guado_moblab-paladin scheduler failing |
||||||||
Issue descriptiongrabbed from a moblab during https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/5824 05/04 14:59:33.962 INFO | monitor_db:0160| os.environ: {'USERNAME': 'moblab', 'SUDO_COMMAND': '/usr/local/autotest/scheduler/monitor_db.py /usr/local/autotest/results', 'TERM': 'linux', 'SHELL': '/bin/bash', 'DJANGO_SETTINGS_MODULE': 'autotest_lib.frontend.settings', 'SUDO_UID': '0', 'SUDO_GID': '0', 'LOGNAME': 'moblab', 'USER': 'moblab', 'NO_GCE_CHECK': 'False', 'MAIL': '/var/mail/moblab', 'PATH': '/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/opt/bin', 'SUDO_USER': 'root', 'HOME': '/home/moblab', 'GENTOO_PYTHON_PROCESS_NAME': 'monitor_db.py'} 05/04 14:59:33.963 INFO | status_server:0120| Status server running on ('0.0.0.0', 13467) 05/04 14:59:33.964 INFO | metadata_reporter:0148| Metadata reporting thread is started. 05/04 14:59:33.978 INFO | monitor_db:0213| 14:59:33 05/04/17> dispatcher starting 05/04 14:59:33.979 INFO | monitor_db:0214| My PID is 15953 05/04 14:59:33.987 INFO | monitor_db:0238| Setting signal handler 05/04 14:59:33.987 INFO |site_drone_manager:0069| Adding drone localhost 05/04 14:59:33.993 INFO | drones:0091| Running drone_utility on localhost 05/04 14:59:33.993 DEBUG| utils:0202| Running 'python /usr/local/autotest/scheduler/drone_utility.py --call_time 1493935173.99' 05/04 14:59:33.997 INFO | connectionpool:0207| Starting new HTTP connection (1): metadata.google.internal 05/04 14:59:34.240 INFO | config:0024| Configuration file does not exist, ignoring: /etc/chrome-infra/ts-mon.json 05/04 14:59:34.240 ERROR| config:0259| ts_mon monitoring is disabled because the endpoint provided is invalid or not supported: 05/04 14:59:34.241 NOTIC| cros_logging:0037| ts_mon was set up. 05/04 14:59:34.698 DEBUG| utils:0297| [stderr] DEBUG:root:Failed to import elasticsearch. Mock classes will be used and calls to Elasticsearch server will be no-op. Test run is not affected by the missing elasticsearch module. 05/04 14:59:34.850 INFO | drone_manager:0305| Drone localhost.max_processes: 1000 05/04 14:59:34.850 INFO | drone_manager:0306| Drone localhost.enabled: True 05/04 14:59:34.850 INFO | drone_manager:0308| Drone localhost.allowed_users: None 05/04 14:59:34.850 DEBUG| utils:0202| Running 'which lxc-start' 05/04 14:59:34.856 DEBUG| utils:0297| [stdout] /usr/sbin/lxc-start 05/04 14:59:34.856 DEBUG| utils:0202| Running 'sudo -n ls "/usr/local/autotest/containers/base_05"' 05/04 14:59:34.866 DEBUG| utils:0297| [stdout] base_05.log 05/04 14:59:34.866 DEBUG| utils:0297| [stdout] config 05/04 14:59:34.866 DEBUG| utils:0297| [stdout] rootfs 05/04 14:59:34.866 INFO | drone_manager:0310| Drone localhost.support_ssp: True 05/04 14:59:34.867 INFO | drone_manager:0212| Using results repository on localhost 05/04 14:59:34.867 INFO | monitor_db:0250| Connected! Running... 05/04 14:59:34.867 INFO |monitor_db_cleanup:0246| Running 24 hour clean up 05/04 14:59:34.868 INFO |monitor_db_cleanup:0252| Checking for uncleanable DB inconsistencies 05/04 14:59:34.975 INFO | drone_manager:0326| Queue cleanup_orphaned_containers at localhost 05/04 14:59:34.976 INFO | thread_lib:0068| (Worker.localhost) starting. 05/04 14:59:34.976 INFO | thread_lib:0091| (Task Queue) Waiting for drone_manager.execute_queue.localhost 05/04 14:59:34.977 INFO | drones:0091| Running drone_utility on localhost 05/04 14:59:34.977 DEBUG| utils:0202| Running 'python /usr/local/autotest/scheduler/drone_utility.py --call_time 1493935174.98' 05/04 14:59:35.677 DEBUG| utils:0297| [stderr] DEBUG:root:Failed to import elasticsearch. Mock classes will be used and calls to Elasticsearch server will be no-op. Test run is not affected by the missing elasticsearch module. 05/04 14:59:35.822 INFO | thread_lib:0071| (Worker.localhost) finished. 05/04 14:59:35.823 INFO | thread_lib:0095| (Task Queue) All threads have returned, clearing map. 05/04 14:59:35.824 WARNI| utils:0096| /usr/local/autotest/scheduler/scheduler_models.py:186: DeprecationWarning: object() takes no parameters return super(DBObject, cls).__new__(cls, id=id, **kwargs) 05/04 14:59:35.829 INFO | drone_manager:0767| monitoring pidfile /usr/local/autotest/results/1-moblab/hostless/.autoserv_execute 05/04 14:59:35.829 INFO | drone_manager:0463| Invoking drone refresh. 05/04 14:59:35.830 INFO | thread_lib:0068| (Worker.localhost) starting. 05/04 14:59:35.830 INFO | drones:0091| Running drone_utility on localhost 05/04 14:59:35.830 DEBUG| utils:0202| Running 'python /usr/local/autotest/scheduler/drone_utility.py --call_time 1493935175.83' 05/04 14:59:35.830 INFO | thread_lib:0091| (Task Queue) Waiting for drone_manager.refresh_queue.localhost 05/04 14:59:36.553 DEBUG| utils:0297| [stderr] DEBUG:root:Failed to import elasticsearch. Mock classes will be used and calls to Elasticsearch server will be no-op. Test run is not affected by the missing elasticsearch module. 05/04 14:59:36.738 INFO | thread_lib:0071| (Worker.localhost) finished. 05/04 14:59:36.738 INFO | thread_lib:0095| (Task Queue) All threads have returned, clearing map. 05/04 14:59:36.738 INFO | drone_manager:0490| Drones refreshed. 05/04 14:59:36.739 ERROR| drone_manager:0811| Process localhost/16141 found, but not an autoserv process. Is {'pgid': '16141', 'pid': '16141', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'} 05/04 14:59:36.739 ERROR| drone_manager:0811| Process localhost/16141 found, but not an autoserv process. Is {'pgid': '16141', 'pid': '16141', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'} 05/04 14:59:36.740 INFO | agent_task:0505| Recovering process localhost/16141 for HostlessQueueTask at 1-moblab/hostless 05/04 14:59:36.740 ERROR| drone_manager:0811| Process localhost/16141 found, but not an autoserv process. Is {'pgid': '16141', 'pid': '16141', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'} 05/04 14:59:36.740 ERROR| drone_manager:0811| Process localhost/16141 found, but not an autoserv process. Is {'pgid': '16141', 'pid': '16141', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'} 05/04 14:59:36.742 INFO | drones:0091| Running drone_utility on localhost 05/04 14:59:36.742 DEBUG| utils:0202| Running 'python /usr/local/autotest/scheduler/drone_utility.py --call_time 1493935176.74' 05/04 14:59:37.459 DEBUG| utils:0297| [stderr] DEBUG:root:Failed to import elasticsearch. Mock classes will be used and calls to Elasticsearch server will be no-op. Test run is not affected by the missing elasticsearch module. 05/04 14:59:37.594 DEBUG| monitor_db:1078| New tick 05/04 14:59:37.594 DEBUG| monitor_db:1078| Starting _garbage_collection 05/04 14:59:37.594 DEBUG| monitor_db:1078| Starting _drone_manager.trigger_refresh 05/04 14:59:37.594 INFO | drone_manager:0463| Invoking drone refresh. 05/04 14:59:37.595 INFO | thread_lib:0068| (Worker.localhost) starting. 05/04 14:59:37.595 INFO | drones:0091| Running drone_utility on localhost 05/04 14:59:37.596 DEBUG| monitor_db:1078| Starting _schedule_running_host_queue_entries 05/04 14:59:37.596 DEBUG| utils:0202| Running 'python /usr/local/autotest/scheduler/drone_utility.py --call_time 1493935177.6' 05/04 14:59:37.600 DEBUG| monitor_db:1078| Starting _schedule_special_tasks 05/04 14:59:37.621 INFO | ts_mon_config:0150| Waiting for ts_mon flushing process to finish... 05/04 14:59:38.286 DEBUG| utils:0297| [stderr] DEBUG:root:Failed to import elasticsearch. Mock classes will be used and calls to Elasticsearch server will be no-op. Test run is not affected by the missing elasticsearch module. 05/04 14:59:38.470 INFO | thread_lib:0071| (Worker.localhost) finished. 05/04 15:00:35.069 INFO | ts_mon_config:0156| Finished waiting for ts_mon process. 05/04 15:00:35.081 INFO | metadata_reporter:0162| Waiting up to 5 seconds for metadata reporting thread to complete. 05/04 15:00:35.082 INFO | status_server:0113| Shutting down server...
,
May 4 2017
And this in lxc_cleanup.log 05/04 15:17:12.791 INFO | lxc_cleanup:0172| Cleaning container bucket /mnt/moblab/containers 05/04 15:17:12.792 DEBUG| utils:0202| Running 'sudo lxc-ls --active' 05/04 15:17:12.807 DEBUG| utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base/rootfs"' 05/04 15:17:12.815 DEBUG| utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base_02/rootfs"' 05/04 15:17:12.823 DEBUG| utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base_05/rootfs"' 05/04 15:17:12.832 DEBUG| utils:0202| Running 'sudo test -e "/mnt/moblab/containers/containers/rootfs"' 05/04 15:17:12.840 DEBUG| lxc_cleanup:0069| Checking if container is orphaned: base_05 05/04 15:17:12.840 DEBUG| lxc_cleanup:0072| Container base_05 is not created for test. 05/04 15:17:12.840 DEBUG| lxc_cleanup:0069| Checking if container is orphaned: base 05/04 15:17:12.840 DEBUG| lxc_cleanup:0072| Container base is not created for test. 05/04 15:17:12.840 DEBUG| lxc_cleanup:0069| Checking if container is orphaned: base_02 05/04 15:17:12.841 DEBUG| lxc_cleanup:0072| Container base_02 is not created for test. 05/04 15:17:12.841 INFO | lxc_cleanup:0181| Cleanup finished. 05/04 15:18:14.759 INFO | lxc_cleanup:0171| 05/04 15:18:14.759 INFO | lxc_cleanup:0172| Cleaning container bucket /mnt/moblab/containers 05/04 15:18:14.759 DEBUG| utils:0202| Running 'sudo lxc-ls --active' 05/04 15:18:14.775 DEBUG| utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base/rootfs"' 05/04 15:18:14.783 DEBUG| utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base_02/rootfs"' 05/04 15:18:14.791 DEBUG| utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base_05/rootfs"' 05/04 15:18:14.799 DEBUG| utils:0202| Running 'sudo test -e "/mnt/moblab/containers/containers/rootfs"' 05/04 15:18:14.807 DEBUG| lxc_cleanup:0069| Checking if container is orphaned: base_05 05/04 15:18:14.807 DEBUG| lxc_cleanup:0072| Container base_05 is not created for test. 05/04 15:18:14.808 DEBUG| lxc_cleanup:0069| Checking if container is orphaned: base 05/04 15:18:14.808 DEBUG| lxc_cleanup:0072| Container base is not created for test. 05/04 15:18:14.808 DEBUG| lxc_cleanup:0069| Checking if container is orphaned: base_02 05/04 15:18:14.808 DEBUG| lxc_cleanup:0072| Container base_02 is not created for test. 05/04 15:18:14.808 INFO | lxc_cleanup:0181| Cleanup finished. 05/04 15:19:16.774 INFO | lxc_cleanup:0171| 05/04 15:19:16.774 INFO | lxc_cleanup:0172| Cleaning container bucket /mnt/moblab/containers 05/04 15:19:16.774 DEBUG| utils:0202| Running 'sudo lxc-ls --active' 05/04 15:19:16.790 DEBUG| utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base/rootfs"' 05/04 15:19:16.798 DEBUG| utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base_02/rootfs"' 05/04 15:19:16.806 DEBUG| utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base_05/rootfs"' 05/04 15:19:16.814 DEBUG| utils:0202| Running 'sudo test -e "/mnt/moblab/containers/containers/rootfs"' 05/04 15:19:16.822 DEBUG| lxc_cleanup:0069| Checking if container is orphaned: base_05 05/04 15:19:16.823 DEBUG| lxc_cleanup:0072| Container base_05 is not created for test. 05/04 15:19:16.823 DEBUG| lxc_cleanup:0069| Checking if container is orphaned: base 05/04 15:19:16.823 DEBUG| lxc_cleanup:0072| Container base is not created for test. 05/04 15:19:16.823 DEBUG| lxc_cleanup:0069| Checking if container is orphaned: base_02 05/04 15:19:16.823 DEBUG| lxc_cleanup:0072| Container base_02 is not created for test. 05/04 15:19:16.823 INFO | lxc_cleanup:0181| Cleanup finished.
,
May 5 2017
This is happening again, even without the suspect CL. I'm on chromeos2-row1-rack8-host1.cros.corp.google.com
,
May 5 2017
Also I don't see host_scheduler.log on the moblab. Is that normal?
,
May 5 2017
I also don't see host_scheduler running at all. Is it supposed to run on moblab?
,
May 5 2017
moblab doesn't have a standalone host scheduler. It runs inline host scheduling, there is a global config controls that behavior.
,
May 5 2017
I also see a little bit of weirdness in the gs_offloader logs, but I doubt this is related. 2017-05-04 17:10:42,560 - DEBUG - Unable to offload to gs://chromeos-image-archive/results/54:ab:3a:c2:7a:48/464f9de0-3127-11e7-8dca-54ab3ac27a48/, sleeping. 2017-05-04 17:12:44,066 - DEBUG - Unable to offload to gs://chromeos-image-archive/results/54:ab:3a:c2:7a:48/464f9de0-3127-11e7-8dca-54ab3ac27a48/, sleeping. 2017-05-04 17:14:46,990 - DEBUG - Start of offload cycle - found 0 new jobs
,
May 5 2017
#2 looks like something legitimately broken with ssp?
,
May 5 2017
Why is it referencing base_02 and base_05, when were on base_06 at least?
,
May 5 2017
lxc_cleanup.py is a script to force cleaning up orphaned container. The script should run on a cron job, not every couple seconds. ssp is not affected though.
,
May 5 2017
Re #10 moblab's base container is different from the one in lab. base_02 and base_05 are both for moblab. It was upgraded to base_05 earlier. That's why both base containers are there. That's not anything wrong.
,
May 5 2017
though we might consider to delete old base containers to free up space.
,
May 5 2017
Hmm. Any guesses on the scheduler log then?
,
May 5 2017
That's the complete log? no obvious reason why scheduler wants to exit at all. Almost look like something killed the process with a signal (like system reboot?).
,
May 5 2017
The logs sort of suggest to me that there are multiple monitor_dbs running at the same time and stepping on eachother.
The error that always appears toward the end of the log, before shutdown, is:
05/04 14:59:36.739 ERROR| drone_manager:0811| Process localhost/16141 found, but not an autoserv process. Is {'pgid': '16141', 'pid': '16141', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
or something similar
,
May 5 2017
localhost logs # ps aux | grep monitor_db.py root 8041 0.0 0.0 15676 1508 ? Ss 18:55 0:00 sudo -u moblab /usr/local/autotest/scheduler/monitor_db.py /usr/local/autotest/results moblab 8042 2.0 1.4 300724 56824 ? Sl 18:55 0:01 /usr/bin/python2.7 /usr/local/autotest/scheduler/monitor_db.py /usr/local/autotest/results moblab 8052 0.0 1.1 218916 46344 ? Sl 18:55 0:00 /usr/bin/python2.7 /usr/local/autotest/scheduler/monitor_db.py /usr/local/autotest/results moblab 8056 0.0 1.2 223704 47920 ? S 18:55 0:00 /usr/bin/python2.7 /usr/local/autotest/scheduler/monitor_db.py /usr/local/autotest/results root 9710 0.0 0.0 4364 644 pts/0 S+ 18:56 0:00 grep --colour=auto monitor_db.py localhost logs # that doesn't proove there are multiple ones running though
,
May 5 2017
Locked http://cautotest/afe/#tab_id=view_host&object_id=1128 for investigation
,
May 5 2017
/var/log/messages 2017-05-05T02:10:55.865777+00:00 WARNING kernel: [ 7323.717516] init: moblab-scheduler-init main process ended, respawning 2017-05-05T02:11:57.842772+00:00 WARNING kernel: [ 7385.726406] init: moblab-scheduler-init main process ended, respawning 2017-05-05T02:12:59.820774+00:00 WARNING kernel: [ 7447.736294] init: moblab-scheduler-init main process ended, respawning 2017-05-05T02:14:01.808771+00:00 WARNING kernel: [ 7509.756496] init: moblab-scheduler-init main process ended, respawning 2017-05-05T02:15:03.795769+00:00 WARNING kernel: [ 7571.775245] init: moblab-scheduler-init main process ended, respawning 2017-05-05T02:15:37.717772+00:00 WARNING kernel: [ 7605.714869] init: whining-update-db main process (32706) terminated with status 1 2017-05-05T02:15:37.717781+00:00 WARNING kernel: [ 7605.714889] init: whining-update-db main process ended, respawning 2017-05-05T02:16:05.867912+00:00 WARNING kernel: [ 7633.878981] init: moblab-scheduler-init main process ended, respawning 2017-05-05T02:17:07.911772+00:00 WARNING kernel: [ 7695.955845] init: moblab-scheduler-init main process ended, respawning 2017-05-05T02:18:09.897778+00:00 WARNING kernel: [ 7757.973354] init: moblab-scheduler-init main process ended, respawning 2017-05-05T02:18:15.721222+00:00 INFO periodic_scheduler[12852]: crash_sender: running /sbin/crash_sender 2017-05-05T02:18:15.721443+00:00 INFO periodic_scheduler[12853]: cros-machine-id-regen: running /usr/sbin/cros-machine-id-regen -r periodic -t 21600 2017-05-05T02:18:15.738523+00:00 NOTICE crash_sender[12872]: Exiting early due to test image. 2017-05-05T02:18:15.741505+00:00 INFO periodic_scheduler[12879]: crash_sender: job completed 2017-05-05T02:18:15.745724+00:00 NOTICE cros-machine-id-regen[12882]: Not regenerating since we did so 7744 seconds ago. 2017-05-05T02:18:15.746925+00:00 INFO periodic_scheduler[12883]: cros-machine-id-regen: job completed 2017-05-05T02:19:11.893770+00:00 WARNING kernel: [ 7820.001723] init: moblab-scheduler-init main process ended, respawning 2017-05-05T02:20:01.541696+00:00 INFO sshd[18034]: syslogin_perform_logout: logout() returned an error 2017-05-05T02:20:01.543178+00:00 INFO sshd[18034]: Received disconnect from 172.24.26.7 port 39203:11: disconnected by user 2017-05-05T02:20:01.543233+00:00 INFO sshd[18034]: Disconnected from 172.24.26.7 port 39203 2017-05-05T02:20:06.460731+00:00 INFO sshd[15779]: Accepted publickey for root from 172.24.26.7 port 59689 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2017-05-05T02:20:13.872282+00:00 WARNING kernel: [ 7882.011485] init: moblab-scheduler-init main process ended, respawning 2017-05-05T02:20:38.102770+00:00 WARNING kernel: [ 7906.255012] init: whining-update-db main process (8720) terminated with status 1 2017-05-05T02:20:38.102779+00:00 WARNING kernel: [ 7906.255035] init: whining-update-db main process ended, respawning 2017-05-05T02:21:15.897322+00:00 WARNING kernel: [ 7944.068578] init: moblab-scheduler-init main process ended, respawning
,
May 5 2017
,
May 5 2017
My suggestion is stop the init script and manually run the scheduler and see how it fails.
,
May 5 2017
Running it manually results in:
19:34:05 ERROR| Process localhost/16355 found, but not an autoserv process. Is {'pgid': '16355', 'pid': '16355', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
19:34:05 ERROR| Process localhost/16355 found, but not an autoserv process. Is {'pgid': '16355', 'pid': '16355', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
19:34:05 INFO | Recovering process localhost/16355 for HostlessQueueTask at 1-moblab/hostless
19:34:05 ERROR| Process localhost/16355 found, but not an autoserv process. Is {'pgid': '16355', 'pid': '16355', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
19:34:05 ERROR| Process localhost/16355 found, but not an autoserv process. Is {'pgid': '16355', 'pid': '16355', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
19:34:05 INFO | Running drone_utility on localhost
19:34:05 INFO | Invoking drone refresh.
19:34:05 INFO | (Worker.localhost) starting.
19:34:05 INFO | Running drone_utility on localhost
19:34:05 INFO | Waiting for ts_mon flushing process to finish...
19:34:06 INFO | (Worker.localhost) finished.
19:35:03 INFO | Finished waiting for ts_mon process.
19:35:03 INFO | Waiting up to 5 seconds for metadata reporting thread to complete.
19:35:03 INFO | Shutting down server...
127.0.0.1 - - [04/May/2017 19:35:03] "GET / HTTP/1.0" 200 -
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 39922)
Traceback (most recent call last):
File "/usr/lib64/python2.7/SocketServer.py", line 295, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib64/python2.7/SocketServer.py", line 321, in process_request
self.finish_request(request, client_address)
File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib64/python2.7/SocketServer.py", line 657, in __init__
self.finish()
File "/usr/lib64/python2.7/SocketServer.py", line 716, in finish
self.wfile.close()
File "/usr/lib64/python2.7/socket.py", line 283, in close
self.flush()
File "/usr/lib64/python2.7/socket.py", line 307, in flush
self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe
----------------------------------------
,
May 5 2017
I would ignore that broken pipe but it still fails when I tried again: 19:37:59 INFO | Waiting up to 5 seconds for metadata reporting thread to complete. 19:37:59 INFO | Shutting down server... 127.0.0.1 - - [04/May/2017 19:37:59] "GET / HTTP/1.0" 200 -
,
May 5 2017
Not sure what that shutdown noise is, but I suspect it's a red herring. Maybe there's some spam after shutdown, but the real problem is *why* is it shutting down...
,
May 5 2017
I've added an assert on the dut in the place where that drone logging is emitted. Want to see in more detail how we're getting there.
,
May 5 2017
Half baked theory that this is related to Issue 718685
,
May 5 2017
I will try to shutdown the upstart scheduler and try it manually.
,
May 5 2017
Manually run and saw some error like "type object 'Provision' has no attribute 'partition'"?
,
May 5 2017
ToT, the file server/cros/provision.py has class method partition, but on the box, it only has _partition
,
May 5 2017
Maybe there are other problem, but we should try to get https://chromium-review.googlesource.com/c/446887 out of the CQ
,
May 5 2017
Re #27: Do you have a stack trace for that? What callers got left behind that's causing this problem? Also, is this a case of us calling across SSP? I verified:-1 that CL.
,
May 5 2017
Nothing to do with SSP. The call is from scheduler/prejob_task.py There is no stack trace print out. I have to modify the code to catch the exception and printed out.
,
May 5 2017
Ugh, was our logging silently eating a stacktrace?
,
May 5 2017
I think that's a believable culprit CL. I will unlock the locked moblab. Follow ups: - Log exceptions in prejob_task construction somewhere we can actually see them. - Do idea 3 from https://bugs.chromium.org/p/chromium/issues/detail?id=718618
,
May 5 2017
Re #32, 33, the monitor_db.py catch all exception, but did not log it. We should log here https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/monitor_db.py?rcl=c48be08d95ee6e9b97f4b0cb929fbe85efd74a38&l=195
,
May 5 2017
,
May 5 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ea8f85158c82dc7055eccea66e199df07bd9d9de commit ea8f85158c82dc7055eccea66e199df07bd9d9de Author: Aviv Keshet <akeshet@chromium.org> Date: Fri May 05 21:42:51 2017 autotest: log uncaught exception in monitor_db BUG= chromium:718615 TEST=None Change-Id: Ifbd87c700c17cf8d6dbeb1ccbe5c61f1b134933a Reviewed-on: https://chromium-review.googlesource.com/497828 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Michael Tang <ntang@chromium.org> [modify] https://crrev.com/ea8f85158c82dc7055eccea66e199df07bd9d9de/scheduler/monitor_db.py
,
May 6 2017
,
May 24 2017
Issue 725345 has been merged into this issue.
,
Aug 1 2017
,
Jan 22 2018
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by akes...@chromium.org
, May 4 2017