New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 718615 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Closed: May 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

guado_moblab-paladin scheduler failing

Project Member Reported by akes...@chromium.org, May 4 2017

Issue description

grabbed from a moblab during https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/5824

05/04 14:59:33.962 INFO |        monitor_db:0160| os.environ: {'USERNAME': 'moblab', 'SUDO_COMMAND': '/usr/local/autotest/scheduler/monitor_db.py /usr/local/autotest/results', 'TERM': 'linux', 'SHELL': '/bin/bash', 'DJANGO_SETTINGS_MODULE': 'autotest_lib.frontend.settings', 'SUDO_UID': '0', 'SUDO_GID': '0', 'LOGNAME': 'moblab', 'USER': 'moblab', 'NO_GCE_CHECK': 'False', 'MAIL': '/var/mail/moblab', 'PATH': '/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/opt/bin', 'SUDO_USER': 'root', 'HOME': '/home/moblab', 'GENTOO_PYTHON_PROCESS_NAME': 'monitor_db.py'}
05/04 14:59:33.963 INFO |     status_server:0120| Status server running on ('0.0.0.0', 13467)
05/04 14:59:33.964 INFO | metadata_reporter:0148| Metadata reporting thread is started.
05/04 14:59:33.978 INFO |        monitor_db:0213| 14:59:33 05/04/17> dispatcher starting
05/04 14:59:33.979 INFO |        monitor_db:0214| My PID is 15953
05/04 14:59:33.987 INFO |        monitor_db:0238| Setting signal handler
05/04 14:59:33.987 INFO |site_drone_manager:0069| Adding drone localhost
05/04 14:59:33.993 INFO |            drones:0091| Running drone_utility on localhost
05/04 14:59:33.993 DEBUG|             utils:0202| Running 'python /usr/local/autotest/scheduler/drone_utility.py --call_time 1493935173.99'
05/04 14:59:33.997 INFO |    connectionpool:0207| Starting new HTTP connection (1): metadata.google.internal
05/04 14:59:34.240 INFO |            config:0024| Configuration file does not exist, ignoring: /etc/chrome-infra/ts-mon.json
05/04 14:59:34.240 ERROR|            config:0259| ts_mon monitoring is disabled because the endpoint provided is invalid or not supported: 
05/04 14:59:34.241 NOTIC|      cros_logging:0037| ts_mon was set up.
05/04 14:59:34.698 DEBUG|             utils:0297| [stderr] DEBUG:root:Failed to import elasticsearch. Mock classes will be used and calls to Elasticsearch server will be no-op. Test run is not affected by the missing elasticsearch module.
05/04 14:59:34.850 INFO |     drone_manager:0305| Drone localhost.max_processes: 1000
05/04 14:59:34.850 INFO |     drone_manager:0306| Drone localhost.enabled: True
05/04 14:59:34.850 INFO |     drone_manager:0308| Drone localhost.allowed_users: None
05/04 14:59:34.850 DEBUG|             utils:0202| Running 'which lxc-start'
05/04 14:59:34.856 DEBUG|             utils:0297| [stdout] /usr/sbin/lxc-start
05/04 14:59:34.856 DEBUG|             utils:0202| Running 'sudo -n ls "/usr/local/autotest/containers/base_05"'
05/04 14:59:34.866 DEBUG|             utils:0297| [stdout] base_05.log
05/04 14:59:34.866 DEBUG|             utils:0297| [stdout] config
05/04 14:59:34.866 DEBUG|             utils:0297| [stdout] rootfs
05/04 14:59:34.866 INFO |     drone_manager:0310| Drone localhost.support_ssp: True
05/04 14:59:34.867 INFO |     drone_manager:0212| Using results repository on localhost
05/04 14:59:34.867 INFO |        monitor_db:0250| Connected! Running...
05/04 14:59:34.867 INFO |monitor_db_cleanup:0246| Running 24 hour clean up
05/04 14:59:34.868 INFO |monitor_db_cleanup:0252| Checking for uncleanable DB inconsistencies
05/04 14:59:34.975 INFO |     drone_manager:0326| Queue cleanup_orphaned_containers at localhost
05/04 14:59:34.976 INFO |        thread_lib:0068| (Worker.localhost) starting.
05/04 14:59:34.976 INFO |        thread_lib:0091| (Task Queue) Waiting for drone_manager.execute_queue.localhost
05/04 14:59:34.977 INFO |            drones:0091| Running drone_utility on localhost
05/04 14:59:34.977 DEBUG|             utils:0202| Running 'python /usr/local/autotest/scheduler/drone_utility.py --call_time 1493935174.98'
05/04 14:59:35.677 DEBUG|             utils:0297| [stderr] DEBUG:root:Failed to import elasticsearch. Mock classes will be used and calls to Elasticsearch server will be no-op. Test run is not affected by the missing elasticsearch module.
05/04 14:59:35.822 INFO |        thread_lib:0071| (Worker.localhost) finished.
05/04 14:59:35.823 INFO |        thread_lib:0095| (Task Queue) All threads have returned, clearing map.
05/04 14:59:35.824 WARNI|             utils:0096| /usr/local/autotest/scheduler/scheduler_models.py:186: DeprecationWarning: object() takes no parameters
  return super(DBObject, cls).__new__(cls, id=id, **kwargs)

05/04 14:59:35.829 INFO |     drone_manager:0767| monitoring pidfile /usr/local/autotest/results/1-moblab/hostless/.autoserv_execute
05/04 14:59:35.829 INFO |     drone_manager:0463| Invoking drone refresh.
05/04 14:59:35.830 INFO |        thread_lib:0068| (Worker.localhost) starting.
05/04 14:59:35.830 INFO |            drones:0091| Running drone_utility on localhost
05/04 14:59:35.830 DEBUG|             utils:0202| Running 'python /usr/local/autotest/scheduler/drone_utility.py --call_time 1493935175.83'
05/04 14:59:35.830 INFO |        thread_lib:0091| (Task Queue) Waiting for drone_manager.refresh_queue.localhost
05/04 14:59:36.553 DEBUG|             utils:0297| [stderr] DEBUG:root:Failed to import elasticsearch. Mock classes will be used and calls to Elasticsearch server will be no-op. Test run is not affected by the missing elasticsearch module.
05/04 14:59:36.738 INFO |        thread_lib:0071| (Worker.localhost) finished.
05/04 14:59:36.738 INFO |        thread_lib:0095| (Task Queue) All threads have returned, clearing map.
05/04 14:59:36.738 INFO |     drone_manager:0490| Drones refreshed.
05/04 14:59:36.739 ERROR|     drone_manager:0811| Process localhost/16141 found, but not an autoserv process. Is {'pgid': '16141', 'pid': '16141', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
05/04 14:59:36.739 ERROR|     drone_manager:0811| Process localhost/16141 found, but not an autoserv process. Is {'pgid': '16141', 'pid': '16141', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
05/04 14:59:36.740 INFO |        agent_task:0505| Recovering process localhost/16141 for HostlessQueueTask at 1-moblab/hostless
05/04 14:59:36.740 ERROR|     drone_manager:0811| Process localhost/16141 found, but not an autoserv process. Is {'pgid': '16141', 'pid': '16141', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
05/04 14:59:36.740 ERROR|     drone_manager:0811| Process localhost/16141 found, but not an autoserv process. Is {'pgid': '16141', 'pid': '16141', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
05/04 14:59:36.742 INFO |            drones:0091| Running drone_utility on localhost
05/04 14:59:36.742 DEBUG|             utils:0202| Running 'python /usr/local/autotest/scheduler/drone_utility.py --call_time 1493935176.74'
05/04 14:59:37.459 DEBUG|             utils:0297| [stderr] DEBUG:root:Failed to import elasticsearch. Mock classes will be used and calls to Elasticsearch server will be no-op. Test run is not affected by the missing elasticsearch module.
05/04 14:59:37.594 DEBUG|        monitor_db:1078| New tick
05/04 14:59:37.594 DEBUG|        monitor_db:1078| Starting _garbage_collection
05/04 14:59:37.594 DEBUG|        monitor_db:1078| Starting _drone_manager.trigger_refresh
05/04 14:59:37.594 INFO |     drone_manager:0463| Invoking drone refresh.
05/04 14:59:37.595 INFO |        thread_lib:0068| (Worker.localhost) starting.
05/04 14:59:37.595 INFO |            drones:0091| Running drone_utility on localhost
05/04 14:59:37.596 DEBUG|        monitor_db:1078| Starting _schedule_running_host_queue_entries
05/04 14:59:37.596 DEBUG|             utils:0202| Running 'python /usr/local/autotest/scheduler/drone_utility.py --call_time 1493935177.6'
05/04 14:59:37.600 DEBUG|        monitor_db:1078| Starting _schedule_special_tasks
05/04 14:59:37.621 INFO |     ts_mon_config:0150| Waiting for ts_mon flushing process to finish...
05/04 14:59:38.286 DEBUG|             utils:0297| [stderr] DEBUG:root:Failed to import elasticsearch. Mock classes will be used and calls to Elasticsearch server will be no-op. Test run is not affected by the missing elasticsearch module.
05/04 14:59:38.470 INFO |        thread_lib:0071| (Worker.localhost) finished.
05/04 15:00:35.069 INFO |     ts_mon_config:0156| Finished waiting for ts_mon process.
05/04 15:00:35.081 INFO | metadata_reporter:0162| Waiting up to 5 seconds for metadata reporting thread to complete.
05/04 15:00:35.082 INFO |     status_server:0113| Shutting down server...


 
While I'm here, I'll not that I also see the following in drone_log:
05/04 13:53:23.943 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 13:54:25.966 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 13:54:25.966 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 13:55:28.064 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 13:55:28.064 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 13:56:30.106 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 13:56:30.106 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 13:57:32.171 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 13:57:32.171 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 13:58:34.217 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 13:58:34.217 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 13:59:36.313 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 13:59:36.313 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 14:00:38.383 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 14:00:38.383 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 14:01:40.458 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 14:01:40.458 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 14:02:42.548 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 14:02:42.548 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 14:03:44.573 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 14:03:44.573 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 14:04:46.647 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 14:04:46.647 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
05/04 14:05:48.670 INFO |     drone_utility:0439| Running ['sudo', '/usr/local/autotest/site_utils/lxc_cleanup.py', '-x', '-v', '-l', '/usr/local/autotest/logs/lxc_cleanup.log']
...
..

And this in lxc_cleanup.log

05/04 15:17:12.791 INFO |       lxc_cleanup:0172| Cleaning container bucket /mnt/moblab/containers
05/04 15:17:12.792 DEBUG|             utils:0202| Running 'sudo lxc-ls --active'
05/04 15:17:12.807 DEBUG|             utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base/rootfs"'
05/04 15:17:12.815 DEBUG|             utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base_02/rootfs"'
05/04 15:17:12.823 DEBUG|             utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base_05/rootfs"'
05/04 15:17:12.832 DEBUG|             utils:0202| Running 'sudo test -e "/mnt/moblab/containers/containers/rootfs"'
05/04 15:17:12.840 DEBUG|       lxc_cleanup:0069| Checking if container is orphaned: base_05
05/04 15:17:12.840 DEBUG|       lxc_cleanup:0072| Container base_05 is not created for test.
05/04 15:17:12.840 DEBUG|       lxc_cleanup:0069| Checking if container is orphaned: base
05/04 15:17:12.840 DEBUG|       lxc_cleanup:0072| Container base is not created for test.
05/04 15:17:12.840 DEBUG|       lxc_cleanup:0069| Checking if container is orphaned: base_02
05/04 15:17:12.841 DEBUG|       lxc_cleanup:0072| Container base_02 is not created for test.
05/04 15:17:12.841 INFO |       lxc_cleanup:0181| Cleanup finished.
05/04 15:18:14.759 INFO |       lxc_cleanup:0171| 
05/04 15:18:14.759 INFO |       lxc_cleanup:0172| Cleaning container bucket /mnt/moblab/containers
05/04 15:18:14.759 DEBUG|             utils:0202| Running 'sudo lxc-ls --active'
05/04 15:18:14.775 DEBUG|             utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base/rootfs"'
05/04 15:18:14.783 DEBUG|             utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base_02/rootfs"'
05/04 15:18:14.791 DEBUG|             utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base_05/rootfs"'
05/04 15:18:14.799 DEBUG|             utils:0202| Running 'sudo test -e "/mnt/moblab/containers/containers/rootfs"'
05/04 15:18:14.807 DEBUG|       lxc_cleanup:0069| Checking if container is orphaned: base_05
05/04 15:18:14.807 DEBUG|       lxc_cleanup:0072| Container base_05 is not created for test.
05/04 15:18:14.808 DEBUG|       lxc_cleanup:0069| Checking if container is orphaned: base
05/04 15:18:14.808 DEBUG|       lxc_cleanup:0072| Container base is not created for test.
05/04 15:18:14.808 DEBUG|       lxc_cleanup:0069| Checking if container is orphaned: base_02
05/04 15:18:14.808 DEBUG|       lxc_cleanup:0072| Container base_02 is not created for test.
05/04 15:18:14.808 INFO |       lxc_cleanup:0181| Cleanup finished.
05/04 15:19:16.774 INFO |       lxc_cleanup:0171| 
05/04 15:19:16.774 INFO |       lxc_cleanup:0172| Cleaning container bucket /mnt/moblab/containers
05/04 15:19:16.774 DEBUG|             utils:0202| Running 'sudo lxc-ls --active'
05/04 15:19:16.790 DEBUG|             utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base/rootfs"'
05/04 15:19:16.798 DEBUG|             utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base_02/rootfs"'
05/04 15:19:16.806 DEBUG|             utils:0202| Running 'sudo test -e "/mnt/moblab/containers/base_05/rootfs"'
05/04 15:19:16.814 DEBUG|             utils:0202| Running 'sudo test -e "/mnt/moblab/containers/containers/rootfs"'
05/04 15:19:16.822 DEBUG|       lxc_cleanup:0069| Checking if container is orphaned: base_05
05/04 15:19:16.823 DEBUG|       lxc_cleanup:0072| Container base_05 is not created for test.
05/04 15:19:16.823 DEBUG|       lxc_cleanup:0069| Checking if container is orphaned: base
05/04 15:19:16.823 DEBUG|       lxc_cleanup:0072| Container base is not created for test.
05/04 15:19:16.823 DEBUG|       lxc_cleanup:0069| Checking if container is orphaned: base_02
05/04 15:19:16.823 DEBUG|       lxc_cleanup:0072| Container base_02 is not created for test.
05/04 15:19:16.823 INFO |       lxc_cleanup:0181| Cleanup finished.

Cc: dshi@chromium.org sbasi@chromium.org
This is happening again, even without the suspect CL.

I'm on chromeos2-row1-rack8-host1.cros.corp.google.com
Also I don't see host_scheduler.log on the moblab. Is that normal?
I also don't see host_scheduler running at all. Is it supposed to run on moblab?

Comment 6 by dshi@chromium.org, May 5 2017

moblab doesn't have a standalone host scheduler. It runs inline host scheduling, there is a global config controls that behavior.
I also see a little bit of weirdness in the gs_offloader logs, but I doubt this is related. 

2017-05-04 17:10:42,560 - DEBUG - Unable to offload to gs://chromeos-image-archive/results/54:ab:3a:c2:7a:48/464f9de0-3127-11e7-8dca-54ab3ac27a48/, sleeping.
2017-05-04 17:12:44,066 - DEBUG - Unable to offload to gs://chromeos-image-archive/results/54:ab:3a:c2:7a:48/464f9de0-3127-11e7-8dca-54ab3ac27a48/, sleeping.
2017-05-04 17:14:46,990 - DEBUG - Start of offload cycle - found 0 new jobs
#2 looks like something legitimately broken with ssp?
Why is it referencing base_02 and base_05, when were on base_06 at least?

Comment 10 by dshi@chromium.org, May 5 2017

lxc_cleanup.py is a script to force cleaning up orphaned container. The script should run on a cron job, not every couple seconds. ssp is not affected though.

Comment 11 by dshi@chromium.org, May 5 2017

Re #10

moblab's base container is different from the one in lab.
base_02 and base_05 are both for moblab. It was upgraded to base_05 earlier. That's why both base containers are there. That's not anything wrong.

Comment 12 by dshi@chromium.org, May 5 2017

though we might consider to delete old base containers to free up space.
Hmm. Any guesses on the scheduler log then?

Comment 14 by dshi@chromium.org, May 5 2017

That's the complete log? no obvious reason why scheduler wants to exit at all. Almost look like something killed the process with a signal (like system reboot?).
The logs sort of suggest to me that there are multiple monitor_dbs running at the same time and stepping on eachother.

The error that always appears toward the end of the log, before shutdown, is:

05/04 14:59:36.739 ERROR|     drone_manager:0811| Process localhost/16141 found, but not an autoserv process. Is {'pgid': '16141', 'pid': '16141', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}

or something similar
localhost logs # ps aux | grep monitor_db.py
root      8041  0.0  0.0  15676  1508 ?        Ss   18:55   0:00 sudo -u moblab /usr/local/autotest/scheduler/monitor_db.py /usr/local/autotest/results
moblab    8042  2.0  1.4 300724 56824 ?        Sl   18:55   0:01 /usr/bin/python2.7 /usr/local/autotest/scheduler/monitor_db.py /usr/local/autotest/results
moblab    8052  0.0  1.1 218916 46344 ?        Sl   18:55   0:00 /usr/bin/python2.7 /usr/local/autotest/scheduler/monitor_db.py /usr/local/autotest/results
moblab    8056  0.0  1.2 223704 47920 ?        S    18:55   0:00 /usr/bin/python2.7 /usr/local/autotest/scheduler/monitor_db.py /usr/local/autotest/results
root      9710  0.0  0.0   4364   644 pts/0    S+   18:56   0:00 grep --colour=auto monitor_db.py
localhost logs # 

that doesn't proove there are multiple ones running though
/var/log/messages

2017-05-05T02:10:55.865777+00:00 WARNING kernel: [ 7323.717516] init: moblab-scheduler-init main process ended, respawning
2017-05-05T02:11:57.842772+00:00 WARNING kernel: [ 7385.726406] init: moblab-scheduler-init main process ended, respawning
2017-05-05T02:12:59.820774+00:00 WARNING kernel: [ 7447.736294] init: moblab-scheduler-init main process ended, respawning
2017-05-05T02:14:01.808771+00:00 WARNING kernel: [ 7509.756496] init: moblab-scheduler-init main process ended, respawning
2017-05-05T02:15:03.795769+00:00 WARNING kernel: [ 7571.775245] init: moblab-scheduler-init main process ended, respawning
2017-05-05T02:15:37.717772+00:00 WARNING kernel: [ 7605.714869] init: whining-update-db main process (32706) terminated with status 1
2017-05-05T02:15:37.717781+00:00 WARNING kernel: [ 7605.714889] init: whining-update-db main process ended, respawning
2017-05-05T02:16:05.867912+00:00 WARNING kernel: [ 7633.878981] init: moblab-scheduler-init main process ended, respawning
2017-05-05T02:17:07.911772+00:00 WARNING kernel: [ 7695.955845] init: moblab-scheduler-init main process ended, respawning
2017-05-05T02:18:09.897778+00:00 WARNING kernel: [ 7757.973354] init: moblab-scheduler-init main process ended, respawning
2017-05-05T02:18:15.721222+00:00 INFO periodic_scheduler[12852]: crash_sender: running /sbin/crash_sender
2017-05-05T02:18:15.721443+00:00 INFO periodic_scheduler[12853]: cros-machine-id-regen: running /usr/sbin/cros-machine-id-regen -r periodic -t 21600
2017-05-05T02:18:15.738523+00:00 NOTICE crash_sender[12872]: Exiting early due to test image.
2017-05-05T02:18:15.741505+00:00 INFO periodic_scheduler[12879]: crash_sender: job completed
2017-05-05T02:18:15.745724+00:00 NOTICE cros-machine-id-regen[12882]: Not regenerating since we did so 7744 seconds ago.
2017-05-05T02:18:15.746925+00:00 INFO periodic_scheduler[12883]: cros-machine-id-regen: job completed
2017-05-05T02:19:11.893770+00:00 WARNING kernel: [ 7820.001723] init: moblab-scheduler-init main process ended, respawning
2017-05-05T02:20:01.541696+00:00 INFO sshd[18034]: syslogin_perform_logout: logout() returned an error
2017-05-05T02:20:01.543178+00:00 INFO sshd[18034]: Received disconnect from 172.24.26.7 port 39203:11: disconnected by user
2017-05-05T02:20:01.543233+00:00 INFO sshd[18034]: Disconnected from 172.24.26.7 port 39203
2017-05-05T02:20:06.460731+00:00 INFO sshd[15779]: Accepted publickey for root from 172.24.26.7 port 59689 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-05-05T02:20:13.872282+00:00 WARNING kernel: [ 7882.011485] init: moblab-scheduler-init main process ended, respawning
2017-05-05T02:20:38.102770+00:00 WARNING kernel: [ 7906.255012] init: whining-update-db main process (8720) terminated with status 1
2017-05-05T02:20:38.102779+00:00 WARNING kernel: [ 7906.255035] init: whining-update-db main process ended, respawning
2017-05-05T02:21:15.897322+00:00 WARNING kernel: [ 7944.068578] init: moblab-scheduler-init main process ended, respawning
Cc: haddowk@chromium.org ntang@chromium.org
My suggestion is stop the init script and manually run the scheduler and see how it fails.
Running it manually results in:

19:34:05 ERROR| Process localhost/16355 found, but not an autoserv process. Is {'pgid': '16355', 'pid': '16355', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
19:34:05 ERROR| Process localhost/16355 found, but not an autoserv process. Is {'pgid': '16355', 'pid': '16355', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
19:34:05 INFO | Recovering process localhost/16355 for HostlessQueueTask at 1-moblab/hostless
19:34:05 ERROR| Process localhost/16355 found, but not an autoserv process. Is {'pgid': '16355', 'pid': '16355', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
19:34:05 ERROR| Process localhost/16355 found, but not an autoserv process. Is {'pgid': '16355', 'pid': '16355', 'ppid': '1', 'comm': 'python2.7', 'args': '/usr/bin/python2.7 -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/1-moblab/hostless -u moblab -l cyan-release/R57-9202.66.0-test_suites/control.dummy_server -s --lab True -P 1-moblab/hostless -n /usr/local/autotest/results/drone_tmp/attach.3'}
19:34:05 INFO | Running drone_utility on localhost
19:34:05 INFO | Invoking drone refresh.
19:34:05 INFO | (Worker.localhost) starting.
19:34:05 INFO | Running drone_utility on localhost
19:34:05 INFO | Waiting for ts_mon flushing process to finish...
19:34:06 INFO | (Worker.localhost) finished.
19:35:03 INFO | Finished waiting for ts_mon process.
19:35:03 INFO | Waiting up to 5 seconds for metadata reporting thread to complete.
19:35:03 INFO | Shutting down server...
127.0.0.1 - - [04/May/2017 19:35:03] "GET / HTTP/1.0" 200 -
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 39922)
Traceback (most recent call last):
  File "/usr/lib64/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib64/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.7/SocketServer.py", line 657, in __init__
    self.finish()
  File "/usr/lib64/python2.7/SocketServer.py", line 716, in finish
    self.wfile.close()
  File "/usr/lib64/python2.7/socket.py", line 283, in close
    self.flush()
  File "/usr/lib64/python2.7/socket.py", line 307, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe
----------------------------------------
I would ignore that broken pipe but it still fails when I tried again:
19:37:59 INFO | Waiting up to 5 seconds for metadata reporting thread to complete.
19:37:59 INFO | Shutting down server...
127.0.0.1 - - [04/May/2017 19:37:59] "GET / HTTP/1.0" 200 -
Labels: -Pri-2 Pri-1
Not sure what that shutdown noise is, but I suspect it's a red herring. Maybe there's some spam after shutdown, but the real problem is *why* is it shutting down...
I've added an assert on the dut in the place where that drone logging is emitted. Want to see in more detail how we're getting there.
Half baked theory that this is related to Issue 718685

Comment 26 by ntang@google.com, May 5 2017

I will try to shutdown the upstart scheduler and try it manually.

Comment 27 by ntang@google.com, May 5 2017

Manually run and saw some error like "type object 'Provision' has no attribute 'partition'"?

Comment 28 by ntang@google.com, May 5 2017

ToT, the file server/cros/provision.py has class method partition, but on the box, it only has _partition

Comment 29 by ntang@google.com, May 5 2017

Maybe there are other problem, but we should try to get https://chromium-review.googlesource.com/c/446887 out of the CQ
Re #27: Do you have a stack trace for that?
What callers got left behind that's causing this problem? Also, is this a case of us calling across SSP?

I verified:-1 that CL.

Comment 31 by ntang@google.com, May 5 2017

Nothing to do with SSP. The call is from scheduler/prejob_task.py

There is no stack trace print out. I have to modify the code to catch the exception and printed out.
Ugh, was our logging silently eating a stacktrace?
I think that's a believable culprit CL. I will unlock the locked moblab.

Follow ups:
 - Log exceptions in prejob_task construction somewhere we can actually see them.
 - Do idea 3 from https://bugs.chromium.org/p/chromium/issues/detail?id=718618

Comment 34 by ntang@google.com, May 5 2017

Re #32, 33, the monitor_db.py catch all exception, but did not log it. We should log here https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/monitor_db.py?rcl=c48be08d95ee6e9b97f4b0cb929fbe85efd74a38&l=195

Comment 35 by aut...@google.com, May 5 2017

Labels: -current-issue
Owner: akes...@chromium.org
Project Member

Comment 36 by bugdroid1@chromium.org, May 5 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ea8f85158c82dc7055eccea66e199df07bd9d9de

commit ea8f85158c82dc7055eccea66e199df07bd9d9de
Author: Aviv Keshet <akeshet@chromium.org>
Date: Fri May 05 21:42:51 2017

autotest: log uncaught exception in monitor_db

BUG= chromium:718615 
TEST=None

Change-Id: Ifbd87c700c17cf8d6dbeb1ccbe5c61f1b134933a
Reviewed-on: https://chromium-review.googlesource.com/497828
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Michael Tang <ntang@chromium.org>

[modify] https://crrev.com/ea8f85158c82dc7055eccea66e199df07bd9d9de/scheduler/monitor_db.py

Status: Fixed (was: Untriaged)
Issue 725345 has been merged into this issue.
Labels: VerifyIn-61

Comment 40 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment