scheduler was dead but didn't get shutdown on the server |
|||||||||||||||
Issue descriptionThe last few whirlwind-paladin builds have been failing: https://viceroy.corp.google.com/chromeos/build_details?build_config=whirlwind-paladin&build_number=6714 Looking closer it appears that the HWTest stage is failing with test aborts. Are there not enough healthy DUTs to run the HWTests against? http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=95207766
,
Jan 9 2017
,
Jan 9 2017
There doesn't seem to be anything wrong with the Whirlwind devices. It looks like there are stuck jobs on the shard for the affected Whirlwind hosts. I compared the output (after selecting "Show verifies, repairs, cleanups and resets") at these two URLs: http://cautotest/afe/#tab_id=view_host&object_id=5591 http://chromeos-server82.cbf.corp.google.com/afe/#tab_id=view_host&object_id=5591 Job 95022619 seems to be aborted on the master, but still queued on the shard. I suspect this is the cause of hwtest aborts.
,
Jan 9 2017
reassigning to snanda@ to find out who can cancel the stuck jobs and revive the Whirlwind hosts on chromeos-server82
,
Jan 9 2017
Hopefully the current deputy (dgarret) knows?
,
Jan 9 2017
Ah sorry, I confirmed with dgarrett in crosoncall@ that he's taking a look at the problem
,
Jan 9 2017
There seems to be a shard configuration issue. From pprabhu: http://chromeos-server82.cbf.corp.google.com/afe/#tab_id=view_host&object_id=774 this says that whirlwind is on chromeos-server82.cbf but serverDB does disagree. atest server list: Hostname : chromeos-server82.cbf.corp.google.com Status : primary Roles : shard Attributes : {u'board': u'wizpig, falco_li'} Date Created : 2016-08-11 14:50:20 Date Modified: 2016-08-11 14:50:20 Note : None That's a different set of boards. I'm guessing this is the real cause.
,
Jan 9 2017
,
Jan 9 2017
Gale doesn't seem to be using a shard (shard:null) That might be my fault - I added the Gale paladin as a nonblocking paladin (important:false) in December. Note that currently the Gale build is broken due to b/34166290 and that is OK
,
Jan 9 2017
I suspect something happened to the shard configuration on the 7th, which caused these issues. What happened, or what to do about it I'm not sure. However, Charlene should be the expert we need.
,
Jan 10 2017
,
Jan 10 2017
In short, the cause of this is that the scheduler on chromeos-server82.cbf was dead but the service didn't get shutdown.
From the scheduler log:
01/07 18:40:53.767 DEBUG| monitor_db:1128| Starting _schedule_running_host_queue_entries
01/07 18:40:53.809 DEBUG| monitor_db:1128| Starting _schedule_special_tasks
01/07 18:40:53.819 DEBUG| monitor_db:1128| Starting _schedule_new_jobs
01/07 18:40:53.819 DEBUG| monitor_db:0841| Processing 0 queue_entries
01/07 18:40:53.820 DEBUG| monitor_db:1128| Starting _drone_manager.sync_refresh
01/07 18:40:54.187 ERROR| email_manager:0082| Uncaught exception; terminating monitor_db
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 180, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 368, in tick
_drone_manager.sync_refresh()
File "/usr/local/autotest/scheduler/drone_manager.py", line 484, in sync_refresh
logging.info("Drones refreshed.")
File "/usr/lib/python2.7/logging/__init__.py", line 1614, in info
root.info(msg, *args, **kwargs)
File "/usr/lib/python2.7/logging/__init__.py", line 1152, in info
self._log(INFO, msg, args, **kwargs)
File "/usr/lib/python2.7/logging/__init__.py", line 1271, in _log
self.handle(record)
File "/usr/lib/python2.7/logging/__init__.py", line 1281, in handle
self.callHandlers(record)
File "/usr/lib/python2.7/logging/__init__.py", line 1321, in callHandlers
hdlr.handle(record)
File "/usr/lib/python2.7/logging/__init__.py", line 749, in handle
self.emit(record)
File "/usr/lib/python2.7/logging/__init__.py", line 879, in emit
self.handleError(record)
File "/usr/local/autotest/client/setup_modules.py", line 85, in _autotest_logging_handle_error
'%r using args %r\n' % (record.msg, record.args))
IOError: [Errno 28] No space left on device
01/07 18:40:54.395 ERROR| email_manager:0054| monitor_db exception
EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 180, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 368, in tick
_drone_manager.sync_refresh()
File "/usr/local/autotest/scheduler/drone_manager.py", line 484, in sync_refresh
logging.info("Drones refreshed.")
File "/usr/lib/python2.7/logging/__init__.py", line 1614, in info
root.info(msg, *args, **kwargs)
File "/usr/lib/python2.7/logging/__init__.py", line 1152, in info
self._log(INFO, msg, args, **kwargs)
File "/usr/lib/python2.7/logging/__init__.py", line 1271, in _log
self.handle(record)
File "/usr/lib/python2.7/logging/__init__.py", line 1281, in handle
self.callHandlers(record)
File "/usr/lib/python2.7/logging/__init__.py", line 1321, in callHandlers
hdlr.handle(record)
File "/usr/lib/python2.7/logging/__init__.py", line 749, in handle
self.emit(record)
File "/usr/lib/python2.7/logging/__init__.py", line 879, in emit
self.handleError(record)
File "/usr/local/autotest/client/setup_modules.py", line 85, i01/07 18:40:55.127 INFO | metadata_reporter:0162| Waiting up to 5 seconds for metadata reporting thread to complete.
01/07 18:40:55.127 ERROR| gmail_lib:0141| Failed to send email to chromeos-test-cron+cautotest@google.com: Credential file does notexist: None. If this is a prod server, puppet shouldinstall it. If you need to be able to send email, find the credential file from chromeos-admin repo and copy it to None
01/07 18:40:55.127 INFO | status_server:0113| Shutting down server...
The scheduler was dead at 01/07 6pm, but the service wasn't shutdown properly. It was stuck at 'Shutting down server....'. So, the result was the scheduler was not working anymore, but its status was still running. In the meantime, host-scheduler and shard-client were both working, so new jobs/tests can still be scheduled to the shard, and host-scheduler can schedule the job/test to a host. But, since the scheduler was dead, the tests would never run and just got queued on the hosts until the timeout was hit. The tests then would be aborted on master, but still queued in the shard.
I just restart the scheduler on the chromeos-server82.cbf, and the problem was fixed. However, I still don't quite understand why the scheduler wasn't shutdown properly. We rarely have this situation.
,
Jan 11 2017
,
Mar 4 2017
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Oct 14 2017
|
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by dgarr...@chromium.org
, Jan 9 2017