Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Starred by 2 users
Status: Archived
Owner:
Closed: Jan 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment
scheduler was dead but didn't get shutdown on the server
Project Member Reported by snanda@chromium.org, Jan 9 2017 Back to list
The last few whirlwind-paladin builds have been failing:

https://viceroy.corp.google.com/chromeos/build_details?build_config=whirlwind-paladin&build_number=6714

Looking closer it appears that the HWTest stage is failing with test aborts.  Are there not enough healthy DUTs to run the HWTests against?

http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=95207766


 
There are 0 healthy whirlwind or gale duts in the CQ pool.
Labels: Hotlist-TreeCloser
Comment 3 by ra...@google.com, Jan 9 2017
There doesn't seem to be anything wrong with the Whirlwind devices.

It looks like there are stuck jobs on the shard for the affected Whirlwind hosts. I compared the output (after selecting "Show verifies, repairs, cleanups and resets") at these two URLs:

http://cautotest/afe/#tab_id=view_host&object_id=5591
http://chromeos-server82.cbf.corp.google.com/afe/#tab_id=view_host&object_id=5591


Job 95022619 seems to be aborted on the master, but still queued on the shard. I suspect this is the cause of hwtest aborts.
Comment 4 by ra...@google.com, Jan 9 2017
Cc: ra...@google.com
Owner: snanda@chromium.org
reassigning to snanda@ to find out who can cancel the stuck jobs and revive the Whirlwind hosts on chromeos-server82
Owner: dgarr...@chromium.org
Hopefully the current deputy (dgarret) knows?
Comment 6 by ra...@google.com, Jan 9 2017
Ah sorry, I confirmed with dgarrett in crosoncall@ that he's taking a look at the problem
Cc: xixuan@chromium.org pprabhu@chromium.org
There seems to be a shard configuration issue.

From pprabhu:
http://chromeos-server82.cbf.corp.google.com/afe/#tab_id=view_host&object_id=774
this says that whirlwind is on chromeos-server82.cbf
but serverDB does disagree.

atest server list:
Hostname     : chromeos-server82.cbf.corp.google.com
Status       : primary
Roles        : shard
Attributes   : {u'board': u'wizpig, falco_li'}
Date Created : 2016-08-11 14:50:20
Date Modified: 2016-08-11 14:50:20
Note         : None


That's a different set of boards. I'm guessing this is the real cause.
Summary: whirlwind/gale shard is misconfigured? (was: whirlwind paladins are failing: lack of healthy DUTs?)
Comment 9 by ra...@google.com, Jan 9 2017
Gale doesn't seem to be using a shard (shard:null)

That might be my fault - I added the Gale paladin as a nonblocking paladin (important:false) in December. Note that currently the Gale build is broken due to b/34166290 and that is OK
Owner: shuqianz@chromium.org
I suspect something happened to the shard configuration on the 7th, which caused these issues. What happened, or what to do about it I'm not sure.

However, Charlene should be the expert we need.
Labels: -Pri-1 Pri-0
Labels: -Pri-0 Pri-2
Summary: scheduler was dead but didn't get shutdown on the server (was: whirlwind/gale shard is misconfigured?)
In short, the cause of this is that the scheduler on chromeos-server82.cbf was dead but the service didn't get shutdown.

From the scheduler log:
01/07 18:40:53.767 DEBUG|        monitor_db:1128| Starting _schedule_running_host_queue_entries
01/07 18:40:53.809 DEBUG|        monitor_db:1128| Starting _schedule_special_tasks
01/07 18:40:53.819 DEBUG|        monitor_db:1128| Starting _schedule_new_jobs
01/07 18:40:53.819 DEBUG|        monitor_db:0841| Processing 0 queue_entries
01/07 18:40:53.820 DEBUG|        monitor_db:1128| Starting _drone_manager.sync_refresh
01/07 18:40:54.187 ERROR|     email_manager:0082| Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 180, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 368, in tick
    _drone_manager.sync_refresh()
  File "/usr/local/autotest/scheduler/drone_manager.py", line 484, in sync_refresh
    logging.info("Drones refreshed.")
  File "/usr/lib/python2.7/logging/__init__.py", line 1614, in info
    root.info(msg, *args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1152, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1271, in _log
    self.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1281, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1321, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 749, in handle
    self.emit(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 879, in emit
    self.handleError(record)
  File "/usr/local/autotest/client/setup_modules.py", line 85, in _autotest_logging_handle_error
    '%r using args %r\n' % (record.msg, record.args))
IOError: [Errno 28] No space left on device
01/07 18:40:54.395 ERROR|     email_manager:0054| monitor_db exception
EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 180, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 368, in tick
    _drone_manager.sync_refresh()
  File "/usr/local/autotest/scheduler/drone_manager.py", line 484, in sync_refresh
    logging.info("Drones refreshed.")
  File "/usr/lib/python2.7/logging/__init__.py", line 1614, in info
    root.info(msg, *args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1152, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1271, in _log
    self.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1281, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1321, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 749, in handle
    self.emit(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 879, in emit
    self.handleError(record)
  File "/usr/local/autotest/client/setup_modules.py", line 85, i01/07 18:40:55.127 INFO | metadata_reporter:0162| Waiting up to 5 seconds for metadata reporting thread to complete.
01/07 18:40:55.127 ERROR|         gmail_lib:0141| Failed to send email to chromeos-test-cron+cautotest@google.com: Credential file does notexist: None. If this is a prod server, puppet shouldinstall it. If you need to be able to send email, find the credential file from chromeos-admin repo and copy it to None
01/07 18:40:55.127 INFO |     status_server:0113| Shutting down server...

The scheduler was dead at 01/07 6pm, but the service wasn't shutdown properly. It was stuck at 'Shutting down server....'. So, the result was the scheduler was not working anymore, but its status was still running. In the meantime, host-scheduler and shard-client were both working, so new jobs/tests can still be scheduled to the shard, and host-scheduler can schedule the job/test to a host. But, since the scheduler was dead, the tests would never run and just got queued on the hosts until the timeout was hit. The tests then would be aborted on master, but still queued in the shard.

I just restart the scheduler on the chromeos-server82.cbf, and the problem was fixed. However, I still don't quite understand why the scheduler wasn't shutdown properly. We rarely have this situation. 
Status: Fixed
Comment 14 by dchan@google.com, Mar 4 2017
Labels: VerifyIn-58
Labels: VerifyIn-59
Labels: VerifyIn-60
Labels: VerifyIn-61
Comment 18 by dchan@chromium.org, Oct 14 (3 days ago)
Status: Archived
Sign in to add a comment