New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 679410 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Jan 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

scheduler was dead but didn't get shutdown on the server

Project Member Reported by snanda@chromium.org, Jan 9 2017

Issue description

The last few whirlwind-paladin builds have been failing:

https://viceroy.corp.google.com/chromeos/build_details?build_config=whirlwind-paladin&build_number=6714

Looking closer it appears that the HWTest stage is failing with test aborts.  Are there not enough healthy DUTs to run the HWTests against?

http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=95207766


 
There are 0 healthy whirlwind or gale duts in the CQ pool.
Labels: Hotlist-TreeCloser

Comment 3 by ra...@google.com, Jan 9 2017

There doesn't seem to be anything wrong with the Whirlwind devices.

It looks like there are stuck jobs on the shard for the affected Whirlwind hosts. I compared the output (after selecting "Show verifies, repairs, cleanups and resets") at these two URLs:

http://cautotest/afe/#tab_id=view_host&object_id=5591
http://chromeos-server82.cbf.corp.google.com/afe/#tab_id=view_host&object_id=5591


Job 95022619 seems to be aborted on the master, but still queued on the shard. I suspect this is the cause of hwtest aborts.

Comment 4 by ra...@google.com, Jan 9 2017

Cc: ra...@google.com
Owner: snanda@chromium.org
reassigning to snanda@ to find out who can cancel the stuck jobs and revive the Whirlwind hosts on chromeos-server82
Owner: dgarr...@chromium.org
Hopefully the current deputy (dgarret) knows?

Comment 6 by ra...@google.com, Jan 9 2017

Ah sorry, I confirmed with dgarrett in crosoncall@ that he's taking a look at the problem
Cc: xixuan@chromium.org pprabhu@chromium.org
There seems to be a shard configuration issue.

From pprabhu:
http://chromeos-server82.cbf.corp.google.com/afe/#tab_id=view_host&object_id=774
this says that whirlwind is on chromeos-server82.cbf
but serverDB does disagree.

atest server list:
Hostname     : chromeos-server82.cbf.corp.google.com
Status       : primary
Roles        : shard
Attributes   : {u'board': u'wizpig, falco_li'}
Date Created : 2016-08-11 14:50:20
Date Modified: 2016-08-11 14:50:20
Note         : None


That's a different set of boards. I'm guessing this is the real cause.
Summary: whirlwind/gale shard is misconfigured? (was: whirlwind paladins are failing: lack of healthy DUTs?)

Comment 9 by ra...@google.com, Jan 9 2017

Gale doesn't seem to be using a shard (shard:null)

That might be my fault - I added the Gale paladin as a nonblocking paladin (important:false) in December. Note that currently the Gale build is broken due to b/34166290 and that is OK
Owner: shuqianz@chromium.org
I suspect something happened to the shard configuration on the 7th, which caused these issues. What happened, or what to do about it I'm not sure.

However, Charlene should be the expert we need.
Labels: -Pri-1 Pri-0
Labels: -Pri-0 Pri-2
Summary: scheduler was dead but didn't get shutdown on the server (was: whirlwind/gale shard is misconfigured?)
In short, the cause of this is that the scheduler on chromeos-server82.cbf was dead but the service didn't get shutdown.

From the scheduler log:
01/07 18:40:53.767 DEBUG|        monitor_db:1128| Starting _schedule_running_host_queue_entries
01/07 18:40:53.809 DEBUG|        monitor_db:1128| Starting _schedule_special_tasks
01/07 18:40:53.819 DEBUG|        monitor_db:1128| Starting _schedule_new_jobs
01/07 18:40:53.819 DEBUG|        monitor_db:0841| Processing 0 queue_entries
01/07 18:40:53.820 DEBUG|        monitor_db:1128| Starting _drone_manager.sync_refresh
01/07 18:40:54.187 ERROR|     email_manager:0082| Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 180, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 368, in tick
    _drone_manager.sync_refresh()
  File "/usr/local/autotest/scheduler/drone_manager.py", line 484, in sync_refresh
    logging.info("Drones refreshed.")
  File "/usr/lib/python2.7/logging/__init__.py", line 1614, in info
    root.info(msg, *args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1152, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1271, in _log
    self.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1281, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1321, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 749, in handle
    self.emit(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 879, in emit
    self.handleError(record)
  File "/usr/local/autotest/client/setup_modules.py", line 85, in _autotest_logging_handle_error
    '%r using args %r\n' % (record.msg, record.args))
IOError: [Errno 28] No space left on device
01/07 18:40:54.395 ERROR|     email_manager:0054| monitor_db exception
EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 180, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 368, in tick
    _drone_manager.sync_refresh()
  File "/usr/local/autotest/scheduler/drone_manager.py", line 484, in sync_refresh
    logging.info("Drones refreshed.")
  File "/usr/lib/python2.7/logging/__init__.py", line 1614, in info
    root.info(msg, *args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1152, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1271, in _log
    self.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1281, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1321, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 749, in handle
    self.emit(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 879, in emit
    self.handleError(record)
  File "/usr/local/autotest/client/setup_modules.py", line 85, i01/07 18:40:55.127 INFO | metadata_reporter:0162| Waiting up to 5 seconds for metadata reporting thread to complete.
01/07 18:40:55.127 ERROR|         gmail_lib:0141| Failed to send email to chromeos-test-cron+cautotest@google.com: Credential file does notexist: None. If this is a prod server, puppet shouldinstall it. If you need to be able to send email, find the credential file from chromeos-admin repo and copy it to None
01/07 18:40:55.127 INFO |     status_server:0113| Shutting down server...

The scheduler was dead at 01/07 6pm, but the service wasn't shutdown properly. It was stuck at 'Shutting down server....'. So, the result was the scheduler was not working anymore, but its status was still running. In the meantime, host-scheduler and shard-client were both working, so new jobs/tests can still be scheduled to the shard, and host-scheduler can schedule the job/test to a host. But, since the scheduler was dead, the tests would never run and just got queued on the hosts until the timeout was hit. The tests then would be aborted on master, but still queued in the shard.

I just restart the scheduler on the chromeos-server82.cbf, and the problem was fixed. However, I still don't quite understand why the scheduler wasn't shutdown properly. We rarely have this situation. 
Status: Fixed (was: Untriaged)

Comment 14 by dchan@google.com, Mar 4 2017

Labels: VerifyIn-58

Comment 15 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 16 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 18 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment