host-scheduler failed to restart cleanly on many shards after push on 8/16 |
||
Issue descriptionOriginally from issue 756244 the host scheduler was down in the 4 shards. the host scheduler never got up after the push to prod around noon today. chromeos-server14 chromeos-server110 chromeos-server108 chromeos-server98
,
Aug 17 2017
I found another one: chromeos-server109.mtv
chromeos-test@chromeos-server109:~$ tail -f /usr/local/autotest/logs/host_scheduler.latest
08/16 12:20:50.834 DEBUG| host_scheduler:0186| Minimum duts to get for suites (suite_id: min_duts): {}
08/16 12:20:51.309 DEBUG| rdb:0398| Processing 2092 host acquisition requests
08/16 12:20:52.031 DEBUG| rdb_cache_manager:0241| Cache stats: hit ratio: 44.44%, avg staleness per line: 0.00%.
08/16 12:20:52.032 DEBUG| rdb:0421| Host acquisition stats: distinct requests: 45, leased hosts: 0, unsatisfied requests: 2092
08/16 12:20:52.072 INFO | host_scheduler:0393| Releasing unused hosts.
08/16 12:20:52.074 INFO | host_scheduler:0395| Updating suite assignment with released hosts
08/16 12:20:52.075 INFO | host_scheduler:0397| Calling email_manager.
08/16 12:20:52.542 INFO | host_scheduler:0416| Shutdown request received.
08/16 12:20:52.543 INFO | host_scheduler:0416| Shutdown request received.
08/16 12:20:52.544 INFO | metadata_reporter:0164| Waiting up to 5 seconds for metadata reporting thread to complete.
(and restarted it).
There have to be more. nxia@ you need to scrub all shards / or just kick host_scheduler on all shards.
,
Aug 17 2017
Tried some acrobatics before recovering 109 -- chromeos-test@chromeos-server109:~$ pstree -p 15374 sudo(15374)───host_scheduler.(16329) chromeos-test@chromeos-server109:~$ sudo strace -p 16329 Process 16329 attached read(4, ^CProcess 16329 detached <detached ...> --------------- The log lines stop at meta_reporter log, but the process has already closed the log fd: chromeos-test@chromeos-server109:~$ lsof -p 16329 | grep logs chromeos-test@chromeos-server109:~$ --------------------- It is stuck trying to read from a mysql connection: chromeos-test@chromeos-server109:~$ lsof -p 16329 | grep TCP host_sche 16329 chromeos-test 4u IPv4 2333224316 0t0 TCP chromeos-server109.mtv.corp.google.com:54633->173.194.109.7:mysql (ESTABLISHED) ---------------------- OK, I've kicked it now. chromeos-server109.mtv is back.
,
Aug 17 2017
nxia: This needs a chromeos-infra-outages@ announcement. We don't know how many shards are stuck, but this one was stuck for ~24 hours.
,
Aug 17 2017
Issue 756380 has been merged into this issue.
,
Aug 17 2017
We grepped logs on all shards. Two more: chromeos-server26.mtv.corp.google.com chromeos-server48.hot.corp.google.com Kicked on both.
,
Aug 17 2017
|
||
►
Sign in to add a comment |
||
Comment 1 by pprabhu@chromium.org
, Aug 17 2017