New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 756517 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: Bug

Blocked on:
issue 756244



Sign in to add a comment

host-scheduler failed to restart cleanly on many shards after push on 8/16

Project Member Reported by pprabhu@chromium.org, Aug 17 2017

Issue description

Originally from issue 756244

the host scheduler was down in the 4 shards. the host scheduler never got up after the push to prod around noon today.

chromeos-server14
chromeos-server110 
chromeos-server108
chromeos-server98
 
Blockedon: 756244
I found another one: chromeos-server109.mtv

chromeos-test@chromeos-server109:~$ tail -f /usr/local/autotest/logs/host_scheduler.latest 
08/16 12:20:50.834 DEBUG|    host_scheduler:0186| Minimum duts to get for suites (suite_id: min_duts): {}
08/16 12:20:51.309 DEBUG|               rdb:0398| Processing 2092 host acquisition requests
08/16 12:20:52.031 DEBUG| rdb_cache_manager:0241| Cache stats: hit ratio: 44.44%, avg staleness per line: 0.00%.
08/16 12:20:52.032 DEBUG|               rdb:0421| Host acquisition stats: distinct requests: 45, leased hosts: 0, unsatisfied requests: 2092
08/16 12:20:52.072 INFO |    host_scheduler:0393| Releasing unused hosts.
08/16 12:20:52.074 INFO |    host_scheduler:0395| Updating suite assignment with released hosts
08/16 12:20:52.075 INFO |    host_scheduler:0397| Calling email_manager.
08/16 12:20:52.542 INFO |    host_scheduler:0416| Shutdown request received.
08/16 12:20:52.543 INFO |    host_scheduler:0416| Shutdown request received.
08/16 12:20:52.544 INFO | metadata_reporter:0164| Waiting up to 5 seconds for metadata reporting thread to complete.


(and restarted it).

There have to be more. nxia@ you need to scrub all shards / or just kick host_scheduler on all shards.
Tried some acrobatics before recovering 109 --

chromeos-test@chromeos-server109:~$ pstree -p 15374
sudo(15374)───host_scheduler.(16329)

chromeos-test@chromeos-server109:~$ sudo strace -p 16329
Process 16329 attached
read(4, ^CProcess 16329 detached
 <detached ...>

---------------

The log lines stop at meta_reporter log, but the process has already closed the log fd:
chromeos-test@chromeos-server109:~$ lsof -p 16329 | grep logs
chromeos-test@chromeos-server109:~$ 


---------------------
It is stuck trying to read from a mysql connection:
chromeos-test@chromeos-server109:~$ lsof -p 16329 | grep TCP
host_sche 16329 chromeos-test    4u  IPv4         2333224316      0t0        TCP chromeos-server109.mtv.corp.google.com:54633->173.194.109.7:mysql (ESTABLISHED)

----------------------
OK, I've kicked it now. chromeos-server109.mtv is back.
nxia: This needs a chromeos-infra-outages@ announcement. We don't know how many shards are stuck, but this one was stuck for ~24 hours.
Issue 756380 has been merged into this issue.
Status: Fixed (was: Assigned)
We grepped logs on all shards.

Two more:
chromeos-server26.mtv.corp.google.com chromeos-server48.hot.corp.google.com

Kicked on both.

Comment 7 by nxia@chromium.org, Aug 17 2017

Cc: pprabhu@chromium.org nxia@chromium.org
 Issue 756247  has been merged into this issue.

Sign in to add a comment