New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 817904 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
OOO
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 798489



Sign in to add a comment

host_scheduler dies around 4am every day

Project Member Reported by nxia@chromium.org, Mar 1 2018

Issue description

http://shortn/_vLiOtjpQiw

The host_scheduler in all shards was down around 3:55am. I forced host_scheduler restart in all shards. 

This also happened around the same time yesterday but the host_scheduler recovered itself yesterday. see  issue 817452 .
 

Comment 1 by nxia@chromium.org, Mar 1 2018

Labels: Chase-Pending

Comment 2 by nxia@chromium.org, Mar 1 2018

Summary: host_scheduler was down around 3:55 am (was: host_scheduler were down around 4:00 am.)

Comment 3 by nxia@chromium.org, Mar 1 2018

Description: Show this description
 Issue 817452  has been merged into this issue.

Comment 6 by nxia@chromium.org, Mar 2 2018

happened again this morning

Summary: host_scheduler dies around 4am every day (was: host_scheduler was down around 3:55 am)
I see no evidence that database was down, or errors killed the host scheduler. Looks to me like it just got either killed by an external process or the tick loop decided to terminate. The fact that this has happened at almost precisely the same time for 4 consecutive days is very suspicios, and suggests to me that upstart or some cronjob is doing something dumb.

03/03 03:59:55.694 INFO |    host_scheduler:0387| Calling new tick.
03/03 03:59:55.694 INFO |    host_scheduler:0388| Leasing hosts for frontend tasks.
03/03 03:59:55.697 INFO |    host_scheduler:0390| Finding hosts for new jobs.
03/03 03:59:55.915 DEBUG|    host_scheduler:0185| Minimum duts to get for suites (suite_id: min_duts): {}
03/03 03:59:56.098 DEBUG|               rdb:0397| Processing 2564 host acquisition requests
03/03 03:59:56.781 DEBUG| rdb_cache_manager:0241| Cache stats: hit ratio: 59.52%, avg staleness per line: 0.00%.
03/03 03:59:56.781 DEBUG|               rdb:0420| Host acquisition stats: distinct requests: 126, leased hosts: 1, unsatisfied requests: 2563
03/03 03:59:56.785 INFO |  scheduler_models:0572| Assigning host chromeos2-row6-rack2-host19 to entry HQE: 181026324, for job: 180602383 and host: no host has status:Queued
03/03 03:59:56.855 INFO |  scheduler_models:0584| creating block 180602383/5385
03/03 03:59:57.014 INFO |    host_scheduler:0251| Scheduling pre job tasks for entry: HQE: 181026324, for job: 180602383 and host: chromeos2-row6-rack2-host19 has status:Queued [active]
03/03 03:59:57.015 INFO |  scheduler_models:0799| chell-release/R65-10323.49.0/bvt-cq/video_ChromeRTCHWDecodeUsed/313636/None (job 180602383, entry 181026324) scheduled on chromeos2-row6-rack2-host19, status=Queued
03/03 03:59:57.069 INFO |    host_scheduler:0392| Releasing unused hosts.
03/03 03:59:57.074 INFO |    host_scheduler:0394| Updating suite assignment with released hosts
03/03 03:59:57.083 INFO |    host_scheduler:0396| Calling email_manager.
03/03 03:59:58.695 INFO |    host_scheduler:0387| Calling new tick.
03/03 03:59:58.696 INFO |    host_scheduler:0388| Leasing hosts for frontend tasks.
03/03 03:59:58.699 INFO |    host_scheduler:0390| Finding hosts for new jobs.
03/03 03:59:58.889 DEBUG|    host_scheduler:0185| Minimum duts to get for suites (suite_id: min_duts): {}
03/03 03:59:59.156 DEBUG|               rdb:0397| Processing 2563 host acquisition requests
03/03 03:59:59.777 DEBUG| rdb_cache_manager:0241| Cache stats: hit ratio: 59.52%, avg staleness per line: 0.00%.
03/03 03:59:59.777 DEBUG|               rdb:0420| Host acquisition stats: distinct requests: 126, leased hosts: 0, unsatisfied requests: 2563
03/03 03:59:59.792 INFO |    host_scheduler:0392| Releasing unused hosts.
03/03 03:59:59.841 INFO |    query_managers:0341| Setting leased = False for the hosts that match {'hostname__in': [u'chromeos4-row3-rack10-host8']}
03/03 03:59:59.903 INFO |    host_scheduler:0394| Updating suite assignment with released hosts
03/03 03:59:59.903 INFO |    host_scheduler:0396| Calling email_manager.
03/03 04:00:01.697 INFO |    host_scheduler:0387| Calling new tick.
03/03 04:00:01.698 INFO |    host_scheduler:0388| Leasing hosts for frontend tasks.
03/03 04:00:01.700 INFO |     ts_mon_config:0207| Waiting for ts_mon flushing process to finish...
03/03 04:00:36.175 INFO |     ts_mon_config:0213| Finished waiting for ts_mon process.

Comment 9 by nxia@chromium.org, Mar 4 2018


chromeos-test@cros-full-0025

host_scheduler log:

04:01:00 INFO | Finished waiting for ts_mon process.
Traceback (most recent call last):

_mysql_exceptions.OperationalError: (2002, "Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)")


From what's recorded in syslog, mysql gets restarted by a cron job 4:00am every morning.

Mar  3 04:00:01 cros-full-0025.mtv.corp.google.com CRON[198228]: (root) CMD (/usr/bin/alarm 6000 /usr/share/update-notifier/reevaluate.py)
Mar  3 04:00:01 cros-full-0025.mtv.corp.google.com CRON[198229]: (root) CMD (/usr/bin/corp_cronwrap -j 3600 /usr/sbin/uplink-tester)
Mar  3 04:00:01 cros-full-0025.mtv.corp.google.com CRON[198230]: (root) CMD (/root/chromeos-admin/puppet/sync_and_run_puppet 2>&1 | /usr/bin/logger -t sync_and_run_puppet -i)
Mar  3 04:00:01 cros-full-0025.mtv.corp.google.com CRON[198231]: (root) CMD (if ! /etc/init.d/apache2 status >/dev/null; then /etc/init.d/apache2 start; fi)
Mar  3 04:00:01 cros-full-0025.mtv.corp.google.com CRON[198232]: (root) CMD (service mysql restart)
One suspect is https://chrome-internal-review.googlesource.com/563840 which maybe doesn't play nice with our new shards?

Comment 11 by nxia@chromium.org, Mar 4 2018

the host-schedulers were down util I manually restarted them around 5:56
#9 oh, interesting. Why is host-scheduler not logging this in the example I saw?
The cros restart is caused by https://chrome-internal-review.googlesource.com/543800
I mean mysql restart
I'm going to revert #13 for now. We should hold open to investigate why:

1) mysql restart should cause host-scheduler to crash
2) the crash is so poorly logged in host-scheduler log (i.e. not logged at all, in the example from comment #8 which was cros-full-0013:/usr/local/autotest/logs/host_scheduler.log.2018-03-02-18.51.35 )
Blocking: 798489

Comment 18 by nxia@chromium.org, Mar 4 2018

Re #12, the logs can be found in /var/logs/host-scheduler.log* but not in /usr/local/autotest/logs.  It's also included in the link pasted by Allen in #4. 
Cc: ayatane@chromium.org dgarr...@chromium.org
I see. Why are /var/logs/host-scheduler.log different from the /usr/local/autotest/logs log?
Project Member

Comment 20 by bugdroid1@chromium.org, Mar 4 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/a5873c8f87f34c637669d7b0207ad5ae50bc99c8

commit a5873c8f87f34c637669d7b0207ad5ae50bc99c8
Author: Aviv Keshet <akeshet@chromium.org>
Date: Sun Mar 04 03:09:28 2018

Things that are weird:
 - upstart or something fails to revive host-scheduler after it dies
   - could be that h-s takes too long to turn down?
 - host-scheduler dies just due to transient mysql outage

upstart template looks like:

# Copyright 2014 The Chromium OS Authors. All rights reserved.
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.

description "Autotest Host Scheduler"

start on runlevel [2345]
stop on runlevel [!2345]
respawn
respawn limit unlimited
post-stop exec sleep 5

normal exit 0

# The scheduler only allows termination in between ticks; this
# number should be larger than any reasonable tick.
kill timeout 30

env LOGDIR=/var/log
env AUTOTEST=/usr/local/autotest
env SCHEDULER=scheduler/host_scheduler.py
env USER=chromeos-test

script
  OUTPUT=${LOGDIR}/${UPSTART_JOB}.log

  # Save the last 5 copies of ${OUTPUT}, numbered in order
  # from most to least recent.
  PREV=.5
  for SUFFIX in .4 .3 .2 .1 ''; do
    mv -f ${OUTPUT}${SUFFIX} ${OUTPUT}${PREV} || :
    PREV=${SUFFIX}
  done

  ARGS="--production"
  if [ <%= @host_scheduler_lifetime_hours -%> -gt 0 ]; then
    ARGS="${ARGS} --lifetime-hours=<%= @host_scheduler_lifetime_hours -%>"
  fi

  sudo -u ${USER} ${AUTOTEST}/${SCHEDULER} ${ARGS[@]} 2>&1 \
         | tee -a ${OUTPUT} \
         | sudo -u ${USER} "<%= @cloudtail_path %>" pipe \
         -service-account-json "<%= @service_account_json %>" \
         -resource-id autotest-host-scheduler \
         -resource-type "<%= scope['::fqdn'] %>"
end script
Could be explained by h-s taking too long to die or to start.
Labels: -Chase-Pending Chase
Owner: gu...@chromium.org
Status: Assigned (was: Untriaged)
Status: Started (was: Assigned)
Actually, the reason is pretty simple. Because we use a pipe in the script stanza and it always return 0. We should use something like PIPESTATUS to return the exit code of host_scheduler.py.
Project Member

Comment 25 by bugdroid1@chromium.org, Mar 8 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/e8e95bad34f4dd190578897e8a36c827fe56b4dd

commit e8e95bad34f4dd190578897e8a36c827fe56b4dd
Author: Congbin Guo <guocb@google.com>
Date: Thu Mar 08 01:54:55 2018

Status: Fixed (was: Started)

Sign in to add a comment