host_scheduler dies around 4am every day |
|||||||||
Issue descriptionhttp://shortn/_vLiOtjpQiw The host_scheduler in all shards was down around 3:55am. I forced host_scheduler restart in all shards. This also happened around the same time yesterday but the host_scheduler recovered itself yesterday. see issue 817452 .
,
Mar 1 2018
,
Mar 1 2018
,
Mar 1 2018
,
Mar 1 2018
Issue 817452 has been merged into this issue.
,
Mar 2 2018
happened again this morning
,
Mar 4 2018
Happened again from 4:00 to 18:00 today. https://viceroy.corp.google.com/chromeos/deputy-view?utc_end=1520129384#_VG__BnnYtbi
,
Mar 4 2018
I see no evidence that database was down, or errors killed the host scheduler. Looks to me like it just got either killed by an external process or the tick loop decided to terminate. The fact that this has happened at almost precisely the same time for 4 consecutive days is very suspicios, and suggests to me that upstart or some cronjob is doing something dumb.
03/03 03:59:55.694 INFO | host_scheduler:0387| Calling new tick.
03/03 03:59:55.694 INFO | host_scheduler:0388| Leasing hosts for frontend tasks.
03/03 03:59:55.697 INFO | host_scheduler:0390| Finding hosts for new jobs.
03/03 03:59:55.915 DEBUG| host_scheduler:0185| Minimum duts to get for suites (suite_id: min_duts): {}
03/03 03:59:56.098 DEBUG| rdb:0397| Processing 2564 host acquisition requests
03/03 03:59:56.781 DEBUG| rdb_cache_manager:0241| Cache stats: hit ratio: 59.52%, avg staleness per line: 0.00%.
03/03 03:59:56.781 DEBUG| rdb:0420| Host acquisition stats: distinct requests: 126, leased hosts: 1, unsatisfied requests: 2563
03/03 03:59:56.785 INFO | scheduler_models:0572| Assigning host chromeos2-row6-rack2-host19 to entry HQE: 181026324, for job: 180602383 and host: no host has status:Queued
03/03 03:59:56.855 INFO | scheduler_models:0584| creating block 180602383/5385
03/03 03:59:57.014 INFO | host_scheduler:0251| Scheduling pre job tasks for entry: HQE: 181026324, for job: 180602383 and host: chromeos2-row6-rack2-host19 has status:Queued [active]
03/03 03:59:57.015 INFO | scheduler_models:0799| chell-release/R65-10323.49.0/bvt-cq/video_ChromeRTCHWDecodeUsed/313636/None (job 180602383, entry 181026324) scheduled on chromeos2-row6-rack2-host19, status=Queued
03/03 03:59:57.069 INFO | host_scheduler:0392| Releasing unused hosts.
03/03 03:59:57.074 INFO | host_scheduler:0394| Updating suite assignment with released hosts
03/03 03:59:57.083 INFO | host_scheduler:0396| Calling email_manager.
03/03 03:59:58.695 INFO | host_scheduler:0387| Calling new tick.
03/03 03:59:58.696 INFO | host_scheduler:0388| Leasing hosts for frontend tasks.
03/03 03:59:58.699 INFO | host_scheduler:0390| Finding hosts for new jobs.
03/03 03:59:58.889 DEBUG| host_scheduler:0185| Minimum duts to get for suites (suite_id: min_duts): {}
03/03 03:59:59.156 DEBUG| rdb:0397| Processing 2563 host acquisition requests
03/03 03:59:59.777 DEBUG| rdb_cache_manager:0241| Cache stats: hit ratio: 59.52%, avg staleness per line: 0.00%.
03/03 03:59:59.777 DEBUG| rdb:0420| Host acquisition stats: distinct requests: 126, leased hosts: 0, unsatisfied requests: 2563
03/03 03:59:59.792 INFO | host_scheduler:0392| Releasing unused hosts.
03/03 03:59:59.841 INFO | query_managers:0341| Setting leased = False for the hosts that match {'hostname__in': [u'chromeos4-row3-rack10-host8']}
03/03 03:59:59.903 INFO | host_scheduler:0394| Updating suite assignment with released hosts
03/03 03:59:59.903 INFO | host_scheduler:0396| Calling email_manager.
03/03 04:00:01.697 INFO | host_scheduler:0387| Calling new tick.
03/03 04:00:01.698 INFO | host_scheduler:0388| Leasing hosts for frontend tasks.
03/03 04:00:01.700 INFO | ts_mon_config:0207| Waiting for ts_mon flushing process to finish...
03/03 04:00:36.175 INFO | ts_mon_config:0213| Finished waiting for ts_mon process.
,
Mar 4 2018
chromeos-test@cros-full-0025 host_scheduler log: 04:01:00 INFO | Finished waiting for ts_mon process. Traceback (most recent call last): _mysql_exceptions.OperationalError: (2002, "Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)") From what's recorded in syslog, mysql gets restarted by a cron job 4:00am every morning. Mar 3 04:00:01 cros-full-0025.mtv.corp.google.com CRON[198228]: (root) CMD (/usr/bin/alarm 6000 /usr/share/update-notifier/reevaluate.py) Mar 3 04:00:01 cros-full-0025.mtv.corp.google.com CRON[198229]: (root) CMD (/usr/bin/corp_cronwrap -j 3600 /usr/sbin/uplink-tester) Mar 3 04:00:01 cros-full-0025.mtv.corp.google.com CRON[198230]: (root) CMD (/root/chromeos-admin/puppet/sync_and_run_puppet 2>&1 | /usr/bin/logger -t sync_and_run_puppet -i) Mar 3 04:00:01 cros-full-0025.mtv.corp.google.com CRON[198231]: (root) CMD (if ! /etc/init.d/apache2 status >/dev/null; then /etc/init.d/apache2 start; fi) Mar 3 04:00:01 cros-full-0025.mtv.corp.google.com CRON[198232]: (root) CMD (service mysql restart)
,
Mar 4 2018
One suspect is https://chrome-internal-review.googlesource.com/563840 which maybe doesn't play nice with our new shards?
,
Mar 4 2018
the host-schedulers were down util I manually restarted them around 5:56
,
Mar 4 2018
#9 oh, interesting. Why is host-scheduler not logging this in the example I saw?
,
Mar 4 2018
The cros restart is caused by https://chrome-internal-review.googlesource.com/543800
,
Mar 4 2018
I mean mysql restart
,
Mar 4 2018
I'm going to revert #13 for now. We should hold open to investigate why: 1) mysql restart should cause host-scheduler to crash 2) the crash is so poorly logged in host-scheduler log (i.e. not logged at all, in the example from comment #8 which was cros-full-0013:/usr/local/autotest/logs/host_scheduler.log.2018-03-02-18.51.35 )
,
Mar 4 2018
,
Mar 4 2018
Re #12, the logs can be found in /var/logs/host-scheduler.log* but not in /usr/local/autotest/logs. It's also included in the link pasted by Allen in #4.
,
Mar 4 2018
I see. Why are /var/logs/host-scheduler.log different from the /usr/local/autotest/logs log?
,
Mar 4 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/a5873c8f87f34c637669d7b0207ad5ae50bc99c8 commit a5873c8f87f34c637669d7b0207ad5ae50bc99c8 Author: Aviv Keshet <akeshet@chromium.org> Date: Sun Mar 04 03:09:28 2018
,
Mar 5 2018
Things that are weird:
- upstart or something fails to revive host-scheduler after it dies
- could be that h-s takes too long to turn down?
- host-scheduler dies just due to transient mysql outage
upstart template looks like:
# Copyright 2014 The Chromium OS Authors. All rights reserved.
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
description "Autotest Host Scheduler"
start on runlevel [2345]
stop on runlevel [!2345]
respawn
respawn limit unlimited
post-stop exec sleep 5
normal exit 0
# The scheduler only allows termination in between ticks; this
# number should be larger than any reasonable tick.
kill timeout 30
env LOGDIR=/var/log
env AUTOTEST=/usr/local/autotest
env SCHEDULER=scheduler/host_scheduler.py
env USER=chromeos-test
script
OUTPUT=${LOGDIR}/${UPSTART_JOB}.log
# Save the last 5 copies of ${OUTPUT}, numbered in order
# from most to least recent.
PREV=.5
for SUFFIX in .4 .3 .2 .1 ''; do
mv -f ${OUTPUT}${SUFFIX} ${OUTPUT}${PREV} || :
PREV=${SUFFIX}
done
ARGS="--production"
if [ <%= @host_scheduler_lifetime_hours -%> -gt 0 ]; then
ARGS="${ARGS} --lifetime-hours=<%= @host_scheduler_lifetime_hours -%>"
fi
sudo -u ${USER} ${AUTOTEST}/${SCHEDULER} ${ARGS[@]} 2>&1 \
| tee -a ${OUTPUT} \
| sudo -u ${USER} "<%= @cloudtail_path %>" pipe \
-service-account-json "<%= @service_account_json %>" \
-resource-id autotest-host-scheduler \
-resource-type "<%= scope['::fqdn'] %>"
end script
,
Mar 5 2018
Could be explained by h-s taking too long to die or to start.
,
Mar 5 2018
,
Mar 7 2018
Actually, the reason is pretty simple. Because we use a pipe in the script stanza and it always return 0. We should use something like PIPESTATUS to return the exit code of host_scheduler.py.
,
Mar 8 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/e8e95bad34f4dd190578897e8a36c827fe56b4dd commit e8e95bad34f4dd190578897e8a36c827fe56b4dd Author: Congbin Guo <guocb@google.com> Date: Thu Mar 08 01:54:55 2018
,
Mar 12 2018
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by nxia@chromium.org
, Mar 1 2018