chromeos6-row2-rack20-host7 is provision-repair fail looping |
||||
Issue descriptionThis took down a cq build: https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/935# $ dut-status -f chromeos6-row2-rack20-host7 chromeos6-row2-rack20-host7 2017-09-18 09:24:35 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609484-verify/ 2017-09-18 09:01:27 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609310-reset/ 2017-09-18 08:56:20 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609267-repair/ 2017-09-18 08:55:51 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609264-reset/ 2017-09-18 08:50:16 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609211-repair/ 2017-09-18 08:49:54 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609202-reset/ 2017-09-18 08:46:06 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609143-repair/ 2017-09-18 08:31:23 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609060-provision/ 2017-09-18 06:59:18 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608758-repair/ 2017-09-18 06:58:58 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608753-reset/ 2017-09-18 06:52:04 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608655-repair/ 2017-09-18 06:39:03 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608592-provision/ 2017-09-18 05:12:49 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/142956045-chromeos-test/ 2017-09-18 05:12:30 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608250-reset/ 2017-09-18 04:53:47 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608216-verify/ 2017-09-18 04:47:23 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608207-reset/ 2017-09-18 04:42:11 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608203-repair/ 2017-09-18 04:41:49 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608202-reset/ 2017-09-18 04:36:48 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608187-repair/ 2017-09-18 04:36:28 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608184-reset/ 2017-09-18 04:32:40 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608129-repair/ 2017-09-18 04:17:45 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608032-provision/ 2017-09-18 02:23:22 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607541-verify/ 2017-09-18 02:02:07 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607506-reset/ 2017-09-18 01:56:53 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607473-repair/ 2017-09-18 01:56:32 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607470-reset/ 2017-09-18 01:51:24 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607424-repair/ 2017-09-18 01:51:03 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607417-reset/ 2017-09-18 01:47:14 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607355-repair/ 2017-09-18 01:32:12 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607276-provision/ 2017-09-17 23:52:57 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606959-verify/ 2017-09-17 23:50:28 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606948-reset/ 2017-09-17 23:45:25 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606935-repair/ 2017-09-17 23:45:04 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606934-reset/ 2017-09-17 23:39:54 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606891-repair/ 2017-09-17 23:39:34 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606884-reset/ 2017-09-17 23:35:47 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606825-repair/ 2017-09-17 23:21:15 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606733-provision/ 2017-09-17 22:16:08 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/142861719-chromeos-test/ Locking DUT for investigation.
,
Sep 18 2017
The reset tasks are failing with this symptom:
FAIL ---- verify.cros timestamp=1505749813 localtime=Sep 18 08:50:13 command execution error
* Command:
/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_Jua8h9ssh-
master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
-o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
chromeos6-row2-rack20-host7 "export LIBC_FATAL_STDERR_=1; if type
\"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
\"server[stack::upstart_status|run|wrapper] -> ssh_run(status system-
services | grep start/running)\";fi; status system-services | grep
start/running"
Exit status: 1
Duration: 0.0360298156738
Repair finds nothing wrong with the DUT, but it does report this:
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/supplied_sslh_fork.20170918.084919.2318.log
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/run_oci.20170918.084959.2271.core
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/run_oci.20170918.084959.2271.meta
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/service_failure.20170918.085047.0.meta
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/supplied_sslh_fork.20170918.084919.2318.core
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/run_oci.20170918.084959.2266.meta
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/run_oci.20170918.084959.2271.dmp
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/run_oci.20170918.084918.4275.meta
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/run_oci.20170918.084918.4281.meta
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/run_oci.20170918.084918.4275.dmp
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/os-release
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/run_oci.20170918.084918.4275.core
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/run_oci.20170918.084959.2266.core
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/run_oci.20170918.084918.4281.dmp
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/lsb-release
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/run_oci.20170918.084918.4281.core
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/service_failure.20170918.085047.0.log
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/run_oci.20170918.084959.2266.dmp
INFO ---- Orphaned Crash Dump timestamp=1505750027 localtime=Sep 18 08:53:47 /var/spool/crash/supplied_sslh_fork.20170918.084919.2318.meta
The reason for the reset failure is outlined in crbug.com/765686#c9; the
DUT is caught in a temporary state that causes a mis-diagnosis of "Chrome
failed to start". Here's why the loop:
* The failure first happens at provision: after reboot, system-services
hasn't finished starting, so verify fails.
* The failed provision triggers repair. At that point, system-services
is running by the time repair makes its check, so repair finds nothing
wrong. But... the last thing repair does is reboot, which puts us back
into the temporary starting state just in time for the next reset.
As for _why_ this temporary state is lasting long enough to show up so
significantly, that's probably because whatever is causing the crashes
cited above is also holding up 'system-services'.
The principle problem here is the crashes: Whatever is causing the
crash dumps is a bug that needs fixing.
Secondarily, there's an argument that the verification check is too
stringent. Mostly, I think that's right, but there's a critical
trade-off to consider: Almost any fix to make the check more
forgiving would paper over a problem like the crashes above. I'd
be really concerned that papering over those failures would make the
cure worse than the disease.
,
Sep 18 2017
> This took down a cq build: https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/935# Ah, something else: This bug and bug 765686 seem to be isolated to the paladins. There's a good chance that the root cause here is a bad CL. We should have enough failed runs at this point to pinpoint that bad CL.
,
Sep 18 2017
+sheriff
,
Sep 18 2017
+secondary sheriff
,
Sep 18 2017
In https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-release?numbuilds=200 It seems that it fails pretty often. I tried to look at the diffs between build #1507..#1506, or #1491..#1490 but nothing looks related: https://crosland.corp.google.com/log/9946.0.0..9947.0.0 https://crosland.corp.google.com/log/9930.0.0..9931.0.0
,
Sep 18 2017
> In https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-release?numbuilds=200 > > It seems that it fails pretty often. Yeah, the wizpig-release builder is pretty unhealthy, and looking at some of the provision failures, they do show the "system-services" failure mentioned here. However, they _don't_ show the extra crash dumps, nor are there any DUTs stuck in a reset/repair cycle as cited above. So... The tree is broken for wizpig, but it seems at least one recent wizpig-paladin run had a bad CL that made things worse.
,
Feb 1 2018
I replaced the ethernet cable for this DUT and restored network connectivity.
,
May 17 2018
sounds fixed |
||||
►
Sign in to add a comment |
||||
Comment 1 by akes...@chromium.org
, Sep 18 2017Labels: annot