New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 766211 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner: ----
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

chromeos6-row2-rack20-host7 is provision-repair fail looping

Project Member Reported by akes...@chromium.org, Sep 18 2017

Issue description

This took down a cq build: https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/935#


$ dut-status -f chromeos6-row2-rack20-host7
chromeos6-row2-rack20-host7
    2017-09-18 09:24:35  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609484-verify/
    2017-09-18 09:01:27  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609310-reset/
    2017-09-18 08:56:20  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609267-repair/
    2017-09-18 08:55:51  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609264-reset/
    2017-09-18 08:50:16  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609211-repair/
    2017-09-18 08:49:54  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609202-reset/
    2017-09-18 08:46:06  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609143-repair/
    2017-09-18 08:31:23  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/609060-provision/
    2017-09-18 06:59:18  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608758-repair/
    2017-09-18 06:58:58  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608753-reset/
    2017-09-18 06:52:04  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608655-repair/
    2017-09-18 06:39:03  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608592-provision/
    2017-09-18 05:12:49  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/142956045-chromeos-test/
    2017-09-18 05:12:30  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608250-reset/
    2017-09-18 04:53:47  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608216-verify/
    2017-09-18 04:47:23  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608207-reset/
    2017-09-18 04:42:11  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608203-repair/
    2017-09-18 04:41:49  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608202-reset/
    2017-09-18 04:36:48  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608187-repair/
    2017-09-18 04:36:28  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608184-reset/
    2017-09-18 04:32:40  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608129-repair/
    2017-09-18 04:17:45  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/608032-provision/
    2017-09-18 02:23:22  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607541-verify/
    2017-09-18 02:02:07  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607506-reset/
    2017-09-18 01:56:53  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607473-repair/
    2017-09-18 01:56:32  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607470-reset/
    2017-09-18 01:51:24  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607424-repair/
    2017-09-18 01:51:03  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607417-reset/
    2017-09-18 01:47:14  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607355-repair/
    2017-09-18 01:32:12  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/607276-provision/
    2017-09-17 23:52:57  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606959-verify/
    2017-09-17 23:50:28  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606948-reset/
    2017-09-17 23:45:25  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606935-repair/
    2017-09-17 23:45:04  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606934-reset/
    2017-09-17 23:39:54  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606891-repair/
    2017-09-17 23:39:34  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606884-reset/
    2017-09-17 23:35:47  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606825-repair/
    2017-09-17 23:21:15  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host7/606733-provision/
    2017-09-17 22:16:08  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/142861719-chromeos-test/


Locking DUT for investigation.
 
Cc: xixuan@chromium.org jrbarnette@chromium.org
Labels: annot
The reset tasks are failing with this symptom:
	FAIL	----	verify.cros	timestamp=1505749813	localtime=Sep 18 08:50:13	command execution error
  * Command: 
      /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_Jua8h9ssh-
      master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
      -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
      ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
      chromeos6-row2-rack20-host7 "export LIBC_FATAL_STDERR_=1; if type
      \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
      \"server[stack::upstart_status|run|wrapper] -> ssh_run(status system-
      services | grep start/running)\";fi; status system-services | grep
      start/running"
  Exit status: 1
  Duration: 0.0360298156738

Repair finds nothing wrong with the DUT, but it does report this:
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/supplied_sslh_fork.20170918.084919.2318.log
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/run_oci.20170918.084959.2271.core
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/run_oci.20170918.084959.2271.meta
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/service_failure.20170918.085047.0.meta
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/supplied_sslh_fork.20170918.084919.2318.core
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/run_oci.20170918.084959.2266.meta
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/run_oci.20170918.084959.2271.dmp
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/run_oci.20170918.084918.4275.meta
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/run_oci.20170918.084918.4281.meta
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/run_oci.20170918.084918.4275.dmp
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/os-release
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/run_oci.20170918.084918.4275.core
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/run_oci.20170918.084959.2266.core
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/run_oci.20170918.084918.4281.dmp
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/lsb-release
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/run_oci.20170918.084918.4281.core
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/service_failure.20170918.085047.0.log
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/run_oci.20170918.084959.2266.dmp
	INFO	----	Orphaned Crash Dump	timestamp=1505750027	localtime=Sep 18 08:53:47	/var/spool/crash/supplied_sslh_fork.20170918.084919.2318.meta


The reason for the reset failure is outlined in crbug.com/765686#c9; the
DUT is caught in a temporary state that causes a mis-diagnosis of "Chrome
failed to start".  Here's why the loop:
  * The failure first happens at provision:  after reboot, system-services
    hasn't finished starting, so verify fails.
  * The failed provision triggers repair.  At that point, system-services
    is running by the time repair makes its check, so repair finds nothing
    wrong.  But... the last thing repair does is reboot, which puts us back
    into the temporary starting state just in time for the next reset.

As for _why_ this temporary state is lasting long enough to show up so
significantly, that's probably because whatever is causing the crashes
cited above is also holding up 'system-services'.

The principle problem here is the crashes:  Whatever is causing the
crash dumps is a bug that needs fixing.

Secondarily, there's an argument that the verification check is too
stringent.  Mostly, I think that's right, but there's a critical
trade-off to consider:  Almost any fix to make the check more
forgiving would paper over a problem like the crashes above.  I'd
be really concerned that papering over those failures would make the
cure worse than the disease.

> This took down a cq build: https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/935#

Ah, something else:  This bug and bug 765686 seem to be isolated
to the paladins.  There's a good chance that the root cause here
is a bad CL.  We should have enough failed runs at this point to
pinpoint that bad CL.

Cc: puneetster@chromium.org
+sheriff
Cc: amstan@chromium.org
+secondary sheriff

Comment 6 by amstan@chromium.org, Sep 18 2017

In https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-release?numbuilds=200

It seems that it fails pretty often.

I tried to look at the diffs between build #1507..#1506, or #1491..#1490 but nothing looks related:
https://crosland.corp.google.com/log/9946.0.0..9947.0.0
https://crosland.corp.google.com/log/9930.0.0..9931.0.0
> In https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-release?numbuilds=200
>
> It seems that it fails pretty often.

Yeah, the wizpig-release builder is pretty unhealthy, and looking
at some of the provision failures, they do show the "system-services"
failure mentioned here.  However, they _don't_ show the extra crash
dumps, nor are there any DUTs stuck in a reset/repair cycle as cited
above.

So...  The tree is broken for wizpig, but it seems at least one recent
wizpig-paladin run had a bad CL that made things worse.

Comment 8 by wryan@google.com, Feb 1 2018

I replaced the ethernet cable for this DUT and restored network connectivity.
Status: Fixed (was: Untriaged)
sounds fixed

Sign in to add a comment