New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 602887 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Apr 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

DUTs seemingly randomly drop offline

Reported by jrbarnette@chromium.org, Apr 13 2016

Issue description

This may be related to bug 598517.

In looking at some DUT histories recently, I observed
behavior I can't easily explain.  The executive summary
is that DUTs that are online and working are periodically
unavailable via ssh.  There's anecdotal evidence that the
DUTs experience this as a network disconnect, reboot, and
consequently fix the problem.

A sample history of a DUT is below.  I've observed sequences
like this on several DUTs recently; this behavior isn't
idiosyncratic:
    2016-04-11 19:09:30  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/412183-repair/
    2016-04-11 19:03:51  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/412153-verify/
    2016-04-11 18:37:52  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/412044-repair/

Here's an explanation of the sequence:
    18:37:52 - Repair fails.  The repair includes a powerwash.
    19:03:51 - Verify fails.  The DUT is reported as offline
        (no answer to ping).
    19:09:30 - Repair passes.  Because of the previous powerwash,
        the DUT _should_ have failed verify.  However, it passes
        because of  bug 602873 .

In order for the DUT to experience  bug 602873 , it would have to
reboot.  The failed powerwash repair doesn't perform that reboot
so the DUT must have rebooted _on its own_ sometime after the
repair failure at 18:37 and the subsequent success at 19:09.

It's possible that the verify failed because it caught the DUT
rebooting, but
 a) That's unlikely; the reboot itself would be very short, and
 b) Even if the reboot really was right around 19:03, that doesn't
    explain _why_ the DUT rebooted spontaneously.

Going back to bug 598517:  It seems like it may be that DUTs are
occasionally losing network access in the lab.  Periodic network
interruptions seem to be associated with both cases.

Somebody should look at this more closely.

 

Comment 1 by ihf@chromium.org, Apr 13 2016

Cc: jhorwich@chromium.org
eventlog.txt says
179 | 2016-04-11 17:17:05 | System Reset
180 | 2016-04-11 17:17:05 | ACPI Wake | S5
181 | 2016-04-11 17:17:05 | Wake Source | Power Button | 0
182 | 2016-04-11 17:17:43 | Kernel Event | Clean Shutdown
183 | 2016-04-11 17:17:44 | System boot | 17823
184 | 2016-04-11 17:17:44 | System Reset
185 | 2016-04-11 18:36:17 | Kernel Event | Clean Shutdown
186 | 2016-04-11 18:36:18 | System boot | 17824
187 | 2016-04-11 18:36:18 | System Reset
188 | 2016-04-11 18:38:22 | Kernel Event | Clean Shutdown
189 | 2016-04-11 18:38:23 | System boot | 17825
190 | 2016-04-11 18:38:23 | System Reset
191 | 2016-04-11 18:38:29 | Kernel Event | Clean Shutdown
192 | 2016-04-11 18:38:30 | System boot | 17826
193 | 2016-04-11 18:38:30 | System Reset
194 | 2016-04-11 18:39:06 | Kernel Event | Clean Shutdown
195 | 2016-04-11 18:39:07 | System boot | 17827
196 | 2016-04-11 18:39:07 | System Reset
197 | 2016-04-11 19:16:42 | System boot | 17828
198 | 2016-04-11 19:16:42 | SUS Power Fail
199 | 2016-04-11 19:16:42 | System Reset
200 | 2016-04-11 19:16:42 | ACPI Wake | S5
201 | 2016-04-11 19:16:42 | Wake Source | Power Button | 0
OK.  Looking more closely at the logs, I see the first (failed)
report reports these reboots:
	GOOD	----	reboot.start	timestamp=1460425100	localtime=Apr 11 18:38:20	
	GOOD	----	reboot.verify	timestamp=1460425136	localtime=Apr 11 18:38:56	
	GOOD	----	reboot.start	timestamp=1460425143	localtime=Apr 11 18:39:03	
	GOOD	----	reboot.verify	timestamp=1460425165	localtime=Apr 11 18:39:25	

The successful repair reports this reboot:
	GOOD	----	reboot.start	timestamp=1460427461	localtime=Apr 11 19:17:41	
	GOOD	----	reboot.verify	timestamp=1460427483	localtime=Apr 11 19:18:03	

This doesn't quite jibe with the eventlog.  We must go deeper.

I've attached the full eventlog.txt from chromeos4-row1-rack3-host13.

We can, at our leisure, pull DUT repair and verify logs from intervals
in the attached file, and see what we can figure out.

eventlog.txt
11.8 KB View Download
I've looked over the logs, and I can match the eventlog in c#1
with the autotest logs.  The executive summary:
    There was no reboot while the DUT was offline.  Rebooting
    the DUT while it was offline got it back online.

Correlated history:
      2016-04-11 18:37:52  begin 412044-repair (DUT is online)
First repair action triggers reboot for Powerwash:
188 | 2016-04-11 18:38:22 | Kernel Event | Clean Shutdown
189 | 2016-04-11 18:38:23 | System boot | 17825
190 | 2016-04-11 18:38:23 | System Reset
Now DUT is up, running powerwash.

Powerwash complete, automatic reboot:
191 | 2016-04-11 18:38:29 | Kernel Event | Clean Shutdown
192 | 2016-04-11 18:38:30 | System boot | 17826
193 | 2016-04-11 18:38:30 | System Reset
DUT is now online.

Beginning of AU; reboot prior to starting work:
194 | 2016-04-11 18:39:06 | Kernel Event | Clean Shutdown
195 | 2016-04-11 18:39:07 | System boot | 17827
196 | 2016-04-11 18:39:07 | System Reset
After this, AU fails, repair fails, and the repair ends.
The DUT is online at the end of the repair.

      2016-04-11 19:03:51  begin 412153-verify/
DUT is found to be offline.

    2016-04-11 19:09:30  begin 412183-repair/
First repair action; DUT is offline, so servo resets the DUT:
197 | 2016-04-11 19:16:42 | System boot | 17828
198 | 2016-04-11 19:16:42 | SUS Power Fail
199 | 2016-04-11 19:16:42 | System Reset
200 | 2016-04-11 19:16:42 | ACPI Wake | S5
201 | 2016-04-11 19:16:42 | Wake Source | Power Button | 0
Now, DUT is online

Comment 5 by ihf@chromium.org, Apr 13 2016

To clarify, the only problem you discovered is that auto update fails? Do you have a link to the job, I am not sure I am looking at the right one.
No, the cause of update failure is known.  In this case, it's
 bug 600737 .

The actual problem is that a DUT with this failure can be seen
to be dropping offline periodically while idle, without explanation.
The particular cases I've got logged here show the DUT being
offline, and being recovered because Servo reset the DUT.

However, I've seen similar cases where the DUT seemed to be
offline in verify and be online moments later when repair started.
I'll go looking to see if I can find such a case.

Comment 7 by ihf@chromium.org, Apr 13 2016

How is "being offline" determined?
> How is "being offline" determined?

From the original description:
    [ ... ]
    19:03:51 - Verify fails.  The DUT is reported as offline
        (no answer to ping).
    [ ... ]

Comment 9 by ihf@chromium.org, Apr 13 2016

A unix ping command from the server to the DUT; or calling autotest server/hosts/abstract_ssh.py ssh_ping(), which does an ssh into the DUT using the existing (but possibly dropped) connection and executes "true" there?
A unix ping command.  Here's the log from the failing verify
task:

04/11 19:08:13.280 INFO |            repair:0279| Verifying this condition: host is available via ssh
04/11 19:08:13.280 DEBUG|          ssh_host:0153| Running (ssh) 'true'
04/11 19:08:13.281 INFO |      abstract_ssh:0735| Master ssh connection to chromeos4-row1-rack3-host13 is down.
04/11 19:08:13.281 DEBUG|      abstract_ssh:0696| Nuking master_ssh_job.
04/11 19:08:13.282 DEBUG|      abstract_ssh:0702| Cleaning master_ssh_tempdir.
04/11 19:08:13.283 INFO |      abstract_ssh:0749| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_hz_ZV6ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row1-rack3-host13'
04/11 19:08:13.284 DEBUG|        base_utils:0177| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_hz_ZV6ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row1-rack3-host13'
04/11 19:08:18.360 INFO |      abstract_ssh:0764| Timed out waiting for master-ssh connection to be established.
04/11 19:09:18.662 WARNI|        base_utils:1067| run process timeout (60) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_hz_ZV6ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row1-rack3-host13 "export LIBC_FATAL_STDERR_=1; true"
04/11 19:09:19.672 DEBUG|        base_utils:0177| Running 'ping chromeos4-row1-rack3-host13 -w1 -c1'
04/11 19:09:20.730 DEBUG|        base_utils:0268| [stdout] PING chromeos4-row1-rack3-host13.cros.corp.google.com (100.107.160.157) 56(84) bytes of data.
04/11 19:09:20.731 DEBUG|        base_utils:0268| [stdout] 
04/11 19:09:20.732 DEBUG|        base_utils:0268| [stdout] --- chromeos4-row1-rack3-host13.cros.corp.google.com ping statistics ---
04/11 19:09:20.732 DEBUG|        base_utils:0268| [stdout] 2 packets transmitted, 0 received, 100% packet loss, time 999ms
04/11 19:09:20.732 DEBUG|        base_utils:0268| [stdout] 
04/11 19:09:20.733 ERROR|            repair:0284| Failed: host is available via ssh
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 281, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/ssh_verify.py", line 37, in verify
    raise hosts.AutoservVerifyError(msg % host.hostname)
AutoservVerifyError: No answer to ping from chromeos4-row1-rack3-host13
04/11 19:09:20.735 INFO |        server_job:0128| 	FAIL	----	verify.ssh	timestamp=1460426960	localtime=Apr 11 19:09:20	No answer to ping from chromeos4-row1-rack3-host13

<sigh> I think I've figured out the "dropping offline" issue.

In this case (and in at least a few others I examined) the DUT
had a working servo, but there was a problem with the USB stick
in the servo.  That lead to the following sequence:
  * Powerwash the DUT, and attempt AU.
  * On AU failure, start servo repair by USB reinstall.
  * Force the DUT off prior to switching the USB stick
    the servo.
  * Start downloading the test image to USB.
  * Download to USB fails (presumably, the USB stick is
    bad).
At the end of this sequence, the DUT is powered off.
On the next verify attempt, the DUT will fail the 'ssh'
test, which triggers repair.  Because servo is working,
the servo reset repair action successfully forces the DUT
online.

The process continues in this fashion indefinitely because
of  bug 602873  (which allows repair to succeed for a failed
DUT) and  bug 600737  (which prevents the DUT from updating
in repair).

Here's the repair log from a failed repair, showing the
failed re-install (repair.usb).
====
START	----	repair	timestamp=1460454699	localtime=Apr 12 02:51:39	
	GOOD	----	verify.ssh	timestamp=1460454723	localtime=Apr 12 02:52:03	
	GOOD	----	verify.power	timestamp=1460454724	localtime=Apr 12 02:52:04	
	GOOD	----	verify.cros	timestamp=1460454726	localtime=Apr 12 02:52:06	
	FAIL	----	verify.good_au	timestamp=1460454727	localtime=Apr 12 02:52:07	Last AU on this DUT failed
	START	----	repair.firmware	timestamp=1460454727	localtime=Apr 12 02:52:07	
		FAIL	----	repair.firmware	timestamp=1460454727	localtime=Apr 12 02:52:07	Firmware repair is not applicable to host chromeos4-row1-rack3-host13.
	END FAIL	----	repair.firmware	timestamp=1460454727	localtime=Apr 12 02:52:07	
	GOOD	----	verify.tpm	timestamp=1460454727	localtime=Apr 12 02:52:07	
	START	----	repair.powerwash	timestamp=1460454727	localtime=Apr 12 02:52:07	
		START	----	reboot	timestamp=1460454727	localtime=Apr 12 02:52:07	
			GOOD	----	reboot.start	timestamp=1460454727	localtime=Apr 12 02:52:07	
			GOOD	----	reboot.verify	timestamp=1460454765	localtime=Apr 12 02:52:45	
		END GOOD	----	reboot	kernel=3.8.11	localtime=Apr 12 02:52:45	timestamp=1460454765	
		START	----	reboot	timestamp=1460454772	localtime=Apr 12 02:52:52	
			GOOD	----	reboot.start	timestamp=1460454772	localtime=Apr 12 02:52:52	
			GOOD	----	reboot.verify	timestamp=1460454794	localtime=Apr 12 02:53:14	
		END GOOD	----	reboot	kernel=3.8.11	localtime=Apr 12 02:53:15	timestamp=1460454795	
		FAIL	----	repair.powerwash	timestamp=1460454865	localtime=Apr 12 02:54:25	Failed to install device image using payload at http://100.107.160.8:8082/update/wolf-release/R50-7978.48.0 on chromeos4-row1-rack3-host13. Update failed. Returned update_engine error code: ERROR_CODE=49, ERROR_MESSAGE=ErrorCode::kNonCriticalUpdateInOOBE. Reported error: AutoservRunError
	END FAIL	----	repair.powerwash	timestamp=1460454865	localtime=Apr 12 02:54:25	
	START	----	repair.usb	timestamp=1460454865	localtime=Apr 12 02:54:25	
		FAIL	----	repair.usb	timestamp=1460454915	localtime=Apr 12 02:55:15	Download image to usb failed.
	END FAIL	----	repair.usb	timestamp=1460454915	localtime=Apr 12 02:55:15	
END FAIL	----	repair	timestamp=1460454915	localtime=Apr 12 02:55:15	
====

And here's the successful repair that followed it:
====
START	----	repair	timestamp=1460456156	localtime=Apr 12 03:15:56	
	FAIL	----	verify.ssh	timestamp=1460456764	localtime=Apr 12 03:26:04	No answer to ping from chromeos4-row1-rack3-host13
	START	----	repair.rpm	timestamp=1460456764	localtime=Apr 12 03:26:04	
		FAIL	----	repair.rpm	timestamp=1460456939	localtime=Apr 12 03:28:59	chromeos4-row1-rack3-host13 is still offline after powercycling
	END FAIL	----	repair.rpm	timestamp=1460456939	localtime=Apr 12 03:28:59	
	START	----	repair.reset	timestamp=1460456939	localtime=Apr 12 03:28:59	
		GOOD	----	verify.ssh	timestamp=1460456970	localtime=Apr 12 03:29:30	
	END GOOD	----	repair.reset	timestamp=1460456970	localtime=Apr 12 03:29:30	
	GOOD	----	verify.cros	timestamp=1460456973	localtime=Apr 12 03:29:33	
	GOOD	----	verify.good_au	timestamp=1460456973	localtime=Apr 12 03:29:33	
	GOOD	----	verify.tpm	timestamp=1460456974	localtime=Apr 12 03:29:34	
	GOOD	----	verify.power	timestamp=1460456974	localtime=Apr 12 03:29:34	
	GOOD	----	verify.PASS	timestamp=1460456974	localtime=Apr 12 03:29:34	
	START	----	reboot	timestamp=1460456974	localtime=Apr 12 03:29:34	
		GOOD	----	reboot.start	timestamp=1460456974	localtime=Apr 12 03:29:34	
		GOOD	----	reboot.verify	timestamp=1460456997	localtime=Apr 12 03:29:57	
	END GOOD	----	reboot	kernel=3.8.11	localtime=Apr 12 03:29:57	timestamp=1460456997	
END GOOD	----	repair	timestamp=1460456997	localtime=Apr 12 03:29:57	{'host_attributes': {'powerunit_hostname': 'chromeos4-row1_2-rack3-rpm2', 'powerunit_outlet': '.A13'}, 'hostname': 'chromeos4-row1-rack3-host13'} repaired successfully

Status: WontFix (was: Assigned)
I'm declaring victory here.  This explains previously misbehavior
in  bug 602873 .  This is (alas) unrelated to bug 598517.

Comment 13 by ihf@chromium.org, Apr 13 2016

Looking at the log 
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row1-rack3-host13/412153-verify/20161104190349/debug/

some more I see that from 19:03:55.348 to 19:08:13.266 the DUT chromeos4-row1-rack3-host13.cros was never reachable attempting to contact it multiple times with different methods while its servo was reachable.

One caveat the servo is on a different subnet than the DUT:
100.107.160.157 chromeos4-row1-rack3-host13.cros
100.107.192.157 chromeos4-row1-rack3-host13-servo.cros

Now on the same subnet is chromeos4-row1-rack3-host11.cros 100.107.160.155 and host9.cros *.153.
They ran graphics tests a little before that, supposedly successfully. But no logs got uploaded to GS.
https://uberchromeos-server38.corp.google.com/new_tko/#tab_id=test_detail_view&object_id=287196959
https://uberchromeos-server38.corp.google.com/new_tko/#tab_id=test_detail_view&object_id=287197193

What could be the reason for this? Do we have logs/traces of what happened on the 100.107.160.* subnet at about 7pm on 4/11?
Could we add some more pings to neighboring devices if we see that a machine is unreachable to determine the state of the subnet and decide if it is a machine or network problem?
> What could be the reason for this?

Comment #11 explains why the DUT was down.  This problem is
explained purely by DUT behaviors; network behaviors aren't
a factor.

Sign in to add a comment