DUTs seemingly randomly drop offline
Reported by
jrbarnette@chromium.org,
Apr 13 2016
|
||
Issue description
This may be related to bug 598517.
In looking at some DUT histories recently, I observed
behavior I can't easily explain. The executive summary
is that DUTs that are online and working are periodically
unavailable via ssh. There's anecdotal evidence that the
DUTs experience this as a network disconnect, reboot, and
consequently fix the problem.
A sample history of a DUT is below. I've observed sequences
like this on several DUTs recently; this behavior isn't
idiosyncratic:
2016-04-11 19:09:30 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/412183-repair/
2016-04-11 19:03:51 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/412153-verify/
2016-04-11 18:37:52 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/412044-repair/
Here's an explanation of the sequence:
18:37:52 - Repair fails. The repair includes a powerwash.
19:03:51 - Verify fails. The DUT is reported as offline
(no answer to ping).
19:09:30 - Repair passes. Because of the previous powerwash,
the DUT _should_ have failed verify. However, it passes
because of bug 602873 .
In order for the DUT to experience bug 602873 , it would have to
reboot. The failed powerwash repair doesn't perform that reboot
so the DUT must have rebooted _on its own_ sometime after the
repair failure at 18:37 and the subsequent success at 19:09.
It's possible that the verify failed because it caught the DUT
rebooting, but
a) That's unlikely; the reboot itself would be very short, and
b) Even if the reboot really was right around 19:03, that doesn't
explain _why_ the DUT rebooted spontaneously.
Going back to bug 598517: It seems like it may be that DUTs are
occasionally losing network access in the lab. Periodic network
interruptions seem to be associated with both cases.
Somebody should look at this more closely.
,
Apr 13 2016
OK. Looking more closely at the logs, I see the first (failed) report reports these reboots: GOOD ---- reboot.start timestamp=1460425100 localtime=Apr 11 18:38:20 GOOD ---- reboot.verify timestamp=1460425136 localtime=Apr 11 18:38:56 GOOD ---- reboot.start timestamp=1460425143 localtime=Apr 11 18:39:03 GOOD ---- reboot.verify timestamp=1460425165 localtime=Apr 11 18:39:25 The successful repair reports this reboot: GOOD ---- reboot.start timestamp=1460427461 localtime=Apr 11 19:17:41 GOOD ---- reboot.verify timestamp=1460427483 localtime=Apr 11 19:18:03 This doesn't quite jibe with the eventlog. We must go deeper.
,
Apr 13 2016
I've attached the full eventlog.txt from chromeos4-row1-rack3-host13. We can, at our leisure, pull DUT repair and verify logs from intervals in the attached file, and see what we can figure out.
,
Apr 13 2016
I've looked over the logs, and I can match the eventlog in c#1
with the autotest logs. The executive summary:
There was no reboot while the DUT was offline. Rebooting
the DUT while it was offline got it back online.
Correlated history:
2016-04-11 18:37:52 begin 412044-repair (DUT is online)
First repair action triggers reboot for Powerwash:
188 | 2016-04-11 18:38:22 | Kernel Event | Clean Shutdown
189 | 2016-04-11 18:38:23 | System boot | 17825
190 | 2016-04-11 18:38:23 | System Reset
Now DUT is up, running powerwash.
Powerwash complete, automatic reboot:
191 | 2016-04-11 18:38:29 | Kernel Event | Clean Shutdown
192 | 2016-04-11 18:38:30 | System boot | 17826
193 | 2016-04-11 18:38:30 | System Reset
DUT is now online.
Beginning of AU; reboot prior to starting work:
194 | 2016-04-11 18:39:06 | Kernel Event | Clean Shutdown
195 | 2016-04-11 18:39:07 | System boot | 17827
196 | 2016-04-11 18:39:07 | System Reset
After this, AU fails, repair fails, and the repair ends.
The DUT is online at the end of the repair.
2016-04-11 19:03:51 begin 412153-verify/
DUT is found to be offline.
2016-04-11 19:09:30 begin 412183-repair/
First repair action; DUT is offline, so servo resets the DUT:
197 | 2016-04-11 19:16:42 | System boot | 17828
198 | 2016-04-11 19:16:42 | SUS Power Fail
199 | 2016-04-11 19:16:42 | System Reset
200 | 2016-04-11 19:16:42 | ACPI Wake | S5
201 | 2016-04-11 19:16:42 | Wake Source | Power Button | 0
Now, DUT is online
,
Apr 13 2016
To clarify, the only problem you discovered is that auto update fails? Do you have a link to the job, I am not sure I am looking at the right one.
,
Apr 13 2016
No, the cause of update failure is known. In this case, it's bug 600737 . The actual problem is that a DUT with this failure can be seen to be dropping offline periodically while idle, without explanation. The particular cases I've got logged here show the DUT being offline, and being recovered because Servo reset the DUT. However, I've seen similar cases where the DUT seemed to be offline in verify and be online moments later when repair started. I'll go looking to see if I can find such a case.
,
Apr 13 2016
How is "being offline" determined?
,
Apr 13 2016
> How is "being offline" determined?
From the original description:
[ ... ]
19:03:51 - Verify fails. The DUT is reported as offline
(no answer to ping).
[ ... ]
,
Apr 13 2016
A unix ping command from the server to the DUT; or calling autotest server/hosts/abstract_ssh.py ssh_ping(), which does an ssh into the DUT using the existing (but possibly dropped) connection and executes "true" there?
,
Apr 13 2016
A unix ping command. Here's the log from the failing verify task: 04/11 19:08:13.280 INFO | repair:0279| Verifying this condition: host is available via ssh 04/11 19:08:13.280 DEBUG| ssh_host:0153| Running (ssh) 'true' 04/11 19:08:13.281 INFO | abstract_ssh:0735| Master ssh connection to chromeos4-row1-rack3-host13 is down. 04/11 19:08:13.281 DEBUG| abstract_ssh:0696| Nuking master_ssh_job. 04/11 19:08:13.282 DEBUG| abstract_ssh:0702| Cleaning master_ssh_tempdir. 04/11 19:08:13.283 INFO | abstract_ssh:0749| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_hz_ZV6ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row1-rack3-host13' 04/11 19:08:13.284 DEBUG| base_utils:0177| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_hz_ZV6ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row1-rack3-host13' 04/11 19:08:18.360 INFO | abstract_ssh:0764| Timed out waiting for master-ssh connection to be established. 04/11 19:09:18.662 WARNI| base_utils:1067| run process timeout (60) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_hz_ZV6ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row1-rack3-host13 "export LIBC_FATAL_STDERR_=1; true" 04/11 19:09:19.672 DEBUG| base_utils:0177| Running 'ping chromeos4-row1-rack3-host13 -w1 -c1' 04/11 19:09:20.730 DEBUG| base_utils:0268| [stdout] PING chromeos4-row1-rack3-host13.cros.corp.google.com (100.107.160.157) 56(84) bytes of data. 04/11 19:09:20.731 DEBUG| base_utils:0268| [stdout] 04/11 19:09:20.732 DEBUG| base_utils:0268| [stdout] --- chromeos4-row1-rack3-host13.cros.corp.google.com ping statistics --- 04/11 19:09:20.732 DEBUG| base_utils:0268| [stdout] 2 packets transmitted, 0 received, 100% packet loss, time 999ms 04/11 19:09:20.732 DEBUG| base_utils:0268| [stdout] 04/11 19:09:20.733 ERROR| repair:0284| Failed: host is available via ssh Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 281, in _verify_host self.verify(host) File "/usr/local/autotest/server/hosts/ssh_verify.py", line 37, in verify raise hosts.AutoservVerifyError(msg % host.hostname) AutoservVerifyError: No answer to ping from chromeos4-row1-rack3-host13 04/11 19:09:20.735 INFO | server_job:0128| FAIL ---- verify.ssh timestamp=1460426960 localtime=Apr 11 19:09:20 No answer to ping from chromeos4-row1-rack3-host13
,
Apr 13 2016
<sigh> I think I've figured out the "dropping offline" issue.
In this case (and in at least a few others I examined) the DUT
had a working servo, but there was a problem with the USB stick
in the servo. That lead to the following sequence:
* Powerwash the DUT, and attempt AU.
* On AU failure, start servo repair by USB reinstall.
* Force the DUT off prior to switching the USB stick
the servo.
* Start downloading the test image to USB.
* Download to USB fails (presumably, the USB stick is
bad).
At the end of this sequence, the DUT is powered off.
On the next verify attempt, the DUT will fail the 'ssh'
test, which triggers repair. Because servo is working,
the servo reset repair action successfully forces the DUT
online.
The process continues in this fashion indefinitely because
of bug 602873 (which allows repair to succeed for a failed
DUT) and bug 600737 (which prevents the DUT from updating
in repair).
Here's the repair log from a failed repair, showing the
failed re-install (repair.usb).
====
START ---- repair timestamp=1460454699 localtime=Apr 12 02:51:39
GOOD ---- verify.ssh timestamp=1460454723 localtime=Apr 12 02:52:03
GOOD ---- verify.power timestamp=1460454724 localtime=Apr 12 02:52:04
GOOD ---- verify.cros timestamp=1460454726 localtime=Apr 12 02:52:06
FAIL ---- verify.good_au timestamp=1460454727 localtime=Apr 12 02:52:07 Last AU on this DUT failed
START ---- repair.firmware timestamp=1460454727 localtime=Apr 12 02:52:07
FAIL ---- repair.firmware timestamp=1460454727 localtime=Apr 12 02:52:07 Firmware repair is not applicable to host chromeos4-row1-rack3-host13.
END FAIL ---- repair.firmware timestamp=1460454727 localtime=Apr 12 02:52:07
GOOD ---- verify.tpm timestamp=1460454727 localtime=Apr 12 02:52:07
START ---- repair.powerwash timestamp=1460454727 localtime=Apr 12 02:52:07
START ---- reboot timestamp=1460454727 localtime=Apr 12 02:52:07
GOOD ---- reboot.start timestamp=1460454727 localtime=Apr 12 02:52:07
GOOD ---- reboot.verify timestamp=1460454765 localtime=Apr 12 02:52:45
END GOOD ---- reboot kernel=3.8.11 localtime=Apr 12 02:52:45 timestamp=1460454765
START ---- reboot timestamp=1460454772 localtime=Apr 12 02:52:52
GOOD ---- reboot.start timestamp=1460454772 localtime=Apr 12 02:52:52
GOOD ---- reboot.verify timestamp=1460454794 localtime=Apr 12 02:53:14
END GOOD ---- reboot kernel=3.8.11 localtime=Apr 12 02:53:15 timestamp=1460454795
FAIL ---- repair.powerwash timestamp=1460454865 localtime=Apr 12 02:54:25 Failed to install device image using payload at http://100.107.160.8:8082/update/wolf-release/R50-7978.48.0 on chromeos4-row1-rack3-host13. Update failed. Returned update_engine error code: ERROR_CODE=49, ERROR_MESSAGE=ErrorCode::kNonCriticalUpdateInOOBE. Reported error: AutoservRunError
END FAIL ---- repair.powerwash timestamp=1460454865 localtime=Apr 12 02:54:25
START ---- repair.usb timestamp=1460454865 localtime=Apr 12 02:54:25
FAIL ---- repair.usb timestamp=1460454915 localtime=Apr 12 02:55:15 Download image to usb failed.
END FAIL ---- repair.usb timestamp=1460454915 localtime=Apr 12 02:55:15
END FAIL ---- repair timestamp=1460454915 localtime=Apr 12 02:55:15
====
And here's the successful repair that followed it:
====
START ---- repair timestamp=1460456156 localtime=Apr 12 03:15:56
FAIL ---- verify.ssh timestamp=1460456764 localtime=Apr 12 03:26:04 No answer to ping from chromeos4-row1-rack3-host13
START ---- repair.rpm timestamp=1460456764 localtime=Apr 12 03:26:04
FAIL ---- repair.rpm timestamp=1460456939 localtime=Apr 12 03:28:59 chromeos4-row1-rack3-host13 is still offline after powercycling
END FAIL ---- repair.rpm timestamp=1460456939 localtime=Apr 12 03:28:59
START ---- repair.reset timestamp=1460456939 localtime=Apr 12 03:28:59
GOOD ---- verify.ssh timestamp=1460456970 localtime=Apr 12 03:29:30
END GOOD ---- repair.reset timestamp=1460456970 localtime=Apr 12 03:29:30
GOOD ---- verify.cros timestamp=1460456973 localtime=Apr 12 03:29:33
GOOD ---- verify.good_au timestamp=1460456973 localtime=Apr 12 03:29:33
GOOD ---- verify.tpm timestamp=1460456974 localtime=Apr 12 03:29:34
GOOD ---- verify.power timestamp=1460456974 localtime=Apr 12 03:29:34
GOOD ---- verify.PASS timestamp=1460456974 localtime=Apr 12 03:29:34
START ---- reboot timestamp=1460456974 localtime=Apr 12 03:29:34
GOOD ---- reboot.start timestamp=1460456974 localtime=Apr 12 03:29:34
GOOD ---- reboot.verify timestamp=1460456997 localtime=Apr 12 03:29:57
END GOOD ---- reboot kernel=3.8.11 localtime=Apr 12 03:29:57 timestamp=1460456997
END GOOD ---- repair timestamp=1460456997 localtime=Apr 12 03:29:57 {'host_attributes': {'powerunit_hostname': 'chromeos4-row1_2-rack3-rpm2', 'powerunit_outlet': '.A13'}, 'hostname': 'chromeos4-row1-rack3-host13'} repaired successfully
,
Apr 13 2016
I'm declaring victory here. This explains previously misbehavior in bug 602873 . This is (alas) unrelated to bug 598517.
,
Apr 13 2016
Looking at the log https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row1-rack3-host13/412153-verify/20161104190349/debug/ some more I see that from 19:03:55.348 to 19:08:13.266 the DUT chromeos4-row1-rack3-host13.cros was never reachable attempting to contact it multiple times with different methods while its servo was reachable. One caveat the servo is on a different subnet than the DUT: 100.107.160.157 chromeos4-row1-rack3-host13.cros 100.107.192.157 chromeos4-row1-rack3-host13-servo.cros Now on the same subnet is chromeos4-row1-rack3-host11.cros 100.107.160.155 and host9.cros *.153. They ran graphics tests a little before that, supposedly successfully. But no logs got uploaded to GS. https://uberchromeos-server38.corp.google.com/new_tko/#tab_id=test_detail_view&object_id=287196959 https://uberchromeos-server38.corp.google.com/new_tko/#tab_id=test_detail_view&object_id=287197193 What could be the reason for this? Do we have logs/traces of what happened on the 100.107.160.* subnet at about 7pm on 4/11? Could we add some more pings to neighboring devices if we see that a machine is unreachable to determine the state of the subnet and decide if it is a machine or network problem?
,
Apr 13 2016
> What could be the reason for this? Comment #11 explains why the DUT was down. This problem is explained purely by DUT behaviors; network behaviors aren't a factor. |
||
►
Sign in to add a comment |
||
Comment 1 by ihf@chromium.org
, Apr 13 2016