provision failures: DUT cannot recover from reboot at post check of rootfs update |
||||||||||||
Issue descriptionExamples: https://luci-milo.appspot.com/buildbot/chromeos/cave-paladin/2210 Due to there's a '\n' at the end of the error messages, builder page only show the first provision error. Will have a fix for that. We can find useful error messages here: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/156315631-chromeos-test/chromeos2-row8-rack7-host11/debug/
,
Nov 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d0e6c9442800fec993b880936fcf0fba376c3315 commit d0e6c9442800fec993b880936fcf0fba376c3315 Author: Xixuan Wu <xixuan@chromium.org> Date: Sat Nov 18 04:18:48 2017 autotest: remove line breaks in provision error messages. BUG= chromium:785552 TEST=Ran unittest. Change-Id: Ic747f0d412fef5e5b9bc8b1f0768e939576424c0 Reviewed-on: https://chromium-review.googlesource.com/773181 Commit-Ready: Ben Chan <benchan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> Reviewed-by: Ben Chan <benchan@chromium.org> [modify] https://crrev.com/d0e6c9442800fec993b880936fcf0fba376c3315/client/common_lib/cros/dev_server.py
,
Nov 20 2017
the last CQ just failed: https://uberchromegw.corp.google.com/i/chromeos/builders/coral-paladin/builds/1823 Will the latest push_to_prod fix this ?
,
Nov 20 2017
No, the fix is only for better error reporting. Assign to current sheriff for further checking whether there's any bad CL/bug in ToT with this failed build.
,
Nov 21 2017
I don't see a CL that relates to coral specifically.
Also seeing this dump:
[1;31m15:06:52: ERROR: BaseException in _RunParallelStages <class 'chromite.lib.failures_lib.TestLabFailure'>: ** HWTest did not complete due to infrastructure issues (code 3) **
Traceback (most recent call last):
File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 442, in _Run
self._task(*self._task_args, **self._task_kwargs)
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 701, in Run
self.PerformStage()
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/test_stages.py", line 319, in PerformStage
raise cmd_result.to_raise
TestLabFailure: ** HWTest did not complete due to infrastructure issues (code 3) **
Traceback (most recent call last):
File "/b/c/cbuild/repository/chromite/cbuildbot/builders/generic_builders.py", line 120, in _RunParallelStages
parallel.RunParallelSteps(steps)
File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 679, in RunParallelSteps
return [queue.get_nowait() for queue in queues]
File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 676, in RunParallelSteps
pass
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 562, in ParallelTasks
raise BackgroundFailure(exc_infos=errors)
BackgroundFailure: <class 'chromite.lib.failures_lib.TestLabFailure'>: ** HWTest did not complete due to infrastructure issues (code 3) **
Traceback (most recent call last):
File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 442, in _Run
self._task(*self._task_args, **self._task_kwargs)
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 701, in Run
self.PerformStage()
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/test_stages.py", line 319, in PerformStage
raise cmd_result.to_raise
TestLabFailure: ** HWTest did not complete due to infrastructure issues (code 3) **
,
Nov 21 2017
HWTest fails due to DUT cannot get back to online in provision and loses connection "forever" until next repair: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/157427082-chromeos-test/chromeos2-row4-rack10-host12/ Could sheriff find a proper owner to check why DUT loses connection in reboot? Maybe it's another case of ethernet dongle problem like Issue 768542, Issue 777920.
,
Nov 21 2017
Hi Ben, How did you verify the ethernet dongle error issue in the runs last week?
,
Nov 21 2017
*** The provision_AutoUpdate log indicated that issue happened around 11/20/17 13:55:51
START provision_AutoUpdate provision_AutoUpdate timestamp=1511214951 localtime=Nov 20 13:55:51
FAIL provision_AutoUpdate provision_AutoUpdate timestamp=1511217103 localtime=Nov 20 14:31:43 Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row4-rack10-host12: 0) ChromiumOSUpdateError: chromeos2-row4-rack10-host12 cannot recover from reboot at post check of stateful update, 1) SSHConnectionError: ssh: connect to host 100.115.227.75 port 22: Connection timed out
*** It wasn't clear what happened around that time frame, but the Ethernet adapter issue did appear on 11/21/17 07:57:03. The fix was to flip the adapter back to normal:
*** The Linksys Ethernet adapter went into the UMS mode (13b1:0040)
2017-11-21T07:57:03.752833+09:00 INFO kernel: [ 4.137747] usb 2-6: new SuperSpeed USB device number 4 using xhci_hcd
2017-11-21T07:57:03.752837+09:00 INFO kernel: [ 4.149265] usb 2-6: New USB device found, idVendor=13b1, idProduct=0040
2017-11-21T07:57:03.752838+09:00 INFO kernel: [ 4.149284] usb 2-6: New USB device strings: Mfr=1, Product=2, SerialNumber=6
2017-11-21T07:57:03.752839+09:00 INFO kernel: [ 4.149293] usb 2-6: Product: Linksys USB3GIGV1
2017-11-21T07:57:03.752841+09:00 INFO kernel: [ 4.149301] usb 2-6: Manufacturer: Linksys
2017-11-21T07:57:03.752842+09:00 INFO kernel: [ 4.149308] usb 2-6: SerialNumber: 000001000000
2017-11-21T07:57:03.752844+09:00 INFO kernel: [ 4.150731] usb-storage 2-6:1.0: USB Mass Storage device detected
2017-11-21T07:57:03.752849+09:00 INFO kernel: [ 4.152463] scsi host0: usb-storage 2-6:1.0
...
*** mist was invoked to switch the Ethernet adapter back Ethernet mode. It did so by performing an USB port reset. You may ignore the mist error for now. The adapter disconnected from USB after port reset (see below).
2017-11-21T07:57:04.309199+09:00 ERR mist[344]: Could not perform a USB port reset: ErrorNotFound
2017-11-21T07:57:04.309227+09:00 ERR mist[344]: Could not switch device '/sys/devices/pci0000:00/0000:00:15.0/usb2/2-6' into the modem mode.
...
*** The Ethernet adapter went back to Ethernet mode (13b1:0041) and eth1 came up.
2017-11-21T07:57:04.666075+09:00 INFO kernel: [ 5.736506] usb 2-6: new SuperSpeed USB device number 5 using xhci_hcd
2017-11-21T07:57:04.677641+09:00 INFO kernel: [ 5.747937] usb 2-6: New USB device found, idVendor=13b1, idProduct=0041
2017-11-21T07:57:04.677676+09:00 INFO kernel: [ 5.747948] usb 2-6: New USB device strings: Mfr=1, Product=2, SerialNumber=6
2017-11-21T07:57:04.677679+09:00 INFO kernel: [ 5.747951] usb 2-6: Product: Linksys USB3GIGV1
2017-11-21T07:57:04.677681+09:00 INFO kernel: [ 5.747954] usb 2-6: Manufacturer: Linksys
2017-11-21T07:57:04.677682+09:00 INFO kernel: [ 5.747957] usb 2-6: SerialNumber: 000001000000
2017-11-21T07:57:04.786091+09:00 INFO kernel: [ 5.856737] usb 2-6: reset SuperSpeed USB device number 5 using xhci_hcd
2017-11-21T07:57:04.826574+09:00 INFO kernel: [ 5.895002] r8152 2-6:1.0 eth1: v1.08.3
...
*** shill picked up eth1 and DHCP went through
2017-11-21T07:57:05.331061+09:00 INFO shill[1154]: [INFO:device.cc(202)] Device created: eth1 index 4
2017-11-21T07:57:05.333486+09:00 INFO shill[1154]: [INFO:service.cc(290)] ethernet service 2 constructed.
2017-11-21T07:57:05.334367+09:00 INFO shill[1154]: [INFO:manager.cc(1210)] Device eth1 registered.
2017-11-21T07:57:05.334812+09:00 INFO shill[1154]: [INFO:manager.cc(1484)] Device eth1 updated: enabled
2017-11-21T07:57:05.391153+09:00 INFO shill[1154]: [INFO:ethernet.cc(121)] Registering eth1 with manager.
2017-11-21T07:57:05.391476+09:00 INFO kernel: [ 6.461892] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
...
2017-11-21T07:57:09.100050+09:00 INFO kernel: [ 10.171072] r8152 2-6:1.0 eth1: carrier on
2017-11-21T07:57:09.100087+09:00 INFO kernel: [ 10.171115] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
...
2017-11-21T07:57:09.921730+09:00 INFO dhcpcd[1321]: eth1: rebinding lease of 100.115.227.75
2017-11-21T07:57:09.921944+09:00 INFO dhcpcd[1321]: eth1: sending REQUEST (xid 0xd5237bc5), next in 4.1 seconds
,
Nov 21 2017
I logged onto chromeos2-row4-rack10-host12 but couldn't find any relevant log on 11/20/17. Seems like the stateful was wiped after a full recovery. Subsequent log did indicate that we were able to recover the Ethernet adapter that went into mass storage mode, so I'm not quite sure what led to the AU error on 11/20/17. Have we seen other instances today?
,
Nov 21 2017
Thanks for all the detailed analysis, Ben! The commit queue was green for 4 consecutive times last night. That seems to imply that the provisioning failure wasn't due to ethernet flakiness? Ben, how often would the ethernet dongle issue come up in the CQ last week? Do you think after 4 straight clean runs you would call it fixed?
,
Nov 21 2017
Hi, ben, we found in staging lab, quawks DUT consistently fail provision due to the same reason: https://storage.cloud.google.com/chromeos-autotest-results/5149-chromeos-test/chromeos2-row1-rack2-host15/autoupdate_logs/CrOS_update_100.115.245.232_11809.log?_ga=2.131546951.-999627312.1503341672 https://storage.cloud.google.com/chromeos-autotest-results/5131-chromeos-test/chromeos2-row1-rack2-host17/autoupdate_logs/CrOS_update_100.115.245.233_17313.log?_ga=2.132063175.-999627312.1503341672 Is it a push blocker? which means some newly committed CL make this happen and we can't do a push.
,
Nov 21 2017
It's not exactly the same reason, both DUTs failed to reboot at post check of stateful update.
,
Nov 22 2017
https://luci-milo.appspot.com/buildbot/chromeos/coral-paladin/1834 I think we saw another failure this afternoon
,
Nov 22 2017
,
Nov 27 2017
Ben any status update on this?
,
Nov 27 2017
The SSHConnectionError could be a symptom of a similar issue or completely different issue (e.g. issue 788584, 777152, 733425). The fix to issue 452686 specifically targets the Ethernet adapters mentioned above when they are enumerated as a USB storage. Other issues like 788584 seem to indicate that the adapter failed to enumerate over USB, which will require a different fix. Since Chrome OS build 10141.0.0, we should be able to switch the following Ethernet adapters back to Ethernet mode once we detect they are in the USB mass storage mode (more details can be found in issue 452686 ): - Samsung USB Ethernet Adapter AA-AE2N12B [04e8:a100 -> 04e8:a101] - Linksys USB3GIGV1 USB 3.0 Gigabit Ethernet Adapter [13b1:0040 -> 13b1:0041] - Realtek RTL8153 based Ethernet adapters [0bda:8152 -> 0bda:8153] From what I observed in my local reboot stress tests, these adapters were switched back to Ethernet mode as expected. The logs (in comment #9) seemed to also confirm that the switching did happen. We can examine the before-repair logs for the DUTs used in those failed paladin runs to confirm. Is that any easy to pull those logs out?
,
Nov 27 2017
You can run dut-status: xixuan@xixuan0:~/chromiumos/src/third_party/autotest/files$ dut-status chromeos2-row4-rack10-host12 -s '2017-11-20 12:00:00' -u '2017-11-20 17:00:00' -f 2017-11-20 16:57:47 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61825265-reset/ 2017-11-20 16:53:49 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/157458549-chromeos-test/ 2017-11-20 16:41:41 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61825216-provision/ 2017-11-20 14:37:08 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61824851-repair/ 2017-11-20 13:55:17 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61824131-provision/ ---the failed provision_AutoUpdate in #9 2017-11-20 12:11:04 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/157415526-chromeos-test/ 2017-11-20 12:09:29 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61823915-reset/ 2017-11-20 12:07:02 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/157415520-chromeos-test/ 2017-11-20 12:05:58 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61823904-reset/ 2017-11-20 12:03:15 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/157415515-chromeos-test/ 2017-11-20 12:02:15 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61823879-reset/ 2017-11-20 12:00:19 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/157415503-chromeos-test/
,
Nov 28 2017
Re #18: Thanks! According to http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61824851-repair: The Linksys Ethernet adapter failed to enumerate over USB: 2017-11-21T07:45:42.017981+09:00 ERR kernel: [ 1138.446978] usb 2-6: device not accepting address 113, error -71 2017-11-21T07:45:42.038304+09:00 ERR kernel: [ 1138.467844] usb usb2-port6: unable to enumerate USB device # FAIL ---- verify.ssh timestamp=1511218042 localtime=Nov 20 14:47:22 No answer to ping from chromeos2-row4-rack10-host12 # START ---- repair.servoreset timestamp=1511218528 localtime=Nov 20 14:55:28 2017-11-21T07:55:31.352744+09:00 INFO kernel: [ 2.250611] usb 2-6: new SuperSpeed USB device number 3 using xhci_hcd 2017-11-21T07:55:31.352746+09:00 INFO kernel: [ 2.262477] usb 2-6: New USB device found, idVendor=13b1, idProduct=0041 2017-11-21T07:55:31.352752+09:00 INFO kernel: [ 2.262497] usb 2-6: New USB device strings: Mfr=1, Product=2, SerialNumber=6 2017-11-21T07:55:31.352755+09:00 INFO kernel: [ 2.262506] usb 2-6: Product: Linksys USB3GIGV1 2017-11-21T07:55:31.352756+09:00 INFO kernel: [ 2.262514] usb 2-6: Manufacturer: Linksys 2017-11-21T07:55:31.352757+09:00 INFO kernel: [ 2.262526] usb 2-6: SerialNumber: 000001000000
,
Nov 28 2017
w.r.t. #19: this looks like issue 788584
,
Mar 30 2018
,
Mar 30 2018
,
Jun 6 2018
,
Jun 8 2018
,
Jul 27
,
Dec 12
|
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by shuqianz@chromium.org
, Nov 17 2017Owner: xixuan@chromium.org