New issue
Advanced search Search tips
Starred by 1 user
Status: Untriaged
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment
provision failures: DUT cannot recover from reboot at post check of rootfs update
Project Member Reported by xixuan@chromium.org, Nov 15 Back to list
Examples:

https://luci-milo.appspot.com/buildbot/chromeos/cave-paladin/2210

Due to there's a '\n' at the end of the error messages, builder page only show the first provision error. Will have a fix for that.

We can find useful error messages here:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/156315631-chromeos-test/chromeos2-row8-rack7-host11/debug/
 
Cc: -xixuan@chromium.org
Owner: xixuan@chromium.org
https://luci-milo.appspot.com/buildbot/chromeos/coral-paladin/1789

Another one
Project Member Comment 2 by bugdroid1@chromium.org, Nov 18
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d0e6c9442800fec993b880936fcf0fba376c3315

commit d0e6c9442800fec993b880936fcf0fba376c3315
Author: Xixuan Wu <xixuan@chromium.org>
Date: Sat Nov 18 04:18:48 2017

autotest: remove line breaks in provision error messages.

BUG=chromium:785552
TEST=Ran unittest.

Change-Id: Ic747f0d412fef5e5b9bc8b1f0768e939576424c0
Reviewed-on: https://chromium-review.googlesource.com/773181
Commit-Ready: Ben Chan <benchan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>
Reviewed-by: Ben Chan <benchan@chromium.org>

[modify] https://crrev.com/d0e6c9442800fec993b880936fcf0fba376c3315/client/common_lib/cros/dev_server.py

the last CQ just failed:

https://uberchromegw.corp.google.com/i/chromeos/builders/coral-paladin/builds/1823

Will the latest push_to_prod fix this ?
Cc: -shuqianz@chromium.org nxia@chromium.org xixuan@chromium.org
Owner: shchen@chromium.org
No, the fix is only for better error reporting.

Assign to current sheriff for further checking whether there's any bad CL/bug in ToT with this failed build.
I don't see a CL that relates to coral specifically.

Also seeing this dump:

[1;31m15:06:52: ERROR: BaseException in _RunParallelStages <class 'chromite.lib.failures_lib.TestLabFailure'>: ** HWTest did not complete due to infrastructure issues (code 3) **
Traceback (most recent call last):
  File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 442, in _Run
    self._task(*self._task_args, **self._task_kwargs)
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 701, in Run
    self.PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/test_stages.py", line 319, in PerformStage
    raise cmd_result.to_raise
TestLabFailure: ** HWTest did not complete due to infrastructure issues (code 3) **
Traceback (most recent call last):
  File "/b/c/cbuild/repository/chromite/cbuildbot/builders/generic_builders.py", line 120, in _RunParallelStages
    parallel.RunParallelSteps(steps)
  File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 679, in RunParallelSteps
    return [queue.get_nowait() for queue in queues]
  File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 676, in RunParallelSteps
    pass
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 562, in ParallelTasks
    raise BackgroundFailure(exc_infos=errors)
BackgroundFailure: <class 'chromite.lib.failures_lib.TestLabFailure'>: ** HWTest did not complete due to infrastructure issues (code 3) **
Traceback (most recent call last):
  File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 442, in _Run
    self._task(*self._task_args, **self._task_kwargs)
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 701, in Run
    self.PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/test_stages.py", line 319, in PerformStage
    raise cmd_result.to_raise
TestLabFailure: ** HWTest did not complete due to infrastructure issues (code 3) **
HWTest fails due to DUT cannot get back to online in provision and loses connection "forever" until next repair:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/157427082-chromeos-test/chromeos2-row4-rack10-host12/

Could sheriff find a proper owner to check why DUT loses connection in reboot? Maybe it's another case of ethernet dongle problem like Issue 768542, Issue 777920.
Owner: benchan@chromium.org
Hi Ben,

How did you verify the ethernet dongle error issue in the runs last week?

*** The provision_AutoUpdate log indicated that issue happened around 11/20/17 13:55:51

START provision_AutoUpdate  provision_AutoUpdate  timestamp=1511214951  localtime=Nov 20 13:55:51 
    FAIL  provision_AutoUpdate  provision_AutoUpdate  timestamp=1511217103  localtime=Nov 20 14:31:43 Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row4-rack10-host12: 0) ChromiumOSUpdateError: chromeos2-row4-rack10-host12 cannot recover from reboot at post check of stateful update, 1) SSHConnectionError: ssh: connect to host 100.115.227.75 port 22: Connection timed out


*** It wasn't clear what happened around that time frame, but the Ethernet adapter issue did appear on 11/21/17 07:57:03. The fix was to flip the adapter back to normal:

*** The Linksys Ethernet adapter went into the UMS mode (13b1:0040)

2017-11-21T07:57:03.752833+09:00 INFO kernel: [    4.137747] usb 2-6: new SuperSpeed USB device number 4 using xhci_hcd
2017-11-21T07:57:03.752837+09:00 INFO kernel: [    4.149265] usb 2-6: New USB device found, idVendor=13b1, idProduct=0040
2017-11-21T07:57:03.752838+09:00 INFO kernel: [    4.149284] usb 2-6: New USB device strings: Mfr=1, Product=2, SerialNumber=6
2017-11-21T07:57:03.752839+09:00 INFO kernel: [    4.149293] usb 2-6: Product: Linksys USB3GIGV1
2017-11-21T07:57:03.752841+09:00 INFO kernel: [    4.149301] usb 2-6: Manufacturer: Linksys
2017-11-21T07:57:03.752842+09:00 INFO kernel: [    4.149308] usb 2-6: SerialNumber: 000001000000
2017-11-21T07:57:03.752844+09:00 INFO kernel: [    4.150731] usb-storage 2-6:1.0: USB Mass Storage device detected
2017-11-21T07:57:03.752849+09:00 INFO kernel: [    4.152463] scsi host0: usb-storage 2-6:1.0

...

*** mist was invoked to switch the Ethernet adapter back Ethernet mode. It did so by performing an USB port reset. You may ignore the mist error for now. The adapter disconnected from USB after port reset (see below).

2017-11-21T07:57:04.309199+09:00 ERR mist[344]: Could not perform a USB port reset: ErrorNotFound
2017-11-21T07:57:04.309227+09:00 ERR mist[344]: Could not switch device '/sys/devices/pci0000:00/0000:00:15.0/usb2/2-6' into the modem mode.

...

*** The Ethernet adapter went back to Ethernet mode (13b1:0041) and eth1 came up.

2017-11-21T07:57:04.666075+09:00 INFO kernel: [    5.736506] usb 2-6: new SuperSpeed USB device number 5 using xhci_hcd
2017-11-21T07:57:04.677641+09:00 INFO kernel: [    5.747937] usb 2-6: New USB device found, idVendor=13b1, idProduct=0041
2017-11-21T07:57:04.677676+09:00 INFO kernel: [    5.747948] usb 2-6: New USB device strings: Mfr=1, Product=2, SerialNumber=6
2017-11-21T07:57:04.677679+09:00 INFO kernel: [    5.747951] usb 2-6: Product: Linksys USB3GIGV1
2017-11-21T07:57:04.677681+09:00 INFO kernel: [    5.747954] usb 2-6: Manufacturer: Linksys
2017-11-21T07:57:04.677682+09:00 INFO kernel: [    5.747957] usb 2-6: SerialNumber: 000001000000
2017-11-21T07:57:04.786091+09:00 INFO kernel: [    5.856737] usb 2-6: reset SuperSpeed USB device number 5 using xhci_hcd
2017-11-21T07:57:04.826574+09:00 INFO kernel: [    5.895002] r8152 2-6:1.0 eth1: v1.08.3

...

*** shill picked up eth1 and DHCP went through

2017-11-21T07:57:05.331061+09:00 INFO shill[1154]: [INFO:device.cc(202)] Device created: eth1 index 4
2017-11-21T07:57:05.333486+09:00 INFO shill[1154]: [INFO:service.cc(290)] ethernet service 2 constructed.
2017-11-21T07:57:05.334367+09:00 INFO shill[1154]: [INFO:manager.cc(1210)] Device eth1 registered.
2017-11-21T07:57:05.334812+09:00 INFO shill[1154]: [INFO:manager.cc(1484)] Device eth1 updated: enabled
2017-11-21T07:57:05.391153+09:00 INFO shill[1154]: [INFO:ethernet.cc(121)] Registering eth1 with manager.
2017-11-21T07:57:05.391476+09:00 INFO kernel: [    6.461892] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
...
2017-11-21T07:57:09.100050+09:00 INFO kernel: [   10.171072] r8152 2-6:1.0 eth1: carrier on
2017-11-21T07:57:09.100087+09:00 INFO kernel: [   10.171115] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
...
2017-11-21T07:57:09.921730+09:00 INFO dhcpcd[1321]: eth1: rebinding lease of 100.115.227.75
2017-11-21T07:57:09.921944+09:00 INFO dhcpcd[1321]: eth1: sending REQUEST (xid 0xd5237bc5), next in 4.1 seconds
I logged onto chromeos2-row4-rack10-host12 but couldn't find any relevant log on 11/20/17. Seems like the stateful was wiped after a full recovery. Subsequent log did indicate that we were able to recover the Ethernet adapter that went into mass storage mode, so I'm not quite sure what led to the AU error on 11/20/17.

Have we seen other instances today?
Thanks for all the detailed analysis, Ben!

The commit queue was green for 4 consecutive times last night.  That seems to imply that the provisioning failure wasn't due to ethernet flakiness?

Ben, how often would the ethernet dongle issue come up in the CQ last week?  Do you think after 4 straight clean runs you would call it fixed?
It's not exactly the same reason, both DUTs failed to reboot at post check of stateful update.
https://luci-milo.appspot.com/buildbot/chromeos/coral-paladin/1834

I think we saw another failure this afternoon
Labels: Chase-Pending
Labels: -Chase-Pending
Ben any status update on this?
The SSHConnectionError could be a symptom of a similar issue or completely different issue (e.g. issue 788584, 777152, 733425). The fix to  issue 452686  specifically targets the Ethernet adapters mentioned above when they are enumerated as a USB storage. Other issues like 788584 seem to indicate that the adapter failed to enumerate over USB, which will require a different fix.

Since Chrome OS build 10141.0.0, we should be able to switch the following Ethernet adapters back to Ethernet mode once we detect they are in the USB mass storage mode (more details can be found in  issue 452686 ):

- Samsung USB Ethernet Adapter AA-AE2N12B [04e8:a100 -> 04e8:a101]
- Linksys USB3GIGV1 USB 3.0 Gigabit Ethernet Adapter [13b1:0040 -> 13b1:0041]
- Realtek RTL8153 based Ethernet adapters [0bda:8152 -> 0bda:8153]

From what I observed in my local reboot stress tests, these adapters were switched back to Ethernet mode as expected. The logs (in comment #9) seemed to also confirm that the switching did happen. We can examine the before-repair logs for the DUTs used in those failed paladin runs to confirm. Is that any easy to pull those logs out?


You can run dut-status:

xixuan@xixuan0:~/chromiumos/src/third_party/autotest/files$ dut-status chromeos2-row4-rack10-host12 -s '2017-11-20 12:00:00' -u '2017-11-20 17:00:00' -f
2017-11-20 16:57:47  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61825265-reset/
    2017-11-20 16:53:49  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/157458549-chromeos-test/
    2017-11-20 16:41:41  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61825216-provision/
    2017-11-20 14:37:08  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61824851-repair/
    2017-11-20 13:55:17  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61824131-provision/ ---the failed provision_AutoUpdate in #9
    2017-11-20 12:11:04  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/157415526-chromeos-test/
    2017-11-20 12:09:29  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61823915-reset/
    2017-11-20 12:07:02  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/157415520-chromeos-test/
    2017-11-20 12:05:58  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61823904-reset/
    2017-11-20 12:03:15  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/157415515-chromeos-test/
    2017-11-20 12:02:15  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61823879-reset/
    2017-11-20 12:00:19  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/157415503-chromeos-test/
Re #18: Thanks!

According to http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host12/61824851-repair:

The Linksys Ethernet adapter failed to enumerate over USB:

2017-11-21T07:45:42.017981+09:00 ERR kernel: [ 1138.446978] usb 2-6: device not accepting address 113, error -71
2017-11-21T07:45:42.038304+09:00 ERR kernel: [ 1138.467844] usb usb2-port6: unable to enumerate USB device

# FAIL	----	verify.ssh	timestamp=1511218042	localtime=Nov 20 14:47:22	No answer to ping from chromeos2-row4-rack10-host12

# START	----	repair.servoreset	timestamp=1511218528	localtime=Nov 20 14:55:28

2017-11-21T07:55:31.352744+09:00 INFO kernel: [    2.250611] usb 2-6: new SuperSpeed USB device number 3 using xhci_hcd
2017-11-21T07:55:31.352746+09:00 INFO kernel: [    2.262477] usb 2-6: New USB device found, idVendor=13b1, idProduct=0041
2017-11-21T07:55:31.352752+09:00 INFO kernel: [    2.262497] usb 2-6: New USB device strings: Mfr=1, Product=2, SerialNumber=6
2017-11-21T07:55:31.352755+09:00 INFO kernel: [    2.262506] usb 2-6: Product: Linksys USB3GIGV1
2017-11-21T07:55:31.352756+09:00 INFO kernel: [    2.262514] usb 2-6: Manufacturer: Linksys
2017-11-21T07:55:31.352757+09:00 INFO kernel: [    2.262526] usb 2-6: SerialNumber: 000001000000
w.r.t. #19: this looks like issue 788584
Sign in to add a comment