Issue metadata
Sign in to add a comment
|
kevin DUTs provision fail looping | ssh connecting timing out |
||||||||||||||||||||||||
Issue description
$ dut-status -f chromeos2-row8-rack8-host3
chromeos2-row8-rack8-host3
2017-02-14 10:04:59 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59984564-repair/
2017-02-14 09:25:39 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59983986-provision/
2017-02-14 07:25:48 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59982640-repair/
2017-02-14 06:45:58 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59982095-provision/
2017-02-14 04:47:27 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59980801-repair/
2017-02-14 04:01:43 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59980011-provision/
2017-02-14 01:52:53 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59978508-repair/
2017-02-14 01:09:01 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59978047-provision/
,
Feb 14 2017
+jrbarnette maybe the Repair verifier should also check that the DUT is ssh-able from a devserver? +haowei Is this ssh failure some other network configuration issue?
,
Feb 14 2017
,
Feb 14 2017
It's not about network issue I think. The provision failure pattern is after installing R58-9282.0.0-rc1/R58-9282.0.0-rc2/R58-9282.0.0-rc3 in rootfs partition, the host cannot come back from rebooting. Before this failure pattern, the DUT can be successfully provisioned to R58-9282.0.0-rc1 (http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59976036-provision/) So I prefer this DUT has some problems, we'd better take it off the CQ and replace another one. Then investigate the problem or even worse, ask for a repair from Englab. We need an expert to know how to investigate what leads to this host cannot reboot after rootfs update. The update_engine.log shows: [0214/075639:INFO:update_attempter.cc(1201)] Marking booted slot as good. [0214/075641:INFO:subprocess.cc(156)] Subprocess output: Starting Google_Kevin firmware updater v5 (bootok)... - Updater package: [Google_Kevin.8785.135.0 / EC:kevin_v1.10.137-b44884d] - Current system: [RO:Google_Kevin.8785.135.0 , ACT:Google_Kevin.8785.135.0 / EC:kevin_v1.10.137-b44884d] - Write protection: Hardware: off, Software: Main=off Firmware update (bootok) completed. [0214/085554:WARNING:libpolicy.cc(36)] Could not load the device policy file. [0214/085554:INFO:real_device_policy_provider.cc(164)] No device policies/settings present. Is this logging normal? Is it related to the 'cannot reboot' error?
,
Feb 14 2017
Here's recent history for the DUT:
chromeos2-row8-rack8-host3
2017-02-14 10:04:59 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59984564-repair/
2017-02-14 09:25:39 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59983986-provision/
2017-02-14 07:25:48 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59982640-repair/
2017-02-14 06:45:58 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59982095-provision/
I've attached the status.log from the last failed repair.
Here's the important part:
FAIL ---- verify.ssh timestamp=1487096223 localtime=Feb 14 10:17:03 No answer to ping from chromeos2-row8-rack8-host3
The thing that repaired it was this step:
END GOOD ---- repair.usb timestamp=1487097373 localtime=Feb 14 10:36:13
So, the DUT went offline. We couldn't bring it back up until
we re-installed from USB.
That's not a network problem. It's also not a problem with
devserver access to the DUT. This is a problem that after
provisioning the build, the DUT crashed.
,
Feb 14 2017
I logged into the DUT, and poked around in /var/log/messages. I found this: 2017-02-14T04:15:39.410586-08:00 ERR kernel: [ 64.731587] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0 2017-02-14T08:16:36.921966-08:00 ERR kernel: [14529.584008] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0 2017-02-14T08:26:18.338578-08:00 ERR kernel: [15111.932061] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0 Looks like trouble with the eMMC. I've removed the DUT from the CQ pool: Balancing kevin cq pool: Total 24 DUTs, 22 working, 2 broken, 0 reserved. Target is 24 working DUTs; grow pool by 2 DUTs. kevin cq pool has 5 spares available. kevin cq pool will return 2 broken DUTs, leaving 0 still in the pool. Transferring 2 DUTs from cq to suites. Updating host: chromeos2-row8-rack9-host14. Removing labels ['pool:cq'] from host chromeos2-row8-rack9-host14 Adding labels ['pool:suites'] to host chromeos2-row8-rack9-host14 Updating host: chromeos2-row8-rack8-host3. Removing labels ['pool:cq'] from host chromeos2-row8-rack8-host3 Adding labels ['pool:suites'] to host chromeos2-row8-rack8-host3 Transferring 2 DUTs from suites to cq. Updating host: chromeos2-row8-rack9-host1. Removing labels ['pool:suites'] from host chromeos2-row8-rack9-host1 Adding labels ['pool:cq'] to host chromeos2-row8-rack9-host1 Updating host: chromeos2-row8-rack9-host6. Removing labels ['pool:suites'] from host chromeos2-row8-rack9-host6 Adding labels ['pool:cq'] to host chromeos2-row8-rack9-host6
,
Feb 14 2017
Up to the deputy to follow up on getting these looked at/replaced.
,
Feb 15 2017
All the two DUTs are repaired successfully
dut-status chromeos2-row8-rack8-host3 -g
DEBUG:root:API client for gmail disabled. No module named anyjson
chromeos2-row8-rack8-host3
2017-01-26 03:27:56 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59679127-repair/
2017-01-26 03:21:34 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59679045-verify/
2017-01-26 00:11:01 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59677040-repair/
dut-status chromeos2-row8-rack9-host14 -g
DEBUG:root:API client for gmail disabled. No module named anyjson
chromeos2-row8-rack9-host14
2016-12-22 16:26:58 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/59243828-repair/
2016-12-22 16:26:03 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/59243821-verify/
2016-12-22 12:30:37 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/59242697-repair/
I will keep track this issue
,
Feb 15 2017
Both DUTs are locked; that's why they got balanced out. It wasn't because they failed their last repair. For chromeos2-row8-rack8-host3, see c#6: The eMMC storage is suspect. For chromeos2-row8-rack9-host14, I checked logs, and saw this: 2017-02-13T14:02:09.610634+00:00 ERR kernel: [ 66.794929] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0 So, both DUTs have a similar symptom, and the eMMC storage may have failed.
,
Feb 15 2017
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by akes...@chromium.org
, Feb 14 2017Summary: chromeos2-row8-rack8-host3 provision fail looping | ssh connecting timing out (was: chromeos2-row8-rack8-host3 provision fail looping)