New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 692172 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 692342
Owner:
Last visit > 30 days ago
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocking:
issue 692179



Sign in to add a comment

kevin DUTs provision fail looping | ssh connecting timing out

Project Member Reported by akes...@chromium.org, Feb 14 2017

Issue description

Cc: ejcaruso@chromium.org shuqianz@chromium.org
Summary: chromeos2-row8-rack8-host3 provision fail looping | ssh connecting timing out (was: chromeos2-row8-rack8-host3 provision fail looping)
Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row8-rack8-host3: SSHConnectionError: ssh: connect to host chromeos2-row8-rack8-host3 port 22: Connection timed out
Cc: haoweiw@chromium.org xixuan@chromium.org jrbarnette@chromium.org
+jrbarnette maybe the Repair verifier should also check that the DUT is ssh-able from a devserver?

+haowei Is this ssh failure some other network configuration issue?
Blocking: 692179

Comment 4 by xixuan@chromium.org, Feb 14 2017

It's not about network issue I think. The provision failure pattern is after installing R58-9282.0.0-rc1/R58-9282.0.0-rc2/R58-9282.0.0-rc3 in rootfs partition, the host cannot come back from rebooting.

Before this failure pattern, the DUT can be successfully provisioned to R58-9282.0.0-rc1 (http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59976036-provision/)

So I prefer this DUT has some problems, we'd better take it off the CQ and replace another one. Then investigate the problem or even worse, ask for a repair from Englab.

We need an expert to know how to investigate what leads to this host cannot reboot after rootfs update. The update_engine.log shows:

[0214/075639:INFO:update_attempter.cc(1201)] Marking booted slot as good.
[0214/075641:INFO:subprocess.cc(156)] Subprocess output:
Starting Google_Kevin firmware updater v5 (bootok)...
 - Updater package: [Google_Kevin.8785.135.0 / EC:kevin_v1.10.137-b44884d]
 - Current system:  [RO:Google_Kevin.8785.135.0 , ACT:Google_Kevin.8785.135.0 / EC:kevin_v1.10.137-b44884d]
 - Write protection: Hardware: off, Software: Main=off
 Firmware update (bootok) completed.

[0214/085554:WARNING:libpolicy.cc(36)] Could not load the device policy file.
[0214/085554:INFO:real_device_policy_provider.cc(164)] No device policies/settings present.

Is this logging normal? Is it related to the 'cannot reboot' error?

Here's recent history for the DUT:
chromeos2-row8-rack8-host3
    2017-02-14 10:04:59  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59984564-repair/
    2017-02-14 09:25:39  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59983986-provision/
    2017-02-14 07:25:48  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59982640-repair/
    2017-02-14 06:45:58  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59982095-provision/

I've attached the status.log from the last failed repair.

Here's the important part:
	FAIL	----	verify.ssh	timestamp=1487096223	localtime=Feb 14 10:17:03	No answer to ping from chromeos2-row8-rack8-host3

The thing that repaired it was this step:
	END GOOD	----	repair.usb	timestamp=1487097373	localtime=Feb 14 10:36:13	

So, the DUT went offline.  We couldn't bring it back up until
we re-installed from USB.

That's not a network problem.  It's also not a problem with
devserver access to the DUT.  This is a problem that after
provisioning the build, the DUT crashed.

status.log
3.5 KB View Download
I logged into the DUT, and poked around in /var/log/messages.
I found this:

2017-02-14T04:15:39.410586-08:00 ERR kernel: [   64.731587] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0
2017-02-14T08:16:36.921966-08:00 ERR kernel: [14529.584008] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0
2017-02-14T08:26:18.338578-08:00 ERR kernel: [15111.932061] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0

Looks like trouble with the eMMC.

I've removed the DUT from the CQ pool:
Balancing kevin cq pool:
Total 24 DUTs, 22 working, 2 broken, 0 reserved.
Target is 24 working DUTs; grow pool by 2 DUTs.
kevin cq pool has 5 spares available.
kevin cq pool will return 2 broken DUTs, leaving 0 still in the pool.
Transferring 2 DUTs from cq to suites.
Updating host: chromeos2-row8-rack9-host14.
Removing labels ['pool:cq'] from host chromeos2-row8-rack9-host14
Adding labels ['pool:suites'] to host chromeos2-row8-rack9-host14
Updating host: chromeos2-row8-rack8-host3.
Removing labels ['pool:cq'] from host chromeos2-row8-rack8-host3
Adding labels ['pool:suites'] to host chromeos2-row8-rack8-host3
Transferring 2 DUTs from suites to cq.
Updating host: chromeos2-row8-rack9-host1.
Removing labels ['pool:suites'] from host chromeos2-row8-rack9-host1
Adding labels ['pool:cq'] to host chromeos2-row8-rack9-host1
Updating host: chromeos2-row8-rack9-host6.
Removing labels ['pool:suites'] from host chromeos2-row8-rack9-host6
Adding labels ['pool:cq'] to host chromeos2-row8-rack9-host6

Owner: shuqianz@chromium.org
Status: Assigned (was: Untriaged)
Up to the deputy to follow up on getting these looked at/replaced.

All the two DUTs are repaired successfully 
dut-status  chromeos2-row8-rack8-host3 -g
DEBUG:root:API client for gmail disabled. No module named anyjson
chromeos2-row8-rack8-host3
    2017-01-26 03:27:56  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59679127-repair/
    2017-01-26 03:21:34  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59679045-verify/
    2017-01-26 00:11:01  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host3/59677040-repair/

dut-status chromeos2-row8-rack9-host14 -g
DEBUG:root:API client for gmail disabled. No module named anyjson
chromeos2-row8-rack9-host14
    2016-12-22 16:26:58  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/59243828-repair/
    2016-12-22 16:26:03  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/59243821-verify/
    2016-12-22 12:30:37  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/59242697-repair/

I will keep track this issue
Both DUTs are locked; that's why they got balanced out.  It wasn't
because they failed their last repair.

For chromeos2-row8-rack8-host3, see c#6:  The eMMC storage is suspect.

For chromeos2-row8-rack9-host14, I checked logs, and saw this:
2017-02-13T14:02:09.610634+00:00 ERR kernel: [   66.794929] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0

So, both DUTs have a similar symptom, and the eMMC storage may have failed.

Mergedinto: 692342
Status: Duplicate (was: Assigned)
Summary: kevin DUTs provision fail looping | ssh connecting timing out (was: chromeos2-row8-rack8-host3 provision fail looping | ssh connecting timing out)

Sign in to add a comment