Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 4 users
Status: Verified
Owner:
Closed: Mar 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 692179



Sign in to add a comment
kevin DUTs provision fail looping | ssh connecting timing out
Project Member Reported by shuqianz@chromium.org, Feb 15 2017 Back to list
http://cautotest/afe/#tab_id=view_host&object_id=5741

This Kevin DUT has failed several tests in a row all because it failed to connect:

Example log:
START	----	provision	timestamp=1487104962	localtime=Feb 14 12:42:42	
	GOOD	----	verify.ssh	timestamp=1487104966	localtime=Feb 14 12:42:46	
	GOOD	----	verify.update	timestamp=1487104969	localtime=Feb 14 12:42:49	
	GOOD	----	verify.brd_config	timestamp=1487104970	localtime=Feb 14 12:42:50	
	GOOD	----	verify.ser_config	timestamp=1487104970	localtime=Feb 14 12:42:50	
	GOOD	----	verify.job	timestamp=1487104971	localtime=Feb 14 12:42:51	
	GOOD	----	verify.servod	timestamp=1487104975	localtime=Feb 14 12:42:55	
	GOOD	----	verify.pwr_button	timestamp=1487104975	localtime=Feb 14 12:42:55	
	GOOD	----	verify.lid_open	timestamp=1487104975	localtime=Feb 14 12:42:55	
	GOOD	----	verify.PASS	timestamp=1487104975	localtime=Feb 14 12:42:55	
	START	provision_AutoUpdate	provision_AutoUpdate	timestamp=1487104976	localtime=Feb 14 12:42:56	
		START	----	----	timestamp=1487105134	localtime=Feb 14 12:45:34	
			GOOD	----	sysinfo.before	timestamp=1487105134	localtime=Feb 14 12:45:34	
		END GOOD	----	----	timestamp=1487105134	localtime=Feb 14 12:45:34	
		FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1487105304	localtime=Feb 14 12:48:24	Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row8-rack9-host1: SSHConnectionError: Connection to 100.115.231.42 timed out while waiting to read
  Traceback (most recent call last):
    File "/usr/local/autotest/client/common_lib/test.py", line 804, in _call_test_function
      return func(*args, **dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 461, in execute
      dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
      postprocess_profiled_run, args, dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 376, in _call_run_once
      self.run_once(*args, **dargs)
    File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 111, in run_once
      force_full_update=force)
    File "/usr/local/autotest/server/afe_utils.py", line 254, in machine_install_and_update_labels
      *args, **dargs)
    File "/usr/local/autotest/server/hosts/cros_host.py", line 729, in machine_install_by_devserver
      full_update=force_full_update)
    File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2018, in auto_update
      raise DevServerException(error_msg % (host_name, error_list[0]))
  DevServerException: CrOS auto-update failed for host chromeos2-row8-rack9-host1: SSHConnectionError: Connection to 100.115.231.42 timed out while waiting to read
	END FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1487105304	localtime=Feb 14 12:48:24	
END FAIL	----	provision	timestamp=1487105304	localtime=Feb 14 12:48:24	
INFO	----	----	timestamp=1487105304	job_abort_reason=	localtime=Feb 14 12:48:24	

Why this DUT is so flaky? Is this similar to the SSHConnection issue happened on pfq? xixuan@, can you take a look?
 
Comment 1 by xixuan@chromium.org, Feb 15 2017
Cc: haoweiw@chromium.org
Owner: shuqianz@chromium.org
Different from PFQ's retry problem. 

Host chromeos2-row8-rack9-host1 doesn't start any provision process. It's pingable by devserver (100.115.245.200), but is not sshable. Retry in this case helps very little since the host cannot be sshed.

If you step into autoupdate_logs/CrOS_Update_****.log, you will find that devserver (100.115.245.200) cannot run the following cmd:

ping ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpAIV5AY/testing_rsa root@chromeos2-row8-rack9-host1 -- mkdir -p /mnt/stateful_partition/unencrypted/preserve/cros-update '&&' mktemp -d '--tmpdir=/mnt/stateful_partition/unencrypted/preserve/cros-update'

I manually login the devserver (100.115.245.200) and run this cmd, still fails:

OpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014
debug1: Reading configuration data /home/chromeos-test/.ssh/config
debug1: /home/chromeos-test/.ssh/config line 2: Applying options for *
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: Connecting to chromeos2-row8-rack9-host1 [100.115.231.42] port 22.
debug1: fd 3 clearing O_NONBLOCK
debug1: Connection established.
debug1: identity file ./images/asuka-release/R58-9279.0.0/telemetry_src/src/third_party/catapult/telemetry/third_party/chromite/ssh_keys/testing_rsa type 1
debug1: identity file ./images/asuka-release/R58-9279.0.0/telemetry_src/src/third_party/catapult/telemetry/third_party/chromite/ssh_keys/testing_rsa-cert type -1
debug1: identity file /home/chromeos-test/.ssh/testing_rsa type 1
debug1: identity file /home/chromeos-test/.ssh/testing_rsa-cert type -1
debug1: identity file /home/chromeos-test/.ssh/chromium type 1
debug1: identity file /home/chromeos-test/.ssh/chromium-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.3
debug1: match: OpenSSH_7.3 pat OpenSSH* compat 0x04000000
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr hmac-sha1-etm@openssh.com none
debug1: kex: client->server aes128-ctr hmac-sha1-etm@openssh.com none
debug1: sending SSH2_MSG_KEX_ECDH_INIT
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
Connection to 100.115.231.42 timed out while waiting to read

So reassign to deputy to handle this network problem. +@haowei for more info about why devserver 100.115.245.200 cannot ssh to host 100.115.231.42.

PS, I find in the latest provision, drone still have some flakeness starting ssh connection to this host. Don't know whether it's related to the bug.

02/14 15:43:47.216 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_4aNR4Pssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row8-rack9-host1'
02/14 15:43:47.217 DEBUG|        base_utils:0185| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_4aNR4Pssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row8-rack9-host1'
02/14 15:43:52.337 INFO |      abstract_ssh:0824| Timed out waiting for master-ssh connection to be established.


Comment 2 by xixuan@chromium.org, Feb 15 2017
I lock this dut and replace it with chromeos2-row8-rack9-host5.

Removing labels ['pool:cq'] from host chromeos2-row8-rack9-host1
Adding labels ['pool:suites'] to host chromeos2-row8-rack9-host1
Transferring 1 DUTs from suites to cq.
Updating host: chromeos2-row8-rack9-host5.
Removing labels ['pool:suites'] from host chromeos2-row8-rack9-host5
Adding labels ['pool:cq'] to host chromeos2-row8-rack9-host5


Blocking: 692179
Cc: adurbin@chromium.org
Labels: -Pri-2 Pri-0
Owner: haoweiw@chromium.org
https://buganizer.corp.google.com/issues/35385505
haoweiw@, can you check whether this DUT is pingable but not able to ssh into. Kevin paladin has been failed for two days for different DUT issues. I have requested to pull out two broken DUTs out of lab, and send to partner team to check whether this is a hardware issue. 
https://buganizer.corp.google.com/issues/35363835
The repair history says that the DUT is fine; it's reachable
by both 'ping' and 'ssh'.  A spot check of DUT history seems
to show a history of "Provision fails, Repair finds nothing
wrong".  Here's the latest repair log:

START	----	repair	timestamp=1487164702	localtime=Feb 15 05:18:22	
	GOOD	----	verify.ssh	timestamp=1487164704	localtime=Feb 15 05:18:24	
	GOOD	----	verify.brd_config	timestamp=1487164704	localtime=Feb 15 05:18:24	
	GOOD	----	verify.ser_config	timestamp=1487164705	localtime=Feb 15 05:18:25	
	GOOD	----	verify.job	timestamp=1487164706	localtime=Feb 15 05:18:26	
	GOOD	----	verify.servod	timestamp=1487164712	localtime=Feb 15 05:18:32	
	GOOD	----	verify.pwr_button	timestamp=1487164712	localtime=Feb 15 05:18:32	
	GOOD	----	verify.lid_open	timestamp=1487164712	localtime=Feb 15 05:18:32	
	GOOD	----	verify.update	timestamp=1487164716	localtime=Feb 15 05:18:36	
	GOOD	----	verify.PASS	timestamp=1487164716	localtime=Feb 15 05:18:36	
	GOOD	----	verify.ssh	timestamp=1487164740	localtime=Feb 15 05:19:00	
	GOOD	----	verify.power	timestamp=1487164741	localtime=Feb 15 05:19:01	
	GOOD	----	verify.fwstatus	timestamp=1487164741	localtime=Feb 15 05:19:01	
	GOOD	----	verify.good_au	timestamp=1487164741	localtime=Feb 15 05:19:01	
	GOOD	----	verify.writable	timestamp=1487164742	localtime=Feb 15 05:19:02	
	GOOD	----	verify.tpm	timestamp=1487164743	localtime=Feb 15 05:19:03	
	GOOD	----	verify.ext4	timestamp=1487164744	localtime=Feb 15 05:19:04	
	GOOD	----	verify.rwfw	timestamp=1487164744	localtime=Feb 15 05:19:04	
	GOOD	----	verify.python	timestamp=1487164744	localtime=Feb 15 05:19:04	
	GOOD	----	verify.cros	timestamp=1487164750	localtime=Feb 15 05:19:10	
	GOOD	----	verify.PASS	timestamp=1487164750	localtime=Feb 15 05:19:10	
END GOOD	----	repair	timestamp=1487164750	localtime=Feb 15 05:19:10	chromeos2-row8-rack9-host1 repaired successfully

I logged into the DUT, and checked /var/log/messages.  I found
some familiar log messages:

2017-02-09T09:31:42.301442-08:00 ERR kernel: [   55.701874] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0
2017-02-09T11:10:40.569860-08:00 ERR kernel: [ 5993.970088] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0
2017-02-09T12:31:40.311749-08:00 ERR kernel: [10853.711992] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0

I'm beginning to suspect that this isn't hardware; it could be
software.  We need to get a kevin expert in to help us.

Cc: jachuang@chromium.org
Owner: alexpau@chromium.org
alexpau@, jachuang@, could you please take a look?
In the meantime, can we make kevin as experimental?
I just logged into a device that's showing no signs of being
bad, and I found the same peculiar messages.

These messages seem more likely caused by software.  We need
get this looked at.  Note that I'm not convinced that these
Provision problems and the mysterious I/O errors must be related.

As for taking kevin out of CQ rotation:  That could be reasonable,
but if there's a new bug in kevin, we're only pushing the problem
someplace else.

Status: Assigned
Hmm, ok. I've pinged alexpau@, wait for him to take a look
Cc: gwendal@chromium.org
Gwendal, would you be so kind to comment on those I/O errors?  Thanks!
Cc: xixuan@chromium.org
 Issue 692172  has been merged into this issue.
Summary: kevin DUTs provision fail looping | ssh connecting timing out (was: chromeos2-row8-rack9-host1 always failed to connect)
Cc: mnissler@chromium.org mqg@chromium.org
 Issue 692179  has been merged into this issue.
Here is a list of the problematic DUTs:
chromeos2-row8-rack8-host3
chromeos2-row8-rack9-host14
chromeos2-row8-rack9-host1 
All of these DUT either have been removed from pool:cq or locked. However, this is not fixed the issues, new DUT in pool:cq will have the similar issue. So, I think this is not the DUT itself causing the failure. Like Richard said, maybe there is a software issue. 
Cc: moch@chromium.org alexpau@chromium.org
Owner: philipchen@chromium.org
+philip/moch - would you please help with debug?

Can someone clarify the current status of the issue.  It appears that the problem started with dev server unable to ssh into DUT but then later on it's confirmed that ssh into the DUT is not an issue.  Is this a test infra or DUT issue?  

Since which release has this issue started to occur?
This issue starts happening from R58-9280.0.0-rc2 
The main problem is all the 3 hosts have error logs about ERR kernel: 

[10853.711992] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0

Richard thinks it's likely a software problem. The error happened since https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/196.


The wrong behavior of these 3 hosts are slightly different.

For host1, devserver 100.115.245.200 cannot ssh to one of these host: chromeos2-row8-rack9-host1. I also see similar thing happened between host14 and devserver 100.115.245.198 ( crbug.com/691729 ). It's related to network connections, which previously mislead me to think it's a network issue. But now I think it may be another symptom caused by the kernel error.

For host3&host14, they cannot get online after provision. see  crbug.com/692172  for detail.


There's additional discussion going on at go/crosoncall.

Key points:
  * At least one of the problem DUTs was seen to have failed
    with a complaint about entering recovery because it couldn't
    find its boot storage.  We don't know when that happened.
  * There are some known problems relating to setting voltage
    levels that can cause that kind of problem.  The fixes
    include updating to the .149 firmware.

The .149 firmware is available at ToT, and on the tip of the R57
and R56 branches.  I've updated kevin thus:
    $ atest stable_version modify -b kevin -i R57-9202.28.0
    Stable version for board kevin is changed from R58-9221.0.0 to R57-9202.28.0.
    $ atest stable_version modify -b kevin/rwfw -i Google_Kevin.8785.149.0
    Stable version for board kevin/rwfw is changed from Google_Kevin.8785.135.0 to Google_Kevin.8785.149.0.

I kicked off verify, so all kevin DUTs will start updating.  I
expect the updates will be transparent to all testing.  By now,
most DUTs should already have updated.


FYI, the:

  blk_update_request: I/O error, dev mmcblk0rpmb, sector 0

Is nothing important.  It happens on 100% of boots on 100% of kevins and has forever.  Filed <https://code.google.com/p/chrome-os-partner/issues/detail?id=62955> to get rid of this confusing message.
Cc: englab-sys-cros@google.com
Now that we've updated the firmware, I've kicked off a test on
the two DUTs that were kicked out of the CQ:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=101516502

If that test passes, I think we'll be able to declare that
the firmware update fixed the failures.

...or, possibly, the bump to the repair build.  From comlink the old repair build was:

  R58-9221.0.0

The kernel on that build was lacking DDRFreq fixes and thus you could get random memory failures while running on that build.  If you used that build to flash a new OS image it's plausible that you would flash something bad.
Re#23

The test running on chromeos2-row8-rack8-host3 seems fine.
However the other one running on chromeos2-row8-rack9-host14 still failed because of un-sshable DUT:

SSHConnectionError: ssh: connect to host chromeos2-row8-rack9-host14 port 22: Connection timed out
I kicked off the same the same test again, but the results are the same:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=102871564

That is, chromeos2-row8-rack8-host3 passed, but chromeos2-row8-rack9-host14 failed. 

I ssh'ed to both devices and confirmed the fw and os versions are identical -
(fwid: Google_Kevin.8785.149.0 / OS: R58-9306.0.0)

It looks more like hw/infra issue?
I checked the DUT's history for the past 24 hours.
The DUT passes provisioning more often than it fails:

    2017-02-23 01:11:25  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60143780-provision/
    2017-02-22 23:52:23  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60142923-provision/
    2017-02-22 22:33:47  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60142147-provision/
    2017-02-22 21:31:00  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60141618-provision/
    2017-02-22 21:08:21  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60141444-provision/
    2017-02-22 15:06:53  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60137694-provision/
    2017-02-22 13:22:52  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60136766-provision/
    2017-02-22 11:49:01  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60135523-provision/
    2017-02-22 10:46:29  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60134774-provision/
    2017-02-22 10:00:05  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60134136-provision/

I can't rule out a problem with the DUT, but we should also consider
the possibility that the paladin build we're testing has some
problem.  Also, I note that dianders@ indicated the possibility of
a long tail of devices that will still have memory problems, even
with all current fixes applied.  It's possible that chromeos2-row8-rack9-host14
is one of the long tail.

I note also  bug 392289 , and the fact that a substantial
number of kevin DUTs are in dev mode.  In particular,
chromeos2-row8-rack9-host14 was in dev mode, but chromeos2-row8-rack8-host3
was not.

I've pushed -host14 to normal mode; I'm re-running the
same Paladin build, to see if it changes behavior now:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=103011296
Cc: jwer...@chromium.org
Can only skim since I'm trying to pay attention to conference presentation, but if the device is failing to boot sometimes then it's possible we have one of jwerner's favorite eMMC problems.  

See http://crosbug.com/p/63025

New firmware is being qualified as we speak to fix this.
chromeos2-row8-rack9-host14 has now passed.  This suggests
that dev mode may have been the problem.

I agree with #30.
BTW, to circumvent the  issue #29  pointed out, I filed a firmware update request: https://bugs.chromium.org/p/chromium/issues/detail?id=695702
Project Member Comment 32 by sheriffbot@chromium.org, Mar 1 2017
Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable?

If a fix is in active development, please set the status to Started.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Pri-0 Pri-1
Comment 34 Deleted
Cince both machines (chromeos2-row8-rack8-host3 & chromeos2-row8-rack9-host14) work now, can we put them back to CQ and mark this bug as fixed?
Status: Verified
Kevin is no longer failing; the DUTs in question have been
in service more or less continuously for a while.

Sign in to add a comment