kevin DUTs provision fail looping | ssh connecting timing out |
||||||||||||
Issue descriptionhttp://cautotest/afe/#tab_id=view_host&object_id=5741 This Kevin DUT has failed several tests in a row all because it failed to connect: Example log: START ---- provision timestamp=1487104962 localtime=Feb 14 12:42:42 GOOD ---- verify.ssh timestamp=1487104966 localtime=Feb 14 12:42:46 GOOD ---- verify.update timestamp=1487104969 localtime=Feb 14 12:42:49 GOOD ---- verify.brd_config timestamp=1487104970 localtime=Feb 14 12:42:50 GOOD ---- verify.ser_config timestamp=1487104970 localtime=Feb 14 12:42:50 GOOD ---- verify.job timestamp=1487104971 localtime=Feb 14 12:42:51 GOOD ---- verify.servod timestamp=1487104975 localtime=Feb 14 12:42:55 GOOD ---- verify.pwr_button timestamp=1487104975 localtime=Feb 14 12:42:55 GOOD ---- verify.lid_open timestamp=1487104975 localtime=Feb 14 12:42:55 GOOD ---- verify.PASS timestamp=1487104975 localtime=Feb 14 12:42:55 START provision_AutoUpdate provision_AutoUpdate timestamp=1487104976 localtime=Feb 14 12:42:56 START ---- ---- timestamp=1487105134 localtime=Feb 14 12:45:34 GOOD ---- sysinfo.before timestamp=1487105134 localtime=Feb 14 12:45:34 END GOOD ---- ---- timestamp=1487105134 localtime=Feb 14 12:45:34 FAIL provision_AutoUpdate provision_AutoUpdate timestamp=1487105304 localtime=Feb 14 12:48:24 Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row8-rack9-host1: SSHConnectionError: Connection to 100.115.231.42 timed out while waiting to read Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/test.py", line 804, in _call_test_function return func(*args, **dargs) File "/usr/local/autotest/client/common_lib/test.py", line 461, in execute dargs) File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry postprocess_profiled_run, args, dargs) File "/usr/local/autotest/client/common_lib/test.py", line 376, in _call_run_once self.run_once(*args, **dargs) File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 111, in run_once force_full_update=force) File "/usr/local/autotest/server/afe_utils.py", line 254, in machine_install_and_update_labels *args, **dargs) File "/usr/local/autotest/server/hosts/cros_host.py", line 729, in machine_install_by_devserver full_update=force_full_update) File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2018, in auto_update raise DevServerException(error_msg % (host_name, error_list[0])) DevServerException: CrOS auto-update failed for host chromeos2-row8-rack9-host1: SSHConnectionError: Connection to 100.115.231.42 timed out while waiting to read END FAIL provision_AutoUpdate provision_AutoUpdate timestamp=1487105304 localtime=Feb 14 12:48:24 END FAIL ---- provision timestamp=1487105304 localtime=Feb 14 12:48:24 INFO ---- ---- timestamp=1487105304 job_abort_reason= localtime=Feb 14 12:48:24 Why this DUT is so flaky? Is this similar to the SSHConnection issue happened on pfq? xixuan@, can you take a look?
,
Feb 15 2017
I lock this dut and replace it with chromeos2-row8-rack9-host5. Removing labels ['pool:cq'] from host chromeos2-row8-rack9-host1 Adding labels ['pool:suites'] to host chromeos2-row8-rack9-host1 Transferring 1 DUTs from suites to cq. Updating host: chromeos2-row8-rack9-host5. Removing labels ['pool:suites'] from host chromeos2-row8-rack9-host5 Adding labels ['pool:cq'] to host chromeos2-row8-rack9-host5
,
Feb 15 2017
,
Feb 15 2017
https://buganizer.corp.google.com/issues/35385505 haoweiw@, can you check whether this DUT is pingable but not able to ssh into. Kevin paladin has been failed for two days for different DUT issues. I have requested to pull out two broken DUTs out of lab, and send to partner team to check whether this is a hardware issue. https://buganizer.corp.google.com/issues/35363835
,
Feb 15 2017
The repair history says that the DUT is fine; it's reachable by both 'ping' and 'ssh'. A spot check of DUT history seems to show a history of "Provision fails, Repair finds nothing wrong". Here's the latest repair log: START ---- repair timestamp=1487164702 localtime=Feb 15 05:18:22 GOOD ---- verify.ssh timestamp=1487164704 localtime=Feb 15 05:18:24 GOOD ---- verify.brd_config timestamp=1487164704 localtime=Feb 15 05:18:24 GOOD ---- verify.ser_config timestamp=1487164705 localtime=Feb 15 05:18:25 GOOD ---- verify.job timestamp=1487164706 localtime=Feb 15 05:18:26 GOOD ---- verify.servod timestamp=1487164712 localtime=Feb 15 05:18:32 GOOD ---- verify.pwr_button timestamp=1487164712 localtime=Feb 15 05:18:32 GOOD ---- verify.lid_open timestamp=1487164712 localtime=Feb 15 05:18:32 GOOD ---- verify.update timestamp=1487164716 localtime=Feb 15 05:18:36 GOOD ---- verify.PASS timestamp=1487164716 localtime=Feb 15 05:18:36 GOOD ---- verify.ssh timestamp=1487164740 localtime=Feb 15 05:19:00 GOOD ---- verify.power timestamp=1487164741 localtime=Feb 15 05:19:01 GOOD ---- verify.fwstatus timestamp=1487164741 localtime=Feb 15 05:19:01 GOOD ---- verify.good_au timestamp=1487164741 localtime=Feb 15 05:19:01 GOOD ---- verify.writable timestamp=1487164742 localtime=Feb 15 05:19:02 GOOD ---- verify.tpm timestamp=1487164743 localtime=Feb 15 05:19:03 GOOD ---- verify.ext4 timestamp=1487164744 localtime=Feb 15 05:19:04 GOOD ---- verify.rwfw timestamp=1487164744 localtime=Feb 15 05:19:04 GOOD ---- verify.python timestamp=1487164744 localtime=Feb 15 05:19:04 GOOD ---- verify.cros timestamp=1487164750 localtime=Feb 15 05:19:10 GOOD ---- verify.PASS timestamp=1487164750 localtime=Feb 15 05:19:10 END GOOD ---- repair timestamp=1487164750 localtime=Feb 15 05:19:10 chromeos2-row8-rack9-host1 repaired successfully
,
Feb 15 2017
I logged into the DUT, and checked /var/log/messages. I found some familiar log messages: 2017-02-09T09:31:42.301442-08:00 ERR kernel: [ 55.701874] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0 2017-02-09T11:10:40.569860-08:00 ERR kernel: [ 5993.970088] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0 2017-02-09T12:31:40.311749-08:00 ERR kernel: [10853.711992] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0 I'm beginning to suspect that this isn't hardware; it could be software. We need to get a kevin expert in to help us.
,
Feb 15 2017
alexpau@, jachuang@, could you please take a look? In the meantime, can we make kevin as experimental?
,
Feb 15 2017
I just logged into a device that's showing no signs of being bad, and I found the same peculiar messages. These messages seem more likely caused by software. We need get this looked at. Note that I'm not convinced that these Provision problems and the mysterious I/O errors must be related. As for taking kevin out of CQ rotation: That could be reasonable, but if there's a new bug in kevin, we're only pushing the problem someplace else.
,
Feb 15 2017
,
Feb 15 2017
Hmm, ok. I've pinged alexpau@, wait for him to take a look
,
Feb 15 2017
Gwendal, would you be so kind to comment on those I/O errors? Thanks!
,
Feb 15 2017
,
Feb 15 2017
,
Feb 15 2017
,
Feb 15 2017
Here is a list of the problematic DUTs: chromeos2-row8-rack8-host3 chromeos2-row8-rack9-host14 chromeos2-row8-rack9-host1
,
Feb 15 2017
All of these DUT either have been removed from pool:cq or locked. However, this is not fixed the issues, new DUT in pool:cq will have the similar issue. So, I think this is not the DUT itself causing the failure. Like Richard said, maybe there is a software issue.
,
Feb 15 2017
+philip/moch - would you please help with debug? Can someone clarify the current status of the issue. It appears that the problem started with dev server unable to ssh into DUT but then later on it's confirmed that ssh into the DUT is not an issue. Is this a test infra or DUT issue? Since which release has this issue started to occur?
,
Feb 15 2017
This issue starts happening from R58-9280.0.0-rc2
,
Feb 15 2017
The main problem is all the 3 hosts have error logs about ERR kernel: [10853.711992] blk_update_request: I/O error, dev mmcblk0rpmb, sector 0 Richard thinks it's likely a software problem. The error happened since https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/196. The wrong behavior of these 3 hosts are slightly different. For host1, devserver 100.115.245.200 cannot ssh to one of these host: chromeos2-row8-rack9-host1. I also see similar thing happened between host14 and devserver 100.115.245.198 ( crbug.com/691729 ). It's related to network connections, which previously mislead me to think it's a network issue. But now I think it may be another symptom caused by the kernel error. For host3&host14, they cannot get online after provision. see crbug.com/692172 for detail.
,
Feb 15 2017
There's additional discussion going on at go/crosoncall. Key points: * At least one of the problem DUTs was seen to have failed with a complaint about entering recovery because it couldn't find its boot storage. We don't know when that happened. * There are some known problems relating to setting voltage levels that can cause that kind of problem. The fixes include updating to the .149 firmware. The .149 firmware is available at ToT, and on the tip of the R57 and R56 branches. I've updated kevin thus: $ atest stable_version modify -b kevin -i R57-9202.28.0 Stable version for board kevin is changed from R58-9221.0.0 to R57-9202.28.0. $ atest stable_version modify -b kevin/rwfw -i Google_Kevin.8785.149.0 Stable version for board kevin/rwfw is changed from Google_Kevin.8785.135.0 to Google_Kevin.8785.149.0. I kicked off verify, so all kevin DUTs will start updating. I expect the updates will be transparent to all testing. By now, most DUTs should already have updated.
,
Feb 15 2017
FYI, the: blk_update_request: I/O error, dev mmcblk0rpmb, sector 0 Is nothing important. It happens on 100% of boots on 100% of kevins and has forever. Filed <https://code.google.com/p/chrome-os-partner/issues/detail?id=62955> to get rid of this confusing message.
,
Feb 15 2017
,
Feb 15 2017
Now that we've updated the firmware, I've kicked off a test on
the two DUTs that were kicked out of the CQ:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=101516502
If that test passes, I think we'll be able to declare that
the firmware update fixed the failures.
,
Feb 15 2017
...or, possibly, the bump to the repair build. From comlink the old repair build was: R58-9221.0.0 The kernel on that build was lacking DDRFreq fixes and thus you could get random memory failures while running on that build. If you used that build to flash a new OS image it's plausible that you would flash something bad.
,
Feb 21 2017
Re#23 The test running on chromeos2-row8-rack8-host3 seems fine. However the other one running on chromeos2-row8-rack9-host14 still failed because of un-sshable DUT: SSHConnectionError: ssh: connect to host chromeos2-row8-rack9-host14 port 22: Connection timed out
,
Feb 23 2017
I kicked off the same the same test again, but the results are the same: https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=102871564 That is, chromeos2-row8-rack8-host3 passed, but chromeos2-row8-rack9-host14 failed. I ssh'ed to both devices and confirmed the fw and os versions are identical - (fwid: Google_Kevin.8785.149.0 / OS: R58-9306.0.0) It looks more like hw/infra issue?
,
Feb 23 2017
I checked the DUT's history for the past 24 hours.
The DUT passes provisioning more often than it fails:
2017-02-23 01:11:25 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60143780-provision/
2017-02-22 23:52:23 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60142923-provision/
2017-02-22 22:33:47 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60142147-provision/
2017-02-22 21:31:00 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60141618-provision/
2017-02-22 21:08:21 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60141444-provision/
2017-02-22 15:06:53 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60137694-provision/
2017-02-22 13:22:52 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60136766-provision/
2017-02-22 11:49:01 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60135523-provision/
2017-02-22 10:46:29 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60134774-provision/
2017-02-22 10:00:05 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack9-host14/60134136-provision/
I can't rule out a problem with the DUT, but we should also consider
the possibility that the paladin build we're testing has some
problem. Also, I note that dianders@ indicated the possibility of
a long tail of devices that will still have memory problems, even
with all current fixes applied. It's possible that chromeos2-row8-rack9-host14
is one of the long tail.
,
Feb 23 2017
I note also bug 392289 , and the fact that a substantial number of kevin DUTs are in dev mode. In particular, chromeos2-row8-rack9-host14 was in dev mode, but chromeos2-row8-rack8-host3 was not. I've pushed -host14 to normal mode; I'm re-running the same Paladin build, to see if it changes behavior now: https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=103011296
,
Feb 23 2017
Can only skim since I'm trying to pay attention to conference presentation, but if the device is failing to boot sometimes then it's possible we have one of jwerner's favorite eMMC problems. See http://crosbug.com/p/63025 New firmware is being qualified as we speak to fix this.
,
Feb 23 2017
chromeos2-row8-rack9-host14 has now passed. This suggests that dev mode may have been the problem.
,
Feb 24 2017
I agree with #30. BTW, to circumvent the issue #29 pointed out, I filed a firmware update request: https://bugs.chromium.org/p/chromium/issues/detail?id=695702
,
Mar 1 2017
Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable? If a fix is in active development, please set the status to Started. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Mar 1 2017
,
Mar 2 2017
Cince both machines (chromeos2-row8-rack8-host3 & chromeos2-row8-rack9-host14) work now, can we put them back to CQ and mark this bug as fixed?
,
Mar 2 2017
Kevin is no longer failing; the DUTs in question have been in service more or less continuously for a while. |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by xixuan@chromium.org
, Feb 15 2017Owner: shuqianz@chromium.org