All bvt veyron_rialto DUTs are failing repair (can't boot from USB) |
|||
Issue descriptionpprabhu@pprabhu:chromiumos$ dut-status -p bvt -b veyron_rialto hostname S last checked URL chromeos2-row1-rack10-host7 NO 2018-06-04 11:00:29 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row1-rack10-host7/556495-repair/ chromeos2-row2-rack10-host4 NO 2018-06-04 11:00:29 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host4/556498-repair/ chromeos2-row2-rack10-host5 NO 2018-06-01 12:37:09 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host5/545573-repair/ chromeos2-row2-rack10-host6 NO 2018-06-04 11:00:29 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host6/556496-repair/ chromeos2-row2-rack10-host7 NO 2018-06-04 11:00:30 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host7/556499-repair/ chromeos2-row2-rack10-host9 NO 2018-06-04 11:00:29 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host9/556497-repair/ pprabhu@pprabhu:chromiumos$
,
Jun 4 2018
They're all failing to even boot from USB as part of the final repair attempt: START ---- repair timestamp=1528135231 localtime=Jun 04 11:00:31 GOOD ---- verify.servo_ssh timestamp=1528135233 localtime=Jun 04 11:00:33 GOOD ---- verify.brd_config timestamp=1528135234 localtime=Jun 04 11:00:34 GOOD ---- verify.ser_config timestamp=1528135234 localtime=Jun 04 11:00:34 GOOD ---- verify.job timestamp=1528135235 localtime=Jun 04 11:00:35 GOOD ---- verify.servod timestamp=1528135240 localtime=Jun 04 11:00:40 GOOD ---- verify.pwr_button timestamp=1528135240 localtime=Jun 04 11:00:40 GOOD ---- verify.lid_open timestamp=1528135240 localtime=Jun 04 11:00:40 GOOD ---- verify.update timestamp=1528135244 localtime=Jun 04 11:00:44 GOOD ---- verify.PASS timestamp=1528135244 localtime=Jun 04 11:00:44 FAIL ---- verify.ssh timestamp=1528135829 localtime=Jun 04 11:10:29 No answer to ping from chromeos2-row2-rack10-host4 START ---- repair.rpm timestamp=1528135829 localtime=Jun 04 11:10:29 FAIL ---- repair.rpm timestamp=1528136105 localtime=Jun 04 11:15:05 chromeos2-row2-rack10-host4 is still offline after powercycling END FAIL ---- repair.rpm timestamp=1528136105 localtime=Jun 04 11:15:05 START ---- repair.sysrq timestamp=1528136105 localtime=Jun 04 11:15:05 FAIL ---- repair.sysrq timestamp=1528136338 localtime=Jun 04 11:18:58 Host chromeos2-row2-rack10-host4 is still offline after sysrq. END FAIL ---- repair.sysrq timestamp=1528136338 localtime=Jun 04 11:18:58 START ---- repair.servoreset timestamp=1528136338 localtime=Jun 04 11:18:58 FAIL ---- repair.servoreset timestamp=1528136564 localtime=Jun 04 11:22:44 Host chromeos2-row2-rack10-host4 is still offline after servoreset. END FAIL ---- repair.servoreset timestamp=1528136564 localtime=Jun 04 11:22:44 START ---- repair.firmware timestamp=1528136564 localtime=Jun 04 11:22:44 FAIL ---- repair.firmware timestamp=1528136564 localtime=Jun 04 11:22:44 Firmware repair is not applicable to host chromeos2-row2-rack10-host4. END FAIL ---- repair.firmware timestamp=1528136564 localtime=Jun 04 11:22:44 START ---- repair.usb timestamp=1528136564 localtime=Jun 04 11:22:44 FAIL ---- repair.usb timestamp=1528137082 localtime=Jun 04 11:31:22 DUT failed to boot from USB after 300 seconds END FAIL ---- repair.usb timestamp=1528137082 localtime=Jun 04 11:31:22 END FAIL ---- repair timestamp=1528137082 localtime=Jun 04 11:31:22
,
Jun 4 2018
Filed b/109669494 to try to get some logs from one of the dead DUTs.
,
Jun 4 2018
They all have separate servo's, so this is not a common labstation issue.
pprabhu@pprabhu:chromiumos$ dut-status -p bvt -b veyron_rialto -n | xargs -i atest host stat {} | grep servo_host
servo_host : chromeos2-row1-rack10-host7-servo
servo_host : chromeos2-row2-rack10-host4-servo
servo_host : chromeos2-row2-rack10-host5-servo
servo_host : chromeos2-row2-rack10-host6-servo
servo_host : chromeos2-row2-rack10-host7-servo
servo_host : chromeos2-row2-rack10-host9-servo
Another chance is that we have a bad veyron_rialto stable image.
,
Jun 4 2018
,
Jun 6 2018
> Another chance is that we have a bad veyron_rialto stable image. We have enough data to answer that question easily... $ dut-status -b veyron_rialto -p bvt hostname S last checked URL chromeos2-row1-rack10-host7 NO 2018-06-06 11:38:43 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row1-rack10-host7/564195-repair/ chromeos2-row2-rack10-host4 NO 2018-06-06 11:38:43 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host4/564196-repair/ chromeos2-row2-rack10-host5 NO 2018-06-01 12:37:09 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host5/545573-repair/ chromeos2-row2-rack10-host6 NO 2018-06-06 11:38:43 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host6/564198-repair/ chromeos2-row2-rack10-host7 NO 2018-06-06 11:38:43 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host7/564199-repair/ chromeos2-row2-rack10-host9 NO 2018-06-06 11:38:43 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host9/564197-repair/ Looking at status.log from all of those failures, there's a common theme: 1) The DUT is offline. 2) Servo verification reports no errors. 3) Re-installing from USB fails like this: START ---- repair.usb timestamp=1528311642 localtime=Jun 06 12:00:42 FAIL ---- repair.usb timestamp=1528312159 localtime=Jun 06 12:09:19 DUT failed to boot from USB after 300 seconds END FAIL ---- repair.usb timestamp=1528312159 localtime=Jun 06 12:09:19
,
Jun 6 2018
> Another chance is that we have a bad veyron_rialto stable image. Just to be clear: A bad stable image isn't likely. The chosen image file is the latest Beta build for rialto, an R66 release. If the image were bad, the most likely cause would be corruption in googlestorage, and that's not very likely at all. The first thing to look for is whether any recent servo changes adversely impacted rialto.
,
Jun 16 2018
Internal bug tracking recovery fixed. All Rialtos are back. |
|||
►
Sign in to add a comment |
|||
Comment 1 by pprabhu@chromium.org
, Jun 4 2018