Fix broken nyan_kitty DUTs
Reported by
jrbarnette@chromium.org,
Jun 19 2017
|
||||
Issue descriptionA large number of nyan_kitty DUTs in pool:suites are currently down because of bug 734764 . They need to be recovered. In particular, the lack of working spare DUTs is blocking debug of Chrome bug 734074 Some possible approaches: * Manually do what repair isn't doing: Reboot all the DUTs, and see if the stateful file system will become writable. * File an urgent ticket for manual re-install from USB.
,
Jun 19 2017
This should get handled automatically via the board inventory roundup. The board inventory script didn't run on its own this morning due to planned lab downtime. So I ran it manually. And it generated this: https://groups.google.com/a/google.com/forum/#!topic/englab-sys-cros/aQ22voTfO9I So looks like nyan_kitty is pretty high in the queue. I'm wiling to trust englab-sys-cros's internal prioritization to get to this in time. +haoweiw FYI.
,
Jun 20 2017
Following up based on an offline conversation with pprabhu@: The only reason for urgency in repairing these DUTs is bug 734074. If that bug gets resolved (it looks like it will be), then repairing kitty can go through the usual inventory management process. It should also be noted that fixing bug 734764 would also address the kitty failures, and that change should (or at least could) be treated as modestly urgent on its own.
,
Jun 20 2017
Actually, we're short of these DUTs in the lab now. And even though jrbarnette@ already has a fix in flight for the problem that's causing repair to fail, I'm going to try to get these back for now. (we were hit by the same CL again yesterday, and lost some more DUTs) Experimented with chromeos4-row13-rack1-host8 which failed in the same way. localhost ~ # rm -rf /var/spool/crash/os-release rm: cannot remove '/var/spool/crash/os-release': Read-only file system Then, localhost ~ # sudo reboot ... ssh again: localhost ~ # rm -rf /var/spool/crash/os-release profit!
,
Jun 20 2017
I've queued a reverify on the host: http://chromeos-server10.hot.corp.google.com/afe/#tab_id=view_host&object_id=2374 If the repair reclaims that DUT, I'll do the same to all the other nyan_kitty DUTs
,
Jun 20 2017
#5 didn't work. Repair failed again.
,
Jun 20 2017
> #5 didn't work. Repair failed again. If rebooting the DUT doesn't leave stateful writable long enough for an update, then fixing bug 734764 may not be enough. The best that that change can do is to allow "reboot the DUT and try again." If the file system is read-only, we need a fix for 735156. However, getting that change made, tested, committed, and deployed is the work of a few days. If it's urgent, we may be back to a ticket to manually re-image the DUTs.
,
Jun 20 2017
Filed b/62834517 for manual repair. The failure in #5 was actually while doing an auto-update. 2017/06/20 12:02:59.677 INFO | auto_updater:0645| Waiting for update...status: UPDATE_STATUS_CHECKING_FOR_UPDATE at progress 0.000000 2017/06/20 12:03:09.694 DEBUG| cros_build_lib:0584| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp4mnDP4/testing_rsa root@chromeos4-row13-rack1-host8 -- update_engine_client --status 2017/06/20 12:03:20.124 DEBUG| cros_build_lib:0633| (stdout): LAST_CHECKED_TIME=1497985387 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_REPORTING_ERROR_EVENT NEW_VERSION=9999.0.0 NEW_SIZE=515546278 2017/06/20 12:03:20.125 DEBUG| cros_build_lib:0635| (stderr): Warning: Permanently added 'chromeos4-row13-rack1-host8,100.115.216.8' (RSA) to the list of known hosts. [0620/120308:INFO:update_engine_client.cc(493)] Querying Update Engine status... 2017/06/20 12:03:20.125 INFO | auto_updater:0645| Waiting for update...status: UPDATE_STATUS_REPORTING_ERROR_EVENT at progress 0.000000 2017/06/20 12:03:30.136 DEBUG| cros_build_lib:0584| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp4mnDP4/testing_rsa root@chromeos4-row13-rack1-host8 -- update_engine_client --status 2017/06/20 12:03:30.433 DEBUG| cros_build_lib:0633| (stdout): LAST_CHECKED_TIME=1497985387 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_IDLE NEW_VERSION=9999.0.0 NEW_SIZE=515546278 But then, we were not able to copy logs in/out of the DUT after the failed update, again suggesting filesystem errors.
,
Jun 22 2017
This was handled by englab-sys-cros:
pprabhu@pprabhu:~$ dut-status -w -b nyan_kitty -p cq | wc
12 12 341
pprabhu@pprabhu:~$ dut-status -w -b nyan_kitty | wc
36 36 1019
pprabhu@pprabhu:~$ dut-status -n -b nyan_kitty | wc
6 6 144
|
||||
►
Sign in to add a comment |
||||
Comment 1 by jrbarnette@chromium.org
, Jun 19 2017