New issue
Advanced search Search tips

Issue 734769 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 734074



Sign in to add a comment

Fix broken nyan_kitty DUTs

Reported by jrbarnette@chromium.org, Jun 19 2017

Issue description

A large number of nyan_kitty DUTs in pool:suites are currently
down because of  bug 734764 .  They need to be recovered. In
particular, the lack of working spare DUTs is blocking debug
of Chrome bug 734074

Some possible approaches:
  * Manually do what repair isn't doing:  Reboot all the DUTs,
    and see if the stateful file system will become writable.
  * File an urgent ticket for manual re-install from USB.

 
Blocking: 734074
Cc: haoweiw@chromium.org
This should get handled automatically via the board inventory roundup. The board inventory script didn't run on its own this morning due to planned lab downtime. So I ran it manually.

And it generated this: https://groups.google.com/a/google.com/forum/#!topic/englab-sys-cros/aQ22voTfO9I

So looks like nyan_kitty is  pretty high in the queue.
I'm wiling to trust englab-sys-cros's internal prioritization to get to this in time.

+haoweiw FYI.
Following up based on an offline conversation with pprabhu@:
The only reason for urgency in repairing these DUTs is bug 734074.
If that bug gets resolved (it looks like it will be), then
repairing kitty can go through the usual inventory management
process.

It should also be noted that fixing  bug 734764  would also
address the kitty failures, and that change should (or at least
could) be treated as modestly urgent on its own.

Status: Started (was: Assigned)
Actually, we're short of these DUTs in the lab now. And even though jrbarnette@ already has a fix in flight for the problem that's causing repair to fail, I'm going to try to get these back for now.
(we were hit by the same CL again yesterday, and lost some more DUTs)

Experimented with chromeos4-row13-rack1-host8 which failed in the same way.

localhost ~ # rm -rf /var/spool/crash/os-release
rm: cannot remove '/var/spool/crash/os-release': Read-only file system

Then,
localhost ~ # sudo reboot
...

ssh again:
localhost ~ # rm -rf /var/spool/crash/os-release


profit!
I've queued a reverify on the host: http://chromeos-server10.hot.corp.google.com/afe/#tab_id=view_host&object_id=2374

If the repair reclaims that DUT, I'll do the same to all the other nyan_kitty DUTs
#5 didn't work. Repair failed again.
> #5 didn't work. Repair failed again.

If rebooting the DUT doesn't leave stateful writable long enough
for an update, then fixing  bug 734764  may not be enough.  The best
that that change can do is to allow "reboot the DUT and try again."
If the file system is read-only, we need a fix for 735156.  However,
getting that change made, tested, committed, and deployed is the
work of a few days.  If it's urgent, we may be back to a ticket to
manually re-image the DUTs.

Filed b/62834517 for manual repair.

The failure in #5 was actually while doing an auto-update. 
2017/06/20 12:02:59.677 INFO |      auto_updater:0645| Waiting for update...status: UPDATE_STATUS_CHECKING_FOR_UPDATE at progress 0.000000
2017/06/20 12:03:09.694 DEBUG|    cros_build_lib:0584| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp4mnDP4/testing_rsa root@chromeos4-row13-rack1-host8 -- update_engine_client --status
2017/06/20 12:03:20.124 DEBUG|    cros_build_lib:0633| (stdout):
LAST_CHECKED_TIME=1497985387
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_REPORTING_ERROR_EVENT
NEW_VERSION=9999.0.0
NEW_SIZE=515546278

2017/06/20 12:03:20.125 DEBUG|    cros_build_lib:0635| (stderr):
Warning: Permanently added 'chromeos4-row13-rack1-host8,100.115.216.8' (RSA) to the list of known hosts.
[0620/120308:INFO:update_engine_client.cc(493)] Querying Update Engine status...

2017/06/20 12:03:20.125 INFO |      auto_updater:0645| Waiting for update...status: UPDATE_STATUS_REPORTING_ERROR_EVENT at progress 0.000000
2017/06/20 12:03:30.136 DEBUG|    cros_build_lib:0584| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp4mnDP4/testing_rsa root@chromeos4-row13-rack1-host8 -- update_engine_client --status
2017/06/20 12:03:30.433 DEBUG|    cros_build_lib:0633| (stdout):
LAST_CHECKED_TIME=1497985387
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_IDLE
NEW_VERSION=9999.0.0
NEW_SIZE=515546278

But then, we were not able to copy logs in/out of the DUT after the failed update, again suggesting filesystem errors.
Status: Fixed (was: Started)
This was handled by englab-sys-cros:

pprabhu@pprabhu:~$ dut-status -w -b nyan_kitty -p cq | wc
     12      12     341
pprabhu@pprabhu:~$ dut-status -w -b nyan_kitty | wc
     36      36    1019
pprabhu@pprabhu:~$ dut-status -n -b nyan_kitty | wc
      6       6     144

Sign in to add a comment