New issue
Advanced search Search tips

Issue 730067 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 736393
issue 779147

Blocking:
issue 730061



Sign in to add a comment

Reduce time-to-recovery for a broken DUT

Project Member Reported by pprabhu@chromium.org, Jun 6 2017

Issue description

The ask here is to look at the following example:
 [1] A provision job was launched on a DUT (chromeos4-row6-rack9-host17).
 [2] Before the provision, we rebooted the DUT.
 [3] For some reason (ignore it here), the DUT failed to return from reboot.
 [4] Eventually, the DUT recovered via servo reset in the following repair job.

The time spent between event [3] and [4] was: ~60 minutes.
This is far too long.

The main reason for this delay is that we try to ping / SSH the DUT many many times between those two events with large timeouts. We've had a tendency to increase SSH / ping timeouts for robustness in various areas. Long delays in failure mode are a result.

The ask here is to consider just this one execution path and optimize it to be fast (again). Make repair fast again.

Failed provision logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row6-rack9-host17/389382-provision/20170406122508/
Following repair logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row6-rack9-host17/389600-repair/20170406130547/
 
Blocking: 730061
Cc: akes...@chromium.org
+ akeshet, triage request: 

This doesn't directly cause outages / slow down test execution. It's impact is rather:
- provision jobs take longer to fail than they could.
- recovery of DUTs takes longer than it should
- we have a whole bunch of ping/ssh network traffic that's redundant.

This requires someone's continued attention for a few weeks. It involves finding out all the places (from those logs) where we ran SSH / ping between [3] and [4] and understanding what changing those timeouts would mean / how we can be smarter about not repeating so much work.
Labels: infra-overhead

Comment 4 by aut...@google.com, Jun 12 2017

Owner: pprabhu@chromium.org
prathmesh, can you help us ID the problematic retries? 
Blockedon: 736393
Cc: hidehiko@chromium.org
This is similar to issue 726481 that hidehiko@ is working on, but not the same -- this is slowness in the order of hour(s) in the case when the test fails / DUT dies.
Blockedon: 779147

Comment 8 by cindyb@chromium.org, May 31 2018

Hi, this bug has not been updated recently and remains untriaged. Please acknowledge the bug and provide status within two weeks (6/8/2018), or the bug will be closed. Thank you.
Status: Archived (was: Untriaged)
Archiving per #8
Status: Untriaged (was: Archived)
This is very much an issue.
Another instance: https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row4-rack9-host18/1093337-provision/20181507195528/

Provision took 2 hours to fail, causing a very hard to debug suite timeout down the line.
:(
Owner: ----
Status: Available (was: Untriaged)
Here you go: https://chromium-swarm-dev.appspot.com/task?id=3fce586d8d691c10&refresh=10

This is a doomed repair attempt. The DUT is not available via SSH. The only thing that was really tried was installing via USB, which failed (give it 10 minutes to stage and download the image to USB, which didn't even get very far).

This took 45 minutes to fail.

Sign in to add a comment