New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 746997 link

Starred by 4 users

Issue metadata

Status: Verified
Owner:
Closed: Dec 21
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug

Blocking:
issue 738036



Sign in to add a comment

AU tests timeout when they need to powerwash after updating backwards

Project Member Reported by dhadd...@chromium.org, Jul 20 2017

Issue description

https://wmatrix.googleplex.com/platform/paygen_au_canary?platforms=quawks

This is only a problem going back to the stepping stone build. 
 
Looking into the logs...

Install the source image on the DUT succeeds. Then it times out after reboot. 

Since this is only on the stepping stone build, my guess is that we are hitting the "ChromeOS is repairing itself. Please wait..." screen. And it is not finishing in time before autotest gives up. 
Yup thats what happened
IMG_20170720_193608.jpg
5.1 MB View Download
Cc: jrbarnette@chromium.org gwendal@chromium.org xixuan@chromium.org josa...@chromium.org keta...@chromium.org
Labels: M-61
It took over 10minutes for it to complete this process.

We only give the reboot 8 minutes to complete:
https://cs.corp.google.com/chromeos_public/chromite/lib/auto_updater.py?q=auto_updater.py&sq=package:%5Echromeos_(internal%7Cpublic)$&l=1317

This only happens on quawks device.

Does anybody know why this 8 minute number was chosen?

Cc: kathrelk...@chromium.org
+milestone owner
Found one run on squawks too that failed for the same reason:
https://wmatrix.googleplex.com/failures/paygen_au_stable?platforms=squawks&builds=R59-9460.74.0&releases=59
Summary: paygen_au s/quawks only failure: Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test. (was: paygen_au quawks only failure: Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test.)
Owner: xixuan@chromium.org
Status: Assigned (was: Untriaged)
xixuan, Can you comment on #3?

Comment 8 by pho...@chromium.org, Jul 24 2017

Blocking: 738036

Comment 9 by pho...@chromium.org, Jul 31 2017

Cc: pho...@chromium.org
 Issue 738036  has been merged into this issue.
Ping xixuan :) 
Owner: dhadd...@chromium.org
Sorry forgot to reply this. The timeout comes from current reboot_timeout in autotest:

https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/hosts/cros_host.py?q=cros_host.py&sq=package:%5Echromeos_(internal%7Cpublic)$&l=117

Based on the comments, 8 minutes seems long, and previous is 5 minutes... Now we want to set it as 10 minutes? Could we know why it's too long, since once it's extended, normal offline DUTs will take much more time to wait for timeout in provision and increase provision time cost.
Yeh 8 minutes is too long for a reboot IMO. 

The problem is when the device downgrades from a recent build to the stepping stone build (M53). The device needs to powerwash and this takes a long time on certain boards. 

I can add an is_au_endtotest clause but wanted to check if there is a better way since cros flash users would be hit by this too and would fail 
Owner: gwendal@chromium.org
Gwendal do we care that this takes so long? 
Ping :) 
This has happened again on Ninja:
https://cros-goldeneye.corp.google.com/chromeos/console/viewRelease?releaseName=M62-STABLE-CHROMEOS-4-NinjaOnly&disco=AAAABhw4cC8&ts=5a1f3f90&usp=comment_email_document

Going back to the old 53 build, we powerwash the device and it takes way too long to complete so the test fails. 
Summary: AU tests timeout when they need to powerwash after updating backwards (was: paygen_au s/quawks only failure: Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test.)
Labels: bvttriage
Labels: -M-61 M-72
Owner: dhadd...@chromium.org
Since the this is the reason for most of the AU failures on stable and causes the need for multiple reruns and morning planner confusion AND since the lab now uses quick-provision I think it is OK to increase the reboot timeout for the case where we update backwards

I'll do it 
Project Member

Comment 19 by bugdroid1@chromium.org, Dec 12

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/d991b182666ed9a57bb42fa21b721847c1179daf

commit d991b182666ed9a57bb42fa21b721847c1179daf
Author: David Haddock <dhaddock@chromium.org>
Date: Wed Dec 12 21:55:53 2018

auto_updater: Increase reboot timeout after rootfs update for AU tests.

When we put old builds on devices as the source image in the AU tests it
sometimes requires the device to powerwash. This powerwashing step is
taking a long time on older devices and the test is timing out.

This is the reason for 90% of AU test failures on stable channel. The
timeout increase is only when we are doing an AU test and only for the
post rootfs reboot.

BUG= chromium:746997 
TEST=autoupdate_EndToEndTest on ninja from R53 -> R73

Change-Id: Ie1f472b680b2fbdb1fa5baa640556866d4f07d9e
Reviewed-on: https://chromium-review.googlesource.com/1372178
Commit-Ready: David Haddock <dhaddock@chromium.org>
Tested-by: David Haddock <dhaddock@chromium.org>
Reviewed-by: Amin Hassani <ahassani@chromium.org>

[modify] https://crrev.com/d991b182666ed9a57bb42fa21b721847c1179daf/lib/auto_updater.py

Cc: kbleicher@chromium.org gu...@chromium.org abod...@chromium.org mkarkada@chromium.org dchan@chromium.org dhadd...@chromium.org
 Issue 915028  has been merged into this issue.
Labels: Merge-Request-71 Merge-Request-72
Project Member

Comment 22 by sheriffbot@chromium.org, Dec 13

Labels: -Merge-Request-72 Merge-Review-72 Hotlist-Merge-Review
This bug requires manual review: M72 has already been promoted to the beta branch, so this requires manual review
Please contact the milestone owner if you have questions.
Owners: govind@(Android), kariahda@(iOS), djmm@(ChromeOS), abdulsyed@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Request-71 -Merge-Review-72 M-71 Merge-Approved-71 Merge-Approved-72
Merge approved for Chrome OS M71 and M72.
Actually this may just require a devserver push to take effect. I'll ask infra-discuss 
Ping since this is blocking boards from getting tested and included in the M71 stable; thanks!

Owner: akes...@chromium.org
+aviv to update dev server.
Project Member

Comment 27 by sheriffbot@chromium.org, Dec 17

Cc: dhaddock@google.com kbleicher@google.com
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Pri-1 Pri-0
Raising to a P0 since this is blocking boards from being pushed to M71 Chrome OS stable.

Please advise on status and next steps.  Thanks.
Owner: dhadd...@chromium.org
I believe the devservers were updated by xixuan@. If so, this should be owned by somebody on AU.

-> dhaddock to confirm fix
Owner: xixuan@chromium.org
Actually, push is apparently blocked by our staging lab outage (due to ganeti .hot outage).

If it's a severe emergency we could consider an untested chromite push, but that carries risk of breaking the lab.
Project Member

Comment 31 by sheriffbot@chromium.org, Dec 21

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Status: Fixed (was: Assigned)
I did a full push on Dec 19, the fix should already be pushed.
Status: Verified (was: Fixed)

Sign in to add a comment