New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 708262 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug

Blocked on:
issue 697141
issue 709280
issue 709281



Sign in to add a comment

test_push: autoupdate_EndtoEndTest test followed by powerwash on the same DUT fails

Project Member Reported by pprabhu@chromium.org, Apr 4 2017

Issue description

This happens almost every test_push run.
eg: http://chromeos-autotest.hot.corp.google.com/afe/#tab_id=view_host&object_id=2


The problem is this:
- autoupdate_EndtoEndTest first installs the FSI version. In this case that's quawks-release/R54-8743.44.0
- This goes to the release GS bucket to get the image, artifacts.
- For some reason that test fails.
  - We actually mark it as pass, that's a separate bug.
  - This leaves the DUT with R54-8743.44.0 installed.

- powerwash test runs after this.
- It looks at current version, and goes to gs://chromeos-image-archive to try to obtain the autotest tarball and other artifacts for this build.
- But that location is a throwaway GS bucket, and we recently cleaned out the said version.
- powerwash fails.


+ gwendal who is working on this issue in a broader context.

This is blocking test_push atm.
I think I can work around this by forcing the powerwash test to first provision the DUT to a sane build.


 
I have a CL to do what I claimed above: https://chromium-review.googlesource.com/c/468087/

But it doesn't go all the way:
- The following provision fails for the same reason. Worse, it fails without marking the DUT as being /var/tmp/provision_failed
- So, the following repair job comes along and does nothing.

- Also, a provision-repair cycle is too long for the test_push's powercycle test. So test_push would fail even if this cycle did work.


Right now, I've manually marked DUTs as bad and forced a reverify. This _should_ install a good build via repair on them. And this _may_ get us a passing test_push.

BUT, this final observation means that we may be losing quawks (and other?) DUTs to this repair-verify cycle in prod as well.
Oh wow.
Now I've successfully transitioned those DUTs to repair-failed state: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row10-rack9-host15/1468-repair/20170404125553/

Repair also fails for the same reason.

bye-bye test_push @ 1:00 PM.
I don't see any sign of this affecting prod.
In particular, I don't see the failure and repair rate shooting up here: http://shortn/_OjLUKDuFC5
Cc: jrbarnette@chromium.org
I was afraid this might be hitting prod, but signals suggest otherwise.

This _is_ blocking test_push, and I'll get to it soon.
Cc: dgarr...@chromium.org
Labels: -Pri-1 Pri-0
OK, I need a test_push as a matter of course and I'm no wiser about what to do here.

+ some people for ideas
+ popping this to top of my stack.
Blockedon: 709280
Blockedon: 709281
With the two blocking bugs fixed, at least DUTs with servo will be able to get out of this situation.

DUTs without servo are still toast.

The real fix here is to remember the URL from which a build was installed instead of trying to infer it. But I can't block push-to-prod for that fix.
I changed the stable_version DEFAULT on the push master to chromeos-test@chromeos-autotest:~$ /usr/local/autotest/cli/atest stable_version list | grep DEFAULT
DEFAULT          | R58-9334.28.0


I've re-kicked test_push. I expect it to pass.
Note that only one of the underlying bugs is expected to fixed at this point.
We've moved back to old repair flow in  issue 709281  which means repair can recover these DUTs now.

Changing the stable version means that provision will now try to install a newer version on the DUT (so will no longer fail).
Blockedon: 697141
Project Member

Comment 12 by bugdroid1@chromium.org, Apr 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/4d80732c8a6684b0e2103fb96bc61650a2966f76

commit 4d80732c8a6684b0e2103fb96bc61650a2966f76
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Apr 07 19:56:41 2017

chromeos_config: Temporarily mark cyan-paladin experimental

while we figure out what's causing it great grief. This takes away some
cheets coverage from CQ.

BUG= chromium:708262 
TEST=unittests

Change-Id: I6db06282e7e6e1b74ed4daf901f8b5547091cca2
Reviewed-on: https://chromium-review.googlesource.com/471867
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/4d80732c8a6684b0e2103fb96bc61650a2966f76/cbuildbot/config_dump.json
[modify] https://crrev.com/4d80732c8a6684b0e2103fb96bc61650a2966f76/cbuildbot/chromeos_config.py

CL in #12 doesn't belong here. That belongs to another fire.
Re #10: test_push failed again (twice).

Currently:
- All quawks DUTs are down (not responding to ping).
For example,
in this repair job: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/7804-autotest_system/chromeos4-row10-rack9-host15/debug/

It looks like we installed quawks-release/R58-9334.28.0 on the DUT, but then stateful_update failed (because the old flow required some files that don't exist (anymore?))
The DUT is now not pingable.

Questions:
- Why is the old repair flow failing? What config setting on the shard do I need? Will this fail in prod?
- Why is the DUT dead?
Filed b/37163626 to recover those DUTs and see why they died.
Cc: xixuan@chromium.org
Project Member

Comment 17 by bugdroid1@chromium.org, Apr 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/f9d50162d6af62da07caa4dbbd14e8971edc43f3

commit f9d50162d6af62da07caa4dbbd14e8971edc43f3
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Apr 07 23:30:48 2017

Revert "chromeos_config: Temporarily mark cyan-paladin experimental"

This reverts commit 4d80732c8a6684b0e2103fb96bc61650a2966f76.

Reason for revert: cyan-paladin has passed two times in a row after apache restart. We haven't root caused the problem yet, but letting cyan bugs in over the weekend is hardly the response.

Original change's description:
> chromeos_config: Temporarily mark cyan-paladin experimental
> 
> while we figure out what's causing it great grief. This takes away some
> cheets coverage from CQ.
> 
> BUG= chromium:708262 
> TEST=unittests
> 
> Change-Id: I6db06282e7e6e1b74ed4daf901f8b5547091cca2
> Reviewed-on: https://chromium-review.googlesource.com/471867
> Reviewed-by: Don Garrett <dgarrett@chromium.org>
> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
> 

TBR=jrbarnette@chromium.org,dgarrett@chromium.org,pprabhu@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=chromium:708679

Change-Id: I1b2eb8fab3ab324c392d0f3ab7f24a448551f076
Reviewed-on: https://chromium-review.googlesource.com/471812
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/f9d50162d6af62da07caa4dbbd14e8971edc43f3/cbuildbot/config_dump.json
[modify] https://crrev.com/f9d50162d6af62da07caa4dbbd14e8971edc43f3/cbuildbot/chromeos_config.py

Status: Fixed (was: Started)
Follow up on #15.

pprabhu@pprabhu:/work/chromiumos/chromeos-admin$ ssh chromeos4-row10-rack9-host15.cros -- cat /etc/lsb-release | grep RELEASE_VERSION
CHROMEOS_RELEASE_VERSION=9334.28.0
pprabhu@pprabhu:/work/chromiumos/chromeos-admin$ ssh chromeos4-row10-rack9-host21.cros -- cat /etc/lsb-release | grep RELEASE_VERSION
CHROMEOS_RELEASE_VERSION=9334.28.0

So, the two dead DUTs are on the new stable version. It's still a mystery why they died.

Meanwhile, changing the stable_version for quawks to be the same as prod has fixed this problem with testing push.

The blocking bug 697141 tracks automating this, and is the real follow up action.

The other two blocking bugs are real issues uncovered by this bug, but are not needed for the resolution here.

Comment 19 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 21 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment