test_push: autoupdate_EndtoEndTest test followed by powerwash on the same DUT fails |
|||||||||||
Issue descriptionThis happens almost every test_push run. eg: http://chromeos-autotest.hot.corp.google.com/afe/#tab_id=view_host&object_id=2 The problem is this: - autoupdate_EndtoEndTest first installs the FSI version. In this case that's quawks-release/R54-8743.44.0 - This goes to the release GS bucket to get the image, artifacts. - For some reason that test fails. - We actually mark it as pass, that's a separate bug. - This leaves the DUT with R54-8743.44.0 installed. - powerwash test runs after this. - It looks at current version, and goes to gs://chromeos-image-archive to try to obtain the autotest tarball and other artifacts for this build. - But that location is a throwaway GS bucket, and we recently cleaned out the said version. - powerwash fails. + gwendal who is working on this issue in a broader context. This is blocking test_push atm. I think I can work around this by forcing the powerwash test to first provision the DUT to a sane build.
,
Apr 4 2017
I have a CL to do what I claimed above: https://chromium-review.googlesource.com/c/468087/ But it doesn't go all the way: - The following provision fails for the same reason. Worse, it fails without marking the DUT as being /var/tmp/provision_failed - So, the following repair job comes along and does nothing. - Also, a provision-repair cycle is too long for the test_push's powercycle test. So test_push would fail even if this cycle did work. Right now, I've manually marked DUTs as bad and forced a reverify. This _should_ install a good build via repair on them. And this _may_ get us a passing test_push. BUT, this final observation means that we may be losing quawks (and other?) DUTs to this repair-verify cycle in prod as well.
,
Apr 4 2017
Oh wow. Now I've successfully transitioned those DUTs to repair-failed state: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row10-rack9-host15/1468-repair/20170404125553/ Repair also fails for the same reason. bye-bye test_push @ 1:00 PM.
,
Apr 4 2017
I don't see any sign of this affecting prod. In particular, I don't see the failure and repair rate shooting up here: http://shortn/_OjLUKDuFC5
,
Apr 4 2017
I was afraid this might be hitting prod, but signals suggest otherwise. This _is_ blocking test_push, and I'll get to it soon.
,
Apr 6 2017
OK, I need a test_push as a matter of course and I'm no wiser about what to do here. + some people for ideas + popping this to top of my stack.
,
Apr 6 2017
,
Apr 6 2017
,
Apr 6 2017
With the two blocking bugs fixed, at least DUTs with servo will be able to get out of this situation. DUTs without servo are still toast. The real fix here is to remember the URL from which a build was installed instead of trying to infer it. But I can't block push-to-prod for that fix.
,
Apr 7 2017
I changed the stable_version DEFAULT on the push master to chromeos-test@chromeos-autotest:~$ /usr/local/autotest/cli/atest stable_version list | grep DEFAULT DEFAULT | R58-9334.28.0 I've re-kicked test_push. I expect it to pass. Note that only one of the underlying bugs is expected to fixed at this point. We've moved back to old repair flow in issue 709281 which means repair can recover these DUTs now. Changing the stable version means that provision will now try to install a newer version on the DUT (so will no longer fail).
,
Apr 7 2017
,
Apr 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/4d80732c8a6684b0e2103fb96bc61650a2966f76 commit 4d80732c8a6684b0e2103fb96bc61650a2966f76 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri Apr 07 19:56:41 2017 chromeos_config: Temporarily mark cyan-paladin experimental while we figure out what's causing it great grief. This takes away some cheets coverage from CQ. BUG= chromium:708262 TEST=unittests Change-Id: I6db06282e7e6e1b74ed4daf901f8b5547091cca2 Reviewed-on: https://chromium-review.googlesource.com/471867 Reviewed-by: Don Garrett <dgarrett@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/4d80732c8a6684b0e2103fb96bc61650a2966f76/cbuildbot/config_dump.json [modify] https://crrev.com/4d80732c8a6684b0e2103fb96bc61650a2966f76/cbuildbot/chromeos_config.py
,
Apr 7 2017
CL in #12 doesn't belong here. That belongs to another fire.
,
Apr 7 2017
Re #10: test_push failed again (twice). Currently: - All quawks DUTs are down (not responding to ping). For example, in this repair job: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/7804-autotest_system/chromeos4-row10-rack9-host15/debug/ It looks like we installed quawks-release/R58-9334.28.0 on the DUT, but then stateful_update failed (because the old flow required some files that don't exist (anymore?)) The DUT is now not pingable. Questions: - Why is the old repair flow failing? What config setting on the shard do I need? Will this fail in prod? - Why is the DUT dead?
,
Apr 7 2017
Filed b/37163626 to recover those DUTs and see why they died.
,
Apr 7 2017
,
Apr 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/f9d50162d6af62da07caa4dbbd14e8971edc43f3 commit f9d50162d6af62da07caa4dbbd14e8971edc43f3 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri Apr 07 23:30:48 2017 Revert "chromeos_config: Temporarily mark cyan-paladin experimental" This reverts commit 4d80732c8a6684b0e2103fb96bc61650a2966f76. Reason for revert: cyan-paladin has passed two times in a row after apache restart. We haven't root caused the problem yet, but letting cyan bugs in over the weekend is hardly the response. Original change's description: > chromeos_config: Temporarily mark cyan-paladin experimental > > while we figure out what's causing it great grief. This takes away some > cheets coverage from CQ. > > BUG= chromium:708262 > TEST=unittests > > Change-Id: I6db06282e7e6e1b74ed4daf901f8b5547091cca2 > Reviewed-on: https://chromium-review.googlesource.com/471867 > Reviewed-by: Don Garrett <dgarrett@chromium.org> > Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> > TBR=jrbarnette@chromium.org,dgarrett@chromium.org,pprabhu@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=chromium:708679 Change-Id: I1b2eb8fab3ab324c392d0f3ab7f24a448551f076 Reviewed-on: https://chromium-review.googlesource.com/471812 Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/f9d50162d6af62da07caa4dbbd14e8971edc43f3/cbuildbot/config_dump.json [modify] https://crrev.com/f9d50162d6af62da07caa4dbbd14e8971edc43f3/cbuildbot/chromeos_config.py
,
Apr 7 2017
Follow up on #15. pprabhu@pprabhu:/work/chromiumos/chromeos-admin$ ssh chromeos4-row10-rack9-host15.cros -- cat /etc/lsb-release | grep RELEASE_VERSION CHROMEOS_RELEASE_VERSION=9334.28.0 pprabhu@pprabhu:/work/chromiumos/chromeos-admin$ ssh chromeos4-row10-rack9-host21.cros -- cat /etc/lsb-release | grep RELEASE_VERSION CHROMEOS_RELEASE_VERSION=9334.28.0 So, the two dead DUTs are on the new stable version. It's still a mystery why they died. Meanwhile, changing the stable_version for quawks to be the same as prod has fixed this problem with testing push. The blocking bug 697141 tracks automating this, and is the real follow up action. The other two blocking bugs are real issues uncovered by this bug, but are not needed for the resolution here.
,
May 30 2017
,
Aug 1 2017
,
Jan 22 2018
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by pprabhu@chromium.org
, Apr 4 2017