whirlwind-paladin failing frequently, not for obvious bad CLs |
|||||||
Issue descriptionRecent example: https://luci-milo.appspot.com/buildbot/chromeos/whirlwind-paladin/8488 Recent example appears to be a provision flake specific to whirlwind. The previous 4 builds also failed. This is blocking progress on other CLs in the CQ. I'm going to mark whirlwind as experimental temporarily.
,
Jul 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/2f0012b2823daec2dde944157116b11b76915d6c commit 2f0012b2823daec2dde944157116b11b76915d6c Author: Aviv Keshet <akeshet@chromium.org> Date: Thu Jul 06 02:34:54 2017 chromeos_config: temporarily mark whirlwind-paladin experimental BUG= chromium:739583 TEST=None Change-Id: I94c789eb1efc46b48f664baa6932113a832894db Reviewed-on: https://chromium-review.googlesource.com/560547 Reviewed-by: Kishan Kunduru <kkunduru@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/2f0012b2823daec2dde944157116b11b76915d6c/cbuildbot/config_dump.json [modify] https://crrev.com/2f0012b2823daec2dde944157116b11b76915d6c/cbuildbot/chromeos_config.py
,
Jul 6 2017
Reducing to P1, as this is no longer a CQ-wide outage.
,
Jul 6 2017
https://uberchromegw.corp.google.com/i/chromeos/builders/whirlwind-paladin/builds/8484/steps/HWTest%20%5Bjetstream_cq%5D/logs/stdio jetstream_DiagnosticReport ERROR: jetstream_DiagnosticReport: test does not exist jetstream_ApiServerAttestation ERROR: jetstream_ApiServerAttestation: test does not exist jetstream_BluetoothBeaconing ERROR: jetstream_BluetoothBeaconing: test does not exist ... Most recent failure seems to be an issue with ap-controller not starting by the time the test tries to talk to it: https://uberchromegw.corp.google.com/i/chromeos/builders/whirlwind-paladin/builds/8496/steps/HWTest%20%5Bjetstream_cq%5D/logs/stdio ... Suite job [ PASSED ] provision [ FAILED ] provision FAIL: ap-controller service not found, command execution error, completed successfully provision [ FAILED ] provision FAIL: command execution error, completed successfully jetstream_ApiServerDeveloperConfiguration [ PASSED ] ... provision http://cautotest/tko/retrieve_logs.cgi?job=/results/127268137-chromeos-test/ provision http://cautotest/tko/retrieve_logs.cgi?job=/results/127268142-chromeos-test/
,
Jul 6 2017
laurence, Could you take a look at this and see if this is something that can be addressed in autotest ?
,
Jul 6 2017
Re #4: The framework verifies that ap-controller is running before launching the test. In this case, the ap-controller service did not start successfully after the device was provisioned. The framework looks for ap-controller for up to 1 minute, so it looks like a possible instance of this issue: https://bugs.chromium.org/p/chromium/issues/detail?id=652565 However, this is more extreme since it affected multiple devices. There appears to be a pattern of failed provisioning followed by a repair followed by a failed test followed by a reset followed by a successful test. For example: https://ubercautotest.corp.google.com/afe/#tab_id=view_host&object_id=5596
,
Jul 6 2017
We also have cases of host verification failing with system-services not running: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/127282376-chromeos-test/chromeos4-row10-jetstream-host1/debug I noticed this on Friday, July 7, and mentioned it on crosoncall. At the time, it was felt that it was due to a bad CL hitting the CQ.
,
Jul 6 2017
Sorry, first noticed that on Friday June 30.
,
Jul 6 2017
All failures seem to be due to provisioning failures, due to either ap-controller not running or system-services not running. (Except for some rare cases of "test does not exist", noted in #4, which would appear to be a build issue).
,
Jul 7 2017
I'm able to repro this locally (using a moblab). I'm also seeing all whirlwind tests in the Talyn lab failing provisioning, starting with build R61-9718.0.0. It seems likely a CL was allowed in which is causing provisioning failures.
,
Jul 7 2017
https://chrome-internal-review.googlesource.com/#/c/404970/ https://chrome-internal-review.googlesource.com/#/c/405071/ These CL's are causing ap-monitor crashes. I am working on a fix for this and should be ready in a few hours.
,
Jul 7 2017
Re #11: Those changes went in to R61-9714.0.0, but a more serious breakage seems to have happened in R61-9718.0.0 (at least from the standpoint of provisioning).
,
Jul 8 2017
Looks like it is taking longer for system-services to start on whirlwind, which is causing verify.cros to fail, which in turn causes provisioning to fail.
,
Jul 10 2017
This looks similar to a bug in autoupdate/provisioning that should be fixed with this change: https://chromium-review.googlesource.com/c/556356/ ap-controller will need it's own patch but polling seems to be an acceptable design choice.
,
Jul 10 2017
Just to confirm the suspicions in 10 and 13, it does appear this is a regression and not an example of test flake. The system startup of the Whirlwind platform has changed unexpectedly and startup is now taking significantly longer, enough to alert the CQ of the problem. Laurence is working on https://chromium-review.googlesource.com/c/564221/ to get the tests green again and unblock the CQ so we can continue finding new problems, however this does NOT address the root cause of the regression.
,
Jul 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a1d2302aecac09a39a41912457b07488f292d872 commit a1d2302aecac09a39a41912457b07488f292d872 Author: Laurence Goodby <lgoodby@google.com> Date: Mon Jul 10 22:27:27 2017 autotest: Fix whirlwind host verification Starting with build R61-7618, it takes longer for system-services to start up on whirlwinds, causing verify.cros to fail on checking system-services. This change ensure that ap-controller is up and running before attempting host verification. BUG= chromium:739583 TEST=Run on moblab. Change-Id: Ic7ab1c3b9dfeb6537cdb5af6c718af536189674b Reviewed-on: https://chromium-review.googlesource.com/564221 Commit-Ready: Laurence Goodby <lgoodby@chromium.org> Tested-by: Laurence Goodby <lgoodby@chromium.org> Reviewed-by: Suresh Rajashekara <sureshraj@chromium.org> Reviewed-by: Grant Grundler <grundler@chromium.org> Reviewed-by: Laurence Goodby <lgoodby@chromium.org> [modify] https://crrev.com/a1d2302aecac09a39a41912457b07488f292d872/server/hosts/jetstream_host.py
,
Jul 10 2017
Workaround should take effect with the next infra push.
,
Jul 11 2017
Filed b/63546679 for the increased boot time.
,
Jul 12 2017
Workaround was pushed to prod today, whirlwind CQ is passing again. The original issue (still unresolved) was caused by a 10x increase in whirlwind boot time, which was causing provisioning to fail (b/63546679).
,
Jul 12 2017
The breaking change has been identified: https://chromium-review.googlesource.com/c/437525/ The failure appears to have gone like this: - the 'hciconfig hci0 down' command broke (now has no effect on whirlwind) - this caused a udev script that is specific to whirlwind to hang - this in turn caused system-services to take more than two minutes to come up - which in turn caused provisioning to fail in verify.cros Further details in b/63546679.
,
Jul 12 2017
,
Jul 12 2017
Issue 738520 has been merged into this issue.
,
Jul 13 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/a9e3886f240ed90b72ad83f2e71258605ca61f1c commit a9e3886f240ed90b72ad83f2e71258605ca61f1c Author: Laurence Goodby <lgoodby@google.com> Date: Thu Jul 13 18:44:42 2017 chromeos_config: restore whirlwind-paladin to important BUG= chromium:739583 TEST=Monitoring whirlwind-paladin Change-Id: I3180a55b405ca6224065d36f6ace45ea9f9681c1 Reviewed-on: https://chromium-review.googlesource.com/568710 Commit-Ready: Laurence Goodby <lgoodby@chromium.org> Tested-by: Laurence Goodby <lgoodby@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> Reviewed-by: Laurence Goodby <lgoodby@chromium.org> [modify] https://crrev.com/a9e3886f240ed90b72ad83f2e71258605ca61f1c/cbuildbot/config_dump.json [modify] https://crrev.com/a9e3886f240ed90b72ad83f2e71258605ca61f1c/cbuildbot/chromeos_config.py |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by akes...@chromium.org
, Jul 6 2017