New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 739583 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

whirlwind-paladin failing frequently, not for obvious bad CLs

Project Member Reported by akes...@chromium.org, Jul 6 2017

Issue description

Recent example: https://luci-milo.appspot.com/buildbot/chromeos/whirlwind-paladin/8488

Recent example appears to be a provision flake specific to whirlwind.

The previous 4 builds also failed.

This is blocking progress on other CLs in the CQ. I'm going to mark whirlwind as experimental temporarily.
 
Cc: xixuan@chromium.org kkunduru@chromium.org marcochen@chromium.org victorhsieh@chromium.org youcheng@chromium.org
Project Member

Comment 2 by bugdroid1@chromium.org, Jul 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/2f0012b2823daec2dde944157116b11b76915d6c

commit 2f0012b2823daec2dde944157116b11b76915d6c
Author: Aviv Keshet <akeshet@chromium.org>
Date: Thu Jul 06 02:34:54 2017

chromeos_config: temporarily mark whirlwind-paladin experimental

BUG= chromium:739583 
TEST=None

Change-Id: I94c789eb1efc46b48f664baa6932113a832894db
Reviewed-on: https://chromium-review.googlesource.com/560547
Reviewed-by: Kishan Kunduru <kkunduru@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/2f0012b2823daec2dde944157116b11b76915d6c/cbuildbot/config_dump.json
[modify] https://crrev.com/2f0012b2823daec2dde944157116b11b76915d6c/cbuildbot/chromeos_config.py

Labels: -Pri-0 Pri-1
Reducing to P1, as this is no longer a CQ-wide outage.
Owner: sduvvuri@chromium.org
https://uberchromegw.corp.google.com/i/chromeos/builders/whirlwind-paladin/builds/8484/steps/HWTest%20%5Bjetstream_cq%5D/logs/stdio

  jetstream_DiagnosticReport                    ERROR: jetstream_DiagnosticReport: test does not exist
  jetstream_ApiServerAttestation                ERROR: jetstream_ApiServerAttestation: test does not exist
  jetstream_BluetoothBeaconing                  ERROR: jetstream_BluetoothBeaconing: test does not exist
...

Most recent failure seems to be an issue with ap-controller not starting by the time the test tries to talk to it:

https://uberchromegw.corp.google.com/i/chromeos/builders/whirlwind-paladin/builds/8496/steps/HWTest%20%5Bjetstream_cq%5D/logs/stdio

...
Suite job                                   [ PASSED ]
  provision                                   [ FAILED ]
  provision                                     FAIL: ap-controller service not found, command execution error, completed successfully
  provision                                   [ FAILED ]
  provision                                     FAIL: command execution error, completed successfully
  jetstream_ApiServerDeveloperConfiguration   [ PASSED ]
...
  provision http://cautotest/tko/retrieve_logs.cgi?job=/results/127268137-chromeos-test/
  provision http://cautotest/tko/retrieve_logs.cgi?job=/results/127268142-chromeos-test/
Owner: lgoo...@chromium.org
laurence,
  Could you take a look at this and see if this is something that can be addressed in autotest ?
Re #4: The framework verifies that ap-controller is running before launching the test. In this case, the ap-controller service did not start successfully after the device was provisioned. The framework looks for ap-controller for up to 1 minute, so it looks like a possible instance of this issue:

  https://bugs.chromium.org/p/chromium/issues/detail?id=652565

However, this is more extreme since it affected multiple devices.

There appears to be a pattern of failed provisioning followed by a repair followed by a failed test followed by a reset followed by a successful test.

For example:

  https://ubercautotest.corp.google.com/afe/#tab_id=view_host&object_id=5596

We also have cases of host verification failing with system-services not running:

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/127282376-chromeos-test/chromeos4-row10-jetstream-host1/debug

I noticed this on Friday, July 7, and mentioned it on crosoncall.
At the time, it was felt that it was due to a bad CL hitting the CQ.

Sorry, first noticed that on Friday June 30.

All failures seem to be due to provisioning failures, due to either ap-controller not running or system-services not running.

(Except for some rare cases of "test does not exist", noted in #4, which would appear to be a build issue).

I'm able to repro this locally (using a moblab).

I'm also seeing all whirlwind tests in the Talyn lab failing provisioning, starting with build R61-9718.0.0.

It seems likely a CL was allowed in which is causing provisioning failures.

https://chrome-internal-review.googlesource.com/#/c/404970/
https://chrome-internal-review.googlesource.com/#/c/405071/

These CL's are causing ap-monitor crashes. I am working on a fix for this and should be ready in a few hours.
Re #11: Those changes went in to R61-9714.0.0, but a more serious breakage seems to have happened in R61-9718.0.0 (at least from the standpoint of provisioning).

Looks like it is taking longer for system-services to start on whirlwind, which is causing verify.cros to fail, which in turn causes provisioning to fail.
This looks similar to a bug in autoupdate/provisioning that should be fixed with this change:
  https://chromium-review.googlesource.com/c/556356/

ap-controller will need it's own patch but polling seems to be an acceptable design choice.

Comment 15 by ra...@google.com, Jul 10 2017

Just to confirm the suspicions in 10 and 13, it does appear this is a regression and not an example of test flake.

The system startup of the Whirlwind platform has changed unexpectedly and startup is now taking significantly longer, enough to alert the CQ of the problem.

Laurence is working on https://chromium-review.googlesource.com/c/564221/ to get the tests green again and unblock the CQ so we can continue finding new problems, however this does NOT address the root cause of the regression.


Project Member

Comment 16 by bugdroid1@chromium.org, Jul 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a1d2302aecac09a39a41912457b07488f292d872

commit a1d2302aecac09a39a41912457b07488f292d872
Author: Laurence Goodby <lgoodby@google.com>
Date: Mon Jul 10 22:27:27 2017

autotest: Fix whirlwind host verification

Starting with build R61-7618, it takes longer for system-services
to start up on whirlwinds, causing verify.cros to fail on checking
system-services. This change ensure that ap-controller is up and
running before attempting host verification.

BUG= chromium:739583 
TEST=Run on moblab.

Change-Id: Ic7ab1c3b9dfeb6537cdb5af6c718af536189674b
Reviewed-on: https://chromium-review.googlesource.com/564221
Commit-Ready: Laurence Goodby <lgoodby@chromium.org>
Tested-by: Laurence Goodby <lgoodby@chromium.org>
Reviewed-by: Suresh Rajashekara <sureshraj@chromium.org>
Reviewed-by: Grant Grundler <grundler@chromium.org>
Reviewed-by: Laurence Goodby <lgoodby@chromium.org>

[modify] https://crrev.com/a1d2302aecac09a39a41912457b07488f292d872/server/hosts/jetstream_host.py

Workaround should take effect with the next infra push.

Filed b/63546679 for the increased boot time.

Status: Fixed (was: Untriaged)
Workaround was pushed to prod today, whirlwind CQ is passing again.

The original issue (still unresolved) was caused by a 10x increase in whirlwind boot time, which was causing provisioning to fail (b/63546679).

The breaking change has been identified: https://chromium-review.googlesource.com/c/437525/

The failure appears to have gone like this:

 - the 'hciconfig hci0 down' command broke (now has no effect on whirlwind)
 - this caused a udev script that is specific to whirlwind to hang
 - this in turn caused system-services to take more than two minutes to come up
 - which in turn caused provisioning to fail in verify.cros

Further details in b/63546679.

Cc: dmitrygr@chromium.org
Cc: lgoo...@chromium.org ayatane@chromium.org akes...@chromium.org
 Issue 738520  has been merged into this issue.
Project Member

Comment 23 by bugdroid1@chromium.org, Jul 13 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/a9e3886f240ed90b72ad83f2e71258605ca61f1c

commit a9e3886f240ed90b72ad83f2e71258605ca61f1c
Author: Laurence Goodby <lgoodby@google.com>
Date: Thu Jul 13 18:44:42 2017

chromeos_config: restore whirlwind-paladin to important

BUG= chromium:739583 
TEST=Monitoring whirlwind-paladin

Change-Id: I3180a55b405ca6224065d36f6ace45ea9f9681c1
Reviewed-on: https://chromium-review.googlesource.com/568710
Commit-Ready: Laurence Goodby <lgoodby@chromium.org>
Tested-by: Laurence Goodby <lgoodby@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>
Reviewed-by: Laurence Goodby <lgoodby@chromium.org>

[modify] https://crrev.com/a9e3886f240ed90b72ad83f2e71258605ca61f1c/cbuildbot/config_dump.json
[modify] https://crrev.com/a9e3886f240ed90b72ad83f2e71258605ca61f1c/cbuildbot/chromeos_config.py

Sign in to add a comment