New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 755400 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

M60: Swarming timeout issues with daisy-spring/daisy-skate

Project Member Reported by josa...@chromium.org, Aug 15 2017

Issue description

Cc: pbath...@chromium.org
Similar issue on banjo enguarde falco_li snappy

Cc: dchan@chromium.org

Comment 3 by nxia@chromium.org, Aug 15 2017

Cc: xixuan@chromium.org pprabhu@chromium.org
As I looked into the log, the tests were marked as passed in cautotest, is swarming timeout a cause for the failures here? +@xixuan

https://uberchromegw.corp.google.com/i/chromeos_release/builders/daisy_skate-release%20release-R60-9592.B/builds/61

http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=134360776

Comment 4 by xixuan@chromium.org, Aug 15 2017

The reason it's failing is the HWTest runs too much time:

daisy_skate builder kicks off the first swarming call to run suite: https://chromeos-proxy.appspot.com/task?id=37e9416181482010&refresh=10&show_raw=1, this call returns immediately.

Then daisy_skate builder kicks off a second swarming call to monitor the running results: https://chromeos-proxy.appspot.com/task?id=37e941b5fcd1bf10&refresh=10&show_raw=1, this call finally succeeds, but takes longer time (about 7 hours) than the builder's deadline for hwtest. 

I don't see anything wrong with shard tick (chromeos-server13.cbf), and also nothing wrong with swarming. The shard pick up a host for this test until 8:00:

08/11 08:12:13.769 INFO |  scheduler_models:0565| Assigning host chromeos4-row9-rack5-host19 to entry HQE: 134710018, for job: 134360786 and host: no host has status:Queued

I would guess the reason that this job takes so long is there's no enough hosts ready for it. Either hosts are not ready, or too many weekly suite jobs are kicked off at that night and grab the hosts.
I think Pramod (pbathini@) has already tried some re runs but it seems like the same issues were faced (not enough DUTs/time outs) on those

Pramod, can you add a link for the re runs?


Owner: pbath...@chromium.org
Status: Assigned (was: Untriaged)
Please reassign once we have more information. xixuan@'s reading so far is that this was caused by too much testing load.
Owner: xixuan@chromium.org
Here is the link to my re-runs

https://ubercautotest.corp.google.com/afe/#tab_id=job_list&state_filter=all&type_filter=all

Comment 8 by xixuan@chromium.org, Aug 17 2017

Owner: pbath...@chromium.org
Sorry the link doesn't work for me.
Owner: xixuan@chromium.org
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135048686

This is the re-run result for snappy
Owner: pbath...@chromium.org
Error msg:

ArtifactDownloadError: Could not find au_control.tar.bz2 in Google Storage at gs://chromeos-image-archive/snappy-release/R60-9592.82.0

This is not the same error with daisy_skate that there's 'not enough DUTs/timeouts'. 

It suggests that there's no au_control package in google storage, which is true. So the re-run cmd '/usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/135048686-pbathini/hostless -u pbathini -l snappy-release/R60-9592.82.0-test_suites/control.au -s --lab True -P 135048686-pbathini/hostless -n /usr/local/autotest/results/drone_tmp/attach.2753' won't successfully run.

Would you try re-run with a release build that contains au_control.tar.bz or investigate why there's no au_control.tar.bz2 for snappy-release/R60-9592.82.0?

I can't find a proper owner for this release build investigating since it's not a deputy but a sheriff issue, so temporarily assign it back to @pbathini.


BTW, the logs like 'https://uberchromegw.corp.google.com/i/chromeos_release/builders/daisy_skate-release%20release-R60-9592.B/builds/61/steps/HWTest%20%5Bsanity%5D/logs/stdio' shows error doesn't mean the HWTest is not finished successfully. At least for daisy_skate, https://chromeos-proxy.appspot.com/task?id=37e941b5fcd1bf10&refresh=10&show_raw=1 shows it finished, just later than builder's deadline. So before re-run, you can check https://chromeos-proxy.appspot.com/ first to see whether your HWTest is actually finished.
Owner: josa...@chromium.org

Comment 12 by dchan@chromium.org, Aug 17 2017

Cc: rjahagir@chromium.org
+rjahagir as she is M60 MO

Comment 13 by dchan@chromium.org, Aug 17 2017

Cc: dhadd...@chromium.org
+dhaddock on snappy AU
The AU control file is missing because the builder didn't generate it. It failed at the HWTest sanity phase. See  issue 709663 
We have a total of 7 boards (daisy_spring, daisy_skate , banjo, enguarde , falco_li ,snappy , chell) missing the AU tests in M-60 9592.82.0 build. Is this expected on these boards?
It is not "expected", but if the build status is failed at the sanity stage these artifacts are lost forever until  issue 709663  is fixed.  

Since the suite contains rollback and powerwash tests we should manually spot check these for stable. 



Cc: dgarr...@chromium.org
As an aside, I'm not sure anymore on the reasoning for the split between paygen_au_* and au suites. 

Why do we need this separate AU suite again? 
Could we move the the autoupdate_Rollback and platform_powerwash to bvt-inline and just not run the extra npo autoupdate_endtoendtest? 
TL;DR:

That might be reasonable.

Details:

As I understand it, the answer is more about the builders than the tests themselves.

The paygen payload tests (on builder side) are only possible to run on a build which gets signed, after signing finishes. This means release builds only, and only near the end of the build.

The AUSuite is a more generic test suite than can be run from any build that generates test images. I don't know if the CQ runs it, but it could. It's images and payloads are NOT 100% identical to paygen images/payloads since they are pre signer changes to the images.

The AUSuite runs much earlier in the build than payload tests (which are generally the final thing to finish in release builds). We sometimes talk like it protects the lab from builds too broken to update, but it since we don't block other tests on it, that doesn't really happen.

The sanity suite is where any "protect the lab" tests should be.

Labels: FixedByAURewrite
Status: Verified (was: Assigned)
AUTestStage is no more. So we won't run into the can't find au_control error anymore

Sign in to add a comment