M60: Swarming timeout issues with daisy-spring/daisy-skate |
|||||||||||||
Issue descriptionHWTest failing on latest M60 Stable build for daisy-skate/daisy-spring This is suspected to be a swarming timeout problem - https://uberchromegw.corp.google.com/i/chromeos_release/builders/daisy_skate-release%20release-R60-9592.B/builds/61/steps/HWTest%20%5Bsanity%5D/logs/stdio - https://uberchromegw.corp.google.com/i/chromeos_release/builders/daisy_spring-release%20release-R60-9592.B/builds/64/steps/HWTest%20%5Bsanity%5D/logs/stdio adding deputy nxia@
,
Aug 15 2017
,
Aug 15 2017
As I looked into the log, the tests were marked as passed in cautotest, is swarming timeout a cause for the failures here? +@xixuan https://uberchromegw.corp.google.com/i/chromeos_release/builders/daisy_skate-release%20release-R60-9592.B/builds/61 http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=134360776
,
Aug 15 2017
The reason it's failing is the HWTest runs too much time: daisy_skate builder kicks off the first swarming call to run suite: https://chromeos-proxy.appspot.com/task?id=37e9416181482010&refresh=10&show_raw=1, this call returns immediately. Then daisy_skate builder kicks off a second swarming call to monitor the running results: https://chromeos-proxy.appspot.com/task?id=37e941b5fcd1bf10&refresh=10&show_raw=1, this call finally succeeds, but takes longer time (about 7 hours) than the builder's deadline for hwtest. I don't see anything wrong with shard tick (chromeos-server13.cbf), and also nothing wrong with swarming. The shard pick up a host for this test until 8:00: 08/11 08:12:13.769 INFO | scheduler_models:0565| Assigning host chromeos4-row9-rack5-host19 to entry HQE: 134710018, for job: 134360786 and host: no host has status:Queued I would guess the reason that this job takes so long is there's no enough hosts ready for it. Either hosts are not ready, or too many weekly suite jobs are kicked off at that night and grab the hosts.
,
Aug 15 2017
I think Pramod (pbathini@) has already tried some re runs but it seems like the same issues were faced (not enough DUTs/time outs) on those Pramod, can you add a link for the re runs?
,
Aug 15 2017
Please reassign once we have more information. xixuan@'s reading so far is that this was caused by too much testing load.
,
Aug 15 2017
Here is the link to my re-runs https://ubercautotest.corp.google.com/afe/#tab_id=job_list&state_filter=all&type_filter=all
,
Aug 17 2017
Sorry the link doesn't work for me.
,
Aug 17 2017
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135048686 This is the re-run result for snappy
,
Aug 17 2017
Error msg: ArtifactDownloadError: Could not find au_control.tar.bz2 in Google Storage at gs://chromeos-image-archive/snappy-release/R60-9592.82.0 This is not the same error with daisy_skate that there's 'not enough DUTs/timeouts'. It suggests that there's no au_control package in google storage, which is true. So the re-run cmd '/usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/135048686-pbathini/hostless -u pbathini -l snappy-release/R60-9592.82.0-test_suites/control.au -s --lab True -P 135048686-pbathini/hostless -n /usr/local/autotest/results/drone_tmp/attach.2753' won't successfully run. Would you try re-run with a release build that contains au_control.tar.bz or investigate why there's no au_control.tar.bz2 for snappy-release/R60-9592.82.0? I can't find a proper owner for this release build investigating since it's not a deputy but a sheriff issue, so temporarily assign it back to @pbathini. BTW, the logs like 'https://uberchromegw.corp.google.com/i/chromeos_release/builders/daisy_skate-release%20release-R60-9592.B/builds/61/steps/HWTest%20%5Bsanity%5D/logs/stdio' shows error doesn't mean the HWTest is not finished successfully. At least for daisy_skate, https://chromeos-proxy.appspot.com/task?id=37e941b5fcd1bf10&refresh=10&show_raw=1 shows it finished, just later than builder's deadline. So before re-run, you can check https://chromeos-proxy.appspot.com/ first to see whether your HWTest is actually finished.
,
Aug 17 2017
,
Aug 17 2017
+rjahagir as she is M60 MO
,
Aug 17 2017
+dhaddock on snappy AU
,
Aug 17 2017
The AU control file is missing because the builder didn't generate it. It failed at the HWTest sanity phase. See issue 709663
,
Aug 17 2017
We have a total of 7 boards (daisy_spring, daisy_skate , banjo, enguarde , falco_li ,snappy , chell) missing the AU tests in M-60 9592.82.0 build. Is this expected on these boards?
,
Aug 17 2017
It is not "expected", but if the build status is failed at the sanity stage these artifacts are lost forever until issue 709663 is fixed. Since the suite contains rollback and powerwash tests we should manually spot check these for stable.
,
Aug 17 2017
As an aside, I'm not sure anymore on the reasoning for the split between paygen_au_* and au suites. Why do we need this separate AU suite again? Could we move the the autoupdate_Rollback and platform_powerwash to bvt-inline and just not run the extra npo autoupdate_endtoendtest?
,
Aug 17 2017
TL;DR: That might be reasonable. Details: As I understand it, the answer is more about the builders than the tests themselves. The paygen payload tests (on builder side) are only possible to run on a build which gets signed, after signing finishes. This means release builds only, and only near the end of the build. The AUSuite is a more generic test suite than can be run from any build that generates test images. I don't know if the CQ runs it, but it could. It's images and payloads are NOT 100% identical to paygen images/payloads since they are pre signer changes to the images. The AUSuite runs much earlier in the build than payload tests (which are generally the final thing to finish in release builds). We sometimes talk like it protects the lab from builds too broken to update, but it since we don't block other tests on it, that doesn't really happen. The sanity suite is where any "protect the lab" tests should be.
,
Sep 19 2017
AUTestStage is no more. So we won't run into the can't find au_control error anymore |
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by josa...@chromium.org
, Aug 15 2017