CQ submitted a change via strategy:cq-submit-partial-pool-cq-history that broken HWTest on multiple platforms |
||||||||||
Issue descriptionChange: https://chrome-internal-review.googlesource.com/c/chromeos/autotest-cheets/+/437193 Submitted here: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15849 Broke the next 10+ runs. 20:40:31: INFO: Checking change CL:*437193; relevant configs [u'master-paladin', 'betty-paladin', 'caroline-paladin', 'chell-nowithdebug-paladin', 'veyron_mighty-paladin', 'zoombini-paladin', 'reef-paladin', 'cave-paladin', 'fizz-paladin', 'oak-paladin', 'veyron_speedy-paladin', 'scarlet-paladin', 'veyron_tiger-paladin', 'auron_yuna-paladin', 'sentry-paladin', 'quawks-paladin', 'kevin-paladin', 'chell-paladin', 'veyron_jaq-paladin', 'hana-paladin', 'samus-paladin', 'cyan-paladin', 'wizpig-paladin', 'glados-paladin', 'eve-paladin', 'elm-paladin', 'coral-paladin', 'edgar-paladin', 'poppy-paladin', 'bob-paladin', 'veyron_jerry-paladin', 'veyron_minnie-paladin', 'reef-uni-paladin']; configs passed in history ['betty-paladin', 'caroline-paladin', 'zoombini-paladin', 'glados-paladin', 'fizz-paladin', 'oak-paladin', 'scarlet-paladin', 'hana-paladin', 'veyron_tiger-paladin', 'auron_yuna-paladin', 'quawks-paladin', 'veyron_jaq-paladin', 'chell-paladin', 'samus-paladin', 'veyron_minnie-paladin', 'wizpig-paladin', 'eve-paladin', 'chell-nowithdebug-paladin', 'coral-paladin', 'poppy-paladin', 'bob-paladin', 'veyron_jerry-paladin', 'cyan-paladin', 'reef-uni-paladin']. 20:40:31: INFO: Change CL:*437193 is verified with reasons ['strategy:cq-submit-partial-pool-builds-passed', 'strategy:cq-submit-partial-pool-cq-history'], choose the final reason strategy:cq-submit-partial-pool-cq-history. nxia@: Can you help do initial analysis on why this slipped throug the cracks? + Chase-Pending for initial analysis only. We need to understand why we let this bug into ToT before it becomes hard to do analysis due to waterfall restart etc.
,
Aug 25 2017
Issue 759097 has been merged into this issue.
,
Aug 25 2017
Note that the outage caused by this is detailed in bug 759039.
,
Aug 25 2017
20:40:31: INFO: The following changes will be submitted using board-aware submission logic: CL:*407788 CL:*432592 CL:*437193 CL:*437235 CL:*438852 CL:*439012 CL:*439772 CL:*439797 CL:*439995 CL:617375 CL:630577 CL:633704 CL:633783 CL:633803 CL:633845 CL:634127 Looks like it's submitted by board-aware submission, maybe interacted with history-aware submission
,
Aug 25 2017
I looked into the previous CQ, the relevant slaves failed in master-paladin/15849 passed in master-paladin/15841, and the relevant slaves failed in master-paladin/15841 passed in master-paladin/15849. The question is why the CL didn't cause the same failure on the same boards in master-paladin/15841. If the reason is the CL combined with another CL caused the failures in 15849 as well as the builds after, that's an expected drawback of history-aware submission. If the CL caused flaky test failures, it may still get through CQ without history-aware submission. pass to pprabhu@ to work with the sheriffs and find out the reasons?
,
Aug 26 2017
Indeed this looks like multiple CLs / other moving parts are the blame: Here are the two veyron_minnie builds involved (luckily, they are consecutive ones). Both have the blamed CL https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/3639 (passes) https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/3640 (fails) Note that the passing one didn't even run the test that failed: https://viceroy.corp.google.com/chromeos/suite_details?job_id=137056862
,
Aug 26 2017
> The question is why the CL didn't cause the same failure on the same boards in master-paladin/15841.
Here's one of the suites from the run that passed:
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=137056193
The suite run doesn't include the test that was just added to the suite.
Something's wrong with the build system.
,
Aug 26 2017
Yeah, when I extract the autotest_packages tarball from the Artifacts section, I see the PlayStoreTest control file still says:
ATTRIBUTES = "suite:bvt-perbuild"
That's still despite this in the uprev stage:
15:40:14: INFO: Rev: Determined that one+ of the ebuild autotest-tests-cheets rev_subdirs was touched ['client/site_tests/cheets_AndroidToChromeIntents', 'client/site_tests/cheets_BlockOutboundNetworkTest', 'client/site_tests/cheets_AntutuTest', 'client/site_tests/cheets_CameraOrientation', 'client/site_tests/cheets_CandyCrushTest', 'client/site_tests/cheets_CleanShutDown', 'client/site_tests/cheets_ClearDalvikCacheOnBoot', 'client/site_tests/cheets_ClipboardTest', 'server/site_tests/cheets_ClobberStateful', 'client/site_tests/cheets_ContainerMount', 'client/site_tests/cheets_ContainerSmokeTest', 'client/site_tests/cheets_ContainerReboot', 'client/site_tests/cheets_CryptoMigration', 'client/site_tests/cheets_desktopui_SimpleLogin', 'client/site_tests/cheets_DownloadsFilesystem', 'client/site_tests/cheets_EnterpriseForceInstall', 'client/site_tests/cheets_EnterpriseLogin', 'client/site_tests/cheets_FileSystemPermissions', 'client/site_tests/cheets_KeyboardTest', 'client/site_tests/cheets_LegacyTestsCouchsurfing', 'client/site_tests/cheets_LegacyTestsOverDrive', 'client/site_tests/cheets_LinpackTest', 'client/site_tests/cheets_LowMemoryKiller', 'client/site_tests/cheets_MailBench', 'client/site_tests/cheets_MediaPlayerVideoHWDecodeUsed', 'client/site_tests/cheets_MicrophoneApp', 'client/site_tests/cheets_MountObbTest', 'client/site_tests/cheets_NativeCrash', 'client/site_tests/cheets_NotificationTest', 'client/site_tests/cheets_NOVALegacy', 'client/site_tests/cheets_PerfBoot', 'client/site_tests/cheets_PerformanceAppTest', 'client/site_tests/cheets_PlayMusicApp', 'client/site_tests/cheets_PlayStoreTest', 'client/site_tests/cheets_PlayStoreOptIn', 'client/site_tests/cheets_PlayVideoApp', 'client/site_tests/cheets_powerLoadTest', 'client/site_tests/cheets_RemovableMedia', 'client/site_tests/cheets_ScreenRotation', 'client/site_tests/cheets_SELinuxTest', 'client/site_tests/cheets_SettingsBridge', 'client/site_tests/cheets_SystemRawImageSize', 'client/site_tests/cheets_TouchTaps', 'client/site_tests/cheets_TouchLatencyEstimate', 'client/site_tests/cheets_VellamoTest', 'server/site_tests/cheets_CTS', 'server/site_tests/cheets_CTS_N', 'server/site_tests/cheets_GTS', 'server/site_tests/cheets_PerfBootServer', 'server/site_tests/cheets_network_WiFi_SimpleConnect']
Although I'm not sure I understand this ebuild very well...when I build locally, I don't see autotest-tests-cheets owning the test (e.g., it's not in equery-${BOARD} files).
,
Aug 26 2017
> I don't see autotest-tests-cheets owning the test (e.g., it's not in equery-${BOARD} files).
Sorry, that's wrong. `equery-${BOARD} files ...` shows it now for me. It wasn't working at some point... either I'm doing something wrong some of the time, or that could point at some strangeness in the build system.
,
Aug 26 2017
> Yeah, when I extract the autotest_packages tarball from the Artifacts section, I see the PlayStoreTest control file still says:
>
> ATTRIBUTES = "suite:bvt-perbuild"
Interesting. In the artifacts there's also a 'control_files.tar'.
In that file, the PlayStoreTest control file says this:
ATTRIBUTES = "suite:bvt-perbuild, suite:bvt-arc"
Which is, of course, what should be everywhere.
,
Aug 26 2017
So this is the direct artifact from the minnie build I've been dissecting (as pointed to by the UploadPrebuilts stage): https://storage.cloud.google.com/chromeos-prebuilt/board/veyron_minnie/paladin-R62-9875.0.0-rc3/packages/chromeos-base/autotest-tests-cheets-0.0.1-r471.tbz2 That clearly contains the following: ... /usr/local/build/autotest/client/site_tests/cheets_PlayStoreTest/control /usr/local/build/autotest/client/site_tests/cheets_PlayStoreTest/test-cheets_PlayStoreTest.tar.bz2 ... Where the bare control file has the correct ATTRIBUTES, and the tarball contains the old contents (including the ATTRIBUTES from my comment #8). So there's definitely something screwy with the autotest tarball packaging in an uprev like this. This is sounding like the same terrain that Prashant was traversing with this change: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/634627 Prashant, do you have any ideas here?
,
Aug 28 2017
+ new sheriffs I'm not planning to look into this more at the moment, but FYI it's probably still a problem.
,
Aug 28 2017
,
Aug 28 2017
pmalani@ are you investigating this?
,
Aug 29 2017
Just saw this (I don't check @chromium.org that often and mail forwarding isn't set up). This has been an ongoing problem (once every 3-4 weeks). Basically either one of the following is happening: - New artifacts aren't getting uploaded (i.e the logic which checks whether we should upload the new artifact is wrong) - The newly uploaded artifacts aren't visible to wherever the tests are being run. Some questions: - How does the test pull the artifact? - Does it refer to a specific version number? - Or does it simply ask for the "latest" artifact? Earlier, *every* autotest package got upreved always whenever there was a check in to autotest, so it could well be that this issue was always there, but got hidden because uprevs of every autotest package kept happening. I frankly have no idea where the uploader script runs. Unless I can see the output of that, I don't have any ideas about debugging this. If someone knows where the upload stage script runs (i.e the autotest_pkg_postinst() invocation during which packager.py --upload runs), please provide that link. If not, can only recommend disabling both the uprev logic and the packager changes (and not bother with switching it on again, since this is recurring problem).
,
Aug 29 2017
BTW, is this still a problem? Assigning to sheriffs for a response.
,
Aug 29 2017
,
Aug 29 2017
The nature of the problem is such that it will occur rarely, and
will be detected even less, because usually, it won't cause
failures. So, we have no good way to determine if it's still
happening, but failures are almost certain to still be possible:
* The problem didn't happen by accident, so there must be a
bug somewhere.
* No one's done anything to fix such a bug.
Given the expected severity of failures (total tree outages
similar to bug 759039), we have to get to the root cause.
However, since this problem isn't actively causing outages right
now, this isn't a sheriff problem. Someone familiar with how we
create Autotest artifacts needs to take a swipe at explaining the
symptoms.
,
Aug 29 2017
Re-assigning to the sheriff. As I mentioned in Comment #15, unless I can be pointed to where the uploader stage runs (and I'm not familiar with the build infra, so I have tried and not been able to figure this out), and get those logs, there is nothing more I can suggest, short of disabling the uprev logic. Kindly let me know if that the option which we wish to proceed with.
,
Aug 29 2017
Unfortunately I'm not familiar with the build infra either. Does anyone have a suggestion about who might be able to help with this?
,
Aug 29 2017
mka@ is right - this isn't a bug for the sheriffs.
> Some questions:
> - How does the test pull the artifact?
> - Does it refer to a specific version number?
> - Or does it simply ask for the "latest" artifact?
The test infrastructure asks for the test artifacts associated
with the specific build being tested. The version is named by
the standard partial path that identifies the directory in
googlestorage.
You can find the artifacts that were used by looking in the "Report"
stage on the builder page. Links to some relevant builders are mentioned
in c#6:
https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/3639 (passes)
https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/3640 (fails)
,
Aug 29 2017
RE: Comment #21. Thanks for the links. Unfortunately, those don't provide information regarding where the uploader stage is run. I'm hoping someone in the CC list would have some information regarding when/where in the CQ run does "./packager.py --upload" run. This issue doesn't occur in Pre-CQ and only CQ; that suggests the artifacts are being generated and uploaded correctly in Pre-CQ. I understand that CQ re-uses old chroots, but that still doesn't explain why the newer artifacts aren't replacing the old ones.
,
Aug 30 2017
What makes you think this doesn't happen in the Pre-CQ? We don't do much to actually verify these client test packages there, so we wouldn't notice if they're wrong.
And if I'm reading correctly, 'packager.py' gets run from chromiumos-overlay/eclass/autotest.eclass -> autotest_pkg_postinst() -> autotest_run_packager() -> "${root_autotest_dir}/utils/packager.py" ...
So, that would be in the BuildPackages stage.
,
Aug 30 2017
> but that still doesn't explain why the newer artifacts aren't replacing the old ones. Why do you think the problem is stale artifacts, as opposed to incorrectly generated artifacts?
,
Aug 30 2017
> What makes you think this doesn't happen in the Pre-CQ? We don't do much to actually verify these client test packages there, so we wouldn't notice if they're wrong.
The point is fair, but it's actually possible to find out
what happens in the pre-CQ. This problem was originally
found with this CL:
crosreview.com/i/437193
That CL went through the pre-cq, and produced artifacts for
several boards, including these:
https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/trybot-kevin-no-vmtest-pre-cq/R62-9875.0.0-b105039
I downloaded the autotest-packages.tar from there, and found this
in the control file:
ATTRIBUTES = "suite:bvt-perbuild, suite:bvt-arc"
Meaning, in fact, that it doesn't happen in the pre-cq. Which
needs to be explained...
,
Aug 30 2017
One difference: that PreCQ build was a "new" build, not an upgrade. See the build packages log from here: https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/no_vmtest_pre_cq/105039 [ebuild N ] chromeos-base/autotest-tests-cheets-0.0.1-r471::cheets-private to /build/kevin/ [...trimmed...] So if pmalani is correct about "stale" artifacts, there's less chance of a previous build somehow influencing the current one on a new package build. Also, that 'autotest_packages.tar' is packaged quite a bit differently than the artifact I pulled from the CQ artifacts. I guess they upload different stuff?
,
Aug 30 2017
I checked the CQ log which failed (mentioned in Comment #6 & Comment #21) and I see the following there too: [ebuild N ] chromeos-base/autotest-tests-cheets-0.0.1-r471::cheets-private to [ ....] So, it seems like it's getting built there, and the version seems to be correctly defined here. According to https://cs.corp.google.com/chromeos_internal/chromite/cbuildbot/commands.py?rcl=3f5e7fc082093e45fa66ac4514aa334d50ae8bce&l=2285 , autotest_packages.tar simply tars up whatever is in <buildroot>/autotest/packages I do know that CQ re-uses chroots, so it's ostensible that the packages haven't been replaced for some weird reason. We could clean the CQ chroot, but FWIU re-using the chroot provided some optimizations. Perhaps we can detect that an autotest package has been upreved, and only clear out the autotest/packages directory? I know it's hacky, but it will at least ensure the packages getting tarred up are correct. BTW, I checked the Artifacts link from the build which is suggested to be failing, and in its autotest_packages.tar (https://storage.cloud.google.com/chromeos-image-archive/veyron_minnie-paladin/R62-9876.0.0-rc3/autotest_packages.tar?_ga=2.35329185.-1079761916.1501616603), the control file in tests-cheets_PlayStoreTest.tar.bz2 seems to have the right ATTRIBUTES variable: ATTRIBUTES = "suite:bvt-perbuild, suite:bvt-arc" I could be misinterpreting something but it seems like the autotest_packages.tar has the right value. Kindly chime in if my understanding is incorrect.
,
Aug 30 2017
> I checked the CQ log which failed
We're actually interested in the one that "passed" -- because the CL was bad, but the CQ didn't notice (and it therefore "passed" as a false positive).
So, looking at this:
https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/3639 (passes)
I see this:
[ebuild U ] chromeos-base/autotest-tests-cheets-0.0.1-r471::cheets-private [0.0.1-r470::cheets-private] to /build/veyron_minnie/ ...
and *that* build's artifacts have a autotest_packages.tar with the "old" files.
,
Oct 26 2017
I don't believe this specific issue recurs ATM. Closing this (I'm monitoring future related bugs) |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by pprabhu@chromium.org
, Aug 25 2017