Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 4 users
Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment
CQ submitted a change via strategy:cq-submit-partial-pool-cq-history that broken HWTest on multiple platforms
Project Member Reported by pprabhu@chromium.org, Aug 25 Back to list
Change:
https://chrome-internal-review.googlesource.com/c/chromeos/autotest-cheets/+/437193
Submitted here:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15849

Broke the next 10+ runs.

20:40:31: INFO: Checking change CL:*437193; relevant configs [u'master-paladin', 'betty-paladin', 'caroline-paladin', 'chell-nowithdebug-paladin', 'veyron_mighty-paladin', 'zoombini-paladin', 'reef-paladin', 'cave-paladin', 'fizz-paladin', 'oak-paladin', 'veyron_speedy-paladin', 'scarlet-paladin', 'veyron_tiger-paladin', 'auron_yuna-paladin', 'sentry-paladin', 'quawks-paladin', 'kevin-paladin', 'chell-paladin', 'veyron_jaq-paladin', 'hana-paladin', 'samus-paladin', 'cyan-paladin', 'wizpig-paladin', 'glados-paladin', 'eve-paladin', 'elm-paladin', 'coral-paladin', 'edgar-paladin', 'poppy-paladin', 'bob-paladin', 'veyron_jerry-paladin', 'veyron_minnie-paladin', 'reef-uni-paladin']; configs passed in history ['betty-paladin', 'caroline-paladin', 'zoombini-paladin', 'glados-paladin', 'fizz-paladin', 'oak-paladin', 'scarlet-paladin', 'hana-paladin', 'veyron_tiger-paladin', 'auron_yuna-paladin', 'quawks-paladin', 'veyron_jaq-paladin', 'chell-paladin', 'samus-paladin', 'veyron_minnie-paladin', 'wizpig-paladin', 'eve-paladin', 'chell-nowithdebug-paladin', 'coral-paladin', 'poppy-paladin', 'bob-paladin', 'veyron_jerry-paladin', 'cyan-paladin', 'reef-uni-paladin'].
20:40:31: INFO: Change CL:*437193 is verified with reasons ['strategy:cq-submit-partial-pool-builds-passed', 'strategy:cq-submit-partial-pool-cq-history'], choose the final reason strategy:cq-submit-partial-pool-cq-history.


nxia@: Can you help do initial analysis on why this slipped throug the cracks?

+ Chase-Pending for initial analysis only. We need to understand why we let this bug into ToT before it becomes hard to do analysis due to waterfall restart etc.
 
OK, over-spoke. Broke the next 6 runs.
 Issue 759097  has been merged into this issue.
Note that the outage caused by this is detailed in bug 759039.

20:40:31: INFO: The following changes will be submitted using board-aware submission logic: CL:*407788 CL:*432592 CL:*437193 CL:*437235 CL:*438852 CL:*439012 CL:*439772 CL:*439797 CL:*439995 CL:617375 CL:630577 CL:633704 CL:633783 CL:633803 CL:633845 CL:634127


Looks like it's submitted by board-aware submission, maybe interacted with history-aware submission
Cc: pprabhu@chromium.org
I looked into the previous CQ, the relevant slaves failed in master-paladin/15849 passed in master-paladin/15841, and the relevant slaves failed in master-paladin/15841 passed in master-paladin/15849. 

The question is why the CL didn't cause the same failure on the same boards in master-paladin/15841. If the reason is the CL combined with another CL caused the failures in 15849 as well as the builds after, that's an expected drawback of history-aware submission. If the CL caused flaky test failures, it may still get through CQ without history-aware submission.

pass to pprabhu@ to work with the sheriffs and find out the reasons?
Indeed this looks like multiple CLs / other moving parts are the blame:

Here are the two veyron_minnie builds involved (luckily, they are consecutive ones). Both have the blamed CL

https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/3639 (passes)
https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/3640 (fails)

Note that the passing one didn't even run the test that failed: https://viceroy.corp.google.com/chromeos/suite_details?job_id=137056862
> The question is why the CL didn't cause the same failure on the same boards in master-paladin/15841.

Here's one of the suites from the run that passed:
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=137056193

The suite run doesn't include the test that was just added to the suite.
Something's wrong with the build system.

Yeah, when I extract the autotest_packages tarball from the Artifacts section, I see the PlayStoreTest control file still says:

ATTRIBUTES = "suite:bvt-perbuild"

That's still despite this in the uprev stage:

15:40:14: INFO:  Rev: Determined that one+ of the ebuild autotest-tests-cheets rev_subdirs was touched ['client/site_tests/cheets_AndroidToChromeIntents', 'client/site_tests/cheets_BlockOutboundNetworkTest', 'client/site_tests/cheets_AntutuTest', 'client/site_tests/cheets_CameraOrientation', 'client/site_tests/cheets_CandyCrushTest', 'client/site_tests/cheets_CleanShutDown', 'client/site_tests/cheets_ClearDalvikCacheOnBoot', 'client/site_tests/cheets_ClipboardTest', 'server/site_tests/cheets_ClobberStateful', 'client/site_tests/cheets_ContainerMount', 'client/site_tests/cheets_ContainerSmokeTest', 'client/site_tests/cheets_ContainerReboot', 'client/site_tests/cheets_CryptoMigration', 'client/site_tests/cheets_desktopui_SimpleLogin', 'client/site_tests/cheets_DownloadsFilesystem', 'client/site_tests/cheets_EnterpriseForceInstall', 'client/site_tests/cheets_EnterpriseLogin', 'client/site_tests/cheets_FileSystemPermissions', 'client/site_tests/cheets_KeyboardTest', 'client/site_tests/cheets_LegacyTestsCouchsurfing', 'client/site_tests/cheets_LegacyTestsOverDrive', 'client/site_tests/cheets_LinpackTest', 'client/site_tests/cheets_LowMemoryKiller', 'client/site_tests/cheets_MailBench', 'client/site_tests/cheets_MediaPlayerVideoHWDecodeUsed', 'client/site_tests/cheets_MicrophoneApp', 'client/site_tests/cheets_MountObbTest', 'client/site_tests/cheets_NativeCrash', 'client/site_tests/cheets_NotificationTest', 'client/site_tests/cheets_NOVALegacy', 'client/site_tests/cheets_PerfBoot', 'client/site_tests/cheets_PerformanceAppTest', 'client/site_tests/cheets_PlayMusicApp', 'client/site_tests/cheets_PlayStoreTest', 'client/site_tests/cheets_PlayStoreOptIn', 'client/site_tests/cheets_PlayVideoApp', 'client/site_tests/cheets_powerLoadTest', 'client/site_tests/cheets_RemovableMedia', 'client/site_tests/cheets_ScreenRotation', 'client/site_tests/cheets_SELinuxTest', 'client/site_tests/cheets_SettingsBridge', 'client/site_tests/cheets_SystemRawImageSize', 'client/site_tests/cheets_TouchTaps', 'client/site_tests/cheets_TouchLatencyEstimate', 'client/site_tests/cheets_VellamoTest', 'server/site_tests/cheets_CTS', 'server/site_tests/cheets_CTS_N', 'server/site_tests/cheets_GTS', 'server/site_tests/cheets_PerfBootServer', 'server/site_tests/cheets_network_WiFi_SimpleConnect']

Although I'm not sure I understand this ebuild very well...when I build locally, I don't see autotest-tests-cheets owning the test (e.g., it's not in equery-${BOARD} files).
> I don't see autotest-tests-cheets owning the test (e.g., it's not in equery-${BOARD} files).

Sorry, that's wrong. `equery-${BOARD} files ...` shows it now for me. It wasn't working at some point... either I'm doing something wrong some of the time, or that could point at some strangeness in the build system.
> Yeah, when I extract the autotest_packages tarball from the Artifacts section, I see the PlayStoreTest control file still says:
>
> ATTRIBUTES = "suite:bvt-perbuild"

Interesting.  In the artifacts there's also a 'control_files.tar'.
In that file, the PlayStoreTest control file says this:
    ATTRIBUTES = "suite:bvt-perbuild, suite:bvt-arc"

Which is, of course, what should be everywhere.

Cc: vapier@chromium.org nxia@chromium.org
Owner: pmalani@chromium.org
Status: Assigned
So this is the direct artifact from the minnie build I've been dissecting (as pointed to by the UploadPrebuilts stage):

https://storage.cloud.google.com/chromeos-prebuilt/board/veyron_minnie/paladin-R62-9875.0.0-rc3/packages/chromeos-base/autotest-tests-cheets-0.0.1-r471.tbz2

That clearly contains the following:
...
/usr/local/build/autotest/client/site_tests/cheets_PlayStoreTest/control
/usr/local/build/autotest/client/site_tests/cheets_PlayStoreTest/test-cheets_PlayStoreTest.tar.bz2
...

Where the bare control file has the correct ATTRIBUTES, and the tarball contains the old contents (including the ATTRIBUTES from my comment #8).

So there's definitely something screwy with the autotest tarball packaging in an uprev like this.

This is sounding like the same terrain that Prashant was traversing with this change:

https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/634627

Prashant, do you have any ideas here?
Cc: mka@chromium.org walker@google.com
Labels: OS-Chrome
+ new sheriffs

I'm not planning to look into this more at the moment, but FYI it's probably still a problem.
Labels: -Chase-Pending
pmalani@ are you investigating this?
Just saw this (I don't check @chromium.org that often and mail forwarding isn't set up). This has been an ongoing problem (once every 3-4 weeks). Basically either one of the following is happening:

- New artifacts aren't getting uploaded (i.e the logic which checks whether we should upload the new artifact is wrong)
- The newly uploaded artifacts aren't visible to wherever the tests are being run.

Some questions:
- How does the test pull the artifact?
- Does it refer to a specific version number?
- Or does it simply ask for the "latest" artifact?

Earlier, *every* autotest package got upreved always whenever there was a check in to autotest, so it could well be that this issue was always there, but got hidden because uprevs of every autotest package kept happening.

I frankly have no idea where the uploader script runs. Unless I can see the output of that, I don't have any ideas about debugging this. If someone knows where the upload stage script runs (i.e the autotest_pkg_postinst() invocation during which packager.py --upload runs), please provide that link. If not, can only recommend disabling both the uprev logic and the packager changes (and not bother with switching it on again, since this is recurring problem).
BTW, is this still a problem?
Assigning to sheriffs for a response.
Owner: mka@chromium.org
Owner: pmalani@chromium.org
The nature of the problem is such that it will occur rarely, and
will be detected even less, because usually, it won't cause
failures.  So, we have no good way to determine if it's still
happening, but failures are almost certain to still be possible:
  * The problem didn't happen by accident, so there must be a
    bug somewhere.
  * No one's done anything to fix such a bug.

Given the expected severity of failures (total tree outages
similar to bug 759039), we have to get to the root cause.

However, since this problem isn't actively causing outages right
now, this isn't a sheriff problem.  Someone familiar with how we
create Autotest artifacts needs to take a swipe at explaining the
symptoms.

Owner: mka@chromium.org
Re-assigning to the sheriff. As I mentioned in Comment #15, unless I can be pointed to where the uploader stage runs (and I'm not familiar with the build infra, so I have tried and not been able to figure this out), and get those logs, there is nothing more I can suggest, short of disabling the uprev logic.

Kindly let me know if that the option which we wish to proceed with.
Unfortunately I'm not familiar with the build infra either. Does anyone have a suggestion about who might be able to help with this?
Owner: pmalani@chromium.org
mka@ is right - this isn't a bug for the sheriffs.

> Some questions:
> - How does the test pull the artifact?
> - Does it refer to a specific version number?
> - Or does it simply ask for the "latest" artifact?

The test infrastructure asks for the test artifacts associated
with the specific build being tested.  The version is named by
the standard partial path that identifies the directory in
googlestorage.

You can find the artifacts that were used by looking in the "Report"
stage on the builder page.  Links to some relevant builders are mentioned
in c#6:
    https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/3639 (passes)
    https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/3640 (fails)

RE: Comment #21. Thanks for the links. Unfortunately, those don't provide information regarding where the uploader stage is run. I'm hoping someone in the CC list would have some information regarding when/where in the CQ run does "./packager.py --upload" run.

This issue doesn't occur in Pre-CQ and only CQ; that suggests the artifacts are being generated and uploaded correctly in Pre-CQ. I understand that CQ re-uses old chroots, but that still doesn't explain why the newer artifacts aren't replacing the old ones.
What makes you think this doesn't happen in the Pre-CQ? We don't do much to actually verify these client test packages there, so we wouldn't notice if they're wrong.

And if I'm reading correctly, 'packager.py' gets run from chromiumos-overlay/eclass/autotest.eclass -> autotest_pkg_postinst() -> autotest_run_packager() -> "${root_autotest_dir}/utils/packager.py" ...

So, that would be in the BuildPackages stage.
> but that still doesn't explain why the newer artifacts aren't replacing the old ones.

Why do you think the problem is stale artifacts, as opposed
to incorrectly generated artifacts?

> What makes you think this doesn't happen in the Pre-CQ? We don't do much to actually verify these client test packages there, so we wouldn't notice if they're wrong.

The point is fair, but it's actually possible to find out
what happens in the pre-CQ.  This problem was originally
found with this CL:
    crosreview.com/i/437193

That CL went through the pre-cq, and produced artifacts for
several boards, including these:
    https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/trybot-kevin-no-vmtest-pre-cq/R62-9875.0.0-b105039

I downloaded the autotest-packages.tar from there, and found this
in the control file:
    ATTRIBUTES = "suite:bvt-perbuild, suite:bvt-arc"

Meaning, in fact, that it doesn't happen in the pre-cq.  Which
needs to be explained...

One difference: that PreCQ build was a "new" build, not an upgrade. See the build packages log from here:

https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/no_vmtest_pre_cq/105039

[ebuild  N     ] chromeos-base/autotest-tests-cheets-0.0.1-r471::cheets-private to /build/kevin/ [...trimmed...]

So if pmalani is correct about "stale" artifacts, there's less chance of a previous build somehow influencing the current one on a new package build.

Also, that 'autotest_packages.tar' is packaged quite a bit differently than the artifact I pulled from the CQ artifacts. I guess they upload different stuff?
I checked the CQ log which failed (mentioned in Comment #6 & Comment #21) and I see the following there too:
[ebuild  N     ] chromeos-base/autotest-tests-cheets-0.0.1-r471::cheets-private to [ ....]

So, it seems like it's getting built there, and the version seems to be correctly defined here.

According to https://cs.corp.google.com/chromeos_internal/chromite/cbuildbot/commands.py?rcl=3f5e7fc082093e45fa66ac4514aa334d50ae8bce&l=2285 , autotest_packages.tar simply tars up whatever is in <buildroot>/autotest/packages

I do know that CQ re-uses chroots, so it's ostensible that the packages haven't been replaced for some weird reason.

We could clean the CQ chroot, but FWIU re-using the chroot provided some optimizations.

Perhaps we can detect that an autotest package has been upreved, and only clear out the autotest/packages directory? I know it's hacky, but it will at least ensure the packages getting tarred up are correct.

BTW, I checked the Artifacts link from the build which is suggested to be failing, and in its autotest_packages.tar (https://storage.cloud.google.com/chromeos-image-archive/veyron_minnie-paladin/R62-9876.0.0-rc3/autotest_packages.tar?_ga=2.35329185.-1079761916.1501616603), the control file in tests-cheets_PlayStoreTest.tar.bz2 seems to have the right ATTRIBUTES variable:
ATTRIBUTES = "suite:bvt-perbuild, suite:bvt-arc"

I could be misinterpreting something but it seems like the autotest_packages.tar has the right value. Kindly chime in if my understanding is incorrect.
> I checked the CQ log which failed

We're actually interested in the one that "passed" -- because the CL was bad, but the CQ didn't notice (and it therefore "passed" as a false positive).

So, looking at this:
    https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/3639 (passes)

I see this:

[ebuild     U  ] chromeos-base/autotest-tests-cheets-0.0.1-r471::cheets-private [0.0.1-r470::cheets-private] to /build/veyron_minnie/ ...

and *that* build's artifacts have a autotest_packages.tar with the "old" files.
Comment 29 Deleted
Comment 30 Deleted
Sign in to add a comment