New issue
Advanced search Search tips

Issue 909860 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

tryjob unnecessarily retrying without patch

Project Member Reported by jam@chromium.org, Nov 28

Issue description

https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win7_chromium_rel_ng/139914 looks green, but then it's retrying without patch.

Dirk pointed out that one of the swarming jobs for layout tests says failed, but then 
1) why is the task marked green?

The failure above is also just because of a leftover process, see bug 898699. Given that almost all these cases now are due to races in Windows and not a bug in Chrome, it seems like another reason to stop doing the failure in swarming if a process is still running.

2) as Dirk points out, why are we retrying all shards instead of just the failed one?
 
Cc: jbudorick@chromium.org mar...@chromium.org martiniss@chromium.org
Components: Infra>Client>Chrome
Cc: erikc...@chromium.org
Owner: bpastene@chromium.org
Status: Assigned (was: Unconfirmed)
> Given that almost all these cases now are due to races in Windows and not a bug in Chrome, it seems like another reason to stop doing the failure in swarming if a process is still running.

I don't have much context behind the failure mode. But If dirk & MA are ok w/ it, I can drop the swarming check (at least on win7) that fails the task if there's zombie processes laying around. (And instead trigger a mandatory reboot of the machine?)

> 2) as Dirk points out, why are we retrying all shards instead of just the failed one?

We've seen that happen on various other suites in the past (+ erikchen whose been involved with a lot of those.) In this case, not only are we retrying every test, but we're retrying each exactly 10 times given the "--gtest_repeat=10" arg append to retries.

That would be why every shard in the w/o patch retry hit their 1hr timeout. We have safeguards against that kinda of excessive retrying, but it's likely that the "unexpected flakes" failure mode is tripping it up. I'll try to track that down.
There's never going to be a way to do proper containment on Windows 7, so ignoring + forced reboot looks good to me. We need to figure out how to do this without being too hacky internally but that should be doable.
(2) is interested:
https://chromium-review.googlesource.com/c/chromium/tools/build/+/1357281

"""
it's possible for test results to be valid, and all tests
to pass, but for the test suite to be considered a failure due to an infra
issue. This can cause the list of failed tests to be empty. In this case,
retries will rerun all tests. We should avoid setting --gtest_repeat.
"""

The CL I linked to makes it so that we don't set --gtest_repeat.

https://bugs.chromium.org/p/chromium/issues/detail?id=910706
Open question I left on the bug:
"""
2) If test results are valid, and there are no failures, it's possible we should mark the test suite as a success even if there's an infra failure?
 """

Sign in to add a comment