New issue
Advanced search Search tips

Issue 900942 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Single swarming shard failure causes all tests to fail

Project Member Reported by erikc...@chromium.org, Nov 1

Issue description

https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win10_chromium_x64_rel_ng/128076


browser_tests has 10 shards. One of them had an internal swarming failure. All tests were marked as a failure. :(
 
Components: Infra>Platform>Swarming
I'm not sure we can practically recover from bot death w/ partial test results.
We could rerun the failing shard [but nothing else] at the recipe layer, before moving on to 'retry without patch'.
The recipe changes to do something like that are probably nontrivial, but I guess that could be possible. We'd want to be careful of how lethal tasks are handled, though.
Another example, this time on android for chrome_public_test_apk:

https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-kitkat-arm-rel/115418

One of the shards failed with status: BOT_DIED and produced no outputs.

Later in 'retry with patch', we appear to have had some flaky failures, thus causing the build to incorrectly fail.

> We'd want to be careful of how lethal tasks are handled, though.

Could you clarify -- what is a lethal task? Does that mean irrecoverable and is expected to fail on retry?
A task that kills its worker. A good example was the macOS browser_tests that killed the host. Same happened with telemetry on Windows.
Labels: Infra-Platform-Test

Sign in to add a comment