Single swarming shard failure causes all tests to fail |
||
Issue descriptionhttps://ci.chromium.org/p/chromium/builders/luci.chromium.try/win10_chromium_x64_rel_ng/128076 browser_tests has 10 shards. One of them had an internal swarming failure. All tests were marked as a failure. :(
,
Nov 1
We could rerun the failing shard [but nothing else] at the recipe layer, before moving on to 'retry without patch'.
,
Nov 1
The recipe changes to do something like that are probably nontrivial, but I guess that could be possible. We'd want to be careful of how lethal tasks are handled, though.
,
Nov 1
Another example, this time on android for chrome_public_test_apk: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-kitkat-arm-rel/115418 One of the shards failed with status: BOT_DIED and produced no outputs. Later in 'retry with patch', we appear to have had some flaky failures, thus causing the build to incorrectly fail.
,
Nov 1
> We'd want to be careful of how lethal tasks are handled, though. Could you clarify -- what is a lethal task? Does that mean irrecoverable and is expected to fail on retry?
,
Nov 5
A task that kills its worker. A good example was the macOS browser_tests that killed the host. Same happened with telemetry on Windows.
,
Dec 4
|
||
►
Sign in to add a comment |
||
Comment 1 by jbudorick@chromium.org
, Nov 1