New issue
Advanced search Search tips

Issue 894875 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Single shard failures should not cause full retries

Project Member Reported by erikc...@chromium.org, Oct 12

Issue description

e.g. see https://bugs.chromium.org/p/chromium/issues/detail?id=894637

When a single shard times out or produces invalid results, all contained tests should assumed to have "failed". However, successful tests from other shards do not need to be rerun.
 
Components: Infra>Client>Chrome
I suspect that this will be difficult because the recipe has no way of knowing what tests the shard *would/should* have run. This will likely require changing the recipe <-> test runner interface, which is non-trivial. 

I suspect that this will help a lot with Android builds especially, since it's common for a single shard to fail due to ADB issues.
martiniss also pointed to  bug 394826  earlier, but that seems to be covering a different issue.

Another option may be to improve the android test runner's timeout handling. When a shard times out, swarming sends it a sigterm, waits for a grace period, then sigkills if the test is still running. It looks like the test runner is catching the sigterm and exits without writing the results json file. (ie: there's no isolated out for https://chromium-swarm.appspot.com/task?id=407d3fa49b092e10)

That may be a bug in the test runner. Fixing that would have lessened the impact of  bug 894637  I believe, since the recipe would have gotten the full test results for every shard.
Status: Available (was: Untriaged)
Filed  bug 895027  for the android test runner's results on timeout.
Labels: Infra-Platform-Test

Sign in to add a comment