Mitigate invalid test results by retrying the invalid shards only |
|||||
Issue descriptionFor test steps with invalid test results, the optimal mitigation solution is to rerun the few Swarming shards with invalid test results instead of rerunning the whole test steps. More details are in https://docs.google.com/document/d/1m_Q2mD9wXssbTsyEBwB-bz5VTQPASplXxWYxfNfIuzU/edit Related bug 917117 for invalid profraw of code coverage bot.
,
Dec 21
+ jbudorick Thanks! I think this is a great idea. I think we should also extend this to tasks that timed out, and tasks with no output status. If retrying the build still yields invalid/time-out/no-status, then we should skip 'without patch' and 'retry with patch' and 'cq build retry'. Rationale: * If entire shards are failing because tip of tree is super busted, then we shouldn't be landing any CLs, so there's no point in running 'without patch'. [we could theoretically roll tip of tree, rebuild and redispatch with the hope that tip of tree has been fixed in the interim...but that's a lot of additional cost for the common case where tip of tree is not horribly broken, with the benefit of only sometimes helping, since it's possible that tip of tree is still broken after the roll] * If the shard is repeatedly failing, then there's no point in running 'retry with patch', or 'cq build retry' since we've already performed the relevant retries. Side notes: * I think that we should redispatch the failing shard multiple times [2 times? 3 times?] onto swarming to minimize the probability that we hit another flaky device [a lot of INVALID_TEST_RESULTS seem to be due to Android device issues]. If a single shard produces valid results, we'll accept those and move on. * We'll need to bump the CQ build timeout [default = 3 hours] to prevent us from timing out builds that legitimately take > 3 hours -- we'll now be dispatching sets of tasks up to 4 times, and each task can theoretically have pending time up to 1 hour. stgao: Is your team planning to tackle this? I think this will significantly reduce false rejects, and also allow us to remove more expensive retry layers. I expect the implementation should take less than 1 week, and I'm happy to provide assistance.
,
Dec 21
Unless you have monitoring via other means, there's no way to know that tip-of-tree is busted without running the 'without patch' step. Also, there shouldn't be any CQ builds that are legitimately taking > 3 hours. There's little excuse for a build to take more than one hour absent pending times, and pending times of more than a few minutes should be producing infra alerts. Pending times of an hour would indicate a severe infra failure. Unless we're sure that we have monitoring elsewhere that is catching these sorts of problems, I don't think we should increase the CQ timeouts, the risk of masking real problems seems high.
,
Dec 21
> Unless you have monitoring via other means, there's no way to know that tip-of-tree is busted without running the 'without patch' step. Why is it helpful in this case to know if tip-of-tree is busted? Example: * Run task once -- task returns with INVALID_TEST_RESULTS * Run task N more times -- task returns with INVALID_TEST_RESULTS. If tip-of-tree is not busted, then we should mark the build as a failure. If tip-of-tree is busted, we can't mark the build as a success since we straight up failed to run a suite of tests, and it's possible that if we did run them there would have been failures. The only thing we can do is run 'retry with patch' and hope that ToT has since been fixed. So we're using 'without patch' to provide a single bit of information -- which is whether we should fail the build, or just run 'retry with patch'. Given that 'without patch' takes about the same amount of time as 'retry with patch', I think that we should either skip the step and just fail the build, or skip the step and run 'retry with patch'. In general, I think we're doing a lot of extra work in the common case [CL is busted], for a very small probability of positive utility in the case that ToT is busted [ToT has been fixed between 'with patch' and 'retry with patch']
,
Dec 21
> Why is it helpful in this case to know if tip-of-tree is busted? Sorry, I was confusing here. I meant what I wrote in the general case. In the specific case of INVALID_TEST_RESULTS, you're right, retry w/o patch isn't helpful. > In general, I think we're doing a lot of extra work in the common case > [CL is busted], for a very small probability of positive utility in the > case that ToT is busted You're right, and that's by design. Historically we've been more than happy to throw hardware at the problem to reduce the amount of work a dev might have to do and to reduce the rate of false rejects. One can argue that we could make better use of the resources, or that we should be saving costs. One could also argue that we should give up and alert on failure faster (so a dev can get notification of a likely failure faster), but as we've discussed before, you could do that without having to stop retrying.
,
Jan 3
,
Jan 4
,
Jan 4
,
Jan 7
To clarify: you're suggesting rerunning the shards in a phase prior to the "without patch" and "retry with patch" phases? I think implementing this is nontrivial (i.e., I don't agree with the 1-week estimate, as I think getting the recipe logic right will be tricky), but I agree with the idea in principle. I do have concerns about adding another retry phase for timed out shards (as suggested in #2), as I think that means tying up capacity in retrying horribly broken things.
,
Jan 8
> To clarify: you're suggesting rerunning the shards in a phase prior to the "without patch" and "retry with patch" phases? Yes > I do have concerns about adding another retry phase for timed out shards (as suggested in #2), as I think that means tying up capacity in retrying horribly broken things. Actually, I think this will reduce capacity usage. We end up performing the same # of retries [sometimes fewer], just at earlier, cheaper stages. When things are horribly broken, this feature will also us to early exit because we know the problem is likely not caused by device issues, and is caused by the CL itself. Case #1: CL causes all shards to time out. Current behavior: * 'with_patch': All shards time out. * 'without_patch': All shards pass * 'retry_with_patch': All shards time out. Proposed behavior: * 'with_patch': All shards time out. * 'swarm_level_retry': All shards time out. [early exit, since we've confirmed time-outs are not caused by device issues. Even if the timeouts are caused by ToT, we have no way of knowing if the CL in question introduced new failures] Case #2: CL causes one shard to time out. Current behavior: * 'with_patch': One shard times out. Depending on whether the test suite supports graceful timeouts, we may attempt to retry only NOTRUN tests, or we may retry all tests, including those in successful shards. * 'without_patch': All NOTRUN tests pass * 'retry_with_patch': [Exactly behavior depends number of tests being rerun. We may be rerunning all NOTRUN/FAILED tests 10X times. Proposed behavior: * 'with_patch': One shard times out. * 'swarm_level_retry': One shard times out. Early exit, since we've confirmed time-outs are not caused by device issues In the cases where the time-outs are caused by device issues, then this feature also saves a lot of capacity.
,
Jan 10
Ah, I missed the early exit part of the proposal. This SGTM. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by st...@chromium.org
, Dec 20