Graceful handling of INVALID_TEST_RESULTS |
||||
Issue descriptionThere are several reasons why a build might return INVALID_TEST_RESULTS. [expiration, timeout, device issues] -- https://bugs.chromium.org/p/chromium/issues/detail?id=881991 flakiness in webkit_layout_test test_runner [not the tests, the test runner] -- https://bugs.chromium.org/p/chromium/issues/detail?id=884776 We should fix these. However, some of these [e.g. expiration/timeout] we will never be able to fully eliminate. In these cases, we still want graceful handling of INVALID_TEST_RESULTS. Right now, a single instance of INVALID_TEST_RESULTS immediately causes the CQ tryjob to fail. This then triggers the CQ full retry mechanism which retries everything. Instead, we should only retry the test suite which returned INVALID_TEST_RESULTS. There are a couple of ways that we can implement this. Given that I just implemented 'retry with patch', which has a very similar purpose, I think that reusing the logic to also handle this case is the right approach. This causes minimal additional code complexity. The high-level logic for tryjob recipe will be: 1) apply patch. run tests. * if specific tests fail, go to (2) * If INVALID_TEST_RESULTS, go to (3) 2) Deapply patch. run failing tests. 3) Reapply patch + roll checkout. Run failing tests [or whole test suite in case of INVALID_TEST_RESULTS]
,
Sep 19
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/684e7a5edfa5436f11f2e7dd243470037b5b0053 commit 684e7a5edfa5436f11f2e7dd243470037b5b0053 Author: erikchen <erikchen@chromium.org> Date: Wed Sep 19 13:12:28 2018 Tryjobs should gracefully handle invalid test results. Previously, invalid test results would immediately cause the tryjob to fail. This CL makes the tryjob attempt to gracefully recover from invalid test results by retrying the full test suite [assuming that 'retry with patch' is enabled]. Example of tryjob handling of invalid test results without patch: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win7_chromium_rel_ng/87769 Example of tryjob handling of invalid test results with patch [using led]: https://ci.chromium.org/swarming/task/4007120b1c8ae910?server=chromium-swarm.appspot.com Bug: 884882 Change-Id: Ic75dcb557d34a28591df4deef0c3e47eea242d60 Reviewed-on: https://chromium-review.googlesource.com/1231036 Commit-Queue: Erik Chen <erikchen@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Reviewed-by: Stephen Martinis <martiniss@chromium.org> [modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipes/blink_downstream.py [add] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.expected/enable_retry_with_patch_invalid_without_patch_results.json [modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/chromium_tests/steps.py [modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/README.recipes.md [add] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.expected/enable_retry_with_patch_invalid_initial_test_results.json [modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/test_utils/api.py [modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.expected/enable_retry_with_patch_recipes.json [modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/chromium_tests/api.py [modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.py
,
Sep 19
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/2cb3699e2517ee6387d915130dee4a5d4c68ac47 commit 2cb3699e2517ee6387d915130dee4a5d4c68ac47 Author: John Budorick <jbudorick@chromium.org> Date: Wed Sep 19 20:39:47 2018 Revert new invalid test result logic. Broke retries, e.g. http://shortn/_89kJJ5EvYr This reverts commit 684e7a5edfa5436f11f2e7dd243470037b5b0053. This reverts commit 408e6e9efe32ad3798e922bacb63e4de33e706e8. TBR=erikchen@chromium.org,martiniss@chromium.org Bug: 884882 Change-Id: Ie13dafe94c36ab113dd15384e7d0befc8f7bc32f Reviewed-on: https://chromium-review.googlesource.com/1234695 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: John Budorick <jbudorick@chromium.org> [modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/recipes/blink_downstream.py [modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/recipe_modules/chromium_tests/steps.py [modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/README.recipes.md [modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/recipe_modules/test_utils/api.py [modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.expected/enable_retry_with_patch_recipes.json [modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/recipe_modules/chromium_tests/api.py [modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.py
,
Sep 19
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/recipes-py/+/ec1f045f324fd25c985a60cdcddb913f82da0490 commit ec1f045f324fd25c985a60cdcddb913f82da0490 Author: erikchen <erikchen@chromium.org> Date: Wed Sep 19 21:56:56 2018 Add AnnotationContains post processing step Bug: 884882 Change-Id: If5943771c99143a2febbb4b4af6ad3795b48aa8e Reviewed-on: https://chromium-review.googlesource.com/1234679 Reviewed-by: Robbie Iannucci <iannucci@chromium.org> Commit-Queue: Erik Chen <erikchen@chromium.org> [modify] https://crrev.com/ec1f045f324fd25c985a60cdcddb913f82da0490/recipe_engine/post_process.py
,
Sep 20
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/152247a198eae12aaa4191514233ca2ee5da37ba commit 152247a198eae12aaa4191514233ca2ee5da37ba Author: erikchen <erikchen@chromium.org> Date: Thu Sep 20 16:34:27 2018 [Reland #1] Tryjobs should gracefully handle invalid test results. The first CL failed because the new function failures_or_invalid_results() sometimes returned a set [for gtests] and sometimes return a dict [for non-gtests]. Calling code assumed that failures_or_invalid_results returned a set, resulting in a runtime error. This reland modifies failures_or_invalid_results to always return a set and adds tests. > Previously, invalid test results would immediately cause the tryjob to fail. > This CL makes the tryjob attempt to gracefully recover from invalid test results > by retrying the full test suite [assuming that 'retry with patch' is enabled]. > > Example of tryjob handling of invalid test results without patch: > https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win7_chromium_rel_ng/87769 > > Example of tryjob handling of invalid test results with patch [using led]: > https://ci.chromium.org/swarming/task/4007120b1c8ae910?server=chromium-swarm.appspot.com > > Bug: 884882 > Change-Id: Ic75dcb557d34a28591df4deef0c3e47eea242d60 > Reviewed-on: https://chromium-review.googlesource.com/1231036 > Commit-Queue: Erik Chen <erikchen@chromium.org> > Reviewed-by: John Budorick <jbudorick@chromium.org> > Reviewed-by: Stephen Martinis <martiniss@chromium.org> Bug: 884882 Change-Id: Ia6ffcd79173aac5c3b40db0293befd383428ea8a Reviewed-on: https://chromium-review.googlesource.com/1234437 Commit-Queue: Erik Chen <erikchen@chromium.org> Reviewed-by: Stephen Martinis <martiniss@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> [modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/recipes/blink_downstream.py [modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/recipe_modules/chromium_tests/steps.py [modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/README.recipes.md [modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/recipe_modules/test_utils/api.py [modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.expected/enable_retry_with_patch_recipes.json [modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/recipe_modules/chromium_tests/api.py [modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.py
,
Sep 20
,
Sep 21
Is there some way that the infrastructure will know that we're getting invalid results and that some trooper should act on it? The original intent was that getting INVALID_TEST_RESULTS was *always* indicative of something being wrong in the infrastructure (i.e., it should never normally happen), and so the recipe should've produced an infra failure, not a normal failure to retry. If it's the case that most instances of this are recoverable and so we should recover from them, that seems reasonable (as long as you're not making the infrastructure problem worse by doing the retries), but I worry about losing the diagnostic.
,
Sep 21
> I worry about losing the diagnostic. We still emit that bit of information: https://chromium-review.googlesource.com/c/chromium/tools/build/+/1231036/7/scripts/slave/recipe_modules/test_utils/api.py#223 If there are supposed to be people looking into steps that emit INVALID_TEST_RESULTS, then we should follow up off-thread, as that doesn't appear to be happening. > The original intent was that getting INVALID_TEST_RESULTS was *always* indicative of something being wrong in the infrastructure (i.e., it should never normally happen), and so the recipe should've produced an infra failure, not a normal failure to retry. It's not clear to me why we shouldn't retry on infra failure. Troopers should still be able to look into steps that emit INVALID_TEST_RESULTS without needing to cause some number of CLs to unnecessarily fail [and not retry].
,
Sep 21
>> I worry about losing the diagnostic. > > We still emit that bit of information: > > https://chromium-review.googlesource.com/c/chromium/tools/build/+/1231036/7/scripts/slave/recipe_modules/test_utils/api.py#223 > > If there are supposed to be people looking into steps that emit INVALID_TEST_RESULTS, then we should follow up > off-thread, as that doesn't appear to be happening. "should be" and "are" are two different things. I agree that we're not doing this at all well today. > It's not clear to me why we shouldn't retry on infra failure. I didn't mean to say that we shouldn't retry. You're right, we probably should retry in most cases. What I was trying to say is given a hypothetical choice between classifying something as an infra failure vs. masking it as a normal failure, one might want to do the former. One can also argue that we should build the monitoring infrastructure that would detect series of "normal failures" and direct them to infra folks as appropriate, as well, since there will likely always be some infra-type failures that only manifest themselves via normal tests. It's unclear if specially classifying known infra issues differently actually helps much; this seems to be never-ending debate as to how much work we should put into this topic.
,
Dec 4
,
Dec 5
|
||||
►
Sign in to add a comment |
||||
Comment 1 by erikc...@chromium.org
, Sep 18