New issue
Advanced search Search tips

Issue 884882 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Sep 20
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 881991
issue 882969



Sign in to add a comment

Graceful handling of INVALID_TEST_RESULTS

Project Member Reported by erikc...@chromium.org, Sep 17

Issue description

There are several reasons why a build might return INVALID_TEST_RESULTS.

[expiration, timeout, device issues] -- https://bugs.chromium.org/p/chromium/issues/detail?id=881991

flakiness in webkit_layout_test test_runner [not the tests, the test runner] -- https://bugs.chromium.org/p/chromium/issues/detail?id=884776

We should fix these. However, some of these [e.g. expiration/timeout] we will never be able to fully eliminate. In these cases, we still want graceful handling of INVALID_TEST_RESULTS. Right now, a single instance of INVALID_TEST_RESULTS immediately causes the CQ tryjob to fail. This then triggers the CQ full retry mechanism which retries everything.

Instead, we should only retry the test suite which returned INVALID_TEST_RESULTS. There are a couple of ways that we can implement this. Given that I just implemented 'retry with patch', which has a very similar purpose, I think that reusing the logic to also handle this case is the right approach. This causes minimal additional code complexity.

The high-level logic for tryjob recipe will be:
1) apply patch. run tests.
  * if specific tests fail, go to (2)
  * If INVALID_TEST_RESULTS, go to (3)
2) Deapply patch. run failing tests.
3) Reapply patch + roll checkout. Run failing tests [or whole test suite in case of INVALID_TEST_RESULTS]


 
Blocking: 882969
Project Member

Comment 2 by bugdroid1@chromium.org, Sep 19

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/684e7a5edfa5436f11f2e7dd243470037b5b0053

commit 684e7a5edfa5436f11f2e7dd243470037b5b0053
Author: erikchen <erikchen@chromium.org>
Date: Wed Sep 19 13:12:28 2018

Tryjobs should gracefully handle invalid test results.

Previously, invalid test results would immediately cause the tryjob to fail.
This CL makes the tryjob attempt to gracefully recover from invalid test results
by retrying the full test suite [assuming that 'retry with patch' is enabled].

Example of tryjob handling of invalid test results without patch:
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win7_chromium_rel_ng/87769

Example of tryjob handling of invalid test results with patch [using led]:
https://ci.chromium.org/swarming/task/4007120b1c8ae910?server=chromium-swarm.appspot.com

Bug:  884882 
Change-Id: Ic75dcb557d34a28591df4deef0c3e47eea242d60
Reviewed-on: https://chromium-review.googlesource.com/1231036
Commit-Queue: Erik Chen <erikchen@chromium.org>
Reviewed-by: John Budorick <jbudorick@chromium.org>
Reviewed-by: Stephen Martinis <martiniss@chromium.org>

[modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipes/blink_downstream.py
[add] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.expected/enable_retry_with_patch_invalid_without_patch_results.json
[modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/chromium_tests/steps.py
[modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/README.recipes.md
[add] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.expected/enable_retry_with_patch_invalid_initial_test_results.json
[modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/test_utils/api.py
[modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.expected/enable_retry_with_patch_recipes.json
[modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/chromium_tests/api.py
[modify] https://crrev.com/684e7a5edfa5436f11f2e7dd243470037b5b0053/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.py

Project Member

Comment 3 by bugdroid1@chromium.org, Sep 19

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/2cb3699e2517ee6387d915130dee4a5d4c68ac47

commit 2cb3699e2517ee6387d915130dee4a5d4c68ac47
Author: John Budorick <jbudorick@chromium.org>
Date: Wed Sep 19 20:39:47 2018

Revert new invalid test result logic.

Broke retries, e.g. http://shortn/_89kJJ5EvYr

This reverts commit 684e7a5edfa5436f11f2e7dd243470037b5b0053.
This reverts commit 408e6e9efe32ad3798e922bacb63e4de33e706e8.

TBR=erikchen@chromium.org,martiniss@chromium.org

Bug:  884882 
Change-Id: Ie13dafe94c36ab113dd15384e7d0befc8f7bc32f
Reviewed-on: https://chromium-review.googlesource.com/1234695
Reviewed-by: John Budorick <jbudorick@chromium.org>
Commit-Queue: John Budorick <jbudorick@chromium.org>

[modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/recipes/blink_downstream.py
[modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/recipe_modules/chromium_tests/steps.py
[modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/README.recipes.md
[modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/recipe_modules/test_utils/api.py
[modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.expected/enable_retry_with_patch_recipes.json
[modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/recipe_modules/chromium_tests/api.py
[modify] https://crrev.com/2cb3699e2517ee6387d915130dee4a5d4c68ac47/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.py

Project Member

Comment 4 by bugdroid1@chromium.org, Sep 19

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/recipes-py/+/ec1f045f324fd25c985a60cdcddb913f82da0490

commit ec1f045f324fd25c985a60cdcddb913f82da0490
Author: erikchen <erikchen@chromium.org>
Date: Wed Sep 19 21:56:56 2018

Add AnnotationContains post processing step

Bug:  884882 
Change-Id: If5943771c99143a2febbb4b4af6ad3795b48aa8e
Reviewed-on: https://chromium-review.googlesource.com/1234679
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>
Commit-Queue: Erik Chen <erikchen@chromium.org>

[modify] https://crrev.com/ec1f045f324fd25c985a60cdcddb913f82da0490/recipe_engine/post_process.py

Project Member

Comment 5 by bugdroid1@chromium.org, Sep 20

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/152247a198eae12aaa4191514233ca2ee5da37ba

commit 152247a198eae12aaa4191514233ca2ee5da37ba
Author: erikchen <erikchen@chromium.org>
Date: Thu Sep 20 16:34:27 2018

[Reland #1] Tryjobs should gracefully handle invalid test results.

The first CL failed because the new function failures_or_invalid_results()
sometimes returned a set [for gtests] and sometimes return a dict [for
non-gtests]. Calling code assumed that failures_or_invalid_results returned a
set, resulting in a runtime error.

This reland modifies failures_or_invalid_results to always return a set and adds
tests.

> Previously, invalid test results would immediately cause the tryjob to fail.
> This CL makes the tryjob attempt to gracefully recover from invalid test results
> by retrying the full test suite [assuming that 'retry with patch' is enabled].
>
> Example of tryjob handling of invalid test results without patch:
> https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win7_chromium_rel_ng/87769
>
> Example of tryjob handling of invalid test results with patch [using led]:
> https://ci.chromium.org/swarming/task/4007120b1c8ae910?server=chromium-swarm.appspot.com
>
> Bug:  884882 
> Change-Id: Ic75dcb557d34a28591df4deef0c3e47eea242d60
> Reviewed-on: https://chromium-review.googlesource.com/1231036
> Commit-Queue: Erik Chen <erikchen@chromium.org>
> Reviewed-by: John Budorick <jbudorick@chromium.org>
> Reviewed-by: Stephen Martinis <martiniss@chromium.org>

Bug:  884882 
Change-Id: Ia6ffcd79173aac5c3b40db0293befd383428ea8a
Reviewed-on: https://chromium-review.googlesource.com/1234437
Commit-Queue: Erik Chen <erikchen@chromium.org>
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Reviewed-by: John Budorick <jbudorick@chromium.org>

[modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/recipes/blink_downstream.py
[modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/recipe_modules/chromium_tests/steps.py
[modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/README.recipes.md
[modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/recipe_modules/test_utils/api.py
[modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.expected/enable_retry_with_patch_recipes.json
[modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/recipe_modules/chromium_tests/api.py
[modify] https://crrev.com/152247a198eae12aaa4191514233ca2ee5da37ba/scripts/slave/recipe_modules/chromium_tests/tests/api/run_tests_on_tryserver.py

Status: Fixed (was: Assigned)
Is there some way that the infrastructure will know that we're getting invalid results and that some trooper should act on it? 

The original intent was that getting INVALID_TEST_RESULTS was *always* indicative of something being wrong in the infrastructure (i.e., it should never normally happen), and so the recipe should've produced an infra failure, not a normal failure to retry.

If it's the case that most instances of this are recoverable and so we should recover from them, that seems reasonable (as long as you're not making the infrastructure problem worse by doing the retries), but I worry about losing the diagnostic.
> I worry about losing the diagnostic.

We still emit that bit of information:

https://chromium-review.googlesource.com/c/chromium/tools/build/+/1231036/7/scripts/slave/recipe_modules/test_utils/api.py#223

If there are supposed to be people looking into steps that emit INVALID_TEST_RESULTS, then we should follow up off-thread, as that doesn't appear to be happening. 


> The original intent was that getting INVALID_TEST_RESULTS was *always* indicative of something being wrong in the infrastructure (i.e., it should never normally happen), and so the recipe should've produced an infra failure, not a normal failure to retry.

It's not clear to me why we shouldn't retry on infra failure. Troopers should still be able to look into steps that emit INVALID_TEST_RESULTS without needing to cause some number of CLs to unnecessarily fail [and not retry].
>> I worry about losing the diagnostic.
>
> We still emit that bit of information:
>
> https://chromium-review.googlesource.com/c/chromium/tools/build/+/1231036/7/scripts/slave/recipe_modules/test_utils/api.py#223
>
> If there are supposed to be people looking into steps that emit INVALID_TEST_RESULTS, then we should follow up 
> off-thread, as that doesn't appear to be happening. 

"should be" and "are" are two different things. I agree that we're not doing this at all well today.

> It's not clear to me why we shouldn't retry on infra failure. 

I didn't mean to say that we shouldn't retry. You're right, we probably should retry in most cases. What I was trying to say is given a hypothetical choice between classifying something as an infra failure vs. masking it as a normal failure, one might want to do the former.

One can also argue that we should build the monitoring infrastructure that would detect series of "normal failures" and direct them to infra folks as appropriate, as well, since there will likely always be some infra-type failures that only manifest themselves via normal tests. It's unclear if specially classifying known infra issues differently actually helps much; this seems to be never-ending debate as to how much work we should put into this topic.
Labels: Infra-Platform-Test
Blocking: 881991

Sign in to add a comment