New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 783263 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocked on:
issue 835490



Sign in to add a comment

Flaky results of gpu pixel tests not shown on dashboard when marked as Flaky

Project Member Reported by xlai@chromium.org, Nov 9 2017

Issue description

Sometimes, a test is flaky and we want to hide it in the bots because that will cause disruptions to other people who are committing patches to CQ. But developers want to continue monitor its flakiness for debugging purpose. 

The current setup in webkit layout tests is like this. When developer marks a test as "Flaky", the test will still be run and its result (success, or failed, or timeout) will be honestly shown on dashboard https://test-results.appspot.com/dashboards/flakiness_dashboard.html; but it will not turn the bot red and so not affect CQ. 

I notice that GPU pixel test is not doing the same. I have a test ("gpu_tests.pixel_integration_test.PixelIntegrationTest.Pixel_OffscreenCanvasTransferBeforeStyleResize", see https://bugs.chromium.org/p/chromium/issues/detail?id=735228) that is flaky and was marked as "Flaky" in test_expectations.py. After a few months I notice that the dashboard is all showing green rectangles, so I enabled the test. This quickly manifest itself as quite many flaky results and triggered Sheriff to revert the enable. This causes a lot of inconvenient to my debugging as I could not reproduce the flaky result locally; and I would have to enable the test to ensure my speculative fix works.

Ned@: How do you think should be correct way for this? 

 

Comment 1 by kbr@chromium.org, Nov 9 2017

Cc: -nednguyen@chromium.org perezju@chromium.org nedngu...@google.com
CC'ing Ned's preferred address.

Comment 2 by kbr@chromium.org, Nov 9 2017

Cc: dpranke@chromium.org
Components: Internals>GPU>Testing
Note that the top-level harness for the tests is:
https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/testing/run_browser_tests.py

and it's actually using Dirk's typ harness under the hood at this point.

typ definitely supports this functionality since it's used by the layout tests. Telemetry's serially_executed_browser_test_case harness doesn't know about test expectations and retries, but if we added a way that the test could call back in to the harness and report that it was flaky, the resulting JSON file could contain that information.

Some quick code searching shows that this is where the JSON results are written:
https://cs.chromium.org/chromium/src/third_party/catapult/third_party/typ/typ/runner.py?l=189

So we should be able to trace back and figure out where and how the results are populated.

I am thinking that supporting for test expectations & retries probably should be moved to typ instead.

Comment 4 by kbr@chromium.org, Nov 10 2017

Maybe, but after the experience of first putting these test expectations inside Telemetry and then taking the functionality back out, I am reluctant to tie the GPU tests very strongly to typ.

Open to the possibility, though, and especially to advice on exactly how this would work. We need to be able to evaluate the test expectations with respect to the exact GPU on the system, which requires using Telemetry to start up the browser and query the SystemInfo from it.

Cc: rnep...@chromium.org seanmccullough@chromium.org charliea@chromium.org
Ken: the context is we are aiming to unify the test expectation mechanism in a way that is similar to Layout test's expectation file. It would allow easy integration of features like disabling tests through UI interface on SheriffOMatice (see go/chops-tada). 

I think the main challenges here is how to make the expectation condition easily extensible as different frameworks have different sets they regularly deal with. Coming up with a giant set of expectation condition that work for all tests in Chrome probably doesn't work. GPU tests usually due with GPU drivers (e.g: https://cs.chromium.org/chromium/src/content/test/gpu/gpu_tests/webgl_conformance_expectations.py)

cc Sean, Randy, Charlie who are driving defining the test disabling expectation pipeline.
dpranke@ can confirm this, but based on my current understanding, I don't think that there's any real reason that we need to be bound to a fixed list of test conditions by which various tests can be disabled: in other words, I think that it's probably possible for the conditions to vary depending on the test framework (layout tests, Telemetry, cc tests). 

At this point, the plan (which still isn't finalized) is for Telemetry to read in the test expectations file and decide for itself whether any of tests it was told to run should actually be disabled. Combine that with some standard way of telling Sheriff-o-matic about all of the possible test conditions on which the user can disable, and that should be sufficient to allow different harnesses to support different conditions.

I'm not familiar enough with how typ/Telemetry report flaky versus failed versus skipped tests to contribute much to that discussion, other than that I know the layout test format - around which our current one-click disabling format is being built - supports the notion of flaky tests as opposed to skipped tests. (Sounds like you already know this though.)

Comment 7 by kbr@chromium.org, Nov 10 2017

Thanks for the input. For the purpose of this bug let's concretely focus on how, in this particular test harness, we can properly report flaky bugs to the typ harness and thereby to the flakiness dashboard. The goal of a unified test expectations format is a good one but it is a way off and we should not block local progress on it.

It is not currently possible to mark a test as flaky using typ; typ has no concept of expected results for tests.

(The comment above about the layout tests using typ was wrong; they don't use typ. typ was a new code base that uses much the same design as the layout tests, but there's currently no code overlap. That's something we should change, but it's not trivial to do so).

https://bit.ly/chromium-test-list-format has the current proposal we're working on for a more unified format, but that's all it is at the moment, no one has yet signed up to do the work AFAIK.
Hmmm, is that the right link Dirk? Looks like it might be broken.
it should be? It works for me when I click on it. What happens for you?

Comment 11 by kbr@chromium.org, Nov 14 2017

Ah, I misunderstood the situation with the layout tests and typ.

Would it be technically possible to make a change to Telemetry and typ to add an API so that the currently running test can cooperatively report a failure, without aborting the test run?

It seems that it should be technically possible; in _run_one_set in src/third_party/typ/typ/runner.py, for example, the ResultSet is available.

I see that src/third_party/typ/typ/json_results.py collapses and hides failures just before a success. For example, failed_test_names in that file. Would it be technically feasible to not do that, and report both the pass and failure? Isn't that what the layout test harness does, and why flaky tests still show up on the flakiness dashboard?

Sorry for the delay in responding.

I'm not sure what you mean by "cooperatively report a failure, without aborting the test run". How is that different from "reporting a failure"?

In typ, if a test is retried, all of the attempts should already be included in the results.json ; this is indeed what the layout tests do as well.

It may be that we're not integrating with typ in such a way that the retry data is being handed to it properly, we'd have to look at the code to see.

Comment 13 by kbr@chromium.org, Nov 21 2017

The retries are being hidden from typ. They're handled in the gpu_integration_test.py harness here:
https://cs.chromium.org/chromium/src/content/test/gpu/gpu_tests/gpu_integration_test.py?type=cs&q=_runGputest&sq=package:chromium&l=117

because it's only this harness that understands our test expectations, for example being able to mark a test flaky on a particular GPU configuration.

I'd like to be able to call into typ and tell it that it retried the currently running test N times. Is that possible?

I think it's a bit dangerous that a test can directly tell typ how many tests are running. Ideally, we would rely on typ to handle test retrying instead of delegate it to gpu_integration_test.py

The tricky bit was client need to be able to implement test expectation (when to retry) in their own way. Lemme think a bit about how to form an API contract here.
The core problem here is typ's retries logic is very primitive: https://github.com/dpranke/typ/blob/master/typ/runner.py#L470 (either you retry all failed tests or you don't).

I think what Ken needs here is a way for the test to dynamically tell typ whether a failure should be retried or not. I propose add s.t like self.needRetry() method that allows invididual test case to dynamically tell typ that they will need to be retried later.

Dirk: wdyt?
Owner: dpranke@chromium.org
Status: Assigned (was: Available)
Hm, yeah, there's not a good way to control / specify retry logic outside of typ today.

Let me think about this and suggest something ...
Owner: kbr@chromium.org
Having thought about this a bit more ... one obvious thing to do is allow you to specify test expectations dynamically up front, and we want/need to add this to TYP regardless.

However, I'm not sure that that actually will do what you want. Would that work, or do have some mechanism to dynamically determine after running the test the first time whether there's any point in retrying or not?

(please reassign back when answered).

Comment 18 by kbr@chromium.org, Feb 3 2018

Owner: dpranke@chromium.org
It could theoretically work to call into typ to specify test expectations up front, but there's a problem. We can only evaluate the GPU test expectations once the browser is brought up and we know what GPU type is in it. See:

https://cs.chromium.org/chromium/src/content/test/gpu/gpu_tests/test_expectations.py?sq=package:chromium&l=224

https://cs.chromium.org/chromium/src/content/test/gpu/gpu_tests/gpu_test_expectations.py?sq=package:chromium&l=100

But Telemetry's serially_executed_browser_test_case dynamically generates the list of tests very early, before the browser is brought up. So we can't evaluate the test expectations for those tests until later, when the first one is run.

It would be a lot simpler if we could leave the retry and test expectations logic alone and if the GPU tests could somehow call into typ reporting that the current test failed, so that it could record that fact. We would only do that if we intended to retry. Otherwise we'd raise the exception that will cause typ to record its failure as usual. Is that possible?

Not entirely sure about the details here, but for the GPU matching, shouldn't each test be able to do any checks it needs on the browser and call skipTest if it's not the right type?

https://docs.python.org/2/library/unittest.html#unittest.TestCase.skipTest

typ should then just not retry any skipped test?

Cc: -nedngu...@google.com nednguyen@chromium.org

Comment 21 by kbr@chromium.org, Feb 6 2018

The issue is that typ would require the fail / flaky expectations up front – before the tests are run. That can't be done in the current setup.

Owner: kbr@chromium.org
The idea here is that you'd actually want to *skip* the test if you couldn't run it, right?

What I'd do, then, would be to launch the browser ahead of time to determine the GPU type, and then use that to set the expectations and the list of tests you'd want to run.

Failing that, I think you can call unittest.TestCase.skipTest() during the test execution to indicate that the test should be "skipped", and then typ wouldn't try to retry it (nor would we consider the test "Failed".

Does that work?
Sorry, re-reading the whole thread, I guess you might want to mark something as flaky based on the gpu, not just skip based on the gpu?

However, since we don't actually want to *run* flaky tests (we should just skip them), I think that ends up reducing to what I wrote. I feel like adding some way at test-method-invocation time to say "should retry" or "don't retry" is the wrong solution.

Comment 24 by xlai@chromium.org, Feb 12 2018

dpranke@: Just adding my point of view: I think marking a test as "Skip" or "Flaky" serves two different purposes. 

A test can be skipped on a certain platform when we are sure that the particular feature that the test is testing is not supported on that platform, and therefore we do not want to run the test. 

A test is marked as flaky when there is a bug (probably a race condition one) that the developer is still working on and cannot be reproduced easily on their machines; having the flaky test continue to run would help developers monitor the current situation of the bug like whether it is already removed by a speculative fix.  
> dpranke@: Just adding my point of view: I think marking a test as "Skip" or "Flaky" 
> serves two different purposes. 

You are correct, they do.

> A test can be skipped on a certain platform when we are sure that the particular feature
> that the test is testing is not supported on that platform, and therefore we do not want
> to run the test. 

I call this the difference between "Skip" and "WontFix"; the former is meant to be for temporary issues that will be fixed.

> A test is marked as flaky when there is a bug (probably a race condition one) that the
> developer is still working on and cannot be reproduced easily on their machines; having
> the flaky test continue to run would help developers monitor the current situation of
> the bug like whether it is already removed by a speculative fix.

Yes, that is a reason to want to keep running flaky tests, and certainly marking a test 
as flaky is better than not marking it as either flaky or skipping it (and just getting
intermittent failures).

Comment 26 by kbr@chromium.org, Feb 13 2018

Owner: dpranke@chromium.org
We want to run the flaky tests. We often find some new regression in Chrome or after a GPU card or driver update where a reliable test starts flaking. Marking it flaky allows us to make sure we don't reliably break this area of functionality.

I don't see how to make this work aside from typ providing an API to let us call into it and tell it that the currently running test failed a couple of times. We will then return successfully from the current run, so that typ thinks that the results were, e.g. "FAIL FAIL PASS".

Dirk: can you please provide such an API from typ? Trying to pull the test expectations into typ for our tests is a large undertaking that would require much help from the Telemetry team, and I don't think it's feasible given all of the other work we have to do.

kbr@ and I chatted about this offline. To sum up our discussion:

1) We want to support expectations files in typ.

2) We don't actually need to run a given test in order to determine whether the test is expected to be flaky or not, but the way the code is structured it's currently very hard to figure out what the right expectation should be until the test method is invoked (i.e., when a running browser instance is available and we can determine which GPU we have).

I'm going to take a look at the current telemetry and gpu code that sets up the test lists that get passed to typ to see what I think the right thing to do here.

Comment 29 by kbr@chromium.org, Feb 15 2018

The place in the GPU test harness where the test expectations for the currently-running test are fetched, and the retry loop is run for flaky tests:

https://cs.chromium.org/chromium/src/content/test/gpu/gpu_tests/gpu_integration_test.py?l=117

All of the autogenerated test functions, which are produced from subclasses' GenerateGpuTests overrides, call down into this base method.

Me, Ken, and Dirk met offline to discuss this. Our plan is to make gpu test understands the test expectation format & add a hook to serially_executed_browser_test_case that allows return the tags of the current platform the test suite is being run on. The concrete steps are as follow:

1) We make sure that TOT typ is in catapult (https://github.com/catapult-project/catapult/tree/master/third_party/typ). This will reduce the unnecessary friction in making sure layout test, telemetry, serially_executed_browser_test_case harnesses are all synced to the same version of typ.

2) We expand typ to support test expectation format (bit.ly/chromium-test-list-format). Before running the test suite, users would set the test_expectation & tags describing the platform the test for the typ test runner.

3) To integrate this with serially_executed_browser_test_case (the base harness of gpu), we would need to:
 i) Set the test expectation file path to typ runner in https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/testing/run_browser_tests.py?rcl=19f413e511563048bbd00578921fcdaf1e6958b6&l=302
 ii) Set the tags for typ runner in around the same place as (i). This can be done by adding a hook to serially_executed_browser_test_case.SeriallyExecutedBrowserTestCase.

The example code look s.t like:
  runner = typ.Runner()
  ...
  runner.expectation_file_path = options.expectation_file_path
  runner.expectation_tags = test_class.GetTestExpectationTags(context.finder_options)


Gpu test suite that subclass SeriallyExecutedBrowserTestCase can override GetTestExpectationTags(finder_options) to add logic to bring up the browser, query the platform info & return appropriate tags that describes the current platform the test is being run on.
Blockedon: 835490
Cc: -junov@chromium.org
Owner: estaab@chromium.org
@estaab - maybe this could be a task done as part of rhasan's starter project work?
Cc: rmhasan@google.com
NextAction: 2018-12-17
Yeah, I think that makes sense. Rakib, after your GPU test typ work wraps up we should take a look at this to see what else needs to be done to resolve this.
Once Rakib's work of converting gpu test to typ test expectation, this issue will mostly be resolved on it all. The key problem is GPU test implement the retry within the test & not delegate to the typ harness to do the retry
Cc: -nednguyen@chromium.org nedngu...@google.com
The NextAction date has arrived: 2018-12-17
NextAction: 2019-01-15
Will check back in when more typ GPU test work will be done.
The NextAction date has arrived: 2019-01-15

Comment 40 by benhenry@google.com, Jan 16 (6 days ago)

Components: Test>Telemetry

Comment 41 by benhenry@google.com, Jan 16 (6 days ago)

Components: -Tests>Telemetry

Comment 42 by estaab@chromium.org, Jan 17 (5 days ago)

NextAction: ----
Owner: rmhasan@google.com

Comment 43 by nedngu...@google.com, Jan 17 (5 days ago)

Cc: -nedngu...@google.com

Sign in to add a comment