New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 844524 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 805276



Sign in to add a comment

Make test runners record sequence of test execution to retry the batched tests for a single failed test

Project Member Reported by st...@chromium.org, May 18 2018

Issue description

In https://chromium-review.googlesource.com/c/chromium/src/+/1054590, Charlie Harrison encountered a nasty bug that make test B in telemetry_perf_unittest suite only fails if it's run right after test A (see https://chromium-review.googlesource.com/c/chromium/src/+/1054590#message-78946c8e6e1839bf5d0ffdaa9c7079d4627abe2c).
Similar cases are also known for webkit layout tests.

If test runners could capture the sequence of test execution within the same session (this means the same test process, the same browser instance, etc), it is more likely we could reproduce the failures in later rerun and find the culprit.
 

Comment 1 by kbr@chromium.org, May 18 2018

Is this information in the JSON file outputted from the various test harnesses? Are the tests listed in that file in the order they were run?


Comment 2 by st...@chromium.org, May 18 2018

TBH, I don't know for sure whether the order is preserved in the current JSON file outputs.
The current data structure is a dict or trie, but I don't know whether it is an ordered dict. For trie, there seems no ordering concept to my best knowledge.

However, we at least could not tell which tests are run in the same __sesssion__ as I mentioned above.

Comment 3 by kbr@chromium.org, May 18 2018

I see.

Maybe a new field could be added to the leaf nodes of the trie which would have some ordering information. Perhaps a simple string containing dotted numbers like "3.15.4". 3 could be the shard number, 15 the "session" number (i.e., this is the 15th browser launched in this shard), and 4 the number of the test within that session.

Components: Build
Status: Available (was: Unconfirmed)
The current file format doesn't contain any useful ordering information, and indeed because of the trie structure, the ordering is explicitly discarded.

We could add it, but at some point we probably just need to move to a different format that is better designed. At one time I had proposed a new format that was a set of extensions to the trace file format that I think works much better for things like this and for cases where you run tests multiple times.


Comment 5 Deleted

Comment 6 Deleted

Comment 7 Deleted

Cc: erikc...@chromium.org sky@chromium.org w...@chromium.org jam@chromium.org estaab@chromium.org
Components: Infra>Flakiness Tools>Test>FindIt>Flakiness
Labels: -OS-Linux -Pri-3 Pri-2
Summary: Make test runners record sequence of test execution to retry the batched tests for a single failed test (was: Make test runners record sequence of test execution)
Gtest has the same problem here too.
Below are the summary of findings and thoughts from the effort I put into flakiness last year (digging into the gtest launcher, discussions with various folks).
I won't be able to dig further here, but I'd document them down to serve as a reference if it helps and someone else pick this bug up.

gtest launcher scheduling logic:
1) For the first run of all tests, the main process of the gtest test launcher schedules tests to run in batches, and each batch runs in its owner job process in parallel. Per an early discussion with jam@, this design is to maximize the usage of VM resources especially CPU, but it can cause some tests to fail due to resource starvation. (wez@ also mentioned resource starvation in a recent email.)
2) Those tests failing the first run are rescheduled to be rerun serially, and each test runs in its own job process for isolation. Per the discussion with jam@, this design is to reduce the interference between tests, and mainly to resolve the flakes due to resource starvation. For this same reason, a test FAIL->PASS doesn't count as a hidden flake by Findit.
3) The main process of the gtest launcher launches (NOT fork) the job processes, so its state won't leak into the job processes.

Tests can flakily fail their first run for various reasons below, and the default 3 retries may or may not help them.
1) Starvation due to parallel processes to execute tests.
   This is expected to be resolved by the isolated & serial retries.
2) State leaking:
    2.1)Within-process state leaking of one test makes another test *in the same batch* flaky. wez@ explicitly mentioned content_unittests leaking hidden global state to both me and erikchen@.
    2.2) Cross-process state interference of tests in different batches. This is tricky, but is rare to my best knowledge. So far, I saw one case only: suite.testA and suite.testB both write a file of the same fixed name onto the disk, and check for its existence during testing, and the fix was to use scoped temporary directory or file.
   This is expected to be resolved by the isolated & serial retries, if the state leaking is __within the context of a process instance__, e.g. in the private memory space of the browser instance which is gone when a new browser instance is launched.
   However, if the state leaking is on disk or other shared resources (maybe shared memory, but not sure), the retries might not help. In that case, we might want better mocking of underlying APIs.
3) The test itself is flaky, e.g. it depends on time instead of a callback from an expected rendering event. Those auto-reverted flaky tests by Findit fall into this categories.
   The retries help to some extent, but not fully mitigate this.
   This is more significant on the Chromium main waterfall: a lot of test failures there are not reproducible by 30 reruns in the __new__ Swarming task triggered by Findit (using exactly the binary from the same build).
4) The test itself is consistently failing.
   The retries always have the same results, and such tests will be reverted by Findit.

To deal with state leaking causing random flakes, recording the sequence can help:
1) Make test failures more reproducible during CQ so that bad CLs are rejected.
2) Enable culprit finding by Findit. It is also possible for the gtest launcher to do that during CQ.

As of now, we don't know how many tests fall into the category of state leaking.
I did a hack in gtest launcher to record the test batches, but didn't get a chance to finish retrying a failed test together with other tests in the same batch & in the original order.
I might not be able to look further into it this. So if anyone is able to dig further, please do feel free to take this bug.

Cc: -nednguyen@chromium.org

Sign in to add a comment