Make test runners record sequence of test execution to retry the batched tests for a single failed test |
||||
Issue descriptionIn https://chromium-review.googlesource.com/c/chromium/src/+/1054590, Charlie Harrison encountered a nasty bug that make test B in telemetry_perf_unittest suite only fails if it's run right after test A (see https://chromium-review.googlesource.com/c/chromium/src/+/1054590#message-78946c8e6e1839bf5d0ffdaa9c7079d4627abe2c). Similar cases are also known for webkit layout tests. If test runners could capture the sequence of test execution within the same session (this means the same test process, the same browser instance, etc), it is more likely we could reproduce the failures in later rerun and find the culprit.
,
May 18 2018
TBH, I don't know for sure whether the order is preserved in the current JSON file outputs. The current data structure is a dict or trie, but I don't know whether it is an ordered dict. For trie, there seems no ordering concept to my best knowledge. However, we at least could not tell which tests are run in the same __sesssion__ as I mentioned above.
,
May 18 2018
I see. Maybe a new field could be added to the leaf nodes of the trie which would have some ordering information. Perhaps a simple string containing dotted numbers like "3.15.4". 3 could be the shard number, 15 the "session" number (i.e., this is the 15th browser launched in this shard), and 4 the number of the test within that session.
,
May 21 2018
The current file format doesn't contain any useful ordering information, and indeed because of the trie structure, the ordering is explicitly discarded. We could add it, but at some point we probably just need to move to a different format that is better designed. At one time I had proposed a new format that was a set of extensions to the trace file format that I think works much better for things like this and for cases where you run tests multiple times.
,
Jan 15
Gtest has the same problem here too.
Below are the summary of findings and thoughts from the effort I put into flakiness last year (digging into the gtest launcher, discussions with various folks).
I won't be able to dig further here, but I'd document them down to serve as a reference if it helps and someone else pick this bug up.
gtest launcher scheduling logic:
1) For the first run of all tests, the main process of the gtest test launcher schedules tests to run in batches, and each batch runs in its owner job process in parallel. Per an early discussion with jam@, this design is to maximize the usage of VM resources especially CPU, but it can cause some tests to fail due to resource starvation. (wez@ also mentioned resource starvation in a recent email.)
2) Those tests failing the first run are rescheduled to be rerun serially, and each test runs in its own job process for isolation. Per the discussion with jam@, this design is to reduce the interference between tests, and mainly to resolve the flakes due to resource starvation. For this same reason, a test FAIL->PASS doesn't count as a hidden flake by Findit.
3) The main process of the gtest launcher launches (NOT fork) the job processes, so its state won't leak into the job processes.
Tests can flakily fail their first run for various reasons below, and the default 3 retries may or may not help them.
1) Starvation due to parallel processes to execute tests.
This is expected to be resolved by the isolated & serial retries.
2) State leaking:
2.1)Within-process state leaking of one test makes another test *in the same batch* flaky. wez@ explicitly mentioned content_unittests leaking hidden global state to both me and erikchen@.
2.2) Cross-process state interference of tests in different batches. This is tricky, but is rare to my best knowledge. So far, I saw one case only: suite.testA and suite.testB both write a file of the same fixed name onto the disk, and check for its existence during testing, and the fix was to use scoped temporary directory or file.
This is expected to be resolved by the isolated & serial retries, if the state leaking is __within the context of a process instance__, e.g. in the private memory space of the browser instance which is gone when a new browser instance is launched.
However, if the state leaking is on disk or other shared resources (maybe shared memory, but not sure), the retries might not help. In that case, we might want better mocking of underlying APIs.
3) The test itself is flaky, e.g. it depends on time instead of a callback from an expected rendering event. Those auto-reverted flaky tests by Findit fall into this categories.
The retries help to some extent, but not fully mitigate this.
This is more significant on the Chromium main waterfall: a lot of test failures there are not reproducible by 30 reruns in the __new__ Swarming task triggered by Findit (using exactly the binary from the same build).
4) The test itself is consistently failing.
The retries always have the same results, and such tests will be reverted by Findit.
To deal with state leaking causing random flakes, recording the sequence can help:
1) Make test failures more reproducible during CQ so that bad CLs are rejected.
2) Enable culprit finding by Findit. It is also possible for the gtest launcher to do that during CQ.
As of now, we don't know how many tests fall into the category of state leaking.
I did a hack in gtest launcher to record the test batches, but didn't get a chance to finish retrying a failed test together with other tests in the same batch & in the original order.
I might not be able to look further into it this. So if anyone is able to dig further, please do feel free to take this bug.
,
Jan 15
|
||||
►
Sign in to add a comment |
||||
Comment 1 by kbr@chromium.org
, May 18 2018