Issue metadata
Sign in to add a comment
|
MessageLoop::IsIdleForTesting() is timing dependent, leading to test flakiness. |
||||||||||||||||||||||
Issue descriptionSerialWorkerTest.ExecuteAndSerializeReads sometimes fails with: [ RUN ] SerialWorkerTest.ExecuteAndSerializeReads ../../net/dns/serial_worker_unittest.cc:152: Failure Value of: message_loop_->IsIdleForTesting() Actual: false Expected: true ../../net/dns/serial_worker_unittest.cc:152: Failure Value of: message_loop_->IsIdleForTesting() Actual: false Expected: true ../../net/dns/serial_worker_unittest.cc:152: Failure Value of: message_loop_->IsIdleForTesting() Actual: false Expected: true [ FAILED ] SerialWorkerTest.ExecuteAndSerializeReads (83 ms) The test does use base::RunLoop::QuitCurrentDeprecated(), so perhaps it is exiting the wrong nested RunLoop at some point? Nothing seems to have changed recently in this test, so the most likely flake trigger seems to be other tests running at the same time, or run earlier in the same batch.
,
Jan 4 2018
,
Jan 4 2018
,
Jan 4 2018
mgersh: I'm not sure what the triggering condition is here, but FWIW we current run Fuchsia tests under QEMU, and the OS itself tends to schedule work differently from other platforms, so shows up race-conditions in code more often. Assigning this to you since I'm not familiar with how SerialWorker is supposed to work.
,
Jan 5 2018
Well, I found the change that probably started the flakiness: https://chromium-review.googlesource.com/c/chromium/src/+/847752 cc gab FYI. I'm sure there's a bug in the SerialWorker test, but this file has gone through so many migrations that subtly changed semantics that it's hard to tell what's going on, and I don't know MessageLoop internals well enough to figure it out from the CL. (That's why I was hesitant about doing yet another migration without figuring out the cause first.) Apparently today is not the day I'm going to figure out how to fix it. In the meantime, I don't object to disabling it on Fuchsia temporarily if it's causing a serious problem.
,
Jan 6 2018
Re #5: Looking at https://chromium-review.googlesource.com/c/chromium/src/+/847752, it changes the semantics from checking that there are no tasks in the "incoming" queue (i.e. the queue in which new tasks reside until the queue is ready to process them) to checking the "triage" queue (tasks already pulled from "incoming", to be processed), pulling tasks into that from "incoming" if it is empty, and checking for ready-to run tasks on the "delayed" queue. Checking the "delayed" task queue in IsIdleForTesting() is sketchy, since the answer can differ depending on how much time has elapsed, e.g. the Fuchsia bots run under QEMU and we sometimes see substantial scheduling delays, which may explain what we are seeing - I've left a comment on the code review.
,
Jan 6 2018
I've tracked this down to the same root-cause as some earlier net_unittests flakes we observed - by default all net_unittests run within the same ScopedTaskEnvironment, which means that tasks posted (either to the main-thread or to TaskScheduler) by earlier tests may still run, or at least remain in queues. In this case, some earlier test is posting a delayed task to the main thread, and the test is then taking long enough to run that that task becomes "due" while SerialWorkerTest.ExecuteAndSerializeReads is running, due to the change landed in https://chromium-review.googlesource.com/c/chromium/src/+/847752, to include delayed tasks in the IsIdleForTesting check.
,
Jan 6 2018
Simple CL which repros this reliably: https://chromium-review.googlesource.com/c/chromium/src/+/853355
,
Jan 6 2018
,
Jan 8 2018
Oh, that makes sense. We also have a bug for the net_unittests ScopedTaskEnvironment at issue 791831 .
,
Jan 8 2018
gab, fdoray: Looking at the bots, I found that there were other flakes caused by this (including one that happened in one of the try runs for the IsIdleForTesting patch, which retries masked). As with my simple repro CL above, it is also the case that our unit-test retries can mask this new flake, since it is dependent on what prior tests happen to do. I've reverted the IsIdleForTesting change (https://chromium-review.googlesource.com/c/chromium/src/+/853091). We can land this impl as a separate API for the ScopedTaskEnvironment, to un-block that work, and then migrate the existing tests to be compatible with the new API. Issue 791831 sounds ideal, and should be straightforward to accomplish?
,
Jan 9 2018
It would definitely be better if net used a ScopedTaskEnvironment per test rather than share it across tests with its suite. There's another problem highlighted here that delayed tasks from base::Timer aren't being cancelled when the base::Timer is deleted (they no-op but aren't cancelled as far as the MessageLoop is concerned so the idleness check stills considered them when they became ripe). I have a better solution for base::Timer in the works as well : https://docs.google.com/document/d/11gcfpL-orj8SWY8bUBa1_ZGe8kPLUcjtGf2_WD7nBKw/edit#heading=h.t5jxzgy5j2kz I found a way to remove the dependency on TimeTicks::Now() @ https://chromium-review.googlesource.com/c/chromium/src/+/856982/2 but it still didn't fully solve the issue per MessagePump alternating between DoWork() and DoDelayedWork(), potentially leaving ready delayed tasks behind by the time DoWork() runs the last task (even if it was posted after the delayed tasks were ready).
,
Jan 9 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6122361253281018f96d268c51cdc214bb9f644e commit 6122361253281018f96d268c51cdc214bb9f644e Author: Wez <wez@chromium.org> Date: Tue Jan 09 17:33:31 2018 Migrate test off of RunLoop::QuitCurrentDeprecated(). Bug: 799081 Change-Id: I21f996740bc6be09106a5dde8c98e1f1b1ebc382 Reviewed-on: https://chromium-review.googlesource.com/850950 Reviewed-by: Miriam Gershenson <mgersh@chromium.org> Commit-Queue: Wez <wez@chromium.org> Cr-Commit-Position: refs/heads/master@{#528030} [modify] https://crrev.com/6122361253281018f96d268c51cdc214bb9f644e/net/dns/serial_worker_unittest.cc |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by w...@chromium.org
, Jan 4 2018