New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 799081 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Fuchsia
Pri: 1
Type: Bug-Regression



Sign in to add a comment

MessageLoop::IsIdleForTesting() is timing dependent, leading to test flakiness.

Project Member Reported by w...@chromium.org, Jan 4 2018

Issue description

SerialWorkerTest.ExecuteAndSerializeReads sometimes fails with:

[ RUN      ] SerialWorkerTest.ExecuteAndSerializeReads
../../net/dns/serial_worker_unittest.cc:152: Failure
Value of: message_loop_->IsIdleForTesting()
  Actual: false
Expected: true
../../net/dns/serial_worker_unittest.cc:152: Failure
Value of: message_loop_->IsIdleForTesting()
  Actual: false
Expected: true
../../net/dns/serial_worker_unittest.cc:152: Failure
Value of: message_loop_->IsIdleForTesting()
  Actual: false
Expected: true
[  FAILED  ] SerialWorkerTest.ExecuteAndSerializeReads (83 ms)

The test does use base::RunLoop::QuitCurrentDeprecated(), so perhaps it is exiting the wrong nested RunLoop at some point?

Nothing seems to have changed recently in this test, so the most likely flake trigger seems to be other tests running at the same time, or run earlier in the same batch.
 

Comment 1 by w...@chromium.org, Jan 4 2018

Cc: fdoray@chromium.org
Possibly related to the WorkerPool -> TaskScheduler migration?

Comment 2 by w...@chromium.org, Jan 4 2018

Cc: mge...@chromium.org
Components: Internals>Network>DNS

Comment 4 by w...@chromium.org, Jan 4 2018

Owner: mge...@chromium.org
Status: Assigned (was: Untriaged)
mgersh: I'm not sure what the triggering condition is here, but FWIW we current run Fuchsia tests under QEMU, and the OS itself tends to schedule work differently from other platforms, so shows up race-conditions in code more often.

Assigning this to you since I'm not familiar with how SerialWorker is supposed to work.
Cc: gab@chromium.org
Labels: -Pri-2 Pri-1
Well, I found the change that probably started the flakiness: https://chromium-review.googlesource.com/c/chromium/src/+/847752

cc gab FYI.

I'm sure there's a bug in the SerialWorker test, but this file has gone through so many migrations that subtly changed semantics that it's hard to tell what's going on, and I don't know MessageLoop internals well enough to figure it out from the CL. (That's why I was hesitant about doing yet another migration without figuring out the cause first.) Apparently today is not the day I'm going to figure out how to fix it. In the meantime, I don't object to disabling it on Fuchsia temporarily if it's causing a serious problem.

Comment 6 by w...@chromium.org, Jan 6 2018

Re #5: Looking at https://chromium-review.googlesource.com/c/chromium/src/+/847752, it changes the semantics from checking that there are no tasks in the "incoming" queue (i.e. the queue in which new tasks reside until the queue is ready to process them) to checking the "triage" queue (tasks already pulled from "incoming", to be processed), pulling tasks into that from "incoming" if it is empty, and checking for ready-to run tasks on the "delayed" queue.

Checking the "delayed" task queue in IsIdleForTesting() is sketchy, since the answer can differ depending on how much time has elapsed, e.g. the Fuchsia bots run under QEMU and we sometimes see substantial scheduling delays, which may explain what we are seeing - I've left a comment on the code review.

Comment 7 by w...@chromium.org, Jan 6 2018

Owner: w...@chromium.org
Status: Started (was: Assigned)
I've tracked this down to the same root-cause as some earlier net_unittests flakes we observed - by default all net_unittests run within the same ScopedTaskEnvironment, which means that tasks posted (either to the main-thread or to TaskScheduler) by earlier tests may still run, or at least remain in queues.

In this case, some earlier test is posting a delayed task to the main thread, and the test is then taking long enough to run that that task becomes "due" while SerialWorkerTest.ExecuteAndSerializeReads is running, due to the change landed in https://chromium-review.googlesource.com/c/chromium/src/+/847752, to include delayed tasks in the IsIdleForTesting check.

Comment 8 by w...@chromium.org, Jan 6 2018

Summary: IsIdleForTesting() is timing dependent, leading to test flakiness. (was: SerialWorkerTest.ExecuteAndSerializeReads flakey under Fuchsia/x64)
Simple CL which repros this reliably: https://chromium-review.googlesource.com/c/chromium/src/+/853355

Comment 9 by w...@chromium.org, Jan 6 2018

Summary: MessageLoop::IsIdleForTesting() is timing dependent, leading to test flakiness. (was: IsIdleForTesting() is timing dependent, leading to test flakiness.)
Oh, that makes sense. We also have a bug for the net_unittests ScopedTaskEnvironment at  issue 791831 .

Comment 11 by w...@chromium.org, Jan 8 2018

Status: Fixed (was: Started)
gab, fdoray: Looking at the bots, I found that there were other flakes caused by this (including one that happened in one of the try runs for the IsIdleForTesting patch, which retries masked). As with my simple repro CL above, it is also the case that our unit-test retries can mask this new flake, since it is dependent on what prior tests happen to do.

I've reverted the IsIdleForTesting change (https://chromium-review.googlesource.com/c/chromium/src/+/853091). We can land this impl as a separate API for the ScopedTaskEnvironment, to un-block that work, and then migrate the existing tests to be compatible with the new API.

 Issue 791831  sounds ideal, and should be straightforward to accomplish?

Comment 12 by gab@chromium.org, Jan 9 2018

It would definitely be better if net used a ScopedTaskEnvironment per test rather than share it across tests with its suite.

There's another problem highlighted here that delayed tasks from base::Timer aren't being cancelled when the base::Timer is deleted (they no-op but aren't cancelled as far as the MessageLoop is concerned so the idleness check stills considered them when they became ripe). I have a better solution for base::Timer in the works as well : https://docs.google.com/document/d/11gcfpL-orj8SWY8bUBa1_ZGe8kPLUcjtGf2_WD7nBKw/edit#heading=h.t5jxzgy5j2kz

I found a way to remove the dependency on TimeTicks::Now() @ https://chromium-review.googlesource.com/c/chromium/src/+/856982/2 but it still didn't fully solve the issue per MessagePump alternating between DoWork() and DoDelayedWork(), potentially leaving ready delayed tasks behind by the time DoWork() runs the last task (even if it was posted after the delayed tasks were ready).
Project Member

Comment 13 by bugdroid1@chromium.org, Jan 9 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6122361253281018f96d268c51cdc214bb9f644e

commit 6122361253281018f96d268c51cdc214bb9f644e
Author: Wez <wez@chromium.org>
Date: Tue Jan 09 17:33:31 2018

Migrate test off of RunLoop::QuitCurrentDeprecated().

Bug:  799081 
Change-Id: I21f996740bc6be09106a5dde8c98e1f1b1ebc382
Reviewed-on: https://chromium-review.googlesource.com/850950
Reviewed-by: Miriam Gershenson <mgersh@chromium.org>
Commit-Queue: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#528030}
[modify] https://crrev.com/6122361253281018f96d268c51cdc214bb9f644e/net/dns/serial_worker_unittest.cc

Sign in to add a comment