New issue
Advanced search Search tips

Issue 816170 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Fuchsia
Pri: 3
Type: Bug

Blocking:
issue 735701



Sign in to add a comment

TaskSchedulerWorkerPoolHistogramTest.NumTasksBetweenWaitsWithIdlePeriodAndCleanup flaked under Fuchsia/x64

Project Member Reported by w...@chromium.org, Feb 24 2018

Issue description

TaskSchedulerWorkerPoolHistogramTest.NumTasksBetweenWaitsWithIdlePeriodAndCleanup failed under Fuchsia/x64 with:

[00057.842] 03925.03952> [ RUN      ] TaskSchedulerWorkerPoolHistogramTest.NumTasksBetweenWaitsWithIdlePeriodAndCleanup
[00057.842] 03925.03952> ../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:663: Failure
[00057.844] 03925.03952> Expected equality of these values:
[00057.845] 03925.03952>   0
[00057.845] 03925.03952>   histogram->SnapshotSamples()->GetCount(0)
[00057.845] 03925.03952>     Which is: 1
[00057.845] 03925.03952> [  FAILED  ] TaskSchedulerWorkerPoolHistogramTest.NumTasksBetweenWaitsWithIdlePeriodAndCleanup (1740 ms)
[00057.847] 03925.03952> [1615/2838] TaskSchedulerWorkerPoolHistogramTest.NumTasksBetweenWaitsWithIdlePeriodAndCleanup (1740 ms)

It looks like this test is inherently flaky, since it relies on a a PlatformThread::Sleep() at https://cs.chromium.org/chromium/src/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc?l=638 - IIUC the timeouts that the test is aiming to wait for are scheduled on a the SchedulerWorker service thread, so there is no guarantee that that ran in a timely manner to _start_ the timeouts, in general.
 
Project Member

Comment 1 by bugdroid1@chromium.org, Feb 24 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/35b3464dccd14d07d154bd660adc9a608ed5bd5c

commit 35b3464dccd14d07d154bd660adc9a608ed5bd5c
Author: Wez <wez@chromium.org>
Date: Sat Feb 24 23:49:27 2018

Broaden test filter for TaskSchedulerWorkerPoolHistogramTests.

These tests rely on thread sleeps for synchronization, which makes them
racey when run on heavily-loaded systems.

TBR: scottmg
Bug:  816170 ,  735701 
Change-Id: I9cad471cf8bc88fd91dc33662489983f13f18786
Reviewed-on: https://chromium-review.googlesource.com/936446
Reviewed-by: Wez <wez@chromium.org>
Commit-Queue: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#539025}
[modify] https://crrev.com/35b3464dccd14d07d154bd660adc9a608ed5bd5c/testing/buildbot/filters/fuchsia.base_unittests.filter

Comment 2 by gab@chromium.org, Feb 26 2018

Owner: gab@chromium.org
Status: Started (was: Untriaged)
I fixed this test's dependency on the sleep timing as part of https://chromium-review.googlesource.com/931524 (it still sleeps but only to test that the metric is logged properly when recovering from idle/cleanup -- doesn't matter whether cleanup actually kicked in, the sleep only accentuates the likelihood that it did and hence the relevance of this test).

Comment 3 by gab@chromium.org, Feb 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a8863d2fc9bc0ede0e9f35306ac85cc0195f3e6c

commit a8863d2fc9bc0ede0e9f35306ac85cc0195f3e6c
Author: Gabriel Charette <gab@chromium.org>
Date: Fri Feb 23 10:25:03 2018

[TaskScheduler] Do not log 0 to TaskScheduler.NumTasksBetweenWaits histogram on detach.

Doing so accesses |outer_| which was unsafe after invoking Cleanup().

While this will be made safe with https://chromium-review.googlesource.com/c/chromium/src/+/931547
we had noticed this metric being wrong in the past but had postponed
cleaning up to avoid perturbing metrics as part of a bigger change.
Now is the time to drop this report. I don't think we need to rename
the metric because it's an internal diagnosis metric for our team.

R=fdoray@chromium.org, robliao@chromium.org

Bug:   756898  , 810464
Change-Id: Ica379d6f3c21aebabeb0574847ba0bfcda346df8
Reviewed-on: https://chromium-review.googlesource.com/931524
Commit-Queue: Gabriel Charette <gab@chromium.org>
Reviewed-by: François Doray <fdoray@chromium.org>
Cr-Commit-Position: refs/heads/master@{#538751}
[modify] https://crrev.com/a8863d2fc9bc0ede0e9f35306ac85cc0195f3e6c/base/task_scheduler/scheduler_worker_pool_impl.cc
[modify] https://crrev.com/a8863d2fc9bc0ede0e9f35306ac85cc0195f3e6c/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc

Comment 4 by gab@chromium.org, Feb 26 2018

Wait, actually my bad. r538751 is the CL that broke this..!

The sleep isn't at fault here, the problem is that I wrongly assumed we would never get a drop in the zero bucket while it's still possible (if a worker is woken up but all the work is processed by other workers before it invokes GetWork() -- then it will do "0 tasks between waits").

Will fix.

Comment 5 by gab@chromium.org, Feb 26 2018

Blocking: 735701

Comment 6 by gab@chromium.org, Feb 26 2018

Looking at TaskSchedulerWorkerPoolHistogramTest.NumTasksBetweenWaitsWithIdlePeriodAndCleanup in detail I fail to see how it got a hit in bucket "0" with r538751.

Either way, cleaned up the test, hopefully this helps : https://chromium-review.googlesource.com/#/c/chromium/src/+/936626 
Project Member

Comment 7 by bugdroid1@chromium.org, Feb 28 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ee000a69dbe1884d15f87ce6565036825158fba8

commit ee000a69dbe1884d15f87ce6565036825158fba8
Author: Gabriel Charette <gab@chromium.org>
Date: Wed Feb 28 11:33:00 2018

Cleanup TaskSchedulerWorkerPoolHistogramTest.NumTasks*Cleanup tests.

TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup still depends
on cleanup timing, I'll try to remove those dependencies in a follow-up.

R=fdoray@chromium.org, robliao@chromium.org

Bug:  816170 
Change-Id: I2cca331d18ee8f3ca384493bd3584c68240c445e
Reviewed-on: https://chromium-review.googlesource.com/936626
Commit-Queue: Gabriel Charette <gab@chromium.org>
Reviewed-by: François Doray <fdoray@chromium.org>
Cr-Commit-Position: refs/heads/master@{#539789}
[modify] https://crrev.com/ee000a69dbe1884d15f87ce6565036825158fba8/base/task_scheduler/scheduler_worker_pool_impl.cc
[modify] https://crrev.com/ee000a69dbe1884d15f87ce6565036825158fba8/base/task_scheduler/scheduler_worker_pool_impl.h
[modify] https://crrev.com/ee000a69dbe1884d15f87ce6565036825158fba8/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc

Project Member

Comment 8 by bugdroid1@chromium.org, Mar 1 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/274f2e7f66d77271128910677f72b802a92e6c78

commit 274f2e7f66d77271128910677f72b802a92e6c78
Author: Gabriel Charette <gab@chromium.org>
Date: Thu Mar 01 15:03:00 2018

[SchedulerWorkerPoolImpl] Make sure workers do not cleanup during WaitForWorkersIdle*().

While auditing the code I noticed that this was a problem: if a worker
cleans up while waiting for n workers to be idle, the idle stack will
shrink and this condition may never be reached.

This prevents this completely for WaitForAllWorkersIdleForTesting() and
greatly reduces the likelihood for WaitForWorkersIdleForTesting(n).

WaitForAllWorkersIdleForTesting() is safe because "n" is obtained under
the lock. Whereas an external call to WaitForWorkersIdleForTesting(n)
could wait on a worker which was already detached. But in practice
if the reclaim timeout is long enough and the main thread invokes this
immediately after unblocking a test task and waiting for it to resolve,
the race would require the main thread to be preempted for the full
reclaim timeout for no reason which seems highly unlikely.

Audited current usage and added a comment as such.

R=fdoray@chromium.org, robliao@chromium.org

Bug:  816170 ,  735701 
Change-Id: Ie7b4fd81c04f53cc9597ebc02a25b39d4f1d658f
Reviewed-on: https://chromium-review.googlesource.com/937361
Commit-Queue: Gabriel Charette <gab@chromium.org>
Reviewed-by: François Doray <fdoray@chromium.org>
Reviewed-by: Robert Liao <robliao@chromium.org>
Cr-Commit-Position: refs/heads/master@{#540154}
[modify] https://crrev.com/274f2e7f66d77271128910677f72b802a92e6c78/base/task_scheduler/scheduler_worker_pool_impl.cc
[modify] https://crrev.com/274f2e7f66d77271128910677f72b802a92e6c78/base/task_scheduler/scheduler_worker_pool_impl.h

Project Member

Comment 9 by bugdroid1@chromium.org, Apr 27 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9746a243c4b7a2e79f60fd4d164e8da833dbeeee

commit 9746a243c4b7a2e79f60fd4d164e8da833dbeeee
Author: Gabriel Charette <gab@chromium.org>
Date: Fri Apr 27 14:31:27 2018

[Fuchsia+TaskScheduler] Re-enable tests that were flaky because of rogue SchedulerWorkers.

Rogue SchedulerWorkers were fixed once and for all in  crbug.com/827615 

R=wez@chromium.org

Bug:  816170 ,  827615 
Change-Id: I697124ed7cd5282bce8e504427e1daa6adc08df2
Reviewed-on: https://chromium-review.googlesource.com/1031210
Commit-Queue: Wez <wez@chromium.org>
Reviewed-by: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#554386}
[modify] https://crrev.com/9746a243c4b7a2e79f60fd4d164e8da833dbeeee/testing/buildbot/filters/fuchsia.base_unittests.filter

Comment 10 by gab@chromium.org, Apr 27 2018

Status: Fixed (was: Started)
I think this was fixed by r552160 (i.e. that this issue was also caused by a rogue SchedulerWorker). Please reopen if not, thanks

Sign in to add a comment