New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 805502 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Fuchsia
Pri: 2
Type: Bug
Hotlist-MemoryInfra

Blocked on:
issue 667892



Sign in to add a comment

MemoryDumpManagerTest.PostTaskForSequencedTaskRunner hang flake on Fuchsia/x64

Project Member Reported by w...@chromium.org, Jan 24 2018

Issue description

In FYI build https://build.chromium.org/p/chromium.fyi/builders/Fuchsia/builds/13225, the MemoryDumpManagerTest.PostTaskForSequencedTaskRunner hung, leading to the process timing-out.  We captured a threads dump:

[00537.357] 03902.04210> [3:1208766299:0124/020019.812458:537357663:ERROR:test_launcher.cc(376)] Dumping threads in process 45056
[00537.394] 62552.62578> Backtrace of threads of process 45056: base_unittests__exec
[00537.394] 62552.62578> arch: x86_64
[00537.394] 62552.62578> 2 thread(s)
...
[00537.401] 62552.62578> <== Thread initial-thread[45093] ==>
...
#01: pc 0x7e29a2f2c84b sp 0x1085d3ffb6a8 (<vDSO>,0x684b)
#02: pc 0x7e29a2f2c569 sp 0x1085d3ffb6b0 (<vDSO>,0x6569)
#03: pc 0x410ed427e571 sp 0x1085d3ffb6d0 (libc.so,0x44571)
#04: pc 0x410ed4251f88 sp 0x1085d3ffb710 (libc.so,0x17f88)
#05: base::ConditionVariable::Wait() at ??:?
#06: base::SequencedWorkerPool::Inner::Shutdown(int) at ??:?
#07: base::SequencedWorkerPoolOwner::~SequencedWorkerPoolOwner() at ??:?
#08: base::trace_event::(anonymous namespace)::TestSequencedTaskRunner::~TestSequencedTaskRunner() at ??:?
#09: std::__1::__tree<scoped_refptr<base::trace_event::MemoryDumpProviderInfo>, base::trace_event::MemoryDumpProviderInfo::Comparator, std::__1::allocator<scoped_refptr<base::trace_event::MemoryDumpProviderInfo> > >::destroy(std::__1::__tree_node<scoped_refptr<base::trace_event::MemoryDumpProviderInfo>, void*>*) at ??:?
#10: base::trace_event::MemoryDumpManager::~MemoryDumpManager() at ??:?
#11: base::trace_event::MemoryDumpManagerTest::TearDown() at ??:?
#12: testing::TestInfo::Run() at ??:?
#13: testing::TestCase::Run() at ??:?
#14: testing::internal::UnitTestImpl::RunAllTests() at ??:?
#15: testing::UnitTest::Run() at ??:?
#16: base::TestSuite::Run() at ??:?
#17: base::(anonymous namespace)::LaunchUnitTestsInternal(base::RepeatingCallback<int ()> const&, unsigned long, int, bool, base::RepeatingCallback<void ()> const&) at ??:?
#18: base::LaunchUnitTests(int, char**, base::RepeatingCallback<int ()> const&) at ??:?
#19: main at ??:?
#20: pc 0x410ed425514e sp 0x1085d3ffbfe0 (libc.so,0x1b14e)
#21: pc 0 sp 0x1085d3ffc000
[00537.426] 62552.62578> <== Thread Test Task RunnerWorker1/1859451[45433] ==>
...
#01: pc 0x410ed4252e50 sp 0x63c4ee42a7c8 (libc.so,0x18e50)
#02: base::internal::MessageLoopTaskRunner::RunsTasksInCurrentSequence() const at ??:?
#03: base::RunLoop::Quit() at ??:?
#04: base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*) at ??:?
#05: base::MessageLoop::RunTask(base::PendingTask*) at ??:?
#06: base::MessageLoop::DoWork() at ??:?
#07: base::MessagePumpDefault::Run(base::MessagePump::Delegate*) at ??:?
#08: base::RunLoop::Run() at ??:?
#09: base::SequencedWorkerPoolOwner::~SequencedWorkerPoolOwner() at ??:?
#10: base::trace_event::(anonymous namespace)::TestSequencedTaskRunner::~TestSequencedTaskRunner() at ??:?
#11: base::trace_event::MemoryDumpManager::ContinueAsyncProcessDump(base::trace_event::MemoryDumpManager::ProcessMemoryDumpAsyncState*) at ??:?
#12: base::SequencedWorkerPool::Inner::ThreadLoop(base::SequencedWorkerPool::Worker*) at ??:?
#13: base::SequencedWorkerPool::Worker::Run() at ??:?
#14: base::SimpleThread::ThreadMain() at ??:?
#15: base::(anonymous namespace)::ThreadFunc(void*) at ??:?
#16: pc 0x410ed42528b6 sp 0x63c4ee42afe0 (libc.so,0x188b6)
#17: pc 0x410ed42c871a sp 0x63c4ee42aff0 (libc.so,0x8e71a)
#18: pc 0 sp 0x63c4ee42b000
[00537.966] 03902.04210> [3:1208766299:0124/020020.420678:537965900:ERROR:test_launcher.cc(398)] going to zx_task_kill(job) for 4a4c56fb
[00537.969] 03902.04210> [3:1208766299:0124/020020.423825:537969027:ERROR:test_launcher.cc(415)] removing 4a4c56fb from live process list
...
[00538.000] 03902.03951> 1 test timed out:
[00538.000] 03902.03951>     MemoryDumpManagerTest.PostTaskForSequencedTaskRunner (../../base/trace_event/memory_dump_manager_unittest.cc:507)

It looks (based on the stacks) like there's a MemoryDumpProvider in the manager, with the TestSequencedTaskRunner set on it, and we're managing to simultaneously touch the MDP on both the main thread and the SequencedWorkerPool thread, leading to a double-free, which in this case happens to deadlock things. O_o
 
Cc: primiano@chromium.org hjd@chromium.org ssid@chromium.org
I want to be realistic on this. I am not going to look to an OS_FUCSIA flakiness. As per "[chromium-dev] PSA: New platform: OS_FUCHSIA, "you don't have to worry more than you are worried about OS_FREEBSD or OS_SOLARIS today"
 
Patches from the Fuchsia team are more than welcome.

Comment 3 by w...@chromium.org, Jan 24 2018

Owner: w...@chromium.org
Status: Started (was: Untriaged)
Re #2: Thanks for pointing out the out-of-date PSA; we'll send out a refresh with the current state of things.

This looks to be a deadlock between the two TestSequencedTaskRunner instances, arising from the underlying SequencedWorkerOwner being destroyed on a WorkerPool/TaskScheduler thread, rather than on the main test thread. I'll diagnose further to find the most appropriate owner.

Comment 4 by w...@chromium.org, Jan 25 2018

Based on the stack trace from the hang (see bug description), this is a bug in the TestSequencedTaskRunner, which is ref-counted and may end up being deleted on any thread, but contains the SequencedWorkerPoolOwner, which owns the SequencedWorkerPool, which must be Shutdown() on the same sequenced on which it was created.

Based on the code itself, TestSequencedTaskRunner seems a peculiar helper; we create the underlying SequencedWorkerPool configured for two worker threads, and then pull out a single sequence token and use that to PostTask to it - we effectively get a single thread, but implemented using two underlying threads.

Comment 5 by w...@chromium.org, Jan 25 2018

ssid: Looks like you changed the thread-hopping logic in MDM in https://chromium-review.googlesource.com/c/chromium/src/+/847911 - any idea how ContinueAsyncProcessDump() could still be running _after_ we have return from RequestProcessDumpAndWait()?

Comment 6 by gab@chromium.org, Jan 25 2018

Cc: fdoray@chromium.org
FYI SequencedWorkerPoolOwner is about to be removed https://chromium-review.googlesource.com/c/chromium/src/+/881532 

Comment 7 by w...@chromium.org, Jan 25 2018

gab: Great; that CL should fix this test.

Basically the current implementation runs a tiny risk of the
TestSequencedTaskRunner, and hence the SequencedWorkerPoolOwner, being
deleted off the main thread, but during teardown the Owner spins a RunLoop
which was bound to the main thread at construction time, so it is actually
processing tasks posted to the main thread, on that Worker thread ->
live-lock.

Comment 8 by w...@chromium.org, Jan 25 2018

Blockedon: 667892
Owner: fdoray@chromium.org

Comment 9 by w...@chromium.org, Feb 21 2018

Is this complete, now that https://chromium-review.googlesource.com/c/chromium/src/+/881532 has landed?
Status: Fixed (was: Started)
Should be fixed. The issue was caused by the fact that SequencedWorkerPool was ref-counted. ScopedTaskEnvironment, now used in this test, doesn't have this issue.

Sign in to add a comment