MemoryDumpManagerTest.TestPollingOnDumpThread is flaky |
|||||
Issue descriptionBackground context: go/memory-infra MemoryDumpManagerTest.TestPollingOnDumpThread[1] seems to be flaky (20% failure rate locally) perhaps due to MemoryDumpScheduler being racy. For example the thread that calls MemoryDumpScheduler::DisableAllTriggers()[2] and resets periodic_state_ is not necessarily the timer thread which could be calling MemoryDumpScheduler::PollMemoryOnPollingThread() and attempting to de-reference periodic_state_ at the same moment. [1]: https://cs.chromium.org/chromium/src/base/trace_event/memory_dump_manager_unittest.cc?q=memory_dump_manager+package:%5Echromium$&dr=C&l=754 [2]: https://cs.chromium.org/chromium/src/base/trace_event/memory_dump_scheduler.cc?l=129
,
Mar 28 2017
Do you have a bot link which failed?
,
Mar 28 2017
Re #1, You're right, I confused periodic_state_ and polling_state_ so that is probably not the right cause. I don't have a bot link, I was trying this locally while working on changes to MDM. Could you CC me on to the CL for 702289 when it is ready (so I can avoid conflicting too much)?
,
Mar 28 2017
ssid I think that the following thing is happening:
- MemoryDumpScheduler::EnablePeriodicTriggerIfNeeded enables periodic_state_->timer.Start binding RequestPeriodicGlobalDump. RequestPeriodicGlobalDump will be called on the task runner of the thread that invoked EnablePeriodicTriggerIfNeeded()
- At some point trace gets disabled, OnTraceLogDisabled calls into DisableAllTriggers() which does:
if (periodic_state_->timer.IsRunning())
periodic_state_->timer.Stop();
periodic_state_.reset();
Now, if the thread where DisableAllTriggers is called is != the thread where EnablePeriodicTriggerIfNeeded is called, it the periodic_state_ will be reset while the timer is running concurrent on another thread. This causes a race.
(This seems yet another argument in favor of simplifying our threading model as discussed in Issue 705564 )
,
Mar 28 2017
The issue you are talking about in #4 is exactly same as issue 702289 . When we designed memory-infra we trusted TraceLog to always call enable and disable on the FILE thread / same thread. But now that is not the case when tracing is disabled unexpectedly. I think this issue should be merged with the issue 702289 if #4 is the cause of the crash. The test should not crash because it does not use different thread
,
Mar 28 2017
Well apparently this is causing a problem in the test. Maybe we should disable the test until we fix Issue 702289 ? P.S. it seem that we are not 100% sure what the cause is, would be great if a combination of you two could check this.
,
Mar 28 2017
I cannot reproduce the issue and not see flakes on the dashboard as well. hjd, can you post the stack trace of the crash? We never disable tracing on different thread in test. I suspect some local changes which hjd did is causing this crash. Also, I uploaded a fix for the other issue in https://codereview.chromium.org/2778313002/
,
Mar 28 2017
I have a suspect (purely based on my mental debugging of my weak memory about mdm_unittest.cc in the tube, so take everything with a grain of salt). I think there is an issue in the actual test : iirc the test fixture ends with DisableTracing() but does not wait for that to complete asynchronously. So if the test fixture reaches the end before that happens, mdm will be destroyed (onteardown) before tracing is disabled, and that will bring down the unique_ptr referred to by the timer. Which also would explain why the race shows up only on hjd workstation. Hjd@ let's check this tomorrow together
,
Mar 28 2017
So, I searched more and found a flaky build: https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyOAsSBUZsYWtlIi1NZW1vcnlEdW1wTWFuYWdlclRlc3QuVGVzdFBvbGxpbmdPbkR1bXBUaHJlYWQM https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_tsan_rel_ng/builds/39549 Stack trace: MemoryDumpManagerTest.TestPollingOnDumpThread (run #1): [ RUN ] MemoryDumpManagerTest.TestPollingOnDumpThread Received signal 11 SEGV_MAPERR 000000000000 #0 0x0000004e77b5 __interceptor_backtrace #1 0x000000e3552a base::debug::StackTrace::StackTrace() #2 0x000000e33a2f base::debug::StackTrace::StackTrace() #3 0x000000e34f3c base::debug::(anonymous namespace)::StackDumpSignalHandler() #4 0x0000004d5a43 __tsan::CallUserSignalHandler() #5 0x0000004d5ed9 rtl_sigaction() #6 0x7fc14e3b3330 <unknown> #7 0x000000f332e7 base::trace_event::MemoryDumpScheduler::PollMemoryOnPollingThread() #8 0x000000f343e6 _ZN4base8internal7InvokerINS0_9BindStateIMNS_11trace_event19MemoryDumpSchedulerEFvvEJNS0_17UnretainedWrapperIS4_EEEEEFvvEE3RunEPNS0_13BindStateBaseE #9 0x00000055e8c0 _ZNO4base8CallbackIFvvELNS_8internal8CopyModeE0ELNS2_10RepeatModeE0EE3RunEv #10 0x000000e369c3 base::debug::TaskAnnotator::RunTask() #11 0x000000e79b51 base::MessageLoop::RunTask() #12 0x000000e7a32e base::MessageLoop::DeferOrRunPendingTask() #13 0x000000e7b099 base::MessageLoop::DoDelayedWork() #14 0x000000e7f5b5 base::MessagePumpDefault::Run() #15 0x000000e796a1 base::MessageLoop::RunHandler() #16 0x000000ebdbaf base::RunLoop::Run() #17 0x000000f14f8c base::Thread::Run() #18 0x000000f15697 base::Thread::ThreadMain() #19 0x000000f07d69 base::(anonymous namespace)::ThreadFunc() #20 0x0000004cfc72 __tsan_thread_start_func #21 0x7fc14e3ab184 start_thread #22 0x7fc14dec2bed clone r8: 0000800000000000 r9: 0003ffffffffffff r10: 0001800000000000 r11: 00007c0000000000 r12: 00007b0c00017520 r13: 00007b0c00017530 r14: 00007b0c00017520 r15: 0000000000f332a0 di: 00006000044c0000 si: 0010c0000000f46c bp: 00007fc14a59bd60 bx: 0000000000000000 dx: 00101c000000f46c ax: 0000100000000000 cx: 000000000000f46c sp: 00007fc14a59bbd0 ip: 0000000000f332e7 efl: 0000000000010246 cgf: 0000000000000033 erf: 0000000000000004 trp: 000000000000000e msk: 0000000000000000 cr2: 0000000000000000 [end of stack trace] Calling _exit(1). Core file will not be generated. The issue here is i checked the wrong condition for polling_disabled at Poll method: I changed "if (polling_state_->current_state != PollingTriggerState::ENABLED)" to "if (!polling_state_)" in https://codereview.chromium.org/2778313002/diff/20001/base/trace_event/memory_dump_scheduler.cc This is because in new model the polling_state_ is cleared instead of just setting disabled at MemoryDumpScheduler::DisablePollingOnPollingThread(). But, I am not sure if you are seeing a different issue.
,
Mar 28 2017
Re #8: There is no async tear down happening currently. What currently happens is when DisableTracing() is called the timer.Stop() is called synchronously and then timer is destroyed. All of this happens in the test main thread.
,
Mar 29 2017
The change you suggested in #9 does fix the problem! I guess the state we occasionally hit is: 1) PollMemoryOnPollingThread posts a delayed PollMemoryOnPollingThread task 2) We destroy the pointer in DisablePollingOnPollingThread 3) Now the delayed PollMemoryOnPollingThread executes and tries to deref a null pointer. Crashing example stack+some printfs [7620:7621:0329/111910.241319:1389285901707:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7620:7621:0329/111910.242945:1389285903323:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7620:7621:0329/111910.244384:1389285904762:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7620:7621:0329/111910.245783:1389285906161:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7620:7621:0329/111910.247346:1389285907724:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7620:7621:0329/111910.248710:1389285909089:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7620:7621:0329/111910.250113:1389285910491:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7620:7621:0329/111910.251578:1389285911956:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7620:7621:0329/111910.253003:1389285913381:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7620:7621:0329/111910.254416:1389285914794:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7620:7621:0329/111910.255873:1389285916251:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7620:7620:0329/111910.256668:1389285917047:ERROR:memory_dump_scheduler.cc(139)] MemoryDumpScheduler::DisableAllTriggers() [7620:7621:0329/111910.256898:1389285917276:ERROR:memory_dump_scheduler.cc(149)] MemoryDumpScheduler::DisablePollingOnPollingThread() [7620:7621:0329/111910.257009:1389285917385:ERROR:memory_dump_scheduler.cc(301)] PollingTriggerState::~PollingTriggerState() [7620:7621:0329/111910.257125:1389285917501:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() ../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:211: std::unique_ptr::pointer std::unique_ptr<base::trace_event::MemoryDumpScheduler::PollingTriggerState, std::default_delete<base::trace_event::MemoryDumpScheduler::PollingTriggerState> >::operator->() const [_Tp = base::trace_event::MemoryDumpScheduler::PollingTriggerState, _Dp = std::default_delete<base::trace_event::MemoryDumpScheduler::PollingTriggerState>]: Assertion 'get() != pointer()' failed. Received signal 6 #0 0x7f1913ef45bb base::debug::StackTrace::StackTrace() #1 0x7f1913ef2c4c base::debug::StackTrace::StackTrace() #2 0x7f1913ef40cf base::debug::(anonymous namespace)::StackDumpSignalHandler() #3 0x7f1914383330 <unknown> #4 0x7f1912be2c37 gsignal #5 0x7f1912be6028 abort #6 0x7f1913eb824f std::__replacement_assert() #7 0x7f191413bea6 std::unique_ptr<>::operator->() #8 0x7f191413a4a9 base::trace_event::MemoryDumpScheduler::PollMemoryOnPollingThread() #9 0x7f1913f0fe05 _ZN4base8internal13FunctorTraitsIMNS_21FileDescriptorWatcher10Controller7WatcherEFvvEvE6InvokeIPS4_JEEEvS6_OT_DpOT0_ #10 0x7f191413c5f1 _ZN4base8internal12InvokeHelperILb0EvE8MakeItSoIRKMNS_11trace_event19MemoryDumpSchedulerEFvvEJPS5_EEEvOT_DpOT0_ #11 0x7f191413c597 _ZN4base8internal7InvokerINS0_9BindStateIMNS_11trace_event19MemoryDumpSchedulerEFvvEJNS0_17UnretainedWrapperIS4_EEEEEFvvEE7RunImplIRKS6_RKSt5tupleIJS8_EEJLm0EEEEvOT_OT0_NS_13IndexSequenceIJXspT1_EEEE #12 0x7f191413c4dc _ZN4base8internal7InvokerINS0_9BindStateIMNS_11trace_event19MemoryDumpSchedulerEFvvEJNS0_17UnretainedWrapperIS4_EEEEEFvvEE3RunEPNS0_13BindStateBaseE #13 0x7f1913efa54e _ZNO4base8CallbackIFvvELNS_8internal8CopyModeE0ELNS2_10RepeatModeE0EE3RunEv #14 0x7f1913ef9dde base::debug::TaskAnnotator::RunTask() #15 0x7f1913f89eed base::MessageLoop::RunTask() #16 0x7f1913f8a174 base::MessageLoop::DeferOrRunPendingTask() #17 0x7f1913f8a6fc base::MessageLoop::DoDelayedWork() #18 0x7f1913fa1b67 base::MessagePumpDefault::Run() #19 0x7f1913f89a87 base::MessageLoop::RunHandler() #20 0x7f1914039fba base::RunLoop::Run() #21 0x7f19140ea094 base::Thread::Run() #22 0x7f19140ea8fa base::Thread::ThreadMain() #23 0x7f19140d0e1a base::(anonymous namespace)::ThreadFunc() #24 0x7f191437b184 start_thread #25 0x7f1912ca9bed clone r8: 29287265746e696f r9: 6767697254676e69 r10: 0000000000000008 r11: 0000000000000202 r12: 0000000000000000 r13: 0000000000000000 r14: 00007f19109a39c0 r15: 00007f19109a3700 di: 0000000000001dc4 si: 0000000000001dc5 bp: 00007f19109a0af0 bx: 0000000000000000 dx: 0000000000000006 ax: 0000000000000000 cx: ffffffffffffffff sp: 00007f19109a0988 ip: 00007f1912be2c37 efl: 0000000000000202 cgf: 0000000000000033 erf: 0000000000000000 trp: 0000000000000000 msk: 0000000000000000 cr2: 0000000000000000 [end of stack trace] Calling _exit(1). Core file will not be generated. [7618:7619:0329/111910.297681:1389285958184:ERROR:kill_posix.cc(84)] Unable to terminate process group 7620: No such process [1/1] MemoryDumpManagerTest.TestPollingOnDumpThread (CRASHED) Working example with printfs: [7601:7602:0329/111907.641777:1389283302166:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7601:7602:0329/111907.643389:1389283303767:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7601:7602:0329/111907.644880:1389283305259:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7601:7602:0329/111907.646307:1389283306685:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7601:7602:0329/111907.647871:1389283308249:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7601:7602:0329/111907.649256:1389283309634:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7601:7602:0329/111907.650674:1389283311052:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7601:7602:0329/111907.652138:1389283312517:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7601:7602:0329/111907.653554:1389283313932:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7601:7602:0329/111907.654989:1389283315367:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7601:7602:0329/111907.656492:1389283316871:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7601:7602:0329/111907.657833:1389283318211:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread() [7601:7601:0329/111907.658066:1389283318445:ERROR:memory_dump_scheduler.cc(139)] MemoryDumpScheduler::DisableAllTriggers() [7601:7602:0329/111907.658243:1389283318621:ERROR:memory_dump_scheduler.cc(149)] MemoryDumpScheduler::DisablePollingOnPollingThread() [7601:7602:0329/111907.658325:1389283318701:ERROR:memory_dump_scheduler.cc(301)] PollingTriggerState::~PollingTriggerState()
,
Mar 29 2017
Can one of you two send out the one line fix in #9 separately so we stop the test from being flake? Thanks!
,
Mar 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3d723ac3847809a99d6e46da60b297a0eb816c88 commit 3d723ac3847809a99d6e46da60b297a0eb816c88 Author: hjd <hjd@chromium.org> Date: Thu Mar 30 12:54:18 2017 memory-infra: Avoid dereferencing polling_state_ when it could be null BUG= 706023 Review-Url: https://codereview.chromium.org/2787533002 Cr-Commit-Position: refs/heads/master@{#460736} [modify] https://crrev.com/3d723ac3847809a99d6e46da60b297a0eb816c88/base/trace_event/memory_dump_scheduler.cc
,
Mar 30 2017
,
Mar 30 2017
merge-request: this code is not hit in production but it is by test. merge-req just to avoid making the tests on the official waterfall flaky.
,
Mar 31 2017
Your change meets the bar and is auto-approved for M58. Please go ahead and merge the CL to branch 3029 manually. Please contact milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 3 2017
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 3 2017
Looks like the original CL that introduced the bug missed the branch point, so no need for the fix. apologies for the noise. removing MR labels |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by ssid@chromium.org
, Mar 28 2017