New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 706023 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner: ----
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 2
Type: Bug

Blocking:
issue 703184



Sign in to add a comment

MemoryDumpManagerTest.TestPollingOnDumpThread is flaky

Project Member Reported by hjd@chromium.org, Mar 28 2017

Issue description

Background context: go/memory-infra

MemoryDumpManagerTest.TestPollingOnDumpThread[1] seems to be flaky (20% failure rate locally) perhaps due to MemoryDumpScheduler being racy.

For example the thread that calls MemoryDumpScheduler::DisableAllTriggers()[2] and resets periodic_state_ is not necessarily the timer thread which could be calling MemoryDumpScheduler::PollMemoryOnPollingThread() and attempting to de-reference periodic_state_ at the same moment.

[1]: https://cs.chromium.org/chromium/src/base/trace_event/memory_dump_manager_unittest.cc?q=memory_dump_manager+package:%5Echromium$&dr=C&l=754
[2]: https://cs.chromium.org/chromium/src/base/trace_event/memory_dump_scheduler.cc?l=129

 

Comment 1 by ssid@chromium.org, Mar 28 2017

Sorry I fail to understand what is the second thing that is referencing periodic_state_. MemoryDumpScheduler::PollMemoryOnPollingThread does not access periodic_state_.
I am also trying to fix  issue 702289  which moves periodic timer to dump_thread because of different thread disabling.

Comment 2 by ssid@chromium.org, Mar 28 2017

Do you have a bot link which failed?

Comment 3 by hjd@chromium.org, Mar 28 2017

Re #1, You're right, I confused periodic_state_ and polling_state_ so that is probably not the right cause. I don't have a bot link, I was trying this locally while working on changes to MDM. Could you CC me on to the CL for 702289 when it is ready (so I can avoid conflicting too much)?
ssid I think that the following thing is happening:
- MemoryDumpScheduler::EnablePeriodicTriggerIfNeeded enables periodic_state_->timer.Start binding RequestPeriodicGlobalDump. RequestPeriodicGlobalDump will be called on the task runner of the thread that invoked EnablePeriodicTriggerIfNeeded()

- At some point trace gets disabled, OnTraceLogDisabled calls into DisableAllTriggers() which does:
    if (periodic_state_->timer.IsRunning())
      periodic_state_->timer.Stop();
    periodic_state_.reset();

Now, if the thread where DisableAllTriggers is called is != the thread where EnablePeriodicTriggerIfNeeded is called, it the periodic_state_ will be reset while the timer is running concurrent on another thread. This causes a race.

(This seems yet another argument in favor of simplifying our threading model as discussed in  Issue 705564 )

Comment 5 by ssid@chromium.org, Mar 28 2017

The issue you are talking about in #4 is exactly same as  issue 702289 . When we designed memory-infra we trusted TraceLog to always call enable and disable on the FILE thread / same thread. But now that is not the case when tracing is disabled unexpectedly. I think this issue should be merged with the  issue 702289  if #4 is the cause of the crash. The test should not crash because it does not use different thread
Well apparently this is causing a problem in the test. Maybe we should disable the test until we fix  Issue 702289  ?


P.S. it seem that we are not 100% sure what the cause is, would be great if a combination of you two could check this.

Comment 7 by ssid@chromium.org, Mar 28 2017

I cannot reproduce the issue and not see flakes on the dashboard as well. hjd, can you post the stack trace of the crash?
We never disable tracing on different thread in test. I suspect some local changes which hjd did is causing this crash.

Also, I uploaded a fix for the other issue in https://codereview.chromium.org/2778313002/
I have a suspect (purely based on my mental debugging of my weak memory about mdm_unittest.cc in the tube, so take everything with a grain of salt). 
I think there is an issue in the actual test : iirc the test fixture ends with DisableTracing() but does not wait for that to complete asynchronously. So if the test fixture reaches the end before that happens, mdm will be destroyed (onteardown) before tracing is disabled, and that will bring down the unique_ptr referred to by the timer. 
Which also would explain why the race shows up only on hjd workstation. 
Hjd@ let's check this tomorrow together 

Comment 9 by ssid@chromium.org, Mar 28 2017

So, I searched more and found a flaky build:
https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyOAsSBUZsYWtlIi1NZW1vcnlEdW1wTWFuYWdlclRlc3QuVGVzdFBvbGxpbmdPbkR1bXBUaHJlYWQM

https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_tsan_rel_ng/builds/39549

Stack trace:
MemoryDumpManagerTest.TestPollingOnDumpThread (run #1):
[ RUN      ] MemoryDumpManagerTest.TestPollingOnDumpThread
Received signal 11 SEGV_MAPERR 000000000000
#0 0x0000004e77b5 __interceptor_backtrace
#1 0x000000e3552a base::debug::StackTrace::StackTrace()
#2 0x000000e33a2f base::debug::StackTrace::StackTrace()
#3 0x000000e34f3c base::debug::(anonymous namespace)::StackDumpSignalHandler()
#4 0x0000004d5a43 __tsan::CallUserSignalHandler()
#5 0x0000004d5ed9 rtl_sigaction()
#6 0x7fc14e3b3330 <unknown>
#7 0x000000f332e7 base::trace_event::MemoryDumpScheduler::PollMemoryOnPollingThread()
#8 0x000000f343e6 _ZN4base8internal7InvokerINS0_9BindStateIMNS_11trace_event19MemoryDumpSchedulerEFvvEJNS0_17UnretainedWrapperIS4_EEEEEFvvEE3RunEPNS0_13BindStateBaseE
#9 0x00000055e8c0 _ZNO4base8CallbackIFvvELNS_8internal8CopyModeE0ELNS2_10RepeatModeE0EE3RunEv
#10 0x000000e369c3 base::debug::TaskAnnotator::RunTask()
#11 0x000000e79b51 base::MessageLoop::RunTask()
#12 0x000000e7a32e base::MessageLoop::DeferOrRunPendingTask()
#13 0x000000e7b099 base::MessageLoop::DoDelayedWork()
#14 0x000000e7f5b5 base::MessagePumpDefault::Run()
#15 0x000000e796a1 base::MessageLoop::RunHandler()
#16 0x000000ebdbaf base::RunLoop::Run()
#17 0x000000f14f8c base::Thread::Run()
#18 0x000000f15697 base::Thread::ThreadMain()
#19 0x000000f07d69 base::(anonymous namespace)::ThreadFunc()
#20 0x0000004cfc72 __tsan_thread_start_func
#21 0x7fc14e3ab184 start_thread
#22 0x7fc14dec2bed clone
  r8: 0000800000000000  r9: 0003ffffffffffff r10: 0001800000000000 r11: 00007c0000000000
 r12: 00007b0c00017520 r13: 00007b0c00017530 r14: 00007b0c00017520 r15: 0000000000f332a0
  di: 00006000044c0000  si: 0010c0000000f46c  bp: 00007fc14a59bd60  bx: 0000000000000000
  dx: 00101c000000f46c  ax: 0000100000000000  cx: 000000000000f46c  sp: 00007fc14a59bbd0
  ip: 0000000000f332e7 efl: 0000000000010246 cgf: 0000000000000033 erf: 0000000000000004
 trp: 000000000000000e msk: 0000000000000000 cr2: 0000000000000000
[end of stack trace]
Calling _exit(1). Core file will not be generated.

The issue here is i checked the wrong condition for polling_disabled at Poll method:

I changed 
"if (polling_state_->current_state != PollingTriggerState::ENABLED)" to
"if (!polling_state_)"
in https://codereview.chromium.org/2778313002/diff/20001/base/trace_event/memory_dump_scheduler.cc

This is because in new model the polling_state_ is cleared instead of just setting disabled at MemoryDumpScheduler::DisablePollingOnPollingThread().

But, I am not sure if you are seeing a different issue.




Comment 10 by ssid@chromium.org, Mar 28 2017

Re #8: There is no async tear down happening currently. What currently happens is when DisableTracing() is called the timer.Stop() is called synchronously and then timer is destroyed. All of this happens in the test main thread.

Comment 11 by hjd@chromium.org, Mar 29 2017

The change you suggested in #9 does fix the problem!

I guess the state we occasionally hit is:
1) PollMemoryOnPollingThread posts a delayed PollMemoryOnPollingThread task
2) We destroy the pointer in DisablePollingOnPollingThread
3) Now the delayed PollMemoryOnPollingThread executes and tries to deref a null pointer.

Crashing example stack+some printfs
[7620:7621:0329/111910.241319:1389285901707:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7620:7621:0329/111910.242945:1389285903323:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7620:7621:0329/111910.244384:1389285904762:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7620:7621:0329/111910.245783:1389285906161:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7620:7621:0329/111910.247346:1389285907724:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7620:7621:0329/111910.248710:1389285909089:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7620:7621:0329/111910.250113:1389285910491:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7620:7621:0329/111910.251578:1389285911956:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7620:7621:0329/111910.253003:1389285913381:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7620:7621:0329/111910.254416:1389285914794:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7620:7621:0329/111910.255873:1389285916251:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7620:7620:0329/111910.256668:1389285917047:ERROR:memory_dump_scheduler.cc(139)] MemoryDumpScheduler::DisableAllTriggers()
[7620:7621:0329/111910.256898:1389285917276:ERROR:memory_dump_scheduler.cc(149)] MemoryDumpScheduler::DisablePollingOnPollingThread()
[7620:7621:0329/111910.257009:1389285917385:ERROR:memory_dump_scheduler.cc(301)] PollingTriggerState::~PollingTriggerState()
[7620:7621:0329/111910.257125:1389285917501:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:211: std::unique_ptr::pointer std::unique_ptr<base::trace_event::MemoryDumpScheduler::PollingTriggerState, std::default_delete<base::trace_event::MemoryDumpScheduler::PollingTriggerState> >::operator->() const [_Tp = base::trace_event::MemoryDumpScheduler::PollingTriggerState, _Dp = std::default_delete<base::trace_event::MemoryDumpScheduler::PollingTriggerState>]: Assertion 'get() != pointer()' failed.
Received signal 6
#0 0x7f1913ef45bb base::debug::StackTrace::StackTrace()
#1 0x7f1913ef2c4c base::debug::StackTrace::StackTrace()
#2 0x7f1913ef40cf base::debug::(anonymous namespace)::StackDumpSignalHandler()
#3 0x7f1914383330 <unknown>
#4 0x7f1912be2c37 gsignal
#5 0x7f1912be6028 abort
#6 0x7f1913eb824f std::__replacement_assert()
#7 0x7f191413bea6 std::unique_ptr<>::operator->()
#8 0x7f191413a4a9 base::trace_event::MemoryDumpScheduler::PollMemoryOnPollingThread()
#9 0x7f1913f0fe05 _ZN4base8internal13FunctorTraitsIMNS_21FileDescriptorWatcher10Controller7WatcherEFvvEvE6InvokeIPS4_JEEEvS6_OT_DpOT0_
#10 0x7f191413c5f1 _ZN4base8internal12InvokeHelperILb0EvE8MakeItSoIRKMNS_11trace_event19MemoryDumpSchedulerEFvvEJPS5_EEEvOT_DpOT0_
#11 0x7f191413c597 _ZN4base8internal7InvokerINS0_9BindStateIMNS_11trace_event19MemoryDumpSchedulerEFvvEJNS0_17UnretainedWrapperIS4_EEEEEFvvEE7RunImplIRKS6_RKSt5tupleIJS8_EEJLm0EEEEvOT_OT0_NS_13IndexSequenceIJXspT1_EEEE
#12 0x7f191413c4dc _ZN4base8internal7InvokerINS0_9BindStateIMNS_11trace_event19MemoryDumpSchedulerEFvvEJNS0_17UnretainedWrapperIS4_EEEEEFvvEE3RunEPNS0_13BindStateBaseE
#13 0x7f1913efa54e _ZNO4base8CallbackIFvvELNS_8internal8CopyModeE0ELNS2_10RepeatModeE0EE3RunEv
#14 0x7f1913ef9dde base::debug::TaskAnnotator::RunTask()
#15 0x7f1913f89eed base::MessageLoop::RunTask()
#16 0x7f1913f8a174 base::MessageLoop::DeferOrRunPendingTask()
#17 0x7f1913f8a6fc base::MessageLoop::DoDelayedWork()
#18 0x7f1913fa1b67 base::MessagePumpDefault::Run()
#19 0x7f1913f89a87 base::MessageLoop::RunHandler()
#20 0x7f1914039fba base::RunLoop::Run()
#21 0x7f19140ea094 base::Thread::Run()
#22 0x7f19140ea8fa base::Thread::ThreadMain()
#23 0x7f19140d0e1a base::(anonymous namespace)::ThreadFunc()
#24 0x7f191437b184 start_thread
#25 0x7f1912ca9bed clone
  r8: 29287265746e696f  r9: 6767697254676e69 r10: 0000000000000008 r11: 0000000000000202
 r12: 0000000000000000 r13: 0000000000000000 r14: 00007f19109a39c0 r15: 00007f19109a3700
  di: 0000000000001dc4  si: 0000000000001dc5  bp: 00007f19109a0af0  bx: 0000000000000000
  dx: 0000000000000006  ax: 0000000000000000  cx: ffffffffffffffff  sp: 00007f19109a0988
  ip: 00007f1912be2c37 efl: 0000000000000202 cgf: 0000000000000033 erf: 0000000000000000
 trp: 0000000000000000 msk: 0000000000000000 cr2: 0000000000000000
[end of stack trace]
Calling _exit(1). Core file will not be generated.
[7618:7619:0329/111910.297681:1389285958184:ERROR:kill_posix.cc(84)] Unable to terminate process group 7620: No such process
[1/1] MemoryDumpManagerTest.TestPollingOnDumpThread (CRASHED)

Working example with printfs:
[7601:7602:0329/111907.641777:1389283302166:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7601:7602:0329/111907.643389:1389283303767:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7601:7602:0329/111907.644880:1389283305259:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7601:7602:0329/111907.646307:1389283306685:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7601:7602:0329/111907.647871:1389283308249:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7601:7602:0329/111907.649256:1389283309634:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7601:7602:0329/111907.650674:1389283311052:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7601:7602:0329/111907.652138:1389283312517:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7601:7602:0329/111907.653554:1389283313932:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7601:7602:0329/111907.654989:1389283315367:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7601:7602:0329/111907.656492:1389283316871:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7601:7602:0329/111907.657833:1389283318211:ERROR:memory_dump_scheduler.cc(177)] MemoryDumpScheduler::PollMemoryOnPollingThread()
[7601:7601:0329/111907.658066:1389283318445:ERROR:memory_dump_scheduler.cc(139)] MemoryDumpScheduler::DisableAllTriggers()
[7601:7602:0329/111907.658243:1389283318621:ERROR:memory_dump_scheduler.cc(149)] MemoryDumpScheduler::DisablePollingOnPollingThread()
[7601:7602:0329/111907.658325:1389283318701:ERROR:memory_dump_scheduler.cc(301)] PollingTriggerState::~PollingTriggerState()


Can one of you two send out the one line fix in #9 separately so we stop the test from being flake? Thanks!
Project Member

Comment 13 by bugdroid1@chromium.org, Mar 30 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3d723ac3847809a99d6e46da60b297a0eb816c88

commit 3d723ac3847809a99d6e46da60b297a0eb816c88
Author: hjd <hjd@chromium.org>
Date: Thu Mar 30 12:54:18 2017

memory-infra: Avoid dereferencing polling_state_ when it could be null

BUG= 706023 

Review-Url: https://codereview.chromium.org/2787533002
Cr-Commit-Position: refs/heads/master@{#460736}

[modify] https://crrev.com/3d723ac3847809a99d6e46da60b297a0eb816c88/base/trace_event/memory_dump_scheduler.cc

Status: Fixed (was: Untriaged)
Labels: Merge-Request-58 OS-All
merge-request: this code is not hit in production but it is by test. merge-req just to avoid making the tests on the official waterfall flaky.
Project Member

Comment 16 by sheriffbot@chromium.org, Mar 31 2017

Labels: -Merge-Request-58 Hotlist-Merge-Approved Merge-Approved-58
Your change meets the bar and is auto-approved for M58. Please go ahead and merge the CL to branch 3029 manually. Please contact milestone owner if you have questions.
Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 17 by sheriffbot@chromium.org, Apr 3 2017

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Hotlist-Merge-Approved -Merge-Approved-58
Looks like the original CL that introduced the bug missed the branch point, so no need for the fix. apologies for the noise. removing MR labels

Sign in to add a comment