New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 766274 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , Chrome , Mac
Pri: 2
Type: Bug



Sign in to add a comment

ThreadLocalEventBuffer is not thread safe

Project Member Reported by ssid@chromium.org, Sep 18 2017

Issue description

Background context: go/memory-infra

Crash:https://crash.corp.google.com/browse?stbtiq=d274ae8561459249&sql_dialect=dremelsql#4

I might be wrong. the crash happened due to unsafe unregistration of dump provider. I looked at the minidump and the dump provider name was ThreadLocalEventBuffer.
 

Comment 1 by ssid@chromium.org, Sep 18 2017

Labels: -Pri-2 Pri-3
The crash is happening on "File" thread created here:
https://chromium.googlesource.com/chromium/src/+/a5f2d27e43b3d7c0a1fc13bb46516c1332e69b2b/third_party/WebKit/Source/platform/exported/Platform.cpp#164

The destructor of MessageLoop says:
" // If |pump_| is non-null, this message loop has been bound and should be the
  // current one on this thread. Otherwise, this loop is being destructed before
  // it was bound to a thread, so a different message loop (or no loop at all)
  // may be current."

ThreadLocalEventBuffer is created only when MessageLoop::current() is not-null, meaning the thread is bound to message loop. So, the destructor cannot be called on a wrong thread.

The crash seems like an invalid address access. So, most likely it's because the provider is destroyed. If the provider is destroyed and we had some bug with unregistration, the post task on the task runner should fail since thread is gone. So, OnMemoryDump() is called on a different thread?

I can't think of any case this would happen. Maybe the MDM::dump_providers_ is corrupted or there is a bug with message loop.

Lowering priority since there is only one crash.
Labels: -Pri-3 Pri-2
Hmm erikchen said he experienced it more than once on canary, makes me feel like this is quite easy to hit. 
On the other side the data on crash/ shows only <10 crashes after M60 on this signature [1]
So, quite conflicted on the priority of this.

Right now I have a wild speculation that I need to confirm or verify.
ThreadLocalEventBuffer should be truly thread-bound (% the case you mentioned which we shouldn't be hitting).
However, a lot of things in chrome got silently relaxed from being thread-safe to being sequence-safe.
So I wonder if what we are hitting here is a case of two sequences (that before would have been two distinct threads) ending up sharing the same thread (and one  unregistering).
I have to think a bit more on the code. 


[1] https://crash.corp.google.com/browse?q=custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3Atrace_event%3A%3AMemoryDumpManager%3A%3AInvokeOnMemoryDump%27&sql_dialect=dremelsql&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#-property-selector,+productversion:1000


I did some auditing at the code of ThreadLocalEventBuffer and it actually seems thread safe.

ThreadLocalEventBuffer:
- is only created for the MessageLoop::current(), and is only registered as MDP passing ThreadTaskRunnerHandle::Get(). When created it's only stored in a TLS pointer.

- is only destroyed when:
  - The current message loop is being destroyed
  - A new generation of tracelog is created.
  In both cases this uses a lookup from a TLS slot, so we can't possibly be picking the wrong object from another thread.

I am trying to debug the assembly in crash/d274ae8561459249 but I need some help.
grokdump.py won't show me the assembly around IP and even if I pull the symbols from go/link_for_766274 I don't seem able to get the relative address of the PC to pass to gdb.
I naively tried
relative offset = $instruction_pointer_from_crash (0x0000000102548edc) - $base_address_of_Google_Chrome_Framework_from_crash (0x0000000100978000) but that is clearly wrong and I think the Mach-O layout is less simple than what I was expecting.

Right now the only theories I have is:
- something I cannot easily spot is invalidating, without calling the dtor, the TLEB
- Our logic in MDM when we do:
 if( mdpinfo->options.dumps_on_single_thread_task_runner && task_runner->RunsTasksInCurrentSequence())
    Invoke MDP.OnMemoryDump without another PostTask
is conceptually flawed and we might end up being on the wrong thread.
The source code is:

  strncpy(provider_name_for_debugging, mdpinfo->name,
     sizeof(provider_name_for_debugging) - 1);
  provider_name_for_debugging[sizeof(provider_name_for_debugging) - 1] = '\0';
  base::debug::Alias(provider_name_for_debugging);  
  ProcessMemoryDump* pmd = pmd_async_state->process_memory_dump.get();
  bool dump_successful =
      mdpinfo->dump_provider->OnMemoryDump(pmd->dump_args(), pmd);


The SysV calling convention for x86_64 is:
rdi : 1st arg
rsi : 2nd arg
rdx : 3rd arg

The assembly:

// R13 = mdpinfo, |name| is the 3rd argument of mdpinfo, 8x3 = 24 (0x18)
// So this puts the |mdpinfo|->|name| pointer in rsi (1st arg)
[0x1bd0ead] <+461>: movq   0x18(%r13), %rsi

//  -0x50(%rbp) is very likely the address on the stack of 
// |provider_name_for_debugging|, so rbx now == |provider_name_for_debugging|
[0x1bd0eb1] <+465>: leaq   -0x50(%rbp), %rbx

// copies 15 (sizeof(provider_name_for_debugging) - 1) into rdx (3rd arg)
[0x1bd0eb5] <+469>: movl   $0xf, %edx

// rdi (1st arg) = &provider_name_for_debugging
[0x1bd0eba] <+474>: movq   %rbx, %rdi

// call strncpy(rdi, rsi, rdx), everything matches
[0x1bd0ebd] <+477>: callq  0x5ed3644                 ; symbol stub for: strncpy

// adds \0 terminator to |provider_name_for_debugging|
[0x1bd0ec2] <+482>: movb   $0x0, -0x41(%rbp)

// rdi (1st arg) = &provider_name_for_debugging
[0x1bd0ec6] <+486>: movq   %rbx, %rdi

// call base::debug::Alias, which is a noop really, but the compiler doesn't
// know it
[0x1bd0ec9] <+489>: callq  0x1b45000                 ; base::debug::Alias at alias.cc:15

// At this point r15 contains the |pmd| pointer, copy that into rdx (3rd arg)
[0x1bd0ece] <+494>: movq   (%r15), %rdx

// r13 = mdpinfo, rdi = 0x8(%r13) = &(mdpinfo->dump_provider)
[0x1bd0ed1] <+497>: movq   0x8(%r13), %rdi

// rax = *(r13) = mdpinfo->dump_provider. rax now contains the MDP pointer.
[0x1bd0ed5] <+501>: movq   (%rdi), %rax

// rsi (1st arg) = pmd->dump_args
[0x1bd0ed8] <+504>: leaq   0x70(%rdx), %rsi

// As a summary, now the args loaded on the registers are
// rdi (|this| ptr of the method): &(mdpinfo->dump_provider)
// rsi (1st arg of the method): pmd->dump_args (from 0x1bd0ed8 above)
// rdx (2nd arg of the method): pmd (from 0x1bd0ece)
// rax = mdpinfo->dump_provider
// 0x10(%rax) = vtable entry for the 2nd virtual method (8x2 = 16 = 0x10)
// of MDP, which is OnMemoryDump()

// This is where we crash.
[0x1bd0edc] <+508>: callq  *0x10(%rax)

So there are two possible crash reasons here:
1) We failed to dereference rax + 16, which essentially means the MDP pointer
  itself was pointing to unmapped memory region.
2) We managed to read *(rax + 16) but crashed when trying to jump at that
   address, as the address, result of the read above, was pointing to an
   unmapped memory region.

From the minidump context
  rax           = 0x700007fc12217314
  rip           = 0x102548edc  (within InvokeOnMemoryDump)

So this is definitely 1.
Which means that the object was deleted but we still had a reference in the
pending_dump_providers which was not marked as |disabled|.

So, still can't explain how this happens. I start thinking this is not a race
but some weird reentrancy issue. But again ran out of time today with 0 results :(
Ahh I have a theory, bear with me a second. Imagine the following scenario:

- We are on Thread A, the next thing in the list is a ThreadLocalEventBuffer on Thread B.
- So MDM::SetupNextMemoryDump does B->PostTask(InvokeOnMemoryDump)
- We are now on thread B and let's assume right now the mdp_info->dump_provider is NON-null, perfectly valid and alive.
- Hence we pass the if(should_dump) check.
- The first thing in the  if(should_dump) body is a  TRACE_EVENT1(kTraceCategory, "MemoryDumpManager::InvokeOnMemoryDump",...)
  If you look at the code of TRACE_EVENT1, the end up calling InitializeThreadLocalEventBufferIfSupported()
- InitializeThreadLocalEventBufferIfSupported will do a generation check. If the generation is old, it will delete and recreate the TLEB.
- The dtor of TLEB will call into MDM::UnregisterDumpProvider. Now we start the evil re-entrancy, so right now we are in:
  MemoryDumpManager::InvokeOnMemoryDump (note: the lock_ is NOT held in the body of the if)
    TRACE_EVENT1
       InitializeThreadLocalEventBufferIfSupported
         UnregisterDumpProvider
- The nested UnregisterDumpProvider() will set disabled = true and keep the pointer around. But now is too late, because we already entered the if(disabled) body.
- At this point the old TLEB has been destroyed, but we still have the pointer in our local mdpinfo (part of the async state). So as soon as we step over TRACE_EVENT1 we call dump_provider and BOOM.

Solution: 
Add a InitializeThreadLocalEventBufferIfSupported() call to the beginning of InvokeOnMemoryDump(). 

Although I would love to be able to reproduce this first at this point and write a test. Just can't tell how right now, but I am quite convinced that is the bug.
Yups, #c5 confirmed.
Repro and fix in https://chromium-review.googlesource.com/c/chromium/src/+/672687

Comment 7 by ssid@chromium.org, Sep 20 2017

Wow thanks for the investigation primiano!
Project Member

Comment 8 by bugdroid1@chromium.org, Sep 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/37e7759fea9965d238c9e709ea95bfdf58605953

commit 37e7759fea9965d238c9e709ea95bfdf58605953
Author: Primiano Tucci <primiano@chromium.org>
Date: Wed Sep 20 05:41:15 2017

memory-infra: Fix TOCTTOU bug on ThreadLocalEventBuffer MDP

Fixes a time-of-check-to-time-of-use bug on ThreadLocalEventBuffer
MDPs in MemoryDumpManager::InvokeOnMemoryDump().

This is the scenario:
- We are on Thread A, the next thing in the MDP list is a
  ThreadLocalEventBuffer on Thread B.
- MDM::SetupNextMemoryDump does B->PostTask(InvokeOnMemoryDump)
- We are now on thread B. Let's assume right now the
  mdp_info->dump_provider is NON-null, perfectly valid and alive.
- Hence we pass the if(should_dump) check.
- The first thing in the if(should_dump) body is a
  TRACE_EVENT1(kTraceCategory, ...)
- TRACE_EVENT1 calls InitializeThreadLocalEventBufferIfSupported()
- InitializeThreadLocalEventBufferIfSupported() will do a generation
  check. If the generation has changed (because of a new TraceLog
  recording mode), it will delete and recreate the TLEB.
- The dtor of TLEB will call into MDM::UnregisterDumpProvider.
  This triggers the re-entrancy into (another method of)
  MemoryDumpManager. Right now we are in:
  MemoryDumpManager::InvokeOnMemoryDump (lock_ is NOT held)
    TRACE_EVENT1
       InitializeThreadLocalEventBufferIfSupported()
         UnregisterDumpProvider()
- The nested UnregisterDumpProvider() will set
  |mdpinfo.disabled| = true and keep the pointer around.
  But now is too late, because we already entered the positive
  branch of the the if(disabled) condition.
- At this point the old TLEB has been destroyed, but we still
  have the pointer in our local |mdpinfo| (part of the async state).
- As soon as we step over TRACE_EVENT1 we call dump_provider and
  BOOM, we dereference a stale pointer.

Bug:  766274 
Test: services_unittests  --gtest_filter=MemoryTracingIntegrationTest.GenerationChangeDoesntReenterMDM
Change-Id: I085f849a6b4048f4bcaec4a9db4ee014d7a1057b
Reviewed-on: https://chromium-review.googlesource.com/672687
Reviewed-by: Siddhartha S <ssid@chromium.org>
Commit-Queue: Primiano Tucci <primiano@chromium.org>
Cr-Commit-Position: refs/heads/master@{#503062}
[modify] https://crrev.com/37e7759fea9965d238c9e709ea95bfdf58605953/base/trace_event/memory_dump_manager.cc
[modify] https://crrev.com/37e7759fea9965d238c9e709ea95bfdf58605953/services/resource_coordinator/public/cpp/memory_instrumentation/tracing_integration_unittest.cc

Labels: Merge-Request-62 OS-Android OS-Chrome OS-iOS OS-Linux OS-Windows
Owner: primiano@chromium.org
Status: Started (was: Untriaged)
Project Member

Comment 10 by sheriffbot@chromium.org, Sep 20 2017

Labels: -Merge-Request-62 Merge-Review-62 Hotlist-Merge-Review
This bug requires manual review: Less than 23 days to go before AppStore submit on M62
Please contact the milestone owner if you have questions.
Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), abdulsyed@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Cc: abdulsyed@chromium.org amineer@chromium.org
Labels: -OS-iOS
+amineer, +abdulsyed, we found this small but plausible race.
The actual fix is 2 lines (looks larger because of reindenting the code). The merge request here is for #8
Project Member

Comment 12 by bugdroid1@chromium.org, Sep 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3f4691c85e47e4c585262fac668f462741d2026b

commit 3f4691c85e47e4c585262fac668f462741d2026b
Author: Primiano Tucci <primiano@chromium.org>
Date: Wed Sep 20 16:28:26 2017

memory-infra: relax race detection check in MDM::InvokeOnMemoryDump

Follow up to http://crrev.com/503062 . The race CHECK introduced
makes only sense for MDPs that are thread-bound and are destroyed
on the same thread on which they invoke OnMemoryDump().
For non-thread-bound MDPs that get deleted via
MemoryDumpManager::UnregisterAndDeleteDumpProviderSoon(),
|mdpinfo.disabled| might become actually false while dumping,
but that wouldn't be a race as they transfer the ownership
to the |mdpinfo| in that case.

TBR=ssid

Bug:  766274 
Change-Id: I019070882eb2404b40d1f5ab26247c0229962f72
Reviewed-on: https://chromium-review.googlesource.com/675223
Reviewed-by: Primiano Tucci <primiano@chromium.org>
Commit-Queue: Primiano Tucci <primiano@chromium.org>
Cr-Commit-Position: refs/heads/master@{#503174}
[modify] https://crrev.com/3f4691c85e47e4c585262fac668f462741d2026b/base/trace_event/memory_dump_manager.cc

Labels: -Merge-Review-62 Merge-Approved-62
Approving merge to M62. Branch:3202. 
Project Member

Comment 14 by sheriffbot@chromium.org, Sep 29 2017

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 15 by sheriffbot@chromium.org, Oct 2 2017

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Reminder to please merge to M62 branch 3202 asap.
Project Member

Comment 17 by bugdroid1@chromium.org, Oct 4 2017

Labels: -merge-approved-62 merge-merged-3202
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/47448f027434abaac62bf92b0ad86d47ec2555b7

commit 47448f027434abaac62bf92b0ad86d47ec2555b7
Author: Primiano Tucci <primiano@chromium.org>
Date: Wed Oct 04 22:34:46 2017

memory-infra: Fix TOCTTOU bug on ThreadLocalEventBuffer MDP

Fixes a time-of-check-to-time-of-use bug on ThreadLocalEventBuffer
MDPs in MemoryDumpManager::InvokeOnMemoryDump().

This is the scenario:
- We are on Thread A, the next thing in the MDP list is a
  ThreadLocalEventBuffer on Thread B.
- MDM::SetupNextMemoryDump does B->PostTask(InvokeOnMemoryDump)
- We are now on thread B. Let's assume right now the
  mdp_info->dump_provider is NON-null, perfectly valid and alive.
- Hence we pass the if(should_dump) check.
- The first thing in the if(should_dump) body is a
  TRACE_EVENT1(kTraceCategory, ...)
- TRACE_EVENT1 calls InitializeThreadLocalEventBufferIfSupported()
- InitializeThreadLocalEventBufferIfSupported() will do a generation
  check. If the generation has changed (because of a new TraceLog
  recording mode), it will delete and recreate the TLEB.
- The dtor of TLEB will call into MDM::UnregisterDumpProvider.
  This triggers the re-entrancy into (another method of)
  MemoryDumpManager. Right now we are in:
  MemoryDumpManager::InvokeOnMemoryDump (lock_ is NOT held)
    TRACE_EVENT1
       InitializeThreadLocalEventBufferIfSupported()
         UnregisterDumpProvider()
- The nested UnregisterDumpProvider() will set
  |mdpinfo.disabled| = true and keep the pointer around.
  But now is too late, because we already entered the positive
  branch of the the if(disabled) condition.
- At this point the old TLEB has been destroyed, but we still
  have the pointer in our local |mdpinfo| (part of the async state).
- As soon as we step over TRACE_EVENT1 we call dump_provider and
  BOOM, we dereference a stale pointer.

TBR=primiano@chromium.org

(cherry picked from commit 37e7759fea9965d238c9e709ea95bfdf58605953)

Bug:  766274 
Test: services_unittests  --gtest_filter=MemoryTracingIntegrationTest.GenerationChangeDoesntReenterMDM
Change-Id: I085f849a6b4048f4bcaec4a9db4ee014d7a1057b
Reviewed-on: https://chromium-review.googlesource.com/672687
Reviewed-by: Siddhartha S <ssid@chromium.org>
Commit-Queue: Primiano Tucci <primiano@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#503062}
Reviewed-on: https://chromium-review.googlesource.com/701282
Reviewed-by: Primiano Tucci <primiano@chromium.org>
Cr-Commit-Position: refs/branch-heads/3202@{#582}
Cr-Branched-From: fa6a5d87adff761bc16afc5498c3f5944c1daa68-refs/heads/master@{#499098}
[modify] https://crrev.com/47448f027434abaac62bf92b0ad86d47ec2555b7/base/trace_event/memory_dump_manager.cc
[modify] https://crrev.com/47448f027434abaac62bf92b0ad86d47ec2555b7/services/resource_coordinator/public/cpp/memory_instrumentation/tracing_integration_unittest.cc

Project Member

Comment 18 by bugdroid1@chromium.org, Oct 4 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c63f7695b544801ba890be8546572d3acd8236f9

commit c63f7695b544801ba890be8546572d3acd8236f9
Author: Primiano Tucci <primiano@chromium.org>
Date: Wed Oct 04 22:39:53 2017

memory-infra: relax race detection check in MDM::InvokeOnMemoryDump

Follow up to http://crrev.com/503062 . The race CHECK introduced
makes only sense for MDPs that are thread-bound and are destroyed
on the same thread on which they invoke OnMemoryDump().
For non-thread-bound MDPs that get deleted via
MemoryDumpManager::UnregisterAndDeleteDumpProviderSoon(),
|mdpinfo.disabled| might become actually false while dumping,
but that wouldn't be a race as they transfer the ownership
to the |mdpinfo| in that case.

TBR=primiano@chromium.org, ssid

(cherry picked from commit 3f4691c85e47e4c585262fac668f462741d2026b)

Bug:  766274 
Change-Id: I019070882eb2404b40d1f5ab26247c0229962f72
Reviewed-on: https://chromium-review.googlesource.com/675223
Reviewed-by: Primiano Tucci <primiano@chromium.org>
Commit-Queue: Primiano Tucci <primiano@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#503174}
Reviewed-on: https://chromium-review.googlesource.com/701637
Cr-Commit-Position: refs/branch-heads/3202@{#583}
Cr-Branched-From: fa6a5d87adff761bc16afc5498c3f5944c1daa68-refs/heads/master@{#499098}
[modify] https://crrev.com/c63f7695b544801ba890be8546572d3acd8236f9/base/trace_event/memory_dump_manager.cc

Status: Fixed (was: Started)

Sign in to add a comment