ThreadLocalEventBuffer is not thread safe |
||||||||
Issue descriptionBackground context: go/memory-infra Crash:https://crash.corp.google.com/browse?stbtiq=d274ae8561459249&sql_dialect=dremelsql#4 I might be wrong. the crash happened due to unsafe unregistration of dump provider. I looked at the minidump and the dump provider name was ThreadLocalEventBuffer.
,
Sep 19 2017
Hmm erikchen said he experienced it more than once on canary, makes me feel like this is quite easy to hit. On the other side the data on crash/ shows only <10 crashes after M60 on this signature [1] So, quite conflicted on the priority of this. Right now I have a wild speculation that I need to confirm or verify. ThreadLocalEventBuffer should be truly thread-bound (% the case you mentioned which we shouldn't be hitting). However, a lot of things in chrome got silently relaxed from being thread-safe to being sequence-safe. So I wonder if what we are hitting here is a case of two sequences (that before would have been two distinct threads) ending up sharing the same thread (and one unregistering). I have to think a bit more on the code. [1] https://crash.corp.google.com/browse?q=custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3Atrace_event%3A%3AMemoryDumpManager%3A%3AInvokeOnMemoryDump%27&sql_dialect=dremelsql&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#-property-selector,+productversion:1000
,
Sep 19 2017
I did some auditing at the code of ThreadLocalEventBuffer and it actually seems thread safe. ThreadLocalEventBuffer: - is only created for the MessageLoop::current(), and is only registered as MDP passing ThreadTaskRunnerHandle::Get(). When created it's only stored in a TLS pointer. - is only destroyed when: - The current message loop is being destroyed - A new generation of tracelog is created. In both cases this uses a lookup from a TLS slot, so we can't possibly be picking the wrong object from another thread. I am trying to debug the assembly in crash/d274ae8561459249 but I need some help. grokdump.py won't show me the assembly around IP and even if I pull the symbols from go/link_for_766274 I don't seem able to get the relative address of the PC to pass to gdb. I naively tried relative offset = $instruction_pointer_from_crash (0x0000000102548edc) - $base_address_of_Google_Chrome_Framework_from_crash (0x0000000100978000) but that is clearly wrong and I think the Mach-O layout is less simple than what I was expecting. Right now the only theories I have is: - something I cannot easily spot is invalidating, without calling the dtor, the TLEB - Our logic in MDM when we do: if( mdpinfo->options.dumps_on_single_thread_task_runner && task_runner->RunsTasksInCurrentSequence()) Invoke MDP.OnMemoryDump without another PostTask is conceptually flawed and we might end up being on the wrong thread.
,
Sep 19 2017
The source code is:
strncpy(provider_name_for_debugging, mdpinfo->name,
sizeof(provider_name_for_debugging) - 1);
provider_name_for_debugging[sizeof(provider_name_for_debugging) - 1] = '\0';
base::debug::Alias(provider_name_for_debugging);
ProcessMemoryDump* pmd = pmd_async_state->process_memory_dump.get();
bool dump_successful =
mdpinfo->dump_provider->OnMemoryDump(pmd->dump_args(), pmd);
The SysV calling convention for x86_64 is:
rdi : 1st arg
rsi : 2nd arg
rdx : 3rd arg
The assembly:
// R13 = mdpinfo, |name| is the 3rd argument of mdpinfo, 8x3 = 24 (0x18)
// So this puts the |mdpinfo|->|name| pointer in rsi (1st arg)
[0x1bd0ead] <+461>: movq 0x18(%r13), %rsi
// -0x50(%rbp) is very likely the address on the stack of
// |provider_name_for_debugging|, so rbx now == |provider_name_for_debugging|
[0x1bd0eb1] <+465>: leaq -0x50(%rbp), %rbx
// copies 15 (sizeof(provider_name_for_debugging) - 1) into rdx (3rd arg)
[0x1bd0eb5] <+469>: movl $0xf, %edx
// rdi (1st arg) = &provider_name_for_debugging
[0x1bd0eba] <+474>: movq %rbx, %rdi
// call strncpy(rdi, rsi, rdx), everything matches
[0x1bd0ebd] <+477>: callq 0x5ed3644 ; symbol stub for: strncpy
// adds \0 terminator to |provider_name_for_debugging|
[0x1bd0ec2] <+482>: movb $0x0, -0x41(%rbp)
// rdi (1st arg) = &provider_name_for_debugging
[0x1bd0ec6] <+486>: movq %rbx, %rdi
// call base::debug::Alias, which is a noop really, but the compiler doesn't
// know it
[0x1bd0ec9] <+489>: callq 0x1b45000 ; base::debug::Alias at alias.cc:15
// At this point r15 contains the |pmd| pointer, copy that into rdx (3rd arg)
[0x1bd0ece] <+494>: movq (%r15), %rdx
// r13 = mdpinfo, rdi = 0x8(%r13) = &(mdpinfo->dump_provider)
[0x1bd0ed1] <+497>: movq 0x8(%r13), %rdi
// rax = *(r13) = mdpinfo->dump_provider. rax now contains the MDP pointer.
[0x1bd0ed5] <+501>: movq (%rdi), %rax
// rsi (1st arg) = pmd->dump_args
[0x1bd0ed8] <+504>: leaq 0x70(%rdx), %rsi
// As a summary, now the args loaded on the registers are
// rdi (|this| ptr of the method): &(mdpinfo->dump_provider)
// rsi (1st arg of the method): pmd->dump_args (from 0x1bd0ed8 above)
// rdx (2nd arg of the method): pmd (from 0x1bd0ece)
// rax = mdpinfo->dump_provider
// 0x10(%rax) = vtable entry for the 2nd virtual method (8x2 = 16 = 0x10)
// of MDP, which is OnMemoryDump()
// This is where we crash.
[0x1bd0edc] <+508>: callq *0x10(%rax)
So there are two possible crash reasons here:
1) We failed to dereference rax + 16, which essentially means the MDP pointer
itself was pointing to unmapped memory region.
2) We managed to read *(rax + 16) but crashed when trying to jump at that
address, as the address, result of the read above, was pointing to an
unmapped memory region.
From the minidump context
rax = 0x700007fc12217314
rip = 0x102548edc (within InvokeOnMemoryDump)
So this is definitely 1.
Which means that the object was deleted but we still had a reference in the
pending_dump_providers which was not marked as |disabled|.
So, still can't explain how this happens. I start thinking this is not a race
but some weird reentrancy issue. But again ran out of time today with 0 results :(
,
Sep 19 2017
Ahh I have a theory, bear with me a second. Imagine the following scenario:
- We are on Thread A, the next thing in the list is a ThreadLocalEventBuffer on Thread B.
- So MDM::SetupNextMemoryDump does B->PostTask(InvokeOnMemoryDump)
- We are now on thread B and let's assume right now the mdp_info->dump_provider is NON-null, perfectly valid and alive.
- Hence we pass the if(should_dump) check.
- The first thing in the if(should_dump) body is a TRACE_EVENT1(kTraceCategory, "MemoryDumpManager::InvokeOnMemoryDump",...)
If you look at the code of TRACE_EVENT1, the end up calling InitializeThreadLocalEventBufferIfSupported()
- InitializeThreadLocalEventBufferIfSupported will do a generation check. If the generation is old, it will delete and recreate the TLEB.
- The dtor of TLEB will call into MDM::UnregisterDumpProvider. Now we start the evil re-entrancy, so right now we are in:
MemoryDumpManager::InvokeOnMemoryDump (note: the lock_ is NOT held in the body of the if)
TRACE_EVENT1
InitializeThreadLocalEventBufferIfSupported
UnregisterDumpProvider
- The nested UnregisterDumpProvider() will set disabled = true and keep the pointer around. But now is too late, because we already entered the if(disabled) body.
- At this point the old TLEB has been destroyed, but we still have the pointer in our local mdpinfo (part of the async state). So as soon as we step over TRACE_EVENT1 we call dump_provider and BOOM.
Solution:
Add a InitializeThreadLocalEventBufferIfSupported() call to the beginning of InvokeOnMemoryDump().
Although I would love to be able to reproduce this first at this point and write a test. Just can't tell how right now, but I am quite convinced that is the bug.
,
Sep 19 2017
Yups, #c5 confirmed. Repro and fix in https://chromium-review.googlesource.com/c/chromium/src/+/672687
,
Sep 20 2017
Wow thanks for the investigation primiano!
,
Sep 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/37e7759fea9965d238c9e709ea95bfdf58605953 commit 37e7759fea9965d238c9e709ea95bfdf58605953 Author: Primiano Tucci <primiano@chromium.org> Date: Wed Sep 20 05:41:15 2017 memory-infra: Fix TOCTTOU bug on ThreadLocalEventBuffer MDP Fixes a time-of-check-to-time-of-use bug on ThreadLocalEventBuffer MDPs in MemoryDumpManager::InvokeOnMemoryDump(). This is the scenario: - We are on Thread A, the next thing in the MDP list is a ThreadLocalEventBuffer on Thread B. - MDM::SetupNextMemoryDump does B->PostTask(InvokeOnMemoryDump) - We are now on thread B. Let's assume right now the mdp_info->dump_provider is NON-null, perfectly valid and alive. - Hence we pass the if(should_dump) check. - The first thing in the if(should_dump) body is a TRACE_EVENT1(kTraceCategory, ...) - TRACE_EVENT1 calls InitializeThreadLocalEventBufferIfSupported() - InitializeThreadLocalEventBufferIfSupported() will do a generation check. If the generation has changed (because of a new TraceLog recording mode), it will delete and recreate the TLEB. - The dtor of TLEB will call into MDM::UnregisterDumpProvider. This triggers the re-entrancy into (another method of) MemoryDumpManager. Right now we are in: MemoryDumpManager::InvokeOnMemoryDump (lock_ is NOT held) TRACE_EVENT1 InitializeThreadLocalEventBufferIfSupported() UnregisterDumpProvider() - The nested UnregisterDumpProvider() will set |mdpinfo.disabled| = true and keep the pointer around. But now is too late, because we already entered the positive branch of the the if(disabled) condition. - At this point the old TLEB has been destroyed, but we still have the pointer in our local |mdpinfo| (part of the async state). - As soon as we step over TRACE_EVENT1 we call dump_provider and BOOM, we dereference a stale pointer. Bug: 766274 Test: services_unittests --gtest_filter=MemoryTracingIntegrationTest.GenerationChangeDoesntReenterMDM Change-Id: I085f849a6b4048f4bcaec4a9db4ee014d7a1057b Reviewed-on: https://chromium-review.googlesource.com/672687 Reviewed-by: Siddhartha S <ssid@chromium.org> Commit-Queue: Primiano Tucci <primiano@chromium.org> Cr-Commit-Position: refs/heads/master@{#503062} [modify] https://crrev.com/37e7759fea9965d238c9e709ea95bfdf58605953/base/trace_event/memory_dump_manager.cc [modify] https://crrev.com/37e7759fea9965d238c9e709ea95bfdf58605953/services/resource_coordinator/public/cpp/memory_instrumentation/tracing_integration_unittest.cc
,
Sep 20 2017
,
Sep 20 2017
This bug requires manual review: Less than 23 days to go before AppStore submit on M62 Please contact the milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), abdulsyed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Sep 20 2017
+amineer, +abdulsyed, we found this small but plausible race. The actual fix is 2 lines (looks larger because of reindenting the code). The merge request here is for #8
,
Sep 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3f4691c85e47e4c585262fac668f462741d2026b commit 3f4691c85e47e4c585262fac668f462741d2026b Author: Primiano Tucci <primiano@chromium.org> Date: Wed Sep 20 16:28:26 2017 memory-infra: relax race detection check in MDM::InvokeOnMemoryDump Follow up to http://crrev.com/503062 . The race CHECK introduced makes only sense for MDPs that are thread-bound and are destroyed on the same thread on which they invoke OnMemoryDump(). For non-thread-bound MDPs that get deleted via MemoryDumpManager::UnregisterAndDeleteDumpProviderSoon(), |mdpinfo.disabled| might become actually false while dumping, but that wouldn't be a race as they transfer the ownership to the |mdpinfo| in that case. TBR=ssid Bug: 766274 Change-Id: I019070882eb2404b40d1f5ab26247c0229962f72 Reviewed-on: https://chromium-review.googlesource.com/675223 Reviewed-by: Primiano Tucci <primiano@chromium.org> Commit-Queue: Primiano Tucci <primiano@chromium.org> Cr-Commit-Position: refs/heads/master@{#503174} [modify] https://crrev.com/3f4691c85e47e4c585262fac668f462741d2026b/base/trace_event/memory_dump_manager.cc
,
Sep 25 2017
Approving merge to M62. Branch:3202.
,
Sep 29 2017
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Oct 2 2017
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Oct 2 2017
Reminder to please merge to M62 branch 3202 asap.
,
Oct 4 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/47448f027434abaac62bf92b0ad86d47ec2555b7 commit 47448f027434abaac62bf92b0ad86d47ec2555b7 Author: Primiano Tucci <primiano@chromium.org> Date: Wed Oct 04 22:34:46 2017 memory-infra: Fix TOCTTOU bug on ThreadLocalEventBuffer MDP Fixes a time-of-check-to-time-of-use bug on ThreadLocalEventBuffer MDPs in MemoryDumpManager::InvokeOnMemoryDump(). This is the scenario: - We are on Thread A, the next thing in the MDP list is a ThreadLocalEventBuffer on Thread B. - MDM::SetupNextMemoryDump does B->PostTask(InvokeOnMemoryDump) - We are now on thread B. Let's assume right now the mdp_info->dump_provider is NON-null, perfectly valid and alive. - Hence we pass the if(should_dump) check. - The first thing in the if(should_dump) body is a TRACE_EVENT1(kTraceCategory, ...) - TRACE_EVENT1 calls InitializeThreadLocalEventBufferIfSupported() - InitializeThreadLocalEventBufferIfSupported() will do a generation check. If the generation has changed (because of a new TraceLog recording mode), it will delete and recreate the TLEB. - The dtor of TLEB will call into MDM::UnregisterDumpProvider. This triggers the re-entrancy into (another method of) MemoryDumpManager. Right now we are in: MemoryDumpManager::InvokeOnMemoryDump (lock_ is NOT held) TRACE_EVENT1 InitializeThreadLocalEventBufferIfSupported() UnregisterDumpProvider() - The nested UnregisterDumpProvider() will set |mdpinfo.disabled| = true and keep the pointer around. But now is too late, because we already entered the positive branch of the the if(disabled) condition. - At this point the old TLEB has been destroyed, but we still have the pointer in our local |mdpinfo| (part of the async state). - As soon as we step over TRACE_EVENT1 we call dump_provider and BOOM, we dereference a stale pointer. TBR=primiano@chromium.org (cherry picked from commit 37e7759fea9965d238c9e709ea95bfdf58605953) Bug: 766274 Test: services_unittests --gtest_filter=MemoryTracingIntegrationTest.GenerationChangeDoesntReenterMDM Change-Id: I085f849a6b4048f4bcaec4a9db4ee014d7a1057b Reviewed-on: https://chromium-review.googlesource.com/672687 Reviewed-by: Siddhartha S <ssid@chromium.org> Commit-Queue: Primiano Tucci <primiano@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#503062} Reviewed-on: https://chromium-review.googlesource.com/701282 Reviewed-by: Primiano Tucci <primiano@chromium.org> Cr-Commit-Position: refs/branch-heads/3202@{#582} Cr-Branched-From: fa6a5d87adff761bc16afc5498c3f5944c1daa68-refs/heads/master@{#499098} [modify] https://crrev.com/47448f027434abaac62bf92b0ad86d47ec2555b7/base/trace_event/memory_dump_manager.cc [modify] https://crrev.com/47448f027434abaac62bf92b0ad86d47ec2555b7/services/resource_coordinator/public/cpp/memory_instrumentation/tracing_integration_unittest.cc
,
Oct 4 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c63f7695b544801ba890be8546572d3acd8236f9 commit c63f7695b544801ba890be8546572d3acd8236f9 Author: Primiano Tucci <primiano@chromium.org> Date: Wed Oct 04 22:39:53 2017 memory-infra: relax race detection check in MDM::InvokeOnMemoryDump Follow up to http://crrev.com/503062 . The race CHECK introduced makes only sense for MDPs that are thread-bound and are destroyed on the same thread on which they invoke OnMemoryDump(). For non-thread-bound MDPs that get deleted via MemoryDumpManager::UnregisterAndDeleteDumpProviderSoon(), |mdpinfo.disabled| might become actually false while dumping, but that wouldn't be a race as they transfer the ownership to the |mdpinfo| in that case. TBR=primiano@chromium.org, ssid (cherry picked from commit 3f4691c85e47e4c585262fac668f462741d2026b) Bug: 766274 Change-Id: I019070882eb2404b40d1f5ab26247c0229962f72 Reviewed-on: https://chromium-review.googlesource.com/675223 Reviewed-by: Primiano Tucci <primiano@chromium.org> Commit-Queue: Primiano Tucci <primiano@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#503174} Reviewed-on: https://chromium-review.googlesource.com/701637 Cr-Commit-Position: refs/branch-heads/3202@{#583} Cr-Branched-From: fa6a5d87adff761bc16afc5498c3f5944c1daa68-refs/heads/master@{#499098} [modify] https://crrev.com/c63f7695b544801ba890be8546572d3acd8236f9/base/trace_event/memory_dump_manager.cc
,
Oct 5 2017
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by ssid@chromium.org
, Sep 18 2017