New issue
Advanced search Search tips

Issue 920440 link

Starred by 3 users

Issue metadata

Status: Untriaged
Owner:
Cc:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 3
Type: Bug

Blocking:
issue 896019



Sign in to add a comment

GWP-ASan overwrites GetLastError() with ERROR_SUCCESS

Project Member Reported by vtsyrklevich@chromium.org, Jan 9

Issue description

While trying to enable GWP-ASan's field trial testing config in https://chromium-review.googlesource.com/c/chromium/src/+/1388938 I was hitting the following unexpected test failures:
Parameterized/PolicyUpdateServiceTest.PolicyCorruptedOnStartup/1
Parameterized/PolicyUpdateServiceTest.PolicyCorruptedOnStartup/0
Parameterized/UpdateServiceTest.SuccessfulUpdate/0
Parameterized/UpdateServiceTest.SuccessfulUpdate/1
Parameterized/UpdateServiceTest.PolicyCorrupted/0
Parameterized/UpdateServiceTest.PolicyCorrupted/1
Parameterized/PolicyUpdateServiceTest.FailedUpdateRetries/0
Parameterized/PolicyUpdateServiceTest.FailedUpdateRetries/1
Parameterized/PolicyUpdateServiceTest.Backoff/0
Parameterized/PolicyUpdateServiceTest.Backoff/1

After a little bit of digging I found that the test failures were due to a DCHECK in base file routines that were triggered when an error occured but GetLastError() did not have an error saved. The cause was that an allocation routine was called at some point after the file error and GWP-ASan's TLSGetValue (using TlsGetValue on Windows) overwrites GetLastError with ERROR_SUCCESS as noted in https://docs.microsoft.com/en-us/windows/desktop/api/processthreadsapi/nf-processthreadsapi-tlsgetvalue: Functions that return indications of failure call SetLastError when they fail. They generally do not call SetLastError when they succeed. The TlsGetValue function is an exception to this general rule. The TlsGetValue function calls SetLastError to clear a thread's last error when it succeeds. That allows checking for the error-free retrieval of zero values.

To fix I need to save LastError across calls to TlsGetValue and I need to investigate whether I need something similar on POSIX platforms as well for errno. I'd like to refactor base::internal::ScopedClearLastError to base::ScopedClearLastError so that I can do that easily.
 
Cc: alph@chromium.org
Alexei I would assume that you would be hitting the same failure as I am since the SamplingHeapProfiler also looks like it calls TlsGetValue for every allocation. Is the SamplingHeapProfile turned on by default/in the field trial testing config? If so, I'm not sure why you would not hit the same failure.
Blocking: 896019
Cc: erikc...@chromium.org
That's an interesting finding! Thanks for sharing.

We don't actually enable heap profiler by default. So most of the tests (besides the profiler ones) run without profiler enabled.
However the scenario can potentially take place in the wild, when the profiler is enabled.

The TlsGetValue documentation you linked is not clear whether last error is updated on every invocation of TlsGetValue or only when the returned value happens to be zero. If former we seem indeed need to store last error across TLS accesses. If latter we can try to workaround it by making sure we never store zero to TL variables.
Project Member

Comment 4 by bugdroid1@chromium.org, Jan 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f11f6e41c03def6838a6ac32d3976990639413e9

commit f11f6e41c03def6838a6ac32d3976990639413e9
Author: Vlad Tsyrklevich <vtsyrklevich@chromium.org>
Date: Thu Jan 10 23:05:22 2019

GWP-ASan: Preserve last error across allocations

Calls to TlsGetValue on every allocation can overwrite the value in
GetLastError() before error handling code has a chance to handle the
error. Preserve its value across calls.

Bug: 920440
Change-Id: Iaee843b71505625fc88ac51698b63c41de957c33
Reviewed-on: https://chromium-review.googlesource.com/c/1404476
Commit-Queue: Vlad Tsyrklevich <vtsyrklevich@chromium.org>
Reviewed-by: Vitaly Buka <vitalybuka@chromium.org>
Cr-Commit-Position: refs/heads/master@{#621790}
[modify] https://crrev.com/f11f6e41c03def6838a6ac32d3976990639413e9/components/gwp_asan/client/sampling_allocator_shims_win.h

> If former we seem indeed need to store last error across TLS accesses.

It's the former, TlsGetValue() sets LastError to 0 on success irrespective of the returned value.
We can probably use thread_local storage on Win. At least there's a precedent https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?type=cs&q=thread_local&g=0&l=149
I was going to try to land the Mac thread_local fix in LLVM first but given that there is precedent I would rather do it sooner as it will also help with some of the perf regressions I'm seeing in crbug.com/921237. I'm also working on implementing macOS support for GWP-ASan anyway so I'll try to finish running down the LLVM bug either way.

Thanks for pointing that out.
Project Member

Comment 8 by bugdroid1@chromium.org, Jan 17 (5 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c72ef12f9ff6132ce2368484883fa0a391807f43

commit c72ef12f9ff6132ce2368484883fa0a391807f43
Author: Vlad Tsyrklevich <vtsyrklevich@chromium.org>
Date: Thu Jan 17 21:23:01 2019

GWP-ASan: Optimize sampling using thread_local

After being assigned some performance regressions after enabling
GWP-ASan in the field trial testing config I benchmarked a number of
GWP-ASan configurations. I found that by turning down the sampling
frequency from the current value I could reduce the GWP-ASan overhead
but that it was still noticeable even with sampling values so high that
we only noticed the overhead of sampling in allocations (and performing
the range checks during frees though this cost is not noticeable.)

Currently sampling is performed by holding a counter in thread-local
storage and accessing TLS using system APIs. I found that using
thread_local to store the sampling counter I could reduce the overhead
of a micro-benchmark testing the sampling overhead by approximately
30-40%. ALWAYS_INLINE improves performance even further.

A bug on macOS has held up thread_local adoption in the C++11 allowed
features; however, I've disabled macOS support in order to get the
performance improvement on Windows and I'm working down the LLVM issue
blocking macOS adoption in parallel.

Bug: 922556, 920440, 896019
Change-Id: Ic35a0e44a25b27bcf3bf8a31234e9704f26b904b
Reviewed-on: https://chromium-review.googlesource.com/c/1416093
Auto-Submit: Vlad Tsyrklevich <vtsyrklevich@chromium.org>
Commit-Queue: Vlad Tsyrklevich <vtsyrklevich@chromium.org>
Reviewed-by: Vitaly Buka <vitalybuka@chromium.org>
Cr-Commit-Position: refs/heads/master@{#623848}
[modify] https://crrev.com/c72ef12f9ff6132ce2368484883fa0a391807f43/components/gwp_asan/BUILD.gn
[modify] https://crrev.com/c72ef12f9ff6132ce2368484883fa0a391807f43/components/gwp_asan/client/BUILD.gn
[modify] https://crrev.com/c72ef12f9ff6132ce2368484883fa0a391807f43/components/gwp_asan/client/sampling_allocator_shims.cc
[delete] https://crrev.com/5bdea1a4780492ef2ae046f17b3ca9c64795711d/components/gwp_asan/client/sampling_allocator_shims_posix.h
[delete] https://crrev.com/5bdea1a4780492ef2ae046f17b3ca9c64795711d/components/gwp_asan/client/sampling_allocator_shims_win.h

Comment 9 by vtsyrklevich@chromium.org, Jan 18 (4 days ago)

Owner: alph@chromium.org
Alexei I've reassigned to you as this is fixed in GWP-ASan. Note that using thread_local in the allocator shims on macOS is tricky. First access to the thread_local in a thread will call malloc, so a re-entry guard not based on thread_local is required.

Sign in to add a comment