CRLSetTest.Parse is flaky - sometimes passes, sometimes crashes on the same revision |
||||||||
Issue description
Steps to reproduce:
$ mkdir out/repro
$ cat > out/repro <<EOT
clang_use_chrome_plugins = false
is_clang = true
is_component_build = true
is_debug = false
symbol_level = 2
llvm_force_head_revision = false
clang_base_path = "/path/to/clang"
use_goma = false
EOT
$ gn gen out/repro
$ ninja -C out/repro net_unittests
$ out/repro/net_unittests --gtest_filter=CRLSetTest.Parse
$ out/repro/net_unittests --gtest_filter=CRLSetTest.Parse
$ out/repro/net_unittests --gtest_filter=CRLSetTest.Parse
$ out/repro/net_unittests --gtest_filter=CRLSetTest.Parse
$ out/repro/net_unittests --gtest_filter=CRLSetTest.Parse
Expected behavior:
Every run passes.
Observed behavior:
Some runs pass, some crash.
Example crash:
$ out/repro/net_unittests --gtest_filter=CRLSetTest.Parse
IMPORTANT DEBUGGING NOTE: batches of tests are run inside their
own process. For debugging a test inside a debugger, use the
--gtest_filter=<your_test_name> flag along with
--single-process-tests.
Using sharding settings from environment. This is shard 0/1
Using 1 parallel jobs.
Note: Google Test filter = CRLSetTest.Parse
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CRLSetTest
[ RUN ] CRLSetTest.Parse
Received signal 11 SEGV_MAPERR 000000000000
#0 0x7fe2b5e1f43c base::debug::StackTrace::StackTrace()
#1 0x7fe2b5e1efa1 base::debug::(anonymous namespace)::StackDumpSignalHandler()
#2 0x7fe2b5b16330 <unknown>
#3 0x7fe2b4484521 <unknown>
#4 0x7fe2b500beaf std::__1::basic_string<>::assign()
#5 0x7fe2b5865971 net::ReadCRL()
#6 0x7fe2b58655c8 net::CRLSetStorage::Parse()
#7 0x00000075669d net::CRLSetTest_Parse_Test::TestBody()
#8 0x000002045506 testing::Test::Run()
#9 0x000002045fe0 testing::TestInfo::Run()
#10 0x0000020464c7 testing::TestCase::Run()
#11 0x00000204cb87 testing::internal::UnitTestImpl::RunAllTests()
#12 0x00000204c7d7 testing::UnitTest::Run()
#13 0x0000020cbc42 base::TestSuite::Run()
#14 0x0000020ceb6e base::(anonymous namespace)::LaunchUnitTestsInternal()
#15 0x0000020cea2b base::LaunchUnitTests()
#16 0x000001c531b8 main
#17 0x7fe2b4350f45 __libc_start_main
#18 0x0000005da02a _start
r8: 0000000000000006 r9: 00000000003822c2 r10: 0000000000000003 r11: 00007fe2b44ba510
r12: 00007ffea793d490 r13: 0000000000000020 r14: 00002fb233f48000 r15: ffffd04e331edc8b
di: 0000000000000020 si: 00000000003822e2 bp: 00007ffea793d290 bx: 0000000000000000
dx: 00007fe2b448451c ax: 0000000000000000 cx: 0000000000000000 sp: 00007ffea793d248
ip: 00007fe2b4484521 efl: 0000000000010202 cgf: 0000000000000033 erf: 0000000000000006
trp: 000000000000000e msk: 0000000000000000 cr2: 0000000000000000
[end of stack trace]
Calling _exit(1). Core file will not be generated.
[1/1] CRLSetTest.Parse (CRASHED)
1 test crashed:
CRLSetTest.Parse (../../net/cert/crl_set_unittest.cc:196)
Tests took 0 seconds.
,
Dec 23 2017
Could you clarify why you believe this is flaky (that is, not tied to the Clang version)? Nothing in this test stands out as inherently likely suspect, so it seems like any flakiness may be representative of a real error (outside of the test). I'm having trouble reproducing the flake locally - could you indicate how you tested in comment #1?
,
Dec 23 2017
Actually, I suspect that Clang is the culprit. I'm bisecting that now.
,
Dec 23 2017
I might not have seen all the emails yet, but I thought the conclusion was already that this was caused by the Clang roll and that the culprit was the change to memcpyopt in r321371.
,
Dec 27 2017
The suspected change is r321138, which relands r319483 and r319482. This is the IR diff after that change: https://reviews.llvm.org/P8054 i.e. we turn some memcpy's into memset(&str, 0, 24). I don't see how that would create new uninitialized data, though.
,
Dec 27 2017
,
Dec 27 2017
,
Dec 27 2017
You can determinstically observe the problem with valgrind. MSan should in theory be able to find the problem, but it doesn't in practice. I guess the MSan instrumentation inhibits memcpy optimization.
In order to use valgrind, I had to add use_allocator="none" to my gn args to disable tcmalloc and apply this patch:
diff --git a/base/process/memory_linux.cc b/base/process/memory_linux.cc
index a985b7a8ceb3..f06b986804ee 100644
--- a/base/process/memory_linux.cc
+++ b/base/process/memory_linux.cc
@@ -21,6 +21,8 @@
#include "third_party/tcmalloc/chromium/src/gperftools/tcmalloc.h"
#endif
+extern "C" void *__libc_malloc(size_t);
+
namespace base {
size_t g_oom_size = 0U;
We could probably also see the bug if we configured tcmalloc to fill allocated memory with random non-zero bytes.
Here's what valgrind says:
$ valgrind ./net_unittests --gtest_filter=CRLSetTest.Parse --single-process-tests
...
Note: Google Test filter = CRLSetTest.Parse
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CRLSetTest
[ RUN ] CRLSetTest.Parse
==38904== Conditional jump or move depends on uninitialised value(s)
==38904== at 0x5C25E3C: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/libc++.so)
==38904== by 0x5410180: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/libnet.so)
==38904== by 0x540FDC8: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/libnet.so)
==38904== by 0x888FBC: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/net_unittests)
...
==38904==
==38904== Conditional jump or move depends on uninitialised value(s)
==38904== at 0x5C25E83: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/libc++.so)
==38904== by 0x5410180: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/libnet.so)
==38904== by 0x540FDC8: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/libnet.so)
==38904== by 0x888FBC: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/net_unittests)
==38904== by 0x217FAF5: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/net_unittests)
...
==38904==
[ OK ] CRLSetTest.Parse (58 ms)
[----------] 1 test from CRLSetTest (66 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (130 ms total)
[ PASSED ] 1 test.
Apparently valgrind doesn't like clang's debug info. =/
,
Dec 28 2017
I've extracted the buggy call to push_back into it's own function in a separate .cc file for easier analysis. It looks like:
#include "net/cert/crl_set_storage.h"
namespace net {
bool ReadCRL(base::StringPiece* data,
std::string* out_parent_spki_hash,
std::vector<std::string>* out_serials);
// This function does 5 24 bytes memsets with four stack addresses. We need 2
// heap addresses to zero out the 48 byte pair.
void CRLSetStorage::ParseLoop(base::StringPiece& data, CRLSet* crl_set) {
// Speculatively push back a pair and pass it to ReadCRL() to avoid
// unnecessary copies.
crl_set->crls_.push_back(
std::make_pair(std::string(), std::vector<std::string>()));
ReadCRL(&data, &crl_set->crls_.back().first, &crl_set->crls_.back().second);
}
} // namespace net
The code we generate is wrong, but it seems like it's only wrong when debug info is enabled.
Here's how things are supposed to work:
1. Default construct the reference temporary pair of string and vector<string>. This should use two 24 byte memsets on a stack address.
2. If there is space in the vector, which there is, move from the reference temporary. memcpyopt discovers that the sources are zeroed, and turns the memcpy into memset. Again, two 24 byte memsets.
3. Destroy the reference temporary.
Here's the x86 code we get for this:
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r12
pushq %rbx
subq $48, %rsp
movq %rsi, %r12
movq %rdi, %r14
leaq -80(%rbp), %rdi
xorl %esi, %esi
movl $24, %edx
callq memset@PLT # Default construct
leaq -56(%rbp), %r15
xorl %esi, %esi
movl $24, %edx
movq %r15, %rdi
callq memset@PLT # Default construct
movq 16(%r12), %rbx
cmpq 24(%r12), %rbx
jae .LBB0_2
# %bb.1:
leaq -80(%rbp), %rdi
xorl %esi, %esi
movl $24, %edx
callq memset@PLT # Re-zero the vector?
leaq 24(%rbx), %rdi
xorl %esi, %esi
movl $24, %edx
callq memset@PLT # Zero the vector on the heap
movq -56(%rbp), %rax # Sequence of 8-byte load/store pairs copying from stack to RBX+24
movq %rax, 24(%rbx)
movq -48(%rbp), %rax
movq %rax, 32(%rbx)
movq -40(%rbp), %rax
movq %rax, 40(%rbx)
xorl %esi, %esi
movl $24, %edx
movq %r15, %rdi
callq memset@PLT # Re-zero moved-from stack memory
addq $48, 16(%r12)
testb $1, -80(%rbp)
jne .LBB0_11
jmp .LBB0_12
------
I diffed the assembly with and without debug info, and I got something pretty suspicious:
.globl _ZN3net13CRLSetStorage9ParseLoopERN4base16BasicStringPieceINSt3__112basic_stringIcNS3_11char_traitsIcEENS3_9allocatorIcEEEEEEPNS_6CRLSetE # -- Begin function _ZN3net13CRLSetStorage9ParseLoopERN4base16BasicStringPieceINSt3__112basic_stringIcNS3_11char_traitsIcEENS3_9allocatorIcEEEEEEPNS_6CRLSetE
.p2align 4, 0x90
.type _ZN3net13CRLSetStorage9ParseLoopERN4base16BasicStringPieceINSt3__112basic_stringIcNS3_11char_traitsIcEENS3_9allocatorIcEEEEEEPNS_6CRLSetE,@function
_ZN3net13CRLSetStorage9ParseLoopERN4base16BasicStringPieceINSt3__112basic_stringIcNS3_11char_traitsIcEENS3_9allocatorIcEEEEEEPNS_6CRLSetE: # @_ZN3net13CRLSetStorage9ParseLoopERN4base16BasicStringPieceINSt3__112basic_stringIcNS3_11char_traitsIcEENS3_9allocatorIcEEEEEEPNS_6CRLSetE
+.Lfunc_begin0:
+ .file 56 "../../net/cert" "crl_set_storage2.cc"
# %bb.0: # %entry
pushq %rbp
movq %rsp, %rbp
@@ -28,10 +85,6 @@
cmpq 24(%r12), %rbx
jae .LBB0_2
# %bb.1: # %_ZNSt3__16vectorINS_4pairINS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEENS0_IS7_NS5_IS7_EEEEEENS5_ISA_EEE9push_backEOSA_.exit.thread
- xorl %esi, %esi
- movl $24, %edx
- movq %rbx, %rdi
- callq memset@PLT
leaq -80(%rbp), %rdi
xorl %esi, %esi
movl $24, %edx
@@ -74,6 +127,7 @@
# %bb.6: # %if.then.i.i.i.i.i.i.i.i
# in Loop: Header=BB0_5 Depth=1
There's our missing memcpy that's supposed to initialize the heap string through RBX... Maybe now I can make a reduction.
,
Dec 28 2017
Reduction:
$ cat t.cpp
#include <assert.h>
#include <stdio.h>
#include <string>
#include <utility>
#include <vector>
void do_push_back(
std::vector<std::pair<std::string, std::vector<std::string>>>* crls) {
crls->push_back(std::make_pair(std::string(), std::vector<std::string>()));
}
int __attribute__((optnone)) main() {
// Put some data in the vector and then remove it so we take the push_back
// fast path.
std::vector<std::pair<std::string, std::vector<std::string>>> crl_set;
crl_set.push_back({"asdf", {}});
crl_set.pop_back();
printf("first word in vector storage: %p\n", *(void**)crl_set.data());
// Do the push_back which may fail to initialize the data.
do_push_back(&crl_set);
auto* first = &crl_set.back().first;
printf("first word in vector storage (should be zero): %p\n",
*(void**)crl_set.data());
assert(first->empty());
puts("ok");
}
$ clang++ -stdlib=libc++ -g -O2 t.cpp -o t.exe -Wl,-rpath=./lib && ./t.exe
first word in vector storage: 0x6664736108
first word in vector storage (should be zero): 0x6664736108
t.exe: t.cpp:25: int main(): Assertion `first->empty()' failed.
Aborted
Removing either -g or -O2 makes the problem go away.
However, it's not clear that memcpyopt is at fault here. This may be a pre-existing bug.
,
Dec 28 2017
,
Dec 28 2017
I tried debugging memcpyopt more to try to fix forward, but I gave up and reverted in r321510. We'll try a revision past that for the clang roll. Before landing, we should manually run the CRLSetTest suite a few times with the new compiler.
,
Jan 3 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0d72bb1494c94523ae7bd944476a495e8b281bcc commit 0d72bb1494c94523ae7bd944476a495e8b281bcc Author: Reid Kleckner <rnk@google.com> Date: Wed Jan 03 18:32:36 2018 Roll clang 318667:321529. Created by patching in https://chromium-review.googlesource.com/c/chromium/src/+/835358 and manually editting the revision in update.py. TBR-ing rsesek@ again for the sandbox changes. TBR-ing bretw@ again for the ppapi adapter LLD warning suppression. I will manually build and test net_unittests after uploading to goma, but I've built this clang revision locally and they are not flaky for me. R=inglorion@chromium.org,hans@chromium.org TBR=rsesek@chromium.org,brettw@chromium.org Bug: 787920 , 797267, 797168 Cq-Include-Trybots: master.tryserver.chromium.android:android_arm64_dbg_recipe;master.tryserver.chromium.android:android_compile_mips_dbg;master.tryserver.chromium.android:android_compile_x64_dbg;master.tryserver.chromium.android:android_compile_x86_dbg Change-Id: Ifd25c014e79ad7a5d4533941ae7fd74a86860191 Reviewed-on: https://chromium-review.googlesource.com/845700 Reviewed-by: Nico Weber <thakis@chromium.org> Reviewed-by: Hans Wennborg <hans@chromium.org> Reviewed-by: Robert Sesek <rsesek@chromium.org> Reviewed-by: Reid Kleckner <rnk@chromium.org> Commit-Queue: Reid Kleckner <rnk@chromium.org> Cr-Commit-Position: refs/heads/master@{#526748} [modify] https://crrev.com/0d72bb1494c94523ae7bd944476a495e8b281bcc/build/config/BUILDCONFIG.gn [modify] https://crrev.com/0d72bb1494c94523ae7bd944476a495e8b281bcc/build/config/compiler/BUILD.gn [modify] https://crrev.com/0d72bb1494c94523ae7bd944476a495e8b281bcc/media/cdm/ppapi/ppapi_cdm_adapter.gni [modify] https://crrev.com/0d72bb1494c94523ae7bd944476a495e8b281bcc/sandbox/linux/tests/unit_tests.cc [modify] https://crrev.com/0d72bb1494c94523ae7bd944476a495e8b281bcc/tools/clang/scripts/update.py
,
Jan 5 2018
Marking fixed. The buggy optimization was reverted and the test passes reliably with the new compiler. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by inglorion@chromium.org
, Dec 22 2017