New issue
Advanced search Search tips

Issue 797168 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 3
Type: Bug

Blocking:
issue 797267
issue 787920



Sign in to add a comment

CRLSetTest.Parse is flaky - sometimes passes, sometimes crashes on the same revision

Project Member Reported by inglorion@chromium.org, Dec 22 2017

Issue description

Steps to reproduce:

$ mkdir out/repro
$ cat > out/repro <<EOT
clang_use_chrome_plugins = false
is_clang = true
is_component_build = true
is_debug = false
symbol_level = 2
llvm_force_head_revision = false
clang_base_path = "/path/to/clang"
use_goma = false
EOT
$ gn gen out/repro
$ ninja -C out/repro net_unittests
$ out/repro/net_unittests --gtest_filter=CRLSetTest.Parse
$ out/repro/net_unittests --gtest_filter=CRLSetTest.Parse
$ out/repro/net_unittests --gtest_filter=CRLSetTest.Parse
$ out/repro/net_unittests --gtest_filter=CRLSetTest.Parse
$ out/repro/net_unittests --gtest_filter=CRLSetTest.Parse

Expected behavior:
Every run passes.

Observed behavior:
Some runs pass, some crash.

Example crash:
$ out/repro/net_unittests --gtest_filter=CRLSetTest.Parse
IMPORTANT DEBUGGING NOTE: batches of tests are run inside their
own process. For debugging a test inside a debugger, use the
--gtest_filter=<your_test_name> flag along with
--single-process-tests.
Using sharding settings from environment. This is shard 0/1
Using 1 parallel jobs.
Note: Google Test filter = CRLSetTest.Parse
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CRLSetTest
[ RUN      ] CRLSetTest.Parse
Received signal 11 SEGV_MAPERR 000000000000
#0 0x7fe2b5e1f43c base::debug::StackTrace::StackTrace()
#1 0x7fe2b5e1efa1 base::debug::(anonymous namespace)::StackDumpSignalHandler()
#2 0x7fe2b5b16330 <unknown>
#3 0x7fe2b4484521 <unknown>
#4 0x7fe2b500beaf std::__1::basic_string<>::assign()
#5 0x7fe2b5865971 net::ReadCRL()
#6 0x7fe2b58655c8 net::CRLSetStorage::Parse()
#7 0x00000075669d net::CRLSetTest_Parse_Test::TestBody()
#8 0x000002045506 testing::Test::Run()
#9 0x000002045fe0 testing::TestInfo::Run()
#10 0x0000020464c7 testing::TestCase::Run()
#11 0x00000204cb87 testing::internal::UnitTestImpl::RunAllTests()
#12 0x00000204c7d7 testing::UnitTest::Run()
#13 0x0000020cbc42 base::TestSuite::Run()
#14 0x0000020ceb6e base::(anonymous namespace)::LaunchUnitTestsInternal()
#15 0x0000020cea2b base::LaunchUnitTests()
#16 0x000001c531b8 main
#17 0x7fe2b4350f45 __libc_start_main
#18 0x0000005da02a _start
  r8: 0000000000000006  r9: 00000000003822c2 r10: 0000000000000003 r11: 00007fe2b44ba510
 r12: 00007ffea793d490 r13: 0000000000000020 r14: 00002fb233f48000 r15: ffffd04e331edc8b
  di: 0000000000000020  si: 00000000003822e2  bp: 00007ffea793d290  bx: 0000000000000000
  dx: 00007fe2b448451c  ax: 0000000000000000  cx: 0000000000000000  sp: 00007ffea793d248
  ip: 00007fe2b4484521 efl: 0000000000010202 cgf: 0000000000000033 erf: 0000000000000006
 trp: 000000000000000e msk: 0000000000000000 cr2: 0000000000000000
[end of stack trace]
Calling _exit(1). Core file will not be generated.
[1/1] CRLSetTest.Parse (CRASHED)
1 test crashed:
    CRLSetTest.Parse (../../net/cert/crl_set_unittest.cc:196)
Tests took 0 seconds.

 
This has been going on for a long time, at least going back to 516471 from November 14.
Could you clarify why you believe this is flaky (that is, not tied to the Clang version)? Nothing in this test stands out as inherently likely suspect, so it seems like any flakiness may be representative of a real error (outside of the test). I'm having trouble reproducing the flake locally - could you indicate how you tested in comment #1?
Labels: clang
Actually, I suspect that Clang is the culprit. I'm bisecting that now.

Comment 4 by h...@chromium.org, Dec 23 2017

I might not have seen all the emails yet, but I thought the conclusion was already that this was caused by the Clang roll and that the culprit was the change to memcpyopt in r321371.
Reverting Clang r319483 and r319482 makes the problem go away.

I also tried leaving those in and reverting r321371 instead, but that still gives me crashes.

Comment 6 by r...@chromium.org, Dec 27 2017

Cc: r...@chromium.org
The suspected change is r321138, which relands r319483 and r319482. This is the IR diff after that change: https://reviews.llvm.org/P8054

i.e. we turn some memcpy's into memset(&str, 0, 24). I don't see how that would create new uninitialized data, though.

Comment 7 by r...@chromium.org, Dec 27 2017

Blocking: 797267

Comment 8 by r...@chromium.org, Dec 27 2017

Blocking: 787920

Comment 9 by r...@chromium.org, Dec 27 2017

You can determinstically observe the problem with valgrind. MSan should in theory be able to find the problem, but it doesn't in practice. I guess the MSan instrumentation inhibits memcpy optimization.

In order to use valgrind, I had to add use_allocator="none" to my gn args to disable tcmalloc and apply this patch:

diff --git a/base/process/memory_linux.cc b/base/process/memory_linux.cc
index a985b7a8ceb3..f06b986804ee 100644
--- a/base/process/memory_linux.cc
+++ b/base/process/memory_linux.cc
@@ -21,6 +21,8 @@
 #include "third_party/tcmalloc/chromium/src/gperftools/tcmalloc.h"
 #endif
 
+extern "C" void *__libc_malloc(size_t);
+
 namespace base {
 
 size_t g_oom_size = 0U;

We could probably also see the bug if we configured tcmalloc to fill allocated memory with random non-zero bytes.

Here's what valgrind says:
$ valgrind ./net_unittests --gtest_filter=CRLSetTest.Parse --single-process-tests
...
Note: Google Test filter = CRLSetTest.Parse
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CRLSetTest
[ RUN      ] CRLSetTest.Parse
==38904== Conditional jump or move depends on uninitialised value(s)
==38904==    at 0x5C25E3C: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/libc++.so)
==38904==    by 0x5410180: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/libnet.so)
==38904==    by 0x540FDC8: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/libnet.so)
==38904==    by 0x888FBC: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/net_unittests)
...
==38904== 
==38904== Conditional jump or move depends on uninitialised value(s)
==38904==    at 0x5C25E83: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/libc++.so)
==38904==    by 0x5410180: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/libnet.so)
==38904==    by 0x540FDC8: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/libnet.so)
==38904==    by 0x888FBC: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/net_unittests)
==38904==    by 0x217FAF5: ??? (in /usr/local/google/home/rnk/chromium/src/out/Release/net_unittests)
...
==38904== 
[       OK ] CRLSetTest.Parse (58 ms)
[----------] 1 test from CRLSetTest (66 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (130 ms total)
[  PASSED  ] 1 test.

Apparently valgrind doesn't like clang's debug info. =/

Comment 10 by r...@chromium.org, Dec 28 2017

I've extracted the buggy call to push_back into it's own function in a separate .cc file for easier analysis. It looks like:

#include "net/cert/crl_set_storage.h"  
namespace net {
bool ReadCRL(base::StringPiece* data,  
             std::string* out_parent_spki_hash,
             std::vector<std::string>* out_serials);                                       
// This function does 5 24 bytes memsets with four stack addresses. We need 2              
// heap addresses to zero out the 48 byte pair.
void CRLSetStorage::ParseLoop(base::StringPiece& data, CRLSet* crl_set) {                  
  // Speculatively push back a pair and pass it to ReadCRL() to avoid                      
  // unnecessary copies.
  crl_set->crls_.push_back(
      std::make_pair(std::string(), std::vector<std::string>()));
  ReadCRL(&data, &crl_set->crls_.back().first, &crl_set->crls_.back().second);             
}  
}  // namespace net

The code we generate is wrong, but it seems like it's only wrong when debug info is enabled.

Here's how things are supposed to work:

1. Default construct the reference temporary pair of string and vector<string>. This should use two 24 byte memsets on a stack address.

2. If there is space in the vector, which there is, move from the reference temporary. memcpyopt discovers that the sources are zeroed, and turns the memcpy into memset. Again, two 24 byte memsets.

3. Destroy the reference temporary. 

Here's the x86 code we get for this:

        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r12
        pushq   %rbx
        subq    $48, %rsp
        movq    %rsi, %r12
        movq    %rdi, %r14
        leaq    -80(%rbp), %rdi
        xorl    %esi, %esi
        movl    $24, %edx
        callq   memset@PLT  # Default construct
        leaq    -56(%rbp), %r15
        xorl    %esi, %esi
        movl    $24, %edx
        movq    %r15, %rdi
        callq   memset@PLT  # Default construct
        movq    16(%r12), %rbx
        cmpq    24(%r12), %rbx
        jae     .LBB0_2
# %bb.1:          
        leaq    -80(%rbp), %rdi
        xorl    %esi, %esi
        movl    $24, %edx
        callq   memset@PLT  # Re-zero the vector?
        leaq    24(%rbx), %rdi
        xorl    %esi, %esi
        movl    $24, %edx
        callq   memset@PLT  # Zero the vector on the heap
        movq    -56(%rbp), %rax  # Sequence of 8-byte load/store pairs copying from stack to RBX+24
        movq    %rax, 24(%rbx)
        movq    -48(%rbp), %rax
        movq    %rax, 32(%rbx)
        movq    -40(%rbp), %rax
        movq    %rax, 40(%rbx)
        xorl    %esi, %esi
        movl    $24, %edx
        movq    %r15, %rdi
        callq   memset@PLT   # Re-zero moved-from stack memory
        addq    $48, 16(%r12)
        testb   $1, -80(%rbp)
        jne     .LBB0_11
        jmp     .LBB0_12


------

I diffed the assembly with and without debug info, and I got something pretty suspicious:

        .globl  _ZN3net13CRLSetStorage9ParseLoopERN4base16BasicStringPieceINSt3__112basic_stringIcNS3_11char_traitsIcEENS3_9allocatorIcEEEEEEPNS_6CRLSetE # -- Begin function _ZN3net13CRLSetStorage9ParseLoopERN4base16BasicStringPieceINSt3__112basic_stringIcNS3_11char_traitsIcEENS3_9allocatorIcEEEEEEPNS_6CRLSetE
        .p2align        4, 0x90
        .type   _ZN3net13CRLSetStorage9ParseLoopERN4base16BasicStringPieceINSt3__112basic_stringIcNS3_11char_traitsIcEENS3_9allocatorIcEEEEEEPNS_6CRLSetE,@function
 _ZN3net13CRLSetStorage9ParseLoopERN4base16BasicStringPieceINSt3__112basic_stringIcNS3_11char_traitsIcEENS3_9allocatorIcEEEEEEPNS_6CRLSetE: # @_ZN3net13CRLSetStorage9ParseLoopERN4base16BasicStringPieceINSt3__112basic_stringIcNS3_11char_traitsIcEENS3_9allocatorIcEEEEEEPNS_6CRLSetE
+.Lfunc_begin0:
+       .file   56 "../../net/cert" "crl_set_storage2.cc"
 # %bb.0:                                # %entry
        pushq   %rbp
        movq    %rsp, %rbp
@@ -28,10 +85,6 @@
        cmpq    24(%r12), %rbx
        jae     .LBB0_2
 # %bb.1:                                # %_ZNSt3__16vectorINS_4pairINS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEENS0_IS7_NS5_IS7_EEEEEENS5_ISA_EEE9push_backEOSA_.exit.thread
-       xorl    %esi, %esi
-       movl    $24, %edx
-       movq    %rbx, %rdi
-       callq   memset@PLT
        leaq    -80(%rbp), %rdi
        xorl    %esi, %esi
        movl    $24, %edx
@@ -74,6 +127,7 @@
 # %bb.6:                                # %if.then.i.i.i.i.i.i.i.i
                                         #   in Loop: Header=BB0_5 Depth=1

There's our missing memcpy that's supposed to initialize the heap string through RBX... Maybe now I can make a reduction.

Comment 11 by r...@chromium.org, Dec 28 2017

Reduction:

$ cat t.cpp                                  
#include <assert.h>
#include <stdio.h>
#include <string>
#include <utility>
#include <vector>

void do_push_back(
    std::vector<std::pair<std::string, std::vector<std::string>>>* crls) {
  crls->push_back(std::make_pair(std::string(), std::vector<std::string>()));
}

int __attribute__((optnone)) main() {
  // Put some data in the vector and then remove it so we take the push_back
  // fast path.
  std::vector<std::pair<std::string, std::vector<std::string>>> crl_set;
  crl_set.push_back({"asdf", {}});
  crl_set.pop_back();
  printf("first word in vector storage: %p\n", *(void**)crl_set.data());

  // Do the push_back which may fail to initialize the data.
  do_push_back(&crl_set);
  auto* first = &crl_set.back().first;
  printf("first word in vector storage (should be zero): %p\n",
         *(void**)crl_set.data());
  assert(first->empty());
  puts("ok");
}

$ clang++ -stdlib=libc++ -g -O2 t.cpp -o t.exe -Wl,-rpath=./lib && ./t.exe
first word in vector storage: 0x6664736108
first word in vector storage (should be zero): 0x6664736108
t.exe: t.cpp:25: int main(): Assertion `first->empty()' failed.
Aborted

Removing either -g or -O2 makes the problem go away.

However, it's not clear that memcpyopt is at fault here. This may be a pre-existing bug.

Comment 12 by a...@chromium.org, Dec 28 2017

Cc: a...@chromium.org

Comment 13 by r...@chromium.org, Dec 28 2017

Cc: -r...@chromium.org
Owner: r...@chromium.org
Status: Assigned (was: Unconfirmed)
I tried debugging memcpyopt more to try to fix forward, but I gave up and reverted in r321510. We'll try a revision past that for the clang roll. Before landing, we should manually run the CRLSetTest suite a few times with the new compiler.
Project Member

Comment 14 by bugdroid1@chromium.org, Jan 3 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0d72bb1494c94523ae7bd944476a495e8b281bcc

commit 0d72bb1494c94523ae7bd944476a495e8b281bcc
Author: Reid Kleckner <rnk@google.com>
Date: Wed Jan 03 18:32:36 2018

Roll clang 318667:321529.

Created by patching in
https://chromium-review.googlesource.com/c/chromium/src/+/835358 and
manually editting the revision in update.py.

TBR-ing rsesek@ again for the sandbox changes.
TBR-ing bretw@ again for the ppapi adapter LLD warning suppression.

I will manually build and test net_unittests after uploading to goma,
but I've built this clang revision locally and they are not flaky for
me.

R=inglorion@chromium.org,hans@chromium.org
TBR=rsesek@chromium.org,brettw@chromium.org

Bug:  787920 , 797267,  797168 
Cq-Include-Trybots: master.tryserver.chromium.android:android_arm64_dbg_recipe;master.tryserver.chromium.android:android_compile_mips_dbg;master.tryserver.chromium.android:android_compile_x64_dbg;master.tryserver.chromium.android:android_compile_x86_dbg
Change-Id: Ifd25c014e79ad7a5d4533941ae7fd74a86860191
Reviewed-on: https://chromium-review.googlesource.com/845700
Reviewed-by: Nico Weber <thakis@chromium.org>
Reviewed-by: Hans Wennborg <hans@chromium.org>
Reviewed-by: Robert Sesek <rsesek@chromium.org>
Reviewed-by: Reid Kleckner <rnk@chromium.org>
Commit-Queue: Reid Kleckner <rnk@chromium.org>
Cr-Commit-Position: refs/heads/master@{#526748}
[modify] https://crrev.com/0d72bb1494c94523ae7bd944476a495e8b281bcc/build/config/BUILDCONFIG.gn
[modify] https://crrev.com/0d72bb1494c94523ae7bd944476a495e8b281bcc/build/config/compiler/BUILD.gn
[modify] https://crrev.com/0d72bb1494c94523ae7bd944476a495e8b281bcc/media/cdm/ppapi/ppapi_cdm_adapter.gni
[modify] https://crrev.com/0d72bb1494c94523ae7bd944476a495e8b281bcc/sandbox/linux/tests/unit_tests.cc
[modify] https://crrev.com/0d72bb1494c94523ae7bd944476a495e8b281bcc/tools/clang/scripts/update.py

Comment 15 by r...@chromium.org, Jan 5 2018

Status: Fixed (was: Assigned)
Marking fixed. The buggy optimization was reverted and the test passes reliably with the new compiler.

Sign in to add a comment