Deadlock in leveldb_service_unittests |
||||||||||||||
Issue description"mojo_apptests (with patch)" is flaky. This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label. We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyJQsSBUZsYWtlIhptb2pvX2FwcHRlc3RzICh3aXRoIHBhdGNoKQw. Flaky tests should be disabled within 30 minutes unless culprit CL is found and reverted. Please see more details here: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queues#triaging-auto-filed-flakiness-bugs This flaky test/step was previously tracked in issue 585650 .
,
Mar 15 2016
Assigning to ben@ since he works on mojo and hopefully as some idea of what is causing this error or who can fix it (since I don't have any idea at all)
,
Mar 15 2016
,
Mar 15 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e3045853929c701a45b7ec8e203f5c1db1d73db6 commit e3045853929c701a45b7ec8e203f5c1db1d73db6 Author: csharp <csharp@chromium.org> Date: Tue Mar 15 16:03:55 2016 Mark MediaAppTest.InitializeRenderer_Success as Flaky on Linux TBR=ben@chromium.org BUG= 594977 Review URL: https://codereview.chromium.org/1798423002 Cr-Commit-Position: refs/heads/master@{#381228} [modify] https://crrev.com/e3045853929c701a45b7ec8e203f5c1db1d73db6/media/mojo/services/media_apptest.cc
,
Mar 15 2016
,
Mar 15 2016
What exactly is the failure? Is it the because of the "uninteresting mock function call"?
Uninteresting mock function call - returning directly.
Function call: ConnectionClosed()
We'll update MediaAppTest to a ShellTest soon ( issue 594408 ). So I don't plan to fix it right now but would see whether it's still flaky after the update.
,
Mar 17 2016
Detected 3 new flakes for test/step "mojo_apptests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyJQsSBUZsYWtlIhptb2pvX2FwcHRlc3RzICh3aXRoIHBhdGNoKQw. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Sheriff Bug Queue (unless already there).
,
Mar 18 2016
Detected 5 new flakes for test/step "mojo_apptests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyJQsSBUZsYWtlIhptb2pvX2FwcHRlc3RzICh3aXRoIHBhdGNoKQw. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Sheriff Bug Queue (unless already there).
,
Mar 21 2016
There are also timeout bugs; I've observed them in LevelDBApptest.GetSnapshotSimple and LevelDBApptest.Basic. I'm disabling both for now. I don't understand the output. It appears something times out even before the test runs, and then the test runs twice and succeeds both times (?!?). Example output: DEBUG:Command: /b/build/slave/linux/build/src/out/Release/mojo_runner --use-temporary-user-data-dir --gtest_filter=LevelDBApptest.GetSnapshotSimple mojo:leveldb_apptests DEBUG:Running Xvfb :42 (pid 24991) and Openbox (pid 25003). DEBUG:Terminating the test for timeout. [ FAILED ] Command (in 40089 ms): /b/build/slave/linux/build/src/out/Release/mojo_runner --use-temporary-user-data-dir --gtest_filter=LevelDBApptest.GetSnapshotSimple mojo:leveldb_apptests ------------------------------------------------------------------------ Note: Google Test filter = LevelDBApptest.GetSnapshotSimple [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from LevelDBApptest [ RUN ] LevelDBApptest.GetSnapshotSimple [ OK ] LevelDBApptest.GetSnapshotSimple (24 ms) [----------] 1 test from LevelDBApptest (24 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (24 ms total) [ PASSED ] 1 test. ERROR: Test timeout after 10 seconds. ERROR: Test exited with code: -15. ------------------------------------------------------------------------ Retrying failed tests (2 attempts remaining) DEBUG:Command: /b/build/slave/linux/build/src/out/Release/mojo_runner --use-temporary-user-data-dir --gtest_filter=LevelDBApptest.GetSnapshotSimple mojo:leveldb_apptests DEBUG:Running Xvfb :43 (pid 25045) and Openbox (pid 25057). DEBUG:Passed (in 107 ms) with output: Note: Google Test filter = LevelDBApptest.GetSnapshotSimple [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from LevelDBApptest [ RUN ] LevelDBApptest.GetSnapshotSimple [ OK ] LevelDBApptest.GetSnapshotSimple (26 ms) [----------] 1 test from LevelDBApptest (26 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (26 ms total) [ PASSED ] 1 test.
,
Mar 21 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/21138fcaeedd96af402c9715cfecf9a0a9eb9528 commit 21138fcaeedd96af402c9715cfecf9a0a9eb9528 Author: phoglund <phoglund@chromium.org> Date: Mon Mar 21 14:09:14 2016 Disabling flaky LevelDBApptest.(GetSnapshotSimple|Basic) TBR=xhwang@chromium.org BUG= 594977 Review URL: https://codereview.chromium.org/1823633002 Cr-Commit-Position: refs/heads/master@{#382276} [modify] https://crrev.com/21138fcaeedd96af402c9715cfecf9a0a9eb9528/components/leveldb/leveldb_apptest.cc
,
Mar 21 2016
Given the fact that we'll update MediaAppTest to a ShellTest soon ( issue 594408 ), lower the priority to P2.
,
Mar 21 2016
Detected 3 new flakes for test/step "mojo_apptests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyJQsSBUZsYWtlIhptb2pvX2FwcHRlc3RzICh3aXRoIHBhdGNoKQw. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Sheriff Bug Queue (unless already there).
,
Mar 21 2016
Given the fact that this bug is tracking all flakiness in apptests (e.g. in #12 LevelDBApptest.WriteBatch is flaky), removing myself from the owner. When we move all test from apptests to shelltests we should be able to close this issue.
,
Mar 22 2016
This doesn't appear to have been fixed by switching from apptest to shelltests. You have to run it 20ish times to get a deadlock, but it still happens.
So I started digging:
The shelltest framework itself seems to be deadlocking with the leveldb service. Here's the stack of the main thread:
#0 0x00007ffff5bae66b in pthread_join (threadid=140737255659264, thread_return=0x0)
at pthread_join.c:92
#1 0x00007ffff7bb7e5c in base::PlatformThread::Join (thread_handle=...)
at ../../base/threading/platform_thread_posix.cc:207
#2 0x00007ffff7bc7f81 in base::SimpleThread::Join (this=0x3875703710b0)
at ../../base/threading/simple_thread.cc:47
#3 0x000000000057099b in mojo::shell::BackgroundShell::MojoThread::Stop (
this=0x3875703710b0) at ../../mojo/shell/background/background_shell.cc:102
#4 0x000000000056f391 in mojo::shell::BackgroundShell::~BackgroundShell (
this=0x3875703dc660) at ../../mojo/shell/background/background_shell.cc:177
#5 0x000000000046c5eb in std::default_delete<mojo::shell::BackgroundShell>::operator() (
this=0x38757035b930, __ptr=0x3875703dc660)
at ./out/Debug/../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:63
#6 0x000000000046b94c in std::unique_ptr<mojo::shell::BackgroundShell, std::default_delete<mojo::shell::BackgroundShell> >::reset (this=0x38757035b930, __p=0x3875703dc660)
at ./out/Debug/../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:245
#7 0x000000000046b47b in mojo::test::ShellTest::TearDown (this=0x38757035b910)
at ../../mojo/shell/public/cpp/lib/shell_test.cc:74
#8 0x0000000000425761 in leveldb::(anonymous namespace)::LevelDBApptest::TearDown (
this=0x38757035b910) at ../../components/leveldb/leveldb_service_unittest.cc:38
Here's the stack of the background thread:
#0 0x00007ffff5bb4ee9 in __libc_waitpid (pid=30219, stat_loc=0x7ffff22145b0, options=0)
at ../sysdeps/unix/sysv/linux/waitpid.c:40
#1 0x00007ffff7b4172e in (anonymous namespace)::WaitpidWithTimeout (handle=30219,
status=0x7ffff22145b0, wait=...) at ../../base/process/process_posix.cc:55
#2 0x00007ffff7b414cb in (anonymous namespace)::WaitForExitWithTimeoutImpl (
handle=30219, exit_code=0x7ffff22147fc, timeout=...)
at ../../base/process/process_posix.cc:196
#3 0x00007ffff7b41394 in base::Process::WaitForExitWithTimeout (this=0x38757027d4c0,
timeout=..., exit_code=0x7ffff22147fc) at ../../base/process/process_posix.cc:357
#4 0x00007ffff7b41352 in base::Process::WaitForExit (this=0x38757027d4c0,
exit_code=0x7ffff22147fc) at ../../base/process/process_posix.cc:353
#5 0x000000000061f7b4 in mojo::shell::ChildProcessHost::Join (this=0x38757027d480)
at ../../mojo/shell/runner/host/child_process_host.cc:117
#6 0x0000000000622dee in mojo::shell::OutOfProcessNativeRunner::~OutOfProcessNativeRunner
(this=0x3875703705d0)
at ../../mojo/shell/runner/host/out_of_process_native_runner.cc:30
#7 0x0000000000622e49 in mojo::shell::OutOfProcessNativeRunner::~OutOfProcessNativeRunner
(this=0x3875703705d0)
at ../../mojo/shell/runner/host/out_of_process_native_runner.cc:28
#8 0x000000000043115f in std::default_delete<mojo::Connection>::operator() (
this=0x387570301510, __ptr=0x3875703705d0)
at ../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:63
#9 0x00000000005eb03c in std::unique_ptr<mojo::shell::NativeRunner, std::default_delete<mojo::shell::NativeRunner> >::reset (this=0x387570301510, __p=0x3875703705d0)
at ./out/Debug/../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:245
#10 0x00000000005e68b8 in mojo::shell::Shell::~Shell (this=0x3875702b3fe0)
at ../../mojo/shell/shell.cc:465
#11 0x00000000005e6b19 in mojo::shell::Shell::~Shell (this=0x3875702b3fe0)
at ../../mojo/shell/shell.cc:461
#12 0x000000000043115f in std::default_delete<mojo::Connection>::operator() (
this=0x3875703db6a0, __ptr=0x3875702b3fe0)
at ../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:63
#13 0x000000000060cb7c in std::unique_ptr<mojo::shell::Shell, std::default_delete<mojo::shell::Shell> >::reset (this=0x3875703db6a0, __p=0x3875702b3fe0)
at ./out/Debug/../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:245
#14 0x000000000060bbe0 in mojo::shell::Context::Shutdown (this=0x3875703db5a0)
at ../../mojo/shell/standalone/context.cc:211
#15 0x00000000005716ad in mojo::shell::BackgroundShell::MojoThread::Run (
this=0x3875703710b0) at ../../mojo/shell/background/background_shell.cc:142
#16 0x00007ffff7bc8041 in base::SimpleThread::ThreadMain (this=0x3875703710b0)
at ../../base/threading/simple_thread.cc:66
#17 0x00007ffff7bb83aa in base::(anonymous namespace)::ThreadFunc (params=0x38757029de30)
The shelltest is currently loading a mojo:leveldb process and is waiting for it to exit.
There appears to have been a LevelDBApp::ShellConnectionLost() method added to the app that performs a hard _exit(1). This is not being called in the runs where we deadlock.
,
Mar 22 2016
+sky
,
Mar 23 2016
I poked at this and it looks like the deadlock is in leveldb_file_thread. I spoke with Elliot about it and he is going to investigate. I'm updating summary to reflect what is happening.
,
Mar 23 2016
Current theory after seeing sky's stack traces: In leveldb shutdown, it's trying to unlock a file. So one of the leveldb threads talks to LevelDBFileThread::Unlock() to unlock the file. LevelDBFileThread::Unlock() exists to make a thread safe interface to the otherwise thread hostile mojo. It then blocks that thread using waitevents. What happens if the filesystem::Directory closed behind our back? This is shutdown after all...
,
Mar 24 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/16167c22ec91beec0b7cbe26d15049bbda589db3 commit 16167c22ec91beec0b7cbe26d15049bbda589db3 Author: erg <erg@chromium.org> Date: Thu Mar 24 19:04:17 2016 leveldb_service: Attempt to fix hang on shutdown. On mojo connection errors, we should call WaitableEvent::Signal() on any events that are blocking the various leveldb threads. BUG= 594977 Review URL: https://codereview.chromium.org/1825413003 Cr-Commit-Position: refs/heads/master@{#383112} [modify] https://crrev.com/16167c22ec91beec0b7cbe26d15049bbda589db3/components/leveldb/leveldb_database_impl.cc [modify] https://crrev.com/16167c22ec91beec0b7cbe26d15049bbda589db3/components/leveldb/leveldb_file_thread.cc [modify] https://crrev.com/16167c22ec91beec0b7cbe26d15049bbda589db3/components/leveldb/leveldb_file_thread.h [modify] https://crrev.com/16167c22ec91beec0b7cbe26d15049bbda589db3/components/leveldb/leveldb_service_unittest.cc
,
Mar 24 2016
Is the underlying problem the same as what's causing LevelDBApptest.GetFromSnapshots to timeout on linux_chromium_rel_ng? From https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/201150/steps/leveldb_service_unittests%20%28with%20patch%29%20on%20Ubuntu-12.04/logs/stdio : [1/6] LevelDBApptest.WriteBatch (64 ms) [2/6] LevelDBApptest.Reconnect (36 ms) [3/6] LevelDBApptest.GetSnapshotSimple (32 ms) [ RUN ] LevelDBApptest.GetFromSnapshots [4/6] LevelDBApptest.GetFromSnapshots (TIMED OUT) [5/6] LevelDBApptest.MemoryDBReadWrite (40 ms) [6/6] LevelDBApptest.InvalidArgumentOnInvalidSnapshot (48 ms) Retrying 1 test (retry #1) Still waiting for the following processes to finish: ./leveldb_service_unittests --brave-new-test-launcher --gtest_flagfile=/tmp/.org.chromium.Chromium.FSIOrY/.org.chromium.Chromium.tSs9gg --single-process-tests --test-launcher-bot-mode --test-launcher-output=/tmp/.org.chromium.Chromium.0lvLj3/test_results.xml --test-launcher-summary-output=/tmp/outI7TpfN/output.json Still waiting for the following processes to finish: ./leveldb_service_unittests --brave-new-test-launcher --gtest_flagfile=/tmp/.org.chromium.Chromium.FSIOrY/.org.chromium.Chromium.tSs9gg --single-process-tests --test-launcher-bot-mode --test-launcher-output=/tmp/.org.chromium.Chromium.0lvLj3/test_results.xml --test-launcher-summary-output=/tmp/outI7TpfN/output.json [ RUN ] LevelDBApptest.GetFromSnapshots [7/7] LevelDBApptest.GetFromSnapshots (TIMED OUT) ... [8/8] LevelDBApptest.GetFromSnapshots (TIMED OUT) ... [9/9] LevelDBApptest.GetFromSnapshots (TIMED OUT) 1 test timed out: LevelDBApptest.GetFromSnapshots (../../components/leveldb/leveldb_service_unittest.cc:223)
,
Mar 24 2016
@19: theoretically. We'll see if #18 helps.
,
Mar 24 2016
,
Mar 29 2016
I searched using the toplevel search on https://chromium-try-flakes.appspot.com/ for leveldb_service_unittests (the new binary name) and found nothing. Closing this. |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by csharp@chromium.org
, Mar 15 2016LevelDBApptest.InvalidArgumentOnInvalidSnapshot only seems to have failed once so I'm not sure if the failure is it or something else. MediaAppTest.InitializeRenderer_Success has multiple failures so I'll disable it on Linux. Output: [ RUN ] MediaAppTest.InitializeRenderer_Success GMOCK WARNING: Uninteresting mock function call - returning directly. Function call: ConnectionClosed() NOTE: You can safely ignore the above warning unless this call should not happen. Do not suppress it by blindly adding an EXPECT_CALL() if you don't mean to enforce the call. See http://code.google.com/p/googlemock/wiki/CookBook#Knowing_When_to_Expect for details. [ERROR:child_process_base.cc(211)] Connection error to the shell. ------------------------------------------------------------------------ Retrying failed tests (1 attempts remaining) DEBUG:Command: /b/build/slave/linux/build/src/out/Release/mojo_runner --use-temporary-user-data-dir --gtest_filter=MediaAppTest.InitializeRenderer_Success mojo:media_apptests DEBUG:Running Xvfb :36 (pid 22903) and Openbox (pid 22915). [ FAILED ] Command (in 104 ms): /b/build/slave/linux/build/src/out/Release/mojo_runner --use-temporary-user-data-dir --gtest_filter=MediaAppTest.InitializeRenderer_Success mojo:media_apptests ------------------------------------------------------------------------ Note: Google Test filter = MediaAppTest.InitializeRenderer_Success