New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 594977 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Mar 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Deadlock in leveldb_service_unittests

Project Member Reported by chromium...@appspot.gserviceaccount.com, Mar 15 2016

Issue description

"mojo_apptests (with patch)" is flaky.

This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label.

We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyJQsSBUZsYWtlIhptb2pvX2FwcHRlc3RzICh3aXRoIHBhdGNoKQw.

Flaky tests should be disabled within 30 minutes unless culprit CL is found and reverted. Please see more details here: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queues#triaging-auto-filed-flakiness-bugs

This flaky test/step was previously tracked in  issue 585650 .
 

Comment 1 by csharp@chromium.org, Mar 15 2016

LevelDBApptest.InvalidArgumentOnInvalidSnapshot only seems to have failed once so I'm not sure if the failure is it or something else.

MediaAppTest.InitializeRenderer_Success has multiple failures so I'll disable it on Linux.

Output:

[ RUN      ] MediaAppTest.InitializeRenderer_Success

GMOCK WARNING:
Uninteresting mock function call - returning directly.
    Function call: ConnectionClosed()
NOTE: You can safely ignore the above warning unless this call should not happen.  Do not suppress it by blindly adding an EXPECT_CALL() if you don't mean to enforce the call.  See http://code.google.com/p/googlemock/wiki/CookBook#Knowing_When_to_Expect for details.
[ERROR:child_process_base.cc(211)] Connection error to the shell.

------------------------------------------------------------------------
Retrying failed tests (1 attempts remaining)
DEBUG:Command: /b/build/slave/linux/build/src/out/Release/mojo_runner --use-temporary-user-data-dir --gtest_filter=MediaAppTest.InitializeRenderer_Success mojo:media_apptests
DEBUG:Running Xvfb :36 (pid 22903) and Openbox (pid 22915).

[  FAILED  ] Command (in 104 ms): /b/build/slave/linux/build/src/out/Release/mojo_runner --use-temporary-user-data-dir --gtest_filter=MediaAppTest.InitializeRenderer_Success mojo:media_apptests
------------------------------------------------------------------------
Note: Google Test filter = MediaAppTest.InitializeRenderer_Success

Comment 2 by csharp@chromium.org, Mar 15 2016

Owner: ben@chromium.org
Assigning to ben@ since he works on mojo and hopefully as some idea of what is causing this error or who can fix it (since I don't have any idea at all)

Comment 3 by csharp@chromium.org, Mar 15 2016

Labels: -Sheriff-Chromium
Project Member

Comment 4 by bugdroid1@chromium.org, Mar 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e3045853929c701a45b7ec8e203f5c1db1d73db6

commit e3045853929c701a45b7ec8e203f5c1db1d73db6
Author: csharp <csharp@chromium.org>
Date: Tue Mar 15 16:03:55 2016

Mark MediaAppTest.InitializeRenderer_Success as Flaky on Linux

TBR=ben@chromium.org
BUG= 594977 

Review URL: https://codereview.chromium.org/1798423002

Cr-Commit-Position: refs/heads/master@{#381228}

[modify] https://crrev.com/e3045853929c701a45b7ec8e203f5c1db1d73db6/media/mojo/services/media_apptest.cc

Cc: xhw...@chromium.org

Comment 6 by xhw...@chromium.org, Mar 15 2016

Cc: -xhw...@chromium.org ben@chromium.org
Owner: xhw...@chromium.org
Status: Assigned (was: Untriaged)
What exactly is the failure? Is it the because of the "uninteresting mock function call"?

Uninteresting mock function call - returning directly.
    Function call: ConnectionClosed()

We'll update MediaAppTest to a ShellTest soon ( issue 594408 ). So I don't plan to fix it right now but would see whether it's still flaky after the update. 
Project Member

Comment 7 by chromium...@appspot.gserviceaccount.com, Mar 17 2016

Labels: Sheriff-Chromium
Detected 3 new flakes for test/step "mojo_apptests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyJQsSBUZsYWtlIhptb2pvX2FwcHRlc3RzICh3aXRoIHBhdGNoKQw. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Sheriff Bug Queue (unless already there).
Project Member

Comment 8 by chromium...@appspot.gserviceaccount.com, Mar 18 2016

Detected 5 new flakes for test/step "mojo_apptests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyJQsSBUZsYWtlIhptb2pvX2FwcHRlc3RzICh3aXRoIHBhdGNoKQw. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Sheriff Bug Queue (unless already there).
Labels: -Sheriff-Chromium
There are also timeout bugs; I've observed them in LevelDBApptest.GetSnapshotSimple and LevelDBApptest.Basic. I'm disabling both for now.

I don't understand the output. It appears something times out even before the test runs, and then the test runs twice and succeeds both times (?!?). Example output:
DEBUG:Command: /b/build/slave/linux/build/src/out/Release/mojo_runner --use-temporary-user-data-dir --gtest_filter=LevelDBApptest.GetSnapshotSimple mojo:leveldb_apptests
DEBUG:Running Xvfb :42 (pid 24991) and Openbox (pid 25003).
DEBUG:Terminating the test for timeout.

[  FAILED  ] Command (in 40089 ms): /b/build/slave/linux/build/src/out/Release/mojo_runner --use-temporary-user-data-dir --gtest_filter=LevelDBApptest.GetSnapshotSimple mojo:leveldb_apptests
------------------------------------------------------------------------
Note: Google Test filter = LevelDBApptest.GetSnapshotSimple
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from LevelDBApptest
[ RUN      ] LevelDBApptest.GetSnapshotSimple
[       OK ] LevelDBApptest.GetSnapshotSimple (24 ms)
[----------] 1 test from LevelDBApptest (24 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (24 ms total)
[  PASSED  ] 1 test.
ERROR: Test timeout after 10 seconds.
ERROR: Test exited with code: -15.
------------------------------------------------------------------------
Retrying failed tests (2 attempts remaining)
DEBUG:Command: /b/build/slave/linux/build/src/out/Release/mojo_runner --use-temporary-user-data-dir --gtest_filter=LevelDBApptest.GetSnapshotSimple mojo:leveldb_apptests
DEBUG:Running Xvfb :43 (pid 25045) and Openbox (pid 25057).
DEBUG:Passed (in 107 ms) with output:
Note: Google Test filter = LevelDBApptest.GetSnapshotSimple
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from LevelDBApptest
[ RUN      ] LevelDBApptest.GetSnapshotSimple
[       OK ] LevelDBApptest.GetSnapshotSimple (26 ms)
[----------] 1 test from LevelDBApptest (26 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (26 ms total)
[  PASSED  ] 1 test.
Project Member

Comment 10 by bugdroid1@chromium.org, Mar 21 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/21138fcaeedd96af402c9715cfecf9a0a9eb9528

commit 21138fcaeedd96af402c9715cfecf9a0a9eb9528
Author: phoglund <phoglund@chromium.org>
Date: Mon Mar 21 14:09:14 2016

Disabling flaky LevelDBApptest.(GetSnapshotSimple|Basic)

TBR=xhwang@chromium.org
BUG= 594977 

Review URL: https://codereview.chromium.org/1823633002

Cr-Commit-Position: refs/heads/master@{#382276}

[modify] https://crrev.com/21138fcaeedd96af402c9715cfecf9a0a9eb9528/components/leveldb/leveldb_apptest.cc

Labels: -Pri-1 Pri-2
Given the fact that we'll update MediaAppTest to a ShellTest soon ( issue 594408 ), lower the priority to P2.
Project Member

Comment 12 by chromium...@appspot.gserviceaccount.com, Mar 21 2016

Labels: Sheriff-Chromium
Detected 3 new flakes for test/step "mojo_apptests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyJQsSBUZsYWtlIhptb2pvX2FwcHRlc3RzICh3aXRoIHBhdGNoKQw. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Sheriff Bug Queue (unless already there).
Cc: xhw...@chromium.org
Owner: ----
Status: Available (was: Assigned)
Given the fact that this bug is tracking all flakiness in apptests (e.g. in #12 LevelDBApptest.WriteBatch is flaky), removing myself from the owner. When we move all test from apptests to shelltests we should be able to close this issue.

Comment 14 by e...@chromium.org, Mar 22 2016

This doesn't appear to have been fixed by switching from apptest to shelltests. You have to run it 20ish times to get a deadlock, but it still happens.

So I started digging:

The shelltest framework itself seems to be deadlocking with the leveldb service. Here's the stack of the main thread:

#0  0x00007ffff5bae66b in pthread_join (threadid=140737255659264, thread_return=0x0)
    at pthread_join.c:92
#1  0x00007ffff7bb7e5c in base::PlatformThread::Join (thread_handle=...)
    at ../../base/threading/platform_thread_posix.cc:207
#2  0x00007ffff7bc7f81 in base::SimpleThread::Join (this=0x3875703710b0)
    at ../../base/threading/simple_thread.cc:47
#3  0x000000000057099b in mojo::shell::BackgroundShell::MojoThread::Stop (
    this=0x3875703710b0) at ../../mojo/shell/background/background_shell.cc:102
#4  0x000000000056f391 in mojo::shell::BackgroundShell::~BackgroundShell (
    this=0x3875703dc660) at ../../mojo/shell/background/background_shell.cc:177
#5  0x000000000046c5eb in std::default_delete<mojo::shell::BackgroundShell>::operator() (
    this=0x38757035b930, __ptr=0x3875703dc660)
    at ./out/Debug/../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:63
#6  0x000000000046b94c in std::unique_ptr<mojo::shell::BackgroundShell, std::default_delete<mojo::shell::BackgroundShell> >::reset (this=0x38757035b930, __p=0x3875703dc660)
    at ./out/Debug/../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:245
#7  0x000000000046b47b in mojo::test::ShellTest::TearDown (this=0x38757035b910)
    at ../../mojo/shell/public/cpp/lib/shell_test.cc:74
#8  0x0000000000425761 in leveldb::(anonymous namespace)::LevelDBApptest::TearDown (
    this=0x38757035b910) at ../../components/leveldb/leveldb_service_unittest.cc:38

Here's the stack of the background thread:

#0  0x00007ffff5bb4ee9 in __libc_waitpid (pid=30219, stat_loc=0x7ffff22145b0, options=0)
    at ../sysdeps/unix/sysv/linux/waitpid.c:40
#1  0x00007ffff7b4172e in (anonymous namespace)::WaitpidWithTimeout (handle=30219, 
    status=0x7ffff22145b0, wait=...) at ../../base/process/process_posix.cc:55
#2  0x00007ffff7b414cb in (anonymous namespace)::WaitForExitWithTimeoutImpl (
    handle=30219, exit_code=0x7ffff22147fc, timeout=...)
    at ../../base/process/process_posix.cc:196
#3  0x00007ffff7b41394 in base::Process::WaitForExitWithTimeout (this=0x38757027d4c0, 
    timeout=..., exit_code=0x7ffff22147fc) at ../../base/process/process_posix.cc:357
#4  0x00007ffff7b41352 in base::Process::WaitForExit (this=0x38757027d4c0, 
    exit_code=0x7ffff22147fc) at ../../base/process/process_posix.cc:353
#5  0x000000000061f7b4 in mojo::shell::ChildProcessHost::Join (this=0x38757027d480)
    at ../../mojo/shell/runner/host/child_process_host.cc:117
#6  0x0000000000622dee in mojo::shell::OutOfProcessNativeRunner::~OutOfProcessNativeRunner
    (this=0x3875703705d0)
    at ../../mojo/shell/runner/host/out_of_process_native_runner.cc:30
#7  0x0000000000622e49 in mojo::shell::OutOfProcessNativeRunner::~OutOfProcessNativeRunner
    (this=0x3875703705d0)
    at ../../mojo/shell/runner/host/out_of_process_native_runner.cc:28
#8  0x000000000043115f in std::default_delete<mojo::Connection>::operator() (
    this=0x387570301510, __ptr=0x3875703705d0)
    at ../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:63
#9  0x00000000005eb03c in std::unique_ptr<mojo::shell::NativeRunner, std::default_delete<mojo::shell::NativeRunner> >::reset (this=0x387570301510, __p=0x3875703705d0)
    at ./out/Debug/../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:245
#10 0x00000000005e68b8 in mojo::shell::Shell::~Shell (this=0x3875702b3fe0)
    at ../../mojo/shell/shell.cc:465
#11 0x00000000005e6b19 in mojo::shell::Shell::~Shell (this=0x3875702b3fe0)
    at ../../mojo/shell/shell.cc:461
#12 0x000000000043115f in std::default_delete<mojo::Connection>::operator() (
    this=0x3875703db6a0, __ptr=0x3875702b3fe0)
    at ../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:63
#13 0x000000000060cb7c in std::unique_ptr<mojo::shell::Shell, std::default_delete<mojo::shell::Shell> >::reset (this=0x3875703db6a0, __p=0x3875702b3fe0)
    at ./out/Debug/../../build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/bits/unique_ptr.h:245
#14 0x000000000060bbe0 in mojo::shell::Context::Shutdown (this=0x3875703db5a0)
    at ../../mojo/shell/standalone/context.cc:211
#15 0x00000000005716ad in mojo::shell::BackgroundShell::MojoThread::Run (
    this=0x3875703710b0) at ../../mojo/shell/background/background_shell.cc:142
#16 0x00007ffff7bc8041 in base::SimpleThread::ThreadMain (this=0x3875703710b0)
    at ../../base/threading/simple_thread.cc:66
#17 0x00007ffff7bb83aa in base::(anonymous namespace)::ThreadFunc (params=0x38757029de30)

The shelltest is currently loading a mojo:leveldb process and is waiting for it to exit.

There appears to have been a LevelDBApp::ShellConnectionLost() method added to the app that performs a hard _exit(1). This is not being called in the runs where we deadlock.

Comment 15 by ben@chromium.org, Mar 22 2016

Cc: sky@chromium.org
+sky

Comment 16 by sky@chromium.org, Mar 23 2016

Owner: e...@chromium.org
Status: Started (was: Available)
Summary: Deadlock in leveldb_service_unittests (was: "mojo_apptests (with patch)" is flaky)
I poked at this and it looks like the deadlock is in leveldb_file_thread. I spoke with Elliot about it and he is going to investigate. I'm updating summary to reflect what is happening.

Comment 17 by e...@chromium.org, Mar 23 2016

Current theory after seeing sky's stack traces:

In leveldb shutdown, it's trying to unlock a file. So one of the leveldb threads talks to LevelDBFileThread::Unlock() to unlock the file. LevelDBFileThread::Unlock() exists to make a thread safe interface to the otherwise thread hostile mojo. It then blocks that thread using waitevents.

What happens if the filesystem::Directory closed behind our back? This is shutdown after all...
Project Member

Comment 18 by bugdroid1@chromium.org, Mar 24 2016

Is the underlying problem the same as what's causing LevelDBApptest.GetFromSnapshots to timeout on linux_chromium_rel_ng?

From https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/201150/steps/leveldb_service_unittests%20%28with%20patch%29%20on%20Ubuntu-12.04/logs/stdio :

[1/6] LevelDBApptest.WriteBatch (64 ms)
[2/6] LevelDBApptest.Reconnect (36 ms)
[3/6] LevelDBApptest.GetSnapshotSimple (32 ms)
[ RUN      ] LevelDBApptest.GetFromSnapshots
[4/6] LevelDBApptest.GetFromSnapshots (TIMED OUT)
[5/6] LevelDBApptest.MemoryDBReadWrite (40 ms)
[6/6] LevelDBApptest.InvalidArgumentOnInvalidSnapshot (48 ms)
Retrying 1 test (retry #1)
Still waiting for the following processes to finish:
	./leveldb_service_unittests --brave-new-test-launcher --gtest_flagfile=/tmp/.org.chromium.Chromium.FSIOrY/.org.chromium.Chromium.tSs9gg --single-process-tests --test-launcher-bot-mode --test-launcher-output=/tmp/.org.chromium.Chromium.0lvLj3/test_results.xml --test-launcher-summary-output=/tmp/outI7TpfN/output.json
Still waiting for the following processes to finish:
	./leveldb_service_unittests --brave-new-test-launcher --gtest_flagfile=/tmp/.org.chromium.Chromium.FSIOrY/.org.chromium.Chromium.tSs9gg --single-process-tests --test-launcher-bot-mode --test-launcher-output=/tmp/.org.chromium.Chromium.0lvLj3/test_results.xml --test-launcher-summary-output=/tmp/outI7TpfN/output.json
[ RUN      ] LevelDBApptest.GetFromSnapshots
[7/7] LevelDBApptest.GetFromSnapshots (TIMED OUT)
...
[8/8] LevelDBApptest.GetFromSnapshots (TIMED OUT)
...
[9/9] LevelDBApptest.GetFromSnapshots (TIMED OUT)
1 test timed out:
    LevelDBApptest.GetFromSnapshots (../../components/leveldb/leveldb_service_unittest.cc:223)

Comment 20 by e...@chromium.org, Mar 24 2016

@19: theoretically. We'll see if #18 helps.
Labels: -Sheriff-Chromium

Comment 22 by e...@chromium.org, Mar 29 2016

Status: Fixed (was: Started)
I searched using the toplevel search on https://chromium-try-flakes.appspot.com/ for leveldb_service_unittests (the new binary name) and found nothing. Closing this.

Sign in to add a comment