ConditionVariableTest.MultiThreadConsumerTest flakily timed out on Fuchsia |
|||||
Issue description
,
Aug 3 2017
(Ignoring the fact that this test is dumb, which it is...) the only real thing being done in WorkQueue::WorkQueue() is making an array of 10 threads. That in turn is doing PlatformThread::Create() which goes here https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?l=81 and doesn't do a whole lot other than pthread_create(). So that's a bit worrying. I added a bit more logging around that thread creation loop, but haven't been able to repro again yet with that.
,
Aug 3 2017
Different failure in that test :( [00003.475] 03111.03138> Note: Google Test filter = ConditionVariableTest.MultiThreadConsumerTest:ReadWriteLockTest.MutexTwoThreads [00003.475] 03111.03138> [==========] Running 2 tests from 2 test cases. [00003.475] 03111.03138> [----------] Global test environment set-up. [00003.477] 03111.03138> [----------] 1 test from ConditionVariableTest [00003.478] 03111.03138> [ RUN ] ConditionVariableTest.MultiThreadConsumerTest [00003.481] 03111.03138> [3:-1435772112:0803/181620.767473:3481293:ERROR:condition_variable_unittest.cc(258)] a0 [00003.482] 03111.03138> [3:-1435772112:0803/181620.769404:3482815:ERROR:condition_variable_unittest.cc(260)] a1 [00003.484] 03111.03138> [3:-1435772112:0803/181620.770732:3484142:ERROR:condition_variable_unittest.cc(580)] c0 [00004.095] 03111.03138> [3:-1435772112:0803/181621.382220:4095635:ERROR:condition_variable_unittest.cc(586)] c3 [00004.096] 03111.03138> [3:-1435772112:0803/181621.383275:4096687:ERROR:condition_variable_unittest.cc(262)] a2 [00004.097] 03111.03138> [3:-1435772112:0803/181621.384221:4097632:ERROR:condition_variable_unittest.cc(265)] a3 [00004.098] 03111.03138> [3:-1435772112:0803/181621.385145:4098556:ERROR:condition_variable_unittest.cc(268)] a4 [00004.099] 03111.03138> [3:-1435772112:0803/181621.386074:4099497:ERROR:condition_variable_unittest.cc(271)] a5 [00004.100] 03111.03138> [3:-1435772112:0803/181621.386999:4100410:ERROR:condition_variable_unittest.cc(273)] a6 [00183.065] 02760.03095> [3:654035760:0803/181920.350575:183064449:ERROR:process_fuchsia.cc(173)] mx_object_wait_one timed out, and no signals [00183.067] 02760.02850> Still waiting for the following processes to finish: [00183.067] 02760.02850> /system/base_unittests --gtest_filter=ConditionVariableTest.MultiThreadConsumerTest:ReadWriteLockTest.MutexTwoThreads --gtest_flagfile=/tmp/.org.chromium.Chromium.kAJihJ/.org.chromium.Chromium.NbeLng --single-process-tests -
,
Aug 4 2017
Got the failure from #c3 again:
[00003.544] 03307.03334> Note: Google Test filter = ConditionVariableTest.MultiThreadConsumerTest:ReadWriteLockTest.MutexTwoThreads
[00003.544] 03307.03334> [==========] Running 2 tests from 2 test cases.
[00003.544] 03307.03334> [----------] Global test environment set-up.
[00003.545] 03307.03334> [----------] 1 test from ConditionVariableTest
[00003.547] 03307.03334> [ RUN ] ConditionVariableTest.MultiThreadConsumerTest
[00003.550] 03307.03334> [3:-1270506704:0804/183343.163891:3549796:ERROR:condition_variable_unittest.cc(258)] a0
[00003.551] 03307.03334> [3:-1270506704:0804/183343.165813:3551304:ERROR:condition_variable_unittest.cc(260)] a1
[00003.552] 03307.03334> [3:-1270506704:0804/183343.167154:3552642:ERROR:condition_variable_unittest.cc(580)] c0
[00003.757] 03307.03334> [3:-1270506704:0804/183343.371468:3756960:ERROR:condition_variable_unittest.cc(586)] c3
[00003.757] 03307.03334> [3:-1270506704:0804/183343.372413:3757903:ERROR:condition_variable_unittest.cc(262)] a2
[00003.758] 03307.03334> [3:-1270506704:0804/183343.373302:3758792:ERROR:condition_variable_unittest.cc(265)] a3
[00003.759] 03307.03334> [3:-1270506704:0804/183343.374235:3759725:ERROR:condition_variable_unittest.cc(268)] a4
[00003.760] 03307.03334> [3:-1270506704:0804/183343.375152:3760641:ERROR:condition_variable_unittest.cc(271)] a5
[00003.761] 03307.03334> [3:-1270506704:0804/183343.376055:3761545:ERROR:condition_variable_unittest.cc(273)] a6
[00018.361] 02490.02536> Still waiting for the following processes to finish:
[00018.361] 02490.02536> /system/base_unittests --gtest_filter=ConditionVariableTest.MultiThreadConsumerTest:ReadWriteLockTest.MutexTwoThreads --gtest_flagfile=/tmp/.org.chromium.Chromium.AMjiLM/.org.chromium.Chromium.nkmFNm --single-process-tests -
[00018.361] 02490.02536> -test-launcher-output=/tmp/.org.chromium.Chromium.FGAfDo/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000
[00033.810] 02490.02536> Still waiting for the following processes to finish:
[00033.810] 02490.02536> /system/base_unittests --gtest_filter=ConditionVariableTest.MultiThreadConsumerTest:ReadWriteLockTest.MutexTwoThreads --gtest_flagfile=/tmp/.org.chromium.Chromium.AMjiLM/.org.chromium.Chromium.nkmFNm --single-process-tests -
[00033.810] 02490.02536> -test-launcher-output=/tmp/.org.chromium.Chromium.FGAfDo/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000
[00049.027] 02490.02536> Still waiting for the following processes to finish:
[00049.027] 02490.02536> /system/base_unittests --gtest_filter=ConditionVariableTest.MultiThreadConsumerTest:ReadWriteLockTest.MutexTwoThreads --gtest_flagfile=/tmp/.org.chromium.Chromium.AMjiLM/.org.chromium.Chromium.nkmFNm --single-process-tests -
[00049.027] 02490.02536> -test-launcher-output=/tmp/.org.chromium.Chromium.FGAfDo/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000
[00064.027] 02490.02536> Still waiting for the following processes to finish:
[00064.027] 02490.02536> /system/base_unittests --gtest_filter=ConditionVariableTest.MultiThreadConsumerTest:ReadWriteLockTest.MutexTwoThreads --gtest_flagfile=/tmp/.org.chromium.Chromium.AMjiLM/.org.chromium.Chromium.nkmFNm --single-process-tests -
[00064.027] 02490.02536> -test-launcher-output=/tmp/.org.chromium.Chromium.FGAfDo/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000
[00079.028] 02490.02536> Still waiting for the following processes to finish:
[00079.028] 02490.02536> /system/base_unittests --gtest_filter=ConditionVariableTest.MultiThreadConsumerTest:ReadWriteLockTest.MutexTwoThreads --gtest_flagfile=/tmp/.org.chromium.Chromium.AMjiLM/.org.chromium.Chromium.nkmFNm --single-process-tests -
[00079.028] 02490.02536> -test-launcher-output=/tmp/.org.chromium.Chromium.FGAfDo/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000
[00094.223] 02490.02536> Still waiting for the following processes to finish:
[00094.223] 02490.02536> /system/base_unittests --gtest_filter=ConditionVariableTest.MultiThreadConsumerTest:ReadWriteLockTest.MutexTwoThreads --gtest_flagfile=/tmp/.org.chromium.Chromium.AMjiLM/.org.chromium.Chromium.nkmFNm --single-process-tests -
[00094.223] 02490.02536> -test-launcher-output=/tmp/.org.chromium.Chromium.FGAfDo/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000
[00103.580] 03307.03334> [3:-1270506704:0804/183523.195169:103580670:ERROR:condition_variable_unittest.cc(276)] a7
[00103.581] 03307.03334> [3:-1270506704:0804/183523.196263:103581753:ERROR:condition_variable_unittest.cc(282)] a8
[00103.583] 03307.03334> [3:-1270506704:0804/183523.197675:103583165:ERROR:condition_variable_unittest.cc(293)] a9
[00103.584] 03307.03334> [3:-1270506704:0804/183523.198663:103584153:ERROR:condition_variable_unittest.cc(298)] a10
[00103.585] 03307.03334> [3:-1270506704:0804/183523.199591:103585081:ERROR:condition_variable_unittest.cc(300)] a11
...
I don't quite know why it manages to sort of continue after hanging for 90s just as it's about to be killed. Perhaps something in process termination releases a thing and lets it run the rest of the test.
The loop that it's gotten stuck on twice is this though:
LOG(ERROR) << "a6";
while (!queue.EveryIdWasAllocated())
queue.all_threads_have_ids()->Wait();
LOG(ERROR) << "a7";
EveryIdWasAllocated() does https://cs.chromium.org/chromium/src/base/synchronization/condition_variable_unittest.cc?rcl=c8ffa1beb0ef45a81878553717983152b735b87b&l=542 a check that all the threads have been assigned ids by counting whether thread_count_ == thread_started_counter_.
queue.all_threads_have_ids()->Wait() is waiting on a ConditionVariable that is Signal()d by the main thread under those same conditions (when EveryIdWasAllocated()) is true.
Because our ConditionVariable class no longer actually does much itself, but rather forwards to either pthreads or win32 condition variables, I'm going to assume that the Signal/Wait are not the bug for now.
My current best guess then would be that WorkQueue::GetThreadId() is racing. https://cs.chromium.org/chromium/src/base/synchronization/condition_variable_unittest.cc?rcl=1d7b37d1258fd3a24d07ef745aa37747a3732650&l=536 There is some locking-type thing there, but I'm not (yet) convinced it's doing the right thing.
This doesn't explain the original hang in #c1 though, which looked more like it was in the thread creation loop.
,
Aug 4 2017
Ah, yes, the DFAKE_SCOPED_RECURSIVE_LOCK is keyed on !NDEBUG, rather than whether dchecks are enabled like you would think (this is very old code, so probably predates that distinction).
,
Aug 4 2017
Hmm, but the one call to GetThreadId() is guarded by lock_, and that's the only place that thread_started_counter_ is mutated. So that actually seems OK. One explanation that would work for the 3 observed hangs is that pthread_create() can hang, or the started thread can hang before execution. However, that way lies madness.
,
Aug 4 2017
Wait, I don't understand this test at all I think. The main body locks the queue.lock(), and then busy loops until all threads are initialized, while the ThreadMain try to get to initialize, but they also take that lock before doing so? ... OK, they must be locking different things it seems. But I have no idea which things they're locking then. ... Uh, nope. Same lock_ pointer: [00003.563] 03310.03337> [==========] Running 1 test from 1 test case. [00003.563] 03310.03337> [----------] Global test environment set-up. [00003.563] 03310.03337> [----------] 1 test from ConditionVariableTest [00003.565] 03310.03337> [ RUN ] ConditionVariableTest.MultiThreadConsumerTest [00003.573] 03310.03390> [3:-515642576:0804/204944.217960:3572677:ERROR:condition_variable_unittest.cc(723)] ThreadMain 0 has lock 0x1e5019408718 [00003.574] 03310.03390> [3:-515642576:0804/204944.220504:3574815:ERROR:condition_variable_unittest.cc(727)] ThreadMain 0 released lock 0x1e5019408718 [00003.577] 03310.03385> [3:-759051472:0804/204944.223313:3577623:ERROR:condition_variable_unittest.cc(723)] ThreadMain 1 has lock 0x1e5019408718 [00003.579] 03310.03385> [3:-759051472:0804/204944.224606:3578932:ERROR:condition_variable_unittest.cc(727)] ThreadMain 1 released lock 0x1e5019408718 [00003.582] 03310.03401> [3:-105788624:0804/204944.227378:3581687:ERROR:condition_variable_unittest.cc(723)] ThreadMain 2 has lock 0x1e5019408718 [00003.583] 03310.03401> [3:-105788624:0804/204944.228668:3582980:ERROR:condition_variable_unittest.cc(727)] ThreadMain 2 released lock 0x1e5019408718 [00003.585] 03310.03408> [3:1462008624:0804/204944.231255:3585563:ERROR:condition_variable_unittest.cc(723)] ThreadMain 3 has lock 0x1e5019408718 [00003.586] 03310.03408> [3:1462008624:0804/204944.232546:3586857:ERROR:condition_variable_unittest.cc(727)] ThreadMain 3 released lock 0x1e5019408718 [00003.592] 03310.03415> [3:-748053712:0804/204944.237807:3592130:ERROR:condition_variable_unittest.cc(723)] ThreadMain 4 has lock 0x1e5019408718 [00003.593] 03310.03415> [3:-748053712:0804/204944.239189:3593500:ERROR:condition_variable_unittest.cc(727)] ThreadMain 4 released lock 0x1e5019408718 [00003.596] 03310.03438> [3:-1028764880:0804/204944.241863:3596173:ERROR:condition_variable_unittest.cc(723)] ThreadMain 5 has lock 0x1e5019408718 [00003.597] 03310.03438> [3:-1028764880:0804/204944.243127:3597439:ERROR:condition_variable_unittest.cc(727)] ThreadMain 5 released lock 0x1e5019408718 [00003.600] 03310.03445> [3:-703407312:0804/204944.245827:3600150:ERROR:condition_variable_unittest.cc(723)] ThreadMain 6 has lock 0x1e5019408718 [00003.602] 03310.03445> [3:-703407312:0804/204944.247931:3602245:ERROR:condition_variable_unittest.cc(727)] ThreadMain 6 released lock 0x1e5019408718 [00003.604] 03310.03337> [3:-1435772112:0804/204944.249844:3604159:ERROR:condition_variable_unittest.cc(267)] test body has lock 0x1e5019408718 [00003.605] 03310.03421> [3:-576439504:0804/204944.251149:3605461:ERROR:condition_variable_unittest.cc(723)] ThreadMain 7 has lock 0x1e5019408718 [00003.606] 03310.03421> [3:-576439504:0804/204944.252502:3606816:ERROR:condition_variable_unittest.cc(727)] ThreadMain 7 released lock 0x1e5019408718 [00003.610] 03310.03426> [3:842332976:0804/204944.255325:3609637:ERROR:condition_variable_unittest.cc(723)] ThreadMain 8 has lock 0x1e5019408718 [00003.612] 03310.03433> [3:-1359353040:0804/204944.257816:3612129:ERROR:condition_variable_unittest.cc(723)] ThreadMain 9 has lock 0x1e5019408718 [00003.613] 03310.03433> [3:-1359353040:0804/204944.259451:3613766:ERROR:condition_variable_unittest.cc(727)] ThreadMain 9 released lock 0x1e5019408718 [00003.615] 03310.03337> [3:-1435772112:0804/204944.261121:3615445:ERROR:condition_variable_unittest.cc(271)] test body released lock 0x1e5019408718 [00003.616] 03310.03426> [3:842332976:0804/204944.262495:3616810:ERROR:condition_variable_unittest.cc(727)] ThreadMain 8 released lock 0x1e5019408718 [00004.057] 03310.03337> [ OK ] ConditionVariableTest.MultiThreadConsumerTest (486 ms) [00004.060] 03310.03337> [----------] 1 test from ConditionVariableTest (492 ms total) And both the main loop and thread 7 have it acquired at the same time. Urm? [00003.604] 03310.03337> [3:-1435772112:0804/204944.249844:3604159:ERROR:condition_variable_unittest.cc(267)] test body has lock 0x1e5019408718 [00003.605] 03310.03421> [3:-576439504:0804/204944.251149:3605461:ERROR:condition_variable_unittest.cc(723)] ThreadMain 7 has lock 0x1e5019408718
,
Aug 4 2017
The changes for the logs above are https://chromium-review.googlesource.com/c/599358/2/base/synchronization/condition_variable_unittest.cc I'm going crazy I think?
,
Aug 4 2017
(As pointed out to me on IRC, the lock code could be racing the print, and the ConditionVariable unlocks its lock, so all is well there. I still have no idea why it's hanging rarely though.)
,
Aug 11 2017
A more convenient (in that it's not just a no-output hang) flake locally in the same suite, but might be a different problem:
[00097.032] 02543.02599> [ RUN ] ConditionVariableTest.LargeFastTaskTest
[00097.032] 02543.02599> [3:-223110352:0811/190217.778455:96900642:FATAL:thread_id_name_manager.cc(114)] Check failed: (id_to_handle_iter!= thread_id_to_handle_.end()).
----- start symbolized stack
#00: base::debug::StackTrace::StackTrace(unsigned long) at /work/cr/src/out/fuch/../../base/debug/stack_trace_fuchsia.cc:169
#01: endl<char, std::__1::char_traits<char> > at /work/cr/src/out/fuch/../../buildtools/third_party/libc++/trunk/include/ostream:1001
(inlined by) operator<< at /work/cr/src/out/fuch/../../buildtools/third_party/libc++/trunk/include/ostream:195
(inlined by) ~LogMessage at /work/cr/src/out/fuch/../../base/logging.cc:554
#02: base::ThreadIdNameManager::RemoveName(unsigned long, int) at /work/cr/src/out/fuch/../../base/threading/thread_id_name_manager.cc:114
#03: base::(anonymous namespace)::ThreadFunc(void*) at /work/cr/src/out/fuch/../../base/threading/platform_thread_posix.cc:77
#04: base_unittests+0x19896
#05: base_unittests+0x910fa
----- end symbolized stack
[00097.032] 02543.02599>
[00097.033] 02543.02599> [1386/2462] ConditionVariableTest.LargeFastTaskTest (CRASHED)
This seems different, but platform_thread_posix.cc seems pretty straightforward (RegisterThread(), run thread, RemoveName()) https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?rcl=24cdbc10fbc7474216e7648a0bcb6fa300e0e578&l=67 so it seems like something funky in thread teardown, or the fixture.
,
Aug 11 2017
wez found #c10, and I repro'd with
diff --git a/base/threading/thread_id_name_manager.cc b/base/threading/thread_id_name_manager.cc
index 107e0dc..3227579 100644
--- a/base/threading/thread_id_name_manager.cc
+++ b/base/threading/thread_id_name_manager.cc
@@ -44,6 +44,11 @@ const char* ThreadIdNameManager::GetDefaultInternedString() {
void ThreadIdNameManager::RegisterThread(PlatformThreadHandle::Handle handle,
PlatformThreadId id) {
AutoLock locked(lock_);
+ if (thread_id_to_handle_.find(id) != thread_id_to_handle_.end()) {
+ LOG(ERROR) << "Adding id=" << id << " as handle=" << handle
+ << ", but already there as handle="
+ << thread_id_to_handle_.find(id)->second;
+ }
thread_id_to_handle_[id] = handle;
thread_handle_to_interned_name_[handle] =
name_to_interned_name_[kDefaultName];
Eventually, we hit this output:
[00098.866] 27213.28061> [3:2099071792:0811/222136.724944:98866476:ERROR:thread_id_name_manager.cc(48)] Adding id=2099071792 as handle=97210093882160, but already there as handle=4778102704944
hex(97210093882160) == 0x58697d1d4b30
hex(4778102704944) == 0x4587d1d4b30
which are the same in the bottom 32 bits!
And PlatformThreadId is pid_t, which is int, but pthread_t (which is what pthread_self() returns) is unsigned long or __pthread* ... in either case, 64 bits. Yikes!
,
Aug 12 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c4822834e91f44835e3fcdd49fcf9ccafe6a84ec commit c4822834e91f44835e3fcdd49fcf9ccafe6a84ec Author: Wez <wez@chromium.org> Date: Sat Aug 12 01:59:30 2017 Fix PlatformThread::CurrentId() under Fuchsia. Under Fuchsia CurrentId() was returning the pthread_t result of calling pthread_self() as the PlatformThreadId, which under POSIX platforms is defined to pid_t. Under Fuchsia these are an unsigned long derived from the pthread's internal point (i.e. 64-bit), and an int (32-bit) respectively. This is changed to use the platform thread handle, which is a 32-bit value as well. Bug: 751894 , 735701 , 738275 Change-Id: Ie5ab21c3c01d609b9b2d5b6a43d04ae57990eaf3 Reviewed-on: https://chromium-review.googlesource.com/611853 Commit-Queue: Wez <wez@chromium.org> Reviewed-by: Daniel Cheng <dcheng@chromium.org> Reviewed-by: Scott Graham <scottmg@chromium.org> Cr-Commit-Position: refs/heads/master@{#493935} [modify] https://crrev.com/c4822834e91f44835e3fcdd49fcf9ccafe6a84ec/base/threading/platform_thread.h [modify] https://crrev.com/c4822834e91f44835e3fcdd49fcf9ccafe6a84ec/base/threading/platform_thread_posix.cc
,
Aug 12 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0cdc0f4c4db9d2766b29d3a93b3617c107f9f8b0 commit 0cdc0f4c4db9d2766b29d3a93b3617c107f9f8b0 Author: Wez <wez@chromium.org> Date: Sat Aug 12 16:45:16 2017 Fix inappropriate use of pthread_self() in PlatformThread::SetName(). We updated PlatformThread::CurrentID() to work correctly for Fuchsia in https://chromium-review.googlesource.com/c/611853, but ironically the Fuchsia version of PlatformThread::SetName() was still mis-using pthread_self() as the PlatformThreadId as well, causing swathes of tests to fail due to ThreadIdManager::Set/Get being called with different Ids (pthread_t vs mx_handle_t) and therefore mis-matching. TBR: scottmg@chromium.org Bug: 751894 , 735701 , 738275 Change-Id: I520560050c18efe624f701b4b9c5925b92bbec9a Reviewed-on: https://chromium-review.googlesource.com/612027 Reviewed-by: Wez <wez@chromium.org> Commit-Queue: Wez <wez@chromium.org> Cr-Commit-Position: refs/heads/master@{#493980} [modify] https://crrev.com/0cdc0f4c4db9d2766b29d3a93b3617c107f9f8b0/base/threading/platform_thread_fuchsia.cc
,
Aug 31 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/5c0b8e30fbf94783c389c90f4ea78bec23e51bf0 commit 5c0b8e30fbf94783c389c90f4ea78bec23e51bf0 Author: Sergey Ulanov <sergeyu@chromium.org> Date: Thu Aug 31 21:51:02 2017 Actually disable ConditionVariableTest.LargeFastTaskTest The test-name didn't have - in the front, so the test wasn't actually disabled. TBR=scottmg@chromium.org NOTRY=true Bug: 751894 Change-Id: If0292834f085fee86a7a82c88ad88cf4ceeb6570 Reviewed-on: https://chromium-review.googlesource.com/646492 Commit-Queue: Sergey Ulanov <sergeyu@chromium.org> Reviewed-by: Sergey Ulanov <sergeyu@chromium.org> Reviewed-by: Scott Graham <scottmg@chromium.org> Cr-Commit-Position: refs/heads/master@{#499022} [modify] https://crrev.com/5c0b8e30fbf94783c389c90f4ea78bec23e51bf0/testing/buildbot/filters/fuchsia.base_unittests.filter
,
Sep 11
Fixed by c10-c13, I believe. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by scottmg@chromium.org
, Aug 3 2017