New issue
Advanced search Search tips

Issue 856900 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Fuchsia
Pri: 1
Type: Bug

Blocking:
issue 738725



Sign in to add a comment

MessageLoopTaskRunnerTest.PostTaskAndReply_DeadReplyTaskRunnerBehavior flaked in base_unittests on Fuchsia

Project Member Reported by scottmg@chromium.org, Jun 27 2018

Issue description

https://ci.chromium.org/p/chromium/builders/luci.chromium.try/fuchsia_x64/51308

../../base/message_loop/message_loop_task_runner_unittest.cc:247: Failure
Value of: reply_deleted_on
  Actual: true
Expected: false
Stack trace:
bt#00: pc 0x659c5899e4a6 (app:/pkg/bin/app,0xcce4a6)
bt#01: pc 0x659c5899de4b (app:/pkg/bin/app,0xccde4b)
bt#02: pc 0x659c58297849 (app:/pkg/bin/app,0x5c7849)
bt#03: end
[  FAILED  ] MessageLoopTaskRunnerTest.PostTaskAndReply_DeadReplyTaskRunnerBehavior (10 ms)
 

Comment 1 by w...@chromium.org, Jun 27 2018

Cc: fdoray@chromium.org gab@chromium.org
Components: Internals
+fdoary, gab, since this seems to indicate that an important expectation is being violated in the task-processing code.

This was on a non-Debug bot, but running under nested virtualization via QEMU, FWIW; so maybe timing related..?

Comment 2 by gab@chromium.org, Jun 28 2018

Owner: fdoray@chromium.org
Status: Assigned (was: Untriaged)
@fdoray as PostTaskAndReply expert: if anything could have made this test flaky it would be r550189 even though I'm not sure how at first sight.

Comment 3 by gab@chromium.org, Jun 28 2018

std::move(*this) from a destructor is potentially ill-defined C++? (ref. ~PostTaskAndReplyRelay())
Components: -Internals
Seems to be pretty reliably flaky:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-arm64-rel/1685 plus a couple other bugs that I'm going to dupe in a sec. This is the actual stack, not much more useful though.

[ RUN      ] MessageLoopTaskRunnerTest.PostTaskAndReply_DeadReplyTaskRunnerBehavior
../../base/message_loop/message_loop_task_runner_unittest.cc:247: Failure
Value of: reply_deleted_on
  Actual: true
Expected: false
Stack trace:
#00: StackTraceGetter::CurrentStackTrace(int, int) at third_party/googletest/custom/gtest/internal/custom/stack_trace_getter.cc:24
#01: testing::internal::UnitTestImpl::CurrentOsStackTraceExceptTop(int) at third_party/googletest/src/googletest/src/gtest.cc:846
#02: testing::internal::AssertHelper::operator=(testing::Message const&) const at third_party/googletest/src/googletest/src/gtest.cc:418
#03: base::MessageLoopTaskRunnerTest_PostTaskAndReply_DeadReplyTaskRunnerBehavior_Test::TestBody() at base/message_loop/message_loop_task_runner_unittest.cc:247
#04: void testing::internal::InvokeHelper<void, std::__2::tuple<> >::InvokeMethod<base::RunLoop, void (base::RunLoop::*)()>(base::RunLoop*, void (base::RunLoop::*)(), std::__2::tuple<> const&) at third_party/googletest/src/googlemock/include/gmock/gmock-generated-actions.h:68
 Issue 889278  has been merged into this issue.
Cc: tzik@chromium.org
 Issue 868086  has been merged into this issue.
scottmg@around:/work/cr/src$ tools/run-swarmed.py -C out/fuch --target-os fuchsia --gtest_filter=MessageLoopTaskRunnerTest.PostTaskAndReply* --test-name base_unittests --copies 500
ninja -C out/fuch base_unittests -j 1120 -l 56
ninja: Entering directory `out/fuch'
ninja: no work to do.
python tools/swarming_client/isolate.py check -i out/fuch/base_unittests.isolate -s out/fuch/base_unittests.isolated
If you get authentication errors, follow:
  https://www.chromium.org/developers/testing/isolated-testing/for-swes#TOC-Login-on-the-services
Uploading to isolate server, this can take a while...
Triggering 500 tasks...
[500/500] collected: 499 successes, 1 failures, 0 bot errors...
Results logs collected into /usr/local/google/work/cr/src/results.

and the failure is the same.

I don't really understand any of this infrastructure, but I guess I'll dig in since it's fairly reproducible.

With a sleep added like this https://chromium-review.googlesource.com/c/chromium/src/+/1246481 it fails and/or crashes on other platforms reliably too, so I'm going to disable it.


scottmg@around:/work/cr/src$ ninja -C out/lin base_unittests -j1000 && out/lin/base_unittests --gtest_filter=MessageLoopTaskRunnerTest.PostTaskAndReply_DeadReplyTaskRunnerBehavior
ninja: Entering directory `out/lin'
[0->2/2 ~1] LINK ./base_unittests
IMPORTANT DEBUGGING NOTE: batches of tests are run inside their
own process. For debugging a test inside a debugger, use the
--gtest_filter=<your_test_name> flag along with
--single-process-tests.
Using sharding settings from environment. This is shard 0/1
Using 1 parallel jobs.
Note: Google Test filter = MessageLoopTaskRunnerTest.PostTaskAndReply_DeadReplyTaskRunnerBehavior
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from MessageLoopTaskRunnerTest
[ RUN      ] MessageLoopTaskRunnerTest.PostTaskAndReply_DeadReplyTaskRunnerBehavior
../../base/message_loop/message_loop_task_runner_unittest.cc:249: Failure
Value of: reply_deleted_on
  Actual: true
Expected: false
Stack trace:
#0 0x000000ea4f6c testing::internal::UnitTestImpl::CurrentOsStackTraceExceptTop()
#1 0x000000ea4939 testing::internal::AssertHelper::operator=()
#2 0x000000706c38 base::MessageLoopTaskRunnerTest_PostTaskAndReply_DeadReplyTaskRunnerBehavior_Test::TestBody()

[  FAILED  ] MessageLoopTaskRunnerTest.PostTaskAndReply_DeadReplyTaskRunnerBehavior (1011 ms)
[----------] 1 test from MessageLoopTaskRunnerTest (1011 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1012 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MessageLoopTaskRunnerTest.PostTaskAndReply_DeadReplyTaskRunnerBehavior

 1 FAILED TEST
[1/1] MessageLoopTaskRunnerTest.PostTaskAndReply_DeadReplyTaskRunnerBehavior (1010 ms)
1 test failed:
    MessageLoopTaskRunnerTest.PostTaskAndReply_DeadReplyTaskRunnerBehavior (../../base/message_loop/message_loop_task_runner_unittest.cc:204)
Tests took 1 seconds.

Nice job tracking that down, Scott!

IIUC the problem is that the test calls UnblockTaskThread(), allowing the task-thread to process the posted task, but then doesn't wait for that to complete before it resets |current_loop_| - so there is a race-condition in the PostTask'ing the reply task, and we may either:

1. Complete running the posted task before resetting, in which case the task will be queued to |current_loop_|, and actually deleted when that is reset.
2. Not complete running the posted task before the reset(), in which case it will be leaked, as expected.
Project Member

Comment 11 by bugdroid1@chromium.org, Sep 26

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f37018c402c4af20a6b89d2daf22766033f0d57d

commit f37018c402c4af20a6b89d2daf22766033f0d57d
Author: Scott Graham <scottmg@chromium.org>
Date: Wed Sep 26 16:38:18 2018

Disable flaky MessageLoopTaskRunnerTest.PostTaskAndReply_DeadReplyTaskRunnerBehavior

See linked bug for reliable repro and investigation.

TBR: fdoray@chromium.org
Bug: 856900
Change-Id: Ibb112c9a573ea4a2451edebffba63cccc80918de
Reviewed-on: https://chromium-review.googlesource.com/1246463
Reviewed-by: Scott Graham <scottmg@chromium.org>
Commit-Queue: Scott Graham <scottmg@chromium.org>
Cr-Commit-Position: refs/heads/master@{#594345}
[modify] https://crrev.com/f37018c402c4af20a6b89d2daf22766033f0d57d/base/message_loop/message_loop_task_runner_unittest.cc

Labels: -Pri-3 Test-Flaky Pri-1
Thanks for finding this, I'm not a fan of leaving tests disabled though, let's prioritize fixing this.

Sign in to add a comment