Issue metadata
Sign in to add a comment
|
Raciness in content_unittests between ScopedTempDir teardown and SimpleBackendImpl initialization |
||||||||||||||||||||||
Issue descriptionMany tests failing with [453226:1063433171:1129/071219.026473:170349682:ERROR:render_widget_host_view_base.cc(193)] Not implemented reached in virtual uint32_t content::RenderWidgetHostViewBase::GetCaptureSequenceNumber() const [453226:1063433171:1129/071219.056807:170380016:WARNING:scoped_temp_dir.cc(23)] Could not delete temp dir in dtor. [453226:980595369:1129/071219.059212:170382421:FATAL:simple_backend_impl.cc(732)] Check failed: mtime_result. e.g. https://ci.chromium.org/p/chromium/builders/luci.chromium.try/fuchsia_x64/160034 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/fuchsia_x64/160029 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/fuchsia_x64/160027 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/fuchsia_x64/160026
,
Nov 29
This DCHECK verifies that stat(2) returns zero (i.e. success) for an existing directory. Is this racy in Fuchsia?
,
Nov 29
This is also causing my patches to be rejected.
,
Nov 29
Auto-assigning, to take an initial look before handing off to Gardener kmarshall@.
,
Nov 29
sergeyu: You landed https://chromium-review.googlesource.com/c/chromium/src/+/1351888 yesterday (Nov 28th), which altered how we propagate the /data directory to test-launcher sub-processes - I can't see anything obviously wrong in the CL, but could you take a look, please?
,
Nov 29
investigating
,
Nov 29
Actually, it is noticable that the tests which fail in the retry-with-patch are the same ones which fail in the original run of the content_unittests step. But each of the failed tries listed by tikuta@ have different tests failing in them. IMO this looks like some earlier test in the same batch is doing something to break later tests.
,
Nov 29
Ah, nm, the retry steps aren't running anything at all, because --test-launcher-retry-limit isn't respected by the runner script.
,
Nov 29
,
Nov 29
OK, we should be able to do try runs to repro this, now. Looking at the log output, the two lines: [453226:1063433171:1129/071219.056807:170380016:WARNING:scoped_temp_dir.cc(23)] Could not delete temp dir in dtor. [453226:980595369:1129/071219.059212:170382421:FATAL:simple_backend_impl.cc(732)] Check failed: mtime_result. are suspicious - the first indicates that recursive deletion of a scoped temporary directory is failing, while the latter suggests that the SimpleBackendImpl cache directory is being deleted in-between being created, and us reaching the stat() call -> i.e. seemingly a race between ScopedTempDir teardown and SimpleBackendImpl initialization. We'll need to work out who is creating and owning the ScopedTempDir in this case, to be able to diagnose further, I think.
,
Nov 29
Also, if I'm reading the logs correctly, this is a race between two threads in the same process, i.e. it's two sequences in the test itself racing against one another.
,
Nov 30
Removing Sheriff-Chromium since this has an owner and is underway.
,
Dec 3
There are now other tests failing with the same message. https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8928190746169793264/+/steps/content_unittests/0/logs/ForwardingAudioStreamFactoryTest.Destruction_CallsOnSourceGoneOnRegisteredLoopbackSinks/0 It looks like cache is cleaned up asynchronously, which races with cache initialization in the next test. The cleanup task is posted here: https://codesearch.chromium.org/chromium/src/net/disk_cache/simple/simple_backend_impl.cc?gsn=SetWorkerPoolForTesting&l=389 The cleanup task is posted to a TaskScheduler task runner. The TaskScheduler for many of these is supposed to be created and destroyed in TestBrowserThreadBundle, which uses ScopedTaskEnvironment: https://codesearch.chromium.org/chromium/src/content/public/test/test_browser_thread_bundle.cc?dr=CSs&g=0&l=117 , but it's likely that some other tests don't use ScopedTaskEnvironment leaving TaskScheduler for the tests that run afterwards.
,
Dec 3
Re #13 tasks leaking between tests would be bad, but I don't think that's possible. Some tests have a ScopedTaskEnvironment in their fixture, others a TestBrowserThreadBundle but these get torn down after each test run so I'm not seeing any way to for tasks to leak. Also without a ScopedTaskEnvironment or a MessageLoop (some tests might use those directly, but it shouldn't matter) PostTask will fail.
,
Dec 3
,
Dec 3
sergeyu: These flakes are causing considerable disruption. Are we seeing them from suites other than content_unittests? If not then let's temporarily take content_unittests off the CQ until we can identify the cause. Tagging as P0, since this is causing bogus CQ failures.
,
Dec 3
I can reproduce it locally . Will disable it on CQ if I don't figure out the root cause soon
,
Dec 3
Looks like the problem is that in TestBrowserContext destructor the ScopedTestDir that contains the disk cache is destroyed while TaskScheduler is still running. The TaskScheduler is executing SimpleBackendImpl::InitCacheStructureOnDisk(), which races with directory destruction.
,
Dec 3
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8e99628583e99028dac8a8433f76d91d68965d41 commit 8e99628583e99028dac8a8433f76d91d68965d41 Author: Sergey Ulanov <sergeyu@chromium.org> Date: Mon Dec 03 21:01:20 2018 Disable content_unittests on Fuchsia x64 bot content_unittests are failing on Fuchsia. TBR=thakis@chromium.org Bug: 910029 Change-Id: Ie567ad03904e3ec27143c3a1bec1a86562c74c18 Reviewed-on: https://chromium-review.googlesource.com/c/1359057 Reviewed-by: Sergey Ulanov <sergeyu@chromium.org> Reviewed-by: Nico Weber <thakis@chromium.org> Commit-Queue: Sergey Ulanov <sergeyu@chromium.org> Cr-Commit-Position: refs/heads/master@{#613241} [modify] https://crrev.com/8e99628583e99028dac8a8433f76d91d68965d41/testing/buildbot/chromium.linux.json [modify] https://crrev.com/8e99628583e99028dac8a8433f76d91d68965d41/testing/buildbot/test_suite_exceptions.pyl
,
Dec 3
sergeyu: Your explanation in #18 makes sense, but why is this suddenly flaking now, when it previously didn't? Has something changed about the ScopedTaskEnvironment vs ScopedTempDir dtor ordering?
,
Dec 3
Looking at estade@'s https://chromium-review.googlesource.com/c/chromium/src/+/1352969 (landed Nov 28th, just before we observed these flakes starting), I notice that a RenderViewHostTestEnabler was added to a whole set of different test bases. RenderViewHostTestEnabler does own a ScopedTaskEnvironment internally.
,
Dec 4
,
Dec 4
,
Dec 4
,
Dec 5
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9320bb77f51270de86de19a37a6295186ab6c9bc commit 9320bb77f51270de86de19a37a6295186ab6c9bc Author: Sergey Ulanov <sergeyu@chromium.org> Date: Wed Dec 05 10:07:42 2018 Fix race in TestBrowserContext shutdown sequence. disk_cache::SimpleBackendImpl performs disk IO asynchronously on TaskScheduler. These operations can race with ScopedTempDir destruction in TestBrowserContext. Updated ~TestBrowserContext() to ensure that all pending IO ops complete before it attempts to destroy ScopedTempDir. Bug: 910029 Change-Id: I69145a395c81a042d8c3880e9917a07a972df9ab Reviewed-on: https://chromium-review.googlesource.com/c/1359464 Commit-Queue: Sergey Ulanov <sergeyu@chromium.org> Reviewed-by: Scott Violet <sky@chromium.org> Cr-Commit-Position: refs/heads/master@{#613925} [modify] https://crrev.com/9320bb77f51270de86de19a37a6295186ab6c9bc/content/public/test/test_browser_context.cc
,
Dec 5
FYI, queued a CL for restoration of content_unittests on the waterfall: https://chromium-review.googlesource.com/c/chromium/src/+/1363640
,
Dec 5
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/40a8e2c9f07c35937bbe629ffc6be3268e89626c commit 40a8e2c9f07c35937bbe629ffc6be3268e89626c Author: Wez <wez@chromium.org> Date: Wed Dec 05 19:22:03 2018 Revert "Disable content_unittests on Fuchsia x64 bot" This reverts commit 8e99628583e99028dac8a8433f76d91d68965d41. Reason for revert: Underlying race-condition leading to flake has been resolved. Original change's description: > Disable content_unittests on Fuchsia x64 bot > > content_unittests are failing on Fuchsia. > > TBR=thakis@chromium.org > > Bug: 910029 > Change-Id: Ie567ad03904e3ec27143c3a1bec1a86562c74c18 > Reviewed-on: https://chromium-review.googlesource.com/c/1359057 > Reviewed-by: Sergey Ulanov <sergeyu@chromium.org> > Reviewed-by: Nico Weber <thakis@chromium.org> > Commit-Queue: Sergey Ulanov <sergeyu@chromium.org> > Cr-Commit-Position: refs/heads/master@{#613241} TBR=thakis@chromium.org,sergeyu@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Bug: 910029 Change-Id: Id3c04cb5efa2c875cad61ffae78805b04738f9cc Reviewed-on: https://chromium-review.googlesource.com/c/1363640 Reviewed-by: Wez <wez@chromium.org> Commit-Queue: Wez <wez@chromium.org> Cr-Commit-Position: refs/heads/master@{#614055} [modify] https://crrev.com/40a8e2c9f07c35937bbe629ffc6be3268e89626c/testing/buildbot/chromium.linux.json [modify] https://crrev.com/40a8e2c9f07c35937bbe629ffc6be3268e89626c/testing/buildbot/test_suite_exceptions.pyl
,
Dec 5
,
Dec 5
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by tikuta@chromium.org
, Nov 29