New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 645736 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 2
Type: Bug-Regression

Blocking:
issue 676210



Sign in to add a comment

"context_lost_tests (with patch)" is flaky

Project Member Reported by chromium...@appspot.gserviceaccount.com, Sep 10 2016

Issue description

"context_lost_tests (with patch)" is flaky.

This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label.

We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyKgsSBUZsYWtlIh9jb250ZXh0X2xvc3RfdGVzdHMgKHdpdGggcGF0Y2gpDA.



This flaky test/step was previously tracked in  issue 643399 .
 

Comment 1 Deleted

Components: Internals>GPU>Internals
Status: Available (was: WontFix)

It seems GpuCrash.GPUProcessCrashesExactlyOnce is crashing like this:
https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.linux%2Flinux_chromium_rel_ng%2F295623%2F%2B%2Frecipes%2Fsteps%2Fcontext_lost_tests_on_NVIDIA_GPU_on_Linux__with_patch__on_Linux%2F0%2Fstdout

Comment 3 by vmi...@chromium.org, Sep 27 2016

Cc: kbr@chromium.org sunn...@chromium.org
Components: -Internals>GPU>Internals Internals>GPU>Testing Internals>Compositing>Rasterization
Labels: M-55 Hotlist-PixelWrangler
Owner: briander...@chromium.org
Status: Assigned (was: Available)
Assigning to current pixel wrangler.  brianderson@ could you please take a deeper look?

We are DCHECKING on a bad Resource state after a forced context loss.

[1:3:0909/211338:FATAL:resource_provider.cc(1392)] Check failed: !delegated_sync_points_required_ || resource->dirty_image || !resource->needs_sync_token(). 
#0 0x7fd5da07646e base::debug::StackTrace::StackTrace()
#1 0x7fd5da08d72b logging::LogMessage::~LogMessage()
#2 0x7fd5daaf3d32 cc::ResourceProvider::PrepareSendToParent()
#3 0x7fd5dab12ab5 cc::LayerTreeHostImpl::DrawLayers()
#4 0x7fd5dabc789d cc::ProxyImpl::DrawAndSwapInternal()
#5 0x7fd5dabc75bf cc::ProxyImpl::ScheduledActionDrawAndSwapIfPossible()
#6 0x7fd5dab94c39 cc::Scheduler::ProcessScheduledActions()
#7 0x7fd5dab94829 cc::Scheduler::OnBeginImplFrameDeadline()
#8 0x7fd5dab969ee cc::Scheduler::BeginImplFrameWithDeadline()
#9 0x7fd5dab9646f cc::Scheduler::OnBeginFrameDerivedImpl()
#10 0x7fd5daaf844d cc::BeginFrameObserverBase::OnBeginFrame()
#11 0x7fd5daaf9f7c cc::ExternalBeginFrameSource::OnBeginFrame()

Comment 4 by kbr@chromium.org, Sep 27 2016

If a recent CL caused the test to become flaky it should be reverted, investigated and relanded.

Project Member

Comment 5 by chromium...@appspot.gserviceaccount.com, Sep 28 2016

Detected 3 new flakes for test/step "context_lost_tests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyKgsSBUZsYWtlIh9jb250ZXh0X2xvc3RfdGVzdHMgKHdpdGggcGF0Y2gpDA. This message was posted automatically by the chromium-try-flakes app.

Comment 6 by kbr@chromium.org, Sep 29 2016

Labels: -Type-Bug ReleaseBlock-Beta Type-Bug-Regression
What is the status of this bug? It is significantly affecting the commit queue -- there have been 10 flakes of this test on Linux today.

Brian, please treat this bug with the highest priority, find the CL that broke this and revert it. Thanks.

Comment 7 by gov...@chromium.org, Sep 29 2016

Is this bug applicable to any specific OS or all OSs?
Project Member

Comment 8 by chromium...@appspot.gserviceaccount.com, Sep 29 2016

Detected 8 new flakes for test/step "context_lost_tests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyKgsSBUZsYWtlIh9jb250ZXh0X2xvc3RfdGVzdHMgKHdpdGggcGF0Y2gpDA. This message was posted automatically by the chromium-try-flakes app.
Project Member

Comment 9 by chromium...@appspot.gserviceaccount.com, Sep 30 2016

Detected 12 new flakes for test/step "context_lost_tests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyKgsSBUZsYWtlIh9jb250ZXh0X2xvc3RfdGVzdHMgKHdpdGggcGF0Y2gpDA. This message was posted automatically by the chromium-try-flakes app.
Cc: ericrk@chromium.org
Eric, do you think https://codereview.chromium.org/2353033003 could be related to the flakes? The patch landed on 9/26 at 20:35 and the flakes started about 2 hours later.

From a quick look at the test (GpuCrash_GPUProcessCrashesExactlyOncePerVisitToAboutGpuCrash), it crashes the GPU service twice, with a wait for the first crash to finish before running the second crash. There's another test that crashes only once (ContextLost_WebGLContextLostFromGPUProcessExit), but that doesn't appear to flake.

I wonder if the wait before the second crash sometimes lets the logic in the patch trigger, which somehow causes us to draw without a sync token since we've lost the context and purged skia's caches?

I also notice the patch previously landed and was reverted on 9/22-9/23, but don't see any flakes during that period. So maybe it's not your patch, unless there was a significant change the second time it landed.

Comment 11 by kbr@chromium.org, Sep 30 2016

 Issue 651850  has been merged into this issue.

Comment 12 by kbr@chromium.org, Sep 30 2016

Similar report in  Issue 651850 .

If https://codereview.chromium.org/2353033003 will revert cleanly then please revert it now and let's watch to see if it clears up the flakiness.

That patch seems super unlikely to me, that is dropping caches inside skia, this is a problem of synchronization in cc.
> I wonder if the wait before the second crash sometimes lets the logic in the 
> patch trigger, which somehow causes us to draw without a sync token since we've 
> lost the context and purged skia's caches?

Can you explain what you mean here? Why would purging skia caches affect sync tokens?

Comment 15 by ericrk@google.com, Sep 30 2016

Hmm... seems unlikely given that it didn't flake the last time it was landed, and we're seeing a lot of flakes now (patch was not modified between revert/reland).... although it's possible my patch interacts with something else that was landed in that interval.

please go ahead and revert and we'll see if that fixes the issue.
Cc: vmp...@chromium.org
    // Check the synchronization and sync token state when delegated sync points
    // are required. The only case where we allow a sync token to not be set is
    // the case where the image is dirty. In that case we will bind the image
    // lazily and generate a sync token at that point.
    DCHECK(!delegated_sync_points_required_ || resource->dirty_image ||
           !resource->needs_sync_token());

needs_sync_token_ can become true in
- Resource::set_mailbox()
- Resource::SetLocallyUsed()
it becomes false always in
- Resource::UpdateSyncToken()

So we're failing to UpdateSyncToken in LayerTreeHostImpl::DrawLayers on some resource before we draw the frame, or we're using a resource we shouldn't be yet.

This seems like a problem related to TileManager more likely.

Comment 17 by ericrk@google.com, Sep 30 2016

Yeah, I agree that it seems unlikely that this is my CL - the first flake I see at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyKgsSBUZsYWtlIh9jb250ZXh0X2xvc3RfdGVzdHMgKHdpdGggcGF0Y2gpDA is at 21:55UTC, ~17hrs after my patch landed, am I missing some?

Looking at changes before this, there are a couple between my CL and the first flake that may also be worth looking into:
Sunnyps@'s scheduler change - crrev.com/864a70f6f93a87ff374bf2aea2494d4d7d0150d7
vmpstr@'s re-ordering of cleanup in tile manager - crrev.com/63faa3a635d7ad979c9fc30de93ee0957003586d
Sorry Eric. I was looking at the wrong day.

I'm almost certain this is triggered by Sunny's scheduler change now.

Looking at the flakes of just the linux_chromium_rel_ng bot, there was a previous streak of flakes from 9/10 to 9/12, where Sunny's patch went in temporarily and then it started again on 9/27 after the patch relanded.

I canceled the revert of Eric's patch and will be reverting Sunny's to see if it helps.
Project Member

Comment 19 by chromium...@appspot.gserviceaccount.com, Oct 1 2016

Detected 8 new flakes for test/step "context_lost_tests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyKgsSBUZsYWtlIh9jb250ZXh0X2xvc3RfdGVzdHMgKHdpdGggcGF0Y2gpDA. This message was posted automatically by the chromium-try-flakes app.

Comment 20 by ajha@chromium.org, Oct 3 2016

Just to update M-55 gets branched in 2/3 days from now. If possible please get the fix landed/reverted before M-55 branch.
Labels: OS-Linux
Looks like OS-Linux at the very least.  Please add OS tags to all release blockers, it's how the release team tracks them.
Project Member

Comment 22 by chromium...@appspot.gserviceaccount.com, Oct 3 2016

Detected 3 new flakes for test/step "context_lost_tests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyKgsSBUZsYWtlIh9jb250ZXh0X2xvc3RfdGVzdHMgKHdpdGggcGF0Y2gpDA. This message was posted automatically by the chromium-try-flakes app.

Comment 23 by kbr@chromium.org, Oct 3 2016

What's the status of this bug? The recent flakes are the same problem:

[1:3:1003/105038:FATAL:resource_provider.cc(1415)] Check failed: !delegated_sync_points_required_ || resource->dirty_image || !resource->needs_sync_token(). 
#0 0x7f9da92c1b0e base::debug::StackTrace::StackTrace()
#1 0x7f9da92d93fb logging::LogMessage::~LogMessage()
#2 0x7f9da9d60122 cc::ResourceProvider::PrepareSendToParent()
#3 0x7f9da9d79ea5 cc::LayerTreeHostImpl::DrawLayers()
#4 0x7f9dacad944e cc::ProxyImpl::DrawAndSwapInternal()
#5 0x7f9dacad918f cc::ProxyImpl::ScheduledActionDrawAndSwapIfPossible()
#6 0x7f9dacad04f9 cc::Scheduler::ProcessScheduledActions()
#7 0x7f9dacad01a9 cc::Scheduler::OnBeginImplFrameDeadline()
#8 0x7f9dacad2096 cc::Scheduler::BeginImplFrameWithDeadline()
#9 0x7f9dacad1b8c cc::Scheduler::OnBeginFrameDerivedImpl()
#10 0x7f9da9d647ed cc::BeginFrameObserverBase::OnBeginFrame()
#11 0x7f9da9d6631c cc::ExternalBeginFrameSource::OnBeginFrame()
#12 0x7f9dac615abb _ZN3IPC8MessageTI23ViewMsg_BeginFrame_MetaSt5tupleIJN2cc14BeginFrameArgsEEEvE8DispatchIN7content34CompositorExternalBeginFrameSourceES9_vMS9_FvRKS4_EEEbPKNS_7MessageEPT_PT0_PT1_T2_
#13 0x7f9dac61589d content::CompositorExternalBeginFrameSource::OnMessageReceived()
#14 0x7f9dac616409 content::CompositorForwardingMessageFilter::ProcessMessageOnCompositorThread()
#15 0x7f9da9355c84 base::debug::TaskAnnotator::RunTask()

Please point to any reverts that might have occurred; this bug wasn't updated.

Revert failed to apply and is trying again today: https://codereview.chromium.org/2386183003/
Owner: sunn...@chromium.org
A friendly reminder that M55 Beta launch is coming soon! Your bug is labelled as Beta ReleaseBlock, pls make sure to land the fix ASAP. Thank you.
Cc: briander...@chromium.org piman@chromium.org danakj@chromium.org vmi...@chromium.org jbau...@chromium.org
Labels: -Pri-1 -ReleaseBlock-Beta -M-55 -Hotlist-PixelWrangler Pri-2
Not flaking since the revert landed so this isn't blocking any more.

I was able to reproduce this failure. It happens when the worker context detects that the gpu process has crashed but the compositor context doesn't until later. During this time the worker context produces null sync tokens which causes ScopedWriteLockGL to not call UpdateSyncToken. Then during draw the DCHECK above fails. At no point does the compositor context notice the gpu process crash.

I'm going to make ScopedWriteLockGL always set the sync token (even when it's null) on the resource and add a DCHECK that there was a GL error when we get a resource without a sync token in PrepareSendToParent.

piman, danakj: Any other suggestions on how to improve this? Any other things that could be affected by this race? Note that in the async worker context mode (gpu streams) the compositor and worker contexts won't be in the same share group so context loss won't propagate via ShareGroup::IsLost and the race might be more visible.

brianderson, jbauman, vmiura: FYI
My #1 suggestion is to bind the worker context to the compositor thread so the compositor can listen to the lost callback.

But that means either 1) stop sharing worker contexts 2) allow sharing contexts created on another thread.

2 is my long-term crusade right now.
I uploaded a fix for the problem for now: https://codereview.chromium.org/2387333003/

#1 *should* just work with the --enable-gpu-async-worker-context flag. That was added so that gpu raster / uploads could happen on a different stream in the gpu process and contexts in different streams can't be in a share group. I don't know about other uses of worker context on the main thread but I'm guessing most (all?) of those use mailboxes and sync tokens to synchronize already.
> #1 *should* just work with the --enable-gpu-async-worker-context flag. That was 
> added so that gpu raster / uploads could happen on a different stream in the 
> gpu process and contexts in different streams can't be in a share group. 

Oh by share I mean literally there is 1 worker context that all compositors share. We'd currently have to make 1 per compositor in order to bind it on the compositor thread. (https://bugs.chromium.org/p/chromium/issues/detail?id=487471)
Project Member

Comment 31 by bugdroid1@chromium.org, Oct 6 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e9fd952d2b376783ae04864e5fcdec647cf2942e

commit e9fd952d2b376783ae04864e5fcdec647cf2942e
Author: sunnyps <sunnyps@chromium.org>
Date: Thu Oct 06 23:03:27 2016

cc: Set sync token on resource even if context is lost.

If the gpu process crashes the worker context will return null sync
tokens which the resource write lock ignores. It's possible for the
compositor context to not detect the gpu process crash until much later.
Until then the scheduler might initiate a draw which will crash when
DCHECKing that resources being sent have sync tokens.

This CL also enables sync tokens for LayerTreeTests by default and
includes some minor cleanup in LayerTreeHostTestContext.

R=brianderson@chromium.org,danakj@chromium.org
BUG= 645736 
CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel

Review-Url: https://codereview.chromium.org/2387333003
Cr-Commit-Position: refs/heads/master@{#423720}

[modify] https://crrev.com/e9fd952d2b376783ae04864e5fcdec647cf2942e/cc/resources/resource_provider.cc
[modify] https://crrev.com/e9fd952d2b376783ae04864e5fcdec647cf2942e/cc/resources/resource_provider.h
[modify] https://crrev.com/e9fd952d2b376783ae04864e5fcdec647cf2942e/cc/test/test_compositor_frame_sink.cc
[modify] https://crrev.com/e9fd952d2b376783ae04864e5fcdec647cf2942e/cc/test/test_compositor_frame_sink.h
[modify] https://crrev.com/e9fd952d2b376783ae04864e5fcdec647cf2942e/cc/test/test_web_graphics_context_3d.cc
[modify] https://crrev.com/e9fd952d2b376783ae04864e5fcdec647cf2942e/cc/trees/layer_tree_host_unittest_context.cc

Project Member

Comment 32 by bugdroid1@chromium.org, Oct 27 2016

Labels: merge-merged-2840
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e9fd952d2b376783ae04864e5fcdec647cf2942e

commit e9fd952d2b376783ae04864e5fcdec647cf2942e
Author: sunnyps <sunnyps@chromium.org>
Date: Thu Oct 06 23:03:27 2016

cc: Set sync token on resource even if context is lost.

If the gpu process crashes the worker context will return null sync
tokens which the resource write lock ignores. It's possible for the
compositor context to not detect the gpu process crash until much later.
Until then the scheduler might initiate a draw which will crash when
DCHECKing that resources being sent have sync tokens.

This CL also enables sync tokens for LayerTreeTests by default and
includes some minor cleanup in LayerTreeHostTestContext.

R=brianderson@chromium.org,danakj@chromium.org
BUG= 645736 
CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel

Review-Url: https://codereview.chromium.org/2387333003
Cr-Commit-Position: refs/heads/master@{#423720}

[modify] https://crrev.com/e9fd952d2b376783ae04864e5fcdec647cf2942e/cc/resources/resource_provider.cc
[modify] https://crrev.com/e9fd952d2b376783ae04864e5fcdec647cf2942e/cc/resources/resource_provider.h
[modify] https://crrev.com/e9fd952d2b376783ae04864e5fcdec647cf2942e/cc/test/test_compositor_frame_sink.cc
[modify] https://crrev.com/e9fd952d2b376783ae04864e5fcdec647cf2942e/cc/test/test_compositor_frame_sink.h
[modify] https://crrev.com/e9fd952d2b376783ae04864e5fcdec647cf2942e/cc/test/test_web_graphics_context_3d.cc
[modify] https://crrev.com/e9fd952d2b376783ae04864e5fcdec647cf2942e/cc/trees/layer_tree_host_unittest_context.cc

Comment 33 by dimu@google.com, Nov 4 2016

Labels: -merge-merged-2840
[Automated comment] removing mislabelled merge-merged-2840
Project Member

Comment 34 by chromium...@appspot.gserviceaccount.com, Feb 10 2017

Detected 8 new flakes for test/step "context_lost_tests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyKgsSBUZsYWtlIh9jb250ZXh0X2xvc3RfdGVzdHMgKHdpdGggcGF0Y2gpDA. This message was posted automatically by the chromium-try-flakes app.

Comment 35 by kbr@chromium.org, Feb 14 2017

Blocking: 676210
Status: Fixed (was: Assigned)
All of the above flakes in #34 were from recipe changes in  Issue 676210 .

At this point the flakiness of this test suite is well within acceptable levels. There are 0 flakes on most days. Closing as Fixed.

Sign in to add a comment