New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 853651 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 1
Type: Bug

Blocking:
issue 853438
issue 853762



Sign in to add a comment

"SnapshotBrowserTest.SyncMultiWindowTest" is flaky

Project Member Reported by chromium...@appspot.gserviceaccount.com, Jun 18 2018

Issue description

"SnapshotBrowserTest.SyncMultiWindowTest" is flaky.

This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label.

We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyMgsSBUZsYWtlIidTbmFwc2hvdEJyb3dzZXJUZXN0LlN5bmNNdWx0aVdpbmRvd1Rlc3QM.

Flaky tests should be disabled within 30 minutes unless culprit CL is found and reverted. Please see more details here: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queues#triaging-auto-filed-flakiness-bugs
 

Comment 1 by treib@chromium.org, Jun 18 2018

Labels: OS-Mac
Owner: ccameron@chromium.org
Status: Assigned (was: Untriaged)
Flakiness dashboard: https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=content_browsertests&tests=SnapshotBrowserTest.SyncMultiWindowTest

The test has been flaky on Mac since at least 06-14, likely longer (the dashboard just doesn't go further back). It's gotten a lot worse today though. First failures:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac10.11%20Tests/27446
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac10.12%20Tests/13823
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac10.13%20Tests/3595

Maybe caused by https://chromium-review.googlesource.com/c/chromium/src/+/1100239? Feel free to unassign if that's unrelated.

In the meantime, I'll disable the test on Mac.

Comment 2 by treib@chromium.org, Jun 18 2018

Cc: kbr@chromium.org
SnapshotBrowserTest.AsyncMultiWindowTest also flakes on Mac, though much more rarely:
https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=content_browsertests&tests=SnapshotBrowserTest.AsyncMultiWindowTest
Project Member

Comment 3 by bugdroid1@chromium.org, Jun 18 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7d7cea31e2157d20955c6d698a51e5f167132020

commit 7d7cea31e2157d20955c6d698a51e5f167132020
Author: Marc Treib <treib@chromium.org>
Date: Mon Jun 18 10:05:47 2018

Disable flaky SnapshotBrowserTest.SyncMultiWindowTest on Mac

TBR=kbr@chromium.org
NOTRY=true

Bug:  853651 
Change-Id: If04328135ede460ae8957e2e135c96ae691ae221
Reviewed-on: https://chromium-review.googlesource.com/1104165
Commit-Queue: Marc Treib <treib@chromium.org>
Reviewed-by: Marc Treib <treib@chromium.org>
Cr-Commit-Position: refs/heads/master@{#567964}
[modify] https://crrev.com/7d7cea31e2157d20955c6d698a51e5f167132020/content/browser/snapshot_browsertest.cc

Comment 4 by treib@chromium.org, Jun 18 2018

Labels: -Sheriff-Chromium SnapshotBrowserTest.SyncMultiWindowTest Test-Disabled
Disable seems like the right thing to do if this is consistently flakey.

The blamed CL changes timing of things, but fixes some serious stability bugs (so it's exactly the type of patch that we don't want blocked because of timing changes).

Comment 6 by kbr@chromium.org, Jun 20 2018

Blocking: 853438
SnapshotBrowserTest stress tests window snapshotting functionality which is used throughout the GPU tests. This test was carefully de-flaked after it was written, and has been stable for a couple of months. If the ability to take a window snapshot becomes unreliable, a lot of GPU tests on the commit queue will become flaky.

Understanding that https://chromium-review.googlesource.com/1100239 fixes stability bugs, it's still critical that this test be re-enabled. From chromium-try-flakes' output:
https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyMgsSBUZsYWtlIidTbmFwc2hvdEJyb3dzZXJUZXN0LlN5bmNNdWx0aVdpbmRvd1Rlc3QM

it's clear that this CL is the cause of the recent failures.

Chris, can you please try re-enabling this test in your local build and run it in a loop? It's important that we figure out some change which will stabilize it again.

I'll take a look at this locally to see if I can reproduce the flake.
This reproduces locally most of the time (enough for debugging to not be hard).

This almost always fails on the last iteration of the loop (we do 20 snapshot tests, and it's usually on #20 that we fail).

Of the 4 renderers that we have going, the sequence of events should be
 1. RenderWidgetHostImpl::GetSnapshotFromBrowser
 2. RenderWidgetHostImpl::OnForceRedrawComplete
 3. RenderWidgetHostImpl::WindowSnapshotReachedScreen
     a. Calls RenderWidgetHostView::CopyFromSurface
     b. Calls DelegatedFrameHost::CopyFromCompositingSurface, which either
     c. Either
        i  calls DelegatedFrameHost::ProcessCopyOutputRequest, or
        ii. enqueues into DelegatedFrameHost::pending_first_frame_requests_
 4. RenderWidgetHostImpl::OnSnapshotFromSurfaceReceived
 5. SyncSnapshotCallback

Every that this test fails, it fails when we take the branch 3.c.ii. Every time that this test passes, it takes branch 3.c.i.

Every time that we fail to immediately make a copy, it's because client_->DelegatedFrameHostGetLayer()->GetFallbackSurfaceId() exists, but is not a valid id.

Next step is to find out what's making this invalid.
The stack trace for when we invalidate the fallback surface (which then causes us to never be able to snapshot the surface) is:
  cc::SurfaceLayer::SetFallbackSurfaceId(viz::SurfaceId const&)
  ui::Layer::SetFallbackSurfaceId(viz::SurfaceId const&)
  content::DelegatedFrameHost::EvictDelegatedFrame()

Any tests that are taking snapshots using DelegatedFrameHost::CopyFromCompositingSurface need to disable frame eviction, or they are going to flakily hit this bug.
And ... we already have the switch kDisableNewContentRenderingTimeout for exactly this reason.

Comment 11 by kbr@chromium.org, Jun 20 2018

Thanks Chris for tracking this down.

Can we dynamically disable frame eviction while DelegatedFrameHost::CopyFromCompositingSurface requests are in flight? The DevTools API (used by the Telemetry based tests) and RenderWidgetHostImpl::GetSnapshotFromBrowser need to obey their contracts; it's not feasible to modify the tests to disable frame eviction all the time.

Actually, it's not the FrameEvictor (although that could just as easily be a source of flakiness for this test ... it appears to be off the hook for now).

Rather, it's the code that clears the web contents 2 seconds after navigation if no new frames have been received (https://codereview.chromium.org/1339803002), the "new content rendering timeout".

This, in turns seems to be because DelegatedFrameHost::OnFirstSurfaceActivation is never hit (which is what resets the timer). We do see HostFrameSinkManager::OnFirstSurfaceActivation being called, but with some other client (being investigated...).

Comment 13 Deleted

#13 was wrong analysis, deleted it to avoid noise.

This is a race that we've fixed before, in  issue 535375 ! This is the "surface sync is enabled" version of the bug.

Here is the way that things work when the test passes:

 1. Renderer process navigates
    a. Renderer calls RenderFrameImpl::DidCommitNavigationInternal
       - This will go more-or-less directly to the browser process
    b. Renderer calls SubmitCompositorFrame with the current surface id
       - This goes off to the viz process, and eventually will land back in
         the browser
 2. The browser process gets the message from [1.a.]
    a. Comes in as RWHI::DidNavigate
    b. Calls RWHV::DidNavigate
       - This will skip allocating a new surface id, because this is the first
         navigation
    c. STARTS the NewContentRenderTimeout (at the end of RHWI::DidNavigate)
 3. The browser process gets the first from with the id send from [1.b.]
    a. Comes in as DelegatedFrameHost::OnFirstSurfaceActivation
    b. STOPS the NewContentRenderTimeout (in
       RHWI::DidReceiveFirstFrameAfterNavigation).

So we stop the timer, and so we never evict our frame.
 
But notice how we're relying on [1.a.] arriving at the browser before [1.b.], even though they take completely independent routes to the browser?

What ends up happening to us when we get a flake is:

 1. Renderer process navigates
    a. Renderer calls RenderFrameImpl::DidCommitNavigationInternal
    b. Renderer calls SubmitCompositorFrame with the current surface id
 2. The browser process gets the first from with the id send from [1.b.]
    a. Comes in as DelegatedFrameHost::OnFirstSurfaceActivation
    b. In RHWI::DidReceiveFirstFrameAfterNavigation, we don't stop the 
       NewContentRenderTimeout, because we never started it.
 3. The browser process gets the message from [1.a.]
    a. Comes in as RWHI::DidNavigate
    b. Calls RWHV::DidNavigate
       - This will skip allocating a new surface id, because this is the first
         navigation
    c. STARTS the new NewContentRenderTimeout (at the end of RHWI::DidNavigate)

And we'll never hit OnFirstSurfaceActivation for this surface id again (because we already hit it).

What does this have to do with my patch? I pump a run loop that will preferentially run [1.b.] over [1.a.]. But they're completely independent (in theory), so this sequence of events could happen in the wild.

Comment 15 by kbr@chromium.org, Jun 21 2018

Blocking: 853762
Patch is up at https://chromium-review.googlesource.com/c/chromium/src/+/1109448 (having a bit of trouble with unit tests, but I think that's resolved now).
Project Member

Comment 17 by bugdroid1@chromium.org, Jun 22 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8af34d702f0ddf935b97070e0b01b31d0e4d1b2c

commit 8af34d702f0ddf935b97070e0b01b31d0e4d1b2c
Author: Christopher Cameron <ccameron@chromium.org>
Date: Fri Jun 22 02:35:29 2018

Fix flakey SnapshotBrowserTest.SyncMultiWindowTest on Mac

This is a race between navigation and surfaces on their way to the
browser.

Here is the way that things work when the test passes:
 1. Renderer process navigates
    a. Renderer calls RenderFrameImpl::DidCommitNavigationInternal
       - This will go more-or-less directly to the browser process
    b. Renderer calls SubmitCompositorFrame with the current surface id
       - This goes off to the viz process, and eventually will land
         back in the browser
 2. The browser process gets the message from [1.a.]
    a. Comes in as RWHI::DidNavigate
    b. Calls RWHV::DidNavigate
       - This will skip allocating a new surface id, because this is the
         first navigation
    c. STARTS the NewContentRenderTimeout (at the end of
       RHWI::DidNavigate)
 3. The browser process gets notified of the frame sent in [1.b.]
    a. Comes in as DelegatedFrameHost::OnFirstSurfaceActivation
    b. STOPS the NewContentRenderTimeout (in
       RHWI::DidReceiveFirstFrameAfterNavigation).
So we stop the timer, and so we don't evict our frame.

But notice how we're relying on [1.a.] arriving at the browser before
[1.b.], even though they take completely independent routes?

What ends up happening to us when we get a flake is:
 1. Renderer process navigates
    a. Renderer calls RenderFrameImpl::DidCommitNavigationInternal
    b. Renderer calls SubmitCompositorFrame with the current surface id
 2. The browser process gets notified of the frame sent in [1.b.]
    a. Comes in as DelegatedFrameHost::OnFirstSurfaceActivation
    b. In RHWI::DidReceiveFirstFrameAfterNavigation, we don't stop the
       NewContentRenderTimeout, because we never started it.
 3. The browser process gets the message from [1.a.]
    a. Comes in as RWHI::DidNavigate
    b. Calls RWHV::DidNavigate
       - This will skip allocating a new surface id, because this is the
         first navigation
    c. STARTS the new NewContentRenderTimeout (at the end of
       RHWI::DidNavigate)
We'll never hit OnFirstSurfaceActivation for this surface id again
(because we already hit it), and so the timeout will always fire.

The fix is to change RHWI::DidNavigate to only start the timer when
RWHV::DidNavigate changes the surface id. Two reasons:
  - If we didn't change the surface id, it was because this was the
    first navigation, and so we don't need to clear the frame.
  - If we allocate a new surface id, then that will kick the renderer
    to trigger OnFirstSurfaceActivation that will tell us to stop the
    timer (and if we don't allocate a new id, we won't necessarily hit
    OnFirstSurfaceActivation, and we won't stop the timer).

R=samans, kbr
TBR=piman (for content/ OWNERship)

Bug:  853651 
Change-Id: I25a8fbf5f6b618b64d764546056fd342e8fc50f9
Reviewed-on: https://chromium-review.googlesource.com/1109448
Reviewed-by: Kenneth Russell <kbr@chromium.org>
Reviewed-by: Saman Sami <samans@chromium.org>
Cr-Commit-Position: refs/heads/master@{#569513}
[modify] https://crrev.com/8af34d702f0ddf935b97070e0b01b31d0e4d1b2c/content/browser/renderer_host/render_widget_host_impl.cc
[modify] https://crrev.com/8af34d702f0ddf935b97070e0b01b31d0e4d1b2c/content/browser/renderer_host/render_widget_host_view_aura_unittest.cc
[modify] https://crrev.com/8af34d702f0ddf935b97070e0b01b31d0e4d1b2c/content/browser/snapshot_browsertest.cc

Status: Fixed (was: Assigned)

Sign in to add a comment