"SnapshotBrowserTest.SyncMultiWindowTest" is flaky |
||||||
Issue description"SnapshotBrowserTest.SyncMultiWindowTest" is flaky. This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label. We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyMgsSBUZsYWtlIidTbmFwc2hvdEJyb3dzZXJUZXN0LlN5bmNNdWx0aVdpbmRvd1Rlc3QM. Flaky tests should be disabled within 30 minutes unless culprit CL is found and reverted. Please see more details here: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queues#triaging-auto-filed-flakiness-bugs
,
Jun 18 2018
SnapshotBrowserTest.AsyncMultiWindowTest also flakes on Mac, though much more rarely: https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=content_browsertests&tests=SnapshotBrowserTest.AsyncMultiWindowTest
,
Jun 18 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7d7cea31e2157d20955c6d698a51e5f167132020 commit 7d7cea31e2157d20955c6d698a51e5f167132020 Author: Marc Treib <treib@chromium.org> Date: Mon Jun 18 10:05:47 2018 Disable flaky SnapshotBrowserTest.SyncMultiWindowTest on Mac TBR=kbr@chromium.org NOTRY=true Bug: 853651 Change-Id: If04328135ede460ae8957e2e135c96ae691ae221 Reviewed-on: https://chromium-review.googlesource.com/1104165 Commit-Queue: Marc Treib <treib@chromium.org> Reviewed-by: Marc Treib <treib@chromium.org> Cr-Commit-Position: refs/heads/master@{#567964} [modify] https://crrev.com/7d7cea31e2157d20955c6d698a51e5f167132020/content/browser/snapshot_browsertest.cc
,
Jun 18 2018
,
Jun 18 2018
Disable seems like the right thing to do if this is consistently flakey. The blamed CL changes timing of things, but fixes some serious stability bugs (so it's exactly the type of patch that we don't want blocked because of timing changes).
,
Jun 20 2018
SnapshotBrowserTest stress tests window snapshotting functionality which is used throughout the GPU tests. This test was carefully de-flaked after it was written, and has been stable for a couple of months. If the ability to take a window snapshot becomes unreliable, a lot of GPU tests on the commit queue will become flaky. Understanding that https://chromium-review.googlesource.com/1100239 fixes stability bugs, it's still critical that this test be re-enabled. From chromium-try-flakes' output: https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyMgsSBUZsYWtlIidTbmFwc2hvdEJyb3dzZXJUZXN0LlN5bmNNdWx0aVdpbmRvd1Rlc3QM it's clear that this CL is the cause of the recent failures. Chris, can you please try re-enabling this test in your local build and run it in a loop? It's important that we figure out some change which will stabilize it again.
,
Jun 20 2018
I'll take a look at this locally to see if I can reproduce the flake.
,
Jun 20 2018
This reproduces locally most of the time (enough for debugging to not be hard).
This almost always fails on the last iteration of the loop (we do 20 snapshot tests, and it's usually on #20 that we fail).
Of the 4 renderers that we have going, the sequence of events should be
1. RenderWidgetHostImpl::GetSnapshotFromBrowser
2. RenderWidgetHostImpl::OnForceRedrawComplete
3. RenderWidgetHostImpl::WindowSnapshotReachedScreen
a. Calls RenderWidgetHostView::CopyFromSurface
b. Calls DelegatedFrameHost::CopyFromCompositingSurface, which either
c. Either
i calls DelegatedFrameHost::ProcessCopyOutputRequest, or
ii. enqueues into DelegatedFrameHost::pending_first_frame_requests_
4. RenderWidgetHostImpl::OnSnapshotFromSurfaceReceived
5. SyncSnapshotCallback
Every that this test fails, it fails when we take the branch 3.c.ii. Every time that this test passes, it takes branch 3.c.i.
Every time that we fail to immediately make a copy, it's because client_->DelegatedFrameHostGetLayer()->GetFallbackSurfaceId() exists, but is not a valid id.
Next step is to find out what's making this invalid.
,
Jun 20 2018
The stack trace for when we invalidate the fallback surface (which then causes us to never be able to snapshot the surface) is: cc::SurfaceLayer::SetFallbackSurfaceId(viz::SurfaceId const&) ui::Layer::SetFallbackSurfaceId(viz::SurfaceId const&) content::DelegatedFrameHost::EvictDelegatedFrame() Any tests that are taking snapshots using DelegatedFrameHost::CopyFromCompositingSurface need to disable frame eviction, or they are going to flakily hit this bug.
,
Jun 20 2018
And ... we already have the switch kDisableNewContentRenderingTimeout for exactly this reason.
,
Jun 20 2018
Thanks Chris for tracking this down. Can we dynamically disable frame eviction while DelegatedFrameHost::CopyFromCompositingSurface requests are in flight? The DevTools API (used by the Telemetry based tests) and RenderWidgetHostImpl::GetSnapshotFromBrowser need to obey their contracts; it's not feasible to modify the tests to disable frame eviction all the time.
,
Jun 20 2018
Actually, it's not the FrameEvictor (although that could just as easily be a source of flakiness for this test ... it appears to be off the hook for now). Rather, it's the code that clears the web contents 2 seconds after navigation if no new frames have been received (https://codereview.chromium.org/1339803002), the "new content rendering timeout". This, in turns seems to be because DelegatedFrameHost::OnFirstSurfaceActivation is never hit (which is what resets the timer). We do see HostFrameSinkManager::OnFirstSurfaceActivation being called, but with some other client (being investigated...).
,
Jun 21 2018
#13 was wrong analysis, deleted it to avoid noise. This is a race that we've fixed before, in issue 535375 ! This is the "surface sync is enabled" version of the bug. Here is the way that things work when the test passes: 1. Renderer process navigates a. Renderer calls RenderFrameImpl::DidCommitNavigationInternal - This will go more-or-less directly to the browser process b. Renderer calls SubmitCompositorFrame with the current surface id - This goes off to the viz process, and eventually will land back in the browser 2. The browser process gets the message from [1.a.] a. Comes in as RWHI::DidNavigate b. Calls RWHV::DidNavigate - This will skip allocating a new surface id, because this is the first navigation c. STARTS the NewContentRenderTimeout (at the end of RHWI::DidNavigate) 3. The browser process gets the first from with the id send from [1.b.] a. Comes in as DelegatedFrameHost::OnFirstSurfaceActivation b. STOPS the NewContentRenderTimeout (in RHWI::DidReceiveFirstFrameAfterNavigation). So we stop the timer, and so we never evict our frame. But notice how we're relying on [1.a.] arriving at the browser before [1.b.], even though they take completely independent routes to the browser? What ends up happening to us when we get a flake is: 1. Renderer process navigates a. Renderer calls RenderFrameImpl::DidCommitNavigationInternal b. Renderer calls SubmitCompositorFrame with the current surface id 2. The browser process gets the first from with the id send from [1.b.] a. Comes in as DelegatedFrameHost::OnFirstSurfaceActivation b. In RHWI::DidReceiveFirstFrameAfterNavigation, we don't stop the NewContentRenderTimeout, because we never started it. 3. The browser process gets the message from [1.a.] a. Comes in as RWHI::DidNavigate b. Calls RWHV::DidNavigate - This will skip allocating a new surface id, because this is the first navigation c. STARTS the new NewContentRenderTimeout (at the end of RHWI::DidNavigate) And we'll never hit OnFirstSurfaceActivation for this surface id again (because we already hit it). What does this have to do with my patch? I pump a run loop that will preferentially run [1.b.] over [1.a.]. But they're completely independent (in theory), so this sequence of events could happen in the wild.
,
Jun 21 2018
,
Jun 21 2018
Patch is up at https://chromium-review.googlesource.com/c/chromium/src/+/1109448 (having a bit of trouble with unit tests, but I think that's resolved now).
,
Jun 22 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8af34d702f0ddf935b97070e0b01b31d0e4d1b2c commit 8af34d702f0ddf935b97070e0b01b31d0e4d1b2c Author: Christopher Cameron <ccameron@chromium.org> Date: Fri Jun 22 02:35:29 2018 Fix flakey SnapshotBrowserTest.SyncMultiWindowTest on Mac This is a race between navigation and surfaces on their way to the browser. Here is the way that things work when the test passes: 1. Renderer process navigates a. Renderer calls RenderFrameImpl::DidCommitNavigationInternal - This will go more-or-less directly to the browser process b. Renderer calls SubmitCompositorFrame with the current surface id - This goes off to the viz process, and eventually will land back in the browser 2. The browser process gets the message from [1.a.] a. Comes in as RWHI::DidNavigate b. Calls RWHV::DidNavigate - This will skip allocating a new surface id, because this is the first navigation c. STARTS the NewContentRenderTimeout (at the end of RHWI::DidNavigate) 3. The browser process gets notified of the frame sent in [1.b.] a. Comes in as DelegatedFrameHost::OnFirstSurfaceActivation b. STOPS the NewContentRenderTimeout (in RHWI::DidReceiveFirstFrameAfterNavigation). So we stop the timer, and so we don't evict our frame. But notice how we're relying on [1.a.] arriving at the browser before [1.b.], even though they take completely independent routes? What ends up happening to us when we get a flake is: 1. Renderer process navigates a. Renderer calls RenderFrameImpl::DidCommitNavigationInternal b. Renderer calls SubmitCompositorFrame with the current surface id 2. The browser process gets notified of the frame sent in [1.b.] a. Comes in as DelegatedFrameHost::OnFirstSurfaceActivation b. In RHWI::DidReceiveFirstFrameAfterNavigation, we don't stop the NewContentRenderTimeout, because we never started it. 3. The browser process gets the message from [1.a.] a. Comes in as RWHI::DidNavigate b. Calls RWHV::DidNavigate - This will skip allocating a new surface id, because this is the first navigation c. STARTS the new NewContentRenderTimeout (at the end of RHWI::DidNavigate) We'll never hit OnFirstSurfaceActivation for this surface id again (because we already hit it), and so the timeout will always fire. The fix is to change RHWI::DidNavigate to only start the timer when RWHV::DidNavigate changes the surface id. Two reasons: - If we didn't change the surface id, it was because this was the first navigation, and so we don't need to clear the frame. - If we allocate a new surface id, then that will kick the renderer to trigger OnFirstSurfaceActivation that will tell us to stop the timer (and if we don't allocate a new id, we won't necessarily hit OnFirstSurfaceActivation, and we won't stop the timer). R=samans, kbr TBR=piman (for content/ OWNERship) Bug: 853651 Change-Id: I25a8fbf5f6b618b64d764546056fd342e8fc50f9 Reviewed-on: https://chromium-review.googlesource.com/1109448 Reviewed-by: Kenneth Russell <kbr@chromium.org> Reviewed-by: Saman Sami <samans@chromium.org> Cr-Commit-Position: refs/heads/master@{#569513} [modify] https://crrev.com/8af34d702f0ddf935b97070e0b01b31d0e4d1b2c/content/browser/renderer_host/render_widget_host_impl.cc [modify] https://crrev.com/8af34d702f0ddf935b97070e0b01b31d0e4d1b2c/content/browser/renderer_host/render_widget_host_view_aura_unittest.cc [modify] https://crrev.com/8af34d702f0ddf935b97070e0b01b31d0e4d1b2c/content/browser/snapshot_browsertest.cc
,
Jun 22 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by treib@chromium.org
, Jun 18 2018Owner: ccameron@chromium.org
Status: Assigned (was: Untriaged)