New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 798476 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Windows
Pri: 1
Type: Bug


Sign in to add a comment

"SitePerProcessBrowserTest.*" tests are flaky

Project Member Reported by chromium...@appspot.gserviceaccount.com, Jan 2 2018

Issue description

"SitePerProcessBrowserTest.KillingRendererClearsDescendantProxies" is flaky.

This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label.

We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNySwsSBUZsYWtlIkBTaXRlUGVyUHJvY2Vzc0Jyb3dzZXJUZXN0LktpbGxpbmdSZW5kZXJlckNsZWFyc0Rlc2NlbmRhbnRQcm94aWVzDA.

Flaky tests should be disabled within 30 minutes unless culprit CL is found and reverted. Please see more details here: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queues#triaging-auto-filed-flakiness-bugs

This flaky test/step was previously tracked in  issue 693365 .
 
Cc: lukasza@chromium.org
Components: -Tests>Flaky Internals>Sandbox>SiteIsolation
Speculatively assigning to lukasza to find the right owner.
Status: Assigned (was: Untriaged)

Comment 3 by creis@chromium.org, Jan 2 2018

Cc: kenrb@chromium.org creis@chromium.org alex...@chromium.org nzolghadr@chromium.org
Components: Blink>Input
Labels: OS-Windows
Owner: sadrul@chromium.org
Sadrul and Navid, it looks like this is a hit testing crash via RenderWidgetInputHandler::GetWidgetRoutingIdAtPoint.  Could some of your recent work have caused this?  It sounds like a real bug rather than a test that should be disabled.



[ RUN      ] SitePerProcessBrowserTest.KillingRendererClearsDescendantProxies
DevTools listening on ws://127.0.0.1:57030/devtools/browser/a3527897-738b-4956-a9f4-e63e9dc0730c
[3120:4936:0102/093348.602:INFO:dxva_video_decode_accelerator_win.cc(1315)] mf.dll is required for hardware video decoding
[3120:4936:0102/093348.603:INFO:media_foundation_video_encode_accelerator_win.cc(376)] mf.dll is required for encoding
[4904:1380:0102/093349.122:WARNING:render_frame_host_impl.cc(2680)] OnDidStopLoading was called twice.
[4904:1380:0102/093349.413:WARNING:render_frame_host_impl.cc(2680)] OnDidStopLoading was called twice.
[5760:4764:0102/093349.421:FATAL:PaintLayer.cpp(1894)] Check failed: GetLayoutObject().GetDocument().Lifecycle().GetState() >= DocumentLifecycle::kCompositingClean.
Backtrace:
	base::debug::StackTrace::StackTrace [0x00007FF662675F34+36]
	logging::LogMessage::~LogMessage [0x00007FF6625ECBEE+94]
	blink::PaintLayer::HitTestLayer [0x00007FF66432F21F+127]
	blink::PaintLayer::HitTest [0x00007FF66432F06B+459]
	blink::LayoutView::HitTestNoLifecycleUpdate [0x00007FF6641F3CC1+273]
	blink::LayoutView::HitTest [0x00007FF6641F3A55+53]
	blink::EventHandler::HitTestResultAtPoint [0x00007FF6640612DA+602]
	blink::WebFrameWidgetImpl::HitTestResultForRootFramePos [0x00007FF664F441E0+176]
	blink::WebFrameWidgetImpl::CoreHitTestResultAt [0x00007FF664F427E2+226]
	blink::WebFrameWidgetImpl::HitTestResultAt [0x00007FF664F426C8+40]
	content::RenderWidgetInputHandler::GetWidgetRoutingIdAtPoint [0x00007FF6647C7934+132]
	content::InputTargetClientImpl::FrameSinkIdAt [0x00007FF6647341BC+44]
	viz::mojom::InputTargetClientStubDispatch::AcceptWithResponder [0x00007FF660F4C750+256]
	viz::mojom::InputTargetClientStub<mojo::RawPtrImplRefTraits<viz::mojom::InputTargetClient> >::AcceptWithResponder [0x00007FF664734287+55]
	mojo::InterfaceEndpointClient::HandleValidatedMessage [0x00007FF6623A96C9+697]
	mojo::FilterChain::Accept [0x00007FF6623B1B64+148]
	mojo::InterfaceEndpointClient::HandleIncomingMessage [0x00007FF6623AA607+119]
	mojo::internal::MultiplexRouter::ProcessIncomingMessage [0x00007FF6623A5E96+726]
	mojo::internal::MultiplexRouter::Accept [0x00007FF6623A5939+345]
	mojo::FilterChain::Accept [0x00007FF6623B1B64+148]
	mojo::Connector::ReadSingleMessage [0x00007FF6623AEBB6+438]
	mojo::Connector::ReadAllAvailableMessages [0x00007FF6623AF3CD+125]
	mojo::Connector::OnHandleReadyInternal [0x00007FF6623AF238+136]
	mojo::SimpleWatcher::OnHandleReady [0x00007FF662B43E7F+255]
	base::debug::TaskAnnotator::RunTask [0x00007FF662690AC0+288]
	blink::scheduler::TaskQueueManager::ProcessTaskFromWorkQueue [0x00007FF6621F74BE+542]
	blink::scheduler::TaskQueueManager::DoWork [0x00007FF6621F6D28+952]
	base::debug::TaskAnnotator::RunTask [0x00007FF662690AC0+288]
	blink::scheduler::internal::ThreadControllerImpl::DoWork [0x00007FF6621FBFC0+240]
	base::debug::TaskAnnotator::RunTask [0x00007FF662690AC0+288]
	base::internal::IncomingTaskQueue::RunTask [0x00007FF6626B409B+123]
	base::MessageLoop::RunTask [0x00007FF662622279+665]
	base::MessageLoop::DeferOrRunPendingTask [0x00007FF6626226CC+188]
	base::MessageLoop::DoWork [0x00007FF66262293D+557]
	base::MessagePumpDefault::Run [0x00007FF6626B658F+175]
	base::MessageLoop::Run [0x00007FF662621AD9+201]
	base::RunLoop::Run [0x00007FF66260B4D9+249]
	content::RendererMain [0x00007FF664721B70+624]
	content::ContentMainRunnerImpl::Run [0x00007FF661077072+418]
	service_manager::Main [0x00007FF6636A60E1+1201]
	content::ContentMain [0x00007FF66107684E+62]
	content::LaunchTests [0x00007FF6623004E1+369]
	main [0x00007FF6649922E0+72]
	__scrt_common_main_seh [0x00007FF66549A779+285] (f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl:283)
	BaseThreadInitThunk [0x00007FFE086C8364+20]
	RtlUserThreadStart [0x00007FFE08D75E91+33]
Received fatal exception EXCEPTION_BREAKPOINT
Backtrace:
	base::debug::BreakDebugger [0x00007FF662696B3D+13]
	logging::LogMessage::~LogMessage [0x00007FF6625ECF5E+974]
	blink::PaintLayer::HitTestLayer [0x00007FF66432F21F+127]
	blink::PaintLayer::HitTest [0x00007FF66432F06B+459]
	blink::LayoutView::HitTestNoLifecycleUpdate [0x00007FF6641F3CC1+273]
	blink::LayoutView::HitTest [0x00007FF6641F3A55+53]
	blink::EventHandler::HitTestResultAtPoint [0x00007FF6640612DA+602]
	blink::WebFrameWidgetImpl::HitTestResultForRootFramePos [0x00007FF664F441E0+176]
	blink::WebFrameWidgetImpl::CoreHitTestResultAt [0x00007FF664F427E2+226]
	blink::WebFrameWidgetImpl::HitTestResultAt [0x00007FF664F426C8+40]
	content::RenderWidgetInputHandler::GetWidgetRoutingIdAtPoint [0x00007FF6647C7934+132]
	content::InputTargetClientImpl::FrameSinkIdAt [0x00007FF6647341BC+44]
	viz::mojom::InputTargetClientStubDispatch::AcceptWithResponder [0x00007FF660F4C750+256]
	viz::mojom::InputTargetClientStub<mojo::RawPtrImplRefTraits<viz::mojom::InputTargetClient> >::AcceptWithResponder [0x00007FF664734287+55]
	mojo::InterfaceEndpointClient::HandleValidatedMessage [0x00007FF6623A96C9+697]
	mojo::FilterChain::Accept [0x00007FF6623B1B64+148]
	mojo::InterfaceEndpointClient::HandleIncomingMessage [0x00007FF6623AA607+119]
	mojo::internal::MultiplexRouter::ProcessIncomingMessage [0x00007FF6623A5E96+726]
	mojo::internal::MultiplexRouter::Accept [0x00007FF6623A5939+345]
	mojo::FilterChain::Accept [0x00007FF6623B1B64+148]
	mojo::Connector::ReadSingleMessage [0x00007FF6623AEBB6+438]
	mojo::Connector::ReadAllAvailableMessages [0x00007FF6623AF3CD+125]
	mojo::Connector::OnHandleReadyInternal [0x00007FF6623AF238+136]
	mojo::SimpleWatcher::OnHandleReady [0x00007FF662B43E7F+255]
	base::debug::TaskAnnotator::RunTask [0x00007FF662690AC0+288]
	blink::scheduler::TaskQueueManager::ProcessTaskFromWorkQueue [0x00007FF6621F74BE+542]
	blink::scheduler::TaskQueueManager::DoWork [0x00007FF6621F6D28+952]
	base::debug::TaskAnnotator::RunTask [0x00007FF662690AC0+288]
	blink::scheduler::internal::ThreadControllerImpl::DoWork [0x00007FF6621FBFC0+240]
	base::debug::TaskAnnotator::RunTask [0x00007FF662690AC0+288]
	base::internal::IncomingTaskQueue::RunTask [0x00007FF6626B409B+123]
	base::MessageLoop::RunTask [0x00007FF662622279+665]
	base::MessageLoop::DeferOrRunPendingTask [0x00007FF6626226CC+188]
	base::MessageLoop::DoWork [0x00007FF66262293D+557]
	base::MessagePumpDefault::Run [0x00007FF6626B658F+175]
	base::MessageLoop::Run [0x00007FF662621AD9+201]
	base::RunLoop::Run [0x00007FF66260B4D9+249]
	content::RendererMain [0x00007FF664721B70+624]
	content::ContentMainRunnerImpl::Run [0x00007FF661077072+418]
	service_manager::Main [0x00007FF6636A60E1+1201]
	content::ContentMain [0x00007FF66107684E+62]
	content::LaunchTests [0x00007FF6623004E1+369]
	main [0x00007FF6649922E0+72]
	__scrt_common_main_seh [0x00007FF66549A779+285] (f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl:283)
	BaseThreadInitThunk [0x00007FFE086C8364+20]
	RtlUserThreadStart [0x00007FFE08D75E91+33]
../../content/browser/site_per_process_browsertest.cc(3565): error: Expected equality of these values:
  site_c_url
    Which is: http://baz.com:57031/title1.html
  node4->current_url()
    Which is:
../../content/browser/site_per_process_browsertest.cc(3581): error: Expected equality of these values:
  " Site A ------------ proxies for B C\n" "   |--Site B ------- proxies for A C\n" "   |    +--Site C -- proxies for A B\n" "   +--Site A ------- proxies for B C\n" "Where A = http://a.com/\n" "      B = http://bar.com/\n" "      C = http://baz.com/"
    Which is: " Site A ------------ proxies for B C\n   |--Site B ------- proxies for A C\n   |    +--Site C -- proxies for A B\n   +--Site A ------- proxies for B C\nWhere A = http://a.com/\n      B = http://bar.com/\n      C = http://baz.com/"
  DepictFrameTree(root)
    Which is: " Site A ------------ proxies for B C\n   |--Site B ------- proxies for A C\n   |    +--Site C -- proxies for A B\n   +--Site A ------- proxies for B C\nWhere A = http://a.com/\n      B = http://bar.com/\n      C = http://baz.com/ (no process)"
With diff:
@@ -5,3 +5,3 @@
 Where A = http://a.com/
       B = http://bar.com/
-      C = http://baz.com/
+      C = http://baz.com/ (no process)
[  FAILED  ] SitePerProcessBrowserTest.KillingRendererClearsDescendantProxies, where TypeParam =  and GetParam() =  (3008 ms)
Labels: -Sheriff-Chromium
As this is assigned to an owner, I am taking it out of the sheriff queue.
Owner: nzolghadr@chromium.org
Status: Started (was: Assigned)
Cc: sadrul@chromium.org
Labels: Hotlist-Input-Dev
Project Member

Comment 8 by chromium...@appspot.gserviceaccount.com, Jan 3 2018

Labels: Sheriff-Chromium
Detected 9 new flakes for test/step "SitePerProcessBrowserTest.KillingRendererClearsDescendantProxies". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNySwsSBUZsYWtlIkBTaXRlUGVyUHJvY2Vzc0Jyb3dzZXJUZXN0LktpbGxpbmdSZW5kZXJlckNsZWFyc0Rlc2NlbmRhbnRQcm94aWVzDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Sheriff Bug Queue (unless already there).
Cc: chrishtr@chromium.org dtapu...@chromium.org
Chris, I was wondering whether you can help with this issue.

It seems that we do call UpdateLifecycleToPrePaintClean
https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/layout/LayoutView.cpp?gsn=HitTest&l=120

before calling the hittest. But the state of the layout seems to remain stale. When I follow the UpdateLifecycleToPrePaintClean path down I can see some comments about OOPIF and the cases that we may not update the layout. In this case we do have an OOPIF in the test page. But I'm not sure which one of those checks get hit as I can not get this test to fail on my local machine. Do you have any insight about this?
Labels: -Sheriff-Chromium
Project Member

Comment 11 by chromium...@appspot.gserviceaccount.com, Jan 5 2018

Labels: Sheriff-Chromium
Detected 11 new flakes for test/step "SitePerProcessBrowserTest.KillingRendererClearsDescendantProxies". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNySwsSBUZsYWtlIkBTaXRlUGVyUHJvY2Vzc0Jyb3dzZXJUZXN0LktpbGxpbmdSZW5kZXJlckNsZWFyc0Rlc2NlbmRhbnRQcm94aWVzDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Sheriff Bug Queue (unless already there).
That is indeed odd. It should be advancing the lifecycly for that frame, not
sure why it does not. The next step is to reproduce this locally and debug..
Anyone has any suggestion on how to reproduce this locally? I'm just building a debug build and run the test a few times hoping to crash. But all of the runs passed. Am I missing a particular config maybe? I assume the test automatically enables the OOPIF. So I don't need to pass any extra parameters.
Project Member

Comment 14 by chromium...@appspot.gserviceaccount.com, Jan 6 2018

Detected 4 new flakes for test/step "SitePerProcessBrowserTest.KillingRendererClearsDescendantProxies". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNySwsSBUZsYWtlIkBTaXRlUGVyUHJvY2Vzc0Jyb3dzZXJUZXN0LktpbGxpbmdSZW5kZXJlckNsZWFyc0Rlc2NlbmRhbnRQcm94aWVzDA. This message was posted automatically by the chromium-try-flakes app.
Project Member

Comment 15 by bugdroid1@chromium.org, Jan 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/24a68053f3548ec8c97442b95e3301c45100e79d

commit 24a68053f3548ec8c97442b95e3301c45100e79d
Author: Alice Boxhall <aboxhall@chromium.org>
Date: Mon Jan 08 02:42:42 2018

Disable flaky test SitePerProcessBrowserTest.KillingRendererClearsDescendantProxies on Win

Disabled test until we can figure out the root cause.

Bug:  798476 
Change-Id: I76bbd8226483a88984880dc014e65dfe96064f80

TBR=nzolghadr@chromium.org

Change-Id: I76bbd8226483a88984880dc014e65dfe96064f80
Reviewed-on: https://chromium-review.googlesource.com/852595
Commit-Queue: Alice Boxhall <aboxhall@chromium.org>
Reviewed-by: Alice Boxhall <aboxhall@chromium.org>
Cr-Commit-Position: refs/heads/master@{#527557}
[modify] https://crrev.com/24a68053f3548ec8c97442b95e3301c45100e79d/content/browser/site_per_process_browsertest.cc

Labels: -Sheriff-Chromium
The test automatically enables OOPIFs and doesn't require any special flags to run, but I haven't been able to repro locally either. :(  I've tried local Linux builds, debug and release, with --enable-pixel-output-in-tests, with flags that are passed to the bots, in a group with other tests -- no luck.  I noticed that the flakes all seem to be on the Win bots, so I also tried it on my Win box - again, no luck, but you might want to try the same, just in case my Win checkout isn't fully up to date.

Other than that, my only suggestion would be to put some debugging info in, upload to gerrit, and run win7_chromium_rel_ng or win10_chromium_x64_rel_ng, to see if you can get to any failures and collect any new info.  I know that's a really inefficient way to debug though. :(

Should've mentioned that I've tried all of #17 with --gtest-repeat=50, and still no failures came up.
Yup. Dave and I were talking yesterday and he suggested the try bot way too.

Comment 20 by creis@chromium.org, Jan 11 2018

Labels: Target-65 M-65 OS-Linux
Any progress?  Looks like this is affecting multiple tests: see  issue 797860  and  issue 801146  for the following tests.
SitePerProcess/TaskManagerOOPIFBrowserTest.KillSubframe/0
SitePerProcess/TaskManagerOOPIFBrowserTest.OrderingOfDependentRows/0

Feel free to dupe those into here to have a single place to investigate, as long as they all get fixed and re-enabled.

Do we have evidence this crash is happening in the wild as well?

Comment 21 by creis@chromium.org, Jan 11 2018

Blocking: 796651
Marking as a blocking bug for the meta bug for OOPIF Viz hit testing ( issue 796651 ).
I haven't seen any report in the wild yet. I'm sending CLs to Gerrit and wait for the try bot to hopefully fail and see what the output is and which DCHECK failed. Every trial takes an hour or two until I get the try bot result back. So it is not a very fast progress.
I enabled the test and tried to get the bots to fail. After a few runs still all the windows bots pass.
https://chromium-review.googlesource.com/c/chromium/src/+/861943
creis@ any idea how to proceed?
Try changing the DCHECK that is failing into a CHECK, in case the bots you
are testing against don't use DCHECK mode.
Also, do you have access tp a windows machine you can build on?

Comment 27 by creis@chromium.org, Jan 12 2018

chrishtr@: Good point-- the try bots might not hit the DCHECKs.

If that doesn't help, you could try landing a CL to re-enable it, possibly with some log statements that indicate what was happening at the time of the crash.  Then if it flakes again, we have more info.
#25-27: I thought the bots all run with dcheck_always_on, so that may not help?  After all, looking at #3, the failure itself is a DCHECK, right?

Another way might be to get access to a trybot VM where this it failing somewhat consistently and debug there.  Might be quite involved, but mentioning it just in case.  See https://sites.google.com/a/google.com/chrome-infrastructure/golo/remote-access.  I'd try the suggestion from #27 first though.
Is there any chance that the lifecycle is not being updated because of the early-return here [1]? I am not really sure what 'IsActive()' means ... is it possible that the browser sends a targeting request while the page is hidden (with dirty layout)? Note that the targeting requests use a different message-pipe, so the IPCs sent for visibility and targeting request will not necessarily be handled in the same order in the renderer that they were sent in the browser.

[1] https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/frame/LocalFrameView.cpp?type=cs&sq=package:chromium&l=3132
Another option is that the test is done, and it is closing the renderer/page, but there were in-flight targeting requests that get handled after that. Is that possible?

Comment 31 by kenrb@chromium.org, Jan 16 2018

Cc: jonr...@chromium.org nasko@chromium.org
 Issue 802281  has been merged into this issue.

Comment 32 by kenrb@chromium.org, Jan 16 2018

Seeing the same flake on ContentSettingBubbleModelMixedScriptOopifTest.MixedContentInCrossSiteIframe on Win7, apparently.
Project Member

Comment 33 by bugdroid1@chromium.org, Jan 16 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9238b731beea3586e3c4557b96e942b2d3c8ce25

commit 9238b731beea3586e3c4557b96e942b2d3c8ce25
Author: Sadrul Habib Chowdhury <sadrul@chromium.org>
Date: Tue Jan 16 19:08:06 2018

blink: Replace a DCHECK with a DCHECK_GE.

Use DCHECK_GE, so that when the dcheck hits, we get a little bit more
information about the failure.

BUG= 798476 

Cq-Include-Trybots: master.tryserver.blink:linux_trusty_blink_rel;master.tryserver.chromium.linux:linux_layout_tests_slimming_paint_v2
Change-Id: I99620ad9d6ec30577b30349cfb036249e48ec2b4
Reviewed-on: https://chromium-review.googlesource.com/867479
Commit-Queue: Chris Harrelson <chrishtr@chromium.org>
Reviewed-by: Chris Harrelson <chrishtr@chromium.org>
Cr-Commit-Position: refs/heads/master@{#529482}
[modify] https://crrev.com/9238b731beea3586e3c4557b96e942b2d3c8ce25/third_party/WebKit/Source/core/paint/PaintLayer.cpp

Some results from the DCHECK -> DCHECK_GE change:

   Check failed: GetLayoutObject().GetDocument().Lifecycle().GetState() >= DocumentLifecycle::kCompositingClean (10 vs. 13)

(complete logs: https://logs.chromium.org/v/?s=chromium%2Fbb%2Ftryserver.chromium.win%2Fwin10_chromium_x64_rel_ng%2F64016%2F%2B%2Frecipes%2Fsteps%2Fcontent_browsertests__with_patch__on_Windows-10-14393%2F0%2Flogs%2FSitePerProcessBrowserTest.SubframeTouchEventRouting%2F0)

It looks like the state is kLayoutClean, but we expect it to be kCompositingClean. I don't know much about the lifecycle in blink. So it doesn't give me any clues. But maybe someone more familiar with it can speculate what's happening?
Sadrul, were are you seeing these failures? I have enabled the test and also changed the DCHECK to CHECK and I see no failure caused by this test in the try bots.
https://chromium-review.googlesource.com/c/chromium/src/+/861943
Just in random code-reviews when I look at the try failures. (on win7_ or win10_ bots, usually)
But this particular test has been disabled for some time. How did you see that?
That last failure is in a different test (see the logs).

The same crash is affecting a lot of tests, even browser_tets (see, for example https://bugs.chromium.org/p/chromium/issues/detail?id=801146#c7)
@sadrul: can you reproduce the failure locally on any test? That would be very
helpful.
I have unfortunately been unable to repro locally. I have attempted to run a lot of tests in parallel, slow down the process using nice (nice -n 15 ...) to see if that exposes any potential race, but no luck so far. :(
Blocking: 803641
Blocking: 803628
Project Member

Comment 43 by bugdroid1@chromium.org, Jan 19 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d35abd616a8f88e34f37a28be800694ae741db51

commit d35abd616a8f88e34f37a28be800694ae741db51
Author: Sorin Jianu <sorin@chromium.org>
Date: Fri Jan 19 01:14:17 2018

Disable flaky test SitePerProcessBrowserTest.SubframeTouchEventRouting.


TBR: kenrb@chromium.org
Bug:  798476 ,  802828 
Change-Id: I62f25924477f4d281483063bfb0a56b93252db90
Reviewed-on: https://chromium-review.googlesource.com/875294
Reviewed-by: Sorin Jianu <sorin@chromium.org>
Commit-Queue: Sorin Jianu <sorin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#530374}
[modify] https://crrev.com/d35abd616a8f88e34f37a28be800694ae741db51/content/browser/site_per_process_browsertest.cc

Blocking: 803627

Comment 45 by gab@chromium.org, Jan 19 2018

Issue 802827 has been merged into this issue.

Comment 46 by gab@chromium.org, Jan 19 2018

Cc: wjmaclean@chromium.org
 Issue 802828  has been merged into this issue.

Comment 47 by gab@chromium.org, Jan 19 2018

 Issue 803627  has been merged into this issue.

Comment 48 by gab@chromium.org, Jan 19 2018

Cc: ekaramad@chromium.org
 Issue 803628  has been merged into this issue.

Comment 49 by gab@chromium.org, Jan 19 2018

 Issue 803641  has been merged into this issue.

Comment 50 by gab@chromium.org, Jan 19 2018

Summary: "SitePerProcessBrowserTest.*" tests are flaky (was: "SitePerProcessBrowserTest.KillingRendererClearsDescendantProxies" is flaky)
I'm still seeing flakes related to this on https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/, looks like it affects all of them (many were individually disabled).

Coalesced all bug reports to make sure they're all re-enabled when this is fixed (and also they should all be disabled for now if you're actively working on this as it affects CQ).

Thanks

Comment 51 by kenrb@chromium.org, Jan 19 2018

Thanks. This potentially affects all browser tests that pass input events to out-of-process iframes.

Comment 52 by kenrb@chromium.org, Jan 19 2018

Comment 34 mentions that the OOPIF is layout clean but not compositing clean. That could result from the OOPIF being render throttled when we send it a hit test request, but it is hard to believe that because the OOPIF should be visible in most of these tests, and the tests know not to attempt to route events before we have received a compositor frame from the OOPIF. Still, given the lack of progress on this, it might be worth temporarily disabling OOPIF throttling to see if that resolves the issue.

Alternatively, we could temporarily modify LayoutView::HitTest to stop hit testing if the call to UpdateLifecycleToPrePaintClean() didn't go as far as it needs to. WebFrameWidgetImpl could return the current frame as the hit test result in that case, and I would expect tests to mostly behave normally. We would still need to figure out the underlying problem, but it would save the sheriffs a lot of work and make it so that we still get to keep all of our test coverage.
> That could result from the OOPIF being render throttled when we send it a hit
> test request, but it is hard to believe that because the OOPIF should be
> visible in most of these tests,

See comments #29 and #30 above. The visibility and targeting use two separate message-pipes I believe. So it is possible that these messages are processed out-of-sync in the renderer.
An alternate is to bind the hit testing mojo channel to be an associated channel to the frame input handler channel. This is where the visibility (and all input) messages are sent.

Comment 55 by kenrb@chromium.org, Jan 19 2018

What I mean is that before the OOPIF submits a compositor frame, the browser won't know to do asynchronous hit testing and the InputTargetClient interface won't be used. So it it must have received layout at some point before this crash could occur. I don't know how these frames could go from an unthrottled state to a throttled state, even though throttling might be consistent with this DCHECK.

Does anybody have any idea why we are seeing this behavior on tests that don't even use input? Is it just other sharded tests driving the mouse on the bots and it happening to pass over these tests' windows?


If the window happens to show up right under the cursor, I think we send a mouse-move in those cases too.
Using associated-pipe could help. But it'd be nice for performance reasons if targeting could remain a separate pipe, and not be delayed by other messages.

Comment 58 by kenrb@chromium.org, Jan 19 2018

Cc: dcheng@chromium.org
I've managed occasional repros on a local Windows machine, and have some limited debugger output. It seems that we are sending a hit test query to a frame where LocalFrameView::IsAttached() is returning false. I'm trying to look at lifetimes to understand how this can be happening, because I think detachment from the frame tree is supposed to cause the RenderFrame to be synchronously destroyed, in which case it shouldn't be possible for these messages to arrive.

Comment 59 by kenrb@chromium.org, Jan 19 2018

It doesn't look like this has to do with destruction, rather this appears to be queries getting sent too early. I can observe InputTargetImpl::FrameSinkIdAt() getting called on a frame for which the LocalFrameView has not yet had LocalFrameView::BeginLifecycleUpdates() called -- and that prevents compositing. In that case we hit the DCHECK in PaintLayer.cpp.

It's possible that we hit this problem on OOPIFs but not the main frame because OOPIFs are lifecycle throttled before BeginLifecycleUpdates() is called, and the main frame is not.

I'm not exactly clear on how we get to the state of having a surface with a compositor frame for this renderer and then we send a message to the renderer to learn that it isn't pumping frames. This would be possible in a case where the OOPIF had been rendering one Document and then loaded a new Document (because the LocalFrameView gets cycled), but I don't know why that would be happening in these tests. The initial empty Document, before the real one, shouldn't render.

Comment 60 by kenrb@chromium.org, Jan 19 2018

Blocking: 803944
Project Member

Comment 61 by bugdroid1@chromium.org, Jan 19 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8547da34bf6cdaaece12d13caa13e461fe7fa54a

commit 8547da34bf6cdaaece12d13caa13e461fe7fa54a
Author: Ken Buchanan <kenrb@chromium.org>
Date: Fri Jan 19 23:56:55 2018

Prevent hit testing frames before lifecycle updates

This is a speculative fix to a problem that is causing flakiness on
site isolation browser tests. It looks like the cause is that hit test
queries are being sent to renderers before the LocalFrameView has begun
lifecycle updates. These fail on out-of-process iframes when the
Document has had layout but has not composited because
BeginLifecycleUpdates() has not yet been called. This triggers a DCHECK
on the hit testing path in PaintLayer.

This change prevents hit testing in that case.

Bug:  798476 
Change-Id: Ie44fd3d5b988e1c758b7472af6444d7b260b15ec
Reviewed-on: https://chromium-review.googlesource.com/876939
Commit-Queue: Ken Buchanan <kenrb@chromium.org>
Reviewed-by: Dave Tapuska <dtapuska@chromium.org>
Cr-Commit-Position: refs/heads/master@{#530669}
[modify] https://crrev.com/8547da34bf6cdaaece12d13caa13e461fe7fa54a/third_party/WebKit/Source/core/frame/LocalFrameView.cpp
[modify] https://crrev.com/8547da34bf6cdaaece12d13caa13e461fe7fa54a/third_party/WebKit/Source/core/frame/LocalFrameView.h
[modify] https://crrev.com/8547da34bf6cdaaece12d13caa13e461fe7fa54a/third_party/WebKit/Source/core/input/EventHandler.cpp

Comment 62 by creis@chromium.org, Jan 20 2018

Blocking: 801146 797860
Owner: kenrb@chromium.org
Listing two more issues that seem to have the same hit testing crash problem.

It's not clear to me that all the bugs gab@ marked as duplicates are the same hit testing crash, though.  For example, issue 802827 doesn't seem to have crashes in the failures:
https://logs.chromium.org/v/?s=chromium%2Fbb%2Ftryserver.chromium.win%2Fwin10_chromium_x64_rel_ng%2F62958%2F%2B%2Frecipes%2Fsteps%2Fcontent_browsertests__with_patch__on_Windows-10-14393%2F0%2Flogs%2FSitePerProcessBrowserTest.InputEventRouterGesturePreventDefaultTargetMapTest%2F0

kenrb@: Can you try to put together a list of all the tests disabled in these various bugs (either duped to this one or listed as blocked on it)?  My guess is that many can be re-enabled after r530669 (assuming that fixes the hit testing crash) and the rest may need a different fix.  Either way, let's aim to get them all re-enabled in M66.  Thanks!

Comment 63 by kenrb@chromium.org, Jan 20 2018

Yes I'll sort through these. And that's true, we've seen a few tests that are just racy and failing because of the higher latency in the new hit test system.
Blockedon: 803972

Comment 65 by kenrb@chromium.org, Jan 22 2018

Blocking: -797860
Did the fix in comment 61 work?

Comment 67 by kenrb@chromium.org, Jan 22 2018

As far as I can tell. There were two tests that had seen flakes from this bug that weren't disabled yet, and I don't any instances since that fix landed.

I will be re-enabling all of the other tests that were flaking with the same crash stack shortly.
Cool!
Project Member

Comment 69 by bugdroid1@chromium.org, Jan 22 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cacd0085c363f70055d6a20a51ad670a1681c411

commit cacd0085c363f70055d6a20a51ad670a1681c411
Author: Ken Buchanan <kenrb@chromium.org>
Date: Mon Jan 22 20:38:12 2018

Re-enable various SitePerProcessBrowserTests

A bug that was causing flakes in many OOPIF-based browser tests was
fixed in r530669. This re-enables several tests that were disabled
due to that bug.

TBR=msramek@chromium.org

Bug:  798476 , 803641 , 803628 , 803627 , 802828 , 802281 
Change-Id: Ib37c6761cd1ee46d8e49ef393db21209935862dd
Reviewed-on: https://chromium-review.googlesource.com/879101
Commit-Queue: Ken Buchanan <kenrb@chromium.org>
Reviewed-by: Alex Moshchuk <alexmos@chromium.org>
Cr-Commit-Position: refs/heads/master@{#530981}
[modify] https://crrev.com/cacd0085c363f70055d6a20a51ad670a1681c411/chrome/browser/ui/content_settings/content_setting_bubble_model_browsertest.cc
[modify] https://crrev.com/cacd0085c363f70055d6a20a51ad670a1681c411/content/browser/site_per_process_browsertest.cc

Comment 70 by kenrb@chromium.org, Jan 23 2018

Status: Fixed (was: Started)
I haven't seen any flakes since the tests were re-enabled.
Components: Tests>Disabled
Labels: Test-Disabled

Sign in to add a comment