Issue metadata
Sign in to add a comment
|
ServiceWorkerProviderContextTest.SetControllerServiceWorker is flaky |
||||||||||||||||||||||
Issue descriptionWe are seeing content_unittests flake fairly often on the Fuchsia/x64 FYI bot, e.g. in https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/126. The test is reported to have "timed out", and from the failure log output it seems that the test hung, though unfortunately we don't have a stack-trace to show which point it hung at. This is very recent, seemingly around the time of https://chromium-review.googlesource.com/1109662
,
Jul 11
Re #1: I see two failures, both on July 9th, so those would be after the revert you linked to.
,
Jul 11
,
Jul 20
Is there a way to see historical results from the Fuschia bot, like in the flakiness dashboard? Fuchsia builders don't look included. The last content_unittests failure I see is https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/270 on July 12 which was last week. Was the test disabled or is the flake rate just low (or the test was possibly mysteriously fixed)? I guess I can try to build Fuchsia.
,
Jul 20
Ahhh it was disabled in fuchsia.content_unittests.filter.
,
Jul 20
Ah no, ServiceWorkerProviderHostTest.* is disabled, which is different. Looks enabled still. Trying to build and try locally.
,
Jul 20
The test passed 100/100 times: out/fuchsia/bin/run_content_unittests --gtest_repeat=100 --gtest_filter="ServiceWorkerProviderContextTest.SetControllerServiceWorker" Since the bot also seems to show no flakes in the past week, I'll WontFix this.
,
Jul 25
Failed again in https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-arm64-rel/470 - note that that QEMU emulating ARM64 in software, so it'll be unusually slow.
,
Aug 2
Test hung again in https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/1251 There is most likely some obscure race-condition dependent on delayed task timings or thread scheduling; it'll be easier to diagnose once we have the hang dumps re-implemented. I'll disable this test for now.
,
Aug 6
This test is flaking everywhere, not just Fuchsia. In fact it's flaking less on Fuchsia than the others: https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=content_unittests&showAllRuns=true&tests=ServiceWorkerProviderContextTest.SetControllerServiceWorker I'd like to investigate instead of disabling. I have a patch to add debug output at: https://chromium-review.googlesource.com/c/chromium/src/+/1163046 bashi or kinuko: If either of you have cycles can you pick this up? I'm crawling out of crash bugs finally and now trying to add UMA for the lost main resource requests (though it's possible that this failing unit test is related). It might be possible to repro simply on Linux with many iterations.
,
Aug 6
I'll try to repro later.
,
Aug 6
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0b039b3fe2ca9173e7f158496673d7faf7126b06 commit 0b039b3fe2ca9173e7f158496673d7faf7126b06 Author: Matt Falkenhagen <falken@chromium.org> Date: Mon Aug 06 08:40:05 2018 Add logging for flaky ServiceWorkerProviderContextTest.SetControllerServiceWorker. This test has flaky timeouts on Fuchsia bots. Bug: 862294 Change-Id: I19e164ecbac86347b9756e55c92e7a59da05dddd Reviewed-on: https://chromium-review.googlesource.com/1163046 Reviewed-by: Kinuko Yasuda <kinuko@chromium.org> Commit-Queue: Matt Falkenhagen <falken@chromium.org> Cr-Commit-Position: refs/heads/master@{#580825} [modify] https://crrev.com/0b039b3fe2ca9173e7f158496673d7faf7126b06/content/renderer/service_worker/service_worker_provider_context_unittest.cc
,
Aug 6
Re #10: If you look at e.g. the Linux flake reports, you'll see that when that test fails content_unittests takes 7m to run, whereas normally it takes 12s - the test itself either takes ~3ms or hangs. Rather than have all of the bots (and try-bots) hitting a 56x slow-down on this step, we should at least add an explicit timeout in the test, e.g. after 1-2s. Since it is already a succeed-or-hang flake, that shouldn't impact likelihood of failure, but it will reduce impact. Log output from a failed run with the new logging: [ RUN ] ServiceWorkerProviderContextTest.SetControllerServiceWorker [21656:21656:0806/091318.342821:26732422255:ERROR:service_worker_provider_context_unittest.cc(486)] 1 test ctor [21656:21656:0806/091318.345060:26732424491:ERROR:service_worker_provider_context_unittest.cc(513)] 1 RunUntilIdle after ctor [21656:21656:0806/091318.345167:26732424592:ERROR:service_worker_provider_context_unittest.cc(515)] 1 RunUntilIdle after ctor finished [21656:21656:0806/091318.347336:26732426765:ERROR:service_worker_provider_context_unittest.cc(527)] 1 loop1.Run() [21656:21656:0806/091318.351937:26732431372:ERROR:service_worker_provider_context_unittest.cc(529)] 1 loop1.Run() finished [21656:21656:0806/091318.352053:26732431482:ERROR:service_worker_provider_context_unittest.cc(535)] 2 reset controller to new one [21656:21656:0806/091318.353108:26732432535:ERROR:service_worker_provider_context_unittest.cc(555)] 2 RunUntilIdle after SetController [21656:21656:0806/091318.354379:26732433809:ERROR:service_worker_provider_context_unittest.cc(557)] 2 RunUntilIdle after SetController finished [21656:21656:0806/091318.355719:26732435151:ERROR:service_worker_provider_context_unittest.cc(572)] 2 loop2.Run() [5541/5543] ServiceWorkerProviderContextTest.SetControllerServiceWorker (TIMED OUT) So we're failing to see DispatchFetchEvent() (see https://cs.chromium.org/chromium/src/content/renderer/service_worker/service_worker_provider_context_unittest.cc?l=240) called, some of the time. Looking at the test, we seem to create multiple Mojo AssociatedInterface requests with dedicated channels, which IIUC means that there is no ordering guarantee across the channels - so could the StartRequest() call internals get re-ordered against the SetController() call?
,
Aug 7
Thanks for the log. I'm trying to untangle this very complicated test now. I think possible errors could be the use of RunUntilIdle and the associated interfaces you mention. Unfortunately I don't see an easy way to add an explicit timeout. I see some existing patterns that race a single RunLoop with a timeout, but not a simple one for the entire test case. This test has many RunLoops and RunUntilIdle so it may be difficult. Do you know of a way? Also try as I might, I couldn't repro locally still.
,
Aug 7
I wasn't able to repro locally too. Fyi here is another log for failed run on CrWinGomaStaging. [8112:2292:0806/185533.451:4464826:ERROR:service_worker_provider_context_unittest.cc(486)] 1 test ctor [8112:2292:0806/185533.451:4464826:ERROR:service_worker_provider_context_unittest.cc(513)] 1 RunUntilIdle after ctor [8112:2292:0806/185533.451:4464826:ERROR:service_worker_provider_context_unittest.cc(515)] 1 RunUntilIdle after ctor finished [8112:2292:0806/185533.452:4464826:ERROR:service_worker_provider_context_unittest.cc(527)] 1 loop1.Run() [8112:2292:0806/185533.452:4464826:ERROR:service_worker_provider_context_unittest.cc(529)] 1 loop1.Run() finished [8112:2292:0806/185533.452:4464826:ERROR:service_worker_provider_context_unittest.cc(535)] 2 reset controller to new one [8112:2292:0806/185533.452:4464826:ERROR:service_worker_provider_context_unittest.cc(555)] 2 RunUntilIdle after SetController [8112:2292:0806/185533.452:4464826:ERROR:service_worker_provider_context_unittest.cc(557)] 2 RunUntilIdle after SetController finished [8112:2292:0806/185533.452:4464826:ERROR:service_worker_provider_context_unittest.cc(572)] 2 loop2.Run() [8112:2292:0806/185533.454:4464826:ERROR:service_worker_provider_context_unittest.cc(574)] 2 loop2.Run() finished [8112:2292:0806/185533.454:4464826:ERROR:service_worker_provider_context_unittest.cc(581)] 3 reset controller to null [8112:2292:0806/185533.454:4464826:ERROR:service_worker_provider_context_unittest.cc(587)] 3 RunUntilIdle() [8112:2292:0806/185533.454:4464826:ERROR:service_worker_provider_context_unittest.cc(589)] 3 RunUntilIdle() finished [8112:2292:0806/185533.454:4464826:ERROR:service_worker_provider_context_unittest.cc(602)] 3 loop3.Run() [5587/5596] ServiceWorkerProviderContextTest.SetControllerServiceWorker (TIMED OUT)
,
Aug 7
Updating the controller does look a bit suspicious. In production code it's a Mojo IPC to ServiceWorkerProviderContext::SetController() which does antoher Mojo call to ControllerServiceWorkerConnector::UpdateController() on another thread. In the test I think the process and thread are the same and we use a single RunUntilIdle() to finish all the calls. It may be that's nondeterministic? Adding an explicit event for the controller updating somehow may be better.
,
Aug 7
,
Aug 7
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9c02c1b092d0c68241b296917678b8397f1d9a0f commit 9c02c1b092d0c68241b296917678b8397f1d9a0f Author: Matt Falkenhagen <falken@chromium.org> Date: Tue Aug 07 23:16:00 2018 Try to deflake ServiceWorkerProviderContextTest.SetControllerServiceWorker. Replace vague RunUntilIdle() calls with FlushForTesting() and explicit events. Bug: 862294 Change-Id: I76e6cf9501b5d480960f431316e385bd7bf9941d Reviewed-on: https://chromium-review.googlesource.com/1164737 Commit-Queue: Matt Falkenhagen <falken@chromium.org> Reviewed-by: Kenichi Ishibashi <bashi@chromium.org> Reviewed-by: Kinuko Yasuda <kinuko@chromium.org> Cr-Commit-Position: refs/heads/master@{#581387} [modify] https://crrev.com/9c02c1b092d0c68241b296917678b8397f1d9a0f/content/renderer/BUILD.gn [modify] https://crrev.com/9c02c1b092d0c68241b296917678b8397f1d9a0f/content/renderer/service_worker/service_worker_provider_context.cc [modify] https://crrev.com/9c02c1b092d0c68241b296917678b8397f1d9a0f/content/renderer/service_worker/service_worker_provider_context.h [modify] https://crrev.com/9c02c1b092d0c68241b296917678b8397f1d9a0f/content/renderer/service_worker/service_worker_provider_context_unittest.cc [add] https://crrev.com/9c02c1b092d0c68241b296917678b8397f1d9a0f/content/renderer/service_worker/service_worker_provider_state_for_client.cc [add] https://crrev.com/9c02c1b092d0c68241b296917678b8397f1d9a0f/content/renderer/service_worker/service_worker_provider_state_for_client.h
,
Aug 8
,
Aug 8
Still flaking: [4533:4533:0807/211518.900633:3579182482:ERROR:service_worker_provider_context_unittest.cc(518)] 1 test ctor [4533:4533:0807/211518.902720:3579184443:ERROR:service_worker_provider_context_unittest.cc(555)] 1 loop1.Run() [4533:4533:0807/211518.905193:3579186914:ERROR:service_worker_provider_context_unittest.cc(557)] 1 loop1.Run() finished [4533:4533:0807/211518.905238:3579186957:ERROR:service_worker_provider_context_unittest.cc(563)] 2 reset controller to new one [4533:4533:0807/211518.905850:3579187571:ERROR:service_worker_provider_context_unittest.cc(587)] 2 FlushForTesting() [4533:4533:0807/211518.906780:3579188501:ERROR:service_worker_provider_context_unittest.cc(589)] 2 FlushForTesting() finished [4533:4533:0807/211518.906822:3579188541:ERROR:service_worker_provider_context_unittest.cc(590)] 2 drop_binding_loop.Run() [4533:4533:0807/211518.906854:3579188573:ERROR:service_worker_provider_context_unittest.cc(592)] 2 drop_binding_loop.Run() finished [4533:4533:0807/211518.906892:3579188610:ERROR:service_worker_provider_context_unittest.cc(605)] 2 FlushControllerConnector() [4533:4533:0807/211518.908743:3579190473:ERROR:service_worker_provider_context_unittest.cc(607)] 2 FlushControllerConnector() finished [4533:4533:0807/211518.909334:3579191055:ERROR:service_worker_provider_context_unittest.cc(614)] 2 loop2.Run() [4533:4533:0807/211518.910400:3579192121:ERROR:service_worker_provider_context_unittest.cc(616)] 2 loop2.Run() finished [4533:4533:0807/211518.910443:3579192162:ERROR:service_worker_provider_context_unittest.cc(623)] 3 reset controller to null [4533:4533:0807/211518.910547:3579192266:ERROR:service_worker_provider_context_unittest.cc(631)] 3 FlushForTesting() [4533:4533:0807/211518.911259:3579192980:ERROR:service_worker_provider_context_unittest.cc(633)] 3 FlushForTesting() finished [4533:4533:0807/211518.911299:3579193018:ERROR:service_worker_provider_context_unittest.cc(634)] 3 drop_binding_loop2.Run() [4533:4533:0807/211518.911333:3579193051:ERROR:service_worker_provider_context_unittest.cc(636)] 3 drop_binding_loop2.Run() finished [4533:4533:0807/211518.911903:3579193624:ERROR:service_worker_provider_context_unittest.cc(649)] 3 loop3.Run()
,
Aug 9
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6865f0d7fb281e0660b23736e80c5db4be81cca8 commit 6865f0d7fb281e0660b23736e80c5db4be81cca8 Author: Matt Falkenhagen <falken@chromium.org> Date: Thu Aug 09 10:31:56 2018 Try more to deflake ServiceWorkerProviderContextTest.SetControllerServiceWorker I neglected to flush one of the interfaces in r581387. Bug: 862294 Change-Id: Ifaa93c99b160916f34520fa6f840976e0847ae97 Reviewed-on: https://chromium-review.googlesource.com/1168956 Commit-Queue: Matt Falkenhagen <falken@chromium.org> Commit-Queue: Kinuko Yasuda <kinuko@chromium.org> Reviewed-by: Kinuko Yasuda <kinuko@chromium.org> Cr-Commit-Position: refs/heads/master@{#581852} [modify] https://crrev.com/6865f0d7fb281e0660b23736e80c5db4be81cca8/content/renderer/service_worker/service_worker_provider_context_unittest.cc
,
Aug 10
It looks fixed since r581852. Long live FlushForTesting! Down with RunUntilIdle! Now I just need to remove the LOG(ERROR). Will do that next week if it's still green.
,
Aug 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b3af3b319d9289213586da04f703dc3608f373b5 commit b3af3b319d9289213586da04f703dc3608f373b5 Author: Matt Falkenhagen <falken@chromium.org> Date: Mon Aug 13 04:53:34 2018 Removing debugging for ServiceWorkerProviderContextTest.SetControllerServiceWorker. The test no longer flakes since the bug was fixed. Bug: 862294 Change-Id: I4802c3a5d09ea487459a93726a841ae4dd7c437a Reviewed-on: https://chromium-review.googlesource.com/1172168 Reviewed-by: Kinuko Yasuda <kinuko@chromium.org> Commit-Queue: Matt Falkenhagen <falken@chromium.org> Cr-Commit-Position: refs/heads/master@{#582505} [modify] https://crrev.com/b3af3b319d9289213586da04f703dc3608f373b5/content/renderer/service_worker/service_worker_provider_context_unittest.cc
,
Aug 13
,
Aug 13
The NextAction date has arrived: 2018-08-13 |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by falken@chromium.org
, Jul 11