Intermittent failures to restart browser via Telemetry, mainly on macOS |
||||||||||||
Issue descriptionIntermittent failures to start the browser on macOS in Telemetry-based test harnesses are being seen. Here are a few examples: GpuProcess_video: https://build.chromium.org/p/client.v8.fyi/builders/Mac%20Release%20%28Intel%29/builds/7 https://chromium-swarm.appspot.com/task?id=33cef397afd77610&refresh=10&show_raw=1 Pixel_GpuRasterization_BlueBox: https://build.chromium.org/p/client.v8.fyi/builders/Mac%20Release%20%28Intel%29/builds/13 https://chromium-swarm.appspot.com/task?id=33cfe885c38d0110&refresh=10&show_raw=1 WebglConformance_conformance2_textures_webgl_canvas_tex_2d_rg16f_rg_half_float WebglConformance_conformance2_textures_webgl_canvas_tex_2d_rgb9_e5_rgb_half_float: https://build.chromium.org/p/client.v8.fyi/builders/Mac%20Release%20%28Intel%29/builds/17 https://chromium-swarm.appspot.com/task?id=33d085a0defa1c10&refresh=10&show_raw=1 The stack trace is the same in all cases: Traceback (most recent call last): _RunGpuTest at content/test/gpu/gpu_tests/gpu_integration_test.py:68 self.RunActualGpuTest(url, *args) RunActualGpuTest at content/test/gpu/gpu_tests/pixel_integration_test.py:129 self.RestartBrowserIfNecessaryWithArgs(page.browser_args) RestartBrowserIfNecessaryWithArgs at content/test/gpu/gpu_tests/pixel_integration_test.py:97 cls.tab = cls.browser.tabs[0] __getitem__ at third_party/catapult/telemetry/telemetry/internal/browser/tab_list.py:18 return self._tab_list_backend.__getitem__(index) __getitem__ at third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py:64 return self.GetBackendFromContextId(context_id) GetBackendFromContextId at third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py:78 raise e TimeoutException: This means that upon restarting the browser, fetching the first created tab times out occasionally. This could be because of a race condition somewhere in the browser's core code or similar. This has been seen before in Issue 628022 . I see that a slight restructuring of the code will allow the exception to be caught and the browser restart to be retried properly. Implementing that now.
,
Jan 19 2017
,
Jan 19 2017
This also explain the two recent flakes in Builder: Mac 10.10 Debug (Intel) https://build.chromium.org/p/chromium.gpu/builders/Mac%2010.10%20Debug%20%28Intel%29
,
Jan 20 2017
,
Jan 20 2017
,
Jan 20 2017
Note: filed the following bug against Catapult about the TimeoutExceptions inside the backend: https://github.com/catapult-project/catapult/issues/3152
,
Jan 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/63505a7684eeca1f5f49c553e677e91bdb0ec67a commit 63505a7684eeca1f5f49c553e677e91bdb0ec67a Author: kbr <kbr@chromium.org> Date: Fri Jan 20 01:42:07 2017 Fetch the first tab inside the browser startup retry loop. This is the operation which is timing out. Attempt a retry to see if it can be made to work reliably. This changes the behavior of the unit tests. Removed a forced crash from testSimpleIntegrationUnittest because it was interfering with the execution of some of the other tests. A TODO has been added about better simulating the kinds of errors being seen on the waterfall; a new test needs to be written to cover all of the new code paths. Also removed bogus expectations from testIntegrationUnittestWithBrowserFailure. BUG= 682819 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=zmo@chromium.org Review-Url: https://codereview.chromium.org/2643023004 Cr-Commit-Position: refs/heads/master@{#444930} [modify] https://crrev.com/63505a7684eeca1f5f49c553e677e91bdb0ec67a/content/test/gpu/gpu_tests/gpu_integration_test.py [modify] https://crrev.com/63505a7684eeca1f5f49c553e677e91bdb0ec67a/content/test/gpu/gpu_tests/gpu_integration_test_unittest.py [modify] https://crrev.com/63505a7684eeca1f5f49c553e677e91bdb0ec67a/content/test/gpu/gpu_tests/gpu_process_integration_test.py [modify] https://crrev.com/63505a7684eeca1f5f49c553e677e91bdb0ec67a/content/test/gpu/gpu_tests/pixel_integration_test.py [modify] https://crrev.com/63505a7684eeca1f5f49c553e677e91bdb0ec67a/content/test/gpu/gpu_tests/screenshot_sync_integration_test.py
,
Jan 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cf3747f01f6dec9715b9d8890081e6f60a32eb79 commit cf3747f01f6dec9715b9d8890081e6f60a32eb79 Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Fri Jan 20 03:02:55 2017 Roll src/third_party/catapult/ 5c82c9272..49e3f62b2 (1 commit). https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/5c82c9272336..49e3f62b24da $ git log 5c82c9272..49e3f62b2 --date=short --no-merges --format='%ad %ae %s' 2017-01-19 kbr Add execute_after_browser_creation FakeBrowserFinderOptions hook. BUG= 682819 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2644273002 Cr-Commit-Position: refs/heads/master@{#444960} [modify] https://crrev.com/cf3747f01f6dec9715b9d8890081e6f60a32eb79/DEPS
,
Jan 20 2017
,
Jan 20 2017
,
Jan 20 2017
,
Jan 20 2017
,
Jan 20 2017
In Issue 682844 there was one very interesting tryjob: https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/373762 This one failed on Linux in the same way failures have been seen on macOS: WebglConformance_conformance_attribs_gl_bindAttribLocation_aliasing (gpu_tests.webgl_conformance_integration_test.WebGLConformanceIntegrationTest) ... (WARNING) 2017-01-18 21:24:47,647 desktop_browser_backend._GetAllCrashpadMinidumps:365 No path to crashpad_database_util found (INFO) 2017-01-18 21:24:47,647 desktop_browser_backend._GetMostRecentMinidump:438 No minidump found via crashpad_database_util (WARNING) 2017-01-18 21:24:47,647 desktop_browser_backend._GetAllCrashpadMinidumps:365 No path to crashpad_database_util found (INFO) 2017-01-18 21:24:47,647 desktop_browser_backend._GetMostRecentMinidump:438 No minidump found via crashpad_database_util Can't get standard output with --show-stdout (INFO) 2017-01-18 21:24:47,648 desktop_browser_backend.HasBrowserFinishedLaunching:238 Discovered ephemeral port 42163 (ERROR) 2017-01-18 21:24:47,649 gpu_integration_test._EnsureTabIsAvailable:186 Failure during browser startup Traceback (most recent call last): File "/b/s/w/irV99GdO/content/test/gpu/gpu_tests/gpu_integration_test.py", line 182, in _EnsureTabIsAvailable cls.tab = cls.browser.tabs[0] File "/b/s/w/irV99GdO/third_party/catapult/telemetry/telemetry/internal/browser/tab_list.py", line 18, in __getitem__ return self._tab_list_backend.__getitem__(index) File "/b/s/w/irV99GdO/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py", line 64, in __getitem__ return self.GetBackendFromContextId(context_id) File "/b/s/w/irV99GdO/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py", line 78, in GetBackendFromContextId raise e TimeoutException: Full stdout attached. With some of the failures seen in Issue 682834 , where it looks like the renderer process hung during execution and not just during browser startup, it's looking more and more like there's an actual bug that was introduced recently in the renderer process.
,
Jan 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/be437ee31e429d47e9c66da37e6dc889ac015440 commit be437ee31e429d47e9c66da37e6dc889ac015440 Author: kbr <kbr@chromium.org> Date: Fri Jan 20 08:18:51 2017 Add test simulating a renderer process crash after browser startup. This exercises the new code path in GpuIntegrationTest.StartBrowser which stops the browser if it started, but failed to fetch the first tab. BUG= 682819 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=zmo@chromium.org Review-Url: https://codereview.chromium.org/2646083002 Cr-Commit-Position: refs/heads/master@{#445015} [modify] https://crrev.com/be437ee31e429d47e9c66da37e6dc889ac015440/content/test/gpu/gpu_tests/gpu_integration_test_unittest.py
,
Jan 20 2017
I suspect this may be unrelated but we had strange problems trying to land https://codereview.chromium.org/2546423002/ leading to mysterious renderer hangs (possibly due to task starvation). Anyway I've been breaking that up into small patches and landed a bunch yesterday. Based on the timeline I've tried to construct I think it's unlikely they're where to blame but thought I'd mention them just in case. Flake observed # 444307 https://build.chromium.org/p/tryserver.chromium.win/builders/win_optional_gpu_tests_rel/builds/6722 https://codereview.chromium.org/2642823003/ (#444491) This patch seems an unlikely root cause Flake observed # 444504 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_optional_gpu_tests_rel/builds/6460 Flake observed # 444583 https://build.chromium.org/p/chromium.gpu/builders/Mac%2010.10%20Debug%20%28Intel%29/builds/22182 Flake observed # 444651 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/371735 Flake observed # 444683 https://build.chromium.org/p/client.v8.fyi/builders/Mac%20Release%20%28Intel%29/builds/7 Flake observed # 444705 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/371914 https://codereview.chromium.org/2644723003/ (#444708) Flake observed # 444713 https://build.chromium.org/p/client.v8.fyi/builders/Mac%20Release%20%28Intel%29/builds/13 Flake observed # 444733 https://build.chromium.org/p/client.v8.fyi/builders/Mac%20Release%20%28Intel%29/builds/17 https://codereview.chromium.org/2640763003/ (#444739) This one is a bit more risky but the flakes seem to have started before https://codereview.chromium.org/2644553002/ (#444780) Flake observed # 444983 https://build.chromium.org/p/chromium.gpu/builders/Mac%2010.10%20Debug%20%28Intel%29/builds/22260 https://codereview.chromium.org/2640303002/ (#445029)
,
Jan 20 2017
Issue 682833 has been merged into this issue.
,
Jan 20 2017
Trybots are flaking too.
,
Jan 20 2017
Detected 8 new flakes for test/step "gpu_process_launch_tests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyMAsSBUZsYWtlIiVncHVfcHJvY2Vzc19sYXVuY2hfdGVzdHMgKHdpdGggcGF0Y2gpDA. This message was posted automatically by the chromium-try-flakes app.
,
Jan 21 2017
,
Jan 23 2017
Add new pixel wrangler to cc.
,
Jan 23 2017
Adding RBB for M58. Do we have any evidence this impacts M57 as well?
,
Jan 23 2017
There has been flakes on Jan 19, so it might be in M57. Would this be observed as crash? What is a good way to check if this in M57?
,
Jan 23 2017
Issue 682844 has been merged into this issue.
,
Jan 23 2017
,
Jan 23 2017
The workaround that has been put in place for this issue seems to have stopped the huge number of reports. There is no longer evidence on chromium-try-flakes that the GPU tests are flaking. There hasn't been any progress on finding renderer process hangs in general; I posted about this on chromium-dev and there were basically no replied. Presumably there is a watchdog that would kill a hung renderer so that this would show up as a spike in the crash database. There's only so much the GPU team can do on this front so I'm closing this bug as fixed.
,
Jan 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c3b4a2f32560ad82a6892a93f64bb260def54c51 commit c3b4a2f32560ad82a6892a93f64bb260def54c51 Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Thu Jan 26 20:04:33 2017 Roll src/third_party/catapult/ e1e778d78..7a2a837ac (29 commits). https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/e1e778d78de1..7a2a837ac3ae $ git log e1e778d78..7a2a837ac --date=short --no-merges --format='%ad %ae %s' 2017-01-26 benjhayden Translate RelatedHistogramSet to python. 2017-01-26 benjhayden Translate RelatedEventSet to python. 2017-01-26 yolandyan Revert of Change apk_helper.py for apk with multi instrumentations and JUnit4 (patchset #10 id:180001 of https://codereview.chromium.org/2632763003/ ) 2017-01-26 nednguyen Update labels to tag in story_set_smoke_test 2017-01-26 hjd [tracing] Cache number formatters in Unit 2017-01-25 dtu [pinpoint] RunTest (Swarming) Quest and Execution. 2017-01-25 charliea Set DISABLE_CLOUD_STORAGE_IO back after psuedo lock tests 2017-01-25 alexandermont Make the whole story power metric not depend on Chrome trace. 2017-01-25 benjhayden Fix TelemetryInfo. 2017-01-25 benjhayden Redesign breakdown-span. 2017-01-25 benjhayden Translate DeviceInfo to python. 2017-01-25 benjhayden Translate TelemetryInfo to python. 2017-01-25 benjhayden Allow metrics to resegment the UserModel. 2017-01-25 benjhayden Translate BuildbotInfo to python. 2017-01-24 simonhatch Dashboard - Remove some old queues. 2017-01-24 sullivan Add ref build back into charts on /group_report page. 2017-01-24 benjhayden Translate Diagnostics to Python. 2017-01-24 benjhayden Make trace2html accept gzipped trace json files in addition to unzipped files. 2017-01-24 benjhayden Add Segments to the UserModel. 2017-01-24 alexandermont Fix function scope bug in tquery. 2017-01-24 charliea Fix bug where stale lock file can cause cloud storage timeouts 2017-01-24 benjhayden Improve BarChart and ColumnChart hover boxes. 2017-01-24 yolandyan Change apk_helper.py for apk with multi instrumentations and JUnit4 2017-01-24 kbr Only display 200 lines of syslog upon sub-process crash on macOS. 2017-01-24 kraynov Fix wrong upload of memtrack_helper for arm64 CPU. 2017-01-23 zheda.chen Change smoothness frame-times metrics on CrOS 2017-01-23 benjhayden Delete systemHealthMetrics meta-metric. 2017-01-23 simonhatch Dashboard - Fix output when tests fail to produce output. 2017-01-23 nednguyen [Telemetry] Remove labels field from story.Story constructor & labels related flags BUG= 682005 , 682005 , 682819 ,672780, 675846 , 683998 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2654253003 Cr-Commit-Position: refs/heads/master@{#446416} [modify] https://crrev.com/c3b4a2f32560ad82a6892a93f64bb260def54c51/DEPS |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by kbr@chromium.org
, Jan 19 2017