New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 682819 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
OOO until 2019-01-24
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 1
Type: Bug

Blocked on:
issue 628022
issue 681433

Blocking:
issue 636153
issue 682832
issue 682834
issue 682844
issue 682906



Sign in to add a comment

Intermittent failures to restart browser via Telemetry, mainly on macOS

Project Member Reported by kbr@chromium.org, Jan 19 2017

Issue description

Intermittent failures to start the browser on macOS in Telemetry-based test harnesses are being seen. Here are a few examples:

GpuProcess_video:
https://build.chromium.org/p/client.v8.fyi/builders/Mac%20Release%20%28Intel%29/builds/7
https://chromium-swarm.appspot.com/task?id=33cef397afd77610&refresh=10&show_raw=1

Pixel_GpuRasterization_BlueBox:
https://build.chromium.org/p/client.v8.fyi/builders/Mac%20Release%20%28Intel%29/builds/13
https://chromium-swarm.appspot.com/task?id=33cfe885c38d0110&refresh=10&show_raw=1

WebglConformance_conformance2_textures_webgl_canvas_tex_2d_rg16f_rg_half_float
WebglConformance_conformance2_textures_webgl_canvas_tex_2d_rgb9_e5_rgb_half_float:
https://build.chromium.org/p/client.v8.fyi/builders/Mac%20Release%20%28Intel%29/builds/17
https://chromium-swarm.appspot.com/task?id=33d085a0defa1c10&refresh=10&show_raw=1

The stack trace is the same in all cases:

Traceback (most recent call last):
_RunGpuTest at content/test/gpu/gpu_tests/gpu_integration_test.py:68
self.RunActualGpuTest(url, *args)
RunActualGpuTest at content/test/gpu/gpu_tests/pixel_integration_test.py:129
self.RestartBrowserIfNecessaryWithArgs(page.browser_args)
RestartBrowserIfNecessaryWithArgs at content/test/gpu/gpu_tests/pixel_integration_test.py:97
cls.tab = cls.browser.tabs[0]
__getitem__ at third_party/catapult/telemetry/telemetry/internal/browser/tab_list.py:18
return self._tab_list_backend.__getitem__(index)
__getitem__ at third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py:64
return self.GetBackendFromContextId(context_id)
GetBackendFromContextId at third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py:78
raise e
TimeoutException: 


This means that upon restarting the browser, fetching the first created tab times out occasionally. This could be because of a race condition somewhere in the browser's core code or similar.

This has been seen before in  Issue 628022 . I see that a slight restructuring of the code will allow the exception to be caught and the browser restart to be retried properly. Implementing that now.

 

Comment 1 by kbr@chromium.org, Jan 19 2017

Blockedon: 681433

Comment 2 by kbr@chromium.org, Jan 19 2017

Cc: weiliangc@chromium.org
Labels: Hotlist-PixelWrangler
This also explain the two recent flakes in Builder: Mac 10.10 Debug (Intel)
https://build.chromium.org/p/chromium.gpu/builders/Mac%2010.10%20Debug%20%28Intel%29


Comment 4 by kbr@chromium.org, Jan 20 2017

Status: Started (was: Assigned)

Comment 5 by kbr@chromium.org, Jan 20 2017

Blocking: 682906

Comment 6 by kbr@chromium.org, Jan 20 2017

Note: filed the following bug against Catapult about the TimeoutExceptions inside the backend:
https://github.com/catapult-project/catapult/issues/3152

Project Member

Comment 7 by bugdroid1@chromium.org, Jan 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/63505a7684eeca1f5f49c553e677e91bdb0ec67a

commit 63505a7684eeca1f5f49c553e677e91bdb0ec67a
Author: kbr <kbr@chromium.org>
Date: Fri Jan 20 01:42:07 2017

Fetch the first tab inside the browser startup retry loop.

This is the operation which is timing out. Attempt a retry to see if
it can be made to work reliably.

This changes the behavior of the unit tests. Removed a forced crash
from testSimpleIntegrationUnittest because it was interfering with the
execution of some of the other tests. A TODO has been added about
better simulating the kinds of errors being seen on the waterfall; a
new test needs to be written to cover all of the new code paths. Also
removed bogus expectations from testIntegrationUnittestWithBrowserFailure.

BUG= 682819 
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=zmo@chromium.org

Review-Url: https://codereview.chromium.org/2643023004
Cr-Commit-Position: refs/heads/master@{#444930}

[modify] https://crrev.com/63505a7684eeca1f5f49c553e677e91bdb0ec67a/content/test/gpu/gpu_tests/gpu_integration_test.py
[modify] https://crrev.com/63505a7684eeca1f5f49c553e677e91bdb0ec67a/content/test/gpu/gpu_tests/gpu_integration_test_unittest.py
[modify] https://crrev.com/63505a7684eeca1f5f49c553e677e91bdb0ec67a/content/test/gpu/gpu_tests/gpu_process_integration_test.py
[modify] https://crrev.com/63505a7684eeca1f5f49c553e677e91bdb0ec67a/content/test/gpu/gpu_tests/pixel_integration_test.py
[modify] https://crrev.com/63505a7684eeca1f5f49c553e677e91bdb0ec67a/content/test/gpu/gpu_tests/screenshot_sync_integration_test.py

Project Member

Comment 8 by bugdroid1@chromium.org, Jan 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cf3747f01f6dec9715b9d8890081e6f60a32eb79

commit cf3747f01f6dec9715b9d8890081e6f60a32eb79
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Fri Jan 20 03:02:55 2017

Roll src/third_party/catapult/ 5c82c9272..49e3f62b2 (1 commit).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/5c82c9272336..49e3f62b24da

$ git log 5c82c9272..49e3f62b2 --date=short --no-merges --format='%ad %ae %s'
2017-01-19 kbr Add execute_after_browser_creation FakeBrowserFinderOptions hook.

BUG= 682819 

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2644273002
Cr-Commit-Position: refs/heads/master@{#444960}

[modify] https://crrev.com/cf3747f01f6dec9715b9d8890081e6f60a32eb79/DEPS

Comment 9 by kbr@chromium.org, Jan 20 2017

Cc: kbr@chromium.org
 Issue 682741  has been merged into this issue.

Comment 10 by kbr@chromium.org, Jan 20 2017

Blocking: 682834

Comment 11 by kbr@chromium.org, Jan 20 2017

Blocking: 682832

Comment 12 by kbr@chromium.org, Jan 20 2017

Blocking: 682844

Comment 13 by kbr@chromium.org, Jan 20 2017

Labels: -OS-Mac OS-All
Summary: Intermittent failures to restart browser via Telemetry, mainly on macOS (was: Intermittent failures to restart browser on macOS via Telemetry)
In  Issue 682844  there was one very interesting tryjob:
https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/373762

This one failed on Linux in the same way failures have been seen on macOS:

WebglConformance_conformance_attribs_gl_bindAttribLocation_aliasing (gpu_tests.webgl_conformance_integration_test.WebGLConformanceIntegrationTest) ... (WARNING) 2017-01-18 21:24:47,647 desktop_browser_backend._GetAllCrashpadMinidumps:365  No path to crashpad_database_util found
(INFO) 2017-01-18 21:24:47,647 desktop_browser_backend._GetMostRecentMinidump:438  No minidump found via crashpad_database_util
(WARNING) 2017-01-18 21:24:47,647 desktop_browser_backend._GetAllCrashpadMinidumps:365  No path to crashpad_database_util found
(INFO) 2017-01-18 21:24:47,647 desktop_browser_backend._GetMostRecentMinidump:438  No minidump found via crashpad_database_util
Can't get standard output with --show-stdout
(INFO) 2017-01-18 21:24:47,648 desktop_browser_backend.HasBrowserFinishedLaunching:238  Discovered ephemeral port 42163
(ERROR) 2017-01-18 21:24:47,649 gpu_integration_test._EnsureTabIsAvailable:186  Failure during browser startup
Traceback (most recent call last):
  File "/b/s/w/irV99GdO/content/test/gpu/gpu_tests/gpu_integration_test.py", line 182, in _EnsureTabIsAvailable
    cls.tab = cls.browser.tabs[0]
  File "/b/s/w/irV99GdO/third_party/catapult/telemetry/telemetry/internal/browser/tab_list.py", line 18, in __getitem__
    return self._tab_list_backend.__getitem__(index)
  File "/b/s/w/irV99GdO/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py", line 64, in __getitem__
    return self.GetBackendFromContextId(context_id)
  File "/b/s/w/irV99GdO/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py", line 78, in GetBackendFromContextId
    raise e
TimeoutException: 


Full stdout attached.

With some of the failures seen in  Issue 682834 , where it looks like the renderer process hung during execution and not just during browser startup, it's looking more and more like there's an actual bug that was introduced recently in the renderer process.

stdout.txt
3.8 MB View Download
Project Member

Comment 14 by bugdroid1@chromium.org, Jan 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/be437ee31e429d47e9c66da37e6dc889ac015440

commit be437ee31e429d47e9c66da37e6dc889ac015440
Author: kbr <kbr@chromium.org>
Date: Fri Jan 20 08:18:51 2017

Add test simulating a renderer process crash after browser startup.

This exercises the new code path in GpuIntegrationTest.StartBrowser
which stops the browser if it started, but failed to fetch the first
tab.

BUG= 682819 
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=zmo@chromium.org

Review-Url: https://codereview.chromium.org/2646083002
Cr-Commit-Position: refs/heads/master@{#445015}

[modify] https://crrev.com/be437ee31e429d47e9c66da37e6dc889ac015440/content/test/gpu/gpu_tests/gpu_integration_test_unittest.py

I suspect this may be unrelated but we had strange problems trying to land  https://codereview.chromium.org/2546423002/ leading to mysterious renderer hangs (possibly due to task starvation). Anyway I've been breaking that up into small patches and landed a bunch yesterday.

Based on the timeline I've tried to construct I think it's unlikely they're where to blame but thought I'd mention them just in case.

Flake observed # 444307 https://build.chromium.org/p/tryserver.chromium.win/builders/win_optional_gpu_tests_rel/builds/6722

https://codereview.chromium.org/2642823003/  (#444491)  This patch seems an unlikely root cause

Flake observed # 444504 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_optional_gpu_tests_rel/builds/6460
Flake observed # 444583 https://build.chromium.org/p/chromium.gpu/builders/Mac%2010.10%20Debug%20%28Intel%29/builds/22182
Flake observed # 444651 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/371735
Flake observed # 444683 https://build.chromium.org/p/client.v8.fyi/builders/Mac%20Release%20%28Intel%29/builds/7
Flake observed # 444705 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/371914

https://codereview.chromium.org/2644723003/  (#444708)

Flake observed # 444713 https://build.chromium.org/p/client.v8.fyi/builders/Mac%20Release%20%28Intel%29/builds/13
Flake observed # 444733 https://build.chromium.org/p/client.v8.fyi/builders/Mac%20Release%20%28Intel%29/builds/17

https://codereview.chromium.org/2640763003/  (#444739) This one is a bit more risky but the flakes seem to have started before
https://codereview.chromium.org/2644553002/  (#444780)

Flake observed # 444983 https://build.chromium.org/p/chromium.gpu/builders/Mac%2010.10%20Debug%20%28Intel%29/builds/22260

https://codereview.chromium.org/2640303002/  (#445029)
 Issue 682833  has been merged into this issue.
Trybots are flaking too.
Project Member

Comment 18 by chromium...@appspot.gserviceaccount.com, Jan 20 2017

Detected 8 new flakes for test/step "gpu_process_launch_tests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyMAsSBUZsYWtlIiVncHVfcHJvY2Vzc19sYXVuY2hfdGVzdHMgKHdpdGggcGF0Y2gpDA. This message was posted automatically by the chromium-try-flakes app.
Blocking: 636153
Cc: ajuma@chromium.org
Add new pixel wrangler to cc.
Cc: pinkerton@chromium.org
Labels: ReleaseBlock-Beta M-58
Adding RBB for M58.

Do we have any evidence this impacts M57 as well?
There has been flakes on Jan 19, so it might be in M57. Would this be observed as crash? What is a good way to check if this in M57?

Comment 23 by kbr@chromium.org, Jan 23 2017

 Issue 682844  has been merged into this issue.

Comment 24 by kbr@chromium.org, Jan 23 2017

Cc: zmo@chromium.org
 Issue 682832  has been merged into this issue.

Comment 25 by kbr@chromium.org, Jan 23 2017

Labels: -ReleaseBlock-Beta -M-58
Status: Fixed (was: Started)
The workaround that has been put in place for this issue seems to have stopped the huge number of reports. There is no longer evidence on chromium-try-flakes that the GPU tests are flaking.

There hasn't been any progress on finding renderer process hangs in general; I posted about this on chromium-dev and there were basically no replied. Presumably there is a watchdog that would kill a hung renderer so that this would show up as a spike in the crash database.

There's only so much the GPU team can do on this front so I'm closing this bug as fixed.

Project Member

Comment 26 by bugdroid1@chromium.org, Jan 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c3b4a2f32560ad82a6892a93f64bb260def54c51

commit c3b4a2f32560ad82a6892a93f64bb260def54c51
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Thu Jan 26 20:04:33 2017

Roll src/third_party/catapult/ e1e778d78..7a2a837ac (29 commits).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/e1e778d78de1..7a2a837ac3ae

$ git log e1e778d78..7a2a837ac --date=short --no-merges --format='%ad %ae %s'
2017-01-26 benjhayden Translate RelatedHistogramSet to python.
2017-01-26 benjhayden Translate RelatedEventSet to python.
2017-01-26 yolandyan Revert of Change apk_helper.py for apk with multi instrumentations and JUnit4 (patchset #10 id:180001 of https://codereview.chromium.org/2632763003/ )
2017-01-26 nednguyen Update labels to tag in story_set_smoke_test
2017-01-26 hjd [tracing] Cache number formatters in Unit
2017-01-25 dtu [pinpoint] RunTest (Swarming) Quest and Execution.
2017-01-25 charliea Set DISABLE_CLOUD_STORAGE_IO back after psuedo lock tests
2017-01-25 alexandermont Make the whole story power metric not depend on Chrome trace.
2017-01-25 benjhayden Fix TelemetryInfo.
2017-01-25 benjhayden Redesign breakdown-span.
2017-01-25 benjhayden Translate DeviceInfo to python.
2017-01-25 benjhayden Translate TelemetryInfo to python.
2017-01-25 benjhayden Allow metrics to resegment the UserModel.
2017-01-25 benjhayden Translate BuildbotInfo to python.
2017-01-24 simonhatch Dashboard - Remove some old queues.
2017-01-24 sullivan Add ref build back into charts on /group_report page.
2017-01-24 benjhayden Translate Diagnostics to Python.
2017-01-24 benjhayden Make trace2html accept gzipped trace json files in addition to unzipped files.
2017-01-24 benjhayden Add Segments to the UserModel.
2017-01-24 alexandermont Fix function scope bug in tquery.
2017-01-24 charliea Fix bug where stale lock file can cause cloud storage timeouts
2017-01-24 benjhayden Improve BarChart and ColumnChart hover boxes.
2017-01-24 yolandyan Change apk_helper.py for apk with multi instrumentations and JUnit4
2017-01-24 kbr Only display 200 lines of syslog upon sub-process crash on macOS.
2017-01-24 kraynov Fix wrong upload of memtrack_helper for arm64 CPU.
2017-01-23 zheda.chen Change smoothness frame-times metrics on CrOS
2017-01-23 benjhayden Delete systemHealthMetrics meta-metric.
2017-01-23 simonhatch Dashboard - Fix output when tests fail to produce output.
2017-01-23 nednguyen [Telemetry] Remove labels field from story.Story constructor & labels related flags

BUG= 682005 , 682005 , 682819 ,672780, 675846 , 683998 

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2654253003
Cr-Commit-Position: refs/heads/master@{#446416}

[modify] https://crrev.com/c3b4a2f32560ad82a6892a93f64bb260def54c51/DEPS

Sign in to add a comment