Issue metadata
Sign in to add a comment
|
Timeout failures in GPU FYI bots |
||||||||||||||||||||||
Issue descriptionMultiple FYI GPU bots are having errors: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Experimental%20Release%20%28Intel%29/2089 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Release%20%28NVIDIA%29/1026 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20FYI%20x64%20Release%20%28NVIDIA%29/952 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Experimental%20Release%20%28Intel%29/2089 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Retina%20Debug%20%28NVIDIA%29/1495 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Retina%20Release%20%28NVIDIA%29/1335 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20GPU%20ASAN%20Release/801 In each case, webgl2_conformance_tests are failing on a different test across a range of devices: [111/138] gpu_tests.webgl_conformance_integration_test.WebGLConformanceIntegrationTest.WebglConformance_deqp_functional_gles3_shadermatrix_add_assign failed unexpectedly 618.6246s: [133/139] gpu_tests.webgl_conformance_integration_test.WebGLConformanceIntegrationTest.WebglConformance_deqp_functional_gles3_texturespecification_basic_teximage3d_2d_array_02 failed unexpectedly 322.8495s: [424/462] gpu_tests.webgl_conformance_integration_test.WebGLConformanceIntegrationTest.WebglConformance_conformance_more_conformance_constants failed unexpectedly 324.2782s: [60/136] gpu_tests.webgl_conformance_integration_test.WebGLConformanceIntegrationTest.WebglConformance_conformance_glsl_functions_glsl_function_clamp_float failed unexpectedly 317.1177s: [113/140] gpu_tests.webgl_conformance_integration_test.WebGLConformanceIntegrationTest.WebglConformance_deqp_functional_gles3_shaderoperator_angle_and_trigonometry_03 failed unexpectedly 320.3420s: [109/138] gpu_tests.webgl_conformance_integration_test.WebGLConformanceIntegrationTest.WebglConformance_deqp_functional_gles3_shadermatrix_add_const failed unexpectedly 317.7220s: [60/136] gpu_tests.webgl_conformance_integration_test.WebGLConformanceIntegrationTest.WebglConformance_conformance_glsl_functions_glsl_function_clamp_float failed unexpectedly 317.1177s: Each such failure looks like this: Traceback (most recent call last): _RunGpuTest at content/test/gpu/gpu_tests/gpu_integration_test.py:132 self.RunActualGpuTest(url, *args) RunActualGpuTest at content/test/gpu/gpu_tests/webgl_conformance_integration_test.py:188 getattr(self, test_name)(test_path, *args[1:]) _RunConformanceTest at content/test/gpu/gpu_tests/webgl_conformance_integration_test.py:202 self._CheckTestCompletion() _CheckTestCompletion at content/test/gpu/gpu_tests/webgl_conformance_integration_test.py:196 'webglTestHarness._finished', timeout=self._GetTestTimeout()) traced_function at third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) WaitForJavaScriptCondition at third_party/catapult/telemetry/telemetry/internal/actions/action_runner.py:261 return self._tab.WaitForJavaScriptCondition(*args, **kwargs) traced_function at third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) WaitForJavaScriptCondition at third_party/catapult/telemetry/telemetry/internal/browser/web_contents.py:239 return self._inspector_backend.WaitForJavaScriptCondition(*args, **kwargs) traced_function at third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) WaitForJavaScriptCondition at third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend.py:302 e.message + '\n' + debug_message) TimeoutException: Timed out while waiting 300s for IsJavaScriptExpressionTrue. Console output: Locals: IsJavaScriptExpressionTrue : <function IsJavaScriptExpressionTrue at 0x105997cf8> condition : 'webglTestHarness._finished' context_id : None debug_message : 'Console output:\n' e : TimeoutException('Timed out while waiting 300s for IsJavaScriptExpressionTrue.',) kwargs : {} timeout : 300 Found crashpad_database_util No minidump found via crashpad_database_util No minidump paths to symbolize Found crashpad_database_util No minidump found via crashpad_database_util Restarting browser due to unexpected test failure Closing browser (pid=8299) ... Browser is closed. Starting Chrome ['/b/s/w/ir/out/Release/Chromium.app/Contents/MacOS/Chromium', '--disable-gpu-watchdog', '--enable-experimental-web-platform-features', '--test-type=gpu', '--disable-domain-blocking-for-3d-apis', '--disable-gpu-process-crash-limit', '--disable-blink-features=WebXR', '--js-flags=--expose-gc', '--enable-logging=stderr', '--autoplay-policy=no-user-gesture-required', '--use-cmd-decoder=validating', '--enable-net-benchmarking', '--metrics-recording-only', '--no-default-browser-check', '--no-first-run', '--enable-gpu-benchmarking', '--deny-permission-prompts', '--autoplay-policy=no-user-gesture-required', '--disable-background-networking', '--disable-component-extensions-with-background-pages', '--disable-default-apps', '--disable-search-geolocation-disclosure', '--proxy-server=socks://localhost:53758', '--ignore-certificate-errors-spki-list=PhrPvGIaAMmd29hj8BCZOq096yj7uMpRNHpn5PDxI6I=', '--remote-debugging-port=0', '--enable-crash-reporter-for-testing', '--disable-component-update', '--window-size=1280,1024', '--user-data-dir=/b/s/w/itZPMhfX/tmpqw7awd', 'about:blank'] DoNothingForwarder started between 127.0.0.1:53901 and 53901 Got devtools config: ws://127.0.0.1:53901/devtools/browser/90db85be-1408-40ba-aa82-a1bde1b5e56f Browser started (pid=8379). OS: mac highsierra Detailed OS version: 10.13.4 17E139j Model: Macmini 7.1 Browser command line: /b/s/w/ir/out/Release/Chromium.app/Contents/MacOS/Chromium --disable-gpu-watchdog --enable-experimental-web-platform-features --test-type=gpu --disable-domain-blocking-for-3d-apis --disable-gpu-process-crash-limit --disable-blink-features=WebXR --js-flags=--expose-gc --enable-logging=stderr --autoplay-policy=no-user-gesture-required --use-cmd-decoder=validating --enable-net-benchmarking --metrics-recording-only --no-default-browser-check --no-first-run --enable-gpu-benchmarking --deny-permission-prompts --autoplay-policy=no-user-gesture-required --disable-background-networking --disable-component-extensions-with-background-pages --disable-default-apps --disable-search-geolocation-disclosure --proxy-server=socks://localhost:53758 --ignore-certificate-errors-spki-list=PhrPvGIaAMmd29hj8BCZOq096yj7uMpRNHpn5PDxI6I= --remote-debugging-port=0 --enable-crash-reporter-for-testing --disable-component-update --window-size=1280,1024 --user-data-dir=/b/s/w/itZPMhfX/tmpqw7awd --flag-switches-begin --flag-switches-end about:blank GPU device 0: VENDOR = 0x8086 (Intel), DEVICE = 0xa2e GPU Attributes: amd_switchable : False can_support_threaded_texture_mailbox: False direct_composition : False direct_rendering : True driver_date : driver_vendor : driver_version : encrypted_only : False gl_extensions : gl_renderer : gl_reset_notification_strategy: 0 gl_vendor : gl_version : gl_ws_extensions : gl_ws_vendor : gl_ws_version : in_process_gpu : False initialization_time : 0.0902 jpeg_decode_accelerator_supported: False max_framerate_denominator: 1 max_framerate_numerator: 30 max_msaa_samples : max_resolution_height: 2160 max_resolution_width: 4096 min_resolution_height: 16 min_resolution_width: 16 optimus : False passthrough_cmd_decoder: False pixel_shader_version: process_crash_count : 0 profile : 3 sandboxed : True software_rendering : False supports_overlays : False vertex_shader_version: video_decode_accelerator_flags: 0 Feature Status: 2d_canvas : enabled flash_3d : enabled flash_stage3d : enabled flash_stage3d_baseline: enabled gpu_compositing : enabled multiple_raster_threads: enabled_on native_gpu_memory_buffers: enabled rasterization : enabled surface_synchronization: enabled_on video_decode : enabled viz_display_compositor: disabled_off webgl : enabled webgl2 : enabled Driver Bug Workarounds: add_and_true_to_loop_condition adjust_src_dst_region_for_blitframebuffer avoid_stencil_buffers decode_encode_srgb_for_generatemipmap disable_framebuffer_cmaa disable_webgl_rgb_multisampling_usage dont_use_loops_to_initialize_variables emulate_abs_int_function get_frag_data_info_bug init_two_cube_map_levels_before_copyteximage max_msaa_sample_count_4 msaa_is_slow pack_parameters_workaround_with_pack_buffer rebind_transform_feedback_before_resume regenerate_struct_names remove_invariant_and_centroid_for_essl3 reset_teximage2d_base_level rewrite_texelfetchoffset_to_texelfetch scalarize_vec_and_mat_constructor_args set_zero_level_before_generating_mipmap unfold_short_circuit_as_ternary_operation unpack_alignment_workaround_with_unpack_buffer unpack_image_height_workaround_with_unpack_buffer use_intermediary_for_copy_texture_image use_unused_standard_shared_blocks Traceback (most recent call last): File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/testing/serially_executed_browser_test_case.py", line 206, in <lambda> return lambda self: based_method(self, *args) File "/b/s/w/ir/content/test/gpu/gpu_tests/gpu_integration_test.py", line 132, in _RunGpuTest self.RunActualGpuTest(url, *args) File "/b/s/w/ir/content/test/gpu/gpu_tests/webgl_conformance_integration_test.py", line 188, in RunActualGpuTest getattr(self, test_name)(test_path, *args[1:]) File "/b/s/w/ir/content/test/gpu/gpu_tests/webgl_conformance_integration_test.py", line 202, in _RunConformanceTest self._CheckTestCompletion() File "/b/s/w/ir/content/test/gpu/gpu_tests/webgl_conformance_integration_test.py", line 196, in _CheckTestCompletion 'webglTestHarness._finished', timeout=self._GetTestTimeout()) File "/b/s/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 52, in traced_function return func(*args, **kwargs) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/actions/action_runner.py", line 261, in WaitForJavaScriptCondition return self._tab.WaitForJavaScriptCondition(*args, **kwargs) File "/b/s/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 52, in traced_function return func(*args, **kwargs) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/browser/web_contents.py", line 239, in WaitForJavaScriptCondition return self._inspector_backend.WaitForJavaScriptCondition(*args, **kwargs) File "/b/s/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 52, in traced_function return func(*args, **kwargs) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend.py", line 302, in WaitForJavaScriptCondition e.message + '\n' + debug_message) TimeoutException: Timed out while waiting 300s for IsJavaScriptExpressionTrue. Console output: [8379:775:0508/111601.508531:WARNING:gaia_auth_fetcher.cc(902)] Could not reach Google Accounts servers: errno -120 A catapult issue perhaps? Suggestions welcome.
,
May 8 2018
,
May 8 2018
Note: the V8 changes seem innocuous, aside from possibly mvstanton's change. However, some change has definitely been made recently which has destabilized these tests and caused them to intermittently time out. It's possible the regression range is not correct.
,
May 8 2018
Summary of flaking:
Builder Win7 FYI x64 Release (NVIDIA):
897 failed refs/heads/master@{#556792}
896 succeeded refs/heads/master@{#556761}
895 succeeded refs/heads/master@{#556737}
Builder Mac FYI Retina Release (NVIDIA) (flaking)
1333 success refs/heads/master@{#556761}
1332 fail refs/heads/master@{#556746}
1331 success refs/heads/master@{#556735}
1330 success refs/heads/master@{#556716}
Builder Mac FYI GPU ASAN Release
803 fail refs/heads/master@{#556770}
801 fail refs/heads/master@{#556716}
800 success refs/heads/master@{#556687}
799 success refs/heads/master@{#556686}
Builder Mac FYI Experimental Release
2090 success refs/heads/master@{#556858}
2089 a refs/heads/master@{#556835}
2088 success refs/heads/master@{#556813}
2087 success refs/heads/master@{#556806}
Windows 10
refs/heads/master@{#556_760}, refs/heads/master@{#556_720}
Ignoring windows 10, flake would seem to start after 556_806 556_686 556_716 556_737
and be present by 556_835 556_716 556_746 556_792?
That suggests flake starting by 556_686 and already in place by by 556_716 but the CLs between 556_760 and 556_686 don't seem suspicious other than the ones pointed out by kbr@ Maybe devtools changes are interfering with telemetry operation?
,
May 9 2018
Issue 841388 has been merged into this issue.
,
May 9 2018
There is a clear signal of the regression on this particular bot: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Retina%20Release%20%28NVIDIA%29?limit=200 We urgently need to speculatively revert any suspect CLs to get this bot reliably green again. This may involve pausing the V8 autoroller and reverting back to an earlier version.
,
May 9 2018
,
May 9 2018
first uptick of timeout error rate in webgl tests on Mac Retina NVIDIA is (https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Retina%20Release%20(NVIDIA)?limit=200) is 1322 Average size of success window is ~2 so looking at 4 previous builds: 1321, 1320, 1319, 1318. Besides: v8 roll: https://chromium.googlesource.com/chromium/src/+/c0e585261f693123f3560c63dd6e254e55a83374 blink memory management change: https://chromium.googlesource.com/chromium/src/+/f4544427fd7e40abba353339146780a2a2b90051 Maybe these CLs deserve a once-over? * maybe an infra change: https://chromium-review.googlesource.com/c/chromium/src/+/1043002? * really reaching here but could perturb timings in image code: https://chromium-review.googlesource.com/c/chromium/src/+/1045729? * devtools change might perturb how telemetry works https://chromium-review.googlesource.com/c/chromium/src/+/1045180?
,
May 9 2018
Reviewing Mac FYI ASAN (https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20GPU%20ASAN%20Release) More frequent timeout failures start at build 801 suspects on build 801 itself from this builder are exactly the same as #8
,
May 9 2018
Reviewing Linux NVIDIA (https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20FYI%20Release%20%28NVIDIA%29?limit=200) More frequent timeout failure start at 1965. 1963 features the v8 roll mentioned above. 1962 has dev tool change mentioned above and blink memory change.
,
May 9 2018
,
May 9 2018
blink roll revert https://chromium-review.googlesource.com/c/chromium/src/+/1052527 landed as Change-Id: I7108d2aecebef1e7ca0355bca571e85d1cc01499 Cr-Commit-Position: refs/heads/master@{#557306}
,
May 9 2018
blink memory management change https://chromium-review.googlesource.com/c/chromium/src/+/1052987 landed: Bug: None Change-Id: I26910af177818011bf1aa00efeacb9bb8058f108 Reviewed-on: https://chromium-review.googlesource.com/1052987 Reviewed-by: Robert Kroeger <rjkroege@chromium.org> Reviewed-by: Kentaro Hara <haraken@chromium.org> Commit-Queue: Robert Kroeger <rjkroege@chromium.org> Cr-Commit-Position: refs/heads/master@{#557322}
,
May 10 2018
It looks like Mac boots recovered and windowns gpu bots failure looks unrelated. Can we try roll V8 again?
,
May 10 2018
Yes, please go ahead and try rolling V8 again. This particular bot was most severely affected: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Retina%20Release%20%28NVIDIA%29?limit=200 From the attached screenshot this is the build where the bot became green again: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Retina%20Release%20%28NVIDIA%29/1355 that build contains the revert of the Blink memory management change https://chromium-review.googlesource.com/1052987 . The previous build https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Retina%20Release%20%28NVIDIA%29/1354 contained the revert of the most recent V8 roll, and that build was still red. So it looks like the culprit was the Blink memory management change.
,
May 10 2018
> So it looks like the culprit was the Blink memory management change. I'm sorry. I couldn't guess that the CL could influence on the webgl tests. Let me try to check what was a problem. Thanks.
,
May 10 2018
>> So it looks like the culprit was the Blink memory management change. >I'm sorry. I couldn't guess that the CL could influence on the webgl tests. Let me try to check what was a problem. Thanks. BTW, though my CL was reverted, it looks the timeout failures occurred again. - https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Retina%20Release%20%28NVIDIA%29/1360
,
May 10 2018
gyuyoung.kim@: I think that failure is just one particular flaky test which we need to suppress. We will confirm by letting things run overnight. Please do not reland your CL. Also, please file a bug and refer to it from your CL when landing changes like yours which have potentially significant impact. Thanks.
,
May 10 2018
>gyuyoung.kim@: I think that failure is just one particular flaky test which we >need to suppress. We will confirm by letting things run overnight. Please do not >reland your CL. Also, please file a bug and refer to it from your CL when landing >changes like yours which have potentially significant impact. Thanks. ok, let me do that next time. Thanks too.
,
May 10 2018
All builder issues are tracked in other bugs. This particular bug seems to have been resolved. Closing.
,
May 11 2018
,
May 11 2018
Robert, *thank you* for tracking down the cause of this flakiness! This was an important regression to find and fix. Gyuyoung: I've filed follow-on Issue 842019 for you to track down the reason why your CL caused these failures. It seems suspicious that it did, and we should get to the bottom of it. Also, I've submitted https://chromium-review.googlesource.com/1054773 to suppress the flakes you observed on the Mac NVIDIA bot in the WebGL dEQP shaderoperator tests.
,
May 11 2018
kbr@, thank you for filing a bug. ok, let me investigate why my CL caused the flaky test there.
,
May 11 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d5efb3fb1f0451424f6233de328906f7002465f3 commit d5efb3fb1f0451424f6233de328906f7002465f3 Author: Kenneth Russell <kbr@chromium.org> Date: Fri May 11 01:40:45 2018 Add link to Blink MemoryCoordinator bug to flakiness examples. This was a difficult bug to track down and is worth mentioning as a reason to prioritize stamping out flakiness on the waterfall. Bug: 840988 Change-Id: If2c56a80b02178ddc28efd44d73ec440a2aa2e0c Tbr: rjkroege@chromium.org Reviewed-on: https://chromium-review.googlesource.com/1054585 Reviewed-by: Kenneth Russell <kbr@chromium.org> Cr-Commit-Position: refs/heads/master@{#557758} [modify] https://crrev.com/d5efb3fb1f0451424f6233de328906f7002465f3/docs/gpu/gpu_testing.md
,
May 15 2018
stgao, lijeffrey: this bug and the associated flaky failures would be a good use case for FindIt. The problematic CL caused flakes in random tests, but all in the same test suite. It would be ideal if the flake detection could handle this case.
,
May 15 2018
As it is for flake detection, liaoyuke@ could follow up here. Jeff focuses on flake analysis -- find the culprit.
,
May 15 2018
,
May 22 2018
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by kbr@chromium.org
, May 8 2018Labels: -Type-Bug -Pri-3 Pri-1 Type-Bug-Regression
Owner: rjkroege@chromium.org