New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 595754 link

Starred by 0 users

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 579664



Sign in to add a comment

flakes on the waterfall on trace_test.IOSurface2DCanvas

Project Member Reported by erikc...@chromium.org, Mar 17 2016

Issue description

https://build.chromium.org/p/chromium.gpu/builders/Mac%2010.10%20Retina%20Release%20%28AMD%29?numbuilds=200

[59890:17155:0316/224832:ERROR:gpu_watchdog_thread.cc(362)] The GPU process hung. Terminating after 10000 ms.
Received signal 11 SEGV_MAPERR 000000000000
 [0x00010bb4e5ea]
 [0x7fff8900cf1a]
 [0x00011f270b28]
 [0x00010bafa60b]
 [0x00010bb4ec0b]
 [0x00010bb72163]
 [0x00010bb7243c]
 [0x00010bb7289b]
 [0x00010bb433dd]
 [0x00010bb6787a]
 [0x00010bb42dc4]
 [0x7fff8fd67a01]
 [0x7fff8fd59b8d]
 [0x7fff8fd591bf]
 [0x7fff8fd58bd8]
 [0x00010bb4378f]
 [0x00010bb43214]
 [0x00010bb881b3]
 [0x00010bb7193d]
 [0x00010bbaa5a8]
 [0x00010bba6677]
 [0x7fff8f69f05a]
 [0x7fff8f69efd7]
 [0x7fff8f69c3ed]
[end of stack trace]
Traceback (most recent call last):
  File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 84, in _RunStoryAndProcessErrorIfNeeded
    state.RunStory(results)
  File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/content/test/gpu/gpu_tests/gpu_test_base.py", line 111, in RunStory
    RunStoryWithRetries(GpuSharedPageState, self, results)
  File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/content/test/gpu/gpu_tests/gpu_test_base.py", line 72, in RunStoryWithRetries
    super(cls, shared_page_state).RunStory(results)
  File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/third_party/catapult/telemetry/telemetry/page/shared_page_state.py", line 329, in RunStory
    self._current_page.Run(self)
  File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/third_party/catapult/telemetry/telemetry/page/__init__.py", line 84, in Run
    shared_state.page_test.RunNavigateSteps(self, current_tab)
  File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/third_party/catapult/telemetry/telemetry/page/page_test.py", line 182, in RunNavigateSteps
    page.RunNavigateSteps(action_runner)
  File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/content/test/gpu/page_sets/pixel_tests.py", line 28, in RunNavigateSteps
    'domAutomationController._finished', timeout_in_seconds=30)
  File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/third_party/catapult/telemetry/telemetry/page/action_runner.py", line 186, in WaitForJavaScriptCondition
    self._tab.WaitForJavaScriptExpression(condition, timeout_in_seconds)
  File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/third_party/catapult/telemetry/telemetry/internal/browser/web_contents.py", line 111, in WaitForJavaScriptExpression
    e.message + '\n' + debug_message)
TimeoutException: Timed out while waiting 30s for IsJavaScriptExpressionTrue.
Console output:
(warning) http://127.0.0.1:54664/pixel_canvas2d_accelerated.html:43: 'webkitRequestAnimationFrame' is vendor-specific. Please use the standard 'requestAnimationFrame' instead.
(error) http://127.0.0.1:54664/favicon.ico:0: Failed to load resource: the server responded with a status of 404 (File not found)
(error) :15: Uncaught TypeError: Cannot read property 'clear' of null


[  FAILED  ] trace_test.IOSurface2DCanvas (98644 ms)
 
Owner: erikc...@chromium.org
Status: Assigned (was: Untriaged)
Blocking: 579664
Cc: dyen@chromium.org
dyen: I thought we got stack traces now?
Cc: ccameron@chromium.org
The first time I landed my Canvas2D changes, we didn't see this error. That implies that something about my second landing is tickling this problem:
https://codereview.chromium.org/1804243002/

I partially symbolicated the stack, and it's totally useless. It's the stack trace of an invocation form the watchdog thread causing a stack dump.

I'm leaving notes for myself in case I ever need to do anything like this again. The trick is figure out where Chromium Framework got mapped. I compiled a local copy of Chrome, and crashed the gpu process from lldb
"""
out/Debug/Chromium.app/Contents/MacOS/Chromium --no-sandbox --gpu-startup-dialog
<attach lldb to gpu process
"""

This gives a symbolicated stack trace, only copying the bottom:
    frame #24: 0x000000010e9d608f Chromium Framework`base::MessageLoop::Run(this=0x00007ff306203320) + 303 at message_loop.cc:293
    frame #25: 0x000000010eb13467 Chromium Framework`base::Thread::Run(this=0x00007fff531a3ba0, message_loop=0x00007ff306203320) + 39 at thread.cc:202
    frame #26: 0x000000010eb13c0d Chromium Framework`base::Thread::ThreadMain(this=0x00007fff531a3ba0) + 1613 at thread.cc:254
    frame #27: 0x000000010eaf82d1 Chromium Framework`base::(anonymous namespace)::ThreadFunc(params=0x00007ff306200e00) + 705 at platform_thread_posix.cc:68
    frame #28: 0x00007fff88767c13 libsystem_pthread.dylib`_pthread_body + 131
    frame #29: 0x00007fff88767b90 libsystem_pthread.dylib`_pthread_start + 168
    frame #30: 0x00007fff88765375 libsystem_pthread.dylib`thread_start + 13

This was a debug build, so we expect some of the frames to possibly be optimized out in a release build. Notice that frame #25 and #26 are very close in address. Looking at stack we're trying to symbolicate:
 [0x00010bbaa5a8]
 [0x00010bba6677]
 [0x7fff8f69f05a]
 [0x7fff8f69efd7]
 [0x7fff8f69c3ed]

I'll bet [0x00010bbaa5a8] corresponds to base::Thread::Run. Download the isolate binary. Disassemble it in Hopper, look up the address for base::Thread::Run, subtract that address from 0x00010bbaa5a8, and then round down to the nearest page size. Check a couple of other frames...yup, this gives the right base.


Partially symbolicated, totally useless stack:

Chromium Framework image base: 0x10B5F5000
 [0x00010bb4e5ea]
0x5595EA
base::debug::(anonymous namespace)::StackDumpSignalHandler(int, __siginfo*, void*)
 [0x7fff8900cf1a]
 [0x00011f270b28]
0x13C7BB28
 [0x00010bafa60b]
0x50560B
base::internal::Invoker...
 [0x00010bb4ec0b]
0x559C0B
base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask const&)
 [0x00010bb72163]
0x57D163
base::MessageLoop::RunTask(base::PendingTask const&)
 [0x00010bb7243c]
 [0x00010bb7289b]
0x57D89B
base::MessageLoop::DoDelayedWork(base::TimeTicks*)
 [0x00010bb433dd]
0x54E3DD
base::MessagePumpCFRunLoopBase::RunWork()
 [0x00010bb6787a]
 [0x00010bb42dc4]
 [0x7fff8fd67a01]
 [0x7fff8fd59b8d]
 [0x7fff8fd591bf]
 [0x7fff8fd58bd8]
 [0x00010bb4378f]
 [0x00010bb43214]
 [0x00010bb881b3]
 [0x00010bb7193d]
 [0x00010bbaa5a8]
0x5B55A8
base::Thread::ThreadMain()
 [0x00010bba6677]
 [0x7fff8f69f05a]
 [0x7fff8f69efd7]
 [0x7fff8f69c3ed]
I haven't been able to reproduce this problem locally, on 2 different machines/GPUs.
Project Member

Comment 7 by bugdroid1@chromium.org, Mar 18 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f4f91cf0abc666aa189c2ee666b7acf0361d00bf

commit f4f91cf0abc666aa189c2ee666b7acf0361d00bf
Author: erikchen <erikchen@chromium.org>
Date: Fri Mar 18 00:14:07 2016

Temporarily disable IOSurface 2D Canvas to debug a trace_test failure.

The first time IOSurface 2D Canvas was enabled, there were no test failures over
several days. The second time it was enabled (along with a bug fix, and some new
tests), trace_test.IOSurface2DCanvas started flakily failing.

BUG= 595754 

Review URL: https://codereview.chromium.org/1813073002

Cr-Commit-Position: refs/heads/master@{#381843}

[modify] https://crrev.com/f4f91cf0abc666aa189c2ee666b7acf0361d00bf/third_party/WebKit/Source/platform/graphics/Canvas2DLayerBridge.h

Ah ha!

I disabled IOSurface 2D Canvas, and the same test failure persists.

Comment 9 by dyen@chromium.org, Mar 21 2016

Regarding the stack traces, telemetry didn't symbolize anything because it wasn't notified of any crashes. The crash happened on the GPU process which just resurrects itself. In this case, telemetry only detected that a timeout has happened. Telemetry probably could have symbolized the stacktrace if it actually had a minidump from the GPU process.

In this case, does the GPU process actually dump a ministack as well? If so, maybe it would make sense to look for minidumps on timeouts.

Comment 11 by kbr@chromium.org, Mar 22 2016

As mentioned at the top: trace_test.IOSurface2DCanvas is flaky on https://build.chromium.org/p/chromium.gpu/builders/Mac%2010.10%20Retina%20Release%20%28AMD%29 . Additionally suppressing this test.

Project Member

Comment 12 by bugdroid1@chromium.org, Mar 22 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b16ec4dc4a1a1cee4d5bf44e1eda5e687216b1a9

commit b16ec4dc4a1a1cee4d5bf44e1eda5e687216b1a9
Author: kbr <kbr@chromium.org>
Date: Tue Mar 22 03:52:41 2016

Skip trace_test.IOSurface2DCanvas on Mac with AMD GPU.

It's flaky on this configuration.

BUG= 595754 
TBR=erikchen@chromium.org
NOTRY=true

Review URL: https://codereview.chromium.org/1823763005

Cr-Commit-Position: refs/heads/master@{#382502}

[modify] https://crrev.com/b16ec4dc4a1a1cee4d5bf44e1eda5e687216b1a9/content/test/gpu/gpu_tests/trace_test_expectations.py

Between #4633 and #4538, there were 28 failures, averaging about 1 failure every 3.4 runs.

This CL made was first seen in #4634. Since then, there have been no failures (15 runs). It looks like somehow, the 2DCanvasWebGL is causing gpu process hangs/crashes?, but only for tracing runs.
dyen: Any thoughts about why this would be happening, or better way to debug/attempt to reproduce locally?

Comment 15 by dyen@chromium.org, Mar 23 2016

Does it only happen on bots? For these kinds of problems we usually just VNC into the bot.

You can use this command to stop swarming bots:
"./swarming.py terminate -S chromium-swarm.appspot.com <botid>"

For regular non-swarming bots you can touch this file:
“/b/build/slave/shutdown.stamp”

Once the current job finishes you can run the test over and over again.
I've managed to [very rarely] reproduce this locally. 

Chrome log from failed run on bot:
https://00e9e64bacf404dbb680ae5a265c1dff1114f5edcfe6b44812-apidata.googleusercontent.com/download/storage/v1_internal/b/chrome-telemetry-output/o/log_610fda67-179e-442b-98da-b58655e03994?qk=AD5uMEu5TtHWIG81PFW8OunhBoJX_Wuh2gbu3lI3-wXubeYMSXJ13pU9HH-enJ0TqJPz-Yd2ppjZB0MXRllhs-MtoZLP9YLiyupfWbMo7rmtq1Ut6DNBYIljiBkEijIfMMFI20xe6Mu9DuNBBdCq-G1L5ZFSG8pXBmowbQEVlrMxPz6zJTGPp36MoetL-CO6hXAWxv_Ayl_W9d9kugBz-KWXBLgGinW_GNfey1_U6P0H2-DAh-FjvCXbFawpCfaAQI-Oqra4q7uCxXYSXsJt8SqCxcTEHfD0ofR-Gw8sdEwO6S18ft7QWHfKKxIdtbw7yKVjbJaXp5eSqOvoczzSqXVvcDjFFUS6QsPCqVZPdZmvUEW_PaBSE4Oq2IkTWizEaSrQBJV-d42sarWYONJiueTLQSn3Ifqgerbc5jYQNiALLZx0HidTGrgwN8DQ1kf0tStSYEcVeipFAg2Hu3SMNMoKdVUxreTJHyOQ4eF-om3gcZbG4-pAdLevZ9WhK34xwhP4pGwWea0v8Rh76VpQ5vLM5tgSOVN--8gMgnu7bW5QEeGqoA9lGdkm5xYGzJoXnh1CRg3damyIMSl_L5SrdXOOlTylIC8H00sIewxIn_aLrajCELTOWshlwz7ctdu4jWeavvsZEUcN4V3lGgowbbtSfZ3VXHMuiPqO5yKNJENDax4q75aMrOJDk1bcQymGbPUN4EOxwHM4sxt5UAXCFe7I9nkric23jJN8TPk9NocFzprEwAHXpN5NyiFACGUEY86BYDfhMQqwP2KLp97_Shuzn-lf834ll8H9PV-GB06NXHZl3SEeSD9eISzVBEb3gAWmMd3yvtwmufxVFhwRpPau-XmoOfeW0prLhkHBJ_ldrEKarjy6hMY

Chrome log from failed run locally:
https://00e9e64bacca40677ddf1fa45153716748d434917d30bb7a99-apidata.googleusercontent.com/download/storage/v1_internal/b/chrome-telemetry-output/o/log_ea7263ce-67b2-45c4-ba40-a4c4bd918762?qk=AD5uMEs3kJGhRGnGIDWt0OtiWM5Z7LyOO1lPpa_1bGHq-Ddskw1QrUbRga55hxD4fohlOzstb6thgG6wsgsM5gdHMf4zF2zNa0qHDIBQx8YL_S55P-SnQrZUJgrwqtQWfJ4v8z7WOduad7vpjRgXnnsfavwHhjKn_uBAu9DuEIXV3YPY-Idcp_StfR_QK7R03GrEBZubSYjNme_nJ1W7Xofwu75g8bqHnDcW0FBzmgQQcnXquYRyZT9M0yercg4jPpk_DsXRFsKR9jDi8nccV9l9CINKkf-B7YlAP1moGc1NHJq4uI0UZ-92od3WBuWUEBhY-Dzixlb3AeU3vczQKbrN9OwnFfrCr96tTqtfkWtGI8PCCRm4DK5bi2Q55LgrZCOMRNABOWOsa7DmK5GAyskKDFMy0J_8C7-kh51S0BrehFGNzlhW8zFWPi47Y2d2Tpx8a9WrZ7xblw3XZUwgOp4RlcCvwY_X3bbo3DN0mEfAa0by14cLBrhBf-eQgW0WxpbpIYhbME5Tqer--PlZQRTaaORuDyCHtdKnpCJL0goZ7NUvClq5KWXSsDKUwH30_4slNeg_zA0Z0hj03Ut1NB-V5UrMecVdPGVc5C3lYbgRVh0Ukp7C0FYNAots_K0sgLVEYlet_uoOL4eOK3fngJfA1DFaf9ZoC_xcXqlnKH3aXfGaVFAB7xiwV_0MO75uGbjyHlZPaGPaysmm7N-cDS_Q_dVuA18ev7zV6R1b7-lsrV5PxxEmJSk3uqq4rpwW7SRXYnsBs6CXcgxoLi0qcZy3SN7kwYQvVOeK0QoMVAqAVpF8lmGVhQkXf1RpNSEIqV7pJOT85L24xoIsKqbiPTR6zTN25DNRoSUjntNNxVzZe1rngXfXVt8

Chrome log from successful run locally:
https://00e9e64bacd6f4d0e52810049e870be5e19c5b0df995139a69-apidata.googleusercontent.com/download/storage/v1_internal/b/chrome-telemetry-output/o/log_37fcdbee-1aeb-418b-be18-44c277c46365?qk=AD5uMEtkr490vT5GE-P0MiGtvZHOr_P1Der3CD1gGL7wrWV-Ax_lhxpf7FHJab7lEpPAlKBoED_FTuTQgW9uyUiCcBjzzQqv9K4L8JgKk9x2zE4pjYLZJlIt2L3PmdE4okZfVGWGkSZg50TpsYt-hKyzNljJKToaizGwfv2NcHE3zDBBlcHsrmULrvMFOQ2QXlrGUlM5scvlUA1M3l-reY98HWhvNoXhcC4krbeSgWFfHtIvEQlm8cUJ8LqHB3wys7uF8dTDmdb67IoFaTDWL1aBNVXUzHzusun1vE4qovviKXQtRh_rm2NxUdrvmRNK9pBwusfH-XuK43O1HTmc9vGvEApnKZJ_yN6RJ3aRc_lWgTwXW2Qm6-bXaVikL4GXW6XRw5XSkREzDp2tVoisxaz4m2P45WyozTHS0yC5rCg85YwgIwp8nSV3EcO7F6XGXgUq9zkbPvdfdhPmZBCZ1mtXKTCXd1ZWl6p8Dc8NVOuqQZpXdVWETxLQSQv45jMCkMFIoH1WFQBytyTuvxLj-31e79-6HiE_3m5rSUnHlDR-DFZPvh6B3XMETFKw8N-AHZVjfyIx9TOydQhAChKKNf1x-QKzuzCPhtux4pY6LsQ__Cs2nlquq3YeoraktML3Nnrdnwj1MKR5IMoEtljqWjP_hh7Sksyb2EGfMEWVVQGvSey1QFteLgbhb5G1jXk0pv3cepWt3B6SAj2bgnz8wuiHFo5_gGNAcYQjcbtnGkGoDhI_COAgaaFi0jD_s1tr0oqyKJyUE-aSeOWWnrYT2jtjhe3RVWrhVG9forQAw-KCcWB-PbAYfml33oLjbwm1dNNB1qTRt0F3nqrlvsn8CRVHvlTnfxQZrFxE0Iio6XpwVzhoj1cvLhA

Notice that in the failed runs, both locally and on the bot, there are the lines
"""
[16258:1295:0325/100458:VERBOSE1:service_discovery_device_lister.cc(36)] DeviceListerStart: service_type: _privet._tcp.local
[16258:1295:0325/100458:VERBOSE1:service_discovery_client_mac.mm(130)] CreateServiceWatcher: _privet._tcp.local
[16258:1295:0325/100458:VERBOSE1:service_discovery_client_mac.mm(253)] ServiceWatcherImplMac::Start
[16258:1295:0325/100458:VERBOSE1:service_discovery_device_lister.cc(41)] DiscoverNewDevices: service_type: _privet._tcp.local
[16258:1295:0325/100458:VERBOSE1:service_discovery_client_mac.mm(260)] ServiceWatcherImplMac::DiscoverNewServices
"""
That are missing from successful run. Suspicious! 

Furthermore, I observed that when the test fails, the green WebGL triangle fails to draw. Reasons still unknown. Note that I haven't turned on IOSurfaces for WebGL yet, so this is not using the CA compositing path.
I think the service_discovery logging is a red triangle. It happens to show up because the test has stalled (thus leaving the browser up for longer), rather than being the root cause.
I think this is the relevant line from the log from the bot:
"""
61514:1299:0321/191115:INFO:CONSOLE(15)] "Uncaught TypeError: Cannot read property 'clear' of null", source:  (15)
"""
As its followed by 30 seconds of nothing. 

Clear() gets called by
content/test/data/gpu/pixel_webgl_util.js:drawTriangle()
pixel_canvas2d_webgl.html:waitForFinish()

This still raises the question of "why did the WebGL canvas fail to initialize?" I'm going to add better error checking to pixel_canvas2d_webgl.html and re-enable the trace tests.
test_harness_script is different for pixel.py and trace_test.py. This is surprising, since they run the same set of tests.

Excerpt from the script from trace_test.py:
"""
  domAutomationController.send = function(msg) {                                   
    // Issue a read pixel to synchronize the gpu process to ensure                 
    // the asynchronous category enabling is finished.                             
    var canvas = document.createElement("canvas")                                  
    canvas.width = 1;                                                              
    canvas.height = 1;                                                             
    var gl = canvas.getContext("webgl");                                           
    gl.clear(gl.COLOR_BUFFER_BIT);                                                 
    var id = new Uint8Array(4);                                                    
    gl.readPixels(0, 0, 1, 1, gl.RGBA, gl.UNSIGNED_BYTE, id);                      
                                                                                   
    domAutomationController._finished = true;                                      
  }                                  
"""

Comment 20 by dyen@chromium.org, Mar 25 2016

The trace_test relies on gpu tracing on the GPU process so in order to make sure the proper trace categories are enabled (which are enabled from the browser process and asynchronously enabled on the GPU process) it issues a "gl.readPixels()" to synchronizes the 2 processes before the test begins.
Project Member

Comment 21 by bugdroid1@chromium.org, Mar 28 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0f496000d1f33c9255621efbf6b88a8185a791cd

commit 0f496000d1f33c9255621efbf6b88a8185a791cd
Author: erikchen <erikchen@chromium.org>
Date: Mon Mar 28 21:36:24 2016

Add safety checks and reenable Canvas2D/WebGL trace_tests.

The trace tests were disabled because they were flakily failing, whereas their
pixel_test counterparts were having no problems. This CL makes several small
changes.
  1. Additional safety checking in trace_test and pixel_canvas2d_webgl.
  2. Rename gl->temp_gl and canvas->temp_canvas to avoid potential namespace
  collisions with global object in pixel_canvas2d_webgl.
  3. Reenable trace tests.
  4. For consistency, try to grab the "experimental-webgl" context, then the
  "webgl" context.

BUG= 595754 

Review URL: https://codereview.chromium.org/1826343003

Cr-Commit-Position: refs/heads/master@{#383572}

[modify] https://crrev.com/0f496000d1f33c9255621efbf6b88a8185a791cd/content/test/data/gpu/pixel_canvas2d_webgl.html
[modify] https://crrev.com/0f496000d1f33c9255621efbf6b88a8185a791cd/content/test/gpu/gpu_tests/trace_test.py
[modify] https://crrev.com/0f496000d1f33c9255621efbf6b88a8185a791cd/content/test/gpu/gpu_tests/trace_test_expectations.py

Status: Fixed (was: Assigned)
I'm not seeing any more flakes. 

Sign in to add a comment