flakes on the waterfall on trace_test.IOSurface2DCanvas |
|||||
Issue descriptionhttps://build.chromium.org/p/chromium.gpu/builders/Mac%2010.10%20Retina%20Release%20%28AMD%29?numbuilds=200 [59890:17155:0316/224832:ERROR:gpu_watchdog_thread.cc(362)] The GPU process hung. Terminating after 10000 ms. Received signal 11 SEGV_MAPERR 000000000000 [0x00010bb4e5ea] [0x7fff8900cf1a] [0x00011f270b28] [0x00010bafa60b] [0x00010bb4ec0b] [0x00010bb72163] [0x00010bb7243c] [0x00010bb7289b] [0x00010bb433dd] [0x00010bb6787a] [0x00010bb42dc4] [0x7fff8fd67a01] [0x7fff8fd59b8d] [0x7fff8fd591bf] [0x7fff8fd58bd8] [0x00010bb4378f] [0x00010bb43214] [0x00010bb881b3] [0x00010bb7193d] [0x00010bbaa5a8] [0x00010bba6677] [0x7fff8f69f05a] [0x7fff8f69efd7] [0x7fff8f69c3ed] [end of stack trace] Traceback (most recent call last): File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 84, in _RunStoryAndProcessErrorIfNeeded state.RunStory(results) File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/content/test/gpu/gpu_tests/gpu_test_base.py", line 111, in RunStory RunStoryWithRetries(GpuSharedPageState, self, results) File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/content/test/gpu/gpu_tests/gpu_test_base.py", line 72, in RunStoryWithRetries super(cls, shared_page_state).RunStory(results) File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/third_party/catapult/telemetry/telemetry/page/shared_page_state.py", line 329, in RunStory self._current_page.Run(self) File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/third_party/catapult/telemetry/telemetry/page/__init__.py", line 84, in Run shared_state.page_test.RunNavigateSteps(self, current_tab) File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/third_party/catapult/telemetry/telemetry/page/page_test.py", line 182, in RunNavigateSteps page.RunNavigateSteps(action_runner) File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/content/test/gpu/page_sets/pixel_tests.py", line 28, in RunNavigateSteps 'domAutomationController._finished', timeout_in_seconds=30) File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/third_party/catapult/telemetry/telemetry/page/action_runner.py", line 186, in WaitForJavaScriptCondition self._tab.WaitForJavaScriptExpression(condition, timeout_in_seconds) File "/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/run8dj4il/third_party/catapult/telemetry/telemetry/internal/browser/web_contents.py", line 111, in WaitForJavaScriptExpression e.message + '\n' + debug_message) TimeoutException: Timed out while waiting 30s for IsJavaScriptExpressionTrue. Console output: (warning) http://127.0.0.1:54664/pixel_canvas2d_accelerated.html:43: 'webkitRequestAnimationFrame' is vendor-specific. Please use the standard 'requestAnimationFrame' instead. (error) http://127.0.0.1:54664/favicon.ico:0: Failed to load resource: the server responded with a status of 404 (File not found) (error) :15: Uncaught TypeError: Cannot read property 'clear' of null [ FAILED ] trace_test.IOSurface2DCanvas (98644 ms)
,
Mar 17 2016
,
Mar 17 2016
dyen: I thought we got stack traces now?
,
Mar 17 2016
The first time I landed my Canvas2D changes, we didn't see this error. That implies that something about my second landing is tickling this problem: https://codereview.chromium.org/1804243002/
,
Mar 17 2016
I partially symbolicated the stack, and it's totally useless. It's the stack trace of an invocation form the watchdog thread causing a stack dump.
I'm leaving notes for myself in case I ever need to do anything like this again. The trick is figure out where Chromium Framework got mapped. I compiled a local copy of Chrome, and crashed the gpu process from lldb
"""
out/Debug/Chromium.app/Contents/MacOS/Chromium --no-sandbox --gpu-startup-dialog
<attach lldb to gpu process
"""
This gives a symbolicated stack trace, only copying the bottom:
frame #24: 0x000000010e9d608f Chromium Framework`base::MessageLoop::Run(this=0x00007ff306203320) + 303 at message_loop.cc:293
frame #25: 0x000000010eb13467 Chromium Framework`base::Thread::Run(this=0x00007fff531a3ba0, message_loop=0x00007ff306203320) + 39 at thread.cc:202
frame #26: 0x000000010eb13c0d Chromium Framework`base::Thread::ThreadMain(this=0x00007fff531a3ba0) + 1613 at thread.cc:254
frame #27: 0x000000010eaf82d1 Chromium Framework`base::(anonymous namespace)::ThreadFunc(params=0x00007ff306200e00) + 705 at platform_thread_posix.cc:68
frame #28: 0x00007fff88767c13 libsystem_pthread.dylib`_pthread_body + 131
frame #29: 0x00007fff88767b90 libsystem_pthread.dylib`_pthread_start + 168
frame #30: 0x00007fff88765375 libsystem_pthread.dylib`thread_start + 13
This was a debug build, so we expect some of the frames to possibly be optimized out in a release build. Notice that frame #25 and #26 are very close in address. Looking at stack we're trying to symbolicate:
[0x00010bbaa5a8]
[0x00010bba6677]
[0x7fff8f69f05a]
[0x7fff8f69efd7]
[0x7fff8f69c3ed]
I'll bet [0x00010bbaa5a8] corresponds to base::Thread::Run. Download the isolate binary. Disassemble it in Hopper, look up the address for base::Thread::Run, subtract that address from 0x00010bbaa5a8, and then round down to the nearest page size. Check a couple of other frames...yup, this gives the right base.
Partially symbolicated, totally useless stack:
Chromium Framework image base: 0x10B5F5000
[0x00010bb4e5ea]
0x5595EA
base::debug::(anonymous namespace)::StackDumpSignalHandler(int, __siginfo*, void*)
[0x7fff8900cf1a]
[0x00011f270b28]
0x13C7BB28
[0x00010bafa60b]
0x50560B
base::internal::Invoker...
[0x00010bb4ec0b]
0x559C0B
base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask const&)
[0x00010bb72163]
0x57D163
base::MessageLoop::RunTask(base::PendingTask const&)
[0x00010bb7243c]
[0x00010bb7289b]
0x57D89B
base::MessageLoop::DoDelayedWork(base::TimeTicks*)
[0x00010bb433dd]
0x54E3DD
base::MessagePumpCFRunLoopBase::RunWork()
[0x00010bb6787a]
[0x00010bb42dc4]
[0x7fff8fd67a01]
[0x7fff8fd59b8d]
[0x7fff8fd591bf]
[0x7fff8fd58bd8]
[0x00010bb4378f]
[0x00010bb43214]
[0x00010bb881b3]
[0x00010bb7193d]
[0x00010bbaa5a8]
0x5B55A8
base::Thread::ThreadMain()
[0x00010bba6677]
[0x7fff8f69f05a]
[0x7fff8f69efd7]
[0x7fff8f69c3ed]
,
Mar 17 2016
I haven't been able to reproduce this problem locally, on 2 different machines/GPUs.
,
Mar 18 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f4f91cf0abc666aa189c2ee666b7acf0361d00bf commit f4f91cf0abc666aa189c2ee666b7acf0361d00bf Author: erikchen <erikchen@chromium.org> Date: Fri Mar 18 00:14:07 2016 Temporarily disable IOSurface 2D Canvas to debug a trace_test failure. The first time IOSurface 2D Canvas was enabled, there were no test failures over several days. The second time it was enabled (along with a bug fix, and some new tests), trace_test.IOSurface2DCanvas started flakily failing. BUG= 595754 Review URL: https://codereview.chromium.org/1813073002 Cr-Commit-Position: refs/heads/master@{#381843} [modify] https://crrev.com/f4f91cf0abc666aa189c2ee666b7acf0361d00bf/third_party/WebKit/Source/platform/graphics/Canvas2DLayerBridge.h
,
Mar 18 2016
Ah ha! I disabled IOSurface 2D Canvas, and the same test failure persists.
,
Mar 21 2016
Regarding the stack traces, telemetry didn't symbolize anything because it wasn't notified of any crashes. The crash happened on the GPU process which just resurrects itself. In this case, telemetry only detected that a timeout has happened. Telemetry probably could have symbolized the stacktrace if it actually had a minidump from the GPU process. In this case, does the GPU process actually dump a ministack as well? If so, maybe it would make sense to look for minidumps on timeouts.
,
Mar 22 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/bf380d5748bbb3ea2b9f14ea1a721a8a4bab5ace commit bf380d5748bbb3ea2b9f14ea1a721a8a4bab5ace Author: erikchen <erikchen@chromium.org> Date: Tue Mar 22 02:52:35 2016 Reenable IOSurface Canvas2D. Disable new trace tests. Continuing to track down source of trace_test.IOSurface2DCanvas flaky errors. BUG= 595754 Review URL: https://codereview.chromium.org/1816623002 Cr-Commit-Position: refs/heads/master@{#382482} [modify] https://crrev.com/bf380d5748bbb3ea2b9f14ea1a721a8a4bab5ace/content/test/gpu/gpu_tests/trace_test_expectations.py [modify] https://crrev.com/bf380d5748bbb3ea2b9f14ea1a721a8a4bab5ace/third_party/WebKit/Source/platform/graphics/Canvas2DLayerBridge.cpp [modify] https://crrev.com/bf380d5748bbb3ea2b9f14ea1a721a8a4bab5ace/third_party/WebKit/Source/platform/graphics/Canvas2DLayerBridge.h
,
Mar 22 2016
As mentioned at the top: trace_test.IOSurface2DCanvas is flaky on https://build.chromium.org/p/chromium.gpu/builders/Mac%2010.10%20Retina%20Release%20%28AMD%29 . Additionally suppressing this test.
,
Mar 22 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b16ec4dc4a1a1cee4d5bf44e1eda5e687216b1a9 commit b16ec4dc4a1a1cee4d5bf44e1eda5e687216b1a9 Author: kbr <kbr@chromium.org> Date: Tue Mar 22 03:52:41 2016 Skip trace_test.IOSurface2DCanvas on Mac with AMD GPU. It's flaky on this configuration. BUG= 595754 TBR=erikchen@chromium.org NOTRY=true Review URL: https://codereview.chromium.org/1823763005 Cr-Commit-Position: refs/heads/master@{#382502} [modify] https://crrev.com/b16ec4dc4a1a1cee4d5bf44e1eda5e687216b1a9/content/test/gpu/gpu_tests/trace_test_expectations.py
,
Mar 23 2016
Between #4633 and #4538, there were 28 failures, averaging about 1 failure every 3.4 runs. This CL made was first seen in #4634. Since then, there have been no failures (15 runs). It looks like somehow, the 2DCanvasWebGL is causing gpu process hangs/crashes?, but only for tracing runs.
,
Mar 23 2016
dyen: Any thoughts about why this would be happening, or better way to debug/attempt to reproduce locally?
,
Mar 23 2016
Does it only happen on bots? For these kinds of problems we usually just VNC into the bot. You can use this command to stop swarming bots: "./swarming.py terminate -S chromium-swarm.appspot.com <botid>" For regular non-swarming bots you can touch this file: “/b/build/slave/shutdown.stamp” Once the current job finishes you can run the test over and over again.
,
Mar 25 2016
I've managed to [very rarely] reproduce this locally. Chrome log from failed run on bot: https://00e9e64bacf404dbb680ae5a265c1dff1114f5edcfe6b44812-apidata.googleusercontent.com/download/storage/v1_internal/b/chrome-telemetry-output/o/log_610fda67-179e-442b-98da-b58655e03994?qk=AD5uMEu5TtHWIG81PFW8OunhBoJX_Wuh2gbu3lI3-wXubeYMSXJ13pU9HH-enJ0TqJPz-Yd2ppjZB0MXRllhs-MtoZLP9YLiyupfWbMo7rmtq1Ut6DNBYIljiBkEijIfMMFI20xe6Mu9DuNBBdCq-G1L5ZFSG8pXBmowbQEVlrMxPz6zJTGPp36MoetL-CO6hXAWxv_Ayl_W9d9kugBz-KWXBLgGinW_GNfey1_U6P0H2-DAh-FjvCXbFawpCfaAQI-Oqra4q7uCxXYSXsJt8SqCxcTEHfD0ofR-Gw8sdEwO6S18ft7QWHfKKxIdtbw7yKVjbJaXp5eSqOvoczzSqXVvcDjFFUS6QsPCqVZPdZmvUEW_PaBSE4Oq2IkTWizEaSrQBJV-d42sarWYONJiueTLQSn3Ifqgerbc5jYQNiALLZx0HidTGrgwN8DQ1kf0tStSYEcVeipFAg2Hu3SMNMoKdVUxreTJHyOQ4eF-om3gcZbG4-pAdLevZ9WhK34xwhP4pGwWea0v8Rh76VpQ5vLM5tgSOVN--8gMgnu7bW5QEeGqoA9lGdkm5xYGzJoXnh1CRg3damyIMSl_L5SrdXOOlTylIC8H00sIewxIn_aLrajCELTOWshlwz7ctdu4jWeavvsZEUcN4V3lGgowbbtSfZ3VXHMuiPqO5yKNJENDax4q75aMrOJDk1bcQymGbPUN4EOxwHM4sxt5UAXCFe7I9nkric23jJN8TPk9NocFzprEwAHXpN5NyiFACGUEY86BYDfhMQqwP2KLp97_Shuzn-lf834ll8H9PV-GB06NXHZl3SEeSD9eISzVBEb3gAWmMd3yvtwmufxVFhwRpPau-XmoOfeW0prLhkHBJ_ldrEKarjy6hMY Chrome log from failed run locally: https://00e9e64bacca40677ddf1fa45153716748d434917d30bb7a99-apidata.googleusercontent.com/download/storage/v1_internal/b/chrome-telemetry-output/o/log_ea7263ce-67b2-45c4-ba40-a4c4bd918762?qk=AD5uMEs3kJGhRGnGIDWt0OtiWM5Z7LyOO1lPpa_1bGHq-Ddskw1QrUbRga55hxD4fohlOzstb6thgG6wsgsM5gdHMf4zF2zNa0qHDIBQx8YL_S55P-SnQrZUJgrwqtQWfJ4v8z7WOduad7vpjRgXnnsfavwHhjKn_uBAu9DuEIXV3YPY-Idcp_StfR_QK7R03GrEBZubSYjNme_nJ1W7Xofwu75g8bqHnDcW0FBzmgQQcnXquYRyZT9M0yercg4jPpk_DsXRFsKR9jDi8nccV9l9CINKkf-B7YlAP1moGc1NHJq4uI0UZ-92od3WBuWUEBhY-Dzixlb3AeU3vczQKbrN9OwnFfrCr96tTqtfkWtGI8PCCRm4DK5bi2Q55LgrZCOMRNABOWOsa7DmK5GAyskKDFMy0J_8C7-kh51S0BrehFGNzlhW8zFWPi47Y2d2Tpx8a9WrZ7xblw3XZUwgOp4RlcCvwY_X3bbo3DN0mEfAa0by14cLBrhBf-eQgW0WxpbpIYhbME5Tqer--PlZQRTaaORuDyCHtdKnpCJL0goZ7NUvClq5KWXSsDKUwH30_4slNeg_zA0Z0hj03Ut1NB-V5UrMecVdPGVc5C3lYbgRVh0Ukp7C0FYNAots_K0sgLVEYlet_uoOL4eOK3fngJfA1DFaf9ZoC_xcXqlnKH3aXfGaVFAB7xiwV_0MO75uGbjyHlZPaGPaysmm7N-cDS_Q_dVuA18ev7zV6R1b7-lsrV5PxxEmJSk3uqq4rpwW7SRXYnsBs6CXcgxoLi0qcZy3SN7kwYQvVOeK0QoMVAqAVpF8lmGVhQkXf1RpNSEIqV7pJOT85L24xoIsKqbiPTR6zTN25DNRoSUjntNNxVzZe1rngXfXVt8 Chrome log from successful run locally: https://00e9e64bacd6f4d0e52810049e870be5e19c5b0df995139a69-apidata.googleusercontent.com/download/storage/v1_internal/b/chrome-telemetry-output/o/log_37fcdbee-1aeb-418b-be18-44c277c46365?qk=AD5uMEtkr490vT5GE-P0MiGtvZHOr_P1Der3CD1gGL7wrWV-Ax_lhxpf7FHJab7lEpPAlKBoED_FTuTQgW9uyUiCcBjzzQqv9K4L8JgKk9x2zE4pjYLZJlIt2L3PmdE4okZfVGWGkSZg50TpsYt-hKyzNljJKToaizGwfv2NcHE3zDBBlcHsrmULrvMFOQ2QXlrGUlM5scvlUA1M3l-reY98HWhvNoXhcC4krbeSgWFfHtIvEQlm8cUJ8LqHB3wys7uF8dTDmdb67IoFaTDWL1aBNVXUzHzusun1vE4qovviKXQtRh_rm2NxUdrvmRNK9pBwusfH-XuK43O1HTmc9vGvEApnKZJ_yN6RJ3aRc_lWgTwXW2Qm6-bXaVikL4GXW6XRw5XSkREzDp2tVoisxaz4m2P45WyozTHS0yC5rCg85YwgIwp8nSV3EcO7F6XGXgUq9zkbPvdfdhPmZBCZ1mtXKTCXd1ZWl6p8Dc8NVOuqQZpXdVWETxLQSQv45jMCkMFIoH1WFQBytyTuvxLj-31e79-6HiE_3m5rSUnHlDR-DFZPvh6B3XMETFKw8N-AHZVjfyIx9TOydQhAChKKNf1x-QKzuzCPhtux4pY6LsQ__Cs2nlquq3YeoraktML3Nnrdnwj1MKR5IMoEtljqWjP_hh7Sksyb2EGfMEWVVQGvSey1QFteLgbhb5G1jXk0pv3cepWt3B6SAj2bgnz8wuiHFo5_gGNAcYQjcbtnGkGoDhI_COAgaaFi0jD_s1tr0oqyKJyUE-aSeOWWnrYT2jtjhe3RVWrhVG9forQAw-KCcWB-PbAYfml33oLjbwm1dNNB1qTRt0F3nqrlvsn8CRVHvlTnfxQZrFxE0Iio6XpwVzhoj1cvLhA Notice that in the failed runs, both locally and on the bot, there are the lines """ [16258:1295:0325/100458:VERBOSE1:service_discovery_device_lister.cc(36)] DeviceListerStart: service_type: _privet._tcp.local [16258:1295:0325/100458:VERBOSE1:service_discovery_client_mac.mm(130)] CreateServiceWatcher: _privet._tcp.local [16258:1295:0325/100458:VERBOSE1:service_discovery_client_mac.mm(253)] ServiceWatcherImplMac::Start [16258:1295:0325/100458:VERBOSE1:service_discovery_device_lister.cc(41)] DiscoverNewDevices: service_type: _privet._tcp.local [16258:1295:0325/100458:VERBOSE1:service_discovery_client_mac.mm(260)] ServiceWatcherImplMac::DiscoverNewServices """ That are missing from successful run. Suspicious! Furthermore, I observed that when the test fails, the green WebGL triangle fails to draw. Reasons still unknown. Note that I haven't turned on IOSurfaces for WebGL yet, so this is not using the CA compositing path.
,
Mar 25 2016
I think the service_discovery logging is a red triangle. It happens to show up because the test has stalled (thus leaving the browser up for longer), rather than being the root cause.
,
Mar 25 2016
I think this is the relevant line from the log from the bot: """ 61514:1299:0321/191115:INFO:CONSOLE(15)] "Uncaught TypeError: Cannot read property 'clear' of null", source: (15) """ As its followed by 30 seconds of nothing. Clear() gets called by content/test/data/gpu/pixel_webgl_util.js:drawTriangle() pixel_canvas2d_webgl.html:waitForFinish() This still raises the question of "why did the WebGL canvas fail to initialize?" I'm going to add better error checking to pixel_canvas2d_webgl.html and re-enable the trace tests.
,
Mar 25 2016
test_harness_script is different for pixel.py and trace_test.py. This is surprising, since they run the same set of tests.
Excerpt from the script from trace_test.py:
"""
domAutomationController.send = function(msg) {
// Issue a read pixel to synchronize the gpu process to ensure
// the asynchronous category enabling is finished.
var canvas = document.createElement("canvas")
canvas.width = 1;
canvas.height = 1;
var gl = canvas.getContext("webgl");
gl.clear(gl.COLOR_BUFFER_BIT);
var id = new Uint8Array(4);
gl.readPixels(0, 0, 1, 1, gl.RGBA, gl.UNSIGNED_BYTE, id);
domAutomationController._finished = true;
}
"""
,
Mar 25 2016
The trace_test relies on gpu tracing on the GPU process so in order to make sure the proper trace categories are enabled (which are enabled from the browser process and asynchronously enabled on the GPU process) it issues a "gl.readPixels()" to synchronizes the 2 processes before the test begins.
,
Mar 28 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0f496000d1f33c9255621efbf6b88a8185a791cd commit 0f496000d1f33c9255621efbf6b88a8185a791cd Author: erikchen <erikchen@chromium.org> Date: Mon Mar 28 21:36:24 2016 Add safety checks and reenable Canvas2D/WebGL trace_tests. The trace tests were disabled because they were flakily failing, whereas their pixel_test counterparts were having no problems. This CL makes several small changes. 1. Additional safety checking in trace_test and pixel_canvas2d_webgl. 2. Rename gl->temp_gl and canvas->temp_canvas to avoid potential namespace collisions with global object in pixel_canvas2d_webgl. 3. Reenable trace tests. 4. For consistency, try to grab the "experimental-webgl" context, then the "webgl" context. BUG= 595754 Review URL: https://codereview.chromium.org/1826343003 Cr-Commit-Position: refs/heads/master@{#383572} [modify] https://crrev.com/0f496000d1f33c9255621efbf6b88a8185a791cd/content/test/data/gpu/pixel_canvas2d_webgl.html [modify] https://crrev.com/0f496000d1f33c9255621efbf6b88a8185a791cd/content/test/gpu/gpu_tests/trace_test.py [modify] https://crrev.com/0f496000d1f33c9255621efbf6b88a8185a791cd/content/test/gpu/gpu_tests/trace_test_expectations.py
,
Apr 5 2016
I'm not seeing any more flakes. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by erikc...@chromium.org
, Mar 17 2016Status: Assigned (was: Untriaged)