context_lost tests failed with FATAL:sync_point_manager.cc(256): Check failed: release > fence_sync_release_ (1 vs. 12) |
|||||||||||
Issue descriptionhttps://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/216312 (stdout attached) failed context_lost tests because of the following assertion failure which took down the renderer process: [23526:1299:0422/151426:FATAL:sync_point_manager.cc(256)] Check failed: release > fence_sync_release_ (1 vs. 12) 0 Chromium Framework 0x0000000102d610d3 _ZN4base5debug10StackTraceC1Ev + 19 1 Chromium Framework 0x0000000102d7ee99 _ZN7logging10LogMessageD2Ev + 73 2 Chromium Framework 0x0000000103d1ad94 _ZN3gpu20SyncPointClientState22ReleaseFenceSyncLockedEyPNSt3__16vectorIN4base8CallbackIFvvELNS3_8internal8CopyModeE1EEENS1_9allocatorIS8_EEEE + 116 3 Chromium Framework 0x0000000103d1ac9b _ZN3gpu20SyncPointClientState16ReleaseFenceSyncEy + 75 4 Chromium Framework 0x0000000103d1b676 _ZN3gpu15SyncPointClient16ReleaseFenceSyncEy + 166 5 Chromium Framework 0x0000000103e21bb8 _ZN3gpu20GpuCommandBufferStub13OnCreateImageERK38GpuCommandBufferMsg_CreateImage_Params + 440 6 Chromium Framework 0x0000000103e218ee _ZN3IPC8MessageTI36GpuCommandBufferMsg_CreateImage_MetaNSt3__15tupleIJ38GpuCommandBufferMsg_CreateImage_ParamsEEEvE8DispatchIN3gpu20GpuCommandBufferStubES9_vMS9_FvRKS4_EEEbPKNS_7MessageEPT_PT0_PT1_T2_ + 126 7 Chromium Framework 0x0000000103e1de59 _ZN3gpu20GpuCommandBufferStub17OnMessageReceivedERKN3IPC7MessageE + 1753 8 Chromium Framework 0x0000000103a3af6c _ZN3IPC13MessageRouter12RouteMessageERKNS_7MessageE + 140 9 Chromium Framework 0x0000000103e17ca4 _ZN3gpu10GpuChannel19HandleMessageHelperERKN3IPC7MessageE + 148 10 Chromium Framework 0x0000000103e17b14 _ZN3gpu10GpuChannel13HandleMessageERK13scoped_refptrINS_22GpuChannelMessageQueueEE + 708 11 Chromium Framework 0x0000000103e19e42 _ZN4base8internal7InvokerINS_13IndexSequenceIJLm0ELm1EEEENS0_9BindStateINS0_15RunnableAdapterIMN3gpu10GpuChannelEFvRK13scoped_refptrINS6_22GpuChannelMessageQueueEEEEEFvPS7_SC_EJNS_7WeakPtrIS7_EESC_EEENS0_12InvokeHelperILb1EvSF_EEFvvEE3RunEPNS0_13BindStateBaseE + 114 12 Chromium Framework 0x0000000102d6160b _ZN4base5debug13TaskAnnotator7RunTaskEPKcRKNS_11PendingTaskE + 187 13 Chromium Framework 0x0000000102d8f483 _ZN4base11MessageLoop7RunTaskERKNS_11PendingTaskE + 579 14 Chromium Framework 0x0000000102d8f79c _ZN4base11MessageLoop21DeferOrRunPendingTaskERKNS_11PendingTaskE + 44 15 Chromium Framework 0x0000000102d8f98b _ZN4base11MessageLoop6DoWorkEv + 299 16 Chromium Framework 0x0000000102d52be3 _ZN4base24MessagePumpCFRunLoopBase7RunWorkEv + 51 17 Chromium Framework 0x0000000102d8010a _ZN4base3mac15CallWithEHFrameEU13block_pointerFvvE + 10 18 Chromium Framework 0x0000000102d52594 _ZN4base24MessagePumpCFRunLoopBase13RunWorkSourceEPv + 68 19 CoreFoundation 0x00007fff9b32ea01 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17 20 CoreFoundation 0x00007fff9b320b8d __CFRunLoopDoSources0 + 269 21 CoreFoundation 0x00007fff9b3201bf __CFRunLoopRun + 927 22 CoreFoundation 0x00007fff9b31fbd8 CFRunLoopRunSpecific + 296 23 Chromium Framework 0x0000000102d5300f _ZN4base20MessagePumpCFRunLoop5DoRunEPNS_11MessagePump8DelegateE + 79 24 Chromium Framework 0x0000000102d52a1f _ZN4base24MessagePumpCFRunLoopBase3RunEPNS_11MessagePump8DelegateE + 127 25 Chromium Framework 0x0000000102d8efd7 _ZN4base11MessageLoop10RunHandlerEv + 215 26 Chromium Framework 0x0000000102dade87 _ZN4base7RunLoop3RunEv + 183 27 Chromium Framework 0x0000000102d8e267 _ZN4base11MessageLoop3RunEv + 103 28 Chromium Framework 0x0000000102d02682 _ZN7content7GpuMainERKNS_18MainFunctionParamsE + 3090 29 Chromium Framework 0x0000000102d0fe3d _ZN7content21ContentMainRunnerImpl3RunEv + 253 30 Chromium Framework 0x0000000102d0f076 _ZN7content11ContentMainERKNS_17ContentMainParamsE + 54 31 Chromium Framework 0x00000001026fbba2 ChromeMain + 66 32 Chromium Helper 0x0000000102694d62 main + 530 33 Chromium Helper 0x0000000102694b44 start + 52 34 ??? 0x0000000000000010 0x0 + 16 This could be one of the reasons the context_lost tests are failing on other platforms. David, could you please investigate this? Thanks.
,
Apr 27 2016
The test that failed was ContextLost.WebGLContextLostFromGPUProcessExit, so yes, the GPU process had exited. The client needs to handle this gracefully. I filed the bug; no need to CC: me. :)
,
Apr 27 2016
I don't think the client is "resetting" things. The CommandBufferProxyImpl (which manages the fence namespace) and GpuChannelHost (which manages the connection to the GPU process) are bound together and bound to a given GPU process. After a context lost, if we recreate a context, a new CommandBufferProxyImpl is recreated (together with a new GpuCommandBufferStub, hence a new SyncPointClient/SyncPointClientState) - iow a new namespace altogether. If needs be (if the GPU process exited too), a new GpuChannelHost to a new gpu process is also created, but that shouldn't change things. I think fences are sent out-of-order from the client side for some reason. I wouldn't be surprised if we're missing a lock somewhere. Maybe related to GPU rasterization? re original report: that stack trace would have taken down the GPU process, not the renderer process.
,
Apr 27 2016
When I said reset I meant through recreation. I don't see how this can be an out of order issue though because on the client side we check fence sync count: https://code.google.com/p/chromium/codesearch#chromium/src/gpu/ipc/client/command_buffer_proxy_impl.cc&q=GpuCommandBufferMsg_CreateImage_Params&sq=package:chromium&type=cs&l=425 This is done right before the GpuCommandBufferMsg_CreateImage IPC is sent, and these fence syncs are per stream so are generated only on a single thread. So assuming we got 1 here (as the DCHECK indicates), this is the very first fence sync being created and no other thread would have access to this. Assuming channel_id with route_id is unique, the only case I can think of where this could happen is if the client side was recreated but somehow the GPU side was not. I tried running this locally on my machine to no avail, has this happened on the bots again?
,
Apr 28 2016
@#4: multiple threads could be using the same context on the client side. They are supposed to use a lock for mutual exclusion, but I could imagine one is missing. Without lock, what could happen is: 1- thread 1 is inside of CommandBufferProxyImpl::CreateImage and calls GenerateFenceSyncRelease(), and passes the DCHECK 2- thread 2 enqueues a bunch of commands into the command buffer that include fences (calls GenerateFenceSyncRelease() multiple times, and flushes 3- thread 1 (still inside of CommandBufferProxyImpl::CreateImage) sends the GpuCommandBufferMsg_CreateImage IPC That would cause this. Now, the context lock is supposed to be there to make sure 1 and 3 happen atomically (relative to 2), but if it's missing we'd have a problem. Now, it could be something else, but I have trouble imagining what could cause an old CommandBufferProxyImpl to be magically connected to a new GpuCommandBufferStub - the latter are only created by the GpuChannelMsg_CreateCommandBuffer message, which is sent by GpuChannelHost::CreateCommandBuffer which also creates a CommandBufferProxyImpl. GpuChannelHost picks a route id (they are not reused) and gives the same one to the CommandBufferProxyImpl and to the GpuCommandBufferStub. Another option is an out-of-order message issue on the service side, but I don't know what could cause that.
,
Apr 28 2016
I don't see the lock being checked in CommandBufferProxyImpl::OrderingBarrier and that accesses next_fence_sync_release. https://code.google.com/p/chromium/codesearch#chromium/src/gpu/ipc/client/command_buffer_proxy_impl.cc&rcl=1461785041&l=258
,
Apr 28 2016
Good point, we should add a CheckLock there and see if it trips on something.
,
Apr 28 2016
It looks like the CheckLock() did not trigger anything on the bots, I also tried syncing back to the revision which was failing and adding the CheckLock() locally (in case it has been fixed) but the test still passed 50 times. We have CheckLock() in all 3 locations where flushed_fence_sync_release_ is touched so that shouldn't be a problem. Although I can see some places where we can have stricter DCHECKS. For example "DCHECK_LE(image_fence_sync - 1, flushed_fence_sync_release_)" Should probably be a "DHECK_EQ(image_fence_sync, flushed_fence_sync_release_ + 1)". If for some reason the generated fence sync is less than the flushed fence sync release perhaps the above callstack could occur.
,
Apr 28 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/46893a3125cca64012e98924ecd93756841f2c98 commit 46893a3125cca64012e98924ecd93756841f2c98 Author: dyen <dyen@chromium.org> Date: Thu Apr 28 21:10:52 2016 Added a lock check for CommandBufferProxyImpl::OrderingBarrier(). R=piman@chromium.org BUG= 606112 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/1924353002 Cr-Commit-Position: refs/heads/master@{#390484} [modify] https://crrev.com/46893a3125cca64012e98924ecd93756841f2c98/gpu/ipc/client/command_buffer_proxy_impl.cc
,
Apr 29 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4d075d2b7a85c4c9c5fa8a0d5158f2e74f2d5b97 commit 4d075d2b7a85c4c9c5fa8a0d5158f2e74f2d5b97 Author: dyen <dyen@chromium.org> Date: Fri Apr 29 01:53:57 2016 Added more strict checks for fence sync counters. R=piman@chromium.org BUG= 606112 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/1933473002 Cr-Commit-Position: refs/heads/master@{#390569} [modify] https://crrev.com/4d075d2b7a85c4c9c5fa8a0d5158f2e74f2d5b97/gpu/ipc/client/command_buffer_proxy_impl.cc [modify] https://crrev.com/4d075d2b7a85c4c9c5fa8a0d5158f2e74f2d5b97/gpu/ipc/client/command_buffer_proxy_impl.h [modify] https://crrev.com/4d075d2b7a85c4c9c5fa8a0d5158f2e74f2d5b97/gpu/ipc/service/gpu_command_buffer_stub.cc
,
May 3 2016
,
May 3 2016
Since David is out it is urgent that progress be made on this. It is affecting the commit queue. See Issue 608923 . Sunny, can you help? https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/220473 is another incidence: [ RUN ] ContextLost.WebGLContextLostFromGPUProcessExit (INFO) 2016-05-01 15:02:54,955 desktop_browser_backend.GetBrowserStartupArgs:242 Requested remote debugging port: 0 (INFO) 2016-05-01 15:02:54,955 desktop_browser_backend.Start:274 Starting Chrome ['/b/swarm_slave/work/isolated/runOdxONx/out/Release/Chromium.app/Contents/MacOS/Chromium', '--js-flags=--expose-gc', '--enable-logging=stderr', '--disable-domain-blocking-for-3d-apis', '--disable-gpu-process-crash-limit', '--enable-gpu-benchmarking', '--enable-net-benchmarking', '--metrics-recording-only', '--no-default-browser-check', '--no-first-run', '--enable-gpu-benchmarking', '--disable-background-networking', '--no-proxy-server', '--disable-component-extensions-with-background-pages', '--disable-default-apps', '--remote-debugging-port=0', '--enable-crash-reporter-for-testing', '--window-size=1280,1024', '--user-data-dir=/b/swarm_slave/work/isolated/tmpJNsM3D/tmpkV1PPu', 'about:blank'] 2016-05-01 15:02:55.230 Chromium[6151:25035] NSWindow warning: adding an unknown subview: <FullSizeContentView: 0x7fa7d598ebc0> 2016-05-01 15:02:55.230 Chromium[6151:25035] Call stack: ( "+callStackSymbols disabled for performance reasons" ) (INFO) 2016-05-01 15:02:55,267 desktop_browser_backend.HasBrowserFinishedLaunching:229 Discovered ephemeral port 49331 [6151:89103:0501/150255:WARNING:simple_synchronous_entry.cc(929)] Could not open platform files for entry. [6151:1299:0501/150255:INFO:CONSOLE(29)] "Harness injected.", source: (29) Received signal 11 SEGV_MAPERR 000000000000 [0x000102dcfb86] [0x7fff93cb0f1a] [0x000000000000] [0x000102d6c292] [0x000102d6b189] [0x000106fa5f22] [0x000102d6ca2b] [0x000103aa0388] [0x000102dd01db] [0x000102dfe3f3] [0x000102dfe70c] [0x000102dfea5b] [0x000102dc1863] [0x000102deee0a] [0x000102dc1214] [0x7fff99565a01] [0x7fff99557b8d] [0x7fff995571bf] [0x7fff99556bd8] [0x000102dc1c8f] [0x000102dc169f] [0x000102dfdf47] [0x000102e1d113] [0x000102dfd1d7] [0x000102d71457] [0x000102d7eb3d] [0x000102d7dd76] [0x00010276a07a] [0x000102702d62] [0x000102702b44] [0x000000000013] [end of stack trace] [6158:16899:0501/150256:ERROR:gpu_channel_host.cc(229)] Failed to send GpuChannelMsg_CreateCommandBuffer. [6158:16899:0501/150256:ERROR:webgraphicscontext3d_command_buffer_impl.cc(129)] GpuChannelHost failed to create command buffer. [6158:16899:0501/150256:ERROR:webgraphicscontext3d_command_buffer_impl.cc(153)] Failed to initialize command buffer. [6158:16899:0501/150256:ERROR:webgraphicscontext3d_command_buffer_impl.cc(199)] Failed to initialize context. [6151:1299:0501/150256:INFO:CONSOLE(0)] "WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost", source: http://127.0.0.1:49317/webgl.html?query=kill_after_notification (0) [6151:1299:0501/150256:ERROR:gpu_process_transport_factory.cc(729)] Lost UI shared context. [6159:1299:0501/150256:FATAL:gpu_command_buffer_stub.cc(1046)] Check failed: image_release_count == sync_point_client_->client_state()->fence_sync_release() + 1 (1 vs. 12) 0 Chromium Framework 0x000000010ebfeca3 _ZN4base5debug10StackTraceC1Ev + 19 1 Chromium Framework 0x000000010ec1cbd7 _ZN7logging10LogMessageD2Ev + 71 2 Chromium Framework 0x000000010fcc554c _ZN3gpu20GpuCommandBufferStub13OnCreateImageERK38GpuCommandBufferMsg_CreateImage_Params + 636 3 Chromium Framework 0x000000010fcc51be _ZN3IPC8MessageTI36GpuCommandBufferMsg_CreateImage_MetaNSt3__15tupleIJ38GpuCommandBufferMsg_CreateImage_ParamsEEEvE8DispatchIN3gpu20GpuCommandBufferStubES9_vMS9_FvRKS4_EEEbPKNS_7MessageEPT_PT0_PT1_T2_ + 126 4 Chromium Framework 0x000000010fcc1485 _ZN3gpu20GpuCommandBufferStub17OnMessageReceivedERKN3IPC7MessageE + 1845 5 Chromium Framework 0x000000010f8ddeac _ZN3IPC13MessageRouter12RouteMessageERKNS_7MessageE + 140 6 Chromium Framework 0x000000010fcbb2b4 _ZN3gpu10GpuChannel19HandleMessageHelperERKN3IPC7MessageE + 148 7 Chromium Framework 0x000000010fcbb124 _ZN3gpu10GpuChannel13HandleMessageERK13scoped_refptrINS_22GpuChannelMessageQueueEE + 708 8 Chromium Framework 0x000000010fcbd462 _ZN4base8internal7InvokerINS_13IndexSequenceIJLm0ELm1EEEENS0_9BindStateINS0_15RunnableAdapterIMN3gpu10GpuChannelEFvRK13scoped_refptrINS6_22GpuChannelMessageQueueEEEEEFvPS7_SC_EJNS_7WeakPtrIS7_EESC_EEENS0_12InvokeHelperILb1EvSF_EEFvvEE3RunEPNS0_13BindStateBaseE + 114 9 Chromium Framework 0x000000010ebff1db _ZN4base5debug13TaskAnnotator7RunTaskEPKcRKNS_11PendingTaskE + 187 10 Chromium Framework 0x000000010ec2d3f3 _ZN4base11MessageLoop7RunTaskERKNS_11PendingTaskE + 579 11 Chromium Framework 0x000000010ec2d70c _ZN4base11MessageLoop21DeferOrRunPendingTaskERKNS_11PendingTaskE + 44 12 Chromium Framework 0x000000010ec2da5b _ZN4base11MessageLoop6DoWorkEv + 299 13 Chromium Framework 0x000000010ebf0863 _ZN4base24MessagePumpCFRunLoopBase7RunWorkEv + 51 14 Chromium Framework 0x000000010ec1de0a _ZN4base3mac15CallWithEHFrameEU13block_pointerFvvE + 10 15 Chromium Framework 0x000000010ebf0214 _ZN4base24MessagePumpCFRunLoopBase13RunWorkSourceEPv + 68 16 CoreFoundation 0x00007fff99565a01 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17 17 CoreFoundation 0x00007fff99557b8d __CFRunLoopDoSources0 + 269 18 CoreFoundation 0x00007fff995571bf __CFRunLoopRun + 927 19 CoreFoundation 0x00007fff99556bd8 CFRunLoopRunSpecific + 296 20 Chromium Framework 0x000000010ebf0c8f _ZN4base20MessagePumpCFRunLoop5DoRunEPNS_11MessagePump8DelegateE + 79 21 Chromium Framework 0x000000010ebf069f _ZN4base24MessagePumpCFRunLoopBase3RunEPNS_11MessagePump8DelegateE + 127 22 Chromium Framework 0x000000010ec2cf47 _ZN4base11MessageLoop10RunHandlerEv + 215 23 Chromium Framework 0x000000010ec4c113 _ZN4base7RunLoop3RunEv + 51 24 Chromium Framework 0x000000010ec2c1d7 _ZN4base11MessageLoop3RunEv + 103 25 Chromium Framework 0x000000010eba0457 _ZN7content7GpuMainERKNS_18MainFunctionParamsE + 4087 26 Chromium Framework 0x000000010ebadb3d _ZN7content21ContentMainRunnerImpl3RunEv + 253 27 Chromium Framework 0x000000010ebacd76 _ZN7content11ContentMainERKNS_17ContentMainParamsE + 54 28 Chromium Framework 0x000000010e59907a ChromeMain + 58 29 Chromium Helper 0x000000010e531d62 main + 530 30 Chromium Helper 0x000000010e531b44 start + 52 31 ??? 0x0000000000000013 0x0 + 19 [6151:1299:0501/150256:INFO:CONSOLE(0)] "GL_OUT_OF_MEMORY : glCreateGpuMemoryBufferImageCHROMIUM: image_id < 0", source: http://127.0.0.1:49317/webgl.html?query=kill_after_notification (0) [6151:1299:0501/150256:INFO:CONSOLE(0)] "GL_INVALID_OPERATION : glGenSyncTokenCHROMIUM: fence sync must be flushed before generating sync token", source: http://127.0.0.1:49317/webgl.html?query=kill_after_notification (0) [6151:1299:0501/150256:INFO:CONSOLE(0)] "WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost", source: http://127.0.0.1:49317/webgl.html?query=kill_after_notification (0) Waiting for page to finish. Traceback (most recent call last): File "/b/swarm_slave/work/isolated/runOdxONx/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 84, in _RunStoryAndProcessErrorIfNeeded state.RunStory(results) File "/b/swarm_slave/work/isolated/runOdxONx/content/test/gpu/gpu_tests/gpu_test_base.py", line 111, in RunStory RunStoryWithRetries(GpuSharedPageState, self, results) File "/b/swarm_slave/work/isolated/runOdxONx/content/test/gpu/gpu_tests/gpu_test_base.py", line 72, in RunStoryWithRetries super(cls, shared_page_state).RunStory(results) File "/b/swarm_slave/work/isolated/runOdxONx/third_party/catapult/telemetry/telemetry/page/shared_page_state.py", line 326, in RunStory self._current_page, self._current_tab, results) File "/b/swarm_slave/work/isolated/runOdxONx/content/test/gpu/gpu_tests/context_lost.py", line 160, in ValidateAndMeasurePage 'Test failed (context not restored properly?)') Failure: Test failed (context not restored properly?) [6151:1299:0501/150256:INFO:CONSOLE(0)] "GL_INVALID_VALUE : glWaitSyncTokenCHROMIUM: Cannot wait on sync_token which has not been verified", source: http://127.0.0.1:49317/webgl.html?query=kill_after_notification (0) [ FAILED ] ContextLost.WebGLContextLostFromGPUProcessExit (2145 ms)
,
May 9 2016
Victor, it is urgent that an engineer be assigned to this. It is affecting Chromium's commit queue. As one random job please see https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/224274 : [14645:16643:0509/085328:ERROR:gpu_channel_host.cc(229)] Failed to send GpuChannelMsg_CreateCommandBuffer. [14645:16643:0509/085328:ERROR:context_provider_command_buffer.cc(153)] GpuChannelHost failed to create command buffer. [14638:1299:0509/085328:INFO:CONSOLE(0)] "WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost", source: http://127.0.0.1:62882/webgl.html?query=kill_after_notification (0) [14638:1299:0509/085328:ERROR:gpu_process_transport_factory.cc(754)] Lost UI shared context. [14646:1299:0509/085328:FATAL:gpu_command_buffer_stub.cc(1047)] Check failed: image_release_count == sync_point_client_->client_state()->fence_sync_release() + 1 (1 vs. 10) 0 Chromium Framework 0x0000000104d21183 _ZN4base5debug10StackTraceC1Ev + 19 1 Chromium Framework 0x0000000104d3f077 _ZN7logging10LogMessageD2Ev + 71 2 Chromium Framework 0x0000000105ec852c _ZN3gpu20GpuCommandBufferStub13OnCreateImageERK38GpuCommandBufferMsg_CreateImage_Params + 636 3 Chromium Framework 0x0000000105ec81bb _ZN3IPC8MessageTI36GpuCommandBufferMsg_CreateImage_MetaNSt3__15tupleIJ38GpuCommandBufferMsg_CreateImage_ParamsEEEvE8DispatchIN3gpu20GpuCommandBufferStubES9_vMS9_FvRKS4_EEEbPKNS_7MessageEPT_PT0_PT1_T2_ + 123 4 Chromium Framework 0x0000000105ec43fd _ZN3gpu20GpuCommandBufferStub17OnMessageReceivedERKN3IPC7MessageE + 1293 5 Chromium Framework 0x0000000105afcd7c _ZN3IPC13MessageRouter12RouteMessageERKNS_7MessageE + 140 6 Chromium Framework 0x0000000105ebe414 _ZN3gpu10GpuChannel19HandleMessageHelperERKN3IPC7MessageE + 148 7 Chromium Framework 0x0000000105ebe1f9 _ZN3gpu10GpuChannel13HandleMessageERK13scoped_refptrINS_22GpuChannelMessageQueueEE + 665 8 Chromium Framework 0x0000000105ec05d2 _ZN4base8internal7InvokerINS_13IndexSequenceIJLm0ELm1EEEENS0_9BindStateINS0_15RunnableAdapterIMN3gpu10GpuChannelEFvRK13scoped_refptrINS6_22GpuChannelMessageQueueEEEEEFvPS7_SC_EJNS_7WeakPtrIS7_EESC_EEENS0_12InvokeHelperILb1EvSF_EEFvvEE3RunEPNS0_13BindStateBaseE + 114 9 Chromium Framework 0x0000000104d2169b _ZN4base5debug13TaskAnnotator7RunTaskEPKcRKNS_11PendingTaskE + 187 10 Chromium Framework 0x0000000104d4f853 _ZN4base11MessageLoop7RunTaskERKNS_11PendingTaskE + 579 11 Chromium Framework 0x0000000104d4fb5c _ZN4base11MessageLoop21DeferOrRunPendingTaskERKNS_11PendingTaskE + 44 12 Chromium Framework 0x0000000104d4feab _ZN4base11MessageLoop6DoWorkEv + 299 While this affects only a corner case (context lost), this code path must remain reliable and its flakiness is affecting the commit queue.
,
May 9 2016
I can take a look at this but not until tomorrow.
,
May 9 2016
Thanks Sunny. Reassigning to you.
,
May 9 2016
Ok, I think I found the bug: When CreateImageCHROMIUM is called we try to send a sync IPC to validate the sync token here: https://code.google.com/p/chromium/codesearch#chromium/src/gpu/ipc/client/gpu_channel_host.cc&rcl=1462792609&l=354 But the StreamFlushInfo for the stream doesn't exist until it has been flushed/OrderingBarrier'd: https://code.google.com/p/chromium/codesearch#chromium/src/gpu/ipc/client/gpu_channel_host.cc&rcl=1462792609&l=145 Calling operator[] on stream_flush_info_ creates the StreamFlushInfo and there's no other way it's created. I'll try a speculative fix for this tomorrow and hopefully it should be fixed.
,
May 10 2016
FYI, more failures as reported in Issue 608923 : https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/225077 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/225056 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/224587
,
May 10 2016
For #16, I'm not sure I follow here. How does the lack of the StreamFlushInfo affect the fence release order on the server side? Also, the synchronous IPC is sent no matter what since EnsureWorkVisible() sets force_validate to be True. Even if this wasn't true and no synchronous IPC is sent for some reason, this still wouldn't trigger the DCHECK. In that case, we would have a race condition only if the sync token is passed to another process and waited on it. But even the race condition would only show up as a NOP on the server side since we ignore sync tokens which do not exist.
,
May 11 2016
I'm back now so I'll take this bug back, but feel free to continue looking at it since I really have no idea how this could happen still.
,
May 11 2016
I ran this patch locally (https://codereview.chromium.org/1972443002/) 50 times and actually got it to fail, here is the callstack: [21738:1295:0511/104328:FATAL:sync_point_manager.cc(259)] Check failed: release == fence_sync_release_ + 1 || release == UINT64_MAX. 0 Chromium Framework 0x000000010493e123 _ZN4base5debug10StackTraceC1Ev + 19 1 Chromium Framework 0x000000010495bee9 _ZN7logging10LogMessageD2Ev + 73 2 Chromium Framework 0x00000001058f7c76 _ZN3gpu20SyncPointClientState22ReleaseFenceSyncLockedEyPNSt3__16vectorIN4base8CallbackIFvvELNS3_8internal8CopyModeE1EEENS1_9allocatorIS8_EEEE + 182 3 Chromium Framework 0x00000001058f7035 _ZN3gpu20SyncPointClientState14EnsureReleasedEy + 85 4 Chromium Framework 0x00000001058f6f5d _ZN3gpu18SyncPointOrderData26BeginProcessingOrderNumberEj + 1085 5 Chromium Framework 0x00000001059ecb3c _ZN3gpu22GpuChannelMessageQueue22BeginMessageProcessingEv + 252 6 Chromium Framework 0x00000001059f52b2 _ZN3gpu10GpuChannel13HandleMessageERK13scoped_refptrINS_22GpuChannelMessageQueueEE + 34 7 Chromium Framework 0x00000001059f7882 _ZN4base8internal7InvokerINS_13IndexSequenceIJLm0ELm1EEEENS0_9BindStateINS0_15RunnableAdapterIMN3gpu10GpuChannelEFvRK13scoped_refptrINS6_22GpuChannelMessageQueueEEEEEFvPS7_SC_EJNS_7WeakPtrIS7_EESC_EEENS0_12InvokeHelperILb1EvSF_EEFvvEE3RunEPNS0_13BindStateBaseE + 114 8 Chromium Framework 0x000000010493e65b _ZN4base5debug13TaskAnnotator7RunTaskEPKcRKNS_11PendingTaskE + 187 9 Chromium Framework 0x000000010496c4c3 _ZN4base11MessageLoop7RunTaskERKNS_11PendingTaskE + 579 10 Chromium Framework 0x000000010496c7dc _ZN4base11MessageLoop21DeferOrRunPendingTaskERKNS_11PendingTaskE + 44 11 Chromium Framework 0x000000010496c9cb _ZN4base11MessageLoop6DoWorkEv + 299 12 Chromium Framework 0x000000010492fc73 _ZN4base24MessagePumpCFRunLoopBase7RunWorkEv + 51 13 Chromium Framework 0x000000010495d15a _ZN4base3mac15CallWithEHFrameEU13block_pointerFvvE + 10 14 Chromium Framework 0x000000010492f624 _ZN4base24MessagePumpCFRunLoopBase13RunWorkSourceEPv + 68 15 CoreFoundation 0x00007fff951b7881 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17 16 CoreFoundation 0x00007fff95196fbc __CFRunLoopDoSources0 + 556 17 CoreFoundation 0x00007fff951964df __CFRunLoopRun + 927 18 CoreFoundation 0x00007fff95195ed8 CFRunLoopRunSpecific + 296 19 Chromium Framework 0x000000010493009f _ZN4base20MessagePumpCFRunLoop5DoRunEPNS_11MessagePump8DelegateE + 79 20 Chromium Framework 0x000000010492faaf _ZN4base24MessagePumpCFRunLoopBase3RunEPNS_11MessagePump8DelegateE + 127 21 Chromium Framework 0x000000010496c017 _ZN4base11MessageLoop10RunHandlerEv + 215 22 Chromium Framework 0x000000010498aec7 _ZN4base7RunLoop3RunEv + 183 23 Chromium Framework 0x000000010496b2a7 _ZN4base11MessageLoop3RunEv + 103 24 Chromium Framework 0x00000001048e4a82 _ZN7content7GpuMainERKNS_18MainFunctionParamsE + 3090 25 Chromium Framework 0x00000001048ece8d _ZN7content21ContentMainRunnerImpl3RunEv + 253 26 Chromium Framework 0x00000001048ec0c6 _ZN7content11ContentMainERKNS_17ContentMainParamsE + 54 27 Chromium Framework 0x00000001042e29a2 ChromeMain + 66 28 Chromium Helper 0x000000010407bd62 main + 530 29 Chromium Helper 0x000000010407bb44 start + 52 30 ??? 0x000000000000000e 0x0 + 14 Looks like something went wrong in the order numbers. I'll continue to investigate.
,
May 11 2016
Could this has something to do with the new scheduler? It looks like messages are being processed out of order.
,
May 11 2016
The new scheduler hasn't gone in yet.
,
May 11 2016
I've verified that order numbers are indeed (very scarily) being processed out of order. What's more concerning, is that it looks like the order numbers are not only being processed out of order but being reused as well: [30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e20550] 31 [30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e20550] 32 [30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e20550] 33 [30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e20550] 34 [30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e20550] 35 [30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e1e980] 36 [30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e1e980] 37 [30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e20550] 38 [30402:1295:0511/125756:ERROR:sync_point_manager.cc(86)] [0x7fc5c8e20550] 37 < 38 [30402:1295:0511/125756:FATAL:sync_point_manager.cc(264)] Check failed: release == fence_sync_release_ + 1 || release == UINT64_MAX. Here the first number within the brackets is the pointer to the SyncPointOrderData (owned by sync point client), and the second number is the order number.
,
May 11 2016
Scratch that last comment, the order numbers aren't being reused, they are just being paused. I think it is actually valid to have order numbers be executed out of order as long as they are in order within a single sync point client. So in this case 38 can run before 37 is finished because 37 and 38 are different sync point clients. There does seem to be a bug here... the client with 38 should not be releasing the fence sync owned by the client with 37. I will have to think about this some more.
,
May 11 2016
Ok, after thinking about this some more 37 should never be waiting on 38 so that is the bug here. We must be missing a synchronous IPC before sync token generation. I don't think we are missing the actual call to GenSyncToken since that would trigger a DCHECK, so we must have some bug where we think we have verified a flush when we haven't really.
,
May 12 2016
+jbauman@ I also added some logging and noticed that right before the DCHECK fail in CreateImage there are also errors related to CreateAndConsumeTexture (mailbox) and ScheduleCALayer. This suggests that the problem has to do with stale sync tokens in the browser. jbauman@ pointed out that if a renderer frame arrives in the browser after gpu process crash we attempt to draw it any way. This renderer frame can contain resources which reference invalid mailboxes, fence syncs, etc. Moreover the client id, route id and fence syncs from the renderer are reused, so we could have a WaitSyncToken that's processed on the service with a fence sync that's ahead of the current fence sync on the renderer. That WaitSyncToken would cause an OrderFence to be inserted that would release the renderer's fence syncs earlier than they should be. One way to fix this correctly would be to version resources based on the gpu process instance that it corresponds to and discard frames on the browser if there's a mismatch. jbauman@ also mentioned that the previous implementation of sync points would initialize the starting sync point number randomly to avoid this so that's another option.
,
May 12 2016
I have a patch up which just ignores these kinds of waits: https://codereview.chromium.org/1975663002/
,
May 12 2016
I verified that my patch fixes this problem by running the test again 100 times. The DCHECKs no longer trigger even when stale sync tokens are detected.
,
May 12 2016
Excellent work David. Let's get your patch in ASAP, and remove the flaky expectations for any affected tests.
,
May 13 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9d255f0ca75efea1773fcb718c368a7a6bc4bba4 commit 9d255f0ca75efea1773fcb718c368a7a6bc4bba4 Author: dyen <dyen@chromium.org> Date: Fri May 13 00:32:33 2016 Invalid sync token waits now automatically get released. Previously, invalid sync token waits would automatically release the fence sync. This would guard against invalid waits which pretended it would release a fence sync while issuing waits to cause dead locks. This worked, however it would allow an evil renderer to release fences on the browser which would invalidate waits. A better solution is to merely cause the invalid wait to be ignored instead. R=piman@chromium.org BUG= 606112 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/1975663002 Cr-Commit-Position: refs/heads/master@{#393409} [modify] https://crrev.com/9d255f0ca75efea1773fcb718c368a7a6bc4bba4/gpu/command_buffer/service/sync_point_manager.cc [modify] https://crrev.com/9d255f0ca75efea1773fcb718c368a7a6bc4bba4/gpu/command_buffer/service/sync_point_manager.h [modify] https://crrev.com/9d255f0ca75efea1773fcb718c368a7a6bc4bba4/gpu/command_buffer/service/sync_point_manager_unittest.cc
,
May 13 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/889fa48ee6cbe56579a77382232eaef9293816c2 commit 889fa48ee6cbe56579a77382232eaef9293816c2 Author: dyen <dyen@chromium.org> Date: Fri May 13 06:14:26 2016 Tentatively enable flaky GpuCrash.GPUProcessCrashesExactlyOnce again. R=kbr@chromium.org BUG= 608923 , 606112 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/1971403002 Cr-Commit-Position: refs/heads/master@{#393454} [modify] https://crrev.com/889fa48ee6cbe56579a77382232eaef9293816c2/content/test/gpu/gpu_tests/context_lost_expectations.py
,
May 13 2016
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by dyen@chromium.org
, Apr 25 2016