New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 606112 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Email to this user bounced
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 603329
issue 608923



Sign in to add a comment

context_lost tests failed with FATAL:sync_point_manager.cc(256): Check failed: release > fence_sync_release_ (1 vs. 12)

Project Member Reported by kbr@chromium.org, Apr 23 2016

Issue description

https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/216312 (stdout attached) failed context_lost tests because of the following assertion failure which took down the renderer process:

[23526:1299:0422/151426:FATAL:sync_point_manager.cc(256)] Check failed: release > fence_sync_release_ (1 vs. 12)
0   Chromium Framework                  0x0000000102d610d3 _ZN4base5debug10StackTraceC1Ev + 19
1   Chromium Framework                  0x0000000102d7ee99 _ZN7logging10LogMessageD2Ev + 73
2   Chromium Framework                  0x0000000103d1ad94 _ZN3gpu20SyncPointClientState22ReleaseFenceSyncLockedEyPNSt3__16vectorIN4base8CallbackIFvvELNS3_8internal8CopyModeE1EEENS1_9allocatorIS8_EEEE + 116
3   Chromium Framework                  0x0000000103d1ac9b _ZN3gpu20SyncPointClientState16ReleaseFenceSyncEy + 75
4   Chromium Framework                  0x0000000103d1b676 _ZN3gpu15SyncPointClient16ReleaseFenceSyncEy + 166
5   Chromium Framework                  0x0000000103e21bb8 _ZN3gpu20GpuCommandBufferStub13OnCreateImageERK38GpuCommandBufferMsg_CreateImage_Params + 440
6   Chromium Framework                  0x0000000103e218ee _ZN3IPC8MessageTI36GpuCommandBufferMsg_CreateImage_MetaNSt3__15tupleIJ38GpuCommandBufferMsg_CreateImage_ParamsEEEvE8DispatchIN3gpu20GpuCommandBufferStubES9_vMS9_FvRKS4_EEEbPKNS_7MessageEPT_PT0_PT1_T2_ + 126
7   Chromium Framework                  0x0000000103e1de59 _ZN3gpu20GpuCommandBufferStub17OnMessageReceivedERKN3IPC7MessageE + 1753
8   Chromium Framework                  0x0000000103a3af6c _ZN3IPC13MessageRouter12RouteMessageERKNS_7MessageE + 140
9   Chromium Framework                  0x0000000103e17ca4 _ZN3gpu10GpuChannel19HandleMessageHelperERKN3IPC7MessageE + 148
10  Chromium Framework                  0x0000000103e17b14 _ZN3gpu10GpuChannel13HandleMessageERK13scoped_refptrINS_22GpuChannelMessageQueueEE + 708
11  Chromium Framework                  0x0000000103e19e42 _ZN4base8internal7InvokerINS_13IndexSequenceIJLm0ELm1EEEENS0_9BindStateINS0_15RunnableAdapterIMN3gpu10GpuChannelEFvRK13scoped_refptrINS6_22GpuChannelMessageQueueEEEEEFvPS7_SC_EJNS_7WeakPtrIS7_EESC_EEENS0_12InvokeHelperILb1EvSF_EEFvvEE3RunEPNS0_13BindStateBaseE + 114
12  Chromium Framework                  0x0000000102d6160b _ZN4base5debug13TaskAnnotator7RunTaskEPKcRKNS_11PendingTaskE + 187
13  Chromium Framework                  0x0000000102d8f483 _ZN4base11MessageLoop7RunTaskERKNS_11PendingTaskE + 579
14  Chromium Framework                  0x0000000102d8f79c _ZN4base11MessageLoop21DeferOrRunPendingTaskERKNS_11PendingTaskE + 44
15  Chromium Framework                  0x0000000102d8f98b _ZN4base11MessageLoop6DoWorkEv + 299
16  Chromium Framework                  0x0000000102d52be3 _ZN4base24MessagePumpCFRunLoopBase7RunWorkEv + 51
17  Chromium Framework                  0x0000000102d8010a _ZN4base3mac15CallWithEHFrameEU13block_pointerFvvE + 10
18  Chromium Framework                  0x0000000102d52594 _ZN4base24MessagePumpCFRunLoopBase13RunWorkSourceEPv + 68
19  CoreFoundation                      0x00007fff9b32ea01 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17
20  CoreFoundation                      0x00007fff9b320b8d __CFRunLoopDoSources0 + 269
21  CoreFoundation                      0x00007fff9b3201bf __CFRunLoopRun + 927
22  CoreFoundation                      0x00007fff9b31fbd8 CFRunLoopRunSpecific + 296
23  Chromium Framework                  0x0000000102d5300f _ZN4base20MessagePumpCFRunLoop5DoRunEPNS_11MessagePump8DelegateE + 79
24  Chromium Framework                  0x0000000102d52a1f _ZN4base24MessagePumpCFRunLoopBase3RunEPNS_11MessagePump8DelegateE + 127
25  Chromium Framework                  0x0000000102d8efd7 _ZN4base11MessageLoop10RunHandlerEv + 215
26  Chromium Framework                  0x0000000102dade87 _ZN4base7RunLoop3RunEv + 183
27  Chromium Framework                  0x0000000102d8e267 _ZN4base11MessageLoop3RunEv + 103
28  Chromium Framework                  0x0000000102d02682 _ZN7content7GpuMainERKNS_18MainFunctionParamsE + 3090
29  Chromium Framework                  0x0000000102d0fe3d _ZN7content21ContentMainRunnerImpl3RunEv + 253
30  Chromium Framework                  0x0000000102d0f076 _ZN7content11ContentMainERKNS_17ContentMainParamsE + 54
31  Chromium Framework                  0x00000001026fbba2 ChromeMain + 66
32  Chromium Helper                     0x0000000102694d62 main + 530
33  Chromium Helper                     0x0000000102694b44 start + 52
34  ???                                 0x0000000000000010 0x0 + 16


This could be one of the reasons the context_lost tests are failing on other platforms. David, could you please investigate this? Thanks.

 
stdout.txt
117 KB View Download

Comment 1 by dyen@chromium.org, Apr 25 2016

Cc: kbr@chromium.org vmi...@chromium.org piman@chromium.org
This indicates a wider problem not related to sync points.

Basically what this tells me is during Context Lost, the client side has reset everything (It is releasing fence 1 which is the first fence) but the server side has not reset yet (It thinks the next fence should be 13). Isn't our context lost case crashing the GPU process? The server should be in a new state.

Are there currently other context lost issues in flight that have to do with initialization?

+piman@ and vmiura@ and kbr@ to see if they know anything related to other context lost issues.

Comment 2 by kbr@chromium.org, Apr 27 2016

Cc: -kbr@chromium.org
The test that failed was ContextLost.WebGLContextLostFromGPUProcessExit, so yes, the GPU process had exited. The client needs to handle this gracefully.

I filed the bug; no need to CC: me. :)

Comment 3 by piman@chromium.org, Apr 27 2016

Cc: ericrk@chromium.org reve...@chromium.org
I don't think the client is "resetting" things. The CommandBufferProxyImpl (which manages the fence namespace) and GpuChannelHost (which manages the connection to the GPU process) are bound together and bound to a given GPU process. After a context lost, if we recreate a context, a new CommandBufferProxyImpl is recreated (together with a new GpuCommandBufferStub, hence a new SyncPointClient/SyncPointClientState) - iow a new namespace altogether. If needs be (if the GPU process exited too), a new GpuChannelHost to a new gpu process is also created, but that shouldn't change things.

I think fences are sent out-of-order from the client side for some reason. I wouldn't be surprised if we're missing a lock somewhere. Maybe related to GPU rasterization?


re original report: that stack trace would have taken down the GPU process, not the renderer process.

Comment 4 by dyen@chromium.org, Apr 27 2016

When I said reset I meant through recreation. I don't see how this can be an out of order issue though because on the client side we check fence sync count:

https://code.google.com/p/chromium/codesearch#chromium/src/gpu/ipc/client/command_buffer_proxy_impl.cc&q=GpuCommandBufferMsg_CreateImage_Params&sq=package:chromium&type=cs&l=425

This is done right before the GpuCommandBufferMsg_CreateImage IPC is sent, and these fence syncs are per stream so are generated only on a single thread. So assuming we got 1 here (as the DCHECK indicates), this is the very first fence sync being created and no other thread would have access to this.

Assuming channel_id with route_id is unique, the only case I can think of where this could happen is if the client side was recreated but somehow the GPU side was not.

I tried running this locally on my machine to no avail, has this happened on the bots again?

Comment 5 by piman@chromium.org, Apr 28 2016

Cc: sunn...@chromium.org
@#4: multiple threads could be using the same context on the client side. They are supposed to use a lock for mutual exclusion, but I could imagine one is missing. Without lock, what could happen is:
1- thread 1 is inside of CommandBufferProxyImpl::CreateImage and calls GenerateFenceSyncRelease(), and passes the DCHECK
2- thread 2 enqueues a bunch of commands into the command buffer that include fences (calls GenerateFenceSyncRelease() multiple times, and flushes
3- thread 1 (still inside of CommandBufferProxyImpl::CreateImage) sends the GpuCommandBufferMsg_CreateImage IPC

That would cause this. Now, the context lock is supposed to be there to make sure 1 and 3 happen atomically (relative to 2), but if it's missing we'd have a problem.


Now, it could be something else, but I have trouble imagining what could cause an old CommandBufferProxyImpl to be magically connected to a new GpuCommandBufferStub - the latter are only created by the GpuChannelMsg_CreateCommandBuffer message, which is sent by GpuChannelHost::CreateCommandBuffer which also creates a CommandBufferProxyImpl. GpuChannelHost picks a route id (they are not reused) and gives the same one to the CommandBufferProxyImpl and to the GpuCommandBufferStub.

Another option is an out-of-order message issue on the service side, but I don't know what could cause that.
I don't see the lock being checked in CommandBufferProxyImpl::OrderingBarrier and that accesses next_fence_sync_release. https://code.google.com/p/chromium/codesearch#chromium/src/gpu/ipc/client/command_buffer_proxy_impl.cc&rcl=1461785041&l=258

Comment 7 by piman@chromium.org, Apr 28 2016

Good point, we should add a CheckLock there and see if it trips on something.

Comment 8 by dyen@chromium.org, Apr 28 2016

It looks like the CheckLock() did not trigger anything on the bots, I also tried syncing back to the revision which was failing and adding the CheckLock() locally (in case it has been fixed) but the test still passed 50 times.

We have CheckLock() in all 3 locations where flushed_fence_sync_release_ is touched so that shouldn't be a problem. Although I can see some places where we can have stricter DCHECKS.

For example "DCHECK_LE(image_fence_sync - 1, flushed_fence_sync_release_)" Should probably be a "DHECK_EQ(image_fence_sync, flushed_fence_sync_release_ + 1)". If for some reason the generated fence sync is less than the flushed fence sync release perhaps the above callstack could occur.
Project Member

Comment 9 by bugdroid1@chromium.org, Apr 28 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/46893a3125cca64012e98924ecd93756841f2c98

commit 46893a3125cca64012e98924ecd93756841f2c98
Author: dyen <dyen@chromium.org>
Date: Thu Apr 28 21:10:52 2016

Added a lock check for CommandBufferProxyImpl::OrderingBarrier().

R=piman@chromium.org
BUG= 606112 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/1924353002
Cr-Commit-Position: refs/heads/master@{#390484}

[modify] https://crrev.com/46893a3125cca64012e98924ecd93756841f2c98/gpu/ipc/client/command_buffer_proxy_impl.cc

Project Member

Comment 10 by bugdroid1@chromium.org, Apr 29 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4d075d2b7a85c4c9c5fa8a0d5158f2e74f2d5b97

commit 4d075d2b7a85c4c9c5fa8a0d5158f2e74f2d5b97
Author: dyen <dyen@chromium.org>
Date: Fri Apr 29 01:53:57 2016

Added more strict checks for fence sync counters.

R=piman@chromium.org
BUG= 606112 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/1933473002
Cr-Commit-Position: refs/heads/master@{#390569}

[modify] https://crrev.com/4d075d2b7a85c4c9c5fa8a0d5158f2e74f2d5b97/gpu/ipc/client/command_buffer_proxy_impl.cc
[modify] https://crrev.com/4d075d2b7a85c4c9c5fa8a0d5158f2e74f2d5b97/gpu/ipc/client/command_buffer_proxy_impl.h
[modify] https://crrev.com/4d075d2b7a85c4c9c5fa8a0d5158f2e74f2d5b97/gpu/ipc/service/gpu_command_buffer_stub.cc

Comment 11 by kbr@chromium.org, May 3 2016

Blocking: 608923

Comment 12 by kbr@chromium.org, May 3 2016

Cc: dyen@chromium.org
Labels: -Pri-2 Pri-1
Owner: sunn...@chromium.org
Since David is out it is urgent that progress be made on this. It is affecting the commit queue. See  Issue 608923 . Sunny, can you help?

https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/220473 is another incidence:

[ RUN      ] ContextLost.WebGLContextLostFromGPUProcessExit
(INFO) 2016-05-01 15:02:54,955 desktop_browser_backend.GetBrowserStartupArgs:242  Requested remote debugging port: 0
(INFO) 2016-05-01 15:02:54,955 desktop_browser_backend.Start:274  Starting Chrome ['/b/swarm_slave/work/isolated/runOdxONx/out/Release/Chromium.app/Contents/MacOS/Chromium', '--js-flags=--expose-gc', '--enable-logging=stderr', '--disable-domain-blocking-for-3d-apis', '--disable-gpu-process-crash-limit', '--enable-gpu-benchmarking', '--enable-net-benchmarking', '--metrics-recording-only', '--no-default-browser-check', '--no-first-run', '--enable-gpu-benchmarking', '--disable-background-networking', '--no-proxy-server', '--disable-component-extensions-with-background-pages', '--disable-default-apps', '--remote-debugging-port=0', '--enable-crash-reporter-for-testing', '--window-size=1280,1024', '--user-data-dir=/b/swarm_slave/work/isolated/tmpJNsM3D/tmpkV1PPu', 'about:blank']
2016-05-01 15:02:55.230 Chromium[6151:25035] NSWindow warning: adding an unknown subview: <FullSizeContentView: 0x7fa7d598ebc0>
2016-05-01 15:02:55.230 Chromium[6151:25035] Call stack:
(
    "+callStackSymbols disabled for performance reasons"
)
(INFO) 2016-05-01 15:02:55,267 desktop_browser_backend.HasBrowserFinishedLaunching:229  Discovered ephemeral port 49331
[6151:89103:0501/150255:WARNING:simple_synchronous_entry.cc(929)] Could not open platform files for entry.
[6151:1299:0501/150255:INFO:CONSOLE(29)] "Harness injected.", source:  (29)
Received signal 11 SEGV_MAPERR 000000000000
 [0x000102dcfb86]
 [0x7fff93cb0f1a]
 [0x000000000000]
 [0x000102d6c292]
 [0x000102d6b189]
 [0x000106fa5f22]
 [0x000102d6ca2b]
 [0x000103aa0388]
 [0x000102dd01db]
 [0x000102dfe3f3]
 [0x000102dfe70c]
 [0x000102dfea5b]
 [0x000102dc1863]
 [0x000102deee0a]
 [0x000102dc1214]
 [0x7fff99565a01]
 [0x7fff99557b8d]
 [0x7fff995571bf]
 [0x7fff99556bd8]
 [0x000102dc1c8f]
 [0x000102dc169f]
 [0x000102dfdf47]
 [0x000102e1d113]
 [0x000102dfd1d7]
 [0x000102d71457]
 [0x000102d7eb3d]
 [0x000102d7dd76]
 [0x00010276a07a]
 [0x000102702d62]
 [0x000102702b44]
 [0x000000000013]
[end of stack trace]
[6158:16899:0501/150256:ERROR:gpu_channel_host.cc(229)] Failed to send GpuChannelMsg_CreateCommandBuffer.
[6158:16899:0501/150256:ERROR:webgraphicscontext3d_command_buffer_impl.cc(129)] GpuChannelHost failed to create command buffer.
[6158:16899:0501/150256:ERROR:webgraphicscontext3d_command_buffer_impl.cc(153)] Failed to initialize command buffer.
[6158:16899:0501/150256:ERROR:webgraphicscontext3d_command_buffer_impl.cc(199)] Failed to initialize context.
[6151:1299:0501/150256:INFO:CONSOLE(0)] "WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost", source: http://127.0.0.1:49317/webgl.html?query=kill_after_notification (0)
[6151:1299:0501/150256:ERROR:gpu_process_transport_factory.cc(729)] Lost UI shared context.
[6159:1299:0501/150256:FATAL:gpu_command_buffer_stub.cc(1046)] Check failed: image_release_count == sync_point_client_->client_state()->fence_sync_release() + 1 (1 vs. 12)
0   Chromium Framework                  0x000000010ebfeca3 _ZN4base5debug10StackTraceC1Ev + 19
1   Chromium Framework                  0x000000010ec1cbd7 _ZN7logging10LogMessageD2Ev + 71
2   Chromium Framework                  0x000000010fcc554c _ZN3gpu20GpuCommandBufferStub13OnCreateImageERK38GpuCommandBufferMsg_CreateImage_Params + 636
3   Chromium Framework                  0x000000010fcc51be _ZN3IPC8MessageTI36GpuCommandBufferMsg_CreateImage_MetaNSt3__15tupleIJ38GpuCommandBufferMsg_CreateImage_ParamsEEEvE8DispatchIN3gpu20GpuCommandBufferStubES9_vMS9_FvRKS4_EEEbPKNS_7MessageEPT_PT0_PT1_T2_ + 126
4   Chromium Framework                  0x000000010fcc1485 _ZN3gpu20GpuCommandBufferStub17OnMessageReceivedERKN3IPC7MessageE + 1845
5   Chromium Framework                  0x000000010f8ddeac _ZN3IPC13MessageRouter12RouteMessageERKNS_7MessageE + 140
6   Chromium Framework                  0x000000010fcbb2b4 _ZN3gpu10GpuChannel19HandleMessageHelperERKN3IPC7MessageE + 148
7   Chromium Framework                  0x000000010fcbb124 _ZN3gpu10GpuChannel13HandleMessageERK13scoped_refptrINS_22GpuChannelMessageQueueEE + 708
8   Chromium Framework                  0x000000010fcbd462 _ZN4base8internal7InvokerINS_13IndexSequenceIJLm0ELm1EEEENS0_9BindStateINS0_15RunnableAdapterIMN3gpu10GpuChannelEFvRK13scoped_refptrINS6_22GpuChannelMessageQueueEEEEEFvPS7_SC_EJNS_7WeakPtrIS7_EESC_EEENS0_12InvokeHelperILb1EvSF_EEFvvEE3RunEPNS0_13BindStateBaseE + 114
9   Chromium Framework                  0x000000010ebff1db _ZN4base5debug13TaskAnnotator7RunTaskEPKcRKNS_11PendingTaskE + 187
10  Chromium Framework                  0x000000010ec2d3f3 _ZN4base11MessageLoop7RunTaskERKNS_11PendingTaskE + 579
11  Chromium Framework                  0x000000010ec2d70c _ZN4base11MessageLoop21DeferOrRunPendingTaskERKNS_11PendingTaskE + 44
12  Chromium Framework                  0x000000010ec2da5b _ZN4base11MessageLoop6DoWorkEv + 299
13  Chromium Framework                  0x000000010ebf0863 _ZN4base24MessagePumpCFRunLoopBase7RunWorkEv + 51
14  Chromium Framework                  0x000000010ec1de0a _ZN4base3mac15CallWithEHFrameEU13block_pointerFvvE + 10
15  Chromium Framework                  0x000000010ebf0214 _ZN4base24MessagePumpCFRunLoopBase13RunWorkSourceEPv + 68
16  CoreFoundation                      0x00007fff99565a01 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17
17  CoreFoundation                      0x00007fff99557b8d __CFRunLoopDoSources0 + 269
18  CoreFoundation                      0x00007fff995571bf __CFRunLoopRun + 927
19  CoreFoundation                      0x00007fff99556bd8 CFRunLoopRunSpecific + 296
20  Chromium Framework                  0x000000010ebf0c8f _ZN4base20MessagePumpCFRunLoop5DoRunEPNS_11MessagePump8DelegateE + 79
21  Chromium Framework                  0x000000010ebf069f _ZN4base24MessagePumpCFRunLoopBase3RunEPNS_11MessagePump8DelegateE + 127
22  Chromium Framework                  0x000000010ec2cf47 _ZN4base11MessageLoop10RunHandlerEv + 215
23  Chromium Framework                  0x000000010ec4c113 _ZN4base7RunLoop3RunEv + 51
24  Chromium Framework                  0x000000010ec2c1d7 _ZN4base11MessageLoop3RunEv + 103
25  Chromium Framework                  0x000000010eba0457 _ZN7content7GpuMainERKNS_18MainFunctionParamsE + 4087
26  Chromium Framework                  0x000000010ebadb3d _ZN7content21ContentMainRunnerImpl3RunEv + 253
27  Chromium Framework                  0x000000010ebacd76 _ZN7content11ContentMainERKNS_17ContentMainParamsE + 54
28  Chromium Framework                  0x000000010e59907a ChromeMain + 58
29  Chromium Helper                     0x000000010e531d62 main + 530
30  Chromium Helper                     0x000000010e531b44 start + 52
31  ???                                 0x0000000000000013 0x0 + 19

[6151:1299:0501/150256:INFO:CONSOLE(0)] "GL_OUT_OF_MEMORY : glCreateGpuMemoryBufferImageCHROMIUM: image_id < 0", source: http://127.0.0.1:49317/webgl.html?query=kill_after_notification (0)
[6151:1299:0501/150256:INFO:CONSOLE(0)] "GL_INVALID_OPERATION : glGenSyncTokenCHROMIUM: fence sync must be flushed before generating sync token", source: http://127.0.0.1:49317/webgl.html?query=kill_after_notification (0)
[6151:1299:0501/150256:INFO:CONSOLE(0)] "WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost", source: http://127.0.0.1:49317/webgl.html?query=kill_after_notification (0)
Waiting for page to finish.
Traceback (most recent call last):
  File "/b/swarm_slave/work/isolated/runOdxONx/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 84, in _RunStoryAndProcessErrorIfNeeded
    state.RunStory(results)
  File "/b/swarm_slave/work/isolated/runOdxONx/content/test/gpu/gpu_tests/gpu_test_base.py", line 111, in RunStory
    RunStoryWithRetries(GpuSharedPageState, self, results)
  File "/b/swarm_slave/work/isolated/runOdxONx/content/test/gpu/gpu_tests/gpu_test_base.py", line 72, in RunStoryWithRetries
    super(cls, shared_page_state).RunStory(results)
  File "/b/swarm_slave/work/isolated/runOdxONx/third_party/catapult/telemetry/telemetry/page/shared_page_state.py", line 326, in RunStory
    self._current_page, self._current_tab, results)
  File "/b/swarm_slave/work/isolated/runOdxONx/content/test/gpu/gpu_tests/context_lost.py", line 160, in ValidateAndMeasurePage
    'Test failed (context not restored properly?)')
Failure: Test failed (context not restored properly?)

[6151:1299:0501/150256:INFO:CONSOLE(0)] "GL_INVALID_VALUE : glWaitSyncTokenCHROMIUM: Cannot wait on sync_token which has not been verified", source: http://127.0.0.1:49317/webgl.html?query=kill_after_notification (0)
[  FAILED  ] ContextLost.WebGLContextLostFromGPUProcessExit (2145 ms)

Comment 13 by kbr@chromium.org, May 9 2016

Owner: vmi...@chromium.org
Victor, it is urgent that an engineer be assigned to this. It is affecting Chromium's commit queue. As one random job please see https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/224274 :

[14645:16643:0509/085328:ERROR:gpu_channel_host.cc(229)] Failed to send GpuChannelMsg_CreateCommandBuffer.
[14645:16643:0509/085328:ERROR:context_provider_command_buffer.cc(153)] GpuChannelHost failed to create command buffer.
[14638:1299:0509/085328:INFO:CONSOLE(0)] "WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost", source: http://127.0.0.1:62882/webgl.html?query=kill_after_notification (0)
[14638:1299:0509/085328:ERROR:gpu_process_transport_factory.cc(754)] Lost UI shared context.
[14646:1299:0509/085328:FATAL:gpu_command_buffer_stub.cc(1047)] Check failed: image_release_count == sync_point_client_->client_state()->fence_sync_release() + 1 (1 vs. 10)
0   Chromium Framework                  0x0000000104d21183 _ZN4base5debug10StackTraceC1Ev + 19
1   Chromium Framework                  0x0000000104d3f077 _ZN7logging10LogMessageD2Ev + 71
2   Chromium Framework                  0x0000000105ec852c _ZN3gpu20GpuCommandBufferStub13OnCreateImageERK38GpuCommandBufferMsg_CreateImage_Params + 636
3   Chromium Framework                  0x0000000105ec81bb _ZN3IPC8MessageTI36GpuCommandBufferMsg_CreateImage_MetaNSt3__15tupleIJ38GpuCommandBufferMsg_CreateImage_ParamsEEEvE8DispatchIN3gpu20GpuCommandBufferStubES9_vMS9_FvRKS4_EEEbPKNS_7MessageEPT_PT0_PT1_T2_ + 123
4   Chromium Framework                  0x0000000105ec43fd _ZN3gpu20GpuCommandBufferStub17OnMessageReceivedERKN3IPC7MessageE + 1293
5   Chromium Framework                  0x0000000105afcd7c _ZN3IPC13MessageRouter12RouteMessageERKNS_7MessageE + 140
6   Chromium Framework                  0x0000000105ebe414 _ZN3gpu10GpuChannel19HandleMessageHelperERKN3IPC7MessageE + 148
7   Chromium Framework                  0x0000000105ebe1f9 _ZN3gpu10GpuChannel13HandleMessageERK13scoped_refptrINS_22GpuChannelMessageQueueEE + 665
8   Chromium Framework                  0x0000000105ec05d2 _ZN4base8internal7InvokerINS_13IndexSequenceIJLm0ELm1EEEENS0_9BindStateINS0_15RunnableAdapterIMN3gpu10GpuChannelEFvRK13scoped_refptrINS6_22GpuChannelMessageQueueEEEEEFvPS7_SC_EJNS_7WeakPtrIS7_EESC_EEENS0_12InvokeHelperILb1EvSF_EEFvvEE3RunEPNS0_13BindStateBaseE + 114
9   Chromium Framework                  0x0000000104d2169b _ZN4base5debug13TaskAnnotator7RunTaskEPKcRKNS_11PendingTaskE + 187
10  Chromium Framework                  0x0000000104d4f853 _ZN4base11MessageLoop7RunTaskERKNS_11PendingTaskE + 579
11  Chromium Framework                  0x0000000104d4fb5c _ZN4base11MessageLoop21DeferOrRunPendingTaskERKNS_11PendingTaskE + 44
12  Chromium Framework                  0x0000000104d4feab _ZN4base11MessageLoop6DoWorkEv + 299


While this affects only a corner case (context lost), this code path must remain reliable and its flakiness is affecting the commit queue.

I can take a look at this but not until tomorrow.

Comment 15 by kbr@chromium.org, May 9 2016

Owner: sunn...@chromium.org
Thanks Sunny. Reassigning to you.

Ok, I think I found the bug:

When CreateImageCHROMIUM is called we try to send a sync IPC to validate the sync token here: https://code.google.com/p/chromium/codesearch#chromium/src/gpu/ipc/client/gpu_channel_host.cc&rcl=1462792609&l=354

But the StreamFlushInfo for the stream doesn't exist until it has been flushed/OrderingBarrier'd: https://code.google.com/p/chromium/codesearch#chromium/src/gpu/ipc/client/gpu_channel_host.cc&rcl=1462792609&l=145

Calling operator[] on stream_flush_info_ creates the StreamFlushInfo and there's no other way it's created.

I'll try a speculative fix for this tomorrow and hopefully it should be fixed.

Comment 18 by dyen@chromium.org, May 10 2016

For #16, I'm not sure I follow here. How does the lack of the StreamFlushInfo affect the fence release order on the server side?

Also, the synchronous IPC is sent no matter what since EnsureWorkVisible() sets force_validate to be True. Even if this wasn't true and no synchronous IPC is sent for some reason, this still wouldn't trigger the DCHECK. In that case, we would have a race condition only if the sync token is passed to another process and waited on it. But even the race condition would only show up as a NOP on the server side since we ignore sync tokens which do not exist.

Comment 19 by dyen@chromium.org, May 11 2016

Owner: dyen@chromium.org
I'm back now so I'll take this bug back, but feel free to continue looking at it since I really have no idea how this could happen still.

Comment 20 by dyen@chromium.org, May 11 2016

I ran this patch locally (https://codereview.chromium.org/1972443002/) 50 times and actually got it to fail, here is the callstack:

[21738:1295:0511/104328:FATAL:sync_point_manager.cc(259)] Check failed: release == fence_sync_release_ + 1 || release == UINT64_MAX. 
0   Chromium Framework                  0x000000010493e123 _ZN4base5debug10StackTraceC1Ev + 19
1   Chromium Framework                  0x000000010495bee9 _ZN7logging10LogMessageD2Ev + 73
2   Chromium Framework                  0x00000001058f7c76 _ZN3gpu20SyncPointClientState22ReleaseFenceSyncLockedEyPNSt3__16vectorIN4base8CallbackIFvvELNS3_8internal8CopyModeE1EEENS1_9allocatorIS8_EEEE + 182
3   Chromium Framework                  0x00000001058f7035 _ZN3gpu20SyncPointClientState14EnsureReleasedEy + 85
4   Chromium Framework                  0x00000001058f6f5d _ZN3gpu18SyncPointOrderData26BeginProcessingOrderNumberEj + 1085
5   Chromium Framework                  0x00000001059ecb3c _ZN3gpu22GpuChannelMessageQueue22BeginMessageProcessingEv + 252
6   Chromium Framework                  0x00000001059f52b2 _ZN3gpu10GpuChannel13HandleMessageERK13scoped_refptrINS_22GpuChannelMessageQueueEE + 34
7   Chromium Framework                  0x00000001059f7882 _ZN4base8internal7InvokerINS_13IndexSequenceIJLm0ELm1EEEENS0_9BindStateINS0_15RunnableAdapterIMN3gpu10GpuChannelEFvRK13scoped_refptrINS6_22GpuChannelMessageQueueEEEEEFvPS7_SC_EJNS_7WeakPtrIS7_EESC_EEENS0_12InvokeHelperILb1EvSF_EEFvvEE3RunEPNS0_13BindStateBaseE + 114
8   Chromium Framework                  0x000000010493e65b _ZN4base5debug13TaskAnnotator7RunTaskEPKcRKNS_11PendingTaskE + 187
9   Chromium Framework                  0x000000010496c4c3 _ZN4base11MessageLoop7RunTaskERKNS_11PendingTaskE + 579
10  Chromium Framework                  0x000000010496c7dc _ZN4base11MessageLoop21DeferOrRunPendingTaskERKNS_11PendingTaskE + 44
11  Chromium Framework                  0x000000010496c9cb _ZN4base11MessageLoop6DoWorkEv + 299
12  Chromium Framework                  0x000000010492fc73 _ZN4base24MessagePumpCFRunLoopBase7RunWorkEv + 51
13  Chromium Framework                  0x000000010495d15a _ZN4base3mac15CallWithEHFrameEU13block_pointerFvvE + 10
14  Chromium Framework                  0x000000010492f624 _ZN4base24MessagePumpCFRunLoopBase13RunWorkSourceEPv + 68
15  CoreFoundation                      0x00007fff951b7881 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17
16  CoreFoundation                      0x00007fff95196fbc __CFRunLoopDoSources0 + 556
17  CoreFoundation                      0x00007fff951964df __CFRunLoopRun + 927
18  CoreFoundation                      0x00007fff95195ed8 CFRunLoopRunSpecific + 296
19  Chromium Framework                  0x000000010493009f _ZN4base20MessagePumpCFRunLoop5DoRunEPNS_11MessagePump8DelegateE + 79
20  Chromium Framework                  0x000000010492faaf _ZN4base24MessagePumpCFRunLoopBase3RunEPNS_11MessagePump8DelegateE + 127
21  Chromium Framework                  0x000000010496c017 _ZN4base11MessageLoop10RunHandlerEv + 215
22  Chromium Framework                  0x000000010498aec7 _ZN4base7RunLoop3RunEv + 183
23  Chromium Framework                  0x000000010496b2a7 _ZN4base11MessageLoop3RunEv + 103
24  Chromium Framework                  0x00000001048e4a82 _ZN7content7GpuMainERKNS_18MainFunctionParamsE + 3090
25  Chromium Framework                  0x00000001048ece8d _ZN7content21ContentMainRunnerImpl3RunEv + 253
26  Chromium Framework                  0x00000001048ec0c6 _ZN7content11ContentMainERKNS_17ContentMainParamsE + 54
27  Chromium Framework                  0x00000001042e29a2 ChromeMain + 66
28  Chromium Helper                     0x000000010407bd62 main + 530
29  Chromium Helper                     0x000000010407bb44 start + 52
30  ???                                 0x000000000000000e 0x0 + 14

Looks like something went wrong in the order numbers. I'll continue to investigate.

Comment 21 by dyen@chromium.org, May 11 2016

Could this has something to do with the new scheduler? It looks like messages are being processed out of order.
The new scheduler hasn't gone in yet.

Comment 23 by dyen@chromium.org, May 11 2016

I've verified that order numbers are indeed (very scarily) being processed out of order. What's more concerning, is that it looks like the order numbers are not only being processed out of order but being reused as well:

[30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e20550] 31
[30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e20550] 32
[30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e20550] 33
[30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e20550] 34
[30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e20550] 35
[30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e1e980] 36
[30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e1e980] 37
[30402:1295:0511/125756:ERROR:sync_point_manager.cc(74)] [0x7fc5c8e20550] 38
[30402:1295:0511/125756:ERROR:sync_point_manager.cc(86)] [0x7fc5c8e20550] 37 < 38
[30402:1295:0511/125756:FATAL:sync_point_manager.cc(264)] Check failed: release == fence_sync_release_ + 1 || release == UINT64_MAX.

Here the first number within the brackets is the pointer to the SyncPointOrderData (owned by sync point client), and the second number is the order number.

Comment 24 by dyen@chromium.org, May 11 2016

Scratch that last comment, the order numbers aren't being reused, they are just being paused. I think it is actually valid to have order numbers be executed out of order as long as they are in order within a single sync point client. So in this case 38 can run before 37 is finished because 37 and 38 are different sync point clients.

There does seem to be a bug here... the client with 38 should not be releasing the fence sync owned by the client with 37. I will have to think about this some more.

Comment 25 by dyen@chromium.org, May 11 2016

Ok, after thinking about this some more 37 should never be waiting on 38 so that is the bug here. We must be missing a synchronous IPC before sync token generation.

I don't think we are missing the actual call to GenSyncToken since that would trigger a DCHECK, so we must have some bug where we think we have verified a flush when we haven't really.
Cc: jbau...@chromium.org
+jbauman@

I also added some logging and noticed that right before the DCHECK fail in CreateImage there are also errors related to CreateAndConsumeTexture (mailbox) and ScheduleCALayer. This suggests that the problem has to do with stale sync tokens in the browser.

jbauman@ pointed out that if a renderer frame arrives in the browser after gpu process crash we attempt to draw it any way. This renderer frame can contain resources which reference invalid mailboxes, fence syncs, etc. Moreover the client id, route id and fence syncs from the renderer are reused, so we could have a WaitSyncToken that's processed on the service with a fence sync that's ahead of the current fence sync on the renderer. That WaitSyncToken would cause an OrderFence to be inserted that would release the renderer's fence syncs earlier than they should be.

One way to fix this correctly would be to version resources based on the gpu process instance that it corresponds to and discard frames on the browser if there's a mismatch. jbauman@ also mentioned that the previous implementation of sync points would initialize the starting sync point number randomly to avoid this so that's another option.

Comment 27 by dyen@chromium.org, May 12 2016

I have a patch up which just ignores these kinds of waits:

https://codereview.chromium.org/1975663002/

Comment 28 by dyen@chromium.org, May 12 2016

I verified that my patch fixes this problem by running the test again 100 times. The DCHECKs no longer trigger even when stale sync tokens are detected.

Comment 29 by kbr@chromium.org, May 12 2016

Excellent work David. Let's get your patch in ASAP, and remove the flaky expectations for any affected tests.

Project Member

Comment 30 by bugdroid1@chromium.org, May 13 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9d255f0ca75efea1773fcb718c368a7a6bc4bba4

commit 9d255f0ca75efea1773fcb718c368a7a6bc4bba4
Author: dyen <dyen@chromium.org>
Date: Fri May 13 00:32:33 2016

Invalid sync token waits now automatically get released.

Previously, invalid sync token waits would automatically release the
fence sync. This would guard against invalid waits which pretended it
would release a fence sync while issuing waits to cause dead locks.

This worked, however it would allow an evil renderer to release fences
on the browser which would invalidate waits. A better solution is to
merely cause the invalid wait to be ignored instead.

R=piman@chromium.org
BUG= 606112 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/1975663002
Cr-Commit-Position: refs/heads/master@{#393409}

[modify] https://crrev.com/9d255f0ca75efea1773fcb718c368a7a6bc4bba4/gpu/command_buffer/service/sync_point_manager.cc
[modify] https://crrev.com/9d255f0ca75efea1773fcb718c368a7a6bc4bba4/gpu/command_buffer/service/sync_point_manager.h
[modify] https://crrev.com/9d255f0ca75efea1773fcb718c368a7a6bc4bba4/gpu/command_buffer/service/sync_point_manager_unittest.cc

Project Member

Comment 31 by bugdroid1@chromium.org, May 13 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/889fa48ee6cbe56579a77382232eaef9293816c2

commit 889fa48ee6cbe56579a77382232eaef9293816c2
Author: dyen <dyen@chromium.org>
Date: Fri May 13 06:14:26 2016

Tentatively enable flaky GpuCrash.GPUProcessCrashesExactlyOnce again.

R=kbr@chromium.org
BUG= 608923 ,  606112 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/1971403002
Cr-Commit-Position: refs/heads/master@{#393454}

[modify] https://crrev.com/889fa48ee6cbe56579a77382232eaef9293816c2/content/test/gpu/gpu_tests/context_lost_expectations.py

Comment 32 by dyen@chromium.org, May 13 2016

Status: Fixed (was: Assigned)

Sign in to add a comment