Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Issue 365904 WebGL Context Lost/Restore tests are flaky on the GPU bots
Starred by 1 user Project Member Reported by bajones@chromium.org, Apr 22, 2014 Back to list
Status: Fixed
Owner: kbr@chromium.org
Closed: May 2014
Cc: kbr@chromium.org, alokp@chromium.org, vmi...@chromium.org, bajones@chromium.org, dongseon...@intel.com, piman@chromium.org, reve...@chromium.org, zmo@chromium.org, jbau...@chromium.org
Components:
OS: All
Pri: 1
Type: Bug-Regression


Sign in to add a comment
Since at least April 16th and possibly earlier we've been seeing failures on debug GPU bots with WebGL tests related to context restoration. This is most prominent on the Debug Mac (Intel) bot (http://build.chromium.org/p/chromium.gpu/builders/Mac%20Debug%20%28Intel%29?numbuilds=200), but has also been observed on all the other Debug bots.

The failing tests are:

webgl_conformance_tests.conformance_context_context_lost_restored
context_lost_tests.WebGLContextLostFromGPUProcessExit

The conformance test failure gives a fairly generic an unhelpful "Tab Crashed" message, but the context_lost_tests failure yields a crash stack:

ASSERTION FAILED: success
../../third_party/WebKit/Source/platform/graphics/gpu/Extensions3DUtil.cpp(39) : void WebCore::Extensions3DUtil::initializeExtensions()
1   0x209e7efb WebCore::Extensions3DUtil::initializeExtensions()
2   0x209e7e4d WebCore::Extensions3DUtil::Extensions3DUtil(blink::WebGraphicsContext3D*)
3   0x209e8059 WebCore::Extensions3DUtil::Extensions3DUtil(blink::WebGraphicsContext3D*)
4   0x209dc090 WebCore::DrawingBuffer::create(WTF::PassOwnPtr<blink::WebGraphicsContext3D>, WebCore::IntSize const&, WebCore::DrawingBuffer::PreserveDrawingBuffer, WTF::PassRefPtr<WebCore::ContextEvictionManager>)
5   0x1305f129 WebCore::WebGLRenderingContextBase::maybeRestoreContext(WebCore::Timer<WebCore::WebGLRenderingContextBase>*)

The "ASSERTION FAILED: success" refers to an assertion in Extensions3DUtil::initializeExtensions() that a makeContextCurrent() call on the WebGraphicsContext3D pointer passed to it succeeds. This is the first time that makeContextCurrent() is called on this code path, and it's failure probably indicates an issue retrieving a valid context from the blink::Platform::current()->createOffscreenGraphicsContext3D() call in WebGLRenderingContextBase::maybeRestoreContext().

Unfortunately the flaky nature of the failure and the fact that it has been going on for a while has made it very difficult to estimate when it started and which CLs may have been responsible.
 
Comment 1 by kbr@chromium.org, Apr 22, 2014
Cc: piman@chromium.org vmi...@chromium.org jbau...@chromium.org
Labels: Cr-Internals-GPU
It looks like there might be a newly introduced race condition in the command buffer which is the cause of the failure to re-create the WebGraphicsContext3DCommandBufferImpl. Here's more context from the failure log:

[1900:63235:0422/160707:INFO:CONSOLE(24)] "Harness injected.", source:  (24)
[1905:81667:0422/160709:ERROR:command_buffer_proxy_impl.cc(152)] Could not send GpuCommandBufferMsg_Initialize.
[1905:81667:0422/160709:ERROR:webgraphicscontext3d_command_buffer_impl.cc(386)] CommandBufferProxy::Initialize failed.
[1905:81667:0422/160709:ERROR:webgraphicscontext3d_command_buffer_impl.cc(405)] Failed to initialize command buffer.
[1905:81667:0422/160709:ERROR:webgraphicscontext3d_command_buffer_impl.cc(463)] Failed to initialize context.
[1900:63235:0422/160709:INFO:CONSOLE(0)] "WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost", source: http://127.0.0.1:52686/webgl.html?query=kill_after_notification (0)
[1904:63235:0422/160709:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
[1904:63235:0422/160709:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
ASSERTION FAILED: success
../../third_party/WebKit/Source/platform/graphics/gpu/Extensions3DUtil.cpp(39) : void WebCore::Extensions3DUtil::initializeExtensions()
1   0x20929efb WebCore::Extensions3DUtil::initializeExtensions()
2   0x20929e4d WebCore::Extensions3DUtil::Extensions3DUtil(blink::WebGraphicsContext3D*)
3   0x2092a059 WebCore::Extensions3DUtil::Extensions3DUtil(blink::WebGraphicsContext3D*)
4   0x2091e090 WebCore::DrawingBuffer::create(WTF::PassOwnPtr<blink::WebGraphicsContext3D>, WebCore::IntSize const&, WebCore::DrawingBuffer::PreserveDrawingBuffer, WTF::PassRefPtr<WebCore::ContextEvictionManager>)


Comment 2 by kbr@chromium.org, Apr 22, 2014
Blockedon: chromium:356453
Comment 3 by kbr@chromium.org, Apr 22, 2014
Issue 356453 was another recent situation where these context lost and restore tests became flaky.

Comment 4 by piman@chromium.org, Apr 22, 2014
[1905:81667:0422/160709:ERROR:command_buffer_proxy_impl.cc(152)] Could not send GpuCommandBufferMsg_Initialize.

That means either the Initialize msg failed or that the channel was closed (either explicitly, or through a GPU process crash).

The GPU process logs its errors if the Initialize msg fails, so if we don't get logs for it, it's the latter.

How does the tests trigger lost contexts?
Comment 5 by piman@chromium.org, Apr 22, 2014
Note however that the context is allowed to be lost at any time, which means that asserting that makeContextCurrent succeeds is incorrect.
Comment 6 by kbr@chromium.org, Apr 22, 2014
The test triggers the lost context by navigating to about:gpucrash to kill the GPU process:

src/content/test/gpu/gpu_tests/context_lost.py
src/content/test/data/gpu/webgl.html

Comment 7 by kbr@chromium.org, Apr 22, 2014
Cc: dongseon...@intel.com
piman: thanks for the analysis.

I think this is another bug introduced by the change of ownership of the WebGraphicsContext3D from the WebGLRenderingContext to the DrawingBuffer. It looks like there's an incorrect assumption in the context restoration logic (maybeRestoreContext / DrawingBuffer::create()) that simply creating the WebGraphicsContext3D is sufficient to guarantee it initialized successfully. An initial makeCurrent() must gate the WebGL context restoration logic.

Comment 8 by kbr@chromium.org, Apr 22, 2014
Summary: WebGL Context Lost/Restore tests are flaky on the GPU bots (was: WebGL Context Lost/Restore tests are flaky the GPU bots)
Comment 9 by kbr@chromium.org, Apr 22, 2014
Blocking: chromium:344393
Comment 10 by piman@chromium.org, Apr 22, 2014
There is also https://codereview.chromium.org/237903002 that landed on the 15th, that affects mac in particular. But if it happens on other bots too, it may not be related.
Issue 365747 might be duplicated to this.
I recently refactor WebGLRenderingContext and DrawingBuffer; https://codereview.chromium.org/223223003
It might cause this issue.

#10 piman@, I think it happens on only mac, because only mac bot was in trouble and I can not reproduce in my linux machine.
Comment 12 by kbr@chromium.org, Apr 23, 2014
Owner: kbr@chromium.org
Status: Assigned
Comment 13 by bajones@chromium.org, Apr 23, 2014
This is happening periodically on all the Debug bots, which you can see if you look back through their build history:

http://build.chromium.org/p/chromium.gpu/builders/Win7%20Debug%20%28NVIDIA%29?numbuilds=200
http://build.chromium.org/p/chromium.gpu/builders/Linux%20Debug%20%28NVIDIA%29?numbuilds=200

For some reason, however, it's hitting the Debug Mac Intel bot hardest. Almost all of the recent builds have failed, though the issue has been present more sporadically for a while. I'm not sure what caused the recent uptick in flakes.

http://build.chromium.org/p/chromium.gpu/builders/Mac%20Debug%20%28Intel%29?numbuilds=200
Thank you for answer.
I can see Win and Linux rarely fails on gpu tests (e.g. webgl_conformance_tests), but I cannot find the same test failure to Mac.
Comment 15 by kbr@chromium.org, Apr 23, 2014
Status: Started
Project Member Comment 16 by bugdroid1@chromium.org, Apr 24, 2014
------------------------------------------------------------------
r265833 | kbr@chromium.org | 2014-04-24T03:29:54.900641Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/src/content/test/gpu/gpu_tests/context_lost.py?r1=265833&r2=265832&pathrev=265833

Make context_lost test more of a stress test.

This change much more reliably catches the assertion failure in Issue 365904.

BUG= 365904 

Review URL: https://codereview.chromium.org/256433003
-----------------------------------------------------------------
Project Member Comment 17 by bugdroid1@chromium.org, Apr 24, 2014
Labels: merge-merged-git-svn
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e964701e45162b57f1ce9125e754b82396fcfc4d

commit e964701e45162b57f1ce9125e754b82396fcfc4d
Author: kbr@chromium.org <kbr@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>
Date: Thu Apr 24 03:29:54 2014 +0000

Make context_lost test more of a stress test.

This change much more reliably catches the assertion failure in Issue 365904.

BUG= 365904 

Review URL: https://codereview.chromium.org/256433003

git-svn-id: svn://svn.chromium.org/chrome/trunk/src@265833 0039d316-1c4b-4281-b951-d872f2087c98


Project Member Comment 18 by bugdroid1@chromium.org, Apr 24, 2014
The following revision refers to this bug:
  http://src.chromium.org/viewvc/blink?view=rev&rev=172465

------------------------------------------------------------------
r172465 | kbr@chromium.org | 2014-04-24T07:57:53.726908Z

Changed paths:
   M http://src.chromium.org/viewvc/blink/trunk/Source/platform/graphics/gpu/Extensions3DUtil.cpp?r1=172465&r2=172464&pathrev=172465
   M http://src.chromium.org/viewvc/blink/trunk/Source/platform/graphics/gpu/DrawingBuffer.cpp?r1=172465&r2=172464&pathrev=172465
   M http://src.chromium.org/viewvc/blink/trunk/Source/core/html/canvas/WebGLRenderingContextBase.cpp?r1=172465&r2=172464&pathrev=172465
   M http://src.chromium.org/viewvc/blink/trunk/Source/platform/graphics/gpu/Extensions3DUtil.h?r1=172465&r2=172464&pathrev=172465
   M http://src.chromium.org/viewvc/blink/trunk/Source/platform/graphics/gpu/DrawingBuffer.h?r1=172465&r2=172464&pathrev=172465
   M http://src.chromium.org/viewvc/blink/trunk/Source/core/html/canvas/WebGLRenderingContext.cpp?r1=172465&r2=172464&pathrev=172465

Fix WebGL context restoration logic. The retry mechanism was broken when the ownership of the WebGraphicsContext3D was transferred to the DrawingBuffer.

The Telemetry based context loss test is being made more of a stress test to verify this fix. Will also attempt to make the WebGL conformance tests more stressful to catch the underlying race condition more reliably.

BUG= 365904 

Review URL: https://codereview.chromium.org/254453002
-----------------------------------------------------------------
Comment 19 by kbr@chromium.org, Apr 24, 2014
With the increased stress in the ContextLost.WebGLContextLostFromGPUProcessExit test another assertion failure is being triggered:

ASSERTION FAILED: !m_destructionInProgress
../../third_party/WebKit/Source/platform/graphics/gpu/DrawingBuffer.cpp(195) : virtual bool WebCore::DrawingBuffer::prepareMailbox(blink::WebExternalTextureMailbox*, blink::WebExternalBitmap*)
1   0x7f018bb6f044 WebCore::DrawingBuffer::prepareMailbox(blink::WebExternalTextureMailbox*, blink::WebExternalBitmap*)
2   0x7f017749e0ac
3   0x7f0188883b92 cc::TextureLayer::Update(cc::ResourceUpdateQueue*, cc::OcclusionTracker<cc::Layer> const*)
4   0x7f01889c1e52 cc::LayerTreeHost::PaintLayerContents(cc::RenderSurfaceLayerList const&, cc::ResourceUpdateQueue*, bool*, bool*)
5   0x7f01889c1181 cc::LayerTreeHost::UpdateLayers(cc::Layer*, cc::ResourceUpdateQueue*)
6   0x7f01889c086c cc::LayerTreeHost::UpdateLayers(cc::ResourceUpdateQueue*)

Working on fixing this next.

Thank you for report. I'll investigate.

Quick fix is as follows but I don't understand why it happens.
+++ b/Source/platform/graphics/gpu/DrawingBuffer.cpp
@@ -191,8 +191,8 @@ blink::WebGraphicsContext3D* DrawingBuffer::context()
 
 bool DrawingBuffer::prepareMailbox(blink::WebExternalTextureMailbox* outMailbox, blink::WebExternalBitmap* bitmap)
 {
-    // prepareMailbox() is always called after layout.
-    ASSERT(!m_destructionInProgress);
+    if (m_destructionInProgress)
+        return false;

kbr@, I submitted the CL to fix ContextLost.WebGLContextLostFromGPUProcessExit test failure;
https://codereview.chromium.org/259533003/
Project Member Comment 22 by bugdroid1@chromium.org, Apr 24, 2014
The following revision refers to this bug:
  http://src.chromium.org/viewvc/blink?view=rev&rev=172564

------------------------------------------------------------------
r172564 | dongseong.hwang@intel.com | 2014-04-24T23:44:41.373392Z

Changed paths:
   M http://src.chromium.org/viewvc/blink/trunk/Source/platform/graphics/gpu/DrawingBuffer.cpp?r1=172564&r2=172563&pathrev=172564

WebGL: Fix assertion hit of gpu tests.

This CL removes the invalid assertion in DrawingBuffer::prepareMailbox(); ASSERT(!m_destructionInProgress).
It can be hit in the following sequence.
1. WebGL draws something.
2. The compositor begins the frame.
3. Javascript makes a context lost using WEBGL_lose_context extension.
4. Assertion hits.

TEST=ContextLost.WebGLContextLostFromGPUProcessExit
BUG= 365904 

Review URL: https://codereview.chromium.org/259533003
-----------------------------------------------------------------
Comment 23 by kbr@chromium.org, Apr 25, 2014
I'm concerned that http://build.chromium.org/p/chromium.webkit/builders/GPU%20Win7%20(dbg)%20(NVIDIA) hasn't cleared up after http://src.chromium.org/viewvc/blink?view=revision&revision=172564 . Build http://build.chromium.org/p/chromium.webkit/builders/GPU%20Win7%20%28dbg%29%20%28NVIDIA%29/builds/19941 should have picked up the fix. There may be a remaining Windows-specific problem. Note the following in the log:

[3448:3268:0424/195347:ERROR:channel.cc(292)] RawChannel fatal error (type 1)
[3448:3268:0424/195347:ERROR:channel.cc(295)] Not shutting down due Windows-only bug
[3448:972:0424/195348:INFO:CONSOLE(0)] "WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost", source: http://127.0.0.1:52894/webgl.html?query=kill_after_notification (0)
[3448:972:0424/195349:INFO:CONSOLE(0)] "WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost", source: http://127.0.0.1:52894/webgl.html?query=kill_after_notification (0)
[3448:892:0424/195841:FATAL:thread_watcher.cc(892)] Check failed: false. 
Backtrace:
	base::debug::StackTrace::StackTrace [0x100A9E31+33]
	logging::LogMessage::~LogMessage [0x1014F22E+94]
	policy::internal::RestrictionNode::operator= [0x051C0E44+16698314]
	base::Watchdog::ThreadDelegate::ThreadMain [0x1029D0DF+767]
	base::`anonymous namespace'::ThreadFunc [0x10262E01+257]
	BaseThreadInitThunk [0x7713336A+18]
	RtlInitializeExceptionChain [0x77A29F72+99]
	RtlInitializeExceptionChain [0x77A29F45+54]

Comment 24 by kbr@chromium.org, Apr 26, 2014
I'm undoing the stress changes to the context lost test for the time being in https://codereview.chromium.org/252793003 . I'm unable to reproduce the failure seen on the Win Debug bots locally. On my Windows workstation (which has an AMD GPU instead of NVIDIA) a Debug build can't survive even one GPU process crash in this test -- the browser hangs forever waiting for a GpuCommandBufferMsg_Initialize message after the GPU process comes back up. Here's the stack trace. The browser may be trying to talk to the dead GPU process. Don't know why this would happen with one vendor's GPU vs. another's. A Release build works fine.

base.dll!base::WaitableEvent::WaitMany(base::WaitableEvent * * events, unsigned int count)  Line 93 + 0x17 bytes	C++
ipc.dll!IPC::SyncChannel::WaitForReply(IPC::SyncChannel::SyncContext * context, base::WaitableEvent * pump_messages_event)  Line 488 + 0x10 bytes	C++
ipc.dll!IPC::SyncChannel::Send(IPC::Message * message)  Line 472 + 0x12 bytes	C++
content.dll!content::GpuChannelHost::Send(IPC::Message * msg)  Line 113 + 0x31 bytes	C++
content.dll!content::CommandBufferProxyImpl::Send(IPC::Message * msg)  Line 492 + 0x18 bytes	C++
>	content.dll!content::CommandBufferProxyImpl::Initialize()  Line 151 + 0x70 bytes	C++
content.dll!content::WebGraphicsContext3DCommandBufferImpl::InitializeCommandBuffer(bool onscreen, content::WebGraphicsContext3DCommandBufferImpl * share_context)  Line 385 + 0x29 bytes	C++
content.dll!content::WebGraphicsContext3DCommandBufferImpl::CreateContext(bool onscreen)  Line 404 + 0x11 bytes	C++
content.dll!content::WebGraphicsContext3DCommandBufferImpl::MaybeInitializeGL()  Line 281 + 0x29 bytes	C++
content.dll!content::WebGraphicsContext3DCommandBufferImpl::makeContextCurrent()  Line 462 + 0x8 bytes	C++
content.dll!content::ContextProviderCommandBuffer::BindToCurrentThread()  Line 94 + 0x26 bytes	C++
cc.dll!cc::OutputSurface::BindToClient(cc::OutputSurfaceClient * client)  Line 254 + 0x25 bytes	C++
content.dll!content::BrowserCompositorOutputSurface::BindToClient(cc::OutputSurfaceClient * client)  Line 63 + 0xf bytes C++
cc.dll!cc::LayerTreeHostImpl::InitializeRenderer(scoped_ptr<cc::OutputSurface,base::DefaultDeleter<cc::OutputSurface> > output_surface)  Line 1864 + 0x48 bytes	C++
cc.dll!cc::SingleThreadProxy::CreateAndInitializeOutputSurface()  Line 128 + 0x72 bytes	C++
cc.dll!cc::LayerTreeHost::InitializeOutputSurfaceIfNeeded()  Line 703 + 0x1d bytes	C++
cc.dll!cc::SingleThreadProxy::CommitAndComposite(base::TimeTicks frame_begin_time, const gfx::Rect & device_viewport_damage_rect, bool for_readback, cc::LayerTreeHostImpl::FrameData * frame)  Line 420 + 0xb bytes	C++
cc.dll!cc::SingleThreadProxy::CompositeImmediately(base::TimeTicks frame_begin_time)  Line 368 + 0x1d bytes	C++
cc.dll!cc::LayerTreeHost::Composite(base::TimeTicks frame_begin_time)  Line 694	C++
compositor.dll!ui::Compositor::Draw()  Line 294 + 0x2b bytes	C++
compositor.dll!base::internal::RunnableAdapter<void (__thiscall ui::Compositor::*)(void)>::Run(ui::Compositor * object)  Line 134 + 0x12 bytes	C++

Project Member Comment 25 by bugdroid1@chromium.org, Apr 26, 2014
------------------------------------------------------------------
r266315 | kbr@chromium.org | 2014-04-26T02:03:26.274377Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/src/content/test/gpu/gpu_tests/context_lost.py?r1=266315&r2=266314&pathrev=266315

Undo the recent stress changes to the context loss test.

The underlying infrastructure isn't ready for this to crash the GPU
process 30 times in a row. Change back to running this test once to
address continued flakiness on the bots.

BUG= 365904 
NOTRY=true
TBR=bajones@chromium.org

Review URL: https://codereview.chromium.org/252793003
-----------------------------------------------------------------
Project Member Comment 26 by bugdroid1@chromium.org, Apr 26, 2014
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1f85e2620e9ea5a417c0225bc4f0c63340e443d2

commit 1f85e2620e9ea5a417c0225bc4f0c63340e443d2
Author: kbr@chromium.org <kbr@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>
Date: Sat Apr 26 02:03:26 2014 +0000

Undo the recent stress changes to the context loss test.

The underlying infrastructure isn't ready for this to crash the GPU
process 30 times in a row. Change back to running this test once to
address continued flakiness on the bots.

BUG= 365904 
NOTRY=true
TBR=bajones@chromium.org

Review URL: https://codereview.chromium.org/252793003

git-svn-id: svn://svn.chromium.org/chrome/trunk/src@266315 0039d316-1c4b-4281-b951-d872f2087c98


Comment 27 by piman@chromium.org, Apr 26, 2014
Stack trace in #24 is just the main thread waiting for the reply to the CreateContext.

Which is pretty interesting / surprising. It means the channel was properly created, so a couple of messages were exchanged with the new GPU process, and then it failed. Similar to the original issue (except with a different failure mode)...

Definitely sounds like a race or something.
Comment 28 by kbr@chromium.org, Apr 28, 2014
There are continued failures of this test on the linux_gpu_triggered_tests bot. There were two recent failures in the history but they disappeared before I could record them and their logs. Need to test more locally.


Comment 29 by kbr@chromium.org, Apr 28, 2014
Labels: Build-CommitQueue
To be clear, this is a blocker for adding linux_gpu as a required job in Chromium's and Blink's CQ.

Comment 30 by kbr@chromium.org, Apr 28, 2014
Cc: alokp@chromium.org
Comment 31 by kbr@chromium.org, Apr 28, 2014
alokp@ points out that this test flaked on the Mac Release (ATI) bot too:

http://build.chromium.org/p/chromium.gpu/builders/Mac%2010.8%20Release%20%28ATI%29/builds/21129

[ RUN      ] ContextLost.WebGLContextLostFromGPUProcessExit
...
[896:25347:0428/151227:ERROR:channel.cc(291)] RawChannel fatal error (type 1)
[902:20739:0428/151228:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
[902:20739:0428/151228:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
[902:20739:0428/151228:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
[902:20739:0428/151228:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
[896:25347:0428/151228:ERROR:channel.cc(291)] RawChannel fatal error (type 1)
[901:1799:0428/151228:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
WARNING:root:While running file://webgl.html?query=kill_after_notification

Traceback (most recent call last):
  _RunPage at tools/telemetry/telemetry/page/page_runner.py:535
    test.RunPage(page, page_state.tab, results)
  RunPage at tools/telemetry/telemetry/page/page_test.py:304
    self._test_method(page, tab, results)
  ValidatePage at content/test/gpu/gpu_tests/context_lost.py:85
    new_tab.Close()
  Close at tools/telemetry/telemetry/core/tab.py:74
    self._backend_list.CloseTab(self._inspector_backend.debugger_url)
  CloseTab at tools/telemetry/telemetry/core/backends/chrome/tab_list_backend.py:36
    raise Exception('Unable to close tab, tab id not found: %s' % tab_id)
Exception: Unable to close tab, tab id not found: AC559AD4-8A8E-074C-161C-ADE0AFF550EA

Locals:
  debugger_url : u'ws://127.0.0.1:52535/devtools/page/AC559AD4-8A8E-074C-161C-ADE0AFF550EA'
  tab_id       : u'AC559AD4-8A8E-074C-161C-ADE0AFF550EA'
  timeout      : None

Traceback (most recent call last):
  File "../content/test/gpu/../../../tools/telemetry/telemetry/page/page_runner.py", line 535, in _RunPage
    test.RunPage(page, page_state.tab, results)
  File "../content/test/gpu/../../../tools/telemetry/telemetry/page/page_test.py", line 304, in RunPage
    self._test_method(page, tab, results)
  File "/private/tmp/run_tha_testLsi8TS/content/test/gpu/gpu_tests/context_lost.py", line 85, in ValidatePage
    new_tab.Close()
  File "../content/test/gpu/../../../tools/telemetry/telemetry/core/tab.py", line 74, in Close
    self._backend_list.CloseTab(self._inspector_backend.debugger_url)
  File "../content/test/gpu/../../../tools/telemetry/telemetry/core/backends/chrome/tab_list_backend.py", line 36, in CloseTab
    raise Exception('Unable to close tab, tab id not found: %s' % tab_id)
Exception: Unable to close tab, tab id not found: AC559AD4-8A8E-074C-161C-ADE0AFF550EA

[  FAILED  ] ContextLost.WebGLContextLostFromGPUProcessExit (5709 ms)

Comment 32 by kbr@chromium.org, Apr 28, 2014
Labels: -Pri-1 Pri-0
Tab crashes are also being observed during try job runs:

http://build.chromium.org/p/tryserver.chromium.gpu/builders/mac_gpu_triggered_tests/builds/1940

ERROR:root:file://webgl.html?query=kill_after_notification:

Traceback (most recent call last):
  _RunPage at tools/telemetry/telemetry/page/page_runner.py:535
    test.RunPage(page, page_state.tab, results)
  RunPage at tools/telemetry/telemetry/page/page_test.py:304
    self._test_method(page, tab, results)
  ValidatePage at content/test/gpu/gpu_tests/context_lost.py:75
    new_tab.Navigate('chrome://gpucrash')
  Navigate at tools/telemetry/telemetry/core/tab.py:253
    self._inspector_backend.Navigate(url, script_to_evaluate_on_commit, timeout)
  Navigate at tools/telemetry/telemetry/core/backends/chrome/inspector_backend.py:165
    self._page.Navigate(url, script_to_evaluate_on_commit, timeout)
  Navigate at tools/telemetry/telemetry/core/backends/chrome/inspector_page.py:118
    self.PerformActionAndWaitForNavigate(DoNavigate, timeout)
  PerformActionAndWaitForNavigate at tools/telemetry/telemetry/core/backends/chrome/inspector_page.py:87
    self._inspector_backend.DispatchNotifications(remaining_time)
  DispatchNotifications at tools/telemetry/telemetry/core/backends/chrome/inspector_websocket.py:68
    self._Receive(timeout)
  _Receive at tools/telemetry/telemetry/core/backends/chrome/inspector_websocket.py:103
    if 'method' in res and self._notification_handler(res):
  _HandleNotification at tools/telemetry/telemetry/core/backends/chrome/inspector_backend.py:227
    raise exceptions.TabCrashException()
TabCrashException

This is really bad. This test has now been flaky for several days. Everyone on the CC: list who has been working in this area of the code recently should be trying to reproduce these failures and fix them with top priority.

Comment 33 by kbr@chromium.org, Apr 29, 2014
Cc: reve...@chromium.org
Labels: -Pri-0 Pri-1
The cause of the about:gpucrash tab crashes is a recent regression. It's been identified and filed separately as Issue 368107. I'm preparing a workaround for this issue.

Comment 34 by kbr@chromium.org, Apr 29, 2014
Blocking: chromium:368107
Project Member Comment 35 by bugdroid1@chromium.org, Apr 29, 2014
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2e286a40542a8d938752a4ac4f0f26a972b45409

commit 2e286a40542a8d938752a4ac4f0f26a972b45409
Author: kbr@chromium.org <kbr@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>
Date: Tue Apr 29 10:03:58 2014 +0000

Work around about:gpucrash tab crashes caused by impl-side painting.

This is a temporary workaround. The reason for the tab crashes should
be investigated and fixed, since if it happens the context is lost
early in a tab's lifetime, this bug will be triggered.

BUG= 365904 , 368107 
TBR=bajones@chromium.org

Review URL: https://codereview.chromium.org/252123003

git-svn-id: svn://svn.chromium.org/chrome/trunk/src@266819 0039d316-1c4b-4281-b951-d872f2087c98


Project Member Comment 36 by bugdroid1@chromium.org, Apr 29, 2014
------------------------------------------------------------------
r266819 | kbr@chromium.org | 2014-04-29T10:03:58.549952Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/src/content/test/gpu/gpu_tests/context_lost.py?r1=266819&r2=266818&pathrev=266819

Work around about:gpucrash tab crashes caused by impl-side painting.

This is a temporary workaround. The reason for the tab crashes should
be investigated and fixed, since if it happens the context is lost
early in a tab's lifetime, this bug will be triggered.

BUG= 365904 , 368107 
TBR=bajones@chromium.org

Review URL: https://codereview.chromium.org/252123003
-----------------------------------------------------------------
Project Member Comment 37 by bugdroid1@chromium.org, Apr 29, 2014
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0cbfee6eb15867caf6c06d8d2a653e534676f9d4

commit 0cbfee6eb15867caf6c06d8d2a653e534676f9d4
Author: kbr@chromium.org <kbr@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>
Date: Tue Apr 29 19:05:18 2014 +0000

Added missing import.

I accidentally blew away my changes just before the last commit and
made a mistake while recreating it.

BUG= 365904 , 368107 
TBR=bajones@chromium.org

Review URL: https://codereview.chromium.org/259363002

git-svn-id: svn://svn.chromium.org/chrome/trunk/src@266930 0039d316-1c4b-4281-b951-d872f2087c98


Project Member Comment 38 by bugdroid1@chromium.org, Apr 29, 2014
------------------------------------------------------------------
r266930 | kbr@chromium.org | 2014-04-29T19:05:18.507044Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/src/content/test/gpu/gpu_tests/context_lost.py?r1=266930&r2=266929&pathrev=266930

Added missing import.

I accidentally blew away my changes just before the last commit and
made a mistake while recreating it.

BUG= 365904 , 368107 
TBR=bajones@chromium.org

Review URL: https://codereview.chromium.org/259363002
-----------------------------------------------------------------
Comment 39 by kbr@chromium.org, Apr 29, 2014
There is a continued failure of the context_lost tests on the Win7 Debug (NVIDIA) bot on the chromium.gpu.fyi waterfall:

http://build.chromium.org/p/chromium.gpu.fyi/builders/Win7%20Debug%20%28NVIDIA%29

http://build.chromium.org/p/chromium.gpu.fyi/builders/Win7%20Debug%20%28NVIDIA%29/builds/1441

The tab running the WebGL code crashes in this situation, again inside Ganesh initialization. The stack trace from the bot is below.

There's no workaround for this; Ganesh's initialization must be made more robust to lost context events. I'm blocking this on Issue 368107 and raising the priority of the other bug.

-----

ERROR:root:file://webgl.html?query=kill_after_notification:

Traceback (most recent call last):
  _RunPage at tools\telemetry\telemetry\page\page_runner.py:535
    test.RunPage(page, page_state.tab, results)
  RunPage at tools\telemetry\telemetry\page\page_test.py:304
    self._test_method(page, tab, results)
  ValidatePage at content\test\gpu\gpu_tests\context_lost.py:98
    'window.domAutomationController._succeeded'):
  EvaluateJavaScript at tools\telemetry\telemetry\core\web_contents.py:86
    expr, context_id=None, timeout=timeout)
  EvaluateJavaScriptInContext at tools\telemetry\telemetry\core\web_contents.py:102
    expr, context_id=context_id, timeout=timeout)
  EvaluateJavaScript at tools\telemetry\telemetry\core\backends\chrome\inspector_backend.py:176
    return self._runtime.Evaluate(expr, context_id, timeout)
  Evaluate at tools\telemetry\telemetry\core\backends\chrome\inspector_runtime.py:35
    res = self._inspector_backend.SyncRequest(request, timeout)
  SyncRequest at tools\telemetry\telemetry\core\backends\chrome\inspector_websocket.py:62
    res = self._Receive(timeout)
  _Receive at tools\telemetry\telemetry\core\backends\chrome\inspector_websocket.py:103
    if 'method' in res and self._notification_handler(res):
  _HandleNotification at tools\telemetry\telemetry\core\backends\chrome\inspector_backend.py:227
    raise exceptions.TabCrashException()
TabCrashException

Locals:
  res : {u'method': u'Inspector.targetCrashed'}

Traceback (most recent call last):
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\..\..\..\tools\telemetry\telemetry\page\page_runner.py", line 535, in _RunPage
    test.RunPage(page, page_state.tab, results)
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\..\..\..\tools\telemetry\telemetry\page\page_test.py", line 304, in RunPage
    self._test_method(page, tab, results)
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\gpu_tests\context_lost.py", line 98, in ValidatePage
    'window.domAutomationController._succeeded'):
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\..\..\..\tools\telemetry\telemetry\core\web_contents.py", line 86, in EvaluateJavaScript
    expr, context_id=None, timeout=timeout)
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\..\..\..\tools\telemetry\telemetry\core\web_contents.py", line 102, in EvaluateJavaScriptInContext
    expr, context_id=context_id, timeout=timeout)
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\..\..\..\tools\telemetry\telemetry\core\backends\chrome\inspector_backend.py", line 176, in EvaluateJavaScript
    return self._runtime.Evaluate(expr, context_id, timeout)
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\..\..\..\tools\telemetry\telemetry\core\backends\chrome\inspector_runtime.py", line 35, in Evaluate
    res = self._inspector_backend.SyncRequest(request, timeout)
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\..\..\..\tools\telemetry\telemetry\core\backends\chrome\inspector_websocket.py", line 62, in SyncRequest
    res = self._Receive(timeout)
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\..\..\..\tools\telemetry\telemetry\core\backends\chrome\inspector_websocket.py", line 103, in _Receive
    if 'method' in res and self._notification_handler(res):
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\..\..\..\tools\telemetry\telemetry\core\backends\chrome\inspector_backend.py", line 227, in _HandleNotification
    raise exceptions.TabCrashException()
TabCrashException

[  FAILED  ] ContextLost.WebGLContextLostFromGPUProcessExit (8482 ms)
WARNING:root:Tab crashed: file://webgl.html?query=kill_after_notification
Stack Trace:
********************************************************************************
	ChildEBP RetAddr  
	WARNING: Stack unwind information not available. Following frames may be wrong.
	26e8dfe8 10301729 KERNELBASE!RaiseException+0x58
	26e8e03c 10300c67 base!__delayLoadHelper2+0x135
	26e8ec2c 100aa8ec base!_tailMerge_dbghelp_dll+0xd
	26e8ec50 100ab1c2 base!DefaultSingletonTraits<base::debug::`anonymous namespace'::SymbolContext>::New+0x5c
	26e8ec60 100aa878 base!Singleton<base::debug::`anonymous namespace'::SymbolContext,LeakySingletonTraits<base::debug::`anonymous namespace'::SymbolContext>,base::debug::A0x4df09154::SymbolContext>::get+0x72
	26e8ec68 100aaa00 base!base::debug::`anonymous namespace'::SymbolContext::GetInstance+0x8
	26e8ec84 1014f2ae base!base::debug::StackTrace::OutputToStream+0x20
	*** WARNING: Unable to verify checksum for skia.dll
	26e8f290 01fefa75 base!logging::LogMessage::~LogMessage+0x8e
	26e8f398 020c73a9 skia!SkDebugf_FileLine+0x105
	26e8f3bc 020c8000 skia!GrDebugCrash+0x49
	26e8f3d0 020c06ba skia!GrGLInterface::validate+0x50
	26e8f3f0 0205fcc8 skia!GrGLContextInfo::initialize+0x16a
	26e8f410 0205fefd skia!GrGLContext::GrGLContext+0x48
	26e8f47c 0203e6aa skia!GrGpu::Create+0xbd
	26e8f4f0 0203802a skia!GrContext::init+0x7a
	*** WARNING: Unable to verify checksum for webkit_gpu.dll
	26e8f520 1751a9aa skia!GrContext::Create+0x8a
	*** WARNING: Unable to verify checksum for content.dll
	26e8f564 12981e20 webkit_gpu!webkit::gpu::GrContextForWebGraphicsContext3D::GrContextForWebGraphicsContext3D+0xea
	*** WARNING: Unable to verify checksum for cc.dll
	26e8f714 0f3f1f35 content!content::ContextProviderCommandBuffer::GrContext+0x250
	26e8f8fc 0f3f1c01 cc!cc::DirectRasterWorkerPool::RunTasksOnOriginThread+0x305
	26e8f90c 0f3f14f2 cc!base::internal::RunnableAdapter<void (__thiscall cc::DirectRasterWorkerPool::*)(void)>::Run+0x21
	26e8f918 0f3f1bb9 cc!base::internal::InvokeHelper<1,void,base::internal::RunnableAdapter<void (__thiscall cc::DirectRasterWorkerPool::*)(void)>,void __cdecl(base::WeakPtr<cc::DirectRasterWorkerPool> const &)>::MakeItSo+0x22
	26e8f934 1007f05f cc!base::internal::Invoker<1,base::internal::BindState<base::internal::RunnableAdapter<void (__thiscall cc::DirectRasterWorkerPool::*)(void)>,void __cdecl(cc::DirectRasterWorkerPool *),void __cdecl(base::WeakPtr<cc::DirectRasterWorkerPool>)>,void __cdecl(cc::DirectRasterWorkerPool *)>::Run+0x49
	26e8f94c 10189971 base!base::Callback<void __cdecl(void)>::Run+0x2f
	26e8fb00 10186484 base!base::MessageLoop::RunTask+0x371
	26e8fb10 10186ba3 base!base::MessageLoop::DeferOrRunPendingTask+0x34
	26e8fb80 1018ee86 base!base::MessageLoop::DoWork+0x103
	26e8fca0 1018957a base!base::MessagePumpDefault::Run+0x106
	26e8fd90 1020d079 base!base::MessageLoop::RunHandler+0x13a
	26e8fd9c 101893b1 base!base::RunLoop::Run+0x29
	26e8fde0 10282ec6 base!base::MessageLoop::Run+0x51
	26e8fdec 10283cbd base!base::Thread::Run+0x16
	26e8ff3c 10263091 base!base::Thread::ThreadMain+0x30d
	26e8ff88 7570338a base!base::`anonymous namespace'::ThreadFunc+0x101
	26e8ff94 773e9f72 kernel32!BaseThreadInitThunk+0x12
	26e8ffd4 773e9f45 ntdll!RtlInitializeExceptionChain+0x63
	26e8ffec 00000000 ntdll!RtlInitializeExceptionChain+0x36
	
********************************************************************************

Comment 40 by kbr@chromium.org, Apr 29, 2014
Blocking: -chromium:368107
Comment 41 by kbr@chromium.org, Apr 29, 2014
Blockedon: chromium:368107
Comment 42 by kbr@chromium.org, Apr 29, 2014
Blockedon: chromium:367946
Linking this to another bug whose root cause was unrelated, but which was also flakiness in the context_lost test.

Comment 43 by kbr@chromium.org, Apr 30, 2014
Here is another failure mode, this time in the WebGL conformance tests:

http://build.chromium.org/p/chromium.webkit/builders/GPU%20Mac10.7%20%28dbg%29/builds/25248

[ RUN      ] WebglConformance.conformance_context_context_lost_restored
INFO:root:Running file://conformance/context/context-lost-restored.html
[1885:1799:0429/192627:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: restoreContext: context restoration not allowed", source: http://127.0.0.1:49528/conformance/context/context-lost-restored.html (0)
[1885:1799:0429/192628:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: bindTexture: object not from this context", source: http://127.0.0.1:49528/conformance/context/context-lost-restored.html (0)
[1885:1799:0429/192628:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: useProgram: object not from this context", source: http://127.0.0.1:49528/conformance/context/context-lost-restored.html (0)
[1885:1799:0429/192628:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: bindBuffer: object not from this context", source: http://127.0.0.1:49528/conformance/context/context-lost-restored.html (0)
[1885:1799:0429/192628:INFO:CONSOLE(0)] "WebGL: INVALID_ENUM: texImage2D: invalid texture type", source: http://127.0.0.1:49528/conformance/context/context-lost-restored.html (0)
[1885:1799:0429/192628:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: restoreContext: context not lost", source: http://127.0.0.1:49528/conformance/context/context-lost-restored.html (0)
[1885:1799:0429/192628:INFO:CONSOLE(11)] "getError expected: NO_ERROR. Was INVALID_OPERATION : after evaluating: WEBGL_lose_context.restoreContext()", source:  (11)
[1885:1799:0429/192628:INFO:CONSOLE(11)] "gl.isContextLost() should be true. Was false.", source:  (11)
[1886:1799:0429/192628:ERROR:gles2_cmd_decoder_autogen.h(149)] [GroupMarkerNotSet(crbug.com/242999)!:6860B87D]GL ERROR :GL_INVALID_ENUM : glBlendFunc: sfactor was GL_TEXTURE_2D
[1885:1799:0429/192628:INFO:CONSOLE(11)] "getError expected: NO_ERROR. Was INVALID_ENUM : after evaluating: gl.blendFunc(gl.TEXTURE_2D, gl.TEXTURE_CUBE_MAP)", source:  (11)
[1885:1799:0429/192628:INFO:CONSOLE(0)] "[GroupMarkerNotSet(crbug.com/242999)!:6860B87D]GL ERROR :GL_INVALID_ENUM : glBlendFunc: sfactor was GL_TEXTURE_2D", source: http://127.0.0.1:49528/conformance/context/context-lost-restored.html (0)
WARNING:root:file://conformance/context/context-lost-restored.html:

Traceback (most recent call last):
  _RunPage at tools/telemetry/telemetry/page/page_runner.py:535
    test.RunPage(page, page_state.tab, results)
  RunPage at tools/telemetry/telemetry/page/page_test.py:304
    self._test_method(page, tab, results)
  ValidatePage at content/test/gpu/gpu_tests/webgl_conformance.py:69
    raise page_test.Failure(_WebGLTestMessages(tab))
Failure: getError expected: NO_ERROR. Was INVALID_OPERATION : after evaluating: WEBGL_lose_context.restoreContext()
gl.isContextLost() should be true. Was false.
getError expected: NO_ERROR. Was INVALID_ENUM : after evaluating: gl.blendFunc(gl.TEXTURE_2D, gl.TEXTURE_CUBE_MAP)

Locals:
  page    : <telemetry.page.page.Page object at 0x101c56d10>
  results : <telemetry.page.gtest_test_results.GTestTestResults run=34 errors=0 failures=0>
  tab     : <telemetry.core.tab.Tab object at 0x101ca2710>

Traceback (most recent call last):
  File "/b/build/slave/GPU_Mac10_7__dbg_/build/src/content/test/gpu/../../../tools/telemetry/telemetry/page/page_runner.py", line 535, in _RunPage
    test.RunPage(page, page_state.tab, results)
  File "/b/build/slave/GPU_Mac10_7__dbg_/build/src/content/test/gpu/../../../tools/telemetry/telemetry/page/page_test.py", line 304, in RunPage
    self._test_method(page, tab, results)
  File "/b/build/slave/GPU_Mac10_7__dbg_/build/src/content/test/gpu/gpu_tests/webgl_conformance.py", line 69, in ValidatePage
    raise page_test.Failure(_WebGLTestMessages(tab))
Failure: getError expected: NO_ERROR. Was INVALID_OPERATION : after evaluating: WEBGL_lose_context.restoreContext()
gl.isContextLost() should be true. Was false.
getError expected: NO_ERROR. Was INVALID_ENUM : after evaluating: gl.blendFunc(gl.TEXTURE_2D, gl.TEXTURE_CUBE_MAP)


[  FAILED  ] WebglConformance.conformance_context_context_lost_restored (481 ms)

Comment 44 by kbr@chromium.org, May 6, 2014
The ContextLost.WebGLContextLostFromGPUProcessExit failure reproducible most readily on Mac OS X with a debug build is seen here:

http://build.chromium.org/p/chromium.gpu.fyi/builders/Mac%2010.9%20Debug%20%28Intel%29/builds/2395

There is a race condition in the establishing of the GPU channel between the renderer and GPU process. The symptom is that a channel is successfully established on which OnChannelError is never called, but on which all Sends fail. Continuing to diagnose exactly the sequence of operations leading to the failure. It looks like the (browser-side) GpuMessageFilter for a particular renderer gets all the way through the OnEstablishGpuChannel message handler successfully, including running the callback, but that the GPU process dies immediately afterward, before the renderer process on the other side sets up its IPC channel. At that point, the renderer will never get an OnChannelError callback for the IPC channel it sets up, and all of its Sends to create new command buffers, etc. will fail. This means that the renderer's GpuChannelHost will never report IsLost(), and the renderer will retry the creation of new command buffers indefinitely.

An excerpt from the log is below. The compositor seems to try to re-initialize 5 times and then the WebGL implementation tries to recreate its lost context over and over again until the test times out.

There may be other bugs leading to the other flakiness linked above, but I'm planning to fix this one before looking at the others.



[90089:19203:0505/203948:ERROR:gpu_channel_host.cc(146)] GpuChannelHost::CreateViewCommandBuffer failed.
[90089:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(377)] GpuChannelHost failed to create command buffer.
[90089:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(405)] Failed to initialize command buffer.
[90089:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(463)] Failed to initialize context.
[90081:1287:0505/203948:INFO:CONSOLE(0)] "WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost", source: http://127.0.0.1:60855/webgl.html?query=kill_after_notification (0)
[90089:19203:0505/203948:ERROR:gpu_channel_host.cc(146)] GpuChannelHost::CreateViewCommandBuffer failed.
[90089:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(377)] GpuChannelHost failed to create command buffer.
[90089:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(405)] Failed to initialize command buffer.
[90089:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(463)] Failed to initialize context.
[90089:19203:0505/203948:ERROR:gpu_channel_host.cc(146)] GpuChannelHost::CreateViewCommandBuffer failed.
[90089:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(377)] GpuChannelHost failed to create command buffer.
[90089:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(405)] Failed to initialize command buffer.
[90089:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(463)] Failed to initialize context.
[90089:19203:0505/203948:ERROR:gpu_channel_host.cc(146)] GpuChannelHost::CreateViewCommandBuffer failed.
[90089:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(377)] GpuChannelHost failed to create command buffer.
[90089:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(405)] Failed to initialize command buffer.
[90089:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(463)] Failed to initialize context.
[90088:19203:0505/203948:ERROR:gpu_channel_host.cc(146)] GpuChannelHost::CreateViewCommandBuffer failed.
[90088:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(377)] GpuChannelHost failed to create command buffer.
[90088:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(405)] Failed to initialize command buffer.
[90088:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(463)] Failed to initialize context.
[90088:19203:0505/203948:ERROR:gpu_channel_host.cc(146)] GpuChannelHost::CreateViewCommandBuffer failed.
[90088:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(377)] GpuChannelHost failed to create command buffer.
[90088:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(405)] Failed to initialize command buffer.
[90088:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(463)] Failed to initialize context.
[90088:1287:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
[90088:1287:0505/203948:ERROR:gpu_channel_host.cc(179)] Failed to send GpuChannelMsg_CreateOffscreenCommandBuffer.
[90088:1287:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(377)] GpuChannelHost failed to create command buffer.
[90088:1287:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(405)] Failed to initialize command buffer.
[90088:1287:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(463)] Failed to initialize context.
[90088:19203:0505/203948:ERROR:gpu_channel_host.cc(146)] GpuChannelHost::CreateViewCommandBuffer failed.
[90088:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(377)] GpuChannelHost failed to create command buffer.
[90088:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(405)] Failed to initialize command buffer.
[90088:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(463)] Failed to initialize context.
[90088:19203:0505/203948:ERROR:gpu_channel_host.cc(146)] GpuChannelHost::CreateViewCommandBuffer failed.
[90088:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(377)] GpuChannelHost failed to create command buffer.
[90088:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(405)] Failed to initialize command buffer.
[90088:19203:0505/203948:ERROR:webgraphicscontext3d_command_buffer_impl.cc(463)] Failed to initialize context.

Comment 45 by piman@chromium.org, May 6, 2014
Note: CreateViewCommandBuffer goes to the browser, it doesn't go through the renderer's channel.
Comment 46 by kbr@chromium.org, May 8, 2014
The reason for the failure in #44 above is now known, and a fix is in progress in https://codereview.chromium.org/270633008 , though per comments there, it will need to be revised.

Even after that is fixed, there is a continued race where the WebGL context sometimes incorrectly gets restored against a context that is actually dead on arrival. This can even be seen here on Linux:

http://build.chromium.org/p/chromium.gpu/builders/Linux%20Release%20%28NVIDIA%29/builds/32948

See log excerpt below. Continuing to try to find a reliable fix for this problem.

INFO:root:Running file://webgl.html?query=kill_after_notification
WARNING:root:Webdriver backend is unsupported without selenium pylib. For installation of selenium pylib, please refer to https://code.google.com/p/selenium/wiki/PythonBindings.
[10424:10453:0507/202357:ERROR:raw_channel_posix.cc(130)] recvmsg: Connection reset by peer
[10424:10453:0507/202357:ERROR:channel.cc(293)] RawChannel fatal error (type 1)
[28:34:0507/202357:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
[28:34:0507/202357:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
[18:18:0507/202357:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
[10424:10424:0507/202357:ERROR:gpu_process_transport_factory.cc(336)] Lost UI shared context.
[28:34:0507/202358:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
[28:34:0507/202358:ERROR:webgraphicscontext3d_command_buffer_impl.cc(468)] Context dead on arrival. Last error: 5
[10424:10453:0507/202358:ERROR:raw_channel_posix.cc(130)] recvmsg: Connection reset by peer
[10424:10453:0507/202358:ERROR:channel.cc(293)] RawChannel fatal error (type 1)
NVIDIA: could not open the device file /dev/nvidia0 (Operation not permitted).
NVIDIA: could not open the device file /dev/nvidia0 (Operation not permitted).
WARNING:root:file://webgl.html?query=kill_after_notification:

Traceback (most recent call last):
  _RunPage at tools/telemetry/telemetry/page/page_runner.py:535
    test.RunPage(page, page_state.tab, results)
  RunPage at tools/telemetry/telemetry/page/page_test.py:246
    self.ValidatePage(page, tab, results)
  ValidatePage at content/test/gpu/gpu_tests/context_lost.py:104
    'Test failed (context not restored properly?)')
Failure: Test failed (context not restored properly?)

Locals:
  completed : True
  new_tab   : <telemetry.core.tab.Tab object at 0x344fa90>
  page      : <gpu_tests.context_lost.WebGLContextLostFromGPUProcessExitPage object at 0x3430b50>
  results   : <telemetry.page.gtest_test_results.GTestTestResults run=1 errors=0 failures=0>
  tab       : <telemetry.core.tab.Tab object at 0x344f5d0>
  x         : 0

Traceback (most recent call last):
  File "../content/test/gpu/../../../tools/telemetry/telemetry/page/page_runner.py", line 535, in _RunPage
    test.RunPage(page, page_state.tab, results)
  File "../content/test/gpu/../../../tools/telemetry/telemetry/page/page_test.py", line 246, in RunPage
    self.ValidatePage(page, tab, results)
  File "/tmp/run_tha_testB3mDai/content/test/gpu/gpu_tests/context_lost.py", line 104, in ValidatePage
    'Test failed (context not restored properly?)')
Failure: Test failed (context not restored properly?)

[  FAILED  ] ContextLost.WebGLContextLostFromGPUProcessExit (3404 ms)

Comment 47 by kbr@chromium.org, May 9, 2014
Blockedon: chromium:308675
Comment 48 by kbr@chromium.org, May 9, 2014
The duplicate context lost events were caused by a bug in the earlier support I added to let Telemetry navigate to about:gpucrash. It turns out that for each tab navigated to that URL, *two* GPU process crashes were triggered in rapid succession. If the second came at just the wrong time the test would fail.

Fix for both issues up for review in https://codereview.chromium.org/277113002 .

Comment 49 by kbr@chromium.org, May 9, 2014
To clarify the changes in the above CL:

Most of the time, the stack trace that handles debug URLs looks like this:

* thread #1: tid = 0x45ed61, 0x199b228c libcontent.dylib`content::HandleDebugURL(url=0xbffc64b0, transition=33554433) + 764 at debug_urls.cc:60, name = 'CrBrowserMain', queue = 'com.apple.main-thread', stop reason = breakpoint 8.1
  * frame #0: 0x199b228c libcontent.dylib`content::HandleDebugURL(url=0xbffc64b0, transition=33554433) + 764 at debug_urls.cc:60
    frame #1: 0x199d59c1 libcontent.dylib`content::NavigationControllerImpl::LoadURLWithParams(this=0x7f2be47c, params=0xbffc64b0) + 449 at navigation_controller_impl.cc:639
    frame #2: 0x02c48882 libchrome_main_dll.dylib`(anonymous namespace)::LoadURLInContents(target_contents=0x7f2be400, url=0xbffc6800, params=0xbffc6be0) + 562 at browser_navigator.cc:280
    frame #3: 0x02c471e8 libchrome_main_dll.dylib`chrome::Navigate(params=0xbffc6be0) + 2376 at browser_navigator.cc:640
    frame #4: 0x02c2e43b libchrome_main_dll.dylib`chrome::OpenCurrentURL(browser=0x7df23a30) + 427 at browser_commands.cc:514
    frame #5: 0x02c37aa8 libchrome_main_dll.dylib`chrome::BrowserCommandController::ExecuteCommandWithDisposition(this=0x7df0a200, id=33004, disposition=CURRENT_TAB) + 1832 at browser_command_controller.cc:398
    frame #6: 0x003e867b libchrome_main_dll.dylib`CommandUpdater::ExecuteCommandWithDisposition(this=0x7df0a218, id=33004, disposition=CURRENT_TAB) + 139 at command_updater.cc:50
    frame #7: 0x003e85ba libchrome_main_dll.dylib`CommandUpdater::ExecuteCommand(this=0x7df0a218, id=33004) + 74 at command_updater.cc:43
    frame #8: 0x02fb8228 libchrome_main_dll.dylib`OmniboxEditController::OnAutocompleteAccept(this=0x7d94846c, destination_url=0xbffc7af4, disposition=CURRENT_TAB, transition=33554433) + 184 at omnibox_edit_controller.cc:19
    frame #9: 0x02fbe3f3 libchrome_main_dll.dylib`OmniboxEditModel::OpenMatch(this=0x7d9485a0, match=<unavailable>, disposition=CURRENT_TAB, alternate_nav_url=0xbffc8190, pasted_text=0xbffc7d30, index=0) + 5091 at omnibox_edit_model.cc:834
    frame #10: 0x02fcad3c libchrome_main_dll.dylib`OmniboxView::OpenMatch(this=0x7d948310, match=0xbffc8070, disposition=CURRENT_TAB, alternate_nav_url=0xbffc8190, pasted_text=0xbffc7d30, selected_line=0) + 300 at omnibox_view.cc:115
    frame #11: 0x02e83367 libchrome_main_dll.dylib`OmniboxViewMac::OpenMatch(this=0x7d948310, match=0xbffc8070, disposition=CURRENT_TAB, alternate_nav_url=0xbffc8190, pasted_text=0xbffc7d30, selected_line=0) + 135 at omnibox_view_mac.mm:252
    frame #12: 0x02fbcf93 libchrome_main_dll.dylib`OmniboxEditModel::AcceptInput(this=0x7d9485a0, disposition=CURRENT_TAB, for_drop=false) + 1939 at omnibox_edit_model.cc:690
    frame #13: 0x02e86caa libchrome_main_dll.dylib`OmniboxViewMac::OnDoCommandBySelector(this=0x7d948310, cmd=0x90c3d87a) + 1210 at omnibox_view_mac.mm:778
    frame #14: 0x02e870bf libchrome_main_dll.dylib`non-virtual thunk to OmniboxViewMac::OnDoCommandBySelector(this=0x7d948320, cmd=0x90c3d87a) + 63 at omnibox_view_mac.mm:822
    frame #15: 0x02e425a7 libchrome_main_dll.dylib`-[AutocompleteTextFieldEditor doCommandBySelector:](self=0x7d8620b0, _cmd=0x90bc599f, cmd=0x90c3d87a) + 103 at autocomplete_text_field_editor.mm:475

...

In other words, the debug URL handling occurs when the omnibox receives the new URL.

When Telemetry initiates a page navigation it doesn't go through this code path. In https://codereview.chromium.org/27603004 , from Issue 308675, I added explicit support for processing debug URLs during page navigation, but only when the GpuBenchmarkingExtension was enabled. When Telemetry is involved, the first stack trace hitting the debug URL handler is this:

* thread #1: tid = 0x45f731, 0x199ad28c libcontent.dylib`content::HandleDebugURL(url=0xbffd43a8, transition=PAGE_TRANSITION_FROM_ADDRESS_BAR) + 764 at debug_urls.cc:60, name = 'CrBrowserMain', queue = 'com.apple.main-thread', stop reason = breakpoint 8.1
  * frame #0: 0x199ad28c libcontent.dylib`content::HandleDebugURL(url=0xbffd43a8, transition=PAGE_TRANSITION_FROM_ADDRESS_BAR) + 764 at debug_urls.cc:60
    frame #1: 0x197481df libcontent.dylib`content::DebugURLHandler(url=0xbffd43a8, browser_context=0x7c1409d0) + 175 at browser_url_handler_impl.cc:83
    frame #2: 0x19748a5f libcontent.dylib`content::BrowserURLHandlerImpl::RewriteURLIfNecessary(this=0x78ee5760, url=0xbffd43a8, browser_context=0x7c1409d0, reverse_on_redirect=0xbffd43a7) + 159 at browser_url_handler_impl.cc:131
    frame #3: 0x199ccfdc libcontent.dylib`content::NavigationController::CreateNavigationEntry(url=0xbffd4950, referrer=0xbffd49b0, transition=PAGE_TRANSITION_TYPED, is_renderer_initiated=false, extra_headers=0xbffd4a10, browser_context=0x7c1409d0) + 172 at navigation_controller_impl.cc:158
    frame #4: 0x199d0eb8 libcontent.dylib`content::NavigationControllerImpl::LoadURLWithParams(this=0x7e2f0a7c, params=0xbffd4950) + 1720 at navigation_controller_impl.cc:694
    frame #5: 0x199d07c7 libcontent.dylib`content::NavigationControllerImpl::LoadURL(this=0x7e2f0a7c, url=0xbffd4bf0, referrer=0xbffd4b88, transition=PAGE_TRANSITION_TYPED, extra_headers=0xbffd4b80) + 247 at navigation_controller_impl.cc:634
    frame #6: 0x1982f457 libcontent.dylib`content::RendererOverridesHandler::PageNavigate(this=0x79c7f140, command=<unavailable>) + 935 at renderer_overrides_handler.cc:375
    frame #7: 0x1983a717 libcontent.dylib`base::internal::RunnableAdapter<scoped_refptr<content::DevToolsProtocol::Response> (this=0xbffd4d24, object=0x79c7f140, a1=0xbffd4e90)(scoped_refptr<content::DevToolsProtocol::Command>)>::Run(content::RendererOverridesHandler*, scoped_refptr<content::DevToolsProtocol::Command> const&) + 183 at bind_internal.h:190
    frame #8: 0x1983a5ed libcontent.dylib`base::internal::InvokeHelper<false, scoped_refptr<content::DevToolsProtocol::Response>, base::internal::RunnableAdapter<scoped_refptr<content::DevToolsProtocol::Response> (runnable=RunnableAdapter<scoped_refptr<content::DevToolsProtocol::Response> (content::RendererOverridesHandler::*)(scoped_refptr<content::DevToolsProtocol::Command>)> at 0xbffd4d24, a1=0x79c7f140, a2=0xbffd4e90)(scoped_refptr<content::DevToolsProtocol::Command>)>, void (content::RendererOverridesHandler*, scoped_refptr<content::DevToolsProtocol::Command> const&)>::MakeItSo(base::internal::RunnableAdapter<scoped_refptr<content::DevToolsProtocol::Response> (content::RendererOverridesHandler::*)(scoped_refptr<content::DevToolsProtocol::Command>)>, content::RendererOverridesHandler*, scoped_refptr<content::DevToolsProtocol::Command> const&) + 109 at bind_internal.h:890
    frame #9: 0x1983a4fe libcontent.dylib`base::internal::Invoker<1, base::internal::BindState<base::internal::RunnableAdapter<scoped_refptr<content::DevToolsProtocol::Response> (base=0x79c7f400, x2=0xbffd4e90)(scoped_refptr<content::DevToolsProtocol::Command>)>, scoped_refptr<content::DevToolsProtocol::Response> (content::RendererOverridesHandler*, scoped_refptr<content::DevToolsProtocol::Command>), void (base::internal::UnretainedWrapper<content::RendererOverridesHandler>)>, scoped_refptr<content::DevToolsProtocol::Response> (content::RendererOverridesHandler*, scoped_refptr<content::DevToolsProtocol::Command>)>::Run(base::internal::BindStateBase*, scoped_refptr<content::DevToolsProtocol::Command> const&) + 158 at bind_internal.h:1219
    frame #10: 0x1980e0f5 libcontent.dylib`base::Callback<scoped_refptr<content::DevToolsProtocol::Response> (this=0x79c7f434, a1=0xbffd4e90)>::Run(scoped_refptr<content::DevToolsProtocol::Command> const&) const + 117 at callback.h:441
    frame #11: 0x1980ce21 libcontent.dylib`content::DevToolsProtocol::Handler::HandleCommand(this=0x79c7f140, command=<unavailable>) + 241 at devtools_protocol.cc:169
    frame #12: 0x198261e9 libcontent.dylib`content::RenderViewDevToolsAgentHost::DispatchOnInspectorBackend(this=0x79c7f070, message=0x79a5adbc) + 185 at render_view_devtools_agent_host.cc:193
    frame #13: 0x197f6d75 libcontent.dylib`content::DevToolsManagerImpl::DispatchOnInspectorBackend(this=0x7c1706e0, from=0x78e2b260, message=0x79a5adbc) + 117 at devtools_manager_impl.cc:78
    frame #14: 0x197f6dfd libcontent.dylib`non-virtual thunk to content::DevToolsManagerImpl::DispatchOnInspectorBackend(this=0x7c1706e4, from=0x78e2b260, message=0x79a5adbc) + 77 at devtools_manager_impl.cc:80
    frame #15: 0x197e1950 libcontent.dylib`content::DevToolsHttpHandlerImpl::OnWebSocketMessageUI(this=0x7c129a50, connection_id=10, data=0x79a5adbc) + 192 at devtools_http_handler_impl.cc:621
...

and the second one is as follows:

* thread #1: tid = 0x45f731, 0x199ad28c libcontent.dylib`content::HandleDebugURL(url=0xbffd3aa0, transition=PAGE_TRANSITION_FROM_ADDRESS_BAR) + 764 at debug_urls.cc:60, name = 'CrBrowserMain', queue = 'com.apple.main-thread', stop reason = breakpoint 8.1
  * frame #0: 0x199ad28c libcontent.dylib`content::HandleDebugURL(url=0xbffd3aa0, transition=PAGE_TRANSITION_FROM_ADDRESS_BAR) + 764 at debug_urls.cc:60
    frame #1: 0x197481df libcontent.dylib`content::DebugURLHandler(url=0xbffd3aa0, browser_context=0x7c1409d0) + 175 at browser_url_handler_impl.cc:83
    frame #2: 0x19748a5f libcontent.dylib`content::BrowserURLHandlerImpl::RewriteURLIfNecessary(this=0x78ee5760, url=0xbffd3aa0, browser_context=0x7c1409d0, reverse_on_redirect=0xbffd3a9f) + 159 at browser_url_handler_impl.cc:131
    frame #3: 0x199d2c01 libcontent.dylib`content::NavigationControllerImpl::RendererDidNavigateToNewPage(this=0x7e2f0a7c, rfh=0x7c5a3860, params=0xbffd4240, replace_entry=false) + 433 at navigation_controller_impl.cc:1015
    frame #4: 0x199d16f8 libcontent.dylib`content::NavigationControllerImpl::RendererDidNavigate(this=0x7e2f0a7c, rfh=0x7c5a3860, params=0xbffd4240, details=0xbffd41c8) + 856 at navigation_controller_impl.cc:771
    frame #5: 0x199ed7a7 libcontent.dylib`content::NavigatorImpl::DidNavigate(this=0x7c5a24e0, render_frame_host=0x7c5a3860, input_params=0xbffd4540) + 951 at navigator_impl.cc:473
    frame #6: 0x199f6f37 libcontent.dylib`content::RenderFrameHostImpl::OnNavigate(this=0x7c5a3860, msg=0x79a59498) + 1447 at render_frame_host_impl.cc:474
    frame #7: 0x199f57b0 libcontent.dylib`content::RenderFrameHostImpl::OnMessageReceived(this=0x7c5a3860, msg=0x79a59498) + 2880 at render_frame_host_impl.cc:277
    frame #8: 0x19fdf223 libcontent.dylib`content::RenderProcessHostImpl::OnMessageReceived(this=0x7c5a0380, msg=0x79a59498) + 2195 at render_process_host_impl.cc:1358
    frame #9: 0x19fdf7df libcontent.dylib`non-virtual thunk to content::RenderProcessHostImpl::OnMessageReceived(this=0x7c5a0384, msg=0x79a59498) + 63 at render_process_host_impl.cc:1359
    frame #10: 0x21fde863 libipc.dylib`IPC::ChannelProxy::Context::OnDispatchMessage(this=0x7c5c3370, message=0x79a59498) + 579 at ipc_channel_proxy.cc:269
    frame #11: 0x21fe7e80 libipc.dylib`base::internal::RunnableAdapter<void (this=0xbffd5050, object=0x7c5c3370, a1=0x79a59498)(IPC::Message const&)>::Run(IPC::ChannelProxy::Context*, IPC::Message const&) + 160 at bind_internal.h:190
...

Project Member Comment 50 by bugdroid1@chromium.org, May 12, 2014
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/eabfe191cb87c254770b64981e675a1805c78444

commit eabfe191cb87c254770b64981e675a1805c78444
Author: kbr@chromium.org <kbr@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>
Date: Mon May 12 10:07:28 2014

Fixed flakiness of context_lost tests on GPU bots.

Two failures were observed:

1. If the GPU process crashed between EstablishGpuChannel and
   CreateViewCommandBuffer, the channel would never be properly marked
   as lost. Now, if CreateViewCommandBuffer fails, mark the
   GpuChannelHost as lost on the renderer side. The RenderThreadImpl
   will then establish a new connection to the new GPU process.

2. In Issue 308675, support was added to allow Telemetry to navigate
   to about:gpucrash, specifically for this test. It turns out that
   each navigation from Telemetry was provoking *two* GPU process
   crashes. Depending on when they came in, the test would
   intermittently receive two lost context events and fail. Squelch
   the second debug URL handling in NavigationControllerImpl.

These fix the locally reproducible failures of this test.

BUG= 365904 
TEST=src/content/test/gpu/run_gpu_test.py context_lost --browser=debug --show-stdout -v --page-filter=WebGLContextLostFromGPUProcessExit --page-repeat=30

Review URL: https://codereview.chromium.org/277113002

git-svn-id: svn://svn.chromium.org/chrome/trunk/src@269763 0039d316-1c4b-4281-b951-d872f2087c98


Project Member Comment 51 by bugdroid1@chromium.org, May 12, 2014
------------------------------------------------------------------
r269763 | kbr@chromium.org | 2014-05-12T10:07:28.076346Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/src/content/browser/frame_host/debug_urls.cc?r1=269763&r2=269762&pathrev=269763
   M http://src.chromium.org/viewvc/chrome/trunk/src/content/browser/frame_host/debug_urls.h?r1=269763&r2=269762&pathrev=269763
   M http://src.chromium.org/viewvc/chrome/trunk/src/content/browser/frame_host/navigation_controller_impl.cc?r1=269763&r2=269762&pathrev=269763
   M http://src.chromium.org/viewvc/chrome/trunk/src/content/common/gpu/client/gpu_channel_host.cc?r1=269763&r2=269762&pathrev=269763

Fixed flakiness of context_lost tests on GPU bots.

Two failures were observed:

1. If the GPU process crashed between EstablishGpuChannel and
   CreateViewCommandBuffer, the channel would never be properly marked
   as lost. Now, if CreateViewCommandBuffer fails, mark the
   GpuChannelHost as lost on the renderer side. The RenderThreadImpl
   will then establish a new connection to the new GPU process.

2. In Issue 308675, support was added to allow Telemetry to navigate
   to about:gpucrash, specifically for this test. It turns out that
   each navigation from Telemetry was provoking *two* GPU process
   crashes. Depending on when they came in, the test would
   intermittently receive two lost context events and fail. Squelch
   the second debug URL handling in NavigationControllerImpl.

These fix the locally reproducible failures of this test.

BUG= 365904 
TEST=src/content/test/gpu/run_gpu_test.py context_lost --browser=debug --show-stdout -v --page-filter=WebGLContextLostFromGPUProcessExit --page-repeat=30

Review URL: https://codereview.chromium.org/277113002
-----------------------------------------------------------------
Comment 52 by kbr@chromium.org, May 12, 2014
r269763 has fixed the intermittent failures of the context_lost tests on the Linux and Mac bots. There is a remaining intermittent failure on Windows where the browser fails to establish (or reestablish) the GPU channel and hangs. Excerpts from a couple of failing builds follow. Continuing to debug this with high priority.

http://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/16447

[ RUN      ] ContextLost.WebGLContextLostFromGPUProcessExit
INFO:root:Running file://webgl.html?query=kill_after_notification
[1864:852:0512/132202:ERROR:channel.cc(297)] RawChannel fatal error (type 1)
[1864:3804:0512/132202:ERROR:gpu_process_transport_factory.cc(361)] Lost UI shared context.
[1864:3804:0512/132202:ERROR:gpu_process_transport_factory.cc(337)] Failed to establish GPU channel.



http://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/16443

[ RUN      ] ContextLost.WebGLContextLostFromGPUProcessExit
INFO:root:Running file://webgl.html?query=kill_after_notification
[3144:1144:0512/115333:ERROR:channel.cc(297)] RawChannel fatal error (type 1)
[3144:648:0512/115333:ERROR:gpu_process_transport_factory.cc(361)] Lost UI shared context.
[3144:648:0512/115333:ERROR:gpu_process_transport_factory.cc(337)] Failed to establish GPU channel.

Comment 53 by kbr@chromium.org, May 12, 2014
Blocking: chromium:327170
Comment 54 by jbau...@chromium.org, May 13, 2014
Here's the stack I normally see on the Windows hang:

0018d55c 75af15e9 ntdll!NtWaitForMultipleObjects+0x15
0018d5f8 756919f8 KERNELBASE!WaitForMultipleObjectsEx+0x100
0018d640 75694200 kernel32!WaitForMultipleObjectsExImplementation+0xe0
0018d65c 1023c8da kernel32!WaitForMultipleObjects+0x18
0018d928 0d66310e base!base::WaitableEvent::WaitMany+0x15a [d:\src\chrome\src\base\synchronization\waitable_event_win.cc @ 93]
0018d960 0d6627b7 ipc!IPC::SyncChannel::WaitForReply+0x8e [d:\src\chrome\src\ipc\ipc_sync_channel.cc @ 488]
0018dad4 1250fd03 ipc!IPC::SyncChannel::Send+0x367 [d:\src\chrome\src\ipc\ipc_sync_channel.cc @ 472]
0018dcb4 1250cfca content!content::GpuChannelHost::Send+0x143 [d:\src\chrome\src\content\common\gpu\client\gpu_channel_host.cc @ 118]
0018df8c 12537f67 content!content::GpuChannelHost::CreateOffscreenCommandBuffer+0x21a [d:\src\chrome\src\content\common\gpu\client\gpu_channel_host.cc @ 196]
0018e29c 125367ae content!content::WebGraphicsContext3DCommandBufferImpl::InitializeCommandBuffer+0x387 [d:\src\chrome\src\content\common\gpu\client\webgraphicscontext3d_command_buffer_impl.cc @ 373]
0018e654 12538837 content!content::WebGraphicsContext3DCommandBufferImpl::CreateContext+0x11e [d:\src\chrome\src\content\common\gpu\client\webgraphicscontext3d_command_buffer_impl.cc @ 404]
0018e744 125436a3 content!content::WebGraphicsContext3DCommandBufferImpl::MaybeInitializeGL+0x147 [d:\src\chrome\src\content\common\gpu\client\webgraphicscontext3d_command_buffer_impl.cc @ 281]
0018e8ec 124e0e9e content!content::WebGraphicsContext3DCommandBufferImpl::makeContextCurrent+0x53 [d:\src\chrome\src\content\common\gpu\client\webgraphicscontext3d_command_buffer_impl.cc @ 462]
0018ea2c 11376177 content!content::ContextProviderCommandBuffer::BindToCurrentThread+0x15e [d:\src\chrome\src\content\common\gpu\client\context_provider_command_buffer.cc @ 94]
0018eac8 1137451a content!content::GpuProcessTransportFactory::SharedMainThreadContextProvider+0x1e7 [d:\src\chrome\src\content\browser\compositor\gpu_process_transport_factory.cc @ 296]
0018eb0c 1135e4ee content!content::GpuProcessTransportFactory::GetGLHelper+0x8a [d:\src\chrome\src\content\browser\compositor\gpu_process_transport_factory.cc @ 260]
0018ee40 1135936f content!content::DelegatedFrameHost::PrepareTextureCopyOutputResult+0x2fe [d:\src\chrome\src\content\browser\compositor\delegated_frame_host.cc @ 488]
0018efb8 11360544 content!content::DelegatedFrameHost::CopyFromCompositingSurfaceHasResult+0x1ff [d:\src\chrome\src\content\browser\compositor\delegated_frame_host.cc @ 442]

It looks like it's attempting to communicate to a GPU process that's good and still exists and is essentially idle in its message loop, but for some reason it doesn't receive the IPC reply.
Comment 55 by kbr@chromium.org, May 13, 2014
Thanks for getting that stack. Can we come up with some hypotheses for why it's happening?

 - Could it be that the GPU process that's live at this point is actually a different one than the GpuChannelHost was instantiated for?
 - Could there be some bug in the IPC::SyncMessageFilter that under some conditions it won't notice a channel error? (and that the GPU process associated with the GpuChannelHost actually crashed and was restarted, but the browser's GpuChannelHost never received an error notification)
 - Is it known whether the GPU process actually received the GpuChannelMsg_CreateOffscreenCommandBuffer message?

Is it possible to add logging to both the browser and GPU process to help diagnose this further? This is much easier on Mac and Linux.

Comment 56 by jbau...@chromium.org, May 13, 2014
> - Could it be that the GPU process that's live at this point is actually a different one than the GpuChannelHost was instantiated for?
> Could there be some bug in the IPC::SyncMessageFilter that under some conditions it won't notice a channel error? (and that the GPU process associated with the GpuChannelHost actually crashed and was restarted, but the browser's GpuChannelHost never received an error notification)

I checked the peer_pid() of the SyncChannel, and it's the same as the process ID of the living GPU process, so everything matches up there. So whatever it is that created the GPU channel/host must have known which GPU process it was supposed to talk to.

> - Is it known whether the GPU process actually received the GpuChannelMsg_CreateOffscreenCommandBuffer message?

No clue yet - I'll have to add logging for this.
Comment 57 by jbau...@chromium.org, May 13, 2014
Looks like the browser's GPU channel is getting descheduled waiting on a sync point, and is never rescheduled. That shouldn't be possible?
Comment 58 by kbr@chromium.org, May 14, 2014
Blocking: chromium:373452
Project Member Comment 59 by bugdroid1@chromium.org, May 15, 2014
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5c20eadfe8d8e278c9483b7e37be14d893655472

commit 5c20eadfe8d8e278c9483b7e37be14d893655472
Author: jbauman@chromium.org <jbauman@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>
Date: Thu May 15 06:10:57 2014

NULL out pending_requests_ when GPU channel fails to be created.

Otherwise the next attempt to create a GPU channel won't create a new pending request and will hang forever.

BUG= 365904 

Review URL: https://codereview.chromium.org/284003002

git-svn-id: svn://svn.chromium.org/chrome/trunk/src@270594 0039d316-1c4b-4281-b951-d872f2087c98


Project Member Comment 60 by bugdroid1@chromium.org, May 15, 2014
------------------------------------------------------------------
r270594 | jbauman@chromium.org | 2014-05-15T06:10:57.500380Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/src/content/browser/gpu/browser_gpu_channel_host_factory.cc?r1=270594&r2=270593&pathrev=270594

NULL out pending_requests_ when GPU channel fails to be created.

Otherwise the next attempt to create a GPU channel won't create a new pending request and will hang forever.

BUG= 365904 

Review URL: https://codereview.chromium.org/284003002
-----------------------------------------------------------------
Project Member Comment 61 by bugdroid1@chromium.org, May 15, 2014
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7ded3ddf6e0c855bb435bcc870b3618f7d0e29fa

commit 7ded3ddf6e0c855bb435bcc870b3618f7d0e29fa
Author: jbauman@chromium.org <jbauman@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>
Date: Thu May 15 08:20:10 2014

Randomize sync point starting number to reduce risk of GPU hangs.

There's a race condition in sync point creation where code can start a wait on a sync point immediately before the sync point is inserted. This will cause the GPU channel to lock up, as the retire will be placed in the queue after the wait.
This normally won't happen in practice (as the client should only wait on sync points that have been inserted) but it can happen if the GPU process restarts and the client waits on a sync point created by the old GPU process.

To reduce the risk of this happening, randomize the starting sync point ID. That way it's much less likely that the sync point created by the old GPU process will soon be created in the new GPU process.

BUG= 365904 , 373452 

Review URL: https://codereview.chromium.org/279303003

git-svn-id: svn://svn.chromium.org/chrome/trunk/src@270630 0039d316-1c4b-4281-b951-d872f2087c98


Project Member Comment 62 by bugdroid1@chromium.org, May 15, 2014
------------------------------------------------------------------
r270630 | jbauman@chromium.org | 2014-05-15T08:20:10.183343Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/src/content/common/gpu/sync_point_manager.cc?r1=270630&r2=270629&pathrev=270630

Randomize sync point starting number to reduce risk of GPU hangs.

There's a race condition in sync point creation where code can start a wait on a sync point immediately before the sync point is inserted. This will cause the GPU channel to lock up, as the retire will be placed in the queue after the wait.
This normally won't happen in practice (as the client should only wait on sync points that have been inserted) but it can happen if the GPU process restarts and the client waits on a sync point created by the old GPU process.

To reduce the risk of this happening, randomize the starting sync point ID. That way it's much less likely that the sync point created by the old GPU process will soon be created in the new GPU process.

BUG= 365904 , 373452 

Review URL: https://codereview.chromium.org/279303003
-----------------------------------------------------------------
Comment 63 by jbau...@chromium.org, May 15, 2014
I think this should be fixed now, unless kbr@ knows of other issues with the tests.
Comment 64 by kbr@chromium.org, May 16, 2014
jbauman@: thank you very much for tracking down those two failures. The context_lost tests are running reliably now both on the GPU tryservers and chromium.gpu waterfall.

There is a remaining intermittent failure of the conformance_context_context_lost_restored WebGL conformance test. Here are a few failing builds:

http://build.chromium.org/p/chromium.gpu/builders/Win7%20Debug%20%28NVIDIA%29/builds/18438
http://build.chromium.org/p/chromium.gpu/builders/Win7%20Debug%20%28NVIDIA%29/builds/18392
http://build.chromium.org/p/chromium.gpu/builders/Win7%20Debug%20%28NVIDIA%29/builds/18307

When the test fails, this is the output:

[3920:2456:0515/071140:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: restoreContext: context restoration not allowed", source: http://127.0.0.1:51964/conformance/context/context-lost-restored.html (0)
[3920:2456:0515/071140:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: bindTexture: object not from this context", source: http://127.0.0.1:51964/conformance/context/context-lost-restored.html (0)
[3920:2456:0515/071140:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: useProgram: object not from this context", source: http://127.0.0.1:51964/conformance/context/context-lost-restored.html (0)
[3920:2456:0515/071140:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: bindBuffer: object not from this context", source: http://127.0.0.1:51964/conformance/context/context-lost-restored.html (0)
[3920:2456:0515/071140:INFO:CONSOLE(0)] "WebGL: INVALID_ENUM: texImage2D: invalid texture type", source: http://127.0.0.1:51964/conformance/context/context-lost-restored.html (0)
[3920:2456:0515/071140:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: restoreContext: context not lost", source: http://127.0.0.1:51964/conformance/context/context-lost-restored.html (0)
[3920:2456:0515/071140:INFO:CONSOLE(11)] "getError expected: NO_ERROR. Was INVALID_OPERATION : after evaluating: WEBGL_lose_context.restoreContext()", source:  (11)
[3920:2456:0515/071140:INFO:CONSOLE(11)] "gl.isContextLost() should be true. Was false.", source:  (11)
[3920:2456:0515/071140:INFO:CONSOLE(11)] "getError expected: NO_ERROR. Was INVALID_ENUM : after evaluating: gl.blendFunc(gl.TEXTURE_2D, gl.TEXTURE_CUBE_MAP)", source:  (11)
[3920:2456:0515/071140:INFO:CONSOLE(0)] "[GroupMarkerNotSet(crbug.com/242999)!:60CF2129]GL ERROR :GL_INVALID_ENUM : glBlendFunc: sfactor was GL_TEXTURE_2D", source: http://127.0.0.1:51964/conformance/context/context-lost-restored.html (0)
WARNING:root:file://conformance\context\context-lost-restored.html:

Traceback (most recent call last):
  _RunPage at tools\telemetry\telemetry\page\page_runner.py:536
    test.RunPage(page, page_state.tab, results)
  RunPage at tools\telemetry\telemetry\page\page_test.py:246
    self.ValidatePage(page, tab, results)
  ValidatePage at content\test\gpu\gpu_tests\webgl_conformance.py:71
    raise page_test.Failure(_WebGLTestMessages(tab))
Failure: getError expected: NO_ERROR. Was INVALID_OPERATION : after evaluating: WEBGL_lose_context.restoreContext()
gl.isContextLost() should be true. Was false.
getError expected: NO_ERROR. Was INVALID_ENUM : after evaluating: gl.blendFunc(gl.TEXTURE_2D, gl.TEXTURE_CUBE_MAP)

Locals:
  page    : <gpu_tests.webgl_conformance.WebglConformancePage object at 0x02F29F10>
  results : <telemetry.page.gtest_test_results.GTestTestResults run=39 errors=0 failures=0>
  tab     : <telemetry.core.tab.Tab object at 0x028FC3F0>

Traceback (most recent call last):
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\..\..\..\tools\telemetry\telemetry\page\page_runner.py", line 536, in _RunPage
    test.RunPage(page, page_state.tab, results)
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\..\..\..\tools\telemetry\telemetry\page\page_test.py", line 246, in RunPage
    self.ValidatePage(page, tab, results)
  File "C:\b\build\slave\Win7_Debug__NVIDIA_\build\src\content\test\gpu\gpu_tests\webgl_conformance.py", line 71, in ValidatePage
    raise page_test.Failure(_WebGLTestMessages(tab))
Failure: getError expected: NO_ERROR. Was INVALID_OPERATION : after evaluating: WEBGL_lose_context.restoreContext()
gl.isContextLost() should be true. Was false.
getError expected: NO_ERROR. Was INVALID_ENUM : after evaluating: gl.blendFunc(gl.TEXTURE_2D, gl.TEXTURE_CUBE_MAP)


The correct output is:


[ RUN      ] WebglConformance.conformance_context_context_lost_restored
INFO:root:Running file://conformance\context\context-lost-restored.html
[340:3272:0515/200526:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: restoreContext: context restoration not allowed", source: http://127.0.0.1:50873/conformance/context/context-lost-restored.html (0)
[340:3272:0515/200526:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: bindTexture: object not from this context", source: http://127.0.0.1:50873/conformance/context/context-lost-restored.html (0)
[340:3272:0515/200526:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: useProgram: object not from this context", source: http://127.0.0.1:50873/conformance/context/context-lost-restored.html (0)
[340:3272:0515/200526:INFO:CONSOLE(0)] "WebGL: INVALID_OPERATION: bindBuffer: object not from this context", source: http://127.0.0.1:50873/conformance/context/context-lost-restored.html (0)
[340:3272:0515/200526:INFO:CONSOLE(0)] "WebGL: INVALID_ENUM: texImage2D: invalid texture type", source: http://127.0.0.1:50873/conformance/context/context-lost-restored.html (0)
[       OK ] WebglConformance.conformance_context_context_lost_restored (652 ms)


The failure mode is strange. It looks like the timeout which is set inside testLosingAndRestoringContext is called twice. I don't think the context lost event is being delivered twice, though that's possible. The incorrect output seems to map to a second call to 'WEBGL_lose_context.restoreContext()' which fails, along with the rest of the code in that function.

This test is still failing once a day on http://build.chromium.org/p/chromium.gpu/builders/Win7%20Debug%20%28NVIDIA%29?numbuilds=200 which is too high a rate. We must figure out what is happening and fix it.

Project Member Comment 65 by bugdroid1@chromium.org, May 16, 2014
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/28b989cc54a58fc7fabe0ef545cf5256d04ca8d9

commit 28b989cc54a58fc7fabe0ef545cf5256d04ca8d9
Author: dongseong.hwang@intel.com <dongseong.hwang@intel.com@0039d316-1c4b-4281-b951-d872f2087c98>
Date: Fri May 16 11:57:57 2014

SyncPointManager never returns 0.

When the next sync point overflow occurs, it returns 1, not 0.

BUG= 365904 ,  373452 

Review URL: https://codereview.chromium.org/288153004

git-svn-id: svn://svn.chromium.org/chrome/trunk/src@270997 0039d316-1c4b-4281-b951-d872f2087c98


Project Member Comment 66 by bugdroid1@chromium.org, May 16, 2014
------------------------------------------------------------------
r270997 | dongseong.hwang@intel.com | 2014-05-16T11:57:57.225177Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/src/content/common/gpu/sync_point_manager.cc?r1=270997&r2=270996&pathrev=270997
   M http://src.chromium.org/viewvc/chrome/trunk/src/content/common/gpu/sync_point_manager.h?r1=270997&r2=270996&pathrev=270997

SyncPointManager never returns 0.

When the next sync point overflow occurs, it returns 1, not 0.

BUG= 365904 ,  373452 

Review URL: https://codereview.chromium.org/288153004
-----------------------------------------------------------------
Comment 67 by kbr@chromium.org, May 16, 2014
Blocking: chromium:374378
Comment 68 by kbr@chromium.org, May 16, 2014
Status: Fixed
I attempted to reproduce the WebglConformance.conformance_context_context_lost_restored failure with a debug build on Windows. Just running the test in isolation many (1000) times doesn't reproduce it. Running the conformance/context/ tests 100 times for a total of 1400 tests doesn't reproduce it either. I haven't yet attempted to run the entire conformance suite repeatedly to try to reproduce.

Given that the majority of the lost context flakiness has been addressed with the above fixes, closing this bug as fixed. Filed Issue 374378 to track the WebGL conformance test flakiness. Thanks everybody for your efforts on fixing this.

Comment 69 by kbr@chromium.org, Jul 8, 2014
Blocking: chromium:392274
Comment 70 by kbr@chromium.org, Jul 8, 2014
Blocking: chromium:374086
Blocking: chromium:393188
Comment 72 by kbr@chromium.org, Jul 11, 2014
Blocking: chromium:381024
Comment 73 by kbr@chromium.org, Jul 21, 2014
Blocking: chromium:395326
Comment 74 by kbr@chromium.org, Aug 27, 2014
Blocking: chromium:407976
Comment 75 by kbr@chromium.org, Aug 27, 2014
Blocking: chromium:408358
Sign in to add a comment