Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Issue 395914 maps_pixel_test failing frequently on win_gpu_triggered_tests try server
Starred by 3 users Project Member Reported by kbr@chromium.org, Jul 22, 2014 Back to list
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests?numbuilds=200

maps_pixel_test is timing out often -- rough estimate 25% of runs. This is an unacceptably high failure rate.

Here are some example failing builds:

http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30551
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30554
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30555
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30557
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30561
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30570
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30577
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30578
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30586
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30588
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30589
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30592

The failures aren't localized to any particular machine -- they seem to be happening on many different machines.

I'm attempting to reproduce locally but have not yet been able to. There's a possibility that this may have been caused by recent enabling of release mode asserts in Blink. Another likely possibility is a V8 roll since this is one of the tests that runs the most JavaScript on any of the waterfalls.

The failure mode is the same everywhere:

INFO:root:  rasterization       : unavailable_software
INFO:root:  threaded_rasterization: disabled_off_ok
INFO:root:  video_decode        : enabled
INFO:root:  video_encode        : enabled
INFO:root:  webgl               : enabled
INFO:root:Driver Bug Workarounds:
INFO:root:  clear_uniforms_before_first_program_use
INFO:root:  exit_on_context_lost
INFO:root:  texsubimage2d_faster_than_teximage2d
INFO:root:  use_current_program_after_successful_link
INFO:root:Running http://localhost:10020/tracker.html
Traceback (most recent call last):
File "..\content\test\gpu\..\..\..\tools\telemetry\telemetry\page\page_runner.py", line 531, in _RunPage
page_state.ImplicitPageNavigation(test)
File "..\content\test\gpu\..\..\..\tools\telemetry\telemetry\page\page_runner.py", line 175, in ImplicitPageNavigation
test.RunNavigateSteps(self.page, self.tab)
File "..\content\test\gpu\..\..\..\tools\telemetry\telemetry\page\page_test.py", line 249, in RunNavigateSteps
page.RunNavigateSteps(action_runner)
File "c:\users\chrome~1.chr\appdata\local\temp\run_tha_test1wnp2p\content\test\gpu\gpu_tests\maps.py", line 86, in RunNavigateSteps
'window.testDone', timeout_in_seconds=180)
File "..\content\test\gpu\..\..\..\tools\telemetry\telemetry\page\actions\action_runner.py", line 167, in WaitForJavaScriptCondition
self._tab.WaitForJavaScriptExpression(condition, timeout_in_seconds)
File "..\content\test\gpu\..\..\..\tools\telemetry\telemetry\core\web_contents.py", line 69, in WaitForJavaScriptExpression
"""))
TimeoutException: Timed out while waiting 180s for IsJavaScriptExpressionTrue.

It doesn't look like the tab crashes, but that it fails to complete the test.

 
Comment 1 by kbr@chromium.org, Jul 22, 2014
Cc: titzer@chromium.org jkummerow@chromium.org hpayer@chromium.org a deleted user
This just failed on the Win7 Debug (NVIDIA) bot on the main waterfall:

http://build.chromium.org/p/chromium.gpu/builders/Win7%20Debug%20%28NVIDIA%29/builds/20622

Possibly because it's a Debug-only bot, it printed more information to the console than Release builds:

INFO:root:  video_encode        : enabled
INFO:root:  webgl               : enabled
INFO:root:Driver Bug Workarounds:
INFO:root:  clear_uniforms_before_first_program_use
INFO:root:  exit_on_context_lost
INFO:root:  texsubimage2d_faster_than_teximage2d
INFO:root:  use_current_program_after_successful_link
INFO:root:Running http://localhost:10020/tracker.html
[3868:2244:0722/122000:INFO:CONSOLE(32)] "Uncaught ReferenceError: vectortown is not defined", source: http://localhost:10020/tracker.html (32)
[3868:2244:0722/122303:INFO:CONSOLE(0)] "'window.webkitStorageInfo' is deprecated. Please use 'navigator.webkitTemporaryStorage' or 'navigator.webkitPersistentStorage' instead.", source:  (0)

The uncaught ReferenceError sounds strongly like an intermittent V8 bug to me. 
Comment 2 by kbr@chromium.org, Jul 22, 2014
Cc: -a deleted user mstarzinger@chromium.org yangguo@chromium.org dslomov@chromium.org
CC'ing several V8 team members. I am attempting to reproduce this on the bots. I ask that a counterpart from the V8 team attempt to reproduce this problem locally as well. The steps to reproduce would be:

python.exe content\test\gpu\run_gpu_test maps --browser=[release,debug]

Comment 3 by kbr@chromium.org, Jul 22, 2014
Note that you'll have to run "gsutil config" and authenticate with an @google.com email address in order to download the Maps WPR archive.

Comment 4 by kbr@chromium.org, Jul 22, 2014
Cc: tonyg@chromium.org dtu@chromium.org pfeldman@chromium.org yu...@chromium.org
Labels: Cr-Platform-DevTools Cr-Tests-Telemetry
I've caught a failing run in Telemetry on one of the bots. When the problem happens, "Uncaught ReferenceError: vectortown is not defined" is printed in the JavaScript console, just as the Debug bot reported.

Some race condition has been introduced recently causing this failure. Possibilities include V8, DevTools, or Telemetry.

Comment 5 by kbr@chromium.org, Jul 22, 2014
I'm able to reproduce the failure on build161-a4 (one of the slaves backing win_gpu_triggered_tests) by running the isolate from this whitespace CL:

http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30840

python.exe swarming.client\run_isolated.py -H 4478b577fa9509ded96e8e952a153e21ecf834b0 -I https://isolateserver.appspot.com -- maps --browser=release

Currently building my own isolate to try reproducing on this machine. At that point will start reverting likely changes and seeing whether they fix the flakiness.

Comment 6 by kbr@chromium.org, Jul 22, 2014
Cc: ccameron@chromium.org
Labels: OS-Mac
Here's an interesting failure on Mac

http://build.chromium.org/p/tryserver.chromium.gpu/builders/mac_gpu_triggered_tests?numbuilds=200

http://build.chromium.org/p/tryserver.chromium.gpu/builders/mac_gpu_triggered_tests/builds/26817

INFO:root:  validate_multisample_buffer_allocation
[804:1799:0722/134601:FATAL:thread_restrictions.cc(72)] Waiting is not allowed to be used on this thread to preventjank and deadlock.
** Unknown exception behavior: 0

The test run timed out after 20 minutes.

Comment 7 by kbr@chromium.org, Jul 22, 2014
Cc: wfh@chromium.org
Reproduced the "Uncaught ReferenceError: vectortown is not defined" error on build161-a4 with binaries built on my machine.

Comment 8 by kbr@chromium.org, Jul 23, 2014
It doesn't look like recent changes to V8 are at fault. I built and pushed binaries reverting V8 to 3.28.25 (r22148) and the failure still occurred.

Comment 9 by kbr@chromium.org, Jul 23, 2014
Cc: vadimsh@chromium.org mar...@chromium.org csharp@chromium.org
I'm seeing the following behavior:

- When running a particular telemetry_gpu_test.isolate hosted on the isolate server via swarming.client\run_isolated.py -H [hash] -I https://isolateserver.appspot.com [args], the failure occurs reliably on the affected machine.

- When downloading that isolate to the local disk on the affected machine and running content\test\gpu\run_gpu_test.py maps [args], the failure does not occur.

I noticed this because I attempted to download a continuous build and wasn't able to reproduce the failure when running a previously failing isolate's run_gpu_test.py against that continuous build.

Did something change recently in run_isolated.py's setup of the temporary directory for running the test?

Comment 10 by kbr@chromium.org, Jul 23, 2014
Labels: -OS-Windows -OS-Mac OS-All
Also, this failure happened on the Linux Debug (NVIDIA) bot as well:

http://build.chromium.org/p/chromium.gpu/builders/Linux%20Debug%20%28NVIDIA%29/builds/31185

INFO:root:Running http://localhost:10020/tracker.html
[2997:3033:0722/184937:WARNING:raw_channel_posix.cc(214)] recvmsg: Connection reset by peer
[2997:2997:0722/184937:INFO:CONSOLE(32)] "Uncaught ReferenceError: vectortown is not defined", source: http://localhost:10020/tracker.html (32)
[2997:3023:0722/184946:ERROR:cert_verify_proc_nss.cc(880)] CERT_PKIXVerifyCert for ssl.gstatic.com failed err=-8172
[2997:2997:0722/185240:INFO:CONSOLE(0)] "'window.webkitStorageInfo' is deprecated. Please use 'navigator.webkitTemporaryStorage' or 'navigator.webkitPersistentStorage' instead.", source:  (0)
Traceback (most recent call last):
File "../content/test/gpu/../../../tools/telemetry/telemetry/page/page_runner.py", line 531, in _RunPage
page_state.ImplicitPageNavigation(test)
File "../content/test/gpu/../../../tools/telemetry/telemetry/page/page_runner.py", line 175, in ImplicitPageNavigation
test.RunNavigateSteps(self.page, self.tab)
File "../content/test/gpu/../../../tools/telemetry/telemetry/page/page_test.py", line 249, in RunNavigateSteps
page.RunNavigateSteps(action_runner)
File "/tmp/run_tha_testZRby7h/content/test/gpu/gpu_tests/maps.py", line 86, in RunNavigateSteps
'window.testDone', timeout_in_seconds=180)
File "../content/test/gpu/../../../tools/telemetry/telemetry/page/actions/action_runner.py", line 167, in WaitForJavaScriptCondition
self._tab.WaitForJavaScriptExpression(condition, timeout_in_seconds)
File "../content/test/gpu/../../../tools/telemetry/telemetry/core/web_contents.py", line 69, in WaitForJavaScriptExpression
"""))
TimeoutException: Timed out while waiting 180s for IsJavaScriptExpressionTrue.

Comment 11 by mar...@chromium.org, Jul 23, 2014
I'll speculate the fact it occurs on one host not another one is likely related to the memory layout, e.g. rebooting would cause an ASLR shuffle which would alter reproducibility. I'd be quite surprised if run_isolated is at fault.
Comment 12 by kbr@chromium.org, Jul 23, 2014
I modified run_isolated.py locally to save off the temporary directory it creates. Invoking the test via the isolate hung, presumably because of the "Uncaught ReferenceError: vectortown is not defined" error. Subsequently invoking run_gpu_test.py out of the temporary directory succeeded.

Still trying to figure out where the race condition is. It's certainly possible that the first invocation via run_isolated.py looks like a cold start of the browser and subsequent invocations from the local disk look like warm start, masking the race. The failure happens reliably on the GPU try servers and has been seen on all platforms now.

Comment 13 by wfh@chromium.org, Jul 23, 2014
Cc: scot...@chromium.org
+scottmg purely because aslr was mentioned.
Comment 14 by kbr@chromium.org, Jul 23, 2014
The problem is 99% certainly a code change in Chromium or subordinate workspaces.

My telemetry_gpu_test isolate build at Chromium r284848 / Blink r178681 reliably hangs.

My telemetry_gpu_test isolate build at Chromium r281891 / Blink r177644 works. (Ran 5 times.)

Given this the problem can be bisected. Will work on this tomorrow.

I had no success reproducing the hangs using isolates from http://build.chromium.org/p/chromium.gpu/builders/GPU%20Win%20Builder . They always ran successfully. I don't know what's the essential difference between builds coming from there and from http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu . Also, as mentioned before, was not able to reproduce the hang using Chromium binaries from the continuous build archive at https://commondatastorage.googleapis.com/chromium-browser-continuous/index.html .

Comment 15 by kbr@chromium.org, Jul 23, 2014
Note that in both of the builds above I removed the dcheck_always_on=1 and blink_asserts_always_on=1 GYP_DEFINES. They are Release mode builds without asserts.

Labels: Infra Build-CommitQueue
Project Member Comment 17 by bugdroid1@chromium.org, Jul 23, 2014
------------------------------------------------------------------
r284937 | kbr@chromium.org | 2014-07-23T16:06:29.108378Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/tools/build/scripts/slave/recipes/gpu/build_and_test.expected/win_debug_tryserver.json?r1=284937&r2=284936&pathrev=284937
   M http://src.chromium.org/viewvc/chrome/trunk/tools/build/scripts/slave/recipe_modules/gpu/api.py?r1=284937&r2=284936&pathrev=284937
   M http://src.chromium.org/viewvc/chrome/trunk/tools/build/scripts/slave/recipes/gpu/build_and_test.expected/win_release_tryserver.json?r1=284937&r2=284936&pathrev=284937

Temporarily disable maps_pixel_test on win_gpu_triggered_tests.

Causing CQ backlogs while root cause is being investigated.

BUG= 395914 

Review URL: https://codereview.chromium.org/413753002
-----------------------------------------------------------------
Comment 18 by kbr@chromium.org, Jul 23, 2014
Issue 396273 has been merged into this issue.
Comment 19 by kbr@chromium.org, Jul 23, 2014
Cc: vmp...@chromium.org
Comment 20 by kbr@chromium.org, Jul 23, 2014
Status: Started
Project Member Comment 21 by bugdroid1@chromium.org, Jul 23, 2014
------------------------------------------------------------------
r285048 | kbr@chromium.org | 2014-07-23T21:40:41.145420Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/tools/build/scripts/slave/gatekeeper.json?r1=285048&r2=285047&pathrev=285048

Temporarily ignore maps_pixel_test failures while investigating root cause of bug.

BUG= 395914 

Review URL: https://codereview.chromium.org/418553004
-----------------------------------------------------------------
Comment 22 by kbr@chromium.org, Jul 23, 2014
Cc: mkwst@chromium.org simonhatch@chromium.org
The flakiness was introduced in this Blink roll:

commit a6283237b69e4eedd3c95988883140113e3b30fd
Author: chrome-admin@google.com <chrome-admin@google.com@0039d316-1c4b-4281-b951-d872f2087c98>
Date:   Tue Jul 22 18:37:12 2014 +0000

    Update .DEPS.git
    
    git-svn-id: svn://svn.chromium.org/chrome/trunk/src@284740 0039d316-1c4b-4281-b951-d872f2087c98


Continuing to bisect into that roll.

Comment 23 by kbr@chromium.org, Jul 23, 2014
Sorry, to be clearer:

commit 6be076326b470b9873732cdaa90a78fc98f497e0
Author: blink-deps-roller@chromium.org <blink-deps-roller@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>
Date:   Tue Jul 22 18:36:27 2014 +0000

    Blink roll 178550:178666
    
    http://build.chromium.org/f/chromium/perf/dashboard/ui/changelog_blink.html?url=/trunk&range=178551:178666&mode=html
    TBR=johnme@chromium.org,pkasting@chromium.org
    BUG=
    
    Review URL: https://codereview.chromium.org/407143002
    
    git-svn-id: svn://svn.chromium.org/chrome/trunk/src@284738 0039d316-1c4b-4281-b951-d872f2087c98

Comment 24 by kbr@chromium.org, Jul 24, 2014
Cc: haraken@chromium.org
After a tedious manual bisect the root cause was:

commit 5d7c2576351c3333f74065a5c92f150abcacc173
Author: mkwst@chromium.org <mkwst@chromium.org@bbb929c8-8fbe-4397-9dbb-9b2b20218538>
Date:   Mon Jul 21 17:42:46 2014 +0000

    Teach ContentSecurityPolicy about WebURLRequest::RequestContext.
    
    Rather than using `Resource::Type` to make decisions about what CSP
    directive to apply to a resource request, we should be using
    WebURLRequest::RequestContext. This patch teaches ContentSecurityPolicy
    to correctly compare a URL against the right policy directive when given
    a 'ResourceContext', and pipes the value through
    'ResourceFetcher::canRequest' to make sure its available when it needs
    to be.
    
    BUG= 390497 
    
    Review URL: https://codereview.chromium.org/398313002
    
    git-svn-id: svn://svn.chromium.org/blink/trunk@178571 bbb929c8-8fbe-4397-9dbb-9b2b20218538


This CL was already discovered as the cause of Issue 396582 and has already been reverted on Blink top of tree.

Comment 25 by kbr@chromium.org, Jul 24, 2014
Blocking: chromium:390497
Comment 26 by tonyg@chromium.org, Jul 24, 2014
Do we understand why the symptoms showed up in this way? I ask because I'm interested whether there's anything Telemetry could have done differently to make this diagnosable without a bisect.
Comment 27 by kbr@chromium.org, Jul 24, 2014
No, I don't personally understand. The side-effect looked like a race condition in navigation, but studying https://codereview.chromium.org/398313002 it doesn't look to me like the flow of control changed.

Comment 28 by mkwst@chromium.org, Jul 24, 2014
Apologies for the breakage. A race-condition in navigation is a very unexpected side-effect of the change that was reverted; it's not clear to me why that would happen, but I'll investigate today.
Comment 30 by kbr@chromium.org, Jul 24, 2014
Mergedinto: 396582
Status: Duplicate
The flakiness has been addressed and maps_pixel_test has been re-enabled.

Closing as duplicate of Issue 396582 under which the CL was reverted.

Sign in to add a comment