New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 859705 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Dec 14
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug-Regression



Sign in to add a comment

gpu_tests.gpu_process_integration_test.GpuProcessIntegrationTest.GpuProcess_only_one_workaround in gpu_process_launch_tests failing on chromium.gpu.fyi/Android FYI Release (Nexus 9)

Project Member Reported by sheriff-...@appspot.gserviceaccount.com, Jul 2

Issue description

Filed by sheriff-o-matic@appspot.gserviceaccount.com on behalf of sunnyps@chromium.org

gpu_tests.gpu_process_integration_test.GpuProcessIntegrationTest.GpuProcess_only_one_workaround in gpu_process_launch_tests failing on chromium.gpu.fyi/Android FYI Release (Nexus 9)

Builders failed on: 
- Android FYI Release (Nexus 9): 
  https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Android%20FYI%20Release%20%28Nexus%209%29

Looks like a telemetry failure:
[10/15] gpu_tests.gpu_process_integration_test.GpuProcessIntegrationTest.GpuProcess_only_one_workaround failed unexpectedly 38.7628s:
...
  Traceback (most recent call last):
    _RunGpuTest at content/test/gpu/gpu_tests/gpu_integration_test.py:132
      self.RunActualGpuTest(url, *args)
    RunActualGpuTest at content/test/gpu/gpu_tests/gpu_process_integration_test.py:102
      getattr(self, test_name)(test_path)
    _GpuProcess_only_one_workaround at content/test/gpu/gpu_tests/gpu_process_integration_test.py:322
      self._CompareAndCaptureDriverBugWorkarounds())
    _CompareAndCaptureDriverBugWorkarounds at content/test/gpu/gpu_tests/gpu_process_integration_test.py:182
      diff = set(browser_list).symmetric_difference(set(gpu_list))
  TypeError: 'NoneType' object is not iterable
  
  Locals:
    browser_list : [u'clear_uniforms_before_first_program_use', u'disable_discard_framebuffer', u'disable_framebuffer_cmaa', u'dont_disable_webgl_when_compositor_context_lost', u'force_cube_complete', u'max_msaa_sample_count_4', u'max_texture_size_limit_4096', u'pack_parameters_workaround_with_pack_buffer', u'scalarize_vec_and_mat_constructor_args', u'unpack_alignment_workaround_with_unpack_buffer', u'unpack_overlapping_rows_separately_unpack_buffer', u'use_gpu_driver_workaround_for_testing', u'use_virtualized_gl_contexts']
    gpu_list     : None
    tab          : <telemetry.internal.browser.tab.Tab object at 0x7f6abc210b10>

There's a catapult autoroll but it doesn't seem to be related:
https://chromium.googlesource.com/chromium/src/+/ebc957362cf84376f8059a42fe8ced87cb685b30
https://chromium.googlesource.com/catapult.git/+log/a14d6738e1fb..153acbd707c0

 
Components: Internals>GPU>Testing
Labels: OS-Android
Cc: kbr@chromium.org
Labels: -Pri-2 Hotlist-PixelWrangler Pri-1 Type-Bug-Regression
Owner: sunn...@chromium.org
Status: Assigned (was: Available)
This started failing recently on this device. Here's the first failing build:

https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Android%20FYI%20Release%20%28Nexus%209%29/5369

but it's been failing intermittently. Studying the current state of the bot:

https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Android%20FYI%20Release%20%28Nexus%209%29?limit=400

The regression range might go as far back as this build:

https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Android%20FYI%20Release%20%28Nexus%209%29/5363

So, in other words:

http://crrev.com/4a855e2432413570b7cff4d62b9acfcf18bc1c1b..d9a2ebc6b3eab7f124f9016460520c8864d1472e

Sunny, as you're pixel wrangler this week, could you please scan through this regression range and see if there's anything obvious in it? Based on the first few errors it seems that maybe the GPU channel is gone and that's why the test fails to get the current list of GPU workarounds.

https://cs.chromium.org/chromium/src/content/renderer/gpu/gpu_benchmarking_extension.cc?type=cs&sq=package:chromium&g=0&l=1080

Thanks.

Note: there's a slim chance this might be related to other pixel_test failures on the same bot in Issue 858826. Not sure what would happen if the GPU process exited abruptly, and if that is a potential cause for the other failure (probably not).

There are a couple of suspect CLs in that range:
1. Turn on OOP Raster on android bots
https://chromium.googlesource.com/chromium/src/+/d82c4fabc5cc6cbe6b475186ed9fa0216980e533
2. Defer GLES2Implementation's error callbacks if needed.
https://chromium.googlesource.com/chromium/src/+/dbc3e4a9b34ac1aa5e9097fb169bf006c1e1f7b0
Possible for you to build and test on a Nexus 9 locally to see if you can reproduce, and then see if either of those is actually responsible? The log doesn't indicate the GPU process actually crashed, so unclear what in those two changes could have caused the test to start failing.

I've acquired a Nexus 9, but it'll take a while to charge.

There's also a catapult autoroll:
https://chromium.googlesource.com/chromium/src/+/608c759406e8202907cc324240f1bce6aaa6d58d

But catapult changes don't seem to be related:

Roll src/third_party/catapult f76f0b44062c..d1692d4ac997 (2 commits)
https://chromium.googlesource.com/catapult.git/+log/f76f0b44062c..d1692d4ac997


git log f76f0b44062c..d1692d4ac997 --date=short --no-merges
--format='%ad %ae %s'
2018-06-29 perezju@chromium.org [Telemetry] Test markers are found in
trace during StartupTracingTest
2018-06-29 perezju@chromium.org [dashboard] Make returning bug comments optional
Couldn't repro on Nexus 9 with Android N. Downgrading to Android M and trying again.
Running the failing test (GpuProcess_only_one_workaround) on Android M didn't reproduce the failure. That got me thinking maybe it's the way the tests are run sequentially that causes this failure. So I tried running the isolate task locally, and I was able to reproduce the failure:

[10/15] gpu_tests.gpu_process_integration_test.GpuProcessIntegrationTest.GpuProcess_only_one_workaround failed unexpectedly 15.0275s:

  Traceback (most recent call last):
    _RunGpuTest at content/test/gpu/gpu_tests/gpu_integration_test.py:132
      self.RunActualGpuTest(url, *args)
    RunActualGpuTest at content/test/gpu/gpu_tests/gpu_process_integration_test.py:102
      getattr(self, test_name)(test_path)
    _GpuProcess_only_one_workaround at content/test/gpu/gpu_tests/gpu_process_integration_test.py:307
      self._CompareAndCaptureDriverBugWorkarounds())
    _CompareAndCaptureDriverBugWorkarounds at content/test/gpu/gpu_tests/gpu_process_integration_test.py:176
      self.fail('No GPU channel detected')
    fail at /usr/lib/python2.7/unittest/case.py:410
      raise self.failureException(msg)
  AssertionError: No GPU channel detected
  
  Locals:
    msg : 'No GPU channel detected'


What's interesting is that the test before this one also has the same error, but is listed as passing:

[9/15] gpu_tests.gpu_process_integration_test.GpuProcessIntegrationTest.GpuProcess_gpu_info_complete passed 16.3192s

Traceback (most recent call last):
  _RunGpuTest at content/test/gpu/gpu_tests/gpu_integration_test.py:132
    self.RunActualGpuTest(url, *args)
  RunActualGpuTest at content/test/gpu/gpu_tests/gpu_process_integration_test.py:102
    getattr(self, test_name)(test_path)
  _GpuProcess_only_one_workaround at content/test/gpu/gpu_tests/gpu_process_integration_test.py:307
    self._CompareAndCaptureDriverBugWorkarounds())
  _CompareAndCaptureDriverBugWorkarounds at content/test/gpu/gpu_tests/gpu_process_integration_test.py:176
    self.fail('No GPU channel detected')
  fail at /usr/lib/python2.7/unittest/case.py:410
    raise self.failureException(msg)
AssertionError: No GPU channel detected

Locals:
  msg : 'No GPU channel detected'
Oh now I see that the both failures are from the same failing test _GpuProcess_only_one_workaround and the logs are jumbled up.
tombstones.py wasn't very useful:

I    2.587s Main  Stack Trace:
I    2.587s Main    RELADDR   FUNCTION                                                                             FILE:LINE
I    2.587s Main    000000000006aaf4  <UNKNOWN>                                                                            /system/lib64/libc.so
I    2.587s Main    0000000000068284  <UNKNOWN>                                                                            /system/lib64/libc.so
I    2.587s Main    0000000000021278  <UNKNOWN>                                                                            /system/lib64/libc.so
I    2.587s Main    000000000001ba18  <UNKNOWN>                                                                            /system/lib64/libc.so
I    2.587s Main    0000000001f4b234  void base::internal::OptionalStorageBase<media::EncryptionPattern, false>::Init<>()  ??:0:0
I    2.587s Main  

Any progress on this? We should get this bot green, even if it means skipping the failing test.

I'm investigating this again. The sequence of tests leading up to failing
is this:

[8/15]
gpu_tests.gpu_process_integration_test.GpuProcessIntegrationTest.GpuProcess_feature_status_under_swiftshader
passed 0.0115s
(loads chrome:gpu page and uses --gpu-blacklist-test-group=2 for browser
args)
[9/15]
gpu_tests.gpu_process_integration_test.GpuProcessIntegrationTest.GpuProcess_gpu_info_complete
passed 16.3192s
(loads functional_css_3d.html and uses no extra browser args)
[10/15]
gpu_tests.gpu_process_integration_test.GpuProcessIntegrationTest.GpuProcess_only_one_workaround
failed unexpectedly 15.0275s:
(loads chrome:gpu and uses no extra browser args)

So the second test launches a new browser process, and the last two tests
use the same browser process. I suspect the second test (gpu_info_complete)
fails (maybe due to OOP-R) but that failure isn't detected in that test
since it never calls VerifyGpuProcessPresent() (like the actual CSS test
case does). Trying to verify these assumptions.
It's highly likely this is due to OOPR. Best way to reproduce this test is to disable all but these tests in gpu_process_integration_test.py:

('GpuProcess_gpu_info_complete', 'gpu/functional_3d_css.html'),
('GpuProcess_only_one_workaround', 'chrome:gpu'),

I tried using chrome:gpu for the first test, and it makes the bug less likely but it still happens in 5 or so attempts. AFAICT the 3d css page makes the crash more likely, and running only the second test doesn't trigger the crash (in my limited testing).

Disabling OOPR with --disable-features=DefaultEnableOopRasterization seems to make the problem go away (no crash in 20 or so attempts).

The crash can happen either in _VerifyGpuProcessPresent() call (line 307) because native code doesn't have a gpu channel or inside _CompareAndCaptureDriverBugWorkarounds() (line 309) call when gpu_list is None because the native code early out. This suggests some kind of race between the  gpu process shutting down and the test. Since the test does succeed sometimes, it's possible the gpu process spins back up within that small window of time.

An important difference with OOPR disabled is that I don't see "Chromium has stopped working" Android system errors all the time which I do with OOPR enabled.

Stack trace is garbage even after doing a local build with symbols and everything. I had to replicate the GN args from the bot to get this to fail (arm32 component release without dcheck wouldn't fail):

dcheck_always_on = true
ffmpeg_branding = "Chrome"
is_component_build = false
is_debug = false
proprietary_codecs = true
#strip_absolute_paths_from_debug_symbols = true
#strip_debug_info = true
symbol_level = 1
target_cpu = "arm64"
target_os = "android"
use_goma = true

(I commented out the debug symbol related args to try and get better stack traces but it didn't work.)

Run the test like this:

content/test/gpu/run_gpu_integration_test.py gpu_process --show-stdout --browser=android-chromium --passthrough -v "--extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc --disable-features=DefaultEnableOopRasterization"

I have a fail script (stolen from danakj) to repeatedly run the test until failure:

#!/bin/bash
COUNTER=0

until [ $? -ne 0 ]; do
    let COUNTER+=1
    echo "fail: attempt #$COUNTER"
    "$@"
done

echo "fail: failed after $COUNTER attempts"

Cc: enne@chromium.org
enne: ^ (OOPR related)
I thought running the failing page directly would help. I ran:

out/droid/bin/chrome_public_apk run https://codepen.io/anon/pen/djoQBQ

Then I switched to chrome://gpu in the running instance, and saw recurring crashes (with "Chromium has stopped working" dialogs). There's a shader compilation error and probably more. See attached logs.
nexus9_chromegpu_crash.txt
19.1 KB View Download
Cc: bsalomon@chromium.org
+bsalomon
Owner: bsalomon@chromium.org
Owner: ethannicholas@chromium.org
Looks like CCPR/geometry shader related. Starting with Ethan, but could just as well be in Chris's wheelhouse. (see log in #16 for shader compilation issue.).
Cc: piman@chromium.org
 Issue 866688  has been merged into this issue.
Cc: csmartdalton@chromium.org
Can we make some progress on this issue? This bot's been red because of this issue for some time, and Chrome won't be working correctly with OOP-R on these sorts of devices because of it:
 
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Android%20FYI%20Release%20(Nexus%209)

Project Member

Comment 22 by bugdroid1@chromium.org, Jul 24

The following revision refers to this bug:
  https://skia.googlesource.com/skia/+/0b63196a7eed40388a4b7b68990b45503554b290

commit 0b63196a7eed40388a4b7b68990b45503554b290
Author: Ethan Nicholas <ethannicholas@google.com>
Date: Tue Jul 24 18:13:45 2018

fixed geometry shaders when canUseFragCoord is false

Bug:  chromium:859705 
Change-Id: Ia5c5b15bd5d12bf2d1c3265664bec2c3eaef24d2
Reviewed-on: https://skia-review.googlesource.com/143114
Commit-Queue: Brian Salomon <bsalomon@google.com>
Reviewed-by: Brian Salomon <bsalomon@google.com>

[modify] https://crrev.com/0b63196a7eed40388a4b7b68990b45503554b290/src/sksl/SkSLGLSLCodeGenerator.cpp

Project Member

Comment 23 by bugdroid1@chromium.org, Jul 24

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6c529359dd432040d3a15cad9bd6e2caed0a6b75

commit 6c529359dd432040d3a15cad9bd6e2caed0a6b75
Author: skia-chromium-autoroll <skia-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Tue Jul 24 20:53:44 2018

Roll src/third_party/skia 6a4e60bb8f49..642c7b758944 (3 commits)

https://skia.googlesource.com/skia.git/+log/6a4e60bb8f49..642c7b758944


git log 6a4e60bb8f49..642c7b758944 --date=short --no-merges --format='%ad %ae %s'
2018-07-24 recipe-roller@chromium.org Roll recipe dependencies (trivial).
2018-07-24 ethannicholas@google.com fixed geometry shaders when canUseFragCoord is false
2018-07-24 caryclark@skia.org handle failing pathop tests


Created with:
  gclient setdep -r src/third_party/skia@642c7b758944

The AutoRoll server is located here: https://autoroll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.

CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_trusty_blink_rel;luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel

BUG= chromium:859705 
TBR=benjaminwagner@chromium.org

Change-Id: Id2d15a495b85a2c515b0c84b3ca1cd2993fff570
Reviewed-on: https://chromium-review.googlesource.com/1147915
Reviewed-by: skia-chromium-autoroll <skia-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Commit-Queue: skia-chromium-autoroll <skia-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#577674}
[modify] https://crrev.com/6c529359dd432040d3a15cad9bd6e2caed0a6b75/DEPS

[GPU Triage Council]

This is in our P1 list -- is this fixed by the above patches?
Status: Verified (was: Assigned)
Yes, was fixed by the Skia CL above, and the test's also been renamed / redone since then.

Sign in to add a comment