New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 865025 link

Starred by 14 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 14
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug-Regression

Blocked on:
issue 887514



Sign in to add a comment

Application freezing, gpu process crash on Veyron

Reported by josh@arreya.com, Jul 18

Issue description

UserAgent: Mozilla/5.0 (X11; CrOS x86_64 10863.0.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3480.0 Safari/537.36
Platform: 10575.58.0 (Official Build) stable-channel veyron_fievel

Example URL:

Steps to reproduce the problem:
1. 
2. 
3. 

What is the expected behavior?

What went wrong?
After 20-30 minutes of running a kiosk application, the application will freeze and no interaction is possible.

Multiple times in the GPU log you will see:
: The GPU process hung. Terminating after 10000 ms.
GpuProcessHostUIShim: The GPU process crashed!

Attached device log captures the crash and 'GPU soft-reset' just prior to the crash report.

Did this work before? Yes 

Is it a problem with Flash or HTML5? N/A

Does this work in other browsers? N/A

Chrome version: 67.0.3396.99  Channel: stable
OS Version: 10575.58.0
Flash Version: 

Contents of chrome://gpu: 
Attached as veyron-gpu.txt
 
veyron-gpu.txt
35.4 KB View Download
aopen-cb-ini-gpu-crash.tgz
1.1 MB Download
Cc: posciak@chromium.org
Can you provide some details on what the kiosk application is doing? That will help us route this bug better.
The kiosk application is playing back a video and a slideshow of still images. The application is basically a webview displaying the URL below, so I would guess that the issue is reproducible in desktop mode as well. 

We're working on a minimum repro. I can see from remote devtools that our 1000ms interval in the background javascript continues to fire, but the video is frozen and interaction isn't possible. 

https://gcmhfoundation.arreya.com/
Does chrome://crashes on this device have any ids?
We have been able to reproduce this on devices with the Rockchip CPU - Chromebit, Chromebox Mini and Chromebase Mini.  The issue affects multiple clients, sites, and content layouts.  Video playback seems to be part of the trigger for the issue - when we remove video elements from content we have not been able to replicate the issue.  We have one report that the issue was first seen a few weeks ago, and several reports were filed this week.

What went wrong?
  -After about 30 minutes devices will visually freeze but continue to run in the background.

  -Visually the content is frozen. The clock is stuck. Slideshows are stuck. Videos are stuck. Touch appears unresponsive due to the visual layout being stuck, but we are able to tell that navigation is still working via devtools inspection/dom elements and the cursor state on screen.

  -Inspecting via remote devtools shows that the javascript continues to run and update the proper dom elements with no visual change on the display. The clock appears stuck even though the dom element is updated in the Elements tab of devtools.  Slideshows continue to run and update dom elements, but nothing updates visually.

  -Remote devtools is unable to display the page preview image during this 'frozen' state.
  
  -Using the Admin Console we are no longer able to issue screenshot commands when the display is frozen.  

  -Admin console will report an error for the reboot command but eventually reboot.  When running normally all Admin Console commands are successful.

  -Occasionally interacting with the content during this state the entire screen will go black, and may or may not display a mouse cursor, and after this the device may fully crash and reboot.

We have several devices set up for testing here including debug mode and one customer device that reliably reproduces the issue.  Please let me know if you have any steps you would like us to follow.

Here are recent crash IDs from a Chromebox Mini (logs attached) -
  b8624f34c2f9b3af
  ec3497d89a61a6d1
  2c162418af292b11
  1c69d12dcd92db2d
  6d86301103ae3c2a
  726b7aa1015fd4ac
  b4dc0832442df92a
  608411f4954ddb6e
  6310d8f33992ecad
  79c881444db2b25a
  b11d7daa29e5050c
  4c2138d66d0d88bb
  209ba4cb1d4e2221


logs_20180719-1127.zip
447 KB Download
Cc: acourbot@chromium.org
Components: OS>Kernel>Video
Looks like the V4L2 driver is hanging during Destroy:

https://crash.corp.google.com/browse?stbtiq=ec3497d89a61a6d1

+some other CrOS owners.
Owner: dstaessens@chromium.org
Status: Assigned (was: Unconfirmed)
This appears to be a hang while waiting for GPU driver waiting for EGL sync to complete and probably issue 845645. 
Verified the issue still occurs on beta and dev channel.

Google Chrome Version   
68.0.3440.70
Platform Version        
10718.58.0 (Official Build) beta-channel veyron_fievel
Firmware Version        
Google_Veyron_Fievel.6588.237.0

Google Chrome Version   
69.0.3494.0
Platform Version        
10888.0.0 (Official Build) dev-channel veyron_fievel
Firmware Version        
Google_Veyron_Fievel.6588.237.0
mini_beta_channel_crash.tgz
2.3 MB Download
mini_dev_channel_crash.tgz
2.6 MB Download
Cc: ryutas@chromium.org
Labels: Hotlist-Enterprise
dstaessens@
Is this case related to crbug.com/862409 as well?
Some of crash reports are pointing crbug.com/862409.

crbug.com/862409 doesn't look like the issue I'm currently looking at, but I'm not very familiar with this code yet...

Does any video playback trigger the issue, or do I have to use a specific codec/video/...? I'm currently trying to reproduce on a RK3399 using crosvideo.appspot.com, but have yet to see the issue. Thanks!
The example content is all using h264 MP4 video H264 cached via fetch, stored & served from IndexedDB as blob.  I do believe video plays a role as I have not seen any crashes on client devices since removing the videos.  Scenarios involve digital signage with simple looping videos, as well as interactive content where videos hide/show/destroy as content is changed. For example, the video involved in the initial report/example link is a background video, looping, with several elements on top.


  We discovered that you can trigger a full crash/reboot on the Chrome device once it is frozen by turning the display off and then back on again.  After power cycling the display and waiting a few minutes the device reboots and returns to the content functioning.  Without the display power cycle, the device will continue to show the frozen content.

  Attached is a log from a device that performed this type of crash twice this morning.  One was from freezing yesterday afternoon and the monitor being power cycled at night/in the morning.  The other was around 10:30AM CST, where we confirmed that the power cycle would cause it to reboot when frozen.





debug-logs_20180726-104543.tgz
715 KB Download
Test link moved to https://rkissuetest.arreya.com

Running this content on a Rockchip based device will eventually cause the GPU to hang / visually freeze the display.  Confirmed on 3 different models using the Rockchip CPU.  To confirm, we ran the same test on an Intel device with no issues.
Any news on this issue? We are experiencing the same issues from os64 on up to os69.
Sorry for the delay. Had a possible fix ready but needed some more work. Should be fine now once it gets through review. http://crrev.com/c/1133614
Any updates on the possible fix for this issue?
Sorry for the delay, review has been going back and forth between different approaches. Hope it gets sorted soon.

Did a few attempts to reproduce the issue but haven't seen any freezes yet so far.
Possibly related to  issue #873750 

Same device family, similar description.  Screen visually hangs, cursor state updates (hover), but DOM/rendering does not respond, black screen after HDMI unplug/plug in.
We still experience this issue running signage in both Chrome Sign Builder & Signagelive. It doesn't seem to be specific to just Google software.
Submitted http://crrev.com/c/1133614, let me know if this fixes the issue!
We're still able to reproduce this issue on 70.0.3538.0 on Veyron. If I'm not mistaken, the above fix should be in that version since it landed in 70.0.3535.0.

The issue is still easy to reproduce for us in kiosk mode with our kiosk application, we can provide more detailed instructions to reproduce the issue if needed.
canary.crash.4.tgz
756 KB Download
I also noticed crrev.com/c/1195225 with similar changes that hasn't been merged, could this affect my testing on 70.0.3538.0 since it's not included?
Additional log and some crash IDs.

f409e6df76674655
7e8729d9218a67bd
33a0fda9306a181c
f1f44b5462e8b7f6
d8ed5b36b3fb6d20
canary.crash.5.tgz
981 KB Download
Cc: conradlo@chromium.org
+conradlo for additional visibility.
The issue is still repeatable with 'Hardware-accelerated video decode' disabled. The log attached to comment #19 is with it enabled and the log in comment #21 is with the flag disabled.
Cc: zmo@chromium.org
Those crashes are all issue 738907 +zmo who closed that one as WontFix due to GPU driver instability -- which hopefully wouldn't be a problem on CrOS.
crrev.com/c/1195225 isn't relevant for Veyron.

The crash ids listed above seem to be related to a different issue than the one I fixed. The issue I tackled should fix ec3497d89a61a6d1 and 726b7aa1015fd4ac.
Hmm, these crash signatures seem suspiciously like the other end of issue 845645. 
These last crash ID's seem to occur in gpu::gles2::GLES2DecoderImpl, why do you think these are linked to issue 845645?

Do you think this is a separate issue from the V4L2SliceVideoDecodeAccelerator::Destroy hanging, and if so who would be the owner if this code?
These crashes seem to be hanging inside an EGL sync and on the other issue you comment "This seems to be caused by the decoder thread waiting for an EGL sync that will never come, when queuing an output buffer." 

So naively they seem like they could be related. I'm just the peanut gallery who noticed some similar terms while triaging though, so I defer to your expert judgement. 
Can we get an update on this issue?  What are the next steps?  Any testing we can do or potential workarounds?
Fixing issue 845645 seems to have just moved the problem. The real problem might be related to issue 705957. Some calls to "glDeleteFramebuffersEXT" block until the watchdog kicks in.
Cc: marc...@chromium.org hoegsberg@chromium.org
Added marcheu@ and hoegsberg@ from the Chrome OS Graphics team.
Cc: marcuskoehler@chromium.org
Blockedon: 887514
Cc: snambiar@chromium.org jayhlee@chromium.org
 Issue 873750  has been merged into this issue.
@arreya, can you try testing the new canary build?  There were some changes made, and we were unable to reproduce the issue.  Chrome version 71 (11124.0.0)
Thanks for the update.  Testing it now and will report back.
Labels: ReleaseBlock-Stable
Labels: M-70
Chrome version 71 (11124.0.0), kiosk mode (managed)

https://rkissuetest.arreya.com crashed in about an hour, contains video
https://rkissuetest2.arreya.com did not experience crash so far, similar to testing on Canary we performed on Oct 2.  This test does not contain video, just one image and an embedded Google Slides presentation.  It crashes on stable in a couple minutes.

I'm not sure if these comments from the email chain were relayed or included in the private issue.

"After further testing on the new repro (https://rkissuetest2.arreya.com/) I think it may be a different path to the same result (crashed renderer/blank screen).  It looks like this one is failing 100% on stable, but does not repro on Canary.  The older repro (https://rkissuetest.arreya.com/) continues to produce the issue on Canary."

Some more info here (desktop restarts renderer and iframe goes black on rkissuetest2, kiosk crashes and stays black) - https://bugs.chromium.org/p/chromium/issues/detail?id=879081#c7

Chrome Sign Builder policies attached

rkissuetest_signbuilder
454 bytes View Download
rkissuetest2_signbuilder
455 bytes View Download
Here is the log from the above crash on latest canary (11124.0.0)
veyron_r71_11124.0.0.tgz
2.8 MB Download
Are there any additional steps we can perform to report more relevant information from the crashes?  Happy to run commands and report back the results.  If it helps we can arrange a remote ssh session before/after the crash.
@bigo, this issue was discovered in M69. Can you add rationale to M70 RBS label? Can we take this fix in M71 instead, M70 is nearing Stable checkpoints? Thanks. 
Labels: -ReleaseBlock-Stable
Removing RBS label. Please feel free to add back with justification if you feel different.
Apologies for the slow progress, this issue has been very hard to reproduce and track down. So far we've been able to create a somewhat reliable repro.

It seems that the GPU is not getting enough power, increasing the voltage slightly seems to fix the issue. We're currently working on determining the exact values required, but hopefully we should be able to provide a fix soon.
Status: Fixed (was: Assigned)
Fix is in the latest build of M72 and M71, will be merged to M70 soon: https://crrev.com/c/1334450
@dstaessens

Thank you all for the hard work. We have been unable to reproduce the issue on 72 since the fix was merged. Do you know the version number we should look for in stable that has the fix?

Sign in to add a comment