gpu process hang and crash on guado during video conference (stuck on bsd ring) |
||||||||||||||||||||||||
Issue descriptionWe received many bug reports about gpu process hang and crash on guado. There are common that there is "stuck on bsd ring" error message. Although the customers have this issue regularly (maybe 2 or 3 times per day), but unfortunately, there is no concrete/quick steps to reproduce yet. So far, observations are: (I compiled following information from bug reports. I haven't reproduced yet) Device: all reports are on guado Version: earliest report are R59. Also reported on 60, 62, 63, and is still coming. Bug reports are coming recently since July. So this may be a regression, but not sure. Possible workaround: one report says disabling hw acceleration can work around this issue (ref: crbug.com/776072 ) Symptom: during multi-user video conference (some use hangout, some use faceme.com or other private apps) 1. the video works well at the beginning several minutes. 2. before crash, video quality drop/green lines/horizontal thick lines and then hang. audio still continue 3. crash completely (black screen, no UI) /var/log/messages shows 2017-11-02T12:47:25.536559+13:00 INFO kernel: [ 646.041518] [drm] stuck on bsd ring 2017-11-02T12:47:25.536577+13:00 INFO kernel: [ 646.041528] [drm] GPU crash dump saved to /sys/class/drm/card0/error 2017-11-02T12:47:25.536579+13:00 INFO kernel: [ 646.041536] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. 2017-11-02T12:47:25.536582+13:00 INFO kernel: [ 646.041546] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel 2017-11-02T12:47:25.536584+13:00 INFO kernel: [ 646.041558] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. 2017-11-02T12:47:25.536586+13:00 INFO kernel: [ 646.041570] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. 2017-11-02T12:47:25.553929+13:00 INFO crash_reporter[3572]: Consent given - collect udev crash info. GPU Crash dump is attached (there are more from crbug.com/779190) Plan: Current plan is to disable hardware acceleration on BDW first. https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/799531 BDW does 1080p VP8 decode at ~23% CPU with 0 dropped frames, so it's not that impactful Additional info: guado is using old kernel (3.14) and we know there are some drm/i915 fixes in 4.x kernel. https://bugs.freedesktop.org/show_bug.cgi?id=97396 Not sure this is relevant or not. One customer say "all of these video meetings included a participant from iPad Pro" (crbug.com/779190) Not sure this is relevant or not.
,
Nov 30 2017
,
Nov 30 2017
,
Nov 30 2017
,
Nov 30 2017
,
Nov 30 2017
> Current plan is to disable hardware acceleration on BDW first. AFAICT SW fallback for encoder and decoder kicks in and video recovers, see https://bugs.chromium.org/p/chromium/issues/detail?id=779190#c29 for instance. SW fallback has the same affect as disabling HW.
,
Nov 30 2017
crbug.com/779190 was updated with customer feedback on crbug.com/77190#c32 , explaining that the video doesn't recover and chrome eventually crashes. FYI.
,
Nov 30 2017
@6: I don't think there is a sw fallback path. Instead the theory is that things get "fixed" after the kernels resets the GPU... until you hit the problem again of course.
,
Nov 30 2017
,
Dec 6 2017
Ping - any updates on this?
,
Dec 6 2017
,
Dec 6 2017
,
Dec 6 2017
,
Dec 6 2017
Are we targeting M-64 with a fix for this problem?
,
Dec 7 2017
HW decode acceleration is now disabled on BDW devices starting from M64, which should prevent the hang/crash issues.
,
Dec 7 2017
Thanks, that's great news. I will ask affected customers if they can see the issue resolved in Beta channel once Beta reaches M64. Stable M64 is still ~2 months away. Is it possible to disable it in M63 too? We have multiple customers affected by this issue, both Cfm and other video conference solutions.
,
Dec 7 2017
,
Dec 7 2017
Yes, the CL to disable should be safe to merge.
,
Dec 7 2017
The required CLs to be merged would include: - https://chromium-review.googlesource.com/799531 - https://chromium-review.googlesource.com/799515 - https://chromium-review.googlesource.com/804922 Thank you.
,
Dec 7 2017
This bug requires manual review: Request affecting a post-stable build Please contact the milestone owner if you have questions. Owners: cmasso@(Android), cmasso@(iOS), gkihumba@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Dec 7 2017
Daniel, could you take a look?
,
Dec 11 2017
,
Dec 11 2017
,
Dec 13 2017
Issue 775789 has been merged into this issue.
,
Dec 14 2017
gkihumba@, what are your thoughts in mergin this to M63. AFAIK guado is one of the few devices that still use kernel version 3.14. So the general impact should be low. CL's that would need approval are: 1. https://chromium-review.googlesource.com/799531 2. https://chromium-review.googlesource.com/799515 3. https://chromium-review.googlesource.com/804922
,
Dec 14 2017
We currently have ~18 devices on 3.14 kernel and M63 stable roll out starts today. So rejecting merge as this is a firmware change that's quite late and affects multiple devices.
,
Dec 14 2017
The fix already merged into M64. So it seems like there is nothing else to do on this issue. I'm closing this bug as fixed. If we run into this crash again, feel free to reopen.
,
Dec 15 2017
,
Jan 27 2018
Issue 794667 has been merged into this issue. |
||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||
Comment 1 by kcwu@chromium.org
, Nov 30 2017