New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Corrupted graphics on login screen

Reported by timmyisc...@gmail.com, Oct 20 2017

Issue description

UserAgent: Mozilla/5.0 (X11; CrOS x86_64 9765.81.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.120 Safari/537.36
Platform: 9765.81.0 (Official Build) stable-channel cyan

Steps to reproduce the problem:
1. Update to the latest version of chrome (works fine)
2. Reboot the device

What is the expected behavior?
A log in screen, showing all google users connected to that chromebook, where you can log into a specific account.

What went wrong?
Installing the latest update to my Acer Chromebook R11 worked fine... After the update, it was flawless... I shutdown the chromebook, and later rebooted it... To find the log in screen a set of HUGE blurry pixels. I could click in the general area of the button to log into my account and type in my password, that all worked fine, and afterwards, it all looked normal. But I hope this problem will be fixed soon!

Did this work before? Yes n/a (it was the previous version of chromeOS, that came out before: 61.0.3163.120)

Chrome version: 61.0.3163.120  Channel: stable
OS Version: 9765.81.0
Flash Version: 27.0.0.130 

Before the update, it worked fine. Now, it doesn't... :(
 
Showing comments 57 - 156 of 156 Older

Comment 57 by wutao@chromium.org, Nov 10 2017

Not sure how canary build affects this bug, even with canary signed mp image, cannot repro so far on 64.0.3262.2/10112.0.0.


Comment 58 by spin...@gmail.com, Nov 11 2017

wutao@ I got the update to M62 today and can confirm, that the graphic glitches on start screen are still there with the same GPU hang in message log.

Plattform 9901.66.0 (Official Build) stable-channel cyan
Firmware Google_Cyan.7287.57.125
ARC-Version 4421464
Kernel 3.18.0-16036-g30f3f9ed6ff1


Comment 59 by spin...@gmail.com, Nov 11 2017

There is an older and closed bug report on freedektop.org with the same GPU HANG ecode on a Google Cyan device with Chrome OS. In this case frequent GPU hangs during boot sometimes caused kernel crashes. These crashes were fixed by some changes in kernel and mesa, but the GPU hangs were not investigated (but there was also no GPU crash dump availaible):

https://bugs.freedesktop.org/show_bug.cgi?id=92545


In /var/log/messages after the GPU hang the crash_reporter process says "collect udev crash info", but no entry is available in chrome://crashes after that. But as already mentioned in #21 the given GPU crash dump file path is not accessible in non-dev mode, and in dev-mode I cannot reproduce the issue :-(  As the GPU hang and reset seems to happen only once during boot, maybe in dev-mode this already happens on Recovery screen where you have to press Ctrl+D for boot? 

2017-11-11T17:52:02.704231+01:00 INFO kernel: [    6.740442] [drm] stuck on render ring
2017-11-11T17:52:02.723251+01:00 INFO kernel: [    6.759769] [drm] GPU HANG: ecode 8:0:0x2efe5dbc, reason: Ring hung, action: reset
2017-11-11T17:52:02.723278+01:00 INFO kernel: [    6.759783] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
2017-11-11T17:52:02.723282+01:00 INFO kernel: [    6.759794] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
2017-11-11T17:52:02.723284+01:00 INFO kernel: [    6.759805] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
2017-11-11T17:52:02.723286+01:00 INFO kernel: [    6.759816] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
2017-11-11T17:52:02.723288+01:00 INFO kernel: [    6.759828] [drm] GPU crash dump saved to /sys/class/drm/card0/error
2017-11-11T17:52:02.726113+01:00 NOTICE kernel: [    6.762083] drm/i915: Resetting chip after gpu hang
2017-11-11T17:52:02.762693+01:00 INFO crash_reporter[1688]: Consent given - collect udev crash info.
2017-11-11T17:52:03.030201+01:00 WARNING kernel: [    7.066834] frecon(298): Chrome started, our work is done, exiting.

Comment 60 by spin...@gmail.com, Nov 11 2017

wutao@ I have switched my device to dev channel M64. Within 30 boots I could not reproduce the graphic glitches on start screen.

Plattform 10106.0.0 (Official Build) dev-channel cyan
Firmware Google_Cyan.7287.57.137
ARC-Version 4435649


According to /var/log/messages also with M64 the GPU hangs happens sometimes, but they do not causing rendering glitches at start screen. Maybe you can also find the GPU hangs in your log?


2017-11-11T19:20:40.811040+01:00 NOTICE kernel: [    0.000000] Linux version 3.18.0-16275-g5084500d59bb (chrome-bot@cros-beefy449-c2) (gcc version 4.9.x 20150123 (prerelease) (4.9.2_cos_gg_4.9.2-r170-0c5a656a1322e137fa4a251f2ccc6c4022918c0a_4.9.2-r170) ) #1 SMP PREEMPT Tue Nov 7 03:13:54 PST 2017
[.....]
2017-11-11T19:20:40.811100+01:00 DEBUG kernel: [    0.000000] DMI: GOOGLE Cyan, BIOS Google_Cyan.7287.57.137 08/20/2017
[.....]
2017-11-11T19:20:44.240051+01:00 DEBUG kernel: [    6.276124] SELinux: initialized (dev proc, type proc), uses genfs_contexts
2017-11-11T19:20:44.708560+01:00 INFO kernel: [    6.744408] [drm] stuck on render ring
2017-11-11T19:20:44.728226+01:00 INFO kernel: [    6.764011] [drm] GPU HANG: ecode 8:0:0x2efe5dbc, reason: Ring hung, action: reset
2017-11-11T19:20:44.728279+01:00 INFO kernel: [    6.764026] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
2017-11-11T19:20:44.728284+01:00 INFO kernel: [    6.764037] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
2017-11-11T19:20:44.728287+01:00 INFO kernel: [    6.764048] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
2017-11-11T19:20:44.728290+01:00 INFO kernel: [    6.764060] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
2017-11-11T19:20:44.728293+01:00 INFO kernel: [    6.764071] [drm] GPU crash dump saved to /sys/class/drm/card0/error
2017-11-11T19:20:44.730089+01:00 NOTICE kernel: [    6.766298] drm/i915: Resetting chip after gpu hang
2017-11-11T19:20:44.746062+01:00 INFO crash_reporter[1705]: libminijail[1705]: mount /dev/log -> /dev/log type ''
2017-11-11T19:20:44.746098+01:00 INFO crash_reporter[1705]: libminijail[1705]: mount /dev/pstore -> /dev/pstore type ''
2017-11-11T19:20:44.749069+01:00 DEBUG kernel: [    6.785584] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2017-11-11T19:20:44.765081+01:00 DEBUG kernel: [    6.801157] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2017-11-11T19:20:44.770190+01:00 INFO crash_reporter[1705]: Consent given - collect udev crash info.


Comment 61 by wutao@chromium.org, Nov 13 2017

Owner: marc...@chromium.org
marcheu@, Please see comments #59 and #60.
Could you please take a look and find a good owner for this bug. Thanks.

Comment 62 by wutao@chromium.org, Nov 13 2017

#58, spinnau@, with M62 how often you got the graphic glitches?
With M62 stable, I only got twice, much less frequently than M61 stable.

do you have an i915_error_state we could look at?

Comment 64 by wutao@chromium.org, Nov 13 2017

marcheu@, I can not recover a readable i915_error_state from system log in the report, where has a coded base64 i915_error_state.

Maybe you can download one from my testing device [1] or search report submitted by spinnau@ [2]?
[1] https://feedback.corp.google.com/product/208/neutron?lView=rd&lRSort=1&lROrder=2&lRFilter=1&lReportSearch=reportId:84378480878&lReport=84378480878
[2] https://feedback.corp.google.com/product/208/neutron?lView=rd&lRSort=1&lROrder=2&lRFilter=1&lReportSearch=user:spinnau&lReport=79918636639

This is happening at boot during the ring initialization, before we even submit any command (this is the batch buffer which applies workarounds):


00000000 :  d121a243
00000004 :  00101001
00000008 :  00001080
0000000c :  00000000
00000010 :  00000000
00000014 :  00000000
00000018 :  1100000f MI_LOAD_REGISTER_IMM
0000001c :  000020c0 WA_SET_BIT_MASKED(INSTPM, INSTPM_FORCE_ORDERING)
00000020 :  00800080
00000024 :  0000209c WA_SET_BIT_MASKED(MI_MODE, ASYNC_FLIP_PERF_DISABLE);
00000028 :  40004000
0000002c :  0000e4f0 WA_SET_BIT_MASKED(GEN8_ROW_CHICKEN,
00000030 :  01200120
00000034 :  00007300 WA_SET_BIT_MASKED(HDC_CHICKEN0,
00000038 :  08100810
0000003c :  00007000 WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
00000040 :  00040000
00000044 :  00007004 WA_SET_BIT_MASKED(CACHE_MODE_1,
00000048 :  00400040
0000004c :  00007018 WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X)
00000050 :  80008000
00000054 :  00007008 WA_SET_FIELD_MASKED(GEN7_GT_MODE,
00000058 :  02800200
0000005c :  00000000
00000060 :  7a000004 PIPE_CONTROL
00000064 :  00101001
00000068 :  00001080
0000006c :  00000000
00000070 :  00000000
00000074 :  00000000
00000078 :  18800001 MI_BATCH_BUFFER_START_GEN8
0000007c :  0009e000
00000080 :  00000000
00000084 :  00000000
00000088 :  10400002 MI_STORE_DATA_IMM
0000008c :  000040c0
00000090 :  00000000
00000094 :  ffffefff
00000098 :  01000000
0000009c :  00000000
000000a0 :  00000000
000000a4 :  00000000


Since this is so early, it can't be user space. It therefore has to be a kernel change. Between 60 and 61 there are a bunch of drm/i915 changes, in particular to enable 48 bit gtt:


2017-03-17 15:36 Matt Atwood        o  CHROMIUM: drm/i915/dp: set dvi pruning off on disconnect
2015-10-14 14:17 Chris Wilson       o  UPSTREAM: drm/i915: Report context GTT size
2015-09-30 15:36 Michel Thierry     o  UPSTREAM: drm/i915/gen8: Flip the 48b switch
2015-10-01 13:33 Michel Thierry     o  UPSTREAM: drm/i915: Wa32bitGeneralStateOffset & Wa32bitInstructionBaseOffset
2015-07-29 17:24 Michel Thierry     o  UPSTREAM: drm/i915/userptr: Kill user_size limit check
2015-07-29 17:23 Michel Thierry     o  UPSTREAM: drm/i915: batch_obj vm offset must be u64
2015-07-29 17:23 Michel Thierry     o  UPSTREAM: drm/i915: object size needs to be u64
2015-07-29 17:23 Michel Thierry     o  UPSTREAM: drm/i915/gen8: Add ppgtt info and debug_dump
2015-07-29 17:23 Michel Thierry     o  UPSTREAM: drm/i915: Expand error state's address width to 64b
2015-07-29 17:23 Michel Thierry     o  UPSTREAM: drm/i915/gen8: Initialize PDPs and PML4
2015-08-03 09:53 Michel Thierry     o  UPSTREAM: drm/i915/gen8: Add 4 level support in insert_entries and clear_range
2015-08-03 09:52 Michel Thierry     o  UPSTREAM: drm/i915/gen8: Pass sg_iter through pte inserts
2015-07-30 11:06 Michel Thierry     o  UPSTREAM: drm/i915/gen8: Add 4 level switching infrastructure and lrc support
2015-07-30 11:05 Michel Thierry     o  UPSTREAM: drm/i915/gen8: implement alloc/free for 4lvl
2015-08-03 09:52 Michel Thierry     o  UPSTREAM: drm/i915/gen8: Add PML4 structure
2015-07-29 17:23 Michel Thierry     o  UPSTREAM: drm/i915/gen8: Add dynamic page trace events
2015-07-30 11:02 Michel Thierry     o  UPSTREAM: drm/i915/gen8: Generalize PTE writing for GEN8 PPGTT
2015-07-30 11:02 Michel Thierry     o  UPSTREAM: drm/i915/gen8: Abstract PDP usage
2015-07-29 17:23 Michel Thierry     o  UPSTREAM: drm/i915/gen8: Make pdp allocation more dynamic
2015-07-29 17:23 Michel Thierry     o  UPSTREAM: drm/i915: Remove unnecessary gen8_clamp_pd

So, I'll send this intel's way

Cc: intel@chromium.org

Comment 67 by wutao@chromium.org, Nov 14 2017

Cc: r...@chromium.org
 Issue 784897  has been merged into this issue.

Comment 68 by wutao@chromium.org, Nov 14 2017

Hi marcheu@, how's this moving forward? Will some people from intel@ continue investigating this? Do we need to file a bug to them? "intel@chromium.org" never visited this website.

Not sure how canary (#57) or dev (#60) build affect this bug, M64 canary or dev channel seems without this bug. I might download signed images to narrow down the "fix" commits in kernel or possible other changes in Chrome. Since this bug only happens "some time" in non-dev mode, it could take a long time.



Comment 69 by wutao@chromium.org, Nov 14 2017

Cc: dchan@chromium.org
+dchan@: could the team help us to find a possible first "bad" version or first "good" version?

marcheu@, do you think it will help if we can find the above bad/good versions to narrow down kernel or chrome changes?

Thanks!
Yes at this point, this is for intel to fix.

You can bisect the kernel if you'd like, but my feeling is that it'll be one of these changes. I would revert, but if we do so that would break video, so either way we have a bug...
Here are debug logs from a Lenovo N42-20 on M61. This is still happening a LOT to our school Chromebooks.
debug-logs_20171116-140125.tgz
5.0 MB Download
Cc: yu.kang...@intel.com dongseon...@intel.com
This kernel issue was filed two years ago in https://bugs.freedesktop.org/show_bug.cgi?id=92545

The reporter said some upstream patches fixed it.
I think https://patchwork.freedesktop.org/patch/119562/ may fix it, as the patch removes the reason "Ring hung" which causes this issue in dmesg.
 [drm] GPU HANG: ecode 8:0:0x2efe5dbc, reason: Ring hung, action: reset

Let us investigate more.
Me two
+One
Cc: joone....@intel.com
re commet#65.  Cyan (Gen8) doesn't use 48bit ppgtt.  We confirmed it uses 32bit full range.   Tests will be done by removing those 19 patches and expecttation is that problem will persist.  Bisect will continue after that. 
Cc: dcasta...@chromium.org
attached 3 logs such as chrome, ui.LATEST, message
ui.LATEST has some interesting points.

reveman@, dcastagna@, could you check it out the question below?

* GPU process restarts. I'm not sure if it's relaunched at same time when kernel has gpu hang and reset gpu. marcheu@ said gpu hang and reset happens before chromium process starts.
[1677:1677:1127/111529.730271:ERROR:gles2_cmd_decoder.cc(16045)] Onscreen context lost via ARB/EXT_robustness. Reset status = GL_INNOCENT_CONTEXT_RESET_KHR
[1677:1677:1127/111529.730389:ERROR:gles2_cmd_decoder.cc(4337)]   GLES2DecoderImpl: Context reset detected after MakeCurrent.
[1677:1677:1127/111529.730472:ERROR:gpu_channel_manager.cc(188)] Exiting GPU process because some drivers cannot recover from problems.
[1677:1677:1127/111529.730585:ERROR:gpu_channel_manager.cc(188)] Exiting GPU process because some drivers cannot recover from problems.

* fail to allocate output surface. It's the root cause of artifact, but I don't know why it's failed.. it's basically gbm bo allocation. The error happens in following line.
https://cs.chromium.org/chromium/src/components/viz/service/display_embedder/buffer_queue.cc?type=cs&q=BufferQueue::GetNextSurface&sq=package:chromium&l=280
[1409:1409:1127/111530.099200:ERROR:buffer_queue.cc(280)] Failed to allocate backing image surface
[1409:1409:1127/111530.100229:ERROR:buffer_queue.cc(280)] Failed to allocate backing image surface

* The failed buffer can reach here and has very wrong texture target for some reasons.
https://cs.chromium.org/chromium/src/components/viz/service/display/gl_renderer.cc?type=cs&q=SamplerTypeFromTextureTarget&l=117
[1409:1409:1127/111535.861996:ERROR:gl_renderer.cc(122)] NOTREACHED() hit.
[1409:1409:1127/111535.862238:ERROR:gl_renderer.cc(122)] NOTREACHED() hit.
[1409:1409:1127/111535.862283:ERROR:gl_renderer.cc(122)] NOTREACHED() hit.
[1409:1409:1127/111535.862321:ERROR:gl_renderer.cc(122)] NOTREACHED() hit.

* The texture target is GL_FALSE, instead of GL_TEXTURE_2D or GL_TEXTURE_EXTERNAL_OES
[1775:1775:1127/111535.864103:ERROR:gles2_cmd_decoder_autogen.h(143)] [.DisplayCompositor-0xb6b87ae9e00]GL ERROR :GL_INVALID_ENUM : glBindTexture: target was GL_FALSE
[1775:1775:1127/111535.864262:ERROR:gles2_cmd_decoder_autogen.h(2914)] [.DisplayCompositor-0xb6b87ae9e00]GL ERROR :GL_INVALID_ENUM : glTexParameteri: target was GL_FALSE
[1775:1775:1127/111535.864352:ERROR:gles2_cmd_decoder_autogen.h(2914)] [.DisplayCompositor-0xb6b87ae9e00]GL ERROR :GL_INVALID_ENUM : glTexParameteri: target was GL_FALSE

reveman@, dcastagna@, do you have any idea why buffer_queue allocation failure results in GL_FALSE in texture target in M61?
chrome
3.4 MB View Download
messages
1.0 MB View Download
ui.LATEST
3.4 MB Download
Surprising as we explicitly use GL_TEXTURE_2D for buffer queue on chrome os: https://cs.chromium.org/chromium/src/content/browser/compositor/gpu_process_transport_factory.cc?l=532
#79 - that's why I asked you :')
I could reproduce it using my own kernel.
My base image is ChromeOS-test-R61-9765.81.0-cyan.tar.xz

Here's how to reproduce it.
*Build kernel
 * checkout cros/stabilize-9765.76.B-chromeos-3.18 branch which is most similar to the kernel in the image.
 * cros_workon-cyan start sys-kernel/chromeos-kernel-3_18
 * emerge-cyan chromeos-kernel-3_18
* Deploy
 * switch to dev mode
 * remove rootfs verification
  - /usr/share/vboot/bin/make_dev_ssd.sh --remove_rootfs_verification --partitions 2
 * deploy
  - ~/trunk/src/scripts/update_kernel.sh --remote 10.7.201.25
 * switch to non-dev mode

Kang and I am bisecting the kernel 3.18. Fortunately(?), I could reproduce it in stabilize-9592.82.B-chromeos-3.18
BTW, 'N' mark is very painful...

It's the current status.
Reproducible	Note		Version	Revision
N		>40 trial	remotes/cros/stabilize-9554.B-chromeos-3.18	60
trying				remotes/cros/stabilize-9592.15.B-chromeos-3.18	60
Y		>30 trial	remotes/cros/stabilize-9592.82.B-chromeos-3.18	60
Y		few trial	remotes/cros/stabilize-9693.B-chromeos-3.18	61
Y		few trial	remotes/cros/stabilize-9765.76.B-chromeos-3.18	61

Comment 82 by wutao@chromium.org, Nov 28 2017

#81, there were some changes on R60-9592.B-chromeos-3.18, hope it can help.

https://bugs.chromium.org/p/chromium/issues/detail?id=727707#c57
#82, thank you for pointing out. Could you cc me in the issue?

BTW, I reproduced it on stabilize-9554.B-chromeos-3.18 after >70 trails..
It maybe not regression. It may has been there and chromium changes revealed..
Let us bisect further, and report here.
Cc: piman@chromium.org
#65 - marcheu@

It's not regression by the bunch of drm/i915 changes between M60 and M61. I verified it can be reproduced in M60 to M64 kernel, although the reproduce rate is the highest in M61. It's still reproduced in ToT.

Reproducible	Note	Version						Revision
Y		>70 try	remotes/cros/stabilize-9554.B-chromeos-3.18	60
Y		>40 try	remotes/cros/stabilize-9592.15.B-chromeos-3.18	60
Y		>30 try	remotes/cros/stabilize-9592.82.B-chromeos-3.18	60
Y		few try	remotes/cros/stabilize-9693.B-chromeos-3.18	61
Y		few try	remotes/cros/stabilize-9765.76.B-chromeos-3.18	61
Y		10 try	remotes/cros/stabilize-9901.35.B-chromeos-3.18	62
Y		10 try	remotes/cros/stabilize-9901.77.B-chromeos-3.18	62
Y		20 try	remotes/cros/stabilize-9998.B-chromeos-3.18	63
Y		>30 try	chromeos-3.18					64

NOTE: I use ChromeOS-test-R61-9765.81.0-cyan.tar.xz, on which I think the easiest to reproduce it. Just update kernel.

I'm curious of Chad's mesa gpu hang patch being able to fix this issue, so try it but it doesn't fix this issue. ToT mesa has still this issue.
https://chromium-review.googlesource.com/c/chromiumos/third_party/mesa/+/780796


On the other hands, this gpu hang happens in GPU process according to the following log.
* kernel has GPU hang and resets GPU chipset
2017-11-27T22:10:39.696119-08:00 INFO kernel: [    6.732016] [drm] stuck on render ring. my kernel 9554.B, action:5
2017-11-27T22:10:39.715053-08:00 INFO kernel: [    6.751290] [drm] GPU HANG: ecode 8:0:0x2efe5dbc, reason: Ring hung, action: reset
2017-11-27T22:10:39.715085-08:00 INFO kernel: [    6.751309] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
2017-11-27T22:10:39.715089-08:00 INFO kernel: [    6.751322] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
2017-11-27T22:10:39.715091-08:00 INFO kernel: [    6.751335] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
2017-11-27T22:10:39.715093-08:00 INFO kernel: [    6.751349] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
2017-11-27T22:10:39.715095-08:00 INFO kernel: [    6.751363] [drm] GPU crash dump saved to /sys/class/drm/card0/error
2017-11-27T22:10:39.717024-08:00 NOTICE kernel: [    6.753197] drm/i915: Resetting chip after gpu hang

* As soon as reset GPU, GPU process encounters Onscreen context lost, and GPU process is relaunched. 
[1641:1641:1127/221039.717204:ERROR:gles2_cmd_decoder.cc(16045)] Onscreen context lost via ARB/EXT_robustness. Reset status = GL_INNOCENT_CONTEXT_RESET_KHR
[1641:1641:1127/221039.736641:ERROR:gles2_cmd_decoder.cc(4337)]   GLES2DecoderImpl: Context reset detected after MakeCurrent.
[1641:1641:1127/221039.736742:ERROR:gpu_channel_manager.cc(188)] Exiting GPU process because some drivers cannot recover from problems.
[1641:1641:1127/221039.736851:ERROR:gpu_channel_manager.cc(188)] Exiting GPU process because some drivers cannot recover from problems.

* However, for some reasons, gbm buffer allocation is failed in new GPU process....
[1373:1373:1127/221040.257611:ERROR:buffer_queue.cc(280)] Failed to allocate backing image surface
[1373:1373:1127/221040.258526:ERROR:buffer_queue.cc(280)] Failed to allocate backing image surface

piman@, do you have any idea of gpu hang and gbm buffer allocation fail?


Comment 85 by wutao@chromium.org, Nov 29 2017

#84, thank you dongseong.hwang@. When you try different kernel, which Chrome version is used for different kernels?
Hi all, Any update for this issue. Thanks. 
@84: since the packet originates from the kernel, that means the problem lives in the kernel too, so we should still look for a kernel-side fix. But it does prevent a bisection :-(
Should we tell our end users to not send their systems in for repair as it is a OS issue and a updated version of the OS will come out soon to resolve it? We have many customers seeing this and while restarting can resolve the issue they are still concerned.
Hi, Intel engineers approaches two paths now.
Kang: try to find the commit which fixes it in v4.4.
Me: try to find what code path in gpu process reveals this gpu hang issue and try to bypass.
Reproduction is quite difficult (>50 reboots), so it takes time. We apologize for it.

Tao asked three questions about how to repro this with local changes:
1. When you try kernel on different M60, M61 .. M64, which Chrome version is used for different ChromeOS? Is it automatic choose by ChromeOS version?

We used ChromeOS-test-R61-9765.81.0-cyan.tar.xz, with kernel update.

2. I and other users cannot reproduce this bug in non-dev mode. What are the steps to repro with local changes. For example, chrome change. I want to remove some code to rule out some possibilities. You mentioned in Comment 81 , that switch to non-dev mode, but I cannot do this unless I use recovery image to reinstall ChromeOS on the device.

Currently, we don't switch non-dev to dev to non-dev. We found we can update kernel in non-dev mode, only if we remove rootfs verification once.
Could you try to reproduce it using ChromeOS-test-R61-9765.81.0-cyan.tar.xz? I always reproduce it by < 10 reboots.

3. Is it possible, could you please show how to repro on M64 ToT?

We never reproduce it using M64. We reproduce it using ChromeOS-test-R61-9765.81.0-cyan.tar.xz with ToT of v3.18 branch.
I guess M64 chromium probably doesn't have code in gpu process which causes gpu hang. I'm trying to understand it.
#88 - I'm sorry for hearing it. you may can say it as nobody has reproduced it in M64 so far..
UPDATE: about mysterious logs in chromium after GPU process relaunch

1. Allocation fails because browser process requests it after GPU process is killed, and before new GPU process is relaunched.
[1373:1373:1127/221040.257611:ERROR:buffer_queue.cc(280)] Failed to allocate backing image surface

The stack trace at the time is as follows:
#1 0x0d594dd2030f viz::BufferQueue::GetNextSurface()
#2 0x0d594dd1fe35 viz::BufferQueue::BindFramebuffer()
#3 0x0d594fc78f8b cc::GLRenderer::BindFramebufferToOutputSurface()
#4 0x0d594fc6b22d cc::DirectRenderer::UseRenderPass()
#5 0x0d594fc7208e cc::GLRenderer::DrawRenderPassQuadInternal()
#6 0x0d594fc6ea0a cc::GLRenderer::DrawRenderPassQuad()
#7 0x0d594fc6ae6e cc::DirectRenderer::DrawRenderPass()
#8 0x0d594fc6a50d cc::DirectRenderer::DrawRenderPassAndExecuteCopyRequests()
#9 0x0d594fc6a289 cc::DirectRenderer::DrawFrame()
#10 0x0d594dd13fba viz::Display::DrawAndSwap()
#11 0x0d594dd15c26 viz::DisplayScheduler::DrawAndSwap()
#12 0x0d594dd14dc1 viz::DisplayScheduler::OnBeginFrameDeadline()
#13 0x0d594cbbf35a base::debug::TaskAnnotator::RunTask()
#14 0x0d594cb47d5e base::MessageLoop::RunTask()
#15 0x0d594cb4812b base::MessageLoop::DeferOrRunPendingTask()
#16 0x0d594cb48584 base::MessageLoop::DoWork()
#17 0x0d594cb49cd9 base::MessagePumpLibevent::Run()
#18 0x0d594cb6b246 base::RunLoop::Run()
#19 0x0d594c807663 ChromeBrowserMainParts::MainMessageLoopRun()
#20 0x0d594b4bce04 content::BrowserMainLoop::RunMainMessageLoopParts()
#21 0x0d594b4bf672 content::BrowserMainRunnerImpl::Run()
#22 0x0d594b4b87fc content::BrowserMain()
#23 0x0d594c7daf46 content::ContentMainRunnerImpl::Run()
#24 0x0d594c7fcecd service_manager::Main()
#25 0x0d594c7d9ef1 content::ContentMain()
#26 0x0d594aeaab6c ChromeMain
#27 0x741e40450816 __libc_start_main
#28 0x0d594aeaa9d9 _start

2. Mysterious GL_FALSE happens because the resource_id doesn't exist in |resources_| in ResourceProvider. 
[1373:1373:1127/221055.881661:ERROR:gl_renderer.cc(122)] NOTREACHED() hit.
[1744:1744:1127/221056.336861:ERROR:gles2_cmd_decoder_autogen.h(143)] [.DisplayCompositor-0x2f4789db5200]GL ERROR :GL_INVALID_ENUM : glBindTexture: target was GL_FALSE
[1744:1744:1127/221056.337035:ERROR:gles2_cmd_decoder_autogen.h(2914)] [.DisplayCompositor-0x2f4789db5200]GL ERROR :GL_INVALID_ENUM : glTexParameteri: target was GL_FALSE

The stack trace at the time is as follows:
#1 0x0d594d836714 gpu::gles2::GLES2Implementation::BindTextureHelper()
#2 0x0d594d79c584 cc::ResourceProvider::BindForSampling()
#3 0x0d594d79c49d cc::ResourceProvider::ScopedSamplerGL::ScopedSamplerGL()
#4 0x0d594fc75e24 cc::GLRenderer::DrawContentQuadNoAA()
#5 0x0d594fc7542e cc::GLRenderer::DrawContentQuad()
#6 0x0d594fc70411 cc::GLRenderer::DrawTileQuad()
#7 0x0d594fc6ae6e cc::DirectRenderer::DrawRenderPass()
#8 0x0d594fc6a50d cc::DirectRenderer::DrawRenderPassAndExecuteCopyRequests()
#9 0x0d594fc6a289 cc::DirectRenderer::DrawFrame()
#10 0x0d594dd13fba viz::Display::DrawAndSwap()
#11 0x0d594dd15c26 viz::DisplayScheduler::DrawAndSwap()
#12 0x0d594dd14dc1 viz::DisplayScheduler::OnBeginFrameDeadline()
#13 0x0d594cbbf35a base::debug::TaskAnnotator::RunTask()
#14 0x0d594cb47d5e base::MessageLoop::RunTask()
#15 0x0d594cb4812b base::MessageLoop::DeferOrRunPendingTask()
#16 0x0d594cb487b0 base::MessageLoop::DoDelayedWork()
#17 0x0d594cb49bad base::MessagePumpLibevent::Run()
#18 0x0d594cb6b246 base::RunLoop::Run()
#19 0x0d594c807663 ChromeBrowserMainParts::MainMessageLoopRun()
#20 0x0d594b4bce04 content::BrowserMainLoop::RunMainMessageLoopParts()
#21 0x0d594b4bf672 content::BrowserMainRunnerImpl::Run()
#22 0x0d594b4b87fc content::BrowserMain()
#23 0x0d594c7daf46 content::ContentMainRunnerImpl::Run()
#24 0x0d594c7fcecd service_manager::Main()
#25 0x0d594c7d9ef1 content::ContentMain()
#26 0x0d594aeaab6c ChromeMain
#27 0x741e40450816 __libc_start_main
#28 0x0d594aeaa9d9 _start

Cc: danakj@chromium.org
+danakj@, oshima@, reveman@, FYI, please see #78, 84, and 91.

From comment 91, point 1, let me think of this  issue #11 :  https://bugs.chromium.org/p/chromium/issues/detail?id=759148#c11

I wonder if this bug happens more frequently when the users are using the animated avatar icon (not the static icon) on the login screen.
#CBC-RS/TC-watchlist
GPU hang rarely happens after creating the first image. After GpuCommandBufferStub is initialized and sets transfer buffer, it creates the first image and then async flush. After that, GPU hang happens rarely.

GpuCommandBufferStub::Initialize completed
GpuCommandBufferMsg_SetGetBuffer message.type():655508
GpuCommandBufferMsg_RegisterTransferBuffer message.type():655553
GpuCommandBufferMsg_RegisterTransferBuffer message.type():655553
GpuCommandBufferMsg_SetGetBuffer message.type():655508
GpuCommandBufferMsg_RegisterTransferBuffer message.type():655553
GpuCommandBufferMsg_AsyncFlush message.type():655542
GpuCommandBufferMsg_WaitForGetOffsetInRange message.type():655532
GpuCommandBufferMsg_AsyncFlush message.type():655542
GpuCommandBufferMsg_WaitForGetOffsetInRange message.type():655532
GpuCommandBufferMsg_AsyncFlush message.type():655542
GpuCommandBufferMsg_WaitForGetOffsetInRange message.type():655532
GpuCommandBufferMsg_AsyncFlush message.type():655542
GpuCommandBufferMsg_WaitForGetOffsetInRange message.type():655532
GpuCommandBufferMsg_AsyncFlush message.type():655542
GpuCommandBufferMsg_AsyncFlush message.type():655542
GpuCommandBufferMsg_AsyncFlush message.type():655542
GpuCommandBufferMsg_AsyncFlush message.type():655542
GpuCommandBufferMsg_CreateImage message.type():655597
GpuCommandBufferMsg_RegisterTransferBuffer message.type():655553
GpuCommandBufferMsg_AsyncFlush message.type():655542
GpuCommandBufferStub MakeCurrent fails!


When updating different revision kernel, the reproduction rate of GPU hang is widely changed as mentioned in #84. This means there are many culprits of GPU hang, and some revision fixes some and regresses some, in my opinion. There has been several GPU hang reports before, but what I noticed is that they were rarely fixed in the record.
We are trying to fix it but I'm not sure how long it takes.
By the way, Chromium can elegantly recover it as GPU process meant to be. See the next comment.
#92 - wutao@, could you cc me in issue 759148?

Besides gpu hang, now I understand why the artifact happens. It's ubercompositor issue. I'm not sure if it's fixed in ToT. Let me explain what's going on.

Renderer sometimes creates ResourceProvider after GPU process killed, but before new GPU process launched. So Renderer's ResourceProvider is software provider. Meanwhile, Browser destructs old ResourceProvider associated to old GPU process, and re-constructs new ResourceProvider. Renderer sends bitmap resource to Browser, but Browser handles the resource as GPU resource, as Browser's ResourceProvider is GPU ResourceProvider and Browser expects child has GPU ResourceProvider.

It's why following logs happen.
[1373:1373:1127/221055.881661:ERROR:gl_renderer.cc(122)] NOTREACHED() hit.
[1744:1744:1127/221056.336861:ERROR:gles2_cmd_decoder_autogen.h(143)] [.DisplayCompositor-0x2f4789db5200]GL ERROR :GL_INVALID_ENUM : glBindTexture: target was GL_FALSE

Renderer creates bitmap resource.
#1 0x028a83ed7112 cc::ResourceProvider::InsertResource()
#2 0x028a83ed6c98 cc::ResourceProvider::CreateBitmap()
#3 0x028a83ed68e4 cc::ResourceProvider::CreateResource()
#4 0x028a83ede070 cc::ScopedResource::Allocate()
#5 0x028a83ed3e24 cc::ResourcePool::CreateResource()
#6 0x028a83efb49f cc::TileManager::CreateRasterTask()
#7 0x028a83ef8ae8 cc::TileManager::AssignGpuMemoryToTiles()
#8 0x028a83ef825b cc::TileManager::PrepareTiles()
#9 0x028a83e94fe3 cc::LayerTreeHostImpl::PrepareTiles()

And Browser receives it with target==0. I checked when InsertResource() stores GL_FALSE (i.e. 0) as target.
#1 0x08980d327112 cc::ResourceProvider::InsertResource()
#2 0x08980d32c39b cc::ResourceProvider::ReceiveFromChild()
#3 0x08980d8a752e viz::SurfaceAggregator::PrewalkTree()
#4 0x08980d8a869d viz::SurfaceAggregator::PrewalkTree()
#5 0x08980d8a9494 viz::SurfaceAggregator::Aggregate()
#6 0x08980d8a1154 viz::Display::DrawAndSwap()
#7 0x08980d8a3216 viz::DisplayScheduler::DrawAndSwap()

There are two solutions.
1. Browser compositor elegantly handles bitmap resource
2. When GPU process relaunched, Renderer re-creates ResourceProvider, even though old one is software ResourceProvider.

danakj@, oshima@, reveman@, Could you feedback your opinion?
I could not reproduce it on M64. It maybe because ToT already fixed it by #1 or #2. Otherwise, the timing of Renderer's ResourceProvider creation maybe very different. This bug requires Renderer's ResourceProvider creation happens between old GPU process killed and new GPU process created.
If ToT still has this issue, which solution do you prefer?

BTW, It's fine that BufferQueue fails surface allocation. When new GPU process created, old BufferQueue is destructed and new BufferQueue allocates brand new surface again.
[1373:1373:1127/221040.257611:ERROR:buffer_queue.cc(280)] Failed to allocate backing image surface
> Renderer sometimes creates ResourceProvider after GPU process killed, but before new GPU process launched. So Renderer's ResourceProvider is software provider. Meanwhile, Browser destructs old ResourceProvider associated to old GPU process, and re-constructs new ResourceProvider. Renderer sends bitmap resource to Browser, but Browser handles the resource as GPU resource, as Browser's ResourceProvider is GPU ResourceProvider and Browser expects child has GPU ResourceProvider.

This should be fixed in TOT with the CompositingModeWatcher/Reporter I've done recently. There's now global decision making for compositing mode instead of local decisions.
#96 - danakj@ nice! could you point out the CL. New stable release should cherry pick it.
It was more than 1 CL and they are large, there's no chance that we can cherry pick them directly to stable IMO. https://bugs.chromium.org/p/chromium/issues/detail?id=772574
Blockedon: 730660
marcheu@, danakj@, how should we fix it? Will I make a CL for only M61 or M62?
I don't have any good ideas to fix it easily on stable, it should be fixed when those CLs roll out to stable in a few weeks, sorry.
Why is this blocked on 730660?
Blockedon: -730660 772574
#102 - sorry for noise. I thought 730660 fixed it.

marcheu@, can we just wait for few weeks?
This issue is happened on those devices that already in the fields/market and many schools. We got a lot of help requests from OEM partners to find out the resolution.

It will be nice to show our OEM partners a path to the solution. So, they will know what to do and how to deal with their customers, especially those schools.

Many Thanks.
The last CL in https://bugs.chromium.org/p/chromium/issues/detail?id=772574 is   https://chromium.googlesource.com/chromium/src.git/+/12ed5e142c912eb0c5fe98a6b034852b2824508f

It is
Cr-Commit-Position: refs/heads/master@{#515271}

Which means it should be fixed in M64. Verifying by trying to reproduce on the dev channel could be helpful.
dchan@, for test and repro, please use recovery image and non-dev mode, and restarts the device many many times, pls see comment 84.


In comments 26 and 44, seems this bug happens more on non-dev mode. I have tried the image ChromeOS-test-R61-9765.81.0-cyan.tar.xz in comment 89, but cannot repro so far.


Comment 107 by dchan@google.com, Dec 6 2017

Cc: -joone....@intel.com aashuto...@chromium.org harpreet@chromium.org

Comment 108 by dchan@google.com, Dec 6 2017

Cc: joone....@intel.com
It was fixed in Nov/9th in  issue 772574 . M64's branch point is Nov/30th. M64 stable release date is Jan/23rd. M64 beta release date would be very soon.

#105 - nobody reproduced it on M64.

#106 - about comment 84, the test is used by combination of M61 chromium and M64 kernel. As M61 chromium still has this issue, it was reproduced. Pure M64 chromeos cannot reproduce it.

In my opinion, users needs to use beta until Jan/23rd... Is it possible solution? Otherwise, we need temporary fix for M63, whose release date is Dec/5th.

What should we do?
@ Intel we currently have over 750k units i the field across 2 models with customers complaining of this issue.  Please help to expedite it as urgently as possible.  

We can not tell our school customer with large install bases to switch over to the beta channel for 64. They are very cautious about switching from the main stable channel to anything else.


Is it ok for us Acer to tell our customer to not send then units in for repair as a new OS update will be out soon to fix the issue? Right now they want to send a system in for repair every time they see this issue as they think it is a hardware failure.  Thank You.  Sandy
Cc: -aashuto...@chromium.org

Comment 112 by dchan@google.com, Dec 6 2017

Cc: vwang@chromium.org
+vwang to response to comment c#110 or figure out who to contact.
#110 - danakj@, acer sounds they cannot wait for 1.5 months.
How about landing the temporary CL for M63 stable release, which was released yesterday.
The easiest and noninvasive fix would be to make browser compositor handle bitmap resource via texture upload.
Chrome OS should never fall back to software. Can we just fix the root cause (GPU hang/crash) instead of trying to merge a crazy amount of code to stable to implement the workaround?
Is the problem fixed completely, or is the chrome gpu process now restarting without too much visible impact? Given that the source of the issue is in the kernel, I suspect it's the latter. In that case, the issue still needs to be addressed in the kernel regardless of the user space situation.
#114, #115 - GPU hang issue is still there. I think users could not notice it. From user's experience point of view, they may feel booting time takes longer by 1.2 sec (i.e. 'new gpu initialization time' - 'old gpu initialization time' from the log), when GPU hang happens.

We still try to fix it in kernel. There are another 2 more GPU hang issues in Braswell, which other engineers are taking a look.
Kang is still bisecting between v3.18 and v4.4, because we never reproduced it in v4.4. We found some promising patches, but those don't fix it completely, though it makes gpu hang happen less often.
It may take long time to fix it. I suspect it's not single issue. Even though we will narrow down the kernel patches, I think the amount of code would be bigger than chromium change.

meanwhile, manufacture partners are suffering...
I don't think we want to make our product worse (increase boot time, cause blinking on startup) to paper over this issue; the cost is too high. As I pointed out in comment #65, this packet is only ever emitted once on ring initialization by the kernel, and cannot be sent by anything else) and comes from the intel driver; I think it is unfair to point the finger at Chrome for this issue.
#117 - I definitely agree on you. I just pointed out "increase boot time" is better than artifact for short time until M64 stable release. In addition, there is not any blinking as GPU process is killed before the first page flip.

#94 shows GPU process emits quite many commands (including gpu process initialization code) before GPU hang happens. I guess M61 GPU process code emits the specific command stream, which reveals GPU hang issue. However, I couldn't reproduce GPU hang by M64 test image. Kernel change or GPU process change may hide the GPU hang issue. As it's race condition issue, timing or command order should be important to reproduce it.

We also want to fix it in kernel and working on it.
@118: indeed the GPU process emits some commands on startup, but as far as I can see the GPU never gets to execute them since it fails on the workaround buffer first (as shown in the error state file). In theory the workaround buffer is the very, very first thing which is queued into a fresh ring, and in our case this is also where things end.

Do you have reason to believe that the Chrome-side commands would influence ordering in any way?
> Do you have reason to believe that the Chrome-side commands would influence ordering in any way?

Chromium GPU process always lost the context in similar place, after the initialization and command buffer messages are done, so I just speculated. 
[1132:1132:1205/203710.721226:ERROR:gles2_cmd_decoder.cc(16047)] Offscreen context lost via ARB/EXT_robustness. Reset status = GL_INNOCENT_CONTEXT_RESET_KHR

I'll investigate it based on your hints. Thank you.
I think what happens is:
- the kernel creates the ring, queues the workarounds in there, but doesn't execute them
- Chrome starts, queues up a few GPU commands, flushes
- At that point, the workarounds get executed and result in the hang we are seeing

Intuitively, this seems like a form of race condition. It could be something like how long does it take from boot or from other parts of GPU initialization to the time where GPU starts running, maybe something is initializing in the background that's not quite ready.

Cc: hoegsberg@chromium.org
We found the 2 kernel patches, with which we could not reproduce this issue so far. Kang and I rebooted cyan 300 times and we could not reproduce this issue. We could reproduce it 1 over 10th without the patches.

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/818337
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/818338

Kang, Daniel Charles, and I discussed this issue with upstream kernel engineers, who are Joonas Lahtinen, Tvrtko Ursulin and Benjamin Widawsky.
Joonas points out those patches, which are workaround for gen8, which is Braswell's GPU. We could not reproduce the GPU hang with the patches so far.
Our validation engineers are validating the patches to make sure it doesn't break any Chromium functionalities. So far, we didn't see any issues.

Stéphane, Kristian, could you review the patches?
For following CL, 

https://chromium-review.googlesource.com/#/c/chromiumos/third_party/kernel/+/818338/

it needs someone to have a look. The current state is "Cannot Merge".

I just cherry-picked refs/changes/37/818337/1 and refs/changes/38/818338/6 on the current cros/chromeos-3.18 tip 332f31ccb26e and there was no conflict.

We generally ignore messages like "Cannot Merge" in the Gerrit web interface, often they just mean there's another, totally unrelated and not conflicting change in flight somewhere touching the same file(s), maybe just not abandoned yet. Other times it seems to be other patches in the *same* series generating these spurious messages - Gerrit's (and git's) concept of a series is very limited. Because of recurring confusions like this one these messages have done IMHO more harm than good, maybe they can be disabled?
Thanks for the update.
Project Member

Comment 126 by bugdroid1@chromium.org, Dec 12 2017

Labels: merge-merged-chromeos-3.18
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/0d9e14b70aa611275f0e283b81e9be489fb4fe07

commit 0d9e14b70aa611275f0e283b81e9be489fb4fe07
Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
Date: Tue Dec 12 03:37:48 2017

UPSTREAM: drm/i915: avoid the last 8mb of stolen on BDW/SKL

The FBC hardware for these platforms doesn't have access to the
bios_reserved range, so it always assumes the maximum (8mb) is used.
So avoid this range while allocating.

This solves a bunch of FIFO underruns that happen if you end up
putting the CFB in that memory range. On my machine, with 32mb of
stolen, I need a 2560x1440 mode for that.

Testcase: igt/kms_frontbuffer_tracking/fbc-* (given the right setup)
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Ville Syrjl <ville.syrjala@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.kernel.org/patch/7177881/
(cherry picked from commit a9da512b3ed73045253afd778e40d4298f42905b)
Signed-off-by: Dongseong Hwang <dongseong.hwang@intel.com>

BUG= 776613 
TEST=reboot and basic validation on cyan

Change-Id: Ifb75889f9625e56066f8180e22538751559c4dd3
Tested-by: Yu Kang Ku <yu.kang.ku@intel.com>
Tested-by: Brian Wilson <brian.wilson@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/818337
Commit-Ready: Dongseong Hwang <dongseong.hwang@intel.com>
Tested-by: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-by: Stéphane Marchesin <marcheu@chromium.org>
Reviewed-by: Duncan Laurie <dlaurie@google.com>
Reviewed-by: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-by: Puneet Kumar <puneetster@chromium.org>

[modify] https://crrev.com/0d9e14b70aa611275f0e283b81e9be489fb4fe07/drivers/gpu/drm/i915/i915_gem_gtt.h
[modify] https://crrev.com/0d9e14b70aa611275f0e283b81e9be489fb4fe07/drivers/gpu/drm/i915/i915_drv.h
[modify] https://crrev.com/0d9e14b70aa611275f0e283b81e9be489fb4fe07/drivers/gpu/drm/i915/i915_gem_stolen.c
[modify] https://crrev.com/0d9e14b70aa611275f0e283b81e9be489fb4fe07/drivers/gpu/drm/i915/intel_fbc.c

Project Member

Comment 127 by bugdroid1@chromium.org, Dec 12 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/2c5bc6460d419822c978fd3275dd91f94d866610

commit 2c5bc6460d419822c978fd3275dd91f94d866610
Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
Date: Tue Dec 12 03:37:49 2017

UPSTREAM: drm/i915: don't use the first stolen page on Broadwell

The spec says we just can't use it.

v2:
  - Add WA name (Ville).
  - Add a big comment explaining that we still didn't fix the problem
    where we inherit a framebuffer on the first page (Chris, Ville).

Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.kernel.org/patch/7250811/
(cherry picked from commit 1c3e804bfe67fc8bb336f8a456348adff2b9df26)
Signed-off-by: Dongseong Hwang <dongseong.hwang@intel.com>

BUG= chromium:776613 
TEST=reboot and basic validation on cyan
CQ-DEPEND=CL:818337

Change-Id: Ia2219124663ae874fbe6c85e4f17a808df14a8e3
Tested-by: Yu Kang Ku <yu.kang.ku@intel.com>
Tested-by: Brian Wilson <brian.wilson@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/818338
Commit-Ready: Dongseong Hwang <dongseong.hwang@intel.com>
Tested-by: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-by: Stéphane Marchesin <marcheu@chromium.org>
Reviewed-by: Dongseong Hwang <dongseong.hwang@intel.com>

[modify] https://crrev.com/2c5bc6460d419822c978fd3275dd91f94d866610/drivers/gpu/drm/i915/i915_gem_stolen.c

Status: Assigned (was: Available)
Now it's fixed in ToT :)
How can we apply it on M63 release?
Apply to 64 first, then 63 :)
Labels: Merge-Request-64 Merge-Request-63
dongseong.hwang@, do you mean what are the steps to apply the cls to M64 and M63?

We add merge requests here (done), and then after approvals, On the code review pages, you can cherry pick the two cls into different branches, following this doc:
https://www.chromium.org/developers/how-tos/drover




Project Member

Comment 131 by sheriffbot@chromium.org, Dec 13 2017

Labels: -Merge-Request-63 Merge-Review-63 Hotlist-Merge-Review
This bug requires manual review: Request affecting a post-stable build
Please contact the milestone owner if you have questions.
Owners: cmasso@(Android), cmasso@(iOS), gkihumba@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Hi all,
  This issue also happend on Padova_rlem. Please also help to pick up the CL to Padova_Rlem. Thanks a lot! 
Project Member

Comment 134 by sheriffbot@chromium.org, Dec 14 2017

Labels: -Merge-Request-64 Hotlist-Merge-Approved Merge-Approved-64
Your change meets the bar and is auto-approved for M64. Please go ahead and merge the CL to branch 3282 manually. Please contact milestone owner if you have questions.
Owners: cmasso@(Android), cmasso@(iOS), kbleicher@(ChromeOS), abdulsyed@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
@133: Sorry for typo error. Padova is relm. i can found the relm in these list.
Thank you for pointing out and educating it. It's my first time to merge CLs to stable branch. I followed https://www.chromium.org/developers/how-tos/drover
There are two cherry-pick CLs, respectively M63 and M64
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/828302
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/828301

If I did something wrong, please advice me.
#136, do we have total two cls need to be merged to both M64 and M63?
FYI, per crbug/327216 this is contributing to one of the top M64 crashes.  I see that you're working to resolve / merge.  Thanks!
Project Member

Comment 140 by bugdroid1@chromium.org, Dec 15 2017

Labels: merge-merged-release-R64-10176.B-chromeos-3.18
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/79e15e6ef0c415ddcd9975d0819958d73654cc62

commit 79e15e6ef0c415ddcd9975d0819958d73654cc62
Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
Date: Fri Dec 15 22:26:15 2017

UPSTREAM: drm/i915: avoid the last 8mb of stolen on BDW/SKL

The FBC hardware for these platforms doesn't have access to the
bios_reserved range, so it always assumes the maximum (8mb) is used.
So avoid this range while allocating.

This solves a bunch of FIFO underruns that happen if you end up
putting the CFB in that memory range. On my machine, with 32mb of
stolen, I need a 2560x1440 mode for that.

Testcase: igt/kms_frontbuffer_tracking/fbc-* (given the right setup)
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Ville Syrjl <ville.syrjala@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.kernel.org/patch/7177881/
(cherry picked from commit a9da512b3ed73045253afd778e40d4298f42905b)
Signed-off-by: Dongseong Hwang <dongseong.hwang@intel.com>

BUG= 776613 
TEST=reboot and basic validation on cyan

Change-Id: Ifb75889f9625e56066f8180e22538751559c4dd3
Tested-by: Yu Kang Ku <yu.kang.ku@intel.com>
Tested-by: Brian Wilson <brian.wilson@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/818337
Commit-Ready: Dongseong Hwang <dongseong.hwang@intel.com>
Tested-by: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-by: Stéphane Marchesin <marcheu@chromium.org>
Reviewed-by: Duncan Laurie <dlaurie@google.com>
Reviewed-by: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-by: Puneet Kumar <puneetster@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/828301
Commit-Queue: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@chromium.org>

[modify] https://crrev.com/79e15e6ef0c415ddcd9975d0819958d73654cc62/drivers/gpu/drm/i915/i915_gem_gtt.h
[modify] https://crrev.com/79e15e6ef0c415ddcd9975d0819958d73654cc62/drivers/gpu/drm/i915/i915_drv.h
[modify] https://crrev.com/79e15e6ef0c415ddcd9975d0819958d73654cc62/drivers/gpu/drm/i915/i915_gem_stolen.c
[modify] https://crrev.com/79e15e6ef0c415ddcd9975d0819958d73654cc62/drivers/gpu/drm/i915/intel_fbc.c

Project Member

Comment 141 by bugdroid1@chromium.org, Dec 15 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/01e32a3522ff2da739f6ac732c56661a97d3f081

commit 01e32a3522ff2da739f6ac732c56661a97d3f081
Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
Date: Fri Dec 15 22:26:20 2017

UPSTREAM: drm/i915: don't use the first stolen page on Broadwell

The spec says we just can't use it.

v2:
  - Add WA name (Ville).
  - Add a big comment explaining that we still didn't fix the problem
    where we inherit a framebuffer on the first page (Chris, Ville).

Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.kernel.org/patch/7250811/
(cherry picked from commit 1c3e804bfe67fc8bb336f8a456348adff2b9df26)
Signed-off-by: Dongseong Hwang <dongseong.hwang@intel.com>

BUG= chromium:776613 
TEST=reboot and basic validation on cyan
CQ-DEPEND=CL:818337

Change-Id: Ia2219124663ae874fbe6c85e4f17a808df14a8e3
Tested-by: Yu Kang Ku <yu.kang.ku@intel.com>
Tested-by: Brian Wilson <brian.wilson@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/818338
Commit-Ready: Dongseong Hwang <dongseong.hwang@intel.com>
Tested-by: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-by: Stéphane Marchesin <marcheu@chromium.org>
Reviewed-by: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/830810
Reviewed-by: Kristian H. Kristensen <hoegsberg@chromium.org>
Commit-Queue: Dongseong Hwang <dongseong.hwang@intel.com>

[modify] https://crrev.com/01e32a3522ff2da739f6ac732c56661a97d3f081/drivers/gpu/drm/i915/i915_gem_stolen.c

Project Member

Comment 142 by bugdroid1@chromium.org, Dec 15 2017

Labels: merge-merged-release-R63-10032.B-chromeos-3.18
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4b3b6c9cfa3236c0f81ab029e067b8b53b5738ca

commit 4b3b6c9cfa3236c0f81ab029e067b8b53b5738ca
Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
Date: Fri Dec 15 23:14:23 2017

UPSTREAM: drm/i915: avoid the last 8mb of stolen on BDW/SKL

The FBC hardware for these platforms doesn't have access to the
bios_reserved range, so it always assumes the maximum (8mb) is used.
So avoid this range while allocating.

This solves a bunch of FIFO underruns that happen if you end up
putting the CFB in that memory range. On my machine, with 32mb of
stolen, I need a 2560x1440 mode for that.

Testcase: igt/kms_frontbuffer_tracking/fbc-* (given the right setup)
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Ville Syrjl <ville.syrjala@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.kernel.org/patch/7177881/
(cherry picked from commit a9da512b3ed73045253afd778e40d4298f42905b)
Signed-off-by: Dongseong Hwang <dongseong.hwang@intel.com>

BUG= 776613 
TEST=reboot and basic validation on cyan

Change-Id: Ifb75889f9625e56066f8180e22538751559c4dd3
Tested-by: Yu Kang Ku <yu.kang.ku@intel.com>
Tested-by: Brian Wilson <brian.wilson@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/818337
Commit-Ready: Dongseong Hwang <dongseong.hwang@intel.com>
Tested-by: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-by: Stéphane Marchesin <marcheu@chromium.org>
Reviewed-by: Duncan Laurie <dlaurie@google.com>
Reviewed-by: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-by: Puneet Kumar <puneetster@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/828302
Commit-Queue: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@chromium.org>

[modify] https://crrev.com/4b3b6c9cfa3236c0f81ab029e067b8b53b5738ca/drivers/gpu/drm/i915/i915_gem_gtt.h
[modify] https://crrev.com/4b3b6c9cfa3236c0f81ab029e067b8b53b5738ca/drivers/gpu/drm/i915/i915_drv.h
[modify] https://crrev.com/4b3b6c9cfa3236c0f81ab029e067b8b53b5738ca/drivers/gpu/drm/i915/i915_gem_stolen.c
[modify] https://crrev.com/4b3b6c9cfa3236c0f81ab029e067b8b53b5738ca/drivers/gpu/drm/i915/intel_fbc.c

Project Member

Comment 143 by bugdroid1@chromium.org, Dec 15 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/64d05cf800043259d2c2fa99dfe044edaa0f06cf

commit 64d05cf800043259d2c2fa99dfe044edaa0f06cf
Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
Date: Fri Dec 15 23:23:46 2017

UPSTREAM: drm/i915: don't use the first stolen page on Broadwell

The spec says we just can't use it.

v2:
  - Add WA name (Ville).
  - Add a big comment explaining that we still didn't fix the problem
    where we inherit a framebuffer on the first page (Chris, Ville).

Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.kernel.org/patch/7250811/
(cherry picked from commit 1c3e804bfe67fc8bb336f8a456348adff2b9df26)
Signed-off-by: Dongseong Hwang <dongseong.hwang@intel.com>

BUG= chromium:776613 
TEST=reboot and basic validation on cyan
CQ-DEPEND=CL:818337

Change-Id: Ia2219124663ae874fbe6c85e4f17a808df14a8e3
Tested-by: Yu Kang Ku <yu.kang.ku@intel.com>
Tested-by: Brian Wilson <brian.wilson@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/818338
Commit-Ready: Dongseong Hwang <dongseong.hwang@intel.com>
Tested-by: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-by: Stéphane Marchesin <marcheu@chromium.org>
Reviewed-by: Dongseong Hwang <dongseong.hwang@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/830682
Reviewed-by: Kristian H. Kristensen <hoegsberg@chromium.org>
Commit-Queue: Dongseong Hwang <dongseong.hwang@intel.com>

[modify] https://crrev.com/64d05cf800043259d2c2fa99dfe044edaa0f06cf/drivers/gpu/drm/i915/i915_gem_stolen.c

Status: Fixed (was: Assigned)
I believe it's fixed now :)
That's great news, thanks!
The #1 crash reported for the latest beta (M64-BETA-CHROMEOS-4)is tagging this sig.  > 1000 instances since the launch late Tuesday U.S..

Can this be reviewed to identify if this is a new instance or unresolved for this bug?

Thanks

https://crash.corp.google.com/browse?q=product.name%3D%27ChromeOS%27%20AND%20product.Version%3D%2710176.41.0%27%20AND%20stable_signature%3D%27crash_reporter-udev-collection-change-card0-drm%27&sql_dialect=googlesql&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D&stbtiq=&reportid=&index=0


re #146, I don't know if this question is directed to us, but we don't have access to the links.  We're also working on merging patches to solve other signature of a GPU hang that could be related.  Pleas see b/37163270
Have those fixes actually landed on M63/M64? We have customer who still has this issue with latest version.
Logs from affected device - https://drive.google.com/drive/folders/1JIk_7420PHzNDUFYu-OiTIMeA3opQRRT?usp=sharing
Cc: vkhabarov@google.com
The build from those logs should have the CLs referenced above if they landed in December, as it was cut in January.
#148 - The fixes were landed to ToT and M64 release.
What's the version of the ChromeOS image in which customers encountered the issue?
Project Member

Comment 152 by bugdroid1@chromium.org, Jan 30 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/9e2f4545fb49446c11779e9edf276522cca50c83

commit 9e2f4545fb49446c11779e9edf276522cca50c83
Author: Aditya Swarup <aditya.swarup@intel.com>
Date: Tue Jan 30 22:54:22 2018

media-libs/mesa: Revert CHROMIUM-disable-hiz-on-braswell.patch

Reverting disable hiz on braswell patch as GPU hangchecks
and gpu hangs cannot be reproduced after fix for
BUG= 776613 

Removed:
	17.0-CHROMIUM-disable-hiz-on-braswell.patch
	BUG=b:35574152
	TEST=Run test_that --iterations 500 -b cyan 10.54.74.124 graphics_GLBench
	or glbench test on Chromebook.

Change-Id: I9be716c0c5ee306ac759354d52b38e2a6a9725f6
Reviewed-on: https://chromium-review.googlesource.com/892300
Commit-Ready: Gurchetan Singh <gurchetansingh@chromium.org>
Tested-by: Gurchetan Singh <gurchetansingh@chromium.org>
Reviewed-by: Gurchetan Singh <gurchetansingh@chromium.org>

[modify] https://crrev.com/9e2f4545fb49446c11779e9edf276522cca50c83/media-libs/mesa/mesa-17.2.3.ebuild
[modify] https://crrev.com/9e2f4545fb49446c11779e9edf276522cca50c83/media-libs/mesa/mesa-9999.ebuild
[rename] https://crrev.com/9e2f4545fb49446c11779e9edf276522cca50c83/media-libs/mesa/mesa-17.2.3-r14.ebuild
[delete] https://crrev.com/abc117d2324c0ed31e0d7e4cc875dd3983e049be/media-libs/mesa/files/17.0-CHROMIUM-disable-hiz-on-braswell.patch

Project Member

Comment 153 by sheriffbot@chromium.org, Feb 12 2018

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 154 by sheriffbot@chromium.org, Feb 16 2018

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Approved-64
Also landed in M63 since mid Dec 2017:

https://crosland.corp.google.com/cl?q=chromium:818338
Cc: ka...@chromium.org
Showing comments 57 - 156 of 156 Older

Sign in to add a comment