New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 669775 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 2
Type: Bug-Regression



Sign in to add a comment

Runaway GPU memory usage

Project Member Reported by sdy@chromium.org, Nov 30 2016

Issue description

Version: 57.0.2936.0
OS: macOS Sierra

What steps will reproduce the problem?
(1) Visit https://itch.io/b/149/a-good-bundle
(2) Scroll to the bottom and back to the top a couple of times.
(3) Watch the system "Memory Used" number in Activity Monitor

What is the expected result?
Stable memory usage.

What happens instead?
"Memory Used" starts to skyrocket (be careful, it consumes all of my ~16GB within 10-20 seconds), but no process is shown as using the memory. Closing the tab stops the growth. Killing the GPU process releases the memory.

Hardware:
MacBook Pro (Retina, 15-inch, Late 2013)
Intel Iris Pro 1536 MB / NVIDIA GeForce GT 750M
 

Comment 1 by sdy@chromium.org, Nov 30 2016

Attached a trace from chrome://tracing with GPU categories checked.
runaway_gpu_memory.json.gz
4.4 MB Download
Cc: erikc...@chromium.org
Hmm, it's staying constant when I try to reproduce it. Do you think this may be related to the IOSurface leak in  issue 662696 ?
Also failing to reproduce this. sdy: Can you try on latest Canary and see if it still repros?

Comment 4 by sdy@chromium.org, Nov 30 2016

I can reproduce this on my personal machine. Could you try on the Sierra test MBP? Otherwise, I'll bring mine in.
Cc: -erikc...@chromium.org
Owner: erikc...@chromium.org
Status: Assigned (was: Untriaged)
Confirmed the issue on macOS Sierra. Confirmed that this is not a dup of 662696. This is just another catastrophic memory leak. Investigating.
Closing the tab and the browser window does not cause memory usage to go down.
But killing the GPU process does reclaim the memory. I added logging to IOSurface creation, and the rate at which we're create IOSurfaces is much smaller than the rate at which we are losing [likely GPU?] memory.

I've attached a giant log from --enable-gpu-client-logging. Between line 714160 and 1085987, I was doing nothing, and the machine lost a couple of GBs of system memory [order of magnitude 5 seconds].
test.zip
6.8 MB Download
"""
Users-MacBook-Pro:src user$ ./tools/bisect-builds.py -a mac --use-local-cache -g 414607 -b 435110 -- https://itch.io/b/149/a-good-bundle
Downloading list of known revisions...
Saved revisions 15734-435266 to /Users/user/src/chromium/src/tools/.bisect-builds-cache.json
Downloading revision 425290...
Received 71937641 of 71937641 bytes, 100.00%
Bisecting range [414614 (good), 435110 (bad)].
Trying revision 425290...
Revision 425290 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: b
Downloading revision 419718...
Bisecting range [414614 (good), 425290 (bad)].
Trying revision 419718...
Revision 419718 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: b
Downloading revision 416930...
Bisecting range [414614 (good), 419718 (bad)].
Trying revision 416930...
Revision 416930 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: b
Downloading revision 415935...
Bisecting range [414614 (good), 416930 (bad)].
Trying revision 415935...
Revision 415935 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: b
Downloading revision 415031...
Bisecting range [414614 (good), 415935 (bad)].
Trying revision 415031...
Revision 415031 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: g
Downloading revision 415575...
Bisecting range [415031 (good), 415935 (bad)].
Trying revision 415575...
Revision 415575 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: b
Downloading revision 415276...
Bisecting range [415031 (good), 415575 (bad)].
Trying revision 415276...
Revision 415276 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: g
Downloading revision 415433...
Bisecting range [415276 (good), 415575 (bad)].
Trying revision 415433...
Revision 415433 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: b
Downloading revision 415317...
Bisecting range [415276 (good), 415433 (bad)].
Trying revision 415317...
Revision 415317 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: g
Downloading revision 415354...
Bisecting range [415317 (good), 415433 (bad)].
Trying revision 415354...
Revision 415354 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: g
Downloading revision 415396...
Bisecting range [415354 (good), 415433 (bad)].
Trying revision 415396...
Revision 415396 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: g
Downloading revision 415418...
Bisecting range [415396 (good), 415433 (bad)].
Trying revision 415418...
Revision 415418 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: g
You are probably looking for a change made after 415418 (known good), but no later than 415433 (first known bad).
CHANGELOG URL:

"""
Labels: M-55 ReleaseBlock-Stable
Cc: erikc...@chromium.org ericrk@chromium.org
Labels: -Pri-1 Pri-0
Owner: cblume@chromium.org
I've confirmed that running 415433 with --disable-gpu-rasterization does not exhibit the issue, but running 415433 with gpu raster does exhibit the problem.

This makes me strongly suspect:
"""
Store mipmap levels in deferred texture image
https://chromium.googlesource.com/skia.git/+/33e0cb5e7f9926b96db209c825f1eeca7c15bb16
"""

whose associated crbug is "Ganesh Image Decodes: mipmap support."

We are in the process of rolling out M55 (it's already at 5% population). What percentage of webpages/population do we expect this to effect? Setting to Pri-0 until we determine that it should be a lower priority.
I've confirmed that on tip of tree, modifying src/image/SkImage_Gpu.cpp so that "useMipMaps" is always "false" prevents the problem from occurring. This makes me certain that I've pinpointed the correct CL.
Is there any way to selectively disable this feature? If not, I think we should disable GPU raster on M-55, and then respin. 

Then we can mark this as RB-S M-56, Pri-1.
Cc: shrike@chromium.org rsch...@chromium.org rpop@chromium.org
+shrike, rpop, rschoen for P0 priority assessment and Ganesh.
Cc: pbomm...@chromium.org mmoss@chromium.org
Please note that we already cut M55 pre-stable RC for release. Please let us know ASAP if we need to retrigger RC with disable GPU raster per comment #13. Thank you.
cblume@, ericrk@ - what are your thoughts on c#13?

c#11 says M55 is at 5% but c#15 says we have a pre-stable RC for M55 - which is correct?

c#15 is correct. 
M55 is in not in stable yet. We cut pre-stable RC for small percentage windows and Mac roll out.
OK - we should not send out the current RC before we resolve c#13.
I can repro the issue.  Taking a look at the mipmap patch.

Comment 20 by ericrk@google.com, Nov 30 2016

We can definitely selectively disable this feature - it's just an optimization (nice to have, but not vital).... Unless we can quickly determine the issue/fix, I'd recommend just disabling mipmaps in deferred image and trying to re-enable this for 56.
This seems pretty bad, severity-wise. Do we know how common it is for the memory leak to be triggered?

I tentatively vote for a new RC, but am willing to be swayed otherwise.
Cc: cblume@chromium.org
Owner: ericrk@chromium.org
ericrk is working on a patch for M-55 [and testing to make sure the issue is fixed]
I'm going to land a patch that disables the mipmap pre-generation path in M55 (basically does what erikchen@ tried in #12). Will cherry-pick this to the Skia M55 branch and get the merge in ASAP.

This change is small, and disables the mipmap-pre-generation path, bringing us back to our M54/M53 behavior.

rschoen@ - I can't repro on my local machine, which makes it hard to determine the exact problem and how widespread it might be. However, we have a number of people who can repro on this bug, so I think it's probably pretty widespread. Given that I feel we have a safe fix, I vote for fixing and re-spinning.
SGTM. govind is that ok?
Ok,sounds good (as per comment #23, change seems to be safe and small) 
ericrk@, please request a merge to M55 once change is ready. I will approve the merge and retrigger M55 RC with this fix in. Thank you.
Cc: anan...@chromium.org
Project Member

Comment 27 by bugdroid1@chromium.org, Nov 30 2016

The following revision refers to this bug:
  https://skia.googlesource.com/skia.git/+/a422d706b1a92e71c6f940de3ed3d0b84a13eb1b

commit a422d706b1a92e71c6f940de3ed3d0b84a13eb1b
Author: Eric Karl <ericrk@chromium.org>
Date: Wed Nov 30 19:09:29 2016

Disable mipmap pre-generation path

There is a bug in the mipmap pre-generation logic in use in
getDeferredTextureImageData. This can cause runaway memory leaks, so we
are disabling this path until we can investigate further.

BUG=669775

Change-Id: I2027f6f7994e089edd4f3452284e894752b31779
Reviewed-on: https://skia-review.googlesource.com/5357
Reviewed-by: Brian Salomon <bsalomon@google.com>

[modify] https://crrev.com/a422d706b1a92e71c6f940de3ed3d0b84a13eb1b/src/image/SkImage_Gpu.cpp

Can we compile a list of hardware (machines, GPU card, retina/non-retina, etc) where this is breaking? That might make it easier to narrow down and help the team fix it. 

Maybe a separate doc to coordinate rather than clutter up the bug?
Labels: Merge-Request-55
Requesting a merge for the cl in #27
Labels: -Merge-Request-55 Merge-Approved-55
Approving merge to M55 branch 2883 based on comment #23. Please merge ASAP. Thank you.
Re#28, that could be useful - Here's a doc for tracking repro status. Please update if you have a machine which reproduces the issue:
https://docs.google.com/spreadsheets/d/1NOgQ1bul2qThtQS0K7mF_LhCWFyj-6e6xYhIryV0b0k/edit?usp=sharing
Project Member

Comment 32 by bugdroid1@chromium.org, Nov 30 2016

Labels: merge-merged-m55
The following revision refers to this bug:
  https://skia.googlesource.com/skia.git/+/d1740f81c843c65acd58d1b571ce94b90fee99d0

commit d1740f81c843c65acd58d1b571ce94b90fee99d0
Author: Eric Karl <ericrk@chromium.org>
Date: Wed Nov 30 19:44:45 2016

Disable mipmap pre-generation path

There is a bug in the mipmap pre-generation logic in use in
getDeferredTextureImageData. This can cause runaway memory leaks, so we
are disabling this path until we can investigate further.

BUG=669775

Change-Id: I2027f6f7994e089edd4f3452284e894752b31779
Reviewed-on: https://skia-review.googlesource.com/5357
Reviewed-by: Brian Salomon <bsalomon@google.com>

Cherry-pick to M55

NOTREECHECKS=true
NOTRY=true
NOPRESUBMIT=true

GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=5385

Change-Id: I46f81d918613b885d29332b7f0070dc518219935
Reviewed-on: https://skia-review.googlesource.com/5385
Reviewed-by: Brian Salomon <bsalomon@google.com>
Commit-Queue: Brian Salomon <bsalomon@google.com>

[modify] https://crrev.com/d1740f81c843c65acd58d1b571ce94b90fee99d0/src/image/SkImage_Gpu.cpp

Comment 33 Deleted

Labels: -Merge-Approved-55
Per comment #32, this is already merged to M55. Hence, removing "Merge-Approved-55" label. 

Triggering new RC with this fix in. Thank you everyone.
Project Member

Comment 35 by bugdroid1@chromium.org, Nov 30 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/48842325265cf85628f8ecea563d4326c63c827e

commit 48842325265cf85628f8ecea563d4326c63c827e
Author: skia-deps-roller <skia-deps-roller@chromium.org>
Date: Wed Nov 30 20:45:46 2016

Roll src/third_party/skia/ 9ac7b2c54..a422d706b (2 commits).

https://skia.googlesource.com/skia.git/+log/9ac7b2c54576..a422d706b1a9

$ git log 9ac7b2c54..a422d706b --date=short --no-merges --format='%ad %ae %s'
2016-11-30 ericrk Disable mipmap pre-generation path
2016-11-30 borenet [nobuildbot] Rename Win bots back to Win8/Win10

BUG=669775

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls

CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_trusty_blink_rel
TBR=brianosman@google.com

Review-Url: https://codereview.chromium.org/2545563002
Cr-Commit-Position: refs/heads/master@{#435408}

[modify] https://crrev.com/48842325265cf85628f8ecea563d4326c63c827e/DEPS

I am investigating.

There are a handful of interesting things going on. First, we have several animated gifs. Second, those are scaled. So each frame of the animated gif is triggering a set of mipmaps to be generated.

The part that seems strange to me is that I don't need to scroll. It is as if the gif loops and is not reusing previous frames.



In order to see anything, I needed to close every program except terminal. Build and launch Chromium from the terminal. Navigate to that page and wait.
I was able to confirm that Eric Karl's patch works around the problem for now.

So the problem is clearly within mipmap pregeneration.


I notice this page starts with images at one size and then uses JavaScript to resize the images. This might mean that the set of parameters originally given might not include mipmaps. Then maybe a new set of parameters is used after scaled?

But I'm still after why it seems to not reuse previous frames.
Given that the memory isn't attributed to a Chrome proc, and that the presence of other apps impacts the leak (per #36) - I wonder if this is a driver issue? It seems possible that other apps were forcing the driver to flush out some caches, while without these apps things just balloon.
Labels: -Pri-0 -ReleaseBlock-Stable Pri-1
Owner: cblume@chromium.org
Dropping to P1, removing RBS, as we have a mitigation in place.

Re-assigning to cblume to investigate addressing the core issue and re-enabling the mipmap predecode feature.
It turns out there is a cap being respected.

I always closed it before hitting max memory since I feared paging to bring me to a crawl and make it hard for me to recover. However, if I just leave it be it will go up from ~4.33 GB (before loading the page) to 14.55 GB (after loading the page). I have 16 GB of ram on this machine.

It doesn't seem to be allocating any space once it hits that cap. Swap is never increased.

This also explains why I needed to close all my other programs to see the effects. Before closing the other programs I was already near 14 GB of memory usage.


The fact that it is respecting a cap is weird to me as well.

I'm going to make a little test program to see if creating several textures would show the memory being used by our app or not.
If textures are listed as memory used by our app, that leads towards a driver bug that we'll need to find a workaround for.
If textures are not listed as memory used by our app, it is possible we're just allocating memory endlessly until we hit some cap in our app.



It is also possible that this is just entirely correct.
There are many animated gifs. All of them are being resized, triggering mipmaps.
And entire mipmap chain is generated for each frame.
The cap we are hitting might just be the amount of memory it should take.


Take this image for example:
https://img.itch.io/aW1hZ2UvMzYzNC82NzA0Mi5naWY=/315x250%23c/qvR26k.gif
It is ~6s of animation. 315x250.
315 * 250 = 78,750 pixels.
78,750 pixels * 4 bytes per pixel = 315,000 bytes per frame.
315,000 bytes * 1.3 for roughly the mipmap chain memory = 409,500 bytes per frame.

If we assume 60 fps that's:
409,500 bytes per frame * 60 frames per second = 24,570,000 bytes per second.
This image is ~6 seconds, so 147,420,000 bytes for this one image + mipmaps.
That's 140 MB for one image.
14 GB / 140 MB = 102.4 images.

So does this page have ~102 images?
It might.



However, it feels like all the images would have completed their animation loop before we hit that 14 GB of memory usage. The memory usage seems to move up linearly and slowly enough that it takes more than 6 second.

Comment 41 by sdy@chromium.org, Dec 1 2016

Nice catch — if I start with enough free memory, it stabilizes for me too!

If I start with less free memory (or just open a second copy of the page…), I hit a little over 4GB of swap and then the system throws up a "Your system has run out of application memory" dialog and started sending SIGSTOP to applications.
I created a separate OpenGL test app to see where mipmapped textures are reported on MacOS by Activity Monitor.

Sure enough, when I deliberately leak texture memory it shows that memory as belonging to my app. This is not what we see in this bug.


This makes me feel even more strongly that this is a driver bug.


I'll think about how to best narrow down what is triggering the bad behavior in the driver.
After we know what is triggering it we can choose if there is a better way which doesn't trigger it or if we want to create a workaround for this specific driver.
Hello,

we just opened https://bugs.chromium.org/p/chromium/issues/detail?id=670875 and feel like the two bugs might be related. Just letting you know as this goes beyond my expertise ;)
I was investigating whether or not the fix (disable mipmap) requires a postmortem, and have been following the commit chain in this bug. It looks like this was a skia change committed in c#27 and then cherry-picked back to M55 in c#32. The cherry pick happened before the stable build and release of M55.

However, this was a change to skia and the cherry-pick was to the skia component of Chrome. In order for this new skia to make it into Chrome there needed to be a DEPS roll. This DEPS roll happened in c#35, but there was never a cherry-pick of that DEPS roll back to M55. Based on Omaha, the c#35 DEPS roll first appeared in 57.0.2938.0. So it seems like the fix never actually made it back to M55?

Please cherry-pick DEPS roll at c#35 to M55 so it gets picked up for next M55 Stable release if needed per c#44. Thank you.
From what I understood, there is no deps roll needed to get the change to M55 stable, as the stable builder should always pull from ToT of the special Skia M55 branch.

bsalomon@, can you confirm.
As far as the postmortem, I'm happy to do a postmortem once I have figured out the root cause and gathered enough information that we can improve from my mistake.


My impression at the moment is 1.) if there isn't already, maybe there can be a test to let a page sit for a while and monitor memory to make sure it doesn't cross some threshold, and 2.) make sure we have coverage of important driver/hardware combos with this test. Both of those are pretty expensive in terms of both money and test duration.

This memory leak is not reported as belonging to Chromium, which means a test like this would have to monitor all system memory, making it vulnerable to all sorts of system noise. This includes other builds / tests -- precluding it from being run in parallel with other tests and thus increasing the expense.

Given all that, maybe the way this all played out was ideal.
Re #46, this is correct. Once a change is cherry picked to origin/chrome/m55 in Skia the next M55 Chrome build will just pick it up. There is no DEPS roll required.
Thank you for the info.
Re #47, +1 to write a postmortem. Thank you cblume@.
Trying to gather the state of this P1 bug, is it fixed and we can close? Or what are the next steps then?
Labels: -Pri-1 Pri-2
Going to bump down to P2.

A work-around is in place (which bumped us down from P0).
Next steps are to investigate the root cause further and fix that.
We already know it is not a typical memory leak and likely comes from something in the driver we are triggering.

Outside the Px priority system, we talked about this being a lower priority than my current work (moving Chrome's image decoders to use SkCodec instead of the Blink image decoders). So P2 seems more correct.
Agree with P2 - The current workaround has disabled the feature - this feature never made it to stable, so this is not a regression. Turning this back on is more or less feature work at this point (not a bug fix).
Cc: brianosman@chromium.org
Hello ericrk@ - I notice that c#27 shows your fix for this bug landed on Nov 30 on ToT, and then in c#32 you cherry-picked the fix back to M55. The M56 branch point was Nov 17, so that implies your ToT patch landed in M57. Is it then the case that this patch does not exist in M56?

Thanks for catching this shrike@, I'll get this into M56 ASAP.
Labels: Merge-Approved-56
Per offline discussion approving for merge into M56
Merged to Skia M56 branch: https://skia.googlesource.com/skia/+/bf2d9e02d58ea01f1c239f7e2fc024cba140ccb1

Not sure why this bug didn't get updated - possibly just delayed.
Thanks for the merge. 
If there is no pending work please remove Merge-Approved-56 and add merge-merged-m56
Labels: -Merge-Approved-56 merge-merged-m56
Just a data point: I've tried to repro on current ToT on macOS Sierra by reverting ericrk@'s fix, and I can't seem to get the memory to spike (I start with ~6GB used and it goes up to ~7GB used on a site heavy with images).

Not sure if anything changed in the meanwhile. It seems that not everyone could consistently repro this though, so I'm a bit hesitant to say that it's better conclusively.
Cc: -mmoss@chromium.org

Comment 63 by sdy@chromium.org, Mar 13 2017

That page only shows the gifs on hover now, so it doesn't trigger the bug. You can "fix" it by scrolling all the way down and all the way up (to load all of the images) and then running this in the console:

    Array.prototype.map.call(document.querySelectorAll('div[data-gif]'), (el) => el.parentNode.style.backgroundImage = `url(${el.dataset.gif}`)

…but creating some kind of test page may be a better idea going forward.

Comment 64 by sdy@chromium.org, Mar 13 2017

(That page being the URL in the bug description: https://itch.io/b/149/a-good-bundle)
I can no longer repro.

I followed the steps in #63 and am using the exact same checkout that repro'ed before.


The only difference is I was forced to upgrade my MacOS version for security reasons.

I used to be on 10.11.6 and am now on 10.12.3.

This matches what we found before -- we could not repro on all of our devices / OS versions. It felt very much like a driver bug. And with no code change this is now fixed. Similarly, the memory was not associated with Chrome (further indicating it was a driver bug).


Is there any way we can get a device with 10.11.6?
It would be nice to identify exactly what the driver bug was and write a work-around.

If we cannot find a device to repro, I'm willing to re-enable mipmaps and close this issue. Apple themselves have fixed it.
TEs have 10.11 devices in MTV. Please see them directly, or come find me and I can introduce you.

If we believe this is fixed in 10.12+, please do not just re-enable mipmaps. We have a significant number of users on 10.10 and older. 

Note that the bug was initially filed by a Chrome developer using 10.12.
erikchen@ and I were able to repro on his device which is running 10.12.3 (same as mine). I suspect I didn't trim my memory down enough before I attempted to repro.

His device has 2 GPUs. It did not repro when running on the NVidia GPU and did repro on the Intel Iris. All of the repros so far are on Intel Iris.

I'll try to repro on other chips/drivers as well (AMD, other Intels).

Once I know what chips/drivers it repros on (which may only be Iris), I'll add a gpu driver workaround to src/gpu/config/gpu_driver_bug_list_json.cc
Cc: -rsch...@chromium.org

Sign in to add a comment