Issue metadata
Sign in to add a comment
|
Runaway GPU memory usage |
||||||||||||||||||||
Issue descriptionVersion: 57.0.2936.0 OS: macOS Sierra What steps will reproduce the problem? (1) Visit https://itch.io/b/149/a-good-bundle (2) Scroll to the bottom and back to the top a couple of times. (3) Watch the system "Memory Used" number in Activity Monitor What is the expected result? Stable memory usage. What happens instead? "Memory Used" starts to skyrocket (be careful, it consumes all of my ~16GB within 10-20 seconds), but no process is shown as using the memory. Closing the tab stops the growth. Killing the GPU process releases the memory. Hardware: MacBook Pro (Retina, 15-inch, Late 2013) Intel Iris Pro 1536 MB / NVIDIA GeForce GT 750M
,
Nov 30 2016
Hmm, it's staying constant when I try to reproduce it. Do you think this may be related to the IOSurface leak in issue 662696 ?
,
Nov 30 2016
Also failing to reproduce this. sdy: Can you try on latest Canary and see if it still repros?
,
Nov 30 2016
I can reproduce this on my personal machine. Could you try on the Sierra test MBP? Otherwise, I'll bring mine in.
,
Nov 30 2016
Confirmed the issue on macOS Sierra. Confirmed that this is not a dup of 662696. This is just another catastrophic memory leak. Investigating.
,
Nov 30 2016
Closing the tab and the browser window does not cause memory usage to go down.
,
Nov 30 2016
But killing the GPU process does reclaim the memory. I added logging to IOSurface creation, and the rate at which we're create IOSurfaces is much smaller than the rate at which we are losing [likely GPU?] memory. I've attached a giant log from --enable-gpu-client-logging. Between line 714160 and 1085987, I was doing nothing, and the machine lost a couple of GBs of system memory [order of magnitude 5 seconds].
,
Nov 30 2016
,
Nov 30 2016
""" Users-MacBook-Pro:src user$ ./tools/bisect-builds.py -a mac --use-local-cache -g 414607 -b 435110 -- https://itch.io/b/149/a-good-bundle Downloading list of known revisions... Saved revisions 15734-435266 to /Users/user/src/chromium/src/tools/.bisect-builds-cache.json Downloading revision 425290... Received 71937641 of 71937641 bytes, 100.00% Bisecting range [414614 (good), 435110 (bad)]. Trying revision 425290... Revision 425290 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: b Downloading revision 419718... Bisecting range [414614 (good), 425290 (bad)]. Trying revision 419718... Revision 419718 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: b Downloading revision 416930... Bisecting range [414614 (good), 419718 (bad)]. Trying revision 416930... Revision 416930 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: b Downloading revision 415935... Bisecting range [414614 (good), 416930 (bad)]. Trying revision 415935... Revision 415935 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: b Downloading revision 415031... Bisecting range [414614 (good), 415935 (bad)]. Trying revision 415031... Revision 415031 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: g Downloading revision 415575... Bisecting range [415031 (good), 415935 (bad)]. Trying revision 415575... Revision 415575 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: b Downloading revision 415276... Bisecting range [415031 (good), 415575 (bad)]. Trying revision 415276... Revision 415276 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: g Downloading revision 415433... Bisecting range [415276 (good), 415575 (bad)]. Trying revision 415433... Revision 415433 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: b Downloading revision 415317... Bisecting range [415276 (good), 415433 (bad)]. Trying revision 415317... Revision 415317 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: g Downloading revision 415354... Bisecting range [415317 (good), 415433 (bad)]. Trying revision 415354... Revision 415354 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: g Downloading revision 415396... Bisecting range [415354 (good), 415433 (bad)]. Trying revision 415396... Revision 415396 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: g Downloading revision 415418... Bisecting range [415396 (good), 415433 (bad)]. Trying revision 415418... Revision 415418 is [(g)ood/(b)ad/(r)etry/(u)nknown/(s)tdout/(q)uit]: g You are probably looking for a change made after 415418 (known good), but no later than 415433 (first known bad). CHANGELOG URL: """
,
Nov 30 2016
,
Nov 30 2016
I've confirmed that running 415433 with --disable-gpu-rasterization does not exhibit the issue, but running 415433 with gpu raster does exhibit the problem. This makes me strongly suspect: """ Store mipmap levels in deferred texture image https://chromium.googlesource.com/skia.git/+/33e0cb5e7f9926b96db209c825f1eeca7c15bb16 """ whose associated crbug is "Ganesh Image Decodes: mipmap support." We are in the process of rolling out M55 (it's already at 5% population). What percentage of webpages/population do we expect this to effect? Setting to Pri-0 until we determine that it should be a lower priority.
,
Nov 30 2016
I've confirmed that on tip of tree, modifying src/image/SkImage_Gpu.cpp so that "useMipMaps" is always "false" prevents the problem from occurring. This makes me certain that I've pinpointed the correct CL.
,
Nov 30 2016
Is there any way to selectively disable this feature? If not, I think we should disable GPU raster on M-55, and then respin. Then we can mark this as RB-S M-56, Pri-1.
,
Nov 30 2016
+shrike, rpop, rschoen for P0 priority assessment and Ganesh.
,
Nov 30 2016
Please note that we already cut M55 pre-stable RC for release. Please let us know ASAP if we need to retrigger RC with disable GPU raster per comment #13. Thank you.
,
Nov 30 2016
cblume@, ericrk@ - what are your thoughts on c#13? c#11 says M55 is at 5% but c#15 says we have a pre-stable RC for M55 - which is correct?
,
Nov 30 2016
c#15 is correct. M55 is in not in stable yet. We cut pre-stable RC for small percentage windows and Mac roll out.
,
Nov 30 2016
OK - we should not send out the current RC before we resolve c#13.
,
Nov 30 2016
I can repro the issue. Taking a look at the mipmap patch.
,
Nov 30 2016
We can definitely selectively disable this feature - it's just an optimization (nice to have, but not vital).... Unless we can quickly determine the issue/fix, I'd recommend just disabling mipmaps in deferred image and trying to re-enable this for 56.
,
Nov 30 2016
This seems pretty bad, severity-wise. Do we know how common it is for the memory leak to be triggered? I tentatively vote for a new RC, but am willing to be swayed otherwise.
,
Nov 30 2016
ericrk is working on a patch for M-55 [and testing to make sure the issue is fixed]
,
Nov 30 2016
I'm going to land a patch that disables the mipmap pre-generation path in M55 (basically does what erikchen@ tried in #12). Will cherry-pick this to the Skia M55 branch and get the merge in ASAP. This change is small, and disables the mipmap-pre-generation path, bringing us back to our M54/M53 behavior. rschoen@ - I can't repro on my local machine, which makes it hard to determine the exact problem and how widespread it might be. However, we have a number of people who can repro on this bug, so I think it's probably pretty widespread. Given that I feel we have a safe fix, I vote for fixing and re-spinning.
,
Nov 30 2016
SGTM. govind is that ok?
,
Nov 30 2016
Ok,sounds good (as per comment #23, change seems to be safe and small) ericrk@, please request a merge to M55 once change is ready. I will approve the merge and retrigger M55 RC with this fix in. Thank you.
,
Nov 30 2016
,
Nov 30 2016
The following revision refers to this bug: https://skia.googlesource.com/skia.git/+/a422d706b1a92e71c6f940de3ed3d0b84a13eb1b commit a422d706b1a92e71c6f940de3ed3d0b84a13eb1b Author: Eric Karl <ericrk@chromium.org> Date: Wed Nov 30 19:09:29 2016 Disable mipmap pre-generation path There is a bug in the mipmap pre-generation logic in use in getDeferredTextureImageData. This can cause runaway memory leaks, so we are disabling this path until we can investigate further. BUG=669775 Change-Id: I2027f6f7994e089edd4f3452284e894752b31779 Reviewed-on: https://skia-review.googlesource.com/5357 Reviewed-by: Brian Salomon <bsalomon@google.com> [modify] https://crrev.com/a422d706b1a92e71c6f940de3ed3d0b84a13eb1b/src/image/SkImage_Gpu.cpp
,
Nov 30 2016
Can we compile a list of hardware (machines, GPU card, retina/non-retina, etc) where this is breaking? That might make it easier to narrow down and help the team fix it. Maybe a separate doc to coordinate rather than clutter up the bug?
,
Nov 30 2016
Requesting a merge for the cl in #27
,
Nov 30 2016
Approving merge to M55 branch 2883 based on comment #23. Please merge ASAP. Thank you.
,
Nov 30 2016
Re#28, that could be useful - Here's a doc for tracking repro status. Please update if you have a machine which reproduces the issue: https://docs.google.com/spreadsheets/d/1NOgQ1bul2qThtQS0K7mF_LhCWFyj-6e6xYhIryV0b0k/edit?usp=sharing
,
Nov 30 2016
The following revision refers to this bug: https://skia.googlesource.com/skia.git/+/d1740f81c843c65acd58d1b571ce94b90fee99d0 commit d1740f81c843c65acd58d1b571ce94b90fee99d0 Author: Eric Karl <ericrk@chromium.org> Date: Wed Nov 30 19:44:45 2016 Disable mipmap pre-generation path There is a bug in the mipmap pre-generation logic in use in getDeferredTextureImageData. This can cause runaway memory leaks, so we are disabling this path until we can investigate further. BUG=669775 Change-Id: I2027f6f7994e089edd4f3452284e894752b31779 Reviewed-on: https://skia-review.googlesource.com/5357 Reviewed-by: Brian Salomon <bsalomon@google.com> Cherry-pick to M55 NOTREECHECKS=true NOTRY=true NOPRESUBMIT=true GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=5385 Change-Id: I46f81d918613b885d29332b7f0070dc518219935 Reviewed-on: https://skia-review.googlesource.com/5385 Reviewed-by: Brian Salomon <bsalomon@google.com> Commit-Queue: Brian Salomon <bsalomon@google.com> [modify] https://crrev.com/d1740f81c843c65acd58d1b571ce94b90fee99d0/src/image/SkImage_Gpu.cpp
,
Nov 30 2016
Per comment #32, this is already merged to M55. Hence, removing "Merge-Approved-55" label. Triggering new RC with this fix in. Thank you everyone.
,
Nov 30 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/48842325265cf85628f8ecea563d4326c63c827e commit 48842325265cf85628f8ecea563d4326c63c827e Author: skia-deps-roller <skia-deps-roller@chromium.org> Date: Wed Nov 30 20:45:46 2016 Roll src/third_party/skia/ 9ac7b2c54..a422d706b (2 commits). https://skia.googlesource.com/skia.git/+log/9ac7b2c54576..a422d706b1a9 $ git log 9ac7b2c54..a422d706b --date=short --no-merges --format='%ad %ae %s' 2016-11-30 ericrk Disable mipmap pre-generation path 2016-11-30 borenet [nobuildbot] Rename Win bots back to Win8/Win10 BUG=669775 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_trusty_blink_rel TBR=brianosman@google.com Review-Url: https://codereview.chromium.org/2545563002 Cr-Commit-Position: refs/heads/master@{#435408} [modify] https://crrev.com/48842325265cf85628f8ecea563d4326c63c827e/DEPS
,
Nov 30 2016
I am investigating. There are a handful of interesting things going on. First, we have several animated gifs. Second, those are scaled. So each frame of the animated gif is triggering a set of mipmaps to be generated. The part that seems strange to me is that I don't need to scroll. It is as if the gif loops and is not reusing previous frames. In order to see anything, I needed to close every program except terminal. Build and launch Chromium from the terminal. Navigate to that page and wait.
,
Nov 30 2016
I was able to confirm that Eric Karl's patch works around the problem for now. So the problem is clearly within mipmap pregeneration. I notice this page starts with images at one size and then uses JavaScript to resize the images. This might mean that the set of parameters originally given might not include mipmaps. Then maybe a new set of parameters is used after scaled? But I'm still after why it seems to not reuse previous frames.
,
Nov 30 2016
Given that the memory isn't attributed to a Chrome proc, and that the presence of other apps impacts the leak (per #36) - I wonder if this is a driver issue? It seems possible that other apps were forcing the driver to flush out some caches, while without these apps things just balloon.
,
Nov 30 2016
Dropping to P1, removing RBS, as we have a mitigation in place. Re-assigning to cblume to investigate addressing the core issue and re-enabling the mipmap predecode feature.
,
Dec 1 2016
It turns out there is a cap being respected. I always closed it before hitting max memory since I feared paging to bring me to a crawl and make it hard for me to recover. However, if I just leave it be it will go up from ~4.33 GB (before loading the page) to 14.55 GB (after loading the page). I have 16 GB of ram on this machine. It doesn't seem to be allocating any space once it hits that cap. Swap is never increased. This also explains why I needed to close all my other programs to see the effects. Before closing the other programs I was already near 14 GB of memory usage. The fact that it is respecting a cap is weird to me as well. I'm going to make a little test program to see if creating several textures would show the memory being used by our app or not. If textures are listed as memory used by our app, that leads towards a driver bug that we'll need to find a workaround for. If textures are not listed as memory used by our app, it is possible we're just allocating memory endlessly until we hit some cap in our app. It is also possible that this is just entirely correct. There are many animated gifs. All of them are being resized, triggering mipmaps. And entire mipmap chain is generated for each frame. The cap we are hitting might just be the amount of memory it should take. Take this image for example: https://img.itch.io/aW1hZ2UvMzYzNC82NzA0Mi5naWY=/315x250%23c/qvR26k.gif It is ~6s of animation. 315x250. 315 * 250 = 78,750 pixels. 78,750 pixels * 4 bytes per pixel = 315,000 bytes per frame. 315,000 bytes * 1.3 for roughly the mipmap chain memory = 409,500 bytes per frame. If we assume 60 fps that's: 409,500 bytes per frame * 60 frames per second = 24,570,000 bytes per second. This image is ~6 seconds, so 147,420,000 bytes for this one image + mipmaps. That's 140 MB for one image. 14 GB / 140 MB = 102.4 images. So does this page have ~102 images? It might. However, it feels like all the images would have completed their animation loop before we hit that 14 GB of memory usage. The memory usage seems to move up linearly and slowly enough that it takes more than 6 second.
,
Dec 1 2016
Nice catch — if I start with enough free memory, it stabilizes for me too! If I start with less free memory (or just open a second copy of the page…), I hit a little over 4GB of swap and then the system throws up a "Your system has run out of application memory" dialog and started sending SIGSTOP to applications.
,
Dec 1 2016
I created a separate OpenGL test app to see where mipmapped textures are reported on MacOS by Activity Monitor. Sure enough, when I deliberately leak texture memory it shows that memory as belonging to my app. This is not what we see in this bug. This makes me feel even more strongly that this is a driver bug. I'll think about how to best narrow down what is triggering the bad behavior in the driver. After we know what is triggering it we can choose if there is a better way which doesn't trigger it or if we want to create a workaround for this specific driver.
,
Dec 3 2016
Hello, we just opened https://bugs.chromium.org/p/chromium/issues/detail?id=670875 and feel like the two bugs might be related. Just letting you know as this goes beyond my expertise ;)
,
Dec 5 2016
I was investigating whether or not the fix (disable mipmap) requires a postmortem, and have been following the commit chain in this bug. It looks like this was a skia change committed in c#27 and then cherry-picked back to M55 in c#32. The cherry pick happened before the stable build and release of M55. However, this was a change to skia and the cherry-pick was to the skia component of Chrome. In order for this new skia to make it into Chrome there needed to be a DEPS roll. This DEPS roll happened in c#35, but there was never a cherry-pick of that DEPS roll back to M55. Based on Omaha, the c#35 DEPS roll first appeared in 57.0.2938.0. So it seems like the fix never actually made it back to M55?
,
Dec 5 2016
Please cherry-pick DEPS roll at c#35 to M55 so it gets picked up for next M55 Stable release if needed per c#44. Thank you.
,
Dec 5 2016
From what I understood, there is no deps roll needed to get the change to M55 stable, as the stable builder should always pull from ToT of the special Skia M55 branch. bsalomon@, can you confirm.
,
Dec 5 2016
As far as the postmortem, I'm happy to do a postmortem once I have figured out the root cause and gathered enough information that we can improve from my mistake. My impression at the moment is 1.) if there isn't already, maybe there can be a test to let a page sit for a while and monitor memory to make sure it doesn't cross some threshold, and 2.) make sure we have coverage of important driver/hardware combos with this test. Both of those are pretty expensive in terms of both money and test duration. This memory leak is not reported as belonging to Chromium, which means a test like this would have to monitor all system memory, making it vulnerable to all sorts of system noise. This includes other builds / tests -- precluding it from being run in parallel with other tests and thus increasing the expense. Given all that, maybe the way this all played out was ideal.
,
Dec 6 2016
Re #46, this is correct. Once a change is cherry picked to origin/chrome/m55 in Skia the next M55 Chrome build will just pick it up. There is no DEPS roll required.
,
Dec 6 2016
Thank you for the info.
,
Dec 6 2016
Re #47, +1 to write a postmortem. Thank you cblume@.
,
Dec 16 2016
Trying to gather the state of this P1 bug, is it fixed and we can close? Or what are the next steps then?
,
Dec 16 2016
Going to bump down to P2. A work-around is in place (which bumped us down from P0). Next steps are to investigate the root cause further and fix that. We already know it is not a typical memory leak and likely comes from something in the driver we are triggering. Outside the Px priority system, we talked about this being a lower priority than my current work (moving Chrome's image decoders to use SkCodec instead of the Blink image decoders). So P2 seems more correct.
,
Dec 16 2016
Agree with P2 - The current workaround has disabled the feature - this feature never made it to stable, so this is not a regression. Turning this back on is more or less feature work at this point (not a bug fix).
,
Jan 20 2017
,
Jan 23 2017
Hello ericrk@ - I notice that c#27 shows your fix for this bug landed on Nov 30 on ToT, and then in c#32 you cherry-picked the fix back to M55. The M56 branch point was Nov 17, so that implies your ToT patch landed in M57. Is it then the case that this patch does not exist in M56?
,
Jan 23 2017
Thanks for catching this shrike@, I'll get this into M56 ASAP.
,
Jan 23 2017
Per offline discussion approving for merge into M56
,
Jan 23 2017
Merged to Skia M56 branch: https://skia.googlesource.com/skia/+/bf2d9e02d58ea01f1c239f7e2fc024cba140ccb1 Not sure why this bug didn't get updated - possibly just delayed.
,
Jan 24 2017
Thanks for the merge. If there is no pending work please remove Merge-Approved-56 and add merge-merged-m56
,
Jan 24 2017
,
Mar 9 2017
Just a data point: I've tried to repro on current ToT on macOS Sierra by reverting ericrk@'s fix, and I can't seem to get the memory to spike (I start with ~6GB used and it goes up to ~7GB used on a site heavy with images). Not sure if anything changed in the meanwhile. It seems that not everyone could consistently repro this though, so I'm a bit hesitant to say that it's better conclusively.
,
Mar 9 2017
,
Mar 13 2017
That page only shows the gifs on hover now, so it doesn't trigger the bug. You can "fix" it by scrolling all the way down and all the way up (to load all of the images) and then running this in the console:
Array.prototype.map.call(document.querySelectorAll('div[data-gif]'), (el) => el.parentNode.style.backgroundImage = `url(${el.dataset.gif}`)
…but creating some kind of test page may be a better idea going forward.
,
Mar 13 2017
(That page being the URL in the bug description: https://itch.io/b/149/a-good-bundle)
,
Mar 18 2017
I can no longer repro. I followed the steps in #63 and am using the exact same checkout that repro'ed before. The only difference is I was forced to upgrade my MacOS version for security reasons. I used to be on 10.11.6 and am now on 10.12.3. This matches what we found before -- we could not repro on all of our devices / OS versions. It felt very much like a driver bug. And with no code change this is now fixed. Similarly, the memory was not associated with Chrome (further indicating it was a driver bug). Is there any way we can get a device with 10.11.6? It would be nice to identify exactly what the driver bug was and write a work-around. If we cannot find a device to repro, I'm willing to re-enable mipmaps and close this issue. Apple themselves have fixed it.
,
Mar 18 2017
TEs have 10.11 devices in MTV. Please see them directly, or come find me and I can introduce you. If we believe this is fixed in 10.12+, please do not just re-enable mipmaps. We have a significant number of users on 10.10 and older. Note that the bug was initially filed by a Chrome developer using 10.12.
,
Mar 18 2017
erikchen@ and I were able to repro on his device which is running 10.12.3 (same as mine). I suspect I didn't trim my memory down enough before I attempted to repro. His device has 2 GPUs. It did not repro when running on the NVidia GPU and did repro on the Intel Iris. All of the repros so far are on Intel Iris. I'll try to repro on other chips/drivers as well (AMD, other Intels). Once I know what chips/drivers it repros on (which may only be Iris), I'll add a gpu driver workaround to src/gpu/config/gpu_driver_bug_list_json.cc
,
Sep 20 2017
|
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by sdy@chromium.org
, Nov 30 20164.4 MB
4.4 MB Download