New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 601369 link

Starred by 4 users

Issue metadata

Status: WontFix
Owner:
Closed: Apr 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug



Sign in to add a comment

[Measurement flake] Android system health plan: +6% private_dirty regression in GPU process Android

Project Member Reported by primiano@chromium.org, Apr 7 2016

Issue description

Just spotted this, which cause a visible effect on the overall health plan
For some reason no alert was triggered on this, even if the regression seems pretty clear visible beyond noise levels.

https://chromeperf.appspot.com/report?sid=d6cc15cda717ef10a1c98a07d3419403d35623a3998669f0166af3f7ee3b7bf0&start_rev=378097&end_rev=385606

Regression range:
http://test-results.appspot.com/revision_range?start=380403&end=380419

Kicking bisect
 
Cc: pras...@chromium.org sullivan@chromium.org
+prasadv,sullivan: bisects failing also here :(
Cc: aelias@chromium.org ericrk@chromium.org
+aelias@,ericrk@ can I ask a min of your time for some help?
TL;DR is:
 - A regression (link in #2) is showing as incresed GL and dirty memory in the gpu process.
 - The CL range in question is very small (17 cls):
 https://chromium.googlesource.com/chromium/src/+log/fd7ede807eb0e64820c805e8d5d0caf7ba9c5516%5E..5fb40e71b18c6c32dce4451821c7866fe80ea5e9?pretty=oneline
 - Bisects are failing for mysterious reasons.

If can kick some manual tryjobs / rebuild locally, need some little help to speculate on the possible culprit without trying 17 CLs one by one.

The possible suspects I can see there (but plz take a look to the list) :
1) FrameView scrolling https://codereview.chromium.org/1752043002
   Could that affect GL + gpu proc memory usage?
1) The angle roll in https://codereview.chromium.org/1780463006
   Do we use angle on Android? (doesn't seem so from ui/gl/gl.gyp, but I'm not sure if there are other deps)
2) Bubble dialogs https://codereview.chromium.org/1759453002
   Is that code possible affecting android (IIRC ui/views is not used on Android right?)
Cc: -perezju@chromium.org siev...@chromium.org jmad...@chromium.org
We do use Angle on Android for validation (unless that changed at some point) and it's the only code in there that runs in the GPU process.  jmadill@, could your changes https://chromium.googlesource.com/angle/angle.git/+/3f01e6c2531c1ff5518aeb676b617bdc1452a81f "Use std::unordered_map in ResourceManager." or https://chromium.googlesource.com/angle/angle.git/+/4e25a0d6cf2e5facdce4f90cb28b024bade1b55f "Return pointers from check*Allocation." have caused a memory increase?
Cc: perezju@chromium.org
Not in any conceivable scenario I could imagine. Those changes are not used run on Android, since they are a part of libANGLE, which is outside the shader translator.
OK, thanks.  Another possible culprit in ANGLE roll is https://chromium.googlesource.com/angle/angle.git/+/aa57aa4c907fc2160adbc88b0101bc67a0b8f3c2 ; I'm not even sure if/where we use that texture type but if we do, it could cause increased memory usage by the driver (which would also explain GL usage observed in #2).  It's not a very suspicious CL but I see nothing else than the ANGLE roll that could increase GPU process dirty.  If it's not that I would double-check whether the bisect range is correct.
You don't, these changes are outside of Android, in libANGLE. You guys only use the shader translator.
Yes, sorry, just noticed that right after I posted #8.  My best theory then is that the regression range is wrong.  I can't come up with even a Rube Goldberg story for how any of those patches might cause the regression.  ("FrameView scrolling" is a small refactoring in Blink, and Aura is indeed not linked in on Clank.)  The range looks much larger when I click on the chromeperf/ link in #0, maybe there was a mistake in obtaining the narrow range posted in #4?
My only rube goldberg tie to ANGLE might be that somehow we're linking in libANGLE even though we're not currently using it.
I was looking at the wrong spike on the graph, I see where primiano@ got the regression range from.  But it's still confusing.  Another place to look might be the clank/ change range, is this bot running a full Clank (as opposed to Chrome public) build?  (Not that I would normally expect clank/ changes to affect GPU process memory either.)
Ok I start suspecting the range as well.
I found three devices were GL memory went up nearby but their ranges don't overlap at all.
This smells like some device change / issue.

https://chromeperf.appspot.com/report?sid=fd0aac3ab1da818a6c9b0613f64e3d1c63eaf0dd119d2d3dae0f583b86b6b840&start_rev=379468&end_rev=381471

sullivan: do you remember if anything interesting happened around March 9th?

re comment 13: something does seem strange here, but some of the regressions occur on March 9 and some on March 10, so it seems unlikely to be a bot issue.

Comment 15 by k...@google.com, Apr 12 2016

Status: Assigned (was: Untriaged)
Issue 594091 has been merged into this issue.
I've just noted that issue 594091 was a duplicate of this. Also tried some bisects there which never worked. But was able to reproduce the regression locally, copy pasting from there:

~~~
I've checked and eliminated any obvious problems with the device. No apparent issues there.

I ran the benchmark locally comparing the good and bad revisions and *was* able to reproduce the regression.
https://x20web.corp.google.com/~perezju/memory_health/crbug594091/results.html

The largest relative increases appear on:
- foreground-memory_android_memtrack_gl_gpu_process/http_yahoo_com (59.47%)
- foreground-memory_mmaps_private_dirty_gpu_process/http_yahoo_com (38.11%)
update: I spent most of today on this one, I can confirm there is a regression reproducible locally in gpu private dirty in the range 380403-380419 but whenever I try to bisect I end up with jumping numbers (see screenshot below).

I am now trying again setting up a script which builds each revision in the range and testing each revision 3 times from scratch (killing all process). hopefully that should tell where the jumpiness starts. very likely the results will come tomorrow as this will take time.
regr.png
52.5 KB View Download
Status: WontFix (was: Assigned)
Summary: [Measurement flake] Android system health plan: +6% private_dirty regression in GPU process Android (was: Android system health plan: +6% private_dirty regression in GPU process Android)
Ok I think this is just a timing issue. I checked manually all the revisions, 3 times each, and it seems that the values for the gpu dirty memory are pretty bimodal.
I would be really curious to know what triggered this bimodaility all of a sudden, as this happens reliably on the main waterfall, but really spent too much energies on this bug.
Attaching perf results and screenshot. If somebody wants to investigate more, please do, but I 'm not going to spend more days on this (especially in light of upcoming petercermak's new benchmarks which should get rid of all these problems).

picksi/amineer: if any objection should come up from the sys health council we should just take this as a benchmark flake and move on. We have enough data here to prove that this is actually a flake, and at the same time we are working to improve the situation in the future. I think there is nothing more we can do here.
regr.png
52.7 KB View Download
results.html
1.7 MB View Download
Thanks for your time and energy investigating this. I will add a note to the System Health plan saying this is a flake in the benchmark.
The SHP dashboard (https://chrome-health.googleplex.com/health-plan/android-chrome/memory/nexus5/), at time of writing is showing Overall PSS as ~4% regressed. I see that private dirty is up by ~3%. Can you confirm that these regressions are the effect of (identical with) the private dirty gpu regression discussed here?

Also can you confirm the flakiness/bimodality you refer to is one of timing. Is the (hand-wavy) story: Something has recently changed the timings of GC (or allocation, etc) so when we grab our memory data we see different numbers to previously, but if we changed the timing we would see the regression vanish.
> I see that private dirty is up by ~3%. Can you confirm that these regressions are the effect of (identical with) the private dirty gpu regression discussed here?
The effects of this issue that we should "discount" from the plan are:
1.8 % on both total PSS and total private dirty. Also ~3% of Android Graphics

> Also can you confirm the flakiness/bimodality you refer to is one of timing.
Yes. I cannot tell what caused the flakiness, but the results in #19 show that it just became more frequent at some point, causing the averages we track to drift.


Sign in to add a comment