[Measurement flake] Android system health plan: +6% private_dirty regression in GPU process Android |
|||||||
Issue descriptionJust spotted this, which cause a visible effect on the overall health plan For some reason no alert was triggered on this, even if the regression seems pretty clear visible beyond noise levels. https://chromeperf.appspot.com/report?sid=d6cc15cda717ef10a1c98a07d3419403d35623a3998669f0166af3f7ee3b7bf0&start_rev=378097&end_rev=385606 Regression range: http://test-results.appspot.com/revision_range?start=380403&end=380419 Kicking bisect
,
Apr 7 2016
Further data point: matches a regression in gl_total https://chromeperf.appspot.com/report?sid=a167329a92b24d0e60a310cd3f6075973d8b24784399d22529f67c852c688068&start_rev=379468&end_rev=381471
,
Apr 7 2016
+prasadv,sullivan: bisects failing also here :(
,
Apr 7 2016
+aelias@,ericrk@ can I ask a min of your time for some help? TL;DR is: - A regression (link in #2) is showing as incresed GL and dirty memory in the gpu process. - The CL range in question is very small (17 cls): https://chromium.googlesource.com/chromium/src/+log/fd7ede807eb0e64820c805e8d5d0caf7ba9c5516%5E..5fb40e71b18c6c32dce4451821c7866fe80ea5e9?pretty=oneline - Bisects are failing for mysterious reasons. If can kick some manual tryjobs / rebuild locally, need some little help to speculate on the possible culprit without trying 17 CLs one by one. The possible suspects I can see there (but plz take a look to the list) : 1) FrameView scrolling https://codereview.chromium.org/1752043002 Could that affect GL + gpu proc memory usage? 1) The angle roll in https://codereview.chromium.org/1780463006 Do we use angle on Android? (doesn't seem so from ui/gl/gl.gyp, but I'm not sure if there are other deps) 2) Bubble dialogs https://codereview.chromium.org/1759453002 Is that code possible affecting android (IIRC ui/views is not used on Android right?)
,
Apr 7 2016
We do use Angle on Android for validation (unless that changed at some point) and it's the only code in there that runs in the GPU process. jmadill@, could your changes https://chromium.googlesource.com/angle/angle.git/+/3f01e6c2531c1ff5518aeb676b617bdc1452a81f "Use std::unordered_map in ResourceManager." or https://chromium.googlesource.com/angle/angle.git/+/4e25a0d6cf2e5facdce4f90cb28b024bade1b55f "Return pointers from check*Allocation." have caused a memory increase?
,
Apr 7 2016
,
Apr 7 2016
Not in any conceivable scenario I could imagine. Those changes are not used run on Android, since they are a part of libANGLE, which is outside the shader translator.
,
Apr 7 2016
OK, thanks. Another possible culprit in ANGLE roll is https://chromium.googlesource.com/angle/angle.git/+/aa57aa4c907fc2160adbc88b0101bc67a0b8f3c2 ; I'm not even sure if/where we use that texture type but if we do, it could cause increased memory usage by the driver (which would also explain GL usage observed in #2). It's not a very suspicious CL but I see nothing else than the ANGLE roll that could increase GPU process dirty. If it's not that I would double-check whether the bisect range is correct.
,
Apr 7 2016
You don't, these changes are outside of Android, in libANGLE. You guys only use the shader translator.
,
Apr 7 2016
Yes, sorry, just noticed that right after I posted #8. My best theory then is that the regression range is wrong. I can't come up with even a Rube Goldberg story for how any of those patches might cause the regression. ("FrameView scrolling" is a small refactoring in Blink, and Aura is indeed not linked in on Clank.) The range looks much larger when I click on the chromeperf/ link in #0, maybe there was a mistake in obtaining the narrow range posted in #4?
,
Apr 7 2016
My only rube goldberg tie to ANGLE might be that somehow we're linking in libANGLE even though we're not currently using it.
,
Apr 7 2016
I was looking at the wrong spike on the graph, I see where primiano@ got the regression range from. But it's still confusing. Another place to look might be the clank/ change range, is this bot running a full Clank (as opposed to Chrome public) build? (Not that I would normally expect clank/ changes to affect GPU process memory either.)
,
Apr 7 2016
Ok I start suspecting the range as well. I found three devices were GL memory went up nearby but their ranges don't overlap at all. This smells like some device change / issue. https://chromeperf.appspot.com/report?sid=fd0aac3ab1da818a6c9b0613f64e3d1c63eaf0dd119d2d3dae0f583b86b6b840&start_rev=379468&end_rev=381471 sullivan: do you remember if anything interesting happened around March 9th?
,
Apr 8 2016
re comment 13: something does seem strange here, but some of the regressions occur on March 9 and some on March 10, so it seems unlikely to be a bot issue.
,
Apr 12 2016
,
Apr 13 2016
Issue 594091 has been merged into this issue.
,
Apr 13 2016
I've just noted that issue 594091 was a duplicate of this. Also tried some bisects there which never worked. But was able to reproduce the regression locally, copy pasting from there: ~~~ I've checked and eliminated any obvious problems with the device. No apparent issues there. I ran the benchmark locally comparing the good and bad revisions and *was* able to reproduce the regression. https://x20web.corp.google.com/~perezju/memory_health/crbug594091/results.html The largest relative increases appear on: - foreground-memory_android_memtrack_gl_gpu_process/http_yahoo_com (59.47%) - foreground-memory_mmaps_private_dirty_gpu_process/http_yahoo_com (38.11%)
,
Apr 13 2016
update: I spent most of today on this one, I can confirm there is a regression reproducible locally in gpu private dirty in the range 380403-380419 but whenever I try to bisect I end up with jumping numbers (see screenshot below). I am now trying again setting up a script which builds each revision in the range and testing each revision 3 times from scratch (killing all process). hopefully that should tell where the jumpiness starts. very likely the results will come tomorrow as this will take time.
,
Apr 13 2016
Ok I think this is just a timing issue. I checked manually all the revisions, 3 times each, and it seems that the values for the gpu dirty memory are pretty bimodal. I would be really curious to know what triggered this bimodaility all of a sudden, as this happens reliably on the main waterfall, but really spent too much energies on this bug. Attaching perf results and screenshot. If somebody wants to investigate more, please do, but I 'm not going to spend more days on this (especially in light of upcoming petercermak's new benchmarks which should get rid of all these problems). picksi/amineer: if any objection should come up from the sys health council we should just take this as a benchmark flake and move on. We have enough data here to prove that this is actually a flake, and at the same time we are working to improve the situation in the future. I think there is nothing more we can do here.
,
Apr 14 2016
Thanks for your time and energy investigating this. I will add a note to the System Health plan saying this is a flake in the benchmark.
,
Apr 14 2016
The SHP dashboard (https://chrome-health.googleplex.com/health-plan/android-chrome/memory/nexus5/), at time of writing is showing Overall PSS as ~4% regressed. I see that private dirty is up by ~3%. Can you confirm that these regressions are the effect of (identical with) the private dirty gpu regression discussed here? Also can you confirm the flakiness/bimodality you refer to is one of timing. Is the (hand-wavy) story: Something has recently changed the timings of GC (or allocation, etc) so when we grab our memory data we see different numbers to previously, but if we changed the timing we would see the regression vanish.
,
Apr 14 2016
> I see that private dirty is up by ~3%. Can you confirm that these regressions are the effect of (identical with) the private dirty gpu regression discussed here? The effects of this issue that we should "discount" from the plan are: 1.8 % on both total PSS and total private dirty. Also ~3% of Android Graphics > Also can you confirm the flakiness/bimodality you refer to is one of timing. Yes. I cannot tell what caused the flakiness, but the results in #19 show that it just became more frequent at some point, causing the averages we track to drift. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by primiano@chromium.org
, Apr 7 2016