Issue metadata
Sign in to add a comment
|
cc/gpu memory benchmarks are sensitive to memory dump delay |
||||||||||||||||||||||||
Issue descriptionThere was a memory regression on android bots due to gpu scheduler being enabled (issue 733725). Investigation of that bug revealed that there's a race between resource pool's eviction task and the memory dump. This causes the gpu process (or browser on low-end android) regression in gpu gl textures. I ran the benchmark like this: tools/perf/run_benchmark --browser=android-chromium --upload-results system_health.memory_mobile --story-filter browse:chrome:omnibox --pageset-repeat=1 Results: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/html-results/results-2017-06-19_18-11-26 The labels are: disabled (scheduler actually enabled), disabled_for_real, dump_10s, dump_5s, flush_in_reclaim. The first two are for scheduler enabled and disabled. dump_10s and dump_5s set _DUMP_WAIT_TIME_ in action_runner.py to 10 and 5 with the scheduler enabled. flush_in_reclaim removed the !visible condition in LTHI::ReclaimResources with the scheduler enabled and dump wait time set to 3 (default). As you can see the regression goes away in the 10s and 5s runs. Moreover, cc memory usage decreases by 50MB(!) (all in renderer cc tile memory) compared to any run with default dump wait time. Traces: - disabled: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_1-2017-06-19_17-52-34-2064.html - disabled_for_real: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_1-2017-06-19_17-56-53-28778.html - dump_10s: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_0-2017-06-19_17-59-38-87127.html - dump_5s: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_0-2017-06-19_18-01-45-58511.html - flush_in_reclaim: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_0-2017-06-19_18-11-23-57720.html
,
Jun 20 2017
primiano@, This is mostly due to a change I made last month to move from a 1s to 5s timeout, crbug.com/720146, I had meant to follow up on this, but it had fallen off my radar. The basic issue is that we have a resource pool in CC which evicts unused resources over a period of N seconds. With N=1, it ~always happened before the memory dump was taken. With N=5, it sometimes misses the dump (hence the noise here and in the other bug). We can "fix" this by either: a) reducing the eviction time in our cache, so that it happens before the dump b) increasing the dump time-out (5s seems to work) I know of cases where (a) will regress performance, but (b) is sort of sad as it adds more delays to tracing.... Do you have a preference here? The benefits of the 5s time out were real, but we can also live without them.
,
Jun 21 2017
Hmm no strong opinion honestly, adding +perezju,+erikchen +hjd for thoughts. There seems to be a general problem with the benchmark w.r.t taking just one dump: no matter about how smart we are, things like this are always going to happen. On the other side we have to be pragmatic, so at some point we have to make a choice. perezju@ any idea how bad would be for the benchmark to delay the dumps by 1 -> 5 seconds? If is not too bad, we should do this as a short term solution. Long term, instead, my feeling is that if we want to keep the "one dump benchmark model" (again, not ideal but pragmatically I think is the only thing we can afford right now) we need some sort of more robust signal for quiescence. Maybe instead of waiting for N seconds we should wait for an idle callback? ericrk@ any idea if this specific thing wouldn't have happened if we were using idle callbacks here? Benchmarking is hard.
,
Jun 21 2017
I'll kick off an experiment to measure the effect of a 5 seconds delay. This sounds very much related to issue 733897 , where I was investigating noise in memory metrics and found gpu:graphics to be very strictly bimodal.
,
Jun 23 2017
,
Jun 23 2017
Results of the experiment are in! Bumping _DUMP_WAIT_TIME from its current 3 to 5 seconds does indeed reduce noise quite significantly in both gpu and cc memory (on gpu and renderer processes respectively). According to my calculations this should add ~13 minutes total run-time to system_health.memory_mobile. +nednguyen, given large reductions in noise, it's probably worth to pay the extra 13 minutes of run-time; what do you think? == Details == I ran a try job with: $ tools/perf/run_benchmark try android-nexus5 system_health.memory_mobile --story-filter browse:chrome:omnibox --pageset-repeat 100 The first graph shows the stdev (in MiB) for all metrics reported by memory-infra on individual processes. The previously noisiest metrics are now under 0.4MiB after the patch; and a few metrics became slightly noisier but only by a tiny not. The second graph is a detail of the previously noisiest metrics, comparing the effect without/with patch. Colab with data to play with: https://colab.corp.google.com/v2/notebook#fileId=0B586tdnIdbhKZV94alc1RjRwRUU
,
Jun 24 2017
fwiw the gpu scheduler CL also improved a lot of memory metrics: https://chromeperf.appspot.com/group_report?rev=479175 I've submitted a CL to verify this on the perfbot but now I'm thinking this might not be a real improvement :/
,
Jun 24 2017
@Primiano: to confirm, the long term solution is still moving the dumping logic to inside Chrome (event based) instead of keeping it in the benchmark code? If that's the plan, I am fine with whatever cheapest fix we can do in the short term to reduce the noise.
,
Jun 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e3095a72f2b1928064b14e9f1d0119b00b171949 commit e3095a72f2b1928064b14e9f1d0119b00b171949 Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org> Date: Mon Jun 26 15:40:28 2017 Roll src/third_party/catapult/ d4f2d777c..b4eb70c31 (1 commit) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/d4f2d777c7aa..b4eb70c319cb $ git log d4f2d777c..b4eb70c31 --date=short --no-merges --format='%ad %ae %s' 2017-06-26 perezju [System Health] Extend dump time to 5 seconds Created with: roll-dep src/third_party/catapult BUG= 734853 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=sullivan@chromium.org Change-Id: I61dc56b5ed9b543cda55f5133ecf76e9d28f22a1 Reviewed-on: https://chromium-review.googlesource.com/548676 Reviewed-by: <catapult-deps-roller@chromium.org> Commit-Queue: <catapult-deps-roller@chromium.org> Cr-Commit-Position: refs/heads/master@{#482283} [modify] https://crrev.com/e3095a72f2b1928064b14e9f1d0119b00b171949/DEPS
,
Jun 28 2017
,
Jul 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/35a87f9da8aa1268ec4cdf1c0a7e6835ac53b268 commit 35a87f9da8aa1268ec4cdf1c0a7e6835ac53b268 Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org> Date: Sat Jul 08 11:53:54 2017 Roll src/third_party/catapult/ 256098db1..00b0c16c9 (1 commit) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/256098db1e09..00b0c16c9708 $ git log 256098db1..00b0c16c9 --date=short --no-merges --format='%ad %ae %s' 2017-07-08 perezju Revert of [System Health] Extend dump time to 5 seconds (patchset #2 id:20001 of https://codereview.chromium.org/2946013004/ ) Created with: roll-dep src/third_party/catapult BUG= 734853 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=sullivan@chromium.org Change-Id: I767d74467eecdddcc223be036d7ae68a6eb9c1e6 Reviewed-on: https://chromium-review.googlesource.com/564312 Reviewed-by: <catapult-deps-roller@chromium.org> Commit-Queue: <catapult-deps-roller@chromium.org> Cr-Commit-Position: refs/heads/master@{#485147} [modify] https://crrev.com/35a87f9da8aa1268ec4cdf1c0a7e6835ac53b268/DEPS
,
Feb 13 2018
I think this is obsolete by this point. While increasing the delay is desirable, it seems that it may introduce as many headaches as we're trying to fix. This is just something we should keep in mind as we triage memory issues. |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by sunn...@chromium.org
, Jun 20 2017