New issue
Advanced search Search tips

Issue 737353 link

Starred by 8 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression

Blocked on:
issue 737331
issue 738902

Blocking:
issue 734853
issue 735649


Show other hotlists

Hotlists containing this issue:
system-health-regressions


Sign in to add a comment

Regression due to extending memory dump wait time

Project Member Reported by sullivan@chromium.org, Jun 28 2017

Issue description

See the link to graphs below.
 
Project Member

Comment 3 by 42576172...@developer.gserviceaccount.com, Jun 28 2017

Cc: perezju@chromium.org
Owner: perezju@chromium.org

=== Auto-CCing suspected CL author perezju@chromium.org ===

Hi perezju@chromium.org, the bisect results pointed to your CL, please take a look at the
results.


=== BISECT JOB RESULTS ===
Perf regression found with culprit

Suspected Commit
  Author : perezju
  Commit : b4eb70c319cb0f38d1a7764e7d9fb56dc96f8bcf
  Date   : Mon Jun 26 14:25:41 2017
  Subject: [System Health] Extend dump time to 5 seconds

Bisect Details
  Configuration: android_webview_arm64_aosp_perf_bisect
  Benchmark    : system_health.memory_mobile
  Metric       : memory:webview:all_processes:reported_by_os:system_memory:java_heap:proportional_resident_size_avg/load_media/load_media_dailymotion
  Change       : 36.12% | 5027498.66667 -> 6843392.0

Revision                                 Result                  N
chromium@482136                          5027499 +- 95475.8      6      good
chromium@482271                          5272576 +- 104548       6      good
chromium@482280                          5781845 +- 114117       6      good
chromium@482282                          6063445 +- 84275.5      6      good
chromium@482282,catapult@b4eb70c319      8299179 +- 100401       6      bad       <--
chromium@482283                          8167424 +- 161675       6      bad
chromium@482284                          7837696 +- 214307       6      bad
chromium@482288                          7489877 +- 140633       6      bad
chromium@482305                          7315115 +- 139795       6      bad
chromium@482339                          7214080 +- 138872       6      bad
chromium@482406                          6843392 +- 221108       6      bad

Please refer to the following doc on diagnosing memory regressions:
  https://chromium.googlesource.com/chromium/src/+/master/docs/memory-infra/memory_benchmarks.md

To Run This Test
  src/tools/perf/run_benchmark -v --browser=android-webview --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=load.media.dailymotion system_health.memory_mobile

Debug Info
  https://chromeperf.appspot.com/buildbucket_job_status/8975584062646688784

Is this bisect wrong?
  https://chromeperf.appspot.com/bad_bisect?try_job_id=6153471146852352


| O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq
|  X  | for more information addressing perf regression bugs. For feedback,
| / \ | file a bug with component Speed>Bisection.  Thank you!
Project Member

Comment 4 by 42576172...@developer.gserviceaccount.com, Jun 28 2017

 Issue 737354  has been merged into this issue.
Project Member

Comment 5 by 42576172...@developer.gserviceaccount.com, Jun 28 2017

 Issue 737356  has been merged into this issue.
Project Member

Comment 6 by 42576172...@developer.gserviceaccount.com, Jun 28 2017

 Issue 737364  has been merged into this issue.
Project Member

Comment 7 by 42576172...@developer.gserviceaccount.com, Jun 28 2017

 Issue 737361  has been merged into this issue.
Project Member

Comment 8 by 42576172...@developer.gserviceaccount.com, Jun 28 2017

 Issue 737367  has been merged into this issue.
Blocking: 734853
Cc: primiano@chromium.org ericrk@chromium.org
Summary: Regression due to extending memory dump wait time (was: 1.7%-24.6% regression in memory.top_10_mobile at 482137:482406)
Unsurprisingly extending the dump time caused a bunch of memory metrics to jiggle around.

These are probably fine, but will have a careful look at them. First thing is to clean up the list of alerts which currently has "WARNING: Non-overlapping alerts are grouped together."

So regressions not related to my change could have been mistakenly assigned to this bug.
Project Member

Comment 10 by 42576172...@developer.gserviceaccount.com, Jun 28 2017

Cc: u...@chromium.org ulan@google.com
 Issue 737490  has been merged into this issue.
Project Member

Comment 11 by 42576172...@developer.gserviceaccount.com, Jun 28 2017

 Issue 737471  has been merged into this issue.
Project Member

Comment 12 by 42576172...@developer.gserviceaccount.com, Jun 28 2017

 Issue 737369  has been merged into this issue.
Project Member

Comment 13 by 42576172...@developer.gserviceaccount.com, Jun 28 2017

 Issue 737476  has been merged into this issue.

Comment 14 by u...@chromium.org, Jun 28 2017

This CL is causing huge variability (order of +/-30MB) in a previously stable benchmark:
https://chromeperf.appspot.com/report?sid=88c5b38cef32f976acd3712c66ce2ddfbd6180708bea8ad8f63e92f88c2905e1&rev=482343

Shouldn't we actually try to reduce the sampling interval (i.e. increase number of samples) to reduce the noise coming from the sampling time?


I'm planning to take another large dump of measurements during the coming weekend, and asses the impact that this CL had.

Re: increase number of samples: that's an end goal, but it's blocked on moving the logic to trigger memory dumps within Chrome.

At the moment we still rely on a single memory dump taken per story; and run the story 3 times to get 3 samples.
Cc: simonhatch@chromium.org
Annie, +simonhatch: Unrelated alerts (whose range does not include my CL) keep being added to the list in:

https://chromeperf.appspot.com/group_report?bug_id=737353

After my earlier cleanup, I'm again getting "WARNING: Non-overlapping alerts are grouped together.", likely due to some recent dupe.

When a culprit CL is found, alerts on ranges that not include said CL should be probably split-off on to new bugs.

Comment 17 by u...@chromium.org, Jun 28 2017

For short running benchmarks we probably getting samples at the beginning of the story and losing coverage of memory usage at the end of story. This may accidentally improve variability of the benchmark if there not much happening at the beginning of the story. But that is a false improvement since it comes from ignoring the data.

> Re: increase number of samples: that's an end goal, but it's blocked on moving the logic to trigger memory dumps within Chrome.
This CL seems to me a step in the opposite direction. Why not keep the current state until the blocking issue is resolved?

On system health benchmark we always take a single memory dump at the end of the story.

For the broader discussion of why this/now see  issue 734853 .

Comment 19 by u...@chromium.org, Jun 28 2017

Blocking: 735649

Comment 20 by u...@chromium.org, Jun 28 2017

Oh, I misunderstood the what the dumping interval means. I thought it was the interval between the periodic memory dumps. Sorry for that, please ignore comment #17.

Since the new dumping interval increases the running time of a story, shouldn't the stories be re-recorded? The stories that regress with the CL are probably now throwing errors because of missing resources.





That's a good point. While assessing the impact of the CL I'll check if any of the stories need re-recording.
re: #c16

Yeah that'd be nice if we could split off just the ranges affected by the CL, not sure if that would be a manual or automated process.
A bunch of memory regressions happened very close to each other, also near to my change (how lucky!).

Keeping track of them here so far:

r482282 - this issue
r482305 -  issue 737360 
r482333 - issue 737264
r482346 .. r482354 -  issue 737525 

A couple of them still waiting for bisects: issues  737921 ,  737922 .

I'll have to carefully see whether some of the regressions still attached to this bug may be due to some of the other issues (or a combination of a few of them!)
Project Member

Comment 24 by 42576172...@developer.gserviceaccount.com, Jun 29 2017

 Issue 737921  has been merged into this issue.
Project Member

Comment 25 by 42576172...@developer.gserviceaccount.com, Jun 29 2017

 Issue 737922  has been merged into this issue.
Project Member

Comment 26 by 42576172...@developer.gserviceaccount.com, Jun 30 2017

Cc: tdres...@chromium.org
 Issue 737563  has been merged into this issue.
Blockedon: 737331
Cc: perezju@google.com
 Issue 738902  has been merged into this issue.
 Issue 738905  has been merged into this issue.
Blockedon: 738902
 Issue 738903  has been merged into this issue.
 Issue 738875  has been merged into this issue.
 Issue 738906  has been merged into this issue.
 Issue 739140  has been merged into this issue.
Cc: nedngu...@google.com
After more careful splitting off of alerts not related to this CL (new issues  738883 , and  738884  were found!), there remained ~200 alerts attributable to my change; which can be put together into the following 4 groups:

A. Regressions in v8 / native_heap (105 alerts)

- Affected platforms: Windows (both) and Mac (v8 only).
- Bisect confirmed r482282 as culprit for v8 regression (54 MiB, load_news_qq on win).
- Not confirmed on native_heap, but worst offenders coincide in magnitude and affected stories/platforms.
- Worst offenders: load_news_qq (133 MiB), load_tools_weather (32 MiB), load_tools_docs (19 MiB), load_tools_drive (13 MiB).

B. Regressions in skia / gpu (18 alerts)

- Affected platforms: Mostly Android.
- Bisects confirmed r482282 as culprit for both skia and gpu (4.1 MiB, browse_media_youtube on android).
- Worst offenders: load_games_spychase (8.0 MiB on android), browse_media_youtube (4.1 MiB on android), load_news_qq (5.3 MiB on win and mac).
- 2 alerts on cc for load_tools_weather also possibly related and confirmed by bisect.

C. Regressions in java_heap (11 alerts)

- Affected platforms: Android.
- Bisect confirmed r482282 as culprit (6.7 MiB, load_tools_weather).
- Worst offenders: load_tools_weather (7.8 MiB), load_games_spychase (3.4 MiB), load_news_irctc (2.1 MiB).

D. Regressions in malloc (22 alerts)

- Affected platforms: Mac and Windows.
- Not confirmed by bisect, but worst offenders show both ref + regular build moving together, e.g.:
  https://chromeperf.appspot.com/report?sid=9b2ba8352519ac52c32f9e482ae41f88052054452887de1a90cef80e6a70a730&rev=482287
- Worst offenders: load_tools_weather (5.2 MiB), load_games_alphabetty (3.1 MiB), load_tools_drive (2.0 MiB) all on Mac.

================

Due to their magnitude, alerts in A probably merit more investigation. Could these be due to a leak in Chrome? (Or maybe just some javascript on the story hungrily eating more memory as time progresses?)

I'm thinking of calling alerts in B, C, D as unfortunate but unavoidable effect of the change in the testing framework.

Thoughts on these?

As follow up, I'll now proceed to analyse the effect on noise levels (just finished gathering data points from the waterfall during this last weekend).
Status: Started (was: Untriaged)
So B is fun as, after the GPU discussion that triggered all this, I was expecting them to go down not up :/

C is fine and very plausible. very likely that will change with your CL when you send USR1 anyways to get rid of Issue 724032

D very likely a consequence of A (v8's malloc memory for parsing and stuff)

A I agree is curious. Possibly, as you say, that's the page trying to do something more (timers etc) after we did the GC and hence accumulating more memory.
Cc: -perezju@google.com
For the past couple of days I've been crunching through data gathered this past weekend, and comparing it against my previous dump (from about early May), to analyze changes in noise patterns for system health on the entire perf waterfall.

Colab with lots of pretty pictures:
https://colab.corp.google.com/v2/notebook#fileId=0B586tdnIdbhKYUVfVXlZRHY4VlU

I've got good news and bad news.

GOOD: Noise in gpu_memory decreased across most stories on N5 and N7!

BAD: Not due to my change. :-(

Some related graphs:
https://chromeperf.appspot.com/report?sid=1c9dea2d396c0a40f0f6f20b22540e6ff79445c416f9b70d97f484c968df747c&start_rev=459675&end_rev=484479

The drop in noise happened around:
http://test-results.appspot.com/revision_range?start=477936&end=481246

The CL to increase dump wait time happened later at r482282.

While investigating this I also found a few other cases where noise increased, also not due to my change:

- v8 on load:news:qq, on mac and win
  https://chromeperf.appspot.com/report?sid=e0133c1ed628c1c51489400ea515bca31a7cbc48b674d03e8f58e14623f640ad&start_rev=459675&end_rev=484479

- cc on load:media:imgur
  https://chromeperf.appspot.com/report?sid=7c6076f9eaf0c3a0750bf512d6460ec819bfdff89d385325a9420e968dc29baf&start_rev=459675&end_rev=484479

- private dirty on mac
  https://chromeperf.appspot.com/report?sid=5126eabea0cbdfe7a8c38f9083f1dd6b9c528da545803e2918d4a10b75f48f75&start_rev=459675&end_rev=484479

A few other time series got a bit more or a bit less noisy, but overall noise levels stayed at about the same.

Several alerts (both improvements/regressions) were caught at the time as my CL landed:
https://chromeperf.appspot.com/group_report?rev=482282

But given that noise did not went down that much anyway, I'm now considering to revert it. (Probably on the weekend to avoid having it cluster again with other regressions?)

Any other thoughts or opinions?
Cc: jgruber@chromium.org
 Issue 739745  has been merged into this issue.
Issue 737331 has been merged into this issue.
Yeah given that gpu improved regardless at this point I'd revert so that we don't have to worry if any of these regressions are due to the change in the delay or not.
I think we just have to figure out a better strategy with gpu folks here.
Status: Fixed (was: Started)
The CL was reverted at https://chromium-review.googlesource.com/564312
Issue 739319 has been merged into this issue.

Sign in to add a comment