New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 784214 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Nov 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression



Sign in to add a comment

353.4% regression in rasterize_and_record_micro.top_25 at 514314:514418

Project Member Reported by nedngu...@google.com, Nov 12 2017

Issue description

Cycle time of the benchmark increases from 20 minutes to 100 minutes (a 80 minutes regression). This is too much.

Walter: can you take a look at this?
 
Project Member

Comment 1 by 42576172...@developer.gserviceaccount.com, Nov 12 2017

All graphs for this bug:
  https://chromeperf.appspot.com/group_report?bug_id=784214

(For debugging:) Original alerts at time of bug-filing:
  https://chromeperf.appspot.com/group_report?sid=6a7cf8fd867b3e88527a658e1e5c2d624387e242a09f87eb5b87493a1ee601c7


Bot(s) for this bug's original alert(s):

android-one
Cc: vmi...@chromium.org
I am in office tomorrow and will look after bisect completes. Obvious suspect is the recent update to static snapshots. We may need to do the work to have telemetry handle un-archiving archives fetched from GCS.

Where are you viewing the cycle time? I want to make sure I'm looking at the same thing you are.
Project Member

Comment 5 by 42576172...@developer.gserviceaccount.com, Nov 13 2017


=== BISECT JOB RESULTS ===
Bisect was unable to run to completion

Error: INFRA_FAILURE

The bisect was able to narrow the range, you can try running with:
  good_revision: e659fbe5f6aadc4dcd655bf7e9f3768fa4d561c3
  bad_revision : b5624e0de59bf01c17073f12b3a9f0ce9f83c87d

If failures persist contact the team (see below) and report the error.


Bisect Details
  Configuration: android_one_perf_bisect
  Benchmark    : rasterize_and_record_micro.top_25
  Metric       : benchmark_duration/benchmark_duration

Revision             Result                   N
chromium@514313      22.0163 +- 1.7245        21      good
chromium@514340      22.0481 +- 0.717216      9       good
chromium@514366      22.4394 +- 10.863        9       bad
chromium@514418      26.513 +- 63.4267        14      bad

To Run This Test
  src/tools/perf/run_benchmark -v --browser=android-chromium --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests rasterize_and_record_micro.top_25

More information on addressing performance regressions:
  http://g.co/ChromePerformanceRegressions

Debug information about this bisect:
  https://chromeperf.appspot.com/buildbucket_job_status/8963092629050850080


For feedback, file a bug with component Speed>Bisection
That's so crazy. How could even 2500 files make it take 80mins longer? Locally it didn't seem at all slow to load pages, so I am just guessing it's due to time to check out the add'l .sha1 and then download them from GCS? But would never have guessed +80 mins.
I suppose I should ask -- does benchmark_duration include time to check out the repo and download from GCS?
Walter: it doesn't include time to check out the repo. But it might include time to download those files from GCS as the logic is inside the benchmark running code at the moment (https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/story_runner.py?rcl=755a485cfb8a6e9010cdcf106933236aad114832&l=161)
Do you know what the current parallelism of the download looks like thread-count-wise? The duration increase is so long, I wonder whether maybe it's using one thread and doing them all serially. Again I can investigate more tomorrow but you probably know or know immediately where to look.
Oh, I think there is no parallelism at all. It's doing all the files fetching in serial
Created reverts for the updated static snapshots for now so that we can look at archive or parallelism without urgency around current state:

https://chromium-review.googlesource.com/c/chromium/src/+/769368
https://chromium-review.googlesource.com/c/chromium/src/+/769528
Walter: for the next step, I think we can consider adding support for Telemetry's page/story to specify a dependency file (with dependency_manager). This system already supports unzipping file.

The core lib is https://cs.chromium.org/chromium/src/third_party/catapult/dependency_manager/

The way to use it is to add a dep file like this:
https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/binary_dependencies.json

With this dep file, one can fetch data from cloud storage with:

config = dependency_manager.BaseConfig(<path to dep file>)
manager = dependency_manager.DependencyManager(config)
manager.PrefetchPaths(<target_platform>)

It will automatically unzip the file if the json include "path_within_archive" (https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/binary_dependencies.json?rcl=c44654bacfc5866bbafe00b7fac50f87c82a4103&l=61) 
Thanks for your revert, the cycle time is back down to normal range now. 

I might some time spend this Thursday & Friday prototyping switching Telemetry to use dependency_manager to manage the image files.
OK, I'll check with you before I do more here. It seems like it's also worth looking at adding parallel download in any case.

Maybe we should file two separate issues for these, (1) download/unpack zip archives, and (2) parallelize GCS download, and close this one.

Defer to your thought.
I think once we zip the file to significant size, there is not much time saving in parallelizing the download as the bandwidth is fixed. I might be proven wrong though.


I filed issue 785397 for download/unpack zip archives
Status: Fixed (was: Assigned)
On parallelizing -- yes, I was thinking for the overall system there's still value. There are 237 .sha1 files under tools/perf outside of static_top_25. If each takes one second to round-trip negotiate and download (I made that up but it's not crazy) that's almost 4 mins which adds up when it's going on all day every day. Parallelizing to 25 threads would reduce to ~10 secs.

I realize we probably don't always download all .sha1 files on all perf bots every time but it still seems a worthwhile improvement at some point. Lower priority than the unpacking archives for sure.

Comment 19 by benhenry@google.com, Jan 16 (6 days ago)

Components: Test>Telemetry

Comment 20 by benhenry@google.com, Jan 16 (6 days ago)

Components: -Speed>Telemetry

Sign in to add a comment