New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 703690 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner: ----
Closed: Mar 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug-Regression



Sign in to add a comment

system_health.common_desktop hasn't uploaded metrics since 3/17, despite having run

Project Member Reported by charliea@chromium.org, Mar 21 2017

Issue description

Last known good revision: 457629
Link to perf dashboard: https://chromeperf.appspot.com/report?sid=00671976640457e03421217df0e360a5353b9812f2e7db17bf1e454b39637f83

system_health.common_desktop seems to be continuing to report data (as shown by this json.output for the results upload step of the latest run on 3/21: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F451%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.common_desktop_Dashboard_Upload%2F0%2Flogs%2Fjson.output%2F0). However, as shown by the perf dashboard above, some of the metrics that appear in the json.output (for example, story:power_avg) are no longer being fully piped through to the perf dashboard.

simonhatch@, eakuefner@, any idea what might be happening here? I've never run into a problem where the run is successfully outputting data but the dashboard doesn't seem to be slurping it in.

If the test is disabled, please downgrade to Pri-2.

 
Cc: -nednguyen@chromium.org nedngu...@google.com charliea@google.com
Cc: -charliea@google.com
The summary metric you mention, story:power_avg is missing from later runs, possibly because browse:news:cnn page is failing?
Confusing: it does seem like you're right, but I don't understand why.

https://chromeperf.appspot.com/report?sid=c436201f4a0b54d2701298cb7c1020713005217e191602badf558eed5c4f23f8 shows that system_health.common_desktop:blank_about:story:power_avg is indeed being reported as recently as today (March 22). Shouldn't system_health.common_desktop:story:power_avg (the average power during the story, aggregated across all stories in the benchmark) also be reported if at least one of its stories is still running? I'd expect it to be an average of at least the stories that successfully ran.
Cc: benjhayden@chromium.org
+benjhayden

I'm not really sure where those are computed, but they're not in the json uploaded.

Ned or Ben might be able to answer that? I'm curious myself.
I talked with CHarlie offline, if any of the story fail, the summarized metric won't be computed to avoid misleading people that there is regression.

example: [100, 100, 100, 1] --> average is 75.25

Assuming the forth story breaks & we still compute the average, the average then would be 100, which is a 132.8% regression.
Status: WontFix (was: Untriaged)
Cc: perezju@chromium.org
I understand why this bug is closed and agree with its closure, but I do want to share a story from the benchmarking meeting this week: perezju@ said that the memory team is unable to use system health benchmarks for the Android System Health plan because there's not really any good way to summarize the benchmark over time. The summary metrics aren't a reliable way to track the long-term system health of Clank because:

- As stories are added and removed from the page set, the metrics change
- As stories are enabled and disabled on a given platform, the metrics change
- If a single story fails within the page set, the metrics aren't reported

This makes it very difficult to achieve system health's objective: provide a high-level view of how Chrome's performance is changing over time. Because of this, I think that memory uses memory.typical_10_mobile for its high-level view of Chrome memory, even though the stories in that set are known to be inferior.

I think that we might need to reexamine our strategy if we're indeed going to provide those high-level insights.
good point. I'll add summarization to the services meeting agenda today.
To #8: according to picksi@, summarizing the benchmark doesn't make much sense anyway & the Android System Health plan are heading toward tracking the changes on a per story basis?
Just to add to my previous comment, I could imagine a couple of different ways that we could address this:

- Only change the stories within the system health page set at fixed points in time (e.g. every Chrome release). To do this, we could have multiple benchmarks, system_health.*, system_health.future.*. All work on the system health benchmarks between releases is done on system_health.future.* and, if a story in system_health.future.* is stable at a given Chrome release, it's rolled into system_health.* (possibly replacing an existing story if, for example, it's a better version of an existing story).

- Only track a small subset of total system health stories in the Android System Health plan, identify those stories with a common tag, and add some way to group metrics by tags in the perf dashboard. I don't like this proposal because this "minimal set of important stories" is exactly what system health itself is supposed to be.

- When a data point is missing, make its contribution to the aggregate equal to the average of its values from the past three runs or something like that. I don't like this because, if a graph is noisy or we try and do this at a regression, it might make the regression look not-as-bad (or worse) than it actually is. Overall, this might affect the summary graphs in unexpected ways and lead to distrust of them.

- Compute the summary metrics when a data point is missing, but add some indicator about what data points are missing from that summary. I don't like this because it makes the graph take much more work to interpret, as you constantly have to try to assess what the graph means at any given revision.

- Allow grouping of arbitrary stories into a single summary metric. I think that this idea is interesting: if you could add stories on-the-fly to a summary metric, you could, for any quarter, take the 10 most stable stories (or something like that) and use that as your metric for system health that quarter. This could certainly lead to some selection bias, but might be better than the system that we currently use now.

I think that the importance of all of this decreases as we get our system health stories more stable, but I do still think this will be a problem. The idea that we don't compute the summary metric when stories fail because we don't want to skew the metric, but we give no indicator when the composition of the story set changes from one revision to the next (e.g. a story is added, removed, disabled, enabled) seems to give a false sense of confidence that the summary metric has a constant meaning throughout the graph.

Sign in to add a comment