New issue
Advanced search Search tips

Issue 887668 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

[📍] Experiment to understand story dependence

Project Member Reported by dtu@chromium.org, Sep 20

Issue description

Background
Currently, a perf bisect only runs a single story from a benchmark. When Pinpoint fails to reproduce a regression, we often find that re-running the bisect with the full benchmark causes the regression to reproduce.

We also note that for many no-repro cases, the numbers reported by Pinpoint are often very different from what we see on the perf dashboard (i.e. precise but not accurate). We know that this is not just due to hardware differences, because we parallelize bisects across many devices.

Proposal
Run an experiment to answer:
* What proportion of no-repros would be reproducible by running more of the benchmark?
* How much more of the benchmark would we need to run? Just the preceding story? 10 stories? The entire benchmark? All the preceding benchmarks?

Experimental design
1. Select a random sample of N=100 completed no-repro Pinpoint bisects from the past week.
2. For each bisect, investigate to see if the reference build shifted or the bisect range is off, or there's another explanation for the no-repro. If so, skip the bisect and move on to the next one.
3. Get the bisect's story index. We can pass `--story-shard-begin-index` and `--story-shard-end-index` to Telemetry to filter for a specific contiguous set of stories.
4. Re-run the bisect, including the preceding 10 stories.
5a. If the bisect repros, try again with the preceding 5 stories, then the preceding 1 story.
6b. If the bisect doesn't repro, try again with the preceding 50 stories, then the full benchmark (if hardware capacity allows).
 
This plan sounds great!

Also, even after just doing 1 and 2, I would love to see if there is a pattern of benchmarks or devices on which this tends to happens more frequently.
Is this related to cache state or something? Should we clear the cache between each story?
We already do clear lots of caches and state between story runs (at least in most benchmarks).

It looks like, in some cases, some things may not be getting cleared and still leaking from one run to the next. But we have no idea what's that. And that's what Dave's work will help us find out.
Makes sense. SGTM!

Sign in to add a comment