New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 881657 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

[📍] Better handling of failures in performance bisects

Project Member Reported by dtu@chromium.org, Sep 7

Issue description

https://bugs.chromium.org/p/chromium/issues/detail?id=881063#c7
In the above bug, the compile stopped failing at commit af3c256. Therefore, Pinpoint reported it as a "difference". This is confusing to users.

Under the hood, Pinpoint has so far treated functional and performance bisects the same. We want to change the way performance bisects display failures:
1. If the perf values immediately before and after the failing range are not significantly different, ignore the failure completely.
2. If the perf values immediately before and after the failing range are significantly different, say "there was a regression in this range."
3. If there are no perf values before or after the failing range, say "there could be a regression in this range, but we don't know."
 
This sounds good to me, just a word of warning on being careful when designing this new "taxonomy" of possible pinpoint results, we should avoid the mistakes from the legacy bisect system which also had many possible "results" but it wasn't always clear what each of them meant.

Whether possible, the results should be able to help distinguish "who is at fault", e.g.: was there a problem with the build (couldn't get values due to broken build), or the metric (too noisy, so couldn't reproduce a regression), or pinpoint (some error on the bisect system itself), among possibly a few other things that could go wrong. This was particularly hard with the old system.

As a start we should probably define the set of possible results as some enum-like constants in the code, together with clear explanations of what they mean.
+1 it was always difficult to try to pull that info on where exactly things went wrong in the legacy bisect script, we tried to piece it together after the fact on the dashboard based on the bisect's raw output.

I'd say the range output is definitely a step in the right direction, when we did that for recipe bisect it did help to clear up confusion as to what happened (ignoring there was still confusion as to why it wasn't able to get further into the range).
The model is a little different from legacy bisect, since legacy bisect was designed to run linearly and find exactly one regression (no more, no less), and therefore every kind of error became a top-level error.

At the top-level, Pinpoint has only 2 possible results: completed and failed. "Failed" means Pinpoint died. The bug comment will say "(sad cat face) Pinpoint stopped with an error".

If completed, the comment will say "Found {1 or more} changes" or "Couldn't reproduce a change" if there are none. It will list the changes, which can take one of 3 formats:

1.
metric_name changed from 80 units -> 120 units (+40 units)
Commit name by commit author
https://commit.url

2.
metric_name changed from 80 units -> 120 units (+40 units)
The Build failed from r123400 to r123500, so Pinpoint was unable to narrow it down.

3. In this case, Pinpoint will say "Pinpoint was unsure about {0 or more} changes."
There may or may not have been a change.
The Test failed from r123400 to r123500, so Pinpoint was unable to narrow it down.
Cc: bugsnash@google.com dtu@chromium.org m...@chromium.org
 Issue 909626  has been merged into this issue.

Sign in to add a comment