New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 669608 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug


Sign in to add a comment

Bisect - Some tests are difficult to ever bisect

Project Member Reported by simonhatch@chromium.org, Nov 29 2016

Issue description

Looking at https://bugs.chromium.org/p/chromium/issues/detail?id=668169#3

So the graph shows a regression, and the mean results also show regressions detected, but the bisect never gets gets confident enough to declare a regression. Looking closely at the values:


        "summary": {
          "description": "Total thread duration of all garbage collection events outside of idle tasks", 
          "important": true, 
          "name": "v8_gc_total_outside_idle", 
          "std": 0.0, 
          "type": "list_of_scalar_values", 
          "units": "ms", 
          "values": [
            6.045000000000001, 
            164.762, 
            26.87, 
            7.93, 
            4.971, 
            28.834, 
            129.6706211584481, 
            118.28299999999997, 
            23.936999999999998, 
            146.387, 
            6.383, 
            229.4098391145737, 
            5.2860000000000005, 
            63.921931803532296, 
            119.85, 
            117.67899999999999, 
            50.76523786249593, 
            106.01499999999999, 
            33.028999999999996, 
            191.44350393250392, 
            0.0, 
            122.02016841186736, 
            4.891, 
            9.045, 
            0.0, 
            6.9209999999999985, 
            0.0, 
            7.002, 
            10.526, 
            31.854000000000003, 
            3.9859999999999998, 
            0.0, 
            4.227, 
            321.018, 
            33.57, 
            102.07694816547178, 
            0.0, 
            0.0, 
            263.648
          ]
        }
      }, 


And the bisect results:

===== TESTED REVISIONS =====
Revision         Mean     Std Dev  N    Good?
chromium@433810  101.261  1946.67  195  good
chromium@433835  118.478  2567.28  195  bad



Now the perf dashboard seemingly averages these out, seems like the bisect be doing the same thing? Where does the dashboard decide to do that, just on every list_of_scalar_values?
 
Blocking: 668169
Blocking: 667417
Blocking: 666400
Blocking: 666395
Cc: dtu@chromium.org
+dtu

Comment 6 by dtu@chromium.org, Dec 5 2016

There is a small but clear regression. KS or Anderson would've detected the regression after 20-25 runs, but this bisect appeared to stop after 5.

I think there are tradeoffs with every approach. Taking the average would make it harder to detect some kinds of regressions that don't affect the mean much. Also having different algorithms for dashboard and bisect means that they're going to be inconsistent in the types of regressions each can detect.
We have different algorithms for dashboard and bisect already. I agree we should work towards moving them closer, but in the short term, I think we still want to dig more deeply into which algorithms work best for bisect?
The most pragmatic solution I can think of is to add a flag to 'Bisect on the mean of each run' and preset it for specific metrics, allowing the sheriff to override it.

This could be piped through to compare_samples which would in turn average each list of scalars in the sample instead of concatenating all.

The same approach could be used to bisect on stddev of each run.
Blocking: 667313
Blocking: 671237
Blocking: 670761
Blocking: 670442
Blocking: 659050
Blocking: 672904
Blocking: 672560
Blocking: 672426
Blocking: 672386
Blocking: 670772
Blocking: 659041
Blocking: 664967
https://bugs.chromium.org/p/chromium/issues/detail?id=664967#c14

Note for later: A page was disabled, hence 2x as many samples for good as bad.
Blocking: 677455
Blocking: 677434
Blocking: 677421
Blocking: 677412
Cc: crouleau@chromium.org
Simon,

How do you get the raw data values from a bisect after it is done? I'm looking at the std deviation numbers for these bisects, and they seem impossibly large. I'm wondering if there's a bug in their calculation.
Cc: simonhatch@chromium.org
re: #c25

Pick any "Bisecting Revision" step, scroll and find the "Compare Samples", there should be 2 in each "Bisecting Revision" and 1 after the initial "Gathering Reference Values"

If you look in there, you'll find all the raw data.

Blocking: 673404
Blocking: 679509
Blocking: 679559
Blocking: 679503
Blocking: 679506
So I did some quickie tests here, added an mean to list_of_scalars akin to what the dashboard does, the results seemed a lot better. Went through the last few bugs blocked against this one, pulled out all the json files and re-ran compare_samples manually.


679559

Original result:
  "result": {
    "U": 2572678.5, 
    "p": 0.08912829748779405, 
    "significance": "FAIL_TO_REJECT"
  }, 

New Result:
tracing/bin/compare_samples "/usr/local/google/home/simonhatch/tmp/bisect_json/good1.json,/usr/local/google/home/simonhatch/tmp/bisect_json/good2.json,/usr/local/google/home/simonhatch/tmp/bisect_json/good3.json,/usr/local/google/home/simonhatch/tmp/bisect_json/good4.json,/usr/local/google/home/simonhatch/tmp/bisect_json/good5.json,/usr/local/google/home/simonhatch/tmp/bisect_json/good6.json" "/usr/local/google/home/simonhatch/tmp/bisect_json/bad1.json,/usr/local/google/home/simonhatch/tmp/bisect_json/bad2.json,/usr/local/google/home/simonhatch/tmp/bisect_json/bad3.json,/usr/local/google/home/simonhatch/tmp/bisect_json/bad4.json,/usr/local/google/home/simonhatch/tmp/bisect_json/bad5.json,/usr/local/google/home/simonhatch/tmp/bisect_json/bad6.json" frame_times/http___techcrunch.com --chartjson
{"sampleA":[16.427904639201067,16.458113989681777,16.447558139568766,16.425819587615347,16.496716494796818,16.48418041241845],"sampleB":[16.778348285006974,16.710595800494897,16.734767810021353,16.785488126696258,16.71902368421617,16.713241469828787],"result":{"U":0,"p":0.005074868097940222,"significance":"REJECT"}}




679503

Original result:
  "result": {
    "U": 352, 
    "p": 0.14944870623046325, 
    "significance": "FAIL_TO_REJECT"
  }, 

New Result:
tracing/bin/compare_samples "/usr/local/google/home/simonhatch/tmp/bisect_json/679503/g1.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/g2.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/g3.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/g4.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/g5.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/g6.json" "/usr/local/google/home/simonhatch/tmp/bisect_json/679503/b1.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/b2.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/b3.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/b4.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/b5.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/b6.json" memory:chrome:all_processes:reported_by_chrome:v8:effective_size_max/memory:chrome:all_processes:reported_by_chrome:v8:effective_size_max --chartjson
{"sampleA":[65618580.8,65973312,62538356.8,62881980.8,62823992,65665654.4],"sampleB":[78500612.8,86430537.6,86990617.6,82670548.8,78776180.8,76795252.8],"result":{"U":0,"p":0.005074868097940222,"significance":"REJECT"}}



679509

Original result:
  "result": {
    "U": 13060, 
    "p": 0.2375743650159785, 
    "significance": "FAIL_TO_REJECT"
  }, 

New Result:
tracing/bin/compare_samples "/usr/local/google/home/simonhatch/tmp/bisect_json/679509/g1.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/g2.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/g3.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/g4.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/g5.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/g6.json" "/usr/local/google/home/simonhatch/tmp/bisect_json/679509/b1.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/b2.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/b3.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/b4.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/b5.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/b6.json" mean_input_event_latency/mean_input_event_latency --chartjson
{"sampleA":[12.868892857142853,12.883678571428572,14.063392857142858,13.938214285714283,11.953214285714287,12.62953571428571],"sampleB":[14.016071428571426,13.82682142857143,15.281678571428568,15.008535714285713,15.695357142857139,15.607178571428571],"result":{"U":3,"p":0.02024057057707751,"significance":"NEED_MORE_DATA"}}

Blocking: 679865
Blocking: 678615
Blocking: 663144
Blocking: 677404
Owner: simonhatch@chromium.org
Status: Fixed (was: Untriaged)
Seems to have made a fairly massive difference, most of those old bisects were able to run through to culprits (or narrowed ranges due to build failures).
\o/
simonhatch@,

Mind linking to the CLs that fixed this?
Sorry, that just get's a little delayed since they're in catapult and it'll update once catapult roller lands in chromium.

https://codereview.chromium.org/2620413003/
Ah shoot. Didn't realize they weren't in Chromium yet. I reran ~6 bisects for videostack team. 
The bug only updates once it makes its way to chromium, but the fix is live since the code is in infra/catapult.

Keep in mind this may or may not fix your individual problems, this will help a lot with bisects where it could clearly reproduce the issue, but wasn't proceeding past gathering the reference values.

Often looked something like this:

Regression: 10%
Values on dashbaord: 10 -> 11


Bisect results:

chromium@000001  10 += 9238503928029
chromium@000010  11 += 3295823095820 (note the crazily large std dev).
Components: Speed>Bisection
Blocking: -670761

Sign in to add a comment