Bisect - Some tests are difficult to ever bisect |
|||||||||||||||||||||||||||||||||||
Issue descriptionLooking at https://bugs.chromium.org/p/chromium/issues/detail?id=668169#3 So the graph shows a regression, and the mean results also show regressions detected, but the bisect never gets gets confident enough to declare a regression. Looking closely at the values: "summary": { "description": "Total thread duration of all garbage collection events outside of idle tasks", "important": true, "name": "v8_gc_total_outside_idle", "std": 0.0, "type": "list_of_scalar_values", "units": "ms", "values": [ 6.045000000000001, 164.762, 26.87, 7.93, 4.971, 28.834, 129.6706211584481, 118.28299999999997, 23.936999999999998, 146.387, 6.383, 229.4098391145737, 5.2860000000000005, 63.921931803532296, 119.85, 117.67899999999999, 50.76523786249593, 106.01499999999999, 33.028999999999996, 191.44350393250392, 0.0, 122.02016841186736, 4.891, 9.045, 0.0, 6.9209999999999985, 0.0, 7.002, 10.526, 31.854000000000003, 3.9859999999999998, 0.0, 4.227, 321.018, 33.57, 102.07694816547178, 0.0, 0.0, 263.648 ] } }, And the bisect results: ===== TESTED REVISIONS ===== Revision Mean Std Dev N Good? chromium@433810 101.261 1946.67 195 good chromium@433835 118.478 2567.28 195 bad Now the perf dashboard seemingly averages these out, seems like the bisect be doing the same thing? Where does the dashboard decide to do that, just on every list_of_scalar_values?
,
Nov 29 2016
,
Nov 29 2016
,
Nov 29 2016
,
Dec 1 2016
+dtu
,
Dec 5 2016
There is a small but clear regression. KS or Anderson would've detected the regression after 20-25 runs, but this bisect appeared to stop after 5. I think there are tradeoffs with every approach. Taking the average would make it harder to detect some kinds of regressions that don't affect the mean much. Also having different algorithms for dashboard and bisect means that they're going to be inconsistent in the types of regressions each can detect.
,
Dec 5 2016
We have different algorithms for dashboard and bisect already. I agree we should work towards moving them closer, but in the short term, I think we still want to dig more deeply into which algorithms work best for bisect?
,
Dec 5 2016
The most pragmatic solution I can think of is to add a flag to 'Bisect on the mean of each run' and preset it for specific metrics, allowing the sheriff to override it. This could be piped through to compare_samples which would in turn average each list of scalars in the sample instead of concatenating all. The same approach could be used to bisect on stddev of each run.
,
Dec 6 2016
,
Dec 6 2016
,
Dec 6 2016
,
Dec 6 2016
,
Dec 6 2016
,
Dec 15 2016
,
Dec 15 2016
,
Dec 15 2016
,
Dec 15 2016
,
Dec 15 2016
,
Dec 15 2016
,
Dec 15 2016
https://bugs.chromium.org/p/chromium/issues/detail?id=664967#c14 Note for later: A page was disabled, hence 2x as many samples for good as bad.
,
Jan 2 2017
,
Jan 2 2017
,
Jan 2 2017
,
Jan 2 2017
,
Jan 9 2017
Simon, How do you get the raw data values from a bisect after it is done? I'm looking at the std deviation numbers for these bisects, and they seem impossibly large. I'm wondering if there's a bug in their calculation.
,
Jan 9 2017
,
Jan 9 2017
re: #c25 Pick any "Bisecting Revision" step, scroll and find the "Compare Samples", there should be 2 in each "Bisecting Revision" and 1 after the initial "Gathering Reference Values" If you look in there, you'll find all the raw data.
,
Jan 9 2017
,
Jan 10 2017
,
Jan 10 2017
,
Jan 10 2017
,
Jan 10 2017
,
Jan 11 2017
So I did some quickie tests here, added an mean to list_of_scalars akin to what the dashboard does, the results seemed a lot better. Went through the last few bugs blocked against this one, pulled out all the json files and re-ran compare_samples manually.
679559
Original result:
"result": {
"U": 2572678.5,
"p": 0.08912829748779405,
"significance": "FAIL_TO_REJECT"
},
New Result:
tracing/bin/compare_samples "/usr/local/google/home/simonhatch/tmp/bisect_json/good1.json,/usr/local/google/home/simonhatch/tmp/bisect_json/good2.json,/usr/local/google/home/simonhatch/tmp/bisect_json/good3.json,/usr/local/google/home/simonhatch/tmp/bisect_json/good4.json,/usr/local/google/home/simonhatch/tmp/bisect_json/good5.json,/usr/local/google/home/simonhatch/tmp/bisect_json/good6.json" "/usr/local/google/home/simonhatch/tmp/bisect_json/bad1.json,/usr/local/google/home/simonhatch/tmp/bisect_json/bad2.json,/usr/local/google/home/simonhatch/tmp/bisect_json/bad3.json,/usr/local/google/home/simonhatch/tmp/bisect_json/bad4.json,/usr/local/google/home/simonhatch/tmp/bisect_json/bad5.json,/usr/local/google/home/simonhatch/tmp/bisect_json/bad6.json" frame_times/http___techcrunch.com --chartjson
{"sampleA":[16.427904639201067,16.458113989681777,16.447558139568766,16.425819587615347,16.496716494796818,16.48418041241845],"sampleB":[16.778348285006974,16.710595800494897,16.734767810021353,16.785488126696258,16.71902368421617,16.713241469828787],"result":{"U":0,"p":0.005074868097940222,"significance":"REJECT"}}
679503
Original result:
"result": {
"U": 352,
"p": 0.14944870623046325,
"significance": "FAIL_TO_REJECT"
},
New Result:
tracing/bin/compare_samples "/usr/local/google/home/simonhatch/tmp/bisect_json/679503/g1.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/g2.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/g3.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/g4.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/g5.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/g6.json" "/usr/local/google/home/simonhatch/tmp/bisect_json/679503/b1.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/b2.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/b3.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/b4.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/b5.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679503/b6.json" memory:chrome:all_processes:reported_by_chrome:v8:effective_size_max/memory:chrome:all_processes:reported_by_chrome:v8:effective_size_max --chartjson
{"sampleA":[65618580.8,65973312,62538356.8,62881980.8,62823992,65665654.4],"sampleB":[78500612.8,86430537.6,86990617.6,82670548.8,78776180.8,76795252.8],"result":{"U":0,"p":0.005074868097940222,"significance":"REJECT"}}
679509
Original result:
"result": {
"U": 13060,
"p": 0.2375743650159785,
"significance": "FAIL_TO_REJECT"
},
New Result:
tracing/bin/compare_samples "/usr/local/google/home/simonhatch/tmp/bisect_json/679509/g1.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/g2.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/g3.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/g4.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/g5.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/g6.json" "/usr/local/google/home/simonhatch/tmp/bisect_json/679509/b1.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/b2.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/b3.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/b4.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/b5.json,/usr/local/google/home/simonhatch/tmp/bisect_json/679509/b6.json" mean_input_event_latency/mean_input_event_latency --chartjson
{"sampleA":[12.868892857142853,12.883678571428572,14.063392857142858,13.938214285714283,11.953214285714287,12.62953571428571],"sampleB":[14.016071428571426,13.82682142857143,15.281678571428568,15.008535714285713,15.695357142857139,15.607178571428571],"result":{"U":3,"p":0.02024057057707751,"significance":"NEED_MORE_DATA"}}
,
Jan 12 2017
,
Jan 12 2017
,
Jan 12 2017
,
Jan 12 2017
,
Jan 13 2017
Seems to have made a fairly massive difference, most of those old bisects were able to run through to culprits (or narrowed ranges due to build failures).
,
Jan 13 2017
\o/
,
Jan 13 2017
simonhatch@, Mind linking to the CLs that fixed this?
,
Jan 13 2017
Sorry, that just get's a little delayed since they're in catapult and it'll update once catapult roller lands in chromium. https://codereview.chromium.org/2620413003/
,
Jan 13 2017
Ah shoot. Didn't realize they weren't in Chromium yet. I reran ~6 bisects for videostack team.
,
Jan 13 2017
The bug only updates once it makes its way to chromium, but the fix is live since the code is in infra/catapult. Keep in mind this may or may not fix your individual problems, this will help a lot with bisects where it could clearly reproduce the issue, but wasn't proceeding past gathering the reference values. Often looked something like this: Regression: 10% Values on dashbaord: 10 -> 11 Bisect results: chromium@000001 10 += 9238503928029 chromium@000010 11 += 3295823095820 (note the crazily large std dev).
,
Feb 3 2017
,
Apr 4 2017
|
|||||||||||||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||||||||||||
Comment 1 by simonhatch@chromium.org
, Nov 29 2016