In some bisects the values are incredibly noisy, bi-modal, etc. and we still try to continue ahead and bisect anyway.
Here's examples:
https://bugs.chromium.org/p/chromium/issues/detail?id=669184#c7
https://bugs.chromium.org/p/chromium/issues/detail?id=667836
Copy/paste my comment from 667836:
It feels like it should have bailed at some point and said that it couldn't reproduce the regression, but the initial run actually did happen to produce a clear regression:
Here are the values from Gathering Reference Values:
{
"result": {
"U": 2,
"p": 0.001947527585946629,
"significance": "REJECT"
},
"sampleA": [
7674880,
7994368,
7932928,
8346624,
7887872,
7658496,
8158208,
8084480
],
"sampleB": [
4197376,
7789568,
3419136,
3685376,
3226624,
3738624,
3398656,
3195904
]
}
Other commits weren't as clear though, and the data almost seems bi-modal, here are the values from 432242:
"sampleA": [
3215360,
8011776, <----
8192000, <----
7970816, <----
8355840, <----
7929856, <----
3375104,
8110080, <----
6689792,
8184832, <----
3216384,
3314688,
3286016,
3175424,
3216384,
3294208,
7770112, <----
7655424, <----
3473408,
3309568,
3387392,
3469312,
3497984,
3190784,
3563520,
8122368, <----
3338240
],
One interesting thing to note is that by the time the bisect had finished, it had expanded the number of tests for the "good" revision:
"sampleA": [
7674880,
7994368,
7932928,
8346624,
7887872,
7658496,
8158208,
8084480,
3453952,
3245056,
3294208,
3249152,
3166208,
3297280,
3268608,
8044544,
3338240,
3174400
],
If you re-test those via compare_samples, you don't have a clear regression anymore:
./tracing/bin/compare_samples ~/tmp/fake_metric_compare_samples1.json ~/tmp/fake_metric_compare_samples2.json Fake/Score --chartjson
{"sampleA":[7674880,7994368,7932928,8346624,7887872,7658496,8158208,8084480,3453952,3245056,3294208,3249152,3166208,3297280,3268608,8044544,3338240,3174400],"sampleB":[4197376,7789568,3419136,3685376,3226624,3738624,3398656,3195904],"result":{"U":58,"p":0.4532547047537364,"significance":"NEED_MORE_DATA"}}
I wonder if we could do something like re-compare previous runs as more samples are added, if we end up with a different answer maybe bail saying the test is too noisy?
Comment 1 by simonhatch@chromium.org
, Nov 29 2016