Bisect: bad classification of memory results |
|||
Issue descriptionhttps://build.chromium.org/p/tryserver.chromium.perf/builders/winx64intel_perf_bisect/builds/1649 https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Ftryserver.chromium.perf%2Fwinx64intel_perf_bisect%2F1649%2F%2B%2Frecipes%2Fsteps%2FPost_bisect_results%2F0%2Flogs%2FDebug_Info%2F0 { "build_id": null, "failed": false, "mean_value": 45555958.85714286, "std_dev": 96884515.33891024, "commit_hash": "03f4a0e3d173271a756cf70d28303b8917096ade", "revision_string": "chromium@476252", "failure_reason": null, "n_observations": 14, "depot_name": "chromium", "result": "good" }, ... { "build_id": null, "failed": false, "mean_value": 41436260.571428575, "std_dev": 70361912.1701477, "commit_hash": "4aba27cc3025b6a9e1f25d24d1f1950eed5c44aa", "revision_string": "chromium@476293", "failure_reason": null, "n_observations": 14, "depot_name": "chromium", "result": "bad" }, ... { "build_id": null, "failed": false, "mean_value": 84663786.66666667, "std_dev": 82620417.86141819, "commit_hash": "fa8183309600f1c2b8c4ad5a60aee69761916a9e", "revision_string": "chromium@476334", "failure_reason": null, "n_observations": 6, "depot_name": "chromium", "result": "bad" } r476293 should have been classified as good.
,
Jun 7 2017
I've been watching the perf bugs/bisects go by pretty closely, and I haven't seen hardly any of these misclassification problems lately (definitely not reproing is a much bigger issue). I think this can wait until pinpoint, unless the desktop memory rotation that was just added ends up seeing lots of these due to the nature of their data.
,
Aug 1
,
Oct 4
|
|||
►
Sign in to add a comment |
|||
Comment 1 by simonhatch@chromium.org
, Jun 7 2017So poking around at this it looks like the test against good fails due to 2 outliers: { "result": { "U": 4, "p": 0.007512726348775978, "significance": "REJECT" }, "sampleA": [ 37487424, 36766528, 36979520, 37290816, 37520192, 37339968, 36979520, 36733760, 37200704 ], "sampleB": [ 36602688, 36733760, 36733760, 26129216, 26129216, 36979520 ] } But the test against bad is just a smidge above the threshold for reject: { "result": { "U": 12, "p": 0.08719762543893439, "significance": "NEED_MORE_DATA" }, "sampleA": [ 37487424, 36766528, 36979520, 37290816, 37520192, 37339968, 36979520, 36733760, 37200704 ], "sampleB": [ 108553024, 108561216, 36848448, 108323648, 108618560, 37077824 ] } If they had both been reject, it would have chosen the one with the avg closest to good/bad. I think in Pinpoint cases like this should be relatively easy to steer back on course, but I'm not sure what we can do here. There's probably always going to be some edge cases like this. Dave, any ideas?