New issue
Advanced search Search tips

Issue 730636 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Oct 4
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Bisect: bad classification of memory results

Project Member Reported by sullivan@chromium.org, Jun 7 2017

Issue description

https://build.chromium.org/p/tryserver.chromium.perf/builders/winx64intel_perf_bisect/builds/1649

https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Ftryserver.chromium.perf%2Fwinx64intel_perf_bisect%2F1649%2F%2B%2Frecipes%2Fsteps%2FPost_bisect_results%2F0%2Flogs%2FDebug_Info%2F0

    {
      "build_id": null,
      "failed": false,
      "mean_value": 45555958.85714286,
      "std_dev": 96884515.33891024,
      "commit_hash": "03f4a0e3d173271a756cf70d28303b8917096ade",
      "revision_string": "chromium@476252",
      "failure_reason": null,
      "n_observations": 14,
      "depot_name": "chromium",
      "result": "good"
    },
...
    {
      "build_id": null,
      "failed": false,
      "mean_value": 41436260.571428575,
      "std_dev": 70361912.1701477,
      "commit_hash": "4aba27cc3025b6a9e1f25d24d1f1950eed5c44aa",
      "revision_string": "chromium@476293",
      "failure_reason": null,
      "n_observations": 14,
      "depot_name": "chromium",
      "result": "bad"
    },
...
    {
      "build_id": null,
      "failed": false,
      "mean_value": 84663786.66666667,
      "std_dev": 82620417.86141819,
      "commit_hash": "fa8183309600f1c2b8c4ad5a60aee69761916a9e",
      "revision_string": "chromium@476334",
      "failure_reason": null,
      "n_observations": 6,
      "depot_name": "chromium",
      "result": "bad"
    }

r476293 should have been classified as good.
 
So poking around at this it looks like the test against good fails due to 2 outliers:

{
  "result": {
    "U": 4,
    "p": 0.007512726348775978,
    "significance": "REJECT"
  },
  "sampleA": [
    37487424,
    36766528,
    36979520,
    37290816,
    37520192,
    37339968,
    36979520,
    36733760,
    37200704
  ],
  "sampleB": [
    36602688,
    36733760,
    36733760,
    26129216,
    26129216,
    36979520
  ]
}

But the test against bad is just a smidge above the threshold for reject:

{
  "result": {
    "U": 12,
    "p": 0.08719762543893439,
    "significance": "NEED_MORE_DATA"
  },
  "sampleA": [
    37487424,
    36766528,
    36979520,
    37290816,
    37520192,
    37339968,
    36979520,
    36733760,
    37200704
  ],
  "sampleB": [
    108553024,
    108561216,
    36848448,
    108323648,
    108618560,
    37077824
  ]
}

If they had both been reject, it would have chosen the one with the avg closest to good/bad. I think in Pinpoint cases like this should be relatively easy to steer back on course, but I'm not sure what we can do here. There's probably always going to be some edge cases like this. Dave, any ideas?


I've been watching the perf bugs/bisects go by pretty closely, and I haven't seen hardly any of these misclassification problems lately (definitely not reproing is a much bigger issue). I think this can wait until pinpoint, unless the desktop memory rotation that was just added ends up seeing lots of these due to the nature of their data.
Status: Assigned (was: Untriaged)
Status: Archived (was: Assigned)

Sign in to add a comment