Incorrect revision range given for chromium.perf build |
||||||||||||
Issue descriptionFiling this bug after a rough time identifying the culprit responsible for https://bugs.chromium.org/p/chromium/issues/detail?id=675034. That bug eventually tracked down the responsible CL for smoothness.sync_scroll.key_mobile_sites_smooth starting to fail. However, we were misled for a long time by an incorrect revision range given on the builders. For example, look at Android Nexus 5 (https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus5%20Perf%20%283%29?numbuilds=200). The build where the benchmark failures began is clear (see smoothness_fail_start.png). Looking at that build, it becomes obvious that we're in luck: there's only one CL in it! (see smoothness_false_lead.png). The CL's description and touched files seemed innocuous, so I decided to launch a perf try job to ensure that the revert would work before just reverting it. I did that in https://codereview.chromium.org/2580053002/. The result: the revert had no effect. (see revert_no_effect.png) At this point, I was bamboozled. sullivan@ suggested widening the suspected revision range and performing a bisect to identify the culprit. I did this here (https://chromeperf.appspot.com/buildbucket_job_status/8992819288306958080), which ultimately identified https://codereview.chromium.org/2572893002 as the responsible CL. That CL had commit position 438961. Let's map that back to the bot status page: see missing_cls.png for how each of these builds maps back to a given commit position range. Once we do this, it becomes immediately clear that CLs with commit positions 438962...439068 aren't represented in ANY build revision ranges. Or, put another way, build #4113 (https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus5%20Perf%20%283%29/builds/4113, and shown earlier in smoothness_false_lead.png) misrepresented its revision range by saying that 439069 was the only new commit in that build. Instead, there were actually 108 (!) new commits in that range. I have no idea what might have caused this.
,
Dec 20 2016
,
Dec 20 2016
,
Dec 20 2016
The biggest thing that I note is that Build 4113 was triggered from Android Compile, while all builds around it were triggered from Android Builder. Maybe there was some sort of configuration change (a recipe change?) which then got reverted, which explains why no builds were triggered for the missing revisions.
,
Dec 20 2016
I think Annie's correct. It seems this is a problem with Gitiles poller blamelists not be 100% correct. If you look at build 4112, there are duplicate commits in the blamelist, and as Charlie said: 4113 has one commit in the blamelist, even though the build shows the correct git hash range. So, I guess I don't understand enough. Aaron - does your comment mean that 4113 wasn't triggered by Gitiles Poller? Erik/Andy - this is either platform or crossover, but it seems others have similar issues with gitiles poller. Could we find someone to look into this?
,
Dec 20 2016
,
Dec 20 2016
I'd like to investigate Aaron's theory about a recipes change more before we dig deeper into gitiles polling. Aaron, do you suggest just looking through build revisions in that time range?
,
Dec 20 2016
(I don't have cycles to own this right now, sorry, just posted the comment because benhenry pinged me directly.) I don't have a firm grasp on how triggering works these days, but I believe that it can be done entirely from inside recipes. There might be a recipe change which changed how triggering worked for a short time on the 17th, and then was reverted. Similarly, there might be a change to those json config files in chromium/src, which I *think* control exactly which bots trigger which others? Really that piece of the investigation should be assigned to someone who understands recipe-based triggering and knows all the right places to look. But yes, examining the log during that time period (and slightly before it, due to lag) is where I'd start.
,
Jan 10 2017
benhenry@, any suggestion of who might be able to investigate this further and push it to completion? Even if there's no bandwidth for it right now, it seems like an important enough part of our infrastructure that we don't want to drop this.
,
Jan 10 2017
Yeah, this seems more like a P1, actually. Erik - can we get someone assigned to this?
,
Jan 17 2017
/bump to estaab@
,
Jan 18 2017
,
Jan 23 2017
Robbie, do you think you can investigate this? If we can understand the root cause and the likeliness of it happening again we can better determine the priority of this. Has this happened again since December?
,
Feb 10 2017
,
Mar 1 2017
I haven't seen it happen since December but I definitely think it's still sometimes a problem unless we've done something to fix it since.
,
Mar 2 2017
This is going to be super difficult to track down, IMO. The behavior is random, and it's unclear how it happens. We could archive this bug.
,
Mar 24 2017
Sorry this didn't get any attention :( Martiniss is working on perf related things recently and might have some luck investigating. I would recommend what we just ensure that this doesn't happen in milo, however, which is slated to replace the buildbot UI sometime in Q2. This bug looks like purely an artifact of the way that buildbot handles git polling, revision ranges and changelists.
,
Mar 24 2017
Got it. If you think this is the case, I'm fine with just closing this as WontFix with that justification. It doesn't seem to happen enough that it's a huge problem.
,
Dec 11 2017
I think blamelists have changed since this bug was filed, and since we moved to LUCI UI. WontFix-ing |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by charliea@chromium.org
, Dec 20 2016