Don't mark cancelled builds as "infrastructure failure" |
||||||||||||||
Issue description(Inspired by http://o/e/m250d93bec8000152, with corresponding discussion in troopers weekly meeting - https://docs.google.com/a/google.com/document/d/12SNu0lSgijd5PkwOzaKZ7jNmoUV74MD5yvbT3s17aUc/edit?disco=AAAABCB8lQU) Cancelled builds shouldn't count as "infrastructure failures". Suggestion is to create a new Build status of "Cancelled" or "Aborted" or something similar (if possible) Note that builds can be cancelled via Buildbucket as well as through Buildbot WebUI.
,
Mar 20 2017
,
Mar 20 2017
,
Mar 27 2017
This issues just triggered a page - http://o/e/m2513081f9000002e
,
Mar 27 2017
,
Mar 29 2017
Another page caused by this issue: http://o/e/m25141021c8000001
,
Mar 29 2017
,
Apr 5 2017
I'm not sure how easy this is to solve with buildbot. Nodir, since you increased the priority do you have an idea? Should we do this now or once buildbot is out of the code path?
,
Apr 7 2017
,
Apr 19 2017
I've misinterpreted this bug. Buildbot does not distinguish builds failed due to an EXCEPTION from cancelled builds because cancelling is implemented via an exception. The root cause has nothing to do with Milo or Swarmbucket. The metric is implemented as a part of mastermon, I think.
,
May 4 2017
another page triggered http://o/e/m2525183748000009
,
May 11 2017
,
Jun 8 2017
another page triggered https://o.corp.google.com/#Escalator:m250d93bec8000152
,
Jun 12 2017
Hi Dirk, can you see if you can find someone to look into this?
,
Jun 12 2017
Erik, I'll let you triage / prioritize this.
,
Jun 14 2017
Adding this a concept to buildbot will be significant work given cancellations are implemented as exceptions (purple). We can do this much more easily in buildbucket and should make sure we support it when we port monitoring to kitchen. I'm going to put this under buildbucket since I think that's most appropriate. Eric, how do we want to handle incoming bugs that we want to add to our schedule?
,
Jun 14 2017
there is nothing to do in swarmbucket case. The bug is specific to buildbot. This defect does not exist outside of buildbot
,
Jun 14 2017
Is that because you can't cancel a swarmbucket build, or because we have a different way of reporting the build as cancelled?
,
Jun 14 2017
neither. you can cancel a swarmbucket build. the of way of reporting a build as cancelled is same for swarmbucket and buildbucket builds that are executed by buildbot. let me rephrase/correct myself: this bug is in master monitoring. It does not distinguish a cancelled build from a true status=EXCEPTION build. This monitoring code runs on master machines, thus changes to buildbucket, swarmbucket or any part of LUCI won't help here. The LOC in question is https://chromium.googlesource.com/chromium/tools/build/+/6fbefca00bc22d0e950f92a0e0cb945d16e6ebf4/scripts/master/status_logger.py#540 currently the value of 'result' is 'exception' for cancelled builds. This is incorrect and should be fixed. How to determine that a build with result=exception is actually as cancelled build? Something like this https://chromium.googlesource.com/chromium/tools/build/+/6fbefca00bc22d0e950f92a0e0cb945d16e6ebf4/scripts/master/buildbucket/integration.py#529 In my opinion, whoever owns buildbot master monitoring should own this bug.
,
Jun 18 2017
Sorry, I realize I could have been more clear in comment 16. I wanted to make sure cancelled builds are a concept in the LUCI/buildbucket world. We're going to have a similar alert after monitoring is implemented and we should make sure this problem is addressed. I'm refocussing this bug to LUCI from buildbot since I don't think it's worth building these concepts into buildbot at this point but I also don't want us to keep having it going forward.
,
Jul 25 2017
,
Apr 30 2018
this becomes increasingly irrelevant due to LUCI |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by philwright@chromium.org
, Mar 16 2017Labels: -Pri-3 Pri-2