Improve build monitoring |
|||||||||
Issue descriptionThis is the bug to track discussion on the recent outage where a bad image push resulted in Linux bot outage. - How do we determine when a canary image is good or bad? - Should we implement an automated monitoring system that would identify (specific test, specific system property) to correlate with consistent failures? Postmortem tracked under go/cit-pm-75. Assigning to nodir to comment as per Trooper discussion. More general, we need to be able to detect an increase of a failure rate associated with any parameter that may affect the build result, not necessarily image, but also, for example, versions of kitchen, git, bot id (all builds of a specific bot are failing, while other builds of the same builder on other bots do not fail that often), etc.
,
Apr 17 2018
Locally I have an implementation that can answer both questions. I’d like to run a project review on this when less busy with urgent stuff such as migrations and buildbucket api v2
,
Apr 30 2018
Issue 721571 has been merged into this issue.
,
May 1 2018
,
May 1 2018
,
Jun 2 2018
Friendly ping. This is a blocking bug for cit-pm-75. Please update pri and comment accordingly. Thanks!
,
Jun 2 2018
i'd really love working on this, but currently the priority is to finish buildbucket api v2
,
Sep 26
i've spent about 1mo researching this topic. This requires spinning up a new service dedicated to monitoring, a kind of monitoring that Monarch is incapable of doing. We have more important things to do for now.
,
Oct 18
,
Oct 18
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by efoo@chromium.org
, Apr 17 2018