New issue
Advanced search Search tips

Issue 833727 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Feature



Sign in to add a comment

Improve build monitoring

Project Member Reported by efoo@chromium.org, Apr 17 2018

Issue description

This is the bug to track discussion on the recent outage where a bad image push resulted in Linux bot outage. 

- How do we determine when a canary image is good or bad? 
- Should we implement an automated monitoring system that would identify (specific test, specific system property) to correlate with consistent failures? 

Postmortem tracked under go/cit-pm-75. Assigning to nodir to comment as per Trooper discussion. 

More general, we need to be able to detect an increase of a failure rate associated with any parameter that may affect the build result, not necessarily image, but also, for example, versions of kitchen, git, bot id (all builds of a specific bot are failing, while other builds of the same builder on other bots do not fail that often), etc.
 

Comment 1 by efoo@chromium.org, Apr 17 2018

Cc: dpranke@chromium.org jchin...@chromium.org tandrii@chromium.org

Comment 2 by no...@chromium.org, Apr 17 2018

Locally I have an implementation that can answer both questions. I’d like to run a project review on this when less busy with urgent stuff such as migrations and buildbucket api v2

Comment 3 by no...@chromium.org, Apr 30 2018

Cc: iannucci@chromium.org vadimsh@chromium.org estaab@chromium.org
 Issue 721571  has been merged into this issue.

Comment 4 by no...@chromium.org, May 1 2018

Description: Show this description

Comment 5 by no...@chromium.org, May 1 2018

Summary: Improve build monitoring (was: Is canary image is good or bad?)

Comment 6 by efoo@chromium.org, Jun 2 2018

Labels: cit-pm-75
Friendly ping. This is a blocking bug for cit-pm-75. Please update pri and comment accordingly. Thanks!

Comment 7 by no...@chromium.org, Jun 2 2018

Labels: -Pri-1 Pri-2
i'd really love working on this, but currently the priority is to finish buildbucket api v2
Labels: -Pri-2 -Type-Task Pri-3 Type-Feature
i've spent about 1mo researching this topic. This requires spinning up a new service dedicated to monitoring, a kind of monitoring that Monarch is incapable of doing. We have more important things to do for now.
Cc: iannu...@google.com
Cc: -iannucci@chromium.org

Sign in to add a comment