New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 779859 link

Starred by 1 user

Issue metadata

Status: Duplicate
Owner: ----
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Feature

Blocked on:
issue 786055



Sign in to add a comment

Monitoring for step duration

Project Member Reported by katthomas@chromium.org, Oct 31 2017

Issue description

Previously we haven't monitored build step duration in Monarch due to field cardinality concerns.

This data would be useful in detecting and responding to issues.

I spoke with packrat on how to best represent this data, and this is what he suggested: monitor step duration in monarch not by specific step, but by some aggregate metric. This metric could be periodic measurement of max running step duration. Or, a distribution of step durations. Neither of these would contain the step_name field. This raises the issue that different steps take different times, so we may need to have some way of organizing them into buckets. Builders steam step data to LogDog, so LogDog has this information.

Defining this metric to be more general allows us to use time series monitoring, which means we can configuring alerts to fire in response to certain changes in that metric. Yay! But if that alert fires, how do we know which step is responsible? That's where event monitoring comes in. We can graph step duration by step name on a Viceroy dashboard based on event data (that we already have!) in Big Query. 
 
@packrat, cc'd you in case you wanted to correct or expand on anything
@jbudorick, cc'd you because you seem interested in having this data


 
hrm, I'll have to think about whether there's a way we could aggregate step times absent the step name that still gives us meaningful information. The expected runtimes of steps varies wildly.

How fine-grained could the buckets be? Shifts in the bucket distribution could potentially be interesting, though at that point we'd be using the buckets as a proxy for the step names.
*vary* wildly, argh
I think there are a couple of ways we could think about it. 

Steps with similar timeouts could be in the same bucket. That would require us to have not-overly-generic timeouts for our steps though, which I'm pretty sure is not always the case. 

We could look at a sample of steps and group them by how long they *should* take. 

We could just pick some buckets... Steps that should finish in < 1m, < 5m, <15m, <30m, <1h, >1h. 

ooo, interesting, I hadn't thought about that.

If we do something like that, what do we wind up alerting on? Can we alert on a combination of conditions that indicate that something has definitively regressed? (Increase in 15m bucket population + decrease in 5m bucket population + no significant change in other buckets, for example?)
Oh, interesting, I thought we'd send different metrics, one for each bucket. For example /chrome/infra/build/step/<bucket>/durations where we send the durations for steps that *should* be in <bucket>.

This is an interesting idea too though! I wonder if we could alert on the rate of increase in population size of a given bucket.  

Comment 6 by packrat@google.com, Oct 31 2017

I would suggest that alerting (in the form of a real-time interrupt) should be reserverd for stuckness, while gentle regressions can be analysed offline at a much lower frequency and perhaps generate a ticket for investigation.

The two data forms you can use for those are quite different, and the second can be  a lot more sophisticated.
We're discussing what metric should be used alerting on that stuckness. Any advice on how to do that given that the expected step duration varies from seconds to ~1h?

Comment 8 by packrat@google.com, Oct 31 2017

My intuition is that a stuckness alert looks something like 'total (all steps) time running > average of last 5 total completion times + 50%'

but that depends a lot on how quickly you need to respond to stuckness and how exceptional a condition it is. In particular, if you're seeing things stuck very often, more localised automatic responses are going to be necessary to avoid paging yourself to death, and that system could potentially act at the level of individual steps with retry logic and data about each step's typical time taken.
I spoke with packrat@ about this, and we decided that it would be helpful to write a lightweight design doc that outlines the context for this issue and highlights some of the related outages we've seen. This would help SRE advise us on a best path forward, and also just generally be helpful. 
Cc: dpranke@chromium.org
Cc: bpastene@chromium.org
Blockedon: 786055
Cc: no...@chromium.org
Labels: -Pri-2 Pri-1
Bumping the priority on this because it would allow us to *not* rely on completed_steps event data for a few graphs (Cycle time 50/90% and Success rate (2-hour/1-day window)) on Buildbot/buildbot. 

Comment 13 Deleted

Mergedinto: 795445
Status: Duplicate (was: Available)

Sign in to add a comment