Monitoring for step duration |
|||||
Issue descriptionPreviously we haven't monitored build step duration in Monarch due to field cardinality concerns. This data would be useful in detecting and responding to issues. I spoke with packrat on how to best represent this data, and this is what he suggested: monitor step duration in monarch not by specific step, but by some aggregate metric. This metric could be periodic measurement of max running step duration. Or, a distribution of step durations. Neither of these would contain the step_name field. This raises the issue that different steps take different times, so we may need to have some way of organizing them into buckets. Builders steam step data to LogDog, so LogDog has this information. Defining this metric to be more general allows us to use time series monitoring, which means we can configuring alerts to fire in response to certain changes in that metric. Yay! But if that alert fires, how do we know which step is responsible? That's where event monitoring comes in. We can graph step duration by step name on a Viceroy dashboard based on event data (that we already have!) in Big Query. @packrat, cc'd you in case you wanted to correct or expand on anything @jbudorick, cc'd you because you seem interested in having this data
,
Oct 31 2017
*vary* wildly, argh
,
Oct 31 2017
I think there are a couple of ways we could think about it. Steps with similar timeouts could be in the same bucket. That would require us to have not-overly-generic timeouts for our steps though, which I'm pretty sure is not always the case. We could look at a sample of steps and group them by how long they *should* take. We could just pick some buckets... Steps that should finish in < 1m, < 5m, <15m, <30m, <1h, >1h.
,
Oct 31 2017
ooo, interesting, I hadn't thought about that. If we do something like that, what do we wind up alerting on? Can we alert on a combination of conditions that indicate that something has definitively regressed? (Increase in 15m bucket population + decrease in 5m bucket population + no significant change in other buckets, for example?)
,
Oct 31 2017
Oh, interesting, I thought we'd send different metrics, one for each bucket. For example /chrome/infra/build/step/<bucket>/durations where we send the durations for steps that *should* be in <bucket>. This is an interesting idea too though! I wonder if we could alert on the rate of increase in population size of a given bucket.
,
Oct 31 2017
I would suggest that alerting (in the form of a real-time interrupt) should be reserverd for stuckness, while gentle regressions can be analysed offline at a much lower frequency and perhaps generate a ticket for investigation. The two data forms you can use for those are quite different, and the second can be a lot more sophisticated.
,
Oct 31 2017
We're discussing what metric should be used alerting on that stuckness. Any advice on how to do that given that the expected step duration varies from seconds to ~1h?
,
Oct 31 2017
My intuition is that a stuckness alert looks something like 'total (all steps) time running > average of last 5 total completion times + 50%' but that depends a lot on how quickly you need to respond to stuckness and how exceptional a condition it is. In particular, if you're seeing things stuck very often, more localised automatic responses are going to be necessary to avoid paging yourself to death, and that system could potentially act at the level of individual steps with retry logic and data about each step's typical time taken.
,
Oct 31 2017
I spoke with packrat@ about this, and we decided that it would be helpful to write a lightweight design doc that outlines the context for this issue and highlights some of the related outages we've seen. This would help SRE advise us on a best path forward, and also just generally be helpful.
,
Nov 30 2017
,
Dec 5 2017
,
Dec 7 2017
Bumping the priority on this because it would allow us to *not* rely on completed_steps event data for a few graphs (Cycle time 50/90% and Success rate (2-hour/1-day window)) on Buildbot/buildbot.
,
Feb 22 2018
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by jbudorick@chromium.org
, Oct 31 2017