Establish CrOS infra availability baseline |
|||||
Issue descriptionEstablish current baseline for test infrastructure uptime as a metric to improve against in 2019.
,
Nov 29
,
Nov 29
I'm working on a Monarch query that will give us the availability percentile for the last year. I'm considering any 50%tile wallclock time for a delta window of 1 day that is higher than 1 day to be out of SLO and therefore be a way to infer when we had an outage. In other words, if a time period in which the preceding 24 hour average of all CL's median wallclock time was higher than 24 hours, we assume there was an outage and that period is an "outage". That has the nice benefit of being relatively congruent with our oncall model of daytime only and the way that impacts reliability. In practice, issues that occur over night are resolved the next morning but will still have—in some cases—pushed the wallclock time for CL's in that period over the 24 hours mark. Eyeballing the graph of just the 24 hour delta window wallclock time <http://shortn/_GNXWjEAlI6>, it seems that we're about 80% available, by that metric. Based on my gut alone, that seems like about the right number.
,
Nov 29
Issue 909815 has been merged into this issue.
,
Nov 29
,
Nov 29
Here's our current availability based on that 24-hour wallclock time target: http://shortn/_zMk7YpdPyA (it will take a long time to render). What do people think of this?
,
Nov 29
I don't fully understand your computation. To rephrase, when submissions on average take > 24 hours, all that time is counted as an outage, is that correct, or is it more nuanced than that?
,
Nov 29
That's right.
,
Nov 29
Okay, I'll take that. Its a bit shocking that we are hovering around 60% right now, but I guess that's because of the recent outages. Even with >48 hours, it still shows around 85%, so we've clearly got a bunch of work to do :-)
,
Nov 29
Thanks. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by akes...@chromium.org
, Nov 28