New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 909817 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Nov 29
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Task



Sign in to add a comment

Establish CrOS infra availability baseline

Project Member Reported by shapiroc@chromium.org, Nov 28

Issue description

Establish current baseline for test infrastructure uptime as a metric to improve against in 2019.
 
Cc: jclinton@chromium.org
+jclinton as we are considering examining joint baseline (CI + Test)
Status: Assigned (was: Untriaged)
Components: Infra>Client>ChromeOS>CI
I'm working on a Monarch query that will give us the availability percentile for the last year. I'm considering any 50%tile wallclock time for a delta window of 1 day that is higher than 1 day to be out of SLO and therefore be a way to infer when we had an outage. In other words, if a time period in which the preceding 24 hour average of all CL's median wallclock time was higher than 24 hours, we assume there was an outage and that period is an "outage".

That has the nice benefit of being relatively congruent with our oncall model of daytime only and the way that impacts reliability. In practice, issues that occur over night are resolved the next morning but will still have—in some cases—pushed the wallclock time for CL's in that period over the 24 hours mark.

Eyeballing the graph of just the 24 hour delta window wallclock time <http://shortn/_GNXWjEAlI6>, it seems that we're about 80% available, by that metric. Based on my gut alone, that seems like about the right number.

 Issue 909815  has been merged into this issue.
Summary: Establish CrOS infra availability baseline (was: Establish test infra availability baseline)
Here's our current availability based on that 24-hour wallclock time target: http://shortn/_zMk7YpdPyA (it will take a long time to render).

What do people think of this?

I don't fully understand your computation. To rephrase, when submissions on average take > 24 hours, all that time is counted as an outage, is that correct, or is it more nuanced than that? 

That's right.
Okay, I'll take that. Its a bit shocking that we are hovering around 60% right now, but I guess that's because of the recent outages. 

Even with >48 hours, it still shows around 85%, so we've clearly got a bunch of work to do :-)
Status: Fixed (was: Assigned)
Thanks.

Sign in to add a comment