New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 639101 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Feature



Sign in to add a comment

Ensure quarantined bots will be repaired rapidly on perf waterfall

Project Member Reported by sullivan@chromium.org, Aug 18 2016

Issue description

We are working on swarming the perf waterfall. The perf waterfall has device affinity, which means that each test must run on the same device every time. Swarming already supports device affinity, but we need a way to ensure that when a device is quarantined:

* It gets repaired quickly
* It is easy for everyone (speed infra, troopers, infra labs) to find and follow status
 
Blocking: 633253

Comment 2 by stip@chromium.org, Aug 23 2016

Cc: abw@chromium.org mar...@chromium.org
Adding abw@ and maruel@ who have taken a keen interest in this kind of thing.

Comment 3 by mar...@chromium.org, Aug 23 2016

I'd recommend:
- make task expiration to be 10 minutes
- turn the step purple in case of expiration

That'd give you an earlier signal
Components: Infra>Platform>Swarming
Status: Available (was: Untriaged)
What perf team is doing as alerts with their buildbot slave?

For swarming side, we could probably add a property alert in goog3 that would be more trigger-happy than the general one.
Re #5: Right now things are a mess, we use whatever hung bot detection exists but we have a sheriffing rotation that manually files bugs on troopers/labs to fix down slaves currently, and this fallback happens a lot. I think pretty much anything would be better, we just want to be able to get some insight into what is quarantined and what ticket queue it's in so that we can double-check nothing's busted.

Comment 7 by stip@chromium.org, Feb 10 2017

Cc: -stip@chromium.org

Comment 8 by mar...@chromium.org, Nov 21 2017

Cc: -abw@chromium.org
Was this addressed?

Comment 9 by eyaich@google.com, Nov 30 2017

given our plans for soft device affinity in 2018, this might not be as relevant.  We still have lots of issues with purple bots on our waterfall, but given our current work on 12 devices I think this is on hold.
Project Member

Comment 10 by sheriffbot@chromium.org, Nov 30

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Blocking: -633253
Labels: -Type-Bug -Pri-2 Pri-3 Type-Feature
Status: Available (was: Untriaged)
eyach@, should we keep this open?
With soft device affinity this isn't as big of an issue since we automatically move off of down bots, but they still need to be detected and that is a manual process on the perf waterfall.  I am not sure who the right owner is here, but I still think there should be more automation around detecting down devices taht aren't android.
Cc: jbudorick@chromium.org
jbudorick: does CCI team own monitoring swarming pools and ensuring that hardware is replaced with some timeline, or is that labs? Note that as Emily said in #12, the perf waterfall now has soft device affinity so it no longer requires an expedited timeline, but we're not sure who checks to make sure all the bots don't eventually go offline.

Sign in to add a comment