Ensure quarantined bots will be repaired rapidly on perf waterfall |
||||||||
Issue descriptionWe are working on swarming the perf waterfall. The perf waterfall has device affinity, which means that each test must run on the same device every time. Swarming already supports device affinity, but we need a way to ensure that when a device is quarantined: * It gets repaired quickly * It is easy for everyone (speed infra, troopers, infra labs) to find and follow status
,
Aug 23 2016
Adding abw@ and maruel@ who have taken a keen interest in this kind of thing.
,
Aug 23 2016
I'd recommend: - make task expiration to be 10 minutes - turn the step purple in case of expiration That'd give you an earlier signal
,
Sep 2 2016
,
Sep 6 2016
What perf team is doing as alerts with their buildbot slave? For swarming side, we could probably add a property alert in goog3 that would be more trigger-happy than the general one.
,
Sep 8 2016
Re #5: Right now things are a mess, we use whatever hung bot detection exists but we have a sheriffing rotation that manually files bugs on troopers/labs to fix down slaves currently, and this fallback happens a lot. I think pretty much anything would be better, we just want to be able to get some insight into what is quarantined and what ticket queue it's in so that we can double-check nothing's busted.
,
Feb 10 2017
,
Nov 21 2017
Was this addressed?
,
Nov 30 2017
given our plans for soft device affinity in 2018, this might not be as relevant. We still have lots of issues with purple bots on our waterfall, but given our current work on 12 devices I think this is on hold.
,
Nov 30
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Dec 6
eyach@, should we keep this open?
,
Dec 13
With soft device affinity this isn't as big of an issue since we automatically move off of down bots, but they still need to be detected and that is a manual process on the perf waterfall. I am not sure who the right owner is here, but I still think there should be more automation around detecting down devices taht aren't android.
,
Dec 13
jbudorick: does CCI team own monitoring swarming pools and ensuring that hardware is replaced with some timeline, or is that labs? Note that as Emily said in #12, the perf waterfall now has soft device affinity so it no longer requires an expedited timeline, but we're not sure who checks to make sure all the bots don't eventually go offline. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by sullivan@chromium.org
, Aug 23 2016