Alert when drones are "down" and remove scheduler crash in this case |
||||
Issue descriptionThis is follow up from PM for issue 848337 [1] We should receive and alert when a drone is unusable. - alert at the source. [2] The scheduler should not crash if the drone call fails. It should skip that drone and continue with the rest - avoid cascading failures. [1] is not well defined because unlike shards, drones do not have a representative service running on the serverthat we can use to proxy drone health. We've avoided alerting on machine stats (cpu/memory) because they're root causes, not actual failures and are difficult to tune. I'd say, we should create metrics _from monitor_db_ when drone calls fail instead of crashing the scheduler. Then, we should alert when too many drone calls fail to the same drone.
,
Jun 29 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7854a62acdb7c54a14d40c042e1b2f106dfa10f6 commit 7854a62acdb7c54a14d40c042e1b2f106dfa10f6 Author: Jacob Kopczynski <jkop@google.com> Date: Fri Jun 29 01:37:58 2018 autotest: Make missing drone not crash scheduler. Instead, it logs the error and sends metrics noting it. BUG=chromium:853861 TEST=Full unit test run Change-Id: Ib5d7b5010e97ed6062a90e9d1333d23e2e36568b Reviewed-on: https://chromium-review.googlesource.com/1119068 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Jacob Kopczynski <jkop@chromium.org> [modify] https://crrev.com/7854a62acdb7c54a14d40c042e1b2f106dfa10f6/scheduler/drone_task_queue.py
,
Jul 9
Ten days later pcon still doesn't see that metric being sent at all. Not sure what's up.
,
Jul 9
I checked that the change is in prod. None of the associated logging is creating messages, AFAICT.
,
Jul 9
,
Jul 9
Crash believed fixed, so not Chase. Not verified, and there are other problems.
,
Jul 10
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3a4dced5345ac46d2d9c0ecd8a6a92c443650ac1 commit 3a4dced5345ac46d2d9c0ecd8a6a92c443650ac1 Author: Jacob Kopczynski <jkop@google.com> Date: Tue Jul 10 23:40:22 2018 autotest: add logging for missing drone metric Will probably be noisy. TEST=None BUG=chromium:853861 Change-Id: I5c7934c94249be6cc10f297ed5afce73b85da699 Reviewed-on: https://chromium-review.googlesource.com/1129365 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/3a4dced5345ac46d2d9c0ecd8a6a92c443650ac1/scheduler/drone_task_queue.py
,
Aug 27
Don't see any of that logging showing up anywhere even now. |
||||
►
Sign in to add a comment |
||||
Comment 1 by pprabhu@chromium.org
, Jun 18 2018