New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 853861 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Alert when drones are "down" and remove scheduler crash in this case

Project Member Reported by pprabhu@chromium.org, Jun 18 2018

Issue description

This is follow up from PM for issue 848337

[1] We should receive and alert when a drone is unusable.
  - alert at the source.
[2] The scheduler should not crash if the drone call fails. It should skip that drone and continue with the rest
  - avoid cascading failures.

[1] is not well defined because unlike shards, drones do not have a representative service running on the serverthat we can use to proxy drone health.
We've avoided alerting on machine stats (cpu/memory) because they're root causes, not actual failures and are difficult to tune.

I'd say, we should create metrics _from monitor_db_ when drone calls fail instead of crashing the scheduler.
Then, we should alert when too many drone calls fail to the same drone.


 
Project Member

Comment 2 by bugdroid1@chromium.org, Jun 29 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7854a62acdb7c54a14d40c042e1b2f106dfa10f6

commit 7854a62acdb7c54a14d40c042e1b2f106dfa10f6
Author: Jacob Kopczynski <jkop@google.com>
Date: Fri Jun 29 01:37:58 2018

autotest: Make missing drone not crash scheduler.

Instead, it logs the error and sends metrics noting it.

BUG=chromium:853861
TEST=Full unit test run

Change-Id: Ib5d7b5010e97ed6062a90e9d1333d23e2e36568b
Reviewed-on: https://chromium-review.googlesource.com/1119068
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Jacob Kopczynski <jkop@chromium.org>

[modify] https://crrev.com/7854a62acdb7c54a14d40c042e1b2f106dfa10f6/scheduler/drone_task_queue.py

Cc: -cra...@chromium.org pprabhu@chromium.org
Ten days later pcon still doesn't see that metric being sent at all. Not sure what's up.

Comment 4 Deleted

I checked that the change is in prod. None of the associated logging is creating messages, AFAICT.
Labels: -Pri-1 -Chase Pri-2
Crash believed fixed, so not Chase. Not verified, and there are other problems.
Project Member

Comment 8 by bugdroid1@chromium.org, Jul 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3a4dced5345ac46d2d9c0ecd8a6a92c443650ac1

commit 3a4dced5345ac46d2d9c0ecd8a6a92c443650ac1
Author: Jacob Kopczynski <jkop@google.com>
Date: Tue Jul 10 23:40:22 2018

autotest: add logging for missing drone metric

Will probably be noisy.

TEST=None
BUG=chromium:853861

Change-Id: I5c7934c94249be6cc10f297ed5afce73b85da699
Reviewed-on: https://chromium-review.googlesource.com/1129365
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/3a4dced5345ac46d2d9c0ecd8a6a92c443650ac1/scheduler/drone_task_queue.py

Status: Assigned (was: Started)
Don't see any of that logging showing up anywhere even now.

Sign in to add a comment