New issue
Advanced search Search tips

Issue 733378 link

Starred by 2 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

monitor_db cleanup hits too hard: Must spread cleanup task across ticks

Project Member Reported by pprabhu@chromium.org, Jun 14 2017

Issue description

Every once in a while, autotest decides to run cleanup tasks via: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/monitor_db.py?l=399

There's no limit on how many jobs this might end up aborting. This leads to sudden load wrt monitor_db tick time.

Here is an example: https://groups.google.com/a/google.com/forum/#!topic/chromeos-build-alerts/jPit3ir_css

Proposed mitigation (changes to monitor_db_cleanup.PeriodicCleanup and children) 
- We spread any cleanup across ticks by
  - run_cleanup_maybe sets a hard limit on the number of "costly" operations allowed. 
  - each sub-step in cleanup (like _abort_timed_out_jobs) decides what counts as costly, and stops if the limit is reached.
  - run_cleanup_maybe keeps state about we finished cleanup in the last call, and if not, continues cleanup in the next call.
  - The current logic around when to trigger a cleanup stays the same.

-------------------------------
Impact: this problems occurs intermittently, potentially multiple times a day. It is likely to make recovery from an outage / reduced capacity event worse.

A prerequisite is that a lot of cleanup work appears suddenly. This can happen if part of the lab loses the ability to run certains jobs that are created together (say via suite_scheduler).

Given that, I do not think this deserves Chase-Pending. Maybe OKR?
 
Cc: ayatane@chromium.org
FYI. This is another kind of "cleanup" that monitor_db does. 
Status: Available (was: Untriaged)
Meets: "Preventative measures against likely causes of future P1 outages."
So, P2 is correct.
Labels: OKR
Labels: -Pri-2 -OKR Pri-3
I'd rather not invest more in monitor_db unless we can prove short term impact. Removing from OKR, downgrading to P3 and I'm not currently convinced that we can do so for this. If we get new occurrences we can upgrade.
Project Member

Comment 5 by sheriffbot@chromium.org, Dec 14

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Sign in to add a comment