Every once in a while, autotest decides to run cleanup tasks via: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/monitor_db.py?l=399
There's no limit on how many jobs this might end up aborting. This leads to sudden load wrt monitor_db tick time.
Here is an example: https://groups.google.com/a/google.com/forum/#!topic/chromeos-build-alerts/jPit3ir_css
Proposed mitigation (changes to monitor_db_cleanup.PeriodicCleanup and children)
- We spread any cleanup across ticks by
- run_cleanup_maybe sets a hard limit on the number of "costly" operations allowed.
- each sub-step in cleanup (like _abort_timed_out_jobs) decides what counts as costly, and stops if the limit is reached.
- run_cleanup_maybe keeps state about we finished cleanup in the last call, and if not, continues cleanup in the next call.
- The current logic around when to trigger a cleanup stays the same.
-------------------------------
Impact: this problems occurs intermittently, potentially multiple times a day. It is likely to make recovery from an outage / reduced capacity event worse.
A prerequisite is that a lot of cleanup work appears suddenly. This can happen if part of the lab loses the ability to run certains jobs that are created together (say via suite_scheduler).
Given that, I do not think this deserves Chase-Pending. Maybe OKR?
Comment 1 by pprabhu@chromium.org
, Jun 14 2017