scheduler crash loops when it finds multiple HQEs for a job with different resultsdir |
||||
Issue descriptionThis is a follow up on incident issue 831689 Root cause is being tracked at issue 831873 This bug is about hardening the scheduler so that it ignore the said job and continues to do its job. Additionally, it should abort, mark complete and inactive the said HQEs, and mark the DUT as REPAIR_FAILED (That's the only safe thing scheduler can do without having to create new special tasks directly. The DUT will somehow get handled later).
,
Apr 16 2018
,
Apr 16 2018
,
Apr 16 2018
> it should [ ... ] mark the DUT as REPAIR_FAILED [ ... ] Ugh. That feels wrong. We do need some sort of orderly shutdown of the job, but unconditionally smashing the host state could cause trouble. If nothing else, I think it would confuse the dut-status command... Ideally, the process of saying "this job is aborted" and cleaning up the HQEs should be sufficient; I believe that the normal scheduler response to an abort is to schedule a cleanup task, which will be enough to get the DUT checked out for errors.
,
Apr 16 2018
The problem occurs while recovering a job after a scheduler restart. So I'm not sure how much work is needed to abort the job in a way that scheudler still creates an agent... which then goes and kicks off the cleanup task. If it's easy, I'm all for it. If not, kill the DUT and let something else come resurrect it.
,
Apr 16 2018
At the point where this error fires, the Host is in a good state, so it should go back to Ready. I will use the standard Lucifer recovery mechanism (job_aborter), which marks the HQEs as failed and the hosts as Ready.
,
Apr 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b88e815bf28a7841c079231a2c46f5f1bb859ad7 commit b88e815bf28a7841c079231a2c46f5f1bb859ad7 Author: Allen Li <ayatane@chromium.org> Date: Tue Apr 17 00:23:38 2018 [autotest] Abort jobs that error when sending to Lucifer Abort jobs that cause this error (synch_count=1 jobs with multiple HQEs). Its hard to handle this from Lucifer, so let the scheduler handle it as an abort. For example, one of the HQEs might be STARTING while the other is PROVISIONING, and Lucifer cannot gracefully abort the PROVISIONING HQE. BUG= chromium:832167 TEST=Test locally (create a bad job followed by a good job to check host scheduling) Change-Id: Iddb215d8a0669fa5fb15e507f3a87777371321b1 Reviewed-on: https://chromium-review.googlesource.com/1014395 Commit-Ready: Allen Li <ayatane@chromium.org> Tested-by: Allen Li <ayatane@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/b88e815bf28a7841c079231a2c46f5f1bb859ad7/scheduler/luciferlib.py [modify] https://crrev.com/b88e815bf28a7841c079231a2c46f5f1bb859ad7/scheduler/monitor_db.py
,
Apr 23 2018
|
||||
►
Sign in to add a comment |
||||
Comment 1 by akes...@chromium.org
, Apr 12 2018