New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 832167 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

scheduler crash loops when it finds multiple HQEs for a job with different resultsdir

Project Member Reported by pprabhu@chromium.org, Apr 12 2018

Issue description

This is a follow up on incident issue 831689
Root cause is being tracked at  issue 831873 

This bug is about hardening the scheduler so that it ignore the said job and continues to do its job. Additionally, it should abort, mark complete and inactive the said HQEs, and mark the DUT as REPAIR_FAILED (That's the only safe thing scheduler can do without having to create new special tasks directly. The DUT will somehow get handled later).
 
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>Test
Labels: -Chase-Pending Chase
Owner: ayatane@chromium.org
Status: Assigned (was: Untriaged)
Cc: jrbarnette@chromium.org
> it should [ ... ] mark the DUT as REPAIR_FAILED [ ... ]

Ugh.  That feels wrong.  We do need some sort of orderly shutdown
of the job, but unconditionally smashing the host state could cause
trouble.  If nothing else, I think it would confuse the dut-status
command...

Ideally, the process of saying "this job is aborted" and cleaning up the
HQEs should be sufficient; I believe that the normal scheduler response to
an abort is to schedule a cleanup task, which will be enough to get the
DUT checked out for errors.

The problem occurs while recovering a job after a scheduler restart. So I'm not sure how much work is needed to abort the job in a way that scheudler still creates an agent... which then goes and kicks off the cleanup task.

If it's easy, I'm all for it. If not, kill the DUT and let something else come resurrect it.
At the point where this error fires, the Host is in a good state, so it should go back to Ready.

I will use the standard Lucifer recovery mechanism (job_aborter), which marks the HQEs as failed and the hosts as Ready.
Project Member

Comment 7 by bugdroid1@chromium.org, Apr 17 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b88e815bf28a7841c079231a2c46f5f1bb859ad7

commit b88e815bf28a7841c079231a2c46f5f1bb859ad7
Author: Allen Li <ayatane@chromium.org>
Date: Tue Apr 17 00:23:38 2018

[autotest] Abort jobs that error when sending to Lucifer

Abort jobs that cause this error (synch_count=1 jobs with multiple
HQEs).

Its hard to handle this from Lucifer, so let the scheduler handle it
as an abort.

For example, one of the HQEs might be STARTING while the other is
PROVISIONING, and Lucifer cannot gracefully abort the PROVISIONING
HQE.

BUG= chromium:832167 
TEST=Test locally (create a bad job followed by a good job to check host scheduling)

Change-Id: Iddb215d8a0669fa5fb15e507f3a87777371321b1
Reviewed-on: https://chromium-review.googlesource.com/1014395
Commit-Ready: Allen Li <ayatane@chromium.org>
Tested-by: Allen Li <ayatane@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/b88e815bf28a7841c079231a2c46f5f1bb859ad7/scheduler/luciferlib.py
[modify] https://crrev.com/b88e815bf28a7841c079231a2c46f5f1bb859ad7/scheduler/monitor_db.py

Status: Fixed (was: Assigned)

Sign in to add a comment