New issue
Advanced search Search tips

Issue 809773 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

lucifer: occasional scheduler crash due to race of AgentTask creation for lucifer HQEs

Project Member Reported by ayatane@chromium.org, Feb 6 2018

Issue description

Project Member

Comment 1 by bugdroid1@chromium.org, Feb 14 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/c0109e3ee4cf8cda80bc6c1a63ecf83ad5cfca89

commit c0109e3ee4cf8cda80bc6c1a63ecf83ad5cfca89
Author: Allen Li <ayatane@chromium.org>
Date: Wed Feb 14 05:15:56 2018

[autotest] Fix occasional scheduler crash

This fixes a bug introduced by
https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/896825

The bug is that if we generate AgentTasks for HQEs owned by lucifer,
it is possible for the HQE to go from, e.g. GATHERING to FAILED
between the scheduler querying for HQEs and the scheduler creating
AgentTasks for the HQEs.

When the scheduler goes to create the AgentTask for a FAILED HQE, it
raises an exception.

Background on what we want:

Ideally, we dont want the scheduler to touch HQEs owned by lucifer at
all.  However, we do need the drone_manager to keep a mapping of
autoserv pids to drone hostnames, since we use that to figure out
which drone to use and reimplementing or refactoring that logic is too
much work.  Normally, this isnt a problem, but if the scheduler
restarts, it uses AgentTasks to recover this information.

Assuming lucifer owns GATHERING, when the scheduler restarts, we want
it to create an AgentTask for a GATHERING HQE so drone_manager can
recover the pid to drone hostname mapping, even though we do not want
the scheduler to manage an AgentTask for that HQE.

Background on the state of the art hack by the previous CL:

We create AgentTasks for all HQEs so the scheduler can recover
pidfiles as needed, but we patch the add_agent_task() method so that
it is a no-op for AgentTasks covering HQEs managed by lucifer.

However, there is a subtle race bug here, where an HQE goes from,
e.g. GATHERING to FAILED between the scheduler querying for HQEs to
create AgentTasks for and the scheduler creating the AgentTasks.
Creating an AgentTask for a FAILED HQE raises an exception, crashing
the scheduler.

The fix is to catch the exception.

Also, for the recovery of pidfiles, it is only necessary to create
AgentTasks up to the lucifer handoff point.

Since lucifer is now running at GATHERING, that means we dont have to
recover PARSING HQEs any more.

...Except! Suite jobs go directly into PARSING, so we still need to
handle parsing.

BUG= chromium:809773 
TEST=Run dummy suite locally

Change-Id: I231b96a0eee428ade87ab8b27f1d6946c6d519a3
Reviewed-on: https://chromium-review.googlesource.com/905746
Commit-Ready: Allen Li <ayatane@chromium.org>
Tested-by: Allen Li <ayatane@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/c0109e3ee4cf8cda80bc6c1a63ecf83ad5cfca89/scheduler/monitor_db.py

Status: Fixed (was: Started)

Sign in to add a comment