New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 740109 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 731573



Sign in to add a comment

Swarming bot does when cache cleanup fails.

Project Member Reported by d...@chromium.org, Jul 7 2017

Issue description

When a Swarming bot attempts to cleanup its caches after a build, the cache cleanup may fail. Some potential causes for failure include:

1) File handles that remain open by orphaned tasks from the build (*).
2) File permission issues on Windows.

An example of such a failure is here: https://chromium-swarm.appspot.com/task?id=372fc2340c856b10&refresh=10&show_raw=1&wide_logs=true

Propose:
1) Determine which caches should be purged by dropping a purge manifest JSON file.
2) Continue to cleanup caches at the end of the build. If successful, delete manifest (1).
3) When Swarming server starts, if manifest (1) exists, purge caches.
4) If (2) fails for any reason, reboot the system.

(*) While (1) may not be Swarming's fault, it should still handle it.
 

Comment 1 by d...@chromium.org, Jul 7 2017

Changing approach here a bit - since Swarming already has handling for zombie processes, I propose that we just ignore named cache failures and let "run_isolated.py" fail on the follow-up task directory purging.
Project Member

Comment 2 by bugdroid1@chromium.org, Jul 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-py.git/+/bd3cbc5ca8f6fd82345bb9073364c56bf2b73130

commit bd3cbc5ca8f6fd82345bb9073364c56bf2b73130
Author: dnj <dnj@google.com>
Date: Fri Jul 07 18:16:44 2017

[run_isolated] Tolerate cache uninstall errors.

If a named cache cannot be uninstalled, the Swarming bot will fail with
an unfriendly code path and the task will terminate as BOT_DIED. This
can happen if a zombie process lingers from a task and retains a handle to
the named cache.

Swarming already has code paths to handle zombie processes and task
space purge errors. This patch makes it so that named cache deletion
failures fall through to standard cleanup code instead of raising an
exception.

BUG= chromium:740109 
TEST=None
R=maruel@chromium.org, vadimsh@chromium.org

Review-Url: https://codereview.chromium.org/2973113003

[modify] https://crrev.com/bd3cbc5ca8f6fd82345bb9073364c56bf2b73130/client/run_isolated.py

Comment 3 by d...@chromium.org, Jul 11 2017

Owner: d...@chromium.org
Status: Fixed (was: Untriaged)
This should be fixed now.

Sign in to add a comment