New issue
Advanced search Search tips

Issue 868083 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Aug 3
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 867622



Sign in to add a comment

Swarming: stop trimming cache in run_isolated post-task

Project Member Reported by mar...@chromium.org, Jul 26

Issue description

It's slowing down the task for no reason, creating post task overhead, and exposing all sorts of issues like 867622.
 
Blocking: 867622
Project Member

Comment 2 by bugdroid1@chromium.org, Aug 1

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/276294e53625739d4371bf50dcaefd9105f19c10

commit 276294e53625739d4371bf50dcaefd9105f19c10
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Aug 01 17:46:33 2018

client: Add Cache.save()

This will be needed in a follow up where .trim() calls are replaced with .save()
for performance reasons.

Bug:  868083 
Change-Id: I6b4029bfc20d5c1e088d43299cf2065f9585c74d
Reviewed-on: https://chromium-review.googlesource.com/1158617
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/276294e53625739d4371bf50dcaefd9105f19c10/client/local_caching.py
[modify] https://crrev.com/276294e53625739d4371bf50dcaefd9105f19c10/client/tests/local_caching_test.py

Project Member

Comment 3 by bugdroid1@chromium.org, Aug 3

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/5cbe47ec9a86ba45d0112388fbb8cd6a8a6079d1

commit 5cbe47ec9a86ba45d0112388fbb8cd6a8a6079d1
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Fri Aug 03 03:41:22 2018

[client] stop trimming cache in run_isolated

Previous behavior:
When run_isolated starts a task, it would trim the caches. When it tears down a
task, after removing the temporary directory it also trims the caches. Since it
is task_runner that controls the task context, and that run_isolated runs as a
child process of task_runner, it means that the post-task trimming blocks the
Swarming task from completing.

Independent of this, bot_main always runs 'run_isolated --cleanup', after every
task, which is quite redundant: cleanup is a superset of the operations done by
trim.


New behavior:
run_isolated doesn't trim before nor after the task process completed and
instead exits immediately after tearing down the temporary directory.

bot_main runs 'run_isolated --cleanup' after the task is completed, as before,
except that it now does so before updating LKGBC (last known good bot code).


Context:
Trimming isolated cache is fairly quick, but trimming named cache can take an
excruciating amount of time, like several tens of minutes. We've seen cases
where the named cache removal fails due to undeleteable files.


Net effect:
This significantly reduces perceived (from the client PoV) task completion
overhead and hides trimming issues from the tasks. This doesn't remove the
*actual* overhead to trim files but changes the context into which it's done.
Trimming a named cache could happen either or after a task, as run_isolated did
both.

What's happening is that trimming is now done outside of a task context, so that
nobody is waiting while this is happening. In practice, this still mean we can't
achieve full throughput, and that furthermore it is (currently) hard to assess
the trimming overhead.

It was hard to assess trimming overheads before, but it still is.

There's a big drawback at removal the trim() call right after setup: this means
that the free disk space is now limited by the isolated and CIPD input sizes and
now need to be taken in account when calculating the minimum free disk space in
bot_config. I think it's a reasonable trade off, vs performance.

This change may require tuning the free disk space parameters in bot_config.py.

Bug:  868083 
Change-Id: Ia7cf41c6f7aa6e0a5b22e6279b8ef6b631e305a1
Reviewed-on: https://chromium-review.googlesource.com/1152079
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/5cbe47ec9a86ba45d0112388fbb8cd6a8a6079d1/appengine/swarming/swarming_bot/bot_code/bot_main.py
[modify] https://crrev.com/5cbe47ec9a86ba45d0112388fbb8cd6a8a6079d1/appengine/swarming/swarming_bot/bot_code/task_runner.py
[modify] https://crrev.com/5cbe47ec9a86ba45d0112388fbb8cd6a8a6079d1/client/isolateserver.py
[modify] https://crrev.com/5cbe47ec9a86ba45d0112388fbb8cd6a8a6079d1/client/run_isolated.py

Status: Fixed (was: Assigned)
Follow up for bot overhead monitoring on issue 870723.
Tracking file deletion is  issue 867622 .

Sign in to add a comment