New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 701930 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: May 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Optimize cipd extraction on slow Swarming bots

Project Member Reported by vadimsh@chromium.org, Mar 15 2017

Issue description

You give it an "ensure file" with a list of CIPD packages, and an isolate server URL, and it isolates the end result of installation of these packages and returns an isolated hash.

Will be used by Skia infra to minimize IO on bots by switching from cipd (that always does unzipping and file copying) to isolate (that uses only cheap symlinking if the cache is already warm). Can potentially later be used by DM and/or recipe bundler.

By making it a separate tool, we can do more tight integration between 'cipd' and 'isolate' guts (e.g. skip unzipping cipd files that are already known to be isolated, etc). Initial implementation will probably be dumb though.
 

Comment 1 by kjlubick@google.com, Mar 15 2017

Cc: borenet@chromium.org

Comment 2 by bore...@google.com, Mar 15 2017

When will this run, and who runs it?  Isolates are relatively short-lived, right? So our more-permanent CIPD packages will have to be re-isolated on a regular basis.

Comment 3 by kjlubick@google.com, Mar 15 2017

We'll run it with Task Scheduler when we trigger builds/tests/perfs for RPI tasks.

Isolate should be smart enough to not re-upload things that are already there, so it should be mostly a no-op, possibly deduplicated by swarming.

Question: Things live in Isolate for 60 days.  Suppose we run this on the 61st day, the cipd bits have been ejected from isolate.  Will swarming deduplication run the task again or (incorrectly) mark it as duplicate?

Yes, it will have to be run regularly. The current idea is to run it on Swarming, via your Task Scheduler (it's just one more dependency).

In the end, developers still could use CIPD to fetch/update/reference packages (and they will be stored permanently), but bots internally will use more optimized mechanism to install stuff.

I actually don't know who exactly you use cipd, maybe it makes sense to ditch it completely and use isolate directly instead?

For more context: the problem we are trying to solve is slow cipd step on RPI bots. It is slow because disk performance on RPI is very poor, but CIPD+Swarming design assumes disk IO is not a bottelneck. In particular, cipd caches *.zip of packages, and unpacks them to a Swarming tasks directory before each task.

Isolate, on the other hand, caches individual 'unpacked' files and just symlinkes them in the working directory. It performs faster, but the cost is complicated cache management.

We could try to teach 'cipd' to be like isolate when unpacking stuff, but it seems wrong, since we already have isolate. If isolate looks like a better fit, might just as well use it...

If integrating with Task Scheduler is complicated, we may try to consider other more hacky options, like preserving installed skimage packages on bots via "Swarming named caches" mechanism.
> Question: Things live in Isolate for 60 days.  Suppose we run this on the 61st day, the cipd bits have been ejected from isolate.  Will swarming deduplication run the task again or (incorrectly) mark it as duplicate?

Swarming task deduplication has expiration time, and it is shorter than Isolate's (it is 7 days currently). So Swarming will run the task again.
(Btw, in a long term we wish to merge Isolate and CIPD storage backends, so CIPD will became just a mechanism to attach metadata and versioning information to permanent isolate hashes. But this is very complicated change, for another day).
The tool is almost ready, but I now have doubts.

Maybe let's try the suggestion in #4: install packages into a swarming named cache directory, which is preserved between swarming tasks.

Pros:
 * Easier to implement. Just a minor tweak of Swarming task config.
 * Will be extremely fast for the warm cache. CIPD will basically check installed versions, conclude they are up-to-date and exit.

Cons:
 * Introduces statefulness. If this installation directory becomes corrupted somehow, this corruption will carry between tasks until manually fixed.
 * Package updates will be slow. It will redownload all files, unzip and copy them. If you update packages often (e.g hourly), this will be noticeable.
 * Swarming bot dir size will double.

Wdyt?

Comment 9 by kjlubick@google.com, Mar 17 2017

We update packages < 1/week on average, so cache will almost always be warm.

> Introduces statefulness. If this installation directory becomes corrupted somehow, this corruption will carry between tasks until manually fixed.

I'm not too concerned about this.  We could invalidate the install directory 10% of the time or something like this, if it ends up being a problem we see in the wild.

> Swarming bot dir size will double.

You mean, we'll have to cache it in two places, so the size needed for CIPD packages will double?  That's fine.
Another simpler option is to do it at client side:
- CIPD client is somehow integrated into run_isolated, fetches the tarball, stores it in the isolated cache (skipped if present), expands it in the isolated cache (skipped if present), moves on.
- run_isolated already has its own --clean support that is run by bot_main after the task.

This is purely a bot-side optimization, no need to refactor anything else so I feel this is safer to implement but I don't have a strong opinion.
Cc: -kjlubick@chromium.org vadimsh@chromium.org
Owner: kjlubick@google.com
Status: Started (was: Assigned)
I'm experimenting with the approach in comment #10
Summary: Optimize cipd extraction on slow Swarming bots (was: Write cipd2isolate tool)
Project Member

Comment 13 by bugdroid1@chromium.org, May 11 2017

Status: Fixed (was: Started)
The above solution (isolate cipd assets in own swarming task) worked for us.  Marking as fixed.
Project Member

Comment 16 by bugdroid1@chromium.org, Sep 19 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3b8e443b4f17f22047dc17fc551db30f8dce5049

commit 3b8e443b4f17f22047dc17fc551db30f8dce5049
Author: Jian Li <jianli@chromium.org>
Date: Tue Sep 19 22:24:51 2017

Shared preferences for prefetch

Bug:  701930 
Change-Id: I0115c410f2daa33904af79d1f8ab1bb98620071c
Reviewed-on: https://chromium-review.googlesource.com/671655
Commit-Queue: Jian Li <jianli@chromium.org>
Reviewed-by: Justin DeWitt <dewittj@chromium.org>
Cr-Commit-Position: refs/heads/master@{#502972}
[add] https://crrev.com/3b8e443b4f17f22047dc17fc551db30f8dce5049/chrome/android/java/src/org/chromium/chrome/browser/offlinepages/prefetch/PrefetchPrefs.java
[modify] https://crrev.com/3b8e443b4f17f22047dc17fc551db30f8dce5049/chrome/android/java_sources.gni

Sign in to add a comment