Optimize cipd extraction on slow Swarming bots |
||||
Issue descriptionYou give it an "ensure file" with a list of CIPD packages, and an isolate server URL, and it isolates the end result of installation of these packages and returns an isolated hash. Will be used by Skia infra to minimize IO on bots by switching from cipd (that always does unzipping and file copying) to isolate (that uses only cheap symlinking if the cache is already warm). Can potentially later be used by DM and/or recipe bundler. By making it a separate tool, we can do more tight integration between 'cipd' and 'isolate' guts (e.g. skip unzipping cipd files that are already known to be isolated, etc). Initial implementation will probably be dumb though.
,
Mar 15 2017
When will this run, and who runs it? Isolates are relatively short-lived, right? So our more-permanent CIPD packages will have to be re-isolated on a regular basis.
,
Mar 15 2017
We'll run it with Task Scheduler when we trigger builds/tests/perfs for RPI tasks. Isolate should be smart enough to not re-upload things that are already there, so it should be mostly a no-op, possibly deduplicated by swarming. Question: Things live in Isolate for 60 days. Suppose we run this on the 61st day, the cipd bits have been ejected from isolate. Will swarming deduplication run the task again or (incorrectly) mark it as duplicate?
,
Mar 15 2017
Yes, it will have to be run regularly. The current idea is to run it on Swarming, via your Task Scheduler (it's just one more dependency). In the end, developers still could use CIPD to fetch/update/reference packages (and they will be stored permanently), but bots internally will use more optimized mechanism to install stuff. I actually don't know who exactly you use cipd, maybe it makes sense to ditch it completely and use isolate directly instead? For more context: the problem we are trying to solve is slow cipd step on RPI bots. It is slow because disk performance on RPI is very poor, but CIPD+Swarming design assumes disk IO is not a bottelneck. In particular, cipd caches *.zip of packages, and unpacks them to a Swarming tasks directory before each task. Isolate, on the other hand, caches individual 'unpacked' files and just symlinkes them in the working directory. It performs faster, but the cost is complicated cache management. We could try to teach 'cipd' to be like isolate when unpacking stuff, but it seems wrong, since we already have isolate. If isolate looks like a better fit, might just as well use it... If integrating with Task Scheduler is complicated, we may try to consider other more hacky options, like preserving installed skimage packages on bots via "Swarming named caches" mechanism.
,
Mar 15 2017
> Question: Things live in Isolate for 60 days. Suppose we run this on the 61st day, the cipd bits have been ejected from isolate. Will swarming deduplication run the task again or (incorrectly) mark it as duplicate? Swarming task deduplication has expiration time, and it is shorter than Isolate's (it is 7 days currently). So Swarming will run the task again.
,
Mar 15 2017
(Btw, in a long term we wish to merge Isolate and CIPD storage backends, so CIPD will became just a mechanism to attach metadata and versioning information to permanent isolate hashes. But this is very complicated change, for another day).
,
Mar 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/external/github.com/luci/luci-go.git/+/a3e2880ae6d94c2c37432d0c36d70256cf322760 commit a3e2880ae6d94c2c37432d0c36d70256cf322760 Author: vadimsh <vadimsh@chromium.org> Date: Thu Mar 16 19:05:56 2017 cipd2isolate: Initial CLI parsing boilerplate. R=iannucci@chromium.org, nodir@chromium.org BUG= 701930 Review-Url: https://codereview.chromium.org/2746363007 [modify] https://crrev.com/a3e2880ae6d94c2c37432d0c36d70256cf322760/cipd/client/cli/main.go [add] https://crrev.com/a3e2880ae6d94c2c37432d0c36d70256cf322760/cipd/client/cmd/cipd2isolate/isolate.go [add] https://crrev.com/a3e2880ae6d94c2c37432d0c36d70256cf322760/cipd/client/cmd/cipd2isolate/main.go
,
Mar 17 2017
The tool is almost ready, but I now have doubts. Maybe let's try the suggestion in #4: install packages into a swarming named cache directory, which is preserved between swarming tasks. Pros: * Easier to implement. Just a minor tweak of Swarming task config. * Will be extremely fast for the warm cache. CIPD will basically check installed versions, conclude they are up-to-date and exit. Cons: * Introduces statefulness. If this installation directory becomes corrupted somehow, this corruption will carry between tasks until manually fixed. * Package updates will be slow. It will redownload all files, unzip and copy them. If you update packages often (e.g hourly), this will be noticeable. * Swarming bot dir size will double. Wdyt?
,
Mar 17 2017
We update packages < 1/week on average, so cache will almost always be warm. > Introduces statefulness. If this installation directory becomes corrupted somehow, this corruption will carry between tasks until manually fixed. I'm not too concerned about this. We could invalidate the install directory 10% of the time or something like this, if it ends up being a problem we see in the wild. > Swarming bot dir size will double. You mean, we'll have to cache it in two places, so the size needed for CIPD packages will double? That's fine.
,
Mar 22 2017
Another simpler option is to do it at client side: - CIPD client is somehow integrated into run_isolated, fetches the tarball, stores it in the isolated cache (skipped if present), expands it in the isolated cache (skipped if present), moves on. - run_isolated already has its own --clean support that is run by bot_main after the task. This is purely a bot-side optimization, no need to refactor anything else so I feel this is safer to implement but I don't have a strong opinion.
,
Apr 28 2017
I'm experimenting with the approach in comment #10
,
May 11 2017
,
May 11 2017
The following revision refers to this bug: https://skia.googlesource.com/skia/+/07072944af9fac196efeb78d6791537221cd1d4c commit 07072944af9fac196efeb78d6791537221cd1d4c Author: Kevin Lubick <kjlubick@google.com> Date: Thu May 11 18:11:01 2017 Isolate CIPD assets for RPI tasks To verify the assets all end up in the right spot, I wiped all the assets off the phone and then ran https://chromium-swarm.appspot.com/task?id=36114ccaa41bd810&refresh=10 Overhead comparisons: Control: 103s https://chromium-swarm.appspot.com/task?id=360e10170744db10 Cold cache: 105s https://chromium-swarm.appspot.com/task?id=36113c4aec720910 Warm cache: 8s https://chromium-swarm.appspot.com/task?id=361143954c1b1c10 Bug:701930 Bug:skia:5213 Change-Id: I1dc052203ed404b63d0a1974ccbe882d26ff9e48 Reviewed-on: https://skia-review.googlesource.com/16490 Commit-Queue: Kevin Lubick <kjlubick@google.com> Reviewed-by: Eric Boren <borenet@google.com> [add] https://crrev.com/07072944af9fac196efeb78d6791537221cd1d4c/infra/bots/isolate_skimage.isolate [modify] https://crrev.com/07072944af9fac196efeb78d6791537221cd1d4c/infra/bots/jobs.json [modify] https://crrev.com/07072944af9fac196efeb78d6791537221cd1d4c/infra/bots/gen_tasks.go [modify] https://crrev.com/07072944af9fac196efeb78d6791537221cd1d4c/infra/bots/tasks.json [add] https://crrev.com/07072944af9fac196efeb78d6791537221cd1d4c/infra/bots/isolate_svg.isolate [add] https://crrev.com/07072944af9fac196efeb78d6791537221cd1d4c/infra/bots/isolate_skp.isolate
,
May 15 2017
The above solution (isolate cipd assets in own swarming task) worked for us. Marking as fixed.
,
Sep 15 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7fa61a50cf6a2773667974b03a29246498b19aa0 commit 7fa61a50cf6a2773667974b03a29246498b19aa0 Author: Jian Li <jianli@chromium.org> Date: Fri Sep 15 21:55:45 2017 Schedule retry with or without backoff for prefetch background task Bug: 701930 Change-Id: I344f3f37e34a21d12afb654621a15527fb820760 Reviewed-on: https://chromium-review.googlesource.com/668088 Commit-Queue: Jian Li <jianli@chromium.org> Reviewed-by: Justin DeWitt <dewittj@chromium.org> Cr-Commit-Position: refs/heads/master@{#502393} [modify] https://crrev.com/7fa61a50cf6a2773667974b03a29246498b19aa0/components/offline_pages/core/prefetch/page_bundle_update_task.cc [modify] https://crrev.com/7fa61a50cf6a2773667974b03a29246498b19aa0/components/offline_pages/core/prefetch/prefetch_dispatcher_impl.cc [modify] https://crrev.com/7fa61a50cf6a2773667974b03a29246498b19aa0/components/offline_pages/core/prefetch/prefetch_dispatcher_impl.h [modify] https://crrev.com/7fa61a50cf6a2773667974b03a29246498b19aa0/components/offline_pages/core/prefetch/prefetch_dispatcher_impl_unittest.cc
,
Sep 19 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3b8e443b4f17f22047dc17fc551db30f8dce5049 commit 3b8e443b4f17f22047dc17fc551db30f8dce5049 Author: Jian Li <jianli@chromium.org> Date: Tue Sep 19 22:24:51 2017 Shared preferences for prefetch Bug: 701930 Change-Id: I0115c410f2daa33904af79d1f8ab1bb98620071c Reviewed-on: https://chromium-review.googlesource.com/671655 Commit-Queue: Jian Li <jianli@chromium.org> Reviewed-by: Justin DeWitt <dewittj@chromium.org> Cr-Commit-Position: refs/heads/master@{#502972} [add] https://crrev.com/3b8e443b4f17f22047dc17fc551db30f8dce5049/chrome/android/java/src/org/chromium/chrome/browser/offlinepages/prefetch/PrefetchPrefs.java [modify] https://crrev.com/3b8e443b4f17f22047dc17fc551db30f8dce5049/chrome/android/java_sources.gni |
||||
►
Sign in to add a comment |
||||
Comment 1 by kjlubick@google.com
, Mar 15 2017