Swarming: stop trimming cache in run_isolated post-task |
||
Issue descriptionIt's slowing down the task for no reason, creating post task overhead, and exposing all sorts of issues like 867622.
,
Aug 1
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/276294e53625739d4371bf50dcaefd9105f19c10 commit 276294e53625739d4371bf50dcaefd9105f19c10 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Wed Aug 01 17:46:33 2018 client: Add Cache.save() This will be needed in a follow up where .trim() calls are replaced with .save() for performance reasons. Bug: 868083 Change-Id: I6b4029bfc20d5c1e088d43299cf2065f9585c74d Reviewed-on: https://chromium-review.googlesource.com/1158617 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Quinten Yearsley <qyearsley@chromium.org> [modify] https://crrev.com/276294e53625739d4371bf50dcaefd9105f19c10/client/local_caching.py [modify] https://crrev.com/276294e53625739d4371bf50dcaefd9105f19c10/client/tests/local_caching_test.py
,
Aug 3
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/5cbe47ec9a86ba45d0112388fbb8cd6a8a6079d1 commit 5cbe47ec9a86ba45d0112388fbb8cd6a8a6079d1 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Fri Aug 03 03:41:22 2018 [client] stop trimming cache in run_isolated Previous behavior: When run_isolated starts a task, it would trim the caches. When it tears down a task, after removing the temporary directory it also trims the caches. Since it is task_runner that controls the task context, and that run_isolated runs as a child process of task_runner, it means that the post-task trimming blocks the Swarming task from completing. Independent of this, bot_main always runs 'run_isolated --cleanup', after every task, which is quite redundant: cleanup is a superset of the operations done by trim. New behavior: run_isolated doesn't trim before nor after the task process completed and instead exits immediately after tearing down the temporary directory. bot_main runs 'run_isolated --cleanup' after the task is completed, as before, except that it now does so before updating LKGBC (last known good bot code). Context: Trimming isolated cache is fairly quick, but trimming named cache can take an excruciating amount of time, like several tens of minutes. We've seen cases where the named cache removal fails due to undeleteable files. Net effect: This significantly reduces perceived (from the client PoV) task completion overhead and hides trimming issues from the tasks. This doesn't remove the *actual* overhead to trim files but changes the context into which it's done. Trimming a named cache could happen either or after a task, as run_isolated did both. What's happening is that trimming is now done outside of a task context, so that nobody is waiting while this is happening. In practice, this still mean we can't achieve full throughput, and that furthermore it is (currently) hard to assess the trimming overhead. It was hard to assess trimming overheads before, but it still is. There's a big drawback at removal the trim() call right after setup: this means that the free disk space is now limited by the isolated and CIPD input sizes and now need to be taken in account when calculating the minimum free disk space in bot_config. I think it's a reasonable trade off, vs performance. This change may require tuning the free disk space parameters in bot_config.py. Bug: 868083 Change-Id: Ia7cf41c6f7aa6e0a5b22e6279b8ef6b631e305a1 Reviewed-on: https://chromium-review.googlesource.com/1152079 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> Reviewed-by: Quinten Yearsley <qyearsley@chromium.org> [modify] https://crrev.com/5cbe47ec9a86ba45d0112388fbb8cd6a8a6079d1/appengine/swarming/swarming_bot/bot_code/bot_main.py [modify] https://crrev.com/5cbe47ec9a86ba45d0112388fbb8cd6a8a6079d1/appengine/swarming/swarming_bot/bot_code/task_runner.py [modify] https://crrev.com/5cbe47ec9a86ba45d0112388fbb8cd6a8a6079d1/client/isolateserver.py [modify] https://crrev.com/5cbe47ec9a86ba45d0112388fbb8cd6a8a6079d1/client/run_isolated.py
,
Aug 3
Follow up for bot overhead monitoring on issue 870723. Tracking file deletion is issue 867622 . |
||
►
Sign in to add a comment |
||
Comment 1 by mar...@chromium.org
, Jul 26