Swarming: LRU cache cleanup is locking into one of the caches (isolate, cipd, named) and not cleaning the other ones in a LRU fashion |
|||||
Issue descriptionThere was a report of a Swarming bot that had a stale named cache that put the bot under the low disk space threshold. AIs: - Add a unit test for run_isolated.clean_caches(), there was none. - Confirm any problem with named cache timestamps in the state.json LRU cache.
,
May 4 2017
The following revision refers to this bug: https://chromium.googlesource.com/external/github.com/luci/luci-py.git/+/83b611c26b82951c2e435da873c9105c245aaf6e commit 83b611c26b82951c2e435da873c9105c245aaf6e Author: maruel <maruel@chromium.org> Date: Thu May 04 16:03:49 2017 Add unit test to clean_caches() - Add unit test to ensure clean_caches() work. - Make trim parameter explicit. - Change Cache.trim() to return the number of items trimmed. BUG= 717716 R=nodir@chromium.org Review-Url: https://codereview.chromium.org/2853413002 [modify] https://crrev.com/83b611c26b82951c2e435da873c9105c245aaf6e/client/cipd.py [modify] https://crrev.com/83b611c26b82951c2e435da873c9105c245aaf6e/client/isolateserver.py [modify] https://crrev.com/83b611c26b82951c2e435da873c9105c245aaf6e/client/named_cache.py [modify] https://crrev.com/83b611c26b82951c2e435da873c9105c245aaf6e/client/run_isolated.py [modify] https://crrev.com/83b611c26b82951c2e435da873c9105c245aaf6e/client/tests/isolateserver_test.py [modify] https://crrev.com/83b611c26b82951c2e435da873c9105c245aaf6e/client/tests/run_isolated_smoke_test.py [modify] https://crrev.com/83b611c26b82951c2e435da873c9105c245aaf6e/client/tests/run_isolated_test.py
,
Mar 15 2018
,
Mar 15 2018
From closed issue: Example builder: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Android%20x64%20Builder%20%28dbg%29 Graph of disk usage: https://viceroy.corp.google.com/chrome_infra/Machines/per_machine?hostname=swarm976-c4&duration=1d&refresh=-1 This builder has failed multiple times in a row. I suspect that swarming is doing something like: * Delete <small && required> caches until < threshold * Run task * Run out of disk space There's some big, honking caches that are available for deletion on this bot (everything not ending with _v2): https://chromium-swarm.appspot.com/bot?id=swarm976-c4&sort_stats=total%3Adesc Marking as P1 as this is actually causing infra failures around the fleet, currently. If there's a systemic fix we can implement in swarming quickly, that would be the best, otherwise we can whack-a-mole.
,
Mar 21 2018
Actually mark as P1...
,
Jun 12 2018
I reproduced the problem (sadly on prod on luci.v8 bots) and I now know exactly what it is.
,
Jun 12 2018
,
Jun 12 2018
Issue 851945 has been merged into this issue.
,
Jun 12 2018
,
Jun 15 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/00938ca5c9bf8139494acf468a3512a35308e149 commit 00938ca5c9bf8139494acf468a3512a35308e149 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Fri Jun 15 19:06:12 2018 [client] Fix Memory CAC to be LRU, internal cleanups - Remove initial_number_items and initial_size. Instead have run_isolated load the values before fetching. - Replace number_items with __len__. Move from ContentAddressedCache to Cache. - Replace cached_set with __iter__. - Move __contains__ and total_size to Cache. - Stop trying to evict when double downloading. The function will be removed afterward. - Change MemoryContentAddressedCache to be a proper LRU cache. - Move code around, Cache methods first, then ContentAddressedCache, then private methods. - Add constant MAX_AGE_SECS. - Minor docstring update or removal. Minimal unit test update. A follow up is heavy on unit tests as more refactoring is done. The follow up is meant to be committed soon after this CL, but is separate to make the whole change more reviewable. Bug: 717716 Change-Id: I2752bf74e708e1ded43e172052928ff74c6b224c Reviewed-on: https://chromium-review.googlesource.com/1101856 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Quinten Yearsley <qyearsley@chromium.org> [modify] https://crrev.com/00938ca5c9bf8139494acf468a3512a35308e149/client/isolateserver.py [modify] https://crrev.com/00938ca5c9bf8139494acf468a3512a35308e149/client/local_caching.py [modify] https://crrev.com/00938ca5c9bf8139494acf468a3512a35308e149/client/run_isolated.py [modify] https://crrev.com/00938ca5c9bf8139494acf468a3512a35308e149/client/tests/local_caching_test.py
,
Jun 15 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/a768d983f46e3cc7314e2d329831a4a2ce71bde2 commit a768d983f46e3cc7314e2d329831a4a2ce71bde2 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Fri Jun 15 20:04:42 2018 [client] Fix cache LRU coherency - Move run_isolated.clean_caches() to local_caching.trim_caches(). Change it to trim properly and stop calling clean(). Make the code work with an arbitrary number of caches. - Remove evict(), get_timestamp(). - Add get_oldest(), remove_oldest(). - Fix named cache cache name symlink removal on install. - Do not call .clean() on the isolated cache unconditionally. Only call it on --clean. - Rewrote unit tests to dramatically increase code coverage. - file_path.try_remove() make it return True on success. - Commented out noisy log in set_read_only(). Bug: 717716 Change-Id: I5ea53b49c97080cd68070e144b54967e28730c2d Reviewed-on: https://chromium-review.googlesource.com/1101853 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Quinten Yearsley <qyearsley@chromium.org> [modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/local_caching.py [modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/run_isolated.py [modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/tests/local_caching_test.py [modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/tests/lru_test.py [modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/tests/run_isolated_smoke_test.py [modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/tests/run_isolated_test.py [modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/utils/file_path.py [modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/utils/lru.py
,
Jun 15 2018
,
Jun 15 2018
Was this deployed to prod yet?
,
Jun 16 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/d335c4bf862924d7014b5dfa9e032f40a075501c commit d335c4bf862924d7014b5dfa9e032f40a075501c Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Sat Jun 16 14:41:12 2018 [swarming] Implement NamedCache.cleanup() This will remove 'forgotten' named cache. There's likely many on many bots. Oops. This should close once for all the named cache disk space usage issues. Bug: 717716 Change-Id: I318395728952a8520a46b1e0516911810e9ac53a Reviewed-on: https://chromium-review.googlesource.com/1103077 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Quinten Yearsley <qyearsley@chromium.org> [modify] https://crrev.com/d335c4bf862924d7014b5dfa9e032f40a075501c/client/local_caching.py [modify] https://crrev.com/d335c4bf862924d7014b5dfa9e032f40a075501c/client/tests/local_caching_test.py
,
Jun 18 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/3d8b8a15a8a745641540adffeb517676c1c6d7ee commit 3d8b8a15a8a745641540adffeb517676c1c6d7ee Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Mon Jun 18 13:09:24 2018 [client] Fix regression in a768d983f4 I typoed cleanup() to clean(). TBR=qyearsley@chromium.org Bug: 717716 Change-Id: I927fc0797b933933738c252ed8fed12b82708b1b Reviewed-on: https://chromium-review.googlesource.com/1104380 Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/3d8b8a15a8a745641540adffeb517676c1c6d7ee/client/run_isolated.py
,
Jun 18 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/6b6fa798293dcc22ad6fb99087b1a3b0842baf2c commit 6b6fa798293dcc22ad6fb99087b1a3b0842baf2c Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Mon Jun 18 13:31:34 2018 [client] Fix regression in d335c4bf86 fs.readlink() is not yet implemented on Windows. So do not run the named cache verification code there for now. TBR=qyearsley@chromium.org Bug: 717716 , 853721 Change-Id: I1de0375a5ca59e0f72bc9e68827dcaec2ccaa190 Reviewed-on: https://chromium-review.googlesource.com/1104381 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/6b6fa798293dcc22ad6fb99087b1a3b0842baf2c/client/local_caching.py [modify] https://crrev.com/6b6fa798293dcc22ad6fb99087b1a3b0842baf2c/client/utils/fs.py
,
Jun 20 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/1e4c78b85f523f72992151290f28837e7b4ed8d9 commit 1e4c78b85f523f72992151290f28837e7b4ed8d9 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Wed Jun 20 19:56:23 2018 [swarming] Update task_runner_test.py due to 00938ca5c9bf813 and pylint 1.4.5 No functional change. Disable a new pylint warning as pylint was upgraded to 1.4.5. TBR= Bug: 717716 Change-Id: Iaba54d06d75e718b96be7bed5b30721d31a34e31 Reviewed-on: https://chromium-review.googlesource.com/1108522 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/1e4c78b85f523f72992151290f28837e7b4ed8d9/appengine/swarming/server/task_scheduler.py [modify] https://crrev.com/1e4c78b85f523f72992151290f28837e7b4ed8d9/appengine/swarming/swarming_bot/bot_code/task_runner_test.py
,
Jun 28 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/21ffa58372cc63f5483999faa32fee0f0c3d6f6b commit 21ffa58372cc63f5483999faa32fee0f0c3d6f6b Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Thu Jun 28 14:52:34 2018 client: fix exception on symlink The server has a lot of crashes because rmtree() is unhappy with symlinks. As the code is trying to remove named cache symlink, this happens a lot. R=qyearsley@chromium.org Bug: 717716 Change-Id: I8523479926bc39d2338a40237cc061fe33efe680 Reviewed-on: https://chromium-review.googlesource.com/1117604 Reviewed-by: Quinten Yearsley <qyearsley@chromium.org> Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/21ffa58372cc63f5483999faa32fee0f0c3d6f6b/client/local_caching.py |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by bugdroid1@chromium.org
, May 4 2017