New issue
Advanced search Search tips

Issue 717716 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 848665



Sign in to add a comment

Swarming: LRU cache cleanup is locking into one of the caches (isolate, cipd, named) and not cleaning the other ones in a LRU fashion

Project Member Reported by mar...@chromium.org, May 2 2017

Issue description

There was a report of a Swarming bot that had a stale named cache that put the bot under the low disk space threshold.

AIs:
- Add a unit test for run_isolated.clean_caches(), there was none.
- Confirm any problem with named cache timestamps in the state.json LRU cache.
 
Cc: mar...@chromium.org
 Issue 822098  has been merged into this issue.
From closed issue:

Example builder: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Android%20x64%20Builder%20%28dbg%29

Graph of disk usage: https://viceroy.corp.google.com/chrome_infra/Machines/per_machine?hostname=swarm976-c4&duration=1d&refresh=-1

This builder has failed multiple times in a row. I suspect that swarming is doing something like:
  * Delete <small && required> caches until < threshold
  * Run task
  * Run out of disk space

There's some big, honking caches that are available for deletion on this bot (everything not ending with _v2): https://chromium-swarm.appspot.com/bot?id=swarm976-c4&sort_stats=total%3Adesc

Marking as P1 as this is actually causing infra failures around the fleet, currently. If there's a systemic fix we can implement in swarming quickly, that would be the best, otherwise we can whack-a-mole.
Labels: -Pri-2 Pri-1
Actually mark as P1...

Comment 6 by mar...@chromium.org, Jun 12 2018

I reproduced the problem (sadly on prod on luci.v8 bots) and I now know exactly what it is.

Comment 7 by mar...@chromium.org, Jun 12 2018

Summary: Swarming: LRU cache cleanup is locking into one of the caches (isolate, cipd, named) and not cleaning the other ones in a LRU fashion (was: Suspicion that named caches aren't cleaned up properly)

Comment 8 by mar...@chromium.org, Jun 12 2018

Issue 851945 has been merged into this issue.

Comment 9 by mar...@chromium.org, Jun 12 2018

Blocking: 848665
Project Member

Comment 10 by bugdroid1@chromium.org, Jun 15 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/00938ca5c9bf8139494acf468a3512a35308e149

commit 00938ca5c9bf8139494acf468a3512a35308e149
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Fri Jun 15 19:06:12 2018

[client] Fix Memory CAC to be LRU, internal cleanups

- Remove initial_number_items and initial_size. Instead have run_isolated load
  the values before fetching.
- Replace number_items with __len__. Move from ContentAddressedCache to Cache.
- Replace cached_set with __iter__.
- Move __contains__ and total_size to Cache.
- Stop trying to evict when double downloading. The function will be removed
  afterward.
- Change MemoryContentAddressedCache to be a proper LRU cache.
- Move code around, Cache methods first, then ContentAddressedCache, then
  private methods.
- Add constant MAX_AGE_SECS.
- Minor docstring update or removal.

Minimal unit test update. A follow up is heavy on unit tests as more
refactoring is done. The follow up is meant to be committed soon after this CL,
but is separate to make the whole change more reviewable.

Bug:  717716 
Change-Id: I2752bf74e708e1ded43e172052928ff74c6b224c
Reviewed-on: https://chromium-review.googlesource.com/1101856
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/00938ca5c9bf8139494acf468a3512a35308e149/client/isolateserver.py
[modify] https://crrev.com/00938ca5c9bf8139494acf468a3512a35308e149/client/local_caching.py
[modify] https://crrev.com/00938ca5c9bf8139494acf468a3512a35308e149/client/run_isolated.py
[modify] https://crrev.com/00938ca5c9bf8139494acf468a3512a35308e149/client/tests/local_caching_test.py

Project Member

Comment 11 by bugdroid1@chromium.org, Jun 15 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/a768d983f46e3cc7314e2d329831a4a2ce71bde2

commit a768d983f46e3cc7314e2d329831a4a2ce71bde2
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Fri Jun 15 20:04:42 2018

[client] Fix cache LRU coherency

- Move run_isolated.clean_caches() to local_caching.trim_caches(). Change it to
  trim properly and stop calling clean(). Make the code work with an arbitrary
  number of caches.
- Remove evict(), get_timestamp().
- Add get_oldest(), remove_oldest().
- Fix named cache cache name symlink removal on install.
- Do not call .clean() on the isolated cache unconditionally. Only call it on
  --clean.
- Rewrote unit tests to dramatically increase code coverage.
- file_path.try_remove() make it return True on success.
- Commented out noisy log in set_read_only().

Bug:  717716 
Change-Id: I5ea53b49c97080cd68070e144b54967e28730c2d
Reviewed-on: https://chromium-review.googlesource.com/1101853
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/local_caching.py
[modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/run_isolated.py
[modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/tests/local_caching_test.py
[modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/tests/lru_test.py
[modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/tests/run_isolated_smoke_test.py
[modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/tests/run_isolated_test.py
[modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/utils/file_path.py
[modify] https://crrev.com/a768d983f46e3cc7314e2d329831a4a2ce71bde2/client/utils/lru.py

Status: Fixed (was: Assigned)
Was this deployed to prod yet?
Project Member

Comment 14 by bugdroid1@chromium.org, Jun 16 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/d335c4bf862924d7014b5dfa9e032f40a075501c

commit d335c4bf862924d7014b5dfa9e032f40a075501c
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Sat Jun 16 14:41:12 2018

[swarming] Implement NamedCache.cleanup()

This will remove 'forgotten' named cache. There's likely many on many bots.
Oops.

This should close once for all the named cache disk space usage issues.

Bug:  717716 
Change-Id: I318395728952a8520a46b1e0516911810e9ac53a
Reviewed-on: https://chromium-review.googlesource.com/1103077
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/d335c4bf862924d7014b5dfa9e032f40a075501c/client/local_caching.py
[modify] https://crrev.com/d335c4bf862924d7014b5dfa9e032f40a075501c/client/tests/local_caching_test.py

Project Member

Comment 15 by bugdroid1@chromium.org, Jun 18 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/3d8b8a15a8a745641540adffeb517676c1c6d7ee

commit 3d8b8a15a8a745641540adffeb517676c1c6d7ee
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Mon Jun 18 13:09:24 2018

[client] Fix regression in a768d983f4

I typoed cleanup() to clean().

TBR=qyearsley@chromium.org

Bug:  717716 
Change-Id: I927fc0797b933933738c252ed8fed12b82708b1b
Reviewed-on: https://chromium-review.googlesource.com/1104380
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/3d8b8a15a8a745641540adffeb517676c1c6d7ee/client/run_isolated.py

Project Member

Comment 16 by bugdroid1@chromium.org, Jun 18 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/6b6fa798293dcc22ad6fb99087b1a3b0842baf2c

commit 6b6fa798293dcc22ad6fb99087b1a3b0842baf2c
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Mon Jun 18 13:31:34 2018

[client] Fix regression in d335c4bf86

fs.readlink() is not yet implemented on Windows. So do not run the named cache
verification code there for now.

TBR=qyearsley@chromium.org

Bug:  717716 , 853721
Change-Id: I1de0375a5ca59e0f72bc9e68827dcaec2ccaa190
Reviewed-on: https://chromium-review.googlesource.com/1104381
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/6b6fa798293dcc22ad6fb99087b1a3b0842baf2c/client/local_caching.py
[modify] https://crrev.com/6b6fa798293dcc22ad6fb99087b1a3b0842baf2c/client/utils/fs.py

Project Member

Comment 17 by bugdroid1@chromium.org, Jun 20 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/1e4c78b85f523f72992151290f28837e7b4ed8d9

commit 1e4c78b85f523f72992151290f28837e7b4ed8d9
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Jun 20 19:56:23 2018

[swarming] Update task_runner_test.py due to 00938ca5c9bf813 and pylint 1.4.5

No functional change.

Disable a new pylint warning as pylint was upgraded to 1.4.5.

TBR=
Bug:  717716 
Change-Id: Iaba54d06d75e718b96be7bed5b30721d31a34e31
Reviewed-on: https://chromium-review.googlesource.com/1108522
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/1e4c78b85f523f72992151290f28837e7b4ed8d9/appengine/swarming/server/task_scheduler.py
[modify] https://crrev.com/1e4c78b85f523f72992151290f28837e7b4ed8d9/appengine/swarming/swarming_bot/bot_code/task_runner_test.py

Project Member

Comment 18 by bugdroid1@chromium.org, Jun 28 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/21ffa58372cc63f5483999faa32fee0f0c3d6f6b

commit 21ffa58372cc63f5483999faa32fee0f0c3d6f6b
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Thu Jun 28 14:52:34 2018

client: fix exception on symlink

The server has a lot of crashes because rmtree() is unhappy with symlinks. As
the code is trying to remove named cache symlink, this happens a lot.

R=qyearsley@chromium.org

Bug:  717716 
Change-Id: I8523479926bc39d2338a40237cc061fe33efe680
Reviewed-on: https://chromium-review.googlesource.com/1117604
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/21ffa58372cc63f5483999faa32fee0f0c3d6f6b/client/local_caching.py

Sign in to add a comment