New issue
Advanced search Search tips

Issue 919599 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Today
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocking:
issue 866062



Sign in to add a comment

CrOS DUT bots exceeding disk space

Project Member Reported by bpastene@chromium.org, Jan 7

Issue description

http://shortn/_Ue1pR3FBgM

Looks like isolated_cache on the bots are growing indefinitely (and not getting cleaned up?)

--- /b/cros_build10-a1/swarming --------------
   83.9GiB [##########] /isolated_cache                                                                                                                                               
  416.3MiB [          ] /c
  103.7MiB [          ] /cipd_cache
   73.0MiB [          ] /logs
  132.0KiB [          ]  e2bfe61c8f0dc89e72a854f4afb14f4b662ea6301fc5652ebe03f80fa2b06263-cacert.pem
    4.0KiB [          ]  README
    4.0KiB [          ]  swarming.lck
    0.0  B [          ]  swarming_bot.zip


On android bots, isolated_cache is capped at 50GB, so isn't a problem. I need to figure out how to add the same cache policy on the cros dut bots I think.
 
Issue 919056 has been merged into this issue.
Issue 919098 has been merged into this issue.
Issue 919438 has been merged into this issue.
Ah, it appears that cache sizes on most bots are capped at 50gb via the default bot_config script's get_settings():
https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chromium-swarm/scripts/bot_config.py#394

The cros bots use their own bot_config script (https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chromium-swarm/scripts/cros_ssh.py) so I prob just have to add its own get_settings() (which'll prob just call out to bot_config's)
Actually... since the bot_config script doesn't have a get_settings(), it should fall back to the server's default implementation, which does cap the cache size at 50 GB:
https://chromium.googlesource.com/infra/luci/luci-py.git/+/master/appengine/swarming/swarming_bot/config/bot_config.py#120

Which clearly isn't working since it's 80+ GB. Either I'm confused, or the bot is...
Maybe the not fails to delete items? probably worth looking at the logs; I can't right now.
Ahh, this is due to the bot's inability to clobber its cros chroot (stored locally in a named cache). There's a number of these errors in the logs:
"Swallowing make_tree_deleteable() error: [Errno 1] Operation not permitted: '/b/swarming/c/Cf/chroot'"

The chroot is owned by root, but the bot runs as non-root. And failing to rm it prevents any subsequent isolate_cache cleanup. The bot has password-less sudo, and looking at the implementation of make_tree_deleteable, it should detect if that's available and use it:
https://codesearch.chromium.org/chromium/infra/luci/client/utils/file_path.py?rcl=a948849703f383b2c1c891e51f61cb1a9f758fd2&l=1126

Might be a bug somewhere in there. I'll dig in
Ahhhhhh ok. Turns out the bot was successfully removing 99% of the chroot, but was getting stuck on a single directory. It couldn't remove the chroot's /tmp/ dir since it has a sticky bit:

$ ls -al /b/cros_build18-a1/swarming/c/lO/chroot/tmp/
total 12
drwxrwxrwt 3 root root 4096 Dec 14 03:01 .
drwxrwxrwx 3 root root 4096 Jan  4 03:47 ..
drwxrwxrwx 2 root  406 4096 Dec 13 10:04 screen

Adding "-t" to the chmod call fixes that. I *think* that's a reasonable fix to this, so I'll upload a CL. (But there are  alternative solutions available.)
Project Member

Comment 7 by bugdroid1@chromium.org, Jan 8

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/bfe82ee618c49228ce6c6c3035973da47d783d0a

commit bfe82ee618c49228ce6c6c3035973da47d783d0a
Author: Ben Pastene <bpastene@chromium.org>
Date: Tue Jan 08 15:18:52 2019

swarming: Remove sticky bits from caches when calling make_tree_deleteable().

If a file/dir is owned by root and has the sticky bit set, only root can
remove it. If such a file is present in a bot's cache, the bot is unable
to delete that cache if the bot runs as non-root.

This change will remove all sticky bits from caches when the bot tries
deleting things, which should allow it to do so.

R=maruel

Bug:  919599 
Change-Id: Iba741fc7a2935d092dbed7b1de6d8bb1b578a04b
Reviewed-on: https://chromium-review.googlesource.com/c/1399144
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/bfe82ee618c49228ce6c6c3035973da47d783d0a/client/utils/file_path.py

Project Member

Comment 8 by bugdroid1@chromium.org, Jan 11

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/a564a4de68d7055ece59f3f63c1f10359c5e90f1

commit a564a4de68d7055ece59f3f63c1f10359c5e90f1
Author: Ben Pastene <bpastene@chromium.org>
Date: Fri Jan 11 20:44:05 2019

Project Member

Comment 9 by bugdroid1@chromium.org, Jan 14

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/398a9ebec9b5edcbdbec17904afeb934b8938071

commit 398a9ebec9b5edcbdbec17904afeb934b8938071
Author: Ben Pastene <bpastene@chromium.org>
Date: Mon Jan 14 20:01:19 2019

Project Member

Comment 10 by bugdroid1@chromium.org, Jan 17 (5 days ago)

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/puppet/+/7fab0eeb7ecb32316bb58f5555571438822a3e4d

commit 7fab0eeb7ecb32316bb58f5555571438822a3e4d
Author: Ben Pastene <bpastene@chromium.org>
Date: Thu Jan 17 23:31:41 2019

Comment 11 by bpastene@chromium.org, Today (12 hours ago)

Status: Fixed (was: Assigned)
With the cache size limits, and the deployment of an extra host (bug 922285), this should be fixed.

Sign in to add a comment