CrOS DUT bots exceeding disk space |
||
Issue descriptionhttp://shortn/_Ue1pR3FBgM Looks like isolated_cache on the bots are growing indefinitely (and not getting cleaned up?) --- /b/cros_build10-a1/swarming -------------- 83.9GiB [##########] /isolated_cache 416.3MiB [ ] /c 103.7MiB [ ] /cipd_cache 73.0MiB [ ] /logs 132.0KiB [ ] e2bfe61c8f0dc89e72a854f4afb14f4b662ea6301fc5652ebe03f80fa2b06263-cacert.pem 4.0KiB [ ] README 4.0KiB [ ] swarming.lck 0.0 B [ ] swarming_bot.zip On android bots, isolated_cache is capped at 50GB, so isn't a problem. I need to figure out how to add the same cache policy on the cros dut bots I think.
,
Jan 7
Ah, it appears that cache sizes on most bots are capped at 50gb via the default bot_config script's get_settings(): https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chromium-swarm/scripts/bot_config.py#394 The cros bots use their own bot_config script (https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chromium-swarm/scripts/cros_ssh.py) so I prob just have to add its own get_settings() (which'll prob just call out to bot_config's)
,
Jan 7
Actually... since the bot_config script doesn't have a get_settings(), it should fall back to the server's default implementation, which does cap the cache size at 50 GB: https://chromium.googlesource.com/infra/luci/luci-py.git/+/master/appengine/swarming/swarming_bot/config/bot_config.py#120 Which clearly isn't working since it's 80+ GB. Either I'm confused, or the bot is...
,
Jan 7
Maybe the not fails to delete items? probably worth looking at the logs; I can't right now.
,
Jan 7
Ahh, this is due to the bot's inability to clobber its cros chroot (stored locally in a named cache). There's a number of these errors in the logs: "Swallowing make_tree_deleteable() error: [Errno 1] Operation not permitted: '/b/swarming/c/Cf/chroot'" The chroot is owned by root, but the bot runs as non-root. And failing to rm it prevents any subsequent isolate_cache cleanup. The bot has password-less sudo, and looking at the implementation of make_tree_deleteable, it should detect if that's available and use it: https://codesearch.chromium.org/chromium/infra/luci/client/utils/file_path.py?rcl=a948849703f383b2c1c891e51f61cb1a9f758fd2&l=1126 Might be a bug somewhere in there. I'll dig in
,
Jan 7
Ahhhhhh ok. Turns out the bot was successfully removing 99% of the chroot, but was getting stuck on a single directory. It couldn't remove the chroot's /tmp/ dir since it has a sticky bit: $ ls -al /b/cros_build18-a1/swarming/c/lO/chroot/tmp/ total 12 drwxrwxrwt 3 root root 4096 Dec 14 03:01 . drwxrwxrwx 3 root root 4096 Jan 4 03:47 .. drwxrwxrwx 2 root 406 4096 Dec 13 10:04 screen Adding "-t" to the chmod call fixes that. I *think* that's a reasonable fix to this, so I'll upload a CL. (But there are alternative solutions available.)
,
Jan 8
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/bfe82ee618c49228ce6c6c3035973da47d783d0a commit bfe82ee618c49228ce6c6c3035973da47d783d0a Author: Ben Pastene <bpastene@chromium.org> Date: Tue Jan 08 15:18:52 2019 swarming: Remove sticky bits from caches when calling make_tree_deleteable(). If a file/dir is owned by root and has the sticky bit set, only root can remove it. If such a file is present in a bot's cache, the bot is unable to delete that cache if the bot runs as non-root. This change will remove all sticky bits from caches when the bot tries deleting things, which should allow it to do so. R=maruel Bug: 919599 Change-Id: Iba741fc7a2935d092dbed7b1de6d8bb1b578a04b Reviewed-on: https://chromium-review.googlesource.com/c/1399144 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/bfe82ee618c49228ce6c6c3035973da47d783d0a/client/utils/file_path.py
,
Jan 11
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/a564a4de68d7055ece59f3f63c1f10359c5e90f1 commit a564a4de68d7055ece59f3f63c1f10359c5e90f1 Author: Ben Pastene <bpastene@chromium.org> Date: Fri Jan 11 20:44:05 2019
,
Jan 14
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/398a9ebec9b5edcbdbec17904afeb934b8938071 commit 398a9ebec9b5edcbdbec17904afeb934b8938071 Author: Ben Pastene <bpastene@chromium.org> Date: Mon Jan 14 20:01:19 2019
,
Jan 17
(5 days ago)
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/puppet/+/7fab0eeb7ecb32316bb58f5555571438822a3e4d commit 7fab0eeb7ecb32316bb58f5555571438822a3e4d Author: Ben Pastene <bpastene@chromium.org> Date: Thu Jan 17 23:31:41 2019
,
Today
(12 hours ago)
With the cache size limits, and the deployment of an extra host (bug 922285), this should be fixed. |
||
►
Sign in to add a comment |
||
Comment 1 by bpastene@chromium.org
, Jan 7