Better package cleanup strategy for deverver |
|||||
Issue descriptionWe are getting frequent nagios alerts that devservers are running low on disk space. These servers always seem to recover within about a day. The servers in question appear to be using as much as 680G of 736G. Whatever this problem is, if it gets any worse, we'll start losing devservers. An example notification: Notification Type: PROBLEM Service: Root Partition Host: chromeos-server89.hot Address: 100.108.1.152 State: CRITICAL Date/Time: Thu Mar 16 04:32:00 PDT 2017 Additional Info: DISK CRITICAL - free space: / 57 GB (12% inode=88%):
,
Mar 22 2017
checked /var/log/message, clean_staged_images.py is regularly run.
The problem comes from after running the script, there still some old packages staying in ~/images, since the script only remove directories which has file 'staged.timestamp' in it.
for dir_path, dir_names, file_names in os.walk(root):
if os.path.basename(dir_path) in _EXEMPTED_DIRECTORIES:
logging.debug('Skipping %s', dir_path)
dir_names[:] = []
elif _TIMESTAMP_FILENAME in file_names:
dir_names[:] = []
yield dir_path
I checked some old packages, they don't have that staged.timestamp, so they cannot be deleted.
Is there any further considerations so that it's designed to only delete directories containing 'staged.timestamp'? If not we can change that behavior.
,
Mar 22 2017
We don't want to clean up stuff while it's still in the process of being staged, and some of the files we stage will have really old dates on them, since they can come from tar files. Maybe the timestamp file is how we avoid those two problem cases?
,
Mar 28 2017
,
May 10 2017
,
May 30 2017
Issue 727935 has been merged into this issue.
,
May 30 2017
Excluding the running out of space problem, this is also a performance issue. In one example: /home/chromeos-test/images 430G 683,460 files/directories.
,
May 31 2017
My first thought is to wipe this directory fully from time to time. Perhaps put it in /tmp, and then drain/reboot the devservers from time to time (weekly?). This would lose some caching efficiency, but self correct any type of file leakage.
,
May 31 2017
Re #8 This drain/wipe approach certainly works. But it requires one cl to remove the devserver from production, and another one to add it back. It will be difficult to automate it. Considering the ACL issue, we can only drain 1 devserver in each subnet at a time. Even that will introduce some load issue as not all subnets have enough devserver capacity (not sure if we have a way to quantify that nowadays) What about fall back to file stat if timestamp file is not found?
,
May 31 2017
Well, I also want a cleaner and more reliable way to drain.
,
Jul 26 2017
https://viceroy.corp.google.com/chromeos/devservers?duration=8d shows more devserver disk space problems. Another possibility would be to use access times instead of modification times.
,
Jul 26 2017
Do we have access time updates enabled on those servers? Modern machines tend to disable it by default for performance.
,
Jul 26 2017
That makes sense. Maybe as an alternative, the script should place its own staged.timestamp cookie whenever it can't find one. That way, eventually every directory would get deleted.
,
Mar 19 2018
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by akes...@chromium.org
, Mar 21 2017