New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 703776 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Better package cleanup strategy for deverver

Project Member Reported by dgarr...@chromium.org, Mar 21 2017

Issue description

We are getting frequent nagios alerts that devservers are running low on disk space. These servers always seem to recover within about a day. The servers in question appear to be using as much as 680G of 736G.

Whatever this problem is, if it gets any worse, we'll start losing devservers.

An example notification:

Notification Type: PROBLEM

Service: Root Partition
Host: chromeos-server89.hot
Address: 100.108.1.152
State: CRITICAL

Date/Time: Thu Mar 16 04:32:00 PDT 2017

Additional Info:

DISK CRITICAL - free space: / 57 GB (12% inode=88%):
 
Cc: akes...@chromium.org dshi@chromium.org xixuan@chromium.org pho...@chromium.org

Comment 2 by xixuan@chromium.org, Mar 22 2017

checked /var/log/message, clean_staged_images.py is regularly run.

The problem comes from after running the script, there still some old packages staying in ~/images, since the script only remove directories which has file 'staged.timestamp' in it.

for dir_path, dir_names, file_names in os.walk(root):                                                                                                      
        if os.path.basename(dir_path) in _EXEMPTED_DIRECTORIES:                                                                                                
            logging.debug('Skipping %s', dir_path)                                                                                                             
            dir_names[:] = []                                                                                                                                  
        elif _TIMESTAMP_FILENAME in file_names:                                                                                                                
            dir_names[:] = []                                                                                                                                  
            yield dir_path

I checked some old packages, they don't have that staged.timestamp, so they cannot be deleted.

Is there any further considerations so that it's designed to only delete directories containing 'staged.timestamp'? If not we can change that behavior.
Owner: xixuan@chromium.org
We don't want to clean up stuff while it's still in the process of being staged, and some of the files we stage will have really old dates on them, since they can come from tar files.

Maybe the timestamp file is how we avoid those two problem cases?

Comment 4 by aut...@google.com, Mar 28 2017

Labels: -current-issue

Comment 5 by xixuan@chromium.org, May 10 2017

Summary: Better package cleanup strategy for deverver (was: PROBLEM Service Alert: <devserver name>/Root Partition is CRITICAL)

Comment 6 by pho...@chromium.org, May 30 2017

 Issue 727935  has been merged into this issue.
Excluding the running out of space problem, this is also a performance issue.

In one example:

/home/chromeos-test/images

430G
683,460 files/directories.
My first thought is to wipe this directory fully from time to time. Perhaps put it in /tmp, and then drain/reboot the devservers from time to time (weekly?).

This would lose some caching efficiency, but self correct any type of file leakage.

Comment 9 by dshi@chromium.org, May 31 2017

Re #8

This drain/wipe approach certainly works. But it requires one cl to remove the devserver from production, and another one to add it back. It will be difficult to automate it.

Considering the ACL issue, we can only drain 1 devserver in each subnet at a time. Even that will introduce some load issue as not all subnets have enough devserver capacity (not sure if we have a way to quantify that nowadays)

What about fall back to file stat if timestamp file is not found?
Well, I also want a cleaner and more reliable way to drain.
https://viceroy.corp.google.com/chromeos/devservers?duration=8d shows more devserver disk space problems.

Another possibility would be to use access times instead of modification times.
Do we have access time updates enabled on those servers? Modern machines tend to disable it by default for performance.
That makes sense. Maybe as an alternative, the script should place its own staged.timestamp cookie whenever it can't find one. That way, eventually every directory would get deleted.
Status: WontFix (was: Untriaged)

Sign in to add a comment