New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 664360 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 654953
Owner:
Closed: Dec 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

load average and number of processes in devserver increases steadily

Project Member Reported by semenzato@chromium.org, Nov 11 2016

Issue description

The load average is very large (even for a 24-cpu machine) but not much seems to be going on:

Output of "top":

top - 16:26:55 up 84 days,  1:08,  2 users,  load average: 132.55, 132.90, 133.28
Tasks: 773 total,   1 running, 771 sleeping,   0 stopped,   1 zombie
%Cpu(s):  5.1 us,  0.8 sy,  0.0 ni, 93.9 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:  65961944 total, 65370276 used,   591668 free,   189748 buffers
KiB Swap: 67092476 total,  1250376 used, 65842100 free. 62120464 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                             
14761 chromeo+  20   0   24244   3552   2464 R  11.5  0.0   0:00.03 top                                                                                                                                 
 9815 cros-co+  20   0 32.849g  28488  11452 S   5.7  0.0   2360:35 consul                                                                                                                              
    1 root      20   0   33632   3832   2632 S   0.0  0.0 806:44.84 init                                                                                                                                
    2 root      20   0       0      0      0 S   0.0  0.0   0:21.46 kthreadd                                                                                                                            
    3 root      20   0       0      0      0 S   0.0  0.0  33:27.16 ksoftirqd/0                                                                                                                         
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                                                                                        
    8 root      20   0       0      0      0 S   0.0  0.0 213:23.53 rcu_sched                                                                                                                           
    9 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh                                                                                                                              
   10 root      20   0       0      0      0 S   0.0  0.0 128:57.30 rcuos/0                                                                                                                             
   11 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/0                                                                                                                             
   12 root      rt   0       0      0      0 S   0.0  0.0   1:30.45 migration/0                                                                                                                         
   13 root      rt   0       0      0      0 S   0.0  0.0   0:20.10 watchdog/0                                                                                                                          
   14 root      rt   0       0      0      0 S   0.0  0.0   0:19.26 watchdog/1                                                                                                                          
   15 root      rt   0       0      0      0 S   0.0  0.0   1:46.81 migration/1                                                                                                                         
   16 root      20   0       0      0      0 S   0.0  0.0   0:35.32 ksoftirqd/1                                                                                                                         
   18 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H                                                                                                                        
   19 root      20   0       0      0      0 S   0.0  0.0  36:12.45 rcuos/1                                                                                                                             
   20 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/1                                                                                                                             
   22 root      rt   0       0      0      0 S   0.0  0.0   0:18.17 watchdog/2                                                                                                                          
   23 root      rt   0       0      0      0 S   0.0  0.0   2:07.68 migration/2                                                                                                                         
   24 root      20   0       0      0      0 S   0.0  0.0   1:45.27 ksoftirqd/2                                                                                                                         
   26 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/2:0H                                                                                                                        
   27 root      20   0       0      0      0 S   0.0  0.0  46:31.00 rcuos/2                                                                                                                             
   28 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/2                                                                                                                             
   29 root      rt   0       0      0      0 S   0.0  0.0   0:17.93 watchdog/3                                                                                                                          
   30 root      rt   0       0      0      0 S   0.0  0.0   1:44.84 migration/3                                                                                                                         
   31 root      20   0       0      0      0 S   0.0  0.0   0:30.26 ksoftirqd/3                                                                                                                         
   33 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/3:0H                                                                                                                        
   34 root      20   0       0      0      0 S   0.0  0.0  35:55.63 rcuos/3                                                                                                                             
   35 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/3                                                                                                                             
   36 root      rt   0       0      0      0 S   0.0  0.0   0:16.86 watchdog/4                                                                                                                          
   37 root      rt   0       0      0      0 S   0.0  0.0   1:52.25 migration/4                                                                                                                         
   38 root      20   0       0      0      0 S   0.0  0.0   0:48.17 ksoftirqd/4                                                                                                                         
   40 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/4:0H                                                                                                                        
   41 root      20   0       0      0      0 S   0.0  0.0  36:56.12 rcuos/4                 

Even if I let it run for a while, I never see any processes using much CPU time, and the number of "running" processes (presumably in the same unit as the load average above) stays small.

I hope this is a one-off, but, if not, it could seriously confuse some of the stats we collect.


 
Labels: -Pri-3 Pri-2
The load average has been growing steadily since Nov 5.  There is a corresponding increase in the process count.  I am guessing there are some weird processes that somehow are contributing to load average in a disproportionate manner.  It could also be a kernel bug, or both.  I'll look a little more.
loadav-long.png
36.3 KB View Download
processcount-long.png
54.7 KB View Download
Cc: davidri...@chromium.org
Owner: ----
Here's are the culprits (from ps auxwww | sort -k 9):

chromeo+ 12959  0.0  0.0  24436  4616 ?        D    Nov06   0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images
chromeo+ 16124  0.0  0.0  24436  4532 ?        D    Nov06   0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images
chromeo+ 18037  0.0  0.0  24436  4616 ?        D    Nov06   0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images
chromeo+ 20724  0.0  0.0  24436  4584 ?        D    Nov06   0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images
chromeo+ 28540  0.0  0.0  24436  4664 ?        D    Nov06   0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images
chromeo+ 29130  0.0  0.0  24436  4612 ?        D    Nov06   0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images
chromeo+ 32235  0.0  0.0  24436  4540 ?        D    Nov06   0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images
chromeo+  3334  0.0  0.0  24436  4524 ?        D    Nov06   0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images

There are lots of them, starting on Nov 5.  They don't seem to exit.  They are probably blocked on something that makes them look like they're waiting for the CPU.

Owner: xixuan@chromium.org
Found problem:

$ sudo cat /proc/9774/stack
[sudo] password for chromeos-test: 
[<ffffffff811fe291>] iterate_dir+0x61/0x120
[<ffffffff811fe71e>] SyS_getdents+0x7e/0xf0
[<ffffffff817bc3b2>] entry_SYSCALL_64_fastpath+0x16/0x75
[<ffffffffffffffff>] 0xffffffffffffffff

I think this is an instance of too many files in /tmp, which I filed a while ago but thought was fixed.  That was  issue 654953 .

There are 16,000 directories in /tmp, each has about 150 files and 1.5 MB total size.

Xixuan, can you take a look?  I know you have lots of other bugs on your plate, please let me know if we should look for someone else.


Comment 4 by xixuan@chromium.org, Nov 11 2016

A cleaning script is kicked off for all devservers to delete all files created 5 days ago.

I guess last time the script failed in the middle. To be continued, I guess I need to add a cronjob on devserver to clean the garbage since after the fix in, every day we still have maybe 20 folders left there and they're not cleaned due to some different reasons.
#4 yes that seems like a good idea, thanks.  Let me know if you would like to discuss details.  Also will be happy to help review.
The program tmpreaper is designed to delete files in /tmp that aren't used for a specified period of time. It's able to correctly deal with the obscure (and surprising) security concerns around doing that.

Some systems run it by default from cron, but I don't think our servers do.

Comment 7 by ihf@chromium.org, Dec 2 2016

Cc: ihf@chromium.org
Can we just reboot dev servers every weekend (cleaning tmp)? I don't see a reason why a dev server should be up for 84+ days.
Personally, I'd like to reboot all lab servers every so often. They are failing to get kernel an other security updates because of the uptime.
But we currently have no automated way to drain servers to ensure that in-progress tests aren't affected.

Comment 10 by ihf@chromium.org, Dec 2 2016

How about not trigger suites on a Sunday? I keep beating a dead horse, but infra should be able to do anything it wants to the lab over the weekend.

Comment 11 by sosa@chromium.org, Dec 5 2016

Labels: -Pri-2 Pri-1
Owner: akes...@chromium.org
Status: Assigned (was: Untriaged)
Re: draining. In theory I agree with Ihf, but the release team often wants releases from Sunday for testing in India before releases on Monday. Saturday would likely be the best day for draining. 

IMO we should have a P1 for having an automatic drain/reboot of some of our infra services. @akeshet, do you want to swallow this bug or file another?
Labels: -Pri-1 Pri-2
Owner: xixuan@chromium.org
xixuan@ has expressed intention to work on this. Not convinced about P1 though, reducing to P2.
Summary: periodically resart devservers (was: what's wrong with chromeos2-devserver7 load average?)
Project Member

Comment 14 by bugdroid1@chromium.org, Dec 7 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/9f57da38aa4fc9affff7a7b7c545db575c22ab27

commit 9f57da38aa4fc9affff7a7b7c545db575c22ab27
Author: xixuan <xixuan@chromium.org>
Date: Tue Dec 06 17:25:14 2016

Project Member

Comment 15 by bugdroid1@chromium.org, Dec 9 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/d35372415d771b07afc2261d22f58c10b65fa676

commit d35372415d771b07afc2261d22f58c10b65fa676
Author: xixuan <xixuan@chromium.org>
Date: Wed Nov 23 19:26:02 2016

chromite: make sure update_engine.log is saved properly during auto-update.

Sometimes the temp dir created for auto-update in /tmp/ of devservers becomes
a file, which has exactly the same content with the update_engine.log. This CL
fixes this issue.

This CL also correct all global variables with a unified format.

BUG= chromium:664360 
TEST=Run auto-update locally.

Change-Id: I12d5a2e2bc538a348f8325f5787e95c6c122ed6b
Reviewed-on: https://chromium-review.googlesource.com/414429
Commit-Ready: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>
Reviewed-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/d35372415d771b07afc2261d22f58c10b65fa676/lib/auto_updater.py

Labels: -Pri-2 Pri-1
Mergedinto: 654953
Status: Duplicate (was: Assigned)
Summary: load average and number of processes in devserver increases steadily (was: periodically resart devservers)
This is a direct consequence of leaving huge amount of garbage in /tmp, so I am duplicating it.

Rebooting devservers periodically may be a good idea, I'll open a separate bug.
Labels: Hotlist-CrOS-DevServerLoad

Sign in to add a comment