Issue metadata
Sign in to add a comment
|
load average and number of processes in devserver increases steadily |
||||||||||||||||||||||
Issue description
The load average is very large (even for a 24-cpu machine) but not much seems to be going on:
Output of "top":
top - 16:26:55 up 84 days, 1:08, 2 users, load average: 132.55, 132.90, 133.28
Tasks: 773 total, 1 running, 771 sleeping, 0 stopped, 1 zombie
%Cpu(s): 5.1 us, 0.8 sy, 0.0 ni, 93.9 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem: 65961944 total, 65370276 used, 591668 free, 189748 buffers
KiB Swap: 67092476 total, 1250376 used, 65842100 free. 62120464 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14761 chromeo+ 20 0 24244 3552 2464 R 11.5 0.0 0:00.03 top
9815 cros-co+ 20 0 32.849g 28488 11452 S 5.7 0.0 2360:35 consul
1 root 20 0 33632 3832 2632 S 0.0 0.0 806:44.84 init
2 root 20 0 0 0 0 S 0.0 0.0 0:21.46 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 33:27.16 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
8 root 20 0 0 0 0 S 0.0 0.0 213:23.53 rcu_sched
9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
10 root 20 0 0 0 0 S 0.0 0.0 128:57.30 rcuos/0
11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/0
12 root rt 0 0 0 0 S 0.0 0.0 1:30.45 migration/0
13 root rt 0 0 0 0 S 0.0 0.0 0:20.10 watchdog/0
14 root rt 0 0 0 0 S 0.0 0.0 0:19.26 watchdog/1
15 root rt 0 0 0 0 S 0.0 0.0 1:46.81 migration/1
16 root 20 0 0 0 0 S 0.0 0.0 0:35.32 ksoftirqd/1
18 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H
19 root 20 0 0 0 0 S 0.0 0.0 36:12.45 rcuos/1
20 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/1
22 root rt 0 0 0 0 S 0.0 0.0 0:18.17 watchdog/2
23 root rt 0 0 0 0 S 0.0 0.0 2:07.68 migration/2
24 root 20 0 0 0 0 S 0.0 0.0 1:45.27 ksoftirqd/2
26 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H
27 root 20 0 0 0 0 S 0.0 0.0 46:31.00 rcuos/2
28 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/2
29 root rt 0 0 0 0 S 0.0 0.0 0:17.93 watchdog/3
30 root rt 0 0 0 0 S 0.0 0.0 1:44.84 migration/3
31 root 20 0 0 0 0 S 0.0 0.0 0:30.26 ksoftirqd/3
33 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/3:0H
34 root 20 0 0 0 0 S 0.0 0.0 35:55.63 rcuos/3
35 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/3
36 root rt 0 0 0 0 S 0.0 0.0 0:16.86 watchdog/4
37 root rt 0 0 0 0 S 0.0 0.0 1:52.25 migration/4
38 root 20 0 0 0 0 S 0.0 0.0 0:48.17 ksoftirqd/4
40 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/4:0H
41 root 20 0 0 0 0 S 0.0 0.0 36:56.12 rcuos/4
Even if I let it run for a while, I never see any processes using much CPU time, and the number of "running" processes (presumably in the same unit as the load average above) stays small.
I hope this is a one-off, but, if not, it could seriously confuse some of the stats we collect.
,
Nov 11 2016
Here's are the culprits (from ps auxwww | sort -k 9): chromeo+ 12959 0.0 0.0 24436 4616 ? D Nov06 0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images chromeo+ 16124 0.0 0.0 24436 4532 ? D Nov06 0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images chromeo+ 18037 0.0 0.0 24436 4616 ? D Nov06 0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images chromeo+ 20724 0.0 0.0 24436 4584 ? D Nov06 0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images chromeo+ 28540 0.0 0.0 24436 4664 ? D Nov06 0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images chromeo+ 29130 0.0 0.0 24436 4612 ? D Nov06 0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images chromeo+ 32235 0.0 0.0 24436 4540 ? D Nov06 0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images chromeo+ 3334 0.0 0.0 24436 4524 ? D Nov06 0:00 /usr/bin/python /home/chromeos-test/chromiumos/src/third_party/autotest/files/site_utils/admin/clean_staged_images.py --max-age 16 --max-paladin-age 16 /home/chromeos-test/images There are lots of them, starting on Nov 5. They don't seem to exit. They are probably blocked on something that makes them look like they're waiting for the CPU.
,
Nov 11 2016
Found problem: $ sudo cat /proc/9774/stack [sudo] password for chromeos-test: [<ffffffff811fe291>] iterate_dir+0x61/0x120 [<ffffffff811fe71e>] SyS_getdents+0x7e/0xf0 [<ffffffff817bc3b2>] entry_SYSCALL_64_fastpath+0x16/0x75 [<ffffffffffffffff>] 0xffffffffffffffff I think this is an instance of too many files in /tmp, which I filed a while ago but thought was fixed. That was issue 654953 . There are 16,000 directories in /tmp, each has about 150 files and 1.5 MB total size. Xixuan, can you take a look? I know you have lots of other bugs on your plate, please let me know if we should look for someone else.
,
Nov 11 2016
A cleaning script is kicked off for all devservers to delete all files created 5 days ago. I guess last time the script failed in the middle. To be continued, I guess I need to add a cronjob on devserver to clean the garbage since after the fix in, every day we still have maybe 20 folders left there and they're not cleaned due to some different reasons.
,
Nov 11 2016
#4 yes that seems like a good idea, thanks. Let me know if you would like to discuss details. Also will be happy to help review.
,
Nov 11 2016
The program tmpreaper is designed to delete files in /tmp that aren't used for a specified period of time. It's able to correctly deal with the obscure (and surprising) security concerns around doing that. Some systems run it by default from cron, but I don't think our servers do.
,
Dec 2 2016
Can we just reboot dev servers every weekend (cleaning tmp)? I don't see a reason why a dev server should be up for 84+ days.
,
Dec 2 2016
Personally, I'd like to reboot all lab servers every so often. They are failing to get kernel an other security updates because of the uptime.
,
Dec 2 2016
But we currently have no automated way to drain servers to ensure that in-progress tests aren't affected.
,
Dec 2 2016
How about not trigger suites on a Sunday? I keep beating a dead horse, but infra should be able to do anything it wants to the lab over the weekend.
,
Dec 5 2016
Re: draining. In theory I agree with Ihf, but the release team often wants releases from Sunday for testing in India before releases on Monday. Saturday would likely be the best day for draining. IMO we should have a P1 for having an automatic drain/reboot of some of our infra services. @akeshet, do you want to swallow this bug or file another?
,
Dec 5 2016
xixuan@ has expressed intention to work on this. Not convinced about P1 though, reducing to P2.
,
Dec 5 2016
,
Dec 7 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/9f57da38aa4fc9affff7a7b7c545db575c22ab27 commit 9f57da38aa4fc9affff7a7b7c545db575c22ab27 Author: xixuan <xixuan@chromium.org> Date: Tue Dec 06 17:25:14 2016
,
Dec 9 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/d35372415d771b07afc2261d22f58c10b65fa676 commit d35372415d771b07afc2261d22f58c10b65fa676 Author: xixuan <xixuan@chromium.org> Date: Wed Nov 23 19:26:02 2016 chromite: make sure update_engine.log is saved properly during auto-update. Sometimes the temp dir created for auto-update in /tmp/ of devservers becomes a file, which has exactly the same content with the update_engine.log. This CL fixes this issue. This CL also correct all global variables with a unified format. BUG= chromium:664360 TEST=Run auto-update locally. Change-Id: I12d5a2e2bc538a348f8325f5787e95c6c122ed6b Reviewed-on: https://chromium-review.googlesource.com/414429 Commit-Ready: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> Reviewed-by: Allen Li <ayatane@chromium.org> [modify] https://crrev.com/d35372415d771b07afc2261d22f58c10b65fa676/lib/auto_updater.py
,
Dec 20 2016
This is a direct consequence of leaving huge amount of garbage in /tmp, so I am duplicating it. Rebooting devservers periodically may be a good idea, I'll open a separate bug.
,
Feb 16 2017
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by semenzato@chromium.org
, Nov 11 201636.3 KB
36.3 KB View Download
54.7 KB
54.7 KB View Download