lxc-start container leak causing slowdown and process leak on shard(s) |
|||||||||||||
Issue descriptionhttps://viceroy.corp.google.com/chromeos/ssp_health?duration=8d Particularly problematic is chromeos-server102 (shard for board:terra, board:veyron_mickey, board:veyron_tiger) which is hovering at 3-4 minutes of ssp overhead per test. I think that caused the failures in Issue 780738 https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_tiger-release/builds/1660
,
Nov 10 2017
dshi or kenobi any idea what's so slow on server102?
,
Nov 10 2017
Very likely a resource issue. The server is running low on VM vmcom 171.9G | vmlim 63.8G can we consider a reboot? or try to clean up anything taking up the vm?
,
Nov 11 2017
There seems to be a process leak too https://viceroy.corp.google.com/chromeos/machines?duration=60d&hostname=chromeos-server102&refresh=-1#_VG_oYXH0N8z
,
Nov 11 2017
chromeos-test@chromeos-server102:~$ ps -A --no-headers -o comm | sort | uniq -c | sort -n
[snip]
72 gs_offloader.py
118 apache2
444 autoserv
495 lxc-start
496 dbus-daemon
496 dhclient
496 init
496 rsyslogd
496 systemd-logind
496 upstart-file-br
496 upstart-socket-
497 cron
703 ssh
2482 getty
,
Nov 11 2017
Looks like a lot of these cron, getty, and other processes are children of lxc-start which is piling up...
pstree
.
.
.
|-62*[lxc-start---init-+-cron]
| |-dbus-daemon]
| |-dhclient]
| |-5*[getty]]
| |-rsyslogd---3*[{rsyslogd}]]
| |-systemd-logind]
| |-systemd-udevd]
| |-upstart-file-br]
| |-upstart-socket-]
| `-upstart-udev-br]
|-404*[lxc-start---init-+-cron]
| |-dbus-daemon]
| |-dhclient]
| |-5*[getty]]
| |-rsyslogd---3*[{rsyslogd}]]
| |-systemd-logind]
| |-upstart-file-br]
| `-upstart-socket-]
|-7*[lxc-start---init-+-adb---2*[{adb}]]
| |-cron]
| |-dbus-daemon]
| |-dhclient]
| |-5*[getty]]
| |-rsyslogd---3*[{rsyslogd}]]
| |-systemd-logind]
| |-upstart-file-br]
| `-upstart-socket-]
|-21*[lxc-start---init-+-adb---4*[{adb}]]
| |-cron]
| |-dbus-daemon]
| |-dhclient]
| |-5*[getty]]
| |-rsyslogd---3*[{rsyslogd}]]
| |-systemd-logind]
| |-upstart-file-br]
| `-upstart-socket-]
|-5*[lxc-start---init-+-cron]
| |-dbus-daemon]
| |-dhclient]
| |-5*[getty]]
| |-java---49*[{java}]]
| |-rsyslogd---3*[{rsyslogd}]]
| |-systemd-logind]
| |-upstart-file-br]
| `-upstart-socket-]
|-2*[lxc-start---init-+-cron]
| |-dbus-daemon]
| |-dhclient]
| |-5*[getty]]
| |-ondemand---sleep]
| |-rsyslogd---3*[{rsyslogd}]]
| |-systemd-logind]
| |-upstart-file-br]
| `-upstart-socket-]
|-lxc-start---init-+-cron
| |-dbus-daemon
| |-dhclient
| |-5*[getty]
| |-java---49*[{java}]
| |-rsyslogd---3*[{rsyslogd}]
| |-systemd-logind
| |-systemd-udevd
| |-upstart-file-br
| |-upstart-socket-
| `-upstart-udev-br
,
Nov 11 2017
,
Nov 11 2017
chromeos-test@chromeos-server102:/usr/local/autotest/containers$ ls | wc 1170 1170 36895
,
Nov 11 2017
I don't know what is causing this leak, but in the interest of unblocking the boards on this shard, I am inclined to just reboot the shard. We can continue to monitor the situation as lxc-start processes pile up.
,
Nov 11 2017
Also, while 102 is the worst offender, it is by no means the only shard with processes piling up: http://shortn/_zsjtV04oAa
,
Nov 11 2017
causes slowdowns and failures on multiple boards.
,
Nov 11 2017
102 rebooted, other servers still affected. still needs to be root caused and long term fixed
,
Nov 13 2017
kenobi@ It seems that there is some failure to destroy a container when test is aborted: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/150305647-chromeos-test/chromeos2-row4-rack6-host10/ssp_logs/debug/ 10/20 01:02:45.240 ERROR| autoserv:0366| Failed to destroy container test_150305647_1508482800_712. Traceback (most recent call last): File "/usr/local/autotest/server/autoserv", line 358, in handle_sigterm container = bucket.get_container(container_id) File "/usr/local/autotest/site_utils/lxc/container_bucket.py", line 75, in get_container return self.get_all().get(container_id, None) File "/usr/local/autotest/site_utils/lxc/container_bucket.py", line 60, in get_all **info) File "/usr/local/autotest/site_utils/lxc/container.py", line 155, in create_from_existing_dir return cls(lxc_path, name, kwargs) File "/usr/local/autotest/site_utils/lxc/container.py", line 140, in __init__ os.path.join(self.container_path, self.name)) File "/usr/local/autotest/site_utils/lxc/container.py", line 76, in load return pickle.loads(pickle_data) File "/usr/lib/python2.7/pickle.py", line 1382, in loads return Unpickler(file).load() File "/usr/lib/python2.7/pickle.py", line 858, in load dispatch[key](self) File "/usr/lib/python2.7/pickle.py", line 880, in load_eof raise EOFError EOFError Before that error, there are repeated log about reading: /usr/local/autotest/containers/test_149864897_1508337622_13874/container_id.p Can you take a look if there is a hole in container cleanup code?
,
Nov 13 2017
,
Nov 13 2017
The container ID unpickling code is not very fault-tolerant, so a bad ID raises exceptions. Coupled with the fact that the ContainerBucket frequently queries all containers (see ContainerBucket.get_all), this causes any container with a corrupted ID to block any code that calls get_all. Bad news all around. I will unblock this by catching and ignoring (for now) errors arising from unpickling of container IDs.
,
Nov 13 2017
Containers in a weird state - on server chromeos-server118.mtv the following containers have ID files that are empty: - test_149864897_1508337622_13874 The SSP logs for this test [1] reference a container that has a different name (but a matching job ID). - test_149898741_1508337602_7026 The SSP logs for this test [2] reference the correct container, but the log looks corrupted and cuts off halfway. Not sure what happened here. These two invalid containers are causing all the ContainerBucket.get_all calls on this server to raise exceptions, which in the case of #13 is blocking the sigterm handler of another test from cleaning up. 1: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/149864897-chromeos-test/chromeos2-row4-rack6-host11/ssp_logs/debug/ 2: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/149898741-chromeos-test/chromeos2-row10-rack7-host17/ssp_logs/debug/
,
Nov 14 2017
Hm, gerrit doesn't seem to have updated the bug, but the fix is landed as https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/767588
,
Nov 14 2017
Let's leave this open until we'e verified this fix. This will need a push to prod. ->nxia secondary deputy can we get a push (with this in it) today?
,
Nov 14 2017
The fix will unblock calls to container_bucket.get_all, but it will not inherently clean up any remaining orphaned (or otherwise lingering) containers. Are there any janitorial processes running that will do this? Or does it get triggered manually? Without an actual cleanup there will be no improvement, just a stoppage of further degradation.
,
Nov 14 2017
Push is done.
,
Nov 16 2017
dshi@, akeshet@ and I discussed this further today. dshi@ experimented with running the lxc cleanup script on a couple of servers (118, 103), with positive results. There are a couple of bugs in the script (attaching bug). Once a fix for those is landed, dshi@ will run a script to kick off lxc cleanup on all drones and shards.
,
Nov 20 2017
,
Nov 20 2017
I will run the cleanup after next push.
,
Nov 21 2017
cleanup is completed.
,
Nov 28 2017
LXC process counts seem to be staying under control for the past couple of weeks. I'm marking this verified. |
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by akes...@chromium.org
, Nov 10 2017