New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 783865 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Nov 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 786129

Blocking:
issue 780738



Sign in to add a comment

lxc-start container leak causing slowdown and process leak on shard(s)

Project Member Reported by akes...@chromium.org, Nov 10 2017

Issue description

https://viceroy.corp.google.com/chromeos/ssp_health?duration=8d

Particularly problematic is chromeos-server102 (shard for board:terra, board:veyron_mickey, board:veyron_tiger) which is hovering at 3-4 minutes of ssp overhead per test.

I think that caused the failures in Issue 780738 https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_tiger-release/builds/1660
 
Blocking: 780738
Cc: shuqianz@chromium.org
dshi or kenobi any idea what's so slow on server102?

Comment 3 by dshi@chromium.org, Nov 10 2017

Very likely a resource issue. The server is running low on VM
vmcom 171.9G  | vmlim  63.8G

can we consider a reboot? or try to clean up anything taking up the vm?
chromeos-test@chromeos-server102:~$ ps -A --no-headers -o comm | sort | uniq -c | sort -n
[snip]
     72 gs_offloader.py
    118 apache2
    444 autoserv
    495 lxc-start
    496 dbus-daemon
    496 dhclient
    496 init
    496 rsyslogd
    496 systemd-logind
    496 upstart-file-br
    496 upstart-socket-
    497 cron
    703 ssh
   2482 getty


Looks like a lot of these cron, getty, and other processes are children of lxc-start which is piling up...

pstree
.
.
.
     |-62*[lxc-start---init-+-cron]
     |                      |-dbus-daemon]
     |                      |-dhclient]
     |                      |-5*[getty]]
     |                      |-rsyslogd---3*[{rsyslogd}]]
     |                      |-systemd-logind]
     |                      |-systemd-udevd]
     |                      |-upstart-file-br]
     |                      |-upstart-socket-]
     |                      `-upstart-udev-br]
     |-404*[lxc-start---init-+-cron]
     |                       |-dbus-daemon]
     |                       |-dhclient]
     |                       |-5*[getty]]
     |                       |-rsyslogd---3*[{rsyslogd}]]
     |                       |-systemd-logind]
     |                       |-upstart-file-br]
     |                       `-upstart-socket-]
     |-7*[lxc-start---init-+-adb---2*[{adb}]]
     |                     |-cron]
     |                     |-dbus-daemon]
     |                     |-dhclient]
     |                     |-5*[getty]]
     |                     |-rsyslogd---3*[{rsyslogd}]]
     |                     |-systemd-logind]
     |                     |-upstart-file-br]
     |                     `-upstart-socket-]
     |-21*[lxc-start---init-+-adb---4*[{adb}]]
     |                      |-cron]
     |                      |-dbus-daemon]
     |                      |-dhclient]
     |                      |-5*[getty]]
     |                      |-rsyslogd---3*[{rsyslogd}]]
     |                      |-systemd-logind]
     |                      |-upstart-file-br]
     |                      `-upstart-socket-]
     |-5*[lxc-start---init-+-cron]
     |                     |-dbus-daemon]
     |                     |-dhclient]
     |                     |-5*[getty]]
     |                     |-java---49*[{java}]]
     |                     |-rsyslogd---3*[{rsyslogd}]]
     |                     |-systemd-logind]
     |                     |-upstart-file-br]
     |                     `-upstart-socket-]
     |-2*[lxc-start---init-+-cron]
     |                     |-dbus-daemon]
     |                     |-dhclient]
     |                     |-5*[getty]]
     |                     |-ondemand---sleep]
     |                     |-rsyslogd---3*[{rsyslogd}]]
     |                     |-systemd-logind]
     |                     |-upstart-file-br]
     |                     `-upstart-socket-]
     |-lxc-start---init-+-cron
     |                  |-dbus-daemon
     |                  |-dhclient
     |                  |-5*[getty]
     |                  |-java---49*[{java}]
     |                  |-rsyslogd---3*[{rsyslogd}]
     |                  |-systemd-logind
     |                  |-systemd-udevd
     |                  |-upstart-file-br
     |                  |-upstart-socket-
     |                  `-upstart-udev-br
Labels: -Pri-2 Pri-1
Owner: dshi@chromium.org
Summary: lxc-start container leak causing slowdown and process leak on shard(s) (was: ssp setup times are creeping up, causing some suite timeouts)
chromeos-test@chromeos-server102:/usr/local/autotest/containers$ ls | wc
   1170    1170   36895
I don't know what is causing this leak, but in the interest of unblocking the boards on this shard, I am inclined to just reboot the shard. We can continue to monitor the situation as lxc-start processes pile up.
Also, while 102 is the worst offender, it is by no means the only shard with processes piling up:

http://shortn/_zsjtV04oAa
Labels: Chase-Pending
causes slowdowns and failures on multiple boards.
102 rebooted, other servers still affected. still needs to be root caused and long term fixed

Comment 13 by dshi@chromium.org, Nov 13 2017

Cc: -kenobi@chromium.org
Owner: kenobi@chromium.org
kenobi@

It seems that there is some failure to destroy a container when test is aborted:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/150305647-chromeos-test/chromeos2-row4-rack6-host10/ssp_logs/debug/

10/20 01:02:45.240 ERROR|          autoserv:0366| Failed to destroy container test_150305647_1508482800_712.
Traceback (most recent call last):
  File "/usr/local/autotest/server/autoserv", line 358, in handle_sigterm
    container = bucket.get_container(container_id)
  File "/usr/local/autotest/site_utils/lxc/container_bucket.py", line 75, in get_container
    return self.get_all().get(container_id, None)
  File "/usr/local/autotest/site_utils/lxc/container_bucket.py", line 60, in get_all
    **info)
  File "/usr/local/autotest/site_utils/lxc/container.py", line 155, in create_from_existing_dir
    return cls(lxc_path, name, kwargs)
  File "/usr/local/autotest/site_utils/lxc/container.py", line 140, in __init__
    os.path.join(self.container_path, self.name))
  File "/usr/local/autotest/site_utils/lxc/container.py", line 76, in load
    return pickle.loads(pickle_data)
  File "/usr/lib/python2.7/pickle.py", line 1382, in loads
    return Unpickler(file).load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 880, in load_eof
    raise EOFError
EOFError

Before that error, there are repeated log about reading:
/usr/local/autotest/containers/test_149864897_1508337622_13874/container_id.p

Can you take a look if there is a hole in container cleanup code?
Labels: -Chase-Pending Chase
The container ID unpickling code is not very fault-tolerant, so a bad ID raises exceptions.  Coupled with the fact that the ContainerBucket frequently queries all containers (see ContainerBucket.get_all), this causes any container with a corrupted ID to block any code that calls get_all.  Bad news all around.

I will unblock this by catching and ignoring (for now) errors arising from unpickling of container IDs.
Containers in a weird state -

on server chromeos-server118.mtv the following containers have ID files that are empty:
- test_149864897_1508337622_13874
  The SSP logs for this test [1] reference a container that has a different name (but a matching job ID).

- test_149898741_1508337602_7026
  The SSP logs for this test [2] reference the correct container, but the log looks corrupted and cuts off halfway.  Not sure what happened here.

These two invalid containers are causing all the ContainerBucket.get_all calls on this server to raise exceptions, which in the case of #13 is blocking the sigterm handler of another test from cleaning up.

1: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/149864897-chromeos-test/chromeos2-row4-rack6-host11/ssp_logs/debug/

2: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/149898741-chromeos-test/chromeos2-row10-rack7-host17/ssp_logs/debug/
Status: Fixed (was: Untriaged)
Hm, gerrit doesn't seem to have updated the bug, but the fix is landed as https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/767588
Cc: kenobi@chromium.org
Owner: nxia@chromium.org
Status: Assigned (was: Fixed)
Let's leave this open until we'e verified this fix.

This will need a push to prod. ->nxia secondary deputy can we get a push (with this in it) today?
The fix will unblock calls to container_bucket.get_all, but it will not inherently clean up any remaining orphaned (or otherwise lingering) containers.  Are there any janitorial processes running that will do this?  Or does it get triggered manually?

Without an actual cleanup there will be no improvement, just a stoppage of further degradation.

Comment 20 by nxia@chromium.org, Nov 14 2017

Push is done.
Blockedon: 786129
dshi@, akeshet@ and I discussed this further today.  dshi@ experimented with running the lxc cleanup script on a couple of servers (118, 103), with positive results.

There are a couple of bugs in the script (attaching bug).  Once a fix for those is landed, dshi@ will run a script to kick off lxc cleanup on all drones and shards.
Owner: kenobi@chromium.org

Comment 23 by dshi@chromium.org, Nov 20 2017

Owner: dshi@chromium.org
I will run the cleanup after next push.

Comment 24 by dshi@chromium.org, Nov 21 2017

Status: Fixed (was: Assigned)
cleanup is completed.
Status: Verified (was: Fixed)
LXC process counts seem to be staying under control for the past couple of weeks.  I'm marking this verified.

Sign in to add a comment