New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 803566 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 804391



Sign in to add a comment

restarting mysql_stats job can leave orphaned child processes

Project Member Reported by ihf@chromium.org, Jan 18 2018

Issue description

Process counts are erratic since January 4th
https://viceroy.corp.google.com/chromeos/machines?hostname=chromeos-server26&duration=2764791

root@chromeos-server26:~# ps -ef | grep mysql_stats.py | wc
    949    8544  111953

Probably a side effect of Richards push that day.
 
Cc: pho...@chromium.org
Summary: restarting mysql_stats job can leave orphaned child processes (was: stale mysql_stats.py processes on chromeos-server26)
> Probably a side effect of Richards push that day.

There's something deeper going on than just the side effect of
a push to prod.

Looking at the various orphaned processes, most of them have
a start time on Monday or Tuesday:

$ ps -ef | awk '/mysql_stats/ {print $5}' | sort | uniq -c | awk '$1 > 5'
    702 Jan15
    127 Jan16

A significant minority started _today_:
$ ps -ef | awk '/mysql_stats/ && $5 ~ /:/' | wc -l
115

So, digging in a bit:

$ status mysql_stats
mysql_stats start/running, process 236088
$ ps -fp 236088
UID         PID   PPID  C STIME TTY          TIME CMD
root     236088      1  0 08:33 ?        00:00:00 sudo -u chromeos-test /usr/local/autotest/site_utils/stats/mysql_stats.py
$ pstree 236088
sudo───mysql_stats.py─┬─mysql_stats.py───2*[{mysql_stats.py}]
                      └─mysql_stats.py

The 'mysql_stats' script starts three subprocesses that do work.
When the mysql_stats job restarts (which does happen every time
we push), some number of those children are apparently left behind
unterminated.

I'm also seeing four orphaned children on chromeos-server25.mtv.
I spot-checked a handful of other shards, and I don't see the symptom
there.  So, whatever causes this, it's not universal.

Adding a bit of information:  Something causes mysql_stats to die and
restart occasionally.

chromeos-test@chromeos-server26:~$ grep 'started' /var/log/mysql_stats.log | cut -c 1-10,24- | uniq -c
      1 Fri May 5 2017 mysql_stats: started.
     30 Mon May 8 2017 mysql_stats: started.
      2 Thu May 11 2017 mysql_stats: started.
      1 Tue Jul 18 2017 mysql_stats: started.
      1 Fri Jul 21 2017 mysql_stats: started.
      1 Mon Jul 24 2017 mysql_stats: started.
     28 Mon Jul 31 2017 mysql_stats: started.
      2 Fri Aug 4 2017 mysql_stats: started.
   6098 Wed Sep 6 2017 mysql_stats: started.
  12862 Thu Sep 7 2017 mysql_stats: started.
  12636 Fri Sep 8 2017 mysql_stats: started.
  12304 Sat Sep 9 2017 mysql_stats: started.
  12271 Sun Sep 10 2017 mysql_stats: started.
  12898 Mon Sep 11 2017 mysql_stats: started.
  11481 Tue Sep 12 2017 mysql_stats: started.
  11668 Wed Sep 13 2017 mysql_stats: started.
   5011 Thu Sep 14 2017 mysql_stats: started.
      2 Fri Oct 6 2017 mysql_stats: started.
      1 Tue Nov 7 2017 mysql_stats: started.
     30 Tue Nov 14 2017 mysql_stats: started.
      2 Wed Nov 22 2017 mysql_stats: started.
      1 Mon Dec 18 2017 mysql_stats: started.
      1 Sat Jan 13 2018 mysql_stats: started.
      1 Sun Jan 14 2018 mysql_stats: started.
    703 Mon Jan 15 2018 mysql_stats: started.
    129 Tue Jan 16 2018 mysql_stats: started.
      1 Wed Jan 17 2018 mysql_stats: started.
    121 Thu Jan 18 2018 mysql_stats: started.
     13 Fri Jan 19 2018 mysql_stats: started.
    137 Sat Jan 20 2018 mysql_stats: started.
      6 Sun Jan 21 2018 mysql_stats: started.
    245 Mon Jan 22 2018 mysql_stats: started.

Cc: akes...@chromium.org
please loop in the deputy. I was unaware of this, and it affected investigation of  Issue 804127

Comment 4 by jkop@chromium.org, Jan 25 2018

Labels: Chase-Pending
Status: Available (was: Untriaged)
Blocking: 804391
Flyby observation:

I restarted the service to see exactly which processes are getting left behind:

on chromeos-server26.mtv right now:

chromeos-test@chromeos-server26:~$ pstree -s -p -l 61385
init(1)───sudo(61382)───mysql_stats.py(61385)─┬─mysql_stats.py(61392)─┬─{mysql_stats.py}(61398)
                                              │                       └─{mysql_stats.py}(61406)
                                              └─mysql_stats.py(61396)

After restart:
chromeos-test@chromeos-server26:~$ for pid in 61382 61385 61392 61396 61398 61406; do echo "${pid}: $(cat /proc/${pid}/status | grep -i pgid)"; done
cat: /proc/61382/status: No such file or directory
61382: 
cat: /proc/61385/status: No such file or directory
61385: 
61392: NSpgid:	61382
cat: /proc/61396/status: No such file or directory
61396: 
cat: /proc/61398/status: No such file or directory
61398: 
61406: NSpgid:	61382


So 61392 and 61406 were left behind.

chromeos-test@chromeos-server26:~$ lsof -p 61406
chromeos-test@chromeos-server26:~$ lsof -p 61392
COMMAND     PID          USER   FD   TYPE             DEVICE  SIZE/OFF     NODE NAME
mysql_sta 61392 chromeos-test  cwd    DIR              202,1      4096        2 /
mysql_sta 61392 chromeos-test  rtd    DIR              202,1      4096        2 /
mysql_sta 61392 chromeos-test  txt    REG              202,1   3341288 21103284 /usr/bin/python2.7
mysql_sta 61392 chromeos-test  mem    REG              202,1    101240 10223992 /lib/x86_64-linux-gnu/libresolv-2.19.so
....
mysql_sta 61392 chromeos-test    0u   CHR                1,3       0t0        6 /dev/null
mysql_sta 61392 chromeos-test    1w   REG              202,1 748076252 27657505 /var/log/mysql_stats.log
mysql_sta 61392 chromeos-test    2w   REG              202,1 748076252 27657505 /var/log/mysql_stats.log
mysql_sta 61392 chromeos-test    3u  unix 0x0000000000000000       0t0 49516728 socket
mysql_sta 61392 chromeos-test    4r  FIFO               0,10       0t0 49514922 pipe
mysql_sta 61392 chromeos-test    5r  FIFO               0,10       0t0 49516725 pipe
mysql_sta 61392 chromeos-test    6w  FIFO               0,10       0t0 49516725 pipe
mysql_sta 61392 chromeos-test    8r   CHR                1,3       0t0        6 /dev/null
mysql_sta 61392 chromeos-test    9u  unix 0x0000000000000000       0t0 49515883 /tmp/pymp-LsZsht/listener-n4Ol8h


chromeos-test@chromeos-server26:~$ sudo strace -p 61392
Process 61392 attached
accept(9, 


So the hung process is stuck on an accept on a unix domain socket.
Nobody else has that socket open:
chromeos-test@chromeos-server26:~$ lsof /tmp/pymp-LsZsht/listener-n4Ol8h
COMMAND     PID          USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
mysql_sta 61392 chromeos-test    9u  unix 0x0000000000000000      0t0 49515883 /tmp/pymp-LsZsht/listener-n4Ol8h


--------------------------
I straced the stuck process while sending it a SIGTERM (which is what upstart would do). Sure enough, accept was called with SA_RESTART, so it goes back to listening:

chromeos-test@chromeos-server26:~$ sudo strace -p 61392
Process 61392 attached
accept(9, 0x7ffd350a3630, [110])        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=146958, si_uid=0} ---
accept(9, 

** Correction: Nobody has that socket open is wrong. linux doesn't provide enough information to deduce the other end of the domain socket, afaict.
Documentation for MySQLDb does suggest that _mysql connects to local DB (like here) via a unix domain socket.

http://mysql-python.sourceforge.net/MySQLdb.html

I suspect this is a mysqldb bug, in that it forks a process that is blocked listening on a socket in an uninterruptible way. :/
Note the name on the socket:
    /tmp/pymp-LsZsht/listener-n4Ol8h
Given the name and the fact that this is a subprocess, the
socket is probably being created by/for the multiprocessing
module, not mysqldb.

I don't know whether mysqldb uses multiprocessing.  I _do_ know
that the metrics module creates subprocesses, so it's still a
suspect.

Project Member

Comment 10 by bugdroid1@chromium.org, Jan 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/10b107218cf35c99ae582cf2560137cb325e4e34

commit 10b107218cf35c99ae582cf2560137cb325e4e34
Author: Paul Hobbs <phobbs@google.com>
Date: Fri Jan 26 20:39:38 2018

mysql_stats: ts-mon thread to mitigate process leak

Use indirect=False to force ts-mon to use a thread for sending metrics.
Hopefully this will mitigate the mysql_stats process leaks we have been seeing.

BUG= chromium:803566 
TEST=None

Change-Id: Ic1fa7f5045e0153c150db2e16d3e34f292ed86bb
Reviewed-on: https://chromium-review.googlesource.com/888260
Commit-Ready: Paul Hobbs <phobbs@google.com>
Tested-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Jacob Kopczynski <jkop@chromium.org>

[modify] https://crrev.com/10b107218cf35c99ae582cf2560137cb325e4e34/site_utils/stats/mysql_stats.py

Owner: pho...@chromium.org
Status: Assigned (was: Available)
I'm still confused why upstart is not able to kill the child processes of mysql_stats using KILL -9 after its kill timeout expires. AFAICT from reading the Upstart Cookbook, upstarts sends a SIGINT too the process group, and then SIGKILL to the process group after the timeout. SIGKILL does in fact kill the rogue processes (of course...), so the only possibility that remains is that upstart isn't sending SIGKILL to the subprocesses. Do they have different process group ids? Or is the upstart script broken in some way?
Re #11: Upstart only tracks the leading process (even though it sends the signal to the leading process' process group).

In this case, the leading process does indeed quit. One of the children is hung.

ref: http://upstart.ubuntu.com/cookbook/#stopping-a-job
> In this case, the leading process does indeed quit. One of the children is hung.

More to the point, I expect that the leading process is dying due
to some error, rather than to some upstart action; see #c2.  Whatever
is causing the parent process to die isn't propagating to the children,
leaving them orphaned.

Re 13: Yes, I notice in the logs I checked that it can fail due to monotonicity constraints: 

MonitoringDecreasingValueError: Monotonically increasing metric "chromeos/autotest/afe_db/threads_created" was given value "-518", which is not greater than or equal to "0".

More to the point, regardless of whether it crashes or how often it does so, it shouldn't leave zombies behind. Maybe we could add something like this to the upstart script:

post-stop script
  pgrep mysql_stats.py --parent 1 | xargs sudo kill -9
end script

> More to the point, regardless of whether it crashes or how often
> it does so, it shouldn't leave zombies behind. Maybe we could add
> something like this to the upstart script:

Quite so.

> post-stop script
>   pgrep mysql_stats.py --parent 1 | xargs sudo kill -9
> end script

Code review comment in advance:
  * This would be simpler as
        pkill -P 1 'mysql_stats\.py'

I think I found the root cause: https://chromium-review.googlesource.com/#/c/chromiumos/third_party/autotest/+/890106

(I'd strongly advise against pkill'ing process. That leads to very very confusing death of processes)
Issue 804391 has been merged into this issue.
Labels: -Chase-Pending Chase
fixed pending push to prod and re-enabling of mysql_stats.
As a follow up, we should add omens for this: issue 760755
Status: Fixed (was: Assigned)
Project Member

Comment 21 by bugdroid1@chromium.org, Feb 13 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/9cab4f1aa41c4b67719cd6a44ddaad7c4a4e24e6

commit 9cab4f1aa41c4b67719cd6a44ddaad7c4a4e24e6
Author: Paul Hobbs <phobbs@google.com>
Date: Tue Feb 13 21:13:37 2018

Sign in to add a comment