restarting mysql_stats job can leave orphaned child processes |
|||||||
Issue descriptionProcess counts are erratic since January 4th https://viceroy.corp.google.com/chromeos/machines?hostname=chromeos-server26&duration=2764791 root@chromeos-server26:~# ps -ef | grep mysql_stats.py | wc 949 8544 111953 Probably a side effect of Richards push that day.
,
Jan 22 2018
Adding a bit of information: Something causes mysql_stats to die and
restart occasionally.
chromeos-test@chromeos-server26:~$ grep 'started' /var/log/mysql_stats.log | cut -c 1-10,24- | uniq -c
1 Fri May 5 2017 mysql_stats: started.
30 Mon May 8 2017 mysql_stats: started.
2 Thu May 11 2017 mysql_stats: started.
1 Tue Jul 18 2017 mysql_stats: started.
1 Fri Jul 21 2017 mysql_stats: started.
1 Mon Jul 24 2017 mysql_stats: started.
28 Mon Jul 31 2017 mysql_stats: started.
2 Fri Aug 4 2017 mysql_stats: started.
6098 Wed Sep 6 2017 mysql_stats: started.
12862 Thu Sep 7 2017 mysql_stats: started.
12636 Fri Sep 8 2017 mysql_stats: started.
12304 Sat Sep 9 2017 mysql_stats: started.
12271 Sun Sep 10 2017 mysql_stats: started.
12898 Mon Sep 11 2017 mysql_stats: started.
11481 Tue Sep 12 2017 mysql_stats: started.
11668 Wed Sep 13 2017 mysql_stats: started.
5011 Thu Sep 14 2017 mysql_stats: started.
2 Fri Oct 6 2017 mysql_stats: started.
1 Tue Nov 7 2017 mysql_stats: started.
30 Tue Nov 14 2017 mysql_stats: started.
2 Wed Nov 22 2017 mysql_stats: started.
1 Mon Dec 18 2017 mysql_stats: started.
1 Sat Jan 13 2018 mysql_stats: started.
1 Sun Jan 14 2018 mysql_stats: started.
703 Mon Jan 15 2018 mysql_stats: started.
129 Tue Jan 16 2018 mysql_stats: started.
1 Wed Jan 17 2018 mysql_stats: started.
121 Thu Jan 18 2018 mysql_stats: started.
13 Fri Jan 19 2018 mysql_stats: started.
137 Sat Jan 20 2018 mysql_stats: started.
6 Sun Jan 21 2018 mysql_stats: started.
245 Mon Jan 22 2018 mysql_stats: started.
,
Jan 23 2018
please loop in the deputy. I was unaware of this, and it affected investigation of Issue 804127
,
Jan 25 2018
,
Jan 26 2018
,
Jan 26 2018
Flyby observation:
I restarted the service to see exactly which processes are getting left behind:
on chromeos-server26.mtv right now:
chromeos-test@chromeos-server26:~$ pstree -s -p -l 61385
init(1)───sudo(61382)───mysql_stats.py(61385)─┬─mysql_stats.py(61392)─┬─{mysql_stats.py}(61398)
│ └─{mysql_stats.py}(61406)
└─mysql_stats.py(61396)
After restart:
chromeos-test@chromeos-server26:~$ for pid in 61382 61385 61392 61396 61398 61406; do echo "${pid}: $(cat /proc/${pid}/status | grep -i pgid)"; done
cat: /proc/61382/status: No such file or directory
61382:
cat: /proc/61385/status: No such file or directory
61385:
61392: NSpgid: 61382
cat: /proc/61396/status: No such file or directory
61396:
cat: /proc/61398/status: No such file or directory
61398:
61406: NSpgid: 61382
So 61392 and 61406 were left behind.
chromeos-test@chromeos-server26:~$ lsof -p 61406
chromeos-test@chromeos-server26:~$ lsof -p 61392
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
mysql_sta 61392 chromeos-test cwd DIR 202,1 4096 2 /
mysql_sta 61392 chromeos-test rtd DIR 202,1 4096 2 /
mysql_sta 61392 chromeos-test txt REG 202,1 3341288 21103284 /usr/bin/python2.7
mysql_sta 61392 chromeos-test mem REG 202,1 101240 10223992 /lib/x86_64-linux-gnu/libresolv-2.19.so
....
mysql_sta 61392 chromeos-test 0u CHR 1,3 0t0 6 /dev/null
mysql_sta 61392 chromeos-test 1w REG 202,1 748076252 27657505 /var/log/mysql_stats.log
mysql_sta 61392 chromeos-test 2w REG 202,1 748076252 27657505 /var/log/mysql_stats.log
mysql_sta 61392 chromeos-test 3u unix 0x0000000000000000 0t0 49516728 socket
mysql_sta 61392 chromeos-test 4r FIFO 0,10 0t0 49514922 pipe
mysql_sta 61392 chromeos-test 5r FIFO 0,10 0t0 49516725 pipe
mysql_sta 61392 chromeos-test 6w FIFO 0,10 0t0 49516725 pipe
mysql_sta 61392 chromeos-test 8r CHR 1,3 0t0 6 /dev/null
mysql_sta 61392 chromeos-test 9u unix 0x0000000000000000 0t0 49515883 /tmp/pymp-LsZsht/listener-n4Ol8h
chromeos-test@chromeos-server26:~$ sudo strace -p 61392
Process 61392 attached
accept(9,
So the hung process is stuck on an accept on a unix domain socket.
Nobody else has that socket open:
chromeos-test@chromeos-server26:~$ lsof /tmp/pymp-LsZsht/listener-n4Ol8h
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
mysql_sta 61392 chromeos-test 9u unix 0x0000000000000000 0t0 49515883 /tmp/pymp-LsZsht/listener-n4Ol8h
--------------------------
I straced the stuck process while sending it a SIGTERM (which is what upstart would do). Sure enough, accept was called with SA_RESTART, so it goes back to listening:
chromeos-test@chromeos-server26:~$ sudo strace -p 61392
Process 61392 attached
accept(9, 0x7ffd350a3630, [110]) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=146958, si_uid=0} ---
accept(9,
,
Jan 26 2018
** Correction: Nobody has that socket open is wrong. linux doesn't provide enough information to deduce the other end of the domain socket, afaict.
,
Jan 26 2018
Documentation for MySQLDb does suggest that _mysql connects to local DB (like here) via a unix domain socket. http://mysql-python.sourceforge.net/MySQLdb.html I suspect this is a mysqldb bug, in that it forks a process that is blocked listening on a socket in an uninterruptible way. :/
,
Jan 26 2018
Note the name on the socket:
/tmp/pymp-LsZsht/listener-n4Ol8h
Given the name and the fact that this is a subprocess, the
socket is probably being created by/for the multiprocessing
module, not mysqldb.
I don't know whether mysqldb uses multiprocessing. I _do_ know
that the metrics module creates subprocesses, so it's still a
suspect.
,
Jan 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/10b107218cf35c99ae582cf2560137cb325e4e34 commit 10b107218cf35c99ae582cf2560137cb325e4e34 Author: Paul Hobbs <phobbs@google.com> Date: Fri Jan 26 20:39:38 2018 mysql_stats: ts-mon thread to mitigate process leak Use indirect=False to force ts-mon to use a thread for sending metrics. Hopefully this will mitigate the mysql_stats process leaks we have been seeing. BUG= chromium:803566 TEST=None Change-Id: Ic1fa7f5045e0153c150db2e16d3e34f292ed86bb Reviewed-on: https://chromium-review.googlesource.com/888260 Commit-Ready: Paul Hobbs <phobbs@google.com> Tested-by: Paul Hobbs <phobbs@google.com> Reviewed-by: Jacob Kopczynski <jkop@chromium.org> [modify] https://crrev.com/10b107218cf35c99ae582cf2560137cb325e4e34/site_utils/stats/mysql_stats.py
,
Jan 26 2018
I'm still confused why upstart is not able to kill the child processes of mysql_stats using KILL -9 after its kill timeout expires. AFAICT from reading the Upstart Cookbook, upstarts sends a SIGINT too the process group, and then SIGKILL to the process group after the timeout. SIGKILL does in fact kill the rogue processes (of course...), so the only possibility that remains is that upstart isn't sending SIGKILL to the subprocesses. Do they have different process group ids? Or is the upstart script broken in some way?
,
Jan 26 2018
Re #11: Upstart only tracks the leading process (even though it sends the signal to the leading process' process group). In this case, the leading process does indeed quit. One of the children is hung. ref: http://upstart.ubuntu.com/cookbook/#stopping-a-job
,
Jan 26 2018
> In this case, the leading process does indeed quit. One of the children is hung. More to the point, I expect that the leading process is dying due to some error, rather than to some upstart action; see #c2. Whatever is causing the parent process to die isn't propagating to the children, leaving them orphaned.
,
Jan 26 2018
Re 13: Yes, I notice in the logs I checked that it can fail due to monotonicity constraints: MonitoringDecreasingValueError: Monotonically increasing metric "chromeos/autotest/afe_db/threads_created" was given value "-518", which is not greater than or equal to "0". More to the point, regardless of whether it crashes or how often it does so, it shouldn't leave zombies behind. Maybe we could add something like this to the upstart script: post-stop script pgrep mysql_stats.py --parent 1 | xargs sudo kill -9 end script
,
Jan 26 2018
> More to the point, regardless of whether it crashes or how often
> it does so, it shouldn't leave zombies behind. Maybe we could add
> something like this to the upstart script:
Quite so.
> post-stop script
> pgrep mysql_stats.py --parent 1 | xargs sudo kill -9
> end script
Code review comment in advance:
* This would be simpler as
pkill -P 1 'mysql_stats\.py'
,
Jan 27 2018
I think I found the root cause: https://chromium-review.googlesource.com/#/c/chromiumos/third_party/autotest/+/890106 (I'd strongly advise against pkill'ing process. That leads to very very confusing death of processes)
,
Jan 29 2018
Issue 804391 has been merged into this issue.
,
Jan 29 2018
fixed pending push to prod and re-enabling of mysql_stats.
,
Jan 29 2018
As a follow up, we should add omens for this: issue 760755
,
Feb 5 2018
,
Feb 13 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/9cab4f1aa41c4b67719cd6a44ddaad7c4a4e24e6 commit 9cab4f1aa41c4b67719cd6a44ddaad7c4a4e24e6 Author: Paul Hobbs <phobbs@google.com> Date: Tue Feb 13 21:13:37 2018 |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by jrbarnette@chromium.org
, Jan 18 2018Summary: restarting mysql_stats job can leave orphaned child processes (was: stale mysql_stats.py processes on chromeos-server26)
> Probably a side effect of Richards push that day. There's something deeper going on than just the side effect of a push to prod. Looking at the various orphaned processes, most of them have a start time on Monday or Tuesday: $ ps -ef | awk '/mysql_stats/ {print $5}' | sort | uniq -c | awk '$1 > 5' 702 Jan15 127 Jan16 A significant minority started _today_: $ ps -ef | awk '/mysql_stats/ && $5 ~ /:/' | wc -l 115 So, digging in a bit: $ status mysql_stats mysql_stats start/running, process 236088 $ ps -fp 236088 UID PID PPID C STIME TTY TIME CMD root 236088 1 0 08:33 ? 00:00:00 sudo -u chromeos-test /usr/local/autotest/site_utils/stats/mysql_stats.py $ pstree 236088 sudo───mysql_stats.py─┬─mysql_stats.py───2*[{mysql_stats.py}] └─mysql_stats.py The 'mysql_stats' script starts three subprocesses that do work. When the mysql_stats job restarts (which does happen every time we push), some number of those children are apparently left behind unterminated. I'm also seeing four orphaned children on chromeos-server25.mtv. I spot-checked a handful of other shards, and I don't see the symptom there. So, whatever causes this, it's not universal.