kill_slow_queries processes piling up on shard(s) |
||||||
Issue descriptionhttps://viceroy.corp.google.com/chromeos/machines?hostname=chromeos-server27&board=sentry&duration=8777686&mdb_role=chrome-infra&pool=managed%3Acts&refresh=-1&status=Running&topstreams=20#_VG_oYXH0N8z chromeos-test@chromeos-server27:~$ ps aux | grep kill_slow_queries | wc 1364 20581 203536 This is starting to push us into memory pressure on at least that shard, and possibly others.
,
Jul 26 2017
This issue is not confined to server27 chromeos-test@chromeos-server101:~$ ps aux | grep kill_slow_queries | wc 2569 38839 384034
,
Jul 26 2017
I check the log of the /var/log/kill_slow_queries.log, and find that besides the kill_slow_queries upstart job, there is an old kill_slow_queries cronjob running every second on the shards, which accounts for the 1000+ kill queries process. I will clean up the old cron jobs on all shards and keep track for this for the rest of the week.
,
Jul 26 2017
chromeos-test@chromeos-server101:~$ sudo killall -r -9 kill_slow_queries We still have lots of these piled up on other shards.
,
Jul 26 2017
We used to run kill_slow_queries in cron, and now we loop infinitely thanks to https://chromium-review.googlesource.com/c/544336/. It makes sense that these processes are piling up.
,
Jul 26 2017
I've ran a script to cleaned up all the cron jobs and killed the kill_slow_queries on the shard. But it seems not all of the processes get killed. I will keep tracking this.
,
Jul 27 2017
I've checked all the shards and database server. Now the number of kill_slow_queries processes is under 8 for all the servers. All the 8 processes is created by the upstart job. I will claim victory for this bug.
,
Jul 28 2017
Can you please check chromeos-server100.mtv? According to the graph below it is still leaking. I can't ssh into it.
,
Jul 28 2017
Checking the boards running on the shard: one possibility is that chromeos-server100.mtv is actually overloaded right now.
,
Jul 28 2017
I powercycled the server from portal as it did not respond for hours. It is back alive now.
,
Jul 28 2017
There was only 5 kill_slow_queries processes running on chromeos-server100.mtv. I think the leak is not caused by this bug.
,
Aug 15 2017
This is a real bug. See deduped bug for details. Fix is in-flight. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by akes...@chromium.org
, Jul 26 2017