scheduler shutdown crash in metrics |
|||||||||||
Issue description
scheduler crashed when trying to shutdown (for deploy_server: restart scheduler)
12/12 20:55:16.667 INFO | monitor_db:0209| Shutdown request received.
12/12 20:55:16.667 INFO | thread_lib:0091| (Task Queue) Waiting for drone_manager.execute_queue.chromeos-server6_mtv_corp_google_com
12/12 20:55:16.667 INFO | thread_lib:0091| (Task Queue) Waiting for drone_manager.execute_queue.chromeos-server3_hot_corp_google_com
12/12 20:55:16.668 INFO | thread_lib:0091| (Task Queue) Waiting for drone_manager.execute_queue.chromeos-server11_hot_corp_google_com
12/12 20:55:18.050 INFO | thread_lib:0071| (Worker.chromeos-server11.hot.corp.google.com) finished.
12/12 20:55:18.050 INFO | thread_lib:0095| (Task Queue) All threads have returned, clearing map.
12/12 20:55:18.051 DEBUG| monitor_db:0331| Calling email_manager.manager.send_queued_emails().
12/12 20:55:18.052 DEBUG| monitor_db:0331| Calling django.db.reset_queries().
12/12 20:55:18.053 ERROR| email_manager:0082| Uncaught exception; terminating monitor_db
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 185, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 224, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 391, in tick
metrics.Counter('chromeos/autotest/scheduler/tick').increment()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 67, in enqueue
reset_after=self.reset_after))
File "<string>", line 2, in put
File "/usr/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe
12/12 20:55:18.056 ERROR| email_manager:0054| monitor_db exception
EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 185, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 224, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 391, in tick
metrics.Counter('chromeos/autotest/scheduler/tick').increment()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 67, in enqueue
reset_after=self.reset_after))
File "<string>", line 2, in put
File "/usr/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe
12/12 20:55:18.057 ERROR| monitor_db:0105| Exception escaping in monitor_db
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 101, in main
main_without_exception_handling()
File "/usr/local/autotest/scheduler/monitor_db.py", line 197, in main_without_exception_handling
"Uncaught exception; terminating monitor_db")
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/autotest/site-packages/chromite/lib/ts_mon_config.py", line 149, in _CreateTsMonFlushingProcess
message_q.put(None)
File "<string>", line 2, in put
File "/usr/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe
,
Dec 13 2016
btw, this problem might exist in a lot of other places we've added metrics: shard_client, host_scheduler, autoserv and suite_scheduler all install signal handlers and setup metrics contexts. I haven't yet checked if this problem exists there as well.
,
Dec 13 2016
,
Dec 14 2016
This won't be a problem unless the whole process group is sent SIGINT. Manually sending SIGINT to the parent process is safe.
,
Dec 14 2016
,
Dec 14 2016
Current analysis: - If you send SIGINT/SIGTERM directly to monitor_db (via kill -SIGTERM ...) it is not forwarded to the child processes (as expected). This means that metrics finishes flushing and things are good. - If you hit run monitor_db.py from the terminal and hit Ctrl-C, SIGINT is sent to the whole process group. (http://shortn/_bGHohqKiSZ) and hence there is a chance that the metrics flushing process will die, and then monitor_db will continue and try to add a metric and then raise an exception. - On lab servers, we run monitor_db via an upstart service. In this case, upstart sends SIGTERM to the whole process group: http://upstart.ubuntu.com/cookbook/#stopping-a-job This means that we'll see the problem once again (and we did, hence this bug). One way to fix this for the upstart case is via a pre-stop stanza that just SIGTERMS the parent process first: http://upstart.ubuntu.com/cookbook/#pre-stop
,
Dec 14 2016
+ayatane, +jrbarnette for thoughts. I can see two approaches here: - Add a pre-stop stanza to our upstart job. - Set the signal_handler before forking the metrics process (http://shortn/_iR3qYgLk0Q). This means that the child process will inherit the signal handler (the global variable will be set, but will have no effect). The child process will exit only because the parent will respond by exiting the context that controls the child's lifetime. Any preferences?
,
Dec 14 2016
I think moving the signal_handler earlier is less hacky than using pre-stop (since we would be basically re-implementing Upstart stop behavior in our pre-stop, and have to duplicate the waiting logic too, otherwise Upstart will immediately go to stopping and SIGTERM everything anyway). The super good solution is figuring out the proper semantics for the forked flushing process and installing the proper signal handling behavior there so it works correctly everywhere (and not have to rely on the parent process setting up the signal handler).
,
Dec 16 2016
Per discussion with pprabhu, we should squelch any such exceptions in metrics.py or ts_mon_config.py. Other than syntax errors, metrics code should not emit exceptions to its callers. Not sure if this bug is related to https://bugs.chromium.org/p/chromium/issues/detail?id=675221
,
Dec 19 2016
I modified the toy program from http://stackoverflow.com/questions/10607553/python-multiprocessing-queue-what-to-do-when-the-receiving-process-quits import multiprocessing import sys import time class MyProcess(multiprocessing.Process): def __init__(self): multiprocessing.Process.__init__(self) self.queue = multiprocessing.Queue() def run(self): print "Subprocess starting!" time.sleep(4) print "Subprocess exiting!" def addToQueue(self, num_objs): starttime = time.time() count=0 print "Adding %d objects to queue..." % num_objs while time.time()-starttime < 4 and count < num_objs: self.queue.put("string") count += 1 print "Added %d objects!" % count self.queue.close() if __name__ == "__main__": args = sys.argv assert len(args) == 2, "Usage: ./%s <num_objs>" % args[0] process = MyProcess() process.start() print "Waiting for a while" time.sleep(2) process.addToQueue(int(args[1])) time.sleep(1) print "Child process state: %d" % process.is_alive() =============================================== On both chromeos-server2 and on my machine, this consistently fails for 3841 objects in the queue, but doesn't for 3840: pprabhu@pprabhu:~/scratch$ python pp.py 3841 Waiting for a while Subprocess starting! Adding 3841 objects to queue... Added 3841 objects! Child process state: 1 Subprocess exiting! Traceback (most recent call last): File "/usr/lib/python2.7/multiprocessing/queues.py", line 266, in _feed send(obj) IOError: [Errno 32] Broken pipe pprabhu@pprabhu:~/scratch$ python pp.py 3840 Waiting for a while Subprocess starting! Adding 3840 objects to queue... Added 3840 objects! Child process state: 1 Subprocess exiting! pprabhu@pprabhu:~/scratch$
,
Dec 20 2016
That doesn't sound abnormal. The fifo is buffered in the kernel, and if it fills up subsequent writes will block. Nothing is consuming from the queue, so...
,
Dec 20 2016
Re #11: Yeah, the cool part is the exact failure at 3841. I was trying to ferret out why that number is significant. Anyway, turns out #10 is irrelevant. We use multiprocessing.Manager for shared data structures and have our own wrapper to try to make it behave better with signals (http://shortn/_3Wfb55w6cc). I have a CL for that that should fix this. I have been trying (and failing) to write a unittest that will make this fail consistently.
,
Apr 6 2017
Issue 682729 has been merged into this issue.
,
Apr 6 2017
This is happening quite consistently. It slows down push-to-prod. It slows down server_management_library tasks that setup/teardown servers. It costs ~5 minutes each time we try to 'service scheduler stop'
,
Apr 6 2017
Here's the CL I referred to in #12. https://chromium-review.googlesource.com/c/423174/ I'd like to merge it without a definitive unittest guaranteeing that I've fixed the real problem.
,
Apr 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/9a4d3909930ffc1043ea8d970a6fae5fd70efd2d commit 9a4d3909930ffc1043ea8d970a6fae5fd70efd2d Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri Apr 07 21:15:38 2017 parallel: Also ignore SIGTERM in the child process. parallel.Manager is used by the metrics library to manage shared data structures with the background metrics flushing subprocess. This library is used by autotest in the lab. The client process is managed via upstart which signals process exit via SIGTERM to the whole process group. Similar to the problems with SIGINT (noted in the original CL: crosreview.com/256698), this causes the manager to throw IOError when the background process dies due to SIGTERM, but the client process tries to continue using the managed queue.Queue during graceful shutdown. BUG= chromium:673639 TEST=- (new) unittest fails before change, passes afterwards. - 'sudo service scheduler restart' does not print a stacktrace after chromite roll. Change-Id: I36fec560dc09cabfbcec38f0bf9b21152c803828 Reviewed-on: https://chromium-review.googlesource.com/423174 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/9a4d3909930ffc1043ea8d970a6fae5fd70efd2d/lib/parallel_unittest.py [modify] https://crrev.com/9a4d3909930ffc1043ea8d970a6fae5fd70efd2d/lib/parallel.py
,
Apr 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/9a4d3909930ffc1043ea8d970a6fae5fd70efd2d commit 9a4d3909930ffc1043ea8d970a6fae5fd70efd2d Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri Apr 07 21:15:38 2017 parallel: Also ignore SIGTERM in the child process. parallel.Manager is used by the metrics library to manage shared data structures with the background metrics flushing subprocess. This library is used by autotest in the lab. The client process is managed via upstart which signals process exit via SIGTERM to the whole process group. Similar to the problems with SIGINT (noted in the original CL: crosreview.com/256698), this causes the manager to throw IOError when the background process dies due to SIGTERM, but the client process tries to continue using the managed queue.Queue during graceful shutdown. BUG= chromium:673639 TEST=- (new) unittest fails before change, passes afterwards. - 'sudo service scheduler restart' does not print a stacktrace after chromite roll. Change-Id: I36fec560dc09cabfbcec38f0bf9b21152c803828 Reviewed-on: https://chromium-review.googlesource.com/423174 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/9a4d3909930ffc1043ea8d970a6fae5fd70efd2d/lib/parallel_unittest.py [modify] https://crrev.com/9a4d3909930ffc1043ea8d970a6fae5fd70efd2d/lib/parallel.py
,
May 2 2017
Issue 717375 has been merged into this issue.
,
May 2 2017
,
May 12 2017
phobbs@ has a CL for this: https://chromium-review.googlesource.com/c/499450/
,
May 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/b872187ef6c9d6cb0dfa71d4484a1f1953c58520 commit b872187ef6c9d6cb0dfa71d4484a1f1953c58520 Author: Paul Hobbs <phobbs@google.com> Date: Thu May 18 06:24:30 2017 metrics: .put metrics on Queue only if alive Only .put metrics on the metrics flushing process' multiprocessing.Queue if it is still alive; otherwise, do nothing. This avoids a performance problem whereby the Queue fills up when the metrics processing has died. BUG= chromium:673639 TEST=unittests Change-Id: I4864c6ca881d2b49a739a70c45dc557482b9d09e Reviewed-on: https://chromium-review.googlesource.com/499450 Reviewed-by: Paul Hobbs <phobbs@google.com> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Paul Hobbs <phobbs@google.com> Commit-Queue: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/b872187ef6c9d6cb0dfa71d4484a1f1953c58520/lib/ts_mon_config.py [modify] https://crrev.com/b872187ef6c9d6cb0dfa71d4484a1f1953c58520/lib/metrics.py
,
Jun 5 2017
Fixed?
,
Jun 5 2017
,
Aug 1 2017
,
Jan 22 2018
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by pprabhu@chromium.org
, Dec 13 2016Status: Started (was: Untriaged)