New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 673639 link

Starred by 4 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

scheduler shutdown crash in metrics

Project Member Reported by pprabhu@chromium.org, Dec 13 2016

Issue description

scheduler crashed when trying to shutdown (for deploy_server: restart scheduler)

12/12 20:55:16.667 INFO |        monitor_db:0209| Shutdown request received.
12/12 20:55:16.667 INFO |        thread_lib:0091| (Task Queue) Waiting for drone_manager.execute_queue.chromeos-server6_mtv_corp_google_com
12/12 20:55:16.667 INFO |        thread_lib:0091| (Task Queue) Waiting for drone_manager.execute_queue.chromeos-server3_hot_corp_google_com
12/12 20:55:16.668 INFO |        thread_lib:0091| (Task Queue) Waiting for drone_manager.execute_queue.chromeos-server11_hot_corp_google_com
12/12 20:55:18.050 INFO |        thread_lib:0071| (Worker.chromeos-server11.hot.corp.google.com) finished.
12/12 20:55:18.050 INFO |        thread_lib:0095| (Task Queue) All threads have returned, clearing map.
12/12 20:55:18.051 DEBUG|        monitor_db:0331| Calling email_manager.manager.send_queued_emails().
12/12 20:55:18.052 DEBUG|        monitor_db:0331| Calling django.db.reset_queries().
12/12 20:55:18.053 ERROR|     email_manager:0082| Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 185, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 224, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 391, in tick
    metrics.Counter('chromeos/autotest/scheduler/tick').increment()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 67, in enqueue
    reset_after=self.reset_after))
  File "<string>", line 2, in put
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
    conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe
12/12 20:55:18.056 ERROR|     email_manager:0054| monitor_db exception
EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 185, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 224, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 391, in tick
    metrics.Counter('chromeos/autotest/scheduler/tick').increment()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 67, in enqueue
    reset_after=self.reset_after))
  File "<string>", line 2, in put
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
    conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe

12/12 20:55:18.057 ERROR|        monitor_db:0105| Exception escaping in monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 101, in main
    main_without_exception_handling()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 197, in main_without_exception_handling
    "Uncaught exception; terminating monitor_db")
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/autotest/site-packages/chromite/lib/ts_mon_config.py", line 149, in _CreateTsMonFlushingProcess
    message_q.put(None)
  File "<string>", line 2, in put
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
    conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe
 
Owner: pprabhu@chromium.org
Status: Started (was: Untriaged)
monitor_db sets up a signal handler to capture SIGINT. It tries to gracefully shutdown on SIGINT.
But, |initialize| is calle *after* SetupTsMonGlobalState. This means that the metrics flushing process dies on SIGINT. 
When SIGINT arrives, |_shutdown| gets set in the parent process, but the process continues with the current |tick|. Meanwhile, the metrics process dies. Now, when |tick| tries to enqueue a metric, it gets IOError.

Fix is simple: We should install the SIGINT/SIGTERM handlers before entering the metrics reporting context.
btw, this problem might exist in a lot of other places we've added metrics: shard_client, host_scheduler, autoserv and suite_scheduler all install signal handlers and setup metrics contexts. I haven't yet checked if this problem exists there as well.

Comment 3 by autumn@chromium.org, Dec 13 2016

Labels: -current-issue
This won't be a problem unless the whole process group is sent SIGINT. Manually sending SIGINT to the parent process is safe.
Status: Available (was: Started)
Current analysis:

- If you send SIGINT/SIGTERM directly to monitor_db (via kill -SIGTERM ...) it is not forwarded to the child processes (as expected). This means that metrics finishes flushing and things are good.

- If you hit run monitor_db.py from the terminal and hit Ctrl-C, SIGINT is sent to the whole process group. (http://shortn/_bGHohqKiSZ) and hence there is a chance that the metrics flushing process will die, and then monitor_db will continue and try to add a metric and then raise an exception.

- On lab servers, we run monitor_db via an upstart service. In this case, upstart sends SIGTERM to the whole process group: http://upstart.ubuntu.com/cookbook/#stopping-a-job
This means that we'll see the problem once again (and we did, hence this bug).

One way to fix this for the upstart case is via a pre-stop stanza that just SIGTERMS the parent process first: http://upstart.ubuntu.com/cookbook/#pre-stop
Cc: jrbarnette@chromium.org ayatane@chromium.org
+ayatane, +jrbarnette for thoughts.

I can see two approaches here: 
- Add a pre-stop stanza to our upstart job.
- Set the signal_handler before forking the metrics process (http://shortn/_iR3qYgLk0Q). This means that the child process will inherit the signal handler (the global variable will be set, but will have no effect). The child process will exit only because the parent will respond by exiting the context that controls the child's lifetime.

Any preferences?
I think moving the signal_handler earlier is less hacky than using pre-stop (since we would be basically re-implementing Upstart stop behavior in our pre-stop, and have to duplicate the waiting logic too, otherwise Upstart will immediately go to stopping and SIGTERM everything anyway).

The super good solution is figuring out the proper semantics for the forked flushing process and installing the proper signal handling behavior there so it works correctly everywhere (and not have to rely on the parent process setting up the signal handler).
Per discussion with pprabhu, we should squelch any such exceptions in metrics.py or ts_mon_config.py. Other than syntax errors, metrics code should not emit exceptions to its callers.

Not sure if this bug is related to https://bugs.chromium.org/p/chromium/issues/detail?id=675221
I modified the toy program from http://stackoverflow.com/questions/10607553/python-multiprocessing-queue-what-to-do-when-the-receiving-process-quits


import multiprocessing                                                           
import sys                                                                       
import time                                                                      
                                                                                 
class MyProcess(multiprocessing.Process):                                        
                                                                                 
    def __init__(self):                                                             
        multiprocessing.Process.__init__(self)                                   
        self.queue = multiprocessing.Queue()                                     
                                                                                 
                                                                                    
    def run(self):                                                               
        print "Subprocess starting!"                                                
        time.sleep(4)                                                               
        print "Subprocess exiting!"                                                 
                                                                                    
    def addToQueue(self, num_objs):                                                 
        starttime = time.time()                                                     
        count=0                                                                     
        print "Adding %d objects to queue..." % num_objs                            
        while time.time()-starttime  < 4 and count < num_objs:                      
            self.queue.put("string")                                                
            count += 1                                                              
        print "Added %d objects!" % count                                           
        self.queue.close()                                                          
                                                                                    
                                                                                    
if __name__ == "__main__":                                                          
    args = sys.argv                                                                 
    assert len(args) == 2, "Usage: ./%s <num_objs>" % args[0]                       
    process = MyProcess()                                                           
    process.start()                                                                 
    print "Waiting for a while"                                                     
    time.sleep(2)                                                                   
    process.addToQueue(int(args[1]))                                                
    time.sleep(1)                                                                   
    print "Child process state: %d" % process.is_alive()    

===============================================

On both chromeos-server2 and on my machine, this consistently fails for 3841 objects in the queue, but doesn't for 3840:

pprabhu@pprabhu:~/scratch$ python pp.py 3841
Waiting for a while
Subprocess starting!
Adding 3841 objects to queue...
Added 3841 objects!
Child process state: 1
Subprocess exiting!
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 266, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
pprabhu@pprabhu:~/scratch$ python pp.py 3840
Waiting for a while
Subprocess starting!
Adding 3840 objects to queue...
Added 3840 objects!
Child process state: 1
Subprocess exiting!
pprabhu@pprabhu:~/scratch$ 


That doesn't sound abnormal.  The fifo is buffered in the kernel, and if it fills up subsequent writes will block.  Nothing is consuming from the queue, so...
Re #11: Yeah, the cool part is the exact failure at 3841. I was trying to ferret out why that number is significant.

Anyway, turns out #10 is irrelevant. We use multiprocessing.Manager for shared data structures and have our own wrapper to try to make it behave better with signals (http://shortn/_3Wfb55w6cc).
I have a CL for that that should fix this. I have been trying (and failing) to write a unittest that will make this fail consistently.
 Issue 682729  has been merged into this issue.
Labels: -Pri-2 Pri-1
This is happening quite consistently. It slows down push-to-prod. It slows down server_management_library tasks that setup/teardown servers.
It costs ~5 minutes each time we try to 'service scheduler stop'
Status: Started (was: Available)
Here's the CL I referred to in #12.
https://chromium-review.googlesource.com/c/423174/

I'd like to merge it without a definitive unittest guaranteeing that I've fixed the real problem.
Project Member

Comment 16 by bugdroid1@chromium.org, Apr 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/9a4d3909930ffc1043ea8d970a6fae5fd70efd2d

commit 9a4d3909930ffc1043ea8d970a6fae5fd70efd2d
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Apr 07 21:15:38 2017

parallel: Also ignore SIGTERM in the child process.

parallel.Manager is used by the metrics library to manage shared data
structures with the background metrics flushing subprocess.  This
library is used by autotest in the lab. The client process is managed
via upstart which signals process exit via SIGTERM to the whole process
group. Similar to the problems with SIGINT (noted in the original CL:
crosreview.com/256698), this causes the manager to throw IOError when
the background process dies due to SIGTERM, but the client process tries
to continue using the managed queue.Queue during graceful shutdown.

BUG= chromium:673639 
TEST=- (new) unittest fails before change, passes afterwards.
     - 'sudo service scheduler restart' does not print a stacktrace
       after chromite roll.

Change-Id: I36fec560dc09cabfbcec38f0bf9b21152c803828
Reviewed-on: https://chromium-review.googlesource.com/423174
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/9a4d3909930ffc1043ea8d970a6fae5fd70efd2d/lib/parallel_unittest.py
[modify] https://crrev.com/9a4d3909930ffc1043ea8d970a6fae5fd70efd2d/lib/parallel.py

Project Member

Comment 17 by bugdroid1@chromium.org, Apr 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/9a4d3909930ffc1043ea8d970a6fae5fd70efd2d

commit 9a4d3909930ffc1043ea8d970a6fae5fd70efd2d
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Apr 07 21:15:38 2017

parallel: Also ignore SIGTERM in the child process.

parallel.Manager is used by the metrics library to manage shared data
structures with the background metrics flushing subprocess.  This
library is used by autotest in the lab. The client process is managed
via upstart which signals process exit via SIGTERM to the whole process
group. Similar to the problems with SIGINT (noted in the original CL:
crosreview.com/256698), this causes the manager to throw IOError when
the background process dies due to SIGTERM, but the client process tries
to continue using the managed queue.Queue during graceful shutdown.

BUG= chromium:673639 
TEST=- (new) unittest fails before change, passes afterwards.
     - 'sudo service scheduler restart' does not print a stacktrace
       after chromite roll.

Change-Id: I36fec560dc09cabfbcec38f0bf9b21152c803828
Reviewed-on: https://chromium-review.googlesource.com/423174
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/9a4d3909930ffc1043ea8d970a6fae5fd70efd2d/lib/parallel_unittest.py
[modify] https://crrev.com/9a4d3909930ffc1043ea8d970a6fae5fd70efd2d/lib/parallel.py

Issue 717375 has been merged into this issue.
Status: Available (was: Started)
Owner: pho...@chromium.org
Status: Started (was: Available)
phobbs@ has a CL for this: https://chromium-review.googlesource.com/c/499450/
Project Member

Comment 21 by bugdroid1@chromium.org, May 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/b872187ef6c9d6cb0dfa71d4484a1f1953c58520

commit b872187ef6c9d6cb0dfa71d4484a1f1953c58520
Author: Paul Hobbs <phobbs@google.com>
Date: Thu May 18 06:24:30 2017

metrics: .put metrics on Queue only if alive

Only .put metrics on the metrics flushing process' multiprocessing.Queue if it
is still alive; otherwise, do nothing. This avoids a performance problem whereby
the Queue fills up when the metrics processing has died.

BUG= chromium:673639 
TEST=unittests

Change-Id: I4864c6ca881d2b49a739a70c45dc557482b9d09e
Reviewed-on: https://chromium-review.googlesource.com/499450
Reviewed-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Paul Hobbs <phobbs@google.com>
Commit-Queue: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/b872187ef6c9d6cb0dfa71d4484a1f1953c58520/lib/ts_mon_config.py
[modify] https://crrev.com/b872187ef6c9d6cb0dfa71d4484a1f1953c58520/lib/metrics.py

Fixed?
Status: Fixed (was: Started)
Labels: VerifyIn-61

Comment 25 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment