Consider to respawn schedulers more retries and with longer wait time |
||||||||
Issue descriptionGoobuntu some time forces security updates and leads to MySQL server restart in the lab. The schedulers (monitor_db, host_scheduler) crash very quickly when db is not available. Currently the respawn setting is 10 times with in 300 seconds. We should consider to add some wait time before retrying, e.g., wait for 1 minute before retrying, and extend the timeout from 5min to 10min. This affects both master and shards.
,
Aug 2 2016
There isn't a way to do this with "respawn limit", but we could have a special exit code for the process which indicates we should sleep before exiting the upstart job.
,
Aug 2 2016
Assigning to deputy (Don) who can sync with Richard to look at what needs to be done.
,
Aug 2 2016
My recommendation would be to skip trying to do this in the upstart job, and instead change monitor_db.py to do some sort of check for "database down" at startup, and then include a long wait (with retries) before giving up. Also, this is the sort of event that should generate some sort of e-mail alert.
,
Aug 2 2016
I have no idea how these processes are launched or managed. Would changes in startup in monitor_db be sufficient?
,
Aug 2 2016
> Would changes in startup in monitor_db be sufficient? Should be. Part of monitor_db initialization involves a lot of DB queries to recover/reconstruct state. So, the specific change would be to start that recovery off with a dummy query to confirm that the database is up. If it fails, retry for some extended period of time (or possibly even retry forever). If we retry for long enough, eventually we can terminate without triggering the upstart respawn limit.
,
Aug 2 2016
The sanity check sounds like a good approach. We need to do this in both monitor_db.py and host_scheduler.py. Let's add a util function to be shared in both places.
,
Aug 2 2016
Hey Charlene, can you take a look?
,
Aug 8 2016
,
Aug 8 2016
,
Aug 8 2016
Do we have any idea of how common and how impactful this is? For triaging, as usual.
,
Jan 11 2017
,
Jan 31 2017
,
Mar 10 2018
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by dshi@chromium.org
, Aug 2 2016Labels: -Pri-3 Pri-1
Owner: sbasi@chromium.org