New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 633654 link

Starred by 3 users

Issue metadata

Status: Archived
Owner: ----
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Consider to respawn schedulers more retries and with longer wait time

Project Member Reported by dshi@chromium.org, Aug 2 2016

Issue description

Goobuntu some time forces security updates and leads to MySQL server restart in the lab. The schedulers (monitor_db, host_scheduler) crash very quickly when db is not available. Currently the respawn setting is 10 times with in 300 seconds. We should consider to add some wait time before retrying, e.g., wait for 1 minute before retrying, and extend the timeout from 5min to 10min.

This affects both master and shards.

 

Comment 1 by dshi@chromium.org, Aug 2 2016

Cc: -sbasi@chromium.org jrbarnette@chromium.org
Labels: -Pri-3 Pri-1
Owner: sbasi@chromium.org
Assign to lead to find an owner for the bug.

+Richard who might know the upstart magic to do that. I can't find a waiting time setting for respawn.
There isn't a way to do this with "respawn limit", but we could have a special exit code for the process which indicates we should sleep before exiting the upstart job.

Comment 3 by sbasi@chromium.org, Aug 2 2016

Owner: dgarr...@chromium.org
Assigning to deputy (Don) who can sync with Richard to look at what needs to be done.
My recommendation would be to skip trying to do this in the upstart
job, and instead change monitor_db.py to do some sort of check for
"database down" at startup, and then include a long wait (with retries)
before giving up.

Also, this is the sort of event that should generate some sort of e-mail
alert.

I have no idea how these processes are launched or managed.

Would changes in startup in monitor_db be sufficient?
> Would changes in startup in monitor_db be sufficient?

Should be.  Part of monitor_db initialization involves a lot
of DB queries to recover/reconstruct state.  So, the specific
change would be to start that recovery off with a dummy query
to confirm that the database is up.  If it fails, retry for
some extended period of time (or possibly even retry forever).

If we retry for long enough, eventually we can terminate without
triggering the upstart respawn limit.

Comment 7 by dshi@chromium.org, Aug 2 2016

The sanity check sounds like a good approach. We need to do this in both monitor_db.py and host_scheduler.py. Let's add a util function to be shared in both places.
Cc: fdeng@chromium.org
Owner: shuqianz@chromium.org
Hey Charlene, can you take a look?
Labels: iptaskforce
Labels: -iptaskforce
Do we have any idea of how common and how impactful this is?  For triaging, as usual.
Labels: -Pri-1 Pri-3
Owner: ----
Status: Untriaged (was: Available)
Status: Archived (was: Untriaged)

Sign in to add a comment