New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 676062 link

Starred by 1 user

Issue metadata

Status: Archived
Owner: ----
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature



Sign in to add a comment

automatically turn Nagios alerts off during devserver maintenance, then back on

Project Member Reported by semenzato@chromium.org, Dec 20 2016

Issue description

This morning I noticed by chance that chromeos2-devserver7 was down.  (I was looking at the check_health results in a log.)  I filed  bug 675993  and it was brought up, but it went down again quickly as I was grepping through logs.

I was able to take a quick look at /var/log/sysstat and /var/log/syslog and it appeared that the machine was down for a few multi-day stretches between about Dec. 1 and now.  (For instance, it was always down between Dec 7 and Dec 13.)

In fact viceroy shows that it was pretty much always down after Dec 7, when it reached a load average of 160 processes per CPU, and likely something melted then. :)

That's a long time, and this would impact capacity.  Do we have any mechanism in place to notice this?
 

Comment 1 by sbasi@chromium.org, Dec 20 2016

Cc: dshi@chromium.org
We have Nagios alerts setup for these issues but sadly I did not receive any emails about this particular devserver today.

Comment 2 by dshi@chromium.org, Dec 20 2016

The devserver was set under maintenance a while ago, so the alerts were disabled:
http://chromeos-mcp/cgi-bin/nagios3/extinfo.cgi?type=1&host=chromeos2-devserver7

Apparently one did not enable alerts again when it's back in prod.
Labels: -Pri-1 -Type-Bug Pri-2 Type-Feature
Summary: automatically turn Nagios alerts off during devserver maintenance, then back on (was: are we not noticing when devservers are down?)
So we need an alert to remind us to turn on the alerts?

As a last resort, an open bug might do.  It would be nicer it there was a way of associating "coming out of maintenance" to "re-enable alerts".  But I can only make the obvious high-level comment.  Might someone, perhaps in a different team, be interested in looking into this?

Comment 4 by dshi@chromium.org, Dec 20 2016

yeah, it's more like a procedure thing. Every time a devserver is added/removed, corresponding changes should be made in nagios's hosts.cfg.

Will add a comment there:
https://chrome-internal-review.googlesource.com/#/c/313618/2/puppet/modules/lab/templates/shadow_config/sections/dev_server_common.ini.erb
Any remaining work here? Dan's updated the doc as a temporary measure. 

Comment 6 by dshi@chromium.org, Jan 4 2017

Status: Fixed (was: Untriaged)
There is no automated way for this issue. Warning on the config is pretty much the best we can do. A more complicated way is to add a resubmit check on shadow config, but I think that's an overkill.

Comment 7 by dchan@google.com, Mar 4 2017

Labels: VerifyIn-58

Comment 8 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 9 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 11 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment