automatically turn Nagios alerts off during devserver maintenance, then back on |
||||||||
Issue descriptionThis morning I noticed by chance that chromeos2-devserver7 was down. (I was looking at the check_health results in a log.) I filed bug 675993 and it was brought up, but it went down again quickly as I was grepping through logs. I was able to take a quick look at /var/log/sysstat and /var/log/syslog and it appeared that the machine was down for a few multi-day stretches between about Dec. 1 and now. (For instance, it was always down between Dec 7 and Dec 13.) In fact viceroy shows that it was pretty much always down after Dec 7, when it reached a load average of 160 processes per CPU, and likely something melted then. :) That's a long time, and this would impact capacity. Do we have any mechanism in place to notice this?
,
Dec 20 2016
The devserver was set under maintenance a while ago, so the alerts were disabled: http://chromeos-mcp/cgi-bin/nagios3/extinfo.cgi?type=1&host=chromeos2-devserver7 Apparently one did not enable alerts again when it's back in prod.
,
Dec 20 2016
So we need an alert to remind us to turn on the alerts? As a last resort, an open bug might do. It would be nicer it there was a way of associating "coming out of maintenance" to "re-enable alerts". But I can only make the obvious high-level comment. Might someone, perhaps in a different team, be interested in looking into this?
,
Dec 20 2016
yeah, it's more like a procedure thing. Every time a devserver is added/removed, corresponding changes should be made in nagios's hosts.cfg. Will add a comment there: https://chrome-internal-review.googlesource.com/#/c/313618/2/puppet/modules/lab/templates/shadow_config/sections/dev_server_common.ini.erb
,
Jan 4 2017
Any remaining work here? Dan's updated the doc as a temporary measure.
,
Jan 4 2017
There is no automated way for this issue. Warning on the config is pretty much the best we can do. A more complicated way is to add a resubmit check on shadow config, but I think that's an overkill.
,
Mar 4 2017
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Oct 14 2017
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by sbasi@chromium.org
, Dec 20 2016