Send email alert when use of a server's disk is over a threshold. |
||
Issue descriptionWe met a case that a shard's disk is fulled of result logs, which make services are not working on that shard. In other words, the shard is dead. To avoid further deterioration, we could let nagios to send email alert when a server's disk usage is a little bit high, and save the server before it's going to be dead.
,
Aug 24 2016
We have disk space monitoring in all lab servers through nagios. For devservers in ACL enabled subnet, this data is not collected.
,
Aug 24 2016
What's the alert's keyword? I searched in inbox, and can't find the alerts of chromeos-server74.cbf (the "down" shard on Aug 16th's morning.
,
Aug 24 2016
https://groups.google.com/a/google.com/forum/#!searchin/chromeos-build-alerts/chromeos-server74$20disk%7Csort:relevance/chromeos-build-alerts/kZTPN9UZrLI/8OEFuFTqBgAJ Root cause of that shard is some kind of disk IO issue, which led to gs_offloader taking long time to upload results.
,
Aug 24 2016
That shard is full at Aug 16 morning, makes many service not work properly. This won't trigger an alert at that time? I see the newest alert of that shard that reporting root partition problem is on Wed Aug 17 12:17:24, not the same day.
,
Aug 24 2016
some sad background info. The shard was added to hosts.cfg in nagios by Charlene a while ago, but we didn't restart nagios, that's why the server was never monitored...
,
Aug 24 2016
oh! understood~
,
Aug 24 2016
> some sad background info. The shard was added to hosts.cfg > in nagios by Charlene a while ago, but we didn't restart > nagios, that's why the server was never monitored... When human error occurs, it's a bug in the software. Can someone file a _new_ bug requesting that we create foolproof administrative mechanisms so that this can't happen?
,
Aug 24 2016
When human error occurs, it's a bug in the software. --- is it a slogan from some books? Bug filed: https://bugs.chromium.org/p/chromium/issues/detail?id=640442 |
||
►
Sign in to add a comment |
||
Comment 1 by aut...@google.com
, Aug 23 2016Labels: Hotlist-Fixit