New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 638640 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner: ----
Closed: Aug 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: ----



Sign in to add a comment

Send email alert when use of a server's disk is over a threshold.

Project Member Reported by xixuan@chromium.org, Aug 17 2016

Issue description

We met a case that a shard's disk is fulled of result logs, which make services are not working on that shard. In other words, the shard is dead. 

To avoid further deterioration, we could let nagios to send email alert when a server's disk usage is a little bit high, and save the server before it's going to be dead.

 

Comment 1 by aut...@google.com, Aug 23 2016

Cc: dshi@chromium.org
Labels: Hotlist-Fixit
+ dan for input, also marking as a fixit candidate

Comment 2 by dshi@chromium.org, Aug 24 2016

Status: WontFix (was: Untriaged)
We have disk space monitoring in all lab servers through nagios. For devservers in ACL enabled subnet, this data is not collected.

Comment 3 by xixuan@chromium.org, Aug 24 2016

What's the alert's keyword? 

I searched in inbox, and can't find the alerts of chromeos-server74.cbf (the "down" shard on Aug 16th's morning.

Comment 4 by dshi@chromium.org, Aug 24 2016

https://groups.google.com/a/google.com/forum/#!searchin/chromeos-build-alerts/chromeos-server74$20disk%7Csort:relevance/chromeos-build-alerts/kZTPN9UZrLI/8OEFuFTqBgAJ

Root cause of that shard is some kind of disk IO issue, which led to gs_offloader taking long time to upload results.

Comment 5 by xixuan@chromium.org, Aug 24 2016

That shard is full at Aug 16 morning, makes many service not work properly. This won't trigger an alert at that time?

I see the newest alert of that shard that reporting root partition problem is on Wed Aug 17 12:17:24, not the same day.

Comment 6 by dshi@chromium.org, Aug 24 2016

some sad background info. The shard was added to hosts.cfg in nagios by Charlene a while ago, but we didn't restart nagios, that's why the server was never monitored...

Comment 7 by xixuan@chromium.org, Aug 24 2016

oh! understood~ 
> some sad background info. The shard was added to hosts.cfg
> in nagios by Charlene a while ago, but we didn't restart
> nagios, that's why the server was never monitored...

When human error occurs, it's a bug in the software.

Can someone file a _new_ bug requesting that we create foolproof
administrative mechanisms so that this can't happen?

Comment 9 by xixuan@chromium.org, Aug 24 2016

When human error occurs, it's a bug in the software.

--- is it a slogan from some books?


Bug filed: https://bugs.chromium.org/p/chromium/issues/detail?id=640442

Sign in to add a comment