Add Shard Disk utilization metric to monarch |
|||
Issue descriptionShard's disk can be full by containing too many result logs (eg. big crash infos). Once it's full, it will affects some services, like shard RPC won't work. Repartition the shard to make the logs and other services separated may be a good solution.
,
Aug 24 2016
> Another option could be to just cap the size / quantity of > the logs that we store. Rotating logs would be good, but log size isn't quite the problem. The problem is that logs and test results are in the same file system. So, when the file system fills up because of a sudden excess of test results, we lose the ability to write logs. So, capping logs won't help; we need to cap test results. Also, I'm not aware of any mechanism we have available that would make it easy to cap either log sizes or results directory sizes. Perhaps Linux supports putting a size limit on a directory hierarchy? > https://bugs.chromium.org/p/chromium/issues/detail?id=638641 MAY > also help some.... Well, that fix is mostly orthogonal. It could make the problem of "the disk filled up" less common, but it can't mitigate the failure once it's occurred.
,
Aug 24 2016
I think the best solution is to get aware of the disk issue before it crashes the servers. Thanks to ayatane@, the shard Disk utilization metric will be added to Monarch. The plan is to send alert if the Disk utilization>=75%. After this metric is added, we can get notification of the potential Disk issue and prevent it from crashing the server. ayatane@, not sure whether you've already had a bug for the metric thing. If you have, you can merge this one into the one you already have.
,
Aug 25 2016
|
|||
►
Sign in to add a comment |
|||
Comment 1 by aut...@google.com
, Aug 23 2016