infra monitoring does not support alerting on import errors |
|||||
Issue descriptionIf there is an python import error, then gae_ts_mon is not initialized and even though an app may return HTTP 5xx for all responses (with a stack), our graphs won't show that and alerts won't fire. this problem could be solved by scraping logs which are actually the source of truth, rather than what gae_ts_mon handler decorators/instrumentation sees this was the case with buildbucket today. There was an import error because gae_ts_mon started to depend on something that must be depsed-in, but i did not run gclient sync before deploying from master.
,
Aug 5 2016
not sure maybe it is an inherent problem with python if there was a separate monitoring service that scrapes logs, this problem would be solved the bug describes a user problem and signifies a problem with the design, not with implementation. Maybe it is not a problem wit gae_ts_mon, but a problem of infra-monitoring in general: we don't have alerting on this kind of problems. I've changed the bug title; not sure it is WAI now
,
Aug 5 2016
in the nutshell, the problem is that we don't receive alerts even though an app is completely down. this is not wai
,
Aug 5 2016
OK, agreed - we should also have blackbox monitoring in addition to whitebox, which is gae_ts_mon. I believe monorail employs some blackbox monitoring using probes (?). AFAIK, this is a separate pipeline (yet another one), and I'm not up to speed on it yet. dsansome@: do you have any comments on this?
,
Aug 5 2016
,
Aug 8 2016
Blackbox monitoring wouldn't have helped detect this if the app was still up? You could just alert if your request rate drops to 0. Either way, is there a bug for being able to deploy an app without running gclient sync?
,
Jan 18 2017
If this is really a Pri-1, find an owner and update the priority. This is the result of a bulk edit that moved high priority available bugs to a lower priority in an attempt to be more honest with bug filers.
,
Mar 15 2017
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by sergeybe...@chromium.org
, Aug 5 2016