New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 635106 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner: ----
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

infra monitoring does not support alerting on import errors

Project Member Reported by no...@chromium.org, Aug 5 2016

Issue description

If there is an python import error, then gae_ts_mon is not initialized and even though an app may return HTTP 5xx for all responses (with a stack), our graphs won't show that and alerts won't fire.

this problem could be solved by scraping logs which are actually the source of truth, rather than what gae_ts_mon handler decorators/instrumentation sees

this was the case with buildbucket today. There was an import error because gae_ts_mon started to depend on something that must be depsed-in, but i did not run gclient sync before deploying from master.
 
Sorry, I'm confused. What if your log scraping code breaks on imports, then what?

Honestly, I'm tempted to mark is WAI... Let me know if I'm missing something.

Comment 2 by no...@chromium.org, Aug 5 2016

Summary: infra monitoring does not support alerting on import errors (was: gae_ts_mon python does not work on import errors)
not sure
maybe it is an inherent problem with python

if there was a separate monitoring service that scrapes logs, this problem would be solved

the bug describes a user problem and signifies a problem with the design, not with implementation. Maybe it is not a problem wit gae_ts_mon, but a problem of infra-monitoring in general: we don't have alerting on this kind of problems. I've changed the bug title; not sure it is WAI now

Comment 3 by no...@chromium.org, Aug 5 2016

in the nutshell, the problem is that we don't receive alerts even though an app is completely down. this is not wai
OK, agreed - we should also have blackbox monitoring in addition to whitebox, which is gae_ts_mon. I believe monorail employs some blackbox monitoring using probes (?).

AFAIK, this is a separate pipeline (yet another one), and I'm not up to speed on it yet.

dsansome@: do you have any comments on this?
Status: Available (was: Untriaged)
Blackbox monitoring wouldn't have helped detect this if the app was still up?
You could just alert if your request rate drops to 0.

Either way, is there a bug for being able to deploy an app without running gclient sync?
Labels: Pri-2
If this is really a Pri-1, find an owner and update the priority.

This is the result of a bulk edit that moved high priority available bugs to a lower priority in an attempt to be more honest with bug filers.
Status: WontFix (was: Available)

Sign in to add a comment