Use PLX Alerts to send email alert to Sheriffs |
||||||
Issue descriptionPLX Alerts is an end-to-end solution for taking SQL queries, executing them on a continuous basis, and then firing and ack’ing alerts based on their output. This item tracks implementing those a basic set of alerts to bring the sheriffs to action. This email alert will include a link to the PLX Alerts dashboard for silencing and a link to Sheriff-o-Matic to begin investigating the alert. At a minimum, we should start with an alert for a post-submit and release builder failures. We can begin to get more sophisticated (for example, alert on excessive builder non-critical [e.g. for weeks]) as time allows. We can also look to expand to cover special use-cases like Gardening and Constables once those PFQ’s are merged into the main CQ. For now, Infra folks will go to the PLX Alerts dashboard to silence all alerts during Infra outages.
,
Nov 15
We should investigate the PLX Alerts to Alertmanager bridge, too, to see if it's viable so that we can unify silencing infra alerts with guardian alerts.
,
Nov 15
earlier I've suggested PLX Alerts for cases where existing solutions wouldn't work. What are the cases in this bug? What are examples of SQL queries? if it is just "alert me when a builder fails", this can be done with existing infra: - see BuildFailureAlert and InfraFailureRatioHighAlert produce an AlertManager alert. It has silencing and integrated into oncall. These functions are just helpers for go/buildbucket-monarch metrics. https://cs.corp.google.com/piper///depot/google3/configs/monitoring/chrome_ops_foundation/common/buildbucket.py?g=0&rcl=213280154&l=186 - luci-notify.appspot.com can send an email when a builder has failed. The email template is customizable. luci-notify does not support silencing. +packrat who might have concerns with PLX Alerts.
,
Nov 16
Won't be able to give SQL examples since our BQ data source don't exist yet.
Here's an alert example that I think isn't a good fit for Monarch:
Alert sheriffs when the parent build failed but only when:
* The child builders didn't fail in one of a list of blacklisted stages (that are Infra responsibility).
* Failures in hardware testing aren't one of a list of blacklisted failure modes that indicate lab infra failure.
* The tree isn't closed.
* The failure doesn't look "catastrophic" (e.g. everything failing or arborting ) that might indicate an infra failure.
Once decided to alert:
* Coalesce common failure classes together in the alert email. E.g. group all X builders that failed in BuildPackages together. group all Y builders that failed in HWTest toghether.
* Include links to a PLX dashboard that shows:
* how the alert condition was met.
* direct link to silence it
* direct link to Sheriff-o-Matic to mark the failure as under investigation.
,
Nov 16
SOM/gatekeeper is for sheriffs and it already looks at the steps to decide if it is worth to close a tree / send an email. The alert definition also talks about tree status, which is also a SOM-level concept. Sean, is there a doc describing what kind of alert conditions SOM/gatekeeper supports. I feel like we need to embrace interrupt-level alerting in SOM. FWIU today a SOM user is expected to stare at SOM dashboard, but my understanding might be outdated. Does SOM send emails on findings?
,
Nov 16
I would very much prefer all of what I've described to be an SoM feature but was under the impression that it wasn't on the horizon so design this instead. If SoM is planning this, we would not do the PLX Alerts side of things (but will probably still export some BQ data).
,
Nov 16
yeah, it is always preferable to have BQ data for ad-hoc analysis, in case SOM does not provide enough diagnosis I might be missing something important in SOM's scope of responsibilities, but the case above (sheriffs, step forgiveness, tree status) seem to match SOM/gatekeeper's scope, so it would be less work overall if it was implemented at the SOM level (even if it is implemented by CrOS devs). John, would Chrome benefit from this? Note that, I believe, the CrOS-specific concept of "parent build" is not important here. The "child builds" can be grouped by the landed commit used in all them.
,
Nov 18
I think that it super useful to have high cardinality and pipeline data available in a queryable form (with e.g. bigquery) and have the ability to build alerts based on regularly executed queries of this data. At this point it's probably worth taking a week or two to step back and investigate and write up what the available technologies to achieve this in case there is something we would prefer to extend investment in or use instead of PLX as I forsee us doing even more of this in the future. +1 also on moving the Sheriff workflow to interrupt vs polling, and for anything which helps move that needle.
,
Nov 19
I agree that - there are cases where Monarch isn't a suitable solution to store monitoring data and evaluate alert conditions, but - setting up a PLX alert such that it triggering the alert results in an incident created in AlertManager, such that it can be used for Escalator Queue, and silenced with ACL checks. >> We should investigate the PLX Alerts to Alertmanager bridge, too, to see if it's viable so that we can unify silencing infra alerts with guardian alerts. Subspace is one possible solution to bridge PLX alerts to AlertManager. from http://shortn/_orrmEm6bM3, "Can send alerts from a script with Stubby API. There are cases, where Monarch is not a suitable option for monitoring backend, Tthe API can be invoked to trigger an alert from non-Monarch monitoring sources. For example, the API can be invoked to trigger an alert, based on the result of a Dremel query execution." With Subspace, PLX scripts can make a stubby call to Subspace to create an incident with specifying the owner of a given incident. We can also forward the subspace notifications to one or mix of Escalator, Ticket-Queue, or Email, as necessary.
,
Nov 19
... and the receivers, which are SoM in this context, can silence the received subspace alert.
,
Nov 19
- Moving SoM to user interrupt instead of user polling is a good long-term goal (especially as FindIt and other CAT tools automate more sheriffing tasks). - SoM doesn't currently have plans to generate email, but some kind of notification method would be required for the above. Email is one, but we have talked about implementing desktop browser notifications as well. Delegating to a separate, dedicated alert management service SGTM too, if that's what sheriffs want. - Regarding the types of failures that Gatekeeper can identify, consult the original docs* or the source code itself**, since it's probably drifted a bit since the docs were written. * https://docs.google.com/document/d/1rJtHbQCxLiWbxg4vWAhi62o3ALTO2dvxe1e2Nu1pkVk/edit#heading=h.wpdokrn2dfrt and https://docs.google.com/document/d/1Gj-hDhmCP4ZklCvuF57-O2lHRMRFFkGTQVhDUWfZnp8/edit ** https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/slave/gatekeeper_ng.py |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by jclinton@chromium.org
, Nov 13