Sample alert:
For page-level alerts, please post updates for this outage in the
corresponding Outalator entry:
https://o.corp.google.com/#Tickets:chrome-infra::::
Alert Details
------------------
Description:
At least 1 prod server puppet configuration is too far out of date.
name: PuppetVersionSkewTooHigh
current value: 40.838888888887595
threshold: Gt(24) for 3h
alert fields: {, }
sent at: 2018-04-27 11:22:14
active since: 2018-04-26 21:30:12 (13 hours 52 mins)
Useful Links
------------------
playbook:
https://goto.google.com/chrome-os-infra-playbook#PuppetVersionSkewTooHigh
console: https://viceroy.corp.google.com/chromeos/puppet
silence:
http://alertmanager.corp.google.com/#view=createSilence&query=alertname%3D%22PuppetVersionSkewTooHigh%22,monarch_metric_fields%3D%22%5E%24%22,monarch_module_name%3D%22chromeos-infra-autotest-alerts%22,monarch_target_fields%3D%22%5E%24%22,monitorname%3D%22monarch%22,service%3D%22chromeos-infra-alert-owners%22
alert manager:
http://alertmanager.corp.google.com/#view=conditionSummary&query=alertname%3D%22PuppetVersionSkewTooHigh%22,+monarch_metric_fields%3D%22%5E%24%22,+monarch_module_name%3D%22chromeos-infra-autotest-alerts%22,+monarch_target_fields%3D%22%5E%24%22,+monitorname%3D%22monarch%22,+service%3D%22chromeos-infra-alert-owners%22
Query
------------------
graph:
https://pcon.corp.google.com/p#chromeos-infra-alert-owners/queryplayground?query=mash&duration=1d&mash=%28Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/sysmon/puppet/version/config%27%29%2C%0A%20%20%20%20%20%20%20%7B%7D%29%0A%20%7C%20Window%28Align%28%2730m%27%29%29%2C%0A%20Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/sysmon/prod_hosts/roles%27%29%2C%0A%20%20%20%20%20%20%20%7B%27host_name%27%3A%20%27cros-full-0036%27%7D%29%0A%20%7C%20Window%28Align%28%271h%27%29%29%0A%20%7C%20GroupBy%28%5B%27metric%3Atarget_hostname%27%2C%20%27metric%3Atarget_data_center%27%5D%2C%20PickAny%28%29%29%0A%20%7C%20Filter%28True%29%0A%20%7C%20Filter%28True%29%0A%20%7C%20MapStreamId%28%27monarch.acquisitions.Task%27%2C%20%7B%27data_center%27%3A%20%27metric%3Atarget_data_center%27%2C%20%27host_name%27%3A%20%27metric%3Atarget_hostname%27%7D%2C%20drop_metric_fields%3DTrue%29%0A%20%7C%20ValueToField%28%27role%27%29%29%0A%7C%20Join%28left_default%3DNone%2C%20left_name%3D%27left%27%2C%20right_default%3DNone%2C%20right_name%3D%27right%27%29%0A%7C%20Point%28VAL%20/%203600%29%0A%7C%20GroupBy%28%5B%5D%2C%20Range%28%29%29&endtime=1524803412
Nothing in this alert, or it's linked graph identifies WHICH hosts are behind. After getting these alerts, I started working through servers that were recent because I missed a single line at the bottom of the graph that was the real problem:
https://viceroy.corp.google.com/chromeos/puppet#_VG_qcaOgRcu
Having the alert, or it's graph call out the problematic hosts would have been helpful.
Comment 1 by ayatane@chromium.org
, May 7 2018