New issue
Advanced search Search tips

Issue 837740 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: May 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

PuppetVersionSkewTooHigh alert should identify WHICH servers.

Project Member Reported by dgarr...@chromium.org, Apr 27 2018

Issue description

Sample alert:


For page-level alerts, please post updates for this outage in the
corresponding Outalator entry:
https://o.corp.google.com/#Tickets:chrome-infra::::

Alert Details
------------------
Description:
At least 1 prod server puppet configuration is too far out of date.

name: PuppetVersionSkewTooHigh
current value: 40.838888888887595
threshold: Gt(24) for 3h
alert fields: {, }

sent at: 2018-04-27 11:22:14
active since: 2018-04-26 21:30:12 (13 hours 52 mins)

Useful Links
------------------
playbook:  
https://goto.google.com/chrome-os-infra-playbook#PuppetVersionSkewTooHigh
console: https://viceroy.corp.google.com/chromeos/puppet
silence:  
http://alertmanager.corp.google.com/#view=createSilence&query=alertname%3D%22PuppetVersionSkewTooHigh%22,monarch_metric_fields%3D%22%5E%24%22,monarch_module_name%3D%22chromeos-infra-autotest-alerts%22,monarch_target_fields%3D%22%5E%24%22,monitorname%3D%22monarch%22,service%3D%22chromeos-infra-alert-owners%22
alert manager:  
http://alertmanager.corp.google.com/#view=conditionSummary&query=alertname%3D%22PuppetVersionSkewTooHigh%22,+monarch_metric_fields%3D%22%5E%24%22,+monarch_module_name%3D%22chromeos-infra-autotest-alerts%22,+monarch_target_fields%3D%22%5E%24%22,+monitorname%3D%22monarch%22,+service%3D%22chromeos-infra-alert-owners%22

Query
------------------
graph:  
https://pcon.corp.google.com/p#chromeos-infra-alert-owners/queryplayground?query=mash&duration=1d&mash=%28Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/sysmon/puppet/version/config%27%29%2C%0A%20%20%20%20%20%20%20%7B%7D%29%0A%20%7C%20Window%28Align%28%2730m%27%29%29%2C%0A%20Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/sysmon/prod_hosts/roles%27%29%2C%0A%20%20%20%20%20%20%20%7B%27host_name%27%3A%20%27cros-full-0036%27%7D%29%0A%20%7C%20Window%28Align%28%271h%27%29%29%0A%20%7C%20GroupBy%28%5B%27metric%3Atarget_hostname%27%2C%20%27metric%3Atarget_data_center%27%5D%2C%20PickAny%28%29%29%0A%20%7C%20Filter%28True%29%0A%20%7C%20Filter%28True%29%0A%20%7C%20MapStreamId%28%27monarch.acquisitions.Task%27%2C%20%7B%27data_center%27%3A%20%27metric%3Atarget_data_center%27%2C%20%27host_name%27%3A%20%27metric%3Atarget_hostname%27%7D%2C%20drop_metric_fields%3DTrue%29%0A%20%7C%20ValueToField%28%27role%27%29%29%0A%7C%20Join%28left_default%3DNone%2C%20left_name%3D%27left%27%2C%20right_default%3DNone%2C%20right_name%3D%27right%27%29%0A%7C%20Point%28VAL%20/%203600%29%0A%7C%20GroupBy%28%5B%5D%2C%20Range%28%29%29&endtime=1524803412




Nothing in this alert, or it's linked graph identifies WHICH hosts are behind. After getting these alerts, I started working through servers that were recent because I missed a single line at the bottom of the graph that was the real problem:


https://viceroy.corp.google.com/chromeos/puppet#_VG_qcaOgRcu


Having the alert, or it's graph call out the problematic hosts would have been helpful.
 
Status: WontFix (was: Untriaged)
I think we ended up concluding that the dashboard/graph is crystal clear.

Sign in to add a comment