Omen pending_job_durations_long doesn't appear to be valid. |
|||||
Issue descriptionViceroy is showing the that pending_job_durations omen has been set for a while. I originally believed that this was a side effect of shard and and db performance issues, but that no longer holds up. We've also just gotten an omen for slow mysql queries, they might be related. Finally, the master scheduler tick rate is a slow, but not alert slow.
,
May 4 2018
Well, that query was against a fixed date. When updated to be current, it showed the "cros-full-0018" was the problematic shard, even though (or because?) it has no boards assigned. http://shortn/_PKmEXsOvMD I've rebooted it for not particularly good reason, but this does tell me that we need to improve the Omen so that it's actionable without so much effort.
,
May 7 2018
Running the omen query by hand shows a value of ~1.5 The omen thresholds are 2, 3, 7. No idea why it's firing.
(Fetch(Raw('monarch.acquisitions.Task', '/chrome/infra/chromeos/autotest/shard_client/heartbeat/known_jobs_durations'))
| Window(Align('1m'), '1m'),
Fetch(Raw('monarch.acquisitions.Task', '/chrome/infra/chromeos/sysmon/prod_hosts/roles'),
{'host_name': 'cros-full-0036'})
| Window(Align('1h'), '1m')
| MapStreamId(
'monarch.acquisitions.Task',
{'host_name': 'metric:target_hostname'},
drop_metric_fields=True)
| ValueToField('prodrole', op=VAL)
| Filter('shard' == FIELDS['metric:prodrole']))
| Join()
| GroupBy(['host_name'])
| Point(Percentile(50, VAL) / 60 / 60 / 24)
| PickTopStreams(1, '3h', reducer=Mean())
| Window(Reduce('3h', Mean()))
| GroupBy([])
,
May 7 2018
,
May 7 2018
,
Jun 8 2018
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by dgarr...@chromium.org
, May 4 2018