New issue
Advanced search Search tips

Issue 839952 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Omen pending_job_durations_long doesn't appear to be valid.

Project Member Reported by dgarr...@chromium.org, May 4 2018

Issue description

Viceroy is showing the that pending_job_durations omen has been set for a while.

I originally believed that this was a side effect of shard and and db performance issues, but that no longer holds up.

We've also just gotten an omen for slow mysql queries, they might be related.

Finally, the master scheduler tick rate is a slow, but not alert slow.
 
Summary: Omen pending_job_durations_long doesn't appear to be valid. (was: pending_job_durations_long)
Trying to break the Omen down by shard (with help) got me here:
  http://shortn/_Zs3ieqwH6O

That means that the Omen is showing slow job values of about 4.6 days, but when broken down by shard, the slowest value is 1.2 days.
Well, that query was against a fixed date. When updated to be current, it showed the "cros-full-0018" was the problematic shard, even though (or because?) it has no boards assigned.

http://shortn/_PKmEXsOvMD


I've rebooted it for not particularly good reason, but this does tell me that we need to improve the Omen so that it's actionable without so much effort.
Status: Available (was: Untriaged)
Running the omen query by hand shows a value of ~1.5  The omen thresholds are 2, 3, 7.  No idea why it's firing.

(Fetch(Raw('monarch.acquisitions.Task', '/chrome/infra/chromeos/autotest/shard_client/heartbeat/known_jobs_durations'))
 | Window(Align('1m'), '1m'),
 Fetch(Raw('monarch.acquisitions.Task', '/chrome/infra/chromeos/sysmon/prod_hosts/roles'),
       {'host_name': 'cros-full-0036'})
 | Window(Align('1h'), '1m')
 | MapStreamId(
     'monarch.acquisitions.Task',
     {'host_name': 'metric:target_hostname'},
     drop_metric_fields=True)
 | ValueToField('prodrole', op=VAL)
 | Filter('shard' == FIELDS['metric:prodrole']))
| Join()
| GroupBy(['host_name'])
| Point(Percentile(50, VAL) / 60 / 60 / 24)
| PickTopStreams(1, '3h', reducer=Mean())
| Window(Reduce('3h', Mean()))
| GroupBy([])

Cc: ayatane@chromium.org
Cc: akes...@chromium.org

Comment 6 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org

Sign in to add a comment