Add a "number of new jobs with hosts" graph to shards dashboard |
||
Issue descriptionThe job pending durations too large omen has been firing for weeks now: http://shortn/_nxkpBNoLq6 I've been personally ignoring it. And guess what, at least partially, it was point to a real issue 864227 The problem with this omen is that there are no obvious next steps. We should add a graph of median job pending duration to the shards dashboard: https://viceroy.corp.google.com/chromeos/capacity_health The top 5 shards with longest pending durations is a useful metric to know: - either we've overlaoded the shards and need to distribute load better - or something is wrong with the shard and it isn't processing jobs as it should
,
Jul 23
We found a dashboard that _might_ be doing something like this: https://viceroy.corp.google.com/chromeos/capacity_health#_VG_7D_0PTt3 But it does not show the impact of issue 864227 http://shortn/_7w6lNefAzb
,
Jul 23
Looks like the existing metric known_jobs_durations does not capture time outs - it only looks at the incomplete jobs present on the shard at a given moment: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/shard/shard_client.py?q=known_jobs_durations&l=334
,
Jul 30
Added an alert on the number of new jobs with hosts per shard and a corresponding dashboard at https://viceroy.corp.google.com/chromeos/capacity_health#_VG_c0_J4Kyq
,
Jul 30
|
||
►
Sign in to add a comment |
||
Comment 1 by akes...@chromium.org
, Jul 23Owner: zamorzaev@chromium.org
Status: Assigned (was: Untriaged)