Graphs of failure rates per Swarming host |
|||
Issue descriptionIt would be extremely helpful to have some graphs of the failure rates over time per Swarming host. In Issue 638718 we're finding that a couple of new hosts are misconfigured, and manually visiting the Swarming URLs for each of the new hosts is quite slow and painful.
,
Sep 13 2016
,
Apr 12 2017
This is now implemented, swarming is instrumented with ts_mon, and graphs are available on vi/chrome_infra.
,
Apr 20 2017
Thank you Sergey! This is awesome. Could you give a link to the new graphs? I browsed around and the closest thing I could find are the Buildbot failures, optionally per master and builder. These look useful. However, they don't dive down to the individual Swarming slave that ran the tests.
,
Apr 20 2017
Oh, I didn't realize you want to look at individual machines. This is probably too detailed for ts_mon, but we should have all the data in event_mon pipeline, if necessary. Unfortunately, this also means you'd need to write your own SQL queries... Ideally, we'd like to get to the point where individual machines don't matter, and they are invisible to the end user. That's one reason for the lack of such graphs. If you still find it useful to track individual machines, please file a bug against swarming, and someone will help you get a viceroy console with the data you need. I'm no longer doing monitoring though, we are pretty much on self-service at this point, but I'm happy to help if you have questions.
,
Apr 21 2017
It's difficult to formulate my idea. There are occasionally hardware failures affecting the GPU bots which result in flakiness until someone notices that that particular slave is reliably failing (certain) tests. Ideally this would be handled in automated fashion with some sort of meta-monitor looking at recent (~50) test runs and auto-quarantining the bot and sending an alert if it's failed all runs of a particular test type (and they weren't all CL tryjobs sent by the same user). I'll comment on https://github.com/luci/luci-py/issues/277 which looks like it already describes this request.
,
Apr 21 2017
As I noted on github, have fuzzy rules are better served by an external service that queries (or listen to pubsub topics?) to get updates about the task on specific bots. Since swarming doesn't know about pre commit, post commit, GPU tests, it can't make an informed decision. I think a good location is SoM, since it's relatively easy to add a cron job to query and analyse the bots. |
|||
►
Sign in to add a comment |
|||
Comment 1 by sergeybe...@chromium.org
, Sep 13 2016Owner: ----