There are two types of infra failures we are concerned about here: flakey failures that happen occasionally and outage-magnitude failures. In my experience thus far the latter happens when one of our dependencies is failing in some way. There is related work to ensure that we are failing gracefully in those cases.
In the case of flakey failures, we want to eventually want to automatically create bugs (with multiple failures of the same type on the same bug) with the Infra>Trooper label.
This will be quite a large number of alerts at the moment, so we could start with a simple email alert sent to chrome-troopers-alerts+staging@ for now while we’re working on burning down the rate.
For outage-magnitude failures, we care about the impact on CQ cycle time, so this directly relates to the work that the crossover team will be doing this quarter on the speed and reliability of the CQ. We can link the bug for the CQ monitoring/alerting once we have that.
Comment 1 by serg...@chromium.org
, Oct 21 2016