When moblab has 0 ready DUTs, it should be obvious that the sheriff need do nothing |
|||||||
Issue descriptionAs you can see in bug #607084 , we ran into a case where there were 0 DUTs available. From the sheriffing docs, the points of sheriffing are: 1. Make sure build blocking failures are identified and addressed in a timely fashion. 2. Manually watch over our build system in ways automation doesn’t/can’t do. 3. Give developers a chance to learn a little more about how build works and breaks. #1: mostly this is identifying problems and finding owners. If there are 0 DUTs available then the problem is clear and the owner is clear. The sheriff can act as a nag, but it doesn't seem like a terribly good use of resources. If the teams in charge of fixing this need a sheriff to nag them to get this done then we should fix that. Perhaps we can create an auto-nagger robot but have it sent email as if it looks like it came from the sheriffs if everyone is ignoring the robots? #2. the automation catches this just fine. #3. sheriffs can't actually do anything about this except nag, so learning opportunities are slim to none. --- Making it clear that this is not the sheriff's job to deal with will help reduce sheriffs frustration and help make sure that when there _is_ something sheriffs should deal with that they'll pay attention to it. NOTE: continuing to _notify_ sheriffs of problems like this is still quite important. I simply wish the message to help the sheriff know what to do about the failure. --- So right now, I get this message in my inbox: master-paladin has encountered infra failures: guado_moblab-paladin: The HWTest [moblab_quick] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/2432 See https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/10952 - It could change to: master-paladin has encountered infra failures: guado_moblab-paladin: The HWTest [moblab_quick] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/2432 It has been identified that there is nothing that the build sheriff can do to help this failure. Sheriffs: sit tight until the problem is fixed. If you would like to ping someone on the status of this problem, please ping on #crospfq IRC or look at the Infrastructure Rotation and ping the Infrastructure Deputy. See https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/10952
,
Apr 27 2016
@1: OK, great info! ...so presumably the job here is to just make it more obvious to the sheriffs. ;)
,
May 3 2016
,
May 4 2017
This issue has been available for more than 365 days, and should be re-evaluated. Please re-triage this issue. The Hotlist-Recharge-Cold label is applied for tracking purposes, and should not be removed after re-triaging the issue. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Jul 20 2017
,
Feb 28 2018
IIUC, isn't this a failure that should be classified as "infra failure" and not blame any CLs? Just happened today: https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/8762 and killed my CL: https://chromium-review.googlesource.com/c/chromiumos/platform/frecon/+/935372 and 16 others.
,
Feb 28 2018
Assigning to infra deputy and cc-ing current Moblab Lead.
,
Feb 28 2018
In general the instructions to the deputy for anything *moblab in the CQ is if it is non obvious mark the board as experimental and file a bug against haddowk cc mattmallet@ I have taken the responsibility to sheriff the moblab devices but I just can not be looking 24x7 Is there a deputy playbook I can add this instruction to ?
,
Feb 28 2018
All the sub duts in stirling ct have been rebooted/recovered.
,
Mar 5 2018
Action items: - Add moblab failure handling to deputy playbook - Don't scare sheriffs The second one is much harder, I think. I'm not the best owner for that. I can take the first one as secondary.
,
Mar 19 2018
No plan to do |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by jrbarnette@chromium.org
, Apr 27 2016Labels: -Hardware-Lab -Infra
Status: Available (was: Untriaged)
Actually, the infra deputy is already responsible for noticing and dealing with problems like this without needing the sheriffs. The real problem here is multi-fold: * The problem first showed up at 17:50 (a bit late for immediate action). * The deputy wasn't actually watching at the time. * The notifications to the deputy are too easy to ignore. :-( * The failure shouldn't have been fatal in the first place. There were three other DUTs available for work; they should have been allowed to pick up the slack.