New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 607196 link

Starred by 3 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature

Blocking:
issue 747056



Sign in to add a comment

When moblab has 0 ready DUTs, it should be obvious that the sheriff need do nothing

Project Member Reported by diand...@chromium.org, Apr 27 2016

Issue description

As you can see in  bug #607084 , we ran into a case where there were 0 DUTs available.

From the sheriffing docs, the points of sheriffing are:

1. Make sure build blocking failures are identified and addressed in a timely fashion.

2. Manually watch over our build system in ways automation doesn’t/can’t do.

3. Give developers a chance to learn a little more about how build works and breaks.


#1: mostly this is identifying problems and finding owners.  If there are 0 DUTs available then the problem is clear and the owner is clear.  The sheriff can act as a nag, but it doesn't seem like a terribly good use of resources.  If the teams in charge of fixing this need a sheriff to nag them to get this done then we should fix that.  Perhaps we can create an auto-nagger robot but have it sent email as if it looks like it came from the sheriffs if everyone is ignoring the robots?

#2. the automation catches this just fine.

#3. sheriffs can't actually do anything about this except nag, so learning opportunities are slim to none.

---

Making it clear that this is not the sheriff's job to deal with will help reduce sheriffs frustration and help make sure that when there _is_ something sheriffs should deal with that they'll pay attention to it.


NOTE: continuing to _notify_ sheriffs of problems like this is still quite important.  I simply wish the message to help the sheriff know what to do about the failure.

---

So right now, I get this message in my inbox:

master-paladin has encountered infra failures:

guado_moblab-paladin: The HWTest [moblab_quick] stage failed: **
HWTest did not complete due to infrastructure issues (code 3) ** in
https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/2432

See https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/10952

-

It could change to:

master-paladin has encountered infra failures:

guado_moblab-paladin: The HWTest [moblab_quick] stage failed: **
HWTest did not complete due to infrastructure issues (code 3) ** in
https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/2432

It has been identified that there is nothing that the build sheriff can
do to help this failure.  Sheriffs: sit tight until the problem is fixed.
If you would like to ping someone on the status of this problem, please
ping on #crospfq IRC or look at the Infrastructure Rotation and ping
the Infrastructure Deputy.

See https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/10952

 
Components: -Infra Infra>Client>ChromeOS
Labels: -Hardware-Lab -Infra
Status: Available (was: Untriaged)
Actually, the infra deputy is already responsible for noticing
and dealing with problems like this without needing the sheriffs.
The real problem here is multi-fold:
  * The problem first showed up at 17:50 (a bit late for
    immediate action).
  * The deputy wasn't actually watching at the time.
  * The notifications to the deputy are too easy to ignore.  :-(
  * The failure shouldn't have been fatal in the first place.
    There were three other DUTs available for work; they should
    have been allowed to pick up the slack.

@1: OK, great info!  ...so presumably the job here is to just make it more obvious to the sheriffs.  ;)
Cc: -sbasi@chromium.org
Labels: FixIt
Owner: sbasi@chromium.org
Project Member

Comment 4 by sheriffbot@chromium.org, May 4 2017

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been available for more than 365 days, and should be re-evaluated. Please re-triage this issue.
The Hotlist-Recharge-Cold label is applied for tracking purposes, and should not be removed after re-triaging the issue.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Blocking: 747056
IIUC, isn't this a failure that should be classified as "infra failure" and not blame any CLs? Just happened today:

https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/8762

and killed my CL:

https://chromium-review.googlesource.com/c/chromiumos/platform/frecon/+/935372

and 16 others.

Comment 7 by sbasi@chromium.org, Feb 28 2018

Cc: haddowk@chromium.org
Owner: nxia@chromium.org
Assigning to infra deputy and cc-ing current Moblab Lead.

In general the instructions to the deputy for anything *moblab in the CQ is if it is non obvious mark the board as experimental and file a bug against haddowk cc mattmallet@

I have taken the responsibility to sheriff the moblab devices but I just can not be looking 24x7

Is there a deputy playbook I can add this instruction to ?
All the sub duts in stirling ct have been rebooted/recovered.
Labels: -Hotlist-Recharge-Cold
Owner: ayatane@chromium.org
Status: Assigned (was: Untriaged)
Action items:

- Add moblab failure handling to deputy playbook
- Don't scare sheriffs

The second one is much harder, I think.  I'm not the best owner for that.  I can take the first one as secondary.
Labels: -Fixit Hotlist-Fixit
Owner: ----
Status: Available (was: Assigned)
No plan to do

Sign in to add a comment