New issue
Advanced search Search tips

Issue 819884 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 821169



Sign in to add a comment

Buildbot buildslaves don't always reconnect.

Project Member Reported by dgarr...@chromium.org, Mar 8 2018

Issue description

From time to time I've been finding buildbot slaves that are simply offline until rebooted, but I've been rebooting them manually to get them working again.

This is usually easiest to find on the tryserver waterfall by looking for slaves that are offline, and have been offline for more than a few minutes.

https://uberchromegw.corp.google.com/i/chromiumos.tryserver/buildslaves

We currently have a single slave in this state, and I plan to leave it that way for investigation since having it down isn't causing substantial burden.

https://uberchromegw.corp.google.com/i/chromiumos.tryserver/buildslaves/cros-standard36-c2

 
I'm aware of this problem.
I'd like to say that buildslave maintenance is a separate concern from builder maintenance, and the latter is more importantly. I recently filed a bug to detect when a builder is compromised because of this / other issues:  issue 819419 

chrome-infra-labs should generally take care of buildslaves based on their own prirorities. For -c2 builders, we should probably have a separate administrative process as well, to monitor / manage the pool health.

Let's keep this bug for that.
Owner: ----
Not currently on my plate.
Owner: dgarr...@chromium.org
Status: Assigned (was: Untriaged)
https://uberchromegw.corp.google.com/i/chromeos/builders/eve-release

cros-beefy59-c2	offline in buildbot, online in GCP.  Will reboot and see.
Blocking: 821169
Not sure what the semantics of "blocking" is
Labels: Infra-Troopers Hotlist-Deputy
Components: -Infra>Client>ChromeOS Infra
Labels: -Pri-3 Pri-2
Owner: ----
Status: Untriaged (was: Assigned)
I don't currently have a builder in this state (just rebooted them all), but it is an ongoing issue.
dgarrett: are you expecting some amount of trooper intervention here?
Yes. We don't normally own or interact with the buildbot client.

I consider this lower priority than it was because we have fewer buildbot builders than we used too, but it does still cause build failures.
If you want to hand it back until we have another stuck builder, that would be very reasonable.
Owner: dgarr...@chromium.org
Status: Assigned (was: Untriaged)
If you don't mind, I'll assign the bug to you Don, otherwise we keep on looking at triaging the issue.
Cc: shapiroc@chromium.org
We currently have two examples of this. 

https://uberchromegw.corp.google.com/i/chromeos/buildslaves/build173-m2
https://uberchromegw.corp.google.com/i/chromeos/buildslaves/cros-beefy71-c2


The first is associated with the CQ which has 2 redundant slaves, the second is causing an outage of veyron_jaq-release, so I'm going to capture logs and reboot it.
Actually, those appear to be symptoms appear to be different, I can't ssh into either machine.
Status: WontFix (was: Assigned)
As the number of buildbot builders goes down, this has been less of an issue.

Sign in to add a comment