New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 654165 link

Starred by 3 users

Issue metadata

Status: Archived
Owner:
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

ChromeOS GCE Slaves Don't Always Reconnect to Buildbot

Project Member Reported by dgarr...@chromium.org, Oct 8 2016

Issue description

Here is another example of a GCE slave not reconnecting to the buildbot master (See the gap after build #1795). I rebooted it, but caused it to reconnect and pick up normally.

https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_spring-release


This doesn't happen a lot, but does from time to time, and the builders never recover without manual intervention. I've only noticed it on GCE, so I assume it's GCE specific. Can we tweak the startup scripts so that slaves auto-reboot if they fail to connect for long enough?
 

Comment 1 by d...@chromium.org, Oct 9 2016

Cc: hinoka@chromium.org dsansome@chromium.org d...@chromium.org
Owner: ----
The GCE BuildBot slave management is definitely different from baremetal. It's actually supposed to be more robust :)

+dsansome@, +hinoka@ FYI. I haven't looked into this at all yet.

Comment 2 by hinoka@chromium.org, Oct 10 2016

They're supposed to retry forever to connect (unless the buildbot client process freezes, but doesn't crash, which has been seen to happen on rare occasions for yet-to-be-known reasons).

Would be good to dump out the logs in /var/log/messages/chromebuild/ when this happens
Owner: dgarr...@chromium.org
Excellent. Assigning the bug to me, so I can find it next time this happens.

Comment 4 by hinoka@chromium.org, Oct 10 2016

So I'm just speculating here, but the only times i've seen this happen is within the buildbot process itself.  Where the buildbot slave either crashes without quitting, or thinks it still has an open connection to the master despite the connection being broken.

If that happens you'll also have to look in /b/build/slave/twistd.log to confirm

Comment 5 by autumn@chromium.org, Oct 11 2016

Labels: -current-issue
Cc: bhthompson@chromium.org dgarr...@chromium.org gkihumba@chromium.org
 Issue 667815  has been merged into this issue.
I have three physical builders that seem to be in a similar state.
  build288-m2
  build293-m2
  build300-m2

On build300-m2, the directory /var/log/messages didn't exist at all. I then tried rebooting, and it reconnected.

Looking in /b/build/slave/twistd.log on build293-m2.
 Issue 670835  has been merged into this issue.
The twisted log shows that this builder has been offline and unused for a long time. The buildbot history page has no history for this build machine, so this is consistent.

2016-08-26 14:11:45-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x793f200>
2016-08-26 14:12:48-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x793f200>
2016-08-26 14:12:48-0700 [-] Connecting to master2a.golo.chromium.org:31506
2016-08-26 14:12:53-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2016-08-26 14:12:53-0700 [Broker,client] Unhandled Error
        Traceback from remote host -- Traceback unavailable
        
2016-08-26 14:12:53-0700 [Broker,client] Lost connection to master2a.golo.chromium.org:31506
2016-08-26 14:12:53-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x793f200>
2016-08-26 14:12:53-0700 [-] Main loop terminated.
2016-08-26 14:12:53-0700 [-] Server Shut Down.

build288-m2 is in the same state, and NEARLY an identical time.

2016-08-26 14:12:34-0700 [-] Connecting to master2a.golo.chromium.org:31506
2016-08-26 14:12:36-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2016-08-26 14:12:36-0700 [Broker,client] Unhandled Error
        Traceback from remote host -- Traceback unavailable
        
2016-08-26 14:12:36-0700 [Broker,client] Lost connection to master2a.golo.chromium.org:31506
2016-08-26 14:12:36-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x6346cf8>
2016-08-26 14:12:36-0700 [-] Main loop terminated.
2016-08-26 14:12:36-0700 [-] Server Shut Down.

Comment 11 by d...@chromium.org, Dec 2 2016

"-m2" builders don't belong on this bug. Baremetal and GCE builders run BuildBot differently. GCE have a monitoring superprocess that kicks BuildBot slave process back alive if it dies. Baremetal does not, so the behavior in #7, #9 (assuming that's a baremetal), and #10 are all expected.
Ah.... I thought that part was the same. Sorry.

Comment 13 by d...@chromium.org, Dec 3 2016

Sadly no. One of our SREs tried to switch our baremetal fleet over but due to the complexities and nuances and non-homogeneous nature of the fleet, lots of stuff broke and they had to roll it back. I haven't heard any updates on the effort since.
Status: Fixed (was: Untriaged)
This problem hasn't reappeared in a long time. Calling it fixed.

Comment 15 by dchan@google.com, Mar 4 2017

Labels: VerifyIn-58

Comment 16 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 17 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 19 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment