ChromeOS GCE Slaves Don't Always Reconnect to Buildbot |
||||||||||
Issue descriptionHere is another example of a GCE slave not reconnecting to the buildbot master (See the gap after build #1795). I rebooted it, but caused it to reconnect and pick up normally. https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_spring-release This doesn't happen a lot, but does from time to time, and the builders never recover without manual intervention. I've only noticed it on GCE, so I assume it's GCE specific. Can we tweak the startup scripts so that slaves auto-reboot if they fail to connect for long enough?
,
Oct 10 2016
They're supposed to retry forever to connect (unless the buildbot client process freezes, but doesn't crash, which has been seen to happen on rare occasions for yet-to-be-known reasons). Would be good to dump out the logs in /var/log/messages/chromebuild/ when this happens
,
Oct 10 2016
Excellent. Assigning the bug to me, so I can find it next time this happens.
,
Oct 10 2016
So I'm just speculating here, but the only times i've seen this happen is within the buildbot process itself. Where the buildbot slave either crashes without quitting, or thinks it still has an open connection to the master despite the connection being broken. If that happens you'll also have to look in /b/build/slave/twistd.log to confirm
,
Oct 11 2016
,
Nov 30 2016
Issue 667815 has been merged into this issue.
,
Dec 2 2016
I have three physical builders that seem to be in a similar state. build288-m2 build293-m2 build300-m2 On build300-m2, the directory /var/log/messages didn't exist at all. I then tried rebooting, and it reconnected. Looking in /b/build/slave/twistd.log on build293-m2.
,
Dec 2 2016
Issue 670835 has been merged into this issue.
,
Dec 2 2016
The twisted log shows that this builder has been offline and unused for a long time. The buildbot history page has no history for this build machine, so this is consistent. 2016-08-26 14:11:45-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x793f200> 2016-08-26 14:12:48-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x793f200> 2016-08-26 14:12:48-0700 [-] Connecting to master2a.golo.chromium.org:31506 2016-08-26 14:12:53-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective 2016-08-26 14:12:53-0700 [Broker,client] Unhandled Error Traceback from remote host -- Traceback unavailable 2016-08-26 14:12:53-0700 [Broker,client] Lost connection to master2a.golo.chromium.org:31506 2016-08-26 14:12:53-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x793f200> 2016-08-26 14:12:53-0700 [-] Main loop terminated. 2016-08-26 14:12:53-0700 [-] Server Shut Down.
,
Dec 2 2016
build288-m2 is in the same state, and NEARLY an identical time. 2016-08-26 14:12:34-0700 [-] Connecting to master2a.golo.chromium.org:31506 2016-08-26 14:12:36-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective 2016-08-26 14:12:36-0700 [Broker,client] Unhandled Error Traceback from remote host -- Traceback unavailable 2016-08-26 14:12:36-0700 [Broker,client] Lost connection to master2a.golo.chromium.org:31506 2016-08-26 14:12:36-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x6346cf8> 2016-08-26 14:12:36-0700 [-] Main loop terminated. 2016-08-26 14:12:36-0700 [-] Server Shut Down.
,
Dec 2 2016
"-m2" builders don't belong on this bug. Baremetal and GCE builders run BuildBot differently. GCE have a monitoring superprocess that kicks BuildBot slave process back alive if it dies. Baremetal does not, so the behavior in #7, #9 (assuming that's a baremetal), and #10 are all expected.
,
Dec 2 2016
Ah.... I thought that part was the same. Sorry.
,
Dec 3 2016
Sadly no. One of our SREs tried to switch our baremetal fleet over but due to the complexities and nuances and non-homogeneous nature of the fleet, lots of stuff broke and they had to roll it back. I haven't heard any updates on the effort since.
,
Jan 17 2017
This problem hasn't reappeared in a long time. Calling it fixed.
,
Mar 4 2017
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Oct 14 2017
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by d...@chromium.org
, Oct 9 2016Owner: ----