New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 612538 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 613612



Sign in to add a comment

Please, restart slave33-c1

Project Member Reported by krasin@chromium.org, May 17 2016

Issue description

slave33-c1 is offline for ~1 hour, which makes three bots to stall:

http://build.chromium.org/p/chromium.fyi/builders/CFI%20Linux%20CF
http://build.chromium.org/p/chromium.fyi/builders/LTO%20Linux
https://build.chromium.org/p/chromium.fyi/builders/LTO%20Linux%20Perf

Please, restart the slave.

Note: this is a recurrence of  https://crbug.com/573350  and  https://crbug.com/577059 , you might want to consider a better monitoring for this kind of an issue.
 

Comment 1 by aga...@chromium.org, May 17 2016

Owner: aga...@chromium.org
Status: Fixed (was: Untriaged)
We do have monitoring for slaves being offline; however a single slave on a .fyi waterfall is not of high enough importance to fire an interrupt. Thanks for filing the bug, and the slave is now running a build on https://build.chromium.org/p/chromium.fyi/builders/LTO%20Linux
Cc: hinoka@chromium.org
I poked around on the bot (and brought it back online as a side effect), trying to figure out what's wrong with it.

It seems to have terminated cleanly by a command from a master (probably because it is marked as 'auto_reboot' in the master config), but wasn't be able to come back online. In the processes list I saw stuck "gclient sync" process. I think we need to impose a timeout on initial gclient sync in https://chromium.googlesource.com/infra/infra/+/master/infra/tools/bot_setup/start/chrome.py#99 (or figure out why it is getting stuck...)

Comment 3 by krasin@chromium.org, May 20 2016

Cc: sergeybe...@chromium.org aga...@chromium.org
Owner: sergeybe...@chromium.org
Status: Assigned (was: Fixed)
Reopening issue, as slave33-c1 is offline again. Please, restart it.
Looking... The machine itself is online, it must be the slave process that's dead.
The slave process stopped logging after 2016-05-19 18:38:57-0700.

I restarted it, it connected to the master. I'm not yet sure what happened this time.
Latest slave logs before the restart:

2016-05-19 18:36:35-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x7f2af48b3b90>
2016-05-19 18:36:35-0700 [-] Connecting to master1.golo.chromium.org:8111
2016-05-19 18:36:35-0700 [Broker,client] Lost connection to master1.golo.chromium.org:8111
2016-05-19 18:36:35-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x7f2af4bbf4d0> will retry in 7 seconds
2016-05-19 18:36:35-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x7f2af48b3b90>
2016-05-19 18:36:43-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x7f2af48b3b90>
2016-05-19 18:36:43-0700 [-] Connecting to master1.golo.chromium.org:8111
2016-05-19 18:36:43-0700 [Broker,client] Lost connection to master1.golo.chromium.org:8111
2016-05-19 18:36:43-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x7f2af4bbf4d0> will retry in 23 seconds
2016-05-19 18:36:43-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x7f2af48b3b90>
2016-05-19 18:37:07-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x7f2af48b3b90>
2016-05-19 18:37:07-0700 [-] Connecting to master1.golo.chromium.org:8111
2016-05-19 18:37:37-0700 [-] Connection to master1.golo.chromium.org:8111 failed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.TimeoutError'>: User timeout caused connection failure.
        ]
2016-05-19 18:37:37-0700 [-] <twisted.internet.tcp.Connector instance at 0x7f2af4bbf4d0> will retry in 56 seconds
2016-05-19 18:37:37-0700 [-] Stopping factory <buildslave.bot.BotFactory instance at 0x7f2af48b3b90>
2016-05-19 18:38:34-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x7f2af48b3b90>
2016-05-19 18:38:34-0700 [-] Connecting to master1.golo.chromium.org:8111
2016-05-19 18:38:57-0700 [Broker,client] message from master: attached
2016-05-19 18:38:57-0700 [Broker,client] removing old builder LTO Linux Perf
2016-05-19 18:38:57-0700 [Broker,client] I have a leftover directory 'goma_cache' that is not being used by the buildmaster: you can delete it now
2016-05-19 18:38:57-0700 [Broker,client] I have a leftover directory 'cert' that is not being used by the buildmaster: you can delete it now
2016-05-19 18:38:57-0700 [Broker,client] I have a leftover directory 'google-chrome-lto-perf-linux_64' that is not being used by the buildmaster: you can delete it now
2016-05-19 18:38:57-0700 [Broker,client] I have a leftover directory 'cache_dir' that is not being used by the buildmaster: you can delete it now
2016-05-19 18:38:57-0700 [Broker,client] I have a leftover directory 'cache' that is not being used by the buildmaster: you can delete it now
2016-05-19 18:38:57-0700 [Broker,client] Wanted directories: ['.svn', 'CFI_Linux_CF', 'cache_dir', 'cert', 'goma_cache', 'google-chrome-lto-linux_64', 'info']
2016-05-19 18:38:57-0700 [Broker,client] Actual directories: ['CFI_Linux_CF', 'cache', 'cache_dir', 'cert', 'goma_cache', 'google-chrome-lto-linux_64', 'google-chrome-lto-perf-linux_64', 'info']
2016-05-19 18:38:57-0700 [Broker,client] Deleting unwanted directory cache
2016-05-19 18:38:57-0700 [Broker,client] Deleting unwanted directory google-chrome-lto-perf-linux_64

Status: Fixed (was: Assigned)
Closing the bug - the immediate problem is solved. Something is still unstable with this slave, it's worth a deeper investigation. Filed http://crbug.com/613612 for tracking the stability issue.
Blocking: 613612

Comment 9 by krasin@chromium.org, May 20 2016

Thank you, Sergey!

Sign in to add a comment