New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 635258 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Look into Windows BuildBot-induced process killing.

Project Member Reported by d...@chromium.org, Aug 6 2016

Issue description

Filing this bug in response to: https://groups.google.com/a/chromium.org/d/topic/infra-dev/p-VhZbbHGP0/discussion

It looks like when BuildBot on Windows kills the "remote_run" subprocess, the signal is not caught by Butler or Annotee, causing:
* The Butler to not have a chance to flush its buffered logs.
* Annotee to not emit its bootstrap result file.
* Presumably, the Butler bootstrap result file mechanism is similarly broken.

I think that this is known to work on Linux/Mac, so this is probably the result of weak Windows signal handling support. Approach should be:
1) Identify what signal BuildBot is sending to "remote_run".
2) Figure out how to catch it in Go.
3) Catch it in Go and wire that into the standard kill/flush paths of Annotee and Butler.

Making P0 b/c this is actually impacting things ATM.
 

Comment 1 by d...@chromium.org, Aug 6 2016

Also this is probably a bug in Kitchen, too.

Comment 2 by d...@chromium.org, Aug 7 2016

Blocking: 633294

Comment 3 by d...@chromium.org, Aug 7 2016

Cc: vadimsh@chromium.org iannucci@chromium.org
From the log on the slave:

2016-08-03 14:59:39-0700 [-] command timed out: 2400 seconds without output, attempting to kill
2016-08-03 14:59:39-0700 [-] using TASKKILL /F /PID 4948 /T
2016-08-03 14:59:41-0700 [-] TASKKILL /F /PID 4948 /T failed: Command 'TASKKILL /F /PID 4948 /T' returned non-zero exit status 128
2016-08-03 14:59:41-0700 [-] trying process.signalProcess('KILL')
2016-08-03 14:59:41-0700 [-] Unhandled Error
        Traceback (most recent call last):
          File "C:\b\build\third_party\twisted_10_2\twisted\application\app.py", line 390, in startReactor
            self.config, oldstdout, oldstderr, self.profiler, reactor)
          File "C:\b\build\third_party\twisted_10_2\twisted\application\app.py", line 311, in runReactorWithLogging
            reactor.run()
          File "C:\b\build\third_party\twisted_10_2\twisted\internet\base.py", line 1158, in run
            self.mainLoop()
          File "C:\b\build\third_party\twisted_10_2\twisted\internet\base.py", line 1167, in mainLoop
            self.runUntilCurrent()
        --- <exception caught here> ---
          File "C:\b\build\third_party\twisted_10_2\twisted\internet\base.py", line 789, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "C:\b\build\third_party\buildbot_slave_8_4\buildslave\runprocess.py", line 709, in doTimeout
            self.kill(msg)
          File "C:\b\build\third_party\buildbot_slave_8_4\buildslave\runprocess.py", line 792, in kill
            self.process.signalProcess(self.KILL)
          File "C:\b\build\third_party\twisted_10_2\twisted\internet\_dumbwin32proc.py", line 248, in signalProcess
            win32process.TerminateProcess(self.hProcess, 1)
        pywintypes.error: (5, 'TerminateProcess', 'Access is denied.')

2016-08-03 15:00:12-0700 [-] command finished with signal None, exit code 1, elapsedTime: 5312.210000
2016-08-03 15:00:12-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x04FFC378>
2016-08-03 15:00:12-0700 [-] Unhandled Error
        Traceback (most recent call last):
          File "C:\b\build\third_party\twisted_10_2\twisted\application\app.py", line 390, in startReactor
            self.config, oldstdout, oldstderr, self.profiler, reactor)
          File "C:\b\build\third_party\twisted_10_2\twisted\application\app.py", line 311, in runReactorWithLogging
            reactor.run()
          File "C:\b\build\third_party\twisted_10_2\twisted\internet\base.py", line 1158, in run
            self.mainLoop()
          File "C:\b\build\third_party\twisted_10_2\twisted\internet\base.py", line 1167, in mainLoop
            self.runUntilCurrent()
        --- <exception caught here> ---
          File "C:\b\build\third_party\twisted_10_2\twisted\internet\base.py", line 789, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "C:\b\build\third_party\twisted_10_2\twisted\internet\_pollingfile.py", line 79, in _pollEvent
            workUnits += resource.checkWork()
          File "C:\b\build\third_party\twisted_10_2\twisted\internet\_pollingfile.py", line 136, in checkWork
            self.cleanup()
          File "C:\b\build\third_party\twisted_10_2\twisted\internet\_pollingfile.py", line 141, in cleanup
            self.lostCallback()
          File "C:\b\build\third_party\twisted_10_2\twisted\internet\_dumbwin32proc.py", line 294, in outConnectionLost
            self.proto.childConnectionLost(1)
        exceptions.AttributeError: 'NoneType' object has no attribute 'childConnectionLost'

2016-08-03 15:00:12-0700 [Broker,client] slave shutting down on command from master

Of note:
1) It looks like killing the process is actually failing.
1a) As an aside, the slave tries to kill via TASKKILL first, which I think is basically un-catchable.
2) 20 seconds later the slave decides to restart on command from master.

What I think is happening is this:
1) TASKKILL or "process.signalProcess('KILL')" is almost completely succeeding, but is unable to kill "remote_run".
1a) It is killing LogDog Butler and Annotee, as well as the recipe engine.
1b) Butler is *not* able to flush, and I don't think there's anything we can do about this short of implementing a softer Windows kill on BuildBot.
2) Because "remote_run" wasn't killed, it continues executing. Because Butler/Annotee were killed, the sentinel file was not dropped, so "remote_run" tries to re-run the recipe.
3) The slave shuts down 20 seconds later, interrupting "remote_run" (obviously).

I see two decent approaches to solving this. We could do one or both of:
1) Remove the redundant non-LogDog fallback from "remote_run" since LogDog bootstrapping is fairly stable at this point.
2) Have Butler/Annotee emit the sentinel file the moment the execution begins, then re-emit it with the return code once it finishes. This would cause the file to exist, and allow "remote_run" and other bootstrap utilities to detect the situation where the process didn't cleanly exit.

+iannucci@, +vadimsh@, you guys have worked in the Windows internals domain as well to some extent. Does this sound sane? Do any other options come to mind?

Comment 4 by d...@chromium.org, Aug 7 2016

Blocking: -633294

Comment 5 by abw@chromium.org, Aug 8 2016

Labels: -Pri-0 Pri-1
Based on crbug.com/633294#c26, I'm downgrading this from P0; please restore priority if this bug is indeed preventing teams from being able to work.


Comment 6 by benhenry@google.com, Aug 16 2016

Status: Assigned (was: Untriaged)

Comment 7 by d...@chromium.org, Sep 27 2016

Status: Fixed (was: Assigned)
I went with the sentinel file approach, so this should be resolved.

Sign in to add a comment