Look into Windows BuildBot-induced process killing. |
|||||||
Issue descriptionFiling this bug in response to: https://groups.google.com/a/chromium.org/d/topic/infra-dev/p-VhZbbHGP0/discussion It looks like when BuildBot on Windows kills the "remote_run" subprocess, the signal is not caught by Butler or Annotee, causing: * The Butler to not have a chance to flush its buffered logs. * Annotee to not emit its bootstrap result file. * Presumably, the Butler bootstrap result file mechanism is similarly broken. I think that this is known to work on Linux/Mac, so this is probably the result of weak Windows signal handling support. Approach should be: 1) Identify what signal BuildBot is sending to "remote_run". 2) Figure out how to catch it in Go. 3) Catch it in Go and wire that into the standard kill/flush paths of Annotee and Butler. Making P0 b/c this is actually impacting things ATM.
,
Aug 7 2016
,
Aug 7 2016
From the log on the slave:
2016-08-03 14:59:39-0700 [-] command timed out: 2400 seconds without output, attempting to kill
2016-08-03 14:59:39-0700 [-] using TASKKILL /F /PID 4948 /T
2016-08-03 14:59:41-0700 [-] TASKKILL /F /PID 4948 /T failed: Command 'TASKKILL /F /PID 4948 /T' returned non-zero exit status 128
2016-08-03 14:59:41-0700 [-] trying process.signalProcess('KILL')
2016-08-03 14:59:41-0700 [-] Unhandled Error
Traceback (most recent call last):
File "C:\b\build\third_party\twisted_10_2\twisted\application\app.py", line 390, in startReactor
self.config, oldstdout, oldstderr, self.profiler, reactor)
File "C:\b\build\third_party\twisted_10_2\twisted\application\app.py", line 311, in runReactorWithLogging
reactor.run()
File "C:\b\build\third_party\twisted_10_2\twisted\internet\base.py", line 1158, in run
self.mainLoop()
File "C:\b\build\third_party\twisted_10_2\twisted\internet\base.py", line 1167, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "C:\b\build\third_party\twisted_10_2\twisted\internet\base.py", line 789, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\b\build\third_party\buildbot_slave_8_4\buildslave\runprocess.py", line 709, in doTimeout
self.kill(msg)
File "C:\b\build\third_party\buildbot_slave_8_4\buildslave\runprocess.py", line 792, in kill
self.process.signalProcess(self.KILL)
File "C:\b\build\third_party\twisted_10_2\twisted\internet\_dumbwin32proc.py", line 248, in signalProcess
win32process.TerminateProcess(self.hProcess, 1)
pywintypes.error: (5, 'TerminateProcess', 'Access is denied.')
2016-08-03 15:00:12-0700 [-] command finished with signal None, exit code 1, elapsedTime: 5312.210000
2016-08-03 15:00:12-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x04FFC378>
2016-08-03 15:00:12-0700 [-] Unhandled Error
Traceback (most recent call last):
File "C:\b\build\third_party\twisted_10_2\twisted\application\app.py", line 390, in startReactor
self.config, oldstdout, oldstderr, self.profiler, reactor)
File "C:\b\build\third_party\twisted_10_2\twisted\application\app.py", line 311, in runReactorWithLogging
reactor.run()
File "C:\b\build\third_party\twisted_10_2\twisted\internet\base.py", line 1158, in run
self.mainLoop()
File "C:\b\build\third_party\twisted_10_2\twisted\internet\base.py", line 1167, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "C:\b\build\third_party\twisted_10_2\twisted\internet\base.py", line 789, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\b\build\third_party\twisted_10_2\twisted\internet\_pollingfile.py", line 79, in _pollEvent
workUnits += resource.checkWork()
File "C:\b\build\third_party\twisted_10_2\twisted\internet\_pollingfile.py", line 136, in checkWork
self.cleanup()
File "C:\b\build\third_party\twisted_10_2\twisted\internet\_pollingfile.py", line 141, in cleanup
self.lostCallback()
File "C:\b\build\third_party\twisted_10_2\twisted\internet\_dumbwin32proc.py", line 294, in outConnectionLost
self.proto.childConnectionLost(1)
exceptions.AttributeError: 'NoneType' object has no attribute 'childConnectionLost'
2016-08-03 15:00:12-0700 [Broker,client] slave shutting down on command from master
Of note:
1) It looks like killing the process is actually failing.
1a) As an aside, the slave tries to kill via TASKKILL first, which I think is basically un-catchable.
2) 20 seconds later the slave decides to restart on command from master.
What I think is happening is this:
1) TASKKILL or "process.signalProcess('KILL')" is almost completely succeeding, but is unable to kill "remote_run".
1a) It is killing LogDog Butler and Annotee, as well as the recipe engine.
1b) Butler is *not* able to flush, and I don't think there's anything we can do about this short of implementing a softer Windows kill on BuildBot.
2) Because "remote_run" wasn't killed, it continues executing. Because Butler/Annotee were killed, the sentinel file was not dropped, so "remote_run" tries to re-run the recipe.
3) The slave shuts down 20 seconds later, interrupting "remote_run" (obviously).
I see two decent approaches to solving this. We could do one or both of:
1) Remove the redundant non-LogDog fallback from "remote_run" since LogDog bootstrapping is fairly stable at this point.
2) Have Butler/Annotee emit the sentinel file the moment the execution begins, then re-emit it with the return code once it finishes. This would cause the file to exist, and allow "remote_run" and other bootstrap utilities to detect the situation where the process didn't cleanly exit.
+iannucci@, +vadimsh@, you guys have worked in the Windows internals domain as well to some extent. Does this sound sane? Do any other options come to mind?
,
Aug 7 2016
,
Aug 8 2016
Based on crbug.com/633294#c26, I'm downgrading this from P0; please restore priority if this bug is indeed preventing teams from being able to work.
,
Aug 16 2016
,
Sep 27 2016
I went with the sentinel file approach, so this should be resolved. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by d...@chromium.org
, Aug 6 2016