browser_tests occasionally leaves an unkillable renderer process, failing the shard |
|
Issue descriptionTwo recent examples of this on the tryservers: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win10_chromium_x64_rel_ng/84677 https://chromium-swarm.appspot.com/task?id=3fa439f416126e10&refresh=10 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win10_chromium_x64_rel_ng/84581 https://chromium-swarm.appspot.com/task?id=3fa3c8cc3017ae10&refresh=10&show_raw=1 In both cases the symptom is the same; an unkillable renderer process was left running, causing Swarming to fail the entire shard. Log excerpt from the first: ----- Retrying 1 test (retry #1) [948/948] SpellcheckServiceBrowserTest.MultilingualPreferenceNotMigrated (2183 ms) SUCCESS: all tests passed. Failed to delete C:\b\s\w\ir (12 files remaining). Maybe the test has a subprocess outliving it. Sleeping 2 seconds. Failed to delete C:\b\s\w\ir (12 files remaining). Maybe the test has a subprocess outliving it. Sleeping 4 seconds. Failed to delete C:\b\s\w\ir. The following files remain: - \\?\C:\b\s\w\ir - \\?\C:\b\s\w\ir\out - \\?\C:\b\s\w\ir\out\Release_x64 - \\?\C:\b\s\w\ir\out\Release_x64\browser_tests.exe - \\?\C:\b\s\w\ir\out\Release_x64\chrome_100_percent.pak - \\?\C:\b\s\w\ir\out\Release_x64\chrome_200_percent.pak - \\?\C:\b\s\w\ir\out\Release_x64\icudtl.dat - \\?\C:\b\s\w\ir\out\Release_x64\locales - \\?\C:\b\s\w\ir\out\Release_x64\locales\en-US.pak - \\?\C:\b\s\w\ir\out\Release_x64\natives_blob.bin - \\?\C:\b\s\w\ir\out\Release_x64\resources.pak - \\?\C:\b\s\w\ir\out\Release_x64\v8_context_snapshot.bin Enumerating processes: - pid 8656; Handles: 274; Exe: C:\b\s\w\ir\out\Release_x64\browser_tests.exe; Cmd: "C:\b\s\w\ir\out\Release_x64\browser_tests.exe" --type=renderer --disable-compositor-ukm-for-tests --dom-automation --enable-logging=stderr --file-url-path-alias="/gen=C:\b\s\w\ir\out\Release_x64\gen" --force-color-profile=srgb --ipc-connection-timeout=30 --test-type=browser --field-trial-handle=1468,3678886453054102039,4181609478204832977,131072 --enable-features=TestFeatureForBrowserTest1 --disable-features=NetworkPrediction,SpeculativePreconnect,TestFeatureForBrowserTest2,WebRTC-H264WithOpenH264FFmpeg --service-pipe-token=8227455578678677143 --lang=en-US --noerrdialogs --user-data-dir="C:\b\s\w\ittrto8m\scoped_dir8664_3332\d8664_9461" --enable-offline-auto-reload-visible-only --start-stack-profiler --device-scale-factor=1 --num-raster-threads=4 --enable-main-frame-before-activation --service-request-channel-token=8227455578678677143 --renderer-client-id=3 --no-v8-untrusted-code-mitigations --mojo-platform-channel-handle=2332 /prefetch:1 Terminating 1 processes: - failed to kill 8656 ... 1364 2018-08-30 15:36:48.115 E: Failure with [Error 5] Access is denied: u'\\\\?\\C:\\b\\s\\w\\ir\\out\\Release_x64\\browser_tests.exe' *** Swarming tried multiple times to delete the run directory and failed *** *** Hard failing the task *** Swarming detected that your testing script ran an executable, which may have started a child executable, and the main script returned early, leaving the children executables playing around unguided. You don't want to leave children processes outliving the task on the Swarming bot, do you? The Swarming bot doesn't. How to fix? - For any process that starts children processes, make sure all children processes terminated properly before each parent process exits. This is especially important in very deep process trees. - This must be done properly both in normal successful task and in case of task failure. Cleanup is very important. - The Swarming bot sends a SIGTERM in case of timeout. - You have 30.0 seconds to comply after the signal was sent to the process before the process is forcibly killed. - To achieve not leaking children processes in case of signals on timeout, you MUST handle signals in each executable / python script and propagate them to children processes. - When your test script (python or binary) receives a signal like SIGTERM or CTRL_BREAK_EVENT on Windows), send it to all children processes and wait for them to terminate before quitting. See https://chromium.googlesource.com/infra/luci/luci-py.git/+/master/appengine/swarming/doc/Bot.md#Graceful-termination_aka-the-SIGTERM-and-SIGKILL-dance for more information. ----- Not sure how long this has been happening or whether this is a recently introduced problem. Only seen on Windows so far.
,
Aug 31
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/e169b6059347c841e6dc0f738c87b0113051258e commit e169b6059347c841e6dc0f738c87b0113051258e Author: Bruce Dawson <brucedawson@chromium.org> Date: Fri Aug 31 18:34:26 2018 Print details when kill(9) fails Occasionally browser_tests.exe is left running, and is unkillable. This changes the swarming code so that it gives more information about *why* the kill command fails. Bug: chromium:879232 Change-Id: Ideadbc98b94e7ebf7a7d7dc357ce770a2a3d362e Reviewed-on: https://chromium-review.googlesource.com/1197963 Commit-Queue: Bruce Dawson <brucedawson@chromium.org> Reviewed-by: smut <smut@google.com> [modify] https://crrev.com/e169b6059347c841e6dc0f738c87b0113051258e/client/utils/file_path.py
,
Aug 31
I spent a month tracking this same problem but for network process on Windows ( bug 820996 ). The summary is that there's a race condition when Windows is starting a process where it internally has it paused, before it returns to Chrome code which puts it in a job object. If Chrome is killed at this time, then the process is left suspended. This was very common with the network process, because as taskkill was killing child processes it would kill the network process, and the browser would then restart it as it itself is being killed. My fix made this network-service scenario go away, but it could still happen very rarely with renderers or even gpu; I did see rare cases of the latter. A more robust solution could be to get the path of the process being killed, and then after taskkill runs we can look for any more processes with that path and also kill them.
,
Aug 31
,
Sep 4
Wow - awesome detective work on that other bug. |
|
►
Sign in to add a comment |
|
Comment 1 by brucedaw...@chromium.org
, Aug 30