New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 879232 link

Starred by 2 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug

Blocked on:
issue 820996



Sign in to add a comment

browser_tests occasionally leaves an unkillable renderer process, failing the shard

Project Member Reported by kbr@chromium.org, Aug 30

Issue description

Two recent examples of this on the tryservers:

https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win10_chromium_x64_rel_ng/84677
https://chromium-swarm.appspot.com/task?id=3fa439f416126e10&refresh=10

https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win10_chromium_x64_rel_ng/84581
https://chromium-swarm.appspot.com/task?id=3fa3c8cc3017ae10&refresh=10&show_raw=1

In both cases the symptom is the same; an unkillable renderer process was left running, causing Swarming to fail the entire shard. Log excerpt from the first:

-----
Retrying 1 test (retry #1)
[948/948] SpellcheckServiceBrowserTest.MultilingualPreferenceNotMigrated (2183 ms)
SUCCESS: all tests passed.
Failed to delete C:\b\s\w\ir (12 files remaining).
  Maybe the test has a subprocess outliving it.
  Sleeping 2 seconds.
Failed to delete C:\b\s\w\ir (12 files remaining).
  Maybe the test has a subprocess outliving it.
  Sleeping 4 seconds.
Failed to delete C:\b\s\w\ir. The following files remain:
- \\?\C:\b\s\w\ir
- \\?\C:\b\s\w\ir\out
- \\?\C:\b\s\w\ir\out\Release_x64
- \\?\C:\b\s\w\ir\out\Release_x64\browser_tests.exe
- \\?\C:\b\s\w\ir\out\Release_x64\chrome_100_percent.pak
- \\?\C:\b\s\w\ir\out\Release_x64\chrome_200_percent.pak
- \\?\C:\b\s\w\ir\out\Release_x64\icudtl.dat
- \\?\C:\b\s\w\ir\out\Release_x64\locales
- \\?\C:\b\s\w\ir\out\Release_x64\locales\en-US.pak
- \\?\C:\b\s\w\ir\out\Release_x64\natives_blob.bin
- \\?\C:\b\s\w\ir\out\Release_x64\resources.pak
- \\?\C:\b\s\w\ir\out\Release_x64\v8_context_snapshot.bin
Enumerating processes:
- pid 8656; Handles: 274; Exe: C:\b\s\w\ir\out\Release_x64\browser_tests.exe; Cmd: "C:\b\s\w\ir\out\Release_x64\browser_tests.exe" --type=renderer --disable-compositor-ukm-for-tests --dom-automation --enable-logging=stderr --file-url-path-alias="/gen=C:\b\s\w\ir\out\Release_x64\gen" --force-color-profile=srgb --ipc-connection-timeout=30 --test-type=browser --field-trial-handle=1468,3678886453054102039,4181609478204832977,131072 --enable-features=TestFeatureForBrowserTest1 --disable-features=NetworkPrediction,SpeculativePreconnect,TestFeatureForBrowserTest2,WebRTC-H264WithOpenH264FFmpeg --service-pipe-token=8227455578678677143 --lang=en-US --noerrdialogs --user-data-dir="C:\b\s\w\ittrto8m\scoped_dir8664_3332\d8664_9461" --enable-offline-auto-reload-visible-only --start-stack-profiler --device-scale-factor=1 --num-raster-threads=4 --enable-main-frame-before-activation --service-request-channel-token=8227455578678677143 --renderer-client-id=3 --no-v8-untrusted-code-mitigations --mojo-platform-channel-handle=2332 /prefetch:1
Terminating 1 processes:
- failed to kill 8656
...
1364 2018-08-30 15:36:48.115 E: Failure with [Error 5] Access is denied: u'\\\\?\\C:\\b\\s\\w\\ir\\out\\Release_x64\\browser_tests.exe'
*** Swarming tried multiple times to delete the run directory and failed ***
*** Hard failing the task ***

Swarming detected that your testing script ran an executable, which may have
started a child executable, and the main script returned early, leaving the
children executables playing around unguided.

You don't want to leave children processes outliving the task on the Swarming
bot, do you? The Swarming bot doesn't.

How to fix?
- For any process that starts children processes, make sure all children
  processes terminated properly before each parent process exits. This is
  especially important in very deep process trees.
  - This must be done properly both in normal successful task and in case of
    task failure. Cleanup is very important.
- The Swarming bot sends a SIGTERM in case of timeout.
  - You have 30.0 seconds to comply after the signal was sent to the process
    before the process is forcibly killed.
- To achieve not leaking children processes in case of signals on timeout, you
  MUST handle signals in each executable / python script and propagate them to
  children processes.
  - When your test script (python or binary) receives a signal like SIGTERM or
    CTRL_BREAK_EVENT on Windows), send it to all children processes and wait for
    them to terminate before quitting.

See
https://chromium.googlesource.com/infra/luci/luci-py.git/+/master/appengine/swarming/doc/Bot.md#Graceful-termination_aka-the-SIGTERM-and-SIGKILL-dance
for more information.
-----

Not sure how long this has been happening or whether this is a recently introduced problem. Only seen on Windows so far.

 
I looked at the last 200 builds and couldn't see any signs of this particular failure (although it's not 100% clear how to spot it from the summaries so I might have missed one).

I'll try landing a change to get file_path.py to print the exception information to see if that gives us any more information. For instance, if a process has gone away by the time we try to kill it then os.kill will throw an exception and say "WindowsError: [Error 87] The parameter is incorrect"

Project Member

Comment 2 by bugdroid1@chromium.org, Aug 31

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/e169b6059347c841e6dc0f738c87b0113051258e

commit e169b6059347c841e6dc0f738c87b0113051258e
Author: Bruce Dawson <brucedawson@chromium.org>
Date: Fri Aug 31 18:34:26 2018

Print details when kill(9) fails

Occasionally browser_tests.exe is left running, and is unkillable. This
changes the swarming code so that it gives more information about *why*
the kill command fails.

Bug: chromium:879232
Change-Id: Ideadbc98b94e7ebf7a7d7dc357ce770a2a3d362e
Reviewed-on: https://chromium-review.googlesource.com/1197963
Commit-Queue: Bruce Dawson <brucedawson@chromium.org>
Reviewed-by: smut <smut@google.com>

[modify] https://crrev.com/e169b6059347c841e6dc0f738c87b0113051258e/client/utils/file_path.py

I spent a month tracking this same problem but for network process on Windows ( bug 820996 ). The summary is that there's a race condition when Windows is starting a process where it internally has it paused, before it returns to Chrome code which puts it in a job object. If Chrome is killed at this time, then the process is left suspended. This was very common with the network process, because as taskkill was killing child processes it would kill the network process, and the browser would then restart it as it itself is being killed.

My fix made this network-service scenario go away, but it could still happen very rarely with renderers or even gpu; I did see rare cases of the latter.

A more robust solution could be to get the path of the process being killed, and then after taskkill runs we can look for any more processes with that path and also kill them.
Blockedon: 820996
Thanks jam@ for your diagnosis. Hoping we can find a more complete fix.

Wow - awesome detective work on that other bug.

Sign in to add a comment