[network_service_]browser_tests leaving processes open on Win7 Tests (dbg)(1) |
|||||||||||||||||||||||||||
Issue descriptionFiled by sheriff-o-matic@appspot.gserviceaccount.com on behalf of dalecurtis@chromium.org network_service_browser_tests failing on multiple builders Builders failed on: - Win7 Tests (dbg)(1): https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20Tests%20%28dbg%29%281%29
,
Oct 24
#2 is issue 845703 apparently.
,
Oct 24
#2.1: It's unable to remove the file because it has to kill a bunch of child processes; many of those appear to be instances of the audio service given "--service-sandbox-type=audio" in the command line?
,
Oct 24
+olka,maxmorin who are working on audio process.
,
Oct 25
Looks like #1 affects the regular (non-network_service) browser_tests as well, e.g.: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20Tests%20%28dbg%29%281%29/72498 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20Tests%20%28dbg%29%281%29/72509
,
Oct 25
Relevant extract of a failure log:
SUCCESS: all tests passed.
Failed to delete e:\b\s\w\ir (4 files remaining).
Maybe the test has a subprocess outliving it.
Sleeping 2 seconds.
Failed to delete e:\b\s\w\ir (4 files remaining).
Maybe the test has a subprocess outliving it.
Sleeping 4 seconds.
Failed to delete e:\b\s\w\ir. The following files remain:
- \\?\e:\b\s\w\ir
- \\?\e:\b\s\w\ir\out
- \\?\e:\b\s\w\ir\out\Debug
- \\?\e:\b\s\w\ir\out\Debug\browser_tests.exe
Enumerating processes:
- pid 2384; Handles: 3; Exe: None; Cmd: "e:\b\s\w\ir\out\Debug\browser_tests.exe" --type=utility --field-trial-handle=848,4039392711463652616,14240082366403410557,131072 --enable-features=TestFeatureForBrowserTest1 --disable-features=TestFeatureForBrowserTest2 --lang=en-US --service-sandbox-type=audio --enable-logging=stderr --override-use-software-gl-for-tests --noerrdialogs --user-data-dir="e:\b\s\w\it86yrhb\scoped_dir9544_11652\d9544_15283" --enable-logging=stderr --ipc-connection-timeout=45 --service-request-channel-token=8849658823054802485 --mojo-platform-channel-handle=3132 --ignored=" --type=renderer " /prefetch:8
<snip: other processes, many but not all of which have --service-sandbox-type=audio>
Terminating 12 processes:
- 2384 killed
- 3516 killed
- 5640 killed
- 6160 killed
- 6324 killed
- 6364 killed
- 7384 killed
- 7516 killed
- 8048 killed
- 8276 killed
- 8416 killed
- 9288 killed
*** Swarming tried multiple times to delete the run directory and failed ***
*** Hard failing the task ***
Swarming detected that your testing script ran an executable, which may have
started a child executable, and the main script returned early, leaving the
children executables playing around unguided.
You don't want to leave children processes outliving the task on the Swarming
bot, do you? The Swarming bot doesn't.
How to fix?
- For any process that starts children processes, make sure all children
processes terminated properly before each parent process exits. This is
especially important in very deep process trees.
- This must be done properly both in normal successful task and in case of
task failure. Cleanup is very important.
- The Swarming bot sends a SIGTERM in case of timeout.
- You have 30.0 seconds to comply after the signal was sent to the process
before the process is forcibly killed.
- To achieve not leaking children processes in case of signals on timeout, you
MUST handle signals in each executable / python script and propagate them to
children processes.
- When your test script (python or binary) receives a signal like SIGTERM or
CTRL_BREAK_EVENT on Windows), send it to all children processes and wait for
them to terminate before quitting.
See
https://chromium.googlesource.com/infra/luci/luci-py.git/+/master/appengine/swarming/doc/Bot.md#Graceful-termination_aka-the-SIGTERM-and-SIGKILL-dance
for more information.
*** May the SIGKILL force be with you ***
,
Oct 25
Also note that the audio service seems to be a common, but not the only offender. E.g. on https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20Tests%20%28dbg%29%281%29/72509, there was only one process left alive: - pid 4568; Handles: 3; Exe: None; Cmd: "e:\b\s\w\ir\out\Debug\browser_tests.exe" --type=renderer --disable-compositor-ukm-for-tests --dom-automation --enable-logging=stderr --force-color-profile=srgb --ipc-connection-timeout=45 --test-type=browser --field-trial-handle=1188,3023676602794448719,13131676868034973992,131072 --enable-features=TestFeatureForBrowserTest1 --disable-features=TestFeatureForBrowserTest2 --service-pipe-token=1809614941905085345 --lang=en-US --noerrdialogs --user-data-dir="e:\b\s\w\itkzoa2l\scoped_dir3196_21956\d3196_10551" --enable-offline-auto-reload-visible-only --device-scale-factor=1 --num-raster-threads=4 --enable-main-frame-before-activation --service-request-channel-token=1809614941905085345 --renderer-client-id=11 --no-v8-untrusted-code-mitigations --mojo-platform-channel-handle=3564 /prefetch:1 The "--renderer-client-id" switch means it's a renderer process, but that's all I could find.
,
Oct 25
olka@, could you please look into the leftover audio processes, or reassign to someone who will?
,
Oct 25
Audio process launch and lifetime are controller by ServiceManager: it's launched on the browser post-startup, and shut down on the browser shut down. Each browser test would launch the whole browser (and the audio process), and should tear down everything. So I assume multiple audio process instances mean the audio process was not teared down correctly for multiple tests. Also, the audio process is not the only child process hanging: [https://chromium-swarm.appspot.com/task?id=40c2115a2c234110&refresh=10&show_raw=1]: 2 audio processes and 3 renderers [https://chromium-swarm.appspot.com/task?id=40c1df5d52894710&refresh=10&show_raw=1]: renderer and another utility process [https://chromium-swarm.appspot.com/task?id=40c0fb4d0afe5110&refresh=10&show_raw=1]- 5 renderers and 7 audio processes. Thought I'm still surprised by the amount of audio processes hanging.
,
Oct 25
rockot@ - is it possible that there is a race between StartService() call [1] and a request to bind an audio service interface, which would result in the audio process being launched twice? [1] https://cs.chromium.org/chromium/src/content/browser/browser_main_loop.cc?type=cs&g=0&l=1678
,
Oct 25
No. StartService is just a way of pre-warming an instance. If the instance is already running by the time the Service Manager receives the StartService request, it does nothing.
,
Oct 25
Thanks Ken - so there is no race logically. Could still there be a race (a bug) if the service is launching but not fully running yet and another request to launch it is received? (Since the audio service runs out of process, the period until it's fully started may be milliseconds.)
,
Oct 25
No, there's no such race. Given the current usage, there is really no way for the Service Manager to run two instances of the service at the same time.
,
Oct 26
We do not control the audio process lifetime, besides pre-warming it (#11-12) - it's a chrome utility processed launched by the service manager. Basing on the audio process being not the only one left behind (#9), looks like there is some general issue with cleaning up child processes on the browser shutdown? The fact that audio process instances dominate may be due to them being shut down during the browser shutdown; while renderers may be explicitly shut down by the browser tests before that. In any case, isn't the browser process supposed to clean up all the child processes? treib@ do you know who can take a look?
,
Oct 26
Sorry, but I know very little about all this. If none of the CCed people feel they can properly triage this, I'll just have to leave it for the current sheriffs. One thing that seems odd to me: Even if the ServiceManager somehow failed to properly shutdown these processes (or started too many of them, etc), the test runner script tries to kill them and fails. So is there some problem with these processes not handling SIGTERM correctly? If so, is that a bug in some common process infrastructure, or separate bugs in both the renderer and the audio processes?
,
Oct 26
+Marina: can #16 be somehow related to the audio process sandbox?
,
Oct 26
This doesn't look new, I'm seeing it back as far as 2018-10-17...
,
Oct 26
The earliest I found on Win7 Tests (dbg)(1) is from 2018-10-15: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20Tests%20%28dbg%29%281%29/72282 It doesn't mean it never happened before then, but it definitely seemed to be happening more frequently from then on.
,
Oct 26
I'm renaming to make it more clear what the underlying issue is; this isn't normal test flakiness but a really specific issue with child processes.
,
Oct 26
,
Oct 29
rockot@ do you know who could take a look at this?
,
Oct 29
Took a quick look, but I wasn't able to tell if this is related to audio process sandbox, I'll have a closer look when I'm back from vacation next week. Not sure how does the test script send CTRL_BREAK_EVENT to child processes. +jam@, I came across this very similar bug[1], looks like you might be able to help out here, please have a look. [1] https://crbug.com/820996
,
Nov 5
A recent run https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20Tests%20%28dbg%29%281%29/72752 shows the same thing: "Swarming detected that your testing script ran an executable, which may have started a child executable, and the main script returned early, leaving the children executables playing around unguided." So I'm keeping the sheriff label on this one. But I don't know who to assign this to, I can't tell what is causing this.
,
Nov 5
Is this the same as https://crbug.com/876224? https://bugs.chromium.org/p/chromium/issues/detail?id=876224#c7 (Aug 21) says: "After going through a bunch of the logs I'm seeing three main failure states for tests on this bot. Which lead me to believe that it currently cannot cleanly tear down the renderer process: 1) Can't run the renderer service: [...] 2) Can't shut down the previous instance: [...] 3) Abnormal Renderer Teardown [...]"
,
Nov 6
to Marina to take a look at sandbox.
,
Nov 6
,
Nov 7
Remove Sheriff-Chromium since owners are looking...
,
Nov 14
,
Nov 14
Sorry I just saw this, was OOO when you added me and didn't see the bug update since (please feel free to IM me if I'm not responding). The first thing is to figure out the type of process that's hanging. I looked at the link in comment 24 and it looks like it's a renderer process (see --type=renderer in the command line). SO I don't think it's related to the audio process, as that would be --type=utility. +Will
,
Nov 14
Indeed most of the hangs are renderer hangs (~70% as far as I've seen), and the rest are in utility process, plus a few in gpu and ppapi. The hangs go as back as 9/5 [1]. Here are some recent utility process hangs: [2][3][4]. One strange thing that I noticed is that --service-sandbox-type=audio is passed to what looks like the network service running in utility process, for example (from [4]): "e:\b\swarm_slave\w\ir\out\Debug\browser_tests.exe" --type=utility --field-trial-handle=1124,18429007685126664494,3859344750570234916,131072 --enable-features=NetworkService,TestFeatureForBrowserTest1 --disable-features=TestFeatureForBrowserTest2 --lang=en-US --service-sandbox-type=audio --enable-logging=stderr --override-use-software-gl-for-tests --proxy-server=127.0.0.1:60378 --noerrdialogs --user-data-dir="e:\b\swarm_slave\w\itwejlw2\scoped_dir6252_31713\d6252_29755" --enable-logging=stderr --ipc-connection-timeout=45 --service-request-channel-token=13795671114383798216 --mojo-platform-channel-handle=4800 --ignored=" --type=renderer " /prefetch:8 IIRC this argument should only be passed to an utility process if it's running audio service. Perhaps there's a bug in setting the sandbox type, I'll check. [1] https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20Tests%20%28dbg%29%281%29/71338 [2] https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20Tests%20%28dbg%29%281%29/72902 [3] https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20Tests%20%28dbg%29%281%29/72850 [4] https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20Tests%20%28dbg%29%281%29/72846
,
Nov 14
Setting sandbox type looks correct, though --service-sandbox-type arg is set on utility process start, so in a browser_test run the arg must be set earlier - somewhere in infrastructure I assume. The arg doesn't seem to be coming from field trial config either, features are passed in using --enable-features arg. I'll check infra scripts.
,
Nov 16
Sandbox type is not set in infrastructure, in fact that command(#31) comes from browser process starting audio utility process on browser startup [2]. That's when the sandbox type arg is set[3], while also inheriting feature arg (NetworkService), hence my confusion. So all hanging utility processes with --service-sandbox-type=audio are in fact audio processes. Looking into why they hang. [1] ".\browser_tests.exe" --brave-new-test-launcher --cfi-diag=0 --disable-gpu-process-for-dx12-vulkan-info-collection --enable-features=NetworkService --gtest_also_run_disabled_tests --gtest_filter=PlatformAppBrowserTest.AppWithContextMenuClicked --single_process --test-launcher-bot-mode --test-launcher-filter-file=../../testing/buildbot/filters/mojo.fyi.network_browser_tests.filter --test-launcher-output="e:\b\s\w\itztv3m9\scoped_dir3304_16022\results3304_9913\test_results.xml" --test-launcher-summary-output="e:\b\s\w\iovwpqov\output.json" --user-data-dir="e:\b\s\w\itztv3m9\scoped_dir3304_16022\d3304_20031" [2] https://cs.chromium.org/chromium/src/content/browser/browser_main_loop.cc?rcl=0eda6776aa9c0c2a312826762b0a5f83fb36a360&l=1665 [3] https://cs.chromium.org/chromium/src/content/browser/utility_process_host.cc?rcl=0eda6776aa9c0c2a312826762b0a5f83fb36a360&l=337
,
Nov 22
Still not sure what's the root cause, here are my findings. Sandboxed audio service process stop sequence: service manager context is shutdown destroying service manager, service_manager destructor stops and deletes all service instances which shuts down and destroys audio service; audio process is then destroyed by worker thread monitoring target process events (TargetEventsThread). Afterwards, browser clean up destroys all child process host object (BrowserChildProcessHostImpl::TerminateAll) - this generally doesn't succeed in killing audio process as it was already terminated. Overall, nothing particular about the audio process shutdown sequence. What could go wrong on stop sequence leading to processes outliving browser_test executable: 1. TargetEventsThread ends up being killed early somehow; in this case audio process should still be killed by BrowserChildProcessHostImpl::TerminateAll. 2. On BrowserChildProcessHostImpl::TerminateAll, Process::Terminate does not wait for processes to terminate (called with wait= false). The obvious fix here would be to call process::Terminate with wait=true, I'm pretty sure we don't want to do that though. @jam, @rockot any thoughts on this? Worth mentioning that this bug doesn't seem to happen so frequently lately, I was looking at the last 200 bot runs, out of which this bug reproduced in 7 runs, out of which 2 runs contained audio processes left behind. Also, since overall most hanging processes are renders, I'm making this bug available for someone with more experience with renderers to pick up. @all cc-ed, please help out with reassigning. It might also be the case that this is an infrastructure problem, the type of processes outliving browser_tests executable is quite diverse: I came across renders, audio, network, ppapi, gpu, pdf-compositor and utility processes. I haven't dug too much into infrastructure, I've mostly looked into audio process life cycle. So I'm a bit out of ideas and I need to move on to other more urgent problems for now, feel free to reassign this to me if it turns out to be indeed an audio specific issue.
,
Nov 28
@jam, @rockot, please have a look at #34. Adding Internals>Core component and Sherrif label, please help find the suitable owner for investigating renderer process hangs.
,
Nov 28
This is a P1 bug, and need an owner. jam@: can you help triaging this bug?
,
Nov 28
Sounds potentially like the browser doesn't properly reap child processes on shutdown? I don't know specifically what could be going wrong here. I do know that if the browser process is shut down, any service processes will detect this. At least for the network service, this always means process self-termination *unless* the main or IO thread is hung for some reason.
,
Nov 28
Correct, in the "normal case", the child processes are correctly cleaned up. From the three weeks I spent investigating bug 820996 , there are very rare cases where the browser process is starting a process and inside Windows code it's started as paused. If the browser process is killed at that exact moment, the child process is not cleaned up. I made this very rare for the network process, since we used to make this more likely (Windows would kill the network process, and the browser process would notice it and restart it as the browser is being killed). Some options: 1) We could add code to guard against this, but it would have to be added to all frameworks (e.g. layout tests, browser tests, telemetry). The parent test launcher (for browser tests), telemetry python process, and layout tests script would then ensure that all processes are killed. 2) On Windows 10 and after, nested job objects would prevent this, which is why we're only seeing this on Windows 7. So this will go away once we stop supporting Win 7, but that'll be for a while. 3) Maybe with the knowledge that this race exists on Windows itself, we can relax the swarming check. i.e. on Win 7 it can kill the remaining processes itself.
,
Nov 28
,
Nov 28
The test harnesses are supposed to be doing 1) already. We've spent a lot of time trying to be sure that that's true in the layout tests, at least. However, you're correct that on Windows we really need to use nested job objects and we don't.
,
Nov 30
Issue 910691 has been merged into this issue.
,
Dec 3
,
Dec 3
+Erik FYI
,
Jan 15
,
Jan 15
|
|||||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||||
Comment 1 by dalecur...@chromium.org
, Oct 24