ServiceManagerContextTest.TerminateOnServiceQuit causing content_browsertests failures |
|||||
Issue descriptionFiled by sheriff-o-matic@appspot.gserviceaccount.com on behalf of joedow@chromium.org content_browsertests failing on chromium.win/Win 7 Tests x64 (1) Builders failed on: - Win 7 Tests x64 (1): https://build.chromium.org/p/chromium.win/builders/Win%207%20Tests%20x64%20%281%29 Looking back through this builders history (for at least several days), I see numerous cases where the tests all pass but the run is marked as failed due to a running child subprocess: Enumerating processes: - pid 5104; Handles: 3; Exe: None; Cmd: "e:\b\s\w\ir\out\Release_x64\content_browsertests.exe" --type=gpu-process --field-trial-handle=896,12648372179796321696,3462781468841023858,131072 --disable-gl-drawing-for-tests --override-use-software-gl-for-tests --gpu-preferences=KAAAAAAAAAAAJwAAAQAAAAAAAAAAAGAAAQAAAAAAAAAIAAAAAAAAACgAAAAEAAAAIAAAAAAAAAAoAAAAAAAAADAAAAAAAAAAOAAAAAAAAAAQAAAAAAAAAAAAAAAKAAAAEAAAAAAAAAAAAAAACwAAABAAAAAAAAAAAQAAAAoAAAAQAAAAAAAAAAEAAAALAAAA --gpu-vendor-id=0xffff --gpu-device-id=0xffff --gpu-driver-vendor=swiftshader --gpu-driver-version --gpu-driver-date --ipc-connection-timeout=30 --service-request-channel-token=65091B429CEF16DF24DDAF2609B793B9 --mojo-platform-channel-handle=912 --ignored=" --type=renderer " /prefetch:2 Terminating 1 processes: - 5104 killed *** Swarming tried multiple times to delete the run directory and failed *** *** Hard failing the task *** Swarming detected that your testing script ran an executable, which may have started a child executable, and the main script returned early, leaving the children executables playing around unguided. You don't want to leave children processes outliving the task on the Swarming bot, do you? The Swarming bot doesn't. How to fix? - For any process that starts children processes, make sure all children processes terminated properly before each parent process exits. This is especially important in very deep process trees. - This must be done properly both in normal successful task and in case of task failure. Cleanup is very important. - The Swarming bot sends a SIGTERM in case of timeout. - You have 30.0 seconds to comply after the signal was sent to the process before the process is forcibly killed. - To achieve not leaking children processes in case of signals on timeout, you MUST handle signals in each executable / python script and propagate them to children processes. - When your test script (python or binary) receives a signal like SIGTERM or CTRL_BREAK_EVENT on Windows), send it to all children processes and wait for them to terminate before quitting.
,
Jan 23 2018
Adding to tracking list to find an owner
,
Jan 24 2018
Issue 804938 has been merged into this issue.
,
Jan 24 2018
I have found one commonality in the runs which fail. This test is included in the retry list: ServiceManagerContextTest.TerminateOnServiceQuit In the successful runs, this test passes the first time. Note that in both failure and success scenarios, there can be 4-15 tests which require a rerun to pass, however the only test which showed up in the failure scenario and not the passing scenario was this one.
,
Jan 24 2018
I'm confident this test is to blame. It looks like the test times out (i.e. the service doesn't quit) and then the swarming bot kills the prodecss and marks the whole suite as failed (even though a follow up test succeeds).
,
Jan 24 2018
,
Jan 24 2018
Removing the sheriff tag since this one has a workaround now.
,
Jan 24 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/01a9285084115732acecede498d535ea4cfffe83 commit 01a9285084115732acecede498d535ea4cfffe83 Author: Joe Downing <joedow@chromium.org> Date: Wed Jan 24 21:45:36 2018 Disabling ServiceManagerContextTest.TerminateOnServiceQuit for Windows Over the last few days we've seen several cases of {viz_}content_browsertests failing even when all of the retires pass. The reason for the failure is that the swarming bot cannot clean up the directory due to child process(es) still running. The one thing all of the failures have in common is that the initial run of TerminateOnServiceQuit timed out. Perhaps there is some additional clean up the test framework needs to do in this case. TBR=rockot@chromium.org BUG=804937 Change-Id: I6937c038f91f7b254ab7ad2b8005c94ca064a4ee Reviewed-on: https://chromium-review.googlesource.com/884230 Reviewed-by: Joe Downing <joedow@chromium.org> Commit-Queue: Joe Downing <joedow@chromium.org> Cr-Commit-Position: refs/heads/master@{#531694} [modify] https://crrev.com/01a9285084115732acecede498d535ea4cfffe83/content/browser/service_manager/service_manager_context_browsertest.cc |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by joedow@chromium.org
, Jan 23 2018Labels: Test-Flaky OS-Windows