Test launcher not terminating hung child processes on Fuchsia |
||
Issue descriptionreal_wez> Uh oh, looks like the builder has gone rogue again :-/ https://build.chromium.org/p/chromium.fyi/builders/Fuchsia/builds/8758 12:07 PM <•scottmg> oooooh, and there is actually an exited=1 there for one of the subprocs 12:08 PM <•scottmg> jamesr_ and i hypothesized that might mean that crashlogger crashed 12:08 PM <•scottmg> or it could quite plausibly be something completely else. but we have a trickle of data now! 12:09 PM <real_wez> Ah, trickle-down debugging 12:09 PM <•scottmg> (maybe the parent waiting for all of them somehow...?) 12:10 PM <•scottmg> 2 of the subprocs are "exited" but not terminated, and 2 are still running https://gist.github.com/sgraham/74ecf2390b3cd91a6d818e30fe87da31 closesgraham — 14 Aug 2017 12:10 PM <real_wez> scottmg: Where are you pulling "exited" from? Is that based on the signals on the process mx_handle_t? 12:11 PM <•scottmg> real_wez: yes, it's mx_object_get_info https://cs.chromium.org/chromium/src/base/test/launcher/test_launcher.cc?q=mx_object_get_info&sq=package:chromium&l=1186 12:11 PM <real_wez> Hmmm, nothing has a debugger attached, but then the crashlogger is registered as the OS-level exception handler, right? 12:12 PM <real_wez> If crashlogger is itself being terminated, or crashing, though, then the default behaviour of the exception handling APIs is to let the process continue, rather than leaving it stalled 12:12 PM <•scottmg> yeah, don't think crashlogger is a debugger though 12:13 PM <real_wez> You have to attach the handler with the quiet-disconnect (or something) flag to get the process to stay stalled 12:13 PM <•scottmg> looking at ProcessDispatcher::GetInfo in the kernel, i think that means it's either State::DEAD or State::DYING 12:14 PM <real_wez> Do processes hit those states as soon as the exception-handler is triggered on them? 12:14 PM <•scottmg> oh i see, you mean debugger_exception_port_... hmm, ok, maybe 12:14 PM <•scottmg> not sure, paging cpu maybe? 12:14 PM <real_wez> cpu to the rescue? 12:18 PM <•scottmg> maybe i should have grabbed state of all threads too, it looks like the last DYING thread moves the process to DEAD 12:33 PM <•scottmg> another interesting thing is that two time out at the same time https://gist.github.com/sgraham/6f50f26dd0d54643c1aae20605c2df8b closesgraham — 14 Aug 2017 12:34 PM <•scottmg> whcih means we return false to here https://cs.chromium.org/chromium/src/base/test/launcher/test_launcher.cc?rcl=8d7260b2352d7e5539160a015e62d828e4b3278e&l=343 12:35 PM <•scottmg> and so call Terminate() here https://cs.chromium.org/chromium/src/base/process/process_fuchsia.cc?rcl=8d7260b2352d7e5539160a015e62d828e4b3278e&l=128 12:36 PM <•scottmg> and we call that with wait=true, which is supposed to mean it mx_task_kill()s and then waits 60s 12:37 PM <•scottmg> the return_code has been set to -1, so i think that means it's toast 12:39 PM <•scottmg> so now the questions are: 12:39 PM <•scottmg> - why haven't the other 2 processes' mx_object_wait_one()s timed out like these two? 12:40 PM <•scottmg> - why is the harness waiting for the 2 that did terminate 12:48 PM <•scottmg> #2 i don't understand. it seems like we should try killing it a second time (unnecessarily) via https://cs.chromium.org/chromium/src/base/test/launcher/test_launcher.cc?rcl=d09279c9b7cd06cc9d1fa91c03c3a49338674041&l=363 and then remove it from the list. but it's clearly still in the list because we keep iterating all 4 processes every time the watchdog says there's no output 12:50 PM <•scottmg> aaaaand #1 i don't understand since it's just thread hanging out in an mx_object_wait_one() so it should time out at some point 12:51 PM <•scottmg> so that's pretty sweet. i guess i'll try to make some subprocesses hang and see what i can repro
,
Aug 14 2017
https://chromium-review.googlesource.com/c/614580 for logging. It only attempts to explain why we keep polling processes that should have been removed from the list. I don't know why the other two didn't even time out.
,
Aug 14 2017
That run of media_unittests finally gave up https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F8758%2F%2B%2Frecipes%2Fsteps%2Fmedia_unittests%2F0%2Fstdout and reported status -21 (which is MX_ERR_TIMED_OUT) for two of the processes. I'm not sure how to interpret this. Of the 4 that were waiting here, all around the same time all of a sudden after many hours, two threads returned from a 60s wait in mx_task_kill() https://cs.chromium.org/chromium/src/base/process/process_fuchsia.cc?rcl=b8e90b623d45c0588550935f29dc2c0e080798aa&l=138 and a third timed out its mx_object_wait_one. The fourth seemingly exited normally and was deemed Just Fine. [13734.044] 02426.02570> Still waiting for the following processes to finish: [13734.045] 02426.02570> /system/media_unittests --gtest_flagfile=/tmp/.org.chromium.Chromium.plmdLG/.org.chromium.Chromium.kKHCJJ --single-process-tests --test-launcher-filter-file=/system/testing/buildbot/filters/fuchsia.media_unittests.filter --t [13734.045] 02426.02570> est-launcher-output=/tmp/.org.chromium.Chromium.iIdHFc/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000 [13734.045] 02426.02570> return_code=0 [13734.045] 02426.02570> started=1 [13734.045] 02426.02570> exited=0 [13734.045] 02426.02570> debugger_attached=0 [13734.045] 02426.02570> /system/media_unittests --gtest_flagfile=/tmp/.org.chromium.Chromium.plmdLG/.org.chromium.Chromium.AGGIOc --single-process-tests --test-launcher-filter-file=/system/testing/buildbot/filters/fuchsia.media_unittests.filter --t [13734.045] 02426.02570> est-launcher-output=/tmp/.org.chromium.Chromium.ccBLDL/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000 [13734.045] 02426.02570> return_code=0 [13734.045] 02426.02570> started=1 [13734.045] 02426.02570> exited=0 [13734.045] 02426.02570> debugger_attached=0 [13734.045] 02426.02570> /system/media_unittests --gtest_flagfile=/tmp/.org.chromium.Chromium.plmdLG/.org.chromium.Chromium.MlBbda --single-process-tests --test-launcher-filter-file=/system/testing/buildbot/filters/fuchsia.media_unittests.filter --t [13734.045] 02426.02570> est-launcher-output=/tmp/.org.chromium.Chromium.GLeCJA/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000 [13734.045] 02426.02570> return_code=-1 [13734.045] 02426.02570> started=1 [13734.045] 02426.02570> exited=1 [13734.045] 02426.02570> debugger_attached=0 [13734.045] 02426.02570> /system/media_unittests --gtest_flagfile=/tmp/.org.chromium.Chromium.plmdLG/.org.chromium.Chromium.EEeHfM --single-process-tests --test-launcher-filter-file=/system/testing/buildbot/filters/fuchsia.media_unittests.filter --t [13734.045] 02426.02570> est-launcher-output=/tmp/.org.chromium.Chromium.mgppNN/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000 [13734.045] 02426.02570> return_code=-1 [13734.045] 02426.02570> started=1 [13734.045] 02426.02570> exited=1 [13734.045] 02426.02570> debugger_attached=0 [13737.387] 02426.03385> [3:8490383:0814/181638.749531:1186172365:ERROR:process_fuchsia.cc(138)] Error waiting for process exit: -21 [13737.391] 02426.03489> [3:15827469:0814/181640.846134:1188268968:ERROR:process_fuchsia.cc(138)] Error waiting for process exit: -21 [13737.395] 02426.03331> [3:15830429:0814/181535.760830:1123183649:ERROR:process_fuchsia.cc(172)] mx_object_wait_one timed out, and no signals [13737.400] 02426.02570> Failed to get out-of-band test success data, dumping full stdio below: [13737.400] 02426.02570> [13737.483] 02426.02570> [2181/2773] FFmpeg/AudioDecoderTest.Reinitialize_AfterReset/6 (249 ms) [13737.483] 02426.02570> [2182/2773] FFmpeg/AudioDecoderTest.ProduceAudioSamples/0 (309 ms) ...
,
Aug 14 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/5a441b28a19032ed33276e14a145aa8aa33ad24e commit 5a441b28a19032ed33276e14a145aa8aa33ad24e Author: Scott Graham <scottmg@chromium.org> Date: Mon Aug 14 23:28:10 2017 fuchsia: Add additional logging in test launcher to diagnose hang It's not clear why the bot is running for many hours, occasionally. Add some additional logging when running in CHROME_HEADLESS to try to narrow down amongst possibilities in https://bugs.chromium.org/p/chromium/issues/detail?id=755282#c1. Bug: 755282 Change-Id: I5853d0bf5dfae5230e4f00570850e5cde6317d3f Reviewed-on: https://chromium-review.googlesource.com/614580 Reviewed-by: Wez <wez@chromium.org> Reviewed-by: Nico Weber <thakis@chromium.org> Cr-Commit-Position: refs/heads/master@{#494241} [modify] https://crrev.com/5a441b28a19032ed33276e14a145aa8aa33ad24e/base/test/launcher/test_launcher.cc [modify] https://crrev.com/5a441b28a19032ed33276e14a145aa8aa33ad24e/build/fuchsia/runner_common.py
,
Aug 22 2017
This is still happening pretty often (maybe 0.5-1 time per day), but the logs are not yet enlightening. https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F8971%2F%2B%2Frecipes%2Fsteps%2Fnet_unittests%2F0%2Fstdout https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F8981%2F%2B%2Frecipes%2Fsteps%2Fnet_unittests%2F0%2Fstdout I think next thing I'll try is: 1. add logging to confirm that the child threads are really getting to the mx_object_wait_one() with a reasonable deadline 2. add the process handle to the "Still waiting" print, because it's hard to map back to which one's still running right now in the logs
,
Aug 22 2017
And maybe retrieve the thread info for the parent process to see if one of watcher threads is wedged. It does always seem that there's at least 2 mx_object_wait_one()s that get stuck at the same time. The watcher threads are in a worker pool https://cs.chromium.org/chromium/src/base/test/launcher/test_launcher.cc?rcl=8bef6b274e18497490ac0a84495afdad0b950636&l=1224 so might be consistent with the waits being on the same worker thread. The pool is instantiated with the same number of threads as there are processes https://cs.chromium.org/chromium/src/base/test/launcher/test_launcher.cc?rcl=8bef6b274e18497490ac0a84495afdad0b950636&l=839 . Hmm.
,
Aug 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b5b481d642285400f22f80bac0f3404ca71eae73 commit b5b481d642285400f22f80bac0f3404ca71eae73 Author: Scott Graham <scottmg@chromium.org> Date: Tue Aug 22 19:40:33 2017 fuchsia: add additional logging to diagnose failing timeouts Also, fix passing override of --test-launcher-jobs on the command line, useful for testing timeout-y code. Bug: 755282 Change-Id: If8b2dcb6db724e12cf29f84b3adccc095a735403 Reviewed-on: https://chromium-review.googlesource.com/626936 Commit-Queue: Scott Graham <scottmg@chromium.org> Reviewed-by: Wez <wez@chromium.org> Cr-Commit-Position: refs/heads/master@{#496407} [modify] https://crrev.com/b5b481d642285400f22f80bac0f3404ca71eae73/base/process/process_fuchsia.cc [modify] https://crrev.com/b5b481d642285400f22f80bac0f3404ca71eae73/build/fuchsia/runner_common.py
,
Jan 9 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/dc9eb2b1266e19bbfd4b9c69ff288750fef4004c commit dc9eb2b1266e19bbfd4b9c69ff288750fef4004c Author: Wez <wez@chromium.org> Date: Tue Jan 09 07:25:07 2018 Update TestLauncher to use Fuchsia jobs in place of POSIX process jobs. TestLauncher was previously using the |LaunchOptions::new_process_group| to request LaunchProcess() to isolate each test job into a separate group, for easy process cleanup. Since the |new_process_group| was not implemented in the Fuchsia implementation of LaunchProcess, this had no effect besides causing errors to be logged when we attempted to kill(-pid). We remove |new_process_group| and update TestLauncher to use native Fuchsia jobs to group test job processes. Bug: 799268 , 755282 Change-Id: Ia96cd77c5b4066d6da522cc7fe0e4e427229dac3 Reviewed-on: https://chromium-review.googlesource.com/852559 Commit-Queue: Wez <wez@chromium.org> Reviewed-by: Daniel Cheng <dcheng@chromium.org> Cr-Commit-Position: refs/heads/master@{#527925} [modify] https://crrev.com/dc9eb2b1266e19bbfd4b9c69ff288750fef4004c/base/process/launch.h [modify] https://crrev.com/dc9eb2b1266e19bbfd4b9c69ff288750fef4004c/base/test/launcher/test_launcher.cc
,
Jan 10 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/26a004d0c98da273581c9d6953a35ea97874a204 commit 26a004d0c98da273581c9d6953a35ea97874a204 Author: Wez <wez@chromium.org> Date: Wed Jan 10 08:27:08 2018 Dump list of all threads in timed-out/hung sub-processes. Bug: 799268 , 755282 , 793412 Change-Id: I6737bbac53253205c6d31d32ce1a34a27e9ceee1 Reviewed-on: https://chromium-review.googlesource.com/853079 Commit-Queue: Wez <wez@chromium.org> Reviewed-by: Daniel Cheng <dcheng@chromium.org> Cr-Commit-Position: refs/heads/master@{#528261} [modify] https://crrev.com/26a004d0c98da273581c9d6953a35ea97874a204/base/test/launcher/test_launcher.cc
,
Jun 5 2018
Closing this out, since I think TestLauncher does a peachy job of terminating hung sub-processes these days.
,
Jun 21 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/42c014d0916cb84be9420e0f3542a0112bc7b797 commit 42c014d0916cb84be9420e0f3542a0112bc7b797 Author: Wez <wez@chromium.org> Date: Thu Jun 21 11:36:14 2018 Remove extra Fuchsia-specific logging from Process and TestLauncher. The issues that this logging was introduced to diagnose are understood and/or resolved, so it can be removed. Also clean up some OS_POSIX && !OS_FUCHSIA preprocessor conditionals, and migrate TestLauncher to use the zx::job container from the SDK. Bug: 755282 , 750370 , 738275, 706592 Change-Id: Iedfaa62684a0b6e8524a456766186b4122e98ae0 Reviewed-on: https://chromium-review.googlesource.com/1108483 Commit-Queue: Daniel Cheng <dcheng@chromium.org> Reviewed-by: Daniel Cheng <dcheng@chromium.org> Reviewed-by: Scott Graham <scottmg@chromium.org> Cr-Commit-Position: refs/heads/master@{#569219} [modify] https://crrev.com/42c014d0916cb84be9420e0f3542a0112bc7b797/base/process/process_fuchsia.cc [modify] https://crrev.com/42c014d0916cb84be9420e0f3542a0112bc7b797/base/test/BUILD.gn [modify] https://crrev.com/42c014d0916cb84be9420e0f3542a0112bc7b797/base/test/launcher/test_launcher.cc |
||
►
Sign in to add a comment |
||
Comment 1 by scottmg@chromium.org
, Aug 14 2017