New issue
Advanced search Search tips

Issue 755282 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: Fuchsia
Pri: 3
Type: Bug

Blocking:
issue 738275



Sign in to add a comment

Test launcher not terminating hung child processes on Fuchsia

Project Member Reported by scottmg@chromium.org, Aug 14 2017

Issue description

real_wez> Uh oh, looks like the builder has gone rogue again :-/ https://build.chromium.org/p/chromium.fyi/builders/Fuchsia/builds/8758
12:07 PM <•scottmg> oooooh, and there is actually an exited=1 there for one of the subprocs
12:08 PM <•scottmg> jamesr_ and i hypothesized that might mean that crashlogger crashed
12:08 PM <•scottmg> or it could quite plausibly be something completely else. but we have a trickle of data now!
12:09 PM <real_wez> Ah, trickle-down debugging
12:09 PM <•scottmg> (maybe the parent waiting for all of them somehow...?)
12:10 PM <•scottmg> 2 of the subprocs are "exited" but not terminated, and 2 are still running https://gist.github.com/sgraham/74ecf2390b3cd91a6d818e30fe87da31
closesgraham — 14 Aug 2017

12:10 PM <real_wez> scottmg: Where are you pulling "exited" from? Is that based on the signals on the process mx_handle_t?
12:11 PM <•scottmg> real_wez: yes, it's mx_object_get_info https://cs.chromium.org/chromium/src/base/test/launcher/test_launcher.cc?q=mx_object_get_info&sq=package:chromium&l=1186
12:11 PM <real_wez> Hmmm, nothing has a debugger attached, but then the crashlogger is registered as the OS-level exception handler, right?
12:12 PM <real_wez> If crashlogger is itself being terminated, or crashing, though, then the default behaviour of the exception handling APIs is to let the process continue, rather than leaving it stalled
12:12 PM <•scottmg> yeah, don't think crashlogger is a debugger though
12:13 PM <real_wez> You have to attach the handler with the quiet-disconnect (or something) flag to get the process to stay stalled
12:13 PM <•scottmg> looking at ProcessDispatcher::GetInfo in the kernel, i think that means it's either State::DEAD or State::DYING
12:14 PM <real_wez> Do processes hit those states as soon as the exception-handler is triggered on them?
12:14 PM <•scottmg> oh i see, you mean debugger_exception_port_... hmm, ok, maybe
12:14 PM <•scottmg> not sure, paging cpu maybe?
12:14 PM <real_wez> cpu to the rescue?
12:18 PM <•scottmg> maybe i should have grabbed state of all threads too, it looks like the last DYING thread moves the process to DEAD
12:33 PM <•scottmg> another interesting thing is that two time out at the same time https://gist.github.com/sgraham/6f50f26dd0d54643c1aae20605c2df8b
closesgraham — 14 Aug 2017

12:34 PM <•scottmg> whcih means we return false to here https://cs.chromium.org/chromium/src/base/test/launcher/test_launcher.cc?rcl=8d7260b2352d7e5539160a015e62d828e4b3278e&l=343
12:35 PM <•scottmg> and so call Terminate() here https://cs.chromium.org/chromium/src/base/process/process_fuchsia.cc?rcl=8d7260b2352d7e5539160a015e62d828e4b3278e&l=128
12:36 PM <•scottmg> and we call that with wait=true, which is supposed to mean it mx_task_kill()s and then waits 60s
12:37 PM <•scottmg> the return_code has been set to -1, so i think that means it's toast
12:39 PM <•scottmg> so now the questions are:
12:39 PM <•scottmg> - why haven't the other 2 processes' mx_object_wait_one()s timed out like these two?
12:40 PM <•scottmg> - why is the harness waiting for the 2 that did terminate
12:48 PM <•scottmg> #2 i don't understand. it seems like we should try killing it a second time (unnecessarily) via https://cs.chromium.org/chromium/src/base/test/launcher/test_launcher.cc?rcl=d09279c9b7cd06cc9d1fa91c03c3a49338674041&l=363 and then remove it from the list. but it's clearly still in the list because we keep iterating all 4 processes every time the watchdog says there's no output
12:50 PM <•scottmg> aaaaand #1 i don't understand since it's just thread hanging out in an mx_object_wait_one() so it should time out at some point
12:51 PM <•scottmg> so that's pretty sweet. i guess i'll try to make some subprocesses hang and see what i can repro
 
Artificially causing a hang in a test child (BindTest::~BindTest in my test case) resulted in correct operation.

In particular, after 900s, the child timed out and a message like

[01266.564] 02935.03306> [3:964272227:0814/202103.382847:1266564658:ERROR:process_fuchsia.cc(172)] mx_object_wait_one timed out, and no signals

appeared, and then that subprocess never appeared in the test output again (that is, we never see return_code=-1, exited=1 like we do in e.g. https://gist.github.com/sgraham/6f50f26dd0d54643c1aae20605c2df8b from above).

Looking back at https://cs.chromium.org/chromium/src/base/test/launcher/test_launcher.cc?rcl=8d7260b2352d7e5539160a015e62d828e4b3278e&l=343

I think that would mean that the subchild task could be hanging in:
1. process.Terminate()
2. AutoLock lock(*GetLiveProcessesLock());
3. KillProcessGroup(process.Handle());

or:
4. failing to erase in GetLiveProcesses()->erase(process.Handle());

I think #2 can be ruled out because TestLauncher::OnOutputTimeout() continues to execute, and it takes the same lock.

#1 and #3 are both mx_task_kill() on the subprocess' handle. process.Terminate() does a wait, and should LOG (not DLOG) an error if something fails, but there's no output. So either process.Terminate() is succeeding or hanging. I believe it's succeeding because that seems to be what's moving the process to exited=1 state and setting return_code to -1.
KillProcessGroup() doesn't have any LOG statement, so we don't know anything about it.

For #4 GetLiveProcesses()->erase() looks symmetric with GetLiveProcesses()->insert(), so... I dunno. I guess I'll have to add more LOGs and wait for another occurrence. :(
https://chromium-review.googlesource.com/c/614580 for logging. It only attempts to explain why we keep polling processes that should have been removed from the list. I don't know why the other two didn't even time out.
That run of media_unittests finally gave up https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F8758%2F%2B%2Frecipes%2Fsteps%2Fmedia_unittests%2F0%2Fstdout and reported status -21 (which is MX_ERR_TIMED_OUT) for two of the processes.

I'm not sure how to interpret this. Of the 4 that were waiting here, all around the same time all of a sudden after many hours, two threads returned from a 60s
wait in mx_task_kill() https://cs.chromium.org/chromium/src/base/process/process_fuchsia.cc?rcl=b8e90b623d45c0588550935f29dc2c0e080798aa&l=138 and a third timed out its mx_object_wait_one. The fourth seemingly exited normally and was deemed Just Fine.


[13734.044] 02426.02570> Still waiting for the following processes to finish:
[13734.045] 02426.02570> /system/media_unittests --gtest_flagfile=/tmp/.org.chromium.Chromium.plmdLG/.org.chromium.Chromium.kKHCJJ --single-process-tests --test-launcher-filter-file=/system/testing/buildbot/filters/fuchsia.media_unittests.filter --t
[13734.045] 02426.02570> est-launcher-output=/tmp/.org.chromium.Chromium.iIdHFc/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000
[13734.045] 02426.02570>   return_code=0
[13734.045] 02426.02570>   started=1
[13734.045] 02426.02570>   exited=0
[13734.045] 02426.02570>   debugger_attached=0
[13734.045] 02426.02570> /system/media_unittests --gtest_flagfile=/tmp/.org.chromium.Chromium.plmdLG/.org.chromium.Chromium.AGGIOc --single-process-tests --test-launcher-filter-file=/system/testing/buildbot/filters/fuchsia.media_unittests.filter --t
[13734.045] 02426.02570> est-launcher-output=/tmp/.org.chromium.Chromium.ccBLDL/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000
[13734.045] 02426.02570>   return_code=0
[13734.045] 02426.02570>   started=1
[13734.045] 02426.02570>   exited=0
[13734.045] 02426.02570>   debugger_attached=0
[13734.045] 02426.02570> /system/media_unittests --gtest_flagfile=/tmp/.org.chromium.Chromium.plmdLG/.org.chromium.Chromium.MlBbda --single-process-tests --test-launcher-filter-file=/system/testing/buildbot/filters/fuchsia.media_unittests.filter --t
[13734.045] 02426.02570> est-launcher-output=/tmp/.org.chromium.Chromium.GLeCJA/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000
[13734.045] 02426.02570>   return_code=-1
[13734.045] 02426.02570>   started=1
[13734.045] 02426.02570>   exited=1
[13734.045] 02426.02570>   debugger_attached=0
[13734.045] 02426.02570> /system/media_unittests --gtest_flagfile=/tmp/.org.chromium.Chromium.plmdLG/.org.chromium.Chromium.EEeHfM --single-process-tests --test-launcher-filter-file=/system/testing/buildbot/filters/fuchsia.media_unittests.filter --t
[13734.045] 02426.02570> est-launcher-output=/tmp/.org.chromium.Chromium.mgppNN/test_results.xml --test-launcher-retry-limit=0 --test-launcher-timeout=90000
[13734.045] 02426.02570>   return_code=-1
[13734.045] 02426.02570>   started=1
[13734.045] 02426.02570>   exited=1
[13734.045] 02426.02570>   debugger_attached=0
[13737.387] 02426.03385> [3:8490383:0814/181638.749531:1186172365:ERROR:process_fuchsia.cc(138)] Error waiting for process exit: -21
[13737.391] 02426.03489> [3:15827469:0814/181640.846134:1188268968:ERROR:process_fuchsia.cc(138)] Error waiting for process exit: -21
[13737.395] 02426.03331> [3:15830429:0814/181535.760830:1123183649:ERROR:process_fuchsia.cc(172)] mx_object_wait_one timed out, and no signals
[13737.400] 02426.02570> Failed to get out-of-band test success data, dumping full stdio below:
[13737.400] 02426.02570>
[13737.483] 02426.02570> [2181/2773] FFmpeg/AudioDecoderTest.Reinitialize_AfterReset/6 (249 ms)
[13737.483] 02426.02570> [2182/2773] FFmpeg/AudioDecoderTest.ProduceAudioSamples/0 (309 ms)
...
Project Member

Comment 4 by bugdroid1@chromium.org, Aug 14 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5a441b28a19032ed33276e14a145aa8aa33ad24e

commit 5a441b28a19032ed33276e14a145aa8aa33ad24e
Author: Scott Graham <scottmg@chromium.org>
Date: Mon Aug 14 23:28:10 2017

fuchsia: Add additional logging in test launcher to diagnose hang

It's not clear why the bot is running for many hours, occasionally. Add
some additional logging when running in CHROME_HEADLESS to try to narrow
down amongst possibilities in
https://bugs.chromium.org/p/chromium/issues/detail?id=755282#c1.

Bug:  755282 
Change-Id: I5853d0bf5dfae5230e4f00570850e5cde6317d3f
Reviewed-on: https://chromium-review.googlesource.com/614580
Reviewed-by: Wez <wez@chromium.org>
Reviewed-by: Nico Weber <thakis@chromium.org>
Cr-Commit-Position: refs/heads/master@{#494241}
[modify] https://crrev.com/5a441b28a19032ed33276e14a145aa8aa33ad24e/base/test/launcher/test_launcher.cc
[modify] https://crrev.com/5a441b28a19032ed33276e14a145aa8aa33ad24e/build/fuchsia/runner_common.py

This is still happening pretty often (maybe 0.5-1 time per day), but the logs are not yet enlightening.

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F8971%2F%2B%2Frecipes%2Fsteps%2Fnet_unittests%2F0%2Fstdout

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F8981%2F%2B%2Frecipes%2Fsteps%2Fnet_unittests%2F0%2Fstdout

I think next thing I'll try is:
1. add logging to confirm that the child threads are really getting to the mx_object_wait_one() with a reasonable deadline
2. add the process handle to the "Still waiting" print, because it's hard to map back to which one's still running right now in the logs
And maybe retrieve the thread info for the parent process to see if one of watcher threads is wedged.

It does always seem that there's at least 2 mx_object_wait_one()s that get stuck at the same time. The watcher threads are in a worker pool https://cs.chromium.org/chromium/src/base/test/launcher/test_launcher.cc?rcl=8bef6b274e18497490ac0a84495afdad0b950636&l=1224 so might be consistent with the waits being on the same worker thread. The pool is instantiated with the same number of threads as there are processes https://cs.chromium.org/chromium/src/base/test/launcher/test_launcher.cc?rcl=8bef6b274e18497490ac0a84495afdad0b950636&l=839 .

Hmm.
Project Member

Comment 7 by bugdroid1@chromium.org, Aug 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b5b481d642285400f22f80bac0f3404ca71eae73

commit b5b481d642285400f22f80bac0f3404ca71eae73
Author: Scott Graham <scottmg@chromium.org>
Date: Tue Aug 22 19:40:33 2017

fuchsia: add additional logging to diagnose failing timeouts

Also, fix passing override of --test-launcher-jobs on the command line,
useful for testing timeout-y code.

Bug:  755282 
Change-Id: If8b2dcb6db724e12cf29f84b3adccc095a735403
Reviewed-on: https://chromium-review.googlesource.com/626936
Commit-Queue: Scott Graham <scottmg@chromium.org>
Reviewed-by: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#496407}
[modify] https://crrev.com/b5b481d642285400f22f80bac0f3404ca71eae73/base/process/process_fuchsia.cc
[modify] https://crrev.com/b5b481d642285400f22f80bac0f3404ca71eae73/build/fuchsia/runner_common.py

Project Member

Comment 8 by bugdroid1@chromium.org, Jan 9 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/dc9eb2b1266e19bbfd4b9c69ff288750fef4004c

commit dc9eb2b1266e19bbfd4b9c69ff288750fef4004c
Author: Wez <wez@chromium.org>
Date: Tue Jan 09 07:25:07 2018

Update TestLauncher to use Fuchsia jobs in place of POSIX process jobs.

TestLauncher was previously using the |LaunchOptions::new_process_group|
to request LaunchProcess() to isolate each test job into a separate
group, for easy process cleanup.

Since the |new_process_group| was not implemented in the Fuchsia
implementation of LaunchProcess, this had no effect besides causing
errors to be logged when we attempted to kill(-pid).

We remove |new_process_group| and update TestLauncher to use native
Fuchsia jobs to group test job processes.

Bug:  799268 ,  755282 
Change-Id: Ia96cd77c5b4066d6da522cc7fe0e4e427229dac3
Reviewed-on: https://chromium-review.googlesource.com/852559
Commit-Queue: Wez <wez@chromium.org>
Reviewed-by: Daniel Cheng <dcheng@chromium.org>
Cr-Commit-Position: refs/heads/master@{#527925}
[modify] https://crrev.com/dc9eb2b1266e19bbfd4b9c69ff288750fef4004c/base/process/launch.h
[modify] https://crrev.com/dc9eb2b1266e19bbfd4b9c69ff288750fef4004c/base/test/launcher/test_launcher.cc

Project Member

Comment 9 by bugdroid1@chromium.org, Jan 10 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/26a004d0c98da273581c9d6953a35ea97874a204

commit 26a004d0c98da273581c9d6953a35ea97874a204
Author: Wez <wez@chromium.org>
Date: Wed Jan 10 08:27:08 2018

Dump list of all threads in timed-out/hung sub-processes.

Bug:  799268 ,  755282 ,  793412 
Change-Id: I6737bbac53253205c6d31d32ce1a34a27e9ceee1
Reviewed-on: https://chromium-review.googlesource.com/853079
Commit-Queue: Wez <wez@chromium.org>
Reviewed-by: Daniel Cheng <dcheng@chromium.org>
Cr-Commit-Position: refs/heads/master@{#528261}
[modify] https://crrev.com/26a004d0c98da273581c9d6953a35ea97874a204/base/test/launcher/test_launcher.cc

Comment 10 by w...@chromium.org, Jun 5 2018

Status: Fixed (was: Started)
Closing this out, since I think TestLauncher does a peachy job of terminating hung sub-processes these days.
Project Member

Comment 11 by bugdroid1@chromium.org, Jun 21 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/42c014d0916cb84be9420e0f3542a0112bc7b797

commit 42c014d0916cb84be9420e0f3542a0112bc7b797
Author: Wez <wez@chromium.org>
Date: Thu Jun 21 11:36:14 2018

Remove extra Fuchsia-specific logging from Process and TestLauncher.

The issues that this logging was introduced to diagnose are understood
and/or resolved, so it can be removed.

Also clean up some OS_POSIX && !OS_FUCHSIA preprocessor conditionals,
and migrate TestLauncher to use the zx::job container from the SDK.

Bug:  755282 ,  750370 , 738275, 706592
Change-Id: Iedfaa62684a0b6e8524a456766186b4122e98ae0
Reviewed-on: https://chromium-review.googlesource.com/1108483
Commit-Queue: Daniel Cheng <dcheng@chromium.org>
Reviewed-by: Daniel Cheng <dcheng@chromium.org>
Reviewed-by: Scott Graham <scottmg@chromium.org>
Cr-Commit-Position: refs/heads/master@{#569219}
[modify] https://crrev.com/42c014d0916cb84be9420e0f3542a0112bc7b797/base/process/process_fuchsia.cc
[modify] https://crrev.com/42c014d0916cb84be9420e0f3542a0112bc7b797/base/test/BUILD.gn
[modify] https://crrev.com/42c014d0916cb84be9420e0f3542a0112bc7b797/base/test/launcher/test_launcher.cc

Sign in to add a comment