New issue
Advanced search Search tips

Issue 798784 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 725620



Sign in to add a comment

moblab-vm: test_that continues even after timeout

Project Member Reported by pprabhu@chromium.org, Jan 3 2018

Issue description

See example build: https://uberchromegw.corp.google.com/i/chromiumos/builders/moblab-generic-vm-paladin/builds/458

The test timed out here for some reason (irrelevant for this bug)
So MoblabVMTest timed out and the RunCommand that was waiting for test_that was aborted. MoblabVMTest then went on to stop the KVM instance running the DUT.

But the test_that invocation continued to run, and was stuck for ~3 minutes because it was trying to SSH into the non existent KVM DUT (brought down by MoblabVMTest already).

When RunCommand aborts, test_that should have aborted.
 
The problem here is that the implementation of cros_sdk involves one or more process forks that don't properly forward SIGTERM to their children.

A simple repro setup is to run a command via cros_sdk and kill the parent process. The child process continues:

pprabhu@pprabhu:trace$ cros_sdk bash -c "sleep 30; echo done" &
[1] 29926
pprabhu@pprabhu:trace$ pstree -a -p 29926
sudo,29926 SSH_AGENT_PID=10941 SSH_AUTH_SOCK=/tmp/ssh-faa0hbwJYzrK/agent.10847 TMUX=/tmp/tmux-193821/default,26303,0 CROS_CACHEDIR=/work/chromiumos/.cache...
  └─python2,29934 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
      └─python2,29947 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
          └─python2,29954 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
              └─bash,29958 -c sleep 30; echo done
                  └─sleep,30227 30
pprabhu@pprabhu:trace$ sudo kill -SIGTERM 29926
cros_sdk: Signaled to shutdown: caught 15 signal.
pprabhu@pprabhu:trace$ pstree -a -p 29934
[1]+  Exit 1                  cros_sdk bash -c "sleep 30; echo done"
pprabhu@pprabhu:trace$ pstree -a -p 29947
python2,29947 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
  └─python2,29954 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
      └─bash,29958 -c sleep 30; echo done
          └─sleep,30227 30
pprabhu@pprabhu:trace$ done


The bug is in the implementation of lib/namespaces.py:CreatePidNs()
This forks twice to contain a process tree inside a pid namespace, but the forking processes fail to forward SIGTERM (and SIGINT) to their children.


Blocking: 725620
Note that the behaviour wrt SIGINT is even weirder: If you SIGINT the launched process, it is simply ignored. CreatePidNs explicitly ignores SIGINT claiming that we trust the child process to handle SIGINT. This is fine if SIGINT is sent to the whole process group (say when you hit Ctrl-C from the controlling terminal), but if SIGINT is sent to the parent process, it'll be a no-op

pprabhu@pprabhu:trace$ pstree -a -p 38304
sudo,38304 SSH_AGENT_PID=10941 SSH_AUTH_SOCK=/tmp/ssh-faa0hbwJYzrK/agent.10847 TMUX=/tmp/tmux-193821/default,26303,0 CROS_CACHEDIR=/work/chromiumos/.cache DEPOT_TOOLS=/work/chromiumos/chromium/tools/depot_tools -- /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
  └─python2,38307 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
      └─python2,38318 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
          └─python2,38323 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
              └─bash,38327 -c sleep 30; echo done
                  └─sleep,38590 30
pprabhu@pprabhu:trace$ sudo kill -SIGINT 38304
pprabhu@pprabhu:trace$ 
pprabhu@pprabhu:trace$ 
pprabhu@pprabhu:trace$ jobs -p
38304
pprabhu@pprabhu:trace$ pstree -a -p 38304
sudo,38304 SSH_AGENT_PID=10941 SSH_AUTH_SOCK=/tmp/ssh-faa0hbwJYzrK/agent.10847 TMUX=/tmp/tmux-193821/default,26303,0 CROS_CACHEDIR=/work/chromiumos/.cache DEPOT_TOOLS=/work/chromiumos/chromium/tools/depot_tools -- /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
  └─python2,38307 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
      └─python2,38318 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
          └─python2,38323 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
              └─bash,38327 -c sleep 30; echo done
                  └─sleep,38590 30

----------------
This doesn't affect things in practice since RunCommand doesn't use SIGINT (SIGTERM, followed by SIGKILL).
Cc: vapier@chromium.org ayatane@chromium.org
Lots of history about why this is the way it is:  issue 412484 

My take is that ignoring SIGINT only went partway -- Ctrl-C now works because the terminal sends SIGINT to the whole process group, but SIGINT/SIGTERM to the parent process still does the wrong thing.

This is especially a problem for cros_sdk as I noted above when run via RunCommand, which SIGTERM's the parent processes. I'll have CreatePidNs forward SIGINT/SIGTERM to the grandchild. The behaviour is still wrong for SIGKILL, where all bets are off.
With https://chromium-review.googlesource.com/c/chromiumos/chromite/+/849535

child exits on SIGTERM as expected:

pprabhu@pprabhu:trace$ cros_sdk bash -c "sleep 30; echo done" &
[1] 74235

pprabhu@pprabhu:trace$ pstree -a -p 74235                                                                                                                                                                                                                                                                                                                                                      
sudo,74235 SSH_AGENT_PID=10941 SSH_AUTH_SOCK=/tmp/ssh-faa0hbwJYzrK/agent.10847 TMUX=/tmp/tmux-193821/default,26303,0 CROS_CACHEDIR=/work/chromiumos/.cache DEPOT_TOOLS=/work/chromiumos/chromium/tools/depot_tools -- /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
  └─python2,74239 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
      └─python2,74250 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
          └─python2,74255 /work/chromiumos/chromite/bin/cros_sdk bash -c sleep 30; echo done
              └─bash,74259 -c sleep 30; echo done
                  └─sleep,74524 30
pprabhu@pprabhu:trace$ sudo kill -SIGINT 74235
pprabhu@pprabhu:trace$ Traceback (most recent call last):
  File "/work/chromiumos/chromite/bin/cros_sdk", line 169, in <module>
    DoMain()
  File "/work/chromiumos/chromite/bin/cros_sdk", line 165, in DoMain
    commandline.ScriptWrapperMain(FindTarget)
  File "/work/chromiumos/chromite/lib/commandline.py", line 911, in ScriptWrapperMain
    ret = target(argv[1:])
  File "/work/chromiumos/chromite/scripts/cros_sdk.py", line 1163, in main
    chroot_command)
  File "/work/chromiumos/chromite/scripts/cros_sdk.py", line 274, in EnterChroot
    mute_output=False)
  File "/work/chromiumos/chromite/lib/cros_build_lib.py", line 622, in RunCommand
    (cmd_result.output, cmd_result.error) = proc.communicate(input)
  File "/usr/lib/python2.7/subprocess.py", line 796, in communicate
    self.wait()
  File "/usr/lib/python2.7/subprocess.py", line 1376, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/usr/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call
    return func(*args)
  File "/work/chromiumos/chromite/lib/cros_build_lib.py", line 369, in _KillChildProcess
    if not signals.RelaySignal(original_handler, signum, frame):
  File "/work/chromiumos/chromite/lib/signals.py", line 27, in RelaySignal
    handler(signum, frame)
KeyboardInterrupt

[1]+  Exit 1                  cros_sdk bash -c "sleep 30; echo done"
pprabhu@pprabhu:trace$ 

Project Member

Comment 7 by bugdroid1@chromium.org, Jan 9 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/7928b3e6193e8d6b380fcffee055121e31f97289

commit 7928b3e6193e8d6b380fcffee055121e31f97289
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Tue Jan 09 06:42:11 2018

Forward signals to processes inside pid namespace.

This improves somewhat over the existing behaviour. Now, SIGINT (not
sent via Ctrl-C, but only to the launched process) and SIGTERM are
forwarded to the actual process running inside the pid namespace. But,
- SIGKILL will still only kill the waiting external-init process, and
  the actual processes inside the namespace will continue to live.

This improves the situation particularly when cbuildbot runs any of the
build commands via cros_sdk. That results in a RunCommand call to
cros_sdk. If the outer RunCommand is aborted (due to timeout), it sends
SIGTERM to the cros_sdk process.

BUG= chromium:798784 
TEST=cros_sdk bash -c "sleep 30; echo done" &
     # Then SIGTERM the process created, the spawned bash process
     # actually exits immediately.

Change-Id: Idf41098d0225733edb1ef7a8518cf7c629192dd4
Reviewed-on: https://chromium-review.googlesource.com/849535
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/7928b3e6193e8d6b380fcffee055121e31f97289/lib/namespaces.py

Status: Fixed (was: Started)

Sign in to add a comment