New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 686611 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 682951



Sign in to add a comment

With ext4crypt, autotest test job aborts at 'restart ui'

Project Member Reported by kinaba@chromium.org, Jan 30 2017

Issue description

OS: Chrome OS 9233 + ext4crypt patches.

What steps will reproduce the problem?
(1) Log in by some regular account.
(2) test_that -b $BOARD $REMOTE cheets_CTSHelper

Expected: the UI restarts and the test starts up with a fake login.
Instead:

...
13:34:02 ERROR|   __init__ at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_backend.py:44
13:34:02 ERROR|     self._cri.RestartUI(self.browser_options.clear_enterprise_policy)
13:34:02 ERROR|   RestartUI at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/core/cros_interface.py:555
13:34:02 ERROR|     self.RunCmdOnDevice(stop_cmd)
13:34:02 ERROR|   RunCmdOnDevice at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/core/cros_interface.py:221
13:34:02 ERROR|     quiet=quiet)
13:34:02 ERROR|   GetAllCmdOutput at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/core/cros_interface.py:65
13:34:02 ERROR|     stdout, stderr = p.communicate()
13:34:02 ERROR|   communicate at /usr/local/lib64/python2.7/subprocess.py:799
13:34:02 ERROR|     return self._communicate(input)
13:34:02 ERROR|   _communicate at /usr/local/lib64/python2.7/subprocess.py:1409
13:34:02 ERROR|     stdout, stderr = self._communicate_with_poll(input)
13:34:02 ERROR|   _communicate_with_poll at /usr/local/lib64/python2.7/subprocess.py:1463
13:34:02 ERROR|     ready = poller.poll()
13:34:02 ERROR|   PrintStackAndExit at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/util/global_hooks.py:38
13:34:02 ERROR|     exception_formatter.PrintFormattedFrame(stack_frame, exception_string)
...


Here,
stop_cmd = 'stop ui'
and
global_hooks.py:38
is a hook for SIGTERM signal.

Hence, while running 'stop ui', for some reason the autotest script itself is being killed by SIGTERM and failing to continue running the rest.
As far as I remember this was not the case for ecryptfs-based backend. (Though not sure what can make the difference...)

This is probably not a blocker (since we can still run the test by manually signing out and running test_that), but it is still annoying.
 

Comment 1 by kinaba@chromium.org, Jan 30 2017

Components: -Platform>ARC

Comment 2 by kinaba@chromium.org, Jan 31 2017

memo: My guess from home without any testing is that this may be because
https://chromium.git.corp.google.com/chromiumos/platform2/+/b1046234bef29203d050a5de65d9f002c32eaeed/cryptohome/mount.cc#763
now we are always immediately invalidating the key, while previously we did only lazy unmount if any file is opened.
It may be causing unexpected file IO error somewhere in autotest.
Labels: -Pri-2 Pri-1
No time to do further investigation yet, but I noticed that this is a P1 bug.

Most of the tests that does Chrome sign-in are actually affected.
For instance, `test_that video_ChromeRTCHWDecodeUsed` fails.

The point is, nicely written tests like
https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/client/site_tests/video_ChromeRTCHWDecodeUsed/video_ChromeRTCHWDecodeUsed.py#70
cleanly log out from Chrome (by the with-statement) before finishing the test.
It is `restart ui` that causes the same problem as `stop ui`.

cheets_CTSHelper was only one of few tests that does not sign-out; most of the tests actually hit by this and marked as failing.

Summary: With ext4crypt, autotest test job aborts at 'restart ui' (was: With ext4crypt autotest cannot exit from the previous login in the test setup.)
Hm, it may be simpler.

/var/log/shutdown_force_kill_processes
recorded that autotest related processes are killed by
https://cs.corp.google.com/chromeos_public/src/platform2/init/killers?type=cs&q=kill_with_open_files_on&l=56
https://cs.corp.google.com/chromeos_public/src/platform2/login_manager/init/scripts/ui-post-stop?type=cs&q=kill_with_open_files_on&l=23
So they are killed exactly because they are opening files.

What's not working well is, for some reason,

$ lsof /home/chronos/u-*

lists up processes not opening files matching /home/chronos/u-* too.
autotest is killed because it is opening "/usr/local/lib64/libpython2.7.so.1.0", etc, etc.... Weird.
TIL: lsof $mountOnPoint lists up all the open files on the same filesystem,
so in our case all open files in /dev/sda1 are affected, including /usr/local.
In the ecyptfs era, it wasn't the case because the ecryptfs is treated as a separate fs.

I'll seek more to find if there's a nice way to handle the situation.
Dumping out my work record today.

(1)
lsof has the "-D" option that does recursive directory traversal,
so `lsof -D /home/chronos/u-hash` limits the output only under that path.
However, this literally does slow recursive search on the whole directory tree
that probably we don't want to do during shutdown. Besides, I'm not sure
the files under Android /data/data can be detected by this or not.

(2)
lsof can also emit the file path (in addition to fd/pid), essentially by reading the symlinks /proc/$pid/fd/$fd.
We may be able to filter it by whitelist/blacklist. Note that processes in side the Android container has their
path inside the container.

For logged-in users /home/chronos/u-* lives on /dev/sda1 (or corresponding device file on each chromebook), so, what it kills are:

/dev/sda1 on /mnt/stateful_partition type ext4 (rw,nosuid,nodev,noexec,relatime,seclabel,commit=600,data=ordered)
/dev/sda1 on /home type ext4 (rw,nodev,relatime,seclabel,commit=600,data=ordered)
/dev/sda1 on /usr/local type ext4 (rw,nodev,relatime,seclabel,commit=600,data=ordered)
/dev/sda1 on /home/chronos/user type ext4 (rw,nodev,relatime,seclabel,commit=600,data=ordered)
/dev/sda1 on /home/user/45d4ae3aa864acc14aecc866544c12ba05e9e9f4 type ext4 (rw,nodev,relatime,seclabel,commit=600,data=ordered)
/dev/sda1 on /home/chronos/u-45d4ae3aa864acc14aecc866544c12ba05e9e9f4 type ext4 (rw,nodev,relatime,seclabel,commit=600,data=ordered)
/dev/sda1 on /home/root/45d4ae3aa864acc14aecc866544c12ba05e9e9f4 type ext4 (rw,nodev,relatime,seclabel,commit=600,data=ordered)
/dev/sda1 on /opt/google/containers/android/rootfs/android-data type ext4 (rw,nodev,relatime,seclabel,commit=600,data=ordered)
/dev/sda1 on /run/containers/android_eYAz38/root/cache type ext4 (rw,nodev,relatime,seclabel,commit=600,data=ordered)
/dev/sda1 on /run/containers/android_eYAz38/root/data type ext4 (rw,nosuid,nodev,relatime,seclabel,commit=600,data=ordered)

The paths that we don't want to kill are
* /mnt/stateful_partition
* /usr/local
* /home
and others are the things really under cryptohome.

Comment 8 by kinaba@chromium.org, Feb 10 2017

The current status.

I created a local change to further filter pids by paths (/home(/root|/user/|chronos)|/data)
using lsof -Fn. I'll clean it up and send for review soon.

Project Member

Comment 9 by bugdroid1@chromium.org, Feb 15 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/421af1309ff12b8d8aa0d4fdc5222f4872f3e52d

commit 421af1309ff12b8d8aa0d4fdc5222f4872f3e52d
Author: Kazuhiro Inaba <kinaba@chromium.org>
Date: Wed Feb 15 18:47:31 2017

ext4crypto: Extend kill_with_open_files_on with filtering by path.

The change is necessary for let 'stop ui' not kill irrelevant processes,
especially the autotest scripts on /usr/local, while trying to only kill
processes accessing files on (to be unmounted) cryptohome.

The previous implementation killing all the pids of
`lsof -t /home/chronos/u-<hash>` worked on ecryptfs-based cryptohome,
because each home directory mounted an encrypted vault directory.

On ext4 crypto, /home/chronos/u-<hash> is merely a bind mount to an
encyrpted directory in the ext4 filesystem on a disk device (e.g., /dev/xxx1.)
The existing code killed more than expected, out of home directories.

This CL adds the ability for the killer script to filter the pid by
open file path, and utilize it by filtering the candidates killed by
'stop ui' with known whitelist of paths pointing inside the user data
directories (/home/(chronos|user|root) and /data from inside ARC container.)

BUG= chromium:686611 
TEST=killers_unittest
TEST=test_that suite:bvt-cq
TEST=manually checked Chrome signout, with or without file-touching processes.

Change-Id: I0b5de80df89d1df565bf06724a54206f2c2d7dfd
Reviewed-on: https://chromium-review.googlesource.com/441429
Commit-Ready: Kazuhiro Inaba <kinaba@chromium.org>
Tested-by: Kazuhiro Inaba <kinaba@chromium.org>
Reviewed-by: Ryo Hashimoto <hashimoto@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/421af1309ff12b8d8aa0d4fdc5222f4872f3e52d/init/killers_unittest
[modify] https://crrev.com/421af1309ff12b8d8aa0d4fdc5222f4872f3e52d/init/killers
[modify] https://crrev.com/421af1309ff12b8d8aa0d4fdc5222f4872f3e52d/login_manager/init/scripts/ui-post-stop

Status: Fixed (was: Started)
Status: Verified (was: Fixed)

Sign in to add a comment