New issue
Advanced search Search tips

Issue 769740 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Drop cache code in cryptohome needs autotests

Project Member Reported by kerrnel@chromium.org, Sep 28 2017

Issue description

We almost shipped a very serious cryptohome regression, because cryptohome was no longer able to write to the /proc/sys/vm/drop_caches file. Adding to the problem, the code in c83ef6c950 did not check the return code of the base::WriteFile() function correctly, and so cryptohome never logged the failure. We have fixed both issues, but we also need to do a long term fix to this problem to reboot the machine immediately if clearing the decrypted cache fails.

An autotest that logged users in and out and checked if the decrypted cache was cleared would have caught both of the bugs. Having a test for this is a high priority as we otherwise risk shipping this regression, and it hampers our ability to improve the error handling in the future.

Can you please expand the cryptohome autotests to include this case, and ideally any other new and important functionality introduced with ext4?

Thank you.

 
Summary: Drop cache code in cryptohome needs autotests (was: The ext4 code in cryptohome needs autotests)
> We have fixed both issues
Sorry, I have no context about the described issues.
Could you give me a pointer to the bugs and the CLs?

> the code in c83ef6c950 did not check the return code of the base::WriteFile() function correctly
I believe you are referring https://chromium-review.googlesource.com/c/chromiumos/platform2/+/440747, but it doesn't anything to do with how the return value of WriteFile() is handled.
What was wrong with the said code?

> An autotest that logged users in and out and checked if the decrypted cache was cleared
It sounds like writing a test of /proc/sys/vm/drop_caches.
I'm even not sure if it's possible to test the behavior of the page cache reliably with autotests (i.e. page caches being evicted doesn't necessarily mean it's evicted by /proc/sys/vm/drop_caches).

Instead, shouldn't we just make sure that cryptohome's unmount D-Bus method correctly returns an error when drop_caches fails, and the error is handled correctly?
Currently, init/scripts/ui-post-stop runs "cryptohome --unmount" to unmount the users' home directories but ignores its result.
Probably we should emit a log and reboot the device when it fails.
> Could you give me a pointer to the bugs and the CLs
The bug about sandboxing causing the /proc/sys/vm/drop_caches write to fail is here: https://bugs.chromium.org/p/chromium/issues/detail?id=764540

> What was wrong with the said code?

The CL that caused the problem with base::WriteFile is: https://chromium-review.googlesource.com/c/chromiumos/platform2/+/428525

The problem with how the return value is handled is that base::WriteFile() returns the number of bytes written, or -1 on error. It does not return a boolean true/false, but that CL expected a bool from base::WriteFile.

The return value handling was corrected in: https://chromium-review.googlesource.com/c/671188/

> I'm even not sure if it's possible to test the behavior of the page cache reliably with autotests (i.e. page caches being evicted doesn't necessarily mean it's evicted by /proc/sys/vm/drop_caches).

I was thinking a simple autotest that logs a user in, logs the user out, and makes sure there isn't decrypted information under the /home/root/$X/mount folder. Do you mean that the decrypted contents aren't reliably cleared?

> Instead, shouldn't we just make sure that cryptohome's unmount D-Bus method correctly returns an error when drop_caches fails, and the error is handled correctly?

We should fix the return value being ignored, but we still need a test to know if it works now and in the future. I was thinking cryptohome() itself would initiate the reboot as a single choke point, but if ui-post-stop script runs on everyone user logout, that could work as well. What do you think?

Regards,

Greg
Thank you for the pointers, so the said bug was a regression caused by the sandboxing, and IIUC it's not just a problem of the ext4 encryption code.

Before enabling sandboxing, did anyone use a tool (e.g. strace) to get the list of all syscalls performed by cryptohomed with and without sandboxing to make sure there is no regression?
For long, cryptohomed had been expected to run with all the root privileges, so every part of cryptohomed can be silently regressed like the drop_cache code.
> Do you mean that the decrypted contents aren't reliably cleared?

Provided the kernel code doesn't have a bug, drop_caches reliably clears the page cache (i.e. "drop_caches succeeded" → "cache is cleared" is always true), but that doesn't mean the converse (i.e. "cache is cleared" → "drop_caches succeeded" is not always true).
Page cache can be cleared for any other reasons.

Triggering drop_caches results in adding a new log line to /var/log/messages which says "drop_caches: 3", so you can know if drop_caches was triggered or not, but you cannot still know if it was executed at the right timing. Triggering drop_caches without invalidating the key doesn't result in totally clearing the decrypted data.


> if ui-post-stop script runs on everyone user logout

ui-post-stop is the only code which executes cryptohome unmount, and it should run whenever a user (or multiple users) logout.
I think what Greg is trying to say here is that there are a bunch of ways in which the drop_cache functionality can break, and given that it's critical functionality, it should have a regression test.

It's impossible to make meaningful progress on a project as large as Chrome OS if we have to manually test everything before making large-scale changes. As Linus Upson used to say, "if you liked then you should have put a test on it."

The discussion here is not who was to blame for this individual bug. We're operating in postmortem mode here: we identified a critical lack of testing that would have prevented this bug. We should add a test so that we don't regress this way again, from a security change, a kernel uprev, or something else.

Chrome OS has a great continuous integration infrastructure, with PreCQ and CQ. Let's use it to guarantee that this critical functionality doesn't regress.

Hashimoto, are you the right owner for this?
Thanks for chiming in Jorge. 

Indeed I don't really care who is to blame for what at this point. My concern is that we added code that is critical to the security and privacy of the system and it had no test, so it broke and nobody noticed. 

The reason I pointed out that the original base::WriteFile() code did not handle the error correctly, was to highlight why a test is so important: even if we had rebooted the machine on a failure of the dbus call, the failure was never being propagated.

So given the intricacies of this code that you pointed out, who is the right owner for this test?
Cc: gwendal@chromium.org
+gwendal who added ext4 encryption code with me.

Sorry for sounding defensive, but my comment in #3 is not about the blame, but about detecting other possible regressions.
As you know, cryptohome's test coverage is not wide enough, and I guess the sandboxing was the largest change which has happened to cryptohome, so I think it'd be really assuring if it's verified that there is no other regressions.

As I wrote in comment #4, I'm not sure if it's possible to write a reliable test for the said code which doesn't result in neither false-positive nor false-negative.
I think the important point here is that we can't really "verify that there is no other regressions" if we don't have tests to do that. Unless somebody writes a doc that says "when making a change to cryptohome, test the following things", manually testing for regressions doesn't work unless we know what to test for.

We need to be able to test this. If this requires adding more logging to the kernel, or exporting more state via /proc or /sys, then we should do that.

Gwendal, what do you think?
Gwendal, did you see this bug? If dropping the cache is that hard to test, we'll need to either expose more logging (like Jorge suggests), or add this to a manual testing checklist for our test team before a release goes to beta. But I don't believe we can ignore such a critical part of the system.
Cc: sarthakkukreti@chromium.org
I am planning to add a couple of autotests that cover the verification of whether caches were dropped or not. If drop_caches failed for any reason, this would result in visible /home/.shadow/<user>/mount directories (and all subdirs/files would be accessible, subject to ACLs). 

So, the test would simply create a file (within /home/.shadow/<user>) while the session is active, logout and check whether the same path is accessible.

One of the cases where this might fail is that drop_cache is usually "best effort": if a process keeps an fd open to a file in the mount, it would be virtually impossible to check that the unless the test traverses the complete mount. But that would usually indicate a problem with another process (since we expect the caches to be dropped once all other processes have stopped), so I am not sure if we should be accounting for it.
That sounds good, should we assign this bug to you Sarthak?
Sure. Apart from testing this regression in an autotest, are there still plans for handling errors from 'cryptohome --unmount' in the ui-post-stop script (as mentioned in c#4)?
Cc: hashimoto@chromium.org
Owner: sarthakkukreti@chromium.org
Hashimoto, thoughts on c#12?
Labels: M-64
Project Member

Comment 16 by bugdroid1@chromium.org, Oct 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2e43b91f9f168b99274ffc44e689de449545600d

commit 2e43b91f9f168b99274ffc44e689de449545600d
Author: Sarthak Kukreti <sarthakkukreti@chromium.org>
Date: Mon Oct 16 21:14:05 2017

cryptohome: check user data visibility after logout

On logout, decrypted user data cached by the file system should
be cleared and become inaccessible. This test checks if a user-
mount file is accessible after the user logs out.

BUG= chromium:769740 
TEST="test_that <DUT_IP> platform_CryptohomeDataLeak" passes
when /proc/sys/vm/drop_caches is written to during key
invalidation and fails when it is not written to.

Change-Id: Ic8af4ca85e7ec444e86f882e07cc61ab9479bdcc
Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/719576
Reviewed-by: Jorge Lucangeli Obes <jorgelo@chromium.org>

[add] https://crrev.com/2e43b91f9f168b99274ffc44e689de449545600d/client/site_tests/platform_CryptohomeDataLeak/platform_CryptohomeDataLeak.py
[add] https://crrev.com/2e43b91f9f168b99274ffc44e689de449545600d/client/site_tests/platform_CryptohomeDataLeak/control

Project Member

Comment 17 by bugdroid1@chromium.org, Oct 27 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/98a537620fa9e54d5f618b5e99a9736c43a06570

commit 98a537620fa9e54d5f618b5e99a9736c43a06570
Author: Sarthak Kukreti <sarthakkukreti@chromium.org>
Date: Fri Oct 27 23:14:06 2017

cryptohome: Enable login_CryptohomeDataLeak

BUG= chromium:769740 
TEST=test_that <DUT_IP> login_CryptohomeDataLeak
CQ-DEPEND=CL:740916

Change-Id: Ie766455e16bdf2bdd8aae6fb29cc3c731f0de740
Reviewed-on: https://chromium-review.googlesource.com/719885
Commit-Ready: Sarthak Kukreti <sarthakkukreti@chromium.org>
Tested-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
Reviewed-by: Ryo Hashimoto <hashimoto@chromium.org>

[modify] https://crrev.com/98a537620fa9e54d5f618b5e99a9736c43a06570/chromeos-base/autotest-chrome/autotest-chrome-9999.ebuild

Greg, do you think this bug requires more tests, or is login_CryptohomeDataLeak enough?
Status: Fixed (was: Assigned)
This is good, I really appreciate the help everyone.

Sign in to add a comment