Drop cache code in cryptohome needs autotests |
||||||
Issue descriptionWe almost shipped a very serious cryptohome regression, because cryptohome was no longer able to write to the /proc/sys/vm/drop_caches file. Adding to the problem, the code in c83ef6c950 did not check the return code of the base::WriteFile() function correctly, and so cryptohome never logged the failure. We have fixed both issues, but we also need to do a long term fix to this problem to reboot the machine immediately if clearing the decrypted cache fails. An autotest that logged users in and out and checked if the decrypted cache was cleared would have caught both of the bugs. Having a test for this is a high priority as we otherwise risk shipping this regression, and it hampers our ability to improve the error handling in the future. Can you please expand the cryptohome autotests to include this case, and ideally any other new and important functionality introduced with ext4? Thank you.
,
Sep 29 2017
> Could you give me a pointer to the bugs and the CLs The bug about sandboxing causing the /proc/sys/vm/drop_caches write to fail is here: https://bugs.chromium.org/p/chromium/issues/detail?id=764540 > What was wrong with the said code? The CL that caused the problem with base::WriteFile is: https://chromium-review.googlesource.com/c/chromiumos/platform2/+/428525 The problem with how the return value is handled is that base::WriteFile() returns the number of bytes written, or -1 on error. It does not return a boolean true/false, but that CL expected a bool from base::WriteFile. The return value handling was corrected in: https://chromium-review.googlesource.com/c/671188/ > I'm even not sure if it's possible to test the behavior of the page cache reliably with autotests (i.e. page caches being evicted doesn't necessarily mean it's evicted by /proc/sys/vm/drop_caches). I was thinking a simple autotest that logs a user in, logs the user out, and makes sure there isn't decrypted information under the /home/root/$X/mount folder. Do you mean that the decrypted contents aren't reliably cleared? > Instead, shouldn't we just make sure that cryptohome's unmount D-Bus method correctly returns an error when drop_caches fails, and the error is handled correctly? We should fix the return value being ignored, but we still need a test to know if it works now and in the future. I was thinking cryptohome() itself would initiate the reboot as a single choke point, but if ui-post-stop script runs on everyone user logout, that could work as well. What do you think? Regards, Greg
,
Oct 2 2017
Thank you for the pointers, so the said bug was a regression caused by the sandboxing, and IIUC it's not just a problem of the ext4 encryption code. Before enabling sandboxing, did anyone use a tool (e.g. strace) to get the list of all syscalls performed by cryptohomed with and without sandboxing to make sure there is no regression? For long, cryptohomed had been expected to run with all the root privileges, so every part of cryptohomed can be silently regressed like the drop_cache code.
,
Oct 2 2017
> Do you mean that the decrypted contents aren't reliably cleared? Provided the kernel code doesn't have a bug, drop_caches reliably clears the page cache (i.e. "drop_caches succeeded" → "cache is cleared" is always true), but that doesn't mean the converse (i.e. "cache is cleared" → "drop_caches succeeded" is not always true). Page cache can be cleared for any other reasons. Triggering drop_caches results in adding a new log line to /var/log/messages which says "drop_caches: 3", so you can know if drop_caches was triggered or not, but you cannot still know if it was executed at the right timing. Triggering drop_caches without invalidating the key doesn't result in totally clearing the decrypted data. > if ui-post-stop script runs on everyone user logout ui-post-stop is the only code which executes cryptohome unmount, and it should run whenever a user (or multiple users) logout.
,
Oct 2 2017
I think what Greg is trying to say here is that there are a bunch of ways in which the drop_cache functionality can break, and given that it's critical functionality, it should have a regression test. It's impossible to make meaningful progress on a project as large as Chrome OS if we have to manually test everything before making large-scale changes. As Linus Upson used to say, "if you liked then you should have put a test on it." The discussion here is not who was to blame for this individual bug. We're operating in postmortem mode here: we identified a critical lack of testing that would have prevented this bug. We should add a test so that we don't regress this way again, from a security change, a kernel uprev, or something else. Chrome OS has a great continuous integration infrastructure, with PreCQ and CQ. Let's use it to guarantee that this critical functionality doesn't regress. Hashimoto, are you the right owner for this?
,
Oct 2 2017
Thanks for chiming in Jorge. Indeed I don't really care who is to blame for what at this point. My concern is that we added code that is critical to the security and privacy of the system and it had no test, so it broke and nobody noticed. The reason I pointed out that the original base::WriteFile() code did not handle the error correctly, was to highlight why a test is so important: even if we had rebooted the machine on a failure of the dbus call, the failure was never being propagated. So given the intricacies of this code that you pointed out, who is the right owner for this test?
,
Oct 3 2017
+gwendal who added ext4 encryption code with me. Sorry for sounding defensive, but my comment in #3 is not about the blame, but about detecting other possible regressions. As you know, cryptohome's test coverage is not wide enough, and I guess the sandboxing was the largest change which has happened to cryptohome, so I think it'd be really assuring if it's verified that there is no other regressions. As I wrote in comment #4, I'm not sure if it's possible to write a reliable test for the said code which doesn't result in neither false-positive nor false-negative.
,
Oct 3 2017
I think the important point here is that we can't really "verify that there is no other regressions" if we don't have tests to do that. Unless somebody writes a doc that says "when making a change to cryptohome, test the following things", manually testing for regressions doesn't work unless we know what to test for. We need to be able to test this. If this requires adding more logging to the kernel, or exporting more state via /proc or /sys, then we should do that. Gwendal, what do you think?
,
Oct 5 2017
Gwendal, did you see this bug? If dropping the cache is that hard to test, we'll need to either expose more logging (like Jorge suggests), or add this to a manual testing checklist for our test team before a release goes to beta. But I don't believe we can ignore such a critical part of the system.
,
Oct 12 2017
I am planning to add a couple of autotests that cover the verification of whether caches were dropped or not. If drop_caches failed for any reason, this would result in visible /home/.shadow/<user>/mount directories (and all subdirs/files would be accessible, subject to ACLs). So, the test would simply create a file (within /home/.shadow/<user>) while the session is active, logout and check whether the same path is accessible. One of the cases where this might fail is that drop_cache is usually "best effort": if a process keeps an fd open to a file in the mount, it would be virtually impossible to check that the unless the test traverses the complete mount. But that would usually indicate a problem with another process (since we expect the caches to be dropped once all other processes have stopped), so I am not sure if we should be accounting for it.
,
Oct 13 2017
That sounds good, should we assign this bug to you Sarthak?
,
Oct 13 2017
Sure. Apart from testing this regression in an autotest, are there still plans for handling errors from 'cryptohome --unmount' in the ui-post-stop script (as mentioned in c#4)?
,
Oct 13 2017
Hashimoto, thoughts on c#12?
,
Oct 13 2017
,
Oct 13 2017
The autotest CL is up at: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/719576
,
Oct 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2e43b91f9f168b99274ffc44e689de449545600d commit 2e43b91f9f168b99274ffc44e689de449545600d Author: Sarthak Kukreti <sarthakkukreti@chromium.org> Date: Mon Oct 16 21:14:05 2017 cryptohome: check user data visibility after logout On logout, decrypted user data cached by the file system should be cleared and become inaccessible. This test checks if a user- mount file is accessible after the user logs out. BUG= chromium:769740 TEST="test_that <DUT_IP> platform_CryptohomeDataLeak" passes when /proc/sys/vm/drop_caches is written to during key invalidation and fails when it is not written to. Change-Id: Ic8af4ca85e7ec444e86f882e07cc61ab9479bdcc Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/719576 Reviewed-by: Jorge Lucangeli Obes <jorgelo@chromium.org> [add] https://crrev.com/2e43b91f9f168b99274ffc44e689de449545600d/client/site_tests/platform_CryptohomeDataLeak/platform_CryptohomeDataLeak.py [add] https://crrev.com/2e43b91f9f168b99274ffc44e689de449545600d/client/site_tests/platform_CryptohomeDataLeak/control
,
Oct 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/98a537620fa9e54d5f618b5e99a9736c43a06570 commit 98a537620fa9e54d5f618b5e99a9736c43a06570 Author: Sarthak Kukreti <sarthakkukreti@chromium.org> Date: Fri Oct 27 23:14:06 2017 cryptohome: Enable login_CryptohomeDataLeak BUG= chromium:769740 TEST=test_that <DUT_IP> login_CryptohomeDataLeak CQ-DEPEND=CL:740916 Change-Id: Ie766455e16bdf2bdd8aae6fb29cc3c731f0de740 Reviewed-on: https://chromium-review.googlesource.com/719885 Commit-Ready: Sarthak Kukreti <sarthakkukreti@chromium.org> Tested-by: Sarthak Kukreti <sarthakkukreti@chromium.org> Reviewed-by: Ryo Hashimoto <hashimoto@chromium.org> [modify] https://crrev.com/98a537620fa9e54d5f618b5e99a9736c43a06570/chromeos-base/autotest-chrome/autotest-chrome-9999.ebuild
,
Nov 7 2017
Greg, do you think this bug requires more tests, or is login_CryptohomeDataLeak enough?
,
Nov 7 2017
This is good, I really appreciate the help everyone. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by hashimoto@chromium.org
, Sep 29 2017