Long delays at login due to sync() calls in cryptohome |
||||||
Issue descriptionI'm sometimes seeing abysmal login times when doing development on a ToT lumpy. Digging around a bit, it looks like cryptohomed may be blocking on calls to sync(): ----- chrome: [14122:14122:0711/090956:VERBOSE1:existing_user_controller.cc(398)] Setting flow from PerformLogin [14122:14122:0711/090956:VERBOSE1:user_flow.cc(20)] Flow 0x1e7fa45c1570 got host 0x1e7fa4aa2420 [14122:14122:0711/090956:VERBOSE1:login_performer.cc(278)] Offline auth started. [14122:14122:0711/091021:VERBOSE1:gaia_screen_handler.cc(392)] OnPortalDetectionCompleted Online [14122:14122:0711/091026:VERBOSE1:cryptohome_authenticator.cc(706)] Resolved state to: 12 [14122:14122:0711/091026:VERBOSE1:cryptohome_authenticator.cc(613)] Login success [14122:14122:0711/091026:VERBOSE1:login_performer.cc(82)] LoginSuccess hash: e2d8d554ca4e43218a5acb5121bd3763e5231fb2 [14122:14122:0711/091026:VERBOSE1:user_session_manager.cc(485)] Starting session for derattest2@gmail.com ----- cryptohomed: 2016-07-11T09:10:26.085372-06:00 WARNING cryptohomed[1342]: Long sync(): 29 seconds ^^^^^^^^^^^^^^^^^^^^^^^ 2016-07-11T09:10:26.095956-06:00 INFO cryptohomed[1342]: Creating pass-through directories /home/.shadow/e2d8d554ca4e43218a5acb5121bd3763e5231fb2/vault/user/GCache/v1/tmp ----- After digging around in git, I think that the original sync() call was added by https://chromium-review.googlesource.com/#/c/14485/, but it's grown into multiple calls now: - Mount::ForceUnmount() calls sync() before unmounting if the mountpoint was busy - Mount::ForceUnmount() calls sync() after unmounting unconditionally - Mount::MountCryptohomeInner() calls sync() before clearing the user keyring - Mount::UnmountCryptohome() calls sync() before clearing the user keyring - maybe more? I'm skeptical that cryptohomed should ever be calling sync(). Shouldn't it be doing more-targeted syncs of the files and directories that it actually cares about? (Its Platform class already has methods for this.) In the worst case, can't it at least call syncfs() instead? It seems like using sync() (documented as blocking until the "actual writing is done" since Linux 1.3.20) could cause all sorts of unintended delays -- won't it even block on system updates going on in the background?
,
Jul 11 2016
Possibly a dupe of bug 625514?
,
Jul 11 2016
The timing log on the other bug points at IME rather than cryptohome, doesn't it?
,
Jul 12 2016
Here's another one I just saw: ---- [9147:9147:0712/095847:VERBOSE1:login_performer.cc(278)] Offline auth started. [9147:9147:0712/095925:VERBOSE1:cryptohome_authenticator.cc(706)] Resolved state to: 12 [9147:9147:0712/095925:VERBOSE1:cryptohome_authenticator.cc(613)] Login success ---- 2016-07-12T09:58:48.122276-06:00 ERR cryptohomed[1343]: XXX ClearUserKeyring: pre-clear sync start ... 2016-07-12T09:59:25.355928-06:00 WARNING cryptohomed[1343]: Long sync(): 37 seconds 2016-07-12T09:59:25.355966-06:00 ERR cryptohomed[1343]: XXX ClearUserKeyring: pre-clear sync end ---- Gwendal, it looks like you added this call to sync() in https://chromium-review.googlesource.com/353354. I don't understand why it's there: - In the call from Mount::MountCryptohomeInner() (which I think is what I'm seeing take a long time), I believe that we've already unmounted previously. Why do we need to sync here? - In the call from Mount::UnmountCryptohome(), ForceUnmount() already called sync() if the filesystem couldn't be unmounted. What case is the additional call trying to guard against? I'm working on cleaning this up (so we just sync the directories we care about), unless someone tells me that that's a horrible idea. :-P
,
Jul 13 2016
Potential performance issues were called out during the review of https://chromium-review.googlesource.com/353354. FWIW, I'm supportive of rolling back this CL while the performance issues are investigated.
,
Jul 13 2016
,
Jul 13 2016
Looks like 53 might be affected as well.
,
Jul 13 2016
#CBC-TC-RS Watchlist Ditto for Asus Flip Google Chrome 53.0.2785.4 (Official Build) dev (32-bit) Revision 6dfaf57f2779c1591d26fe68632c0fd6a4457a4a-refs/branch-heads/2785@{#7} Platform 8530.6.0 (Official Build) dev-channel veyron_minnie ARC 3026157
,
Jul 13 2016
(I'm not familiar with this codebase, but I uploaded a speculative change to make the syncs more targeted at https://chromium-review.googlesource.com/#/c/360132/ .)
,
Jul 13 2016
#4,#5: do rollback c/353354, let me know if you have a test to reproduce the performance degradation so that I don't make the error again. I thought that sync would not be a performance issue because as pointed in #4, ClearUserKeyring should be called after an existing explicit sync() [so as I understand, little performance penalty] or unmount [where I think a sync() is implied]. But thinking about it again, if a lot of disk activities happen between logins, we pain the price at MountInner() time. Then, I am wondering if have to call ClearUserKeyring in MountInner at all. On ext4 crypto, a sync and dropping the cache is required for consistency and security reason before removing the key: - without sync, when the key is gone, ext4 would not be able to write data to the disk (it would not know how to encrypt it) - without dropping the cache, you could read data from encryption directory in clean even after the key has been removed, as long as someone put it in the cache previously. Similarly, although I haven't recreated it, the flush thread from ecryptfs to ext4 would be confused if it had to write data from the ecryptfs cache to the ext4 cache without having access to the key. ClearUserKeyring removes all keys in root user keyring, not only the keys of the user directory cryptohome thinks is mounted. Looking at the code, when cryptohome crashes and restarts, it tries to recreate the mount object, but I was not sure it could unmount properly stale mounts from the previous cryptohome process.
,
Jul 13 2016
#11: I'm able to repro a slow login on ToT lumpy by doing the following: 1. Boot the device. 2. SSH to it as root and run "dd if=/dev/zero of=/mnt/stateful_partition/foo.bin bs=1M count=512" (may need to play with the count). 3. Try to log in via the UI. 4. Look at /var/log/messages and see something like "WARNING cryptohomed[1341]: Long sync(): 17 seconds". To repeat my earlier question: why is cryptohomed calling sync() instead of calling fsync() on whatever it actually cares about? Sure, we can revert the call that adds the sync() to ClearUserKeyring, but there are still other sync() calls in cryptohomed, and they'll still block login and logout if they're called when there's a lot of data to flush, regardless of whether the data has any relation to encrypted home directories... right? Unmounting isn't the same as sync() since it's only going to block on the mount point being synced, right? With https://chromium-review.googlesource.com/#/c/360132/ applied, I no longer see the login delay, but I'm also not sure whether it's still covering whatever the sync() calls were intended to cover.
,
Jul 29 2016
Um. So. Opinions? My change is still sitting around at https://chromium-review.googlesource.com/#/c/360132/, although it has a merge conflict now.
,
Aug 1 2016
I [vaguely] remember various bugs (some worked on by +tnagel) that were caused or exacerbated by a failure to consistently sync to disk. The sync() calls may well have been an over-reaction and I'm supportive of your change. At the same time it's difficult to predict what sort of issues we may see if the sync() calls are removed and the current cryptohome maintainers should be aware / watchful / supportive so I'll leave the +2 to them.
,
Aug 11 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/40dd95d5b501a5e5dcc6bbdf4ce4be347f3f7fd8 commit 40dd95d5b501a5e5dcc6bbdf4ce4be347f3f7fd8 Author: Daniel Erat <derat@chromium.org> Date: Tue Jul 12 17:11:03 2016 cryptohome: Avoid calling sync() when possible. Make more-targeted fsync() calls on source and destination mount directories instead of using the big, slow sync() hammer. BUG= chromium:627115 TEST=unit tests pass; no longer see 20+ second hangs when logging in after syncing an updated version of chrome to the device Change-Id: Ib4e4477609cd6a1c21768fe0be6b8fb1551ac23d Reviewed-on: https://chromium-review.googlesource.com/360132 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Gwendal Grignou <gwendal@chromium.org> [modify] https://crrev.com/40dd95d5b501a5e5dcc6bbdf4ce4be347f3f7fd8/cryptohome/mount_stack.cc [modify] https://crrev.com/40dd95d5b501a5e5dcc6bbdf4ce4be347f3f7fd8/cryptohome/mount.cc [modify] https://crrev.com/40dd95d5b501a5e5dcc6bbdf4ce4be347f3f7fd8/cryptohome/mock_platform.h [modify] https://crrev.com/40dd95d5b501a5e5dcc6bbdf4ce4be347f3f7fd8/cryptohome/mount.h [modify] https://crrev.com/40dd95d5b501a5e5dcc6bbdf4ce4be347f3f7fd8/cryptohome/platform.cc [modify] https://crrev.com/40dd95d5b501a5e5dcc6bbdf4ce4be347f3f7fd8/cryptohome/mount_stack_unittest.cc [modify] https://crrev.com/40dd95d5b501a5e5dcc6bbdf4ce4be347f3f7fd8/cryptohome/mount_stack.h [modify] https://crrev.com/40dd95d5b501a5e5dcc6bbdf4ce4be347f3f7fd8/cryptohome/platform.h
,
Aug 11 2016
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by derat@chromium.org
, Jul 11 2016