New issue
Advanced search Search tips

Issue 822302 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Sep 1
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug


Participants' hotlists:
TPM-Userland-Performance


Sign in to add a comment

cryptohome: delay between login-prompt-visible and cryptohomed readiness on tpm2

Project Member Reported by apronin@chromium.org, Mar 15 2018

Issue description

[Spawned from issue 815269]

Unlike tpm 1.2 case, tpm-related daemons and scripts - cr50-update, trunksd, tpm_managerd, attestationd, cryptohomed, chapsd - start on 'starting system-services' and depend on each other in that order.

On the 1st login after boot, that causes a 5-10 seconds delay after login-prompt-visible (a trigger for system-services) before cryptohomed can serve requests from Chrome.

See https://crbug.com/815269#c13 based on http://listnr/product/208/report/85185344318 for an example: cr50-update _starts_ 2.6 seconds after login-prompt-visible, trunksd is ready at 3.3s mark, cryptohomed becomes ready somewhere between 7.7..9 seconds. 

It's not 100% clear yet if Chrome normally needs cryptohomed earlier than in 5 seconds even if the password is entered quickly. Still, a visible delay is likely.

The fix might include:

1) internal cryptohomed optimizations to register on dbus before all initializations (which include communicating to the previously starting attestationd, for example) are done; that may or may not help things much, since while the requests will be accepted earlier they still can't be completed until after all initializations are done;

2) reducing dependencies for cryptohomed on other tpm daemons and starting in parallel with them;

3) starting tpm-daemons before login-prompt-visible (otherwise, based on the example, ~3 seconds of delay will remain even if cryptohomed starts first after login-prompt-visible).
 
Re: 3) starting tpm-daemons before login-prompt-visible 

We delay all the other stuff until login prompt is visible to make ui more responsive. Basically, if you add more disk IO code before showing UI, it will delay login prompt even more.
Labels: -Pri-2 Pri-1
Re #1: Understood, but note that it's a UI issue in any case (possibly later login-prompt-visible vs noticeable delay after login)
1) cryptohomed doesn't do much disk IO on start (it is not mounting anything until requested by Chrome, though it does read state from disk), but it does communicate with other daemons over dbus and waits on TPM operations, for example;
2) for TPM 1.2, cryptohome already starts before login-prompt-visible (dependency chain: boot-services -> tpm-probe -> tcsd -> chapsd -> cryptohomed) and there it does the reads the same status from disk.

I will measure before proposing to go with option 3, of course. I suspect that login-prompt-visible won't see a noticeable delay, but that needs to be backed.
> I suspect that login-prompt-visible won't see a noticeable delay, but that needs to be backed.

Sounds good.
Cc: conradlo@chromium.org
Here are my $0.02 cents FWIW: Moving services to start earlier should be a last resort. Given that it takes almost 10 seconds in the worst case for cryptohomed to become ready, there seems to be quite a bit of potential for optimization of the cryptohomed startup sequence. I'd focus on that - the slow startup issue with the TPM 2.0 stack is likely going to cause further grief down the road (it already has caused problems with bootlockboxd), so this needs to be fixed sooner or later.
Looked closer at cryptohomed start perf. On fizz with ToT Chrome OS:
Initializing it from starting the process to service.Run() typically takes ~2.8s (2.3..3.0 seen). There are two main long delays:

1) Getting tpm status from tpm_managerd at the beginning of TpmInit::SetupTpm. Typically 700-800 ms (0.4 - 1.5 s seen). The majority of time we are waiting for the 1st task on tpm_managerd side - TpmManagerService::InitializeTask - to finish before it can start processing the GetTpmStatus request from cryptohomed. Then it takes ~150ms to actually get the status from the TPM.

2) Loading lockbox (lockbox()->Load()) inside InstallAttributes::Init. Typically ~1.5..1.6 s. We are checking TPM readiness and reading TPM NVRAM inside. Haven't yet checked which part is the longest inside.

Other TPM commands also take time, but are far behind #1 and #2 above, typically taking 150-200 ms for some operations: LoadOrCreateCryptohomeKey, GetLECredentialBackend (plus the above-mentioned actual GetTpmStatus seems to also eat ~150ms out of ~750ms that cryptohomed waits for response from tpm_managerd).

Next step: checking if the same profile is seen on other tpm2 devices (e.g. eve), and comparing with profile on 1.2 devices (e.g. link).
Status: Started (was: Untriaged)
Got similar to #7 results on eve. Most time spent during cryptohomed startup in:
1) waiting for tpm_mananagerd Initialize task to complete. Typically 1.0..1.2s.
2) Loading lockbox. Typically 1.6..1.7s.

LoadOrCreateCryptohomeKey (when we need to load, first creation takes longer), GetLECredentialBackend take 200..250ms. Note that i2c (used on eve) is generally slower for tpm communications that spi (used on fizz).

Total time from cryptohomed's main() to service->Run ~ 3.4..3.6 seconds (and 2.8..3.2s on fizz).

Note: The focus of this analysis is regular (non-first) logins. Running on an unowned tpm, or owned, but w/o a created owner user will lead to different results.  Times for tpm initialization, creating/loading cryptohome key, dealing with lockbox are different before everything is set up for the first time. 
For TPM1.2 on Link the situation is different. The longest operations are:
1) LoadOrCreateCryptohomeKey. Typically 2.8..2.9s!
2) AttestationInitialize. Typically 600..700ms
3) Loading lockbox. Typically 175..185ms (once 700+ms).
Overall, the time from cryptohomed's main() to service->Run() is 3.8..4.0s. Which is higher that for TPM2.0 (3.4..3.6s on eve). However, note that on TPM1.2 devices, tpm daemons are started before login-prompt-visible, while on tpm 2.0 they wait until that signal is emitted. As a result:

 - for cryptohomed on TPM 1.2 (Link): 
  -- main() is launched ~2.75s after kernel start, and before login-prompt-visible;
  -- service->Run() is reached 1.6..1.8s after login-prompt-visible.

 - for cryptohomed on TPM 2.0 (Eve):
  -- main() is launched ~4.1s after kernel start, already 1..1.2s after login-prompt-visible
  -- service->Run() is reached 4.4..4.7 seconds after login-prompt-visible. That's 3 seconds longer from login-prompt-visible to cryptohomed readiness on Eve than on Link.
Based on #10, I suspect that the solution should be a combination of (a) speeding up the slowest operations on TPM 2.0 devices - tpm_managerd initialization and and lockbox initialization; (b) allowing tpm-daemons to start before login-prompt-visible on tpm 2.0; (c) and possibly looking for a way to turn some of the cryptohomed initialization steps into non-blocking asynchronous ops.

Next steps:
1) measuring the effect of starting daemons before login-prompt-visible;
2) analyzing tpm_managerd and lockbox perf.
For item 1 from #11: Measured effects of depending on 'started boot-services' instead of 'starting system-services' on fizz and eve. Used https://crrev.com/c/1128446 for capturing additional timestamps at boot, and  https://crrev.com/c/1128449 + https://crrev.com/c/724238 to change the dependency.

For eve:
 - that slows down login-prompt-visible by ~0.05s, from ~3s to ~3.05s since boot (+1.6%);
 - cryptohomed is ready ~2.5s after login-prompt-visible, instead of ~4.5s as now.

For fizz:
 - that slows down login-prompt-visible by ~0.15s, from ~4.00s to ~4.15s since boot (+3.2%);
 - cryptohomed is ready ~1.9s after login-prompt-visible, instead of ~4.4s as now.

For comparison, for link: login-prompt-visible is at ~5s mark, and cryptohomed is ready ~1.9s after it.

More details & stats from the experiment are in go/cryptohomed-startup-metrics.

Bottomline: for fizz/eve, at the cost of <150ms slowdown to login-prompt, we get >2s improvement to cryptohomed readiness time. Which (almost) brings it to the same ballpark (~2s after login-prompt) as seen for Link, where iiuc we have no reported issues for post-password-entering delays. Is it worth it?

(Of course, this is only for item 1 of #11; performance improvements for tpm-daemons will also be investigated and applied regardless of changing these upstart dependencies.)
For item 2 from #11:

1) tpm_managerd startup slowness is due to the fact that it runs in parallel with attestationd startup. And attestationd uses policy session (StartAuthSession: 0.8s + severeal more commands) to load a key. Next step: (1) investigate if a saltless session can be used in attestationd.

2) lockbox Load performance is dominated by reading nvram, which is delegated to tpm_managerd, and uses a session (StartAuthSession ~ 0.75s). Plus, long RSA_Decrypt (another ~0.75s) is running in parallel. A
Areas of investigation: 
(a) Can we use a saltless session in this case? The data read from nvram is not secret, but should be protected from tampering. 
(b) do we even need to read nvram data at this point again (or can take input from mount-encrypted)?
(c) fixing issue 858863 first and using persistent sessions everywhere.
(d) who needs decrypt with a tpm-bound key at this stage (chapsd? attestationd?) and can it be postponed (likely, won't help the overall login time, but will help time to cryptohome readiness)?
Btw. cryptohomed doesn't have a reason to load the lockbox at all. In fact, I have started working on switching it over the lockbox contents dumped by mount-encrypted:  issue 847438 
H, re-measured the effects of CLs from #12 on fizz with the current ToT (M70-10926.0.0). Something has changed. Now the difference with that CL and w/o it is bigger that 0.05..0.15s measured in #12. See attached before and after.

Also, I see occasional spikes for login-prompt-visible. Normally it starts at around ~4s, but occasionally spikes to 6+ s. It happens both with and without the CL. The attached before & after capture only one such spike (in 'before', i.e. w/o the CL). But a saw more when manually checking results.

Looks like an additional TPM command or commands were added somewhere, and now cryptohomed may occasionally wait on them, if started a bit later.

Bottomline: Will need to re-measure and check what's going on on M70 before continuing with CL:724238.
before.txt
1.4 KB View Download
after.txt
1.4 KB View Download
More measurements before & after. It seems the time to login-prompt-visible is not that stable, but still the diff between before and after appears to be bigger than 0.15s I had in #12. Here, even if we discard those 6+ second peaks, which may be unrelated and can happen with or w/o the CL, avg(before) = 3.91, and avg(after) = 4.13 (+0.22).

It should be noted that there are also multiple crashes for mosys, flashrom, which may affect the timing. In any case, makes sense to wait with the CL.
before2.txt
2.9 KB View Download
after2.txt
2.9 KB View Download
Updated results for Fizz (which was worse than Eve) on the current ToT in go/cryptohomed-startup-metrics (this time on 50 iterations).

The diff(time_to_login_prompt) is indeed up to +0.20s (<+5%), from 3.8..4.4s to 4.0..4.6s. The savings in diff(time_to_cryptohomed_readiness) is -2.8s, so it's worth it. Will land the CL.

The results above is with anomalous delays (cases of time-to-login-prompt-visible >= 6s) discarded. There were 10/50 such long delays before change, and 7/50 after change. When that happens we get 6.15..6.5s instead of 3.8..4.6s till login-prompt-visible. The reason for those spikes is yet unknown, but is unrelated to starting cr50-update script earlier.
Project Member

Comment 18 by bugdroid1@chromium.org, Aug 7

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/efbd4ef7f5fa448fe558b7c360af41a58d93e90d

commit efbd4ef7f5fa448fe558b7c360af41a58d93e90d
Author: Andrey Pronin <apronin@chromium.org>
Date: Tue Aug 07 08:50:46 2018

tpm_manager: change dependency system-services -> boot-services

Before this CL tpm_managerd.conf depended on 'starting system-services'
in addition to 'started tcsd/trunksd'. Currently, that doesn't affect
anything since we have 'starting system-services' -> cr50-update -> ...
-> trunksd -> tpm_managerd chain of dependencies. However, if we decide
to start tpm-daemons earlier in the boot sequence (e.g. CL:724238),
having 'starting system-services' in tpm_managerd.conf would prevent
such earlier start.

This CL changes 'starting system-services' to 'started boot-services'.
By itself it doesn't change anything, but allows further optimizations
in CLs like CL:724238.

Also, this CL fixes 'stop on' rule for tpm_managerd - it stops it on
'stopping boot-services', as is done for all other tpm-daemons.

BUG= chromium:822302 
TEST=boot and login

Change-Id: Ibd4f84a2caac9d36e0dd13daef2d7f93beb7758f
Reviewed-on: https://chromium-review.googlesource.com/1128449
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Andrey Pronin <apronin@chromium.org>
Reviewed-by: Mattias Nissler <mnissler@chromium.org>

[modify] https://crrev.com/efbd4ef7f5fa448fe558b7c360af41a58d93e90d/tpm_manager/server/tpm_managerd.conf

Project Member

Comment 19 by bugdroid1@chromium.org, Aug 7

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/5b5a4657cf6eb4853abedad639dc208e435ece62

commit 5b5a4657cf6eb4853abedad639dc208e435ece62
Author: Andrey Pronin <apronin@chromium.org>
Date: Tue Aug 07 21:37:06 2018

chromeos-cr50-scripts: start on started boot-services

Before this CL, a TPM2.0-specifc cr50-update upstart script waited for
login-prompt-visible, since it is a pre-condition for 'system-services'.
cryptohomed sits several links down the chain of daemons, which starts
after cr50-update. This cr50-update dependency led to the situation
when cryptohomed in TPM 2.0 case started much later than login prompt and
that caused delays with processing user passwords, if entered quickly.

This CL changes cr50-update dependency to 'started boot-services'.
That's the point where tpm-related daemons are started in TPM 1.2 case.
This changed allows reducing the delay from login-prompt-visible to
cryptohomed readiness for TPM 2.0 case to figures comparable with TPM
1.2 case (from ~4.5s to ~2.0/2.5s) while increasing time-to-login-prompt
by 0.05..0.15s only (1.6..3.2%), and still keeping time-to-login-prompt
shorter than in TPM 1.2 case. More details are provided in chromium:822302
and go/cryptohomed-startup-metrics.

This CL is a rare exception to the general rule, allowed due to the fact
that in terms of UX benefits this significant decrease of time to actual
login outweighs slight increase of time to login prompt.

BUG=b:67912719
BUG= chromium:822302 
TEST=boot and login

Change-Id: I12344116d31c227f6b7d26d647a5e75f5c5e44b2
Reviewed-on: https://chromium-review.googlesource.com/724238
Commit-Ready: Andrey Pronin <apronin@chromium.org>
Tested-by: Andrey Pronin <apronin@chromium.org>
Reviewed-by: Mattias Nissler <mnissler@chromium.org>

[rename] https://crrev.com/5b5a4657cf6eb4853abedad639dc208e435ece62/chromeos-base/chromeos-cr50-scripts/chromeos-cr50-scripts-0.0.1-r31.ebuild
[modify] https://crrev.com/5b5a4657cf6eb4853abedad639dc208e435ece62/chromeos-base/chromeos-cr50-scripts/files/cr50-update.conf

Status: Fixed (was: Started)
#3 is done to address login-prompt -> cryptohome-ready time.
To keep track of the other stuff realetd to crypthomed startup performance (before login-prompt-visible) created a separate issue 879797.

Sign in to add a comment