New issue
Advanced search Search tips

Issue 827682 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 7
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Task



Sign in to add a comment

Possible boot perf regression from tpm-rng quality

Project Member Reported by evgreen@chromium.org, Mar 30 2018

Issue description

This bug serves as a reminder to ensure that boot performance is not regressed due to the lack of CL:967183.

A previous CHROMIUM patch CL:427668 did two things to address a boot perf regression in chromium:673183. It changed the tpm-rng driver to load earlier, and gave tpm-rng a "quality" value.

Upstream modified the design here to remove the tpm-rng driver and integrate it into the tpm core itself. With this, the first part of the patch changing the driver order is no longer applicable.

The second part of the patch, assigning a Quality value to tpm-rng data, is the less well understood one. Assigning a non-zero quality value causes the tpm to spin up a thread with hwrng_fillfn. A change in BoringSSL seemed to require more random data earlier, which blocked boot and caused the performance regression. By spinning up the hwrng_fillfn thread, random data can be supplied in parallel with the rest of boot, in theory parallelizing boot and fixing the regression.

I did some bootperf runs [1] to see if I could reproduce this performance regression on 4.14. I ran it on Clapper (where the original bug was reported), Samus, and Eve. I was unable to notice any difference between having a Quality value set and not having it set, except maaaybe on Eve there was something there. Unsure, more analysis is probably needed.

Since the CHROMIUM change conflicts with the upstream change, we decided to revert that change from 4.14. The argument being that if that change is really needed, it should be upstreamed.

This bug serves as a reminder to make sure that performance is not regressed on 4.14, or if it is then the corresponding change should be picked as an upstream patch.


[1] https://docs.google.com/spreadsheets/d/1gQLcHdWQkBbH1RW0nDasBaK5Wdyz1fdQb5oLRHN0Z80/edit?usp=sharing

 
Owner: apronin@chromium.org
Status: Assigned (was: Untriaged)
As per CL:967183, Andrey said:

> How urgent is this? I know it's small, but I'll need to do more 
> research on the right universal quality value. I'm in a bit of 
> too much multitasking already at the moment, so will be happy 
> to help with upstreaming, but in a couple weeks.

...which sounds like he volunteered to own this task, so assigning it to him.  Andrey: if something has changed and you can't own this then please yell and we can find someone else to own.  M-69 stable as a deadline gives several months for this to get done.
Cc: louiscollard@chromium.org
Owner: louiscollard@chromium.org
Status: Started (was: Assigned)
I have spent some time looking at this today. With the huge disclaimer that I am totally new to all of this, my thoughts so far are below.

To recap some extra context from the various linked bugs, mainly for my own benefit:

On at least two occasions, there have been significant increases in boot time on ChromeOS, caused by the combination of two factors:
- something during boot blocking until the kernel entropy pool is initialized
- lack of sufficient entropy-generating events during boot

The fix for this problem is to make enough entropy available to the system early on, such that boot is not blocked.

It was decided that the TPM's RNG should be used as an entropy source; this provided sufficient entropy early enough to not impact boot speed.

As a side note, it's worth noting that during work on the initial regression, some changes related to drm/dp landed, which caused more interrupts during boot - providing another source of entropy early on. This also largely 'fixed' the problem on the board in question.

As mentioned above, the TPM's RNG is 'enabled' by allocating it a quality score, which causes the kernel to use it as an entropy source. 

The current decision to use the TPM's RNG, and the score used, were motivated by reducing boot time on a specific model of Chromebook - this doesn't seem ideal; we should better understand the implications, and if possible, upstream any necessary changes. 

So, some questions to investigate:

(0. Is it a good idea to use the TPM as a source of entropy? Seems like it..)

1. What is the impact of the quality score?

2. Is the quality score TPM-specific?

3. What should the quality score be?

4. Can this be upstreamed?

---

0. Whether using the TPM is a good idea:

Seems like it should be, however, the TPM spec (sections 4.2.5 in spec 1.2 and 11.4.10.2 in 2.0) doesn't appear to require a true hardware source of entropy, and suggests that sources like "random keyboard strokes or mouse movements" could be used. This seems very similar to the kernel's own sources of entropy - so if that actually is what the TPM is using, it is perhaps not as much of a no-brainer as I had initially assumed that the TPM would be a better source of entropy than otherwise available.

---

1. Impact of quality score

The quality score is supposed to represent the "estimated number of bits of true entropy per 1024 bits".

A positive (non-zero) score will enable the TPM's RNG as a source of entropy. When enabled, the TPM will be queried continuously until the kernel's entropy pool is initialized / becomes re-initialised after being depleted.

The actual value of the score is used to determine how much the entropy score for the kernel's pool should be increased each time entropy from the TPM is added:
https://github.com/torvalds/linux/blob/master/drivers/char/hw_random/core.c#L442
https://github.com/torvalds/linux/blob/master/drivers/char/random.c#L2329

A consequence of this is that a lower score may require more reads from the TPM for the kernel's pool to be initialized (as was observed in the initial bug).

I haven't looked into the interaction between the TPM and other entropy sources, but it seems likely that a high score will cause other sources to contribute much less to the pool, whilst a lower score would give other chances more time to contribute.

---

2. Is the quality score TPM-specific?

The spec does not specify implementation (or very much at all), so I can only assume that the exact behavior of the RNG across TPMs will differ. This article compares performance of several different TPM chips, and does indeed note a difference in quality: https://arxiv.org/abs/1008.2223

That said, given the intrinsic requirement that TPM's RNG be high quality, and given the relatively coarse way in which the quality score is used by the kernel, it may be that a single score could be shared for all known good TPMs.

---

3. What should the score be?

This seems to boil down to whether or not the TPM's RNG can be trusted to generate truly random numbers. If we are absolutely 100% confident it can, then the maximum possible score seems reasonable, and is likely to achieve the fastest possible time for the entropy pool to be [re-]initialized.

That said, having this level of confidence in the TPM doesn't seem particularly realistic, so a score that results in additional entropy sources being used as well seems like a Good Idea.

I need to investigate further to figure out the highest score that would still give other sources of entropy a decent chance at being used.

---

4. Upstreaming

I don't have any experience with this, however it seems like the ChromeOS change in its current form might be tricky to upstream, even if the above issues around choice of value are resolved.

From what I've read today, it seems that TPMs in general can be polarizing in the community, and defaulting to use of a hardware RNG over which the user has no visibility or control, and that could in theory be intentionally flawed - may be difficult. 

A reasonable compromise may be to add a module parameter to configure the quality score.

---

Lastly, just wanted to note that using the TPM is not the only solution to the boot time problem seen on ChromeOS.

We could save some seed at shutdown to initialize the entropy pool on subsequent boot, as suggested here:
https://github.com/torvalds/linux/blob/master/drivers/char/random.c#L175
..however the security implications seem non-trivial, as does the seed used for 'first boot' 

We could use the TPM to add entropy to the pool once, to avoid the boot time issue, but without enabling indefinitely. This seems like a weird halfway house that's probably best avoided.

We could make sure that ChromeOS provides enough entropy through the 'traditional' sources used by the kernel, such that using the TPM is not necessary. However, there likely isn't a sensible way to do this if it isn't happening already.

===

Next Steps:
- I will try and determine a score that uses the TPM as much as possible whilst still mixing in some other source of entropy
- I will investigate adding the score as a parameter, to be upstreamed


> I haven't looked into the interaction between the TPM and other entropy sources, but it seems likely that a high score will cause other sources to contribute much less to the pool, whilst a lower score would give other chances more time to contribute.

My understanding is that hwrng uses only the highest entropy source: https://github.com/torvalds/linux/blob/master/drivers/char/hw_random/core.c#L488 and https://github.com/torvalds/linux/blob/master/drivers/char/hw_random/core.c#L210
But maybe I missed some source rotation somewhere.
> A reasonable compromise may be to add a module parameter to configure the quality score.

Given that depending on the GetRandom implementation on the TPM side the quality may be different, that sounds like a reasonable approach. Alternatively, the model-specific drivers can provide the quality. But probably the best of both worlds is the model-specific-driver giving the default quality (with 0 if the driver is not updated to use this mechanism) with a module parameter to override it (at least be able to set to 0).

On a separate note: it'd be useful to know the right value specifically for cr50 implementation of RNG and use it in cr50* drivers.
Re #4:

Sorry I didn't explain this well - that's how I understand it too; I was thinking more about other non-HW/TPM sources of entropy (eg keyboard, interrupts).

Having read more of the code though, I don't think this is an issue, as those other sources appear to get pushed into the entropy pool when they happen, regardless of the current level of entropy (they don't get 'paused' or discarded when entropy is sufficiently high, in the same way the hwrng appears to).
Regarding the score to assign for cr50 / generally:

I think we should assign a score of 1 (or something small and conservative).

IIUC, the original reason to assign a higher score was to reduce the number of calls to the TPM during boot, so as to minimise any possible delay.

Having read the code some more, I am not convinced that this is necessary.

The higher score will cause the entropy score for the pool to increase more with each read, which will indeed cause fewer reads from the TPM (as the reads stop once entropy reaches a certain level).

However, the case we care about - when we need to initialize the pool, so as to not delay boot - looks like it follows a different path, and the score is not taken into account at all, rather the pool is initialized once 64 bytes of data have been added to it:

https://github.com/torvalds/linux/blob/fe03a7594d86e0754f05e604cd803a6a9aae3c1c/drivers/char/random.c#L2318
https://github.com/torvalds/linux/blob/fe03a7594d86e0754f05e604cd803a6a9aae3c1c/drivers/char/random.c#L858

If my understanding is right, then the subsequent reads from the TPM will continue to add entropy to the pool in the background, but won't block boot.

So, for the original purpose of this change, I don't think there is a justification for having a score higher than 1.

In which case, it seems that assigning a higher score (especially a very high score like 1000) could be unnecessarily risky (albeit still not really very risky at all).

The only reason I could think of to have a higher score would be to:
- not occupy the TPM with unnecessary GetRandom requests, that could get in the way of other activities
- not keep the TPM awake? not sure how the power management works, or how much it consumes

This seems like it's just a power/performance vs security tradeoff - I'll try and get some more data, I suspect any change is unlikely to have much of a noticeable effect though?
> The only reason I could think of to have a higher score would be to:
> - not occupy the TPM with unnecessary GetRandom requests, that could get in the way of other activities
> - not keep the TPM awake? not sure how the power management works, or how much it consumes

Yep, those reasons may be important, though.

Iiuc what quality is, with a 100% quality we need 1 byte read from HWRNG for every byte requested through /dev/[u]random. [Confirmed by RNG sources: Reading each byte takes 1 << (3+3) = 64 from the entropy count. With hwrng quality set to N, each GetRandom(32 bytes) request to TPM is treated as adding (32 * N * 8) >> 10 << 3 = 2*N to the entropy count. So, we need to ask for GetRandom for every N/32 bytes requested from /dev/[u]random.]

From my experiments, getting a response to GetRandom(32 bytes) from the TPM takes 10-20ms. So, we can have 50-100 GetRandom ops per second. If nothing else is happening in the system (which is not true, of course - entropy is refilled through interrupts and such as well), once the randoms are requested at a rate higher than ~3N bytes/second, they'd have to wait for TPM. 

If an app needs to generate a 2048-bit random key using /dev/random, with N=1, it'd have to wait for 256*32 GetRandom operations, or 80 seconds (in the absence of other entropy sources, of course).  It would be interesting to see how it actually works on a real-life system, where entropy also comes from interrupts, disk access etc.

In my experiments a couple months ago, I saw that doing "dd if=/dev/random of=/dev/null" brings hwrng thread CPU usage to almost 10%, but I didn't measure the number of GetRandom requests out of it. May be the entropy provided by other sources during normal operation is more than enough for TPM to never or seldom be bothered even for frequent /dev/random use.

The 2nd point about power is also related. If TPM is idle (no requests to it, as is normally most of the time), it goes to sleep after 2 seconds reducing the power budget. 

I agree that we shouldn't set the score higher than the actual quality. But I would set it as high as it actually is.

Right, agree that a score of 1 is not good if we are trying to provide lots of high quality random numbers, and agree that if we're confident in our quality score we should use it :)

I guess I was mainly thinking that in situations where we don't have a quality score, 1 is a better default than 1000 for the original goal of avoiding boot delays.

I'll keep looking into a score for cr50, I'm not terribly sure how to proceed for eg Infineon chips though. Let's discuss over VC?


Project Member

Comment 10 by bugdroid1@chromium.org, Jun 8 2018

Labels: merge-merged-chromeos-4.14
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/6ea9c796a3537ab2c220405c31776c58fe22bd1a

commit 6ea9c796a3537ab2c220405c31776c58fe22bd1a
Author: Louis Collard <louiscollard@chromium.org>
Date: Fri Jun 08 06:33:29 2018

CHROMIUM: tpm: cr50_i2c: Set hwrng quality.

This sets a quality score for the hwrng, causing it to be used
to populate the kernel's entropy pool.

The score was decided after discussion with the designer of the
rng, and is based on results from running the NIST SP800-90B
Entropy Assessment suite.

BUG= chromium:827682 
TEST=build

Change-Id: I28a30f2be8466b59a0987c345d1ec7e93dbeb15d
Reviewed-on: https://chromium-review.googlesource.com/1082180
Commit-Ready: Louis Collard <louiscollard@chromium.org>
Tested-by: Louis Collard <louiscollard@chromium.org>
Reviewed-by: Andrey Pronin <apronin@chromium.org>

[modify] https://crrev.com/6ea9c796a3537ab2c220405c31776c58fe22bd1a/drivers/char/tpm/cr50_i2c.c

Project Member

Comment 11 by bugdroid1@chromium.org, Jul 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/b8dec3da94bcda91bcf2d6a1617dcb88c097baad

commit b8dec3da94bcda91bcf2d6a1617dcb88c097baad
Author: Louis Collard <louiscollard@chromium.org>
Date: Fri Jul 13 18:50:47 2018

FROMGIT: tpm: Allow tpm_tis drivers to set hwrng quality.

Adds plumbing required for drivers based on tpm_tis to set hwrng quality.

Signed-off-by: Louis Collard <louiscollard@chromium.org>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>

(cherry picked from commit 41577d0f4c3eefea989b79d331321f625a066bf8 http://git.infradead.org/users/jjs/linux-tpmdd.git master)

BUG= chromium:827682 
TEST=build

Change-Id: Ia9260445c5c2eb5b53ad0c1cbfbef4b575b4f3d5
Reviewed-on: https://chromium-review.googlesource.com/1134626
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Louis Collard <louiscollard@chromium.org>
Reviewed-by: Andrey Pronin <apronin@chromium.org>

[modify] https://crrev.com/b8dec3da94bcda91bcf2d6a1617dcb88c097baad/drivers/char/tpm/tpm_tis_core.h
[modify] https://crrev.com/b8dec3da94bcda91bcf2d6a1617dcb88c097baad/drivers/char/tpm/tpm_tis_core.c

Project Member

Comment 12 by bugdroid1@chromium.org, Jul 16

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/34c65f03425befc95ce7dc7393b7740cedbc20bc

commit 34c65f03425befc95ce7dc7393b7740cedbc20bc
Author: Louis Collard <louiscollard@chromium.org>
Date: Mon Jul 16 04:49:12 2018

CHROMIUM: tpm: cr50_spi: Set hwrng quality.

This sets a quality score for the hwrng, causing it to be used
to populate the kernel's entropy pool.

The score was decided after discussion with the designer of the
rng, and is based on results from running the NIST SP800-90B
Entropy Assessment suite.

BUG= chromium:827682 
TEST=build

Signed-off-by: Louis Collard <louiscollard@chromium.org>
Change-Id: Iaafc3a02c4e29708fa7f368785ca4f8db43c2339
Reviewed-on: https://chromium-review.googlesource.com/1090595
Reviewed-by: Andrey Pronin <apronin@chromium.org>

[modify] https://crrev.com/34c65f03425befc95ce7dc7393b7740cedbc20bc/drivers/char/tpm/cr50_spi.c

Labels: Pri-1
Release blockers should be P1 or P0
Louis, btw, isn't it fixed already? I know we haven't landed the ability to override hwrng quality (https://crrev.com/c/1082177), but for the purposes of this bug it's not needed: for cr50_spi/i2c the default tpm-rng quality has made it to chromeos-4.14 through upstream in comments 10-12 above. And those CLs are already in M69, so just mark Fixed?
Status: Fixed (was: Started)
Sorry yea I was leaving it as we still didn't get an solution accepted upstream for non-cr50 boards. I need to try and revive the discussion. 

Perhaps that is not important though - I guess any old boards will be a running an older kernel with the original patch, and any boards with newer kernel will have cr50.
Re #15: good point re boards with non-cr50 devices. We don't have those on 4.14 as of now, but we may "uprev" kernel for some tpm 1.2 devices in the future. So, let's create a separate bug for that support, but definitely not as RBS for M69.
Status: Assigned (was: Fixed)
Was a new bug created?  Can you please link it from here?  Moving this to Assigned pending a new bug being created.
Status: Fixed (was: Assigned)
Sorry for the delay, created crbug.com/871793

Sign in to add a comment