New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 762334 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Closed: Sep 1
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Provision is breaking the TPM on stumpy and lumpy

Reported by jrbarnette@chromium.org, Sep 5 2017

Issue description

A significant number of stumpy and lumpy devices are
failing repair, starting from around 9/1.

Here's a sample history of how the trouble starts:
chromeos4-row2-rack8-host19
    2017-09-01 23:44:47  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row2-rack8-host19/2320304-repair/
    2017-09-01 23:33:10  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row2-rack8-host19/2320295-provision/
    2017-09-01 23:32:31  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row2-rack8-host19/2320291-verify/

That shows a working DUT at 23:32:31, followed by a failed provision
task, followed by repair failing.  Both provision and repair fail
because of the 'tpm' verifier:
	FAIL	----	verify.tpm	timestamp=1504334665	localtime=Sep 01 23:44:25	Cannot load the TPM SRK public key

Basically, the TPM has gotten into a bad state that prevents certain
tests from passing.

Seven lumpy DUTs are exhibiting the symptom:  five in the bvt pool, and
two in the cq pool

Three stumpy DUTs show the failure:  two in the bvt pool, and one in the cq
pool.

All of the failures happened on 9/1 and 9/2; there don't seem to be any
more failures since.  I don't know exactly why.  However, if there are
more failures, we'll run out of DUTs for testing.  That will turn the CQ
and the PFQ red, and generally gum up the works.  Marking P1 for now, but
this could go to P0 very easily.

 
Cc: dlaurie@chromium.org
Owner: tbroch@chromium.org
Cc: apronin@chromium.org
Blocking: 762393
Another incarnation of issue 692134?
Discussed in more details in issue 652565
Actually, in the provision log from the bug description, I see
Tspi_Key_CreateKey - TPM_E_DEFEND_LOCK_RUNNING
So, the boards are currently in the locked state - tpm commands are being rejected.
And in the before_repair logs from the 1st Repair job, I already see:
1)
2017-09-02T06:43:23.439416+00:00 WARNING cryptohomed[948]: Taking ownership was interrupted, continuing.

which means that the initialization was indeed interrupted (tpm is owned, but .tpm_owned file is not there), which is likely caused by too quick reboot during provision (as discussed in the issues from comments 3-4).

2)
2017-09-02T06:43:23.787944+00:00 WARNING cryptohomed[948]: GetDictionaryAttackInfo: Counter mismatch: 17 vs #022
2017-09-02T06:43:23.788342+00:00 INFO cryptohomed[948]: Configuring TPM, ownership taken: 0.
2017-09-02T06:43:23.788438+00:00 INFO cryptohomed[948]: Attestation: Preparing for enrollment...
2017-09-02T06:43:23.788485+00:00 ERR cryptohomed[948]: ConnectContextAsOwner requires an owner password
2017-09-02T06:43:23.788510+00:00 ERR cryptohomed[948]: GetEndorsementPublicKey: Could not connect to the TPM.
2017-09-02T06:43:23.788534+00:00 ERR cryptohomed[948]: Attestation: Failed to get EK public key.

which indicates that we are continuously trying to use an incorrect owner password, DA counter goes up (and, as it seems, eventually results in a locked state). May also be the result of interrupted initialization, that for some reason can't complete.
BTW, do we clear the stateful partition w/o resetting the tpm when doing Repair?

That also can lead to such outcomes. If initialization was interrupted, the tpm might be owned with a non-well-known password (since ownership was not reset), but if the stateful is cleared, the .tpm_state file with this temporary stored password is lost. And there is no .tpm_owned file that indicates that the initialization is complete, so we treat it is as an interrupted initialization (there are also SRK related steps, and later install attributes and enrollment related steps that need the owner password), and attempt to continue, and fail since we don't know the owner password.

After the 1st re-initialization attempt, the tpm should be marked as initialized (even if using the well-know password fails). So, we shouldn't get into a DA lockout. But the SRK will remain uninitialized, so we'll get "Cannot load the TPM SRK public key", and decide to repair.

If the stateful is cleared but the tpm ownership is not reset, we'll get into the same re-initialization attempt, and eventually the DA counter will reach the limit, and the tpm will be locked.
> BTW, do we clear the stateful partition w/o resetting the tpm when doing Repair?

On stumpy and lumpy, possibly yes.  Repair for the TPM error
includes powerwash, which clears the stateful partition.
Powerwash invokes "crossystem clear_tpm_owner_request=1", which
will clear the TPM on reboot.  However, some older legacy platforms
don't support that feature; it's possible that stumpy and lumpy are
one of those legacy platforms.

There is also https://crbug.com/652565#c89 (which is now line 1001, not line 940).
> There is also https://crbug.com/652565#c89 (which is now line 1001, not line 940).

Oh yeah, that code.  My guess is that stumpy and lumpy don't
support "clear_tpm_owner_request", which would explain why all
the code we've built up to avoid this kind of failure didn't
work for them.

Labels: -Pri-0 Pri-1
This was meant to P1, P0.  So far, we have no actual builder failures
from this, but there's still no guarantee it can't escalate.

Btw, powerwash is only called in Repair, not in normal Provision, right?

Not sure if I'm looking at the right classes, but here's what I see:
0) cros_host.machine_install() calls updater.run_update() both from its full update and stateful-only paths. And updater = autoupdater.ChromiumOSUpdater(...)
1) autoupdater.run_update() calls self.update_stateful(), so clobber=True by default there. That triggers calling stateful_update script with '--stateful_change=clean'.
2) That leads to:
  -) having "clobber" in ".update_available"-> 
  -) since we are running in dev mode (right?), wiping the stateful in dev_update_stateful_partition() called from chromeos_startup.

We have just a wipe, no tpm reset. And the preserved files don't include .tpm_owned and .tpm_state, which are directly in the root of /mnt/stateful_encrypted. So, if not for that "crossystem clear_tpm_owner_request=1" on line 1001, after Provision we'd have ended up in the state described in comment #8. I guess, we can remove the comment about papering over an inexplicable problem from this line.
 
If clear_tpm_owner_request is indeed not supported, are there alternative (legacy, board-specific) ways to trigger the owner reset from CLI?

And for a possible reason why it was seen on 9/1 and 9/2 specifically. 

Recently, there was another fix for the tpm initialization state machine. Until then, we had legacy code that was setting .tpm_owned at startup, before cryptohomed started, if tpm was at all owned (w/o checking the owner auth). That tricked cryptohomed into thinking that the tpm initialization was complete even if it was in fact interrupted.

So, cryptohomed didn't attempt to finalize it, didn't try using the owner password (that is supposed to be available on the 1st boot), didn't cause DA failures; just assumed that the board was fully ready.

That worked fine for the cases when the initialization was indeed done (since we don't have or use the owner password after the 1st boot). Maybe with some optional or advanced tasks, like periodic DA counter reset or attestation, failing (depends on if provision-wipe preserves the delegates for the old owner auth and all the attestation data - didn't check).
Blocking: -762393
Cc: englab-sys-cros@google.com
We are seeing issue on TPM on the following DUTs. Please let me know if this bug is addressing these issues, otherwise I will raise another bug.

chromeos6-row2-rack7-host20   Cannot load the TPM SRK public key
chromeos6-row2-rack7-host18   Cannot load the TPM SRK public key
chromeos6-row2-rack8-host11   Cannot load the TPM SRK public key
chromeos6-row2-rack8-host13   Cannot load the TPM SRK public key
chromeos6-row2-rack8-host8    Cannot load the TPM SRK public key
chromeos6-row2-rack8-host2    Cannot load the TPM SRK public key
chromeos6-row2-rack8-host6    Cannot load the TPM SRK public key
chromeos6-row2-rack8-host4    Cannot load the TPM SRK public key
chromeos4-row2-rack8-host19   Cannot load the TPM SRK public key
chromeos4-row2-rack8-host15   Cannot load the TPM SRK public key
Cc: tbroch@chromium.org
Owner: apronin@chromium.org
Looks like crbug.com/692134 has dialog w/ that error so might be a better locale 

Going send this bug Andrey's way since he's been doing all the work to resolve.
I'm not sure it's the same bug. How do I get the logs from the affected DUTs?
At least for one of the hosts (chromeos6-row2-rack8-host13), I see that the tpm is returning error 0x803 - both when we repair it and pre-repair (relevant parts of the logs below). 0x803 = TPM_DEFEND_LOCK_RUNNING.
The TPM is owned, but looks like we don't have the right owner password. If stumpy/lumpy indeed don't support crossystem clear_tpm_owner_request=1 (see comments #9 and #11), and thus wiping the stateful partition is done without clearing the tpm, that'll explain it.

Repair (chromeos6-row2-rack8-host13/60529462-repair/debug/autoserv.DEBUG):
04/12 06:31:25.578 DEBUG|          ssh_host:0284| Running (ssh) 'cryptohome --action=status'
[...]
04/12 06:31:25.968 DEBUG|        base_utils:0280| [stdout]       "can_load_srk_pubkey": false,
[...]
04/12 06:31:25.969 DEBUG|        base_utils:0280| [stdout]       "last_error": 2051,
04/12 06:31:25.970 DEBUG|        base_utils:0280| [stdout]       "owned": true
[...]
AutoservVerifyError: Cannot load the TPM SRK public key

Pre-repair (chromeos6-row2-rack8-host13/60529462-repair/chromeos6-row2-rack8-host13/before_repair/log/messages):

2017-04-12T01:26:18.154053+00:00 INFO kernel: [   10.163925] tpm_tis tpm_tis: command 0x21 (size 59) returned code 0x803
2017-04-12T01:26:18.195268+00:00 INFO cryptohomed[1052]: TPM error 0x803 (TPM is defending against dictionary attacks and is in some time-out period): WrapRsaKey: Cannot load SRK pub key
2017-04-12T01:26:18.195294+00:00 ERR cryptohomed[1052]: Couldn't wrap cryptohome key
I'll keep commenting here since https://crbug.com/692134 was a general bug, which seem to be fixed for boards that have "crossystem clear_tpm_owner_request" working (assuming it is called in Provision), but lumpy/stumpy case is special.

Looks like indeed for stumpy and lumpy the only way to clear the tpm owner is switching to dev mode and back. Here is the bug that was opened to fix that by adding crossystem clear_tpm_owner_request: https://code.google.com/archive/p/chromium-os/issues/31974
And the fix itself: https://crrev.com/c/25646
But looks like it came too late for stumpy & lumpy - we don't pick this fix into their firmware branches (in fact, I don't see firmware-stumpy or firmware-lumpy branches at all).

What happens now for each Provision: it will clear the tpm status file (see #13), we'll attempt initialize the tpm again, see that it was already owned and the default well-known password doesn't work (DA counter ++ for this unsuccessful auth), and mark it as 'initialized'. Then we'll do attestation initialization that also needs the owner password (and may get DA counter += N there - will need to check). Everything should be fine (to be checked) if that happens once in a while and after that the tpm is marked initialized. The DA counter is decremented with time. But if we keep trying to repair in a loop, with each attempt the counter goes up at least by 1, and eventually is over 10, and we got into DA lockout state. 

Then we get into repair loop because of that (since it can't load the SRK), and until we reset the tpm owner (and DA counter with it) by switching between dev and normal modes, or just let the DUT sit idle long enough for it DA counter to drop sufficiently below the threshold we'll stay in this loop.

Similarly, if the tpm _is_ somehow cleared but the tpm initialization is interrupted before it is complete, and we do powerwash before completing the initialization, we've owned the tpm but lost the owner password before being able to unrestrict the SRK. And the device will never be able to be ready (since loading the SRK will always fail), this will lead to the same repair loop, until the tpm is cleared. And for lumpy/stumpy the current Repair task doesn't clear the tpm. Also, the delegates in the attestation db

Bottomline:

1) To guarantee repairing stumpy/lumpy specifically, 
  a) Repair needs to initiate dev-normal-dev mode switching. Otherwise, it may not repair it in case of issues with tpm initialization. 
Additionally, it may make sense to
  b) When repairing wait until tpm initialization is complete.
  c) Limit the number of consecutive repair loops / ensure delays between repairs to minimize the chances of TPM_DEFEND_LOCK_RUNNING

2) I will go through the cryptohomed code to 
  a)make sure we don't try to call commands that require owner auth more than once after powerwash if the owner password is unknown / owner delegates are not available. At least in some places (periodic DA counter reset) we do check for the presence of the delegate before trying to use the command. 
  b) check the logic behind preserving/clearing the owner delegate in the attestation db - it may work incorrectly if the owner was cleared but tpm initialization was interrupted.

But #2 doesn't 100% guarantee we never see repair loops due to TPM_DEFEND_LOCK_RUNNING. if we get into a repair loop (e.g. 10+ consecutive {repair+unsuccessful verify} without delays between them) for unrelated reasons, eventually it can turn into repair loop due to "Cannot load the TPM SRK public key" because of TPM_DEFEND_LOCK_RUNNING. Only #1 repairs after that.
Cc: -apronin@chromium.org akes...@chromium.org
akeshet@, who's the right person to work together with to address #1 from the previous comment #21 on this bug?
Cc: jrbarnette@chromium.org
jrbarnette@ (OOO until Wednesday) is a good choice
OK.  We can (by hook or by crook) for all of the failing DUTs to
do the dev mode change as described.  But, that's unlikely to be
scalable if the problem recurs.

So:  Why did this happen in the first place, and how do we know it
won't happen again?

Cc: nxia@chromium.org
 Issue 774317  has been merged into this issue.
OK.  Regarding fixing the TPM by toggling dev mode:  The procedure
was requested in b/67752755.  Yesterday, seven DUTs were toggled.
All of them still have TPM problems.

We need the following:
  * A working procedure for repairing the DUTs.
  * An explanation of the root cause.
  * A fix to prevent this from recurring.

We still have issues related to this:

See b/70568312

Comment 28 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org
Status: WontFix (was: Assigned)
stumpy and lumpy, which didn't support clearing the tpm owner, reached EOL.
So, closing the bug as WontFix/Obsolete. LMK if something is still needed here.

Sign in to add a comment