New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 864282 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Closed: Oct 17
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug


Participants' hotlists:
Hotlist-Tast


Sign in to add a comment

cryptohomed broken on some DUTs due to interrupted TPM initialization

Project Member Reported by derat@chromium.org, Jul 17

Issue description

The ui.ChromeLogin and ui.ChromeCrashLoggedIn Tast tests both timed out in the rikku-release build at http://stainless/browse/chromeos-autotest-results/217503455-chromeos-test/ while waiting for user directories to be mounted:

2018/07/15 08:43:56 Started test ui.ChromeCrashLoggedIn
2018/07/15 08:43:56 [08:43:56.060] Restarting ui job
2018/07/15 08:43:57 [08:43:56.542] Waiting for org.chromium.SessionManager D-Bus service
2018/07/15 08:43:57 [08:43:56.557] Asking session_manager to enable Chrome testing
2018/07/15 08:43:57 [08:43:56.560] Waiting for Chrome to write its debugging port to /home/chronos/DevToolsActivePort
2018/07/15 08:43:58 [08:43:57.273] Removing cryptohome for testuser@gmail.com
2018/07/15 08:43:58 [08:43:57.319] Finding OOBE DevTools target
2018/07/15 08:43:58 [08:43:57.643] Connecting to Chrome at ws://127.0.0.1:35071/devtools/page/9C5F22C8CDE07F9123CC40EF1D9F6368
2018/07/15 08:43:58 [08:43:57.736] Waiting for OOBE
2018/07/15 08:44:00 [08:44:00.014] Logging in as user "testuser@gmail.com"
2018/07/15 08:44:00 [08:44:00.077] Waiting for cryptohome /home/user/c37077ca708067994b028ceecfa10fc34571a404
2018/07/15 08:44:56 [08:44:56.065] Error at chrome_crash_logged_in.go:24: Chrome login failed: /home/user/c37077ca708067994b028ceecfa10fc34571a404 not mounted: context deadline exceeded

2018/07/15 08:44:58 Started test ui.ChromeLogin
2018/07/15 08:44:58 [08:44:58.039] Restarting ui job
2018/07/15 08:44:59 [08:44:58.485] Waiting for org.chromium.SessionManager D-Bus service
2018/07/15 08:44:59 [08:44:58.502] Asking session_manager to enable Chrome testing
2018/07/15 08:44:59 [08:44:58.504] Waiting for Chrome to write its debugging port to /home/chronos/DevToolsActivePort
2018/07/15 08:45:00 [08:44:59.257] Removing cryptohome for testuser@gmail.com
2018/07/15 08:45:00 [08:44:59.299] Finding OOBE DevTools target
2018/07/15 08:45:00 [08:44:59.640] Connecting to Chrome at ws://127.0.0.1:49143/devtools/page/17771F1A6A3DAD8D1648B7697C3537F9
2018/07/15 08:45:00 [08:44:59.697] Waiting for OOBE
2018/07/15 08:45:03 [08:45:02.114] Logging in as user "testuser@gmail.com"
2018/07/15 08:45:03 [08:45:02.221] Waiting for cryptohome /home/user/c37077ca708067994b028ceecfa10fc34571a404
2018/07/15 08:45:58 [08:45:58.043] Error at chrome_login.go:69: Chrome login failed: /home/user/c37077ca708067994b028ceecfa10fc34571a404 not mounted: context deadline exceeded

In both cases, I see suspicious-seeming cryptohomed errors in the messages file:

2018-07-15T08:44:00.068466-07:00 INFO cryptohomed[800]: HasAuthorization: TPM Owner password not available.
2018-07-15T08:44:00.068487-07:00 ERR cryptohomed[800]: Destroy() called with insufficient authorization.
2018-07-15T08:44:00.389144-07:00 INFO session_manager[5478]: [INFO:server_backed_state_key_generator.cc(126)] Stable device secret missing!

2018-07-15T08:45:02.210595-07:00 INFO cryptohomed[800]: HasAuthorization: TPM Owner password not available.
2018-07-15T08:45:02.210615-07:00 ERR cryptohomed[800]: Destroy() called with insufficient authorization.
2018-07-15T08:45:02.516675-07:00 INFO session_manager[6207]: [INFO:server_backed_state_key_generator.cc(126)] Stable device secret missing!

In a different successful ui.ChromeLogin run on a rikku DUT, I see messages like these from cryptohomed instead:

2018-07-16T08:11:58.112420-07:00 INFO cryptohomed[835]: Migrated (or created) user directory: /home/.shadow/a62b9c49d923f441cf2a9e6e872563f83c68f682/mount
2018-07-16T08:11:58.113796-07:00 INFO cryptohomed[835]: Creating pass-through directories /home/.shadow/a62b9c49d923f441cf2a9e6e872563f83c68f682/mount/user/Cache
2018-07-16T08:11:58.114023-07:00 INFO cryptohomed[835]: Creating pass-through directories /home/.shadow/a62b9c49d923f441cf2a9e6e872563f83c68f682/mount/user/Downloads
2018-07-16T08:11:58.114219-07:00 INFO cryptohomed[835]: Creating pass-through directories /home/.shadow/a62b9c49d923f441cf2a9e6e872563f83c68f682/mount/user/GCache
2018-07-16T08:11:58.115666-07:00 INFO cryptohomed[835]: Creating pass-through directories /home/.shadow/a62b9c49d923f441cf2a9e6e872563f83c68f682/mount/user/GCache/v1
2018-07-16T08:11:58.116268-07:00 INFO cryptohomed[835]: Creating pass-through directories /home/.shadow/a62b9c49d923f441cf2a9e6e872563f83c68f682/mount/user/GCache/v2
2018-07-16T08:11:58.116839-07:00 INFO cryptohomed[835]: Creating pass-through directories /home/.shadow/a62b9c49d923f441cf2a9e6e872563f83c68f682/mount/user/GCache/v1/blobs
2018-07-16T08:11:58.117467-07:00 INFO cryptohomed[835]: Creating pass-through directories /home/.shadow/a62b9c49d923f441cf2a9e6e872563f83c68f682/mount/user/GCache/v1/tmp
2018-07-16T08:11:58.118710-07:00 INFO cryptohomed[835]: RecursiveCopy: /home/.shadow/a62b9c49d923f441cf2a9e6e872563f83c68f682/mount/user/log
2018-07-16T08:11:58.119253-07:00 INFO cryptohomed[835]: RecursiveCopy: /home/.shadow/a62b9c49d923f441cf2a9e6e872563f83c68f682/mount/user/.ssh
2018-07-16T08:11:58.119770-07:00 INFO cryptohomed[835]: RecursiveCopy: /home/.shadow/a62b9c49d923f441cf2a9e6e872563f83c68f682/mount/user/.pki
2018-07-16T08:11:58.120265-07:00 INFO cryptohomed[835]: RecursiveCopy: /home/.shadow/a62b9c49d923f441cf2a9e6e872563f83c68f682/mount/user/.pki/nssdb

Presumably something on the DUT was in bad shape when these tests ran. Maybe an earlier test left them in a weird state.

cryptohome owners, any guesses about what might be happening here and how other tests can work around it?
 
Are full logs available?
"Destroy() called with insufficient authorization." is normally an indication that we lost some on-disk state and are trying to go through tpm initialization, but the tpm is already owned. E.g. if the disk or some state files were wiped w/o clearing the tpm owner.
Sample bug with similar symptoms: http://b/79376410
Looks like this is /var/log/messages from the device.
messages.txt
773 KB View Download
There is no encstateful key space defined in nvram (and no signs of communication errors from the driver + no encstateful key on disk, which would appear in case of tpm initialization interrupted before it is written to tpm). And there are no other objects that are typically persisted after tpm initialization - specifically, no salting key. Still the TPM is owned, what leads to the initialization errors.

[mount-encrypted didn't find encstateful key in TPM and wiped encstateful => on-disk tpm state, including owner password temporarily stored in case of interrupted initializaton, is lost]
2018-07-15T14:45:00.891803+00:00 NOTICE mount-encrypted[1562]: [pid:270] VFS mount state sanity check ok.
2018-07-15T14:45:00.891814+00:00 NOTICE mount-encrypted[1562]: [0715/144450:INFO:tpm.cc(304)] TPM ready
2018-07-15T14:45:00.891825+00:00 NOTICE mount-encrypted[1562]: [0715/144450:INFO:tpm.cc(459)] No encstateful NVRAM area defined.
2018-07-15T14:45:00.891835+00:00 NOTICE mount-encrypted[1562]: [0715/144450:ERROR:tpm1.cc(495)] encstateful space contents absent or too short.
2018-07-15T14:45:00.891845+00:00 NOTICE mount-encrypted[1562]: [0715/144451:INFO:tpm.cc(103)] NVRAM area has been defined but not written.
2018-07-15T14:45:00.891855+00:00 NOTICE mount-encrypted[1562]: [0715/144451:INFO:tpm.cc(103)] NVRAM area has been defined but not written.
2018-07-15T14:45:00.891865+00:00 NOTICE mount-encrypted[1562]: [0715/144451:INFO:tpm.cc(437)] No Lockbox NVRAM area defined.
2018-07-15T14:45:00.891875+00:00 NOTICE mount-encrypted[1562]: [0715/144451:INFO:encryption_key.cc(169)] Using NVRAM as system key; finalization needed.
2018-07-15T14:45:00.891888+00:00 NOTICE mount-encrypted[1562]: [0715/144451:INFO:encryption_key.cc(233)] Attempting to generate fresh NVRAM system key.
2018-07-15T14:45:00.891898+00:00 NOTICE mount-encrypted[1562]: [0715/144451:ERROR:tpm1.cc(495)] encstateful space contents absent or too short.
2018-07-15T14:45:00.891908+00:00 NOTICE mount-encrypted[1562]: [0715/144451:INFO:encryption_key.cc(267)] No usable system key found.
2018-07-15T14:45:00.891918+00:00 NOTICE mount-encrypted[1562]: [0715/144451:ERROR:encryption_key.cc(51)] Data read failed from /mnt/stateful_partition/encrypted.needs-finalization
2018-07-15T14:45:00.891928+00:00 NOTICE mount-encrypted[1562]: [0715/144451:INFO:encryption_key.cc(280)] Generating new encryption key.

[No salting key in TPM nvram]
2018-07-15T14:44:54.318434+00:00 INFO cryptohomed[895]: TPM error 0x2020 (Key not found in persistent storage): LoadKeyByUuid: failed LoadKeyByUUID
2018-07-15T14:44:54.318513+00:00 WARNING cryptohomed[895]: Canceled creating cryptohome key - TPM is not ready.

...

[InstallAttributes are not ready]
2018-07-15T14:44:55.537465+00:00 INFO cryptohomed[895]: Load() called prior to a successful Store()
2018-07-15T14:44:55.537492+00:00 INFO cryptohomed[895]: Resuming interrupted InstallAttributes. (Store needed.)

...

[Attempt to delete FWMP fails]
2018-07-15T14:46:28.358027+00:00 INFO cryptohomed[895]: HasAuthorization: TPM Owner password not available.
2018-07-15T14:46:28.358058+00:00 ERR cryptohomed[895]: Destroy() called with insufficient authorization.

...

[Attempt to own with new password fails]
2018-07-15T15:01:37.142438+00:00 INFO kernel: [ 1014.247697] tpm_tis tpm_tis: command 0x66 (size 55) returned code 0x1 [GetCapabilityOwner -> AUTHFAIL]
2018-07-15T15:01:37.172403+00:00 INFO kernel: [ 1014.277667] tpm_tis tpm_tis: command 0xba (size 18) returned code 0x22
2018-07-15T15:01:37.178295+00:00 INFO cryptohomed[895]: Configuring TPM, ownership taken: 0.
2018-07-15T15:01:39.703898+00:00 INFO cryptohomed[895]: HasAuthorization: TPM Owner password not available.
err correction, it's TPM 1.2, so, of course, no salting key there. :) what it can't find is the legacy cryptohome key in that failed LoadKeyByUuid... so, the only part that's definitely missing from TPM is the encstateful nvram space.
Thanks for digging into this, Andrey!

Will having a missing encstateful key prevent logins until the DUT is reimaged (or even longer)?

And if so, is there any way for tests to detect and recover from this state? Just guessing: Will GetStatusString report tpm.owned as true but tpm.has_cryptohome_key as false, or something similar?
Project Member

Comment 6 by bugdroid1@chromium.org, Jul 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/2c85a6582b4d92d09171722f2a6f09b2faf2117f

commit 2c85a6582b4d92d09171722f2a6f09b2faf2117f
Author: Daniel Erat <derat@chromium.org>
Date: Wed Jul 18 11:08:10 2018

tast-tests: Make cryptohome.WaitForUserMount log status.

Make cryptohome.WaitForUserMount log the output of
"cryptohome --action=status" if the user dir fails to show
up. This is intended to aid in debugging a TPM issue that's
making some tests fail.

Also update cryptohome.RemoveUserDir to honor the passed
context's deadline.

BUG= chromium:864282 
TEST=manual: used short timeout to force failure and
     verified that status is logged

Change-Id: Ifdd2e65bb7166e5b620b8e74423bdd70a68fe9e8
Reviewed-on: https://chromium-review.googlesource.com/1141334
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>

[modify] https://crrev.com/2c85a6582b4d92d09171722f2a6f09b2faf2117f/src/chromiumos/tast/local/cryptohome/cryptohome.go

Andrey, here's the cryptohome status from a failed run at http://stainless/browse/chromeos-autotest-results/219248229-chromeos-test/ :

2018/07/20 21:45:25 Started test ui.ChromeCrashLoggedIn
2018/07/20 21:45:25 [21:45:25.000] Restarting ui job
2018/07/20 21:45:26 [21:45:25.498] Waiting for org.chromium.SessionManager D-Bus service
2018/07/20 21:45:26 [21:45:25.512] Asking session_manager to enable Chrome testing
2018/07/20 21:45:26 [21:45:25.513] Waiting for Chrome to write its debugging port to /home/chronos/DevToolsActivePort
2018/07/20 21:45:26 [21:45:26.234] Removing cryptohome for testuser@gmail.com
2018/07/20 21:45:26 [21:45:26.273] Finding OOBE DevTools target
2018/07/20 21:45:27 [21:45:26.518] Connecting to Chrome at ws://127.0.0.1:58280/devtools/page/314BA59B593934F4505FE0CAA3588ED5
2018/07/20 21:45:27 [21:45:26.635] Waiting for OOBE
2018/07/20 21:45:29 [21:45:28.885] Logging in as user "testuser@gmail.com"
2018/07/20 21:45:29 [21:45:28.950] Waiting for cryptohome /home/user/9003073363b3235a481f0a73b4dc4bbf6640a6c9
2018/07/20 21:46:22 [21:46:22.043] cryptohome status:
{
   "installattrs": {
      "first_install": false,
      "initialized": false,
      "invalid": false,
      "lockbox_index": 536870916,
      "lockbox_nvram_version": 2,
      "secure": false,
      "size": 0,
      "version": 1
   },
   "mounts": [  ],
   "tpm": {
      "being_owned": false,
      "can_connect": true,
      "can_decrypt": false,
      "can_encrypt": false,
      "can_load_srk": false,
      "can_load_srk_pubkey": false,
      "enabled": true,
      "has_context": true,
      "has_cryptohome_key": false,
      "has_key_handle": false,
      "last_error": 8224,
      "owned": false
   }
}
2018/07/20 21:46:22 [21:46:22.043] Error at chrome_crash_logged_in.go:24: Chrome login failed: /home/user/9003073363b3235a481f0a73b4dc4bbf6640a6c9 not mounted: context deadline exceeded

Please let me know if there's anything I can do from within the test code to detect the problem and repair the system.
Yes, looked at /var/log/messages as well (generally, messages is more informative for this kind of issues). The TPM is not owned at boot, and by the time the test runs yet (and the output of cryptohome --action=status in #7 tells the same).
Here's when the test calls attestation --action=status (note it's still on GMT TZ, so 7 hrs difference in datetimes):
2018-07-21T04:41:12.830257+00:00 NOTICE ag[2504]: autotest server[stack::__init__|_get_cryptohome_status|run] -> ssh_run(cryptohome --action=status)
...
2018-07-21T04:43:57.462046+00:00 NOTICE ag[10787]: autotest server[stack::__init__|_get_cryptohome_status|run] -> ssh_run(cryptohome --action=status)

And here's when the ownership is taken:
2018-07-21T04:46:27.442899+00:00 ERR cryptohomed[890]: Taking TPM ownership took 2650ms
and evetyually full initialization completes:
2018-07-21T04:46:35.426799+00:00 INFO cryptohomed[890]: Attestation: Prepared successfully (7270 ms).


Also, chrome crashes at this point:
2018-07-21T04:46:23.387505+00:00 WARNING crash_reporter[12357]: [user] Received crash notification for chrome[12270] sig 11, user 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2018-07-21T04:46:23.411299+00:00 INFO session_manager[12253]: [INFO:session_manager_service.cc(297)] Browser process 12270 exited with signal 11 (Segmentation fault)

And only after it restarts we start taking TPM ownership (note: Chrome needs to explicitly tell to start taking the ownership - and that happens either during OOBE or after the user logs in).

So, it's possible the test doesn't know how to handle the situation of an unowned TPM. If that should never happen, jobs like Repair should make sure that the TPM is left in the owned state before Provision & then tests are run.
If that's ok to see the test running on a freshly powerwashed/repaired device, then it should make sure that Chrome instructs cryptohomed to own the TPM and wait till the process is over [it's possible that Repair does reset the TPM owner but doesn't wipe the disk, so doesn't delete /home/chronos/.oobe_completed, for example, which prevents Chrome from doing OOBE on an unowned machine on start; or that the telemetry user session that autotest initiates doesn't do OOBE in any case]

Or, alternatively, check if the TPM is owned & initiate it if not through cryptohome CLI interface. And since there's more than one test, it should probably be a common step.

If having the TPM unowned at this point is unexpected, then that also deserves investigation.
how do we see the history of jobs on that DUT? would be good to check first why the TPM is unowned, and if that's expected.
And here's full 'messages' from #7 for this bug's reference.
messages
489 KB View Download
Cc: mnissler@chromium.org
Re #5: 

1) a proper recovery from unexpectedly losing stateful partition is going through powerwash, i.e. resetting the TPM owner + wiping out the disk. The reason: Chrome keeps some state in the lost stateful partition, so is left in the inconsistent state, which may lead to side effects (example: b/73351086, e.g. see comments 87 and 90 there). We don't want those side effects to affect test results a few days down the road, when nobody can trace it back to this point.

2) I don't know a reliable way to detect that situation for losing encstateful w/o resetting tpm owner. tpm.owned as true but tpm.has_cryptohome_key as false sounds like a good indicator, but can also happen while the tpm is still in the process of being initialized (owned can be true quite early on, bu the cryptohome key is created after that on the first access to it, i.e. when the user logs in). 

We could put something onto the stateful partition each time after we clear the TPM in test lab jobs.

Similarly, mount-encrypted could create a special file on encstateful, if the tpm was owned when it started, but there was no encstateful key, but that can happen in case of interrupted initialization - if we use that file in autotests, they would have to make sure they don't reboot after owning the tpm before the initialization is complete, or they might get this flag from mount-encrypted. That's a good idea in general: see bugs like issue 652565 and issue 692134 (and there were more iirc).

There are also older devices that have no ability to clear TPMs at all, but they will soon be at AUE point, if not already. See  issue 762334 , for example.

I can't think of other scenarios when the tpm is owned at boot, and there's no encstateful key, but I may be wrong. +mnissler: I doubt we have it on the 'stateful preservation' path, but please correct me if needed. 
Cc: ihf@chromium.org jrbarnette@chromium.org achuith@chromium.org
Components: Infra>Client>ChromeOS>Test
Labels: -Pri-2 Pri-1
Owner: xixuan@chromium.org
Thanks for digging into the logs!

That Chrome segfault is expected and unrelated; a later test is sending SEGV to Chrome to verify crash reporting:

2018/07/20 21:46:23 [21:46:23.283] Sending SIGSEGV to root Chrome process 12270

These tests just want Chrome to be in a state where they can log in using Oobe.loginForTesting(). They don't care about the state of the TPM as long as they can log in and interact with Chrome.

The calls at 21:41:12 and 21:43:57 (in PDT) that you quoted in #8 must be from some earlier Autotest code; the Tast run didn't start until 21:45:25.

You can see the jobs that ran before this one at http://stainless/search?view=list&first_date=2018-07-20&last_date=2018-07-21&hostname=chromeos4-row3-rack5-host4&exclude_cts=false&exclude_not_run=false&exclude_non_release=false&exclude_au=false&exclude_acts=false&exclude_retried=false&exclude_non_production=false (sort by Started Time).

This DUT looks like it was in a broken state for a long time. At 2018-07-20 05:19:28 UTC, login_CryptohomeIncognito failed with "Unhandled LoginException: Failed to restart browser in guest mode (pid 19918)." There are more cryptohome-related failures in other Autotests that run after this one.

Going back even farther, I see a failure in a video_VideoReload.vp8 run at 2018-07-19 15:40:15 UTC (http://stainless/browse/chromeos-autotest-results/218735701-chromeos-test/): "Unhandled LoginException: Timed out going through login screen. Cryptohome not mounted. OOBE not dismissed." I didn't look back beyond that.

It looks to me like we need to do more to identify and repair DUTs with TPMs that are in a bad state before running tests on them. This seems like it's a cause of general flakiness in existing Autotest-based tests that need to log in.

Xixuan, I think that you worked on the TPMStatusVerifier class in server/hosts/cros_repair.py most recently. Do you have any ideas about how to detect and handle this state?
Cc: akes...@chromium.org
Cc: lgoo...@chromium.org dkrahn@chromium.org
> Xixuan, I think that you worked on the TPMStatusVerifier class in
> server/hosts/cros_repair.py most recently. Do you have any ideas
> about how to detect and handle this state?

Unfortunately, all xixuan@ did was to move the verifier code from an
old location to its current location.  I doubt she has any insights
into its behavior.

I've also worked with the code, but I don't know much more than xixuan@.
However, if I've understood the code correctly, it simply extracts state
from TPM using a query on cryptohome.  The state is reasonably easy to
parse, so if the exact settings that we consider to be suspect can be
described, there's a decent chance that just about anyone can enhance the
function to detect that condition.

For experts in the deeper mysteries of the TPM, we may have to turn to
dkrahn@ or lgoodby@.

Cc: xixuan@chromium.org
Owner: ----
Owner: lgoo...@chromium.org
Laurence, can you tell me which additional check(s) I can add to this code (in Autotest's TPMStatusVerifier class) to identify this case so we don't get unavoidable test failures?

        try:
            tpm = status['tpm']
            if not tpm['enabled']:
                raise hosts.AutoservVerifyError(
                        'TPM is not enabled -- Hardware is not working.')
            if not tpm['can_connect']:
                raise hosts.AutoservVerifyError(
                        ('TPM connect failed -- '
                         'last_error=%d.' % tpm['last_error']))
            if tpm['owned'] and not tpm['can_load_srk']:
                raise hosts.AutoservVerifyError(
                        'Cannot load the TPM SRK')
            if tpm['can_load_srk'] and not tpm['can_load_srk_pubkey']:
                raise hosts.AutoservVerifyError(
                        'Cannot load the TPM SRK public key')
        except KeyError:
            logging.info('Cannot determine the Crytohome valid status - '
                         'skipping check.')

Here's the status from a run on a then-broken rikku DUT (see #7):

   "tpm": {
      "being_owned": false,
      "can_connect": true,
      "can_decrypt": false,
      "can_encrypt": false,
      "can_load_srk": false,
      "can_load_srk_pubkey": false,
      "enabled": true,
      "has_context": true,
      "has_cryptohome_key": false,
      "has_key_handle": false,
      "last_error": 8224,
      "owned": false
   }

And here's what the same now-no-longer-broken DUT (chromeos4-row3-rack5-host4) reports now:

   "tpm": {
      "being_owned": false,
      "can_connect": true,
      "can_decrypt": true,
      "can_encrypt": true,
      "can_load_srk": true,
      "can_load_srk_pubkey": true,
      "enabled": true,
      "has_context": true,
      "has_cryptohome_key": true,
      "has_key_handle": true,
      "last_error": 0,
      "owned": true
   }

So there are a bunch of differences, but I don't know which (if any) indicate a problem vs. being expected due to the TPM not being owned in the former case (which I assume is expected at times on DUTs in the lab).
Re #16: Right, in the first case it's just a seemingly unowned device, which is fine. The problem is: if I read the mount-encrypted log from #3 right, TPM is actually owned. This line says it:
2018-07-15T14:45:00.891814+00:00 NOTICE mount-encrypted[1562]: [0715/144450:INFO:tpm.cc(304)] TPM ready

That log also says that a space in tpm nvram where encstateful key is saved was created, but nothing was written to it. Sounds like an interrupted TPM initialization.
2018-07-15T14:45:00.891845+00:00 NOTICE mount-encrypted[1562]: [0715/144451:INFO:tpm.cc(103)] NVRAM area has been defined but not written.

One idea:

If, when we are in this state, tpm.owned reported by cryptohome --action=status is indeed false, then we can detect it by comparing that tpm.owned to /sys/class/tpm/tpm0/device/owned (on older devices /sys/class/misc/tpm0/device/owned). If that one is true, and tpm.being_owned is false we are probably in this state. And let's restrict it to TPM1.2 (since 2.0 initialization is different).
So:
 - if it is TPM 1.2 (e.g. "tpmc tpmver" returns "1.2" or alt [ -f /usr/sbin/tcsd ])
 - if tpm.owned = false, tpm.being_owned = false
 - if /sys/class/(tpm|misc)/tpm0/device/owned contains 1
we are in this condition.


Owner: derat@chromium.org
Status: Started (was: Assigned)
Project Member

Comment 19 by bugdroid1@chromium.org, Aug 1

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/996075f2639a844bb642079dc2b8f2200819f888

commit 996075f2639a844bb642079dc2b8f2200819f888
Author: Daniel Erat <derat@chromium.org>
Date: Wed Aug 01 17:02:00 2018

tast-tests: Make cryptohome log TPM ownership status.

Update cryptohome.WaitForUserMount log the TPM device's
ownership status on failure. This will help test a theory
that TPM initialization is being interrupted on lab devices.

BUG= chromium:864282 
TEST=manual: ownership is logged after calling logStatus

Change-Id: Id99978db0dcc27f8600e5f881ae9d392ff9d364d
Reviewed-on: https://chromium-review.googlesource.com/1157916
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>

[modify] https://crrev.com/996075f2639a844bb642079dc2b8f2200819f888/src/chromiumos/tast/local/cryptohome/cryptohome.go

http://stainless/browse/chromeos-autotest-results/223846952-chromeos-test/ contains a more-recent failure in ui.ChromeCrashLoggedIn:

2018/08/04 10:32:05 Started test ui.ChromeCrashLoggedIn
2018/08/04 10:32:05 [10:32:05.043] Restarting ui job
...
2018/08/04 10:32:39 [10:32:38.982] Waiting for cryptohome /home/user/59e4b5160f3449717f83b917bcbb5950ee349ffe
2018/08/04 10:33:05 [10:33:05.044] Failed to get cryptohome status
2018/08/04 10:33:05 [10:33:05.044] External command failed: signal: killed
2018/08/04 10:33:05 [10:33:05.045] Command: 'cryptohome' '--action=status'
2018/08/04 10:33:05 [10:33:05.045] Uncaptured output:
2018/08/04 10:33:05 [10:33:05.046] Error at chrome_crash_logged_in.go:24: Chrome login failed: /home/user/59e4b5160f3449717f83b917bcbb5950ee349ffe not mounted: context deadline exceeded

It looks like the "cryptohome --action=status" call that was intended to collect more info timed out. This is unexpected -- cryptohome.WaitForUserMount reserves three seconds for it to run.

I'll hold off on trying to commit https://crrev.com/c/1157533 until I see a failure indicating that we're actually seeing the state described in #17.
To give some context, "cryptohome --action=status" is not guaranteed to finish in 3 seconds:

1) That request to cryptohomed triggers several TPM commands.

2) If there's parallel TPm activity, each TPM command may have to wait on one or more other TPM commands already submitted by other daemons to run on TPM and waiting for their turn (TPM can process one command at a time).

3) There are TPM commands that take multiple seconds. E.g. RSA keypair generation is non-deterministic. For TPM2 it typically takes <10s, but can take 30+s in extreme cases (see e.g. http://b/68167013, go/cr50-keygen-speedup).

4) Those lengthy commands are rare (and commands are rarely sent to TPM in general), but they are concentrated around boot (especially first OOBE boot!) & login time. So, if "cryptohome --action=status" is unlucky, it may have to wait for a couple of lengthy 10+ second commands plus a few faster commands, so a 1+ minute wait is not impossible. See, for example, issue 772187, especially comments 28-29 there, for an example of similar delay in a different context.
Summary: cryptohomed broken on some DUTs due to interrupted TPM initialization (was: cryptohomed sometimes fails to mount user dir when tests log in)
Fixing my embarrassing cc mistake. Sorry. :-(
Cc: -hashimoto@chromium.org hidehiko@chromium.org
Components: OS>Systems>Security
Is it fixed or does it need more attention?
Status: WontFix (was: Started)
I'll WontFix this for now since I don't think that I've seen this in a few months (and can't test any mitigations as a result).

Sign in to add a comment