New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 914258 link

Starred by 2 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Tast blocks for two minutes before giving up with "Failed ensuring that TPM is initialized"

Project Member Reported by derat@chromium.org, Dec 12

Issue description

As tracked by  issue 901363 , I tried to update Tast to check the TPM status before running tests and request that ownership of it be taken if needed. This was an attempt to reduce flakiness seen when running login tests on unowned devices.

The logic is:

a) run "cryptohome --action=tpm_more_status"
b) if "enabled" is false or "attestation_prepared" is true, don't wait
c) otherwise, run "cryptohome --action=tpm_take_ownership"
d) run "cryptohome --action=tpm_more_status" repeatedly for 2 minutes and wait for attestation_prepared to become true

However, it looks like DUTs belonging to developers (including me) frequently end up in weird states where this doesn't work. After running tpm_take_ownership, attestation_prepared never becomes true. Oddly, login still works, but this makes Tast block for two minutes every time it's asked to run a test, which isn't acceptable.

It sounds like there are multiple broken states here, so we need a faster workaround than waiting for all of the underlying issues to be fixed.

The biggest hammer would be to just skip these checks by default and only enable them when running in the lab. Passing this through from the tast command is somewhat tricky, though.

As a targeted way to detect broken states faster so we can give up immediately, I tried running "cryptohome --action=status" call right after tpm_take_ownership and checking that either "owned" or "being_owned" is true. However, even when ownership is actually being taken (e.g. on a non-broken device where I've just run "crossystem clear_tpm_owner_request=1" and rebooted to clear ownership), it looks like there's still a small time window where both are false. Is this expected?
 
> After running tpm_take_ownership, attestation_prepared never becomes true. Oddly, login still works, but this makes Tast block for two minutes every time it's asked to run a test, which isn't acceptable.

Taking ownership (including pre-owning + actual owning) plus preparing for enrollment in attestation may take a long time (> 2 minutes) on some machines if unlucky. It's hard to say whether there is some error. Do you have logs?

> It looks like there's still a small time window where both are false. Is this expected?

tpm_take_ownership spawns a new task on another thread to start the owning process. When the task is running, it first sets being_owned and then sends a request to tpm_managerd to do the rest part of the work. That being said, if there is a noticeable gap between cryptohome --action=tpm_take_ownership being executed and the task actually starting running, it's possible to see false on both owned and being_owned in the gap. However, the gap should be small. How small was the time window you observed?
> It's hard to say whether there is some error. Do you have logs?

Yes, see  https://crbug.com/901363#c34 . I'd rather not turn this bug into a another discussion of different issues that can prevent ownership from being taken, though. If I file a separate bug about the issue that I'm seeing on my dev device, should I assign it to you?

> tpm_take_ownership spawns a new task on another thread to start the owning process.
> ...
> How small was the time window you observed?

Thanks, that explanation aligns with what I saw -- my code was checking immediately. I'll try making it wait a few seconds before checking.
Even when I make Tast wait two seconds after running --action=tpm_take_ownership, --action=status still reports that the TPM is neither owned nor being owned.

I'm leaning toward just removing all of this code from Tast. I'm feeling skeptical that there's much that Tast can do to fix things when the TPM/cryptome/tpm_managerd/etc. is broken, and it feels like it's out of scope for a test framework in any case.
Re a two-second delay not working: sorry, bug in my code. After fixing it, I see Tast give up as expected when the ownership request isn't honored (and not give up when it is honored). I'll probably go with that approach as a mitigation for the delay.
Uploaded https://crrev.com/c/1374110 to make Tast give up after a second if it doesn't look like cryptohome is actually able to take ownership.
A few general comments. I look at this from a bit different angle.

> However, it looks like DUTs belonging to developers (including me) frequently end up in weird states where this doesn't work. After running tpm_take_ownership, attestation_prepared never becomes true. 

Why do people want to run tests involving login/cryptohomed, when the device is known to be in this weird state wrt to TPM status?
 - if it's expected that the state may be weird, tests results, in their login part, are meaningless - e.g. login may fail, but it's not an indication of any bugs related to it;
 - if that's unexpected, then inability to reach attestation_prepared is actually a real test failure.
The whole point of enforcing owned & attestation_prepared before running those tests was so that they don't confuse developers by producing failures on login when the actual problem is not login, but the fact that the state is weird, or we are in a middle of taking ownership and the device behaves differently, not as expected by tests.

> Oddly, login still works, but this makes Tast block for two minutes every time it's asked to run a test, which isn't acceptable.

1) attestation_prepared is not required for using TPM for login - it is only needed for tests that involve enrollment.
2) attestation_prepared was included in the checks because while attestation is being prepared right after taking ownership, the timing of various responses to requests to cryptohomed is quite different - much longer delays are expected. instead of making tests always expect the worst, we make sure the device goes through those first-boot steps once, so that we can rely on it being in the default, normal state.
3) login works even when owned=false. it just uses software keys instead of TPM-bound keys in this case.
However, the common case login is the one not on the first (oobe) boot. at this point, TPM is owned and attestation_prepared is true on real user devices. And my understanding, this is what we want to exercise in our tests (except for special tests that target first bot behavior). Otherwise, we are testing something different from what the users actually face.

One of the problems is we have a mixture of tests that use login: some to actually test the login, some since it just unlocks the UI needed for testing the actual target.
If exercising the user login as close as possible to what's happening with real-life users is important, then inability to reach attestation_prepared=true is a test blocker.
If that's not important, and TPM just gets in the way, when tests highlight TPM-related issues & weird state, then we need to do one of the following:
(a) simulate logins for UI purposes at Chrome level, w/o it even talking to cryptohomed. so that the device can show the UI as if the user is logged in. [it's as pure testing as possible: no waiting for attestation_prepared, not depending on TPM-related bugs, not even depending on login/cryptohome-related bugs.]
(b) if we can separate those tests into different suites, then
 - for group A (tests that actually test login) we can go through clearing the owner + taking ownership + waiting for attestation_prepared - then run the tests. [realistic login behavior]
 - for group B (just give me UI as if logged in, don't care about other aspects) we can go through clearing the owner and stop there. logins will be performed using software keys. [here no waiting for attestation_prepared, not depending on TPM-related bugs, etc. though cryptohome/login bugs will still affect unrelated tests]


There are non-login tests that are getting blocked by this same check (e.g., I'm looking at shill tests that don't deal with Chrome / login at all). So there's more than just Group A and B.

Also, I'm not sure if this is the right bug to add to, but the device I reported having issues with on  bug 901363  / comments #37 and #42 is back in shape after upgrading my image and running 'crossystem clear_tpm_owner_request=1; reboot'. Thanks.
> There are non-login tests that are getting blocked by this same check 

hm, I thought we are only enforcing it for tests that talk to cryptohomed. maybe we should just drop this check from unrelated tests then? 
Cc: xiy...@chromium.org
Thanks, Andrey!

This isn't a test-level check; it happens before we run any tests, and happens regardless of the tests that will be run.

I added it for the 2) case in #6: "attestation_prepared was included in the checks because while attestation is being prepared right after taking ownership, the timing of various responses to requests to cryptohomed is quite different - much longer delays are expected. instead of making tests always expect the worst, we make sure the device goes through those first-boot steps once, so that we can rely on it being in the default, normal state."

I think I misinterpreted  https://crbug.com/901363#c17  as being a general procedure that Tast can perform to get TPMs into a good state before running tests. Since that isn't the case, I'm leaning toward still trying to do this when running in the lab (where it seems useful in the off chance that it prevents long cryptohome delays later) but skipping it during manual runs.

Given the current state of Chrome's login code, I think it's unlikely that we'd be able to add a new simulated-login mode for testing anytime soon. I like the idea of avoiding the login code when possible, though. :-)
Project Member

Comment 10 by bugdroid1@chromium.org, Dec 14

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast/+/16180cb0ed083ab0d9c00cf8ea65302b51c26024

commit 16180cb0ed083ab0d9c00cf8ea65302b51c26024
Author: Daniel Erat <derat@chromium.org>
Date: Fri Dec 14 03:28:03 2018

tast: Add -waituntilready flag to "run" subcommand.

Add a -waituntilready flag to "tast run" that controls
whether the test bundle's "ready" function (if any) will be
executed before running tests. Before this change, the
function was always executed.

The local "cros" bundle's readiness function waits for
various Upstart services, which is low-cost, but it also
attempts to check the TPM's ownership state and initialize
it, which can be slow or can fail entirely.

This flag currently defaults to true, but I'm planning to
make it default to false after updating builders to
explicitly set it.

BUG=chromium:914258
TEST=updated unit tests; also manually verified that
     readiness is checked by default for "tast run" and
     local_test_runner but not checked when passing
     -waituntilready=false to either command

Change-Id: I0c9d1e39437d270204df0f693d5655052d8ea903
Reviewed-on: https://chromium-review.googlesource.com/1375010
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/16180cb0ed083ab0d9c00cf8ea65302b51c26024/src/chromiumos/tast/bundle/local_test.go
[modify] https://crrev.com/16180cb0ed083ab0d9c00cf8ea65302b51c26024/src/chromiumos/tast/bundle/local.go
[modify] https://crrev.com/16180cb0ed083ab0d9c00cf8ea65302b51c26024/src/chromiumos/cmd/tast/run/local.go
[modify] https://crrev.com/16180cb0ed083ab0d9c00cf8ea65302b51c26024/src/chromiumos/tast/runner/args.go
[modify] https://crrev.com/16180cb0ed083ab0d9c00cf8ea65302b51c26024/src/chromiumos/cmd/tast/run/config.go
[modify] https://crrev.com/16180cb0ed083ab0d9c00cf8ea65302b51c26024/src/chromiumos/tast/bundle/args.go

Project Member

Comment 11 by bugdroid1@chromium.org, Dec 15

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/8832a5c072e7a5c2bd7869eb79b494dff225cb78

commit 8832a5c072e7a5c2bd7869eb79b494dff225cb78
Author: Daniel Erat <derat@chromium.org>
Date: Sat Dec 15 06:40:35 2018

cbuildbot: Pass -waituntilready to Tast.

Make the TastVMTest stage pass -waituntilready to "tast run"
to ensure that readiness checks (including attempting to
initialize the TPM) will happen before tests are executed.
This flag is true by default but will be changed to false
soon.

BUG=chromium:914258
TEST=none

Change-Id: Iffb62ac764f8604f126fa1ebe71ba9d9d2be224d
Reviewed-on: https://chromium-review.googlesource.com/1377372
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Hidehiko Abe <hidehiko@chromium.org>

[modify] https://crrev.com/8832a5c072e7a5c2bd7869eb79b494dff225cb78/cbuildbot/stages/tast_test_stages.py

Project Member

Comment 12 by bugdroid1@chromium.org, Dec 15

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b6b7c21b3e181fdbca44e6ee3f2a69593713e2da

commit b6b7c21b3e181fdbca44e6ee3f2a69593713e2da
Author: Daniel Erat <derat@chromium.org>
Date: Sat Dec 15 06:40:33 2018

autotest: Pass -waituntilready to Tast.

Make the tast.py server test pass -waituntilready=true to
"tast run" to ensure that readiness checks (including
attempting to initialize the TPM) will happen before tests
are executed. This flag is true by default but will be
changed to false soon.

BUG=chromium:914258
TEST=unit tests still pass

Change-Id: I8c59d8b2b06352eb3aac06c938a47ff70df4b54c
Reviewed-on: https://chromium-review.googlesource.com/1377555
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Hidehiko Abe <hidehiko@chromium.org>

[modify] https://crrev.com/b6b7c21b3e181fdbca44e6ee3f2a69593713e2da/server/site_tests/tast/testdata/fake_tast.py
[modify] https://crrev.com/b6b7c21b3e181fdbca44e6ee3f2a69593713e2da/server/site_tests/tast/tast.py

Project Member

Comment 13 by bugdroid1@chromium.org, Dec 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast/+/cfa4ffaca7987424e7c96be63acba875c46a0ca6

commit cfa4ffaca7987424e7c96be63acba875c46a0ca6
Author: Daniel Erat <derat@chromium.org>
Date: Tue Dec 18 00:52:30 2018

tast: Make run command's -waituntilready default to false.

Make the "run" subcommands -waituntilready flag default to
false. Builders explicitly set it to true, so the default
can be changed now to avoid blocking for two minutes when
the TPMs on developers' DUTs are in bad states.

BUG=chromium:914258
TEST=none
CQ-DEPEND=I8c59d8b2b06352eb3aac06c938a47ff70df4b54c
CQ-DEPEND=Iffb62ac764f8604f126fa1ebe71ba9d9d2be224d

Change-Id: Ib45dfa6f22eaa0eb293c04002dd7b472bb75ad0f
Reviewed-on: https://chromium-review.googlesource.com/1377556
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>
Reviewed-by: Hidehiko Abe <hidehiko@chromium.org>

[modify] https://crrev.com/cfa4ffaca7987424e7c96be63acba875c46a0ca6/src/chromiumos/cmd/tast/run/config.go

Sign in to add a comment