Tast blocks for two minutes before giving up with "Failed ensuring that TPM is initialized" |
||
Issue descriptionAs tracked by issue 901363 , I tried to update Tast to check the TPM status before running tests and request that ownership of it be taken if needed. This was an attempt to reduce flakiness seen when running login tests on unowned devices. The logic is: a) run "cryptohome --action=tpm_more_status" b) if "enabled" is false or "attestation_prepared" is true, don't wait c) otherwise, run "cryptohome --action=tpm_take_ownership" d) run "cryptohome --action=tpm_more_status" repeatedly for 2 minutes and wait for attestation_prepared to become true However, it looks like DUTs belonging to developers (including me) frequently end up in weird states where this doesn't work. After running tpm_take_ownership, attestation_prepared never becomes true. Oddly, login still works, but this makes Tast block for two minutes every time it's asked to run a test, which isn't acceptable. It sounds like there are multiple broken states here, so we need a faster workaround than waiting for all of the underlying issues to be fixed. The biggest hammer would be to just skip these checks by default and only enable them when running in the lab. Passing this through from the tast command is somewhat tricky, though. As a targeted way to detect broken states faster so we can give up immediately, I tried running "cryptohome --action=status" call right after tpm_take_ownership and checking that either "owned" or "being_owned" is true. However, even when ownership is actually being taken (e.g. on a non-broken device where I've just run "crossystem clear_tpm_owner_request=1" and rebooted to clear ownership), it looks like there's still a small time window where both are false. Is this expected?
,
Dec 12
> It's hard to say whether there is some error. Do you have logs? Yes, see https://crbug.com/901363#c34 . I'd rather not turn this bug into a another discussion of different issues that can prevent ownership from being taken, though. If I file a separate bug about the issue that I'm seeing on my dev device, should I assign it to you? > tpm_take_ownership spawns a new task on another thread to start the owning process. > ... > How small was the time window you observed? Thanks, that explanation aligns with what I saw -- my code was checking immediately. I'll try making it wait a few seconds before checking.
,
Dec 12
Even when I make Tast wait two seconds after running --action=tpm_take_ownership, --action=status still reports that the TPM is neither owned nor being owned. I'm leaning toward just removing all of this code from Tast. I'm feeling skeptical that there's much that Tast can do to fix things when the TPM/cryptome/tpm_managerd/etc. is broken, and it feels like it's out of scope for a test framework in any case.
,
Dec 12
Re a two-second delay not working: sorry, bug in my code. After fixing it, I see Tast give up as expected when the ownership request isn't honored (and not give up when it is honored). I'll probably go with that approach as a mitigation for the delay.
,
Dec 12
Uploaded https://crrev.com/c/1374110 to make Tast give up after a second if it doesn't look like cryptohome is actually able to take ownership.
,
Dec 12
A few general comments. I look at this from a bit different angle. > However, it looks like DUTs belonging to developers (including me) frequently end up in weird states where this doesn't work. After running tpm_take_ownership, attestation_prepared never becomes true. Why do people want to run tests involving login/cryptohomed, when the device is known to be in this weird state wrt to TPM status? - if it's expected that the state may be weird, tests results, in their login part, are meaningless - e.g. login may fail, but it's not an indication of any bugs related to it; - if that's unexpected, then inability to reach attestation_prepared is actually a real test failure. The whole point of enforcing owned & attestation_prepared before running those tests was so that they don't confuse developers by producing failures on login when the actual problem is not login, but the fact that the state is weird, or we are in a middle of taking ownership and the device behaves differently, not as expected by tests. > Oddly, login still works, but this makes Tast block for two minutes every time it's asked to run a test, which isn't acceptable. 1) attestation_prepared is not required for using TPM for login - it is only needed for tests that involve enrollment. 2) attestation_prepared was included in the checks because while attestation is being prepared right after taking ownership, the timing of various responses to requests to cryptohomed is quite different - much longer delays are expected. instead of making tests always expect the worst, we make sure the device goes through those first-boot steps once, so that we can rely on it being in the default, normal state. 3) login works even when owned=false. it just uses software keys instead of TPM-bound keys in this case. However, the common case login is the one not on the first (oobe) boot. at this point, TPM is owned and attestation_prepared is true on real user devices. And my understanding, this is what we want to exercise in our tests (except for special tests that target first bot behavior). Otherwise, we are testing something different from what the users actually face. One of the problems is we have a mixture of tests that use login: some to actually test the login, some since it just unlocks the UI needed for testing the actual target. If exercising the user login as close as possible to what's happening with real-life users is important, then inability to reach attestation_prepared=true is a test blocker. If that's not important, and TPM just gets in the way, when tests highlight TPM-related issues & weird state, then we need to do one of the following: (a) simulate logins for UI purposes at Chrome level, w/o it even talking to cryptohomed. so that the device can show the UI as if the user is logged in. [it's as pure testing as possible: no waiting for attestation_prepared, not depending on TPM-related bugs, not even depending on login/cryptohome-related bugs.] (b) if we can separate those tests into different suites, then - for group A (tests that actually test login) we can go through clearing the owner + taking ownership + waiting for attestation_prepared - then run the tests. [realistic login behavior] - for group B (just give me UI as if logged in, don't care about other aspects) we can go through clearing the owner and stop there. logins will be performed using software keys. [here no waiting for attestation_prepared, not depending on TPM-related bugs, etc. though cryptohome/login bugs will still affect unrelated tests]
,
Dec 12
There are non-login tests that are getting blocked by this same check (e.g., I'm looking at shill tests that don't deal with Chrome / login at all). So there's more than just Group A and B. Also, I'm not sure if this is the right bug to add to, but the device I reported having issues with on bug 901363 / comments #37 and #42 is back in shape after upgrading my image and running 'crossystem clear_tpm_owner_request=1; reboot'. Thanks.
,
Dec 12
> There are non-login tests that are getting blocked by this same check hm, I thought we are only enforcing it for tests that talk to cryptohomed. maybe we should just drop this check from unrelated tests then?
,
Dec 12
Thanks, Andrey! This isn't a test-level check; it happens before we run any tests, and happens regardless of the tests that will be run. I added it for the 2) case in #6: "attestation_prepared was included in the checks because while attestation is being prepared right after taking ownership, the timing of various responses to requests to cryptohomed is quite different - much longer delays are expected. instead of making tests always expect the worst, we make sure the device goes through those first-boot steps once, so that we can rely on it being in the default, normal state." I think I misinterpreted https://crbug.com/901363#c17 as being a general procedure that Tast can perform to get TPMs into a good state before running tests. Since that isn't the case, I'm leaning toward still trying to do this when running in the lab (where it seems useful in the off chance that it prevents long cryptohome delays later) but skipping it during manual runs. Given the current state of Chrome's login code, I think it's unlikely that we'd be able to add a new simulated-login mode for testing anytime soon. I like the idea of avoiding the login code when possible, though. :-)
,
Dec 14
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast/+/16180cb0ed083ab0d9c00cf8ea65302b51c26024 commit 16180cb0ed083ab0d9c00cf8ea65302b51c26024 Author: Daniel Erat <derat@chromium.org> Date: Fri Dec 14 03:28:03 2018 tast: Add -waituntilready flag to "run" subcommand. Add a -waituntilready flag to "tast run" that controls whether the test bundle's "ready" function (if any) will be executed before running tests. Before this change, the function was always executed. The local "cros" bundle's readiness function waits for various Upstart services, which is low-cost, but it also attempts to check the TPM's ownership state and initialize it, which can be slow or can fail entirely. This flag currently defaults to true, but I'm planning to make it default to false after updating builders to explicitly set it. BUG=chromium:914258 TEST=updated unit tests; also manually verified that readiness is checked by default for "tast run" and local_test_runner but not checked when passing -waituntilready=false to either command Change-Id: I0c9d1e39437d270204df0f693d5655052d8ea903 Reviewed-on: https://chromium-review.googlesource.com/1375010 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/16180cb0ed083ab0d9c00cf8ea65302b51c26024/src/chromiumos/tast/bundle/local_test.go [modify] https://crrev.com/16180cb0ed083ab0d9c00cf8ea65302b51c26024/src/chromiumos/tast/bundle/local.go [modify] https://crrev.com/16180cb0ed083ab0d9c00cf8ea65302b51c26024/src/chromiumos/cmd/tast/run/local.go [modify] https://crrev.com/16180cb0ed083ab0d9c00cf8ea65302b51c26024/src/chromiumos/tast/runner/args.go [modify] https://crrev.com/16180cb0ed083ab0d9c00cf8ea65302b51c26024/src/chromiumos/cmd/tast/run/config.go [modify] https://crrev.com/16180cb0ed083ab0d9c00cf8ea65302b51c26024/src/chromiumos/tast/bundle/args.go
,
Dec 15
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/8832a5c072e7a5c2bd7869eb79b494dff225cb78 commit 8832a5c072e7a5c2bd7869eb79b494dff225cb78 Author: Daniel Erat <derat@chromium.org> Date: Sat Dec 15 06:40:35 2018 cbuildbot: Pass -waituntilready to Tast. Make the TastVMTest stage pass -waituntilready to "tast run" to ensure that readiness checks (including attempting to initialize the TPM) will happen before tests are executed. This flag is true by default but will be changed to false soon. BUG=chromium:914258 TEST=none Change-Id: Iffb62ac764f8604f126fa1ebe71ba9d9d2be224d Reviewed-on: https://chromium-review.googlesource.com/1377372 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Hidehiko Abe <hidehiko@chromium.org> [modify] https://crrev.com/8832a5c072e7a5c2bd7869eb79b494dff225cb78/cbuildbot/stages/tast_test_stages.py
,
Dec 15
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b6b7c21b3e181fdbca44e6ee3f2a69593713e2da commit b6b7c21b3e181fdbca44e6ee3f2a69593713e2da Author: Daniel Erat <derat@chromium.org> Date: Sat Dec 15 06:40:33 2018 autotest: Pass -waituntilready to Tast. Make the tast.py server test pass -waituntilready=true to "tast run" to ensure that readiness checks (including attempting to initialize the TPM) will happen before tests are executed. This flag is true by default but will be changed to false soon. BUG=chromium:914258 TEST=unit tests still pass Change-Id: I8c59d8b2b06352eb3aac06c938a47ff70df4b54c Reviewed-on: https://chromium-review.googlesource.com/1377555 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Hidehiko Abe <hidehiko@chromium.org> [modify] https://crrev.com/b6b7c21b3e181fdbca44e6ee3f2a69593713e2da/server/site_tests/tast/testdata/fake_tast.py [modify] https://crrev.com/b6b7c21b3e181fdbca44e6ee3f2a69593713e2da/server/site_tests/tast/tast.py
,
Dec 18
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast/+/cfa4ffaca7987424e7c96be63acba875c46a0ca6 commit cfa4ffaca7987424e7c96be63acba875c46a0ca6 Author: Daniel Erat <derat@chromium.org> Date: Tue Dec 18 00:52:30 2018 tast: Make run command's -waituntilready default to false. Make the "run" subcommands -waituntilready flag default to false. Builders explicitly set it to true, so the default can be changed now to avoid blocking for two minutes when the TPMs on developers' DUTs are in bad states. BUG=chromium:914258 TEST=none CQ-DEPEND=I8c59d8b2b06352eb3aac06c938a47ff70df4b54c CQ-DEPEND=Iffb62ac764f8604f126fa1ebe71ba9d9d2be224d Change-Id: Ib45dfa6f22eaa0eb293c04002dd7b472bb75ad0f Reviewed-on: https://chromium-review.googlesource.com/1377556 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Shuhei Takahashi <nya@chromium.org> Reviewed-by: Hidehiko Abe <hidehiko@chromium.org> [modify] https://crrev.com/cfa4ffaca7987424e7c96be63acba875c46a0ca6/src/chromiumos/cmd/tast/run/config.go |
||
►
Sign in to add a comment |
||
Comment 1 by garryxiao@google.com
, Dec 12