Issue metadata
Sign in to add a comment
|
Tast should try to ensure TPM is owned before running tests |
||||||||||||||||||||||
Issue descriptionTest fails when trying to removing user cryptohome during test preparation. The failure seems not really mash specific though. logs link: https://stainless.corp.google.com/browse/chromeos-autotest-results/254575188-chromeos-test/ Relevant Tast log: ==== 2018/11/01 22:12:46 Started test ui.MashLogin 2018/11/01 22:12:46 [22:12:46.087] Restarting ui job 2018/11/01 22:12:47 [22:12:46.765] Removing cryptohome for testuser@gmail.com 2018/11/01 22:14:16 [22:14:16.100] External command timed out 2018/11/01 22:14:16 [22:14:16.100] Command: cryptohome --action=remove --force --user=testuser@gmail.com 2018/11/01 22:14:16 [22:14:16.100] Uncaptured output: 2018/11/01 22:14:16 [22:14:16.126] Saving minidump: /tmp/tast_out.031681413/ui.MashLogin/tcsd.945.dmp 2018/11/01 22:14:19 [22:14:18.610] Saving minidump: /tmp/tast_out.031681413/ui.MashLogin/chapsd.974.dmp 2018/11/01 22:14:20 [22:14:19.615] Saving minidump: /tmp/tast_out.031681413/ui.MashLogin/cryptohomed.1073.dmp 2018/11/01 22:14:21 [22:14:21.073] Error at mash_login.go:26: Chrome login failed: failed to remove cryptohome: context deadline exceeded 2018/11/01 22:14:21 [22:14:21.073] Stack trace: Chrome login failed at chromiumos/tast/local/bundles/cros/ui.MashLogin (mash_login.go:26) at chromiumos/tast/testing.(*Test).Run.func4 (test.go:189) at chromiumos/tast/testing.runStages.func1.1 (stage.go:39) at chromiumos/tast/testing.runAndRecover.func1 (stage.go:69) at runtime.goexit (asm_arm.s:1015) failed to remove cryptohome at chromiumos/tast/local/cryptohome.RemoveUserDir (cryptohome.go:88) at chromiumos/tast/local/chrome.(*Chrome).restartSession (chrome.go:344) at chromiumos/tast/local/chrome.(*Chrome).restartChromeForTesting (chrome.go:268) at chromiumos/tast/local/chrome.New (chrome.go:172) at chromiumos/tast/local/bundles/cros/ui.MashLogin (mash_login.go:24) at chromiumos/tast/testing.(*Test).Run.func4 (test.go:189) at chromiumos/tast/testing.runStages.func1.1 (stage.go:39) at chromiumos/tast/testing.runAndRecover.func1 (stage.go:69) ... context deadline exceeded at ??? 2018/11/01 22:14:21 Completed test ui.MashLogin in 1m35.032s with 1 error(s) Kernel log between 22:12:47 and 22:14:16 ==== 2018-11-01T22:12:46.578741-07:00 WARNING cryptohomed[1073]: Could not load the device policy file. 2018-11-01T22:12:46.703807-07:00 WARNING chapsd[974]: Unload Token event received for unknown path: /home/root/ca15b0030f669ca76704c0e8955f7e19666001e4/chaps 2018-11-01T22:12:50.382471-07:00 INFO chapsd[974]: Deleted lock file: /run/lock/power_override/chapsd_token_init_slot_0.lock 2018-11-01T22:13:08.458240-07:00 INFO kernel: [ 382.347393] tpm_i2c_infineon 3-0020: command 0x1f (size 146) returned code 0x0 2018-11-01T22:13:08.518248-07:00 INFO kernel: [ 382.406227] tpm_i2c_infineon 3-0020: command 0xba (size 18) returned code 0x22 2018-11-01T22:13:08.563256-07:00 INFO kernel: [ 382.452383] tpm_i2c_infineon 3-0020: command 0xa (size 10) returned code 0x0 2018-11-01T22:13:08.618317-07:00 INFO kernel: [ 382.507168] tpm_i2c_infineon 3-0020: command 0xa (size 10) returned code 0x0 2018-11-01T22:13:08.733305-07:00 INFO kernel: [ 382.620367] tpm_i2c_infineon 3-0020: command 0x21 (size 59) returned code 0x0 2018-11-01T22:13:08.773256-07:00 INFO kernel: [ 382.661451] tpm_i2c_infineon 3-0020: command 0xba (size 18) returned code 0x22 2018-11-01T22:13:08.843306-07:00 INFO kernel: [ 382.730184] tpm_i2c_infineon 3-0020: command 0x65 (size 42) returned code 0x0 2018-11-01T22:13:09.010686-07:00 INFO cryptohomed[1073]: Created new cryptohome key. 2018-11-01T22:13:09.893245-07:00 INFO kernel: [ 383.780497] tpm_i2c_infineon 3-0020: command 0x41 (size 618) returned code 0x0 2018-11-01T22:13:09.953318-07:00 INFO kernel: [ 383.841105] tpm_i2c_infineon 3-0020: command 0xba (size 18) returned code 0x22 2018-11-01T22:13:10.003236-07:00 INFO kernel: [ 383.892110] tpm_i2c_infineon 3-0020: command 0xa (size 10) returned code 0x0 2018-11-01T22:13:10.123244-07:00 INFO kernel: [ 384.013864] tpm_i2c_infineon 3-0020: command 0x21 (size 59) returned code 0x0 2018-11-01T22:13:10.168236-07:00 INFO kernel: [ 384.056469] tpm_i2c_infineon 3-0020: command 0xba (size 18) returned code 0x22 2018-11-01T22:13:10.213255-07:00 INFO kernel: [ 384.102470] tpm_i2c_infineon 3-0020: command 0xa (size 10) returned code 0x0 2018-11-01T22:13:10.288256-07:00 INFO kernel: [ 384.179570] tpm_i2c_infineon 3-0020: command 0x65 (size 42) returned code 0x0 2018-11-01T22:13:10.523106-07:00 INFO chapsd[974]: Finished importing 0 pending objects. 2018-11-01T22:13:10.801140-07:00 INFO chapsd[974]: Master key is ready for token at /var/lib/chaps 2018-11-01T22:13:11.348239-07:00 INFO kernel: [ 385.235101] tpm_i2c_infineon 3-0020: command 0x41 (size 618) returned code 0x0 2018-11-01T22:13:11.403233-07:00 INFO kernel: [ 385.291208] tpm_i2c_infineon 3-0020: command 0xba (size 18) returned code 0x22 2018-11-01T22:13:11.453242-07:00 INFO kernel: [ 385.340924] tpm_i2c_infineon 3-0020: command 0x65 (size 18) returned code 0x0 2018-11-01T22:13:11.542427-07:00 INFO cryptohomed[1073]: Configuring TPM, ownership taken: 1. 2018-11-01T22:13:11.543302-07:00 INFO kernel: [ 385.433786] tpm_i2c_infineon 3-0020: command 0x21 (size 14) returned code 0x0 2018-11-01T22:13:11.588286-07:00 INFO kernel: [ 385.478298] tpm_i2c_infineon 3-0020: command 0x65 (size 18) returned code 0x0 2018-11-01T22:13:11.638270-07:00 INFO kernel: [ 385.526179] tpm_i2c_infineon 3-0020: command 0x65 (size 18) returned code 0x0 2018-11-01T22:13:11.678310-07:00 INFO kernel: [ 385.568178] tpm_i2c_infineon 3-0020: command 0x65 (size 18) returned code 0x0 2018-11-01T22:13:11.738255-07:00 INFO kernel: [ 385.626057] tpm_i2c_infineon 3-0020: command 0xb (size 36) returned code 0x0 2018-11-01T22:13:11.848569-07:00 INFO kernel: [ 385.739796] tpm_i2c_infineon 3-0020: command 0xcc (size 146) returned code 0x0 2018-11-01T22:13:11.898245-07:00 INFO kernel: [ 385.786544] tpm_i2c_infineon 3-0020: command 0xba (size 18) returned code 0x22 2018-11-01T22:13:11.938235-07:00 INFO kernel: [ 385.827110] tpm_i2c_infineon 3-0020: command 0x65 (size 18) returned code 0x0 2018-11-01T22:13:11.993247-07:00 INFO kernel: [ 385.884095] tpm_i2c_infineon 3-0020: command 0x15 (size 14) returned code 0x0 2018-11-01T22:13:12.043232-07:00 INFO kernel: [ 385.932203] tpm_i2c_infineon 3-0020: command 0x65 (size 18) returned code 0x0 2018-11-01T22:13:12.103446-07:00 INFO kernel: [ 385.994906] tpm_i2c_infineon 3-0020: command 0xb (size 36) returned code 0x0 2018-11-01T22:13:12.223302-07:00 INFO kernel: [ 386.111679] tpm_i2c_infineon 3-0020: command 0xcc (size 146) returned code 0x0 2018-11-01T22:13:12.274872-07:00 INFO cryptohomed[1073]: Lockbox created. 2018-11-01T22:13:12.278255-07:00 INFO kernel: [ 386.166288] tpm_i2c_infineon 3-0020: command 0xba (size 18) returned code 0x22 2018-11-01T22:13:12.516490-07:00 INFO cryptohomed[1073]: Attestation: Preparing for enrollment... 2018-11-01T22:13:12.568252-07:00 INFO kernel: [ 386.457522] tpm_i2c_infineon 3-0020: command 0xa (size 10) returned code 0x0 2018-11-01T22:13:12.698435-07:00 INFO kernel: [ 386.589861] tpm_i2c_infineon 3-0020: command 0x81 (size 59) returned code 0x0 2018-11-01T22:13:12.743239-07:00 INFO kernel: [ 386.631560] tpm_i2c_infineon 3-0020: command 0xba (size 18) returned code 0x22 2018-11-01T22:13:12.788226-07:00 INFO kernel: [ 386.677320] tpm_i2c_infineon 3-0020: command 0xa (size 10) returned code 0x0 2018-11-01T22:13:12.913239-07:00 INFO kernel: [ 386.804245] tpm_i2c_infineon 3-0020: command 0x81 (size 59) returned code 0x0 2018-11-01T22:13:12.957528-07:00 INFO cryptohomed[1073]: Attestation: Creating identity 0 with identity feature IDENTITY_FEATURE_ENTERPRISE_ENROLLMENT_ID. 2018-11-01T22:13:12.958343-07:00 INFO kernel: [ 386.846629] tpm_i2c_infineon 3-0020: command 0xba (size 18) returned code 0x22 2018-11-01T22:13:13.403239-07:00 INFO kernel: [ 387.290130] tpm_i2c_infineon 3-0020: command 0xb (size 36) returned code 0x0 2018-11-01T22:13:13.458250-07:00 INFO kernel: [ 387.347106] tpm_i2c_infineon 3-0020: command 0xa (size 10) returned code 0x0 cryptohomed seems not did the remove. For other tests that also removes user cryptohome, we see something like this: 2018-11-01T22:12:06.455759-07:00 WARNING cryptohomed[1073]: No valid keysets on disk for ca15b0030f669ca76704c0e8955f7e19666001e4 But this not seen for the failed test login.
,
Nov 2
I initially submitted https://crrev.com/c/1309160 to try to fix intermittent flakiness caused by bad policy files ( issue 897278 ). In doing so, I removed a call to wait for cryptohomed's D-Bus service to available before we use the cryptohome command to try to wipe the user's existing home directory when logging in. I'd previously added the wait in https://crrev.com/c/1223034 to fix issue 883554 . In https://crrev.com/c/1314050, I made us wait for the service again. We do so earlier now, though, before running any tests. Before, we did this: - stop and start ui Upstart job - ask session_manager to restart Chrome - wait for cryptohomed D-Bus service - run cryptohome command to remove user dir - etc. Now, we do this: - wait for cryptohomed Upstart job - wait for cryptohomed D-Bus service - stop ui Upstart job - remove /var/lib/whitelist - run cryptohome command to remove user dir - start ui Upstart job - ask session_manager to restart Chrome - etc. In other words, we're running the cryptohome command earlier than before now (since we don't restart Chrome first). I'm wondering if there's a window of time where cryptohomed has exported its D-Bus service but still isn't ready for requests...
,
Nov 2
Here's the chapsd trace as well. My best theory right now is that this early "cryptohome --action=remove" is somehow putting cryptohomed into a bad state, because all of the subsequent cryptohome calls also hang after this. Might be a deadlock in cryptohomed? chrome.New was also refactored a bit by https://crrev.com/c/1303785. I'm not sure if that could also be having an effect on the timing here. I've uploaded https://crrev.com/c/1315331 to make us switch to running the cryptohome command later again, after restarting Chrome. Reverting https://crrev.com/c/1314050 and https://crrev.com/c/1309160 might fix the VM race, but I'd like to avoid it if possible since it'd make use again susceptible to the policy corruption race at issue 897278 .
,
Nov 2
,
Nov 2
> I'm wondering if there's a window of time where cryptohomed has exported its D-Bus service but still isn't ready for requests... Looked at service::Initialize for it. Yes, there is time between dbus_g_object_type_install_info(), which makes dbus objects externally callable, and mount_thread_.Start(), for example. And a call to RemoveEx posts a task on mount_task_ to do the removal. Not sure what happens in this case. maybe the task is lost. But if it is posted, a lot goes on between dbus...install_info and starting the thread, so it can be at least delayed. But from the timestamp, it's 380+ seconds from kernel start. cryptohomed initialization should be long done by now, even if there is a race. Or was cryptohomed just restarted? From the log in bug descr, I also see that TPM initialization goes on in parallel (with PrepareForEnrollment not done yet by timeout). It probably also means that the TPM initialization was either (a) triggered by the previous login attempt during the previous test, or (b) going through OOBE by the previous test, or (c) restarting cryptohomed after interrupted TPM initialization (although in this case I'd expect some traces of that in the logs). Will check the logs to see if I can understand what exactly happened. (I expect also it is a signed image with Chrome (not Chromium) that runs on those DUTs, right? Chrome does trigger TPM initialization on OOBE & logins, Chromium only on logins.) Having this TPM initialization going in parallel could have also affect the request. Though it is mostly run on a parallel thread, there are things that affect user management requests like RemoveEx (and vice versa) or cyptohomed start sequence. Note that if the test account we are removing now was the one, which creation triggered TPM initalization in the previous test, that account became the device owner. And I wonder if that user has already been unmounted by now. Is it possible that we are removing the home directory for a still mounted user? The TPM initialization posts some tasks for the mounted users on the mount thread once the TPM is owned. E.g. to tell chaps that it can now establish a kestore slot for those users. There can be a race condition if we remove & do these things in parallel. Honestly, I'd separate tests for what happens when we do user operations in parallel with TPM initialization (first boot scenarios) from non-first-boot workflows, when TPM is supposed to be initialized long ago. It may be worth to have "TPM fully initialized" as a pre-condition before running tests like ui.MashLogin. So that if this test is one of the first to run on the device after it was reset and owner cleared, it first triggers the ownership, waits until it is done, and then continues. If the device is already owns it starts right away.
,
Nov 2
Cont to #15 after looking at full logs. 1) Cryptohome was already running tasks on mount thread just fine at this point. 2018-11-01T22:06:53.549485-07:00 INFO cryptohomed[1073]: The TPM chip does not support GetAlertsData. Stop UploadAlertsData task. So, doesn't look like an issue with cryptohomed being on dbus but not ready to serve requests yet. 2) Actually, it's not cryptohomed who triggers TPM ownership on login or during OOBE. It is Chrome who decides that and tells cryptohomed when to do it. Looks like TPM initialization was initiated around creating this user. 2018-11-01T22:11:54.505668-07:00 INFO cryptohomed[1073]: Migrated (or created) user directory: /home/.shadow/ca15b0030f669ca76704c0e8955f7e19666001e4/vault Since here we say that cryptohomed was told to do so at around ~22:11:57 (timestamp minus duration): 2018-11-01T22:12:38.006941-07:00 ERR cryptohomed[1073]: Taking TPM ownership took 40500ms 3) On a separate note: we've got chrome crash in between: 2018-11-01T22:12:01.770094-07:00 WARNING crash_reporter[4674]: [user] Received crash notification for chrome[4219] sig 11, user 1000 group 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly) 4) Here, soon after TPM is owned we start processing MountEx call: 2018-11-01T22:12:38.672948-07:00 INFO cryptohomed[1073]: Migrated (or created) user directory: /home/.shadow/ca15b0030f669ca76704c0e8955f7e19666001e4/vault Putting a mount task to tell chaps to load keystore token here: 2018-11-01T22:12:39.420793-07:00 INFO cryptohomed[1073]: Putting a Pkcs11_Initialize on the mount thread. But we unmount first, so we cancel the chaps task here: 2018-11-01T22:12:46.572342-07:00 INFO cryptohomed[1073]: Cancelling PKCS#11 Init on unmount. 4) At this point mount_thread is still running Service::ConfigureOwnedTpm task 2018-11-01T22:13:11.542427-07:00 INFO cryptohomed[1073]: Configuring TPM, ownership taken: 1. ... [part of Service::InitializeInstallAttributes called from that task] 2018-11-01T22:13:12.274872-07:00 INFO cryptohomed[1073]: Lockbox created. ... [part of ServiceMonolithic::AttestationInitializeTpmComplete called from that task on TPM 1.2] 2018-11-01T22:13:12.516490-07:00 INFO cryptohomed[1073]: Attestation: Preparing for enrollment... ... 2018-11-01T22:14:30.489696-07:00 INFO cryptohomed[1073]: Attestation: Prepared successfully (77973 ms). And this task was posted at about 22:12:38, when we reported that taking TPM ownership was done. So, between 22:13:11 and 22:14:30, the mount thread is still running a task posted at 22:12:38. RemoveEx task posted at 22:12:46 didn't have a chance to run yet by the time it times out: 2018/11/01 22:12:47 [22:12:46.765] Removing cryptohome for testuser@gmail.com 2018/11/01 22:14:16 [22:14:16.100] External command timed out Bottomline: remove operation timed out because we ran the tests on a yet unowned device and had to wait for TPM initialization tasks on mount_thread before RemoveEx could be processed.
,
Nov 2
1) For tests: I strongly suggest to own the device as a precondition as I said in #5 above to avoid such flakiness. (Or further increase the timeout for cryptohomed operations, which doesn't sound as clean.) And have separate tests for OOBE/first boot - in my plans 2) For cryptohome: The issues with cryptohomed (and other daemons) becoming unresponsive while the device is being owned on the 1st boot are well-known. We are starting a refactoring project, with one of the major goals to reduce the amount of blocking operations, - literally right now: several design sessions with the team next week. That refactoring should improve the situation (though in general case we need other improvements as well to make performance on first boot acceptable). Coincidentally, RemoveEx should fully benefit from the refactoring alone, as it doesn't need to talk to the TPM itself, though now has to wait for other tasks that do.
,
Nov 2
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/246e40140009cae9c92dcd493c0609031ee33ec3 commit 246e40140009cae9c92dcd493c0609031ee33ec3 Author: Daniel Erat <derat@chromium.org> Date: Fri Nov 02 20:23:01 2018 tast-tests: Wait to delete cryptohome in chrome.New. Make chrome.New run "cryptohome --action=remove" after asking session_manager to restart Chrome for testing, rather than earlier (after stopping the ui Upstart job). It looks like there's a race when we run it earlier that can apparently result in the cryptohome command (and all future cryptohome commands) hanging. I'm doing this to try to restore the original timing that we had. BUG= chromium:901363 TEST=ran ui.ChromeLoginForever successfully Change-Id: I554f2119aa65a7a0c0b36dc912fa5897e3ad618a Reviewed-on: https://chromium-review.googlesource.com/c/1315331 Tested-by: Dan Erat <derat@chromium.org> Trybot-Ready: Dan Erat <derat@chromium.org> Reviewed-by: James Cook <jamescook@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/246e40140009cae9c92dcd493c0609031ee33ec3/src/chromiumos/tast/local/chrome/chrome.go
,
Nov 3
Sorry, I conflated the cryptohome failure here with a race that likely just happens on VM builders (which I described in #2). See e.g. http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8930999976767462272. It typically manifests itself as a failure in arc.Boot, which is the first test that runs. In that case, cryptohome exited with status code 1 rather than hanging: 2018/11/01 23:56:14 [23:56:13.230] Waiting for DUT to be ready for testing 2018/11/01 23:56:33 Started test arc.Boot 2018/11/01 23:56:33 [23:56:32.466] Restarting ui job 2018/11/01 23:56:34 [23:56:33.717] Removing cryptohome for testuser@gmail.com 2018/11/01 23:56:34 [23:56:33.766] External command failed: exit status 1 2018/11/01 23:56:34 [23:56:33.767] Command: cryptohome --action=remove --force --user=testuser@gmail.com 2018/11/01 23:56:34 [23:56:33.767] Uncaptured output: Remove call failed: The name org.chromium.Cryptohome was not provided by any .service files. This is what I've been trying to fix by making us wait longer for the D-Bus service to be ready. I haven't seen this happen in the last few amd64-generic-paladin, betty-arc64-paladin, and betty-paladin runs, so I'm hopeful that it's finally resolved. ---- The failure that xiyuan@ originally described here, and that apronin@ investigated, is different. It was on daisy-release, and other tests that performed login passed before ui.MashLogin failed. I've seen a bunch of odd daisy failures in different tests, and I suspect that many of them (not sure about this one) were just caused by the slowness of the (almost-EOL) hardware. Re waiting for TPM initialization to be complete before running tests, that sounds great! Andrey, can you offer suggestions about the right way to do that (ideally using either command-line programs or D-Bus calls, so we don't need to perform a full Chrome login before starting tests)? I'm thinking of something along the lines of: - run "cryptohome --action=status" - if tpm.being_owned, run "cryptohome --action=tpm_wait_ownership" - else if !tpm.owned, run "cryptohome --action=tpm_take_ownership" (These could be D-Bus calls to cryptohomed instead if that's preferable.) Would something like that help here?
,
Nov 4
There's another failure on daisy at http://stainless/browse/chromeos-autotest-results/255010406-chromeos-test/ : 2018/11/03 14:50:11 Started test ui.MashLogin 2018/11/03 14:50:11 [14:50:11.025] Restarting ui job 2018/11/03 14:50:12 [14:50:12.414] Waiting for org.chromium.SessionManager D-Bus service 2018/11/03 14:50:12 [14:50:12.452] Asking session_manager to enable Chrome testing 2018/11/03 14:50:12 [14:50:12.461] Waiting for Chrome to write its debugging port to /home/chronos/DevToolsActivePort 2018/11/03 14:50:14 [14:50:14.359] Removing cryptohome for testuser@gmail.com 2018/11/03 14:52:11 [14:52:11.028] External command timed out 2018/11/03 14:52:11 [14:52:11.028] Command: cryptohome --action=remove --force --user=testuser@gmail.com 2018/11/03 14:52:11 [14:52:11.029] Uncaptured output: 2018/11/03 14:52:11 [14:52:11.030] Error at mash_login.go:26: Chrome login failed: failed to remove cryptohome: context deadline exceeded
,
Nov 5
Re #9: as for the first failure fixed by #8, there's corresponding utility in autotest. https://chromium.git.corp.google.com/chromiumos/third_party/autotest/+/4d3c33f191d3959a47b7883b5998e6d2e9ef1ab4/client/cros/ownership.py#84 According to the comment, the removal should be done while ui is being stopped? Should we wait for cryptohomed is up, instead?
,
Nov 5
#11: I'm not sure I follow. The current Tast code wipes /var/lib/whitelist while the ui job is stopped, similar to the Autotest code. The issue discussed in #9 is different: the cryptohome command that we use to wipe the encrypted home dir can apparently fail if we run it too quickly after cryptohomed starts (which can happen easily on VM builders). I don't think that that's still happening after #8, though. Please let me know if I'm misunderstanding.
,
Nov 5
The daisy "failed to remove cryptohome: context deadline exceeded" timeout failures are still happening, though. Andrey, does doing something like what I suggested at the end of #9 before running any tests sound correct? - run "cryptohome --action=status" - if tpm.being_owned, run "cryptohome --action=tpm_wait_ownership" - else if !tpm.owned, run "cryptohome --action=tpm_take_ownership"
,
Nov 6
Re #12: My understanding was; we had two issues. 1) the timing to remove the user data dir was too early (cryptohome D-Bus service may not start yet), and 2) if D-Bus call is too early even after cryptohome D-Bus service starts, it would be stuck. The fix at #8 is moving the timing later, after ui restarts. Autotest wipes /var/lib/whitelist (including policy file explicitly), in addition, /user/chronos/Local State. https://chromium.git.corp.google.com/chromiumos/third_party/autotest/+/4d3c33f191d3959a47b7883b5998e6d2e9ef1ab4/client/cros/ownership.py#105 while ui is being stopped, and according to the comment "while ui is being stopped" seems required condition (BTW, is it also for removing Local State dir, too? or only for white list dir...?). (if so,) Is it necessary for tast to follow the manner? So, my question here is; if the timing is problematic, should we a) wait for cryptohome D-Bus service is up (for 1.) during ui is stopped instead of moving it after ui restart, and b) fix some timing issue in cryptohome (for 2.)?
,
Nov 6
/home/chronos/Local State is written by Chrome, so the ui job likely needs to be stopped first to ensure that Chrome won't rewrite it after you remove it. Re waiting for the cryptohome service, we actually do it even earlier now, at the end of the "ready" function defined in chromiumos/tast/local/ready/ready.go (i.e. before running any tests). I think that it'd be good to fix the potential race around the early dbus_g_object_type_install_info call described in #5, since we're still counting on the fact that cryptohomed will happen to be ready by the time that we've finished restarting the ui job.
,
Nov 6
> the potential race around the early dbus_g_object_type_install_info call described in #5 Agreed. I looked closer, and there is a separate brillo::dbus::RegisterExclusiveService happening _after_ full initialization, incl mount thread creation, is done. So, it's quite possible we don't expose anything on dbus until then, and there's no race. But makes sense to double-check. Will create a separate bug.
,
Nov 13
Re #13: to make sure that full initialization incl attestation steps is done (to avoid running in parallel with PrepareForEnrollment - item 4 in comment #6), this will work - run "cryptohome --action=tpm_more_status" - if not attestation_prepared: true -- run "cryptohome --action=tpm_take_ownership" - while not attestation_prepared: true -- wait
,
Nov 25
,
Nov 25
Issue 902318 has been merged into this issue.
,
Nov 28
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/ec9b6b38a58a7204a4a3cb9127c8eb4418b17479 commit ec9b6b38a58a7204a4a3cb9127c8eb4418b17479 Author: Daniel Erat <derat@chromium.org> Date: Wed Nov 28 03:14:09 2018 tast-tests: Take ownership of TPM before testing. Update the "ready" package to attempt to ensure that the TPM is initialized before testing starts. Also make it wait for the cryptohomed process to be up for at least 10 seconds before running the cryptohome command to try to avoid triggering hangs that we've seen just after startup in the past. Also avoid waiting for Upstart jobs that aren't installed. BUG= chromium:901363 TEST=ran "crossystem clear_tpm_owner_request=1" and rebooted to clear ownership, then ran ui.ChromeLogin and verified that ownership was retaken before the test was run. also verified that ownership is skipped on a VM Change-Id: Ib205dc268b5e4bea6ea13d77e86037440740c102 Reviewed-on: https://chromium-review.googlesource.com/1350289 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> Reviewed-by: Shuhei Takahashi <nya@chromium.org> Reviewed-by: Andrey Pronin <apronin@chromium.org> [modify] https://crrev.com/ec9b6b38a58a7204a4a3cb9127c8eb4418b17479/src/chromiumos/tast/local/ready/ready.go
,
Nov 29
Chung-yih reported in https://crrev.com/c/1351971 that on his DUT (I'm not sure of the specifics), this code is always hanging for two minutes before timing out: (cr) ((8c78090...)) cylee@cylee1 ~/trunk/src/scripts $ tast --verbose run 10.10.10.1 vm.CrostiniDiskIOPerf 2018/11/29 09:34:58 Command line: tast --verbose run 10.10.10.1 vm.CrostiniDiskIOPerf 2018/11/29 09:34:58 Using SSH key /home/cylee/trunk/chromite/ssh_keys/testing_rsa 2018/11/29 09:34:58 Using SSH dir /home/cylee/.ssh 2018/11/29 09:34:58 Writing results to /tmp/tast/results/20181129-093458 2018/11/29 09:34:58 Connecting to 10.10.10.1 2018/11/29 09:34:58 Getting architecture from target 2018/11/29 09:34:58 Building chromiumos/tast/local/bundles/cros from /home/cylee/trunk/src/platform/tast-tests:/home/cylee/trunk/src/platform/tast:/usr/lib/gopath 2018/11/29 09:35:00 Built test bundle in 1.634s 2018/11/29 09:35:00 Pushing test bundle /tmp/tast/build/x86_64/local_bundles/cros to /usr/local/libexec/tast/bundles/local_pushed on target 2018/11/29 09:35:00 Pushed test bundle in 571ms (sent 3.6 MB) 2018/11/29 09:35:00 Getting data file list from target 2018/11/29 09:35:01 Got data file list with 5 file(s) 2018/11/29 09:35:01 Fetching external data files to /tmp/tast/external_data 2018/11/29 09:35:01 Pushing data files to target 2018/11/29 09:35:01 Pushed data files in 51ms (sent 0 B) 2018/11/29 09:35:01 Test software dependencies not checked for wildcard/literal test patterns 2018/11/29 09:35:01 Getting initial system state 2018/11/29 09:35:01 [09:35:00.541] Waiting for DUT to be ready for testing 2018/11/29 09:35:01 [09:35:00.677] TPM not initialized; taking ownership now to ensure that tests aren't blocked during login ... Then it hangs and fails like this (from an earlier run): 2018/11/29 08:55:06 [08:55:05.674] TPM not initialized; taking ownership now to ensure that tests aren't blocked during login 2018/11/29 08:55:36 [08:55:35.524] Still waiting for important system services to be running 2018/11/29 08:56:06 [08:56:05.524] Still waiting for important system services to be running 2018/11/29 08:56:36 [08:56:35.524] Still waiting for important system services to be running 2018/11/29 08:57:06 [08:57:05.525] Still waiting for important system services to be running 2018/11/29 08:57:06 [08:57:05.554] Failed ensuring that TPM is initialized: context deadline exceeded "cryptohome --action=tpm_more_status" logs this: [cryptohome.GetTpmStatusReply.reply] { enabled: true owned: true initialized: false owner_password: "" attestation_prepared: false attestation_enrolled: false dictionary_attack_counter: 0 dictionary_attack_threshold: 200 dictionary_attack_lockout_in_effect: false dictionary_attack_lockout_seconds_remaining: 0 install_lockbox_finalized: false boot_lockbox_finalized: true verified_boot_measured: false } GetTpmStatus success. "cryptohome --action=tpm_take_ownership" presumably succeeds, right? (I think it may always succeed.) Are there any cryptohome-related messages in /var/log/messages after you do that? What does "cryptohome --action=tpm_wait_ownership" log? Andrey may have more ideas. If it's unreleased hardware, we should probably file a new RVG bug for continued discussion.
,
Nov 29
localhost ~ # cryptohome --action=tpm_take_ownership localhost ~ # echo $? 0 localhost ~ # cryptohome --action=tpm_wait_ownership TPM is not currently being owned. While tast is blocked by the poll, there's nothing error output. After the 2-minutes blocking I can see related log 2018-11-29T10:11:04.326115+08:00 WARNING cryptohomed[1269]: ManageCCDPwd is not supported 2018-11-29T10:11:05.194115+08:00 INFO cryptohomed[1269]: Pinweaver IsSupported: 0 2018-11-29T10:11:06.436198+08:00 WARNING cryptohomed[1269]: Could not load the device policy file. 2018-11-29T10:11:06.647080+08:00 WARNING cryptohomed[1269]: message repeated 5 times: [ Could not load the device policy file.] 2018-11-29T10:11:07.569340+08:00 INFO cryptohomed[1269]: Migrated (or created) user directory: /home/.shadow/2c5859c2a92932288ce902f6c2a0746270229178/mount 2018-11-29T10:11:07.571361+08:00 INFO cryptohomed[1269]: Creating pass-through directories /home/.shadow/2c5859c2a92932288ce902f6c2a0746270229178/mount/user/Cache 2018-11-29T10:11:07.571837+08:00 INFO cryptohomed[1269]: Creating pass-through directories /home/.shadow/2c5859c2a92932288ce902f6c2a0746270229178/mount/user/Downloads 2018-11-29T10:11:07.572259+08:00 INFO cryptohomed[1269]: Creating pass-through directories /home/.shadow/2c5859c2a92932288ce902f6c2a0746270229178/mount/user/GCache 2018-11-29T10:11:07.572731+08:00 INFO cryptohomed[1269]: Creating pass-through directories /home/.shadow/2c5859c2a92932288ce902f6c2a0746270229178/mount/user/GCache/v1 2018-11-29T10:11:07.573119+08:00 INFO cryptohomed[1269]: Creating pass-through directories /home/.shadow/2c5859c2a92932288ce902f6c2a0746270229178/mount/user/GCache/v2 2018-11-29T10:11:07.573594+08:00 INFO cryptohomed[1269]: Creating pass-through directories /home/.shadow/2c5859c2a92932288ce902f6c2a0746270229178/mount/user/GCache/v1/blobs 2018-11-29T10:11:07.574062+08:00 INFO cryptohomed[1269]: Creating pass-through directories /home/.shadow/2c5859c2a92932288ce902f6c2a0746270229178/mount/user/GCache/v1/tmp 2018-11-29T10:11:07.575819+08:00 INFO cryptohomed[1269]: RecursiveCopy: /home/.shadow/2c5859c2a92932288ce902f6c2a0746270229178/mount/user/log 2018-11-29T10:11:07.576317+08:00 INFO cryptohomed[1269]: RecursiveCopy: /home/.shadow/2c5859c2a92932288ce902f6c2a0746270229178/mount/user/.pki 2018-11-29T10:11:07.576823+08:00 INFO cryptohomed[1269]: RecursiveCopy: /home/.shadow/2c5859c2a92932288ce902f6c2a0746270229178/mount/user/.pki/nssdb 2018-11-29T10:11:07.578945+08:00 INFO cryptohomed[1269]: RecursiveCopy: /home/.shadow/2c5859c2a92932288ce902f6c2a0746270229178/mount/user/.ssh 2018-11-29T10:11:07.597585+08:00 WARNING cryptohomed[1269]: ManageCCDPwd is not supported 2018-11-29T10:11:07.597817+08:00 WARNING cryptohomed[1269]: Could not load the device policy file. 2018-11-29T10:11:08.456796+08:00 INFO cryptohomed[1269]: Putting a Pkcs11_Initialize on the mount thread. 2018-11-29T10:11:08.528956+08:00 INFO cryptohomed[1269]: A Pkcs11_Init event got finished. 2018-11-29T10:11:08.529019+08:00 INFO cryptohomed[1269]: PKCS#11 initialization succeeded. 2018-11-29T10:11:14.288097+08:00 WARNING cryptohomed[1269]: RemoveKeyset: key to remove not found
,
Nov 29
Hmm. Andrey, any ideas about what could be going wrong here?
,
Nov 29
Forget to append my DUT infor: CHROMEOS_RELEASE_BUILDER_PATH=eve-release/R72-11219.0.0 CHROMEOS_RELEASE_BUILD_NUMBER=11219 CHROMEOS_RELEASE_BRANCH_NUMBER=0 CHROMEOS_RELEASE_CHROME_MILESTONE=72 CHROMEOS_RELEASE_PATCH_NUMBER=0 CHROMEOS_RELEASE_TRACK=testimage-channel CHROMEOS_RELEASE_DESCRIPTION=11219.0.0 (Official Build) dev-channel eve test
,
Nov 29
,
Nov 29
Do we have logs from the device? I don't see it in the discussion above. I see "attestation_prepared: false" on the device, and that's what we are waiting for. Possibly, some issues with PrepareForAttestation in attestationd on that device after TakeOwnership completed. Or, it just takes longer than 2 minutes. But need logs to check.
,
Nov 29
Hi apronin, It looks there's some attestation errors. I didn't copy them before I refreshed my image to eve-release/R72-11313.0.0 . After the re-image I get another type of error: [05:22:22.493] Failed waiting for cryptohome D-Bus service: org.chromium.Cryptohome D-Bus service unavailable: context deadline exceeded 2018/11/30 05:22:23 Started test vm.CrostiniStartTime 2018/11/30 05:22:23 [05:22:22.495] Restarting ui job 2018/11/30 05:22:24 [05:22:23.849] Waiting for org.chromium.SessionManager D-Bus service 2018/11/30 05:22:24 [05:22:23.850] Asking session_manager to enable Chrome testing 2018/11/30 05:22:24 [05:22:23.850] Waiting for Chrome to write its debugging port to /home/chronos/DevToolsActivePort 2018/11/30 05:22:25 [05:22:24.519] Removing cryptohome for testuser@gmail.com 2018/11/30 05:22:25 [05:22:24.546] External command failed: exit status 1 2018/11/30 05:22:25 [05:22:24.546] Command: cryptohome --action=remove --force --user=testuser@gmail.com 2018/11/30 05:22:25 [05:22:24.546] Uncaptured output: Remove call failed: The name org.chromium.Cryptohome was not provided by any .service files. I get this in my /var/log/messages 2018-11-30T05:18:22.804274+08:00 WARNING cryptohomed[1269]: ManageCCDPwd is not supported 2018-11-30T05:19:06.407904+08:00 INFO cryptohomed[1276]: Protocol version: 0 2018-11-30T05:19:06.407962+08:00 INFO cryptohomed[1276]: Pinweaver IsSupported: 0 2018-11-30T05:19:06.533337+08:00 WARNING cryptohomed[1276]: Init() failed to read attributes file. 2018-11-30T05:19:06.546666+08:00 ERR cryptohomed[1276]: AddDBusError(...): Domain=dbus, Code=org.freedesktop.DBus.Error.ServiceUnknown, Message=The name org.chromium.Attestation was not provided by any .service files 2018-11-30T05:19:06.547644+08:00 CRIT cryptohomed[1276]: SetSystemSalt error: 2#012/usr/lib64/libbase-core-456626.so(_ZN4base5debug10StackTraceC1Em+0x1c) [0x79765be8937c]#012 2018-11-30T05:19:06.574133+08:00 WARNING crash_reporter[1776]: [user] Received crash notification for cryptohomed[1276] sig 6, user 0 group 0 (developer build - not testing - always dumping) 2018-11-30T05:19:06.644293+08:00 INFO crash_reporter[1776]: Stored minidump to /var/spool/crash/cryptohomed.20181130.051906.1276.dmp 2018-11-30T05:19:06.646066+08:00 INFO crash_reporter[1776]: Leaving core file at /proc/self/fd/5/cryptohomed.20181130.051906.1276.core due to developer image 2018-11-30T05:19:06.652042+08:00 WARNING kernel: [ 3.675888] init: cryptohomed main process (1276) killed by ABRT signal 2018-11-30T05:19:06.652057+08:00 WARNING kernel: [ 3.675908] init: cryptohomed main process ended, respawning 2018-11-30T05:19:06.662020+08:00 WARNING kernel: [ 3.685580] init: cryptohomed-client main process (1791) terminated with status 1 ... And it repeated for many times.
,
Nov 29
So, attestationd is not running: 2018-11-30T05:19:06.546666+08:00 ERR cryptohomed[1276]: AddDBusError(...): Domain=dbus, Code=org.freedesktop.DBus.Error.ServiceUnknown, Message=The name org.chromium.Attestation was not provided by any .service files 2018-11-30T05:19:06.547644+08:00 CRIT cryptohomed[1276]: SetSystemSalt error: 2#012/usr/lib64/libbase-core-456626.so(_ZN4base5debug10StackTraceC1Em+0x1c) [0x79765be8937c]#012 Are there any messages that mention 'attestationd' in the logs? And (in case attestationd is not running because trunksd is not running), anything from trunksd?
,
Nov 29
After a repo sync I get the original error message running (cr) ((0bacb33...)) cylee@cylee1 ~/trunk/src/scripts $ tast --verbose run 10.10.10.1 vm.CrostiniStartTime 2018/11/30 05:42:20 [05:42:20.388] Waiting for DUT to be ready for testing 2018/11/30 05:42:20 [05:42:20.539] TPM not initialized; taking ownership now to ensure that tests aren't blocked during login 2018/11/30 05:42:50 [05:42:50.388] Still waiting for important system services to be running 2018/11/30 05:43:20 [05:43:20.389] Still waiting for important system services to be running 2018/11/30 05:43:50 [05:43:50.389] Still waiting for important system services to be running 2018/11/30 05:44:20 [05:44:20.389] Still waiting for important system services to be running 2018/11/30 05:44:20 [05:44:20.406] Failed ensuring that TPM is initialized: context deadline exceeded grep attestationd in /var/log/messages messages:2018-11-30T05:19:04.909528+08:00 INFO attestationd[1218]: Attestation Daemon Started. messages:2018-11-30T05:19:04.909841+08:00 WARNING attestationd[1218]: libminijail[1218]: non-debug build: ignoring request to enable seccomp logging messages:2018-11-30T05:19:05.416542+08:00 INFO attestationd[1218]: Attestation service started. messages:2018-11-30T05:19:05.417431+08:00 NOTICE kernel: [ 2.440540] audit: type=1326 audit(1543526345.415:6): auid=4294967295 uid=247 gid=247 ses=4294967295 subj=u:r:chromeos:s0 pid=1218 comm="attestationd" exe="/usr/sbin/attestationd" sig=31 arch=c000003e syscall=24 compat=0 ip=0x7c4417b91347 code=0x0 messages:2018-11-30T05:40:08.425087+08:00 INFO attestationd[1246]: Attestation Daemon Started. messages:2018-11-30T05:40:08.425639+08:00 WARNING attestationd[1246]: libminijail[1246]: non-debug build: ignoring request to enable seccomp logging messages:2018-11-30T05:40:08.429181+08:00 INFO attestationd[1246]: Attestation service started. messages:2018-11-30T05:40:08.435766+08:00 INFO attestationd[1246]: Loading attestation database. messages:2018-11-30T05:40:08.435795+08:00 ERR attestationd[1246]: Failed to read attestation database: No such file or directory messages:2018-11-30T05:40:08.435801+08:00 WARNING attestationd[1246]: Creating new attestation database. messages:2018-11-30T05:40:09.054915+08:00 INFO attestationd[1246]: Attestation: Preparing for enrollment... messages:2018-11-30T05:40:09.246263+08:00 WARNING attestationd[1246]: TPM endorsement password is not available. messages:2018-11-30T05:40:09.246303+08:00 ERR attestationd[1246]: GetEndorsementPublicKey: EK not available. messages:2018-11-30T05:40:09.246327+08:00 ERR attestationd[1246]: Attestation: Failed to get RSA EK public key.
,
Nov 29
Please go through clearing the tpm owner and try again. I'm not sure what the original issue was that led to this situation, but at the moment: TPM is owned, but attestation is not prepared (either was never prepared or attestation database was deleted). It tries to go through PrepareForAttestation step, but the endorsement password is also not available (normally is kept in the tpm_manager database in stateful data after the tpm is owned). That prevents attestationd from getting the endorsement key and finishing PrepareForAttestation (the last 4 messages from the log). So, the test never get to the point when it is finished. Any chance you have wiped/overwritten stateful data on the device w/o clearing the tpm owner? That'd lead to this issue. Could also be one of the bugs that led to losing stateful partition, if you haven't gone through clearing the tpm owner in a while.
,
Nov 29
Cool, after clearing the TPM owner it backs to normal. Thanks to Andrey and Daniel.
,
Nov 29
Great! I'm going to close this bug since the change to initialize the TPM seems to be sticking, and I don't see any recent "failed to remove cryptohome: context deadline exceeded" errors (per http://stainless/search?view=list&first_date=2018-11-16&last_date=2018-11-29&branch=%5Emaster%24&test=%5Etast%5C..*%5C.&status=FAIL&reason=failed+to+remove+cryptohome&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=false). These errors were sporadic, though, so there isn't conclusive evidence that we won't hit them anymore. I do still see "failed to remove cryptohome: exit status 1" errors, though. Those generally seem to be caused by the "cryptohome" command aborting due to the cryptohomed D-Bus service being unavailable. I've filed issue 910407 against myself to improve error reporting there.
,
Dec 1
Andrey, I also saw this code fail to initialize the TPM after I reimaged a caroline device with a ToT dev image that I just built. I had actually signed in once with a GAIA account after reimaging it, so I would've expected the device to be owned already.
Here's what tast printed:
2018/11/30 17:48:05 [17:48:04.774] Waiting for DUT to be ready for testing
2018/11/30 17:48:07 [17:48:06.853] TPM not initialized; taking ownership now to ensure that tests aren't blocked during login
2018/11/30 17:48:35 [17:48:34.775] Still waiting for important system services to be running
2018/11/30 17:49:05 [17:49:04.776] Still waiting for important system services to be running
2018/11/30 17:49:35 [17:49:34.776] Still waiting for important system services to be running
2018/11/30 17:50:05 [17:50:04.777] Still waiting for important system services to be running
2018/11/30 17:50:05 [17:50:04.808] Failed ensuring that TPM is initialized: context deadline exceeded
The underlying issue seems different from what Cheng-yu described above, though. Here's what "cryptohome --action=tpm_more_status" prints (after a several-second delay, which I don't think I've seen before):
[cryptohome.GetTpmStatusReply.reply] {
enabled: true
owned: false
initialized: false
attestation_prepared: false
attestation_enrolled: false
install_lockbox_finalized: false
boot_lockbox_finalized: false
verified_boot_measured: false
}
GetTpmStatus success.
In /var/log/messages, I see this soon after boot:
2018-12-01T01:41:32.291676+00:00 WARNING kernel: [ 0.257742] ------------[ cut here ]------------
2018-12-01T01:41:32.291679+00:00 WARNING kernel: [ 0.257747] WARNING: CPU: 2 PID: 1 at ../../../../../tmp/portage/sys-kernel/chromeos-kernel-3_18-3.18-r2239/work/chromeos-kernel-3_18-3.18/lib/iomap.c:43 bad_io_access+0x36/0x38()
2018-12-01T01:41:32.291680+00:00 ERR kernel: [ 0.257749] Bad IO access at port 0x84 (return inl(port))
2018-12-01T01:41:32.291680+00:00 WARNING kernel: [ 0.257751] Modules linked in:
2018-12-01T01:41:32.291681+00:00 WARNING kernel: [ 0.257755] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 3.18.0-18744-gbb60454be5e6 #1
2018-12-01T01:41:32.291681+00:00 WARNING kernel: [ 0.257757] Hardware name: Google Caroline/Caroline, BIOS Google_Caroline.7820.356.0 12/05/2017
2018-12-01T01:41:32.291682+00:00 WARNING kernel: [ 0.257759] 0000000000000000 000000008d1134dd ffff88017a22bce8 ffffffff923914ce
2018-12-01T01:41:32.291685+00:00 WARNING kernel: [ 0.257763] 0000000000000000 ffff88017a22bd40 ffff88017a22bd28 ffffffff91d8e2ed
2018-12-01T01:41:32.291686+00:00 WARNING kernel: [ 0.257765] 00000000000fed44 ffff880179803728 ffff88017a3f4000 ffff880179215410
2018-12-01T01:41:32.291686+00:00 WARNING kernel: [ 0.257769] Call Trace:
2018-12-01T01:41:32.291687+00:00 WARNING kernel: [ 0.257774] [<ffffffff923914ce>] dump_stack+0x4e/0x71
2018-12-01T01:41:32.291688+00:00 WARNING kernel: [ 0.257778] [<ffffffff91d8e2ed>] warn_slowpath_common+0x84/0x9f
2018-12-01T01:41:32.291688+00:00 WARNING kernel: [ 0.257782] [<ffffffff91d8e401>] warn_slowpath_fmt+0x55/0x6b
2018-12-01T01:41:32.291689+00:00 WARNING kernel: [ 0.257785] [<ffffffff91f7bb37>] bad_io_access+0x36/0x38
2018-12-01T01:41:32.291692+00:00 WARNING kernel: [ 0.257787] [<ffffffff91f7bd3c>] ioread32+0x2a/0x2f
2018-12-01T01:41:32.291692+00:00 WARNING kernel: [ 0.257792] [<ffffffff9201d711>] tpm_tis_init+0xb3/0x131
2018-12-01T01:41:32.291693+00:00 WARNING kernel: [ 0.257794] [<ffffffff929574fd>] init_tis+0x96/0xd3
2018-12-01T01:41:32.291694+00:00 WARNING kernel: [ 0.257797] [<ffffffff92957467>] ? tpm_init+0x80/0x80
2018-12-01T01:41:32.291694+00:00 WARNING kernel: [ 0.257804] [<ffffffff91d2cff0>] do_one_initcall+0x18f/0x1a4
2018-12-01T01:41:32.291695+00:00 WARNING kernel: [ 0.257813] [<ffffffff929210cf>] kernel_init_freeable+0x198/0x238
2018-12-01T01:41:32.291697+00:00 WARNING kernel: [ 0.257817] [<ffffffff9238ec20>] ? rest_init+0x84/0x84
2018-12-01T01:41:32.291698+00:00 WARNING kernel: [ 0.257820] [<ffffffff9238ec2e>] kernel_init+0xe/0xda
2018-12-01T01:41:32.291699+00:00 WARNING kernel: [ 0.257823] [<ffffffff923962ae>] ret_from_fork+0x6e/0xa0
2018-12-01T01:41:32.291699+00:00 WARNING kernel: [ 0.257825] [<ffffffff9238ec20>] ? rest_init+0x84/0x84
2018-12-01T01:41:32.291700+00:00 WARNING kernel: [ 0.257830] ---[ end trace f1cbf46f90dc7ba8 ]---
...
2018-12-01T01:41:32.291705+00:00 INFO kernel: [ 0.275862] tpm_tis tpm_tis: 1.2 TPM (device-id 0x1B, rev-id 16) [gentle shutdown]
...
2018-12-01T01:41:43.594503+00:00 WARNING tpm-manager[2377]: TSS: Failed unix connect: /run/tcsd/tcsd.socket - No such file or directory
2018-12-01T01:41:43.594812+00:00 WARNING tpm-manager[2377]: TSS: Got a list of valid IPs
2018-12-01T01:41:43.594934+00:00 WARNING tpm-manager[2377]: TSS: Could not connect to machine: localhost
2018-12-01T01:41:43.594952+00:00 ERR tpm-manager[2377]: TSS: Could not connect to any machine in the list.
2018-12-01T01:41:43.594958+00:00 ERR tpm-manager[2377]: TSS: Failed to send packet
Messages similar to those are also repeated by cryptohomed many times in the log:
2018-12-01T01:48:12.214049+00:00 ERR cryptohomed[1502]: TPM error 0x3011 (Communication failure): Error calling Tspi_Context_Connect
2018-12-01T01:48:12.214311+00:00 ERR cryptohomed[1502]: ConnectContextAsUser: Could not open the TPM
2018-12-01T01:48:12.214357+00:00 ERR cryptohomed[1502]: GetDictionaryAttackInfo: Failed to connect to the TPM.
2018-12-01T01:48:12.214524+00:00 WARNING cryptohomed[1502]: TSS: Failed unix connect: /run/tcsd/tcsd.socket - No such file or directory
2018-12-01T01:48:12.214710+00:00 WARNING cryptohomed[1502]: TSS: Got a list of valid IPs
2018-12-01T01:48:12.214966+00:00 WARNING cryptohomed[1502]: TSS: Could not connect to machine: localhost
2018-12-01T01:48:12.215055+00:00 ERR cryptohomed[1502]: TSS: Could not connect to any machine in the list.
2018-12-01T01:48:12.215077+00:00 ERR cryptohomed[1502]: TSS: Failed to send packet
Tests are actually still able to log in on this device, so presumably cryptohomed is functional. But I'd like to know how the device got into this state when I reimaged it, and whether there's any way for Tast to fix it (since while it's in this state, there's a two-minute delay every time I try to run tests until we give up on initializing the TPM).
I *did* see a stateful-related problem when reimaging the device, though. I initially tried to update it with 'cros flash', but got a Python error ("ImportError: No module named six") and told the script to restore the stateful partition, which it didn't do successfully. Then I booted off USB and ran chromeos-install. Could that have caused that? Is there any way for Tast to detect this so we don't wait for two minutes?
,
Dec 1
Do we have full logs? This: ... 2018-12-01T01:41:32.291689+00:00 WARNING kernel: [ 0.257785] [<ffffffff91f7bb37>] bad_io_access+0x36/0x38 2018-12-01T01:41:32.291692+00:00 WARNING kernel: [ 0.257787] [<ffffffff91f7bd3c>] ioread32+0x2a/0x2f 2018-12-01T01:41:32.291692+00:00 WARNING kernel: [ 0.257792] [<ffffffff9201d711>] tpm_tis_init+0xb3/0x131 2018-12-01T01:41:32.291693+00:00 WARNING kernel: [ 0.257794] [<ffffffff929574fd>] init_tis+0x96/0xd3 ... may or may not be a problem. Probe for tpm driver fails. But we have a forced probe for the standard tpm driver, which looks for tpm registers at a known location, and shouldn't find them there on caroline. If a subsequent 2nd probe for the actual tpm device on SPI succeeds, then this is likely WAI. And looks like we do get the 2nd successful probe: 2018-12-01T01:41:32.291705+00:00 INFO kernel: [ 0.275862] tpm_tis tpm_tis: 1.2 TPM (device-id 0x1B, rev-id 16) [gentle shutdown] Still, something goes wrong later when starting tcsd, so we don't get a tcsd socket that all daemons use to connect to the TPM: TSS: Failed unix connect: /run/tcsd/tcsd.socket - No such file or directory So, after that all daemons, including cryptohomed can't talk to the TPM. cryptohomed runs in software-only mode. In this mode it allows user logins. But other functionality that requires TPM, like attestation (and reaching attestation_prepared), fails. So, whatever the issue is it's the issue with starting tcsd and creating its socket. And it happens between these two messages: 2018-12-01T01:41:32.291705+00:00 INFO kernel: [ 0.275862] tpm_tis tpm_tis: 1.2 TPM (device-id 0x1B, rev-id 16) [gentle shutdown] ... 2018-12-01T01:41:43.594503+00:00 WARNING tpm-manager[2377]: TSS: Failed unix connect: /run/tcsd/tcsd.socket - No such file or directory
,
Dec 1
#35: Thanks! Uploaded the messages file at https://drive.google.com/file/d/1J_rDHJDUvbHkT9OIIwTAUeBwj_wZNDQn/view?usp=sharing -- let me know if there's anything else that'd be useful. Maybe I should make the "Failed ensuring that TPM is initialized" message also include a suggestion to clear the TPM, if that's a reasonable way to get out of these bad states.
,
Dec 11
I'm doing local testing on a Scarlet. It's sitting at OOBE, and I get a similar timeout message as Dan in #34, except I don't have any complaints in dmesg or syslog about anything TPM-related.
Here's my status:
# cryptohome --action=tpm_more_status
[cryptohome.GetTpmStatusReply.reply] {
enabled: true
owned: true
initialized: false
owner_password: ""
attestation_prepared: false
attestation_enrolled: false
dictionary_attack_counter: 0
dictionary_attack_threshold: 200
dictionary_attack_lockout_in_effect: false
dictionary_attack_lockout_seconds_remaining: 0
install_lockbox_finalized: false
boot_lockbox_finalized: false
verified_boot_measured: false
}
GetTpmStatus success.
# cryptohome --action=tpm_more_status
[cryptohome.GetTpmStatusReply.reply] {
enabled: true
owned: true
initialized: false
owner_password: ""
attestation_prepared: false
attestation_enrolled: false
dictionary_attack_counter: 0
dictionary_attack_threshold: 200
dictionary_attack_lockout_in_effect: false
dictionary_attack_lockout_seconds_remaining: 0
install_lockbox_finalized: false
boot_lockbox_finalized: false
verified_boot_measured: false
}
GetTpmStatus success.
,
Dec 11
#37: Sorry about the slowness. :-( Andrey said he'll look at this Tuesday morning. If it's not straightforward to make the detection and initialization logic robust, I'll either make Tast print explicit instructions for putting the TPM back into a good state or make this check only run in the lab.
,
Dec 11
Re #36: no add'l info from full logs. /dev/tpm0 is created fine: 2018-12-01T01:41:32.291705+00:00 INFO kernel: [ 0.275862] tpm_tis tpm_tis: 1.2 TPM (device-id 0x1B, rev-id 16) [gentle shutdown] Then tcsd dies (and doesn't restart), so cryptohomed can't talk to the TPM through the socket tcsd was suposed to create: 2018-12-01T01:41:33.419026+00:00 WARNING kernel: [ 3.433963] init: tcsd main process (1126) terminated with status 127 2018-12-01T01:41:33.419039+00:00 WARNING kernel: [ 3.433991] init: tcsd main process ended, respawning ... 2018-12-01T01:41:33.619394+00:00 WARNING cryptohomed[1502]: TSS: Failed unix connect: /run/tcsd/tcsd.socket - No such file or directory 2018-12-01T01:41:33.619859+00:00 WARNING cryptohomed[1502]: TSS: Got a list of valid IPs 2018-12-01T01:41:33.619889+00:00 WARNING cryptohomed[1502]: TSS: Could not connect to machine: localhost 2018-12-01T01:41:33.619900+00:00 ERR cryptohomed[1502]: TSS: Could not connect to any machine in the list. 2018-12-01T01:41:33.619904+00:00 ERR cryptohomed[1502]: TSS: Failed to send packet No additional info on the reasons why tcsd died. Did it happen once, or it's possible to reproduc it?
,
Dec 11
Re #37: do you have full logs? what OS image do you use? The important parts are: owned: true owner_password: "" attestation_prepared: false (ignore 'initialized' for TPM 2.0) Looks like the TPM was owned and the owner password has been discarded, but PrepareForEnrollment hasn't succeeded. Could be: (1) stateful partition wiped w/o clearing the tpm (then the attestation db is lost and 'attestation_prepared' = false, and no way to prepare it again w/o going through clearing the owner first, since the owner password is unavailable already); (2) an actual issue with going through PrepareForEnrollment; (3) some other issue with attestationd like issue 910811 . The fix landed in 11338.0.0/11316.6.0.
,
Dec 11
Btw, why are we tracking #36 and #37 on this bug? Looks like tast suite is doing its job and not running non-first-boot tests if device initialization fails. Separately from that, there are two errors (one - for tcsd on TPM 1.2 device - #36, the other - for something else on TPM 2.0 device) that prevented initialization on two DUTs.
,
Dec 11
#40: full logs (from 'generate_logs'): https://drive.google.com/drive/folders/1AtlA3W4TgTTvu3Pc0gmjtSC5yK2DhH9l?usp=sharing #41: Why here? Because Dan reopened at comment #34 :) Maybe it does deserve one or more new bugs. But the essence of these comments is: it is apparently easy for dev devices to get in a bad shape such that whatever TPM status checks were just added here can fail easily. So at a minimum, there should be an easy way to get dev systems out of this state. (BTW, call me a n00b, but I still don't see any instructions on how to wipe the TPM. So I'm currently just hacking out the $subject changes from tast-tests.) Side note: these checks are totally irrelevant to the Tasts I'm trying to run. Maybe that's not a huge deal, but it's one more dependency that can screw up testing. Imagine eventually, we have true BVT tests we want to run during bringup -- there's no guarantee TPM will be in good enough shape for that. Maybe that's out of scope here, or maybe I'm derailing, so feel free to ignore. But it's food for thought.
,
Dec 11
> #40: full logs (from 'generate_logs'): 1) In the recent logs I see: 2018-12-10T13:27:09.379586-08:00 ERR attestationd[1287]: libminijail[1287]: blocked syscall: sched_yield 2018-12-10T13:27:09.437663-08:00 WARNING crash_reporter[1296]: [user] Received crash notification for attestationd[1287] sig 6, user 247 group 247 (developer build - not testing - always dumping) 2018-12-10T13:27:09.469054-08:00 WARNING kernel: [ 5.940646] init: attestationd main process (1287) killed by ABRT signal So, issue 910811 is indeed still not fixed in the DUT image. I recommend updating the image (fix landed in 11338.0.0/11316.6.0; the running image is a custom 11302.0.2018_11_28_1603). 2) However, attestationd then restarts and seems to be doing just fine, except for the attestation database indeed not being there: 2018-12-10T13:27:09.469104-08:00 WARNING kernel: [ 5.947554] init: attestationd main process ended, respawning 2018-12-10T13:27:09.566569-08:00 INFO attestationd[1386]: Attestation Daemon Started. 2018-12-10T13:27:09.584997-08:00 ERR attestationd[1386]: Failed to read attestation database: No such file or directory 2018-12-10T13:27:09.585036-08:00 WARNING attestationd[1386]: Creating new attestation database. So, it can't prepare for enrollment: 2018-12-10T13:27:09.760172-08:00 INFO attestationd[1386]: Attestation: Preparing for enrollment... 2018-12-10T13:27:09.878996-08:00 WARNING attestationd[1386]: TPM endorsement password is not available. 2018-12-10T13:27:09.879055-08:00 ERR attestationd[1386]: GetEndorsementPublicKey: EK not available. 2018-12-10T13:27:09.879086-08:00 ERR attestationd[1386]: Attestation: Failed to get RSA EK public key. Looking in the past, preparing for enrollment never succeeded for the same reasons. The first attempt was on 12/05: 2018-12-05T14:12:45.516291-08:00 ERR attestationd[1291]: Failed to read attestation database: No such file or directory 2018-12-05T14:12:45.516342-08:00 WARNING attestationd[1291]: Creating new attestation database. 2018-12-05T14:12:51.117840-08:00 INFO attestationd[1291]: Attestation: Preparing for enrollment... 2018-12-05T14:12:51.228374-08:00 WARNING attestationd[1291]: TPM endorsement password is not available. 2018-12-05T14:12:51.228422-08:00 ERR attestationd[1291]: GetEndorsementPublicKey: EK not available. 2018-12-05T14:12:51.228452-08:00 ERR attestationd[1291]: Attestation: Failed to get RSA EK public key. 3) Looking at even earlier log entries. The log starts on 12/04. On that boot, /dev/tpm0 device was not created, which caused crashes for trunksd, attestationd, tpm_managerd, cryptohomed (by design, they die in such case). Looks like something was wrong with Linux kernel and tpm probe was not even started - at least I see no traces of it in the logs. Possibly, because of this: 2018-12-05T00:29:41.553890+00:00 WARNING kernel: [ 4.147762] ------------[ cut here ]------------ 2018-12-05T00:29:41.553893+00:00 WARNING kernel: [ 4.152936] mmc1: cqhci: spurious TCN for tag 31 2018-12-05T00:29:41.553896+00:00 WARNING kernel: [ 4.158137] WARNING: CPU: 0 PID: 0 at /mnt/host/source/src/third_party/kernel/v4.4/drivers/mmc/host/cqhci.c:725 cqhci_irq+0x318/0x424 2018-12-05T00:29:41.553898+00:00 WARNING kernel: [ 4.171548] Modules linked in: joydev 2018-12-05T00:29:41.553905+00:00 WARNING kernel: [ 4.175641] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.20.0-rc5+ #72 2018-12-05T00:29:41.553907+00:00 WARNING kernel: [ 4.182835] Hardware name: Google Scarlet (DT) 2018-12-05T00:29:41.553910+00:00 WARNING kernel: [ 4.187798] pstate: 60000085 (nZCv daIf -PAN -UAO) 2018-12-05T00:29:41.553913+00:00 WARNING kernel: [ 4.193148] pc : cqhci_irq+0x318/0x424 2018-12-05T00:29:41.553915+00:00 WARNING kernel: [ 4.197333] lr : cqhci_irq+0x318/0x424 2018-12-05T00:29:41.553918+00:00 WARNING kernel: [ 4.201518] sp : ffffff8008003d40 2018-12-05T00:29:41.553920+00:00 WARNING kernel: [ 4.205216] x29: ffffff8008003d40 x28: ffffff80090db000 2018-12-05T00:29:41.553922+00:00 WARNING kernel: [ 4.211149] x27: 000000000000001f x26: 0000000000000001 2018-12-05T00:29:41.553929+00:00 WARNING kernel: [ 4.217083] x25: ffffffc0f0f4e498 x24: ffffff8008c020bc 2018-12-05T00:29:41.553932+00:00 WARNING kernel: [ 4.223016] x23: ffffff8009010000 x22: ffffff80090da9bc 2018-12-05T00:29:41.553934+00:00 WARNING kernel: [ 4.228949] x21: ffffffc0f0caf000 x20: 0000000000000002 2018-12-05T00:29:41.553937+00:00 WARNING kernel: [ 4.234883] x19: ffffffc0f0f4e480 x18: ffffff80084b2000 2018-12-05T00:29:41.553939+00:00 WARNING kernel: [ 4.240816] x17: 0000000000000000 x16: 0000000000000000 2018-12-05T00:29:41.553941+00:00 WARNING kernel: [ 4.246749] x15: 0000000000000000 x14: ffffffffffffffff 2018-12-05T00:29:41.553944+00:00 WARNING kernel: [ 4.252682] x13: 00000040eee05000 x12: 0000000000000000 2018-12-05T00:29:41.553946+00:00 WARNING kernel: [ 4.258615] x11: 0000000000000000 x10: 0000000000010004 2018-12-05T00:29:41.553957+00:00 WARNING kernel: [ 4.264549] x9 : 8a22428eb8f00900 x8 : 8a22428eb8f00900 2018-12-05T00:29:41.553959+00:00 WARNING kernel: [ 4.270482] x7 : ffffff800901adc0 x6 : ffffff8009031150 2018-12-05T00:29:41.554019+00:00 WARNING kernel: [ 4.276414] x5 : ffffff8009031150 x4 : ffffff8009031118 2018-12-05T00:29:41.554022+00:00 WARNING kernel: [ 4.282348] x3 : ffffff8009031000 x2 : 0000000000010004 2018-12-05T00:29:41.554025+00:00 WARNING kernel: [ 4.288281] x1 : 0000000000010004 x0 : 0000000000000024 2018-12-05T00:29:41.554027+00:00 WARNING kernel: [ 4.294215] Call trace: 2018-12-05T00:29:41.554030+00:00 WARNING kernel: [ 4.296944] cqhci_irq+0x318/0x424 2018-12-05T00:29:41.554032+00:00 WARNING kernel: [ 4.300741] sdhci_arasan_cqhci_irq+0x64/0x90 2018-12-05T00:29:41.554036+00:00 WARNING kernel: [ 4.305607] sdhci_irq+0xb4/0xaa4 2018-12-05T00:29:41.554042+00:00 WARNING kernel: [ 4.309311] handle_irq_event_percpu+0xf0/0x258 2018-12-05T00:29:41.554044+00:00 WARNING kernel: [ 4.314371] handle_irq_event+0x4c/0x7c 2018-12-05T00:29:41.554047+00:00 WARNING kernel: [ 4.318654] handle_fasteoi_irq+0xb4/0x124 2018-12-05T00:29:41.554049+00:00 WARNING kernel: [ 4.323230] generic_handle_irq+0x30/0x44 2018-12-05T00:29:41.554052+00:00 WARNING kernel: [ 4.327706] __handle_domain_irq+0x90/0xbc 2018-12-05T00:29:41.554054+00:00 WARNING kernel: [ 4.332282] gic_handle_irq+0xe8/0x18c 2018-12-05T00:29:41.554056+00:00 WARNING kernel: [ 4.336467] el1_irq+0xb0/0x128 2018-12-05T00:29:41.554058+00:00 WARNING kernel: [ 4.339974] cpuidle_enter_state+0x304/0x370 2018-12-05T00:29:41.554064+00:00 WARNING kernel: [ 4.344742] cpuidle_enter+0x34/0x40 2018-12-05T00:29:41.554067+00:00 WARNING kernel: [ 4.348734] do_idle+0x1e4/0x248 2018-12-05T00:29:41.554070+00:00 WARNING kernel: [ 4.352336] cpu_startup_entry+0x28/0x2c 2018-12-05T00:29:41.554072+00:00 WARNING kernel: [ 4.356717] rest_init+0xd4/0xe0 2018-12-05T00:29:41.554075+00:00 WARNING kernel: [ 4.360321] arch_call_rest_init+0x14/0x1c 2018-12-05T00:29:41.554077+00:00 WARNING kernel: [ 4.364894] start_kernel+0x41c/0x474 2018-12-05T00:29:41.554079+00:00 WARNING kernel: [ 4.368982] ---[ end trace 222888278ed8d527 ]--- 4) Eventually, after one of the reboots, the driver probe succeeds and we get /dev/tpm0: 2018-12-05T14:12:43.968004-08:00 INFO kernel: [ 2.341282] cr50_spi spi2.0: TPM ready IRQ confirmed on attempt 0. 2018-12-05T14:12:43.968010-08:00 INFO kernel: [ 2.362333] cr50_spi spi2.0: 2.0 TPM (device-id 0x28, rev-id 0) 2018-12-05T14:12:43.968015-08:00 INFO kernel: [ 2.389511] cr50_spi spi2.0: registered shutdown handler [gentle shutdown] 2018-12-05T14:12:43.968021-08:00 INFO kernel: [ 2.400282] cr50_spi spi2.0: Cr50 firmware version: B2-D:0 RO_A:0.0.10/29d77172 RW_A:0.4.10/cr50_v1.9308_B.433-59c93 2018-12-05T14:12:45.324848-08:00 INFO trunksd[1177]: TPM: /dev/tpm0 opened successfully By that point, tpm_managerd already thinks that the TPM is owned: 2018-12-05T14:12:45.377240-08:00 INFO tpm_managerd[1205]: Starting TPM Manager... 2018-12-05T14:12:51.007801-08:00 WARNING tpm_managerd[1205]: WARNING: Verified boot PCR 2 is not initialized. 2018-12-05T14:12:51.028482-08:00 WARNING tpm_managerd[1205]: WARNING: Verified boot PCR 3 is not initialized. 2018-12-05T14:12:51.060973-08:00 INFO tpm_managerd[1205]: Ownership taken signal is sent. There are more errors when communicating with the TPM later on 12/05, though I don't see any errors reported by cr50 driver, which is weird: 2018-12-05T14:19:14.340341-08:00 ERR trunksd[1177]: TPM: Error writing to TPM handle.: Input/output error 2018-12-05T14:19:14.342933-08:00 ERR chapsd[5701]: LoadKey: Error loading key: TRUNKS_RC_WRITE_ERROR 2018-12-05T14:19:14.342991-08:00 ERR chapsd[5701]: Error loading key into TPM: TRUNKS_RC_WRITE_ERROR 2018-12-05T15:25:58.609990-08:00 ERR trunksd[1094]: TPM: Error writing to TPM handle.: Input/output error 2018-12-05T15:25:58.611242-08:00 ERR chapsd[11737]: AsymmetricDecrypt: Error performing RSA decrypt: TRUNKS_RC_WRITE_ERROR 2018-12-05T15:25:58.611335-08:00 ERR chapsd[11737]: Error performing unbind operation: TRUNKS_RC_WRITE_ERROR 2018-12-05T15:25:58.631378-08:00 ERR chapsd[11737]: Authentication failed for token at /var/lib/chaps, reinitializing token. I don't see mount-encrypted logs captured on 12/04 (the first record is much later: "2018-12-05T14:13:21.064140-08:00 NOTICE mount-encrypted[2317]: [pid:326] Starting."). Though, apparently it ran and in the absence of TPM communications it must have recreated the stateful partition. I see something in the logs, but can't tell if it confirms that or not: 2018-12-05T00:29:41.554139+00:00 INFO kernel: [ 6.591714] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: discard,commit=600 2018-12-05T00:29:41.554142+00:00 NOTICE kernel: [ 6.608724] audit: type=1400 audit(1543969780.181:258): avc: denied { mounton } for pid=358 comm="mount-encrypted" path="/mnt/stateful_partition/encrypted" dev="mmcblk1p1" ino=1310721 scontext=u:r:chromeos:s0 tcontext=u:object_r:unlabeled:s0 tclass=dir permissive=1 2018-12-05T00:29:41.554145+00:00 NOTICE kernel: [ 6.644980] audit: type=1400 audit(1543969780.211:259): avc: denied { create } for pid=358 comm="mount-encrypted" name="var" scontext=u:r:chromeos:s0 tcontext=u:object_r:unlabeled:s0 tclass=dir permissive=1 2018-12-05T00:29:41.554148+00:00 NOTICE kernel: [ 6.665778] audit: type=1400 audit(1543969780.213:260): avc: denied { setattr } for pid=358 comm="mount-encrypted" name="var" dev="dm-0" ino=12 scontext=u:r:chromeos:s0 tcontext=u:object_r:unlabeled:s0 tclass=dir permissive=1 2018-12-05T00:29:41.554155+00:00 NOTICE kernel: [ 6.688419] audit: type=1400 audit(1543969780.213:261): avc: denied { mounton } for pid=358 comm="mount-encrypted" path="/var" dev="mmcblk1p5" ino=15553 scontext=u:r:chromeos:s0 tcontext=u:object_r:cros_var:s0 tclass=dir permissive=1 2018-12-05T00:29:41.554157+00:00 INFO kernel: [ 7.703768] EXT4-fs (mmcblk1p1): re-mounted. Opts: resgid=20119,commit=600 2018-12-05T00:29:41.554160+00:00 WARNING kernel: [ 7.801806] init: ureadahead main process (447) terminated with status 5 2018-12-05T00:29:41.554163+00:00 INFO kernel: [ 7.858276] EXT4-fs (mmcblk1p1): re-mounted. Opts: resgid=20119,commit=600 So, not sure what happened on 01/24, but the device went out of it in some bad logical state and can't recover from it w/o powerwashing: the TPM is already owned and the owner password not available; there is no existing attestationd db with info filled in at PrepareForEnrollment step; PrepareForEnrollment can't succeed w/o owner password. So, the next step is going through powerwash / clearing the TPM owner to let it initialize itself again. Probably, also update the image to a newer one to deal with that attestationd crash. > I still don't see any instructions on how to wipe the TPM you can always go through powerwash (or normal->dev->normal mode transition), but for a quick solution on dev/test image, just do the following (ignore rm messages about missing files): rm /home/chronos/.oobe_completed rm /home/.shadow/.can_attempt_ownership rm -rf /home/.shadow/low_entropy_creds/ crossystem clear_tpm_owner_request=1 reboot
,
Dec 11
> Looks like something was wrong with Linux kernel and tpm probe was not even started - at least I see no traces of it in the logs. Possibly, because of this: That's the danger of giving you truly-full logs -- that was old stuff from running upstream kernels (4.20-rc5 IIRC) which should be ignored. Of course there are always regressions, but the key point is: Cr50 support is still not upstream *hint* *hint* :D > you can always go through powerwash (or normal->dev->normal mode transition), I don't like to do that on my test devices, since that requires manual attention (to get into recovery, and to reinstall an image with an SSH server). I prefer being able to do everything over the network, without losing my stateful / SSH server. Anyway, I should probably terminate this line of requests. Thanks for the help -- I'll see whether an upgraded image and/or a TPM reset works out for me, and I'll leave it to Dan to determine what dependencies he wants to enforce for Tast readiness, and whether he wants to move that to a new bug.
,
Dec 11
> That's the danger of giving you truly-full logs -- that was old stuff from running upstream kernels (4.20-rc5 IIRC) which should be ignored. Of course there are always regressions, but the key point is: Cr50 support is still not upstream *hint* *hint* :D Sure :) Just powerwash / clear the TPM owner, as described above, after using such experimental images. Otherwise, you risk getting those weird errors. The problems you see now (tests can't run since attestation is not prepared) started then. I suspect, the "crossystem clear_tpm_owner_request=1" option should still work fine for you. It takes seconds, clears encstateful, but not stateful in general, and the image shouldn't be affected. So, unless the ssh server needs something in /home/chronos or /var (which are wiped out), it should continue running just fine after this procedure.
,
Dec 12
+garryxiao: just in case, take a look at the logs from #40 and notes from #43: is there any chance tpm_managerd incorrectly detected the TPM state as owned and sent the signal too soon in this case? it's hard to tell if the device was actually owned with a random password (and that password discarded/lost) before all those TPM comm issues, or it was only "pre-owned" with one of the well-known passwords on the 1st boot after that. if necessary, please open a separate bug.
,
Dec 12
I checked the full log in #40. Looks like tpm_managerd thinks the TPM is already owned at the very beginning at the first time it doesn't crash (it crashed and restarted a million times before that). I didn't see any logs like "CreateStorageRootKeys" or "CreateSaltingKey" from tpm_managerd, which are normally shown in system messages during OOBE. tpm_managerd calls trunks when started to determine whether the TPM is already owned before trying to go through the full initialization process. Probably in this case trunks somehow thinks the TPM is already owned by mistake? By the way, tom_managerd only sends the signal when 1. trunks or trouser tells it TPM is already owned at the very beginning, or 2. tpm_managerd successfully went through the entire initialization process.
,
Dec 12
Here is the call stack when tpm_managerd checks if the TPM is owned (TPM 2.0): tpm_manager::Tpm2Impl::CheckAndNotifyIfTpmOwned() trunks::TpmStateImpl::IsOwned() TpmStateImpl::IsOwned() checks 3 TPM property bits, which are updated every time trunks::TpmStateImpl::Initialize() is called, and Initialize() indirectly gets capability from TPM. Inside CheckAndNotifyIfTpmOwned(), before it calls TpmStateImpl::IsOwned(), it indirectly calls TpmStateImpl::Initialize() to refresh the propery bits. In our case, looks like TpmStateImpl::IsOwned() returns true because all the 3 bits are set, and there is no error during the refresh (otherwise we'll see an error like "Failed to query TPM properties" in the log). I am not sure why TPM says it's already owned during oobe. Maybe the TPM is in some weird state? #43 also pointed out the TPM communication errors, which I guess should not happen if TPM behaves normally? Does this only happen on some specific machine?
,
Dec 12
I filed issue 914258 against myself to track mitigating the failed ownership attempts discussed in #21 and onward so that the "tast run" command doesn't block for two minutes when developers run it against their own DUTs.
,
Dec 13
I'm going to close this venerable bug. Issue 914258 tracks making this code avoid blocking developers. Let's file new bugs to track any individual TPM issues that developers are seeing on their DUTs or refinements to what Tast does to try to get the TPM into a working state. |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by xiy...@chromium.org
, Nov 2There is also a cryptohome minidump collected during the MashLogin test. Not sure whether it is related. Stack: ==== Crash reason: 0x00000000 / 0x00000000 Crash address: 0x0 Process uptime: not available Thread 0 (crashed) 0 libc-2.23.so!__poll + 0x33 r0 = 0xb8945748 r1 = 0x00000003 r2 = 0xffffffff r3 = 0x00000000 r4 = 0xffffffff r5 = 0xb6c8eaa9 r6 = 0x00000003 r7 = 0x000000a8 r8 = 0x00000000 r9 = 0xb8945748 r10 = 0xb89378f8 r12 = 0x00000000 fp = 0x00000003 sp = 0xbea219a0 lr = 0x00000000 pc = 0xb669a5e4 Found by: given as instruction pointer in context 1 libglib-2.0.so.0.5200.3!g_main_context_iterate [gmain.c : 4271 + 0x7] r4 = 0xffffffff r5 = 0xb6c8eaa9 r6 = 0x00000003 r7 = 0xbea219d8 r8 = 0x00000000 r9 = 0xb8945748 r10 = 0xb89378f8 fp = 0x00000003 sp = 0xbea219a8 pc = 0xb6c837bb Found by: call frame info 2 libglib-2.0.so.0.5200.3!g_main_loop_run [gmain.c : 4168 + 0x7] r4 = 0xb892b0e8 r5 = 0x00000001 r6 = 0xb892b0f0 r7 = 0xbea219f8 r8 = 0xb6ed0ec0 r9 = 0x00000000 r10 = 0xb6fa85a4 fp = 0x00000000 sp = 0xbea219e0 pc = 0xb6c83af1 Found by: call frame info 3 libbrillo-glib-395517.so!brillo::dbus::AbstractDbusService::Run() [abstract_dbus_service.cc : 25 + 0x3] r4 = 0xb89308d8 r5 = 0xb6ed0ec0 r6 = 0xb89308d8 r7 = 0xbea21aa8 r8 = 0xb6ed0ec0 r9 = 0x00000000 r10 = 0xb6fa85a4 fp = 0x00000000 sp = 0xbea21a00 pc = 0xb6bd1dfd Found by: call frame info 4 cryptohomed!main [cryptohomed.cc : 92 + 0x7] r4 = 0xbea21ab4 r5 = 0x00000001 r6 = 0xb89308d8 r7 = 0xbea21bc8 r8 = 0xb6ed0ec0 r9 = 0x00000000 r10 = 0xb6fa85a4 fp = 0x00000000 sp = 0xbea21ab0 pc = 0xb6ee3331 Found by: call frame info 5 libc-2.23.so!__libc_start_main [libc-start.c : 289 + 0xb] r4 = 0xbea21be8 r5 = 0x00000000 r6 = 0x00000000 r7 = 0x00000000 r8 = 0x00000000 r9 = 0x00000000 r10 = 0xb6fa85a4 fp = 0x00000000 sp = 0xbea21bd0 pc = 0xb661f8a1 Found by: call frame info 6 cryptohomed!_start + 0x32 r4 = 0x00000000 r5 = 0x00000000 r6 = 0xb6ee30a1 r7 = 0x00000000 r8 = 0x00000000 r9 = 0x00000000 r10 = 0xb6fa85a4 fp = 0x00000000 sp = 0xbea21d18 pc = 0xb6ee30d5 Found by: call frame info 7 cryptohomed!__libc_csu_init [elf-init.c : 87 + 0xf] sp = 0xbea21d1c pc = 0xb6f83f4d Found by: stack scanning 8 0xbea21e4f r3 = 0xb6eb4059 r4 = 0xbea21d24 r5 = 0xbea21e15 r6 = 0xbea21e21 r7 = 0x00000000 r8 = 0xbea21e2b r9 = 0xbea21e3d sp = 0xbea21d3c pc = 0xbea21e51 Found by: call frame info 9 cryptohomed!_GLOBAL__sub_I_tpm_status.pb.cc [once.h : 0 + 0x29] sp = 0xbea21db0 pc = 0xb6ee30a1 Found by: stack scanning14.4 KB
14.4 KB View Download