New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 616854 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Closed: Sep 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 615475
issue 615962
issue 627367
issue 641366



Sign in to add a comment

Cryptohome could not remove the user's vault.

Project Member Reported by akes...@chromium.org, Jun 2 2016

Issue description

Tracking bug for occurences of this flake.
 
Blockedon: 615475
Blockedon: 615962
Cc: dgarr...@chromium.org
Components: -Infra>Client>ChromeOS
Labels: -current-issue
Owner: scollyer@chromium.org
Seems to be a sheriff issue, please find an owner. 
I don't understand why this particular failure has been assigned as an issue. The linked failure on 'peach-pit-chrome-pfg' was for build 2109. Build 2110 showed the same issue. However, subsequent builds have either failed for different reasons or have been successful. 
Please feel free to Wont' Fix or Obsolete it.
I filed this because I need some tracking bug for the historical pfq annotating that I'm doing. If this issue is no longer relevant then feel free to close.
Cc: steve...@chromium.org michae...@chromium.org
This is still failing frequently (4 times today) in issue 615475.

The problem is always the same: "Remove call failed: Message did not receive a reply (timeout by message bus)."

Is this just a flake we have to live with, or a symptom of a bug?

07/12 10:47:38.706 DEBUG|        cryptohome:0180| Removing vault for user test@test.test with hash 19cc1a8c867fe7ddec4f09764ccf7de59cd45978
07/12 10:47:38.707 DEBUG|        base_utils:0185| Running '/usr/sbin/cryptohome --action=remove --force --user=test@test.test 2>&1'
07/12 10:47:39.508 DEBUG|        base_utils:0278| [stdout] Remove call failed: Message did not receive a reply (timeout by message bus).
07/12 10:47:39.510 DEBUG|        base_utils:0278| [stdout] Remove failed.
07/12 10:47:39.517 DEBUG|        base_utils:0185| Running 'mkdir -p /usr/local/autotest/results/default/login_Cryptohome/sysinfo/iteration.1/var/spool'
07/12 10:47:39.547 DEBUG|        base_utils:0185| Running 'rsync --no-perms --chmod=ugo+r -a --exclude=autoserv*  /var/spool/crash /usr/local/autotest/results/default/login_Cryptohome/sysinfo/iteration.1/var/spool'
07/12 10:47:39.986 DEBUG|        base_utils:0185| Running 'rm -rf /var/spool/crash/*'
07/12 10:47:40.108 DEBUG|        base_utils:0185| Running 'logger "autotest finished iteration /usr/local/autotest/results/default/login_Cryptohome/sysinfo/iteration.1"'
07/12 10:47:40.157 WARNI|              test:0606| Autotest caught exception when running test:
Traceback (most recent call last):
  File "/usr/local/autotest/common_lib/test.py", line 600, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/common_lib/test.py", line 804, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/common_lib/test.py", line 461, in execute
    dargs)
  File "/usr/local/autotest/common_lib/test.py", line 347, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/common_lib/test.py", line 376, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/tests/login_Cryptohome/login_Cryptohome.py", line 35, in run_once
    cryptohome.remove_vault(username)
  File "/usr/local/autotest/cros/cryptohome.py", line 185, in remove_vault
    raise ChromiumOSError('Cryptohome could not remove the user\'s vault.')
ChromiumOSError: Cryptohome could not remove the user's vault.


Labels: -Pri-2 Pri-1
Cc: krisr@chromium.org wad@chromium.org ihf@chromium.org scollyer@chromium.org
Components: Tests
Owner: ----
+wad@

This timeout has happened in the pfq 9 times since Monday. I don't see any recent changes to the test, so this may be a regression or an infra problem if the devices are overloaded.

Can anyone shed some light on what's happening here? When this cropped up in 2014 (issue 344862), increasing the timeout helped, but seemed like a stopgap measure. Trying to understand what has happened since then.
Cc: rbhagavatula@chromium.org wuchengli@chromium.org jwer...@chromium.org
I'm sheriff this week. This happened three times today (see issue 615475). 
Cc: achuith@chromium.org
Cc: drcrash@chromium.org dkrahn@chromium.org
Darren, do you know who would be the right person to look at this? 

This seems to be a new flaky regression.
Cc: dspaid@chromium.org uekawa@chromium.org
Cc: smbar...@chromium.org
Cc: jhorwich@chromium.org ngm@chromium.org apronin@chromium.org
Owner: dkrahn@chromium.org
Status: Assigned (was: Untriaged)
This bug started happening May 27, and was quite infrequent until Jul 11, when it started happening multiple times a day.

What do these CLs do?
https://chromium-review.googlesource.com/#/c/357971/
https://chromium-review.googlesource.com/#/c/359049/

Assigning to Darren to find an owner.
https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/70182206-chromeos-test/chromeos4-row2-rack4-host3/login_Cryptohome/sysinfo/var/log/messages

2016-07-20T08:15:21.183147+00:00 INFO cryptohomed[744]: TPM error 0x2020 (Key not found in persistent storage): LoadKeyByUuid: failed LoadKeyByUUID


2016-07-20T08:15:21.836353+00:00 ERR cryptohomed[744]: Couldn't wrap cryptohome key
2016-07-20T08:15:21.836421+00:00 WARNING cryptohomed[744]: Could not load the device policy file.
2016-07-20T08:15:21.836554+00:00 ERR cryptohomed[744]: Creating new salt at /home/.shadow/salt (0, 0)
2016-07-20T08:15:22.350180+00:00 ERR shill[803]: [ERROR:crypto_des_cbc.cc(103)] Unable to load key matter from /var/lib/whitelist/owner.key
2016-07-20T08:15:22.350608+00:00 ERR shill[803]: [ERROR:crypto_des_cbc.cc(103)] Unable to load key matter from /var/lib/whitelist/owner.key
2016-07-20T08:15:22.360562+00:00 WARNING chapsd[736]: SRK does not exist - this is normal when the TPM is not yet owned.
2016-07-20T08:15:22.361106+00:00 WARNING chapsd[736]: SRK does not exist - this is normal when the TPM is not yet owned.
2016-07-20T08:15:22.394026+00:00 INFO kernel: [   11.399371] tpm_tis tpm_tis: command 0x15 (size 14) returned code 0x0
2016-07-20T08:15:22.397194+00:00 WARNING cryptohomed[744]: Failed to read PCR1.
2016-07-20T08:15:22.501176+00:00 INFO cryptohomed[744]: TPM error 0x2020 (Key not found in persistent storage): Unseal: Failed to load SRK.
2016-07-20T08:15:22.501326+00:00 ERR cryptohomed[744]: Cannot unseal aes key.
2016-07-20T08:15:22.501336+00:00 ERR cryptohomed[744]: Attestation: Could not unseal decryption key.
2016-07-20T08:15:22.501341+00:00 WARNING cryptohomed[744]: Attestation: Attestation data invalid.  This is normal if the TPM has been cleared.
2016-07-20T08:15:22.526024+00:00 INFO kernel: [   11.531450] tpm_tis tpm_tis: command 0x46 (size 14) returned code 0x0
2016-07-20T08:15:22.575027+00:00 INFO kernel: [   11.581072] tpm_tis tpm_tis: command 0x7c (size 30) returned code 0x0
2016-07-20T08:15:22.580184+00:00 INFO cryptohomed[744]: Cannot read boot lockbox files.

First, there was a regression, later fixed by https://chromium-review.googlesource.com/#/c/359986/. But we still see it now.

Second, based on two logs in issue 615475 and issue 612797, the symptoms in /var/log/messages there are similar to issue 618392. So, all three may be related. Not identical (in issue 618392 chrome dies first), but all the rest is the same:

WARNING chapsd[21545]: SRK does not exist - this is normal when the TPM is not yet owned.
WARNING cryptohomed[21554]: Could not load the device policy file.
...
INFO cryptohomed[21554]: Migrated (or created) user directory: ...
...
WARNING cryptohomed[21554]: TPM was not owned. TPM initialization call back will handle PKCS#11 initialization.
...
ERR cryptohomed[21554]: TPM initialization took 1998ms
...
INFO cryptohomed[21554]: Created new cryptohome key.
...
WARNING chapsd[21545]: Unload Token event received for unknown path
...
WARNING cryptohomed[21554]: PKCS#11 initialization requested but cryptohome is not mounted
...
WARNING crash_reporter[23124]: [user] Received crash notification for cryptohomed[21554] sig 11, user 0

And then the crashed cryptohomed manifests itself in various ways: failed commands, dbus timeouts, etc



Cc: gwendal@chromium.org
Owner: apronin@chromium.org
#18: Those CLs are for TPM 2.0 support. They should not effect existing devices.

Looks like apronin@ is looking at this :). +gwendal too.
Yes, I'm looking. Just to keep records on the investigation:
At least from two of the logs where I could match dump to symbols, looks like cryptohome is crashing in Service::InitializeTpmComplete, while trying to go through the set of mounts. 
The session has ended while crypohome was still initializing tpm. And looks like Service::InitializeTpmComplete runs in parallel with Service::RemoveMountForUser or Service::RemoveAllMounts. And somewhere there's not enough synchronization.
Consistent with observations for issue 618392.
Cc: owenlin@chromium.org
Blockedon: 627367
Status: Started (was: Assigned)
I'm adding 627367 which I imagine has the same root cause of cryptohome crashing. This failed in the PFQ again today.
apronin@ we really need to get this issue to conclusion asap. This is affecting a number of boards and a variety of tests. Do you have any CL or solution in mind already or are we still investigating this?
While I'm working on a proper fix (won't be today), this workaround may buy some time: https://chromium-review.googlesource.com/#/c/362513/.

Can't really test if it helps, as I can't reproduce the issue, but based on what I see about the bug, it might help temporarily.

The problem is with the session being closed and the new started before the tpm initialization initiated during this session is still ongoing. Then, when the tpm initialization is complete, it tries to access the objects which are not there anymore. Fixing it will take time, so I'm changing the test to delay the moment when the session is restarted.
Project Member

Comment 28 by bugdroid1@chromium.org, Jul 22 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/32820b4ed32d388e7a5a00f4c67564ee33b49d5d

commit 32820b4ed32d388e7a5a00f4c67564ee33b49d5d
Author: Andrey Pronin <apronin@google.com>
Date: Fri Jul 22 02:19:59 2016

WORKAROUND: autotest: login_Cryptohome: let tpm finish initialization

Workaround: the test seems to fail because cryptohome crashes if
a session is closed while tpm is still being initialized. While a
proper fix is being implemented, workaround by waiting before closing
the session in case we need time to finish tpm initialization.

BUG= chromium:616854 
TEST=none

Change-Id: Ied5a20efcbfa2105f40ed0720015dba5d680871d
Reviewed-on: https://chromium-review.googlesource.com/362513
Commit-Ready: Achuith Bhandarkar <achuith@chromium.org>
Tested-by: Achuith Bhandarkar <achuith@chromium.org>
Reviewed-by: Achuith Bhandarkar <achuith@chromium.org>

[modify] https://crrev.com/32820b4ed32d388e7a5a00f4c67564ee33b49d5d/client/site_tests/login_Cryptohome/login_Cryptohome.py

Project Member

Comment 29 by bugdroid1@chromium.org, Jul 23 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/cd07fc15c10755078dceb8560a3dd83fd4f259e9

commit cd07fc15c10755078dceb8560a3dd83fd4f259e9
Author: Chirantan Ekbote <chirantan@chromium.org>
Date: Fri Jul 22 23:53:19 2016

cryptohome: Add missing lock when accessing mounts

The cryptohome mounts can be accessed concurrently by multiple threads
and have a lock (mounts_lock_) that must be held before they can be
accessed.  Unfortunately, when iterating over the mounts after
successfully initializing the TPM, the Service::InitializeTpmComplete
function did not acquire the lock.

Since the time this bug was introduced (commit b009d07 in March, 2013)
the tpm initialization flow has apparently been fast enough that it
never overlapped with the session teardown phase of the login_Cryptohome
test.  This changed recently and the race condition is frequently
triggered.

Add the missing lock acquisition around the mount iteration code in
Service::InitializeTpmComplete.

BUG= chromium:616854 
TEST=login_Cryptohome on peach_pit

Change-Id: I8ab4e4a6e5b17beb341cb86b13dc0dc258e04f19
Signed-off-by: Chirantan Ekbote <chirantan@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/362609
Reviewed-by: Andrey Pronin <apronin@chromium.org>

[modify] https://crrev.com/cd07fc15c10755078dceb8560a3dd83fd4f259e9/cryptohome/service.cc

We've had 6 successful PFQ runs since yesterday, and no cryptohome failures. We could try reverting the WAR to see if the missing-lock fix works.
Cc: -wuchengli@chromium.org
Blockedon: 641366
Some fixes landed - tracked in  issue 641369 ,  issue 641372 ,  issue 641366 .
Do we still see the issue? The last auto-generated bug report in issue 615475 is from Sep 7.
Status: Fixed (was: Started)
Not seen anymore, marking as Fixed
Labels: VerifyIn-55

Comment 38 by dchan@google.com, Nov 19 2016

Labels: VerifyIn-56

Comment 39 by dchan@google.com, Jan 21 2017

Labels: VerifyIn-57

Comment 40 by dchan@google.com, Mar 4 2017

Labels: VerifyIn-58

Comment 41 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 42 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 44 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment