New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 810716 link

Starred by 27 users

Issue metadata

Status: Fixed
Owner:
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

ChromeOS issue: Devices losing enrollment after updating to OS 63

Project Member Reported by marcore@chromium.org, Feb 9 2018

Issue description

ChromeOS version: 63.0.3239.140
ChromeOS device model: HP Chromebook 11 G5 EE
Case#: 14753835

Description:
Devices losing enrollment after updating to OS 63
-Customer reported that devices started to lose enrollment after being updated to version 63
-Customer hasn't identify the action that force the device to lose enrollment (rebooting the device, closing the lid)
-Customer is just notified when a device lost enrollment and issue can't be reproduce on demand 
-Devices that are re-enrolled don't lose the enrollment again
-Not all devices are being affected

-Amount of the affected devices so far: 50-70


Steps to reproduce: 
N/A autoupdate
Current Behavior / Reproduction: 
device loses enrollment
Expected Behavior: 
don't lose enrollment
Drive link to logs: 
https://drive.google.com/open?id=1HswXRNzGyYN-kNqSUmpTKPkxyh6igBp1
https://drive.google.com/open?id=1Jz7U7RTRBXpt_NmuDZjsLUl6P29C7YoF

In the log I see the same errors as crbug.com/685370

messages:2018-02-06T17:48:30.741429+00:00 INFO session_manager[1252]: [INFO:policy_key.cc(51)] No policy key on disk at /var/lib/whitelist/owner.key
and
chrome/chrome_20180207-082803:[1323:1323:0207/082804.542895:ERROR:device_cloud_policy_store_chromeos.cc(233)] Device policy read on enrolled device yields no DM token! Status: 1.

 
Owner: gkihumba@chromium.org
Grace, could you help triage this issue ?
Customer Info and date&time of the issue:
https://drive.google.com/open?id=1md9OCfRdcO8sT3xXMRV0jShOEd_LWNHx4ScjsIDb_Qk
I've another customer with the same issue, but different HW:
KEFKA: Dell Chromebook 11 Model 3180
case: 14842246
customer info: https://drive.google.com/open?id=1swuwNB_DakBVI8ybPiMyHPGlf5ivhp-Kf4DN0ayaSMw
debug logs:
https://drive.google.com/open?id=1o1BBqiLaY0Hvo4BCZ5dAFPOSSSpyWnZJ
https://drive.google.com/open?id=1_qGCe_pTSrbrrKdM3ogJwItKBgdnX1jW

Cc: vkhabarov@google.com
Hi! Is there any update on this? Thanks!
I have another customer with the same HW as #1:
case: 14983742
HW: HP Chromebook 11 G5 EE relm
version: 63.0.3239.140
Platform Version 10032.86.0 (Official Build) stable-channel relm
Firmware Version Google_Relm.7287.370.0
debug logs: 
https://drive.google.com/open?id=1lISxrCEoxdInjxcEFRjbpT3Ki-WgDOp_ 
https://drive.google.com/open?id=1M9EuL4z0urcyqUo1vK7xCHY5LQ86Gqwg
Is there something else we can gather to help the investigation?

Labels: -Pri-2 Pri-1
Another customer reports issue on Terra device and v63. Raising priority
I have this same issue on many devices, all ASUS C202SA, 63.0.3239.140

Current Case:  14993413


Is there any update to this ticket?

Comment 9 by gkihumba@google.com, Feb 22 2018

Cc: kbleicher@chromium.org
Can you please update to M64 and see if the issue still exists? M63 is now a dead branch and the stable channel is M64. The issue was likely fixed there.
+ Kevin M64 TPM
I'm not aware of any issues in M64.  Hasn't been raise as an issue / blocker / etc.
@gkihumba ..... I have no way of knowing if the issue still exists after updating to M64.... When this occurs to the devices, we go through the re-enrollment steps and the device keeps its enrollment....but I do not know for how long, because it is random as to what devices this occurs on.... Some devices that are on M63 or lower do not lose their enrollment, while some do.
Could you give exact repro steps? Without that it's difficult to triage.
Cc: gkihumba@chromium.org
Owner: mdrasner@chromium.org
Status: Assigned (was: Untriaged)
Matt, is this a known issue?
Customers didn't find any particular steps to trigger this, looks random in nature and some even wanted to know if it was done by users themselves. Only thing in common is either closing the lid or powering off device.
any update to this? Please!?
Lets see if I can jumpstart some ideas on possible resolutions....

Lets say, hypothetically, we confirm the issue happens when the device updates from 63 to 64, and the user closes the lid (and the user policy is set to logout the user/restart the device on the lid closing) during the upgrade process before it has a chance to finish. (again, just being hypothetical here)

Would it be safe to assume that the problem could be avoided by turning off the policy which causes the device to logout/restart on lid closing?

Also, if we have a large majority of our devices (715 of them) still on version 63, is there any way to force all of them to update to 64?  I am pretty sure there is a spot somewhere in the Admin console that does this, and I am also pretty sure we have it set to force an update if the user doesnt do it, after 1 day, and yet the vast majority of our devices remain on version 63....

I don't think that is off topic, since it relates to the issue at hand.

Thoughts?
Cc: maxkirsch@chromium.org mdrasner@chromium.org
Owner: atwilson@chromium.org
Apologies for the delay and thank you for the relevant details of what you've seen happening with your devices. 

re#18: if the issue does occur when the device reboots after pulling down the update, then I'm not sure changing the functionality on lid closing would help other than preventing the update from actually taking affect (i.e. would not take the update unless a user signs out, reboots, or hits the update arrow). There isn't a policy that forces the update, but so long as you aren't blocking updates via policy then all devices will take the update shortly after it is released and then stage the update awaiting next reboot.

@atwilson, can you help triage to have someone investigate the logs in c#5? Only thought that comes to mind is that this is related to the FS corruption we saw previously, but given the sporadic reproduction, not sure if that could be related.
Cc: atwilson@chromium.org
Owner: apronin@chromium.org
Andrey, looks like FS corruption again - WDYT?
Cc: igorcov@chromium.org
+igor in case he sees anything else suspicious in the logs
Looking at the 1st log:

Did the problems start on Feb 06?

The issue is there's no tpm device created by the driver. So, all further communications fail:
2018-02-06T18:27:30.076234+00:00 NOTICE tpm-probe[526]: tpm is non-functional or tpm driver is missing
2018-02-06T18:27:30.359947+00:00 ERR tcsd[714]: TSS: Could not find a device to open!

The tpm driver's probe seems to fail at
2018-02-06T18:27:29.816063+00:00 ERR kernel: [    1.067868] tpm_tis tpm_tis: Unable to read header

Apparently, the tpm is unresponsive. It comes back ok after the next reboot. But by that time mount-encrypted should have received an error attempting to communicate with the TPM and reset encrypted stateful, causing the loss of enrollment and local data. 

I also see one ext4 repair:
2018-02-06T11:12:34.434506-08:00 INFO kernel: [    1.518283] EXT4-fs (mmcblk0p1): recovery complete
2018-02-06T11:12:34.434534-08:00 INFO kernel: [    1.827077] EXT4-fs (dm-1): recovery complete

which happened after this on the previous boot:
2018-02-06T18:27:31.537079+00:00 INFO kernel: [    5.567383] EXT4-fs (dm-1): resizing filesystem from 440498 to 804909 blocks
2018-02-06T18:27:31.537283+00:00 INFO kernel: [    5.568072] EXT4-fs (dm-1): resized filesystem to 804909

It's unclear if it led to filesystem corruption. Possible, but that issue with non-communicating TPM chip seems like a more serious issue and more likely root cause. Unfortunately, no indication of why that happened. So, no further alley for investigation.

mount-encrypted overreaction in case of possibly transient tpm errors is tracked separately in issue 702724.
Same in the 2nd log:
2018-02-06T17:48:27.108639+00:00 ERR kernel: [    1.070472] tpm_tis tpm_tis: Unable to read header
2018-02-06T17:48:27.624183+00:00 ERR tcsd[704]: TSS: Could not find a device to open!

then

2018-02-06T17:48:28.795488+00:00 INFO kernel: [    4.826147] EXT4-fs (dm-1): resizing filesystem from 209661 to 804909 blocks
2018-02-06T17:48:28.808072+00:00 INFO kernel: [    4.839177] EXT4-fs (dm-1): resized filesystem to 804909

then

2018-02-06T09:49:51.445561-08:00 INFO kernel: [    1.489621] EXT4-fs (mmcblk0p1): recovery complete

2018-02-06T09:49:51.445709-08:00 INFO kernel: [    1.825991] EXT4-fs (dm-1): recovery complete
In both logs I also see warnings and errors on every boot seemingly related to enumeraton of devices:


2018-02-06T17:48:27.106837+00:00 WARNING kernel: [    0.170074] ------------[ cut here ]------------
2018-02-06T17:48:27.106838+00:00 WARNING kernel: [    0.170082] WARNING: CPU: 0 PID: 1 at /mnt/host/source/src/third_party/kernel/v3.18/drivers/base/power/common.c:156 dev_pm_domain_set+0x58/0x5b()
2018-02-06T17:48:27.106840+00:00 WARNING kernel: [    0.170084] PM domains can only be changed for unbound devices
2018-02-06T17:48:27.106841+00:00 WARNING kernel: [    0.170086] Modules linked in:
2018-02-06T17:48:27.106849+00:00 WARNING kernel: [    0.170092] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.0-16288-g64d05cf80004 #1
2018-02-06T17:48:27.106851+00:00 WARNING kernel: [    0.170095] Hardware name: GOOGLE Relm, BIOS Google_Relm.7287.370.0 08/06/2017
2018-02-06T17:48:27.106853+00:00 WARNING kernel: [    0.170098]  0000000000000000 000000006ec2bee4 ffff88017b227b78 ffffffff9daa3024
2018-02-06T17:48:27.106855+00:00 WARNING kernel: [    0.170102]  0000000000000000 ffff88017b227bd0 ffff88017b227bb8 ffffffff9d4647a7
2018-02-06T17:48:27.106856+00:00 WARNING kernel: [    0.170107]  000000000000000f ffffffff9d80cb06 ffff88017a933c10 ffffffff9e095140
2018-02-06T17:48:27.106858+00:00 WARNING kernel: [    0.170111] Call Trace:
2018-02-06T17:48:27.106864+00:00 WARNING kernel: [    0.170120]  [<ffffffff9daa3024>] dump_stack+0x4e/0x71
2018-02-06T17:48:27.106867+00:00 WARNING kernel: [    0.170127]  [<ffffffff9d4647a7>] warn_slowpath_common+0x81/0x9b
2018-02-06T17:48:27.106869+00:00 WARNING kernel: [    0.170131]  [<ffffffff9d80cb06>] ? dev_pm_domain_set+0x58/0x5b
2018-02-06T17:48:27.106870+00:00 WARNING kernel: [    0.170136]  [<ffffffff9d464816>] warn_slowpath_fmt+0x55/0x6b
2018-02-06T17:48:27.106872+00:00 WARNING kernel: [    0.170142]  [<ffffffff9d6f2aea>] ? acpi_get_data_full+0x7d/0x90
2018-02-06T17:48:27.106873+00:00 WARNING kernel: [    0.170146]  [<ffffffff9d80cb06>] dev_pm_domain_set+0x58/0x5b
2018-02-06T17:48:27.106875+00:00 WARNING kernel: [    0.170152]  [<ffffffff9d6db2bc>] acpi_lpss_platform_notify+0x139/0x15b
2018-02-06T17:48:27.106882+00:00 WARNING kernel: [    0.170158]  [<ffffffff9d47eccc>] notifier_call_chain+0x40/0x64
2018-02-06T17:48:27.106884+00:00 WARNING kernel: [    0.170163]  [<ffffffff9d47ef8b>] __blocking_notifier_call_chain+0x4d/0x66
2018-02-06T17:48:27.106886+00:00 WARNING kernel: [    0.170167]  [<ffffffff9d47efb8>] blocking_notifier_call_chain+0x14/0x16
2018-02-06T17:48:27.106888+00:00 WARNING kernel: [    0.170172]  [<ffffffff9d806077>] driver_bound+0x93/0x97
2018-02-06T17:48:27.106889+00:00 WARNING kernel: [    0.170176]  [<ffffffff9d8061e5>] driver_probe_device+0x13b/0x2b1
2018-02-06T17:48:27.106891+00:00 WARNING kernel: [    0.170180]  [<ffffffff9d806421>] __driver_attach+0x5e/0x81
2018-02-06T17:48:27.106898+00:00 WARNING kernel: [    0.170184]  [<ffffffff9d8063c3>] ? __device_attach_driver+0x68/0x68
2018-02-06T17:48:27.106907+00:00 WARNING kernel: [    0.170188]  [<ffffffff9d805284>] bus_for_each_dev+0x8c/0xaf
2018-02-06T17:48:27.106909+00:00 WARNING kernel: [    0.170192]  [<ffffffff9d805c41>] driver_attach+0x1e/0x20
2018-02-06T17:48:27.106910+00:00 WARNING kernel: [    0.170196]  [<ffffffff9d8058c9>] bus_add_driver+0xeb/0x1e3
2018-02-06T17:48:27.106912+00:00 WARNING kernel: [    0.170200]  [<ffffffff9d806c04>] driver_register+0x8f/0xcc
2018-02-06T17:48:27.106913+00:00 WARNING kernel: [    0.170204]  [<ffffffff9e155297>] ? spidev_init+0xab/0xab
2018-02-06T17:48:27.106915+00:00 WARNING kernel: [    0.170209]  [<ffffffff9d807a1e>] __platform_driver_register+0x4a/0x4c
2018-02-06T17:48:27.106923+00:00 WARNING kernel: [    0.170213]  [<ffffffff9e1552a9>] pxa2xx_spi_init+0x12/0x14
2018-02-06T17:48:27.106925+00:00 WARNING kernel: [    0.170218]  [<ffffffff9d4003b5>] do_one_initcall+0x188/0x19d
2018-02-06T17:48:27.106927+00:00 WARNING kernel: [    0.170224]  [<ffffffff9e11c09c>] kernel_init_freeable+0x194/0x234
2018-02-06T17:48:27.106935+00:00 WARNING kernel: [    0.170230]  [<ffffffff9da9f920>] ? rest_init+0x84/0x84
2018-02-06T17:48:27.106937+00:00 WARNING kernel: [    0.170234]  [<ffffffff9da9f92e>] kernel_init+0xe/0xda
2018-02-06T17:48:27.107225+00:00 WARNING kernel: [    0.170238]  [<ffffffff9daa837c>] ret_from_fork+0x7c/0xb0
2018-02-06T17:48:27.107229+00:00 WARNING kernel: [    0.170243]  [<ffffffff9da9f920>] ? rest_init+0x84/0x84
2018-02-06T17:48:27.107238+00:00 WARNING kernel: [    0.170256] ---[ end trace 64a422b868e883d9 ]---

and 

2018-02-06T17:48:27.107262+00:00 WARNING kernel: [    0.177304] ------------[ cut here ]------------
2018-02-06T17:48:27.107264+00:00 WARNING kernel: [    0.177311] WARNING: CPU: 0 PID: 1 at /mnt/host/source/src/third_party/kernel/v3.18/drivers/base/power/common.c:156 dev_pm_domain_set+0x58/0x5b()
2018-02-06T17:48:27.107272+00:00 WARNING kernel: [    0.177313] PM domains can only be changed for unbound devices
2018-02-06T17:48:27.107274+00:00 WARNING kernel: [    0.177314] Modules linked in:
2018-02-06T17:48:27.107276+00:00 WARNING kernel: [    0.177319] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W      3.18.0-16288-g64d05cf80004 #1
2018-02-06T17:48:27.107278+00:00 WARNING kernel: [    0.177321] Hardware name: GOOGLE Relm, BIOS Google_Relm.7287.370.0 08/06/2017
2018-02-06T17:48:27.107279+00:00 WARNING kernel: [    0.177324]  0000000000000000 000000006ec2bee4 ffff88017b227b78 ffffffff9daa3024
2018-02-06T17:48:27.107280+00:00 WARNING kernel: [    0.177328]  0000000000000000 ffff88017b227bd0 ffff88017b227bb8 ffffffff9d4647a7
2018-02-06T17:48:27.107282+00:00 WARNING kernel: [    0.177333]  000000000000000f ffffffff9d80cb06 ffff88017a933010 ffffffff9e095140
2018-02-06T17:48:27.107289+00:00 WARNING kernel: [    0.177337] Call Trace:
2018-02-06T17:48:27.107292+00:00 WARNING kernel: [    0.177342]  [<ffffffff9daa3024>] dump_stack+0x4e/0x71
2018-02-06T17:48:27.107294+00:00 WARNING kernel: [    0.177348]  [<ffffffff9d4647a7>] warn_slowpath_common+0x81/0x9b
2018-02-06T17:48:27.107295+00:00 WARNING kernel: [    0.177352]  [<ffffffff9d80cb06>] ? dev_pm_domain_set+0x58/0x5b
2018-02-06T17:48:27.107297+00:00 WARNING kernel: [    0.177356]  [<ffffffff9d464816>] warn_slowpath_fmt+0x55/0x6b
2018-02-06T17:48:27.107298+00:00 WARNING kernel: [    0.177361]  [<ffffffff9d6f2aea>] ? acpi_get_data_full+0x7d/0x90
2018-02-06T17:48:27.107300+00:00 WARNING kernel: [    0.177366]  [<ffffffff9d80cb06>] dev_pm_domain_set+0x58/0x5b
2018-02-06T17:48:27.107308+00:00 WARNING kernel: [    0.177371]  [<ffffffff9d6db2bc>] acpi_lpss_platform_notify+0x139/0x15b
2018-02-06T17:48:27.107311+00:00 WARNING kernel: [    0.177376]  [<ffffffff9d47eccc>] notifier_call_chain+0x40/0x64
2018-02-06T17:48:27.107313+00:00 WARNING kernel: [    0.177380]  [<ffffffff9d47ef8b>] __blocking_notifier_call_chain+0x4d/0x66
2018-02-06T17:48:27.107315+00:00 WARNING kernel: [    0.177385]  [<ffffffff9d47efb8>] blocking_notifier_call_chain+0x14/0x16
2018-02-06T17:48:27.107316+00:00 WARNING kernel: [    0.177389]  [<ffffffff9d806077>] driver_bound+0x93/0x97
2018-02-06T17:48:27.107317+00:00 WARNING kernel: [    0.177393]  [<ffffffff9d8061e5>] driver_probe_device+0x13b/0x2b1
2018-02-06T17:48:27.107326+00:00 WARNING kernel: [    0.177397]  [<ffffffff9d806421>] __driver_attach+0x5e/0x81
2018-02-06T17:48:27.107328+00:00 WARNING kernel: [    0.177401]  [<ffffffff9d8063c3>] ? __device_attach_driver+0x68/0x68
2018-02-06T17:48:27.107330+00:00 WARNING kernel: [    0.177405]  [<ffffffff9d805284>] bus_for_each_dev+0x8c/0xaf
2018-02-06T17:48:27.107332+00:00 WARNING kernel: [    0.177409]  [<ffffffff9d805c41>] driver_attach+0x1e/0x20
2018-02-06T17:48:27.107333+00:00 WARNING kernel: [    0.177413]  [<ffffffff9d8058c9>] bus_add_driver+0xeb/0x1e3
2018-02-06T17:48:27.107334+00:00 WARNING kernel: [    0.177417]  [<ffffffff9d806c04>] driver_register+0x8f/0xcc
2018-02-06T17:48:27.107336+00:00 WARNING kernel: [    0.177421]  [<ffffffff9e1586d0>] ? i2c_i801_init+0xc9/0xc9
2018-02-06T17:48:27.107343+00:00 WARNING kernel: [    0.177426]  [<ffffffff9d807a1e>] __platform_driver_register+0x4a/0x4c
2018-02-06T17:48:27.107346+00:00 WARNING kernel: [    0.177430]  [<ffffffff9e1586e2>] dw_i2c_init_driver+0x12/0x14
2018-02-06T17:48:27.107347+00:00 WARNING kernel: [    0.177434]  [<ffffffff9d4003b5>] do_one_initcall+0x188/0x19d
2018-02-06T17:48:27.107349+00:00 WARNING kernel: [    0.177439]  [<ffffffff9e11c09c>] kernel_init_freeable+0x194/0x234
2018-02-06T17:48:27.107350+00:00 WARNING kernel: [    0.177444]  [<ffffffff9da9f920>] ? rest_init+0x84/0x84
2018-02-06T17:48:27.107352+00:00 WARNING kernel: [    0.177448]  [<ffffffff9da9f92e>] kernel_init+0xe/0xda
2018-02-06T17:48:27.107359+00:00 WARNING kernel: [    0.177453]  [<ffffffff9daa837c>] ret_from_fork+0x7c/0xb0
2018-02-06T17:48:27.107362+00:00 WARNING kernel: [    0.177457]  [<ffffffff9da9f920>] ? rest_init+0x84/0x84
2018-02-06T17:48:27.107363+00:00 WARNING kernel: [    0.177460] ---[ end trace 64a422b868e883da ]---

and

2018-02-06T17:48:27.107247+00:00 ERR kernel: [    0.175902] ERROR: Unable to locate IOAPIC for GSI 184
2018-02-06T17:48:27.107255+00:00 WARNING kernel: [    0.175905] Failed to set pin attr for GSI184
2018-02-06T17:48:27.107258+00:00 ERR kernel: [    0.176532] i2c i2c-0: failed to add I2C device MLFS0000:00 from ACPI
2018-02-06T17:48:27.107259+00:00 ERR kernel: [    0.176579] ERROR: Unable to locate IOAPIC for GSI 184
2018-02-06T17:48:27.107261+00:00 WARNING kernel: [    0.176582] Failed to set pin attr for GSI184

and

2018-02-06T17:48:27.107828+00:00 ERR kernel: [    0.225139] ERROR: Unable to locate IOAPIC for GSI 182
2018-02-06T17:48:27.107830+00:00 WARNING kernel: [    0.225142] Failed to set pin attr for GSI182


Those are seen both on boots when the tpm device is successfully created and when we have no /dev/tpm0, so unclear if can affect it. But overall sounds not very healthy.
What was the Chrome OS version before the AU to 63?
Do we still have un-updated devices? Is it possible to gather logs from them to compare?

I looked at other logs from #3 and #5. In all cases I see "tpm_tis tpm_tis: Unable to read header" errors from the tpm driver, leading to no /dev/tpm0 device and all tpm-communicating utilities/daemons getting only errors from it. And that'd typically lead to losing enrollment.

Also, re #24: I see the same warnings on much earlier versions, so they didn't start after AU. So, probably unrelated.
Cc: emaxx@chromium.org
Yeah, and if those TPM errors are persistent, it makes sense to wipe everything and re-enroll, re-own the TPM, etc. But if it's some transient thing, it's a shame to do the wipe as opposed to (say) rebooting or blocking at an error screen or something.

Not clear to me that "can't reach the TPM" is something Chrome OS should try to silently recover from, rather than throwing up an error and letting the user intervene (power cycle, powerwash, etc).
Re #27: Absolutely correct. I'm not saying that's how it should behave, only that this is how mount-encrypted behaves now.

Those potentially transient errors shouldn't automatically lead to wiping stateful. The bug for fixing this mount-encrypted behavior is issue 702724.

In many cases, we can't silently recover from those errors. Some retries are done at the driver level already to recover from i2c/spi level issues. There are still some semi-exotic situations when a tpm can return a recoverable error, which can be silently dealt with at Tlcl library level or mount-encrypted level. Those were not reported in practice, but possible.

Still, in most cases, we will have to show an error and make the user reboot, hoping that the error will go away. A reboot loop in case of a persistent error is better than suddenly losing stateful on reboot or falling back to tpm-less mechanisms.


OK, so not sure what next steps are here. Clearly something weird is happening with TPMs, but not sure who would be able to look at that. We have a clear AI to not clear stateful on TPM errors, but not sure when we'd move forward on that either (and requires UX review, etc).

Andrey, are you comfortable keeping this on your plate, or is there someone else who should also look at this?
Yes, let me take care of it. I deal with a similar issue in b/73351086. And fixing the tpm errors processing in mount-encrypted and tlcl library in general is what I wanted to do for a long time already.

The next steps are along two parallel trucks, and some are beyond the scope of this bug, though help to address its consequences:

1) Reproduce and find the reason for tpm communication issues. I started looking, but no repro yet. Also, I may backport to 3.18 some of the driver patches that improve error processing and log messages in case of errors. As of now, some communication timeouts (possibly what we actually get here) can lead to confusing messages and passing garbage data.

2) Fix how mount-encrypted/tlcl deal with these errors: (a) better propagate the errors to the level that can make a decision on retry, (b) silently retry a few times in case of comm issues, (c) agree on the UX in case we still get an error, (d) implement.

Comment 31 by k....@wiufsd.org, Mar 16 2018

We are seeing the same issue on version 64 now as well on HP Chromebook 11 G5 EE... This is starting to be a serious problem.
Next in my list after issue 702724. Fixing which should also help here - when hitting this issue, users may see an error screen and have to reboot, but not lose data/enrollment.
Having this issue on Samsung Chromebooks in our fleet. Hope a fix is coming soon - particularly with State Testing coming up in April.
Cc: marchuk@google.com
Have the same on celes devices

2018-03-26T12:25:31.054929+00:00 ERR kernel: [    0.146981] ERROR: Unable to locate IOAPIC for GSI 183

I see recent commits in b/73351086 but not on any build yet.
Labels: M-64
We have another customer having this issue on Relm, running 64.0.3282.190 
Drive link to debug logs: https://drive.google.com/open?id=17CeBO3uMuWWi6t53W-PYUT-PwVCTcFNj 
Thanks! For all 3 cases in comment #35 I also see
  2018-03-26T12:43:44.398082+00:00 ERR kernel: [    1.051564] tpm_tis tpm_tis: Unable to read header
followed by
  2018-03-26T12:43:44.500586+00:00 NOTICE tpm-probe[519]: tpm is non-functional or tpm driver is missing

As a result, the key for mounting encrypted stateful cannot be read from the TPM, and that partition is wiped.
I'm not sure if the IOAPIC stuff from comment #34 is directly related.
We have these messages on every boot for all 3 devices:

2018-03-26T12:43:44.397508+00:00 ERR kernel: [    0.171995] ERROR: Unable to locate IOAPIC for GSI 184
2018-03-26T12:43:44.397510+00:00 WARNING kernel: [    0.171999] Failed to set pin attr for GSI184
...
2018-03-23T18:08:46.284177+00:00 WARNING kernel: [    0.200283] Expanded resource reserved due to conflict with PCI Bus 0000:00
2018-03-23T18:08:46.284178+00:00 DEBUG kernel: [    0.200287] e820: reserve RAM buffer [mem 0x7cc13000-0x7fffffff]
...
2018-03-23T18:08:46.284477+00:00 ERR kernel: [    0.219436] ERROR: Unable to locate IOAPIC for GSI 182
2018-03-23T18:08:46.284479+00:00 WARNING kernel: [    0.219439] Failed to set pin attr for GSI182

But only on the 1st boot the TPM is unreachable.
More re #34: 

> I see recent commits in b/73351086 but not on any build yet.

The fixes for b/73351086 won't help this issue since they affect only devices with different TPM models.
Status: Started (was: Assigned)
RE #33: 

> Having this issue on Samsung Chromebooks in our fleet. 

Do you use different Samsung Chromebook models in your fleet? Is the issue limited to one particular model?
I see the same lines about IOAPIC and Expanded resource reserved due to conflict with PCI Bus on M54. So, that's not something, which was added with M63/M64.

Another observation:
One of the byproducts of tpm not being present is that chromeos reports its dictionary attack counter as 100 (special value meaning "no tpm").

Looking at the last month data only, I see that although the occurrence of counter=100 for relm, kefka and similar boards at boot remained at the level of thousandths of a percent, it's frequency has increased ~2x from M61/62 to M63/64, so it is possible that there was some change between 62 and 63.
Note that this is the analysis of the recent situation only, so this 2x is hardly a statistical fact at this point. And it's hard to compare to M62 and earlier, since as we go down with revision numbers, the number of data point quickly decreases.
General question: in many of the reports above, seeing the lost enrollment cases is tied to migration to 63/64. What was the previous Chrome OS revision running on those boards, for which lost enrollment was not seen with this frequency? M61/62 or earlier?
Cc: -gkihumba@chromium.org
In one of reports, it started in 63* but was fine in 62*, but no more specific versions.
Cc: adurbin@chromium.org
I haven't been able to reproduce the issue by running a stress boot test so far, but it's possible that 1500 reboots is not sufficient yet - the frequency may be very low (supported by UMA figures that show much lower frequency for tpm issues in general, as commented in #40).

One of the changes that landed in coreboot between M62 (9901.79.0, with firmware 7287.341.0) and M63 (10032.86.0, with firmware 7287.370.0 - that's what we see on devices with issues) is https://crrev.com/i/342463 (plus https://crrev.com/i/378309 and https://crrev.com/i/412088) for b/36790283.

It is possible it's causing these issues - similar to b/68224689, but at boot time. The outcome at the tpm driver level is similar. Then the tpm driver changes being made for b/68224689 should also help here.
I've another customer with this issue and different HW:
case: 15132165
model: Asus Chromebook C202 
version: 63.0.3239.140 / 10032.86.0
customer info & serial#: https://drive.google.com/open?id=1j-nSYUYxGvHNesiUM4D1pstSs-qscdkK0ZW7-DlNatQ
debug logs:  https://drive.google.com/open?id=1ZenXe2KZVn9qJB4AdJgJwXgVE2uDQ6h8
but in the log I have:
messages.5:2018-02-20T13:04:48.022483-08:00 INFO session_manager[1138]: [INFO:policy_key.cc(51)] No policy key on disk at /home/root/f394d6c960ed35d8d736e60486726787d1936532/session_manager/policy/key

chrome/chrome_20180227-124455:[1162:1162:0227/124457.038302:ERROR:device_cloud_policy_store_chromeos.cc(233)] Device policy read on enrolled device yields no DM token! Status: 5.
Status 5 ==   STORE_VALIDATION_ERROR,  // Policy validation failure.
instead of 1  STORE_KEY_UNAVAILABLE,   // Owner key not yet configured. 
 
but I've also the same crashes found at #24

Re #45: what are the symptoms in this case, btw?

Unlike other cases on this bug, there were no tpm errors that led to losing encstateful - the device haven't gone through wiping it since at least 02/13, so it's a different root cause and there should not have been unexpected OOBE screens on boot.

I do see policy errors:
2018-02-27T11:04:09.014158-08:00 ERR shill[1104]: [ERROR:device_policy_impl.cc(132)] Policy on disk could not be parsed!
...
2018-02-27T11:04:09.116916-08:00 ERR session_manager[1132]: [ERROR:device_policy_impl.cc(132)] Policy on disk could not be parsed!

which started after ext4 errors. I'm splitting it to a separate bug - see  issue 832204 .


Have one more model, 

case: 15132165
model: Edugear CMT Chromebook (wizpig)
version: 65.0.3325.209 / 10323.67.0

with
2018-04-19T16:23:23.192750+00:00 ERR kernel: [    1.068344] tpm_tis tpm_tis: Unable to read header

and
2018-04-19T10:45:20.301391-07:00 INFO kernel: [    1.530381] EXT4-fs (mmcblk0p1): recovery complete
2018-04-19T10:45:20.301414-07:00 INFO kernel: [    1.813420] EXT4-fs (dm-1): recovery complete

and

2018-04-19T16:23:23.192096+00:00 ERR kernel: [    0.176149] ERROR: Unable to locate IOAPIC for GSI 184
2018-04-19T16:23:23.192097+00:00 WARNING kernel: [    0.176153] Failed to set pin attr for GSI184

SN and logs
https://drive.google.com/open?id=1XuT2e2aTFDh5mK-NFtnxUDOtqELWBeo2
Any updates? Please let me know if anything else is needed from customer regarding this issue.

No updates since comment 44. The fix for the suspected root cause there is in R67.
Same issue: HP Chromebook 11 G5 65.0.3325.209 and an Asus
Cc: ryutas@chromium.org apronin@chromium.org pelets...@chromium.org
 Issue 838033  has been merged into this issue.
Same issue with our fleet of 3000 Dell 3180. Sent sample device to Dell, they confirmed its an issue with ChromeOS. The said to put pressure on Google for a fix. We've upgraded from 63, to 64, and now 65. Still occurs randomly; we get 10-20 reports per week.
Components: -Enterprise>Enrollment OS>Systems
Status: Fixed (was: Started)
"tpm_tis tpm_tis: Unable to read header" is what led to lost enrollment in reported cases with logs.

There is an known issue that causes occasional "tpm_tis tpm_tis: Unable to read header" on strago family since R63. The fix was made in R67.
Plus, additional hardening from similar low-level issues at encstateful mount time is to be included in R68.

Marking as Fixed. Please report if seen again on 67 or later.
Note that once seen on a device, fixing the problem includes powerwash. W/o powerwash, the issue will re-occur on reboots and will still be there after AU to 67+.

Sign in to add a comment