chromeos4-row4-rack11-host15 is provision fail, repair succeed looping. |
||||||||
Issue descriptionThis DUT appears to be in a bad state.
,
Dec 15 2017
> I forced a repair which included a USB installation, which worked, then unlocked the DUT.
The putatively successful repair is here:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row4-rack11-host15/5038955-repair/20171512125907/
The status.log file shows that the same AU failure struck the
repair task as strikes provision; that's why the DUT had to be
re-imaged from USB.
I don't understand why the provision failure doesn't trigger
any repair actions; that's not supposed to happen, although in
this case, it wouldn't help, since even install from USB doesn't
stop the symptom.
Probably, the more interesting question is what's causing the
provision failure in the first place. The error message just
says that starting the local devserver on the DUT failed.
I can't imagine what would cause that.
,
Dec 15 2017
I wasn't certain if it was the devserver on the DUT, or a lab devserver. But yeah, I'm confused.
,
Dec 15 2017
Reboot in pre-setup of rootfs update causes '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.JvngRBuh3r' to be deleted. Then after reboot, rootfs update is kicked off, but it cannot find this directory, then it cannot start devserver. We can find it from this snippet: 2017/12/15 13:57:48.657 WARNI| auto_updater:0737| No devserver log is available. 2017/12/15 13:57:48.658 DEBUG|dev_server_wrapper:0588| No devserver running. 2017/12/15 13:57:48.658 DEBUG| cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpWY7OOY/testing_rsa root@chromeos4-row4-rack11-host15 -- 'PATH=/usr/local/bin:/usr/local/sbin:$PATH rsync' --version 2017/12/15 13:57:48.993 INFO | remote_access:0459| Using rsync compression: True Warning: Permanently added 'chromeos4-row4-rack11-host15,100.115.195.157' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row4-rack11-host15,100.115.195.157' (ED25519) to the list of known hosts. receiving incremental file list rsync: change_dir "/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.JvngRBuh3r" failed: No such file or directory (2) sent 18 bytes received 143 bytes 107.33 bytes/sec total size is 0 speedup is 0.00 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.0] rsync: [Receiver] write error: Broken pipe (32) 2017/12/15 13:57:49.422 WARNI| remote_access:0177| Could not copy /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.JvngRBuh3r/dev_server.log from device.
,
Dec 15 2017
Because this is only happening on a single DUT, I believe that the TPM is in a bad state and is losing the stateful partition key each boot. Passing to a sheriff to find a product owner to confirm. If this is a hardware failure, please pass it back to the deputy to get the DUT replaced. We don't want to lose this bug, since the DUT is currently locked in the lab and creating a single DUT shortage.
,
Dec 15 2017
Andrey, could you please take a look at the TPM on the device?
,
Dec 15 2017
Is there any way to just ssh there, to whatever test image there currently is?
,
Dec 15 2017
From the logs it seems the tpm is owned but the encrypted stateful key was not written there (happens when creating install attributes after successful owning). And the on-disk key (that is kept there until owning) is lost also - probably something happened and all stateful was wiped w/o clearing the TPM. So, it re-creates encstateful on every boot - seems to be stuck in this state. From mount-encrypted.log: [pid:247] NVRAM area has been defined but not written. [pid:247] Using NVRAM as system key; finalization needed. [pid:247] No usable system key found. [pid:247] /mnt/stateful_partition/encrypted.needs-finalization does not exist. Is it possible to just powerwash (or at least clear the tpm owner) and then go through OOBE and wait until the device is fully owned?
,
Dec 15 2017
Eventlog is weird - a bunch of "System boot"s 3-5 minutes apart w/o shutdowns in between that goes like: 1 | 2017-12-13 00:44:33 | System boot | 0 2 | 2017-12-13 00:47:43 | System boot | 0 3 | 2017-12-13 00:52:00 | System boot | 0 4 | 2017-12-13 00:55:15 | System boot | 0 5 | 2017-12-13 00:58:19 | System boot | 0 6 | 2017-12-13 01:02:42 | System boot | 0 7 | 2017-12-13 01:05:52 | System boot | 0 8 | 2017-12-13 01:08:50 | System boot | 0 9 | 2017-12-13 01:13:16 | System boot | 0 10 | 2017-12-13 01:17:10 | System boot | 0 11 | 2017-12-13 01:20:48 | System boot | 0 12 | 2017-12-13 01:25:24 | System boot | 0 ... Are we EC-resetting the DUT w/o proper shutdown when attempting to recover? That's an easy way to get a dictionary attack lockout from the TPM or corrupt the filesystem.
,
Dec 16 2017
Hm, what I see after manual "crossystem clear_tpm_owner_request=1", the system boots normally once, auto-owns the tpm, presumably saves the encstateful key in the TPM NVRAM. But after reboot I see [pid:247] Version 2 Lockbox NVRAM area found. [pid:247] NVRAM area has been defined but not written. in /var/log/mount-encrypted.log. Somehow, the TPM NVRAM contents are either not saved by cryptohomed (bug in the new Chrome OS? the device runs 10032.71.1) or lost (h/w bug in TPM?). The TPM remains owned. eventlog.txt still shows only "System boot" entries (is that normal for veyron_speedy?). It's unlikely but possible to lose data written to tpm nvram if there was no proper shutdown.
,
Dec 16 2017
> Is it possible to just powerwash (or at least clear the tpm owner) > and then go through OOBE and wait until the device is fully owned? The repair cycle mentioned in c#1 would have forced power wash. Also, we forced recovery mode during that same repair cycle. Forcing the DUT through OOBE after such a cycle can be done, but it requires a certain amount of fiddling around. > Are we EC-resetting the DUT w/o proper shutdown when attempting to recover? We'd only do that if the DUT weren't responding on the network. As I understand the stated symptoms, the DUT never goes offline, except during reboot. More to the point, the description says that repair never does anything. That means (in particular) that repair isn't resetting the EC. OTOH, if the TPM could get into a long-lived awkward state because of a single EC reset, it might have happened some time ago.
,
Dec 16 2017
> OTOH, if the TPM could get into a long-lived awkward state because of a single EC reset, it might have happened some time ago. Well, TPM can enter into a long lockout after a single EC reset (it req's a combination of factors, and only relevant for some models) and we do hit it occasionally - see http://b/35578636, for example. But I meant entering a lockout after multiple EC resets. In any case, that doesn't happen here.
,
Dec 16 2017
,
Dec 16 2017
It looks like we have to quite different DUTs exhibiting this
basic failure symptom:
chromeos4-row4-rack11-host15 veyron_speedy
chromeos4-row11-rack11-host7 cyan
Do we know definitively that the failure is TPM related?
,
Dec 16 2017
In this case it's the data that was supposedly saved in TPM NVRAM upon reboot turns out to be not there. If it is seen on multiple devices, it doesn't sound like a h/w TPM glitch. Rather, we (1) either don't save it there though we think we did, (2) or we don't even attempt to save it there but the file that keeps the data on disk until it is saved in tpm is deleted due to other reasons (including fs corruption).
,
Dec 16 2017
But I looked at the logs in issue 795456 . At least for the last failed test (https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/162813009-chromeos-test/chromeos4-row11-rack11-host7/crashinfo.chromeos4-row11-rack11-host7/var/log/) mount-encrypted.log shows an unowned TPM instead of "NVRAM area has been defined but not written", so not sure it's the same bug.
,
Dec 16 2017
I see that before the reboot, the lockbox spaces indeed contains FFs: # tpmc read 0x20000004 44 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff But there is /mnt/stateful_partition/encrypted.needs-finalization. After reboot, the TPM still finds the TPM NVRAM uninitialized (all FFs), as expected. But it doesn't find /mnt/stateful_partition/encrypted.needs-finalizatio [pid:248] /mnt/stateful_partition/encrypted.needs-finalization does not exist. So, the issue seems to be the disappearing /mnt/stateful_partition/encrypted.needs-finalization file.
,
Dec 16 2017
And it's not like the whole stateful is erased. Other dirs show access timestampts before the last tpm clear: # ls -l /mnt/stateful_partition total 82960 drwxr-xr-x. 22 root root 4096 Dec 11 22:18 dev_image drwxr-xr-x. 5 root root 4096 Dec 15 16:45 encrypted -rw-------. 1 root root 3317653504 Dec 15 16:51 encrypted.block -rw-------. 1 root root 48 Dec 15 16:45 encrypted.needs-finalization drwxr-xr-x. 3 root root 4096 Dec 12 05:07 etc drwxr-xr-x. 6 root root 4096 Dec 12 05:07 home drwxr-xr-x. 4 root root 4096 Dec 12 05:02 unencrypted drwxr-xr-x. 12 root root 4096 Dec 11 22:13 var_overlay However, anything new I create on stateful disappears after reboot: localhost ~ # cp /mnt/stateful_partition/encrypted.needs-finalization /mnt/stateful_partition/zzz localhost ~ # ls -la /mnt/stateful_partition/ total 55788 drwxr-xr-x. 8 root root 4096 Dec 15 16:59 . drwxr-xr-x. 3 root root 4096 Dec 11 22:12 .. -rw-r--r--. 1 root root 0 Nov 16 16:41 .labmachine -rw-r--r--. 1 root root 0 Dec 12 05:07 .tpm_owned -rw-------. 1 root root 317 Dec 12 05:07 .tpm_status -rw-------. 1 root root 8 Dec 12 05:07 .tpm_status.sum drwxr-xr-x. 22 root root 4096 Dec 11 22:18 dev_image drwxr-xr-x. 5 root root 4096 Dec 15 16:58 encrypted -rw-------. 1 root root 3317653504 Dec 15 16:58 encrypted.block <-- still here -rw-------. 1 root root 48 Dec 15 16:58 encrypted.needs-finalization <--- still here drwxr-xr-x. 3 root root 4096 Dec 12 05:07 etc drwxr-xr-x. 6 root root 4096 Dec 12 05:07 home drwxr-xr-x. 4 root root 4096 Dec 12 05:02 unencrypted drwxr-xr-x. 12 root root 4096 Dec 11 22:13 var_overlay -rw-------. 1 root root 48 Dec 15 16:59 zzz <--- still here localhost ~ # reboot ... localhost ~ # ls -la /mnt/stateful_partition/ total 7116 drwxr-xr-x. 8 root root 4096 Dec 15 16:59 . drwxr-xr-x. 3 root root 4096 Dec 11 22:12 .. -rw-r--r--. 1 root root 0 Nov 16 16:41 .labmachine -rw-r--r--. 1 root root 0 Dec 12 05:07 .tpm_owned -rw-------. 1 root root 317 Dec 12 05:07 .tpm_status -rw-------. 1 root root 8 Dec 12 05:07 .tpm_status.sum drwxr-xr-x. 22 root root 4096 Dec 11 22:18 dev_image drwxr-xr-x. 5 root root 4096 Dec 15 16:59 encrypted -rw-------. 1 root root 3317653504 Dec 15 17:00 encrypted.block <--- old file lost, new one created in place -rw-------. 1 root root 48 Dec 15 16:59 encrypted.needs-finalization <--- old file lost, new one created in place drwxr-xr-x. 3 root root 4096 Dec 12 05:07 etc drwxr-xr-x. 6 root root 4096 Dec 12 05:07 home drwxr-xr-x. 4 root root 4096 Dec 12 05:02 unencrypted drwxr-xr-x. 12 root root 4096 Dec 11 22:13 var_overlay localhost ~ # <--- no 'zzz' after reboot Something must be wrong with the filesystem?
,
Dec 16 2017
Back to the sheriffs. Doesn't seem to be TPM-related.
,
Dec 16 2017
As a side note, I see that the TPM is auto-owned on each boot in testlab. That leads to created but uninitialized lockbox (FFs) since the tpm was owned w/o any user login (so, there's no install attributes to write there). However, that shouldn't have stopped it from operating normally if only encrypted.needs-finalization was preserved. And the TPM is auto-owned because of this CL, I believe: https://crrev.com/c/688763. See the latest comments there. +alemate just in case it leads to any troubles still. We may need to return false from ShallAttemptTpmOwnership() for Chromium builds then.
,
Dec 18 2017
Bouncing to current week sheriffs, I will be OOO.
,
Dec 27 2017
,
Feb 5 2018
,
Feb 5 2018
<sigh> I just tried logging in to the DUT, to poke around at its
current state. Then I tried to reboot. The DUT failed to boot
from internal storage (it never came back). The DUT _did_ boot from
USB (using servo) without difficulty.
I tested and saw the problem (boot from eMMC failed, boot from USB
worked) twice. Then, while booted from USB, I tried examining
the current root file system (ROOT-A). That produced a warning:
localhost media # mount -o ro /dev/mmcblk0p3 /media/local
localhost media # ls /media/local/
ls: cannot access '/media/local/run': Structure needs cleaning
bin boot debugd dev etc home lib lost+found media mnt opt postinst proc root run sbin sys tmp usr var
Then I tried re-installing from USB with chromeos-install:
Installing partition 1 to /dev/mmcblk0
Installing the stateful partition...
chroot: failed to run command '/usr/bin/cros_installer': Structure needs cleaning
[ etc. ]
So, there's reason to suspect an internal storaage failure.
,
Feb 6 2018
I've filed b/72967028 to replace the unit. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by dgarr...@chromium.org
, Dec 15 2017