New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 795396 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

chromeos4-row4-rack11-host15 is provision fail, repair succeed looping.

Project Member Reported by dgarr...@chromium.org, Dec 15 2017

Issue description

This DUT appears to be in a bad state.
 
I forced a repair which included a USB installation, which worked, then unlocked the DUT. It promptly failed it's first provision job, and so I've relocked it.

http://chromeos-server14.mtv.corp.google.com/afe/#tab_id=view_host&object_id=3659


Provision failures look like this:

		FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1513374634	localtime=Dec 15 13:50:34	Unhandled DevServerException: CrOS auto-update failed for host chromeos4-row4-rack11-host15: 0) RootfsUpdateError: Failed to perform rootfs update: DevServerStartupError('Timeout (30) waiting for remote devserver port_file',), 1) RootfsUpdateError: Failed to perform rootfs update: DevServerStartupError('Timeout (30) waiting for remote devserver port_file',), 
  Traceback (most recent call last):
    File "/usr/local/autotest/client/common_lib/test.py", line 831, in _call_test_function
      return func(*args, **dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 495, in execute
      dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 362, in _call_run_once_with_retry
      postprocess_profiled_run, args, dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 400, in _call_run_once
      self.run_once(*args, **dargs)
    File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 126, in run_once
      with_cheets=with_cheets)
    File "/usr/local/autotest/server/afe_utils.py", line 124, in machine_install_and_update_labels
      *args, **dargs)
    File "/usr/local/autotest/server/hosts/cros_host.py", line 828, in machine_install_by_devserver
      quick_provision=quick_provision)
    File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2391, in auto_update
      error_msg % (host_name, real_error))
  DevServerException: CrOS auto-update failed for host chromeos4-row4-rack11-host15: 0) RootfsUpdateError: Failed to perform rootfs update: DevServerStartupError('Timeout (30) waiting for remote devserver port_file',), 1) RootfsUpdateError: Failed to perform rootfs update: DevServerStartupError('Timeout (30) waiting for remote devserver port_file',), 
> I forced a repair which included a USB installation, which worked, then unlocked the DUT.

The putatively successful repair is here:
    https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row4-rack11-host15/5038955-repair/20171512125907/

The status.log file shows that the same AU failure struck the
repair task as strikes provision; that's why the DUT had to be
re-imaged from USB.

I don't understand why the provision failure doesn't trigger
any repair actions; that's not supposed to happen, although in
this case, it wouldn't help, since even install from USB doesn't
stop the symptom.

Probably, the more interesting question is what's causing the
provision failure in the first place.  The error message just
says that starting the local devserver on the DUT failed.
I can't imagine what would cause that.

I wasn't certain if it was the devserver on the DUT, or a lab devserver.

But yeah, I'm confused.

Comment 4 by xixuan@chromium.org, Dec 15 2017

Reboot in pre-setup of rootfs update causes '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.JvngRBuh3r' to be deleted. Then after reboot, rootfs update is kicked off, but it cannot find this directory, then it cannot start devserver. We can find it from this snippet:

2017/12/15 13:57:48.657 WARNI|      auto_updater:0737| No devserver log is available.
2017/12/15 13:57:48.658 DEBUG|dev_server_wrapper:0588| No devserver running.
2017/12/15 13:57:48.658 DEBUG|    cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpWY7OOY/testing_rsa root@chromeos4-row4-rack11-host15 -- 'PATH=/usr/local/bin:/usr/local/sbin:$PATH rsync' --version
2017/12/15 13:57:48.993 INFO |     remote_access:0459| Using rsync compression: True
Warning: Permanently added 'chromeos4-row4-rack11-host15,100.115.195.157' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row4-rack11-host15,100.115.195.157' (ED25519) to the list of known hosts.
receiving incremental file list
rsync: change_dir "/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.JvngRBuh3r" failed: No such file or directory (2)

sent 18 bytes  received 143 bytes  107.33 bytes/sec
total size is 0  speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.0]
rsync: [Receiver] write error: Broken pipe (32)
2017/12/15 13:57:49.422 WARNI|     remote_access:0177| Could not copy /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.JvngRBuh3r/dev_server.log from device.

Owner: dtor@chromium.org
Because this is only happening on a single DUT, I believe that the TPM is in a bad state and is losing the stateful partition key each boot.

Passing to a sheriff to find a product owner to confirm.

If this is a hardware failure, please pass it back to the deputy to get the DUT replaced.

We don't want to lose this bug, since the DUT is currently locked in the lab and creating a single DUT shortage.

Comment 6 by dtor@chromium.org, Dec 15 2017

Cc: dtor@chromium.org
Owner: apronin@chromium.org
Andrey, could you please take a look at the TPM on the device?
Is there any way to just ssh there, to whatever test image there currently is?
From the logs it seems the tpm is owned but the encrypted stateful key was not written there (happens when creating install attributes after successful owning). And the on-disk key (that is kept there until owning) is lost also - probably something happened and all stateful was wiped w/o clearing the TPM. So, it re-creates encstateful on every boot - seems to be stuck in this state.

From mount-encrypted.log:
[pid:247] NVRAM area has been defined but not written.
[pid:247] Using NVRAM as system key; finalization needed.
[pid:247] No usable system key found.
[pid:247] /mnt/stateful_partition/encrypted.needs-finalization does not exist.

Is it possible to just powerwash (or at least clear the tpm owner) and then go through OOBE and wait until the device is fully owned?
Eventlog is weird - a bunch of "System boot"s 3-5 minutes apart w/o shutdowns in between that goes like:
1 | 2017-12-13 00:44:33 | System boot | 0
2 | 2017-12-13 00:47:43 | System boot | 0
3 | 2017-12-13 00:52:00 | System boot | 0
4 | 2017-12-13 00:55:15 | System boot | 0
5 | 2017-12-13 00:58:19 | System boot | 0
6 | 2017-12-13 01:02:42 | System boot | 0
7 | 2017-12-13 01:05:52 | System boot | 0
8 | 2017-12-13 01:08:50 | System boot | 0
9 | 2017-12-13 01:13:16 | System boot | 0
10 | 2017-12-13 01:17:10 | System boot | 0
11 | 2017-12-13 01:20:48 | System boot | 0
12 | 2017-12-13 01:25:24 | System boot | 0
...

Are we EC-resetting the DUT w/o proper shutdown when attempting to recover? That's an easy way to get a dictionary attack lockout from the TPM or corrupt the filesystem.
Hm, what I see after manual "crossystem clear_tpm_owner_request=1", the system boots normally once, auto-owns the tpm, presumably saves the encstateful key in the TPM NVRAM. But after reboot I see
[pid:247] Version 2 Lockbox NVRAM area found.
[pid:247] NVRAM area has been defined but not written.
in /var/log/mount-encrypted.log.

Somehow, the TPM NVRAM contents are either not saved by cryptohomed (bug in the new Chrome OS? the device runs 10032.71.1) or lost (h/w bug in TPM?). The TPM remains owned.

eventlog.txt still shows only "System boot" entries (is that normal for veyron_speedy?). It's unlikely but possible to lose data written to tpm nvram if there was no proper shutdown.
> Is it possible to just powerwash (or at least clear the tpm owner)
> and then go through OOBE and wait until the device is fully owned?

The repair cycle mentioned in c#1 would have forced power wash.
Also, we forced recovery mode during that same repair cycle.

Forcing the DUT through OOBE after such a cycle can be done, but
it requires a certain amount of fiddling around.


> Are we EC-resetting the DUT w/o proper shutdown when attempting to recover?

We'd only do that if the DUT weren't responding on the network.
As I understand the stated symptoms, the DUT never goes offline,
except during reboot.  More to the point, the description says that
repair never does anything.  That means (in particular) that repair
isn't resetting the EC.

OTOH, if the TPM could get into a long-lived awkward state because
of a single EC reset, it might have happened some time ago.

> OTOH, if the TPM could get into a long-lived awkward state because
of a single EC reset, it might have happened some time ago.

Well, TPM can enter into a long lockout after a single EC reset (it req's a combination of factors, and only relevant for some models) and we do hit it occasionally - see http://b/35578636, for example. But I meant entering a lockout after multiple EC resets. In any case, that doesn't happen here.
Cc: jrbarnette@chromium.org dgarr...@chromium.org
 Issue 795456  has been merged into this issue.
It looks like we have to quite different DUTs exhibiting this
basic failure symptom:
    chromeos4-row4-rack11-host15    veyron_speedy
    chromeos4-row11-rack11-host7    cyan

Do we know definitively that the failure is TPM related?

In this case it's the data that was supposedly saved in TPM NVRAM upon reboot turns out to be not there. If it is seen on multiple devices, it doesn't sound like a h/w TPM glitch. Rather, we (1) either don't save it there though we think we did, (2) or we don't even attempt to save it there but the file that keeps the data on disk until it is saved in tpm is deleted due to other reasons (including fs corruption).
But I looked at the logs in  issue 795456 . At least for the last failed test (https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/162813009-chromeos-test/chromeos4-row11-rack11-host7/crashinfo.chromeos4-row11-rack11-host7/var/log/) mount-encrypted.log shows an unowned TPM instead of "NVRAM area has been defined but not written", so not sure it's the same bug.
I see that before the reboot, the lockbox spaces indeed contains FFs:
# tpmc read 0x20000004 44
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
But there is /mnt/stateful_partition/encrypted.needs-finalization.

After reboot, the TPM still finds the TPM NVRAM uninitialized (all FFs), as expected.
But it doesn't find /mnt/stateful_partition/encrypted.needs-finalizatio
[pid:248] /mnt/stateful_partition/encrypted.needs-finalization does not exist.

So, the issue seems to be the disappearing /mnt/stateful_partition/encrypted.needs-finalization file.
And it's not like the whole stateful is erased. Other dirs show access timestampts before the last tpm clear:
 # ls -l /mnt/stateful_partition
total 82960
drwxr-xr-x. 22 root root       4096 Dec 11 22:18 dev_image
drwxr-xr-x.  5 root root       4096 Dec 15 16:45 encrypted
-rw-------.  1 root root 3317653504 Dec 15 16:51 encrypted.block
-rw-------.  1 root root         48 Dec 15 16:45 encrypted.needs-finalization
drwxr-xr-x.  3 root root       4096 Dec 12 05:07 etc
drwxr-xr-x.  6 root root       4096 Dec 12 05:07 home
drwxr-xr-x.  4 root root       4096 Dec 12 05:02 unencrypted
drwxr-xr-x. 12 root root       4096 Dec 11 22:13 var_overlay


However, anything new I create on stateful disappears after reboot:
localhost ~ # cp  /mnt/stateful_partition/encrypted.needs-finalization /mnt/stateful_partition/zzz
localhost ~ # ls -la /mnt/stateful_partition/
total 55788
drwxr-xr-x.  8 root root       4096 Dec 15 16:59 .
drwxr-xr-x.  3 root root       4096 Dec 11 22:12 ..
-rw-r--r--.  1 root root          0 Nov 16 16:41 .labmachine
-rw-r--r--.  1 root root          0 Dec 12 05:07 .tpm_owned
-rw-------.  1 root root        317 Dec 12 05:07 .tpm_status
-rw-------.  1 root root          8 Dec 12 05:07 .tpm_status.sum
drwxr-xr-x. 22 root root       4096 Dec 11 22:18 dev_image
drwxr-xr-x.  5 root root       4096 Dec 15 16:58 encrypted
-rw-------.  1 root root 3317653504 Dec 15 16:58 encrypted.block <-- still here
-rw-------.  1 root root         48 Dec 15 16:58 encrypted.needs-finalization <--- still here
drwxr-xr-x.  3 root root       4096 Dec 12 05:07 etc
drwxr-xr-x.  6 root root       4096 Dec 12 05:07 home
drwxr-xr-x.  4 root root       4096 Dec 12 05:02 unencrypted
drwxr-xr-x. 12 root root       4096 Dec 11 22:13 var_overlay
-rw-------.  1 root root         48 Dec 15 16:59 zzz <--- still here
localhost ~ # reboot
...
localhost ~ # ls -la /mnt/stateful_partition/
total 7116
drwxr-xr-x.  8 root root       4096 Dec 15 16:59 .
drwxr-xr-x.  3 root root       4096 Dec 11 22:12 ..
-rw-r--r--.  1 root root          0 Nov 16 16:41 .labmachine
-rw-r--r--.  1 root root          0 Dec 12 05:07 .tpm_owned
-rw-------.  1 root root        317 Dec 12 05:07 .tpm_status
-rw-------.  1 root root          8 Dec 12 05:07 .tpm_status.sum
drwxr-xr-x. 22 root root       4096 Dec 11 22:18 dev_image
drwxr-xr-x.  5 root root       4096 Dec 15 16:59 encrypted
-rw-------.  1 root root 3317653504 Dec 15 17:00 encrypted.block <--- old file lost, new one created in place
-rw-------.  1 root root         48 Dec 15 16:59 encrypted.needs-finalization <--- old file lost, new one created in place
drwxr-xr-x.  3 root root       4096 Dec 12 05:07 etc
drwxr-xr-x.  6 root root       4096 Dec 12 05:07 home
drwxr-xr-x.  4 root root       4096 Dec 12 05:02 unencrypted
drwxr-xr-x. 12 root root       4096 Dec 11 22:13 var_overlay
localhost ~ #  <--- no 'zzz' after reboot


Something must be wrong with the filesystem?
Cc: -dtor@chromium.org apronin@chromium.org
Labels: OS-Chrome
Owner: dtor@chromium.org
Back to the sheriffs. Doesn't seem to be TPM-related.
Cc: alemate@chromium.org
As a side note, I see that the TPM is auto-owned on each boot in testlab. That leads to created but uninitialized lockbox (FFs) since the tpm was owned w/o any user login (so, there's no install attributes to write there). However, that shouldn't have stopped it from operating normally if only encrypted.needs-finalization was preserved.

And the TPM is auto-owned because of this CL, I believe: https://crrev.com/c/688763. See the latest comments there.

+alemate just in case it leads to any troubles still. We may need to return false from ShallAttemptTpmOwnership() for Chromium builds then.

Comment 21 by dtor@chromium.org, Dec 18 2017

Cc: bmgordon@chromium.org
Owner: martinroth@chromium.org
Bouncing to current week sheriffs, I will be OOO.
Status: Assigned (was: Untriaged)
Owner: ----
<sigh> I just tried logging in to the DUT, to poke around at its
current state.  Then I tried to reboot.  The DUT failed to boot
from internal storage (it never came back).  The DUT _did_ boot from
USB (using servo) without difficulty.

I tested and saw the problem (boot from eMMC failed, boot from USB
worked) twice.  Then, while booted from USB, I tried examining
the current root file system (ROOT-A).  That produced a warning:
    localhost media # mount -o ro /dev/mmcblk0p3 /media/local 
    localhost media # ls /media/local/
    ls: cannot access '/media/local/run': Structure needs cleaning
    bin  boot  debugd  dev  etc  home  lib  lost+found  media  mnt  opt  postinst  proc  root  run  sbin  sys  tmp  usr  var

Then I tried re-installing from USB with chromeos-install:
    Installing partition 1 to /dev/mmcblk0
    Installing the stateful partition...
    chroot: failed to run command '/usr/bin/cros_installer': Structure needs cleaning
[ etc. ]

So, there's reason to suspect an internal storaage failure.
I've filed b/72967028 to replace the unit.

Sign in to add a comment