New issue
Advanced search Search tips
Starred by 4 users
Status: Fixed
Closed: Jan 16
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Sign in to add a comment
jetstream hosts repeatedly failing tests - TPM issue?
Project Member Reported by, Dec 12 Back to list
chromeos6-row22-jetstream-host5 keeps failing whirlwind tests with:

  Timed out waiting for AP to appear operational

The history of the DUT:
The DUT is currently locked, please unlock as soon as this is resolved.
This seems the same as

2017-11-15T13:26:31.014235+00:00 ERR attestationd[1169]: TPM error 0x21 (Decryption error): Unseal: Error calling Tspi_Data_Unseal
2017-11-15T13:26:31.014296+00:00 ERR attestationd[1169]: UnsealKey: Cannot unseal aes key.
2017-11-15T13:26:31.014348+00:00 ERR attestationd[1169]: Attestation: Could not unseal decryption key.

and we need to replace the hardware. See also http://b/33758106

Summary: chromeos6-row22-jetstream-host5 and host6 repeatedly failing tests (was: chromeos6-row22-jetstream-host5 repeatedly failing tests)
chromeos6-row22-jetstream-host6 is also failing:

Seems we have more DUTs experiencing this: - chromeos6-row22-jetstream-host6

I wonder if it is indeed DUT problem as asserted in http://b/33758106 or product problem.
Both DUTs are now locked with a reason pointing to this bug. That decreases our limited inventory, but keeps them from failing tests.
I added moved two additional hosts from pool:jetstream-test to pool:cq to keep the same amount of whirlwind units until units failing are fixed or replaced.

chromeos6-row22-jetstream-host9 board:whirlwind pool:cq
chromeos6-row22-jetstream-host10 board:whirlwind pool:cq

I ran test that failed before on chromeos6-row22-jetstream-host6 and it was successful.  I've seen failures before once and a while I think if we let it run a few more and see what happens.

Same failure on chromeos6-row22-jetstream-host7 today:
Same failure on today chromeos6-row22-jetstream-host7:

It looks like about 6% of builds are failing with this same problem, but I haven't seen any cases where two in a row fail.
chromeos6-row22-jetstream-host7 is still failing:

 Issue 798540  has been merged into this issue.
Labels: -Pri-3 Pri-1
Status: Assigned
Summary: jetstream hosts repeatedly failing tests - TPM issue? (was: chromeos6-row22-jetstream-host5 and host6 repeatedly failing tests)
There appear to be multiple buganizer bugs tracking this incident,
including these:

One of the bugs above needs to become the canonical bug tracking
the problem; the others should be closed as a duplicate.

IIUC, the suspicion is that the DUT's TPM gets into a bad state.

Currently, the failures seem to be attributable to
chromeos6-row22-jetstream-host7 (only).  I've locked the DUT citing
this bug.

I'm holding this bug open as the "make it so whirlwind-paladin isn't
holding up the CQ" bug.  If locking the DUT doesn't silence the problem,
we'll have to make whirlwind paladin experimental.  Once whirlwind-paladin
is green or squelched, we can close this bug in favor of buganizer bugs
that will find a permanent solution.

> I think this caused CQ failure again.

Yup.  The failed test happened before I locked the DUT.

One point of note:  The DUT in question has a system clock that's
some 4.5 days behind:
    $ TZ=UTC date ; ssh chromeos6-row22-jetstream-host7 date ; TZ=UTC date
    Mon Jan  8 18:39:27 UTC 2018
    Thu Jan  4 05:39:25 UTC 2018
    Mon Jan  8 18:39:29 UTC 2018

I don't know if that's contributing to the troubles.

Josue and I have dug deeply enough into this issue and found enough failure cases and logs for the engineer in charge of the TPM hardware on Google Wifi and OnHub to accept this bug.

This engineer needs to be identified. It's possible that this engineer does not exist or does not know they are on the hook for supporting this subsystem.

Josue and the Jetstream test team can help with pair debugging if required but will not be able to drive this problem to resolution.
Meanwhile hardware will be replaced. I will keep the units so that the dev team can help us debug the problem, it seems like the conclusion the last time that it happened was to replace HW.

I'm trying to find a couple of untis to replace: 
> Meanwhile hardware will be replaced. [ ... ]

Given the nature of the failures, I suspect that the root cause
is in software, or is at least software triggered.  Replacing the
hardware will likely stop the current failures, but it's not a
scalable long-term solution.

> I'm trying to find a couple of untis to replace: 
> chromeos6-row22-jetstream-host5
> chromeos6-row22-jetstream-host6
> chromeos6-row22-jetstream-host7

At this time, only chromeos6-row22-jetstream-host7 is exhibiting
failures, and it's been locked for some time.

From what I've seen of TPM-related problems, it can sometimes
happen that this kind of symptom goes away if the DUT is allowed
to sit long enough, so it's not implausible that -host5 and -host6
simply fixed themselves.  It's also possible (though maybe less
likely) that -host7 has fixed itself by now.

OK.  I did some digging.  It seems all three of these DUTs are now

Moreover, all three show the 'tpm-manager dump_status' output that's
been flagged as characteristic of b/33758106.  So, whatever we do, we
need to do it for all three.

I'd like to close this bug on the grounds that the original failure
symptom (CQ failures) has been eliminated (for now).  We need separate
(buganizer) bugs for DUT replacement, and explaining the root cause of
the failures.

josuehe@ - can you comment on what bug we should use to track DUT
replacement, and what to use to track the underlying bug?

-DUT replacement we can track it with b/71636396
-For the underlying bug we should use the same b/33758106

This problem has now occurred again, on a new host,

I believe it's time to make the whirlwind paladin experimental,
until we can find a fix, a workaround, or other mitigation.

Comment 26 by, Jan 16 (6 days ago)
Status: Fixed
The whirlwind paladin is now experimental, and will need to remain so
until the underlying problem in b/33758106 is addressed, at least
sufficiently to mitigate the failures.

Replacing DUTs may or may not be necessary; it is definitely not
sufficient.  We can't make whirlwind non-experimental until we
have some guarantee that more DUTs won't be affected.

I'm closing this bug, since the CQ failures have been stopped;
work on mitigating/fixing should presumably take place in

Sign in to add a comment