|jetstream hosts repeatedly failing tests - TPM issue?|
|Project Member Reported by dgarr...@chromium.org, Dec 12||Back to list|
chromeos6-row22-jetstream-host5 keeps failing whirlwind tests with: Timed out waiting for AP to appear operational The history of the DUT: http://chromeos-server155.cbf.corp.google.com/afe/#tab_id=view_host&object_id=7972
The DUT is currently locked, please unlock as soon as this is resolved.
This seems the same as https://bugs.chromium.org/p/chromium/issues/detail?id=658338#c19: 2017-11-15T13:26:31.014235+00:00 ERR attestationd: TPM error 0x21 (Decryption error): Unseal: Error calling Tspi_Data_Unseal 2017-11-15T13:26:31.014296+00:00 ERR attestationd: UnsealKey: Cannot unseal aes key. 2017-11-15T13:26:31.014348+00:00 ERR attestationd: Attestation: Could not unseal decryption key. and we need to replace the hardware. See also http://b/33758106
chromeos6-row22-jetstream-host6 is also failing: https://luci-milo.appspot.com/buildbot/chromeos/whirlwind-paladin/10139 https://luci-milo.appspot.com/buildbot/chromeos/whirlwind-paladin/10135
Seems we have more DUTs experiencing this: https://luci-milo.appspot.com/buildbot/chromeos/whirlwind-paladin/10139 - chromeos6-row22-jetstream-host6 I wonder if it is indeed DUT problem as asserted in http://b/33758106 or product problem.
Both DUTs are now locked with a reason pointing to this bug. That decreases our limited inventory, but keeps them from failing tests.
I added moved two additional hosts from pool:jetstream-test to pool:cq to keep the same amount of whirlwind units until units failing are fixed or replaced. chromeos6-row22-jetstream-host9 board:whirlwind pool:cq chromeos6-row22-jetstream-host10 board:whirlwind pool:cq
chromeos6-row22-jetstream-host7 failed: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17203
I ran test that failed before on chromeos6-row22-jetstream-host6 and it was successful. I've seen failures before once and a while I think if we let it run a few more and see what happens.
Same failure on chromeos6-row22-jetstream-host7 today: https://luci-milo.appspot.com/buildbot/chromeos/whirlwind-paladin/10203
Same failure on today chromeos6-row22-jetstream-host7: https://luci-milo.appspot.com/buildbot/chromeos/whirlwind-paladin/10222 It looks like about 6% of builds are failing with this same problem, but I haven't seen any cases where two in a row fail.
chromeos6-row22-jetstream-host7 is still failing: https://luci-milo.appspot.com/buildbot/chromeos/whirlwind-paladin/10258
Issue 798540 has been merged into this issue.
There appear to be multiple buganizer bugs tracking this incident, including these: b/71636396 b/33758106 One of the bugs above needs to become the canonical bug tracking the problem; the others should be closed as a duplicate. IIUC, the suspicion is that the DUT's TPM gets into a bad state. Currently, the failures seem to be attributable to chromeos6-row22-jetstream-host7 (only). I've locked the DUT citing this bug. I'm holding this bug open as the "make it so whirlwind-paladin isn't holding up the CQ" bug. If locking the DUT doesn't silence the problem, we'll have to make whirlwind paladin experimental. Once whirlwind-paladin is green or squelched, we can close this bug in favor of buganizer bugs that will find a permanent solution.
I think this caused CQ failure again. https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17433
> I think this caused CQ failure again. > > https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17433 Yup. The failed test happened before I locked the DUT. One point of note: The DUT in question has a system clock that's some 4.5 days behind: $ TZ=UTC date ; ssh chromeos6-row22-jetstream-host7 date ; TZ=UTC date Mon Jan 8 18:39:27 UTC 2018 Thu Jan 4 05:39:25 UTC 2018 Mon Jan 8 18:39:29 UTC 2018 I don't know if that's contributing to the troubles.
Josue and I have dug deeply enough into this issue and found enough failure cases and logs for the engineer in charge of the TPM hardware on Google Wifi and OnHub to accept this bug. This engineer needs to be identified. It's possible that this engineer does not exist or does not know they are on the hook for supporting this subsystem. Josue and the Jetstream test team can help with pair debugging if required but will not be able to drive this problem to resolution.
Meanwhile hardware will be replaced. I will keep the units so that the dev team can help us debug the problem, it seems like the conclusion the last time that it happened was to replace HW. I'm trying to find a couple of untis to replace: chromeos6-row22-jetstream-host5 chromeos6-row22-jetstream-host6 chromeos6-row22-jetstream-host7
> Meanwhile hardware will be replaced. [ ... ] Given the nature of the failures, I suspect that the root cause is in software, or is at least software triggered. Replacing the hardware will likely stop the current failures, but it's not a scalable long-term solution. > I'm trying to find a couple of untis to replace: > chromeos6-row22-jetstream-host5 > chromeos6-row22-jetstream-host6 > chromeos6-row22-jetstream-host7 At this time, only chromeos6-row22-jetstream-host7 is exhibiting failures, and it's been locked for some time. From what I've seen of TPM-related problems, it can sometimes happen that this kind of symptom goes away if the DUT is allowed to sit long enough, so it's not implausible that -host5 and -host6 simply fixed themselves. It's also possible (though maybe less likely) that -host7 has fixed itself by now.
OK. I did some digging. It seems all three of these DUTs are now locked: chromeos6-row22-jetstream-host5 chromeos6-row22-jetstream-host6 chromeos6-row22-jetstream-host7 Moreover, all three show the 'tpm-manager dump_status' output that's been flagged as characteristic of b/33758106. So, whatever we do, we need to do it for all three. I'd like to close this bug on the grounds that the original failure symptom (CQ failures) has been eliminated (for now). We need separate (buganizer) bugs for DUT replacement, and explaining the root cause of the failures. josuehe@ - can you comment on what bug we should use to track DUT replacement, and what to use to track the underlying bug?
This problem has now occurred again, on a new host, chromeos6-row22-jetstream-host8: https://luci-milo.appspot.com/buildbot/chromeos/whirlwind-paladin/10397 I believe it's time to make the whirlwind paladin experimental, until we can find a fix, a workaround, or other mitigation.
Jan 16 (6 days ago),
The whirlwind paladin is now experimental, and will need to remain so until the underlying problem in b/33758106 is addressed, at least sufficiently to mitigate the failures. Replacing DUTs may or may not be necessary; it is definitely not sufficient. We can't make whirlwind non-experimental until we have some guarantee that more DUTs won't be affected. I'm closing this bug, since the CQ failures have been stopped; work on mitigating/fixing should presumably take place in buganizer.
|► Sign in to add a comment|