link-paladin, stumpy-paladin, nyan_kitty-paladin and a few others keep failing with "HWTest did not complete due to infrastructure issues (code 3)" |
||||||
Issue descriptionThey've all failed 4 times in a row, during provisioning apparently, so the CQ automatically retries all the CLs. Not sure if it's an infra issue or a bad CL that keeps being retried automatically. This blocks the CQ, it looks like ~50 CL are in the CQ-retry loop => P0 Assigning to the infra deputy to check if it's really an infra issue, sheriffs CC'd.
,
Aug 9 2017
Impacted paladins: " NOTE: The Commit Queue will retry your change automatically. The following build(s) failed: wolf-paladin: The HWTest [bvt-inline] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/wolf-paladin/15252 link-paladin: The HWTest [bvt-inline] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/link-paladin/29242 nyan_big-paladin: The HWTest [bvt-inline] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/nyan_big-paladin/2549 lumpy-paladin: The HWTest [bvt-inline] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/lumpy-paladin/29211 nyan_kitty-paladin: The HWTest [bvt-cq] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/2551 peppy-paladin: The HWTest [bvt-cq] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/peppy-paladin/16070 peach_pit-paladin: The HWTest [bvt-inline] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-paladin/16652 stumpy-paladin: The HWTest [bvt-cq] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/stumpy-paladin/29247 The build encountered Chrome OS Lab infrastructure issues. Your change will not be blamed for the failure. "
,
Aug 9 2017
Evidence is that there's a bad CL that causes 8 boards not to boot properly, causing provision failures. I'm going to go mark -1 Verified on every CL that I can't prove definitively innocent.
,
Aug 9 2017
Selected potential suspects are now -1 Verified. Next step is to kill the current CQ run, to allow the (supposedly) innocent CLs to prove their innocence.
,
Aug 9 2017
FTR.
Actual failure during provision:
FAIL ---- verify.tpm timestamp=1502286781 localtime=Aug 09 06:53:01 command execution error
* Command:
/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_h76tWtssh-
master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
-o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
chromeos6-row2-rack8-host10 "export LIBC_FATAL_STDERR_=1; if type
\"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
\"server[stack::verify|run|wrapper] -> ssh_run(cryptohome
--action=status)\";fi; cryptohome --action=status"
Exit status: 134
Duration: 0.315699100494
stderr:
bash: line 1: 3009 Aborted (core dumped) cryptohome --action=status
So, something broke cryptohome to the point where it's crashing.
,
Aug 9 2017
Also: INFO ---- New Crash Dump timestamp=1502286799 localtime=Aug 09 06:53:19 /usr/local/autotest/results/hosts/chromeos6-row2-rack8-host10/888884-provision/20170908063503/crashinfo.chromeos6-row2-rack8-host10/cryptohome.20170809.065300.3009.dmp INFO ---- Orphaned Crash Dump timestamp=1502286799 localtime=Aug 09 06:53:19 /var/spool/crash/cryptohome.20170809.065300.3009.meta INFO ---- Orphaned Crash Dump timestamp=1502286799 localtime=Aug 09 06:53:19 /var/spool/crash/cryptohome.20170809.065300.3009.dmp INFO ---- Orphaned Crash Dump timestamp=1502286799 localtime=Aug 09 06:53:19 /var/spool/crash/os-release INFO ---- Orphaned Crash Dump timestamp=1502286799 localtime=Aug 09 06:53:19 /var/spool/crash/cryptohome.20170809.065300.3009.core INFO ---- Orphaned Crash Dump timestamp=1502286799 localtime=Aug 09 06:53:19 /var/spool/crash/lsb-release cryptohome is the only thing leaving crashdumps. So it is unlikely that many / arbitrary services are crashy. This seems localized to just the one daemon.
,
Aug 9 2017
Some diagnosis and explanation.
First, these are the affected boards:
wolf link nyan_big lumpy nyan_kitty peppy peach_pit stumpy
I don't know for sure, but I _think_ these are all using the
same kernel.
A spot check says that the symptom is a 100% provisioning failure
rate for the affected boards. That's not infrastructure; the
build must be bad. However, the canaries are still healthy. So,
the failure must be a bad CL in the runs.
Next, the problem started with this CQ run:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15665
This is the full blamelist from the run:
adhd | dgreid | 590714 | fails:5 | CRAS: timing_ut - Add more input tests.
autotest | ayatane | 511762 | [autotest] Remove old drone log conversion
autotest | ayatane | 511763 | fails:0(1) | [autotest] Clean up ensure logic
autotest | ayatane | 511764 | fails:0(1) | [autotest] Make sync_send_file_to private
chromeos-admin | jrbarnette | *423471 | fails:1 | [lab-tools] Use venv code to read the devserver configuration.
chromite | norvez | 606697 | Promote betty-paladin back to important
chromite | nxia | 604762 | Set the buildslave_type to GCE for pre-cqs except lakitu-pre-cqs
chromiumos-overlay | derat | 600518 | target-chromium-os: Remove "feedback" USE flag.
chromiumos-overlay | skau | 604380 | net-print/cups: Fix typo in filter file.
chromiumos-overlay | vapier | 587950 | fails:1 | gmock32/gtest32: drop unused packages
ec | nvaccaro | 540124 | sensors: add bmi160 & kionix orientation driver
ec | rspangler | 602695 | fails:2 | cr50: Clean up device state code
ec | rspangler | 604499 | fails:2 | cr50: Merge CCD device handling to rdd.c
ec | rspangler | 604500 | fails:2 | cr50: Merge BattPrsnt device handling to wp.c
ec | vijay.p.hiremath-AT-intel.com | 604731 | glkrvp: Enable charger & smart battery
ec | vijay.p.hiremath-AT-intel.com | 604732 | GLKRVP: Enable fast charging and battery cut-off
platform2 | derat | 605016 | power: Ignore peripheral batteries with "Unknown" status.
platform2 | isandrk | 600214 | cryptohome: Add a DBus call to send TPM version information
platform2 | mpdenton | 602882 | cryptohome: Use namespaces for cryptohome and cryptohomed
Some repos can be ruled out because they don't contribute to the
running image. That suggests we should clear changes to chromite,
ec, autotest, and chromeos-admin. The remaining changes are the ones
that got the -1 Verified.
If the diagnosis thus far is correct, then the new CQ run will complete
without this failure. At that point, we can set to figuring out which
CL is the real culprit.
,
Aug 9 2017
,
Aug 9 2017
> So, something broke cryptohome to the point where it's crashing.
Hmmm... That would indict one of these CLs:
platform2 | isandrk | 600214 | cryptohome: Add a DBus call to send TPM version information
platform2 | mpdenton | 602882 | cryptohome: Use namespaces for cryptohome and cryptohomed
Those were already high on my list of real suspects, because
cryptohome is one of the services that can take down provisioning.
I note, however, that I thought that the failures I looked at were
mere ssh timeouts. That would suggest something wrong beyond
just cryptohome.
,
Aug 9 2017
Also: 2017-08-09T13:51:41.323167+00:00 ERR cryptohomed[1154]: libminijail[1154]: unshare(CLONE_NEWCGROUP) failed: Invalid argument 2017-08-09T13:51:41.329063+00:00 WARNING kernel: [ 15.339256] init: cryptohomed main process (1154) killed by ABRT signal 2017-08-09T13:51:41.329081+00:00 WARNING kernel: [ 15.339296] init: cryptohomed main process ended, respawning Looks like the new call to minijail_namespace_cgroups(...) in https://chromium-review.googlesource.com/c/602882/3/cryptohome/cryptohome.cc#92 didn't go very well.
,
Aug 9 2017
Re #9: I'm only speaking for the few where the failure was obviously cryptohomed crashing. If there are other failure modes, I haven't chanced upon them.
,
Aug 9 2017
Based on the log message in comment #10, it must be this CL: platform2 | mpdenton | 602882 | cryptohome: Use namespaces for cryptohome and cryptohomed This is consistent with all the failing boards having one kernel version. That CL should now be Verified-1, correct?
,
Aug 9 2017
From go/crconn: 3.8 devices - wolf, link, peach_pit, peppy 3.4 devices - lumpy, stumpy (I assume these were upgraded to 3.8) 3.10 devices - nyan_big, nyan_kitty, I could possibly be the culprit here for 3.8. I chumped in Intel wifi bugfixes on 4 kernels yesterday (3.8, 3.14, 3.18 and 4.4 - no Intel wifi devices on 3.10 anymore). https://chromium-review.googlesource.com/#/q/topic:Core28-bugfixes-merge+(status:open+OR+status:merged) https://chromium-review.googlesource.com/#/c/chromiumos/third_party/kernel/+/607364/-1..1 3.8 was the only kernel that did not get enough testing (I had cbuildbot --remote kicked off). Looking into this now.
,
Aug 9 2017
CLONE_NEWCGROUP was introduced in Linux Kernel 4.6 (http://man7.org/linux/man-pages/man2/unshare.2.html), and so that seems like the likely culprit of the 3.8 boards failing.
,
Aug 9 2017
,
Aug 9 2017
I just did a spot check of the current CQ run. Several DUTs on the
wolf CQ have successfully provisioned with the latest build. That
means that one of the CLs marked "-1 Verified" is the cause of the
overnight outage.
Based on evidence to date, I'd recommend we'd blame this CL as the
sole culprit:
https://chromium-review.googlesource.com/#/c/chromiumos/platform2/+/602882/
,
Aug 9 2017
,
Aug 9 2017
I've dropped "-1 Verified" for all of the CLs except the suspected culprit. Those CLs will start back through with the next CQ run. If we're wrong about the culprit, then we'll see this problem all over again.
,
Aug 10 2017
CQ is working again, culprit CL:602882 has been identified.
,
Jan 22 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by norvez@chromium.org
, Aug 9 2017