New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 753838 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug



Sign in to add a comment

link-paladin, stumpy-paladin, nyan_kitty-paladin and a few others keep failing with "HWTest did not complete due to infrastructure issues (code 3)"

Project Member Reported by norvez@chromium.org, Aug 9 2017

Issue description


They've all failed 4 times in a row, during provisioning apparently, so the CQ automatically retries all the CLs. Not sure if it's an infra issue or a bad CL that keeps being retried automatically.

This blocks the CQ, it looks like ~50 CL are in the CQ-retry loop => P0

Assigning to the infra deputy to check if it's really an infra issue, sheriffs CC'd.
 
Components: Infra>Client>ChromeOS
Impacted paladins:

"
NOTE: The Commit Queue will retry your change automatically.

The following build(s) failed:

wolf-paladin: The HWTest [bvt-inline] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/wolf-paladin/15252

link-paladin: The HWTest [bvt-inline] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/link-paladin/29242

nyan_big-paladin: The HWTest [bvt-inline] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/nyan_big-paladin/2549

lumpy-paladin: The HWTest [bvt-inline] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/lumpy-paladin/29211

nyan_kitty-paladin: The HWTest [bvt-cq] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/2551

peppy-paladin: The HWTest [bvt-cq] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/peppy-paladin/16070

peach_pit-paladin: The HWTest [bvt-inline] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-paladin/16652

stumpy-paladin: The HWTest [bvt-cq] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/stumpy-paladin/29247

The build encountered Chrome OS Lab infrastructure issues.  Your change will not be blamed for the failure.
"
Evidence is that there's a bad CL that causes 8 boards not to
boot properly, causing provision failures.

I'm going to go mark -1 Verified on every CL that I can't prove
definitively innocent.

Selected potential suspects are now -1 Verified.

Next step is to kill the current CQ run, to allow the (supposedly)
innocent CLs to prove their innocence.

FTR.
Actual failure during provision:

	FAIL	----	verify.tpm	timestamp=1502286781	localtime=Aug 09 06:53:01	command execution error
  * Command: 
      /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_h76tWtssh-
      master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
      -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
      ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
      chromeos6-row2-rack8-host10 "export LIBC_FATAL_STDERR_=1; if type
      \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
      \"server[stack::verify|run|wrapper] -> ssh_run(cryptohome
      --action=status)\";fi; cryptohome --action=status"
  Exit status: 134
  Duration: 0.315699100494
  
  stderr:
  bash: line 1:  3009 Aborted                 (core dumped) cryptohome --action=status

So, something broke cryptohome to the point where it's crashing.
Also:
INFO	----	New Crash Dump	timestamp=1502286799	localtime=Aug 09 06:53:19	/usr/local/autotest/results/hosts/chromeos6-row2-rack8-host10/888884-provision/20170908063503/crashinfo.chromeos6-row2-rack8-host10/cryptohome.20170809.065300.3009.dmp
INFO	----	Orphaned Crash Dump	timestamp=1502286799	localtime=Aug 09 06:53:19	/var/spool/crash/cryptohome.20170809.065300.3009.meta
INFO	----	Orphaned Crash Dump	timestamp=1502286799	localtime=Aug 09 06:53:19	/var/spool/crash/cryptohome.20170809.065300.3009.dmp
INFO	----	Orphaned Crash Dump	timestamp=1502286799	localtime=Aug 09 06:53:19	/var/spool/crash/os-release
INFO	----	Orphaned Crash Dump	timestamp=1502286799	localtime=Aug 09 06:53:19	/var/spool/crash/cryptohome.20170809.065300.3009.core
INFO	----	Orphaned Crash Dump	timestamp=1502286799	localtime=Aug 09 06:53:19	/var/spool/crash/lsb-release


cryptohome is the only thing leaving crashdumps. So it is unlikely that many / arbitrary services are crashy. This seems localized to just the one daemon.
Some diagnosis and explanation.

First, these are the affected boards:
    wolf link nyan_big lumpy nyan_kitty peppy peach_pit stumpy

I don't know for sure, but I _think_ these are all using the
same kernel.

A spot check says that the symptom is a 100% provisioning failure
rate for the affected boards.  That's not infrastructure; the
build must be bad.  However, the canaries are still healthy.  So,
the failure must be a bad CL in the runs.

Next, the problem started with this CQ run:
    https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15665

This is the full blamelist from the run:
    adhd | dgreid | 590714 | fails:5 | CRAS: timing_ut - Add more input tests.
    autotest | ayatane | 511762 | [autotest] Remove old drone log conversion
    autotest | ayatane | 511763 | fails:0(1) | [autotest] Clean up ensure logic
    autotest | ayatane | 511764 | fails:0(1) | [autotest] Make sync_send_file_to private
    chromeos-admin | jrbarnette | *423471 | fails:1 | [lab-tools] Use venv code to read the devserver configuration.
    chromite | norvez | 606697 | Promote betty-paladin back to important
    chromite | nxia | 604762 | Set the buildslave_type to GCE for pre-cqs except lakitu-pre-cqs
    chromiumos-overlay | derat | 600518 | target-chromium-os: Remove "feedback" USE flag.
    chromiumos-overlay | skau | 604380 | net-print/cups: Fix typo in filter file.
    chromiumos-overlay | vapier | 587950 | fails:1 | gmock32/gtest32: drop unused packages
    ec | nvaccaro | 540124 | sensors: add bmi160 & kionix orientation driver
    ec | rspangler | 602695 | fails:2 | cr50: Clean up device state code
    ec | rspangler | 604499 | fails:2 | cr50: Merge CCD device handling to rdd.c
    ec | rspangler | 604500 | fails:2 | cr50: Merge BattPrsnt device handling to wp.c
    ec | vijay.p.hiremath-AT-intel.com | 604731 | glkrvp: Enable charger & smart battery
    ec | vijay.p.hiremath-AT-intel.com | 604732 | GLKRVP: Enable fast charging and battery cut-off
    platform2 | derat | 605016 | power: Ignore peripheral batteries with "Unknown" status.
    platform2 | isandrk | 600214 | cryptohome: Add a DBus call to send TPM version information
    platform2 | mpdenton | 602882 | cryptohome: Use namespaces for cryptohome and cryptohomed

Some repos can be ruled out because they don't contribute to the
running image.  That suggests we should clear changes to chromite,
ec, autotest, and chromeos-admin.  The remaining changes are the ones
that got the -1 Verified.

If the diagnosis thus far is correct, then the new CQ run will complete
without this failure.  At that point, we can set to figuring out which
CL is the real culprit.

Cc: kerrnel@chromium.org mpdenton@google.com
> So, something broke cryptohome to the point where it's crashing.

Hmmm...  That would indict one of these CLs:
    platform2 | isandrk | 600214 | cryptohome: Add a DBus call to send TPM version information
    platform2 | mpdenton | 602882 | cryptohome: Use namespaces for cryptohome and cryptohomed

Those were already high on my list of real suspects, because
cryptohome is one of the services that can take down provisioning.
I note, however, that I thought that the failures I looked at were
mere ssh timeouts.  That would suggest something wrong beyond
just cryptohome.

Also:

2017-08-09T13:51:41.323167+00:00 ERR cryptohomed[1154]: libminijail[1154]: unshare(CLONE_NEWCGROUP) failed: Invalid argument
2017-08-09T13:51:41.329063+00:00 WARNING kernel: [   15.339256] init: cryptohomed main process (1154) killed by ABRT signal
2017-08-09T13:51:41.329081+00:00 WARNING kernel: [   15.339296] init: cryptohomed main process ended, respawning


Looks like the new call to minijail_namespace_cgroups(...) in https://chromium-review.googlesource.com/c/602882/3/cryptohome/cryptohome.cc#92 didn't go very well.
Re #9: I'm only speaking for the few where the failure was obviously cryptohomed crashing. If there are other failure modes, I haven't chanced upon them.
Based on the log message in comment #10, it must be this CL:
platform2 | mpdenton | 602882 | cryptohome: Use namespaces for cryptohome and cryptohomed

This is consistent with all the failing boards having one kernel version. That CL should now be Verified-1, correct?
From go/crconn: 
3.8 devices -  wolf, link, peach_pit, peppy
3.4 devices - lumpy, stumpy (I assume these were upgraded to 3.8) 
3.10 devices - nyan_big, nyan_kitty, 

I could possibly be the culprit here for 3.8.
I chumped in Intel wifi bugfixes on 4 kernels yesterday (3.8, 3.14, 3.18 and 4.4 - no Intel wifi devices on 3.10 anymore).

https://chromium-review.googlesource.com/#/q/topic:Core28-bugfixes-merge+(status:open+OR+status:merged)
https://chromium-review.googlesource.com/#/c/chromiumos/third_party/kernel/+/607364/-1..1

3.8 was the only kernel that did not get enough testing (I had cbuildbot --remote kicked off). Looking into this now.



CLONE_NEWCGROUP was introduced in Linux Kernel 4.6 (http://man7.org/linux/man-pages/man2/unshare.2.html), and so that seems like the likely culprit of the 3.8 boards failing.

Comment 15 by nxia@chromium.org, Aug 9 2017

Cc: nxia@chromium.org
I just did a spot check of the current CQ run.  Several DUTs on the
wolf CQ have successfully provisioned with the latest build.  That
means that one of the CLs marked "-1 Verified" is the cause of the
overnight outage.

Based on evidence to date, I'd recommend we'd blame this CL as the
sole culprit:
    https://chromium-review.googlesource.com/#/c/chromiumos/platform2/+/602882/

Owner: mpdenton@google.com
I've dropped "-1 Verified" for all of the CLs except the suspected
culprit.  Those CLs will start back through with the next CQ run.
If we're wrong about the culprit, then we'll see this problem all
over again.

Status: Fixed (was: Assigned)
CQ is working again, culprit CL:602882 has been identified.

Comment 20 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment