New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 849137 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 30
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug
Build-Toolchain

Blocked on:
issue 856686

Blocking:
issue 879206



Sign in to add a comment

grunt: UnitTest stage failing during "Generating license" - ValueError('int is out of range (need a 128-bit value)')

Project Member Reported by jclinton@chromium.org, Jun 4 2018

Issue description

Comment 1 by djkurtz@google.com, Jun 4 2018

Cc: dgarr...@chromium.org vapier@chromium.org
Several unittests are "failing".  Actually the test themselves pass, the failure is in the "Generating license" check during post install:



=== Start output for job rootdev-0.0.1-r28 (0m21.1s) ===
rootdev-0.0.1-r28: >>> Completed installing rootdev-0.0.1-r28 into /build/grunt/tmp/portage/sys-apps/rootdev-0.0.1-r28/image/
rootdev-0.0.1-r28: 
rootdev-0.0.1-r28:  * Final size of build directory: 72 KiB
rootdev-0.0.1-r28:  * Final size of installed tree: 96 KiB
rootdev-0.0.1-r28: 
rootdev-0.0.1-r28:  * Generating license for sys-apps/rootdev-0.0.1-r28 in /build/grunt/tmp/portage/sys-apps/rootdev-0.0.1-r28
rootdev-0.0.1-r28: Traceback (most recent call last):
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/licensing/ebuild_license_hook", line 77, in <module>
rootdev-0.0.1-r28:     from chromite.lib import commandline
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/lib/commandline.py", line 29, in <module>
rootdev-0.0.1-r28:     from chromite.lib import gs
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/lib/gs.py", line 29, in <module>
rootdev-0.0.1-r28:     from chromite.lib import metrics
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/lib/metrics.py", line 25, in <module>
rootdev-0.0.1-r28:     from infra_libs import ts_mon
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/third_party/infra_libs/__init__.py", line 5, in <module>
rootdev-0.0.1-r28:     from . import ts_mon  # Must be imported first so httplib2_utils can import it.
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/third_party/infra_libs/ts_mon/__init__.py", line 5, in <module>
rootdev-0.0.1-r28:     from infra_libs.ts_mon.config import add_argparse_options
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/third_party/infra_libs/ts_mon/config.py", line 13, in <module>
rootdev-0.0.1-r28:     import requests
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/third_party/requests/__init__.py", line 53, in <module>
rootdev-0.0.1-r28:     from .packages.urllib3.contrib import pyopenssl
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/third_party/requests/packages/__init__.py", line 27, in <module>
rootdev-0.0.1-r28:     from . import urllib3
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/third_party/requests/packages/urllib3/__init__.py", line 8, in <module>
rootdev-0.0.1-r28:     from .connectionpool import (
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/third_party/requests/packages/urllib3/connectionpool.py", line 41, in <module>
rootdev-0.0.1-r28:     from .request import RequestMethods
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/third_party/requests/packages/urllib3/request.py", line 7, in <module>
rootdev-0.0.1-r28:     from .filepost import encode_multipart_formdata
rootdev-0.0.1-r28:   File "/mnt/host/source/chromite/third_party/requests/packages/urllib3/filepost.py", line 4, in <module>
rootdev-0.0.1-r28:     from uuid import uuid4
rootdev-0.0.1-r28:   File "/usr/lib64/python2.7/uuid.py", line 605, in <module>
rootdev-0.0.1-r28:     NAMESPACE_DNS = UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')
rootdev-0.0.1-r28:   File "/usr/lib64/python2.7/uuid.py", line 168, in __init__
rootdev-0.0.1-r28:     raise ValueError('int is out of range (need a 128-bit value)')
rootdev-0.0.1-r28: ValueError: int is out of range (need a 128-bit value)
rootdev-0.0.1-r28:  * ERROR: sys-apps/rootdev-0.0.1-r28::chromiumos failed:
rootdev-0.0.1-r28:  *   
rootdev-0.0.1-r28:  * Failed Generating Licensing for sys-apps/rootdev-0.0.1-r28
rootdev-0.0.1-r28:  * Note that many/most open source licenses require that you distribute the license
rootdev-0.0.1-r28:  * with the code, therefore you should fix this instead of overridding this check.
rootdev-0.0.1-r28:  * 
rootdev-0.0.1-r28:  * Note too that you need to bundle the license with binary packages too, even
rootdev-0.0.1-r28:  * if they are not part of ChromeOS proper since all packages are available as
rootdev-0.0.1-r28:  * prebuilts to anyone and therefore must include a license.
rootdev-0.0.1-r28:  * 
rootdev-0.0.1-r28:  * If you need help resolving the licensing error you just got, please have a
rootdev-0.0.1-r28:  * look at
rootdev-0.0.1-r28:  * http://www.chromium.org/chromium-os/licensing-for-chromiumos-package-owners
rootdev-0.0.1-r28:  * 
rootdev-0.0.1-r28:  * 
rootdev-0.0.1-r28:  * If you need support, post the output of `emerge --info '=sys-apps/rootdev-0.0.1-r28::chromiumos'`,
rootdev-0.0.1-r28:  * the complete build log and the output of `emerge -pqv '=sys-apps/rootdev-0.0.1-r28::chromiumos'`.
rootdev-0.0.1-r28:  * The complete build log is located at '/build/grunt/tmp/portage/logs/sys-apps:rootdev-0.0.1-r28:20180604-151107.log'.
rootdev-0.0.1-r28:  * For convenience, a symlink to the build log is located at '/build/grunt/tmp/portage/sys-apps/rootdev-0.0.1-r28/temp/build.log'.
rootdev-0.0.1-r28:  * The ebuild environment file is located at '/build/grunt/tmp/portage/sys-apps/rootdev-0.0.1-r28/temp/environment'.
rootdev-0.0.1-r28:  * Working directory: '/usr/lib64/python2.7/site-packages'
rootdev-0.0.1-r28:  * S: '/mnt/host/source/src/third_party/rootdev'
rootdev-0.0.1-r28: !!! post install failed; exiting.

Comment 2 by djkurtz@google.com, Jun 4 2018

Cc: akes...@chromium.org cra...@chromium.org
There' were two bad CL's this morning related to the license check:
https://chromium-review.googlesource.com/c/chromiumos/platform/arc-camera/+/1065594
https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/1067570

This bug is about the timeouts over the weekend (which will probably be visible again as soon as those bad CL's aren't included).

Comment 4 by djkurtz@google.com, Jun 4 2018

That license failure was from the 2011 build started on 2018-06-01 11:07 PM in the OP, and is present in all of them.

Comment 5 by djkurtz@google.com, Jun 4 2018

My guess is that the verity spam is because verity also failed the license check, which causes the entire test run output to get logged.

Comment 6 by djkurtz@google.com, Jun 4 2018

@#3 - Also, sorry, I don't follow what those two patches have to do with a license check failure.

Comment 7 by djkurtz@google.com, Jun 4 2018

Summary: grunt: UnitTest stage failing during "Generating license" - ValueError('int is out of range (need a 128-bit value)') (was: grunt: UnitTest stage continuously timing out on ToT)
@#4 - I let logdog finish, and it is confirmed that the verity spew does complete, and it is caused by the same license issue.

verity-0.0.1-r87:  * Generating license for chromeos-base/verity-0.0.1-r87 in /build/grunt/tmp/portage/chromeos-base/verity-0.0.1-r87
verity-0.0.1-r87: Traceback (most recent call last):
verity-0.0.1-r87:   File "/mnt/host/source/chromite/licensing/ebuild_license_hook", line 77, in <module>
verity-0.0.1-r87:     from chromite.lib import commandline
verity-0.0.1-r87:   File "/mnt/host/source/chromite/lib/commandline.py", line 29, in <module>
verity-0.0.1-r87:     from chromite.lib import gs
verity-0.0.1-r87:   File "/mnt/host/source/chromite/lib/gs.py", line 29, in <module>
verity-0.0.1-r87:     from chromite.lib import metrics
verity-0.0.1-r87:   File "/mnt/host/source/chromite/lib/metrics.py", line 25, in <module>
verity-0.0.1-r87:     from infra_libs import ts_mon
verity-0.0.1-r87:   File "/mnt/host/source/chromite/third_party/infra_libs/__init__.py", line 5, in <module>
verity-0.0.1-r87:     from . import ts_mon  # Must be imported first so httplib2_utils can import it.
verity-0.0.1-r87:   File "/mnt/host/source/chromite/third_party/infra_libs/ts_mon/__init__.py", line 5, in <module>
verity-0.0.1-r87:     from infra_libs.ts_mon.config import add_argparse_options
verity-0.0.1-r87:   File "/mnt/host/source/chromite/third_party/infra_libs/ts_mon/config.py", line 13, in <module>
verity-0.0.1-r87:     import requests
verity-0.0.1-r87:   File "/mnt/host/source/chromite/third_party/requests/__init__.py", line 53, in <module>
verity-0.0.1-r87:     from .packages.urllib3.contrib import pyopenssl
verity-0.0.1-r87:   File "/mnt/host/source/chromite/third_party/requests/packages/__init__.py", line 27, in <module>
verity-0.0.1-r87:     from . import urllib3
verity-0.0.1-r87:   File "/mnt/host/source/chromite/third_party/requests/packages/urllib3/__init__.py", line 8, in <module>
verity-0.0.1-r87:     from .connectionpool import (
verity-0.0.1-r87:   File "/mnt/host/source/chromite/third_party/requests/packages/urllib3/connectionpool.py", line 41, in <module>
verity-0.0.1-r87:     from .request import RequestMethods
verity-0.0.1-r87:   File "/mnt/host/source/chromite/third_party/requests/packages/urllib3/request.py", line 7, in <module>
verity-0.0.1-r87:     from .filepost import encode_multipart_formdata
verity-0.0.1-r87:   File "/mnt/host/source/chromite/third_party/requests/packages/urllib3/filepost.py", line 4, in <module>
verity-0.0.1-r87:     from uuid import uuid4
verity-0.0.1-r87:   File "/usr/lib64/python2.7/uuid.py", line 605, in <module>
verity-0.0.1-r87:     NAMESPACE_DNS = UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')
verity-0.0.1-r87:   File "/usr/lib64/python2.7/uuid.py", line 168, in __init__
verity-0.0.1-r87:     raise ValueError('int is out of range (need a 128-bit value)')
verity-0.0.1-r87: ValueError: int is out of range (need a 128-bit value)
verity-0.0.1-r87:  * ERROR: chromeos-base/verity-0.0.1-r87::chromiumos failed:


Is this reproducible locally?
i think the underlying issue is that our modules import too goddamn much ;).  the fact that importing the commandline module yields a network request for metrics is pretty crazy.

if we can't just drop the metrics logic, i guess we should split lib.gs up into a "constants" module that only has idempotent logic (constants & functional helpers w/no side-effects).  that way the rest of the users (like lib.commandline) can get a reduced module when it only wants these helpers/constants.

that said, the error here makes no sense.  the NAMESPACE_DNS member is a valid uuid which shouldn't be hitting an out of range value ...
Hum. No idea why we do active metrics work at import, but that seems like the most important thing to fix.
@8 No, it did not reproduce locally when I ran:
FEATURES=test emerge-grunt rootdev

Don, I think you are familiar with the license generation pieces.
Can I assign this over to you?
Owner: mikenichols@chromium.org
I did rewrite ebuild_license_hook, but haven't looked at it in a very long time. Mike has been looking at metrics.py and might be able to explain why it's doing active things at import time.
Also, why would this just suddenly start failing, and only on grunt-paladin?
kahlee-paladin, which is nearly identical, does NOT have this UnitTest failure.

Here are the CLs between the working & failing builds:
https://crosland.corp.google.com/log/10743.0.0..10744.0.0

Hoewver, this sounds more like an issue with the builder itself than the image.
i think the licensing logic is a red herring
Status: Started (was: Assigned)
I'm working on reproducing this locally now.  

-- Mike
The timing aligns when grunt-paladin switched builders from cros-beefy486-c2 to some bare metal build202-m2.
The normal grunt-paladin builder (cros-beefy486-c2) has been offline for a few days, so grunt-paladin has been trading off with romer-paladin on build202-m2.  Probably not coincidentally, build 2011 was the first build where this started happening.  The regular grunt and romer builders are back online this morning, so this may go away by itself if that was the cause.
Thanks for the info, Ben.  I saw that in the crosoncall earlier this morning and was working to try to do some validations myself. 

-- Mike
Do we understand why the move to the other machine is causing these failures?  Was it from switching back and forth between builders (which seems like it will have ramifications if we're trying to make builders more fluid and have less affinity requirements), or some issues with bare metal machines (or this one in particular)?
Cc: bmgordon@chromium.org
I don't think we understand the real cause of the failure, so saying it's tied to the machine is just a strongly educated guess.

Ben, given your improvements, do we always reset the chroot when a build changes build configs?
We don't reset between build config changes, but we do reset whenever something incompatible happens (branch change, incremental->non-incremental, etc).  The two paladins sharing the machine should have been able to share the chroot safely.  Since they were alternately failing, it was reset to a clean snapshot in between each build.  I looked through a few of the builds and didn't see anything unexpected for that part.
build202-m2 is currently idle. I'm going to log in and wipe the ChromeOS build environment fully. This will make the next build on it slower, since it will have to sync code from scratch. I probably should have done that sooner.
grunt-paladin just killed the CQ run again: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/18802

This time, the build timed out during build_packages: it got to the end of build_packages and then just hung doing nothing until time expired: https://luci-milo.appspot.com/buildbot/chromeos/grunt-paladin/2049

It was running on cros-beefy486-c2.

At some point overnight cros-beefy486-c2 failed again and now we're back to build202-m2 which is back to failing unit tests. Marking experimental.
The (A?) problem has got to be the system libraries on build202-m2.golo.chromium.org.  I have no idea how to login to this device (help?), but the last three lines of the stacktrace are very telling:



hdctools-0.0.1-r807:   File "/mnt/host/source/chromite/third_party/requests/packages/urllib3/filepost.py", line 4, in <module>
hdctools-0.0.1-r807:     from uuid import uuid4
hdctools-0.0.1-r807:   File "/usr/lib64/python2.7/uuid.py", line 605, in <module>
hdctools-0.0.1-r807:     NAMESPACE_DNS = UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')
hdctools-0.0.1-r807:   File "/usr/lib64/python2.7/uuid.py", line 168, in __init__
hdctools-0.0.1-r807:     raise ValueError('int is out of range (need a 128-bit value)')

"from uuid import uuid4"
Ok, filepost.py imports uuid.  Great. that's pretty legit.

NAMESPACE_DNS = UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')
Create a UUID object with a specific hex value and call it NAMESPACE_DNS.  This looks legit as well.

raise ValueError('int is out of range (need a 128-bit value)')
Hrm, what's going on here? "int is out of range."  Using the system libraries on my machine (because I have no idea how to find this machine) I see that the constructor for UUID does roughly the following:

# irrelevant parts scraped out:
# hex = '6ba7b810-9dad-11d1-80b4-00c04fd430c8'
def __init__(self, hex=None, bytes=None, bytes_le=None, fields=None,
                    int=None, version=None):
  if hex is not None:
    hex = hex.replace('urn:', '').replace('uuid:', '')
    hex = hex.strip('{}').replace('-', '')
    if len(hex) != 32:
      raise ValueError('badly formed hexadecimal UUID string')
    int = long(hex, 16)
  # ok, now "int" (a horrible variable name) is equal to
  # >>> i = long('6ba7b810-9dad-11d1-80b4-00c04fd430c8'.replace('-',''), 16)
  # >>> print i
  # 143098242404177361603877621312831893704
  if int is not None:
    if not 0 <= int < 1<<128L:
      raise ValueError('int is out of range (need a 128-bit value)')
  # Ok, this is odd and doesn't make sense.
  # on my desktop, 143098242404177361603877621312831893704 is
  # definitely less than 1<<128L.


Smells to me like busted python libraries on that machine. Sure wish I knew how to log in.

OK, Thanks to mikenichols@, I was able to get into the machine, chroot into the grunt chroot and reproduce the (stacktrace) problem:

$ ssh build202-m2.golo
$ cd /b/c/cbuild/repository/chroot/build/grunt
$ sudo chroot .
$ python2.7
>>> import uuid
sh: 1: cannot create /dev/null: Is a directory
sh: 1: cannot create /dev/null: Is a directory
sh: 1: cannot create /dev/null: Is a directory
sh: 1: cannot create /dev/null: Is a directory
sh: 1: cannot create /dev/null: Is a directory
sh: 1: cannot create /dev/null: Is a directory
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/uuid.py", line 605, in <module>
    NAMESPACE_DNS = UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')
  File "/usr/lib64/python2.7/uuid.py", line 168, in __init__
    raise ValueError('int is out of range (need a 128-bit value)')
ValueError: int is out of range (need a 128-bit value)

more specifically, this is really broken:

>>> i = long('6ba7b810-9dad-11d1-80b4-00c04fd430c8'.replace('-',''), 16)
>>> print i
25993752599367497841673592235616513366132321067116433072379666648858939910645517526286245627759603117433917196014180281909672132690638501420307763972336982924491944552885516877409584366686926903841867234761360772604689296712882651053948934123466755507186791707555270325767595022614536

I've updated cros-beefy486-c2 to Ubuntu's 4.4.0-127 and rebooted it.  Assuming puppet doesn't intervene to put it back on 3.13, we can see if that does anything about the I/O hangs when grunt-paladin moves back over.
Now that I've got a pretty laser focused reproduction case, I'm able to play around.  The issue is definitely with build202-m2 (and not the build of python), though it's not clear to me what the problem is precisely.

I built the grunt image on my workstation.  No dice, everything works fine when I do the long hex conversion.  Maybe this produced a different software artifact than the builder.

So, I copied the python interpreter and all its shared libraries from build202-m2's .../build/grunt directory to:
  * my workstation
  * a hermetic directory on build202-m2.
On both systems, I created a chroot with only these files from build202-m2:
  * /usr/bin/python2.7
  * all shared objects listed by 'ldd /usr/bin/python2.7' except for (obviously)
    linux-vdso.so.1.
  * The full contents of /usr/lib64/python2.7/...
Then, I'm able to load up the interpreter and play knowing the same software is being used on both systems.  I verified that the correct libraries were being used by looking at /proc/${pid}/maps.  I ran a few tests with this hermetically sealed and identical binary on both systems:
  * load up the interpreter and run long('10', 16).  This fails (produces 1<<59L,
    the wrong result) consistently on build202-m2 and passes (produces 16L)
    consistently on my machine.
  * Load up the interpreter, attach strace to the process and run run long('10', 16).  
    On both systems, this produces roughly the strace output[1] shown below.  This
    indicates that nothing interesting is happening and the conversion is not
    calling out to the kernel in any significant way.

These tests effectively eliminate any software artifacts produced by the build from causing the problem.  I believe there is something very wonky about build202-m2.



[1] strace output:
read(0, "long('10', 16)\n", 1024)       = 15
brk(0x7ff97952b000)                     = 0x7ff97952b000
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
write(1, "576460752303423488L\n", 20)   = 20
ioctl(0, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(2, ">>> ", 4)                     = 4
read(0,  <detached ...>


to be clear, this python code is running inside of the cros_sdk chroot, so the version of python/etc... in the host distro shouldn't matter.  the kernel might factor in though.
Cc: yunlian@chromium.org manojgupta@chromium.org
Manoj asked on a related CL:

"""There is some flakiness here:
grunt paladin has had a few good runs after the toolchain roll (Wed, June 6).
https://uberchromegw.corp.google.com/i/chromeos/builders/grunt-paladin/builds/2042
https://uberchromegw.corp.google.com/i/chromeos/builders/grunt-paladin/builds/2046

And a few where unit tests timed out because of errors, also on Wed, June 6.
https://uberchromegw.corp.google.com/i/chromeos/builders/grunt-paladin/builds/2053
https://uberchromegw.corp.google.com/i/chromeos/builders/grunt-paladin/builds/2056

Grunt canary builds however are doing fine (where every package is built from scratch instead of prebuilts). 
https://uberchromegw.corp.google.com/i/chromeos/builders/grunt-release

So maybe there is some prebuilt issue here? 
So follow up question is : Where does grunt-paladin gets its prebuilts from?"""

One question is whether we've seen a successful build on build202-m2?  That is the box that we've consistently been able to recreate this issue.  Using the same tests as above, the cros-beefy486-c2 bot does not exhibit the same behavior with converting to long.  

cros-beefy486-c2 / # python2.7
Python 2.7.10 (default, Jun  6 2018, 02:28:56) 
[GCC 4.2.1 Compatible Chromium OS 7.0_pre331547_p20180529 Clang 7.0.0 (/var/cac on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print long('6ba7b8109dad11d180b400c04fd430c8', 16)
143098242404177361603877621312831893704

The differences between the hosts lie in kernel and architecture, with build202 being physical in Golo and beefy486 being GCE.

-- Mike
I saw a successful romer-paladin build on there a couple of days ago, but it was in a sea of red and purple.
Summarizing the discussions happening here in BLD, the leading suspicion is that something about the way that we're compiling or optimizing the python binary on grunt is incompatible with certain builders.  More specifically, build202-m2 and build201-m2 with grunt python binaries produce bogus answers for "long('10',16)". The same binaries/libraries run on cros-beefy97-c2 produces the correct answer.

If this proves correct, this would be bad because it would imply that we can't universally run grunt AMD binaries on intel builders.
Wait.... the python binary was emerged for the host, not the board. That would imply it was built with the host toolchain, not the board's.

I'm not sure I understand, dgarrett@.  "import uuid" works fine inside the cros_sdk, but when run under the ${cros_chroot}/build/grunt/, it breaks in the way that happens in the failed unit test from comment #1.  Smells to me like the unit test is running python from ${cros_chroot}/build/grunt/.
Do we run the same test for Arm boards? They have a python binary that will always fail on the host.
we don't run python under the board sysroot as it makes no sense

my guess is our long standing LD_LIBRARY_PATH test hack is screwing things up.  give this CL a try:
  https://chromium-review.googlesource.com/1091950
Project Member

Comment 40 by bugdroid1@chromium.org, Jun 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/d2a01fb02937c61a8f27cf04dde96b59e9fa5319

commit d2a01fb02937c61a8f27cf04dde96b59e9fa5319
Author: Mike Frysinger <vapier@chromium.org>
Date: Fri Jun 08 10:15:34 2018

amd64/x86: clean up src_test LD_LIBRARY_PATH hacks

We've been hacking up the test env to make src_test work for random
packages by pointing LD_LIBRARY_PATH to the sysroot, but then we let
that value stick around which bleeds into later build stages.  This
in turn can random break build tools, so at least clear it once the
test phase is over.

BUG= chromium:849137 
TEST=precq passes

Change-Id: I2e5313183ebf7615066c27f394b9cd253d3fe49d
Reviewed-on: https://chromium-review.googlesource.com/1091950
Commit-Ready: Mike Frysinger <vapier@chromium.org>
Tested-by: Mike Frysinger <vapier@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Mike Nichols <mikenichols@chromium.org>

[modify] https://crrev.com/d2a01fb02937c61a8f27cf04dde96b59e9fa5319/profiles/arch/amd64/profile.bashrc
[modify] https://crrev.com/d2a01fb02937c61a8f27cf04dde96b59e9fa5319/profiles/arch/x86/profile.bashrc

Status: Fixed (was: Started)
i'm going to call this fixed.  please re-open if we have logs after my CL with the same failure.
Labels: -Pri-1 Pri-2
Owner: vapier@chromium.org
Status: Assigned (was: Fixed)
Happened again when the GCE builder went down and we failed over to bare metal Intel CPU's:  issue 879206 .
Should we consider moving Grunt off of all GOLO builders to only GCE?  Buildbot is being decomm'd, at which time we'll be completely on swarming, therefore the risk is minimal and would at least avoid the current issues. 

-- Mike

Comment 44 Deleted

(Minor edit:)
jclinton just mentioned to me a theory that the x86 compiler used for the grunt build is building AMD-optimized x86 host binaries (ie for python, or unittests) that execute correctly on some builders (ie "GCE" cros-beefy-*), but fail when the grunt-paladin build gets migrated to other Intel builders (ie "Golo" aka bare-metal aka "build-*"), which leads to flaky unittests (eg, this bug and  issue 879206 ).

Further he mentioned that this sounds like a compiler bug, where the compiled code should still give correct results if executed on Intel machines, albeit slower, since it is somehow supposed to take the slow-path around any AMD-optimized paths.

Luis, 
Does this sound plausible?  Could our observed results on these bugs be due to an issue with the host toolchain?

Blocking: 879206
at this point in time, there's not much build can do about this

we're optimizing grunt (all stonyridge) for AMD by using -march=amdfam10.  that means it's free to use AMD-specific ISA insns.  if the code is then executed on a system where the cpu isn't AMD, we might get failures like this.  i don't see this being a compiler bug ... if it always executes correctly on an AMD cpu, and sometimes misbehaves on an Intel cpu, that isn't a compiler bug imo.

if control over the execution cpu isn't guaranteed, the options:
- run it through qemu and take a perf hit
- disable unittests altogether
- don't optimize for AMD and only optimize for ISAs that exist on the build & boards (so we can't use 3DNow or other AMD SSE extensions)

no one really wants to do any of these.  so CI can try and provide build systems where the CPU seems to be compatible enough and call it a day.

this isn't the same issue as the python code that originally drove this bug.  that scenario had board code (/build/grunt/xxx) executing in the sdk env directly and thus mixed shared libs.  the packages failing in  issue 879206  are all being run inside the board's sysroot so there's no chance of mixing objects.
As vapier already said, Grunt builder is passing "-march=amdfam10" (https://cs.corp.google.com/chromeos_public/src/overlays/chipset-stnyridge/make.conf?l=8 )

This will force compiler to emit AMD specific instructions and they aren't going to work on Intel machines.

One option is to disable generating AMD specific optimizations but so far the consensus was not to do so (See the bug https://bugs.chromium.org/p/chromium/issues/detail?id=856686)
Blockedon: 856686
Besides issue 856686, what would be the performance impact to devices in the field of doing "-mtune=amdfam10 -march=some_x86_64_lowest_commond_denominator"?

I think that mtune is sideways compatible with the non-target CPU?

-mtune is largely about insn/register scheduling rather than ISA selection.  when you drop the -march like that, you drop support for AMD-specific ISA's like 3DNow and SSE extensions that AMD has done.  i don't know the actual perf hit, or the intersection with the baseline x86-64 arch.  i'd leave that to the toolchain/perf teams to quantify ;).
Regardless of AMD/Intel, a newer feature set not supported by builders could always cause similar problems e.g. Enabling sse<N> for a new Intel board that is not supported by builders.
Status: Fixed (was: Assigned)
sure, that's happened to us in the past multiple times as discussed in issue 856686

at this point, i feel like we're just rehashing issue 856686 and not adding anything new.  this bug was largely about mixing sdk & board libs which is bad.  it was also about the build bot changing hosts (and thus CPU) which is (currently) bad for AMD boards.  we've resolved that now by switching the allocation back.

so i'm not sure we need to do much more here.  if we want to discuss approaches to solve this scenario, lets keep it on issue 856686.
if I understand correctly, you have fixed the issue for the builders. 
But this problem will happen to anyone that tries to execute this unittest on their personal workstations. We dont care about that?
it depends on the workstation's CPU

Sign in to add a comment