New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 780727 link

Starred by 3 users

Issue metadata

Status: WontFix
Owner:
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

pre-flight factory-gru-9017.B builder fails

Project Member Reported by philipchen@chromium.org, Nov 2 2017

Issue description

'pre-flight factory-gru-9017.B' builder has been failing since Oct. 30:
https://uberchromegw.corp.google.com/i/chromeos.branch/builders/gru%20pre-flight%20factory-gru-9017.B/builds/116

I can't find any informative debug message from the log.
I tried to kick off trybot builds for factory-gru-9017.B, e.g.:
bob-factory:
https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/factory/builds/10
bob-release:
https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/release/builds/16870

Both builds succeeded.
Do someone know how to fix the pre-flight builder? Otherwise we can't get builder-built images on this factory branch anymore.
 
Labels: -Pri-2 Pri-1
It failed before building packages. Any ideas?

  File "/usr/lib/python2.7/ssl.py", line 487, in wrap_socket
    ciphers=ciphers)
  File "/usr/lib/python2.7/ssl.py", line 241, in __init__
    ciphers)
ssl.SSLError: [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib
20:10:09: INFO: Waiting for ts_mon flushing process to finish...
20:11:07: INFO: Finished waiting for ts_mon process.
Cc: yueherngl@chromium.org
Looks like the builder's cacert has some issue.  Chrome trooper should take a look.

Comment 5 by efoo@chromium.org, Nov 4 2017

Components: -Infra Infra>Client>ChromeOS
Moving under Client>ChromeOS
Labels: Infra-Troopers
Cc: marcochen@chromium.org
Hi trooper,

Could you take a look?
This issue blocks the production for Bob.

Comment 8 by jojwang@google.com, Nov 6 2017

Cc: b...@chromium.org vhang@chromium.org

Comment 9 by jojwang@google.com, Nov 6 2017

Cc: -b...@chromium.org -vhang@chromium.org pschmidt@chromium.org
adding labs oncall
Cc: b...@chromium.org
This looks suspiciously like http2lib does not know where to find cacerts.txt

And this does not look like a system python implementation.


Comment 12 by d...@chromium.org, Nov 7 2017

> This looks suspiciously like http2lib does not know where to find cacerts.txt

"httplib2" comes bundled with certs, and defaults to using its own: 
https://chromium.googlesource.com/chromiumos/chromite/+/factory-gru-9017.B/third_party/httplib2/__init__.py#187

It's worth noting that the bundled "httplib2" hasn't changed in over a year, and is used by every other CrOS builder.

> And this does not look like a system python implementation.

Curious why not?

All I can tell is that from the stack trace, the lines in question don't line up with the chroot "ssl.py", but do line up with the local system's "ssl.py", so I think it's probably that this is the system Python.

System Python and system "httplib2" seem to both be able to access the failing URL: https://chromium.googlesource.com/chromiumos/chromite/+/factory-gru-9017.B/lib/cipd.py#29

I took a few shots at repro but fell short :( The failure is happening across multiple systems, all of which have performed green builds in the interim, so it feels safe to say that this is a CrOS problem specific to the "gru pre-flight factory-gru-9017.B" builder.

Comment 13 by d...@chromium.org, Nov 7 2017

OTOH it does seem confined to certain builders. e.g., this build on build22-m2 seems fine: https://uberchromegw.corp.google.com/i/chromeos.branch/builders/gru%20pre-flight%20factory-gru-9017.B/builds/120
Re: for build22-m2, this was before Oct 30th, when this problem started happening?

Comment 15 by d...@chromium.org, Nov 7 2017

I manually kicked the one I linked in #13 off just a few minutes prior, and it is doing well.

I think either:
1) This is bot-specific, or
2) The certificate problem has resolved itself.
I just manually kicked off the device factory build, it still failed:
https://uberchromegw.corp.google.com/i/chromeos.branch/builders/bob%20factory%20factory-gru-9017.B

Comment 17 by d...@chromium.org, Nov 8 2017

Worth noting that failure is also on "build22-m2", the bot that succeeded in #13. Unfortunately, since the error is occurring within "cbuildbot" code and environment, I think the best thing you can do is log onto a bot (build22-m2 seems reasonable) and find a reliable reproduction case. Chrome Operations doesn't typically dive into other codebases, and the failure here is deep within Chromite code.

Copy/paste of failure stack for posterity:

  File "/b/c/cbuild/repository/chromite/lib/cipd.py", line 97, in _Fetch
  File "/b/c/cbuild/repository/chromite/lib/cipd.py", line 76, in _DownloadCIPD
  File "/b/c/cbuild/repository/chromite/lib/cipd.py", line 54, in _ChromeInfraRequest
  File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1593, in request
    
  File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1335, in _request
    
  File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1257, in _conn_request
    
  File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1021, in connect
    
  File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 80, in _ssl_wrap_socket
    
  File "/usr/lib/python2.7/ssl.py", line 487, in wrap_socket
    ciphers=ciphers)
  File "/usr/lib/python2.7/ssl.py", line 241, in __init__
    ciphers)
ssl.SSLError: [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib

Comment 18 by d...@chromium.org, Nov 8 2017

Cc: dgarr...@chromium.org
Most searches for this error are useless, but this one intrigued me: https://stackoverflow.com/questions/15696526/ssl-throwing-error-185090050-while-authentication-via-oauth

Working more with this, I found that these branch builders rapidly flip between really old and new Chromite checkouts. The one on build22-m2 is currently a "veyron_pinky factory factory-veyron-6591.B" build, whose Chromite checkout is from October 2016, before "//chromite/third_party/httplib2" was added. The "gru" and "bob" builds are from 2017, after it was added. Perhaps switching the checkouts between old and new versions may be messing up file permissions on "cacert.txt"?

To that end, I wrote a basic test:

$ mkdir -p test
$ cd test
$ git clone https://chromium.googlesource.com/chromiumos/chromite
$ PYTHONPATH=$PWD/chromite/third_party python -c 'import httplib2; print httplib2.__file__; print httplib2.Http().request(uri="https://chrome-infra-packages.appspot.com/_ah/api")'

({'status': '404', 'content-length': '9', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Mon, 01 Jan 1990 00:00:00 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'pragma': 'no-cache', 'cache-control': 'no-cache, no-store, max-age=0, must-revalidate', 'date': 'Wed, 08 Nov 2017 12:53:20 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; ma=2592000; v="41,39,38,37,35"', 'content-type': 'text/html; charset=UTF-8'}, 'Not Found')

# Now, to trigger the failure case!
$ chmod 0220 chromite/third_party/httplib2/cacerts.txt
$ PYTHONPATH=$PWD/chromite/third_party python -c 'import httplib2; print httplib2.__file__; print httplib2.Http().request(uri="https://chrome-infra-packages.appspot.com/_ah/api")'

  <Stack Trace>...
  File "/usr/lib/python2.7/ssl.py", line 487, in wrap_socket
    ciphers=ciphers)
  File "/usr/lib/python2.7/ssl.py", line 241, in __init__
    ciphers)
ssl.SSLError: [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib

=====

I am speculating:
1) This is related to "cbuildbot_launch" toggling between old and new branches, possibly also due to the fact that "cacert.txt" is toggled between existing and not existing depending on the branch.
2) The problem here is a permissions error on Chromite's embedded "//chromite/third_party/httplib2/cacert.txt" file, since that's the only way I've been able to do to reproduce the error.
3) This is hard to repro b/c it is checkout flake. It triggers based on the previous build that the machine does, and the broken state is removed when the machine does a new build. We'd have to catch a failed build *right* after it fails to confirm.

Because of this, I propose:
A) Remove troopers / Systems from this bug. This is 100% a Chromite problem, not a problem with infrastructure or the bots.
B) Monitor https://uberchromegw.corp.google.com/i/chromeos.branch/one_line_per_build and try really hard to identify a builder immediately after a failure case. SSH onto it to confirm that something is up with "cacert.txt".
C) In parallel with (B), look at the checkout logic for "cbuildbot_launch" or whatever's checking out Chromite and see if there is a place where it might be causing (2).

Once (C) is solved, cherry-pick the solution onto all active branches.

And that's about all of the drive-by early-morning debugging I have time for :) +dgarrett@ b/c of "cbuildbot_launch".

Comment 19 by d...@chromium.org, Nov 8 2017

More information on (#18/C): I speculate that the thing that is causing this problem is whatever software is managing "/b/c/cbuild/repository/chromite".
Labels: -Infra-Troopers
removing troopers per #18
RE: #18

We setup a new branch few days ago and its pre-flight is showing the same problem.

https://uberchromegw.corp.google.com/i/chromeos.branch/builders/falco%20pre-flight%20firmware-servo-9040.B/builds/1

Can this be an useful test case?

Thanks. 
cbuildbot_launch is what sets up everything under /b/c/cbuild/repository. When it moves between branches, it does a detailed cleanup of the checkout which is supposed to be a more efficient clobber.

However, that cleanup is surprisingly complicated. I'm wondering if it's failing in some obscure way.

I could switch that to a full clobber when moving between branches. Safer, but slower.

Cc: shuqianz@chromium.org
In practice, such a change would affect the tryjob, release, and branch waterfalls (who all regularly change branches).

It would cause us to consume more GoB quota than we do now, but I'm not sure how much.

Comment 25 by d...@chromium.org, Nov 13 2017

Interesting! So advancing on #17/18 theory, I have discovered that if the "cacerts.txt" file is *deleted*, the failure also reproduces! This is huge, since deletion seems way more probable than something losing read permission.

Reading #21, it looks like the builder linked by yueherngl@ in #21 has not built anything after the failing build: https://screenshot.googleplex.com/KDpp8hz1TP1.png

Therefore, it should have a pristine state that can be used for probing the problem, right? I logged into the bot and checked it out, and much to my surprise the "/b/c/cbuild/repository/" directory is practically empty, and does not contain a "chromite" checkout:

$ ls -a /b/c/cbuild/repository/
.  ..  .cache  cbuildbot_logs  .completed_stages

This is important because it suggests that either:
1) At the time of failure, there is no "chromite" directory.
2) Something is nuking this directory after the failure.

Since I see no evidence of (2), and (1) has relevance to the deletion test that I ran, I suspect that it is the case. Something like:

1) cbuildbot is running.
2) cbuildbot imports "httplib2", which fixes the "cacerts.txt" path.
3) *something* is deleting the "chromite" checkout, including the embedded "cacerts.txt" file.
4) cbuildbot makes an HTTP call, which now fails b/c of (3).

Anything you can fill in here, dgarrett@?
Cc: pho...@chromium.org
The "cbuildbot" stage is really cbuildbot_launch. Launch runs from a checkout of chromite in a recipe location that is indpendent of /b/c/cbuild/repository/.

The "RunCbuildbot" stage is "cbuild/repository/chromite/cbuildbot".

Which means that that directory existed when it was invoked.


However, the Cleanup stages recursively cleans up the "cbuild/repository" underneath itself. Maybe it's being overly agressive? However, I don't see why this would be broken only part of the time.

Comment 27 by d...@chromium.org, Nov 14 2017

*shrug* no idea why, maybe branch logic that didn't get cherry-picked, or branch-swapping logic responding incorrectly to a checkout from another branch?

The ".../chromite/cbuildbot" directory *definitely* existed at one point, since the stack trace includes its vendored "http2lib" in its path. It also definitely stopped existing at another point. Hard to say what happened in between, but I think it's a pretty solid running theory that it stopped existing before it should have :)
Owner: shuqianz@chromium.org
Passing to the deputy, though I know it's a weird cbuildbot issue.
Cc: nxia@chromium.org
What does deputy need to do here?
Owner: dgarr...@chromium.org
dgarrett@

Any idea about how to fix it?

The pre-flight builder might partially works, the device builder on 9017 branch still failed for the similar reason:
https://uberchromegw.corp.google.com/i/chromeos.branch/builders/bob%20factory%20factory-gru-9017.B/builds/75

Our partner is blocked on this to update Bob factory image.
I'm still not sure what's going wrong. That's why I tried to hand it off, to see if anyone else had any ideas.
You have a couple of successful builds, but I'm not sure why it was failing in the first place.
Actually.... what exactly is the point of this pre flight?

What is it importing? Chrome? Android NYC? Android MNC? Something else?

Cc: bhthompson@chromium.org
+ bhthompson@

Bernie, do you know what might go wrong in 9017 factory branch?
https://uberchromegw.corp.google.com/i/chromeos.branch/builders/bob%20factory%20factory-gru-9017.B/builds/75
This is a gnarly issue, I am afraid I don't have anything to add to the debugging at this point without significant further thought. The solution is probably changing how we clean between runs (e.g. comment 22) which may or may not be worth it based on the frequency of this kind of bug.

But in the interim if this is blocking we may try using a trybot instead of the formal factory builder, it is possible this will work ok, you can try kicking one with:

`cros tryjob --remote --production --branch=factory-gru-9017.B bob-factory`

bhthompson@,
Yep, trybot would work as my initial comment.
But I hope there is a long term solution to fix this branch builder.

dgarrett@,
will you be able to do a full clobber as you said in #22?


PS: I checked with David James. Nominally, the only thing the preflight builder does is uprev ebuilds, and submit the uprevs.

It might also update to the proper OS version at the same time.
I haven't, but I can.
I have a CL up now. Understand that it will slow all firmware/factory branch builds by about 20 minutes.

https://crrev.com/c/776273


Project Member

Comment 42 by bugdroid1@chromium.org, Nov 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/094422f998d7cdc7bfd95a14177a8ed97beb242a

commit 094422f998d7cdc7bfd95a14177a8ed97beb242a
Author: Don Garrett <dgarrett@google.com>
Date: Sat Nov 18 04:18:39 2017

cbuildbot_launch: Always wipe firmware/factory branches.

We've had weird issues with cbuildbot on some firmware/factory
branches doing weird things during cleanup. Instead of trying to fix
cleanup on the branches, just make sure that they always have a clean
slate. This is slower, and wasteful of GoB quota, but we perform very
few builds on these branches, so it shouldn't be a big deal.

BUG= chromium:780727 
TEST=run_tests + new unittests

Change-Id: I8bd4bf776f717b5d837337ecefd4c83ca9172706
Reviewed-on: https://chromium-review.googlesource.com/776273
Commit-Ready: Don Garrett <dgarrett@chromium.org>
Tested-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/094422f998d7cdc7bfd95a14177a8ed97beb242a/scripts/cbuildbot_launch.py
[modify] https://crrev.com/094422f998d7cdc7bfd95a14177a8ed97beb242a/scripts/cbuildbot_launch_unittest.py

Comment 43 by d...@chromium.org, Nov 18 2017

I manually kicked off a build after #42 landed, and it looks like it's making it past the failure point, so that's encouraging: https://luci-milo.appspot.com/buildbot/chromeos.branch/falco%20pre-flight%20firmware-servo-9040.B/5

That does not, unfortunately, satisfy my curiosity RE what's going on in #25. Oh well :)
Not sure now if this is a different problem. But I manually kicked off a build

https://uberchromegw.corp.google.com/i/chromeos.branch/builders/falco%20pre-flight%20firmware-servo-9040.B/builds/6

And I see the same problem?
 
  <Stack Trace>...
  File "/b/c/cbuild/repository/chromite/lib/cipd.py", line 97, in _Fetch
  File "/b/c/cbuild/repository/chromite/lib/cipd.py", line 76, in _DownloadCIPD
  File "/b/c/cbuild/repository/chromite/lib/cipd.py", line 54, in _ChromeInfraRequest
  File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1593, in request
    
  File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1335, in _request
    
  File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1257, in _conn_request
    
  File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1021, in connect
    
  File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 80, in _ssl_wrap_socket
    
  File "/usr/lib/python2.7/ssl.py", line 487, in wrap_socket
    ciphers=ciphers)
  File "/usr/lib/python2.7/ssl.py", line 241, in __init__
    ciphers)
ssl.SSLError: [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib

Comment 45 by d...@chromium.org, Nov 20 2017

I can confirm that this is the same problem and (from the step text, "Firmware/Factory Branch: Wiping buildroot.") that it is running the latest patch. It looks like #42 did not fix this issue.
Agreed.

I do kinda feel like this might be related to builder configuration. One fix MIGHT be to move the builders to GCE. I don't think they run any VM Tests.
Project Member

Comment 47 by bugdroid1@chromium.org, Nov 21 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/b5fc08b115a71a7b9c102b936f0afbee261e8d86

commit b5fc08b115a71a7b9c102b936f0afbee261e8d86
Author: Don Garrett <dgarrett@chromium.org>
Date: Tue Nov 21 07:25:49 2017

Revert "cbuildbot_launch: Always wipe firmware/factory branches."

This reverts commit 094422f998d7cdc7bfd95a14177a8ed97beb242a.

Reason for revert:

This CL didn't fix the problem, and so hurts branched build performance for no reason.

Original change's description:
> cbuildbot_launch: Always wipe firmware/factory branches.
>
> We've had weird issues with cbuildbot on some firmware/factory
> branches doing weird things during cleanup. Instead of trying to fix
> cleanup on the branches, just make sure that they always have a clean
> slate. This is slower, and wasteful of GoB quota, but we perform very
> few builds on these branches, so it shouldn't be a big deal.
>
> BUG= chromium:780727 
> TEST=run_tests + new unittests
>
> Change-Id: I8bd4bf776f717b5d837337ecefd4c83ca9172706
> Reviewed-on: https://chromium-review.googlesource.com/776273
> Commit-Ready: Don Garrett <dgarrett@chromium.org>
> Tested-by: Don Garrett <dgarrett@chromium.org>
> Reviewed-by: Don Garrett <dgarrett@chromium.org>

Bug:  chromium:780727 
Change-Id: I604392864c9dccfd2a6cef26049d20ed5761e5d9
Reviewed-on: https://chromium-review.googlesource.com/780259
Commit-Ready: Don Garrett <dgarrett@chromium.org>
Tested-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/b5fc08b115a71a7b9c102b936f0afbee261e8d86/scripts/cbuildbot_launch.py
[modify] https://crrev.com/b5fc08b115a71a7b9c102b936f0afbee261e8d86/scripts/cbuildbot_launch_unittest.py

Status: WontFix (was: Untriaged)
The builders in factory-gru-9017.B magically work well now.

Sign in to add a comment