pre-flight factory-gru-9017.B builder fails |
|||||||||||||||||
Issue description'pre-flight factory-gru-9017.B' builder has been failing since Oct. 30: https://uberchromegw.corp.google.com/i/chromeos.branch/builders/gru%20pre-flight%20factory-gru-9017.B/builds/116 I can't find any informative debug message from the log. I tried to kick off trybot builds for factory-gru-9017.B, e.g.: bob-factory: https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/factory/builds/10 bob-release: https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/release/builds/16870 Both builds succeeded. Do someone know how to fix the pre-flight builder? Otherwise we can't get builder-built images on this factory branch anymore.
,
Nov 3 2017
It failed before building packages. Any ideas?
File "/usr/lib/python2.7/ssl.py", line 487, in wrap_socket
ciphers=ciphers)
File "/usr/lib/python2.7/ssl.py", line 241, in __init__
ciphers)
ssl.SSLError: [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib
20:10:09: INFO: Waiting for ts_mon flushing process to finish...
20:11:07: INFO: Finished waiting for ts_mon process.
,
Nov 3 2017
,
Nov 3 2017
Looks like the builder's cacert has some issue. Chrome trooper should take a look.
,
Nov 4 2017
Moving under Client>ChromeOS
,
Nov 4 2017
,
Nov 6 2017
Hi trooper, Could you take a look? This issue blocks the production for Bob.
,
Nov 6 2017
,
Nov 6 2017
adding labs oncall
,
Nov 7 2017
,
Nov 7 2017
This looks suspiciously like http2lib does not know where to find cacerts.txt And this does not look like a system python implementation.
,
Nov 7 2017
> This looks suspiciously like http2lib does not know where to find cacerts.txt "httplib2" comes bundled with certs, and defaults to using its own: https://chromium.googlesource.com/chromiumos/chromite/+/factory-gru-9017.B/third_party/httplib2/__init__.py#187 It's worth noting that the bundled "httplib2" hasn't changed in over a year, and is used by every other CrOS builder. > And this does not look like a system python implementation. Curious why not? All I can tell is that from the stack trace, the lines in question don't line up with the chroot "ssl.py", but do line up with the local system's "ssl.py", so I think it's probably that this is the system Python. System Python and system "httplib2" seem to both be able to access the failing URL: https://chromium.googlesource.com/chromiumos/chromite/+/factory-gru-9017.B/lib/cipd.py#29 I took a few shots at repro but fell short :( The failure is happening across multiple systems, all of which have performed green builds in the interim, so it feels safe to say that this is a CrOS problem specific to the "gru pre-flight factory-gru-9017.B" builder.
,
Nov 7 2017
OTOH it does seem confined to certain builders. e.g., this build on build22-m2 seems fine: https://uberchromegw.corp.google.com/i/chromeos.branch/builders/gru%20pre-flight%20factory-gru-9017.B/builds/120
,
Nov 7 2017
Re: for build22-m2, this was before Oct 30th, when this problem started happening?
,
Nov 7 2017
I manually kicked the one I linked in #13 off just a few minutes prior, and it is doing well. I think either: 1) This is bot-specific, or 2) The certificate problem has resolved itself.
,
Nov 8 2017
I just manually kicked off the device factory build, it still failed: https://uberchromegw.corp.google.com/i/chromeos.branch/builders/bob%20factory%20factory-gru-9017.B
,
Nov 8 2017
Worth noting that failure is also on "build22-m2", the bot that succeeded in #13. Unfortunately, since the error is occurring within "cbuildbot" code and environment, I think the best thing you can do is log onto a bot (build22-m2 seems reasonable) and find a reliable reproduction case. Chrome Operations doesn't typically dive into other codebases, and the failure here is deep within Chromite code.
Copy/paste of failure stack for posterity:
File "/b/c/cbuild/repository/chromite/lib/cipd.py", line 97, in _Fetch
File "/b/c/cbuild/repository/chromite/lib/cipd.py", line 76, in _DownloadCIPD
File "/b/c/cbuild/repository/chromite/lib/cipd.py", line 54, in _ChromeInfraRequest
File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1593, in request
File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1335, in _request
File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1257, in _conn_request
File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1021, in connect
File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 80, in _ssl_wrap_socket
File "/usr/lib/python2.7/ssl.py", line 487, in wrap_socket
ciphers=ciphers)
File "/usr/lib/python2.7/ssl.py", line 241, in __init__
ciphers)
ssl.SSLError: [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib
,
Nov 8 2017
Most searches for this error are useless, but this one intrigued me: https://stackoverflow.com/questions/15696526/ssl-throwing-error-185090050-while-authentication-via-oauth Working more with this, I found that these branch builders rapidly flip between really old and new Chromite checkouts. The one on build22-m2 is currently a "veyron_pinky factory factory-veyron-6591.B" build, whose Chromite checkout is from October 2016, before "//chromite/third_party/httplib2" was added. The "gru" and "bob" builds are from 2017, after it was added. Perhaps switching the checkouts between old and new versions may be messing up file permissions on "cacert.txt"? To that end, I wrote a basic test: $ mkdir -p test $ cd test $ git clone https://chromium.googlesource.com/chromiumos/chromite $ PYTHONPATH=$PWD/chromite/third_party python -c 'import httplib2; print httplib2.__file__; print httplib2.Http().request(uri="https://chrome-infra-packages.appspot.com/_ah/api")' ({'status': '404', 'content-length': '9', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Mon, 01 Jan 1990 00:00:00 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'pragma': 'no-cache', 'cache-control': 'no-cache, no-store, max-age=0, must-revalidate', 'date': 'Wed, 08 Nov 2017 12:53:20 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; ma=2592000; v="41,39,38,37,35"', 'content-type': 'text/html; charset=UTF-8'}, 'Not Found') # Now, to trigger the failure case! $ chmod 0220 chromite/third_party/httplib2/cacerts.txt $ PYTHONPATH=$PWD/chromite/third_party python -c 'import httplib2; print httplib2.__file__; print httplib2.Http().request(uri="https://chrome-infra-packages.appspot.com/_ah/api")' <Stack Trace>... File "/usr/lib/python2.7/ssl.py", line 487, in wrap_socket ciphers=ciphers) File "/usr/lib/python2.7/ssl.py", line 241, in __init__ ciphers) ssl.SSLError: [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib ===== I am speculating: 1) This is related to "cbuildbot_launch" toggling between old and new branches, possibly also due to the fact that "cacert.txt" is toggled between existing and not existing depending on the branch. 2) The problem here is a permissions error on Chromite's embedded "//chromite/third_party/httplib2/cacert.txt" file, since that's the only way I've been able to do to reproduce the error. 3) This is hard to repro b/c it is checkout flake. It triggers based on the previous build that the machine does, and the broken state is removed when the machine does a new build. We'd have to catch a failed build *right* after it fails to confirm. Because of this, I propose: A) Remove troopers / Systems from this bug. This is 100% a Chromite problem, not a problem with infrastructure or the bots. B) Monitor https://uberchromegw.corp.google.com/i/chromeos.branch/one_line_per_build and try really hard to identify a builder immediately after a failure case. SSH onto it to confirm that something is up with "cacert.txt". C) In parallel with (B), look at the checkout logic for "cbuildbot_launch" or whatever's checking out Chromite and see if there is a place where it might be causing (2). Once (C) is solved, cherry-pick the solution onto all active branches. And that's about all of the drive-by early-morning debugging I have time for :) +dgarrett@ b/c of "cbuildbot_launch".
,
Nov 8 2017
More information on (#18/C): I speculate that the thing that is causing this problem is whatever software is managing "/b/c/cbuild/repository/chromite".
,
Nov 9 2017
removing troopers per #18
,
Nov 13 2017
RE: #18 We setup a new branch few days ago and its pre-flight is showing the same problem. https://uberchromegw.corp.google.com/i/chromeos.branch/builders/falco%20pre-flight%20firmware-servo-9040.B/builds/1 Can this be an useful test case? Thanks.
,
Nov 13 2017
cbuildbot_launch is what sets up everything under /b/c/cbuild/repository. When it moves between branches, it does a detailed cleanup of the checkout which is supposed to be a more efficient clobber. However, that cleanup is surprisingly complicated. I'm wondering if it's failing in some obscure way. I could switch that to a full clobber when moving between branches. Safer, but slower.
,
Nov 13 2017
,
Nov 13 2017
In practice, such a change would affect the tryjob, release, and branch waterfalls (who all regularly change branches). It would cause us to consume more GoB quota than we do now, but I'm not sure how much.
,
Nov 13 2017
Interesting! So advancing on #17/18 theory, I have discovered that if the "cacerts.txt" file is *deleted*, the failure also reproduces! This is huge, since deletion seems way more probable than something losing read permission. Reading #21, it looks like the builder linked by yueherngl@ in #21 has not built anything after the failing build: https://screenshot.googleplex.com/KDpp8hz1TP1.png Therefore, it should have a pristine state that can be used for probing the problem, right? I logged into the bot and checked it out, and much to my surprise the "/b/c/cbuild/repository/" directory is practically empty, and does not contain a "chromite" checkout: $ ls -a /b/c/cbuild/repository/ . .. .cache cbuildbot_logs .completed_stages This is important because it suggests that either: 1) At the time of failure, there is no "chromite" directory. 2) Something is nuking this directory after the failure. Since I see no evidence of (2), and (1) has relevance to the deletion test that I ran, I suspect that it is the case. Something like: 1) cbuildbot is running. 2) cbuildbot imports "httplib2", which fixes the "cacerts.txt" path. 3) *something* is deleting the "chromite" checkout, including the embedded "cacerts.txt" file. 4) cbuildbot makes an HTTP call, which now fails b/c of (3). Anything you can fill in here, dgarrett@?
,
Nov 14 2017
The "cbuildbot" stage is really cbuildbot_launch. Launch runs from a checkout of chromite in a recipe location that is indpendent of /b/c/cbuild/repository/. The "RunCbuildbot" stage is "cbuild/repository/chromite/cbuildbot". Which means that that directory existed when it was invoked. However, the Cleanup stages recursively cleans up the "cbuild/repository" underneath itself. Maybe it's being overly agressive? However, I don't see why this would be broken only part of the time.
,
Nov 14 2017
*shrug* no idea why, maybe branch logic that didn't get cherry-picked, or branch-swapping logic responding incorrectly to a checkout from another branch? The ".../chromite/cbuildbot" directory *definitely* existed at one point, since the stack trace includes its vendored "http2lib" in its path. It also definitely stopped existing at another point. Hard to say what happened in between, but I think it's a pretty solid running theory that it stopped existing before it should have :)
,
Nov 14 2017
Passing to the deputy, though I know it's a weird cbuildbot issue.
,
Nov 14 2017
,
Nov 14 2017
What does deputy need to do here?
,
Nov 14 2017
,
Nov 15 2017
dgarrett@ Any idea about how to fix it? The pre-flight builder might partially works, the device builder on 9017 branch still failed for the similar reason: https://uberchromegw.corp.google.com/i/chromeos.branch/builders/bob%20factory%20factory-gru-9017.B/builds/75 Our partner is blocked on this to update Bob factory image.
,
Nov 15 2017
I'm still not sure what's going wrong. That's why I tried to hand it off, to see if anyone else had any ideas.
,
Nov 15 2017
You have a couple of successful builds, but I'm not sure why it was failing in the first place.
,
Nov 15 2017
Actually.... what exactly is the point of this pre flight? What is it importing? Chrome? Android NYC? Android MNC? Something else?
,
Nov 15 2017
+ bhthompson@ Bernie, do you know what might go wrong in 9017 factory branch? https://uberchromegw.corp.google.com/i/chromeos.branch/builders/bob%20factory%20factory-gru-9017.B/builds/75
,
Nov 16 2017
This is a gnarly issue, I am afraid I don't have anything to add to the debugging at this point without significant further thought. The solution is probably changing how we clean between runs (e.g. comment 22) which may or may not be worth it based on the frequency of this kind of bug. But in the interim if this is blocking we may try using a trybot instead of the formal factory builder, it is possible this will work ok, you can try kicking one with: `cros tryjob --remote --production --branch=factory-gru-9017.B bob-factory`
,
Nov 16 2017
bhthompson@, Yep, trybot would work as my initial comment. But I hope there is a long term solution to fix this branch builder. dgarrett@, will you be able to do a full clobber as you said in #22?
,
Nov 17 2017
PS: I checked with David James. Nominally, the only thing the preflight builder does is uprev ebuilds, and submit the uprevs. It might also update to the proper OS version at the same time.
,
Nov 17 2017
I haven't, but I can.
,
Nov 17 2017
I have a CL up now. Understand that it will slow all firmware/factory branch builds by about 20 minutes. https://crrev.com/c/776273
,
Nov 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/094422f998d7cdc7bfd95a14177a8ed97beb242a commit 094422f998d7cdc7bfd95a14177a8ed97beb242a Author: Don Garrett <dgarrett@google.com> Date: Sat Nov 18 04:18:39 2017 cbuildbot_launch: Always wipe firmware/factory branches. We've had weird issues with cbuildbot on some firmware/factory branches doing weird things during cleanup. Instead of trying to fix cleanup on the branches, just make sure that they always have a clean slate. This is slower, and wasteful of GoB quota, but we perform very few builds on these branches, so it shouldn't be a big deal. BUG= chromium:780727 TEST=run_tests + new unittests Change-Id: I8bd4bf776f717b5d837337ecefd4c83ca9172706 Reviewed-on: https://chromium-review.googlesource.com/776273 Commit-Ready: Don Garrett <dgarrett@chromium.org> Tested-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/094422f998d7cdc7bfd95a14177a8ed97beb242a/scripts/cbuildbot_launch.py [modify] https://crrev.com/094422f998d7cdc7bfd95a14177a8ed97beb242a/scripts/cbuildbot_launch_unittest.py
,
Nov 18 2017
I manually kicked off a build after #42 landed, and it looks like it's making it past the failure point, so that's encouraging: https://luci-milo.appspot.com/buildbot/chromeos.branch/falco%20pre-flight%20firmware-servo-9040.B/5 That does not, unfortunately, satisfy my curiosity RE what's going on in #25. Oh well :)
,
Nov 20 2017
Not sure now if this is a different problem. But I manually kicked off a build https://uberchromegw.corp.google.com/i/chromeos.branch/builders/falco%20pre-flight%20firmware-servo-9040.B/builds/6 And I see the same problem? <Stack Trace>... File "/b/c/cbuild/repository/chromite/lib/cipd.py", line 97, in _Fetch File "/b/c/cbuild/repository/chromite/lib/cipd.py", line 76, in _DownloadCIPD File "/b/c/cbuild/repository/chromite/lib/cipd.py", line 54, in _ChromeInfraRequest File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1593, in request File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1335, in _request File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1257, in _conn_request File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 1021, in connect File "/b/c/cbuild/repository/chromite/third_party/httplib2/__init__.py", line 80, in _ssl_wrap_socket File "/usr/lib/python2.7/ssl.py", line 487, in wrap_socket ciphers=ciphers) File "/usr/lib/python2.7/ssl.py", line 241, in __init__ ciphers) ssl.SSLError: [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib
,
Nov 20 2017
I can confirm that this is the same problem and (from the step text, "Firmware/Factory Branch: Wiping buildroot.") that it is running the latest patch. It looks like #42 did not fix this issue.
,
Nov 20 2017
Agreed. I do kinda feel like this might be related to builder configuration. One fix MIGHT be to move the builders to GCE. I don't think they run any VM Tests.
,
Nov 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/b5fc08b115a71a7b9c102b936f0afbee261e8d86 commit b5fc08b115a71a7b9c102b936f0afbee261e8d86 Author: Don Garrett <dgarrett@chromium.org> Date: Tue Nov 21 07:25:49 2017 Revert "cbuildbot_launch: Always wipe firmware/factory branches." This reverts commit 094422f998d7cdc7bfd95a14177a8ed97beb242a. Reason for revert: This CL didn't fix the problem, and so hurts branched build performance for no reason. Original change's description: > cbuildbot_launch: Always wipe firmware/factory branches. > > We've had weird issues with cbuildbot on some firmware/factory > branches doing weird things during cleanup. Instead of trying to fix > cleanup on the branches, just make sure that they always have a clean > slate. This is slower, and wasteful of GoB quota, but we perform very > few builds on these branches, so it shouldn't be a big deal. > > BUG= chromium:780727 > TEST=run_tests + new unittests > > Change-Id: I8bd4bf776f717b5d837337ecefd4c83ca9172706 > Reviewed-on: https://chromium-review.googlesource.com/776273 > Commit-Ready: Don Garrett <dgarrett@chromium.org> > Tested-by: Don Garrett <dgarrett@chromium.org> > Reviewed-by: Don Garrett <dgarrett@chromium.org> Bug: chromium:780727 Change-Id: I604392864c9dccfd2a6cef26049d20ed5761e5d9 Reviewed-on: https://chromium-review.googlesource.com/780259 Commit-Ready: Don Garrett <dgarrett@chromium.org> Tested-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/b5fc08b115a71a7b9c102b936f0afbee261e8d86/scripts/cbuildbot_launch.py [modify] https://crrev.com/b5fc08b115a71a7b9c102b936f0afbee261e8d86/scripts/cbuildbot_launch_unittest.py
,
Dec 22 2017
The builders in factory-gru-9017.B magically work well now. |
|||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||
Comment 1 by philipchen@chromium.org
, Nov 3 2017