New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 876964 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , Mac
Pri: 1
Type: Bug-Regression

Restricted
  • Only users with Google permission may make changes.



Sign in to add a comment

Update best_revision calculation to use buildbucket v2 API

Project Member Reported by nyerramilli@chromium.org, Aug 23

Issue description

Canary # 70.0.3531.0 failed at trigger page - Exception steps exception calculate_branch_revisions.best_revision failed failure reason

OS: All,
Canary Time: Aug. 22, 2018, 8:02 p.m.

Builder URL:
--------------
https://uberchromegw.corp.google.com/i/official.infra.cron/builders/chrome-branch/builds/775

Error Log URL:
----------------------
https://logs.chromium.org/v/?s=infra-internal%2Fbb%2Fofficial.infra.cron%2Fchrome-branch%2F775%2F%2B%2Frecipes%2Fsteps%2Fcalculate_branch_revisions%2F0%2Fsteps%2Fbest_revision%2F0%2Fstdout

Error log: (Please check the error log URl for more details):
Traceback (most recent call last):
  File "/mnt/data/b/rr/tmpSjuMZQ/rw/checkout/recipes/recipe_modules/release/resources/best_revision.py", line 472, in <module>
    sys.exit(main())
  File "/mnt/data/b/rr/tmpSjuMZQ/rw/checkout/recipes/recipe_modules/release/resources/best_revision.py", line 442, in main
    builds_data = get_builds_data(master, builder, args.limit, token=token)
  File "/mnt/data/b/rr/tmpSjuMZQ/rw/checkout/recipes/recipe_modules/release/resources/best_revision.py", line 150, in get_builds_data
    r.raise_for_status()
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 773, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error
step returned non-zero exit code: 1

Michael, Could you please check.
 
Cc: -gov...@chromium.org mmoss@chromium.org
Owner: gov...@chromium.org
For a server error (in this case, apparently in the milo service), there's not much we can do but try again a little later. If you still want this canary, go ahead and trigger another one. If not, we can just skip it and wait for tonight.
Cc: benmason@google.com
Schedule a canary trigger, hopefully it goes thru.
Looks like it failed again:

https://uberchromegw.corp.google.com/i/official.infra.cron/builders/chrome-branch/builds/776

[DEBUG:urllib3.connectionpool] "POST /prpc/milo.Buildbot/GetBuildbotBuildsJSON HTTP/1.1" 500 62
Traceback (most recent call last):
  File "/mnt/data/b/rr/tmpKpgB3U/rw/checkout/recipes/recipe_modules/release/resources/best_revision.py", line 472, in <module>
    sys.exit(main())
  File "/mnt/data/b/rr/tmpKpgB3U/rw/checkout/recipes/recipe_modules/release/resources/best_revision.py", line 442, in main
    builds_data = get_builds_data(master, builder, args.limit, token=token)
  File "/mnt/data/b/rr/tmpKpgB3U/rw/checkout/recipes/recipe_modules/release/resources/best_revision.py", line 150, in get_builds_data
    r.raise_for_status()
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 773, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error
step returned non-zero exit code: 1
And it failed again, both times when doing:

[INFO:root] Fetching builds for "chromium.chromiumos/linux-chromeos-rel" ...

Maybe milo is barfing on bad data for that builder or something? This probably needs to be escalated to milo folks.
Cc: -mmoss@chromium.org gov...@chromium.org
Owner: mmoss@chromium.org
Assigning back to mmoss@, pls feel free to reassign. Thank you.
Cc: hinoka@chromium.org estaab@chromium.org no...@chromium.org
I can also reproduce the error when running the script locally, although in that case it doesn't always fail on the linux-chromeos-rel query, so it looks like it's just some generic flakiness with the service.

+hinoka, estaab, nodir as luci-milo contacts. Is milo having problems? I think we've very occasionally seen 500s here before, but nothing this consistent.
Cc: -gov...@chromium.org mmoss@chromium.org
Owner: gov...@chromium.org
My last few manual runs haven't had any errors, so maybe the flakiness has passed. Passing to govind to trigger again.
Milo HTTP 500s will need to be investigated, but note that in general HTTP 500s may be caused by factors we don't control, so clients should retry requests on transient errors (HTTP code >= 500).
Yeah, I added retries, but since we were seeing more frequent errors, I was just wondering if there was a more serious, non-transient issue going on.
Owner: mmoss@chromium.org
Canary trigger in queue, assigning back to mmoss@.
Retry failed again, different error (Exception steps exception publish_branch_buildspec.git cl upload retry 3 failed failure reason)

https://uberchromegw.corp.google.com/i/official.infra.cron/builders/chrome-branch/builds/777
Maybe related to Issue 876910. That just landed another fix, which just rolled into the official recipes, so try again to see if that helps.
Cc: tandrii@chromium.org
+tandrii, any idea what's going on with 'git cl upload' here? Maybe related to stuff you've been doing lately, or recent gerrit changes?

https://logs.chromium.org/v/?s=infra-internal%2Fbb%2Fofficial.infra.cron%2Fchrome-branch%2F778%2F%2B%2Frecipes%2Fsteps%2Fpublish_branch_buildspec%2F0%2Fsteps%2Fgit_cl_upload%2F0%2Fstdout

yep, definitely my change. Is there a way to pin depot_tools at older revisions, or should I revert all my CLs?
Project Member

Comment 16 by bugdroid1@chromium.org, Aug 23

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chrome/tools/release/scripts/+/2191486728d6f4e3856cbbf032f46191fc8c5f1c

commit 2191486728d6f4e3856cbbf032f46191fc8c5f1c
Author: Andrii Shyshkalov <tandrii@chromium.org>
Date: Thu Aug 23 21:57:46 2018

Build triggered
Project Member

Comment 18 by bugdroid1@chromium.org, Aug 23

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chrome/tools/release/scripts/+/2ed5bffdc9928599459282e618a7a33264dc0da7

commit 2ed5bffdc9928599459282e618a7a33264dc0da7
Author: Michael Moss <mmoss@google.com>
Date: Thu Aug 23 22:05:12 2018

Project Member

Comment 19 by bugdroid1@chromium.org, Aug 23

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/depot_tools/+/6365bc444603bb8e890a0b913757379c9c074f6c

commit 6365bc444603bb8e890a0b913757379c9c074f6c
Author: Andrii Shyshkalov <tandrii@chromium.org>
Date: Thu Aug 23 22:14:49 2018

git cl: temporary stop using project~number on Gerrit.

Broke release process.

R=ehmaldonado, mmoss

Bug: 876964, 876910
Change-Id: I02ea424632f5c5522af0010adce1c993e2610b48
Reviewed-on: https://chromium-review.googlesource.com/1187548
Reviewed-by: Michael Moss <mmoss@chromium.org>
Commit-Queue: Andrii Shyshkalov <tandrii@chromium.org>

[modify] https://crrev.com/6365bc444603bb8e890a0b913757379c9c074f6c/tests/git_cl_test.py
[modify] https://crrev.com/6365bc444603bb8e890a0b913757379c9c074f6c/git_cl.py

Project Member

Comment 20 by bugdroid1@chromium.org, Aug 23

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chrome/tools/release/scripts/+/44e4bd4f0c932026708ed3b6e073b6b83b1cc09e

commit 44e4bd4f0c932026708ed3b6e073b6b83b1cc09e
Author: Andrii Shyshkalov <tandrii@google.com>
Date: Thu Aug 23 22:17:55 2018

Status: Fixed (was: Assigned)
The build with the temporary depot_tools rollback succeeded:

https://uberchromegw.corp.google.com/i/official.infra.cron/builders/chrome-branch/builds/780
Project Member

Comment 22 by bugdroid1@chromium.org, Aug 23

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/depot_tools/+/1e82867e3d0f8f3f96672c0172ff4adad1777ce3

commit 1e82867e3d0f8f3f96672c0172ff4adad1777ce3
Author: Andrii Shyshkalov <tandrii@chromium.org>
Date: Thu Aug 23 22:34:37 2018

git cl: restart using project~number on Gerrit.

This fixes Gerrit project detection based on remote URL,
accounting for potential 'a/' prefix in the URL path component,
which isn't part of the Gerrit project name.

R=ehmaldonado, mmoss

Bug: 876964, 876910
Change-Id: I473ae8c6c9e0f2034b350901abd67db151e0a3d3
Reviewed-on: https://chromium-review.googlesource.com/1187573
Reviewed-by: Michael Moss <mmoss@chromium.org>
Commit-Queue: Andrii Shyshkalov <tandrii@chromium.org>

[modify] https://crrev.com/1e82867e3d0f8f3f96672c0172ff4adad1777ce3/tests/git_cl_test.py
[modify] https://crrev.com/1e82867e3d0f8f3f96672c0172ff4adad1777ce3/git_cl.py

Project Member

Comment 23 by bugdroid1@chromium.org, Aug 24

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8246b935fed83207fce73a19c9f06c8d896f1639

commit 8246b935fed83207fce73a19c9f06c8d896f1639
Author: depot-tools-chromium-autoroll <depot-tools-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Fri Aug 24 05:44:43 2018

Roll src/third_party/depot_tools b16da6a2a9e5..7b7eb8800be0 (3 commits)

https://chromium.googlesource.com/chromium/tools/depot_tools.git/+log/b16da6a2a9e5..7b7eb8800be0


git log b16da6a2a9e5..7b7eb8800be0 --date=short --no-merges --format='%ad %ae %s'
2018-08-23 recipe-roller@chromium.org Roll recipe dependencies (trivial).
2018-08-23 tandrii@chromium.org git cl: restart using project~number on Gerrit.
2018-08-23 tandrii@chromium.org git cl: temporary stop using project~number on Gerrit.


Created with:
  gclient setdep -r src/third_party/depot_tools@7b7eb8800be0

The AutoRoll server is located here: https://depot-tools-chromium-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.



BUG=chromium:876964,chromium:876910,chromium:876964,chromium:876910
TBR=agable@chromium.org

Change-Id: Ifa9855cab76e938fb4be4584ebb2e61708bcb3e8
Reviewed-on: https://chromium-review.googlesource.com/1187095
Reviewed-by: depot-tools-chromium-autoroll <depot-tools-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Commit-Queue: depot-tools-chromium-autoroll <depot-tools-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#585706}
[modify] https://crrev.com/8246b935fed83207fce73a19c9f06c8d896f1639/DEPS

Today's (08/23/18) 8:00 PM PST canary failed to trigger, with same error

https://uberchromegw.corp.google.com/i/official.infra.cron/builders/chrome-branch/builds/783


Status: Assigned (was: Fixed)
Reopening the bug.
The official master needs to be restarted to pick up the fix 
mmoss@, would it be possible for you to restart the official master as this is blocking canary release?
Owner: no...@chromium.org
Restart to pickup what fix? Maybe you're thinking of the duplicate buildbucket trigger bug? This bug is about failing talking to milo again, even after adding more retries and delays. It's failing the same request three times over a period of about 2 minutes:

[INFO:root] Fetching builds for "chromium.chromiumos/linux-chromeos-rel" ...
[INFO:urllib3.connectionpool] Starting new HTTPS connection (1): luci-milo.appspot.com
[DEBUG:urllib3.connectionpool] Setting read timeout to None
[DEBUG:urllib3.connectionpool] "POST /prpc/milo.Buildbot/GetBuildbotBuildsJSON HTTP/1.1" 500 62
[WARNING:root] Fetch failed (500). Waiting 14s before retry (2 more).
[INFO:urllib3.connectionpool] Starting new HTTPS connection (1): luci-milo.appspot.com
[DEBUG:urllib3.connectionpool] Setting read timeout to None
[DEBUG:urllib3.connectionpool] "POST /prpc/milo.Buildbot/GetBuildbotBuildsJSON HTTP/1.1" 500 62
[WARNING:root] Fetch failed (500). Waiting 26s before retry (1 more).
[INFO:urllib3.connectionpool] Starting new HTTPS connection (1): luci-milo.appspot.com
[DEBUG:urllib3.connectionpool] Setting read timeout to None
[DEBUG:urllib3.connectionpool] "POST /prpc/milo.Buildbot/GetBuildbotBuildsJSON HTTP/1.1" 500 62
[WARNING:root] Fetch failed (500). Waiting 74s before retry (0 more).
Traceback (most recent call last):
  File "/mnt/data/b/rr/tmpZcCaVP/rw/checkout/recipes/recipe_modules/release/resources/best_revision.py", line 482, in <module>
    sys.exit(main())
  File "/mnt/data/b/rr/tmpZcCaVP/rw/checkout/recipes/recipe_modules/release/resources/best_revision.py", line 452, in main
    builds_data = get_builds_data(master, builder, args.limit, token=token)
  File "/mnt/data/b/rr/tmpZcCaVP/rw/checkout/recipes/recipe_modules/release/resources/best_revision.py", line 160, in get_builds_data
    r.raise_for_status()
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 773, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error


I suppose I could add a lot more retries, but I'd really rather understand what's going wrong on the server side, and see if that can be fixed.

just to update, yesterday (08/25) Canary 70.0.3533.0 triggered successfully and all the builds available - https://pantheon.corp.google.com/storage/browser/chrome-signed/desktop-5c0tCh/70.0.3533.0.

and today (08/26) Canary #70.0.3534.0 also triggered successfully and builds are compiling.
sorry I confused this issue with the one with duplicated builds. Please ignore #c26
Cc: gov...@chromium.org
Today Canary job failed with same reason :

Build URL:
---------------
https://uberchromegw.corp.google.com/i/official.infra.cron/builders/chrome-branch/builds/788

Error Log URL:
--------------
https://logs.chromium.org/v/?s=infra-internal%2Fbb%2Fofficial.infra.cron%2Fchrome-branch%2F788%2F%2B%2Frecipes%2Fsteps%2Fcalculate_branch_revisions%2F0%2Fsteps%2Fbest_revision%2F0%2Fstdout

Error Log:
-----------

[WARNING:root] Fetch failed (500). Waiting 14s before retry (2 more).
[INFO:urllib3.connectionpool] Starting new HTTPS connection (1): luci-milo.appspot.com
[DEBUG:urllib3.connectionpool] Setting read timeout to None
[DEBUG:urllib3.connectionpool] "POST /prpc/milo.Buildbot/GetBuildbotBuildsJSON HTTP/1.1" 500 62
[WARNING:root] Fetch failed (500). Waiting 26s before retry (1 more).
[INFO:urllib3.connectionpool] Starting new HTTPS connection (1): luci-milo.appspot.com
[DEBUG:urllib3.connectionpool] Setting read timeout to None
[DEBUG:urllib3.connectionpool] "POST /prpc/milo.Buildbot/GetBuildbotBuildsJSON HTTP/1.1" 500 62
[WARNING:root] Fetch failed (500). Waiting 74s before retry (0 more).
Traceback (most recent call last):
  File "/mnt/data/b/rr/tmp3uDLQ7/rw/checkout/recipes/recipe_modules/release/resources/best_revision.py", line 482, in <module>
    sys.exit(main())
  File "/mnt/data/b/rr/tmp3uDLQ7/rw/checkout/recipes/recipe_modules/release/resources/best_revision.py", line 452, in main
    builds_data = get_builds_data(master, builder, args.limit, token=token)
  File "/mnt/data/b/rr/tmp3uDLQ7/rw/checkout/recipes/recipe_modules/release/resources/best_revision.py", line 160, in get_builds_data
    r.raise_for_status()
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 773, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error
step returned non-zero exit code: 1

As per c#1, will re trigger after sometime and will update.
It's curious that, other than some of the failures I saw when I ran the script manually, these failures all seem to be when fetching "chromium.chromiumos/linux-chromeos-rel".

nodir@, could that be a clue to something wrong with that data on the milo end, or is it just coincidence?
Canary retrigger @ 11:30 PM on 08/28 failed with same error: https://uberchromegw.corp.google.com/i/official.infra.cron/builders/chrome-branch/builds/789
Canary job triggered at 11:54 p.m. was successfully triggered and all the signed binaries are available

https://pantheon.corp.google.com/storage/browser/chrome-signed/desktop-5c0tCh/70.0.3535.4
Labels: -Pri-1 Pri-0
Pls note job trigger at #34 for is from existing canary branch #3535, version 70.0.3535.4.  

Canary trigger from ToT was failing last night, so raising to P0.
Status: Started (was: Assigned)
Owner: mmoss@google.com
Status: Assigned (was: Started)
best_revision.py tries to load 50 latest builds. CrOS builds are huge. Example: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/linux-chromeos-rel/12648


  echo '{"master": "chromium.chromiumos", "builder":"linux-chromeos-rel", "limit":50}' | prpc call ci.chromium.org milo.Buildbot.GetBuildbotBuildsJSON  > res  

gives me 

  HTTP 500: no gRPC code. Body: "HTTP response was too large: 33917886. The limit is: 
  33554432."

(This message is present in responses to best_revision.py, so please improve logging in best_revision.py to help yourself)

This is a (reasonable) GAE limitation of 32mb. There is nothing Milo can do to change this.
Please don't request 50 builds in one request. Short term, I recommend loading 10 per request with paging. 
Long term, please switch to buildbucket v2 API. It just handled 100 builds for me, using partial response

https://cr-buildbucket.appspot.com/rpcexplorer/services/buildbucket.v2.Builds/SearchBuilds?request={%20%20%20%20%22predicate%22:%20{%20%20%20%20%20%20%20%20%22builder%22:%20{%20%20%20%20%20%20%20%20%20%20%20%20%22project%22:%20%22chromium%22,%20%20%20%20%20%20%20%20%20%20%20%20%22bucket%22:%20%22ci%22,%20%20%20%20%20%20%20%20%20%20%20%20%22builder%22:%20%22linux-chromeos-rel%22%20%20%20%20%20%20%20%20}%20%20%20%20},%20%20%20%20%22fields%22:%20%22builds.*.id,builds.*.status,builds.*.steps.*.name,builds.*.steps.*.status%22}

reassigning, unless you have ideas what can be done on the milo side to address this. 
Owner: mmoss@chromium.org
> Short term, I recommend loading 10 per request with paging. 

Any pointers on how to do paging with the milo API? I can obviously use limit to get the last 10 builds, but how do I get builds 10-20, 20-30, etc., without asking for a limit of 20 or 30 or more?
ah, sorry Michael, I've deleted paging functionality 11mo ago because it complicated emulation.
https://chromium-review.googlesource.com/c/infra/luci/luci-go/+/685354

options:

1) request 20 if <condition>, where condition could be "builder=linux-chromeos-rel" or "master=chromium.chromiumos". This seems to work now, but not guaranteed to work tomorrow, and it is ad-hoc

2) use v2 API with partial responses. We can start with v2 only for this builder (ah-hoc), and later switch to use it for all LUCI builders




To clarify, v2 doesn't work with non-luci builders (e.g. the official builders), so I can't get all the data from that yet, right?
correct, not all builders are fully supported by buildbucket v2 API, so it would be a progressive transition. CrOS is completely on LUCI though.
And when you say "partial responses" with the v2 API, you mean where the request explicitly asks for only some of the build data (using the "fields" param, like in your example), or something else?
Project Member

Comment 43 by bugdroid1@chromium.org, Aug 29

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chrome/tools/release/scripts/+/d0825066ea65d89c6a0f9e45ef2374e191226f1c

commit d0825066ea65d89c6a0f9e45ef2374e191226f1c
Author: Michael Moss <mmoss@google.com>
Date: Wed Aug 29 17:53:52 2018

yes. By default v2 returns only fields that are unlikely to take a lot of bytes. To fetch other fields, e.g. properties, steps, a complete list of fields must be explicitly specified. Absence of partial response support in Milo is one of the reasons the response is large. For example, it is loading stuff like step_text and we had users putting 100Ks of test names there if all tests failed.

please specify only fields that are actually used by the script, e.g. builds.*.steps.*.status (step statuses), as opposed to builds.*.steps (all fields of steps)
sgtm, thanks
another example: best_revision.py does not use properties, but Milo returns them. CrOS's list of output properties is quite large.
Labels: -Pri-0 Pri-1
lowering priority since #43 should provide a workaround for the current failure.
Canary #70.0.3536.0 successfully got triggered. Thank you.
Components: -Build Infra>Client>Chrome>Release
Summary: Update best_revision calculation to use buildbucket v2 API (was: Canary Delivery issue: Exception steps exception calculate_branch_revisions.best_revision failed failure reason)
Updating subject to reflect why this is actually still open. Initially switch just the cros builder, then for others once they are migrated to luci (per #41).

Sign in to add a comment