New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 698996 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 698992



Sign in to add a comment

Long delay before builds are leased from buildbucket after master restart

Project Member Reported by philwright@chromium.org, Mar 7 2017

Issue description

(outalation that inspired this: http://o/365520)

It can take a very long time to lease builds after a master gets restarted, as can be seen in this Buildbot Overview dashboard ... http://shortn/_RgTrw4488G

Note also these master.twistd logs for a specific build id - https://pantheon.corp.google.com/logs/viewer?project=chrome-infra-logs&minLogLevel=0&expandAll=false&resource=global&advancedFilter=resource.type%3D%22global%22%0AlogName%3D%22projects%2Fchrome-infra-logs%2Flogs%2Fmaster_twistd.log%22%0Alabels.%22cloudtail%2Fresource_id%22%3D%22master.tryserver.chromium.linux%22%0A%22%5Bbuildbucket%5D%22%0A%228986188187968384384%22&timestamp=2017-03-03T01:35:38.000000000Z

In the outalation, the master restarted at around 16:15, and bots were attaching again by 16:21. But there appears to be a massive lag in the time it takes to start leasing builds from buildbucket.

Suspected culprit is race condition at https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/master/buildbucket/integration.py#164

It's possible for several separate execution flows to all end up calling _ensure_leases_loaded and getting past the initial if condition, because self._leases isn't set to something until the first flow has returned. This is borne out by the following logs - https://pantheon.corp.google.com/logs/viewer?project=chrome-infra-logs&minLogLevel=0&expandAll=false&resource=global&advancedFilter=resource.type%3D%22global%22%0AlogName%3D%22projects%2Fchrome-infra-logs%2Flogs%2Fmaster_twistd.log%22%0Alabels.%22cloudtail%2Fresource_id%22%3D%22master.tryserver.chromium.linux%22%0A%22%5Bbuildbucket%5D%22%0A%22Loaded%20current%20leases%22&timestamp=2017-03-03T00:40:10.000000000Z

The theory we came up with is that each execution flow in _ensure_leases_loaded then proceeds to execute self._send_heartbeats(), which generates a massive backlog of requests in the buildbucket client, which has a limit of 10 simultaneous requests, so it takes a long time to flush through all the requests.

This in turn prevents any other API requests to occur - ie requests to find any new builds.

---------------------
Whilst debugging this issue, we identified a few misc issues and areas where we think logging and monitoring could be improved ...

1) Put master name in the user-agent string for buildbucket client, to make searching through buildbucket's logs easier.

2) There seems to be a huge number of api requests coming in for buildbucket search. (like, ~100 qps). Is this expected? See https://pantheon.corp.google.com/logs/viewer?project=cr-buildbucket&minLogLevel=0&expandAll=false&resource=gae_app&logName=projects%2Fcr-buildbucket%2Flogs%2Fappengine.googleapis.com%252Frequest_log&timestamp=2017-03-07T05:56:32.802036000Z

3) Make buildbucket include the build ids alongside the heartbeat APIs log mesages (currently POST requests, so build id isn't recorded in the log message - see https://pantheon.corp.google.com/logs/viewer?project=cr-buildbucket&minLogLevel=0&expandAll=false&resource=gae_app&logName=projects%2Fcr-buildbucket%2Flogs%2Fappengine.googleapis.com%252Frequest_log&advancedFilter=resource.type%3D%22gae_app%22%0AlogName%3D%22projects%2Fcr-buildbucket%2Flogs%2Fappengine.googleapis.com%252Frequest_log%22%0AprotoPayload.method!%3D%22GET%22%0A%22heartbeat%22&timestamp=2017-03-07T05:56:18.849578000Z)


 
Blockedon: 698992

Comment 2 by no...@chromium.org, Mar 8 2017

thank you for the productive investigation

Comment 3 by no...@chromium.org, Mar 8 2017

Status: Started (was: Untriaged)
(2) is expected - Rietveld and Gerrit fetch builds to display them alongside a CL all the time
Project Member

Comment 4 by bugdroid1@chromium.org, Mar 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/164a47e0aa7cb406d19b77fefdb6808c41d942ad

commit 164a47e0aa7cb406d19b77fefdb6808c41d942ad
Author: Nodir Turakulov <nodir@chromium.org>
Date: Wed Mar 08 00:59:02 2017

buildbucket: fix race in _ensure_leases_loaded

R=philwright@chromium.org, dsansome@chromium.org
BUG= 698996 

Change-Id: Idb7a11465e591418e85501bed809cbbaaa2a62c2
Reviewed-on: https://chromium-review.googlesource.com/451547
Reviewed-by: Dave Sansome <dsansome@chromium.org>
Commit-Queue: Nodir Turakulov <nodir@chromium.org>

[modify] https://crrev.com/164a47e0aa7cb406d19b77fefdb6808c41d942ad/scripts/master/buildbucket/integration.py

Comment 5 by no...@chromium.org, Mar 8 2017

> (3)
That would cause a lot of logs, in particular N log lines every 5 min, where N is the number of pending and running builds on the master. Is this OK?

Comment 6 by no...@chromium.org, Mar 8 2017

> (1)
There is http_client_name used in InstrumentedHttp https://cs.chromium.org/chromium/build/third_party/infra_libs/infra_libs/httplib2_utils.py?type=cs&q=InstrumentedHttp&l=301

The name "http_client_name" is rather generic doesn't necessarily imply that it has to be about monitoring. Same applies InstrumentedHttp.

Can we derive user agent from InstrumentedHttp's name and set the user agent in InstrumentedHttp?
(3) Ah I see - each of these heartbeat calls actually contains multiple Build IDs, right? Would it be possible to add a single log entry that lists all the builds (and the bucket/master) in each heartbeat request? I'm not super familiar with appengine, but it would be great if the log showed up as a "sub-log" for the same api request id, similar to how "Heartbeat for build" log message shows up https://chromium.googlesource.com/infra/infra/+/master/appengine/cr-buildbucket/service.py#791

(1) I'll check with Dave about this ...
(3) Note ... I'm guessing it's not actually trivial to find out the bucket/master for a particular build id, right? So logging the bucket/master may not be feasible here?

Comment 9 by no...@chromium.org, Mar 8 2017

(3) yes, heartbeat request contains multiple build ids

https://chromium.googlesource.com/infra/infra/+/master/appengine/cr-buildbucket/service.py#789 loads the build, so I can load the bucket from there. But that would result in a log line (under request log) for each build. I don't know, seems like a lot. The reason I didn't need it is if build does not expire [1], then I know that heartbeats were received

[1]: https://pantheon.corp.google.com/logs/viewer?project=cr-buildbucket&minLogLevel=200&expandAll=false&resource=gae_app%2Fmodule_id%2Fbackend&versionId=3300-a706456&logName=projects%2Fcr-buildbucket%2Flogs%2Fappengine.googleapis.com%252Frequest_log&filters=path:reset_expired_builds
(3) I've created https://chromium-review.googlesource.com/c/451025/ to add logging per batch heartbeat. Thoughts?

(1) I'm not sure what I'm meant to be looking at ...
Project Member

Comment 11 by bugdroid1@chromium.org, Mar 9 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/0c2d37779791c4af88d1fe63dee953471621552f

commit 0c2d37779791c4af88d1fe63dee953471621552f
Author: philwright <philwright@google.com>
Date: Thu Mar 09 00:00:26 2017

adding logging to batch heartbeat

BUG= 698996 

Change-Id: I5479200a9565d54764a706784b423cbb25b209ed
Reviewed-on: https://chromium-review.googlesource.com/451025
Reviewed-by: Nodir Turakulov <nodir@chromium.org>
Commit-Queue: Phil Wright <philwright@chromium.org>

[modify] https://crrev.com/0c2d37779791c4af88d1fe63dee953471621552f/appengine/cr-buildbucket/service.py

Comment 12 by no...@chromium.org, Mar 16 2017

what's the status of this bug?
There's still the suggestion to (1) Put master name in the user-agent string.

@nodir: You pasted a link apparently in reference to this, but I wasn't sure what I was supposed to look at on the resulting page.

In any case, adding the master name to the user-agent string is more a nice-to-have feature request, rather than an actual issue. I'm not too fussed if it's not worth following up.

Cheers,
Phil

Comment 14 by no...@chromium.org, Mar 31 2017

in #5? I am suggestion to set user-agent in InstrumenedHttp
Status: Fixed (was: Started)
actually it was in #9 - I misinterpreted the footnote [1] with the link - didn't realise it was in reference to the text above. All good.

So anyway, as I said, the "master name" in user agent string is more of a nice to have.

If it's still needed by anyone, they should open a new bug for it.

I'll close this bug now.

Cheers,
Phil


Sign in to add a comment