Long delay before builds are leased from buildbucket after master restart |
|||
Issue description(outalation that inspired this: http://o/365520) It can take a very long time to lease builds after a master gets restarted, as can be seen in this Buildbot Overview dashboard ... http://shortn/_RgTrw4488G Note also these master.twistd logs for a specific build id - https://pantheon.corp.google.com/logs/viewer?project=chrome-infra-logs&minLogLevel=0&expandAll=false&resource=global&advancedFilter=resource.type%3D%22global%22%0AlogName%3D%22projects%2Fchrome-infra-logs%2Flogs%2Fmaster_twistd.log%22%0Alabels.%22cloudtail%2Fresource_id%22%3D%22master.tryserver.chromium.linux%22%0A%22%5Bbuildbucket%5D%22%0A%228986188187968384384%22×tamp=2017-03-03T01:35:38.000000000Z In the outalation, the master restarted at around 16:15, and bots were attaching again by 16:21. But there appears to be a massive lag in the time it takes to start leasing builds from buildbucket. Suspected culprit is race condition at https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/master/buildbucket/integration.py#164 It's possible for several separate execution flows to all end up calling _ensure_leases_loaded and getting past the initial if condition, because self._leases isn't set to something until the first flow has returned. This is borne out by the following logs - https://pantheon.corp.google.com/logs/viewer?project=chrome-infra-logs&minLogLevel=0&expandAll=false&resource=global&advancedFilter=resource.type%3D%22global%22%0AlogName%3D%22projects%2Fchrome-infra-logs%2Flogs%2Fmaster_twistd.log%22%0Alabels.%22cloudtail%2Fresource_id%22%3D%22master.tryserver.chromium.linux%22%0A%22%5Bbuildbucket%5D%22%0A%22Loaded%20current%20leases%22×tamp=2017-03-03T00:40:10.000000000Z The theory we came up with is that each execution flow in _ensure_leases_loaded then proceeds to execute self._send_heartbeats(), which generates a massive backlog of requests in the buildbucket client, which has a limit of 10 simultaneous requests, so it takes a long time to flush through all the requests. This in turn prevents any other API requests to occur - ie requests to find any new builds. --------------------- Whilst debugging this issue, we identified a few misc issues and areas where we think logging and monitoring could be improved ... 1) Put master name in the user-agent string for buildbucket client, to make searching through buildbucket's logs easier. 2) There seems to be a huge number of api requests coming in for buildbucket search. (like, ~100 qps). Is this expected? See https://pantheon.corp.google.com/logs/viewer?project=cr-buildbucket&minLogLevel=0&expandAll=false&resource=gae_app&logName=projects%2Fcr-buildbucket%2Flogs%2Fappengine.googleapis.com%252Frequest_log×tamp=2017-03-07T05:56:32.802036000Z 3) Make buildbucket include the build ids alongside the heartbeat APIs log mesages (currently POST requests, so build id isn't recorded in the log message - see https://pantheon.corp.google.com/logs/viewer?project=cr-buildbucket&minLogLevel=0&expandAll=false&resource=gae_app&logName=projects%2Fcr-buildbucket%2Flogs%2Fappengine.googleapis.com%252Frequest_log&advancedFilter=resource.type%3D%22gae_app%22%0AlogName%3D%22projects%2Fcr-buildbucket%2Flogs%2Fappengine.googleapis.com%252Frequest_log%22%0AprotoPayload.method!%3D%22GET%22%0A%22heartbeat%22×tamp=2017-03-07T05:56:18.849578000Z)
,
Mar 8 2017
thank you for the productive investigation
,
Mar 8 2017
(2) is expected - Rietveld and Gerrit fetch builds to display them alongside a CL all the time
,
Mar 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/164a47e0aa7cb406d19b77fefdb6808c41d942ad commit 164a47e0aa7cb406d19b77fefdb6808c41d942ad Author: Nodir Turakulov <nodir@chromium.org> Date: Wed Mar 08 00:59:02 2017 buildbucket: fix race in _ensure_leases_loaded R=philwright@chromium.org, dsansome@chromium.org BUG= 698996 Change-Id: Idb7a11465e591418e85501bed809cbbaaa2a62c2 Reviewed-on: https://chromium-review.googlesource.com/451547 Reviewed-by: Dave Sansome <dsansome@chromium.org> Commit-Queue: Nodir Turakulov <nodir@chromium.org> [modify] https://crrev.com/164a47e0aa7cb406d19b77fefdb6808c41d942ad/scripts/master/buildbucket/integration.py
,
Mar 8 2017
> (3) That would cause a lot of logs, in particular N log lines every 5 min, where N is the number of pending and running builds on the master. Is this OK?
,
Mar 8 2017
> (1) There is http_client_name used in InstrumentedHttp https://cs.chromium.org/chromium/build/third_party/infra_libs/infra_libs/httplib2_utils.py?type=cs&q=InstrumentedHttp&l=301 The name "http_client_name" is rather generic doesn't necessarily imply that it has to be about monitoring. Same applies InstrumentedHttp. Can we derive user agent from InstrumentedHttp's name and set the user agent in InstrumentedHttp?
,
Mar 8 2017
(3) Ah I see - each of these heartbeat calls actually contains multiple Build IDs, right? Would it be possible to add a single log entry that lists all the builds (and the bucket/master) in each heartbeat request? I'm not super familiar with appengine, but it would be great if the log showed up as a "sub-log" for the same api request id, similar to how "Heartbeat for build" log message shows up https://chromium.googlesource.com/infra/infra/+/master/appengine/cr-buildbucket/service.py#791 (1) I'll check with Dave about this ...
,
Mar 8 2017
(3) Note ... I'm guessing it's not actually trivial to find out the bucket/master for a particular build id, right? So logging the bucket/master may not be feasible here?
,
Mar 8 2017
(3) yes, heartbeat request contains multiple build ids https://chromium.googlesource.com/infra/infra/+/master/appengine/cr-buildbucket/service.py#789 loads the build, so I can load the bucket from there. But that would result in a log line (under request log) for each build. I don't know, seems like a lot. The reason I didn't need it is if build does not expire [1], then I know that heartbeats were received [1]: https://pantheon.corp.google.com/logs/viewer?project=cr-buildbucket&minLogLevel=200&expandAll=false&resource=gae_app%2Fmodule_id%2Fbackend&versionId=3300-a706456&logName=projects%2Fcr-buildbucket%2Flogs%2Fappengine.googleapis.com%252Frequest_log&filters=path:reset_expired_builds
,
Mar 8 2017
(3) I've created https://chromium-review.googlesource.com/c/451025/ to add logging per batch heartbeat. Thoughts? (1) I'm not sure what I'm meant to be looking at ...
,
Mar 9 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/0c2d37779791c4af88d1fe63dee953471621552f commit 0c2d37779791c4af88d1fe63dee953471621552f Author: philwright <philwright@google.com> Date: Thu Mar 09 00:00:26 2017 adding logging to batch heartbeat BUG= 698996 Change-Id: I5479200a9565d54764a706784b423cbb25b209ed Reviewed-on: https://chromium-review.googlesource.com/451025 Reviewed-by: Nodir Turakulov <nodir@chromium.org> Commit-Queue: Phil Wright <philwright@chromium.org> [modify] https://crrev.com/0c2d37779791c4af88d1fe63dee953471621552f/appengine/cr-buildbucket/service.py
,
Mar 16 2017
what's the status of this bug?
,
Mar 30 2017
There's still the suggestion to (1) Put master name in the user-agent string. @nodir: You pasted a link apparently in reference to this, but I wasn't sure what I was supposed to look at on the resulting page. In any case, adding the master name to the user-agent string is more a nice-to-have feature request, rather than an actual issue. I'm not too fussed if it's not worth following up. Cheers, Phil
,
Mar 31 2017
in #5? I am suggestion to set user-agent in InstrumenedHttp
,
Apr 12 2017
actually it was in #9 - I misinterpreted the footnote [1] with the link - didn't realise it was in reference to the text above. All good. So anyway, as I said, the "master name" in user agent string is more of a nice to have. If it's still needed by anyone, they should open a new bug for it. I'll close this bug now. Cheers, Phil |
|||
►
Sign in to add a comment |
|||
Comment 1 by dsansome@chromium.org
, Mar 7 2017