New issue
Advanced search Search tips

Issue 916359 link

Starred by 7 users

Issue metadata

Status: Fixed
Owner:
Closed: Dec 19
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

CQ triggers too many builds on active CLs

Project Member Reported by tandrii@chromium.org, Dec 19

Issue description

In other words, CQ appears to not recognize builds triggered before, and so keep triggering builds.
 
Labels: -Restrict-View-Google
Recent release of CQ was around 16:15, but it's not clear when it was actually deployed by puppet (will be investigated later), so for now - reverted.
Last buildbucket release was 17:15 PM, also undone.

CQ triggering became too high around buildbucket release time (see screenshot)
Screenshot from 2018-12-18 17-57-52.png
15.5 KB View Download
The above graph was with window size of 1h, with 2m window we can clearly see insane triggering is over (internal URL http://shortn/_LRmLaoWsp2)
Screenshot from 2018-12-18 17-59-31.png
78.8 KB View Download
From CQ log, it appears around 17:30 there most frequent message was like this:

2018-12-18 17:28:32.655 UTC-8
[pid:43064 tid:140478196455168 infra_internal.services.cq.buildbucket_util:515] Skipping bucket result 8926752504619894192 for issue 1379183 patchset 4: not required builder.

In another news, jbudorick@ has sent PSA.
purging all scheduled builds w/ tag user_agent:cq from luci.chromium.try via buildbucket's delete_many_builds
Issue 916358 has been merged into this issue.
i've checked more CQ logs:
1. last push was 2018-12-18 16:39:06.760 UTC-8
[pid:43064 tid:140482652575552 infra_internal.services.cq.cq:176] The Commit Queue is going to commit stuff.
2. the revert of that push didn't reach prod yet, so the cause is 100% buildbucket push.
purging all scheduled build from luci.chromium.try regardless of tags.
Labels: -Pri-0 Pri-1
Ran the same thing on a bunch of other buckets with big number of pending builds
(e.g., for v8 https://apis-explorer.appspot.com/apis-explorer/?base=https://cr-buildbucket.appspot.com/_ah/api#p/buildbucket/v1/buildbucket.delete_many_builds?bucket=luci.v8.try&status=SCHEDULED&_h=8 )

Downgrading to Pri1, but I'll continue monitoring reduction of actual pending build counts.
i think buildbucket backend is having hard time chewing through the backlog of things to delete:

Expected Future, received <class 'google.appengine.api.apiproxy_stub_map.UserRPC'>: <google.appengine.api.apiproxy_stub_map.UserRPC object at 0x2a77fd1ef350> (/base/alloc/tmpfs/dynamic_runtimes/python27g/d22767677e9aa897/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py:1552)
Traceback (most recent call last):
  File "third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
    rv = self.handle_exception(request, response, e)
  File "third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
    rv = self.router.dispatch(request, response)
  File "third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
    return route.handler_adapter(request, response)
  File "third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
    return handler.dispatch()
  File "third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
    return self.handle_exception(e, self.app.debug)
  File "third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
    return method(*args, **kwargs)
  File "appengine/ext/deferred/deferred.py", line 318, in post
    self.run_from_request()
  File "appengine/ext/deferred/deferred.py", line 313, in run_from_request
    run(self.request.body)
  File "appengine/ext/deferred/deferred.py", line 155, in run
    return func(*args, **kwds)
  File "service.py", line 684, in _task_delete_many_builds
    q.map(del_if_unchanged, keys_only=True)
  File "appengine/ext/ndb/utils.py", line 160, in positional_wrapper
    return wrapped(*args, **kwds)
  File "appengine/ext/ndb/query.py", line 1190, in map
    **q_options).get_result()
  File "appengine/ext/ndb/tasklets.py", line 383, in get_result
    self.check_success()
  File "appengine/ext/ndb/tasklets.py", line 624, in _finish
    result = [r.get_result() for r in self._results]
  File "appengine/ext/ndb/tasklets.py", line 383, in get_result
    self.check_success()
  File "appengine/ext/ndb/tasklets.py", line 427, in _help_tasklet_along
    value = gen.throw(exc.__class__, exc, tb)
  File "service.py", line 671, in del_if_unchanged
    if (yield txn(key)):  # pragma: no branch
  File "appengine/ext/ndb/tasklets.py", line 430, in _help_tasklet_along
    value = gen.send(val)
  File "appengine/ext/ndb/context.py", line 1029, in transaction
    result = yield result
  File "appengine/ext/ndb/tasklets.py", line 427, in _help_tasklet_along
    value = gen.throw(exc.__class__, exc, tb)
  File "service.py", line 666, in txn
    yield futs
  File "appengine/ext/ndb/tasklets.py", line 496, in _help_tasklet_along
    mfut.add_dependent(subfuture)
  File "appengine/ext/ndb/tasklets.py", line 648, in add_dependent
    raise TypeError('Expected Future, received %s: %r' % (type(fut), fut))
TypeError: Expected Future, received <class 'google.appengine.api.apiproxy_stub_map.UserRPC'>: <google.appengine.api.apiproxy_stub_map.UserRPC object at 0x2a77fd1ef350>
Project Member

Comment 11 by bugdroid1@chromium.org, Dec 19

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/01045b09928f9967a6e687ee6a128838a5211b6a

commit 01045b09928f9967a6e687ee6a128838a5211b6a
Author: Nodir Turakulov <nodir@google.com>
Date: Wed Dec 19 02:52:22 2018

[buildbucket] Make cancel_task_transactionally_async a tasklet

cancel_task_transactionally_async currently returns a UserRPC
but yield [multipleFutures] does not like that.
Make it return a future.

R=tandrii@chromium.org

Bug:  916359 
Change-Id: I37059bd5f089319173938d877459e570665f8cc9
Reviewed-on: https://chromium-review.googlesource.com/c/1383376
Commit-Queue: Nodir Turakulov <nodir@chromium.org>
Commit-Queue: Vadim Shtayura <vadimsh@chromium.org>
Auto-Submit: Nodir Turakulov <nodir@chromium.org>
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>
Cr-Commit-Position: refs/heads/master@{#19661}
[modify] https://crrev.com/01045b09928f9967a6e687ee6a128838a5211b6a/appengine/cr-buildbucket/swarming/swarming.py

Chromium appears to be almost back to normal, only limited backlog of longest-running builders, which should get cleared within 1-2 hours.
Status: Fixed (was: Assigned)
Sent updated PSA. Checked other projects, situation is similar.
Postmortem TBD
Issue 916336 has been merged into this issue.
Issue 916375 has been merged into this issue.
Issue 916378 has been merged into this issue.
Issue 916383 has been merged into this issue.
Issue 916384 has been merged into this issue.
Issue 916385 has been merged into this issue.
postmortem: go/chops-pm-110

Sign in to add a comment