New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 657328 link

Starred by 3 users

Issue metadata

Status: WontFix
Owner:
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Feature



Sign in to add a comment

CQ should implement throttling s.t. one user doesn't overload CQ

Project Member Reported by tandrii@chromium.org, Oct 19 2016

Issue description

automatic tooling, like autorollers in  issue 657253 , can overload CQ by creating the too many CLs and submitting them to CQ. While theoretically, those tools should be fixed, it's much more scalable to play defensively in CQ.


I propose at most 5 CLs at any given time allocated to the (user, project).  We can just keep that in CQ RAM only, which will be lost every CQ restart, but so long as CQ restarts don't happen too often, this will sufficient to mitigate the outage.
 
It might miss the following scenario: An auto-roller that keeps adding a new CL, hits CQ (so that CQ schedules jobs) then closes CL and creates the next CL. The scheduled jobs will remain.
Yes :( That means we need both limit on too many concurrent CLs as well as limit on number of tryjobs being scheduled per unique owner. This way, roller's load would be severely limited.

Comment 3 by stip@google.com, Oct 19 2016

Buildbucket can expire jobs. Right now that's at a 24hr timeout (from what I understand). If we make it 30 minutes, then excess scheduled jobs will just expire instead of causing a massive bubble we have to drain.

Comment 4 by stip@google.com, Oct 19 2016

We had talked about this internally in July, with a buildbucket endpoint being created for CQ to sense this.

From my email:

"Aditionally, 626042 may have been at fault in that the failures would have exacerbated capacity issues from CQ retries. I would be interested in an analysis of CQ retries on overall capacity, as this is not the first time we have seen this. I know there is a global retry token-bucket algorithm used for rate limiting, but I wonder if we can be smarter here. Inspired by realtime throughput estimation algorithms such as TCP Vegas, I'd like to propose the following CQ retry model:
CQ globally tracks the time between build request and build start ('build pending time') for each builder.
If pending build times go over a human-meaningful number (30 minutes) automatic retries stop and only first-run patches go through.
If pending build times go over a human-meaningful number (60 minutes), all tryjob triggers stop except for a small trickle (one every 10 minutes).
The idea is that CQ is constantly (and crudely) estimating the capacity of the system by reading pending time of the builds it is sending. As it hits capacity it performs crude QoS by only sending first-run patches (which haven't failed yet, and are likely to be better than someone's bad patch repeatedly failing). Finally, if there is an emergency outage, CQ will keep sending a few jobs to probe for when capacity arises.

The last part is critical, and a lesson the CQ team has learned before by sending trial jobs during slow periods to keep monitoring systems alive. To use Vegas's wording: "[I]f a connection is sending too much extra data, it will cause congestion. Less obviously, if a connection is sending too little extra data, it cannot respond rapidly enough to transient increases in the available network bandwidth.""

Note that I think we can be far simpler here: if a pending queue is more than 30 minutes, just don't send any jobs to it. The 'slow trickle' QoS idea is nice, but it requires a lot more thought. Just to get *something* working, let's do "if >30, shut off."

Comment 5 by stip@google.com, Oct 19 2016

Fyi that nodir@ implemented sensing bucket pending times in https://codereview.chromium.org/2170673002
Owner: tandrii@chromium.org
IIUC, Mike's solution will penalize all users equally by only landing first-run patches (I assume this means no retries) or will prevent everyone from landing anything if the system is clogged with attempts. IMHO, this is a great idea on its own, but it essentially still allows one user to make everybody else's experience worse, which is what I believe this bug is about.

Regarding the endpoint, it would be great it we could also query builds by the user who scheduled it. Not sure if CQ sends this info along, but such endpoint would be able to implement Mike's suggestion while also providing a solution for the issue described in this bug.
Cc: tandrii@chromium.org
Owner: ----
Labels: cit-cq
Ben, why did you add this new label? Shouldn't we use Infra>CQ component, which is already set on this issue?
Cc: benhenry@chromium.org
+Ben for real now.
Labels: -cit-cq cit-pm
wrong label.
Cc: dpranke@chromium.org
Components: Infra>Platform>Buildbucket
Can we do this throttling in buildbucket instead, e.g. buildbucket doesn't accept a flood of items into buckets? It might be easier to implement in buildbucket than CQ.
Cc: d...@chromium.org no...@chromium.org
Cc: vadimsh@chromium.org estaab@chromium.org
Good question, Erik. There are both advantages and disadvantages to doing it in either system.

CQ:
 + actually knows who *triggered* the build, and can throttle per-user
 - but, with strong plans to make CQ processing each CL separately (and not all in one process) negates above, because each per-CL worker will be unaware of what and how many tryjobs are triggered by other CL-workers.

Buildbucket:
 - can (easily) throttle only CQ account, because it isn't aware of who triggered CQ, effectively penalizing all users of a project when one user is abusing CQ
 + perhaps we can let CQ somehow communicate who really is to "blame".
   adding Vadim as this is very similar to delegation in auth that he has thought a lot about.
> perhaps we can let CQ somehow communicate who really is to "blame".

This is relatively easy to do technically (just one additional RPC to get a delegation token and then send it in a header when calling buildbucket). 

I'm worried though about various code checking "requestor == commit-bot@chromium.org" directly to detect CQ, e.g. https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/chromium_swarming/api.py?q=commit-bot@chromium.org&sq=package:chromium&l=56&dr=C

It has to be changed.
not all users triggering CQ have rights to trigger tryjobs on their own. I'd rather keep "requstor" as CQ, but accounting in buildbucket internall potentially by something else.
> not all users triggering CQ have rights to trigger tryjobs on their own.

You mean CQ does additional authorization checks (like checking for an LGTM)? I remember thinking about this problem... I believe for cases when commit author has no permission to run tryjobs, we should use the identity of whoever LGTM-approved it. E.g. if A authors a CL, and B lgtm-s, the buildbucket job will be "Belongs to B, on behalf of A" (and B will be used for authorizations and quotas and blame if tryjob hacked bots :)).

But I'm not sure it's the best way. Using CQs own account for authorization is definitely simpler, but now exploited (or just tricked) CQ can run arbitrary builds.
I've never seen much evidence that we have people abusing the system. I have seen some cases of people *intentionally* and for good reasons chewing up a lot of resources. I've also seen a few cases where people have done it accidentally (they didn't realize resources were so limited), or where there were bugs. But those latter three cases are relatively rare.

So, IMO we don't really need to rate-limit individual users, and doing so fairly and correctly is actually a hard problem.

However, we absolutely need to be able to rate-limit try jobs and the CQ as a whole, because it's really easy to overload and kill buildbot masters. And in this situation we really need to rate-limit *every* user, not just one big one. So I at least really think buildbucket is probably the right place and doing something simple is probably sufficient.
I now realize that this bug was originally filed specifically because we had accidentally abusive users, so sorry for the derailing :). That said, I think the right thing to do in that case is to fix the auto-rollers instead.
Labels: Infra-Failures
Owner: dpranke@chromium.org
Status: Assigned (was: Available)
I think we should assign this bug to somebody. It's P1 and a post-mortem action item. dpranke, it looks like you might be most recently up-to-speed on this issue. What needs to be done and who should do it? 

Cc: katthomas@chromium.org
Status: WontFix (was: Assigned)
I've filed  bug 682541  to request a way that we manually be able to pause buildbucket, as discussed in comments #15 - 18. 

As I said in comment #21, I don't think we need to (or should) rate-limit individual users (even the CQ) at this time, so I'm going to WontFix this bug. There is a separate bug --  bug 656862  -- for the generic concept of adding fine-grained quotas to either or both of buildbucket or swarming, but I think that's a separate items that needs to be prioritized and scheduled differently.
Labels: Hotlist-Infra-Failures

Sign in to add a comment