New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 657253 link

Starred by 5 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression



Sign in to add a comment

Massive amount of pending jobs

Project Member Reported by jam@chromium.org, Oct 19 2016

Issue description

Comment 1 by jam@chromium.org, Oct 19 2016

Labels: -Restrict-View-Google

Comment 3 by jam@chromium.org, Oct 19 2016

btw I couldn't find documentation on google or internally on how to stop autorollers?
Cc: borenet@chromium.org
On https://autoroll.skia.org/ there's a stop button. Don't know if it stays stopped. I pressed it now to stop the skia auto-roller.
Yeah, skia-deps-roller seems to have posted a lot of jobs.
e1ZsT5mQony.png
210 KB View Download
Also specific examples of duplicate roll CLs:

https://codereview.chromium.org/2427333003/
https://codereview.chromium.org/2427343003/
https://codereview.chromium.org/2427353003/

All of them trying to "Roll src/third_party/skia/ e719577fe..3ac64b427 (9 commits)"
Cc: rmis...@chromium.org
The "outgoing reviews" list in https://codereview.chromium.org/user/skia-deps-roller is totally overloaded. Looks like there's a new roll every minute and they are all pending (it's always the same).

Maybe we can just unclick the CQ manually everywhere and close those CLs? This will not remove the pending jobs though.
or just cancel all related builbucket builds
skia-deps-roller has 100 open CLs at this point, and pdfium-deps-roller has 73.
Also affected, https://codereview.chromium.org/user/catapult-deps-roller 100 CLs at this moment.
Do you know if catapult and pdfium have similar off switches?
Stopped pdfium and catapult roller under:
https://catapult-roll.skia.org
https://pdfium-roll.skia.org
Owner: tandrii@chromium.org
Status: Started (was: Untriaged)
Seems like the catapult deps roller started creating several dozen CLs again while being in the Stopped state. We closed them all again.
Project Member

Comment 16 by bugdroid1@chromium.org, Oct 19 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/188a1465dae6854e48940b3a912ac23a2177d788

commit 188a1465dae6854e48940b3a912ac23a2177d788
Author: Andrii Shyshkalov <tandrii@chromium.org>
Date: Wed Oct 19 09:45:46 2016

Cc: sullivan@chromium.org
CC Annie for catapult auto-roller
Eric (borenet@), can you add to cc whoever owns pdfium roller?
ATTENTION ATTENTION: my CL [1] for CQ does right now is ignore *all* CLs created by *deps-roller@chromium.org.

Therefore, this stays at P0 for now, until either:
* each autoroller which had created too many CLs (skia, pdfium and catapult) is fixed s.t. they don't create CLs non-stop.
* CQ implements throttling.

[1] https://chrome-internal.googlesource.com/infra/infra_internal.git/+/188a1465dae6854e48940b3a912ac23a2177d788
correction: CQ implements throttling && my CL above is reverted s.t. roller's CLs are again processed by CQ.

Filed CQ throttling  issue 657328 

Comment 21 by rmis...@google.com, Oct 19 2016

Cc: dsinclair@chromium.org
CC dsinclair@ for pdfium roller.
I'd also add something about documenting the rollers and how to stop them in an emergency. As pointed out by Andrii, docs might be stale or otherwise not scale, so I'm also for throttling in CQ.

Comment 23 by bore...@google.com, Oct 19 2016

All of the *roll.skia.org rollers are owned by me. The roller parses the output of "git cl upload" to find the issue number of the CL it just uploaded. When the server URL changed in https://chromium.googlesource.com/chromium/tools/depot_tools/+/6ff1fc0e0163002596edbfbca2335325b043b823, the rollers stopped being able to parse the issue number from the output. I'm in the process of changing the rollers to instead use "git cl issue --json" which should be more robust, and I'll add throttling as well.
Project Member

Comment 24 by bugdroid1@chromium.org, Oct 19 2016

The following revision refers to this bug:
  https://skia.googlesource.com/buildbot.git/+/ea1205a0c91f5e45eab9998d338bd8d67c18d3e2

commit ea1205a0c91f5e45eab9998d338bd8d67c18d3e2
Author: Eric Boren <borenet@google.com>
Date: Wed Oct 19 11:53:24 2016

Fix autoroll issue number parsing

Don't parse the issue number from "git cl" output. Instead, use the
--json flag to "git cl issue" and parse the JSON file it produces.

BUG= 657253 

Change-Id: I6d090b36f7d695427d867421a3a13889eb559474
Reviewed-on: https://skia-review.googlesource.com/3626
Reviewed-by: Ravi Mistry <rmistry@google.com>
Commit-Queue: Eric Boren <borenet@google.com>

[modify] https://crrev.com/ea1205a0c91f5e45eab9998d338bd8d67c18d3e2/autoroll/go/repo_manager/repo_manager.go

Some suggestions:
1) V8 auto-roller also queuries https://codereview.chromium.org/search?closed=3&owner=v8-autoroll%40chromium.org and does not upload a new CL if there's an open CL (the url is not really gerrit future-proof though).
2) Maybe reduce the roll frequency, e.g. from 1m to 5-10m? So that if something bad happens, it doesn't happen so often?
3) I wonder why the catapult roller produced CLs again even though the app's status was at "Stopped" at that point. Maybe check if "Stopped" really means that the roller won't do anything.
4) Maybe add a note to the roll commit description about how to stop the auto-roller. It seems it wasn't clear to many people.

Comment 26 by bore...@google.com, Oct 19 2016

Re. #3, can you provide some more information? I'm scanning through the logs (here http://104.154.112.121:10115/file_server/autoroll.catapult-autoroll.default.log.INFO.20161012-122003.1375). The POST request to stop the roller finished at I1019 08:28:11.504339, after which point I just see the expected "Roller is stopped; not opening new rolls." at each cycle.
Actually I think the CLs weren't new, just CQ started processing them at that time. E.g. there are many like this: https://codereview.chromium.org/2432313002/

So, this wasn't the auto-rollers problem.
Could somebody also kill these jobs? https://build.chromium.org/p/tryserver.blink/builders/linux_precise_blink_rel

They were overlooked today when deleting the buildbucket builds. V8 and skia both use this bot optionally and it's flooded.

Comment 29 by bore...@google.com, Oct 19 2016

I just manually canceled everything on that page attributed to skia-deps-roller. Didn't there used to be a "cancel all pending" button?
Thanks! Looks good now.
Project Member

Comment 31 by bugdroid1@chromium.org, Oct 19 2016

The following revision refers to this bug:
  https://skia.googlesource.com/buildbot.git/+/f9e5543fd2298505c40cd647f5a8ca510dd12c3c

commit f9e5543fd2298505c40cd647f5a8ca510dd12c3c
Author: Eric Boren <borenet@google.com>
Date: Wed Oct 19 12:44:44 2016

Add throttling to the AutoRoll bots

No more than 3 rolls uploaded every 10 minutes

BUG= 657253 

Change-Id: Idd31b9e045412e9a79e39a237777f0a1afb55337
Reviewed-on: https://skia-review.googlesource.com/3627
Reviewed-by: Ravi Mistry <rmistry@google.com>
Commit-Queue: Eric Boren <borenet@google.com>

[modify] https://crrev.com/f9e5543fd2298505c40cd647f5a8ca510dd12c3c/autoroll/go/autoroller/roller.go
[modify] https://crrev.com/f9e5543fd2298505c40cd647f5a8ca510dd12c3c/autoroll/go/autoroller/roller_test.go

Comment 32 by bore...@google.com, Oct 19 2016

As for your other suggestions:

1. I guess I could add a separate goroutine which queries for CLs by the deps-roller and closes those it doesn't already know about. Would that help in a situation like this, where the CQ bit is already checked but we close the CL ~1 minute later?

2. It's nice to have the roller cycling frequently so that it can pick up changed statuses and re-roll quickly. I think the throttling I just added should help though.

4. Yes, I'll add that.
Re 1) Maybe CQ experts know more, but if the CQ already picked up a CL and started scheduling builds, it won't cancel those if the CL is closed or CQ unchecked. Also on swarming there's no cancel.
Re #c32:

1. No, it wouldn't help, because CQ doesn't cancel tryjobs any more.

It proved more trouble than it was worth, particularly wrt to experimental tryjobs. With unreliable Rietveld results (see http://crbug.com/656756), it'd have backfired badly.

In fact, if you roller closes prior CL, then my CQ auto-throttling proposal (issue  http://crbug.com/657328 ) won't work either, as machebach@ correctly noticed. That said, it's an issue with my proposal, not yours :)

However, if you not just close CL, but also cancel all tryjobs associated with it, then would help. I'm open to new "git cl try-cancel -b <buildbucket_id1> -b <buildbucket_id2>" for this.

Comment 35 by bore...@google.com, Oct 19 2016

Okay, given the above fixes, it sounds like it may not be worth building a find-rogue-CLs-and-cancel-trybots scheme.  We can revisit that if we continue to run into problems.

I've landed fixes and pushed updates to all of the autoroll servers. I'd like to re-enable them one at a time when you guys are ready.
You can enable them immediately, but CQ will still ignore them for now. Then we can wait for ~30 minutes and see that rollers don't create too many CLs. Then, I'll revert my CQ patch that made it ignore rollers. SGTM?
FTR, so far CQ noticed and ignored only 5 CLs:

[W2016-10-19T04:58:08.946217-07:00 28803 140071638579008 pending_manager.rietveld:646] CQ dislikes autorollers creating too many CLs: 2432413002
[W2016-10-19T04:58:33.476104-07:00 28803 140071638579008 pending_manager.rietveld:646] CQ dislikes autorollers creating too many CLs: 2432413002
[W2016-10-19T04:59:06.081503-07:00 28803 140071638579008 pending_manager.rietveld:646] CQ dislikes autorollers creating too many CLs: 2432413002
[W2016-10-19T04:59:34.453334-07:00 28803 140071638579008 pending_manager.rietveld:646] CQ dislikes autorollers creating too many CLs: 2432413002
[W2016-10-19T06:45:39.890051-07:00 28803 140071638579008 pending_manager.rietveld:646] CQ dislikes autorollers creating too many CLs: 2433073004

Actually, 4 first lines point to the same CL.

Comment 39 by bore...@google.com, Oct 19 2016

Okay, re-enabled the rollers. So far no errors in the logs, and two have uploaded (single, non-duplicate) CLs.
Landing revert, to be deployed in ~15 minutes: https://chromereviews.googleplex.com/526267013/
Project Member

Comment 41 by bugdroid1@chromium.org, Oct 19 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/6d9753ceeed17c2de6ed33b97e69a96cb85ebca0

commit 6d9753ceeed17c2de6ed33b97e69a96cb85ebca0
Author: tandrii <tandrii@google.com>
Date: Wed Oct 19 14:15:02 2016

Labels: -Pri-0 Pri-1
Turns out I forgot to remove autodeploy.pid, so CQ wasn't being auto-deployed. Fixed that just in time for autodeploy to kick in in 4 minutes.

Also, lowering priority to P1, as there is no longer an outage.
Labels: -Pri-1 Pri-2
Things appear to be working smoothly now. I think that had to be done on this bug has been finished. CQ being smarter is handled in  issue 657328 

Comment 44 by d...@chromium.org, Oct 19 2016

Issue 657303 has been merged into this issue.
Issue 657306 has been merged into this issue.
I'd like to revisit the root cause of this. What would happen if a simiar problem appeared with the auto-roller's logic consuming `git cl issue --json` now. E.g. somebody changing the json's content significantly. Wouldn't we end up with the same outage again unless we add other hurdles?
That's what I meant by CQ being smarter - adding smartness in CQ s.t. it gets doesn't get overloaded. See  issue 657328 , in which I already send a design a doc.

Comment 48 by bore...@google.com, Oct 20 2016

So the root cause is still a potential issue, but I think it's mitigated by the fact that the roller no longer looks for the codereview server in addition to the issue number, and by the fact that it's using the JSON output rather than stdout. I'm hoping that --json is fairly stable since it is presumably designed to be consumed by machines.

If it did happen again, the rollers' throttling should reduce the load to 3 CLs every 10 minutes instead of 10 CLs every 10 minutes. I'm looking into why Skia infra's alerts weren't firing due to the apparent failure to upload CLs, because they should have.
Status: Fixed (was: Started)
Labels: cit-cq

Sign in to add a comment