Issue metadata
Sign in to add a comment
|
Massive amount of pending jobs |
||||||||||||||||||||||
Issue descriptionhttps://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng 314 pending jobs 226 of which are pdfium & catapult roller jobs https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng 147 102 https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng 390 300 https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng 314 isn't even loading
,
Oct 19 2016
https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng loaded now after a minute or two 426 203
,
Oct 19 2016
btw I couldn't find documentation on google or internally on how to stop autorollers?
,
Oct 19 2016
On https://autoroll.skia.org/ there's a stop button. Don't know if it stays stopped. I pressed it now to stop the skia auto-roller.
,
Oct 19 2016
Yeah, skia-deps-roller seems to have posted a lot of jobs.
,
Oct 19 2016
Also specific examples of duplicate roll CLs: https://codereview.chromium.org/2427333003/ https://codereview.chromium.org/2427343003/ https://codereview.chromium.org/2427353003/ All of them trying to "Roll src/third_party/skia/ e719577fe..3ac64b427 (9 commits)"
,
Oct 19 2016
The "outgoing reviews" list in https://codereview.chromium.org/user/skia-deps-roller is totally overloaded. Looks like there's a new roll every minute and they are all pending (it's always the same). Maybe we can just unclick the CQ manually everywhere and close those CLs? This will not remove the pending jobs though.
,
Oct 19 2016
or just cancel all related builbucket builds
,
Oct 19 2016
Also pdfium-deps-roller: https://codereview.chromium.org/2429293002/ https://codereview.chromium.org/2429303002/ https://codereview.chromium.org/2429313002/ https://codereview.chromium.org/2429323002/ https://codereview.chromium.org/2429333002/ https://codereview.chromium.org/2429333003/ https://codereview.chromium.org/2429343002/ "Roll src/third_party/pdfium/ 7c29e27da..09bad1cf2 (3 commits)"
,
Oct 19 2016
skia-deps-roller has 100 open CLs at this point, and pdfium-deps-roller has 73.
,
Oct 19 2016
Also affected, https://codereview.chromium.org/user/catapult-deps-roller 100 CLs at this moment.
,
Oct 19 2016
Do you know if catapult and pdfium have similar off switches?
,
Oct 19 2016
Stopped pdfium and catapult roller under: https://catapult-roll.skia.org https://pdfium-roll.skia.org
,
Oct 19 2016
,
Oct 19 2016
Seems like the catapult deps roller started creating several dozen CLs again while being in the Stopped state. We closed them all again.
,
Oct 19 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/188a1465dae6854e48940b3a912ac23a2177d788 commit 188a1465dae6854e48940b3a912ac23a2177d788 Author: Andrii Shyshkalov <tandrii@chromium.org> Date: Wed Oct 19 09:45:46 2016
,
Oct 19 2016
CC Annie for catapult auto-roller
,
Oct 19 2016
Eric (borenet@), can you add to cc whoever owns pdfium roller?
,
Oct 19 2016
ATTENTION ATTENTION: my CL [1] for CQ does right now is ignore *all* CLs created by *deps-roller@chromium.org. Therefore, this stays at P0 for now, until either: * each autoroller which had created too many CLs (skia, pdfium and catapult) is fixed s.t. they don't create CLs non-stop. * CQ implements throttling. [1] https://chrome-internal.googlesource.com/infra/infra_internal.git/+/188a1465dae6854e48940b3a912ac23a2177d788
,
Oct 19 2016
correction: CQ implements throttling && my CL above is reverted s.t. roller's CLs are again processed by CQ. Filed CQ throttling issue 657328
,
Oct 19 2016
CC dsinclair@ for pdfium roller.
,
Oct 19 2016
I'd also add something about documenting the rollers and how to stop them in an emergency. As pointed out by Andrii, docs might be stale or otherwise not scale, so I'm also for throttling in CQ.
,
Oct 19 2016
All of the *roll.skia.org rollers are owned by me. The roller parses the output of "git cl upload" to find the issue number of the CL it just uploaded. When the server URL changed in https://chromium.googlesource.com/chromium/tools/depot_tools/+/6ff1fc0e0163002596edbfbca2335325b043b823, the rollers stopped being able to parse the issue number from the output. I'm in the process of changing the rollers to instead use "git cl issue --json" which should be more robust, and I'll add throttling as well.
,
Oct 19 2016
The following revision refers to this bug: https://skia.googlesource.com/buildbot.git/+/ea1205a0c91f5e45eab9998d338bd8d67c18d3e2 commit ea1205a0c91f5e45eab9998d338bd8d67c18d3e2 Author: Eric Boren <borenet@google.com> Date: Wed Oct 19 11:53:24 2016 Fix autoroll issue number parsing Don't parse the issue number from "git cl" output. Instead, use the --json flag to "git cl issue" and parse the JSON file it produces. BUG= 657253 Change-Id: I6d090b36f7d695427d867421a3a13889eb559474 Reviewed-on: https://skia-review.googlesource.com/3626 Reviewed-by: Ravi Mistry <rmistry@google.com> Commit-Queue: Eric Boren <borenet@google.com> [modify] https://crrev.com/ea1205a0c91f5e45eab9998d338bd8d67c18d3e2/autoroll/go/repo_manager/repo_manager.go
,
Oct 19 2016
Some suggestions: 1) V8 auto-roller also queuries https://codereview.chromium.org/search?closed=3&owner=v8-autoroll%40chromium.org and does not upload a new CL if there's an open CL (the url is not really gerrit future-proof though). 2) Maybe reduce the roll frequency, e.g. from 1m to 5-10m? So that if something bad happens, it doesn't happen so often? 3) I wonder why the catapult roller produced CLs again even though the app's status was at "Stopped" at that point. Maybe check if "Stopped" really means that the roller won't do anything. 4) Maybe add a note to the roll commit description about how to stop the auto-roller. It seems it wasn't clear to many people.
,
Oct 19 2016
Re. #3, can you provide some more information? I'm scanning through the logs (here http://104.154.112.121:10115/file_server/autoroll.catapult-autoroll.default.log.INFO.20161012-122003.1375). The POST request to stop the roller finished at I1019 08:28:11.504339, after which point I just see the expected "Roller is stopped; not opening new rolls." at each cycle.
,
Oct 19 2016
Actually I think the CLs weren't new, just CQ started processing them at that time. E.g. there are many like this: https://codereview.chromium.org/2432313002/ So, this wasn't the auto-rollers problem.
,
Oct 19 2016
Could somebody also kill these jobs? https://build.chromium.org/p/tryserver.blink/builders/linux_precise_blink_rel They were overlooked today when deleting the buildbucket builds. V8 and skia both use this bot optionally and it's flooded.
,
Oct 19 2016
I just manually canceled everything on that page attributed to skia-deps-roller. Didn't there used to be a "cancel all pending" button?
,
Oct 19 2016
Thanks! Looks good now.
,
Oct 19 2016
The following revision refers to this bug: https://skia.googlesource.com/buildbot.git/+/f9e5543fd2298505c40cd647f5a8ca510dd12c3c commit f9e5543fd2298505c40cd647f5a8ca510dd12c3c Author: Eric Boren <borenet@google.com> Date: Wed Oct 19 12:44:44 2016 Add throttling to the AutoRoll bots No more than 3 rolls uploaded every 10 minutes BUG= 657253 Change-Id: Idd31b9e045412e9a79e39a237777f0a1afb55337 Reviewed-on: https://skia-review.googlesource.com/3627 Reviewed-by: Ravi Mistry <rmistry@google.com> Commit-Queue: Eric Boren <borenet@google.com> [modify] https://crrev.com/f9e5543fd2298505c40cd647f5a8ca510dd12c3c/autoroll/go/autoroller/roller.go [modify] https://crrev.com/f9e5543fd2298505c40cd647f5a8ca510dd12c3c/autoroll/go/autoroller/roller_test.go
,
Oct 19 2016
As for your other suggestions: 1. I guess I could add a separate goroutine which queries for CLs by the deps-roller and closes those it doesn't already know about. Would that help in a situation like this, where the CQ bit is already checked but we close the CL ~1 minute later? 2. It's nice to have the roller cycling frequently so that it can pick up changed statuses and re-roll quickly. I think the throttling I just added should help though. 4. Yes, I'll add that.
,
Oct 19 2016
Re 1) Maybe CQ experts know more, but if the CQ already picked up a CL and started scheduling builds, it won't cancel those if the CL is closed or CQ unchecked. Also on swarming there's no cancel.
,
Oct 19 2016
Re #c32: 1. No, it wouldn't help, because CQ doesn't cancel tryjobs any more. It proved more trouble than it was worth, particularly wrt to experimental tryjobs. With unreliable Rietveld results (see http://crbug.com/656756), it'd have backfired badly. In fact, if you roller closes prior CL, then my CQ auto-throttling proposal (issue http://crbug.com/657328 ) won't work either, as machebach@ correctly noticed. That said, it's an issue with my proposal, not yours :) However, if you not just close CL, but also cancel all tryjobs associated with it, then would help. I'm open to new "git cl try-cancel -b <buildbucket_id1> -b <buildbucket_id2>" for this.
,
Oct 19 2016
Okay, given the above fixes, it sounds like it may not be worth building a find-rogue-CLs-and-cancel-trybots scheme. We can revisit that if we continue to run into problems. I've landed fixes and pushed updates to all of the autoroll servers. I'd like to re-enable them one at a time when you guys are ready.
,
Oct 19 2016
You can enable them immediately, but CQ will still ignore them for now. Then we can wait for ~30 minutes and see that rollers don't create too many CLs. Then, I'll revert my CQ patch that made it ignore rollers. SGTM?
,
Oct 19 2016
FTR, so far CQ noticed and ignored only 5 CLs: [W2016-10-19T04:58:08.946217-07:00 28803 140071638579008 pending_manager.rietveld:646] CQ dislikes autorollers creating too many CLs: 2432413002 [W2016-10-19T04:58:33.476104-07:00 28803 140071638579008 pending_manager.rietveld:646] CQ dislikes autorollers creating too many CLs: 2432413002 [W2016-10-19T04:59:06.081503-07:00 28803 140071638579008 pending_manager.rietveld:646] CQ dislikes autorollers creating too many CLs: 2432413002 [W2016-10-19T04:59:34.453334-07:00 28803 140071638579008 pending_manager.rietveld:646] CQ dislikes autorollers creating too many CLs: 2432413002 [W2016-10-19T06:45:39.890051-07:00 28803 140071638579008 pending_manager.rietveld:646] CQ dislikes autorollers creating too many CLs: 2433073004
,
Oct 19 2016
Actually, 4 first lines point to the same CL.
,
Oct 19 2016
Okay, re-enabled the rollers. So far no errors in the logs, and two have uploaded (single, non-duplicate) CLs.
,
Oct 19 2016
Landing revert, to be deployed in ~15 minutes: https://chromereviews.googleplex.com/526267013/
,
Oct 19 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/6d9753ceeed17c2de6ed33b97e69a96cb85ebca0 commit 6d9753ceeed17c2de6ed33b97e69a96cb85ebca0 Author: tandrii <tandrii@google.com> Date: Wed Oct 19 14:15:02 2016
,
Oct 19 2016
Turns out I forgot to remove autodeploy.pid, so CQ wasn't being auto-deployed. Fixed that just in time for autodeploy to kick in in 4 minutes. Also, lowering priority to P1, as there is no longer an outage.
,
Oct 19 2016
Things appear to be working smoothly now. I think that had to be done on this bug has been finished. CQ being smarter is handled in issue 657328
,
Oct 19 2016
Issue 657303 has been merged into this issue.
,
Oct 19 2016
Issue 657306 has been merged into this issue.
,
Oct 19 2016
I'd like to revisit the root cause of this. What would happen if a simiar problem appeared with the auto-roller's logic consuming `git cl issue --json` now. E.g. somebody changing the json's content significantly. Wouldn't we end up with the same outage again unless we add other hurdles?
,
Oct 19 2016
That's what I meant by CQ being smarter - adding smartness in CQ s.t. it gets doesn't get overloaded. See issue 657328 , in which I already send a design a doc.
,
Oct 20 2016
So the root cause is still a potential issue, but I think it's mitigated by the fact that the roller no longer looks for the codereview server in addition to the issue number, and by the fact that it's using the JSON output rather than stdout. I'm hoping that --json is fairly stable since it is presumably designed to be consumed by machines. If it did happen again, the rollers' throttling should reduce the load to 3 CLs every 10 minutes instead of 10 CLs every 10 minutes. I'm looking into why Skia infra's alerts weren't firing due to the apparent failure to upload CLs, because they should have.
,
Oct 20 2016
,
Oct 20 2016
,
Oct 24 2016
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by jam@chromium.org
, Oct 19 2016