Issue metadata
Sign in to add a comment
|
Reduce performance impact of the pubsub publisher in buildbot |
||||||||||||||||||||||
Issue descriptionIn the discussion of what to do about bug 669297 on Friday, people suggested that there might be an issue with the replication of data from buildbot to Milo that might be impacting the buildbot master. I don't know the details about this, but supposedly hinoka@ did, so I'm filing this so that he can treat this as high-priority on Monday, since we don't yet know what's going on with the master and we think things are still broken. Please update the bug with more information about what you think this might be, and what work might be needed to fix it. If you think this is relatively likely to be the a real culprit, also please continue working on fixing it. Please treat this as your highest priority task; if you think you have something else more important that you should be working on, let me know and we (you/me/estaab) can prioritize things or move them around.
,
Dec 13 2016
I think the "new Milo scheduler change" that got picked up in https://bugs.chromium.org/p/chromium/issues/detail?id=669297#c66 and https://bugs.chromium.org/p/chromium/issues/detail?id=669297#c67 was the fix for this bug. If I'm right, can you link to it from here? Or is there more work to be done?
,
Dec 14 2016
So at one point, the buildbot pubsub pusher implemented support for pushing pending builds. This was left out in the initial version because the query isn't in the json by default probably because it required a postgres call. This was added by having each pubsub push also query for all of the pending builds, which buildbot has to make a postgres query for. There was suspicion that this was causing a lot of slowdown, so a patch was put in to add in a 200-item (master-wide) LRU cache in front of the postgres query.
,
Dec 14 2016
Ryan, WDYT about making it so that every Pub/Sub operation times itself and logs that time on completion? Then we can very easily notice when it takes too long, which was the clue that Ben used to identify this problem.
,
Dec 14 2016
I think that's a good idea, but I don't know how much that'll help since it might be dependent on reactor load (or maybe it's useful because it's a second derivitive of reactor load) In addition to that, I also want to log: * # of items in reactor that are ready to be consumed * Oldest item in reactor queue that is ready to be consumed But these two might be harder to dig up.
,
Dec 14 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/a8ad337f9e050566e7352a57a9ed1e897ff2d2e8 commit a8ad337f9e050566e7352a57a9ed1e897ff2d2e8 Author: Ryan Tseng <hinoka@google.com> Date: Wed Dec 14 01:09:15 2016 Buildbot: Add more logging to pubsub pusher Adds wall clock timing for each step of status push, along with reactor queue length. BUG= 673202 Change-Id: Ibe28abf025351461b9f634d603d425cb14a25634 Reviewed-on: https://chromium-review.googlesource.com/419687 Reviewed-by: Daniel Jacques <dnj@chromium.org> Commit-Queue: Ryan Tseng <hinoka@chromium.org> [modify] https://crrev.com/a8ad337f9e050566e7352a57a9ed1e897ff2d2e8/scripts/master/pubsub_json_status_push.py
,
Dec 14 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/95396906b3d575e93a6a9755dfb62b5508e6aa2f commit 95396906b3d575e93a6a9755dfb62b5508e6aa2f Author: Ryan Tseng <hinoka@google.com> Date: Wed Dec 14 21:39:24 2016
,
Dec 14 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/95396906b3d575e93a6a9755dfb62b5508e6aa2f commit 95396906b3d575e93a6a9755dfb62b5508e6aa2f Author: Ryan Tseng <hinoka@google.com> Date: Wed Dec 14 21:39:24 2016
,
Dec 14 2016
This was a much worse problem than I thought initially, after restarting FYI to pick up the above change, this appears in the logs: 2016-12-14 14:30:44-0800 [-] PubSub: Last send session took total 372.3s, 0.0 load build, 371.6 master, 0.7 send. len_tcq 1 it took over 5 minutes to compute the master data!!!
,
Dec 14 2016
Yikes. What's the next action, then?
,
Dec 14 2016
I'm adding in more instrumentation (in the form of printing stuff to logs) to dig deeper. In the meantime, if someone can rubberstamp this 3 line change, that would be fantastic: https://chromium-review.googlesource.com/c/418953/
,
Dec 14 2016
Looks like Dan had a comment on it. (side note: we should move away from rubberstamping CLs and do real reviews everywhere)
,
Dec 14 2016
Okay, so after some local testing and scanning through buildbot code, I've noticed a couple places we can make improvements. 1. Ensure there is an index for a buildRequest query. The query looks something like (Paraphrased): SELECT * FROM BuildRequest WHERE (claimed_at == None OR clamed_at == 0) AND claimed_by_name == None AND claimed_by_incarnation == None AND complete == 0 AND buildername == <builder name> https://cs.chromium.org/chromium/build/third_party/buildbot_8_4p1/buildbot/db/buildrequests.py?sq=package:chromium&dr=C&l=112 2. Limit the number of entries fetched by the above query to 25 (Sorted by oldest first). Current we are fetching _all_ of the buildrequests. The only optimization we made was limiting it to 25 buildsets (which is stored in a different table). 3. Use a COUNT (*) query instead to get the number of pending builds. This might be faster then doing a Select *. Alternative: Use count estimate? https://wiki.postgresql.org/wiki/Count_estimate +dsansome FYI become I want to do some postgres db configuration (ie add an index).
,
Dec 15 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/b7db7ffeae0260cce98ffc8e49d39614a833a7ec commit b7db7ffeae0260cce98ffc8e49d39614a833a7ec Author: Ryan Tseng <hinoka@google.com> Date: Thu Dec 15 00:01:20 2016 Buildbot PubSub pusher: Don't log stepStarted events This is pretty noisy, and the pusher doesn't even repond to this event, let's remove logging for it. BUG= 673202 Change-Id: I0b7c8077307874897813fc3f68d43cfbd690d179 Reviewed-on: https://chromium-review.googlesource.com/418953 Commit-Queue: Ryan Tseng <hinoka@chromium.org> Reviewed-by: Daniel Jacques <dnj@chromium.org> [modify] https://crrev.com/b7db7ffeae0260cce98ffc8e49d39614a833a7ec/scripts/master/pubsub_json_status_push.py
,
Dec 15 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/255c8f868c19a20e56053b3adbae9ab62647bee4 commit 255c8f868c19a20e56053b3adbae9ab62647bee4 Author: Ryan Tseng <hinoka@google.com> Date: Thu Dec 15 23:03:51 2016 Buildbot PubSub: Fetch all builders in parallel Builders were being fetched in sequence, this moves the builder fetching to a separate deferred which gets executed in parallel. Also adds some simplified top level stats about number of pending builds. BUG= 673202 Change-Id: I9382ab470f492040efdef2f366d1be481b05c005 Reviewed-on: https://chromium-review.googlesource.com/420873 Reviewed-by: Daniel Jacques <dnj@chromium.org> Commit-Queue: Ryan Tseng <hinoka@chromium.org> [modify] https://crrev.com/255c8f868c19a20e56053b3adbae9ab62647bee4/scripts/master/pubsub_json_status_push.py
,
Dec 15 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/f086a2bf1df7305abdb8c03cd6d247650863db0a commit f086a2bf1df7305abdb8c03cd6d247650863db0a Author: Ryan Tseng <hinoka@google.com> Date: Thu Dec 15 23:45:33 2016
,
Dec 21 2016
I think we can probably close this now, yes? It almost certainly was impacting performance, and seems to be much better now?
,
Dec 21 2016
Ryan is out but yes, we're in better shape now after the fixes.
,
Dec 28 2016
,
Dec 28 2016
,
Dec 28 2016
,
Dec 28 2016
I don't think we're in great shape as of yet, here are some metrics from today: 2016-12-28 11:30:02-0800 [-] PubSub: Last send session took total 3.9s, 0.0 load build, 3.4 master, 0.5 send. len_tcq 25. br 4. bs 0 That's 4 build request queries (total, for the master) taking 3.4 seconds. This won't scale well when people come back from the holidays. I added 3 blocking bugs as potential fixes. Having pending build costs be less than 0.1s / build should be the target for fixing this bug.
,
Jan 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/e80ad4b00ab95e996f83f54b2846a34f3a3d71c8 commit e80ad4b00ab95e996f83f54b2846a34f3a3d71c8 Author: Ryan Tseng <hinoka@google.com> Date: Tue Jan 10 05:48:47 2017 PubSub: Don't get pending builds for tryserver.chromium.linux BUG=679563, 673202 TBR=estaab Change-Id: I0cb437be78c3647ae899dd03362273fd63a26b0e Reviewed-on: https://chromium-review.googlesource.com/425797 Commit-Queue: Ryan Tseng <hinoka@chromium.org> Reviewed-by: Erik Staab <estaab@chromium.org> [modify] https://crrev.com/e80ad4b00ab95e996f83f54b2846a34f3a3d71c8/scripts/master/pubsub_json_status_push.py
,
Jan 11 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/4594b889500d4a602bbd147a8c897649c64f363e commit 4594b889500d4a602bbd147a8c897649c64f363e Author: Ryan Tseng <hinoka@google.com> Date: Wed Jan 11 07:04:04 2017 Enable pending build reporting for chromium.fyi At the very least, to collect some baseline data using FYI, which has historicall been known to be slow. BUG= 673202 TBR=estaab Change-Id: I95a1cda68d2a19940b11c30e7071fc80f7da3cca Reviewed-on: https://chromium-review.googlesource.com/427061 Reviewed-by: Ryan Tseng <hinoka@chromium.org> Commit-Queue: Ryan Tseng <hinoka@chromium.org> [modify] https://crrev.com/4594b889500d4a602bbd147a8c897649c64f363e/scripts/master/pubsub_json_status_push.py
,
Jan 11 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/8acd51f19e5e35185fcd3d23122e3da18cfd693e commit 8acd51f19e5e35185fcd3d23122e3da18cfd693e Author: Ryan Tseng <hinoka@google.com> Date: Wed Jan 11 07:12:52 2017
,
Jan 12 2017
For posterity: A "normal" peak for tryserver.chromium.linux without pubsub digging into pending builds is about 16 items max in the reactor queue: https://screenshot.googleplex.com/aTsbe7jnPDT
,
Jan 15 2017
,
Jan 16 2017
Is this fixed now? If not, what's left to be done? Are the "blocked on bugs" actually blocking figuring out whether the replication is affecting buildbot, or is it blocking *fixing* replication?
,
Jan 17 2017
This isn't fixed, tho the bug now is to fix the replication to be more better.
,
Jan 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/d4e8588dac9024ffce6967c8045482b1bbe9aeab commit d4e8588dac9024ffce6967c8045482b1bbe9aeab Author: Ryan Tseng <hinoka@google.com> Date: Wed Jan 18 06:22:29 2017 Buildbot PubSub: Keep track of all pending builds in the pubsub modules This uses a variety of subscription services within to keep track of pending builds. Buildbot backs pending build reqeusts by an SQL database for the case of multiple masters trying to claim the same request. But since we don't use multiple masters with a single database, this is not a concern for us and it's okay to maintain all pending builds in memory. BUG= 673202 Change-Id: Iff288ed9dc39317f4c0e6f1ab61429228e1b91b6 Reviewed-on: https://chromium-review.googlesource.com/427228 Reviewed-by: Daniel Jacques <dnj@chromium.org> Commit-Queue: Ryan Tseng <hinoka@chromium.org> [modify] https://crrev.com/d4e8588dac9024ffce6967c8045482b1bbe9aeab/scripts/master/pubsub_json_status_push.py
,
Jan 18 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/67b571f949f53e5c7adcac1981476acd059829d3 commit 67b571f949f53e5c7adcac1981476acd059829d3 Author: Ryan Tseng <hinoka@google.com> Date: Wed Jan 18 08:43:55 2017
,
Jan 18 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/34e5a86a9ebc4b08460fe79d6ff7152501b31615 commit 34e5a86a9ebc4b08460fe79d6ff7152501b31615 Author: Ryan Tseng <hinoka@google.com> Date: Wed Jan 18 08:54:47 2017
,
Jan 18 2017
Well, so this hasn't been reverted yet. Is it working? :D
,
Jan 19 2017
So the performance impact (from 1 day of running it on chromium.fyi) is low, but there are some accuracy issues. The pending builds aren't getting cleared out after the build starts. I'm looking into causes and fixes.
,
Jan 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/4b6b66ac7eaedfedb5119159fbeaabe680811a8b commit 4b6b66ac7eaedfedb5119159fbeaabe680811a8b Author: Dan Jacques <dnj@chromium.org> Date: Thu Jan 19 18:10:03 2017 Pub/Sub: Fix build request unsubscribe. Fix a bug(?) in BuildBot where BuildRequest observers were never triggered because, at the time of triggering, their build status object was not initialized. The bug was fixed by having the observer get called regardless. Add documentation to the Pub/Sub JSON status push that documents that the build status object will be None. BUG= chromium:673202 TEST=local - Fired up a local master, observed builds being removed. Change-Id: I152530cf12efc4587066baaf1e9c5232559fecc8 Reviewed-on: https://chromium-review.googlesource.com/430317 Reviewed-by: Ryan Tseng <hinoka@chromium.org> Commit-Queue: Ryan Tseng <hinoka@chromium.org> [modify] https://crrev.com/4b6b66ac7eaedfedb5119159fbeaabe680811a8b/scripts/master/pubsub_json_status_push.py [modify] https://crrev.com/4b6b66ac7eaedfedb5119159fbeaabe680811a8b/third_party/buildbot_8_4p1/buildbot/status/master.py
,
Jan 20 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/79b04ec68d47b7de3f7401dbf528f3d154f699a9 commit 79b04ec68d47b7de3f7401dbf528f3d154f699a9 Author: Ryan Tseng <hinoka@google.com> Date: Fri Jan 20 03:39:32 2017
,
Jan 20 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/02e63a25fd526050ce66f0f3b75fd17561bbb3cc commit 02e63a25fd526050ce66f0f3b75fd17561bbb3cc Author: Ryan Tseng <hinoka@google.com> Date: Fri Jan 20 05:34:29 2017
,
Jan 20 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/6a67db8e586ccdc04cb66c82db2fbaa39b538e8d commit 6a67db8e586ccdc04cb66c82db2fbaa39b538e8d Author: Ryan Tseng <hinoka@google.com> Date: Fri Jan 20 06:43:53 2017
,
Jan 20 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/162a591f69c22a7f64e3ad54c9c8c73fa80e2873 commit 162a591f69c22a7f64e3ad54c9c8c73fa80e2873 Author: Ryan Tseng <hinoka@google.com> Date: Fri Jan 20 06:46:46 2017
,
Jan 20 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/162a591f69c22a7f64e3ad54c9c8c73fa80e2873 commit 162a591f69c22a7f64e3ad54c9c8c73fa80e2873 Author: Ryan Tseng <hinoka@google.com> Date: Fri Jan 20 06:46:46 2017
,
Jan 23 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/fc0c174d403685f2bfbb9105471c22216628d5fe commit fc0c174d403685f2bfbb9105471c22216628d5fe Author: Ryan Tseng <hinoka@google.com> Date: Mon Jan 23 23:04:29 2017 Buildbot: Notify master for all build requests when a build starts BUG= 673202 Change-Id: I622ce7d51cc9c66a207777ef63a4c65a797deb58 Reviewed-on: https://chromium-review.googlesource.com/431120 Reviewed-by: Daniel Jacques <dnj@chromium.org> Commit-Queue: Ryan Tseng <hinoka@chromium.org> [modify] https://crrev.com/fc0c174d403685f2bfbb9105471c22216628d5fe/third_party/buildbot_8_4p1/buildbot/process/builder.py
,
Jan 23 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/8d4e2e0f26be6c589c4a85bf1184854ae759e5f3 commit 8d4e2e0f26be6c589c4a85bf1184854ae759e5f3 Author: Ryan Tseng <hinoka@google.com> Date: Mon Jan 23 23:15:07 2017
,
Jan 24 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/d6cabc8316bb31d4045f831c710cc97b093ca31f commit d6cabc8316bb31d4045f831c710cc97b093ca31f Author: Ryan Tseng <hinoka@google.com> Date: Tue Jan 24 00:35:41 2017
,
Jan 24 2017
I think the last fix did it. It should be both fast and correct now. After a few master restarts i'll mark this as fixed.
,
Jan 31 2017
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by hinoka@chromium.org
, Dec 13 2016