New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 673202 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: 2016-12-11
OS: All
Pri: 1
Type: Bug

Blocked on:
issue 677315
issue 677316
issue 677318

Blocking:
issue 673205
issue 669297
issue 673400



Sign in to add a comment

Reduce performance impact of the pubsub publisher in buildbot

Project Member Reported by dpranke@chromium.org, Dec 12 2016

Issue description

In the discussion of what to do about bug 669297 on Friday, people suggested that there might be an issue with the replication of data from buildbot to Milo that might be impacting the buildbot master.

I don't know the details about this, but supposedly hinoka@ did, so I'm filing this so that he can treat this as high-priority on Monday, since we don't yet know what's going on with the master and we think things are still broken.

Please update the bug with more information about what you think this might be, and what work might be needed to fix it. If you think this is relatively likely to be the a real culprit, also please continue working on fixing it.

Please treat this as your highest priority task; if you think you have something else more important that you should be working on, let me know and we (you/me/estaab) can prioritize things or move them around.


 

Comment 1 by hinoka@chromium.org, Dec 13 2016

Blocking: 673400
I think the "new Milo scheduler change" that got picked up in https://bugs.chromium.org/p/chromium/issues/detail?id=669297#c66 and https://bugs.chromium.org/p/chromium/issues/detail?id=669297#c67 was the fix for this bug.

If I'm right, can you link to it from here? Or is there more work to be done?

Comment 3 by hinoka@chromium.org, Dec 14 2016

So at one point, the buildbot pubsub pusher implemented support for pushing pending builds.  This was left out in the initial version because the query isn't in the json by default probably because it required a postgres call.

This was added by having each pubsub push also query for all of the pending builds, which buildbot has to make a postgres query for.  There was suspicion that this was causing a lot of slowdown, so a patch was put in to add in a 200-item (master-wide) LRU cache in front of the postgres query.

Comment 4 by d...@chromium.org, Dec 14 2016

Ryan, WDYT about making it so that every Pub/Sub operation times itself and logs that time on completion? Then we can very easily notice when it takes too long, which was the clue that Ben used to identify this problem.

Comment 5 by hinoka@chromium.org, Dec 14 2016

I think that's a good idea, but I don't know how much that'll help since it might be dependent on reactor load (or maybe it's useful because it's a second derivitive of reactor load)

In addition to that, I also want to log:
* # of items in reactor that are ready to be consumed
* Oldest item in reactor queue that is ready to be consumed
But these two might be harder to dig up.
Project Member

Comment 6 by bugdroid1@chromium.org, Dec 14 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/a8ad337f9e050566e7352a57a9ed1e897ff2d2e8

commit a8ad337f9e050566e7352a57a9ed1e897ff2d2e8
Author: Ryan Tseng <hinoka@google.com>
Date: Wed Dec 14 01:09:15 2016

Buildbot: Add more logging to pubsub pusher

Adds wall clock timing for each step of status push, along with reactor queue length.

BUG= 673202 

Change-Id: Ibe28abf025351461b9f634d603d425cb14a25634
Reviewed-on: https://chromium-review.googlesource.com/419687
Reviewed-by: Daniel Jacques <dnj@chromium.org>
Commit-Queue: Ryan Tseng <hinoka@chromium.org>

[modify] https://crrev.com/a8ad337f9e050566e7352a57a9ed1e897ff2d2e8/scripts/master/pubsub_json_status_push.py

Project Member

Comment 7 by bugdroid1@chromium.org, Dec 14 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/95396906b3d575e93a6a9755dfb62b5508e6aa2f

commit 95396906b3d575e93a6a9755dfb62b5508e6aa2f
Author: Ryan Tseng <hinoka@google.com>
Date: Wed Dec 14 21:39:24 2016

Project Member

Comment 8 by bugdroid1@chromium.org, Dec 14 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/95396906b3d575e93a6a9755dfb62b5508e6aa2f

commit 95396906b3d575e93a6a9755dfb62b5508e6aa2f
Author: Ryan Tseng <hinoka@google.com>
Date: Wed Dec 14 21:39:24 2016

Comment 9 by hinoka@chromium.org, Dec 14 2016

This was a much worse problem than I thought initially, after restarting FYI to pick up the above change, this appears in the logs:

2016-12-14 14:30:44-0800 [-] PubSub: Last send session took total 372.3s, 0.0 load build, 371.6 master, 0.7 send. len_tcq 1

it took over 5 minutes to compute the master data!!!
Yikes. What's the next action, then?
I'm adding in more instrumentation (in the form of printing stuff to logs) to dig deeper.

In the meantime, if someone can rubberstamp this 3 line change, that would be fantastic: https://chromium-review.googlesource.com/c/418953/
Looks like Dan had a comment on it.

(side note: we should move away from rubberstamping CLs and do real reviews everywhere)
Cc: dsansome@chromium.org
Okay, so after some local testing and scanning through buildbot code, I've noticed a couple places we can make improvements.

1. Ensure there is an index for a buildRequest query.  The query looks something like (Paraphrased):
SELECT * FROM BuildRequest WHERE (claimed_at == None OR clamed_at == 0)
AND claimed_by_name == None
AND claimed_by_incarnation == None
AND complete == 0 
AND buildername == <builder name>

https://cs.chromium.org/chromium/build/third_party/buildbot_8_4p1/buildbot/db/buildrequests.py?sq=package:chromium&dr=C&l=112

2. Limit the number of entries fetched by the above query to 25 (Sorted by oldest first).  Current we are fetching _all_ of the buildrequests.  The only optimization we made was limiting it to 25 buildsets (which is stored in a different table).

3. Use a COUNT (*) query instead to get the number of pending builds.  This might be faster then doing a Select *.
Alternative: Use count estimate? https://wiki.postgresql.org/wiki/Count_estimate

+dsansome FYI become I want to do some postgres db configuration (ie add an index).
Project Member

Comment 14 by bugdroid1@chromium.org, Dec 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/b7db7ffeae0260cce98ffc8e49d39614a833a7ec

commit b7db7ffeae0260cce98ffc8e49d39614a833a7ec
Author: Ryan Tseng <hinoka@google.com>
Date: Thu Dec 15 00:01:20 2016

Buildbot PubSub pusher: Don't log stepStarted events

This is pretty noisy, and the pusher doesn't even repond to this event, let's remove
logging for it.

BUG= 673202 

Change-Id: I0b7c8077307874897813fc3f68d43cfbd690d179
Reviewed-on: https://chromium-review.googlesource.com/418953
Commit-Queue: Ryan Tseng <hinoka@chromium.org>
Reviewed-by: Daniel Jacques <dnj@chromium.org>

[modify] https://crrev.com/b7db7ffeae0260cce98ffc8e49d39614a833a7ec/scripts/master/pubsub_json_status_push.py

Project Member

Comment 15 by bugdroid1@chromium.org, Dec 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/255c8f868c19a20e56053b3adbae9ab62647bee4

commit 255c8f868c19a20e56053b3adbae9ab62647bee4
Author: Ryan Tseng <hinoka@google.com>
Date: Thu Dec 15 23:03:51 2016

Buildbot PubSub: Fetch all builders in parallel

Builders were being fetched in sequence, this moves the builder fetching to
a separate deferred which gets executed in parallel.

Also adds some simplified top level stats about number of pending builds.

BUG= 673202 

Change-Id: I9382ab470f492040efdef2f366d1be481b05c005
Reviewed-on: https://chromium-review.googlesource.com/420873
Reviewed-by: Daniel Jacques <dnj@chromium.org>
Commit-Queue: Ryan Tseng <hinoka@chromium.org>

[modify] https://crrev.com/255c8f868c19a20e56053b3adbae9ab62647bee4/scripts/master/pubsub_json_status_push.py

Project Member

Comment 16 by bugdroid1@chromium.org, Dec 15 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/f086a2bf1df7305abdb8c03cd6d247650863db0a

commit f086a2bf1df7305abdb8c03cd6d247650863db0a
Author: Ryan Tseng <hinoka@google.com>
Date: Thu Dec 15 23:45:33 2016

I think we can probably close this now, yes? It almost certainly was impacting performance, and seems to be much better now?
Status: Fixed (was: Assigned)
Ryan is out but yes, we're in better shape now after the fixes.
Blockedon: 677315
Blockedon: 677316
Blockedon: 677318
Status: Started (was: Fixed)
I don't think we're in great shape as of yet, here are some metrics from today:

2016-12-28 11:30:02-0800 [-] PubSub: Last send session took total 3.9s, 0.0 load build, 3.4 master, 0.5 send. len_tcq 25. br 4. bs 0

That's 4 build request queries (total, for the master) taking 3.4 seconds.  This won't scale well when people come back from the holidays.

I added 3 blocking bugs as potential fixes.  Having pending build costs be less than 0.1s / build should be the target for fixing this bug.
Project Member

Comment 23 by bugdroid1@chromium.org, Jan 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/e80ad4b00ab95e996f83f54b2846a34f3a3d71c8

commit e80ad4b00ab95e996f83f54b2846a34f3a3d71c8
Author: Ryan Tseng <hinoka@google.com>
Date: Tue Jan 10 05:48:47 2017

PubSub: Don't get pending builds for tryserver.chromium.linux

BUG=679563, 673202 
TBR=estaab

Change-Id: I0cb437be78c3647ae899dd03362273fd63a26b0e
Reviewed-on: https://chromium-review.googlesource.com/425797
Commit-Queue: Ryan Tseng <hinoka@chromium.org>
Reviewed-by: Erik Staab <estaab@chromium.org>

[modify] https://crrev.com/e80ad4b00ab95e996f83f54b2846a34f3a3d71c8/scripts/master/pubsub_json_status_push.py

Project Member

Comment 24 by bugdroid1@chromium.org, Jan 11 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/4594b889500d4a602bbd147a8c897649c64f363e

commit 4594b889500d4a602bbd147a8c897649c64f363e
Author: Ryan Tseng <hinoka@google.com>
Date: Wed Jan 11 07:04:04 2017

Enable pending build reporting for chromium.fyi

At the very least, to collect some baseline data using FYI, which has historicall
been known to be slow.

BUG= 673202 
TBR=estaab

Change-Id: I95a1cda68d2a19940b11c30e7071fc80f7da3cca
Reviewed-on: https://chromium-review.googlesource.com/427061
Reviewed-by: Ryan Tseng <hinoka@chromium.org>
Commit-Queue: Ryan Tseng <hinoka@chromium.org>

[modify] https://crrev.com/4594b889500d4a602bbd147a8c897649c64f363e/scripts/master/pubsub_json_status_push.py

Project Member

Comment 25 by bugdroid1@chromium.org, Jan 11 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/8acd51f19e5e35185fcd3d23122e3da18cfd693e

commit 8acd51f19e5e35185fcd3d23122e3da18cfd693e
Author: Ryan Tseng <hinoka@google.com>
Date: Wed Jan 11 07:12:52 2017

For posterity: A "normal" peak for tryserver.chromium.linux without pubsub digging into pending builds is about 16 items max in the reactor queue:

https://screenshot.googleplex.com/aTsbe7jnPDT
Blocking: 673205
Is this fixed now? If not, what's left to be done? Are the "blocked on bugs" actually blocking figuring out whether the replication is affecting buildbot, or is it blocking *fixing* replication?
Summary: Reduce performance impact of the pubsub publisher in buildbot (was: Figure out if replicating data to milo is affecting buildbot performance)
This isn't fixed, tho the bug now is to fix the replication to be more better.
Project Member

Comment 30 by bugdroid1@chromium.org, Jan 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/d4e8588dac9024ffce6967c8045482b1bbe9aeab

commit d4e8588dac9024ffce6967c8045482b1bbe9aeab
Author: Ryan Tseng <hinoka@google.com>
Date: Wed Jan 18 06:22:29 2017

Buildbot PubSub: Keep track of all pending builds in the pubsub modules

This uses a variety of subscription services within to keep track of pending builds.

Buildbot backs pending build reqeusts by an SQL database for the case of multiple
masters trying to claim the same request.  But since we don't use multiple
masters with a single database, this is not a concern for us and it's okay to
maintain all pending builds in memory.

BUG= 673202 

Change-Id: Iff288ed9dc39317f4c0e6f1ab61429228e1b91b6
Reviewed-on: https://chromium-review.googlesource.com/427228
Reviewed-by: Daniel Jacques <dnj@chromium.org>
Commit-Queue: Ryan Tseng <hinoka@chromium.org>

[modify] https://crrev.com/d4e8588dac9024ffce6967c8045482b1bbe9aeab/scripts/master/pubsub_json_status_push.py

Project Member

Comment 31 by bugdroid1@chromium.org, Jan 18 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/67b571f949f53e5c7adcac1981476acd059829d3

commit 67b571f949f53e5c7adcac1981476acd059829d3
Author: Ryan Tseng <hinoka@google.com>
Date: Wed Jan 18 08:43:55 2017

Project Member

Comment 32 by bugdroid1@chromium.org, Jan 18 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/34e5a86a9ebc4b08460fe79d6ff7152501b31615

commit 34e5a86a9ebc4b08460fe79d6ff7152501b31615
Author: Ryan Tseng <hinoka@google.com>
Date: Wed Jan 18 08:54:47 2017

Comment 33 by d...@chromium.org, Jan 18 2017

Well, so this hasn't been reverted yet. Is it working? :D
So the performance impact (from 1 day of running it on chromium.fyi) is low, but there are some accuracy issues.  The pending builds aren't getting cleared out after the build starts.  I'm looking into causes and fixes.
Project Member

Comment 35 by bugdroid1@chromium.org, Jan 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/4b6b66ac7eaedfedb5119159fbeaabe680811a8b

commit 4b6b66ac7eaedfedb5119159fbeaabe680811a8b
Author: Dan Jacques <dnj@chromium.org>
Date: Thu Jan 19 18:10:03 2017

Pub/Sub: Fix build request unsubscribe.

Fix a bug(?) in BuildBot where BuildRequest observers were never
triggered because, at the time of triggering, their build status object
was not initialized. The bug was fixed by having the observer get called
regardless.

Add documentation to the Pub/Sub JSON status push that documents that
the build status object will be None.

BUG= chromium:673202 
TEST=local
  - Fired up a local master, observed builds being removed.

Change-Id: I152530cf12efc4587066baaf1e9c5232559fecc8
Reviewed-on: https://chromium-review.googlesource.com/430317
Reviewed-by: Ryan Tseng <hinoka@chromium.org>
Commit-Queue: Ryan Tseng <hinoka@chromium.org>

[modify] https://crrev.com/4b6b66ac7eaedfedb5119159fbeaabe680811a8b/scripts/master/pubsub_json_status_push.py
[modify] https://crrev.com/4b6b66ac7eaedfedb5119159fbeaabe680811a8b/third_party/buildbot_8_4p1/buildbot/status/master.py

Project Member

Comment 36 by bugdroid1@chromium.org, Jan 20 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/79b04ec68d47b7de3f7401dbf528f3d154f699a9

commit 79b04ec68d47b7de3f7401dbf528f3d154f699a9
Author: Ryan Tseng <hinoka@google.com>
Date: Fri Jan 20 03:39:32 2017

Project Member

Comment 37 by bugdroid1@chromium.org, Jan 20 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/02e63a25fd526050ce66f0f3b75fd17561bbb3cc

commit 02e63a25fd526050ce66f0f3b75fd17561bbb3cc
Author: Ryan Tseng <hinoka@google.com>
Date: Fri Jan 20 05:34:29 2017

Project Member

Comment 38 by bugdroid1@chromium.org, Jan 20 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/6a67db8e586ccdc04cb66c82db2fbaa39b538e8d

commit 6a67db8e586ccdc04cb66c82db2fbaa39b538e8d
Author: Ryan Tseng <hinoka@google.com>
Date: Fri Jan 20 06:43:53 2017

Project Member

Comment 39 by bugdroid1@chromium.org, Jan 20 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/162a591f69c22a7f64e3ad54c9c8c73fa80e2873

commit 162a591f69c22a7f64e3ad54c9c8c73fa80e2873
Author: Ryan Tseng <hinoka@google.com>
Date: Fri Jan 20 06:46:46 2017

Project Member

Comment 40 by bugdroid1@chromium.org, Jan 20 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/162a591f69c22a7f64e3ad54c9c8c73fa80e2873

commit 162a591f69c22a7f64e3ad54c9c8c73fa80e2873
Author: Ryan Tseng <hinoka@google.com>
Date: Fri Jan 20 06:46:46 2017

Project Member

Comment 41 by bugdroid1@chromium.org, Jan 23 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/fc0c174d403685f2bfbb9105471c22216628d5fe

commit fc0c174d403685f2bfbb9105471c22216628d5fe
Author: Ryan Tseng <hinoka@google.com>
Date: Mon Jan 23 23:04:29 2017

Buildbot: Notify master for all build requests when a build starts

BUG= 673202 

Change-Id: I622ce7d51cc9c66a207777ef63a4c65a797deb58
Reviewed-on: https://chromium-review.googlesource.com/431120
Reviewed-by: Daniel Jacques <dnj@chromium.org>
Commit-Queue: Ryan Tseng <hinoka@chromium.org>

[modify] https://crrev.com/fc0c174d403685f2bfbb9105471c22216628d5fe/third_party/buildbot_8_4p1/buildbot/process/builder.py

Project Member

Comment 42 by bugdroid1@chromium.org, Jan 23 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/8d4e2e0f26be6c589c4a85bf1184854ae759e5f3

commit 8d4e2e0f26be6c589c4a85bf1184854ae759e5f3
Author: Ryan Tseng <hinoka@google.com>
Date: Mon Jan 23 23:15:07 2017

Project Member

Comment 43 by bugdroid1@chromium.org, Jan 24 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/d6cabc8316bb31d4045f831c710cc97b093ca31f

commit d6cabc8316bb31d4045f831c710cc97b093ca31f
Author: Ryan Tseng <hinoka@google.com>
Date: Tue Jan 24 00:35:41 2017

I think the last fix did it.  It should be both fast and correct now.  After a few master restarts i'll mark this as fixed.
Status: Fixed (was: Started)

Sign in to add a comment