Issue metadata
Sign in to add a comment
|
Figure out how to monitor when a master is alive but badly overloaded |
||||||||||||||||||||
Issue descriptionThis is being split off from bug 669297, where tryserver.chromium.linux is getting badly overloaded and yet apparently we're not getting any alerts for this. I've filed bug 673203 to figure out why perhaps the most obvious alert (too many pending builds) isn't firing, but that's not the only way a master might become unresponsive. I'd like to understand what other things we could or should be monitoring and alerting on that might indicate that the master is mostly non-responsive. Some possibilities: - alert on page loads > 4-5 seconds for /json/varz or some other "simple" page - alert on excessive CPU or load on the master (yes, I understand this may be impractical but I'd like to at least seriously explore options here) - figure out if there's something we could instrument in the master itself that might help
,
Jan 10 2017
,
Jan 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/e21c8d6e6ddca65b4a1dc01f9e96cf33d67c5583 commit e21c8d6e6ddca65b4a1dc01f9e96cf33d67c5583 Author: Ryan Tseng <hinoka@google.com> Date: Tue Jan 10 07:58:44 2017 Buildbot monitoring: Add metrics for reactor queue length I'm theorizing that this metric coorilates with master performance. BUG=673205 Change-Id: I4563d50e11526597388a01ae843a95f9d64fad8c Reviewed-on: https://chromium-review.googlesource.com/425850 Reviewed-by: Dave Sansome <dsansome@chromium.org> Commit-Queue: Ryan Tseng <hinoka@chromium.org> [modify] https://crrev.com/e21c8d6e6ddca65b4a1dc01f9e96cf33d67c5583/scripts/master/monitoring_status_receiver.py
,
Jan 11 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/a9f41e602716c70523c071d06982ec01502a2d53 commit a9f41e602716c70523c071d06982ec01502a2d53 Author: Ryan Tseng <hinoka@google.com> Date: Wed Jan 11 02:54:58 2017
,
Jan 11 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/85a50f6987d668431095d408eaa4a51f03ecc413 commit 85a50f6987d668431095d408eaa4a51f03ecc413 Author: Ryan Tseng <hinoka@google.com> Date: Wed Jan 11 03:25:51 2017
,
Jan 11 2017
Graphs here: http://shortn/_mSxzwY3eFg Not a whole lot of data right now, it's pretty quiet. There's data that restarting the master puts the reactor queue to about 700+ items. Still need to figure out what the threshold for "bad" is. I'm speculating about 50 items for 10 minutes is enough to alert on.
,
Jan 11 2017
This starts to become tangential, but much of restarting a master is unpickling and reloading build objects off of disk, right? Are there ways to restore data that might be faster (at the cost of having less data still in memory)? E.g., if we have a world where all the data is mirrored into milo and logdog in near-realtime, can we stop caring about old builds?
,
Jan 11 2017
Thats true, though loading builds from disk is just a one-time cost during a restart, and I think we care more about the on-going load/latency more than the restart cost.
,
Jan 11 2017
That's why it's tangential :). Though, my understanding is that restarting chromium.fyi and chromium.perf is *really* slow as a result and there are real problems there that we should think about, too.
,
Jan 11 2017
,
Jan 15 2017
,
Jan 16 2017
,
Jan 16 2017
,
Jan 16 2017
Reassigning to estaab@ as per our meeting on Thursday. The basic goals for this bug are currently: - make sure we have properly configured alerts on pending builds (tracked in bug 642837) - make sure we have properly configured alerts on slow page loads; @estaab, do we have a bug for this work? - make sure replication to milo isn't unduly affecting buildbot (tracked in bug 673202 )
,
Jan 16 2017
,
Jan 16 2017
Issue 666894 has been merged into this issue.
,
Jan 16 2017
,
Jan 16 2017
I filed bug 681425 for the slow page loads issue, since I think we agreed that dsansome@ was going to look into that one?
,
Jan 16 2017
,
Jan 16 2017
,
Jan 18 2017
,
Feb 10 2017
,
Mar 29 2017
Removing Infra>Monitoring since this is a Buildbot related alert modification. Please reserve Infra>Monitoring for monitoring (ts_mon and event_mon) bugs. Added Ops-AddMonitoring label to track monitoring related tasks.
,
Jun 2 2018
This is less important now because we are migrating to LUCI
,
Oct 19
|
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by sergeybe...@chromium.org
, Jan 5 2017