New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 673205 link

Starred by 3 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: 2016-12-12
OS: All
Pri: 2
Type: Feature

Blocked on:
issue 681128
issue 642837
issue 673202
issue 673890
issue 681124
issue 681425



Sign in to add a comment

Figure out how to monitor when a master is alive but badly overloaded

Project Member Reported by dpranke@chromium.org, Dec 12 2016

Issue description

This is being split off from bug 669297, where tryserver.chromium.linux is getting badly overloaded and yet apparently we're not getting any alerts for this.

I've filed bug 673203 to figure out why perhaps the most obvious alert (too many pending builds) isn't firing, but that's not the only way a master might become unresponsive.

I'd like to understand what other things we could or should be monitoring and alerting on that might indicate that the master is mostly non-responsive.

Some possibilities:

- alert on page loads > 4-5 seconds for /json/varz or some other "simple" page
- alert on excessive CPU or load on the master (yes, I understand this may be impractical but I'd like to at least seriously explore options here)
- figure out if there's something we could instrument in the master itself that might help


 
Good news: currently, there is already monitoring of /json/varz by mastermon; you can see it in vi/chrome_infra on per-master console.

Bad news: I'd like to deprecate mastermon - issue 662479.

We can still have a prober running on master machines that pokes the master and sends metrics; we can even have a trimmed down version of mastermon do that. The same prober may monitor the CPU usage of master processes (a bit hacky IMHO, but can be done).

The master itself can periodically create a deferred object that would compute the time difference between its creation and execution time, and report that as a current reactor delay.

Comment 2 by hinoka@chromium.org, Jan 10 2017

Cc: hinoka@chromium.org
 Issue 679628  has been merged into this issue.
Project Member

Comment 3 by bugdroid1@chromium.org, Jan 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/e21c8d6e6ddca65b4a1dc01f9e96cf33d67c5583

commit e21c8d6e6ddca65b4a1dc01f9e96cf33d67c5583
Author: Ryan Tseng <hinoka@google.com>
Date: Tue Jan 10 07:58:44 2017

Buildbot monitoring: Add metrics for reactor queue length

I'm theorizing that this metric coorilates with master performance.

BUG=673205

Change-Id: I4563d50e11526597388a01ae843a95f9d64fad8c
Reviewed-on: https://chromium-review.googlesource.com/425850
Reviewed-by: Dave Sansome <dsansome@chromium.org>
Commit-Queue: Ryan Tseng <hinoka@chromium.org>

[modify] https://crrev.com/e21c8d6e6ddca65b4a1dc01f9e96cf33d67c5583/scripts/master/monitoring_status_receiver.py

Project Member

Comment 4 by bugdroid1@chromium.org, Jan 11 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/a9f41e602716c70523c071d06982ec01502a2d53

commit a9f41e602716c70523c071d06982ec01502a2d53
Author: Ryan Tseng <hinoka@google.com>
Date: Wed Jan 11 02:54:58 2017

Project Member

Comment 5 by bugdroid1@chromium.org, Jan 11 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/85a50f6987d668431095d408eaa4a51f03ecc413

commit 85a50f6987d668431095d408eaa4a51f03ecc413
Author: Ryan Tseng <hinoka@google.com>
Date: Wed Jan 11 03:25:51 2017

Comment 6 by hinoka@chromium.org, Jan 11 2017

Graphs here: http://shortn/_mSxzwY3eFg

Not a whole lot of data right now, it's pretty quiet.  There's data that restarting the master puts the reactor queue to about 700+ items.  Still need to figure out what the threshold for "bad" is.  I'm speculating about 50 items for 10 minutes is enough to alert on.
This starts to become tangential, but much of restarting a master is unpickling and reloading build objects off of disk, right? Are there ways to restore data that might be faster (at the cost of having less data still in memory)? E.g., if we have a world where all the data is mirrored into milo and logdog in near-realtime, can we stop caring about old builds?

Comment 8 by hinoka@chromium.org, Jan 11 2017

Thats true, though loading builds from disk is just a one-time cost during a restart, and I think we care more about the on-going load/latency more than the restart cost.
That's why it's tangential :).

Though, my understanding is that restarting chromium.fyi and chromium.perf is *really* slow as a result and there are real problems there that we should think about, too.
Cc: katthomas@chromium.org
Blockedon: 673202
Blockedon: 642837
Blocking: -669297
Cc: iannucci@chromium.org stip@chromium.org d...@chromium.org
Owner: estaab@chromium.org
Status: Started (was: Assigned)
Reassigning to estaab@ as per our meeting on Thursday. The basic goals for this bug are currently:

- make sure we have properly configured alerts on pending builds (tracked in bug 642837)

- make sure we have properly configured alerts on slow page loads; @estaab, do we have a bug for this work? 

- make sure replication to milo isn't unduly affecting buildbot (tracked in  bug 673202 )
Blockedon: 673890
Cc: serg...@chromium.org phajdan.jr@chromium.org chrishall@chromium.org andyb...@chromium.org tandrii@chromium.org
 Issue 666894  has been merged into this issue.
Blockedon: 681425
I filed bug 681425 for the slow page loads issue, since I think we agreed that dsansome@ was going to look into that one?
Blockedon: 681124
Blockedon: 681128
Cc: -andyb...@chromium.org

Comment 22 by stip@chromium.org, Feb 10 2017

Cc: -stip@chromium.org

Comment 23 by efoo@chromium.org, Mar 29 2017

Components: -Infra>Monitoring
Labels: Ops-AddMonitoring
Removing Infra>Monitoring since this is a Buildbot related alert modification. Please reserve Infra>Monitoring for monitoring (ts_mon and event_mon) bugs. Added Ops-AddMonitoring label to track monitoring related tasks.
Cc: -phajdan.jr@chromium.org estaab@chromium.org no...@chromium.org
Labels: -Pri-1 -Type-Bug Pri-2 Type-Feature
Owner: ----
Status: Available (was: Started)
This is less important now because we are migrating to LUCI
Cc: -iannucci@chromium.org iannu...@google.com

Sign in to add a comment