More robust step monitoring using explicit monitoring IDs |
|||||||
Issue descriptionI recently was fixing bug 674759 , and I realized that adding per-step monitoring is a bit silly: 1. Modify the regex hard-coded in the master here: https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/master/status_logger.py#449 2. Restart all the masters to pick up this change. This is unfortunately also fragile if the names of these steps (which are intended to be human-readable and not machine-readable) ever change (as one might reasonably do to improve the legibility of a build's presentation). Additionally, restarting the masters is pretty disruptive. Unfortunately, since step names are human-readable and subject to change (up to and including arbitrary data like git commit hashes), it's not feasible to send a 'step_name' field for every step with out exploding the world. I would propose to fix this by replacing the 'step_name' field with an explicit 'monitoring_id' field, which is added to the step explicitly when the step is run. The mechanism to do this would work like: 1. Add a new annotation @@@MONITORING_ID@<NAMESPACE>@<VALUE>@@@ 2. Modify the master to add this as a field `namespace:value` to the Build object when it parses this annotation. 3. Modify the master to send metrics for the step if it has monitoring_id set, and to send monitoring_id in place of step_name. 4. Restart all the masters :) 5. Modify the recipes 'step' api to allow this to be set when starting a step (e.g. `api.step('git fetch (%s)' % (commit,), ['git', 'fetch', 'origin', commit], monitoring_id=('infra', 'git_fetch'))` ) After this, adding monitoring for a step will be as easy as adding a monitoring_id attribute to a step. This would also be potentially valuable for user steps too. Additionally, this API in recipes could persist after buildbot as well, so that recipes with monitored steps would not need to change, and with a different backend, we could potentially have the recipe engine use this directly (e.g. so that a given recipe would be monitorable on both swarming and buildbot simultaneously as we transition).
,
Feb 14 2017
,
Feb 14 2017
+nodir, +dnj, we may need to do this for luci-lite... I think we discussed something similar offline, but I don't remember what the conclusion was, or if there's a hotlist we should add this to (or if there's a bug we should block with it)
,
Feb 14 2017
This seems like a good proposal. Is there some danger to allowing users to arbitrary enable monitoring on their builds? A bad CL may be able to overwhelm the monitoring backend's stream count capacity. Perhaps we should make a set of monitoring IDs available in a place we control (recipe engine? annotated_run? Kitchen?) and limit the users to those IDs. If we had a recipe engine monitoring stream type (for annotation-less monitoring), it could respond to this annotation and related stream events by gathering / sending metrics. As an aside, all of this seems more relevant in the face of event monitoring. If we had a "step information" BigQuery table, we could monitor *every* step by default. The monitoring ID could override the step name for consistency across recipe modifications.
,
Feb 14 2017
proposal lgtm I've added luci label so this is not lost
,
Mar 29 2017
Removing Infra>Monitoring since this is a Recipes related alert modification. Please reserve Infra>Monitoring for monitoring (ts_mon and event_mon) bugs. Added Ops-AddMonitoring label to track monitoring related tasks.
,
May 16 2018
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
May 16 2018
This might potentially be useful for our monitoring subteam at CCI. Adding to chrome-client-infra-monitoring hotlist and updating the component.
,
May 16 2018
based on my research in go/step-monitoring, i believe monitoring_id won't scale
,
May 16 2018
Why wouldn't it? If build names and even bot names scale, I don't see the reason why monitoring_id wouldn't, if we make sure they are not generated dynamically. When tsmon was first introduced, it took a bit of user education to use it correctly, but now we are pretty good at it. I think the same can work here.
,
Oct 22
I'm tempted to abandon this bug / approach in favor of using BigQuery data directly for step monitoring. I don't believe we have any plans to add alerts for steps, so this data is only useful for deep dives and historical analysis, which is done better in BQ anyway. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by sergeybe...@chromium.org
, Jan 12 2017