Add non-cumulative stage duration metric |
||||
Issue descriptionThe current metric for stage based duration monitoring is cumulative which makes analysis fuzzy. We need a non-cumulative metric to monitor the value for every stage. The metric should comply with the current CI monitoring plan and add common metric fields.
,
Sep 17
,
Sep 18
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/f362a46fe0948664aa3e8b08966e9c3afa7a2707 commit f362a46fe0948664aa3e8b08966e9c3afa7a2707 Author: Dhanya Ganesh <dhanyaganesh@chromium.org> Date: Tue Sep 18 17:31:03 2018 Chromite: Add Common Metric Fields for all metrics Common Metric Fields currently include build_config, tryjob, and branch name. This information needs to be included into every metric pushed from the build. BUG= chromium:876085 TEST=tryjob Change-Id: I3e9880b8f3644858145f088f5aa8948d870c268c Reviewed-on: https://chromium-review.googlesource.com/1188686 Commit-Ready: Dhanya Ganesh <dhanyaganesh@chromium.org> Tested-by: Dhanya Ganesh <dhanyaganesh@chromium.org> Reviewed-by: Mike Nichols <mikenichols@chromium.org> [modify] https://crrev.com/f362a46fe0948664aa3e8b08966e9c3afa7a2707/lib/ts_mon_config.py [modify] https://crrev.com/f362a46fe0948664aa3e8b08966e9c3afa7a2707/lib/ts_mon_config_unittest.py [modify] https://crrev.com/f362a46fe0948664aa3e8b08966e9c3afa7a2707/scripts/cbuildbot_launch.py [modify] https://crrev.com/f362a46fe0948664aa3e8b08966e9c3afa7a2707/lib/metrics.py
,
Sep 27
(build sheriff here) Seeing a flaky failure of the TestConsumeMessages.testConsumeTwoMetrics test in the CQ: > chromite-0.0.2-r4168: ====================================================================== > chromite-0.0.2-r4168: FAIL: [chromite.lib.ts_mon_config_unittest] TestConsumeMessages.testConsumeTwoMetrics > chromite-0.0.2-r4168: Tests that sending two metrics only calls flush once. > chromite-0.0.2-r4168: ---------------------------------------------------------------------- > chromite-0.0.2-r4168: Traceback (most recent call last): > chromite-0.0.2-r4168: File "/mnt/host/source/chromite/lib/timeout_util.py", line 191, in TimeoutWrapper > chromite-0.0.2-r4168: return func(*args, **kwargs) > chromite-0.0.2-r4168: File "/mnt/host/source/chromite/lib/ts_mon_config_unittest.py", line 89, in testConsumeTwoMetrics > chromite-0.0.2-r4168: 'arg1', fields=self.common_metric_fields, kwarg1='value') > chromite-0.0.2-r4168: File "/mnt/host/source/chromite/third_party/mock.py", line 846, in assert_called_once_with > chromite-0.0.2-r4168: return self.assert_called_with(*args, **kwargs) > chromite-0.0.2-r4168: File "/mnt/host/source/chromite/third_party/mock.py", line 835, in assert_called_with > chromite-0.0.2-r4168: raise AssertionError(msg) > chromite-0.0.2-r4168: AssertionError: Expected call: mock_name1('arg1', fields={u'branch_name': u'branch', u'tryjob': True, u'build_config': u'config'}, kwarg1='value') > chromite-0.0.2-r4168: Actual call: mock_name1('arg1', fields={u'branch_name': u'master', u'tryjob': False, u'build_config': u'config'}, kwarg1='value') > chromite-0.0.2-r4168: > chromite-0.0.2-r4168: ---------------------------------------------------------------------- https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8934227197327257808 (builder "daisy_skate-paladin") dhanyaganesh@: Could you please check whether this could be caused by the change tracked in this bug? (If not, I'll file a separate bug for tracking this flakiness.)
,
Sep 27
> The current metric for stage based duration monitoring is cumulative which makes analysis fuzzy.
Can you elaborate on this? What makes analysis fuzzy?
Using a non-cumulative metric only allows you to determine what the most recent value of some metric was, for instance the most recent duration of {stage:foo, config:foo, builder_name:foo}. This is probably not what you are interested in in most anayses.
,
Sep 28
Cumulative metrics provide a distribution view of the world, bucketed in 50, 90, 95th, etc.. percentiles. This is great for tracking the overall health of a stage but does nothing to identify outliers that present problems; for example, the CQ can hang for 10 hours without knowing is an example of an outlier that we do not want to be swallowed up by the averaging of data. Every build should be identifiable and send a clear signal of the time taken. These can be averaged over the span of the window or used to preemptively identify cases where there are issues at the time they occur. -- Mike
,
Sep 29
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/c38eebeacf5bc3813989cbcbcc9e8abff4d6daa6 commit c38eebeacf5bc3813989cbcbcc9e8abff4d6daa6 Author: Aviv Keshet <akeshet@chromium.org> Date: Sat Sep 29 07:27:36 2018 Revert "Chromite: Add Common Metric Fields for all metrics" This reverts commit f362a46fe0948664aa3e8b08966e9c3afa7a2707. Reason for revert: 1) This library is used by autotest lab for sending metrics, not just chromite. The fields you discuss adding are not meaningfull in the lab, and I don't believe you have tested or considered how this affects the lab. 2) Adding fields to an existing metric is not safe, because if monarch sees a concurrently inconcsistent set of fields then it will panic and drop data. BUG= chromium:876085 TEST=None Original change's description: > Chromite: Add Common Metric Fields for all metrics > > Common Metric Fields currently include build_config, tryjob, > and branch name. This information needs to be included into > every metric pushed from the build. > > BUG= chromium:876085 > TEST=tryjob > > Change-Id: I3e9880b8f3644858145f088f5aa8948d870c268c > Reviewed-on: https://chromium-review.googlesource.com/1188686 > Commit-Ready: Dhanya Ganesh <dhanyaganesh@chromium.org> > Tested-by: Dhanya Ganesh <dhanyaganesh@chromium.org> > Reviewed-by: Mike Nichols <mikenichols@chromium.org> Bug: chromium:876085 Change-Id: I89543d3e5a9ae7f8d2ce0b8b68d2272523dde1ba Reviewed-on: https://chromium-review.googlesource.com/1249663 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Jason Clinton <jclinton@chromium.org> [modify] https://crrev.com/c38eebeacf5bc3813989cbcbcc9e8abff4d6daa6/lib/ts_mon_config.py [modify] https://crrev.com/c38eebeacf5bc3813989cbcbcc9e8abff4d6daa6/lib/ts_mon_config_unittest.py [modify] https://crrev.com/c38eebeacf5bc3813989cbcbcc9e8abff4d6daa6/scripts/cbuildbot_launch.py [modify] https://crrev.com/c38eebeacf5bc3813989cbcbcc9e8abff4d6daa6/lib/metrics.py
,
Oct 19
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/95c5c15bdec97afc40aeb74e8679016adf055883 commit 95c5c15bdec97afc40aeb74e8679016adf055883 Author: Dhanya Ganesh <dhanyaganesh@chromium.org> Date: Fri Oct 19 02:01:24 2018 Add Common Metric Fields for all metrics Common Metric Fields currently include build_config, tryjob, and branch name. This information needs to be included into every metric pushed from the build. This CL is an improved attempt at making the process streamlined. BUG= chromium:876085 TEST=tryjob Change-Id: I125b6db1ef8d185795e19f99431d40c7bb9b851a Reviewed-on: https://chromium-review.googlesource.com/1269936 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Dhanya Ganesh <dhanyaganesh@chromium.org> Reviewed-by: Jason Clinton <jclinton@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: Mike Nichols <mikenichols@chromium.org> [modify] https://crrev.com/95c5c15bdec97afc40aeb74e8679016adf055883/lib/ts_mon_config.py [modify] https://crrev.com/95c5c15bdec97afc40aeb74e8679016adf055883/lib/metrics.py [modify] https://crrev.com/95c5c15bdec97afc40aeb74e8679016adf055883/scripts/cbuildbot_launch.py [modify] https://crrev.com/95c5c15bdec97afc40aeb74e8679016adf055883/lib/ts_mon_config_unittest.py [modify] https://crrev.com/95c5c15bdec97afc40aeb74e8679016adf055883/lib/constants.py [modify] https://crrev.com/95c5c15bdec97afc40aeb74e8679016adf055883/scripts/cbuildbot_launch_unittest.py
,
Oct 23
|
||||
►
Sign in to add a comment |
||||
Comment 1 by bugdroid1@chromium.org
, Sep 6