New issue
Advanced search Search tips

Issue 876085 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Closed: Oct 23
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Task



Sign in to add a comment

Add non-cumulative stage duration metric

Project Member Reported by dhanyaganesh@chromium.org, Aug 20

Issue description

The current metric for stage based duration monitoring is cumulative which makes analysis fuzzy. We need a non-cumulative metric to monitor the value for every stage. The metric should comply with the current CI monitoring plan and add common metric fields.

 
Project Member

Comment 1 by bugdroid1@chromium.org, Sep 6

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/89f629fce2d28f438bd3817f023cc39c22688c40

commit 89f629fce2d28f438bd3817f023cc39c22688c40
Author: Dhanya Ganesh <dhanyaganesh@chromium.org>
Date: Thu Sep 06 04:04:35 2018

generic_stages: Add non-cumulative stage duration metric

The current metric for stage based duration monitoring
is cumulative which makes analysis fuzzy. This CL adds
a non-cumulative FloatMetric to monitor the same value
for every stage. Common metric fields are added to comply
with the current CI monitoring plan.

BUG= chromium:876085 
TEST=tryjob

Change-Id: I4dbe16729add8c477ac0d4c65392cf363c03fba0
Reviewed-on: https://chromium-review.googlesource.com/1182404
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Dhanya Ganesh <dhanyaganesh@chromium.org>
Reviewed-by: Mike Nichols <mikenichols@chromium.org>

[modify] https://crrev.com/89f629fce2d28f438bd3817f023cc39c22688c40/lib/constants.py
[modify] https://crrev.com/89f629fce2d28f438bd3817f023cc39c22688c40/cbuildbot/stages/generic_stages.py

Status: Fixed (was: Started)
Project Member

Comment 3 by bugdroid1@chromium.org, Sep 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/f362a46fe0948664aa3e8b08966e9c3afa7a2707

commit f362a46fe0948664aa3e8b08966e9c3afa7a2707
Author: Dhanya Ganesh <dhanyaganesh@chromium.org>
Date: Tue Sep 18 17:31:03 2018

Chromite: Add Common Metric Fields for all metrics

Common Metric Fields currently include build_config, tryjob,
and branch name. This information needs to be included into
every metric pushed from the build.

BUG= chromium:876085 
TEST=tryjob

Change-Id: I3e9880b8f3644858145f088f5aa8948d870c268c
Reviewed-on: https://chromium-review.googlesource.com/1188686
Commit-Ready: Dhanya Ganesh <dhanyaganesh@chromium.org>
Tested-by: Dhanya Ganesh <dhanyaganesh@chromium.org>
Reviewed-by: Mike Nichols <mikenichols@chromium.org>

[modify] https://crrev.com/f362a46fe0948664aa3e8b08966e9c3afa7a2707/lib/ts_mon_config.py
[modify] https://crrev.com/f362a46fe0948664aa3e8b08966e9c3afa7a2707/lib/ts_mon_config_unittest.py
[modify] https://crrev.com/f362a46fe0948664aa3e8b08966e9c3afa7a2707/scripts/cbuildbot_launch.py
[modify] https://crrev.com/f362a46fe0948664aa3e8b08966e9c3afa7a2707/lib/metrics.py

Cc: emaxx@chromium.org
Status: Assigned (was: Fixed)
(build sheriff here)

Seeing a flaky failure of the TestConsumeMessages.testConsumeTwoMetrics test in the CQ:

> chromite-0.0.2-r4168: ======================================================================
> chromite-0.0.2-r4168: FAIL: [chromite.lib.ts_mon_config_unittest] TestConsumeMessages.testConsumeTwoMetrics
> chromite-0.0.2-r4168: Tests that sending two metrics only calls flush once.
> chromite-0.0.2-r4168: ----------------------------------------------------------------------
> chromite-0.0.2-r4168: Traceback (most recent call last):
> chromite-0.0.2-r4168:   File "/mnt/host/source/chromite/lib/timeout_util.py", line 191, in TimeoutWrapper
> chromite-0.0.2-r4168:     return func(*args, **kwargs)
> chromite-0.0.2-r4168:   File "/mnt/host/source/chromite/lib/ts_mon_config_unittest.py", line 89, in testConsumeTwoMetrics
> chromite-0.0.2-r4168:     'arg1', fields=self.common_metric_fields, kwarg1='value')
> chromite-0.0.2-r4168:   File "/mnt/host/source/chromite/third_party/mock.py", line 846, in assert_called_once_with
> chromite-0.0.2-r4168:     return self.assert_called_with(*args, **kwargs)
> chromite-0.0.2-r4168:   File "/mnt/host/source/chromite/third_party/mock.py", line 835, in assert_called_with
> chromite-0.0.2-r4168:     raise AssertionError(msg)
> chromite-0.0.2-r4168: AssertionError: Expected call: mock_name1('arg1', fields={u'branch_name': u'branch', u'tryjob': True, u'build_config': u'config'}, kwarg1='value')
> chromite-0.0.2-r4168: Actual call: mock_name1('arg1', fields={u'branch_name': u'master', u'tryjob': False, u'build_config': u'config'}, kwarg1='value')
> chromite-0.0.2-r4168: 
> chromite-0.0.2-r4168: ----------------------------------------------------------------------

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8934227197327257808 (builder "daisy_skate-paladin")


dhanyaganesh@: Could you please check whether this could be caused by the change tracked in this bug?
(If not, I'll file a separate bug for tracking this flakiness.)
> The current metric for stage based duration monitoring is cumulative which makes analysis fuzzy.

Can you elaborate on this? What makes analysis fuzzy?
Using a non-cumulative metric only allows you to determine what the most recent value of some metric was, for instance the most recent duration of {stage:foo, config:foo, builder_name:foo}. This is probably not what you are interested in in most anayses.
Cumulative metrics provide a distribution view of the world, bucketed in 50, 90, 95th, etc.. percentiles.  This is great for tracking the overall health of a stage but does nothing to identify outliers that present problems; for example, the CQ can hang for 10 hours without knowing is an example of an outlier that we do not want to be swallowed up by the averaging of data.

Every build should be identifiable and send a clear signal of the time taken.  These can be averaged over the span of the window or used to preemptively identify cases where there are issues at the time they occur.  

-- Mike  
Project Member

Comment 7 by bugdroid1@chromium.org, Sep 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/c38eebeacf5bc3813989cbcbcc9e8abff4d6daa6

commit c38eebeacf5bc3813989cbcbcc9e8abff4d6daa6
Author: Aviv Keshet <akeshet@chromium.org>
Date: Sat Sep 29 07:27:36 2018

Revert "Chromite: Add Common Metric Fields for all metrics"

This reverts commit f362a46fe0948664aa3e8b08966e9c3afa7a2707.

Reason for revert:
1) This library is used by autotest lab for sending metrics,
not just chromite. The fields you discuss adding are not meaningfull
in the lab, and I don't believe you have tested or considered how this
affects the lab.
2) Adding fields to an existing metric is not safe, because if monarch
sees a concurrently inconcsistent set of fields then it will panic
and drop data.

BUG= chromium:876085 
TEST=None

Original change's description:
> Chromite: Add Common Metric Fields for all metrics
>
> Common Metric Fields currently include build_config, tryjob,
> and branch name. This information needs to be included into
> every metric pushed from the build.
>
> BUG= chromium:876085 
> TEST=tryjob
>
> Change-Id: I3e9880b8f3644858145f088f5aa8948d870c268c
> Reviewed-on: https://chromium-review.googlesource.com/1188686
> Commit-Ready: Dhanya Ganesh <dhanyaganesh@chromium.org>
> Tested-by: Dhanya Ganesh <dhanyaganesh@chromium.org>
> Reviewed-by: Mike Nichols <mikenichols@chromium.org>

Bug:  chromium:876085 
Change-Id: I89543d3e5a9ae7f8d2ce0b8b68d2272523dde1ba
Reviewed-on: https://chromium-review.googlesource.com/1249663
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Jason Clinton <jclinton@chromium.org>

[modify] https://crrev.com/c38eebeacf5bc3813989cbcbcc9e8abff4d6daa6/lib/ts_mon_config.py
[modify] https://crrev.com/c38eebeacf5bc3813989cbcbcc9e8abff4d6daa6/lib/ts_mon_config_unittest.py
[modify] https://crrev.com/c38eebeacf5bc3813989cbcbcc9e8abff4d6daa6/scripts/cbuildbot_launch.py
[modify] https://crrev.com/c38eebeacf5bc3813989cbcbcc9e8abff4d6daa6/lib/metrics.py

Project Member

Comment 8 by bugdroid1@chromium.org, Oct 19

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/95c5c15bdec97afc40aeb74e8679016adf055883

commit 95c5c15bdec97afc40aeb74e8679016adf055883
Author: Dhanya Ganesh <dhanyaganesh@chromium.org>
Date: Fri Oct 19 02:01:24 2018

Add Common Metric Fields for all metrics

Common Metric Fields currently include build_config, tryjob,
and branch name. This information needs to be included into
every metric pushed from the build. This CL is an improved
attempt at making the process streamlined.

BUG= chromium:876085 
TEST=tryjob

Change-Id: I125b6db1ef8d185795e19f99431d40c7bb9b851a
Reviewed-on: https://chromium-review.googlesource.com/1269936
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Dhanya Ganesh <dhanyaganesh@chromium.org>
Reviewed-by: Jason Clinton <jclinton@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Mike Nichols <mikenichols@chromium.org>

[modify] https://crrev.com/95c5c15bdec97afc40aeb74e8679016adf055883/lib/ts_mon_config.py
[modify] https://crrev.com/95c5c15bdec97afc40aeb74e8679016adf055883/lib/metrics.py
[modify] https://crrev.com/95c5c15bdec97afc40aeb74e8679016adf055883/scripts/cbuildbot_launch.py
[modify] https://crrev.com/95c5c15bdec97afc40aeb74e8679016adf055883/lib/ts_mon_config_unittest.py
[modify] https://crrev.com/95c5c15bdec97afc40aeb74e8679016adf055883/lib/constants.py
[modify] https://crrev.com/95c5c15bdec97afc40aeb74e8679016adf055883/scripts/cbuildbot_launch_unittest.py

Status: Verified (was: Assigned)

Sign in to add a comment