New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 668288 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Dec 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

metrics: Add a new metric to report partitioned runtime of functions.

Project Member Reported by pprabhu@chromium.org, Nov 23 2016

Issue description

We have cases where we report the distribution of runtime for an important function. Furthermore, we'd like to know what % of time was spent in each of the sub-steps for this function.

For example, 
- scheduler tick: https://chromium-review.googlesource.com/#/c/413610/4/
- scheduler db cleanup: https://chromium-review.googlesource.com/#/c/413633/1

We should add a new metric type that does this for us.

- As a decorator / context, it measures the overall time taken and reports it as a distribution.
- Additionally, it provides two functions: substep_start(name) and substep_end(name). 
  - It also collects the time taken by each (not necessarily mutually exclusive) substep.
  - Finally, it also reports, for each substep separately, a distribution metric (0-100 range) of the % of time spent in that substep.


 
A nicer API might be:

with metrics.RunTimeBreakdown('/foo/bar') as e:
  e.StartStep('step1')
  step1()

  e.StartStep('step2')
  step2()

(instead of needing to call both _start and _end)


Also, do we want the % breakdowns to be a distribution, or a gauge? With gauge we theoretically lose a bit of resolution, but it will be a lot easier to graph / work with.
API question:

The decision here is: do we want to enforce the steps to be mutually exclusive?
By not enforcing this requirement, you are free to do:

def step():
  with RunTimeBreakdown('/foo/bar') as e:
    e.StartStep('mycomplexstep1')
    mycomplexstep1(e)
    e.EndStep('mycomplexstep1')
  
    e.StartStep('somelaterstep')
    somelaterstep()
    e.EndStep('somelaterstep')

def somecomplexstep1(e):
  e.StartStep('complexstep1_part1')
  complexstep1_part1()
  e.EndStep('complexstep1_part1')

  e.StartStep('complexstep1_part2')
  complexstep1_part2()
  e.EndStep('complexstep1_part2')

------------------------------------

Note that this use case is not just theoretical.
There are complicated sub steps in a scheduler tick where you may want the fraction of both the outer substep and its own constituents.

Secondly, this saves from gotchas where we thought the steps were mutually exclusive, but were wrong. I think the metric shouldn't depend on that for correctness.

--------------------------------------
The same API can be made a bit better:


def 
def step():
  with RunTimeBreakdown('/foo/bar') as e:
    with e.SubStep('mycomplexstep1'):
      mycomplexstep1(e)
  
    with e.SubStep('somelaterstep'):
      somelaterstep()


def somecomplexstep1(e):
  with e.SubStep('complexstep1_part1'):
    complexstep1_part1()

  with e.SubStep('complexstep1_part2'):
    complexstep1_part2()
Re Gauge vs Distribution: I was thinking distribution because what we probably want to actually look at is probably things like what is the 90% of the fraction of time spent in cleanup during a tick?

Basically, each individual % breakdown is less important in my mind than the distribution of where the time is going mostly.

I can be convinced otherwise.
How do you get % breakdowns that add up to 100% if the substeps aren't mutually exclusive?

I don't think we should make this more complicated that the bare necessities.

I think if we use this, we should enforce by convention that the timer context e is never passed into interior functions. Too much room for confusion.


Cc: sergeybe...@chromium.org dsansome@chromium.org ayatane@chromium.org
+some other metrics people

Have you come up with a favorite way to measure both the runtime and the runtime breakdown of operations?
Re #4: Agreed about simplicity. I'll eat my words and follow your API recommendation.
Let's wait a couple days to see what people here think.
> How do you get % breakdowns that add up to 100% if the substeps aren't mutually exclusive?

It might not matter that the substeps don't add up to 100% if you also report a separate "total" duration, rather than calculating the total by summing the duration of all the substeps in Monarch.

As an aside: this sounds useful - please consider adding these decorators to infra_libs, and naming the metric without a /chromeos/ prefix, so they can be used elsewhere as well.
Here for the bikeshedding.  I think it should be:

def step():
  with RunTimeBreakdown('/foo/bar') as e:
    with e.SubStep('mycomplexstep1'):
      mycomplexstep1()
  
    with e.SubStep('somelaterstep'):
      somelaterstep()

SubStep can enforce the no nesting constraint by keeping state on __enter__/__exit__.  If we do want a more detailed breakdown, I think it's better to nest the RunTimeBreakDown.
def step():
  with RunTimeBreakdown('/foo/bar') as e:
    with e.SubStep('mycomplexstep1'):
      mycomplexstep1()
  
    with e.SubStep('somelaterstep'):
      somelaterstep()

def mycomplexstep1():
  with RunTimeBreakdown('/foo/bar/nested') as e:
    with e.SubStep('1'):
      substep()
  
    with e.SubStep('2'):
      substep()

This will make the breakdowns reasonable to reason about.
If we go with 8, a typical usecase (a tick() function that calls out to other handlers) gets a lot less readable.

def tick():
  with RunTimeBreakdown('foo') as r:
    r.Step('bar')
    bar()

    r.Step('baz')
    baz()

    r.Step('qux')
    qux()

vs 

def tick():
  with RunTimeBreakdown('foo') as r:
    with r.Step('bar'):
      bar()

    
    with r.Step('baz'):
      baz()

    with r.Step('qux'):
      qux()
I think in actual code the flat version will be less readable because you will have a lot more lines and consequently a lot more sections, some of which might start with r.step(), and someone is going to put an r.Step() call in the middle of some other lines without using a nice blank line to separate it.  Then someone will unintentionally start putting r.step() calls into if blocks, but not in the else blocks, and so on.

In summary, I think there's more potential for abuse for the flat design, whereas the second level of with statements enforces some basic sanity.  It will also encourage splitting out the steps into separate functions instead of creating one long 1000 line function, which can only be a good thing.
I agree with #10.
Another way to think of this is, if a step begins with an explicit r.step() call, but ends by default at the next r.step() call, it is easy to unintentionally extend a step too much (because, say an if condition was not executed that would have started the next step, as mentioned in #10). This will pollute the time data in a way that will be hard to notice or recover from.

I also vote for simplicity here -- we _will_ assume that steps are mutually exclusive. If you try to enter a step twice, it'll blow up.

with RuntimeBreakdown('foo') as r:
  with r.Step('outer'):
    with r.Step('inner'):  # kaboom! This blows up.
      do something.


Re #7: I'll try to put this in ts_mon. Glad that it might help others.
Labels: -current-issue
@dsansome: There are a few problems adding this to ts_mon:

- ts_mon/common/metrics.py currently does not provide any contexts / decorators. 
- A time measurement value inherently contains a unit of measurement. (we use seconds). All the metrics in ts_mon/common/metrics.py are unit agnostic, although you can label a metric with the unit it is in.


Philosophically, it seems like ts_mon wants to restrict itself to reporting of metrics from the client programs. It wants to stay out of the actual collection (doesn't care how the numbers are generated / what scale they are in).

What I'm trying to do is a convenient way to generate the specific timing metrics we want. ts_mon is a library for the reporting of these collected data. Adding this functionality to ts_mon would be layering violation.
(There is no precedent to ts_mon providing any way to *collect* the metrics)

I'll go ahead with a chromite CL while we're discussing this.
Actually, ts_mon is specifically designed for ease of instrumentation, and hence, ease of data collection is in scope. It already provides one context - ScopedIncrementCounter: https://cs.chromium.org/chromium/infra/packages/infra_libs/infra_libs/ts_mon/common/helpers.py?q=scopedincrementcounter&sq=package:chromium&l=8

The only reason we didn't build more collection support is due to lack of resources, and lack of user pressure (it was easy enough to just set the metrics directly). 

So, IMHO this new step time context is perfectly in scope. OTOH, making it useful in general and not just for specific ChromeOS applications may be more work than you'd like to do. So, it's really up to you.

ts_mon metrics can be (and are) declared with appropriate units, see e.g. https://cs.chromium.org/chromium/src/third_party/chromite/lib/metrics.py?q=MetricsDataUnits&sq=package:chromium&l=180&dr=C

The units might not be propagating to the backend though (it's bug somewhere in the pipeline). But once we migrate to the new backend API, they should appear properly.
I uploaded a CL for chromite: https://chromium-review.googlesource.com/#/c/415993/

Sadly, even if I add it to ts_mon, we can't use it from chromite:
The collector class would create metrics internally. Because chromite actually wraps around ts_mon metrics (see issue 670022 for context), any metrics generated internally within ts_mon would not play well with the chromite setup.

That said, I do agree that helpers.py would have been a good place for this in ts_mon.
Status: Started (was: Assigned)
Project Member

Comment 17 by bugdroid1@chromium.org, Dec 13 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/2879f73f6ff1f58636234cffe11fa3c3277c8fe4

commit 2879f73f6ff1f58636234cffe11fa3c3277c8fe4
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Dec 01 02:18:21 2016

metrics: Add RuntimeBreakdownTimer.

RuntimeBreakdownTimer is a new metric collection class that allows us to
time a given operation along with tracking the fraction of time spent
within sub-steps of the operation.

BUG= chromium:668288 
TEST=(new) unittests; manual metric test

Change-Id: I3632f15c9bbf3e3efff5901810a383f59f6b3666
Reviewed-on: https://chromium-review.googlesource.com/415993
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/2879f73f6ff1f58636234cffe11fa3c3277c8fe4/lib/metrics_unittest.py
[modify] https://crrev.com/2879f73f6ff1f58636234cffe11fa3c3277c8fe4/lib/metrics.py

Status: Fixed (was: Started)
Project Member

Comment 19 by bugdroid1@chromium.org, Dec 21 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7ab1ef65b4696b486849af98a0c7fe57b8e60e5d

commit 7ab1ef65b4696b486849af98a0c7fe57b8e60e5d
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Dec 15 21:43:22 2016

scheduler: Start reporting tick() time breakdown.

BUG= chromium:668288 
TEST=unittests, local scheduler run.

Change-Id: If1df494743cbcc396c130c10abf402b31ddb8f92
Reviewed-on: https://chromium-review.googlesource.com/420476
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/7ab1ef65b4696b486849af98a0c7fe57b8e60e5d/scheduler/monitor_db.py

Project Member

Comment 20 by bugdroid1@chromium.org, Feb 28 2017

Labels: merge-merged-release-R56-9000.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/7f2eef619506a97a506270cd9788c9d57156638f

commit 7f2eef619506a97a506270cd9788c9d57156638f
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Tue Feb 28 01:11:09 2017

metrics: Add RuntimeBreakdownTimer.

RuntimeBreakdownTimer is a new metric collection class that allows us to
time a given operation along with tracking the fraction of time spent
within sub-steps of the operation.

BUG= chromium:668288 
BUG=chromium:693778
TEST=(new) unittests; manual metric test

Change-Id: I3632f15c9bbf3e3efff5901810a383f59f6b3666
Previous-Reviewed-on: https://chromium-review.googlesource.com/415993
(cherry picked from commit 3ff9e2e73e2975e3e10dc4cd187b6881b3149b63)
Reviewed-on: https://chromium-review.googlesource.com/447176
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Keith Haddow <haddowk@chromium.org>
Tested-by: Keith Haddow <haddowk@chromium.org>
Commit-Queue: Keith Haddow <haddowk@chromium.org>

[modify] https://crrev.com/7f2eef619506a97a506270cd9788c9d57156638f/lib/metrics_unittest.py
[modify] https://crrev.com/7f2eef619506a97a506270cd9788c9d57156638f/lib/metrics.py

Comment 21 by dchan@google.com, Mar 4 2017

Labels: VerifyIn-58

Comment 22 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 23 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 25 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment