metrics: Add a new metric to report partitioned runtime of functions. |
|||||||||||
Issue descriptionWe have cases where we report the distribution of runtime for an important function. Furthermore, we'd like to know what % of time was spent in each of the sub-steps for this function. For example, - scheduler tick: https://chromium-review.googlesource.com/#/c/413610/4/ - scheduler db cleanup: https://chromium-review.googlesource.com/#/c/413633/1 We should add a new metric type that does this for us. - As a decorator / context, it measures the overall time taken and reports it as a distribution. - Additionally, it provides two functions: substep_start(name) and substep_end(name). - It also collects the time taken by each (not necessarily mutually exclusive) substep. - Finally, it also reports, for each substep separately, a distribution metric (0-100 range) of the % of time spent in that substep.
,
Nov 23 2016
API question:
The decision here is: do we want to enforce the steps to be mutually exclusive?
By not enforcing this requirement, you are free to do:
def step():
with RunTimeBreakdown('/foo/bar') as e:
e.StartStep('mycomplexstep1')
mycomplexstep1(e)
e.EndStep('mycomplexstep1')
e.StartStep('somelaterstep')
somelaterstep()
e.EndStep('somelaterstep')
def somecomplexstep1(e):
e.StartStep('complexstep1_part1')
complexstep1_part1()
e.EndStep('complexstep1_part1')
e.StartStep('complexstep1_part2')
complexstep1_part2()
e.EndStep('complexstep1_part2')
------------------------------------
Note that this use case is not just theoretical.
There are complicated sub steps in a scheduler tick where you may want the fraction of both the outer substep and its own constituents.
Secondly, this saves from gotchas where we thought the steps were mutually exclusive, but were wrong. I think the metric shouldn't depend on that for correctness.
--------------------------------------
The same API can be made a bit better:
def
def step():
with RunTimeBreakdown('/foo/bar') as e:
with e.SubStep('mycomplexstep1'):
mycomplexstep1(e)
with e.SubStep('somelaterstep'):
somelaterstep()
def somecomplexstep1(e):
with e.SubStep('complexstep1_part1'):
complexstep1_part1()
with e.SubStep('complexstep1_part2'):
complexstep1_part2()
,
Nov 23 2016
Re Gauge vs Distribution: I was thinking distribution because what we probably want to actually look at is probably things like what is the 90% of the fraction of time spent in cleanup during a tick? Basically, each individual % breakdown is less important in my mind than the distribution of where the time is going mostly. I can be convinced otherwise.
,
Nov 23 2016
How do you get % breakdowns that add up to 100% if the substeps aren't mutually exclusive? I don't think we should make this more complicated that the bare necessities. I think if we use this, we should enforce by convention that the timer context e is never passed into interior functions. Too much room for confusion.
,
Nov 23 2016
+some other metrics people Have you come up with a favorite way to measure both the runtime and the runtime breakdown of operations?
,
Nov 24 2016
Re #4: Agreed about simplicity. I'll eat my words and follow your API recommendation. Let's wait a couple days to see what people here think.
,
Nov 24 2016
> How do you get % breakdowns that add up to 100% if the substeps aren't mutually exclusive? It might not matter that the substeps don't add up to 100% if you also report a separate "total" duration, rather than calculating the total by summing the duration of all the substeps in Monarch. As an aside: this sounds useful - please consider adding these decorators to infra_libs, and naming the metric without a /chromeos/ prefix, so they can be used elsewhere as well.
,
Nov 24 2016
Here for the bikeshedding. I think it should be:
def step():
with RunTimeBreakdown('/foo/bar') as e:
with e.SubStep('mycomplexstep1'):
mycomplexstep1()
with e.SubStep('somelaterstep'):
somelaterstep()
SubStep can enforce the no nesting constraint by keeping state on __enter__/__exit__. If we do want a more detailed breakdown, I think it's better to nest the RunTimeBreakDown.
def step():
with RunTimeBreakdown('/foo/bar') as e:
with e.SubStep('mycomplexstep1'):
mycomplexstep1()
with e.SubStep('somelaterstep'):
somelaterstep()
def mycomplexstep1():
with RunTimeBreakdown('/foo/bar/nested') as e:
with e.SubStep('1'):
substep()
with e.SubStep('2'):
substep()
This will make the breakdowns reasonable to reason about.
,
Nov 24 2016
If we go with 8, a typical usecase (a tick() function that calls out to other handlers) gets a lot less readable.
def tick():
with RunTimeBreakdown('foo') as r:
r.Step('bar')
bar()
r.Step('baz')
baz()
r.Step('qux')
qux()
vs
def tick():
with RunTimeBreakdown('foo') as r:
with r.Step('bar'):
bar()
with r.Step('baz'):
baz()
with r.Step('qux'):
qux()
,
Nov 24 2016
I think in actual code the flat version will be less readable because you will have a lot more lines and consequently a lot more sections, some of which might start with r.step(), and someone is going to put an r.Step() call in the middle of some other lines without using a nice blank line to separate it. Then someone will unintentionally start putting r.step() calls into if blocks, but not in the else blocks, and so on. In summary, I think there's more potential for abuse for the flat design, whereas the second level of with statements enforces some basic sanity. It will also encourage splitting out the steps into separate functions instead of creating one long 1000 line function, which can only be a good thing.
,
Nov 24 2016
I agree with #10.
Another way to think of this is, if a step begins with an explicit r.step() call, but ends by default at the next r.step() call, it is easy to unintentionally extend a step too much (because, say an if condition was not executed that would have started the next step, as mentioned in #10). This will pollute the time data in a way that will be hard to notice or recover from.
I also vote for simplicity here -- we _will_ assume that steps are mutually exclusive. If you try to enter a step twice, it'll blow up.
with RuntimeBreakdown('foo') as r:
with r.Step('outer'):
with r.Step('inner'): # kaboom! This blows up.
do something.
Re #7: I'll try to put this in ts_mon. Glad that it might help others.
,
Nov 29 2016
,
Dec 1 2016
@dsansome: There are a few problems adding this to ts_mon: - ts_mon/common/metrics.py currently does not provide any contexts / decorators. - A time measurement value inherently contains a unit of measurement. (we use seconds). All the metrics in ts_mon/common/metrics.py are unit agnostic, although you can label a metric with the unit it is in. Philosophically, it seems like ts_mon wants to restrict itself to reporting of metrics from the client programs. It wants to stay out of the actual collection (doesn't care how the numbers are generated / what scale they are in). What I'm trying to do is a convenient way to generate the specific timing metrics we want. ts_mon is a library for the reporting of these collected data. Adding this functionality to ts_mon would be layering violation. (There is no precedent to ts_mon providing any way to *collect* the metrics) I'll go ahead with a chromite CL while we're discussing this.
,
Dec 1 2016
Actually, ts_mon is specifically designed for ease of instrumentation, and hence, ease of data collection is in scope. It already provides one context - ScopedIncrementCounter: https://cs.chromium.org/chromium/infra/packages/infra_libs/infra_libs/ts_mon/common/helpers.py?q=scopedincrementcounter&sq=package:chromium&l=8 The only reason we didn't build more collection support is due to lack of resources, and lack of user pressure (it was easy enough to just set the metrics directly). So, IMHO this new step time context is perfectly in scope. OTOH, making it useful in general and not just for specific ChromeOS applications may be more work than you'd like to do. So, it's really up to you. ts_mon metrics can be (and are) declared with appropriate units, see e.g. https://cs.chromium.org/chromium/src/third_party/chromite/lib/metrics.py?q=MetricsDataUnits&sq=package:chromium&l=180&dr=C The units might not be propagating to the backend though (it's bug somewhere in the pipeline). But once we migrate to the new backend API, they should appear properly.
,
Dec 2 2016
I uploaded a CL for chromite: https://chromium-review.googlesource.com/#/c/415993/ Sadly, even if I add it to ts_mon, we can't use it from chromite: The collector class would create metrics internally. Because chromite actually wraps around ts_mon metrics (see issue 670022 for context), any metrics generated internally within ts_mon would not play well with the chromite setup. That said, I do agree that helpers.py would have been a good place for this in ts_mon.
,
Dec 2 2016
,
Dec 13 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/2879f73f6ff1f58636234cffe11fa3c3277c8fe4 commit 2879f73f6ff1f58636234cffe11fa3c3277c8fe4 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Dec 01 02:18:21 2016 metrics: Add RuntimeBreakdownTimer. RuntimeBreakdownTimer is a new metric collection class that allows us to time a given operation along with tracking the fraction of time spent within sub-steps of the operation. BUG= chromium:668288 TEST=(new) unittests; manual metric test Change-Id: I3632f15c9bbf3e3efff5901810a383f59f6b3666 Reviewed-on: https://chromium-review.googlesource.com/415993 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/2879f73f6ff1f58636234cffe11fa3c3277c8fe4/lib/metrics_unittest.py [modify] https://crrev.com/2879f73f6ff1f58636234cffe11fa3c3277c8fe4/lib/metrics.py
,
Dec 21 2016
,
Dec 21 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7ab1ef65b4696b486849af98a0c7fe57b8e60e5d commit 7ab1ef65b4696b486849af98a0c7fe57b8e60e5d Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Dec 15 21:43:22 2016 scheduler: Start reporting tick() time breakdown. BUG= chromium:668288 TEST=unittests, local scheduler run. Change-Id: If1df494743cbcc396c130c10abf402b31ddb8f92 Reviewed-on: https://chromium-review.googlesource.com/420476 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/7ab1ef65b4696b486849af98a0c7fe57b8e60e5d/scheduler/monitor_db.py
,
Feb 28 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/7f2eef619506a97a506270cd9788c9d57156638f commit 7f2eef619506a97a506270cd9788c9d57156638f Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Tue Feb 28 01:11:09 2017 metrics: Add RuntimeBreakdownTimer. RuntimeBreakdownTimer is a new metric collection class that allows us to time a given operation along with tracking the fraction of time spent within sub-steps of the operation. BUG= chromium:668288 BUG=chromium:693778 TEST=(new) unittests; manual metric test Change-Id: I3632f15c9bbf3e3efff5901810a383f59f6b3666 Previous-Reviewed-on: https://chromium-review.googlesource.com/415993 (cherry picked from commit 3ff9e2e73e2975e3e10dc4cd187b6881b3149b63) Reviewed-on: https://chromium-review.googlesource.com/447176 Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Keith Haddow <haddowk@chromium.org> Tested-by: Keith Haddow <haddowk@chromium.org> Commit-Queue: Keith Haddow <haddowk@chromium.org> [modify] https://crrev.com/7f2eef619506a97a506270cd9788c9d57156638f/lib/metrics_unittest.py [modify] https://crrev.com/7f2eef619506a97a506270cd9788c9d57156638f/lib/metrics.py
,
Mar 4 2017
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Oct 14 2017
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by akes...@chromium.org
, Nov 23 2016A nicer API might be: with metrics.RunTimeBreakdown('/foo/bar') as e: e.StartStep('step1') step1() e.StartStep('step2') step2() (instead of needing to call both _start and _end) Also, do we want the % breakdowns to be a distribution, or a gauge? With gauge we theoretically lose a bit of resolution, but it will be a lot easier to graph / work with.