Improve the code for detecting regression spikes |
|||
Issue descriptionWhere: infra/appengine/findit/crash/detect_regression_range.py The current implementation of spike detection is excessively complicated, making the code harder to read than it needs to be. All this should be cleaned up. * nit: why "indexes" instead of "indices"? * at present GetSpikeIndexes doesn't use the y values at all, so it would be cleaner to simply pass the list of x values alone. * In fact, we don't actually care about the indices: we want pairs of the y values surrounding that index. This is currently done by a second pass with GetRegressionRangeFromSpike, but since noone cares about the actual indices, we should simply return those pairs together. (N.B., this higher-level fix obviates the previous bulletpoint, since for this change we will in fact need the y values.) * In general the exponential model for detecting time-series events is considered naive and is superseded by other models (e.g., standard-deviation based models). However, according to previous investigation it looks like the exponential model does better for our purposes of only detecting deviation from zero. We should make an explicit note of this, so that it is clear to future developers why we are using the exponential model. (N.B., for proposing actual improvements/changes to the model used, please file a separate ticket.)
,
Sep 7 2016
I don't know what you mean by "graphs". What "graphs" are there? Spike detection simply takes in a stream of events and tags certain events as "interesting"; there are no graphs involved... The indices here are meaningless; the only thing that matters is the relative ordering of events in the time series. Using indices confounds things by making them depend on how wide a window of the time series we look at, when the window starts compared to the events of interest, etc. All those things are arbitrary and have nothing to do with the problem in question, so depending on them introduces a lot of potential for bugs. Can you explicate why exactly you want the indices rather than the objects/events themselves?
,
Sep 8 2016
An initial CL for this refactoring is at https://codereview.chromium.org/2311393003/. A newer CL replaces the list of pairs (of objects with their values) and instead uses a list of objects together with a valuation function. This should help remove the need for so much munging with representations, and hopefully also helps clarify what I mean about not needing indices in order to get at all the information stored within each object/event. https://codereview.chromium.org/2325503002 (N.B., so far this new CL has only changed the GetSpikes function, it hasn't updated the unit tests nor updated DetectRegressionRange to avoid the additional munging; so it is not yet ready for review, unlike the previous CL.)
,
Sep 8 2016
,
Sep 9 2016
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra.git/+/27d3c91752defc8bf2556775462f4aa90419e74a commit 27d3c91752defc8bf2556775462f4aa90419e74a Author: wrengr <wrengr@google.com> Date: Fri Sep 09 20:16:24 2016 We avoid using array indices, to avoid potential off-by-one errors as well as to enable the possibility of streaming rather than holding onto the whole list in memory. Also, we change the API for getting spikes so that now it takes a list of objects together with a valuation function, rather than taking a list of pairs (of objects with their values). BUG= chromium:644406 Review-Url: https://codereview.chromium.org/2325503002 [modify] https://crrev.com/27d3c91752defc8bf2556775462f4aa90419e74a/appengine/findit/crash/crash_pipeline.py [modify] https://crrev.com/27d3c91752defc8bf2556775462f4aa90419e74a/appengine/findit/crash/detect_regression_range.py [modify] https://crrev.com/27d3c91752defc8bf2556775462f4aa90419e74a/appengine/findit/crash/test/detect_regression_range_test.py
,
Sep 9 2016
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra.git/+/27d3c91752defc8bf2556775462f4aa90419e74a commit 27d3c91752defc8bf2556775462f4aa90419e74a Author: wrengr <wrengr@google.com> Date: Fri Sep 09 20:16:24 2016 We avoid using array indices, to avoid potential off-by-one errors as well as to enable the possibility of streaming rather than holding onto the whole list in memory. Also, we change the API for getting spikes so that now it takes a list of objects together with a valuation function, rather than taking a list of pairs (of objects with their values). BUG= chromium:644406 Review-Url: https://codereview.chromium.org/2325503002 [modify] https://crrev.com/27d3c91752defc8bf2556775462f4aa90419e74a/appengine/findit/crash/crash_pipeline.py [modify] https://crrev.com/27d3c91752defc8bf2556775462f4aa90419e74a/appengine/findit/crash/detect_regression_range.py [modify] https://crrev.com/27d3c91752defc8bf2556775462f4aa90419e74a/appengine/findit/crash/test/detect_regression_range_test.py
,
Sep 9 2016
|
|||
►
Sign in to add a comment |
|||
Comment 1 by kateso...@chromium.org
, Sep 7 2016Actually the indices can better associate different graphs sharing the same indices, since the historical_metadata is like this: [ { "report_number": 0, "cpm": 0.0, "client_number": 0, "chrome_version": "51.0.2704.103" }, ... { "report_number": 10, "cpm": 2.1, "client_number": 8, "chrome_version": "53.0.2768.0" }, ] I think the best way is to simply pass a list x values, and return indices.