Where: infra/appengine/findit/crash/detect_regression_range.py
Right now we're using a very simple exponential smoothing model to detect spikes. Can we do any better?
* For example, right now we initialize running mean with the value of the first event in the time series; however, this places undue emphasis on where exactly we start viewing the time series from (i.e., where we truncate it to). This is especially problematic whenever alpha is very small. Is there a better way to initialize the mean? (E.g., we could try initializing the mean to zero, since that's what we want the mean to be. However, that could cause issues whenever we have some long-running issues causing non-zero cpm, since in this case the actual mean is far from zero— so we'll generate a false-positive spike at the beginning of the time series. Then again, perhaps we don't care about that false positive, since right now we're only focusing on the most recent spike anyways.)
* The current implementation computes the mean in the naive/obvious way. Unfortunately, due to issues about IEEE754 floats this implementation will cause a loss of precision in many circumstances. Do we care about precision enough to fix this issue? Or is the infelicity insignificant in terms of achieving our ultimate goals in Findit? Even if insignificant, should we fix it anyways (just for good hygiene in case it matters in the future)?
* We could try other models. For example, when doing time-series analyses in general, a model based on standard-deviations is often preferable to the exponential smoothing model. Apparently this was tried previously and found to work less well for our particular task (this is now documented; cf., http://crbugs.com/644406). But perhaps we should try again, or try other models.
Comment 1 by wrengr@chromium.org
, Oct 4 2016