New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 872638 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: ----



Sign in to add a comment

Perf Dashboard: regression mail has delayed more than 2 weeks for a perf regression

Project Member Reported by johnylin@google.com, Aug 9

Issue description

Chrome version: 68.0.3440.84 (Linux)
URL: https://chromeperf.appspot.com/group_report?keys=agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgICQw6fBjQgM

Please copy and paste any errors from JavaScript console (Ctrl+Shift+J to open):

Please describe the problem:

We have a question (a bug potentially) about regression alert mail, recently we have found some regression mails will be sent a couple of weeks later than when the regression took place. For example:

https://chromeperf.appspot.com/group_report?keys=agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgICQw6fBjQgM
Test: ChromeOSVideo/cros-squawks/video_PlaybackPerf/hw_video_cpu_usage_h264_1080p
Value: 35.08
Point ID: 35010001090600000
Time added: 2018-07-24T23:20:45.000Z
ChromeOS Version range: 70.10905.0.0 - 70.10906.0.0 
Chrome Version range: 70.0.3500.0 - 70.0.3501.0 

It's on July 24, however we (chromeos-video-alerts@) received the mail of reporting this at Aug 8 6:23AM, UTC+8, which is delayed more than 2 weeks.
https://groups.google.com/a/google.com/forum/#!topic/chromeos-video-alerts/dQiNovTEa6A

We would like to know if it is a potential issue about report delay? or is there a configuration to reduce this?

Thanks a lot
 
Cc: hiroh@chromium.org wuchengli@chromium.org akahuang@chromium.org posciak@chromium.org
Cc: johnylin@chromium.org
Labels: Needs-Triage-M68
Labels: TE-NeedsTriageFromHYD
Cc: tedlai@chromium.org hctsai@chromium.org cywang@chromium.org
Owner: sullivan@chromium.org
From https://sponge.corp.google.com/invocations?searchFor=video_PlaybackPerf.h264%20squawks%20R70%20after%3A2018-07-24%20before%3A2018-07-25

The test results uploaded to chromeperf right after they were executed. It seems the delay is mainly from chromeperf anomaly detection. Annie, do you have any insight of the issue?



https://screenshot.googleplex.com/aJJxwe1S1aK from crosboltv2
Owner: simonhatch@chromium.org
Reassigning to TL simonhatch
Labels: -TE-NeedsTriageFromHYD TE-NeedsTraige-help
Seems it is out of scope from TE end as it is related to perf dash, adding TE-NeedsTraige-help label to move this out of our triaging bucket.

Thanks..!

Hmm this seems to happen periodically but we've never figured out why. Often by the time it's reported, we have trouble going back that far in the logs. I'll see if I can add some additional logging and a notification to the team when this happens so we can dig into this better.
Status: Started (was: Unconfirmed)
Actually, I've been digging into this and I think I know what's wrong here.

So although it's not visible on the UI, the alert entity itself was generated on Aug 8, the day you were emailed, so there was no delay there. As for why it was alerting so late, from what I can see it was having trouble getting positive signal for the alert. 

New points were streaming in both before and after the alert point, some of them were huge spikes which threw off the alerting algorithm, making it wait for more points before generating the alert.

From https://dev-simonhatch-193e9338-dot-chromeperf.appspot.com/debug_alert?test_path=ChromeOSVideo%2Fcros-squawks%2Fvideo_PlaybackPerf%2Fhw_video_cpu_usage_h264_1080p&rev=35010001090600000&num_before=200&num_after=30&config=%7B%7D

This is a version of the dashboard with the spikes removed, and with default parameters you can see that the alert fired pretty quickly.


https://chromeperf.appspot.com/debug_alert?test_path=ChromeOSVideo%2Fcros-squawks%2Fvideo_PlaybackPerf%2Fhw_video_cpu_usage_h264_1080p&rev=35010001090600000&num_before=200&num_after=30&config=

Here's the untouched data, and even with quite a few data points no alert is generated yet. We could tweak the parameters a bit to have worked better in this case, but I think those kind of spikes always throw off the current implementation.
Thanks for the clear explanation. Actually since spikes may be treated as outliers, what we care about more is the step changing point, the alert latency would make us more difficult to investigate. It would be great if we could shorten the latency.

You have mentioned we could improve this case by tweaking parameters, so do we need to provide anything to you for doing this?

Thanks

 
Cc: dtu@chromium.org
+dtu for ideas

Playing with some values like max_window_size=30, seems to get the alert considerably faster. Might also result in some spurious alerts though. We could create a new anomaly config for you that overrides the defaults.
Owner: ----
Status: Available (was: Started)
Removing myself since investigation is done.
Labels: Pri-2
Setting defect without priority to Pri-2.

Sign in to add a comment