Perf Dashboard: regression mail has delayed more than 2 weeks for a perf regression |
||||||||||||
Issue descriptionChrome version: 68.0.3440.84 (Linux) URL: https://chromeperf.appspot.com/group_report?keys=agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgICQw6fBjQgM Please copy and paste any errors from JavaScript console (Ctrl+Shift+J to open): Please describe the problem: We have a question (a bug potentially) about regression alert mail, recently we have found some regression mails will be sent a couple of weeks later than when the regression took place. For example: https://chromeperf.appspot.com/group_report?keys=agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgICQw6fBjQgM Test: ChromeOSVideo/cros-squawks/video_PlaybackPerf/hw_video_cpu_usage_h264_1080p Value: 35.08 Point ID: 35010001090600000 Time added: 2018-07-24T23:20:45.000Z ChromeOS Version range: 70.10905.0.0 - 70.10906.0.0 Chrome Version range: 70.0.3500.0 - 70.0.3501.0 It's on July 24, however we (chromeos-video-alerts@) received the mail of reporting this at Aug 8 6:23AM, UTC+8, which is delayed more than 2 weeks. https://groups.google.com/a/google.com/forum/#!topic/chromeos-video-alerts/dQiNovTEa6A We would like to know if it is a potential issue about report delay? or is there a configuration to reduce this? Thanks a lot
,
Aug 9
,
Aug 9
,
Aug 10
,
Aug 10
,
Aug 13
From https://sponge.corp.google.com/invocations?searchFor=video_PlaybackPerf.h264%20squawks%20R70%20after%3A2018-07-24%20before%3A2018-07-25 The test results uploaded to chromeperf right after they were executed. It seems the delay is mainly from chromeperf anomaly detection. Annie, do you have any insight of the issue? https://screenshot.googleplex.com/aJJxwe1S1aK from crosboltv2
,
Aug 13
Reassigning to TL simonhatch
,
Aug 14
Seems it is out of scope from TE end as it is related to perf dash, adding TE-NeedsTraige-help label to move this out of our triaging bucket. Thanks..!
,
Aug 14
Hmm this seems to happen periodically but we've never figured out why. Often by the time it's reported, we have trouble going back that far in the logs. I'll see if I can add some additional logging and a notification to the team when this happens so we can dig into this better.
,
Aug 14
Actually, I've been digging into this and I think I know what's wrong here. So although it's not visible on the UI, the alert entity itself was generated on Aug 8, the day you were emailed, so there was no delay there. As for why it was alerting so late, from what I can see it was having trouble getting positive signal for the alert. New points were streaming in both before and after the alert point, some of them were huge spikes which threw off the alerting algorithm, making it wait for more points before generating the alert. From https://dev-simonhatch-193e9338-dot-chromeperf.appspot.com/debug_alert?test_path=ChromeOSVideo%2Fcros-squawks%2Fvideo_PlaybackPerf%2Fhw_video_cpu_usage_h264_1080p&rev=35010001090600000&num_before=200&num_after=30&config=%7B%7D This is a version of the dashboard with the spikes removed, and with default parameters you can see that the alert fired pretty quickly. https://chromeperf.appspot.com/debug_alert?test_path=ChromeOSVideo%2Fcros-squawks%2Fvideo_PlaybackPerf%2Fhw_video_cpu_usage_h264_1080p&rev=35010001090600000&num_before=200&num_after=30&config= Here's the untouched data, and even with quite a few data points no alert is generated yet. We could tweak the parameters a bit to have worked better in this case, but I think those kind of spikes always throw off the current implementation.
,
Aug 15
Thanks for the clear explanation. Actually since spikes may be treated as outliers, what we care about more is the step changing point, the alert latency would make us more difficult to investigate. It would be great if we could shorten the latency. You have mentioned we could improve this case by tweaking parameters, so do we need to provide anything to you for doing this? Thanks
,
Aug 15
+dtu for ideas Playing with some values like max_window_size=30, seems to get the alert considerably faster. Might also result in some spurious alerts though. We could create a new anomaly config for you that overrides the defaults.
,
Oct 4
Removing myself since investigation is done.
,
Jan 11
Setting defect without priority to Pri-2. |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by johnylin@chromium.org
, Aug 9