Make sure start_with_url.{cold,warm}.startup_pages is alerted on new bots |
|||||||||
Issue descriptionWith clank internal commit 0d21e9e8035b3a625c12b0665fac1e48da08c42d the runs on perf-clankium-l-{phone,tablet} moved to health-plan-clankium-{low-end,}-phone. Yay! I checked whether results on the non-low-end phone roughly match, and they do: https://chromeperf.appspot.com/report?sid=2168cf9630dba06b8d4ace08d36ecc228fbc006a7542ef637c98420d646b00c3 Great! === Request 1: For extra assurance I wanted to get a confirmation that the new graphs are alerted. Sorry if it is trivial, I never remember where to look for the alert configuration, and whether the bot names are part of the alert configuration. Here are the alerts we need: Bots: health-plan-clankium-low-end-phone, health-plan-clankium-phone Metrics on these bots: messageloop_start_time, foreground_tab_request_start, foreground_tab_load_complete. Please confirm :) === Request 2: If not too difficult, please backfill the graphs for health-plan-clankium-phone with the data that previously came from perf-clankium-l-phone. This is in order to reduce surprises when looking at graphs and improve communication over our historical abilities to triage problems and ship improvements. Thank you :) Assigning to sullivan@ for triage, out of ignorance mainly :/
,
May 2 2018
Simon, can you help out?
,
May 2 2018
ClankInternal/*/start_with_url.cold.startup_pages/*
ClankInternal/*/start_with_url.warm.startup_pages/*
This is the existing config for startup_with_url.{cold,warm}.startup_pages, looks like it doesn't care about the bot specifically so your new ones are already covered. Confirmed by checking dev_console and pulling up a test path from health-plan-clankium-low-end-phone. These are alerting on the summary metric though, I can switch these to alert on the individual pages like so:
ClankInternal/*/start_with_url.*.startup_pages/messageloop_start_time/*
ClankInternal/*/start_with_url.*.startup_pages/foreground_tab_request_start/*
ClankInternal/*/start_with_url.*.startup_pages/foreground_tab_load_complete/*
Let me know if you want to do that.
,
May 2 2018
+1 to alert on individual pages rather than aggregates.
,
May 2 2018
Very helpful, thanks! Summary metrics for messageloop_start_time and foreground_tab_request_start are WAI, but the foreground_tab_load_complete would be better off per-page. The effect of it would probably be tiny, so if it is more than 15 minutes of your time, let's not bother.
,
May 3 2018
Ok switched the alerts over for these start_with_url tests: ClankInternal/*/start_with_url.*.startup_pages/messageloop_start_time ClankInternal/*/start_with_url.*.startup_pages/foreground_tab_request_start ClankInternal/*/start_with_url.*.startup_pages/foreground_tab_load_complete/* So they'll alert on messageloop_start_time, foreground_tab_request_start, and per-page on foreground_tab_load_complete. You also wanted the data migrated?
,
May 3 2018
> they'll alert on messageloop_start_time, foreground_tab_request_start, and per-page on foreground_tab_load_complete. thank you! > You also wanted the data migrated? yes please :)
,
May 3 2018
Ok data migration is underway, will probably be done in 20-30 mins.
,
May 3 2018
,
Jul 13
I would like to reopen this bug as P1 just to have keep the context close. Let me know if creating a new bug is better and I'll do it then. I am seeing a clear regression on May 17 on this graph: https://chromeperf.appspot.com/group_report?bug_id=800750 There seems to be no alert, while the metric seems to match the pattern. Simon, can you please take a look?
,
Jul 13
You probably meant the regression here? https://chromeperf.appspot.com/report?sid=904c528280a585089876ffab1f63c1b7cb07f0ba2aeaf453d46d4427e5db5884&start_rev=1522902549&end_rev=1531469218
,
Jul 13
perezju: oh, sorry, pasted a wrong link, and thank you for a correct link
,
Jul 13
So poking at this a bit, pulling up the TestMetadata in dev_console shows that it does have a sheriff. Went through the backlog of changes around that time and couldn't find any related sheriff test path changes. I brought it up in /debug_alert and the default settings for alerting actually don't seem to alert. Not until I changes the min_steppiness to 0.45 did an alert show up: https://chromeperf.appspot.com/debug_alert?test_path=ClankInternal%2Fhealth-plan-clankium-phone%2Fstart_with_url.warm.startup_pages%2Fforeground_tab_request_start&rev=1526517330&num_before=300&num_after=300&config=%7B%0D%0A++%22min_steppiness%22%3A+0.40000000000000002%0D%0A%7D So maybe minimally we should assign an anomaly config here with updated params. Additionally, I'm a little concerned about this, wondering if we should set aside some time to look at the regression detection algorithm and it's default parameters, and see if we can do better.
,
Jul 18
thank you for the link to the anomaly detection debugger! Playing around with a bigger range. So min_steppiness=0.4 is pretty good. It sometimes flags things below 50ms (we won't be able to bisect those), but not too much. The value at 0.3 matches my visual comprehension better, but we may spend too much time bisecting. Here is my take at 0.4 (slow to load): https://chromeperf.appspot.com/debug_alert?test_path=ClankInternal%2Fhealth-plan-clankium-phone%2Fstart_with_url.warm.startup_pages%2Fforeground_tab_request_start&rev=1526517330&num_before=3000&num_after=600&config=%7B%0D%0A++"min_steppiness"%3A+0.3%0D%0A%7D dtu: should we switch to this value or there are other parameters to tweak? Is there a way to discover the current values used for alerting?
,
Jul 19
`min_steppiness` the major sensitivity parameter. In some cases, `multiple_of_std_dev` might be relevant, but looks like steppiness is the limiting one here. `min_steppiness` is 0.5 by default. You can find the default parameter values here: https://chromium.googlesource.com/catapult.git/+/HEAD/dashboard/dashboard/find_change_points.py#38 -> simonhatch@ to update the alerting config.
,
Jul 25
Sorry, this got pushed down my list. Done.
,
Aug 13
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by perezju@chromium.org
, May 2 2018