Issue metadata
Sign in to add a comment
|
No data received for system_health.memory_mobile from android-nexus7v2 since 427296 |
||||||||||||||||||||
Issue descriptionThis bot seems to be having trouble, but even before then this test was failing. Also failing on the test build, but at a different time so will file a separate bug.
,
Nov 9 2016
,
Nov 10 2016
Looking into it. Appears that the benchmark has been pretty crashy on this bot. Strange thing this is also happening on the reference build.
,
Nov 10 2016
I've got a theory, looking at: https://chromeperf.appspot.com/report?sid=3815d8f8078c996fbdcfd9c0ce31f479e565940e6e74773e6854b214e95963ec&start_rev=424708&end_rev=428601 It appears that the data stoppage starts soon after adding the background stories to system health, on the 29th October. But there _are_ more recent runs of the benchmark, e.g. this one from just a few minutes ago, that claims to have uploaded data to the dashboard: Sending result 2 of 2 to dashboard. {"is_ref": true, "test_suite_name": "system_health.memory_mobile", "master": "ChromiumPerf", "versions": {"webkit_rev": "431079", "chromium": "8ea079197144ef1b0f575fe3b3d167a6bf67102a", "commit_pos": 431079}, "point_id": 431079, ... https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus7v2%20Perf%20%281%29/builds/4191/steps/system_health.memory_mobile.reference/logs/stdio Could it be that after adding the new stories we're sending now more data and some piece of the infrastructure is unable to handle it? The following also looks suspicious (from the same log above): C 2.735s Main ******************************************************************************** C 2.735s Main Detailed Logs C 2.735s Main ******************************************************************************** C 2.735s Main ******************************************************************************** C 2.735s Main Summary C 2.735s Main ******************************************************************************** C 2.735s Main [==========] 1 test ran. C 2.735s Main [ PASSED ] 0 tests. C 2.735s Main [ FAILED ] 1 test, listed below: C 2.735s Main [ FAILED ] PrintStep C 2.735s Main C 2.735s Main 1 FAILED TEST C 2.735s Main ********************************************************************************
,
Nov 10 2016
I'm seeing a similar data stoppage issue on internal bots. Here the data stops on 3rd November: https://chromeperf.appspot.com/report?sid=1ca97f6aa3fe02826e9869f08330b7993225cda240cbca4e792184bb447e7a8a But the health dashboard has data (just newly added) up until today: https://clank-svelte-status.googleplex.com/system-health/android-chrome/memory/low-end-phone/ Could you have a look Annie to see if it's something from the chromeperf end?
,
Nov 11 2016
,
Nov 11 2016
To check whether this could be a dashboard issue, we need to look at the chartjson. Since these builds are old, there are a few hoops. Here are the instructions; we're hoping to make this easier soon. 1) Open up a chart, look at the last data point. (Check the date; the alert should be marked recovered if there is new data and file a bug if that doesn't happen). 2) For all these, the last data point is commit pos 427296 from October 25 so the data really did stop. 3) Get buildbot stdio link for last data point. It's in the tooltip: https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus7v2%20Perf%20%281%29/builds/4027/steps/system_health.memory_mobile.reference/logs/stdio 4) Shorten it to the build links for the last data point and the first build without the data point: https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus7v2%20Perf%20%281%29/builds/4027 https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus7v2%20Perf%20%281%29/builds/4028 5) Scroll to find the chartjson links for system_health.memory_mobile. Since these are very old builds, we need to look at logdog: Last run with data: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus7v2_Perf__1_%2F4027%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.memory_mobile.reference%2F0%2Flogs%2Fjson.output%2F0 First run with no data: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus7v2_Perf__1_%2F4028%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.memory_mobile.reference%2F0%2Flogs%2Fjson.output%2F0 Now we look at the chartjson. I see a difference in what's being sent to the dashboard: Before: "blank_about@@memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg": { "blank:about:blank": { "description": "effective size of cc in all processes in Chrome", "grouping_keys": { "case": "blank", "group": "about" }, "important": false, "improvement_direction": "down", "name": "memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg", "page_id": 0, "std": 0.0, "tir_label": "blank_about", "type": "list_of_scalar_values", "units": "sizeInBytes", "values": [ 1149256 ] }, "summary": { "description": "effective size of cc in all processes in Chrome", "grouping_keys": { "case": "blank", "group": "about" }, "important": false, "improvement_direction": "down", "name": "memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg", "std": 0.0, "tir_label": "blank_about", "type": "list_of_scalar_values", "units": "sizeInBytes", "values": [ 1149256 ] } }, After, the summary metric is missing: "blank_about@@memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg": { "blank:about:blank": { "description": "effective size of cc in all processes in Chrome", "grouping_keys": { "case": "blank", "group": "about" }, "important": false, "improvement_direction": "down", "name": "memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg", "page_id": 0, "std": 0.0, "tir_label": "blank_about", "type": "list_of_scalar_values", "units": "sizeInBytes", "values": [ 1149256 ] } }, So I think this is a problem with the test. Juan, can you take a look at the stdio in logdog before/after to see why it's no longer producing the summary?
,
Nov 11 2016
Ned, any idea why the summary metric may be missing?
,
Nov 11 2016
I don't know. Ben or Ethan can help with the summarization pipeline
,
Nov 11 2016
Summarization fails when the test fails. The test is failing both before and after the data stoppage alert. Is someone looking into the test failure?
,
Nov 14 2016
Then I guess this is just issue 664505 , i.e. the test is failing for many different reasons. And as far as telemetry/dashboard, this is WAI. Still this is probably something important to think about. With the large (and increasing) number of stories in the benchmark, having some of them failing at any given moment is not unlikely. So this sort of data stoppages will keep happening every now and then. Should we report summaries even if some pages failed?
,
Nov 14 2016
This is a little confusing because the way it's set up, it has a summary metric for blank_about. Usually we have summary metrics for multiple pages. Juan, Ethan, Ben, any idea why there's a summary for just blank_about? Normally I would say to monitor individual pages, because if we produce a summary metric on failure and a page fails, it will move the average confusingly. But we're already doing that. I don't understand why the summarization works this way.
,
Nov 14 2016
The way the benchmark is set up, there is a "blank_about" group with a single "blank:about:blank" story. I think most other groups ("load_news", "browse_social", etc.) do have more than one story.
But now that you mention it, yeah, we should be alerting on individual stories (al least that was the plan as per [1]), why are these data stoppage alerts on the story group level?
[1]: https://github.com/catapult-project/catapult/issues/2722#issuecomment-252910388
,
Nov 16 2016
Sorry for the long delay here. Our alerting is for patterns like: ChromiumPerf/*/system_health.memory_mobile/memory:chrome:all_processes:reported_by_chrome:gpu:effective_size_avg/*/* I checked and it does match individual pages, like ChromiumPerf/android-nexus7v2/system_health.memory_mobile/memory:chrome:all_processes:reported_by_chrome:gpu:effective_size_avg/blank_about/blank_about_blank The weirdness comes in for ref builds. The ref builds are programmatically ignored for perf regression alerts, but what happens is that the summary for the ref build matches the pattern for the per-page alert: ChromiumPerf/android-nexus7v2/system_health.memory_mobile/memory:chrome:all_processes:reported_by_chrome:java_heap:effective_size_avg/blank_about/ref The non-summary ref for that is ChromiumPerf/android-nexus7v2/system_health.memory_mobile/memory:chrome:all_processes:reported_by_chrome:gpu:effective_size_avg/blank_about/blank_about_blank_ref, which we also have matched, and ignored in performance alerts So we do get data stoppage alerts for the summaries on ref builds, and that's why we got this alert on the summary. In general, I think we want to know if the ref build stopped sending data on that non-summary ref. It gets tricky because the dashboard has this strange naming convention. To my knowledge, this is the first time we've gotten a data stoppage alert that was problematic in this way, but it did waste a lot of time. Do people think we should do something to refine the data stoppage alerts for ref builds?
,
Nov 17 2016
Issue 666055 has been merged into this issue.
,
Nov 17 2016
,
Nov 17 2016
Yeah, I guess it's fine to keep those alerts. And now that we know about them it's easier to diagnose them (like the two I've just dup'ed here).
,
Dec 7 2016
,
Dec 7 2016
Issue 663900 has been merged into this issue. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by rsch...@chromium.org
, Nov 9 2016