New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 663899 link

Starred by 3 users

Issue metadata

Status: WontFix
Owner: ----
Closed: Dec 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression

Blocking:
issue 664505



Sign in to add a comment

No data received for system_health.memory_mobile from android-nexus7v2 since 427296

Project Member Reported by rsch...@chromium.org, Nov 9 2016

Issue description

This bot seems to be having trouble, but even before then this test was failing. Also failing on the test build, but at a different time so will file a separate bug.
 
All graphs for this bug:
  https://chromeperf.appspot.com/group_report?bug_id=663899

Original alerts at time of bug-filing:
  https://chromeperf.appspot.com/group_report?keys=agxzfmNocm9tZXBlcmZyvQELEhNTdG9wcGFnZUFsZXJ0UGFyZW50Io4BQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Y2M6ZWZmZWN0aXZlX3NpemVfYXZnL2JsYW5rX2Fib3V0L3JlZgwLEg1TdG9wcGFnZUFsZXJ0GKCKGgw,agxzfmNocm9tZXBlcmZyvgELEhNTdG9wcGFnZUFsZXJ0UGFyZW50Io8BQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Y2M6ZWZmZWN0aXZlX3NpemVfYXZnL2Jyb3dzZV9tZWRpYS9yZWYMCxINU3RvcHBhZ2VBbGVydBigihoM,agxzfmNocm9tZXBlcmZyvQELEhNTdG9wcGFnZUFsZXJ0UGFyZW50Io4BQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Y2M6ZWZmZWN0aXZlX3NpemVfYXZnL2Jyb3dzZV9uZXdzL3JlZgwLEg1TdG9wcGFnZUFsZXJ0GKCKGgw,agxzfmNocm9tZXBlcmZyvwELEhNTdG9wcGFnZUFsZXJ0UGFyZW50IpABQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Y2M6ZWZmZWN0aXZlX3NpemVfYXZnL2Jyb3dzZV9zb2NpYWwvcmVmDAsSDVN0b3BwYWdlQWxlcnQYoIoaDA,agxzfmNocm9tZXBlcmZyvAELEhNTdG9wcGFnZUFsZXJ0UGFyZW50Io0BQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Y2M6ZWZmZWN0aXZlX3NpemVfYXZnL2xvYWRfbWVkaWEvcmVmDAsSDVN0b3BwYWdlQWxlcnQYoIoaDA,agxzfmNocm9tZXBlcmZyuwELEhNTdG9wcGFnZUFsZXJ0UGFyZW50IowBQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Y2M6ZWZmZWN0aXZlX3NpemVfYXZnL2xvYWRfbmV3cy9yZWYMCxINU3RvcHBhZ2VBbGVydBigihoM,agxzfmNocm9tZXBlcmZyvQELEhNTdG9wcGFnZUFsZXJ0UGFyZW50Io4BQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Y2M6ZWZmZWN0aXZlX3NpemVfYXZnL2xvYWRfc2VhcmNoL3JlZgwLEg1TdG9wcGFnZUFsZXJ0GKCKGgw,agxzfmNocm9tZXBlcmZyvgELEhNTdG9wcGFnZUFsZXJ0UGFyZW50Io8BQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Z3B1OmVmZmVjdGl2ZV9zaXplX2F2Zy9ibGFua19hYm91dC9yZWYMCxINU3RvcHBhZ2VBbGVydBigihoM,agxzfmNocm9tZXBlcmZyvwELEhNTdG9wcGFnZUFsZXJ0UGFyZW50IpABQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Z3B1OmVmZmVjdGl2ZV9zaXplX2F2Zy9icm93c2VfbWVkaWEvcmVmDAsSDVN0b3BwYWdlQWxlcnQYoIoaDA,agxzfmNocm9tZXBlcmZyvgELEhNTdG9wcGFnZUFsZXJ0UGFyZW50Io8BQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Z3B1OmVmZmVjdGl2ZV9zaXplX2F2Zy9icm93c2VfbmV3cy9yZWYMCxINU3RvcHBhZ2VBbGVydBigihoM,agxzfmNocm9tZXBlcmZywAELEhNTdG9wcGFnZUFsZXJ0UGFyZW50IpEBQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Z3B1OmVmZmVjdGl2ZV9zaXplX2F2Zy9icm93c2Vfc29jaWFsL3JlZgwLEg1TdG9wcGFnZUFsZXJ0GKCKGgw,agxzfmNocm9tZXBlcmZyvQELEhNTdG9wcGFnZUFsZXJ0UGFyZW50Io4BQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Z3B1OmVmZmVjdGl2ZV9zaXplX2F2Zy9sb2FkX21lZGlhL3JlZgwLEg1TdG9wcGFnZUFsZXJ0GKCKGgw,agxzfmNocm9tZXBlcmZyvAELEhNTdG9wcGFnZUFsZXJ0UGFyZW50Io0BQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Z3B1OmVmZmVjdGl2ZV9zaXplX2F2Zy9sb2FkX25ld3MvcmVmDAsSDVN0b3BwYWdlQWxlcnQYoIoaDA,agxzfmNocm9tZXBlcmZyvgELEhNTdG9wcGFnZUFsZXJ0UGFyZW50Io8BQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6Z3B1OmVmZmVjdGl2ZV9zaXplX2F2Zy9sb2FkX3NlYXJjaC9yZWYMCxINU3RvcHBhZ2VBbGVydBigihoM,agxzfmNocm9tZXBlcmZyxAELEhNTdG9wcGFnZUFsZXJ0UGFyZW50IpUBQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6amF2YV9oZWFwOmVmZmVjdGl2ZV9zaXplX2F2Zy9ibGFua19hYm91dC9yZWYMCxINU3RvcHBhZ2VBbGVydBigihoM,agxzfmNocm9tZXBlcmZyxQELEhNTdG9wcGFnZUFsZXJ0UGFyZW50IpYBQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6amF2YV9oZWFwOmVmZmVjdGl2ZV9zaXplX2F2Zy9icm93c2VfbWVkaWEvcmVmDAsSDVN0b3BwYWdlQWxlcnQYoIoaDA,agxzfmNocm9tZXBlcmZyxAELEhNTdG9wcGFnZUFsZXJ0UGFyZW50IpUBQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6amF2YV9oZWFwOmVmZmVjdGl2ZV9zaXplX2F2Zy9icm93c2VfbmV3cy9yZWYMCxINU3RvcHBhZ2VBbGVydBigihoM,agxzfmNocm9tZXBlcmZyxgELEhNTdG9wcGFnZUFsZXJ0UGFyZW50IpcBQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6amF2YV9oZWFwOmVmZmVjdGl2ZV9zaXplX2F2Zy9icm93c2Vfc29jaWFsL3JlZgwLEg1TdG9wcGFnZUFsZXJ0GKCKGgw,agxzfmNocm9tZXBlcmZywwELEhNTdG9wcGFnZUFsZXJ0UGFyZW50IpQBQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6amF2YV9oZWFwOmVmZmVjdGl2ZV9zaXplX2F2Zy9sb2FkX21lZGlhL3JlZgwLEg1TdG9wcGFnZUFsZXJ0GKCKGgw,agxzfmNocm9tZXBlcmZywgELEhNTdG9wcGFnZUFsZXJ0UGFyZW50IpMBQ2hyb21pdW1QZXJmL2FuZHJvaWQtbmV4dXM3djIvc3lzdGVtX2hlYWx0aC5tZW1vcnlfbW9iaWxlL21lbW9yeTpjaHJvbWU6YWxsX3Byb2Nlc3NlczpyZXBvcnRlZF9ieV9jaHJvbWU6amF2YV9oZWFwOmVmZmVjdGl2ZV9zaXplX2F2Zy9sb2FkX25ld3MvcmVmDAsSDVN0b3BwYWdlQWxlcnQYoIoaDA


Bot(s) for this bug's original alert(s):

android-nexus7v2
Cc: primiano@chromium.org perezju@chromium.org
Owner: perezju@chromium.org
Status: Assigned (was: Untriaged)
Looking into it. Appears that the benchmark has been pretty crashy on this bot. Strange thing this is also happening on the reference build.
I've got a theory, looking at:
https://chromeperf.appspot.com/report?sid=3815d8f8078c996fbdcfd9c0ce31f479e565940e6e74773e6854b214e95963ec&start_rev=424708&end_rev=428601

It appears that the data stoppage starts soon after adding the background stories to system health, on the 29th October.

But there _are_ more recent runs of the benchmark, e.g. this one from just a few minutes ago, that claims to have uploaded data to the dashboard:

Sending result 2 of 2 to dashboard.
{"is_ref": true, "test_suite_name": "system_health.memory_mobile", "master": "ChromiumPerf", "versions": {"webkit_rev": "431079", "chromium": "8ea079197144ef1b0f575fe3b3d167a6bf67102a", "commit_pos": 431079}, "point_id": 431079, ...
https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus7v2%20Perf%20%281%29/builds/4191/steps/system_health.memory_mobile.reference/logs/stdio

Could it be that after adding the new stories we're sending now more data and some piece of the infrastructure is unable to handle it?

The following also looks suspicious (from the same log above):
C    2.735s Main  ********************************************************************************
C    2.735s Main  Detailed Logs
C    2.735s Main  ********************************************************************************
C    2.735s Main  ********************************************************************************
C    2.735s Main  Summary
C    2.735s Main  ********************************************************************************
C    2.735s Main  [==========] 1 test ran.
C    2.735s Main  [  PASSED  ] 0 tests.
C    2.735s Main  [  FAILED  ] 1 test, listed below:
C    2.735s Main  [  FAILED  ] PrintStep
C    2.735s Main  
C    2.735s Main  1 FAILED TEST
C    2.735s Main  ********************************************************************************

Owner: sullivan@chromium.org
I'm seeing a similar data stoppage issue on internal bots.

Here the data stops on 3rd November:
https://chromeperf.appspot.com/report?sid=1ca97f6aa3fe02826e9869f08330b7993225cda240cbca4e792184bb447e7a8a

But the health dashboard has data (just newly added) up until today:
https://clank-svelte-status.googleplex.com/system-health/android-chrome/memory/low-end-phone/

Could you have a look Annie to see if it's something from the chromeperf end?
Blocking: 664505
Owner: perezju@chromium.org
To check whether this could be a dashboard issue, we need to look at the chartjson. Since these builds are old, there are a few hoops. Here are the instructions; we're hoping to make this easier soon.

1) Open up a chart, look at the last data point. (Check the date; the alert should be marked recovered if there is new data and file a bug if that doesn't happen).
2) For all these, the last data point is commit pos 427296 from October 25 so the data really did stop.
3) Get buildbot stdio link for last data point. It's in the tooltip:
https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus7v2%20Perf%20%281%29/builds/4027/steps/system_health.memory_mobile.reference/logs/stdio
4) Shorten it to the build links for the last data point and the first build without the data point:
https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus7v2%20Perf%20%281%29/builds/4027
https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus7v2%20Perf%20%281%29/builds/4028
5) Scroll to find the chartjson links for system_health.memory_mobile. Since these are very old builds, we need to look at logdog:
Last run with data: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus7v2_Perf__1_%2F4027%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.memory_mobile.reference%2F0%2Flogs%2Fjson.output%2F0
First run with no data: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus7v2_Perf__1_%2F4028%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.memory_mobile.reference%2F0%2Flogs%2Fjson.output%2F0

Now we look at the chartjson. I see a difference in what's being sent to the dashboard:

Before:

      "blank_about@@memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg": {
        "blank:about:blank": {
          "description": "effective size of cc in all processes in Chrome",
          "grouping_keys": {
            "case": "blank",
            "group": "about"
          },
          "important": false,
          "improvement_direction": "down",
          "name": "memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg",
          "page_id": 0,
          "std": 0.0,
          "tir_label": "blank_about",
          "type": "list_of_scalar_values",
          "units": "sizeInBytes",
          "values": [
            1149256
          ]
        },
        "summary": {
          "description": "effective size of cc in all processes in Chrome",
          "grouping_keys": {
            "case": "blank",
            "group": "about"
          },
          "important": false,
          "improvement_direction": "down",
          "name": "memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg",
          "std": 0.0,
          "tir_label": "blank_about",
          "type": "list_of_scalar_values",
          "units": "sizeInBytes",
          "values": [
            1149256
          ]
        }
      },

After, the summary metric is missing:

      "blank_about@@memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg": {
        "blank:about:blank": {
          "description": "effective size of cc in all processes in Chrome",
          "grouping_keys": {
            "case": "blank",
            "group": "about"
          },
          "important": false,
          "improvement_direction": "down",
          "name": "memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg",
          "page_id": 0,
          "std": 0.0,
          "tir_label": "blank_about",
          "type": "list_of_scalar_values",
          "units": "sizeInBytes",
          "values": [
            1149256
          ]
        }
      },

So I think this is a problem with the test. Juan, can you take a look at the stdio in logdog before/after to see why it's no longer producing the summary?
Owner: nednguyen@chromium.org
Ned, any idea why the summary metric may be missing?
Cc: eakuefner@chromium.org benjhayden@chromium.org
Owner: ----
Status: Available (was: Assigned)
I don't know. Ben or Ethan can help with the summarization pipeline
Summarization fails when the test fails. The test is failing both before and after the data stoppage alert. Is someone looking into the test failure?
Then I guess this is just  issue 664505 , i.e. the test is failing for many different reasons. And as far as telemetry/dashboard, this is WAI.

Still this is probably something important to think about. With the large (and increasing) number of stories in the benchmark, having some of them failing at any given moment is not unlikely. So this sort of data stoppages will keep happening every now and then.

Should we report summaries even if some pages failed?
This is a little confusing because the way it's set up, it has a summary metric for blank_about. Usually we have summary metrics for multiple pages. Juan, Ethan, Ben, any idea why there's a summary for just blank_about?

Normally I would say to monitor individual pages, because if we produce a summary metric on failure and a page fails, it will move the average confusingly. But we're already doing that. I don't understand why the summarization works this way.
The way the benchmark is set up, there is a "blank_about" group with a single "blank:about:blank" story. I think most other groups ("load_news", "browse_social", etc.) do have more than one story.

But now that you mention it, yeah, we should be alerting on individual stories (al least that was the plan as per [1]), why are these data stoppage alerts on the story group level?

[1]: https://github.com/catapult-project/catapult/issues/2722#issuecomment-252910388
Sorry for the long delay here. Our alerting is for patterns like:

ChromiumPerf/*/system_health.memory_mobile/memory:chrome:all_processes:reported_by_chrome:gpu:effective_size_avg/*/*

I checked and it does match individual pages, like 
ChromiumPerf/android-nexus7v2/system_health.memory_mobile/memory:chrome:all_processes:reported_by_chrome:gpu:effective_size_avg/blank_about/blank_about_blank

The weirdness comes in for ref builds. The ref builds are programmatically ignored for perf regression alerts, but what happens is that the summary for the ref build matches the pattern for the per-page alert:
ChromiumPerf/android-nexus7v2/system_health.memory_mobile/memory:chrome:all_processes:reported_by_chrome:java_heap:effective_size_avg/blank_about/ref

The non-summary ref for that is ChromiumPerf/android-nexus7v2/system_health.memory_mobile/memory:chrome:all_processes:reported_by_chrome:gpu:effective_size_avg/blank_about/blank_about_blank_ref, which we also have matched, and ignored in performance alerts

So we do get data stoppage alerts for the summaries on ref builds, and that's why we got this alert on the summary. In general, I think we want to know if the ref build stopped sending data on that non-summary ref. It gets tricky because the dashboard has this strange naming convention. To my knowledge, this is the first time we've gotten a data stoppage alert that was problematic in this way, but it did waste a lot of time. Do people think we should do something to refine the data stoppage alerts for ref builds?
 Issue 666055  has been merged into this issue.
Cc: benhenry@chromium.org
 Issue 665995  has been merged into this issue.
Yeah, I guess it's fine to keep those alerts. And now that we know about them it's easier to diagnose them (like the two I've just dup'ed here).
Status: WontFix (was: Available)
 Issue 663900  has been merged into this issue.

Sign in to add a comment