New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 863045 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Jul 23
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Multiple failures for veyron_jerry-release due to denial of access token refresh

Project Member Reported by jettrink@chromium.org, Jul 12

Issue description

The 2 most recent veyron_jerry-release attempts failed with similar issues with what appears to be trouble refreshing the access token.

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941274790613553424
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941244364493115600


From logs:
17:27:43: WARNING: HttpsMonitor.send received status 429: {
  "error": {
    "code": 429,
    "message": "Insufficient tokens for quota 'WriteGroup' and limit 'CLIENT_PROJECT-100s' of service 'prodxmon-pa.googleapis.com' for consumer 'project_number:102025095358'.",
    "status": "RESOURCE_EXHAUSTED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.Help",
        "links": [
          {
            "description": "Google developer console API key",
            "url": "https://console.developers.google.com/project/102025095358/apiui/credential"
          }
        ]
      }
    ]
  }
}

17:27:51: WARNING: Exception is not retriable return code: 3; command: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmp_g6Tfr/tmpUlo67i/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name veyron_jerry-release/R69-10867.0.0-sanity --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:Build' '--tags=suite:sanity' '--tags=build:veyron_jerry-release/R69-10867.0.0' '--tags=task_name:veyron_jerry-release/R69-10867.0.0-sanity' '--tags=board:veyron_jerry' -- /usr/local/autotest/site_utils/run_suite.py --build veyron_jerry-release/R69-10867.0.0 --board veyron_jerry --suite_name sanity --pool bvt --file_bugs True --priority Build --timeout_mins 180 --retry True --max_retries 5 --minimum_duts 1 --suite_min_duts 1 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 85406435L, 'cidb_build_id': 2740142, 'datastore_parent_key': ('Build', 2740142, 'BuildStage', 85406435L)}" -m 216195220
Triggered task: veyron_jerry-release/R69-10867.0.0-sanity



--- and ---

23:21:38: INFO: Refreshing due to a 401 (attempt 1/2)
23:21:38: INFO: Refreshing access_token
23:31:38: INFO: Refreshing due to a 401 (attempt 1/2)
23:31:38: INFO: Refreshing access_token
23:39:33: INFO: Refreshing due to a 401 (attempt 1/2)
23:39:33: INFO: Refreshing access_token
00:21:43: INFO: Refreshing due to a 401 (attempt 1/2)
00:21:43: INFO: Refreshing access_token
00:31:42: INFO: Refreshing due to a 401 (attempt 1/2)
00:31:42: INFO: Refreshing access_token
00:34:47: INFO: Re-run swarming_cmd to avoid buildbot salency check.
00:34:47: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmpn2j4aq/tmp1wbP0U/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name veyron_jerry-release/R69-10868.0.0-sanity --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:Build' '--tags=suite:sanity' '--tags=build:veyron_jerry-release/R69-10868.0.0' '--tags=task_name:veyron_jerry-release/R69-10868.0.0-sanity' '--tags=board:veyron_jerry' -- /usr/local/autotest/site_utils/run_suite.py --build veyron_jerry-release/R69-10868.0.0 --board veyron_jerry --suite_name sanity --pool bvt --file_bugs True --priority Build --timeout_mins 180 --retry True --max_retries 5 --minimum_duts 1 --suite_min_duts 1 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 85446746L, 'cidb_build_id': 2741629, 'datastore_parent_key': ('Build', 2741629, 'BuildStage', 85446746L)}" -m 216271491
00:35:36: WARNING: Exception is not retriable return code: 3; command: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmpn2j4aq/tmp1wbP0U/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name veyron_jerry-release/R69-10868.0.0-sanity --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:Build' '--tags=suite:sanity' '--tags=build:veyron_jerry-release/R69-10868.0.0' '--tags=task_name:veyron_jerry-release/R69-10868.0.0-sanity' '--tags=board:veyron_jerry' -- /usr/local/autotest/site_utils/run_suite.py --build veyron_jerry-release/R69-10868.0.0 --board veyron_jerry --suite_name sanity --pool bvt --file_bugs True --priority Build --timeout_mins 180 --retry True --max_retries 5 --minimum_duts 1 --suite_min_duts 1 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 85446746L, 'cidb_build_id': 2741629, 'datastore_parent_key': ('Build', 2741629, 'BuildStage', 85446746L)}" -m 216271491
Triggered task: veyron_jerry-release/R69-10868.0.0-sanity

 
I have no idea what the first issue is, but the `Refreshing due to a 401 (attempt 1/2)` is a really misleading line that simply means something is hanging and thus the LUCI auth token is being refreshed automatically. Aka it pretty much always means "something is hanging".
Owner: gmeinke@chromium.org
Status: Assigned (was: Untriaged)
Over to the oncall.
I saw this same error in a previously failing falco-release (which has since passed now): https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941274861474566800

https://luci-logdog.appspot.com/v/?s=chromeos/buildbucket/cr-buildbucket.appspot.com/8941274861474566800/+/steps/HWTest__bvt-inline_/0/stdout

--- log snippet ---

17:40:42: INFO: RetriableHttp: attempt 5 receiving status 503, final attempt
17:40:43: WARNING: HttpsMonitor.send received status 503: {
  "error": {
    "code": 503,
    "message": "The service is currently unavailable.",
    "status": "UNAVAILABLE"
  }
}

17:40:43: WARNING: HttpsMonitor.send received status 503: {
  "error": {
    "code": 503,
    "message": "The service is currently unavailable.",
    "status": "UNAVAILABLE"
  }
}

17:41:02: WARNING: HttpsMonitor.send received status 429: {
  "error": {
    "code": 429,
    "message": "Insufficient tokens for quota 'WriteGroup' and limit 'CLIENT_PROJECT-100s' of service 'prodxmon-pa.googleapis.com' for consumer 'project_number:102025095358'.",
    "status": "RESOURCE_EXHAUSTED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.Help",
        "links": [
          {
            "description": "Google developer console API key",
            "url": "https://console.developers.google.com/project/102025095358/apiui/credential"
          }
        ]
      }
    ]
  }
}
17:41:02: WARNING: Exception is not retriable return code: 3; command: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmpV8pj3x/tmpIOl9mm/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name falco-release/R69-10867.0.0-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:Build' '--tags=suite:bvt-inline' '--tags=build:falco-release/R69-10867.0.0' '--tags=task_name:falco-release/R69-10867.0.0-bvt-inline' '--tags=board:falco' -- /usr/local/autotest/site_utils/run_suite.py --build falco-release/R69-10867.0.0 --board falco --suite_name bvt-inline --pool bvt --file_bugs True --priority Build --timeout_mins 180 --retry True --max_retries 5 --minimum_duts 4 --suite_min_duts 6 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 85408934L, 'cidb_build_id': 2740013, 'datastore_parent_key': ('Build', 2740013, 'BuildStage', 85408934L)}" --json_dump -m 216198187
Triggered task: falco-release/R69-10867.0.0-bvt-inline

We are spiking over the quota of 150k write requests per 100s, I'm not sure I correctly increased the quota to 200k per 100s. 

https://pantheon.corp.google.com/apis/api/prodxmon-pa.googleapis.com/credentials?organizationId=433637338589&project=google.com:prodx-mon-chrome-infra or if that just changed the view of the monitoring graph.


Labels: -Pri-1 Pri-2
Filed crbug.com/863466 with the troopers to investigate.

we should handle the issue of not being able to log monitoring data and not failing the build. Will investigate this issue, lowering priority to p2. crbug.com/863466 is the critical bug as to why the monitoring writes are spiking.
Cc: rrangel@chromium.org pmalani@chromium.org
Copying current sheriffs
Verified monitoring errors have cleared, don't think they were causing issues. 

The second error listed above is still appearing in the logs and the veyron_jerry-release builders are still failing. Also verified that the builds are running on different builders, so not a single builder issue.

23:21:38: INFO: Refreshing due to a 401 (attempt 1/2)
23:21:38: INFO: Refreshing access_token
23:31:38: INFO: Refreshing due to a 401 (attempt 1/2)
23:31:38: INFO: Refreshing access_token

Still trying to figure out what the something is that is hanging ...
Status: WontFix (was: Assigned)
Veyron_jerry-release has had 9 successfull builds since 2018-07-20: https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=veyron_jerry-release&buildBranch=master

The access token failures are no longer happening, closing.

Sign in to add a comment