New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 848811 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

PFQ informational bots fail to SyncChrome due to RESOURCE_EXHAUSTED

Project Member Reported by xiy...@chromium.org, Jun 1 2018

Issue description

Happening on:
- caroline-tot-chrome-pfq-informational
- daisy-tot-chromium-pfq-informational
- eve-tot-chrome-pfq-informational
- peach_pit-tot-chrome-pfq-informational 
- tricky-tot-chrome-pfq-informational 
- veyron_minnie-tot-chrome-pfq-informational

Sample build:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8945038622751113392

05:22:21: INFO: Refreshing due to a 401 (attempt 1/2)
05:22:21: INFO: Refreshing access_token
05:45:27: INFO: RetriableHttp: attempt 1 receiving status 503, will retry
05:46:33: WARNING: HttpsMonitor.send received status 429: {
  "error": {
    "code": 429,
    "message": "Insufficient tokens for quota 'WriteGroup' and limit 'CLIENT_PROJECT-100s' of service 'prodxmon-pa.googleapis.com' for consumer 'project_number:102025095358'.",
    "status": "RESOURCE_EXHAUSTED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.Help",
        "links": [
          {
            "description": "Google developer console API key",
            "url": "https://console.developers.google.com/project/102025095358/apiui/credential"
          }
        ]
      }
    ]
  }
}
 
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>CI
Cc: mikenichols@chromium.org dgarr...@chromium.org
Owner: athilenius@chromium.org
Status: Assigned (was: Untriaged)
Alec, it looks like the prodmon API is exhausted. Need get help from the current ChOps oncall.

Components: Infra
Labels: Infra-Troopers
Oh, this seems like tsmon data couldn't be sent to prod, but it's not clear why that would block your build. One has to look into what the script being run actually does.
Labels: -Pri-3 Pri-1
Alec, let's look and see if we don't have an exception wrap on the line that killed the build.
Thanks for prioritizing this. These builders are very important for gardening.

Status: Started (was: Assigned)
This is uncharted territory for me. Can someone convince me that the 429 warning is the cause of the build being killed? I'm looking through CBuildBot source, but that's slow going. The very long list of 401 auth errors that span over almost 24 hours seems more probable as to why the build was killed.
Status: WontFix (was: Started)
This failure was the build hanging, both the 401s and 429 are unrelated. jclinton and I looked at the swarming machine and don't see any other issues with it. Swarming killed the task after 23 hours 50 min with a timeout error.
Status: Assigned (was: WontFix)
Actually, this is still happening: https://cros-goldeneye.corp.google.com/chromeos/legoland/builderSummary?buildBranch=&builderGroups=informational&limit=&email=&buildConfig=. We need to figure out what is hanging. We can log into the machines and get process trees.

`tee` is hanging because /build/amd64-generic/etc/portage/package.keywords/chrome doesn't exist. In fact, /build doesn't exist. Still trying to understand why.
Cc: muyuanli@chromium.org
Issue 849173 has been merged into this issue.
Cc: achuith@chromium.org
Status: Fixed (was: Assigned)
This turned out to be an issue with the newly added log streaming support, the original error message (resource exhaustion) was a red herring. The builder was stalled until it was finally killed by Swarming after 23 hours, 50 min each time.

Reverting crrev.com/c/1063329 (revert here crrev.com/c/1091183) appears to have worked. Marking this fixed.

Sign in to add a comment