New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 900740 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 919630



Sign in to add a comment

health_alert_recipient emails have "bad access token" errors.

Project Member Reported by jdufault@chromium.org, Oct 31

Issue description

Bot seems to have been failing for a while. Apparently it tries to send a health report that it is failing but then fails sending the report.

Log snippet:

11:54:20: INFO: Builder lakitu-incremental has failed 3 time(s) in a row.
11:54:20: INFO: Builder failed 3 consecutive times, sending health alert email to [u'gci-alerts+buildbots@google.com'].
11:54:20: INFO: URL being requested: GET https://www.googleapis.com/discovery/v1/apis/gmail/v1/rest
11:54:20: INFO: Attempting refresh to obtain initial access_token
11:54:20: INFO: Refreshing access_token
11:54:21: INFO: Failed to retrieve access token: {
  "error" : "deleted_client",
  "error_description" : "The OAuth client was deleted."
}

Example run:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8931135011980512784
 
Cc: zamorzaev@chromium.org
Components: Infra>Client>ChromeOS>Test
Owner: wonderfly@chromium.org
I'm also seeing the following failure in cloud_StackdriverServices which appears in a fraction of GCETest runs seemingly at random:

pre-test siteration sysinfo error:
        traceback:0013| Traceback (most recent call last):
        traceback:0013|   File "/usr/local/autotest/common_lib/log.py", line 25, in decorated_func
        traceback:0013|     fn(*args, **dargs)
        traceback:0013|   File "/usr/local/autotest/bin/base_sysinfo.py", line 399, in log_before_each_iteration
        traceback:0013|     board = utils.get_board_with_frequency_and_memory()
        traceback:0013|   File "/usr/local/autotest/bin/utils.py", line 2103, in get_board_with_frequency_and_memory
        traceback:0013|     frequency = int(round(get_cpu_max_frequency() * 1e-8)) * 0.1
        traceback:0013|   File "/usr/local/autotest/bin/utils.py", line 1914, in get_cpu_max_frequency
        traceback:0013|     assert max_frequency > 1e8, 'Unreasonably low CPU frequency.'
        traceback:0013| AssertionError: Unreasonably low CPU frequency.

Daniel, can you please take a look?
Cc: -martis@chromium.org -jdufault@chromium.org dufault@chromium.org
Components: -Infra>Client>ChromeOS>Test Infra>Client>ChromeOS>CI
Labels: -Pri-1 Pri-3
P3 per CrOS CI SLO
Blocking: 919630
Cc: mikenichols@chromium.org dburger@chromium.org
I'm guessing that Mike removed the Gmail credentials from our builders.

We can probably do something with the LUCI Notifier instead, then remove Gmail notification support from Chromite. This would also allow us to remove "Streak" counters and a bunch of mostly unused GS access.
To use LUCI Notifier, we should update request build (and the LUCI Scheduler generation code) to use the existing email notification config values to populate the same properties we set for tryjobs.

Most of the support should already be in place, other than wiring in the config values.
Nothing has changed from a builder perspective; we're still using the same image we've used for over a year and the Puppet changes were specific to the Xenial upgrade.  

Do we know which oauth it is using?  

I do agree in that we should remove email support from Chromite and default this all to luci-notify.  This would allow control of the report information, via a template, and the recipients.  

-- Mike
Cc: xixuan@chromium.org
Vague memory says that the lab team (xixuan@) owns the credentials we are currently using, perhaps they invalidated our tokens as part of a security cleanup?
Also, these tokens are installed by puppet, not embedded on the image.
That is what I was assuming, that they were part of the larger credential install that takes place within Puppet.  There have been no recent changes to credentials, on my part.  

Easy fix would be to resolve the credentials and/or add a new set to provide access today.  The alternative is Luci-Notify but it would mean a change in the reports they are getting today.   That may or may not be important.  

-- Mike


There is a whole bunch of nearly obsolete logic that could get deleted if we switch to LUCI notify. From a code health point of view, that change is the right thing to do.

The new emails would also be more likely to be useful.

Oh... also it's failing because of GCE Test failures then blowing up with the token problem when it tries to send an email about the build failure.
Owner: dgarr...@chromium.org
Summary: health_alert_recipient emails have "bad access token" errors. (was: lakitu-incremental has bad access token)
I'm going to take over this bug for email notifications, but filed https://crbug.com/919647 for the GCE test failures.
A number of ChromeOS build types use the "health_alert_recipients" field to generate email notifications for failed builds, but those notifications seem to failing to send, which means the builds now fail for multiple reasons.

Affected build configs (found by scanning chromeos_config):
  master-toolchain
  lakitu*-incremental
  lakitu*-release
  lakitu-full

It looks like this would also affect:
  master-chromium-pfq

Except that "health_threshold" is not set, so the emails are never generated.

Note: the "health_threshold" is never set to more that one for any of these builds, so the "failure streak" logic used to support it is no longer needed.

Status: Assigned (was: Untriaged)
This issue has an owner, a component and a priority, but is still listed as untriaged or unconfirmed. By definition, this bug is triaged. Changing status to "assigned". Please reach out to me if you disagree with how I've done this.

Sign in to add a comment