health_alert_recipient emails have "bad access token" errors. |
|||||||||
Issue descriptionBot seems to have been failing for a while. Apparently it tries to send a health report that it is failing but then fails sending the report. Log snippet: 11:54:20: INFO: Builder lakitu-incremental has failed 3 time(s) in a row. 11:54:20: INFO: Builder failed 3 consecutive times, sending health alert email to [u'gci-alerts+buildbots@google.com']. 11:54:20: INFO: URL being requested: GET https://www.googleapis.com/discovery/v1/apis/gmail/v1/rest 11:54:20: INFO: Attempting refresh to obtain initial access_token 11:54:20: INFO: Refreshing access_token 11:54:21: INFO: Failed to retrieve access token: { "error" : "deleted_client", "error_description" : "The OAuth client was deleted." } Example run: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8931135011980512784
,
Nov 19
,
Nov 27
,
Nov 27
P3 per CrOS CI SLO
,
Jan 7
,
Jan 7
I'm guessing that Mike removed the Gmail credentials from our builders. We can probably do something with the LUCI Notifier instead, then remove Gmail notification support from Chromite. This would also allow us to remove "Streak" counters and a bunch of mostly unused GS access.
,
Jan 7
To use LUCI Notifier, we should update request build (and the LUCI Scheduler generation code) to use the existing email notification config values to populate the same properties we set for tryjobs. Most of the support should already be in place, other than wiring in the config values.
,
Jan 7
Nothing has changed from a builder perspective; we're still using the same image we've used for over a year and the Puppet changes were specific to the Xenial upgrade. Do we know which oauth it is using? I do agree in that we should remove email support from Chromite and default this all to luci-notify. This would allow control of the report information, via a template, and the recipients. -- Mike
,
Jan 7
Vague memory says that the lab team (xixuan@) owns the credentials we are currently using, perhaps they invalidated our tokens as part of a security cleanup?
,
Jan 7
Also, these tokens are installed by puppet, not embedded on the image.
,
Jan 7
That is what I was assuming, that they were part of the larger credential install that takes place within Puppet. There have been no recent changes to credentials, on my part. Easy fix would be to resolve the credentials and/or add a new set to provide access today. The alternative is Luci-Notify but it would mean a change in the reports they are getting today. That may or may not be important. -- Mike
,
Jan 7
There is a whole bunch of nearly obsolete logic that could get deleted if we switch to LUCI notify. From a code health point of view, that change is the right thing to do. The new emails would also be more likely to be useful.
,
Jan 7
Oh... also it's failing because of GCE Test failures then blowing up with the token problem when it tries to send an email about the build failure.
,
Jan 7
I'm going to take over this bug for email notifications, but filed https://crbug.com/919647 for the GCE test failures.
,
Jan 7
A number of ChromeOS build types use the "health_alert_recipients" field to generate email notifications for failed builds, but those notifications seem to failing to send, which means the builds now fail for multiple reasons. Affected build configs (found by scanning chromeos_config): master-toolchain lakitu*-incremental lakitu*-release lakitu-full It looks like this would also affect: master-chromium-pfq Except that "health_threshold" is not set, so the emails are never generated. Note: the "health_threshold" is never set to more that one for any of these builds, so the "failure streak" logic used to support it is no longer needed.
,
Jan 11
This issue has an owner, a component and a priority, but is still listed as untriaged or unconfirmed. By definition, this bug is triaged. Changing status to "assigned". Please reach out to me if you disagree with how I've done this. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by zamorzaev@chromium.org
, Nov 1Components: Infra>Client>ChromeOS>Test
Owner: wonderfly@chromium.org
I'm also seeing the following failure in cloud_StackdriverServices which appears in a fraction of GCETest runs seemingly at random: pre-test siteration sysinfo error: traceback:0013| Traceback (most recent call last): traceback:0013| File "/usr/local/autotest/common_lib/log.py", line 25, in decorated_func traceback:0013| fn(*args, **dargs) traceback:0013| File "/usr/local/autotest/bin/base_sysinfo.py", line 399, in log_before_each_iteration traceback:0013| board = utils.get_board_with_frequency_and_memory() traceback:0013| File "/usr/local/autotest/bin/utils.py", line 2103, in get_board_with_frequency_and_memory traceback:0013| frequency = int(round(get_cpu_max_frequency() * 1e-8)) * 0.1 traceback:0013| File "/usr/local/autotest/bin/utils.py", line 1914, in get_cpu_max_frequency traceback:0013| assert max_frequency > 1e8, 'Unreasonably low CPU frequency.' traceback:0013| AssertionError: Unreasonably low CPU frequency. Daniel, can you please take a look?