New issue
Advanced search Search tips

Issue 891723 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 21
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

LUCI Notify not notifying.

Project Member Reported by dgarr...@chromium.org, Oct 3

Issue description

David Riley reports that he got notifications for this tryjob:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933821779979522544

Finish: 2018-10-01 18:42

But not for these two:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933738815060322000

Finish: 2018-10-02 17:06

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933742958483619088

Finish: 2018-10-02 15:31

Parameters_json for the last two does contain the notification request tag:

\"email_notify\": [{\"email\": \"davidriley@chromium.org\", \"template\": \"default\"}]


Is there any chance that LUCI Notify was/is down?
 
Cc: manojgupta@chromium.org
This only applies to email notifications for builds triggered by "cros tryjob".

No effect on updates to the CLs.
Thanks. I was misreading the status anyway. I guess pre-cq is just taking longer these days, as it's running more VM tests.
Cc: groeck@chromium.org
Summary: LUCI Notify firing intermitently (was: LUCI Notify firing intermitantly)
Is this really "intermittent" failure? groeck and I aren't getting notifications all day. I have 2 jobs that show up in go/legoland-tryjobs as PASSed but no email.

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933660091050629376
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933744143298682096
Not just today. I started a substantial number of tryjobs since yesterday (like: at least 20); none of them triggered a response.

To add to the stats: same here today. 6 trybots, 0 emails.
Cc: jclinton@chromium.org
Labels: -Pri-3 Pri-1
Let's ping the current trooper. I probably don't have this in the right component, but making my best guess.
Cc: no...@chromium.org
Labels: Infra-Trooper
Owner: ----
Summary: LUCI Notify not notifying. (was: LUCI Notify firing intermitently)
Labels: -Pri-1 Pri-0
Since it appears to be an outage, raising enough to reach the trooper queue. However, email notifications aren't really panic inducing.
Components: -Infra>Platform>Config Infra
Labels: -Infra-Trooper Infra-Troopers
Labels: Foundation-Troopers
Owner: no...@chromium.org
Status: Started (was: Untriaged)
Project Member

Comment 16 by bugdroid1@chromium.org, Oct 3

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/93902eae21c2660a710c5a6b57d50e4c63157573

commit 93902eae21c2660a710c5a6b57d50e4c63157573
Author: Nodir Turakulov <nodir@google.com>
Date: Wed Oct 03 23:50:48 2018

[luci_notify] Dedup (recipient, template, build)

Ensure that task deduplication keys are unique in a batch.

Bug:  891723 
Change-Id: Ic2b64301ea021c0b581ee57be6db418414d72e30
Reviewed-on: https://chromium-review.googlesource.com/c/1260055
Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org>
Commit-Queue: Nodir Turakulov <nodir@chromium.org>

[modify] https://crrev.com/93902eae21c2660a710c5a6b57d50e4c63157573/luci_notify/notify/notify.go
[modify] https://crrev.com/93902eae21c2660a710c5a6b57d50e4c63157573/luci_notify/notify/notify_test.go

status: mitigated
the app is currently processing its backlog, going to send a lot of messages
Status: Fixed (was: Started)
Project Member

Comment 19 by bugdroid1@chromium.org, Oct 4

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/1bfb803523683e3e0c5db5888b933bfc06ce034d

commit 1bfb803523683e3e0c5db5888b933bfc06ce034d
Author: Nodir Turakulov <nodir@google.com>
Date: Thu Oct 04 00:11:48 2018

[luci_notify] Add tsmon cron job

Add a cronjob that exports metrics

R=vadimsh@chromium.org

Bug:  891723 
Change-Id: Ie555ae3e5b7a9124e528591c0fa370ed49ab79dd
Reviewed-on: https://chromium-review.googlesource.com/c/1260058
Commit-Queue: Nodir Turakulov <nodir@chromium.org>
Commit-Queue: Andrii Shyshkalov <tandrii@chromium.org>
Auto-Submit: Nodir Turakulov <nodir@chromium.org>
Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org>

[modify] https://crrev.com/1bfb803523683e3e0c5db5888b933bfc06ce034d/luci_notify/frontend/cron.yaml

sorry for the delayed fix. Entire Foundation team, except tandrii, were offsite today.
the reason we didn't notice the outage is Monarch metrics were not exported (fixed by c19). The alerting in luci_notify was added only few days ago and this is the first time we should have been alerted, but were not.

we will improve our alerting to ensure that metrics are exported
Status: Started (was: Fixed)
actually, i will reopen this bug not to forget to improve alerting. But the outage is over.
Labels: -Pri-0 Pri-1
Thanks!
Cc: ddoman@chromium.org
ddoman, I cannot find a way to alert on _absence_ of metric data using GMon. How do I express such a predicate?
You can use Join() or JoinWithLiteralTable() with a default value that is high enough to trigger a given alert.
However, to cover the particular case described in this ticket, I'd suggest use of a presence alert.
: http://shortn/_dwia7shDFC

If a given service at least reports presence metric data, then it implies that ts-mon is configured properly for the given service?
However, if you want to make sure that monitoring data points are continuously reported under a certain metric path, then you need to
add Join() or JoinWithLiteralTable() to the mash expression of an existing alert to catch such a period where no data points have been reported in a given time period.
Status: Fixed (was: Started)
Status: Unconfirmed (was: Fixed)
Right after (or even before) this was markes as fixed, LUCY decided not to notify again. This worked for a while, meaning it must have been un-fixed. Reopening.
Example job where no notification was received:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932129044875055648
from the server logs:

Received task: {"type":"internal.EmailTask","body":{"recipients":["groeck@chromium.org"],"subject":"[Build Status] octopus-paladin-tryjob: FAILURE","body":"\n\n\nBuild \u003ca href=\"https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932129044875055648\"\u003e\noctopus-paladin-tryjob\n\u003c/a\u003e on master\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eResult:\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003eFAILURE\u003c/b\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eBuilder:\u003c/td\u003e\n    \u003ctd\u003eTry\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eCreated by:\u003c/td\u003e\n    \u003ctd\u003euser:groeck@google.com\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eCreated at:\u003c/td\u003e\n    \u003ctd\u003e2018-10-20 17:02:41.284667 \u0026#43;0000 UTC\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eStart Time:\u003c/td\u003e\n    \u003ctd\u003e2018-10-20 17:02:45.175818 \u0026#43;0000 UTC\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eFinished at:\u003c/td\u003e\n    \u003ctd\u003e2018-10-20 17:13:54.649894 \u0026#43;0000 UTC\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003cp\u003e\nAll of your tryjobs\n\u003ca href=\"https://cros-goldeneye.corp.google.com/chromeos/legoland/builderSummary?buildConfig\u0026builderGroups=tryjob\u0026email=groeck%40chromium.org\"\u003ehere\u003c/a\u003e\nvia \u003ca href=\"https://cros-goldeneye.corp.google.com/chromeos/legoland\"\u003ego/legoland\u003c/a\u003e.\n\u003c/p\u003e"}}

which means the app has successfully sent the email
i see 9 similar log messages to groeck@chromium.org between 01:20 AM and 02:09 AM today. All of them appear successful. You didn't receive any of that?
just sent a test tryjob and received an email https://screenshot.googleplex.com/3gsCE6Tbwms
Status: Fixed (was: Unconfirmed)
Yes, now it is working again. Yesterday it didn't work. No idea what is going on. I'll mark as fixed again and open another one if it happens again.

Sign in to add a comment