LUCI Notify not notifying. |
|||||||||||||||||
Issue descriptionDavid Riley reports that he got notifications for this tryjob: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933821779979522544 Finish: 2018-10-01 18:42 But not for these two: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933738815060322000 Finish: 2018-10-02 17:06 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933742958483619088 Finish: 2018-10-02 15:31 Parameters_json for the last two does contain the notification request tag: \"email_notify\": [{\"email\": \"davidriley@chromium.org\", \"template\": \"default\"}] Is there any chance that LUCI Notify was/is down?
,
Oct 3
Is this affecting precq too, or just manual tryjobs? I can't figure out if the pre-cq is actually done here: https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1258409 AFAICT, all the jobs have PASSed...or...is this one stuck? https://chromeos-cl-viewer-ui.googleplex.com/cl_status/chromium-review.googlesource.com/1258409/2 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933672262472134560
,
Oct 3
This only applies to email notifications for builds triggered by "cros tryjob". No effect on updates to the CLs.
,
Oct 3
Thanks. I was misreading the status anyway. I guess pre-cq is just taking longer these days, as it's running more VM tests.
,
Oct 3
Is this really "intermittent" failure? groeck and I aren't getting notifications all day. I have 2 jobs that show up in go/legoland-tryjobs as PASSed but no email. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933660091050629376 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933744143298682096
,
Oct 3
Not just today. I started a substantial number of tryjobs since yesterday (like: at least 20); none of them triggered a response.
,
Oct 3
To add to the stats: same here today. 6 trybots, 0 emails.
,
Oct 3
Let's ping the current trooper. I probably don't have this in the right component, but making my best guess.
,
Oct 3
,
Oct 3
,
Oct 3
Since it appears to be an outage, raising enough to reach the trooper queue. However, email notifications aren't really panic inducing.
,
Oct 3
,
Oct 3
,
Oct 3
,
Oct 3
,
Oct 3
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/93902eae21c2660a710c5a6b57d50e4c63157573 commit 93902eae21c2660a710c5a6b57d50e4c63157573 Author: Nodir Turakulov <nodir@google.com> Date: Wed Oct 03 23:50:48 2018 [luci_notify] Dedup (recipient, template, build) Ensure that task deduplication keys are unique in a batch. Bug: 891723 Change-Id: Ic2b64301ea021c0b581ee57be6db418414d72e30 Reviewed-on: https://chromium-review.googlesource.com/c/1260055 Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org> Commit-Queue: Nodir Turakulov <nodir@chromium.org> [modify] https://crrev.com/93902eae21c2660a710c5a6b57d50e4c63157573/luci_notify/notify/notify.go [modify] https://crrev.com/93902eae21c2660a710c5a6b57d50e4c63157573/luci_notify/notify/notify_test.go
,
Oct 3
status: mitigated the app is currently processing its backlog, going to send a lot of messages
,
Oct 4
,
Oct 4
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/1bfb803523683e3e0c5db5888b933bfc06ce034d commit 1bfb803523683e3e0c5db5888b933bfc06ce034d Author: Nodir Turakulov <nodir@google.com> Date: Thu Oct 04 00:11:48 2018 [luci_notify] Add tsmon cron job Add a cronjob that exports metrics R=vadimsh@chromium.org Bug: 891723 Change-Id: Ie555ae3e5b7a9124e528591c0fa370ed49ab79dd Reviewed-on: https://chromium-review.googlesource.com/c/1260058 Commit-Queue: Nodir Turakulov <nodir@chromium.org> Commit-Queue: Andrii Shyshkalov <tandrii@chromium.org> Auto-Submit: Nodir Turakulov <nodir@chromium.org> Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org> [modify] https://crrev.com/1bfb803523683e3e0c5db5888b933bfc06ce034d/luci_notify/frontend/cron.yaml
,
Oct 4
sorry for the delayed fix. Entire Foundation team, except tandrii, were offsite today. the reason we didn't notice the outage is Monarch metrics were not exported (fixed by c19). The alerting in luci_notify was added only few days ago and this is the first time we should have been alerted, but were not. we will improve our alerting to ensure that metrics are exported
,
Oct 4
actually, i will reopen this bug not to forget to improve alerting. But the outage is over.
,
Oct 4
,
Oct 4
Thanks!
,
Oct 4
ddoman, I cannot find a way to alert on _absence_ of metric data using GMon. How do I express such a predicate?
,
Oct 4
You can use Join() or JoinWithLiteralTable() with a default value that is high enough to trigger a given alert. However, to cover the particular case described in this ticket, I'd suggest use of a presence alert. : http://shortn/_dwia7shDFC If a given service at least reports presence metric data, then it implies that ts-mon is configured properly for the given service? However, if you want to make sure that monitoring data points are continuously reported under a certain metric path, then you need to add Join() or JoinWithLiteralTable() to the mash expression of an existing alert to catch such a period where no data points have been reported in a given time period.
,
Oct 17
,
Oct 18
,
Oct 21
Right after (or even before) this was markes as fixed, LUCY decided not to notify again. This worked for a while, meaning it must have been un-fixed. Reopening. Example job where no notification was received: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932129044875055648
,
Oct 21
from the server logs:
Received task: {"type":"internal.EmailTask","body":{"recipients":["groeck@chromium.org"],"subject":"[Build Status] octopus-paladin-tryjob: FAILURE","body":"\n\n\nBuild \u003ca href=\"https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932129044875055648\"\u003e\noctopus-paladin-tryjob\n\u003c/a\u003e on master\n\n\u003ctable\u003e\n \u003ctr\u003e\n \u003ctd\u003eResult:\u003c/td\u003e\n \u003ctd\u003e\u003cb\u003eFAILURE\u003c/b\u003e\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003eBuilder:\u003c/td\u003e\n \u003ctd\u003eTry\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003eCreated by:\u003c/td\u003e\n \u003ctd\u003euser:groeck@google.com\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003eCreated at:\u003c/td\u003e\n \u003ctd\u003e2018-10-20 17:02:41.284667 \u0026#43;0000 UTC\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003eStart Time:\u003c/td\u003e\n \u003ctd\u003e2018-10-20 17:02:45.175818 \u0026#43;0000 UTC\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003eFinished at:\u003c/td\u003e\n \u003ctd\u003e2018-10-20 17:13:54.649894 \u0026#43;0000 UTC\u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003cp\u003e\nAll of your tryjobs\n\u003ca href=\"https://cros-goldeneye.corp.google.com/chromeos/legoland/builderSummary?buildConfig\u0026builderGroups=tryjob\u0026email=groeck%40chromium.org\"\u003ehere\u003c/a\u003e\nvia \u003ca href=\"https://cros-goldeneye.corp.google.com/chromeos/legoland\"\u003ego/legoland\u003c/a\u003e.\n\u003c/p\u003e"}}
which means the app has successfully sent the email
,
Oct 21
i see 9 similar log messages to groeck@chromium.org between 01:20 AM and 02:09 AM today. All of them appear successful. You didn't receive any of that?
,
Oct 21
just sent a test tryjob and received an email https://screenshot.googleplex.com/3gsCE6Tbwms
,
Oct 21
Yes, now it is working again. Yesterday it didn't work. No idea what is going on. I'll mark as fixed again and open another one if it happens again. |
|||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||
Comment 1 by dgarr...@chromium.org
, Oct 3