New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 734803 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

add a pre-cq-launcher tick rate alert

Project Member Reported by akes...@chromium.org, Jun 19 2017

Issue description

https://uberchromegw.corp.google.com/i/chromeos/builders/pre-cq-launcher/builds/9415


full of messages like
[W 2017-06-19 16:27:27] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W 2017-06-19 16:27:27] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W 2017-06-19 16:27:32] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W 2017-06-19 16:27:37] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W 2017-06-19 16:27:43] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W 2017-06-19 16:27:44] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W 2017-06-19 16:27:49] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W 2017-06-19 16:27:49] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W 2017-06-19 16:27:54] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W 2017-06-19 16:27:56] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W 2017-06-19 16:28:00] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W 2017-06-19 16:28:02] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}


unsurprisingly, these messages aren't in the logdog version of the logs.
 
Our dashboards show this, but no alerts were received (we have no pre-cq alerts)

https://viceroy.corp.google.com/chromeos/pre-cq
Going to restart the pre-cq launcher and see if that fixes things.
Labels: -Pri-0 Pri-1
^ got pre-cq working again. Demoting to P1. Outage is over.

Possible preventative meausures:
 - pre-cq tick rate alerts
 - root cause the pubsub publishing failure (hence Infra label on this bug)
Labels: Chase-Pending
Chase-Pending.

Justification: adding alerts are well scoped, preventative measure against P0 outages.
Cc: d...@chromium.org pho...@chromium.org dgarr...@chromium.org nxia@chromium.org
+logdog people

Comment 6 by d...@chromium.org, Jun 20 2017

Pub/Sub uptime and connectivity are prerequisites. Nothing in the logs suggest anything went wrong on our end, and it was retrying consistently. This suggests wither a GCE or acute Pub/Sub service outage, which are both beyond our control.
Looks like it's happening again?
Never mind, things are fine
Labels: -Pri-1 Pri-3
Labels: -Pri-3 Pri-1
Summary: add a pre-cq-launcher tick rate alert (was: chromeos pre-cq-launcher stuck on logdog pubsub calls)
pre-cq-launcher is a class 1 service. We need to shorten its outages. Re-upping to P1.
Components: -Infra Infra>Platform>LogDog
Labels: -Chase-Pending Chase
Owner: akes...@chromium.org
Status: Assigned (was: Untriaged)
Components: -Infra>Platform>LogDog
Status: Fixed (was: Assigned)

Sign in to add a comment