New issue
Advanced search Search tips

Issue 786379 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner: ----
Closed: Sep 26
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

transient logdog errors have caused a chromium.perf:Win x64 Builder to run for more than a month

Project Member Reported by jbudorick@chromium.org, Nov 17 2017

Issue description

https://luci-milo.appspot.com/buildbot/chromium.perf/Win%20x64%20Builder/107222 is trapped in a logdog infinite loop.

[W2017-09-27T15:53:56.964821-07:00 628 0 pubsubOutput.go:220] TRANSIENT error publishing messages; retrying... {"error":"rpc error: code = 13 desc = connection error: desc = \"transport: read tcp 192.168.110.139:55917->172.217.1.138:443: wsarecv: An existing connection was forcibly closed by the remote host.\"", "count":1, "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}

Killing the build is simple. I'm wondering whether we can stop having logdog indefinitely retry on transient failures, though, because this isn't the first time I've seen this (though it's definitely the most egregious).
 

Comment 1 by d...@chromium.org, Nov 17 2017

So realistically we have a few problems:

- LogDog needs to report failures, else we can't see when things fail.
- LogDog currently never gives up on sending logs, since it is the sole owner of log data and giving up means losing data permanently.


Transient errors typically resolve themselves eventually, so this isn't usually a problem.

BuildBot has three ways of terminating builds:
1) Maximum time without output. Since LogDog needs to report status, this is not going to work, since LogDog will be causing output when it fails like this.
2) Maximum per-step execution time.
3) Maximum per-build execution time.

On Swarming, and on *most* builders, (2) and (3) kick in and stop this sort of thing from happening. LogDog supersedes (2) b/c it runs even after steps finish, so the simple and standard solution here would be to configure maximum timeouts to the overall build.

Internally, I would caution strongly against having LogDog actually give up sending logs, since that causes irrecoverable data loss. It might be worth having LogDog's logger detect duplicate messages, but that's a fair amount of work and there is an art to doing that right. I think the right solution here is to just have the builder enforce an upper limit.
Status: Available (was: Untriaged)
Status: WontFix (was: Available)
This is mostly an issue with logdog on buildbot.  While the root cause is unknown, this is known not to happen on LUCI, so the solution is to migrate to luci.

Sign in to add a comment