New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 788903 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

luci-logdog-dev collector occasionally stops receiving pubsub messages

Project Member Reported by vadimsh@chromium.org, Nov 27 2017

Issue description

https://viceroy.corp.google.com/chrome_infra/Appengine/luci_logdog_dev?duration=2d&utc_end=1511821672.46

It breaks milo among other things causing it to display "LogDog stream not found Job likely failed to start" (while the swarming job thinks everything is totally fine).
 
IIRC, it is the third occurrence on -dev. It never happened on prod. Once it happens, it never "heals" itself. 
Nothing obviously wrong in the logs: https://paste.googleplex.com/4983477036384256

At some point (I2017-11-26T18:20:56.59) messages just stop coming.
I don't know what's wrong. My gut feeling is that there's a bug in Cloud PubSub library or the way we use it, that causes it to stall during some unusual network event (like unexpectedly closed connection or something, we've seen eternally hanging TLS connections to Cloud before). But I can't prove or deny this, need more logging.

Restarting the collector pod now. 
Cc: estaab@chromium.org iannucci@chromium.org tandrii@chromium.org
It is happening again: https://viceroy.corp.google.com/chrome_infra/Appengine/luci_logdog_dev?duration=2d&utc_end=1512001490.08

:(

I guess it's time to document how to restart collectors...

My infra_internal gclient checkout is at ~/infra-git:

$ cd ~/infra-git/infra_internal/services/deployments/luci-logdog-dev
$ eval `~/infra-git/infra/go/env.py`
$ gke.py -C ./services.yaml kubectl -K collector -- get pod
<note pod name, e.g collector-2124089204-8png0>
$ gke.py -C ./services.yaml kubectl -K collector -- get pod collector-2124089204-8png0 -o yaml > pod.yaml
$ gke.py -C ./services.yaml kubectl -K collector -- replace --force -f pod.yaml

(I'm not sure it's a best way, but it works).

Comment 5 by d...@chromium.org, Nov 30 2017

Have you tried updating cloud library and pushing a new version? If there is a bug with Pub/Sub subscriber, perhaps it's been fixed upstream?

It's also odd to me that this isn't hitting prod. I thought iannucci@ deployed the same version to both.
Potential culprit: https://github.com/GoogleCloudPlatform/google-cloud-go/issues/740

I'm updating the library now, it has fixes.
Project Member

Comment 7 by bugdroid1@chromium.org, Nov 30 2017

Project Member

Comment 8 by bugdroid1@chromium.org, Nov 30 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/91ca088ec80539f5b371d15ef6294f238078365f

commit 91ca088ec80539f5b371d15ef6294f238078365f
Author: Vadim Shtayura <vadimsh@chromium.org>
Date: Thu Nov 30 19:59:08 2017

Roll go dependencies and luci-go DEPS.

Most notable changes are in cloud.google.com/go. It contains a potential fix
for  crbug.com/788903 . It also contains a breaking change to BigQuery API.

infra/go/src/go.chromium.org/luci:
5f9512aca [bqschemaupdater] Fix it after cloud.google.com/go/bigquery API change.
632469d22 [server] Enable pprof by default on all servers.

R=nodir@chromium.org, tandrii@chromium.org
BUG= 788903 

Change-Id: Ic2d0fa3de96c6eb3bfdfd844642ea7422f0e06ce
Reviewed-on: https://chromium-review.googlesource.com/801654
Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org>
Commit-Queue: Vadim Shtayura <vadimsh@chromium.org>

[modify] https://crrev.com/91ca088ec80539f5b371d15ef6294f238078365f/go/deps.lock
[modify] https://crrev.com/91ca088ec80539f5b371d15ef6294f238078365f/DEPS

Owner: vadimsh@chromium.org
Status: Assigned (was: Untriaged)
Updating collector and archivist on logdog-dev to pick up new code.
NextAction: 2017-12-07
Updated all services (GAE, Flex and GKE) on luci-logdog-dev.

Now let's run it for a week. If the issue doesn't reappear we can consider it fixed. We'll have to update prod logdog to pick up the fix.
The NextAction date has arrived: 2017-12-07
NextAction: ----
It worked for a week without problems.

I'll update -dev again to pick up other unrelated changes, let it run for some time, and then deploy everything to prod.
Status: Fixed (was: Assigned)

Sign in to add a comment