luci-logdog-dev collector occasionally stops receiving pubsub messages |
||||||
Issue descriptionhttps://viceroy.corp.google.com/chrome_infra/Appengine/luci_logdog_dev?duration=2d&utc_end=1511821672.46 It breaks milo among other things causing it to display "LogDog stream not found Job likely failed to start" (while the swarming job thinks everything is totally fine).
,
Nov 27 2017
Nothing obviously wrong in the logs: https://paste.googleplex.com/4983477036384256 At some point (I2017-11-26T18:20:56.59) messages just stop coming.
,
Nov 28 2017
I don't know what's wrong. My gut feeling is that there's a bug in Cloud PubSub library or the way we use it, that causes it to stall during some unusual network event (like unexpectedly closed connection or something, we've seen eternally hanging TLS connections to Cloud before). But I can't prove or deny this, need more logging. Restarting the collector pod now.
,
Nov 30 2017
It is happening again: https://viceroy.corp.google.com/chrome_infra/Appengine/luci_logdog_dev?duration=2d&utc_end=1512001490.08 :( I guess it's time to document how to restart collectors... My infra_internal gclient checkout is at ~/infra-git: $ cd ~/infra-git/infra_internal/services/deployments/luci-logdog-dev $ eval `~/infra-git/infra/go/env.py` $ gke.py -C ./services.yaml kubectl -K collector -- get pod <note pod name, e.g collector-2124089204-8png0> $ gke.py -C ./services.yaml kubectl -K collector -- get pod collector-2124089204-8png0 -o yaml > pod.yaml $ gke.py -C ./services.yaml kubectl -K collector -- replace --force -f pod.yaml (I'm not sure it's a best way, but it works).
,
Nov 30 2017
Have you tried updating cloud library and pushing a new version? If there is a bug with Pub/Sub subscriber, perhaps it's been fixed upstream? It's also odd to me that this isn't hitting prod. I thought iannucci@ deployed the same version to both.
,
Nov 30 2017
Potential culprit: https://github.com/GoogleCloudPlatform/google-cloud-go/issues/740 I'm updating the library now, it has fixes.
,
Nov 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/5f9512acaf69863b47ea7485fded4c2d5d73edfb commit 5f9512acaf69863b47ea7485fded4c2d5d73edfb Author: Vadim Shtayura <vadimsh@chromium.org> Date: Thu Nov 30 19:45:11 2017 [bqschemaupdater] Fix it after cloud.google.com/go/bigquery API change. CreateTableOption doesn't exist anymore. Instead callers of Table.Create should pass TableMetadata. R=nodir@chromium.org BUG= 788903 Change-Id: Ieb533b4d2a8d38a2ea4bba7ef39f5e4ed684d6f5 Reviewed-on: https://chromium-review.googlesource.com/801261 Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org> [modify] https://crrev.com/5f9512acaf69863b47ea7485fded4c2d5d73edfb/tools/cmd/bqschemaupdater/dry_run_table_store.go [modify] https://crrev.com/5f9512acaf69863b47ea7485fded4c2d5d73edfb/tools/cmd/bqschemaupdater/dry_run_table_store_test.go [modify] https://crrev.com/5f9512acaf69863b47ea7485fded4c2d5d73edfb/tools/cmd/bqschemaupdater/local_table_store.go [modify] https://crrev.com/5f9512acaf69863b47ea7485fded4c2d5d73edfb/tools/cmd/bqschemaupdater/local_table_store_test.go [modify] https://crrev.com/5f9512acaf69863b47ea7485fded4c2d5d73edfb/tools/cmd/bqschemaupdater/main.go [modify] https://crrev.com/5f9512acaf69863b47ea7485fded4c2d5d73edfb/tools/cmd/bqschemaupdater/main_test.go [modify] https://crrev.com/5f9512acaf69863b47ea7485fded4c2d5d73edfb/tools/cmd/bqschemaupdater/table_store.go
,
Nov 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/91ca088ec80539f5b371d15ef6294f238078365f commit 91ca088ec80539f5b371d15ef6294f238078365f Author: Vadim Shtayura <vadimsh@chromium.org> Date: Thu Nov 30 19:59:08 2017 Roll go dependencies and luci-go DEPS. Most notable changes are in cloud.google.com/go. It contains a potential fix for crbug.com/788903 . It also contains a breaking change to BigQuery API. infra/go/src/go.chromium.org/luci: 5f9512aca [bqschemaupdater] Fix it after cloud.google.com/go/bigquery API change. 632469d22 [server] Enable pprof by default on all servers. R=nodir@chromium.org, tandrii@chromium.org BUG= 788903 Change-Id: Ic2d0fa3de96c6eb3bfdfd844642ea7422f0e06ce Reviewed-on: https://chromium-review.googlesource.com/801654 Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org> Commit-Queue: Vadim Shtayura <vadimsh@chromium.org> [modify] https://crrev.com/91ca088ec80539f5b371d15ef6294f238078365f/go/deps.lock [modify] https://crrev.com/91ca088ec80539f5b371d15ef6294f238078365f/DEPS
,
Nov 30 2017
Updating collector and archivist on logdog-dev to pick up new code.
,
Nov 30 2017
Updated all services (GAE, Flex and GKE) on luci-logdog-dev. Now let's run it for a week. If the issue doesn't reappear we can consider it fixed. We'll have to update prod logdog to pick up the fix.
,
Dec 7 2017
The NextAction date has arrived: 2017-12-07
,
Dec 7 2017
It worked for a week without problems. I'll update -dev again to pick up other unrelated changes, let it run for some time, and then deploy everything to prod.
,
Mar 13 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by vadimsh@chromium.org
, Nov 27 2017