New issue
Advanced search Search tips

Issue 597886 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

LogDog Collector pipeline cleanup.

Project Member Reported by d...@chromium.org, Mar 25 2016

Issue description

Recent stress tests of have revealed a few problems with the LogDog Collector pipeline that need fixing.

1) Collector is hammering gRPC / BigTable with calls. It is probably exceeding quota, and definitely spending most of its CPU and time on these RPCs.

Currently, Collector loads each individual log entry into its own BigTable row. This makes fetching data and ensuring contiguous log space really easy, but means that large log streams will hammer BT with magnitudes more RPC calls than it's designed to tolerate.

The proposed solution is to change LogDog's intermediate storage BigTable format.

Current: Sparse table, individual row protobufs, row key == {Stream, Index}, one row per index.
Proposed: Sparse table, bulk row protobufs, row key == {Stream, Index_last}, one row per bundled set of contiguous log indexes.

We would assert in Collector that a given bundle entry is composed of contiguous indexes. This becomes a formal ingest requirement rather than a nice implementation detail.

Writing a row would now be a single write per bundle rather than len(bundle) writes, which is a huge gain.

Reading a row would seek forward from target {Stream, Index} to find the first matching row, load that row's data, then find the row value in that bundle. For example:

Want: Row 10
Have: {0-8}, {9-15}, {16-20}

We would seek for {10}. Since we're indexing based on the LAST contiguous index in the bundle, the first row would be {9-15}, which we would load, confirm that 9 < 10, and pick out 10 from the bundle.

Archive: We would still scan forwards from 0 and pull the full space.
Tail: We would still keys-only to the last row and pull that out.

2) Collector currently doesn't refresh a bundle's ACK deadline if it's taking too long. We need to implement this similar to the way task queue does it so that poison bundles don't stay in Pub/Sub forever.

We can do this for free by updating the Pub/Sub library to use the new SubscriptionHandle API (https://godoc.org/google.golang.org/cloud/pubsub#SubscriptionHandle).
 

Comment 1 by estaab@google.com, Mar 25 2016

Labels: -Pri-3 -Infra Infra-Platform Pri-1
We should probably confirm that the bottleneck is quota and not tablet hotspotting or something else. We need to scale 10-100x to cover our whole fleet, plus more for organic growth, so if we're hitting a hard limit we need to be confident we're addressing the root cause.

Comment 2 by estaab@google.com, Mar 25 2016

Also, this is a good time to be finding these issues and thanks for the careful incremental rollout up to this point. :)

Comment 3 by d...@chromium.org, Mar 25 2016

Tablet shouldn't be a problem, we're using hashes for row keys. Thrashing within a log stream may happen, but between them should not.

I'm confident that #1 and #2 are problems that need fixing, and #1 is the current thing bringing collection to its knees.

Yeah agreed, that's the point of the stress test :)
Project Member

Comment 5 by bugdroid1@chromium.org, Mar 29 2016

Project Member

Comment 8 by bugdroid1@chromium.org, Apr 1 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/ab3e5f8f2a91ceec6828033564bfd78aaf4f19d4

commit ab3e5f8f2a91ceec6828033564bfd78aaf4f19d4
Author: dnj <dnj@chromium.org>
Date: Fri Apr 01 03:09:49 2016

Revert of Bump luci/gae. (patchset #1 id:1 of https://codereview.chromium.org/1844333004/ )

Reason for revert:
Found a bug in luci/gae, reverting until fixed.

Original issue's description:
> Bump luci/gae.
>
> Bump to 0a4fa8065e6843ebbbca94561b3df42447150ec4.
>
> TBR=iannucci@chromium.org
> BUG= chromium:597886 
> TEST=None
>
> Committed: https://chromium.googlesource.com/infra/infra/+/7f7815b3b6c62fab3ec6c65ec111357422a4954f

TBR=iannucci@chromium.org
# Skipping CQ checks because original CL landed less than 1 days ago.
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG= chromium:597886 

Review URL: https://codereview.chromium.org/1850853003

[modify] https://crrev.com/ab3e5f8f2a91ceec6828033564bfd78aaf4f19d4/DEPS

Project Member

Comment 9 by bugdroid1@chromium.org, Apr 1 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/ffc30b3a4e342cab34436687a3d712deb3ff5c49

commit ffc30b3a4e342cab34436687a3d712deb3ff5c49
Author: dnj <dnj@chromium.org>
Date: Fri Apr 01 22:11:05 2016

Roll luci/gae pin.

BUG= chromium:597886 
TEST=local
  - Tested infra and luci-go locally against the new pin.

R=iannucci@chromium.org

Review URL: https://codereview.chromium.org/1847193004

[modify] https://crrev.com/ffc30b3a4e342cab34436687a3d712deb3ff5c49/DEPS

Project Member

Comment 10 by bugdroid1@chromium.org, Apr 1 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-go.git/+/63c19cda88aa254fd537edc802bf5c19cc8f009f

commit 63c19cda88aa254fd537edc802bf5c19cc8f009f
Author: dnj <dnj@chromium.org>
Date: Fri Apr 01 22:19:55 2016

LogDog: BigTable batching schema.

Update the Storage schema to enable batching Put() calls. Update
BigTable schema to store batches of sequential logs in a single cell
instead of requiring one cell per log.

Previously, BigTable schema keyed each cell row off of a log's index.
Now, the cell row is keyed based on the LAST index in a block of
sequential log entries, storing entries as a series of recordio blocks.
This change allows multiple log entries to be stored in a single
BigTable cell.

  Before: 0|1|2|3|4|5
  Now:    0 1 2 3|4 5

Before, "get message 2" meant "go to row #2, get data." Now, it is "go
to row >= 2 (in this example, 3) and pull 2 out of the block.

Update APIs accordingly.

Also update Subscriber API to control message quantity rather than
batch pull count for better tuning and parity with upstream Pub/Sub
API changes.

BUG= chromium:597886 
TEST=local

Review URL: https://codereview.chromium.org/1838803002

[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/appengine/logdog/coordinator/config/bigTable.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/appengine/logdog/coordinator/endpoints/logs/get.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/appengine/logdog/coordinator/endpoints/logs/get_test.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/subscriber/source.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/subscriber/subscriber.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/subscriber/subscriber_test.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/proto/logdog/logpb/butler.pb.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/proto/logdog/logpb/butler.proto
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/proto/logdog/svcconfig/config.pb.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/proto/logdog/svcconfig/config.proto
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/cmd/logdog_collector/main.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/internal/logdog/archivist/archivist.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/internal/logdog/archivist/archivist_test.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/internal/logdog/collector/collector.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/internal/logdog/collector/collector_test.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/internal/logdog/collector/utils_test.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/internal/logdog/service/service.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/logdog/storage/archive/storage.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/logdog/storage/bigtable/bigtable.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/logdog/storage/bigtable/initialize.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/logdog/storage/bigtable/storage.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/logdog/storage/bigtable/storage_test.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/logdog/storage/memory/memory.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/logdog/storage/memory/memory_test.go
[modify] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/server/logdog/storage/storage.go

Project Member

Comment 11 by bugdroid1@chromium.org, Apr 1 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-go.git/+/d9cce0a1a2700c61bf5a984dd0f47ffbbbd13b3a

commit d9cce0a1a2700c61bf5a984dd0f47ffbbbd13b3a
Author: dnj <dnj@chromium.org>
Date: Fri Apr 01 22:24:53 2016

Use native Pub/Sub library primitives.

Remove the common/gcloud/pubsub/{subscriber/ackbuffer}, as those are now
built into the default Pub/Sub library via SubscriptionHandle.

Rewrite libraries that use those tools to use the new Pub/Sub library
structs instead.

BUG= chromium:597886 
TEST=None

Review URL: https://codereview.chromium.org/1838303002

[modify] https://crrev.com/d9cce0a1a2700c61bf5a984dd0f47ffbbbd13b3a/appengine/tsmon/middleware.go
[modify] https://crrev.com/d9cce0a1a2700c61bf5a984dd0f47ffbbbd13b3a/client/cmd/logdog_butler/main.go
[modify] https://crrev.com/d9cce0a1a2700c61bf5a984dd0f47ffbbbd13b3a/client/cmd/logdog_butler/output_pubsub.go
[modify] https://crrev.com/d9cce0a1a2700c61bf5a984dd0f47ffbbbd13b3a/client/internal/logdog/butler/output/pubsub/pubsubOutput.go
[modify] https://crrev.com/d9cce0a1a2700c61bf5a984dd0f47ffbbbd13b3a/client/internal/logdog/butler/output/pubsub/pubsubOutput_test.go
[delete] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/ackbuffer/ack.go
[delete] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/ackbuffer/ackbuffer.go
[delete] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/ackbuffer/ackbuffer_test.go
[delete] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/connection.go
[delete] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/connection_impl.go
[delete] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/pubsub.go
[modify] https://crrev.com/d9cce0a1a2700c61bf5a984dd0f47ffbbbd13b3a/common/gcloud/pubsub/quota.go
[delete] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/retry.go
[add] https://crrev.com/d9cce0a1a2700c61bf5a984dd0f47ffbbbd13b3a/common/gcloud/pubsub/scopes.go
[delete] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/subscriber/source.go
[delete] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/subscriber/subscriber.go
[delete] https://crrev.com/63c19cda88aa254fd537edc802bf5c19cc8f009f/common/gcloud/pubsub/subscriber/subscriber_test.go
[modify] https://crrev.com/d9cce0a1a2700c61bf5a984dd0f47ffbbbd13b3a/common/tsmon/iface.go
[modify] https://crrev.com/d9cce0a1a2700c61bf5a984dd0f47ffbbbd13b3a/common/tsmon/monitor/pubsub.go
[modify] https://crrev.com/d9cce0a1a2700c61bf5a984dd0f47ffbbbd13b3a/server/cmd/logdog_collector/main.go

Project Member

Comment 12 by bugdroid1@chromium.org, Apr 1 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-go.git/+/fdfa04c6c456d06e5aad68069c79b05a6b4e71dd

commit fdfa04c6c456d06e5aad68069c79b05a6b4e71dd
Author: dnj <dnj@chromium.org>
Date: Fri Apr 01 22:35:44 2016

Fix GS client breakage, remove offset hacks.

A recent update to the Google Cloud Storage client library caused the
existing hacks to add range support to break, resulting in an inability
to read archived logs. Fortunately, the client library also added
support for a range reader!

This CL removes the offset hacks and switches over to the first-class
range reader.

BUG= chromium:597886 

Review URL: https://codereview.chromium.org/1844253002

[modify] https://crrev.com/fdfa04c6c456d06e5aad68069c79b05a6b4e71dd/common/gcloud/gs/gs.go
[delete] https://crrev.com/91b4e32d3221f45166c22362e88a0bc8de33dbc2/common/gcloud/gs/opts.go
[modify] https://crrev.com/fdfa04c6c456d06e5aad68069c79b05a6b4e71dd/server/logdog/storage/archive/storage.go

Comment 13 by d...@chromium.org, Apr 1 2016

Status: Fixed (was: Untriaged)
Components: Infra>Platform
Labels: -Infra-Platform

Sign in to add a comment