New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 591122 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

gae_ts_mon: bound the batches sent to pubsub

Project Member Reported by no...@chromium.org, Mar 1 2016

Issue description

https://code.google.com/p/chromium/codesearch#chromium/infra/appengine_module/gae_ts_mon/config.py&l=70 sends all metrics that are there. However there may be too many to send in one batch. Therefore, if an app hits the critical mass of metrics, it no longer can send any metrics and fails all requests because of 60s timeout.

gae_ts_mon cannot be used in buildbucket enpoints until this is fixed.
 
Cc: -sergeybe...@chromium.org
Owner: sergeybe...@chromium.org
Status: Assigned (was: Untriaged)
How many metrics does your app have? I'd like to find out if the large number of metrics is legitimate, or if it's a bug the way the app is set up.

If our apps indeed can have tons of metrics for a good reason, this may become a problem for isolate / swarming too, and we should fix it. A possible way is to delegate sending metrics to a task in a taskqueue.
Looking at the logs, I see at most 300+ metrics in the default module, and 200+ in the backend. What's concerning is that the default module seems to add ~7 sec to the request to serialize the metrics into proto... The PubSub request doesn't seem to take long. This is odd, 300+ metrics is not *that* much, really. And quite reasonable for an app with a bunch of endpoints.

Do you have a sample BQ query to show the problematic logs? I couldn't find failing requests.
This query shows that the slowest requests don't necessarily have most metrics:

SELECT metadata.timestamp, protoPayload.latency, protoPayload.line.severity, protoPayload.status, protoPayload.line.logMessage
FROM
  TABLE_DATE_RANGE(logs.appengine_googleapis_com_request_log_, DATE_ADD(CURRENT_TIMESTAMP(), -1, 'DAY'), CURRENT_TIMESTAMP())
WHERE protoPayload.line.logMessage CONTAINS "ts_mon: sending"
ORDER BY protoPayload.latency DESC
LIMIT
  1000


Row	metadata_timestamp	protoPayload_latency	protoPayload_line_severity	protoPayload_status	protoPayload_line_logMessage	 
1	2016-02-29 23:57:46 UTC	9.999973s	DEBUG	200	ts_mon: sending 36 metrics to PubSub	 
2	2016-03-01 13:31:00 UTC	9.999932s	DEBUG	200	ts_mon: sending 251 metrics to PubSub	 
3	2016-03-01 14:33:45 UTC	9.999301s	DEBUG	200	ts_mon: sending 523 metrics to PubSub	 
4	2016-03-01 09:37:43 UTC	9.997218s	DEBUG	200	ts_mon: sending 275 metrics to PubSub	 
5	2016-03-01 17:24:55 UTC	9.996101s	DEBUG	200	ts_mon: sending 161 metrics to PubSub	 
6	2016-02-29 13:22:20 UTC	9.99482s	DEBUG	302	ts_mon: sending 128 metrics to PubSub	 
7	2016-02-29 20:05:00 UTC	9.994684s	DEBUG	200	ts_mon: sending 247 metrics to PubSub	 
8	2016-03-01 06:15:34 UTC	9.994378s	DEBUG	200	ts_mon: sending 16 metrics to PubSub	 
9	2016-03-01 12:13:55 UTC	9.994209s	DEBUG	200	ts_mon: sending 250 metrics to PubSub	 
10	2016-03-01 17:37:06 UTC	9.993188s	DEBUG	200	ts_mon: sending 99 metrics to PubSub	

Comment 4 by no...@chromium.org, Mar 1 2016

I don't know why this bq does not display the requests I am concerned with.

Here are some logs:
https://pantheon.corp.google.com/logs?project=cr-buildbucket&debugUI=CLOUD&service=appengine.googleapis.com&key1=default&key2=3617-c6acc1e-tainted-nodir&logName=appengine.googleapis.com%2Frequest_log&minLogLevel=500&expandAll=false&timezone=America%2FLos_Angeles&filters=text:%22ts_mon:%20sending%22&lastVisibleTimestampNanos=1456858424444817000

59sec:

10:54:20.319
ts_mon: sending 770 metrics to PubSub
Traceback (most recent call last):
  File "appengine/runtime/wsgi.py", line 267, in Handle
    result = handler(dict(self._environ), self._StartResponse)
  File "third_party/endpoints-1.0/endpoints/apiserving.py", line 480, in __call__
    body_iter = self.service_app(environ, StartResponse)
  File "third_party/protorpc-1.0/protorpc/wsgi/util.py", line 173, in first_found_app
    response = app(environ, first_found_start_response)
  File "third_party/protorpc-1.0/protorpc/wsgi/service.py", line 181, in protorpc_service_app
    response = method(instance, request)
  File "api.py", line 166, in decorated
    gae_ts_mon.flush_metrics_if_needed()
  File "gae_ts_mon/config.py", line 49, in flush_metrics_if_needed
    return _flush_metrics_if_needed_locked(time_now)
  File "gae_ts_mon/config.py", line 70, in _flush_metrics_if_needed_locked
    interface.flush()
  File "gae_ts_mon/common/interface.py", line 103, in flush
    state.global_monitor.send(proto)
  File "gae_ts_mon/common/monitors.py", line 148, in send
    {'data': base64.b64encode(proto.SerializeToString())},
  File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 812, in SerializeToString
    return self.SerializePartialToString()
  File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 821, in SerializePartialToString
    self._InternalSerialize(out.write)
  File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 827, in InternalSerialize
    field_descriptor._encoder(write_bytes, field_value)
  File "components/third_party/protobuf/google/protobuf/internal/encoder.py", line 748, in EncodeRepeatedField
    element._InternalSerialize(write)
  File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 827, in InternalSerialize
    field_descriptor._encoder(write_bytes, field_value)
  File "components/third_party/protobuf/google/protobuf/internal/encoder.py", line 748, in EncodeRepeatedField
    element._InternalSerialize(write)
  File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 826, in InternalSerialize
    for field_descriptor, field_value in self.ListFields():
  File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 602, in ListFields
    all_fields = [item for item in self._fields.iteritems() if _IsPresent(item)]
  File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 592, in _IsPresent
    elif item[0].cpp_type == _FieldDescriptor.CPPTYPE_MESSAGE:
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.
This exceeds the overall request deadline; I'd very surprised if 770 metrics took longer than 10s to send. Is there an indication of how long it took in the logs?

One possible mitigation could be to send protos in a taskqueue task. However, the request timed out during proto construction, before PubSub was even invoked. And I'd be hesitant to use anything but protos in passing to the task. Python protos are known to be slow, so I'm not yet sure how to solve this properly.
One thing I forgot to mention: the batch size is already bounded to 1000 metrics in ts_mon. However, it will simply call PubSub multiple times if it needs to send more than 1000 metrics. There is really no way around that - if you're generating that many streams, you have to send them every minute, or you lose the metrics.
I did some loadtesting today for large number of metrics, and managed to reproduce at least part the problem in my test app: http://crrev.com/1757213002

The test was running at a constant 10 QPS to a trivial endpoint (printing "Hello world", and populating a bunch of metrics). The graphs are here: http://shortn/_JMpbkltzzZ

The important point is, the ts_mon latency to PubSub calls doesn't seem to change with the number of metrics; it's always well under 100ms. There may be multiple calls to PubSub though, but they still run well under 1s with 2K metrics that I tested. 

However, the total request durations become as slow as 2-3 seconds. This makes me think that ts_mon is taking its time populating the protos. I'll try to instrument proto construction to measure it properly, but I'm pretty sure it's the culprit.

I'm not sure what to do about it yet. If we genuinely need to send this many metrics, something somewhere has to generate those protos from the instance's memory. The only reasonable option is to offload it to a taskqueue task, but the data still needs to be passed to the task.

Maybe it's faster to pickle ts_mon metric_store instance, send it to a task, unpickle it there and take time converting it to protos. It's a weird thing to do though...

Comment 8 by no...@chromium.org, Mar 3 2016

Maybe it wasn't get_ts_mon causing the slowdown. I will recheck after I fix
a deadlock that I've found in a gae component

On Wed, Mar 2, 2016 at 10:16 PM sergeyberezin@chromium.org via Monorail <
monorail@chromium.org> wrote:
Whether it was gae_ts_mon fault or not, it was a good prompt to look at the performance anyways. And I'm now finding all sorts of interesting curiousities, like:


16:36:21.978
ts_mon: populating metrics took 1.285630 sec
16:36:22.986
ts_mon: populating a proto took 0.910440 sec
16:36:22.986
ts_mon: wrapping proto took 0.000010 sec
16:36:22.986
ts_mon: sending 1000 metrics to PubSub
16:36:24.280
ts_mon: creating PubSub message took 1.294160 sec
16:36:24.288
URL being requested: POST https://pubsub.googleapis.com/v1/projects/chrome-infra-mon-pubsub/topics/monacq:publish?alt=json
16:36:24.377
ts_mon: sending PubSub message took 0.096580 sec
16:36:24.378
ts_mon: sending a proto took 1.391810 sec  <<< this is the total of wrapping+creating+sending PubSub message

So, it's both populating metrics (dictionary lookups) and serializing proto to string + base64-encoding that takes most of the time. Not the protobufs themselves, and not the PubSub communication.
And slightly more details:


16:50:19.277
ts_mon: populating metrics took 2.510540 sec
16:50:21.166
ts_mon: populating a proto took 1.782880 sec
16:50:21.166
ts_mon: wrapping proto took 0.000010 sec
16:50:21.166
ts_mon: sending 1000 metrics to PubSub
16:50:23.367
ts_mon: serializing proto (242860 bytes) took 2.200580 sec
16:50:23.369
ts_mon: creating PubSub message took 0.001750 sec
16:50:23.381
URL being requested: POST https://pubsub.googleapis.com/v1/projects/chrome-infra-mon-pubsub/topics/monacq:publish?alt=json
16:50:23.471
ts_mon: sending PubSub message took 0.101260 sec

So, yeah, serializing a proto takes the cake. Apart from populating the metrics, which is in our control to fix.
And this is on my desktop:

Populating 10000 metrics:
DEBUG    2016-03-04 01:32:31,113 main.py:37] ts_mon: populating metrics took 0.153431 sec

Serializing batches of 1000 metrics:

DEBUG    2016-03-04 01:32:31,214 monitors.py:189] ts_mon: sending 1000 metrics to PubSub
DEBUG    2016-03-04 01:32:31,295 monitors.py:192] ts_mon: serializing proto (169970 bytes) took 0.081126 sec
DEBUG    2016-03-04 01:32:31,338 monitors.py:189] ts_mon: sending 1000 metrics to PubSub
DEBUG    2016-03-04 01:32:31,419 monitors.py:192] ts_mon: serializing proto (169979 bytes) took 0.080556 sec
DEBUG    2016-03-04 01:32:31,463 monitors.py:189] ts_mon: sending 1000 metrics to PubSub
DEBUG    2016-03-04 01:32:31,543 monitors.py:192] ts_mon: serializing proto (169974 bytes) took 0.080660 sec
DEBUG    2016-03-04 01:32:31,587 monitors.py:189] ts_mon: sending 1000 metrics to PubSub
DEBUG    2016-03-04 01:32:31,668 monitors.py:192] ts_mon: serializing proto (169975 bytes) took 0.081051 sec
DEBUG    2016-03-04 01:32:31,711 monitors.py:189] ts_mon: sending 1000 metrics to PubSub
DEBUG    2016-03-04 01:32:31,792 monitors.py:192] ts_mon: serializing proto (169974 bytes) took 0.080248 sec
DEBUG    2016-03-04 01:32:31,914 monitors.py:189] ts_mon: sending 1000 metrics to PubSub
DEBUG    2016-03-04 01:32:32,034 monitors.py:192] ts_mon: serializing proto (200446 bytes) took 0.119846 sec
DEBUG    2016-03-04 01:32:32,123 monitors.py:189] ts_mon: sending 1000 metrics to PubSub
DEBUG    2016-03-04 01:32:32,237 monitors.py:192] ts_mon: serializing proto (200359 bytes) took 0.113641 sec
DEBUG    2016-03-04 01:32:32,325 monitors.py:189] ts_mon: sending 1000 metrics to PubSub
DEBUG    2016-03-04 01:32:32,439 monitors.py:192] ts_mon: serializing proto (200148 bytes) took 0.113381 sec
DEBUG    2016-03-04 01:32:32,533 monitors.py:189] ts_mon: sending 1000 metrics to PubSub
DEBUG    2016-03-04 01:32:32,655 monitors.py:192] ts_mon: serializing proto (200234 bytes) took 0.122245 sec
DEBUG    2016-03-04 01:32:32,782 monitors.py:189] ts_mon: sending 1000 metrics to PubSub
DEBUG    2016-03-04 01:32:32,901 monitors.py:192] ts_mon: serializing proto (200123 bytes) took 0.118779 sec
DEBUG    2016-03-04 01:32:32,905 monitors.py:189] ts_mon: sending 11 metrics to PubSub
DEBUG    2016-03-04 01:32:32,907 monitors.py:192] ts_mon: serializing proto (2219 bytes) took 0.001547 sec

That's about 1 sec spent on serializing. By other numbers, it looks like my desktop is ~10x faster than an AppEngine instance, so it's not a surprise it takes 1-2 sec to serialize 1000 metrics.

It also looks like populating metrics is not really a bottleneck, and it's probably something weird happening on GAE.
iannucci@ suggested this for Go version, but it may be useful for Python too: queue up sending a metric at the beginning of a handler in a separate thread, and wait for it at the end. If most of the time the master thread is waiting for I/O (e.g. for Datastore), then the expensive proto serialization may have a chance to complete in parallel, without adding much latency to the request.
Project Member

Comment 13 by bugdroid1@chromium.org, Mar 8 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/5d220c3d397eb320a1e451e6c40df629dae48428

commit 5d220c3d397eb320a1e451e6c40df629dae48428
Author: sergeyberezin <sergeyberezin@chromium.org>
Date: Tue Mar 08 01:30:32 2016

gae_ts_mon: flush metrics asynchronously

Start flushing metrics at the beginning of a request in a parallel
thread. This reduces the overall latency of the request.

BUG= 591122 
R=dsansome@chromium.org, iannucci@chromium.org
TEST=load-tested in the cloud

Review URL: https://codereview.chromium.org/1772913003

[modify] https://crrev.com/5d220c3d397eb320a1e451e6c40df629dae48428/appengine_module/gae_ts_mon/config.py

Status: Fixed (was: Assigned)
Hopefully the above CL fixes the immediate concern with latency, closing the bug.
Components: Infra>Monitoring
Labels: -Infra-Monitoring

Sign in to add a comment