gae_ts_mon: bound the batches sent to pubsub |
|||
Issue descriptionhttps://code.google.com/p/chromium/codesearch#chromium/infra/appengine_module/gae_ts_mon/config.py&l=70 sends all metrics that are there. However there may be too many to send in one batch. Therefore, if an app hits the critical mass of metrics, it no longer can send any metrics and fails all requests because of 60s timeout. gae_ts_mon cannot be used in buildbucket enpoints until this is fixed.
,
Mar 1 2016
Looking at the logs, I see at most 300+ metrics in the default module, and 200+ in the backend. What's concerning is that the default module seems to add ~7 sec to the request to serialize the metrics into proto... The PubSub request doesn't seem to take long. This is odd, 300+ metrics is not *that* much, really. And quite reasonable for an app with a bunch of endpoints. Do you have a sample BQ query to show the problematic logs? I couldn't find failing requests.
,
Mar 1 2016
This query shows that the slowest requests don't necessarily have most metrics: SELECT metadata.timestamp, protoPayload.latency, protoPayload.line.severity, protoPayload.status, protoPayload.line.logMessage FROM TABLE_DATE_RANGE(logs.appengine_googleapis_com_request_log_, DATE_ADD(CURRENT_TIMESTAMP(), -1, 'DAY'), CURRENT_TIMESTAMP()) WHERE protoPayload.line.logMessage CONTAINS "ts_mon: sending" ORDER BY protoPayload.latency DESC LIMIT 1000 Row metadata_timestamp protoPayload_latency protoPayload_line_severity protoPayload_status protoPayload_line_logMessage 1 2016-02-29 23:57:46 UTC 9.999973s DEBUG 200 ts_mon: sending 36 metrics to PubSub 2 2016-03-01 13:31:00 UTC 9.999932s DEBUG 200 ts_mon: sending 251 metrics to PubSub 3 2016-03-01 14:33:45 UTC 9.999301s DEBUG 200 ts_mon: sending 523 metrics to PubSub 4 2016-03-01 09:37:43 UTC 9.997218s DEBUG 200 ts_mon: sending 275 metrics to PubSub 5 2016-03-01 17:24:55 UTC 9.996101s DEBUG 200 ts_mon: sending 161 metrics to PubSub 6 2016-02-29 13:22:20 UTC 9.99482s DEBUG 302 ts_mon: sending 128 metrics to PubSub 7 2016-02-29 20:05:00 UTC 9.994684s DEBUG 200 ts_mon: sending 247 metrics to PubSub 8 2016-03-01 06:15:34 UTC 9.994378s DEBUG 200 ts_mon: sending 16 metrics to PubSub 9 2016-03-01 12:13:55 UTC 9.994209s DEBUG 200 ts_mon: sending 250 metrics to PubSub 10 2016-03-01 17:37:06 UTC 9.993188s DEBUG 200 ts_mon: sending 99 metrics to PubSub
,
Mar 1 2016
I don't know why this bq does not display the requests I am concerned with. Here are some logs: https://pantheon.corp.google.com/logs?project=cr-buildbucket&debugUI=CLOUD&service=appengine.googleapis.com&key1=default&key2=3617-c6acc1e-tainted-nodir&logName=appengine.googleapis.com%2Frequest_log&minLogLevel=500&expandAll=false&timezone=America%2FLos_Angeles&filters=text:%22ts_mon:%20sending%22&lastVisibleTimestampNanos=1456858424444817000 59sec: 10:54:20.319 ts_mon: sending 770 metrics to PubSub Traceback (most recent call last): File "appengine/runtime/wsgi.py", line 267, in Handle result = handler(dict(self._environ), self._StartResponse) File "third_party/endpoints-1.0/endpoints/apiserving.py", line 480, in __call__ body_iter = self.service_app(environ, StartResponse) File "third_party/protorpc-1.0/protorpc/wsgi/util.py", line 173, in first_found_app response = app(environ, first_found_start_response) File "third_party/protorpc-1.0/protorpc/wsgi/service.py", line 181, in protorpc_service_app response = method(instance, request) File "api.py", line 166, in decorated gae_ts_mon.flush_metrics_if_needed() File "gae_ts_mon/config.py", line 49, in flush_metrics_if_needed return _flush_metrics_if_needed_locked(time_now) File "gae_ts_mon/config.py", line 70, in _flush_metrics_if_needed_locked interface.flush() File "gae_ts_mon/common/interface.py", line 103, in flush state.global_monitor.send(proto) File "gae_ts_mon/common/monitors.py", line 148, in send {'data': base64.b64encode(proto.SerializeToString())}, File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 812, in SerializeToString return self.SerializePartialToString() File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 821, in SerializePartialToString self._InternalSerialize(out.write) File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 827, in InternalSerialize field_descriptor._encoder(write_bytes, field_value) File "components/third_party/protobuf/google/protobuf/internal/encoder.py", line 748, in EncodeRepeatedField element._InternalSerialize(write) File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 827, in InternalSerialize field_descriptor._encoder(write_bytes, field_value) File "components/third_party/protobuf/google/protobuf/internal/encoder.py", line 748, in EncodeRepeatedField element._InternalSerialize(write) File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 826, in InternalSerialize for field_descriptor, field_value in self.ListFields(): File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 602, in ListFields all_fields = [item for item in self._fields.iteritems() if _IsPresent(item)] File "components/third_party/protobuf/google/protobuf/internal/python_message.py", line 592, in _IsPresent elif item[0].cpp_type == _FieldDescriptor.CPPTYPE_MESSAGE: DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.
,
Mar 2 2016
This exceeds the overall request deadline; I'd very surprised if 770 metrics took longer than 10s to send. Is there an indication of how long it took in the logs? One possible mitigation could be to send protos in a taskqueue task. However, the request timed out during proto construction, before PubSub was even invoked. And I'd be hesitant to use anything but protos in passing to the task. Python protos are known to be slow, so I'm not yet sure how to solve this properly.
,
Mar 2 2016
One thing I forgot to mention: the batch size is already bounded to 1000 metrics in ts_mon. However, it will simply call PubSub multiple times if it needs to send more than 1000 metrics. There is really no way around that - if you're generating that many streams, you have to send them every minute, or you lose the metrics.
,
Mar 3 2016
I did some loadtesting today for large number of metrics, and managed to reproduce at least part the problem in my test app: http://crrev.com/1757213002 The test was running at a constant 10 QPS to a trivial endpoint (printing "Hello world", and populating a bunch of metrics). The graphs are here: http://shortn/_JMpbkltzzZ The important point is, the ts_mon latency to PubSub calls doesn't seem to change with the number of metrics; it's always well under 100ms. There may be multiple calls to PubSub though, but they still run well under 1s with 2K metrics that I tested. However, the total request durations become as slow as 2-3 seconds. This makes me think that ts_mon is taking its time populating the protos. I'll try to instrument proto construction to measure it properly, but I'm pretty sure it's the culprit. I'm not sure what to do about it yet. If we genuinely need to send this many metrics, something somewhere has to generate those protos from the instance's memory. The only reasonable option is to offload it to a taskqueue task, but the data still needs to be passed to the task. Maybe it's faster to pickle ts_mon metric_store instance, send it to a task, unpickle it there and take time converting it to protos. It's a weird thing to do though...
,
Mar 3 2016
Maybe it wasn't get_ts_mon causing the slowdown. I will recheck after I fix a deadlock that I've found in a gae component On Wed, Mar 2, 2016 at 10:16 PM sergeyberezin@chromium.org via Monorail < monorail@chromium.org> wrote:
,
Mar 4 2016
Whether it was gae_ts_mon fault or not, it was a good prompt to look at the performance anyways. And I'm now finding all sorts of interesting curiousities, like: 16:36:21.978 ts_mon: populating metrics took 1.285630 sec 16:36:22.986 ts_mon: populating a proto took 0.910440 sec 16:36:22.986 ts_mon: wrapping proto took 0.000010 sec 16:36:22.986 ts_mon: sending 1000 metrics to PubSub 16:36:24.280 ts_mon: creating PubSub message took 1.294160 sec 16:36:24.288 URL being requested: POST https://pubsub.googleapis.com/v1/projects/chrome-infra-mon-pubsub/topics/monacq:publish?alt=json 16:36:24.377 ts_mon: sending PubSub message took 0.096580 sec 16:36:24.378 ts_mon: sending a proto took 1.391810 sec <<< this is the total of wrapping+creating+sending PubSub message So, it's both populating metrics (dictionary lookups) and serializing proto to string + base64-encoding that takes most of the time. Not the protobufs themselves, and not the PubSub communication.
,
Mar 4 2016
And slightly more details: 16:50:19.277 ts_mon: populating metrics took 2.510540 sec 16:50:21.166 ts_mon: populating a proto took 1.782880 sec 16:50:21.166 ts_mon: wrapping proto took 0.000010 sec 16:50:21.166 ts_mon: sending 1000 metrics to PubSub 16:50:23.367 ts_mon: serializing proto (242860 bytes) took 2.200580 sec 16:50:23.369 ts_mon: creating PubSub message took 0.001750 sec 16:50:23.381 URL being requested: POST https://pubsub.googleapis.com/v1/projects/chrome-infra-mon-pubsub/topics/monacq:publish?alt=json 16:50:23.471 ts_mon: sending PubSub message took 0.101260 sec So, yeah, serializing a proto takes the cake. Apart from populating the metrics, which is in our control to fix.
,
Mar 4 2016
And this is on my desktop: Populating 10000 metrics: DEBUG 2016-03-04 01:32:31,113 main.py:37] ts_mon: populating metrics took 0.153431 sec Serializing batches of 1000 metrics: DEBUG 2016-03-04 01:32:31,214 monitors.py:189] ts_mon: sending 1000 metrics to PubSub DEBUG 2016-03-04 01:32:31,295 monitors.py:192] ts_mon: serializing proto (169970 bytes) took 0.081126 sec DEBUG 2016-03-04 01:32:31,338 monitors.py:189] ts_mon: sending 1000 metrics to PubSub DEBUG 2016-03-04 01:32:31,419 monitors.py:192] ts_mon: serializing proto (169979 bytes) took 0.080556 sec DEBUG 2016-03-04 01:32:31,463 monitors.py:189] ts_mon: sending 1000 metrics to PubSub DEBUG 2016-03-04 01:32:31,543 monitors.py:192] ts_mon: serializing proto (169974 bytes) took 0.080660 sec DEBUG 2016-03-04 01:32:31,587 monitors.py:189] ts_mon: sending 1000 metrics to PubSub DEBUG 2016-03-04 01:32:31,668 monitors.py:192] ts_mon: serializing proto (169975 bytes) took 0.081051 sec DEBUG 2016-03-04 01:32:31,711 monitors.py:189] ts_mon: sending 1000 metrics to PubSub DEBUG 2016-03-04 01:32:31,792 monitors.py:192] ts_mon: serializing proto (169974 bytes) took 0.080248 sec DEBUG 2016-03-04 01:32:31,914 monitors.py:189] ts_mon: sending 1000 metrics to PubSub DEBUG 2016-03-04 01:32:32,034 monitors.py:192] ts_mon: serializing proto (200446 bytes) took 0.119846 sec DEBUG 2016-03-04 01:32:32,123 monitors.py:189] ts_mon: sending 1000 metrics to PubSub DEBUG 2016-03-04 01:32:32,237 monitors.py:192] ts_mon: serializing proto (200359 bytes) took 0.113641 sec DEBUG 2016-03-04 01:32:32,325 monitors.py:189] ts_mon: sending 1000 metrics to PubSub DEBUG 2016-03-04 01:32:32,439 monitors.py:192] ts_mon: serializing proto (200148 bytes) took 0.113381 sec DEBUG 2016-03-04 01:32:32,533 monitors.py:189] ts_mon: sending 1000 metrics to PubSub DEBUG 2016-03-04 01:32:32,655 monitors.py:192] ts_mon: serializing proto (200234 bytes) took 0.122245 sec DEBUG 2016-03-04 01:32:32,782 monitors.py:189] ts_mon: sending 1000 metrics to PubSub DEBUG 2016-03-04 01:32:32,901 monitors.py:192] ts_mon: serializing proto (200123 bytes) took 0.118779 sec DEBUG 2016-03-04 01:32:32,905 monitors.py:189] ts_mon: sending 11 metrics to PubSub DEBUG 2016-03-04 01:32:32,907 monitors.py:192] ts_mon: serializing proto (2219 bytes) took 0.001547 sec That's about 1 sec spent on serializing. By other numbers, it looks like my desktop is ~10x faster than an AppEngine instance, so it's not a surprise it takes 1-2 sec to serialize 1000 metrics. It also looks like populating metrics is not really a bottleneck, and it's probably something weird happening on GAE.
,
Mar 5 2016
iannucci@ suggested this for Go version, but it may be useful for Python too: queue up sending a metric at the beginning of a handler in a separate thread, and wait for it at the end. If most of the time the master thread is waiting for I/O (e.g. for Datastore), then the expensive proto serialization may have a chance to complete in parallel, without adding much latency to the request.
,
Mar 8 2016
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra.git/+/5d220c3d397eb320a1e451e6c40df629dae48428 commit 5d220c3d397eb320a1e451e6c40df629dae48428 Author: sergeyberezin <sergeyberezin@chromium.org> Date: Tue Mar 08 01:30:32 2016 gae_ts_mon: flush metrics asynchronously Start flushing metrics at the beginning of a request in a parallel thread. This reduces the overall latency of the request. BUG= 591122 R=dsansome@chromium.org, iannucci@chromium.org TEST=load-tested in the cloud Review URL: https://codereview.chromium.org/1772913003 [modify] https://crrev.com/5d220c3d397eb320a1e451e6c40df629dae48428/appengine_module/gae_ts_mon/config.py
,
Mar 9 2016
Hopefully the above CL fixes the immediate concern with latency, closing the bug.
,
Apr 27 2016
|
|||
►
Sign in to add a comment |
|||
Comment 1 by sergeybe...@chromium.org
, Mar 1 2016Owner: sergeybe...@chromium.org
Status: Assigned (was: Untriaged)