New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 669669 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocked on:
issue 674760



Sign in to add a comment

ts_mon should emit a metric that counts when other metrics have flush errors

Project Member Reported by akes...@chromium.org, Nov 29 2016

Issue description

Today, if a metric has a error during flush (due to a bad field for instance) this causes ts_mon to simply log a warning and throw that metric away (possible also throwing other metrics away? needs investigation)

Instead, ts_mon should update a counter like /chrome/infra/chromeos/ts_mon/flush_failed with a field metric_name="/name/of/the/metric/that/failed".

This will make it easier to identify metrics that are having flush problems.
 
Cc: sergeybe...@chromium.org dsansome@chromium.org
What would happen if this new metric has a flush error, or gets caught in a batch that failed to flush?

In addition to timeseries monitoring, we have a logging pipeline (cloudtail + Cloud Logging API) which would log these errors, and in particular, allows counting error log lines in a separate metric from a separate service, e.g. like this: http://shortn/_cQD5P5thgZ

The metric is defined here: https://chromium.googlesource.com/infra/infra/+/master/go/src/infra/tools/cloudtail/client.go#99

You can monitor / alert on the amount of error / warning logs. Would that solve the problem?
We're not using the cloud logging features of ts_mon as far as I know. The errors I was referring to were being logged on the client side and did not as far as I know make it up to any cloud service.
It's not a ts_mon feature, it's a whole separate pipeline. I understand that you don't use it yet, and my implied question was if it makes sense to consider using it?

For most of our services, we send all the logs to a single project, and go/chrome-infra-logs has all of them in one place. The logs are also exported to BigQuery for longer retention and more complex queries. This proved quite useful for us on many occasions.
Ah, ok. Yes, some unified cloud based logging that we use across all our services would be good, but that sounds like a rather large scope. I'd like to revisit that in a future Q.

In the meantime, still think this ts_mon metric makes sense.
We already have a PubSub connection monitoring from ts_mon, e.g. http://shortn/_9vHVXsIIys

This of course wouldn't catch failures within the client before the network call, e.g. in protobuf assignments. What types of failures are you after?
Example https://bugs.chromium.org/p/chromium/issues/detail?id=669577

11/29 10:19:30.972 ERROR|         interface:0166| Automatic monitoring flush failed.
Traceback (most recent call last):
  File "/usr/local/autotest/site-packages/chromite/third_party/infra_libs/ts_mon/common/interface.py", line 164, in _flush_and_log_exceptions
    flush()
  File "/usr/local/autotest/site-packages/chromite/third_party/infra_libs/ts_mon/common/interface.py", line 109, in flush
    metric.serialize_to(proto, start_time, fields, value, target)
  File "/usr/local/autotest/site-packages/chromite/third_party/infra_libs/ts_mon/common/metrics.py", line 115, in serialize_to
    self._populate_fields(metric_pb, fields)
  File "/usr/local/autotest/site-packages/chromite/third_party/infra_libs/ts_mon/common/metrics.py", line 142, in _populate_fields
    raise errors.MonitoringInvalidFieldTypeError(self._name, key, value)
MonitoringInvalidFieldTypeError: Metric "chrome/infra/chromeos/autotest/resolve_devserver" was given field "devserver" with invalid value "<autotest_lib.client.common_lib.cros.dev_server.AndroidBuildServer object at 0x7f28d8243e50>" (<class 'autotest_lib.client.common_lib.cros.dev_server.AndroidBuildServer'>).
I see. Well, this will abort the entire flush() operation before any of the metrics were sent, so your proposed new metric will also be dropped. And once one such bad metric makes it into the metric store, it will likely continue crashing until the process is restarted.

This is precisely why I'm suggesting a parallel logging pipeline with its own separate metrics. It shouldn't be that hard to set up, actually. We already have all the machinery for log scraping and Puppet setups. See the docs at http://go/inframon-doc > cloudtail: Cloud Logging.
RE 8 yeah I'm suggesting that we catch and report that flush error inside of ts_mon flush, so that we don't abort the entire flush operation.

From a ts_mon API perspective, the existing behavior of aborting flush of all metrics due to a single malformed metric is suboptimal too.
1. I agree that one metrics failing to flush shouldn't prevent the other metrics from flushing.
2. I'm not sure using a Monarch metric to report these failures is appropriate.

On the logging tangent:

We're currently using metrics to track some errors that may be more appropriate as log messages.  I think this is because we don't currently have a unified logging infrastructure, nor ways to run queries or alert on log messages.  Since we do have Monarch metrics set up, it's tempting to reuse our Monarch metric infrastructure to track errors.

If Monarch works well as a tool for this job, that's fine, but I suspect what we'd like is a unified logging framework to collect detailed information about errors, in parallel with Monarch stats for errors that happen often enough or aren't completely preventable to warrant time series tracking.
Owner: pprabhu@chromium.org
Status: Started (was: Untriaged)
I'll take a stab at making metrics serialization errors not fail the whole flush. We can discuss on the CL if this is the correct approach.
Uploaded 2 CLs to continue-on-error when flushing metrics:
https://chromium-review.googlesource.com/#/c/415447/
https://chromium-review.googlesource.com/#/c/415584/

@sergeyberezin: Can you help review?
Left some initial comments on the CLs. Thanks for making ts_mon better!
Labels: -current-issue
Project Member

Comment 15 by bugdroid1@chromium.org, Dec 7 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/038c15d3882c4b84f124ab24c06bcc48394fbe93

commit 038c15d3882c4b84f124ab24c06bcc48394fbe93
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Nov 30 22:33:19 2016

ts_mon: Continue flushing metrics beyond serialization error.

Before this CL, If flush() encounters an error while serializing a
metric, flush immediately aborts. This means that
- A subset of the metrics get flushed (whatever batches were already
  sent)
- ts_mon never sends the remaining metrics till the offending metric is
  reset externally / the ts_mon flushing thread restarts.

This CL hardens flush() to continue flushing the rest of the metrics
beyond any corrupt metrics encountered. flush() continues to raise an
exception, so the failure mode is unchanged.

This is the first of a two part change. This only changes the code path
for serialization to old proto type.

BUG= chromium:669669 
TEST=(new) unittest fails before change, passes afterwards.

Change-Id: I2c158144cc6fd1f4a7b40ca7a625c7ff2400bfcc
Reviewed-on: https://chromium-review.googlesource.com/415447
Commit-Queue: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Sergey Berezin <sergeyberezin@chromium.org>

[modify] https://crrev.com/038c15d3882c4b84f124ab24c06bcc48394fbe93/packages/infra_libs/infra_libs/ts_mon/common/errors.py
[modify] https://crrev.com/038c15d3882c4b84f124ab24c06bcc48394fbe93/packages/infra_libs/infra_libs/ts_mon/common/interface.py
[modify] https://crrev.com/038c15d3882c4b84f124ab24c06bcc48394fbe93/packages/infra_libs/infra_libs/ts_mon/common/metrics.py
[modify] https://crrev.com/038c15d3882c4b84f124ab24c06bcc48394fbe93/packages/infra_libs/infra_libs/ts_mon/common/test/errors_test.py
[modify] https://crrev.com/038c15d3882c4b84f124ab24c06bcc48394fbe93/packages/infra_libs/infra_libs/ts_mon/common/test/interface_test.py
[modify] https://crrev.com/038c15d3882c4b84f124ab24c06bcc48394fbe93/packages/infra_libs/infra_libs/ts_mon/common/test/metrics_test.py

Project Member

Comment 16 by bugdroid1@chromium.org, Dec 7 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/23de8a9d518ad8566a328e4349f0a1afcb3dad8a

commit 23de8a9d518ad8566a328e4349f0a1afcb3dad8a
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Nov 30 23:31:15 2016

ts_mon: Continue flushing metrics beyond serialization error [2/2]

Before this CL, If flush() encounters an error while serializing a
metric, flush immediately aborts. This means that
- A subset of the metrics get flushed (whatever batches were already
  sent)
- ts_mon never sends the remaining metrics till the offending metric is
  reset externally / the ts_mon flushing thread restarts.

This CL hardens flush() to continue flushing the rest of the metrics
beyond any corrupt metrics encountered. flush() continues to raise an
exception, so the failure mode is unchanged.

This is the second of a two part change. This changes the code path for
serialization to new proto type.

BUG= chromium:669669 
TEST=(new) unittest fails before change, passes afterwards.

Change-Id: I2e1bb1be7e63c643922eb3548ef34bc02eb053ee
Reviewed-on: https://chromium-review.googlesource.com/415584
Commit-Queue: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Sergey Berezin <sergeyberezin@chromium.org>

[modify] https://crrev.com/23de8a9d518ad8566a328e4349f0a1afcb3dad8a/packages/infra_libs/infra_libs/ts_mon/common/interface.py
[modify] https://crrev.com/23de8a9d518ad8566a328e4349f0a1afcb3dad8a/packages/infra_libs/infra_libs/ts_mon/common/metrics.py
[modify] https://crrev.com/23de8a9d518ad8566a328e4349f0a1afcb3dad8a/packages/infra_libs/infra_libs/ts_mon/common/test/interface_test.py
[modify] https://crrev.com/23de8a9d518ad8566a328e4349f0a1afcb3dad8a/packages/infra_libs/infra_libs/ts_mon/common/test/metrics_test.py

Blockedon: 674760
I can't roll the changes in #15 and #16 into chromite yet because updating infra_libs surfaces a bug in the way chromite sets up protobuf dependency (see blocking bug).
Status: Fixed (was: Started)
We've since updated ts_mon twice.
So, all done.

Comment 19 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment