ts_mon should emit a metric that counts when other metrics have flush errors |
||||||
Issue descriptionToday, if a metric has a error during flush (due to a bad field for instance) this causes ts_mon to simply log a warning and throw that metric away (possible also throwing other metrics away? needs investigation) Instead, ts_mon should update a counter like /chrome/infra/chromeos/ts_mon/flush_failed with a field metric_name="/name/of/the/metric/that/failed". This will make it easier to identify metrics that are having flush problems.
,
Nov 29 2016
What would happen if this new metric has a flush error, or gets caught in a batch that failed to flush? In addition to timeseries monitoring, we have a logging pipeline (cloudtail + Cloud Logging API) which would log these errors, and in particular, allows counting error log lines in a separate metric from a separate service, e.g. like this: http://shortn/_cQD5P5thgZ The metric is defined here: https://chromium.googlesource.com/infra/infra/+/master/go/src/infra/tools/cloudtail/client.go#99 You can monitor / alert on the amount of error / warning logs. Would that solve the problem?
,
Nov 29 2016
We're not using the cloud logging features of ts_mon as far as I know. The errors I was referring to were being logged on the client side and did not as far as I know make it up to any cloud service.
,
Nov 29 2016
It's not a ts_mon feature, it's a whole separate pipeline. I understand that you don't use it yet, and my implied question was if it makes sense to consider using it? For most of our services, we send all the logs to a single project, and go/chrome-infra-logs has all of them in one place. The logs are also exported to BigQuery for longer retention and more complex queries. This proved quite useful for us on many occasions.
,
Nov 29 2016
Ah, ok. Yes, some unified cloud based logging that we use across all our services would be good, but that sounds like a rather large scope. I'd like to revisit that in a future Q. In the meantime, still think this ts_mon metric makes sense.
,
Nov 29 2016
We already have a PubSub connection monitoring from ts_mon, e.g. http://shortn/_9vHVXsIIys This of course wouldn't catch failures within the client before the network call, e.g. in protobuf assignments. What types of failures are you after?
,
Nov 30 2016
Example https://bugs.chromium.org/p/chromium/issues/detail?id=669577 11/29 10:19:30.972 ERROR| interface:0166| Automatic monitoring flush failed. Traceback (most recent call last): File "/usr/local/autotest/site-packages/chromite/third_party/infra_libs/ts_mon/common/interface.py", line 164, in _flush_and_log_exceptions flush() File "/usr/local/autotest/site-packages/chromite/third_party/infra_libs/ts_mon/common/interface.py", line 109, in flush metric.serialize_to(proto, start_time, fields, value, target) File "/usr/local/autotest/site-packages/chromite/third_party/infra_libs/ts_mon/common/metrics.py", line 115, in serialize_to self._populate_fields(metric_pb, fields) File "/usr/local/autotest/site-packages/chromite/third_party/infra_libs/ts_mon/common/metrics.py", line 142, in _populate_fields raise errors.MonitoringInvalidFieldTypeError(self._name, key, value) MonitoringInvalidFieldTypeError: Metric "chrome/infra/chromeos/autotest/resolve_devserver" was given field "devserver" with invalid value "<autotest_lib.client.common_lib.cros.dev_server.AndroidBuildServer object at 0x7f28d8243e50>" (<class 'autotest_lib.client.common_lib.cros.dev_server.AndroidBuildServer'>).
,
Nov 30 2016
I see. Well, this will abort the entire flush() operation before any of the metrics were sent, so your proposed new metric will also be dropped. And once one such bad metric makes it into the metric store, it will likely continue crashing until the process is restarted. This is precisely why I'm suggesting a parallel logging pipeline with its own separate metrics. It shouldn't be that hard to set up, actually. We already have all the machinery for log scraping and Puppet setups. See the docs at http://go/inframon-doc > cloudtail: Cloud Logging.
,
Nov 30 2016
RE 8 yeah I'm suggesting that we catch and report that flush error inside of ts_mon flush, so that we don't abort the entire flush operation. From a ts_mon API perspective, the existing behavior of aborting flush of all metrics due to a single malformed metric is suboptimal too.
,
Nov 30 2016
1. I agree that one metrics failing to flush shouldn't prevent the other metrics from flushing. 2. I'm not sure using a Monarch metric to report these failures is appropriate. On the logging tangent: We're currently using metrics to track some errors that may be more appropriate as log messages. I think this is because we don't currently have a unified logging infrastructure, nor ways to run queries or alert on log messages. Since we do have Monarch metrics set up, it's tempting to reuse our Monarch metric infrastructure to track errors. If Monarch works well as a tool for this job, that's fine, but I suspect what we'd like is a unified logging framework to collect detailed information about errors, in parallel with Monarch stats for errors that happen often enough or aren't completely preventable to warrant time series tracking.
,
Nov 30 2016
I'll take a stab at making metrics serialization errors not fail the whole flush. We can discuss on the CL if this is the correct approach.
,
Dec 1 2016
Uploaded 2 CLs to continue-on-error when flushing metrics: https://chromium-review.googlesource.com/#/c/415447/ https://chromium-review.googlesource.com/#/c/415584/ @sergeyberezin: Can you help review?
,
Dec 1 2016
Left some initial comments on the CLs. Thanks for making ts_mon better!
,
Dec 6 2016
,
Dec 7 2016
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra.git/+/038c15d3882c4b84f124ab24c06bcc48394fbe93 commit 038c15d3882c4b84f124ab24c06bcc48394fbe93 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Nov 30 22:33:19 2016 ts_mon: Continue flushing metrics beyond serialization error. Before this CL, If flush() encounters an error while serializing a metric, flush immediately aborts. This means that - A subset of the metrics get flushed (whatever batches were already sent) - ts_mon never sends the remaining metrics till the offending metric is reset externally / the ts_mon flushing thread restarts. This CL hardens flush() to continue flushing the rest of the metrics beyond any corrupt metrics encountered. flush() continues to raise an exception, so the failure mode is unchanged. This is the first of a two part change. This only changes the code path for serialization to old proto type. BUG= chromium:669669 TEST=(new) unittest fails before change, passes afterwards. Change-Id: I2c158144cc6fd1f4a7b40ca7a625c7ff2400bfcc Reviewed-on: https://chromium-review.googlesource.com/415447 Commit-Queue: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Sergey Berezin <sergeyberezin@chromium.org> [modify] https://crrev.com/038c15d3882c4b84f124ab24c06bcc48394fbe93/packages/infra_libs/infra_libs/ts_mon/common/errors.py [modify] https://crrev.com/038c15d3882c4b84f124ab24c06bcc48394fbe93/packages/infra_libs/infra_libs/ts_mon/common/interface.py [modify] https://crrev.com/038c15d3882c4b84f124ab24c06bcc48394fbe93/packages/infra_libs/infra_libs/ts_mon/common/metrics.py [modify] https://crrev.com/038c15d3882c4b84f124ab24c06bcc48394fbe93/packages/infra_libs/infra_libs/ts_mon/common/test/errors_test.py [modify] https://crrev.com/038c15d3882c4b84f124ab24c06bcc48394fbe93/packages/infra_libs/infra_libs/ts_mon/common/test/interface_test.py [modify] https://crrev.com/038c15d3882c4b84f124ab24c06bcc48394fbe93/packages/infra_libs/infra_libs/ts_mon/common/test/metrics_test.py
,
Dec 7 2016
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra.git/+/23de8a9d518ad8566a328e4349f0a1afcb3dad8a commit 23de8a9d518ad8566a328e4349f0a1afcb3dad8a Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Nov 30 23:31:15 2016 ts_mon: Continue flushing metrics beyond serialization error [2/2] Before this CL, If flush() encounters an error while serializing a metric, flush immediately aborts. This means that - A subset of the metrics get flushed (whatever batches were already sent) - ts_mon never sends the remaining metrics till the offending metric is reset externally / the ts_mon flushing thread restarts. This CL hardens flush() to continue flushing the rest of the metrics beyond any corrupt metrics encountered. flush() continues to raise an exception, so the failure mode is unchanged. This is the second of a two part change. This changes the code path for serialization to new proto type. BUG= chromium:669669 TEST=(new) unittest fails before change, passes afterwards. Change-Id: I2e1bb1be7e63c643922eb3548ef34bc02eb053ee Reviewed-on: https://chromium-review.googlesource.com/415584 Commit-Queue: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Sergey Berezin <sergeyberezin@chromium.org> [modify] https://crrev.com/23de8a9d518ad8566a328e4349f0a1afcb3dad8a/packages/infra_libs/infra_libs/ts_mon/common/interface.py [modify] https://crrev.com/23de8a9d518ad8566a328e4349f0a1afcb3dad8a/packages/infra_libs/infra_libs/ts_mon/common/metrics.py [modify] https://crrev.com/23de8a9d518ad8566a328e4349f0a1afcb3dad8a/packages/infra_libs/infra_libs/ts_mon/common/test/interface_test.py [modify] https://crrev.com/23de8a9d518ad8566a328e4349f0a1afcb3dad8a/packages/infra_libs/infra_libs/ts_mon/common/test/metrics_test.py
,
Dec 27 2016
I can't roll the changes in #15 and #16 into chromite yet because updating infra_libs surfaces a bug in the way chromite sets up protobuf dependency (see blocking bug).
,
Jun 28 2017
We've since updated ts_mon twice. So, all done.
,
Jan 22 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by akes...@chromium.org
, Nov 29 2016