rpc-logserver metrics being dropped because of reverse timestamp |
|||||
Issue description
From apache_error_log_metrics_daemon.log on chromeos-server108.mtv:
HttpsMonitor.send received status 400: {
"error": {
"code": 400,
"message": "Operation was attempted past the valid range.",
"status": "OUT_OF_RANGE",
"details": [
{
"@type": "type.googleapis.com/google.rpc.DebugInfo",
"detail": "[ORIGINAL ERROR] generic::out_of_range: APPLICATION_ERROR;streamz/StreamzCollection.Write;For metric '/chrome/infra/http/durations': Monarch only accepted 0 of 1 points. Example error:
New reset timestamp '2017/05/17-04:45:49.000 (1495021549000000)' must not be older than existing reset timestamp '2017/05/17-04:46:02.000 (1495021562000000)'. (monarch.acquisitions.Task{proxy_environment
= 'pa' acquisition_name = 'google.com:prodx-mon-chrome-infra' service_name = 'apache_error_log_metrics' job_name = 'apache_error_log_metrics' data_center = 'mtv' host_name = 'chromeos-server108' task_num
= 0 proxy_zone = 'atl'} /chrome/infra/http/durations{REDACTED}
]
}
}
We've seen something similar in the past for gs_offloader where multiple copies of the gs_offloader process were stepping over each others' feet: issue 716259
,
May 17 2017
https://chrome-internal-review.googlesource.com/c/373411/ renamed the puppet module to the new name. This CL removed the old resource that was installing the upstart job for apache_error_stats.py This is not safe. This tells puppet to stop managing the service, but it doesn't tell puppet to actually remove the service from the servers. Instead, we should tell puppet to explicitly remove the service. At this point, we can (1a) manually delete the old upstart job for apache_error_stats from all servers or (1b) re-add the job removed by the CL above, then ask puppet to explicitly remove it and (b) drop the symlink from apache_error_stats --> apache_error_log_metrics. If we just do (2), it will "fix" the current problem because the old upstart job will simply fail.
,
May 17 2017
Let's drop the symlink, and use fabric to delete the old upstart job.
,
May 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/5fdad794eb1e77bbefdf10f0703b0b54d026ae4a commit 5fdad794eb1e77bbefdf10f0703b0b54d026ae4a Author: Paul Hobbs <phobbs@google.com> Date: Thu May 18 17:20:05 2017 [autotest] Remove symlink for apache_error_stats BUG= chromium:723696 TEST=None Change-Id: I122697990c8f26b752a3680bfc28afa19bbc2536 Reviewed-on: https://chromium-review.googlesource.com/508211 Reviewed-by: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Trybot-Ready: Aviv Keshet <akeshet@chromium.org> [delete] https://crrev.com/5734df660a52c6c33a03474b61288cc2d09eb6a9/site_utils/stats/apache_error_stats.py
,
May 23 2017
Should be resolved
,
May 24 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/17234132964a17b4105b3ed9f096a7e54226bcb8 commit 17234132964a17b4105b3ed9f096a7e54226bcb8 Author: Paul Hobbs <phobbs@google.com> Date: Wed May 24 20:06:47 2017
,
Aug 1 2017
,
Jan 22 2018
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by pprabhu@chromium.org
, May 17 2017