New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 723696 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: May 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

rpc-logserver metrics being dropped because of reverse timestamp

Project Member Reported by pprabhu@chromium.org, May 17 2017

Issue description

From apache_error_log_metrics_daemon.log on chromeos-server108.mtv:

HttpsMonitor.send received status 400: {
  "error": {
    "code": 400,
    "message": "Operation was attempted past the valid range.",
    "status": "OUT_OF_RANGE",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.DebugInfo",
        "detail": "[ORIGINAL ERROR] generic::out_of_range: APPLICATION_ERROR;streamz/StreamzCollection.Write;For metric '/chrome/infra/http/durations': Monarch only accepted 0 of 1 points. Example error:
New reset timestamp '2017/05/17-04:45:49.000 (1495021549000000)' must not be older than existing reset timestamp '2017/05/17-04:46:02.000 (1495021562000000)'. (monarch.acquisitions.Task{proxy_environment
= 'pa' acquisition_name = 'google.com:prodx-mon-chrome-infra' service_name = 'apache_error_log_metrics' job_name = 'apache_error_log_metrics' data_center = 'mtv' host_name = 'chromeos-server108' task_num
= 0 proxy_zone = 'atl'} /chrome/infra/http/durations{REDACTED}
    ]
  }
}

We've seen something similar in the past for gs_offloader where multiple copies of the gs_offloader process were stepping over each others' feet: issue 716259

 
I want to blame https://chromium-review.googlesource.com/c/500994/

On chromeos-server108.mtv, I see both these files are recently updated and contain these metrics errors:

root@chromeos-server108:/var/log# ls -lh apache_error_*
-rw-r--r-- 1 root root 1.1M May 17 09:00 apache_error_log_metrics_daemon.log
-rw-r--r-- 1 root root  38M May 17 09:10 apache_error_stats_daemon.log

Both these processes are running (and stepping over each others' feet):
root@chromeos-server108:/var/log# ps aux | grep apache_error_
root      3954  0.0  0.0  81564 16824 ?        S    08:45   0:00 sudo -u chromeos-test /usr/local/autotest/site_utils/stats/apache_error_log_metrics.py
root      3956  0.0  0.0  81564 16828 ?        S    08:45   0:00 sudo -u chromeos-test /usr/local/autotest/site_utils/stats/apache_error_stats.py
chromeo+  3993  0.2  0.0 161864 26852 ?        Sl   08:45   0:04 python /usr/local/autotest/site_utils/stats/apache_error_stats.py
chromeo+  3995  0.2  0.0 161868 26852 ?        Sl   08:45   0:04 python /usr/local/autotest/site_utils/stats/apache_error_log_metrics.py
root     13987  0.0  0.0  23752   908 pts/4    S+   09:12   0:00 grep --color=auto apache_error_
https://chrome-internal-review.googlesource.com/c/373411/ renamed the puppet module to the new name. This CL removed the old resource that was installing the upstart job for apache_error_stats.py

This is not safe. This tells puppet to stop managing the service, but it doesn't tell puppet to actually remove the service from the servers. Instead, we should tell puppet to explicitly remove the service.

At this point, we can
(1a) manually delete the old upstart job for apache_error_stats from all servers
or
(1b) re-add the job removed by the CL above, then ask puppet to explicitly remove it

and

(b) drop the symlink from apache_error_stats  --> apache_error_log_metrics.


If we just do (2), it will "fix" the current problem because the old upstart job will simply fail.

Comment 3 by pho...@chromium.org, May 17 2017

Status: Started (was: Assigned)
Let's drop the symlink, and use fabric to delete the old upstart job.
Project Member

Comment 4 by bugdroid1@chromium.org, May 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/5fdad794eb1e77bbefdf10f0703b0b54d026ae4a

commit 5fdad794eb1e77bbefdf10f0703b0b54d026ae4a
Author: Paul Hobbs <phobbs@google.com>
Date: Thu May 18 17:20:05 2017

[autotest] Remove symlink for apache_error_stats

BUG= chromium:723696 
TEST=None

Change-Id: I122697990c8f26b752a3680bfc28afa19bbc2536
Reviewed-on: https://chromium-review.googlesource.com/508211
Reviewed-by: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Trybot-Ready: Aviv Keshet <akeshet@chromium.org>

[delete] https://crrev.com/5734df660a52c6c33a03474b61288cc2d09eb6a9/site_utils/stats/apache_error_stats.py

Comment 5 by pho...@chromium.org, May 23 2017

Status: Fixed (was: Started)
Should be resolved
Project Member

Comment 6 by bugdroid1@chromium.org, May 24 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/17234132964a17b4105b3ed9f096a7e54226bcb8

commit 17234132964a17b4105b3ed9f096a7e54226bcb8
Author: Paul Hobbs <phobbs@google.com>
Date: Wed May 24 20:06:47 2017

Comment 7 by dchan@chromium.org, Aug 1 2017

Labels: VerifyIn-61

Comment 8 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment