New issue
Advanced search Search tips

Issue 757494 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

AFE master-db metrics corrupted by shard-like data

Project Member Reported by pprabhu@chromium.org, Aug 21 2017

Issue description

No visible impact yet.

https://viceroy.corp.google.com/chromeos/afe_db?duration=8d#_VG_oYXH0N8z

All vital signs are bonkers since Aug 18th.
 
Also, why the heck is there a host-scheduler running on chromeos-server25: https://viceroy.corp.google.com/chromeos/capacity_health?duration=8d#_VG_-lD6hO9o
?
The time coincides with a push-to-prod on Friday.

Also, the process count doesn't match up with what I see on chromeos-server25:
chromeos-test@chromeos-server25:~$ ps aux | wc
    361    4125   31238

(sysmon is reporting thousands of processes, there are only 361)

---------
Also, no host-scheduler:
chromeos-test@chromeos-server25:~$ ps aux | grep host_scheduler
chromeo+ 21638  0.0  0.0  23768   912 pts/7    S+   10:38   0:00 grep --color host_scheduler

At this point, I believe some shard is masquerading as chromeos-server25 and polluting the metrics.
Status: Started (was: Assigned)
There are a couple CLs up for review trying to get FQDN and IP addresses of our servers so that we can figure out who the doppleganger is.
Summary: AFE master-db metrics corrupted by shard-like data (was: AFE master-db vital signs show process leak from Aug 18th onwards)
Project Member

Comment 6 by bugdroid1@chromium.org, Aug 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/b6bb5f88d864b0830f5b8d8f7fa2a6540b15a22f

commit b6bb5f88d864b0830f5b8d8f7fa2a6540b15a22f
Author: Allen Li <ayatane@chromium.org>
Date: Tue Aug 22 02:53:02 2017

sysmon: Add FQDN metric

Useful for debugging metrics getting emitted from unknown sources.

BUG= chromium:757494 
TEST=None

Change-Id: If260ea32f7fe5bbe990610c1905f080040489235
Reviewed-on: https://chromium-review.googlesource.com/624457
Commit-Ready: Allen Li <ayatane@chromium.org>
Tested-by: Allen Li <ayatane@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/b6bb5f88d864b0830f5b8d8f7fa2a6540b15a22f/scripts/sysmon/net_metrics.py
[modify] https://crrev.com/b6bb5f88d864b0830f5b8d8f7fa2a6540b15a22f/scripts/sysmon/net_metrics_unittest.py

#6 didn't reveal another fqdn for metrics for chromeos-server25

/var/log/sysmon.log on chromeos-server25 is full of:

DEBUG:chromite.scripts.sysmon.git_metrics:Collecting Git timestamp 1503403728 for '/usr/local/autotest/.git'
WARNING:root:HttpsMonitor.send received status 400: {
  "error": {
    "code": 400,
    "message": "Operation was attempted past the valid range.",
    "status": "OUT_OF_RANGE",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.DebugInfo",
        "detail": "[ORIGINAL ERROR] generic::out_of_range: APPLICATION_ERROR;streamz/StreamzCollection.Write;For metric '/chrome/infra/chromeos/sysmon/git/hash': Monarch only accepted 0 of 2 points. Example error: New reset timestamp '2017/08/22-11:29:12.000 (1503426552000000)' must not be older than existing reset timestamp '2017/08/22-11:37:46.000 (1503427066000000)'. (monarch.acquisitions.Task{proxy_environment = 'pa' acquisition_name = 'google.com:prodx-mon-chrome-infra' service_name = 'sysmon' job_name = 'sysmon' data_center = 'mtv' host_name = 'chromeos-server25' task_num = 0 proxy_zone = 'atl'} /chrome/infra/chromeos/sysmon/git/hash{'/usr/local/autotest/.git'});AppErrorCode=11;StartTimeMs=1503431169609;tcp;Deadline(sec)=15.0;ResFormat=UNCOMPRESSED;ServerTimeSec=0.006512641906738281;LogBytes=256;FailFast;EffSecLevel=none;ReqFormat=UNCOMPRESSED;ReqID=12a8256f05089cdf;GlobalID=acf68029ebafb9c6;Server=10.83.8.208:4867"
      }
    ]
  }
}


Owner: ayatane@chromium.org
Allen found the culprit.
The story is ... interesting ...
chromeos-server100 (a shard), has in its /etc/hosts the line

172.24.59.11		172.24.26.45 172

Note that 172.24.59.11 is the IP of chromeos-server100 and 172.24.26.45 is the IP of chromeos-server25 (database).

This causes hostname to report 172 as the hostname and 172.24.26.45 as the FQDN.

This causes Python's socket.getfqdn() to report chromeos-server25.mtv.corp.google.com as the FQDN.

I have fixed /etc/hosts on chromeos-server100 by hand.  I suspect it will need a reboot to clear any caches.

root@172:~# stat /etc/hosts
  File: ‘/etc/hosts’
  Size: 198       	Blocks: 8          IO Block: 4096   regular file
Device: ca01h/51713d	Inode: 44565503    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2017-08-21 21:56:59.426528463 -0700
Modify: 2017-08-16 21:56:59.672635294 -0700
Change: 2017-08-16 21:56:59.672635294 -0700
 Birth: -
root@172:~# cat /etc/hosts

127.0.0.1	localhost
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

172.24.59.11		172.24.26.45 172
Project Member

Comment 10 by bugdroid1@chromium.org, Aug 30 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/07f80875789814c1b47e452581f749cd5adafddf

commit 07f80875789814c1b47e452581f749cd5adafddf
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Aug 30 20:04:23 2017

sysmon: Report network addresses as metrics

BUG= chromium:757494 
TEST=unittests, manually run sysmon to verify metrics

Change-Id: I7f5f0fe7170ce178bfd746ce54af266de28ce1ad
Reviewed-on: https://chromium-review.googlesource.com/624461
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/07f80875789814c1b47e452581f749cd5adafddf/scripts/sysmon/net_metrics.py
[modify] https://crrev.com/07f80875789814c1b47e452581f749cd5adafddf/scripts/sysmon/net_metrics_unittest.py

Status: Verified (was: Started)

Sign in to add a comment