AFE master-db metrics corrupted by shard-like data |
|||||
Issue descriptionNo visible impact yet. https://viceroy.corp.google.com/chromeos/afe_db?duration=8d#_VG_oYXH0N8z All vital signs are bonkers since Aug 18th.
,
Aug 21 2017
The time coincides with a push-to-prod on Friday.
Also, the process count doesn't match up with what I see on chromeos-server25:
chromeos-test@chromeos-server25:~$ ps aux | wc
361 4125 31238
(sysmon is reporting thousands of processes, there are only 361)
---------
Also, no host-scheduler:
chromeos-test@chromeos-server25:~$ ps aux | grep host_scheduler
chromeo+ 21638 0.0 0.0 23768 912 pts/7 S+ 10:38 0:00 grep --color host_scheduler
,
Aug 21 2017
At this point, I believe some shard is masquerading as chromeos-server25 and polluting the metrics.
,
Aug 21 2017
There are a couple CLs up for review trying to get FQDN and IP addresses of our servers so that we can figure out who the doppleganger is.
,
Aug 22 2017
,
Aug 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/b6bb5f88d864b0830f5b8d8f7fa2a6540b15a22f commit b6bb5f88d864b0830f5b8d8f7fa2a6540b15a22f Author: Allen Li <ayatane@chromium.org> Date: Tue Aug 22 02:53:02 2017 sysmon: Add FQDN metric Useful for debugging metrics getting emitted from unknown sources. BUG= chromium:757494 TEST=None Change-Id: If260ea32f7fe5bbe990610c1905f080040489235 Reviewed-on: https://chromium-review.googlesource.com/624457 Commit-Ready: Allen Li <ayatane@chromium.org> Tested-by: Allen Li <ayatane@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/b6bb5f88d864b0830f5b8d8f7fa2a6540b15a22f/scripts/sysmon/net_metrics.py [modify] https://crrev.com/b6bb5f88d864b0830f5b8d8f7fa2a6540b15a22f/scripts/sysmon/net_metrics_unittest.py
,
Aug 22 2017
#6 didn't reveal another fqdn for metrics for chromeos-server25
/var/log/sysmon.log on chromeos-server25 is full of:
DEBUG:chromite.scripts.sysmon.git_metrics:Collecting Git timestamp 1503403728 for '/usr/local/autotest/.git'
WARNING:root:HttpsMonitor.send received status 400: {
"error": {
"code": 400,
"message": "Operation was attempted past the valid range.",
"status": "OUT_OF_RANGE",
"details": [
{
"@type": "type.googleapis.com/google.rpc.DebugInfo",
"detail": "[ORIGINAL ERROR] generic::out_of_range: APPLICATION_ERROR;streamz/StreamzCollection.Write;For metric '/chrome/infra/chromeos/sysmon/git/hash': Monarch only accepted 0 of 2 points. Example error: New reset timestamp '2017/08/22-11:29:12.000 (1503426552000000)' must not be older than existing reset timestamp '2017/08/22-11:37:46.000 (1503427066000000)'. (monarch.acquisitions.Task{proxy_environment = 'pa' acquisition_name = 'google.com:prodx-mon-chrome-infra' service_name = 'sysmon' job_name = 'sysmon' data_center = 'mtv' host_name = 'chromeos-server25' task_num = 0 proxy_zone = 'atl'} /chrome/infra/chromeos/sysmon/git/hash{'/usr/local/autotest/.git'});AppErrorCode=11;StartTimeMs=1503431169609;tcp;Deadline(sec)=15.0;ResFormat=UNCOMPRESSED;ServerTimeSec=0.006512641906738281;LogBytes=256;FailFast;EffSecLevel=none;ReqFormat=UNCOMPRESSED;ReqID=12a8256f05089cdf;GlobalID=acf68029ebafb9c6;Server=10.83.8.208:4867"
}
]
}
}
,
Aug 23 2017
Allen found the culprit. The story is ... interesting ...
,
Aug 23 2017
chromeos-server100 (a shard), has in its /etc/hosts the line 172.24.59.11 172.24.26.45 172 Note that 172.24.59.11 is the IP of chromeos-server100 and 172.24.26.45 is the IP of chromeos-server25 (database). This causes hostname to report 172 as the hostname and 172.24.26.45 as the FQDN. This causes Python's socket.getfqdn() to report chromeos-server25.mtv.corp.google.com as the FQDN. I have fixed /etc/hosts on chromeos-server100 by hand. I suspect it will need a reboot to clear any caches. root@172:~# stat /etc/hosts File: ‘/etc/hosts’ Size: 198 Blocks: 8 IO Block: 4096 regular file Device: ca01h/51713d Inode: 44565503 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2017-08-21 21:56:59.426528463 -0700 Modify: 2017-08-16 21:56:59.672635294 -0700 Change: 2017-08-16 21:56:59.672635294 -0700 Birth: - root@172:~# cat /etc/hosts 127.0.0.1 localhost ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts 172.24.59.11 172.24.26.45 172
,
Aug 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/07f80875789814c1b47e452581f749cd5adafddf commit 07f80875789814c1b47e452581f749cd5adafddf Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Aug 30 20:04:23 2017 sysmon: Report network addresses as metrics BUG= chromium:757494 TEST=unittests, manually run sysmon to verify metrics Change-Id: I7f5f0fe7170ce178bfd746ce54af266de28ce1ad Reviewed-on: https://chromium-review.googlesource.com/624461 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Allen Li <ayatane@chromium.org> [modify] https://crrev.com/07f80875789814c1b47e452581f749cd5adafddf/scripts/sysmon/net_metrics.py [modify] https://crrev.com/07f80875789814c1b47e452581f749cd5adafddf/scripts/sysmon/net_metrics_unittest.py
,
Sep 11 2017
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by pprabhu@chromium.org
, Aug 21 2017