create a blackbox monitoring daemon for rpc calls to lab servers |
|||||||
Issue descriptionCreate a daemon that continuously (~1x per minute) runs a no-op RPC against shards/afes in the lab and devservers, and monitors response time and return code.
,
Apr 26 2017
Some ideas: The actual tool will do something like: - Determine the list of shards - Make RPC calls to master + shards using some RPC client library (frontend.py and friends) - Report metrics using ts_mon (chromite). This means dependencies on autotest code, so this tool should be part of autotest. There is a lot of precedent of putting such a tool somewhere under site_utils/... As for where the service that continuously does this runs. One possible place that came up was the sentinel server. Currently, the sentinel server (see chromeos-admin/.../sentinel.pp) doesn't have an autotest checkout, so we'll have to add that. - Something like 'require autotest' in the sentinel.pp module - Add a cron job to run the command repeatedly / write an upstart service that runs it continuously with a short sleep as OP wanted (second preferred, cron jobs have issues when used with small intervals) Using the sentinel service has some benefits: - We get truly remote calls (no cautotest/ calling cautotest/) - We can start moving other adhoc services that currently run on cautotest to this server, since it will be setup to make RPCs correctly.
,
Apr 26 2017
,
May 24 2017
,
May 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/df9a8ae1d6e460b6d3dfa4e520d0207d8f33153c commit df9a8ae1d6e460b6d3dfa4e520d0207d8f33153c Author: Chris Ching <chingcodes@chromium.org> Date: Fri May 26 00:16:00 2017 rpc_flight_recorder: Monitor service for AFEs BUG= chromium:715386 TEST=run locally Change-Id: I23b1d329f75214a2b67e05e624c878a4e23e7eb8 Reviewed-on: https://chromium-review.googlesource.com/501509 Commit-Ready: Chris Ching <chingcodes@chromium.org> Tested-by: Chris Ching <chingcodes@chromium.org> Reviewed-by: Chris Ching <chingcodes@chromium.org> [add] https://crrev.com/df9a8ae1d6e460b6d3dfa4e520d0207d8f33153c/site_utils/rpc_flight_recorder.py
,
Jun 1 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/0a6ba313ac4c8a558b81eb4baf5236ba2cb3146b commit 0a6ba313ac4c8a558b81eb4baf5236ba2cb3146b Author: Chris Ching <chingcodes@chromium.org> Date: Thu Jun 01 20:40:54 2017
,
Jun 1 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/0a6ba313ac4c8a558b81eb4baf5236ba2cb3146b commit 0a6ba313ac4c8a558b81eb4baf5236ba2cb3146b Author: Chris Ching <chingcodes@chromium.org> Date: Thu Jun 01 20:40:54 2017
,
Jun 1 2017
Seeing crashes in daemon.
DEBUG:root:cautotest:get_motd:result =
INFO:root:cautotest:get_motd:success
DEBUG:root:Finished Server Polling
DEBUG:root:Starting Server Polling: cautotest
DEBUG:root:cautotest:get_motd:result =
INFO:root:cautotest:get_motd:success
DEBUG:root:Finished Server Polling
DEBUG:root:Starting Server Polling: cautotest
WARNING:root:cautotest:get_motd:failed - Uknown
INFO:root:Waiting for ts_mon flushing process to finish...
INFO:root:Finished waiting for ts_mon process.
Traceback (most recent call last):
File "./rpc_flight_recorder.py", line 157, in <module>
main(sys.argv)
File "./rpc_flight_recorder.py", line 153, in main
afe_monitor.poll_servers()
File "./rpc_flight_recorder.py", line 64, in poll_servers
self._pool.map(afe_rpc_call, self._servers)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
AttributeError: __exit__
,
Jun 2 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/7238fdad7938f1a8481149414513d4ffca23da59 commit 7238fdad7938f1a8481149414513d4ffca23da59 Author: Chris Ching <chingcodes@chromium.org> Date: Fri Jun 02 18:48:16 2017
,
Jun 2 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/500b9e7452c4c319ac71848ca830fcc564b2445b commit 500b9e7452c4c319ac71848ca830fcc564b2445b Author: Chris Ching <chingcodes@chromium.org> Date: Fri Jun 02 21:15:23 2017
,
Jun 2 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1c0fe8bb5099e9d9334420ec13bfaf5479857f7d commit 1c0fe8bb5099e9d9334420ec13bfaf5479857f7d Author: Chris Ching <chingcodes@chromium.org> Date: Fri Jun 02 22:11:21 2017 rpc_flight_recorder: fix exception handling BUG= chromium:715386 TEST=run locally Change-Id: I90b54431a80108e960c5f6cd696d6f8b51fcb4b2 Reviewed-on: https://chromium-review.googlesource.com/522886 Commit-Ready: Chris Ching <chingcodes@chromium.org> Tested-by: Chris Ching <chingcodes@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/1c0fe8bb5099e9d9334420ec13bfaf5479857f7d/site_utils/rpc_flight_recorder.py
,
Jun 5 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/129c0d4b9037536193f4ef5df74ce1394f5b5d35 commit 129c0d4b9037536193f4ef5df74ce1394f5b5d35 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Mon Jun 05 08:18:08 2017 [autotest] Fix fields provided to rpc_flight_recorder metric BUG= chromium:715386 TEST=None Change-Id: If619e86f4d02d3ee5936cea5a5a9b46dededabd1 Reviewed-on: https://chromium-review.googlesource.com/522866 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Chris Ching <chingcodes@chromium.org> [modify] https://crrev.com/129c0d4b9037536193f4ef5df74ce1394f5b5d35/site_utils/rpc_flight_recorder.py
,
Jun 7 2017
Chase-Pending justification: rpc performance problems and outages causes lab outages on a regular basis. Proper monitoring is essential to identifying them.
,
Jun 12 2017
,
Jun 19 2017
CL in flight for getting the shard list.
,
Jun 19 2017
and for getting a db-touching rpc time
,
Jun 26 2017
viceroy dashboard now: https://viceroy.corp.google.com/chromeos/afe_rpc_blackbox
,
Jun 26 2017
Last remaining CLs in review (shard list updating; db-touching operation).
,
Jun 28 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b311a424fa831d5db29f163a1aebb27ddec3d8dd commit b311a424fa831d5db29f163a1aebb27ddec3d8dd Author: Chris Ching <chingcodes@chromium.org> Date: Wed Jun 28 10:09:48 2017 rpc_flight_recorder: add shard updating option BUG= chromium:715386 TEST=run locally Change-Id: I45a2f065d00672a375e008067a608953835ce490 Reviewed-on: https://chromium-review.googlesource.com/533638 Commit-Ready: Chris Ching <chingcodes@chromium.org> Tested-by: Chris Ching <chingcodes@chromium.org> Reviewed-by: Chris Ching <chingcodes@chromium.org> [modify] https://crrev.com/b311a424fa831d5db29f163a1aebb27ddec3d8dd/site_utils/rpc_flight_recorder.py
,
Jun 29 2017
,
Jan 22 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by akes...@chromium.org
, Apr 26 2017