New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 715386 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

create a blackbox monitoring daemon for rpc calls to lab servers

Project Member Reported by akes...@chromium.org, Apr 26 2017

Issue description

Create a daemon that continuously (~1x per minute) runs a no-op RPC against shards/afes in the lab and devservers, and monitors response time and return code.
 
(and reports info to ts_mon, obv)
Some ideas:

The actual tool will do something like:
- Determine the list of shards
- Make RPC calls to master + shards using some RPC client library (frontend.py and friends)
- Report metrics using ts_mon (chromite).

This means dependencies on autotest code, so this tool should be part of autotest. There is a lot of precedent of putting such a tool somewhere under site_utils/...


As for where the service that continuously does this runs. One possible place that came up was the sentinel server.
Currently, the sentinel server (see chromeos-admin/.../sentinel.pp) doesn't have an autotest checkout, so we'll have to add that.
- Something like 'require autotest' in the sentinel.pp module
- Add a cron job to run the command repeatedly / write an upstart service that runs it continuously with a short sleep as OP wanted (second preferred, cron jobs have issues when used with small intervals)

Using the sentinel service has some benefits:
- We get truly remote calls (no cautotest/ calling cautotest/)
- We can start moving other adhoc services that currently run on cautotest to this server, since it will be setup to make RPCs correctly.
Labels: -current-issue
Owner: chingcodes@chromium.org
Status: Assigned (was: Untriaged)
Status: Started (was: Assigned)
Project Member

Comment 5 by bugdroid1@chromium.org, May 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/df9a8ae1d6e460b6d3dfa4e520d0207d8f33153c

commit df9a8ae1d6e460b6d3dfa4e520d0207d8f33153c
Author: Chris Ching <chingcodes@chromium.org>
Date: Fri May 26 00:16:00 2017

rpc_flight_recorder: Monitor service for AFEs

BUG= chromium:715386 
TEST=run locally

Change-Id: I23b1d329f75214a2b67e05e624c878a4e23e7eb8
Reviewed-on: https://chromium-review.googlesource.com/501509
Commit-Ready: Chris Ching <chingcodes@chromium.org>
Tested-by: Chris Ching <chingcodes@chromium.org>
Reviewed-by: Chris Ching <chingcodes@chromium.org>

[add] https://crrev.com/df9a8ae1d6e460b6d3dfa4e520d0207d8f33153c/site_utils/rpc_flight_recorder.py

Project Member

Comment 6 by bugdroid1@chromium.org, Jun 1 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/0a6ba313ac4c8a558b81eb4baf5236ba2cb3146b

commit 0a6ba313ac4c8a558b81eb4baf5236ba2cb3146b
Author: Chris Ching <chingcodes@chromium.org>
Date: Thu Jun 01 20:40:54 2017

Project Member

Comment 7 by bugdroid1@chromium.org, Jun 1 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/0a6ba313ac4c8a558b81eb4baf5236ba2cb3146b

commit 0a6ba313ac4c8a558b81eb4baf5236ba2cb3146b
Author: Chris Ching <chingcodes@chromium.org>
Date: Thu Jun 01 20:40:54 2017

Seeing crashes in daemon.

DEBUG:root:cautotest:get_motd:result =
INFO:root:cautotest:get_motd:success
DEBUG:root:Finished Server Polling
DEBUG:root:Starting Server Polling: cautotest
DEBUG:root:cautotest:get_motd:result =
INFO:root:cautotest:get_motd:success
DEBUG:root:Finished Server Polling
DEBUG:root:Starting Server Polling: cautotest
WARNING:root:cautotest:get_motd:failed - Uknown
INFO:root:Waiting for ts_mon flushing process to finish...
INFO:root:Finished waiting for ts_mon process.
Traceback (most recent call last):
  File "./rpc_flight_recorder.py", line 157, in <module>
    main(sys.argv)
  File "./rpc_flight_recorder.py", line 153, in main
    afe_monitor.poll_servers()
  File "./rpc_flight_recorder.py", line 64, in poll_servers
    self._pool.map(afe_rpc_call, self._servers)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
AttributeError: __exit__
Project Member

Comment 9 by bugdroid1@chromium.org, Jun 2 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/7238fdad7938f1a8481149414513d4ffca23da59

commit 7238fdad7938f1a8481149414513d4ffca23da59
Author: Chris Ching <chingcodes@chromium.org>
Date: Fri Jun 02 18:48:16 2017

Project Member

Comment 10 by bugdroid1@chromium.org, Jun 2 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/500b9e7452c4c319ac71848ca830fcc564b2445b

commit 500b9e7452c4c319ac71848ca830fcc564b2445b
Author: Chris Ching <chingcodes@chromium.org>
Date: Fri Jun 02 21:15:23 2017

Project Member

Comment 11 by bugdroid1@chromium.org, Jun 2 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1c0fe8bb5099e9d9334420ec13bfaf5479857f7d

commit 1c0fe8bb5099e9d9334420ec13bfaf5479857f7d
Author: Chris Ching <chingcodes@chromium.org>
Date: Fri Jun 02 22:11:21 2017

rpc_flight_recorder: fix exception handling

BUG= chromium:715386 
TEST=run locally

Change-Id: I90b54431a80108e960c5f6cd696d6f8b51fcb4b2
Reviewed-on: https://chromium-review.googlesource.com/522886
Commit-Ready: Chris Ching <chingcodes@chromium.org>
Tested-by: Chris Ching <chingcodes@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/1c0fe8bb5099e9d9334420ec13bfaf5479857f7d/site_utils/rpc_flight_recorder.py

Project Member

Comment 12 by bugdroid1@chromium.org, Jun 5 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/129c0d4b9037536193f4ef5df74ce1394f5b5d35

commit 129c0d4b9037536193f4ef5df74ce1394f5b5d35
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Mon Jun 05 08:18:08 2017

[autotest] Fix fields provided to rpc_flight_recorder metric

BUG= chromium:715386 
TEST=None

Change-Id: If619e86f4d02d3ee5936cea5a5a9b46dededabd1
Reviewed-on: https://chromium-review.googlesource.com/522866
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Chris Ching <chingcodes@chromium.org>

[modify] https://crrev.com/129c0d4b9037536193f4ef5df74ce1394f5b5d35/site_utils/rpc_flight_recorder.py

Labels: -Pri-2 Chase-Pending Pri-1
Chase-Pending justification: rpc performance problems and outages causes lab outages on a regular basis. Proper monitoring is essential to identifying them.
Labels: -Chase-Pending Chase
CL in flight for getting the shard list.
and for getting a db-touching rpc time
Last remaining CLs in review (shard list updating; db-touching operation).
Project Member

Comment 19 by bugdroid1@chromium.org, Jun 28 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b311a424fa831d5db29f163a1aebb27ddec3d8dd

commit b311a424fa831d5db29f163a1aebb27ddec3d8dd
Author: Chris Ching <chingcodes@chromium.org>
Date: Wed Jun 28 10:09:48 2017

rpc_flight_recorder: add shard updating option

BUG= chromium:715386 
TEST=run locally

Change-Id: I45a2f065d00672a375e008067a608953835ce490
Reviewed-on: https://chromium-review.googlesource.com/533638
Commit-Ready: Chris Ching <chingcodes@chromium.org>
Tested-by: Chris Ching <chingcodes@chromium.org>
Reviewed-by: Chris Ching <chingcodes@chromium.org>

[modify] https://crrev.com/b311a424fa831d5db29f163a1aebb27ddec3d8dd/site_utils/rpc_flight_recorder.py

Status: Fixed (was: Started)

Comment 21 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment