Add metric for when RPC RetryingAFE call times out |
|||||
Issue descriptionSomewhere around cros/dynamic_suite/frontend_wrappers.py's RetryingAFE run call, add a metric whenever the timeout for AFE RPC calls occurs. (See issue 723645 for examples and stack traces we want metrics for). Ideally there would be fields for: - hostname generating rpc - destination rpc server - call being made (eg get_hosts) - script generating RPC - count of retries (if possible?) With a value of how long of a timeout there was before failing.
,
May 19 2017
,
May 19 2017
This is actually different from Issue 695539. This is about metrics in RetryingAFE, from the client.
,
May 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3e41e80a1be59303a9d469e3ef43184c23485aa8 commit 3e41e80a1be59303a9d469e3ef43184c23485aa8 Author: xixuan <xixuan@chromium.org> Date: Mon May 22 20:09:34 2017 autotest: add metrics from rpc client side for timeout RPCs. BUG= chromium:724529 TEST=local run autotest, make an RPC call timeout, print the fields. Ran unittest. Change-Id: If75643800bcd75d993d13a68fe14e8427f68e7ba Reviewed-on: https://chromium-review.googlesource.com/509931 Reviewed-by: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> Commit-Queue: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/3e41e80a1be59303a9d469e3ef43184c23485aa8/server/cros/dynamic_suite/frontend_wrappers.py
,
May 23 2017
I think this is is prod now? However, I don't see any metrics yet. I'd be surprised if we didn't have a single rpc timeout failure in the last 2 hours.
,
May 25 2017
Metric appears to be live: http://shortn/_kcqTQ9Chva destination_server is not what I expect. Shouldn't it be something like chromeos-server36 or the master?
,
May 25 2017
Glad the metric is live, I was getting worried by not seeing anything for several days. Does that mean we had several days with no afe timeouts? destination_server of localhost makes sense to me, these are autoserv jobs on the shard making shard afe calls. The port # seems superfluous though, I have a CL to remove it.
,
May 25 2017
It would be useful to have a precomputation which changes localhost -> host_name, because if we're trying to measure overall timeouts hitting a particular server, that's what's important.
,
May 25 2017
The only callers to a shard will be from that shard, so I think we are safe to use host_name for any metric where we're interested in shard afe performance. But I see your point otherwise.
,
May 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b5f1c11c67b798972ab5a1b417900c008be2034e commit b5f1c11c67b798972ab5a1b417900c008be2034e Author: Aviv Keshet <akeshet@chromium.org> Date: Thu May 25 19:25:25 2017 autotest: drop port # from retrying_afe destination_server stats Local calls are having destination logged with with hostname and port number. This has the potentially to cause a too-many-streams problem. BUG= chromium:724529 TEST=None Change-Id: I65203da04c99c5bc8ac9f1dc000a243db67cab78 Reviewed-on: https://chromium-review.googlesource.com/515046 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/b5f1c11c67b798972ab5a1b417900c008be2034e/server/cros/dynamic_suite/frontend_wrappers.py
,
Jun 1 2017
,
Aug 1 2017
,
Jan 22 2018
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by xixuan@chromium.org
, May 19 2017