New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 724529 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug

Blocking:
issue 723645



Sign in to add a comment

Add metric for when RPC RetryingAFE call times out

Project Member Reported by davidri...@chromium.org, May 19 2017

Issue description

Somewhere around cros/dynamic_suite/frontend_wrappers.py's RetryingAFE run call, add a metric whenever the timeout for AFE RPC calls occurs.  (See issue 723645 for examples and stack traces we want metrics for).

Ideally there would be fields for:
- hostname generating rpc
- destination rpc server
- call being made (eg get_hosts)
- script generating RPC
- count of retries (if possible?)

With a value of how long of a timeout there was before failing.
 

Comment 1 by xixuan@chromium.org, May 19 2017

similar to Issue 695539.
Owner: xixuan@chromium.org
This is actually different from Issue 695539. This is about metrics in RetryingAFE, from the client.
Project Member

Comment 4 by bugdroid1@chromium.org, May 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3e41e80a1be59303a9d469e3ef43184c23485aa8

commit 3e41e80a1be59303a9d469e3ef43184c23485aa8
Author: xixuan <xixuan@chromium.org>
Date: Mon May 22 20:09:34 2017

autotest: add metrics from rpc client side for timeout RPCs.

BUG= chromium:724529 
TEST=local run autotest, make an RPC call timeout, print the fields.
Ran unittest.

Change-Id: If75643800bcd75d993d13a68fe14e8427f68e7ba
Reviewed-on: https://chromium-review.googlesource.com/509931
Reviewed-by: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/3e41e80a1be59303a9d469e3ef43184c23485aa8/server/cros/dynamic_suite/frontend_wrappers.py

I think this is is prod now?

However, I don't see any metrics yet. I'd be surprised if we didn't have a single rpc timeout failure in the last 2 hours.
Metric appears to be live: http://shortn/_kcqTQ9Chva

destination_server is not what I expect.  Shouldn't it be something like chromeos-server36 or the master?
Glad the metric is live, I was getting worried by not seeing anything for several days. Does that mean we had several days with no afe timeouts?

destination_server of localhost makes sense to me, these are autoserv jobs on the shard making shard afe calls. The port # seems superfluous though, I have a CL to remove it.
It would be useful to have a precomputation which changes localhost -> host_name, because if we're trying to measure overall timeouts hitting a particular server, that's what's important.
The only callers to a shard will be from that shard, so I think we are safe to use host_name for any metric where we're interested in shard afe performance.

But I see your point otherwise.


Project Member

Comment 10 by bugdroid1@chromium.org, May 25 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b5f1c11c67b798972ab5a1b417900c008be2034e

commit b5f1c11c67b798972ab5a1b417900c008be2034e
Author: Aviv Keshet <akeshet@chromium.org>
Date: Thu May 25 19:25:25 2017

autotest: drop port # from retrying_afe destination_server stats

Local calls are having destination logged with with hostname and port
number. This has the potentially to cause a too-many-streams problem.

BUG= chromium:724529 
TEST=None

Change-Id: I65203da04c99c5bc8ac9f1dc000a243db67cab78
Reviewed-on: https://chromium-review.googlesource.com/515046
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/b5f1c11c67b798972ab5a1b417900c008be2034e/server/cros/dynamic_suite/frontend_wrappers.py

Status: Fixed (was: Untriaged)
Labels: VerifyIn-61

Comment 13 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment