New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 596527 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner: ----
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug
OKR

Blocking:
issue 588284



Sign in to add a comment

Autotest RPC server load balancing

Project Member Reported by dshi@chromium.org, Mar 21 2016

Issue description

We are seeing repeated error about test failing related to RPC load issue. For example:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/57410232-chromeos-test/chromeos2-row24-rack9-host3/debug/

and bug 594559.

We tried following:
1. dedicated RPC server for different services, e.g., shard, GoldenEye, suite scheduler etc.
2. Use RetryingAFE in test (CL 332823)

However, all test jobs still use cautotest as the only RPC server. We should have a way for test to use an RPC server with the least load. One way to do this is to use Consul.

We have Consul service installed in all lab servers now. All we need to add is a Consul check to count the Apache process count on RPC servers, and a utility to query Consul to get the RPC server with least load.

The test jobs (server part) are running in drones, and all drones are running in corp network, so Consul does not have the network issue as for devserver.

 

Comment 1 by dshi@chromium.org, Mar 21 2016

Blocking: 594559

Comment 2 by dshi@chromium.org, Mar 21 2016

Blocking: -594559
Cc: jrbarnette@chromium.org

Comment 3 by dshi@chromium.org, Mar 21 2016

Summary: Autotest RPC server load balancing (was: Add a mechanism to choose an RPC server with least load)

Comment 4 by dshi@chromium.org, Mar 21 2016

Cc: fdeng@chromium.org dshi@chromium.org shuqianz@chromium.org
 Issue 572991  has been merged into this issue.

Comment 5 by autumn@chromium.org, Mar 28 2016

Labels: okr
Blocking: 588284
Owner: dshi@chromium.org
Status: Assigned (was: Available)
Could you please triage this?

Comment 7 by dshi@chromium.org, Apr 4 2016

Labels: -Pri-3 Pri-2
We have a plan to implement this. design doc will be up for review soon.

Comment 8 by benhenry@google.com, Apr 26 2016

Components: Infra>Client>ChromeOS
Labels: -Infra-ChromeOS
Cc: pprabhu@chromium.org
Labels: -okr OKR
If this is causing us regular grief (is it?), it might be worth doing this as an okr as a short term measure. However, this work will be made obsolete by skylab.
Any way to get a metric about rpc server load into monarch? (Is there already one?)
There isn't any ts_mon configured for the rpc server yet.
I did a quick check on last night's canary failures. It looks like run_suite calls to determine the result of a suite (with the -m XXXXX option) timed out at the RPC layer, even though the suite itself completed successfully. I checked that the timeout didn't happen in the swarming layer. So, my best guess at present is RPC load issue. It'd be very helpful to get some metrics from the RPC layer.
Ignore #13. Last night's woes are because of a server going down. Not because the RPC layer was to blame.

Comment 15 by dshi@chromium.org, Oct 6 2017

Owner: ----
Status: WontFix (was: Assigned)
Likely obsoleted by skylab work

Sign in to add a comment