Autotest RPC server load balancing |
||||||||||
Issue descriptionWe are seeing repeated error about test failing related to RPC load issue. For example: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/57410232-chromeos-test/chromeos2-row24-rack9-host3/debug/ and bug 594559. We tried following: 1. dedicated RPC server for different services, e.g., shard, GoldenEye, suite scheduler etc. 2. Use RetryingAFE in test (CL 332823) However, all test jobs still use cautotest as the only RPC server. We should have a way for test to use an RPC server with the least load. One way to do this is to use Consul. We have Consul service installed in all lab servers now. All we need to add is a Consul check to count the Apache process count on RPC servers, and a utility to query Consul to get the RPC server with least load. The test jobs (server part) are running in drones, and all drones are running in corp network, so Consul does not have the network issue as for devserver.
,
Mar 21 2016
,
Mar 21 2016
,
Mar 21 2016
Issue 572991 has been merged into this issue.
,
Mar 28 2016
,
Apr 4 2016
Could you please triage this?
,
Apr 4 2016
We have a plan to implement this. design doc will be up for review soon.
,
Apr 26 2016
,
May 17 2016
,
Dec 6 2016
If this is causing us regular grief (is it?), it might be worth doing this as an okr as a short term measure. However, this work will be made obsolete by skylab.
,
Dec 6 2016
Any way to get a metric about rpc server load into monarch? (Is there already one?)
,
Dec 6 2016
There isn't any ts_mon configured for the rpc server yet.
,
Dec 7 2016
I did a quick check on last night's canary failures. It looks like run_suite calls to determine the result of a suite (with the -m XXXXX option) timed out at the RPC layer, even though the suite itself completed successfully. I checked that the timeout didn't happen in the swarming layer. So, my best guess at present is RPC load issue. It'd be very helpful to get some metrics from the RPC layer.
,
Dec 7 2016
Ignore #13. Last night's woes are because of a server going down. Not because the RPC layer was to blame.
,
Oct 6 2017
Likely obsoleted by skylab work |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by dshi@chromium.org
, Mar 21 2016