Suite scheduler created too many suites causing lab load |
|||
Issue descriptionWe are suspecting the suite scheduler created ~800 new suite jobs last night which is way too many, thus eating up all of pool:suites and leaving many long running processes on the Drone. Dshi is investigating.
,
Mar 3 2016
Issue 591541 has been merged into this issue.
,
Mar 3 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/8446fac5f239fb4675f359de9322535fcaa68c30 commit 8446fac5f239fb4675f359de9322535fcaa68c30 Author: Dan Shi <dshi@google.com> Date: Thu Mar 03 06:07:39 2016 [autotest] Optimize delay_minutes setting in suite scheduler. We are seeing some odd behavior of suite scheduler that it schedule suites for the same board repeatedly while failed to schedule suites for some other boards. This may be related to some multithreading issue. Though I haven't found the root cause, there are several issues we can improve. BUG= chromium:591538 TEST=unittest, local suite schedule run. Change-Id: If79f518df38a1bdf91968c5e7eefa3640dfde27a Reviewed-on: https://chromium-review.googlesource.com/330168 Reviewed-by: Dan Shi <dshi@google.com> Commit-Queue: Dan Shi <dshi@google.com> Trybot-Ready: Dan Shi <dshi@google.com> Tested-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/8446fac5f239fb4675f359de9322535fcaa68c30/site_utils/suite_scheduler/deduping_scheduler.py [modify] https://crrev.com/8446fac5f239fb4675f359de9322535fcaa68c30/site_utils/suite_scheduler/task.py [modify] https://crrev.com/8446fac5f239fb4675f359de9322535fcaa68c30/site_utils/suite_scheduler/driver.py
,
Mar 4 2016
I have a theory about the cause of this issue. From the duplicated suites, it seems they are all identical. It seems that the create_suite_job rpc gets to the server, the RPC server was able to create the suite, but socket timed out and client thought the call failed, it then tried again, and eventually timed out (after 10min). This is related to the RPC server's load. We might also need to increase the socket timeout for the create_suite_job call. I'm doing some tests on the suite scheduler server.
,
Mar 4 2016
we have a dedicated rpc server for suite scheduler chromeos-server30.cbf.corp.google.com If #4's theory is correct, it would be either the db is overloaded, or the network flaked around that time.
,
Mar 8 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/f770c456af6aec782e5fe2bdfe2b253a5b901f86 commit f770c456af6aec782e5fe2bdfe2b253a5b901f86 Author: Dan Shi <dshi@google.com> Date: Fri Mar 04 20:51:18 2016 [autotest] Allow suite scheduler to set time out for RPC. When an RPC server is under heavy load, long RPC calls like create_suite_job might time out. However the call might still successfully create suite job even though the caller received an socket.timeout exception and retry the call. That leads to multiple suites be created. This change allows the create_suite_job call to pass in a minimum value of timeout, to reduce the flake of the RPC. The same change is applied to get_jobs and get_hostnames call, which may take longer than the default 6s timeout. BUG= chromium:591538 TEST=local suite schedule run, unittest, local AFE Change-Id: If7b888fe6ca80f5c7e705026e06b883cef0bfdc4 Reviewed-on: https://chromium-review.googlesource.com/330463 Commit-Ready: Dan Shi <dshi@chromium.org> Tested-by: Dan Shi <dshi@google.com> Reviewed-by: Simran Basi <sbasi@chromium.org> [modify] https://crrev.com/f770c456af6aec782e5fe2bdfe2b253a5b901f86/site_utils/suite_scheduler/deduping_scheduler.py [modify] https://crrev.com/f770c456af6aec782e5fe2bdfe2b253a5b901f86/site_utils/suite_scheduler/deduping_scheduler_unittest.py [modify] https://crrev.com/f770c456af6aec782e5fe2bdfe2b253a5b901f86/frontend/afe/json_rpc/proxy.py
,
Mar 11 2016
|
|||
►
Sign in to add a comment |
|||
Comment 1 by sbasi@chromium.org
, Mar 2 2016