New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 591538 link

Starred by 5 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Mar 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Suite scheduler created too many suites causing lab load

Project Member Reported by sbasi@chromium.org, Mar 2 2016

Issue description

We are suspecting the suite scheduler created ~800 new suite jobs last night which is way too many, thus eating up all of pool:suites and leaving many long running processes on the Drone.

Dshi is investigating.
 

Comment 2 by dshi@chromium.org, Mar 3 2016

Cc: cychiang@chromium.org waihong@chromium.org sbasi@chromium.org
 Issue 591541  has been merged into this issue.
Project Member

Comment 3 by bugdroid1@chromium.org, Mar 3 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/8446fac5f239fb4675f359de9322535fcaa68c30

commit 8446fac5f239fb4675f359de9322535fcaa68c30
Author: Dan Shi <dshi@google.com>
Date: Thu Mar 03 06:07:39 2016

[autotest] Optimize delay_minutes setting in suite scheduler.

We are seeing some odd behavior of suite scheduler that it schedule
suites for the same board repeatedly while failed to schedule suites for
some other boards. This may be related to some multithreading issue.
Though I haven't found the root cause, there are several issues we can
improve.

BUG= chromium:591538 
TEST=unittest, local suite schedule run.

Change-Id: If79f518df38a1bdf91968c5e7eefa3640dfde27a
Reviewed-on: https://chromium-review.googlesource.com/330168
Reviewed-by: Dan Shi <dshi@google.com>
Commit-Queue: Dan Shi <dshi@google.com>
Trybot-Ready: Dan Shi <dshi@google.com>
Tested-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/8446fac5f239fb4675f359de9322535fcaa68c30/site_utils/suite_scheduler/deduping_scheduler.py
[modify] https://crrev.com/8446fac5f239fb4675f359de9322535fcaa68c30/site_utils/suite_scheduler/task.py
[modify] https://crrev.com/8446fac5f239fb4675f359de9322535fcaa68c30/site_utils/suite_scheduler/driver.py

Comment 4 by dshi@chromium.org, Mar 4 2016

I have a theory about the cause of this issue. From the duplicated suites, it seems they are all identical. It seems that the create_suite_job rpc gets to the server, the RPC server was able to create the suite, but socket timed out and client thought the call failed, it then tried again, and eventually timed out (after 10min).

This is related to the RPC server's load. We might also need to increase the socket timeout for the create_suite_job call. I'm doing some tests on the suite scheduler server.

Comment 5 by fdeng@chromium.org, Mar 4 2016

we have a dedicated rpc server for suite scheduler 
  chromeos-server30.cbf.corp.google.com
If #4's theory is correct, it would be either the db is overloaded, or the network flaked around that time.

Project Member

Comment 6 by bugdroid1@chromium.org, Mar 8 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/f770c456af6aec782e5fe2bdfe2b253a5b901f86

commit f770c456af6aec782e5fe2bdfe2b253a5b901f86
Author: Dan Shi <dshi@google.com>
Date: Fri Mar 04 20:51:18 2016

[autotest] Allow suite scheduler to set time out for RPC.

When an RPC server is under heavy load, long RPC calls like
create_suite_job might time out. However the call might still
successfully create suite job even though the caller received an
socket.timeout exception and retry the call. That leads to multiple
suites be created.

This change allows the create_suite_job call to pass in a minimum value
of timeout, to reduce the flake of the RPC. The same change is applied
to get_jobs and get_hostnames call, which may take longer than the
default 6s timeout.

BUG= chromium:591538 
TEST=local suite schedule run, unittest, local AFE

Change-Id: If7b888fe6ca80f5c7e705026e06b883cef0bfdc4
Reviewed-on: https://chromium-review.googlesource.com/330463
Commit-Ready: Dan Shi <dshi@chromium.org>
Tested-by: Dan Shi <dshi@google.com>
Reviewed-by: Simran Basi <sbasi@chromium.org>

[modify] https://crrev.com/f770c456af6aec782e5fe2bdfe2b253a5b901f86/site_utils/suite_scheduler/deduping_scheduler.py
[modify] https://crrev.com/f770c456af6aec782e5fe2bdfe2b253a5b901f86/site_utils/suite_scheduler/deduping_scheduler_unittest.py
[modify] https://crrev.com/f770c456af6aec782e5fe2bdfe2b253a5b901f86/frontend/afe/json_rpc/proxy.py

Comment 7 by dshi@chromium.org, Mar 11 2016

Status: Fixed (was: Assigned)

Sign in to add a comment