New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 718181 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Make drone_utility calls faster

Project Member Reported by pprabhu@chromium.org, May 3 2017

Issue description

In each monitor_db tick, we queue up calls to be run on the drones.

This happens twice:
- Once for drone refresh (to find out what all the launched autoserv processes are up to)
- Once at the end of the tick (to create new autoserv processes).

To parallelize this step, we use one thread per drone:
https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/thread_lib.py?l=127

This does not help shards, where there is just one drone (localhost).

Instead, we should create a fixed number of threads (maybe min(100, #of drones)) and directly parallelize the SiteDrone.execute_queued_calls (http://shortn/_F0ieBTST20).

This way, shards will also be benefited by this.

By some analysis, this step takes 40 - 60% of the times on some of the heavily loaded drone: https://bugs.chromium.org/p/chromium/issues/detail?id=718130#c14

-----------------
+ junta who were interested in figuring out if any of the monitor_db:tick steps could be parallelized better.
+ ayatane, FYI. This is related to / conflicts with the monitor_db refactoring you may be doing soon.
 

Comment 1 by aut...@google.com, May 5 2017

Owner: akes...@chromium.org
+ aviv to set up a meeting for next steps here
A meeting sounds hard given that pprabhu is on the opposite side of the earth.

This does seem like a good idea, I'll shop around for somebody to investigate.

Comment 3 by aut...@google.com, May 9 2017

Labels: -current-issue
Owner: pprabhu@chromium.org
Status: Assigned (was: Available)
ayatane@ noticed that we may not be using threaded drone queue, in which case trigger_refresh would be synchronous.

Before doing anything else, we should correct this. This won't help super-much since the "asynchronous" drone refresh still only works in concurrent with a few other steps of the tick (i.e., I'd expect gains still from making execute_queued_calls multithreaded), but this is a super low hanging fruit.

Adding to my plate to just explore this first option.
Project Member

Comment 5 by bugdroid1@chromium.org, Jun 1 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/be0fea17b7ded897104a0faaaab8d8836a8f3347

commit be0fea17b7ded897104a0faaaab8d8836a8f3347
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jun 01 01:31:18 2017

Project Member

Comment 6 by bugdroid1@chromium.org, Jun 1 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/be0fea17b7ded897104a0faaaab8d8836a8f3347

commit be0fea17b7ded897104a0faaaab8d8836a8f3347
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jun 01 01:31:18 2017

Status: Started (was: Assigned)
#6 turned on threaded_drone_manager on one of the shards.
I don't see marked tick rate difference: http://shortn/_G1NpJrTaJT
But the time is now evenly spread across sync_refresh and trigger_refresh: http://shortn/_YLKYHozNGw

I'll do it for all shards now, since we haven't taken a hit wrt tick time, and using a threaded_drone_manager is the first step towards making drone calls more concurrent.
Project Member

Comment 8 by bugdroid1@chromium.org, Jun 1 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/e4ea56fdc6a1f85138d173ea3ec89f377680bcc0

commit e4ea56fdc6a1f85138d173ea3ec89f377680bcc0
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jun 01 18:39:40 2017

Project Member

Comment 9 by bugdroid1@chromium.org, Jun 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d4db8fffb61d720b30417d706df62e51319ebcdc

commit d4db8fffb61d720b30417d706df62e51319ebcdc
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jun 08 06:45:25 2017

[autotest] Drop {Base,Site}DroneUtility

'cause we're all one.

BUG= chromium:718181 
TEST=None

Change-Id: I0b5707e4e30574e14be1cdc3874882e001062823
Reviewed-on: https://chromium-review.googlesource.com/524268
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/d4db8fffb61d720b30417d706df62e51319ebcdc/scheduler/drone_utility.py

Project Member

Comment 10 by bugdroid1@chromium.org, Jun 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/01e96da131fb191f9b3c4c066266e944638945ad

commit 01e96da131fb191f9b3c4c066266e944638945ad
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jun 08 06:45:25 2017

[autotest] Make some staticmethods free functions.

This is part of a small drone_utility refactor to pull out drone process
refresh logic and parallelize it.

BUG= chromium:718181 
TEST=unittests, local AFE run, moblab_RunSuite

Change-Id: Ib41c93c64b437bff09a119097e3bad14f3b3574e
Reviewed-on: https://chromium-review.googlesource.com/524269
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/01e96da131fb191f9b3c4c066266e944638945ad/scheduler/drone_utility.py
[modify] https://crrev.com/01e96da131fb191f9b3c4c066266e944638945ad/scheduler/drone_utility_unittest.py

Project Member

Comment 11 by bugdroid1@chromium.org, Jun 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/9ccda9dfa68d83807ec0c4b3c5b55d83c852c69c

commit 9ccda9dfa68d83807ec0c4b3c5b55d83c852c69c
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jun 08 06:45:26 2017

[autotest] Use --no-header with ps to skip header.

Instead of dropping the first line returned by '/bin/ps'.

BUG= chromium:718181 
TEST=(1) Manually verify the --no-headers argument on a drone.
     (2) moblab_RunSuite on a moblab passes.

Change-Id: Idcbb3614f0efb3480ccbd9ef05475658a711fe22
Reviewed-on: https://chromium-review.googlesource.com/524270
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/9ccda9dfa68d83807ec0c4b3c5b55d83c852c69c/scheduler/drone_utility.py

Project Member

Comment 12 by bugdroid1@chromium.org, Jun 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a7d9e34cd1289ea650e4c1bc3632d31128c038ef

commit a7d9e34cd1289ea650e4c1bc3632d31128c038ef
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jun 08 06:45:26 2017

[autotest] Extract ProcessRefresher from DroneUtility

This prepares us for a refactor of the process refreshing code.
- ProcessRefresher is now a better encapsulated concept. It is
  responsible for refreshing the running process summary but not the
  actual filesystem interaction to obtain this information.
- Also fixes some awkward function design issues that made unittesting
  difficult.
- Improves unittest coverage, to give us confidence in the upcoming
  multiprocess support.

BUG= chromium:718181 
TEST=[1] unittests
     [2] Run a test on local autotest instance
     [3] moblab_RunSuite on a moblab device

Change-Id: I99119b5949cc6bfa3d976a1b9e1bf41ff06ac9a2
Reviewed-on: https://chromium-review.googlesource.com/524271
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/a7d9e34cd1289ea650e4c1bc3632d31128c038ef/scheduler/drone_utility.py
[modify] https://crrev.com/a7d9e34cd1289ea650e4c1bc3632d31128c038ef/scheduler/drone_utility_unittest.py

Project Member

Comment 13 by bugdroid1@chromium.org, Jun 9 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/cfabbdef4cad60c34fae9d09cb079117c764e402

commit cfabbdef4cad60c34fae9d09cb079117c764e402
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Jun 09 22:33:41 2017

[autotest] Use process pool to read pidfiles in drone refresh

We spend a lot of time reading each pidfile corresponding to the running
processes on drones. Use a multiprocessing.Pool to speed this up.

For now, we hide the use of multiprocessing.Pool behind a global_config
flag. This allows us to gently roll this out to shards observing the
impact.

BUG= chromium:718181 
TEST=[1] unittests.
     [2] Run a test on local autotest instance.
     [3] moblab_RunSuite.

Change-Id: I08333453711a07ea3d7b805b44fda8e032a535eb
Reviewed-on: https://chromium-review.googlesource.com/525075
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/cfabbdef4cad60c34fae9d09cb079117c764e402/scheduler/drone_utility.py
[modify] https://crrev.com/cfabbdef4cad60c34fae9d09cb079117c764e402/scheduler/drone_utility_unittest.py

Project Member

Comment 14 by bugdroid1@chromium.org, Jun 14 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/170f2fde398ec35f13d5f5e3ed3c9ad819be7ebd

commit 170f2fde398ec35f13d5f5e3ed3c9ad819be7ebd
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Jun 14 00:47:24 2017

I turned on the experiment on chromeos-server14 but it didn't change anything:

tick rate: http://shortn/_JoxwhKF57d
tick breakdown: http://shortn/_XVUsQihHIj

I haven't parallelized checking of the dark mark yet. Should try that too.
Project Member

Comment 17 by bugdroid1@chromium.org, Jun 30 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a71d9e8c5deb9bc7b94fa10152097270047b54cb

commit a71d9e8c5deb9bc7b94fa10152097270047b54cb
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Jun 30 23:02:48 2017

[autotest] Use process pools to read dark mark in drone_utility.

This CL uses a multiprocessing.Pool for another slow operation executed
by drone_utility in every scheduler tick.

BUG= chromium:718181 
TEST=[1] unittests
     [2] Test run on local autotest instance.
     [3] moblab_RunSuite.

Change-Id: Ic3eb7856690c0f1c352fdf149bb4e79dcf77bc01
Reviewed-on: https://chromium-review.googlesource.com/525076
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/a71d9e8c5deb9bc7b94fa10152097270047b54cb/scheduler/drone_utility.py
[modify] https://crrev.com/a71d9e8c5deb9bc7b94fa10152097270047b54cb/scheduler/drone_utility_unittest.py

Even #17 didn't help in practice.

So, I dug in in prod.

Guess what, all the time is going in loading drone_utility module everytime we run it :/

Here are the loading time for each of the autotest imports from within drone_utility on my machine: 

pprabhu@pprabhu:files$ python loadit.py 
0(): 0.00
1(error): 0.00
2(global_config): 0.01
3(logging_manager): 0.00
4(utils): 0.05
5(retry): 0.00
6(drone_logging_config): 0.00
7(email_manager): 0.31
8(scheduler_config): 0.00
9(hosts): 0.22
10(subcommand): 0.00


That's pretty much all the time a simple call to drone_utility takes.

On chromeos-server14, where I've been experimenting with parallelized calls, drone_utility calls take ~3 seconds which is ~70% of a single tick time.

Same exercise there gives:
>>> from autotest_lib.scheduler import drone_utility_loading
0(): 0.00
1(error): 0.01
2(global_config): 0.04
3(logging_manager): 0.00
4(utils): 0.16
5(retry): 0.00
6(drone_logging_config): 0.00
7(email_manager): 0.57
8(scheduler_config): 0.01
9(hosts): 0.74
10(subcommand): 0.00

So ~1.5 seconds.
That only accounts for half of the missing 3 seconds each drone_utility call. But there's definitely 3 seconds missing between when drones.py:execute_queued_calls hands off to drone_utility and when drone_utility actually starts doing anything.

(My hunch based on logs, plus pgrep'ing is that drone_utility reexecs once before doing anything useful)
Project Member

Comment 19 by bugdroid1@chromium.org, Jul 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/240b4672a10cc1621f909bd42248bc15d9ca33ab

commit 240b4672a10cc1621f909bd42248bc15d9ca33ab
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jul 06 04:54:57 2017

[autotest] Delay hosts module import in drone_utility

The hosts module import takes ~.75 seconds. This has been bottlenecking
the scheduler tick by slowing down each drone_utility refresh call by
~1.5 seconds. Delay the import as a hack while we work on a real fix.

BUG= chromium:718181 
BUG= chromium:739466 
TEST=None

Change-Id: I3966259594bc5b2e5e1dd8592c8dcfd9cbda1c6c
Reviewed-on: https://chromium-review.googlesource.com/559938
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/240b4672a10cc1621f909bd42248bc15d9ca33ab/scheduler/drone_utility.py

Project Member

Comment 20 by bugdroid1@chromium.org, Jul 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2284834caadd6f3a5ea53d7d93f3ef54be4bf231

commit 2284834caadd6f3a5ea53d7d93f3ef54be4bf231
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jul 06 19:12:05 2017

[autotest] Drop email notification from drone_utility

I haven't seen this notification in months. We don't want email
notifications from the lab.
Importing this module adds ~1 second to every single drone_utility call
from the scheduler. So this adds ~2 seconds to every tick in the lab.

BUG= chromium:718181 
TEST=None

Change-Id: I6868271e71a06e1038ab008dbcc3e5f92d4fff61
Reviewed-on: https://chromium-review.googlesource.com/559937
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/2284834caadd6f3a5ea53d7d93f3ef54be4bf231/scheduler/drone_utility.py

#19 and #20 together improved the average and 90%ile tick rate on the shards by 2.5X: https://viceroy.corp.google.com/chromeos/capacity_health?duration=8d#_VG_z14nZKtl

So, parallelization of the refresh call may not have been necessay.

I'm going to turn off the new parallelized code flow on chromeos-server14 and see how it affects the tick there. Then take call to drop the new feature / turn it on for all shards.
Project Member

Comment 22 by bugdroid1@chromium.org, Jul 12 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/15a75589a5157f9cd779d7c090f15a644b644d57

commit 15a75589a5157f9cd779d7c090f15a644b644d57
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Jul 12 21:36:49 2017

Status: Fixed (was: Started)
#22 turned off the new process pools on server14, and it actually helped the tick rate.

https://viceroy.corp.google.com/chromeos/capacity_health?duration=8d&heatmap=False&hostname=chromeos-server14&refresh=-1&topstreams=5#_VG_nQ5cHmXp

So, conclusion: process pools are not needed. My work here is done.
We got 2.5X improvement by making drone_utility.py less stupid.

Summary: Make drone_utility calls faster (was: parallelize SiteDrone.execute_queued_calls)

Comment 25 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment