Make drone_utility calls faster |
|||||||
Issue descriptionIn each monitor_db tick, we queue up calls to be run on the drones. This happens twice: - Once for drone refresh (to find out what all the launched autoserv processes are up to) - Once at the end of the tick (to create new autoserv processes). To parallelize this step, we use one thread per drone: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/thread_lib.py?l=127 This does not help shards, where there is just one drone (localhost). Instead, we should create a fixed number of threads (maybe min(100, #of drones)) and directly parallelize the SiteDrone.execute_queued_calls (http://shortn/_F0ieBTST20). This way, shards will also be benefited by this. By some analysis, this step takes 40 - 60% of the times on some of the heavily loaded drone: https://bugs.chromium.org/p/chromium/issues/detail?id=718130#c14 ----------------- + junta who were interested in figuring out if any of the monitor_db:tick steps could be parallelized better. + ayatane, FYI. This is related to / conflicts with the monitor_db refactoring you may be doing soon.
,
May 8 2017
A meeting sounds hard given that pprabhu is on the opposite side of the earth. This does seem like a good idea, I'll shop around for somebody to investigate.
,
May 9 2017
,
May 23 2017
ayatane@ noticed that we may not be using threaded drone queue, in which case trigger_refresh would be synchronous. Before doing anything else, we should correct this. This won't help super-much since the "asynchronous" drone refresh still only works in concurrent with a few other steps of the tick (i.e., I'd expect gains still from making execute_queued_calls multithreaded), but this is a super low hanging fruit. Adding to my plate to just explore this first option.
,
Jun 1 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/be0fea17b7ded897104a0faaaab8d8836a8f3347 commit be0fea17b7ded897104a0faaaab8d8836a8f3347 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jun 01 01:31:18 2017
,
Jun 1 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/be0fea17b7ded897104a0faaaab8d8836a8f3347 commit be0fea17b7ded897104a0faaaab8d8836a8f3347 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jun 01 01:31:18 2017
,
Jun 1 2017
#6 turned on threaded_drone_manager on one of the shards. I don't see marked tick rate difference: http://shortn/_G1NpJrTaJT But the time is now evenly spread across sync_refresh and trigger_refresh: http://shortn/_YLKYHozNGw I'll do it for all shards now, since we haven't taken a hit wrt tick time, and using a threaded_drone_manager is the first step towards making drone calls more concurrent.
,
Jun 1 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/e4ea56fdc6a1f85138d173ea3ec89f377680bcc0 commit e4ea56fdc6a1f85138d173ea3ec89f377680bcc0 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jun 01 18:39:40 2017
,
Jun 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d4db8fffb61d720b30417d706df62e51319ebcdc commit d4db8fffb61d720b30417d706df62e51319ebcdc Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jun 08 06:45:25 2017 [autotest] Drop {Base,Site}DroneUtility 'cause we're all one. BUG= chromium:718181 TEST=None Change-Id: I0b5707e4e30574e14be1cdc3874882e001062823 Reviewed-on: https://chromium-review.googlesource.com/524268 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/d4db8fffb61d720b30417d706df62e51319ebcdc/scheduler/drone_utility.py
,
Jun 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/01e96da131fb191f9b3c4c066266e944638945ad commit 01e96da131fb191f9b3c4c066266e944638945ad Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jun 08 06:45:25 2017 [autotest] Make some staticmethods free functions. This is part of a small drone_utility refactor to pull out drone process refresh logic and parallelize it. BUG= chromium:718181 TEST=unittests, local AFE run, moblab_RunSuite Change-Id: Ib41c93c64b437bff09a119097e3bad14f3b3574e Reviewed-on: https://chromium-review.googlesource.com/524269 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/01e96da131fb191f9b3c4c066266e944638945ad/scheduler/drone_utility.py [modify] https://crrev.com/01e96da131fb191f9b3c4c066266e944638945ad/scheduler/drone_utility_unittest.py
,
Jun 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/9ccda9dfa68d83807ec0c4b3c5b55d83c852c69c commit 9ccda9dfa68d83807ec0c4b3c5b55d83c852c69c Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jun 08 06:45:26 2017 [autotest] Use --no-header with ps to skip header. Instead of dropping the first line returned by '/bin/ps'. BUG= chromium:718181 TEST=(1) Manually verify the --no-headers argument on a drone. (2) moblab_RunSuite on a moblab passes. Change-Id: Idcbb3614f0efb3480ccbd9ef05475658a711fe22 Reviewed-on: https://chromium-review.googlesource.com/524270 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/9ccda9dfa68d83807ec0c4b3c5b55d83c852c69c/scheduler/drone_utility.py
,
Jun 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a7d9e34cd1289ea650e4c1bc3632d31128c038ef commit a7d9e34cd1289ea650e4c1bc3632d31128c038ef Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jun 08 06:45:26 2017 [autotest] Extract ProcessRefresher from DroneUtility This prepares us for a refactor of the process refreshing code. - ProcessRefresher is now a better encapsulated concept. It is responsible for refreshing the running process summary but not the actual filesystem interaction to obtain this information. - Also fixes some awkward function design issues that made unittesting difficult. - Improves unittest coverage, to give us confidence in the upcoming multiprocess support. BUG= chromium:718181 TEST=[1] unittests [2] Run a test on local autotest instance [3] moblab_RunSuite on a moblab device Change-Id: I99119b5949cc6bfa3d976a1b9e1bf41ff06ac9a2 Reviewed-on: https://chromium-review.googlesource.com/524271 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/a7d9e34cd1289ea650e4c1bc3632d31128c038ef/scheduler/drone_utility.py [modify] https://crrev.com/a7d9e34cd1289ea650e4c1bc3632d31128c038ef/scheduler/drone_utility_unittest.py
,
Jun 9 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/cfabbdef4cad60c34fae9d09cb079117c764e402 commit cfabbdef4cad60c34fae9d09cb079117c764e402 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri Jun 09 22:33:41 2017 [autotest] Use process pool to read pidfiles in drone refresh We spend a lot of time reading each pidfile corresponding to the running processes on drones. Use a multiprocessing.Pool to speed this up. For now, we hide the use of multiprocessing.Pool behind a global_config flag. This allows us to gently roll this out to shards observing the impact. BUG= chromium:718181 TEST=[1] unittests. [2] Run a test on local autotest instance. [3] moblab_RunSuite. Change-Id: I08333453711a07ea3d7b805b44fda8e032a535eb Reviewed-on: https://chromium-review.googlesource.com/525075 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/cfabbdef4cad60c34fae9d09cb079117c764e402/scheduler/drone_utility.py [modify] https://crrev.com/cfabbdef4cad60c34fae9d09cb079117c764e402/scheduler/drone_utility_unittest.py
,
Jun 14 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/170f2fde398ec35f13d5f5e3ed3c9ad819be7ebd commit 170f2fde398ec35f13d5f5e3ed3c9ad819be7ebd Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Jun 14 00:47:24 2017
,
Jun 14 2017
I turned on the experiment on chromeos-server14 but it didn't change anything: tick rate: http://shortn/_JoxwhKF57d tick breakdown: http://shortn/_XVUsQihHIj I haven't parallelized checking of the dark mark yet. Should try that too.
,
Jun 29 2017
Re #15: In flight: https://chromium-review.googlesource.com/c/525076
,
Jun 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a71d9e8c5deb9bc7b94fa10152097270047b54cb commit a71d9e8c5deb9bc7b94fa10152097270047b54cb Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri Jun 30 23:02:48 2017 [autotest] Use process pools to read dark mark in drone_utility. This CL uses a multiprocessing.Pool for another slow operation executed by drone_utility in every scheduler tick. BUG= chromium:718181 TEST=[1] unittests [2] Test run on local autotest instance. [3] moblab_RunSuite. Change-Id: Ic3eb7856690c0f1c352fdf149bb4e79dcf77bc01 Reviewed-on: https://chromium-review.googlesource.com/525076 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/a71d9e8c5deb9bc7b94fa10152097270047b54cb/scheduler/drone_utility.py [modify] https://crrev.com/a71d9e8c5deb9bc7b94fa10152097270047b54cb/scheduler/drone_utility_unittest.py
,
Jul 5 2017
Even #17 didn't help in practice. So, I dug in in prod. Guess what, all the time is going in loading drone_utility module everytime we run it :/ Here are the loading time for each of the autotest imports from within drone_utility on my machine: pprabhu@pprabhu:files$ python loadit.py 0(): 0.00 1(error): 0.00 2(global_config): 0.01 3(logging_manager): 0.00 4(utils): 0.05 5(retry): 0.00 6(drone_logging_config): 0.00 7(email_manager): 0.31 8(scheduler_config): 0.00 9(hosts): 0.22 10(subcommand): 0.00 That's pretty much all the time a simple call to drone_utility takes. On chromeos-server14, where I've been experimenting with parallelized calls, drone_utility calls take ~3 seconds which is ~70% of a single tick time. Same exercise there gives: >>> from autotest_lib.scheduler import drone_utility_loading 0(): 0.00 1(error): 0.01 2(global_config): 0.04 3(logging_manager): 0.00 4(utils): 0.16 5(retry): 0.00 6(drone_logging_config): 0.00 7(email_manager): 0.57 8(scheduler_config): 0.01 9(hosts): 0.74 10(subcommand): 0.00 So ~1.5 seconds. That only accounts for half of the missing 3 seconds each drone_utility call. But there's definitely 3 seconds missing between when drones.py:execute_queued_calls hands off to drone_utility and when drone_utility actually starts doing anything. (My hunch based on logs, plus pgrep'ing is that drone_utility reexecs once before doing anything useful)
,
Jul 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/240b4672a10cc1621f909bd42248bc15d9ca33ab commit 240b4672a10cc1621f909bd42248bc15d9ca33ab Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jul 06 04:54:57 2017 [autotest] Delay hosts module import in drone_utility The hosts module import takes ~.75 seconds. This has been bottlenecking the scheduler tick by slowing down each drone_utility refresh call by ~1.5 seconds. Delay the import as a hack while we work on a real fix. BUG= chromium:718181 BUG= chromium:739466 TEST=None Change-Id: I3966259594bc5b2e5e1dd8592c8dcfd9cbda1c6c Reviewed-on: https://chromium-review.googlesource.com/559938 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/240b4672a10cc1621f909bd42248bc15d9ca33ab/scheduler/drone_utility.py
,
Jul 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2284834caadd6f3a5ea53d7d93f3ef54be4bf231 commit 2284834caadd6f3a5ea53d7d93f3ef54be4bf231 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jul 06 19:12:05 2017 [autotest] Drop email notification from drone_utility I haven't seen this notification in months. We don't want email notifications from the lab. Importing this module adds ~1 second to every single drone_utility call from the scheduler. So this adds ~2 seconds to every tick in the lab. BUG= chromium:718181 TEST=None Change-Id: I6868271e71a06e1038ab008dbcc3e5f92d4fff61 Reviewed-on: https://chromium-review.googlesource.com/559937 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/2284834caadd6f3a5ea53d7d93f3ef54be4bf231/scheduler/drone_utility.py
,
Jul 12 2017
#19 and #20 together improved the average and 90%ile tick rate on the shards by 2.5X: https://viceroy.corp.google.com/chromeos/capacity_health?duration=8d#_VG_z14nZKtl So, parallelization of the refresh call may not have been necessay. I'm going to turn off the new parallelized code flow on chromeos-server14 and see how it affects the tick there. Then take call to drop the new feature / turn it on for all shards.
,
Jul 12 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/15a75589a5157f9cd779d7c090f15a644b644d57 commit 15a75589a5157f9cd779d7c090f15a644b644d57 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Jul 12 21:36:49 2017
,
Jul 13 2017
#22 turned off the new process pools on server14, and it actually helped the tick rate. https://viceroy.corp.google.com/chromeos/capacity_health?duration=8d&heatmap=False&hostname=chromeos-server14&refresh=-1&topstreams=5#_VG_nQ5cHmXp So, conclusion: process pools are not needed. My work here is done. We got 2.5X improvement by making drone_utility.py less stupid.
,
Jul 13 2017
,
Jan 22 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by aut...@google.com
, May 5 2017