Use Swarming trigger script to provide "or" of multiple GPU configurations per bot |
|||||||||
Issue descriptionDuring operations like GPU driver or graphics card upgrades, the bots in the Swarming pool will generally be in one of two configurations: the previous config and the next one. The differences between these might include the PCI ID of the graphics card; the driver version; the exact OS version (i.e., an upgrade from Win7 to Win10, or macOS 10.12 to 10.13), etc. It also might involve changes to more than one of these dimensions. In https://cs.chromium.org/chromium/src/content/test/gpu/generate_buildbot_json.py we prefer to be precise in the Swarming dimensions specified when we trigger tests, to avoid problems we've seen in the past where a one-off machine added to the Swarming pool for a particular purpose inadvertently has jobs triggered against it that weren't supposed to run on it. In Issue 756295 we requested an "or" operator for Swarming dimensions, to be evaluated on the server. While this would be convenient for specification in the JSON files, the semantics would become messy pretty quickly. Which one would be preferred and why? What would the fallback criteria be? martiniss@ pointed out in https://bugs.chromium.org/p/chromium/issues/detail?id=756295#c7 that he's implemented a new mechanism called trigger scripts, which have the opportunity to run code between the time the recipe runs a Swarming step, and when the shards are actually triggered on the bots. From that comment: """ https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/swarming/api.py?q=trigger_script&sq=package:chromium&dr=C&l=1205 are the docs on this feature. You can add a "trigger_script" key in a //testing/buildbot/*.json file, and it should use this logic. https://cs.chromium.org/chromium/src/tools/perf/perf_device_trigger.py?q=perf_device&sq=package:chromium&l=1 is the trigger script we're using, if you want to see some sample code for a trigger script. """ After discussion with martiniss@ I think we should use this mechanism rather than asking for the Swarming feature server-side. The semantics of the server side would be complicated, and with a bit of work, all the flexibility, and more, can be implemented on the client side. The rough sketch would be: 1) We write a GPU-test-specific trigger script somewhere. This script will receive as input all of the usual arguments to "swarming.py trigger". If a bot is in the process of being migrated, it will pass down via the trigger script's "args" argument (somehow, TBD) the two or more dimension sets it expects to see in the Swarming pool. 2) The trigger script makes multiple queries against the Swarming pool to enumerate the bots available for each dimension set. 3) Based on these results, the trigger script makes a determination of which dimension set it should actually use for each shard. It might consider factors like: - The number of available bots for that dimension set - If no available bots, the number of busy bots for each dimension set, and assign shards probabilistically 4) The trigger script writes out a JSON file describing how each shard is to be triggered, which is picked up by Swarming. Then generate_buildbot_json.py can be changed so that for a given bot, like "Win7 Release (NVIDIA)", multiple swarming_dimensions can be set. If this is the case, a trigger_script property would be specified in the JSON file it outputs, passing all of the swarming_dimensions down to be evaluated as above. There are some downsides to this approach, like there being a sort of race condition with other clients triggering jobs, and the job not being able to migrate between configurations once specified. However, these are probably relatively minor. We should implement this before the next big upgrade in the fleet. It can't be done in a rush and so can't be done for the GeForce GT 610 -> Quadro P400 upgrade currently underway.
,
Dec 14 2017
In Issue 794720 there was a significant regression in a Windows 10-only code path which strongly motivates moving the GPU tests from Windows 7 to Windows 10, so they can be run on the CQ. In order to do this upgrade this script must be implemented. We can't take up to half the fleet offline at any given time.
,
Jan 12 2018
Work is in progress here: https://chromium-review.googlesource.com/833505/ Because it's difficult to find these tryjobs after the fact with the Gerrit user interface, the most recent tryjob at this point is: https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/79127 The trigger script still isn't working and I've just added some debug logging to try to figure out why.
,
Jan 12 2018
,
Jan 12 2018
Working on issue 781021 which will make this easier, as this implements a fallback mechanism on the server. It's not strictly speaking an "or" but should be useable enough to work alike, especially coupled with issue 706586 .
,
Jan 12 2018
Thanks for working on that. It's plausible we could get rid of the GPU trigger script that's going to be added for this purpose in https://chromium-review.googlesource.com/833505 with server-side support, but now that the trigger script is working, I'm not sure we really need more sophistication server-side.
,
Jan 12 2018
,
Jan 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/681a0e18bc03d08dd601f77efc198361e556c307 commit 681a0e18bc03d08dd601f77efc198361e556c307 Author: Kenneth Russell <kbr@chromium.org> Date: Sat Jan 13 07:32:21 2018 Add Swarming trigger script for GPU tests. This script takes in multiple Swarming dimension sets, specified as a JSON string which decodes to a list of dictionaries. The script queries the Swarming pool for the live bots for each dimension and spreads the shard(s) for the given tests across them according to an algorithm defined in the script. This allows two or more GPU configurations to be specified for a single bot. This way, during upgrades of the fleet, jobs can temporarily be targeted at both the old and the new configurations, avoiding temporarily losing half of the capacity. The trigger script is enabled on these four bots: chromium.gpu:Win7 Release (NVIDIA) chromium.gpu.fyi:Win7 Release (NVIDIA) chromium.gpu.fyi:Win7 dEQP Release (NVIDIA) chromium.gpu.fyi:Optional Win7 Release (NVIDIA) which are mirrored to the following trybots: tryserver.chromium.win:win7_chromium_rel_ng tryserver.chromium.win:win_optional_gpu_tests_rel tryserver.chromium.angle:win_angle_rel_ng tryserver.chromium.angle:win_angle_deqp_rel_ng Once the upgrade to Win10 begins, Win10 will be set as an alternate Swarming dimension for these bots. When the upgrade is complete, the bots will be targeted solely at Win10, with Win7 only being tested on the waterfalls. BUG= 781057 Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Change-Id: Ie79953dfa023c3a9e4427e651ac6f8c10304b74c Reviewed-on: https://chromium-review.googlesource.com/833505 Commit-Queue: Kenneth Russell <kbr@chromium.org> Reviewed-by: Stephen Martinis <martiniss@chromium.org> Cr-Commit-Position: refs/heads/master@{#529170} [modify] https://crrev.com/681a0e18bc03d08dd601f77efc198361e556c307/content/test/gpu/PRESUBMIT.py [modify] https://crrev.com/681a0e18bc03d08dd601f77efc198361e556c307/content/test/gpu/generate_buildbot_json.py [add] https://crrev.com/681a0e18bc03d08dd601f77efc198361e556c307/content/test/gpu/trigger_gpu_test.py [add] https://crrev.com/681a0e18bc03d08dd601f77efc198361e556c307/content/test/gpu/trigger_gpu_test_unittest.py [modify] https://crrev.com/681a0e18bc03d08dd601f77efc198361e556c307/testing/buildbot/chromium.gpu.fyi.json [modify] https://crrev.com/681a0e18bc03d08dd601f77efc198361e556c307/testing/buildbot/chromium.gpu.json
,
Jan 13 2018
,
Jan 16 2018
This new Swarming trigger script is working and is running on multiple bots on both Chromium's and ANGLE's commit queues. At this point we can proceed with the upgrade of the majority of the Win7 GPU bots to Win10.
,
Jan 27 2018
,
Feb 1 2018
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by kbr@chromium.org
, Nov 15 2017