New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 713192 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 1
Type: Bug



Sign in to add a comment

ANGLE CQ 32-bit Windows Release on ATI GPUs running low on capacity

Project Member Reported by jmad...@chromium.org, Apr 19 2017

Issue description

Comment 1 by kbr@chromium.org, Apr 19 2017

Summary: ANGLE CQ 32-bit Windows Release on ATI GPUs running low on capacity (was: ANGLE CQ 32-bit Windows Release seems to be having capacity problems)
The failures are all on the AMD GPU bots. Clearly they're oversubscribed.

The only short-term workaround for this problem is to stop triggering tests on the AMD bots by default by removing this entry from trybots.py:

https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/chromium_tests/trybots.py?q=trybots.py+package:%5Echromium$&dr&l=235

          {
            'mastername': 'chromium.gpu.fyi',
            'buildername': 'GPU Win Builder',
            'tester': 'Win7 Release (AMD)',
          },

The tests will still run on the waterfall but won't run on this particular tryserver.

We hadn't planned on continuing to expand the Windows AMD GPU bots in the data center because of the reliability problems we encountered when first deploying them.

We could add an optional tryserver to the ANGLE CQ which could be used to manually trigger tryjobs on AMD GPUs.

Still need to look into this more, but Ken I think removing some of the tests from AMD would be a better option than all of them. Currently the Vulkan back-end is only tested on Windows AMD. Thanks for pointing out it was only AMD.

Comment 3 by kbr@chromium.org, Apr 19 2017

Right now the win_angle_rel_ng tryserver mirrors "Win7 Release (AMD)" on the chromium.gpu.fyi waterfall, so we can't reduce the set of tests that are run on the tryserver without also removing those tests from the waterfall.

We could define an "imaginary" waterfall bot similar to "Optional Win7 Release (AMD)" which would contain that subset of tests, and have win_angle_rel_ng mirror that to pick up the tests it runs on AMD.

Looks like about half of the load comes from win_optional_gpu_tests_rel. Perhaps we can remove it from some presubmits and suffer detecting the problems in GPU.FYI waterfall?
Here is a breakdown of today's win_optional_gpu_tests_rel jobs:
gpu/PRESUBMIT.py - 9
v8-autoroll - 6
content/test/gpu/PRESUBMIT.py - 5
skia-deps-roller - 4
third_party/WebKit/Source/modules/webgl/PRESUBMIT.py - 4
media/gpu/PRESUBMIT.py - 3
tools/roll_swiftshader.py - 1

I'm puzzled by why https://codereview.chromium.org/2829503002/ also has triggered this bot, 3 times. Maybe they were added manually?

So, if I remove win_optional_gpu_tests_rel from v8-autoroll and skia-deps-roller, this will reduce the load by 1/6th on AMD Win7 bots. What do you think?
BTW, can we request more bots?

Comment 6 by kbr@chromium.org, Apr 19 2017

> So, if I remove win_optional_gpu_tests_rel from v8-autoroll and skia-deps-roller, this will reduce the load by 1/6th on AMD Win7 bots. What do you think?

Simply removing win_optional_gpu_tests_rel from these two rollers is not a good solution. Changes to both V8 and Skia have broken WebGL 2.0 conformance tests in the past.

Another viable option would be to remove "Optional Win7 Release (AMD)" from the mirroring of "win_optional_gpu_tests_rel", leaving win_angle_rel_ng alone.

My feeling is that removing from the optional testers, and leaving on win_angle_rel_ng, would be a good solution for ANGLE. I can look into that tomorrow if it's still available.
Labels: -Infra-Troopers
Removing troopers label for now.
I had another idea, what if we restrict the number of builds that win_optional_gpu_tests_rel and win_angle_rel_ng can run simultaneously?

I see that win_optional_gpu_tests_rel runs 5 builds now and win_angle_rel_ng runs 3 (though looks like it's capable of 20, if number of buildslaves is the limit). What if we limit each to run 3 builds maximum? The CQ will take longer for the unfortunate, but we'll have more coverage.
Also note that there are 147 NVIDIA bots and only 23 AMD bots. Since win_optional_gpu_tests_rel and win_angle_rel_ng run on both, AMD is oversubscribed and NVIDIA is underutilized. Perhaps there is a way to spread the load between them more evenly? Maybe there are tests which are not specific to the driver or hardware, and we can run them only on NVIDIA?

Comment 11 by kbr@chromium.org, Apr 20 2017

I don't think it's a good idea to artificially reduce the number of slaves in the win_optional_gpu_tests_rel pool. Doing so would directly impact CQ time for CLs that touch code in many Chromium subdirectories, as well as V8 and Skia rolls.

Changes to ANGLE are the ones that are most likely to affect one GPU type or another on Windows. For this reason I think it's reasonable to stop running win_optional_gpu_tests_rel's tests on Win/AMD, and have only win_angle_rel_ng do so.

Note that the NVIDIA bots are triggered by win_chromium_rel_ng and handle a large number of CLs daily.

Cc: ynovikov@chromium.org
Owner: jmad...@chromium.org
I agree with Ken that we should pick our battles and take the easy solution for now. Will look at restricting the bots today.
Project Member

Comment 13 by bugdroid1@chromium.org, Apr 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/886d106a5f3bb0440d03e1e327eb3ab1cb0862f8

commit 886d106a5f3bb0440d03e1e327eb3ab1cb0862f8
Author: Jamie Madill <jmadill@chromium.org>
Date: Thu Apr 20 17:28:33 2017

Disable AMD tests on GPU optional trybots.

These tests are running into capacity problems on the testers.
Only run them on the ANGLE CQ and FYI bots for now.

BUG= 713192 
R=kbr@chromium.org

Change-Id: I9795a98e51ee9c0fa41d47edc6faa0b53f01791e
Reviewed-on: https://chromium-review.googlesource.com/483028
Commit-Queue: Jamie Madill <jmadill@chromium.org>
Reviewed-by: Kenneth Russell <kbr@chromium.org>

[modify] https://crrev.com/886d106a5f3bb0440d03e1e327eb3ab1cb0862f8/scripts/slave/recipe_modules/chromium_tests/trybots.py

Status: Fixed (was: Assigned)
win_angle_rel_ng didn't run out of capacity today, so I think this is fixed. Thanks, Jamie!

Sign in to add a comment