New issue
Advanced search Search tips

Issue 921039 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 12
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 1
Type: Bug

Blocked on:
issue 865538



Sign in to add a comment

Win 7 Nvidia GPU Perf gtest-based performance tests are showing "expired, not enough capacity"

Project Member Reported by jmad...@chromium.org, Jan 11

Issue description

Waterfall:

https://ci.chromium.org/p/chrome/builders/luci.chrome.ci/Win%207%20Nvidia%20GPU%20Perf

First bad build:

https://ci.chromium.org/p/chrome/builders/luci.chrome.ci/Win%207%20Nvidia%20GPU%20Perf/3537

Caleb, I don't see a perf bot sherrif, but I recall you were helping with the Nexus 5 issue with ANGLE. Can you help investigate this? Seems like a substantial break. I'm not sure if these tests are tied to a specific bot like they used to be.

 
Blockedon: 865538
Cc: bradhall@chromium.org jbudorick@chromium.org
Thanks for the heads up! I haven't looked into this in detail, but my guess is that it's related to issue 906654. We're just running out of these devices and we need to deprecate this configuration and replace it.

I will need to figure out if we can just reshard this to move devices off of the Telemetry benchmarks and over to the gtest perf tests.
Okay, thanks. Would be good to get the gests back again while the configuration is being phased out. Currently it's my main source of test data.

Would be happy to move over to a new Windows 10 configuration when available! I'll follow the relevant issues.
This is the device that the tests should run on: https://chrome-swarming.appspot.com/bot?id=build202-m7&sort_stats=total%3Adesc

If you look at it's Tasks history, it it clear that it used to run *both* the gtest_perf_tests and the Telemetry performance_test_suite tests. The last ttime it ran a gtest though was Jan 6th. For some reason now it =is too busy to run the gtest_perf_tests. So maybe the order of runs has changed so that we start the telemetry tests before the gtests or don't wait for the telemetry tests to finish. 

I will keep looking into this.
Ah, yes, I was right. The last successful build looks like this:

( 28 secs )test_pre_run:

( 3 secs )[trigger] angle_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1'
( 2 secs )[trigger] load_library_perf_tests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1'
( 3 secs )[trigger] media_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1'
( 3 secs )[trigger] passthrough_command_buffer_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1'
( 3 secs )[trigger] validating_command_buffer_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1'
( 12 secs )[trigger] performance_test_suite on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1'


The first failing build looks like thisL

( 31 secs )test_pre_run

( 15 secs )[trigger] performance_test_suite on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1'
( 2 secs )[trigger] angle_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1'
( 3 secs )[trigger] load_library_perf_tests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1'
( 3 secs )[trigger] media_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1
( 2 secs )[trigger] passthrough_command_buffer_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1'
( 3 secs )[trigger] validating_command_buffer_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1'


So the whole issue is that we trigger the performance_test_suite (which is telemetry) first now. And then afterwards we trigger all the gtest_perf_tests, which then have to wait for the bot to become free. Since it takes a bot 6 hours to run a performance_test_suite shard, the gtest_perf_tests requests time out.


Now I need to figure out what code controls this behavior and how to change it back.
Cc: dpranke@chromium.org tikuta@chromium.org
Culprit: https://chromium.googlesource.com/chromium/tools/build/+/8df5b432d7a5e51f04a11cd9daa4c14a19255cb1

I wonder if it is safe to revert the culprit.
Nice work diagnosing this. Not sure on revert but will watch for the fix.
Thanks to the revert we're getting occasional results now but still builds are often failing. I noticed the results regressed as well and filed  issue 921004 .

Looking at the test shards, performance_test_suite uses all five shards including build202-m7 while the gtest perf tests all compete for build202-m7.

If this test configuration is being deprecated would it be possible to remove performance_test_suite from this bot? It would still be useful for ANGLE because we have multiple years of history. Right now it's our primary source of performance test data. I use it in public presentations to show our progress.
Alternately if there's a way to tell performance_test_suite not to use a particular shard that would help.
Status: Fixed (was: Assigned)
> Looking at the test shards, performance_test_suite uses all five shards including build202-m7 while the gtest perf tests all compete for build202-m7.

Filed issue 921353. Thank you!


> Alternately if there's a way to tell performance_test_suite not to use a particular shard that would help.

This doesn't make sense since all of the gtest_perf_tests run back to back only takes about 15 minutes, but one shard of performance_test_suite takes around 6 hours. The reason is that performance_test_suite runs all the Telemetry tests, and there are many many more Telemetry tests than gtest_perf_tests.


> If this test configuration is being deprecated would it be possible to remove performance_test_suite from this bot? It would still be useful for ANGLE because we have multiple years of history. Right now it's our primary source of performance test data. I use it in public presentations to show our progress.

We will keep you in the loop as we deprecate this configuration. In general, we need to limit the number of configurations that we support for maintainability's sake, so you will need to migrate off of this bot eventually, but there may be some transition time built in, and perhaps we can leave the gtests running on these as long as possible. Note that we don't have many of these devices left, and they are consumer-grade hardware, so they are fated to die even if we tried to maintain them indefinitely. There is a data migration process so that you could have the new charts show the old data as well (potentially with a marked spot for where the transition happened).

Sign in to add a comment