Win 7 Nvidia GPU Perf gtest-based performance tests are showing "expired, not enough capacity" |
|||
Issue descriptionWaterfall: https://ci.chromium.org/p/chrome/builders/luci.chrome.ci/Win%207%20Nvidia%20GPU%20Perf First bad build: https://ci.chromium.org/p/chrome/builders/luci.chrome.ci/Win%207%20Nvidia%20GPU%20Perf/3537 Caleb, I don't see a perf bot sherrif, but I recall you were helping with the Nexus 5 issue with ANGLE. Can you help investigate this? Seems like a substantial break. I'm not sure if these tests are tied to a specific bot like they used to be.
,
Jan 11
Okay, thanks. Would be good to get the gests back again while the configuration is being phased out. Currently it's my main source of test data. Would be happy to move over to a new Windows 10 configuration when available! I'll follow the relevant issues.
,
Jan 11
This is the device that the tests should run on: https://chrome-swarming.appspot.com/bot?id=build202-m7&sort_stats=total%3Adesc If you look at it's Tasks history, it it clear that it used to run *both* the gtest_perf_tests and the Telemetry performance_test_suite tests. The last ttime it ran a gtest though was Jan 6th. For some reason now it =is too busy to run the gtest_perf_tests. So maybe the order of runs has changed so that we start the telemetry tests before the gtests or don't wait for the telemetry tests to finish. I will keep looking into this.
,
Jan 11
Ah, yes, I was right. The last successful build looks like this: ( 28 secs )test_pre_run: ( 3 secs )[trigger] angle_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1' ( 2 secs )[trigger] load_library_perf_tests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1' ( 3 secs )[trigger] media_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1' ( 3 secs )[trigger] passthrough_command_buffer_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1' ( 3 secs )[trigger] validating_command_buffer_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1' ( 12 secs )[trigger] performance_test_suite on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1' The first failing build looks like thisL ( 31 secs )test_pre_run ( 15 secs )[trigger] performance_test_suite on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1' ( 2 secs )[trigger] angle_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1' ( 3 secs )[trigger] load_library_perf_tests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1' ( 3 secs )[trigger] media_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1 ( 2 secs )[trigger] passthrough_command_buffer_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1' ( 3 secs )[trigger] validating_command_buffer_perftests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Run on OS: 'Windows-2008ServerR2-SP1' So the whole issue is that we trigger the performance_test_suite (which is telemetry) first now. And then afterwards we trigger all the gtest_perf_tests, which then have to wait for the bot to become free. Since it takes a bot 6 hours to run a performance_test_suite shard, the gtest_perf_tests requests time out. Now I need to figure out what code controls this behavior and how to change it back.
,
Jan 11
Culprit: https://chromium.googlesource.com/chromium/tools/build/+/8df5b432d7a5e51f04a11cd9daa4c14a19255cb1 I wonder if it is safe to revert the culprit.
,
Jan 12
Nice work diagnosing this. Not sure on revert but will watch for the fix.
,
Jan 12
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/0a6ec8ff35e6665ff9821afa79502ae41669dadb commit 0a6ec8ff35e6665ff9821afa79502ae41669dadb Author: Caleb Rouleau <crouleau@google.com> Date: Sat Jan 12 01:28:55 2019 Revert "[chromium_tests, test_utils] change order of test_pre_run" https://bugs.chromium.org/p/chromium/issues/detail?id=921039#c5 This reverts commit 8df5b432d7a5e51f04a11cd9daa4c14a19255cb1. TBR=dpranke Bug: 921039 Change-Id: I866a2d264a031de4264fb25d0aa4a655f1906023 Reviewed-on: https://chromium-review.googlesource.com/c/1407155 Reviewed-by: Caleb Rouleau <crouleau@google.com> Commit-Queue: Caleb Rouleau <crouleau@google.com> [modify] https://crrev.com/0a6ec8ff35e6665ff9821afa79502ae41669dadb/scripts/slave/README.recipes.md [modify] https://crrev.com/0a6ec8ff35e6665ff9821afa79502ae41669dadb/scripts/slave/recipes/chromium_trybot.py [modify] https://crrev.com/0a6ec8ff35e6665ff9821afa79502ae41669dadb/scripts/slave/recipe_modules/chromium_tests/tests/steps/swarming_isolated_script_test.py [modify] https://crrev.com/0a6ec8ff35e6665ff9821afa79502ae41669dadb/scripts/slave/recipe_modules/test_utils/api.py [modify] https://crrev.com/0a6ec8ff35e6665ff9821afa79502ae41669dadb/scripts/slave/recipe_modules/chromium_tests/steps.py
,
Jan 12
Thanks to the revert we're getting occasional results now but still builds are often failing. I noticed the results regressed as well and filed issue 921004 . Looking at the test shards, performance_test_suite uses all five shards including build202-m7 while the gtest perf tests all compete for build202-m7. If this test configuration is being deprecated would it be possible to remove performance_test_suite from this bot? It would still be useful for ANGLE because we have multiple years of history. Right now it's our primary source of performance test data. I use it in public presentations to show our progress.
,
Jan 12
Alternately if there's a way to tell performance_test_suite not to use a particular shard that would help.
,
Jan 12
> Looking at the test shards, performance_test_suite uses all five shards including build202-m7 while the gtest perf tests all compete for build202-m7. Filed issue 921353. Thank you! > Alternately if there's a way to tell performance_test_suite not to use a particular shard that would help. This doesn't make sense since all of the gtest_perf_tests run back to back only takes about 15 minutes, but one shard of performance_test_suite takes around 6 hours. The reason is that performance_test_suite runs all the Telemetry tests, and there are many many more Telemetry tests than gtest_perf_tests. > If this test configuration is being deprecated would it be possible to remove performance_test_suite from this bot? It would still be useful for ANGLE because we have multiple years of history. Right now it's our primary source of performance test data. I use it in public presentations to show our progress. We will keep you in the loop as we deprecate this configuration. In general, we need to limit the number of configurations that we support for maintainability's sake, so you will need to migrate off of this bot eventually, but there may be some transition time built in, and perhaps we can leave the gtests running on these as long as possible. Note that we don't have many of these devices left, and they are consumer-grade hardware, so they are fated to die even if we tried to maintain them indefinitely. There is a data migration process so that you could have the new charts show the old data as well (potentially with a marked spot for where the transition happened). |
|||
►
Sign in to add a comment |
|||
Comment 1 by crouleau@chromium.org
, Jan 11Cc: bradhall@chromium.org jbudorick@chromium.org