Mac mini jobs expiring |
||||||||
Issue descriptionJobs on the mac mini builder are expiring. https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Mini%208GB%2010.12%20Perf/ This appears to be because the total amount of time required to run the test suites is longer than 6 hours. I'm investigating why tests are taking longer than they should be.
,
Jan 31 2017
I have some data (googler only) here: https://docs.google.com/spreadsheets/d/1Fx_RCUk61ZnGPmM6dUgunVzfuNQL-r5Prog3q-8KW8M/edit#gid=2071277789 Compares test timings on this bot vs a comparable Mac 10.12 Perf bot. Also, I just realized there is a probable culprit already. Swarming overhead. Compare https://chromium-swarm.appspot.com/task?id=340838a43487e110&refresh=10&show_raw=1 (mini) to https://chromium-swarm.appspot.com/task?id=3408db4def0fbd10&refresh=10 (regular). The first task has an overhead of about 2 minutes. The second task has an overhead of about 2 seconds. This time adds up. 2 minutes * ~35 enabled tests per bot is about an hour of overhead. This explains why we're getting timeouts. We need to fix this. Couple possible solutions: * Make tests run faster * Make swarming overhead smaller * Increase swarming timeout * Remove this builder
,
Jan 31 2017
Well, actually, it's about 60 tests per bot, which is 2 hours of overhead.
,
Jan 31 2017
The background here is that we had super old Mac Minis with spinning disks. They were removed because they had too much overhead, with the hope that replacing them with newer spinning disk minis would give us the perf tests we need but with less disk overhead. Looks like these still have too much overhead. I'm pretty clueless about the specifics of what causes the overhead: Dirk, will your plan to move telemetry tests into one step impact it?
,
Jan 31 2017
+ layout-tests-on-swarming folks. The rationale is that it's going to hit a similar issue. In practice, I *think* that the best course of action to migrate on platforms where symlinks are faster than hardlink to use symlinks. I did a partial implementation last summer in run_isolated.py but it's not complete yet; so someone would have to take from here and add complete support: - bool flag at task request, push the flag down to the bot - make run_isolated.py symlink work on all relevant platforms I'm actually confident that this can be completed in a reasonable amount of time.
,
Jan 31 2017
Actually, Tim reminded me that their work on exparchive is likely going to achieve the desired performance improvement.
,
Jan 31 2017
I'm just downloading this isolate now, will re-upload it using the exparchive and rerun the command to see if it changes the performance much.
,
Jan 31 2017
I just opened up access to the spreadsheet. Also gathering more data today.
,
Jan 31 2017
https://chromium-swarm.appspot.com/task?id=340f1f1e04dd6310&refresh=10&show_raw=1 Non-scientific test - with exparchive, overhead was 22 seconds from a cold cache...
,
Jan 31 2017
It looks like on a hot cache things are even better; https://chromium-swarm.appspot.com/task?id=340f4ca8dcad5f10&refresh=10&show_raw=1 Total Overhead 10s Downloading Inputs From Isolate 7s
,
Jan 31 2017
So it looks like exparchive makes things significantly better. Here is 5 runs using an isolated generated using exparchive; --- https://chromium-swarm.appspot.com/task?id=340f500c1b435d10&refresh=10&show_raw=1 Total Overhead 12s Downloading Inputs From Isolate 8s --- https://chromium-swarm.appspot.com/task?id=340f502afd973810&refresh=10&show_raw=1 Total Overhead 35s Downloading Inputs From Isolate 6s Uploading Outputs To Isolate 0s --- https://chromium-swarm.appspot.com/task?id=340f50566b6fab10&refresh=10&show_raw=1 Total Overhead 17s Downloading Inputs From Isolate 10s Uploading Outputs To Isolate 0s --- https://chromium-swarm.appspot.com/task?id=340f507e5fa78610&refresh=10&show_raw=1 Total Overhead 29s Downloading Inputs From Isolate 5s Uploading Outputs To Isolate 0s --- https://chromium-swarm.appspot.com/task?id=340f50a57cf8de10&refresh=10&show_raw=1 Total Overhead 13s Downloading Inputs From Isolate 6s Uploading Outputs To Isolate 0s --- Looks like in the "hot" case, the overhead drops to ~15 seconds and in the "cold" case it ends up being ~35 seconds.
,
Feb 1 2017
That sounds great. Can we start using exparchive? I'm not familiar with what that actually is...
,
Feb 1 2017
I'm looking for the easiest way to make this possible. The options seem to be; * Change isolate binary to allow batcharchive use exparchive in some way. * Change recipes to use exparchive rather than batcharchive (probably via a flag). Digging around the perf recipes, they seem fairly different to the trybots/waterfall? It looks like the isolation of the perf tests happen at a totally different location to the swarming triggers. See https://build.chromium.org/p/chromium.perf/builders/Linux%20Builder/builds/91678 for example. An example of the isolate call is https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FLinux_Builder%2F91678%2F%2B%2Frecipes%2Fsteps%2Fisolate_tests%2F0%2Fstdout The isolate command looks like this; ----- python -u /b/rr/tmpqwTq6S/rw/checkout/scripts/slave/recipe_modules/isolate/resources/isolate.py \ /b/c/b/Linux_Builder/src/tools/swarming_client \ batcharchive \ --dump-json /tmp/tmpPT3KXf.json \ --isolate-server https://isolateserver.appspot.com \ --verbose /b/c/b/Linux_Builder/src/out/Release/load_library_perf_tests.isolated.gen.json \ \ /b/c/b/Linux_Builder/src/out/Release/cc_perftests.isolated.gen.json \ /b/c/b/Linux_Builder/src/out/Release/tracing_perftests.isolated.gen.json \ /b/c/b/Linux_Builder/src/out/Release/media_perftests.isolated.gen.json \ /b/c/b/Linux_Builder/src/out/Release/telemetry_perf_tests.isolated.gen.json ----- Open questions I have; * Is it normal for the perf bots to only batcharchive 5-6 test suites? Or does it normally isolate 30+ like the try bots and this run was abnormal? * How important is getting this done? * Who own's this stuff?
,
Feb 1 2017
I can own this, I just don't know what to do on the isolate side of things. I can do the recipe changes. Although, I will have to admit I'm not sure where exactly the inputs to all the triggered tasks get isolated. Is all we need to do change the word "batcharchive" to "exparchive" in the call to isolate.py? Or is it more complicated?
,
Feb 1 2017
Sorry martiniss, what I actually meant was, * Who own's the perf recipes and I can send code reviews to? (And who is able to understand the consequences of the changes?)
,
Feb 1 2017
Ah. I generally own the perf recipes. Some of the code is in chromium_tests, which has other owners than me. The swarming bots (which includes all the Mac bots) are all on the chromium recipe.
,
Feb 1 2017
Replying to sullivan@'s question in comment #4, yes, consolidating all of these things into a single big step (or a small number of them) would likely eliminate the per-step swarming overhead concerns.
,
Feb 3 2017
FTR, I chatted with tansell@ in person. He's doing some changes to the isolate recipe module. Once those changes land, we should be able to start using the exparchive command!
,
Feb 7 2017
,
Feb 7 2017
FYI Once https://chromium-review.googlesource.com/c/436324 lands, the perf recipies just need to be updated to give always_use_exparchive=True to isolate_tests and you should get the performance improvement. (This will also be useful in helping validate if exparchive can replace batcharchive everywhere.) However, you will **also** want to look into to using a single swarming task for a whole test suite rather than having a single test per swarming task.
,
Feb 7 2017
,
Feb 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/3eeeee0a7f1eddcddcd127c995341347f5759ef5 commit 3eeeee0a7f1eddcddcd127c995341347f5759ef5 Author: Stephen Martinis <martiniss@google.com> Date: Wed Feb 08 01:10:04 2017 chromium.perf: Ignore Mac Mini bot The bot will not ever be green until bug crbug.com/686974 is fixed. BUG= 686974 Change-Id: I80ab3d66788c454bb6d1ee273575203f59b9ccc9 Reviewed-on: https://chromium-review.googlesource.com/439367 Commit-Queue: Stephen Martinis <martiniss@chromium.org> Reviewed-by: David Tu <dtu@chromium.org> [modify] https://crrev.com/3eeeee0a7f1eddcddcd127c995341347f5759ef5/scripts/slave/gatekeeper.json
,
Feb 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/e06832c0df218c8b07c55855435c58904b139d3b commit e06832c0df218c8b07c55855435c58904b139d3b Author: Tim 'mithro' Ansell <tansell@chromium.org> Date: Wed Feb 08 05:23:48 2017 recipe_modules: Fix swarm_hashes generation. Previously this code failed in two ways; * When using multiple exparchive targets, swarm_hashes only contained the hash from the last exparchive command. * When exparchive and batcharchive isolation were used together the swarm_hashes only contained the hashes from the batcharchive command. swarm_hashes needs to contain the swarm_hashes results from all the isolation steps and now does. Also adding tests which prove these cases no longer fail. BUG= 524758 , 686974 Change-Id: I2189c3923aa8c9d6d611ad3141393ecf31d5468d Reviewed-on: https://chromium-review.googlesource.com/436324 Reviewed-by: Robbie Iannucci <iannucci@chromium.org> Reviewed-by: Michael McGreevy <mcgreevy@chromium.org> Commit-Queue: Michael McGreevy <mcgreevy@chromium.org> [add] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive-batch.json [modify] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive.json [add] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive-batch-emiss.json [add] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive-multi-miss.json [add] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive-miss.json [add] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive-batch-bmiss.json [modify] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.py [modify] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/api.py [add] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive-multi.json
,
Feb 13 2017
,
Feb 14 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/bb5af838e6d49c226f0c5317966368e9e4924f22 commit bb5af838e6d49c226f0c5317966368e9e4924f22 Author: Stephen Martinis <martiniss@google.com> Date: Tue Feb 14 21:14:03 2017 Use exparchive for chromium perf fyi This should give us performance improvements for low end machines; currently it can take about 2 minutes to successfully set up the bot, but exparchive should improve this time. BUG= 686974 Change-Id: Iebbc62ca2a3f218d5d08b7eef51026229004ce47 Reviewed-on: https://chromium-review.googlesource.com/442724 Reviewed-by: David Tu <dtu@chromium.org> Reviewed-by: Tim 'mithro' Ansell <tansell@chromium.org> Commit-Queue: Stephen Martinis <martiniss@chromium.org> [modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipe_modules/chromium_tests/chromium_perf_fyi.py [modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipe_modules/chromium_tests/api.py [modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipes/chromium.expected/full_chromium_perf_fyi_Win_Builder_FYI.json [modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipe_modules/chromium_tests/chromium_perf.py
,
Feb 14 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/bb5af838e6d49c226f0c5317966368e9e4924f22 commit bb5af838e6d49c226f0c5317966368e9e4924f22 Author: Stephen Martinis <martiniss@google.com> Date: Tue Feb 14 21:14:03 2017 Use exparchive for chromium perf fyi This should give us performance improvements for low end machines; currently it can take about 2 minutes to successfully set up the bot, but exparchive should improve this time. BUG= 686974 Change-Id: Iebbc62ca2a3f218d5d08b7eef51026229004ce47 Reviewed-on: https://chromium-review.googlesource.com/442724 Reviewed-by: David Tu <dtu@chromium.org> Reviewed-by: Tim 'mithro' Ansell <tansell@chromium.org> Commit-Queue: Stephen Martinis <martiniss@chromium.org> [modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipe_modules/chromium_tests/chromium_perf_fyi.py [modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipe_modules/chromium_tests/api.py [modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipes/chromium.expected/full_chromium_perf_fyi_Win_Builder_FYI.json [modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipe_modules/chromium_tests/chromium_perf.py
,
Feb 15 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/c6eaa22d628834eb8ab026d5396e92bda2126241 commit c6eaa22d628834eb8ab026d5396e92bda2126241 Author: Tim 'mithro' Ansell <tansell@chromium.org> Date: Wed Feb 15 01:16:21 2017 recipe_modules/isolate: Fix always_use_exparchive. Previously when using always_use_exparchive it would still try and run batcharchive with no targets, which fails. Now skip batcharchive if there are no batcharchive targets. BUG= 686974 Change-Id: I0c291aa426f8e96e4ac1c8d17918b6e8de9bb465 Reviewed-on: https://chromium-review.googlesource.com/442904 Commit-Queue: Tim 'mithro' Ansell <tansell@chromium.org> Reviewed-by: Stephen Martinis <martiniss@chromium.org> [modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/example.expected/exparchive.json [modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/example.expected/exparchive-multi-miss.json [modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/example.py [modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/example.expected/exparchive-miss.json [modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipes/chromium.expected/full_chromium_perf_fyi_Win_Builder_FYI.json [add] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/example.expected/always-use-exparchive.json [modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/api.py [modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/example.expected/exparchive-multi.json
,
Feb 15 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/789d42c57bdcb81e837444cf35e334520436683d commit 789d42c57bdcb81e837444cf35e334520436683d Author: Tim 'mithro' Ansell <tansell@chromium.org> Date: Wed Feb 15 05:20:29 2017 recipe_modules/chromium_perf: Enable exparchive for Mac. BUG= 686974 Change-Id: I83d83ab2a1145e43aedfcfe55950b0876cfd15ec Reviewed-on: https://chromium-review.googlesource.com/443024 Reviewed-by: Stephen Martinis <martiniss@chromium.org> Commit-Queue: Tim 'mithro' Ansell <tansell@chromium.org> [modify] https://crrev.com/789d42c57bdcb81e837444cf35e334520436683d/scripts/slave/recipes/chromium.expected/full_chromium_perf_Mac_Builder.json [modify] https://crrev.com/789d42c57bdcb81e837444cf35e334520436683d/scripts/slave/recipe_modules/chromium_tests/chromium_perf.py
,
Feb 15 2017
https://build.chromium.org/p/chromium.perf/builders/Mac%20Mini%208GB%2010.12%20Perf/builds/1304 should be the first build which has the exparchive isolates. I'll look at it tomorrow, and see if we get a performance improvement (someone else can too). That link isn't live right now, but that build number should run later tonight.
,
Feb 15 2017
And, the build is good! https://chromium-swarm.appspot.com/task?id=3459fd71a87d1e10&refresh=10&show_raw=1 is an example task triggered by that build. The overhead on it is ~30 seconds, which is much better than the previous 2.5 minutes. Yay!
,
Feb 15 2017
\o/
,
Feb 15 2017
Great to see this actually worked! However, the task run time is only 5s. Thus, while going from 2.5minutes to 30 seconds is a great improvement it is still a huge amount of overhead. While I believe we could get the isolate down to as little as 10 seconds (with further optimisation that we haven't yet committed to doing), that would still mean still be 100% overhead! You should definitely do further work to combined tests into a single task. Also look at increasing your swarming pool size, 5 machine is pretty small pool...
,
Feb 15 2017
BTW Your cycle time previously timed out at ~7 hours. Now you are at succeeding at 6.2 hours so you don't have a huge amount of buffer here.
,
Feb 15 2017
This is great news! Nice work!
,
Feb 15 2017
Yeah, there are known issues with how we're scheduling everything on swarming. This helps a lot for now. I have plans to deal with these issues that you've mentioned.
,
Mar 24 2017
There are follow ups planned around the suggestions in #32 so closing this bug. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by martiniss@chromium.org
, Jan 31 2017Components: Infra>Client>Perf
Labels: -Pri-3 Performance-Sheriff-BotHealth Pri-1