New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 686974 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 1
Type: Bug

Blocked on:
issue 689735

Blocking:
issue 691582



Sign in to add a comment

Mac mini jobs expiring

Project Member Reported by martiniss@chromium.org, Jan 31 2017

Issue description

Jobs on the mac mini builder are expiring. https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Mini%208GB%2010.12%20Perf/

This appears to be because the total amount of time required to run the test suites is longer than 6 hours. 

I'm investigating why tests are taking longer than they should be. 
 
Cc: sullivan@chromium.org
Components: Infra>Client>Perf
Labels: -Pri-3 Performance-Sheriff-BotHealth Pri-1
Labels, cc-ing people.
Cc: mar...@chromium.org
I have some data (googler only) here: https://docs.google.com/spreadsheets/d/1Fx_RCUk61ZnGPmM6dUgunVzfuNQL-r5Prog3q-8KW8M/edit#gid=2071277789

Compares test timings on this bot vs a comparable Mac 10.12 Perf bot. 

Also, I just realized there is a probable culprit already. Swarming overhead.

Compare https://chromium-swarm.appspot.com/task?id=340838a43487e110&refresh=10&show_raw=1 (mini) to https://chromium-swarm.appspot.com/task?id=3408db4def0fbd10&refresh=10 (regular). The first task has an overhead of about 2 minutes. The second task has an overhead of about 2 seconds.

This time adds up. 2 minutes *  ~35 enabled tests per bot is about an hour of overhead. This explains why we're getting timeouts.

We need to fix this. Couple possible solutions:

* Make tests run faster
* Make swarming overhead smaller
* Increase swarming timeout
* Remove this builder
Well, actually, it's about 60 tests per bot, which is 2 hours of overhead.
Cc: benhenry@chromium.org dpranke@chromium.org
The background here is that we had super old Mac Minis with spinning disks. They were removed because they had too much overhead, with the hope that replacing them with newer spinning disk minis would give us the perf tests we need but with less disk overhead.

Looks like these still have too much overhead. I'm pretty clueless about the specifics of what causes the overhead: Dirk, will your plan to move telemetry tests into one step impact it?

Comment 5 by mar...@chromium.org, Jan 31 2017

Cc: djd@chromium.org mcgreevy@chromium.org tansell@chromium.org
+ layout-tests-on-swarming folks. The rationale is that it's going to hit a similar issue.

In practice, I *think* that the best course of action to migrate on platforms where symlinks are faster than hardlink to use symlinks.

I did a partial implementation last summer in run_isolated.py but it's not complete yet; so someone would have to take from here and add complete support:
- bool flag at task request, push the flag down to the bot
- make run_isolated.py symlink work on all relevant platforms

I'm actually confident that this can be completed in a reasonable amount of time.

Comment 6 by mar...@chromium.org, Jan 31 2017

Actually, Tim reminded me that their work on exparchive is likely going to achieve the desired performance improvement.

Comment 7 by tansell@google.com, Jan 31 2017

I'm just downloading this isolate now, will re-upload it using the exparchive and rerun the command to see if it changes the performance much.
I just opened up access to the spreadsheet. Also gathering more data today. 

Comment 9 by tansell@google.com, Jan 31 2017

https://chromium-swarm.appspot.com/task?id=340f1f1e04dd6310&refresh=10&show_raw=1

Non-scientific test - with exparchive, overhead was 22 seconds from a cold cache...


Comment 10 by tansell@google.com, Jan 31 2017

It looks like on a hot cache things are even better;

https://chromium-swarm.appspot.com/task?id=340f4ca8dcad5f10&refresh=10&show_raw=1

Total Overhead	10s
Downloading Inputs From Isolate	7s

Comment 11 by tansell@google.com, Jan 31 2017

So it looks like exparchive makes things significantly better. Here is 5 runs using an isolated generated using exparchive;

---
https://chromium-swarm.appspot.com/task?id=340f500c1b435d10&refresh=10&show_raw=1
Total Overhead	12s
Downloading Inputs From Isolate	8s
---
https://chromium-swarm.appspot.com/task?id=340f502afd973810&refresh=10&show_raw=1
Total Overhead	35s
Downloading Inputs From Isolate	6s
Uploading Outputs To Isolate	0s
---
https://chromium-swarm.appspot.com/task?id=340f50566b6fab10&refresh=10&show_raw=1
Total Overhead	17s
Downloading Inputs From Isolate	10s
Uploading Outputs To Isolate	0s
---
https://chromium-swarm.appspot.com/task?id=340f507e5fa78610&refresh=10&show_raw=1
Total Overhead	29s
Downloading Inputs From Isolate	5s
Uploading Outputs To Isolate	0s
---
https://chromium-swarm.appspot.com/task?id=340f50a57cf8de10&refresh=10&show_raw=1
Total Overhead	13s
Downloading Inputs From Isolate	6s
Uploading Outputs To Isolate	0s
---

Looks like in the "hot" case, the overhead drops to ~15 seconds and in the "cold" case it ends up being ~35 seconds.

That sounds great. Can we start using exparchive? I'm not familiar with what that actually is...
I'm looking for the easiest way to make this possible. The options seem to be;
 * Change isolate binary to allow batcharchive use exparchive in some way.
 * Change recipes to use exparchive rather than batcharchive (probably via a flag).

Digging around the perf recipes, they seem fairly different to the trybots/waterfall?

It looks like the isolation of the perf tests happen at a totally different location to the swarming triggers. See https://build.chromium.org/p/chromium.perf/builders/Linux%20Builder/builds/91678 for example.

An example of the isolate call is https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FLinux_Builder%2F91678%2F%2B%2Frecipes%2Fsteps%2Fisolate_tests%2F0%2Fstdout

The isolate command looks like this;
-----
python -u /b/rr/tmpqwTq6S/rw/checkout/scripts/slave/recipe_modules/isolate/resources/isolate.py \
	/b/c/b/Linux_Builder/src/tools/swarming_client \
	batcharchive \
	--dump-json /tmp/tmpPT3KXf.json \
	--isolate-server https://isolateserver.appspot.com \
	--verbose /b/c/b/Linux_Builder/src/out/Release/load_library_perf_tests.isolated.gen.json \
	\
	/b/c/b/Linux_Builder/src/out/Release/cc_perftests.isolated.gen.json \
	/b/c/b/Linux_Builder/src/out/Release/tracing_perftests.isolated.gen.json \
	/b/c/b/Linux_Builder/src/out/Release/media_perftests.isolated.gen.json \
	/b/c/b/Linux_Builder/src/out/Release/telemetry_perf_tests.isolated.gen.json
-----

Open questions I have;

 * Is it normal for the perf bots to only batcharchive 5-6 test suites? Or does it normally isolate 30+ like the try bots and this run was abnormal?

 * How important is getting this done?

 * Who own's this stuff?


I can own this, I just don't know what to do on the isolate side of things. I can do the recipe changes. Although, I will have to admit I'm not sure where exactly the inputs to all the triggered tasks get isolated.

Is all we need to do change the word "batcharchive" to "exparchive" in the call to isolate.py? Or is it more complicated?
Sorry martiniss, what I actually meant was, 

 * Who own's the perf recipes and I can send code reviews to? 
   (And who is able to understand the consequences of the changes?)

Ah.

I generally own the perf recipes. Some of the code is in chromium_tests, which has other owners than me. The swarming bots (which includes all the Mac bots) are all on the chromium recipe.
Replying to sullivan@'s question in comment #4, yes, consolidating all of these things into a single big step (or a small number of them) would likely eliminate the per-step swarming overhead concerns.
Status: Assigned (was: Started)
FTR, I chatted with tansell@ in person. He's doing some changes to the isolate recipe module. Once those changes land, we should be able to start using the exparchive command!
Cc: martiniss@chromium.org
 Issue 687333  has been merged into this issue.
FYI Once https://chromium-review.googlesource.com/c/436324 lands, the perf recipies just need to be updated to give always_use_exparchive=True to isolate_tests and you should get the performance improvement. (This will also be useful in helping validate if exparchive can replace batcharchive everywhere.)

However, you will **also** want to look into to using a single swarming task for a whole test suite rather than having a single test per swarming task.
Blockedon: 689735
Project Member

Comment 22 by bugdroid1@chromium.org, Feb 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/3eeeee0a7f1eddcddcd127c995341347f5759ef5

commit 3eeeee0a7f1eddcddcd127c995341347f5759ef5
Author: Stephen Martinis <martiniss@google.com>
Date: Wed Feb 08 01:10:04 2017

chromium.perf: Ignore Mac Mini bot

The bot will not ever be green until bug  crbug.com/686974  is fixed.

BUG= 686974 

Change-Id: I80ab3d66788c454bb6d1ee273575203f59b9ccc9
Reviewed-on: https://chromium-review.googlesource.com/439367
Commit-Queue: Stephen Martinis <martiniss@chromium.org>
Reviewed-by: David Tu <dtu@chromium.org>

[modify] https://crrev.com/3eeeee0a7f1eddcddcd127c995341347f5759ef5/scripts/slave/gatekeeper.json

Project Member

Comment 23 by bugdroid1@chromium.org, Feb 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/e06832c0df218c8b07c55855435c58904b139d3b

commit e06832c0df218c8b07c55855435c58904b139d3b
Author: Tim 'mithro' Ansell <tansell@chromium.org>
Date: Wed Feb 08 05:23:48 2017

recipe_modules: Fix swarm_hashes generation.

Previously this code failed in two ways;

 * When using multiple exparchive targets, swarm_hashes only contained
   the hash from the last exparchive command.

 * When exparchive and batcharchive isolation were used together the
   swarm_hashes only contained the hashes from the batcharchive command.

swarm_hashes needs to contain the swarm_hashes results from all the
isolation steps and now does. Also adding tests which prove these cases
no longer fail.

BUG= 524758 , 686974 

Change-Id: I2189c3923aa8c9d6d611ad3141393ecf31d5468d
Reviewed-on: https://chromium-review.googlesource.com/436324
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>
Reviewed-by: Michael McGreevy <mcgreevy@chromium.org>
Commit-Queue: Michael McGreevy <mcgreevy@chromium.org>

[add] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive-batch.json
[modify] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive.json
[add] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive-batch-emiss.json
[add] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive-multi-miss.json
[add] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive-miss.json
[add] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive-batch-bmiss.json
[modify] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.py
[modify] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/api.py
[add] https://crrev.com/e06832c0df218c8b07c55855435c58904b139d3b/scripts/slave/recipe_modules/isolate/example.expected/exparchive-multi.json

Blocking: 691582
Project Member

Comment 25 by bugdroid1@chromium.org, Feb 14 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/bb5af838e6d49c226f0c5317966368e9e4924f22

commit bb5af838e6d49c226f0c5317966368e9e4924f22
Author: Stephen Martinis <martiniss@google.com>
Date: Tue Feb 14 21:14:03 2017

Use exparchive for chromium perf fyi

This should give us performance improvements for low end machines;
currently it can take about 2 minutes to successfully set up the bot,
but exparchive should improve this time.

BUG= 686974 

Change-Id: Iebbc62ca2a3f218d5d08b7eef51026229004ce47
Reviewed-on: https://chromium-review.googlesource.com/442724
Reviewed-by: David Tu <dtu@chromium.org>
Reviewed-by: Tim 'mithro' Ansell <tansell@chromium.org>
Commit-Queue: Stephen Martinis <martiniss@chromium.org>

[modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipe_modules/chromium_tests/chromium_perf_fyi.py
[modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipe_modules/chromium_tests/api.py
[modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipes/chromium.expected/full_chromium_perf_fyi_Win_Builder_FYI.json
[modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipe_modules/chromium_tests/chromium_perf.py

Project Member

Comment 26 by bugdroid1@chromium.org, Feb 14 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/bb5af838e6d49c226f0c5317966368e9e4924f22

commit bb5af838e6d49c226f0c5317966368e9e4924f22
Author: Stephen Martinis <martiniss@google.com>
Date: Tue Feb 14 21:14:03 2017

Use exparchive for chromium perf fyi

This should give us performance improvements for low end machines;
currently it can take about 2 minutes to successfully set up the bot,
but exparchive should improve this time.

BUG= 686974 

Change-Id: Iebbc62ca2a3f218d5d08b7eef51026229004ce47
Reviewed-on: https://chromium-review.googlesource.com/442724
Reviewed-by: David Tu <dtu@chromium.org>
Reviewed-by: Tim 'mithro' Ansell <tansell@chromium.org>
Commit-Queue: Stephen Martinis <martiniss@chromium.org>

[modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipe_modules/chromium_tests/chromium_perf_fyi.py
[modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipe_modules/chromium_tests/api.py
[modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipes/chromium.expected/full_chromium_perf_fyi_Win_Builder_FYI.json
[modify] https://crrev.com/bb5af838e6d49c226f0c5317966368e9e4924f22/scripts/slave/recipe_modules/chromium_tests/chromium_perf.py

Project Member

Comment 27 by bugdroid1@chromium.org, Feb 15 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/c6eaa22d628834eb8ab026d5396e92bda2126241

commit c6eaa22d628834eb8ab026d5396e92bda2126241
Author: Tim 'mithro' Ansell <tansell@chromium.org>
Date: Wed Feb 15 01:16:21 2017

recipe_modules/isolate: Fix always_use_exparchive.

Previously when using always_use_exparchive it would still try and run
batcharchive with no targets, which fails. Now skip batcharchive if
there are no batcharchive targets.

BUG= 686974 

Change-Id: I0c291aa426f8e96e4ac1c8d17918b6e8de9bb465
Reviewed-on: https://chromium-review.googlesource.com/442904
Commit-Queue: Tim 'mithro' Ansell <tansell@chromium.org>
Reviewed-by: Stephen Martinis <martiniss@chromium.org>

[modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/example.expected/exparchive.json
[modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/example.expected/exparchive-multi-miss.json
[modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/example.py
[modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/example.expected/exparchive-miss.json
[modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipes/chromium.expected/full_chromium_perf_fyi_Win_Builder_FYI.json
[add] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/example.expected/always-use-exparchive.json
[modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/api.py
[modify] https://crrev.com/c6eaa22d628834eb8ab026d5396e92bda2126241/scripts/slave/recipe_modules/isolate/example.expected/exparchive-multi.json

Project Member

Comment 28 by bugdroid1@chromium.org, Feb 15 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/789d42c57bdcb81e837444cf35e334520436683d

commit 789d42c57bdcb81e837444cf35e334520436683d
Author: Tim 'mithro' Ansell <tansell@chromium.org>
Date: Wed Feb 15 05:20:29 2017

recipe_modules/chromium_perf: Enable exparchive for Mac.

BUG= 686974 

Change-Id: I83d83ab2a1145e43aedfcfe55950b0876cfd15ec
Reviewed-on: https://chromium-review.googlesource.com/443024
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Commit-Queue: Tim 'mithro' Ansell <tansell@chromium.org>

[modify] https://crrev.com/789d42c57bdcb81e837444cf35e334520436683d/scripts/slave/recipes/chromium.expected/full_chromium_perf_Mac_Builder.json
[modify] https://crrev.com/789d42c57bdcb81e837444cf35e334520436683d/scripts/slave/recipe_modules/chromium_tests/chromium_perf.py

https://build.chromium.org/p/chromium.perf/builders/Mac%20Mini%208GB%2010.12%20Perf/builds/1304 should be the first build which has the exparchive isolates. I'll look at it tomorrow, and see if we get a performance improvement (someone else can too). That link isn't live right now, but that build number should run later tonight.
And, the build is good! https://chromium-swarm.appspot.com/task?id=3459fd71a87d1e10&refresh=10&show_raw=1 is an example task triggered by that build. The overhead on it is ~30 seconds, which is much better than the previous 2.5 minutes. Yay!
\o/
Great to see this actually worked!

However, the task run time is only 5s.

Thus, while going from 2.5minutes to 30 seconds is a great improvement it is still a huge amount of overhead. While I believe we could get the isolate down to as little as 10 seconds (with further optimisation that we haven't yet committed to doing), that would still mean still be 100% overhead!

You should definitely do further work to combined tests into a single task. Also look at increasing your swarming pool size, 5 machine is pretty small pool...

BTW Your cycle time previously timed out at ~7 hours. Now you are at succeeding at 6.2 hours so you don't have a huge amount of buffer here.
This is great news! Nice work!
Yeah, there are known issues with how we're scheduling everything on swarming. This helps a lot for now.

I have plans to deal with these issues that you've mentioned.
Status: Fixed (was: Assigned)
There are follow ups planned around the suggestions in #32 so closing this bug.

Sign in to add a comment