New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 724998 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: 2017-05-26
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Replacing build8-b1 bot in Mac Retina Perf with a new machine.

Project Member Reported by perezju@chromium.org, May 22 2017

Issue description

https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Retina%20Perf/

Has lots of expired jobs with "shard #0 expired, not enough capacity" from builds 676 to 682.

Maybe related to  issue 675986 ?
 
Cc: benhenry@chromium.org vhang@chromium.org
This is because lab has taken a bot offline due to its bulging battery.

Stephen/Ben: let think of a better process for these.
What happens if all shards for a specific config were taken offline? Ideally, the dashboard/recipe could correct for this, right? 

Otherwise, can we put a piece of metadata in the dashboard to let people know about bots going offline? What is acceptable? What information needs to be shared and for whom? 

We want everyone who could possibly care that a device is offline when they wonder about it? If that's true, then we either need to prevent them from wondering about problems or adding metadata in places where people also find failures.
I chatted with Stephen today. Part of our sharding benchmark per story will include dealing with rebalancing when a bot is offline. We will send the design out later.

For this particular bug, we currently can only wait for the lab to put in a replacement device, it seems like.
Owner: vhang@chromium.org
Summary: Replacing build8-b1 bot in Mac Retina Perf with a new machine. (was: Many expired jobs on Mac Retina Perf bot)
Assign this bug to Vince since the lab is working on a replacement.
I think the policy shoudl be to turn down all of one config when one bot is bad because of device affinity. In practice, even though there are like 4 machines running tests for for that config, there is virtually one "machine" running tests for that config. We are like some sort of RAID without anything explicit other than sharding.

That said, if we pulled down all devices for one config, where would failures manifest?
I'm not exactly sure what you mean in #5? If we pulled all the devices, the entire bot would fail.

We could stop triggering jobs on the bot, like we did with the Zenbook bots.

Comment 7 by vhang@chromium.org, May 22 2017

Status: Assigned (was: Untriaged)
we have a spare macbook but it's not the exact same unit.  The one that was pulled offline is a A1398 EMC 2910.  The spare is a A1398 EMC 2673.

If you want the exact same unit, I may have to contact our vendor to see if they still have some in stock.

Comment 8 by vhang@chromium.org, May 23 2017

Corp techstop may have a used/old unit.  Will be working with them to get a replacement.
We have 4 bisectors, why not replace with one of those? https://uberchromegw.corp.google.com/i/tryserver.chromium.perf/builders/mac_retina_perf_bisect

Comment 10 by vhang@chromium.org, May 25 2017

Owner: jo...@chromium.org
Replacement will be shipped soon.  t/26829260

Assigning to johnw to help set it up.
Labels: Performance-Sheriff-BotHealth
Talking with martiniss@ offline, it sounds like we might not be able to get a replacement for build8-b1 for another week. Is there any way that we could reshard the benchmarks that were running on build8-b1 onto the other perfbots? Having two weeks of no coverage for these benchmarks is pretty scary - especially for system_health.common_desktop BattOr benchmarks, which provide most of the coverage for power regressions on Mac, given that we don't have BattOrs attached to any configurations besides the Mac Retina Perf.
NextAction: 2017-05-26
To #12: Resharding the benchmarks is a complex operation, and we risks making other benchmarks timed out at 10h limit. 

We should prepare more hardware in the future so these swap operation happens quickly, but for now we can rely on other bot configurations to mitigate the risk of coverage lost.
The NextAction date has arrived: 2017-05-26
Annie, Ned, and I talked offline and decided that we're going to pull one of the Mac Retina Perf bisectors off and add it to the main waterfall. I talked with martiniss@ and he said that he didn't think this would be too difficult and would be willing to help make it happen.
Note that this is an exceptional case because "Mac Retina Perf " is the only Mac config that we have battor, and the failing bot is including battor benchmark.

Output of running "./tools/perf/generate_perf_data" about which benchmarks are currently affected:
 Device "build8-b1" is blacklisted. These benchmarks were not scheduled:
 * battor.steady_state
 * battor.steady_state.reference
 * blink_perf.css
 * blink_perf.css.reference
 * blink_perf.events
 * blink_perf.events.reference
 * blink_perf.shadow_dom
 * blink_perf.shadow_dom.reference
 * kraken
 * kraken.reference
 * media.mse_cases
 * media.mse_cases.reference
 * media.tough_video_cases_tbmv2
 * media.tough_video_cases_tbmv2.reference
 * memory.long_running_idle_gmail_tbmv2
 * memory.long_running_idle_gmail_tbmv2.reference
 * page_cycler_v2_site_isolation.basic_oopif
 * page_cycler_v2_site_isolation.basic_oopif.reference
 * performance_browser_tests
 * rasterize_and_record_micro.partial_invalidation
 * rasterize_and_record_micro.partial_invalidation.reference
 * smoothness.desktop_tough_pinch_zoom_cases
 * smoothness.desktop_tough_pinch_zoom_cases.reference
 * smoothness.gpu_rasterization.tough_path_rendering_cases
 * smoothness.gpu_rasterization.tough_path_rendering_cases.reference
 * smoothness.maps
 * smoothness.maps.reference
 * smoothness.tough_animation_cases
 * smoothness.tough_animation_cases.reference
 * smoothness.tough_texture_upload_cases
 * smoothness.tough_texture_upload_cases.reference
 * smoothness.tough_webgl_ad_cases
 * smoothness.tough_webgl_ad_cases.reference
 * start_with_ext.warm.blank_page
 * start_with_ext.warm.blank_page.reference
 * startup.large_profile.cold.blank_page
 * startup.large_profile.cold.blank_page.reference
 * system_health.common_desktop
 * system_health.common_desktop.reference
 * system_health.webview_startup
 * system_health.webview_startup.reference
 * v8.infinite_scroll-classic_tbmv2
 * v8.infinite_scroll-classic_tbmv2.reference
 * v8.infinite_scroll_tbmv2
 * v8.infinite_scroll_tbmv2.reference
Project Member

Comment 18 by bugdroid1@chromium.org, May 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/95c9cb5e460bdf1a02b7e7f886e982b1df33065e

commit 95c9cb5e460bdf1a02b7e7f886e982b1df33065e
Author: Stephen Martinis <martiniss@chromium.org>
Date: Fri May 26 22:09:17 2017

//tools/perf: Replace broken Mac Retina bot

Replaces the broken Mac retina bot with a bot taken from the
bisect pool.

Bug: 724998
Change-Id: I91443e3427535f3a2e4b6be35ae8f8e00e9a8df4
Reviewed-on: https://chromium-review.googlesource.com/517293
Commit-Queue: Stephen Martinis <martiniss@chromium.org>
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#475157}
[modify] https://crrev.com/95c9cb5e460bdf1a02b7e7f886e982b1df33065e/testing/buildbot/chromium.perf.json
[modify] https://crrev.com/95c9cb5e460bdf1a02b7e7f886e982b1df33065e/tools/perf/core/benchmark_sharding_map.json
[modify] https://crrev.com/95c9cb5e460bdf1a02b7e7f886e982b1df33065e/tools/perf/core/perf_data_generator.py

Project Member

Comment 19 by bugdroid1@chromium.org, May 26 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/11a37bdbdae285dcbe9fde2c1ac840ed14fe4ef9

commit 11a37bdbdae285dcbe9fde2c1ac840ed14fe4ef9
Author: Stephen Martinis <martiniss@google.com>
Date: Fri May 26 22:39:51 2017

Project Member

Comment 20 by bugdroid1@chromium.org, May 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/57a109ea1ae2719fc5fcc9d7306542cbf4c3904c

commit 57a109ea1ae2719fc5fcc9d7306542cbf4c3904c
Author: Stephen Martinis <martiniss@chromium.org>
Date: Fri May 26 22:44:57 2017

Remove old retina bot from tryserver.chromium.perf

It's being used on the main waterfall now.

TBR=dtu

Bug: 724998
Change-Id: I4783c663140bef6040b53cefc00d3d2c29b6ce91
Reviewed-on: https://chromium-review.googlesource.com/517278
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Commit-Queue: Stephen Martinis <martiniss@chromium.org>

[modify] https://crrev.com/57a109ea1ae2719fc5fcc9d7306542cbf4c3904c/masters/master.tryserver.chromium.perf/slaves.cfg

Project Member

Comment 21 by bugdroid1@chromium.org, May 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/57a109ea1ae2719fc5fcc9d7306542cbf4c3904c

commit 57a109ea1ae2719fc5fcc9d7306542cbf4c3904c
Author: Stephen Martinis <martiniss@chromium.org>
Date: Fri May 26 22:44:57 2017

Remove old retina bot from tryserver.chromium.perf

It's being used on the main waterfall now.

TBR=dtu

Bug: 724998
Change-Id: I4783c663140bef6040b53cefc00d3d2c29b6ce91
Reviewed-on: https://chromium-review.googlesource.com/517278
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Commit-Queue: Stephen Martinis <martiniss@chromium.org>

[modify] https://crrev.com/57a109ea1ae2719fc5fcc9d7306542cbf4c3904c/masters/master.tryserver.chromium.perf/slaves.cfg

Project Member

Comment 22 by bugdroid1@chromium.org, May 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/57a109ea1ae2719fc5fcc9d7306542cbf4c3904c

commit 57a109ea1ae2719fc5fcc9d7306542cbf4c3904c
Author: Stephen Martinis <martiniss@chromium.org>
Date: Fri May 26 22:44:57 2017

Remove old retina bot from tryserver.chromium.perf

It's being used on the main waterfall now.

TBR=dtu

Bug: 724998
Change-Id: I4783c663140bef6040b53cefc00d3d2c29b6ce91
Reviewed-on: https://chromium-review.googlesource.com/517278
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Commit-Queue: Stephen Martinis <martiniss@chromium.org>

[modify] https://crrev.com/57a109ea1ae2719fc5fcc9d7306542cbf4c3904c/masters/master.tryserver.chromium.perf/slaves.cfg

Project Member

Comment 23 by bugdroid1@chromium.org, May 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/57a109ea1ae2719fc5fcc9d7306542cbf4c3904c

commit 57a109ea1ae2719fc5fcc9d7306542cbf4c3904c
Author: Stephen Martinis <martiniss@chromium.org>
Date: Fri May 26 22:44:57 2017

Remove old retina bot from tryserver.chromium.perf

It's being used on the main waterfall now.

TBR=dtu

Bug: 724998
Change-Id: I4783c663140bef6040b53cefc00d3d2c29b6ce91
Reviewed-on: https://chromium-review.googlesource.com/517278
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Commit-Queue: Stephen Martinis <martiniss@chromium.org>

[modify] https://crrev.com/57a109ea1ae2719fc5fcc9d7306542cbf4c3904c/masters/master.tryserver.chromium.perf/slaves.cfg

Project Member

Comment 24 by bugdroid1@chromium.org, May 26 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager/+/3da14c21b4f5baa44768881a51691e015940f457

commit 3da14c21b4f5baa44768881a51691e015940f457
Author: Stephen Martinis <martiniss@google.com>
Date: Fri May 26 22:54:34 2017

Comment 25 by jo...@chromium.org, Jun 14 2017

Update: regarding the broken build8-b1. The replacement that came in wasn't the same spec, so we could not put it back into service. Vince is looking into finding a suitable replacement. 

Thanks.

Sign in to add a comment