New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 649042 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Nov 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 1
Type: Bug



Sign in to add a comment

5000+ infra failures for swarmed mac 10.10 tests in the last week

Project Member Reported by estaab@chromium.org, Sep 21 2016

Issue description

I was looking at recent top infra failures with katthomas@ and saw that there are 5500 steps matching ".*tests (with patch) on Mac-10.10" that have failed in the last week. After inspecting a few (e.g.  https://uberchromegw.corp.google.com/i/tryserver.chromium.mac/builders/mac_chromium_10.10_rel_ng/builds/129961) it looks like the swarming tasks are expiring.

maruel@, is this a capacity issue?

cc sergiyb@ since these are resulting in invalid test results, which we were trying to track down earlier.

cc phajdan for CQ SLO concerns.
 
Cc: smut@chromium.org
Components: -Infra>Platform>Swarming Infra>Client>iOS
This is Sana's 50% CQ experiment.

I'd advocate we either go forward, get more bots and eventually add it to CQ, or stop the experiment.

Comment 2 by estaab@chromium.org, Sep 21 2016

Cc: -smut@chromium.org
Owner: smut@chromium.org
Status: Assigned (was: Untriaged)
Thanks, that makes sense. Can we at least have the experiment percentage match the existing capacity and scale them together?

Comment 3 by s...@google.com, Sep 21 2016

Cc: -phajdan@google.com smut@chromium.org
Owner: phajdan.jr@chromium.org
We should get more bots and add this to CQ. Can you recommend capacity requirements here?
IMHO we should be tracking infra failures affecting CQ production builders first and only then look at experimental and git-cl-try. There is a query that I wrote earlier today to find all infra flakes in the last month: https://plx.corp.google.com/script/#a=qo%7Ci=google%253A%253Ascript_90._75e2b5_7f29_4087_a09d_662b4da37318. It uses completed_builds table and filters builds using `category` field.
Issue 498330 has been merged into this issue.
Is there a link to some information about this experiment? It's not clear to me from this issue how it is related.

Comment 7 by estaab@chromium.org, Oct 11 2016

Any builder with an experiment percentage doesn't block CLs from landing so they can't actually contribute to false CQ rejections and are less interesting to us. The builder is defined here:
https://cs.chromium.org/chromium/src/infra/config/cq.cfg?rcl=0&l=73&q=mac_chromium_10.10_rel_ng

It looks like CATEGORY_CQ is an appropriate filter for measuring flakes causing false rejections since there's also a CATEGORY_CQ_EXPERIMENTAL.
We already do filter by CATEGORY_CQ, see query in #4. It has

build.category = 'CATEGORY_CQ' AND
Can this bug be closed?
Labels: to-review
Project Member

Comment 11 by bugdroid1@chromium.org, Oct 28 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/86e29baab34cd4e0bf09d17f043e57d8f259c25b

commit 86e29baab34cd4e0bf09d17f043e57d8f259c25b
Author: jam <jam@chromium.org>
Date: Fri Oct 28 22:06:05 2016

Scale back Mac 10.10 swarming experiment to 10% instead of 50%.

There aren't enough VMs for it and it's taking resources from the main waterfall.

BUG=653677, 649042 
TBR=smut@chromium.org

Review-Url: https://codereview.chromium.org/2465523002
Cr-Commit-Position: refs/heads/master@{#428510}

[modify] https://crrev.com/86e29baab34cd4e0bf09d17f043e57d8f259c25b/infra/config/cq.cfg

Labels: -Pri-2 Pri-1
Status: Fixed (was: Assigned)
Katie, you're already filtering infra failures using CATEGORY_CQ build category, right? In that case we can close this.
Yup
Labels: Infra-Failures
Labels: Hotlist-Infra-Failures

Sign in to add a comment