5000+ infra failures for swarmed mac 10.10 tests in the last week |
||||||||
Issue descriptionI was looking at recent top infra failures with katthomas@ and saw that there are 5500 steps matching ".*tests (with patch) on Mac-10.10" that have failed in the last week. After inspecting a few (e.g. https://uberchromegw.corp.google.com/i/tryserver.chromium.mac/builders/mac_chromium_10.10_rel_ng/builds/129961) it looks like the swarming tasks are expiring. maruel@, is this a capacity issue? cc sergiyb@ since these are resulting in invalid test results, which we were trying to track down earlier. cc phajdan for CQ SLO concerns.
,
Sep 21 2016
Thanks, that makes sense. Can we at least have the experiment percentage match the existing capacity and scale them together?
,
Sep 21 2016
We should get more bots and add this to CQ. Can you recommend capacity requirements here?
,
Sep 23 2016
IMHO we should be tracking infra failures affecting CQ production builders first and only then look at experimental and git-cl-try. There is a query that I wrote earlier today to find all infra flakes in the last month: https://plx.corp.google.com/script/#a=qo%7Ci=google%253A%253Ascript_90._75e2b5_7f29_4087_a09d_662b4da37318. It uses completed_builds table and filters builds using `category` field.
,
Sep 23 2016
Issue 498330 has been merged into this issue.
,
Oct 10 2016
Is there a link to some information about this experiment? It's not clear to me from this issue how it is related.
,
Oct 11 2016
Any builder with an experiment percentage doesn't block CLs from landing so they can't actually contribute to false CQ rejections and are less interesting to us. The builder is defined here: https://cs.chromium.org/chromium/src/infra/config/cq.cfg?rcl=0&l=73&q=mac_chromium_10.10_rel_ng It looks like CATEGORY_CQ is an appropriate filter for measuring flakes causing false rejections since there's also a CATEGORY_CQ_EXPERIMENTAL.
,
Oct 12 2016
We already do filter by CATEGORY_CQ, see query in #4. It has build.category = 'CATEGORY_CQ' AND
,
Oct 26 2016
Can this bug be closed?
,
Oct 27 2016
,
Oct 28 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/86e29baab34cd4e0bf09d17f043e57d8f259c25b commit 86e29baab34cd4e0bf09d17f043e57d8f259c25b Author: jam <jam@chromium.org> Date: Fri Oct 28 22:06:05 2016 Scale back Mac 10.10 swarming experiment to 10% instead of 50%. There aren't enough VMs for it and it's taking resources from the main waterfall. BUG=653677, 649042 TBR=smut@chromium.org Review-Url: https://codereview.chromium.org/2465523002 Cr-Commit-Position: refs/heads/master@{#428510} [modify] https://crrev.com/86e29baab34cd4e0bf09d17f043e57d8f259c25b/infra/config/cq.cfg
,
Nov 10 2016
,
Nov 14 2016
Katie, you're already filtering infra failures using CATEGORY_CQ build category, right? In that case we can close this.
,
Nov 14 2016
Yup
,
Dec 8 2016
,
Jan 25 2017
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by phajdan.jr@chromium.org
, Sep 21 2016Components: -Infra>Platform>Swarming Infra>Client>iOS