New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 660713 link

Starred by 2 users

Issue metadata

Status: Duplicate
Merged: issue 665159
Owner:
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

A bunch of tests incorrectly marked flaky because of swarming failures

Project Member Reported by chromium...@appspot.gserviceaccount.com, Oct 30 2016

Issue description

"PolicyPrefIndicatorTestInstance/PolicyPrefIndicatorTest.CheckPolicyIndicators/22" is flaky.

This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label.

We have detected 4 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyWwsSBUZsYWtlIlBQb2xpY3lQcmVmSW5kaWNhdG9yVGVzdEluc3RhbmNlL1BvbGljeVByZWZJbmRpY2F0b3JUZXN0LkNoZWNrUG9saWN5SW5kaWNhdG9ycy8yMgw.

Flaky tests should be disabled within 30 minutes unless culprit CL is found and reverted. Please see more details here: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queues#triaging-auto-filed-flakiness-bugs
 
Components: Infra>Platform>Swarming Infra
Labels: -Sheriff-Chromium Infra-Troopers
Summary: "PolicyPrefIndicatorTestInstance/PolicyPrefIndicatorTest.CheckPolicyIndicators/22" is flaky, probably because of swarming failures (was: "PolicyPrefIndicatorTestInstance/PolicyPrefIndicatorTest.CheckPolicyIndicators/22" is flaky)
See https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/305492 for instance. If I look in the logs for the test, they actually succeed, but they're probably marked as failed because sharding failed.

If I'm right, it's pretty bad that tests get handed to the sheriff like this when they're not actually failing. A natural action to take is to just disable+file bug or look in vain for culprits, but that's the wrong action if it's an infra failure. Maybe the flakes tool should not file bugs if it detects swarming or infra errors nearby?...
Summary: A bunch of tests incorrectley marked flaky because of swarming failures (was: "PolicyPrefIndicatorTestInstance/PolicyPrefIndicatorTest.CheckPolicyIndicators/22" is flaky, probably because of swarming failures)
 Issue 660712  has been merged into this issue.
 Issue 660710  has been merged into this issue.
 Issue 660691  has been merged into this issue.
Summary: A bunch of tests incorrectly marked flaky because of swarming failures (was: A bunch of tests incorrectley marked flaky because of swarming failures)

Comment 7 by hinoka@chromium.org, Oct 31 2016

Cc: serg...@chromium.org
Labels: -Infra-Troopers
Status: Available (was: Untriaged)
Components: -Infra Infra>Flakiness
Cc: katthomas@chromium.org
Owner: katthomas@chromium.org
Cc: -serg...@chromium.org
Re #1: We are aware that some flakes are infra-related, but it's non-trivial to automatically detect this. Since most flakes are caused by tests, we've made a decision to route bugs to sheriffs by default and expect sheriffs to investigate the issue and re-route to troopers if needed. This is also why we need human judgment and do not simply automate disabling tests.

In fact, we report flakes in some known infra steps directly to troopers, e.g. see  issue 594867 . If you think we can further improve automated detection of infra-related flakes, please file a bug describing the suggested approach and add a label Infra>Flakiness>Pipeline to it. Thank you.
Cc: phoglund@chromium.org
Owner: mar...@chromium.org
Status: Assigned (was: Available)
It looks like each of the tests listed as failing also produced "excessive output," but still passed. (I'm assuming that lines like the following indicate the test passed: 
[       OK ] ContentSettingBubbleModelMediaStreamTest.ManageLink (9285 ms))

I'm a little bit confused as to why these tests are producing excess output. @phoglund, who is the owner of these tests? Can we CC that person here to get some insight?

I'm assigning this to MA as owner of swarming. Why are these tests being marked as INFRA FAILURE? Is this the desired outcome? If so, why?
Cc: -phoglund@chromium.org phajdan.jr@chromium.org
Re #12: I understand, it's a hard problem. Assigning to sheriffs is perhaps the most reasonable thing to do.

Also I don't "own" browser_tests, I was just the sheriff when this happened. I don't know who to talk about general browser_tests problems unfortunately. Pawel, you know? I know you've worked with the test launchers and what they print; appears browser_tests is printing too much data for swarming, in this case.
We are now discussing who should own the test launchers, like browser_tests or webkit_tests here: https://groups.google.com/a/google.com/d/msg/chrome-infra/dVpYIDMsH2M/mC0woRPWCQAJ.
@maruel, If these tests are normally fine, I'm guessing they are producing excessive output because they are failing in some way. In that case, these should be reported as red, correct?
Cc: iannucci@chromium.org mar...@chromium.org
Owner: ----
Status: Available (was: Assigned)
https://chromium-swarm.appspot.com/tasklist?f=buildername%3Alinux_chromium_chromeos_rel_ng&f=buildnumber%3A305492

I looked at each of the individual failure and they generated a json file of around 100Mb each. The file was successfully stored, so Swarming and Isolate worked fine.

I "suspect" it is the recipes that chocked on trying to load all these json files at once and failed to. I could be wrong but still, the tasks ran successfully (from Swarming's perspective).

So yes, they should have been reported as normal failure and the fact that they are reported as infrastructure failure is a bug.
Owner: iannucci@chromium.org
Status: Assigned (was: Available)
Thanks @maruel!

Assigned to @iannucci for recipe expertise. 
Labels: Infra-Failures
If the recipe author requested that the json file be loaded with `api.json.output`, then yes, it will attempt to read the json file into the recipe. If those JSON files are enormous, then I can see it overwhelming the recipe engine process.

I would recommend changing the recipe to not load multiple 100MB JSON files into memory, but I'm not familiar with the recipes in question.

The obvious solution, of course, is to have the test harness spit out a smaller summary json document which only has the information needed by the recipe.

If that's for some reason impossible, a second thing to do would be to immediately trim the document by implementing a custom Placeholder; it would read the document, and then trim it down to just the details the recipe needs before retaining it. I would be willing to augment json.output() to take a trim function that could be used to implement that.

If simply reading and parsing the document is too much work for the recipe engine (and we can't emit less data), then we'd have to investigate adding support for an external tool, such as jq, to do a streaming filter of the json while reading it from disk before the recipe engine ever sees it. This would be a lot more work and would add a binary dependency to the recipe engine runtime (something I'm planning to support, but am not yet working on).
Oh, and while we're at it, I would probably recommend making api.raw_io (and thus api.json) hard-fail when being asked to handle documents > 256KB.

"Doc! It hurts when I do this!"
"... then don't do that."
Thanks @iannucci! Do you know someone who is familiar with the recipe who would be a good fit to take this on?
Mergedinto: 665159
Status: Duplicate (was: Assigned)
Cc: mcgreevy@chromium.org djd@chromium.org
Issue 665159 has been merged into this issue.
Status: Available (was: Duplicate)
I think we actually duped those two bugs into each other
Heh, ok. Either way!
Owner: dpranke@chromium.org
Dirk could you figure out priority/owner and such for this? I'm out this next few weeks so I don't want to be holding it during that time anyhow.
(oh and see comments #21, #22 for diagnosis/proposed solution(s))
Yup, assigning to me is fine.
Status: Assigned (was: Available)
There is an owner on this bug, but the status was not "Assigned" or "Started". Fixing. If you do not own this bug, please remove yourself as the owner and make the status "Available".
Status: Duplicate (was: Assigned)
Labels: Hotlist-Infra-Failures

Sign in to add a comment