"ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailures" is flaky |
||||||||||||||
Issue description"ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailures" is flaky. This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label. We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNySAsSBUZsYWtlIj1DaGlsZFRyYWNpbmdUZXN0Lk11bHRpcGxlQ2hpbGRJbml0aWF0ZWRNZW1vcnlEdW1wV2l0aEZhaWx1cmVzDA. Flaky tests should be disabled within 30 minutes unless culprit CL is found and reverted. Please see more details here: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queues#triaging-auto-filed-flakiness-bugs
,
Oct 17 2016
I see crbug.com/563674 which is similar. CC'ing one of the CR folks to make sure the component is correct and perhaps help triage to the right person.
,
Oct 18 2016
Disabling this test and two others in the same file on Android due to flakiness. primiano@: Assigning to you since you recently edited these tests. Please try find a better owner if necessary.
,
Oct 18 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/356f59fca3ff5b0e1d82d0a85087c9628154682a commit 356f59fca3ff5b0e1d82d0a85087c9628154682a Author: guidou <guidou@chromium.org> Date: Tue Oct 18 16:46:14 2016 Disable flaky tests in ChildTracingTest on Android. BUG= 656729 TBR=jbauman@chromium.org Review-Url: https://codereview.chromium.org/2432443003 Cr-Commit-Position: refs/heads/master@{#425988} [modify] https://crrev.com/356f59fca3ff5b0e1d82d0a85087c9628154682a/components/tracing/child/child_trace_message_filter_browsertest.cc
,
Oct 18 2016
Detected 5 new flakes for test/step "ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailures". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNySAsSBUZsYWtlIj1DaGlsZFRyYWNpbmdUZXN0Lk11bHRpcGxlQ2hpbGRJbml0aWF0ZWRNZW1vcnlEdW1wV2l0aEZhaWx1cmVzDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Sheriff Bug Queue (unless already there).
,
Oct 19 2016
,
Oct 19 2016
kraynov, can you help here? Would be great to understand why all of a suddent this became flaky. We didn't touch either the code nor the test in a long while. this sudden flakiness worries me.
,
Oct 19 2016
Looking more, this might be related with the recent refactoring that ssid@ made.
,
Oct 19 2016
,
Nov 3 2016
Ok I have a theory here. The theory is that some previous test execution (possibly telemetry chrome_trace_config_unittest.py) leaves the trace startup trace config file around in /data/local/chrome-trace-config.json. If that file is present chrome will automatically start tracing with differnet options and cause the failure above. Essentially that file is state that should not be leaked from one test to another. kraynov: can you have a try and create startup trace file and see if you get the same failure?
,
Nov 3 2016
Ned/John/Juan: I am not familiar with swarming. What is the best way to ensure that we don't leave the stale trace config file around when running these tests? I imagine what happens here is that those tests generally clean up the chrome-trac-config.json when they execute cleanly, but they can't possibly do that if the device hangs or something crashes.
,
Nov 3 2016
+zhenw, eyaich, dtu: see #10
,
Nov 3 2016
I vaguely recall there is the concept of bot health check in swarming. Maruel may know which is the best way to ensure the bots are cleaned between runs.
,
Nov 3 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ea26a4f05c216cd7dd7d76ea19fc46665c300baa commit ea26a4f05c216cd7dd7d76ea19fc46665c300baa Author: primiano <primiano@chromium.org> Date: Thu Nov 03 15:22:03 2016 tracing: Bail out early in tracing tests when tracing is already enabled A bunch of tracing browsertests have recently become flaky. I strongly suspect this is because some other test leaves the startup tracing file (/data/local/chrome-trace-config.json) around. On Android the presence of that file is enough to initialize the tracing machinery and interfere with tests. Bail out early explicitly in that case. BUG= 657628 , 656729 Review-Url: https://codereview.chromium.org/2479563002 Cr-Commit-Position: refs/heads/master@{#429593} [modify] https://crrev.com/ea26a4f05c216cd7dd7d76ea19fc46665c300baa/content/browser/tracing/memory_tracing_browsertest.cc
,
Nov 3 2016
I think /data/local/chrome-trace-config.json is always cleaned up after each run: https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/chrome_tracing_agent.py?l=265 And it is always overridden when starting tracing: https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/chrome_tracing_agent.py?l=261
,
Nov 7 2016
Re #15 > I think /data/local/chrome-trace-config.json is always cleaned up after each run: https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/chrome_tracing_agent.py?l=265 That won't run if the slave is abruptly terminated killing python. no idea how the swarming runtime works. I don't thing relying on the python's atExit is enough, as python could just be killed from the outside. > And it is always overridden when starting tracing: https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/chrome_tracing_agent.py?l=261 Right, but only if running telemetry though. That code is not triggered when running content_browsertests. So I suspect what's happening here is: 1) Telemetry runs on a device 2) The run is abruptly terminated 3) After a while we run content_browsertests on the same device Nothin has cleared the trace config. All the tests running on the device have trace enabled with whatever was the last config in 1). Tracing tests in content_browsertests will fail because of the interference. I don't think this problem can be solved at the telemetry level. You can't safely assume telemetry will cleanup cleanly. And you can't safely assume that we run only telemetry on a device (I think) in these days of swarming. I think this should be dealt with at a swarming level. Swarming should guarantee that the device is in a clean state, regardess of what ran before.
,
Nov 7 2016
Ah, I see your point. I agree.
,
Nov 7 2016
primiano@: The Swarming bot sends a SIGTERM when the timeout occurs. The bot then gives it a grace period (30s) before escalating to SIGKILL.
It's up to the script to handle signals and handle them correctly. So every python script should handle SIGTERM, propagate the signal to children processes as relevant and exit as quickly as possible.
The data must be saved into ${ISOLATED_OUTDIR} to be kept as part of swarming task results.
,
Nov 7 2016
> It's up to the script to handle signals and handle them correctly. there is a number of things that could go wrong there: the python script getting stuck somewhere, python crashing, the AtExit handlers failing. Or more realistically: adb hanging. I challenge everybody to remove a file on the device when the test fails because of adb hangs. In general none of this should be a problem, as the state is *usually* tied to chrome's process lifetime which is restarted on every new test execution. Unfortunately this is not the case for this trace startup file here, because it represents device-wide state that can survive regardless of the chrome process lifetime. As such I think the infrastructure should guarantee to leave the device in a clean and consistent state, regardless of what the test scripts do. In the builtbot case we have the provision_devices step which takes care of this sort of device-wide initialization. Is there an equivalent for swarming which is run by unittests, browsertests and telemetry tests?
,
Nov 8 2016
maruel@ excellent, that seems to be the perfect place. perezju@ can I move this to you who are more used to the infra repo? I think we need an extra path in in |paths| in #20 + see what is the equivalent for bots not yet on swarming (if any) that run browsertests.
,
Nov 8 2016
I think this CL should be landed after getting this problem resolved. https://codereview.chromium.org/2479593002/
,
Nov 8 2016
Yup, looking into it.
,
Nov 8 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/be13f4f0f88964b18b375c61f59193e2fb6773e2 commit be13f4f0f88964b18b375c61f59193e2fb6773e2 Author: perezju <perezju@google.com> Date: Tue Nov 08 23:18:45 2016
,
Nov 8 2016
I'll make the code live tomorrow.
,
Nov 9 2016
Thanks! Do we also see this problem on any non-swarming bots?
,
Nov 10 2016
I forgot to update this bug but bot_config_public.py has been live on chromium-swarm.appspot.com for several hours now.
,
Nov 10 2016
Looks like all of the flakes come from bots using swarming, and the fix is in place. Let's try to re-enable the tests from #4?
,
Nov 10 2016
Just created revert to re-enable in https://codereview.chromium.org/2490213002/
,
Nov 10 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/170d17d3be602605691146d3f2e3675d52334bbc commit 170d17d3be602605691146d3f2e3675d52334bbc Author: perezju <perezju@chromium.org> Date: Thu Nov 10 14:09:15 2016 Revert of Disable flaky tests in ChildTracingTest on Android. (patchset #1 id:1 of https://codereview.chromium.org/2432443003/ ) Reason for revert: Fix in place for issue that introduced flakiness of these tests. Original issue's description: > Disable flaky tests in ChildTracingTest on Android. > > BUG= 656729 > TBR=jbauman@chromium.org > > Committed: https://crrev.com/356f59fca3ff5b0e1d82d0a85087c9628154682a > Cr-Commit-Position: refs/heads/master@{#425988} TBR=guidou@chromium.org # Not skipping CQ checks because original CL landed more than 1 days ago. BUG= 656729 Review-Url: https://codereview.chromium.org/2490213002 Cr-Commit-Position: refs/heads/master@{#431246} [modify] https://crrev.com/170d17d3be602605691146d3f2e3675d52334bbc/components/tracing/child/child_trace_message_filter_browsertest.cc
,
Nov 10 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f98f6a4df243fd2bff7f2c0dfb919a0717f9a511 commit f98f6a4df243fd2bff7f2c0dfb919a0717f9a511 Author: kraynov <kraynov@chromium.org> Date: Thu Nov 10 18:14:00 2016 Tracing: Bail out early in tracing tests when tracing is already enabled. Following up https://crrev.com/2479563002 . BUG= 657628 , 656729 Review-Url: https://codereview.chromium.org/2479593002 Cr-Commit-Position: refs/heads/master@{#431291} [modify] https://crrev.com/f98f6a4df243fd2bff7f2c0dfb919a0717f9a511/components/tracing/child/child_trace_message_filter_browsertest.cc [modify] https://crrev.com/f98f6a4df243fd2bff7f2c0dfb919a0717f9a511/content/browser/tracing/memory_tracing_browsertest.cc
,
Nov 11 2016
Fixed. Don't see any more failures on browsertests and tests have been re-enabled on all of: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel Do re-open if you feel some more work is needed, or flakiness re-appears.
,
Jan 4 2017
,
Jan 4 2017
|
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by joedow@chromium.org
, Oct 17 2016