New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 656729 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Nov 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

"ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailures" is flaky

Project Member Reported by chromium...@appspot.gserviceaccount.com, Oct 17 2016

Issue description

"ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailures" is flaky.

This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label.

We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNySAsSBUZsYWtlIj1DaGlsZFRyYWNpbmdUZXN0Lk11bHRpcGxlQ2hpbGRJbml0aWF0ZWRNZW1vcnlEdW1wV2l0aEZhaWx1cmVzDA.

Flaky tests should be disabled within 30 minutes unless culprit CL is found and reverted. Please see more details here: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queues#triaging-auto-filed-flakiness-bugs
 

Comment 1 by joedow@chromium.org, Oct 17 2016

Components: Internals>Tracing
Test seems flaky, here is some failure info for two test runs (same patchset).  One failed and the rerun succeeded:

Log output for the successful run:
I   58.403s run_tests_on_device(00d4cb976535a88c)  [INFO:test_support_android.cc(185)] MessagePumpForUIFactory already set, unable to override.
I   58.403s run_tests_on_device(00d4cb976535a88c)  >>ScopedMainEntryLogger
I   58.403s run_tests_on_device(00d4cb976535a88c)  Note: Google Test filter = ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailures
I   58.403s run_tests_on_device(00d4cb976535a88c)  [==========] Running 1 test from 1 test case.
I   58.403s run_tests_on_device(00d4cb976535a88c)  [----------] Global test environment set-up.
I   58.403s run_tests_on_device(00d4cb976535a88c)  [----------] 1 test from ChildTracingTest
I   58.403s run_tests_on_device(00d4cb976535a88c)  [ RUN      ] ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailures
I   58.403s run_tests_on_device(00d4cb976535a88c)  [WARNING:resource_bundle_android.cc(100)] locale_file_path.empty() for locale 
I   58.403s run_tests_on_device(00d4cb976535a88c)  [       OK ] ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailures (146 ms)
I   58.404s run_tests_on_device(00d4cb976535a88c)  [----------] 1 test from ChildTracingTest (146 ms total)
I   58.404s run_tests_on_device(00d4cb976535a88c)  
I   58.404s run_tests_on_device(00d4cb976535a88c)  [----------] Global test environment tear-down
I   58.404s run_tests_on_device(00d4cb976535a88c)  [==========] 1 test from 1 test case ran. (148 ms total)
I   58.404s run_tests_on_device(00d4cb976535a88c)  [  PASSED  ] 1 test.


Log output for the failed run (note the symbols didn't seem to load so the stack isn't very useful):
I   76.361s run_tests_on_device(0048fdf67961b4f7)  [INFO:test_support_android.cc(185)] MessagePumpForUIFactory already set, unable to override.
I   76.361s run_tests_on_device(0048fdf67961b4f7)  >>ScopedMainEntryLogger
I   76.361s run_tests_on_device(0048fdf67961b4f7)  Note: Google Test filter = ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailures
I   76.361s run_tests_on_device(0048fdf67961b4f7)  [==========] Running 1 test from 1 test case.
I   76.361s run_tests_on_device(0048fdf67961b4f7)  [----------] Global test environment set-up.
I   76.361s run_tests_on_device(0048fdf67961b4f7)  [----------] 1 test from ChildTracingTest
I   76.361s run_tests_on_device(0048fdf67961b4f7)  [ RUN      ] ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailures
I   76.362s run_tests_on_device(0048fdf67961b4f7)  [WARNING:resource_bundle_android.cc(100)] locale_file_path.empty() for locale 
I   76.362s run_tests_on_device(0048fdf67961b4f7)  [FATAL:trace_log.cc(760)] Check failed: new_options == old_options (1 vs. 16)Attempting to re-enable tracing with a different set of options.
I   76.362s run_tests_on_device(0048fdf67961b4f7)  #00 0x0000007f9738f4f7 /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x000000000076c4f7
I   76.362s run_tests_on_device(0048fdf67961b4f7)  #01 0x0000007f97407b2b /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x00000000007e4b2b
I   76.362s run_tests_on_device(0048fdf67961b4f7)  #02 0x0000007f974182d7 /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x00000000007f52d7
I   76.362s run_tests_on_device(0048fdf67961b4f7)  #03 0x0000007f97417a83 /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x00000000007f4a83
I   76.362s run_tests_on_device(0048fdf67961b4f7)  #04 0x0000007f973580ef /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x00000000007350ef
I   76.362s run_tests_on_device(0048fdf67961b4f7)  #05 0x0000007f97358707 /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x0000000000735707
I   76.362s run_tests_on_device(0048fdf67961b4f7)  #06 0x0000007f9735b24f /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x000000000073824f
I   76.362s run_tests_on_device(0048fdf67961b4f7)  #07 0x0000007f993d8f73 /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x00000000027b5f73
I   76.362s run_tests_on_device(0048fdf67961b4f7)  #08 0x0000007f993dd17f /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x00000000027ba17f
I   76.363s run_tests_on_device(0048fdf67961b4f7)  #09 0x0000007f993dd2ab /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x00000000027ba2ab
I   76.363s run_tests_on_device(0048fdf67961b4f7)  #10 0x0000007f993dd367 /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x00000000027ba367
I   76.363s run_tests_on_device(0048fdf67961b4f7)  #11 0x0000007f993df793 /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x00000000027bc793
I   76.363s run_tests_on_device(0048fdf67961b4f7)  #12 0x0000007f993df99f /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x00000000027bc99f
I   76.363s run_tests_on_device(0048fdf67961b4f7)  #13 0x0000007f9935e1d7 /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x000000000273b1d7
I   76.363s run_tests_on_device(0048fdf67961b4f7)  #14 0x0000007f993ad417 /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x000000000278a417
I   76.363s run_tests_on_device(0048fdf67961b4f7)  #15 0x0000007f993b5aa3 /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x0000000002792aa3
I   76.363s run_tests_on_device(0048fdf67961b4f7)  #16 0x0000007f9733a537 /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x0000000000717537
I   76.363s run_tests_on_device(0048fdf67961b4f7)  #17 0x0000007f9744127f /data/app/org.chromium.components_browsertests_apk-1/lib/arm64/lib_components_browsertests__library.so+0x000000000081e27f
I   76.363s run_tests_on_device(0048fdf67961b4f7)  #18 0x0000007faf0dbd13 /data/app/org.chromium.components_browsertests_apk-1/oat/arm64/base.odex+0x0000000000316d13
I   76.363s run_tests_on_device(0048fdf67961b4f7)  
I   76.363s run_tests_on_device(0048fdf67961b4f7)  [ERROR:test_suite.cc(270)] Currently running: ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailures
I   76.364s run_tests_on_device(0048fdf67961b4f7)  
I   76.364s run_tests_on_device(0048fdf67961b4f7)  ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailuresUNKNOWN
I   76.364s run_tests_on_device(0048fdf67961b4f7)  OK (0 tests)
I   76.364s run_tests_on_device(0048fdf67961b4f7)  Finished running tests on this device.
I   76.365s run_tests_on_device(006c93fa7fa32e06)  Finished running tests on this device.
I   76.365s run_tests_on_device(00b65c202767c603)  Finished running tests on this device.
I   76.366s run_tests_on_device(00a29126838a29d5)  Finished running tests on this device.
I   76.366s run_tests_on_device(00a6d6b81d779237)  Finished running tests on this device.


Comment 2 by joedow@chromium.org, Oct 17 2016

Cc: primiano@chromium.org
I see  crbug.com/563674  which is similar.  CC'ing one of the CR folks to make sure the component is correct and perhaps help triage to the right person.

Comment 3 by guidou@chromium.org, Oct 18 2016

Cc: -primiano@chromium.org jbau...@chromium.org
Labels: -Sheriff-Chromium
Owner: primiano@chromium.org
Disabling this test and two others in the same file on Android due to flakiness.

primiano@: Assigning to you since you recently edited these tests. Please try find a better owner if necessary.
Project Member

Comment 4 by bugdroid1@chromium.org, Oct 18 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/356f59fca3ff5b0e1d82d0a85087c9628154682a

commit 356f59fca3ff5b0e1d82d0a85087c9628154682a
Author: guidou <guidou@chromium.org>
Date: Tue Oct 18 16:46:14 2016

Disable flaky tests in ChildTracingTest on Android.

BUG= 656729 
TBR=jbauman@chromium.org

Review-Url: https://codereview.chromium.org/2432443003
Cr-Commit-Position: refs/heads/master@{#425988}

[modify] https://crrev.com/356f59fca3ff5b0e1d82d0a85087c9628154682a/components/tracing/child/child_trace_message_filter_browsertest.cc

Project Member

Comment 5 by chromium...@appspot.gserviceaccount.com, Oct 18 2016

Labels: Sheriff-Chromium
Detected 5 new flakes for test/step "ChildTracingTest.MultipleChildInitiatedMemoryDumpWithFailures". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNySAsSBUZsYWtlIj1DaGlsZFRyYWNpbmdUZXN0Lk11bHRpcGxlQ2hpbGRJbml0aWF0ZWRNZW1vcnlEdW1wV2l0aEZhaWx1cmVzDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Sheriff Bug Queue (unless already there).

Comment 6 by guidou@chromium.org, Oct 19 2016

Labels: -Sheriff-Chromium
Cc: primiano@chromium.org picksi@chromium.org ssid@chromium.org
Labels: Hotlist-MemoryInfra
Owner: kraynov@chromium.org
kraynov, can you help here?
Would be great to understand why all of a suddent this became flaky.
We didn't touch either the code nor the test in a long while.
this sudden flakiness worries me.

Looking more, this might be related with the recent refactoring that ssid@ made.
Status: Assigned (was: Untriaged)
Ok I have a theory here.
The theory is that some previous test execution (possibly telemetry chrome_trace_config_unittest.py) leaves the trace startup trace config file around in /data/local/chrome-trace-config.json. If that file is present chrome will automatically  start tracing with differnet options and cause the failure above.
Essentially that file is state that should not be leaked from one test to another.

kraynov: can you have a try and create startup trace file and see if you get the same failure?

Cc: perezju@chromium.org jbudorick@chromium.org nednguyen@chromium.org
Ned/John/Juan: I am not familiar with swarming.
What is the best way to ensure that we don't leave the stale trace config file around when running these tests?
I imagine what happens here is that those tests generally clean up the chrome-trac-config.json when they execute cleanly, but they can't possibly do that if the device hangs or something crashes.

Cc: dtu@chromium.org zh...@chromium.org eyaich@chromium.org
+zhenw, eyaich, dtu: see #10
Cc: mar...@chromium.org
I vaguely recall there is the concept of bot health check in swarming. Maruel may know which is the best way to ensure the bots are cleaned between runs.
Project Member

Comment 14 by bugdroid1@chromium.org, Nov 3 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ea26a4f05c216cd7dd7d76ea19fc46665c300baa

commit ea26a4f05c216cd7dd7d76ea19fc46665c300baa
Author: primiano <primiano@chromium.org>
Date: Thu Nov 03 15:22:03 2016

tracing: Bail out early in tracing tests when tracing is already enabled

A bunch of tracing browsertests have recently become flaky.
I strongly suspect this is because some other test leaves the
startup tracing file (/data/local/chrome-trace-config.json)
around. On Android the presence of that file is enough to
initialize the tracing machinery and interfere with tests.
Bail out early explicitly in that case.

BUG= 657628 ,  656729 

Review-Url: https://codereview.chromium.org/2479563002
Cr-Commit-Position: refs/heads/master@{#429593}

[modify] https://crrev.com/ea26a4f05c216cd7dd7d76ea19fc46665c300baa/content/browser/tracing/memory_tracing_browsertest.cc

Re #15

> I think /data/local/chrome-trace-config.json is always cleaned up after each run:
https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/chrome_tracing_agent.py?l=265

That won't run if the slave is abruptly terminated killing python. no idea how the swarming runtime works. I don't thing relying on the python's atExit is enough, as python could just be killed from the outside.

> And it is always overridden when starting tracing:
https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/chrome_tracing_agent.py?l=261
Right, but only if running telemetry though. That code is not triggered when running content_browsertests.
So I suspect what's happening here is:
1) Telemetry runs on a device
2) The run is abruptly terminated
3) After a while we run content_browsertests on the same device
Nothin has cleared the trace config. All the tests running on the device have trace enabled with whatever was the last config in 1). Tracing tests in content_browsertests will fail because of the interference.

I don't think this problem can be solved at the telemetry level. You can't safely assume telemetry will cleanup cleanly. And you can't safely assume that we run only telemetry on a device (I think) in these days of swarming.
I think this should be dealt with at a swarming level. Swarming should guarantee that the device is in a clean state, regardess of what ran before. 
Ah, I see your point. I agree.
primiano@: The Swarming bot sends a SIGTERM when the timeout occurs. The bot then gives it a grace period (30s) before escalating to SIGKILL.

It's up to the script to handle signals and handle them correctly. So every python script should handle SIGTERM, propagate the signal to children processes as relevant and exit as quickly as possible.

The data must be saved into ${ISOLATED_OUTDIR} to be kept as part of swarming task results.
> It's up to the script to handle signals and handle them correctly.
there is a number of things that could go wrong there: the python script getting stuck somewhere, python crashing, the AtExit handlers failing. Or more realistically: adb hanging. I challenge everybody to remove a file on the device when the test fails because of adb hangs.

In general none of this should be a problem, as the state is *usually* tied to chrome's process lifetime which is restarted on every new test execution.
Unfortunately this is not the case for this trace startup file here, because it represents device-wide state that can survive regardless of the chrome process lifetime.

As such I think the infrastructure should guarantee to leave the device in a clean and consistent state, regardless of what the test scripts do.
In the builtbot case we have the provision_devices step which takes care of this sort of device-wide initialization.
Is there an equivalent for swarming which is run by unittests, browsertests and telemetry tests?
Cc: kraynov@chromium.org
Owner: perezju@chromium.org
maruel@ excellent, that seems to be the perfect place. 
perezju@ can I move this to you who are more used to the infra repo?
I think we need an extra path in in |paths| in #20 + see what is the equivalent for bots not yet on swarming (if any) that run browsertests.
I think this CL should be landed after getting this problem resolved.
https://codereview.chromium.org/2479593002/
Yup, looking into it.
Project Member

Comment 24 by bugdroid1@chromium.org, Nov 8 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/be13f4f0f88964b18b375c61f59193e2fb6773e2

commit be13f4f0f88964b18b375c61f59193e2fb6773e2
Author: perezju <perezju@google.com>
Date: Tue Nov 08 23:18:45 2016

I'll make the code live tomorrow.
Thanks! Do we also see this problem on any non-swarming bots?
I forgot to update this bug but bot_config_public.py has been live on chromium-swarm.appspot.com for several hours now.
Looks like all of the flakes come from bots using swarming, and the fix is in place. Let's try to re-enable the tests from #4?
Just created revert to re-enable in https://codereview.chromium.org/2490213002/
Project Member

Comment 30 by bugdroid1@chromium.org, Nov 10 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/170d17d3be602605691146d3f2e3675d52334bbc

commit 170d17d3be602605691146d3f2e3675d52334bbc
Author: perezju <perezju@chromium.org>
Date: Thu Nov 10 14:09:15 2016

Revert of Disable flaky tests in ChildTracingTest on Android. (patchset #1 id:1 of https://codereview.chromium.org/2432443003/ )

Reason for revert:
Fix in place for issue that introduced flakiness of these tests.

Original issue's description:
> Disable flaky tests in ChildTracingTest on Android.
>
> BUG= 656729 
> TBR=jbauman@chromium.org
>
> Committed: https://crrev.com/356f59fca3ff5b0e1d82d0a85087c9628154682a
> Cr-Commit-Position: refs/heads/master@{#425988}

TBR=guidou@chromium.org
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG= 656729 

Review-Url: https://codereview.chromium.org/2490213002
Cr-Commit-Position: refs/heads/master@{#431246}

[modify] https://crrev.com/170d17d3be602605691146d3f2e3675d52334bbc/components/tracing/child/child_trace_message_filter_browsertest.cc

Status: Fixed (was: Assigned)
Fixed. Don't see any more failures on browsertests and tests have been re-enabled on all of:

https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng
https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng
https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel

Do re-open if you feel some more work is needed, or flakiness re-appears.
Components: Internals>Instrumentation>Memory
Components: -Internals>Tracing

Sign in to add a comment