Issue metadata
Sign in to add a comment
|
Trybot android_n5x_swarming_rel is failing most tryjobs |
|||||||||||||||||||||||||||||||
Issue descriptionhttps://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel?numbuilds=200 is failing most of its tryjobs. There are so many failures in so many test suites that I'm having a difficult time figuring out any common pattern. This is the failure that affected my tryjob, making me look into this in the first place: https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/82060 It looks like telemetry_perf_unittests failed the tests browse:news:flipboard and browse:news:qq , apparently because the GPU watchdog fired. In https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/82053 , MediaStreamManagerTest.MakeAndCancelMultipleRequests in content_unittests timed out 3 times. In https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/82051 , the following three tests timed out because their callbacks were never called: org.chromium.content.browser.webcontents.AccessibilitySnapshotTest#testRequestAccessibilitySnapshot org.chromium.content.browser.ScreenOrientationListenerTest#testOrientationChanges org.chromium.content.browser.input.ImeTest#testEnterKey_AfterCommitText C 200.023s Main [FAIL] org.chromium.content.browser.ScreenOrientationListenerTest#testOrientationChanges: C 200.023s Main java.util.concurrent.TimeoutException: waitForCallback timed out! C 200.023s Main at org.chromium.base.test.util.CallbackHelper.waitForCallback(CallbackHelper.java:183) C 200.023s Main at org.chromium.base.test.util.CallbackHelper.waitForCallback(CallbackHelper.java:219) C 200.023s Main at org.chromium.content.browser.ScreenOrientationListenerTest.lockOrientationAndWait(ScreenOrientationListenerTest.java:160) C 200.023s Main at org.chromium.content.browser.ScreenOrientationListenerTest.testOrientationChanges(ScreenOrientationListenerTest.java:167) C 200.023s Main at android.test.InstrumentationTestCase.runMethod(InstrumentationTestCase.java:214) C 200.023s Main at android.test.InstrumentationTestCase.runTest(InstrumentationTestCase.java:199) C 200.023s Main at android.test.ActivityInstrumentationTestCase2.runTest(ActivityInstrumentationTestCase2.java:192) C 200.023s Main at org.chromium.content_shell_apk.ContentShellTestBase.runTest(ContentShellTestBase.java:256) C 200.023s Main at org.chromium.base.test.BaseTestResult.runParameterized(BaseTestResult.java:161) C 200.023s Main at org.chromium.base.test.BaseTestResult.run(BaseTestResult.java:124) C 200.023s Main at android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:191) C 200.023s Main at android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:176) C 200.023s Main at android.test.InstrumentationTestRunner.onStart(InstrumentationTestRunner.java:555) C 200.023s Main at android.app.Instrumentation$InstrumentationThread.run(Instrumentation.java:1879) Adding some components and referencing a related bug. Things have gotten a lot worse since that bug was filed. The sheriffs will certainly need help figuring out what is going wrong.
Showing comments 5 - 104
of 104
Older ›
,
Dec 8 2016
+dnj for logdog question
,
Dec 8 2016
Issue 672221 has been merged into this issue.
,
Dec 8 2016
#1/5: as mentioned on the other bug, that's not the failure here. The test runner can tolerate not being able to bootstrap a logdog stream.
,
Dec 8 2016
2hr success rate on ANSR took a nosedive starting around 2pm PST yesterday: http://shortn/_PsyQGQlqQq
,
Dec 8 2016
,
Dec 8 2016
,
Dec 8 2016
(reposting comment 11 w/ log as attachment) com.android.systemui is dying?
,
Dec 8 2016
,
Dec 8 2016
Given the timing of the failure, I'm going to speculatively revert https://chromereviews.googleplex.com/501377014/ (and its follow-up CL, https://chrome-internal-review.googlesource.com/c/310089/).
,
Dec 8 2016
https://chrome-internal-review.googlesource.com/#/c/310047/ may fix the problem. Pushed to prod one minute ago. Kevin first notified me.
,
Dec 8 2016
,
Dec 8 2016
,
Dec 8 2016
,
Dec 8 2016
CL in #15 has not had any apparent effect on 2hr success rate on ANSR over the last hour: http://shortn/_bN4l5VCrU3 proceeding w/ spec revert.
,
Dec 8 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/a7135d07ccc57c9d54bd7384cc79aca7e23aaed6 commit a7135d07ccc57c9d54bd7384cc79aca7e23aaed6 Author: John Budorick <jbudorick@google.com> Date: Thu Dec 08 14:53:34 2016
,
Dec 8 2016
Issue 672502 has been merged into this issue.
,
Dec 8 2016
Spec revert deployed as version 161 of bot_config at 2016-12-08T16:17:15.605660
,
Dec 8 2016
Issue 666293 has been merged into this issue.
,
Dec 8 2016
Since I'm reverting stuff already...
,
Dec 8 2016
RE #1, this isn't software that I am familiar with. It looks like someone is using the LogDog client library. Whether they are treating a lack of bootstrapping as fatal or not is entirely a question of whether they catch the error. If the test continues in spite of that error, I'd assume that it is non-fatal.
,
Dec 8 2016
2hr success rate has hit 40%: http://shortn/_BkJQ7eFU0R Continuing to monitor while looking for potential chromium-side culprits.
,
Dec 8 2016
Might be worth trying if this could be worked around with in_process_gpu = false. https://bugs.chromium.org/p/chromium/issues/detail?id=672388#c2 I don't know really if this is related. https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/82060 mentioned here also has the line: in_process_gpu : True close to failure at the bottom of the file.
,
Dec 8 2016
,
Dec 8 2016
2hr success rate has wavered between 35%-40% since #26. Still looking. #27: I would guess that that's not related.
,
Dec 8 2016
30min success rate doesn't look too great either: http://shortn/_PVFQZD1VB7 I'll start trying to repro locally.
,
Dec 8 2016
,
Dec 8 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2d7355cedb69e3c2a18cb0c5d25ce309dff45ef7 commit 2d7355cedb69e3c2a18cb0c5d25ce309dff45ef7 Author: jbudorick <jbudorick@chromium.org> Date: Thu Dec 08 21:59:59 2016 Drop android_n5x_swarming_rel to 100% experimental. Its task success rate is currently hovering between 35-40% for reasons we haven't yet been able to identify but don't appear to be related to the CLs under test. BUG= 672382 NOTRY=true TBR=sergiyb@chromium.org Review-Url: https://codereview.chromium.org/2559313002 Cr-Commit-Position: refs/heads/master@{#437360} [modify] https://crrev.com/2d7355cedb69e3c2a18cb0c5d25ce309dff45ef7/infra/config/cq.cfg
,
Dec 8 2016
,
Dec 9 2016
,
Dec 9 2016
,
Dec 9 2016
,
Dec 9 2016
,
Dec 9 2016
Issue 672810 has been merged into this issue.
,
Dec 9 2016
,
Dec 9 2016
Possibly related -- some of the bots have wildly inaccurate system clocks: https://crbug.com/672843
,
Dec 9 2016
,
Dec 9 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/9fc9e06b5315deeccaf504fc2d4de09ab591f654 commit 9fc9e06b5315deeccaf504fc2d4de09ab591f654 Author: John Budorick <jbudorick@google.com> Date: Fri Dec 09 17:30:13 2016
,
Dec 9 2016
,
Dec 9 2016
FYI, had this failure on https://codereview.chromium.org/2464233002/ org.chromium.content.browser.JavaBridgeBasicsTest#testAdditionNotReflectedUntilReload (run #1): junit.framework.AssertionFailedError: Shell is still loading. at org.chromium.content.browser.test.util.CriteriaHelper.pollInstrumentationThread(CriteriaHelper.java:74) at org.chromium.content.browser.test.util.CriteriaHelper.pollUiThread(CriteriaHelper.java:112) at org.chromium.content_shell_apk.ContentShellTestBase.waitForActiveShellToBeDoneLoading(ContentShellTestBase.java:135) at org.chromium.content.browser.JavaBridgeTestBase.setUpContentView(JavaBridgeTestBase.java:35) at org.chromium.content.browser.JavaBridgeTestBase.setUp(JavaBridgeTestBase.java:54) at org.chromium.content.browser.JavaBridgeBasicsTest.setUp(JavaBridgeBasicsTest.java:97) at org.chromium.base.test.BaseTestResult.runParameterized(BaseTestResult.java:161) at org.chromium.base.test.BaseTestResult.run(BaseTestResult.java:124) at android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:191) at android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:176) at android.test.InstrumentationTestRunner.onStart(InstrumentationTestRunner.java:555) at android.app.Instrumentation$InstrumentationThread.run(Instrumentation.java:1879) Hint: perhaps the framework tries to communicate with worker threads racily (i.e. before they're actually guaranteed to be live?)
,
Dec 9 2016
,
Dec 9 2016
Still looking into this. We're fairly certain it has to do with com.android.systemui crashing on the devices during tests. We're seeing ANRs pretty reliably right before every failure. As for the cause, I'm looking into how we throttle/unthrottle the cpu. It might be that we're throttling too hard and certain processes on the device get starved out and eventually stop responding.
,
Dec 9 2016
,
Dec 10 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/546f40ba5282181e942e348e1bf934ea3e0adf0b commit 546f40ba5282181e942e348e1bf934ea3e0adf0b Author: Benjamin Pastene <bpastene@google.com> Date: Sat Dec 10 00:21:41 2016
,
Dec 10 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/546f40ba5282181e942e348e1bf934ea3e0adf0b commit 546f40ba5282181e942e348e1bf934ea3e0adf0b Author: Benjamin Pastene <bpastene@google.com> Date: Sat Dec 10 00:21:41 2016
,
Dec 10 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/e6634a68586b009045ff3058bfb21698ca172c32 commit e6634a68586b009045ff3058bfb21698ca172c32 Author: John Budorick <jbudorick@google.com> Date: Sat Dec 10 01:01:28 2016
,
Dec 10 2016
Summary: After two full days of investigating and debugging, we've decided to reflash all N5X devices. We flashed a subset of the devices earlier today and couldn't find any similar test flakes on the newly-flashed phones. So the hope here is that reflashing *all* of them will stamp out the flakes for good. Will keep the bot off the CQ and part of Monday to verify everything it worked.
,
Dec 12 2016
Update: flashing fixed all the flakes: http://shortn/_yCEfwAMCqT Yay!! But we still have no idea what exactly went wrong. Since the bot is still off the CQ, I'm planning on relanding the revert in #20 to narrow it down. If flakes start exploding again, then we know for sure it was my change, and we can work from there. If everything stays fine, then we have some digging to do in git logs...
,
Dec 12 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/d8372c8615637fcd770b77babae5109937d0841c commit d8372c8615637fcd770b77babae5109937d0841c Author: John Budorick <jbudorick@google.com> Date: Mon Dec 12 20:42:38 2016
,
Dec 12 2016
Ok, proceeding with the reland/experiment mentioned in #52
,
Dec 13 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/093b4e7a083bd3bf26381550689c466efaf743b4 commit 093b4e7a083bd3bf26381550689c466efaf743b4 Author: Benjamin Pastene <bpastene@google.com> Date: Mon Dec 12 23:27:29 2016
,
Dec 13 2016
,
Dec 13 2016
Possibly. The CL in #55 intentionally relanded our primary suspect in an attempt to (1) confirm our suspicions and (2) gather more info about the mechanism behind the failure if it was indeed the cause. android_n5x_swarming_rel is still running as a 100% experiment on the CQ, though.
,
Dec 13 2016
Issue 672388 has been merged into this issue.
,
Dec 13 2016
Could you provide a time estimate on when this bug will be fixed? Our group had to remove all the tests from our android_optional_gpu_tests_rel tryserver in Issue 672502 due to these failures. We've lost significant test coverage due to this and need to re-enable it ASAP. Thanks.
,
Dec 14 2016
Re #59: The bot, and swarming pool it runs on, is back to pre-outage success rates: http://shortn/_3jGezy4SYd Theoretically, we could resolve this as fixed and be done with it. The problem is, we still don't know what really happened. The biggest culprit was relanded over 24 hours ago and nothing's regressed since. I'd really like to get to the bottom of this before re-enabling everything. I have an experiment on the devices right now that will run overnight. If that doesn't uncover anything useful, it may be best to turn everything back on and see how things go. I'll circle back tomorrow morning once the experiment is over, and we can work from there. BTW, I'm currently hunting down issues with the camera as a possible culprit, so if anyone has any ideas why camera services on the device would suddenly start crashing/causing problems, please let me know. (Maybe some random tests out there started messing with them...?)
,
Dec 14 2016
Actually, this is still a P0 despite the absence of red...
,
Dec 14 2016
I object to the method of flakiness assesment based on success rates. Looking at the actual failures, to me it looks like the bot still has the same problem: https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/85528 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/85547 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/85564 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/85566 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/85570 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/85630 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/85632 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/85644 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/85669 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/85682 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/85685 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/85701 Moreover, 85701, 85644, 85570, 85566, 85564, 85547, 85528 are reported as successful, despite errors being present in the logs.
,
Dec 14 2016
#62: Can you explain why you object to using success rate as a mechanism for assessing flakiness? While the bot is still seeing some issues, it's seeing them at a dramatically lower rate than it had been late last week. That's meaningful, even if we haven't determined the root cause of the outage yet. I think the success-with-errors case is happening because the recipes see no test failures in the JSON returned by the test runner. This can happen in multiple cases: - test runner hang on tear-down: https://bugs.chromium.org/p/chromium/issues/detail?id=664308#c38 - deadlock on catapult dep download: https://bugs.chromium.org/p/chromium/issues/detail?id=674172
,
Dec 14 2016
Yes, there are less failed builds, but that doesn't mean that the problem is gone. The lower rate can have different explanations. Like: "after reflashing the devices some of them reverted to a bad state, and others are going to follow them shortly". And, since the bot is experimental now, it could well be that if the CLs will be retried on it after failure, the success rate will be back to 0. I don't think you can't say something like "if success rate is above X%, that means the problem is solved". (Unless X=100, which will never happen). As I've pointed out in #62 and in #56 there are still builds which fail in the same way they failed before reflashing the devices, despite the claim in #60 that "nothing regressed since". "The same way" is what important here, not the lower rate. You wouldn't notice a few devices going bad with the success rate metric, until a large portion of them will go bad. As exemplified in #60 and #62. What I usually check when dealing with flakiness is "no errors evident in logs of last 200 builds". It's also not 100% proof, but I think a bit more reliable. PS. I don't understand why in #60 bpastene@ said "The biggest culprit was relanded over 24 hours ago and nothing's regressed since." if I've pointed in #56 to a build which failed in 24 hours after reland.
,
Dec 14 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/9b8d19bbc8fb578c33f444bec3d9c09567f4a3bb commit 9b8d19bbc8fb578c33f444bec3d9c09567f4a3bb Author: Benjamin Pastene <bpastene@google.com> Date: Wed Dec 14 17:39:49 2016
,
Dec 14 2016
Sorry for typo, it should read "I don't think you can say something like". Another explanation for higher success rates could be that we turned off GPU tests on Nexus 5X and android_optional_gpu_tests_rel. Maybe they were the ones putting the devices into a bad state? Since they run on the same swarmed pool, I think.
,
Dec 14 2016
It bears repeating that disabling every test on the GPU bots for these devices would have an effect on the same time frame we are discussing.
,
Dec 14 2016
What makes you so sure that "there are still builds which fail in the same way they failed before reflashing the devices"? A lot of the builds you posted in #62 failed because of the CL under test or because of bug 664308. And the build posted in #56 also failed in the same way on linux_android_rel_ng, which runs on a different, and unaffected, device pool. Here are swarming task failures per hour for android_n5x_swarming_rel: http://shortn/_RhW5KRBXhn You can very clearly see the outage and how we're currently back to pre-outage failure rates.
,
Dec 14 2016
Re #66: Yes, I agree that that's possible. I'm going to reflash and revert everything again, then I suggest we reenable all tests & CQ bots.
,
Dec 14 2016
bpastene.. is the flakiness fixed? Like 99-100% fixed? It wouldn't be great to enable tests if there's still flakiness.
,
Dec 14 2016
From my perspective, and from the numbers, yes. But ynovikov@ may have other ideas :) Give me ~60 min, then everything will theoretically be in the same state it was before the outage. After that, we can start re-enabling things.
,
Dec 14 2016
Hmm, I was pretty sure that I've filtered out the irrelevant build failures. Sorry if that wasn't so. Some of them had the same error as #2. And the CLs in tests seemed like landed without making the bot permanently red. I actually think that turning everything back on would we a good experiment. We just need to turn it back off quickly if something goes wrong. On Dec 14, 2016 13:04, "jmadill via monorail" <monorail+v2.4292351161@ chromium.org> wrote:
,
Dec 14 2016
Ok, how would everyone feel about re-enabling the gpu tests and reverting https://codereview.chromium.org/2562133002? I'd like to see how the devices would do under the full load. We can always revert it again if things go poorly, but I suspect they won't. There was a slight hiccup when I reflashed the phones, but that's since been resolved, and everything is looking good now: https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=user&f=pool%3AChrome&f=device_type%3ANexus%205X%20(bullhead)&f=state%3ACOMPLETED&l=50&s=created_ts%3Adesc
,
Dec 14 2016
This is fine as a test. Please re-revert if flakiness becomes apparent. Also watch the ANGLE try waterfall: https://luci-milo.appspot.com/buildbot/tryserver.chromium.angle/android_angle_rel_ng/ Note you may have to revert by hand since the generated json can change quite a bit between changes to the generator.
,
Dec 14 2016
Sounds good. I'll start working on it.
,
Dec 14 2016
CQed the revert in https://codereview.chromium.org/2577793002/. Will keep an eye on the GPU.FYI and ANGLE waterfall.
,
Dec 15 2016
First GPU builds went well: https://build.chromium.org/p/chromium.gpu.fyi/builders/Android%20Release%20%28Nexus%205X%29/builds/4661 https://build.chromium.org/p/tryserver.chromium.android/builders/android_optional_gpu_tests_rel/builds/1387 Looks like the errors I've pointed to in #56 and #62 are issue 670817 and other issues not related to com.android.systemui crashing. Sorry about that. Too many different sources of flakiness is confusing.
,
Dec 15 2016
Let's let it run overnight and see how things go. If nothing has started failing again, I'll put android_n5x_swarming_rel back on the CQ.
,
Dec 15 2016
Next build has a problem - crash in org.chromium.chrome.browser.metrics.UmaUtils.getForegroundStartTime. https://build.chromium.org/p/tryserver.chromium.android/builders/android_optional_gpu_tests_rel/builds/1388 Searching crbug.com leads me to issue 673385 and via it to issue 673433. Looks like N5X swarmed pool and linux_android_rel_ng failures may have something in common?
,
Dec 15 2016
,
Dec 15 2016
Same assertion in org.chromium.chrome.browser.metrics.UmaUtils.getForegroundStartTime here: https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/86436 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/86440 https://build.chromium.org/p/chromium.gpu.fyi/builders/Android%20Release%20%28Nexus%205X%29/builds/4662 and https://build.chromium.org/p/chromium.gpu.fyi/builders/Android%20Release%20%28Nexus%205X%29/builds/4664 fail with a different assertion: [FATAL:context_provider_factory_impl_android.cc(238)] Timed out waiting for GPU channel. Also "Success rate (2-hour average)" metric says that we are currently at 50-60%. I think I misjudged its usefulness. So, unless you need the bots in this state to debug them, I think we need to re-disable GPU tests.
,
Dec 15 2016
> [FATAL:context_provider_factory_impl_android.cc(238)] Timed out waiting for GPU channel. I've been looking into that crash a lot in crbug.com/664341. Please involve me in anything related to that crash
,
Dec 15 2016
In here https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/86481 I see junit.framework.AssertionFailedError: Many tests will fail if the screen is not on. at org.chromium.content_shell_apk.ContentShellTestBase.assertScreenIsOn(ContentShellTestBase.java:72) I thought that somewhere a flag was applied to keep the screen always on. Did something regress there?
,
Dec 15 2016
https://build.chromium.org/p/chromium.gpu.fyi/builders/Android%20Release%20%28Nexus%205X%29/builds/4665 has yet another failure mode: [1214/182854:ERROR:host_forwarder_main.cc(477)] ERROR: could not get adb port for device. You might need to add 'adb' to your PATH or provide the device serial id.
,
Dec 15 2016
Re #83: Hmmm. Nothing should have regressed. I'll take that bot out of the pool to investigate.
,
Dec 15 2016
Other than that spurt of failures the bot had yesterday evening, it seems pretty stable now? It's had 14 straight green builds and counting: https://build.chromium.org/p/chromium.gpu.fyi/builders/Android%20Release%20%28Nexus%205X%29?numbuilds=200 Similar story with the angle tryserver (when it doesn't fail to compile): https://build.chromium.org/p/tryserver.chromium.angle/builders/android_angle_rel_ng
,
Dec 15 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/90f8c17af487a88dc2c1d09dc10dbe92aa7c0699 commit 90f8c17af487a88dc2c1d09dc10dbe92aa7c0699 Author: bpastene <bpastene@chromium.org> Date: Thu Dec 15 18:21:22 2016 Add android_n5x_swarming_rel back to the CQ. All incidence of flake that dropped it out of the CQ has been absent for a good few days now. I think it's time to add it back. TBR=tandrii@chromium.org BUG= 672382 Review-Url: https://codereview.chromium.org/2583553002 Cr-Commit-Position: refs/heads/master@{#438876} [modify] https://crrev.com/90f8c17af487a88dc2c1d09dc10dbe92aa7c0699/infra/config/cq.cfg
,
Dec 15 2016
I'd say give it another couple days (at least one full day) before claiming victory.. that spurt of failures might be related or might not be.
,
Dec 15 2016
Ben, maybe you or John can confirm that org.chromium.chrome.browser.metrics.UmaUtils.getForegroundStartTime assert is a separate bug? Since in issue 673385 this was dismissed as a side effect of 673433, which is now marked as fixed. But the problem is still present in recent n5x builds: https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/86884
,
Dec 19 2016
This has started happening again: https://build.chromium.org/p/chromium.gpu.fyi/builders/Android%20Release%20%28Nexus%205X%29/builds/4718 https://chromium-swarm.appspot.com/task?id=332515141de8b210&refresh=10&show_raw=1 12-17 10:13:15.111 958 975 E ActivityManager: ANR in com.android.systemui 12-17 10:13:15.111 958 975 E ActivityManager: PID: 11840 12-17 10:13:15.111 958 975 E ActivityManager: Reason: Broadcast of Intent { act=android.intent.action.TIME_TICK flg=0x50000014 (has extras) } 12-17 10:13:15.112 958 975 I ActivityManager: Killing 11840:com.android.systemui/u0a29 (adj -12): bg anr
,
Dec 19 2016
Here the ANR is preceded by a crash: https://build.chromium.org/p/chromium.gpu.fyi/builders/Android%20Release%20%28Nexus%205X%29/builds/4759 https://chromium-swarm.appspot.com/task?id=332eb5981c080e10&refresh=10&show_raw=1 12-19 07:05:11.962 21314 21314 I chromium: [INFO:CONSOLE(33)] "Harness injected.", source: (33) 12-19 07:05:12.877 11595 15475 E mm-camera-intf: mm_camera_open:Failed with Connection timed out error, retrying after 20 milli-seconds 12-19 07:05:13.955 11697 11720 I Process : Sending signal. PID: 21341 SIG: 3 12-19 07:05:13.956 21341 21346 I art : Thread[2,tid=21346,WaitingInMainSignalCatcherLoop,Thread*=0x7f76784000,peer=0x12c9d0a0,"Signal Catcher"]: reacting to signal 3 --------- beginning of crash 12-19 07:05:14.020 21341 21346 F libc : Fatal signal 11 (SIGSEGV), code 1, fault addr 0x7f66c824e0 in tid 21346 (Signal Catcher) 12-19 07:05:14.132 492 492 I SELinux : SELinux: Loaded file_contexts contexts from /file_contexts. 12-19 07:05:14.135 492 492 F DEBUG : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** 12-19 07:05:14.136 492 492 F DEBUG : Build fingerprint: 'google/bullhead/bullhead:6.0.1/MMB29Q/2480792:userdebug/dev-keys' 12-19 07:05:14.136 492 492 F DEBUG : Revision: 'rev_1.0' 12-19 07:05:14.136 492 492 F DEBUG : ABI: 'arm64' 12-19 07:05:14.136 492 492 F DEBUG : pid: 21341, tid: 21346, name: Signal Catcher >>> org.chromium.chrome:sandboxed_process0 <<< 12-19 07:05:14.136 492 492 F DEBUG : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x7f66c824e0 12-19 07:05:14.150 492 492 F DEBUG : x0 00000000002c24c0 x1 0000007f7fdca508 x2 0000000000004001 x3 0000007f7bf89440 12-19 07:05:14.150 492 492 F DEBUG : x4 00000000000033bf x5 0000000000004001 x6 0000000000000000 x7 0000007f7fdca50c 12-19 07:05:14.150 492 492 F DEBUG : x8 0000000000005362 x9 0000000000005362 x10 0000007f7fdca50c x11 0000000000004000 12-19 07:05:14.150 492 492 F DEBUG : x12 0000000000004001 x13 0000000000000000 x14 0000000000000001 x15 0000000000000fc0 12-19 07:05:14.151 492 492 F DEBUG : x16 0000007f7fdbd910 x17 0000000000000000 x18 0000000000000fc0 x19 0000007f7e26c000 12-19 07:05:14.151 492 492 F DEBUG : x20 0000007f7fdcabb8 x21 0000007f66c824c0 x22 0000007f7bf88440 x23 0000007f669c0000 12-19 07:05:14.151 492 492 F DEBUG : x24 0000007f7bf88448 x25 00000000000033bf x26 0000000000000001 x27 0000007f7bf88358 12-19 07:05:14.151 492 492 F DEBUG : x28 0000007f7e074d10 x29 0000007f7bf882b0 x30 0000007f7e249788 12-19 07:05:14.151 492 492 F DEBUG : sp 0000007f7bf882b0 pc 0000007f7e2497b0 pstate 0000000060000000 12-19 07:05:14.156 11697 11720 I Process : Sending signal. PID: 21314 SIG: 3 12-19 07:05:14.156 21314 21319 I art : Thread[2,tid=21319,WaitingInMainSignalCatcherLoop,Thread*=0x7f76784000,peer=0x12c9b0a0,"Signal Catcher"]: reacting to signal 3 12-19 07:05:14.165 492 492 F DEBUG : 12-19 07:05:14.165 492 492 F DEBUG : backtrace: 12-19 07:05:14.166 492 492 F DEBUG : #00 pc 00000000000077b0 /system/lib64/libunwind.so 12-19 07:05:14.166 492 492 F DEBUG : #01 pc 0000000000007cf4 /system/lib64/libunwind.so (_ULaarch64_dwarf_find_debug_frame+320) 12-19 07:05:14.167 492 492 F DEBUG : #02 pc 0000000000008244 /system/lib64/libunwind.so 12-19 07:05:14.167 492 492 F DEBUG : #03 pc 0000000000003868 /system/bin/linker64 (__dl__Z18do_dl_iterate_phdrPFiP12dl_phdr_infomPvES1_+96) 12-19 07:05:14.167 492 492 F DEBUG : #04 pc 0000000000003324 /system/bin/linker64 (__dl_dl_iterate_phdr+44) 12-19 07:05:14.167 492 492 F DEBUG : #05 pc 00000000000085b8 /system/lib64/libunwind.so 12-19 07:05:14.167 492 492 F DEBUG : #06 pc 00000000000062c0 /system/lib64/libunwind.so 12-19 07:05:14.167 492 492 F DEBUG : #07 pc 00000000000071f4 /system/lib64/libunwind.so 12-19 07:05:14.167 492 492 F DEBUG : #08 pc 0000000000007618 /system/lib64/libunwind.so 12-19 07:05:14.167 492 492 F DEBUG : #09 pc 0000000000015ca8 /system/lib64/libunwind.so (_ULaarch64_step+40) 12-19 07:05:14.167 492 492 F DEBUG : #10 pc 0000000000007128 /system/lib64/libbacktrace.so (UnwindCurrent::UnwindFromContext(unsigned long, ucontext*)+360) 12-19 07:05:14.168 492 492 F DEBUG : #11 pc 0000000000005094 /system/lib64/libbacktrace.so (BacktraceCurrent::UnwindThread(unsigned long)+556) 12-19 07:05:14.168 492 492 F DEBUG : #12 pc 000000000048b108 /system/lib64/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, int, char const*, art::ArtMethod*, void*)+236) 12-19 07:05:14.168 492 492 F DEBUG : #13 pc 000000000045a2c8 /system/lib64/libart.so (art::Thread::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char> >&) const+220) 12-19 07:05:14.168 492 492 F DEBUG : #14 pc 0000000000466ea8 /system/lib64/libart.so (art::DumpCheckpoint::Run(art::Thread*)+688) 12-19 07:05:14.168 492 492 F DEBUG : #15 pc 0000000000467ea0 /system/lib64/libart.so (art::ThreadList::RunCheckpoint(art::Closure*)+500) 12-19 07:05:14.168 492 492 F DEBUG : #16 pc 000000000046847c /system/lib64/libart.so (art::ThreadList::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char> >&)+188) 12-19 07:05:14.168 492 492 F DEBUG : #17 pc 0000000000468d64 /system/lib64/libart.so (art::ThreadList::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char> >&)+492) 12-19 07:05:14.168 492 492 F DEBUG : #18 pc 0000000000432474 /system/lib64/libart.so (art::Runtime::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char> >&)+96) 12-19 07:05:14.169 492 492 F DEBUG : #19 pc 000000000043f8e4 /system/lib64/libart.so (art::SignalCatcher::HandleSigQuit()+1256) 12-19 07:05:14.169 492 492 F DEBUG : #20 pc 00000000004404f4 /system/lib64/libart.so (art::SignalCatcher::Run(void*)+452) 12-19 07:05:14.169 492 492 F DEBUG : #21 pc 0000000000066d24 /system/lib64/libc.so (__pthread_start(void*)+52) 12-19 07:05:14.169 492 492 F DEBUG : #22 pc 000000000001eb84 /system/lib64/libc.so (__start_thread+16) 12-19 07:05:14.259 21314 21319 F libc : Fatal signal 11 (SIGSEGV), code 2, fault addr 0x7f5aa824e0 in tid 21319 (Signal Catcher) 12-19 07:05:14.329 492 492 W debuggerd64: type=1400 audit(0.0:54): avc: denied { search } for name="org.chromium.chrome" dev="dm-2" ino=377236 scontext=u:r:debuggerd:s0 tcontext=u:object_r:app_data_file:s0:c512,c768 tclass=dir permissive=0 12-19 07:05:14.356 11697 11720 I Process : Sending signal. PID: 21377 SIG: 3 12-19 07:05:14.356 21377 21381 I art : Thread[2,tid=21381,WaitingInMainSignalCatcherLoop,Thread*=0x7f76784000,peer=0x12c9f0a0,"Signal Catcher"]: reacting to signal 3 12-19 07:05:14.437 21377 21381 I art : Wrote stack traces to '/data/anr/traces.txt' 12-19 07:05:14.438 11697 11720 E ActivityManager: ANR in com.android.systemui
,
Dec 19 2016
Yeah, thanks for the heads-up. Here's a few more from just looking at ansr: https://chromium-swarm.appspot.com/task?id=33313e776317ad10 https://chromium-swarm.appspot.com/task?id=3331395650589210 My original change hasn't been live on these guys since they were reflashed. We need to come up with a new theory as to what's causing them. I'll start investigating again.
,
Dec 19 2016
,
Dec 19 2016
A bit different symptom here: https://build.chromium.org/p/tryserver.chromium.android/builders/android_optional_gpu_tests_rel/builds/1435 https://chromium-swarm.appspot.com/task?id=3323f1e2b8a47b10&refresh=10&show_raw=1 12-17 04:57:09.265 5229 5229 E mm-camera-intf: mm_camera_open:Failed with Connection timed out error, retrying after 20 milli-seconds 12-17 04:57:10.195 6577 6600 E chromium: [ERROR:gpu_watchdog_thread.cc(373)] The GPU process hung. Terminating after 10000 ms. 12-17 04:57:16.482 949 4309 E WifiHAL : Error polling socket 12-17 04:57:19.322 5229 5229 E mm-camera-intf: mm_camera_open:Failed with Connection timed out error, retrying after 20 milli-seconds 12-17 04:57:19.811 6514 6570 E chromium: [ERROR:browser_gpu_channel_host_factory.cc(113)] Failed to launch GPU process. 12-17 04:57:20.340 949 966 E ActivityManager: ANR in com.google.android.googlequicksearchbox:search Looks to me like camera messages and com.android.systemui ANRs are just a symptom of the whole device getting stuck.
,
Dec 19 2016
For future reference, here's bugreport dump of a phone currently experience hanging camera services and crashing systemuis.
,
Dec 19 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/ddd18b00966dc32af2c8aa940692785736040963 commit ddd18b00966dc32af2c8aa940692785736040963 Author: John Budorick <jbudorick@google.com> Date: Mon Dec 19 21:53:56 2016
,
Dec 20 2016
Filed b/33759402 Let's see where that gets us.
,
Dec 20 2016
I suggest a higher priority than P2/S4 for that bug.
,
Dec 20 2016
,
Dec 20 2016
I'd like to suggest that if the main outage is over that this bug either be closed and new dependent ones filed, or downgraded from P0. That's supposed to be reserved for emergencies. I'm still seeing some random test failures on this bot like: https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/89392 https://chromium-swarm.appspot.com/task?id=33342d22304e1310&refresh=10&show_raw=1 which looks like the connection to the device was lost. Is there another bug filed about that?
,
Dec 20 2016
The rate of failure has definitely decreased, so I'll downgrade the priority. But there's still some cleanup needed. It turns out that some of the original flashing tasks didn't land on all the devices, so I've been hunting down those that still need it and retriggering them. The failure in #100 is a timeout, and any failures after it hits that point might just be fallout from the swarming bot starting to reclaim resources/devices/files/etc on the host.
,
Jan 21 2017
Is there more to do here or can this be closed now?
,
Jan 21 2017
Nope. The N5X pool is healthy again and has been running tests normally for the past month or so.
,
Dec 22 2017
Showing comments 5 - 104
of 104
Older ›
|
||||||||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||||||||