File/MediaTest.VideoBearMp4/0 - content_browsertests failing on 2 builders |
|||||
Issue descriptionAfter talking with jbudorick@, it seems like content browsertests (or gtests in general) have a 2 minute timeout and this is not exceeding it. content_browsertests failing on 2 builders Type: build-failure Builders failed on: - KitKat Tablet Tester: https://build.chromium.org/p/chromium.android/builders/KitKat%20Tablet%20Tester - Lollipop Tablet Tester: https://build.chromium.org/p/chromium.android/builders/Lollipop%20Tablet%20Tester First failing build: https://uberchromegw.corp.google.com/i/chromium.android/builders/KitKat%20Tablet%20Tester/builds/6593 Media related change: https://codereview.chromium.org/2629223003
,
Jan 18 2017
Can't repro it on my KK nexus 4 or L Nexus 7. It also looks like content_browsertest runs have succeeded since it failed, so it must be flaky. I'm going to repeat the test a bunch of times and see if I can make it flake. The logs from the bots don't really contain anything helpful AFAICT.
,
Jan 19 2017
+jbudorick Agreed, the logs don't have anything. John was able to try and trace it to the timeout killer, but I'll defer to him to clarify. I opened a bug that it would be awesome to have better logging around when tests have started and stopped (regardless of success state), but that might be quite difficult in the current setup.
,
Jan 19 2017
Thanks, I'm temporarily disabling this test on Android (https://codereview.chromium.org/2644773002/). I don't want to revert the suspect commit yet because even if it is causing this flakiness, I want to get more data from dev about the effect of that change on the crash it's supposed to fix. I've run this test 80ish times locally now, and haven't seen a failure, so I'm wondering if there's a way I can repro with more logging on the bots? Now that I think about it, since the tests are running in the browser process I can probably manually change the log level (or command line?) from the test and it will automatically propogate to the children.
,
Jan 19 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/712079835ab9fd11466c0471d6232983bd52da38 commit 712079835ab9fd11466c0471d6232983bd52da38 Author: watk <watk@chromium.org> Date: Thu Jan 19 01:47:39 2017 media: Disable the flaky bear.mp4 browser test on Android Disabling this test until we can fix the flakiness. BUG= 682387 Review-Url: https://codereview.chromium.org/2644773002 Cr-Commit-Position: refs/heads/master@{#444596} [modify] https://crrev.com/712079835ab9fd11466c0471d6232983bd52da38/content/browser/media/media_browsertest.cc
,
Jan 19 2017
@watk when you're running locally, are you only running File/MediaTest.VideoBearMp4/0?
,
Jan 19 2017
I ran 100 retries of that single test. And I did 40 of '*Mp4*'. I'll try *Media* now.
,
Jan 19 2017
That's on your N7?
,
Jan 19 2017
Yep. Probably N4 is more likely to break though, so I'll switch to that actually.
,
Jan 19 2017
Given the failure pattern (http://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=content_browsertests&tests=File%2FMediaTest.VideoBearMp4%2F0), I'd stick with the tablet. Are you running a release build or a debug build?
,
Jan 19 2017
Debug
,
Jan 20 2017
Ooh, I some failures in BearSilentMp4 (not BearMp4) so that's weird. But It looks like the same timeout failure mode. This was with a N7, same KK build as bots, 10 retries of *Media*. I had a lot of logging enabled, but unfortunately not enough to figure out what the problem is. I'll keep trying.
,
Jan 20 2017
I can track down an N7 tomorrow and try to repro, too.
,
Jan 20 2017
,
Jan 20 2017
I think we're running into a bug we've seen in the past. Sometimes mediaserver just dies and media stops working until you reboot the device. Attached a logcat snippet of a failed run where you can see that the test times out. The media element is suspended because it hangs for 15 seconds, and there are no calls to queue input to MediaCodec for some reason. 1) Where are we hanging? Or is it an unreported error from AVDA? It's shouldn't be codec creation because GVD init succeeds. But we never try to queue an input buffer. 2) How do other h264 tests succeed once we get into this state? Does it depend on the profile or something? 3) Why would my CL to swap the surface release order have any impact on this? 4) How do we work around it? Maybe it's the same or similar issue we saw with VP8 which was fixed by always flushing the decoder before releasing it.
,
Jan 20 2017
Forgot to mention an important fact: after the tests were failing I tried playing h264 big buck bunny in chrome and it failed in the same way. And the auto thread was hung when the codec was released.
,
Jan 20 2017
Why does this hang the test though? Are we still pending frames to vend?
,
Jan 20 2017
Yeah, we don't decode a single frame but the test is waiting for 'ended'. I have no idea why. Nothing is literally hanging because the GPU thread is still running. It's more like we are never able to dequeue an input buffer or something. I'm re-running it with more logging but it took hours to repro it yesterday :(
,
Jan 20 2017
I wonder if tearing down the surface prior to mediacodec dumps out a buffer which completes ended. Are we trying to queue an EOS and wait for it even though no frames were ever queued? I thought we fixed that, but maybe not.
,
Jan 20 2017
My comment was ambiguous. When I said "we don't decode a single frame" I mean the video is 1s long but AVDA decodes zero frames. It submits zero to MediaCodec. So it looks like: 1) Something breaks mediaserver in test run 1 but the test succeeds. 2) Test run 2 initializes successfully, but times out because it can't decode because mediaserver is broken. I want to understand the mechanics of 2), and then we have to see about working around 1), which was probably caused by my change somehow.
,
Jan 20 2017
That circus of events sounds likely to me.
,
Jan 21 2017
Ok, yep, when things break DequeueInputBuffer() always returns TRY_AGAIN_LATER. After 1 second we stop trying, and of course the element eventually suspends and the test times out. Open questions: 1) What on earth breaks mediaserver and is it related to my change or not? One thing of note: my mediaserver pid is ~4000 but it's ~180 after a reboot. So mediaserver restarted at some point. Likely related. 2) How we could reliably "test" our fresh MediaCodec to see if it's in this broken state? If there's no reliable test we should consider never getting an input buffer from a fresh codec for 1 second to be an error. (But I'd really rather not implement this). The icing on the cake is that apparently the software decoder that we fall back to on KK doesn't support baseline h264 (https://b.corp.google.com/issues/28585610#comment6) so h264 is pretty much broken at this point.
,
Jan 24 2017
Can you confirm that the ST from the passing test is actually destroyed? Or is it lingering in some limbo state and hence the next set fails.
,
Jan 24 2017
The last passing test successfully releases the ST. The first failing test hangs in MediaCodec#release() by the look of it. And the ST is never released. It looks like the failure corresponds with a mediaserver crash too: 110301 E/OMXNodeInstance( 181): !!! Observer died. Quickly, do something, ... anything... So we're doing something that causes a mediaserver crash, and that breaks the device. I remember seeing mention of a resource leak at the hardware decoder level in a buganizer. This seems a lot like that: requires a restart to fix. I'll see if I can figure out what crashes mediaserver.
,
Jan 26 2017
I've discovered that the tests the trigger this are the VideoSizeTests where the test only waits for the size metadata. In these: * metadata arrives and the renderer exits/is killed before AVDA init finishes. * AVDA::Destroy is called before the async codec creation completes. * When the async codec creation returns it releases its codec on the main thread because AVDA is gone. * Before MC#release() returns the gpu process dies. Looks like it's killed because I don't see any evidence of a crash. Assumedly once the test completes the renderer and gpu are killed. * Then in logcat: I/ActivityManager( 570): Process org.chromium.content_browsertests_apk:privileged_process0 (pid 5133) has died. E/OMXNodeInstance( 181): !!! Observer died. Quickly, do something, ... anything... E/OMX-VDEC-1080P( 181): E/OMX-VDEC-1080P( 181): Release output buffer failed in VCD E/OMX-VDEC-1080P( 181): VDEC_IOCTL_FREE_H264_MV_BUFFER failed So my hypothesis is that killing the GPU process while we have a MediaCodec allocated leads to this codec leak.
,
Jan 26 2017
good catch! if the renderer is shutting down properly, which it probably is if AVDA::Destroy is called by WMPI, then i suppose that we could make AVDA synchronously wait for MC destruction in ActualDestroy (with short timeout). it's usually pretty fast. but that's very hacky and narrowly scoped. the real problem is actual gpu crashes, and that seems unfixable.
,
Jan 27 2017
Actually I think AVDA::Destroy() is being called by the thing that owns it observing the gpu channel deletion. And the channel is torn down when it notices the renderer died. I don't see any evidence of a clean shutdown of the renderer. It doesn't appear to be as simple as the gpu process dying while we have a codec because I tried 'adb shell kill <pid>' while video was playing a number of times and never got mediaserver to permanenty break. I saw the same "observer died" message, but not " E/OMX-VDEC-1080P( 181): Release output buffer failed in VCD". It's also not as simple as a crash during a release(). I tried posting the release and calling raise(SIGTERM). And from the logs I verified that the release was in progress when the process died. So I'm not sure what else to try.
,
Jan 27 2017
you mention that "before MC#release returns the gpu process dies". do you know how long the release was pending for? seems like that should be a fairly short function call. i'm wondering if MC was already in a bad state before the release happened. in this case, something bad happens to mediaserver, then later release() on main thread => release waits forever. independently, gpu process is torn down by test => media server connection gets dropped => mediaserver tries to clean up (observer died) => mediaserver is in a bad state and it can't clean up (release output buffer failed).
,
Jan 27 2017
Tried reverting my CL to post the ST destruction and it never hangs :( One of two things happens: 1) ~ST before MC#configure(): 01-26 17:08:02.639 5842 5860 E chromium: [5842:5860:0126/170802.652856:768077623:ERROR:surface_texture.cc(31)] ~SurfaceTexture 01-26 17:08:02.639 5842 5917 E BufferQueue: [unnamed-5842-0] connect: BufferQueue has been abandoned! 01-26 17:08:02.639 5842 5917 E MediaCodec: native_window_api_connect returned an error: No such device (-19) 01-26 17:08:02.649 5842 5915 E cr_MediaCodecBridge: Cannot configure the video codec 01-26 17:08:02.649 5842 5915 E cr_MediaCodecBridge: java.lang.IllegalStateException 01-26 17:08:02.649 5842 5915 E cr_MediaCodecBridge: at android.media.MediaCodec.native_configure(Native Method) 01-26 17:08:02.649 5842 5915 E cr_MediaCodecBridge: at android.media.MediaCodec.configure(MediaCodec.java:262) 01-26 17:08:02.649 5842 5915 E cr_MediaCodecBridge: at org.chromium.media.MediaCodecBridge.configureVideo(MediaCodecBridge.java:545) 01-26 17:08:02.649 5842 5915 E cr_MediaCodecBridge: at dalvik.system.NativeStart.run(Native Method) 01-26 17:08:02.649 5842 5915 W cr_MediaCodecBridge: calling MediaCodec.release() on OMX.qcom.video.decoder.avc 01-26 17:08:02.669 5842 5915 W cr_MediaCodecBridge: MediaCodec.release() completed 2) ~ST after MC#configure(): 01-26 17:07:35.012 5155 5239 W cr_MediaCodecBridge: calling MediaCodec.release() on OMX.qcom.video.decoder.avc 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5177 E chromium: [5155:5177:0126/170735.031182:740455949:ERROR:surface_texture.cc(31)] ~SurfaceTexture 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.022 5155 5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned! 01-26 17:07:35.042 5155 5239 W cr_MediaCodecBridge: MediaCodec.release() completed
,
Jan 27 2017
Re #28: The time between calling MediaCodec#release() and the "observer died" message is about 70ms. And then tens of ms later ActivityManager says the process died. I also saw it hang in release() for 120s one time, and 120s is the shard timeout for the test runner. I don't know why sometimes the test runner would have to kill it, whereas usually it's killed immediately. In both cases the gpu main thread is in release() when it dies.. To expand on #29: I don't think my CL causes this because it's incorrect. Rather our previously incorrect behavior actually helped us avoid a hang by making it a benign (on this device) error instead. Maybe we'll have to do something bizarre like dropping the ST before the MC on some devices..
,
Jan 27 2017
Latest findings: * MediaCodec#release() is a red herring. I tried putting a 2 second sleep before calling release and things break in the same way, but before release() is called. * mediaserver crashing is unrelated. This issue occurs even if mediaserver never crashes. I still can't repro this in a normal chrome build by killing the gpu process manually so I still can't say what the exact conditions are. I'm going to work on a workaround for our tests for now.
,
Jan 27 2017
Ok, I can finally repro this with the following change to OnCodecConfigured. The new discovery is that it's timing sensitive. When I wait 100ms before killing the process, OMX-VDEC doesn't log any errors and things keep working. If I only wait 50ms I see the same error from OMX-VDEC, and after repeating this for a total of three times, the device is in the broken state.
@@ -924,6 +944,15 @@ void AndroidVideoDecodeAccelerator::OnCodecConfigured(
DCHECK(thread_checker_.CalledOnValidThread());
DCHECK(state_ == WAITING_FOR_CODEC || state_ == SURFACE_DESTROYED);
+ int prop = getProp("killit");
+ LOG(ERROR) << "PROP IS : " << prop;
+ if (prop == 1) {
+ raise(SIGTERM);
+ } else if (prop > 1) {
+ base::PlatformThread::Sleep(base::TimeDelta::FromMilliseconds(prop));
+ raise(SIGTERM);
+ }
,
Jan 27 2017
wow, really nice detective work. i wonder if there's something we can do in OnCodecConfigured (or maybe even before the post back to the main thread) that could kick it out of the bad state. at least minimize the window where a crash will cause issues. no idea what. poll for buffers or something. i guess there aren't too many options.
,
Jan 27 2017
I tried DequeueInputBuffer() and Flush() but neither one made a difference :( We could install a SIGTERM handler that calls sleep() if we recently allocated a codec. I don't want that to be my legacy in Chrome :P That only works if SIGTERM is a non-negligible percentage of the shutdowns. They're probably usually SIGKILL from the framework though, I would guess.
,
Jan 28 2017
based on c#30, you suspect that if we release the ST first, then killing the process no longer triggers the bug. does that work with your OnCodecConfigured repro? +1 on not adding a sleep. :)
,
Jan 28 2017
It doesn't fix it. I guess it was helping before because MC#release() was being called. So there's always a window after codec creation where the process exiting will break things. I think I'm going to assume this happens rarely in practice and make a workaround for the tests. I'll also see if I can repro this on newer Android releases.
,
Jan 28 2017
sgtm.
,
Jan 28 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/54db2003b44f32d6618e70682b84fb8d13fd4e15 commit 54db2003b44f32d6618e70682b84fb8d13fd4e15 Author: watk <watk@chromium.org> Date: Sat Jan 28 04:12:33 2017 media: Add an Android bug workaround to the video size media browsertests The video size tests previously only waited for loadedmetadata before completing and killing the browser processes. This resulted in some unfortunate timing where the GPU process was being killed shortly after a MediaCodec was allocated, triggering a framework bug that put the device into a broken state until it's rebooted. Since we don't have any way to fix the underlying issue this CL changes the test so it no longer triggers the bug. The test now waits for the decoder to initialize by waiting for loadeddata. BUG= 682387 Review-Url: https://codereview.chromium.org/2655333006 Cr-Commit-Position: refs/heads/master@{#446906} [modify] https://crrev.com/54db2003b44f32d6618e70682b84fb8d13fd4e15/content/browser/media/media_browsertest.cc [modify] https://crrev.com/54db2003b44f32d6618e70682b84fb8d13fd4e15/media/test/data/player.html
,
Mar 6 2017
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by w...@chromium.org
, Jan 18 2017