New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 682387 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: ----
Type: ----



Sign in to add a comment

File/MediaTest.VideoBearMp4/0 - content_browsertests failing on 2 builders

Project Member Reported by tedc...@chromium.org, Jan 18 2017

Issue description

After talking with jbudorick@, it seems like content browsertests (or gtests in general) have a 2 minute timeout and this is not exceeding it.

content_browsertests failing on 2 builders

Type: build-failure

Builders failed on: 
- KitKat Tablet Tester: 
  https://build.chromium.org/p/chromium.android/builders/KitKat%20Tablet%20Tester
- Lollipop Tablet Tester: 
  https://build.chromium.org/p/chromium.android/builders/Lollipop%20Tablet%20Tester

First failing build:
https://uberchromegw.corp.google.com/i/chromium.android/builders/KitKat%20Tablet%20Tester/builds/6593

Media related change:
https://codereview.chromium.org/2629223003


 

Comment 1 by w...@chromium.org, Jan 18 2017

Taking a look. It looks like it is timing out from both logcat messages and the 120 seconds that elapses between retries by the test runner. 

Comment 2 by w...@chromium.org, Jan 18 2017

Can't repro it on my KK nexus 4 or L Nexus 7. It also looks like content_browsertest runs have succeeded since it failed, so it must be flaky.

I'm going to repeat the test a bunch of times and see if I can make it flake. The logs from the bots don't really contain anything helpful AFAICT.
Cc: jbudorick@chromium.org
+jbudorick

Agreed, the logs don't have anything.  John was able to try and trace it to the timeout killer, but I'll defer to him to clarify.  I opened a bug that it would be awesome to have better logging around when tests have started and stopped (regardless of success state), but that might be quite difficult in the current setup.

Comment 4 by w...@chromium.org, Jan 19 2017

Thanks, I'm temporarily disabling this test on Android (https://codereview.chromium.org/2644773002/). I don't want to revert the suspect commit yet because even if it is causing this flakiness, I want to get more data from dev about the effect of that change on the crash it's supposed to fix.

I've run this test 80ish times locally now, and haven't seen a failure, so I'm wondering if there's a way I can repro with more logging on the bots?

Now that I think about it, since the tests are running in the browser process I can probably manually change the log level (or command line?) from the test and it will automatically propogate to the children.
Project Member

Comment 5 by bugdroid1@chromium.org, Jan 19 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/712079835ab9fd11466c0471d6232983bd52da38

commit 712079835ab9fd11466c0471d6232983bd52da38
Author: watk <watk@chromium.org>
Date: Thu Jan 19 01:47:39 2017

media: Disable the flaky bear.mp4 browser test on Android

Disabling this test until we can fix the flakiness.

BUG= 682387 

Review-Url: https://codereview.chromium.org/2644773002
Cr-Commit-Position: refs/heads/master@{#444596}

[modify] https://crrev.com/712079835ab9fd11466c0471d6232983bd52da38/content/browser/media/media_browsertest.cc

@watk when you're running locally, are you only running File/MediaTest.VideoBearMp4/0?

Comment 7 by w...@chromium.org, Jan 19 2017

I ran 100 retries of that single test. And I did 40 of '*Mp4*'. I'll try *Media* now.
That's on your N7?

Comment 9 by w...@chromium.org, Jan 19 2017

Yep. Probably N4 is more likely to break though, so I'll switch to that actually.
Given the failure pattern (http://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=content_browsertests&tests=File%2FMediaTest.VideoBearMp4%2F0), I'd stick with the tablet.

Are you running a release build or a debug build?

Comment 11 by w...@chromium.org, Jan 19 2017

Debug

Comment 12 by w...@chromium.org, Jan 20 2017

Ooh, I some failures in BearSilentMp4 (not BearMp4) so that's weird. But It looks like the same timeout failure mode. This was with a N7, same KK build as bots, 10 retries of *Media*. 

I had a lot of logging enabled, but unfortunately not enough to figure out what the problem is. I'll keep trying.
I can track down an N7 tomorrow and try to repro, too.
Components: Internals>Media
Labels: OS-Android

Comment 15 by w...@chromium.org, Jan 20 2017

Cc: liber...@chromium.org
I think we're running into a bug we've seen in the past. Sometimes mediaserver just dies and media stops working until you reboot the device. Attached a logcat snippet of a failed run where you can see that the test times out. The media element is suspended because it hangs for 15 seconds, and there are no calls to queue input to MediaCodec for some reason.

1) Where are we hanging? Or is it an unreported error from AVDA? It's shouldn't be codec creation because GVD init succeeds. But we never try to queue an input buffer.
2) How do other h264 tests succeed once we get into this state? Does it depend on the profile or something?
3) Why would my CL to swap the surface release order have any impact on this?
4) How do we work around it? Maybe it's the same or similar issue we saw with VP8 which was fixed by always flushing the decoder before releasing it.

Comment 16 by w...@chromium.org, Jan 20 2017

Forgot to mention an important fact: after the tests were failing I tried playing h264 big buck bunny in chrome and it failed in the same way. And the auto thread was hung when the codec was released.
Why does this hang the test though? Are we still pending frames to vend?

Comment 18 by w...@chromium.org, Jan 20 2017

Yeah, we don't decode a single frame but the test is waiting for 'ended'. I have no idea why. Nothing is literally hanging because the GPU thread is still running. It's more like we are never able to dequeue an input buffer or something. I'm re-running it with more logging but it took hours to repro it yesterday :(
I wonder if tearing down the surface prior to mediacodec dumps out a buffer which completes ended. Are we trying to queue an EOS and wait for it even though no frames were ever queued? I thought we fixed that, but maybe not.

Comment 20 by w...@chromium.org, Jan 20 2017

My comment was ambiguous. When I said "we don't decode a single frame" I mean the video is 1s long but AVDA decodes zero frames. It submits zero to MediaCodec.

So it looks like:
1) Something breaks mediaserver in test run 1 but the test succeeds.
2) Test run 2 initializes successfully, but times out because it can't decode because mediaserver is broken.

I want to understand the mechanics of 2), and then we have to see about working around 1), which was probably caused by my change somehow.

That circus of events sounds likely to me.

Comment 22 by w...@chromium.org, Jan 21 2017

Ok, yep, when things break DequeueInputBuffer() always returns TRY_AGAIN_LATER. After 1 second we stop trying, and of course the element eventually suspends and the test times out.

Open questions:
1) What on earth breaks mediaserver and is it related to my change or not? One thing of note: my mediaserver pid is ~4000 but it's ~180 after a reboot. So mediaserver restarted at some point. Likely related.

2) How we could reliably "test" our fresh MediaCodec to see if it's in this broken state? If there's no reliable test we should consider never getting an input buffer from a fresh codec for 1 second to be an error. (But I'd really rather not implement this).

The icing on the cake is that apparently the software decoder that we fall back to on KK doesn't support baseline h264 (https://b.corp.google.com/issues/28585610#comment6) so h264 is pretty much broken at this point.
Can you confirm that the ST from the passing test is actually destroyed? Or is it lingering in some limbo state and hence the next set fails.

Comment 24 by w...@chromium.org, Jan 24 2017

The last passing test successfully releases the ST. The first failing test hangs in  MediaCodec#release() by the look of it. And the ST is never released. 

It looks like the failure corresponds with a mediaserver crash too:
110301 E/OMXNodeInstance(  181): !!! Observer died. Quickly, do something, ... anything...

So we're doing something that causes a mediaserver crash, and that breaks the device. I remember seeing mention of a resource leak at the hardware decoder level in a buganizer. This seems a lot like that: requires a restart to fix. I'll  see if I can figure out what crashes mediaserver.  

Comment 25 by w...@chromium.org, Jan 26 2017

I've discovered that the tests the trigger this are the VideoSizeTests where the test only waits for the size metadata. In these:

* metadata arrives and the renderer exits/is killed before AVDA init finishes.
* AVDA::Destroy is called before the async codec creation completes.
* When the async codec creation returns it releases its codec on the main thread because AVDA is gone.
* Before MC#release() returns the gpu process dies. Looks like it's killed because I don't see any evidence of a crash. Assumedly once the test completes the renderer and gpu are killed.
* Then in logcat:
  I/ActivityManager(  570): Process org.chromium.content_browsertests_apk:privileged_process0 (pid 5133) has died.
  E/OMXNodeInstance(  181): !!! Observer died. Quickly, do something, ... anything...
  E/OMX-VDEC-1080P(  181): 
  E/OMX-VDEC-1080P(  181): Release output buffer failed in VCD
  E/OMX-VDEC-1080P(  181): VDEC_IOCTL_FREE_H264_MV_BUFFER failed
 
So my hypothesis is that killing the GPU process while we have a MediaCodec allocated leads to this codec  leak.
good catch!

if the renderer is shutting down properly, which it probably is if AVDA::Destroy is called by WMPI, then i suppose that we could make AVDA synchronously wait for MC destruction in ActualDestroy (with short timeout). it's usually pretty fast.  but that's very hacky and narrowly scoped.

the real problem is actual gpu crashes, and that seems unfixable.

Comment 27 by w...@chromium.org, Jan 27 2017

Actually I think AVDA::Destroy() is being called by the thing that owns it observing the gpu channel deletion. And the channel is torn down when it notices the renderer died. I don't see any evidence of a clean shutdown of the renderer.

It doesn't appear to be as simple as the gpu process dying while we have a codec because I tried 'adb shell kill <pid>' while video was playing a number of times and never got mediaserver to permanenty break. I saw the same "observer died" message, but not "  E/OMX-VDEC-1080P(  181): Release output buffer failed in VCD".

It's also not as simple as a crash during a release(). I tried posting the release and calling raise(SIGTERM). And from the logs I verified that the release was in progress when the process died.

So I'm not sure what else to try.
you mention that "before MC#release returns the gpu process dies".  do you know how long the release was pending for?  seems like that should be a fairly short function call.

i'm wondering if MC was already in a bad state before the release happened.  in this case, something bad happens to mediaserver, then later release() on main thread => release waits forever.  independently, gpu process is torn down by test => media server connection gets dropped => mediaserver tries to clean up (observer died) => mediaserver is in a bad state and it can't clean up (release output buffer failed).

Comment 29 by w...@chromium.org, Jan 27 2017

Tried reverting my CL to post the ST destruction and it never hangs :(

One of two things happens:

1) ~ST before MC#configure():
01-26 17:08:02.639  5842  5860 E chromium: [5842:5860:0126/170802.652856:768077623:ERROR:surface_texture.cc(31)] ~SurfaceTexture
01-26 17:08:02.639  5842  5917 E BufferQueue: [unnamed-5842-0] connect: BufferQueue has been abandoned!
01-26 17:08:02.639  5842  5917 E MediaCodec: native_window_api_connect returned an error: No such device (-19)
01-26 17:08:02.649  5842  5915 E cr_MediaCodecBridge: Cannot configure the video codec
01-26 17:08:02.649  5842  5915 E cr_MediaCodecBridge: java.lang.IllegalStateException
01-26 17:08:02.649  5842  5915 E cr_MediaCodecBridge:   at android.media.MediaCodec.native_configure(Native Method)
01-26 17:08:02.649  5842  5915 E cr_MediaCodecBridge:   at android.media.MediaCodec.configure(MediaCodec.java:262)
01-26 17:08:02.649  5842  5915 E cr_MediaCodecBridge:   at org.chromium.media.MediaCodecBridge.configureVideo(MediaCodecBridge.java:545)
01-26 17:08:02.649  5842  5915 E cr_MediaCodecBridge:   at dalvik.system.NativeStart.run(Native Method)
01-26 17:08:02.649  5842  5915 W cr_MediaCodecBridge: calling MediaCodec.release() on OMX.qcom.video.decoder.avc
01-26 17:08:02.669  5842  5915 W cr_MediaCodecBridge: MediaCodec.release() completed


2) ~ST after MC#configure():
01-26 17:07:35.012  5155  5239 W cr_MediaCodecBridge: calling MediaCodec.release() on OMX.qcom.video.decoder.avc
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5177 E chromium: [5155:5177:0126/170735.031182:740455949:ERROR:surface_texture.cc(31)] ~SurfaceTexture
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.022  5155  5242 W BufferQueue: [unnamed-5155-0] cancelBuffer: BufferQueue has been abandoned!
01-26 17:07:35.042  5155  5239 W cr_MediaCodecBridge: MediaCodec.release() completed

Comment 30 by w...@chromium.org, Jan 27 2017

Re #28: The time between calling MediaCodec#release() and the "observer died" message is about 70ms. And then tens of ms later ActivityManager says the process died.

I also saw it hang in release() for 120s one time, and 120s is the shard timeout for the test runner. I don't know why sometimes the test runner would have to kill it, whereas usually it's killed immediately. In both cases the gpu main thread is in release() when it dies..

To expand on #29: I don't think my CL causes this because it's incorrect. Rather our previously incorrect behavior actually helped us avoid a hang by making it a benign (on this device) error instead. Maybe we'll have to do something bizarre like dropping the ST before the MC on some devices..

Comment 31 by w...@chromium.org, Jan 27 2017

Latest findings:
* MediaCodec#release() is a red herring. I tried putting a 2 second sleep before calling release and things break in the same way, but before release() is called.
* mediaserver crashing is unrelated. This issue occurs even if mediaserver never crashes.

I still can't repro this in a normal chrome build by killing the gpu process manually so I still can't say what the exact conditions are.

I'm going to work on a workaround for our tests for now.

Comment 32 by w...@chromium.org, Jan 27 2017

Ok, I can finally repro this with the following change to OnCodecConfigured. The new discovery is that it's timing sensitive. When I wait 100ms before killing the process, OMX-VDEC doesn't log any errors and things keep working. If I only wait 50ms I see the same error from OMX-VDEC, and after repeating this for a total of three times, the device is in the broken state.  

@@ -924,6 +944,15 @@ void AndroidVideoDecodeAccelerator::OnCodecConfigured(
   DCHECK(thread_checker_.CalledOnValidThread());
   DCHECK(state_ == WAITING_FOR_CODEC || state_ == SURFACE_DESTROYED);

+  int prop = getProp("killit");
+  LOG(ERROR) << "PROP IS : " << prop;
+  if (prop == 1) {
+    raise(SIGTERM);
+  } else if (prop > 1) {
+    base::PlatformThread::Sleep(base::TimeDelta::FromMilliseconds(prop));
+    raise(SIGTERM);
+  }

wow, really nice detective work.

i wonder if there's something we can do in OnCodecConfigured (or maybe even before the post back to the main thread) that could kick it out of the bad state.  at least minimize the window where a crash will cause issues.

no idea what.  poll for buffers or something.  i guess there aren't too many options.

Comment 34 by w...@chromium.org, Jan 27 2017

I tried DequeueInputBuffer() and Flush() but neither one made a difference :(

We could install a SIGTERM handler that calls sleep() if we recently allocated a codec. I don't want that to be my legacy in Chrome :P

That only works if SIGTERM is a non-negligible percentage of the shutdowns. They're probably usually SIGKILL from the framework though, I would guess.
based on c#30, you suspect that if we release the ST first, then killing the process no longer triggers the bug.  does that work with your OnCodecConfigured repro?

+1 on not adding a sleep.  :)

Comment 36 by w...@chromium.org, Jan 28 2017

It doesn't fix it. I guess it was helping before because MC#release() was being called. So there's always a window after codec creation where the process exiting will break things.

I think I'm going to assume this happens rarely in practice and make a workaround for the tests. I'll also see if I can repro this on newer Android releases.
sgtm.
Project Member

Comment 38 by bugdroid1@chromium.org, Jan 28 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/54db2003b44f32d6618e70682b84fb8d13fd4e15

commit 54db2003b44f32d6618e70682b84fb8d13fd4e15
Author: watk <watk@chromium.org>
Date: Sat Jan 28 04:12:33 2017

media: Add an Android bug workaround to the video size media browsertests

The video size tests previously only waited for loadedmetadata before
completing and killing the browser processes. This resulted in some
unfortunate timing where the GPU process was being killed shortly after
a MediaCodec was allocated, triggering a framework bug that put the
device into a broken state until it's rebooted.

Since we don't have any way to fix the underlying issue this CL changes
the test so it no longer triggers the bug. The test now waits for the
decoder to initialize by waiting for loadeddata.

BUG= 682387 

Review-Url: https://codereview.chromium.org/2655333006
Cr-Commit-Position: refs/heads/master@{#446906}

[modify] https://crrev.com/54db2003b44f32d6618e70682b84fb8d13fd4e15/content/browser/media/media_browsertest.cc
[modify] https://crrev.com/54db2003b44f32d6618e70682b84fb8d13fd4e15/media/test/data/player.html

Comment 39 by w...@chromium.org, Mar 6 2017

Status: Fixed (was: Assigned)

Sign in to add a comment