New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 662786 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

CRAS: handle severe underrun

Project Member Reported by cychiang@chromium.org, Nov 7 2016

Issue description

https://feedback.corp.google.com/product/208/neutron?lView=rd&lReport=21981917814

In this report, there was a severe underrun:

2016-11-04T11:01:53.814151+00:00 ERR cras_server[1919]: pcm_avail returned frames larger than buf_size: 500232 > 65536

This means appl_ptr is behind hw_ptr by more than 400000 samples.

CRAS tried to recover it by filling one block size of zeros (usually below 1024 samples).
However, one block size is too small compared with the amount of underrun.
This causes CRAS to wake up in a busy loop and fill one block of zeros to slightly recover underrun situation.

Since the amount of underrun is too large, it needs over 30 seconds to recover, and in this duration, CRAS consume most of the CPU and makes system freeze.

Apart from the root cause of the underrun (maybe it is DMA driver misbehave), CRAS should have a better way to handle this.

Since this seems to be a real failure, we should not try to recover it. Instead, we should let it error out and close the device.

The proposed plan to handle underrun:

1. In cras_alsa_get_avail_frames, we should have two paths for underrun:
  a) Underrun is not severe (less than 10 block size maybe ?) Recover it as current behavior.

  b) Underrun is severe, return an error code.

2. The error in cras_alsa_get_avail_frames pops to frames_queued and cras_iodev_frames_queued.
   cras_iodev_frames_to_play_in_sleep should pop this error to get_next_output_wake
   get_next_output_wake should pop this error to fill_next_sleep_interval
   We can log an error here. This can prevent us from scheduling a busy wake.

3. When audio thread next time reaches write_output_samples and calls cras_iodev_frames_queued, it will get the error and remove that device. 
 

 
Maybe make it a number of ms for a threshold, and make it pretty tight, an underrun of > 10ms is pretty insane and should probably be handled differently.

if it's greater than 10ms then can we reset the device?  stop/start or something like that.
Thank you Dylan. Use a time for threshold seems reasonable.
But I am not sure if we should recover it or not.
Silently reseting the device seems the best for user experience.
Thanks!

Status: Started (was: Assigned)
Currently iodev does not support "stop", and "start" is only optionally supported.

I found that it will be better to close and then open the device following the flow in iodev_list to properly reset the device.

We can use cras_main_message so audio thread can pass message to main thread, and let main thread disables/enables device in cras_iodev_list using a message handler.


Comment 4 by akhal...@gmail.com, Nov 17 2016

I'm seeing this error every microsecond on my HP Chromebook 13 (chell):

2016-11-17T09:31:24.732070-08:00 ERR cras_server[3727]: pcm_avail returned frames larger than buf_size: 21888 > 16384
2016-11-17T09:31:24.732222-08:00 ERR cras_server[3727]: pcm_avail returned frames larger than buf_size: 21376 > 16384
2016-11-17T09:31:24.732231-08:00 ERR cras_server[3727]: pcm_avail returned frames larger than buf_size: 21376 > 16384
2016-11-17T09:31:24.732235-08:00 ERR cras_server[3727]: pcm_avail returned frames

Comment 5 by dgreid@chromium.org, Nov 17 2016

Jimmy, can you start by making this print an ATLOG so we stop spamming the logs.
Patchset 5 in https://chromium-review.googlesource.com/#/c/411022/ log underrun message only 1 out of 10.
But the message in #4 looks like a severe underrun. appl_ptr is lagged by hw_ptr by more than 4000 samples.
The close/open should recover the severe underrun so it will not spam the logs.
Project Member

Comment 7 by bugdroid1@chromium.org, Nov 23 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/adhd/+/0caf60b0734cc215ad147320a42614635dd9fdac

commit 0caf60b0734cc215ad147320a42614635dd9fdac
Author: Cheng-Yi Chiang <cychiang@chromium.org>
Date: Mon Nov 14 10:35:15 2016

CRAS: alsa_io - Propagate error for severe underrun

Previously, cras_alsa_get_avail_frames masks underrun and only
logs an error message.
This patch sets a threshold for severe underrun.
If severe underrun happens, the error should be propagated to audio
thread.
The following commit will reset the device so device can be recovered
from severe underrun.

BUG= chromium:662786 
TEST=make check

Signed-off-by: Cheng-Yi Chiang <cychiang@chromium.org>
Change-Id: Ib2a385a808936d0673590c0c0ab73c4ed97ada89
Reviewed-on: https://chromium-review.googlesource.com/411022
Commit-Ready: Nicolas Boichat <drinkcat@chromium.org>
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/0caf60b0734cc215ad147320a42614635dd9fdac/cras/src/server/cras_alsa_helpers.h
[modify] https://crrev.com/0caf60b0734cc215ad147320a42614635dd9fdac/cras/src/tests/alsa_helpers_unittest.cc
[modify] https://crrev.com/0caf60b0734cc215ad147320a42614635dd9fdac/cras/src/server/cras_alsa_helpers.c
[modify] https://crrev.com/0caf60b0734cc215ad147320a42614635dd9fdac/cras/src/server/cras_iodev.h
[modify] https://crrev.com/0caf60b0734cc215ad147320a42614635dd9fdac/cras/src/tests/alsa_io_unittest.cc
[modify] https://crrev.com/0caf60b0734cc215ad147320a42614635dd9fdac/cras/src/server/cras_alsa_io.c

Project Member

Comment 8 by bugdroid1@chromium.org, Nov 23 2016

Project Member

Comment 9 by bugdroid1@chromium.org, Nov 23 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/adhd/+/94f07550848b7e387bf063949c939866acc0225c

commit 94f07550848b7e387bf063949c939866acc0225c
Author: Cheng-Yi Chiang <cychiang@chromium.org>
Date: Tue Nov 15 09:12:56 2016

CRAS: device_monitor - Add device monitor to handle device request

Use device monitor to handle message from audio thread.
When audio thread finds a device in bad state, it can send a message to
main thread to reset a device properly from iodev_list.

BUG= chromium:662786 
TEST=make check
TEST=With hack patch of severe underrun, check device reset upon severe
     underrun. Also check audio thread log.

Signed-off-by: Cheng-Yi Chiang <cychiang@chromium.org>
Change-Id: Iac6d3f688f40cfe434002064407363151857ec6b
Reviewed-on: https://chromium-review.googlesource.com/411743
Commit-Ready: Nicolas Boichat <drinkcat@chromium.org>
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/server/cras_main_message.h
[modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/common/cras_types.h
[modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/tests/cras_test_client.c
[add] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/server/cras_device_monitor.h
[modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/tests/audio_thread_unittest.cc
[modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/Makefile.am
[add] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/server/cras_device_monitor.c
[modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/server/cras_server.c
[modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/server/audio_thread.c
[add] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/tests/device_monitor_unittest.cc

Cc: gkihumba@chromium.org bhthompson@chromium.org
Labels: Merge-Request-56 Merge-Request-55
The fix in #7, #8, #9 did these two things to improve the user experience when underrun happens:

1. Reset audio device when severe underrun happens (audio data lagging hardware by more than 20 ms), there will be a short drop of audio. Hopefully after the reset, the audio device becomes normal. 
2. Limit the logging rate. Currently, the pcm_avail error message will spam the log when underrun is large, cost a lot of CPU cycle, and hence makes the underrun worse. The fix in #7 limits the log to once every 30 seconds.

It is too late to merge for R54. Can we merge to R55 and R56 ?
This can prevent bad user experience (system freezes for many seconds) when severe underrun happens.
Thanks!
Labels: -Merge-Request-55 -Merge-Request-56 Merge-Approved-56 Merge-Approved-55
We can do 55 and 56, but it is too late for 54 (already stable, no more scheduled releases). 
Do we have to do 55?  This wasn't a regression fix. I'd like to see it get a full cycle of testing before pushing to users and 56 seems perfect for that.
Sure, you have a better assessment of the risk here than I do, comment 10 made it sound like a bad regression (freezes are most annoying). 

Comment 14 by akhal...@gmail.com, Nov 23 2016

This bug is causing daily crashes on my chell on 54 stable. Does it make sense to defer fixing that severe of a bug in the stable channel?
Anything that affect audio timing makes me nervous, if it's causing a crash, then it's probably worth the risk.  But I don't understand how this would cause a crash.

We should certainly at least disable the excessive logging in every release we can.
Re#14
Hi akahalilx@, could you please file a feedback report by pressing alt + shift + i (please mention  bug 662786  in feedback content) when your chell crashes (after reboot) ? Maybe the crash is coming from other cause.

Re#15
I can make another change that simply removes the log of pcm_avail. That would be very low risk for R55.
Thanks!
Project Member

Comment 17 by bugdroid1@chromium.org, Nov 24 2016

Labels: merge-merged-release-R55-8872.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/adhd/+/9c9ed3da99d89520ca17293a02706716a3af28a0

commit 9c9ed3da99d89520ca17293a02706716a3af28a0
Author: Cheng-Yi Chiang <cychiang@chromium.org>
Date: Thu Nov 24 06:31:52 2016

CRAS: alsa_helpers: Remove pcm_avail underrun logs

This excessive logs makes device harder to come back from underrun.
Simply remove it for R55.
Note that this is handled differently in R56 (handle severe underrun by
resetting device, log once every 30 seconds).

BUG= chromium:662786 
TEST=Cause underrun using cras_test_client, check there is no underrun
log in /var/log/messages. Check there is underrun count in audio thread
log.

Signed-off-by: Cheng-Yi Chiang <cychiang@chromium.org>
Change-Id: Ib6f017b66930b3ae7d93ad0db568ccdd8bcc4792
Reviewed-on: https://chromium-review.googlesource.com/414665
Reviewed-by: Chinyue Chen <chinyue@chromium.org>

[modify] https://crrev.com/9c9ed3da99d89520ca17293a02706716a3af28a0/cras/src/server/cras_alsa_helpers.c

Project Member

Comment 18 by sheriffbot@chromium.org, Nov 30 2016

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 19 by sheriffbot@chromium.org, Dec 4 2016

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
I merged the CL that removes the log of underrun to R56 too since it is causing trouble for the dogfooder.
https://chromium-review.googlesource.com/#/c/416535/
Project Member

Comment 21 by bugdroid1@chromium.org, Dec 6 2016

Labels: merge-merged-release-R56-9000.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/adhd/+/97d8a01c3166a6bc7640c37fbf2f3ed6a8ba7e0a

commit 97d8a01c3166a6bc7640c37fbf2f3ed6a8ba7e0a
Author: Cheng-Yi Chiang <cychiang@chromium.org>
Date: Thu Nov 24 06:31:52 2016

CRAS: alsa_helpers: Remove pcm_avail underrun logs

This excessive logs makes device harder to come back from underrun.
Simply remove it for R55.
Note that this is handled differently in R56 (handle severe underrun by
resetting device, log once every 30 seconds).

BUG= chromium:662786 
TEST=Cause underrun using cras_test_client, check there is no underrun
log in /var/log/messages. Check there is underrun count in audio thread
log.

Signed-off-by: Cheng-Yi Chiang <cychiang@chromium.org>
Change-Id: Ib6f017b66930b3ae7d93ad0db568ccdd8bcc4792
Reviewed-on: https://chromium-review.googlesource.com/414665
Reviewed-by: Chinyue Chen <chinyue@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/416535

[modify] https://crrev.com/97d8a01c3166a6bc7640c37fbf2f3ed6a8ba7e0a/cras/src/server/cras_alsa_helpers.c

Project Member

Comment 22 by bugdroid1@chromium.org, Dec 16 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/adhd/+/e4cfb42dc25e5196bc0115c1402db24359adfe8b

commit e4cfb42dc25e5196bc0115c1402db24359adfe8b
Author: Cheng-Yi Chiang <cychiang@chromium.org>
Date: Tue Dec 13 14:49:28 2016

CRAS: alsa_io - Set severe underrun threshold to five seconds

As explained in the issue, severe underrun is only targeted for special
case where bad dma driver makes hw_ptr jumps by a large distance and
creates a severe underrun that can not be recovered by filling blocks of
zeros. It should not be easily triggered by normal underrun or stressing
system.

BUG=chrome-os-partner:60497,  chromium:662786 
TEST=run balloon test in  issue 60497 , should not see resetting device.

Change-Id: I78d13d428cf11b06dbec05454340abc376f2290e
Reviewed-on: https://chromium-review.googlesource.com/418566
Commit-Ready: Cheng-Yi Chiang <cychiang@chromium.org>
Tested-by: Cheng-Yi Chiang <cychiang@chromium.org>
Reviewed-by: Cheng-Yi Chiang <cychiang@chromium.org>
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/e4cfb42dc25e5196bc0115c1402db24359adfe8b/cras/src/server/cras_alsa_io.c

Project Member

Comment 23 by sheriffbot@chromium.org, Jan 27 2017

Labels: -Merge-Approved-56 -Merge-Approved-55
This issue hasn't been updated in the last 6 weeks, so removing its merge approval label. Please re-request a merge if needed.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Status: Fixed (was: Started)
Forgot to mark fix.

Sign in to add a comment