CRAS: handle severe underrun |
||||||||
Issue descriptionhttps://feedback.corp.google.com/product/208/neutron?lView=rd&lReport=21981917814 In this report, there was a severe underrun: 2016-11-04T11:01:53.814151+00:00 ERR cras_server[1919]: pcm_avail returned frames larger than buf_size: 500232 > 65536 This means appl_ptr is behind hw_ptr by more than 400000 samples. CRAS tried to recover it by filling one block size of zeros (usually below 1024 samples). However, one block size is too small compared with the amount of underrun. This causes CRAS to wake up in a busy loop and fill one block of zeros to slightly recover underrun situation. Since the amount of underrun is too large, it needs over 30 seconds to recover, and in this duration, CRAS consume most of the CPU and makes system freeze. Apart from the root cause of the underrun (maybe it is DMA driver misbehave), CRAS should have a better way to handle this. Since this seems to be a real failure, we should not try to recover it. Instead, we should let it error out and close the device. The proposed plan to handle underrun: 1. In cras_alsa_get_avail_frames, we should have two paths for underrun: a) Underrun is not severe (less than 10 block size maybe ?) Recover it as current behavior. b) Underrun is severe, return an error code. 2. The error in cras_alsa_get_avail_frames pops to frames_queued and cras_iodev_frames_queued. cras_iodev_frames_to_play_in_sleep should pop this error to get_next_output_wake get_next_output_wake should pop this error to fill_next_sleep_interval We can log an error here. This can prevent us from scheduling a busy wake. 3. When audio thread next time reaches write_output_samples and calls cras_iodev_frames_queued, it will get the error and remove that device.
,
Nov 8 2016
Thank you Dylan. Use a time for threshold seems reasonable. But I am not sure if we should recover it or not. Silently reseting the device seems the best for user experience. Thanks!
,
Nov 14 2016
Currently iodev does not support "stop", and "start" is only optionally supported. I found that it will be better to close and then open the device following the flow in iodev_list to properly reset the device. We can use cras_main_message so audio thread can pass message to main thread, and let main thread disables/enables device in cras_iodev_list using a message handler.
,
Nov 17 2016
I'm seeing this error every microsecond on my HP Chromebook 13 (chell): 2016-11-17T09:31:24.732070-08:00 ERR cras_server[3727]: pcm_avail returned frames larger than buf_size: 21888 > 16384 2016-11-17T09:31:24.732222-08:00 ERR cras_server[3727]: pcm_avail returned frames larger than buf_size: 21376 > 16384 2016-11-17T09:31:24.732231-08:00 ERR cras_server[3727]: pcm_avail returned frames larger than buf_size: 21376 > 16384 2016-11-17T09:31:24.732235-08:00 ERR cras_server[3727]: pcm_avail returned frames
,
Nov 17 2016
Jimmy, can you start by making this print an ATLOG so we stop spamming the logs.
,
Nov 18 2016
Patchset 5 in https://chromium-review.googlesource.com/#/c/411022/ log underrun message only 1 out of 10. But the message in #4 looks like a severe underrun. appl_ptr is lagged by hw_ptr by more than 4000 samples. The close/open should recover the severe underrun so it will not spam the logs.
,
Nov 23 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/adhd/+/0caf60b0734cc215ad147320a42614635dd9fdac commit 0caf60b0734cc215ad147320a42614635dd9fdac Author: Cheng-Yi Chiang <cychiang@chromium.org> Date: Mon Nov 14 10:35:15 2016 CRAS: alsa_io - Propagate error for severe underrun Previously, cras_alsa_get_avail_frames masks underrun and only logs an error message. This patch sets a threshold for severe underrun. If severe underrun happens, the error should be propagated to audio thread. The following commit will reset the device so device can be recovered from severe underrun. BUG= chromium:662786 TEST=make check Signed-off-by: Cheng-Yi Chiang <cychiang@chromium.org> Change-Id: Ib2a385a808936d0673590c0c0ab73c4ed97ada89 Reviewed-on: https://chromium-review.googlesource.com/411022 Commit-Ready: Nicolas Boichat <drinkcat@chromium.org> Reviewed-by: Dylan Reid <dgreid@chromium.org> [modify] https://crrev.com/0caf60b0734cc215ad147320a42614635dd9fdac/cras/src/server/cras_alsa_helpers.h [modify] https://crrev.com/0caf60b0734cc215ad147320a42614635dd9fdac/cras/src/tests/alsa_helpers_unittest.cc [modify] https://crrev.com/0caf60b0734cc215ad147320a42614635dd9fdac/cras/src/server/cras_alsa_helpers.c [modify] https://crrev.com/0caf60b0734cc215ad147320a42614635dd9fdac/cras/src/server/cras_iodev.h [modify] https://crrev.com/0caf60b0734cc215ad147320a42614635dd9fdac/cras/src/tests/alsa_io_unittest.cc [modify] https://crrev.com/0caf60b0734cc215ad147320a42614635dd9fdac/cras/src/server/cras_alsa_io.c
,
Nov 23 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/adhd/+/d34bdb1fada946db0fb8e3360110b2cf9e16c09f commit d34bdb1fada946db0fb8e3360110b2cf9e16c09f Author: Cheng-Yi Chiang <cychiang@chromium.org> Date: Wed Nov 16 12:29:24 2016 CRAS: alsa_io - Track number of severe underruns Track number of severe underruns in alsa_io. User can see it through audio thread log. BUG= chromium:662786 TEST=with hack patch, cras_test_client --dump_a and see num_severe_underruns increased. Signed-off-by: Cheng-Yi Chiang <cychiang@chromium.org> Change-Id: I02a644971bfcc43e7589c95a3f1ca6db750c75d4 Reviewed-on: https://chromium-review.googlesource.com/411644 Commit-Ready: Nicolas Boichat <drinkcat@chromium.org> Reviewed-by: Dylan Reid <dgreid@chromium.org> [modify] https://crrev.com/d34bdb1fada946db0fb8e3360110b2cf9e16c09f/cras/src/common/cras_types.h [modify] https://crrev.com/d34bdb1fada946db0fb8e3360110b2cf9e16c09f/cras/src/tests/cras_test_client.c [modify] https://crrev.com/d34bdb1fada946db0fb8e3360110b2cf9e16c09f/cras/src/server/cras_iodev.h [modify] https://crrev.com/d34bdb1fada946db0fb8e3360110b2cf9e16c09f/cras/src/server/audio_thread.c [modify] https://crrev.com/d34bdb1fada946db0fb8e3360110b2cf9e16c09f/cras/src/tests/audio_thread_unittest.cc [modify] https://crrev.com/d34bdb1fada946db0fb8e3360110b2cf9e16c09f/cras/src/server/cras_iodev.c [modify] https://crrev.com/d34bdb1fada946db0fb8e3360110b2cf9e16c09f/cras/src/server/cras_alsa_io.c
,
Nov 23 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/adhd/+/94f07550848b7e387bf063949c939866acc0225c commit 94f07550848b7e387bf063949c939866acc0225c Author: Cheng-Yi Chiang <cychiang@chromium.org> Date: Tue Nov 15 09:12:56 2016 CRAS: device_monitor - Add device monitor to handle device request Use device monitor to handle message from audio thread. When audio thread finds a device in bad state, it can send a message to main thread to reset a device properly from iodev_list. BUG= chromium:662786 TEST=make check TEST=With hack patch of severe underrun, check device reset upon severe underrun. Also check audio thread log. Signed-off-by: Cheng-Yi Chiang <cychiang@chromium.org> Change-Id: Iac6d3f688f40cfe434002064407363151857ec6b Reviewed-on: https://chromium-review.googlesource.com/411743 Commit-Ready: Nicolas Boichat <drinkcat@chromium.org> Reviewed-by: Dylan Reid <dgreid@chromium.org> [modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/server/cras_main_message.h [modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/common/cras_types.h [modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/tests/cras_test_client.c [add] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/server/cras_device_monitor.h [modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/tests/audio_thread_unittest.cc [modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/Makefile.am [add] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/server/cras_device_monitor.c [modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/server/cras_server.c [modify] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/server/audio_thread.c [add] https://crrev.com/94f07550848b7e387bf063949c939866acc0225c/cras/src/tests/device_monitor_unittest.cc
,
Nov 23 2016
The fix in #7, #8, #9 did these two things to improve the user experience when underrun happens: 1. Reset audio device when severe underrun happens (audio data lagging hardware by more than 20 ms), there will be a short drop of audio. Hopefully after the reset, the audio device becomes normal. 2. Limit the logging rate. Currently, the pcm_avail error message will spam the log when underrun is large, cost a lot of CPU cycle, and hence makes the underrun worse. The fix in #7 limits the log to once every 30 seconds. It is too late to merge for R54. Can we merge to R55 and R56 ? This can prevent bad user experience (system freezes for many seconds) when severe underrun happens. Thanks!
,
Nov 23 2016
We can do 55 and 56, but it is too late for 54 (already stable, no more scheduled releases).
,
Nov 23 2016
Do we have to do 55? This wasn't a regression fix. I'd like to see it get a full cycle of testing before pushing to users and 56 seems perfect for that.
,
Nov 23 2016
Sure, you have a better assessment of the risk here than I do, comment 10 made it sound like a bad regression (freezes are most annoying).
,
Nov 23 2016
This bug is causing daily crashes on my chell on 54 stable. Does it make sense to defer fixing that severe of a bug in the stable channel?
,
Nov 23 2016
Anything that affect audio timing makes me nervous, if it's causing a crash, then it's probably worth the risk. But I don't understand how this would cause a crash. We should certainly at least disable the excessive logging in every release we can.
,
Nov 24 2016
Re#14 Hi akahalilx@, could you please file a feedback report by pressing alt + shift + i (please mention bug 662786 in feedback content) when your chell crashes (after reboot) ? Maybe the crash is coming from other cause. Re#15 I can make another change that simply removes the log of pcm_avail. That would be very low risk for R55. Thanks!
,
Nov 24 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/adhd/+/9c9ed3da99d89520ca17293a02706716a3af28a0 commit 9c9ed3da99d89520ca17293a02706716a3af28a0 Author: Cheng-Yi Chiang <cychiang@chromium.org> Date: Thu Nov 24 06:31:52 2016 CRAS: alsa_helpers: Remove pcm_avail underrun logs This excessive logs makes device harder to come back from underrun. Simply remove it for R55. Note that this is handled differently in R56 (handle severe underrun by resetting device, log once every 30 seconds). BUG= chromium:662786 TEST=Cause underrun using cras_test_client, check there is no underrun log in /var/log/messages. Check there is underrun count in audio thread log. Signed-off-by: Cheng-Yi Chiang <cychiang@chromium.org> Change-Id: Ib6f017b66930b3ae7d93ad0db568ccdd8bcc4792 Reviewed-on: https://chromium-review.googlesource.com/414665 Reviewed-by: Chinyue Chen <chinyue@chromium.org> [modify] https://crrev.com/9c9ed3da99d89520ca17293a02706716a3af28a0/cras/src/server/cras_alsa_helpers.c
,
Nov 30 2016
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Dec 4 2016
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Dec 6 2016
I merged the CL that removes the log of underrun to R56 too since it is causing trouble for the dogfooder. https://chromium-review.googlesource.com/#/c/416535/
,
Dec 6 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/adhd/+/97d8a01c3166a6bc7640c37fbf2f3ed6a8ba7e0a commit 97d8a01c3166a6bc7640c37fbf2f3ed6a8ba7e0a Author: Cheng-Yi Chiang <cychiang@chromium.org> Date: Thu Nov 24 06:31:52 2016 CRAS: alsa_helpers: Remove pcm_avail underrun logs This excessive logs makes device harder to come back from underrun. Simply remove it for R55. Note that this is handled differently in R56 (handle severe underrun by resetting device, log once every 30 seconds). BUG= chromium:662786 TEST=Cause underrun using cras_test_client, check there is no underrun log in /var/log/messages. Check there is underrun count in audio thread log. Signed-off-by: Cheng-Yi Chiang <cychiang@chromium.org> Change-Id: Ib6f017b66930b3ae7d93ad0db568ccdd8bcc4792 Reviewed-on: https://chromium-review.googlesource.com/414665 Reviewed-by: Chinyue Chen <chinyue@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/416535 [modify] https://crrev.com/97d8a01c3166a6bc7640c37fbf2f3ed6a8ba7e0a/cras/src/server/cras_alsa_helpers.c
,
Dec 16 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/adhd/+/e4cfb42dc25e5196bc0115c1402db24359adfe8b commit e4cfb42dc25e5196bc0115c1402db24359adfe8b Author: Cheng-Yi Chiang <cychiang@chromium.org> Date: Tue Dec 13 14:49:28 2016 CRAS: alsa_io - Set severe underrun threshold to five seconds As explained in the issue, severe underrun is only targeted for special case where bad dma driver makes hw_ptr jumps by a large distance and creates a severe underrun that can not be recovered by filling blocks of zeros. It should not be easily triggered by normal underrun or stressing system. BUG=chrome-os-partner:60497, chromium:662786 TEST=run balloon test in issue 60497 , should not see resetting device. Change-Id: I78d13d428cf11b06dbec05454340abc376f2290e Reviewed-on: https://chromium-review.googlesource.com/418566 Commit-Ready: Cheng-Yi Chiang <cychiang@chromium.org> Tested-by: Cheng-Yi Chiang <cychiang@chromium.org> Reviewed-by: Cheng-Yi Chiang <cychiang@chromium.org> Reviewed-by: Dylan Reid <dgreid@chromium.org> [modify] https://crrev.com/e4cfb42dc25e5196bc0115c1402db24359adfe8b/cras/src/server/cras_alsa_io.c
,
Jan 27 2017
This issue hasn't been updated in the last 6 weeks, so removing its merge approval label. Please re-request a merge if needed. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Feb 1 2017
Forgot to mark fix. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by dgreid@chromium.org
, Nov 7 2016