s5p-jpeg driver crashed when running JDA unittest |
|||||||||||||||
Issue descriptionI added a abort test case in JDA unittest. https://codereview.chromium.org/2746073002. It caught a bug in s5p-jpeg driver. From wmatrix, this should be always reproducible. From the source, it looks like this can be reproduced by having two jpeg decoder running at the same time. Try opening two tabs and using two different cameras on peach pi/pit. Panic#1 Part1 <0>[ 1151.167806] BUG: spinlock recursion on CPU#0, swapper/0/0 <0>[ 1151.167840] lock: 0xee9b2938, .magic: dead4ead, .owner: swapper/0/0, .owner_cpu: 0 <5>[ 1151.167854] Backtrace: <5>[ 1151.167887] [<c010d118>] (unwind_backtrace+0x0/0x110) from [<c063f224>] (dump_stack+0x28/0x30) <5>[ 1151.167914] [<c063f224>] (dump_stack+0x28/0x30) from [<c0319464>] (spin_dump+0x80/0x94) <5>[ 1151.167934] [<c0319464>] (spin_dump+0x80/0x94) from [<c03194a4>] (spin_bug+0x2c/0x30) <5>[ 1151.167953] [<c03194a4>] (spin_bug+0x2c/0x30) from [<c0319558>] (do_raw_spin_lock+0x50/0x15c) <5>[ 1151.167973] [<c0319558>] (do_raw_spin_lock+0x50/0x15c) from [<c0643994>] (_raw_spin_lock_irqsave+0x20/0x28) <5>[ 1151.167997] [<c0643994>] (_raw_spin_lock_irqsave+0x20/0x28) from [<c0499b88>] (exynos3250_jpeg_device_run+0x24/0x2a0) <5>[ 1151.168016] [<c0499b88>] (exynos3250_jpeg_device_run+0x24/0x2a0) from [<c0492ad0>] (v4l2_m2m_try_run+0xbc/0xd0) <5>[ 1151.168033] [<c0492ad0>] (v4l2_m2m_try_run+0xbc/0xd0) from [<c0492dcc>] (v4l2_m2m_job_finish+0xdc/0xf4) <5>[ 1151.168048] [<c0492dcc>] (v4l2_m2m_job_finish+0xdc/0xf4) from [<c0499a14>] (exynos3250_jpeg_irq+0x164/0x188) <5>[ 1151.168069] [<c0499a14>] (exynos3250_jpeg_irq+0x164/0x188) from [<c0189188>] (handle_irq_event_percpu+0x7c/0x208) <5>[ 1151.168087] [<c0189188>] (handle_irq_event_percpu+0x7c/0x208) from [<c0189360>] (handle_irq_event+0x4c/0x6c) <5>[ 1151.168104] [<c0189360>] (handle_irq_event+0x4c/0x6c) from [<c018c6a8>] (handle_fasteoi_irq+0xc4/0x12c) <5>[ 1151.168122] [<c018c6a8>] (handle_fasteoi_irq+0xc4/0x12c) from [<c0188a10>] (generic_handle_irq+0x30/0x40) <5>[ 1151.168140] [<c0188a10>] (generic_handle_irq+0x30/0x40) from [<c01069a8>] (handle_IRQ+0x7c/0xa4) <5>[ 1151.168158] [<c01069a8>] (handle_IRQ+0x7c/0xa4) from [<c01003f0>] (gic_handle_irq+0x48/0x6c) <5>[ 1151.168176] [<c01003f0>] (gic_handle_irq+0x48/0x6c) from [<c0105bc0>] (__irq_svc+0x40/0x50) <5>[ 1151.168187] Exception stack(0xc0b01eb0 to 0xc0b01ef8) <5>[ 1151.168200] 1ea0: c0b01ef8 0000010c 06432f6d 0000010c <5>[ 1151.168213] 1ec0: c3b373f8 00000000 0637fb19 0000010c 00000000 c0b30ea8 c0b79f50 c0b01f2c <5>[ 1151.168225] 1ee0: 00000006 c0b01ef8 c0169e1c c04da8e0 20000013 ffffffff <5>[ 1151.168243] [<c0105bc0>] (__irq_svc+0x40/0x50) from [<c04da8e0>] (cpuidle_wrap_enter+0x58/0xa4) <5>[ 1151.168260] [<c04da8e0>] (cpuidle_wrap_enter+0x58/0xa4) from [<c04d9fcc>] (cpuidle_enter_tk+0x14/0x1c) <5>[ 1151.168275] [<c04d9fcc>] (cpuidle_enter_tk+0x14/0x1c) from [<c04da3c8>] (cpuidle_enter_state+0x24/0x68) <5>[ 1151.168290] [<c04da3c8>] (cpuidle_enter_state+0x24/0x68) from [<c04da580>] (cpuidle_idle_call+0x174/0x250) <5>[ 1151.168305] [<c04da580>] (cpuidle_idle_call+0x174/0x250) from [<c0106dbc>] (cpu_idle+0xa4/0x104) <5>[ 1151.168321] [<c0106dbc>] (cpu_idle+0xa4/0x104) from [<c063aa64>] (rest_init+0x6c/0x84) <5>[ 1151.168339] [<c063aa64>] (rest_init+0x6c/0x84) from [<c0a00898>] (start_kernel+0x2e4/0x34c) <4>[ 1152.170680] exynos5-hsi2c 12ca0000.hsi2c: tx timeout <3>[ 1152.170712] ERROR: 0: 1125000 != 1162500 <4>[ 1152.170729] Failed to increase CPU voltage: -110 <4>[ 1158.450664] iio iio:device0: Conversion timed out! Resetting <3>[ 1158.450687] read channel() error: -110 <4>[ 1158.555667] iio iio:device0: Conversion timed out! Resetting <3>[ 1158.555690] read channel() error: -110 <4>[ 1158.660668] iio iio:device0: Conversion timed out! Resetting <3>[ 1158.660690] read channel() error: -110 <4>[ 1158.765660] iio iio:device0: Conversion timed out! Resetting <3>[ 1158.765683] read channel() error: -110 <0>[ 1160.035601] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0 <5>[ 1160.035626] Backtrace: <5>[ 1160.035661] [<c010d118>] (unwind_backtrace+0x0/0x110) from [<c063f224>] (dump_stack+0x28/0x30) <5>[ 1160.035686] [<c063f224>] (dump_stack+0x28/0x30) from [<c063f72c>] (panic+0xa8/0x1fc) <5>[ 1160.035709] [<c063f72c>] (panic+0xa8/0x1fc) from [<c0188658>] (watchdog_timer_fn+0x248/0x280) <5>[ 1160.035733] [<c0188658>] (watchdog_timer_fn+0x248/0x280) from [<c0149adc>] (__run_hrtimer+0x110/0x1d4) <5>[ 1160.035757] [<c0149adc>] (__run_hrtimer+0x110/0x1d4) from [<c014a690>] (hrtimer_interrupt+0x130/0x288) <5>[ 1160.035782] [<c014a690>] (hrtimer_interrupt+0x130/0x288) from [<c04f6a24>] (exynos4_mct_tick_isr+0x70/0x84) <5>[ 1160.035806] [<c04f6a24>] (exynos4_mct_tick_isr+0x70/0x84) from [<c0189188>] (handle_irq_event_percpu+0x7c/0x208) <5>[ 1160.035829] [<c0189188>] (handle_irq_event_percpu+0x7c/0x208) from [<c0189360>] (handle_irq_event+0x4c/0x6c) <5>[ 1160.035852] [<c0189360>] (handle_irq_event+0x4c/0x6c) from [<c018c6a8>] (handle_fasteoi_irq+0xc4/0x12c) <5>[ 1160.035874] [<c018c6a8>] (handle_fasteoi_irq+0xc4/0x12c) from [<c0188a10>] (generic_handle_irq+0x30/0x40) <5>[ 1160.035898] [<c0188a10>] (generic_handle_irq+0x30/0x40) from [<c01069a8>] (handle_IRQ+0x7c/0xa4) <5>[ 1160.035918] [<c01069a8>] (handle_IRQ+0x7c/0xa4) from [<c01003f0>] (gic_handle_irq+0x48/0x6c) <5>[ 1160.035940] [<c01003f0>] (gic_handle_irq+0x48/0x6c) from [<c0105bc0>] (__irq_svc+0x40/0x50) <5>[ 1160.035959] Exception stack(0xef143ee8 to 0xef143f30) <5>[ 1160.035977] 3ee0: ef143f30 0000010e 16e0dad9 0000010e c3b523f8 00000000 <5>[ 1160.035997] 3f00: 16957c94 0000010e 00000000 c0b30ea8 c0b79f50 ef143f64 00000006 ef143f30 <5>[ 1160.036014] 3f20: c0169e1c c04da8e0 20000013 ffffffff <5>[ 1160.036036] [<c0105bc0>] (__irq_svc+0x40/0x50) from [<c04da8e0>] (cpuidle_wrap_enter+0x58/0xa4) <5>[ 1160.036057] [<c04da8e0>] (cpuidle_wrap_enter+0x58/0xa4) from [<c04d9fcc>] (cpuidle_enter_tk+0x14/0x1c) <5>[ 1160.036077] [<c04d9fcc>] (cpuidle_enter_tk+0x14/0x1c) from [<c04da3c8>] (cpuidle_enter_state+0x24/0x68) <5>[ 1160.036099] [<c04da3c8>] (cpuidle_enter_state+0x24/0x68) from [<c04da580>] (cpuidle_idle_call+0x174/0x250) <5>[ 1160.036120] [<c04da580>] (cpuidle_idle_call+0x174/0x250) from [<c0106dbc>] (cpu_idle+0xa4/0x104) <5>[ 1160.036143] [<c0106dbc>] (cpu_idle+0xa4/0x104) from [<c063c3bc>] (secondary_start_kernel+0x10c/0x130) <5>[ 1160.036171] [<c063c3bc>] (secondary_start_kernel+0x10c/0x130) from [<2063b948>] (0x2063b948) <2>[ 1160.036196] CPU1: stopping <5>[ 1160.036235] Backtrace: <5>[ 1160.036280] [<c010d118>] (unwind_backtrace+0x0/0x110) from [<c063f224>] (dump_stack+0x28/0x30) <5>[ 1160.036327] [<c063f224>] (dump_stack+0x28/0x30) from [<c010bb0c>] (handle_IPI+0xdc/0x158) <5>[ 1160.036381] [<c010bb0c>] (handle_IPI+0xdc/0x158) from [<c010040c>] (gic_handle_irq+0x64/0x6c) <5>[ 1160.036425] [<c010040c>] (gic_handle_irq+0x64/0x6c) from [<c0105bc0>] (__irq_svc+0x40/0x50) <5>[ 1160.036470] Exception stack(0xef13fee8 to 0xef13ff30) <5>[ 1160.036510] fee0: ef13ff30 0000010e 16ea0e51 0000010e c3b403f8 00000000 <5>[ 1160.036555] ff00: 16e23822 0000010e 00000000 c0b30ea8 c0b79f50 ef13ff64 00000006 ef13ff30 <5>[ 1160.036603] ff20: c0169e1c c04da8e0 20000013 ffffffff <5>[ 1160.036648] [<c0105bc0>] (__irq_svc+0x40/0x50) from [<c04da8e0>] (cpuidle_wrap_enter+0x58/0xa4) <5>[ 1160.036698] [<c04da8e0>] (cpuidle_wrap_enter+0x58/0xa4) from [<c04d9fcc>] (cpuidle_enter_tk+0x14/0x1c) <5>[ 1160.036747] [<c04d9fcc>] (cpuidle_enter_tk+0x14/0x1c) from [<c04da3c8>] (cpuidle_enter_state+0x24/0x68) <5>[ 1160.036794] [<c04da3c8>] (cpuidle_enter_state+0x24/0x68) from [<c04da580>] (cpuidle_idle_call+0x174/0x250) <5>[ 1160.036844] [<c04da580>] (cpuidle_idle_call+0x174/0x250) from [<c0106dbc>] (cpu_idle+0xa4/0x104) <5>[ 1160.036897] [<c0106dbc>] (cpu_idle+0xa4/0x104) from [<c063c3bc>] (secondary_start_kernel+0x10c/0x130) <5>[ 1160.036946] [<c063c3bc>] (secondary_start_kernel+0x10c/0x130) from [<2063b948>] (0x2063b948) <2>[ 1160.036997] CPU2: stopping <5>[ 1160.037026] Backtrace: <5>[ 1160.037064] [<c010d118>] (unwind_backtrace+0x0/0x110) from [<c063f224>] (dump_stack+0x28/0x30) <5>[ 1160.037101] [<c063f224>] (dump_stack+0x28/0x30) from [<c010bb0c>] (handle_IPI+0xdc/0x158) <5>[ 1160.037140] [<c010bb0c>] (handle_IPI+0xdc/0x158) from [<c010040c>] (gic_handle_irq+0x64/0x6c) <5>[ 1160.037179] [<c010040c>] (gic_handle_irq+0x64/0x6c) from [<c0105bc0>] (__irq_svc+0x40/0x50) <5>[ 1160.037213] Exception stack(0xef141ee8 to 0xef141f30) <5>[ 1160.037253] 1ee0: ef141f30 0000010e 16ea24c7 0000010e c3b493f8 00000000 <5>[ 1160.037289] 1f00: 16e25b9d 0000010e 00000000 c0b30ea8 c0b79f50 ef141f64 00000006 ef141f30 <5>[ 1160.037326] 1f20: c0169e1c c04da8e0 20000013 ffffffff <5>[ 1160.037361] [<c0105bc0>] (__irq_svc+0x40/0x50) from [<c04da8e0>] (cpuidle_wrap_enter+0x58/0xa4) <5>[ 1160.037401] [<c04da8e0>] (cpuidle_wrap_enter+0x58/0xa4) from [<c04d9fcc>] (cpuidle_enter_tk+0x14/0x1c) <5>[ 1160.037439] [<c04d9fcc>] (cpuidle_enter_tk+0x14/0x1c) from [<c04da3c8>] (cpuidle_enter_state+0x24/0x68) <5>[ 1160.037476] [<c04da3c8>] (cpuidle_enter_state+0x24/0x68) from [<c04da580>] (cpuidle_idle_call+0x174/0x250) <5>[ 1160.037516] [<c04da580>] (cpuidle_idle_call+0x174/0x250) from [<c0106dbc>] (cpu_idle+0xa4/0x104) <5>[ 1160.037555] [<c0106dbc>] (cpu_idle+0xa4/0x104) from [<c063c3bc>] (secondary_start_kernel+0x10c/0x130) <5>[ 1160.037593] [<c063c3bc>] (secondary_start_kernel+0x10c/0x130) from [<2063b948>] (0x2063b948) <4>[ 1161.970202] SMP: failed to stop secondary CPUs <3>[ 1161.970224] CPU0 PC: <c04f65dc> exynos4_read_count_32+0x10/0x20 <3>[ 1161.970248] CPU3 PC: <c011c7c0> exynos5_panic_notify+0x64/0xbc
,
Apr 6 2017
,
Apr 11 2017
To keep note, this is the regression shown on wmatrix since 9391.0.0 (9390.0.0 was good) https://wmatrix.googleplex.com/platform/video?suites=bvt-perbuild&tests=video_JpegDecodeAccelerator&days_back=30&platforms=peach_pi
,
Apr 12 2017
The test was added between 9390.0.0 and 9391.0.0 (https://chromium.googlesource.com/chromium/src/+log/59.0.3041.0..59.0.3048.0?pretty=fuller&n=10000), so this is very likely not a regression between the versions, just a new test that has been failing from the start on these devices.
,
Apr 12 2017
This seems to be a classical deadlock issue. I am still trying to understand what the code exactly does (while teaching myself about V4L2), but it seems very suspicious to me that exynos3250_jpeg_device_run is called from interrupt context. We also have some weird patterns with locks being freed by one function only to be reacquired immediately by the next.
,
Apr 12 2017
,
Apr 17 2017
So the crash seems to happen whenever there is another job in the queue when the interrupt handler is triggered. The call sequence goes as follows: exynos3250_jpeg_irq() - spin_lock(&jpeg->slock); - v4l2_m2m_job_finish() - v4l2_m2m_try_run() - exynos3250_jpeg_device_run() - spin_lock(&jpeg->slock); - spin_unlock(&jpeg->slock); - spin_unlock(&jpeg->slock); In that scenario the spin_lock from exynos3250_jpeg_device_run() is called before the IRQ handler has a chance to release it. The fact this issue does not manifest itself elsewhere seems to indicate this scenario (more than one job queued) is rare. Reordering the lock release in the IRQ handler seems to do the trick as a workaround, but I want to check whether this is an acceptable solution and verify that we cannot run into another race condition.
,
Apr 17 2017
Also we are doing quite a lot of work in that interrupt handler. It probably makes sense to move the bulk of it into a worker, unless someone thinks this is the proper design?
,
Apr 17 2017
Re #7: chrome will only use one JPEG HW decoder for one camera. If users open multiple tabs to use the the same camera, video device is only opened once and there is only one JPEG decoder. Users won't see this crash if they only use internal camera. I believe you can reproduce this issue if you connect an external camera and use both cameras at the same time. Re #8: peach (exynos) is an very old device and in maintenance mode. I haven't looked at the code. If possible, find a simple fix. There's no need to refactor too much.
,
Apr 18 2017
Thanks for the explanation. The peach driver is almost identical to the mainline version, for which I suppose we will also want a fix? If we need to design a proper one anyway, we may as well apply it on both branches, shouldn't we?
,
Apr 18 2017
Yes. If our version is similar to the mainline, send the patch to upstream first. Cherrypick it back and mark it with FROMLIST.
,
Apr 18 2017
Turns out I am going to have to split the top and bottom halves of the IRQ handler. When device_run() is called from try_run(), the hook does some long polling and may call cpu_relax() from the interrupt handler, which is a bad idea. We will definitely run into other issues if this is not fixed.
,
Apr 18 2017
FYI, cpu_relax() is only a barrier, so the worst thing that might happen with that polling is bad latency (well, a kernel panic might also happen, if it takes really long time like 180 sec. or so).
,
Apr 18 2017
Right, the current thread is not going to be yielded by a cpu_relax(). Still concerned about the amount of code that runs in IRQ context though. :)
,
Apr 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/b6d26a74fd462f4dd24a3f57deb671560463fbd7 commit b6d26a74fd462f4dd24a3f57deb671560463fbd7 Author: Alexandre Courbot <acourbot@chromium.org> Date: Fri Apr 21 09:34:44 2017 CHROMIUM: media/s5p-jpeg: fix recursive spinlock acquisition v4l2_m2m_job_finish(), which is called from the interrupt handler with slock acquired, can call the device_run() hook immediately if another context was in the queue. On the s5p driver, this hook also acquires slock, resulting in a deadlock for this particular scenario. Fix this by releasing slock in the interrupt handler before calling v4l2_m2m_job_finish(). This is safe to do as the state of the hardware cannot change before v4l2_m2m_job_finish() is called anyway. Ideally v4l2_m2m_job_finish() should be called from a bottom half interrupt handler as it does a lot of things, but this quick fix is helpful in mitigating the issue. BUG= chromium:705971 TEST=video_JpegDecodeAccelerator now passes successfully Change-Id: Ic6358c1e00fed0ea0a61f84492ceb3d594df0a4a Signed-off-by: Alexandre Courbot <acourbot@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/481579 Reviewed-by: Tomasz Figa <tfiga@chromium.org> [modify] https://crrev.com/b6d26a74fd462f4dd24a3f57deb671560463fbd7/drivers/media/platform/s5p-jpeg/jpeg-core.c
,
Apr 21 2017
Alex. Can you also send this patch to upstream?
,
Apr 21 2017
Yes. Shall I close this bug now that the fix has landed in ChromeOS?
,
Apr 21 2017
Sure. You can close this.
,
Apr 24 2017
Closing this as fixed and preparing change for upstream.
,
May 16 2017
Could this be merged to R59? video_JpegDecodeAccelerator R59 is still failing.
,
May 16 2017
,
May 16 2017
,
May 17 2017
Your change meets the bar and is auto-approved for M59. Please go ahead and merge the CL to branch 3071 manually. Please contact milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), gkihumba@(ChromeOS), Abdul Syed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
May 19 2017
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
May 19 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/79184bb31146ebb3220213eb359887faa0a9a982 commit 79184bb31146ebb3220213eb359887faa0a9a982 Author: Alexandre Courbot <acourbot@chromium.org> Date: Fri May 19 17:31:44 2017 CHROMIUM: media/s5p-jpeg: fix recursive spinlock acquisition v4l2_m2m_job_finish(), which is called from the interrupt handler with slock acquired, can call the device_run() hook immediately if another context was in the queue. On the s5p driver, this hook also acquires slock, resulting in a deadlock for this particular scenario. Fix this by releasing slock in the interrupt handler before calling v4l2_m2m_job_finish(). This is safe to do as the state of the hardware cannot change before v4l2_m2m_job_finish() is called anyway. Ideally v4l2_m2m_job_finish() should be called from a bottom half interrupt handler as it does a lot of things, but this quick fix is helpful in mitigating the issue. BUG= chromium:705971 TEST=video_JpegDecodeAccelerator now passes successfully Change-Id: Ic6358c1e00fed0ea0a61f84492ceb3d594df0a4a Signed-off-by: Alexandre Courbot <acourbot@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/481579 Reviewed-by: Tomasz Figa <tfiga@chromium.org> (cherry picked from commit b6d26a74fd462f4dd24a3f57deb671560463fbd7) Reviewed-on: https://chromium-review.googlesource.com/508175 [modify] https://crrev.com/79184bb31146ebb3220213eb359887faa0a9a982/drivers/media/platform/s5p-jpeg/jpeg-core.c
,
May 19 2017
Merged to 59. Marking as fixed.
,
May 23 2017
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
May 24 2017
,
Aug 1 2017
,
Nov 6 2017
https://wmatrix.googleplex.com/unfiltered?hide_missing=True&releases=tot&tests=video_JpegDecodeAccelerator&days_back=30 |
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by jcliang@chromium.org
, Mar 28 2017