circular locking observed when booting chromeos-4.14 on caroline |
|||
Issue description
With v4.14.34-4302-gdf94c7aa2f90 and USE="lockdebug kasan pcserial tty_console_ttyS0", the following circular locking dependency was observed on caroline.
[ 62.289470] ======================================================
[ 62.296386] WARNING: possible circular locking dependency detected
[ 62.303305] 4.14.34 #4 Not tainted
[ 62.307110] ------------------------------------------------------
[ 62.314025] arc-setup/1571 is trying to acquire lock:
[ 62.319675] (&mm->mmap_sem){++++}, at: [<ffffffff8f21a3f2>] __might_fault+0x85/0x10b
[ 62.328456]
but task is already holding lock:
[ 62.334987] (&sb->s_type->i_mutex_key#4){++++}, at: [<ffffffff8f276f78>] iterate_dir+0xa9/0x22c
[ 62.344839]
which lock already depends on the new lock.
Dependency chain (shortened):
-> #7 (&sb->s_type->i_mutex_key#4){++++}:
[ 62.369705] down_write+0x3d/0x65
[ 62.374010] start_creating+0x7e/0x118
[ 62.378788] debugfs_create_dir+0x13/0x116
[ 62.384032] iwl_mvm_vif_dbgfs_register+0x88/0x6e1 [iwlmvm]
[ 62.390925] iwl_mvm_mac_add_interface+0x386/0x49a [iwlmvm]
[ 62.397912] drv_add_interface+0x1f2/0x225 [iwl7000_mac80211]
[ 62.405098] ieee80211_do_open+0x616/0xb6a [iwl7000_mac80211]
-> #6 (&mvm->mutex){+.+.}:
[ 62.470448] __mutex_lock+0x91/0x523
[ 62.475111] iwl_mvm_tzone_get_temp+0x4a/0xcd [iwlmvm]
[ 62.481454] thermal_zone_get_temp+0x90/0xae
[ 62.486824] thermal_zone_device_update+0x85/0x297
[ 62.492781] thermal_zone_device_register+0x848/0x8cd
[ 62.499097] iwl_mvm_thermal_initialize+0x2b2/0x365 [iwlmvm]
-> #5 (&tz->lock){+.+.}:
[ 62.567693] __mutex_lock+0x91/0x523
[ 62.572283] thermal_zone_get_temp+0x69/0xae
[ 62.577652] thermal_zone_device_update+0x85/0x297
[ 62.583605] thermal_zone_device_register+0x848/0x8cd
-> #4 (cpuhp_state-up){+.+.}:
[ 62.626284] cpuhp_lock_acquire+0x32/0x35
[ 62.631362] cpuhp_issue_call+0xd9/0x220
[ 62.636343] __cpuhp_setup_state_cpuslocked+0x1f6/0x2a2
[ 62.642783] __cpuhp_setup_state+0x47/0x59
[ 62.647958] page_writeback_init+0x43/0x66
-> #3 (cpuhp_state_mutex){+.+.}:
[ 62.669537] __mutex_lock+0x91/0x523
[ 62.674131] __cpuhp_setup_state_cpuslocked+0x58/0x2a2
[ 62.680470] __cpuhp_setup_state+0x47/0x59
[ 62.685645] page_alloc_init+0x28/0x30
-> #2 (cpu_hotplug_lock.rw_sem){++++}:
[ 62.707420] cpus_read_lock+0x3d/0x84
[ 62.712110] apply_wqattrs_lock+0xe/0x1d
[ 62.717089] apply_workqueue_attrs+0x1a/0x40
[ 62.722463] __alloc_workqueue_key+0x5ab/0x806
[ 62.728025] i915_gem_init_userptr+0x72/0x9a
[ 62.733392] i915_gem_init+0xb3/0x139
[ 62.738083] i915_driver_load+0x182f/0x1a02
-> #1 (&dev->struct_mutex){+.+.}:
[ 62.799307] __mutex_lock+0x91/0x523
[ 62.803903] i915_mutex_lock_interruptible+0x132/0x143
[ 62.810245] i915_gem_fault+0x229/0x739
[ 62.815130] __do_fault+0x44/0x69
-> #0 (&mm->mmap_sem){++++}:
[ 62.844821] lock_acquire+0x1e4/0x21a
[ 62.849508] __might_fault+0xc4/0x10b
[ 62.854201] _copy_to_user+0x25/0x83
[ 62.858793] filldir+0x110/0x18f
other info that might help us debug this:
[ 62.896845] Chain exists of:
&mm->mmap_sem --> &mvm->mutex --> &sb->s_type->i_mutex_key#4
[ 62.909415] Possible unsafe locking scenario:
[ 62.916044] CPU0 CPU1
[ 62.921116] ---- ----
[ 62.926187] lock(&sb->s_type->i_mutex_key#4);
[ 62.931265] lock(&mvm->mutex);
[ 62.937702] lock(&sb->s_type->i_mutex_key#4);
[ 62.945597] lock(&mm->mmap_sem);
[ 62.949411]
*** DEADLOCK ***
[ 62.956047] 1 lock held by arc-setup/1571:
[ 62.960632] #0: (&sb->s_type->i_mutex_key#4){++++}, at: [<ffffffff8f276f78>] iterate_dir+0xa9/0x22c
Complete log is attached.
,
Apr 17 2018
Manual bisect: Description Comments Result -------------------------------------------------------------- v4.14.34-4302-gdf94c7aa2f90 v4.14.34 merge point bad v4.14.25-4110-g32b887dfb630 v4.14.25 merge point bad v4.14.20-3918-g316b74faa6d8 v4.14.20 merge point bad v4.14.10-3147-g70a3f165923f v4.14.10 merge point good v4.14.15-3566-g813a7d3099c7 v4.14.15 merge point good v4.14.17-3820-g2f8e01bb20c5 v4.14.17 merge point good v4.14.18-3866-g13117e46bc65 v4.14.18 merge point bad v4.14.17-3865-gb9014010a4ab pre v4.14.18 merge bad Automated bisect: bad: b9014010a4ab good: 2f8e01bb20c5 v4.14.16-3841-gddd27534d9eb good v4.14.16-3852-g369228868afb bad v4.14.16-3846-g8a1a0317d203 bad v4.14.16-3843-ge3d91bfdf444 bad v4.14.16-3842-g1ce12c965f89 good (1st) good (2nd) Automated bisect log: $ git bisect log # bad: [b9014010a4ab69e5bff6d09f1b60bac2ec164828] UPSTREAM: drm/i915/gvt: Export intel_gvt_render_mmio_to_ring_id() # good: [2f8e01bb20c5e2e246cb2c6c8eb17b4e83400850] CHROMIUM: Merge 'v4.14.17' into chromeos-4.14 git bisect start 'b9014010a4ab' '2f8e01bb20c5' # good: [ddd27534d9eb07938cc2f86bdba83a20fafc36d7] UPSTREAM: drm/i915/gvt: Limit read hw reg to active vgpu git bisect good ddd27534d9eb07938cc2f86bdba83a20fafc36d7 # bad: [369228868afb7b43d90a1df011b85a96f5db6893] CHROMIUM: config: All arm64 get CPU_BOOST git bisect bad 369228868afb7b43d90a1df011b85a96f5db6893 # bad: [8a1a0317d2034d79b0434d96bfde484e71185bcf] UPSTREAM: drm/i915: Track GGTT writes on the vma git bisect bad 8a1a0317d2034d79b0434d96bfde484e71185bcf # bad: [e3d91bfdf444ca8beb03dd5913fc3de93542b0c8] UPSTREAM: drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM git bisect bad e3d91bfdf444ca8beb03dd5913fc3de93542b0c8 # bad: [1ce12c965f89cfba51760a503983280d01ab00d1] UPSTREAM: drm/atomic-helper: always track connector commits, too git bisect bad 1ce12c965f89cfba51760a503983280d01ab00d1 # first bad commit: [1ce12c965f89cfba51760a503983280d01ab00d1] UPSTREAM: drm/atomic-helper: always track connector commits, too Problem appears to be i915 related.
,
Apr 17 2018
,
Apr 17 2018
Last entry in automated bisect log is wrong; 1ce12c965f89 should have been marked good. Culprit is e3d91bfdf444ca ("UPSTREAM: drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM"). Sorry for the confusion.
Here is the corrected log:
# bad: [b9014010a4ab69e5bff6d09f1b60bac2ec164828] UPSTREAM: drm/i915/gvt: Export intel_gvt_render_mmio_to_ring_id()
# good: [2f8e01bb20c5e2e246cb2c6c8eb17b4e83400850] CHROMIUM: Merge 'v4.14.17' into chromeos-4.14
git bisect start 'b9014010a4ab' '2f8e01bb20c5'
# good: [ddd27534d9eb07938cc2f86bdba83a20fafc36d7] UPSTREAM: drm/i915/gvt: Limit read hw reg to active vgpu
git bisect good ddd27534d9eb07938cc2f86bdba83a20fafc36d7
# bad: [369228868afb7b43d90a1df011b85a96f5db6893] CHROMIUM: config: All arm64 get CPU_BOOST
git bisect bad 369228868afb7b43d90a1df011b85a96f5db6893
# bad: [8a1a0317d2034d79b0434d96bfde484e71185bcf] UPSTREAM: drm/i915: Track GGTT writes on the vma
git bisect bad 8a1a0317d2034d79b0434d96bfde484e71185bcf
# bad: [e3d91bfdf444ca8beb03dd5913fc3de93542b0c8] UPSTREAM: drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM
git bisect bad e3d91bfdf444ca8beb03dd5913fc3de93542b0c8
# good: [1ce12c965f89cfba51760a503983280d01ab00d1] UPSTREAM: drm/atomic-helper: always track connector commits, too
git bisect good 1ce12c965f89cfba51760a503983280d01ab00d1
# first bad commit: [e3d91bfdf444ca8beb03dd5913fc3de93542b0c8] UPSTREAM: drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM
,
Apr 17 2018
No idea what I am doing, but the problem disappears after applying the following patch.
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 77f3ab342804..2525d1beabcd 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -815,9 +815,7 @@ int i915_gem_init_userptr(struct drm_i915_private *dev_priv)
hash_init(dev_priv->mm_structs);
dev_priv->mm.userptr_wq =
- alloc_workqueue("i915-userptr-acquire",
- WQ_HIGHPRI | WQ_UNBOUND,
- 0);
+ alloc_workqueue("i915-userptr-acquire", WQ_HIGHPRI, 0);
if (!dev_priv->mm.userptr_wq)
return -ENOMEM;
,
Apr 18 2018
In #5 you're doing what Chris Wilson's original commit tried to modify(upstream 21cc6431e0c2 - "drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM" ) and then he addressed it again with the commit "drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM". Right now the code in this function in chromeos-4.14 matches upstream so this *may* be an upstream bug.
,
Apr 18 2018
#6: The comments in the upstream commit suggested to me that Chris got the wrong alloc_workqueue() in the first commit, which he fixed up in the second. There is no explanation why the second initialization would need WQ_UNBOUND. This is why I figured that it might be worthwhile trying to restore the original 2nd call. However, as I said, I have no idea what the code is supposed to do, so this is really an absolutely wild shot. I have also no idea what this all has to do with the iwl and the thermal code, both of which seem to be somehow involved. Given that, it might as well be that Chris' commit exposes some problem completely elsewhere (and that the problem isn't seen upstream for that reason).
,
Apr 18 2018
I'm not in the office until Friday and no access to a caroline but could you try booting the drm-intel-next-queued kernel and see if it has the same problem? If it does then an fdo bug can be opened.
,
Apr 24 2018
#8: Trying to run an upstream kernel on a chromebook has its own challenges, and it is very unlikely that the userspace actions needed to trigger the problem will even execute (arc-setup, for example, is expected to fail because the necessary kernel support is not there). Furthermore, there is chromeos specific code in the dependency chain - essentially the entire iwl7000 code. As mentioned in #7, it is very well possible that the problem is not caused but only exposed by the WQ_UNBOUND change.
,
Apr 25 2018
True - while I was able to boot the upstream dinq kernel (with some workaround patches and iwlwifi enabled) arc would not work. Unfortunately it is hard to prove there's an upstream bug without the full stack being functional. Is there an easy way to disable arc-setup from running and see if the lock loop still occurs?
,
Apr 25 2018
arc-setup is not always the process triggering the problem. I have also seen it with wpa_supplicant (such as in the last attached attached log).
,
Apr 26 2018
I booted yesterday's dinq and no lock loop with wpa_supplicant or anything else, just a kasan error, but not surprised. Then tried v4.15-rc3 and got a lock - see the attached log. Of course there's a huge gap between v4.15-rc3 and ToT dinq to do a bisect, but a manual poking of different kernel versions can be done until a smaller range can be bisected. I chose v4.15-rc3 because I did the chromeos-4.14 i915 forklift from that kernel initially.
,
Apr 26 2018
Hmm, different traceback, but cpu hotplug is involved. Makes me wonder if the problem is there. |
|||
►
Sign in to add a comment |
|||
Comment 1 by groeck@chromium.org
, Apr 17 201870.3 KB
70.3 KB Download
70.7 KB
70.7 KB Download