New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 833992 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

circular locking observed when booting chromeos-4.14 on caroline

Project Member Reported by groeck@chromium.org, Apr 17 2018

Issue description

With v4.14.34-4302-gdf94c7aa2f90 and USE="lockdebug kasan pcserial tty_console_ttyS0", the following circular locking dependency was observed on caroline.

[   62.289470] ======================================================
[   62.296386] WARNING: possible circular locking dependency detected
[   62.303305] 4.14.34 #4 Not tainted 
[   62.307110] ------------------------------------------------------
[   62.314025] arc-setup/1571 is trying to acquire lock:
[   62.319675]  (&mm->mmap_sem){++++}, at: [<ffffffff8f21a3f2>] __might_fault+0x85/0x10b
[   62.328456] 
               but task is already holding lock:
[   62.334987]  (&sb->s_type->i_mutex_key#4){++++}, at: [<ffffffff8f276f78>] iterate_dir+0xa9/0x22c
[   62.344839]
               which lock already depends on the new lock.

Dependency chain (shortened):

              -> #7 (&sb->s_type->i_mutex_key#4){++++}:
[   62.369705]        down_write+0x3d/0x65
[   62.374010]        start_creating+0x7e/0x118
[   62.378788]        debugfs_create_dir+0x13/0x116
[   62.384032]        iwl_mvm_vif_dbgfs_register+0x88/0x6e1 [iwlmvm]
[   62.390925]        iwl_mvm_mac_add_interface+0x386/0x49a [iwlmvm]
[   62.397912]        drv_add_interface+0x1f2/0x225 [iwl7000_mac80211]
[   62.405098]        ieee80211_do_open+0x616/0xb6a [iwl7000_mac80211]

               -> #6 (&mvm->mutex){+.+.}:
[   62.470448]        __mutex_lock+0x91/0x523
[   62.475111]        iwl_mvm_tzone_get_temp+0x4a/0xcd [iwlmvm]
[   62.481454]        thermal_zone_get_temp+0x90/0xae
[   62.486824]        thermal_zone_device_update+0x85/0x297
[   62.492781]        thermal_zone_device_register+0x848/0x8cd
[   62.499097]        iwl_mvm_thermal_initialize+0x2b2/0x365 [iwlmvm]

               -> #5 (&tz->lock){+.+.}:
[   62.567693]        __mutex_lock+0x91/0x523
[   62.572283]        thermal_zone_get_temp+0x69/0xae
[   62.577652]        thermal_zone_device_update+0x85/0x297
[   62.583605]        thermal_zone_device_register+0x848/0x8cd

               -> #4 (cpuhp_state-up){+.+.}:
[   62.626284]        cpuhp_lock_acquire+0x32/0x35
[   62.631362]        cpuhp_issue_call+0xd9/0x220
[   62.636343]        __cpuhp_setup_state_cpuslocked+0x1f6/0x2a2
[   62.642783]        __cpuhp_setup_state+0x47/0x59
[   62.647958]        page_writeback_init+0x43/0x66

               -> #3 (cpuhp_state_mutex){+.+.}:
[   62.669537]        __mutex_lock+0x91/0x523
[   62.674131]        __cpuhp_setup_state_cpuslocked+0x58/0x2a2
[   62.680470]        __cpuhp_setup_state+0x47/0x59
[   62.685645]        page_alloc_init+0x28/0x30

               -> #2 (cpu_hotplug_lock.rw_sem){++++}:
[   62.707420]        cpus_read_lock+0x3d/0x84
[   62.712110]        apply_wqattrs_lock+0xe/0x1d
[   62.717089]        apply_workqueue_attrs+0x1a/0x40
[   62.722463]        __alloc_workqueue_key+0x5ab/0x806
[   62.728025]        i915_gem_init_userptr+0x72/0x9a
[   62.733392]        i915_gem_init+0xb3/0x139
[   62.738083]        i915_driver_load+0x182f/0x1a02

               -> #1 (&dev->struct_mutex){+.+.}:
[   62.799307]        __mutex_lock+0x91/0x523
[   62.803903]        i915_mutex_lock_interruptible+0x132/0x143
[   62.810245]        i915_gem_fault+0x229/0x739
[   62.815130]        __do_fault+0x44/0x69

               -> #0 (&mm->mmap_sem){++++}:
[   62.844821]        lock_acquire+0x1e4/0x21a
[   62.849508]        __might_fault+0xc4/0x10b
[   62.854201]        _copy_to_user+0x25/0x83
[   62.858793]        filldir+0x110/0x18f

               other info that might help us debug this:

[   62.896845] Chain exists of:
                 &mm->mmap_sem --> &mvm->mutex --> &sb->s_type->i_mutex_key#4

[   62.909415]  Possible unsafe locking scenario:

[   62.916044]        CPU0                    CPU1
[   62.921116]        ----                    ----
[   62.926187]   lock(&sb->s_type->i_mutex_key#4);
[   62.931265]                                lock(&mvm->mutex);
[   62.937702]                                lock(&sb->s_type->i_mutex_key#4);
[   62.945597]   lock(&mm->mmap_sem);
[   62.949411]
                *** DEADLOCK ***

[   62.956047] 1 lock held by arc-setup/1571:
[   62.960632]  #0:  (&sb->s_type->i_mutex_key#4){++++}, at: [<ffffffff8f276f78>] iterate_dir+0xa9/0x22c

Complete log is attached.

 
dmesg
70.2 KB View Download

Comment 1 by groeck@chromium.org, Apr 17 2018

The problem was observed three times in a row and thus appears to be persistent. Additional logs attached.

dmesg.circular
70.3 KB Download
dmesg.circular2
70.7 KB Download

Comment 2 by groeck@chromium.org, Apr 17 2018

Cc: marc...@chromium.org nathan.d...@intel.com
Components: -OS>Kernel OS>Kernel>Graphics
Labels: -Pri-3 Pri-2
Status: Available (was: Assigned)
Manual bisect:

Description                     Comments                Result
--------------------------------------------------------------
v4.14.34-4302-gdf94c7aa2f90     v4.14.34 merge point    bad
v4.14.25-4110-g32b887dfb630     v4.14.25 merge point    bad
v4.14.20-3918-g316b74faa6d8     v4.14.20 merge point    bad
v4.14.10-3147-g70a3f165923f     v4.14.10 merge point    good
v4.14.15-3566-g813a7d3099c7     v4.14.15 merge point    good
v4.14.17-3820-g2f8e01bb20c5     v4.14.17 merge point    good
v4.14.18-3866-g13117e46bc65     v4.14.18 merge point    bad
v4.14.17-3865-gb9014010a4ab     pre v4.14.18 merge      bad

Automated bisect:
bad:    b9014010a4ab
good:   2f8e01bb20c5

v4.14.16-3841-gddd27534d9eb     good
v4.14.16-3852-g369228868afb     bad
v4.14.16-3846-g8a1a0317d203     bad
v4.14.16-3843-ge3d91bfdf444     bad
v4.14.16-3842-g1ce12c965f89     good (1st) good (2nd)

Automated bisect log:

$ git bisect log
# bad: [b9014010a4ab69e5bff6d09f1b60bac2ec164828] UPSTREAM: drm/i915/gvt: Export intel_gvt_render_mmio_to_ring_id()
# good: [2f8e01bb20c5e2e246cb2c6c8eb17b4e83400850] CHROMIUM: Merge 'v4.14.17' into chromeos-4.14
git bisect start 'b9014010a4ab' '2f8e01bb20c5'
# good: [ddd27534d9eb07938cc2f86bdba83a20fafc36d7] UPSTREAM: drm/i915/gvt: Limit read hw reg to active vgpu
git bisect good ddd27534d9eb07938cc2f86bdba83a20fafc36d7
# bad: [369228868afb7b43d90a1df011b85a96f5db6893] CHROMIUM: config: All arm64 get CPU_BOOST
git bisect bad 369228868afb7b43d90a1df011b85a96f5db6893
# bad: [8a1a0317d2034d79b0434d96bfde484e71185bcf] UPSTREAM: drm/i915: Track GGTT writes on the vma
git bisect bad 8a1a0317d2034d79b0434d96bfde484e71185bcf
# bad: [e3d91bfdf444ca8beb03dd5913fc3de93542b0c8] UPSTREAM: drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM
git bisect bad e3d91bfdf444ca8beb03dd5913fc3de93542b0c8
# bad: [1ce12c965f89cfba51760a503983280d01ab00d1] UPSTREAM: drm/atomic-helper: always track connector commits, too
git bisect bad 1ce12c965f89cfba51760a503983280d01ab00d1
# first bad commit: [1ce12c965f89cfba51760a503983280d01ab00d1] UPSTREAM: drm/atomic-helper: always track connector commits, too

Problem appears to be i915 related.

dmesg-v4.14.16-e3d91bfdf444
70.2 KB Download
dmesg-v4.14.16-8a1a0317d203
70.2 KB Download
dmesg-v4.14.16-369228868afb
69.9 KB Download
dmesg-v4.14.17-b9014010a4ab
70.2 KB Download

Comment 3 by groeck@chromium.org, Apr 17 2018

Cc: groeck@chromium.org
Owner: ----

Comment 4 by groeck@chromium.org, Apr 17 2018

Last entry in automated bisect log is wrong; 1ce12c965f89 should have been marked good. Culprit is e3d91bfdf444ca ("UPSTREAM: drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM"). Sorry for the confusion.
Here is the corrected log:

# bad: [b9014010a4ab69e5bff6d09f1b60bac2ec164828] UPSTREAM: drm/i915/gvt: Export intel_gvt_render_mmio_to_ring_id()
# good: [2f8e01bb20c5e2e246cb2c6c8eb17b4e83400850] CHROMIUM: Merge 'v4.14.17' into chromeos-4.14
git bisect start 'b9014010a4ab' '2f8e01bb20c5'
# good: [ddd27534d9eb07938cc2f86bdba83a20fafc36d7] UPSTREAM: drm/i915/gvt: Limit read hw reg to active vgpu
git bisect good ddd27534d9eb07938cc2f86bdba83a20fafc36d7
# bad: [369228868afb7b43d90a1df011b85a96f5db6893] CHROMIUM: config: All arm64 get CPU_BOOST
git bisect bad 369228868afb7b43d90a1df011b85a96f5db6893
# bad: [8a1a0317d2034d79b0434d96bfde484e71185bcf] UPSTREAM: drm/i915: Track GGTT writes on the vma
git bisect bad 8a1a0317d2034d79b0434d96bfde484e71185bcf
# bad: [e3d91bfdf444ca8beb03dd5913fc3de93542b0c8] UPSTREAM: drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM
git bisect bad e3d91bfdf444ca8beb03dd5913fc3de93542b0c8
# good: [1ce12c965f89cfba51760a503983280d01ab00d1] UPSTREAM: drm/atomic-helper: always track connector commits, too
git bisect good 1ce12c965f89cfba51760a503983280d01ab00d1
# first bad commit: [e3d91bfdf444ca8beb03dd5913fc3de93542b0c8] UPSTREAM: drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM

Comment 5 by groeck@chromium.org, Apr 17 2018

No idea what I am doing, but the problem disappears after applying the following patch.

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 77f3ab342804..2525d1beabcd 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -815,9 +815,7 @@ int i915_gem_init_userptr(struct drm_i915_private *dev_priv)
 	hash_init(dev_priv->mm_structs);
 
 	dev_priv->mm.userptr_wq =
-		alloc_workqueue("i915-userptr-acquire",
-				WQ_HIGHPRI | WQ_UNBOUND,
-				0);
+		alloc_workqueue("i915-userptr-acquire", WQ_HIGHPRI, 0);
 	if (!dev_priv->mm.userptr_wq)
 		return -ENOMEM;

In #5 you're doing what Chris Wilson's original commit tried to modify(upstream 21cc6431e0c2 - "drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM" ) and then he addressed it again with the commit "drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM". 

Right now the code in this function in chromeos-4.14 matches upstream so this *may* be an upstream bug. 

Comment 7 by groeck@chromium.org, Apr 18 2018

#6: The comments in the upstream commit suggested to me that Chris got the wrong alloc_workqueue() in the first commit, which he fixed up in the second. There is no explanation why the second initialization would need WQ_UNBOUND. This is why I figured that it might be worthwhile trying to restore the original 2nd call. However, as I said, I have no idea what the code is supposed to do, so this is really an absolutely wild shot.
I have also no idea what this all has to do with the iwl and the thermal code, both of which seem to be somehow involved. Given that, it might as well be that Chris' commit exposes some problem completely elsewhere (and that the problem isn't seen upstream for that reason).
I'm not in the office until Friday and no access to a caroline but could you try booting the drm-intel-next-queued kernel and see if it has the same problem? If it does then an fdo bug can be opened.  

Comment 9 by groeck@chromium.org, Apr 24 2018

#8: Trying to run an upstream kernel on a chromebook has its own challenges, and it is very unlikely that the userspace actions needed to trigger the problem will even execute (arc-setup, for example, is expected to fail because the necessary kernel support is not there). Furthermore, there is chromeos specific code in the dependency chain - essentially the entire iwl7000 code. As mentioned in #7, it is very well possible that the problem is not caused but only exposed by the WQ_UNBOUND change.
True - while I was able to boot the upstream dinq kernel (with some workaround patches and iwlwifi enabled) arc would not work. Unfortunately it is hard to prove there's an upstream bug without the full stack being functional. 
Is there an easy way to disable arc-setup from running and see if the lock loop still occurs? 
arc-setup is not always the process triggering the problem. I have also seen it with wpa_supplicant (such as in the last attached attached log).

I booted yesterday's dinq and no lock loop with wpa_supplicant or anything else, just a kasan error, but not surprised. 

Then tried v4.15-rc3 and got a lock - see the attached log. Of course there's a huge gap between v4.15-rc3 and ToT dinq to do a bisect, but a manual poking of different kernel versions can be done until a smaller range can be bisected. I chose v4.15-rc3 because I did the chromeos-4.14 i915 forklift from that kernel initially. 

dmesg-4.15-rc3.log
62.6 KB View Download
Hmm, different traceback, but cpu hotplug is involved. Makes me wonder if the problem is there.

Sign in to add a comment