Observed lockdep circular dependency in SR-IOV paths

[RFC] Observed lockdep circular dependency in SR-IOV paths
Posted by Raghavendra Rao Ananta 16 hours ago
Hi Alex,

While running the vfio_pci_sriov_uapi_test [1] on a CONFIG_LOCKDEP
enabled kernel (7.0-rc1), we observed the following lockdep circular
locking dependency warning:

[  286.997167] ======================================================
[  287.003363] WARNING: possible circular locking dependency detected
[  287.009562] 7.0.0-dbg-DEV #3 Tainted: G S
[  287.015074] ------------------------------------------------------
[  287.021270] vfio_pci_sriov_/18636 is trying to acquire lock:
[  287.026942] ff45bea2294d4968 (&vdev->memory_lock){+.+.}-{4:4}, at:
vfio_pci_core_runtime_resume+0x1f/0xa0
[  287.036530]
[  287.036530] but task is already holding lock:
[  287.042383] ff45bea3a96b8230 (&new_dev_set->lock){+.+.}-{4:4}, at:
vfio_group_fops_unl_ioctl+0x44d/0x7b0
[  287.051879]
[  287.051879] which lock already depends on the new lock.
[  287.051879]
[  287.060070]
[  287.060070] the existing dependency chain (in reverse order) is:
[  287.067568]
[  287.067568] -> #2 (&new_dev_set->lock){+.+.}-{4:4}:
[  287.073941]        __mutex_lock+0x92/0xb80
[  287.078058]        vfio_assign_device_set+0x66/0x1b0
[  287.083042]        vfio_pci_core_register_device+0xd1/0x2a0
[  287.088638]        vfio_pci_probe+0xd2/0x100
[  287.092933]        local_pci_probe_callback+0x4d/0xa0
[  287.098001]        process_scheduled_works+0x2ca/0x680
[  287.103158]        worker_thread+0x1e8/0x2f0
[  287.107452]        kthread+0x10c/0x140
[  287.111230]        ret_from_fork+0x18e/0x360
[  287.115519]        ret_from_fork_asm+0x1a/0x30
[  287.119983]
[  287.119983] -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
[  287.127219]        __flush_work+0x345/0x490
[  287.131429]        pci_device_probe+0x2e3/0x490
[  287.135979]        really_probe+0x1f9/0x4e0
[  287.140180]        __driver_probe_device+0x77/0x100
[  287.145079]        driver_probe_device+0x1e/0x110
[  287.149803]        __device_attach_driver+0xe3/0x170
[  287.154789]        bus_for_each_drv+0x125/0x150
[  287.159346]        __device_attach+0xca/0x1a0
[  287.163720]        device_initial_probe+0x34/0x50
[  287.168445]        pci_bus_add_device+0x6e/0x90
[  287.172995]        pci_iov_add_virtfn+0x3c9/0x3e0
[  287.177719]        sriov_add_vfs+0x2c/0x60
[  287.181838]        sriov_enable+0x306/0x4a0
[  287.186038]        vfio_pci_core_sriov_configure+0x184/0x220
[  287.191715]        sriov_numvfs_store+0xd9/0x1c0
[  287.196351]        kernfs_fop_write_iter+0x13f/0x1d0
[  287.201338]        vfs_write+0x2be/0x3b0
[  287.205286]        ksys_write+0x73/0x100
[  287.209233]        do_syscall_64+0x14d/0x750
[  287.213529]        entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  287.219120]
[  287.219120] -> #0 (&vdev->memory_lock){+.+.}-{4:4}:
[  287.225491]        __lock_acquire+0x14c6/0x2800
[  287.230048]        lock_acquire+0xd3/0x2f0
[  287.234168]        down_write+0x3a/0xc0
[  287.238019]        vfio_pci_core_runtime_resume+0x1f/0xa0
[  287.243436]        __rpm_callback+0x8c/0x310
[  287.247730]        rpm_resume+0x529/0x6f0
[  287.251765]        __pm_runtime_resume+0x68/0x90
[  287.256402]        vfio_pci_core_enable+0x44/0x310
[  287.261216]        vfio_pci_open_device+0x1c/0x80
[  287.265947]        vfio_df_open+0x10f/0x150
[  287.270148]        vfio_group_fops_unl_ioctl+0x4a4/0x7b0
[  287.275476]        __se_sys_ioctl+0x71/0xc0
[  287.279679]        do_syscall_64+0x14d/0x750
[  287.283975]        entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  287.289559]
[  287.289559] other info that might help us debug this:
[  287.289559]
[  287.297582] Chain exists of:
[  287.297582]   &vdev->memory_lock --> (work_completion)(&arg.work)
--> &new_dev_set->lock
[  287.297582]
[  287.310023]  Possible unsafe locking scenario:
[  287.310023]
[  287.315961]        CPU0                    CPU1
[  287.320510]        ----                    ----
[  287.325059]   lock(&new_dev_set->lock);
[  287.328917]
lock((work_completion)(&arg.work));
[  287.336153]                                lock(&new_dev_set->lock);
[  287.342523]   lock(&vdev->memory_lock);
[  287.346382]
[  287.346382]  *** DEADLOCK ***
[  287.346382]
[  287.352315] 2 locks held by vfio_pci_sriov_/18636:
[  287.357125]  #0: ff45bea208ed3e18 (&group->group_lock){+.+.}-{4:4},
at: vfio_group_fops_unl_ioctl+0x3e3/0x7b0
[  287.367048]  #1: ff45bea3a96b8230 (&new_dev_set->lock){+.+.}-{4:4},
at: vfio_group_fops_unl_ioctl+0x44d/0x7b0
[  287.376976]
[  287.376976] stack backtrace:
[  287.381353] CPU: 191 UID: 0 PID: 18636 Comm: vfio_pci_sriov_
Tainted: G S                  7.0.0-dbg-DEV #3 PREEMPTLAZY
[  287.381355] Tainted: [S]=CPU_OUT_OF_SPEC
[  287.381356] Call Trace:
[  287.381357]  <TASK>
[  287.381358]  dump_stack_lvl+0x54/0x70
[  287.381361]  print_circular_bug+0x2e1/0x300
[  287.381363]  check_noncircular+0xf9/0x120
[  287.381364]  ? __lock_acquire+0x5b4/0x2800
[  287.381366]  __lock_acquire+0x14c6/0x2800
[  287.381368]  ? pci_mmcfg_read+0x4f/0x220
[  287.381370]  ? pci_mmcfg_write+0x57/0x220
[  287.381371]  ? lock_acquire+0xd3/0x2f0
[  287.381373]  ? pci_mmcfg_write+0x57/0x220
[  287.381374]  ? lock_release+0xef/0x360
[  287.381376]  ? vfio_pci_core_runtime_resume+0x1f/0xa0
[  287.381377]  lock_acquire+0xd3/0x2f0
[  287.381378]  ? vfio_pci_core_runtime_resume+0x1f/0xa0
[  287.381379]  ? lock_is_held_type+0x76/0x100
[  287.381382]  down_write+0x3a/0xc0
[  287.381382]  ? vfio_pci_core_runtime_resume+0x1f/0xa0
[  287.381383]  vfio_pci_core_runtime_resume+0x1f/0xa0
[  287.381384]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
[  287.381385]  __rpm_callback+0x8c/0x310
[  287.381386]  ? ktime_get_mono_fast_ns+0x3d/0xb0
[  287.381389]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
[  287.381390]  rpm_resume+0x529/0x6f0
[  287.381392]  ? lock_is_held_type+0x76/0x100
[  287.381394]  __pm_runtime_resume+0x68/0x90
[  287.381396]  vfio_pci_core_enable+0x44/0x310
[  287.381398]  vfio_pci_open_device+0x1c/0x80
[  287.381399]  vfio_df_open+0x10f/0x150
[  287.381401]  vfio_group_fops_unl_ioctl+0x4a4/0x7b0
[  287.381402]  __se_sys_ioctl+0x71/0xc0
[  287.381404]  do_syscall_64+0x14d/0x750
[  287.381405]  ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  287.381406]  ? trace_irq_disable+0x25/0xd0
[  287.381409]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  287.381410] RIP: 0033:0x447c8b
[  287.381413] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24
10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00
00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28
00 00
[  287.381414] RSP: 002b:00007ffff5fb2530 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  287.381415] RAX: ffffffffffffffda RBX: 00007ffff5fb2d90 RCX: 0000000000447c8b
[  287.381416] RDX: 00007ffff5fb25b0 RSI: 0000000000003b6a RDI: 0000000000000006
[  287.381417] RBP: 00007ffff5fb2620 R08: 0000000000000000 R09: 00000000ffffffff
[  287.381418] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffff5fb2d78
[  287.381419] R13: 0000000000000016 R14: 00000000004f5440 R15: 0000000000000016
[  287.381423]  </TASK>

Initial analysis suggests it could be a false positive, depicting a
classic ABBA circular locking scenario:
* Trace#1 and Trace#2: Triggered while setting sriov_numvfs, where the
PF's vdev->memory_lock (A) is taken first, and then the VF's
dev_set->lock is allocated and taken (B).
* Trace#0: Triggered during VFIO_GROUP_GET_DEVICE_FD where first the
VF's dev_set->lock is taken (B) and then the PF's vdev->memory_lock
(A) is taken.

From Trace#1/Trace#2, the dev_set->lock (B) is allocated and
immediately grabbed during VF creation. Since it isn't taken until
after the device is fully created and VFIO_GROUP_GET_DEVICE_FD is
called on it, it's perfectly synchronized, and therefore cannot cause
a deadlock. As far as we know, this is at least true for the VFs being
created, i.e, we were able to find a 'dev_set' from the
'vfio_device_set_xa' xarray using 'idx' in vfio_assign_device_set().

I've tried this naiive diff, which seems to help with the issue, but
may not be a complete solution:

--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -106,7 +107,8 @@ int vfio_assign_device_set(struct vfio_device
*device, void *set_id)
 found_get_ref:
        dev_set->device_count++;
        xa_unlock(&vfio_device_set_xa);
-       mutex_lock(&dev_set->lock);
+       mutex_lock_nested(&dev_set->lock, SINGLE_DEPTH_NESTING);
        device->dev_set = dev_set;
        list_add_tail(&device->dev_set_list, &dev_set->device_list);
        mutex_unlock(&dev_set->lock);

I wanted your opinion or suggestions on how to proceed with the warning.

Thank you.
Raghavendra

[1]: https://lore.kernel.org/all/20260303193822.2526335-1-rananta@google.com/