[PATCH 0/4] blk-rq-qos: fix possible deadlock

Yu Kuai posted 4 patches 2 months ago
block/blk-cgroup.c     |  6 +++++
block/blk-mq-debugfs.c | 58 ++++++++++++++++++++++++++++++------------
block/blk-mq-debugfs.h |  4 +--
block/blk-rq-qos.c     |  7 -----
block/blk-sysfs.c      |  4 +++
block/blk-wbt.c        |  7 ++++-
6 files changed, 60 insertions(+), 26 deletions(-)
[PATCH 0/4] blk-rq-qos: fix possible deadlock
Posted by Yu Kuai 2 months ago
Currently rq-qos debugfs entries is created from rq_qos_add(), while
rq_qos_add() requires queue to be freezed. This can deadlock because

creating new entries can trigger fs reclaim.

Fix this problem by delaying creating rq-qos debugfs entries until
it's initialization is complete.

- For wbt, it can be initialized by default of by blk-sysfs, fix it by
  delaying after wbt_init();
- For other policies, they can only be initialized by blkg configuration,
  fix it by delaying to blkg_conf_end();

Noted this set is cooked on the top of my other thread:
https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@kernel.org/

And the deadlock can be reporduced with above thead, by running blktests
throtl/001 with wbt enabled by default. While the deadlock is really a
long term problem.

BTW, it's probably better to fix this problem before the above huge
refactor, I'm sending this set anyway to make sure if people will aggree
about the solution.

Yu Kuai (4):
  blk-mq-debugfs: warn about possible deadlock
  blk-mq-debugfs: factor out a helper blk_mq_debugfs_register_rq_qos()
  blk-rq-qos: fix possible deadlock
  blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static

 block/blk-cgroup.c     |  6 +++++
 block/blk-mq-debugfs.c | 58 ++++++++++++++++++++++++++++++------------
 block/blk-mq-debugfs.h |  4 +--
 block/blk-rq-qos.c     |  7 -----
 block/blk-sysfs.c      |  4 +++
 block/blk-wbt.c        |  7 ++++-
 6 files changed, 60 insertions(+), 26 deletions(-)

-- 
2.39.2
Re: [PATCH 0/4] blk-rq-qos: fix possible deadlock
Posted by Nilay Shroff 2 months ago

On 10/14/25 7:51 AM, Yu Kuai wrote:
> Currently rq-qos debugfs entries is created from rq_qos_add(), while
> rq_qos_add() requires queue to be freezed. This can deadlock because
> 
> creating new entries can trigger fs reclaim.
> 
> Fix this problem by delaying creating rq-qos debugfs entries until
> it's initialization is complete.
> 
> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
>   delaying after wbt_init();
> - For other policies, they can only be initialized by blkg configuration,
>   fix it by delaying to blkg_conf_end();
> 
> Noted this set is cooked on the top of my other thread:
> https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@kernel.org/
> 
> And the deadlock can be reporduced with above thead, by running blktests
> throtl/001 with wbt enabled by default. While the deadlock is really a
> long term problem.
> 
While freezing the queue we also mark GFP_NOIO scope, so doesn't that
help avoid fs-reclaim? Or maybe if you can share the lockdep splat 
encountered running throtl/001?

Thanks,
--Nilay
Re: [PATCH 0/4] blk-rq-qos: fix possible deadlock
Posted by Yu Kuai 2 months ago
Hi,

在 2025/10/14 18:58, Nilay Shroff 写道:
>
> On 10/14/25 7:51 AM, Yu Kuai wrote:
>> Currently rq-qos debugfs entries is created from rq_qos_add(), while
>> rq_qos_add() requires queue to be freezed. This can deadlock because
>>
>> creating new entries can trigger fs reclaim.
>>
>> Fix this problem by delaying creating rq-qos debugfs entries until
>> it's initialization is complete.
>>
>> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
>>    delaying after wbt_init();
>> - For other policies, they can only be initialized by blkg configuration,
>>    fix it by delaying to blkg_conf_end();
>>
>> Noted this set is cooked on the top of my other thread:
>> https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@kernel.org/
>>
>> And the deadlock can be reporduced with above thead, by running blktests
>> throtl/001 with wbt enabled by default. While the deadlock is really a
>> long term problem.
>>
> While freezing the queue we also mark GFP_NOIO scope, so doesn't that
> help avoid fs-reclaim? Or maybe if you can share the lockdep splat
> encountered running throtl/001?

Yes, we can avoid fs-reclaim if queue is freezing, however,
because debugfs is a generic file system, and we can't avoid fs reclaim from
all context. There is still

Following is the log with above set and wbt enabled by default, the set acquire
lock order by:

freeze queue -> elevator lock -> rq_qos_mutex -> blkcg_mutex

However, fs-reclaim from other context cause the deadlock report.


[   45.632372][  T531] ======================================================
[   45.633734][  T531] WARNING: possible circular locking dependency detected
[   45.635062][  T531] 6.17.0-gfd4a560a0864-dirty #30 Not tainted
[   45.636220][  T531] ------------------------------------------------------
[   45.637587][  T531] check/531 is trying to acquire lock:
[   45.638626][  T531] ffff9473884382b0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_
conf_start+0x116/0x190
[   45.640416][  T531]
[   45.640416][  T531] but task is already holding lock:
[   45.641828][  T531] ffff9473884385d8 (&q->elevator_lock){+.+.}-{4:4}, at: blkg
_conf_start+0x108/0x190
[   45.643322][  T531]
[   45.643322][  T531] which lock already depends on the new lock.
[   45.643322][  T531]
[   45.644862][  T531]
[   45.644862][  T531] the existing dependency chain (in reverse order) is:
[   45.646046][  T531]
[   45.646046][  T531] -> #5 (&q->elevator_lock){+.+.}-{4:4}:
[   45.647052][  T531]        __mutex_lock+0xd3/0x8d0
[   45.647716][  T531]        blkg_conf_start+0x108/0x190
[   45.648395][  T531]        tg_set_limit+0x74/0x300
[   45.649046][  T531]        kernfs_fop_write_iter+0x14a/0x210
[   45.649813][  T531]        vfs_write+0x29e/0x550
[   45.650413][  T531]        ksys_write+0x74/0xf0
[   45.651032][  T531]        do_syscall_64+0xbb/0x380
[   45.651730][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   45.652533][  T531]
[   45.652533][  T531] -> #4 (&q->q_usage_counter(io)#3){++++}-{0:0}:
[   45.653297][  T531]        blk_alloc_queue+0x30b/0x350
[   45.653807][  T531]        blk_mq_alloc_queue+0x5d/0xd0
[   45.654300][  T531]        __blk_mq_alloc_disk+0x13/0x60
[   45.654810][  T531]        null_add_dev+0x2f8/0x660 [null_blk]
[   45.655374][  T531] nullb_device_power_store+0x88/0x130 [null_blk]
[   45.656009][  T531]        configfs_write_iter+0xcb/0x130 [configfs]
[   45.656614][  T531]        vfs_write+0x29e/0x550
[   45.657045][  T531]        ksys_write+0x74/0xf0
[   45.657497][  T531]        do_syscall_64+0xbb/0x380
[   45.657958][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   45.658595][  T531]
[   45.658595][  T531] -> #3 (fs_reclaim){+.+.}-{0:0}:
[   45.659266][  T531]        fs_reclaim_acquire+0xa4/0xe0
[   45.659783][  T531] kmem_cache_alloc_lru_noprof+0x3b/0x2a0
[   45.660369][  T531]        alloc_inode+0x1a/0x60
[   45.660873][  T531]        new_inode+0xd/0x90
[   45.661291][  T531]        debugfs_create_dir+0x38/0x160
[   45.661806][  T531]        component_debug_init+0x12/0x20
[   45.662316][  T531]        do_one_initcall+0x68/0x2b0
[   45.662807][  T531]        kernel_init_freeable+0x238/0x290
[   45.663241][  T531]        kernel_init+0x15/0x130
[   45.663659][  T531]        ret_from_fork+0x182/0x1d0
[   45.664054][  T531]        ret_from_fork_asm+0x1a/0x30
[   45.664601][  T531]
[   45.664601][  T531] -> #2 (&sb->s_type->i_mutex_key#2){+.+.15:25:59 [194/1936]
[   45.665274][  T531]        down_write+0x3a/0xb0
[   45.665669][  T531]        simple_start_creating+0x51/0x110
[   45.666097][  T531]        debugfs_start_creating+0x68/0xd0
[   45.666561][  T531]        debugfs_create_dir+0x10/0x160
[   45.666970][  T531]        blk_register_queue+0x8b/0x1e0
[   45.667386][  T531]        __add_disk+0x253/0x3f0
[   45.667804][  T531]        add_disk_fwnode+0x71/0x180
[   45.668205][  T531]        virtblk_probe+0x9c2/0xb50
[   45.668603][  T531]        virtio_dev_probe+0x223/0x380
[   45.669004][  T531]        really_probe+0xc2/0x260
[   45.669369][  T531]        __driver_probe_device+0x6e/0x100
[   45.669856][  T531]        driver_probe_device+0x1a/0xd0
[   45.670263][  T531]        __driver_attach+0x93/0x1a0
[   45.670672][  T531]        bus_for_each_dev+0x87/0xe0
[   45.671063][  T531]        bus_add_driver+0xe0/0x1d0
[   45.671469][  T531]        driver_register+0x70/0xe0
[   45.671856][  T531]        virtio_blk_init+0x53/0x80
[   45.672258][  T531]        do_one_initcall+0x68/0x2b0
[   45.672661][  T531]        kernel_init_freeable+0x238/0x290
[   45.673097][  T531]        kernel_init+0x15/0x130
[   45.673510][  T531]        ret_from_fork+0x182/0x1d0
[   45.673907][  T531]        ret_from_fork_asm+0x1a/0x30
[   45.674319][  T531]
[   45.674319][  T531] -> #1 (&q->debugfs_mutex){+.+.}-{4:4}:
[   45.674929][  T531]        __mutex_lock+0xd3/0x8d0
[   45.675315][  T531]        rq_qos_add+0xe0/0x130
[   45.675717][  T531]        wbt_init+0x183/0x210
[   45.676062][  T531]        blk_register_queue+0xdf/0x1e0
[   45.676490][  T531]        __add_disk+0x253/0x3f0
[   45.676844][  T531]        add_disk_fwnode+0x71/0x180
[   45.677247][  T531]        virtblk_probe+0x9c2/0xb50
[   45.677648][  T531]        virtio_dev_probe+0x223/0x380
[   45.678044][  T531]        really_probe+0xc2/0x260
[   45.678411][  T531]        __driver_probe_device+0x6e/0x100
[   45.678875][  T531]        driver_probe_device+0x1a/0xd0
[   45.679282][  T531]        __driver_attach+0x93/0x1a0
[   45.679678][  T531]        bus_for_each_dev+0x87/0xe0
[   45.680053][  T531]        bus_add_driver+0xe0/0x1d0
[   45.680458][  T531]        driver_register+0x70/0xe0
[   45.680823][  T531]        virtio_blk_init+0x53/0x80
[   45.681208][  T531]        do_one_initcall+0x68/0x2b0
[   45.681611][  T531]        kernel_init_freeable+0x238/0x290
[   45.682027][  T531]        kernel_init+0x15/0x130
[   45.682392][  T531]        ret_from_fork+0x182/0x1d0
[   45.682829][  T531]        ret_from_fork_asm+0x1a/0x30
[   45.683240][  T531]
[   45.683240][  T531] -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
[   45.683867][  T531]        __lock_acquire+0x103d/0x1650
[   45.684411][  T531]        lock_acquire+0xbc/0x2c0
[   45.684798][  T531]        __mutex_lock+0xd3/0x8d0
[   45.685172][  T531]        blkg_conf_start+0x116/0x190
[   45.685623][  T531]        tg_set_limit+0x74/0x300
[   45.685986][  T531]        kernfs_fop_write_iter+0x14a/0x210
[   45.686405][  T531]        vfs_write+0x29e/0x550
[   45.686771][  T531]        ksys_write+0x74/0xf0
[   45.687112][  T531]        do_syscall_64+0xbb/0x380
[   45.687514][  T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   45.687983][  T531]
[   45.687983][  T531] other info that might help us debug this:
[   45.687983][  T531]
[   45.688760][  T531] Chain exists of:
[   45.688760][  T531]   &q->rq_qos_mutex --> &q->q_usage_counter(io)#3 --> &q->e
levator_lock
[   45.688760][  T531]
[   45.689817][  T531]  Possible unsafe locking scenario:
[   45.689817][  T531]
[   45.690359][  T531]        CPU0                    CPU1
[   45.690764][  T531]        ----                    ----
[   45.691172][  T531]   lock(&q->elevator_lock);
[   45.691544][  T531] lock(&q->q_usage_counter(io
)#3);
[   45.692108][  T531] lock(&q->elevator_lock);
[   45.692659][  T531]   lock(&q->rq_qos_mutex);
[   45.692994][  T531]
[   45.692994][  T531]  *** DEADLOCK ***

Thanks, Kuai




>
> Thanks,
> --Nilay
>