From: Zheng Qixing <zhengqixing@huawei.com>
When switching IO schedulers on a block device (e.g., loop0),
blkcg_activate_policy() is called to allocate blkg_policy_data (pd)
for all blkgs associated with that device's request queue.
However, a race condition exists between blkcg_activate_policy() and
concurrent blkcg deletion that leads to a use-after-free:
T1 (blkcg_activate_policy):
- Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
- Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
- Goes to enomem error path to rollback blkg1's resources
T2 (blkcg deletion):
- blkcgA is being deleted concurrently
- blkg1 is freed via blkg_free_workfn()
- blkg1->pd is freed
T1 (continued):
- In the rollback path, accesses pd->online after blkg1->pd
has been freed
- Triggers use-after-free
The issue occurs because blkcg_activate_policy() doesn't hold
adequate protection against concurrent blkg freeing during the
error rollback path. The call trace is as follows:
==================================================================
BUG: KASAN: slab-use-after-free in blkcg_activate_policy+0x516/0x5f0
Read of size 1 at addr ffff88802e1bc00c by task sh/7357
CPU: 1 PID: 7357 Comm: sh Tainted: G OE 6.6.0+ #1
Call Trace:
<TASK>
blkcg_activate_policy+0x516/0x5f0
bfq_create_group_hierarchy+0x31/0x90
bfq_init_queue+0x6df/0x8e0
blk_mq_init_sched+0x290/0x3a0
elevator_switch+0x8a/0x190
elv_iosched_store+0x1f7/0x2a0
queue_attr_store+0xad/0xf0
kernfs_fop_write_iter+0x1ee/0x2e0
new_sync_write+0x154/0x260
vfs_write+0x313/0x3c0
ksys_write+0xbd/0x160
do_syscall_64+0x55/0x100
entry_SYSCALL_64_after_hwframe+0x78/0xe2
Allocated by task 7357:
bfq_pd_alloc+0x6e/0x120
blkcg_activate_policy+0x141/0x5f0
bfq_create_group_hierarchy+0x31/0x90
bfq_init_queue+0x6df/0x8e0
blk_mq_init_sched+0x290/0x3a0
elevator_switch+0x8a/0x190
elv_iosched_store+0x1f7/0x2a0
queue_attr_store+0xad/0xf0
kernfs_fop_write_iter+0x1ee/0x2e0
new_sync_write+0x154/0x260
vfs_write+0x313/0x3c0
ksys_write+0xbd/0x160
do_syscall_64+0x55/0x100
entry_SYSCALL_64_after_hwframe+0x78/0xe2
Freed by task 14318:
blkg_free_workfn+0x7f/0x200
process_one_work+0x2ef/0x5d0
worker_thread+0x38d/0x4f0
kthread+0x156/0x190
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x1b/0x30
Fix this bug by adding q->blkcg_mutex in the enomem branch of
blkcg_activate_policy().
Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
---
block/blk-cgroup.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 5e1a724a799a..af468676cad1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1693,9 +1693,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
enomem:
/* alloc failed, take down everything */
+ mutex_lock(&q->blkcg_mutex);
spin_lock_irq(&q->queue_lock);
blkcg_policy_teardown_pds(q, pol);
spin_unlock_irq(&q->queue_lock);
+ mutex_unlock(&q->blkcg_mutex);
ret = -ENOMEM;
goto out;
}
--
2.39.2
Hi,
在 2026/1/8 9:44, Zheng Qixing 写道:
> From: Zheng Qixing <zhengqixing@huawei.com>
>
> When switching IO schedulers on a block device (e.g., loop0),
> blkcg_activate_policy() is called to allocate blkg_policy_data (pd)
> for all blkgs associated with that device's request queue.
>
> However, a race condition exists between blkcg_activate_policy() and
> concurrent blkcg deletion that leads to a use-after-free:
>
> T1 (blkcg_activate_policy):
> - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
> - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
> - Goes to enomem error path to rollback blkg1's resources
>
> T2 (blkcg deletion):
> - blkcgA is being deleted concurrently
> - blkg1 is freed via blkg_free_workfn()
> - blkg1->pd is freed
>
> T1 (continued):
> - In the rollback path, accesses pd->online after blkg1->pd
> has been freed
> - Triggers use-after-free
>
> The issue occurs because blkcg_activate_policy() doesn't hold
> adequate protection against concurrent blkg freeing during the
> error rollback path. The call trace is as follows:
>
> ==================================================================
> BUG: KASAN: slab-use-after-free in blkcg_activate_policy+0x516/0x5f0
> Read of size 1 at addr ffff88802e1bc00c by task sh/7357
> CPU: 1 PID: 7357 Comm: sh Tainted: G OE 6.6.0+ #1
> Call Trace:
> <TASK>
> blkcg_activate_policy+0x516/0x5f0
> bfq_create_group_hierarchy+0x31/0x90
> bfq_init_queue+0x6df/0x8e0
> blk_mq_init_sched+0x290/0x3a0
> elevator_switch+0x8a/0x190
> elv_iosched_store+0x1f7/0x2a0
> queue_attr_store+0xad/0xf0
> kernfs_fop_write_iter+0x1ee/0x2e0
> new_sync_write+0x154/0x260
> vfs_write+0x313/0x3c0
> ksys_write+0xbd/0x160
> do_syscall_64+0x55/0x100
> entry_SYSCALL_64_after_hwframe+0x78/0xe2
>
> Allocated by task 7357:
> bfq_pd_alloc+0x6e/0x120
> blkcg_activate_policy+0x141/0x5f0
> bfq_create_group_hierarchy+0x31/0x90
> bfq_init_queue+0x6df/0x8e0
> blk_mq_init_sched+0x290/0x3a0
> elevator_switch+0x8a/0x190
> elv_iosched_store+0x1f7/0x2a0
> queue_attr_store+0xad/0xf0
> kernfs_fop_write_iter+0x1ee/0x2e0
> new_sync_write+0x154/0x260
> vfs_write+0x313/0x3c0
> ksys_write+0xbd/0x160
> do_syscall_64+0x55/0x100
> entry_SYSCALL_64_after_hwframe+0x78/0xe2
>
> Freed by task 14318:
> blkg_free_workfn+0x7f/0x200
> process_one_work+0x2ef/0x5d0
> worker_thread+0x38d/0x4f0
> kthread+0x156/0x190
> ret_from_fork+0x2d/0x50
> ret_from_fork_asm+0x1b/0x30
>
> Fix this bug by adding q->blkcg_mutex in the enomem branch of
> blkcg_activate_policy().
>
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
> ---
> block/blk-cgroup.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 5e1a724a799a..af468676cad1 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1693,9 +1693,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>
> enomem:
> /* alloc failed, take down everything */
> + mutex_lock(&q->blkcg_mutex);
> spin_lock_irq(&q->queue_lock);
> blkcg_policy_teardown_pds(q, pol);
> spin_unlock_irq(&q->queue_lock);
> + mutex_unlock(&q->blkcg_mutex);
This looks correct, however, I think it's better also to protect q->blkg_list iteration from
blkcg_activate_policy() and blkg_destroys_all() as well. This way all the q->blkg_list access
will be protected by blkcg_mutex, and it'll be easier to convert protecting blkg from queue_lock
to blkcg_mutex.
> ret = -ENOMEM;
> goto out;
> }
--
Thansk,
Kuai
在 2026/1/9 0:11, Yu Kuai 写道: > This looks correct, however, I think it's better also to protect q->blkg_list iteration from > blkcg_activate_policy() and blkg_destroys_all() as well. This way all the q->blkg_list access > will be protected by blkcg_mutex, and it'll be easier to convert protecting blkg from queue_lock > to blkcg_mutex. I tried adding blkcg_mutex protection in blkcg_activate_policy() and blkg_destroy_all() as suggested. Unfortunately, the UAF still occurs even with proper mutex protection. The mutex successfully protects the list structure during traversal won't be added/removed from q->blkg_list while we hold the lock. However, this doesn't prevent the same blkg from being released twice. [ 108.677948][ C0] ================================================================== [ 108.678541][ C0] BUG: KASAN: slab-use-after-free in rcu_cblist_dequeue+0xb1/0xe0 [ 108.679117][ C0] Read of size 8 at addr ffff888108ee9e48 by task swapper/0/0 [ 108.679654][ C0] [ 108.679827][ C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.6.0-ga7706cf69006-dirty #43 [ 108.680437][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 [ 108.681125][ C0] Call Trace: [ 108.681369][ C0] <IRQ> [ 108.684870][ C0] rcu_cblist_dequeue+0xb1/0xe0 [ 108.685239][ C0] rcu_do_batch+0x24c/0xd80 [ 108.686892][ C0] rcu_core+0x4d1/0x7d0 [ 108.687205][ C0] handle_softirqs+0x1ca/0x720 [ 108.687561][ C0] irq_exit_rcu+0x141/0x1a0 [ 108.687896][ C0] sysvec_apic_timer_interrupt+0x6e/0x90 [ 108.689218][ C0] RIP: 0010:pv_native_safe_halt+0xb/0x10 [ 108.689642][ C0] Code: 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 eb 07 0f 00 2d 97 d9 3d 00 fb f4 <e9> 50 ce 02 00 90 90 90 90 90 90 90 90 90 90 90 90 90 9b [ 108.691075][ C0] RSP: 0018:ffffffff9cc07e00 EFLAGS: 00000206 [ 108.691537][ C0] RAX: 0000000000000006 RBX: 0000000000000000 RCX: ffffffff9b280422 [ 108.692129][ C0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff97b74d45 [ 108.692714][ C0] RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed10e3602165 [ 108.693305][ C0] R10: ffff88871b010b2b R11: 0000000000000000 R12: 1ffffffff3980fc5 [ 108.693892][ C0] R13: ffffffff9cc88ec0 R14: dffffc0000000000 R15: 0000000000014810 [ 108.695293][ C0] default_idle+0x5/0x10 [ 108.695594][ C0] default_idle_call+0x97/0x1d0 [ 108.695942][ C0] cpuidle_idle_call+0x1e5/0x270 [ 108.697162][ C0] do_idle+0xef/0x150 [ 108.697454][ C0] cpu_startup_entry+0x51/0x60 [ 108.698108][ C0] rest_init+0x1cc/0x320 [ 108.698410][ C0] arch_call_rest_init+0xf/0x30 [ 108.698761][ C0] start_kernel+0x392/0x400 [ 108.699085][ C0] x86_64_start_reservations+0x14/0x30 [ 108.699474][ C0] x86_64_start_kernel+0x9b/0xa0 [ 108.699822][ C0] secondary_startup_64_no_verify+0x194/0x19b [ 108.700255][ C0] </TASK> [ 108.700477][ C0] [ 108.700644][ C0] Allocated by task 1045: [ 108.700948][ C0] kasan_save_stack+0x1c/0x40 [ 108.701293][ C0] kasan_set_track+0x21/0x30 [ 108.701617][ C0] __kasan_kmalloc+0x8b/0x90 [ 108.701946][ C0] blkg_alloc+0xbc/0x940 [ 108.702251][ C0] blkg_create+0xcf6/0x13d0 [ 108.702576][ C0] blkg_lookup_create+0x47b/0x810 [ 108.702935][ C0] bio_associate_blkg_from_css+0x1a0/0x8c0 [ 108.703354][ C0] bio_associate_blkg+0xa2/0x190 [ 108.703704][ C0] bio_init+0x272/0x8d0 [ 108.704000][ C0] bio_alloc_bioset+0x454/0x770 [ 108.704350][ C0] ext4_bio_write_folio+0x68e/0x10d0 [ 108.704729][ C0] mpage_submit_folio+0x14a/0x2b0 [ 108.705090][ C0] mpage_process_page_bufs+0x1b1/0x390 [ 108.705492][ C0] mpage_prepare_extent_to_map+0xa91/0x1060 [ 108.705915][ C0] ext4_do_writepages+0x9af/0x1d60 [ 108.706288][ C0] ext4_writepages+0x281/0x5a0 [ 108.706634][ C0] do_writepages+0x165/0x5f0 [ 108.707057][ C0] filemap_fdatawrite_wbc+0x111/0x170 [ 108.707450][ C0] __filemap_fdatawrite_range+0x9d/0xd0 [ 108.707851][ C0] file_write_and_wait_range+0x97/0x110 [ 108.708251][ C0] ext4_sync_file+0x1fb/0xb60 [ 108.708592][ C0] __x64_sys_fsync+0x55/0x90 [ 108.708932][ C0] do_syscall_64+0x6b/0x120 [ 108.709262][ C0] entry_SYSCALL_64_after_hwframe+0x78/0xe2 [ 108.709690][ C0] [ 108.709867][ C0] Freed by task 338: [ 108.710150][ C0] kasan_save_stack+0x1c/0x40 [ 108.710496][ C0] kasan_set_track+0x21/0x30 [ 108.710835][ C0] kasan_save_free_info+0x27/0x40 [ 108.711203][ C0] __kasan_slab_free+0x106/0x180 [ 108.711564][ C0] __kmem_cache_free+0x1dd/0x470 [ 108.711923][ C0] process_one_work+0x774/0x13a0 [ 108.712288][ C0] worker_thread+0x6eb/0x12c0 [ 108.712631][ C0] kthread+0x29f/0x360 [ 108.712928][ C0] ret_from_fork+0x30/0x70 [ 108.713251][ C0] ret_from_fork_asm+0x1b/0x30
Hi, 在 2026/1/9 14:22, Zheng Qixing 写道: > I tried adding blkcg_mutex protection in blkcg_activate_policy() and > blkg_destroy_all() as suggested. > > Unfortunately, the UAF still occurs even with proper mutex protection. > > The mutex successfully protects the list structure during traversal > won't be added/removed from > > q->blkg_list while we hold the lock. However, this doesn't prevent the > same blkg from being released > > twice. I don't understand, what I suggested should actually include your changes in this patch. Can you show your changes and make sure blkcg_policy_teardown_pds() inside blkcg_activate_policy() is also protected. -- Thansk, Kuai
Sorry for that I replied with the wrong patch. Even after adding the
blkcg_mutex in blkg_destroy_all()
and blkcg_activate_policy(), the UAF issue described in the 3rd patch
(multiple calls to call_rcu releasing
the same blkg) still occurs.
For the 2nd patch, in addition to the blkcg_mutex that I initially added
in the enomem branch, it is
indeed possible to add blkcg_mutex at other places where blkg_list is
accessed. This would prevent
the case where pd_alloc_fn(..., GFP_NOWAIT) succeeds while the
corresponding blkg is being destroyed,
which could otherwise lead to a memory leak.
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 5e1a724a799a..439cafa98c92 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -573,6 +573,7 @@ static void blkg_destroy_all(struct gendisk *disk)
> int count = BLKG_DESTROY_BATCH_SIZE;
> int i;
>
> + mutex_lock(&q->blkcg_mutex);
> restart:
> spin_lock_irq(&q->queue_lock);
> list_for_each_entry(blkg, &q->blkg_list, q_node) {
> @@ -592,7 +593,9 @@ static void blkg_destroy_all(struct gendisk *disk)
> if (!(--count)) {
> count = BLKG_DESTROY_BATCH_SIZE;
> spin_unlock_irq(&q->queue_lock);
> + mutex_unlock(&q->blkcg_mutex);
> cond_resched();
> + mutex_lock(&q->blkcg_mutex);
> goto restart;
> }
> }
> @@ -611,6 +614,7 @@ static void blkg_destroy_all(struct gendisk *disk)
>
> q->root_blkg = NULL;
> spin_unlock_irq(&q->queue_lock);
> + mutex_unlock(&q->blkcg_mutex);
> }
>
> static void blkg_iostat_set(struct blkg_iostat *dst, struct blkg_iostat *src)
> @@ -1621,6 +1625,8 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>
> if (queue_is_mq(q))
> memflags = blk_mq_freeze_queue(q);
> +
> + mutex_lock(&q->blkcg_mutex);
> retry:
> spin_lock_irq(&q->queue_lock);
>
> @@ -1682,6 +1688,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
> ret = 0;
>
> spin_unlock_irq(&q->queue_lock);
> + mutex_unlock(&q->blkcg_mutex);
> out:
> if (queue_is_mq(q))
> blk_mq_unfreeze_queue(q, memflags);
> @@ -1696,6 +1703,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
> spin_lock_irq(&q->queue_lock);
> blkcg_policy_teardown_pds(q, pol);
> spin_unlock_irq(&q->queue_lock);
> + mutex_unlock(&q->blkcg_mutex);
> ret = -ENOMEM;
> goto out;
> }
© 2016 - 2026 Red Hat, Inc.