[v1] blk-cgroup: cleanup and bugfixs in blk-cgroup

[PATCH 3/3] blk-cgroup: skip dying blkg in blkcg_activate_policy()

Posted by Zheng Qixing 1 month ago

From: Zheng Qixing <zhengqixing@huawei.com>

When switching IO schedulers on a block device, blkcg_activate_policy()
can race with concurrent blkcg deletion, leading to a use-after-free of
the blkg.

T1:				  T2:
elv_iosched_store		  blkg_destroy
elevator_switch			  kill(&blkg->refcnt) // blkg->refcnt=0
...				  blkg_release // call_rcu
blkcg_activate_policy		  __blkg_release
list for blkg			  blkg_free
				  blkg_free_workfn
				  ->pd_free_fn(pd)
blkg_get(blkg) // blkg->refcnt=0->1
				  list_del_init(&blkg->q_node)
				  kfree(blkg)
blkg_put(pinned_blkg) // blkg->refcnt=1->0
blkg_release // call_rcu again
call_rcu(..., __blkg_release)

Fix this by replacing blkg_get() with blkg_tryget(), which fails if
the blkg's refcount has already reached zero. If blkg_tryget() fails,
skip processing this blkg since it's already being destroyed.

The uaf call trace is as follows:

==================================================================
 BUG: KASAN: slab-use-after-free in rcu_accelerate_cbs+0x114/0x120
 Read of size 8 at addr ffff88815a20b5d8 by task bash/1068
 CPU: 0 PID: 1068 Comm: bash Not tainted 6.6.0-g6918ead378dc-dirty #31
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
Call Trace:
 <IRQ>
 rcu_accelerate_cbs+0x114/0x120
 rcu_report_qs_rdp+0x1fb/0x3e0
 rcu_core+0x4d7/0x6f0
 handle_softirqs+0x198/0x550
 irq_exit_rcu+0x130/0x190
 sysvec_apic_timer_interrupt+0x6e/0x90
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x16/0x20

Allocated by task 1031:
 kasan_save_stack+0x1c/0x40
 kasan_set_track+0x21/0x30
 __kasan_kmalloc+0x8b/0x90
 blkg_alloc+0xb6/0x9c0
 blkg_create+0x8c6/0x1010
 blkg_lookup_create+0x2ca/0x660
 bio_associate_blkg_from_css+0xfb/0x4e0
 bio_associate_blkg+0x62/0xf0
 bio_init+0x272/0x8d0
 bio_alloc_bioset+0x45a/0x760
 ext4_bio_write_folio+0x68e/0x10d0
 mpage_submit_folio+0x14a/0x2b0
 mpage_process_page_bufs+0x1b1/0x390
 mpage_prepare_extent_to_map+0xa91/0x1060
 ext4_do_writepages+0x948/0x1c50
 ext4_writepages+0x23f/0x4a0
 do_writepages+0x162/0x5e0
 filemap_fdatawrite_wbc+0x11a/0x180
 __filemap_fdatawrite_range+0x9d/0xd0
 file_write_and_wait_range+0x91/0x110
 ext4_sync_file+0x1c1/0xaa0
 __x64_sys_fsync+0x55/0x90
 do_syscall_64+0x55/0x100
 entry_SYSCALL_64_after_hwframe+0x78/0xe2

Freed by task 24:
 kasan_save_stack+0x1c/0x40
 kasan_set_track+0x21/0x30
 kasan_save_free_info+0x27/0x40
 __kasan_slab_free+0x106/0x180
 __kmem_cache_free+0x162/0x350
 process_one_work+0x573/0xd30
 worker_thread+0x67f/0xc30
 kthread+0x28b/0x350
 ret_from_fork+0x30/0x70
 ret_from_fork_asm+0x1b/0x30

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
---
 block/blk-cgroup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index af468676cad1..ac7702db0836 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1645,9 +1645,10 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 			 * GFP_NOWAIT failed.  Free the existing one and
 			 * prealloc for @blkg w/ GFP_KERNEL.
 			 */
+			if (!blkg_tryget(blkg))
+				continue;
 			if (pinned_blkg)
 				blkg_put(pinned_blkg);
-			blkg_get(blkg);
 			pinned_blkg = blkg;
 
 			spin_unlock_irq(&q->queue_lock);
-- 
2.39.2

Re: [PATCH 3/3] blk-cgroup: skip dying blkg in blkcg_activate_policy()

Posted by Yu Kuai 1 month ago

Hi,

在 2026/1/8 9:44, Zheng Qixing 写道:
> From: Zheng Qixing <zhengqixing@huawei.com>
>
> When switching IO schedulers on a block device, blkcg_activate_policy()
> can race with concurrent blkcg deletion, leading to a use-after-free of
> the blkg.
>
> T1:				  T2:
> elv_iosched_store		  blkg_destroy
> elevator_switch			  kill(&blkg->refcnt) // blkg->refcnt=0
> ...				  blkg_release // call_rcu
> blkcg_activate_policy		  __blkg_release
> list for blkg			  blkg_free
> 				  blkg_free_workfn
> 				  ->pd_free_fn(pd)
> blkg_get(blkg) // blkg->refcnt=0->1
> 				  list_del_init(&blkg->q_node)
> 				  kfree(blkg)
> blkg_put(pinned_blkg) // blkg->refcnt=1->0
> blkg_release // call_rcu again
> call_rcu(..., __blkg_release)

This stack is not clear to me, can this problem be fixed by protecting
q->blkg_list iteration with blkcg_mutex as I said in patch 2?

>
> Fix this by replacing blkg_get() with blkg_tryget(), which fails if
> the blkg's refcount has already reached zero. If blkg_tryget() fails,
> skip processing this blkg since it's already being destroyed.
>
> The uaf call trace is as follows:
>
> ==================================================================
>   BUG: KASAN: slab-use-after-free in rcu_accelerate_cbs+0x114/0x120
>   Read of size 8 at addr ffff88815a20b5d8 by task bash/1068
>   CPU: 0 PID: 1068 Comm: bash Not tainted 6.6.0-g6918ead378dc-dirty #31
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
> Call Trace:
>   <IRQ>
>   rcu_accelerate_cbs+0x114/0x120
>   rcu_report_qs_rdp+0x1fb/0x3e0
>   rcu_core+0x4d7/0x6f0
>   handle_softirqs+0x198/0x550
>   irq_exit_rcu+0x130/0x190
>   sysvec_apic_timer_interrupt+0x6e/0x90
>   </IRQ>
>   <TASK>
>   asm_sysvec_apic_timer_interrupt+0x16/0x20
>
> Allocated by task 1031:
>   kasan_save_stack+0x1c/0x40
>   kasan_set_track+0x21/0x30
>   __kasan_kmalloc+0x8b/0x90
>   blkg_alloc+0xb6/0x9c0
>   blkg_create+0x8c6/0x1010
>   blkg_lookup_create+0x2ca/0x660
>   bio_associate_blkg_from_css+0xfb/0x4e0
>   bio_associate_blkg+0x62/0xf0
>   bio_init+0x272/0x8d0
>   bio_alloc_bioset+0x45a/0x760
>   ext4_bio_write_folio+0x68e/0x10d0
>   mpage_submit_folio+0x14a/0x2b0
>   mpage_process_page_bufs+0x1b1/0x390
>   mpage_prepare_extent_to_map+0xa91/0x1060
>   ext4_do_writepages+0x948/0x1c50
>   ext4_writepages+0x23f/0x4a0
>   do_writepages+0x162/0x5e0
>   filemap_fdatawrite_wbc+0x11a/0x180
>   __filemap_fdatawrite_range+0x9d/0xd0
>   file_write_and_wait_range+0x91/0x110
>   ext4_sync_file+0x1c1/0xaa0
>   __x64_sys_fsync+0x55/0x90
>   do_syscall_64+0x55/0x100
>   entry_SYSCALL_64_after_hwframe+0x78/0xe2
>
> Freed by task 24:
>   kasan_save_stack+0x1c/0x40
>   kasan_set_track+0x21/0x30
>   kasan_save_free_info+0x27/0x40
>   __kasan_slab_free+0x106/0x180
>   __kmem_cache_free+0x162/0x350
>   process_one_work+0x573/0xd30
>   worker_thread+0x67f/0xc30
>   kthread+0x28b/0x350
>   ret_from_fork+0x30/0x70
>   ret_from_fork_asm+0x1b/0x30
>
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
> ---
>   block/blk-cgroup.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index af468676cad1..ac7702db0836 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1645,9 +1645,10 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>   			 * GFP_NOWAIT failed.  Free the existing one and
>   			 * prealloc for @blkg w/ GFP_KERNEL.
>   			 */

Why this check is not done before pd_alloc_fn()? What if pd_alloc_fn() succeed for
removed blkg?

> +			if (!blkg_tryget(blkg))
> +				continue;
>   			if (pinned_blkg)
>   				blkg_put(pinned_blkg);
> -			blkg_get(blkg);
>   			pinned_blkg = blkg;
>   
>   			spin_unlock_irq(&q->queue_lock);

-- 
Thansk,
Kuai

Re: [PATCH 3/3] blk-cgroup: skip dying blkg in blkcg_activate_policy()

Posted by Zheng Qixing 1 month ago

在 2026/1/9 0:22, Yu Kuai 写道:
> Hi,
>
> 在 2026/1/8 9:44, Zheng Qixing 写道:
>> From: Zheng Qixing <zhengqixing@huawei.com>
>>
>> When switching IO schedulers on a block device, blkcg_activate_policy()
>> can race with concurrent blkcg deletion, leading to a use-after-free of
>> the blkg.
>>
>> T1:				  T2:
>> elv_iosched_store		  blkg_destroy
>> elevator_switch			  kill(&blkg->refcnt) // blkg->refcnt=0
>> ...				  blkg_release // call_rcu
>> blkcg_activate_policy		  __blkg_release
>> list for blkg			  blkg_free
>> 				  blkg_free_workfn
>> 				  ->pd_free_fn(pd)
>> blkg_get(blkg) // blkg->refcnt=0->1
>> 				  list_del_init(&blkg->q_node)
>> 				  kfree(blkg)
>> blkg_put(pinned_blkg) // blkg->refcnt=1->0
>> blkg_release // call_rcu again
>> call_rcu(..., __blkg_release)
> This stack is not clear to me, can this problem be fixed by protecting
> q->blkg_list iteration with blkcg_mutex as I said in patch 2?
It appears that adding blkcg_mutex still cannot resolve the issue where 
the same blkg has its
reference count decremented to 0 twice.
However, it does fix the memory leak caused by pd_alloc_fn() succeeding 
for a blkg that has
already been removed.
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index af468676cad1..ac7702db0836 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -1645,9 +1645,10 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>>    			 * GFP_NOWAIT failed.  Free the existing one and
>>    			 * prealloc for @blkg w/ GFP_KERNEL.
>>    			 */
> Why this check is not done before pd_alloc_fn()? What if pd_alloc_fn() succeed for
> removed blkg?

I will fix this memory leak issue in the next revision.


Thank,

Qixing

Re: [PATCH 3/3] blk-cgroup: skip dying blkg in blkcg_activate_policy()

Posted by Christoph Hellwig 1 month ago

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>