[v2] blk-cgroup: cleanup and bugfixs in blk-cgroup

[PATCH v2 2/3] blk-cgroup: skip dying blkg in blkcg_activate_policy()

Posted by Zheng Qixing 4 weeks ago

From: Zheng Qixing <zhengqixing@huawei.com>

When switching IO schedulers on a block device, blkcg_activate_policy()
can race with concurrent blkcg deletion, leading to a use-after-free in
rcu_accelerate_cbs.

T1:                               T2:
		                  blkg_destroy
                 		  kill(&blkg->refcnt) // blkg->refcnt=1->0
				  blkg_release // call_rcu(__blkg_release)
                                  ...
				  blkg_free_workfn
                                  ->pd_free_fn(pd)
elv_iosched_store
elevator_switch
...
iterate blkg list
blkg_get(blkg) // blkg->refcnt=0->1
                                  list_del_init(&blkg->q_node)
blkg_put(pinned_blkg) // blkg->refcnt=1->0
blkg_release // call_rcu again
rcu_accelerate_cbs // uaf

Fix this by replacing blkg_get() with blkg_tryget(), which fails if
the blkg's refcount has already reached zero. If blkg_tryget() fails,
skip processing this blkg since it's already being destroyed.

Link: https://lore.kernel.org/all/20260108014416.3656493-4-zhengqixing@huaweicloud.com/
Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-cgroup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 600f8c5843ea..5dbc107eec53 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1622,9 +1622,10 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 			 * GFP_NOWAIT failed.  Free the existing one and
 			 * prealloc for @blkg w/ GFP_KERNEL.
 			 */
+			if (!blkg_tryget(blkg))
+				continue;
 			if (pinned_blkg)
 				blkg_put(pinned_blkg);
-			blkg_get(blkg);
 			pinned_blkg = blkg;
 
 			spin_unlock_irq(&q->queue_lock);
-- 
2.39.2

Re: [PATCH v2 2/3] blk-cgroup: skip dying blkg in blkcg_activate_policy()

Posted by Yu Kuai 3 weeks, 5 days ago

Hi,

在 2026/1/13 14:10, Zheng Qixing 写道:
> From: Zheng Qixing <zhengqixing@huawei.com>
>
> When switching IO schedulers on a block device, blkcg_activate_policy()
> can race with concurrent blkcg deletion, leading to a use-after-free in
> rcu_accelerate_cbs.
>
> T1:                               T2:
> 		                  blkg_destroy
>                   		  kill(&blkg->refcnt) // blkg->refcnt=1->0
> 				  blkg_release // call_rcu(__blkg_release)
>                                    ...
> 				  blkg_free_workfn
>                                    ->pd_free_fn(pd)
> elv_iosched_store
> elevator_switch
> ...
> iterate blkg list
> blkg_get(blkg) // blkg->refcnt=0->1
>                                    list_del_init(&blkg->q_node)
> blkg_put(pinned_blkg) // blkg->refcnt=1->0
> blkg_release // call_rcu again
> rcu_accelerate_cbs // uaf
>
> Fix this by replacing blkg_get() with blkg_tryget(), which fails if
> the blkg's refcount has already reached zero. If blkg_tryget() fails,
> skip processing this blkg since it's already being destroyed.
>
> Link: https://lore.kernel.org/all/20260108014416.3656493-4-zhengqixing@huaweicloud.com/
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-cgroup.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 600f8c5843ea..5dbc107eec53 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1622,9 +1622,10 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>   			 * GFP_NOWAIT failed.  Free the existing one and
>   			 * prealloc for @blkg w/ GFP_KERNEL.
>   			 */
> +			if (!blkg_tryget(blkg))
> +				continue;

So, why this check is still before the pd_alloc_fn()?

See blkg_destroy(), can you replace this by the same checking:

list_for_each_entry_reverse()
	if (hlist_unhashed(&blkg->blkcg_node))
		continue;
	if (blkg->pd[pol->plid])
		continue;

>   			if (pinned_blkg)
>   				blkg_put(pinned_blkg);
> -			blkg_get(blkg);
>   			pinned_blkg = blkg;
>   
>   			spin_unlock_irq(&q->queue_lock);

-- 
Thansk,
Kuai

Re: [PATCH v2 2/3] blk-cgroup: skip dying blkg in blkcg_activate_policy()

Posted by Zheng Qixing 3 weeks, 5 days ago

>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index 600f8c5843ea..5dbc107eec53 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -1622,9 +1622,10 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>>    			 * GFP_NOWAIT failed.  Free the existing one and
>>    			 * prealloc for @blkg w/ GFP_KERNEL.
>>    			 */
>> +			if (!blkg_tryget(blkg))
>> +				continue;
> So, why this check is still before the pd_alloc_fn()?
You mean 'after'?
> See blkg_destroy(), can you replace this by the same checking:
>
> list_for_each_entry_reverse()
> 	if (hlist_unhashed(&blkg->blkcg_node))
> 		continue;
> 	if (blkg->pd[pol->plid])
> 		continue;

This change makes sense.

This issue can be resolved by either doing tryget(blkg) before
pd_alloc_fn() or by accessing blkg->blkcg_node.

To keep the behavior consistent with blkg_destroy() and blkg_destroy_all(), I will revise this in v3.

Thank,

Qixing

Re: [PATCH v2 2/3] blk-cgroup: skip dying blkg in blkcg_activate_policy()

Posted by Michal Koutný 3 weeks, 6 days ago

On Tue, Jan 13, 2026 at 02:10:34PM +0800, Zheng Qixing <zhengqixing@huaweicloud.com> wrote:
> From: Zheng Qixing <zhengqixing@huawei.com>
> 
> When switching IO schedulers on a block device, blkcg_activate_policy()
> can race with concurrent blkcg deletion, leading to a use-after-free in
> rcu_accelerate_cbs.
> 
> T1:                               T2:
> 		                  blkg_destroy
>                  		  kill(&blkg->refcnt) // blkg->refcnt=1->0
> 				  blkg_release // call_rcu(__blkg_release)
>                                   ...
> 				  blkg_free_workfn
>                                   ->pd_free_fn(pd)
> elv_iosched_store
> elevator_switch
> ...
> iterate blkg list
> blkg_get(blkg) // blkg->refcnt=0->1
>                                   list_del_init(&blkg->q_node)
> blkg_put(pinned_blkg) // blkg->refcnt=1->0
> blkg_release // call_rcu again
> rcu_accelerate_cbs // uaf
> 
> Fix this by replacing blkg_get() with blkg_tryget(), which fails if
> the blkg's refcount has already reached zero. If blkg_tryget() fails,
> skip processing this blkg since it's already being destroyed.
> 
> Link: https://lore.kernel.org/all/20260108014416.3656493-4-zhengqixing@huaweicloud.com/
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-cgroup.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reviewed-by: Michal Koutný <mkoutny@suse.com>