[v2] blk-cgroup: cleanup and bugfixs in blk-cgroup

[PATCH v2 1/3] blk-cgroup: fix race between policy activation and blkg destruction

Posted by Zheng Qixing 4 weeks ago

From: Zheng Qixing <zhengqixing@huawei.com>

When switching an IO scheduler on a block device, blkcg_activate_policy()
allocates blkg_policy_data (pd) for all blkgs attached to the queue.
However, blkcg_activate_policy() may race with concurrent blkcg deletion,
leading to use-after-free and memory leak issues.

The use-after-free occurs in the following race:

T1 (blkcg_activate_policy):
  - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
  - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
  - Enters the enomem rollback path to release blkg1 resources

T2 (blkcg deletion):
  - blkcgA is deleted concurrently
  - blkg1 is freed via blkg_free_workfn()
  - blkg1->pd is freed

T1 (continued):
  - Rollback path accesses blkg1->pd->online after pd is freed
  - Triggers use-after-free

In addition, blkg_free_workfn() frees pd before removing the blkg from
q->blkg_list. This allows blkcg_activate_policy() to allocate a new pd
for a blkg that is being destroyed, leaving the newly allocated pd
unreachable when the blkg is finally freed.

Fix these races by extending blkcg_mutex coverage to serialize
blkcg_activate_policy() rollback and blkg destruction, ensuring pd
lifecycle is synchronized with blkg list visibility.

Link: https://lore.kernel.org/all/20260108014416.3656493-3-zhengqixing@huaweicloud.com/
Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
---
 block/blk-cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3cffb68ba5d8..600f8c5843ea 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1596,6 +1596,8 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 
 	if (queue_is_mq(q))
 		memflags = blk_mq_freeze_queue(q);
+
+	mutex_lock(&q->blkcg_mutex);
 retry:
 	spin_lock_irq(&q->queue_lock);
 
@@ -1658,6 +1660,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 
 	spin_unlock_irq(&q->queue_lock);
 out:
+	mutex_unlock(&q->blkcg_mutex);
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
 	if (pinned_blkg)
-- 
2.39.2

Re: [PATCH v2 1/3] blk-cgroup: fix race between policy activation and blkg destruction

Posted by Yu Kuai 3 weeks, 5 days ago

Hi,

You are sending to my invalid huawei email address, so I didn't see this patch.

在 2026/1/13 14:10, Zheng Qixing 写道:
> From: Zheng Qixing <zhengqixing@huawei.com>
>
> When switching an IO scheduler on a block device, blkcg_activate_policy()
> allocates blkg_policy_data (pd) for all blkgs attached to the queue.
> However, blkcg_activate_policy() may race with concurrent blkcg deletion,
> leading to use-after-free and memory leak issues.
>
> The use-after-free occurs in the following race:
>
> T1 (blkcg_activate_policy):
>    - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
>    - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
>    - Enters the enomem rollback path to release blkg1 resources
>
> T2 (blkcg deletion):
>    - blkcgA is deleted concurrently
>    - blkg1 is freed via blkg_free_workfn()
>    - blkg1->pd is freed
>
> T1 (continued):
>    - Rollback path accesses blkg1->pd->online after pd is freed
>    - Triggers use-after-free
>
> In addition, blkg_free_workfn() frees pd before removing the blkg from
> q->blkg_list. This allows blkcg_activate_policy() to allocate a new pd
> for a blkg that is being destroyed, leaving the newly allocated pd
> unreachable when the blkg is finally freed.
>
> Fix these races by extending blkcg_mutex coverage to serialize
> blkcg_activate_policy() rollback and blkg destruction, ensuring pd
> lifecycle is synchronized with blkg list visibility.
>
> Link: https://lore.kernel.org/all/20260108014416.3656493-3-zhengqixing@huaweicloud.com/
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
> ---
>   block/blk-cgroup.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 3cffb68ba5d8..600f8c5843ea 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1596,6 +1596,8 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>   
>   	if (queue_is_mq(q))
>   		memflags = blk_mq_freeze_queue(q);
> +
> +	mutex_lock(&q->blkcg_mutex);
>   retry:
>   	spin_lock_irq(&q->queue_lock);
>   
> @@ -1658,6 +1660,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>   
>   	spin_unlock_irq(&q->queue_lock);
>   out:
> +	mutex_unlock(&q->blkcg_mutex);
>   	if (queue_is_mq(q))
>   		blk_mq_unfreeze_queue(q, memflags);
>   	if (pinned_blkg)

Can you also protect blkg_destroy_all() will blkcg_mutex as well? Then all access for q->blkg_list will
be protected.

-- 
Thansk,
Kuai

Re: [PATCH v2 1/3] blk-cgroup: fix race between policy activation and blkg destruction

Posted by Zheng Qixing 3 weeks, 5 days ago

>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index 3cffb68ba5d8..600f8c5843ea 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -1596,6 +1596,8 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>>    
>>    	if (queue_is_mq(q))
>>    		memflags = blk_mq_freeze_queue(q);
>> +
>> +	mutex_lock(&q->blkcg_mutex);
>>    retry:
>>    	spin_lock_irq(&q->queue_lock);
>>    
>> @@ -1658,6 +1660,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>>    
>>    	spin_unlock_irq(&q->queue_lock);
>>    out:
>> +	mutex_unlock(&q->blkcg_mutex);
>>    	if (queue_is_mq(q))
>>    		blk_mq_unfreeze_queue(q, memflags);
>>    	if (pinned_blkg)
> Can you also protect blkg_destroy_all() will blkcg_mutex as well? Then all access for q->blkg_list will
> be protected.
Why does blkg_destroy_all() also need blkcg_mutex?

After finishing ->pd_offline_fn() for blkgs and scheduling 
blkg_free_workfn() in blkg_destroy(),
blkg_destroy_all() clears the corresponding policy bit in q->blkcg_pols 
to avoid duplicate policy
teardown in blkcg_deactivate_policy().

Re: [PATCH v2 1/3] blk-cgroup: fix race between policy activation and blkg destruction

Posted by Michal Koutný 3 weeks, 6 days ago

On Tue, Jan 13, 2026 at 02:10:33PM +0800, Zheng Qixing <zhengqixing@huaweicloud.com> wrote:
> From: Zheng Qixing <zhengqixing@huawei.com>
> 
> When switching an IO scheduler on a block device, blkcg_activate_policy()
> allocates blkg_policy_data (pd) for all blkgs attached to the queue.
> However, blkcg_activate_policy() may race with concurrent blkcg deletion,
> leading to use-after-free and memory leak issues.
> 
> The use-after-free occurs in the following race:
> 
> T1 (blkcg_activate_policy):
>   - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
>   - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
>   - Enters the enomem rollback path to release blkg1 resources
> 
> T2 (blkcg deletion):
>   - blkcgA is deleted concurrently
>   - blkg1 is freed via blkg_free_workfn()
>   - blkg1->pd is freed
> 
> T1 (continued):
>   - Rollback path accesses blkg1->pd->online after pd is freed

The rollback path is under q->queue_lock same like the list removal in
blkg_free_workfn().
Why is queue_lock not enough for synchronization in this case?

(BTW have you observed this case "naturally" or have you injected the
memory allocation failure?)


>   - Triggers use-after-free
> 
> In addition, blkg_free_workfn() frees pd before removing the blkg from
> q->blkg_list.

Yeah, this looks weirdly reversed.

> This allows blkcg_activate_policy() to allocate a new pd
> for a blkg that is being destroyed, leaving the newly allocated pd
> unreachable when the blkg is finally freed.
> 
> Fix these races by extending blkcg_mutex coverage to serialize
> blkcg_activate_policy() rollback and blkg destruction, ensuring pd
> lifecycle is synchronized with blkg list visibility.
> 
> Link: https://lore.kernel.org/all/20260108014416.3656493-3-zhengqixing@huaweicloud.com/
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>

Thanks,
Michal

Re: [PATCH v2 1/3] blk-cgroup: fix race between policy activation and blkg destruction

Posted by Zheng Qixing 3 weeks, 5 days ago

Resend...

blkcg_activate_policy()                blkg_free_workfn()
-------------------                    ------------------
spin_lock(&q->queue_lock)
...
if (!pd) {
     spin_unlock(&q->queue_lock)
     ...
     goto enomem
     }
enomem:
     spin_lock(&q->queue_lock)
     if (pd) {

                        ->pd_free_fn()  // pd freed

        pd->online // uaf
        ...
     }

                        spin_lock(&q->queue_lock)

                        list_del_init(&blkg->q_node)

                        spin_unlock(&q->queue_lock)

Re: [PATCH v2 1/3] blk-cgroup: fix race between policy activation and blkg destruction

Posted by Zheng Qixing 3 weeks, 5 days ago

在 2026/1/14 18:40, Michal Koutný 写道:
> On Tue, Jan 13, 2026 at 02:10:33PM +0800, Zheng Qixing <zhengqixing@huaweicloud.com> wrote:
>> From: Zheng Qixing <zhengqixing@huawei.com>
>>
>> When switching an IO scheduler on a block device, blkcg_activate_policy()
>> allocates blkg_policy_data (pd) for all blkgs attached to the queue.
>> However, blkcg_activate_policy() may race with concurrent blkcg deletion,
>> leading to use-after-free and memory leak issues.
>>
>> The use-after-free occurs in the following race:
>>
>> T1 (blkcg_activate_policy):
>>    - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
>>    - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
>>    - Enters the enomem rollback path to release blkg1 resources
>>
>> T2 (blkcg deletion):
>>    - blkcgA is deleted concurrently
>>    - blkg1 is freed via blkg_free_workfn()
>>    - blkg1->pd is freed
>>
>> T1 (continued):
>>    - Rollback path accesses blkg1->pd->online after pd is freed
> The rollback path is under q->queue_lock same like the list removal in
> blkg_free_workfn().
> Why is queue_lock not enough for synchronization in this case?
>
> (BTW have you observed this case "naturally" or have you injected the
> memory allocation failure?)
>
Yes, this issue was discovered by injecting memory allocation failure at
->pd_alloc_fn(..., GFP_KERNEL) in blkcg_activate_policy().

In blkg_free_workfn(), q->queue_lock only protects the
list_del_init(&blkg->q_node). However, ->pd_free_fn() is called before
list_del_init(), meaning the pd is already freed before the blkg is removed
from the queue's list.

     blkcg_activate_policy()                  blkg_free_workfn()
     -------------------                          ------------------
     spin_lock(&q->queue_lock)
     ...
     if (!pd) {
         spin_unlock(&q->queue_lock)
         ...
         goto enomem
     }
     enomem:
         spin_lock(&q->queue_lock)
         if (pd) {
->pd_free_fn()  // pd freed
            pd->online // uaf
         ...
         }
spin_lock(&q->queue_lock)
list_del_init(&blkg->q_node)
spin_unlock(&q->queue_lock)
>>    - Triggers use-after-free
>>
>> In addition, blkg_free_workfn() frees pd before removing the blkg from
>> q->blkg_list.
> Yeah, this looks weirdly reversed.

Commit f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from 
blkg_free_workfn() and blkcg_deactivate_policy()") delays 
list_del_init(&blkg->q_node) until after pd_free_fn() in 
blkg_free_workfn(). This keeps blkgs visible in the queue list during 
policy deactivation, preventing parent policy data from being freed 
before child policy data and avoiding use-after-free.

Kind Regards,
Qixing

Re: [PATCH v2 1/3] blk-cgroup: fix race between policy activation and blkg destruction

Posted by Michal Koutný 3 weeks, 5 days ago

On Thu, Jan 15, 2026 at 11:27:47AM +0800, Zheng Qixing <zhengqixing@huaweicloud.com> wrote:
> Yes, this issue was discovered by injecting memory allocation failure at
> ->pd_alloc_fn(..., GFP_KERNEL) in blkcg_activate_policy().

Fair enough.

> Commit f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from
> blkg_free_workfn() and blkcg_deactivate_policy()") delays
> list_del_init(&blkg->q_node) until after pd_free_fn() in blkg_free_workfn().

IIUC, the point was to delay it from blkg_destroy until blkg_free_workfn
but then inside blkg_free_workfn it may have gone too far where it calls
pd_free_fn's before actual list removal.

(I'm Cc'ing the correct Kuai's address now.)
IOW, I'm wondering whether mere swap of these two actions (pd_free_fn
and list removal) wouldn't be a sufficient fix for the discovered issue
(instead of expanding lock coverage).

Thanks,
Michal