From: Zheng Qixing <zhengqixing@huawei.com>
When switching an IO scheduler on a block device, blkcg_activate_policy()
allocates blkg_policy_data (pd) for all blkgs attached to the queue.
However, blkcg_activate_policy() may race with concurrent blkcg deletion,
leading to use-after-free and memory leak issues.
The use-after-free occurs in the following race:
T1 (blkcg_activate_policy):
- Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
- Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
- Enters the enomem rollback path to release blkg1 resources
T2 (blkcg deletion):
- blkcgA is deleted concurrently
- blkg1 is freed via blkg_free_workfn()
- blkg1->pd is freed
T1 (continued):
- Rollback path accesses blkg1->pd->online after pd is freed
- Triggers use-after-free
In addition, blkg_free_workfn() frees pd before removing the blkg from
q->blkg_list. This allows blkcg_activate_policy() to allocate a new pd
for a blkg that is being destroyed, leaving the newly allocated pd
unreachable when the blkg is finally freed.
Fix these races by extending blkcg_mutex coverage to serialize
blkcg_activate_policy() rollback and blkg destruction, ensuring pd
lifecycle is synchronized with blkg list visibility.
Link: https://lore.kernel.org/all/20260108014416.3656493-3-zhengqixing@huaweicloud.com/
Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
---
block/blk-cgroup.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3cffb68ba5d8..600f8c5843ea 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1596,6 +1596,8 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
if (queue_is_mq(q))
memflags = blk_mq_freeze_queue(q);
+
+ mutex_lock(&q->blkcg_mutex);
retry:
spin_lock_irq(&q->queue_lock);
@@ -1658,6 +1660,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
spin_unlock_irq(&q->queue_lock);
out:
+ mutex_unlock(&q->blkcg_mutex);
if (queue_is_mq(q))
blk_mq_unfreeze_queue(q, memflags);
if (pinned_blkg)
--
2.39.2
Hi,
You are sending to my invalid huawei email address, so I didn't see this patch.
在 2026/1/13 14:10, Zheng Qixing 写道:
> From: Zheng Qixing <zhengqixing@huawei.com>
>
> When switching an IO scheduler on a block device, blkcg_activate_policy()
> allocates blkg_policy_data (pd) for all blkgs attached to the queue.
> However, blkcg_activate_policy() may race with concurrent blkcg deletion,
> leading to use-after-free and memory leak issues.
>
> The use-after-free occurs in the following race:
>
> T1 (blkcg_activate_policy):
> - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
> - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
> - Enters the enomem rollback path to release blkg1 resources
>
> T2 (blkcg deletion):
> - blkcgA is deleted concurrently
> - blkg1 is freed via blkg_free_workfn()
> - blkg1->pd is freed
>
> T1 (continued):
> - Rollback path accesses blkg1->pd->online after pd is freed
> - Triggers use-after-free
>
> In addition, blkg_free_workfn() frees pd before removing the blkg from
> q->blkg_list. This allows blkcg_activate_policy() to allocate a new pd
> for a blkg that is being destroyed, leaving the newly allocated pd
> unreachable when the blkg is finally freed.
>
> Fix these races by extending blkcg_mutex coverage to serialize
> blkcg_activate_policy() rollback and blkg destruction, ensuring pd
> lifecycle is synchronized with blkg list visibility.
>
> Link: https://lore.kernel.org/all/20260108014416.3656493-3-zhengqixing@huaweicloud.com/
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
> ---
> block/blk-cgroup.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 3cffb68ba5d8..600f8c5843ea 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1596,6 +1596,8 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>
> if (queue_is_mq(q))
> memflags = blk_mq_freeze_queue(q);
> +
> + mutex_lock(&q->blkcg_mutex);
> retry:
> spin_lock_irq(&q->queue_lock);
>
> @@ -1658,6 +1660,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>
> spin_unlock_irq(&q->queue_lock);
> out:
> + mutex_unlock(&q->blkcg_mutex);
> if (queue_is_mq(q))
> blk_mq_unfreeze_queue(q, memflags);
> if (pinned_blkg)
Can you also protect blkg_destroy_all() will blkcg_mutex as well? Then all access for q->blkg_list will
be protected.
--
Thansk,
Kuai
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c >> index 3cffb68ba5d8..600f8c5843ea 100644 >> --- a/block/blk-cgroup.c >> +++ b/block/blk-cgroup.c >> @@ -1596,6 +1596,8 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol) >> >> if (queue_is_mq(q)) >> memflags = blk_mq_freeze_queue(q); >> + >> + mutex_lock(&q->blkcg_mutex); >> retry: >> spin_lock_irq(&q->queue_lock); >> >> @@ -1658,6 +1660,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol) >> >> spin_unlock_irq(&q->queue_lock); >> out: >> + mutex_unlock(&q->blkcg_mutex); >> if (queue_is_mq(q)) >> blk_mq_unfreeze_queue(q, memflags); >> if (pinned_blkg) > Can you also protect blkg_destroy_all() will blkcg_mutex as well? Then all access for q->blkg_list will > be protected. Why does blkg_destroy_all() also need blkcg_mutex? After finishing ->pd_offline_fn() for blkgs and scheduling blkg_free_workfn() in blkg_destroy(), blkg_destroy_all() clears the corresponding policy bit in q->blkcg_pols to avoid duplicate policy teardown in blkcg_deactivate_policy().
On Tue, Jan 13, 2026 at 02:10:33PM +0800, Zheng Qixing <zhengqixing@huaweicloud.com> wrote:
> From: Zheng Qixing <zhengqixing@huawei.com>
>
> When switching an IO scheduler on a block device, blkcg_activate_policy()
> allocates blkg_policy_data (pd) for all blkgs attached to the queue.
> However, blkcg_activate_policy() may race with concurrent blkcg deletion,
> leading to use-after-free and memory leak issues.
>
> The use-after-free occurs in the following race:
>
> T1 (blkcg_activate_policy):
> - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
> - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
> - Enters the enomem rollback path to release blkg1 resources
>
> T2 (blkcg deletion):
> - blkcgA is deleted concurrently
> - blkg1 is freed via blkg_free_workfn()
> - blkg1->pd is freed
>
> T1 (continued):
> - Rollback path accesses blkg1->pd->online after pd is freed
The rollback path is under q->queue_lock same like the list removal in
blkg_free_workfn().
Why is queue_lock not enough for synchronization in this case?
(BTW have you observed this case "naturally" or have you injected the
memory allocation failure?)
> - Triggers use-after-free
>
> In addition, blkg_free_workfn() frees pd before removing the blkg from
> q->blkg_list.
Yeah, this looks weirdly reversed.
> This allows blkcg_activate_policy() to allocate a new pd
> for a blkg that is being destroyed, leaving the newly allocated pd
> unreachable when the blkg is finally freed.
>
> Fix these races by extending blkcg_mutex coverage to serialize
> blkcg_activate_policy() rollback and blkg destruction, ensuring pd
> lifecycle is synchronized with blkg list visibility.
>
> Link: https://lore.kernel.org/all/20260108014416.3656493-3-zhengqixing@huaweicloud.com/
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Thanks,
Michal
Resend...
blkcg_activate_policy() blkg_free_workfn()
------------------- ------------------
spin_lock(&q->queue_lock)
...
if (!pd) {
spin_unlock(&q->queue_lock)
...
goto enomem
}
enomem:
spin_lock(&q->queue_lock)
if (pd) {
->pd_free_fn() // pd freed
pd->online // uaf
...
}
spin_lock(&q->queue_lock)
list_del_init(&blkg->q_node)
spin_unlock(&q->queue_lock)
在 2026/1/14 18:40, Michal Koutný 写道:
> On Tue, Jan 13, 2026 at 02:10:33PM +0800, Zheng Qixing <zhengqixing@huaweicloud.com> wrote:
>> From: Zheng Qixing <zhengqixing@huawei.com>
>>
>> When switching an IO scheduler on a block device, blkcg_activate_policy()
>> allocates blkg_policy_data (pd) for all blkgs attached to the queue.
>> However, blkcg_activate_policy() may race with concurrent blkcg deletion,
>> leading to use-after-free and memory leak issues.
>>
>> The use-after-free occurs in the following race:
>>
>> T1 (blkcg_activate_policy):
>> - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
>> - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
>> - Enters the enomem rollback path to release blkg1 resources
>>
>> T2 (blkcg deletion):
>> - blkcgA is deleted concurrently
>> - blkg1 is freed via blkg_free_workfn()
>> - blkg1->pd is freed
>>
>> T1 (continued):
>> - Rollback path accesses blkg1->pd->online after pd is freed
> The rollback path is under q->queue_lock same like the list removal in
> blkg_free_workfn().
> Why is queue_lock not enough for synchronization in this case?
>
> (BTW have you observed this case "naturally" or have you injected the
> memory allocation failure?)
>
Yes, this issue was discovered by injecting memory allocation failure at
->pd_alloc_fn(..., GFP_KERNEL) in blkcg_activate_policy().
In blkg_free_workfn(), q->queue_lock only protects the
list_del_init(&blkg->q_node). However, ->pd_free_fn() is called before
list_del_init(), meaning the pd is already freed before the blkg is removed
from the queue's list.
blkcg_activate_policy() blkg_free_workfn()
------------------- ------------------
spin_lock(&q->queue_lock)
...
if (!pd) {
spin_unlock(&q->queue_lock)
...
goto enomem
}
enomem:
spin_lock(&q->queue_lock)
if (pd) {
->pd_free_fn() // pd freed
pd->online // uaf
...
}
spin_lock(&q->queue_lock)
list_del_init(&blkg->q_node)
spin_unlock(&q->queue_lock)
>> - Triggers use-after-free
>>
>> In addition, blkg_free_workfn() frees pd before removing the blkg from
>> q->blkg_list.
> Yeah, this looks weirdly reversed.
Commit f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from
blkg_free_workfn() and blkcg_deactivate_policy()") delays
list_del_init(&blkg->q_node) until after pd_free_fn() in
blkg_free_workfn(). This keeps blkgs visible in the queue list during
policy deactivation, preventing parent policy data from being freed
before child policy data and avoiding use-after-free.
Kind Regards,
Qixing
On Thu, Jan 15, 2026 at 11:27:47AM +0800, Zheng Qixing <zhengqixing@huaweicloud.com> wrote:
> Yes, this issue was discovered by injecting memory allocation failure at
> ->pd_alloc_fn(..., GFP_KERNEL) in blkcg_activate_policy().
Fair enough.
> Commit f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from
> blkg_free_workfn() and blkcg_deactivate_policy()") delays
> list_del_init(&blkg->q_node) until after pd_free_fn() in blkg_free_workfn().
IIUC, the point was to delay it from blkg_destroy until blkg_free_workfn
but then inside blkg_free_workfn it may have gone too far where it calls
pd_free_fn's before actual list removal.
(I'm Cc'ing the correct Kuai's address now.)
IOW, I'm wondering whether mere swap of these two actions (pd_free_fn
and list removal) wouldn't be a sufficient fix for the discovered issue
(instead of expanding lock coverage).
Thanks,
Michal
© 2016 - 2026 Red Hat, Inc.