[v1] blk-cgroup: cleanup and bugfixs in blk-cgroup

[PATCH 2/3] blk-cgroup: fix uaf in blkcg_activate_policy() racing with blkg_free_workfn()

Posted by Zheng Qixing 1 month ago

From: Zheng Qixing <zhengqixing@huawei.com>

When switching IO schedulers on a block device (e.g., loop0),
blkcg_activate_policy() is called to allocate blkg_policy_data (pd)
for all blkgs associated with that device's request queue.

However, a race condition exists between blkcg_activate_policy() and
concurrent blkcg deletion that leads to a use-after-free:

T1 (blkcg_activate_policy):
  - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
  - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
  - Goes to enomem error path to rollback blkg1's resources

T2 (blkcg deletion):
  - blkcgA is being deleted concurrently
  - blkg1 is freed via blkg_free_workfn()
  - blkg1->pd is freed

T1 (continued):
  - In the rollback path, accesses pd->online after blkg1->pd
    has been freed
  - Triggers use-after-free

The issue occurs because blkcg_activate_policy() doesn't hold
adequate protection against concurrent blkg freeing during the
error rollback path. The call trace is as follows:

==================================================================
BUG: KASAN: slab-use-after-free in blkcg_activate_policy+0x516/0x5f0
Read of size 1 at addr ffff88802e1bc00c by task sh/7357
CPU: 1 PID: 7357 Comm: sh Tainted: G           OE       6.6.0+ #1
Call Trace:
 <TASK>
 blkcg_activate_policy+0x516/0x5f0
 bfq_create_group_hierarchy+0x31/0x90
 bfq_init_queue+0x6df/0x8e0
 blk_mq_init_sched+0x290/0x3a0
 elevator_switch+0x8a/0x190
 elv_iosched_store+0x1f7/0x2a0
 queue_attr_store+0xad/0xf0
 kernfs_fop_write_iter+0x1ee/0x2e0
 new_sync_write+0x154/0x260
 vfs_write+0x313/0x3c0
 ksys_write+0xbd/0x160
 do_syscall_64+0x55/0x100
 entry_SYSCALL_64_after_hwframe+0x78/0xe2

Allocated by task 7357:
 bfq_pd_alloc+0x6e/0x120
 blkcg_activate_policy+0x141/0x5f0
 bfq_create_group_hierarchy+0x31/0x90
 bfq_init_queue+0x6df/0x8e0
 blk_mq_init_sched+0x290/0x3a0
 elevator_switch+0x8a/0x190
 elv_iosched_store+0x1f7/0x2a0
 queue_attr_store+0xad/0xf0
 kernfs_fop_write_iter+0x1ee/0x2e0
 new_sync_write+0x154/0x260
 vfs_write+0x313/0x3c0
 ksys_write+0xbd/0x160
 do_syscall_64+0x55/0x100
 entry_SYSCALL_64_after_hwframe+0x78/0xe2

Freed by task 14318:
 blkg_free_workfn+0x7f/0x200
 process_one_work+0x2ef/0x5d0
 worker_thread+0x38d/0x4f0
 kthread+0x156/0x190
 ret_from_fork+0x2d/0x50
 ret_from_fork_asm+0x1b/0x30

Fix this bug by adding q->blkcg_mutex in the enomem branch of
blkcg_activate_policy().

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
---
 block/blk-cgroup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 5e1a724a799a..af468676cad1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1693,9 +1693,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 
 enomem:
 	/* alloc failed, take down everything */
+	mutex_lock(&q->blkcg_mutex);
 	spin_lock_irq(&q->queue_lock);
 	blkcg_policy_teardown_pds(q, pol);
 	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 	ret = -ENOMEM;
 	goto out;
 }
-- 
2.39.2

Re: [PATCH 2/3] blk-cgroup: fix uaf in blkcg_activate_policy() racing with blkg_free_workfn()

Posted by Yu Kuai 1 month ago

Hi,

在 2026/1/8 9:44, Zheng Qixing 写道:
> From: Zheng Qixing <zhengqixing@huawei.com>
>
> When switching IO schedulers on a block device (e.g., loop0),
> blkcg_activate_policy() is called to allocate blkg_policy_data (pd)
> for all blkgs associated with that device's request queue.
>
> However, a race condition exists between blkcg_activate_policy() and
> concurrent blkcg deletion that leads to a use-after-free:
>
> T1 (blkcg_activate_policy):
>    - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
>    - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
>    - Goes to enomem error path to rollback blkg1's resources
>
> T2 (blkcg deletion):
>    - blkcgA is being deleted concurrently
>    - blkg1 is freed via blkg_free_workfn()
>    - blkg1->pd is freed
>
> T1 (continued):
>    - In the rollback path, accesses pd->online after blkg1->pd
>      has been freed
>    - Triggers use-after-free
>
> The issue occurs because blkcg_activate_policy() doesn't hold
> adequate protection against concurrent blkg freeing during the
> error rollback path. The call trace is as follows:
>
> ==================================================================
> BUG: KASAN: slab-use-after-free in blkcg_activate_policy+0x516/0x5f0
> Read of size 1 at addr ffff88802e1bc00c by task sh/7357
> CPU: 1 PID: 7357 Comm: sh Tainted: G           OE       6.6.0+ #1
> Call Trace:
>   <TASK>
>   blkcg_activate_policy+0x516/0x5f0
>   bfq_create_group_hierarchy+0x31/0x90
>   bfq_init_queue+0x6df/0x8e0
>   blk_mq_init_sched+0x290/0x3a0
>   elevator_switch+0x8a/0x190
>   elv_iosched_store+0x1f7/0x2a0
>   queue_attr_store+0xad/0xf0
>   kernfs_fop_write_iter+0x1ee/0x2e0
>   new_sync_write+0x154/0x260
>   vfs_write+0x313/0x3c0
>   ksys_write+0xbd/0x160
>   do_syscall_64+0x55/0x100
>   entry_SYSCALL_64_after_hwframe+0x78/0xe2
>
> Allocated by task 7357:
>   bfq_pd_alloc+0x6e/0x120
>   blkcg_activate_policy+0x141/0x5f0
>   bfq_create_group_hierarchy+0x31/0x90
>   bfq_init_queue+0x6df/0x8e0
>   blk_mq_init_sched+0x290/0x3a0
>   elevator_switch+0x8a/0x190
>   elv_iosched_store+0x1f7/0x2a0
>   queue_attr_store+0xad/0xf0
>   kernfs_fop_write_iter+0x1ee/0x2e0
>   new_sync_write+0x154/0x260
>   vfs_write+0x313/0x3c0
>   ksys_write+0xbd/0x160
>   do_syscall_64+0x55/0x100
>   entry_SYSCALL_64_after_hwframe+0x78/0xe2
>
> Freed by task 14318:
>   blkg_free_workfn+0x7f/0x200
>   process_one_work+0x2ef/0x5d0
>   worker_thread+0x38d/0x4f0
>   kthread+0x156/0x190
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x1b/0x30
>
> Fix this bug by adding q->blkcg_mutex in the enomem branch of
> blkcg_activate_policy().
>
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
> ---
>   block/blk-cgroup.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 5e1a724a799a..af468676cad1 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1693,9 +1693,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>   
>   enomem:
>   	/* alloc failed, take down everything */
> +	mutex_lock(&q->blkcg_mutex);
>   	spin_lock_irq(&q->queue_lock);
>   	blkcg_policy_teardown_pds(q, pol);
>   	spin_unlock_irq(&q->queue_lock);
> +	mutex_unlock(&q->blkcg_mutex);

This looks correct, however, I think it's better also to protect q->blkg_list iteration from
blkcg_activate_policy() and blkg_destroys_all() as well. This way all the q->blkg_list access
will be protected by blkcg_mutex, and it'll be easier to convert protecting blkg from queue_lock
to blkcg_mutex.

>   	ret = -ENOMEM;
>   	goto out;
>   }

-- 
Thansk,
Kuai

Re: [PATCH 2/3] blk-cgroup: fix uaf in blkcg_activate_policy() racing with blkg_free_workfn()

Posted by Zheng Qixing 1 month ago

在 2026/1/9 0:11, Yu Kuai 写道:
> This looks correct, however, I think it's better also to protect q->blkg_list iteration from
> blkcg_activate_policy() and blkg_destroys_all() as well. This way all the q->blkg_list access
> will be protected by blkcg_mutex, and it'll be easier to convert protecting blkg from queue_lock
> to blkcg_mutex.

I tried adding blkcg_mutex protection in blkcg_activate_policy() and 
blkg_destroy_all() as suggested.

Unfortunately, the UAF still occurs even with proper mutex protection.

The mutex successfully protects the list structure during traversal 
won't be added/removed from

q->blkg_list while we hold the lock. However, this doesn't prevent the 
same blkg from being released

twice.

[  108.677948][    C0] 
==================================================================
[  108.678541][    C0] BUG: KASAN: slab-use-after-free in 
rcu_cblist_dequeue+0xb1/0xe0
[  108.679117][    C0] Read of size 8 at addr ffff888108ee9e48 by task 
swapper/0/0
[  108.679654][    C0]
[  108.679827][    C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
6.6.0-ga7706cf69006-dirty #43
[  108.680437][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS 1.16.1-2.fc37 04/01/2014
[  108.681125][    C0] Call Trace:
[  108.681369][    C0]  <IRQ>
[  108.684870][    C0]  rcu_cblist_dequeue+0xb1/0xe0
[  108.685239][    C0]  rcu_do_batch+0x24c/0xd80
[  108.686892][    C0]  rcu_core+0x4d1/0x7d0
[  108.687205][    C0]  handle_softirqs+0x1ca/0x720
[  108.687561][    C0]  irq_exit_rcu+0x141/0x1a0
[  108.687896][    C0]  sysvec_apic_timer_interrupt+0x6e/0x90
[  108.689218][    C0] RIP: 0010:pv_native_safe_halt+0xb/0x10
[  108.689642][    C0] Code: 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 
00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 eb 07 0f 00 2d 97 d9 
3d 00 fb f4 <e9> 50 ce 02 00 90 90 90 90 90 90 90 90 90 90 90 90 90 9b
[  108.691075][    C0] RSP: 0018:ffffffff9cc07e00 EFLAGS: 00000206
[  108.691537][    C0] RAX: 0000000000000006 RBX: 0000000000000000 RCX: 
ffffffff9b280422
[  108.692129][    C0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffffffff97b74d45
[  108.692714][    C0] RBP: 0000000000000000 R08: 0000000000000001 R09: 
ffffed10e3602165
[  108.693305][    C0] R10: ffff88871b010b2b R11: 0000000000000000 R12: 
1ffffffff3980fc5
[  108.693892][    C0] R13: ffffffff9cc88ec0 R14: dffffc0000000000 R15: 
0000000000014810
[  108.695293][    C0]  default_idle+0x5/0x10
[  108.695594][    C0]  default_idle_call+0x97/0x1d0
[  108.695942][    C0]  cpuidle_idle_call+0x1e5/0x270
[  108.697162][    C0]  do_idle+0xef/0x150
[  108.697454][    C0]  cpu_startup_entry+0x51/0x60
[  108.698108][    C0]  rest_init+0x1cc/0x320
[  108.698410][    C0]  arch_call_rest_init+0xf/0x30
[  108.698761][    C0]  start_kernel+0x392/0x400
[  108.699085][    C0]  x86_64_start_reservations+0x14/0x30
[  108.699474][    C0]  x86_64_start_kernel+0x9b/0xa0
[  108.699822][    C0]  secondary_startup_64_no_verify+0x194/0x19b
[  108.700255][    C0]  </TASK>
[  108.700477][    C0]
[  108.700644][    C0] Allocated by task 1045:
[  108.700948][    C0]  kasan_save_stack+0x1c/0x40
[  108.701293][    C0]  kasan_set_track+0x21/0x30
[  108.701617][    C0]  __kasan_kmalloc+0x8b/0x90
[  108.701946][    C0]  blkg_alloc+0xbc/0x940
[  108.702251][    C0]  blkg_create+0xcf6/0x13d0
[  108.702576][    C0]  blkg_lookup_create+0x47b/0x810
[  108.702935][    C0]  bio_associate_blkg_from_css+0x1a0/0x8c0
[  108.703354][    C0]  bio_associate_blkg+0xa2/0x190
[  108.703704][    C0]  bio_init+0x272/0x8d0
[  108.704000][    C0]  bio_alloc_bioset+0x454/0x770
[  108.704350][    C0]  ext4_bio_write_folio+0x68e/0x10d0
[  108.704729][    C0]  mpage_submit_folio+0x14a/0x2b0
[  108.705090][    C0]  mpage_process_page_bufs+0x1b1/0x390
[  108.705492][    C0]  mpage_prepare_extent_to_map+0xa91/0x1060
[  108.705915][    C0]  ext4_do_writepages+0x9af/0x1d60
[  108.706288][    C0]  ext4_writepages+0x281/0x5a0
[  108.706634][    C0]  do_writepages+0x165/0x5f0
[  108.707057][    C0]  filemap_fdatawrite_wbc+0x111/0x170
[  108.707450][    C0]  __filemap_fdatawrite_range+0x9d/0xd0
[  108.707851][    C0]  file_write_and_wait_range+0x97/0x110
[  108.708251][    C0]  ext4_sync_file+0x1fb/0xb60
[  108.708592][    C0]  __x64_sys_fsync+0x55/0x90
[  108.708932][    C0]  do_syscall_64+0x6b/0x120
[  108.709262][    C0]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[  108.709690][    C0]
[  108.709867][    C0] Freed by task 338:
[  108.710150][    C0]  kasan_save_stack+0x1c/0x40
[  108.710496][    C0]  kasan_set_track+0x21/0x30
[  108.710835][    C0]  kasan_save_free_info+0x27/0x40
[  108.711203][    C0]  __kasan_slab_free+0x106/0x180
[  108.711564][    C0]  __kmem_cache_free+0x1dd/0x470
[  108.711923][    C0]  process_one_work+0x774/0x13a0
[  108.712288][    C0]  worker_thread+0x6eb/0x12c0
[  108.712631][    C0]  kthread+0x29f/0x360
[  108.712928][    C0]  ret_from_fork+0x30/0x70
[  108.713251][    C0]  ret_from_fork_asm+0x1b/0x30

Re: [PATCH 2/3] blk-cgroup: fix uaf in blkcg_activate_policy() racing with blkg_free_workfn()

Posted by Yu Kuai 1 month ago

Hi,

在 2026/1/9 14:22, Zheng Qixing 写道:
> I tried adding blkcg_mutex protection in blkcg_activate_policy() and 
> blkg_destroy_all() as suggested.
>
> Unfortunately, the UAF still occurs even with proper mutex protection.
>
> The mutex successfully protects the list structure during traversal 
> won't be added/removed from
>
> q->blkg_list while we hold the lock. However, this doesn't prevent the 
> same blkg from being released
>
> twice.

I don't understand, what I suggested should actually include your changes in this patch. Can you
show your changes and make sure blkcg_policy_teardown_pds() inside blkcg_activate_policy() is
also protected.

-- 
Thansk,
Kuai

Re: [PATCH 2/3] blk-cgroup: fix uaf in blkcg_activate_policy() racing with blkg_free_workfn()

Posted by Zheng Qixing 1 month ago

Sorry for that I replied with the wrong patch. Even after adding the 
blkcg_mutex in blkg_destroy_all()
and blkcg_activate_policy(), the UAF issue described in the 3rd patch 
(multiple calls to call_rcu releasing
the same blkg) still occurs.
For the 2nd patch, in addition to the blkcg_mutex that I initially added 
in the enomem branch, it is
indeed possible to add blkcg_mutex at other places where blkg_list is 
accessed. This would prevent
the case where pd_alloc_fn(..., GFP_NOWAIT) succeeds while the 
corresponding blkg is being destroyed,
which could otherwise lead to a memory leak.
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 5e1a724a799a..439cafa98c92 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -573,6 +573,7 @@ static void blkg_destroy_all(struct gendisk *disk)
>          int count = BLKG_DESTROY_BATCH_SIZE;
>          int i;
>   
> +       mutex_lock(&q->blkcg_mutex);
>   restart:
>          spin_lock_irq(&q->queue_lock);
>          list_for_each_entry(blkg, &q->blkg_list, q_node) {
> @@ -592,7 +593,9 @@ static void blkg_destroy_all(struct gendisk *disk)
>                  if (!(--count)) {
>                          count = BLKG_DESTROY_BATCH_SIZE;
>                          spin_unlock_irq(&q->queue_lock);
> +                       mutex_unlock(&q->blkcg_mutex);
>                          cond_resched();
> +                       mutex_lock(&q->blkcg_mutex);
>                          goto restart;
>                  }
>          }
> @@ -611,6 +614,7 @@ static void blkg_destroy_all(struct gendisk *disk)
>   
>          q->root_blkg = NULL;
>          spin_unlock_irq(&q->queue_lock);
> +       mutex_unlock(&q->blkcg_mutex);
>   }
>   
>   static void blkg_iostat_set(struct blkg_iostat *dst, struct blkg_iostat *src)
> @@ -1621,6 +1625,8 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>   
>          if (queue_is_mq(q))
>                  memflags = blk_mq_freeze_queue(q);
> +
> +       mutex_lock(&q->blkcg_mutex);
>   retry:
>          spin_lock_irq(&q->queue_lock);
>   
> @@ -1682,6 +1688,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>          ret = 0;
>   
>          spin_unlock_irq(&q->queue_lock);
> +       mutex_unlock(&q->blkcg_mutex);
>   out:
>          if (queue_is_mq(q))
>                  blk_mq_unfreeze_queue(q, memflags);
> @@ -1696,6 +1703,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
>          spin_lock_irq(&q->queue_lock);
>          blkcg_policy_teardown_pds(q, pol);
>          spin_unlock_irq(&q->queue_lock);
> +       mutex_unlock(&q->blkcg_mutex);
>          ret = -ENOMEM;
>          goto out;
>   }

Re: [PATCH 2/3] blk-cgroup: fix uaf in blkcg_activate_policy() racing with blkg_free_workfn()

Posted by Christoph Hellwig 1 month ago

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>