[v1] cgroup: remove offline draining in root destruction to avoid hung_tasks

[PATCH -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Chen Ridong 2 months, 2 weeks ago

From: Chen Ridong <chenridong@huawei.com>

A hung task can occur during LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.

Call Trace:
	cgroup_lock_and_drain_offline+0x14c/0x1e8
	cgroup_destroy_root+0x3c/0x2c0
	css_free_rwork_fn+0x248/0x338
	process_one_work+0x16c/0x3b8
	worker_thread+0x22c/0x3b0
	kthread+0xec/0x100
	ret_from_fork+0x10/0x20

Root Cause:

CPU0                            CPU1
mount perf_event                umount net_prio
cgroup1_get_tree                cgroup_kill_sb
rebind_subsystems               // root destruction enqueues
				// cgroup_destroy_wq
// kill all perf_event css
                                // one perf_event css A is dying
                                // css A offline enqueues cgroup_destroy_wq
                                // root destruction will be executed first
                                css_free_rwork_fn
                                cgroup_destroy_root
                                cgroup_lock_and_drain_offline
                                // some perf descendants are dying
                                // cgroup_destroy_wq max_active = 1
                                // waiting for css A to die

Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
   blocked behind root destruction in cgroup_destroy_wq (max_active=1)

Solution:
Move cgroup_lock_and_drain_offline() to the start of unmount operations.
This ensures:
1. cgroup_lock_and_drain_offline() will not be called within
   cgroup_destroy_wq context.
2. No new dying csses for the subsystem being unmounted can appear in
   cgrp_dfl_root between unmount start and subsystem rebinding.

Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 kernel/cgroup/cgroup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..7a71410b350e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1346,8 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
 
 	trace_cgroup_destroy_root(root);
 
-	cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
-
+	cgroup_lock();
 	BUG_ON(atomic_read(&root->nr_cgrps));
 	BUG_ON(!list_empty(&cgrp->self.children));
 
@@ -2336,6 +2335,7 @@ static void cgroup_kill_sb(struct super_block *sb)
 	 *
 	 * And don't kill the default root.
 	 */
+	cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
 	if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root &&
 	    !percpu_ref_is_dying(&root->cgrp.self.refcnt))
 		percpu_ref_kill(&root->cgrp.self.refcnt);
-- 
2.34.1

Re: [PATCH -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Chen Ridong 2 months, 2 weeks ago


On 2025/7/22 17:24, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
> 
> A hung task can occur during LTP cgroup testing when repeatedly
> mounting/unmounting perf_event and net_prio controllers with
> systemd.unified_cgroup_hierarchy=1. The hang manifests in
> cgroup_lock_and_drain_offline() during root destruction.
> 
> Call Trace:
> 	cgroup_lock_and_drain_offline+0x14c/0x1e8
> 	cgroup_destroy_root+0x3c/0x2c0
> 	css_free_rwork_fn+0x248/0x338
> 	process_one_work+0x16c/0x3b8
> 	worker_thread+0x22c/0x3b0
> 	kthread+0xec/0x100
> 	ret_from_fork+0x10/0x20
> 
> Root Cause:
> 
> CPU0                            CPU1
> mount perf_event                umount net_prio
> cgroup1_get_tree                cgroup_kill_sb
> rebind_subsystems               // root destruction enqueues
> 				// cgroup_destroy_wq
> // kill all perf_event css
>                                 // one perf_event css A is dying
>                                 // css A offline enqueues cgroup_destroy_wq
>                                 // root destruction will be executed first
>                                 css_free_rwork_fn
>                                 cgroup_destroy_root
>                                 cgroup_lock_and_drain_offline
>                                 // some perf descendants are dying
>                                 // cgroup_destroy_wq max_active = 1
>                                 // waiting for css A to die
> 
> Problem scenario:
> 1. CPU0 mounts perf_event (rebind_subsystems)
> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
> 3. A dying perf_event CSS gets queued for offline after root destruction
> 4. Root destruction waits for offline completion, but offline work is
>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
> 
> Solution:
> Move cgroup_lock_and_drain_offline() to the start of unmount operations.
> This ensures:
> 1. cgroup_lock_and_drain_offline() will not be called within
>    cgroup_destroy_wq context.
> 2. No new dying csses for the subsystem being unmounted can appear in
>    cgrp_dfl_root between unmount start and subsystem rebinding.
> 
> Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
> Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
>  kernel/cgroup/cgroup.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb..7a71410b350e 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1346,8 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
>  
>  	trace_cgroup_destroy_root(root);
>  
> -	cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
> -
> +	cgroup_lock();
>  	BUG_ON(atomic_read(&root->nr_cgrps));
>  	BUG_ON(!list_empty(&cgrp->self.children));
>  
> @@ -2336,6 +2335,7 @@ static void cgroup_kill_sb(struct super_block *sb)
>  	 *
>  	 * And don't kill the default root.
>  	 */
> +	cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
>  	if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root &&
>  	    !percpu_ref_is_dying(&root->cgrp.self.refcnt))
>  		percpu_ref_kill(&root->cgrp.self.refcnt);

Sorry, this is a mistake, I will send the new one.

Best regards,
Ridong