[v2] cgroup: remove offline draining in root destruction to avoid hung_tasks

[PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Chen Ridong 2 months, 2 weeks ago

From: Chen Ridong <chenridong@huawei.com>

A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.

Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio

Call Trace:
	cgroup_lock_and_drain_offline+0x14c/0x1e8
	cgroup_destroy_root+0x3c/0x2c0
	css_free_rwork_fn+0x248/0x338
	process_one_work+0x16c/0x3b8
	worker_thread+0x22c/0x3b0
	kthread+0xec/0x100
	ret_from_fork+0x10/0x20

Root Cause:

CPU0                            CPU1
mount perf_event                umount net_prio
cgroup1_get_tree                cgroup_kill_sb
rebind_subsystems               // root destruction enqueues
				// cgroup_destroy_wq
// kill all perf_event css
                                // one perf_event css A is dying
                                // css A offline enqueues cgroup_destroy_wq
                                // root destruction will be executed first
                                css_free_rwork_fn
                                cgroup_destroy_root
                                cgroup_lock_and_drain_offline
                                // some perf descendants are dying
                                // cgroup_destroy_wq max_active = 1
                                // waiting for css A to die

Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
   blocked behind root destruction in cgroup_destroy_wq (max_active=1)

Solution:
Move cgroup_lock_and_drain_offline() to the start of unmount operations.
This ensures:
1. cgroup_lock_and_drain_offline() will not be called within
   cgroup_destroy_wq context.
2. No new dying csses for the subsystem being unmounted can appear in
   cgrp_dfl_root between unmount start and subsystem rebinding.

[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 kernel/cgroup/cgroup.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..af81a90f8c92 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1346,8 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
 
 	trace_cgroup_destroy_root(root);
 
-	cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
-
+	cgroup_lock();
 	BUG_ON(atomic_read(&root->nr_cgrps));
 	BUG_ON(!list_empty(&cgrp->self.children));
 
@@ -2336,6 +2335,8 @@ static void cgroup_kill_sb(struct super_block *sb)
 	 *
 	 * And don't kill the default root.
 	 */
+	cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
+	cgroup_unlock();
 	if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root &&
 	    !percpu_ref_is_dying(&root->cgrp.self.refcnt))
 		percpu_ref_kill(&root->cgrp.self.refcnt);
-- 
2.34.1

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Chen Ridong 2 months, 1 week ago


On 2025/7/22 19:27, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
> 
> A hung task can occur during [1] LTP cgroup testing when repeatedly
> mounting/unmounting perf_event and net_prio controllers with
> systemd.unified_cgroup_hierarchy=1. The hang manifests in
> cgroup_lock_and_drain_offline() during root destruction.
> 
> Related case:
> cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
> cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio

This can be easily reproduced by the process(offered by Yingjie):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <unistd.h>

#define LOOPS 10000
#define TMPDIR "/tmp"
#define CGROUP_BASE "ltp_cgtest"
#define CONTROLLERS_COUNT 2

const char *controllers[CONTROLLERS_COUNT] = {"perf_event", "net_prio"};

void safe_mkdir(const char *path) {
    if (mkdir(path, 0777) == -1 && errno != EEXIST) {
        fprintf(stderr, "mkdir(%s) failed: %s\n", path, strerror(errno));
        exit(1);
    }
}

void safe_rmdir(const char *path) {
    if (rmdir(path) == -1 && errno != ENOENT) {
        fprintf(stderr, "rmdir(%s) failed: %s\n", path, strerror(errno));
    }
}

int is_mounted(const char *mnt) {
    FILE *fp = fopen("/proc/mounts", "r");
    if (!fp) return 0;
    char line[512];
    int found = 0;
    while (fgets(line, sizeof(line), fp)) {
        if (strstr(line, mnt)) {
            found = 1;
            break;
        }
    }
    fclose(fp);
    return found;
}

int main(void) {
    if (getuid() != 0) {
        fprintf(stderr, "This program must be run as root\n");
        return 1;
    }

    FILE *fcg = fopen("/proc/cgroups", "r");
    if (!fcg) {
        fprintf(stderr, "Kernel does not support cgroups\n");
        return 1;
    }
    fclose(fcg);

    char mnt[256];
    for (int i = 1; i <= LOOPS; ++i) {
        for (int c = 0; c < CONTROLLERS_COUNT; ++c) {
            snprintf(mnt, sizeof(mnt), "%s/cgroup_%s", TMPDIR, controllers[c]);
            printf("=== Loop %d: %s ===\n", i, controllers[c]);
            fflush(stdout);

            safe_mkdir(mnt);

            if (!is_mounted(mnt)) {
                if (mount(controllers[c], mnt, "cgroup", 0, controllers[c]) == -1) {
                    fprintf(stderr, "Failed to mount cgroup v1 for %s at %s: %s\n",
                            controllers[c], mnt, strerror(errno));
                    safe_rmdir(mnt);
                    continue;
                }
            }

            if (umount2(mnt, 0) == -1) {
                fprintf(stderr, "umount(%s) failed: %s\n", mnt, strerror(errno));
            }
            safe_rmdir(mnt);
        }
    }
    return 0;
}

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Michal Koutný 2 months, 1 week ago

Hi Ridong.

On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
> CPU0                            CPU1
> mount perf_event                umount net_prio
> cgroup1_get_tree                cgroup_kill_sb
> rebind_subsystems               // root destruction enqueues
> 				// cgroup_destroy_wq
> // kill all perf_event css
>                                 // one perf_event css A is dying
>                                 // css A offline enqueues cgroup_destroy_wq
>                                 // root destruction will be executed first
>                                 css_free_rwork_fn
>                                 cgroup_destroy_root
>                                 cgroup_lock_and_drain_offline
>                                 // some perf descendants are dying
>                                 // cgroup_destroy_wq max_active = 1
>                                 // waiting for css A to die
> 
> Problem scenario:
> 1. CPU0 mounts perf_event (rebind_subsystems)
> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
> 3. A dying perf_event CSS gets queued for offline after root destruction
> 4. Root destruction waits for offline completion, but offline work is
>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)

What's concerning me is why umount of net_prio hierarhy waits for
draining of the default hierachy? (Where you then run into conflict with
perf_event that's implicit_on_dfl.)

IOW why not this:
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)

        trace_cgroup_destroy_root(root);

-       cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
+       cgroup_lock_and_drain_offline(cgrp);

        BUG_ON(atomic_read(&root->nr_cgrps));
        BUG_ON(!list_empty(&cgrp->self.children));

Does this correct the LTP scenario?

Thanks,
Michal

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Chen Ridong 2 months, 1 week ago


On 2025/7/24 21:35, Michal Koutný wrote:
> Hi Ridong.
> 
> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>> CPU0                            CPU1
>> mount perf_event                umount net_prio
>> cgroup1_get_tree                cgroup_kill_sb
>> rebind_subsystems               // root destruction enqueues
>> 				// cgroup_destroy_wq
>> // kill all perf_event css
>>                                 // one perf_event css A is dying
>>                                 // css A offline enqueues cgroup_destroy_wq
>>                                 // root destruction will be executed first
>>                                 css_free_rwork_fn
>>                                 cgroup_destroy_root
>>                                 cgroup_lock_and_drain_offline
>>                                 // some perf descendants are dying
>>                                 // cgroup_destroy_wq max_active = 1
>>                                 // waiting for css A to die
>>
>> Problem scenario:
>> 1. CPU0 mounts perf_event (rebind_subsystems)
>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>> 3. A dying perf_event CSS gets queued for offline after root destruction
>> 4. Root destruction waits for offline completion, but offline work is
>>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
> 
> What's concerning me is why umount of net_prio hierarhy waits for
> draining of the default hierachy? (Where you then run into conflict with
> perf_event that's implicit_on_dfl.)
> 

This was also first respond.

> IOW why not this:
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
> 
>         trace_cgroup_destroy_root(root);
> 
> -       cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
> +       cgroup_lock_and_drain_offline(cgrp);
> 
>         BUG_ON(atomic_read(&root->nr_cgrps));
>         BUG_ON(!list_empty(&cgrp->self.children));
> 
> Does this correct the LTP scenario?
> 
> Thanks,
> Michal

I've tested this approach and discovered it can lead to another issue that required significant
investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
draining of the default hierarchy.

Consider this sequence:

mount net_prio			umount perf_event
cgroup1_get_tree
// &cgrp_dfl_root.cgrp
cgroup_lock_and_drain_offline
// wait for all perf_event csses dead
prepare_to_wait(&dsct->offline_waitq)
schedule();
				cgroup_destroy_root
				// &root->cgrp, not cgrp_dfl_root
				cgroup_lock_and_drain_offline
				rebind_subsystems
				rcu_assign_pointer(dcgrp->subsys[ssid], css);
				dst_root->subsys_mask |= 1 << ssid;
				cgroup_propagate_control
				// enable cgrp_dfl_root perf_event css
				cgroup_apply_control_enable
				css = cgroup_css(dsct, ss);
				// since we drain root->cgrp not cgrp_dfl_root
				// css(dying) is not null on the cgrp_dfl_root
				// we won't create css, but the css is dying
								
// got the offline_waitq wake up
goto restart;
// some perf_event dying csses are online now
prepare_to_wait(&dsct->offline_waitq)
schedule();
// never get the offline_waitq wake up

I encountered two main issues:
1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root
2.Potential hangs during cgrp_dfl_root draining in the mounting process

I have tried calling  cgroup_lock_and_drain_offline with the subtree_ss_mask, It seems that can fix
this issue I encountered. But I am not sure there are scenarios [u]mounting mutil legacy subsystem
in the same time.

I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
begins. How can we guarantee this ordering? Therefore, I propose moving the draining operation
outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
potential race condition. This patch implements that approach.

Best regards,
Ridong

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Michal Koutný 2 months, 1 week ago

On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
> > On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
> >> CPU0                            CPU1
> >> mount perf_event                umount net_prio
> >> cgroup1_get_tree                cgroup_kill_sb
> >> rebind_subsystems               // root destruction enqueues
> >> 				// cgroup_destroy_wq
> >> // kill all perf_event css
> >>                                 // one perf_event css A is dying
> >>                                 // css A offline enqueues cgroup_destroy_wq
> >>                                 // root destruction will be executed first
> >>                                 css_free_rwork_fn
> >>                                 cgroup_destroy_root
> >>                                 cgroup_lock_and_drain_offline
> >>                                 // some perf descendants are dying
> >>                                 // cgroup_destroy_wq max_active = 1
> >>                                 // waiting for css A to die
> >>
> >> Problem scenario:
> >> 1. CPU0 mounts perf_event (rebind_subsystems)
> >> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
> >> 3. A dying perf_event CSS gets queued for offline after root destruction
> >> 4. Root destruction waits for offline completion, but offline work is
> >>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
> > 
> > What's concerning me is why umount of net_prio hierarhy waits for
> > draining of the default hierachy? (Where you then run into conflict with
> > perf_event that's implicit_on_dfl.)
> > 
> 
> This was also first respond.
> 
> > IOW why not this:
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
> > 
> >         trace_cgroup_destroy_root(root);
> > 
> > -       cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
> > +       cgroup_lock_and_drain_offline(cgrp);
> > 
> >         BUG_ON(atomic_read(&root->nr_cgrps));
> >         BUG_ON(!list_empty(&cgrp->self.children));
> > 
> > Does this correct the LTP scenario?
> > 
> > Thanks,
> > Michal
> 
> I've tested this approach and discovered it can lead to another issue that required significant
> investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
> draining of the default hierarchy.
> 
> Consider this sequence:
> 
> mount net_prio			umount perf_event
> cgroup1_get_tree
> // &cgrp_dfl_root.cgrp
> cgroup_lock_and_drain_offline
> // wait for all perf_event csses dead
> prepare_to_wait(&dsct->offline_waitq)
> schedule();
> 				cgroup_destroy_root
> 				// &root->cgrp, not cgrp_dfl_root
> 				cgroup_lock_and_drain_offline
								perf_event's css (offline but dying)

> 				rebind_subsystems
> 				rcu_assign_pointer(dcgrp->subsys[ssid], css);
> 				dst_root->subsys_mask |= 1 << ssid;
> 				cgroup_propagate_control
> 				// enable cgrp_dfl_root perf_event css
> 				cgroup_apply_control_enable
> 				css = cgroup_css(dsct, ss);
> 				// since we drain root->cgrp not cgrp_dfl_root
> 				// css(dying) is not null on the cgrp_dfl_root
> 				// we won't create css, but the css is dying

				What would prevent seeing a dying css when
				cgrp_dfl_root is drained?
				(Or nothing drained as in the patch?)

				I assume you've seen this warning from
				cgroup_apply_control_enable
				WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?


> 								
> // got the offline_waitq wake up
> goto restart;
> // some perf_event dying csses are online now
> prepare_to_wait(&dsct->offline_waitq)
> schedule();
> // never get the offline_waitq wake up
> 
> I encountered two main issues:
> 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root

Is this really resolved by the patch? (The questions above.)

> 2.Potential hangs during cgrp_dfl_root draining in the mounting process

Fortunately, the typical use case (mounting at boot) wouldn't suffer
from this.

> I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
> offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
> begins.

This is a valid point.

> How can we guarantee this ordering? Therefore, I propose moving the draining operation
> outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
> potential race condition. This patch implements that approach.

I acknowledge the issue (although rare in real world). Some entity will
always have to wait of the offlining. It may be OK in cgroup_kill_sb
(ideally, if this was bound to process context of umount caller, not
sure if that's how kill_sb works).
I slightly dislike the form of an empty lock/unlock -- which makes me
wonder if this is the best solution.

Let me think more about this...

Thanks,
Michal

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Chen Ridong 1 month, 3 weeks ago


On 2025/7/26 1:17, Michal Koutný wrote:
> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> CPU0                            CPU1
>>>> mount perf_event                umount net_prio
>>>> cgroup1_get_tree                cgroup_kill_sb
>>>> rebind_subsystems               // root destruction enqueues
>>>> 				// cgroup_destroy_wq
>>>> // kill all perf_event css
>>>>                                 // one perf_event css A is dying
>>>>                                 // css A offline enqueues cgroup_destroy_wq
>>>>                                 // root destruction will be executed first
>>>>                                 css_free_rwork_fn
>>>>                                 cgroup_destroy_root
>>>>                                 cgroup_lock_and_drain_offline
>>>>                                 // some perf descendants are dying
>>>>                                 // cgroup_destroy_wq max_active = 1
>>>>                                 // waiting for css A to die
>>>>
>>>> Problem scenario:
>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>> 4. Root destruction waits for offline completion, but offline work is
>>>>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>
>>> What's concerning me is why umount of net_prio hierarhy waits for
>>> draining of the default hierachy? (Where you then run into conflict with
>>> perf_event that's implicit_on_dfl.)
>>>
>>
>> This was also first respond.
>>
>>> IOW why not this:
>>> --- a/kernel/cgroup/cgroup.c
>>> +++ b/kernel/cgroup/cgroup.c
>>> @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
>>>
>>>         trace_cgroup_destroy_root(root);
>>>
>>> -       cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
>>> +       cgroup_lock_and_drain_offline(cgrp);
>>>
>>>         BUG_ON(atomic_read(&root->nr_cgrps));
>>>         BUG_ON(!list_empty(&cgrp->self.children));
>>>
>>> Does this correct the LTP scenario?
>>>
>>> Thanks,
>>> Michal
>>
>> I've tested this approach and discovered it can lead to another issue that required significant
>> investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
>> draining of the default hierarchy.
>>
>> Consider this sequence:
>>
>> mount net_prio			umount perf_event
>> cgroup1_get_tree
>> // &cgrp_dfl_root.cgrp
>> cgroup_lock_and_drain_offline
>> // wait for all perf_event csses dead
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> 				cgroup_destroy_root
>> 				// &root->cgrp, not cgrp_dfl_root
>> 				cgroup_lock_and_drain_offline
> 								perf_event's css (offline but dying)
> 
>> 				rebind_subsystems
>> 				rcu_assign_pointer(dcgrp->subsys[ssid], css);
>> 				dst_root->subsys_mask |= 1 << ssid;
>> 				cgroup_propagate_control
>> 				// enable cgrp_dfl_root perf_event css
>> 				cgroup_apply_control_enable
>> 				css = cgroup_css(dsct, ss);
>> 				// since we drain root->cgrp not cgrp_dfl_root
>> 				// css(dying) is not null on the cgrp_dfl_root
>> 				// we won't create css, but the css is dying
> 
> 				What would prevent seeing a dying css when
> 				cgrp_dfl_root is drained?
> 				(Or nothing drained as in the patch?)
> 
> 				I assume you've seen this warning from
> 				cgroup_apply_control_enable
> 				WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?
> 
> 
>> 								
>> // got the offline_waitq wake up
>> goto restart;
>> // some perf_event dying csses are online now
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> // never get the offline_waitq wake up
>>
>> I encountered two main issues:
>> 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root
> 
> Is this really resolved by the patch? (The questions above.)
> 
>> 2.Potential hangs during cgrp_dfl_root draining in the mounting process
> 
> Fortunately, the typical use case (mounting at boot) wouldn't suffer
> from this.
> 
>> I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
>> offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
>> begins.
> 
> This is a valid point.
> 
>> How can we guarantee this ordering? Therefore, I propose moving the draining operation
>> outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
>> potential race condition. This patch implements that approach.
> 
> I acknowledge the issue (although rare in real world). Some entity will
> always have to wait of the offlining. It may be OK in cgroup_kill_sb
> (ideally, if this was bound to process context of umount caller, not
> sure if that's how kill_sb works).
> I slightly dislike the form of an empty lock/unlock -- which makes me
> wonder if this is the best solution.
> 
> Let me think more about this...
> 
> Thanks,
> Michal

I've just submitted v3, which I believe provides an improved solution to the issue we've been
discussing.

The v3 version is available here:
https://lore.kernel.org/cgroups/20250815070518.1255842-1-chenridong@huaweicloud.com/T/#u

I'd be truly grateful if you could spare some time to review it.

-- 
Best regards,
Ridong

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Hillf Danton 1 month, 3 weeks ago

On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
> > On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
> >> CPU0                            CPU1
> >> mount perf_event                umount net_prio
> >> cgroup1_get_tree                cgroup_kill_sb
> >> rebind_subsystems               // root destruction enqueues
> >> 				// cgroup_destroy_wq
> >> // kill all perf_event css
> >>                                 // one perf_event css A is dying
> >>                                 // css A offline enqueues cgroup_destroy_wq
> >>                                 // root destruction will be executed first
> >>                                 css_free_rwork_fn
> >>                                 cgroup_destroy_root
> >>                                 cgroup_lock_and_drain_offline
> >>                                 // some perf descendants are dying
> >>                                 // cgroup_destroy_wq max_active = 1
> >>                                 // waiting for css A to die
> >>
> >> Problem scenario:
> >> 1. CPU0 mounts perf_event (rebind_subsystems)
> >> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
> >> 3. A dying perf_event CSS gets queued for offline after root destruction
> >> 4. Root destruction waits for offline completion, but offline work is
> >>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
> > 
> > What's concerning me is why umount of net_prio hierarhy waits for
> > draining of the default hierachy? (Where you then run into conflict with
> > perf_event that's implicit_on_dfl.)
> > 
/*
 * cgroup destruction makes heavy use of work items and there can be a lot
 * of concurrent destructions.  Use a separate workqueue so that cgroup
 * destruction work items don't end up filling up max_active of system_wq
 * which may lead to deadlock.
 */

If task hung could be reliably reproduced, it is right time to cut
max_active off for cgroup_destroy_wq according to its comment.

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Chen Ridong 1 month, 3 weeks ago


On 2025/8/15 10:40, Hillf Danton wrote:
> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> CPU0                            CPU1
>>>> mount perf_event                umount net_prio
>>>> cgroup1_get_tree                cgroup_kill_sb
>>>> rebind_subsystems               // root destruction enqueues
>>>> 				// cgroup_destroy_wq
>>>> // kill all perf_event css
>>>>                                 // one perf_event css A is dying
>>>>                                 // css A offline enqueues cgroup_destroy_wq
>>>>                                 // root destruction will be executed first
>>>>                                 css_free_rwork_fn
>>>>                                 cgroup_destroy_root
>>>>                                 cgroup_lock_and_drain_offline
>>>>                                 // some perf descendants are dying
>>>>                                 // cgroup_destroy_wq max_active = 1
>>>>                                 // waiting for css A to die
>>>>
>>>> Problem scenario:
>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>> 4. Root destruction waits for offline completion, but offline work is
>>>>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>
>>> What's concerning me is why umount of net_prio hierarhy waits for
>>> draining of the default hierachy? (Where you then run into conflict with
>>> perf_event that's implicit_on_dfl.)
>>>
> /*
>  * cgroup destruction makes heavy use of work items and there can be a lot
>  * of concurrent destructions.  Use a separate workqueue so that cgroup
>  * destruction work items don't end up filling up max_active of system_wq
>  * which may lead to deadlock.
>  */
> 
> If task hung could be reliably reproduced, it is right time to cut
> max_active off for cgroup_destroy_wq according to its comment.

Hi Danton,

Thank you for your feedback.

While modifying max_active could be a viable solution, I’m unsure whether it might introduce other
side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe
addresses the issue more comprehensively.

I’d be very grateful if you could take a look and share your thoughts. Your review would be greatly
appreciated!

v3: https://lore.kernel.org/cgroups/20250815070518.1255842-1-chenridong@huaweicloud.com/T/#u

-- 
Best regards,
Ridong

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Hillf Danton 1 month, 3 weeks ago

On Fri, 15 Aug 2025 15:29:56 +0800 Chen Ridong wrote:
>On 2025/8/15 10:40, Hillf Danton wrote:
>> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>>> CPU0                            CPU1
>>>>> mount perf_event                umount net_prio
>>>>> cgroup1_get_tree                cgroup_kill_sb
>>>>> rebind_subsystems               // root destruction enqueues
>>>>> 				// cgroup_destroy_wq
>>>>> // kill all perf_event css
>>>>>                                 // one perf_event css A is dying
>>>>>                                 // css A offline enqueues cgroup_destroy_wq
>>>>>                                 // root destruction will be executed first
>>>>>                                 css_free_rwork_fn
>>>>>                                 cgroup_destroy_root
>>>>>                                 cgroup_lock_and_drain_offline
>>>>>                                 // some perf descendants are dying
>>>>>                                 // cgroup_destroy_wq max_active = 1
>>>>>                                 // waiting for css A to die
>>>>>
>>>>> Problem scenario:
>>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>>> 4. Root destruction waits for offline completion, but offline work is
>>>>>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>>
>>>> What's concerning me is why umount of net_prio hierarhy waits for
>>>> draining of the default hierachy? (Where you then run into conflict with
>>>> perf_event that's implicit_on_dfl.)
>>>>
>> /*
>>  * cgroup destruction makes heavy use of work items and there can be a lot
>>  * of concurrent destructions.  Use a separate workqueue so that cgroup
>>  * destruction work items don't end up filling up max_active of system_wq
>>  * which may lead to deadlock.
>>  */
>> 
>> If task hung could be reliably reproduced, it is right time to cut
>> max_active off for cgroup_destroy_wq according to its comment.
>
>Hi Danton,
>
>Thank you for your feedback.
>
>While modifying max_active could be a viable solution, I’m unsure whether it might introduce other
>side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe
>addresses the issue more comprehensively.
>
Given your reproducer [1], it is simple to test with max_active cut.

I do not think v3 is a correct fix frankly because it leaves the root cause
intact. Nor is it cgroup specific even given high concurrency in destruction.

[1] https://lore.kernel.org/lkml/39e05402-40c7-4631-a87b-8e3747ceddc6@huaweicloud.com/

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Chen Ridong 1 month, 3 weeks ago


On 2025/8/15 18:02, Hillf Danton wrote:
> On Fri, 15 Aug 2025 15:29:56 +0800 Chen Ridong wrote:
>> On 2025/8/15 10:40, Hillf Danton wrote:
>>> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>>>> CPU0                            CPU1
>>>>>> mount perf_event                umount net_prio
>>>>>> cgroup1_get_tree                cgroup_kill_sb
>>>>>> rebind_subsystems               // root destruction enqueues
>>>>>> 				// cgroup_destroy_wq
>>>>>> // kill all perf_event css
>>>>>>                                 // one perf_event css A is dying
>>>>>>                                 // css A offline enqueues cgroup_destroy_wq
>>>>>>                                 // root destruction will be executed first
>>>>>>                                 css_free_rwork_fn
>>>>>>                                 cgroup_destroy_root
>>>>>>                                 cgroup_lock_and_drain_offline
>>>>>>                                 // some perf descendants are dying
>>>>>>                                 // cgroup_destroy_wq max_active = 1
>>>>>>                                 // waiting for css A to die
>>>>>>
>>>>>> Problem scenario:
>>>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>>>> 4. Root destruction waits for offline completion, but offline work is
>>>>>>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>>>
>>>>> What's concerning me is why umount of net_prio hierarhy waits for
>>>>> draining of the default hierachy? (Where you then run into conflict with
>>>>> perf_event that's implicit_on_dfl.)
>>>>>
>>> /*
>>>  * cgroup destruction makes heavy use of work items and there can be a lot
>>>  * of concurrent destructions.  Use a separate workqueue so that cgroup
>>>  * destruction work items don't end up filling up max_active of system_wq
>>>  * which may lead to deadlock.
>>>  */
>>>
>>> If task hung could be reliably reproduced, it is right time to cut
>>> max_active off for cgroup_destroy_wq according to its comment.
>>
>> Hi Danton,
>>
>> Thank you for your feedback.
>>
>> While modifying max_active could be a viable solution, I’m unsure whether it might introduce other
>> side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe
>> addresses the issue more comprehensively.
>>
> Given your reproducer [1], it is simple to test with max_active cut.
> 
> I do not think v3 is a correct fix frankly because it leaves the root cause
> intact. Nor is it cgroup specific even given high concurrency in destruction.
> 
> [1] https://lore.kernel.org/lkml/39e05402-40c7-4631-a87b-8e3747ceddc6@huaweicloud.com/

Hi Danton,

Thank you for your reply.
To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
cgroup_destroy_wq to 1?

Note that cgroup_destroy_wq already has max_active=1 as inited:

```cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);```

The v3 changes prevent subsystem root destruction from being blocked by unrelated subsystem offline
events. Since root destruction should only proceed after all descendants are destroyed, it shouldn't
be blocked by children offline events. My testing with the reproducer confirms this fixes the issue
I encountered.

-- 
Best regards,
Ridong

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Hillf Danton 1 month, 3 weeks ago

On Fri, 15 Aug 2025 18:28:53 +0800 Chen Ridong wrote:
> 
> To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
> cgroup_destroy_wq to 1?
> 
	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 0);

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Chen Ridong 1 month, 3 weeks ago

On 2025/8/15 19:54, Hillf Danton wrote:
> On Fri, 15 Aug 2025 18:28:53 +0800 Chen Ridong wrote:
>>
>> To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
>> cgroup_destroy_wq to 1?
>>
> 	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 0);

Thank you for the additional clarification.

While this modification is functional, I’m concerned it might spawn a significant number of cgroup
destruction tasks—most of which would contend for cgroup_mutex. Could this waste system resources?

-- 
Best regards,
Ridong

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Hillf Danton 1 month, 3 weeks ago

On Sat, 16 Aug 2025 08:33:55 +0800 Chen Ridong wrote:
> On 2025/8/15 19:54, Hillf Danton wrote:
> > On Fri, 15 Aug 2025 18:28:53 +0800 Chen Ridong wrote:
> >>
> >> To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
> >> cgroup_destroy_wq to 1?
> >>
> > 	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 0);
> 
> Thank you for the additional clarification.
> 
> While this modification is functional, I’m concerned it might spawn a significant number of cgroup
> destruction tasks—most of which would contend for cgroup_mutex. Could this waste system resources?
> 
No win comes from nothing, so you need to pay $.02 for example at least
for curing the task hung by erasing the root cause.

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Chen Ridong 2 months ago

> I acknowledge the issue (although rare in real world). Some entity will
> always have to wait of the offlining. It may be OK in cgroup_kill_sb
> (ideally, if this was bound to process context of umount caller, not
> sure if that's how kill_sb works).
> I slightly dislike the form of an empty lock/unlock -- which makes me
> wonder if this is the best solution.
> 
> Let me think more about this...
> 
> Thanks,
> Michal

Hi Michal,

Have you come up with a better solution for this?
Would appreciate your thoughts when you have time.

Best regards,
Ridong

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Michal Koutný 1 month, 3 weeks ago

Hi Ridong.

On Thu, Jul 31, 2025 at 07:53:02PM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
> Have you come up with a better solution for this?
> Would appreciate your thoughts when you have time.

Sorry for taking so long. (Also expect my next response here may be
slow.)
I tried reproducing it with the described LTP tests [1] (to get a better
idea about what and why needs to be offlined) but I cannot bring it to a
hang nor lockdep report. How do you launch the particular LTP tests to
trigger this?

Thanks,
Michal

[1]
while true ; do
	/opt/ltp/testcases/bin/cgroup_fj_function.sh net_cls $pp
	/opt/ltp/testcases/bin/cgroup_fj_function.sh perf_event
done
(with pp both `;` or `&` for concurrent runs, two vCPUs)

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Chen Ridong 1 month, 3 weeks ago


On 2025/8/14 23:17, Michal Koutný wrote:
> Hi Ridong.
> 
> On Thu, Jul 31, 2025 at 07:53:02PM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>> Have you come up with a better solution for this?
>> Would appreciate your thoughts when you have time.
> 
> Sorry for taking so long. (Also expect my next response here may be
> slow.)
> I tried reproducing it with the described LTP tests [1] (to get a better
> idea about what and why needs to be offlined) but I cannot bring it to a
> hang nor lockdep report. How do you launch the particular LTP tests to
> trigger this?
> 
> Thanks,
> Michal
> 
> [1]
> while true ; do
> 	/opt/ltp/testcases/bin/cgroup_fj_function.sh net_cls $pp
> 	/opt/ltp/testcases/bin/cgroup_fj_function.sh perf_event
> done
> (with pp both `;` or `&` for concurrent runs, two vCPUs)

Hi Michal,

I’ve provided details on how to reproduce the issue—it’s quite straightforward. For your reference,
here’s the link to the discussion:

https://lore.kernel.org/cgroups/btaaerpdl3bolxbysbqcig6kiccdgsoo32td64sk6yo4m5l5zy@nds6s35p6e6w/T/#m01f1229f2e84e1186abaf04378b4c0f47151f7e4

Let me know if you need further clarification.

-- 
Best regards,
Ridong

Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks

Posted by Chen Ridong 2 months, 1 week ago


On 2025/7/26 1:17, Michal Koutný wrote:
> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> CPU0                            CPU1
>>>> mount perf_event                umount net_prio
>>>> cgroup1_get_tree                cgroup_kill_sb
>>>> rebind_subsystems               // root destruction enqueues
>>>> 				// cgroup_destroy_wq
>>>> // kill all perf_event css
>>>>                                 // one perf_event css A is dying
>>>>                                 // css A offline enqueues cgroup_destroy_wq
>>>>                                 // root destruction will be executed first
>>>>                                 css_free_rwork_fn
>>>>                                 cgroup_destroy_root
>>>>                                 cgroup_lock_and_drain_offline
>>>>                                 // some perf descendants are dying
>>>>                                 // cgroup_destroy_wq max_active = 1
>>>>                                 // waiting for css A to die
>>>>
>>>> Problem scenario:
>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>> 4. Root destruction waits for offline completion, but offline work is
>>>>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>
>>> What's concerning me is why umount of net_prio hierarhy waits for
>>> draining of the default hierachy? (Where you then run into conflict with
>>> perf_event that's implicit_on_dfl.)
>>>
>>
>> This was also first respond.
>>
>>> IOW why not this:
>>> --- a/kernel/cgroup/cgroup.c
>>> +++ b/kernel/cgroup/cgroup.c
>>> @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
>>>
>>>         trace_cgroup_destroy_root(root);
>>>
>>> -       cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
>>> +       cgroup_lock_and_drain_offline(cgrp);
>>>
>>>         BUG_ON(atomic_read(&root->nr_cgrps));
>>>         BUG_ON(!list_empty(&cgrp->self.children));
>>>
>>> Does this correct the LTP scenario?
>>>
>>> Thanks,
>>> Michal
>>
>> I've tested this approach and discovered it can lead to another issue that required significant
>> investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
>> draining of the default hierarchy.
>>
>> Consider this sequence:
>>
>> mount net_prio			umount perf_event
>> cgroup1_get_tree
>> // &cgrp_dfl_root.cgrp
>> cgroup_lock_and_drain_offline
>> // wait for all perf_event csses dead
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> 				cgroup_destroy_root
>> 				// &root->cgrp, not cgrp_dfl_root
>> 				cgroup_lock_and_drain_offline
> 								perf_event's css (offline but dying)
> 
>> 				rebind_subsystems
>> 				rcu_assign_pointer(dcgrp->subsys[ssid], css);
>> 				dst_root->subsys_mask |= 1 << ssid;
>> 				cgroup_propagate_control
>> 				// enable cgrp_dfl_root perf_event css
>> 				cgroup_apply_control_enable
>> 				css = cgroup_css(dsct, ss);
>> 				// since we drain root->cgrp not cgrp_dfl_root
>> 				// css(dying) is not null on the cgrp_dfl_root
>> 				// we won't create css, but the css is dying
> 
> 				What would prevent seeing a dying css when
> 				cgrp_dfl_root is drained?
> 				(Or nothing drained as in the patch?)

> 				I assume you've seen this warning from
> 				cgroup_apply_control_enable
> 				WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?
>
> 
				WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?
				-- Yes
				Draining the cgrp_dfl_root can prevent seeing the dying css.
				Q:When the task can be woken up if it is waiting on offline_waitq?
				A:The offline_css is invoked, and:
				RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);

				If we drain the cgrp_dfl_root, it traverses all the csses
				That means cgroup_lock_and_drain_offline can only return when all
				the dying have disappeared, thus preventing seeing a dying css.
>> 								
>> // got the offline_waitq wake up
>> goto restart;
>> // some perf_event dying csses are online now
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> // never get the offline_waitq wake up
>>
>> I encountered two main issues:
>> 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root
> 
> Is this really resolved by the patch? (The questions above.)
> 
>> 2.Potential hangs during cgrp_dfl_root draining in the mounting process
> 
> Fortunately, the typical use case (mounting at boot) wouldn't suffer
> from this.
> 
>> I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
>> offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
>> begins.
> 
> This is a valid point.
> 
>> How can we guarantee this ordering? Therefore, I propose moving the draining operation
>> outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
>> potential race condition. This patch implements that approach.
> 
> I acknowledge the issue (although rare in real world). Some entity will
> always have to wait of the offlining. It may be OK in cgroup_kill_sb
> (ideally, if this was bound to process context of umount caller, not
> sure if that's how kill_sb works).
> I slightly dislike the form of an empty lock/unlock -- which makes me
> wonder if this is the best solution.

Thank you, I’d appreciate it if you could suggest a better solution.

Thanks,
Ridong