kernel/cgroup/cgroup.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
From: Chen Ridong <chenridong@huawei.com>
A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.
Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
Call Trace:
cgroup_lock_and_drain_offline+0x14c/0x1e8
cgroup_destroy_root+0x3c/0x2c0
css_free_rwork_fn+0x248/0x338
process_one_work+0x16c/0x3b8
worker_thread+0x22c/0x3b0
kthread+0xec/0x100
ret_from_fork+0x10/0x20
Root Cause:
CPU0 CPU1
mount perf_event umount net_prio
cgroup1_get_tree cgroup_kill_sb
rebind_subsystems // root destruction enqueues
// cgroup_destroy_wq
// kill all perf_event css
// one perf_event css A is dying
// css A offline enqueues cgroup_destroy_wq
// root destruction will be executed first
css_free_rwork_fn
cgroup_destroy_root
cgroup_lock_and_drain_offline
// some perf descendants are dying
// cgroup_destroy_wq max_active = 1
// waiting for css A to die
Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
blocked behind root destruction in cgroup_destroy_wq (max_active=1)
Solution:
Move cgroup_lock_and_drain_offline() to the start of unmount operations.
This ensures:
1. cgroup_lock_and_drain_offline() will not be called within
cgroup_destroy_wq context.
2. No new dying csses for the subsystem being unmounted can appear in
cgrp_dfl_root between unmount start and subsystem rebinding.
[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
kernel/cgroup/cgroup.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..af81a90f8c92 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1346,8 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
trace_cgroup_destroy_root(root);
- cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
-
+ cgroup_lock();
BUG_ON(atomic_read(&root->nr_cgrps));
BUG_ON(!list_empty(&cgrp->self.children));
@@ -2336,6 +2335,8 @@ static void cgroup_kill_sb(struct super_block *sb)
*
* And don't kill the default root.
*/
+ cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
+ cgroup_unlock();
if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root &&
!percpu_ref_is_dying(&root->cgrp.self.refcnt))
percpu_ref_kill(&root->cgrp.self.refcnt);
--
2.34.1
On 2025/7/22 19:27, Chen Ridong wrote: > From: Chen Ridong <chenridong@huawei.com> > > A hung task can occur during [1] LTP cgroup testing when repeatedly > mounting/unmounting perf_event and net_prio controllers with > systemd.unified_cgroup_hierarchy=1. The hang manifests in > cgroup_lock_and_drain_offline() during root destruction. > > Related case: > cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event > cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio This can be easily reproduced by the process(offered by Yingjie): #include <stdio.h> #include <stdlib.h> #include <string.h> #include <errno.h> #include <sys/mount.h> #include <sys/stat.h> #include <unistd.h> #define LOOPS 10000 #define TMPDIR "/tmp" #define CGROUP_BASE "ltp_cgtest" #define CONTROLLERS_COUNT 2 const char *controllers[CONTROLLERS_COUNT] = {"perf_event", "net_prio"}; void safe_mkdir(const char *path) { if (mkdir(path, 0777) == -1 && errno != EEXIST) { fprintf(stderr, "mkdir(%s) failed: %s\n", path, strerror(errno)); exit(1); } } void safe_rmdir(const char *path) { if (rmdir(path) == -1 && errno != ENOENT) { fprintf(stderr, "rmdir(%s) failed: %s\n", path, strerror(errno)); } } int is_mounted(const char *mnt) { FILE *fp = fopen("/proc/mounts", "r"); if (!fp) return 0; char line[512]; int found = 0; while (fgets(line, sizeof(line), fp)) { if (strstr(line, mnt)) { found = 1; break; } } fclose(fp); return found; } int main(void) { if (getuid() != 0) { fprintf(stderr, "This program must be run as root\n"); return 1; } FILE *fcg = fopen("/proc/cgroups", "r"); if (!fcg) { fprintf(stderr, "Kernel does not support cgroups\n"); return 1; } fclose(fcg); char mnt[256]; for (int i = 1; i <= LOOPS; ++i) { for (int c = 0; c < CONTROLLERS_COUNT; ++c) { snprintf(mnt, sizeof(mnt), "%s/cgroup_%s", TMPDIR, controllers[c]); printf("=== Loop %d: %s ===\n", i, controllers[c]); fflush(stdout); safe_mkdir(mnt); if (!is_mounted(mnt)) { if (mount(controllers[c], mnt, "cgroup", 0, controllers[c]) == -1) { fprintf(stderr, "Failed to mount cgroup v1 for %s at %s: %s\n", controllers[c], mnt, strerror(errno)); safe_rmdir(mnt); continue; } } if (umount2(mnt, 0) == -1) { fprintf(stderr, "umount(%s) failed: %s\n", mnt, strerror(errno)); } safe_rmdir(mnt); } } return 0; }
Hi Ridong. On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote: > CPU0 CPU1 > mount perf_event umount net_prio > cgroup1_get_tree cgroup_kill_sb > rebind_subsystems // root destruction enqueues > // cgroup_destroy_wq > // kill all perf_event css > // one perf_event css A is dying > // css A offline enqueues cgroup_destroy_wq > // root destruction will be executed first > css_free_rwork_fn > cgroup_destroy_root > cgroup_lock_and_drain_offline > // some perf descendants are dying > // cgroup_destroy_wq max_active = 1 > // waiting for css A to die > > Problem scenario: > 1. CPU0 mounts perf_event (rebind_subsystems) > 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work > 3. A dying perf_event CSS gets queued for offline after root destruction > 4. Root destruction waits for offline completion, but offline work is > blocked behind root destruction in cgroup_destroy_wq (max_active=1) What's concerning me is why umount of net_prio hierarhy waits for draining of the default hierachy? (Where you then run into conflict with perf_event that's implicit_on_dfl.) IOW why not this: --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root) trace_cgroup_destroy_root(root); - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); + cgroup_lock_and_drain_offline(cgrp); BUG_ON(atomic_read(&root->nr_cgrps)); BUG_ON(!list_empty(&cgrp->self.children)); Does this correct the LTP scenario? Thanks, Michal
On 2025/7/24 21:35, Michal Koutný wrote: > Hi Ridong. > > On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote: >> CPU0 CPU1 >> mount perf_event umount net_prio >> cgroup1_get_tree cgroup_kill_sb >> rebind_subsystems // root destruction enqueues >> // cgroup_destroy_wq >> // kill all perf_event css >> // one perf_event css A is dying >> // css A offline enqueues cgroup_destroy_wq >> // root destruction will be executed first >> css_free_rwork_fn >> cgroup_destroy_root >> cgroup_lock_and_drain_offline >> // some perf descendants are dying >> // cgroup_destroy_wq max_active = 1 >> // waiting for css A to die >> >> Problem scenario: >> 1. CPU0 mounts perf_event (rebind_subsystems) >> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work >> 3. A dying perf_event CSS gets queued for offline after root destruction >> 4. Root destruction waits for offline completion, but offline work is >> blocked behind root destruction in cgroup_destroy_wq (max_active=1) > > What's concerning me is why umount of net_prio hierarhy waits for > draining of the default hierachy? (Where you then run into conflict with > perf_event that's implicit_on_dfl.) > This was also first respond. > IOW why not this: > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root) > > trace_cgroup_destroy_root(root); > > - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); > + cgroup_lock_and_drain_offline(cgrp); > > BUG_ON(atomic_read(&root->nr_cgrps)); > BUG_ON(!list_empty(&cgrp->self.children)); > > Does this correct the LTP scenario? > > Thanks, > Michal I've tested this approach and discovered it can lead to another issue that required significant investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for draining of the default hierarchy. Consider this sequence: mount net_prio umount perf_event cgroup1_get_tree // &cgrp_dfl_root.cgrp cgroup_lock_and_drain_offline // wait for all perf_event csses dead prepare_to_wait(&dsct->offline_waitq) schedule(); cgroup_destroy_root // &root->cgrp, not cgrp_dfl_root cgroup_lock_and_drain_offline rebind_subsystems rcu_assign_pointer(dcgrp->subsys[ssid], css); dst_root->subsys_mask |= 1 << ssid; cgroup_propagate_control // enable cgrp_dfl_root perf_event css cgroup_apply_control_enable css = cgroup_css(dsct, ss); // since we drain root->cgrp not cgrp_dfl_root // css(dying) is not null on the cgrp_dfl_root // we won't create css, but the css is dying // got the offline_waitq wake up goto restart; // some perf_event dying csses are online now prepare_to_wait(&dsct->offline_waitq) schedule(); // never get the offline_waitq wake up I encountered two main issues: 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root 2.Potential hangs during cgrp_dfl_root draining in the mounting process I have tried calling cgroup_lock_and_drain_offline with the subtree_ss_mask, It seems that can fix this issue I encountered. But I am not sure there are scenarios [u]mounting mutil legacy subsystem in the same time. I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root begins. How can we guarantee this ordering? Therefore, I propose moving the draining operation outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this potential race condition. This patch implements that approach. Best regards, Ridong
On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote: > > On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote: > >> CPU0 CPU1 > >> mount perf_event umount net_prio > >> cgroup1_get_tree cgroup_kill_sb > >> rebind_subsystems // root destruction enqueues > >> // cgroup_destroy_wq > >> // kill all perf_event css > >> // one perf_event css A is dying > >> // css A offline enqueues cgroup_destroy_wq > >> // root destruction will be executed first > >> css_free_rwork_fn > >> cgroup_destroy_root > >> cgroup_lock_and_drain_offline > >> // some perf descendants are dying > >> // cgroup_destroy_wq max_active = 1 > >> // waiting for css A to die > >> > >> Problem scenario: > >> 1. CPU0 mounts perf_event (rebind_subsystems) > >> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work > >> 3. A dying perf_event CSS gets queued for offline after root destruction > >> 4. Root destruction waits for offline completion, but offline work is > >> blocked behind root destruction in cgroup_destroy_wq (max_active=1) > > > > What's concerning me is why umount of net_prio hierarhy waits for > > draining of the default hierachy? (Where you then run into conflict with > > perf_event that's implicit_on_dfl.) > > > > This was also first respond. > > > IOW why not this: > > --- a/kernel/cgroup/cgroup.c > > +++ b/kernel/cgroup/cgroup.c > > @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root) > > > > trace_cgroup_destroy_root(root); > > > > - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); > > + cgroup_lock_and_drain_offline(cgrp); > > > > BUG_ON(atomic_read(&root->nr_cgrps)); > > BUG_ON(!list_empty(&cgrp->self.children)); > > > > Does this correct the LTP scenario? > > > > Thanks, > > Michal > > I've tested this approach and discovered it can lead to another issue that required significant > investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for > draining of the default hierarchy. > > Consider this sequence: > > mount net_prio umount perf_event > cgroup1_get_tree > // &cgrp_dfl_root.cgrp > cgroup_lock_and_drain_offline > // wait for all perf_event csses dead > prepare_to_wait(&dsct->offline_waitq) > schedule(); > cgroup_destroy_root > // &root->cgrp, not cgrp_dfl_root > cgroup_lock_and_drain_offline perf_event's css (offline but dying) > rebind_subsystems > rcu_assign_pointer(dcgrp->subsys[ssid], css); > dst_root->subsys_mask |= 1 << ssid; > cgroup_propagate_control > // enable cgrp_dfl_root perf_event css > cgroup_apply_control_enable > css = cgroup_css(dsct, ss); > // since we drain root->cgrp not cgrp_dfl_root > // css(dying) is not null on the cgrp_dfl_root > // we won't create css, but the css is dying What would prevent seeing a dying css when cgrp_dfl_root is drained? (Or nothing drained as in the patch?) I assume you've seen this warning from cgroup_apply_control_enable WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ? > > // got the offline_waitq wake up > goto restart; > // some perf_event dying csses are online now > prepare_to_wait(&dsct->offline_waitq) > schedule(); > // never get the offline_waitq wake up > > I encountered two main issues: > 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root Is this really resolved by the patch? (The questions above.) > 2.Potential hangs during cgrp_dfl_root draining in the mounting process Fortunately, the typical use case (mounting at boot) wouldn't suffer from this. > I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that > offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root > begins. This is a valid point. > How can we guarantee this ordering? Therefore, I propose moving the draining operation > outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this > potential race condition. This patch implements that approach. I acknowledge the issue (although rare in real world). Some entity will always have to wait of the offlining. It may be OK in cgroup_kill_sb (ideally, if this was bound to process context of umount caller, not sure if that's how kill_sb works). I slightly dislike the form of an empty lock/unlock -- which makes me wonder if this is the best solution. Let me think more about this... Thanks, Michal
On 2025/7/26 1:17, Michal Koutný wrote: > On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote: >>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote: >>>> CPU0 CPU1 >>>> mount perf_event umount net_prio >>>> cgroup1_get_tree cgroup_kill_sb >>>> rebind_subsystems // root destruction enqueues >>>> // cgroup_destroy_wq >>>> // kill all perf_event css >>>> // one perf_event css A is dying >>>> // css A offline enqueues cgroup_destroy_wq >>>> // root destruction will be executed first >>>> css_free_rwork_fn >>>> cgroup_destroy_root >>>> cgroup_lock_and_drain_offline >>>> // some perf descendants are dying >>>> // cgroup_destroy_wq max_active = 1 >>>> // waiting for css A to die >>>> >>>> Problem scenario: >>>> 1. CPU0 mounts perf_event (rebind_subsystems) >>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work >>>> 3. A dying perf_event CSS gets queued for offline after root destruction >>>> 4. Root destruction waits for offline completion, but offline work is >>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1) >>> >>> What's concerning me is why umount of net_prio hierarhy waits for >>> draining of the default hierachy? (Where you then run into conflict with >>> perf_event that's implicit_on_dfl.) >>> >> >> This was also first respond. >> >>> IOW why not this: >>> --- a/kernel/cgroup/cgroup.c >>> +++ b/kernel/cgroup/cgroup.c >>> @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root) >>> >>> trace_cgroup_destroy_root(root); >>> >>> - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); >>> + cgroup_lock_and_drain_offline(cgrp); >>> >>> BUG_ON(atomic_read(&root->nr_cgrps)); >>> BUG_ON(!list_empty(&cgrp->self.children)); >>> >>> Does this correct the LTP scenario? >>> >>> Thanks, >>> Michal >> >> I've tested this approach and discovered it can lead to another issue that required significant >> investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for >> draining of the default hierarchy. >> >> Consider this sequence: >> >> mount net_prio umount perf_event >> cgroup1_get_tree >> // &cgrp_dfl_root.cgrp >> cgroup_lock_and_drain_offline >> // wait for all perf_event csses dead >> prepare_to_wait(&dsct->offline_waitq) >> schedule(); >> cgroup_destroy_root >> // &root->cgrp, not cgrp_dfl_root >> cgroup_lock_and_drain_offline > perf_event's css (offline but dying) > >> rebind_subsystems >> rcu_assign_pointer(dcgrp->subsys[ssid], css); >> dst_root->subsys_mask |= 1 << ssid; >> cgroup_propagate_control >> // enable cgrp_dfl_root perf_event css >> cgroup_apply_control_enable >> css = cgroup_css(dsct, ss); >> // since we drain root->cgrp not cgrp_dfl_root >> // css(dying) is not null on the cgrp_dfl_root >> // we won't create css, but the css is dying > > What would prevent seeing a dying css when > cgrp_dfl_root is drained? > (Or nothing drained as in the patch?) > > I assume you've seen this warning from > cgroup_apply_control_enable > WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ? > > >> >> // got the offline_waitq wake up >> goto restart; >> // some perf_event dying csses are online now >> prepare_to_wait(&dsct->offline_waitq) >> schedule(); >> // never get the offline_waitq wake up >> >> I encountered two main issues: >> 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root > > Is this really resolved by the patch? (The questions above.) > >> 2.Potential hangs during cgrp_dfl_root draining in the mounting process > > Fortunately, the typical use case (mounting at boot) wouldn't suffer > from this. > >> I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that >> offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root >> begins. > > This is a valid point. > >> How can we guarantee this ordering? Therefore, I propose moving the draining operation >> outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this >> potential race condition. This patch implements that approach. > > I acknowledge the issue (although rare in real world). Some entity will > always have to wait of the offlining. It may be OK in cgroup_kill_sb > (ideally, if this was bound to process context of umount caller, not > sure if that's how kill_sb works). > I slightly dislike the form of an empty lock/unlock -- which makes me > wonder if this is the best solution. > > Let me think more about this... > > Thanks, > Michal I've just submitted v3, which I believe provides an improved solution to the issue we've been discussing. The v3 version is available here: https://lore.kernel.org/cgroups/20250815070518.1255842-1-chenridong@huaweicloud.com/T/#u I'd be truly grateful if you could spare some time to review it. -- Best regards, Ridong
On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote: > > On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote: > >> CPU0 CPU1 > >> mount perf_event umount net_prio > >> cgroup1_get_tree cgroup_kill_sb > >> rebind_subsystems // root destruction enqueues > >> // cgroup_destroy_wq > >> // kill all perf_event css > >> // one perf_event css A is dying > >> // css A offline enqueues cgroup_destroy_wq > >> // root destruction will be executed first > >> css_free_rwork_fn > >> cgroup_destroy_root > >> cgroup_lock_and_drain_offline > >> // some perf descendants are dying > >> // cgroup_destroy_wq max_active = 1 > >> // waiting for css A to die > >> > >> Problem scenario: > >> 1. CPU0 mounts perf_event (rebind_subsystems) > >> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work > >> 3. A dying perf_event CSS gets queued for offline after root destruction > >> 4. Root destruction waits for offline completion, but offline work is > >> blocked behind root destruction in cgroup_destroy_wq (max_active=1) > > > > What's concerning me is why umount of net_prio hierarhy waits for > > draining of the default hierachy? (Where you then run into conflict with > > perf_event that's implicit_on_dfl.) > > /* * cgroup destruction makes heavy use of work items and there can be a lot * of concurrent destructions. Use a separate workqueue so that cgroup * destruction work items don't end up filling up max_active of system_wq * which may lead to deadlock. */ If task hung could be reliably reproduced, it is right time to cut max_active off for cgroup_destroy_wq according to its comment.
On 2025/8/15 10:40, Hillf Danton wrote: > On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote: >>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote: >>>> CPU0 CPU1 >>>> mount perf_event umount net_prio >>>> cgroup1_get_tree cgroup_kill_sb >>>> rebind_subsystems // root destruction enqueues >>>> // cgroup_destroy_wq >>>> // kill all perf_event css >>>> // one perf_event css A is dying >>>> // css A offline enqueues cgroup_destroy_wq >>>> // root destruction will be executed first >>>> css_free_rwork_fn >>>> cgroup_destroy_root >>>> cgroup_lock_and_drain_offline >>>> // some perf descendants are dying >>>> // cgroup_destroy_wq max_active = 1 >>>> // waiting for css A to die >>>> >>>> Problem scenario: >>>> 1. CPU0 mounts perf_event (rebind_subsystems) >>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work >>>> 3. A dying perf_event CSS gets queued for offline after root destruction >>>> 4. Root destruction waits for offline completion, but offline work is >>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1) >>> >>> What's concerning me is why umount of net_prio hierarhy waits for >>> draining of the default hierachy? (Where you then run into conflict with >>> perf_event that's implicit_on_dfl.) >>> > /* > * cgroup destruction makes heavy use of work items and there can be a lot > * of concurrent destructions. Use a separate workqueue so that cgroup > * destruction work items don't end up filling up max_active of system_wq > * which may lead to deadlock. > */ > > If task hung could be reliably reproduced, it is right time to cut > max_active off for cgroup_destroy_wq according to its comment. Hi Danton, Thank you for your feedback. While modifying max_active could be a viable solution, I’m unsure whether it might introduce other side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe addresses the issue more comprehensively. I’d be very grateful if you could take a look and share your thoughts. Your review would be greatly appreciated! v3: https://lore.kernel.org/cgroups/20250815070518.1255842-1-chenridong@huaweicloud.com/T/#u -- Best regards, Ridong
On Fri, 15 Aug 2025 15:29:56 +0800 Chen Ridong wrote: >On 2025/8/15 10:40, Hillf Danton wrote: >> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote: >>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote: >>>>> CPU0 CPU1 >>>>> mount perf_event umount net_prio >>>>> cgroup1_get_tree cgroup_kill_sb >>>>> rebind_subsystems // root destruction enqueues >>>>> // cgroup_destroy_wq >>>>> // kill all perf_event css >>>>> // one perf_event css A is dying >>>>> // css A offline enqueues cgroup_destroy_wq >>>>> // root destruction will be executed first >>>>> css_free_rwork_fn >>>>> cgroup_destroy_root >>>>> cgroup_lock_and_drain_offline >>>>> // some perf descendants are dying >>>>> // cgroup_destroy_wq max_active = 1 >>>>> // waiting for css A to die >>>>> >>>>> Problem scenario: >>>>> 1. CPU0 mounts perf_event (rebind_subsystems) >>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work >>>>> 3. A dying perf_event CSS gets queued for offline after root destruction >>>>> 4. Root destruction waits for offline completion, but offline work is >>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1) >>>> >>>> What's concerning me is why umount of net_prio hierarhy waits for >>>> draining of the default hierachy? (Where you then run into conflict with >>>> perf_event that's implicit_on_dfl.) >>>> >> /* >> * cgroup destruction makes heavy use of work items and there can be a lot >> * of concurrent destructions. Use a separate workqueue so that cgroup >> * destruction work items don't end up filling up max_active of system_wq >> * which may lead to deadlock. >> */ >> >> If task hung could be reliably reproduced, it is right time to cut >> max_active off for cgroup_destroy_wq according to its comment. > >Hi Danton, > >Thank you for your feedback. > >While modifying max_active could be a viable solution, I’m unsure whether it might introduce other >side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe >addresses the issue more comprehensively. > Given your reproducer [1], it is simple to test with max_active cut. I do not think v3 is a correct fix frankly because it leaves the root cause intact. Nor is it cgroup specific even given high concurrency in destruction. [1] https://lore.kernel.org/lkml/39e05402-40c7-4631-a87b-8e3747ceddc6@huaweicloud.com/
On 2025/8/15 18:02, Hillf Danton wrote: > On Fri, 15 Aug 2025 15:29:56 +0800 Chen Ridong wrote: >> On 2025/8/15 10:40, Hillf Danton wrote: >>> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote: >>>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote: >>>>>> CPU0 CPU1 >>>>>> mount perf_event umount net_prio >>>>>> cgroup1_get_tree cgroup_kill_sb >>>>>> rebind_subsystems // root destruction enqueues >>>>>> // cgroup_destroy_wq >>>>>> // kill all perf_event css >>>>>> // one perf_event css A is dying >>>>>> // css A offline enqueues cgroup_destroy_wq >>>>>> // root destruction will be executed first >>>>>> css_free_rwork_fn >>>>>> cgroup_destroy_root >>>>>> cgroup_lock_and_drain_offline >>>>>> // some perf descendants are dying >>>>>> // cgroup_destroy_wq max_active = 1 >>>>>> // waiting for css A to die >>>>>> >>>>>> Problem scenario: >>>>>> 1. CPU0 mounts perf_event (rebind_subsystems) >>>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work >>>>>> 3. A dying perf_event CSS gets queued for offline after root destruction >>>>>> 4. Root destruction waits for offline completion, but offline work is >>>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1) >>>>> >>>>> What's concerning me is why umount of net_prio hierarhy waits for >>>>> draining of the default hierachy? (Where you then run into conflict with >>>>> perf_event that's implicit_on_dfl.) >>>>> >>> /* >>> * cgroup destruction makes heavy use of work items and there can be a lot >>> * of concurrent destructions. Use a separate workqueue so that cgroup >>> * destruction work items don't end up filling up max_active of system_wq >>> * which may lead to deadlock. >>> */ >>> >>> If task hung could be reliably reproduced, it is right time to cut >>> max_active off for cgroup_destroy_wq according to its comment. >> >> Hi Danton, >> >> Thank you for your feedback. >> >> While modifying max_active could be a viable solution, I’m unsure whether it might introduce other >> side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe >> addresses the issue more comprehensively. >> > Given your reproducer [1], it is simple to test with max_active cut. > > I do not think v3 is a correct fix frankly because it leaves the root cause > intact. Nor is it cgroup specific even given high concurrency in destruction. > > [1] https://lore.kernel.org/lkml/39e05402-40c7-4631-a87b-8e3747ceddc6@huaweicloud.com/ Hi Danton, Thank you for your reply. To clarify, when you mentioned "cut max_active off", did you mean setting max_active of cgroup_destroy_wq to 1? Note that cgroup_destroy_wq already has max_active=1 as inited: ```cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);``` The v3 changes prevent subsystem root destruction from being blocked by unrelated subsystem offline events. Since root destruction should only proceed after all descendants are destroyed, it shouldn't be blocked by children offline events. My testing with the reproducer confirms this fixes the issue I encountered. -- Best regards, Ridong
On Fri, 15 Aug 2025 18:28:53 +0800 Chen Ridong wrote: > > To clarify, when you mentioned "cut max_active off", did you mean setting max_active of > cgroup_destroy_wq to 1? > cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 0);
On 2025/8/15 19:54, Hillf Danton wrote: > On Fri, 15 Aug 2025 18:28:53 +0800 Chen Ridong wrote: >> >> To clarify, when you mentioned "cut max_active off", did you mean setting max_active of >> cgroup_destroy_wq to 1? >> > cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 0); Thank you for the additional clarification. While this modification is functional, I’m concerned it might spawn a significant number of cgroup destruction tasks—most of which would contend for cgroup_mutex. Could this waste system resources? -- Best regards, Ridong
On Sat, 16 Aug 2025 08:33:55 +0800 Chen Ridong wrote: > On 2025/8/15 19:54, Hillf Danton wrote: > > On Fri, 15 Aug 2025 18:28:53 +0800 Chen Ridong wrote: > >> > >> To clarify, when you mentioned "cut max_active off", did you mean setting max_active of > >> cgroup_destroy_wq to 1? > >> > > cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 0); > > Thank you for the additional clarification. > > While this modification is functional, I’m concerned it might spawn a significant number of cgroup > destruction tasks—most of which would contend for cgroup_mutex. Could this waste system resources? > No win comes from nothing, so you need to pay $.02 for example at least for curing the task hung by erasing the root cause.
> I acknowledge the issue (although rare in real world). Some entity will > always have to wait of the offlining. It may be OK in cgroup_kill_sb > (ideally, if this was bound to process context of umount caller, not > sure if that's how kill_sb works). > I slightly dislike the form of an empty lock/unlock -- which makes me > wonder if this is the best solution. > > Let me think more about this... > > Thanks, > Michal Hi Michal, Have you come up with a better solution for this? Would appreciate your thoughts when you have time. Best regards, Ridong
Hi Ridong. On Thu, Jul 31, 2025 at 07:53:02PM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote: > Have you come up with a better solution for this? > Would appreciate your thoughts when you have time. Sorry for taking so long. (Also expect my next response here may be slow.) I tried reproducing it with the described LTP tests [1] (to get a better idea about what and why needs to be offlined) but I cannot bring it to a hang nor lockdep report. How do you launch the particular LTP tests to trigger this? Thanks, Michal [1] while true ; do /opt/ltp/testcases/bin/cgroup_fj_function.sh net_cls $pp /opt/ltp/testcases/bin/cgroup_fj_function.sh perf_event done (with pp both `;` or `&` for concurrent runs, two vCPUs)
On 2025/8/14 23:17, Michal Koutný wrote: > Hi Ridong. > > On Thu, Jul 31, 2025 at 07:53:02PM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote: >> Have you come up with a better solution for this? >> Would appreciate your thoughts when you have time. > > Sorry for taking so long. (Also expect my next response here may be > slow.) > I tried reproducing it with the described LTP tests [1] (to get a better > idea about what and why needs to be offlined) but I cannot bring it to a > hang nor lockdep report. How do you launch the particular LTP tests to > trigger this? > > Thanks, > Michal > > [1] > while true ; do > /opt/ltp/testcases/bin/cgroup_fj_function.sh net_cls $pp > /opt/ltp/testcases/bin/cgroup_fj_function.sh perf_event > done > (with pp both `;` or `&` for concurrent runs, two vCPUs) Hi Michal, I’ve provided details on how to reproduce the issue—it’s quite straightforward. For your reference, here’s the link to the discussion: https://lore.kernel.org/cgroups/btaaerpdl3bolxbysbqcig6kiccdgsoo32td64sk6yo4m5l5zy@nds6s35p6e6w/T/#m01f1229f2e84e1186abaf04378b4c0f47151f7e4 Let me know if you need further clarification. -- Best regards, Ridong
On 2025/7/26 1:17, Michal Koutný wrote: > On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote: >>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote: >>>> CPU0 CPU1 >>>> mount perf_event umount net_prio >>>> cgroup1_get_tree cgroup_kill_sb >>>> rebind_subsystems // root destruction enqueues >>>> // cgroup_destroy_wq >>>> // kill all perf_event css >>>> // one perf_event css A is dying >>>> // css A offline enqueues cgroup_destroy_wq >>>> // root destruction will be executed first >>>> css_free_rwork_fn >>>> cgroup_destroy_root >>>> cgroup_lock_and_drain_offline >>>> // some perf descendants are dying >>>> // cgroup_destroy_wq max_active = 1 >>>> // waiting for css A to die >>>> >>>> Problem scenario: >>>> 1. CPU0 mounts perf_event (rebind_subsystems) >>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work >>>> 3. A dying perf_event CSS gets queued for offline after root destruction >>>> 4. Root destruction waits for offline completion, but offline work is >>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1) >>> >>> What's concerning me is why umount of net_prio hierarhy waits for >>> draining of the default hierachy? (Where you then run into conflict with >>> perf_event that's implicit_on_dfl.) >>> >> >> This was also first respond. >> >>> IOW why not this: >>> --- a/kernel/cgroup/cgroup.c >>> +++ b/kernel/cgroup/cgroup.c >>> @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root) >>> >>> trace_cgroup_destroy_root(root); >>> >>> - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); >>> + cgroup_lock_and_drain_offline(cgrp); >>> >>> BUG_ON(atomic_read(&root->nr_cgrps)); >>> BUG_ON(!list_empty(&cgrp->self.children)); >>> >>> Does this correct the LTP scenario? >>> >>> Thanks, >>> Michal >> >> I've tested this approach and discovered it can lead to another issue that required significant >> investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for >> draining of the default hierarchy. >> >> Consider this sequence: >> >> mount net_prio umount perf_event >> cgroup1_get_tree >> // &cgrp_dfl_root.cgrp >> cgroup_lock_and_drain_offline >> // wait for all perf_event csses dead >> prepare_to_wait(&dsct->offline_waitq) >> schedule(); >> cgroup_destroy_root >> // &root->cgrp, not cgrp_dfl_root >> cgroup_lock_and_drain_offline > perf_event's css (offline but dying) > >> rebind_subsystems >> rcu_assign_pointer(dcgrp->subsys[ssid], css); >> dst_root->subsys_mask |= 1 << ssid; >> cgroup_propagate_control >> // enable cgrp_dfl_root perf_event css >> cgroup_apply_control_enable >> css = cgroup_css(dsct, ss); >> // since we drain root->cgrp not cgrp_dfl_root >> // css(dying) is not null on the cgrp_dfl_root >> // we won't create css, but the css is dying > > What would prevent seeing a dying css when > cgrp_dfl_root is drained? > (Or nothing drained as in the patch?) > I assume you've seen this warning from > cgroup_apply_control_enable > WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ? > > WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ? -- Yes Draining the cgrp_dfl_root can prevent seeing the dying css. Q:When the task can be woken up if it is waiting on offline_waitq? A:The offline_css is invoked, and: RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL); If we drain the cgrp_dfl_root, it traverses all the csses That means cgroup_lock_and_drain_offline can only return when all the dying have disappeared, thus preventing seeing a dying css. >> >> // got the offline_waitq wake up >> goto restart; >> // some perf_event dying csses are online now >> prepare_to_wait(&dsct->offline_waitq) >> schedule(); >> // never get the offline_waitq wake up >> >> I encountered two main issues: >> 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root > > Is this really resolved by the patch? (The questions above.) > >> 2.Potential hangs during cgrp_dfl_root draining in the mounting process > > Fortunately, the typical use case (mounting at boot) wouldn't suffer > from this. > >> I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that >> offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root >> begins. > > This is a valid point. > >> How can we guarantee this ordering? Therefore, I propose moving the draining operation >> outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this >> potential race condition. This patch implements that approach. > > I acknowledge the issue (although rare in real world). Some entity will > always have to wait of the offlining. It may be OK in cgroup_kill_sb > (ideally, if this was bound to process context of umount caller, not > sure if that's how kill_sb works). > I slightly dislike the form of an empty lock/unlock -- which makes me > wonder if this is the best solution. Thank you, I’d appreciate it if you could suggest a better solution. Thanks, Ridong
© 2016 - 2025 Red Hat, Inc.