include/linux/cgroup-defs.h | 1 - kernel/cgroup/cgroup.c | 4 ---- kernel/cgroup/cpuset.c | 30 +++++++++++++++++++++++++----- 3 files changed, 25 insertions(+), 10 deletions(-)
Now in cpuset_attach(), we need to synchronously wait for flush_workqueue to complete. The execution time of flushing cpuset_migrate_mm_wq depends on the amount of mm migration initiated by cpusets at that time. When the cpuset.mems of a cgroup occupying a large amount of memory is modified, it may trigger extensive mm migration, causing cpuset_attach() to block on flush_workqueue for an extended period. cgroup attach operation | someone change cpuset.mems | -------------------------------+------------------------------- __cgroup_procs_write() cpuset_write_resmask() cgroup_kn_lock_live() cpuset_attach() cpuset_migrate_mm() cpuset_post_attach() flush_workqueue(cpuset_migrate_mm_wq); This could be dangerous because cpuset_attach() is within the critical section of cgroup_mutex, which may ultimately cause all cgroup-related operations in the system to be blocked. We encountered this issue in the production environment, and it can be easily reproduced locally using the script below. [Thu Sep 4 14:51:39 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Sep 4 14:51:39 2025] task:tee state:D stack:0 pid:13330 tgid:13330 ppid:13321 task_flags:0x400100 flags:0x00004000 [Thu Sep 4 14:51:39 2025] Call Trace: [Thu Sep 4 14:51:39 2025] <TASK> [Thu Sep 4 14:51:39 2025] __schedule+0xcc1/0x1c60 [Thu Sep 4 14:51:39 2025] ? find_held_lock+0x2d/0xa0 [Thu Sep 4 14:51:39 2025] schedule+0x3e/0xe0 [Thu Sep 4 14:51:39 2025] schedule_preempt_disabled+0x15/0x30 [Thu Sep 4 14:51:39 2025] __mutex_lock+0x928/0x1230 [Thu Sep 4 14:51:39 2025] ? cgroup_kn_lock_live+0x4a/0x240 [Thu Sep 4 14:51:39 2025] ? cgroup_kn_lock_live+0x4a/0x240 [Thu Sep 4 14:51:39 2025] cgroup_kn_lock_live+0x4a/0x240 [Thu Sep 4 14:51:39 2025] __cgroup_procs_write+0x38/0x210 [Thu Sep 4 14:51:39 2025] cgroup_procs_write+0x17/0x30 [Thu Sep 4 14:51:39 2025] cgroup_file_write+0xa5/0x260 [Thu Sep 4 14:51:39 2025] kernfs_fop_write_iter+0x13d/0x1e0 [Thu Sep 4 14:51:39 2025] vfs_write+0x310/0x530 [Thu Sep 4 14:51:39 2025] ksys_write+0x6e/0xf0 [Thu Sep 4 14:51:39 2025] do_syscall_64+0x77/0x390 [Thu Sep 4 14:51:39 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e This patchset attempts to defer the flush_workqueue() operation until returning to userspace using the task_work which is originally proposed by tejun[1], so that flush happens after cgroup_mutex is dropped. That way we maintain the operation synchronicity while avoiding bothering anyone else. [1]: https://lore.kernel.org/cgroups/ZgMFPMjZRZCsq9Q-@slm.duckdns.org/T/#m117f606fa24f66f0823a60f211b36f24bd9e1883 #!/bin/bash sudo mkdir -p /sys/fs/cgroup/test sudo mkdir -p /sys/fs/cgroup/test1 sudo mkdir -p /sys/fs/cgroup/test2 echo 0 > /sys/fs/cgroup/test1/cpuset.mems echo 1 > /sys/fs/cgroup/test2/cpuset.mems for i in {1..10}; do ( pid=$BASHPID while true; do echo "Add $pid to test1" echo "$pid" | sudo tee /sys/fs/cgroup/test1/cgroup.procs >/dev/null sleep 5 echo "Add $pid to test2" echo "$pid" | sudo tee /sys/fs/cgroup/test2/cgroup.procs >/dev/null done ) & done echo 0 > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs stress --vm 100 --vm-bytes 2048M --vm-keep & sleep 30 echo "begin change cpuset.mems" echo 1 > /sys/fs/cgroup/test/cpuset.mems Chuyi Zhou (3): cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work cgroup: Remove unused cgroup_subsys::post_attach include/linux/cgroup-defs.h | 1 - kernel/cgroup/cgroup.c | 4 ---- kernel/cgroup/cpuset.c | 30 +++++++++++++++++++++++++----- 3 files changed, 25 insertions(+), 10 deletions(-) -- 2.20.1
On Thu, Sep 04, 2025 at 03:45:02PM +0800, Chuyi Zhou wrote: > Now in cpuset_attach(), we need to synchronously wait for > flush_workqueue to complete. The execution time of flushing > cpuset_migrate_mm_wq depends on the amount of mm migration initiated by > cpusets at that time. When the cpuset.mems of a cgroup occupying a large > amount of memory is modified, it may trigger extensive mm migration, > causing cpuset_attach() to block on flush_workqueue for an extended period. Applied 1-3 to cgroup/for-6.18. There were a couple conflicts that I resolved. It'd be great if you can take a look and make sure everything is okay. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-6.18 Thanks. -- tejun
在 2025/9/5 01:27, Tejun Heo 写道: > On Thu, Sep 04, 2025 at 03:45:02PM +0800, Chuyi Zhou wrote: >> Now in cpuset_attach(), we need to synchronously wait for >> flush_workqueue to complete. The execution time of flushing >> cpuset_migrate_mm_wq depends on the amount of mm migration initiated by >> cpusets at that time. When the cpuset.mems of a cgroup occupying a large >> amount of memory is modified, it may trigger extensive mm migration, >> causing cpuset_attach() to block on flush_workqueue for an extended period. > > Applied 1-3 to cgroup/for-6.18. There were a couple conflicts that I > resolved. It'd be great if you can take a look and make sure everything is > okay. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-6.18 > > Thanks. > Sorry, I forgot to rebase the latest cgroup branch before sending the patchset. I made sure everything is okay. Thanks.
© 2016 - 2025 Red Hat, Inc.