[PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work

Chuyi Zhou posted 3 patches 4 weeks ago
include/linux/cgroup-defs.h |  1 -
kernel/cgroup/cgroup.c      |  4 ----
kernel/cgroup/cpuset.c      | 30 +++++++++++++++++++++++++-----
3 files changed, 25 insertions(+), 10 deletions(-)
[PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work
Posted by Chuyi Zhou 4 weeks ago
Now in cpuset_attach(), we need to synchronously wait for
flush_workqueue to complete. The execution time of flushing
cpuset_migrate_mm_wq depends on the amount of mm migration initiated by
cpusets at that time. When the cpuset.mems of a cgroup occupying a large
amount of memory is modified, it may trigger extensive mm migration,
causing cpuset_attach() to block on flush_workqueue for an extended period.

            cgroup attach operation  | someone change cpuset.mems
                                     |
      -------------------------------+-------------------------------
       __cgroup_procs_write()                 cpuset_write_resmask()
	cgroup_kn_lock_live()
	cpuset_attach()				cpuset_migrate_mm()


	cpuset_post_attach()
	  flush_workqueue(cpuset_migrate_mm_wq);

This could be dangerous because cpuset_attach() is within the critical
section of cgroup_mutex, which may ultimately cause all cgroup-related
operations in the system to be blocked. We encountered this issue in the
production environment, and it can be easily reproduced locally using the
script below.

[Thu Sep  4 14:51:39 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Sep  4 14:51:39 2025] task:tee             state:D stack:0     pid:13330 tgid:13330 ppid:13321  task_flags:0x400100 flags:0x00004000
[Thu Sep  4 14:51:39 2025] Call Trace:
[Thu Sep  4 14:51:39 2025]  <TASK>
[Thu Sep  4 14:51:39 2025]  __schedule+0xcc1/0x1c60
[Thu Sep  4 14:51:39 2025]  ? find_held_lock+0x2d/0xa0
[Thu Sep  4 14:51:39 2025]  schedule+0x3e/0xe0
[Thu Sep  4 14:51:39 2025]  schedule_preempt_disabled+0x15/0x30
[Thu Sep  4 14:51:39 2025]  __mutex_lock+0x928/0x1230
[Thu Sep  4 14:51:39 2025]  ? cgroup_kn_lock_live+0x4a/0x240
[Thu Sep  4 14:51:39 2025]  ? cgroup_kn_lock_live+0x4a/0x240
[Thu Sep  4 14:51:39 2025]  cgroup_kn_lock_live+0x4a/0x240
[Thu Sep  4 14:51:39 2025]  __cgroup_procs_write+0x38/0x210
[Thu Sep  4 14:51:39 2025]  cgroup_procs_write+0x17/0x30
[Thu Sep  4 14:51:39 2025]  cgroup_file_write+0xa5/0x260
[Thu Sep  4 14:51:39 2025]  kernfs_fop_write_iter+0x13d/0x1e0
[Thu Sep  4 14:51:39 2025]  vfs_write+0x310/0x530
[Thu Sep  4 14:51:39 2025]  ksys_write+0x6e/0xf0
[Thu Sep  4 14:51:39 2025]  do_syscall_64+0x77/0x390
[Thu Sep  4 14:51:39 2025]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

This patchset attempts to defer the flush_workqueue() operation until
returning to userspace using the task_work which is originally proposed by
tejun[1], so that flush happens after cgroup_mutex is dropped. That way we
maintain the operation synchronicity while avoiding bothering anyone else.

[1]: https://lore.kernel.org/cgroups/ZgMFPMjZRZCsq9Q-@slm.duckdns.org/T/#m117f606fa24f66f0823a60f211b36f24bd9e1883

#!/bin/bash

sudo mkdir -p /sys/fs/cgroup/test

sudo mkdir -p /sys/fs/cgroup/test1
sudo mkdir -p /sys/fs/cgroup/test2

echo 0 > /sys/fs/cgroup/test1/cpuset.mems

echo 1 > /sys/fs/cgroup/test2/cpuset.mems

for i in {1..10}; do
    (
        pid=$BASHPID

        while true; do
	    echo "Add $pid to test1"

	    echo "$pid" | sudo tee /sys/fs/cgroup/test1/cgroup.procs >/dev/null

            sleep 5

	    echo "Add $pid to test2"

            echo "$pid" | sudo tee /sys/fs/cgroup/test2/cgroup.procs >/dev/null

        done
    ) &
done


echo 0 > /sys/fs/cgroup/test/cpuset.mems

echo $$ > /sys/fs/cgroup/test/cgroup.procs

stress --vm 100 --vm-bytes 2048M --vm-keep &

sleep 30

echo "begin change cpuset.mems"

echo 1 > /sys/fs/cgroup/test/cpuset.mems

Chuyi Zhou (3):
  cpuset: Don't always flush cpuset_migrate_mm_wq in
    cpuset_write_resmask
  cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work
  cgroup: Remove unused cgroup_subsys::post_attach

 include/linux/cgroup-defs.h |  1 -
 kernel/cgroup/cgroup.c      |  4 ----
 kernel/cgroup/cpuset.c      | 30 +++++++++++++++++++++++++-----
 3 files changed, 25 insertions(+), 10 deletions(-)

-- 
2.20.1
Re: [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work
Posted by Tejun Heo 4 weeks ago
On Thu, Sep 04, 2025 at 03:45:02PM +0800, Chuyi Zhou wrote:
> Now in cpuset_attach(), we need to synchronously wait for
> flush_workqueue to complete. The execution time of flushing
> cpuset_migrate_mm_wq depends on the amount of mm migration initiated by
> cpusets at that time. When the cpuset.mems of a cgroup occupying a large
> amount of memory is modified, it may trigger extensive mm migration,
> causing cpuset_attach() to block on flush_workqueue for an extended period.

Applied 1-3 to cgroup/for-6.18. There were a couple conflicts that I
resolved. It'd be great if you can take a look and make sure everything is
okay.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-6.18

Thanks.

-- 
tejun
Re: [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work
Posted by Chuyi Zhou 4 weeks ago

在 2025/9/5 01:27, Tejun Heo 写道:
> On Thu, Sep 04, 2025 at 03:45:02PM +0800, Chuyi Zhou wrote:
>> Now in cpuset_attach(), we need to synchronously wait for
>> flush_workqueue to complete. The execution time of flushing
>> cpuset_migrate_mm_wq depends on the amount of mm migration initiated by
>> cpusets at that time. When the cpuset.mems of a cgroup occupying a large
>> amount of memory is modified, it may trigger extensive mm migration,
>> causing cpuset_attach() to block on flush_workqueue for an extended period.
> 
> Applied 1-3 to cgroup/for-6.18. There were a couple conflicts that I
> resolved. It'd be great if you can take a look and make sure everything is
> okay.
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-6.18
> 
> Thanks.
> 

Sorry, I forgot to rebase the latest cgroup branch before sending the 
patchset. I made sure everything is okay.

Thanks.