From nobody Thu Oct 2 22:39:30 2025 Received: from out30-98.freemail.mail.aliyun.com (out30-98.freemail.mail.aliyun.com [115.124.30.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CDEFE30597D; Wed, 10 Sep 2025 06:59:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.98 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757487597; cv=none; b=fSiFxryUHkihWNKSTA7HBxNq7KDd+Dz2Lsk2mfIOrsEyuz11Sj/IJvlJi7WbfWU5nmeekcJO4m8hfOML91LwgLF/RAPT4jOLuoQDGM65M9pkRzW1axDz5i8dW+6yr0j8PEa/+4oUzFU8s9SeYeuYlhhNN76CwP+v0DCjWG8FJ1A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757487597; c=relaxed/simple; bh=J/Gm1Sv89L/EDnUsOv0d/PklkOhFoC7F8qrqO/p60QQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=WZ0qkO4nw+c3zXviooW/SngrLZwuzeilLiludDT5kr5FXqiGxRT9mdqTX8W7QrU9iRIZ9Xx02hg5Kzr4i7P3bSt5uT+l2cOTp6MK95f47PsmdHnZFQi8Zg+ymU2DxncO1gVf/bQqKmj0p+JIaQyxOUwPqqbgCisYkSYIDMzcKHM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=lomyvRQV; arc=none smtp.client-ip=115.124.30.98 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="lomyvRQV" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1757487592; h=From:To:Subject:Date:Message-Id:MIME-Version; bh=yNqA+q0Witmyp4c0usUgj6F+Z4vsbMGnPFPUvlgeXh0=; b=lomyvRQVgbPrF/f4JACALjVB4kCBw7OW2X9J/cCK3mwm6/L6JwLEADfFy2mWFlCLuyLjemKGGiWN/Xtr0yVwpsrAELTNTca4J9zDFujRnuq9Sv0jxnHd7REqQqvLTKk8rfxLEUWqTNCTdKXA0hI10MgldzMmbzn/VWb6P/KhvY0= Received: from localhost(mailfrom:escape@linux.alibaba.com fp:SMTPD_---0Wnh5VbJ_1757487591 cluster:ay36) by smtp.aliyun-inc.com; Wed, 10 Sep 2025 14:59:52 +0800 From: Yi Tao To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Date: Wed, 10 Sep 2025 14:59:33 +0800 Message-Id: X-Mailer: git-send-email 2.32.0.3.g01195cf9f In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Dynamic cgroup migration involving threadgroup locks can be in one of two states: no lock held, or holding the global lock. Explicitly declaring the different lock modes to make the code easier to understand and facilitates future extensions of the lock modes. Signed-off-by: Yi Tao --- include/linux/cgroup-defs.h | 8 ++++ kernel/cgroup/cgroup-internal.h | 9 +++-- kernel/cgroup/cgroup-v1.c | 14 +++---- kernel/cgroup/cgroup.c | 67 ++++++++++++++++++++++++--------- 4 files changed, 69 insertions(+), 29 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 6b93a64115fe..213b0d01464f 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -140,6 +140,14 @@ enum { __CFTYPE_ADDED =3D (1 << 18), }; =20 +enum cgroup_attach_lock_mode { + /* Default */ + CGRP_ATTACH_LOCK_GLOBAL, + + /* When pid=3D0 && threadgroup=3Dfalse, see comments in cgroup_procs_writ= e_start */ + CGRP_ATTACH_LOCK_NONE, +}; + /* * cgroup_file is the handle for a file instance created in a cgroup which * is used, for example, to generate file changed notifications. This can diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-interna= l.h index b14e61c64a34..a6d6f30b6f65 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -249,12 +249,13 @@ int cgroup_migrate(struct task_struct *leader, bool t= hreadgroup, =20 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -void cgroup_attach_lock(bool lock_threadgroup); -void cgroup_attach_unlock(bool lock_threadgroup); +void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode); +void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode); struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, - bool *locked) + enum cgroup_attach_lock_mode *lock_mode) __acquires(&cgroup_threadgroup_rwsem); -void cgroup_procs_write_finish(struct task_struct *task, bool locked) +void cgroup_procs_write_finish(struct task_struct *task, + enum cgroup_attach_lock_mode lock_mode) __releases(&cgroup_threadgroup_rwsem); =20 void cgroup_lock_and_drain_offline(struct cgroup *cgrp); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 2a4a387f867a..f3838888ea6f 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, stru= ct task_struct *tsk) int retval =3D 0; =20 cgroup_lock(); - cgroup_attach_lock(true); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL); for_each_root(root) { struct cgroup *from_cgrp; =20 @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, stru= ct task_struct *tsk) if (retval) break; } - cgroup_attach_unlock(true); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL); cgroup_unlock(); =20 return retval; @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgr= oup *from) =20 cgroup_lock(); =20 - cgroup_attach_lock(true); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL); =20 /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgr= oup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(true); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL); cgroup_unlock(); return ret; } @@ -502,13 +502,13 @@ static ssize_t __cgroup1_procs_write(struct kernfs_op= en_file *of, struct task_struct *task; const struct cred *cred, *tcred; ssize_t ret; - bool locked; + enum cgroup_attach_lock_mode lock_mode; =20 cgrp =3D cgroup_kn_lock_live(of->kn, false); if (!cgrp) return -ENODEV; =20 - task =3D cgroup_procs_write_start(buf, threadgroup, &locked); + task =3D cgroup_procs_write_start(buf, threadgroup, &lock_mode); ret =3D PTR_ERR_OR_ZERO(task); if (ret) goto out_unlock; @@ -531,7 +531,7 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open= _file *of, ret =3D cgroup_attach_task(cgrp, task, threadgroup); =20 out_finish: - cgroup_procs_write_finish(task, locked); + cgroup_procs_write_finish(task, lock_mode); out_unlock: cgroup_kn_unlock(of->kn); =20 diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 312c6a8b55bb..2b88c7abaa00 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2459,7 +2459,7 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); =20 /** * cgroup_attach_lock - Lock for ->attach() - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem + * @lock_mode: whether to down_write cgroup_threadgroup_rwsem * * cgroup migration sometimes needs to stabilize threadgroups against fork= s and * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach= () @@ -2480,21 +2480,39 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assum= e that * CPU hotplug is disabled on entry. */ -void cgroup_attach_lock(bool lock_threadgroup) +void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode) { cpus_read_lock(); - if (lock_threadgroup) + + switch (lock_mode) { + case CGRP_ATTACH_LOCK_NONE: + break; + case CGRP_ATTACH_LOCK_GLOBAL: percpu_down_write(&cgroup_threadgroup_rwsem); + break; + default: + pr_warn("cgroup: Unexpected attach lock mode."); + break; + } } =20 /** * cgroup_attach_unlock - Undo cgroup_attach_lock() - * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem + * @lock_mode: whether to up_write cgroup_threadgroup_rwsem */ -void cgroup_attach_unlock(bool lock_threadgroup) +void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode) { - if (lock_threadgroup) + switch (lock_mode) { + case CGRP_ATTACH_LOCK_NONE: + break; + case CGRP_ATTACH_LOCK_GLOBAL: percpu_up_write(&cgroup_threadgroup_rwsem); + break; + default: + pr_warn("cgroup: Unexpected attach lock mode."); + break; + } + cpus_read_unlock(); } =20 @@ -2968,7 +2986,7 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struc= t task_struct *leader, } =20 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, - bool *threadgroup_locked) + enum cgroup_attach_lock_mode *lock_mode) { struct task_struct *tsk; pid_t pid; @@ -2985,8 +3003,13 @@ struct task_struct *cgroup_procs_write_start(char *b= uf, bool threadgroup, * Therefore, we can skip the global lock. */ lockdep_assert_held(&cgroup_mutex); - *threadgroup_locked =3D pid || threadgroup; - cgroup_attach_lock(*threadgroup_locked); + + if (pid || threadgroup) + *lock_mode =3D CGRP_ATTACH_LOCK_GLOBAL; + else + *lock_mode =3D CGRP_ATTACH_LOCK_NONE; + + cgroup_attach_lock(*lock_mode); =20 rcu_read_lock(); if (pid) { @@ -3017,14 +3040,15 @@ struct task_struct *cgroup_procs_write_start(char *= buf, bool threadgroup, goto out_unlock_rcu; =20 out_unlock_threadgroup: - cgroup_attach_unlock(*threadgroup_locked); - *threadgroup_locked =3D false; + cgroup_attach_unlock(*lock_mode); + *lock_mode =3D CGRP_ATTACH_LOCK_NONE; out_unlock_rcu: rcu_read_unlock(); return tsk; } =20 -void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_= locked) +void cgroup_procs_write_finish(struct task_struct *task, + enum cgroup_attach_lock_mode lock_mode) { struct cgroup_subsys *ss; int ssid; @@ -3032,7 +3056,7 @@ void cgroup_procs_write_finish(struct task_struct *ta= sk, bool threadgroup_locked /* release reference from cgroup_procs_write_start() */ put_task_struct(task); =20 - cgroup_attach_unlock(threadgroup_locked); + cgroup_attach_unlock(lock_mode); =20 for_each_subsys(ss, ssid) if (ss->post_attach) @@ -3088,6 +3112,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgr= p) struct cgroup_subsys_state *d_css; struct cgroup *dsct; struct css_set *src_cset; + enum cgroup_attach_lock_mode lock_mode; bool has_tasks; int ret; =20 @@ -3119,7 +3144,13 @@ static int cgroup_update_dfl_csses(struct cgroup *cg= rp) * write-locking can be skipped safely. */ has_tasks =3D !list_empty(&mgctx.preloaded_src_csets); - cgroup_attach_lock(has_tasks); + + if (has_tasks) + lock_mode =3D CGRP_ATTACH_LOCK_GLOBAL; + else + lock_mode =3D CGRP_ATTACH_LOCK_NONE; + + cgroup_attach_lock(lock_mode); =20 /* NULL dst indicates self on default hierarchy */ ret =3D cgroup_migrate_prepare_dst(&mgctx); @@ -3140,7 +3171,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgr= p) ret =3D cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(has_tasks); + cgroup_attach_unlock(lock_mode); return ret; } =20 @@ -5241,13 +5272,13 @@ static ssize_t __cgroup_procs_write(struct kernfs_o= pen_file *of, char *buf, struct task_struct *task; const struct cred *saved_cred; ssize_t ret; - bool threadgroup_locked; + enum cgroup_attach_lock_mode lock_mode; =20 dst_cgrp =3D cgroup_kn_lock_live(of->kn, false); if (!dst_cgrp) return -ENODEV; =20 - task =3D cgroup_procs_write_start(buf, threadgroup, &threadgroup_locked); + task =3D cgroup_procs_write_start(buf, threadgroup, &lock_mode); ret =3D PTR_ERR_OR_ZERO(task); if (ret) goto out_unlock; @@ -5273,7 +5304,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_ope= n_file *of, char *buf, ret =3D cgroup_attach_task(dst_cgrp, task, threadgroup); =20 out_finish: - cgroup_procs_write_finish(task, threadgroup_locked); + cgroup_procs_write_finish(task, lock_mode); out_unlock: cgroup_kn_unlock(of->kn); =20 --=20 2.32.0.3.g01195cf9f From nobody Thu Oct 2 22:39:30 2025 Received: from out30-124.freemail.mail.aliyun.com (out30-124.freemail.mail.aliyun.com [115.124.30.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1DD853081C6; Wed, 10 Sep 2025 06:59:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757487612; cv=none; b=PCiGoo/bOuezsdA0dI7jnxC2BDGq4NNIa6U79oC5T88IYoh2sUp0J5FFRStCjlxM129HXijiW6qSVMOMq9R+inahCLdNLOzJ1P5gkYIPmP/Jq4Uk6xHLiIPIHs6LSVBJYcd2BaVDbx/3UXWcj6MvBz7yOPtFvW/NrIlzf8f3mG8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757487612; c=relaxed/simple; bh=WmLagUY7b398zGCDLzTYzx1B047LtR/Qc0/Ux+WyOEg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=PiShGHnPU2/gw3OlslNt/OCmpulji6YeIEHI+tjX+m7uqW/F7gXVvd4A74Lv83DOVHhh/Z+zT0BG9BDQr/ZHlYi6PV617LfLDUWIJAzV7zNcSo/Wyl+ASse9/dlfrJTq5kjS6TOA1UbLVTUNALRXY/jmml3GeMPbjzmnlDxVP5c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=Vm/M8iJ2; arc=none smtp.client-ip=115.124.30.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="Vm/M8iJ2" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1757487598; h=From:To:Subject:Date:Message-Id:MIME-Version; bh=FZg2oKhhQptVy6+gxiHuSObSCdlSV1TQIVwvfkvGA9k=; b=Vm/M8iJ2fvP6ls02qzPOOiJ/jFYjpVWmb3vjB6S0LZyReLzI40ErAw/ZOuapG6tlh6GHefSzUAumQHN32+lyRX9jSksan0NhXOEBSh+FROEbPYATllvspzM6LgTMl9DtlL76DJqW+xVw7QoFmRAzNI7QzpVsJ+9Cxnd59DKyX4Q= Received: from localhost(mailfrom:escape@linux.alibaba.com fp:SMTPD_---0Wnh6Fp6_1757487597 cluster:ay36) by smtp.aliyun-inc.com; Wed, 10 Sep 2025 14:59:57 +0800 From: Yi Tao To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start Date: Wed, 10 Sep 2025 14:59:34 +0800 Message-Id: <324e2f62ed7a3666e28768d2c35b8aa957dd1651.1757486368.git.escape@linux.alibaba.com> X-Mailer: git-send-email 2.32.0.3.g01195cf9f In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Later patches will introduce a new parameter `task` to cgroup_attach_lock, thus adjusting the position of cgroup_attach_lock within cgroup_procs_write_start. Between obtaining the threadgroup leader via PID and acquiring the cgroup attach lock, the threadgroup leader may change, which could lead to incorrect cgroup migration. Therefore, after acquiring the cgroup attach lock, we check whether the threadgroup leader has changed, and if so, retry the operation. Signed-off-by: Yi Tao --- kernel/cgroup/cgroup.c | 61 ++++++++++++++++++++++++++---------------- 1 file changed, 38 insertions(+), 23 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 2b88c7abaa00..756807164091 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2994,29 +2994,13 @@ struct task_struct *cgroup_procs_write_start(char *= buf, bool threadgroup, if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) return ERR_PTR(-EINVAL); =20 - /* - * If we migrate a single thread, we don't care about threadgroup - * stability. If the thread is `current`, it won't exit(2) under our - * hands or change PID through exec(2). We exclude - * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write - * callers by cgroup_mutex. - * Therefore, we can skip the global lock. - */ - lockdep_assert_held(&cgroup_mutex); - - if (pid || threadgroup) - *lock_mode =3D CGRP_ATTACH_LOCK_GLOBAL; - else - *lock_mode =3D CGRP_ATTACH_LOCK_NONE; - - cgroup_attach_lock(*lock_mode); - +retry_find_task: rcu_read_lock(); if (pid) { tsk =3D find_task_by_vpid(pid); if (!tsk) { tsk =3D ERR_PTR(-ESRCH); - goto out_unlock_threadgroup; + goto out_unlock_rcu; } } else { tsk =3D current; @@ -3033,15 +3017,46 @@ struct task_struct *cgroup_procs_write_start(char *= buf, bool threadgroup, */ if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { tsk =3D ERR_PTR(-EINVAL); - goto out_unlock_threadgroup; + goto out_unlock_rcu; } =20 get_task_struct(tsk); - goto out_unlock_rcu; + rcu_read_unlock(); + + /* + * If we migrate a single thread, we don't care about threadgroup + * stability. If the thread is `current`, it won't exit(2) under our + * hands or change PID through exec(2). We exclude + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write + * callers by cgroup_mutex. + * Therefore, we can skip the global lock. + */ + lockdep_assert_held(&cgroup_mutex); + + if (pid || threadgroup) + *lock_mode =3D CGRP_ATTACH_LOCK_GLOBAL; + else + *lock_mode =3D CGRP_ATTACH_LOCK_NONE; + + cgroup_attach_lock(*lock_mode); + + if (threadgroup) { + if (!thread_group_leader(tsk)) { + /* + * a race with de_thread from another thread's exec() + * may strip us of our leadership, if this happens, + * there is no choice but to throw this task away and + * try again; this is + * "double-double-toil-and-trouble-check locking". + */ + cgroup_attach_unlock(*lock_mode); + put_task_struct(tsk); + goto retry_find_task; + } + } + + return tsk; =20 -out_unlock_threadgroup: - cgroup_attach_unlock(*lock_mode); - *lock_mode =3D CGRP_ATTACH_LOCK_NONE; out_unlock_rcu: rcu_read_unlock(); return tsk; --=20 2.32.0.3.g01195cf9f From nobody Thu Oct 2 22:39:30 2025 Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E650830648B; Wed, 10 Sep 2025 07:00:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.130 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757487622; cv=none; b=DBKGcH6NT8kSmjDFgTU7mty0gLDkxtXc3kbUZJ8M+AzIDP/m+CZFCdvnUQnCNWdHSa1rANNxbE7yL3s3ICwuw0CCfeYbo8dlr7gWRwhCBD74fyxS+yTiBxWBJLD44okXalnYiEysSQIta+Nyzp5v8o/eg/2MaeINgu0wVwYAPeI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757487622; c=relaxed/simple; bh=iGWxDSh0YCMdLa0u6KtTEIYNEo1ZzTjy89BTzDWB+hM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=mykXNVHGFmm56UhN45iTinjpk99RVkdXy4pNQAtO4KbRkON3D5h1mgOz46yibDrS7UpUAvmRsm9SCgZ/KkuXTzbkcJWoU36SFDD6lNGK2y87OJoQvdJtSylrbBYCgwWCh7WpI7KttT2X1psyWxFAlAE90PuhY6yl2aSx2jp+Ai4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=P/cWcHlS; arc=none smtp.client-ip=115.124.30.130 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="P/cWcHlS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1757487606; h=From:To:Subject:Date:Message-Id:MIME-Version; bh=sI93MSs3zgiNn0IVFfCd/36Uo2lYoPvYKtaMYzpcBrI=; b=P/cWcHlS+lyIhGh/6zN0hiSoRGiUnLY86fPxwa6kYhh1SuEHpK74QfeIjtwywVmfZbcnBQ1LJ/LV6iqxt83ddJXZKo89ZnhhhSNOTeI73yk7gwbMeUoTTceta9N5ZY3QKDofhxMnkB/2eai8dJs5tTN7wgr8dSNhfwTiLfuceQo= Received: from localhost(mailfrom:escape@linux.alibaba.com fp:SMTPD_---0Wnh7Nvn_1757487605 cluster:ay36) by smtp.aliyun-inc.com; Wed, 10 Sep 2025 15:00:05 +0800 From: Yi Tao To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Date: Wed, 10 Sep 2025 14:59:35 +0800 Message-Id: <9d46438e61bcf7b5ffc9eb582081f4fcc99c2cbf.1757486368.git.escape@linux.alibaba.com> X-Mailer: git-send-email 2.32.0.3.g01195cf9f In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The static usage pattern of creating a cgroup, enabling controllers, and then seeding it with CLONE_INTO_CGROUP doesn't require write locking cgroup_threadgroup_rwsem and thus doesn't benefit from this patch. To avoid affecting other users, the per threadgroup rwsem is only used when the favordynmods is enabled. As computer hardware advances, modern systems are typically equipped with many CPU cores and large amounts of memory, enabling the deployment of numerous applications. On such systems, container creation and deletion become frequent operations, making cgroup process migration no longer a cold path. This leads to noticeable contention with common process operations such as fork, exec, and exit. To alleviate the contention between cgroup process migration and operations like process fork, this patch modifies lock to take the write lock on signal_struct->group_rwsem when writing pid to cgroup.procs/threads instead of holding a global write lock. Cgroup process migration has historically relied on signal_struct->group_rwsem to protect thread group integrity. In commit <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"), this was changed to a global cgroup_threadgroup_rwsem. The advantage of using a global lock was simplified handling of process group migrations. This patch retains the use of the global lock for protecting process group migration, while reducing contention by using per thread group lock during cgroup.procs/threads writes. The locking behavior is as follows: write cgroup.procs/threads | process fork,exec,exit | process group migrat= ion ---------------------------------------------------------------------------= --- cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) critical section | critical section | critical section up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes signal_struct->group_rwsem. This patch eliminates contention between cgroup migration and fork operations for threads that belong to different thread groups, thereby reducing the long-tail latency of cgroup migrations and lowering system load. With this patch, under heavy fork and exec interference, the long-tail latency of cgroup migration has been reduced from milliseconds to microseconds. Under heavy cgroup migration interference, the multi-CPU score of the spawn test case in UnixBench increased by 9%. Signed-off-by: Yi Tao --- include/linux/cgroup-defs.h | 17 ++++++++- include/linux/sched/signal.h | 4 ++ init/init_task.c | 3 ++ kernel/cgroup/cgroup-internal.h | 6 ++- kernel/cgroup/cgroup-v1.c | 8 ++-- kernel/cgroup/cgroup.c | 67 +++++++++++++++++++++++++-------- kernel/fork.c | 4 ++ 7 files changed, 87 insertions(+), 22 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 213b0d01464f..fe29152bceff 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -91,6 +91,12 @@ enum { * cgroup_threadgroup_rwsem. This makes hot path operations such as * forks and exits into the slow path and more expensive. * + * Alleviate the contention between fork, exec, exit operations and + * writing to cgroup.procs by taking a per threadgroup rwsem instead of + * the global cgroup_threadgroup_rwsem. Fork and other operations + * from threads in different thread groups no longer contend with + * writing to cgroup.procs. + * * The static usage pattern of creating a cgroup, enabling controllers, * and then seeding it with CLONE_INTO_CGROUP doesn't require write * locking cgroup_threadgroup_rwsem and thus doesn't benefit from @@ -146,6 +152,9 @@ enum cgroup_attach_lock_mode { =20 /* When pid=3D0 && threadgroup=3Dfalse, see comments in cgroup_procs_writ= e_start */ CGRP_ATTACH_LOCK_NONE, + + /* When favordynmods is on, see comments above CGRP_ROOT_FAVOR_DYNMODS */ + CGRP_ATTACH_LOCK_PER_THREADGROUP, }; =20 /* @@ -830,6 +839,7 @@ struct cgroup_subsys { }; =20 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem; +extern bool cgroup_enable_per_threadgroup_rwsem; =20 struct cgroup_of_peak { unsigned long value; @@ -841,11 +851,14 @@ struct cgroup_of_peak { * @tsk: target task * * Allows cgroup operations to synchronize against threadgroup changes - * using a percpu_rw_semaphore. + * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore w= hen + * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS defin= ition. */ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) { percpu_down_read(&cgroup_threadgroup_rwsem); + if (cgroup_enable_per_threadgroup_rwsem) + down_read(&tsk->signal->cgroup_threadgroup_rwsem); } =20 /** @@ -856,6 +869,8 @@ static inline void cgroup_threadgroup_change_begin(stru= ct task_struct *tsk) */ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) { + if (cgroup_enable_per_threadgroup_rwsem) + up_read(&tsk->signal->cgroup_threadgroup_rwsem); percpu_up_read(&cgroup_threadgroup_rwsem); } =20 diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 1ef1edbaaf79..7d6449982822 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -226,6 +226,10 @@ struct signal_struct { struct tty_audit_buf *tty_audit_buf; #endif =20 +#ifdef CONFIG_CGROUPS + struct rw_semaphore cgroup_threadgroup_rwsem; +#endif + /* * Thread is the potential origin of an oom condition; kill first on * oom diff --git a/init/init_task.c b/init/init_task.c index e557f622bd90..a55e2189206f 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -27,6 +27,9 @@ static struct signal_struct init_signals =3D { }, .multiprocess =3D HLIST_HEAD_INIT, .rlim =3D INIT_RLIMITS, +#ifdef CONFIG_CGROUPS + .cgroup_threadgroup_rwsem =3D __RWSEM_INITIALIZER(init_signals.cgroup_thr= eadgroup_rwsem), +#endif .cred_guard_mutex =3D __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), .exec_update_lock =3D __RWSEM_INITIALIZER(init_signals.exec_update_lock), #ifdef CONFIG_POSIX_TIMERS diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-interna= l.h index a6d6f30b6f65..22051b4f1ccb 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool th= readgroup, =20 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode); -void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode); +void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode, + struct task_struct *tsk); +void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode, + struct task_struct *tsk); struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, enum cgroup_attach_lock_mode *lock_mode) __acquires(&cgroup_threadgroup_rwsem); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index f3838888ea6f..db48550bc4cc 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, stru= ct task_struct *tsk) int retval =3D 0; =20 cgroup_lock(); - cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); for_each_root(root) { struct cgroup *from_cgrp; =20 @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, stru= ct task_struct *tsk) if (retval) break; } - cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); cgroup_unlock(); =20 return retval; @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgr= oup *from) =20 cgroup_lock(); =20 - cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); =20 /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgr= oup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); cgroup_unlock(); return ret; } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 756807164091..344424dd365b 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -216,6 +216,14 @@ static u16 have_canfork_callback __read_mostly; =20 static bool have_favordynmods __ro_after_init =3D IS_ENABLED(CONFIG_CGROUP= _FAVOR_DYNMODS); =20 +/* + * Write protected by cgroup_mutex and write-lock of cgroup_threadgroup_rw= sem, + * read protected by either. + * + * Can only be turned on, but not turned off. + */ +bool cgroup_enable_per_threadgroup_rwsem __read_mostly; + /* cgroup namespace for init task */ struct cgroup_namespace init_cgroup_ns =3D { .ns.count =3D REFCOUNT_INIT(2), @@ -1302,14 +1310,24 @@ void cgroup_favor_dynmods(struct cgroup_root *root,= bool favor) { bool favoring =3D root->flags & CGRP_ROOT_FAVOR_DYNMODS; =20 - /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */ + /* + * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition. + * favordynmods can flip while task is between + * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end, + * so down_write global cgroup_threadgroup_rwsem to synchronize them. + */ + percpu_down_write(&cgroup_threadgroup_rwsem); if (favor && !favoring) { + cgroup_enable_per_threadgroup_rwsem =3D true; rcu_sync_enter(&cgroup_threadgroup_rwsem.rss); root->flags |=3D CGRP_ROOT_FAVOR_DYNMODS; } else if (!favor && favoring) { + if (cgroup_enable_per_threadgroup_rwsem) + WARN_ONCE(1, "cgroup favordynmods: per threadgroup rwsem mechanism can'= t be disabled\n"); rcu_sync_exit(&cgroup_threadgroup_rwsem.rss); root->flags &=3D ~CGRP_ROOT_FAVOR_DYNMODS; } + percpu_up_write(&cgroup_threadgroup_rwsem); } =20 static int cgroup_init_root_id(struct cgroup_root *root) @@ -2459,7 +2477,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); =20 /** * cgroup_attach_lock - Lock for ->attach() - * @lock_mode: whether to down_write cgroup_threadgroup_rwsem + * @lock_mode: whether acquire and acquire which rwsem + * @tsk: thread group to lock * * cgroup migration sometimes needs to stabilize threadgroups against fork= s and * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach= () @@ -2479,8 +2498,15 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); * Resolve the situation by always acquiring cpus_read_lock() before optio= nally * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assum= e that * CPU hotplug is disabled on entry. + * + * When favordynmods is enabled, take per threadgroup rwsem to reduce over= head + * on dynamic cgroup modifications. see the comment above + * CGRP_ROOT_FAVOR_DYNMODS definition. + * + * tsk is not NULL only when writing to cgroup.procs. */ -void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode) +void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode, + struct task_struct *tsk) { cpus_read_lock(); =20 @@ -2490,6 +2516,9 @@ void cgroup_attach_lock(enum cgroup_attach_lock_mode = lock_mode) case CGRP_ATTACH_LOCK_GLOBAL: percpu_down_write(&cgroup_threadgroup_rwsem); break; + case CGRP_ATTACH_LOCK_PER_THREADGROUP: + down_write(&tsk->signal->cgroup_threadgroup_rwsem); + break; default: pr_warn("cgroup: Unexpected attach lock mode."); break; @@ -2498,9 +2527,11 @@ void cgroup_attach_lock(enum cgroup_attach_lock_mode= lock_mode) =20 /** * cgroup_attach_unlock - Undo cgroup_attach_lock() - * @lock_mode: whether to up_write cgroup_threadgroup_rwsem + * @lock_mode: whether release and release which rwsem + * @tsk: thread group to lock */ -void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode) +void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode, + struct task_struct *tsk) { switch (lock_mode) { case CGRP_ATTACH_LOCK_NONE: @@ -2508,6 +2539,9 @@ void cgroup_attach_unlock(enum cgroup_attach_lock_mod= e lock_mode) case CGRP_ATTACH_LOCK_GLOBAL: percpu_up_write(&cgroup_threadgroup_rwsem); break; + case CGRP_ATTACH_LOCK_PER_THREADGROUP: + up_write(&tsk->signal->cgroup_threadgroup_rwsem); + break; default: pr_warn("cgroup: Unexpected attach lock mode."); break; @@ -3019,7 +3053,6 @@ struct task_struct *cgroup_procs_write_start(char *bu= f, bool threadgroup, tsk =3D ERR_PTR(-EINVAL); goto out_unlock_rcu; } - get_task_struct(tsk); rcu_read_unlock(); =20 @@ -3033,12 +3066,16 @@ struct task_struct *cgroup_procs_write_start(char *= buf, bool threadgroup, */ lockdep_assert_held(&cgroup_mutex); =20 - if (pid || threadgroup) - *lock_mode =3D CGRP_ATTACH_LOCK_GLOBAL; - else + if (pid || threadgroup) { + if (cgroup_enable_per_threadgroup_rwsem) + *lock_mode =3D CGRP_ATTACH_LOCK_PER_THREADGROUP; + else + *lock_mode =3D CGRP_ATTACH_LOCK_GLOBAL; + } else { *lock_mode =3D CGRP_ATTACH_LOCK_NONE; + } =20 - cgroup_attach_lock(*lock_mode); + cgroup_attach_lock(*lock_mode, tsk); =20 if (threadgroup) { if (!thread_group_leader(tsk)) { @@ -3049,7 +3086,7 @@ struct task_struct *cgroup_procs_write_start(char *bu= f, bool threadgroup, * try again; this is * "double-double-toil-and-trouble-check locking". */ - cgroup_attach_unlock(*lock_mode); + cgroup_attach_unlock(*lock_mode, tsk); put_task_struct(tsk); goto retry_find_task; } @@ -3068,11 +3105,11 @@ void cgroup_procs_write_finish(struct task_struct *= task, struct cgroup_subsys *ss; int ssid; =20 + cgroup_attach_unlock(lock_mode, task); + /* release reference from cgroup_procs_write_start() */ put_task_struct(task); =20 - cgroup_attach_unlock(lock_mode); - for_each_subsys(ss, ssid) if (ss->post_attach) ss->post_attach(); @@ -3165,7 +3202,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgr= p) else lock_mode =3D CGRP_ATTACH_LOCK_NONE; =20 - cgroup_attach_lock(lock_mode); + cgroup_attach_lock(lock_mode, NULL); =20 /* NULL dst indicates self on default hierarchy */ ret =3D cgroup_migrate_prepare_dst(&mgctx); @@ -3186,7 +3223,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgr= p) ret =3D cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(lock_mode); + cgroup_attach_unlock(lock_mode, NULL); return ret; } =20 diff --git a/kernel/fork.c b/kernel/fork.c index c4ada32598bd..9a039867ecfd 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, st= ruct task_struct *tsk) tty_audit_fork(sig); sched_autogroup_fork(sig); =20 +#ifdef CONFIG_CGROUPS + init_rwsem(&sig->cgroup_threadgroup_rwsem); +#endif + sig->oom_score_adj =3D current->signal->oom_score_adj; sig->oom_score_adj_min =3D current->signal->oom_score_adj_min; =20 --=20 2.32.0.3.g01195cf9f