From nobody Thu Oct 2 23:57:57 2025 Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com [115.124.30.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A6FA223D7D1; Tue, 9 Sep 2025 07:55:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.131 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757404541; cv=none; b=E8l1dcZb/bFs5smgSujVHxGc2986DnVQB+r+v/IJhTEM1AqTMBNeNky70NOL0OlJEwX+DdZsCl4g091tk04zAm1Xm98hUQscO05qAKpQ+vCuWB5bejlkYOvJS19LszPGG8OMCBYKyoZuzlkaIHz0VMPfDxx7Ky0RPubXOQUHJmI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757404541; c=relaxed/simple; bh=EBBH4Z43VH71g42qMsIRdPGnmt36PpddwSjYxDa1Hcc=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=lhzYNMoPUlmyWXBElnWZnhvqE1dEyNWEF3aH14jyXqxQk5WHyEp/BIudiDiRM2g3DYxL3QwuxuHTsMEgCUAE/qaJ/31HWQtPGvmzS/jnMz2hQZrcQVyoTxwyh77S8322yE022eoAbmqgAUgvLSiWHkQVkuUY17SBC1P3mo/QGTM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=PsuMUutv; arc=none smtp.client-ip=115.124.30.131 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="PsuMUutv" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1757404535; h=From:To:Subject:Date:Message-Id:MIME-Version; bh=5LkRsk3RmP6lSHDQ6ndvJICjUxqhmJhYfKVCI+e0NAc=; b=PsuMUutvcNudunS6HKngf6RMfPiMbPifQXgwqEkNyNo6rgZ2bDUdWcKPcYj/KngC1epDwtKXhfCvBfyxGDcYtBPc+67Tv3/3fz/GmWI6QQ2h8dAISWWSSJKRTyAttha2NanQEV76VdAt5Ix2fQbNGzkpTL2ZDt/PjywMswV3Hzc= Received: from localhost(mailfrom:escape@linux.alibaba.com fp:SMTPD_---0Wnd9LYK_1757404533 cluster:ay36) by smtp.aliyun-inc.com; Tue, 09 Sep 2025 15:55:34 +0800 From: Yi Tao To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 1/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Date: Tue, 9 Sep 2025 15:55:28 +0800 Message-Id: <510f350477fc7a22f1fb69c37f7d87b605e3eac0.1757403652.git.escape@linux.alibaba.com> X-Mailer: git-send-email 2.32.0.3.g01195cf9f In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The static usage pattern of creating a cgroup, enabling controllers, and then seeding it with CLONE_INTO_CGROUP doesn't require write locking cgroup_threadgroup_rwsem and thus doesn't benefit from this patch. To avoid affecting other users, the per threadgroup rwsem is only used when the favordynmods is enabled. As computer hardware advances, modern systems are typically equipped with many CPU cores and large amounts of memory, enabling the deployment of numerous applications. On such systems, container creation and deletion become frequent operations, making cgroup process migration no longer a cold path. This leads to noticeable contention with common process operations such as fork, exec, and exit. To alleviate the contention between cgroup process migration and operations like process fork, this patch modifies lock to take the write lock on signal_struct->group_rwsem when writing pid to cgroup.procs/threads instead of holding a global write lock. Cgroup process migration has historically relied on signal_struct->group_rwsem to protect thread group integrity. In commit <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"), this was changed to a global cgroup_threadgroup_rwsem. The advantage of using a global lock was simplified handling of process group migrations. This patch retains the use of the global lock for protecting process group migration, while reducing contention by using per thread group lock during cgroup.procs/threads writes. The locking behavior is as follows: write cgroup.procs/threads | process fork,exec,exit | process group migrat= ion ---------------------------------------------------------------------------= --- cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) critical section | critical section | critical section up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes signal_struct->group_rwsem. This patch eliminates contention between cgroup migration and fork operations for threads that belong to different thread groups, thereby reducing the long-tail latency of cgroup migrations and lowering system load. With this patch, under heavy fork and exec interference, the long-tail latency of cgroup migration has been reduced from milliseconds to microseconds. Under heavy cgroup migration interference, the multi-CPU score of the spawn test case in UnixBench increased by 9%. Signed-off-by: Yi Tao --- include/linux/cgroup-defs.h | 14 ++++- include/linux/sched/signal.h | 4 ++ init/init_task.c | 3 ++ kernel/cgroup/cgroup-internal.h | 4 +- kernel/cgroup/cgroup-v1.c | 8 +-- kernel/cgroup/cgroup.c | 96 ++++++++++++++++++++++----------- kernel/fork.c | 4 ++ 7 files changed, 95 insertions(+), 38 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 6b93a64115fe..552050bd2c02 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -91,6 +91,12 @@ enum { * cgroup_threadgroup_rwsem. This makes hot path operations such as * forks and exits into the slow path and more expensive. * + * Alleviate the contention between fork, exec, exit operations and + * writing to cgroup.procs by taking a per threadgroup rwsem instead of + * the global cgroup_threadgroup_rwsem. Fork and other operations + * from threads in different thread groups no longer contend with + * writing to cgroup.procs. + * * The static usage pattern of creating a cgroup, enabling controllers, * and then seeding it with CLONE_INTO_CGROUP doesn't require write * locking cgroup_threadgroup_rwsem and thus doesn't benefit from @@ -822,6 +828,7 @@ struct cgroup_subsys { }; =20 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem; +extern bool cgroup_enable_per_threadgroup_rwsem; =20 struct cgroup_of_peak { unsigned long value; @@ -833,11 +840,14 @@ struct cgroup_of_peak { * @tsk: target task * * Allows cgroup operations to synchronize against threadgroup changes - * using a percpu_rw_semaphore. + * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore w= hen + * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS defin= ition. */ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) { percpu_down_read(&cgroup_threadgroup_rwsem); + if (cgroup_enable_per_threadgroup_rwsem) + down_read(&tsk->signal->cgroup_threadgroup_rwsem); } =20 /** @@ -848,6 +858,8 @@ static inline void cgroup_threadgroup_change_begin(stru= ct task_struct *tsk) */ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) { + if (cgroup_enable_per_threadgroup_rwsem) + up_read(&tsk->signal->cgroup_threadgroup_rwsem); percpu_up_read(&cgroup_threadgroup_rwsem); } =20 diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 1ef1edbaaf79..7d6449982822 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -226,6 +226,10 @@ struct signal_struct { struct tty_audit_buf *tty_audit_buf; #endif =20 +#ifdef CONFIG_CGROUPS + struct rw_semaphore cgroup_threadgroup_rwsem; +#endif + /* * Thread is the potential origin of an oom condition; kill first on * oom diff --git a/init/init_task.c b/init/init_task.c index e557f622bd90..a55e2189206f 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -27,6 +27,9 @@ static struct signal_struct init_signals =3D { }, .multiprocess =3D HLIST_HEAD_INIT, .rlim =3D INIT_RLIMITS, +#ifdef CONFIG_CGROUPS + .cgroup_threadgroup_rwsem =3D __RWSEM_INITIALIZER(init_signals.cgroup_thr= eadgroup_rwsem), +#endif .cred_guard_mutex =3D __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), .exec_update_lock =3D __RWSEM_INITIALIZER(init_signals.exec_update_lock), #ifdef CONFIG_POSIX_TIMERS diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-interna= l.h index b14e61c64a34..318cc7f22e2c 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool thr= eadgroup, =20 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -void cgroup_attach_lock(bool lock_threadgroup); -void cgroup_attach_unlock(bool lock_threadgroup); +void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk); +void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk); struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, bool *locked) __acquires(&cgroup_threadgroup_rwsem); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 2a4a387f867a..65e9b454780c 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, stru= ct task_struct *tsk) int retval =3D 0; =20 cgroup_lock(); - cgroup_attach_lock(true); + cgroup_attach_lock(true, NULL); for_each_root(root) { struct cgroup *from_cgrp; =20 @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, stru= ct task_struct *tsk) if (retval) break; } - cgroup_attach_unlock(true); + cgroup_attach_unlock(true, NULL); cgroup_unlock(); =20 return retval; @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgr= oup *from) =20 cgroup_lock(); =20 - cgroup_attach_lock(true); + cgroup_attach_lock(true, NULL); =20 /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgr= oup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(true); + cgroup_attach_unlock(true, NULL); cgroup_unlock(); return ret; } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 312c6a8b55bb..b53ae8fd9681 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -216,6 +216,14 @@ static u16 have_canfork_callback __read_mostly; =20 static bool have_favordynmods __ro_after_init =3D IS_ENABLED(CONFIG_CGROUP= _FAVOR_DYNMODS); =20 +/* + * Write protected by cgroup_mutex and write-lock of cgroup_threadgroup_rw= sem, + * read protected by either. + * + * Can only be turned on, but not turned off. + */ +bool cgroup_enable_per_threadgroup_rwsem __read_mostly; + /* cgroup namespace for init task */ struct cgroup_namespace init_cgroup_ns =3D { .ns.count =3D REFCOUNT_INIT(2), @@ -1302,14 +1310,24 @@ void cgroup_favor_dynmods(struct cgroup_root *root,= bool favor) { bool favoring =3D root->flags & CGRP_ROOT_FAVOR_DYNMODS; =20 - /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */ + /* + * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition. + * favordynmods can flip while task is between + * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end, + * so down_write global cgroup_threadgroup_rwsem to synchronize them. + */ + percpu_down_write(&cgroup_threadgroup_rwsem); if (favor && !favoring) { + cgroup_enable_per_threadgroup_rwsem =3D true; rcu_sync_enter(&cgroup_threadgroup_rwsem.rss); root->flags |=3D CGRP_ROOT_FAVOR_DYNMODS; } else if (!favor && favoring) { + if (cgroup_enable_per_threadgroup_rwsem) + WARN_ONCE(1, "cgroup favordynmods: per threadgroup rwsem mechanism can'= t be disabled\n"); rcu_sync_exit(&cgroup_threadgroup_rwsem.rss); root->flags &=3D ~CGRP_ROOT_FAVOR_DYNMODS; } + percpu_up_write(&cgroup_threadgroup_rwsem); } =20 static int cgroup_init_root_id(struct cgroup_root *root) @@ -2459,7 +2477,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); =20 /** * cgroup_attach_lock - Lock for ->attach() - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem + * @lock_threadgroup: whether to down_write rwsem + * @tsk: thread group to lock * * cgroup migration sometimes needs to stabilize threadgroups against fork= s and * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach= () @@ -2479,22 +2498,37 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); * Resolve the situation by always acquiring cpus_read_lock() before optio= nally * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assum= e that * CPU hotplug is disabled on entry. + * + * When favordynmods is enabled, take per threadgroup rwsem to reduce over= head + * on dynamic cgroup modifications. see the comment above + * CGRP_ROOT_FAVOR_DYNMODS definition. + * + * tsk is not NULL only when writing to cgroup.procs. */ -void cgroup_attach_lock(bool lock_threadgroup) +void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk) { cpus_read_lock(); - if (lock_threadgroup) - percpu_down_write(&cgroup_threadgroup_rwsem); + if (lock_threadgroup) { + if (tsk && cgroup_enable_per_threadgroup_rwsem) + down_write(&tsk->signal->cgroup_threadgroup_rwsem); + else + percpu_down_write(&cgroup_threadgroup_rwsem); + } } =20 /** * cgroup_attach_unlock - Undo cgroup_attach_lock() - * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem + * @lock_threadgroup: whether to up_write rwsem + * @tsk: thread group to lock */ -void cgroup_attach_unlock(bool lock_threadgroup) +void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk) { - if (lock_threadgroup) - percpu_up_write(&cgroup_threadgroup_rwsem); + if (lock_threadgroup) { + if (tsk && cgroup_enable_per_threadgroup_rwsem) + up_write(&tsk->signal->cgroup_threadgroup_rwsem); + else + percpu_up_write(&cgroup_threadgroup_rwsem); + } cpus_read_unlock(); } =20 @@ -2976,24 +3010,12 @@ struct task_struct *cgroup_procs_write_start(char *= buf, bool threadgroup, if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) return ERR_PTR(-EINVAL); =20 - /* - * If we migrate a single thread, we don't care about threadgroup - * stability. If the thread is `current`, it won't exit(2) under our - * hands or change PID through exec(2). We exclude - * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write - * callers by cgroup_mutex. - * Therefore, we can skip the global lock. - */ - lockdep_assert_held(&cgroup_mutex); - *threadgroup_locked =3D pid || threadgroup; - cgroup_attach_lock(*threadgroup_locked); - rcu_read_lock(); if (pid) { tsk =3D find_task_by_vpid(pid); if (!tsk) { tsk =3D ERR_PTR(-ESRCH); - goto out_unlock_threadgroup; + goto out_unlock_rcu; } } else { tsk =3D current; @@ -3010,15 +3032,27 @@ struct task_struct *cgroup_procs_write_start(char *= buf, bool threadgroup, */ if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { tsk =3D ERR_PTR(-EINVAL); - goto out_unlock_threadgroup; + goto out_unlock_rcu; } - get_task_struct(tsk); - goto out_unlock_rcu; =20 -out_unlock_threadgroup: - cgroup_attach_unlock(*threadgroup_locked); - *threadgroup_locked =3D false; + rcu_read_unlock(); + + /* + * If we migrate a single thread, we don't care about threadgroup + * stability. If the thread is `current`, it won't exit(2) under our + * hands or change PID through exec(2). We exclude + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write + * callers by cgroup_mutex. + * Therefore, we can skip the global lock. + */ + lockdep_assert_held(&cgroup_mutex); + *threadgroup_locked =3D pid || threadgroup; + + cgroup_attach_lock(*threadgroup_locked, tsk); + + return tsk; + out_unlock_rcu: rcu_read_unlock(); return tsk; @@ -3032,7 +3066,7 @@ void cgroup_procs_write_finish(struct task_struct *ta= sk, bool threadgroup_locked /* release reference from cgroup_procs_write_start() */ put_task_struct(task); =20 - cgroup_attach_unlock(threadgroup_locked); + cgroup_attach_unlock(threadgroup_locked, task); =20 for_each_subsys(ss, ssid) if (ss->post_attach) @@ -3119,7 +3153,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgr= p) * write-locking can be skipped safely. */ has_tasks =3D !list_empty(&mgctx.preloaded_src_csets); - cgroup_attach_lock(has_tasks); + cgroup_attach_lock(has_tasks, NULL); =20 /* NULL dst indicates self on default hierarchy */ ret =3D cgroup_migrate_prepare_dst(&mgctx); @@ -3140,7 +3174,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgr= p) ret =3D cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(has_tasks); + cgroup_attach_unlock(has_tasks, NULL); return ret; } =20 diff --git a/kernel/fork.c b/kernel/fork.c index af673856499d..ae7826815c2c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, st= ruct task_struct *tsk) tty_audit_fork(sig); sched_autogroup_fork(sig); =20 +#ifdef CONFIG_CGROUPS + init_rwsem(&sig->cgroup_threadgroup_rwsem); +#endif + sig->oom_score_adj =3D current->signal->oom_score_adj; sig->oom_score_adj_min =3D current->signal->oom_score_adj_min; =20 --=20 2.32.0.3.g01195cf9f From nobody Thu Oct 2 23:57:57 2025 Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 354CF23F424; Tue, 9 Sep 2025 07:55:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.130 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757404547; cv=none; b=URObs3LV7lL0HA4kU+RqtU8q+ztMfSD/wpm3eaYDgnWQQASqbGHv9vp8n4/DFtvArCZryoPnfZazAD1uIDHA6e0P21Vgq9Sx3gpFfOKNwrg08VzvY1jsdhKa4dpEdJTdnONeDG6u9a0u2cFSLlUzHfh5Ubi+BFcw+2C0wfmGnNA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757404547; c=relaxed/simple; bh=5ukJ70+msLFnHkJnhJrjSsIaO24ya9JEx5Q9+SlptnI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=mnJLQJu2fQWOLKBAeqMeFeptPYj+buMjMg5if7zG2yElzWHoiGeRRwbYS0jQ7G/j8ReL+QilU9bHZTHX8wN3dSbXpZkuu+dPxnvRoBAgmjJKxyOZfU9Nw9NKtQCAV9xjNstmTwNyJt4YsesFm3aKKGaa9ONI0iwjAJkK2yLDu7M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=NbaTnNJ/; arc=none smtp.client-ip=115.124.30.130 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="NbaTnNJ/" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1757404536; h=From:To:Subject:Date:Message-Id:MIME-Version; bh=4509FaEFYAcgeZjV/u4axpDC6hpPPjtTioO+fvPx73s=; b=NbaTnNJ/GWFSbI+WwoGMQ89SUoDcMuxB5B3N/5tg9brrLZlsvNEu5f9ZSS3/15aW2p3QLuj+ApctOlaUsL1GgmwBFXR6iDLBPH92dVW4vTWEFQiuOgT+0HQt+ks1m2q177KCC/xGPLcsFYC3mjuwy1c4YWgd/+BoZjgxqCcEO/o= Received: from localhost(mailfrom:escape@linux.alibaba.com fp:SMTPD_---0WndDsJ4_1757404535 cluster:ay36) by smtp.aliyun-inc.com; Tue, 09 Sep 2025 15:55:36 +0800 From: Yi Tao To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed Date: Tue, 9 Sep 2025 15:55:29 +0800 Message-Id: <42b7b9b12a65e7b1a8e58ce055240ad63f93bf6b.1757403652.git.escape@linux.alibaba.com> X-Mailer: git-send-email 2.32.0.3.g01195cf9f In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Between obtaining the threadgroup leader via PID and acquiring the cgroup attach lock, the threadgroup leader may change, which could lead to incorrect cgroup migration. Therefore, after acquiring the cgroup attach lock, we check whether the threadgroup leader has changed, and if so, retry the operation. Signed-off-by: Yi Tao --- kernel/cgroup/cgroup.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index b53ae8fd9681..6e90b79e8fa3 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3010,6 +3010,7 @@ struct task_struct *cgroup_procs_write_start(char *bu= f, bool threadgroup, if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) return ERR_PTR(-EINVAL); =20 +retry_find_task: rcu_read_lock(); if (pid) { tsk =3D find_task_by_vpid(pid); @@ -3051,6 +3052,21 @@ struct task_struct *cgroup_procs_write_start(char *b= uf, bool threadgroup, =20 cgroup_attach_lock(*threadgroup_locked, tsk); =20 + if (threadgroup) { + if (!thread_group_leader(tsk)) { + /* + * a race with de_thread from another thread's exec() + * may strip us of our leadership, if this happens, + * there is no choice but to throw this task away and + * try again; this is + * "double-double-toil-and-trouble-check locking". + */ + cgroup_attach_unlock(*threadgroup_locked, tsk); + put_task_struct(tsk); + goto retry_find_task; + } + } + return tsk; =20 out_unlock_rcu: --=20 2.32.0.3.g01195cf9f From nobody Thu Oct 2 23:57:57 2025 Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E244523D7F2; Tue, 9 Sep 2025 07:55:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.130 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757404549; cv=none; b=c7wvyJ27fEIhX0bHoIGpna+2Sym3BWo2dBKcu5Jw05Sw1uXt+UE2E89ZoAgb4iMzmDDbVDMI1hCEeoI5q23T9EzO7Z+eHncCB9e3EZ7Bli4+eQpEuwLud9+S/ljh0T/532tClgtj4NrJZ8cs3YZOGlsdh26SWkb+ty5YeoFEteA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757404549; c=relaxed/simple; bh=MQNIWFXIGYqRfLqolINV5pkrx46U7YN+uDXMC8YLJAg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=DNCCKLvtHa4BOathIpusuMW/NjoRl4uyTa4dWEeA6njwODrVvB6vgx3KFp9bDtgyDsg6hwv3L2NpL7a1RQ1LST9PiJNdW+9fDDCohoRK3oYJKQhVQZVM3wEVXQhtVLN/OAYW9E8X2xOS2LEvEOeeBS+5v9645ElNl7rystlUjl0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=u/Ws6pp6; arc=none smtp.client-ip=115.124.30.130 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="u/Ws6pp6" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1757404538; h=From:To:Subject:Date:Message-Id:MIME-Version; bh=JahaFDOYLkjYX2+eJ9+FPO6GHiCSnDLa39UN+ObHkBo=; b=u/Ws6pp6G8Dmbxn+0iYnMoI/GvKsS7GQj2TWNHheq+7x7X9NR8ogMidsm81C0bH+Iyp3tno00z3DkGOR6sHNZu682EYUMcLSFn9yq2qjYeueOXSpAsFQP5AJe13+wWoxT0Zi6s6jjKbjRKirV1kAjncXHyKO/pZtlM9CqYnydbA= Received: from localhost(mailfrom:escape@linux.alibaba.com fp:SMTPD_---0Wnd-YJQ_1757404537 cluster:ay36) by smtp.aliyun-inc.com; Tue, 09 Sep 2025 15:55:38 +0800 From: Yi Tao To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Date: Tue, 9 Sep 2025 15:55:30 +0800 Message-Id: X-Mailer: git-send-email 2.32.0.3.g01195cf9f In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Dynamic cgroup migration involving threadgroup locks can be in one of three states: no lock held, holding the global lock, or holding a per threadgroup lock. Explicitly declaring the different lock modes to make the code easier to understand. Signed-off-by: Yi Tao --- include/linux/cgroup-defs.h | 11 ++++ kernel/cgroup/cgroup-internal.h | 8 +-- kernel/cgroup/cgroup-v1.c | 14 ++--- kernel/cgroup/cgroup.c | 91 ++++++++++++++++++++++----------- 4 files changed, 83 insertions(+), 41 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 552050bd2c02..d764eba041c6 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -146,6 +146,17 @@ enum { __CFTYPE_ADDED =3D (1 << 18), }; =20 +enum { + /* Default */ + CGRP_ATTACH_LOCK_GLOBAL, + + /* When pid=3D0 && threadgroup=3Dfalse, see comments in cgroup_procs_writ= e_start */ + CGRP_ATTACH_LOCK_NONE, + + /* When favordynmods is on, see comments above CGRP_ROOT_FAVOR_DYNMODS */ + CGRP_ATTACH_LOCK_PER_THREADGROUP, +}; + /* * cgroup_file is the handle for a file instance created in a cgroup which * is used, for example, to generate file changed notifications. This can diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-interna= l.h index 318cc7f22e2c..9de8ab47a335 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -249,12 +249,12 @@ int cgroup_migrate(struct task_struct *leader, bool t= hreadgroup, =20 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk); -void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk); +void cgroup_attach_lock(int lock_mode, struct task_struct *tsk); +void cgroup_attach_unlock(int lock_mode, struct task_struct *tsk); struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, - bool *locked) + int *lock_mode) __acquires(&cgroup_threadgroup_rwsem); -void cgroup_procs_write_finish(struct task_struct *task, bool locked) +void cgroup_procs_write_finish(struct task_struct *task, int locke_mode) __releases(&cgroup_threadgroup_rwsem); =20 void cgroup_lock_and_drain_offline(struct cgroup *cgrp); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 65e9b454780c..73d8dafe219f 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, stru= ct task_struct *tsk) int retval =3D 0; =20 cgroup_lock(); - cgroup_attach_lock(true, NULL); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); for_each_root(root) { struct cgroup *from_cgrp; =20 @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, stru= ct task_struct *tsk) if (retval) break; } - cgroup_attach_unlock(true, NULL); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); cgroup_unlock(); =20 return retval; @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgr= oup *from) =20 cgroup_lock(); =20 - cgroup_attach_lock(true, NULL); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); =20 /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgr= oup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(true, NULL); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); cgroup_unlock(); return ret; } @@ -502,13 +502,13 @@ static ssize_t __cgroup1_procs_write(struct kernfs_op= en_file *of, struct task_struct *task; const struct cred *cred, *tcred; ssize_t ret; - bool locked; + int lock_mode; =20 cgrp =3D cgroup_kn_lock_live(of->kn, false); if (!cgrp) return -ENODEV; =20 - task =3D cgroup_procs_write_start(buf, threadgroup, &locked); + task =3D cgroup_procs_write_start(buf, threadgroup, &lock_mode); ret =3D PTR_ERR_OR_ZERO(task); if (ret) goto out_unlock; @@ -531,7 +531,7 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open= _file *of, ret =3D cgroup_attach_task(cgrp, task, threadgroup); =20 out_finish: - cgroup_procs_write_finish(task, locked); + cgroup_procs_write_finish(task, lock_mode); out_unlock: cgroup_kn_unlock(of->kn); =20 diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 6e90b79e8fa3..a568629f7ed0 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2477,7 +2477,7 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); =20 /** * cgroup_attach_lock - Lock for ->attach() - * @lock_threadgroup: whether to down_write rwsem + * @lock_mode: whether acquire and acquire which rwsem * @tsk: thread group to lock * * cgroup migration sometimes needs to stabilize threadgroups against fork= s and @@ -2499,35 +2499,46 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assum= e that * CPU hotplug is disabled on entry. * - * When favordynmods is enabled, take per threadgroup rwsem to reduce over= head - * on dynamic cgroup modifications. see the comment above - * CGRP_ROOT_FAVOR_DYNMODS definition. - * * tsk is not NULL only when writing to cgroup.procs. */ -void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk) +void cgroup_attach_lock(int lock_mode, struct task_struct *tsk) { cpus_read_lock(); - if (lock_threadgroup) { - if (tsk && cgroup_enable_per_threadgroup_rwsem) - down_write(&tsk->signal->cgroup_threadgroup_rwsem); - else - percpu_down_write(&cgroup_threadgroup_rwsem); + + switch (lock_mode) { + case CGRP_ATTACH_LOCK_NONE: + break; + case CGRP_ATTACH_LOCK_GLOBAL: + percpu_down_write(&cgroup_threadgroup_rwsem); + break; + case CGRP_ATTACH_LOCK_PER_THREADGROUP: + down_write(&tsk->signal->cgroup_threadgroup_rwsem); + break; + default: + pr_warn("cgroup: Unexpected attach lock mode."); + break; } } =20 /** * cgroup_attach_unlock - Undo cgroup_attach_lock() - * @lock_threadgroup: whether to up_write rwsem + * @lock_mode: whether release and release which rwsem * @tsk: thread group to lock */ -void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk) -{ - if (lock_threadgroup) { - if (tsk && cgroup_enable_per_threadgroup_rwsem) - up_write(&tsk->signal->cgroup_threadgroup_rwsem); - else - percpu_up_write(&cgroup_threadgroup_rwsem); +void cgroup_attach_unlock(int lock_mode, struct task_struct *tsk) +{ + switch (lock_mode) { + case CGRP_ATTACH_LOCK_NONE: + break; + case CGRP_ATTACH_LOCK_GLOBAL: + percpu_up_write(&cgroup_threadgroup_rwsem); + break; + case CGRP_ATTACH_LOCK_PER_THREADGROUP: + up_write(&tsk->signal->cgroup_threadgroup_rwsem); + break; + default: + pr_warn("cgroup: Unexpected attach lock mode."); + break; } cpus_read_unlock(); } @@ -3002,7 +3013,7 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struc= t task_struct *leader, } =20 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, - bool *threadgroup_locked) + int *lock_mode) { struct task_struct *tsk; pid_t pid; @@ -3046,11 +3057,24 @@ struct task_struct *cgroup_procs_write_start(char *= buf, bool threadgroup, * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write * callers by cgroup_mutex. * Therefore, we can skip the global lock. + * + * When favordynmods is enabled, take per threadgroup rwsem to reduce + * overhead on dynamic cgroup modifications. see the comment above + * CGRP_ROOT_FAVOR_DYNMODS definition. */ lockdep_assert_held(&cgroup_mutex); - *threadgroup_locked =3D pid || threadgroup; =20 - cgroup_attach_lock(*threadgroup_locked, tsk); + if (pid || threadgroup) { + if (cgroup_enable_per_threadgroup_rwsem) + *lock_mode =3D CGRP_ATTACH_LOCK_PER_THREADGROUP; + else + *lock_mode =3D CGRP_ATTACH_LOCK_GLOBAL; + + } else { + *lock_mode =3D CGRP_ATTACH_LOCK_NONE; + } + + cgroup_attach_lock(*lock_mode, tsk); =20 if (threadgroup) { if (!thread_group_leader(tsk)) { @@ -3061,7 +3085,7 @@ struct task_struct *cgroup_procs_write_start(char *bu= f, bool threadgroup, * try again; this is * "double-double-toil-and-trouble-check locking". */ - cgroup_attach_unlock(*threadgroup_locked, tsk); + cgroup_attach_unlock(*lock_mode, tsk); put_task_struct(tsk); goto retry_find_task; } @@ -3074,7 +3098,7 @@ struct task_struct *cgroup_procs_write_start(char *bu= f, bool threadgroup, return tsk; } =20 -void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_= locked) +void cgroup_procs_write_finish(struct task_struct *task, int lock_mode) { struct cgroup_subsys *ss; int ssid; @@ -3082,7 +3106,7 @@ void cgroup_procs_write_finish(struct task_struct *ta= sk, bool threadgroup_locked /* release reference from cgroup_procs_write_start() */ put_task_struct(task); =20 - cgroup_attach_unlock(threadgroup_locked, task); + cgroup_attach_unlock(lock_mode, task); =20 for_each_subsys(ss, ssid) if (ss->post_attach) @@ -3139,6 +3163,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgr= p) struct cgroup *dsct; struct css_set *src_cset; bool has_tasks; + int lock_mode; int ret; =20 lockdep_assert_held(&cgroup_mutex); @@ -3169,7 +3194,13 @@ static int cgroup_update_dfl_csses(struct cgroup *cg= rp) * write-locking can be skipped safely. */ has_tasks =3D !list_empty(&mgctx.preloaded_src_csets); - cgroup_attach_lock(has_tasks, NULL); + + if (has_tasks) + lock_mode =3D CGRP_ATTACH_LOCK_GLOBAL; + else + lock_mode =3D CGRP_ATTACH_LOCK_NONE; + + cgroup_attach_lock(lock_mode, NULL); =20 /* NULL dst indicates self on default hierarchy */ ret =3D cgroup_migrate_prepare_dst(&mgctx); @@ -3190,7 +3221,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgr= p) ret =3D cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(has_tasks, NULL); + cgroup_attach_unlock(lock_mode, NULL); return ret; } =20 @@ -5291,13 +5322,13 @@ static ssize_t __cgroup_procs_write(struct kernfs_o= pen_file *of, char *buf, struct task_struct *task; const struct cred *saved_cred; ssize_t ret; - bool threadgroup_locked; + int lock_mode; =20 dst_cgrp =3D cgroup_kn_lock_live(of->kn, false); if (!dst_cgrp) return -ENODEV; =20 - task =3D cgroup_procs_write_start(buf, threadgroup, &threadgroup_locked); + task =3D cgroup_procs_write_start(buf, threadgroup, &lock_mode); ret =3D PTR_ERR_OR_ZERO(task); if (ret) goto out_unlock; @@ -5323,7 +5354,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_ope= n_file *of, char *buf, ret =3D cgroup_attach_task(dst_cgrp, task, threadgroup); =20 out_finish: - cgroup_procs_write_finish(task, threadgroup_locked); + cgroup_procs_write_finish(task, lock_mode); out_unlock: cgroup_kn_unlock(of->kn); =20 --=20 2.32.0.3.g01195cf9f