From nobody Sat Feb 7 15:40:36 2026 Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3F29834888F for ; Tue, 20 Jan 2026 17:09:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.48 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768928953; cv=none; b=FfF+Tk1ywFo+mADIPILd45wSahjJlHZow83k7iPgKCTbd5wj6DhYuXpntObebE6ELvDop7Wex1IRX6P4U6iEn7C+CEyW9lRxjdgNzVa09JvlK8feevoS1bvjjlGQ0X4IXqf8dxwnAOH1+avtjRIVm0O2IwNtXOHieGhi9t2L9MI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768928953; c=relaxed/simple; bh=kr7yBZlFs4v5mTRyHBuStMqJ7KcAT5LdX2UKtFW8lOQ=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=AdC0lzOjKb2FI13wg4CeYYSnOwdzqfRk8pFRJHiXeDsZnYsuF/mMDs96AlSe2oqh1WjM9w6kghzwLA6HHK+IZZmyOI1E0FZZ4YTssXtOe7TRAv/5imAUs8S13wvPA2mOljMdeD6wJlc8ioigDEWSD37q/V6tt46cD0CqLko2r2s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=UIpZmNkQ; arc=none smtp.client-ip=209.85.208.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="UIpZmNkQ" Received: by mail-ed1-f48.google.com with SMTP id 4fb4d7f45d1cf-6581327d6baso95292a12.3 for ; Tue, 20 Jan 2026 09:09:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1768928948; x=1769533748; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=lty+/9ZQzucJOhyOUFJ57+6O2KRarMnOEKlOmEePGSU=; b=UIpZmNkQ5ltXdA+j+NziBU8Y2dW6hmR8sTryrTIYxDUvvpc1EWUrybQalWa1lOBOwW 3/YmiSDV9TvUOdzLuOpcUKdwSDGVpjLBNfGmv/fYBjl99ivOO0XhbPJFK7XeJYhgkjhl 21w/ly+rDgFAQb9AAGhUWa4MCB4BmRKAAw4FFtaJFaV3h7kUNvi3tY5k34e7S5NXh/lD RJm4PzjTNb6/yuYIGFmASdkiYWIfmSLF3qModa3DjUjAhaf2JmUALm7uOw9u2YoQOkH7 neoYCgiBplNQU4+DTJW/g5fat7a2dC5EQVOK2V6b7WeYeuZm2eaBUTHOnUykOBLPJL4u PqJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1768928948; x=1769533748; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=lty+/9ZQzucJOhyOUFJ57+6O2KRarMnOEKlOmEePGSU=; b=QTUikHQ9pu2sxljB2sLjsnJ12V0y8/147wlYpUOuTlfTKBIC+prSHK/HK/SNG+jyVA FH0nTeBN7voE43pTG5/TyclPk1PHrKRoOJYhKlB3MN2XE1qBXh+EzwST1P0rCZmRr5/q kcFp5ogzMyCntR58SjeN+Rp7Hb2wyupj1qYuW5LEteVo7JTCLr+zf1MNpHDUSLdwBrf1 Wd5rLXPBjggjKeSVTcCz8WXz3PDKaRvRZYY9Qaq4ABOrTfPH2FrBAajsE8sy/eSbyhf3 5gXv1MyCeSIBkZWsvaq4X4XB2ygOW54CRWV02ZULBTTsTHZBFl+NAUsdEwoqZ9jERRmE 6uXg== X-Forwarded-Encrypted: i=1; AJvYcCVt3a+wepz4H0UE4zbd9Os77tEAFlss6KWVx2noHkjB+knDjMyND6QwXk8zULt3cPJDMoBHyUz+5mYu3xk=@vger.kernel.org X-Gm-Message-State: AOJu0YzscJ2AlRuHZA7jU7RS4dPs7j23HDujLIkacqMEeTSc6NGqKP/X lldSiRBOxDeAgqzCeHG01ioxo3rfYjAbHgoMfAzpL5N1NLcSfdEW2uo0 X-Gm-Gg: AY/fxX57yh1PqXKYGf0V52dx5H4mT/X4E/iKJrjOD7hJPh10MSfeWzTPbsPZG1sWma3 UxQMXVdwiNfO1o5lup8UnjkrmAMPXDtEjtZbky4bYMCY7sljePbDLp1LyjkyskmMRxByZpQHVRa JdBLyk1E9m9TfI3swL37jFzYnZtl+qrX7m4iiQ/uy/r4a9VK77dWbzJsh2w3h//ZIP62SJvcN+u AVqGNIX6cua51bBipgqoWq9ld9U/F6oiMEz0hAMTbh0Hhd0yfdh6+3wiRzeQvx3voAJEW/9EOMc SFWxxD3NF1rQUjkXyWGhuVR+8h+ZS1hZ3+UjFMDXiN0Zj7KLDnjJq7P+LeMhI9fKMhF16JekOls Ksk4wT4bXQrIhkJcRGyPSJp6n52nyDDIrXdOnmwP/nVJa+UEI/D64RZtLE7h00guQ42Ytlq3x5r cOtDO0dpRwmmy4zizT7L+x7n0zqLMTics/wWfu3XzYPAHLLmEhFY9j+ScDILbvVw== X-Received: by 2002:a17:907:fdc1:b0:b87:4c74:b316 with SMTP id a640c23a62f3a-b8800364d26mr238859566b.50.1768928947973; Tue, 20 Jan 2026 09:09:07 -0800 (PST) Received: from f.. (cst-prg-85-136.cust.vodafone.cz. [46.135.85.136]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-b87959c9bc9sm1421895966b.33.2026.01.20.09.09.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 20 Jan 2026 09:09:07 -0800 (PST) From: Mateusz Guzik To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com Cc: brauner@kernel.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Mateusz Guzik Subject: [PATCH] cgroup: avoid css_set_lock in cgroup_css_set_fork() Date: Tue, 20 Jan 2026 18:08:59 +0100 Message-ID: <20260120170859.1467868-1-mjguzik@gmail.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In the stock kernel the css_set_lock is taken three times during thread life cycle, turning it into the primary bottleneck in fork-heavy workloads. The acquire in perparation for clone can be avoided with a sequence counter, which in turn pushes the lock down. Accounts only for 6% speed up when creating threads in parallel on 20 cores as most of the contention shifts to pidmap_lock (from about 740k ops/s to 790k ops/s). Signed-off-by: Mateusz Guzik --- I don't really care for cgroups, I merely need the thing out of the way for fork. If someone wants to handle this differently, I'm not going to argue as long as the bottleneck is taken care of. On the stock kernel pidmap_lock is still the biggest problem, but there is a patch to fix it: https://lore.kernel.org/linux-fsdevel/CAGudoHFuhbkJ+8iA92LYPmphBboJB7sxxC2L= 7A8OtBXA22UXzA@mail.gmail.com/T/#m832ac70f5e8f5ea14e69ca78459578d687efdd9f .. afterwards it is cgroups and the commit message was written pretending it already landed. with the patch below contention is back on pidmap_lock kernel/cgroup/cgroup-internal.h | 11 ++++-- kernel/cgroup/cgroup.c | 60 ++++++++++++++++++++++++++------- 2 files changed, 55 insertions(+), 16 deletions(-) diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-interna= l.h index 22051b4f1ccb..04a3aadcbc7f 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -194,6 +194,9 @@ static inline bool notify_on_release(const struct cgrou= p *cgrp) return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags); } =20 +/* + * refcounted get/put for css_set objects + */ void put_css_set_locked(struct css_set *cset); =20 static inline void put_css_set(struct css_set *cset) @@ -213,14 +216,16 @@ static inline void put_css_set(struct css_set *cset) spin_unlock_irqrestore(&css_set_lock, flags); } =20 -/* - * refcounted get/put for css_set objects - */ static inline void get_css_set(struct css_set *cset) { refcount_inc(&cset->refcount); } =20 +static inline bool get_css_set_not_zero(struct css_set *cset) +{ + return refcount_inc_not_zero(&cset->refcount); +} + bool cgroup_ssid_enabled(int ssid); bool cgroup_on_dfl(const struct cgroup *cgrp); =20 diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 94788bd1fdf0..16d2a8d204e8 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -87,7 +87,12 @@ * cgroup.h can use them for lockdep annotations. */ DEFINE_MUTEX(cgroup_mutex); -DEFINE_SPINLOCK(css_set_lock); +__cacheline_aligned DEFINE_SPINLOCK(css_set_lock); +/* + * css_set_for_clone_seq is used to allow lockless operation in cgroup_css= _set_fork() + */ +static __cacheline_aligned seqcount_spinlock_t css_set_for_clone_seq =3D + SEQCNT_SPINLOCK_ZERO(css_set_for_clone_seq, &css_set_lock); =20 #if (defined CONFIG_PROVE_RCU || defined CONFIG_LOCKDEP) EXPORT_SYMBOL_GPL(cgroup_mutex); @@ -907,6 +912,7 @@ static void css_set_skip_task_iters(struct css_set *cse= t, * @from_cset: css_set @task currently belongs to (may be NULL) * @to_cset: new css_set @task is being moved to (may be NULL) * @use_mg_tasks: move to @to_cset->mg_tasks instead of ->tasks + * @is_clone: indicator whether @task is amids clone * * Move @task from @from_cset to @to_cset. If @task didn't belong to any * css_set, @from_cset can be NULL. If @task is being disassociated @@ -918,13 +924,16 @@ static void css_set_skip_task_iters(struct css_set *c= set, */ static void css_set_move_task(struct task_struct *task, struct css_set *from_cset, struct css_set *to_cset, - bool use_mg_tasks) + bool use_mg_tasks, bool is_clone) { lockdep_assert_held(&css_set_lock); =20 if (to_cset && !css_set_populated(to_cset)) css_set_update_populated(to_cset, true); =20 + if (!is_clone) + raw_write_seqcount_begin(&css_set_for_clone_seq); + if (from_cset) { WARN_ON_ONCE(list_empty(&task->cg_list)); =20 @@ -949,6 +958,9 @@ static void css_set_move_task(struct task_struct *task, list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : &to_cset->tasks); } + + if (!is_clone) + raw_write_seqcount_end(&css_set_for_clone_seq); } =20 /* @@ -2723,7 +2735,7 @@ static int cgroup_migrate_execute(struct cgroup_mgctx= *mgctx) =20 get_css_set(to_cset); to_cset->nr_tasks++; - css_set_move_task(task, from_cset, to_cset, true); + css_set_move_task(task, from_cset, to_cset, true, false); from_cset->nr_tasks--; /* * If the source or destination cgroup is frozen, @@ -4183,7 +4195,9 @@ static void __cgroup_kill(struct cgroup *cgrp) lockdep_assert_held(&cgroup_mutex); =20 spin_lock_irq(&css_set_lock); + raw_write_seqcount_begin(&css_set_for_clone_seq); cgrp->kill_seq++; + raw_write_seqcount_end(&css_set_for_clone_seq); spin_unlock_irq(&css_set_lock); =20 css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THRE= ADED, &it); @@ -6690,20 +6704,40 @@ static int cgroup_css_set_fork(struct kernel_clone_= args *kargs) struct cgroup *dst_cgrp =3D NULL; struct css_set *cset; struct super_block *sb; + bool need_lock; =20 if (kargs->flags & CLONE_INTO_CGROUP) cgroup_lock(); =20 cgroup_threadgroup_change_begin(current); =20 - spin_lock_irq(&css_set_lock); - cset =3D task_css_set(current); - get_css_set(cset); - if (kargs->cgrp) - kargs->kill_seq =3D kargs->cgrp->kill_seq; - else - kargs->kill_seq =3D cset->dfl_cgrp->kill_seq; - spin_unlock_irq(&css_set_lock); + need_lock =3D true; + scoped_guard(rcu) { + unsigned seq =3D raw_read_seqcount_begin(&css_set_for_clone_seq); + cset =3D task_css_set(current); + if (unlikely(!cset || !get_css_set_not_zero(cset))) + break; + if (kargs->cgrp) + kargs->kill_seq =3D kargs->cgrp->kill_seq; + else + kargs->kill_seq =3D cset->dfl_cgrp->kill_seq; + if (read_seqcount_retry(&css_set_for_clone_seq, seq)) { + put_css_set(cset); + break; + } + need_lock =3D false; + } + + if (unlikely(need_lock)) { + spin_lock_irq(&css_set_lock); + cset =3D task_css_set(current); + get_css_set(cset); + if (kargs->cgrp) + kargs->kill_seq =3D kargs->cgrp->kill_seq; + else + kargs->kill_seq =3D cset->dfl_cgrp->kill_seq; + spin_unlock_irq(&css_set_lock); + } =20 if (!(kargs->flags & CLONE_INTO_CGROUP)) { kargs->cset =3D cset; @@ -6907,7 +6941,7 @@ void cgroup_post_fork(struct task_struct *child, =20 WARN_ON_ONCE(!list_empty(&child->cg_list)); cset->nr_tasks++; - css_set_move_task(child, NULL, cset, false); + css_set_move_task(child, NULL, cset, false, true); } else { put_css_set(cset); cset =3D NULL; @@ -6995,7 +7029,7 @@ static void do_cgroup_task_dead(struct task_struct *t= sk) =20 WARN_ON_ONCE(list_empty(&tsk->cg_list)); cset =3D task_css_set(tsk); - css_set_move_task(tsk, cset, NULL, false); + css_set_move_task(tsk, cset, NULL, false, false); cset->nr_tasks--; /* matches the signal->live check in css_task_iter_advance() */ if (thread_group_leader(tsk) && atomic_read(&tsk->signal->live)) --=20 2.48.1