From nobody Sat Feb 7 15:35:29 2026 Received: from mail-wm1-f44.google.com (mail-wm1-f44.google.com [209.85.128.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5551C3A9D93 for ; Thu, 22 Jan 2026 11:30:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.44 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769081402; cv=none; b=tKaL09KAIt3brzH5X81DhzzuulLFVHDXr9kfQbWk/gEKn4kXAykJsXhUZMpwJ7JRF7RmHl90xd8z4ufbpgBnPGNSFRL+qZBfRdaSaJpDFVbbtDlwXYAZE3W5ffXSXdb7QZpeHUJsubz/P2OYAU7CQ6suk56KM5KyCpz33NKPaYo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769081402; c=relaxed/simple; bh=qozDHw7wapC2zbxdbRxoQOUIKZ8xEwI3GBzwZOHeJyw=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=kpLldm/QU4s5Ttw5gk9+mkrgmR9NzJx+NGWegoZEvqZLbwXSH4HfcWTymwjyJH2jsDw5VZi4KNnCh0OlYlMnKJHhz7FY9FYwbFdwFQ4C7c5Kyiou+PuZWDH8CFFp5cutoHHNYNVL1UY52GgTn2fK0xLWr8EzUjsGbxIUNXh4nL8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=VFY1axzh; arc=none smtp.client-ip=209.85.128.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VFY1axzh" Received: by mail-wm1-f44.google.com with SMTP id 5b1f17b1804b1-47fedb7c68dso8820865e9.2 for ; Thu, 22 Jan 2026 03:30:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1769081399; x=1769686199; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=msl8g7qpFfJ2L4P4DSDVKxyjAE79rEl7eo2VMGktQDE=; b=VFY1axzhHVJMdg22HMOKp0Dksei5ZaCmqYewfLIhJszlcF3krrGYNhW5p9/pltS7oK W8y+G6Q7Om87TZURpkqELpqdMTHJNbN92Bsa/q9nxjgwKeQY7gzERhwVbB56mXlF0Uqe GVrsMcvVyN3pPs15SEs0EYdTGmZFvn/sp46COComGXcqf7aPIkvGUcEQcU1D3sENXBC/ PTaKA3iWIh3MZLg5Td1dTTtEUTefPn81YMD+RwSaH0H0Zvt38r4+m7uRV1XkX7IjfDg2 1HEVsshp1e5DV44A1iN+0B9tE8Rz1VjLSGxLFAbRQQvkRcAfX0liwlf5GGNWMtbbhv6k YMKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769081399; x=1769686199; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=msl8g7qpFfJ2L4P4DSDVKxyjAE79rEl7eo2VMGktQDE=; b=OWzc2cU51YACsaA69uZwaCVk/M7yscXhZUqZbIYzuZeaLWkPPI5WVanp3ka7wGO3lO qr1qKcTdzygqiyf3S8zQoZ4u8zUgafR25WrLsb14fKzNIXX5PVY5Su858bHSvmJeu8xO fagp7vw/YuKYByHRBGf5XOw9nMe9f79FmqRVWHfu4vJyXfMq0lgLnkvVFggDCRaYkLHG QB0Yi0SNB3oQQ9sOR5T83v1KRyEmlvrDIi4kFgGrtj+ahB59h0kQySEKbv3xXBzWVTBC VE9nXaVxhWGm3NWL7O4Az0WvZ4MAdA6nkrsKabigJmKtft8n20jYzM3Ss4QR90vWFlxe nkNQ== X-Forwarded-Encrypted: i=1; AJvYcCVDJddaa5oFmuGU1CM5V0rT7TGec5Y/1d5730R3ECU9P9itF9fPigUbfySrqEyEMFhgYhkpJIJeOj8GvdQ=@vger.kernel.org X-Gm-Message-State: AOJu0Yw6LqcyplLIQmRJKxojOjNmcIwTfzJJ9LQnv2xPbWOPqBHjYchS SRDc1AjjUV2HR8SSdn7qwXuOn3iOb/J7lxeLHrF0k2pUPuHQ17plzDM5 X-Gm-Gg: AZuq6aIPLQwVWe0hZ/SCk1gGAcF89g1FiIcxQkFrv7PLzqIYI5pIUeQs9Oj9e14ZS78 T78MgCRcwM5hR/zcOX3w70TBYW8Nk6fn9u9LcN4ihaV/ihlYZO65bQlJNmESFsNTLIfqQ4oVoqO haQoY2lvcWWJqgIfoiHXcPZlh+WU0PSbeCZNYijj9Co+OWpVJhQzlnP5jlmUhpRrZ1BX2TihD/K oi87/TQN9vDaP60qpfg40gsKWAEQNwBw4NLR+/ZxWv99bJ3vHsHYctTrNX6/6qn+T0B0PhuUv7a x5HreSBGyFD6lHNaTvxBPoGNhnowmLqmUHLF479KwgMBYo3bzz7rbN6MFRLXWxj7liwzYVInutf p9pGRT9SEzZAFL9TgZKkVZ7zpjV/hpTzohunHfzUzABo0IArHQDdsM5U/kJ7V/Mm30NUdh+jsdl mqwHXmXktLl7E1FNbG7UKUNSrpHc3hfIFswc31GMwhq7Ht3dRkB+HGI0EcYG8uSQ== X-Received: by 2002:a05:600c:1991:b0:47e:e4ff:e2ac with SMTP id 5b1f17b1804b1-4801e356cebmr286256845e9.33.1769081398409; Thu, 22 Jan 2026 03:29:58 -0800 (PST) Received: from f.. (cst-prg-85-136.cust.vodafone.cz. [46.135.85.136]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4804252b02dsm50748695e9.5.2026.01.22.03.29.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 22 Jan 2026 03:29:57 -0800 (PST) From: Mateusz Guzik To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com Cc: brauner@kernel.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Mateusz Guzik Subject: [PATCH v2] cgroup: avoid css_set_lock in cgroup_css_set_fork() Date: Thu, 22 Jan 2026 12:29:51 +0100 Message-ID: <20260122112951.1854124-1-mjguzik@gmail.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In the stock kernel the css_set_lock is taken three times during thread life cycle, turning it into the primary bottleneck in fork-heavy workloads. The acquire in perparation for clone can be avoided with a sequence counter, which in turn pushes the lock down. Accounts only for 6% speed up when creating threads in parallel on 20 cores as most of the contention shifts to pidmap_lock. Signed-off-by: Mateusz Guzik --- v2: - change comment about clone_seq - raw_write_seqcount* -> write_seqcount - just loop on failed seq check - don't bump it on task exit kernel/cgroup/cgroup-internal.h | 11 +++++-- kernel/cgroup/cgroup.c | 54 +++++++++++++++++++++++++-------- 2 files changed, 49 insertions(+), 16 deletions(-) diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-interna= l.h index 22051b4f1ccb..04a3aadcbc7f 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -194,6 +194,9 @@ static inline bool notify_on_release(const struct cgrou= p *cgrp) return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags); } =20 +/* + * refcounted get/put for css_set objects + */ void put_css_set_locked(struct css_set *cset); =20 static inline void put_css_set(struct css_set *cset) @@ -213,14 +216,16 @@ static inline void put_css_set(struct css_set *cset) spin_unlock_irqrestore(&css_set_lock, flags); } =20 -/* - * refcounted get/put for css_set objects - */ static inline void get_css_set(struct css_set *cset) { refcount_inc(&cset->refcount); } =20 +static inline bool get_css_set_not_zero(struct css_set *cset) +{ + return refcount_inc_not_zero(&cset->refcount); +} + bool cgroup_ssid_enabled(int ssid); bool cgroup_on_dfl(const struct cgroup *cgrp); =20 diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 94788bd1fdf0..0053582b9b56 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -87,7 +87,14 @@ * cgroup.h can use them for lockdep annotations. */ DEFINE_MUTEX(cgroup_mutex); -DEFINE_SPINLOCK(css_set_lock); +__cacheline_aligned DEFINE_SPINLOCK(css_set_lock); + +/* + * css_set_for_clone_seq synchronizes access to task_struct::cgroup + * and cgroup::kill_seq used on clone path + */ +static __cacheline_aligned seqcount_spinlock_t css_set_for_clone_seq =3D + SEQCNT_SPINLOCK_ZERO(css_set_for_clone_seq, &css_set_lock); =20 #if (defined CONFIG_PROVE_RCU || defined CONFIG_LOCKDEP) EXPORT_SYMBOL_GPL(cgroup_mutex); @@ -907,6 +914,7 @@ static void css_set_skip_task_iters(struct css_set *cse= t, * @from_cset: css_set @task currently belongs to (may be NULL) * @to_cset: new css_set @task is being moved to (may be NULL) * @use_mg_tasks: move to @to_cset->mg_tasks instead of ->tasks + * @skip_clone_seq: don't bump css_set_for_clone_seq * * Move @task from @from_cset to @to_cset. If @task didn't belong to any * css_set, @from_cset can be NULL. If @task is being disassociated @@ -918,13 +926,16 @@ static void css_set_skip_task_iters(struct css_set *c= set, */ static void css_set_move_task(struct task_struct *task, struct css_set *from_cset, struct css_set *to_cset, - bool use_mg_tasks) + bool use_mg_tasks, bool skip_clone_seq) { lockdep_assert_held(&css_set_lock); =20 if (to_cset && !css_set_populated(to_cset)) css_set_update_populated(to_cset, true); =20 + if (!skip_clone_seq) + write_seqcount_begin(&css_set_for_clone_seq); + if (from_cset) { WARN_ON_ONCE(list_empty(&task->cg_list)); =20 @@ -949,6 +960,9 @@ static void css_set_move_task(struct task_struct *task, list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : &to_cset->tasks); } + + if (!skip_clone_seq) + write_seqcount_end(&css_set_for_clone_seq); } =20 /* @@ -2723,7 +2737,7 @@ static int cgroup_migrate_execute(struct cgroup_mgctx= *mgctx) =20 get_css_set(to_cset); to_cset->nr_tasks++; - css_set_move_task(task, from_cset, to_cset, true); + css_set_move_task(task, from_cset, to_cset, true, false); from_cset->nr_tasks--; /* * If the source or destination cgroup is frozen, @@ -4183,7 +4197,9 @@ static void __cgroup_kill(struct cgroup *cgrp) lockdep_assert_held(&cgroup_mutex); =20 spin_lock_irq(&css_set_lock); + write_seqcount_begin(&css_set_for_clone_seq); cgrp->kill_seq++; + write_seqcount_end(&css_set_for_clone_seq); spin_unlock_irq(&css_set_lock); =20 css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THRE= ADED, &it); @@ -6696,14 +6712,26 @@ static int cgroup_css_set_fork(struct kernel_clone_= args *kargs) =20 cgroup_threadgroup_change_begin(current); =20 - spin_lock_irq(&css_set_lock); - cset =3D task_css_set(current); - get_css_set(cset); - if (kargs->cgrp) - kargs->kill_seq =3D kargs->cgrp->kill_seq; - else - kargs->kill_seq =3D cset->dfl_cgrp->kill_seq; - spin_unlock_irq(&css_set_lock); + for (;;) { + unsigned seq =3D raw_read_seqcount_begin(&css_set_for_clone_seq); + bool got_ref =3D false; + rcu_read_lock(); + cset =3D task_css_set(current); + if (kargs->cgrp) + kargs->kill_seq =3D kargs->cgrp->kill_seq; + else + kargs->kill_seq =3D cset->dfl_cgrp->kill_seq; + if (get_css_set_not_zero(cset)) + got_ref =3D true; + rcu_read_unlock(); + if (unlikely(!got_ref || read_seqcount_retry(&css_set_for_clone_seq, seq= ))) { + if (got_ref) + put_css_set(cset); + cpu_relax(); + continue; + } + break; + } =20 if (!(kargs->flags & CLONE_INTO_CGROUP)) { kargs->cset =3D cset; @@ -6907,7 +6935,7 @@ void cgroup_post_fork(struct task_struct *child, =20 WARN_ON_ONCE(!list_empty(&child->cg_list)); cset->nr_tasks++; - css_set_move_task(child, NULL, cset, false); + css_set_move_task(child, NULL, cset, false, true); } else { put_css_set(cset); cset =3D NULL; @@ -6995,7 +7023,7 @@ static void do_cgroup_task_dead(struct task_struct *t= sk) =20 WARN_ON_ONCE(list_empty(&tsk->cg_list)); cset =3D task_css_set(tsk); - css_set_move_task(tsk, cset, NULL, false); + css_set_move_task(tsk, cset, NULL, false, true); cset->nr_tasks--; /* matches the signal->live check in css_task_iter_advance() */ if (thread_group_leader(tsk) && atomic_read(&tsk->signal->live)) --=20 2.48.1