From nobody Thu Apr 2 19:00:21 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 53C2235CB88 for ; Thu, 12 Feb 2026 16:48:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770914891; cv=none; b=dp8cAF8okjx7Whs6CKCxeAEAy5x/qB87X8Eqrj8xvAp766lBQk/cuweA7Vd4Yb/djjQLs3R4EkJO5jKcM6gBHlkdMALFSdS18V59WhN8SOrXSR4HIVuuByAPAR82GhrGy4kBJFWXacoJKgUR5zNM79c68StGwkww2vkIdnXF2dc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770914891; c=relaxed/simple; bh=sEIIJSYnZpvauvSASLQ3csoM6BerjWwcrHU8Z7IAfiY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Rf/7XHcV7Q3l1sux+ZXaOFqXgGXmwfKDDppE5j8zDCNLGpqGnupNi1d60ajWdThdRBrwrTJeuxFbbqo5XMnbi5LR++RTCwXIuOLHkeEARfmvzrkvA3e7BFFB0pnQnD9K+XMQR8Hu+WZdBxJb0SqsJp14nFvjCGfHiTYbh9wUN78= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=h7NGNN1Q; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="h7NGNN1Q" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770914889; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IV4KOsWbu66LHPnMQVrH+zBgc3B06m6KKDZounUUUzI=; b=h7NGNN1QT+qDnWQfpfTLfJLwfBWrFXf4f8MnCi3PxMWXTpKB/lf80e31cTJbpl4TlUcLk9 gn/GmLknOqOmsSIzjYIv3jEaWLe1MjB8ro2lviikQ179oVe3D8FhNyPd7tzs07FmYiX2iG d+LXDyVekGXlJBTCmRPifyrEyb4S7UE= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-580-wDebt_02OWuFesm9gfJa-Q-1; Thu, 12 Feb 2026 11:48:06 -0500 X-MC-Unique: wDebt_02OWuFesm9gfJa-Q-1 X-Mimecast-MFC-AGG-ID: wDebt_02OWuFesm9gfJa-Q_1770914883 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id EEB8E1800259; Thu, 12 Feb 2026 16:48:02 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.80.194]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 7FFE11800464; Thu, 12 Feb 2026 16:47:59 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Anna-Maria Behnsen , Frederic Weisbecker , Thomas Gleixner , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Waiman Long Subject: [PATCH v5 1/6] cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() Date: Thu, 12 Feb 2026 11:46:35 -0500 Message-ID: <20260212164640.2408295-2-longman@redhat.com> In-Reply-To: <20260212164640.2408295-1-longman@redhat.com> References: <20260212164640.2408295-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" The effective_xcpus of a cpuset can contain offline CPUs. In partition_xcpus_del(), the xcpus parameter is incorrectly used as a temporary cpumask to mask out offline CPUs. As xcpus can be the effective_xcpus of a cpuset, this can result in unexpected changes in that cpumask. Fix this problem by not making any changes to the xcpus parameter. Fixes: 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partiti= ons") Signed-off-by: Waiman Long Reviewed-by: Chen Ridong --- kernel/cgroup/cpuset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index c43efef7df71..a366ef84f982 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1221,8 +1221,8 @@ static void partition_xcpus_del(int old_prs, struct c= puset *parent, isolated_cpus_update(old_prs, parent->partition_root_state, xcpus); =20 - cpumask_and(xcpus, xcpus, cpu_active_mask); cpumask_or(parent->effective_cpus, parent->effective_cpus, xcpus); + cpumask_and(parent->effective_cpus, parent->effective_cpus, cpu_active_ma= sk); } =20 /* --=20 2.52.0 From nobody Thu Apr 2 19:00:21 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 71F9C35DD0F for ; Thu, 12 Feb 2026 16:48:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770914894; cv=none; b=dj4JeKl2Hv3bnPFtPsQtwnXHmF8Arch1ZS4vSFz4KNxFGDAhpWf8HDYXWVD9Jji9uTE6und0Alqoy5ycN10LouyF3H3zASyB5BcQbxX/O5IxMbG99KWuOTjgRVOvad8m2Mv4CSWyo2q06GFhh2FggtyYTT7SoWBPDswWi0effTI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770914894; c=relaxed/simple; bh=cJNvne/ZLJKp+G6468ZX9HggWJl+I1tTdNbXK0Elgd8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=eeaj8Jq1g3GUCj6Nqybm2tchfZcTW0ob7j/O+CWPmA+aGg7kqqzXneqcWGzjzxibhSHafARVRjF8wqA/tAkCHkbevxirgS1LVRT9MkgyGWKjp32TCkLJxFXScWc/mSb6BhTGdVdDOQt/VxHejr34GJeReQnkr7vo/eHXiP60ats= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=CNDJFCM3; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="CNDJFCM3" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770914892; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EMXLyoHDIGkd1WS8t/CaMj6jAYKwVeoWZxQZ060QjgI=; b=CNDJFCM3xz7DrjcNG5Z/YUiqYSc0CAZZ9Gxj1/0al+Y+l6jVgIueMPQ8kaTgUWvQ6yXkaP iNiTFOnBNre+9+Fuli9t6tT7Wf6XpMy8m8Db98D1fzoP5E0dHdpax+LgjWeH7kEysKr/E5 nOepVnVVuXZMDzlF9p9tzdc58NuWx9s= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-662-z40GFMGmOKuntaRGRm0B4g-1; Thu, 12 Feb 2026 11:48:09 -0500 X-MC-Unique: z40GFMGmOKuntaRGRm0B4g-1 X-Mimecast-MFC-AGG-ID: z40GFMGmOKuntaRGRm0B4g_1770914886 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 7179D1955BE0; Thu, 12 Feb 2026 16:48:06 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.80.194]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 33AE118003F5; Thu, 12 Feb 2026 16:48:03 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Anna-Maria Behnsen , Frederic Weisbecker , Thomas Gleixner , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Waiman Long Subject: [PATCH v5 2/6] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Date: Thu, 12 Feb 2026 11:46:36 -0500 Message-ID: <20260212164640.2408295-3-longman@redhat.com> In-Reply-To: <20260212164640.2408295-1-longman@redhat.com> References: <20260212164640.2408295-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Clarify the locking rules associated with file level internal variables inside the cpuset code. There is no functional change. Signed-off-by: Waiman Long Reviewed-by: Chen Ridong --- kernel/cgroup/cpuset.c | 105 ++++++++++++++++++++++++----------------- 1 file changed, 61 insertions(+), 44 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index a366ef84f982..e55855269432 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -61,6 +61,58 @@ static const char * const perr_strings[] =3D { [PERR_REMOTE] =3D "Have remote partition underneath", }; =20 +/* + * CPUSET Locking Convention + * ------------------------- + * + * Below are the three global locks guarding cpuset structures in lock + * acquisition order: + * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock) + * - cpuset_mutex + * - callback_lock (raw spinlock) + * + * A task must hold all the three locks to modify externally visible or + * used fields of cpusets, though some of the internally used cpuset fields + * and internal variables can be modified without holding callback_lock. I= f only + * reliable read access of the externally used fields are needed, a task c= an + * hold either cpuset_mutex or callback_lock which are exposed to other + * external subsystems. + * + * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others, + * ensuring that it is the only task able to also acquire callback_lock and + * be able to modify cpusets. It can perform various checks on the cpuset + * structure first, knowing nothing will change. It can also allocate memo= ry + * without holding callback_lock. While it is performing these checks, var= ious + * callback routines can briefly acquire callback_lock to query cpusets. = Once + * it is ready to make the changes, it takes callback_lock, blocking every= one + * else. + * + * Calls to the kernel memory allocator cannot be made while holding + * callback_lock which is a spinlock, as the memory allocator may sleep or + * call back into cpuset code and acquire callback_lock. + * + * Now, the task_struct fields mems_allowed and mempolicy may be changed + * by other task, we use alloc_lock in the task_struct fields to protect + * them. + * + * The cpuset_common_seq_show() handlers only hold callback_lock across + * small pieces of code, such as when reading out possibly multi-word + * cpumasks and nodemasks. + */ + +static DEFINE_MUTEX(cpuset_mutex); + +/* + * File level internal variables below follow one of the following exclusi= on + * rules. + * + * RWCS: Read/write-able by holding either cpus_write_lock (and optionally + * cpuset_mutex) or both cpus_read_lock and cpuset_mutex. + * + * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable + * by holding both cpuset_mutex and callback_lock. + */ + /* * For local partitions, update to subpartitions_cpus & isolated_cpus is d= one * in update_parent_effective_cpumask(). For remote partitions, it is done= in @@ -70,19 +122,18 @@ static const char * const perr_strings[] =3D { * Exclusive CPUs distributed out to local or remote sub-partitions of * top_cpuset */ -static cpumask_var_t subpartitions_cpus; +static cpumask_var_t subpartitions_cpus; /* RWCS */ =20 /* - * Exclusive CPUs in isolated partitions + * Exclusive CPUs in isolated partitions (shown in cpuset.cpus.isolated) */ -static cpumask_var_t isolated_cpus; +static cpumask_var_t isolated_cpus; /* CSCB */ =20 /* - * isolated_cpus updating flag (protected by cpuset_mutex) - * Set if isolated_cpus is going to be updated in the current - * cpuset_mutex crtical section. + * Set if isolated_cpus is being updated in the current cpuset_mutex + * critical section. */ -static bool isolated_cpus_updating; +static bool isolated_cpus_updating; /* RWCS */ =20 /* * A flag to force sched domain rebuild at the end of an operation. @@ -98,7 +149,7 @@ static bool isolated_cpus_updating; * Note that update_relax_domain_level() in cpuset-v1.c can still call * rebuild_sched_domains_locked() directly without using this flag. */ -static bool force_sd_rebuild; +static bool force_sd_rebuild; /* RWCS */ =20 /* * Partition root states: @@ -218,42 +269,6 @@ struct cpuset top_cpuset =3D { .partition_root_state =3D PRS_ROOT, }; =20 -/* - * There are two global locks guarding cpuset structures - cpuset_mutex and - * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel - * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to c= puset - * structures. Note that cpuset_mutex needs to be a mutex as it is used in - * paths that rely on priority inheritance (e.g. scheduler - on RT) for - * correctness. - * - * A task must hold both locks to modify cpusets. If a task holds - * cpuset_mutex, it blocks others, ensuring that it is the only task able = to - * also acquire callback_lock and be able to modify cpusets. It can perfo= rm - * various checks on the cpuset structure first, knowing nothing will chan= ge. - * It can also allocate memory while just holding cpuset_mutex. While it = is - * performing these checks, various callback routines can briefly acquire - * callback_lock to query cpusets. Once it is ready to make the changes, = it - * takes callback_lock, blocking everyone else. - * - * Calls to the kernel memory allocator can not be made while holding - * callback_lock, as that would risk double tripping on callback_lock - * from one of the callbacks into the cpuset code from within - * __alloc_pages(). - * - * If a task is only holding callback_lock, then it has read-only - * access to cpusets. - * - * Now, the task_struct fields mems_allowed and mempolicy may be changed - * by other task, we use alloc_lock in the task_struct fields to protect - * them. - * - * The cpuset_common_seq_show() handlers only hold callback_lock across - * small pieces of code, such as when reading out possibly multi-word - * cpumasks and nodemasks. - */ - -static DEFINE_MUTEX(cpuset_mutex); - /** * cpuset_lock - Acquire the global cpuset mutex * @@ -1163,6 +1178,8 @@ static void reset_partition_data(struct cpuset *cs) static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask = *xcpus) { WARN_ON_ONCE(old_prs =3D=3D new_prs); + lockdep_assert_held(&callback_lock); + lockdep_assert_held(&cpuset_mutex); if (new_prs =3D=3D PRS_ISOLATED) cpumask_or(isolated_cpus, isolated_cpus, xcpus); else --=20 2.52.0 From nobody Thu Apr 2 19:00:21 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1531F35DD09 for ; Thu, 12 Feb 2026 16:48:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770914899; cv=none; b=k5iTAF8k7EFcY8u/WBwv+Dh7uaiYTEBEGepfszXYRu+OdAD6/B74YlA6T05rr/i/xfWgj1rJwOrkiUFqWKQdPbVjQeXg1OaCsKayjSQv93/YEHDCAiXBkWOIiKcS7MU6HcvmIu90KUKwkuXz+gk35W8GBqUjyc5hN/mzD94DC0M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770914899; c=relaxed/simple; bh=fuv/jyg3v7GpHgjIBWv6SQzEBACOf0vyAQ8TxmVeSCQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=quaeJTEdqlVESF1G5el0n8Q+HnFqXR2b9+Araohmml1HYOk1AbAw+HPrf+THWWVv5vV81WmPSi2SMxXIrk6Z7z522ZkRJfgmeKG4Zr+ReCTO45/sbsc4qTyWP8qO0Skuy9D7ZyslCjtnH0+Xt+8hMGbuIMMOkdsnwny0ilB3M4w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Ng+7m94y; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Ng+7m94y" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770914897; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6qAkghoKFfQsdHzpQw7M41IwcTa05blMdq+6OMoGRvc=; b=Ng+7m94yXYtrUoD+iilaQ/3XkrTVNQNd+FpNcNr5Dox2cqPu1qtZsb0doGC+WV06Lo5545 EijVCO0TPbPr86KlzNr9EXce/rFKQT4UjRaR/NrdSJpJD13rlVmWu0Qpm3VnlMyckcYcHh Xv6h39BZrakgaTdS92unMNxowVKErug= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-255-698C9_Y1OxOGmVsJ680FmQ-1; Thu, 12 Feb 2026 11:48:13 -0500 X-MC-Unique: 698C9_Y1OxOGmVsJ680FmQ-1 X-Mimecast-MFC-AGG-ID: 698C9_Y1OxOGmVsJ680FmQ_1770914890 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B2BB01955F22; Thu, 12 Feb 2026 16:48:09 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.80.194]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id AABA01800464; Thu, 12 Feb 2026 16:48:06 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Anna-Maria Behnsen , Frederic Weisbecker , Thomas Gleixner , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Waiman Long Subject: [PATCH v5 3/6] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed Date: Thu, 12 Feb 2026 11:46:37 -0500 Message-ID: <20260212164640.2408295-4-longman@redhat.com> In-Reply-To: <20260212164640.2408295-1-longman@redhat.com> References: <20260212164640.2408295-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" As cpuset is updating HK_TYPE_DOMAIN housekeeping mask when there is a change in the set of isolated CPUs, making this change is now more costly than before. Right now, the isolated_cpus_updating flag can be set even if there is no real change in isolated_cpus. Put in additional checks to make sure that isolated_cpus_updating is set only if there is a real change in isolated_cpus. Signed-off-by: Waiman Long Reviewed-by: Chen Ridong --- kernel/cgroup/cpuset.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index e55855269432..c792380f9b60 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1180,11 +1180,15 @@ static void isolated_cpus_update(int old_prs, int n= ew_prs, struct cpumask *xcpus WARN_ON_ONCE(old_prs =3D=3D new_prs); lockdep_assert_held(&callback_lock); lockdep_assert_held(&cpuset_mutex); - if (new_prs =3D=3D PRS_ISOLATED) + if (new_prs =3D=3D PRS_ISOLATED) { + if (cpumask_subset(xcpus, isolated_cpus)) + return; cpumask_or(isolated_cpus, isolated_cpus, xcpus); - else + } else { + if (!cpumask_intersects(xcpus, isolated_cpus)) + return; cpumask_andnot(isolated_cpus, isolated_cpus, xcpus); - + } isolated_cpus_updating =3D true; } =20 --=20 2.52.0 From nobody Thu Apr 2 19:00:21 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1523235D60B for ; Thu, 12 Feb 2026 16:48:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770914911; cv=none; b=jF7M/D7ZhrkF6ajCrIOqtvsHNwegb7YMY6m+nrX68foyMEDkQA9e50cTQrJdoBWA5WcJRbb7qldP6cBr29JGK9xQnA+gRPPDFv9CRg8Fq1hCxdxSNnREFK+c7oWRHS/wKkDc/x/h5jfz7dkAlKBiSWeg6TIyOhaYg5t+r/s9YSs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770914911; c=relaxed/simple; bh=wRty8/J2pFM4dDwWYkfELVBdIV2Rf3GmWv2P4m107Tc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PkIEoyKJmtKDOiPBeSUYC8ERz0YjBN+0m0gyUs6pGKyYblAvK2cJhcsASmWsDxlcl+OFOdKJajlIk7OA8ovAYCaP9GoN9aYEutCKLPp4ek+iY7LRYm5NSRX17WEZ9tQGCs5ffwNSmya/46OzAK3myTUxBfMoBy7bscZNsE2fj40= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ZXfLI7Hf; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZXfLI7Hf" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770914902; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LhafozxLtbCRF/ar7/h7DrTiUYnhUT7qKI+AsTwZgJM=; b=ZXfLI7HfmvHAgN5Z/vkIH8pyMSCttkbWkHqSCNtnQ2IhRBNPggq6T/bQLg5I10aShMjLHM nrHyoNPfEULJKp19OHO33a8kTXDV4DPnGXRHPez3s/1r40U7N3H4j4Ypsmvii2fWRM8GXb xTjckTyvJOmzX8TwYQdcsU60mkdJMkU= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-246-nu4QUERNN460wplI7OHY6w-1; Thu, 12 Feb 2026 11:48:15 -0500 X-MC-Unique: nu4QUERNN460wplI7OHY6w-1 X-Mimecast-MFC-AGG-ID: nu4QUERNN460wplI7OHY6w_1770914893 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 333D818007E9; Thu, 12 Feb 2026 16:48:13 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.80.194]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id D564F1800286; Thu, 12 Feb 2026 16:48:09 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Anna-Maria Behnsen , Frederic Weisbecker , Thomas Gleixner , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Waiman Long Subject: [PATCH v5 4/6] cgroup/cpuset: Don't update isolated_cpus from CPU hotplug Date: Thu, 12 Feb 2026 11:46:38 -0500 Message-ID: <20260212164640.2408295-5-longman@redhat.com> In-Reply-To: <20260212164640.2408295-1-longman@redhat.com> References: <20260212164640.2408295-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" As any change to isolated_cpus is going to be propagated to the HK_TYPE_DOMAIN housekeeping cpumask, it can be problematic if housekeeping cpumasks are directly being modified from the CPU hotplug code path. This is especially the case if we are going to enable dynamic update to the nohz_full housekeeping cpumask (HK_TYPE_KERNEL_NOISE) in the near future with the help of CPU hotplug. Avoid these potential problems by changing the cpuset code to not updating isolated_cpus when calling from CPU hotplug. A new special PRS_INVALID_ISOLCPUS is added to indicate the current cpuset is an invalid partition but its effective_xcpus are still in isolated_cpus. This special state will be set if an isolated partition becomes invalid due to the shutdown of the last active CPU in that partition. We also need to keep the effective_xcpus even if exclusive_cpus isn't set. When changes are made to "cpuset.cpus", "cpuset.cpus.exclusive" or "cpuset.cpus.partition" of a PRS_INVALID_ISOLCPUS cpuset, its state will be reset back to PRS_INVALID_ISOLATED and its effective_xcpus will be removed from isolated_cpus before proceeding. As CPU hotplug will no longer update isolated_cpus, some of the test cases in test_cpuset_prs.h will have to be updated to match the new expected results. Some new test cases are also added to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN housekeeping cpumask will both be updated. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 85 ++++++++++++++++--- .../selftests/cgroup/test_cpuset_prs.sh | 21 +++-- 2 files changed, 87 insertions(+), 19 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index c792380f9b60..48b7f275085b 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -159,6 +159,8 @@ static bool force_sd_rebuild; /* RWCS */ * 2 - partition root without load balancing (isolated) * -1 - invalid partition root * -2 - invalid isolated partition root + * -3 - invalid isolated partition root but with effective xcpus still + * in isolated_cpus (set from CPU hotplug side) * * There are 2 types of partitions - local or remote. Local partitions are * those whose parents are partition root themselves. Setting of @@ -187,6 +189,7 @@ static bool force_sd_rebuild; /* RWCS */ #define PRS_ISOLATED 2 #define PRS_INVALID_ROOT -1 #define PRS_INVALID_ISOLATED -2 +#define PRS_INVALID_ISOLCPUS -3 /* Effective xcpus still in isolated_cpus = */ =20 /* * Temporary cpumasks for working with partitions that are passed among @@ -382,6 +385,30 @@ static inline bool is_in_v2_mode(void) (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE); } =20 +/* + * If the given cpuset has a partition state of PRS_INVALID_ISOLCPUS, + * remove its effective_xcpus from isolated_cpus and reset its state to + * PRS_INVALID_ISOLATED. Also clear effective_xcpus if exclusive_cpus is + * empty. + */ +static void fix_invalid_isolcpus(struct cpuset *cs, struct cpuset *trialcs) +{ + if (likely(cs->partition_root_state !=3D PRS_INVALID_ISOLCPUS)) + return; + WARN_ON_ONCE(cpumask_empty(cs->effective_xcpus)); + spin_lock_irq(&callback_lock); + cpumask_andnot(isolated_cpus, isolated_cpus, cs->effective_xcpus); + if (cpumask_empty(cs->exclusive_cpus)) + cpumask_clear(cs->effective_xcpus); + cs->partition_root_state =3D PRS_INVALID_ISOLATED; + spin_unlock_irq(&callback_lock); + isolated_cpus_updating =3D true; + if (trialcs) { + trialcs->partition_root_state =3D PRS_INVALID_ISOLATED; + cpumask_copy(trialcs->effective_xcpus, cs->effective_xcpus); + } +} + /** * partition_is_populated - check if partition has tasks * @cs: partition root to be checked @@ -1160,7 +1187,8 @@ static void reset_partition_data(struct cpuset *cs) =20 lockdep_assert_held(&callback_lock); =20 - if (cpumask_empty(cs->exclusive_cpus)) { + if (cpumask_empty(cs->exclusive_cpus) && + (cs->partition_root_state !=3D PRS_INVALID_ISOLCPUS)) { cpumask_clear(cs->effective_xcpus); if (is_cpu_exclusive(cs)) clear_bit(CS_CPU_EXCLUSIVE, &cs->flags); @@ -1189,6 +1217,10 @@ static void isolated_cpus_update(int old_prs, int ne= w_prs, struct cpumask *xcpus return; cpumask_andnot(isolated_cpus, isolated_cpus, xcpus); } + /* + * Shouldn't update isolated_cpus from CPU hotplug + */ + WARN_ON_ONCE(current->flags & PF_KTHREAD); isolated_cpus_updating =3D true; } =20 @@ -1208,7 +1240,6 @@ static void partition_xcpus_add(int new_prs, struct c= puset *parent, if (!parent) parent =3D &top_cpuset; =20 - if (parent =3D=3D &top_cpuset) cpumask_or(subpartitions_cpus, subpartitions_cpus, xcpus); =20 @@ -1224,11 +1255,12 @@ static void partition_xcpus_add(int new_prs, struct= cpuset *parent, * @old_prs: old partition_root_state * @parent: parent cpuset * @xcpus: exclusive CPUs to be removed + * @no_isolcpus: don't update isolated_cpus * * Remote partition if parent =3D=3D NULL */ static void partition_xcpus_del(int old_prs, struct cpuset *parent, - struct cpumask *xcpus) + struct cpumask *xcpus, bool no_isolcpus) { WARN_ON_ONCE(old_prs < 0); lockdep_assert_held(&callback_lock); @@ -1238,7 +1270,7 @@ static void partition_xcpus_del(int old_prs, struct c= puset *parent, if (parent =3D=3D &top_cpuset) cpumask_andnot(subpartitions_cpus, subpartitions_cpus, xcpus); =20 - if (old_prs !=3D parent->partition_root_state) + if ((old_prs !=3D parent->partition_root_state) && !no_isolcpus) isolated_cpus_update(old_prs, parent->partition_root_state, xcpus); =20 @@ -1496,6 +1528,8 @@ static int remote_partition_enable(struct cpuset *cs,= int new_prs, */ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *t= mp) { + int old_prs =3D cs->partition_root_state; + WARN_ON_ONCE(!is_remote_partition(cs)); /* * When a CPU is offlined, top_cpuset may end up with no available CPUs, @@ -1508,14 +1542,24 @@ static void remote_partition_disable(struct cpuset = *cs, struct tmpmasks *tmp) =20 spin_lock_irq(&callback_lock); cs->remote_partition =3D false; - partition_xcpus_del(cs->partition_root_state, NULL, cs->effective_xcpus); if (cs->prs_err) cs->partition_root_state =3D -cs->partition_root_state; else cs->partition_root_state =3D PRS_MEMBER; + /* + * Don't update isolated_cpus if calling from CPU hotplug kthread + */ + if ((current->flags & PF_KTHREAD) && + (cs->partition_root_state =3D=3D PRS_INVALID_ISOLATED)) + cs->partition_root_state =3D PRS_INVALID_ISOLCPUS; =20 - /* effective_xcpus may need to be changed */ - compute_excpus(cs, cs->effective_xcpus); + partition_xcpus_del(old_prs, NULL, cs->effective_xcpus, + cs->partition_root_state =3D=3D PRS_INVALID_ISOLCPUS); + /* + * effective_xcpus may need to be changed + */ + if (cs->partition_root_state !=3D PRS_INVALID_ISOLCPUS) + compute_excpus(cs, cs->effective_xcpus); reset_partition_data(cs); spin_unlock_irq(&callback_lock); update_isolation_cpumasks(); @@ -1580,7 +1624,7 @@ static void remote_cpus_update(struct cpuset *cs, str= uct cpumask *xcpus, if (adding) partition_xcpus_add(prs, NULL, tmp->addmask); if (deleting) - partition_xcpus_del(prs, NULL, tmp->delmask); + partition_xcpus_del(prs, NULL, tmp->delmask, false); /* * Need to update effective_xcpus and exclusive_cpus now as * update_sibling_cpumasks() below may iterate back to the same cs. @@ -1893,6 +1937,10 @@ static int update_parent_effective_cpumask(struct cp= uset *cs, int cmd, if (!part_error) new_prs =3D -old_prs; break; + case PRS_INVALID_ISOLCPUS: + if (!part_error) + new_prs =3D PRS_ISOLATED; + break; } } =20 @@ -1923,12 +1971,19 @@ static int update_parent_effective_cpumask(struct c= puset *cs, int cmd, if (old_prs !=3D new_prs) cs->partition_root_state =3D new_prs; =20 + /* + * Don't update isolated_cpus if calling from CPU hotplug kthread + */ + if ((current->flags & PF_KTHREAD) && + (cs->partition_root_state =3D=3D PRS_INVALID_ISOLATED)) + cs->partition_root_state =3D PRS_INVALID_ISOLCPUS; /* * Adding to parent's effective_cpus means deletion CPUs from cs * and vice versa. */ if (adding) - partition_xcpus_del(old_prs, parent, tmp->addmask); + partition_xcpus_del(old_prs, parent, tmp->addmask, + cs->partition_root_state =3D=3D PRS_INVALID_ISOLCPUS); if (deleting) partition_xcpus_add(new_prs, parent, tmp->delmask); =20 @@ -2317,6 +2372,7 @@ static void partition_cpus_change(struct cpuset *cs, = struct cpuset *trialcs, if (cs_is_member(cs)) return; =20 + fix_invalid_isolcpus(cs, trialcs); prs_err =3D validate_partition(cs, trialcs); if (prs_err) trialcs->prs_err =3D cs->prs_err =3D prs_err; @@ -2818,6 +2874,7 @@ static int update_prstate(struct cpuset *cs, int new_= prs) if (alloc_tmpmasks(&tmpmask)) return -ENOMEM; =20 + fix_invalid_isolcpus(cs, NULL); err =3D update_partition_exclusive_flag(cs, new_prs); if (err) goto out; @@ -3268,6 +3325,7 @@ static int cpuset_partition_show(struct seq_file *seq= , void *v) type =3D "root"; fallthrough; case PRS_INVALID_ISOLATED: + case PRS_INVALID_ISOLCPUS: if (!type) type =3D "isolated"; err =3D perr_strings[READ_ONCE(cs->prs_err)]; @@ -3463,9 +3521,9 @@ static void cpuset_css_offline(struct cgroup_subsys_s= tate *css) } =20 /* - * If a dying cpuset has the 'cpus.partition' enabled, turn it off by - * changing it back to member to free its exclusive CPUs back to the pool = to - * be used by other online cpusets. + * If a dying cpuset has the 'cpus.partition' enabled or is in the + * PRS_INVALID_ISOLCPUS state, turn it off by changing it back to member to + * free its exclusive CPUs back to the pool to be used by other online cpu= sets. */ static void cpuset_css_killed(struct cgroup_subsys_state *css) { @@ -3473,7 +3531,8 @@ static void cpuset_css_killed(struct cgroup_subsys_st= ate *css) =20 cpuset_full_lock(); /* Reset valid partition back to member */ - if (is_partition_valid(cs)) + if (is_partition_valid(cs) || + (cs->partition_root_state =3D=3D PRS_INVALID_ISOLCPUS)) update_prstate(cs, PRS_MEMBER); cpuset_full_unlock(); } diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/test= ing/selftests/cgroup/test_cpuset_prs.sh index 5dff3ad53867..380506157f70 100755 --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh @@ -234,6 +234,7 @@ TEST_MATRIX=3D( "$SETUP_A123_PARTITIONS . C2-3 . . . 0 A1:|A2:2|A= 3:3 A1:P1|A2:P1|A3:P1" =20 # CPU offlining cases: + # cpuset.cpus.isolated should no longer be updated. " C0-1 . . C2-3 S+ C4-5 . O2=3D0 0 A1:0-1|B= 1:3" "C0-3:P1:S+ C2-3:P1 . . O2=3D0 . . . 0 A1:0-1|A= 2:3" "C0-3:P1:S+ C2-3:P1 . . O2=3D0 O2=3D1 . . 0 A1:0-1= |A2:2-3" @@ -245,8 +246,9 @@ TEST_MATRIX=3D( "C2-3:P1:S+ C3:P2 . . O2=3D0 O2=3D1 . . 0 A1:2|A= 2:3 A1:P1|A2:P2" "C2-3:P1:S+ C3:P1 . . O2=3D0 . . . 0 A1:|A2:3= A1:P1|A2:P1" "C2-3:P1:S+ C3:P1 . . O3=3D0 . . . 0 A1:2|A2:= A1:P1|A2:P1" - "C2-3:P1:S+ C3:P1 . . T:O2=3D0 . . . 0 A1:3|A2:= 3 A1:P1|A2:P-1" - "C2-3:P1:S+ C3:P1 . . . T:O3=3D0 . . 0 A1:2|A2:= 2 A1:P1|A2:P-1" + "C2-3:P1:S+ C3:P2 . . T:O2=3D0 . . . 0 A1:3|A2:= 3 A1:P1|A2:P-2" + "C1-3:P1:S+ C3:P2 . . . T:O3=3D0 . . 0 A1:1-2|A= 2:1-2|XA2:3 A1:P1|A2:P-2 3" + "C1-3:P1:S+ C3:P2 . . . T:O3=3D0 O3=3D1 . 0 A1:1-2= |A2:3|XA2:3 A1:P1|A2:P2 3" "$SETUP_A123_PARTITIONS . O1=3D0 . . . 0 A1:|A2:2= |A3:3 A1:P1|A2:P1|A3:P1" "$SETUP_A123_PARTITIONS . O2=3D0 . . . 0 A1:1|A2:= |A3:3 A1:P1|A2:P1|A3:P1" "$SETUP_A123_PARTITIONS . O3=3D0 . . . 0 A1:1|A2:= 2|A3: A1:P1|A2:P1|A3:P1" @@ -299,13 +301,14 @@ TEST_MATRIX=3D( A1:P0|A2:P2|A3:P-1 2-4" =20 # Remote partition offline tests + # CPU offline shouldn't change cpuset.cpus.{isolated,exclusive.effective} " C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:O2=3D0 . 0 A1:0-1|A= 2:1|A3:3 A1:P0|A3:P2 2-3" " C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:O2=3D0 O2=3D1 0 A1:0-= 1|A2:1|A3:2-3 A1:P0|A3:P2 2-3" - " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 P2:O3=3D0 . 0 A1:0-2|A= 2:1-2|A3: A1:P0|A3:P2 3" - " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 T:P2:O3=3D0 . 0 A1:0-2|A= 2:1-2|A3:1-2 A1:P0|A3:P-2 3|" + " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 P2:O3=3D0 . 0 A1:0-2|A= 2:1-2|A3:|XA3:3 A1:P0|A3:P2 3" + " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 T:P2:O3=3D0 . 0 A1:0-2|A= 2:1-2|A3:1-2|XA3:3 A1:P0|A3:P-2 3" =20 # An invalidated remote partition cannot self-recover from hotplug - " C0-3:S+ C1-3:S+ C2 . X2-3 X2-3 T:P2:O2=3D0 O2=3D1 0 A1:0-3= |A2:1-3|A3:2 A1:P0|A3:P-2 ." + " C0-3:S+ C1-3:S+ C2 . X2-3 X2-3 T:P2:O2=3D0 O2=3D1 0 A1:0-3= |A2:1-3|A3:2|XA3:2 A1:P0|A3:P-2 2" =20 # cpus.exclusive.effective clearing test " C0-3:S+ C1-3:S+ C2 . X2-3:X . . . 0 A1:0-3|A2:= 1-3|A3:2|XA1:" @@ -764,7 +767,7 @@ check_cgroup_states() # only CPUs in isolated partitions as well as those that are isolated at # boot time. # -# $1 - expected isolated cpu list(s) {,} +# $1 - expected isolated cpu list(s) {|} # - expected sched/domains value # - cpuset.cpus.isolated value =3D if not defined # @@ -773,6 +776,7 @@ check_isolcpus() EXPECTED_ISOLCPUS=3D$1 ISCPUS=3D${CGROUP2}/cpuset.cpus.isolated ISOLCPUS=3D$(cat $ISCPUS) + HKICPUS=3D$(cat /sys/devices/system/cpu/isolated) LASTISOLCPU=3D SCHED_DOMAINS=3D/sys/kernel/debug/sched/domains if [[ $EXPECTED_ISOLCPUS =3D . ]] @@ -810,6 +814,11 @@ check_isolcpus() ISOLCPUS=3D EXPECTED_ISOLCPUS=3D$EXPECTED_SDOMAIN =20 + # + # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS + # + [[ "$ISOLCPUS" !=3D "$HKICPUS" ]] && return 1 + # # Use the sched domain in debugfs to check isolated CPUs, if available # --=20 2.52.0 From nobody Thu Apr 2 19:00:21 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C99A635D5E7 for ; Thu, 12 Feb 2026 16:48:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770914906; cv=none; b=mdkk1d5hUCWevtlNAsjz2HOeEMQlCV0OCYiEcxWDi6ybv7YpQbAKpKpneCTJeSyTbzI+3NobgJzlWFey7uq740n25nNiyKf80nsj1RfhsjA+y276GvRDWcwML7NM6p1ripxz/SetBFVpjJPN+NpRHt1dcPC+Y8k3ug7o4JegC8U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770914906; c=relaxed/simple; bh=ocb4h7PxpCgLT8ddw7hxNLABaVPwyeBIxIlw7jEpoDM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ksi+KRVjnl5p8k4w/mgnCA2OLOzaSyHJzEZQlzuDqjcEs5o2U/aWutqC8wJgS7WvkouAhNHWnGmYfEJKTwczJe+THGOpg/htCgTW+zlyTX/VxFFmVhwdGfPLe063qrQJ6RHHvyYaZPsb409F1OIYVAJteK1oYfuhKA+40WlKCWU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=OS1mGtBk; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="OS1mGtBk" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770914903; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ujXsVtZkdbe11x4mi/cE86wUSR+HJS2RKadyX3nfR/4=; b=OS1mGtBkGkFtfMvVqE/fFuw8400qABadOZAoXW1nBJHbVNvpbVbvwAokm2t3E8jCk2bjSL vQg8bnt+86pkSagJgJvOKSdKgJDqCBwtyb/hE4HBTIKk7EAwKwVfhmbDo3oCBcTp0w41+a ZIue3/1I3TUg0GleNcxAbiC52w+Tjlg= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-493-kWDOf_l4N1mV7KKkJhMkVA-1; Thu, 12 Feb 2026 11:48:20 -0500 X-MC-Unique: kWDOf_l4N1mV7KKkJhMkVA-1 X-Mimecast-MFC-AGG-ID: kWDOf_l4N1mV7KKkJhMkVA_1770914898 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 84F90180061C; Thu, 12 Feb 2026 16:48:18 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.80.194]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 6D2181800465; Thu, 12 Feb 2026 16:48:13 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Anna-Maria Behnsen , Frederic Weisbecker , Thomas Gleixner , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Waiman Long Subject: [PATCH v5 5/6] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Date: Thu, 12 Feb 2026 11:46:39 -0500 Message-ID: <20260212164640.2408295-6-longman@redhat.com> In-Reply-To: <20260212164640.2408295-1-longman@redhat.com> References: <20260212164640.2408295-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" The current cpuset partition code is able to dynamically update the sched domains of a running system and the corresponding HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the "isolcpus=3Ddomain,..." boot command line feature at run time. The housekeeping cpumask update requires flushing a number of different workqueues which may not be safe with cpus_read_lock() held as the workqueue flushing code may acquire cpus_read_lock() or acquiring locks which have locking dependency with cpus_read_lock() down the chain. Below is an example of such circular locking problem. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D WARNING: possible circular locking dependency detected 6.18.0-test+ #2 Tainted: G S ------------------------------------------------------ test_cpuset_prs/10971 is trying to acquire lock: ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockd= ep_map+0x7a/0x180 but task is already holding lock: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0= x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (cpuset_mutex){+.+.}-{4:4}: -> #3 (cpu_hotplug_lock){++++}-{0:0}: -> #2 (rtnl_mutex){+.+.}-{4:4}: -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: Chain exists of: (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex 5 locks held by test_cpuset_prs/10971: #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1= d0 #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_it= er+0x260/0x5f0 #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_= iter+0x2b6/0x5f0 #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partitio= n_write+0x77/0x130 #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_wr= ite+0x85/0x130 Call Trace: : touch_wq_lockdep_map+0x93/0x180 __flush_workqueue+0x111/0x10b0 housekeeping_update+0x12d/0x2d0 update_parent_effective_cpumask+0x595/0x2440 update_prstate+0x89d/0xce0 cpuset_partition_write+0xc5/0x130 cgroup_file_write+0x1a5/0x680 kernfs_fop_write_iter+0x3df/0x5f0 vfs_write+0x525/0xfd0 ksys_write+0xf9/0x1d0 do_syscall_64+0x95/0x520 entry_SYSCALL_64_after_hwframe+0x76/0x7e To avoid such a circular locking dependency problem, we have to call housekeeping_update() without holding the cpus_read_lock() and cpuset_mutex. The current set of wq's flushed by housekeeping_update() may not have work functions that call cpus_read_lock() directly, but we are likely to extend the list of wq's that are flushed in the future. Moreover, the current set of work functions may hold locks that may have cpu_hotplug_lock down the dependency chain. One way to do that is to defer the housekeeping_update() call after the current cpuset critical section has finished without holding cpus_read_lock. For cpuset control file write, this can be done by deferring it using task_work right before returning to userspace. To enable mutual exclusion between the housekeeping_update() call and other cpuset control file write actions, a new top level cpuset_top_mutex is introduced. This new mutex will be acquired first to allow sharing variables used by both code paths. However, cpuset update from CPU hotplug can still happen in parallel with the housekeeping_update() call, though that should be rare in production environment. As cpus_read_lock() is now no longer held when tmigr_isolated_exclude_cpumask() is called, it needs to acquire it directly. The lockdep_is_cpuset_held() is also updated to return true if either cpuset_top_mutex or cpuset_mutex is held. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 99 ++++++++++++++++++++++++++++++++--- kernel/sched/isolation.c | 4 +- kernel/time/timer_migration.c | 4 +- 3 files changed, 93 insertions(+), 14 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 48b7f275085b..c6a97956a991 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -65,14 +65,28 @@ static const char * const perr_strings[] =3D { * CPUSET Locking Convention * ------------------------- * - * Below are the three global locks guarding cpuset structures in lock + * Below are the four global/local locks guarding cpuset structures in lock * acquisition order: + * - cpuset_top_mutex * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock) * - cpuset_mutex * - callback_lock (raw spinlock) * - * A task must hold all the three locks to modify externally visible or - * used fields of cpusets, though some of the internally used cpuset fields + * As cpuset will now indirectly flush a number of different workqueues in + * housekeeping_update() to update housekeeping cpumasks when the set of + * isolated CPUs is going to be changed, it may be vulnerable to deadlock + * if we hold cpus_read_lock while calling into housekeeping_update(). + * + * The first cpuset_top_mutex will be held except when calling into + * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock + * and cpuset_mutex will be held instead. The main purpose of this mutex + * is to prevent regular cpuset control file write actions from interfering + * with the call to housekeeping_update(), though CPU hotplug operation can + * still happen in parallel. This mutex also provides protection for some + * internal variables. + * + * A task must hold all the remaining three locks to modify externally vis= ible + * or used fields of cpusets, though some of the internally used cpuset fi= elds * and internal variables can be modified without holding callback_lock. I= f only * reliable read access of the externally used fields are needed, a task c= an * hold either cpuset_mutex or callback_lock which are exposed to other @@ -100,6 +114,7 @@ static const char * const perr_strings[] =3D { * cpumasks and nodemasks. */ =20 +static DEFINE_MUTEX(cpuset_top_mutex); static DEFINE_MUTEX(cpuset_mutex); =20 /* @@ -111,6 +126,8 @@ static DEFINE_MUTEX(cpuset_mutex); * * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable * by holding both cpuset_mutex and callback_lock. + * + * T: Read/write-able by holding the cpuset_top_mutex. */ =20 /* @@ -135,6 +152,18 @@ static cpumask_var_t isolated_cpus; /* CSCB */ */ static bool isolated_cpus_updating; /* RWCS */ =20 +/* + * Copy of isolated_cpus to be passed to housekeeping_update() + */ +static cpumask_var_t isolated_hk_cpus; /* T */ + +/* + * Flag to prevent queuing more than one task_work to the same cpuset_top_= mutex + * critical section. + */ +static bool isolcpus_twork_queued; /* T */ + + /* * A flag to force sched domain rebuild at the end of an operation. * It can be set in @@ -301,20 +330,24 @@ void lockdep_assert_cpuset_lock_held(void) */ void cpuset_full_lock(void) { + mutex_lock(&cpuset_top_mutex); cpus_read_lock(); mutex_lock(&cpuset_mutex); } =20 void cpuset_full_unlock(void) { + isolcpus_twork_queued =3D false; mutex_unlock(&cpuset_mutex); cpus_read_unlock(); + mutex_unlock(&cpuset_top_mutex); } =20 #ifdef CONFIG_LOCKDEP bool lockdep_is_cpuset_held(void) { - return lockdep_is_held(&cpuset_mutex); + return lockdep_is_held(&cpuset_mutex) || + lockdep_is_held(&cpuset_top_mutex); } #endif =20 @@ -1338,6 +1371,28 @@ static bool prstate_housekeeping_conflict(int prstat= e, struct cpumask *new_cpus) return false; } =20 +/* + * housekeeping_update() will only be called if isolated_cpus differs + * from isolated_hk_cpus. To be safe, rebuild_sched_domains() will always + * be called just in case there are still pending sched domains changes. + */ +static void isolcpus_tworkfn(struct callback_head *cb) +{ + bool update_hk =3D true; + + guard(mutex)(&cpuset_top_mutex); + scoped_guard(spinlock_irq, &callback_lock) { + if (cpumask_equal(isolated_hk_cpus, isolated_cpus)) + update_hk =3D false; + else + cpumask_copy(isolated_hk_cpus, isolated_cpus); + } + if (update_hk) + WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus) < 0); + rebuild_sched_domains(); + kfree(cb); +} + /* * update_isolation_cpumasks - Update external isolation related CPU masks * @@ -1346,15 +1401,42 @@ static bool prstate_housekeeping_conflict(int prsta= te, struct cpumask *new_cpus) */ static void update_isolation_cpumasks(void) { - int ret; + struct callback_head *twork_cb; =20 if (!isolated_cpus_updating) return; + else + isolated_cpus_updating =3D false; + + /* + * CPU hotplug shouldn't set isolated_cpus_updating. + * + * To have better flexibility and prevent the possibility of deadlock, + * we defer the housekeeping_update() call to after the current cpuset + * critical section has finished. This is done via the synchronous + * task_work which will be executed right before returning to userspace. + * + * update_isolation_cpumasks() may be called more than once in the + * same cpuset_mutex critical section. + */ + lockdep_assert_held(&cpuset_top_mutex); + if (isolcpus_twork_queued) + return; =20 - ret =3D housekeeping_update(isolated_cpus); - WARN_ON_ONCE(ret < 0); + twork_cb =3D kzalloc(sizeof(struct callback_head), GFP_KERNEL); + if (!twork_cb) + return; =20 - isolated_cpus_updating =3D false; + /* + * isolcpus_tworkfn() will be invoked before returning to userspace + */ + init_task_work(twork_cb, isolcpus_tworkfn); + if (task_work_add(current, twork_cb, TWA_RESUME)) { + kfree(twork_cb); + WARN_ON_ONCE(1); /* Current task shouldn't be exiting */ + } else { + isolcpus_twork_queued =3D true; + } } =20 /** @@ -3689,6 +3771,7 @@ int __init cpuset_init(void) BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL)); + BUG_ON(!zalloc_cpumask_var(&isolated_hk_cpus, GFP_KERNEL)); =20 cpumask_setall(top_cpuset.cpus_allowed); nodes_setall(top_cpuset.mems_allowed); diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index 3b725d39c06e..ef152d401fe2 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask) struct cpumask *trial, *old =3D NULL; int err; =20 - lockdep_assert_cpus_held(); - trial =3D kmalloc(cpumask_size(), GFP_KERNEL); if (!trial) return -ENOMEM; @@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask) } =20 if (!housekeeping.flags) - static_branch_enable_cpuslocked(&housekeeping_overridden); + static_branch_enable(&housekeeping_overridden); =20 if (housekeeping.flags & HK_FLAG_DOMAIN) old =3D housekeeping_cpumask_dereference(HK_TYPE_DOMAIN); diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 6da9cd562b20..83428aa03aef 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *ex= clude_cpumask) cpumask_var_t cpumask __free(free_cpumask_var) =3D CPUMASK_VAR_NULL; int cpu; =20 - lockdep_assert_cpus_held(); - if (!works) return -ENOMEM; if (!alloc_cpumask_var(&cpumask, GFP_KERNEL)) @@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *ex= clude_cpumask) * First set previously isolated CPUs as available (unisolate). * This cpumask contains only CPUs that switched to available now. */ + guard(cpus_read_lock)(); cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask); cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask); =20 @@ -1626,7 +1625,6 @@ static int __init tmigr_init_isolation(void) cpumask_andnot(cpumask, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_D= OMAIN)); =20 /* Protect against RCU torture hotplug testing */ - guard(cpus_read_lock)(); return tmigr_isolated_exclude_cpumask(cpumask); } late_initcall(tmigr_init_isolation); --=20 2.52.0 From nobody Thu Apr 2 19:00:21 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 14D4A224B05 for ; Thu, 12 Feb 2026 16:48:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770914913; cv=none; b=TgPGXhhlZhd0iZR/veXvuljGPg14OdHBPKGdN10sAtHqHE606zIqEgIAADOFQ/sVd3uFhnBCtyzbq6KCGgzaI4ynTD98kfVmCdovpbGOaJzej7OKF5Xc2B9TyxTH+XZ7CK8oh7OmAgprkJVfNwNv5k7RuqbFDb940speeMsZEPo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770914913; c=relaxed/simple; bh=+IASX15SHJQl3uo84DSZeDBh3tIvYlqzv39AOczHBjg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=dQPLPNHL0Sa4dmTuegdSbwQ12IZRipjTt7GEmRAg6KrJqjhiuw3sn++EQfPksI+TXZkBiqHYrw7eZWIP/cWQsis9iG6snfXe/2wIyW5gTFRUzywDYoBg/tAijcoJ3oNXyZolQgEOljsg0iJW/akY3OBKnu9ON5Dz11LA0oyC9lk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=DemolHdK; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="DemolHdK" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770914909; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=OakEBTRc4ktVpI3ZptxUvVQZ9LjX5pa6Y5FedYN8EUM=; b=DemolHdKp19jXWyBBZwhK77R8dyxjRLo6BXgADLpi8E07mY54B3X/hmzIPl2R9h97VKrHW yRzmGTZbTgfGegszMvwgUYjtrYpTWrImkHVq++4IGeBrAsR7UtjT9eIwHXLsIPGeUYd6Zd +huzvPASIb0zH8xXq8nisPmlYUCmiYE= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-688-sGST7Oq4N7uhQoTziKEN-g-1; Thu, 12 Feb 2026 11:48:25 -0500 X-MC-Unique: sGST7Oq4N7uhQoTziKEN-g-1 X-Mimecast-MFC-AGG-ID: sGST7Oq4N7uhQoTziKEN-g_1770914903 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 0D74418002C2; Thu, 12 Feb 2026 16:48:22 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.80.194]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id BECB418003F5; Thu, 12 Feb 2026 16:48:18 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Anna-Maria Behnsen , Frederic Weisbecker , Thomas Gleixner , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Waiman Long Subject: [PATCH v5 6/6] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls Date: Thu, 12 Feb 2026 11:46:40 -0500 Message-ID: <20260212164640.2408295-7-longman@redhat.com> In-Reply-To: <20260212164640.2408295-1-longman@redhat.com> References: <20260212164640.2408295-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Now that we are going to defer any changes to the HK_TYPE_DOMAIN housekeeping cpumasks to either task_work or workqueue where rebuild_sched_domains() call will be issued. The current rebuild_sched_domains_locked() call near the end of the cpuset critical section can be removed in such cases. Currently, a boolean force_sd_rebuild flag is used to decide if rebuild_sched_domains_locked() call needs to be invoked. To allow deferral that like, we change it to a tri-state sd_rebuild enumaration type. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index c6a97956a991..426949363ca7 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -178,7 +178,11 @@ static bool isolcpus_twork_queued; /* T */ * Note that update_relax_domain_level() in cpuset-v1.c can still call * rebuild_sched_domains_locked() directly without using this flag. */ -static bool force_sd_rebuild; /* RWCS */ +static enum { + SD_NO_REBUILD =3D 0, + SD_REBUILD, + SD_DEFER_REBUILD, +} sd_rebuild; /* RWCS */ =20 /* * Partition root states: @@ -1023,7 +1027,7 @@ void rebuild_sched_domains_locked(void) =20 lockdep_assert_cpus_held(); lockdep_assert_cpuset_lock_held(); - force_sd_rebuild =3D false; + sd_rebuild =3D SD_NO_REBUILD; =20 /* Generate domain masks and attrs */ ndoms =3D generate_sched_domains(&doms, &attr); @@ -1408,6 +1412,9 @@ static void update_isolation_cpumasks(void) else isolated_cpus_updating =3D false; =20 + /* Defer rebuild_sched_domains() to task_work or wq */ + sd_rebuild =3D SD_DEFER_REBUILD; + /* * CPU hotplug shouldn't set isolated_cpus_updating. * @@ -3053,7 +3060,7 @@ static int update_prstate(struct cpuset *cs, int new_= prs) update_partition_sd_lb(cs, old_prs); =20 notify_partition_change(cs, old_prs); - if (force_sd_rebuild) + if (sd_rebuild =3D=3D SD_REBUILD) rebuild_sched_domains_locked(); free_tmpmasks(&tmpmask); return 0; @@ -3330,7 +3337,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file = *of, } =20 free_cpuset(trialcs); - if (force_sd_rebuild) + if (sd_rebuild =3D=3D SD_REBUILD) rebuild_sched_domains_locked(); out_unlock: cpuset_full_unlock(); @@ -3815,7 +3822,8 @@ hotplug_update_tasks(struct cpuset *cs, =20 void cpuset_force_rebuild(void) { - force_sd_rebuild =3D true; + if (!sd_rebuild) + sd_rebuild =3D SD_REBUILD; } =20 /** @@ -4025,7 +4033,7 @@ static void cpuset_handle_hotplug(void) } =20 /* rebuild sched domains if necessary */ - if (force_sd_rebuild) + if (sd_rebuild =3D=3D SD_REBUILD) rebuild_sched_domains_cpuslocked(); =20 free_tmpmasks(ptmp); --=20 2.52.0