From nobody Sat Feb 7 10:16:38 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3363E30EF82 for ; Fri, 6 Feb 2026 20:37:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770410269; cv=none; b=dua2DAS1XtCg/qE61AYeFumeu1fedt6+9D6gI7I6r7lK6BMFN7S21W6L1IWJ3CSTFlI1hGR92hHmefNwE+G9dVqKtns0rgZdQdl4NVcSVXvNjN99HEqJE8fuVvdwOnxVQtmeRhsZjOqo8sJMpsl1QyRCSieH4d/OFRjxOP8G5Mw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770410269; c=relaxed/simple; bh=7F8vJ7UQECcOhQTNQc04wbw027Sui8vgBZ9mQIjCYSY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=uoCHqgAb5PR8yo9rxy6/9nr//tQSpBfPf/6ub9hlQFC2ctaNJwn+oKI2CMllgHkRMlxn0VzbsHfAHHXw8kK04fE4lENF/qBTD8IQfBmBaJ9qI7SzyXs9RRR3yW5kS5i9YXxYeAgY+ERfFUzhPU1xC7K+mOar5nPIC1HuaoZvcyc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=imu5JrH1; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="imu5JrH1" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770410268; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2LN2dql+xSHKeHXbNFxK9yCCHIHx5M1ctcSetTZNmss=; b=imu5JrH1wX5Avq/WndqvO28NVSUmOoI/4qemJS20Udd/PHbiJ1pLRHeQsHbunR9BkS5Rz8 sp64EUapXUWf5R2VE410nK6X/q7Ywf5ihUmCdUCTRLCCKGYOYzjnY1AUj5r0+6G/1iCQaf 45kA5nmZaFhLP2KlWuqsSmobH9jNmsw= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-647-kLLyt2cEOhaJwFcQ2DSvCg-1; Fri, 06 Feb 2026 15:37:45 -0500 X-MC-Unique: kLLyt2cEOhaJwFcQ2DSvCg-1 X-Mimecast-MFC-AGG-ID: kLLyt2cEOhaJwFcQ2DSvCg_1770410263 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id BA42A1955DCD; Fri, 6 Feb 2026 20:37:40 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.90.86]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 843B71800465; Fri, 6 Feb 2026 20:37:36 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Anna-Maria Behnsen , Frederic Weisbecker , Thomas Gleixner , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Waiman Long Subject: [PATCH/for-next v4 1/4] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Date: Fri, 6 Feb 2026 15:37:09 -0500 Message-ID: <20260206203712.1989610-2-longman@redhat.com> In-Reply-To: <20260206203712.1989610-1-longman@redhat.com> References: <20260206203712.1989610-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Clarify the locking rules associated with file level internal variables inside the cpuset code. There is no functional change. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 105 ++++++++++++++++++++++++----------------- 1 file changed, 61 insertions(+), 44 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index c43efef7df71..a4c6386a594d 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -61,6 +61,58 @@ static const char * const perr_strings[] =3D { [PERR_REMOTE] =3D "Have remote partition underneath", }; =20 +/* + * CPUSET Locking Convention + * ------------------------- + * + * Below are the three global locks guarding cpuset structures in lock + * acquisition order: + * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock) + * - cpuset_mutex + * - callback_lock (raw spinlock) + * + * A task must hold all the three locks to modify externally visible or + * used fields of cpusets, though some of the internally used cpuset fields + * and internal variables can be modified without holding callback_lock. I= f only + * reliable read access of the externally used fields are needed, a task c= an + * hold either cpuset_mutex or callback_lock which are exposed to other + * external subsystems. + * + * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others, + * ensuring that it is the only task able to also acquire callback_lock and + * be able to modify cpusets. It can perform various checks on the cpuset + * structure first, knowing nothing will change. It can also allocate memo= ry + * without holding callback_lock. While it is performing these checks, var= ious + * callback routines can briefly acquire callback_lock to query cpusets. = Once + * it is ready to make the changes, it takes callback_lock, blocking every= one + * else. + * + * Calls to the kernel memory allocator cannot be made while holding + * callback_lock which is a spinlock, as the memory allocator may sleep or + * call back into cpuset code and acquire callback_lock. + * + * Now, the task_struct fields mems_allowed and mempolicy may be changed + * by other task, we use alloc_lock in the task_struct fields to protect + * them. + * + * The cpuset_common_seq_show() handlers only hold callback_lock across + * small pieces of code, such as when reading out possibly multi-word + * cpumasks and nodemasks. + */ + +static DEFINE_MUTEX(cpuset_mutex); + +/* + * File level internal variables below follow one of the following exclusi= on + * rules. + * + * RWCS: Read/write-able by holding either cpus_write_lock or both + * cpus_read_lock and cpuset_mutex. + * + * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable + * by holding both cpuset_mutex and callback_lock. + */ + /* * For local partitions, update to subpartitions_cpus & isolated_cpus is d= one * in update_parent_effective_cpumask(). For remote partitions, it is done= in @@ -70,19 +122,18 @@ static const char * const perr_strings[] =3D { * Exclusive CPUs distributed out to local or remote sub-partitions of * top_cpuset */ -static cpumask_var_t subpartitions_cpus; +static cpumask_var_t subpartitions_cpus; /* RWCS */ =20 /* - * Exclusive CPUs in isolated partitions + * Exclusive CPUs in isolated partitions (shown in cpuset.cpus.isolated) */ -static cpumask_var_t isolated_cpus; +static cpumask_var_t isolated_cpus; /* CSCB */ =20 /* - * isolated_cpus updating flag (protected by cpuset_mutex) - * Set if isolated_cpus is going to be updated in the current - * cpuset_mutex crtical section. + * Set if isolated_cpus is being updated in the current cpuset_mutex + * critical section. */ -static bool isolated_cpus_updating; +static bool isolated_cpus_updating; /* RWCS */ =20 /* * A flag to force sched domain rebuild at the end of an operation. @@ -98,7 +149,7 @@ static bool isolated_cpus_updating; * Note that update_relax_domain_level() in cpuset-v1.c can still call * rebuild_sched_domains_locked() directly without using this flag. */ -static bool force_sd_rebuild; +static bool force_sd_rebuild; /* RWCS */ =20 /* * Partition root states: @@ -218,42 +269,6 @@ struct cpuset top_cpuset =3D { .partition_root_state =3D PRS_ROOT, }; =20 -/* - * There are two global locks guarding cpuset structures - cpuset_mutex and - * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel - * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to c= puset - * structures. Note that cpuset_mutex needs to be a mutex as it is used in - * paths that rely on priority inheritance (e.g. scheduler - on RT) for - * correctness. - * - * A task must hold both locks to modify cpusets. If a task holds - * cpuset_mutex, it blocks others, ensuring that it is the only task able = to - * also acquire callback_lock and be able to modify cpusets. It can perfo= rm - * various checks on the cpuset structure first, knowing nothing will chan= ge. - * It can also allocate memory while just holding cpuset_mutex. While it = is - * performing these checks, various callback routines can briefly acquire - * callback_lock to query cpusets. Once it is ready to make the changes, = it - * takes callback_lock, blocking everyone else. - * - * Calls to the kernel memory allocator can not be made while holding - * callback_lock, as that would risk double tripping on callback_lock - * from one of the callbacks into the cpuset code from within - * __alloc_pages(). - * - * If a task is only holding callback_lock, then it has read-only - * access to cpusets. - * - * Now, the task_struct fields mems_allowed and mempolicy may be changed - * by other task, we use alloc_lock in the task_struct fields to protect - * them. - * - * The cpuset_common_seq_show() handlers only hold callback_lock across - * small pieces of code, such as when reading out possibly multi-word - * cpumasks and nodemasks. - */ - -static DEFINE_MUTEX(cpuset_mutex); - /** * cpuset_lock - Acquire the global cpuset mutex * @@ -1163,6 +1178,8 @@ static void reset_partition_data(struct cpuset *cs) static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask = *xcpus) { WARN_ON_ONCE(old_prs =3D=3D new_prs); + lockdep_assert_held(&callback_lock); + lockdep_assert_held(&cpuset_mutex); if (new_prs =3D=3D PRS_ISOLATED) cpumask_or(isolated_cpus, isolated_cpus, xcpus); else --=20 2.52.0 From nobody Sat Feb 7 10:16:38 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E599B30F806 for ; Fri, 6 Feb 2026 20:37:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770410271; cv=none; b=cnnqGwEQWwqN9JikVCMi3Vz12vrvM9Fix61DeyYKfqRmYjKFv4eZq6+dFjkJzn1Ic0IMJuz9A/q1lJEQwWoZJwXBFY2zvd3f7CJ9ylr4huJlocBlX0sedOxsvjIuVYUo+coWlisHaiOaVo9fJ1wg9FhUHECJA1BKUEd1ZW87Pvg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770410271; c=relaxed/simple; bh=KX/ab+6SYxRzT/EfHTIy/rPfjDx43H1S74zLjfamKhg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=deVcOA8Np1WYu0yN6dK2/+Yx2shfjvD9+ZT7CIYcR2MipXgyq+Oay0vJTOeAi7xDR99aD07+slAZyXTL2oxCiLVtYlcK2xW4lyBTBTAChlOVjh/i7vTypaoXxRiPYgqQEPgma9BUmh8txzoMwD0SvnLX8OMPdeTjCtXhrSPD+Ns= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=FSthzDtE; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="FSthzDtE" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770410270; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Shy/SZAlRS3Kdw6OfyxjeoKQKbjIxIHLx2o3iV5Q6pY=; b=FSthzDtE13uTVE66+yjdPB63JdS4QCklLSuqb2TVRx2htZFd8LreihD1T3pvi2eRC4cmcz omrtlXu1ffSQnSNsxrqPJuhuiiZK/8IwNeu28JayNAvqVVz3C7S8IV2ZV5aX7eOD4Gi97o wRP2/lzZZlGIt3npianWCyh0O9kNNp8= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-672-jbYnmNdPNFmdEx2GjBku8w-1; Fri, 06 Feb 2026 15:37:46 -0500 X-MC-Unique: jbYnmNdPNFmdEx2GjBku8w-1 X-Mimecast-MFC-AGG-ID: jbYnmNdPNFmdEx2GjBku8w_1770410264 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 259611956096; Fri, 6 Feb 2026 20:37:44 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.90.86]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id F2DCF180029A; Fri, 6 Feb 2026 20:37:40 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Anna-Maria Behnsen , Frederic Weisbecker , Thomas Gleixner , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Waiman Long Subject: [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Date: Fri, 6 Feb 2026 15:37:10 -0500 Message-ID: <20260206203712.1989610-3-longman@redhat.com> In-Reply-To: <20260206203712.1989610-1-longman@redhat.com> References: <20260206203712.1989610-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" The update_isolation_cpumasks() function can be called either directly from regular cpuset control file write with cpuset_full_lock() called or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held. As we are going to enable dynamic update to the nozh_full housekeeping cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug, allowing the CPU hotplug path to call into housekeeping_update() directly from update_isolation_cpumasks() will likely cause deadlock. So we have to defer any call to housekeeping_update() after the CPU hotplug operation has finished. This is now done via the workqueue where the actual housekeeping_update() call, if needed, will happen after cpus_write_lock is released. We can't use the synchronous task_work API as call from CPU hotplug path happen in the per-cpu kthread of the CPU that is being shut down or brought up. Because of the asynchronous nature of workqueue, the HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the "cpuset.cpus.isolated" control file in this case. Also add a check in test_cpuset_prs.sh and modify some existing test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN housekeeping cpumask will both be updated. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 41 +++++++++++++++++-- .../selftests/cgroup/test_cpuset_prs.sh | 13 ++++-- 2 files changed, 48 insertions(+), 6 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index a4c6386a594d..eb0eabd85e8c 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1302,6 +1302,17 @@ static bool prstate_housekeeping_conflict(int prstat= e, struct cpumask *new_cpus) return false; } =20 +static void isolcpus_workfn(struct work_struct *work) +{ + cpuset_full_lock(); + if (isolated_cpus_updating) { + isolated_cpus_updating =3D false; + WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); + rebuild_sched_domains_locked(); + } + cpuset_full_unlock(); +} + /* * update_isolation_cpumasks - Update external isolation related CPU masks * @@ -1310,14 +1321,38 @@ static bool prstate_housekeeping_conflict(int prsta= te, struct cpumask *new_cpus) */ static void update_isolation_cpumasks(void) { - int ret; + static DECLARE_WORK(isolcpus_work, isolcpus_workfn); =20 + lockdep_assert_cpuset_lock_held(); if (!isolated_cpus_updating) return; =20 - ret =3D housekeeping_update(isolated_cpus); - WARN_ON_ONCE(ret < 0); + /* + * This function can be reached either directly from regular cpuset + * control file write or via CPU hotplug. In the latter case, it is + * the per-cpu kthread that calls cpuset_handle_hotplug() on behalf + * of the task that initiates CPU shutdown or bringup. + * + * To have better flexibility and prevent the possibility of deadlock + * when calling from CPU hotplug, we defer the housekeeping_update() + * call to after the current cpuset critical section has finished. + * This is done via workqueue. + */ + if (current->flags & PF_KTHREAD) { + /* + * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work + * item that is still pending. Before the pending bit is + * cleared, the work data is copied out and work item dequeued. + * So it is possible to queue the work again before the + * isolcpus_workfn() is invoked to process the previously + * queued work. Since isolcpus_workfn() doesn't use the work + * item at all, this is not a problem. + */ + queue_work(system_unbound_wq, &isolcpus_work); + return; + } =20 + WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); isolated_cpus_updating =3D false; } =20 diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/test= ing/selftests/cgroup/test_cpuset_prs.sh index 5dff3ad53867..0502b156582b 100755 --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh @@ -245,8 +245,9 @@ TEST_MATRIX=3D( "C2-3:P1:S+ C3:P2 . . O2=3D0 O2=3D1 . . 0 A1:2|A= 2:3 A1:P1|A2:P2" "C2-3:P1:S+ C3:P1 . . O2=3D0 . . . 0 A1:|A2:3= A1:P1|A2:P1" "C2-3:P1:S+ C3:P1 . . O3=3D0 . . . 0 A1:2|A2:= A1:P1|A2:P1" - "C2-3:P1:S+ C3:P1 . . T:O2=3D0 . . . 0 A1:3|A2:= 3 A1:P1|A2:P-1" - "C2-3:P1:S+ C3:P1 . . . T:O3=3D0 . . 0 A1:2|A2:= 2 A1:P1|A2:P-1" + "C2-3:P1:S+ C3:P2 . . T:O2=3D0 . . . 0 A1:3|A2:= 3 A1:P1|A2:P-2" + "C1-3:P1:S+ C3:P2 . . . T:O3=3D0 . . 0 A1:1-2|A= 2:1-2 A1:P1|A2:P-2 3|" + "C1-3:P1:S+ C3:P2 . . . T:O3=3D0 O3=3D1 . 0 A1:1-2= |A2:3 A1:P1|A2:P2 3" "$SETUP_A123_PARTITIONS . O1=3D0 . . . 0 A1:|A2:2= |A3:3 A1:P1|A2:P1|A3:P1" "$SETUP_A123_PARTITIONS . O2=3D0 . . . 0 A1:1|A2:= |A3:3 A1:P1|A2:P1|A3:P1" "$SETUP_A123_PARTITIONS . O3=3D0 . . . 0 A1:1|A2:= 2|A3: A1:P1|A2:P1|A3:P1" @@ -764,7 +765,7 @@ check_cgroup_states() # only CPUs in isolated partitions as well as those that are isolated at # boot time. # -# $1 - expected isolated cpu list(s) {,} +# $1 - expected isolated cpu list(s) {|} # - expected sched/domains value # - cpuset.cpus.isolated value =3D if not defined # @@ -773,6 +774,7 @@ check_isolcpus() EXPECTED_ISOLCPUS=3D$1 ISCPUS=3D${CGROUP2}/cpuset.cpus.isolated ISOLCPUS=3D$(cat $ISCPUS) + HKICPUS=3D$(cat /sys/devices/system/cpu/isolated) LASTISOLCPU=3D SCHED_DOMAINS=3D/sys/kernel/debug/sched/domains if [[ $EXPECTED_ISOLCPUS =3D . ]] @@ -810,6 +812,11 @@ check_isolcpus() ISOLCPUS=3D EXPECTED_ISOLCPUS=3D$EXPECTED_SDOMAIN =20 + # + # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS + # + [[ "$ISOLCPUS" !=3D "$HKICPUS" ]] && return 1 + # # Use the sched domain in debugfs to check isolated CPUs, if available # --=20 2.52.0 From nobody Sat Feb 7 10:16:38 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F0F5E30E858 for ; Fri, 6 Feb 2026 20:37:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770410277; cv=none; b=XKFuY1hPQVmWUN9pDK8ARBw0lcGObamD9Sh1W0M5BBDH/ujXo1wr6YxNN2zz3WpZ083cByPN0tVQuxiU7BkEik45lXHZEc21R+BuLIHaUpDWhAn/Lj+KiW2v84u4CrPmixBrn8yMzvwh7Fotlatm/FVtV4YobyhfVac5Md28KFo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770410277; c=relaxed/simple; bh=vzRj17QV6p6V9uo2cbczXZN5tFAZBOhrL72rPoNdfhg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=X6JnUZHlTA2pHk/Sr1fTrC8RrFARL3akzoGzm6QcVVx298k8lZLyNbf1MwzL/VUW/48tQzTBcEJDnrEyaZlOJgCMakeztzBM3KvqDzorEDolT8pGCY4tsvL3IhAFQC2zs/l2ipDifDpPisV4m1DmQqQPohrvH6o1S+EHAkYBTKc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=frgbE0Ou; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="frgbE0Ou" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770410274; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cLgz4SaPqZSYWY0fPcdYG3Vjy90xruymj8VHO6NfIiE=; b=frgbE0Ou55Gaiw62Zz/zgmQU4XeKUJyFI0ehbeEGS5+rjX6m4jSLQrK+hnPsAydWtXIbTm 4NFLaW2dBOS4xXGYjtHsSkGxipwa8py4cYRKooO/5LWxuhtDvLGcc9R4DkJobHaUSM63RW KmILoSpRhgpTbcJI8SvuVnI2bXxGBKo= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-655--xev2Z4nPoGWQzg6xX7slw-1; Fri, 06 Feb 2026 15:37:51 -0500 X-MC-Unique: -xev2Z4nPoGWQzg6xX7slw-1 X-Mimecast-MFC-AGG-ID: -xev2Z4nPoGWQzg6xX7slw_1770410269 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B46691800349; Fri, 6 Feb 2026 20:37:48 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.90.86]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 9427C18003F6; Fri, 6 Feb 2026 20:37:44 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Anna-Maria Behnsen , Frederic Weisbecker , Thomas Gleixner , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Waiman Long Subject: [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Date: Fri, 6 Feb 2026 15:37:11 -0500 Message-ID: <20260206203712.1989610-4-longman@redhat.com> In-Reply-To: <20260206203712.1989610-1-longman@redhat.com> References: <20260206203712.1989610-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" The current cpuset partition code is able to dynamically update the sched domains of a running system and the corresponding HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the "isolcpus=3Ddomain,..." boot command line feature at run time. The housekeeping cpumask update requires flushing a number of different workqueues which may not be safe with cpus_read_lock() held as the workqueue flushing code may acquire cpus_read_lock() or acquiring locks which have locking dependency with cpus_read_lock() down the chain. Below is an example of such circular locking problem. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D WARNING: possible circular locking dependency detected 6.18.0-test+ #2 Tainted: G S ------------------------------------------------------ test_cpuset_prs/10971 is trying to acquire lock: ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockd= ep_map+0x7a/0x180 but task is already holding lock: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0= x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (cpuset_mutex){+.+.}-{4:4}: -> #3 (cpu_hotplug_lock){++++}-{0:0}: -> #2 (rtnl_mutex){+.+.}-{4:4}: -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: Chain exists of: (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex 5 locks held by test_cpuset_prs/10971: #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1= d0 #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_it= er+0x260/0x5f0 #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_= iter+0x2b6/0x5f0 #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partitio= n_write+0x77/0x130 #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_wr= ite+0x85/0x130 Call Trace: : touch_wq_lockdep_map+0x93/0x180 __flush_workqueue+0x111/0x10b0 housekeeping_update+0x12d/0x2d0 update_parent_effective_cpumask+0x595/0x2440 update_prstate+0x89d/0xce0 cpuset_partition_write+0xc5/0x130 cgroup_file_write+0x1a5/0x680 kernfs_fop_write_iter+0x3df/0x5f0 vfs_write+0x525/0xfd0 ksys_write+0xf9/0x1d0 do_syscall_64+0x95/0x520 entry_SYSCALL_64_after_hwframe+0x76/0x7e To avoid such a circular locking dependency problem, we have to call housekeeping_update() without holding the cpus_read_lock() and cpuset_mutex. The current set of wq's flushed by housekeeping_update() may not have work functions that call cpus_read_lock() directly, but we are likely to extend the list of wq's that are flushed in the future. Moreover, the current set of work functions may hold locks that may have cpu_hotplug_lock down the dependency chain. One way to do that is to defer the housekeeping_update() call after the current cpuset critical section has finished without holding cpus_read_lock. For cpuset control file write, this can be done by deferring it using task_work right before returning to userspace. To enable mutual exclusion between the housekeeping_update() call and other cpuset control file write actions, a new top level cpuset_top_mutex is introduced. This new mutex will be acquired first to allow sharing variables used by both code paths. However, cpuset update from CPU hotplug can still happen in parallel with the housekeeping_update() call, though that should be rare in production environment. As cpus_read_lock() is now no longer held when tmigr_isolated_exclude_cpumask() is called, it needs to acquire it directly. The lockdep_is_cpuset_held() is also updated to return true if either cpuset_top_mutex or cpuset_mutex is held. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 107 +++++++++++++++++++++++++++------- kernel/sched/isolation.c | 4 +- kernel/time/timer_migration.c | 4 +- 3 files changed, 89 insertions(+), 26 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index eb0eabd85e8c..d26c77a726b2 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -65,14 +65,28 @@ static const char * const perr_strings[] =3D { * CPUSET Locking Convention * ------------------------- * - * Below are the three global locks guarding cpuset structures in lock + * Below are the four global/local locks guarding cpuset structures in lock * acquisition order: + * - cpuset_top_mutex * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock) * - cpuset_mutex * - callback_lock (raw spinlock) * - * A task must hold all the three locks to modify externally visible or - * used fields of cpusets, though some of the internally used cpuset fields + * As cpuset will now indirectly flush a number of different workqueues in + * housekeeping_update() to update housekeeping cpumasks when the set of + * isolated CPUs is going to be changed, it may be vulnerable to deadlock + * if we hold cpus_read_lock while calling into housekeeping_update(). + * + * The first cpuset_top_mutex will be held except when calling into + * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock + * and cpuset_mutex will be held instead. The main purpose of this mutex + * is to prevent regular cpuset control file write actions from interfering + * with the call to housekeeping_update(), though CPU hotplug operation can + * still happen in parallel. This mutex also provides protection for some + * internal variables. + * + * A task must hold all the remaining three locks to modify externally vis= ible + * or used fields of cpusets, though some of the internally used cpuset fi= elds * and internal variables can be modified without holding callback_lock. I= f only * reliable read access of the externally used fields are needed, a task c= an * hold either cpuset_mutex or callback_lock which are exposed to other @@ -100,6 +114,7 @@ static const char * const perr_strings[] =3D { * cpumasks and nodemasks. */ =20 +static DEFINE_MUTEX(cpuset_top_mutex); static DEFINE_MUTEX(cpuset_mutex); =20 /* @@ -111,6 +126,8 @@ static DEFINE_MUTEX(cpuset_mutex); * * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable * by holding both cpuset_mutex and callback_lock. + * + * T: Read/write-able by holding the cpuset_top_mutex. */ =20 /* @@ -135,6 +152,13 @@ static cpumask_var_t isolated_cpus; /* CSCB */ */ static bool isolated_cpus_updating; /* RWCS */ =20 +/* + * Copy of isolated_cpus to be processed by housekeeping_update() + */ +static cpumask_var_t isolated_hk_cpus; /* T */ +static bool isolcpus_twork_queued; /* T */ + + /* * A flag to force sched domain rebuild at the end of an operation. * It can be set in @@ -298,6 +322,7 @@ void lockdep_assert_cpuset_lock_held(void) */ void cpuset_full_lock(void) { + mutex_lock(&cpuset_top_mutex); cpus_read_lock(); mutex_lock(&cpuset_mutex); } @@ -306,12 +331,14 @@ void cpuset_full_unlock(void) { mutex_unlock(&cpuset_mutex); cpus_read_unlock(); + mutex_unlock(&cpuset_top_mutex); } =20 #ifdef CONFIG_LOCKDEP bool lockdep_is_cpuset_held(void) { - return lockdep_is_held(&cpuset_mutex); + return lockdep_is_held(&cpuset_mutex) || + lockdep_is_held(&cpuset_top_mutex); } #endif =20 @@ -1302,30 +1329,53 @@ static bool prstate_housekeeping_conflict(int prsta= te, struct cpumask *new_cpus) return false; } =20 -static void isolcpus_workfn(struct work_struct *work) +/* + * housekeeping_update() will only be called if isolated_cpus differs + * from isolated_hk_cpus. To be safe, rebuild_sched_domains() will always + * be called just in case there are still pending sched domains changes. + */ +static void do_housekeeping_update(bool *flag) { - cpuset_full_lock(); - if (isolated_cpus_updating) { - isolated_cpus_updating =3D false; - WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); - rebuild_sched_domains_locked(); + bool update_hk =3D true; + + guard(mutex)(&cpuset_top_mutex); + if (flag) + *flag =3D false; + scoped_guard(spinlock_irq, &callback_lock) { + if (cpumask_equal(isolated_hk_cpus, isolated_cpus)) + update_hk =3D false; + else + cpumask_copy(isolated_hk_cpus, isolated_cpus); } - cpuset_full_unlock(); + if (update_hk) + WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus) < 0); + rebuild_sched_domains(); +} + +static void isolcpus_workfn(struct work_struct *work) +{ + do_housekeeping_update(NULL); +} + +static void isolcpus_tworkfn(struct callback_head *cb) +{ + /* Clear isolcpus_twork_queued */ + do_housekeeping_update(&isolcpus_twork_queued); } =20 /* * update_isolation_cpumasks - Update external isolation related CPU masks - * - * The following external CPU masks will be updated if necessary: - * - workqueue unbound cpumask */ static void update_isolation_cpumasks(void) { static DECLARE_WORK(isolcpus_work, isolcpus_workfn); + static struct callback_head twork_cb; =20 lockdep_assert_cpuset_lock_held(); if (!isolated_cpus_updating) return; + else + isolated_cpus_updating =3D false; =20 /* * This function can be reached either directly from regular cpuset @@ -1333,10 +1383,15 @@ static void update_isolation_cpumasks(void) * the per-cpu kthread that calls cpuset_handle_hotplug() on behalf * of the task that initiates CPU shutdown or bringup. * - * To have better flexibility and prevent the possibility of deadlock - * when calling from CPU hotplug, we defer the housekeeping_update() - * call to after the current cpuset critical section has finished. - * This is done via workqueue. + * To have better flexibility and prevent the possibility of deadlock, + * we defer the housekeeping_update() call to after the current + * cpuset critical section has finished. This is done via task_work + * for cpuset control file write and workqueue for CPU hotplug. + * + * When calling from CPU hotplug, cpuset_top_mutex is not held. So the + * cpuset operation can run asynchronously with do_housekeeping_update(). + * This should not be a problem as another isolcpus_workfn() call will + * be scheduled to make sure that housekeeping cpumasks will be updated. */ if (current->flags & PF_KTHREAD) { /* @@ -1352,8 +1407,19 @@ static void update_isolation_cpumasks(void) return; } =20 - WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); - isolated_cpus_updating =3D false; + /* + * update_isolation_cpumasks() may be called more than once in the + * same cpuset_mutex critical section. + */ + lockdep_assert_held(&cpuset_top_mutex); + if (isolcpus_twork_queued) + return; + + init_task_work(&twork_cb, isolcpus_tworkfn); + if (!task_work_add(current, &twork_cb, TWA_RESUME)) + isolcpus_twork_queued =3D true; + else + WARN_ON_ONCE(1); /* Current task shouldn't be exiting */ } =20 /** @@ -3661,6 +3727,7 @@ int __init cpuset_init(void) BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL)); + BUG_ON(!zalloc_cpumask_var(&isolated_hk_cpus, GFP_KERNEL)); =20 cpumask_setall(top_cpuset.cpus_allowed); nodes_setall(top_cpuset.mems_allowed); diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index 3b725d39c06e..ef152d401fe2 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask) struct cpumask *trial, *old =3D NULL; int err; =20 - lockdep_assert_cpus_held(); - trial =3D kmalloc(cpumask_size(), GFP_KERNEL); if (!trial) return -ENOMEM; @@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask) } =20 if (!housekeeping.flags) - static_branch_enable_cpuslocked(&housekeeping_overridden); + static_branch_enable(&housekeeping_overridden); =20 if (housekeeping.flags & HK_FLAG_DOMAIN) old =3D housekeeping_cpumask_dereference(HK_TYPE_DOMAIN); diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 6da9cd562b20..83428aa03aef 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *ex= clude_cpumask) cpumask_var_t cpumask __free(free_cpumask_var) =3D CPUMASK_VAR_NULL; int cpu; =20 - lockdep_assert_cpus_held(); - if (!works) return -ENOMEM; if (!alloc_cpumask_var(&cpumask, GFP_KERNEL)) @@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *ex= clude_cpumask) * First set previously isolated CPUs as available (unisolate). * This cpumask contains only CPUs that switched to available now. */ + guard(cpus_read_lock)(); cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask); cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask); =20 @@ -1626,7 +1625,6 @@ static int __init tmigr_init_isolation(void) cpumask_andnot(cpumask, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_D= OMAIN)); =20 /* Protect against RCU torture hotplug testing */ - guard(cpus_read_lock)(); return tmigr_isolated_exclude_cpumask(cpumask); } late_initcall(tmigr_init_isolation); --=20 2.52.0 From nobody Sat Feb 7 10:16:38 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2336230EF81 for ; Fri, 6 Feb 2026 20:37:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770410279; cv=none; b=PfY82meNZojLoxetol69y7Pj0rJn5nun70stMBgzBIH6qw2CViUTrumDBKjdDEFHpCIRJoU05oxcJn0P10mOcQ3+Y0AS0B8tgyz+4HSLZcXKklAq5jegXlPmUwb5ccTvFGlx4VcnPQbTl9hJI+mjnCw6HVIxNj0GCUDHoL7L0Mk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770410279; c=relaxed/simple; bh=71S3ObAXFhgNJHPY1eqOrJoW7VXUnIJajfUjoJSe1yk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lDRflaOwXYe+38Wckjq7jr66rFNyQfLPvj43z6DQoA4kIl5ub47TwFgjkpSIHxNtOALDRa3jE6j/42vdgikzOS7AlAxK7/qcPPTk9nqU5wJ4U3+Dp4vcLVsBdyVC9/ODq/QZ7J7EnzS6wxy9xgzHTN2+rVpgBgUNzvMi68Bur5Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=B7KEynFb; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="B7KEynFb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770410278; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MUME4hn48Lh+CEIYg2UpLJDShYrbNFkvPeAL4+NsAw8=; b=B7KEynFb/TwlnU8ICrct12JGGKXwUMNhEjrR9VybAX/lXcdc3j2lC12lgrRFmh4bhFukik ThQ8uFXnZMNalkS3FNgwypRZJVZBptM7qxiiZ3fq4gZzlfd6bRzG329Z2avQ97/ISu47iH 4E62Ew8e2KtNlv0f9a7sReW6ZkPsUgw= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-589-RRQ9N5k2ONeO8GLTUDYibA-1; Fri, 06 Feb 2026 15:37:55 -0500 X-MC-Unique: RRQ9N5k2ONeO8GLTUDYibA-1 X-Mimecast-MFC-AGG-ID: RRQ9N5k2ONeO8GLTUDYibA_1770410273 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 95AE91955F0E; Fri, 6 Feb 2026 20:37:52 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.90.86]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id EED8518003F6; Fri, 6 Feb 2026 20:37:48 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Anna-Maria Behnsen , Frederic Weisbecker , Thomas Gleixner , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Waiman Long Subject: [PATCH/for-next v4 4/4] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls Date: Fri, 6 Feb 2026 15:37:12 -0500 Message-ID: <20260206203712.1989610-5-longman@redhat.com> In-Reply-To: <20260206203712.1989610-1-longman@redhat.com> References: <20260206203712.1989610-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Now that we are going to defer any changes to the HK_TYPE_DOMAIN housekeeping cpumasks to either task_work or workqueue where rebuild_sched_domains() call will be issued. The current rebuild_sched_domains_locked() call near the end of the cpuset critical section can be removed in such cases. Currently, a boolean force_sd_rebuild flag is used to decide if rebuild_sched_domains_locked() call needs to be invoked. To allow deferral that like, we change it to a tri-state sd_rebuild enumaration type. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index d26c77a726b2..e224df321e34 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -173,7 +173,11 @@ static bool isolcpus_twork_queued; /* T */ * Note that update_relax_domain_level() in cpuset-v1.c can still call * rebuild_sched_domains_locked() directly without using this flag. */ -static bool force_sd_rebuild; /* RWCS */ +static enum { + SD_NO_REBUILD =3D 0, + SD_REBUILD, + SD_DEFER_REBUILD, +} sd_rebuild; /* RWCS */ =20 /* * Partition root states: @@ -990,7 +994,7 @@ void rebuild_sched_domains_locked(void) =20 lockdep_assert_cpus_held(); lockdep_assert_cpuset_lock_held(); - force_sd_rebuild =3D false; + sd_rebuild =3D SD_NO_REBUILD; =20 /* Generate domain masks and attrs */ ndoms =3D generate_sched_domains(&doms, &attr); @@ -1377,6 +1381,9 @@ static void update_isolation_cpumasks(void) else isolated_cpus_updating =3D false; =20 + /* Defer rebuild_sched_domains() to task_work or wq */ + sd_rebuild =3D SD_DEFER_REBUILD; + /* * This function can be reached either directly from regular cpuset * control file write or via CPU hotplug. In the latter case, it is @@ -3011,7 +3018,7 @@ static int update_prstate(struct cpuset *cs, int new_= prs) update_partition_sd_lb(cs, old_prs); =20 notify_partition_change(cs, old_prs); - if (force_sd_rebuild) + if (sd_rebuild =3D=3D SD_REBUILD) rebuild_sched_domains_locked(); free_tmpmasks(&tmpmask); return 0; @@ -3288,7 +3295,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file = *of, } =20 free_cpuset(trialcs); - if (force_sd_rebuild) + if (sd_rebuild =3D=3D SD_REBUILD) rebuild_sched_domains_locked(); out_unlock: cpuset_full_unlock(); @@ -3771,7 +3778,8 @@ hotplug_update_tasks(struct cpuset *cs, =20 void cpuset_force_rebuild(void) { - force_sd_rebuild =3D true; + if (!sd_rebuild) + sd_rebuild =3D SD_REBUILD; } =20 /** @@ -3981,7 +3989,7 @@ static void cpuset_handle_hotplug(void) } =20 /* rebuild sched domains if necessary */ - if (force_sd_rebuild) + if (sd_rebuild =3D=3D SD_REBUILD) rebuild_sched_domains_cpuslocked(); =20 free_tmpmasks(ptmp); --=20 2.52.0