From nobody Tue Dec 2 02:18:55 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D6849335BBB for ; Wed, 19 Nov 2025 09:56:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763546167; cv=none; b=cnAHQSUCsGSAMkbtJWzT3lxoieB8pRCNdHJfW8/BarXXqnZ1dD4jhc10t8sIrL7JXATKme0UoGsY+TUS4VrPhzqLZiC9CZwuKb2R1eC4hJf/uUtmrcE7vZ1ezD23UWl/kT5RoxJytZqc98fS3tMl2QqIGg/MG0oydhjJVSI8Yo4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763546167; c=relaxed/simple; bh=UgCUqkymhZ+rmMa555UKNFjPFwx6eolPXr/7CtqA/MM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=WmZoL5LVEZzkGQrMTs5NJ6aG7n2RbhefJlE6Hbn7NEHomoLfbniY5JH3V3s/7stvEk0OJI+Fd3IXc7AMz0Od8wHYAaIY2ueDR2/lD5iH1vS88lgIiqZx/rOnb1551yPfwMPVkXAIT4r0lrkacO+RFj51bBekF4IgRfUgjJy5JP0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=BBKrSXEX; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="BBKrSXEX" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1763546164; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=M9tuRk81dBfhNfeLrLSKOJE/gf6lbseEeNlG9tm14p4=; b=BBKrSXEXLtIRLhNK55cHpqUk4jalfI8dTkPTIepLChQMLoHd6XXgHxpzD1BYyaL4j1vt31 H9+UYVEFPk0oAIMyZgtBFIMCiSuoWX+oa2vFVGd95Shdbz9tlLwVQblztyEapmg9uffIPz t01aiepFtVA1QjgEKuO9ICj06/iSwfk= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-13-vBF4cGNCNTmTXumyMJR6Eg-1; Wed, 19 Nov 2025 04:55:59 -0500 X-MC-Unique: vBF4cGNCNTmTXumyMJR6Eg-1 X-Mimecast-MFC-AGG-ID: vBF4cGNCNTmTXumyMJR6Eg_1763546158 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id ECB7919560A3; Wed, 19 Nov 2025 09:55:56 +0000 (UTC) Received: from fedora.redhat.com (unknown [10.72.112.57]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id CBFC8180049F; Wed, 19 Nov 2025 09:55:47 +0000 (UTC) From: Pingfan Liu To: cgroups@vger.kernel.org Cc: Pingfan Liu , Waiman Long , Chen Ridong , Peter Zijlstra , Juri Lelli , Pierre Gondois , Ingo Molnar , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Tejun Heo , Johannes Weiner , mkoutny@suse.com, linux-kernel@vger.kernel.org Subject: [PATCHv7 1/2] cgroup/cpuset: Introduce cpuset_cpus_allowed_locked() Date: Wed, 19 Nov 2025 17:55:24 +0800 Message-ID: <20251119095525.12019-2-piliu@redhat.com> In-Reply-To: <20251119095525.12019-1-piliu@redhat.com> References: <20251119095525.12019-1-piliu@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 cpuset_cpus_allowed() uses a reader lock that is sleepable under RT, which means it cannot be called inside raw_spin_lock_t context. Introduce a new cpuset_cpus_allowed_locked() helper that performs the same function as cpuset_cpus_allowed() except that the caller must have acquired the cpuset_mutex so that no further locking will be needed. Suggested-by: Waiman Long Signed-off-by: Pingfan Liu Cc: Waiman Long Cc: Tejun Heo Cc: Johannes Weiner Cc: "Michal Koutn=C3=BD" Cc: linux-kernel@vger.kernel.org To: cgroups@vger.kernel.org Reviewed-by: Chen Ridong Reviewed-by: Waiman Long --- include/linux/cpuset.h | 9 +++++++- kernel/cgroup/cpuset.c | 51 +++++++++++++++++++++++++++++------------- 2 files changed, 44 insertions(+), 16 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 2ddb256187b51..a98d3330385c2 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -74,6 +74,7 @@ extern void inc_dl_tasks_cs(struct task_struct *task); extern void dec_dl_tasks_cs(struct task_struct *task); extern void cpuset_lock(void); extern void cpuset_unlock(void); +extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpuma= sk *mask); extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mas= k); extern bool cpuset_cpus_allowed_fallback(struct task_struct *p); extern bool cpuset_cpu_is_isolated(int cpu); @@ -195,10 +196,16 @@ static inline void dec_dl_tasks_cs(struct task_struct= *task) { } static inline void cpuset_lock(void) { } static inline void cpuset_unlock(void) { } =20 +static inline void cpuset_cpus_allowed_locked(struct task_struct *p, + struct cpumask *mask) +{ + cpumask_copy(mask, task_cpu_possible_mask(p)); +} + static inline void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask) { - cpumask_copy(mask, task_cpu_possible_mask(p)); + cpuset_cpus_allowed_locked(p, mask); } =20 static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 52468d2c178a3..7a179a1a2e30a 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4116,24 +4116,13 @@ void __init cpuset_init_smp(void) BUG_ON(!cpuset_migrate_mm_wq); } =20 -/** - * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset. - * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed. - * @pmask: pointer to struct cpumask variable to receive cpus_allowed set. - * - * Description: Returns the cpumask_var_t cpus_allowed of the cpuset - * attached to the specified @tsk. Guaranteed to return some non-empty - * subset of cpu_active_mask, even if this means going outside the - * tasks cpuset, except when the task is in the top cpuset. - **/ - -void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) +/* + * Return cpus_allowed mask from a task's cpuset. + */ +static void __cpuset_cpus_allowed_locked(struct task_struct *tsk, struct c= pumask *pmask) { - unsigned long flags; struct cpuset *cs; =20 - spin_lock_irqsave(&callback_lock, flags); - cs =3D task_cs(tsk); if (cs !=3D &top_cpuset) guarantee_active_cpus(tsk, pmask); @@ -4153,7 +4142,39 @@ void cpuset_cpus_allowed(struct task_struct *tsk, st= ruct cpumask *pmask) if (!cpumask_intersects(pmask, cpu_active_mask)) cpumask_copy(pmask, possible_mask); } +} =20 +/** + * cpuset_cpus_allowed_locked - return cpus_allowed mask from a task's cpu= set. + * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed. + * @pmask: pointer to struct cpumask variable to receive cpus_allowed set. + * + * Similir to cpuset_cpus_allowed() except that the caller must have acqui= red + * cpuset_mutex. + */ +void cpuset_cpus_allowed_locked(struct task_struct *tsk, struct cpumask *p= mask) +{ + lockdep_assert_held(&cpuset_mutex); + __cpuset_cpus_allowed_locked(tsk, pmask); +} + +/** + * cpuset_cpus_allowed - return cpus_allowed mask from a task's cpuset. + * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed. + * @pmask: pointer to struct cpumask variable to receive cpus_allowed set. + * + * Description: Returns the cpumask_var_t cpus_allowed of the cpuset + * attached to the specified @tsk. Guaranteed to return some non-empty + * subset of cpu_active_mask, even if this means going outside the + * tasks cpuset, except when the task is in the top cpuset. + **/ + +void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) +{ + unsigned long flags; + + spin_lock_irqsave(&callback_lock, flags); + __cpuset_cpus_allowed_locked(tsk, pmask); spin_unlock_irqrestore(&callback_lock, flags); } =20 --=20 2.49.0 From nobody Tue Dec 2 02:18:55 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 74A1034575A for ; Wed, 19 Nov 2025 09:56:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763546178; cv=none; b=ELLpbs1bJgm54Jsq8U+FQMaUaNxoCKsIwYR215u3N5KxKGJffSKxyV1hpoBT/M2sZLNtUjmmc9wSAN0UugPg0JgFWkKUOm9+5vpZqjegM99sqi9zBHPWNyeei43yDpWN+u/S+VKNLlyd01deajmDN0AjD5nd3ONl6xiX2RDvedc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763546178; c=relaxed/simple; bh=3Cx9A5Lqaz7WMQUmErzsXx881CMEh/rjrUn6EUgVTKk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=KHTZpMSHDuJdobpYm+iBtNTZar658hvDKLSqU5oGDEt/ZxG5ezCPQ1VTvF44XR3AmNQPCiqfPbCaEEG0BFEGHoxgLZWjkdljlXae3idx6dCvHbaXEwGcgnf2obHD66ECpRz0Gols7n694VsQwbueKqaCQ7G9oFiUYi53w5ScIvU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ZJlhX2kP; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZJlhX2kP" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1763546175; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6Ej4ofB7euYq06zHr5+MO0e+3TVegk07xXw/9Xr6vlA=; b=ZJlhX2kPmZ4M6TDIF7KRl84ddXVa59cFM+/APUXKoHlUl4VkoBsBc35DIJAhTD9c4hTtwF FbNGF4M1i1GB67zFKGo1LqVCjb3y5Mupv7259iFwh+phR7eAdPLRFuYg+j6V697uduaejd C1FGhfp2WvTOA3pw0kI6eTg3bHPywRg= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-266-jW6MDMXtOtm8R0DHuUe-hw-1; Wed, 19 Nov 2025 04:56:12 -0500 X-MC-Unique: jW6MDMXtOtm8R0DHuUe-hw-1 X-Mimecast-MFC-AGG-ID: jW6MDMXtOtm8R0DHuUe-hw_1763546170 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 39BAC18AB406; Wed, 19 Nov 2025 09:56:09 +0000 (UTC) Received: from fedora.redhat.com (unknown [10.72.112.57]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id C574F18002B6; Wed, 19 Nov 2025 09:55:57 +0000 (UTC) From: Pingfan Liu To: linux-kernel@vger.kernel.org Cc: Pingfan Liu , Waiman Long , Chen Ridong , Peter Zijlstra , Juri Lelli , Pierre Gondois , Ingo Molnar , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Tejun Heo , Johannes Weiner , mkoutny@suse.com Subject: [PATCHv7 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug Date: Wed, 19 Nov 2025 17:55:25 +0800 Message-ID: <20251119095525.12019-3-piliu@redhat.com> In-Reply-To: <20251119095525.12019-1-piliu@redhat.com> References: <20251119095525.12019-1-piliu@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 *** Bug description *** When testing kexec-reboot on a 144 cpus machine with isolcpus=3Dmanaged_irq,domain,1-71,73-143 in kernel command line, I encounter the following bug: [ 97.114759] psci: CPU142 killed (polled 0 ms) [ 97.333236] Failed to offline CPU143 - error=3D-16 [ 97.333246] ------------[ cut here ]------------ [ 97.342682] kernel BUG at kernel/cpu.c:1569! [ 97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [...] In essence, the issue originates from the CPU hot-removal process, not limited to kexec. It can be reproduced by writing a SCHED_DEADLINE program that waits indefinitely on a semaphore, spawning multiple instances to ensure some run on CPU 72, and then offlining CPUs 1=E2=80=931= 43 one by one. When attempting this, CPU 143 failed to go offline. bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/= system/cpu/cpu$i/online 2>/dev/null; done' Tracking down this issue, I found that dl_bw_deactivate() returned -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU. But that is not the fact, and contributed by the following factors: When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an blocked-state deadline task (in this case, "cppc_fie"), it was not migrated to CPU0, and its task_rq() information is stale. So its rq->rd points to def_root_domain instead of the one shared with CPU0. As a result, its bandwidth is wrongly accounted into a wrong root domain during domain rebuild. *** Issue *** The key point is that root_domain is only tracked through active rq->rd. To avoid using a global data structure to track all root_domains in the system, there should be a method to locate an active CPU within the corresponding root_domain. *** Solution *** To locate the active cpu, the following rules for deadline sub-system is useful -1.any cpu belongs to a unique root domain at a given time -2.DL bandwidth checker ensures that the root domain has active cpus. Now, let's examine the blocked-state task P. If P is attached to a cpuset that is a partition root, it is straightforward to find an active CPU. If P is attached to a cpuset that has changed from 'root' to 'member', the active CPUs are grouped into the parent root domain. Naturally, the CPUs' capacity and reserved DL bandwidth are taken into account in the ancestor root domain. (In practice, it may be unsafe to attach P to an arbitrary root domain, since that domain may lack sufficient DL bandwidth for P.) Again, it is straightforward to find an active CPU in the ancestor root domain. This patch groups CPUs into isolated and housekeeping sets. For the housekeeping group, it walks up the cpuset hierarchy to find active CPUs in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd. Signed-off-by: Pingfan Liu Cc: Waiman Long Cc: Chen Ridong Cc: Peter Zijlstra Cc: Juri Lelli Cc: Pierre Gondois Cc: Ingo Molnar Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Valentin Schneider To: linux-kernel@vger.kernel.org --- kernel/sched/deadline.c | 54 ++++++++++++++++++++++++++++++++++++----- 1 file changed, 48 insertions(+), 6 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 7b7671060bf9e..194a341e85864 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2465,6 +2465,7 @@ static struct task_struct *pick_earliest_pushable_dl_= task(struct rq *rq, int cpu return NULL; } =20 +/* Access rule: must be called on local CPU with preemption disabled */ static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl); =20 static int find_later_rq(struct task_struct *task) @@ -2907,11 +2908,43 @@ void __init init_sched_dl_class(void) GFP_KERNEL, cpu_to_node(i)); } =20 +/* + * This function always returns a non-empty bitmap in @cpus. This is becau= se + * if a root domain has reserved bandwidth for DL tasks, the DL bandwidth + * check will prevent CPU hotplug from deactivating all CPUs in that domai= n. + */ +static void dl_get_task_effective_cpus(struct task_struct *p, struct cpuma= sk *cpus) +{ + const struct cpumask *hk_msk; + + hk_msk =3D housekeeping_cpumask(HK_TYPE_DOMAIN); + if (housekeeping_enabled(HK_TYPE_DOMAIN)) { + if (!cpumask_intersects(p->cpus_ptr, hk_msk)) { + /* + * CPUs isolated by isolcpu=3D"domain" always belong to + * def_root_domain. + */ + cpumask_andnot(cpus, cpu_active_mask, hk_msk); + return; + } + } + + /* + * If a root domain holds a DL task, it must have active CPUs. So + * active CPUs can always be found by walking up the task's cpuset + * hierarchy up to the partition root. + */ + cpuset_cpus_allowed_locked(p, cpus); +} + +/* The caller should hold cpuset_mutex */ void dl_add_task_root_domain(struct task_struct *p) { struct rq_flags rf; struct rq *rq; struct dl_bw *dl_b; + unsigned int cpu; + struct cpumask *msk =3D this_cpu_cpumask_var_ptr(local_cpu_mask_dl); =20 raw_spin_lock_irqsave(&p->pi_lock, rf.flags); if (!dl_task(p) || dl_entity_is_special(&p->dl)) { @@ -2919,16 +2952,25 @@ void dl_add_task_root_domain(struct task_struct *p) return; } =20 - rq =3D __task_rq_lock(p, &rf); - + /* + * Get an active rq, whose rq->rd traces the correct root + * domain. + * Ideally this would be under cpuset reader lock until rq->rd is + * fetched. However, sleepable locks cannot nest inside pi_lock, so we + * rely on the caller of dl_add_task_root_domain() holds 'cpuset_mutex' + * to guarantee the CPU stays in the cpuset. + */ + dl_get_task_effective_cpus(p, msk); + cpu =3D cpumask_first_and(cpu_active_mask, msk); + BUG_ON(cpu >=3D nr_cpu_ids); + rq =3D cpu_rq(cpu); dl_b =3D &rq->rd->dl_bw; - raw_spin_lock(&dl_b->lock); + /* End of fetching rd */ =20 + raw_spin_lock(&dl_b->lock); __dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span)); - raw_spin_unlock(&dl_b->lock); - - task_rq_unlock(rq, p, &rf); + raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags); } =20 void dl_clear_root_domain(struct root_domain *rd) --=20 2.49.0