From nobody Thu Jun 11 01:40:31 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 77BAF313277 for ; Tue, 9 Jun 2026 21:02:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038973; cv=none; b=r72gAywU0gyGxHSfuFATMXKU2kS32bM6EQLmqgzH5MxVdGlE3h8O/zY2Vwnb5gzkxYiwrLbmLj7p0pgy2xOCrf7refVI2ZeKpC9q8KagfJO0kYOjigHcCfc2OIS5fowJmgIZdvAXArC2grK+LUoRFkiHExkN9Ok3kqNoe7B3AV8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038973; c=relaxed/simple; bh=f/Q5dw5XVz+ZpO56KFHZP9zsoZeVoBUTIdNJsq6xYdk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=odugrYBkMZd1Jw4MfOb7sbvcPAwP1JRZrTq6d8B2eD+pCJ/TBFntVFdorNy07BCPkCQFlLcz9EcdLsuy78sn3GER40giOzcm2pSlIZF6lmL1bpdxpbwkWh2QM1K/Wfk3J659t35BFsaQ7sGvuHskBAGXVpcNSN8I6a9zi2QUw64= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=YEG8veQ8; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="YEG8veQ8" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1781038972; x=1812574972; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=f/Q5dw5XVz+ZpO56KFHZP9zsoZeVoBUTIdNJsq6xYdk=; b=YEG8veQ8bAw9Ke588T0il9LJNo022klHdXm5RkIMQYXINR6J4JLWTcxv LLzi+jj2iSgxaryiQyCC+mbZ72eERASg8zfhmNSJDj4F+z5/P9rLv/jqr r4PWL+Jv1G1r0jrijvpCUGbEJD0vLz6Bu/nXB6QZsJ9XVe6vLdYV+ug3U Z5zRqYj7K/aRa82VsYwKsYJyWBNYJxn5HP2IqDlG/p+KPoMnBGB5tltnD Q+zaLxKWPrbNzwRFCNkkZFAvAOH5sD9TItRWdy6s9lVUZpTTcQKnQD8ue GC57ULjwPqEWRRAi/gmq/oKuQUUixy49D6fYk645ys7OjwQXmPN3UVlSE Q==; X-CSE-ConnectionGUID: 6BJYGDt4Q7qodQAq93QT0g== X-CSE-MsgGUID: uo0pVPq+TD6/6G2cL1jwBQ== X-IronPort-AV: E=McAfee;i="6800,10657,11812"; a="81885002" X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="81885002" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:51 -0700 X-CSE-ConnectionGUID: wEv00hnMRoyHInxHDUY4gw== X-CSE-MsgGUID: utlzelpgTnuL2HoRrTj/qg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="251045646" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:49 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v5 01/11] x86,fs/resctrl: Prevent out-of-bounds access while offlining CPU when SNC enabled Date: Tue, 9 Jun 2026 14:02:27 -0700 Message-ID: <16137433df42f85013b2f7a53626795cbd6637b9.1781029125.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The architecture updates the cpu_mask in a domain's header to track which online CPUs are associated with the domain. When this mask becomes empty the architecture initiates offline of the domain that includes calling on resctrl fs to offline the domain. If it is a monitoring domain in which LLC occupancy is tracked resctrl fs forces the limbo handler to clear all busy RMID state associated with the domain. The limbo handler always reads the current event value associated with a busy RMID irrespective of it being checked as part of regular "is it still busy" check or whether it will be forced released anyway. When reading an RMID on a system with SNC enabled the "logical RMID" is converted to the "physical RMID" and this conversion requires the NUMA node ID of the resctrl monitoring domain that is in turn determined by querying the NUMA node ID of any CPU belonging to the monitoring domain. When the monitoring domain is going offline its cpu_mask is empty causing the NUMA node ID query via cpu_to_node() to be done with "nr_cpu_ids" as argument resulting in an out-of-bounds access. Refactor the limbo handler to skip reading the RMID when the RMID will just be forced to no longer be dirty in the domain anyway. Add a safety check to the architecture's RMID reader to protect against this scenario. Fixes: e13db55b5a0d ("x86/resctrl: Introduce snc_nodes_per_l3_cache") Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/cover.1780456704.git.reinette.chatre= %40intel.com?part=3D9 Signed-off-by: Reinette Chatre --- Changes since v4: - New patch --- arch/x86/kernel/cpu/resctrl/monitor.c | 5 ++++ fs/resctrl/monitor.c | 39 +++++++++++++++------------ 2 files changed, 27 insertions(+), 17 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/re= sctrl/monitor.c index 03ee6102ab07..569894d6e5c8 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -259,6 +259,11 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, str= uct rdt_domain_hdr *hdr, if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3)) return -EINVAL; =20 + if (cpumask_empty(&hdr->cpu_mask)) { + pr_warn_once("Domain %d has no CPUs\n", hdr->id); + return -EINVAL; + } + d =3D container_of(hdr, struct rdt_l3_mon_domain, hdr); hw_dom =3D resctrl_to_arch_mon_dom(d); cpu =3D cpumask_any(&hdr->cpu_mask); diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index 0e6a389a16bf..a932a1fea818 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -135,10 +135,10 @@ void __check_limbo(struct rdt_l3_mon_domain *d, bool = force_free) struct rdt_resource *r =3D resctrl_arch_get_resource(RDT_RESOURCE_L3); u32 idx_limit =3D resctrl_arch_system_num_rmid_idx(); struct rmid_entry *entry; + bool rmid_dirty =3D true; u32 idx, cur_idx =3D 1; void *arch_mon_ctx; void *arch_priv; - bool rmid_dirty; u64 val =3D 0; =20 arch_priv =3D mon_event_all[QOS_L3_OCCUP_EVENT_ID].arch_priv; @@ -161,22 +161,27 @@ void __check_limbo(struct rdt_l3_mon_domain *d, bool = force_free) break; =20 entry =3D __rmid_entry(idx); - if (resctrl_arch_rmid_read(r, &d->hdr, entry->closid, entry->rmid, - QOS_L3_OCCUP_EVENT_ID, arch_priv, &val, - arch_mon_ctx)) { - rmid_dirty =3D true; - } else { - rmid_dirty =3D (val >=3D resctrl_rmid_realloc_threshold); - - /* - * x86's CLOSID and RMID are independent numbers, so the entry's - * CLOSID is an empty CLOSID (X86_RESCTRL_EMPTY_CLOSID). On Arm the - * RMID (PMG) extends the CLOSID (PARTID) space with bits that aren't - * used to select the configuration. It is thus necessary to track both - * CLOSID and RMID because there may be dependencies between them - * on some architectures. - */ - trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->hdr.id, va= l); + if (!force_free) { + if (resctrl_arch_rmid_read(r, &d->hdr, entry->closid, + entry->rmid, QOS_L3_OCCUP_EVENT_ID, + arch_priv, &val, arch_mon_ctx)) { + rmid_dirty =3D true; + } else { + rmid_dirty =3D (val >=3D resctrl_rmid_realloc_threshold); + + /* + * x86's CLOSID and RMID are independent numbers, + * so the entry's CLOSID is an empty CLOSID + * (X86_RESCTRL_EMPTY_CLOSID). On Arm the RMID + * (PMG) extends the CLOSID (PARTID) space with + * bits that aren't used to select the configuration. + * It is thus necessary to track both CLOSID and + * RMID because there may be dependencies between + * them on some architectures. + */ + trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, + d->hdr.id, val); + } } =20 if (force_free || !rmid_dirty) { --=20 2.50.1 From nobody Thu Jun 11 01:40:31 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DBA443164BA for ; Tue, 9 Jun 2026 21:02:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038975; cv=none; b=OHpaIMZSGkOJcQn63k5fJI6JZ4ahNP0YIArLR6qmTFQuga+4KAPwm+VIqGHBuTBwk4s+zYa+k7L7V/B+YagyYS6IeN577Sl2c3hsK+cAZbwIgrbIZlhIrCymdpuZqE3QVoUIlkeXj4lss5F7CwxduRCzSLXJggnKpAyfhdd91kg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038975; c=relaxed/simple; bh=MipamcLZAxg3ZzTnxHxnvdROytTJWrIxsz+GUOz26/Y=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=p8Io09+4ZXIX9zgsZGOBLbRxjkeXc1xGmmWN9KoPT/YSGsHFem601L4mbws+9yxKjUyASt7JZB4ZHyIYbTRMkcUo6cvB6ID7THqnKqK4LZonp9EdoKfA2u4aTVbaVLWNnVtD1oOrI27biz7PFu/CaEPj4gsrkvpSPIJ3yF6aYqg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=R9rQGKpF; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="R9rQGKpF" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1781038973; x=1812574973; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=MipamcLZAxg3ZzTnxHxnvdROytTJWrIxsz+GUOz26/Y=; b=R9rQGKpFwMNalT1JXCwggOgHYuBSs8IvxD1YGFsd4p5dmnYUBsjjALMq 3ozWxxL5FOFxEZ97gwJuItXKvn9pj0qWFkhotIHJwCAB81LdEJiNHSdL2 q4r9JbE5XdUwa4vr5Eg9X3wO2UtTfGeFe1mgsFwzQXWAybWgeFCX+zvyS gwyvhWbp0mbTbjMOP+r/oJJSAOtXKVYp5Q+ZMlyx5GJkgceGJ5NBwTycg xkhp2i9qvba62aTBetf7OC8tuymIFNRqAgNzE4yHDOVamhFFHxFrgyTAN 5W2FunozxSuBPX6W4mJcF33Y3bOzq+GcVmzOCftpOFFw8KkmmuQMfqSHd Q==; X-CSE-ConnectionGUID: rR0nRDsPTeSBKlDNm9xWcA== X-CSE-MsgGUID: 7WbWLf2USkKr2LBnQwJi6g== X-IronPort-AV: E=McAfee;i="6800,10657,11812"; a="81885007" X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="81885007" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:51 -0700 X-CSE-ConnectionGUID: KZWu6SpeQ16nxGqIJaTpJA== X-CSE-MsgGUID: 9dqswrgETBypU07zNbEdpQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="251045655" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:50 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v5 02/11] x86,fs/resctrl: Document safe RCU list traversal Date: Tue, 9 Jun 2026 14:02:28 -0700 Message-ID: <6e254c033538d597fa0f968827ae3857a5d8339c.1781029125.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" rdt_resource::ctrl_domains and rdt_resource::mon_domains are RCU lists with entries added and removed by architecture from CPU hotplug callbacks that are run with cpus_write_lock() held. These lists can be traversed safely from resctrl fs by either holding cpus_read_lock() or relying on an RCU read-side critical section. resctrl fs traversals of rdt_resource::ctrl_domains and rdt_resource::mon_domains are done using list_for_each_entry() with cpus_read_lock() held. Similarly, x86 architecture callbacks use list_for_each_entry() expecting that resctrl fs makes the call with cpus_read_lock() held. Inconsistently, a lockdep_assert_cpus_held() precedes the list_for_each_entry() call with varying distance to document this safe RCU list traversal. In preparation for an upcoming traversal of rdt_resource::ctrl_domains that needs to be done from RCU read-side critical section there is a requirement for developers to always know exactly in which context the list is being traversed. Replace the list_for_each_entry() traversals of RCU list with list_for_each_entry_rcu() to document that an RCU list is being traversed while making use of the built-in lockdep expression that additionally documents that it is cpus_read_lock() that enables the list to be traversed from non-RCU protection. Only revert to documenting the safety of traversal using a comment when lockdep does not have needed visibility in functions called via smp_call*(). The lockdep expression within list_for_each_entry_rcu() depends on RCU_EXPERT that is not set in a typical debug kernel so keep the existing lockdep_assert_cpus_held() that is active with CONFIG_LOCKDEP=3Dy found in typical debug kernel. Signed-off-by: Reinette Chatre --- Changes since v3: - New patch. --- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 4 ++-- arch/x86/kernel/cpu/resctrl/monitor.c | 2 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 ++-- fs/resctrl/ctrlmondata.c | 12 +++++++----- fs/resctrl/monitor.c | 23 +++++++++++++--------- fs/resctrl/pseudo_lock.c | 2 +- fs/resctrl/rdtgroup.c | 24 +++++++++++------------ 7 files changed, 39 insertions(+), 32 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cp= u/resctrl/ctrlmondata.c index b20e705606b8..e74f1ed54b86 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -53,7 +53,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u= 32 closid) /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); =20 - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) { hw_dom =3D resctrl_to_arch_ctrl_dom(d); msr_param.res =3D NULL; for (t =3D 0; t < CDP_NUM_TYPES; t++) { @@ -115,7 +115,7 @@ static void _resctrl_sdciae_enable(struct rdt_resource = *r, bool enable) lockdep_assert_cpus_held(); =20 /* Update MSR_IA32_L3_QOS_EXT_CFG MSR on all the CPUs in all domains */ - list_for_each_entry(d, &r->ctrl_domains, hdr.list) + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) on_each_cpu_mask(&d->hdr.cpu_mask, resctrl_sdciae_set_one_amd, &enable, = 1); } =20 diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/re= sctrl/monitor.c index 569894d6e5c8..430b8fae0b77 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -510,7 +510,7 @@ static void _resctrl_abmc_enable(struct rdt_resource *r= , bool enable) =20 lockdep_assert_cpus_held(); =20 - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_hel= d()) { on_each_cpu_mask(&d->hdr.cpu_mask, resctrl_abmc_set_one_amd, &enable, 1); resctrl_arch_reset_rmid_all(r, d); diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/r= esctrl/rdtgroup.c index 885026468440..5ffa39fa86fa 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -151,7 +151,7 @@ static int set_cache_qos_cfg(int level, bool enable) return -ENOMEM; =20 r_l =3D &rdt_resources_all[level].r_resctrl; - list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r_l->ctrl_domains, hdr.list, lockdep_is_cpus_= held()) { if (r_l->cache.arch_has_per_cpu_cfg) /* Pick all the CPUs in the domain instance */ for_each_cpu(cpu, &d->hdr.cpu_mask) @@ -249,7 +249,7 @@ void resctrl_arch_reset_all_ctrls(struct rdt_resource *= r) * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU * from each domain to update the MSRs below. */ - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) { hw_dom =3D resctrl_to_arch_ctrl_dom(d); =20 for (i =3D 0; i < hw_res->num_closid; i++) diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index 9a7dfc48cb2e..f33712c17d38 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -261,7 +261,7 @@ static int parse_line(char *line, struct resctrl_schema= *s, return -EINVAL; } dom =3D strim(dom); - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) { if (d->hdr.id =3D=3D dom_id) { data.buf =3D dom; data.closid =3D rdtgrp->closid; @@ -397,7 +397,7 @@ static void show_doms(struct seq_file *s, struct resctr= l_schema *schema, =20 if (resource_name) seq_printf(s, "%*s:", max_name_width, resource_name); - list_for_each_entry(dom, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(dom, &r->ctrl_domains, hdr.list, lockdep_is_cpus_= held()) { if (sep) seq_puts(s, ";"); =20 @@ -535,6 +535,8 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_= head *h, int id, struct rdt_domain_hdr *d; struct list_head *l; =20 + lockdep_assert_cpus_held(); + list_for_each(l, h) { d =3D list_entry(l, struct rdt_domain_hdr, list); /* When id is found, return its domain. */ @@ -717,7 +719,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) * struct mon_data. Search all domains in the resource for * one that matches this cache id. */ - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_he= ld()) { if (d->ci_id =3D=3D domid) { cpu =3D cpumask_any(&d->hdr.cpu_mask); ci =3D get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE); @@ -817,7 +819,7 @@ static int resctrl_io_alloc_init_cbm(struct resctrl_sch= ema *s, u32 closid) /* Keep CDP_CODE and CDP_DATA of io_alloc CLOSID's CBM in sync. */ if (resctrl_arch_get_cdp_enabled(r->rid)) { peer_type =3D resctrl_peer_type(s->conf_type); - list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) + list_for_each_entry_rcu(d, &s->res->ctrl_domains, hdr.list, lockdep_is_c= pus_held()) memcpy(&d->staged_config[peer_type], &d->staged_config[s->conf_type], sizeof(d->staged_config[0])); @@ -980,7 +982,7 @@ static int resctrl_io_alloc_parse_line(char *line, str= uct rdt_resource *r, } =20 dom =3D strim(dom); - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) { if (update_all || d->hdr.id =3D=3D dom_id) { data.buf =3D dom; data.mode =3D RDT_MODE_SHAREABLE; diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index a932a1fea818..2dacb589625d 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -309,7 +309,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) idx =3D resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid); =20 entry->busy =3D 0; - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_hel= d()) { /* * For the first limbo RMID in the domain, * setup up the limbo worker. @@ -507,6 +507,11 @@ static int __l3_mon_event_count_sum(struct rdtgroup *r= dtgrp, struct rmid_read *r * all domains fail for any reason. */ ret =3D -EINVAL; + /* + * RCU list being traversed with CPU hotplug lock held. lockdep + * unable to help prove this here since this work is scheduled via + * smp_call*(). Not called from MBM overflow handler. + */ list_for_each_entry(d, &rr->r->mon_domains, hdr.list) { if (d->ci_id !=3D rr->ci->id) continue; @@ -1231,7 +1236,7 @@ static int rdtgroup_assign_cntr_event(struct rdt_l3_m= on_domain *d, struct rdtgro int ret =3D 0; =20 if (!d) { - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_he= ld()) { int err; =20 err =3D rdtgroup_alloc_assign_cntr(r, d, rdtgrp, mevt); @@ -1303,7 +1308,7 @@ static void rdtgroup_unassign_cntr_event(struct rdt_l= 3_mon_domain *d, struct rdt struct rdt_resource *r =3D resctrl_arch_get_resource(mevt->rid); =20 if (!d) { - list_for_each_entry(d, &r->mon_domains, hdr.list) + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_he= ld()) rdtgroup_free_unassign_cntr(r, d, rdtgrp, mevt); } else { rdtgroup_free_unassign_cntr(r, d, rdtgrp, mevt); @@ -1375,7 +1380,7 @@ static void rdtgroup_update_cntr_event(struct rdt_res= ource *r, struct rdtgroup * struct rdt_l3_mon_domain *d; int cntr_id; =20 - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_hel= d()) { cntr_id =3D mbm_cntr_get(r, d, rdtgrp, evtid); if (cntr_id >=3D 0) rdtgroup_assign_cntr(r, d, evtid, rdtgrp->mon.rmid, @@ -1545,7 +1550,7 @@ ssize_t resctrl_mbm_assign_mode_write(struct kernfs_o= pen_file *of, char *buf, /* * Reset all the non-achitectural RMID state and assignable counters. */ - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_he= ld()) { mbm_cntr_free_all(r, d); resctrl_reset_rmid_all(r, d); } @@ -1568,7 +1573,7 @@ int resctrl_num_mbm_cntrs_show(struct kernfs_open_fil= e *of, cpus_read_lock(); mutex_lock(&rdtgroup_mutex); =20 - list_for_each_entry(dom, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(dom, &r->mon_domains, hdr.list, lockdep_is_cpus_h= eld()) { if (sep) seq_putc(s, ';'); =20 @@ -1602,7 +1607,7 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_op= en_file *of, goto out_unlock; } =20 - list_for_each_entry(dom, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(dom, &r->mon_domains, hdr.list, lockdep_is_cpus_h= eld()) { if (sep) seq_putc(s, ';'); =20 @@ -1652,7 +1657,7 @@ int mbm_L3_assignments_show(struct kernfs_open_file *= of, struct seq_file *s, voi =20 sep =3D false; seq_printf(s, "%s:", mevt->name); - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_he= ld()) { if (sep) seq_putc(s, ';'); =20 @@ -1750,7 +1755,7 @@ static int resctrl_parse_mbm_assignment(struct rdt_re= source *r, struct rdtgroup } =20 /* Verify if the dom_id is valid */ - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_hel= d()) { if (d->hdr.id =3D=3D dom_id) { ret =3D rdtgroup_modify_assign_state(dom_str, d, rdtgrp, mevt); if (ret) { diff --git a/fs/resctrl/pseudo_lock.c b/fs/resctrl/pseudo_lock.c index d1cb0986006e..dea2b4bf966f 100644 --- a/fs/resctrl/pseudo_lock.c +++ b/fs/resctrl/pseudo_lock.c @@ -656,7 +656,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctr= l_domain *d) * associated with them. */ for_each_alloc_capable_rdt_resource(r) { - list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d_i, &r->ctrl_domains, hdr.list, lockdep_is_cpus= _held()) { if (d_i->plr) cpumask_or(cpu_with_psl, cpu_with_psl, &d_i->hdr.cpu_mask); diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index af2cbab14497..2a6221925767 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -117,7 +117,7 @@ void rdt_staged_configs_clear(void) lockdep_assert_held(&rdtgroup_mutex); =20 for_each_alloc_capable_rdt_resource(r) { - list_for_each_entry(dom, &r->ctrl_domains, hdr.list) + list_for_each_entry_rcu(dom, &r->ctrl_domains, hdr.list, lockdep_is_cpus= _held()) memset(dom->staged_config, 0, sizeof(dom->staged_config)); } } @@ -1063,7 +1063,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file= *of, =20 cpus_read_lock(); mutex_lock(&rdtgroup_mutex); - list_for_each_entry(dom, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(dom, &r->ctrl_domains, hdr.list, lockdep_is_cpus_= held()) { if (sep) seq_putc(seq, ';'); hw_shareable =3D r->cache.shareable_bits; @@ -1415,7 +1415,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgr= oup *rdtgrp) if (r->rid =3D=3D RDT_RESOURCE_MBA || r->rid =3D=3D RDT_RESOURCE_SMBA) continue; has_cache =3D true; - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_h= eld()) { ctrl =3D resctrl_arch_get_config(r, d, closid, s->conf_type); if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) { @@ -1604,7 +1604,7 @@ static int rdtgroup_size_show(struct kernfs_open_file= *of, type =3D schema->conf_type; sep =3D false; seq_printf(s, "%*s:", max_name_width, schema->name); - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_h= eld()) { if (sep) seq_putc(s, ';'); if (rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKSETUP) { @@ -1649,7 +1649,7 @@ static int mbm_config_show(struct seq_file *s, struct= rdt_resource *r, u32 evtid cpus_read_lock(); mutex_lock(&rdtgroup_mutex); =20 - list_for_each_entry(dom, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(dom, &r->mon_domains, hdr.list, lockdep_is_cpus_h= eld()) { if (sep) seq_puts(s, ";"); =20 @@ -1763,7 +1763,7 @@ static int mon_config_write(struct rdt_resource *r, c= har *tok, u32 evtid) return -EINVAL; } =20 - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_hel= d()) { if (d->hdr.id =3D=3D dom_id) { mbm_config_write_domain(r, d, evtid, val); goto next; @@ -2554,7 +2554,7 @@ static int set_mba_sc(bool mba_sc) =20 rdtgroup_default.mba_mbps_event =3D mba_mbps_default_event; =20 - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) { for (i =3D 0; i < num_closid; i++) d->mbps_val[i] =3D MBA_MAX_MBPS; } @@ -2879,7 +2879,7 @@ static int rdt_get_tree(struct fs_context *fc) =20 if (resctrl_is_mbm_enabled()) { r =3D resctrl_arch_get_resource(RDT_RESOURCE_L3); - list_for_each_entry(dom, &r->mon_domains, hdr.list) + list_for_each_entry_rcu(dom, &r->mon_domains, hdr.list, lockdep_is_cpus_= held()) mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU); } @@ -3435,7 +3435,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_= node *parent_kn, /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); =20 - list_for_each_entry(hdr, &r->mon_domains, list) { + list_for_each_entry_rcu(hdr, &r->mon_domains, list, lockdep_is_cpus_held(= )) { ret =3D mkdir_mondata_subdir(parent_kn, hdr, r, prgrp); if (ret) return ret; @@ -3620,7 +3620,7 @@ int rdtgroup_init_cat(struct resctrl_schema *s, u32 c= losid) struct rdt_ctrl_domain *d; int ret; =20 - list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &s->res->ctrl_domains, hdr.list, lockdep_is_cp= us_held()) { ret =3D __init_one_rdt_domain(d, s, closid); if (ret < 0) return ret; @@ -3635,7 +3635,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r,= u32 closid) struct resctrl_staged_config *cfg; struct rdt_ctrl_domain *d; =20 - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) { if (is_mba_sc(r)) { d->mbps_val[closid] =3D MBA_MAX_MBPS; continue; @@ -4506,7 +4506,7 @@ static struct rdt_l3_mon_domain *get_mon_domain_from_= cpu(int cpu, =20 lockdep_assert_cpus_held(); =20 - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_hel= d()) { /* Find the domain that contains this CPU */ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) return d; --=20 2.50.1 From nobody Thu Jun 11 01:40:31 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 74EDE3749F3 for ; Tue, 9 Jun 2026 21:02:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038975; cv=none; b=f7im69x7HUY/5SuBRaGt2yg/vantLvBKNHN8+6LHYjgSE7QlojMScQaDZt8B/vhE3TK53Zd7ScDKKz8exSfqDMSv0jLQF5luC5RCzu0UELixsYR+Z/dTeklTfXD2guts9yEFmO1R65qEE2NFFqaj5ARngqtHiOpijXYdzZMjHlo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038975; c=relaxed/simple; bh=6ATno5HyYAEBpGTzLDqq16Sa/353TxzmxUBOAQq6ObE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=fu1QRFhB9iqPcYBDAp/klNR0otBppgRmCCT/jAMkKxlHOedZAyVssWNI4F6ZM0U/DXiJixIxfoXLw3Tf7RKusz4paJi0gpnSR4IsxkK0tTzZfrDiBU4KZRqnujhmyJ92HKU6Sj6cy5d5YDlD0PZcTudUcwmmY3OtGw44Q1HQSqw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=gntcBu3s; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="gntcBu3s" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1781038974; x=1812574974; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6ATno5HyYAEBpGTzLDqq16Sa/353TxzmxUBOAQq6ObE=; b=gntcBu3s2EyLhhcowYDL06P4YJcwldxDmdP5GoYsdAONA6CshAafI4fQ 18RtsCR6ZMOGW+ByknO1euxt7m/p34KXnBfwWRAS0md6rIvtAZSK7q0SB UQ49yYNUOYG5Q5XnSHwipwyvISZoNKL0H1mPWgvr0hUN8t1yJz3AWbr8U gYbu8IBbIv1jA4hjT+OoYzSWVfWhtBKNJLsGUUpCYVwlqVYu4991OHjjm k2s3YCaW58uycPvi8IV2nUHzgyshBss6qucQtqLUT9jdb2vpOpLN31EiY TjUcWYfDp7G3pL6u5+Vk5juFbIK4kyXLoSRD6hUVsvdFoSPvyn0Ppoei8 A==; X-CSE-ConnectionGUID: S2NV915HSPmj0VaceXYXzw== X-CSE-MsgGUID: oMewTZT3RaS0KQE248+u6w== X-IronPort-AV: E=McAfee;i="6800,10657,11812"; a="81885011" X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="81885011" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:52 -0700 X-CSE-ConnectionGUID: S8p8/S1KT5K0gi63bjav0g== X-CSE-MsgGUID: 8yfeY/LQRKK/O2fCvb39UA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="251045663" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:51 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v5 03/11] fs/resctrl: Move functions to avoid forward references in subsequent fixes Date: Tue, 9 Jun 2026 14:02:29 -0700 Message-ID: <8217111523c55851d88b12cb5f6c0413a9732b4d.1781029125.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Tony Luck rdt_get_tree() manages resctrl fs mount and rdt_kill_sb() manages resctrl fs unmount. There is significant overlap between error cleanup during resctrl mount failure and cleanup on resctrl unmount yet the cleanup is not done consistently in these two flows. Pull some cleanup functions before rdt_get_tree() in preparation for a new helper that can be shared between mount and unmount. Signed-off-by: Tony Luck Signed-off-by: Reinette Chatre Reviewed-by: Ben Horgan --- Changes since V2: - Rewrite changelog. Changes since V3: - Add Ben's Reviewed-by tag. --- fs/resctrl/rdtgroup.c | 376 +++++++++++++++++++++--------------------- 1 file changed, 188 insertions(+), 188 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 2a6221925767..2b624cf02147 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -2792,6 +2792,194 @@ static void schemata_list_destroy(void) } } =20 +/* + * Move tasks from one to the other group. If @from is NULL, then all tasks + * in the systems are moved unconditionally (used for teardown). + * + * If @mask is not NULL the cpus on which moved tasks are running are set + * in that mask so the update smp function call is restricted to affected + * cpus. + */ +static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *t= o, + struct cpumask *mask) +{ + struct task_struct *p, *t; + + read_lock(&tasklist_lock); + for_each_process_thread(p, t) { + if (!from || is_closid_match(t, from) || + is_rmid_match(t, from)) { + resctrl_arch_set_closid_rmid(t, to->closid, + to->mon.rmid); + + /* + * Order the closid/rmid stores above before the loads + * in task_curr(). This pairs with the full barrier + * between the rq->curr update and + * resctrl_arch_sched_in() during context switch. + */ + smp_mb(); + + /* + * If the task is on a CPU, set the CPU in the mask. + * The detection is inaccurate as tasks might move or + * schedule before the smp function call takes place. + * In such a case the function call is pointless, but + * there is no other side effect. + */ + if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t)) + cpumask_set_cpu(task_cpu(t), mask); + } + } + read_unlock(&tasklist_lock); +} + +static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp) +{ + struct rdtgroup *sentry, *stmp; + struct list_head *head; + + head =3D &rdtgrp->mon.crdtgrp_list; + list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) { + rdtgroup_unassign_cntrs(sentry); + free_rmid(sentry->closid, sentry->mon.rmid); + list_del(&sentry->mon.crdtgrp_list); + + if (atomic_read(&sentry->waitcount) !=3D 0) + sentry->flags =3D RDT_DELETED; + else + rdtgroup_remove(sentry); + } +} + +/* + * Forcibly remove all of subdirectories under root. + */ +static void rmdir_all_sub(void) +{ + struct rdtgroup *rdtgrp, *tmp; + + /* Move all tasks to the default resource group */ + rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); + + list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { + /* Free any child rmids */ + free_all_child_rdtgrp(rdtgrp); + + /* Remove each rdtgroup other than root */ + if (rdtgrp =3D=3D &rdtgroup_default) + continue; + + if (rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKED) + rdtgroup_pseudo_lock_remove(rdtgrp); + + /* + * Give any CPUs back to the default group. We cannot copy + * cpu_online_mask because a CPU might have executed the + * offline callback already, but is still marked online. + */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); + + rdtgroup_unassign_cntrs(rdtgrp); + + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + kernfs_remove(rdtgrp->kn); + list_del(&rdtgrp->rdtgroup_list); + + if (atomic_read(&rdtgrp->waitcount) !=3D 0) + rdtgrp->flags =3D RDT_DELETED; + else + rdtgroup_remove(rdtgrp); + } + /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */ + update_closid_rmid(cpu_online_mask, &rdtgroup_default); + + kernfs_remove(kn_info); + kernfs_remove(kn_mongrp); + kernfs_remove(kn_mondata); +} + +/** + * mon_get_kn_priv() - Get the mon_data priv data for this event. + * + * The same values are used across the mon_data directories of all control= and + * monitor groups for the same event in the same domain. Keep a list of + * allocated structures and re-use an existing one with the same values for + * @rid, @domid, etc. + * + * @rid: The resource id for the event file being created. + * @domid: The domain id for the event file being created. + * @mevt: The type of event file being created. + * @do_sum: Whether SNC summing monitors are being created. Only set + * when @rid =3D=3D RDT_RESOURCE_L3. + * + * Return: Pointer to mon_data private data of the event, NULL on failure. + */ +static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int do= mid, + struct mon_evt *mevt, + bool do_sum) +{ + struct mon_data *priv; + + lockdep_assert_held(&rdtgroup_mutex); + + list_for_each_entry(priv, &mon_data_kn_priv_list, list) { + if (priv->rid =3D=3D rid && priv->domid =3D=3D domid && + priv->sum =3D=3D do_sum && priv->evt =3D=3D mevt) + return priv; + } + + priv =3D kzalloc_obj(*priv); + if (!priv) + return NULL; + + priv->rid =3D rid; + priv->domid =3D domid; + priv->sum =3D do_sum; + priv->evt =3D mevt; + list_add_tail(&priv->list, &mon_data_kn_priv_list); + + return priv; +} + +/** + * mon_put_kn_priv() - Free all allocated mon_data structures. + * + * Called when resctrl file system is unmounted. + */ +static void mon_put_kn_priv(void) +{ + struct mon_data *priv, *tmp; + + lockdep_assert_held(&rdtgroup_mutex); + + list_for_each_entry_safe(priv, tmp, &mon_data_kn_priv_list, list) { + list_del(&priv->list); + kfree(priv); + } +} + +static void resctrl_fs_teardown(void) +{ + lockdep_assert_held(&rdtgroup_mutex); + + /* Cleared by rdtgroup_destroy_root() */ + if (!rdtgroup_default.kn) + return; + + rmdir_all_sub(); + rdtgroup_unassign_cntrs(&rdtgroup_default); + mon_put_kn_priv(); + rdt_pseudo_lock_release(); + rdtgroup_default.mode =3D RDT_MODE_SHAREABLE; + closid_exit(); + schemata_list_destroy(); + rdtgroup_destroy_root(); +} + static int rdt_get_tree(struct fs_context *fc) { struct rdt_fs_context *ctx =3D rdt_fc2context(fc); @@ -2991,194 +3179,6 @@ static int rdt_init_fs_context(struct fs_context *f= c) return 0; } =20 -/* - * Move tasks from one to the other group. If @from is NULL, then all tasks - * in the systems are moved unconditionally (used for teardown). - * - * If @mask is not NULL the cpus on which moved tasks are running are set - * in that mask so the update smp function call is restricted to affected - * cpus. - */ -static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *t= o, - struct cpumask *mask) -{ - struct task_struct *p, *t; - - read_lock(&tasklist_lock); - for_each_process_thread(p, t) { - if (!from || is_closid_match(t, from) || - is_rmid_match(t, from)) { - resctrl_arch_set_closid_rmid(t, to->closid, - to->mon.rmid); - - /* - * Order the closid/rmid stores above before the loads - * in task_curr(). This pairs with the full barrier - * between the rq->curr update and - * resctrl_arch_sched_in() during context switch. - */ - smp_mb(); - - /* - * If the task is on a CPU, set the CPU in the mask. - * The detection is inaccurate as tasks might move or - * schedule before the smp function call takes place. - * In such a case the function call is pointless, but - * there is no other side effect. - */ - if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t)) - cpumask_set_cpu(task_cpu(t), mask); - } - } - read_unlock(&tasklist_lock); -} - -static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp) -{ - struct rdtgroup *sentry, *stmp; - struct list_head *head; - - head =3D &rdtgrp->mon.crdtgrp_list; - list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) { - rdtgroup_unassign_cntrs(sentry); - free_rmid(sentry->closid, sentry->mon.rmid); - list_del(&sentry->mon.crdtgrp_list); - - if (atomic_read(&sentry->waitcount) !=3D 0) - sentry->flags =3D RDT_DELETED; - else - rdtgroup_remove(sentry); - } -} - -/* - * Forcibly remove all of subdirectories under root. - */ -static void rmdir_all_sub(void) -{ - struct rdtgroup *rdtgrp, *tmp; - - /* Move all tasks to the default resource group */ - rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); - - list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { - /* Free any child rmids */ - free_all_child_rdtgrp(rdtgrp); - - /* Remove each rdtgroup other than root */ - if (rdtgrp =3D=3D &rdtgroup_default) - continue; - - if (rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKSETUP || - rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKED) - rdtgroup_pseudo_lock_remove(rdtgrp); - - /* - * Give any CPUs back to the default group. We cannot copy - * cpu_online_mask because a CPU might have executed the - * offline callback already, but is still marked online. - */ - cpumask_or(&rdtgroup_default.cpu_mask, - &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); - - rdtgroup_unassign_cntrs(rdtgrp); - - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - - kernfs_remove(rdtgrp->kn); - list_del(&rdtgrp->rdtgroup_list); - - if (atomic_read(&rdtgrp->waitcount) !=3D 0) - rdtgrp->flags =3D RDT_DELETED; - else - rdtgroup_remove(rdtgrp); - } - /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */ - update_closid_rmid(cpu_online_mask, &rdtgroup_default); - - kernfs_remove(kn_info); - kernfs_remove(kn_mongrp); - kernfs_remove(kn_mondata); -} - -/** - * mon_get_kn_priv() - Get the mon_data priv data for this event. - * - * The same values are used across the mon_data directories of all control= and - * monitor groups for the same event in the same domain. Keep a list of - * allocated structures and re-use an existing one with the same values for - * @rid, @domid, etc. - * - * @rid: The resource id for the event file being created. - * @domid: The domain id for the event file being created. - * @mevt: The type of event file being created. - * @do_sum: Whether SNC summing monitors are being created. Only set - * when @rid =3D=3D RDT_RESOURCE_L3. - * - * Return: Pointer to mon_data private data of the event, NULL on failure. - */ -static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int do= mid, - struct mon_evt *mevt, - bool do_sum) -{ - struct mon_data *priv; - - lockdep_assert_held(&rdtgroup_mutex); - - list_for_each_entry(priv, &mon_data_kn_priv_list, list) { - if (priv->rid =3D=3D rid && priv->domid =3D=3D domid && - priv->sum =3D=3D do_sum && priv->evt =3D=3D mevt) - return priv; - } - - priv =3D kzalloc_obj(*priv); - if (!priv) - return NULL; - - priv->rid =3D rid; - priv->domid =3D domid; - priv->sum =3D do_sum; - priv->evt =3D mevt; - list_add_tail(&priv->list, &mon_data_kn_priv_list); - - return priv; -} - -/** - * mon_put_kn_priv() - Free all allocated mon_data structures. - * - * Called when resctrl file system is unmounted. - */ -static void mon_put_kn_priv(void) -{ - struct mon_data *priv, *tmp; - - lockdep_assert_held(&rdtgroup_mutex); - - list_for_each_entry_safe(priv, tmp, &mon_data_kn_priv_list, list) { - list_del(&priv->list); - kfree(priv); - } -} - -static void resctrl_fs_teardown(void) -{ - lockdep_assert_held(&rdtgroup_mutex); - - /* Cleared by rdtgroup_destroy_root() */ - if (!rdtgroup_default.kn) - return; - - rmdir_all_sub(); - rdtgroup_unassign_cntrs(&rdtgroup_default); - mon_put_kn_priv(); - rdt_pseudo_lock_release(); - rdtgroup_default.mode =3D RDT_MODE_SHAREABLE; - closid_exit(); - schemata_list_destroy(); - rdtgroup_destroy_root(); -} - static void rdt_kill_sb(struct super_block *sb) { struct rdt_resource *r; --=20 2.50.1 From nobody Thu Jun 11 01:40:31 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 75422374A03 for ; Tue, 9 Jun 2026 21:02:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038975; cv=none; b=sZjwzD/KWCmjw4aqKVyPug/FaDThln7EOONzv8UJkvhTvTg7FE3eekfC5gbe1k5VP34nmbhdByhd0tNHRknENLahvmEvJzVeO9kFU+QKHdcz5H+JQ6THWF9KsdJq/mB1uQajCmnzIjTh6jfzH2brfJVZkM4n28S5YLwxQU14fHo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038975; c=relaxed/simple; bh=C8hY132qG+xjkRnVVTukTcbYF6pljAiJfoDuxTvymqU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=u4tEoRuWEc9U/h2ZX9KxqLLYgQfeOtsCmK5DszfJk1ZFX4bj2ogKYaVmiscUiOBmhcfNryYUVQwFn8emYOCaBczorLv4mJ8fL4bf684QEHo9M1xS//pwl+hjWMU56wHkNW/SM4qI8JGIW9mBlLjE2GGTSg/SWArPPYRvTszmF0c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=V9xzLI6L; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="V9xzLI6L" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1781038974; x=1812574974; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=C8hY132qG+xjkRnVVTukTcbYF6pljAiJfoDuxTvymqU=; b=V9xzLI6Lrcdo3r5LPqaTat8+dgybf7uOksdC0EaJB+/3Auf7L+kJRqAf I0FRLJ1vKLwpAQkBZ1wJxw7TOERMu2til66jt7KkoE3PdPqjbReeMhNTs 9WJBYEwinrnGGAJyH35PuA64mOfekt+kzLNrH/KWLdjK8aLh6j44P2b3w +mvPsXCWOPxMhpwe1DxRU+jMCQ2rGKvIERFaZrKAa1SmIipydplFRBgx4 vWb3138chdDY8/WrhhLrEQHsxXTh5lW39M1WKQ6SdQQCIYrTKBSDscJhr w7epfKZS9Y9BYuoxUgCF6KhCZRClc1w+Cu0ps9a0Pyt+uGN/cGqKAvhan A==; X-CSE-ConnectionGUID: h4H3wbfpQ6mQdySDmIkfng== X-CSE-MsgGUID: IG/iUiL2Txy4WbvU40udOA== X-IronPort-AV: E=McAfee;i="6800,10657,11812"; a="81885016" X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="81885016" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:53 -0700 X-CSE-ConnectionGUID: Nh3OGGcdRTWACpYQ7tKkIQ== X-CSE-MsgGUID: gd3spaAYQTa5FCtEWLIr4w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="251045678" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:51 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v5 04/11] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Date: Tue, 9 Jun 2026 14:02:30 -0700 Message-ID: <52880d6ee4d38599f69cbb42493e16f3903e7f52.1781029125.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Tony Luck If mkdir_mondata_all() or a subsequent call in rdt_get_tree() fails, the mon_data structures allocated by mon_get_kn_priv() are leaked. Add mon_put_kn_priv() to the out_mongrp error path to free the mon_data structures. Fixes: 2a6566038544 ("x86/resctrl: Expand the width of domid by replacing m= on_data_bits") Reported-by: Reinette Chatre Closes: https://lore.kernel.org/lkml/5d38c1fb-8f91-472b-8897-24b2f50c772b@i= ntel.com/ Signed-off-by: Tony Luck Signed-off-by: Reinette Chatre Reviewed-by: Chen Yu Reviewed-by: Ben Horgan --- Changes since V2: - Reword changelog. Changes since V3: - Add Chenyu's Reviewed-by tag that should have been added in V2. - Add Ben's Reviewed-by tag. - Add Closes: tag. --- fs/resctrl/rdtgroup.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 2b624cf02147..31cfb54a5488 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -3081,6 +3081,7 @@ static int rdt_get_tree(struct fs_context *fc) kernfs_remove(kn_mondata); out_mongrp: if (resctrl_arch_mon_capable()) { + mon_put_kn_priv(); rdtgroup_unassign_cntrs(&rdtgroup_default); kernfs_remove(kn_mongrp); } --=20 2.50.1 From nobody Thu Jun 11 01:40:31 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 284463DD507 for ; Tue, 9 Jun 2026 21:02:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038976; cv=none; b=mhd7ISLOGTUeA40fdSUbc5Uau+SYdcL59URToCdOdSgqOm7sCMbiIc/pC5Yl4ZPCM5j6kgnwTHurYnQfsH6OWKqY6P2Q1roFzER5mSo9zOjOYkAU45MmO2b1R0zsJQEOHGi/3nz/rKdOVdc3oef8yEP3XrPhy1f2iL2ILLsA3c4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038976; c=relaxed/simple; bh=uj2HbwpeVZoI2H2eNfrEoI9yD3Oy3gxW4pyNDQE7YT4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=mUoaCXPsMEWWPOL/H2q1zaOyXK1OkP5NDzqnw8gB/TJQJdCf2J+r8zyW4q/n+DrwvQT1lyqdBkPvhlq/TTvBef/ENllW4jOa5ow4XhqCnqLvK2HsFcabwTyJ0kT+ZEu8FZPUXado6+zVYZGUNtq9QEp/Cq6hYLdvHkDgNGhobrM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=MoE9ul4t; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="MoE9ul4t" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1781038976; x=1812574976; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=uj2HbwpeVZoI2H2eNfrEoI9yD3Oy3gxW4pyNDQE7YT4=; b=MoE9ul4tSDnWaMZwIOfUolVgDPTRGC5zQFkaA8eDVFxmJkj58Dn6eFZf Zj+orE7t5QwvMXXN1evMepFwugzxXA7VrvJty8KWQRdMKNXgyKrmRWAdl Q96i6fFHqsTnCQcMEpUY1M6OkgyPNXdn02JuFGrvOdAa+N/nZE3GrJTdr JDaGY+oklRP16mNxxHwNvRxVTLHUYId4sInHpEqmhMWNtr9O+KC6KpQ7F m8nfA0hFehcqyop06vwJDPU/RCWrVYuxX3qTv7g6ee6x17uXj1cowvOt3 pwj0UJyc7p8hZ5HGHW6A2PDytIJDjyqEF2u2siEfenapFppEihAFNOvvu Q==; X-CSE-ConnectionGUID: KuVqk39NQF6x3HmgFrbykQ== X-CSE-MsgGUID: ibWKgtbIRRWVaycq2YRe6w== X-IronPort-AV: E=McAfee;i="6800,10657,11812"; a="81885021" X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="81885021" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:54 -0700 X-CSE-ConnectionGUID: lcR9f0kHSIS/7sHizfJRrQ== X-CSE-MsgGUID: CH8OTXeQSMyKH9X/TyOMhQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="251045686" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:52 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v5 05/11] fs/resctrl: Fix use-after-free during unmount Date: Tue, 9 Jun 2026 14:02:31 -0700 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Tony Luck During unmount or failure teardown all mon_data structures that contain monitoring event file private data are freed after which kernfs nodes are removed. However, the RDT_DELETED flag is never set for the statically allocated default resource group. A concurrent reader of an event file associated with the default resource group may, after dropping kernfs active protection, block on rdtgroup_mutex while unmount proceeds to free the file private data and destroy the kernfs node without waiting for the reader. When the mutex is released, the reader wakes up, observes that RDT_DELETED is not set for the default group, and dereferences the already-freed file private data. The scenario can be depicted as follows: CPU0 CPU1 /* * Default resource group's * monitoring data accessible via * kernfs file with kernfs_node::priv * pointing to a struct mon_data. * User opens the file for reading. */ rdtgroup_mondata_show() /* arch encounters fatal error */ rdtgroup_kn_lock_live() resctrl_exit() atomic_inc(&rdtgroup_default.waitcount) cpus_read_lock() kernfs_break_active_protection(kn) mutex_lock(&rdtgroup_mutex) cpus_read_lock() resctrl_fs_teardown() mutex_lock(&rdtgroup_mutex) rmdir_all_sub() mon_put_kn_priv() /* Delete all mon_data struc= tures */ rdtgroup_destroy_root() kernfs_destroy_root() rdtgroup_default.kn =3D NULL mutex_unlock(&rdtgroup_mutex) /* * rdtgroup_default.flags is empty so * rdtgroup_kn_lock_live() returns * &rdtgroup_default */ md =3D of->kn->priv; /* md points to freed mon_data */ Set RDT_DELETED for the default group unconditionally since the flag does not lead to the freeing of this statically allocated group. Do not allow a new resctrl mount if there are any waiters on default group of previous mount. A new mount will re-initialize the default group that would appear to waiters from previous mount as though the default group is accessible causing them to access the mon_data structures from the previous mount that have been removed. Fixes: 2a6566038544 ("x86/resctrl: Expand the width of domid by replacing m= on_data_bits") Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/20260508182143.14592-1-tony.luck%40i= ntel.com?part=3D2 [1] Signed-off-by: Tony Luck Signed-off-by: Reinette Chatre Reviewed-by: Chen Yu --- Changes since V2: - Rewrite changelog to not describe code as much. - Rework changelog to switch to "Reported-by/Closes". - Merge the duplicate rdtgroup_remove() comment with the function comment. - Fix changelog to not mention that RDT_DELETED flag is set conditionally. - Change "Fixes:" tag to point to commit that introduced dynamically allocated mon_data this bug involves. Changes since V3: - Depict the race. (Chenyu) - Add Chenyu's Reviewed-by tag. - Changelog grammar fixes. --- fs/resctrl/rdtgroup.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 31cfb54a5488..809f0965474c 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -585,14 +585,20 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open= _file *of, * * On resource group creation via a mkdir, an extra kernfs_node reference = is * taken to ensure that the rdtgroup structure remains accessible for the - * rdtgroup_kn_unlock() calls where it is removed. + * rdtgroup_kn_unlock() calls where it is removed. The default group is + * statically allocated: it does not have an extra reference but will have + * RDT_DELETED set on unmount to support safe access to its associated fil= es + * via rdtgroup_kn_lock_live/rdtgroup_kn_unlock(). * - * Drop the extra reference here, then free the rdtgroup structure. + * For all but the default group: drop the extra reference, then free the + * rdtgroup structure. * * Return: void */ static void rdtgroup_remove(struct rdtgroup *rdtgrp) { + if (rdtgrp =3D=3D &rdtgroup_default) + return; kernfs_put(rdtgrp->kn); kfree(rdtgrp); } @@ -2975,6 +2981,7 @@ static void resctrl_fs_teardown(void) mon_put_kn_priv(); rdt_pseudo_lock_release(); rdtgroup_default.mode =3D RDT_MODE_SHAREABLE; + rdtgroup_default.flags =3D RDT_DELETED; closid_exit(); schemata_list_destroy(); rdtgroup_destroy_root(); @@ -3000,6 +3007,12 @@ static int rdt_get_tree(struct fs_context *fc) goto out; } =20 + /* Avoid races from pending operations from a previous mount */ + if (atomic_read(&rdtgroup_default.waitcount) !=3D 0) { + ret =3D -EBUSY; + goto out; + } + ret =3D setup_rmid_lru_list(); if (ret) goto out; @@ -4275,6 +4288,7 @@ static int rdtgroup_setup_root(struct rdt_fs_context = *ctx) =20 ctx->kfc.root =3D rdt_root; rdtgroup_default.kn =3D kernfs_root_to_node(rdt_root); + rdtgroup_default.flags =3D 0; =20 return 0; } --=20 2.50.1 From nobody Thu Jun 11 01:40:31 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 51A921DDC37 for ; Tue, 9 Jun 2026 21:02:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038982; cv=none; b=nxVa3NhCoS/Uzm3w8jL6Yi1cr2co1bU2Qx/Zu9bt/XHFujzr+iO0B01QK4086O3OcC9La+4/kz9gYrRbQca0f7+Rp+QZGGprqhvlsebH4IQY6bYFc1wYvStDlHZe6aoIBoL5kdFAwVYiYeuor60crIyrpF5MME/rE18qrgR+D0w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038982; c=relaxed/simple; bh=LHCo0MUmMyDzclk6PO+WqAPmteois0/AC1uNvOLDCqs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=E/WKzQ4j4mQcFSVcHdXWanOOhqkGnnmJaTi6Y/b4sFovZzhRdTdKy5+be0g71dCNxgXeGncl4mMSJn7HX1pIEPASZp2gz+SF0GrDMEkQgf0epyXONonh2ljLqD8jksKWAKRo5AlDkLkMTmFFXHbW88t+d6YR6gJ8+wxjvzJl7RU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=KohHXMkj; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="KohHXMkj" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1781038976; x=1812574976; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=LHCo0MUmMyDzclk6PO+WqAPmteois0/AC1uNvOLDCqs=; b=KohHXMkjT30IogTH5EkSesyOSepg9I8/5kLj/iMvD2r0DeiYqZj2sF2V ZTiv58XpCY5Ufc7hSonRXUk1rKNHnAxKQy9kBoWgVfEaLlqt4hnJyjLcD iJYexzG/ANkn4lpvzvbkffcnBROe2d0GcenLvE1xIyzTB4QHUHqSkHBur fV5469lYSsxfiDBpyJSMp5BzmMnwPDIxfdWftG49o/Awou6FFkMk2HspF wf1p3n9LGVJ1CCK8B1f4BaIDBQUYm2JBYNYF31Mb6tYZkzTwG+8bYkFqZ UnyfvhbzyDo2788Q88pnuJYIlIrolWcVk7QRQdQ1ZMlJlOiWiPh+j15hl Q==; X-CSE-ConnectionGUID: cHEMOlDITTy1KiTCZ+BUng== X-CSE-MsgGUID: 7Fh449vfQEqIzPMxUiBFFg== X-IronPort-AV: E=McAfee;i="6800,10657,11812"; a="81885027" X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="81885027" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:55 -0700 X-CSE-ConnectionGUID: kG3W0rhxQza5nXD7nAFM/w== X-CSE-MsgGUID: QIbQTTDRQo+XGfZUN6JkbQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="251045700" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:53 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v5 06/11] fs/resctrl: Fix deadlock on errors during mount Date: Tue, 9 Jun 2026 14:02:32 -0700 Message-ID: <86ce88f4b66d9b1ed69942d8f2b34f751609684c.1781029125.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" rdt_get_tree() acquires rdtgroup_mutex before calling kernfs_get_tree(). If superblock setup fails inside kernfs_get_tree(), the VFS calls .kill_sb() (rdt_kill_sb()) on the same thread before kernfs_get_tree() returns. rdt_kill_sb() unconditionally attempts to acquire rdtgroup_mutex and deadlock occurs. Since mount failure resulting from kernfs_get_tree() already calls the resctrl fs unmount handler (rdt_kill_sb()) let both call the same helper to make it clear both paths perform the same cleanup. Call kernfs_get_tree() outside of locks. If kernfs_get_tree() fails and ctx->kfc.new_sb_created is set, then rdt_kill_sb() has already been called and no further cleanup is needed. kernfs_get_tree() may set ctx->kfc.new_sb_created and then fail to obtain an inode for the new kn, causing the rdt_kill_sb() path to run with one few= er reference than required for the root to remain accessible in kernfs_kill_sb= (). Add an extra hold on rdtgroup_default.kn to defend against this scenario and ensure the root can be dereferenced safely from kernfs_kill_sb(). Dropping locks before kernfs_get_tree() creates a window where CPU hotplug callbacks can race with the mount operation. Specifically, an online event observing resctrl_mounted =3D=3D true could concurrently append directories= to the unactivated kernfs tree, allocate mon_data structures, and arm backgrou= nd workers. This concurrency is safe because the mount has not yet returned to the VFS, meaning userspace cannot interact with these transient files. If kernfs_get_tree() subsequently fails, the standard resctrl_unmount() teardo= wn safely manages the concurrent modifications: any dynamically generated kern= fs nodes are removed, and the associated memory is freed. Any background workers spawned by the hotplug event will naturally exit without re-arming when they acquire rdtgroup_mutex and observe resctrl_mounted =3D=3D false. Fixes: 5ff193fbde20 ("x86/intel_rdt: Add basic resctrl filesystem support") Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40i= ntel.com [1] Co-developed-by: Tony Luck Signed-off-by: Tony Luck Signed-off-by: Reinette Chatre Reviewed-by: Ben Horgan Reviewed-by: Chen Yu --- Changes since V2: - Switch to "Reported-by/Closes" in changelog Changes since V3: - Add Ben's Reviewed-by tag. - Rework subject and changelog. - s/root kn/root/ in comment. (Chenyu) - Add Chenyu's Reviewed-by tag. - Changelog grammar fixes. - Add snippet to changelog about potential race with hotplug handlers. --- fs/resctrl/rdtgroup.c | 83 +++++++++++++++++++++++++++++-------------- 1 file changed, 56 insertions(+), 27 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 809f0965474c..0d073d4db734 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -2987,10 +2987,34 @@ static void resctrl_fs_teardown(void) rdtgroup_destroy_root(); } =20 +static void resctrl_unmount(void) +{ + struct rdt_resource *r; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_disable_ctx(); + + /* Put everything back to default values. */ + for_each_alloc_capable_rdt_resource(r) + resctrl_arch_reset_all_ctrls(r); + + resctrl_fs_teardown(); + if (resctrl_arch_alloc_capable()) + resctrl_arch_disable_alloc(); + if (resctrl_arch_mon_capable()) + resctrl_arch_disable_mon(); + resctrl_mounted =3D false; + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + static int rdt_get_tree(struct fs_context *fc) { struct rdt_fs_context *ctx =3D rdt_fc2context(fc); unsigned long flags =3D RFTYPE_CTRL_BASE; + struct kernfs_node *rdt_root_kn; struct rdt_l3_mon_domain *dom; struct rdt_resource *r; int ret; @@ -3066,10 +3090,6 @@ static int rdt_get_tree(struct fs_context *fc) if (ret) goto out_mondata; =20 - ret =3D kernfs_get_tree(fc); - if (ret < 0) - goto out_psl; - if (resctrl_arch_alloc_capable()) resctrl_arch_enable_alloc(); if (resctrl_arch_mon_capable()) @@ -3085,10 +3105,38 @@ static int rdt_get_tree(struct fs_context *fc) RESCTRL_PICK_ANY_CPU); } =20 - goto out; + /* + * Ensure root remains accessible after mutex is unlocked so that + * kernfs_kill_sb() can run safely if called by kernfs_get_tree()'s + * failure path after creating a superblock but before taking reference + * on root kn (for example, if unable to get inode for root kn). + */ + kernfs_get(rdtgroup_default.kn); + + /* + * Make backup of the current root kn being created to be used in + * kernfs_put(). The additional reference taken above will prevent the + * kn from being freed before kernfs_kill_sb() can run but + * rdtgroup_default.kn may be set to NULL via rdtgroup_destroy_root() + * and its backing root (rdt_root) could be overwritten before + * kernfs_put() can run. + */ + rdt_root_kn =3D rdtgroup_default.kn; + + rdt_last_cmd_clear(); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + ret =3D kernfs_get_tree(fc); + /* + * resctrl can only be mounted once, new superblock only expected + * to be created once. + */ + if (!ctx->kfc.new_sb_created) + resctrl_unmount(); + kernfs_put(rdt_root_kn); + return ret; =20 -out_psl: - rdt_pseudo_lock_release(); out_mondata: if (resctrl_arch_mon_capable()) kernfs_remove(kn_mondata); @@ -3108,7 +3156,6 @@ static int rdt_get_tree(struct fs_context *fc) out_root: rdtgroup_destroy_root(); out: - rdt_last_cmd_clear(); mutex_unlock(&rdtgroup_mutex); cpus_read_unlock(); return ret; @@ -3195,26 +3242,8 @@ static int rdt_init_fs_context(struct fs_context *fc) =20 static void rdt_kill_sb(struct super_block *sb) { - struct rdt_resource *r; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - rdt_disable_ctx(); - - /* Put everything back to default values. */ - for_each_alloc_capable_rdt_resource(r) - resctrl_arch_reset_all_ctrls(r); - - resctrl_fs_teardown(); - if (resctrl_arch_alloc_capable()) - resctrl_arch_disable_alloc(); - if (resctrl_arch_mon_capable()) - resctrl_arch_disable_mon(); - resctrl_mounted =3D false; + resctrl_unmount(); kernfs_kill_sb(sb); - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); } =20 static struct file_system_type rdt_fs_type =3D { --=20 2.50.1 From nobody Thu Jun 11 01:40:31 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 216F63DEAC9 for ; Tue, 9 Jun 2026 21:02:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038982; cv=none; b=JWQSFvYDWwWG8rF0IxujxAd8G0Bz7fwDJPjclDuLianRtBhVjEKJPVsOCQxDJtYB+na7JABdPTSl7GYrPkRwcoqBzunO1gcoFnWPj4OAZddRawFA1hkEkMnSpseiDN8Vl/aAdPxuqR3pcols87Fap98bVGcZxIMl683Zadfe/LE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038982; c=relaxed/simple; bh=bQGZ3Gi1hecEzRonc8KkP4mUrFcfRliOT9dqTRLLRN4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ik1o/Cp1ySXsmjy+Jg/kppcT1tNXL9YNat9kLtd2lzIQrTY8RNOcfppEEVxwMl2NvxBVduvskWu+Ef8OKCx9BfUmSaqae38yhOuTWriaBJGkvTnKYwMg16hBR/WGjXxMnuqSJtpZoRHjTKnb6ABeFazg2+sfeqLY2fDP81vR24w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Evfu3nvh; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Evfu3nvh" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1781038977; x=1812574977; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=bQGZ3Gi1hecEzRonc8KkP4mUrFcfRliOT9dqTRLLRN4=; b=Evfu3nvhFQk62Z75UYE+kQZDQs6qC5atr6E55E1BJrLLf1LUi/RfsHrY QJNCcYDaoCuRqMeE1CKQzxCk/DCOIsWdbIIJVm+4Bc55lgNHa6Q6psSMs fAHmlaKf6kHkkwLp/7T02o30F9GAICvbMzw2VC8SkqplFe4iS1hbuz9b2 zaekRP9c833umwnA4cYLjoBw46Dkp9l5hNM5jnYYAmi+9YH8RgHp2yQWh BQdijdxLDFioGISk7Nu+4IJ0ZeeT2Cqz7VQna69GLItXu6sNR38iOLV3E 430M7erTOjNu9YXzvy6VkXH3Ayqi+qSZ2B5nZPc5ne8ljbcWBiy0h0HjY w==; X-CSE-ConnectionGUID: CMy+fcoRR2ebqMIt/VxmJw== X-CSE-MsgGUID: TXg6CjZqR5+xaMKD+ZIOug== X-IronPort-AV: E=McAfee;i="6800,10657,11812"; a="81885031" X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="81885031" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:56 -0700 X-CSE-ConnectionGUID: rq9mlczNR2C0OTwo24NvLA== X-CSE-MsgGUID: JbXnfTi0STaH4dmC7jMfrg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="251045719" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:54 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v5 07/11] fs/resctrl: Prevent use-after-free in rdtgroup_kn_put() Date: Tue, 9 Jun 2026 14:02:33 -0700 Message-ID: <9bfb3dc321d46b9d28d9952addc9c7e8a84437dc.1781029125.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A struct rdtgroup is reference counted via rdtgroup::waitcount. Callers that need the structure to remain valid across a sleep (while waiting on acquiring rdtgroup_mutex) take a reference with rdtgroup_kn_get() and release it with rdtgroup_kn_put(). The release path is intended to serve as the fallback freer: if the count drops to zero and the group has already been marked RDT_DELETED, rdtgroup_kn_put() frees the structure. The bulk teardown paths free_all_child_rdtgrp() and rmdir_all_sub() resulting from a resctrl directory remove or resctrl fs unmount act as the primary freer: they hold rdtgroup_mutex and free each rdtgroup whose waitcount is zero, otherwise they set RDT_DELETED and leave the freeing to the last waiter. These two freers race. rdtgroup_kn_put() commits waitcount =3D=3D 0 with atomic_dec_and_test() outside rdtgroup_mutex, then reads rdtgroup::flags. Between those two operations a concurrent caller of free_all_child_rdtgrp() or rmdir_all_sub() (which holds the mutex) can observe waitcount =3D=3D 0 v= ia atomic_read(), call rdtgroup_remove(), and kfree() the structure. The subsequent read of rdtgroup::flags in rdtgroup_kn_put() is then a use-after-free, and the structure may even be freed twice if the freed memory happens to satisfy the RDT_DELETED flag check. Replace the bare atomic_dec_and_test() with atomic_dec_and_mutex_lock() so that the decrement-to-zero takes rdtgroup_mutex before the count becomes globally visible. The inspection of rdtgroup::flags then runs under the same mutex held by the bulk freers, making the two paths mutually exclusive. The common case where the count does not reach zero remains lock-free. Defer kernfs_unbreak_active_protection() until after the mutex is dropped since kernfs active protections functionally wrap rdtgroup_mutex. Remove resource group, which in turn drops its kernfs reference, after kernfs protection is restored. Fixes: b8511ccc75c0 ("x86/resctrl: Fix use-after-free when deleting resourc= e groups") Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40i= ntel.com?part=3D1 Assisted-by: GitHub_Copilot:gemini-3.1-pro Signed-off-by: Reinette Chatre Reviewed-by: Ben Horgan Reviewed-by: Tony Luck --- Changes since V2: - New patch Changes since V3: - Add Ben's Reviewed-by tag. - Add Tony's Reviewed-by tag. --- fs/resctrl/rdtgroup.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 0d073d4db734..c04424c081a4 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -2606,15 +2606,24 @@ static void rdtgroup_kn_get(struct rdtgroup *rdtgrp= , struct kernfs_node *kn) =20 static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *k= n) { - if (atomic_dec_and_test(&rdtgrp->waitcount) && - (rdtgrp->flags & RDT_DELETED)) { + bool needs_free; + + if (!atomic_dec_and_mutex_lock(&rdtgrp->waitcount, &rdtgroup_mutex)) { + kernfs_unbreak_active_protection(kn); + return; + } + + needs_free =3D rdtgrp->flags & RDT_DELETED; + + mutex_unlock(&rdtgroup_mutex); + + kernfs_unbreak_active_protection(kn); + + if (needs_free) { if (rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKSETUP || rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKED) rdtgroup_pseudo_lock_remove(rdtgrp); - kernfs_unbreak_active_protection(kn); rdtgroup_remove(rdtgrp); - } else { - kernfs_unbreak_active_protection(kn); } } =20 --=20 2.50.1 From nobody Thu Jun 11 01:40:31 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C38F43E00AE for ; Tue, 9 Jun 2026 21:02:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038982; cv=none; b=U6gO/wPi6Az6p4gP8T/spShCyAPI2RZiFLSMTnrx90pC2wnvdAOJHiy5rUzAYAOoVX8VqDEa3vlkwGZdoGzDszo0PayFrUsCXKu/EBHTa0LrfnuF37gjGl9SlvmCAHS5emjDbVyoQloyfuFoOVey1UfpdflMPT1QH4+06qekCzQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038982; c=relaxed/simple; bh=+bCwbGXodPOyojvefV3AVotYYV06mZ2z2YAeKsYo9NE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=X68wm1XyLlzxS57X1mPABYNQJWM87aRcDgWKkp2IEltoei0uDxhfCCEaVFP/F2c8OjMW4parVMp/QSx1W/wgXDq8H+ouhVR5toaiZ5Iiuab/cIZPvcC1tb8X8TdeNFU4NZ6jNJMfdPsZcFsmxU/a43vfOhR+jX/FJh3YkscR/eg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=JvHY1ytq; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="JvHY1ytq" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1781038977; x=1812574977; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+bCwbGXodPOyojvefV3AVotYYV06mZ2z2YAeKsYo9NE=; b=JvHY1ytqkAuZavEQ9IpvHLMpi4xu4wdEzSVrrdbMKSv+yIKjQjPleIrE 25G4fvX9eQozFk3gNVCWeQCTIsN4zExbvvJ80DHmuSNU0ZAFEU0LRlWmA V/2RH9cmuZklVfGnP0Y4UwR1i8aKHd3ADbpwqAruHtNjODjXSjITrmxxy Is8mivoUCZ10EbE6vlzK3hG2rs1aCTzDmETW3SuInBLbRgBK/hNz5L4UW E4un21kNpbcWgq4ZKJ78Rk7I5EUOJ370MKtdUw2/0oo5EgGm+hLs4FcJ+ rJ4+QUNGgDETXDlH5GPYIehoQbGc15dNCPRBd1hdOifet4QvYywXSyL9e w==; X-CSE-ConnectionGUID: O3386EmyR8C5Qk4RbWMyBQ== X-CSE-MsgGUID: xzwNVR7iQyG9jqB3P3xMrQ== X-IronPort-AV: E=McAfee;i="6800,10657,11812"; a="81885035" X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="81885035" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:57 -0700 X-CSE-ConnectionGUID: 2BfXSEApSom7RWqB9qZ8SQ== X-CSE-MsgGUID: nnBGRo/IRfqwZx8M0962XQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="251045736" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:55 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v5 08/11] fs/resctrl: Fix double-add of pseudo-locked region's RMID to free list Date: Tue, 9 Jun 2026 14:02:34 -0700 Message-ID: <137b53e306e268af9e925e7966af304cc5ad01b3.1781029125.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A pseudo-locked group's RMID is freed when it is created. On unmount rmdir_all_sub() unconditionally frees all RMID of all groups, resulting in a double-free of the pseudo-locked group's RMID. The consequence of this is that the original free results in the pseudo-locked group's RMID being added to the rmid_free_lru linked list and the second free then attempts to add the same RMID entry to the rmid_free_lru again. Do not double-free a pseudo-locked group's RMID. Fixes: e0bdfe8e36f3 ("x86/intel_rdt: Support creation/removal of pseudo-loc= ked region") Signed-off-by: Reinette Chatre --- Changes since V2: - New patch Changes since V3: - Extract the double-add/double-free fix from all the other pseudo-locking fixes that will be deferred. This issue was uncovered during testing of the race fixes so drop all the Reported-by and Closes tags. --- fs/resctrl/rdtgroup.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index c04424c081a4..77c9d22017bc 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -2885,10 +2885,6 @@ static void rmdir_all_sub(void) if (rdtgrp =3D=3D &rdtgroup_default) continue; =20 - if (rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKSETUP || - rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKED) - rdtgroup_pseudo_lock_remove(rdtgrp); - /* * Give any CPUs back to the default group. We cannot copy * cpu_online_mask because a CPU might have executed the @@ -2899,7 +2895,13 @@ static void rmdir_all_sub(void) =20 rdtgroup_unassign_cntrs(rdtgrp); =20 - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + if (rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKED) { + rdtgroup_pseudo_lock_remove(rdtgrp); + } else { + /* Pseudo-locked group's RMID is freed during setup. */ + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + } =20 kernfs_remove(rdtgrp->kn); list_del(&rdtgrp->rdtgroup_list); --=20 2.50.1 From nobody Thu Jun 11 01:40:31 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4E8213164BA for ; Tue, 9 Jun 2026 21:03:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038989; cv=none; b=mtigtTCHotkiZhe6SRCvh73R+tTxhwaBy9ioLqTZaoBCVB1CXRKepx2va1aE1bSdDw46JqcyMrjxRYP8lUO2fu9dDrIKzy67/ntXVVtdAuyZYrFrN+nXKSmibP0sYTEnHQA+5bo2yBug9ybpULD5GS+D5KCzfrK+yAji7yBBEvc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038989; c=relaxed/simple; bh=rw/RMtmGjnLsALK6wTgrci/d5YbDjOf1E3nt5prXDgU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=le2yOTL/o17h89/XPP3mEXlYOkuP5Y5MaxLMHbcVeR2LeV35jvc6LeSsarJkj2rKoP/TfwFPE6DdWtbdn9oT+1+AioB+mfN7Uyle1DXCbzdqWgvcYRXiz41sW3vJxR83OwXYUTFmJiKyJk45aDqJpoLaybZmCmYNcRrWWw7xjeQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=SV2E+HFq; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="SV2E+HFq" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1781038983; x=1812574983; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=rw/RMtmGjnLsALK6wTgrci/d5YbDjOf1E3nt5prXDgU=; b=SV2E+HFqzkCt9I8Rx4mi8Ja6xqqYsnuk96Cui78JQkSzdQC4lFHJSPdz i0Dt+iCFssI5vSL8AI1aPj1ZaPIfZS46/NH5e/iCxRGZaHG/C8eR2CC63 7qoMpM/DGyBXsLoroWz/TM22Lu0jJgaC3PeKSk804ByKdEDTsMQgXFNG0 7ZcjVltfhJfmhBR44i/+xF5b4yz1sMGoq5AHWHKrHaASsX2bIVL9DsO5q XVA4gseNCN2yLa8YyHlPre9DBApmapWZospi4MAEpJpkyMmJ3wTulZTdb UaoT8Z06DAuu4hfC40leSVpp+0xWYpAv4C1Mxb9xCE4PrhcoPisJ1HJpV Q==; X-CSE-ConnectionGUID: PkisaLHrRnS9czoB+/iiVQ== X-CSE-MsgGUID: ADhiuojERuaJVLXY/0apTg== X-IronPort-AV: E=McAfee;i="6800,10657,11812"; a="81885039" X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="81885039" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:58 -0700 X-CSE-ConnectionGUID: Si7PehL3QFS3EfhlAIXPjw== X-CSE-MsgGUID: p7NjSHmCTeGeGT6+BA/NKw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="251045742" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:56 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v5 09/11] fs/resctrl: Prevent deadlock and use-after-free in info file handlers Date: Tue, 9 Jun 2026 14:02:35 -0700 Message-ID: <5daffbd831ad0b506566ff280e420954ca5a2b95.1781029125.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" resctrl provides files under the info/ directory to expose global configuration and capabilities to userspace. These files are instantiated statically during filesystem mount and expose data associated with internal schema structures via kernfs private pointers. A potential deadlock exists between userspace readers of these info files and the unmount filesystem teardown process. Reading an info file invokes kernfs which acquires an active reference, after which the handler typically attempts to acquire the rdtgroup_mutex. Concurrently, unmounting the filesystem holds the rdtgroup_mutex and then attempts to recursively remove the info kernfs nodes involving kernfs_drain() which blocks until all active references are released. Another problem exists where info files might be accessed from an outdated mount if the filesystem is unmounted and remounted during a reader's execution, leading to a use-after-free when reading the now-deleted private schema data. Introduce info_kn_lock() and info_kn_unlock() helpers to coordinate locking across all info handlers. These helpers mirror similar logic used by resour= ce group handlers by deliberately breaking the kernfs active protection before attempting to acquire the rdtgroup_mutex, preventing the deadlock. To guard against the vulnerability from rapid mount cycling, info_kn_lock() securely walks the parent lineage of the kernfs node under an RCU section to confirm the node belongs to the globally active root before permitting the operation to proceed. Convert all info file handlers to use this helper and only de-reference the schema after it is determined safe to do so. Make no attempt to output an error message to last_cmd_status on failure since failure implies there is no filesystem with which to display the error to user space. Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40i= ntel.com?part=3D3 Assisted-by: GitHub_Copilot:gemini-3.1-pro Signed-off-by: Reinette Chatre Reviewed-by: Tony Luck --- Changes since V2: - New patch Changes since V3: - Add Tony's Reviewed-by tag. - Changelog grammar fixes. --- fs/resctrl/ctrlmondata.c | 38 ++++---- fs/resctrl/internal.h | 3 +- fs/resctrl/monitor.c | 48 +++++----- fs/resctrl/rdtgroup.c | 192 ++++++++++++++++++++++++++++++++------- 4 files changed, 203 insertions(+), 78 deletions(-) diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index f33712c17d38..2b29fb5a8702 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -771,10 +771,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *a= rg) int resctrl_io_alloc_show(struct kernfs_open_file *of, struct seq_file *se= q, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 + r =3D s->res; if (r->cache.io_alloc_capable) { if (resctrl_arch_get_io_alloc_enabled(r)) seq_puts(seq, "enabled\n"); @@ -784,7 +786,7 @@ int resctrl_io_alloc_show(struct kernfs_open_file *of, = struct seq_file *seq, voi seq_puts(seq, "not supported\n"); } =20 - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return 0; } @@ -849,7 +851,7 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file = *of, char *buf, size_t nbytes, loff_t off) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; char const *grp_name; u32 io_alloc_closid; bool enable; @@ -859,9 +861,10 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file= *of, char *buf, if (ret) return ret; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 + r =3D s->res; rdt_last_cmd_clear(); =20 if (!r->cache.io_alloc_capable) { @@ -909,8 +912,7 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file = *of, char *buf, } =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -918,14 +920,15 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_fil= e *of, char *buf, int resctrl_io_alloc_cbm_show(struct kernfs_open_file *of, struct seq_file= *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; int ret =3D 0; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 + r =3D s->res; if (!r->cache.io_alloc_capable) { rdt_last_cmd_printf("io_alloc is not supported on %s\n", s->name); ret =3D -ENODEV; @@ -947,8 +950,7 @@ int resctrl_io_alloc_cbm_show(struct kernfs_open_file *= of, struct seq_file *seq, show_doms(seq, s, NULL, resctrl_io_alloc_closid(r)); =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); return ret; } =20 @@ -1015,7 +1017,7 @@ ssize_t resctrl_io_alloc_cbm_write(struct kernfs_open= _file *of, char *buf, size_t nbytes, loff_t off) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; u32 io_alloc_closid; int ret =3D 0; =20 @@ -1025,10 +1027,11 @@ ssize_t resctrl_io_alloc_cbm_write(struct kernfs_op= en_file *of, char *buf, =20 buf[nbytes - 1] =3D '\0'; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; rdt_last_cmd_clear(); =20 + r =3D s->res; if (!r->cache.io_alloc_capable) { rdt_last_cmd_printf("io_alloc is not supported on %s\n", s->name); ret =3D -ENODEV; @@ -1053,8 +1056,7 @@ ssize_t resctrl_io_alloc_cbm_write(struct kernfs_open= _file *of, char *buf, out_clear_configs: rdt_staged_configs_clear(); out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index 48af75b9dc85..e62a277dee85 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -335,8 +335,9 @@ __printf(1, 2) void rdt_last_cmd_printf(const char *fmt, ...); =20 struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn); - void rdtgroup_kn_unlock(struct kernfs_node *kn); +bool info_kn_lock(struct kernfs_node *kn); +void info_kn_unlock(struct kernfs_node *kn); =20 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name); =20 diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index 2dacb589625d..15e3eeddb6df 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -1062,7 +1062,8 @@ int event_filter_show(struct kernfs_open_file *of, st= ruct seq_file *seq, void *v bool sep =3D false; int ret =3D 0, i; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; rdt_last_cmd_clear(); =20 r =3D resctrl_arch_get_resource(mevt->rid); @@ -1083,7 +1084,7 @@ int event_filter_show(struct kernfs_open_file *of, st= ruct seq_file *seq, void *v seq_putc(seq, '\n'); =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return ret; } @@ -1094,7 +1095,8 @@ int resctrl_mbm_assign_on_mkdir_show(struct kernfs_op= en_file *of, struct seq_fil struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); int ret =3D 0; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; rdt_last_cmd_clear(); =20 if (!resctrl_arch_mbm_cntr_assign_enabled(r)) { @@ -1106,7 +1108,7 @@ int resctrl_mbm_assign_on_mkdir_show(struct kernfs_op= en_file *of, struct seq_fil seq_printf(s, "%u\n", r->mon.mbm_assign_on_mkdir); =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return ret; } @@ -1122,7 +1124,8 @@ ssize_t resctrl_mbm_assign_on_mkdir_write(struct kern= fs_open_file *of, char *buf if (ret) return ret; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; rdt_last_cmd_clear(); =20 if (!resctrl_arch_mbm_cntr_assign_enabled(r)) { @@ -1134,7 +1137,7 @@ ssize_t resctrl_mbm_assign_on_mkdir_write(struct kern= fs_open_file *of, char *buf r->mon.mbm_assign_on_mkdir =3D value; =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -1424,8 +1427,8 @@ ssize_t event_filter_write(struct kernfs_open_file *o= f, char *buf, size_t nbytes =20 buf[nbytes - 1] =3D '\0'; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1448,8 +1451,7 @@ ssize_t event_filter_write(struct kernfs_open_file *o= f, char *buf, size_t nbytes } =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -1460,7 +1462,8 @@ int resctrl_mbm_assign_mode_show(struct kernfs_open_f= ile *of, struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); bool enabled; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; enabled =3D resctrl_arch_mbm_cntr_assign_enabled(r); =20 if (r->mon.mbm_cntr_assignable) { @@ -1479,7 +1482,7 @@ int resctrl_mbm_assign_mode_show(struct kernfs_open_f= ile *of, seq_puts(s, "[default]\n"); } =20 - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return 0; } @@ -1498,8 +1501,8 @@ ssize_t resctrl_mbm_assign_mode_write(struct kernfs_o= pen_file *of, char *buf, =20 buf[nbytes - 1] =3D '\0'; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1557,8 +1560,7 @@ ssize_t resctrl_mbm_assign_mode_write(struct kernfs_o= pen_file *of, char *buf, } =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -1570,8 +1572,8 @@ int resctrl_num_mbm_cntrs_show(struct kernfs_open_fil= e *of, struct rdt_l3_mon_domain *dom; bool sep =3D false; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 list_for_each_entry_rcu(dom, &r->mon_domains, hdr.list, lockdep_is_cpus_h= eld()) { if (sep) @@ -1582,8 +1584,7 @@ int resctrl_num_mbm_cntrs_show(struct kernfs_open_fil= e *of, } seq_putc(s, '\n'); =20 - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); return 0; } =20 @@ -1596,8 +1597,8 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_op= en_file *of, u32 cntrs, i; int ret =3D 0; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1623,8 +1624,7 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_op= en_file *of, seq_putc(s, '\n'); =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret; } diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 77c9d22017bc..9f998e394911 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -977,13 +977,14 @@ static int rdt_last_cmd_status_show(struct kernfs_ope= n_file *of, { int len; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; len =3D seq_buf_used(&last_cmd_status); if (len) seq_printf(seq, "%.*s", len, last_cmd_status_buf); else seq_puts(seq, "ok\n"); - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); return 0; } =20 @@ -1002,7 +1003,11 @@ static int rdt_num_closids_show(struct kernfs_open_f= ile *of, { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; seq_printf(seq, "%u\n", s->num_closid); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1010,9 +1015,14 @@ static int rdt_default_ctrl_show(struct kernfs_open_= file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%x\n", resctrl_get_default_ctrl(r)); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1020,9 +1030,15 @@ static int rdt_min_cbm_bits_show(struct kernfs_open_= file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; + =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->cache.min_cbm_bits); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1030,9 +1046,14 @@ static int rdt_shareable_bits_show(struct kernfs_ope= n_file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%x\n", r->cache.shareable_bits); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1060,15 +1081,16 @@ static int rdt_bit_usage_show(struct kernfs_open_fi= le *of, */ unsigned long sw_shareable =3D 0, hw_shareable =3D 0; unsigned long exclusive =3D 0, pseudo_locked =3D 0; - struct rdt_resource *r =3D s->res; struct rdt_ctrl_domain *dom; int i, hwb, swb, excl, psl; + struct rdt_resource *r; enum rdtgrp_mode mode; bool sep =3D false; u32 ctrl_val; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; list_for_each_entry_rcu(dom, &r->ctrl_domains, hdr.list, lockdep_is_cpus_= held()) { if (sep) seq_putc(seq, ';'); @@ -1144,8 +1166,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file= *of, sep =3D true; } seq_putc(seq, '\n'); - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); return 0; } =20 @@ -1153,9 +1174,14 @@ static int rdt_min_bw_show(struct kernfs_open_file *= of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->membw.min_bw); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1164,8 +1190,12 @@ static int rdt_num_rmids_show(struct kernfs_open_fil= e *of, { struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; seq_printf(seq, "%u\n", r->mon.num_rmid); =20 + info_kn_unlock(of->kn); + return 0; } =20 @@ -1175,6 +1205,8 @@ static int rdt_mon_features_show(struct kernfs_open_f= ile *of, struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); struct mon_evt *mevt; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; for_each_mon_event(mevt) { if (mevt->rid !=3D r->rid || !mevt->enabled) continue; @@ -1184,6 +1216,8 @@ static int rdt_mon_features_show(struct kernfs_open_f= ile *of, seq_printf(seq, "%s_config\n", mevt->name); } =20 + info_kn_unlock(of->kn); + return 0; } =20 @@ -1191,9 +1225,14 @@ static int rdt_bw_gran_show(struct kernfs_open_file = *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->membw.bw_gran); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1201,16 +1240,24 @@ static int rdt_delay_linear_show(struct kernfs_open= _file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->membw.delay_linear); + info_kn_unlock(of->kn); + return 0; } =20 static int max_threshold_occ_show(struct kernfs_open_file *of, struct seq_file *seq, void *v) { + if (!info_kn_lock(of->kn)) + return -ENOENT; seq_printf(seq, "%u\n", resctrl_rmid_realloc_threshold); + info_kn_unlock(of->kn); =20 return 0; } @@ -1219,22 +1266,28 @@ static int rdt_thread_throttle_mode_show(struct ker= nfs_open_file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; + + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 + r =3D s->res; switch (r->membw.throttle_mode) { case THREAD_THROTTLE_PER_THREAD: seq_puts(seq, "per-thread\n"); - return 0; + break; case THREAD_THROTTLE_MAX: seq_puts(seq, "max\n"); - return 0; + break; case THREAD_THROTTLE_UNDEFINED: seq_puts(seq, "undefined\n"); - return 0; + break; + default: + WARN_ON_ONCE(1); + break; } =20 - WARN_ON_ONCE(1); - + info_kn_unlock(of->kn); return 0; } =20 @@ -1248,12 +1301,20 @@ static ssize_t max_threshold_occ_write(struct kernf= s_open_file *of, if (ret) return ret; =20 - if (bytes > resctrl_rmid_realloc_limit) - return -EINVAL; + if (!info_kn_lock(of->kn)) + return -ENOENT; + + if (bytes > resctrl_rmid_realloc_limit) { + ret =3D -EINVAL; + goto out_unlock; + } =20 resctrl_rmid_realloc_threshold =3D resctrl_arch_round_mon_val(bytes); =20 - return nbytes; +out_unlock: + info_kn_unlock(of->kn); + + return ret ?: nbytes; } =20 /* @@ -1293,10 +1354,15 @@ static int rdt_has_sparse_bitmasks_show(struct kern= fs_open_file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->cache.arch_has_sparse_bitmasks); =20 + info_kn_unlock(of->kn); + return 0; } =20 @@ -1652,8 +1718,8 @@ static int mbm_config_show(struct seq_file *s, struct= rdt_resource *r, u32 evtid struct rdt_l3_mon_domain *dom; bool sep =3D false; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + lockdep_assert_cpus_held(); + lockdep_assert_held(&rdtgroup_mutex); =20 list_for_each_entry_rcu(dom, &r->mon_domains, hdr.list, lockdep_is_cpus_h= eld()) { if (sep) @@ -1670,8 +1736,6 @@ static int mbm_config_show(struct seq_file *s, struct= rdt_resource *r, u32 evtid } seq_puts(s, "\n"); =20 - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); =20 return 0; } @@ -1681,8 +1745,12 @@ static int mbm_total_bytes_config_show(struct kernfs= _open_file *of, { struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + mbm_config_show(seq, r, QOS_L3_MBM_TOTAL_EVENT_ID); =20 + info_kn_unlock(of->kn); return 0; } =20 @@ -1691,8 +1759,12 @@ static int mbm_local_bytes_config_show(struct kernfs= _open_file *of, { struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + mbm_config_show(seq, r, QOS_L3_MBM_LOCAL_EVENT_ID); =20 + info_kn_unlock(of->kn); return 0; } =20 @@ -1790,8 +1862,8 @@ static ssize_t mbm_total_bytes_config_write(struct ke= rnfs_open_file *of, if (nbytes =3D=3D 0 || buf[nbytes - 1] !=3D '\n') return -EINVAL; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1799,8 +1871,7 @@ static ssize_t mbm_total_bytes_config_write(struct ke= rnfs_open_file *of, =20 ret =3D mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID); =20 - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -1816,8 +1887,8 @@ static ssize_t mbm_local_bytes_config_write(struct ke= rnfs_open_file *of, if (nbytes =3D=3D 0 || buf[nbytes - 1] !=3D '\n') return -EINVAL; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1825,8 +1896,7 @@ static ssize_t mbm_local_bytes_config_write(struct ke= rnfs_open_file *of, =20 ret =3D mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID); =20 - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -2659,6 +2729,58 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn) rdtgroup_kn_put(rdtgrp, kn); } =20 +/* + * Accessing the kn after breaking active protection is safe since the open + * of resctrl file holds a kernfs base reference (different from active + * protection) on the kn ensuring that it remains accessible even if it was + * unlinked. Each kn in turn holds base reference to parent so the kn's + * genealogy remains in memory until all base references dropped. + */ +static bool is_active_resctrl_node(struct kernfs_node *kn) +{ + struct kernfs_node *p; + bool match =3D false; + + guard(rcu)(); + p =3D kn; + while (p) { + if (p =3D=3D rdtgroup_default.kn) { + match =3D true; + break; + } + p =3D rcu_dereference(p->__parent); + } + + return match; +} + +bool info_kn_lock(struct kernfs_node *kn) +{ + kernfs_break_active_protection(kn); + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + /* + * Check both if resctrl is torn down (!rdtgroup_default.kn) and + * if the reader's kernfs_node originates from a dead mount. + */ + if (!rdtgroup_default.kn || !is_active_resctrl_node(kn)) { + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + kernfs_unbreak_active_protection(kn); + return false; + } + + return true; +} + +void info_kn_unlock(struct kernfs_node *kn) +{ + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + kernfs_unbreak_active_protection(kn); +} + static int mkdir_mondata_all(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, struct kernfs_node **mon_data_kn); --=20 2.50.1 From nobody Thu Jun 11 01:40:31 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4EF073988E1 for ; Tue, 9 Jun 2026 21:03:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038988; cv=none; b=V51nbgmZS5bAHsbOz2JNwQo6jHUdgYR3pgp6qvVCJPl75m4jrILLX1piKbCf/ycUzSsq1jVy5a3wggcVsJPmLa4LfzNKxVERdUEh5kvuSfyGSjybW0BdQlUPOpuR/Evj526C/28WwrmoneD78SBLwdI3JJxsCm8xjAHVRhs1AeA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038988; c=relaxed/simple; bh=LI3fWkBiCZaOC31ilZkPdUqLlp43WfXVSeGsZHEYUvs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=dTWLeb2oSufct7pCiBXVoMCSv0ikprr+5WjOhgWYGQCVwXj/f2edpYuyvhPSqtvUJZXyUKbWMeJlCfDqIPBu8aJY/DWbGRWhhgxWXM7x2x4FkOd8UaHvvkGJpsCYajpoj/h4X3JvJ2MduQPxSWt9swWKjc6uwC8yI8fRgw253pY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Q1HzgAhV; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Q1HzgAhV" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1781038983; x=1812574983; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=LI3fWkBiCZaOC31ilZkPdUqLlp43WfXVSeGsZHEYUvs=; b=Q1HzgAhVHcHV4SMmOl/cQVD0megN88WeE/kpVfryNh3mFX1d/91aXtbf 7l4uN6JWcspvpHVb1Skzq/GwefQfmbkVl9eq1qx861beDpenynhmtmMTO sz1skuvVhXzu1QD5Ju4VzQjXdxgAg2DlpSh5iJpKxP4AeJ/yeXpIzfvOt Dc1MMwHkVRZySD8tL+HCmsBlV2zrUQE4bV4GhIXJgP6wsqfxgdhUnfEby SqG+IAe3LXRnQLFXkldRorlW36sAZ0wQaGInUkAvohckChw7/CQD1SqfT b1x7RUN60m8p/wzPYVmbE8n4HpHTCTmNjlv4lsBVarwS7A49i+FRbvXAs g==; X-CSE-ConnectionGUID: fQ5ZmlYhTqSpEIQiGfgH0w== X-CSE-MsgGUID: S+G6jE21RZC3QF9TaQn0IA== X-IronPort-AV: E=McAfee;i="6800,10657,11812"; a="81885043" X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="81885043" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:59 -0700 X-CSE-ConnectionGUID: sFmORcTHTb+hZXHhtm8v/w== X-CSE-MsgGUID: YfnJtOCjRmehp5+yT9DXSA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="251045755" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:57 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v5 10/11] x86/resctrl: Ensure domain fully initialized before placed on RCU list Date: Tue, 9 Jun 2026 14:02:36 -0700 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A resctrl domain consists of the domain structure self that includes pointers to dynamically allocated filesystem as well as architecture specific data. For example, the L3 monitoring domain structure consists of the architecture specific struct rdt_hw_l3_mon_domain that contains the dynamically allocated rdt_hw_l3_mon_domain::arch_mbm_states architectural state and the embedded struct rdt_l3_mon_domain contains the dynamically allocated rdt_l3_mon_domain::mbm_states resctrl fs state. The domains are added to and removed from an RCU protected list while cpus_write_lock() is held so that readers could access domains via cpus_read_lock() or from an RCU read-side critical section. A reader accessing a domain via the RCU list expects that the domain and all its dynamically allocated data is accessible. Only place the domain on the RCU list when all its dynamically allocated data is ready, similarly unlink it from RCU list (again with cpus_write_lock() held) before removing any of its dynamically allocated data. Calling resctrl_online_mon_domain() before adding the domain to the RCU list creates the kernfs files that expose the domain's monitoring data to user space before adding the domain to the RCU list. This is safe because rdtgroup_mondata_show() acquires cpus_read_lock() before it traverses the RCU list and will thus block until the domain is added to the RCU list. There are no readers accessing a domain via RCU list. Ensure safety of access when such a reader arrives. Signed-off-by: Reinette Chatre Reviewed-by: Tony Luck Reviewed-by: Chen Yu --- Changes since V2: - New patch Changes since V3: - Add Tony's Reviewed-by tag. - Add Chenyu's Reviewed-by tag. - Grammar fixes in changelog. - Add snippet to changelog about possible race with rdtgroup_mondata_show(). --- arch/x86/kernel/cpu/resctrl/core.c | 18 +++++++----------- arch/x86/kernel/cpu/resctrl/intel_aet.c | 5 ++--- 2 files changed, 9 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resct= rl/core.c index 9c01d2562b7a..bca782050198 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -515,14 +515,12 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_r= esource *r) return; } =20 - list_add_tail_rcu(&d->hdr.list, add_pos); - err =3D resctrl_online_ctrl_domain(r, d); if (err) { - list_del_rcu(&d->hdr.list); - synchronize_rcu(); ctrl_domain_free(hw_dom); + return; } + list_add_tail_rcu(&d->hdr.list, add_pos); } =20 static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, s= truct list_head *add_pos) @@ -556,14 +554,12 @@ static void l3_mon_domain_setup(int cpu, int id, stru= ct rdt_resource *r, struct return; } =20 - list_add_tail_rcu(&d->hdr.list, add_pos); - err =3D resctrl_online_mon_domain(r, &d->hdr); if (err) { - list_del_rcu(&d->hdr.list); - synchronize_rcu(); l3_mon_domain_free(hw_dom); + return; } + list_add_tail_rcu(&d->hdr.list, add_pos); } =20 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r) @@ -642,9 +638,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_= resource *r) d =3D container_of(hdr, struct rdt_ctrl_domain, hdr); hw_dom =3D resctrl_to_arch_ctrl_dom(d); =20 - resctrl_offline_ctrl_domain(r, d); list_del_rcu(&hdr->list); synchronize_rcu(); + resctrl_offline_ctrl_domain(r, d); =20 /* * rdt_ctrl_domain "d" is going to be freed below, so clear @@ -689,9 +685,9 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_r= esource *r) =20 d =3D container_of(hdr, struct rdt_l3_mon_domain, hdr); hw_dom =3D resctrl_to_arch_mon_dom(d); - resctrl_offline_mon_domain(r, hdr); list_del_rcu(&hdr->list); synchronize_rcu(); + resctrl_offline_mon_domain(r, hdr); l3_mon_domain_free(hw_dom); break; } @@ -702,9 +698,9 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_r= esource *r) return; =20 pkgd =3D container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr); - resctrl_offline_mon_domain(r, hdr); list_del_rcu(&hdr->list); synchronize_rcu(); + resctrl_offline_mon_domain(r, hdr); kfree(pkgd); break; } diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/= resctrl/intel_aet.c index 89b8b619d5d5..c22c3cf5167d 100644 --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c @@ -398,12 +398,11 @@ void intel_aet_mon_domain_setup(int cpu, int id, stru= ct rdt_resource *r, d->hdr.type =3D RESCTRL_MON_DOMAIN; d->hdr.rid =3D RDT_RESOURCE_PERF_PKG; cpumask_set_cpu(cpu, &d->hdr.cpu_mask); - list_add_tail_rcu(&d->hdr.list, add_pos); =20 err =3D resctrl_online_mon_domain(r, &d->hdr); if (err) { - list_del_rcu(&d->hdr.list); - synchronize_rcu(); kfree(d); + return; } + list_add_tail_rcu(&d->hdr.list, add_pos); } --=20 2.50.1 From nobody Thu Jun 11 01:40:31 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 700B23D5640 for ; Tue, 9 Jun 2026 21:03:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038991; cv=none; b=NXEZqv/K7ull5qPCz3fd8F3ghsMqRehb1L+V60oYA+q3Kr/ZvJ6+dFTtlyOGfE1Xs+dLVlIjmYFk32bBSPF9KSgOjZDgo1IB/w/cF1B9B6/rt9qIJgSD9mMsv4HgdInCeIc0BBLUS8a2R8V62M7hGJsN4aExqhKPBuFczJFFzpQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781038991; c=relaxed/simple; bh=a5p5K5hWHX3lpBLrRN3Kva6wjWiWd4XwRFQ9Z5zfrBc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=J5lvesIm/jhDSmji6a5A6nPryXGH+qbGUDTGXm2X42H3HyamtfXbF7ue6BZUCXvpXiRKlJa2pViofiyBde9ixlxP0/ps94lLnMgDy2IvXposF8pzyC+Trh67CS1eUnKw34DvZsH2jN+f7GthARLUe/SVMfZtvLfGYtXnW04/xB4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=ny/ywJDr; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ny/ywJDr" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1781038983; x=1812574983; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=a5p5K5hWHX3lpBLrRN3Kva6wjWiWd4XwRFQ9Z5zfrBc=; b=ny/ywJDrnUMiHgu9zkGhqzHiS2A1zg8OkjyGEL7f9Qk0WG54l7ZBANNW uCaVReAdySKb92QNdv+t7jBi8I139EclwDxmemCErdSQbg3GeiO47O56B /oJ56FrVowEL+Az6Hi0eg0GGxAaB8pzfsS7qpm/PWJqw+OU3F4465y6+Z c2BDN3OfejN/tkuLR4M0Y9NYTea1Qhj0n+TRTJxUPOiVrCvgTnxI3ji/R 33JLAx5iF+B/aKeSE8AvscOsHrRUH2gnO680DvJh2x1JOG/WXThK+L0/d 9k0AcnVARvfmr9IpYKyKUPnIpftEPjJBqGLMEZTT5qmoIzlIw8ibjFTEg g==; X-CSE-ConnectionGUID: V1HeaD0OTJWCRxtbmAvXcA== X-CSE-MsgGUID: I41Iq8+HT3S/8Of3pjDyxA== X-IronPort-AV: E=McAfee;i="6800,10657,11812"; a="81885048" X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="81885048" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:03:00 -0700 X-CSE-ConnectionGUID: AH9ry8rQReyXU+C1CDoBZg== X-CSE-MsgGUID: NI0fI0VlQTiFytsA5v9kUQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,196,1774335600"; d="scan'208";a="251045768" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2026 14:02:58 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v5 11/11] fs/resctrl: Fix UAF from worker threads when domains are removed Date: Tue, 9 Jun 2026 14:02:37 -0700 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The mbm_handle_overflow() and cqm_handle_limbo() workers read event counters and may sleep while doing so. They are scheduled via delayed_work embedded in struct rdt_l3_mon_domain. Architecture allocates and frees these domains from CPU hotplug callbacks under cpus_write_lock(), and the workers acquire cpus_read_lock() to keep the domain alive across their access. A use-after-free can occur when a worker is blocked waiting for cpus_read_lock() while the hotplug core holds cpus_write_lock(): the architecture frees the rdt_l3_mon_domain that contains the worker's work_struct. When the worker unblocks, the container_of() it performs on the embedded work pointer dereferences freed memory. Drop cpus_read_lock() from the workers and instead drain pending and in-flight work synchronously before the architecture can free the domain. Since architecture offlines the domain under cpus_write_lock() after it has been unlinked from the RCU list and a grace period has elapsed, no new work can be scheduled. The cancel only needs to wait out existing work. Drop rdtgroup_mutex during CPU offline around cancel_delayed_work_sync() so that a worker waiting on the mutex can complete before re-pinning the work on a different CPU. When offlining a CPU the architecture may iterate over resources in any order. For example, the MBA control domain may be offlined before or after a corresponding L3 monitor domain. Ensure that resctrl fs cancels the workers no matter what order the architecture offlines the domains. Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing") Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40i= ntel.com # [1] Co-developed-by: Tony Luck Signed-off-by: Tony Luck Signed-off-by: Reinette Chatre --- Changes since v2: - Rewrite changelog - v2 attempted to solve the issue by using is_percpu_thread() within the worker to learn if CPU worker was running on is going offline. A Sashiko (https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%= 40intel.com?part=3D5) pointed out that this would not be able to handle the scenario if one of the hotplug handlers following the resctrl offline handlers failed. - Some other fixes attempted that failed: - Switch to accessing domain structure in handler via RCU so that CPU hotplug lock no longer needed. Use cancel_delayed_work_sync() with mutex dropped to cancel worker. Running worker from RCU read-side critical section is a problem since the worker needs to be able to sleep (mbm_handle_overflow()->mbm_update()-> mbm_update_one_event()->resctrl_arch_mon_ctx_alloc()-> might_sleep()) - Adding a reference count to the domain structure to avoid the worker needing to take CPU hotplug lock. This ended up being very complicated with the architecture needing new APIs to manage the reference count which cannot cleanly integrate into MPAM since it uses a single architecture domain structure to contain both the control and monitoring domain structures. Managing the references across mount, unmount, online, offline, as well as worker self exit resulted in several asymmetrical and complicated paths that were error prone. Locking also proved to be complicated since architecture would need to initiate domain free that will need to call back into resctrl that will take rdtgroup_mutex which means that references need to be taken/released without locking. Changes since V3: - Traverse mon_domains list using list_for_each_entry_rcu( ..., lockdep_is_cpus_held()) to document how CPU hotplug lock is required to be held (via architecture). - Add snippet in changelog to motivate canceling work in monitor and control domain offline handlers. Changes since V4: - Add check for empty domain to workers to avoid reading RMID when domain's cpu_mask is empty because x86's resctrl_arch_rmid_read() depends on there being CPUs in the domain's cpu_mask. --- fs/resctrl/monitor.c | 60 +++++++++++++++++++++++++++++++++++-------- fs/resctrl/rdtgroup.c | 52 +++++++++++++++++++++++++++++++++---- 2 files changed, 97 insertions(+), 15 deletions(-) diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index 15e3eeddb6df..7340b1d17f17 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -633,14 +633,22 @@ void mon_event_count(void *info) rr->err =3D 0; } =20 -static struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, - struct rdt_resource *r) +/* + * Find the software controller's ctrl domain that contains @cpu on resour= ce @r. + * + * Only called from the mbm_over worker via update_mba_bw() where the retu= rned + * domain is kept alive by cancel_delayed_work_sync() in + * resctrl_offline_ctrl_domain(). This drains this worker and then waits on + * rdtgroup_mutex held here before the architecture can free the ctrl doma= in. + * + * Context: Call from RCU read-side critical section. + */ +static struct rdt_ctrl_domain *get_sc_ctrl_domain_from_cpu(int cpu, + struct rdt_resource *r) { struct rdt_ctrl_domain *d; =20 - lockdep_assert_cpus_held(); - - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list) { /* Find the domain that contains this CPU */ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) return d; @@ -701,7 +709,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct= rdt_l3_mon_domain *dom_m if (WARN_ON_ONCE(!pmbm_data)) return; =20 - dom_mba =3D get_ctrl_domain_from_cpu(smp_processor_id(), r_mba); + guard(rcu)(); + dom_mba =3D get_sc_ctrl_domain_from_cpu(smp_processor_id(), r_mba); if (!dom_mba) { pr_warn_once("Failure to get domain for MBA update\n"); return; @@ -804,11 +813,25 @@ void cqm_handle_limbo(struct work_struct *work) unsigned long delay =3D msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); struct rdt_l3_mon_domain *d; =20 - cpus_read_lock(); + /* + * Safe to run without CPU hotplug lock. Work is guaranteed to be + * canceled before the domain structure is removed. + */ mutex_lock(&rdtgroup_mutex); =20 + /* + * Ensure the worker is dedicated to a CPU as intended and not + * relocated by workqueue subsystem as part of CPU going offline. + */ + if (!is_percpu_thread()) + goto out_unlock; + d =3D container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work); =20 + /* Domain is going offline */ + if (cpumask_empty(&d->hdr.cpu_mask)) + goto out_unlock; + __check_limbo(d, false); =20 if (has_busy_rmid(d)) { @@ -818,8 +841,8 @@ void cqm_handle_limbo(struct work_struct *work) delay); } =20 +out_unlock: mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); } =20 /** @@ -851,7 +874,10 @@ void mbm_handle_overflow(struct work_struct *work) struct list_head *head; struct rdt_resource *r; =20 - cpus_read_lock(); + /* + * Safe to run without CPU hotplug lock. Work is guaranteed to be + * canceled before the domain structure is removed. + */ mutex_lock(&rdtgroup_mutex); =20 /* @@ -861,9 +887,24 @@ void mbm_handle_overflow(struct work_struct *work) if (!resctrl_mounted || !resctrl_arch_mon_capable()) goto out_unlock; =20 + /* + * Ensure the worker is dedicated to a CPU and not relocated by + * workqueue subsystem as part of CPU going offline since reading + * events depend on smp_processor_id(). After passing this check + * smp_processor_id() is valid for entire duration of this worker + * since it runs with rdtgroup_mutex held and the offline handler needs + * rdtgroup_mutex to offline the CPU being run on here. + */ + if (!is_percpu_thread()) + goto out_unlock; + r =3D resctrl_arch_get_resource(RDT_RESOURCE_L3); d =3D container_of(work, struct rdt_l3_mon_domain, mbm_over.work); =20 + /* Domain is going offline */ + if (cpumask_empty(&d->hdr.cpu_mask)) + goto out_unlock; + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { mbm_update(r, d, prgrp); =20 @@ -885,7 +926,6 @@ void mbm_handle_overflow(struct work_struct *work) =20 out_unlock: mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); } =20 /** diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 9f998e394911..b5fb59d0e035 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -4491,6 +4491,29 @@ static void domain_destroy_l3_mon_state(struct rdt_l= 3_mon_domain *d) =20 void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_d= omain *d) { + /* + * mbm_handle_overflow() may dereference this ctrl domain via + * update_mba_bw()->get_sc_ctrl_domain_from_cpu(). The architecture has + * unlinked the domain from the RCU list and waited a grace period, so + * no new worker iteration can find it; drain any worker that already + * holds a pointer to it before the architecture frees the domain. + * + * Software controller is enabled/disabled on mount/unmount with + * cpus_read_lock() held. Running here with cpus_write_lock() so + * there are no concurrent changes to software controller status. + */ + if (r->rid =3D=3D RDT_RESOURCE_MBA && is_mba_sc(r)) { + struct rdt_resource *l3 =3D resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdt_l3_mon_domain *mon_d; + + list_for_each_entry_rcu(mon_d, &l3->mon_domains, hdr.list, lockdep_is_cp= us_held()) { + if (mon_d->hdr.id =3D=3D d->hdr.id) { + cancel_delayed_work_sync(&mon_d->mbm_over); + break; + } + } + } + mutex_lock(&rdtgroup_mutex); =20 if (supports_mba_mbps() && r->rid =3D=3D RDT_RESOURCE_MBA) @@ -4503,6 +4526,24 @@ void resctrl_offline_mon_domain(struct rdt_resource = *r, struct rdt_domain_hdr *h { struct rdt_l3_mon_domain *d; =20 + /* + * Called by architecture under CPU hotplug lock as it prepares to remove + * the domain which is guaranteed to be accessible here. + * The domain has been unlinked from the RCU list and a grace period + * has elapsed, so no new worker can be scheduled. Drain any worker that + * is in flight or pending before letting architecture proceed to free + * the domain that has the workers' struct delayed_work embedded. + * Do so before taking rdtgroup_mutex since the workers also acquire it. + */ + if (r->rid =3D=3D RDT_RESOURCE_L3 && + domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3)) { + d =3D container_of(hdr, struct rdt_l3_mon_domain, hdr); + if (resctrl_is_mbm_enabled()) + cancel_delayed_work_sync(&d->mbm_over); + if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) + cancel_delayed_work_sync(&d->cqm_limbo); + } + mutex_lock(&rdtgroup_mutex); =20 /* @@ -4519,8 +4560,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *= r, struct rdt_domain_hdr *h goto out_unlock; =20 d =3D container_of(hdr, struct rdt_l3_mon_domain, hdr); - if (resctrl_is_mbm_enabled()) - cancel_delayed_work(&d->mbm_over); if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(= d)) { /* * When a package is going down, forcefully @@ -4531,7 +4570,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *= r, struct rdt_domain_hdr *h * package never comes back. */ __check_limbo(d, true); - cancel_delayed_work(&d->cqm_limbo); } =20 domain_destroy_l3_mon_state(d); @@ -4712,12 +4750,16 @@ void resctrl_offline_cpu(unsigned int cpu) d =3D get_mon_domain_from_cpu(cpu, l3); if (d) { if (resctrl_is_mbm_enabled() && cpu =3D=3D d->mbm_work_cpu) { - cancel_delayed_work(&d->mbm_over); + mutex_unlock(&rdtgroup_mutex); + cancel_delayed_work_sync(&d->mbm_over); + mutex_lock(&rdtgroup_mutex); mbm_setup_overflow_handler(d, 0, cpu); } if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && cpu =3D=3D d->cqm_work_cpu && has_busy_rmid(d)) { - cancel_delayed_work(&d->cqm_limbo); + mutex_unlock(&rdtgroup_mutex); + cancel_delayed_work_sync(&d->cqm_limbo); + mutex_lock(&rdtgroup_mutex); cqm_setup_limbo_handler(d, 0, cpu); } } --=20 2.50.1