From nobody Mon Jun 8 06:35:46 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E468B3F0AB6 for ; Wed, 3 Jun 2026 03:27:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457282; cv=none; b=tuP7t6fqoprT1tiLDccS+l8kfIKbSs8jkLVHcVUy2us7z+tbIVnfPmm0vePW1kydOP2l7poGQJadHT1q++10MWxMjFdIpS6kDgJ6kLRqwN6izFR+rf/30qv2dVA5prq4q6a2zMVxX6MFW7aRHs61ffMTRvE+EwZzFhpeqOGGwjQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457282; c=relaxed/simple; bh=T4DbqQK6bzC0/39aw6+BVEETLyFKBbEMyWay9H13OJ4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=e81ZXE10Ei8013sgisj7oelf/cU4lSLBqJTpaN9waNR7n1I7qNVKqhuqW0vvqEsrHMNUOFuCn7NM22Qo4XD/FIOT9yeGULiHOP8mPAN+b4FaTl7Vmqpmbrl5PUAT+upsDjGgkxa5hombRFl5pOFhnVSx01oQGU1UVqoV3IUlezA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=QnRg38Im; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="QnRg38Im" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1780457280; x=1811993280; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=T4DbqQK6bzC0/39aw6+BVEETLyFKBbEMyWay9H13OJ4=; b=QnRg38ImRl86PHMg/Gp/lEiVCAOrgKmbV67hXLn6fhM9OJA1e9ZrUBQD zPGIwcVUprkUD44H7WtwwpbUuLeKMKEyRJaoQrADVoMZ4KcYYlqbaE5UT 4Im66bLassMnG/QhCD4CKfUDbFoeiFXDOJWKcXMhMXGfTdD3JpQ0uUm8a gK68p+fb6xPVIvUxgh3Hu87Sul+28f5i2VPg43z1z2GStgKDJOrfR9Tri 3Y3hjbE2FO9FBKSgN3pzKXWXsfdxJfhJ0P5Zkc0u4dw2E3l0sJcXcyAs3 0ylUdMkZNoo29+LbgWG5ahf3p7DNqq/Ne1a9+EcmpvI6HRx1oNuIr/nwh Q==; X-CSE-ConnectionGUID: +pLpfly5Syu+mdJu/eDbtg== X-CSE-MsgGUID: OFtWOo2dRp2BM68hneSKwg== X-IronPort-AV: E=McAfee;i="6800,10657,11805"; a="91938969" X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="91938969" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:27:59 -0700 X-CSE-ConnectionGUID: zKyMM5yJSq+rTKvNxMV+DQ== X-CSE-MsgGUID: WWWBGow3SkWTGc6sZia0iA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="241110095" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:27:58 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v4 01/10] x86,fs/resctrl: Document safe RCU list traversal Date: Tue, 2 Jun 2026 20:27:29 -0700 Message-ID: <776eb116e624f312239fa71cb20d9005e0f709fb.1780456704.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" rdt_resource::ctrl_domains and rdt_resource::mon_domains are RCU lists with entries added and removed by architecture from CPU hotplug callbacks that are run with cpus_write_lock() held. These lists can be traversed safely from resctrl fs by either holding cpus_read_lock() or relying on an RCU read-side critical section. resctrl fs traversals of rdt_resource::ctrl_domains and rdt_resource::mon_domains are done using list_for_each_entry() with cpus_read_lock() held. Similarly, x86 architecture callbacks use list_for_each_entry() expecting that resctrl fs makes the call with cpus_read_lock() held. Inconsistently, a lockdep_assert_cpus_held() precedes the list_for_each_entry() call with varying distance to document this safe RCU list traversal. In preparation for an upcoming traversal of rdt_resource::ctrl_domains that needs to be done from RCU read-side critical section there is a requirement for developers to always know exactly in which context the list is being traversed. Replace the list_for_each_entry() traversals of RCU list with list_for_each_entry_rcu() to document that an RCU list is being traversed while making use of the built-in lockdep expression that additionally documents that it is cpus_read_lock() that enables the list to be traversed from non-RCU protection. Only revert to documenting the safety of traversal using a comment when lockdep does not have needed visibility in functions called via smp_call*(). The lockdep expression within list_for_each_entry_rcu() depends on RCU_EXPERT that is not set in a typical debug kernel so keep the existing lockdep_assert_cpus_held() that is active with CONFIG_LOCKDEP=3Dy found in typical debug kernel. Signed-off-by: Reinette Chatre Reported-by: Sashiko --- Changes since v3: - New patch. --- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 4 ++-- arch/x86/kernel/cpu/resctrl/monitor.c | 2 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 ++-- fs/resctrl/ctrlmondata.c | 12 +++++++----- fs/resctrl/monitor.c | 23 +++++++++++++--------- fs/resctrl/pseudo_lock.c | 2 +- fs/resctrl/rdtgroup.c | 24 +++++++++++------------ 7 files changed, 39 insertions(+), 32 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cp= u/resctrl/ctrlmondata.c index b20e705606b8..e74f1ed54b86 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -53,7 +53,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u= 32 closid) /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); =20 - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) { hw_dom =3D resctrl_to_arch_ctrl_dom(d); msr_param.res =3D NULL; for (t =3D 0; t < CDP_NUM_TYPES; t++) { @@ -115,7 +115,7 @@ static void _resctrl_sdciae_enable(struct rdt_resource = *r, bool enable) lockdep_assert_cpus_held(); =20 /* Update MSR_IA32_L3_QOS_EXT_CFG MSR on all the CPUs in all domains */ - list_for_each_entry(d, &r->ctrl_domains, hdr.list) + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) on_each_cpu_mask(&d->hdr.cpu_mask, resctrl_sdciae_set_one_amd, &enable, = 1); } =20 diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/re= sctrl/monitor.c index 9bf9d7e201aa..ca9c88d6fd14 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -500,7 +500,7 @@ static void _resctrl_abmc_enable(struct rdt_resource *r= , bool enable) =20 lockdep_assert_cpus_held(); =20 - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_hel= d()) { on_each_cpu_mask(&d->hdr.cpu_mask, resctrl_abmc_set_one_amd, &enable, 1); resctrl_arch_reset_rmid_all(r, d); diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/r= esctrl/rdtgroup.c index 885026468440..5ffa39fa86fa 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -151,7 +151,7 @@ static int set_cache_qos_cfg(int level, bool enable) return -ENOMEM; =20 r_l =3D &rdt_resources_all[level].r_resctrl; - list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r_l->ctrl_domains, hdr.list, lockdep_is_cpus_= held()) { if (r_l->cache.arch_has_per_cpu_cfg) /* Pick all the CPUs in the domain instance */ for_each_cpu(cpu, &d->hdr.cpu_mask) @@ -249,7 +249,7 @@ void resctrl_arch_reset_all_ctrls(struct rdt_resource *= r) * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU * from each domain to update the MSRs below. */ - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) { hw_dom =3D resctrl_to_arch_ctrl_dom(d); =20 for (i =3D 0; i < hw_res->num_closid; i++) diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index 9a7dfc48cb2e..f33712c17d38 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -261,7 +261,7 @@ static int parse_line(char *line, struct resctrl_schema= *s, return -EINVAL; } dom =3D strim(dom); - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) { if (d->hdr.id =3D=3D dom_id) { data.buf =3D dom; data.closid =3D rdtgrp->closid; @@ -397,7 +397,7 @@ static void show_doms(struct seq_file *s, struct resctr= l_schema *schema, =20 if (resource_name) seq_printf(s, "%*s:", max_name_width, resource_name); - list_for_each_entry(dom, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(dom, &r->ctrl_domains, hdr.list, lockdep_is_cpus_= held()) { if (sep) seq_puts(s, ";"); =20 @@ -535,6 +535,8 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_= head *h, int id, struct rdt_domain_hdr *d; struct list_head *l; =20 + lockdep_assert_cpus_held(); + list_for_each(l, h) { d =3D list_entry(l, struct rdt_domain_hdr, list); /* When id is found, return its domain. */ @@ -717,7 +719,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) * struct mon_data. Search all domains in the resource for * one that matches this cache id. */ - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_he= ld()) { if (d->ci_id =3D=3D domid) { cpu =3D cpumask_any(&d->hdr.cpu_mask); ci =3D get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE); @@ -817,7 +819,7 @@ static int resctrl_io_alloc_init_cbm(struct resctrl_sch= ema *s, u32 closid) /* Keep CDP_CODE and CDP_DATA of io_alloc CLOSID's CBM in sync. */ if (resctrl_arch_get_cdp_enabled(r->rid)) { peer_type =3D resctrl_peer_type(s->conf_type); - list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) + list_for_each_entry_rcu(d, &s->res->ctrl_domains, hdr.list, lockdep_is_c= pus_held()) memcpy(&d->staged_config[peer_type], &d->staged_config[s->conf_type], sizeof(d->staged_config[0])); @@ -980,7 +982,7 @@ static int resctrl_io_alloc_parse_line(char *line, str= uct rdt_resource *r, } =20 dom =3D strim(dom); - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) { if (update_all || d->hdr.id =3D=3D dom_id) { data.buf =3D dom; data.mode =3D RDT_MODE_SHAREABLE; diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index 0e6a389a16bf..d2aa7d045056 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -304,7 +304,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) idx =3D resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid); =20 entry->busy =3D 0; - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_hel= d()) { /* * For the first limbo RMID in the domain, * setup up the limbo worker. @@ -502,6 +502,11 @@ static int __l3_mon_event_count_sum(struct rdtgroup *r= dtgrp, struct rmid_read *r * all domains fail for any reason. */ ret =3D -EINVAL; + /* + * RCU list being traversed with CPU hotplug lock held. lockdep + * unable to help prove this here since this work is scheduled via + * smp_call*(). Not called from MBM overflow handler. + */ list_for_each_entry(d, &rr->r->mon_domains, hdr.list) { if (d->ci_id !=3D rr->ci->id) continue; @@ -1226,7 +1231,7 @@ static int rdtgroup_assign_cntr_event(struct rdt_l3_m= on_domain *d, struct rdtgro int ret =3D 0; =20 if (!d) { - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_he= ld()) { int err; =20 err =3D rdtgroup_alloc_assign_cntr(r, d, rdtgrp, mevt); @@ -1298,7 +1303,7 @@ static void rdtgroup_unassign_cntr_event(struct rdt_l= 3_mon_domain *d, struct rdt struct rdt_resource *r =3D resctrl_arch_get_resource(mevt->rid); =20 if (!d) { - list_for_each_entry(d, &r->mon_domains, hdr.list) + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_he= ld()) rdtgroup_free_unassign_cntr(r, d, rdtgrp, mevt); } else { rdtgroup_free_unassign_cntr(r, d, rdtgrp, mevt); @@ -1370,7 +1375,7 @@ static void rdtgroup_update_cntr_event(struct rdt_res= ource *r, struct rdtgroup * struct rdt_l3_mon_domain *d; int cntr_id; =20 - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_hel= d()) { cntr_id =3D mbm_cntr_get(r, d, rdtgrp, evtid); if (cntr_id >=3D 0) rdtgroup_assign_cntr(r, d, evtid, rdtgrp->mon.rmid, @@ -1540,7 +1545,7 @@ ssize_t resctrl_mbm_assign_mode_write(struct kernfs_o= pen_file *of, char *buf, /* * Reset all the non-achitectural RMID state and assignable counters. */ - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_he= ld()) { mbm_cntr_free_all(r, d); resctrl_reset_rmid_all(r, d); } @@ -1563,7 +1568,7 @@ int resctrl_num_mbm_cntrs_show(struct kernfs_open_fil= e *of, cpus_read_lock(); mutex_lock(&rdtgroup_mutex); =20 - list_for_each_entry(dom, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(dom, &r->mon_domains, hdr.list, lockdep_is_cpus_h= eld()) { if (sep) seq_putc(s, ';'); =20 @@ -1597,7 +1602,7 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_op= en_file *of, goto out_unlock; } =20 - list_for_each_entry(dom, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(dom, &r->mon_domains, hdr.list, lockdep_is_cpus_h= eld()) { if (sep) seq_putc(s, ';'); =20 @@ -1647,7 +1652,7 @@ int mbm_L3_assignments_show(struct kernfs_open_file *= of, struct seq_file *s, voi =20 sep =3D false; seq_printf(s, "%s:", mevt->name); - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_he= ld()) { if (sep) seq_putc(s, ';'); =20 @@ -1745,7 +1750,7 @@ static int resctrl_parse_mbm_assignment(struct rdt_re= source *r, struct rdtgroup } =20 /* Verify if the dom_id is valid */ - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_hel= d()) { if (d->hdr.id =3D=3D dom_id) { ret =3D rdtgroup_modify_assign_state(dom_str, d, rdtgrp, mevt); if (ret) { diff --git a/fs/resctrl/pseudo_lock.c b/fs/resctrl/pseudo_lock.c index d1cb0986006e..dea2b4bf966f 100644 --- a/fs/resctrl/pseudo_lock.c +++ b/fs/resctrl/pseudo_lock.c @@ -656,7 +656,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctr= l_domain *d) * associated with them. */ for_each_alloc_capable_rdt_resource(r) { - list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d_i, &r->ctrl_domains, hdr.list, lockdep_is_cpus= _held()) { if (d_i->plr) cpumask_or(cpu_with_psl, cpu_with_psl, &d_i->hdr.cpu_mask); diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index af2cbab14497..2a6221925767 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -117,7 +117,7 @@ void rdt_staged_configs_clear(void) lockdep_assert_held(&rdtgroup_mutex); =20 for_each_alloc_capable_rdt_resource(r) { - list_for_each_entry(dom, &r->ctrl_domains, hdr.list) + list_for_each_entry_rcu(dom, &r->ctrl_domains, hdr.list, lockdep_is_cpus= _held()) memset(dom->staged_config, 0, sizeof(dom->staged_config)); } } @@ -1063,7 +1063,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file= *of, =20 cpus_read_lock(); mutex_lock(&rdtgroup_mutex); - list_for_each_entry(dom, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(dom, &r->ctrl_domains, hdr.list, lockdep_is_cpus_= held()) { if (sep) seq_putc(seq, ';'); hw_shareable =3D r->cache.shareable_bits; @@ -1415,7 +1415,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgr= oup *rdtgrp) if (r->rid =3D=3D RDT_RESOURCE_MBA || r->rid =3D=3D RDT_RESOURCE_SMBA) continue; has_cache =3D true; - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_h= eld()) { ctrl =3D resctrl_arch_get_config(r, d, closid, s->conf_type); if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) { @@ -1604,7 +1604,7 @@ static int rdtgroup_size_show(struct kernfs_open_file= *of, type =3D schema->conf_type; sep =3D false; seq_printf(s, "%*s:", max_name_width, schema->name); - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_h= eld()) { if (sep) seq_putc(s, ';'); if (rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKSETUP) { @@ -1649,7 +1649,7 @@ static int mbm_config_show(struct seq_file *s, struct= rdt_resource *r, u32 evtid cpus_read_lock(); mutex_lock(&rdtgroup_mutex); =20 - list_for_each_entry(dom, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(dom, &r->mon_domains, hdr.list, lockdep_is_cpus_h= eld()) { if (sep) seq_puts(s, ";"); =20 @@ -1763,7 +1763,7 @@ static int mon_config_write(struct rdt_resource *r, c= har *tok, u32 evtid) return -EINVAL; } =20 - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_hel= d()) { if (d->hdr.id =3D=3D dom_id) { mbm_config_write_domain(r, d, evtid, val); goto next; @@ -2554,7 +2554,7 @@ static int set_mba_sc(bool mba_sc) =20 rdtgroup_default.mba_mbps_event =3D mba_mbps_default_event; =20 - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) { for (i =3D 0; i < num_closid; i++) d->mbps_val[i] =3D MBA_MAX_MBPS; } @@ -2879,7 +2879,7 @@ static int rdt_get_tree(struct fs_context *fc) =20 if (resctrl_is_mbm_enabled()) { r =3D resctrl_arch_get_resource(RDT_RESOURCE_L3); - list_for_each_entry(dom, &r->mon_domains, hdr.list) + list_for_each_entry_rcu(dom, &r->mon_domains, hdr.list, lockdep_is_cpus_= held()) mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU); } @@ -3435,7 +3435,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_= node *parent_kn, /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); =20 - list_for_each_entry(hdr, &r->mon_domains, list) { + list_for_each_entry_rcu(hdr, &r->mon_domains, list, lockdep_is_cpus_held(= )) { ret =3D mkdir_mondata_subdir(parent_kn, hdr, r, prgrp); if (ret) return ret; @@ -3620,7 +3620,7 @@ int rdtgroup_init_cat(struct resctrl_schema *s, u32 c= losid) struct rdt_ctrl_domain *d; int ret; =20 - list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &s->res->ctrl_domains, hdr.list, lockdep_is_cp= us_held()) { ret =3D __init_one_rdt_domain(d, s, closid); if (ret < 0) return ret; @@ -3635,7 +3635,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r,= u32 closid) struct resctrl_staged_config *cfg; struct rdt_ctrl_domain *d; =20 - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list, lockdep_is_cpus_he= ld()) { if (is_mba_sc(r)) { d->mbps_val[closid] =3D MBA_MAX_MBPS; continue; @@ -4506,7 +4506,7 @@ static struct rdt_l3_mon_domain *get_mon_domain_from_= cpu(int cpu, =20 lockdep_assert_cpus_held(); =20 - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->mon_domains, hdr.list, lockdep_is_cpus_hel= d()) { /* Find the domain that contains this CPU */ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) return d; --=20 2.50.1 From nobody Mon Jun 8 06:35:46 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2EB923F1668 for ; Wed, 3 Jun 2026 03:28:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457282; cv=none; b=R/NMg7sgph5DXh723amkwD9N/YY3v2BOS94Gz54xZxhaIw3KZfCA/d1aE+UL6L3YIWkrNFhZd99Ry7JeZD0779G+li6xQg7tMYtx6ClNMtcLimLBxStqBHlPlml6YnfskIa3r4PvdIvJm0ysu8p9BPdywZsCB+q4UP93W5nZglg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457282; c=relaxed/simple; bh=6ATno5HyYAEBpGTzLDqq16Sa/353TxzmxUBOAQq6ObE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=VHOZyJFm1FC5wrydcqoRcWDX6ltN2Jy4BqXI3MC2CNvwloz17QEKVDTff+pnK0/Nnblj++z50oQbaieKAEvOAGFNgGHaejlTK+xlF8KJ8xDQ4jFwrbvS/Rx4mX7WD3RVVM0ZcLR1364gPaI5JnF7DkxBNxhoYKsCT98YWwdttC0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=mVIvMe8B; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="mVIvMe8B" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1780457281; x=1811993281; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6ATno5HyYAEBpGTzLDqq16Sa/353TxzmxUBOAQq6ObE=; b=mVIvMe8BU/FDRcLuWQ2IFjiCNrvJ7Xa7JmU9Sq1JIfQpVIvj/rhJwu/Q +LxYtkwd/jVatm3c0u6egq9G1EP4TXCPMf8w90+AJtQOrPusX/VKskwfP xHxz9JDGAvHfybi9Gt1fnCmtBNwEiB+zo3nkP0RawbZRcTh5qJR/qqeLY noTikm1oj3cPQOd3imQ6KbI0ney+r+mAHNYu6KEnVD2SDstTi5MIc0tnJ Jvi+EQ96DUEOUICTV3l7Wkt2KYik4qPEHwnOTBaFz4ulfyx2NaO6+J96H ABdAttQ7xKO2KHW/rQQtKbsNs0kW6blP8ko0/tmFtzQ27sIC3zYoT7W5l g==; X-CSE-ConnectionGUID: Bx6k12EzStWYi8/5gIvdVA== X-CSE-MsgGUID: YrfBb3uSQtabInVEgBXxFg== X-IronPort-AV: E=McAfee;i="6800,10657,11805"; a="91938979" X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="91938979" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:27:59 -0700 X-CSE-ConnectionGUID: AeGIMIwyTaWxqNaT7HDxIQ== X-CSE-MsgGUID: 1qNHthCTROeJM2knSwRqzg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="241110099" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:27:58 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v4 02/10] fs/resctrl: Move functions to avoid forward references in subsequent fixes Date: Tue, 2 Jun 2026 20:27:30 -0700 Message-ID: <741d65f435bd6745693c321b817eca58b70ec0b2.1780456704.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Tony Luck rdt_get_tree() manages resctrl fs mount and rdt_kill_sb() manages resctrl fs unmount. There is significant overlap between error cleanup during resctrl mount failure and cleanup on resctrl unmount yet the cleanup is not done consistently in these two flows. Pull some cleanup functions before rdt_get_tree() in preparation for a new helper that can be shared between mount and unmount. Signed-off-by: Tony Luck Signed-off-by: Reinette Chatre Reviewed-by: Ben Horgan Reported-by: Sashiko --- Changes since V2: - Rewrite changelog. Changes since V3: - Add Ben's Reviewed-by tag. --- fs/resctrl/rdtgroup.c | 376 +++++++++++++++++++++--------------------- 1 file changed, 188 insertions(+), 188 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 2a6221925767..2b624cf02147 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -2792,6 +2792,194 @@ static void schemata_list_destroy(void) } } =20 +/* + * Move tasks from one to the other group. If @from is NULL, then all tasks + * in the systems are moved unconditionally (used for teardown). + * + * If @mask is not NULL the cpus on which moved tasks are running are set + * in that mask so the update smp function call is restricted to affected + * cpus. + */ +static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *t= o, + struct cpumask *mask) +{ + struct task_struct *p, *t; + + read_lock(&tasklist_lock); + for_each_process_thread(p, t) { + if (!from || is_closid_match(t, from) || + is_rmid_match(t, from)) { + resctrl_arch_set_closid_rmid(t, to->closid, + to->mon.rmid); + + /* + * Order the closid/rmid stores above before the loads + * in task_curr(). This pairs with the full barrier + * between the rq->curr update and + * resctrl_arch_sched_in() during context switch. + */ + smp_mb(); + + /* + * If the task is on a CPU, set the CPU in the mask. + * The detection is inaccurate as tasks might move or + * schedule before the smp function call takes place. + * In such a case the function call is pointless, but + * there is no other side effect. + */ + if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t)) + cpumask_set_cpu(task_cpu(t), mask); + } + } + read_unlock(&tasklist_lock); +} + +static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp) +{ + struct rdtgroup *sentry, *stmp; + struct list_head *head; + + head =3D &rdtgrp->mon.crdtgrp_list; + list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) { + rdtgroup_unassign_cntrs(sentry); + free_rmid(sentry->closid, sentry->mon.rmid); + list_del(&sentry->mon.crdtgrp_list); + + if (atomic_read(&sentry->waitcount) !=3D 0) + sentry->flags =3D RDT_DELETED; + else + rdtgroup_remove(sentry); + } +} + +/* + * Forcibly remove all of subdirectories under root. + */ +static void rmdir_all_sub(void) +{ + struct rdtgroup *rdtgrp, *tmp; + + /* Move all tasks to the default resource group */ + rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); + + list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { + /* Free any child rmids */ + free_all_child_rdtgrp(rdtgrp); + + /* Remove each rdtgroup other than root */ + if (rdtgrp =3D=3D &rdtgroup_default) + continue; + + if (rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKED) + rdtgroup_pseudo_lock_remove(rdtgrp); + + /* + * Give any CPUs back to the default group. We cannot copy + * cpu_online_mask because a CPU might have executed the + * offline callback already, but is still marked online. + */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); + + rdtgroup_unassign_cntrs(rdtgrp); + + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + kernfs_remove(rdtgrp->kn); + list_del(&rdtgrp->rdtgroup_list); + + if (atomic_read(&rdtgrp->waitcount) !=3D 0) + rdtgrp->flags =3D RDT_DELETED; + else + rdtgroup_remove(rdtgrp); + } + /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */ + update_closid_rmid(cpu_online_mask, &rdtgroup_default); + + kernfs_remove(kn_info); + kernfs_remove(kn_mongrp); + kernfs_remove(kn_mondata); +} + +/** + * mon_get_kn_priv() - Get the mon_data priv data for this event. + * + * The same values are used across the mon_data directories of all control= and + * monitor groups for the same event in the same domain. Keep a list of + * allocated structures and re-use an existing one with the same values for + * @rid, @domid, etc. + * + * @rid: The resource id for the event file being created. + * @domid: The domain id for the event file being created. + * @mevt: The type of event file being created. + * @do_sum: Whether SNC summing monitors are being created. Only set + * when @rid =3D=3D RDT_RESOURCE_L3. + * + * Return: Pointer to mon_data private data of the event, NULL on failure. + */ +static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int do= mid, + struct mon_evt *mevt, + bool do_sum) +{ + struct mon_data *priv; + + lockdep_assert_held(&rdtgroup_mutex); + + list_for_each_entry(priv, &mon_data_kn_priv_list, list) { + if (priv->rid =3D=3D rid && priv->domid =3D=3D domid && + priv->sum =3D=3D do_sum && priv->evt =3D=3D mevt) + return priv; + } + + priv =3D kzalloc_obj(*priv); + if (!priv) + return NULL; + + priv->rid =3D rid; + priv->domid =3D domid; + priv->sum =3D do_sum; + priv->evt =3D mevt; + list_add_tail(&priv->list, &mon_data_kn_priv_list); + + return priv; +} + +/** + * mon_put_kn_priv() - Free all allocated mon_data structures. + * + * Called when resctrl file system is unmounted. + */ +static void mon_put_kn_priv(void) +{ + struct mon_data *priv, *tmp; + + lockdep_assert_held(&rdtgroup_mutex); + + list_for_each_entry_safe(priv, tmp, &mon_data_kn_priv_list, list) { + list_del(&priv->list); + kfree(priv); + } +} + +static void resctrl_fs_teardown(void) +{ + lockdep_assert_held(&rdtgroup_mutex); + + /* Cleared by rdtgroup_destroy_root() */ + if (!rdtgroup_default.kn) + return; + + rmdir_all_sub(); + rdtgroup_unassign_cntrs(&rdtgroup_default); + mon_put_kn_priv(); + rdt_pseudo_lock_release(); + rdtgroup_default.mode =3D RDT_MODE_SHAREABLE; + closid_exit(); + schemata_list_destroy(); + rdtgroup_destroy_root(); +} + static int rdt_get_tree(struct fs_context *fc) { struct rdt_fs_context *ctx =3D rdt_fc2context(fc); @@ -2991,194 +3179,6 @@ static int rdt_init_fs_context(struct fs_context *f= c) return 0; } =20 -/* - * Move tasks from one to the other group. If @from is NULL, then all tasks - * in the systems are moved unconditionally (used for teardown). - * - * If @mask is not NULL the cpus on which moved tasks are running are set - * in that mask so the update smp function call is restricted to affected - * cpus. - */ -static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *t= o, - struct cpumask *mask) -{ - struct task_struct *p, *t; - - read_lock(&tasklist_lock); - for_each_process_thread(p, t) { - if (!from || is_closid_match(t, from) || - is_rmid_match(t, from)) { - resctrl_arch_set_closid_rmid(t, to->closid, - to->mon.rmid); - - /* - * Order the closid/rmid stores above before the loads - * in task_curr(). This pairs with the full barrier - * between the rq->curr update and - * resctrl_arch_sched_in() during context switch. - */ - smp_mb(); - - /* - * If the task is on a CPU, set the CPU in the mask. - * The detection is inaccurate as tasks might move or - * schedule before the smp function call takes place. - * In such a case the function call is pointless, but - * there is no other side effect. - */ - if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t)) - cpumask_set_cpu(task_cpu(t), mask); - } - } - read_unlock(&tasklist_lock); -} - -static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp) -{ - struct rdtgroup *sentry, *stmp; - struct list_head *head; - - head =3D &rdtgrp->mon.crdtgrp_list; - list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) { - rdtgroup_unassign_cntrs(sentry); - free_rmid(sentry->closid, sentry->mon.rmid); - list_del(&sentry->mon.crdtgrp_list); - - if (atomic_read(&sentry->waitcount) !=3D 0) - sentry->flags =3D RDT_DELETED; - else - rdtgroup_remove(sentry); - } -} - -/* - * Forcibly remove all of subdirectories under root. - */ -static void rmdir_all_sub(void) -{ - struct rdtgroup *rdtgrp, *tmp; - - /* Move all tasks to the default resource group */ - rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); - - list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { - /* Free any child rmids */ - free_all_child_rdtgrp(rdtgrp); - - /* Remove each rdtgroup other than root */ - if (rdtgrp =3D=3D &rdtgroup_default) - continue; - - if (rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKSETUP || - rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKED) - rdtgroup_pseudo_lock_remove(rdtgrp); - - /* - * Give any CPUs back to the default group. We cannot copy - * cpu_online_mask because a CPU might have executed the - * offline callback already, but is still marked online. - */ - cpumask_or(&rdtgroup_default.cpu_mask, - &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); - - rdtgroup_unassign_cntrs(rdtgrp); - - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - - kernfs_remove(rdtgrp->kn); - list_del(&rdtgrp->rdtgroup_list); - - if (atomic_read(&rdtgrp->waitcount) !=3D 0) - rdtgrp->flags =3D RDT_DELETED; - else - rdtgroup_remove(rdtgrp); - } - /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */ - update_closid_rmid(cpu_online_mask, &rdtgroup_default); - - kernfs_remove(kn_info); - kernfs_remove(kn_mongrp); - kernfs_remove(kn_mondata); -} - -/** - * mon_get_kn_priv() - Get the mon_data priv data for this event. - * - * The same values are used across the mon_data directories of all control= and - * monitor groups for the same event in the same domain. Keep a list of - * allocated structures and re-use an existing one with the same values for - * @rid, @domid, etc. - * - * @rid: The resource id for the event file being created. - * @domid: The domain id for the event file being created. - * @mevt: The type of event file being created. - * @do_sum: Whether SNC summing monitors are being created. Only set - * when @rid =3D=3D RDT_RESOURCE_L3. - * - * Return: Pointer to mon_data private data of the event, NULL on failure. - */ -static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int do= mid, - struct mon_evt *mevt, - bool do_sum) -{ - struct mon_data *priv; - - lockdep_assert_held(&rdtgroup_mutex); - - list_for_each_entry(priv, &mon_data_kn_priv_list, list) { - if (priv->rid =3D=3D rid && priv->domid =3D=3D domid && - priv->sum =3D=3D do_sum && priv->evt =3D=3D mevt) - return priv; - } - - priv =3D kzalloc_obj(*priv); - if (!priv) - return NULL; - - priv->rid =3D rid; - priv->domid =3D domid; - priv->sum =3D do_sum; - priv->evt =3D mevt; - list_add_tail(&priv->list, &mon_data_kn_priv_list); - - return priv; -} - -/** - * mon_put_kn_priv() - Free all allocated mon_data structures. - * - * Called when resctrl file system is unmounted. - */ -static void mon_put_kn_priv(void) -{ - struct mon_data *priv, *tmp; - - lockdep_assert_held(&rdtgroup_mutex); - - list_for_each_entry_safe(priv, tmp, &mon_data_kn_priv_list, list) { - list_del(&priv->list); - kfree(priv); - } -} - -static void resctrl_fs_teardown(void) -{ - lockdep_assert_held(&rdtgroup_mutex); - - /* Cleared by rdtgroup_destroy_root() */ - if (!rdtgroup_default.kn) - return; - - rmdir_all_sub(); - rdtgroup_unassign_cntrs(&rdtgroup_default); - mon_put_kn_priv(); - rdt_pseudo_lock_release(); - rdtgroup_default.mode =3D RDT_MODE_SHAREABLE; - closid_exit(); - schemata_list_destroy(); - rdtgroup_destroy_root(); -} - static void rdt_kill_sb(struct super_block *sb) { struct rdt_resource *r; --=20 2.50.1 From nobody Mon Jun 8 06:35:46 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AA1AF3F1ACB for ; Wed, 3 Jun 2026 03:28:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457282; cv=none; b=pz37sCeAbL6oOmBH75jB/Lz3D9Ph1Dmi52cFCtD8zbGyOo6f9d+Rgvs70CVL2w3erozWHwfwVcWfcRlHRb75+s8aWWl8psl0pFSx//i2QoCnQ0yHJKDGbjpOxclSGGxJyMOfRLZEx6wR8PDj9Gmj4lFxuY2s34jnKr7kGzlfS0E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457282; c=relaxed/simple; bh=C8hY132qG+xjkRnVVTukTcbYF6pljAiJfoDuxTvymqU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lwv6i8ZC9hYukQsJRy8O2Xcayy/oldjU4sj7zK5Gk+LceJl+HuQvOkwZFzLY7RCNCG3ppFAAwo/jqQe53Y9Po4wGSPPFWLAVeyscFolb+JDqu3uOoKdDu5mVWzugTF036ufqnjOvHl4xcrdNzA/4LN2Y7HKV5Kq50nKbSTw9irQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Hf0kSrZg; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Hf0kSrZg" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1780457282; x=1811993282; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=C8hY132qG+xjkRnVVTukTcbYF6pljAiJfoDuxTvymqU=; b=Hf0kSrZgAL+bGmqHmYa5FVaDz04fr/LnGLdz3JeAQTYSWSG86wLSK3HZ 1LJVQDWAaQ8/VJLBEZ5xnJ6G0kEcIUK+0RDF0PUO1XgbPiKk/rsp/3PhB dciH36vYdXWywA+hwepvYe6uhTuZ/jlftZweG1uxiBAKHyOpqxgxRdXt4 Ugw6usOuNRePaA6CpWEdipi1+gHO9hTQZHEJyoZ5ehpQ/eD4soG7149Fa 58LBKBLRHdP07iBNHXVB5S0jKPt3T531semjfE3Xgrr1iuoJm1/qSD8tx VyuORjjMOgIY0YRYNAURFd/16XaitXL1ukuTX6lXN10EA1YiUTfEAmDUH A==; X-CSE-ConnectionGUID: 43jwLUH0QkuzKYJVOXApUQ== X-CSE-MsgGUID: +GhKgDqIT8GKouvyzR9Slw== X-IronPort-AV: E=McAfee;i="6800,10657,11805"; a="91938989" X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="91938989" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:00 -0700 X-CSE-ConnectionGUID: j4VKF9jvTUqPXgHdZo7cQA== X-CSE-MsgGUID: bUkltbcWQwi8Pk5mgLa0BA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="241110103" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:27:59 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v4 03/10] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Date: Tue, 2 Jun 2026 20:27:31 -0700 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Tony Luck If mkdir_mondata_all() or a subsequent call in rdt_get_tree() fails, the mon_data structures allocated by mon_get_kn_priv() are leaked. Add mon_put_kn_priv() to the out_mongrp error path to free the mon_data structures. Fixes: 2a6566038544 ("x86/resctrl: Expand the width of domid by replacing m= on_data_bits") Reported-by: Reinette Chatre Closes: https://lore.kernel.org/lkml/5d38c1fb-8f91-472b-8897-24b2f50c772b@i= ntel.com/ Signed-off-by: Tony Luck Signed-off-by: Reinette Chatre Reviewed-by: Chen Yu Reviewed-by: Ben Horgan Reported-by: Sashiko --- Changes since V2: - Reword changelog. Changes since V3: - Add Chenyu's Reviewed-by tag that should have been added in V2. - Add Ben's Reviewed-by tag. - Add Closes: tag. --- fs/resctrl/rdtgroup.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 2b624cf02147..31cfb54a5488 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -3081,6 +3081,7 @@ static int rdt_get_tree(struct fs_context *fc) kernfs_remove(kn_mondata); out_mongrp: if (resctrl_arch_mon_capable()) { + mon_put_kn_priv(); rdtgroup_unassign_cntrs(&rdtgroup_default); kernfs_remove(kn_mongrp); } --=20 2.50.1 From nobody Mon Jun 8 06:35:46 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DC7233F39EC for ; Wed, 3 Jun 2026 03:28:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457284; cv=none; b=J9QLx4AyztxoH2BBtM4ViV+FGXZtuaVQ1KtYLH692DUyfWkql4pGQAF/O9BaWNc1zLgGb3IPjRw5QHmdrEYuVUkqBeAWMPzFZn29ArC38gfX9pFDLjq5YtHYRK2myJ+VjFNiTyqsw6/S57gCdYQTf94XTUORp/SWYdeK2Pw1IqA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457284; c=relaxed/simple; bh=uj2HbwpeVZoI2H2eNfrEoI9yD3Oy3gxW4pyNDQE7YT4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ictM4TAXvAKA8Wd1iBt9tL2H8DzEJpfmQWA3qxi4//ASz0mCyuUQ0/WXBFjzoGXpwsZDDFc3TmwtH1muAvjXX58eomaI78UEQbnE7SZg8ZS7cK46S4bszuSVaZAwNGYlcMzfEMKABWrlt9XEuixWubKpjmm87N9lTeF7QFVbV0M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=aIJySfnM; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="aIJySfnM" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1780457283; x=1811993283; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=uj2HbwpeVZoI2H2eNfrEoI9yD3Oy3gxW4pyNDQE7YT4=; b=aIJySfnMot3sTGfsyCMvQXGY7gx2jZi2/FnjUlo9+wfR29nMvt0RD1dD K1cB1BwpuBLLZK734PDEqDSnDxQNSr4fFfDBPM3gmabNWno/ph04Aw8TG y1rM/iuTKPwFBJ4TNhoT5iD8fSxiPOJSVffJITPf+3dHsWqWPiuoYlnUE HtOy/+BbHL01nKgDx74xdkrFyIDRaiB2kY7F9jy8DN1qdnaVMwa0sihEF XhpbFoSCbmgMh81AWjibQv9Npcj4GoDgi4R5ps/KEjqSSSPrY3X/+1dHz dtxoRs691akp5aomwkKtSBTzLNv+nOUlHrLi6SBntgklQOiyI89P/a+9q Q==; X-CSE-ConnectionGUID: 25cwYGYeQrufZuPXiPeltw== X-CSE-MsgGUID: Zwt4JPDwT9avqRB9NTU4Ag== X-IronPort-AV: E=McAfee;i="6800,10657,11805"; a="91938999" X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="91938999" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:00 -0700 X-CSE-ConnectionGUID: rt556nYLQ9Wc7DZoPIo8CA== X-CSE-MsgGUID: 9He36uLFT8qGq+TD0vD2tw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="241110106" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:27:59 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v4 04/10] fs/resctrl: Fix use-after-free during unmount Date: Tue, 2 Jun 2026 20:27:32 -0700 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Tony Luck During unmount or failure teardown all mon_data structures that contain monitoring event file private data are freed after which kernfs nodes are removed. However, the RDT_DELETED flag is never set for the statically allocated default resource group. A concurrent reader of an event file associated with the default resource group may, after dropping kernfs active protection, block on rdtgroup_mutex while unmount proceeds to free the file private data and destroy the kernfs node without waiting for the reader. When the mutex is released, the reader wakes up, observes that RDT_DELETED is not set for the default group, and dereferences the already-freed file private data. The scenario can be depicted as follows: CPU0 CPU1 /* * Default resource group's * monitoring data accessible via * kernfs file with kernfs_node::priv * pointing to a struct mon_data. * User opens the file for reading. */ rdtgroup_mondata_show() /* arch encounters fatal error */ rdtgroup_kn_lock_live() resctrl_exit() atomic_inc(&rdtgroup_default.waitcount) cpus_read_lock() kernfs_break_active_protection(kn) mutex_lock(&rdtgroup_mutex) cpus_read_lock() resctrl_fs_teardown() mutex_lock(&rdtgroup_mutex) rmdir_all_sub() mon_put_kn_priv() /* Delete all mon_data struc= tures */ rdtgroup_destroy_root() kernfs_destroy_root() rdtgroup_default.kn =3D NULL mutex_unlock(&rdtgroup_mutex) /* * rdtgroup_default.flags is empty so * rdtgroup_kn_lock_live() returns * &rdtgroup_default */ md =3D of->kn->priv; /* md points to freed mon_data */ Set RDT_DELETED for the default group unconditionally since the flag does not lead to the freeing of this statically allocated group. Do not allow a new resctrl mount if there are any waiters on default group of previous mount. A new mount will re-initialize the default group that would appear to waiters from previous mount as though the default group is accessible causing them to access the mon_data structures from the previous mount that have been removed. Fixes: 2a6566038544 ("x86/resctrl: Expand the width of domid by replacing m= on_data_bits") Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/20260508182143.14592-1-tony.luck%40i= ntel.com?part=3D2 [1] Signed-off-by: Tony Luck Signed-off-by: Reinette Chatre Reviewed-by: Chen Yu --- Changes since V2: - Rewrite changelog to not describe code as much. - Rework changelog to switch to "Reported-by/Closes". - Merge the duplicate rdtgroup_remove() comment with the function comment. - Fix changelog to not mention that RDT_DELETED flag is set conditionally. - Change "Fixes:" tag to point to commit that introduced dynamically allocated mon_data this bug involves. Changes since V3: - Depict the race. (Chenyu) - Add Chenyu's Reviewed-by tag. - Changelog grammar fixes. --- fs/resctrl/rdtgroup.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 31cfb54a5488..809f0965474c 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -585,14 +585,20 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open= _file *of, * * On resource group creation via a mkdir, an extra kernfs_node reference = is * taken to ensure that the rdtgroup structure remains accessible for the - * rdtgroup_kn_unlock() calls where it is removed. + * rdtgroup_kn_unlock() calls where it is removed. The default group is + * statically allocated: it does not have an extra reference but will have + * RDT_DELETED set on unmount to support safe access to its associated fil= es + * via rdtgroup_kn_lock_live/rdtgroup_kn_unlock(). * - * Drop the extra reference here, then free the rdtgroup structure. + * For all but the default group: drop the extra reference, then free the + * rdtgroup structure. * * Return: void */ static void rdtgroup_remove(struct rdtgroup *rdtgrp) { + if (rdtgrp =3D=3D &rdtgroup_default) + return; kernfs_put(rdtgrp->kn); kfree(rdtgrp); } @@ -2975,6 +2981,7 @@ static void resctrl_fs_teardown(void) mon_put_kn_priv(); rdt_pseudo_lock_release(); rdtgroup_default.mode =3D RDT_MODE_SHAREABLE; + rdtgroup_default.flags =3D RDT_DELETED; closid_exit(); schemata_list_destroy(); rdtgroup_destroy_root(); @@ -3000,6 +3007,12 @@ static int rdt_get_tree(struct fs_context *fc) goto out; } =20 + /* Avoid races from pending operations from a previous mount */ + if (atomic_read(&rdtgroup_default.waitcount) !=3D 0) { + ret =3D -EBUSY; + goto out; + } + ret =3D setup_rmid_lru_list(); if (ret) goto out; @@ -4275,6 +4288,7 @@ static int rdtgroup_setup_root(struct rdt_fs_context = *ctx) =20 ctx->kfc.root =3D rdt_root; rdtgroup_default.kn =3D kernfs_root_to_node(rdt_root); + rdtgroup_default.flags =3D 0; =20 return 0; } --=20 2.50.1 From nobody Mon Jun 8 06:35:46 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DCCDA3F39EF for ; Wed, 3 Jun 2026 03:28:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457285; cv=none; b=B0buYTrhtC/yHOZYkNXgEgYgwsmnq/ZnZl5y6jsE99iSxqt0aSFI9uUVLg1SzQq0zq79xJacCUeoCm4aK0YQtFVYaAeCb3eL1fEON2D2HXX3ceH30KRbcm6P1plWTvU8oU+/qE949Jg2g1beJrC7bydMxBdzi+6Pg7PR0IYHm4Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457285; c=relaxed/simple; bh=LHCo0MUmMyDzclk6PO+WqAPmteois0/AC1uNvOLDCqs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=GLI4p3V6iA2ljSwFdWfj2CgpZS5BDSgDmzltt1mEP9fVlL03Dy1ZvOgdsoUd3PH9gQUOiyd+uB3CamBrjg1X6dI/XI87dI/rInEtcxC88nxG//wFRqxWqvm2L1gGCrfPvPe5yRzCzb1yZhpgP3Vj+buwXAJuUW5o5I8wtDFjjV0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=iy81/ARA; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="iy81/ARA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1780457283; x=1811993283; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=LHCo0MUmMyDzclk6PO+WqAPmteois0/AC1uNvOLDCqs=; b=iy81/ARAlSMTI7jvpUcJhvuJ0/lDedOdWrr6TSr8DdfH6nU8ONIf//VM /YcCifWzZy4HWZhXcf7n3JL7c9bKndU6mCs+vwbn5yKxvQybSa8lziDX9 pllcgudhLBODhvEHl5WpOW9q6pq/FbVQiuT3n7ZO3+8gR5VfxouZoNFrL Ip8IFQ+/gEeX37XzPe8WLqRHhIaL0KZz1jjeFN82Vf08ZL2cJKfzNaoVp k0kowyOnrz9ccMV1ng5fHs5kctN2neoTS7fZSkwVhope+ZtRIXq4LPuLW gZXOMrQHhwO2Jjd3XjWjwQQOjc9yEs5LfE0s7YROWKGuIkOK4DpKCBd8e w==; X-CSE-ConnectionGUID: peDmkmpYR9CwF8ptEemJHg== X-CSE-MsgGUID: ftJnmAXtTUiShQxLR6/oXw== X-IronPort-AV: E=McAfee;i="6800,10657,11805"; a="91939009" X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="91939009" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:00 -0700 X-CSE-ConnectionGUID: Dm+WnOybRc2uEBTPpVFKJg== X-CSE-MsgGUID: LEkV1cLHQcGp9diTW8qpLg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="241110110" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:00 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v4 05/10] fs/resctrl: Fix deadlock on errors during mount Date: Tue, 2 Jun 2026 20:27:33 -0700 Message-ID: <1184040fb321fb99fde6155a4ab91c654b059b1b.1780456704.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" rdt_get_tree() acquires rdtgroup_mutex before calling kernfs_get_tree(). If superblock setup fails inside kernfs_get_tree(), the VFS calls .kill_sb() (rdt_kill_sb()) on the same thread before kernfs_get_tree() returns. rdt_kill_sb() unconditionally attempts to acquire rdtgroup_mutex and deadlock occurs. Since mount failure resulting from kernfs_get_tree() already calls the resctrl fs unmount handler (rdt_kill_sb()) let both call the same helper to make it clear both paths perform the same cleanup. Call kernfs_get_tree() outside of locks. If kernfs_get_tree() fails and ctx->kfc.new_sb_created is set, then rdt_kill_sb() has already been called and no further cleanup is needed. kernfs_get_tree() may set ctx->kfc.new_sb_created and then fail to obtain an inode for the new kn, causing the rdt_kill_sb() path to run with one few= er reference than required for the root to remain accessible in kernfs_kill_sb= (). Add an extra hold on rdtgroup_default.kn to defend against this scenario and ensure the root can be dereferenced safely from kernfs_kill_sb(). Dropping locks before kernfs_get_tree() creates a window where CPU hotplug callbacks can race with the mount operation. Specifically, an online event observing resctrl_mounted =3D=3D true could concurrently append directories= to the unactivated kernfs tree, allocate mon_data structures, and arm backgrou= nd workers. This concurrency is safe because the mount has not yet returned to the VFS, meaning userspace cannot interact with these transient files. If kernfs_get_tree() subsequently fails, the standard resctrl_unmount() teardo= wn safely manages the concurrent modifications: any dynamically generated kern= fs nodes are removed, and the associated memory is freed. Any background workers spawned by the hotplug event will naturally exit without re-arming when they acquire rdtgroup_mutex and observe resctrl_mounted =3D=3D false. Fixes: 5ff193fbde20 ("x86/intel_rdt: Add basic resctrl filesystem support") Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40i= ntel.com [1] Co-developed-by: Tony Luck Signed-off-by: Tony Luck Signed-off-by: Reinette Chatre Reviewed-by: Ben Horgan Reviewed-by: Chen Yu --- Changes since V2: - Switch to "Reported-by/Closes" in changelog Changes since V3: - Add Ben's Reviewed-by tag. - Rework subject and changelog. - s/root kn/root/ in comment. (Chenyu) - Add Chenyu's Reviewed-by tag. - Changelog grammar fixes. - Add snippet to changelog about potential race with hotplug handlers. --- fs/resctrl/rdtgroup.c | 83 +++++++++++++++++++++++++++++-------------- 1 file changed, 56 insertions(+), 27 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 809f0965474c..0d073d4db734 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -2987,10 +2987,34 @@ static void resctrl_fs_teardown(void) rdtgroup_destroy_root(); } =20 +static void resctrl_unmount(void) +{ + struct rdt_resource *r; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_disable_ctx(); + + /* Put everything back to default values. */ + for_each_alloc_capable_rdt_resource(r) + resctrl_arch_reset_all_ctrls(r); + + resctrl_fs_teardown(); + if (resctrl_arch_alloc_capable()) + resctrl_arch_disable_alloc(); + if (resctrl_arch_mon_capable()) + resctrl_arch_disable_mon(); + resctrl_mounted =3D false; + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + static int rdt_get_tree(struct fs_context *fc) { struct rdt_fs_context *ctx =3D rdt_fc2context(fc); unsigned long flags =3D RFTYPE_CTRL_BASE; + struct kernfs_node *rdt_root_kn; struct rdt_l3_mon_domain *dom; struct rdt_resource *r; int ret; @@ -3066,10 +3090,6 @@ static int rdt_get_tree(struct fs_context *fc) if (ret) goto out_mondata; =20 - ret =3D kernfs_get_tree(fc); - if (ret < 0) - goto out_psl; - if (resctrl_arch_alloc_capable()) resctrl_arch_enable_alloc(); if (resctrl_arch_mon_capable()) @@ -3085,10 +3105,38 @@ static int rdt_get_tree(struct fs_context *fc) RESCTRL_PICK_ANY_CPU); } =20 - goto out; + /* + * Ensure root remains accessible after mutex is unlocked so that + * kernfs_kill_sb() can run safely if called by kernfs_get_tree()'s + * failure path after creating a superblock but before taking reference + * on root kn (for example, if unable to get inode for root kn). + */ + kernfs_get(rdtgroup_default.kn); + + /* + * Make backup of the current root kn being created to be used in + * kernfs_put(). The additional reference taken above will prevent the + * kn from being freed before kernfs_kill_sb() can run but + * rdtgroup_default.kn may be set to NULL via rdtgroup_destroy_root() + * and its backing root (rdt_root) could be overwritten before + * kernfs_put() can run. + */ + rdt_root_kn =3D rdtgroup_default.kn; + + rdt_last_cmd_clear(); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + ret =3D kernfs_get_tree(fc); + /* + * resctrl can only be mounted once, new superblock only expected + * to be created once. + */ + if (!ctx->kfc.new_sb_created) + resctrl_unmount(); + kernfs_put(rdt_root_kn); + return ret; =20 -out_psl: - rdt_pseudo_lock_release(); out_mondata: if (resctrl_arch_mon_capable()) kernfs_remove(kn_mondata); @@ -3108,7 +3156,6 @@ static int rdt_get_tree(struct fs_context *fc) out_root: rdtgroup_destroy_root(); out: - rdt_last_cmd_clear(); mutex_unlock(&rdtgroup_mutex); cpus_read_unlock(); return ret; @@ -3195,26 +3242,8 @@ static int rdt_init_fs_context(struct fs_context *fc) =20 static void rdt_kill_sb(struct super_block *sb) { - struct rdt_resource *r; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - rdt_disable_ctx(); - - /* Put everything back to default values. */ - for_each_alloc_capable_rdt_resource(r) - resctrl_arch_reset_all_ctrls(r); - - resctrl_fs_teardown(); - if (resctrl_arch_alloc_capable()) - resctrl_arch_disable_alloc(); - if (resctrl_arch_mon_capable()) - resctrl_arch_disable_mon(); - resctrl_mounted =3D false; + resctrl_unmount(); kernfs_kill_sb(sb); - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); } =20 static struct file_system_type rdt_fs_type =3D { --=20 2.50.1 From nobody Mon Jun 8 06:35:46 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 08AF03F411A for ; Wed, 3 Jun 2026 03:28:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457284; cv=none; b=d+hraj5IvGDapi0+7RfylvFKi9AsnodGO33QUlR9syDZGA4TuLWd6ff5+sX8IPZV9ajLh9mAL1bUg9LZ3+qjT+aBANcCoP+YCWJsCd4WsdsG/Q3VK/qBXYuB+CYsSnQOXNoHmL6cUC5iMrkdWz6sndiTtZ+V3q+5f/cBaf5c+VE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457284; c=relaxed/simple; bh=bQGZ3Gi1hecEzRonc8KkP4mUrFcfRliOT9dqTRLLRN4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=kJHamgwliDB+7POn1TIdgOF+qpml3Z7GRdqsy9L467Gu8vUk32Kyglegww0wE8rAZ5uh2WrLOJiujjUkbUNq6O1oh6qWImi7jv4H1s4LNkENYHSaW46Im8Nzdm1FLt/3utqrlG9LDsZbsYjes9aBzrJl3oa/kP0+wa2AoEx+QU4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=e1hdAnrT; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="e1hdAnrT" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1780457283; x=1811993283; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=bQGZ3Gi1hecEzRonc8KkP4mUrFcfRliOT9dqTRLLRN4=; b=e1hdAnrTjtQEQ4Nemj5kpeNULYTnNHH2RIVftI0KR7LDdCXeT5YZwZ60 AC0ALDs9kaM0EbZ49Pn3rOHZKj7uJvn5X+l0uEmCy+eLuazljBUiLOw6Z sQrIsghn1tI4z8vbKxIEJEUATb9/JrB/pi9ISqV7JIyV+zNRzPQqSw5MN 7FzQMlWoNOGlmy/809/exonzGS5igOwBNIebwQNe0QpFeydAAsVKuilte PQvIrHMQnE0KOwbjgZEAKK5WJLpHu05jxaUGzh7FsFm+VuKVnmepa75gj uU0FiyZF96Z7diZLTw8QKzfDGxNalR8YqPgZ1BZ4QsUdHdDxfYxsiRAu+ Q==; X-CSE-ConnectionGUID: mTd5iHFTRv2ma11aLTQ+2w== X-CSE-MsgGUID: 56i2y7V4QS2zClRC427NpQ== X-IronPort-AV: E=McAfee;i="6800,10657,11805"; a="91939019" X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="91939019" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:01 -0700 X-CSE-ConnectionGUID: 4atK+GKBRHKn7zvQjHHrYw== X-CSE-MsgGUID: s45oagcsQGuyrH3D8BLmQg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="241110114" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:00 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v4 06/10] fs/resctrl: Prevent use-after-free in rdtgroup_kn_put() Date: Tue, 2 Jun 2026 20:27:34 -0700 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A struct rdtgroup is reference counted via rdtgroup::waitcount. Callers that need the structure to remain valid across a sleep (while waiting on acquiring rdtgroup_mutex) take a reference with rdtgroup_kn_get() and release it with rdtgroup_kn_put(). The release path is intended to serve as the fallback freer: if the count drops to zero and the group has already been marked RDT_DELETED, rdtgroup_kn_put() frees the structure. The bulk teardown paths free_all_child_rdtgrp() and rmdir_all_sub() resulting from a resctrl directory remove or resctrl fs unmount act as the primary freer: they hold rdtgroup_mutex and free each rdtgroup whose waitcount is zero, otherwise they set RDT_DELETED and leave the freeing to the last waiter. These two freers race. rdtgroup_kn_put() commits waitcount =3D=3D 0 with atomic_dec_and_test() outside rdtgroup_mutex, then reads rdtgroup::flags. Between those two operations a concurrent caller of free_all_child_rdtgrp() or rmdir_all_sub() (which holds the mutex) can observe waitcount =3D=3D 0 v= ia atomic_read(), call rdtgroup_remove(), and kfree() the structure. The subsequent read of rdtgroup::flags in rdtgroup_kn_put() is then a use-after-free, and the structure may even be freed twice if the freed memory happens to satisfy the RDT_DELETED flag check. Replace the bare atomic_dec_and_test() with atomic_dec_and_mutex_lock() so that the decrement-to-zero takes rdtgroup_mutex before the count becomes globally visible. The inspection of rdtgroup::flags then runs under the same mutex held by the bulk freers, making the two paths mutually exclusive. The common case where the count does not reach zero remains lock-free. Defer kernfs_unbreak_active_protection() until after the mutex is dropped since kernfs active protections functionally wrap rdtgroup_mutex. Remove resource group, which in turn drops its kernfs reference, after kernfs protection is restored. Fixes: b8511ccc75c0 ("x86/resctrl: Fix use-after-free when deleting resourc= e groups") Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40i= ntel.com?part=3D1 Assisted-by: GitHub_Copilot:gemini-3.1-pro Signed-off-by: Reinette Chatre Reviewed-by: Ben Horgan Reviewed-by: Tony Luck --- Changes since V2: - New patch Changes since V3: - Add Ben's Reviewed-by tag. - Add Tony's Reviewed-by tag. --- fs/resctrl/rdtgroup.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 0d073d4db734..c04424c081a4 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -2606,15 +2606,24 @@ static void rdtgroup_kn_get(struct rdtgroup *rdtgrp= , struct kernfs_node *kn) =20 static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *k= n) { - if (atomic_dec_and_test(&rdtgrp->waitcount) && - (rdtgrp->flags & RDT_DELETED)) { + bool needs_free; + + if (!atomic_dec_and_mutex_lock(&rdtgrp->waitcount, &rdtgroup_mutex)) { + kernfs_unbreak_active_protection(kn); + return; + } + + needs_free =3D rdtgrp->flags & RDT_DELETED; + + mutex_unlock(&rdtgroup_mutex); + + kernfs_unbreak_active_protection(kn); + + if (needs_free) { if (rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKSETUP || rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKED) rdtgroup_pseudo_lock_remove(rdtgrp); - kernfs_unbreak_active_protection(kn); rdtgroup_remove(rdtgrp); - } else { - kernfs_unbreak_active_protection(kn); } } =20 --=20 2.50.1 From nobody Mon Jun 8 06:35:46 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 616D03F54AC for ; Wed, 3 Jun 2026 03:28:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457285; cv=none; b=WiBsEhFZH3kNFfeWAIefjqHkOwDuxeupq/gKpCpsK0ow95JJZqZ/cr6bkaHBc68zpPbrmw4+cS20IDbF4Nx1hU56frvn3cVzHzRzTPiXskuKoYtCA0cXFLtocDJc73CkTyDCVk6u2n2RRvLJeEF4QA+F+8EW7//9lhUwVgVcmpY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457285; c=relaxed/simple; bh=+bCwbGXodPOyojvefV3AVotYYV06mZ2z2YAeKsYo9NE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=TyDa1+E1nxkbzwd1JKuDHhenvUuMJ+seT2qq7zhFQiyumB6oBY6Snd0C4uiEOFKMxEr6BUtpgqQQyFPLGojoX6R4oHy7R184gP6NGOg2IUcJ593nVDPpwTkbWL0ZYVeooNJxbXuIOEK2NKEYTtwqgcbhLHLRLVrLp1KbnZDAzfE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=FJ4UH0z9; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="FJ4UH0z9" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1780457284; x=1811993284; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+bCwbGXodPOyojvefV3AVotYYV06mZ2z2YAeKsYo9NE=; b=FJ4UH0z9xfJTlt3sya+7s1brfk/P1V3zFP55mU1F03s2CjlpvbgOV/X+ 4UCSGpA6BmXZTk/R5kIRodazoWshOhkUrpZrgVF2J5mubCNZvNJMENatv 0aHczjx6Um+8HC7EeF5VgALnG4qbLzR/Fz8zu6loUM6SjXblGmgBSmt0t 2c8/FlOHpsAtdlXwCK1k4IAy+S+2u7SYu82e64G0jibzw8eJmP2GD9RVX X8k0j/0rKZOe5PbUa9TRSnBmOEX7IyMp3WpGiPXBuMwgG1I6RfykDKGpl KHL1Djz68O7neX8jZeV4Mh4fVCdkGTmQwonI0+zhtIvNbEyJUhDMMTtCp w==; X-CSE-ConnectionGUID: 6ulp+qe1ThWjqmrQd6msPw== X-CSE-MsgGUID: ij9zDYQqSkiv3wTeoiWBHQ== X-IronPort-AV: E=McAfee;i="6800,10657,11805"; a="91939029" X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="91939029" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:01 -0700 X-CSE-ConnectionGUID: I7K+eS4KQmKbRtrIiO8SgA== X-CSE-MsgGUID: qNzA9OWNSiq/ySxEloFnkg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="241110119" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:00 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v4 07/10] fs/resctrl: Fix double-add of pseudo-locked region's RMID to free list Date: Tue, 2 Jun 2026 20:27:35 -0700 Message-ID: <2eda4d2873e6607b3c3f19bb5cac1d9fa8e2c04d.1780456704.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A pseudo-locked group's RMID is freed when it is created. On unmount rmdir_all_sub() unconditionally frees all RMID of all groups, resulting in a double-free of the pseudo-locked group's RMID. The consequence of this is that the original free results in the pseudo-locked group's RMID being added to the rmid_free_lru linked list and the second free then attempts to add the same RMID entry to the rmid_free_lru again. Do not double-free a pseudo-locked group's RMID. Fixes: e0bdfe8e36f3 ("x86/intel_rdt: Support creation/removal of pseudo-loc= ked region") Signed-off-by: Reinette Chatre Reported-by: Sashiko --- Changes since V2: - New patch Changes since V3: - Extract the double-add/double-free fix from all the other pseudo-locking fixes that will be deferred. This issue was uncovered during testing of the race fixes so drop all the Reported-by and Closes tags. --- fs/resctrl/rdtgroup.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index c04424c081a4..77c9d22017bc 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -2885,10 +2885,6 @@ static void rmdir_all_sub(void) if (rdtgrp =3D=3D &rdtgroup_default) continue; =20 - if (rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKSETUP || - rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKED) - rdtgroup_pseudo_lock_remove(rdtgrp); - /* * Give any CPUs back to the default group. We cannot copy * cpu_online_mask because a CPU might have executed the @@ -2899,7 +2895,13 @@ static void rmdir_all_sub(void) =20 rdtgroup_unassign_cntrs(rdtgrp); =20 - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + if (rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode =3D=3D RDT_MODE_PSEUDO_LOCKED) { + rdtgroup_pseudo_lock_remove(rdtgrp); + } else { + /* Pseudo-locked group's RMID is freed during setup. */ + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + } =20 kernfs_remove(rdtgrp->kn); list_del(&rdtgrp->rdtgroup_list); --=20 2.50.1 From nobody Mon Jun 8 06:35:46 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B60583F58E2 for ; Wed, 3 Jun 2026 03:28:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457287; cv=none; b=d7uUQXd37VeG7rsFg3wE+G6IrKXAZCJ5ipUweHrYt1/4nf/uP4jAMnsAXaze2mV9lXkEk50QBCE/N1Jw4Su8d+z2c8KZKIFzuqCxgK5e8zWEgolafUK/bxqzv/Ie7sb/ci87Oraep//LO5cZM6tDqSenldNUL/jc+tc2ZPP1wtY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457287; c=relaxed/simple; bh=nCfOM0nWNwA84mKyfiVdAhpO8ALeWZENqh+IUSVvkDo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=bV2SwVmndcPMNfosZndKJRFYR7015C0Kj8nK1anv7dp/SzCWaX6z8zNk70USkjdt9znWpe8nqYennjR1RQuQ2uGW58oEu/GHUi5SLRhdcrf0I1jCEejbVXgLWi0BvDmG29ry5gOUolTTk71qKAUKHKdwKgM3YzM8mtkhYlrbT6A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=WpqbrsAp; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="WpqbrsAp" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1780457285; x=1811993285; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nCfOM0nWNwA84mKyfiVdAhpO8ALeWZENqh+IUSVvkDo=; b=WpqbrsAp+jS/eC2c983qgZF02XTrknThReJroiCyv/6ocl+opDkXomWN IjdvRpp5P/Asn7VlA//eohdKE3UZvfbgYAhQum+OweQb26knG9A9k9KzJ DbkeLuwu/KVGhaI/kGCns4GA8vwJmuhjYk3CryN4JlkeU9kGdenvOb5+0 TI2M2NjgBzH2O+cCNjdvSeloU1FngwU2GvQPapO88d55qcQDPpa+X+jzz /dKk0Zl8q5NJfcU4e6vFtBAiq79vAqOCdBh+548yvJMf6OMs2aEkApIhH FLPpNYG5CQDObobRGEBxHCpyRVXIY82QJJEQferb6VpxDbIhZ+VHR+4GC A==; X-CSE-ConnectionGUID: hd2m857+QEm/pmbHbUirqQ== X-CSE-MsgGUID: zulWo/9MRdWur0iNg+ET3A== X-IronPort-AV: E=McAfee;i="6800,10657,11805"; a="91939040" X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="91939040" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:01 -0700 X-CSE-ConnectionGUID: x8+YdgjUTzeyWRYUmtCeXQ== X-CSE-MsgGUID: 3LcRbBpzTYGd11GeF13gUw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="241110141" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:01 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v4 08/10] fs/resctrl: Prevent deadlock and use-after-free in info file handlers Date: Tue, 2 Jun 2026 20:27:36 -0700 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" resctrl provides files under the info/ directory to expose global configuration and capabilities to userspace. These files are instantiated statically during filesystem mount and expose data associated with internal schema structures via kernfs private pointers. A potential deadlock exists between userspace readers of these info files and the unmount filesystem teardown process. Reading an info file invokes kernfs which acquires an active reference, after which the handler typically attempts to acquire the rdtgroup_mutex. Concurrently, unmounting the filesystem holds the rdtgroup_mutex and then attempts to recursively remove the info kernfs nodes involving kernfs_drain() which blocks until all active references are released. Another problem exists where info files might be accessed from an outdated mount if the filesystem is unmounted and remounted during a reader's execution, leading to a use-after-free when reading the now-deleted private schema data. Introduce info_kn_lock() and info_kn_unlock() helpers to coordinate locking across all info handlers. These helpers mirror similar logic used by resour= ce group handlers by deliberately breaking the kernfs active protection before attempting to acquire the rdtgroup_mutex, preventing the deadlock. To guard against the vulnerability from rapid mount cycling, info_kn_lock() securely walks the parent lineage of the kernfs node under an RCU section to confirm the node belongs to the globally active root before permitting the operation to proceed. Convert all info file handlers to use this helper and only de-reference the schema after it is determined safe to do so. Make no attempt to output an error message to last_cmd_status on failure since failure implies there is no filesystem with which to display the error to user space. Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40i= ntel.com?part=3D3 Assisted-by: GitHub_Copilot:gemini-3.1-pro Signed-off-by: Reinette Chatre Reviewed-by: Tony Luck --- Changes since V2: - New patch Changes since V3: - Add Tony's Reviewed-by tag. - Changelog grammar fixes. --- fs/resctrl/ctrlmondata.c | 38 ++++---- fs/resctrl/internal.h | 3 +- fs/resctrl/monitor.c | 48 +++++----- fs/resctrl/rdtgroup.c | 192 ++++++++++++++++++++++++++++++++------- 4 files changed, 203 insertions(+), 78 deletions(-) diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index f33712c17d38..2b29fb5a8702 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -771,10 +771,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *a= rg) int resctrl_io_alloc_show(struct kernfs_open_file *of, struct seq_file *se= q, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 + r =3D s->res; if (r->cache.io_alloc_capable) { if (resctrl_arch_get_io_alloc_enabled(r)) seq_puts(seq, "enabled\n"); @@ -784,7 +786,7 @@ int resctrl_io_alloc_show(struct kernfs_open_file *of, = struct seq_file *seq, voi seq_puts(seq, "not supported\n"); } =20 - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return 0; } @@ -849,7 +851,7 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file = *of, char *buf, size_t nbytes, loff_t off) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; char const *grp_name; u32 io_alloc_closid; bool enable; @@ -859,9 +861,10 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file= *of, char *buf, if (ret) return ret; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 + r =3D s->res; rdt_last_cmd_clear(); =20 if (!r->cache.io_alloc_capable) { @@ -909,8 +912,7 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file = *of, char *buf, } =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -918,14 +920,15 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_fil= e *of, char *buf, int resctrl_io_alloc_cbm_show(struct kernfs_open_file *of, struct seq_file= *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; int ret =3D 0; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 + r =3D s->res; if (!r->cache.io_alloc_capable) { rdt_last_cmd_printf("io_alloc is not supported on %s\n", s->name); ret =3D -ENODEV; @@ -947,8 +950,7 @@ int resctrl_io_alloc_cbm_show(struct kernfs_open_file *= of, struct seq_file *seq, show_doms(seq, s, NULL, resctrl_io_alloc_closid(r)); =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); return ret; } =20 @@ -1015,7 +1017,7 @@ ssize_t resctrl_io_alloc_cbm_write(struct kernfs_open= _file *of, char *buf, size_t nbytes, loff_t off) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; u32 io_alloc_closid; int ret =3D 0; =20 @@ -1025,10 +1027,11 @@ ssize_t resctrl_io_alloc_cbm_write(struct kernfs_op= en_file *of, char *buf, =20 buf[nbytes - 1] =3D '\0'; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; rdt_last_cmd_clear(); =20 + r =3D s->res; if (!r->cache.io_alloc_capable) { rdt_last_cmd_printf("io_alloc is not supported on %s\n", s->name); ret =3D -ENODEV; @@ -1053,8 +1056,7 @@ ssize_t resctrl_io_alloc_cbm_write(struct kernfs_open= _file *of, char *buf, out_clear_configs: rdt_staged_configs_clear(); out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index 48af75b9dc85..e62a277dee85 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -335,8 +335,9 @@ __printf(1, 2) void rdt_last_cmd_printf(const char *fmt, ...); =20 struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn); - void rdtgroup_kn_unlock(struct kernfs_node *kn); +bool info_kn_lock(struct kernfs_node *kn); +void info_kn_unlock(struct kernfs_node *kn); =20 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name); =20 diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index d2aa7d045056..f7ab9a1bc726 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -1057,7 +1057,8 @@ int event_filter_show(struct kernfs_open_file *of, st= ruct seq_file *seq, void *v bool sep =3D false; int ret =3D 0, i; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; rdt_last_cmd_clear(); =20 r =3D resctrl_arch_get_resource(mevt->rid); @@ -1078,7 +1079,7 @@ int event_filter_show(struct kernfs_open_file *of, st= ruct seq_file *seq, void *v seq_putc(seq, '\n'); =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return ret; } @@ -1089,7 +1090,8 @@ int resctrl_mbm_assign_on_mkdir_show(struct kernfs_op= en_file *of, struct seq_fil struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); int ret =3D 0; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; rdt_last_cmd_clear(); =20 if (!resctrl_arch_mbm_cntr_assign_enabled(r)) { @@ -1101,7 +1103,7 @@ int resctrl_mbm_assign_on_mkdir_show(struct kernfs_op= en_file *of, struct seq_fil seq_printf(s, "%u\n", r->mon.mbm_assign_on_mkdir); =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return ret; } @@ -1117,7 +1119,8 @@ ssize_t resctrl_mbm_assign_on_mkdir_write(struct kern= fs_open_file *of, char *buf if (ret) return ret; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; rdt_last_cmd_clear(); =20 if (!resctrl_arch_mbm_cntr_assign_enabled(r)) { @@ -1129,7 +1132,7 @@ ssize_t resctrl_mbm_assign_on_mkdir_write(struct kern= fs_open_file *of, char *buf r->mon.mbm_assign_on_mkdir =3D value; =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -1419,8 +1422,8 @@ ssize_t event_filter_write(struct kernfs_open_file *o= f, char *buf, size_t nbytes =20 buf[nbytes - 1] =3D '\0'; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1443,8 +1446,7 @@ ssize_t event_filter_write(struct kernfs_open_file *o= f, char *buf, size_t nbytes } =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -1455,7 +1457,8 @@ int resctrl_mbm_assign_mode_show(struct kernfs_open_f= ile *of, struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); bool enabled; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; enabled =3D resctrl_arch_mbm_cntr_assign_enabled(r); =20 if (r->mon.mbm_cntr_assignable) { @@ -1474,7 +1477,7 @@ int resctrl_mbm_assign_mode_show(struct kernfs_open_f= ile *of, seq_puts(s, "[default]\n"); } =20 - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return 0; } @@ -1493,8 +1496,8 @@ ssize_t resctrl_mbm_assign_mode_write(struct kernfs_o= pen_file *of, char *buf, =20 buf[nbytes - 1] =3D '\0'; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1552,8 +1555,7 @@ ssize_t resctrl_mbm_assign_mode_write(struct kernfs_o= pen_file *of, char *buf, } =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -1565,8 +1567,8 @@ int resctrl_num_mbm_cntrs_show(struct kernfs_open_fil= e *of, struct rdt_l3_mon_domain *dom; bool sep =3D false; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 list_for_each_entry_rcu(dom, &r->mon_domains, hdr.list, lockdep_is_cpus_h= eld()) { if (sep) @@ -1577,8 +1579,7 @@ int resctrl_num_mbm_cntrs_show(struct kernfs_open_fil= e *of, } seq_putc(s, '\n'); =20 - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); return 0; } =20 @@ -1591,8 +1592,8 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_op= en_file *of, u32 cntrs, i; int ret =3D 0; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1618,8 +1619,7 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_op= en_file *of, seq_putc(s, '\n'); =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret; } diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 77c9d22017bc..9f998e394911 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -977,13 +977,14 @@ static int rdt_last_cmd_status_show(struct kernfs_ope= n_file *of, { int len; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; len =3D seq_buf_used(&last_cmd_status); if (len) seq_printf(seq, "%.*s", len, last_cmd_status_buf); else seq_puts(seq, "ok\n"); - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); return 0; } =20 @@ -1002,7 +1003,11 @@ static int rdt_num_closids_show(struct kernfs_open_f= ile *of, { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; seq_printf(seq, "%u\n", s->num_closid); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1010,9 +1015,14 @@ static int rdt_default_ctrl_show(struct kernfs_open_= file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%x\n", resctrl_get_default_ctrl(r)); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1020,9 +1030,15 @@ static int rdt_min_cbm_bits_show(struct kernfs_open_= file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; + =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->cache.min_cbm_bits); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1030,9 +1046,14 @@ static int rdt_shareable_bits_show(struct kernfs_ope= n_file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%x\n", r->cache.shareable_bits); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1060,15 +1081,16 @@ static int rdt_bit_usage_show(struct kernfs_open_fi= le *of, */ unsigned long sw_shareable =3D 0, hw_shareable =3D 0; unsigned long exclusive =3D 0, pseudo_locked =3D 0; - struct rdt_resource *r =3D s->res; struct rdt_ctrl_domain *dom; int i, hwb, swb, excl, psl; + struct rdt_resource *r; enum rdtgrp_mode mode; bool sep =3D false; u32 ctrl_val; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; list_for_each_entry_rcu(dom, &r->ctrl_domains, hdr.list, lockdep_is_cpus_= held()) { if (sep) seq_putc(seq, ';'); @@ -1144,8 +1166,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file= *of, sep =3D true; } seq_putc(seq, '\n'); - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); return 0; } =20 @@ -1153,9 +1174,14 @@ static int rdt_min_bw_show(struct kernfs_open_file *= of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->membw.min_bw); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1164,8 +1190,12 @@ static int rdt_num_rmids_show(struct kernfs_open_fil= e *of, { struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; seq_printf(seq, "%u\n", r->mon.num_rmid); =20 + info_kn_unlock(of->kn); + return 0; } =20 @@ -1175,6 +1205,8 @@ static int rdt_mon_features_show(struct kernfs_open_f= ile *of, struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); struct mon_evt *mevt; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; for_each_mon_event(mevt) { if (mevt->rid !=3D r->rid || !mevt->enabled) continue; @@ -1184,6 +1216,8 @@ static int rdt_mon_features_show(struct kernfs_open_f= ile *of, seq_printf(seq, "%s_config\n", mevt->name); } =20 + info_kn_unlock(of->kn); + return 0; } =20 @@ -1191,9 +1225,14 @@ static int rdt_bw_gran_show(struct kernfs_open_file = *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->membw.bw_gran); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1201,16 +1240,24 @@ static int rdt_delay_linear_show(struct kernfs_open= _file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->membw.delay_linear); + info_kn_unlock(of->kn); + return 0; } =20 static int max_threshold_occ_show(struct kernfs_open_file *of, struct seq_file *seq, void *v) { + if (!info_kn_lock(of->kn)) + return -ENOENT; seq_printf(seq, "%u\n", resctrl_rmid_realloc_threshold); + info_kn_unlock(of->kn); =20 return 0; } @@ -1219,22 +1266,28 @@ static int rdt_thread_throttle_mode_show(struct ker= nfs_open_file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; + + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 + r =3D s->res; switch (r->membw.throttle_mode) { case THREAD_THROTTLE_PER_THREAD: seq_puts(seq, "per-thread\n"); - return 0; + break; case THREAD_THROTTLE_MAX: seq_puts(seq, "max\n"); - return 0; + break; case THREAD_THROTTLE_UNDEFINED: seq_puts(seq, "undefined\n"); - return 0; + break; + default: + WARN_ON_ONCE(1); + break; } =20 - WARN_ON_ONCE(1); - + info_kn_unlock(of->kn); return 0; } =20 @@ -1248,12 +1301,20 @@ static ssize_t max_threshold_occ_write(struct kernf= s_open_file *of, if (ret) return ret; =20 - if (bytes > resctrl_rmid_realloc_limit) - return -EINVAL; + if (!info_kn_lock(of->kn)) + return -ENOENT; + + if (bytes > resctrl_rmid_realloc_limit) { + ret =3D -EINVAL; + goto out_unlock; + } =20 resctrl_rmid_realloc_threshold =3D resctrl_arch_round_mon_val(bytes); =20 - return nbytes; +out_unlock: + info_kn_unlock(of->kn); + + return ret ?: nbytes; } =20 /* @@ -1293,10 +1354,15 @@ static int rdt_has_sparse_bitmasks_show(struct kern= fs_open_file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->cache.arch_has_sparse_bitmasks); =20 + info_kn_unlock(of->kn); + return 0; } =20 @@ -1652,8 +1718,8 @@ static int mbm_config_show(struct seq_file *s, struct= rdt_resource *r, u32 evtid struct rdt_l3_mon_domain *dom; bool sep =3D false; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + lockdep_assert_cpus_held(); + lockdep_assert_held(&rdtgroup_mutex); =20 list_for_each_entry_rcu(dom, &r->mon_domains, hdr.list, lockdep_is_cpus_h= eld()) { if (sep) @@ -1670,8 +1736,6 @@ static int mbm_config_show(struct seq_file *s, struct= rdt_resource *r, u32 evtid } seq_puts(s, "\n"); =20 - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); =20 return 0; } @@ -1681,8 +1745,12 @@ static int mbm_total_bytes_config_show(struct kernfs= _open_file *of, { struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + mbm_config_show(seq, r, QOS_L3_MBM_TOTAL_EVENT_ID); =20 + info_kn_unlock(of->kn); return 0; } =20 @@ -1691,8 +1759,12 @@ static int mbm_local_bytes_config_show(struct kernfs= _open_file *of, { struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + mbm_config_show(seq, r, QOS_L3_MBM_LOCAL_EVENT_ID); =20 + info_kn_unlock(of->kn); return 0; } =20 @@ -1790,8 +1862,8 @@ static ssize_t mbm_total_bytes_config_write(struct ke= rnfs_open_file *of, if (nbytes =3D=3D 0 || buf[nbytes - 1] !=3D '\n') return -EINVAL; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1799,8 +1871,7 @@ static ssize_t mbm_total_bytes_config_write(struct ke= rnfs_open_file *of, =20 ret =3D mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID); =20 - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -1816,8 +1887,8 @@ static ssize_t mbm_local_bytes_config_write(struct ke= rnfs_open_file *of, if (nbytes =3D=3D 0 || buf[nbytes - 1] !=3D '\n') return -EINVAL; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1825,8 +1896,7 @@ static ssize_t mbm_local_bytes_config_write(struct ke= rnfs_open_file *of, =20 ret =3D mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID); =20 - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -2659,6 +2729,58 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn) rdtgroup_kn_put(rdtgrp, kn); } =20 +/* + * Accessing the kn after breaking active protection is safe since the open + * of resctrl file holds a kernfs base reference (different from active + * protection) on the kn ensuring that it remains accessible even if it was + * unlinked. Each kn in turn holds base reference to parent so the kn's + * genealogy remains in memory until all base references dropped. + */ +static bool is_active_resctrl_node(struct kernfs_node *kn) +{ + struct kernfs_node *p; + bool match =3D false; + + guard(rcu)(); + p =3D kn; + while (p) { + if (p =3D=3D rdtgroup_default.kn) { + match =3D true; + break; + } + p =3D rcu_dereference(p->__parent); + } + + return match; +} + +bool info_kn_lock(struct kernfs_node *kn) +{ + kernfs_break_active_protection(kn); + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + /* + * Check both if resctrl is torn down (!rdtgroup_default.kn) and + * if the reader's kernfs_node originates from a dead mount. + */ + if (!rdtgroup_default.kn || !is_active_resctrl_node(kn)) { + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + kernfs_unbreak_active_protection(kn); + return false; + } + + return true; +} + +void info_kn_unlock(struct kernfs_node *kn) +{ + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + kernfs_unbreak_active_protection(kn); +} + static int mkdir_mondata_all(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, struct kernfs_node **mon_data_kn); --=20 2.50.1 From nobody Mon Jun 8 06:35:46 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 010643F7893 for ; Wed, 3 Jun 2026 03:28:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457287; cv=none; b=EALAfRMdUUmzSvtlZyAi7Qr+RDUdBsXfBOK5VTe3RadTkfj+Mi5Lug1FWKYAip8Wv10zetz2WjrU2mKY+Vp1ayJoxIUdvTzLMUtcMq0ayLLoGrnyx+4WOq6sGpXveOhtHbn5JZ//7UhJ0m022Ak+FGDZfEifpXLrIL10lQyFy8g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457287; c=relaxed/simple; bh=LI3fWkBiCZaOC31ilZkPdUqLlp43WfXVSeGsZHEYUvs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=IHEzGW5vHrvF+TMdU4sEBuB+cqM7Z8Wn8oLa3t5Nsfwmo/f01Y8COwhq/4lZ7bpJplTDLTRt/JTlnfXZ9bDrozbMMzPINV+ViWwBy5cQkef/IlIzBoTRErS1Bk7A0jrX1WH5JftQ1fo4k6BkE4gAZkGjSXmIpuNEI5xTa7R+nH8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=dPXTTdry; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="dPXTTdry" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1780457286; x=1811993286; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=LI3fWkBiCZaOC31ilZkPdUqLlp43WfXVSeGsZHEYUvs=; b=dPXTTdryDzJ2JAxUuHCFDQZtmKtld4epfjS1dtSVtH8UwJJ1g28rrsac 4ge4258us4o3IIPh0BTQxz5hTuoGEByHfLzx6Ekuvxv3ZA0qqe5uz9jC/ 3zIqakUbmDFh4maii6/JY0OutGXEYzqMrpg5kdWglLPLlQtPdDKI+ODru +0IKzXwXxRHgyhsCk68ql0THov73o7i9lQ8/KKLiTZKAB/U7WH76Hi09j b/oMLD7IFAXFJ1TRaCx65iawZpZRy1D9e17hM3HHRA3yHuFJfEuKObnQV OKIlWWjFoHIDtQItDfj1oOCl8Dd/gX5SDA3yhx3cVLZzEsx3b6ciI9Xgj w==; X-CSE-ConnectionGUID: Fw1xRiomSHSC44ynQX9AqA== X-CSE-MsgGUID: E5bRddxURzO1BNsn3B2kcA== X-IronPort-AV: E=McAfee;i="6800,10657,11805"; a="91939050" X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="91939050" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:02 -0700 X-CSE-ConnectionGUID: Z1M80lUZRn2YTPwIH7PLPQ== X-CSE-MsgGUID: N0pjBagxT+KsqCU8LKtwSw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="241110144" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:01 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v4 09/10] x86/resctrl: Ensure domain fully initialized before placed on RCU list Date: Tue, 2 Jun 2026 20:27:37 -0700 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A resctrl domain consists of the domain structure self that includes pointers to dynamically allocated filesystem as well as architecture specific data. For example, the L3 monitoring domain structure consists of the architecture specific struct rdt_hw_l3_mon_domain that contains the dynamically allocated rdt_hw_l3_mon_domain::arch_mbm_states architectural state and the embedded struct rdt_l3_mon_domain contains the dynamically allocated rdt_l3_mon_domain::mbm_states resctrl fs state. The domains are added to and removed from an RCU protected list while cpus_write_lock() is held so that readers could access domains via cpus_read_lock() or from an RCU read-side critical section. A reader accessing a domain via the RCU list expects that the domain and all its dynamically allocated data is accessible. Only place the domain on the RCU list when all its dynamically allocated data is ready, similarly unlink it from RCU list (again with cpus_write_lock() held) before removing any of its dynamically allocated data. Calling resctrl_online_mon_domain() before adding the domain to the RCU list creates the kernfs files that expose the domain's monitoring data to user space before adding the domain to the RCU list. This is safe because rdtgroup_mondata_show() acquires cpus_read_lock() before it traverses the RCU list and will thus block until the domain is added to the RCU list. There are no readers accessing a domain via RCU list. Ensure safety of access when such a reader arrives. Signed-off-by: Reinette Chatre Reviewed-by: Tony Luck Reviewed-by: Chen Yu Reported-by: Sashiko --- Changes since V2: - New patch Changes since V3: - Add Tony's Reviewed-by tag. - Add Chenyu's Reviewed-by tag. - Grammar fixes in changelog. - Add snippet to changelog about possible race with rdtgroup_mondata_show(). --- arch/x86/kernel/cpu/resctrl/core.c | 18 +++++++----------- arch/x86/kernel/cpu/resctrl/intel_aet.c | 5 ++--- 2 files changed, 9 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resct= rl/core.c index 9c01d2562b7a..bca782050198 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -515,14 +515,12 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_r= esource *r) return; } =20 - list_add_tail_rcu(&d->hdr.list, add_pos); - err =3D resctrl_online_ctrl_domain(r, d); if (err) { - list_del_rcu(&d->hdr.list); - synchronize_rcu(); ctrl_domain_free(hw_dom); + return; } + list_add_tail_rcu(&d->hdr.list, add_pos); } =20 static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, s= truct list_head *add_pos) @@ -556,14 +554,12 @@ static void l3_mon_domain_setup(int cpu, int id, stru= ct rdt_resource *r, struct return; } =20 - list_add_tail_rcu(&d->hdr.list, add_pos); - err =3D resctrl_online_mon_domain(r, &d->hdr); if (err) { - list_del_rcu(&d->hdr.list); - synchronize_rcu(); l3_mon_domain_free(hw_dom); + return; } + list_add_tail_rcu(&d->hdr.list, add_pos); } =20 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r) @@ -642,9 +638,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_= resource *r) d =3D container_of(hdr, struct rdt_ctrl_domain, hdr); hw_dom =3D resctrl_to_arch_ctrl_dom(d); =20 - resctrl_offline_ctrl_domain(r, d); list_del_rcu(&hdr->list); synchronize_rcu(); + resctrl_offline_ctrl_domain(r, d); =20 /* * rdt_ctrl_domain "d" is going to be freed below, so clear @@ -689,9 +685,9 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_r= esource *r) =20 d =3D container_of(hdr, struct rdt_l3_mon_domain, hdr); hw_dom =3D resctrl_to_arch_mon_dom(d); - resctrl_offline_mon_domain(r, hdr); list_del_rcu(&hdr->list); synchronize_rcu(); + resctrl_offline_mon_domain(r, hdr); l3_mon_domain_free(hw_dom); break; } @@ -702,9 +698,9 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_r= esource *r) return; =20 pkgd =3D container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr); - resctrl_offline_mon_domain(r, hdr); list_del_rcu(&hdr->list); synchronize_rcu(); + resctrl_offline_mon_domain(r, hdr); kfree(pkgd); break; } diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/= resctrl/intel_aet.c index 89b8b619d5d5..c22c3cf5167d 100644 --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c @@ -398,12 +398,11 @@ void intel_aet_mon_domain_setup(int cpu, int id, stru= ct rdt_resource *r, d->hdr.type =3D RESCTRL_MON_DOMAIN; d->hdr.rid =3D RDT_RESOURCE_PERF_PKG; cpumask_set_cpu(cpu, &d->hdr.cpu_mask); - list_add_tail_rcu(&d->hdr.list, add_pos); =20 err =3D resctrl_online_mon_domain(r, &d->hdr); if (err) { - list_del_rcu(&d->hdr.list); - synchronize_rcu(); kfree(d); + return; } + list_add_tail_rcu(&d->hdr.list, add_pos); } --=20 2.50.1 From nobody Mon Jun 8 06:35:46 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0D2EC3F7899 for ; Wed, 3 Jun 2026 03:28:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457288; cv=none; b=tD3s8himccZqaAElFJnHjaU4PfJyZPqy//lyZdooDUEu31fl73eRSujMb57eIzNY7bNLuMDoHjEOpkVOj/pHlxtXOP1DdpLuhZNGQf+YGNHXVQh9CKplnPGSR4r7B1O1cR3ngzDWzlMC42dPcnL356ZRf1TUHgSNWMdTpVNhvP4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457288; c=relaxed/simple; bh=NkRz2ujFLstUVTop8cwQvKn2tC+od6QZJws0jIRRO8k=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=fk4zom8IGUyPGvGJ3+MW5AieG14yW5unis5GXhr8QAqVxWE+LOzjz3QG5PDSohXRQWZeSfGMSVchmmQI4axBw4uWjANZrrGp95Hnusgt69i/mmpRoYgT0+VoCTIHIKdpTH0Ih0/hjV8no3RuVmXpP4dMg5MIf/Hc/TG75RUrWQw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=R09Bng2h; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="R09Bng2h" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1780457286; x=1811993286; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=NkRz2ujFLstUVTop8cwQvKn2tC+od6QZJws0jIRRO8k=; b=R09Bng2hugCtX8+g/L9/hM7m7IUaiz9SlVxFjnl4j7ROVmIs9WAsoE5N MhE7uXPLBqVxU4n5JIXA62bnt42jgSr0SLLVYHetRdVKmsU6Lp822QSTr awmtp+4msa/kC++kdMDcyRw4y1w5UQwhPv/6mZOfkBd7asLhdwjswaG4j LzujE38c07k7jIUTbBgeqw1HboA9o2P/ofI8/+Cj2j2bnXIyakXBBYH3H jhx2WRL0lvKTfIWqgLW3qm6aJ7MmKtHcLW6r17SSBTaVhsl79KdwhW/Uz +YUONnrWqbj/ZiCPX6o92C5NYU7a9GALMzKa0VP+3C9DfUNa38FNqqjJB w==; X-CSE-ConnectionGUID: DypI4DMCQPeR07Wo6r3kAQ== X-CSE-MsgGUID: Aem/4RThQcGtiFZ7Sq961Q== X-IronPort-AV: E=McAfee;i="6800,10657,11805"; a="91939061" X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="91939061" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:02 -0700 X-CSE-ConnectionGUID: trpoU+IfQoanmb/bMAAi9w== X-CSE-MsgGUID: WdrtOkcIR1KdSAaIWULefw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,184,1774335600"; d="scan'208";a="241110148" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jun 2026 20:28:01 -0700 From: Reinette Chatre To: tony.luck@intel.com, james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com, bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com Cc: x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com, fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com, yu.c.chen@intel.com, linux-kernel@vger.kernel.org, patches@lists.linux.dev, reinette.chatre@intel.com Subject: [PATCH v4 10/10] fs/resctrl: Fix UAF from worker threads when domains are removed Date: Tue, 2 Jun 2026 20:27:38 -0700 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The mbm_handle_overflow() and cqm_handle_limbo() workers read event counters and may sleep while doing so. They are scheduled via delayed_work embedded in struct rdt_l3_mon_domain. Architecture allocates and frees these domains from CPU hotplug callbacks under cpus_write_lock(), and the workers acquire cpus_read_lock() to keep the domain alive across their access. A use-after-free can occur when a worker is blocked waiting for cpus_read_lock() while the hotplug core holds cpus_write_lock(): the architecture frees the rdt_l3_mon_domain that contains the worker's work_struct. When the worker unblocks, the container_of() it performs on the embedded work pointer dereferences freed memory. Drop cpus_read_lock() from the workers and instead drain pending and in-flight work synchronously before the architecture can free the domain. Since architecture offlines the domain under cpus_write_lock() after it has been unlinked from the RCU list and a grace period has elapsed, no new work can be scheduled. The cancel only needs to wait out existing work. Drop rdtgroup_mutex during CPU offline around cancel_delayed_work_sync() so that a worker waiting on the mutex can complete before re-pinning the work on a different CPU. When offlining a CPU the architecture may iterate over resources in any order. For example, the MBA control domain may be offlined before or after a corresponding L3 monitor domain. Ensure that resctrl fs cancels the workers no matter what order the architecture offlines the domains. Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing") Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40i= ntel.com # [1] Co-developed-by: Tony Luck Signed-off-by: Tony Luck Signed-off-by: Reinette Chatre --- Changes since v2: - Rewrite changelog - v2 attempted to solve the issue by using is_percpu_thread() within the worker to learn if CPU worker was running on is going offline. A Sashiko (https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%= 40intel.com?part=3D5) pointed out that this would not be able to handle the scenario if one of the hotplug handlers following the resctrl offline handlers failed. - Some other fixes attempted that failed: - Switch to accessing domain structure in handler via RCU so that CPU hotplug lock no longer needed. Use cancel_delayed_work_sync() with mutex dropped to cancel worker. Running worker from RCU read-side critical section is a problem since the worker needs to be able to sleep (mbm_handle_overflow()->mbm_update()-> mbm_update_one_event()->resctrl_arch_mon_ctx_alloc()-> might_sleep()) - Adding a reference count to the domain structure to avoid the worker needing to take CPU hotplug lock. This ended up being very complicated with the architecture needing new APIs to manage the reference count which cannot cleanly integrate into MPAM since it uses a single architecture domain structure to contain both the control and monitoring domain structures. Managing the references across mount, unmount, online, offline, as well as worker self exit resulted in several asymmetrical and complicated paths that were error prone. Locking also proved to be complicated since architecture would need to initiate domain free that will need to call back into resctrl that will take rdtgroup_mutex which means that references need to be taken/released without locking. Changes since V3: ---------------- - Traverse mon_domains list using list_for_each_entry_rcu( ..., lockdep_is_cpus_held()) to document how CPU hotplug lock is required to be held (via architecture). - Add snippet in changelog to motivate canceling work in monitor and control domain offline handlers. --- fs/resctrl/monitor.c | 52 ++++++++++++++++++++++++++++++++++--------- fs/resctrl/rdtgroup.c | 52 ++++++++++++++++++++++++++++++++++++++----- 2 files changed, 89 insertions(+), 15 deletions(-) diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index f7ab9a1bc726..db56c0153e3a 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -628,14 +628,22 @@ void mon_event_count(void *info) rr->err =3D 0; } =20 -static struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, - struct rdt_resource *r) +/* + * Find the software controller's ctrl domain that contains @cpu on resour= ce @r. + * + * Only called from the mbm_over worker via update_mba_bw() where the retu= rned + * domain is kept alive by cancel_delayed_work_sync() in + * resctrl_offline_ctrl_domain(). This drains this worker and then waits on + * rdtgroup_mutex held here before the architecture can free the ctrl doma= in. + * + * Context: Call from RCU read-side critical section. + */ +static struct rdt_ctrl_domain *get_sc_ctrl_domain_from_cpu(int cpu, + struct rdt_resource *r) { struct rdt_ctrl_domain *d; =20 - lockdep_assert_cpus_held(); - - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list) { /* Find the domain that contains this CPU */ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) return d; @@ -696,7 +704,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct= rdt_l3_mon_domain *dom_m if (WARN_ON_ONCE(!pmbm_data)) return; =20 - dom_mba =3D get_ctrl_domain_from_cpu(smp_processor_id(), r_mba); + guard(rcu)(); + dom_mba =3D get_sc_ctrl_domain_from_cpu(smp_processor_id(), r_mba); if (!dom_mba) { pr_warn_once("Failure to get domain for MBA update\n"); return; @@ -799,9 +808,19 @@ void cqm_handle_limbo(struct work_struct *work) unsigned long delay =3D msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); struct rdt_l3_mon_domain *d; =20 - cpus_read_lock(); + /* + * Safe to run without CPU hotplug lock. Work is guaranteed to be + * canceled before the domain structure is removed. + */ mutex_lock(&rdtgroup_mutex); =20 + /* + * Ensure the worker is dedicated to a CPU as intended and not + * relocated by workqueue subsystem as part of CPU going offline. + */ + if (!is_percpu_thread()) + goto out_unlock; + d =3D container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work); =20 __check_limbo(d, false); @@ -813,8 +832,8 @@ void cqm_handle_limbo(struct work_struct *work) delay); } =20 +out_unlock: mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); } =20 /** @@ -846,7 +865,10 @@ void mbm_handle_overflow(struct work_struct *work) struct list_head *head; struct rdt_resource *r; =20 - cpus_read_lock(); + /* + * Safe to run without CPU hotplug lock. Work is guaranteed to be + * canceled before the domain structure is removed. + */ mutex_lock(&rdtgroup_mutex); =20 /* @@ -856,6 +878,17 @@ void mbm_handle_overflow(struct work_struct *work) if (!resctrl_mounted || !resctrl_arch_mon_capable()) goto out_unlock; =20 + /* + * Ensure the worker is dedicated to a CPU and not relocated by + * workqueue subsystem as part of CPU going offline since reading + * events depend on smp_processor_id(). After passing this check + * smp_processor_id() is valid for entire duration of this worker + * since it runs with rdtgroup_mutex held and the offline handler needs + * rdtgroup_mutex to offline the CPU being run on here. + */ + if (!is_percpu_thread()) + goto out_unlock; + r =3D resctrl_arch_get_resource(RDT_RESOURCE_L3); d =3D container_of(work, struct rdt_l3_mon_domain, mbm_over.work); =20 @@ -880,7 +913,6 @@ void mbm_handle_overflow(struct work_struct *work) =20 out_unlock: mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); } =20 /** diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 9f998e394911..b5fb59d0e035 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -4491,6 +4491,29 @@ static void domain_destroy_l3_mon_state(struct rdt_l= 3_mon_domain *d) =20 void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_d= omain *d) { + /* + * mbm_handle_overflow() may dereference this ctrl domain via + * update_mba_bw()->get_sc_ctrl_domain_from_cpu(). The architecture has + * unlinked the domain from the RCU list and waited a grace period, so + * no new worker iteration can find it; drain any worker that already + * holds a pointer to it before the architecture frees the domain. + * + * Software controller is enabled/disabled on mount/unmount with + * cpus_read_lock() held. Running here with cpus_write_lock() so + * there are no concurrent changes to software controller status. + */ + if (r->rid =3D=3D RDT_RESOURCE_MBA && is_mba_sc(r)) { + struct rdt_resource *l3 =3D resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdt_l3_mon_domain *mon_d; + + list_for_each_entry_rcu(mon_d, &l3->mon_domains, hdr.list, lockdep_is_cp= us_held()) { + if (mon_d->hdr.id =3D=3D d->hdr.id) { + cancel_delayed_work_sync(&mon_d->mbm_over); + break; + } + } + } + mutex_lock(&rdtgroup_mutex); =20 if (supports_mba_mbps() && r->rid =3D=3D RDT_RESOURCE_MBA) @@ -4503,6 +4526,24 @@ void resctrl_offline_mon_domain(struct rdt_resource = *r, struct rdt_domain_hdr *h { struct rdt_l3_mon_domain *d; =20 + /* + * Called by architecture under CPU hotplug lock as it prepares to remove + * the domain which is guaranteed to be accessible here. + * The domain has been unlinked from the RCU list and a grace period + * has elapsed, so no new worker can be scheduled. Drain any worker that + * is in flight or pending before letting architecture proceed to free + * the domain that has the workers' struct delayed_work embedded. + * Do so before taking rdtgroup_mutex since the workers also acquire it. + */ + if (r->rid =3D=3D RDT_RESOURCE_L3 && + domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3)) { + d =3D container_of(hdr, struct rdt_l3_mon_domain, hdr); + if (resctrl_is_mbm_enabled()) + cancel_delayed_work_sync(&d->mbm_over); + if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) + cancel_delayed_work_sync(&d->cqm_limbo); + } + mutex_lock(&rdtgroup_mutex); =20 /* @@ -4519,8 +4560,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *= r, struct rdt_domain_hdr *h goto out_unlock; =20 d =3D container_of(hdr, struct rdt_l3_mon_domain, hdr); - if (resctrl_is_mbm_enabled()) - cancel_delayed_work(&d->mbm_over); if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(= d)) { /* * When a package is going down, forcefully @@ -4531,7 +4570,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *= r, struct rdt_domain_hdr *h * package never comes back. */ __check_limbo(d, true); - cancel_delayed_work(&d->cqm_limbo); } =20 domain_destroy_l3_mon_state(d); @@ -4712,12 +4750,16 @@ void resctrl_offline_cpu(unsigned int cpu) d =3D get_mon_domain_from_cpu(cpu, l3); if (d) { if (resctrl_is_mbm_enabled() && cpu =3D=3D d->mbm_work_cpu) { - cancel_delayed_work(&d->mbm_over); + mutex_unlock(&rdtgroup_mutex); + cancel_delayed_work_sync(&d->mbm_over); + mutex_lock(&rdtgroup_mutex); mbm_setup_overflow_handler(d, 0, cpu); } if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && cpu =3D=3D d->cqm_work_cpu && has_busy_rmid(d)) { - cancel_delayed_work(&d->cqm_limbo); + mutex_unlock(&rdtgroup_mutex); + cancel_delayed_work_sync(&d->cqm_limbo); + mutex_lock(&rdtgroup_mutex); cqm_setup_limbo_handler(d, 0, cpu); } } --=20 2.50.1