From nobody Mon Jun 8 19:46:40 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 77B033C3BF4 for ; Tue, 26 May 2026 22:12:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779833569; cv=none; b=M/V2vmKL9M5C8W+sp6NJHZTqTCGt0GJoDOu9dFwKqbI0loZ/iuhYQlUdjsZK7/x1WSbIVQ0A8E+F8bIWYhUXxCnKciVOJ1z04Qr0T1ZzNKqZscjISkkmbrEeMx2Y0qV1qP8/IR11z6fDlHoSld38erbtLaRVV9Do9KCuHu1L5P0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779833569; c=relaxed/simple; bh=K1I8MxIbzIiAAPzk/lJsJPaSkWTpswYCOD+0zt+b7Ek=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=kGW+LoI9WDf8jVky8BioLURM7NKsbNhxAKmAYwB+NCv47de+/RigpbhBmihGnXr8AR2ePNVLWNh3Ywy2TmY49IKSHzK1I8Y4MeNdcvj+FM/plo3NQPTD2Aovbhgaez1chyrbBy3fmfuhshddBFrDzt2DsHWrX6K3GDObmCr3Bg8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=FoKjLzCb; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="FoKjLzCb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1779833566; x=1811369566; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=K1I8MxIbzIiAAPzk/lJsJPaSkWTpswYCOD+0zt+b7Ek=; b=FoKjLzCbKwJgLkWgb/bS7Gt5CjRRJiGYkXP2D5k2R2mdw9uavDPvNB4W HsaqxkfcPi//QWFl+HtEyEkHArfZHzzvKg+aMVSUThA2Bt6XpYV7fQDI2 FEZFOCPEjK0FqXqjd8FzxBvS3eZj4EVyfPkJZ+M9URkxd1+/SYIajXW1X bMroxtegZ/qrHzw0knwrJ2fgnw0vMIQqIeu850ibkA/SfS4u9FYsLnj2t W135+X4YQa0fdDbKkAlbAsCzPCuRjli7tEkt6r7EH6U/+D78BX3ab/8C/ Io5xpF0ONtkrN97Odwk++SVnQzpQDam0TB34S8Wke0yEYOU3Z/7cgv0Du A==; X-CSE-ConnectionGUID: K0oFliDURHefLfkpGoXpZA== X-CSE-MsgGUID: nGf1f+p9SK2beX9cnyvm/w== X-IronPort-AV: E=McAfee;i="6800,10657,11798"; a="84516106" X-IronPort-AV: E=Sophos;i="6.24,170,1774335600"; d="scan'208";a="84516106" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 May 2026 15:12:44 -0700 X-CSE-ConnectionGUID: 9gBr8d3PQbqggSAfwtk0uA== X-CSE-MsgGUID: U+ta+1F9RQieoFaXzklZJA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,170,1774335600"; d="scan'208";a="237622167" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 May 2026 15:12:44 -0700 From: Reinette Chatre To: reinette.chatre@intel.com Cc: linux-kernel@vger.kernel.org, patches@lists.linux.dev Subject: [PATCH RESEND v3 1/3] fs/resctrl: Prevent deadlock and use-after-free in info file handlers Date: Tue, 26 May 2026 15:12:35 -0700 Message-ID: <04548591d62d3ae3e9937814d4b3926f8d0424c9.1779833414.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" resctrl provides files under the info/ directory to expose global configuration and capabilities to userspace. These files are instantiated statically during filesystem mount and expose data associated with internal schema structures via kernfs private pointers. A potential deadlock exists between userspace readers of these info files and the unmount filesystem teardown process. Reading an info file invokes kernfs which acquires an active reference, after which the handler typically attempts to acquire the rdtgroup_mutex. Concurrently, unmounting the filesystem holds the rdtgroup_mutex and then attempts to recursively remove the info kernfs nodes involving kernfs_drain() which blocks until all active references are released. Another problem exists where info files might be accessed from an outdated mount if the filesystem is unmounted and remounted during a reader's execution, leading to a use-after-free when reading the now-deleted private schema data. Introduce info_kn_lock() and info_kn_unlock() helpers to coordinate locking across all info handlers. These helpers mirror similar logic used by resour= ce group handlers by deliberately breaking the kernfs active protection before attempting to acquire the rdtgroup_mutex, preventing the deadlock. To guard against the vulnerability from rapid mount cycling, info_kn_lock() securely walks the parent lineage of the kernfs node under an RCU section to confirm the node belongs to the globally active root before permitting the operation to proceed. Convert all info file handlers to use this helper and only de-reference the schema after it determined safe to do so. Make no attempt to output error message to last_cmd_status on failure since failure implies there is no filesystem with which to display error to user space. Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40i= ntel.com?part=3D3 Assisted-by: GitHub_Copilot:gemini-3.1-pro Signed-off-by: Reinette Chatre --- Changes since V2: - New patch --- fs/resctrl/ctrlmondata.c | 38 ++++---- fs/resctrl/internal.h | 3 +- fs/resctrl/monitor.c | 48 +++++----- fs/resctrl/rdtgroup.c | 192 ++++++++++++++++++++++++++++++++------- 4 files changed, 203 insertions(+), 78 deletions(-) diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index 9a7dfc48cb2e..b95bf6208be2 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -769,10 +769,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *a= rg) int resctrl_io_alloc_show(struct kernfs_open_file *of, struct seq_file *se= q, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 + r =3D s->res; if (r->cache.io_alloc_capable) { if (resctrl_arch_get_io_alloc_enabled(r)) seq_puts(seq, "enabled\n"); @@ -782,7 +784,7 @@ int resctrl_io_alloc_show(struct kernfs_open_file *of, = struct seq_file *seq, voi seq_puts(seq, "not supported\n"); } =20 - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return 0; } @@ -847,7 +849,7 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file = *of, char *buf, size_t nbytes, loff_t off) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; char const *grp_name; u32 io_alloc_closid; bool enable; @@ -857,9 +859,10 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file= *of, char *buf, if (ret) return ret; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 + r =3D s->res; rdt_last_cmd_clear(); =20 if (!r->cache.io_alloc_capable) { @@ -907,8 +910,7 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file = *of, char *buf, } =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -916,14 +918,15 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_fil= e *of, char *buf, int resctrl_io_alloc_cbm_show(struct kernfs_open_file *of, struct seq_file= *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; int ret =3D 0; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 + r =3D s->res; if (!r->cache.io_alloc_capable) { rdt_last_cmd_printf("io_alloc is not supported on %s\n", s->name); ret =3D -ENODEV; @@ -945,8 +948,7 @@ int resctrl_io_alloc_cbm_show(struct kernfs_open_file *= of, struct seq_file *seq, show_doms(seq, s, NULL, resctrl_io_alloc_closid(r)); =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); return ret; } =20 @@ -1013,7 +1015,7 @@ ssize_t resctrl_io_alloc_cbm_write(struct kernfs_open= _file *of, char *buf, size_t nbytes, loff_t off) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; u32 io_alloc_closid; int ret =3D 0; =20 @@ -1023,10 +1025,11 @@ ssize_t resctrl_io_alloc_cbm_write(struct kernfs_op= en_file *of, char *buf, =20 buf[nbytes - 1] =3D '\0'; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; rdt_last_cmd_clear(); =20 + r =3D s->res; if (!r->cache.io_alloc_capable) { rdt_last_cmd_printf("io_alloc is not supported on %s\n", s->name); ret =3D -ENODEV; @@ -1051,8 +1054,7 @@ ssize_t resctrl_io_alloc_cbm_write(struct kernfs_open= _file *of, char *buf, out_clear_configs: rdt_staged_configs_clear(); out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index e7e415ee7766..4e3173f25e92 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -345,8 +345,9 @@ void rdt_last_cmd_printf(const char *fmt, ...); =20 void rdtgroup_remove(struct rdtgroup *rdtgrp); struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn); - void rdtgroup_kn_unlock(struct kernfs_node *kn); +bool info_kn_lock(struct kernfs_node *kn); +void info_kn_unlock(struct kernfs_node *kn); =20 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name); =20 diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index 0e6a389a16bf..4565b9864a9e 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -1052,7 +1052,8 @@ int event_filter_show(struct kernfs_open_file *of, st= ruct seq_file *seq, void *v bool sep =3D false; int ret =3D 0, i; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; rdt_last_cmd_clear(); =20 r =3D resctrl_arch_get_resource(mevt->rid); @@ -1073,7 +1074,7 @@ int event_filter_show(struct kernfs_open_file *of, st= ruct seq_file *seq, void *v seq_putc(seq, '\n'); =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return ret; } @@ -1084,7 +1085,8 @@ int resctrl_mbm_assign_on_mkdir_show(struct kernfs_op= en_file *of, struct seq_fil struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); int ret =3D 0; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; rdt_last_cmd_clear(); =20 if (!resctrl_arch_mbm_cntr_assign_enabled(r)) { @@ -1096,7 +1098,7 @@ int resctrl_mbm_assign_on_mkdir_show(struct kernfs_op= en_file *of, struct seq_fil seq_printf(s, "%u\n", r->mon.mbm_assign_on_mkdir); =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return ret; } @@ -1112,7 +1114,8 @@ ssize_t resctrl_mbm_assign_on_mkdir_write(struct kern= fs_open_file *of, char *buf if (ret) return ret; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; rdt_last_cmd_clear(); =20 if (!resctrl_arch_mbm_cntr_assign_enabled(r)) { @@ -1124,7 +1127,7 @@ ssize_t resctrl_mbm_assign_on_mkdir_write(struct kern= fs_open_file *of, char *buf r->mon.mbm_assign_on_mkdir =3D value; =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -1414,8 +1417,8 @@ ssize_t event_filter_write(struct kernfs_open_file *o= f, char *buf, size_t nbytes =20 buf[nbytes - 1] =3D '\0'; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1438,8 +1441,7 @@ ssize_t event_filter_write(struct kernfs_open_file *o= f, char *buf, size_t nbytes } =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -1450,7 +1452,8 @@ int resctrl_mbm_assign_mode_show(struct kernfs_open_f= ile *of, struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); bool enabled; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; enabled =3D resctrl_arch_mbm_cntr_assign_enabled(r); =20 if (r->mon.mbm_cntr_assignable) { @@ -1469,7 +1472,7 @@ int resctrl_mbm_assign_mode_show(struct kernfs_open_f= ile *of, seq_puts(s, "[default]\n"); } =20 - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); =20 return 0; } @@ -1488,8 +1491,8 @@ ssize_t resctrl_mbm_assign_mode_write(struct kernfs_o= pen_file *of, char *buf, =20 buf[nbytes - 1] =3D '\0'; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1547,8 +1550,7 @@ ssize_t resctrl_mbm_assign_mode_write(struct kernfs_o= pen_file *of, char *buf, } =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -1560,8 +1562,8 @@ int resctrl_num_mbm_cntrs_show(struct kernfs_open_fil= e *of, struct rdt_l3_mon_domain *dom; bool sep =3D false; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 list_for_each_entry(dom, &r->mon_domains, hdr.list) { if (sep) @@ -1572,8 +1574,7 @@ int resctrl_num_mbm_cntrs_show(struct kernfs_open_fil= e *of, } seq_putc(s, '\n'); =20 - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); return 0; } =20 @@ -1586,8 +1587,8 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_op= en_file *of, u32 cntrs, i; int ret =3D 0; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1613,8 +1614,7 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_op= en_file *of, seq_putc(s, '\n'); =20 out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret; } diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index a8b4ac7dd823..6601b138ac7a 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -977,13 +977,14 @@ static int rdt_last_cmd_status_show(struct kernfs_ope= n_file *of, { int len; =20 - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; len =3D seq_buf_used(&last_cmd_status); if (len) seq_printf(seq, "%.*s", len, last_cmd_status_buf); else seq_puts(seq, "ok\n"); - mutex_unlock(&rdtgroup_mutex); + info_kn_unlock(of->kn); return 0; } =20 @@ -1002,7 +1003,11 @@ static int rdt_num_closids_show(struct kernfs_open_f= ile *of, { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; seq_printf(seq, "%u\n", s->num_closid); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1010,9 +1015,14 @@ static int rdt_default_ctrl_show(struct kernfs_open_= file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%x\n", resctrl_get_default_ctrl(r)); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1020,9 +1030,15 @@ static int rdt_min_cbm_bits_show(struct kernfs_open_= file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; + =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->cache.min_cbm_bits); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1030,9 +1046,14 @@ static int rdt_shareable_bits_show(struct kernfs_ope= n_file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%x\n", r->cache.shareable_bits); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1060,15 +1081,16 @@ static int rdt_bit_usage_show(struct kernfs_open_fi= le *of, */ unsigned long sw_shareable =3D 0, hw_shareable =3D 0; unsigned long exclusive =3D 0, pseudo_locked =3D 0; - struct rdt_resource *r =3D s->res; struct rdt_ctrl_domain *dom; int i, hwb, swb, excl, psl; + struct rdt_resource *r; enum rdtgrp_mode mode; bool sep =3D false; u32 ctrl_val; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; list_for_each_entry(dom, &r->ctrl_domains, hdr.list) { if (sep) seq_putc(seq, ';'); @@ -1144,8 +1166,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file= *of, sep =3D true; } seq_putc(seq, '\n'); - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); return 0; } =20 @@ -1153,9 +1174,14 @@ static int rdt_min_bw_show(struct kernfs_open_file *= of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->membw.min_bw); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1164,8 +1190,12 @@ static int rdt_num_rmids_show(struct kernfs_open_fil= e *of, { struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; seq_printf(seq, "%u\n", r->mon.num_rmid); =20 + info_kn_unlock(of->kn); + return 0; } =20 @@ -1175,6 +1205,8 @@ static int rdt_mon_features_show(struct kernfs_open_f= ile *of, struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); struct mon_evt *mevt; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; for_each_mon_event(mevt) { if (mevt->rid !=3D r->rid || !mevt->enabled) continue; @@ -1184,6 +1216,8 @@ static int rdt_mon_features_show(struct kernfs_open_f= ile *of, seq_printf(seq, "%s_config\n", mevt->name); } =20 + info_kn_unlock(of->kn); + return 0; } =20 @@ -1191,9 +1225,14 @@ static int rdt_bw_gran_show(struct kernfs_open_file = *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->membw.bw_gran); + info_kn_unlock(of->kn); + return 0; } =20 @@ -1201,16 +1240,24 @@ static int rdt_delay_linear_show(struct kernfs_open= _file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->membw.delay_linear); + info_kn_unlock(of->kn); + return 0; } =20 static int max_threshold_occ_show(struct kernfs_open_file *of, struct seq_file *seq, void *v) { + if (!info_kn_lock(of->kn)) + return -ENOENT; seq_printf(seq, "%u\n", resctrl_rmid_realloc_threshold); + info_kn_unlock(of->kn); =20 return 0; } @@ -1219,22 +1266,28 @@ static int rdt_thread_throttle_mode_show(struct ker= nfs_open_file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; + + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 + r =3D s->res; switch (r->membw.throttle_mode) { case THREAD_THROTTLE_PER_THREAD: seq_puts(seq, "per-thread\n"); - return 0; + break; case THREAD_THROTTLE_MAX: seq_puts(seq, "max\n"); - return 0; + break; case THREAD_THROTTLE_UNDEFINED: seq_puts(seq, "undefined\n"); - return 0; + break; + default: + WARN_ON_ONCE(1); + break; } =20 - WARN_ON_ONCE(1); - + info_kn_unlock(of->kn); return 0; } =20 @@ -1248,12 +1301,20 @@ static ssize_t max_threshold_occ_write(struct kernf= s_open_file *of, if (ret) return ret; =20 - if (bytes > resctrl_rmid_realloc_limit) - return -EINVAL; + if (!info_kn_lock(of->kn)) + return -ENOENT; + + if (bytes > resctrl_rmid_realloc_limit) { + ret =3D -EINVAL; + goto out_unlock; + } =20 resctrl_rmid_realloc_threshold =3D resctrl_arch_round_mon_val(bytes); =20 - return nbytes; +out_unlock: + info_kn_unlock(of->kn); + + return ret ?: nbytes; } =20 /* @@ -1293,10 +1354,15 @@ static int rdt_has_sparse_bitmasks_show(struct kern= fs_open_file *of, struct seq_file *seq, void *v) { struct resctrl_schema *s =3D rdt_kn_parent_priv(of->kn); - struct rdt_resource *r =3D s->res; + struct rdt_resource *r; =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + r =3D s->res; seq_printf(seq, "%u\n", r->cache.arch_has_sparse_bitmasks); =20 + info_kn_unlock(of->kn); + return 0; } =20 @@ -1652,8 +1718,8 @@ static int mbm_config_show(struct seq_file *s, struct= rdt_resource *r, u32 evtid struct rdt_l3_mon_domain *dom; bool sep =3D false; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + lockdep_assert_cpus_held(); + lockdep_assert_held(&rdtgroup_mutex); =20 list_for_each_entry(dom, &r->mon_domains, hdr.list) { if (sep) @@ -1670,8 +1736,6 @@ static int mbm_config_show(struct seq_file *s, struct= rdt_resource *r, u32 evtid } seq_puts(s, "\n"); =20 - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); =20 return 0; } @@ -1681,8 +1745,12 @@ static int mbm_total_bytes_config_show(struct kernfs= _open_file *of, { struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + mbm_config_show(seq, r, QOS_L3_MBM_TOTAL_EVENT_ID); =20 + info_kn_unlock(of->kn); return 0; } =20 @@ -1691,8 +1759,12 @@ static int mbm_local_bytes_config_show(struct kernfs= _open_file *of, { struct rdt_resource *r =3D rdt_kn_parent_priv(of->kn); =20 + if (!info_kn_lock(of->kn)) + return -ENOENT; + mbm_config_show(seq, r, QOS_L3_MBM_LOCAL_EVENT_ID); =20 + info_kn_unlock(of->kn); return 0; } =20 @@ -1790,8 +1862,8 @@ static ssize_t mbm_total_bytes_config_write(struct ke= rnfs_open_file *of, if (nbytes =3D=3D 0 || buf[nbytes - 1] !=3D '\n') return -EINVAL; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1799,8 +1871,7 @@ static ssize_t mbm_total_bytes_config_write(struct ke= rnfs_open_file *of, =20 ret =3D mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID); =20 - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -1816,8 +1887,8 @@ static ssize_t mbm_local_bytes_config_write(struct ke= rnfs_open_file *of, if (nbytes =3D=3D 0 || buf[nbytes - 1] !=3D '\n') return -EINVAL; =20 - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); + if (!info_kn_lock(of->kn)) + return -ENOENT; =20 rdt_last_cmd_clear(); =20 @@ -1825,8 +1896,7 @@ static ssize_t mbm_local_bytes_config_write(struct ke= rnfs_open_file *of, =20 ret =3D mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID); =20 - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); + info_kn_unlock(of->kn); =20 return ret ?: nbytes; } @@ -2660,6 +2730,58 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn) rdtgroup_kn_put(rdtgrp, kn); } =20 +/* + * Accessing the kn after breaking active protection is safe since the open + * of resctrl file holds a kernfs base reference (different from active + * protection) on the kn ensuring that it remains accessible even if it was + * unlinked. Each kn in turn holds base reference to parent so the kn's + * genealogy remains in memory until all base references dropped. + */ +static bool is_active_resctrl_node(struct kernfs_node *kn) +{ + struct kernfs_node *p; + bool match =3D false; + + guard(rcu)(); + p =3D kn; + while (p) { + if (p =3D=3D rdtgroup_default.kn) { + match =3D true; + break; + } + p =3D rcu_dereference(p->__parent); + } + + return match; +} + +bool info_kn_lock(struct kernfs_node *kn) +{ + kernfs_break_active_protection(kn); + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + /* + * Check both if resctrl is torn down (!rdtgroup_default.kn) and + * if the reader's kernfs_node originates from a dead mount. + */ + if (!rdtgroup_default.kn || !is_active_resctrl_node(kn)) { + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + kernfs_unbreak_active_protection(kn); + return false; + } + + return true; +} + +void info_kn_unlock(struct kernfs_node *kn) +{ + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + kernfs_unbreak_active_protection(kn); +} + static int mkdir_mondata_all(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, struct kernfs_node **mon_data_kn); --=20 2.50.1 From nobody Mon Jun 8 19:46:40 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A95143C3C10 for ; Tue, 26 May 2026 22:12:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779833568; cv=none; b=loeBppDDg9v4HBdwQqornY3c0U310T4knGI1zhTHrjaaUe7ovzq9TusTmIOIPYBLyTS4FZKR7eoSuaCgcsd/daqS79MwbWgZXQlRDHoqAkIKsf4Z4qXvlGl2Pt2CcgJSOs2VYoNFgjeYd1xSqvDIXx/RIEnZr1BPSoOHS9woV4g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779833568; c=relaxed/simple; bh=1iXFgQCwrYTU794BU7ayq2EciWf+wRYONptSagDlmxY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=JP0BMiDJpYox8bfRq18G5sWb3Gs81G84Nv/UHKr/VPQi3Kk6t+Sf3DI3Osba/v6ffZF3Wu479GzINuv853MqQtBOOEYVBkumtDfkQ7UmF3gvQmE5hJ3ieoou1o5XtlOpNyXDgklUUF2TTW/K0JT7EyiA7IUgEM97aj942tRxOZ8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=JnMAhGuq; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="JnMAhGuq" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1779833566; x=1811369566; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=1iXFgQCwrYTU794BU7ayq2EciWf+wRYONptSagDlmxY=; b=JnMAhGuqUp16Ks3kNUQdYnShxXzGrE1+KKM7ZDNQxwbFX4UTlFH3UJ++ kS2EjTBkzOLlowexEyhc+YEddSnVucHb8WpKngOGM1QHYfSJ7s33GurWh 16lxmTLeK1ME0IP6h+3sWsLXM2q0oERKQxlA72dRnENFwThN8Ps1a9JXh /zIYSnUYpNS8EmFlcwG/bQuraHMMO+53UOdypTUQihB+suCXqBkxlD6O7 /22gVbYNCYDUis5hlgM8pweDoaxqWCZxF9fK4PnIVjkpWbCA3hfwYOzXk f/RHso7+ohHZ+AZUsVarWYQFk4jppBAU9dUIwFAelFyu7q638KQYgIbYi g==; X-CSE-ConnectionGUID: yrUJ3V+qQJGusKqxHngT0w== X-CSE-MsgGUID: UsUty5CdR1GA7t/a6w9nhw== X-IronPort-AV: E=McAfee;i="6800,10657,11798"; a="84516108" X-IronPort-AV: E=Sophos;i="6.24,170,1774335600"; d="scan'208";a="84516108" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 May 2026 15:12:44 -0700 X-CSE-ConnectionGUID: hhGDDrHRR1iE2/rvfr8xiA== X-CSE-MsgGUID: emVKq1d3Qqq8hWb49St8Jg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,170,1774335600"; d="scan'208";a="237622176" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 May 2026 15:12:44 -0700 From: Reinette Chatre To: reinette.chatre@intel.com Cc: linux-kernel@vger.kernel.org, patches@lists.linux.dev Subject: [PATCH RESEND v3 2/3] x86/resctrl: Ensure domain fully initialized before placed on RCU list Date: Tue, 26 May 2026 15:12:36 -0700 Message-ID: <3ba2959b1cd3596e1e340eaee6b43487edaec0a4.1779833414.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A resctrl domain consists of the domain structure self that includes pointers to dynamically allocated filesystem as well as architecture specific data. For example, the L3 monitoring domain structure consists of the architecture specific struct rdt_hw_l3_mon_domain that contains the dynamically allocated rdt_hw_l3_mon_domain::arch_mbm_states architectural state and the embedded struct rdt_l3_mon_domain contains the dynamically allocated rdt_l3_mon_domain::mbm_states resctrl fs state. The domains are placed on an RCU protected list so that readers could access domains via cpus_read_lock() or from an RCU read-side critical section. A reader accessing a domain via the RCU list expects that the domain and all its dynamically allocated data is accessible. Only place domain on RCU list when all its dynamically allocated data is ready, similarly unlink from RCU list before removing any of its dynamically allocated data. There are no readers accessing a domain via RCU list. Ensure safety of access when such reader arrives. Signed-off-by: Reinette Chatre --- Changes since V2: - New patch --- arch/x86/kernel/cpu/resctrl/core.c | 18 +++++++----------- arch/x86/kernel/cpu/resctrl/intel_aet.c | 5 ++--- 2 files changed, 9 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resct= rl/core.c index 9c01d2562b7a..bca782050198 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -515,14 +515,12 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_r= esource *r) return; } =20 - list_add_tail_rcu(&d->hdr.list, add_pos); - err =3D resctrl_online_ctrl_domain(r, d); if (err) { - list_del_rcu(&d->hdr.list); - synchronize_rcu(); ctrl_domain_free(hw_dom); + return; } + list_add_tail_rcu(&d->hdr.list, add_pos); } =20 static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, s= truct list_head *add_pos) @@ -556,14 +554,12 @@ static void l3_mon_domain_setup(int cpu, int id, stru= ct rdt_resource *r, struct return; } =20 - list_add_tail_rcu(&d->hdr.list, add_pos); - err =3D resctrl_online_mon_domain(r, &d->hdr); if (err) { - list_del_rcu(&d->hdr.list); - synchronize_rcu(); l3_mon_domain_free(hw_dom); + return; } + list_add_tail_rcu(&d->hdr.list, add_pos); } =20 static void domain_add_cpu_mon(int cpu, struct rdt_resource *r) @@ -642,9 +638,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_= resource *r) d =3D container_of(hdr, struct rdt_ctrl_domain, hdr); hw_dom =3D resctrl_to_arch_ctrl_dom(d); =20 - resctrl_offline_ctrl_domain(r, d); list_del_rcu(&hdr->list); synchronize_rcu(); + resctrl_offline_ctrl_domain(r, d); =20 /* * rdt_ctrl_domain "d" is going to be freed below, so clear @@ -689,9 +685,9 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_r= esource *r) =20 d =3D container_of(hdr, struct rdt_l3_mon_domain, hdr); hw_dom =3D resctrl_to_arch_mon_dom(d); - resctrl_offline_mon_domain(r, hdr); list_del_rcu(&hdr->list); synchronize_rcu(); + resctrl_offline_mon_domain(r, hdr); l3_mon_domain_free(hw_dom); break; } @@ -702,9 +698,9 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_r= esource *r) return; =20 pkgd =3D container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr); - resctrl_offline_mon_domain(r, hdr); list_del_rcu(&hdr->list); synchronize_rcu(); + resctrl_offline_mon_domain(r, hdr); kfree(pkgd); break; } diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/= resctrl/intel_aet.c index 89b8b619d5d5..c22c3cf5167d 100644 --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c @@ -398,12 +398,11 @@ void intel_aet_mon_domain_setup(int cpu, int id, stru= ct rdt_resource *r, d->hdr.type =3D RESCTRL_MON_DOMAIN; d->hdr.rid =3D RDT_RESOURCE_PERF_PKG; cpumask_set_cpu(cpu, &d->hdr.cpu_mask); - list_add_tail_rcu(&d->hdr.list, add_pos); =20 err =3D resctrl_online_mon_domain(r, &d->hdr); if (err) { - list_del_rcu(&d->hdr.list); - synchronize_rcu(); kfree(d); + return; } + list_add_tail_rcu(&d->hdr.list, add_pos); } --=20 2.50.1 From nobody Mon Jun 8 19:46:40 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EAD6C35BDC7 for ; Tue, 26 May 2026 22:12:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779833570; cv=none; b=XpzJBbFYvdu3kjsl03VRYI57Zp5UvDHblPBEtGd1B+41MpIZN1JsN1RSn+z4N/m4btxPm5CBj+IA7mqcJh53p3L92uWFlodSLcFoMUhkiEKjt7BGKm6CgfzVdq1ktBlMqQVlmqdDQk3d527VGSYpnjvN8jzWzsuAtNjPMT2DHlo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779833570; c=relaxed/simple; bh=C+0z1oFt2EtVMPc6oB1Qc1vexbhegTpMehSW92wyaFo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Jo9DIUGhZdLnyBXO+ewpnfQeZEd4BhmDAvM8bQJoDvoVh/2ZqJD2m+BnR/ar8oREVXyCvkgNlJDEba9FrbcK/9MQqXd1uSmTi7QtFtNtryekOrMLPPD3Pz0WymHKL8oYc3NltM1JF//zgCcWTs/8Yh0Ouad7QMYxPAm1FsY+69s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=AHxukjgW; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="AHxukjgW" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1779833569; x=1811369569; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=C+0z1oFt2EtVMPc6oB1Qc1vexbhegTpMehSW92wyaFo=; b=AHxukjgWQTXYXuK+lZM5DOukFEzJgtdVJrkE3ctL5u8jQ5IQo5D4xqHp Y51BE2Wjg9Vq8ONJHkWtk7+68geY3nES4uHWKaXdXKtUJP18sbEoJ3DuH VOVSRxhtZf9ZtTzNnZmRthDc4DDlKnnTm1qXggoYtFyk5WmAlvz+ttzm+ tiuSL6P8nAUKSvBMKIGlVUuoBqax5oBeuJJgMAnAsDJGTAsSTV0EvM/Sq 7wrORXmV8RVUAl1OvDp4AZxSjcHxAO67WNYStjNRs1ucNtJzl7zL6ZWMG XxLOC6f+fcEwMfP2ke7ZuwJeb+orrf3QFI2EZVfyRkHB6jWL6b0cxatfp g==; X-CSE-ConnectionGUID: 7JdjpxoaShWGZB1z7LuPXQ== X-CSE-MsgGUID: PXUxly+nSHiVUSmUo1Iyyw== X-IronPort-AV: E=McAfee;i="6800,10657,11798"; a="84516111" X-IronPort-AV: E=Sophos;i="6.24,170,1774335600"; d="scan'208";a="84516111" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 May 2026 15:12:44 -0700 X-CSE-ConnectionGUID: 9+P/gaggQxqO5PI4uyIscg== X-CSE-MsgGUID: 4TduRx5YRxG1WWUZnw25fg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,170,1774335600"; d="scan'208";a="237622180" Received: from rchatre-desk1.jf.intel.com ([10.165.154.99]) by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 May 2026 15:12:44 -0700 From: Reinette Chatre To: reinette.chatre@intel.com Cc: linux-kernel@vger.kernel.org, patches@lists.linux.dev Subject: [PATCH RESEND v3 3/3] fs/resctrl: Fix UAF from worker threads when domains are removed Date: Tue, 26 May 2026 15:12:37 -0700 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The mbm_handle_overflow() and cqm_handle_limbo() workers read event counters and may sleep while doing so. They are scheduled via delayed_work embedded in struct rdt_l3_mon_domain. Architecture allocates and frees these domains from CPU hotplug callbacks under cpus_write_lock(), and the workers acquire cpus_read_lock() to keep the domain alive across their access. A use-after-free can occur when a worker is blocked waiting for cpus_read_lock() while the hotplug core holds cpus_write_lock(): the architecture frees the rdt_l3_mon_domain that contains the worker's work_struct. When the worker unblocks, the container_of() it performs on the embedded work pointer dereferences freed memory. Drop cpus_read_lock() from the workers and instead drain pending and in-flight work synchronously before the architecture can free the domain. Since architecture offlines the domain under cpus_write_lock() after it has been unlinked from the RCU list and a grace period has elapsed no new work can be scheduled. The cancel only needs to wait out existing work. Drop rdtgroup_mutex during CPU offline around cancel_delayed_work_sync() so that a worker waiting on the mutex can complete before re-pinning the work on a different CPU. Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing") Reported-by: Sashiko Closes: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40i= ntel.com # [1] Co-developed-by: Tony Luck Signed-off-by: Tony Luck Signed-off-by: Reinette Chatre --- Changes since v2: - Rewrite changelog - v2 attempted to solve the issue by using is_percpu_thread() within the worker to learn if CPU worker was running on is going offline. A Sashiko (https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%= 40intel.com?part=3D5) pointed out that this would not be able to handle the scenario if one of the hotplug handlers following the resctrl offline handlers failed. - Some other fixes attempted that failed: - Switch to accessing domain structure in handler via RCU so that CPU hotplug lock no longer needed. Use cancel_delayed_work_sync() with mutex dropped to cancel worker. Running worker from RCU read-side critical section is a problem since the worker needs to be able to sleep (mbm_handle_overflow()->mbm_update()-> mbm_update_one_event()->resctrl_arch_mon_ctx_alloc()-> might_sleep()) - Adding a reference count to the domain structure to avoid the worker needing to take CPU hotplug lock. This ended up being very complicated with the architecture needing new APIs to manage the reference count which cannot cleanly integrate into MPAM since it uses a single architecture domain structure to contain both the control and monitoring domain structures. Managing the references across mount, unmount, online, offline, as well as worker self exit resulted in several asymmetrical and complicated paths that were error prone. Locking also proved to be complicated since architecture would need to initiate domain free that will need to call back into resctrl that will take rdtgroup_mutex which means that references need to be taken/released without locking. --- fs/resctrl/monitor.c | 52 ++++++++++++++++++++++++++++++++++--------- fs/resctrl/rdtgroup.c | 52 ++++++++++++++++++++++++++++++++++++++----- 2 files changed, 89 insertions(+), 15 deletions(-) diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index 4565b9864a9e..37df65229109 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -623,14 +623,22 @@ void mon_event_count(void *info) rr->err =3D 0; } =20 -static struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, - struct rdt_resource *r) +/* + * Find the software controller's ctrl domain that contains @cpu on resour= ce @r. + * + * Only called from the mbm_over worker via update_mba_bw() where the retu= rned + * domain is kept alive by cancel_delayed_work_sync() in + * resctrl_offline_ctrl_domain(). This drains this worker and then waits on + * rdtgroup_mutex held here before the architecture can free the ctrl doma= in. + * + * Context: Call from RCU read-side critical section. + */ +static struct rdt_ctrl_domain *get_sc_ctrl_domain_from_cpu(int cpu, + struct rdt_resource *r) { struct rdt_ctrl_domain *d; =20 - lockdep_assert_cpus_held(); - - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list) { /* Find the domain that contains this CPU */ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) return d; @@ -691,7 +699,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct= rdt_l3_mon_domain *dom_m if (WARN_ON_ONCE(!pmbm_data)) return; =20 - dom_mba =3D get_ctrl_domain_from_cpu(smp_processor_id(), r_mba); + guard(rcu)(); + dom_mba =3D get_sc_ctrl_domain_from_cpu(smp_processor_id(), r_mba); if (!dom_mba) { pr_warn_once("Failure to get domain for MBA update\n"); return; @@ -794,9 +803,19 @@ void cqm_handle_limbo(struct work_struct *work) unsigned long delay =3D msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); struct rdt_l3_mon_domain *d; =20 - cpus_read_lock(); + /* + * Safe to run without CPU hotplug lock. Work is guaranteed to be + * canceled before the domain structure is removed. + */ mutex_lock(&rdtgroup_mutex); =20 + /* + * Ensure the worker is dedicated to a CPU as intended and not + * relocated by workqueue subsystem as part of CPU going offline. + */ + if (!is_percpu_thread()) + goto out_unlock; + d =3D container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work); =20 __check_limbo(d, false); @@ -808,8 +827,8 @@ void cqm_handle_limbo(struct work_struct *work) delay); } =20 +out_unlock: mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); } =20 /** @@ -841,7 +860,10 @@ void mbm_handle_overflow(struct work_struct *work) struct list_head *head; struct rdt_resource *r; =20 - cpus_read_lock(); + /* + * Safe to run without CPU hotplug lock. Work is guaranteed to be + * canceled before the domain structure is removed. + */ mutex_lock(&rdtgroup_mutex); =20 /* @@ -851,6 +873,17 @@ void mbm_handle_overflow(struct work_struct *work) if (!resctrl_mounted || !resctrl_arch_mon_capable()) goto out_unlock; =20 + /* + * Ensure the worker is dedicated to a CPU and not relocated by + * workqueue subsystem as part of CPU going offline since reading + * events depend on smp_processor_id(). After passing this check + * smp_processor_id() is valid for entire duration of this worker + * since it runs with rdtgroup_mutex held and the offline handler needs + * rdtgroup_mutex to offline the CPU being run on here. + */ + if (!is_percpu_thread()) + goto out_unlock; + r =3D resctrl_arch_get_resource(RDT_RESOURCE_L3); d =3D container_of(work, struct rdt_l3_mon_domain, mbm_over.work); =20 @@ -875,7 +908,6 @@ void mbm_handle_overflow(struct work_struct *work) =20 out_unlock: mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); } =20 /** diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 6601b138ac7a..9281c5a71063 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -4493,6 +4493,29 @@ static void domain_destroy_l3_mon_state(struct rdt_l= 3_mon_domain *d) =20 void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_d= omain *d) { + /* + * mbm_handle_overflow() may dereference this ctrl domain via + * update_mba_bw()->get_sc_ctrl_domain_from_cpu(). The architecture has + * unlinked the domain from the RCU list and waited a grace period, so + * no new worker iteration can find it; drain any worker that already + * holds a pointer to it before the architecture frees the domain. + * + * Software controller is enabled/disabled on mount/unmount with + * cpus_read_lock() held. Running here with cpus_write_lock() so + * there are no concurrent changes to software controller status. + */ + if (r->rid =3D=3D RDT_RESOURCE_MBA && is_mba_sc(r)) { + struct rdt_resource *l3 =3D resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdt_l3_mon_domain *mon_d; + + list_for_each_entry(mon_d, &l3->mon_domains, hdr.list) { + if (mon_d->hdr.id =3D=3D d->hdr.id) { + cancel_delayed_work_sync(&mon_d->mbm_over); + break; + } + } + } + mutex_lock(&rdtgroup_mutex); =20 if (supports_mba_mbps() && r->rid =3D=3D RDT_RESOURCE_MBA) @@ -4505,6 +4528,24 @@ void resctrl_offline_mon_domain(struct rdt_resource = *r, struct rdt_domain_hdr *h { struct rdt_l3_mon_domain *d; =20 + /* + * Called by architecture under CPU hotplug lock as it prepares to remove + * the domain which is guaranteed to be accessible here. + * The domain has been unlinked from the RCU list and a grace period + * has elapsed, so no new worker can be scheduled. Drain any worker that + * is in flight or pending before letting architecture proceed to free + * the domain that has the workers' struct delayed_work embedded. + * Do so before taking rdtgroup_mutex since the workers also acquire it. + */ + if (r->rid =3D=3D RDT_RESOURCE_L3 && + domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3)) { + d =3D container_of(hdr, struct rdt_l3_mon_domain, hdr); + if (resctrl_is_mbm_enabled()) + cancel_delayed_work_sync(&d->mbm_over); + if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) + cancel_delayed_work_sync(&d->cqm_limbo); + } + mutex_lock(&rdtgroup_mutex); =20 /* @@ -4521,8 +4562,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *= r, struct rdt_domain_hdr *h goto out_unlock; =20 d =3D container_of(hdr, struct rdt_l3_mon_domain, hdr); - if (resctrl_is_mbm_enabled()) - cancel_delayed_work(&d->mbm_over); if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(= d)) { /* * When a package is going down, forcefully @@ -4533,7 +4572,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *= r, struct rdt_domain_hdr *h * package never comes back. */ __check_limbo(d, true); - cancel_delayed_work(&d->cqm_limbo); } =20 domain_destroy_l3_mon_state(d); @@ -4714,12 +4752,16 @@ void resctrl_offline_cpu(unsigned int cpu) d =3D get_mon_domain_from_cpu(cpu, l3); if (d) { if (resctrl_is_mbm_enabled() && cpu =3D=3D d->mbm_work_cpu) { - cancel_delayed_work(&d->mbm_over); + mutex_unlock(&rdtgroup_mutex); + cancel_delayed_work_sync(&d->mbm_over); + mutex_lock(&rdtgroup_mutex); mbm_setup_overflow_handler(d, 0, cpu); } if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && cpu =3D=3D d->cqm_work_cpu && has_busy_rmid(d)) { - cancel_delayed_work(&d->cqm_limbo); + mutex_unlock(&rdtgroup_mutex); + cancel_delayed_work_sync(&d->cqm_limbo); + mutex_lock(&rdtgroup_mutex); cqm_setup_limbo_handler(d, 0, cpu); } } --=20 2.50.1