[PATCH v6 37/42] x86/restrl: Expand the width of dom_id by replacing mon_data_bits

James Morse posted 42 patches 1 year ago
There is a newer version of this series
[PATCH v6 37/42] x86/restrl: Expand the width of dom_id by replacing mon_data_bits
Posted by James Morse 1 year ago
MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
the domain-id, and is packed into the mon_data_bits union bitfield.
The width of cache-id in this field is 14 bits.

Expanding the union would break 32bit x86 platforms as this union is
stored as the kernfs kn->priv pointer. This saved allocating memory
for the priv data storage.

The firmware on MPAM platforms have used the PPTT cache-id field to
expose the interconnect's id for the cache, which is sparse and uses
more than 14 bits. Use of this id is to enable PCIe direct cache
injection hints. Using this feature with VFIO means the value provided
by the ACPI table should be exposed to user-space.

To support cache-id values greater than 14 bits, convert the
mon_data_bits union to a structure. This is allocated when the kernfs
file is created, and free'd when the monitor directory is rmdir'd.
Readers and writers must hold the rdtgroup_mutex, and readers should
check for a NULL pointer to protect against an open file preventing
the kernfs file from being free'd immediately after the rmdir call.

Signed-off-by: James Morse <james.morse@arm.com>
---
Previously the MPAM tree repainted the cache-id to compact them,
argue-ing there was no other user. With VFIO use of this PCIe feature,
this is no longer an option:
http://inbox.dpdk.org/dev/PH7PR12MB8596BF09963460CEAE17582E82522@PH7PR12MB8596.namprd12.prod.outlook.com/
---
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 19 +++++++-----
 arch/x86/kernel/cpu/resctrl/internal.h    | 37 +++++++++++------------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 28 ++++++++++++-----
 3 files changed, 50 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 032a585293af..0b475e274483 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -645,7 +645,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	union mon_data_bits md;
+	struct mon_data *md;
 	int ret = 0;
 
 	rdtgrp = rdtgroup_kn_lock_live(of->kn);
@@ -654,17 +654,22 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		goto out;
 	}
 
-	md.priv = of->kn->priv;
-	resid = md.u.rid;
-	domid = md.u.domid;
-	evtid = md.u.evtid;
+	md = of->kn->priv;
+	if (!md) {
+		ret = -EIO;
+		goto out;
+	}
+
+	resid = md->rid;
+	domid = md->domid;
+	evtid = md->evtid;
 	r = resctrl_arch_get_resource(resid);
 
-	if (md.u.sum) {
+	if (md->sum) {
 		/*
 		 * This file requires summing across all domains that share
 		 * the L3 cache id that was provided in the "domid" field of the
-		 * mon_data_bits union. Search all domains in the resource for
+		 * struct mon_data. Search all domains in the resource for
 		 * one that matches this cache id.
 		 */
 		list_for_each_entry(d, &r->mon_domains, hdr.list) {
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 32ed9aeffb90..16c1a391d012 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -103,27 +103,24 @@ struct mon_evt {
 };
 
 /**
- * union mon_data_bits - Monitoring details for each event file.
- * @priv:              Used to store monitoring event data in @u
- *                     as kernfs private data.
- * @u.rid:             Resource id associated with the event file.
- * @u.evtid:           Event id associated with the event file.
- * @u.sum:             Set when event must be summed across multiple
- *                     domains.
- * @u.domid:           When @u.sum is zero this is the domain to which
- *                     the event file belongs. When @sum is one this
- *                     is the id of the L3 cache that all domains to be
- *                     summed share.
- * @u:                 Name of the bit fields struct.
+ * struct mon_data - Monitoring details for each event file.
+ * @rid:             Resource id associated with the event file.
+ * @evtid:           Event id associated with the event file.
+ * @sum:             Set when event must be summed across multiple
+ *                   domains.
+ * @domid:           When @sum is zero this is the domain to which
+ *                   the event file belongs. When @sum is one this
+ *                   is the id of the L3 cache that all domains to be
+ *                   summed share.
+ *
+ * Stored in the kernfs kn->priv field, readers and writers must hold
+ * rdtgroup_mutex.
  */
-union mon_data_bits {
-	void *priv;
-	struct {
-		unsigned int rid		: 10;
-		enum resctrl_event_id evtid	: 7;
-		unsigned int sum		: 1;
-		unsigned int domid		: 14;
-	} u;
+struct mon_data {
+	unsigned int rid;
+	enum resctrl_event_id evtid;
+	unsigned int sum;
+	unsigned int domid;
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 6832ae603db3..abebe01447ba 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3113,11 +3113,19 @@ static struct file_system_type rdt_fs_type = {
 };
 
 static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
-		       void *priv)
+		       struct mon_data *_priv)
 {
 	struct kernfs_node *kn;
+	struct mon_data *priv;
 	int ret = 0;
 
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+	memcpy(priv, _priv, sizeof(*priv));
+
 	kn = __kernfs_create_file(parent_kn, name, 0444,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
 				  &kf_mondata_ops, priv, NULL, NULL);
@@ -3137,9 +3145,15 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
 {
 	struct kernfs_node *kn;
 
+	lockdep_assert_held(&rdtgroup_mutex);
+
 	kn = kernfs_find_and_get(pkn, name);
 	if (!kn)
 		return;
+
+	kfree(kn->priv);
+	kn->priv = NULL;
+
 	kernfs_put(kn);
 
 	if (kn->dir.subdirs <= 1)
@@ -3180,19 +3194,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
 			     bool do_sum)
 {
 	struct rmid_read rr = {0};
-	union mon_data_bits priv;
+	struct mon_data priv;
 	struct mon_evt *mevt;
 	int ret;
 
 	if (WARN_ON(list_empty(&r->evt_list)))
 		return -EPERM;
 
-	priv.u.rid = r->rid;
-	priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
-	priv.u.sum = do_sum;
+	priv.rid = r->rid;
+	priv.domid = do_sum ? d->ci->id : d->hdr.id;
+	priv.sum = do_sum;
 	list_for_each_entry(mevt, &r->evt_list, list) {
-		priv.u.evtid = mevt->evtid;
-		ret = mon_addfile(kn, mevt->name, priv.priv);
+		priv.evtid = mevt->evtid;
+		ret = mon_addfile(kn, mevt->name, &priv);
 		if (ret)
 			return ret;
 
-- 
2.39.2
Re: [PATCH v6 37/42] x86/restrl: Expand the width of dom_id by replacing mon_data_bits
Posted by Reinette Chatre 11 months, 3 weeks ago
Hi James,

On 2/7/25 10:18 AM, James Morse wrote:
> MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
> The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
> the domain-id, and is packed into the mon_data_bits union bitfield.
> The width of cache-id in this field is 14 bits.
> 
> Expanding the union would break 32bit x86 platforms as this union is
> stored as the kernfs kn->priv pointer. This saved allocating memory
> for the priv data storage.
> 
> The firmware on MPAM platforms have used the PPTT cache-id field to
> expose the interconnect's id for the cache, which is sparse and uses
> more than 14 bits. Use of this id is to enable PCIe direct cache
> injection hints. Using this feature with VFIO means the value provided
> by the ACPI table should be exposed to user-space.
> 
> To support cache-id values greater than 14 bits, convert the
> mon_data_bits union to a structure. This is allocated when the kernfs
> file is created, and free'd when the monitor directory is rmdir'd.
> Readers and writers must hold the rdtgroup_mutex, and readers should
> check for a NULL pointer to protect against an open file preventing
> the kernfs file from being free'd immediately after the rmdir call.

The last sentence is difficult to parse and took me many reads. I see
two major parts to this statement and if I understand correctly the current
implementation combined with this patch does not support either.
(a) "checking for a NULL pointer from readers"
    The reader is rdtgroup_mondata_show() and it starts by calling:
		rdtgrp = rdtgroup_kn_lock_live(of->kn);
    As I understand, on return of rdtgroup_kn_lock_live() the kernfs node
    "of->kn" may no longer exist. This seems to be an issue with current code
    also.
    Considering this, it seems to me that checking if of->kn->priv is NULL
    may be futile if of->kn may no longer exist.
    I think this also needs a reference to the data needed by the file or
    the data needs to be stashed away before the call to
    kernfs_break_active_protection().
(b) "...being free'd immediately after the rmdir call"
    I believe this refers to expectation that one task may have the file open
    while another removes the resource group directory ("rmdir") with the
    assumption that the associated struct mon_data is removed during handling
    of rmdir. In this implementation the monitoring data file's struct mon_data
    is only removed when a monitoring domain goes offline. That is, when the
    resource group remains intact while the monitoring data files associated with
    one domain is removed (for example when all CPUs associated with that domain
    goes offline). The "rmdir" to remove a resource group does not call this code
    (mon_rmdir_one_subdir()), nor does the cleanup of the default resource group's
    "kn_mondata".  

I am trying to get a handle on the different lifetimes and if I understand
correctly this implementation does not attempt to keep the struct mon_data
accessible as long as the file is open. I do not think I've discovered all the
implications of this yet.

> 
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> Previously the MPAM tree repainted the cache-id to compact them,
> argue-ing there was no other user. With VFIO use of this PCIe feature,
> this is no longer an option:
> http://inbox.dpdk.org/dev/PH7PR12MB8596BF09963460CEAE17582E82522@PH7PR12MB8596.namprd12.prod.outlook.com/
> ---
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 19 +++++++-----
>  arch/x86/kernel/cpu/resctrl/internal.h    | 37 +++++++++++------------
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 28 ++++++++++++-----
>  3 files changed, 50 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 032a585293af..0b475e274483 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -645,7 +645,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>  	u32 resid, evtid, domid;
>  	struct rdtgroup *rdtgrp;
>  	struct rdt_resource *r;
> -	union mon_data_bits md;
> +	struct mon_data *md;
>  	int ret = 0;
>  
>  	rdtgrp = rdtgroup_kn_lock_live(of->kn);
> @@ -654,17 +654,22 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>  		goto out;
>  	}
>  
> -	md.priv = of->kn->priv;
> -	resid = md.u.rid;
> -	domid = md.u.domid;
> -	evtid = md.u.evtid;
> +	md = of->kn->priv;
> +	if (!md) {
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	resid = md->rid;
> +	domid = md->domid;
> +	evtid = md->evtid;
>  	r = resctrl_arch_get_resource(resid);
>  
> -	if (md.u.sum) {
> +	if (md->sum) {
>  		/*
>  		 * This file requires summing across all domains that share
>  		 * the L3 cache id that was provided in the "domid" field of the
> -		 * mon_data_bits union. Search all domains in the resource for
> +		 * struct mon_data. Search all domains in the resource for
>  		 * one that matches this cache id.
>  		 */
>  		list_for_each_entry(d, &r->mon_domains, hdr.list) {
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 32ed9aeffb90..16c1a391d012 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -103,27 +103,24 @@ struct mon_evt {
>  };
>  
>  /**
> - * union mon_data_bits - Monitoring details for each event file.
> - * @priv:              Used to store monitoring event data in @u
> - *                     as kernfs private data.
> - * @u.rid:             Resource id associated with the event file.
> - * @u.evtid:           Event id associated with the event file.
> - * @u.sum:             Set when event must be summed across multiple
> - *                     domains.
> - * @u.domid:           When @u.sum is zero this is the domain to which
> - *                     the event file belongs. When @sum is one this
> - *                     is the id of the L3 cache that all domains to be
> - *                     summed share.
> - * @u:                 Name of the bit fields struct.
> + * struct mon_data - Monitoring details for each event file.
> + * @rid:             Resource id associated with the event file.
> + * @evtid:           Event id associated with the event file.
> + * @sum:             Set when event must be summed across multiple
> + *                   domains.
> + * @domid:           When @sum is zero this is the domain to which
> + *                   the event file belongs. When @sum is one this
> + *                   is the id of the L3 cache that all domains to be
> + *                   summed share.
> + *
> + * Stored in the kernfs kn->priv field, readers and writers must hold
> + * rdtgroup_mutex.
>   */
> -union mon_data_bits {
> -	void *priv;
> -	struct {
> -		unsigned int rid		: 10;
> -		enum resctrl_event_id evtid	: 7;
> -		unsigned int sum		: 1;
> -		unsigned int domid		: 14;
> -	} u;
> +struct mon_data {
> +	unsigned int rid;
> +	enum resctrl_event_id evtid;
> +	unsigned int sum;
> +	unsigned int domid;
>  };
>  
>  /**
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 6832ae603db3..abebe01447ba 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3113,11 +3113,19 @@ static struct file_system_type rdt_fs_type = {
>  };
>  
>  static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> -		       void *priv)
> +		       struct mon_data *_priv)
>  {
>  	struct kernfs_node *kn;
> +	struct mon_data *priv;
>  	int ret = 0;
>  
> +	lockdep_assert_held(&rdtgroup_mutex);
> +
> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +		return -ENOMEM;
> +	memcpy(priv, _priv, sizeof(*priv));
> +
>  	kn = __kernfs_create_file(parent_kn, name, 0444,
>  				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
>  				  &kf_mondata_ops, priv, NULL, NULL);
> @@ -3137,9 +3145,15 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
>  {
>  	struct kernfs_node *kn;
>  
> +	lockdep_assert_held(&rdtgroup_mutex);
> +
>  	kn = kernfs_find_and_get(pkn, name);
>  	if (!kn)
>  		return;
> +
> +	kfree(kn->priv);
> +	kn->priv = NULL;
> +
>  	kernfs_put(kn);
>  
>  	if (kn->dir.subdirs <= 1)
> @@ -3180,19 +3194,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
>  			     bool do_sum)
>  {
>  	struct rmid_read rr = {0};
> -	union mon_data_bits priv;
> +	struct mon_data priv;
>  	struct mon_evt *mevt;
>  	int ret;
>  
>  	if (WARN_ON(list_empty(&r->evt_list)))
>  		return -EPERM;
>  
> -	priv.u.rid = r->rid;
> -	priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
> -	priv.u.sum = do_sum;
> +	priv.rid = r->rid;
> +	priv.domid = do_sum ? d->ci->id : d->hdr.id;
> +	priv.sum = do_sum;
>  	list_for_each_entry(mevt, &r->evt_list, list) {
> -		priv.u.evtid = mevt->evtid;
> -		ret = mon_addfile(kn, mevt->name, priv.priv);
> +		priv.evtid = mevt->evtid;
> +		ret = mon_addfile(kn, mevt->name, &priv);
>  		if (ret)
>  			return ret;
>  

Reinette
Re: [PATCH v6 37/42] x86/restrl: Expand the width of dom_id by replacing mon_data_bits
Posted by James Morse 11 months, 2 weeks ago
Hi Reinette,

On 20/02/2025 05:40, Reinette Chatre wrote:
> On 2/7/25 10:18 AM, James Morse wrote:
>> MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
>> The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
>> the domain-id, and is packed into the mon_data_bits union bitfield.
>> The width of cache-id in this field is 14 bits.
>>
>> Expanding the union would break 32bit x86 platforms as this union is
>> stored as the kernfs kn->priv pointer. This saved allocating memory
>> for the priv data storage.
>>
>> The firmware on MPAM platforms have used the PPTT cache-id field to
>> expose the interconnect's id for the cache, which is sparse and uses
>> more than 14 bits. Use of this id is to enable PCIe direct cache
>> injection hints. Using this feature with VFIO means the value provided
>> by the ACPI table should be exposed to user-space.
>>
>> To support cache-id values greater than 14 bits, convert the
>> mon_data_bits union to a structure. This is allocated when the kernfs
>> file is created, and free'd when the monitor directory is rmdir'd.

>> Readers and writers must hold the rdtgroup_mutex, and readers should
>> check for a NULL pointer to protect against an open file preventing
>> the kernfs file from being free'd immediately after the rmdir call.

> The last sentence is difficult to parse and took me many reads. I see
> two major parts to this statement and if I understand correctly the current
> implementation combined with this patch does not support either.
> (a) "checking for a NULL pointer from readers"
>     The reader is rdtgroup_mondata_show() and it starts by calling:
> 		rdtgrp = rdtgroup_kn_lock_live(of->kn);
>     As I understand, on return of rdtgroup_kn_lock_live() the kernfs node
>     "of->kn" may no longer exist. This seems to be an issue with current code
>     also.
>     Considering this, it seems to me that checking if of->kn->priv is NULL
>     may be futile if of->kn may no longer exist.

Certainly true.
Because the lifetime is different to the existing pointer-abuse version, I just added the
checks to be on the safe side.

I'll rip this out.


>     I think this also needs a reference to the data needed by the file or
>     the data needs to be stashed away before the call to
>     kernfs_break_active_protection().

I've tried to hit this problem, and been unable. I'm happy to write it off as theoretical.

In particular:
* rmdir a control group while holding the mbm_local_bytes file open for reading. Any read
after the parent control group has been destroyed gets -ENODEV, even though though
/proc/<pid>/fd shows the fd as open for reading. The kernel in question had lockdep and
kasan enabled)
* take all the CPUs in a domain offline while holding the mbm_local_bytes file open for
reading. Again, read attempts get -ENODEV.


> (b) "...being free'd immediately after the rmdir call"
>     I believe this refers to expectation that one task may have the file open
>     while another removes the resource group directory ("rmdir") with the
>     assumption that the associated struct mon_data is removed during handling
>     of rmdir.

This is what I was worried about - and it seemed worth chucking in a NULL check just in
case. Trying a bit harder to hit it - it now seems theoretical.


>     In this implementation the monitoring data file's struct mon_data
>     is only removed when a monitoring domain goes offline.

>     That is, when the
>     resource group remains intact while the monitoring data files associated with
>     one domain is removed (for example when all CPUs associated with that domain
>     goes offline). The "rmdir" to remove a resource group does not call this code
>     (mon_rmdir_one_subdir()), nor does the cleanup of the default resource group's
>     "kn_mondata".  

Huh, its the path via user-space calling rmdir() that I was worried about. I hadn't
spotted that there are two of these and they aren't joined up!

This would leak the priv pointer when the user-space path via rmdir() just leaves the
cleanup to kernfs.

Fixing this produces even more spaghetti as domain-offline manipulates one domain in all
rdtgroup, whereas rmdir manipulates all domains in on rdtgroup. Its going to be noisy to
merge these two paths.


A simpler approach is to use the event kn->priv pointers in the default control group as
the canonical copy, which also saves memory. For mbm_total in a domain, every control and
monitor group has the same values in struct mon_data_bits - the RMID is found by walking
up the tree to find the struct rdtgroup.
As user-space can't rmdir the default control group, we only need to free it for
domain-offline, when we know all the files for that domain are going to be removed - which
we can rely on to avoid doing it in a particular order.


> I am trying to get a handle on the different lifetimes and if I understand
> correctly this implementation does not attempt to keep the struct mon_data
> accessible as long as the file is open.

No, but I think that concern is theoretical...

> I do not think I've discovered all the implications of this yet.


Thanks,

James