MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
the domain-id, and is packed into the mon_data_bits union bitfield.
The width of cache-id in this field is 14 bits.
Expanding the union would break 32bit x86 platforms as this union is
stored as the kernfs kn->priv pointer. This saved allocating memory
for the priv data storage.
The firmware on MPAM platforms have used the PPTT cache-id field to
expose the interconnect's id for the cache, which is sparse and uses
more than 14 bits. Use of this id is to enable PCIe direct cache
injection hints. Using this feature with VFIO means the value provided
by the ACPI table should be exposed to user-space.
To support cache-id values greater than 14 bits, convert the
mon_data_bits union to a structure. This is allocated for the default
control group when the kernfs event files are created, and free'd when
the monitor directory is rmdir'd when the domain goes offline.
All other control and monitor groups lookup the struct mon_data allocated
for the default control group, and use this.
This simplifies the lifecycle of this structure as the default control
group cannot be rmdir()d by user-space, so only needs to consider
domain-offline, which removes all the event files corresponding to a
domain while holding rdtgroup_mutex - which prevents concurrent
readers. mkdir_mondata_subdir_allrdtgrp() must special case the default
control group to ensure it is created first.
Signed-off-by: James Morse <james.morse@arm.com>
---
Previously the MPAM tree repainted the cache-id to compact them,
argue-ing there was no other user. With VFIO use of this PCIe feature,
this is no longer an option.
Changes since v6:
* Added the get/put helpers.
* Special case the creation of the mondata files for the default control
group.
* Removed wording about files living longer than expected, the corresponding
error handling is wrapped in WARN_ON_ONCE() as this indicates a bug.
---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 19 +--
arch/x86/kernel/cpu/resctrl/internal.h | 37 +++---
arch/x86/kernel/cpu/resctrl/monitor.c | 3 +
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 142 ++++++++++++++++++++--
4 files changed, 162 insertions(+), 39 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 0a0ac5f6112e..159972c3fe73 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -667,7 +667,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
- union mon_data_bits md;
+ struct mon_data *md;
int ret = 0;
rdtgrp = rdtgroup_kn_lock_live(of->kn);
@@ -676,17 +676,22 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
goto out;
}
- md.priv = of->kn->priv;
- resid = md.u.rid;
- domid = md.u.domid;
- evtid = md.u.evtid;
+ md = of->kn->priv;
+ if (WARN_ON_ONCE(!md)) {
+ ret = -EIO;
+ goto out;
+ }
+
+ resid = md->rid;
+ domid = md->domid;
+ evtid = md->evtid;
r = resctrl_arch_get_resource(resid);
- if (md.u.sum) {
+ if (md->sum) {
/*
* This file requires summing across all domains that share
* the L3 cache id that was provided in the "domid" field of the
- * mon_data_bits union. Search all domains in the resource for
+ * struct mon_data. Search all domains in the resource for
* one that matches this cache id.
*/
list_for_each_entry(d, &r->mon_domains, hdr.list) {
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 32ed9aeffb90..16c1a391d012 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -103,27 +103,24 @@ struct mon_evt {
};
/**
- * union mon_data_bits - Monitoring details for each event file.
- * @priv: Used to store monitoring event data in @u
- * as kernfs private data.
- * @u.rid: Resource id associated with the event file.
- * @u.evtid: Event id associated with the event file.
- * @u.sum: Set when event must be summed across multiple
- * domains.
- * @u.domid: When @u.sum is zero this is the domain to which
- * the event file belongs. When @sum is one this
- * is the id of the L3 cache that all domains to be
- * summed share.
- * @u: Name of the bit fields struct.
+ * struct mon_data - Monitoring details for each event file.
+ * @rid: Resource id associated with the event file.
+ * @evtid: Event id associated with the event file.
+ * @sum: Set when event must be summed across multiple
+ * domains.
+ * @domid: When @sum is zero this is the domain to which
+ * the event file belongs. When @sum is one this
+ * is the id of the L3 cache that all domains to be
+ * summed share.
+ *
+ * Stored in the kernfs kn->priv field, readers and writers must hold
+ * rdtgroup_mutex.
*/
-union mon_data_bits {
- void *priv;
- struct {
- unsigned int rid : 10;
- enum resctrl_event_id evtid : 7;
- unsigned int sum : 1;
- unsigned int domid : 14;
- } u;
+struct mon_data {
+ unsigned int rid;
+ enum resctrl_event_id evtid;
+ unsigned int sum;
+ unsigned int domid;
};
/**
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 73e3fe4f4c87..5bf67b429dd3 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1096,6 +1096,9 @@ static struct mon_evt mbm_local_event = {
* Note that MBM events are also part of RDT_RESOURCE_L3 resource
* because as per the SDM the total and local memory bandwidth
* are enumerated as part of L3 monitoring.
+ *
+ * mon_put_default_kn_priv_all() also assumes monitor events are only supported
+ * on the L3 resource.
*/
static void l3_mon_evt_init(struct rdt_resource *r)
{
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index aecd3fa734cd..443635d195f0 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3114,6 +3114,110 @@ static struct file_system_type rdt_fs_type = {
.kill_sb = rdt_kill_sb,
};
+/**
+ * mon_get_default_kn_priv() - Get the mon_data priv data for this event from
+ * the default control group.
+ * Called when monitor event files are created for a domain.
+ * When called with the default control group, the structure will be allocated.
+ * This happens at mount time, before other control or monitor groups are
+ * created.
+ * This simplifies the lifetime management for rmdir() versus domain-offline
+ * as the default control group lives forever, and only one group needs to be
+ * special cased.
+ *
+ * @r: The resource for the event type being created.
+ * @d: The domain for the event type being created.
+ * @mevt: The event type being created.
+ * @rdtgrp: The rdtgroup for which the monitor file is being created,
+ * used to determine if this is the default control group.
+ * @do_sum: Whether the SNC sub-numa node monitors are being created.
+ */
+static struct mon_data *mon_get_default_kn_priv(struct rdt_resource *r,
+ struct rdt_mon_domain *d,
+ struct mon_evt *mevt,
+ struct rdtgroup *rdtgrp,
+ bool do_sum)
+{
+ struct kernfs_node *kn_dom, *kn_evt;
+ struct mon_data *priv;
+ bool snc_mode;
+ char name[32];
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ snc_mode = r->mon_scope == RESCTRL_L3_NODE;
+ if (!do_sum)
+ sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
+ else
+ sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
+
+ kn_dom = kernfs_find_and_get(kn_mondata, name);
+ if (!kn_dom)
+ return NULL;
+
+ kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
+
+ /* Is this the creation of the default groups monitor files? */
+ if (!kn_evt && rdtgrp == &rdtgroup_default) {
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (!priv)
+ return NULL;
+ priv->rid = r->rid;
+ priv->domid = do_sum ? d->ci->id : d->hdr.id;
+ priv->sum = do_sum;
+ priv->evtid = mevt->evtid;
+ return priv;
+ }
+
+ if (!kn_evt)
+ return NULL;
+
+ return kn_evt->priv;
+}
+
+/**
+ * mon_put_default_kn_priv_all() - Potentially free the mon_data priv data for
+ * all events from the default control group.
+ * Put the mon_data priv data for all events for a particular domain.
+ * When called with the default control group, the priv structure previously
+ * allocated will be kfree()d. This should only be done as part of taking a
+ * domain offline.
+ * Only a domain offline will 'rmdir' monitor files in the default control
+ * group. After domain offline releases rdtgrp_mutex, all references will
+ * have been removed.
+ *
+ * @rdtgrp: The rdtgroup for which the monitor files are being removed,
+ * used to determine if this is the default control group.
+ * @name: The name of the domain or SNC sub-numa domain which is being
+ * taken offline.
+ */
+static void mon_put_default_kn_priv_all(struct rdtgroup *rdtgrp, char *name)
+{
+ struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
+ struct kernfs_node *kn_dom, *kn_evt;
+ struct mon_evt *mevt;
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ if (rdtgrp != &rdtgroup_default)
+ return;
+
+ kn_dom = kernfs_find_and_get(kn_mondata, name);
+ if (!kn_dom)
+ return;
+
+ list_for_each_entry(mevt, &r->evt_list, list) {
+ kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
+ if (!kn_evt)
+ continue;
+ if (!kn_evt->priv)
+ continue;
+
+ kfree(kn_evt->priv);
+ kn_evt->priv = NULL;
+ }
+}
+
static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
void *priv)
{
@@ -3135,19 +3239,25 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
return ret;
}
-static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subname)
+static void mon_rmdir_one_subdir(struct rdtgroup *rdtgrp, char *name, char *subname)
{
+ struct kernfs_node *pkn = rdtgrp->mon.mon_data_kn;
struct kernfs_node *kn;
kn = kernfs_find_and_get(pkn, name);
if (!kn)
return;
+
+ mon_put_default_kn_priv_all(rdtgrp, name);
+
kernfs_put(kn);
- if (kn->dir.subdirs <= 1)
+ if (kn->dir.subdirs <= 1) {
kernfs_remove(kn);
- else
+ } else {
+ mon_put_default_kn_priv_all(rdtgrp, subname);
kernfs_remove_by_name(kn, subname);
+ }
}
/*
@@ -3170,10 +3280,10 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
- mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
+ mon_rmdir_one_subdir(prgrp, name, subname);
list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
- mon_rmdir_one_subdir(crgrp->mon.mon_data_kn, name, subname);
+ mon_rmdir_one_subdir(crgrp, name, subname);
}
}
@@ -3182,19 +3292,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
bool do_sum)
{
struct rmid_read rr = {0};
- union mon_data_bits priv;
+ struct mon_data *priv;
struct mon_evt *mevt;
int ret;
if (WARN_ON(list_empty(&r->evt_list)))
return -EPERM;
- priv.u.rid = r->rid;
- priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
- priv.u.sum = do_sum;
list_for_each_entry(mevt, &r->evt_list, list) {
- priv.u.evtid = mevt->evtid;
- ret = mon_addfile(kn, mevt->name, priv.priv);
+ priv = mon_get_default_kn_priv(r, d, mevt, prgrp, do_sum);
+ if (WARN_ON_ONCE(!priv))
+ return -EINVAL;
+
+ ret = mon_addfile(kn, mevt->name, priv);
if (ret)
return ret;
@@ -3274,9 +3384,17 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdtgroup *prgrp, *crgrp;
struct list_head *head;
+ /*
+ * During domain-online create the default control group first
+ * so that mon_get_default_kn_priv() can find the allocated structure
+ * on subsequent calls.
+ */
+ mkdir_mondata_subdir(kn_mondata, d, r, &rdtgroup_default);
+
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
parent_kn = prgrp->mon.mon_data_kn;
- mkdir_mondata_subdir(parent_kn, d, r, prgrp);
+ if (prgrp != &rdtgroup_default)
+ mkdir_mondata_subdir(parent_kn, d, r, prgrp);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
--
2.39.5
Hi James,
> MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
> The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
> the domain-id, and is packed into the mon_data_bits union bitfield.
> The width of cache-id in this field is 14 bits.
>
> Expanding the union would break 32bit x86 platforms as this union is
> stored as the kernfs kn->priv pointer. This saved allocating memory
> for the priv data storage.
>
> The firmware on MPAM platforms have used the PPTT cache-id field to
> expose the interconnect's id for the cache, which is sparse and uses
> more than 14 bits. Use of this id is to enable PCIe direct cache
> injection hints. Using this feature with VFIO means the value provided
> by the ACPI table should be exposed to user-space.
>
> To support cache-id values greater than 14 bits, convert the
> mon_data_bits union to a structure. This is allocated for the default
> control group when the kernfs event files are created, and free'd when
> the monitor directory is rmdir'd when the domain goes offline.
> All other control and monitor groups lookup the struct mon_data allocated
> for the default control group, and use this.
> This simplifies the lifecycle of this structure as the default control
> group cannot be rmdir()d by user-space, so only needs to consider
> domain-offline, which removes all the event files corresponding to a
> domain while holding rdtgroup_mutex - which prevents concurrent
> readers. mkdir_mondata_subdir_allrdtgrp() must special case the default
> control group to ensure it is created first.
>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> Previously the MPAM tree repainted the cache-id to compact them,
> argue-ing there was no other user. With VFIO use of this PCIe feature,
> this is no longer an option.
>
> Changes since v6:
> * Added the get/put helpers.
> * Special case the creation of the mondata files for the default control
> group.
> * Removed wording about files living longer than expected, the corresponding
> error handling is wrapped in WARN_ON_ONCE() as this indicates a bug.
> ---
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 19 +--
> arch/x86/kernel/cpu/resctrl/internal.h | 37 +++---
> arch/x86/kernel/cpu/resctrl/monitor.c | 3 +
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 142 ++++++++++++++++++++--
> 4 files changed, 162 insertions(+), 39 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 0a0ac5f6112e..159972c3fe73 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -667,7 +667,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> u32 resid, evtid, domid;
> struct rdtgroup *rdtgrp;
> struct rdt_resource *r;
> - union mon_data_bits md;
> + struct mon_data *md;
> int ret = 0;
>
> rdtgrp = rdtgroup_kn_lock_live(of->kn);
> @@ -676,17 +676,22 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> goto out;
> }
>
> - md.priv = of->kn->priv;
> - resid = md.u.rid;
> - domid = md.u.domid;
> - evtid = md.u.evtid;
> + md = of->kn->priv;
> + if (WARN_ON_ONCE(!md)) {
> + ret = -EIO;
> + goto out;
> + }
> +
> + resid = md->rid;
> + domid = md->domid;
> + evtid = md->evtid;
> r = resctrl_arch_get_resource(resid);
>
> - if (md.u.sum) {
> + if (md->sum) {
> /*
> * This file requires summing across all domains that share
> * the L3 cache id that was provided in the "domid" field of the
> - * mon_data_bits union. Search all domains in the resource for
> + * struct mon_data. Search all domains in the resource for
> * one that matches this cache id.
> */
> list_for_each_entry(d, &r->mon_domains, hdr.list) {
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 32ed9aeffb90..16c1a391d012 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -103,27 +103,24 @@ struct mon_evt {
> };
>
> /**
> - * union mon_data_bits - Monitoring details for each event file.
> - * @priv: Used to store monitoring event data in @u
> - * as kernfs private data.
> - * @u.rid: Resource id associated with the event file.
> - * @u.evtid: Event id associated with the event file.
> - * @u.sum: Set when event must be summed across multiple
> - * domains.
> - * @u.domid: When @u.sum is zero this is the domain to which
> - * the event file belongs. When @sum is one this
> - * is the id of the L3 cache that all domains to be
> - * summed share.
> - * @u: Name of the bit fields struct.
> + * struct mon_data - Monitoring details for each event file.
> + * @rid: Resource id associated with the event file.
> + * @evtid: Event id associated with the event file.
> + * @sum: Set when event must be summed across multiple
> + * domains.
> + * @domid: When @sum is zero this is the domain to which
> + * the event file belongs. When @sum is one this
> + * is the id of the L3 cache that all domains to be
> + * summed share.
> + *
> + * Stored in the kernfs kn->priv field, readers and writers must hold
> + * rdtgroup_mutex.
> */
> -union mon_data_bits {
> - void *priv;
> - struct {
> - unsigned int rid : 10;
> - enum resctrl_event_id evtid : 7;
> - unsigned int sum : 1;
> - unsigned int domid : 14;
> - } u;
> +struct mon_data {
> + unsigned int rid;
> + enum resctrl_event_id evtid;
> + unsigned int sum;
> + unsigned int domid;
> };
>
> /**
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 73e3fe4f4c87..5bf67b429dd3 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1096,6 +1096,9 @@ static struct mon_evt mbm_local_event = {
> * Note that MBM events are also part of RDT_RESOURCE_L3 resource
> * because as per the SDM the total and local memory bandwidth
> * are enumerated as part of L3 monitoring.
> + *
> + * mon_put_default_kn_priv_all() also assumes monitor events are only supported
> + * on the L3 resource.
> */
> static void l3_mon_evt_init(struct rdt_resource *r)
> {
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index aecd3fa734cd..443635d195f0 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3114,6 +3114,110 @@ static struct file_system_type rdt_fs_type = {
> .kill_sb = rdt_kill_sb,
> };
>
> +/**
> + * mon_get_default_kn_priv() - Get the mon_data priv data for this event from
> + * the default control group.
> + * Called when monitor event files are created for a domain.
> + * When called with the default control group, the structure will be allocated.
> + * This happens at mount time, before other control or monitor groups are
> + * created.
> + * This simplifies the lifetime management for rmdir() versus domain-offline
> + * as the default control group lives forever, and only one group needs to be
> + * special cased.
> + *
> + * @r: The resource for the event type being created.
> + * @d: The domain for the event type being created.
> + * @mevt: The event type being created.
> + * @rdtgrp: The rdtgroup for which the monitor file is being created,
> + * used to determine if this is the default control group.
> + * @do_sum: Whether the SNC sub-numa node monitors are being created.
> + */
> +static struct mon_data *mon_get_default_kn_priv(struct rdt_resource *r,
> + struct rdt_mon_domain *d,
> + struct mon_evt *mevt,
> + struct rdtgroup *rdtgrp,
> + bool do_sum)
> +{
> + struct kernfs_node *kn_dom, *kn_evt;
> + struct mon_data *priv;
> + bool snc_mode;
> + char name[32];
> +
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> + if (!do_sum)
> + sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
This change triggered a minor report during compilation.
fs/resctrl/rdtgroup.c: In function ‘mon_get_default_kn_priv’:
fs/resctrl/rdtgroup.c:2931:28: warning: format ‘%d’ expects argument of
type ‘int’, but argument 4 has type ‘long unsigned int’ [-Wformat=]
2931 | sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id :
d->hdr.id);
| ~~~^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| | |
| int long unsigned int
| %02ld
> + else
> + sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> +
> + kn_dom = kernfs_find_and_get(kn_mondata, name);
> + if (!kn_dom)
> + return NULL;
> +
> + kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
> +
> + /* Is this the creation of the default groups monitor files? */
> + if (!kn_evt && rdtgrp == &rdtgroup_default) {
> + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return NULL;
> + priv->rid = r->rid;
> + priv->domid = do_sum ? d->ci->id : d->hdr.id;
> + priv->sum = do_sum;
> + priv->evtid = mevt->evtid;
> + return priv;
> + }
> +
> + if (!kn_evt)
> + return NULL;
> +
> + return kn_evt->priv;
> +}
> +
> +/**
> + * mon_put_default_kn_priv_all() - Potentially free the mon_data priv data for
> + * all events from the default control group.
> + * Put the mon_data priv data for all events for a particular domain.
> + * When called with the default control group, the priv structure previously
> + * allocated will be kfree()d. This should only be done as part of taking a
> + * domain offline.
> + * Only a domain offline will 'rmdir' monitor files in the default control
> + * group. After domain offline releases rdtgrp_mutex, all references will
> + * have been removed.
> + *
> + * @rdtgrp: The rdtgroup for which the monitor files are being removed,
> + * used to determine if this is the default control group.
> + * @name: The name of the domain or SNC sub-numa domain which is being
> + * taken offline.
> + */
> +static void mon_put_default_kn_priv_all(struct rdtgroup *rdtgrp, char *name)
> +{
> + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> + struct kernfs_node *kn_dom, *kn_evt;
> + struct mon_evt *mevt;
> +
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + if (rdtgrp != &rdtgroup_default)
> + return;
> +
> + kn_dom = kernfs_find_and_get(kn_mondata, name);
> + if (!kn_dom)
> + return;
> +
> + list_for_each_entry(mevt, &r->evt_list, list) {
> + kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
> + if (!kn_evt)
> + continue;
> + if (!kn_evt->priv)
> + continue;
> +
> + kfree(kn_evt->priv);
> + kn_evt->priv = NULL;
> + }
> +}
> +
> static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> void *priv)
> {
> @@ -3135,19 +3239,25 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> return ret;
> }
>
> -static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subname)
> +static void mon_rmdir_one_subdir(struct rdtgroup *rdtgrp, char *name, char *subname)
> {
> + struct kernfs_node *pkn = rdtgrp->mon.mon_data_kn;
> struct kernfs_node *kn;
>
> kn = kernfs_find_and_get(pkn, name);
> if (!kn)
> return;
> +
> + mon_put_default_kn_priv_all(rdtgrp, name);
> +
> kernfs_put(kn);
>
> - if (kn->dir.subdirs <= 1)
> + if (kn->dir.subdirs <= 1) {
> kernfs_remove(kn);
> - else
> + } else {
> + mon_put_default_kn_priv_all(rdtgrp, subname);
> kernfs_remove_by_name(kn, subname);
> + }
> }
>
> /*
> @@ -3170,10 +3280,10 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> - mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
> + mon_rmdir_one_subdir(prgrp, name, subname);
>
> list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
> - mon_rmdir_one_subdir(crgrp->mon.mon_data_kn, name, subname);
> + mon_rmdir_one_subdir(crgrp, name, subname);
> }
> }
>
> @@ -3182,19 +3292,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> bool do_sum)
> {
> struct rmid_read rr = {0};
> - union mon_data_bits priv;
> + struct mon_data *priv;
> struct mon_evt *mevt;
> int ret;
>
> if (WARN_ON(list_empty(&r->evt_list)))
> return -EPERM;
>
> - priv.u.rid = r->rid;
> - priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
> - priv.u.sum = do_sum;
> list_for_each_entry(mevt, &r->evt_list, list) {
> - priv.u.evtid = mevt->evtid;
> - ret = mon_addfile(kn, mevt->name, priv.priv);
> + priv = mon_get_default_kn_priv(r, d, mevt, prgrp, do_sum);
> + if (WARN_ON_ONCE(!priv))
> + return -EINVAL;
> +
> + ret = mon_addfile(kn, mevt->name, priv);
> if (ret)
> return ret;
>
> @@ -3274,9 +3384,17 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> struct rdtgroup *prgrp, *crgrp;
> struct list_head *head;
>
> + /*
> + * During domain-online create the default control group first
> + * so that mon_get_default_kn_priv() can find the allocated structure
> + * on subsequent calls.
> + */
> + mkdir_mondata_subdir(kn_mondata, d, r, &rdtgroup_default);
> +
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> parent_kn = prgrp->mon.mon_data_kn;
> - mkdir_mondata_subdir(parent_kn, d, r, prgrp);
> + if (prgrp != &rdtgroup_default)
> + mkdir_mondata_subdir(parent_kn, d, r, prgrp);
>
> head = &prgrp->mon.crdtgrp_list;
> list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
> --
> 2.39.5
>
On 3/1/2025 1:29 AM, James Morse wrote:
> MPAM platforms retrieve the cache-id property from the ACPI PPTT
> table. The cache-id field is 32 bits wide. Under resctrl, the cache-id
> becomes the domain-id, and is packed into the mon_data_bits union
> bitfield. The width of cache-id in this
>
> MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
> The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
> the domain-id, and is packed into the mon_data_bits union bitfield.
> The width of cache-id in this field is 14 bits.
>
> Expanding the union would break 32bit x86 platforms as this union is
> stored as the kernfs kn->priv pointer. This saved allocating memory
> for the priv data storage.
>
> The firmware on MPAM platforms have used the PPTT cache-id field to
> expose the interconnect's id for the cache, which is sparse and uses
> more than 14 bits. Use of this id is to enable PCIe direct cache
> injection hints. Using this feature with VFIO means the value provided
> by the ACPI table should be exposed to user-space.
>
> To support cache-id values greater than 14 bits, convert the
> mon_data_bits union to a structure. This is allocated for the default
> control group when the kernfs event files are created, and free'd when
> the monitor directory is rmdir'd when the domain goes offline.
> All other control and monitor groups lookup the struct mon_data allocated
> for the default control group, and use this.
> This simplifies the lifecycle of this structure as the default control
> group cannot be rmdir()d by user-space, so only needs to consider
> domain-offline, which removes all the event files corresponding to a
> domain while holding rdtgroup_mutex - which prevents concurrent
> readers. mkdir_mondata_subdir_allrdtgrp() must special case the default
> control group to ensure it is created first.
>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> Previously the MPAM tree repainted the cache-id to compact them,
> argue-ing there was no other user. With VFIO use of this PCIe feature,
> this is no longer an option.
>
> Changes since v6:
> * Added the get/put helpers.
> * Special case the creation of the mondata files for the default control
> group.
> * Removed wording about files living longer than expected, the corresponding
> error handling is wrapped in WARN_ON_ONCE() as this indicates a bug.
> ---
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 19 +--
> arch/x86/kernel/cpu/resctrl/internal.h | 37 +++---
> arch/x86/kernel/cpu/resctrl/monitor.c | 3 +
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 142 ++++++++++++++++++++--
> 4 files changed, 162 insertions(+), 39 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 0a0ac5f6112e..159972c3fe73 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -667,7 +667,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> u32 resid, evtid, domid;
> struct rdtgroup *rdtgrp;
> struct rdt_resource *r;
> - union mon_data_bits md;
> + struct mon_data *md;
> int ret = 0;
>
> rdtgrp = rdtgroup_kn_lock_live(of->kn);
> @@ -676,17 +676,22 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> goto out;
> }
>
> - md.priv = of->kn->priv;
> - resid = md.u.rid;
> - domid = md.u.domid;
> - evtid = md.u.evtid;
> + md = of->kn->priv;
> + if (WARN_ON_ONCE(!md)) {
> + ret = -EIO;
> + goto out;
> + }
> +
> + resid = md->rid;
> + domid = md->domid;
> + evtid = md->evtid;
> r = resctrl_arch_get_resource(resid);
>
> - if (md.u.sum) {
> + if (md->sum) {
> /*
> * This file requires summing across all domains that share
> * the L3 cache id that was provided in the "domid" field of the
> - * mon_data_bits union. Search all domains in the resource for
> + * struct mon_data. Search all domains in the resource for
> * one that matches this cache id.
> */
> list_for_each_entry(d, &r->mon_domains, hdr.list) {
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 32ed9aeffb90..16c1a391d012 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -103,27 +103,24 @@ struct mon_evt {
> };
>
> /**
> - * union mon_data_bits - Monitoring details for each event file.
> - * @priv: Used to store monitoring event data in @u
> - * as kernfs private data.
> - * @u.rid: Resource id associated with the event file.
> - * @u.evtid: Event id associated with the event file.
> - * @u.sum: Set when event must be summed across multiple
> - * domains.
> - * @u.domid: When @u.sum is zero this is the domain to which
> - * the event file belongs. When @sum is one this
> - * is the id of the L3 cache that all domains to be
> - * summed share.
> - * @u: Name of the bit fields struct.
> + * struct mon_data - Monitoring details for each event file.
> + * @rid: Resource id associated with the event file.
> + * @evtid: Event id associated with the event file.
> + * @sum: Set when event must be summed across multiple
> + * domains.
> + * @domid: When @sum is zero this is the domain to which
> + * the event file belongs. When @sum is one this
> + * is the id of the L3 cache that all domains to be
> + * summed share.
> + *
> + * Stored in the kernfs kn->priv field, readers and writers must hold
> + * rdtgroup_mutex.
> */
> -union mon_data_bits {
> - void *priv;
> - struct {
> - unsigned int rid : 10;
> - enum resctrl_event_id evtid : 7;
> - unsigned int sum : 1;
> - unsigned int domid : 14;
> - } u;
> +struct mon_data {
> + unsigned int rid;
> + enum resctrl_event_id evtid;
> + unsigned int sum;
> + unsigned int domid;
> };
>
> /**
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 73e3fe4f4c87..5bf67b429dd3 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1096,6 +1096,9 @@ static struct mon_evt mbm_local_event = {
> * Note that MBM events are also part of RDT_RESOURCE_L3 resource
> * because as per the SDM the total and local memory bandwidth
> * are enumerated as part of L3 monitoring.
> + *
> + * mon_put_default_kn_priv_all() also assumes monitor events are only supported
> + * on the L3 resource.
> */
> static void l3_mon_evt_init(struct rdt_resource *r)
> {
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index aecd3fa734cd..443635d195f0 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3114,6 +3114,110 @@ static struct file_system_type rdt_fs_type = {
> .kill_sb = rdt_kill_sb,
> };
>
> +/**
> + * mon_get_default_kn_priv() - Get the mon_data priv data for this event from
> + * the default control group.
> + * Called when monitor event files are created for a domain.
> + * When called with the default control group, the structure will be allocated.
> + * This happens at mount time, before other control or monitor groups are
> + * created.
> + * This simplifies the lifetime management for rmdir() versus domain-offline
> + * as the default control group lives forever, and only one group needs to be
> + * special cased.
> + *
> + * @r: The resource for the event type being created.
> + * @d: The domain for the event type being created.
> + * @mevt: The event type being created.
> + * @rdtgrp: The rdtgroup for which the monitor file is being created,
> + * used to determine if this is the default control group.
> + * @do_sum: Whether the SNC sub-numa node monitors are being created.
> + */
> +static struct mon_data *mon_get_default_kn_priv(struct rdt_resource *r,
> + struct rdt_mon_domain *d,
> + struct mon_evt *mevt,
> + struct rdtgroup *rdtgrp,
> + bool do_sum)
> +{
> + struct kernfs_node *kn_dom, *kn_evt;
> + struct mon_data *priv;
> + bool snc_mode;
> + char name[32];
> +
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> + if (!do_sum)
> + sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
> + else
> + sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> +
> + kn_dom = kernfs_find_and_get(kn_mondata, name);
> + if (!kn_dom)
> + return NULL;
> +
> + kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
> +
> + /* Is this the creation of the default groups monitor files? */
> + if (!kn_evt && rdtgrp == &rdtgroup_default) {
> + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return NULL;
> + priv->rid = r->rid;
> + priv->domid = do_sum ? d->ci->id : d->hdr.id;
> + priv->sum = do_sum;
> + priv->evtid = mevt->evtid;
> + return priv;
> + }
> +
> + if (!kn_evt)
> + return NULL;
> +
> + return kn_evt->priv;
> +}
> +
> +/**
> + * mon_put_default_kn_priv_all() - Potentially free the mon_data priv data for
> + * all events from the default control group.
> + * Put the mon_data priv data for all events for a particular domain.
> + * When called with the default control group, the priv structure previously
> + * allocated will be kfree()d. This should only be done as part of taking a
> + * domain offline.
> + * Only a domain offline will 'rmdir' monitor files in the default control
> + * group. After domain offline releases rdtgrp_mutex, all references will
> + * have been removed.
> + *
> + * @rdtgrp: The rdtgroup for which the monitor files are being removed,
> + * used to determine if this is the default control group.
> + * @name: The name of the domain or SNC sub-numa domain which is being
> + * taken offline.
> + */
> +static void mon_put_default_kn_priv_all(struct rdtgroup *rdtgrp, char *name)
> +{
> + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> + struct kernfs_node *kn_dom, *kn_evt;
> + struct mon_evt *mevt;
> +
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + if (rdtgrp != &rdtgroup_default)
> + return;
> +
> + kn_dom = kernfs_find_and_get(kn_mondata, name);
> + if (!kn_dom)
> + return;
> +
> + list_for_each_entry(mevt, &r->evt_list, list) {
> + kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
> + if (!kn_evt)
> + continue;
> + if (!kn_evt->priv)
> + continue;
> +
> + kfree(kn_evt->priv);
> + kn_evt->priv = NULL;
> + }
> +}
> +
> static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> void *priv)
> {
> @@ -3135,19 +3239,25 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> return ret;
> }
>
> -static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subname)
> +static void mon_rmdir_one_subdir(struct rdtgroup *rdtgrp, char *name, char *subname)
> {
> + struct kernfs_node *pkn = rdtgrp->mon.mon_data_kn;
> struct kernfs_node *kn;
>
> kn = kernfs_find_and_get(pkn, name);
> if (!kn)
> return;
> +
> + mon_put_default_kn_priv_all(rdtgrp, name);
> +
> kernfs_put(kn);
>
> - if (kn->dir.subdirs <= 1)
> + if (kn->dir.subdirs <= 1) {
> kernfs_remove(kn);
> - else
> + } else {
> + mon_put_default_kn_priv_all(rdtgrp, subname);
> kernfs_remove_by_name(kn, subname);
> + }
> }
>
> /*
> @@ -3170,10 +3280,10 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> - mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
> + mon_rmdir_one_subdir(prgrp, name, subname);
>
> list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
> - mon_rmdir_one_subdir(crgrp->mon.mon_data_kn, name, subname);
> + mon_rmdir_one_subdir(crgrp, name, subname);
> }
> }
>
> @@ -3182,19 +3292,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> bool do_sum)
> {
> struct rmid_read rr = {0};
> - union mon_data_bits priv;
> + struct mon_data *priv;
> struct mon_evt *mevt;
> int ret;
>
> if (WARN_ON(list_empty(&r->evt_list)))
> return -EPERM;
>
> - priv.u.rid = r->rid;
> - priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
> - priv.u.sum = do_sum;
> list_for_each_entry(mevt, &r->evt_list, list) {
> - priv.u.evtid = mevt->evtid;
> - ret = mon_addfile(kn, mevt->name, priv.priv);
> + priv = mon_get_default_kn_priv(r, d, mevt, prgrp, do_sum);
> + if (WARN_ON_ONCE(!priv))
> + return -EINVAL;
> +
> + ret = mon_addfile(kn, mevt->name, priv);
> if (ret)
> return ret;
>
> @@ -3274,9 +3384,17 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> struct rdtgroup *prgrp, *crgrp;
> struct list_head *head;
>
> + /*
> + * During domain-online create the default control group first
> + * so that mon_get_default_kn_priv() can find the allocated structure
> + * on subsequent calls.
> + */
> + mkdir_mondata_subdir(kn_mondata, d, r, &rdtgroup_default);
> +
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> parent_kn = prgrp->mon.mon_data_kn;
> - mkdir_mondata_subdir(parent_kn, d, r, prgrp);
> + if (prgrp != &rdtgroup_default)
> + mkdir_mondata_subdir(parent_kn, d, r, prgrp);
>
> head = &prgrp->mon.crdtgrp_list;
> list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
> --
> 2.39.5
>
Hi Amit,
On 07/03/2025 10:17, Amit Singh Tomar wrote:
>> MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
>> The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
>> the domain-id, and is packed into the mon_data_bits union bitfield.
>> The width of cache-id in this field is 14 bits.
>>
>> Expanding the union would break 32bit x86 platforms as this union is
>> stored as the kernfs kn->priv pointer. This saved allocating memory
>> for the priv data storage.
>>
>> The firmware on MPAM platforms have used the PPTT cache-id field to
>> expose the interconnect's id for the cache, which is sparse and uses
>> more than 14 bits. Use of this id is to enable PCIe direct cache
>> injection hints. Using this feature with VFIO means the value provided
>> by the ACPI table should be exposed to user-space.
>>
>> To support cache-id values greater than 14 bits, convert the
>> mon_data_bits union to a structure. This is allocated for the default
>> control group when the kernfs event files are created, and free'd when
>> the monitor directory is rmdir'd when the domain goes offline.
>> All other control and monitor groups lookup the struct mon_data allocated
>> for the default control group, and use this.
>> This simplifies the lifecycle of this structure as the default control
>> group cannot be rmdir()d by user-space, so only needs to consider
>> domain-offline, which removes all the event files corresponding to a
>> domain while holding rdtgroup_mutex - which prevents concurrent
>> readers. mkdir_mondata_subdir_allrdtgrp() must special case the default
>> control group to ensure it is created first.
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/
>> rdtgroup.c
>> index aecd3fa734cd..443635d195f0 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -3114,6 +3114,110 @@ static struct file_system_type rdt_fs_type = {
>> .kill_sb = rdt_kill_sb,
>> };
>> +/**
>> + * mon_get_default_kn_priv() - Get the mon_data priv data for this event from
>> + * the default control group.
>> + * Called when monitor event files are created for a domain.
>> + * When called with the default control group, the structure will be allocated.
>> + * This happens at mount time, before other control or monitor groups are
>> + * created.
>> + * This simplifies the lifetime management for rmdir() versus domain-offline
>> + * as the default control group lives forever, and only one group needs to be
>> + * special cased.
>> + *
>> + * @r: The resource for the event type being created.
>> + * @d: The domain for the event type being created.
>> + * @mevt: The event type being created.
>> + * @rdtgrp: The rdtgroup for which the monitor file is being created,
>> + * used to determine if this is the default control group.
>> + * @do_sum: Whether the SNC sub-numa node monitors are being created.
>> + */
>> +static struct mon_data *mon_get_default_kn_priv(struct rdt_resource *r,
>> + struct rdt_mon_domain *d,
>> + struct mon_evt *mevt,
>> + struct rdtgroup *rdtgrp,
>> + bool do_sum)
>> +{
>> + struct kernfs_node *kn_dom, *kn_evt;
>> + struct mon_data *priv;
>> + bool snc_mode;
>> + char name[32];
>> +
>> + lockdep_assert_held(&rdtgroup_mutex);
>> +
>> + snc_mode = r->mon_scope == RESCTRL_L3_NODE;
>> + if (!do_sum)
>> + sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
> This change triggered a minor report during compilation.
>
> fs/resctrl/rdtgroup.c: In function ‘mon_get_default_kn_priv’:
> fs/resctrl/rdtgroup.c:2931:28: warning: format ‘%d’ expects argument of type ‘int’, but
> argument 4 has type ‘long unsigned int’ [-Wformat=]
> 2931 | sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
> | ~~~^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> | | |
> | int long unsigned int
> | %02ld
Heh, not yet its not! You must have rebased the MPAM tree on-top, its a patch in there
that causes this:
This is because of the device-tree folk want to make cache-id an unsigned long so they can
use the arm CPU's affinity id as a cache-id. That patch already has to cleanup this
pattern elsewhere in resctrl, I need to add this one to it.
That thing is a discussion for the DT folk to drive ... I think they could just as easily
use the CPU number - only it wouldn't be a hardware-derived value. (the upshot is
cache-ids could change over a firmware update - which I think is fine)
Thanks,
James
Hi James,
On 2/28/25 11:59 AM, James Morse wrote:
> MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
> The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
> the domain-id, and is packed into the mon_data_bits union bitfield.
> The width of cache-id in this field is 14 bits.
>
> Expanding the union would break 32bit x86 platforms as this union is
> stored as the kernfs kn->priv pointer. This saved allocating memory
> for the priv data storage.
>
> The firmware on MPAM platforms have used the PPTT cache-id field to
> expose the interconnect's id for the cache, which is sparse and uses
> more than 14 bits. Use of this id is to enable PCIe direct cache
> injection hints. Using this feature with VFIO means the value provided
> by the ACPI table should be exposed to user-space.
>
> To support cache-id values greater than 14 bits, convert the
> mon_data_bits union to a structure. This is allocated for the default
> control group when the kernfs event files are created, and free'd when
> the monitor directory is rmdir'd when the domain goes offline.
> All other control and monitor groups lookup the struct mon_data allocated
> for the default control group, and use this.
> This simplifies the lifecycle of this structure as the default control
> group cannot be rmdir()d by user-space, so only needs to consider
> domain-offline, which removes all the event files corresponding to a
> domain while holding rdtgroup_mutex - which prevents concurrent
> readers. mkdir_mondata_subdir_allrdtgrp() must special case the default
> control group to ensure it is created first.
This is novel.
>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> Previously the MPAM tree repainted the cache-id to compact them,
> argue-ing there was no other user. With VFIO use of this PCIe feature,
> this is no longer an option.
>
> Changes since v6:
> * Added the get/put helpers.
> * Special case the creation of the mondata files for the default control
> group.
> * Removed wording about files living longer than expected, the corresponding
> error handling is wrapped in WARN_ON_ONCE() as this indicates a bug.
> ---
...
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index aecd3fa734cd..443635d195f0 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3114,6 +3114,110 @@ static struct file_system_type rdt_fs_type = {
> .kill_sb = rdt_kill_sb,
> };
>
> +/**
> + * mon_get_default_kn_priv() - Get the mon_data priv data for this event from
> + * the default control group.
Since this involves monitoring this would technically be the default "monitoring"
group (throughout).
> + * Called when monitor event files are created for a domain.
> + * When called with the default control group, the structure will be allocated.
A bit difficult to parse. Assuming there is a re-spin, how about something like:
"Allocate the structure when @rdtgrp is the default group."
> + * This happens at mount time, before other control or monitor groups are
> + * created.
> + * This simplifies the lifetime management for rmdir() versus domain-offline
> + * as the default control group lives forever, and only one group needs to be
> + * special cased.
> + *
> + * @r: The resource for the event type being created.
> + * @d: The domain for the event type being created.
Stray tab makes for inconsisent spacing.
> + * @mevt: The event type being created.
> + * @rdtgrp: The rdtgroup for which the monitor file is being created,
> + * used to determine if this is the default control group.
> + * @do_sum: Whether the SNC sub-numa node monitors are being created.
do_sum can be true or false when it comes to the SNC files (more below).
> + */
> +static struct mon_data *mon_get_default_kn_priv(struct rdt_resource *r,
> + struct rdt_mon_domain *d,
> + struct mon_evt *mevt,
> + struct rdtgroup *rdtgrp,
> + bool do_sum)
> +{
> + struct kernfs_node *kn_dom, *kn_evt;
> + struct mon_data *priv;
> + bool snc_mode;
> + char name[32];
> +
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> + if (!do_sum)
> + sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
When in SNC mode the "mon_<resource name>_ files always use d->ci->id as the domain id.
> + else
> + sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> +
The mon_sub_<resource name>_ directories are always created when in SNC mode, they do
not exist on non SNC enabled systems. The mon_<resource name>_ directories exists on
both SNC enabled and non-SNC/SNC disabled systems. The mon_<resource name>_ directories
on SNC enabled system will have "do_sum" set. I think what you may be trying to do
here is something like:
if (!snc_mode) { /* do_sum is not relevant */
sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
} else if (snc_mode && do_sum) {
sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
} else { /* snc_mode && !do_sum */
sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
}
> + kn_dom = kernfs_find_and_get(kn_mondata, name);
> + if (!kn_dom)
> + return NULL;
It seems like this either assumes the directories are on the same level or assumes
kernfs_find_and_get() does a recursive find. As I understand kernfs_find_and_get()
does not do a recursive find. On SNC enabled systems the mon_sub_<resource name>_
directories are subdirectories of the mon_<resource name>_ directories.
Example of how hierarchy may look is at:
https://lore.kernel.org/all/20240628215619.76401-9-tony.luck@intel.com/
With all of the above I do not think this will work on an SNC enabled
system ... to confirm this I tried it out and it is not possible to mount
resctrl on an SNC enabled system and the WARN_ON_ONCE() this patch adds to
mon_add_all_files() is hit.
> +
> + kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
Note the "...and_get..." in kernfs_find_and_get() that gets a reference to
the kn before returning it. I expect that this work will have symmetry when it
comes to get/put of references but I see four new calls to kernfs_find_and_get() but
no new matching kernfs_put() to release the new references. It looks like
kernfs_find_and_get() is just used to figure out what the kn is so the references
need not be kept around for long.
> +
> + /* Is this the creation of the default groups monitor files? */
> + if (!kn_evt && rdtgrp == &rdtgroup_default) {
> + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return NULL;
> + priv->rid = r->rid;
> + priv->domid = do_sum ? d->ci->id : d->hdr.id;
> + priv->sum = do_sum;
> + priv->evtid = mevt->evtid;
> + return priv;
> + }
> +
> + if (!kn_evt)
> + return NULL;
> +
> + return kn_evt->priv;
> +}
> +
> +/**
> + * mon_put_default_kn_priv_all() - Potentially free the mon_data priv data for
> + * all events from the default control group.
> + * Put the mon_data priv data for all events for a particular domain.
> + * When called with the default control group, the priv structure previously
> + * allocated will be kfree()d. This should only be done as part of taking a
> + * domain offline.
> + * Only a domain offline will 'rmdir' monitor files in the default control
> + * group. After domain offline releases rdtgrp_mutex, all references will
> + * have been removed.
> + *
> + * @rdtgrp: The rdtgroup for which the monitor files are being removed,
> + * used to determine if this is the default control group.
> + * @name: The name of the domain or SNC sub-numa domain which is being
> + * taken offline.
This is a bit confusing since domains do not have names. How about (please feel
free to improve):
"Name of directory containing monitoring files that is in process of being removed."
> + */
> +static void mon_put_default_kn_priv_all(struct rdtgroup *rdtgrp, char *name)
> +{
> + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> + struct kernfs_node *kn_dom, *kn_evt;
> + struct mon_evt *mevt;
> +
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + if (rdtgrp != &rdtgroup_default)
> + return;
> +
> + kn_dom = kernfs_find_and_get(kn_mondata, name);
> + if (!kn_dom)
> + return;
I expect this will always fail when @name is a mon_sub_* directory.
> +
> + list_for_each_entry(mevt, &r->evt_list, list) {
> + kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
> + if (!kn_evt)
> + continue;
> + if (!kn_evt->priv)
> + continue;
> +
> + kfree(kn_evt->priv);
> + kn_evt->priv = NULL;
> + }
> +}
> +
> static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> void *priv)
> {
> @@ -3135,19 +3239,25 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> return ret;
> }
>
> -static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subname)
> +static void mon_rmdir_one_subdir(struct rdtgroup *rdtgrp, char *name, char *subname)
> {
> + struct kernfs_node *pkn = rdtgrp->mon.mon_data_kn;
> struct kernfs_node *kn;
>
> kn = kernfs_find_and_get(pkn, name);
> if (!kn)
> return;
> +
> + mon_put_default_kn_priv_all(rdtgrp, name);
> +
> kernfs_put(kn);
>
> - if (kn->dir.subdirs <= 1)
> + if (kn->dir.subdirs <= 1) {
> kernfs_remove(kn);
> - else
> + } else {
> + mon_put_default_kn_priv_all(rdtgrp, subname);
> kernfs_remove_by_name(kn, subname);
> + }
> }
>
> /*
> @@ -3170,10 +3280,10 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> - mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
> + mon_rmdir_one_subdir(prgrp, name, subname);
>
> list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
> - mon_rmdir_one_subdir(crgrp->mon.mon_data_kn, name, subname);
> + mon_rmdir_one_subdir(crgrp, name, subname);
> }
> }
>
> @@ -3182,19 +3292,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> bool do_sum)
> {
> struct rmid_read rr = {0};
> - union mon_data_bits priv;
> + struct mon_data *priv;
> struct mon_evt *mevt;
> int ret;
>
> if (WARN_ON(list_empty(&r->evt_list)))
> return -EPERM;
>
> - priv.u.rid = r->rid;
> - priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
> - priv.u.sum = do_sum;
> list_for_each_entry(mevt, &r->evt_list, list) {
> - priv.u.evtid = mevt->evtid;
> - ret = mon_addfile(kn, mevt->name, priv.priv);
> + priv = mon_get_default_kn_priv(r, d, mevt, prgrp, do_sum);
> + if (WARN_ON_ONCE(!priv))
> + return -EINVAL;
> +
This is the warning I hit on the SNC system.
> + ret = mon_addfile(kn, mevt->name, priv);
> if (ret)
> return ret;
>
> @@ -3274,9 +3384,17 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> struct rdtgroup *prgrp, *crgrp;
> struct list_head *head;
>
> + /*
> + * During domain-online create the default control group first
> + * so that mon_get_default_kn_priv() can find the allocated structure
> + * on subsequent calls.
> + */
> + mkdir_mondata_subdir(kn_mondata, d, r, &rdtgroup_default);
> +
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> parent_kn = prgrp->mon.mon_data_kn;
> - mkdir_mondata_subdir(parent_kn, d, r, prgrp);
> + if (prgrp != &rdtgroup_default)
> + mkdir_mondata_subdir(parent_kn, d, r, prgrp);
>
> head = &prgrp->mon.crdtgrp_list;
> list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
Reinette
Hi Reinette,
On 07/03/2025 05:03, Reinette Chatre wrote:
> On 2/28/25 11:59 AM, James Morse wrote:
>> MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
>> The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
>> the domain-id, and is packed into the mon_data_bits union bitfield.
>> The width of cache-id in this field is 14 bits.
>>
>> Expanding the union would break 32bit x86 platforms as this union is
>> stored as the kernfs kn->priv pointer. This saved allocating memory
>> for the priv data storage.
>>
>> The firmware on MPAM platforms have used the PPTT cache-id field to
>> expose the interconnect's id for the cache, which is sparse and uses
>> more than 14 bits. Use of this id is to enable PCIe direct cache
>> injection hints. Using this feature with VFIO means the value provided
>> by the ACPI table should be exposed to user-space.
>>
>> To support cache-id values greater than 14 bits, convert the
>> mon_data_bits union to a structure. This is allocated for the default
>> control group when the kernfs event files are created, and free'd when
>> the monitor directory is rmdir'd when the domain goes offline.
>> All other control and monitor groups lookup the struct mon_data allocated
>> for the default control group, and use this.
>> This simplifies the lifecycle of this structure as the default control
>> group cannot be rmdir()d by user-space, so only needs to consider
>> domain-offline, which removes all the event files corresponding to a
>> domain while holding rdtgroup_mutex - which prevents concurrent
>> readers. mkdir_mondata_subdir_allrdtgrp() must special case the default
>> control group to ensure it is created first.
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index aecd3fa734cd..443635d195f0 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -3114,6 +3114,110 @@ static struct file_system_type rdt_fs_type = {
>> .kill_sb = rdt_kill_sb,
>> };
>>
>> +/**
>> + * mon_get_default_kn_priv() - Get the mon_data priv data for this event from
>> + * the default control group.
>
> Since this involves monitoring this would technically be the default "monitoring"
> group (throughout).
Makes sense - fixed.
This has all leaked out of the observation 'you can't rmdir /'.
>> + * Called when monitor event files are created for a domain.
>> + * When called with the default control group, the structure will be allocated.
>
> A bit difficult to parse. Assuming there is a re-spin, how about something like:
> "Allocate the structure when @rdtgrp is the default group."
Done,
>> + * This happens at mount time, before other control or monitor groups are
>> + * created.
>> + * This simplifies the lifetime management for rmdir() versus domain-offline
>> + * as the default control group lives forever, and only one group needs to be
>> + * special cased.
>> + *
>> + * @r: The resource for the event type being created.
>> + * @d: The domain for the event type being created.
>
> Stray tab makes for inconsisent spacing.
Fixed,
>> + * @mevt: The event type being created.
>> + * @rdtgrp: The rdtgroup for which the monitor file is being created,
>> + * used to determine if this is the default control group.
>> + * @do_sum: Whether the SNC sub-numa node monitors are being created.
> do_sum can be true or false when it comes to the SNC files (more below).
I mis-understood what the hierarchy looks like on SNC systems.
>> + */
>> +static struct mon_data *mon_get_default_kn_priv(struct rdt_resource *r,
>> + struct rdt_mon_domain *d,
>> + struct mon_evt *mevt,
>> + struct rdtgroup *rdtgrp,
>> + bool do_sum)
>> +{
>> + struct kernfs_node *kn_dom, *kn_evt;
>> + struct mon_data *priv;
>> + bool snc_mode;
>> + char name[32];
>> +
>> + lockdep_assert_held(&rdtgroup_mutex);
>> +
>> + snc_mode = r->mon_scope == RESCTRL_L3_NODE;
>> + if (!do_sum)
>> + sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
>
> When in SNC mode the "mon_<resource name>_ files always use d->ci->id as the domain id.
(that makes sense - I'd conflated it with the next problem)
>> + else
>> + sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
>> +
> The mon_sub_<resource name>_ directories are always created when in SNC mode, they do
> not exist on non SNC enabled systems. The mon_<resource name>_ directories exists on
> both SNC enabled and non-SNC/SNC disabled systems. The mon_<resource name>_ directories
> on SNC enabled system will have "do_sum" set. I think what you may be trying to do
> here is something like:
>
> if (!snc_mode) { /* do_sum is not relevant */
> sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
> } else if (snc_mode && do_sum) {
> sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> } else { /* snc_mode && !do_sum */
> sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> }
Right - that makes sense. Thanks for explaining it.
>> + kn_dom = kernfs_find_and_get(kn_mondata, name);
>> + if (!kn_dom)
>> + return NULL;
>
> It seems like this either assumes the directories are on the same level or assumes
> kernfs_find_and_get() does a recursive find. As I understand kernfs_find_and_get()
> does not do a recursive find. On SNC enabled systems the mon_sub_<resource name>_
> directories are subdirectories of the mon_<resource name>_ directories.
> Example of how hierarchy may look is at:
> https://lore.kernel.org/all/20240628215619.76401-9-tony.luck@intel.com/
(aha!)
> With all of the above I do not think this will work on an SNC enabled
> system ... to confirm this I tried it out and it is not possible to mount
> resctrl on an SNC enabled system and the WARN_ON_ONCE() this patch adds to
> mon_add_all_files() is hit.
I hadn't realised the mon_sub directories for SNC weren't all directly under mon_data.
Searching from mon_data will need the parent name too. What I've come up with is:
-------%<-------
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
if (!snc_mode) {
sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
kn_target_dir = kernfs_find_and_get(kn_mondata, name);
} else {
sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
kn_target_dir = kernfs_find_and_get(kn_mondata, name);
if (snc_mode && !do_sum) {
sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
kernfs_put(kn_target_dir);
kn_target_dir = kernfs_find_and_get(kn_target_dir, name);
}
}
kernfs_put(kn_target_dir);
if (!kn_target_dir)
return NULL;
-------%<-------
>> +
>> + kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
>
> Note the "...and_get..." in kernfs_find_and_get() that gets a reference to
> the kn before returning it. I expect that this work will have symmetry when it
> comes to get/put of references but I see four new calls to kernfs_find_and_get() but
> no new matching kernfs_put() to release the new references. It looks like
> kernfs_find_and_get() is just used to figure out what the kn is so the references
> need not be kept around for long.
Not sure how I missed that. There is no need to hold an extra reference as rdtgroup_mutex
is being held, I'll add a kernfs_put() call after each involuntary _get(). (and a comment
above the lockdep assert)
>> +
>> + /* Is this the creation of the default groups monitor files? */
>> + if (!kn_evt && rdtgrp == &rdtgroup_default) {
>> + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>> + if (!priv)
>> + return NULL;
>> + priv->rid = r->rid;
>> + priv->domid = do_sum ? d->ci->id : d->hdr.id;
>> + priv->sum = do_sum;
>> + priv->evtid = mevt->evtid;
>> + return priv;
>> + }
>> +
>> + if (!kn_evt)
>> + return NULL;
>> +
>> + return kn_evt->priv;
>> +}
>> +
>> +/**
>> + * mon_put_default_kn_priv_all() - Potentially free the mon_data priv data for
>> + * all events from the default control group.
>> + * Put the mon_data priv data for all events for a particular domain.
>> + * When called with the default control group, the priv structure previously
>> + * allocated will be kfree()d. This should only be done as part of taking a
>> + * domain offline.
>> + * Only a domain offline will 'rmdir' monitor files in the default control
>> + * group. After domain offline releases rdtgrp_mutex, all references will
>> + * have been removed.
>> + *
>> + * @rdtgrp: The rdtgroup for which the monitor files are being removed,
>> + * used to determine if this is the default control group.
>> + * @name: The name of the domain or SNC sub-numa domain which is being
>> + * taken offline.
>
> This is a bit confusing since domains do not have names. How about (please feel
> free to improve):
> "Name of directory containing monitoring files that is in process of being removed."
That's clearer, thanks!
>> +static void mon_put_default_kn_priv_all(struct rdtgroup *rdtgrp, char *name)
>> +{
>> + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
>> + struct kernfs_node *kn_dom, *kn_evt;
>> + struct mon_evt *mevt;
>> +
>> + lockdep_assert_held(&rdtgroup_mutex);
>> +
>> + if (rdtgrp != &rdtgroup_default)
>> + return;
>> +
>> + kn_dom = kernfs_find_and_get(kn_mondata, name);
>> + if (!kn_dom)
>> + return;
>
> I expect this will always fail when @name is a mon_sub_* directory.
Yup, fixed with a smiliar pattern to above - only the 'subname' is available to the
caller, so it can be passed in.
>> +
>> + list_for_each_entry(mevt, &r->evt_list, list) {
>> + kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
>> + if (!kn_evt)
>> + continue;
>> + if (!kn_evt->priv)
>> + continue;
>> +
>> + kfree(kn_evt->priv);
>> + kn_evt->priv = NULL;
>> + }
>> +}
>> @@ -3182,19 +3292,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
>> bool do_sum)
>> {
>> struct rmid_read rr = {0};
>> - union mon_data_bits priv;
>> + struct mon_data *priv;
>> struct mon_evt *mevt;
>> int ret;
>>
>> if (WARN_ON(list_empty(&r->evt_list)))
>> return -EPERM;
>>
>> - priv.u.rid = r->rid;
>> - priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
>> - priv.u.sum = do_sum;
>> list_for_each_entry(mevt, &r->evt_list, list) {
>> - priv.u.evtid = mevt->evtid;
>> - ret = mon_addfile(kn, mevt->name, priv.priv);
>> + priv = mon_get_default_kn_priv(r, d, mevt, prgrp, do_sum);
>> + if (WARN_ON_ONCE(!priv))
>> + return -EINVAL;
>> +
> This is the warning I hit on the SNC system.
Thanks for testing that, this is a 'should never happen - its a bug'.
Thanks!
James
Hi James,
On 3/12/25 11:04 AM, James Morse wrote:
> On 07/03/2025 05:03, Reinette Chatre wrote:
>> On 2/28/25 11:59 AM, James Morse wrote:
...
>> With all of the above I do not think this will work on an SNC enabled
>> system ... to confirm this I tried it out and it is not possible to mount
>> resctrl on an SNC enabled system and the WARN_ON_ONCE() this patch adds to
>> mon_add_all_files() is hit.
>
> I hadn't realised the mon_sub directories for SNC weren't all directly under mon_data.
> Searching from mon_data will need the parent name too. What I've come up with is:
> -------%<-------
> snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> if (!snc_mode) {
> sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
> kn_target_dir = kernfs_find_and_get(kn_mondata, name);
> } else {
> sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> kn_target_dir = kernfs_find_and_get(kn_mondata, name);
>
> if (snc_mode && !do_sum) {
snc_mode should always be true here?
> sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> kernfs_put(kn_target_dir);
I think this needs some extra guardrails. If kn_target_dir is NULL here
it looks like that the kernfs_put() above will be fine, but from what I can tell
the kernfs_find_and_get() below will not be.
> kn_target_dir = kernfs_find_and_get(kn_target_dir, name);
> }
> }
> kernfs_put(kn_target_dir);
> if (!kn_target_dir)
> return NULL;
> -------%<-------
>
This looks good to me. In original patch a NULL kn within mon_get_default_kn_priv()
was used as prompt to create the private data. It is thus not obvious to me from this
snippet what is being returned "to", but I do not think that was your point of sharing
this snippet.
Reinette
On Thu, Mar 13, 2025 at 08:25:08AM -0700, Reinette Chatre wrote:
> Hi James,
>
> On 3/12/25 11:04 AM, James Morse wrote:
> > On 07/03/2025 05:03, Reinette Chatre wrote:
> >> On 2/28/25 11:59 AM, James Morse wrote:
>
> ...
>
> >> With all of the above I do not think this will work on an SNC enabled
> >> system ... to confirm this I tried it out and it is not possible to mount
> >> resctrl on an SNC enabled system and the WARN_ON_ONCE() this patch adds to
> >> mon_add_all_files() is hit.
> >
> > I hadn't realised the mon_sub directories for SNC weren't all directly under mon_data.
> > Searching from mon_data will need the parent name too. What I've come up with is:
> > -------%<-------
> > snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> > if (!snc_mode) {
> > sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
> > kn_target_dir = kernfs_find_and_get(kn_mondata, name);
> > } else {
> > sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> > kn_target_dir = kernfs_find_and_get(kn_mondata, name);
> >
> > if (snc_mode && !do_sum) {
>
> snc_mode should always be true here?
>
> > sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> > kernfs_put(kn_target_dir);
>
> I think this needs some extra guardrails. If kn_target_dir is NULL here
> it looks like that the kernfs_put() above will be fine, but from what I can tell
> the kernfs_find_and_get() below will not be.
>
> > kn_target_dir = kernfs_find_and_get(kn_target_dir, name);
> > }
> > }
> > kernfs_put(kn_target_dir);
> > if (!kn_target_dir)
> > return NULL;
> > -------%<-------
> >
>
> This looks good to me. In original patch a NULL kn within mon_get_default_kn_priv()
> was used as prompt to create the private data. It is thus not obvious to me from this
> snippet what is being returned "to", but I do not think that was your point of sharing
> this snippet.
Is this all overly complex trying to re-use the "priv" fields from
the default resctrl group? Would it be easier to just keep a list
of each combinations of region id, domain id, sum, and event id that have
already been allocated and re-use existing ones, or add to the list
for new ones. Scanning this list may be less overhead that all the
sprintf() and kernfs_find_and_get() searches.
Something like:
static LIST_HEAD(kn_priv_list);
static struct mon_data *mon_get_kn_priv(struct rdt_resource *r,
struct rdt_mon_domain *d,
struct mon_evt *mevt,
struct rdtgroup *rdtgrp,
bool do_sum)
{
struct mon_data *priv;
list_for_each_entry(priv, &kn_priv_list, list) {
if (priv->rid == r->rid && priv->domid == (do_sum ? d->ci->id : d->hdr.id) &&
priv->sum == do_sum && priv->evtid == mevt->evtid)
return priv;
}
}
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv)
return NULL;
priv->rid = r->rid;
priv->domid = do_sum ? d->ci->id : d->hdr.id;
priv->sum = do_sum;
priv->evtid = mevt->evtid;
list_add_tail(&priv->list, &kn_priv_list);
return priv;
}
Maybe just ignore the "domain goes offline case" there's not much memory
tied up in these structures (and the domain may come back).
Just free the whole list when resctrl is unmounted.
static void mon_put_kn_priv(void)
{
struct mon_data *priv, *tmp;
list_for_each_entry_safe(priv, tmp, &kn_priv_list, list)
kfree(priv);
}
[Lightly tested the "get" path. Haven't tried the "put" code.]
-Tony
On Mon, Mar 24, 2025 at 05:52:37PM -0700, Luck, Tony wrote:
> On Thu, Mar 13, 2025 at 08:25:08AM -0700, Reinette Chatre wrote:
> > Hi James,
> >
> > On 3/12/25 11:04 AM, James Morse wrote:
> > > On 07/03/2025 05:03, Reinette Chatre wrote:
> > >> On 2/28/25 11:59 AM, James Morse wrote:
> >
> > ...
> >
> > >> With all of the above I do not think this will work on an SNC enabled
> > >> system ... to confirm this I tried it out and it is not possible to mount
> > >> resctrl on an SNC enabled system and the WARN_ON_ONCE() this patch adds to
> > >> mon_add_all_files() is hit.
> > >
> > > I hadn't realised the mon_sub directories for SNC weren't all directly under mon_data.
> > > Searching from mon_data will need the parent name too. What I've come up with is:
> > > -------%<-------
> > > snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> > > if (!snc_mode) {
> > > sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
> > > kn_target_dir = kernfs_find_and_get(kn_mondata, name);
> > > } else {
> > > sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> > > kn_target_dir = kernfs_find_and_get(kn_mondata, name);
> > >
> > > if (snc_mode && !do_sum) {
> >
> > snc_mode should always be true here?
> >
> > > sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> > > kernfs_put(kn_target_dir);
> >
> > I think this needs some extra guardrails. If kn_target_dir is NULL here
> > it looks like that the kernfs_put() above will be fine, but from what I can tell
> > the kernfs_find_and_get() below will not be.
> >
> > > kn_target_dir = kernfs_find_and_get(kn_target_dir, name);
> > > }
> > > }
> > > kernfs_put(kn_target_dir);
> > > if (!kn_target_dir)
> > > return NULL;
> > > -------%<-------
> > >
> >
> > This looks good to me. In original patch a NULL kn within mon_get_default_kn_priv()
> > was used as prompt to create the private data. It is thus not obvious to me from this
> > snippet what is being returned "to", but I do not think that was your point of sharing
> > this snippet.
>
> Is this all overly complex trying to re-use the "priv" fields from
> the default resctrl group? Would it be easier to just keep a list
> of each combinations of region id, domain id, sum, and event id that have
> already been allocated and re-use existing ones, or add to the list
> for new ones. Scanning this list may be less overhead that all the
> sprintf() and kernfs_find_and_get() searches.
James,
I played around with the simplification some more and tested on both
normal and SNC systems. Below is a patch against:
git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/move_to_fs/v7
Note that this is *after* the move to fs/resctrl as I was too chicken
to try applying this before the move and then re-run the script to
move things. But if you take this suggestion, just mash it into your
"Expand the width of dom_id by replacing mon_data_bits" patch.
Maybe give me Co-developed-by credit, but that's not important.
Note that your expansion of mon_data is going to be very useful going
forward. I want to add extra information to struct mon_data:
1) Flag to note that an event counter can be read from any CPU, not
just the ones in the domain specified by the mon_data/mon_L3_XX/*
file.
2) Type field to specify how to display the value of each counter
(since I want floating point instead of integer for the energy
counters).
-Tony
From e2689c7439572608ce03a525c71c3fb88379057c Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Thu, 3 Apr 2025 10:53:55 -0700
Subject: [PATCH] fs/resctrl: Simplify allocation of mon_data structures
Instead of making a special case to allocate and attach these structures
to kernfs files in the default control group, simply allocate a structure
when a new combination of <rid, domain, mevt, do_sum> is needed and
re-use existing structures when possible.
Free all structures when resctrl filesystem is unmounted.
Partial revert of commit fa563b5171e9 ("x86/resctrl: Expand the width
of dom_id by replacing mon_data_bits")
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 2 +
fs/resctrl/rdtgroup.c | 138 ++++++++++++------------------------------
2 files changed, 40 insertions(+), 100 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index ec3863d18f68..e5976bd52a35 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -83,6 +83,7 @@ struct mon_evt {
/**
* struct mon_data - Monitoring details for each event file.
+ * @list: List of all allocated structures.
* @rid: Resource id associated with the event file.
* @evtid: Event id associated with the event file.
* @sum: Set when event must be summed across multiple
@@ -96,6 +97,7 @@ struct mon_evt {
* rdtgroup_mutex.
*/
struct mon_data {
+ struct list_head list;
unsigned int rid;
enum resctrl_event_id evtid;
unsigned int sum;
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 234ec9dbe5b3..4ec40850752a 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -69,6 +69,8 @@ static int rdtgroup_setup_root(struct rdt_fs_context *ctx);
static void rdtgroup_destroy_root(void);
+static void mon_put_kn_priv(void);
+
struct dentry *debugfs_resctrl;
/*
@@ -2873,6 +2875,7 @@ static void rdt_kill_sb(struct super_block *sb)
resctrl_arch_reset_all_ctrls(r);
rmdir_all_sub();
+ mon_put_kn_priv();
rdt_pseudo_lock_release();
rdtgroup_default.mode = RDT_MODE_SHAREABLE;
schemata_list_destroy();
@@ -2895,107 +2898,54 @@ static struct file_system_type rdt_fs_type = {
.kill_sb = rdt_kill_sb,
};
+static LIST_HEAD(kn_priv_list);
+
/**
- * mon_get_default_kn_priv() - Get the mon_data priv data for this event from
- * the default control group.
+ * mon_get_kn_priv() - Get the mon_data priv data for this event
* Called when monitor event files are created for a domain.
- * When called with the default control group, the structure will be allocated.
- * This happens at mount time, before other control or monitor groups are
- * created.
- * This simplifies the lifetime management for rmdir() versus domain-offline
- * as the default control group lives forever, and only one group needs to be
- * special cased.
+ * The same values are used in multiple directories. Keep a list
+ * of allocated structures and re-use an existing one with the same
+ * list of values for rid, domain, etc.
*
- * @r: The resource for the event type being created.
- * @d: The domain for the event type being created.
- * @mevt: The event type being created.
- * @rdtgrp: The rdtgroup for which the monitor file is being created,
- * used to determine if this is the default control group.
- * @do_sum: Whether the SNC sub-numa node monitors are being created.
+ * @rid: The resource for the event type being created.
+ * @domid: The domain for the event type being created.
+ * @mevt: The event type being created.
+ * @do_sum: Whether the SNC sub-numa node monitors are being created.
*/
-static struct mon_data *mon_get_default_kn_priv(struct rdt_resource *r,
- struct rdt_mon_domain *d,
- struct mon_evt *mevt,
- struct rdtgroup *rdtgrp,
- bool do_sum)
+static struct mon_data *mon_get_kn_priv(int rid, int domid, struct mon_evt *mevt, bool do_sum)
{
- struct kernfs_node *kn_dom, *kn_evt;
struct mon_data *priv;
- bool snc_mode;
- char name[32];
- lockdep_assert_held(&rdtgroup_mutex);
-
- snc_mode = r->mon_scope == RESCTRL_L3_NODE;
- if (!do_sum)
- sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
- else
- sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
-
- kn_dom = kernfs_find_and_get(kn_mondata, name);
- if (!kn_dom)
- return NULL;
-
- kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
-
- /* Is this the creation of the default groups monitor files? */
- if (!kn_evt && rdtgrp == &rdtgroup_default) {
- priv = kzalloc(sizeof(*priv), GFP_KERNEL);
- if (!priv)
- return NULL;
- priv->rid = r->rid;
- priv->domid = do_sum ? d->ci->id : d->hdr.id;
- priv->sum = do_sum;
- priv->evtid = mevt->evtid;
- return priv;
+ list_for_each_entry(priv, &kn_priv_list, list) {
+ if (priv->rid == rid && priv->domid == domid &&
+ priv->sum == do_sum && priv->evtid == mevt->evtid)
+ return priv;
}
- if (!kn_evt)
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (!priv)
return NULL;
- return kn_evt->priv;
+ priv->rid = rid;
+ priv->domid = domid;
+ priv->sum = do_sum;
+ priv->evtid = mevt->evtid;
+ list_add_tail(&priv->list, &kn_priv_list);
+
+ return priv;
}
/**
- * mon_put_default_kn_priv_all() - Potentially free the mon_data priv data for
- * all events from the default control group.
- * Put the mon_data priv data for all events for a particular domain.
- * When called with the default control group, the priv structure previously
- * allocated will be kfree()d. This should only be done as part of taking a
- * domain offline.
- * Only a domain offline will 'rmdir' monitor files in the default control
- * group. After domain offline releases rdtgrp_mutex, all references will
- * have been removed.
- *
- * @rdtgrp: The rdtgroup for which the monitor files are being removed,
- * used to determine if this is the default control group.
- * @name: The name of the domain or SNC sub-numa domain which is being
- * taken offline.
+ * mon_put_kn_priv() - Free all allocated mon_data structures
+ * Called when resctrl file system is unmounted.
*/
-static void mon_put_default_kn_priv_all(struct rdtgroup *rdtgrp, char *name)
+static void mon_put_kn_priv(void)
{
- struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- struct kernfs_node *kn_dom, *kn_evt;
- struct mon_evt *mevt;
+ struct mon_data *priv, *tmp;
- lockdep_assert_held(&rdtgroup_mutex);
-
- if (rdtgrp != &rdtgroup_default)
- return;
-
- kn_dom = kernfs_find_and_get(kn_mondata, name);
- if (!kn_dom)
- return;
-
- list_for_each_entry(mevt, &r->evt_list, list) {
- kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
- if (!kn_evt)
- continue;
- if (!kn_evt->priv)
- continue;
-
- kfree(kn_evt->priv);
- kn_evt->priv = NULL;
+ list_for_each_entry_safe(priv, tmp, &kn_priv_list, list) {
+ kfree(priv);
+ list_del(&priv->list);
}
}
@@ -3029,16 +2979,12 @@ static void mon_rmdir_one_subdir(struct rdtgroup *rdtgrp, char *name, char *subn
if (!kn)
return;
- mon_put_default_kn_priv_all(rdtgrp, name);
-
kernfs_put(kn);
- if (kn->dir.subdirs <= 1) {
+ if (kn->dir.subdirs <= 1)
kernfs_remove(kn);
- } else {
- mon_put_default_kn_priv_all(rdtgrp, subname);
+ else
kernfs_remove_by_name(kn, subname);
- }
}
/*
@@ -3081,7 +3027,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
return -EPERM;
list_for_each_entry(mevt, &r->evt_list, list) {
- priv = mon_get_default_kn_priv(r, d, mevt, prgrp, do_sum);
+ priv = mon_get_kn_priv(r->rid, do_sum ? d->ci->id : d->hdr.id, mevt, do_sum);
if (WARN_ON_ONCE(!priv))
return -EINVAL;
@@ -3165,17 +3111,9 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdtgroup *prgrp, *crgrp;
struct list_head *head;
- /*
- * During domain-online create the default control group first
- * so that mon_get_default_kn_priv() can find the allocated structure
- * on subsequent calls.
- */
- mkdir_mondata_subdir(kn_mondata, d, r, &rdtgroup_default);
-
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
parent_kn = prgrp->mon.mon_data_kn;
- if (prgrp != &rdtgroup_default)
- mkdir_mondata_subdir(parent_kn, d, r, prgrp);
+ mkdir_mondata_subdir(parent_kn, d, r, prgrp);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
--
2.48.1
Hi Tony, On 03/04/2025 20:16, Luck, Tony wrote: > On Mon, Mar 24, 2025 at 05:52:37PM -0700, Luck, Tony wrote: >> Is this all overly complex trying to re-use the "priv" fields from >> the default resctrl group? I was trying to keep the kernfs hierarchy as the source of truth for this, and keep the extra memory allocated to a minimum. >> Would it be easier to just keep a list >> of each combinations of region id, domain id, sum, and event id that have >> already been allocated and re-use existing ones, or add to the list >> for new ones. I agree this is much simpler! >> Scanning this list may be less overhead that all the >> sprintf() and kernfs_find_and_get() searches. > I played around with the simplification some more and tested on both > normal and SNC systems. Below is a patch against: > > git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/move_to_fs/v7 > > Note that this is *after* the move to fs/resctrl as I was too chicken > to try applying this before the move and then re-run the script to > move things. But if you take this suggestion, just mash it into your > "Expand the width of dom_id by replacing mon_data_bits" patch. > Maybe give me Co-developed-by credit, but that's not important. Done! > Note that your expansion of mon_data is going to be very useful going > forward. I want to add extra information to struct mon_data: > > 1) Flag to note that an event counter can be read from any CPU, not > just the ones in the domain specified by the mon_data/mon_L3_XX/* > file. Awesome - (some) arm64 and risc-v have this too. > 2) Type field to specify how to display the value of each counter > (since I want floating point instead of integer for the energy > counters). Thanks, James
© 2016 - 2026 Red Hat, Inc.