[PATCH 12/33] arm_mpam: Add the class and component structures for ris firmware described

James Morse posted 33 patches 1 month, 1 week ago
There is a newer version of this series
[PATCH 12/33] arm_mpam: Add the class and component structures for ris firmware described
Posted by James Morse 1 month, 1 week ago
An MSC is a container of resources, each identified by their RIS index.
Some RIS are described by firmware to provide their position in the system.
Others are discovered when the driver probes the hardware.

To configure a resource it needs to be found by its class, e.g. 'L2'.
There are two kinds of grouping, a class is a set of components, which
are visible to user-space as there are likely to be multiple instances
of the L2 cache. (e.g. one per cluster or package)

struct mpam_components are a set of struct mpam_vmsc. A vMSC groups the
RIS in an MSC that control the same logical piece of hardware. (e.g. L2).
This is to allow hardware implementations where two controls are presented
as different RIS. Re-combining these RIS allows their feature bits to
be or-ed. This structure is not visible outside mpam_devices.c

struct mpam_vmsc are then a set of struct mpam_msc_ris, which are not
visible as each L2 cache may be composed of individual slices which need
to be configured the same as the hardware is not able to distribute the
configuration.

Add support for creating and destroying these structures.

A gfp is passed as the structures may need creating when a new RIS entry
is discovered when probing the MSC.

CC: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
Changes since RFC:
 * removed a pr_err() debug message that crept in.
---
 drivers/resctrl/mpam_devices.c  | 488 +++++++++++++++++++++++++++++++-
 drivers/resctrl/mpam_internal.h |  91 ++++++
 include/linux/arm_mpam.h        |   8 +-
 3 files changed, 574 insertions(+), 13 deletions(-)

diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c
index 71a1fb1a9c75..5baf2a8786fb 100644
--- a/drivers/resctrl/mpam_devices.c
+++ b/drivers/resctrl/mpam_devices.c
@@ -20,7 +20,6 @@
 #include <linux/printk.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
-#include <linux/srcu.h>
 #include <linux/types.h>
 
 #include <acpi/pcc.h>
@@ -35,11 +34,483 @@
 static DEFINE_MUTEX(mpam_list_lock);
 static LIST_HEAD(mpam_all_msc);
 
-static struct srcu_struct mpam_srcu;
+struct srcu_struct mpam_srcu;
 
 /* MPAM isn't available until all the MSC have been probed. */
 static u32 mpam_num_msc;
 
+/*
+ * An MSC is a physical container for controls and monitors, each identified by
+ * their RIS index. These share a base-address, interrupts and some MMIO
+ * registers. A vMSC is a virtual container for RIS in an MSC that control or
+ * monitor the same thing. Members of a vMSC are all RIS in the same MSC, but
+ * not all RIS in an MSC share a vMSC.
+ * Components are a group of vMSC that control or monitor the same thing but
+ * are from different MSC, so have different base-address, interrupts etc.
+ * Classes are the set components of the same type.
+ *
+ * The features of a vMSC is the union of the RIS it contains.
+ * The features of a Class and Component are the common subset of the vMSC
+ * they contain.
+ *
+ * e.g. The system cache may have bandwidth controls on multiple interfaces,
+ * for regulating traffic from devices independently of traffic from CPUs.
+ * If these are two RIS in one MSC, they will be treated as controlling
+ * different things, and will not share a vMSC/component/class.
+ *
+ * e.g. The L2 may have one MSC and two RIS, one for cache-controls another
+ * for bandwidth. These two RIS are members of the same vMSC.
+ *
+ * e.g. The set of RIS that make up the L2 are grouped as a component. These
+ * are sometimes termed slices. They should be configured the same, as if there
+ * were only one.
+ *
+ * e.g. The SoC probably has more than one L2, each attached to a distinct set
+ * of CPUs. All the L2 components are grouped as a class.
+ *
+ * When creating an MSC, struct mpam_msc is added to the all mpam_all_msc list,
+ * then linked via struct mpam_ris to a vmsc, component and class.
+ * The same MSC may exist under different class->component->vmsc paths, but the
+ * RIS index will be unique.
+ */
+LIST_HEAD(mpam_classes);
+
+/* List of all objects that can be free()d after synchronise_srcu() */
+static LLIST_HEAD(mpam_garbage);
+
+#define init_garbage(x)	init_llist_node(&(x)->garbage.llist)
+
+static struct mpam_vmsc *
+mpam_vmsc_alloc(struct mpam_component *comp, struct mpam_msc *msc, gfp_t gfp)
+{
+	struct mpam_vmsc *vmsc;
+
+	lockdep_assert_held(&mpam_list_lock);
+
+	vmsc = kzalloc(sizeof(*vmsc), gfp);
+	if (!comp)
+		return ERR_PTR(-ENOMEM);
+	init_garbage(vmsc);
+
+	INIT_LIST_HEAD_RCU(&vmsc->ris);
+	INIT_LIST_HEAD_RCU(&vmsc->comp_list);
+	vmsc->comp = comp;
+	vmsc->msc = msc;
+
+	list_add_rcu(&vmsc->comp_list, &comp->vmsc);
+
+	return vmsc;
+}
+
+static struct mpam_vmsc *mpam_vmsc_get(struct mpam_component *comp,
+				       struct mpam_msc *msc, bool alloc,
+				       gfp_t gfp)
+{
+	struct mpam_vmsc *vmsc;
+
+	lockdep_assert_held(&mpam_list_lock);
+
+	list_for_each_entry(vmsc, &comp->vmsc, comp_list) {
+		if (vmsc->msc->id == msc->id)
+			return vmsc;
+	}
+
+	if (!alloc)
+		return ERR_PTR(-ENOENT);
+
+	return mpam_vmsc_alloc(comp, msc, gfp);
+}
+
+static struct mpam_component *
+mpam_component_alloc(struct mpam_class *class, int id, gfp_t gfp)
+{
+	struct mpam_component *comp;
+
+	lockdep_assert_held(&mpam_list_lock);
+
+	comp = kzalloc(sizeof(*comp), gfp);
+	if (!comp)
+		return ERR_PTR(-ENOMEM);
+	init_garbage(comp);
+
+	comp->comp_id = id;
+	INIT_LIST_HEAD_RCU(&comp->vmsc);
+	/* affinity is updated when ris are added */
+	INIT_LIST_HEAD_RCU(&comp->class_list);
+	comp->class = class;
+
+	list_add_rcu(&comp->class_list, &class->components);
+
+	return comp;
+}
+
+static struct mpam_component *
+mpam_component_get(struct mpam_class *class, int id, bool alloc, gfp_t gfp)
+{
+	struct mpam_component *comp;
+
+	lockdep_assert_held(&mpam_list_lock);
+
+	list_for_each_entry(comp, &class->components, class_list) {
+		if (comp->comp_id == id)
+			return comp;
+	}
+
+	if (!alloc)
+		return ERR_PTR(-ENOENT);
+
+	return mpam_component_alloc(class, id, gfp);
+}
+
+static struct mpam_class *
+mpam_class_alloc(u8 level_idx, enum mpam_class_types type, gfp_t gfp)
+{
+	struct mpam_class *class;
+
+	lockdep_assert_held(&mpam_list_lock);
+
+	class = kzalloc(sizeof(*class), gfp);
+	if (!class)
+		return ERR_PTR(-ENOMEM);
+	init_garbage(class);
+
+	INIT_LIST_HEAD_RCU(&class->components);
+	/* affinity is updated when ris are added */
+	class->level = level_idx;
+	class->type = type;
+	INIT_LIST_HEAD_RCU(&class->classes_list);
+
+	list_add_rcu(&class->classes_list, &mpam_classes);
+
+	return class;
+}
+
+static struct mpam_class *
+mpam_class_get(u8 level_idx, enum mpam_class_types type, bool alloc, gfp_t gfp)
+{
+	bool found = false;
+	struct mpam_class *class;
+
+	lockdep_assert_held(&mpam_list_lock);
+
+	list_for_each_entry(class, &mpam_classes, classes_list) {
+		if (class->type == type && class->level == level_idx) {
+			found = true;
+			break;
+		}
+	}
+
+	if (found)
+		return class;
+
+	if (!alloc)
+		return ERR_PTR(-ENOENT);
+
+	return mpam_class_alloc(level_idx, type, gfp);
+}
+
+#define add_to_garbage(x)				\
+do {							\
+	__typeof__(x) _x = x;				\
+	(_x)->garbage.to_free = (_x);			\
+	llist_add(&(_x)->garbage.llist, &mpam_garbage);	\
+} while (0)
+
+static void mpam_class_destroy(struct mpam_class *class)
+{
+	lockdep_assert_held(&mpam_list_lock);
+
+	list_del_rcu(&class->classes_list);
+	add_to_garbage(class);
+}
+
+static void mpam_comp_destroy(struct mpam_component *comp)
+{
+	struct mpam_class *class = comp->class;
+
+	lockdep_assert_held(&mpam_list_lock);
+
+	list_del_rcu(&comp->class_list);
+	add_to_garbage(comp);
+
+	if (list_empty(&class->components))
+		mpam_class_destroy(class);
+}
+
+static void mpam_vmsc_destroy(struct mpam_vmsc *vmsc)
+{
+	struct mpam_component *comp = vmsc->comp;
+
+	lockdep_assert_held(&mpam_list_lock);
+
+	list_del_rcu(&vmsc->comp_list);
+	add_to_garbage(vmsc);
+
+	if (list_empty(&comp->vmsc))
+		mpam_comp_destroy(comp);
+}
+
+static void mpam_ris_destroy(struct mpam_msc_ris *ris)
+{
+	struct mpam_vmsc *vmsc = ris->vmsc;
+	struct mpam_msc *msc = vmsc->msc;
+	struct platform_device *pdev = msc->pdev;
+	struct mpam_component *comp = vmsc->comp;
+	struct mpam_class *class = comp->class;
+
+	lockdep_assert_held(&mpam_list_lock);
+
+	cpumask_andnot(&comp->affinity, &comp->affinity, &ris->affinity);
+	cpumask_andnot(&class->affinity, &class->affinity, &ris->affinity);
+	clear_bit(ris->ris_idx, msc->ris_idxs);
+	list_del_rcu(&ris->vmsc_list);
+	list_del_rcu(&ris->msc_list);
+	add_to_garbage(ris);
+	ris->garbage.pdev = pdev;
+
+	if (list_empty(&vmsc->ris))
+		mpam_vmsc_destroy(vmsc);
+}
+
+/*
+ * There are two ways of reaching a struct mpam_msc_ris. Via the
+ * class->component->vmsc->ris, or via the msc.
+ * When destroying the msc, the other side needs unlinking and cleaning up too.
+ */
+static void mpam_msc_destroy(struct mpam_msc *msc)
+{
+	struct platform_device *pdev = msc->pdev;
+	struct mpam_msc_ris *ris, *tmp;
+
+	lockdep_assert_held(&mpam_list_lock);
+
+	list_del_rcu(&msc->glbl_list);
+	platform_set_drvdata(pdev, NULL);
+
+	list_for_each_entry_safe(ris, tmp, &msc->ris, msc_list)
+		mpam_ris_destroy(ris);
+
+	add_to_garbage(msc);
+	msc->garbage.pdev = pdev;
+}
+
+static void mpam_free_garbage(void)
+{
+	struct mpam_garbage *iter, *tmp;
+	struct llist_node *to_free = llist_del_all(&mpam_garbage);
+
+	if (!to_free)
+		return;
+
+	synchronize_srcu(&mpam_srcu);
+
+	llist_for_each_entry_safe(iter, tmp, to_free, llist) {
+		if (iter->pdev)
+			devm_kfree(&iter->pdev->dev, iter->to_free);
+		else
+			kfree(iter->to_free);
+	}
+}
+
+/* Called recursively to walk the list of caches from a particular CPU */
+static void __mpam_get_cpumask_from_cache_id(int cpu, struct device_node *cache_node,
+					     unsigned long cache_id,
+					     u32 cache_level,
+					     cpumask_t *affinity)
+{
+	int err;
+	u32 iter_level;
+	unsigned long iter_cache_id;
+	struct device_node *iter_node __free(device_node) = of_find_next_cache_node(cache_node);
+
+	if (!iter_node)
+		return;
+
+	err = of_property_read_u32(iter_node, "cache-level", &iter_level);
+	if (err)
+		return;
+
+	/*
+	 * get_cpu_cacheinfo_id() isn't ready until sometime
+	 * during device_initcall(). Use cache_of_calculate_id().
+	 */
+	iter_cache_id = cache_of_calculate_id(iter_node);
+	if (cache_id == ~0UL)
+		return;
+
+	if (iter_level == cache_level && iter_cache_id == cache_id)
+		cpumask_set_cpu(cpu, affinity);
+
+	__mpam_get_cpumask_from_cache_id(cpu, iter_node, cache_id, cache_level,
+					 affinity);
+}
+
+/*
+ * The cacheinfo structures are only populated when CPUs are online.
+ * This helper walks the device tree to include offline CPUs too.
+ */
+int mpam_get_cpumask_from_cache_id(unsigned long cache_id, u32 cache_level,
+				   cpumask_t *affinity)
+{
+	int cpu;
+
+	if (!acpi_disabled)
+		return acpi_pptt_get_cpumask_from_cache_id(cache_id, affinity);
+
+	for_each_possible_cpu(cpu) {
+		struct device_node *cpu_node __free(device_node) = of_get_cpu_node(cpu, NULL);
+		if (!cpu_node) {
+			pr_err("Failed to find cpu%d device node\n", cpu);
+			return -ENOENT;
+		}
+
+		__mpam_get_cpumask_from_cache_id(cpu, cpu_node, cache_id,
+						 cache_level, affinity);
+			continue;
+	}
+
+	return 0;
+}
+
+/*
+ * cpumask_of_node() only knows about online CPUs. This can't tell us whether
+ * a class is represented on all possible CPUs.
+ */
+static void get_cpumask_from_node_id(u32 node_id, cpumask_t *affinity)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		if (node_id == cpu_to_node(cpu))
+			cpumask_set_cpu(cpu, affinity);
+	}
+}
+
+static int get_cpumask_from_cache(struct device_node *cache,
+				  cpumask_t *affinity)
+{
+	int err;
+	u32 cache_level;
+	unsigned long cache_id;
+
+	err = of_property_read_u32(cache, "cache-level", &cache_level);
+	if (err) {
+		pr_err("Failed to read cache-level from cache node\n");
+		return -ENOENT;
+	}
+
+	cache_id = cache_of_calculate_id(cache);
+	if (cache_id == ~0UL) {
+		pr_err("Failed to calculate cache-id from cache node\n");
+		return -ENOENT;
+	}
+
+	return mpam_get_cpumask_from_cache_id(cache_id, cache_level, affinity);
+}
+
+static int mpam_ris_get_affinity(struct mpam_msc *msc, cpumask_t *affinity,
+				 enum mpam_class_types type,
+				 struct mpam_class *class,
+				 struct mpam_component *comp)
+{
+	int err;
+
+	switch (type) {
+	case MPAM_CLASS_CACHE:
+		err = mpam_get_cpumask_from_cache_id(comp->comp_id, class->level,
+						     affinity);
+		if (err)
+			return err;
+
+		if (cpumask_empty(affinity))
+			pr_warn_once("%s no CPUs associated with cache node",
+				     dev_name(&msc->pdev->dev));
+
+		break;
+	case MPAM_CLASS_MEMORY:
+		get_cpumask_from_node_id(comp->comp_id, affinity);
+		/* affinity may be empty for CPU-less memory nodes */
+		break;
+	case MPAM_CLASS_UNKNOWN:
+		return 0;
+	}
+
+	cpumask_and(affinity, affinity, &msc->accessibility);
+
+	return 0;
+}
+
+static int mpam_ris_create_locked(struct mpam_msc *msc, u8 ris_idx,
+				  enum mpam_class_types type, u8 class_id,
+				  int component_id, gfp_t gfp)
+{
+	int err;
+	struct mpam_vmsc *vmsc;
+	struct mpam_msc_ris *ris;
+	struct mpam_class *class;
+	struct mpam_component *comp;
+
+	lockdep_assert_held(&mpam_list_lock);
+
+	if (test_and_set_bit(ris_idx, msc->ris_idxs))
+		return -EBUSY;
+
+	ris = devm_kzalloc(&msc->pdev->dev, sizeof(*ris), gfp);
+	if (!ris)
+		return -ENOMEM;
+	init_garbage(ris);
+
+	class = mpam_class_get(class_id, type, true, gfp);
+	if (IS_ERR(class))
+		return PTR_ERR(class);
+
+	comp = mpam_component_get(class, component_id, true, gfp);
+	if (IS_ERR(comp)) {
+		if (list_empty(&class->components))
+			mpam_class_destroy(class);
+		return PTR_ERR(comp);
+	}
+
+	vmsc = mpam_vmsc_get(comp, msc, true, gfp);
+	if (IS_ERR(vmsc)) {
+		if (list_empty(&comp->vmsc))
+			mpam_comp_destroy(comp);
+		return PTR_ERR(vmsc);
+	}
+
+	err = mpam_ris_get_affinity(msc, &ris->affinity, type, class, comp);
+	if (err) {
+		if (list_empty(&vmsc->ris))
+			mpam_vmsc_destroy(vmsc);
+		return err;
+	}
+
+	ris->ris_idx = ris_idx;
+	INIT_LIST_HEAD_RCU(&ris->vmsc_list);
+	ris->vmsc = vmsc;
+
+	cpumask_or(&comp->affinity, &comp->affinity, &ris->affinity);
+	cpumask_or(&class->affinity, &class->affinity, &ris->affinity);
+	list_add_rcu(&ris->vmsc_list, &vmsc->ris);
+
+	return 0;
+}
+
+int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx,
+		    enum mpam_class_types type, u8 class_id, int component_id)
+{
+	int err;
+
+	mutex_lock(&mpam_list_lock);
+	err = mpam_ris_create_locked(msc, ris_idx, type, class_id,
+				     component_id, GFP_KERNEL);
+	mutex_unlock(&mpam_list_lock);
+	if (err)
+		mpam_free_garbage();
+
+	return err;
+}
+
 static void mpam_discovery_complete(void)
 {
 	pr_err("Discovered all MSC\n");
@@ -179,7 +650,10 @@ static int update_msc_accessibility(struct mpam_msc *msc)
 		cpumask_copy(&msc->accessibility, cpu_possible_mask);
 		err = 0;
 	} else {
-		if (of_device_is_compatible(parent, "memory")) {
+		if (of_device_is_compatible(parent, "cache")) {
+			err = get_cpumask_from_cache(parent,
+						     &msc->accessibility);
+		} else if (of_device_is_compatible(parent, "memory")) {
 			cpumask_copy(&msc->accessibility, cpu_possible_mask);
 			err = 0;
 		} else {
@@ -209,11 +683,10 @@ static void mpam_msc_drv_remove(struct platform_device *pdev)
 
 	mutex_lock(&mpam_list_lock);
 	mpam_num_msc--;
-	platform_set_drvdata(pdev, NULL);
-	list_del_rcu(&msc->glbl_list);
-	synchronize_srcu(&mpam_srcu);
-	devm_kfree(&pdev->dev, msc);
+	mpam_msc_destroy(msc);
 	mutex_unlock(&mpam_list_lock);
+
+	mpam_free_garbage();
 }
 
 static int mpam_msc_drv_probe(struct platform_device *pdev)
@@ -230,6 +703,7 @@ static int mpam_msc_drv_probe(struct platform_device *pdev)
 			err = -ENOMEM;
 			break;
 		}
+		init_garbage(msc);
 
 		mutex_init(&msc->probe_lock);
 		mutex_init(&msc->part_sel_lock);
diff --git a/drivers/resctrl/mpam_internal.h b/drivers/resctrl/mpam_internal.h
index 07e0f240eaca..d49bb884b433 100644
--- a/drivers/resctrl/mpam_internal.h
+++ b/drivers/resctrl/mpam_internal.h
@@ -7,10 +7,27 @@
 #include <linux/arm_mpam.h>
 #include <linux/cpumask.h>
 #include <linux/io.h>
+#include <linux/llist.h>
 #include <linux/mailbox_client.h>
 #include <linux/mutex.h>
 #include <linux/resctrl.h>
 #include <linux/sizes.h>
+#include <linux/srcu.h>
+
+/*
+ * Structures protected by SRCU may not be freed for a surprising amount of
+ * time (especially if perf is running). To ensure the MPAM error interrupt can
+ * tear down all the structures, build a list of objects that can be gargbage
+ * collected once synchronize_srcu() has returned.
+ * If pdev is non-NULL, use devm_kfree().
+ */
+struct mpam_garbage {
+	/* member of mpam_garbage */
+	struct llist_node	llist;
+
+	void			*to_free;
+	struct platform_device	*pdev;
+};
 
 struct mpam_msc {
 	/* member of mpam_all_msc */
@@ -57,6 +74,80 @@ struct mpam_msc {
 
 	void __iomem		*mapped_hwpage;
 	size_t			mapped_hwpage_sz;
+
+	struct mpam_garbage	garbage;
+};
+
+struct mpam_class {
+	/* mpam_components in this class */
+	struct list_head	components;
+
+	cpumask_t		affinity;
+
+	u8			level;
+	enum mpam_class_types	type;
+
+	/* member of mpam_classes */
+	struct list_head	classes_list;
+
+	struct mpam_garbage	garbage;
+};
+
+struct mpam_component {
+	u32			comp_id;
+
+	/* mpam_vmsc in this component */
+	struct list_head	vmsc;
+
+	cpumask_t		affinity;
+
+	/* member of mpam_class:components */
+	struct list_head	class_list;
+
+	/* parent: */
+	struct mpam_class	*class;
+
+	struct mpam_garbage	garbage;
 };
 
+struct mpam_vmsc {
+	/* member of mpam_component:vmsc_list */
+	struct list_head	comp_list;
+
+	/* mpam_msc_ris in this vmsc */
+	struct list_head	ris;
+
+	/* All RIS in this vMSC are members of this MSC */
+	struct mpam_msc		*msc;
+
+	/* parent: */
+	struct mpam_component	*comp;
+
+	struct mpam_garbage	garbage;
+};
+
+struct mpam_msc_ris {
+	u8			ris_idx;
+
+	cpumask_t		affinity;
+
+	/* member of mpam_vmsc:ris */
+	struct list_head	vmsc_list;
+
+	/* member of mpam_msc:ris */
+	struct list_head	msc_list;
+
+	/* parent: */
+	struct mpam_vmsc	*vmsc;
+
+	struct mpam_garbage	garbage;
+};
+
+/* List of all classes - protected by srcu*/
+extern struct srcu_struct mpam_srcu;
+extern struct list_head mpam_classes;
+
+int mpam_get_cpumask_from_cache_id(unsigned long cache_id, u32 cache_level,
+				   cpumask_t *affinity);
+
 #endif /* MPAM_INTERNAL_H */
diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h
index 0edefa6ba019..406a77be68cb 100644
--- a/include/linux/arm_mpam.h
+++ b/include/linux/arm_mpam.h
@@ -36,11 +36,7 @@ static inline int acpi_mpam_parse_resources(struct mpam_msc *msc,
 static inline int acpi_mpam_count_msc(void) { return -EINVAL; }
 #endif
 
-static inline int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx,
-				  enum mpam_class_types type, u8 class_id,
-				  int component_id)
-{
-	return -EINVAL;
-}
+int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx,
+		    enum mpam_class_types type, u8 class_id, int component_id);
 
 #endif /* __LINUX_ARM_MPAM_H */
-- 
2.20.1
Re: [PATCH 12/33] arm_mpam: Add the class and component structures for ris firmware described
Posted by Dave Martin 1 month ago
Hi,

> Subject: arm_mpam: Add the class and component structures for ris firmware described

Mangled subject line?


There is a fair intersection between the commit message and what the
patch does, but they don't quite seem to match up.

Some key issues like locking / object lifecycle management
and DT parsing (a bit of which, it appears, lives here too) are not
mentioned at all.

In lieu of a complete rewrite, it might be best to discard the
explanation of the various object types.  The comment in the code
speaks for itself, and looks clearer.

[...]

On Fri, Aug 22, 2025 at 03:29:53PM +0000, James Morse wrote:
> An MSC is a container of resources, each identified by their RIS index.
> Some RIS are described by firmware to provide their position in the system.
> Others are discovered when the driver probes the hardware.
> 
> To configure a resource it needs to be found by its class, e.g. 'L2'.
> There are two kinds of grouping, a class is a set of components, which
> are visible to user-space as there are likely to be multiple instances
> of the L2 cache. (e.g. one per cluster or package)
> 
> struct mpam_components are a set of struct mpam_vmsc. A vMSC groups the
> RIS in an MSC that control the same logical piece of hardware. (e.g. L2).
> This is to allow hardware implementations where two controls are presented
> as different RIS. Re-combining these RIS allows their feature bits to
> be or-ed. This structure is not visible outside mpam_devices.c
> 
> struct mpam_vmsc are then a set of struct mpam_msc_ris, which are not
> visible as each L2 cache may be composed of individual slices which need
> to be configured the same as the hardware is not able to distribute the
> configuration.
> 
> Add support for creating and destroying these structures.
> A gfp is passed as the structures may need creating when a new RIS entry
> is discovered when probing the MSC.
> 
> CC: Ben Horgan <ben.horgan@arm.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> Changes since RFC:
>  * removed a pr_err() debug message that crept in.
> ---
>  drivers/resctrl/mpam_devices.c  | 488 +++++++++++++++++++++++++++++++-
>  drivers/resctrl/mpam_internal.h |  91 ++++++
>  include/linux/arm_mpam.h        |   8 +-
>  3 files changed, 574 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c
> index 71a1fb1a9c75..5baf2a8786fb 100644
> --- a/drivers/resctrl/mpam_devices.c
> +++ b/drivers/resctrl/mpam_devices.c
> @@ -20,7 +20,6 @@

[...]

> @@ -35,11 +34,483 @@
>  static DEFINE_MUTEX(mpam_list_lock);
>  static LIST_HEAD(mpam_all_msc);
>  
> -static struct srcu_struct mpam_srcu;
> +struct srcu_struct mpam_srcu;

Why expose this here?  This patch makes no use of the exposed symbol.

>  
>  /* MPAM isn't available until all the MSC have been probed. */
>  static u32 mpam_num_msc;
>  
> +/*
> + * An MSC is a physical container for controls and monitors, each identified by
> + * their RIS index. These share a base-address, interrupts and some MMIO
> + * registers. A vMSC is a virtual container for RIS in an MSC that control or
> + * monitor the same thing. Members of a vMSC are all RIS in the same MSC, but
> + * not all RIS in an MSC share a vMSC.
> + * Components are a group of vMSC that control or monitor the same thing but
> + * are from different MSC, so have different base-address, interrupts etc.
> + * Classes are the set components of the same type.
> + *
> + * The features of a vMSC is the union of the RIS it contains.
> + * The features of a Class and Component are the common subset of the vMSC
> + * they contain.
> + *
> + * e.g. The system cache may have bandwidth controls on multiple interfaces,
> + * for regulating traffic from devices independently of traffic from CPUs.
> + * If these are two RIS in one MSC, they will be treated as controlling
> + * different things, and will not share a vMSC/component/class.
> + *
> + * e.g. The L2 may have one MSC and two RIS, one for cache-controls another
> + * for bandwidth. These two RIS are members of the same vMSC.
> + *
> + * e.g. The set of RIS that make up the L2 are grouped as a component. These
> + * are sometimes termed slices. They should be configured the same, as if there
> + * were only one.
> + *
> + * e.g. The SoC probably has more than one L2, each attached to a distinct set
> + * of CPUs. All the L2 components are grouped as a class.
> + *
> + * When creating an MSC, struct mpam_msc is added to the all mpam_all_msc list,
> + * then linked via struct mpam_ris to a vmsc, component and class.
> + * The same MSC may exist under different class->component->vmsc paths, but the
> + * RIS index will be unique.
> + */

This description of the structures and how they relate to each other
seems OK (bearing in mind that I am already familiar with this stuff --
I can't speak for other people).

> +LIST_HEAD(mpam_classes);
> +
> +/* List of all objects that can be free()d after synchronise_srcu() */
> +static LLIST_HEAD(mpam_garbage);
> +
> +#define init_garbage(x)	init_llist_node(&(x)->garbage.llist)

[...]

> +#define add_to_garbage(x)				\
> +do {							\
> +	__typeof__(x) _x = x;				\

Nit:

= (x)

(for the paranoid)

> +	(_x)->garbage.to_free = (_x);			\

_x->garbage.to_free = _x;

(_x is an identifier, not a macro argument.  It can't get re-parsed as
something else -- assuming that there is not a #define for _x, but then
all bets would be off anyway.)

> +	llist_add(&(_x)->garbage.llist, &mpam_garbage);	\

&_x->...


[...]

> +static void mpam_ris_destroy(struct mpam_msc_ris *ris)
> +{
> +	struct mpam_vmsc *vmsc = ris->vmsc;
> +	struct mpam_msc *msc = vmsc->msc;
> +	struct platform_device *pdev = msc->pdev;
> +	struct mpam_component *comp = vmsc->comp;
> +	struct mpam_class *class = comp->class;
> +
> +	lockdep_assert_held(&mpam_list_lock);
> +
> +	cpumask_andnot(&comp->affinity, &comp->affinity, &ris->affinity);
> +	cpumask_andnot(&class->affinity, &class->affinity, &ris->affinity);

This is not the inverse of the cpumask_or()s in mpam_ris_create_locked(),
unless the the ris associated with each class and each component have
strictly disjoint affinity masks.  Is that checked anywhere, or should
it be impossible by construction?


But, thinking about it:

I wonder why we ever really need to do the teardown.  If we get an
error interrupt then we can just go into a sulk, spam dmesg a bit, put
the hardware into the most vanilla state that we can, and refuse to
manipulate it further.  But this only happens in the case of a software
or hardware *bug* (or, in a future world where we might implement
virtualisation, an uncontainable MPAM error triggered by a guest -- for
which tearing down the host MPAM would be an overreaction).

Trying to cleanly tear the MPAM driver down after such an error seems a
bit futile.

The MPAM resctrl glue could eventually be made into a module (though
not needed from day 1) -- which would allow for unloading resctrlfs if
that is eventually module-ised.  I think this wouldn't require the MPAM
devices backend to be torn down at any point, though (?)


If we can simplify or eliminate the teardown, does it simplify the
locking at all?  The garbage collection logic can also be dispensed
with if there is never any garbage.

Since MSCs etc. never disappear from the hardware, it feels like it
ought not to be necessary ever to remove items from any of these lists
except when trying to do a teardown (?)

(Putting the hardware into a quiecent state is not the same thing as
tearing down the data structures -- we do want to quiesce MPAM when
shutting down the kernel, as least for the kexec scenario.)

> +	clear_bit(ris->ris_idx, msc->ris_idxs);
> +	list_del_rcu(&ris->vmsc_list);
> +	list_del_rcu(&ris->msc_list);
> +	add_to_garbage(ris);
> +	ris->garbage.pdev = pdev;
> +
> +	if (list_empty(&vmsc->ris))
> +		mpam_vmsc_destroy(vmsc);
> +}

[...]

> +static int mpam_ris_create_locked(struct mpam_msc *msc, u8 ris_idx,
> +				  enum mpam_class_types type, u8 class_id,
> +				  int component_id, gfp_t gfp)
> +{
> +	int err;
> +	struct mpam_vmsc *vmsc;
> +	struct mpam_msc_ris *ris;
> +	struct mpam_class *class;
> +	struct mpam_component *comp;
> +
> +	lockdep_assert_held(&mpam_list_lock);
> +
> +	if (test_and_set_bit(ris_idx, msc->ris_idxs))
> +		return -EBUSY;

Is it impossible by construction to get in here with an out-of-range
ris_idx?

To avoid the callers (i.e., ACPI) needing to understand the internal
limitations of this code, maybe it is worth having a check here (even
if technically redundant).

[...]

Cheers
---Dave
Re: [PATCH 12/33] arm_mpam: Add the class and component structures for ris firmware described
Posted by James Morse 3 weeks, 4 days ago
Hi Dave,

On 01/09/2025 12:09, Dave Martin wrote:
>> Subject: arm_mpam: Add the class and component structures for ris firmware described
> 
> Mangled subject line?

order words hard is.


> There is a fair intersection between the commit message and what the
> patch does, but they don't quite seem to match up.
> 
> Some key issues like locking / object lifecycle management
> and DT parsing (a bit of which, it appears, lives here too) are not
> mentioned at all.

I don't think everything needs mentioning - you have the diff for that.This should capture
the motivation, what you have and what you need to find, the grouping etc.


> In lieu of a complete rewrite, it might be best to discard the
> explanation of the various object types.  The comment in the code
> speaks for itself, and looks clearer.

Fair enough,

> [...]
> 
> On Fri, Aug 22, 2025 at 03:29:53PM +0000, James Morse wrote:
>> An MSC is a container of resources, each identified by their RIS index.
>> Some RIS are described by firmware to provide their position in the system.
>> Others are discovered when the driver probes the hardware.
>>
>> To configure a resource it needs to be found by its class, e.g. 'L2'.
>> There are two kinds of grouping, a class is a set of components, which
>> are visible to user-space as there are likely to be multiple instances
>> of the L2 cache. (e.g. one per cluster or package)
>>
>> struct mpam_components are a set of struct mpam_vmsc. A vMSC groups the
>> RIS in an MSC that control the same logical piece of hardware. (e.g. L2).
>> This is to allow hardware implementations where two controls are presented
>> as different RIS. Re-combining these RIS allows their feature bits to
>> be or-ed. This structure is not visible outside mpam_devices.c
>>
>> struct mpam_vmsc are then a set of struct mpam_msc_ris, which are not
>> visible as each L2 cache may be composed of individual slices which need
>> to be configured the same as the hardware is not able to distribute the
>> configuration.
>>
>> Add support for creating and destroying these structures.
>> A gfp is passed as the structures may need creating when a new RIS entry
>> is discovered when probing the MSC.

>> diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c
>> index 71a1fb1a9c75..5baf2a8786fb 100644
>> --- a/drivers/resctrl/mpam_devices.c
>> +++ b/drivers/resctrl/mpam_devices.c
>> @@ -20,7 +20,6 @@
> 
> [...]
> 
>> @@ -35,11 +34,483 @@
>>  static DEFINE_MUTEX(mpam_list_lock);
>>  static LIST_HEAD(mpam_all_msc);
>>  
>> -static struct srcu_struct mpam_srcu;
>> +struct srcu_struct mpam_srcu;

> Why expose this here?  This patch makes no use of the exposed symbol.

The mpam_resctrl code needs to take it when it walks these lists. I don't want to change
it then because its just additional churn.


>>  /* MPAM isn't available until all the MSC have been probed. */
>>  static u32 mpam_num_msc;
>>  
>> +/*
>> + * An MSC is a physical container for controls and monitors, each identified by
>> + * their RIS index. These share a base-address, interrupts and some MMIO
>> + * registers. A vMSC is a virtual container for RIS in an MSC that control or
>> + * monitor the same thing. Members of a vMSC are all RIS in the same MSC, but
>> + * not all RIS in an MSC share a vMSC.
>> + * Components are a group of vMSC that control or monitor the same thing but
>> + * are from different MSC, so have different base-address, interrupts etc.
>> + * Classes are the set components of the same type.
>> + *
>> + * The features of a vMSC is the union of the RIS it contains.
>> + * The features of a Class and Component are the common subset of the vMSC
>> + * they contain.
>> + *
>> + * e.g. The system cache may have bandwidth controls on multiple interfaces,
>> + * for regulating traffic from devices independently of traffic from CPUs.
>> + * If these are two RIS in one MSC, they will be treated as controlling
>> + * different things, and will not share a vMSC/component/class.
>> + *
>> + * e.g. The L2 may have one MSC and two RIS, one for cache-controls another
>> + * for bandwidth. These two RIS are members of the same vMSC.
>> + *
>> + * e.g. The set of RIS that make up the L2 are grouped as a component. These
>> + * are sometimes termed slices. They should be configured the same, as if there
>> + * were only one.
>> + *
>> + * e.g. The SoC probably has more than one L2, each attached to a distinct set
>> + * of CPUs. All the L2 components are grouped as a class.
>> + *
>> + * When creating an MSC, struct mpam_msc is added to the all mpam_all_msc list,
>> + * then linked via struct mpam_ris to a vmsc, component and class.
>> + * The same MSC may exist under different class->component->vmsc paths, but the
>> + * RIS index will be unique.
>> + */
> 
> This description of the structures and how they relate to each other
> seems OK (bearing in mind that I am already familiar with this stuff --
> I can't speak for other people).

Great!


> [...]
> 
>> +#define add_to_garbage(x)				\
>> +do {							\
>> +	__typeof__(x) _x = x;				\
> 
> Nit:
> 
> = (x)
> 
> (for the paranoid)

Fixed,


>> +	(_x)->garbage.to_free = (_x);			\
> 
> _x->garbage.to_free = _x;
> 
> (_x is an identifier, not a macro argument.  It can't get re-parsed as
> something else -- assuming that there is not a #define for _x, but then
> all bets would be off anyway.)

>> +	llist_add(&(_x)->garbage.llist, &mpam_garbage);	\
> 
> &_x->...

Fixed,


> [...]
> 
>> +static void mpam_ris_destroy(struct mpam_msc_ris *ris)
>> +{
>> +	struct mpam_vmsc *vmsc = ris->vmsc;
>> +	struct mpam_msc *msc = vmsc->msc;
>> +	struct platform_device *pdev = msc->pdev;
>> +	struct mpam_component *comp = vmsc->comp;
>> +	struct mpam_class *class = comp->class;
>> +
>> +	lockdep_assert_held(&mpam_list_lock);
>> +
>> +	cpumask_andnot(&comp->affinity, &comp->affinity, &ris->affinity);
>> +	cpumask_andnot(&class->affinity, &class->affinity, &ris->affinity);
> 
> This is not the inverse of the cpumask_or()s in mpam_ris_create_locked(),
> unless the the ris associated with each class and each component have
> strictly disjoint affinity masks.  Is that checked anywhere, or should
> it be impossible by construction?

They should be disjoint. These bitmaps are built from firmware description of the cache
hierarchy. I don't think its possible to describe a situation where there are overlaps.

You can build a nonsense cache hierarchy, e.g. where CPU-0's L3 is CPU-6's L2, but if you
do the scheduler is going to complain when it tries to chose the scheduler domains. I
think this should be filed under "you've got bigger problems".  There is a check that
catches this in mpam_resctrl_pick_caches(), to see that all the CPUs are accounted for,
which is to avoid tasks that get lucky with task-placement managing to escape their
resource limit.


> But, thinking about it:
> 
> I wonder why we ever really need to do the teardown.  If we get an
> error interrupt then we can just go into a sulk, spam dmesg a bit, put
> the hardware into the most vanilla state that we can, and refuse to
> manipulate it further.  But this only happens in the case of a software
> or hardware *bug* (or, in a future world where we might implement
> virtualisation, an uncontainable MPAM error triggered by a guest -- for
> which tearing down the host MPAM would be an overreaction).

The good news is guests can't escape the PARTID virtualisation that the CPU does, so any
mess a guest manages to make is confined to that guest's PARTID range.


> Trying to cleanly tear the MPAM driver down after such an error seems a
> bit futile.
> 
> The MPAM resctrl glue could eventually be made into a module (though
> not needed from day 1) -- which would allow for unloading resctrlfs if
> that is eventually module-ised.  I think this wouldn't require the MPAM
> devices backend to be torn down at any point, though (?)

It would certainly be optional. kernfs->resctrl->mpam is the reason all this has to be
built-in. If that changes I'd aim for this to be a module.

All this free()ing was added so that the driver doesn't end up sitting on memory when it
isn't providing any usable feature. I have seen a platform where the error interrupt goes
off during boot, (I suspect firmware configures an out-of-range PARTID). On such a
platform any memory that isn't free()d is a waste.

But I agree its a small amount of memory.


> If we can simplify or eliminate the teardown, does it simplify the
> locking at all?  The garbage collection logic can also be dispensed
> with if there is never any garbage.

It wouldn't simplify the locking, only remove that deferred free()ing which is needed
because of SRCU.


> Since MSCs etc. never disappear from the hardware, it feels like it
> ought not to be necessary ever to remove items from any of these lists
> except when trying to do a teardown (?)

Unbinding the driver from an MSC is another case where this may be triggered via
mpam_msc_drv_remove(). If you look at the whole thing, mpam_ris_destroy() pokes
mpam_resctrl_teardown_class() to see if resctrl needs to be torn down.

I don't anticipate folk actually needing to do that. One Reasons is for VFIO - but this
kind of stuff has a performance impact on the hypervisor, so its unlikely to ever allow a
guest direct access to this kind of thing. Another reason is to load a more specific
driver, which sounds unlikely.


Ultimately this memory free-ing code is here because its the right thing to do.
I'd prefer to keep it as making this a loadable module would mean we have to do this.


> (Putting the hardware into a quiecent state is not the same thing as
> tearing down the data structures -- we do want to quiesce MPAM when
> shutting down the kernel, as least for the kexec scenario.)

> [...]
> 
>> +static int mpam_ris_create_locked(struct mpam_msc *msc, u8 ris_idx,
>> +				  enum mpam_class_types type, u8 class_id,
>> +				  int component_id, gfp_t gfp)
>> +{
>> +	int err;
>> +	struct mpam_vmsc *vmsc;
>> +	struct mpam_msc_ris *ris;
>> +	struct mpam_class *class;
>> +	struct mpam_component *comp;
>> +
>> +	lockdep_assert_held(&mpam_list_lock);
>> +
>> +	if (test_and_set_bit(ris_idx, msc->ris_idxs))
>> +		return -EBUSY;
> 
> Is it impossible by construction to get in here with an out-of-range
> ris_idx?
> 
> To avoid the callers (i.e., ACPI) needing to understand the internal
> limitations of this code, maybe it is worth having a check here (even
> if technically redundant).

It's possible - but only if you mess up the firmware tables.
I'll add a check for this as its easy enough.


Thanks,

James
Re: [PATCH 12/33] arm_mpam: Add the class and component structures for ris firmware described
Posted by Dave Martin 3 weeks, 3 days ago
Hi,

On Mon, Sep 08, 2025 at 06:57:41PM +0100, James Morse wrote:
> Hi Dave,
> 
> On 01/09/2025 12:09, Dave Martin wrote:
> >> Subject: arm_mpam: Add the class and component structures for ris firmware described
> > 
> > Mangled subject line?
> 
> order words hard is.
> 
> 
> > There is a fair intersection between the commit message and what the
> > patch does, but they don't quite seem to match up.
> > 
> > Some key issues like locking / object lifecycle management
> > and DT parsing (a bit of which, it appears, lives here too) are not
> > mentioned at all.
> 
> I don't think everything needs mentioning - you have the diff for that.This should capture
> the motivation, what you have and what you need to find, the grouping etc.
> 
> 
> > In lieu of a complete rewrite, it might be best to discard the
> > explanation of the various object types.  The comment in the code
> > speaks for itself, and looks clearer.
> 
> Fair enough,

OK

[...]

> >> diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c
> >> index 71a1fb1a9c75..5baf2a8786fb 100644
> >> --- a/drivers/resctrl/mpam_devices.c
> >> +++ b/drivers/resctrl/mpam_devices.c
> >> @@ -20,7 +20,6 @@
> > 
> > [...]
> > 
> >> @@ -35,11 +34,483 @@
> >>  static DEFINE_MUTEX(mpam_list_lock);
> >>  static LIST_HEAD(mpam_all_msc);
> >>  
> >> -static struct srcu_struct mpam_srcu;
> >> +struct srcu_struct mpam_srcu;
> 
> > Why expose this here?  This patch makes no use of the exposed symbol.
> 
> The mpam_resctrl code needs to take it when it walks these lists. I don't want to change
> it then because its just additional churn.

I guess this is harmless, but it's no help to the kernel, or to
reviewers...

[...]

> >> +/*
> >> + * An MSC is a physical container for controls and monitors, each identified by

[...]

> >> + * The same MSC may exist under different class->component->vmsc paths, but the
> >> + * RIS index will be unique.
> >> + */
> > 
> > This description of the structures and how they relate to each other
> > seems OK (bearing in mind that I am already familiar with this stuff --
> > I can't speak for other people).
> 
> Great!

OK

[...]

> >> +static void mpam_ris_destroy(struct mpam_msc_ris *ris)
> >> +{
> >> +	struct mpam_vmsc *vmsc = ris->vmsc;
> >> +	struct mpam_msc *msc = vmsc->msc;
> >> +	struct platform_device *pdev = msc->pdev;
> >> +	struct mpam_component *comp = vmsc->comp;
> >> +	struct mpam_class *class = comp->class;
> >> +
> >> +	lockdep_assert_held(&mpam_list_lock);
> >> +
> >> +	cpumask_andnot(&comp->affinity, &comp->affinity, &ris->affinity);
> >> +	cpumask_andnot(&class->affinity, &class->affinity, &ris->affinity);
> > 
> > This is not the inverse of the cpumask_or()s in mpam_ris_create_locked(),
> > unless the the ris associated with each class and each component have
> > strictly disjoint affinity masks.  Is that checked anywhere, or should
> > it be impossible by construction?
> 
> They should be disjoint. These bitmaps are built from firmware description of the cache
> hierarchy. I don't think its possible to describe a situation where there are overlaps.
> 
> You can build a nonsense cache hierarchy, e.g. where CPU-0's L3 is CPU-6's L2, but if you
> do the scheduler is going to complain when it tries to chose the scheduler domains. I
> think this should be filed under "you've got bigger problems".  There is a check that
> catches this in mpam_resctrl_pick_caches(), to see that all the CPUs are accounted for,
> which is to avoid tasks that get lucky with task-placement managing to escape their
> resource limit.

I guess that makes sense.

If the firmware description is formally a tree structure then it should
be impossible to end up with overlapping affinity masks.

Since this doesn't bite us until teardown-time in any case, I think
this probably doesn't need to be checked explicitly, unless we observe
actual problems.

A comment documenting this assumption may be worth having.


> > But, thinking about it:
> > 
> > I wonder why we ever really need to do the teardown.  If we get an
> > error interrupt then we can just go into a sulk, spam dmesg a bit, put
> > the hardware into the most vanilla state that we can, and refuse to
> > manipulate it further.  But this only happens in the case of a software
> > or hardware *bug* (or, in a future world where we might implement
> > virtualisation, an uncontainable MPAM error triggered by a guest -- for
> > which tearing down the host MPAM would be an overreaction).
> 
> The good news is guests can't escape the PARTID virtualisation that the CPU does, so any
> mess a guest manages to make is confined to that guest's PARTID range.
> 
> 
> > Trying to cleanly tear the MPAM driver down after such an error seems a
> > bit futile.
> > 
> > The MPAM resctrl glue could eventually be made into a module (though
> > not needed from day 1) -- which would allow for unloading resctrlfs if
> > that is eventually module-ised.  I think this wouldn't require the MPAM
> > devices backend to be torn down at any point, though (?)
> 
> It would certainly be optional. kernfs->resctrl->mpam is the reason all this has to be
> built-in. If that changes I'd aim for this to be a module.
> 
> All this free()ing was added so that the driver doesn't end up sitting on memory when it
> isn't providing any usable feature. I have seen a platform where the error interrupt goes

I guess that's reasonable, but this is only applies to hardware that
has MPAM but where it is either broken, or where it is unsuitable for
running Linux but Linux has been deployed on it anyway while still
leaving the ACPI tables intact.  This does not violate any
specification, but it seems of marginal benefit to introduce a load of
complexity just to safe a few K in this situation.  (Or do we get stuck,
unable to free the config and mbwu_state arrays?  Those don't count as
large on a server-class system, but they are about the "a few K"
magnitude.)

(Not that I'm not saying that teardown is something we shouldn't do --
rather, my point is: do we really need to do it now if it is subtle and
complex to make it work, or can this be a later addition?)

> off during boot, (I suspect firmware configures an out-of-range PARTID). On such a
> platform any memory that isn't free()d is a waste.
> 
> But I agree its a small amount of memory.
> 
> 
> > If we can simplify or eliminate the teardown, does it simplify the
> > locking at all?  The garbage collection logic can also be dispensed
> > with if there is never any garbage.
> 
> It wouldn't simplify the locking, only remove that deferred free()ing which is needed
> because of SRCU.

My point was that there is no need to defend against concurrent removal
if list entries if list entries are never removed.

> > Since MSCs etc. never disappear from the hardware, it feels like it
> > ought not to be necessary ever to remove items from any of these lists
> > except when trying to do a teardown (?)
> 
> Unbinding the driver from an MSC is another case where this may be triggered via
> mpam_msc_drv_remove(). If you look at the whole thing, mpam_ris_destroy() pokes
> mpam_resctrl_teardown_class() to see if resctrl needs to be torn down.
> 
> I don't anticipate folk actually needing to do that. One Reasons is for VFIO - but this
> kind of stuff has a performance impact on the hypervisor, so its unlikely to ever allow a
> guest direct access to this kind of thing. Another reason is to load a more specific
> driver, which sounds unlikely.
> 
> 
> Ultimately this memory free-ing code is here because its the right thing to do.
> I'd prefer to keep it as making this a loadable module would mean we have to do this.

I don't disagree with that: it is messy to retrofit teardown if it was
never considered in the initial design.

I guess that this all comes from my uncertainty about the object
lifecycles and locking behaviour.

I would still prefer to see this documented.  If the the documentation
would be too unwieldy or infrasible to write, this would suggest that
the code would benefit from simplification...


For the probe phase, or for teardown, I'm really not sure why it would
break anything to have a single Big MPAM Lock (however inelegant).

For the run phase (when resctrl and other clients of the driver are
able to use the driver), the discovered system properties and the
mappings onto resctrl resources are all static, and we don't seem to
need all this RCU stuff.

> > (Putting the hardware into a quiecent state is not the same thing as
> > tearing down the data structures -- we do want to quiesce MPAM when
> > shutting down the kernel, as least for the kexec scenario.)
> 
> > [...]
> > 
> >> +static int mpam_ris_create_locked(struct mpam_msc *msc, u8 ris_idx,
> >> +				  enum mpam_class_types type, u8 class_id,
> >> +				  int component_id, gfp_t gfp)
> >> +{
> >> +	int err;
> >> +	struct mpam_vmsc *vmsc;
> >> +	struct mpam_msc_ris *ris;
> >> +	struct mpam_class *class;
> >> +	struct mpam_component *comp;
> >> +.
> >> +	lockdep_assert_held(&mpam_list_lock);
> >> +
> >> +	if (test_and_set_bit(ris_idx, msc->ris_idxs))
> >> +		return -EBUSY;
> > 
> > Is it impossible by construction to get in here with an out-of-range
> > ris_idx?
> > 
> > To avoid the callers (i.e., ACPI) needing to understand the internal
> > limitations of this code, maybe it is worth having a check here (even
> > if technically redundant).
> 
> It's possible - but only if you mess up the firmware tables.
> I'll add a check for this as its easy enough.

OK, suits me.

Cheers
---Dave
Re: [PATCH 12/33] arm_mpam: Add the class and component structures for ris firmware described
Posted by James Morse 3 weeks, 2 days ago
Hi Dave,

On 09/09/2025 12:28, Dave Martin wrote:
> On Mon, Sep 08, 2025 at 06:57:41PM +0100, James Morse wrote:
>> On 01/09/2025 12:09, Dave Martin wrote:
>>>> Subject: arm_mpam: Add the class and component structures for ris firmware described

>>>> diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c
>>>> index 71a1fb1a9c75..5baf2a8786fb 100644
>>>> --- a/drivers/resctrl/mpam_devices.c
>>>> +++ b/drivers/resctrl/mpam_devices.c
>>>> @@ -20,7 +20,6 @@
>>>
>>> [...]
>>>
>>>> @@ -35,11 +34,483 @@
>>>>  static DEFINE_MUTEX(mpam_list_lock);
>>>>  static LIST_HEAD(mpam_all_msc);
>>>>  
>>>> -static struct srcu_struct mpam_srcu;
>>>> +struct srcu_struct mpam_srcu;
>>
>>> Why expose this here?  This patch makes no use of the exposed symbol.
>>
>> The mpam_resctrl code needs to take it when it walks these lists. I don't want to change
>> it then because its just additional churn.
> 
> I guess this is harmless, but it's no help to the kernel, or to
> reviewers...

A trade-off has to be made here. The series is too big to post in one go. driver/resctrl
is the obvious split - but until both arrive then there is no need for mpam_internal.h, or
really any of the driver as it doesn't have a user-space interface.
I can barf the other series on the list as an illustration - but I think that would just
frustrate people.


[...]

>>>> +static void mpam_ris_destroy(struct mpam_msc_ris *ris)
>>>> +{
>>>> +	struct mpam_vmsc *vmsc = ris->vmsc;
>>>> +	struct mpam_msc *msc = vmsc->msc;
>>>> +	struct platform_device *pdev = msc->pdev;
>>>> +	struct mpam_component *comp = vmsc->comp;
>>>> +	struct mpam_class *class = comp->class;
>>>> +
>>>> +	lockdep_assert_held(&mpam_list_lock);
>>>> +
>>>> +	cpumask_andnot(&comp->affinity, &comp->affinity, &ris->affinity);
>>>> +	cpumask_andnot(&class->affinity, &class->affinity, &ris->affinity);
>>>
>>> This is not the inverse of the cpumask_or()s in mpam_ris_create_locked(),
>>> unless the the ris associated with each class and each component have
>>> strictly disjoint affinity masks.  Is that checked anywhere, or should
>>> it be impossible by construction?
>>
>> They should be disjoint. These bitmaps are built from firmware description of the cache
>> hierarchy. I don't think its possible to describe a situation where there are overlaps.
>>
>> You can build a nonsense cache hierarchy, e.g. where CPU-0's L3 is CPU-6's L2, but if you
>> do the scheduler is going to complain when it tries to chose the scheduler domains. I
>> think this should be filed under "you've got bigger problems".  There is a check that
>> catches this in mpam_resctrl_pick_caches(), to see that all the CPUs are accounted for,
>> which is to avoid tasks that get lucky with task-placement managing to escape their
>> resource limit.
> 
> I guess that makes sense.
> 
> If the firmware description is formally a tree structure then it should
> be impossible to end up with overlapping affinity masks.
> 
> Since this doesn't bite us until teardown-time in any case, I think
> this probably doesn't need to be checked explicitly, unless we observe
> actual problems.
> 
> A comment documenting this assumption may be worth having.

Sure,


>>> But, thinking about it:
>>>
>>> I wonder why we ever really need to do the teardown.  If we get an
>>> error interrupt then we can just go into a sulk, spam dmesg a bit, put
>>> the hardware into the most vanilla state that we can, and refuse to
>>> manipulate it further.  But this only happens in the case of a software
>>> or hardware *bug* (or, in a future world where we might implement
>>> virtualisation, an uncontainable MPAM error triggered by a guest -- for
>>> which tearing down the host MPAM would be an overreaction).
>>
>> The good news is guests can't escape the PARTID virtualisation that the CPU does, so any
>> mess a guest manages to make is confined to that guest's PARTID range.
>>
>>
>>> Trying to cleanly tear the MPAM driver down after such an error seems a
>>> bit futile.
>>>
>>> The MPAM resctrl glue could eventually be made into a module (though
>>> not needed from day 1) -- which would allow for unloading resctrlfs if
>>> that is eventually module-ised.  I think this wouldn't require the MPAM
>>> devices backend to be torn down at any point, though (?)
>>
>> It would certainly be optional. kernfs->resctrl->mpam is the reason all this has to be
>> built-in. If that changes I'd aim for this to be a module.
>>
>> All this free()ing was added so that the driver doesn't end up sitting on memory when it
>> isn't providing any usable feature. I have seen a platform where the error interrupt goes
> 
> I guess that's reasonable, but this is only applies to hardware that
> has MPAM but where it is either broken, or where it is unsuitable for
> running Linux but Linux has been deployed on it anyway while still
> leaving the ACPI tables intact.  This does not violate any
> specification, but it seems of marginal benefit to introduce a load of
> complexity just to safe a few K in this situation.  (Or do we get stuck,
> unable to free the config and mbwu_state arrays?  Those don't count as
> large on a server-class system, but they are about the "a few K"
> magnitude.)
> 
> (Not that I'm not saying that teardown is something we shouldn't do --
> rather, my point is: do we really need to do it now if it is subtle and
> complex to make it work, or can this be a later addition?)

Equally, can't someone say this memory has been leaked once the MPAM driver has given up.
As alloc/free were done together it seems to odd to do them at separate times - that will
certainly make it more subtle.


>> off during boot, (I suspect firmware configures an out-of-range PARTID). On such a
>> platform any memory that isn't free()d is a waste.
>>
>> But I agree its a small amount of memory.
>>
>>
>>> If we can simplify or eliminate the teardown, does it simplify the
>>> locking at all?  The garbage collection logic can also be dispensed
>>> with if there is never any garbage.
>>
>> It wouldn't simplify the locking, only remove that deferred free()ing which is needed
>> because of SRCU.
> 
> My point was that there is no need to defend against concurrent removal
> if list entries if list entries are never removed.

You can eyeball the writers are recognise the pattern as srcu. If it's an "oh that list is
read only" - then its much more of a driver specific hack.
I'd prefer to keep close to the srcu pattern - even if it is a bit complex.


>>> Since MSCs etc. never disappear from the hardware, it feels like it
>>> ought not to be necessary ever to remove items from any of these lists
>>> except when trying to do a teardown (?)
>>
>> Unbinding the driver from an MSC is another case where this may be triggered via
>> mpam_msc_drv_remove(). If you look at the whole thing, mpam_ris_destroy() pokes
>> mpam_resctrl_teardown_class() to see if resctrl needs to be torn down.
>>
>> I don't anticipate folk actually needing to do that. One Reasons is for VFIO - but this
>> kind of stuff has a performance impact on the hypervisor, so its unlikely to ever allow a
>> guest direct access to this kind of thing. Another reason is to load a more specific
>> driver, which sounds unlikely.
>>
>>
>> Ultimately this memory free-ing code is here because its the right thing to do.
>> I'd prefer to keep it as making this a loadable module would mean we have to do this.
> 
> I don't disagree with that: it is messy to retrofit teardown if it was
> never considered in the initial design.
> 
> I guess that this all comes from my uncertainty about the object
> lifecycles and locking behaviour.
> 
> I would still prefer to see this documented.  If the the documentation
> would be too unwieldy or infrasible to write, this would suggest that
> the code would benefit from simplification...

Right - nothing describes the 'phases' the driver has, they just emerge.
I'll try and add that, but it won't be in time for v2.


> For the probe phase, or for teardown, I'm really not sure why it would
> break anything to have a single Big MPAM Lock (however inelegant).

That is broadly what mpam_list_lock is doing before the cpuhp calls are registered.


> For the run phase (when resctrl and other clients of the driver are
> able to use the driver), the discovered system properties and the
> mappings onto resctrl resources are all static, and we don't seem to
> need all this RCU stuff.

Iff we say "driver specific hack - read only list" - I think that is worse.
Making it srcu makes it recognisable, and lets us free the memory instead of leaking it.


Thanks,

James
Re: [PATCH 12/33] arm_mpam: Add the class and component structures for ris firmware described
Posted by Fenghua Yu 1 month ago
Hi, James,

On 8/22/25 08:29, James Morse wrote:
> An MSC is a container of resources, each identified by their RIS index.
> Some RIS are described by firmware to provide their position in the system.
> Others are discovered when the driver probes the hardware.
>
> To configure a resource it needs to be found by its class, e.g. 'L2'.
> There are two kinds of grouping, a class is a set of components, which
> are visible to user-space as there are likely to be multiple instances
> of the L2 cache. (e.g. one per cluster or package)
>
> struct mpam_components are a set of struct mpam_vmsc. A vMSC groups the
> RIS in an MSC that control the same logical piece of hardware. (e.g. L2).
> This is to allow hardware implementations where two controls are presented
> as different RIS. Re-combining these RIS allows their feature bits to
> be or-ed. This structure is not visible outside mpam_devices.c
>
> struct mpam_vmsc are then a set of struct mpam_msc_ris, which are not
> visible as each L2 cache may be composed of individual slices which need
> to be configured the same as the hardware is not able to distribute the
> configuration.
>
> Add support for creating and destroying these structures.
>
> A gfp is passed as the structures may need creating when a new RIS entry
> is discovered when probing the MSC.
>
> CC: Ben Horgan <ben.horgan@arm.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> Changes since RFC:
>   * removed a pr_err() debug message that crept in.
> ---
>   drivers/resctrl/mpam_devices.c  | 488 +++++++++++++++++++++++++++++++-
>   drivers/resctrl/mpam_internal.h |  91 ++++++
>   include/linux/arm_mpam.h        |   8 +-
>   3 files changed, 574 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c
> index 71a1fb1a9c75..5baf2a8786fb 100644
> --- a/drivers/resctrl/mpam_devices.c
> +++ b/drivers/resctrl/mpam_devices.c
[SNIP]
> +static struct mpam_vmsc *
> +mpam_vmsc_alloc(struct mpam_component *comp, struct mpam_msc *msc, gfp_t gfp)
> +{
> +	struct mpam_vmsc *vmsc;
> +
> +	lockdep_assert_held(&mpam_list_lock);
> +
> +	vmsc = kzalloc(sizeof(*vmsc), gfp);
> +	if (!comp)

s/if (!cmp)/if (!vmsc)/


> +		return ERR_PTR(-ENOMEM);
> +	init_garbage(vmsc);
> +
> +	INIT_LIST_HEAD_RCU(&vmsc->ris);
> +	INIT_LIST_HEAD_RCU(&vmsc->comp_list);
> +	vmsc->comp = comp;
> +	vmsc->msc = msc;
> +
> +	list_add_rcu(&vmsc->comp_list, &comp->vmsc);
> +
> +	return vmsc;
> +}

Thanks.

-Fenghua
Re: [PATCH 12/33] arm_mpam: Add the class and component structures for ris firmware described
Posted by James Morse 3 weeks, 4 days ago
Hi Fenghua,

On 28/08/2025 02:29, Fenghua Yu wrote:
> On 8/22/25 08:29, James Morse wrote:
>> An MSC is a container of resources, each identified by their RIS index.
>> Some RIS are described by firmware to provide their position in the system.
>> Others are discovered when the driver probes the hardware.
>>
>> To configure a resource it needs to be found by its class, e.g. 'L2'.
>> There are two kinds of grouping, a class is a set of components, which
>> are visible to user-space as there are likely to be multiple instances
>> of the L2 cache. (e.g. one per cluster or package)
>>
>> struct mpam_components are a set of struct mpam_vmsc. A vMSC groups the
>> RIS in an MSC that control the same logical piece of hardware. (e.g. L2).
>> This is to allow hardware implementations where two controls are presented
>> as different RIS. Re-combining these RIS allows their feature bits to
>> be or-ed. This structure is not visible outside mpam_devices.c
>>
>> struct mpam_vmsc are then a set of struct mpam_msc_ris, which are not
>> visible as each L2 cache may be composed of individual slices which need
>> to be configured the same as the hardware is not able to distribute the
>> configuration.
>>
>> Add support for creating and destroying these structures.
>>
>> A gfp is passed as the structures may need creating when a new RIS entry
>> is discovered when probing the MSC.

>> diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c
>> index 71a1fb1a9c75..5baf2a8786fb 100644
>> --- a/drivers/resctrl/mpam_devices.c
>> +++ b/drivers/resctrl/mpam_devices.c
> [SNIP]
>> +static struct mpam_vmsc *
>> +mpam_vmsc_alloc(struct mpam_component *comp, struct mpam_msc *msc, gfp_t gfp)
>> +{
>> +    struct mpam_vmsc *vmsc;
>> +
>> +    lockdep_assert_held(&mpam_list_lock);
>> +
>> +    vmsc = kzalloc(sizeof(*vmsc), gfp);
>> +    if (!comp)
> 
> s/if (!cmp)/if (!vmsc)/
> 
> 
>> +        return ERR_PTR(-ENOMEM);

Yup, that's a copy-and-paste typo. Fixed,


Thanks,

James