[v3] arm_mpam: Add KVM/arm64 and resctrl glue code

[PATCH v3 29/47] arm_mpam: resctrl: Pick classes for use as mbm counters

Posted by Ben Horgan 4 weeks ago

From: James Morse <james.morse@arm.com>

resctrl has two types of counters, NUMA-local and global. MPAM has only
bandwidth counters, but the position of the MSC may mean it counts
NUMA-local, or global traffic.

But the topology information is not available.

Apply a heuristic: the L2 or L3 supports bandwidth monitors, these are
probably NUMA-local. If the memory controller supports bandwidth monitors,
they are probably global.

This also allows us to assert that we don't have the same class backing two
different resctrl events.

Because the class or component backing the event may not be 'the L3', it is
necessary for mpam_resctrl_get_domain_from_cpu() to search the monitor
domains too. This matters the most for 'monitor only' systems, where 'the
L3' control domains may be empty, and the ctrl_comp pointer NULL.

resctrl expects there to be enough monitors for every possible control and
monitor group to have one. Such a system gets called 'free running' as the
monitors can be programmed once and left running.  Any other platform will
need to emulate ABMC.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
---
Changes since rfc:
drop has_mbwu

Changes since v2:
Iterate over mpam_resctrl_dom directly (Jonathan)
Use for_each_mpam_resctrl_mon
---
 drivers/resctrl/mpam_internal.h |   8 ++
 drivers/resctrl/mpam_resctrl.c  | 133 +++++++++++++++++++++++++++++++-
 2 files changed, 139 insertions(+), 2 deletions(-)

diff --git a/drivers/resctrl/mpam_internal.h b/drivers/resctrl/mpam_internal.h
index 21cc776e57aa..1c5492008fe8 100644
--- a/drivers/resctrl/mpam_internal.h
+++ b/drivers/resctrl/mpam_internal.h
@@ -340,6 +340,14 @@ struct mpam_msc_ris {
 
 struct mpam_resctrl_dom {
 	struct mpam_component	*ctrl_comp;
+
+	/*
+	 * There is no single mon_comp because different events may be backed
+	 * by different class/components. mon_comp is indexed by the event
+	 * number.
+	 */
+	struct mpam_component   *mon_comp[QOS_NUM_EVENTS];
+
 	struct rdt_ctrl_domain	resctrl_ctrl_dom;
 	struct rdt_mon_domain	resctrl_mon_dom;
 };
diff --git a/drivers/resctrl/mpam_resctrl.c b/drivers/resctrl/mpam_resctrl.c
index 5020a5faed96..14a8dcaf1366 100644
--- a/drivers/resctrl/mpam_resctrl.c
+++ b/drivers/resctrl/mpam_resctrl.c
@@ -68,6 +68,14 @@ static bool cdp_enabled;
 static bool cacheinfo_ready;
 static DECLARE_WAIT_QUEUE_HEAD(wait_cacheinfo_ready);
 
+/* Whether this num_mbw_mon could result in a free_running system */
+static int __mpam_monitors_free_running(u16 num_mbwu_mon)
+{
+	if (num_mbwu_mon >= resctrl_arch_system_num_rmid_idx())
+		return resctrl_arch_system_num_rmid_idx();
+	return 0;
+}
+
 bool resctrl_arch_alloc_capable(void)
 {
 	return exposed_alloc_capable;
@@ -296,6 +304,26 @@ static bool cache_has_usable_csu(struct mpam_class *class)
 	return true;
 }
 
+static bool class_has_usable_mbwu(struct mpam_class *class)
+{
+	struct mpam_props *cprops = &class->props;
+
+	if (!mpam_has_feature(mpam_feat_msmon_mbwu, cprops))
+		return false;
+
+	/*
+	 * resctrl expects the bandwidth counters to be free running,
+	 * which means we need as many monitors as resctrl has
+	 * control/monitor groups.
+	 */
+	if (__mpam_monitors_free_running(cprops->num_mbwu_mon)) {
+		pr_debug("monitors usable in free-running mode\n");
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Calculate the worst-case percentage change from each implemented step
  * in the control.
@@ -599,7 +627,36 @@ static void mpam_resctrl_pick_counters(void)
 				break;
 			}
 		}
+
+		if (class_has_usable_mbwu(class) && topology_matches_l3(class)) {
+			pr_debug("class %u has usable MBWU, and matches L3 topology",
+				 class->level);
+
+			/*
+			 * MBWU counters may be 'local' or 'total' depending on
+			 * where they are in the topology. Counters on caches
+			 * are assumed to be local. If it's on the memory
+			 * controller, its assumed to be global.
+			 */
+			switch (class->type) {
+			case MPAM_CLASS_CACHE:
+				counter_update_class(QOS_L3_MBM_LOCAL_EVENT_ID,
+						     class);
+				break;
+			case MPAM_CLASS_MEMORY:
+				counter_update_class(QOS_L3_MBM_TOTAL_EVENT_ID,
+						     class);
+				break;
+			default:
+				break;
+			}
+		}
 	}
+
+	/* Allocation of MBWU monitors assumes that the class is unique... */
+	if (mpam_resctrl_counters[QOS_L3_MBM_LOCAL_EVENT_ID].class)
+		WARN_ON_ONCE(mpam_resctrl_counters[QOS_L3_MBM_LOCAL_EVENT_ID].class ==
+			     mpam_resctrl_counters[QOS_L3_MBM_TOTAL_EVENT_ID].class);
 }
 
 static int mpam_resctrl_control_init(struct mpam_resctrl_res *res)
@@ -942,6 +999,20 @@ static void mpam_resctrl_domain_insert(struct list_head *list,
 	list_add_tail_rcu(&new->list, pos);
 }
 
+static struct mpam_component *find_component(struct mpam_class *class, int cpu)
+{
+	struct mpam_component *comp;
+
+	guard(srcu)(&mpam_srcu);
+	list_for_each_entry_srcu(comp, &class->components, class_list,
+				 srcu_read_lock_held(&mpam_srcu)) {
+		if (cpumask_test_cpu(cpu, &comp->affinity))
+			return comp;
+	}
+
+	return NULL;
+}
+
 static struct mpam_resctrl_dom *
 mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res)
 {
@@ -990,8 +1061,33 @@ mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res)
 	}
 
 	if (exposed_mon_capable) {
+		struct mpam_component *any_mon_comp;
+		struct mpam_resctrl_mon *mon;
+		enum resctrl_event_id eventid;
+
+		/*
+		 * Even if the monitor domain is backed by a different
+		 * component, the L3 component IDs need to be used... only
+		 * there may be no ctrl_comp for the L3.
+		 * Search each event's class list for a component with
+		 * overlapping CPUs and set up the dom->mon_comp array.
+		 */
+
+		for_each_mpam_resctrl_mon(mon, eventid) {
+			struct mpam_component *mon_comp;
+
+			if (!mon->class)
+				continue;       // dummy resource
+
+			mon_comp = find_component(mon->class, cpu);
+			dom->mon_comp[eventid] = mon_comp;
+			if (mon_comp)
+				any_mon_comp = mon_comp;
+		}
+		WARN_ON_ONCE(!any_mon_comp);
+
 		mon_d = &dom->resctrl_mon_dom;
-		mpam_resctrl_domain_hdr_init(cpu, ctrl_comp, &mon_d->hdr);
+		mpam_resctrl_domain_hdr_init(cpu, any_mon_comp, &mon_d->hdr);
 		mon_d->hdr.type = RESCTRL_MON_DOMAIN;
 		mpam_resctrl_domain_insert(&r->mon_domains, &mon_d->hdr);
 		err = resctrl_online_mon_domain(r, mon_d);
@@ -1013,6 +1109,35 @@ mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res)
 	return dom;
 }
 
+/*
+ * We know all the monitors are associated with the L3, even if there are no
+ * controls and therefore no control component. Find the cache-id for the CPU
+ * and use that to search for existing resctrl domains.
+ * This relies on mpam_resctrl_pick_domain_id() using the L3 cache-id
+ * for anything that is not a cache.
+ */
+static struct mpam_resctrl_dom *mpam_resctrl_get_mon_domain_from_cpu(int cpu)
+{
+	u32 cache_id;
+	struct mpam_resctrl_dom *dom;
+	struct mpam_resctrl_res *l3 = &mpam_resctrl_controls[RDT_RESOURCE_L3];
+
+	lockdep_assert_cpus_held();
+
+	if (!l3->class)
+		return NULL;
+	cache_id = get_cpu_cacheinfo_id(cpu, 3);
+	if (cache_id == ~0)
+		return NULL;
+
+	list_for_each_entry_rcu(dom, &l3->resctrl_res.mon_domains, resctrl_mon_dom.hdr.list) {
+		if (dom->resctrl_mon_dom.hdr.id == cache_id)
+			return dom;
+	}
+
+	return NULL;
+}
+
 static struct mpam_resctrl_dom *
 mpam_resctrl_get_domain_from_cpu(int cpu, struct mpam_resctrl_res *res)
 {
@@ -1026,7 +1151,11 @@ mpam_resctrl_get_domain_from_cpu(int cpu, struct mpam_resctrl_res *res)
 			return dom;
 	}
 
-	return NULL;
+	if (r->rid != RDT_RESOURCE_L3)
+		return NULL;
+
+	/* Search the mon domain list too - needed on monitor only platforms. */
+	return mpam_resctrl_get_mon_domain_from_cpu(cpu);
 }
 
 int mpam_resctrl_online_cpu(unsigned int cpu)
-- 
2.43.0

Re: [PATCH v3 29/47] arm_mpam: resctrl: Pick classes for use as mbm counters

Posted by Peter Newman 3 weeks, 4 days ago

Hi Ben,

On Mon, Jan 12, 2026 at 6:02 PM Ben Horgan <ben.horgan@arm.com> wrote:
>
> From: James Morse <james.morse@arm.com>
>
> resctrl has two types of counters, NUMA-local and global. MPAM has only
> bandwidth counters, but the position of the MSC may mean it counts
> NUMA-local, or global traffic.
>
> But the topology information is not available.
>
> Apply a heuristic: the L2 or L3 supports bandwidth monitors, these are
> probably NUMA-local. If the memory controller supports bandwidth monitors,
> they are probably global.

Are remote memory accesses not cached? How do we know an MBWU monitor
residing on a cache won't count remote traffic?

Thanks,
-Peter

Re: [PATCH v3 29/47] arm_mpam: resctrl: Pick classes for use as mbm counters

Posted by James Morse 3 weeks ago

Hi Peter,

On 15/01/2026 15:49, Peter Newman wrote:
> On Mon, Jan 12, 2026 at 6:02 PM Ben Horgan <ben.horgan@arm.com> wrote:
>> From: James Morse <james.morse@arm.com>
>>
>> resctrl has two types of counters, NUMA-local and global. MPAM has only
>> bandwidth counters, but the position of the MSC may mean it counts
>> NUMA-local, or global traffic.
>>
>> But the topology information is not available.
>>
>> Apply a heuristic: the L2 or L3 supports bandwidth monitors, these are
>> probably NUMA-local. If the memory controller supports bandwidth monitors,
>> they are probably global.

> Are remote memory accesses not cached? How do we know an MBWU monitor
> residing on a cache won't count remote traffic?

It will, yes you get double counting. Is forbidding both mbm_total and mbm_local preferable?

I think this comes from 'total' in mbm_total not really having the obvious meaning of the
word:
If I have CPUs in NUMA-A and no memory controllers, then NUMA-B has no CPUs, and all the
memory-controllers.
With MPAM: we've only got one bandwidth counter, it doesn't know where the traffic goes
after the MSC. mbm-local on the L3 would reflect all the bandwidth, and mbm-total on the
memory-controllers would have the  same number.
I think on x86 mbm_local on the CPUs would read zero as zero traffic went to the 'local'
memory controller, and mbm_total would reflect all the memory bandwidth. (so 'total'
really means 'other')

I think what MPAM is doing here is still useful as a system normally has both CPUs and
memory controllers in the NUMA nodes, and you can use this to spot a control/monitor group
on a NUMA-node that is hammering all the memory (outlier mbm_local), or the same where a
NUMA-node's memory controller is getting hammered by all the NUMA nodes (outlier
mbm_total)

I've not heard of a platform with both memory bandwidth monitors at L3 and the memory
controller, so this may be a theoretical issue.

Shall we only expose one of mbm-local/total to prevent this being seen by user-space?

Thanks,

James

Re: [PATCH v3 29/47] arm_mpam: resctrl: Pick classes for use as mbm counters

Posted by Peter Newman 3 weeks ago

Hi James,

On Mon, Jan 19, 2026 at 1:04 PM James Morse <james.morse@arm.com> wrote:
>
> Hi Peter,
>
> On 15/01/2026 15:49, Peter Newman wrote:
> > On Mon, Jan 12, 2026 at 6:02 PM Ben Horgan <ben.horgan@arm.com> wrote:
> >> From: James Morse <james.morse@arm.com>
> >>
> >> resctrl has two types of counters, NUMA-local and global. MPAM has only
> >> bandwidth counters, but the position of the MSC may mean it counts
> >> NUMA-local, or global traffic.
> >>
> >> But the topology information is not available.
> >>
> >> Apply a heuristic: the L2 or L3 supports bandwidth monitors, these are
> >> probably NUMA-local. If the memory controller supports bandwidth monitors,
> >> they are probably global.
>
> > Are remote memory accesses not cached? How do we know an MBWU monitor
> > residing on a cache won't count remote traffic?
>
> It will, yes you get double counting. Is forbidding both mbm_total and mbm_local preferable?
>
> I think this comes from 'total' in mbm_total not really having the obvious meaning of the
> word:
> If I have CPUs in NUMA-A and no memory controllers, then NUMA-B has no CPUs, and all the
> memory-controllers.
> With MPAM: we've only got one bandwidth counter, it doesn't know where the traffic goes
> after the MSC. mbm-local on the L3 would reflect all the bandwidth, and mbm-total on the
> memory-controllers would have the  same number.
> I think on x86 mbm_local on the CPUs would read zero as zero traffic went to the 'local'
> memory controller, and mbm_total would reflect all the memory bandwidth. (so 'total'
> really means 'other')

Our software is going off the definition from the Intel SDM:

"This event monitors the L3 external bandwidth satisfied by the local
memory. In most platforms that support this event, L3 requests are
likely serviced by a memory system with non-uniform memory
architecture. This allows bandwidth to off-package memory resources to
be tracked by subtracting local from total bandwidth (for instance,
bandwidth over QPI to a memory controller on another physical
processor could be tracked by subtraction).

On NUMA-capable hardware that can support this event where all memory
is local, mbm_local == mbm_total, but in practice you can't read them
at the same time from userspace, so if you read mbm_total first,
you'll probably get a small negative result for remote bandwidth.

>
> I think what MPAM is doing here is still useful as a system normally has both CPUs and
> memory controllers in the NUMA nodes, and you can use this to spot a control/monitor group
> on a NUMA-node that is hammering all the memory (outlier mbm_local), or the same where a
> NUMA-node's memory controller is getting hammered by all the NUMA nodes (outlier
> mbm_total)
>
> I've not heard of a platform with both memory bandwidth monitors at L3 and the memory
> controller, so this may be a theoretical issue.
>
> Shall we only expose one of mbm-local/total to prevent this being seen by user-space?

I believe in the current software design, MPAM is only able to support
mbm_total, as an individual MSC (or class of MSCs with the same
configuration) can't separate traffic by destination, so it must be
the combined value. On a hardware design where MSCs were placed such
that one only counts local traffic and another only counts remote, the
resctrl MPAM driver would have to understand the hardware
configuration well enough to be able to produce counts following
Intel's definition of mbm_local and mbm_total.

Thanks,
-Peter

Re: [PATCH v3 29/47] arm_mpam: resctrl: Pick classes for use as mbm counters

Posted by Ben Horgan 2 weeks ago

Hi Peter, James,

On 1/19/26 12:47, Peter Newman wrote:
> Hi James,
> 
> On Mon, Jan 19, 2026 at 1:04 PM James Morse <james.morse@arm.com> wrote:
>>
>> Hi Peter,
>>
>> On 15/01/2026 15:49, Peter Newman wrote:
>>> On Mon, Jan 12, 2026 at 6:02 PM Ben Horgan <ben.horgan@arm.com> wrote:
>>>> From: James Morse <james.morse@arm.com>
>>>>
>>>> resctrl has two types of counters, NUMA-local and global. MPAM has only
>>>> bandwidth counters, but the position of the MSC may mean it counts
>>>> NUMA-local, or global traffic.
>>>>
>>>> But the topology information is not available.
>>>>
>>>> Apply a heuristic: the L2 or L3 supports bandwidth monitors, these are
>>>> probably NUMA-local. If the memory controller supports bandwidth monitors,
>>>> they are probably global.
>>
>>> Are remote memory accesses not cached? How do we know an MBWU monitor
>>> residing on a cache won't count remote traffic?
>>
>> It will, yes you get double counting. Is forbidding both mbm_total and mbm_local preferable?
>>
>> I think this comes from 'total' in mbm_total not really having the obvious meaning of the
>> word:
>> If I have CPUs in NUMA-A and no memory controllers, then NUMA-B has no CPUs, and all the
>> memory-controllers.
>> With MPAM: we've only got one bandwidth counter, it doesn't know where the traffic goes
>> after the MSC. mbm-local on the L3 would reflect all the bandwidth, and mbm-total on the
>> memory-controllers would have the  same number.
>> I think on x86 mbm_local on the CPUs would read zero as zero traffic went to the 'local'
>> memory controller, and mbm_total would reflect all the memory bandwidth. (so 'total'
>> really means 'other')
> 
> Our software is going off the definition from the Intel SDM:
> 
> "This event monitors the L3 external bandwidth satisfied by the local
> memory. In most platforms that support this event, L3 requests are
> likely serviced by a memory system with non-uniform memory
> architecture. This allows bandwidth to off-package memory resources to
> be tracked by subtracting local from total bandwidth (for instance,
> bandwidth over QPI to a memory controller on another physical
> processor could be tracked by subtraction).

Indeed we should base our discussion on the event definition in the
Intel SDM. For our reference, the description for the external bandwidth
monitoring event (mbm_total) is:

"This event monitors the L3 total external bandwidth to the next level
of the cache hierarchy, including all demand and prefetch misses from
the L3 to the next hierarchy of the memory system. In most platforms,
this represents memory bandwidth."

> 
> On NUMA-capable hardware that can support this event where all memory
> is local, mbm_local == mbm_total, but in practice you can't read them
> at the same time from userspace, so if you read mbm_total first,
> you'll probably get a small negative result for remote bandwidth.
> 
>>
>> I think what MPAM is doing here is still useful as a system normally has both CPUs and
>> memory controllers in the NUMA nodes, and you can use this to spot a control/monitor group
>> on a NUMA-node that is hammering all the memory (outlier mbm_local), or the same where a
>> NUMA-node's memory controller is getting hammered by all the NUMA nodes (outlier
>> mbm_total)
>>
>> I've not heard of a platform with both memory bandwidth monitors at L3 and the memory
>> controller, so this may be a theoretical issue.
>>
>> Shall we only expose one of mbm-local/total to prevent this being seen by user-space?
> 
> I believe in the current software design, MPAM is only able to support
> mbm_total, as an individual MSC (or class of MSCs with the same
> configuration) can't separate traffic by destination, so it must be
> the combined value. On a hardware design where MSCs were placed such
> that one only counts local traffic and another only counts remote, the
> resctrl MPAM driver would have to understand the hardware
> configuration well enough to be able to produce counts following
> Intel's definition of mbm_local and mbm_total.

On a system with MSC measuring memory bandwidth on the L3 caches these
MSC will measure all bandwidth to the next level of the memory hierarchy
which matches the definition of mbm_total. (We assume any MSC on an L3
is at the egress even though acpi/dt doesn't distinguish ingress and
egress.)

For MSC on memory controllers then they don't distinguish which L3 cache
the traffic came from and so unless there is a single L3 then we can't
use these memory bandwidth monitors as they count neither mbm_local nor
mbm_total. When there is a single L3 (and no higher level caches) then
it would match both mbm_total and mbm_local.

Hence, I agree we should just use mbm_total and update the heuristics
such that if the MSC are at the memory only consider them if there are
no higher caches and a single L3.

The introduction of ABMC muddies the waters as the "event_filter" file
defines the meaning of mbm_local and mbm_total. In order to handle this
file properly with MPAM, fs/resctrl changes are needed. We could either
make "event_filter" show the bits that correspond to the mbm counter and
unchangeable or decouple the "event_filter" part of ABMC from the
counter assignment part. As more work is needed to not break abi here
I'll drop the ABMC patches from the next respin of this series.

> 
> Thanks,
> -Peter

Thanks,

Ben

Re: [PATCH v3 29/47] arm_mpam: resctrl: Pick classes for use as mbm counters

Posted by Peter Newman 1 week, 3 days ago

Hi Ben,

On Mon, Jan 26, 2026 at 5:00 PM Ben Horgan <ben.horgan@arm.com> wrote:
>
> Hi Peter, James,
>
> On 1/19/26 12:47, Peter Newman wrote:
> > Hi James,
> >
> > On Mon, Jan 19, 2026 at 1:04 PM James Morse <james.morse@arm.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 15/01/2026 15:49, Peter Newman wrote:
> >>> On Mon, Jan 12, 2026 at 6:02 PM Ben Horgan <ben.horgan@arm.com> wrote:
> >>>> From: James Morse <james.morse@arm.com>
> >>>>
> >>>> resctrl has two types of counters, NUMA-local and global. MPAM has only
> >>>> bandwidth counters, but the position of the MSC may mean it counts
> >>>> NUMA-local, or global traffic.
> >>>>
> >>>> But the topology information is not available.
> >>>>
> >>>> Apply a heuristic: the L2 or L3 supports bandwidth monitors, these are
> >>>> probably NUMA-local. If the memory controller supports bandwidth monitors,
> >>>> they are probably global.
> >>
> >>> Are remote memory accesses not cached? How do we know an MBWU monitor
> >>> residing on a cache won't count remote traffic?
> >>
> >> It will, yes you get double counting. Is forbidding both mbm_total and mbm_local preferable?
> >>
> >> I think this comes from 'total' in mbm_total not really having the obvious meaning of the
> >> word:
> >> If I have CPUs in NUMA-A and no memory controllers, then NUMA-B has no CPUs, and all the
> >> memory-controllers.
> >> With MPAM: we've only got one bandwidth counter, it doesn't know where the traffic goes
> >> after the MSC. mbm-local on the L3 would reflect all the bandwidth, and mbm-total on the
> >> memory-controllers would have the  same number.
> >> I think on x86 mbm_local on the CPUs would read zero as zero traffic went to the 'local'
> >> memory controller, and mbm_total would reflect all the memory bandwidth. (so 'total'
> >> really means 'other')
> >
> > Our software is going off the definition from the Intel SDM:
> >
> > "This event monitors the L3 external bandwidth satisfied by the local
> > memory. In most platforms that support this event, L3 requests are
> > likely serviced by a memory system with non-uniform memory
> > architecture. This allows bandwidth to off-package memory resources to
> > be tracked by subtracting local from total bandwidth (for instance,
> > bandwidth over QPI to a memory controller on another physical
> > processor could be tracked by subtraction).
>
> Indeed we should base our discussion on the event definition in the
> Intel SDM. For our reference, the description for the external bandwidth
> monitoring event (mbm_total) is:
>
> "This event monitors the L3 total external bandwidth to the next level
> of the cache hierarchy, including all demand and prefetch misses from
> the L3 to the next hierarchy of the memory system. In most platforms,
> this represents memory bandwidth."
>
> >
> > On NUMA-capable hardware that can support this event where all memory
> > is local, mbm_local == mbm_total, but in practice you can't read them
> > at the same time from userspace, so if you read mbm_total first,
> > you'll probably get a small negative result for remote bandwidth.
> >
> >>
> >> I think what MPAM is doing here is still useful as a system normally has both CPUs and
> >> memory controllers in the NUMA nodes, and you can use this to spot a control/monitor group
> >> on a NUMA-node that is hammering all the memory (outlier mbm_local), or the same where a
> >> NUMA-node's memory controller is getting hammered by all the NUMA nodes (outlier
> >> mbm_total)
> >>
> >> I've not heard of a platform with both memory bandwidth monitors at L3 and the memory
> >> controller, so this may be a theoretical issue.
> >>
> >> Shall we only expose one of mbm-local/total to prevent this being seen by user-space?
> >
> > I believe in the current software design, MPAM is only able to support
> > mbm_total, as an individual MSC (or class of MSCs with the same
> > configuration) can't separate traffic by destination, so it must be
> > the combined value. On a hardware design where MSCs were placed such
> > that one only counts local traffic and another only counts remote, the
> > resctrl MPAM driver would have to understand the hardware
> > configuration well enough to be able to produce counts following
> > Intel's definition of mbm_local and mbm_total.
>
> On a system with MSC measuring memory bandwidth on the L3 caches these
> MSC will measure all bandwidth to the next level of the memory hierarchy
> which matches the definition of mbm_total. (We assume any MSC on an L3
> is at the egress even though acpi/dt doesn't distinguish ingress and
> egress.)
>
> For MSC on memory controllers then they don't distinguish which L3 cache
> the traffic came from and so unless there is a single L3 then we can't
> use these memory bandwidth monitors as they count neither mbm_local nor
> mbm_total. When there is a single L3 (and no higher level caches) then
> it would match both mbm_total and mbm_local.

The text you quoted from Intel was in the context of the L3. I assume
if such an event were implemented at a different level of the memory
system, it would continue to refer to downstream bandwidth.

>
> Hence, I agree we should just use mbm_total and update the heuristics
> such that if the MSC are at the memory only consider them if there are
> no higher caches and a single L3.

That should be ok for now. If I see a system where this makes MBWU
counters inaccessible, we'll continue the discussion then.

>
> The introduction of ABMC muddies the waters as the "event_filter" file
> defines the meaning of mbm_local and mbm_total. In order to handle this
> file properly with MPAM, fs/resctrl changes are needed. We could either
> make "event_filter" show the bits that correspond to the mbm counter and
> unchangeable or decouple the "event_filter" part of ABMC from the
> counter assignment part. As more work is needed to not break abi here
> I'll drop the ABMC patches from the next respin of this series.

I would prefer if you can just leave out the event_filter or make it
unconfigurable on MPAM. The rest of the counter assignment seems to
work well.

Longer term, the event_filter interface is supposed to give us the
ability to define and name our own counter events, but we'll have to
find a way past the decision to define the event filters in terms
copy-pasted from an AMD manual.

Thanks,
-Peter

Re: [PATCH v3 29/47] arm_mpam: resctrl: Pick classes for use as mbm counters

Posted by Ben Horgan 1 week, 3 days ago

Hi Peter,

On 1/30/26 13:04, Peter Newman wrote:
> Hi Ben,
> 
> On Mon, Jan 26, 2026 at 5:00 PM Ben Horgan <ben.horgan@arm.com> wrote:
>>
>> Hi Peter, James,
>>
>> On 1/19/26 12:47, Peter Newman wrote:
>>> Hi James,
>>>
>>> On Mon, Jan 19, 2026 at 1:04 PM James Morse <james.morse@arm.com> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> On 15/01/2026 15:49, Peter Newman wrote:
>>>>> On Mon, Jan 12, 2026 at 6:02 PM Ben Horgan <ben.horgan@arm.com> wrote:
>>>>>> From: James Morse <james.morse@arm.com>
>>>>>>
>>>>>> resctrl has two types of counters, NUMA-local and global. MPAM has only
>>>>>> bandwidth counters, but the position of the MSC may mean it counts
>>>>>> NUMA-local, or global traffic.
>>>>>>
>>>>>> But the topology information is not available.
>>>>>>
>>>>>> Apply a heuristic: the L2 or L3 supports bandwidth monitors, these are
>>>>>> probably NUMA-local. If the memory controller supports bandwidth monitors,
>>>>>> they are probably global.
>>>>
>>>>> Are remote memory accesses not cached? How do we know an MBWU monitor
>>>>> residing on a cache won't count remote traffic?
>>>>
>>>> It will, yes you get double counting. Is forbidding both mbm_total and mbm_local preferable?
>>>>
>>>> I think this comes from 'total' in mbm_total not really having the obvious meaning of the
>>>> word:
>>>> If I have CPUs in NUMA-A and no memory controllers, then NUMA-B has no CPUs, and all the
>>>> memory-controllers.
>>>> With MPAM: we've only got one bandwidth counter, it doesn't know where the traffic goes
>>>> after the MSC. mbm-local on the L3 would reflect all the bandwidth, and mbm-total on the
>>>> memory-controllers would have the  same number.
>>>> I think on x86 mbm_local on the CPUs would read zero as zero traffic went to the 'local'
>>>> memory controller, and mbm_total would reflect all the memory bandwidth. (so 'total'
>>>> really means 'other')
>>>
>>> Our software is going off the definition from the Intel SDM:
>>>
>>> "This event monitors the L3 external bandwidth satisfied by the local
>>> memory. In most platforms that support this event, L3 requests are
>>> likely serviced by a memory system with non-uniform memory
>>> architecture. This allows bandwidth to off-package memory resources to
>>> be tracked by subtracting local from total bandwidth (for instance,
>>> bandwidth over QPI to a memory controller on another physical
>>> processor could be tracked by subtraction).
>>
>> Indeed we should base our discussion on the event definition in the
>> Intel SDM. For our reference, the description for the external bandwidth
>> monitoring event (mbm_total) is:
>>
>> "This event monitors the L3 total external bandwidth to the next level
>> of the cache hierarchy, including all demand and prefetch misses from
>> the L3 to the next hierarchy of the memory system. In most platforms,
>> this represents memory bandwidth."
>>
>>>
>>> On NUMA-capable hardware that can support this event where all memory
>>> is local, mbm_local == mbm_total, but in practice you can't read them
>>> at the same time from userspace, so if you read mbm_total first,
>>> you'll probably get a small negative result for remote bandwidth.
>>>
>>>>
>>>> I think what MPAM is doing here is still useful as a system normally has both CPUs and
>>>> memory controllers in the NUMA nodes, and you can use this to spot a control/monitor group
>>>> on a NUMA-node that is hammering all the memory (outlier mbm_local), or the same where a
>>>> NUMA-node's memory controller is getting hammered by all the NUMA nodes (outlier
>>>> mbm_total)
>>>>
>>>> I've not heard of a platform with both memory bandwidth monitors at L3 and the memory
>>>> controller, so this may be a theoretical issue.
>>>>
>>>> Shall we only expose one of mbm-local/total to prevent this being seen by user-space?
>>>
>>> I believe in the current software design, MPAM is only able to support
>>> mbm_total, as an individual MSC (or class of MSCs with the same
>>> configuration) can't separate traffic by destination, so it must be
>>> the combined value. On a hardware design where MSCs were placed such
>>> that one only counts local traffic and another only counts remote, the
>>> resctrl MPAM driver would have to understand the hardware
>>> configuration well enough to be able to produce counts following
>>> Intel's definition of mbm_local and mbm_total.
>>
>> On a system with MSC measuring memory bandwidth on the L3 caches these
>> MSC will measure all bandwidth to the next level of the memory hierarchy
>> which matches the definition of mbm_total. (We assume any MSC on an L3
>> is at the egress even though acpi/dt doesn't distinguish ingress and
>> egress.)
>>
>> For MSC on memory controllers then they don't distinguish which L3 cache
>> the traffic came from and so unless there is a single L3 then we can't
>> use these memory bandwidth monitors as they count neither mbm_local nor
>> mbm_total. When there is a single L3 (and no higher level caches) then
>> it would match both mbm_total and mbm_local.
> 
> The text you quoted from Intel was in the context of the L3. I assume
> if such an event were implemented at a different level of the memory
> system, it would continue to refer to downstream bandwidth.

Yes, that does seem reasonable. That cache level would have to match
with what is reported in resctrl too. I expect that would involve adding
a new entry in enum resctrl_scope.

> 
>>
>> Hence, I agree we should just use mbm_total and update the heuristics
>> such that if the MSC are at the memory only consider them if there are
>> no higher caches and a single L3.
> 
> That should be ok for now. If I see a system where this makes MBWU
> counters inaccessible, we'll continue the discussion then.

Good to know. I'm looking into tightening the heuristics in general.
Please shout if any of the changes in heuristics mean that any hardware
or features stop being usable.

> 
>>
>> The introduction of ABMC muddies the waters as the "event_filter" file
>> defines the meaning of mbm_local and mbm_total. In order to handle this
>> file properly with MPAM, fs/resctrl changes are needed. We could either
>> make "event_filter" show the bits that correspond to the mbm counter and
>> unchangeable or decouple the "event_filter" part of ABMC from the
>> counter assignment part. As more work is needed to not break abi here
>> I'll drop the ABMC patches from the next respin of this series.
> 
> I would prefer if you can just leave out the event_filter or make it
> unconfigurable on MPAM. The rest of the counter assignment seems to
> work well.

If there is an event_filter file it should show the "correct" values and
so just leaving it out would be the way to go. However, unless I'm
missing something even this requires changes in fs/resctrl. As such, I
think it's expedient to defer adding ABMC to the series until we have
decided what to do in fs/resctrl.

> 
> Longer term, the event_filter interface is supposed to give us the
> ability to define and name our own counter events, but we'll have to
> find a way past the decision to define the event filters in terms
> copy-pasted from an AMD manual.
> 
> Thanks,
> -Peter

Thanks,

Ben