fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

[PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 1 month ago

The control value parser for the MB resource currently coerces the
memory bandwidth percentage value from userspace to be an exact
multiple of the bw_gran parameter.

On MPAM systems, this results in somewhat worse-than-worst-case
rounding, since bw_gran is in general only an approximation to the
actual hardware granularity on these systems, and the hardware
bandwidth allocation control value is not natively a percentage --
necessitating a further conversion in the resctrl_arch_update_domains()
path, regardless of the conversion done at parse time.

Allow the arch to provide its own parse-time conversion that is
appropriate for the hardware, and move the existing conversion to x86.
This will avoid accumulated error from rounding the value twice on MPAM
systems.

Clarify the documentation, but avoid overly exact promises.

Clamping to bw_min and bw_max still feels generic: leave it in the core
code, for now.

No functional change.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>

---

Based on v6.17-rc3.

Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
the other tests except for the NONCONT_CAT tests, which do not seem to
be supported in my configuration -- and have nothing to do with the
code touched by this patch).

Notes:

I put the x86 version out of line in order to avoid having to move
struct rdt_resource and its dependencies into resctrl_types.h -- which
would create a lot of diff noise.  Schemata writes from userspace have
a high overhead in any case.

For MPAM the conversion will be a no-op, because the incoming
percentage from the core resctrl code needs to be converted to hardware
representation in the driver anyway.

Perhaps _all_ the types should move to resctrl_types.h.

For now, I went for the smallest diffstat...

---
 Documentation/filesystems/resctrl.rst     | 7 +++----
 arch/x86/include/asm/resctrl.h            | 2 ++
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 6 ++++++
 fs/resctrl/ctrlmondata.c                  | 2 +-
 include/linux/resctrl.h                   | 6 ++++++
 5 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index c7949dd44f2f..a1d0469d6dfb 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -143,12 +143,11 @@ with respect to allocation:
 		user can request.
 
 "bandwidth_gran":
-		The granularity in which the memory bandwidth
+		The approximate granularity in which the memory bandwidth
 		percentage is allocated. The allocated
 		b/w percentage is rounded off to the next
-		control step available on the hardware. The
-		available bandwidth control steps are:
-		min_bandwidth + N * bandwidth_gran.
+		control step available on the hardware. The available
+		steps are at least as small as this value.
 
 "delay_linear":
 		Indicates if the delay scale is linear or
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index feb93b50e990..8bec2b9cc503 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -18,6 +18,8 @@
  */
 #define X86_RESCTRL_EMPTY_CLOSID         ((u32)~0)
 
+struct rdt_resource;
+
 /**
  * struct resctrl_pqr_state - State cache for the PQR MSR
  * @cur_rmid:		The cached Resource Monitoring ID
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 1189c0df4ad7..cf9b30b5df3c 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -16,9 +16,15 @@
 #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
 
 #include <linux/cpu.h>
+#include <linux/math.h>
 
 #include "internal.h"
 
+u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r)
+{
+	return roundup(val, (unsigned long)r->membw.bw_gran);
+}
+
 int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index d98e0d2de09f..c5e73b75aaa0 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
 		return false;
 	}
 
-	*data = roundup(bw, (unsigned long)r->membw.bw_gran);
+	*data = resctrl_arch_round_bw(bw, r);
 	return true;
 }
 
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 6fb4894b8cfd..5b2a555cf2dd 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
 bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
 int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
 
+/*
+ * Round a bandwidth control value to the nearest value acceptable to
+ * the arch code for resource r:
+ */
+u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
+
 /*
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.

base-commit: 1b237f190eb3d36f52dffe07a40b5eb210280e00
-- 
2.34.1

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 1 week, 3 days ago

Hi again,

On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:

[...]

> > Clamping to bw_min and bw_max still feels generic: leave it in the core
> > code, for now.
> 
> Sounds like MPAM may be ready to start the schema parsing discussion again?
> I understand that MPAM has a few more ways to describe memory bandwidth as
> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> schema format to user space, which seems like a good idea for new schema.

On this topic, specifically:

My own ideas in this area are a little different, though I agree with
the general idea.

Bitmap controls are distinct from numeric values, but for numbers, I'm
not sure that distinguishing percentages from other values is required,
since this is really just a specific case of a linear scale.

I imagined a generic numeric schema, described by a set of files like
the following in a schema's info directory:

	min: minimum value, e.g., 1
	max: maximum value, e.g., 1023
	scale: value that corresponds to one unit
	unit: quantified base unit, e.g., "100pc", "64MBps"
	map: mapping function name

If s is the value written in a schemata entry and p is the
corresponding physical amount of resource, then

	min <= s <= max

and

	p = map(s / scale) * unit

One reason why I prefer this scaling scheme over the floating-point
approach is that it can be exact (at least for currently known
platforms), and it doesn't require a new floating-point parser/
formatter to be written for this one thing in the kernel (which I
suspect is likely to be error-prone and poorly defined around
subtleties such as rounding behaviour).

"map" anticipates non-linear ramps, but this is only really here as a
forwards compatibility get-out.  For now, this might just be set to
"none", meaning the identity mapping (i.e., a no-op).  This may shadow
the existing the "delay_linear" parameter, but with more general
applicabillity if we need it.

The idea is that userspace reads the info files and then does the
appropriate conversions itself.  This might or might not be seen as a
burden, but would give exact control over the hardware configuration
with a generic interface, with possibly greater precision than the
existing schemata allow (when the hardware supports it), and without
having to second-guess the rounding that the kernel may or may not do
on the values.

For RDT MBA, we might have

	min: 10
	max: 100
	scale: 100
	unit: 100pc
	map: none

The schemata entry

	MB: 0=10, 1=100

would allocate the minimum possible bandwidth to domain 0, and 100%
bandwidth to domain 1.

For AMD SMBA, we might have:

	min: 1
	max: 100
	scale: 8
	unit: 1GBps

(if I've understood this correctly from resctrl.rst.)

For MPAM MBW_MAX with, say, 6 bits of resolution, we might have:

	min: 1
	max: 64
	scale: 64
	unit: 100pc
	map: none

The schemata entry

	MB: 0=1,1=64

would allocate the minimum possible bandwidth to domain 0, and 100%
bandwidth to domain 1.  This would probably need to be a new schema,
since we already have "MB" mimicking x86.

Exposing the hardware scale in this way would give userspace precise
control (including in sub-1% increments on capable hardware), without
having to second-guess the way the kernel will round the values.

> Is this something MPAM is still considering? For example, the minimum
> and maximum ranges that can be specified, is this something you already
> have some ideas for? Have you perhaps considered Tony's RFD [3] that includes
> discussion on how to handle min/max ranges for bandwidth? 

This seems to be a different thing.  I think James had some thoughts on
this already -- I haven't checked on his current idea, but one option
would be simply to expose this as two distinct schemata, say MB_MIN,
MB_MAX.

There's a question of how to cope with multiple different schemata
entries that shadow each other (i.e., control the same hardware
resource).

Would something like the following work?  A read from schemata might
produce something like this:

MB: 0=50, 1=50
# MB_HW: 0=32, 1=32
# MB_MIN: 0=31, 1=31
# MB_MAX: 0=32, 1=32

(Where MB_HW is the MPAM schema with 6-bit resolution that I
illustrated above, and MB_MIN and MB_MAX are similar schemata for the
specific MIN and MAX controls in the hardware.)

Userspace that does not understand the new entries would need to ignore
the commented lines, but can otherwise safely alter and write back the
schemata with the expected results.  The kernel would in turn ignore
the commented lines on write.  The commented lines are meaningful but
"inactive": they describe the current hardware configuration on read,
but (unless explicitly uncommented) won't change anything on write.

Software that understands the new entries can uncomment the conflicting
entries and write them back instead of (or in addition to) the
conflicting entries.  For example, userspace might write the following:

MB_MIN: 0=16, 1=16
MB_MAX: 0=32, 1=32

Which might then read back as follows:

MB: 0=50, 1=50
# MB_HW: 0=32, 1=32
# MB_MIN: 0=16, 1=16
# MB_MAX: 0=32, 1=32

I haven't tried to develop this idea further, for now.

I'd be interested in people's thoughts on it, though.

Cheers
---Dave

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Reinette Chatre 6 days, 11 hours ago

Hi Dave,

Just one correction ...

On 9/22/25 8:04 AM, Dave Martin wrote:
> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> 
> [...]
> 
>>> Clamping to bw_min and bw_max still feels generic: leave it in the core
>>> code, for now.
>>
>> Sounds like MPAM may be ready to start the schema parsing discussion again?
>> I understand that MPAM has a few more ways to describe memory bandwidth as
>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
>> schema format to user space, which seems like a good idea for new schema.
> 
> On this topic, specifically:
> 
> 
> My own ideas in this area are a little different, though I agree with
> the general idea.
> 
> Bitmap controls are distinct from numeric values, but for numbers, I'm
> not sure that distinguishing percentages from other values is required,
> since this is really just a specific case of a linear scale.
> 
> I imagined a generic numeric schema, described by a set of files like
> the following in a schema's info directory:
> 
> 	min: minimum value, e.g., 1
> 	max: maximum value, e.g., 1023
> 	scale: value that corresponds to one unit
> 	unit: quantified base unit, e.g., "100pc", "64MBps"
> 	map: mapping function name
> 
> If s is the value written in a schemata entry and p is the
> corresponding physical amount of resource, then
> 
> 	min <= s <= max
> 
> and
> 
> 	p = map(s / scale) * unit
> 
> One reason why I prefer this scaling scheme over the floating-point
> approach is that it can be exact (at least for currently known
> platforms), and it doesn't require a new floating-point parser/
> formatter to be written for this one thing in the kernel (which I
> suspect is likely to be error-prone and poorly defined around
> subtleties such as rounding behaviour).
> 
> "map" anticipates non-linear ramps, but this is only really here as a
> forwards compatibility get-out.  For now, this might just be set to
> "none", meaning the identity mapping (i.e., a no-op).  This may shadow
> the existing the "delay_linear" parameter, but with more general
> applicabillity if we need it.
> 
> 
> The idea is that userspace reads the info files and then does the
> appropriate conversions itself.  This might or might not be seen as a
> burden, but would give exact control over the hardware configuration
> with a generic interface, with possibly greater precision than the
> existing schemata allow (when the hardware supports it), and without
> having to second-guess the rounding that the kernel may or may not do
> on the values.
> 
> For RDT MBA, we might have
> 
> 	min: 10
> 	max: 100
> 	scale: 100
> 	unit: 100pc
> 	map: none
> 
> The schemata entry
> 
> 	MB: 0=10, 1=100
> 
> would allocate the minimum possible bandwidth to domain 0, and 100%
> bandwidth to domain 1.
> 
> 
> For AMD SMBA, we might have:
> 
> 	min: 1
> 	max: 100
> 	scale: 8
> 	unit: 1GBps
> 

Unfortunately not like this for AMD. Initial support for AMD MBA set max
to a hardcoded 2048 [1] that was later [2] modified to learn max from hardware.
Of course this broke resctrl as a generic interface and I hope we learned
enough since to not repeat this mistake nor give up on MB and make its interface
even worse by, for example, adding more architecture specific input ranges.

Reinette 

[1] commit 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
[2] commit 0976783bb123 ("x86/resctrl: Remove hard-coded memory bandwidth limit")

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 3 days, 19 hours ago

Hi Reinette,

On Fri, Sep 26, 2025 at 01:54:10PM -0700, Reinette Chatre wrote:
> Hi Dave,
> 
> Just one correction ...
> 
> On 9/22/25 8:04 AM, Dave Martin wrote:

[...0

> > For AMD SMBA, we might have:
> > 
> > 	min: 1
> > 	max: 100
> > 	scale: 8
> > 	unit: 1GBps
> > 
> 
> Unfortunately not like this for AMD. Initial support for AMD MBA set max
> to a hardcoded 2048 [1] that was later [2] modified to learn max from hardware.
> Of course this broke resctrl as a generic interface and I hope we learned
> enough since to not repeat this mistake nor give up on MB and make its interface
> even worse by, for example, adding more architecture specific input ranges.
> 
> Reinette 
> 
> [1] commit 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
> [2] commit 0976783bb123 ("x86/resctrl: Remove hard-coded memory bandwidth limit")

The "100" was just picked randomly in my example.  Looking more
carefully at the spec, it may make sense to have:

	min: 1
	max: (1 << value of BW_LEN)
	scale: 8
	unit: 1GBps

(This max value correspondings to setting the "unlimited" bit in the
control MSR; the other bits of the bandwidth value are then ignored.
For this instance of the schema, programming the "max" value would be
expected to give the nearest approximation to unlimited bandwidth that
the hardware permits.)

While the memory system is under-utilised end-to-end, I would expect
throughput from a memory-bound job to scale linearly with the control
value, but the control level at which throughput starts to saturate
will depend on the pattern of load throughout the system.

This seems fundamentally different from percentage controls -- it looks
impossible to simulate proportional controls with absolute throughput
controls, nor vice-versa (?)

Cheers
---Dave

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Luck, Tony 1 week ago

On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
> Hi again,
> 
> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> 
> [...]
> 
> > > Clamping to bw_min and bw_max still feels generic: leave it in the core
> > > code, for now.
> > 
> > Sounds like MPAM may be ready to start the schema parsing discussion again?
> > I understand that MPAM has a few more ways to describe memory bandwidth as
> > well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> > schema format to user space, which seems like a good idea for new schema.
> 
> On this topic, specifically:
> 
> 
> My own ideas in this area are a little different, though I agree with
> the general idea.
> 
> Bitmap controls are distinct from numeric values, but for numbers, I'm
> not sure that distinguishing percentages from other values is required,
> since this is really just a specific case of a linear scale.
> 
> I imagined a generic numeric schema, described by a set of files like
> the following in a schema's info directory:
> 
> 	min: minimum value, e.g., 1
> 	max: maximum value, e.g., 1023
> 	scale: value that corresponds to one unit
> 	unit: quantified base unit, e.g., "100pc", "64MBps"
> 	map: mapping function name
> 
> If s is the value written in a schemata entry and p is the
> corresponding physical amount of resource, then
> 
> 	min <= s <= max
> 
> and
> 
> 	p = map(s / scale) * unit
> 
> One reason why I prefer this scaling scheme over the floating-point
> approach is that it can be exact (at least for currently known
> platforms), and it doesn't require a new floating-point parser/
> formatter to be written for this one thing in the kernel (which I
> suspect is likely to be error-prone and poorly defined around
> subtleties such as rounding behaviour).
> 
> "map" anticipates non-linear ramps, but this is only really here as a
> forwards compatibility get-out.  For now, this might just be set to
> "none", meaning the identity mapping (i.e., a no-op).  This may shadow
> the existing the "delay_linear" parameter, but with more general
> applicabillity if we need it.
> 
> 
> The idea is that userspace reads the info files and then does the
> appropriate conversions itself.  This might or might not be seen as a
> burden, but would give exact control over the hardware configuration
> with a generic interface, with possibly greater precision than the
> existing schemata allow (when the hardware supports it), and without
> having to second-guess the rounding that the kernel may or may not do
> on the values.
> 
> For RDT MBA, we might have
> 
> 	min: 10
> 	max: 100
> 	scale: 100
> 	unit: 100pc
> 	map: none
> 
> The schemata entry
> 
> 	MB: 0=10, 1=100
> 
> would allocate the minimum possible bandwidth to domain 0, and 100%
> bandwidth to domain 1.
> 
> 
> For AMD SMBA, we might have:
> 
> 	min: 1
> 	max: 100
> 	scale: 8
> 	unit: 1GBps
> 
> (if I've understood this correctly from resctrl.rst.)
> 
> 
> For MPAM MBW_MAX with, say, 6 bits of resolution, we might have:
> 
> 	min: 1
> 	max: 64
> 	scale: 64
> 	unit: 100pc
> 	map: none
> 
> The schemata entry
> 
> 	MB: 0=1,1=64
> 
> would allocate the minimum possible bandwidth to domain 0, and 100%
> bandwidth to domain 1.  This would probably need to be a new schema,
> since we already have "MB" mimicking x86.
> 
> Exposing the hardware scale in this way would give userspace precise
> control (including in sub-1% increments on capable hardware), without
> having to second-guess the way the kernel will round the values.
> 
> 
> > Is this something MPAM is still considering? For example, the minimum
> > and maximum ranges that can be specified, is this something you already
> > have some ideas for? Have you perhaps considered Tony's RFD [3] that includes
> > discussion on how to handle min/max ranges for bandwidth? 
> 
> This seems to be a different thing.  I think James had some thoughts on
> this already -- I haven't checked on his current idea, but one option
> would be simply to expose this as two distinct schemata, say MB_MIN,
> MB_MAX.
> 
> There's a question of how to cope with multiple different schemata
> entries that shadow each other (i.e., control the same hardware
> resource).
> 
> 
> Would something like the following work?  A read from schemata might
> produce something like this:
> 
> MB: 0=50, 1=50
> # MB_HW: 0=32, 1=32
> # MB_MIN: 0=31, 1=31
> # MB_MAX: 0=32, 1=32
> 
> (Where MB_HW is the MPAM schema with 6-bit resolution that I
> illustrated above, and MB_MIN and MB_MAX are similar schemata for the
> specific MIN and MAX controls in the hardware.)
> 
> Userspace that does not understand the new entries would need to ignore
> the commented lines, but can otherwise safely alter and write back the
> schemata with the expected results.  The kernel would in turn ignore
> the commented lines on write.  The commented lines are meaningful but
> "inactive": they describe the current hardware configuration on read,
> but (unless explicitly uncommented) won't change anything on write.
> 
> Software that understands the new entries can uncomment the conflicting
> entries and write them back instead of (or in addition to) the
> conflicting entries.  For example, userspace might write the following:
> 
> MB_MIN: 0=16, 1=16
> MB_MAX: 0=32, 1=32
> 
> Which might then read back as follows:
> 
> MB: 0=50, 1=50
> # MB_HW: 0=32, 1=32
> # MB_MIN: 0=16, 1=16
> # MB_MAX: 0=32, 1=32
> 
> 
> I haven't tried to develop this idea further, for now.
> 
> I'd be interested in people's thoughts on it, though.

Applying this to Intel upcoming region aware memory bandwidth
that supports 255 steps and h/w min/max limits.
We would have info files with "min = 1, max = 255" and a schemata
file that looks like this to legacy apps:

MB: 0=50;1=75
#MB_HW: 0=128;1=191
#MB_MIN: 0=128;1=191
#MB_MAX: 0=128;1=191

But a newer app that is aware of the extensions can write:

# cat > schemata << 'EOF'
MB_HW: 0=10
MB_MIN: 0=10
MB_MAX: 0=64
EOF

which then reads back as:
MB: 0=4;1=75
#MB_HW: 0=10;1=191
#MB_MIN: 0=10;1=191
#MB_MAX: 0=64;1=191

with the legacy line updated with the rounded value of the MB_HW
supplied by the user. 10/255 = 3.921% ... so call it "4".

The region aware h/w supports separate bandwidth controls for each
region. We could hope (or perhaps update the spec to define) that
region0 is always node-local DDR memory and keep the "MB" tag for
that.

Then use some other tag naming for other regions. Remote DDR,
local CXL, remote CXL are the ones we think are next in the h/w
memory sequence. But the "region" concept would allow for other
options as other memory technologies come into use.

> 
> Cheers
> ---Dave

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 3 days, 18 hours ago

Hi Tony,

Thanks for taking at look at this -- comments below.

[...]

On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:

[...]

> > Would something like the following work?  A read from schemata might
> > produce something like this:
> > 
> > MB: 0=50, 1=50
> > # MB_HW: 0=32, 1=32
> > # MB_MIN: 0=31, 1=31
> > # MB_MAX: 0=32, 1=32

[...]

> > I'd be interested in people's thoughts on it, though.
> 
> Applying this to Intel upcoming region aware memory bandwidth
> that supports 255 steps and h/w min/max limits.

Following the MPAM example, would you also expect:

	scale: 255
	unit: 100pc

...?

> We would have info files with "min = 1, max = 255" and a schemata
> file that looks like this to legacy apps:
> 
> MB: 0=50;1=75
> #MB_HW: 0=128;1=191
> #MB_MIN: 0=128;1=191
> #MB_MAX: 0=128;1=191
> 
> But a newer app that is aware of the extensions can write:
> 
> # cat > schemata << 'EOF'
> MB_HW: 0=10
> MB_MIN: 0=10
> MB_MAX: 0=64
> EOF
> 
> which then reads back as:
> MB: 0=4;1=75
> #MB_HW: 0=10;1=191
> #MB_MIN: 0=10;1=191
> #MB_MAX: 0=64;1=191
> 
> with the legacy line updated with the rounded value of the MB_HW
> supplied by the user. 10/255 = 3.921% ... so call it "4".

I'm suggesting that this always be rounded up, so that you have a
guarantee that the steps are no smaller than the reported value.

(In this case, round-up and round-to-nearest give the same answer
anyway, though!)

> 
> The region aware h/w supports separate bandwidth controls for each
> region. We could hope (or perhaps update the spec to define) that
> region0 is always node-local DDR memory and keep the "MB" tag for
> that.

Do you have concerns about existing software choking on the #-prefixed
lines?

> Then use some other tag naming for other regions. Remote DDR,
> local CXL, remote CXL are the ones we think are next in the h/w
> memory sequence. But the "region" concept would allow for other
> options as other memory technologies come into use.

Would it be reasnable just to have a set of these schema instances, per
region, so:

MB_HW: ... // implicitly region 0
MB_HW_1: ...
MB_HW_2: ...

etc.

Or, did you have something else in mind?

My thinking is that we avoid adding complexity in the schemata file if
we treat mapping these schema instances onto the hardware topology as
an orthogonal problem.  So long as we have unique names in the schemata
file, we can describe elsewhere what they relate to in the hardware.

Cheers
---Dave

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Luck, Tony 3 days, 16 hours ago

On Mon, Sep 29, 2025 at 02:56:19PM +0100, Dave Martin wrote:
> Hi Tony,
> 
> Thanks for taking at look at this -- comments below.
> 
> [...]
> 
> On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
> > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
> 
> [...]
> 
> > > Would something like the following work?  A read from schemata might
> > > produce something like this:
> > > 
> > > MB: 0=50, 1=50
> > > # MB_HW: 0=32, 1=32
> > > # MB_MIN: 0=31, 1=31
> > > # MB_MAX: 0=32, 1=32
> 
> [...]
> 
> > > I'd be interested in people's thoughts on it, though.
> > 
> > Applying this to Intel upcoming region aware memory bandwidth
> > that supports 255 steps and h/w min/max limits.
> 
> Following the MPAM example, would you also expect:
> 
> 	scale: 255
> 	unit: 100pc
> 
> ...?

Yes. 255 (or whatever "Q" value is provided in the ACPI table)
corresponds to no throttling, so 100% bandwidth.

> 
> > We would have info files with "min = 1, max = 255" and a schemata
> > file that looks like this to legacy apps:
> > 
> > MB: 0=50;1=75
> > #MB_HW: 0=128;1=191
> > #MB_MIN: 0=128;1=191
> > #MB_MAX: 0=128;1=191
> > 
> > But a newer app that is aware of the extensions can write:
> > 
> > # cat > schemata << 'EOF'
> > MB_HW: 0=10
> > MB_MIN: 0=10
> > MB_MAX: 0=64
> > EOF
> > 
> > which then reads back as:
> > MB: 0=4;1=75
> > #MB_HW: 0=10;1=191
> > #MB_MIN: 0=10;1=191
> > #MB_MAX: 0=64;1=191
> > 
> > with the legacy line updated with the rounded value of the MB_HW
> > supplied by the user. 10/255 = 3.921% ... so call it "4".
> 
> I'm suggesting that this always be rounded up, so that you have a
> guarantee that the steps are no smaller than the reported value.

Round up, rather than round-to-nearest, make sense. Though perhaps
only cosmetic as I would be surprised if anyone has a mix of tools
looking at the legacy schemata lines while programming using the
direct h/w controls.

> 
> (In this case, round-up and round-to-nearest give the same answer
> anyway, though!)
> 
> > 
> > The region aware h/w supports separate bandwidth controls for each
> > region. We could hope (or perhaps update the spec to define) that
> > region0 is always node-local DDR memory and keep the "MB" tag for
> > that.
> 
> Do you have concerns about existing software choking on the #-prefixed
> lines?

Do they even need a # prefix? We already mix lines for multiple
resources in the schemata file with a separate prefix for each resource.
The schemata file also allows writes to just update one resource (or
one domain in a single resource). The schemata file started with just
"L3". Then we added "L2", "MB", and "SMBA" with no concern that the
initial "L3" manipulating tools would be confused.

> > Then use some other tag naming for other regions. Remote DDR,
> > local CXL, remote CXL are the ones we think are next in the h/w
> > memory sequence. But the "region" concept would allow for other
> > options as other memory technologies come into use.
> 
> Would it be reasnable just to have a set of these schema instances, per
> region, so:
> 
> MB_HW: ... // implicitly region 0
> MB_HW_1: ...
> MB_HW_2: ...

Chen Yu is currently looking at putting the word "TIER" into the
name, since there's some precedent for describing memory in "tiers".

Whatever naming scheme is used, the important part is how will users
find out what each schemata line actually means/controls.

> etc.
> 
> Or, did you have something else in mind?
> 
> My thinking is that we avoid adding complexity in the schemata file if
> we treat mapping these schema instances onto the hardware topology as
> an orthogonal problem.  So long as we have unique names in the schemata
> file, we can describe elsewhere what they relate to in the hardware.

Yes, exactly this.

> 
> Cheers
> ---Dave

-Tony

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 2 days, 16 hours ago

Hi Tony,

On Mon, Sep 29, 2025 at 09:37:41AM -0700, Luck, Tony wrote:
> On Mon, Sep 29, 2025 at 02:56:19PM +0100, Dave Martin wrote:
> > Hi Tony,
> > 
> > Thanks for taking at look at this -- comments below.
> > 
> > [...]
> > 
> > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
> > > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
> > 
> > [...]
> > 
> > > > Would something like the following work?  A read from schemata might
> > > > produce something like this:
> > > > 
> > > > MB: 0=50, 1=50
> > > > # MB_HW: 0=32, 1=32
> > > > # MB_MIN: 0=31, 1=31
> > > > # MB_MAX: 0=32, 1=32
> > 
> > [...]
> > 
> > > > I'd be interested in people's thoughts on it, though.
> > > 
> > > Applying this to Intel upcoming region aware memory bandwidth
> > > that supports 255 steps and h/w min/max limits.
> > 
> > Following the MPAM example, would you also expect:
> > 
> > 	scale: 255
> > 	unit: 100pc
> > 
> > ...?
> 
> Yes. 255 (or whatever "Q" value is provided in the ACPI table)
> corresponds to no throttling, so 100% bandwidth.
> 
> > 
> > > We would have info files with "min = 1, max = 255" and a schemata
> > > file that looks like this to legacy apps:

[...]

> > > MB: 0=4;1=75

[...]

> > > with the legacy line updated with the rounded value of the MB_HW
> > > supplied by the user. 10/255 = 3.921% ... so call it "4".
> > 
> > I'm suggesting that this always be rounded up, so that you have a
> > guarantee that the steps are no smaller than the reported value.
> 
> Round up, rather than round-to-nearest, make sense. Though perhaps
> only cosmetic as I would be surprised if anyone has a mix of tools
> looking at the legacy schemata lines while programming using the
> direct h/w controls.

Ack

[...]

> > Do you have concerns about existing software choking on the #-prefixed
> > lines?
> 
> Do they even need a # prefix? We already mix lines for multiple
> resources in the schemata file with a separate prefix for each resource.
> The schemata file also allows writes to just update one resource (or
> one domain in a single resource). The schemata file started with just
> "L3". Then we added "L2", "MB", and "SMBA" with no concern that the
> initial "L3" manipulating tools would be confused.

The "#" thing is for backwards compatibility with old userspace that
might blindly "paste back" unknown entries when writing the schemata
file.

(See also my reply to Reinette [1].)

> > > Then use some other tag naming for other regions. Remote DDR,
> > > local CXL, remote CXL are the ones we think are next in the h/w
> > > memory sequence. But the "region" concept would allow for other
> > > options as other memory technologies come into use.
> > 
> > Would it be reasnable just to have a set of these schema instances, per
> > region, so:
> > 
> > MB_HW: ... // implicitly region 0
> > MB_HW_1: ...
> > MB_HW_2: ...
> 
> Chen Yu is currently looking at putting the word "TIER" into the
> name, since there's some precedent for describing memory in "tiers".
> 
> Whatever naming scheme is used, the important part is how will users
> find out what each schemata line actually means/controls.

Agreed.  That's a problem, but a separate one.

[...]

> > Or, did you have something else in mind?
> > 
> > My thinking is that we avoid adding complexity in the schemata file if
> > we treat mapping these schema instances onto the hardware topology as
> > an orthogonal problem.  So long as we have unique names in the schemata
> > file, we can describe elsewhere what they relate to in the hardware.
> 
> Yes, exactly this.

OK, that's reassuring.

Cheers
---Dave

[1] https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Reinette Chatre 3 days, 16 hours ago

Hi Dave,

On 9/29/25 6:56 AM, Dave Martin wrote:
> On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
>> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:

...

>> The region aware h/w supports separate bandwidth controls for each
>> region. We could hope (or perhaps update the spec to define) that
>> region0 is always node-local DDR memory and keep the "MB" tag for
>> that.
> 
> Do you have concerns about existing software choking on the #-prefixed
> lines?

I am trying to understand the purpose of the #-prefix. I see two motivations
for the #-prefix with the primary point that multiple schema apply to the same
resource. 

1) Commented schema are "inactive"
This is unclear to me. In the MB example the commented lines show the 
finer grained controls. Since the original MB resource is an approximation
and the hardware must already be configured to support it, would the #-prefixed
lines not show the actual "active" configuration?

2) Commented schema are "conflicting"
The original proposal mentioned "write them back instead of (or in addition to)
the conflicting entries". I do not know how resctrl will be able to
handle a user requesting a change to both "MB" and "MB_HW". This seems like
something that should fail?

On a high level it is not clear to me why the # prefix is needed. As I understand the
schemata names will always be unique and the new features made backward
compatible to existing schemata names. That is, existing MB, L3, etc.
will also have the new info files that describe their values/ranges.

I expect that user space will ignore schema that it is not familiar
with so the # prefix seems unnecessary? 

I believe the motivation is to express a relationship between different
schema (you mentioned "shadow" initially). I think this relationship can
be expressed clearly by using a namespace prefix (like "MB_" in the examples).
This may help more when there are multiple schemata with this format where a #-prefix
does not make obvious which resource is shadowed. 

>> Then use some other tag naming for other regions. Remote DDR,
>> local CXL, remote CXL are the ones we think are next in the h/w
>> memory sequence. But the "region" concept would allow for other
>> options as other memory technologies come into use.
> 
> Would it be reasnable just to have a set of these schema instances, per
> region, so:
> 
> MB_HW: ... // implicitly region 0
> MB_HW_1: ...
> MB_HW_2: ...
> 
> etc.
> 
> Or, did you have something else in mind?
> 
> My thinking is that we avoid adding complexity in the schemata file if
> we treat mapping these schema instances onto the hardware topology as
> an orthogonal problem.  So long as we have unique names in the schemata
> file, we can describe elsewhere what they relate to in the hardware.

Agreed ... and "elsewhere" is expected to be unique depending on the resource.

Reinette

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 2 days, 17 hours ago

Hi,

On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote:
> Hi Dave,
> 
> On 9/29/25 6:56 AM, Dave Martin wrote:
> > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
> >> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
> 
> ...
> 
> >> The region aware h/w supports separate bandwidth controls for each
> >> region. We could hope (or perhaps update the spec to define) that
> >> region0 is always node-local DDR memory and keep the "MB" tag for
> >> that.
> > 
> > Do you have concerns about existing software choking on the #-prefixed
> > lines?
> 
> I am trying to understand the purpose of the #-prefix. I see two motivations
> for the #-prefix with the primary point that multiple schema apply to the same
> resource. 
> 
> 1) Commented schema are "inactive"
> This is unclear to me. In the MB example the commented lines show the 
> finer grained controls. Since the original MB resource is an approximation
> and the hardware must already be configured to support it, would the #-prefixed
> lines not show the actual "active" configuration?

They would show the active configuration (possibly more precisely than
"MB" does).

If not, it's not clear how userspace that is trying to use MB_HW (say)
could read out the current configuration.

The # is intended to make resctrl ignore the lines when the file
is written by userspace.  This is done so that userspace has to
actually change those lines in order for them to take effect when
writing.  Old userspace can just pass them through without modification,
without anything unexpected happening.

The reason why I think that this convention may be needed is that we
never told (old) userspace what it was supposed to do with schemata
entries that it does not recognise.

> 2) Commented schema are "conflicting"
> The original proposal mentioned "write them back instead of (or in addition to)
> the conflicting entries". I do not know how resctrl will be able to
> handle a user requesting a change to both "MB" and "MB_HW". This seems like
> something that should fail?

If userspace is asking for two incompatible things at the same time, we
can either pick one of them and ignore the rest, or do nothing, or fail
explicitly.

If we think that it doesn't really matter what happens, then resctrl
could just dumbly process the entries in the order given.  If the
result is not what userspace wanted, that's not our problem.

(Today, nothing prevents userspace writing multiple "MB" lines at the
same time: resctrl will process them all, but only the final one will
have a lasting effect.  So, the fact that a resctrl write can contain
mutually incompatible requests does not seem to be new.)

> On a high level it is not clear to me why the # prefix is needed. As I understand the
> schemata names will always be unique and the new features made backward
> compatible to existing schemata names. That is, existing MB, L3, etc.
> will also have the new info files that describe their values/ranges.

Regarding backwards compatibility for the existing controls:

This proposal is only about numeric controls.  L3 wouldn't change, but
we could still add info/ metadata for bitmap control at the same time
as adding it for numeric controls.

MB may be hard to describe in a useful way, though -- at least in the
MPAM case, where the number of steps does not divide into 100, and the
AMD cases where the meaning of the MB control values is different.

MB and MB_HW are not interchangeable.  To obtain predictable results
from MB, userspace would need to know precisely how the kernel is going
to round the value.  This feels like an implementation detail that
doesn't belong in the ABI.

I suppose we could also add a "granularity" entry in info/, but we have
the existing "bandwidth_gran" file for MB.  For any new schema, I don't
think we need to state the granularity: the other parameters can always
be adjusted so that the granularity is exactly 1.

Regarding the "#" (see below):

> I expect that user space will ignore schema that it is not familiar
> with so the # prefix seems unnecessary? 
> 
> I believe the motivation is to express a relationship between different
> schema (you mentioned "shadow" initially). I think this relationship can
> be expressed clearly by using a namespace prefix (like "MB_" in the examples).
> This may help more when there are multiple schemata with this format where a #-prefix
> does not make obvious which resource is shadowed. 

An illustration would probably help, here.

Say resctrl knows has schemata MB, MB_HW, MB_MIN and MB_MAX, all of
which control (aspects of) the same underlying hardware resource.

Reading the schemata file might yield:

	MB: 0=29
	MB_HW: 0=2
	MB_MIN: 0=1
	MB_MAX: 0=2

(I assume for this toy example that MB_{HW,MIN,MAX} ranges from
0 to 7.)

Now, suppose (current) userspace wants to change the allocated
bandwidth.  It only understands the "MB" line, but it is reasonable to
expect that writing the other lines back without modification will do
nothing.  (A human user might read the file, and tweak it through an
editor to modify just the entry of interest, run it through awk, etc.)

So, the user writes back:

	MB: 0=43
	MB_HW: 0=2
	MB_MIN: 0=1
	MB_MAX: 0=2

If resctrl just processes the entries in order, it will temporarily
change the bandwidth allocation due to the "MB" row, but will then
immediately change it back again due to the other rows.

Reading schemata again now gives:

	MB: 0=29
	MB_HW: 0=2
	MB_MIN: 0=1
	MB_MAX: 0=2

We might be able to solve some problems of this sort be reordering the
entries, but I suspect that some software may give up as soon as it
sees an unfamiliar entry -- so it may be better to keep the classic
entries (like "MB") at the start.

Anyway, going back to the "#" convention:

If the initial read of schemata has the new entries "pre-commented",
then userspace wouldn't need to know about the new entries.  It could
just tweak the MB entry (which it knows about), and write the file back:

	MB: 0=43
	# MB_HW: 0=2
	# MB_MIN: 0=1
	# MB_MAX: 0=2

then resctrl knows to ignore the hashed lines, and so reading the file
back gives:

	MB: 0=43
	# MB_HW: 0=3
	# MB_MIN: 0=2
	# MB_MAX: 0=3

(For hardware-specific reasons, the MPAM driver currently internally
programs the MIN bound to be a bit less than the MAX bound, when
userspace writes an "MB" entry into schemata.  The key thing is that
writing MB may cause the MB_MIN/MB_MAX entries to change -- at the
resctrl level, I don't that that we necessarily need to make promises
about what they can change _to_.  The exact effect of MIN and MAX
bounds is likely to be hardware-dependent anyway.)

Regarding new userspce:

Going forward, we can explicitly document that there should be no
conflicting or "passenger" entries in a schemata write: don't include
an entry for somehing that you don't explicitly want to set, and if
multiple entries affect the same resource, we don't promise what
happens.

(But sadly, we can't impose that rule on existing software after the
fact.)

One final note: I have not provided any way to indicate that all those
entries control the same hardware resource.  The common "MB" prefix is
intended as a clue, but ultimately, userspace needs to know what an
entry controls before tweaking it.

We could try to describe the relationships explicitly, but I'm not sure
that it is useful...

> >> Then use some other tag naming for other regions. Remote DDR,
> >> local CXL, remote CXL are the ones we think are next in the h/w
> >> memory sequence. But the "region" concept would allow for other
> >> options as other memory technologies come into use.
> > 
> > Would it be reasnable just to have a set of these schema instances, per
> > region, so:
> > 
> > MB_HW: ... // implicitly region 0
> > MB_HW_1: ...
> > MB_HW_2: ...
> > 
> > etc.
> > 
> > Or, did you have something else in mind?
> > 
> > My thinking is that we avoid adding complexity in the schemata file if
> > we treat mapping these schema instances onto the hardware topology as
> > an orthogonal problem.  So long as we have unique names in the schemata
> > file, we can describe elsewhere what they relate to in the hardware.
> 
> Agreed ... and "elsewhere" is expected to be unique depending on the resource.
> 
> Reinette

Yes

Cheers
---Dave

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Chen, Yu C 3 days, 23 hours ago

On 9/26/2025 6:58 AM, Luck, Tony wrote:
> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
>> Hi again,
>>
>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>>
>> [...]
>>

[snip]

>>  For example, userspace might write the following:
>>
>> MB_MIN: 0=16, 1=16
>> MB_MAX: 0=32, 1=32
>>
>> Which might then read back as follows:
>>
>> MB: 0=50, 1=50
>> # MB_HW: 0=32, 1=32
>> # MB_MIN: 0=16, 1=16
>> # MB_MAX: 0=32, 1=32
>>
>>
>> I haven't tried to develop this idea further, for now.
>>
>> I'd be interested in people's thoughts on it, though.
> 
> Applying this to Intel upcoming region aware memory bandwidth
> that supports 255 steps and h/w min/max limits.
> We would have info files with "min = 1, max = 255" and a schemata
> file that looks like this to legacy apps:
> 
> MB: 0=50;1=75
> #MB_HW: 0=128;1=191
> #MB_MIN: 0=128;1=191
> #MB_MAX: 0=128;1=191
> 
> But a newer app that is aware of the extensions can write:
> 
> # cat > schemata << 'EOF'
> MB_HW: 0=10
> MB_MIN: 0=10
> MB_MAX: 0=64
> EOF
> 
> which then reads back as:
> MB: 0=4;1=75
> #MB_HW: 0=10;1=191
> #MB_MIN: 0=10;1=191
> #MB_MAX: 0=64;1=191
> 
> with the legacy line updated with the rounded value of the MB_HW
> supplied by the user. 10/255 = 3.921% ... so call it "4".
>

This seems to be applicable as it introduces the new interface
while preserving forward compatibility.

One minor question is that, according to "Figure 6-5. MBA Optimal
Bandwidth Register" in the latest RDT specification, the maximum
value ranges from 1 to 511.
Additionally, this bandwidth field is located at bits 48 to 56 in
the MBA Optimal Bandwidth Register, and the range for
this segment could be 1 to 8191. Just wonder if it would be
possible that the current maximum value of 512 may be extended
in the future? Perhaps we could explore a method to query the maximum 
upper limit from the ACPI table or register, or use CPUID to distinguish 
between platforms rather than hardcoding it. Reinette also mentioned 
this in another thread.

Thanks,
Chenyu


[1] 
https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html

> The region aware h/w supports separate bandwidth controls for each
> region. We could hope (or perhaps update the spec to define) that
> region0 is always node-local DDR memory and keep the "MB" tag for
> that.
> 
> Then use some other tag naming for other regions. Remote DDR,
> local CXL, remote CXL are the ones we think are next in the h/w
> memory sequence. But the "region" concept would allow for other
> options as other memory technologies come into use.
> 
>>
>> Cheers
>> ---Dave
>

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 3 days, 18 hours ago

Hi there,

On Mon, Sep 29, 2025 at 05:19:32PM +0800, Chen, Yu C wrote:
> On 9/26/2025 6:58 AM, Luck, Tony wrote:

[...]

> > Applying this to Intel upcoming region aware memory bandwidth
> > that supports 255 steps and h/w min/max limits.
> > We would have info files with "min = 1, max = 255" and a schemata
> > file that looks like this to legacy apps:
> > 
> > MB: 0=50;1=75
> > #MB_HW: 0=128;1=191
> > #MB_MIN: 0=128;1=191
> > #MB_MAX: 0=128;1=191
> > 
> > But a newer app that is aware of the extensions can write:
> > 
> > # cat > schemata << 'EOF'
> > MB_HW: 0=10
> > MB_MIN: 0=10
> > MB_MAX: 0=64
> > EOF
> > 
> > which then reads back as:
> > MB: 0=4;1=75
> > #MB_HW: 0=10;1=191
> > #MB_MIN: 0=10;1=191
> > #MB_MAX: 0=64;1=191
> > 
> > with the legacy line updated with the rounded value of the MB_HW
> > supplied by the user. 10/255 = 3.921% ... so call it "4".
> > 
> 
> This seems to be applicable as it introduces the new interface
> while preserving forward compatibility.
> 
> One minor question is that, according to "Figure 6-5. MBA Optimal
> Bandwidth Register" in the latest RDT specification, the maximum
> value ranges from 1 to 511.
> Additionally, this bandwidth field is located at bits 48 to 56 in
> the MBA Optimal Bandwidth Register, and the range for
> this segment could be 1 to 8191. Just wonder if it would be
> possible that the current maximum value of 512 may be extended
> in the future? Perhaps we could explore a method to query the maximum upper
> limit from the ACPI table or register, or use CPUID to distinguish between
> platforms rather than hardcoding it. Reinette also mentioned this in another
> thread.
>
> Thanks,
> Chenyu
> 
> 
> [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html

I can't comment on the direction of travel in the RDT architecture.

I guess it would be up to the arch code whether to trust ACPI if it
says that the maximum value of this field is > 511.  (> 65535 would be
impossible though, since the fields would start to overlap each
other...)

Would anything break in the interface proposed here, if the maximum
value is larger than 511?  (I'm hoping not.  For MPAM, the bandwidth
controls can have up to 16 bits and the size can be probed though MMIO
registers.

I don't think we've seen MPAM hardware that comes close to 16 bits for
now, though.

Cheers
---Dave

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Chen, Yu C 3 days, 4 hours ago

On 9/29/2025 10:13 PM, Dave Martin wrote:
> Hi there,
> 
> On Mon, Sep 29, 2025 at 05:19:32PM +0800, Chen, Yu C wrote:
>> On 9/26/2025 6:58 AM, Luck, Tony wrote:
> 
> [...]
> 
>>> Applying this to Intel upcoming region aware memory bandwidth
>>> that supports 255 steps and h/w min/max limits.
>>> We would have info files with "min = 1, max = 255" and a schemata
>>> file that looks like this to legacy apps:
>>>
>>> MB: 0=50;1=75
>>> #MB_HW: 0=128;1=191
>>> #MB_MIN: 0=128;1=191
>>> #MB_MAX: 0=128;1=191
>>>
>>> But a newer app that is aware of the extensions can write:
>>>
>>> # cat > schemata << 'EOF'
>>> MB_HW: 0=10
>>> MB_MIN: 0=10
>>> MB_MAX: 0=64
>>> EOF
>>>
>>> which then reads back as:
>>> MB: 0=4;1=75
>>> #MB_HW: 0=10;1=191
>>> #MB_MIN: 0=10;1=191
>>> #MB_MAX: 0=64;1=191
>>>
>>> with the legacy line updated with the rounded value of the MB_HW
>>> supplied by the user. 10/255 = 3.921% ... so call it "4".
>>>
>>
>> This seems to be applicable as it introduces the new interface
>> while preserving forward compatibility.
>>
>> One minor question is that, according to "Figure 6-5. MBA Optimal
>> Bandwidth Register" in the latest RDT specification, the maximum
>> value ranges from 1 to 511.
>> Additionally, this bandwidth field is located at bits 48 to 56 in
>> the MBA Optimal Bandwidth Register, and the range for
>> this segment could be 1 to 8191. Just wonder if it would be
>> possible that the current maximum value of 512 may be extended
>> in the future? Perhaps we could explore a method to query the maximum upper
>> limit from the ACPI table or register, or use CPUID to distinguish between
>> platforms rather than hardcoding it. Reinette also mentioned this in another
>> thread.
>>
>> Thanks,
>> Chenyu
>>
>>
>> [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html
> 
> I can't comment on the direction of travel in the RDT architecture.
> 
> I guess it would be up to the arch code whether to trust ACPI if it
> says that the maximum value of this field is > 511.  (> 65535 would be
> impossible though, since the fields would start to overlap each
> other...)
> 
> Would anything break in the interface proposed here, if the maximum
> value is larger than 511?  (I'm hoping not.  For MPAM, the bandwidth
> controls can have up to 16 bits and the size can be probed though MMIO
> registers.
> 

I overlooked this bit width. It should not exceed 511 according to the
RDT spec. Previously, I was just wondering how to calculate the legacy
MB percentage in Tony's example. If we want to keep consistency - if
the user provides a value of 10, what is the denominator: Is it 255,
511, or something queried from ACPI.

MB: 0=4;1=75           <--- 10/255
#MB_HW: 0=10;1=191
#MB_MIN: 0=10;1=191
#MB_MAX: 0=64;1=191

or

MB: 0=1;1=75          <--- 10/511
#MB_HW: 0=10;1=191
#MB_MIN: 0=10;1=191
#MB_MAX: 0=64;1=191

thanks,
Chenyu

> I don't think we've seen MPAM hardware that comes close to 16 bits for
> now, though.
> 
> Cheers
> ---Dave

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 2 days, 16 hours ago

Hi,

On Tue, Sep 30, 2025 at 12:43:36PM +0800, Chen, Yu C wrote:
> On 9/29/2025 10:13 PM, Dave Martin wrote:

[...]

> > I guess it would be up to the arch code whether to trust ACPI if it
> > says that the maximum value of this field is > 511.  (> 65535 would be
> > impossible though, since the fields would start to overlap each
> > other...)
> > 
> > Would anything break in the interface proposed here, if the maximum
> > value is larger than 511?  (I'm hoping not.  For MPAM, the bandwidth
> > controls can have up to 16 bits and the size can be probed though MMIO
> > registers.
> > 
> 
> I overlooked this bit width. It should not exceed 511 according to the
> RDT spec. Previously, I was just wondering how to calculate the legacy
> MB percentage in Tony's example. If we want to keep consistency - if
> the user provides a value of 10, what is the denominator: Is it 255,
> 511, or something queried from ACPI.
> 
> MB: 0=4;1=75           <--- 10/255
> #MB_HW: 0=10;1=191
> #MB_MIN: 0=10;1=191
> #MB_MAX: 0=64;1=191
> 
> or
> 
> MB: 0=1;1=75          <--- 10/511
> #MB_HW: 0=10;1=191
> #MB_MIN: 0=10;1=191
> #MB_MAX: 0=64;1=191
> 
> thanks,
> Chenyu

The denomiator (the "scale" parameter in my model, though the name is
unimportant) should be whatever quantity of resource is specified in
the "unit" parameter.

For "percentage" type controls, I'd expect the unit to be 100% ("100pc"
in my syntax).

So, Tony suggestion looks plausible to me [1] :

 | Yes. 255 (or whatever "Q" value is provided in the ACPI table)
 | corresponds to no throttling, so 100% bandwidth.

So, if ACPI says Q=387, that's the denominator we advertise.

Does that sound right?

Question: is this a global parameter, or per-CPU?

From the v1.2 RDT spec, it looks like it is a single, global parameter.
I hope this is true (!)  But I'm not too familiar with these specs...

Cheers
---Dave

[1] https://lore.kernel.org/lkml/aNq11fmlac6dH4pH@agluck-desk3/
>

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Chen, Yu C 1 day, 20 hours ago

On 9/30/2025 11:55 PM, Dave Martin wrote:
> Hi,
> 
> On Tue, Sep 30, 2025 at 12:43:36PM +0800, Chen, Yu C wrote:
>> On 9/29/2025 10:13 PM, Dave Martin wrote:
> 
> [...]
> 
>>> I guess it would be up to the arch code whether to trust ACPI if it
>>> says that the maximum value of this field is > 511.  (> 65535 would be
>>> impossible though, since the fields would start to overlap each
>>> other...)
>>>
>>> Would anything break in the interface proposed here, if the maximum
>>> value is larger than 511?  (I'm hoping not.  For MPAM, the bandwidth
>>> controls can have up to 16 bits and the size can be probed though MMIO
>>> registers.
>>>
>>
>> I overlooked this bit width. It should not exceed 511 according to the
>> RDT spec. Previously, I was just wondering how to calculate the legacy
>> MB percentage in Tony's example. If we want to keep consistency - if
>> the user provides a value of 10, what is the denominator: Is it 255,
>> 511, or something queried from ACPI.
>>
>> MB: 0=4;1=75           <--- 10/255
>> #MB_HW: 0=10;1=191
>> #MB_MIN: 0=10;1=191
>> #MB_MAX: 0=64;1=191
>>
>> or
>>
>> MB: 0=1;1=75          <--- 10/511
>> #MB_HW: 0=10;1=191
>> #MB_MIN: 0=10;1=191
>> #MB_MAX: 0=64;1=191
>>
>> thanks,
>> Chenyu
> 
> The denomiator (the "scale" parameter in my model, though the name is
> unimportant) should be whatever quantity of resource is specified in
> the "unit" parameter.
> 
> For "percentage" type controls, I'd expect the unit to be 100% ("100pc"
> in my syntax).
> 
> So, Tony suggestion looks plausible to me [1] :
> 
>   | Yes. 255 (or whatever "Q" value is provided in the ACPI table)
>   | corresponds to no throttling, so 100% bandwidth.
> 
> So, if ACPI says Q=387, that's the denominator we advertise.
> 
> Does that sound right?
> 

Yes, it makes sense, the denominator is the "scale" in your example.

> Question: is this a global parameter, or per-CPU?
> 

It should be a global setting for all the MBA Register Blocks.

Thanks,
Chenyu

>  From the v1.2 RDT spec, it looks like it is a single, global parameter.
> I hope this is true (!)  But I'm not too familiar with these specs...
> 
> Cheers
> ---Dave
> 
> 
> [1] https://lore.kernel.org/lkml/aNq11fmlac6dH4pH@agluck-desk3/
>>

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 17 hours ago

Hi there,

On Wed, Oct 01, 2025 at 08:13:45PM +0800, Chen, Yu C wrote:
> On 9/30/2025 11:55 PM, Dave Martin wrote:
> > Hi,
> > 
> > On Tue, Sep 30, 2025 at 12:43:36PM +0800, Chen, Yu C wrote:

[...]

> > > I overlooked this bit width. It should not exceed 511 according to the
> > > RDT spec. Previously, I was just wondering how to calculate the legacy
> > > MB percentage in Tony's example. If we want to keep consistency - if
> > > the user provides a value of 10, what is the denominator: Is it 255,
> > > 511, or something queried from ACPI.
> > > 
> > > MB: 0=4;1=75           <--- 10/255
> > > #MB_HW: 0=10;1=191
> > > #MB_MIN: 0=10;1=191
> > > #MB_MAX: 0=64;1=191
> > > 
> > > or
> > > 
> > > MB: 0=1;1=75          <--- 10/511
> > > #MB_HW: 0=10;1=191
> > > #MB_MIN: 0=10;1=191
> > > #MB_MAX: 0=64;1=191
> > > 
> > > thanks,
> > > Chenyu
> > 
> > The denomiator (the "scale" parameter in my model, though the name is
> > unimportant) should be whatever quantity of resource is specified in
> > the "unit" parameter.
> > 
> > For "percentage" type controls, I'd expect the unit to be 100% ("100pc"
> > in my syntax).
> > 
> > So, Tony suggestion looks plausible to me [1] :
> > 
> >   | Yes. 255 (or whatever "Q" value is provided in the ACPI table)
> >   | corresponds to no throttling, so 100% bandwidth.
> > 
> > So, if ACPI says Q=387, that's the denominator we advertise.
> > 
> > Does that sound right?
> > 
> 
> Yes, it makes sense, the denominator is the "scale" in your example.

Thanks for confirming that.

> > Question: is this a global parameter, or per-CPU?
> > 
> 
> It should be a global setting for all the MBA Register Blocks.

That's good -- since resctrl resource controls are not per-CPU,
exposing the exact hardware resolution won't work unless the value
is scaled identically for all CPUs.

Cheers
---Dave

RE: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Luck, Tony 16 hours ago

> > > So, if ACPI says Q=387, that's the denominator we advertise.
> > >
> > > Does that sound right?
> > >
> >
> > Yes, it makes sense, the denominator is the "scale" in your example.
>
> Thanks for confirming that.
>
> > > Question: is this a global parameter, or per-CPU?
> > >
> >
> > It should be a global setting for all the MBA Register Blocks.
>
> That's good -- since resctrl resource controls are not per-CPU,
> exposing the exact hardware resolution won't work unless the value
> is scaled identically for all CPUs.

The RDT architecture spec says there is a separate MARC table that
describes each instance on an L3 cache.

So in theory there could be different "Q" values for each. I'm chatting
with the architects to point out that would be bad, and they shouldn't
build something that has different "Q" values on the same system.

-Tony

RE: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Luck, Tony 3 days, 16 hours ago

> > This seems to be applicable as it introduces the new interface
> > while preserving forward compatibility.
> >
> > One minor question is that, according to "Figure 6-5. MBA Optimal
> > Bandwidth Register" in the latest RDT specification, the maximum
> > value ranges from 1 to 511.
> > Additionally, this bandwidth field is located at bits 48 to 56 in
> > the MBA Optimal Bandwidth Register, and the range for
> > this segment could be 1 to 8191. Just wonder if it would be

48..56 is still 9 bits, so max value is 511.

> > possible that the current maximum value of 512 may be extended
> > in the future? Perhaps we could explore a method to query the maximum upper
> > limit from the ACPI table or register, or use CPUID to distinguish between
> > platforms rather than hardcoding it. Reinette also mentioned this in another
> > thread.

I think 511 was chosen as "bigger than we expect to ever need" and 9-bits
allocated in the registers based on that.

Initial implementation may use 255 as the maximum - though I'm pushing on
that a bit as the throttle graph at the early stage is fairly linear from "1" to some
value < 255, when bandwidth hits maximum, then flat up to 255.
If things stay that way, I'm arguing that the "Q" value enumerated in the ACPI
table should be the value where peak bandwidth is hit (though this is complicated
because workloads with different mixes of read/write access have different
throttle graphs).

> >
> > Thanks,
> > Chenyu
> >
> >
> > [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html
>
> I can't comment on the direction of travel in the RDT architecture.
>
> I guess it would be up to the arch code whether to trust ACPI if it
> says that the maximum value of this field is > 511.  (> 65535 would be
> impossible though, since the fields would start to overlap each
> other...)

resctrl should do some sanity checks on values it sees in the ACPI
tables. Linux has:

#define FW_BUG          "[Firmware Bug]: "
#define FW_WARN         "[Firmware Warn]: "
#define FW_INFO         "[Firmware Info]: "

for good historical reasons.

>
> Would anything break in the interface proposed here, if the maximum
> value is larger than 511?  (I'm hoping not.  For MPAM, the bandwidth
> controls can have up to 16 bits and the size can be probed though MMIO
> registers.
>
> I don't think we've seen MPAM hardware that comes close to 16 bits for
> now, though.

While kernel code is sometimes space-conserving and uses u8/u16 types
for values that fit in some limited range, I'd expect user applications that
read the "info" files and program the "schemata" files to not care.

Python integers have arbitrary precision, so would be just fine with:

	max 340282366920938463463374607431768211455

:-)

-Tony

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Chen, Yu C 2 days, 21 hours ago

On 9/30/2025 12:23 AM, Luck, Tony wrote:
>>> This seems to be applicable as it introduces the new interface
>>> while preserving forward compatibility.
>>>
>>> One minor question is that, according to "Figure 6-5. MBA Optimal
>>> Bandwidth Register" in the latest RDT specification, the maximum
>>> value ranges from 1 to 511.
>>> Additionally, this bandwidth field is located at bits 48 to 56 in
>>> the MBA Optimal Bandwidth Register, and the range for
>>> this segment could be 1 to 8191. Just wonder if it would be
> 
> 48..56 is still 9 bits, so max value is 511.
> 

Ah I see, I overlooked this.

>>> possible that the current maximum value of 512 may be extended
>>> in the future? Perhaps we could explore a method to query the maximum upper
>>> limit from the ACPI table or register, or use CPUID to distinguish between
>>> platforms rather than hardcoding it. Reinette also mentioned this in another
>>> thread.
> 
> I think 511 was chosen as "bigger than we expect to ever need" and 9-bits
> allocated in the registers based on that.
> 

OK, got it.

> Initial implementation may use 255 as the maximum - though I'm pushing on
> that a bit as the throttle graph at the early stage is fairly linear from "1" to some
> value < 255,
> when bandwidth hits maximum, then flat up to 255.
> If things stay that way, I'm arguing that the "Q" value enumerated in the ACPI
> table should be the value where peak bandwidth is hit

I see. If I understand correctly, the BIOS needs to pre-train the system to
find this Q. However, if the BIOS cannot provide this Q, would it be 
feasible
for the user to provide it? For example, the user could saturate the memory
bandwidth, gradually increase MB_MAX, and finally find the Q_max where the
memory bandwidth no longer increases. The user could then adjust the max
field in the info file.

> (though this is complicated
> because workloads with different mixes of read/write access have different
> throttle graphs).
>

Does this mean read and write operations have different Q values to saturate
the memory bandwidth? For example, if the workload is all reads, there 
is a Q_r;
if the workload is all writes, there is another Q_w. In that case, maybe we
could choose the maximum of Q_r and Q_w (max(Q_r, Q_w)).

thanks,
Chenyu

RE: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Luck, Tony 2 days, 16 hours ago

> >>> This seems to be applicable as it introduces the new interface
> >>> while preserving forward compatibility.
> >>>
> >>> One minor question is that, according to "Figure 6-5. MBA Optimal
> >>> Bandwidth Register" in the latest RDT specification, the maximum
> >>> value ranges from 1 to 511.
> >>> Additionally, this bandwidth field is located at bits 48 to 56 in
> >>> the MBA Optimal Bandwidth Register, and the range for
> >>> this segment could be 1 to 8191. Just wonder if it would be
> >
> > 48..56 is still 9 bits, so max value is 511.
> >
>
> Ah I see, I overlooked this.
>
> >>> possible that the current maximum value of 512 may be extended
> >>> in the future? Perhaps we could explore a method to query the maximum upper
> >>> limit from the ACPI table or register, or use CPUID to distinguish between
> >>> platforms rather than hardcoding it. Reinette also mentioned this in another
> >>> thread.
> >
> > I think 511 was chosen as "bigger than we expect to ever need" and 9-bits
> > allocated in the registers based on that.
> >
>
> OK, got it.
>
> > Initial implementation may use 255 as the maximum - though I'm pushing on
> > that a bit as the throttle graph at the early stage is fairly linear from "1" to some
> > value < 255,
> > when bandwidth hits maximum, then flat up to 255.
> > If things stay that way, I'm arguing that the "Q" value enumerated in the ACPI
> > table should be the value where peak bandwidth is hit
>
> I see. If I understand correctly, the BIOS needs to pre-train the system to
> find this Q. However, if the BIOS cannot provide this Q, would it be
> feasible
> for the user to provide it? For example, the user could saturate the memory
> bandwidth, gradually increase MB_MAX, and finally find the Q_max where the
> memory bandwidth no longer increases. The user could then adjust the max
> field in the info file.
>
> > (though this is complicated
> > because workloads with different mixes of read/write access have different
> > throttle graphs).
> >
>
> Does this mean read and write operations have different Q values to saturate
> the memory bandwidth? For example, if the workload is all reads, there
> is a Q_r;
> if the workload is all writes, there is another Q_w. In that case, maybe we
> could choose the maximum of Q_r and Q_w (max(Q_r, Q_w)).

If the BIOS doesn't provide a good enough number, then users might well
do some tuning based on the workloads they plan to run and ignore the value
in the info file in favor of one tuned specifically for their workloads. But it is too
early to start guessing at workarounds for problems that may not even exist.

-Tony

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Reinette Chatre 2 weeks, 6 days ago

Hi Dave,

nits:
Please use the subject prefix "x86,fs/resctrl" to be consistent with other
resctrl code (and was established by Arm :)).
Also please use upper case for acronym mba->MBA.

On 9/2/25 9:24 AM, Dave Martin wrote:
> The control value parser for the MB resource currently coerces the
> memory bandwidth percentage value from userspace to be an exact
> multiple of the bw_gran parameter.

(to help be specific)
"the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"?

> 
> On MPAM systems, this results in somewhat worse-than-worst-case
> rounding, since bw_gran is in general only an approximation to the
> actual hardware granularity on these systems, and the hardware
> bandwidth allocation control value is not natively a percentage --
> necessitating a further conversion in the resctrl_arch_update_domains()
> path, regardless of the conversion done at parse time.
> 
> Allow the arch to provide its own parse-time conversion that is
> appropriate for the hardware, and move the existing conversion to x86.
> This will avoid accumulated error from rounding the value twice on MPAM
> systems.
> 
> Clarify the documentation, but avoid overly exact promises.
> 
> Clamping to bw_min and bw_max still feels generic: leave it in the core
> code, for now.

Sounds like MPAM may be ready to start the schema parsing discussion again?
I understand that MPAM has a few more ways to describe memory bandwidth as
well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
schema format to user space, which seems like a good idea for new schema.
Is this something MPAM is still considering? For example, the minimum
and maximum ranges that can be specified, is this something you already
have some ideas for? Have you perhaps considered Tony's RFD [3] that includes
discussion on how to handle min/max ranges for bandwidth? 

> 
> No functional change.
> 
> Signed-off-by: Dave Martin <Dave.Martin@arm.com>
> 
> ---
> 
> Based on v6.17-rc3.
> 
> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> the other tests except for the NONCONT_CAT tests, which do not seem to
> be supported in my configuration -- and have nothing to do with the
> code touched by this patch).

Is the NONCONT_CAT test failing (i.e printing "not ok")?

The NONCONT_CAT tests may print error messages as debug information as part of
running, but these errors are expected as part of the test. The test should accurately
state whether it passed or failed though. For example, below attempts to write
a non-contiguous CBM to a system that does not support non-contiguous masks.
This fails as expected, error messages printed as debugging and thus the test passes
with an "ok".

# Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument                                      
# Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected                             
ok 5 L3_NONCONT_CAT: test                             

> 
> Notes:
> 
> I put the x86 version out of line in order to avoid having to move
> struct rdt_resource and its dependencies into resctrl_types.h -- which
> would create a lot of diff noise.  Schemata writes from userspace have
> a high overhead in any case.

Sounds good, I expect compiler will inline.

> 
> For MPAM the conversion will be a no-op, because the incoming
> percentage from the core resctrl code needs to be converted to hardware
> representation in the driver anyway.

(addressed below)

> 
> Perhaps _all_ the types should move to resctrl_types.h.

Can surely consider when there is a good motivation.

> 
> For now, I went for the smallest diffstat...
> 
> ---
>  Documentation/filesystems/resctrl.rst     | 7 +++----
>  arch/x86/include/asm/resctrl.h            | 2 ++
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 6 ++++++
>  fs/resctrl/ctrlmondata.c                  | 2 +-
>  include/linux/resctrl.h                   | 6 ++++++
>  5 files changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> index c7949dd44f2f..a1d0469d6dfb 100644
> --- a/Documentation/filesystems/resctrl.rst
> +++ b/Documentation/filesystems/resctrl.rst
> @@ -143,12 +143,11 @@ with respect to allocation:
>  		user can request.
>  
>  "bandwidth_gran":
> -		The granularity in which the memory bandwidth
> +		The approximate granularity in which the memory bandwidth
>  		percentage is allocated. The allocated
>  		b/w percentage is rounded off to the next
> -		control step available on the hardware. The
> -		available bandwidth control steps are:
> -		min_bandwidth + N * bandwidth_gran.
> +		control step available on the hardware. The available
> +		steps are at least as small as this value.

A bit difficult to parse for me. 
Is "at least as small as" same as "at least"?

Please note that the documentation has a section "Memory bandwidth Allocation
and monitoring" that also contains these exact promises.

>  
>  "delay_linear":
>  		Indicates if the delay scale is linear or
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index feb93b50e990..8bec2b9cc503 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -18,6 +18,8 @@
>   */
>  #define X86_RESCTRL_EMPTY_CLOSID         ((u32)~0)
>  
> +struct rdt_resource;
> +

I'm missing something here. Why is this needed?

>  /**
>   * struct resctrl_pqr_state - State cache for the PQR MSR
>   * @cur_rmid:		The cached Resource Monitoring ID
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 1189c0df4ad7..cf9b30b5df3c 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -16,9 +16,15 @@
>  #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
>  
>  #include <linux/cpu.h>
> +#include <linux/math.h>
>  
>  #include "internal.h"
>  
> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r)
> +{
> +	return roundup(val, (unsigned long)r->membw.bw_gran);
> +}
> +
>  int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
>  			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
>  {
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index d98e0d2de09f..c5e73b75aaa0 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
>  		return false;
>  	}
>  
> -	*data = roundup(bw, (unsigned long)r->membw.bw_gran);
> +	*data = resctrl_arch_round_bw(bw, r);

Please check that function comments remain accurate after changes (specifically
if making the conversion more generic as proposed below).

>  	return true;
>  }
>  
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 6fb4894b8cfd..5b2a555cf2dd 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
>  bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
>  int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
>  
> +/*
> + * Round a bandwidth control value to the nearest value acceptable to
> + * the arch code for resource r:
> + */
> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
> +

I do not think that resctrl should make any assumptions on what the
architecture's conversion does (i.e "round"). That architecture needs to be
asked to "round a bandwidth control value" also sounds strange since resctrl really
should be able to do something like rounding itself. As I understand from
the notes this will be a no-op for MPAM making this even more confusing.

How about naming the helper something like resctrl_arch_convert_bw()?
(Open to other ideas of course).

If you make such a change, please check that subject of patch still fits.

I think that using const to pass data to architecture is great, thanks.

Reinette

[1] https://lore.kernel.org/lkml/fa93564a-45b0-ccdd-c139-ae4867eacfb5@arm.com/
[2] https://lore.kernel.org/all/acefb432-6388-44ed-b444-1e52335c6c3d@arm.com/
[3] https://lore.kernel.org/lkml/Z_mB-gmQe_LR4FWP@agluck-desk3/

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 1 week, 3 days ago

Hi Reinette,

Thanks for the review.

On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> Hi Dave,
> 
> nits:
> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
> resctrl code (and was established by Arm :)).
> Also please use upper case for acronym mba->MBA.

Ack (the local custom in the MPAM code is to use "mba", but arguably,
the meaning is not quite the same -- I'll change it.)

> On 9/2/25 9:24 AM, Dave Martin wrote:
> > The control value parser for the MB resource currently coerces the
> > memory bandwidth percentage value from userspace to be an exact
> > multiple of the bw_gran parameter.
> 
> (to help be specific)
> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"?

"bw_gran" was intended as an informal shorthand for the abstract
parameter (exposed both in the field you mention and through the
bandiwidth_gran file in resctrl).

I can rewrite it as per your suggestion, but this could be read as
excluding the bandwidth_gran file.  Would it make sense just to write
it out longhand?  For now, I've rewritten it as follows:

 | The control value parser for the MB resource currently coerces the
 | memory bandwidth percentage value from userspace to be an exact
 | multiple of the bandwidth granularity parameter.
 |
 | On MPAM systems, this results in somewhat worse-than-worst-case
 | rounding, since the bandwidth granularity advertised to resctrl by the
 | MPAM driver is in general only an approximation [...]

(I'm happy to go with your suggestion if you're not keen on this,
though.)

> > On MPAM systems, this results in somewhat worse-than-worst-case
> > rounding, since bw_gran is in general only an approximation to the
> > actual hardware granularity on these systems, and the hardware
> > bandwidth allocation control value is not natively a percentage --
> > necessitating a further conversion in the resctrl_arch_update_domains()
> > path, regardless of the conversion done at parse time.
> > 
> > Allow the arch to provide its own parse-time conversion that is
> > appropriate for the hardware, and move the existing conversion to x86.
> > This will avoid accumulated error from rounding the value twice on MPAM
> > systems.
> > 
> > Clarify the documentation, but avoid overly exact promises.
> > 
> > Clamping to bw_min and bw_max still feels generic: leave it in the core
> > code, for now.
> 
> Sounds like MPAM may be ready to start the schema parsing discussion again?
> I understand that MPAM has a few more ways to describe memory bandwidth as
> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> schema format to user space, which seems like a good idea for new schema.

My own ideas in this area are a little different, though I agree with
the general idea.

I'll respond separately on that, to avoid this thread getting off-topic.

For this patch, my aim was to avoid changing anything unnecessarily.

> Is this something MPAM is still considering? For example, the minimum
> and maximum ranges that can be specified, is this something you already
> have some ideas for? Have you perhaps considered Tony's RFD [3] that includes
> discussion on how to handle min/max ranges for bandwidth? 

This is another thing that we probably do want to support at some point,
but it feels like a different thing from the minimum and maximum bounds 
acceptable to an individual schema -- especially since in the hardware
they may behave more like trigger points than hard limits.

Again, I'll respond separately.

[...]

> > Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> > the other tests except for the NONCONT_CAT tests, which do not seem to
> > be supported in my configuration -- and have nothing to do with the
> > code touched by this patch).
> 
> Is the NONCONT_CAT test failing (i.e printing "not ok")?
> 
> The NONCONT_CAT tests may print error messages as debug information as part of
> running, but these errors are expected as part of the test. The test should accurately
> state whether it passed or failed though. For example, below attempts to write
> a non-contiguous CBM to a system that does not support non-contiguous masks.
> This fails as expected, error messages printed as debugging and thus the test passes
> with an "ok".
> 
> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument                                      
> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected                             
> ok 5 L3_NONCONT_CAT: test                             

I don't think that this was anything to do with my changes, but I don't
still seem to have the test output.  (Since this test has to do with
bitmap schemata (?), it seemed unlikely to be affected by changes to
bw_validate().)

I'll need to re-test with and without this patch to check whether it
makes any difference.

> > Notes:
> > 
> > I put the x86 version out of line in order to avoid having to move
> > struct rdt_resource and its dependencies into resctrl_types.h -- which
> > would create a lot of diff noise.  Schemata writes from userspace have
> > a high overhead in any case.
> 
> Sounds good, I expect compiler will inline.

The function and caller are in separate translation units, so unless
LTO is used, I don't think the function will be inlined.

> > 
> > For MPAM the conversion will be a no-op, because the incoming
> > percentage from the core resctrl code needs to be converted to hardware
> > representation in the driver anyway.
> 
> (addressed below)
> 
> > 
> > Perhaps _all_ the types should move to resctrl_types.h.
> 
> Can surely consider when there is a good motivation.
> 
> > 
> > For now, I went for the smallest diffstat...

I'll assume the motivation is not strong enough for now, but shout if
you disagree.

[...]

> > diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> > index c7949dd44f2f..a1d0469d6dfb 100644
> > --- a/Documentation/filesystems/resctrl.rst
> > +++ b/Documentation/filesystems/resctrl.rst
> > @@ -143,12 +143,11 @@ with respect to allocation:
> >  		user can request.
> >  
> >  "bandwidth_gran":
> > -		The granularity in which the memory bandwidth
> > +		The approximate granularity in which the memory bandwidth
> >  		percentage is allocated. The allocated
> >  		b/w percentage is rounded off to the next
> > -		control step available on the hardware. The
> > -		available bandwidth control steps are:
> > -		min_bandwidth + N * bandwidth_gran.
> > +		control step available on the hardware. The available
> > +		steps are at least as small as this value.
> 
> A bit difficult to parse for me. 
> Is "at least as small as" same as "at least"?

It was supposed to mean: "The available steps are no larger than this
value."

Formally My expectation is that this value is the smallest integer
number of percent which is not smaller than the apparent size of any
individual rounding step.  Equivalently, this is the smallest number g
for which writing "MB: 0=x" and "MB: 0=y" yield different
configurations for every in-range x and where y = x + g and y is also
in-range.

That's a bit of a mouthful, though.  If you can think of a more
succinct way of putting it, I'm open to suggestions!

> Please note that the documentation has a section "Memory bandwidth Allocation
> and monitoring" that also contains these exact promises.

Hmmm, somehow I completely missed that.

Does the following make sense?  Ideally, there would be a simpler way
to describe the discrepancy between the reported and actual values of
bw_gran...

 |  Memory bandwidth Allocation and monitoring
 |  ==========================================
 |
 |  [...]
 |
 |  The minimum bandwidth percentage value for each cpu model is predefined
 |  and can be looked up through "info/MB/min_bandwidth". The bandwidth
 |  granularity that is allocated is also dependent on the cpu model and can
 |  be looked up at "info/MB/bandwidth_gran". The available bandwidth
 | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
 | -to the next control step available on the hardware.
 | +control steps are: min_bw + N * (bw_gran - e), where e is a
 | +non-negative, hardware-defined real constant that is less than 1.
 | +Intermediate values are rounded to the next control step available on
 | +the hardware.
 | +
 | +At the time of writing, the constant e referred to in the preceding
 | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
 | +describes the step size exactly), but this may not be the case on other
 | +hardware when the actual granularity is not an exact divisor of 100.

> 
> >  
> >  "delay_linear":
> >  		Indicates if the delay scale is linear or
> > diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> > index feb93b50e990..8bec2b9cc503 100644
> > --- a/arch/x86/include/asm/resctrl.h
> > +++ b/arch/x86/include/asm/resctrl.h
> > @@ -18,6 +18,8 @@
> >   */
> >  #define X86_RESCTRL_EMPTY_CLOSID         ((u32)~0)
> >  
> > +struct rdt_resource;
> > +
> 
> I'm missing something here. Why is this needed?

Oops, it's not.  This got left behind from when I had the function
in-line here.

Removed.

[...]

> > diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> > index d98e0d2de09f..c5e73b75aaa0 100644
> > --- a/fs/resctrl/ctrlmondata.c
> > +++ b/fs/resctrl/ctrlmondata.c
> > @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> >  		return false;
> >  	}
> >  
> > -	*data = roundup(bw, (unsigned long)r->membw.bw_gran);
> > +	*data = resctrl_arch_round_bw(bw, r);
> 
> Please check that function comments remain accurate after changes (specifically
> if making the conversion more generic as proposed below).

I hoped that the comment for this function was still applicable, though
it can probably be improved.  How about the following?

 | - * hardware. The allocated bandwidth percentage is rounded to the next
 | - * control step available on the hardware.
 | + * hardware. The allocated bandwidth percentage is converted as
 | + * appropriate for consumption by the specific hardware driver.

[...]

> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 6fb4894b8cfd..5b2a555cf2dd 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
> >  bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
> >  int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
> >  
> > +/*
> > + * Round a bandwidth control value to the nearest value acceptable to
> > + * the arch code for resource r:
> > + */
> > +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
> > +
> 
> I do not think that resctrl should make any assumptions on what the
> architecture's conversion does (i.e "round"). That architecture needs to be
> asked to "round a bandwidth control value" also sounds strange since resctrl really
> should be able to do something like rounding itself. As I understand from
> the notes this will be a no-op for MPAM making this even more confusing.
> 
> How about naming the helper something like resctrl_arch_convert_bw()?
> (Open to other ideas of course).
> 
> If you make such a change, please check that subject of patch still fits.

I struggled a bit with the name.  Really, this is converting the value
to an intermediate form (which might or might not involve rounding).
For historical reasons, this is a value suitable for writing directly
to the relevant x86 MSR without any further interpretation.

For MPAM, it is convenient to do this conversion later, rather than
during parsing of the value.

Would a name like resctrl_arch_preconvert_bw() be acceptable?

This isn't more informative than your suggestion regarding what the
conversion is expected to do, but may convey the expectation that the
output value may still not be in its final (i.e., hardware) form.

> I think that using const to pass data to architecture is great, thanks.
> 
> Reinette

I try to constify by default when straightforward to do so, since the
compiler can then find which cases need to change; the reverse
direction is harder to automate...

Cheers
---Dave

[...]

> [1] https://lore.kernel.org/lkml/fa93564a-45b0-ccdd-c139-ae4867eacfb5@arm.com/
> [2] https://lore.kernel.org/all/acefb432-6388-44ed-b444-1e52335c6c3d@arm.com/
> [3] https://lore.kernel.org/lkml/Z_mB-gmQe_LR4FWP@agluck-desk3/

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Reinette Chatre 1 week, 2 days ago

Hi Dave,

On 9/22/25 7:39 AM, Dave Martin wrote:
> Hi Reinette,
> 
> Thanks for the review.
> 
> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>> Hi Dave,
>>
>> nits:
>> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
>> resctrl code (and was established by Arm :)).
>> Also please use upper case for acronym mba->MBA.
> 
> Ack (the local custom in the MPAM code is to use "mba", but arguably,
> the meaning is not quite the same -- I'll change it.)

I am curious what the motivation is for the custom? Knowing this will help
me to keep things consistent when the two worlds meet.

> 
>> On 9/2/25 9:24 AM, Dave Martin wrote:
>>> The control value parser for the MB resource currently coerces the
>>> memory bandwidth percentage value from userspace to be an exact
>>> multiple of the bw_gran parameter.
>>
>> (to help be specific)
>> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"?
> 
> "bw_gran" was intended as an informal shorthand for the abstract
> parameter (exposed both in the field you mention and through the
> bandiwidth_gran file in resctrl).

I do not see a need for being abstract since the bandwidth_gran file exposes
the field verbatim.

> 
> I can rewrite it as per your suggestion, but this could be read as
> excluding the bandwidth_gran file.  Would it make sense just to write
> it out longhand?  For now, I've rewritten it as follows:

Since the bandwidth_gran file exposes rdt_resource::resctrl_membw::bw_gran
it is not clear to me how being specific excludes the bandwidth_gran file. 

> 
>  | The control value parser for the MB resource currently coerces the
>  | memory bandwidth percentage value from userspace to be an exact
>  | multiple of the bandwidth granularity parameter.

If you want to include the bandwidth_gran file then the above could be
something like:

	The control value parser for the MB resource coerces the memory
	bandwidth percentage value from userspace to be an exact multiple
	of the bandwidth granularity parameter that is exposed by the
	bandwidth_gran resctrl file.

I still think that replacing "the bandwidth granularity parameter" with
"rdt_resource::resctrl_membw::bw_gran" will help to be more specific.

>  |
>  | On MPAM systems, this results in somewhat worse-than-worst-case
>  | rounding, since the bandwidth granularity advertised to resctrl by the
>  | MPAM driver is in general only an approximation [...]
> 
> (I'm happy to go with your suggestion if you're not keen on this,
> though.)
> 
>>> On MPAM systems, this results in somewhat worse-than-worst-case
>>> rounding, since bw_gran is in general only an approximation to the
>>> actual hardware granularity on these systems, and the hardware
>>> bandwidth allocation control value is not natively a percentage --
>>> necessitating a further conversion in the resctrl_arch_update_domains()
>>> path, regardless of the conversion done at parse time.
>>>
>>> Allow the arch to provide its own parse-time conversion that is
>>> appropriate for the hardware, and move the existing conversion to x86.
>>> This will avoid accumulated error from rounding the value twice on MPAM
>>> systems.
>>>
>>> Clarify the documentation, but avoid overly exact promises.
>>>
>>> Clamping to bw_min and bw_max still feels generic: leave it in the core
>>> code, for now.
>>
>> Sounds like MPAM may be ready to start the schema parsing discussion again?
>> I understand that MPAM has a few more ways to describe memory bandwidth as
>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
>> schema format to user space, which seems like a good idea for new schema.
> 
> My own ideas in this area are a little different, though I agree with
> the general idea.

Should we expect a separate proposal from James?

> 
> I'll respond separately on that, to avoid this thread getting off-topic.

Much appreciated.

> 
> For this patch, my aim was to avoid changing anything unnecessarily.

Understood. More below as I try to understand the details but it does not
really sound as though the current interface works that great for MPAM. If I
understand correctly this patch enables MPAM to use existing interface for
its memory bandwidth allocations but doing so does not enable users to 
obtain benefit of hardware capabilities. For that users would want to use
the new interface?

>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
>>> the other tests except for the NONCONT_CAT tests, which do not seem to
>>> be supported in my configuration -- and have nothing to do with the
>>> code touched by this patch).
>>
>> Is the NONCONT_CAT test failing (i.e printing "not ok")?
>>
>> The NONCONT_CAT tests may print error messages as debug information as part of
>> running, but these errors are expected as part of the test. The test should accurately
>> state whether it passed or failed though. For example, below attempts to write
>> a non-contiguous CBM to a system that does not support non-contiguous masks.
>> This fails as expected, error messages printed as debugging and thus the test passes
>> with an "ok".
>>
>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument                                      
>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected                             
>> ok 5 L3_NONCONT_CAT: test                             
> 
> I don't think that this was anything to do with my changes, but I don't
> still seem to have the test output.  (Since this test has to do with
> bitmap schemata (?), it seemed unlikely to be affected by changes to
> bw_validate().)

I agree that this should not have anything to do with this patch. My concern
is that I understood that the test failed for a feature that is not supported.
If this is the case then there may be a problem with the test. The test should
not fail if the feature is not supported but instead skip the test.

> 
>>> Notes:
>>>
>>> I put the x86 version out of line in order to avoid having to move
>>> struct rdt_resource and its dependencies into resctrl_types.h -- which
>>> would create a lot of diff noise.  Schemata writes from userspace have
>>> a high overhead in any case.
>>
>> Sounds good, I expect compiler will inline.
> 
> The function and caller are in separate translation units, so unless
> LTO is used, I don't think the function will be inlined.

Thanks, yes, indeed.

> 
>>>
>>> For MPAM the conversion will be a no-op, because the incoming
>>> percentage from the core resctrl code needs to be converted to hardware
>>> representation in the driver anyway.
>>
>> (addressed below)
>>
>>>
>>> Perhaps _all_ the types should move to resctrl_types.h.
>>
>> Can surely consider when there is a good motivation.
>>
>>>
>>> For now, I went for the smallest diffstat...
> 
> I'll assume the motivation is not strong enough for now, but shout if
> you disagree.

I agree.

> 
> [...]
> 
>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
>>> index c7949dd44f2f..a1d0469d6dfb 100644
>>> --- a/Documentation/filesystems/resctrl.rst
>>> +++ b/Documentation/filesystems/resctrl.rst
>>> @@ -143,12 +143,11 @@ with respect to allocation:
>>>  		user can request.
>>>  
>>>  "bandwidth_gran":
>>> -		The granularity in which the memory bandwidth
>>> +		The approximate granularity in which the memory bandwidth
>>>  		percentage is allocated. The allocated
>>>  		b/w percentage is rounded off to the next
>>> -		control step available on the hardware. The
>>> -		available bandwidth control steps are:
>>> -		min_bandwidth + N * bandwidth_gran.
>>> +		control step available on the hardware. The available
>>> +		steps are at least as small as this value.
>>
>> A bit difficult to parse for me. 
>> Is "at least as small as" same as "at least"?
> 
> It was supposed to mean: "The available steps are no larger than this
> value."

This is clear to me, especially when compared with the planned addition to
"Memory bandwidth Allocation and monitoring" ... but I do find it contradicting
the paragraph below (more below).

> 
> Formally My expectation is that this value is the smallest integer
> number of percent which is not smaller than the apparent size of any
> individual rounding step.  Equivalently, this is the smallest number g

Considering the two statements:
- "The available steps are no larger than this value."
- "this value ... is not smaller than the apparent size of any individual rounding step"

The "not larger" and "not smaller" sounds like all these words just end up saying that
this is the step size?

> for which writing "MB: 0=x" and "MB: 0=y" yield different
> configurations for every in-range x and where y = x + g and y is also
> in-range.
> 
> That's a bit of a mouthful, though.  If you can think of a more
> succinct way of putting it, I'm open to suggestions!
> 
>> Please note that the documentation has a section "Memory bandwidth Allocation
>> and monitoring" that also contains these exact promises.
> 
> Hmmm, somehow I completely missed that.
> 
> Does the following make sense?  Ideally, there would be a simpler way
> to describe the discrepancy between the reported and actual values of
> bw_gran...
> 
>  |  Memory bandwidth Allocation and monitoring
>  |  ==========================================
>  |
>  |  [...]
>  |
>  |  The minimum bandwidth percentage value for each cpu model is predefined
>  |  and can be looked up through "info/MB/min_bandwidth". The bandwidth
>  |  granularity that is allocated is also dependent on the cpu model and can
>  |  be looked up at "info/MB/bandwidth_gran". The available bandwidth
>  | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
>  | -to the next control step available on the hardware.
>  | +control steps are: min_bw + N * (bw_gran - e), where e is a
>  | +non-negative, hardware-defined real constant that is less than 1.
>  | +Intermediate values are rounded to the next control step available on
>  | +the hardware.
>  | +
>  | +At the time of writing, the constant e referred to in the preceding
>  | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
>  | +describes the step size exactly), but this may not be the case on other
>  | +hardware when the actual granularity is not an exact divisor of 100.

Have you considered how to share the value of "e" with users?

>>> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
>>> index d98e0d2de09f..c5e73b75aaa0 100644
>>> --- a/fs/resctrl/ctrlmondata.c
>>> +++ b/fs/resctrl/ctrlmondata.c
>>> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
>>>  		return false;
>>>  	}
>>>  
>>> -	*data = roundup(bw, (unsigned long)r->membw.bw_gran);
>>> +	*data = resctrl_arch_round_bw(bw, r);
>>
>> Please check that function comments remain accurate after changes (specifically
>> if making the conversion more generic as proposed below).
> 
> I hoped that the comment for this function was still applicable, though
> it can probably be improved.  How about the following?
> 
>  | - * hardware. The allocated bandwidth percentage is rounded to the next
>  | - * control step available on the hardware.
>  | + * hardware. The allocated bandwidth percentage is converted as
>  | + * appropriate for consumption by the specific hardware driver.
>  
> [...]

Looks good to me.

> 
>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index 6fb4894b8cfd..5b2a555cf2dd 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
>>>  bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
>>>  int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
>>>  
>>> +/*
>>> + * Round a bandwidth control value to the nearest value acceptable to
>>> + * the arch code for resource r:
>>> + */
>>> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
>>> +
>>
>> I do not think that resctrl should make any assumptions on what the
>> architecture's conversion does (i.e "round"). That architecture needs to be
>> asked to "round a bandwidth control value" also sounds strange since resctrl really
>> should be able to do something like rounding itself. As I understand from
>> the notes this will be a no-op for MPAM making this even more confusing.
>>
>> How about naming the helper something like resctrl_arch_convert_bw()?
>> (Open to other ideas of course).
>>
>> If you make such a change, please check that subject of patch still fits.
> 
> I struggled a bit with the name.  Really, this is converting the value
> to an intermediate form (which might or might not involve rounding).
> For historical reasons, this is a value suitable for writing directly
> to the relevant x86 MSR without any further interpretation.
> 
> For MPAM, it is convenient to do this conversion later, rather than
> during parsing of the value.
> 
> 
> Would a name like resctrl_arch_preconvert_bw() be acceptable?

Yes.

> 
> This isn't more informative than your suggestion regarding what the
> conversion is expected to do, but may convey the expectation that the
> output value may still not be in its final (i.e., hardware) form.

Sounds good, yes. 

> 
>> I think that using const to pass data to architecture is great, thanks.
>>
>> Reinette
> 
> I try to constify by default when straightforward to do so, since the
> compiler can then find which cases need to change; the reverse
> direction is harder to automate...

Could you please elaborate what you mean with "reverse direction"?


Thank you

Reinette

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 1 week ago

Hi Reinette,

On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
> Hi Dave,
> 
> On 9/22/25 7:39 AM, Dave Martin wrote:
> > Hi Reinette,
> > 
> > Thanks for the review.
> > 
> > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> >> Hi Dave,
> >>
> >> nits:
> >> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
> >> resctrl code (and was established by Arm :)).
> >> Also please use upper case for acronym mba->MBA.
> > 
> > Ack (the local custom in the MPAM code is to use "mba", but arguably,
> > the meaning is not quite the same -- I'll change it.)
> 
> I am curious what the motivation is for the custom? Knowing this will help
> me to keep things consistent when the two worlds meet.

I think this has just evolved over time.  On the x86 side, MBA is a
specific architectural feature, but on the MPAM side the architecture
doesn't really have a name for the same thing.  Memory bandwidth is a
concept, but a few different types of control are defined for it, with
different names.

So, for the MPAM driver "mba" is more of a software concept than
something in a published spec: it's the glue that attaches to "MB"
resource as seen through resctrl.

(This isn't official though; it's just the mental model that I have
formed.)

> 
> >>> The control value parser for the MB resource currently coerces the
> >>> memory bandwidth percentage value from userspace to be an exact
> >>> multiple of the bw_gran parameter.
> >>
> >> (to help be specific)
> >> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"?
> > 
> > "bw_gran" was intended as an informal shorthand for the abstract
> > parameter (exposed both in the field you mention and through the
> > bandiwidth_gran file in resctrl).
> 
> I do not see a need for being abstract since the bandwidth_gran file exposes
> the field verbatim.

Sure; that was just my thought process.

> > I can rewrite it as per your suggestion, but this could be read as
> > excluding the bandwidth_gran file.  Would it make sense just to write
> > it out longhand?  For now, I've rewritten it as follows:
> 
> Since the bandwidth_gran file exposes rdt_resource::resctrl_membw::bw_gran
> it is not clear to me how being specific excludes the bandwidth_gran file. 
> 
> > 
> >  | The control value parser for the MB resource currently coerces the
> >  | memory bandwidth percentage value from userspace to be an exact
> >  | multiple of the bandwidth granularity parameter.
> 
> If you want to include the bandwidth_gran file then the above could be
> something like:
> 
> 	The control value parser for the MB resource coerces the memory
> 	bandwidth percentage value from userspace to be an exact multiple
> 	of the bandwidth granularity parameter that is exposed by the
> 	bandwidth_gran resctrl file.
> 
> I still think that replacing "the bandwidth granularity parameter" with
> "rdt_resource::resctrl_membw::bw_gran" will help to be more specific.

That's fine.  I'll change as per your original suggestion.

> >  |
> >  | On MPAM systems, this results in somewhat worse-than-worst-case
> >  | rounding, since the bandwidth granularity advertised to resctrl by the
> >  | MPAM driver is in general only an approximation [...]
> > 
> > (I'm happy to go with your suggestion if you're not keen on this,
> > though.)
> > 
> >>> On MPAM systems, this results in somewhat worse-than-worst-case
> >>> rounding, since bw_gran is in general only an approximation to the
> >>> actual hardware granularity on these systems, and the hardware
> >>> bandwidth allocation control value is not natively a percentage --
> >>> necessitating a further conversion in the resctrl_arch_update_domains()
> >>> path, regardless of the conversion done at parse time.
> >>>
> >>> Allow the arch to provide its own parse-time conversion that is
> >>> appropriate for the hardware, and move the existing conversion to x86.
> >>> This will avoid accumulated error from rounding the value twice on MPAM
> >>> systems.
> >>>
> >>> Clarify the documentation, but avoid overly exact promises.
> >>>
> >>> Clamping to bw_min and bw_max still feels generic: leave it in the core
> >>> code, for now.
> >>
> >> Sounds like MPAM may be ready to start the schema parsing discussion again?
> >> I understand that MPAM has a few more ways to describe memory bandwidth as
> >> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> >> schema format to user space, which seems like a good idea for new schema.
> > 
> > My own ideas in this area are a little different, though I agree with
> > the general idea.
> 
> Should we expect a separate proposal from James?

At some point, yes.  We still need to have a chat about it.

Right now, I was just throwing an idea out there.

> > I'll respond separately on that, to avoid this thread getting off-topic.
> 
> Much appreciated.
> 
> > 
> > For this patch, my aim was to avoid changing anything unnecessarily.
> 
> Understood. More below as I try to understand the details but it does not
> really sound as though the current interface works that great for MPAM. If I
> understand correctly this patch enables MPAM to use existing interface for
> its memory bandwidth allocations but doing so does not enable users to 
> obtain benefit of hardware capabilities. For that users would want to use
> the new interface?

In ideal world, probably, yes.

Since not all use cases will care about full precision, the MB resource
(approximated for MPAM) should be fine for a lot of people, but I
expect that sooner or later somebody will want more exact control.

> >>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> >>> the other tests except for the NONCONT_CAT tests, which do not seem to
> >>> be supported in my configuration -- and have nothing to do with the
> >>> code touched by this patch).
> >>
> >> Is the NONCONT_CAT test failing (i.e printing "not ok")?
> >>
> >> The NONCONT_CAT tests may print error messages as debug information as part of
> >> running, but these errors are expected as part of the test. The test should accurately
> >> state whether it passed or failed though. For example, below attempts to write
> >> a non-contiguous CBM to a system that does not support non-contiguous masks.
> >> This fails as expected, error messages printed as debugging and thus the test passes
> >> with an "ok".
> >>
> >> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument                                      
> >> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected                             
> >> ok 5 L3_NONCONT_CAT: test                             
> > 
> > I don't think that this was anything to do with my changes, but I don't
> > still seem to have the test output.  (Since this test has to do with
> > bitmap schemata (?), it seemed unlikely to be affected by changes to
> > bw_validate().)
> 
> I agree that this should not have anything to do with this patch. My concern
> is that I understood that the test failed for a feature that is not supported.
> If this is the case then there may be a problem with the test. The test should
> not fail if the feature is not supported but instead skip the test.

I'll try to capture more output from this when I re-run it, so that we
can figure out what this is.

> >>> Notes:
> >>>
> >>> I put the x86 version out of line in order to avoid having to move
> >>> struct rdt_resource and its dependencies into resctrl_types.h -- which
> >>> would create a lot of diff noise.  Schemata writes from userspace have
> >>> a high overhead in any case.
> >>
> >> Sounds good, I expect compiler will inline.
> > 
> > The function and caller are in separate translation units, so unless
> > LTO is used, I don't think the function will be inlined.
> 
> Thanks, yes, indeed.
> 
> > 
> >>>
> >>> For MPAM the conversion will be a no-op, because the incoming
> >>> percentage from the core resctrl code needs to be converted to hardware
> >>> representation in the driver anyway.
> >>
> >> (addressed below)
> >>
> >>>
> >>> Perhaps _all_ the types should move to resctrl_types.h.
> >>
> >> Can surely consider when there is a good motivation.
> >>
> >>>
> >>> For now, I went for the smallest diffstat...
> > 
> > I'll assume the motivation is not strong enough for now, but shout if
> > you disagree.
> 
> I agree.

OK, I'll leave that as-is for now, then.

> > 
> > [...]
> > 
> >>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> >>> index c7949dd44f2f..a1d0469d6dfb 100644
> >>> --- a/Documentation/filesystems/resctrl.rst
> >>> +++ b/Documentation/filesystems/resctrl.rst
> >>> @@ -143,12 +143,11 @@ with respect to allocation:
> >>>  		user can request.
> >>>  
> >>>  "bandwidth_gran":
> >>> -		The granularity in which the memory bandwidth
> >>> +		The approximate granularity in which the memory bandwidth
> >>>  		percentage is allocated. The allocated
> >>>  		b/w percentage is rounded off to the next
> >>> -		control step available on the hardware. The
> >>> -		available bandwidth control steps are:
> >>> -		min_bandwidth + N * bandwidth_gran.
> >>> +		control step available on the hardware. The available
> >>> +		steps are at least as small as this value.
> >>
> >> A bit difficult to parse for me. 
> >> Is "at least as small as" same as "at least"?
> > 
> > It was supposed to mean: "The available steps are no larger than this
> > value."
> 
> This is clear to me, especially when compared with the planned addition to
> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting
> the paragraph below (more below).
> 
> > 
> > Formally My expectation is that this value is the smallest integer
> > number of percent which is not smaller than the apparent size of any
> > individual rounding step.  Equivalently, this is the smallest number g
> 
> Considering the two statements:
> - "The available steps are no larger than this value."
> - "this value ... is not smaller than the apparent size of any individual rounding step"
> 
> The "not larger" and "not smaller" sounds like all these words just end up saying that
> this is the step size?

They are intended to be the same statement: A <= B versus
B >= A respectively.

But I'd be the first to admit that the wording is a bit twisted!
(I wouldn't be astonshed if I got something wrong somewhere.)

See below for an alternative way of describing this that might be more
intuitive.

> 
> > for which writing "MB: 0=x" and "MB: 0=y" yield different
> > configurations for every in-range x and where y = x + g and y is also
> > in-range.
> > 
> > That's a bit of a mouthful, though.  If you can think of a more
> > succinct way of putting it, I'm open to suggestions!
> > 
> >> Please note that the documentation has a section "Memory bandwidth Allocation
> >> and monitoring" that also contains these exact promises.
> > 
> > Hmmm, somehow I completely missed that.
> > 
> > Does the following make sense?  Ideally, there would be a simpler way
> > to describe the discrepancy between the reported and actual values of
> > bw_gran...
> > 
> >  |  Memory bandwidth Allocation and monitoring
> >  |  ==========================================
> >  |
> >  |  [...]
> >  |
> >  |  The minimum bandwidth percentage value for each cpu model is predefined
> >  |  and can be looked up through "info/MB/min_bandwidth". The bandwidth
> >  |  granularity that is allocated is also dependent on the cpu model and can
> >  |  be looked up at "info/MB/bandwidth_gran". The available bandwidth
> >  | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> >  | -to the next control step available on the hardware.
> >  | +control steps are: min_bw + N * (bw_gran - e), where e is a
> >  | +non-negative, hardware-defined real constant that is less than 1.
> >  | +Intermediate values are rounded to the next control step available on
> >  | +the hardware.
> >  | +
> >  | +At the time of writing, the constant e referred to in the preceding
> >  | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
> >  | +describes the step size exactly), but this may not be the case on other
> >  | +hardware when the actual granularity is not an exact divisor of 100.
> 
> Have you considered how to share the value of "e" with users?

Perhaps introducing this "e" as an explicit parameter is a bad idea and
overly formal.  In practice, there are likely to various sources of
skid and approximation in the hardware, so exposing an actual value may
be counterproductive -- i.e., what usable guarantee is this providing
to userspace, if this is likely to be swamped by approximations
elsewhere?

Instead, maybe we can just say something like:

 | The available steps are spaced at roughly equal intervals between the
 | value reported by info/MB/min_bandwidth and 100%, inclusive.  Reading
 | info/MB/bandwidth_gran gives the worst-case precision of these
 | interval steps, in per cent.

What do you think?

If that's adequate, then the wording under the definition of
"bandwidth_gran" could be aligned with this.

> >>> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> >>> index d98e0d2de09f..c5e73b75aaa0 100644
> >>> --- a/fs/resctrl/ctrlmondata.c
> >>> +++ b/fs/resctrl/ctrlmondata.c
> >>> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> >>>  		return false;
> >>>  	}
> >>>  
> >>> -	*data = roundup(bw, (unsigned long)r->membw.bw_gran);
> >>> +	*data = resctrl_arch_round_bw(bw, r);
> >>
> >> Please check that function comments remain accurate after changes (specifically
> >> if making the conversion more generic as proposed below).
> > 
> > I hoped that the comment for this function was still applicable, though
> > it can probably be improved.  How about the following?
> > 
> >  | - * hardware. The allocated bandwidth percentage is rounded to the next
> >  | - * control step available on the hardware.
> >  | + * hardware. The allocated bandwidth percentage is converted as
> >  | + * appropriate for consumption by the specific hardware driver.
> >  
> > [...]
> 
> Looks good to me.

OK.

> > 
> >>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> >>> index 6fb4894b8cfd..5b2a555cf2dd 100644
> >>> --- a/include/linux/resctrl.h
> >>> +++ b/include/linux/resctrl.h
> >>> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
> >>>  bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
> >>>  int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
> >>>  
> >>> +/*
> >>> + * Round a bandwidth control value to the nearest value acceptable to
> >>> + * the arch code for resource r:
> >>> + */
> >>> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
> >>> +
> >>
> >> I do not think that resctrl should make any assumptions on what the
> >> architecture's conversion does (i.e "round"). That architecture needs to be
> >> asked to "round a bandwidth control value" also sounds strange since resctrl really
> >> should be able to do something like rounding itself. As I understand from
> >> the notes this will be a no-op for MPAM making this even more confusing.
> >>
> >> How about naming the helper something like resctrl_arch_convert_bw()?
> >> (Open to other ideas of course).
> >>
> >> If you make such a change, please check that subject of patch still fits.
> > 
> > I struggled a bit with the name.  Really, this is converting the value
> > to an intermediate form (which might or might not involve rounding).
> > For historical reasons, this is a value suitable for writing directly
> > to the relevant x86 MSR without any further interpretation.
> > 
> > For MPAM, it is convenient to do this conversion later, rather than
> > during parsing of the value.
> > 
> > 
> > Would a name like resctrl_arch_preconvert_bw() be acceptable?
> 
> Yes.
> 
> > 
> > This isn't more informative than your suggestion regarding what the
> > conversion is expected to do, but may convey the expectation that the
> > output value may still not be in its final (i.e., hardware) form.
> 
> Sounds good, yes. 

OK, I'll hack that in.

> 
> > 
> >> I think that using const to pass data to architecture is great, thanks.
> >>
> >> Reinette
> > 
> > I try to constify by default when straightforward to do so, since the
> > compiler can then find which cases need to change; the reverse
> > direction is harder to automate...
> 
> Could you please elaborate what you mean with "reverse direction"?

I just meant that over-consting tends to result in violations of the
language that the compiler will detect, but under-consting doesn't:

static void foo(int *nonconstp, const int *constp)
{
	*constp = 0; // compiler error
	(*nonconstp); // silently accpeted, though it could have been const
}

So, the compiler will tell you places where const needs to be removed
(or something else needs to change), but to find places where const
could be _added_, you have to hunt them down yourself, or use some
other tool that is probably not part of the usual workflow.

Cheers
---Dave

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Reinette Chatre 1 week ago

Hi Dave,

On 9/25/25 5:46 AM, Dave Martin wrote:
> On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
>> On 9/22/25 7:39 AM, Dave Martin wrote:
>>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>>>> Hi Dave,
>>>>
>>>> nits:
>>>> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
>>>> resctrl code (and was established by Arm :)).
>>>> Also please use upper case for acronym mba->MBA.
>>>
>>> Ack (the local custom in the MPAM code is to use "mba", but arguably,
>>> the meaning is not quite the same -- I'll change it.)
>>
>> I am curious what the motivation is for the custom? Knowing this will help
>> me to keep things consistent when the two worlds meet.
> 
> I think this has just evolved over time.  On the x86 side, MBA is a
> specific architectural feature, but on the MPAM side the architecture
> doesn't really have a name for the same thing.  Memory bandwidth is a
> concept, but a few different types of control are defined for it, with
> different names.
> 
> So, for the MPAM driver "mba" is more of a software concept than
> something in a published spec: it's the glue that attaches to "MB"
> resource as seen through resctrl.
> 
> (This isn't official though; it's just the mental model that I have
> formed.)

I see. Thank you for the details. My mental model is simpler: write acronyms
in upper case.

...

>>>>> On MPAM systems, this results in somewhat worse-than-worst-case
>>>>> rounding, since bw_gran is in general only an approximation to the
>>>>> actual hardware granularity on these systems, and the hardware
>>>>> bandwidth allocation control value is not natively a percentage --
>>>>> necessitating a further conversion in the resctrl_arch_update_domains()
>>>>> path, regardless of the conversion done at parse time.
>>>>>
>>>>> Allow the arch to provide its own parse-time conversion that is
>>>>> appropriate for the hardware, and move the existing conversion to x86.
>>>>> This will avoid accumulated error from rounding the value twice on MPAM
>>>>> systems.
>>>>>
>>>>> Clarify the documentation, but avoid overly exact promises.
>>>>>
>>>>> Clamping to bw_min and bw_max still feels generic: leave it in the core
>>>>> code, for now.
>>>>
>>>> Sounds like MPAM may be ready to start the schema parsing discussion again?
>>>> I understand that MPAM has a few more ways to describe memory bandwidth as
>>>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
>>>> schema format to user space, which seems like a good idea for new schema.
>>>
>>> My own ideas in this area are a little different, though I agree with
>>> the general idea.
>>
>> Should we expect a separate proposal from James?
> 
> At some point, yes.  We still need to have a chat about it.
> 
> Right now, I was just throwing an idea out there.

Thank you very much for doing so. We are digesting it.


...

>>> For this patch, my aim was to avoid changing anything unnecessarily.
>>
>> Understood. More below as I try to understand the details but it does not
>> really sound as though the current interface works that great for MPAM. If I
>> understand correctly this patch enables MPAM to use existing interface for
>> its memory bandwidth allocations but doing so does not enable users to 
>> obtain benefit of hardware capabilities. For that users would want to use
>> the new interface?
> 
> In ideal world, probably, yes.
> 
> Since not all use cases will care about full precision, the MB resource
> (approximated for MPAM) should be fine for a lot of people, but I
> expect that sooner or later somebody will want more exact control.

ack.

> 
>>>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
>>>>> the other tests except for the NONCONT_CAT tests, which do not seem to
>>>>> be supported in my configuration -- and have nothing to do with the
>>>>> code touched by this patch).
>>>>
>>>> Is the NONCONT_CAT test failing (i.e printing "not ok")?
>>>>
>>>> The NONCONT_CAT tests may print error messages as debug information as part of
>>>> running, but these errors are expected as part of the test. The test should accurately
>>>> state whether it passed or failed though. For example, below attempts to write
>>>> a non-contiguous CBM to a system that does not support non-contiguous masks.
>>>> This fails as expected, error messages printed as debugging and thus the test passes
>>>> with an "ok".
>>>>
>>>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument                                      
>>>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected                             
>>>> ok 5 L3_NONCONT_CAT: test                             
>>>
>>> I don't think that this was anything to do with my changes, but I don't
>>> still seem to have the test output.  (Since this test has to do with
>>> bitmap schemata (?), it seemed unlikely to be affected by changes to
>>> bw_validate().)
>>
>> I agree that this should not have anything to do with this patch. My concern
>> is that I understood that the test failed for a feature that is not supported.
>> If this is the case then there may be a problem with the test. The test should
>> not fail if the feature is not supported but instead skip the test.
> 
> I'll try to capture more output from this when I re-run it, so that we
> can figure out what this is.

Thank you.


...

>>>
>>> [...]
>>>
>>>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
>>>>> index c7949dd44f2f..a1d0469d6dfb 100644
>>>>> --- a/Documentation/filesystems/resctrl.rst
>>>>> +++ b/Documentation/filesystems/resctrl.rst
>>>>> @@ -143,12 +143,11 @@ with respect to allocation:
>>>>>  		user can request.
>>>>>  
>>>>>  "bandwidth_gran":
>>>>> -		The granularity in which the memory bandwidth
>>>>> +		The approximate granularity in which the memory bandwidth
>>>>>  		percentage is allocated. The allocated
>>>>>  		b/w percentage is rounded off to the next
>>>>> -		control step available on the hardware. The
>>>>> -		available bandwidth control steps are:
>>>>> -		min_bandwidth + N * bandwidth_gran.
>>>>> +		control step available on the hardware. The available
>>>>> +		steps are at least as small as this value.
>>>>
>>>> A bit difficult to parse for me. 
>>>> Is "at least as small as" same as "at least"?
>>>
>>> It was supposed to mean: "The available steps are no larger than this
>>> value."
>>
>> This is clear to me, especially when compared with the planned addition to
>> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting
>> the paragraph below (more below).
>>
>>>
>>> Formally My expectation is that this value is the smallest integer
>>> number of percent which is not smaller than the apparent size of any
>>> individual rounding step.  Equivalently, this is the smallest number g
>>
>> Considering the two statements:
>> - "The available steps are no larger than this value."
>> - "this value ... is not smaller than the apparent size of any individual rounding step"
>>
>> The "not larger" and "not smaller" sounds like all these words just end up saying that
>> this is the step size?
> 
> They are intended to be the same statement: A <= B versus
> B >= A respectively.

This is what I understood from the words ... and that made me think that it
can be simplified to A = B ... but no need to digress ... onto the alternatives below ...

> 
> But I'd be the first to admit that the wording is a bit twisted!
> (I wouldn't be astonshed if I got something wrong somewhere.)
> 
> See below for an alternative way of describing this that might be more
> intuitive.
> 
>>
>>> for which writing "MB: 0=x" and "MB: 0=y" yield different
>>> configurations for every in-range x and where y = x + g and y is also
>>> in-range.
>>>
>>> That's a bit of a mouthful, though.  If you can think of a more
>>> succinct way of putting it, I'm open to suggestions!
>>>
>>>> Please note that the documentation has a section "Memory bandwidth Allocation
>>>> and monitoring" that also contains these exact promises.
>>>
>>> Hmmm, somehow I completely missed that.
>>>
>>> Does the following make sense?  Ideally, there would be a simpler way
>>> to describe the discrepancy between the reported and actual values of
>>> bw_gran...
>>>
>>>  |  Memory bandwidth Allocation and monitoring
>>>  |  ==========================================
>>>  |
>>>  |  [...]
>>>  |
>>>  |  The minimum bandwidth percentage value for each cpu model is predefined
>>>  |  and can be looked up through "info/MB/min_bandwidth". The bandwidth
>>>  |  granularity that is allocated is also dependent on the cpu model and can
>>>  |  be looked up at "info/MB/bandwidth_gran". The available bandwidth
>>>  | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
>>>  | -to the next control step available on the hardware.
>>>  | +control steps are: min_bw + N * (bw_gran - e), where e is a
>>>  | +non-negative, hardware-defined real constant that is less than 1.
>>>  | +Intermediate values are rounded to the next control step available on
>>>  | +the hardware.
>>>  | +
>>>  | +At the time of writing, the constant e referred to in the preceding
>>>  | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
>>>  | +describes the step size exactly), but this may not be the case on other
>>>  | +hardware when the actual granularity is not an exact divisor of 100.
>>
>> Have you considered how to share the value of "e" with users?
> 
> Perhaps introducing this "e" as an explicit parameter is a bad idea and
> overly formal.  In practice, there are likely to various sources of
> skid and approximation in the hardware, so exposing an actual value may
> be counterproductive -- i.e., what usable guarantee is this providing
> to userspace, if this is likely to be swamped by approximations
> elsewhere?
> 
> Instead, maybe we can just say something like:
> 
>  | The available steps are spaced at roughly equal intervals between the
>  | value reported by info/MB/min_bandwidth and 100%, inclusive.  Reading
>  | info/MB/bandwidth_gran gives the worst-case precision of these
>  | interval steps, in per cent.
> 
> What do you think?

I find "worst-case precision" a bit confusing, consider for example, what
would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
the upper limit of these interval steps"? I believe this matches what you
mentioned a couple of messages ago: "The available steps are no larger than this
value."

(and "per cent" -> "percent")

> 
> If that's adequate, then the wording under the definition of
> "bandwidth_gran" could be aligned with this.

I think putting together a couple of your proposals and statements while making the
text more accurate may work:

	 "bandwidth_gran":
		The approximate granularity in which the memory bandwidth
 		percentage is allocated. The allocated bandwidth percentage
		is rounded up to the next control step available on the
		hardware. The available hardware steps are no larger than
		this value.

I assume "available" is needed because, even though the steps are not larger
than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
to 100% range?
 
>>>> I think that using const to pass data to architecture is great, thanks.
>>>>
>>>> Reinette
>>>
>>> I try to constify by default when straightforward to do so, since the
>>> compiler can then find which cases need to change; the reverse
>>> direction is harder to automate...
>>
>> Could you please elaborate what you mean with "reverse direction"?
> 
> I just meant that over-consting tends to result in violations of the
> language that the compiler will detect, but under-consting doesn't:
> 
> static void foo(int *nonconstp, const int *constp)
> {
> 	*constp = 0; // compiler error
> 	(*nonconstp); // silently accpeted, though it could have been const
> }
> 
> So, the compiler will tell you places where const needs to be removed
> (or something else needs to change), but to find places where const
> could be _added_, you have to hunt them down yourself, or use some
> other tool that is probably not part of the usual workflow.

Got it, thanks.

Reinette

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 3 days, 20 hours ago

Hi Reinette,

On Thu, Sep 25, 2025 at 09:53:37PM +0100, Reinette Chatre wrote:
> Hi Dave,
> 
> On 9/25/25 5:46 AM, Dave Martin wrote:
> > On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
> >> On 9/22/25 7:39 AM, Dave Martin wrote:
> >>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> >>>> Hi Dave,

[...]

> >>>> Also please use upper case for acronym mba->MBA.
> >>>
> >>> Ack (the local custom in the MPAM code is to use "mba", but arguably,
> >>> the meaning is not quite the same -- I'll change it.)
> >>
> >> I am curious what the motivation is for the custom? Knowing this will help
> >> me to keep things consistent when the two worlds meet.
> > 
> > I think this has just evolved over time.  On the x86 side, MBA is a
> > specific architectural feature, but on the MPAM side the architecture
> > doesn't really have a name for the same thing.  Memory bandwidth is a
> > concept, but a few different types of control are defined for it, with
> > different names.
> > 
> > So, for the MPAM driver "mba" is more of a software concept than
> > something in a published spec: it's the glue that attaches to "MB"
> > resource as seen through resctrl.
> > 
> > (This isn't official though; it's just the mental model that I have
> > formed.)
> 
> I see. Thank you for the details. My mental model is simpler: write acronyms
> in upper case.

Generally, I agree, although I'm not sure whether that acronym belongs
in the MPAM-specific code.

For this patch, though, that's irrelevant.  I've changed it to "MBA"
as requested.

[...]

> >> really sound as though the current interface works that great for MPAM. If I
> >> understand correctly this patch enables MPAM to use existing interface for
> >> its memory bandwidth allocations but doing so does not enable users to 
> >> obtain benefit of hardware capabilities. For that users would want to use
> >> the new interface?
> > 
> > In ideal world, probably, yes.
> > 
> > Since not all use cases will care about full precision, the MB resource
> > (approximated for MPAM) should be fine for a lot of people, but I
> > expect that sooner or later somebody will want more exact control.
> 
> ack.

OK.

[,,,]

> >> Considering the two statements:
> >> - "The available steps are no larger than this value."
> >> - "this value ... is not smaller than the apparent size of any individual rounding step"
> >>
> >> The "not larger" and "not smaller" sounds like all these words just end up saying that
> >> this is the step size?
> > 
> > They are intended to be the same statement: A <= B versus
> > B >= A respectively.
> 
> This is what I understood from the words ... and that made me think that it
> can be simplified to A = B ... but no need to digress ... onto the alternatives below ...

Right...

[...]

> > Instead, maybe we can just say something like:
> > 
> >  | The available steps are spaced at roughly equal intervals between the
> >  | value reported by info/MB/min_bandwidth and 100%, inclusive.  Reading
> >  | info/MB/bandwidth_gran gives the worst-case precision of these
> >  | interval steps, in per cent.
> > 
> > What do you think?
> 
> I find "worst-case precision" a bit confusing, consider for example, what
> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
> the upper limit of these interval steps"? I believe this matches what you
> mentioned a couple of messages ago: "The available steps are no larger than this
> value."

Yes, that works.  "Worst case" implies a value judgement that smaller
steps are "better" then large steps, since the goal is control.

But your wording, to the effect that this is the largest (apparent)
step size, conveys all the needed information.

> (and "per cent" -> "percent")

( Note: https://en.wiktionary.org/wiki/per_cent )

(Though either is acceptable, the fused word has a more informal feel
to it for me.  Happy to change it -- though your rewording below gets
rid of it anyway.  (This word doesn't appear in resctrl.rst --
evertying is "percentage" etc.)

> 
> > If that's adequate, then the wording under the definition of
> > "bandwidth_gran" could be aligned with this.
> 
> I think putting together a couple of your proposals and statements while making the
> text more accurate may work:
> 
> 	 "bandwidth_gran":
> 		The approximate granularity in which the memory bandwidth
>  		percentage is allocated. The allocated bandwidth percentage
> 		is rounded up to the next control step available on the
> 		hardware. The available hardware steps are no larger than
> 		this value.

That's better, thanks.  I'm happy to pick this up and reword the text
in both places along these lines.

> I assume "available" is needed because, even though the steps are not larger
> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
> to 100% range?

Yes -- or, rather, the steps _look_ inconsistent because they are
rounded to exact percentages by the interface.

I don't think we expect the actual steps in the hardware to be
irregular.

[...]

> Reinette

Cheers
---Dave

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Reinette Chatre 3 days, 17 hours ago

Hi Dave,

On 9/29/25 5:43 AM, Dave Martin wrote:
> Hi Reinette,
> 
> On Thu, Sep 25, 2025 at 09:53:37PM +0100, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 9/25/25 5:46 AM, Dave Martin wrote:
>>> On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
>>>> On 9/22/25 7:39 AM, Dave Martin wrote:
>>>>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>>>>>> Hi Dave,
> 
> [...]
> 
>>>>>> Also please use upper case for acronym mba->MBA.
>>>>>
>>>>> Ack (the local custom in the MPAM code is to use "mba", but arguably,
>>>>> the meaning is not quite the same -- I'll change it.)
>>>>
>>>> I am curious what the motivation is for the custom? Knowing this will help
>>>> me to keep things consistent when the two worlds meet.
>>>
>>> I think this has just evolved over time.  On the x86 side, MBA is a
>>> specific architectural feature, but on the MPAM side the architecture
>>> doesn't really have a name for the same thing.  Memory bandwidth is a
>>> concept, but a few different types of control are defined for it, with
>>> different names.
>>>
>>> So, for the MPAM driver "mba" is more of a software concept than
>>> something in a published spec: it's the glue that attaches to "MB"
>>> resource as seen through resctrl.
>>>
>>> (This isn't official though; it's just the mental model that I have
>>> formed.)
>>
>> I see. Thank you for the details. My mental model is simpler: write acronyms
>> in upper case.
> 
> Generally, I agree, although I'm not sure whether that acronym belongs
> in the MPAM-specific code.
> 
> For this patch, though, that's irrelevant.  I've changed it to "MBA"
> as requested.
> 

Thank you.

...

>>>> Considering the two statements:
>>>> - "The available steps are no larger than this value."
>>>> - "this value ... is not smaller than the apparent size of any individual rounding step"
>>>>
>>>> The "not larger" and "not smaller" sounds like all these words just end up saying that
>>>> this is the step size?
>>>
>>> They are intended to be the same statement: A <= B versus
>>> B >= A respectively.
>>
>> This is what I understood from the words ... and that made me think that it
>> can be simplified to A = B ... but no need to digress ... onto the alternatives below ...
> 
> Right...
> 
> [...]
> 
>>> Instead, maybe we can just say something like:
>>>
>>>  | The available steps are spaced at roughly equal intervals between the
>>>  | value reported by info/MB/min_bandwidth and 100%, inclusive.  Reading
>>>  | info/MB/bandwidth_gran gives the worst-case precision of these
>>>  | interval steps, in per cent.
>>>
>>> What do you think?
>>
>> I find "worst-case precision" a bit confusing, consider for example, what
>> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
>> the upper limit of these interval steps"? I believe this matches what you
>> mentioned a couple of messages ago: "The available steps are no larger than this
>> value."
> 
> Yes, that works.  "Worst case" implies a value judgement that smaller
> steps are "better" then large steps, since the goal is control.
> 
> But your wording, to the effect that this is the largest (apparent)
> step size, conveys all the needed information.

Thank you for considering it. My preference is for stating things succinctly
and not leave too much for interpretation.

> 
>> (and "per cent" -> "percent")
> 
> ( Note: https://en.wiktionary.org/wiki/per_cent )

Yes, in particular I note the "chiefly Commonwealth". I respect the differences
in the English language and was easily convinced in earlier MPAM work to
accept different spelling. I now regret doing so because after merge we now have a
supposedly coherent resctrl codebase with inconsistent spelling that is unpleasant
to encounter when reading the code and also complicates new work.

> (Though either is acceptable, the fused word has a more informal feel
> to it for me.  Happy to change it -- though your rewording below gets
> rid of it anyway.  (This word doesn't appear in resctrl.rst --
> evertying is "percentage" etc.)
> 
>>
>>> If that's adequate, then the wording under the definition of
>>> "bandwidth_gran" could be aligned with this.
>>
>> I think putting together a couple of your proposals and statements while making the
>> text more accurate may work:
>>
>> 	 "bandwidth_gran":
>> 		The approximate granularity in which the memory bandwidth
>>  		percentage is allocated. The allocated bandwidth percentage
>> 		is rounded up to the next control step available on the
>> 		hardware. The available hardware steps are no larger than
>> 		this value.
> 
> That's better, thanks.  I'm happy to pick this up and reword the text
> in both places along these lines.

Thank you very much. Please do feel free to rework.

> 
>> I assume "available" is needed because, even though the steps are not larger
>> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
>> to 100% range?
> 
> Yes -- or, rather, the steps _look_ inconsistent because they are
> rounded to exact percentages by the interface.
> 
> I don't think we expect the actual steps in the hardware to be
> irregular.
> 
Thank you for clarifying.

Reinette

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 3 days, 16 hours ago

Hi Reinette,

On Mon, Sep 29, 2025 at 08:38:13AM -0700, Reinette Chatre wrote:
> Hi Dave,
> 
> On 9/29/25 5:43 AM, Dave Martin wrote:

[...]

> > Generally, I agree, although I'm not sure whether that acronym belongs
> > in the MPAM-specific code.
> > 
> > For this patch, though, that's irrelevant.  I've changed it to "MBA"
> > as requested.
> > 
> 
> Thank you.

[...]

> >> I find "worst-case precision" a bit confusing, consider for example, what
> >> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
> >> the upper limit of these interval steps"? I believe this matches what you
> >> mentioned a couple of messages ago: "The available steps are no larger than this
> >> value."
> > 
> > Yes, that works.  "Worst case" implies a value judgement that smaller
> > steps are "better" then large steps, since the goal is control.
> > 
> > But your wording, to the effect that this is the largest (apparent)
> > step size, conveys all the needed information.
> 
> Thank you for considering it. My preference is for stating things succinctly
> and not leave too much for interpretation.

I find that it's not always easy to work out what information is
essential without the discussion... so I hope that didn't feel like a
waste of time!

> >> (and "per cent" -> "percent")
> > 
> > ( Note: https://en.wiktionary.org/wiki/per_cent )
> 
> Yes, in particular I note the "chiefly Commonwealth". I respect the differences
> in the English language and was easily convinced in earlier MPAM work to
> accept different spelling. I now regret doing so because after merge we now have a
> supposedly coherent resctrl codebase with inconsistent spelling that is unpleasant
> to encounter when reading the code and also complicates new work.
> 
> > (Though either is acceptable, the fused word has a more informal feel
> > to it for me.  Happy to change it -- though your rewording below gets
> > rid of it anyway.  (This word doesn't appear in resctrl.rst --
> > evertying is "percentage" etc.)

Sure, it's best not to have mixed-up conventions in the same document.

(With this one, I wasn't aware that there were regional differences at
all, so that was news to me...)

[...]

> >> I think putting together a couple of your proposals and statements while making the
> >> text more accurate may work:
> >>
> >> 	 "bandwidth_gran":
> >> 		The approximate granularity in which the memory bandwidth
> >>  		percentage is allocated. The allocated bandwidth percentage
> >> 		is rounded up to the next control step available on the
> >> 		hardware. The available hardware steps are no larger than
> >> 		this value.
> > 
> > That's better, thanks.  I'm happy to pick this up and reword the text
> > in both places along these lines.
> 
> Thank you very much. Please do feel free to rework.
> 
> > 
> >> I assume "available" is needed because, even though the steps are not larger
> >> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
> >> to 100% range?
> > 
> > Yes -- or, rather, the steps _look_ inconsistent because they are
> > rounded to exact percentages by the interface.
> > 
> > I don't think we expect the actual steps in the hardware to be
> > irregular.
> > 
> Thank you for clarifying.
> 
> Reinette

OK.

I'll tidy up the loose ends and repost once I've have a chance to re-
test.

Thanks for the review.

Cheers
---Dave

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Luck, Tony 1 week ago

On Thu, Sep 25, 2025 at 01:53:37PM -0700, Reinette Chatre wrote:
> Hi Dave,
> 
> On 9/25/25 5:46 AM, Dave Martin wrote:
> > On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
> >> On 9/22/25 7:39 AM, Dave Martin wrote:
> >>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> >>>> Hi Dave,
> >>>>
> >>>> nits:
> >>>> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
> >>>> resctrl code (and was established by Arm :)).
> >>>> Also please use upper case for acronym mba->MBA.
> >>>
> >>> Ack (the local custom in the MPAM code is to use "mba", but arguably,
> >>> the meaning is not quite the same -- I'll change it.)
> >>
> >> I am curious what the motivation is for the custom? Knowing this will help
> >> me to keep things consistent when the two worlds meet.
> > 
> > I think this has just evolved over time.  On the x86 side, MBA is a
> > specific architectural feature, but on the MPAM side the architecture
> > doesn't really have a name for the same thing.  Memory bandwidth is a
> > concept, but a few different types of control are defined for it, with
> > different names.
> > 
> > So, for the MPAM driver "mba" is more of a software concept than
> > something in a published spec: it's the glue that attaches to "MB"
> > resource as seen through resctrl.
> > 
> > (This isn't official though; it's just the mental model that I have
> > formed.)
> 
> I see. Thank you for the details. My mental model is simpler: write acronyms
> in upper case.
> 
> ...
> 
> >>>>> On MPAM systems, this results in somewhat worse-than-worst-case
> >>>>> rounding, since bw_gran is in general only an approximation to the
> >>>>> actual hardware granularity on these systems, and the hardware
> >>>>> bandwidth allocation control value is not natively a percentage --
> >>>>> necessitating a further conversion in the resctrl_arch_update_domains()
> >>>>> path, regardless of the conversion done at parse time.
> >>>>>
> >>>>> Allow the arch to provide its own parse-time conversion that is
> >>>>> appropriate for the hardware, and move the existing conversion to x86.
> >>>>> This will avoid accumulated error from rounding the value twice on MPAM
> >>>>> systems.
> >>>>>
> >>>>> Clarify the documentation, but avoid overly exact promises.
> >>>>>
> >>>>> Clamping to bw_min and bw_max still feels generic: leave it in the core
> >>>>> code, for now.
> >>>>
> >>>> Sounds like MPAM may be ready to start the schema parsing discussion again?
> >>>> I understand that MPAM has a few more ways to describe memory bandwidth as
> >>>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> >>>> schema format to user space, which seems like a good idea for new schema.
> >>>
> >>> My own ideas in this area are a little different, though I agree with
> >>> the general idea.
> >>
> >> Should we expect a separate proposal from James?
> > 
> > At some point, yes.  We still need to have a chat about it.
> > 
> > Right now, I was just throwing an idea out there.
> 
> Thank you very much for doing so. We are digesting it.
> 
> 
> ...
> 
> >>> For this patch, my aim was to avoid changing anything unnecessarily.
> >>
> >> Understood. More below as I try to understand the details but it does not
> >> really sound as though the current interface works that great for MPAM. If I
> >> understand correctly this patch enables MPAM to use existing interface for
> >> its memory bandwidth allocations but doing so does not enable users to 
> >> obtain benefit of hardware capabilities. For that users would want to use
> >> the new interface?
> > 
> > In ideal world, probably, yes.
> > 
> > Since not all use cases will care about full precision, the MB resource
> > (approximated for MPAM) should be fine for a lot of people, but I
> > expect that sooner or later somebody will want more exact control.
> 
> ack.
> 
> > 
> >>>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> >>>>> the other tests except for the NONCONT_CAT tests, which do not seem to
> >>>>> be supported in my configuration -- and have nothing to do with the
> >>>>> code touched by this patch).
> >>>>
> >>>> Is the NONCONT_CAT test failing (i.e printing "not ok")?
> >>>>
> >>>> The NONCONT_CAT tests may print error messages as debug information as part of
> >>>> running, but these errors are expected as part of the test. The test should accurately
> >>>> state whether it passed or failed though. For example, below attempts to write
> >>>> a non-contiguous CBM to a system that does not support non-contiguous masks.
> >>>> This fails as expected, error messages printed as debugging and thus the test passes
> >>>> with an "ok".
> >>>>
> >>>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument                                      
> >>>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected                             
> >>>> ok 5 L3_NONCONT_CAT: test                             
> >>>
> >>> I don't think that this was anything to do with my changes, but I don't
> >>> still seem to have the test output.  (Since this test has to do with
> >>> bitmap schemata (?), it seemed unlikely to be affected by changes to
> >>> bw_validate().)
> >>
> >> I agree that this should not have anything to do with this patch. My concern
> >> is that I understood that the test failed for a feature that is not supported.
> >> If this is the case then there may be a problem with the test. The test should
> >> not fail if the feature is not supported but instead skip the test.
> > 
> > I'll try to capture more output from this when I re-run it, so that we
> > can figure out what this is.
> 
> Thank you.
> 
> 
> ...
> 
> >>>
> >>> [...]
> >>>
> >>>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> >>>>> index c7949dd44f2f..a1d0469d6dfb 100644
> >>>>> --- a/Documentation/filesystems/resctrl.rst
> >>>>> +++ b/Documentation/filesystems/resctrl.rst
> >>>>> @@ -143,12 +143,11 @@ with respect to allocation:
> >>>>>  		user can request.
> >>>>>  
> >>>>>  "bandwidth_gran":
> >>>>> -		The granularity in which the memory bandwidth
> >>>>> +		The approximate granularity in which the memory bandwidth
> >>>>>  		percentage is allocated. The allocated
> >>>>>  		b/w percentage is rounded off to the next
> >>>>> -		control step available on the hardware. The
> >>>>> -		available bandwidth control steps are:
> >>>>> -		min_bandwidth + N * bandwidth_gran.
> >>>>> +		control step available on the hardware. The available
> >>>>> +		steps are at least as small as this value.
> >>>>
> >>>> A bit difficult to parse for me. 
> >>>> Is "at least as small as" same as "at least"?
> >>>
> >>> It was supposed to mean: "The available steps are no larger than this
> >>> value."
> >>
> >> This is clear to me, especially when compared with the planned addition to
> >> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting
> >> the paragraph below (more below).
> >>
> >>>
> >>> Formally My expectation is that this value is the smallest integer
> >>> number of percent which is not smaller than the apparent size of any
> >>> individual rounding step.  Equivalently, this is the smallest number g
> >>
> >> Considering the two statements:
> >> - "The available steps are no larger than this value."
> >> - "this value ... is not smaller than the apparent size of any individual rounding step"
> >>
> >> The "not larger" and "not smaller" sounds like all these words just end up saying that
> >> this is the step size?
> > 
> > They are intended to be the same statement: A <= B versus
> > B >= A respectively.
> 
> This is what I understood from the words ... and that made me think that it
> can be simplified to A = B ... but no need to digress ... onto the alternatives below ...
> 
> > 
> > But I'd be the first to admit that the wording is a bit twisted!
> > (I wouldn't be astonshed if I got something wrong somewhere.)
> > 
> > See below for an alternative way of describing this that might be more
> > intuitive.
> > 
> >>
> >>> for which writing "MB: 0=x" and "MB: 0=y" yield different
> >>> configurations for every in-range x and where y = x + g and y is also
> >>> in-range.
> >>>
> >>> That's a bit of a mouthful, though.  If you can think of a more
> >>> succinct way of putting it, I'm open to suggestions!
> >>>
> >>>> Please note that the documentation has a section "Memory bandwidth Allocation
> >>>> and monitoring" that also contains these exact promises.
> >>>
> >>> Hmmm, somehow I completely missed that.
> >>>
> >>> Does the following make sense?  Ideally, there would be a simpler way
> >>> to describe the discrepancy between the reported and actual values of
> >>> bw_gran...
> >>>
> >>>  |  Memory bandwidth Allocation and monitoring
> >>>  |  ==========================================
> >>>  |
> >>>  |  [...]
> >>>  |
> >>>  |  The minimum bandwidth percentage value for each cpu model is predefined
> >>>  |  and can be looked up through "info/MB/min_bandwidth". The bandwidth
> >>>  |  granularity that is allocated is also dependent on the cpu model and can
> >>>  |  be looked up at "info/MB/bandwidth_gran". The available bandwidth
> >>>  | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> >>>  | -to the next control step available on the hardware.
> >>>  | +control steps are: min_bw + N * (bw_gran - e), where e is a
> >>>  | +non-negative, hardware-defined real constant that is less than 1.
> >>>  | +Intermediate values are rounded to the next control step available on
> >>>  | +the hardware.
> >>>  | +
> >>>  | +At the time of writing, the constant e referred to in the preceding
> >>>  | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
> >>>  | +describes the step size exactly), but this may not be the case on other
> >>>  | +hardware when the actual granularity is not an exact divisor of 100.
> >>
> >> Have you considered how to share the value of "e" with users?
> > 
> > Perhaps introducing this "e" as an explicit parameter is a bad idea and
> > overly formal.  In practice, there are likely to various sources of
> > skid and approximation in the hardware, so exposing an actual value may
> > be counterproductive -- i.e., what usable guarantee is this providing
> > to userspace, if this is likely to be swamped by approximations
> > elsewhere?
> > 
> > Instead, maybe we can just say something like:
> > 
> >  | The available steps are spaced at roughly equal intervals between the
> >  | value reported by info/MB/min_bandwidth and 100%, inclusive.  Reading
> >  | info/MB/bandwidth_gran gives the worst-case precision of these
> >  | interval steps, in per cent.
> > 
> > What do you think?
> 
> I find "worst-case precision" a bit confusing, consider for example, what
> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
> the upper limit of these interval steps"? I believe this matches what you
> mentioned a couple of messages ago: "The available steps are no larger than this
> value."
> 
> (and "per cent" -> "percent")
> 
> > 
> > If that's adequate, then the wording under the definition of
> > "bandwidth_gran" could be aligned with this.
> 
> I think putting together a couple of your proposals and statements while making the
> text more accurate may work:
> 
> 	 "bandwidth_gran":
> 		The approximate granularity in which the memory bandwidth
>  		percentage is allocated. The allocated bandwidth percentage
> 		is rounded up to the next control step available on the
> 		hardware. The available hardware steps are no larger than
> 		this value.
> 
> I assume "available" is needed because, even though the steps are not larger
> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
> to 100% range?

What values are allowed for "bandwidth_gran"? The "Intel® Resource
Director Technology (Intel® RDT) Architecture Specification"

	https://cdrdv2.intel.com/v1/dl/getContent/789566

describes the upcoming region aware memory bandwidth allocation
controls as being a number from "1" to "Q" (enumerated in an ACPI
table). First implementation looks like Q == 255 which means a
granularity of 0.392% The spec has headroom to allow Q == 511.

I don't expect users to need that granularity at the high bandwidth
end of the range, but I do expect them to care for highly throttled
background/batch jobs to make sure they can't affect performance of
the high priority jobs.

I'd hate to have to round all low bandwidth controls to 1% steps.

>  
> >>>> I think that using const to pass data to architecture is great, thanks.
> >>>>
> >>>> Reinette
> >>>
> >>> I try to constify by default when straightforward to do so, since the
> >>> compiler can then find which cases need to change; the reverse
> >>> direction is harder to automate...
> >>
> >> Could you please elaborate what you mean with "reverse direction"?
> > 
> > I just meant that over-consting tends to result in violations of the
> > language that the compiler will detect, but under-consting doesn't:
> > 
> > static void foo(int *nonconstp, const int *constp)
> > {
> > 	*constp = 0; // compiler error
> > 	(*nonconstp); // silently accpeted, though it could have been const
> > }
> > 
> > So, the compiler will tell you places where const needs to be removed
> > (or something else needs to change), but to find places where const
> > could be _added_, you have to hunt them down yourself, or use some
> > other tool that is probably not part of the usual workflow.
> 
> Got it, thanks.
> 
> Reinette
> 

-Tony

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Reinette Chatre 1 week ago

Hi Tony,

On 9/25/25 2:35 PM, Luck, Tony wrote:
> On Thu, Sep 25, 2025 at 01:53:37PM -0700, Reinette Chatre wrote:
>> On 9/25/25 5:46 AM, Dave Martin wrote:
>>> On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
>>>> On 9/22/25 7:39 AM, Dave Martin wrote:
>>>>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:

...


>>>>> for which writing "MB: 0=x" and "MB: 0=y" yield different
>>>>> configurations for every in-range x and where y = x + g and y is also
>>>>> in-range.
>>>>>
>>>>> That's a bit of a mouthful, though.  If you can think of a more
>>>>> succinct way of putting it, I'm open to suggestions!
>>>>>
>>>>>> Please note that the documentation has a section "Memory bandwidth Allocation
>>>>>> and monitoring" that also contains these exact promises.
>>>>>
>>>>> Hmmm, somehow I completely missed that.
>>>>>
>>>>> Does the following make sense?  Ideally, there would be a simpler way
>>>>> to describe the discrepancy between the reported and actual values of
>>>>> bw_gran...
>>>>>
>>>>>  |  Memory bandwidth Allocation and monitoring
>>>>>  |  ==========================================
>>>>>  |
>>>>>  |  [...]
>>>>>  |
>>>>>  |  The minimum bandwidth percentage value for each cpu model is predefined
>>>>>  |  and can be looked up through "info/MB/min_bandwidth". The bandwidth
>>>>>  |  granularity that is allocated is also dependent on the cpu model and can
>>>>>  |  be looked up at "info/MB/bandwidth_gran". The available bandwidth
>>>>>  | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
>>>>>  | -to the next control step available on the hardware.
>>>>>  | +control steps are: min_bw + N * (bw_gran - e), where e is a
>>>>>  | +non-negative, hardware-defined real constant that is less than 1.
>>>>>  | +Intermediate values are rounded to the next control step available on
>>>>>  | +the hardware.
>>>>>  | +
>>>>>  | +At the time of writing, the constant e referred to in the preceding
>>>>>  | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
>>>>>  | +describes the step size exactly), but this may not be the case on other
>>>>>  | +hardware when the actual granularity is not an exact divisor of 100.
>>>>
>>>> Have you considered how to share the value of "e" with users?
>>>
>>> Perhaps introducing this "e" as an explicit parameter is a bad idea and
>>> overly formal.  In practice, there are likely to various sources of
>>> skid and approximation in the hardware, so exposing an actual value may
>>> be counterproductive -- i.e., what usable guarantee is this providing
>>> to userspace, if this is likely to be swamped by approximations
>>> elsewhere?
>>>
>>> Instead, maybe we can just say something like:
>>>
>>>  | The available steps are spaced at roughly equal intervals between the
>>>  | value reported by info/MB/min_bandwidth and 100%, inclusive.  Reading
>>>  | info/MB/bandwidth_gran gives the worst-case precision of these
>>>  | interval steps, in per cent.
>>>
>>> What do you think?
>>
>> I find "worst-case precision" a bit confusing, consider for example, what
>> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
>> the upper limit of these interval steps"? I believe this matches what you
>> mentioned a couple of messages ago: "The available steps are no larger than this
>> value."
>>
>> (and "per cent" -> "percent")
>>
>>>
>>> If that's adequate, then the wording under the definition of
>>> "bandwidth_gran" could be aligned with this.
>>
>> I think putting together a couple of your proposals and statements while making the
>> text more accurate may work:
>>
>> 	 "bandwidth_gran":
>> 		The approximate granularity in which the memory bandwidth
>>  		percentage is allocated. The allocated bandwidth percentage
>> 		is rounded up to the next control step available on the
>> 		hardware. The available hardware steps are no larger than
>> 		this value.
>>
>> I assume "available" is needed because, even though the steps are not larger
>> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
>> to 100% range?
> 
> What values are allowed for "bandwidth_gran"? The "Intel® Resource

This is a property of the MB resource where the ABI is to express allocations
as a percentage. Current doc:
	"bandwidth_gran":                                                               
		The granularity in which the memory bandwidth                   
		percentage is allocated. The allocated                          
		b/w percentage is rounded off to the next                       
		control step available on the hardware. The                     
		available bandwidth control steps are:                          
		min_bandwidth + N * bandwidth_gran.

I do not expect we can switch it to fractions so I would say that
integer values are allowed, starting at 1.

I understand that the MB resource on AMD supports different ranges and
I find that ABI discrepancy unfortunate. I do not think this should be
seen as an opportunity that "anything goes" when it comes to MB and used as
an excuse to pile on another range of hardware dependent inputs. Instead I
believe we should keep MB interface as-is and instead work on a generic
interface that enables user space to interact with resctrl to have benefit
of all hardware capabilities without needing to know which hardware is
underneath.

> Director Technology (Intel® RDT) Architecture Specification"
> 
> 	https://cdrdv2.intel.com/v1/dl/getContent/789566
> 
> describes the upcoming region aware memory bandwidth allocation
> controls as being a number from "1" to "Q" (enumerated in an ACPI
> table). First implementation looks like Q == 255 which means a
> granularity of 0.392% The spec has headroom to allow Q == 511.
> 
> I don't expect users to need that granularity at the high bandwidth
> end of the range, but I do expect them to care for highly throttled
> background/batch jobs to make sure they can't affect performance of
> the high priority jobs.
> 
> I'd hate to have to round all low bandwidth controls to 1% steps.

This is the limitation if choosing to expose this feature as an MB resource
and seems to be the same problem that Dave is facing. For finer granularity
allocations I expect that we would need a new schema/resource backed by new
properties as proposed by Dave in
	https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
This will require updates to user space (that will anyway be needed if wedging
another non-ABI input into MB).

Reinette

Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

Posted by Dave Martin 3 days, 19 hours ago

Hi Reinette, Tony,

On Thu, Sep 25, 2025 at 03:18:51PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 9/25/25 2:35 PM, Luck, Tony wrote:

[...]

> > Director Technology (Intel® RDT) Architecture Specification"
> > 
> > 	https://cdrdv2.intel.com/v1/dl/getContent/789566
> > 
> > describes the upcoming region aware memory bandwidth allocation
> > controls as being a number from "1" to "Q" (enumerated in an ACPI
> > table). First implementation looks like Q == 255 which means a
> > granularity of 0.392% The spec has headroom to allow Q == 511.

That does look like it would benefit from exposing the hardware field
without rounding (similarly as for MPAM).

Is the relationship between this value and the expected memory system
throughput actually defined anywhere?

If the expected throughput is exactly proportional to this value, or a
reasonable approximation to this, then that it simple -- but I can't
see it actually stated.

when a spec suggests a need to divide by (2^N - 1), I do wonder whether
that it what they _really_ meant (and whether hardware will just do the
obvious cheap approximation in defiance of the spec).

> > 
> > I don't expect users to need that granularity at the high bandwidth
> > end of the range, but I do expect them to care for highly throttled
> > background/batch jobs to make sure they can't affect performance of
> > the high priority jobs.

A case where it _might_ matter is where there is a non-trivial number
of jobs, and an attempt is made to share bandwidth among them.

Although it may not matter exactly how much bandwidth is given to each
job, the rounding errors may accumulate so that they add up to
significantly more than or less than 100% in total.  This feels
undesirable.

Rounding off the value in the interface effectively makes it impossible
for portable software to avoid this problem...

> > I'd hate to have to round all low bandwidth controls to 1% steps.

+1!  (No pun intended.)

> This is the limitation if choosing to expose this feature as an MB resource
> and seems to be the same problem that Dave is facing. For finer granularity
> allocations I expect that we would need a new schema/resource backed by new
> properties as proposed by Dave in
> 	https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
> This will require updates to user space (that will anyway be needed if wedging
> another non-ABI input into MB).
> 
> Reinette

Ack; while we could add decimal places to bandwidth_gran as reported to
userspace, we don't know that software isn't going to choke on that.

Plus, we could need to add precision to the control values too --
it's no good advertising 0.5% guanularity when the MB schema only
accepts/reports integers.

Software that parses anything as (potentially) a real number might work
transparently, but we didn't warn users that they might need to do
that...

Cheers
---Dave