Documentation/filesystems/resctrl.rst | 7 +++---- arch/x86/include/asm/resctrl.h | 2 ++ arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 6 ++++++ fs/resctrl/ctrlmondata.c | 2 +- include/linux/resctrl.h | 6 ++++++ 5 files changed, 18 insertions(+), 5 deletions(-)
The control value parser for the MB resource currently coerces the
memory bandwidth percentage value from userspace to be an exact
multiple of the bw_gran parameter.
On MPAM systems, this results in somewhat worse-than-worst-case
rounding, since bw_gran is in general only an approximation to the
actual hardware granularity on these systems, and the hardware
bandwidth allocation control value is not natively a percentage --
necessitating a further conversion in the resctrl_arch_update_domains()
path, regardless of the conversion done at parse time.
Allow the arch to provide its own parse-time conversion that is
appropriate for the hardware, and move the existing conversion to x86.
This will avoid accumulated error from rounding the value twice on MPAM
systems.
Clarify the documentation, but avoid overly exact promises.
Clamping to bw_min and bw_max still feels generic: leave it in the core
code, for now.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
---
Based on v6.17-rc3.
Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
the other tests except for the NONCONT_CAT tests, which do not seem to
be supported in my configuration -- and have nothing to do with the
code touched by this patch).
Notes:
I put the x86 version out of line in order to avoid having to move
struct rdt_resource and its dependencies into resctrl_types.h -- which
would create a lot of diff noise. Schemata writes from userspace have
a high overhead in any case.
For MPAM the conversion will be a no-op, because the incoming
percentage from the core resctrl code needs to be converted to hardware
representation in the driver anyway.
Perhaps _all_ the types should move to resctrl_types.h.
For now, I went for the smallest diffstat...
---
Documentation/filesystems/resctrl.rst | 7 +++----
arch/x86/include/asm/resctrl.h | 2 ++
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 6 ++++++
fs/resctrl/ctrlmondata.c | 2 +-
include/linux/resctrl.h | 6 ++++++
5 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index c7949dd44f2f..a1d0469d6dfb 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -143,12 +143,11 @@ with respect to allocation:
user can request.
"bandwidth_gran":
- The granularity in which the memory bandwidth
+ The approximate granularity in which the memory bandwidth
percentage is allocated. The allocated
b/w percentage is rounded off to the next
- control step available on the hardware. The
- available bandwidth control steps are:
- min_bandwidth + N * bandwidth_gran.
+ control step available on the hardware. The available
+ steps are at least as small as this value.
"delay_linear":
Indicates if the delay scale is linear or
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index feb93b50e990..8bec2b9cc503 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -18,6 +18,8 @@
*/
#define X86_RESCTRL_EMPTY_CLOSID ((u32)~0)
+struct rdt_resource;
+
/**
* struct resctrl_pqr_state - State cache for the PQR MSR
* @cur_rmid: The cached Resource Monitoring ID
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 1189c0df4ad7..cf9b30b5df3c 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -16,9 +16,15 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/cpu.h>
+#include <linux/math.h>
#include "internal.h"
+u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r)
+{
+ return roundup(val, (unsigned long)r->membw.bw_gran);
+}
+
int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type t, u32 cfg_val)
{
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index d98e0d2de09f..c5e73b75aaa0 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
return false;
}
- *data = roundup(bw, (unsigned long)r->membw.bw_gran);
+ *data = resctrl_arch_round_bw(bw, r);
return true;
}
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 6fb4894b8cfd..5b2a555cf2dd 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
+/*
+ * Round a bandwidth control value to the nearest value acceptable to
+ * the arch code for resource r:
+ */
+u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
+
/*
* Update the ctrl_val and apply this config right now.
* Must be called on one of the domain's CPUs.
base-commit: 1b237f190eb3d36f52dffe07a40b5eb210280e00
--
2.34.1
Hi again, On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: [...] > > Clamping to bw_min and bw_max still feels generic: leave it in the core > > code, for now. > > Sounds like MPAM may be ready to start the schema parsing discussion again? > I understand that MPAM has a few more ways to describe memory bandwidth as > well as cache portion partitioning. Previously ([1] [2]) James mused about exposing > schema format to user space, which seems like a good idea for new schema. On this topic, specifically: My own ideas in this area are a little different, though I agree with the general idea. Bitmap controls are distinct from numeric values, but for numbers, I'm not sure that distinguishing percentages from other values is required, since this is really just a specific case of a linear scale. I imagined a generic numeric schema, described by a set of files like the following in a schema's info directory: min: minimum value, e.g., 1 max: maximum value, e.g., 1023 scale: value that corresponds to one unit unit: quantified base unit, e.g., "100pc", "64MBps" map: mapping function name If s is the value written in a schemata entry and p is the corresponding physical amount of resource, then min <= s <= max and p = map(s / scale) * unit One reason why I prefer this scaling scheme over the floating-point approach is that it can be exact (at least for currently known platforms), and it doesn't require a new floating-point parser/ formatter to be written for this one thing in the kernel (which I suspect is likely to be error-prone and poorly defined around subtleties such as rounding behaviour). "map" anticipates non-linear ramps, but this is only really here as a forwards compatibility get-out. For now, this might just be set to "none", meaning the identity mapping (i.e., a no-op). This may shadow the existing the "delay_linear" parameter, but with more general applicabillity if we need it. The idea is that userspace reads the info files and then does the appropriate conversions itself. This might or might not be seen as a burden, but would give exact control over the hardware configuration with a generic interface, with possibly greater precision than the existing schemata allow (when the hardware supports it), and without having to second-guess the rounding that the kernel may or may not do on the values. For RDT MBA, we might have min: 10 max: 100 scale: 100 unit: 100pc map: none The schemata entry MB: 0=10, 1=100 would allocate the minimum possible bandwidth to domain 0, and 100% bandwidth to domain 1. For AMD SMBA, we might have: min: 1 max: 100 scale: 8 unit: 1GBps (if I've understood this correctly from resctrl.rst.) For MPAM MBW_MAX with, say, 6 bits of resolution, we might have: min: 1 max: 64 scale: 64 unit: 100pc map: none The schemata entry MB: 0=1,1=64 would allocate the minimum possible bandwidth to domain 0, and 100% bandwidth to domain 1. This would probably need to be a new schema, since we already have "MB" mimicking x86. Exposing the hardware scale in this way would give userspace precise control (including in sub-1% increments on capable hardware), without having to second-guess the way the kernel will round the values. > Is this something MPAM is still considering? For example, the minimum > and maximum ranges that can be specified, is this something you already > have some ideas for? Have you perhaps considered Tony's RFD [3] that includes > discussion on how to handle min/max ranges for bandwidth? This seems to be a different thing. I think James had some thoughts on this already -- I haven't checked on his current idea, but one option would be simply to expose this as two distinct schemata, say MB_MIN, MB_MAX. There's a question of how to cope with multiple different schemata entries that shadow each other (i.e., control the same hardware resource). Would something like the following work? A read from schemata might produce something like this: MB: 0=50, 1=50 # MB_HW: 0=32, 1=32 # MB_MIN: 0=31, 1=31 # MB_MAX: 0=32, 1=32 (Where MB_HW is the MPAM schema with 6-bit resolution that I illustrated above, and MB_MIN and MB_MAX are similar schemata for the specific MIN and MAX controls in the hardware.) Userspace that does not understand the new entries would need to ignore the commented lines, but can otherwise safely alter and write back the schemata with the expected results. The kernel would in turn ignore the commented lines on write. The commented lines are meaningful but "inactive": they describe the current hardware configuration on read, but (unless explicitly uncommented) won't change anything on write. Software that understands the new entries can uncomment the conflicting entries and write them back instead of (or in addition to) the conflicting entries. For example, userspace might write the following: MB_MIN: 0=16, 1=16 MB_MAX: 0=32, 1=32 Which might then read back as follows: MB: 0=50, 1=50 # MB_HW: 0=32, 1=32 # MB_MIN: 0=16, 1=16 # MB_MAX: 0=32, 1=32 I haven't tried to develop this idea further, for now. I'd be interested in people's thoughts on it, though. Cheers ---Dave
Hi Dave, Just one correction ... On 9/22/25 8:04 AM, Dave Martin wrote: > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > > [...] > >>> Clamping to bw_min and bw_max still feels generic: leave it in the core >>> code, for now. >> >> Sounds like MPAM may be ready to start the schema parsing discussion again? >> I understand that MPAM has a few more ways to describe memory bandwidth as >> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing >> schema format to user space, which seems like a good idea for new schema. > > On this topic, specifically: > > > My own ideas in this area are a little different, though I agree with > the general idea. > > Bitmap controls are distinct from numeric values, but for numbers, I'm > not sure that distinguishing percentages from other values is required, > since this is really just a specific case of a linear scale. > > I imagined a generic numeric schema, described by a set of files like > the following in a schema's info directory: > > min: minimum value, e.g., 1 > max: maximum value, e.g., 1023 > scale: value that corresponds to one unit > unit: quantified base unit, e.g., "100pc", "64MBps" > map: mapping function name > > If s is the value written in a schemata entry and p is the > corresponding physical amount of resource, then > > min <= s <= max > > and > > p = map(s / scale) * unit > > One reason why I prefer this scaling scheme over the floating-point > approach is that it can be exact (at least for currently known > platforms), and it doesn't require a new floating-point parser/ > formatter to be written for this one thing in the kernel (which I > suspect is likely to be error-prone and poorly defined around > subtleties such as rounding behaviour). > > "map" anticipates non-linear ramps, but this is only really here as a > forwards compatibility get-out. For now, this might just be set to > "none", meaning the identity mapping (i.e., a no-op). This may shadow > the existing the "delay_linear" parameter, but with more general > applicabillity if we need it. > > > The idea is that userspace reads the info files and then does the > appropriate conversions itself. This might or might not be seen as a > burden, but would give exact control over the hardware configuration > with a generic interface, with possibly greater precision than the > existing schemata allow (when the hardware supports it), and without > having to second-guess the rounding that the kernel may or may not do > on the values. > > For RDT MBA, we might have > > min: 10 > max: 100 > scale: 100 > unit: 100pc > map: none > > The schemata entry > > MB: 0=10, 1=100 > > would allocate the minimum possible bandwidth to domain 0, and 100% > bandwidth to domain 1. > > > For AMD SMBA, we might have: > > min: 1 > max: 100 > scale: 8 > unit: 1GBps > Unfortunately not like this for AMD. Initial support for AMD MBA set max to a hardcoded 2048 [1] that was later [2] modified to learn max from hardware. Of course this broke resctrl as a generic interface and I hope we learned enough since to not repeat this mistake nor give up on MB and make its interface even worse by, for example, adding more architecture specific input ranges. Reinette [1] commit 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature") [2] commit 0976783bb123 ("x86/resctrl: Remove hard-coded memory bandwidth limit")
Hi Reinette, On Fri, Sep 26, 2025 at 01:54:10PM -0700, Reinette Chatre wrote: > Hi Dave, > > Just one correction ... > > On 9/22/25 8:04 AM, Dave Martin wrote: [...0 > > For AMD SMBA, we might have: > > > > min: 1 > > max: 100 > > scale: 8 > > unit: 1GBps > > > > Unfortunately not like this for AMD. Initial support for AMD MBA set max > to a hardcoded 2048 [1] that was later [2] modified to learn max from hardware. > Of course this broke resctrl as a generic interface and I hope we learned > enough since to not repeat this mistake nor give up on MB and make its interface > even worse by, for example, adding more architecture specific input ranges. > > Reinette > > [1] commit 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature") > [2] commit 0976783bb123 ("x86/resctrl: Remove hard-coded memory bandwidth limit") The "100" was just picked randomly in my example. Looking more carefully at the spec, it may make sense to have: min: 1 max: (1 << value of BW_LEN) scale: 8 unit: 1GBps (This max value correspondings to setting the "unlimited" bit in the control MSR; the other bits of the bandwidth value are then ignored. For this instance of the schema, programming the "max" value would be expected to give the nearest approximation to unlimited bandwidth that the hardware permits.) While the memory system is under-utilised end-to-end, I would expect throughput from a memory-bound job to scale linearly with the control value, but the control level at which throughput starts to saturate will depend on the pattern of load throughout the system. This seems fundamentally different from percentage controls -- it looks impossible to simulate proportional controls with absolute throughput controls, nor vice-versa (?) Cheers ---Dave
On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: > Hi again, > > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > > [...] > > > > Clamping to bw_min and bw_max still feels generic: leave it in the core > > > code, for now. > > > > Sounds like MPAM may be ready to start the schema parsing discussion again? > > I understand that MPAM has a few more ways to describe memory bandwidth as > > well as cache portion partitioning. Previously ([1] [2]) James mused about exposing > > schema format to user space, which seems like a good idea for new schema. > > On this topic, specifically: > > > My own ideas in this area are a little different, though I agree with > the general idea. > > Bitmap controls are distinct from numeric values, but for numbers, I'm > not sure that distinguishing percentages from other values is required, > since this is really just a specific case of a linear scale. > > I imagined a generic numeric schema, described by a set of files like > the following in a schema's info directory: > > min: minimum value, e.g., 1 > max: maximum value, e.g., 1023 > scale: value that corresponds to one unit > unit: quantified base unit, e.g., "100pc", "64MBps" > map: mapping function name > > If s is the value written in a schemata entry and p is the > corresponding physical amount of resource, then > > min <= s <= max > > and > > p = map(s / scale) * unit > > One reason why I prefer this scaling scheme over the floating-point > approach is that it can be exact (at least for currently known > platforms), and it doesn't require a new floating-point parser/ > formatter to be written for this one thing in the kernel (which I > suspect is likely to be error-prone and poorly defined around > subtleties such as rounding behaviour). > > "map" anticipates non-linear ramps, but this is only really here as a > forwards compatibility get-out. For now, this might just be set to > "none", meaning the identity mapping (i.e., a no-op). This may shadow > the existing the "delay_linear" parameter, but with more general > applicabillity if we need it. > > > The idea is that userspace reads the info files and then does the > appropriate conversions itself. This might or might not be seen as a > burden, but would give exact control over the hardware configuration > with a generic interface, with possibly greater precision than the > existing schemata allow (when the hardware supports it), and without > having to second-guess the rounding that the kernel may or may not do > on the values. > > For RDT MBA, we might have > > min: 10 > max: 100 > scale: 100 > unit: 100pc > map: none > > The schemata entry > > MB: 0=10, 1=100 > > would allocate the minimum possible bandwidth to domain 0, and 100% > bandwidth to domain 1. > > > For AMD SMBA, we might have: > > min: 1 > max: 100 > scale: 8 > unit: 1GBps > > (if I've understood this correctly from resctrl.rst.) > > > For MPAM MBW_MAX with, say, 6 bits of resolution, we might have: > > min: 1 > max: 64 > scale: 64 > unit: 100pc > map: none > > The schemata entry > > MB: 0=1,1=64 > > would allocate the minimum possible bandwidth to domain 0, and 100% > bandwidth to domain 1. This would probably need to be a new schema, > since we already have "MB" mimicking x86. > > Exposing the hardware scale in this way would give userspace precise > control (including in sub-1% increments on capable hardware), without > having to second-guess the way the kernel will round the values. > > > > Is this something MPAM is still considering? For example, the minimum > > and maximum ranges that can be specified, is this something you already > > have some ideas for? Have you perhaps considered Tony's RFD [3] that includes > > discussion on how to handle min/max ranges for bandwidth? > > This seems to be a different thing. I think James had some thoughts on > this already -- I haven't checked on his current idea, but one option > would be simply to expose this as two distinct schemata, say MB_MIN, > MB_MAX. > > There's a question of how to cope with multiple different schemata > entries that shadow each other (i.e., control the same hardware > resource). > > > Would something like the following work? A read from schemata might > produce something like this: > > MB: 0=50, 1=50 > # MB_HW: 0=32, 1=32 > # MB_MIN: 0=31, 1=31 > # MB_MAX: 0=32, 1=32 > > (Where MB_HW is the MPAM schema with 6-bit resolution that I > illustrated above, and MB_MIN and MB_MAX are similar schemata for the > specific MIN and MAX controls in the hardware.) > > Userspace that does not understand the new entries would need to ignore > the commented lines, but can otherwise safely alter and write back the > schemata with the expected results. The kernel would in turn ignore > the commented lines on write. The commented lines are meaningful but > "inactive": they describe the current hardware configuration on read, > but (unless explicitly uncommented) won't change anything on write. > > Software that understands the new entries can uncomment the conflicting > entries and write them back instead of (or in addition to) the > conflicting entries. For example, userspace might write the following: > > MB_MIN: 0=16, 1=16 > MB_MAX: 0=32, 1=32 > > Which might then read back as follows: > > MB: 0=50, 1=50 > # MB_HW: 0=32, 1=32 > # MB_MIN: 0=16, 1=16 > # MB_MAX: 0=32, 1=32 > > > I haven't tried to develop this idea further, for now. > > I'd be interested in people's thoughts on it, though. Applying this to Intel upcoming region aware memory bandwidth that supports 255 steps and h/w min/max limits. We would have info files with "min = 1, max = 255" and a schemata file that looks like this to legacy apps: MB: 0=50;1=75 #MB_HW: 0=128;1=191 #MB_MIN: 0=128;1=191 #MB_MAX: 0=128;1=191 But a newer app that is aware of the extensions can write: # cat > schemata << 'EOF' MB_HW: 0=10 MB_MIN: 0=10 MB_MAX: 0=64 EOF which then reads back as: MB: 0=4;1=75 #MB_HW: 0=10;1=191 #MB_MIN: 0=10;1=191 #MB_MAX: 0=64;1=191 with the legacy line updated with the rounded value of the MB_HW supplied by the user. 10/255 = 3.921% ... so call it "4". The region aware h/w supports separate bandwidth controls for each region. We could hope (or perhaps update the spec to define) that region0 is always node-local DDR memory and keep the "MB" tag for that. Then use some other tag naming for other regions. Remote DDR, local CXL, remote CXL are the ones we think are next in the h/w memory sequence. But the "region" concept would allow for other options as other memory technologies come into use. > > Cheers > ---Dave
Hi Tony, Thanks for taking at look at this -- comments below. [...] On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: [...] > > Would something like the following work? A read from schemata might > > produce something like this: > > > > MB: 0=50, 1=50 > > # MB_HW: 0=32, 1=32 > > # MB_MIN: 0=31, 1=31 > > # MB_MAX: 0=32, 1=32 [...] > > I'd be interested in people's thoughts on it, though. > > Applying this to Intel upcoming region aware memory bandwidth > that supports 255 steps and h/w min/max limits. Following the MPAM example, would you also expect: scale: 255 unit: 100pc ...? > We would have info files with "min = 1, max = 255" and a schemata > file that looks like this to legacy apps: > > MB: 0=50;1=75 > #MB_HW: 0=128;1=191 > #MB_MIN: 0=128;1=191 > #MB_MAX: 0=128;1=191 > > But a newer app that is aware of the extensions can write: > > # cat > schemata << 'EOF' > MB_HW: 0=10 > MB_MIN: 0=10 > MB_MAX: 0=64 > EOF > > which then reads back as: > MB: 0=4;1=75 > #MB_HW: 0=10;1=191 > #MB_MIN: 0=10;1=191 > #MB_MAX: 0=64;1=191 > > with the legacy line updated with the rounded value of the MB_HW > supplied by the user. 10/255 = 3.921% ... so call it "4". I'm suggesting that this always be rounded up, so that you have a guarantee that the steps are no smaller than the reported value. (In this case, round-up and round-to-nearest give the same answer anyway, though!) > > The region aware h/w supports separate bandwidth controls for each > region. We could hope (or perhaps update the spec to define) that > region0 is always node-local DDR memory and keep the "MB" tag for > that. Do you have concerns about existing software choking on the #-prefixed lines? > Then use some other tag naming for other regions. Remote DDR, > local CXL, remote CXL are the ones we think are next in the h/w > memory sequence. But the "region" concept would allow for other > options as other memory technologies come into use. Would it be reasnable just to have a set of these schema instances, per region, so: MB_HW: ... // implicitly region 0 MB_HW_1: ... MB_HW_2: ... etc. Or, did you have something else in mind? My thinking is that we avoid adding complexity in the schemata file if we treat mapping these schema instances onto the hardware topology as an orthogonal problem. So long as we have unique names in the schemata file, we can describe elsewhere what they relate to in the hardware. Cheers ---Dave
On Mon, Sep 29, 2025 at 02:56:19PM +0100, Dave Martin wrote: > Hi Tony, > > Thanks for taking at look at this -- comments below. > > [...] > > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: > > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: > > [...] > > > > Would something like the following work? A read from schemata might > > > produce something like this: > > > > > > MB: 0=50, 1=50 > > > # MB_HW: 0=32, 1=32 > > > # MB_MIN: 0=31, 1=31 > > > # MB_MAX: 0=32, 1=32 > > [...] > > > > I'd be interested in people's thoughts on it, though. > > > > Applying this to Intel upcoming region aware memory bandwidth > > that supports 255 steps and h/w min/max limits. > > Following the MPAM example, would you also expect: > > scale: 255 > unit: 100pc > > ...? Yes. 255 (or whatever "Q" value is provided in the ACPI table) corresponds to no throttling, so 100% bandwidth. > > > We would have info files with "min = 1, max = 255" and a schemata > > file that looks like this to legacy apps: > > > > MB: 0=50;1=75 > > #MB_HW: 0=128;1=191 > > #MB_MIN: 0=128;1=191 > > #MB_MAX: 0=128;1=191 > > > > But a newer app that is aware of the extensions can write: > > > > # cat > schemata << 'EOF' > > MB_HW: 0=10 > > MB_MIN: 0=10 > > MB_MAX: 0=64 > > EOF > > > > which then reads back as: > > MB: 0=4;1=75 > > #MB_HW: 0=10;1=191 > > #MB_MIN: 0=10;1=191 > > #MB_MAX: 0=64;1=191 > > > > with the legacy line updated with the rounded value of the MB_HW > > supplied by the user. 10/255 = 3.921% ... so call it "4". > > I'm suggesting that this always be rounded up, so that you have a > guarantee that the steps are no smaller than the reported value. Round up, rather than round-to-nearest, make sense. Though perhaps only cosmetic as I would be surprised if anyone has a mix of tools looking at the legacy schemata lines while programming using the direct h/w controls. > > (In this case, round-up and round-to-nearest give the same answer > anyway, though!) > > > > > The region aware h/w supports separate bandwidth controls for each > > region. We could hope (or perhaps update the spec to define) that > > region0 is always node-local DDR memory and keep the "MB" tag for > > that. > > Do you have concerns about existing software choking on the #-prefixed > lines? Do they even need a # prefix? We already mix lines for multiple resources in the schemata file with a separate prefix for each resource. The schemata file also allows writes to just update one resource (or one domain in a single resource). The schemata file started with just "L3". Then we added "L2", "MB", and "SMBA" with no concern that the initial "L3" manipulating tools would be confused. > > Then use some other tag naming for other regions. Remote DDR, > > local CXL, remote CXL are the ones we think are next in the h/w > > memory sequence. But the "region" concept would allow for other > > options as other memory technologies come into use. > > Would it be reasnable just to have a set of these schema instances, per > region, so: > > MB_HW: ... // implicitly region 0 > MB_HW_1: ... > MB_HW_2: ... Chen Yu is currently looking at putting the word "TIER" into the name, since there's some precedent for describing memory in "tiers". Whatever naming scheme is used, the important part is how will users find out what each schemata line actually means/controls. > etc. > > Or, did you have something else in mind? > > My thinking is that we avoid adding complexity in the schemata file if > we treat mapping these schema instances onto the hardware topology as > an orthogonal problem. So long as we have unique names in the schemata > file, we can describe elsewhere what they relate to in the hardware. Yes, exactly this. > > Cheers > ---Dave -Tony
Hi Tony, On Mon, Sep 29, 2025 at 09:37:41AM -0700, Luck, Tony wrote: > On Mon, Sep 29, 2025 at 02:56:19PM +0100, Dave Martin wrote: > > Hi Tony, > > > > Thanks for taking at look at this -- comments below. > > > > [...] > > > > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: > > > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: > > > > [...] > > > > > > Would something like the following work? A read from schemata might > > > > produce something like this: > > > > > > > > MB: 0=50, 1=50 > > > > # MB_HW: 0=32, 1=32 > > > > # MB_MIN: 0=31, 1=31 > > > > # MB_MAX: 0=32, 1=32 > > > > [...] > > > > > > I'd be interested in people's thoughts on it, though. > > > > > > Applying this to Intel upcoming region aware memory bandwidth > > > that supports 255 steps and h/w min/max limits. > > > > Following the MPAM example, would you also expect: > > > > scale: 255 > > unit: 100pc > > > > ...? > > Yes. 255 (or whatever "Q" value is provided in the ACPI table) > corresponds to no throttling, so 100% bandwidth. > > > > > > We would have info files with "min = 1, max = 255" and a schemata > > > file that looks like this to legacy apps: [...] > > > MB: 0=4;1=75 [...] > > > with the legacy line updated with the rounded value of the MB_HW > > > supplied by the user. 10/255 = 3.921% ... so call it "4". > > > > I'm suggesting that this always be rounded up, so that you have a > > guarantee that the steps are no smaller than the reported value. > > Round up, rather than round-to-nearest, make sense. Though perhaps > only cosmetic as I would be surprised if anyone has a mix of tools > looking at the legacy schemata lines while programming using the > direct h/w controls. Ack [...] > > Do you have concerns about existing software choking on the #-prefixed > > lines? > > Do they even need a # prefix? We already mix lines for multiple > resources in the schemata file with a separate prefix for each resource. > The schemata file also allows writes to just update one resource (or > one domain in a single resource). The schemata file started with just > "L3". Then we added "L2", "MB", and "SMBA" with no concern that the > initial "L3" manipulating tools would be confused. The "#" thing is for backwards compatibility with old userspace that might blindly "paste back" unknown entries when writing the schemata file. (See also my reply to Reinette [1].) > > > Then use some other tag naming for other regions. Remote DDR, > > > local CXL, remote CXL are the ones we think are next in the h/w > > > memory sequence. But the "region" concept would allow for other > > > options as other memory technologies come into use. > > > > Would it be reasnable just to have a set of these schema instances, per > > region, so: > > > > MB_HW: ... // implicitly region 0 > > MB_HW_1: ... > > MB_HW_2: ... > > Chen Yu is currently looking at putting the word "TIER" into the > name, since there's some precedent for describing memory in "tiers". > > Whatever naming scheme is used, the important part is how will users > find out what each schemata line actually means/controls. Agreed. That's a problem, but a separate one. [...] > > Or, did you have something else in mind? > > > > My thinking is that we avoid adding complexity in the schemata file if > > we treat mapping these schema instances onto the hardware topology as > > an orthogonal problem. So long as we have unique names in the schemata > > file, we can describe elsewhere what they relate to in the hardware. > > Yes, exactly this. OK, that's reassuring. Cheers ---Dave [1] https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/
Hi Dave, On 9/29/25 6:56 AM, Dave Martin wrote: > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: >> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: ... >> The region aware h/w supports separate bandwidth controls for each >> region. We could hope (or perhaps update the spec to define) that >> region0 is always node-local DDR memory and keep the "MB" tag for >> that. > > Do you have concerns about existing software choking on the #-prefixed > lines? I am trying to understand the purpose of the #-prefix. I see two motivations for the #-prefix with the primary point that multiple schema apply to the same resource. 1) Commented schema are "inactive" This is unclear to me. In the MB example the commented lines show the finer grained controls. Since the original MB resource is an approximation and the hardware must already be configured to support it, would the #-prefixed lines not show the actual "active" configuration? 2) Commented schema are "conflicting" The original proposal mentioned "write them back instead of (or in addition to) the conflicting entries". I do not know how resctrl will be able to handle a user requesting a change to both "MB" and "MB_HW". This seems like something that should fail? On a high level it is not clear to me why the # prefix is needed. As I understand the schemata names will always be unique and the new features made backward compatible to existing schemata names. That is, existing MB, L3, etc. will also have the new info files that describe their values/ranges. I expect that user space will ignore schema that it is not familiar with so the # prefix seems unnecessary? I believe the motivation is to express a relationship between different schema (you mentioned "shadow" initially). I think this relationship can be expressed clearly by using a namespace prefix (like "MB_" in the examples). This may help more when there are multiple schemata with this format where a #-prefix does not make obvious which resource is shadowed. >> Then use some other tag naming for other regions. Remote DDR, >> local CXL, remote CXL are the ones we think are next in the h/w >> memory sequence. But the "region" concept would allow for other >> options as other memory technologies come into use. > > Would it be reasnable just to have a set of these schema instances, per > region, so: > > MB_HW: ... // implicitly region 0 > MB_HW_1: ... > MB_HW_2: ... > > etc. > > Or, did you have something else in mind? > > My thinking is that we avoid adding complexity in the schemata file if > we treat mapping these schema instances onto the hardware topology as > an orthogonal problem. So long as we have unique names in the schemata > file, we can describe elsewhere what they relate to in the hardware. Agreed ... and "elsewhere" is expected to be unique depending on the resource. Reinette
Hi, On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote: > Hi Dave, > > On 9/29/25 6:56 AM, Dave Martin wrote: > > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: > >> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: > > ... > > >> The region aware h/w supports separate bandwidth controls for each > >> region. We could hope (or perhaps update the spec to define) that > >> region0 is always node-local DDR memory and keep the "MB" tag for > >> that. > > > > Do you have concerns about existing software choking on the #-prefixed > > lines? > > I am trying to understand the purpose of the #-prefix. I see two motivations > for the #-prefix with the primary point that multiple schema apply to the same > resource. > > 1) Commented schema are "inactive" > This is unclear to me. In the MB example the commented lines show the > finer grained controls. Since the original MB resource is an approximation > and the hardware must already be configured to support it, would the #-prefixed > lines not show the actual "active" configuration? They would show the active configuration (possibly more precisely than "MB" does). If not, it's not clear how userspace that is trying to use MB_HW (say) could read out the current configuration. The # is intended to make resctrl ignore the lines when the file is written by userspace. This is done so that userspace has to actually change those lines in order for them to take effect when writing. Old userspace can just pass them through without modification, without anything unexpected happening. The reason why I think that this convention may be needed is that we never told (old) userspace what it was supposed to do with schemata entries that it does not recognise. > 2) Commented schema are "conflicting" > The original proposal mentioned "write them back instead of (or in addition to) > the conflicting entries". I do not know how resctrl will be able to > handle a user requesting a change to both "MB" and "MB_HW". This seems like > something that should fail? If userspace is asking for two incompatible things at the same time, we can either pick one of them and ignore the rest, or do nothing, or fail explicitly. If we think that it doesn't really matter what happens, then resctrl could just dumbly process the entries in the order given. If the result is not what userspace wanted, that's not our problem. (Today, nothing prevents userspace writing multiple "MB" lines at the same time: resctrl will process them all, but only the final one will have a lasting effect. So, the fact that a resctrl write can contain mutually incompatible requests does not seem to be new.) > On a high level it is not clear to me why the # prefix is needed. As I understand the > schemata names will always be unique and the new features made backward > compatible to existing schemata names. That is, existing MB, L3, etc. > will also have the new info files that describe their values/ranges. Regarding backwards compatibility for the existing controls: This proposal is only about numeric controls. L3 wouldn't change, but we could still add info/ metadata for bitmap control at the same time as adding it for numeric controls. MB may be hard to describe in a useful way, though -- at least in the MPAM case, where the number of steps does not divide into 100, and the AMD cases where the meaning of the MB control values is different. MB and MB_HW are not interchangeable. To obtain predictable results from MB, userspace would need to know precisely how the kernel is going to round the value. This feels like an implementation detail that doesn't belong in the ABI. I suppose we could also add a "granularity" entry in info/, but we have the existing "bandwidth_gran" file for MB. For any new schema, I don't think we need to state the granularity: the other parameters can always be adjusted so that the granularity is exactly 1. Regarding the "#" (see below): > I expect that user space will ignore schema that it is not familiar > with so the # prefix seems unnecessary? > > I believe the motivation is to express a relationship between different > schema (you mentioned "shadow" initially). I think this relationship can > be expressed clearly by using a namespace prefix (like "MB_" in the examples). > This may help more when there are multiple schemata with this format where a #-prefix > does not make obvious which resource is shadowed. An illustration would probably help, here. Say resctrl knows has schemata MB, MB_HW, MB_MIN and MB_MAX, all of which control (aspects of) the same underlying hardware resource. Reading the schemata file might yield: MB: 0=29 MB_HW: 0=2 MB_MIN: 0=1 MB_MAX: 0=2 (I assume for this toy example that MB_{HW,MIN,MAX} ranges from 0 to 7.) Now, suppose (current) userspace wants to change the allocated bandwidth. It only understands the "MB" line, but it is reasonable to expect that writing the other lines back without modification will do nothing. (A human user might read the file, and tweak it through an editor to modify just the entry of interest, run it through awk, etc.) So, the user writes back: MB: 0=43 MB_HW: 0=2 MB_MIN: 0=1 MB_MAX: 0=2 If resctrl just processes the entries in order, it will temporarily change the bandwidth allocation due to the "MB" row, but will then immediately change it back again due to the other rows. Reading schemata again now gives: MB: 0=29 MB_HW: 0=2 MB_MIN: 0=1 MB_MAX: 0=2 We might be able to solve some problems of this sort be reordering the entries, but I suspect that some software may give up as soon as it sees an unfamiliar entry -- so it may be better to keep the classic entries (like "MB") at the start. Anyway, going back to the "#" convention: If the initial read of schemata has the new entries "pre-commented", then userspace wouldn't need to know about the new entries. It could just tweak the MB entry (which it knows about), and write the file back: MB: 0=43 # MB_HW: 0=2 # MB_MIN: 0=1 # MB_MAX: 0=2 then resctrl knows to ignore the hashed lines, and so reading the file back gives: MB: 0=43 # MB_HW: 0=3 # MB_MIN: 0=2 # MB_MAX: 0=3 (For hardware-specific reasons, the MPAM driver currently internally programs the MIN bound to be a bit less than the MAX bound, when userspace writes an "MB" entry into schemata. The key thing is that writing MB may cause the MB_MIN/MB_MAX entries to change -- at the resctrl level, I don't that that we necessarily need to make promises about what they can change _to_. The exact effect of MIN and MAX bounds is likely to be hardware-dependent anyway.) Regarding new userspce: Going forward, we can explicitly document that there should be no conflicting or "passenger" entries in a schemata write: don't include an entry for somehing that you don't explicitly want to set, and if multiple entries affect the same resource, we don't promise what happens. (But sadly, we can't impose that rule on existing software after the fact.) One final note: I have not provided any way to indicate that all those entries control the same hardware resource. The common "MB" prefix is intended as a clue, but ultimately, userspace needs to know what an entry controls before tweaking it. We could try to describe the relationships explicitly, but I'm not sure that it is useful... > >> Then use some other tag naming for other regions. Remote DDR, > >> local CXL, remote CXL are the ones we think are next in the h/w > >> memory sequence. But the "region" concept would allow for other > >> options as other memory technologies come into use. > > > > Would it be reasnable just to have a set of these schema instances, per > > region, so: > > > > MB_HW: ... // implicitly region 0 > > MB_HW_1: ... > > MB_HW_2: ... > > > > etc. > > > > Or, did you have something else in mind? > > > > My thinking is that we avoid adding complexity in the schemata file if > > we treat mapping these schema instances onto the hardware topology as > > an orthogonal problem. So long as we have unique names in the schemata > > file, we can describe elsewhere what they relate to in the hardware. > > Agreed ... and "elsewhere" is expected to be unique depending on the resource. > > Reinette Yes Cheers ---Dave
On 9/26/2025 6:58 AM, Luck, Tony wrote: > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: >> Hi again, >> >> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: >> >> [...] >> [snip] >> For example, userspace might write the following: >> >> MB_MIN: 0=16, 1=16 >> MB_MAX: 0=32, 1=32 >> >> Which might then read back as follows: >> >> MB: 0=50, 1=50 >> # MB_HW: 0=32, 1=32 >> # MB_MIN: 0=16, 1=16 >> # MB_MAX: 0=32, 1=32 >> >> >> I haven't tried to develop this idea further, for now. >> >> I'd be interested in people's thoughts on it, though. > > Applying this to Intel upcoming region aware memory bandwidth > that supports 255 steps and h/w min/max limits. > We would have info files with "min = 1, max = 255" and a schemata > file that looks like this to legacy apps: > > MB: 0=50;1=75 > #MB_HW: 0=128;1=191 > #MB_MIN: 0=128;1=191 > #MB_MAX: 0=128;1=191 > > But a newer app that is aware of the extensions can write: > > # cat > schemata << 'EOF' > MB_HW: 0=10 > MB_MIN: 0=10 > MB_MAX: 0=64 > EOF > > which then reads back as: > MB: 0=4;1=75 > #MB_HW: 0=10;1=191 > #MB_MIN: 0=10;1=191 > #MB_MAX: 0=64;1=191 > > with the legacy line updated with the rounded value of the MB_HW > supplied by the user. 10/255 = 3.921% ... so call it "4". > This seems to be applicable as it introduces the new interface while preserving forward compatibility. One minor question is that, according to "Figure 6-5. MBA Optimal Bandwidth Register" in the latest RDT specification, the maximum value ranges from 1 to 511. Additionally, this bandwidth field is located at bits 48 to 56 in the MBA Optimal Bandwidth Register, and the range for this segment could be 1 to 8191. Just wonder if it would be possible that the current maximum value of 512 may be extended in the future? Perhaps we could explore a method to query the maximum upper limit from the ACPI table or register, or use CPUID to distinguish between platforms rather than hardcoding it. Reinette also mentioned this in another thread. Thanks, Chenyu [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html > The region aware h/w supports separate bandwidth controls for each > region. We could hope (or perhaps update the spec to define) that > region0 is always node-local DDR memory and keep the "MB" tag for > that. > > Then use some other tag naming for other regions. Remote DDR, > local CXL, remote CXL are the ones we think are next in the h/w > memory sequence. But the "region" concept would allow for other > options as other memory technologies come into use. > >> >> Cheers >> ---Dave >
Hi there, On Mon, Sep 29, 2025 at 05:19:32PM +0800, Chen, Yu C wrote: > On 9/26/2025 6:58 AM, Luck, Tony wrote: [...] > > Applying this to Intel upcoming region aware memory bandwidth > > that supports 255 steps and h/w min/max limits. > > We would have info files with "min = 1, max = 255" and a schemata > > file that looks like this to legacy apps: > > > > MB: 0=50;1=75 > > #MB_HW: 0=128;1=191 > > #MB_MIN: 0=128;1=191 > > #MB_MAX: 0=128;1=191 > > > > But a newer app that is aware of the extensions can write: > > > > # cat > schemata << 'EOF' > > MB_HW: 0=10 > > MB_MIN: 0=10 > > MB_MAX: 0=64 > > EOF > > > > which then reads back as: > > MB: 0=4;1=75 > > #MB_HW: 0=10;1=191 > > #MB_MIN: 0=10;1=191 > > #MB_MAX: 0=64;1=191 > > > > with the legacy line updated with the rounded value of the MB_HW > > supplied by the user. 10/255 = 3.921% ... so call it "4". > > > > This seems to be applicable as it introduces the new interface > while preserving forward compatibility. > > One minor question is that, according to "Figure 6-5. MBA Optimal > Bandwidth Register" in the latest RDT specification, the maximum > value ranges from 1 to 511. > Additionally, this bandwidth field is located at bits 48 to 56 in > the MBA Optimal Bandwidth Register, and the range for > this segment could be 1 to 8191. Just wonder if it would be > possible that the current maximum value of 512 may be extended > in the future? Perhaps we could explore a method to query the maximum upper > limit from the ACPI table or register, or use CPUID to distinguish between > platforms rather than hardcoding it. Reinette also mentioned this in another > thread. > > Thanks, > Chenyu > > > [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html I can't comment on the direction of travel in the RDT architecture. I guess it would be up to the arch code whether to trust ACPI if it says that the maximum value of this field is > 511. (> 65535 would be impossible though, since the fields would start to overlap each other...) Would anything break in the interface proposed here, if the maximum value is larger than 511? (I'm hoping not. For MPAM, the bandwidth controls can have up to 16 bits and the size can be probed though MMIO registers. I don't think we've seen MPAM hardware that comes close to 16 bits for now, though. Cheers ---Dave
On 9/29/2025 10:13 PM, Dave Martin wrote: > Hi there, > > On Mon, Sep 29, 2025 at 05:19:32PM +0800, Chen, Yu C wrote: >> On 9/26/2025 6:58 AM, Luck, Tony wrote: > > [...] > >>> Applying this to Intel upcoming region aware memory bandwidth >>> that supports 255 steps and h/w min/max limits. >>> We would have info files with "min = 1, max = 255" and a schemata >>> file that looks like this to legacy apps: >>> >>> MB: 0=50;1=75 >>> #MB_HW: 0=128;1=191 >>> #MB_MIN: 0=128;1=191 >>> #MB_MAX: 0=128;1=191 >>> >>> But a newer app that is aware of the extensions can write: >>> >>> # cat > schemata << 'EOF' >>> MB_HW: 0=10 >>> MB_MIN: 0=10 >>> MB_MAX: 0=64 >>> EOF >>> >>> which then reads back as: >>> MB: 0=4;1=75 >>> #MB_HW: 0=10;1=191 >>> #MB_MIN: 0=10;1=191 >>> #MB_MAX: 0=64;1=191 >>> >>> with the legacy line updated with the rounded value of the MB_HW >>> supplied by the user. 10/255 = 3.921% ... so call it "4". >>> >> >> This seems to be applicable as it introduces the new interface >> while preserving forward compatibility. >> >> One minor question is that, according to "Figure 6-5. MBA Optimal >> Bandwidth Register" in the latest RDT specification, the maximum >> value ranges from 1 to 511. >> Additionally, this bandwidth field is located at bits 48 to 56 in >> the MBA Optimal Bandwidth Register, and the range for >> this segment could be 1 to 8191. Just wonder if it would be >> possible that the current maximum value of 512 may be extended >> in the future? Perhaps we could explore a method to query the maximum upper >> limit from the ACPI table or register, or use CPUID to distinguish between >> platforms rather than hardcoding it. Reinette also mentioned this in another >> thread. >> >> Thanks, >> Chenyu >> >> >> [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html > > I can't comment on the direction of travel in the RDT architecture. > > I guess it would be up to the arch code whether to trust ACPI if it > says that the maximum value of this field is > 511. (> 65535 would be > impossible though, since the fields would start to overlap each > other...) > > Would anything break in the interface proposed here, if the maximum > value is larger than 511? (I'm hoping not. For MPAM, the bandwidth > controls can have up to 16 bits and the size can be probed though MMIO > registers. > I overlooked this bit width. It should not exceed 511 according to the RDT spec. Previously, I was just wondering how to calculate the legacy MB percentage in Tony's example. If we want to keep consistency - if the user provides a value of 10, what is the denominator: Is it 255, 511, or something queried from ACPI. MB: 0=4;1=75 <--- 10/255 #MB_HW: 0=10;1=191 #MB_MIN: 0=10;1=191 #MB_MAX: 0=64;1=191 or MB: 0=1;1=75 <--- 10/511 #MB_HW: 0=10;1=191 #MB_MIN: 0=10;1=191 #MB_MAX: 0=64;1=191 thanks, Chenyu > I don't think we've seen MPAM hardware that comes close to 16 bits for > now, though. > > Cheers > ---Dave
Hi, On Tue, Sep 30, 2025 at 12:43:36PM +0800, Chen, Yu C wrote: > On 9/29/2025 10:13 PM, Dave Martin wrote: [...] > > I guess it would be up to the arch code whether to trust ACPI if it > > says that the maximum value of this field is > 511. (> 65535 would be > > impossible though, since the fields would start to overlap each > > other...) > > > > Would anything break in the interface proposed here, if the maximum > > value is larger than 511? (I'm hoping not. For MPAM, the bandwidth > > controls can have up to 16 bits and the size can be probed though MMIO > > registers. > > > > I overlooked this bit width. It should not exceed 511 according to the > RDT spec. Previously, I was just wondering how to calculate the legacy > MB percentage in Tony's example. If we want to keep consistency - if > the user provides a value of 10, what is the denominator: Is it 255, > 511, or something queried from ACPI. > > MB: 0=4;1=75 <--- 10/255 > #MB_HW: 0=10;1=191 > #MB_MIN: 0=10;1=191 > #MB_MAX: 0=64;1=191 > > or > > MB: 0=1;1=75 <--- 10/511 > #MB_HW: 0=10;1=191 > #MB_MIN: 0=10;1=191 > #MB_MAX: 0=64;1=191 > > thanks, > Chenyu The denomiator (the "scale" parameter in my model, though the name is unimportant) should be whatever quantity of resource is specified in the "unit" parameter. For "percentage" type controls, I'd expect the unit to be 100% ("100pc" in my syntax). So, Tony suggestion looks plausible to me [1] : | Yes. 255 (or whatever "Q" value is provided in the ACPI table) | corresponds to no throttling, so 100% bandwidth. So, if ACPI says Q=387, that's the denominator we advertise. Does that sound right? Question: is this a global parameter, or per-CPU? From the v1.2 RDT spec, it looks like it is a single, global parameter. I hope this is true (!) But I'm not too familiar with these specs... Cheers ---Dave [1] https://lore.kernel.org/lkml/aNq11fmlac6dH4pH@agluck-desk3/ >
On 9/30/2025 11:55 PM, Dave Martin wrote: > Hi, > > On Tue, Sep 30, 2025 at 12:43:36PM +0800, Chen, Yu C wrote: >> On 9/29/2025 10:13 PM, Dave Martin wrote: > > [...] > >>> I guess it would be up to the arch code whether to trust ACPI if it >>> says that the maximum value of this field is > 511. (> 65535 would be >>> impossible though, since the fields would start to overlap each >>> other...) >>> >>> Would anything break in the interface proposed here, if the maximum >>> value is larger than 511? (I'm hoping not. For MPAM, the bandwidth >>> controls can have up to 16 bits and the size can be probed though MMIO >>> registers. >>> >> >> I overlooked this bit width. It should not exceed 511 according to the >> RDT spec. Previously, I was just wondering how to calculate the legacy >> MB percentage in Tony's example. If we want to keep consistency - if >> the user provides a value of 10, what is the denominator: Is it 255, >> 511, or something queried from ACPI. >> >> MB: 0=4;1=75 <--- 10/255 >> #MB_HW: 0=10;1=191 >> #MB_MIN: 0=10;1=191 >> #MB_MAX: 0=64;1=191 >> >> or >> >> MB: 0=1;1=75 <--- 10/511 >> #MB_HW: 0=10;1=191 >> #MB_MIN: 0=10;1=191 >> #MB_MAX: 0=64;1=191 >> >> thanks, >> Chenyu > > The denomiator (the "scale" parameter in my model, though the name is > unimportant) should be whatever quantity of resource is specified in > the "unit" parameter. > > For "percentage" type controls, I'd expect the unit to be 100% ("100pc" > in my syntax). > > So, Tony suggestion looks plausible to me [1] : > > | Yes. 255 (or whatever "Q" value is provided in the ACPI table) > | corresponds to no throttling, so 100% bandwidth. > > So, if ACPI says Q=387, that's the denominator we advertise. > > Does that sound right? > Yes, it makes sense, the denominator is the "scale" in your example. > Question: is this a global parameter, or per-CPU? > It should be a global setting for all the MBA Register Blocks. Thanks, Chenyu > From the v1.2 RDT spec, it looks like it is a single, global parameter. > I hope this is true (!) But I'm not too familiar with these specs... > > Cheers > ---Dave > > > [1] https://lore.kernel.org/lkml/aNq11fmlac6dH4pH@agluck-desk3/ >>
Hi there, On Wed, Oct 01, 2025 at 08:13:45PM +0800, Chen, Yu C wrote: > On 9/30/2025 11:55 PM, Dave Martin wrote: > > Hi, > > > > On Tue, Sep 30, 2025 at 12:43:36PM +0800, Chen, Yu C wrote: [...] > > > I overlooked this bit width. It should not exceed 511 according to the > > > RDT spec. Previously, I was just wondering how to calculate the legacy > > > MB percentage in Tony's example. If we want to keep consistency - if > > > the user provides a value of 10, what is the denominator: Is it 255, > > > 511, or something queried from ACPI. > > > > > > MB: 0=4;1=75 <--- 10/255 > > > #MB_HW: 0=10;1=191 > > > #MB_MIN: 0=10;1=191 > > > #MB_MAX: 0=64;1=191 > > > > > > or > > > > > > MB: 0=1;1=75 <--- 10/511 > > > #MB_HW: 0=10;1=191 > > > #MB_MIN: 0=10;1=191 > > > #MB_MAX: 0=64;1=191 > > > > > > thanks, > > > Chenyu > > > > The denomiator (the "scale" parameter in my model, though the name is > > unimportant) should be whatever quantity of resource is specified in > > the "unit" parameter. > > > > For "percentage" type controls, I'd expect the unit to be 100% ("100pc" > > in my syntax). > > > > So, Tony suggestion looks plausible to me [1] : > > > > | Yes. 255 (or whatever "Q" value is provided in the ACPI table) > > | corresponds to no throttling, so 100% bandwidth. > > > > So, if ACPI says Q=387, that's the denominator we advertise. > > > > Does that sound right? > > > > Yes, it makes sense, the denominator is the "scale" in your example. Thanks for confirming that. > > Question: is this a global parameter, or per-CPU? > > > > It should be a global setting for all the MBA Register Blocks. That's good -- since resctrl resource controls are not per-CPU, exposing the exact hardware resolution won't work unless the value is scaled identically for all CPUs. Cheers ---Dave
> > > So, if ACPI says Q=387, that's the denominator we advertise. > > > > > > Does that sound right? > > > > > > > Yes, it makes sense, the denominator is the "scale" in your example. > > Thanks for confirming that. > > > > Question: is this a global parameter, or per-CPU? > > > > > > > It should be a global setting for all the MBA Register Blocks. > > That's good -- since resctrl resource controls are not per-CPU, > exposing the exact hardware resolution won't work unless the value > is scaled identically for all CPUs. The RDT architecture spec says there is a separate MARC table that describes each instance on an L3 cache. So in theory there could be different "Q" values for each. I'm chatting with the architects to point out that would be bad, and they shouldn't build something that has different "Q" values on the same system. -Tony
> > This seems to be applicable as it introduces the new interface > > while preserving forward compatibility. > > > > One minor question is that, according to "Figure 6-5. MBA Optimal > > Bandwidth Register" in the latest RDT specification, the maximum > > value ranges from 1 to 511. > > Additionally, this bandwidth field is located at bits 48 to 56 in > > the MBA Optimal Bandwidth Register, and the range for > > this segment could be 1 to 8191. Just wonder if it would be 48..56 is still 9 bits, so max value is 511. > > possible that the current maximum value of 512 may be extended > > in the future? Perhaps we could explore a method to query the maximum upper > > limit from the ACPI table or register, or use CPUID to distinguish between > > platforms rather than hardcoding it. Reinette also mentioned this in another > > thread. I think 511 was chosen as "bigger than we expect to ever need" and 9-bits allocated in the registers based on that. Initial implementation may use 255 as the maximum - though I'm pushing on that a bit as the throttle graph at the early stage is fairly linear from "1" to some value < 255, when bandwidth hits maximum, then flat up to 255. If things stay that way, I'm arguing that the "Q" value enumerated in the ACPI table should be the value where peak bandwidth is hit (though this is complicated because workloads with different mixes of read/write access have different throttle graphs). > > > > Thanks, > > Chenyu > > > > > > [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html > > I can't comment on the direction of travel in the RDT architecture. > > I guess it would be up to the arch code whether to trust ACPI if it > says that the maximum value of this field is > 511. (> 65535 would be > impossible though, since the fields would start to overlap each > other...) resctrl should do some sanity checks on values it sees in the ACPI tables. Linux has: #define FW_BUG "[Firmware Bug]: " #define FW_WARN "[Firmware Warn]: " #define FW_INFO "[Firmware Info]: " for good historical reasons. > > Would anything break in the interface proposed here, if the maximum > value is larger than 511? (I'm hoping not. For MPAM, the bandwidth > controls can have up to 16 bits and the size can be probed though MMIO > registers. > > I don't think we've seen MPAM hardware that comes close to 16 bits for > now, though. While kernel code is sometimes space-conserving and uses u8/u16 types for values that fit in some limited range, I'd expect user applications that read the "info" files and program the "schemata" files to not care. Python integers have arbitrary precision, so would be just fine with: max 340282366920938463463374607431768211455 :-) -Tony
On 9/30/2025 12:23 AM, Luck, Tony wrote: >>> This seems to be applicable as it introduces the new interface >>> while preserving forward compatibility. >>> >>> One minor question is that, according to "Figure 6-5. MBA Optimal >>> Bandwidth Register" in the latest RDT specification, the maximum >>> value ranges from 1 to 511. >>> Additionally, this bandwidth field is located at bits 48 to 56 in >>> the MBA Optimal Bandwidth Register, and the range for >>> this segment could be 1 to 8191. Just wonder if it would be > > 48..56 is still 9 bits, so max value is 511. > Ah I see, I overlooked this. >>> possible that the current maximum value of 512 may be extended >>> in the future? Perhaps we could explore a method to query the maximum upper >>> limit from the ACPI table or register, or use CPUID to distinguish between >>> platforms rather than hardcoding it. Reinette also mentioned this in another >>> thread. > > I think 511 was chosen as "bigger than we expect to ever need" and 9-bits > allocated in the registers based on that. > OK, got it. > Initial implementation may use 255 as the maximum - though I'm pushing on > that a bit as the throttle graph at the early stage is fairly linear from "1" to some > value < 255, > when bandwidth hits maximum, then flat up to 255. > If things stay that way, I'm arguing that the "Q" value enumerated in the ACPI > table should be the value where peak bandwidth is hit I see. If I understand correctly, the BIOS needs to pre-train the system to find this Q. However, if the BIOS cannot provide this Q, would it be feasible for the user to provide it? For example, the user could saturate the memory bandwidth, gradually increase MB_MAX, and finally find the Q_max where the memory bandwidth no longer increases. The user could then adjust the max field in the info file. > (though this is complicated > because workloads with different mixes of read/write access have different > throttle graphs). > Does this mean read and write operations have different Q values to saturate the memory bandwidth? For example, if the workload is all reads, there is a Q_r; if the workload is all writes, there is another Q_w. In that case, maybe we could choose the maximum of Q_r and Q_w (max(Q_r, Q_w)). thanks, Chenyu
> >>> This seems to be applicable as it introduces the new interface > >>> while preserving forward compatibility. > >>> > >>> One minor question is that, according to "Figure 6-5. MBA Optimal > >>> Bandwidth Register" in the latest RDT specification, the maximum > >>> value ranges from 1 to 511. > >>> Additionally, this bandwidth field is located at bits 48 to 56 in > >>> the MBA Optimal Bandwidth Register, and the range for > >>> this segment could be 1 to 8191. Just wonder if it would be > > > > 48..56 is still 9 bits, so max value is 511. > > > > Ah I see, I overlooked this. > > >>> possible that the current maximum value of 512 may be extended > >>> in the future? Perhaps we could explore a method to query the maximum upper > >>> limit from the ACPI table or register, or use CPUID to distinguish between > >>> platforms rather than hardcoding it. Reinette also mentioned this in another > >>> thread. > > > > I think 511 was chosen as "bigger than we expect to ever need" and 9-bits > > allocated in the registers based on that. > > > > OK, got it. > > > Initial implementation may use 255 as the maximum - though I'm pushing on > > that a bit as the throttle graph at the early stage is fairly linear from "1" to some > > value < 255, > > when bandwidth hits maximum, then flat up to 255. > > If things stay that way, I'm arguing that the "Q" value enumerated in the ACPI > > table should be the value where peak bandwidth is hit > > I see. If I understand correctly, the BIOS needs to pre-train the system to > find this Q. However, if the BIOS cannot provide this Q, would it be > feasible > for the user to provide it? For example, the user could saturate the memory > bandwidth, gradually increase MB_MAX, and finally find the Q_max where the > memory bandwidth no longer increases. The user could then adjust the max > field in the info file. > > > (though this is complicated > > because workloads with different mixes of read/write access have different > > throttle graphs). > > > > Does this mean read and write operations have different Q values to saturate > the memory bandwidth? For example, if the workload is all reads, there > is a Q_r; > if the workload is all writes, there is another Q_w. In that case, maybe we > could choose the maximum of Q_r and Q_w (max(Q_r, Q_w)). If the BIOS doesn't provide a good enough number, then users might well do some tuning based on the workloads they plan to run and ignore the value in the info file in favor of one tuned specifically for their workloads. But it is too early to start guessing at workarounds for problems that may not even exist. -Tony
Hi Dave, nits: Please use the subject prefix "x86,fs/resctrl" to be consistent with other resctrl code (and was established by Arm :)). Also please use upper case for acronym mba->MBA. On 9/2/25 9:24 AM, Dave Martin wrote: > The control value parser for the MB resource currently coerces the > memory bandwidth percentage value from userspace to be an exact > multiple of the bw_gran parameter. (to help be specific) "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"? > > On MPAM systems, this results in somewhat worse-than-worst-case > rounding, since bw_gran is in general only an approximation to the > actual hardware granularity on these systems, and the hardware > bandwidth allocation control value is not natively a percentage -- > necessitating a further conversion in the resctrl_arch_update_domains() > path, regardless of the conversion done at parse time. > > Allow the arch to provide its own parse-time conversion that is > appropriate for the hardware, and move the existing conversion to x86. > This will avoid accumulated error from rounding the value twice on MPAM > systems. > > Clarify the documentation, but avoid overly exact promises. > > Clamping to bw_min and bw_max still feels generic: leave it in the core > code, for now. Sounds like MPAM may be ready to start the schema parsing discussion again? I understand that MPAM has a few more ways to describe memory bandwidth as well as cache portion partitioning. Previously ([1] [2]) James mused about exposing schema format to user space, which seems like a good idea for new schema. Is this something MPAM is still considering? For example, the minimum and maximum ranges that can be specified, is this something you already have some ideas for? Have you perhaps considered Tony's RFD [3] that includes discussion on how to handle min/max ranges for bandwidth? > > No functional change. > > Signed-off-by: Dave Martin <Dave.Martin@arm.com> > > --- > > Based on v6.17-rc3. > > Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ > the other tests except for the NONCONT_CAT tests, which do not seem to > be supported in my configuration -- and have nothing to do with the > code touched by this patch). Is the NONCONT_CAT test failing (i.e printing "not ok")? The NONCONT_CAT tests may print error messages as debug information as part of running, but these errors are expected as part of the test. The test should accurately state whether it passed or failed though. For example, below attempts to write a non-contiguous CBM to a system that does not support non-contiguous masks. This fails as expected, error messages printed as debugging and thus the test passes with an "ok". # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected ok 5 L3_NONCONT_CAT: test > > Notes: > > I put the x86 version out of line in order to avoid having to move > struct rdt_resource and its dependencies into resctrl_types.h -- which > would create a lot of diff noise. Schemata writes from userspace have > a high overhead in any case. Sounds good, I expect compiler will inline. > > For MPAM the conversion will be a no-op, because the incoming > percentage from the core resctrl code needs to be converted to hardware > representation in the driver anyway. (addressed below) > > Perhaps _all_ the types should move to resctrl_types.h. Can surely consider when there is a good motivation. > > For now, I went for the smallest diffstat... > > --- > Documentation/filesystems/resctrl.rst | 7 +++---- > arch/x86/include/asm/resctrl.h | 2 ++ > arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 6 ++++++ > fs/resctrl/ctrlmondata.c | 2 +- > include/linux/resctrl.h | 6 ++++++ > 5 files changed, 18 insertions(+), 5 deletions(-) > > diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst > index c7949dd44f2f..a1d0469d6dfb 100644 > --- a/Documentation/filesystems/resctrl.rst > +++ b/Documentation/filesystems/resctrl.rst > @@ -143,12 +143,11 @@ with respect to allocation: > user can request. > > "bandwidth_gran": > - The granularity in which the memory bandwidth > + The approximate granularity in which the memory bandwidth > percentage is allocated. The allocated > b/w percentage is rounded off to the next > - control step available on the hardware. The > - available bandwidth control steps are: > - min_bandwidth + N * bandwidth_gran. > + control step available on the hardware. The available > + steps are at least as small as this value. A bit difficult to parse for me. Is "at least as small as" same as "at least"? Please note that the documentation has a section "Memory bandwidth Allocation and monitoring" that also contains these exact promises. > > "delay_linear": > Indicates if the delay scale is linear or > diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h > index feb93b50e990..8bec2b9cc503 100644 > --- a/arch/x86/include/asm/resctrl.h > +++ b/arch/x86/include/asm/resctrl.h > @@ -18,6 +18,8 @@ > */ > #define X86_RESCTRL_EMPTY_CLOSID ((u32)~0) > > +struct rdt_resource; > + I'm missing something here. Why is this needed? > /** > * struct resctrl_pqr_state - State cache for the PQR MSR > * @cur_rmid: The cached Resource Monitoring ID > diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c > index 1189c0df4ad7..cf9b30b5df3c 100644 > --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c > +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c > @@ -16,9 +16,15 @@ > #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > #include <linux/cpu.h> > +#include <linux/math.h> > > #include "internal.h" > > +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r) > +{ > + return roundup(val, (unsigned long)r->membw.bw_gran); > +} > + > int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, > u32 closid, enum resctrl_conf_type t, u32 cfg_val) > { > diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c > index d98e0d2de09f..c5e73b75aaa0 100644 > --- a/fs/resctrl/ctrlmondata.c > +++ b/fs/resctrl/ctrlmondata.c > @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) > return false; > } > > - *data = roundup(bw, (unsigned long)r->membw.bw_gran); > + *data = resctrl_arch_round_bw(bw, r); Please check that function comments remain accurate after changes (specifically if making the conversion more generic as proposed below). > return true; > } > > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h > index 6fb4894b8cfd..5b2a555cf2dd 100644 > --- a/include/linux/resctrl.h > +++ b/include/linux/resctrl.h > @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid, > bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); > int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); > > +/* > + * Round a bandwidth control value to the nearest value acceptable to > + * the arch code for resource r: > + */ > +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r); > + I do not think that resctrl should make any assumptions on what the architecture's conversion does (i.e "round"). That architecture needs to be asked to "round a bandwidth control value" also sounds strange since resctrl really should be able to do something like rounding itself. As I understand from the notes this will be a no-op for MPAM making this even more confusing. How about naming the helper something like resctrl_arch_convert_bw()? (Open to other ideas of course). If you make such a change, please check that subject of patch still fits. I think that using const to pass data to architecture is great, thanks. Reinette [1] https://lore.kernel.org/lkml/fa93564a-45b0-ccdd-c139-ae4867eacfb5@arm.com/ [2] https://lore.kernel.org/all/acefb432-6388-44ed-b444-1e52335c6c3d@arm.com/ [3] https://lore.kernel.org/lkml/Z_mB-gmQe_LR4FWP@agluck-desk3/
Hi Reinette, Thanks for the review. On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > Hi Dave, > > nits: > Please use the subject prefix "x86,fs/resctrl" to be consistent with other > resctrl code (and was established by Arm :)). > Also please use upper case for acronym mba->MBA. Ack (the local custom in the MPAM code is to use "mba", but arguably, the meaning is not quite the same -- I'll change it.) > On 9/2/25 9:24 AM, Dave Martin wrote: > > The control value parser for the MB resource currently coerces the > > memory bandwidth percentage value from userspace to be an exact > > multiple of the bw_gran parameter. > > (to help be specific) > "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"? "bw_gran" was intended as an informal shorthand for the abstract parameter (exposed both in the field you mention and through the bandiwidth_gran file in resctrl). I can rewrite it as per your suggestion, but this could be read as excluding the bandwidth_gran file. Would it make sense just to write it out longhand? For now, I've rewritten it as follows: | The control value parser for the MB resource currently coerces the | memory bandwidth percentage value from userspace to be an exact | multiple of the bandwidth granularity parameter. | | On MPAM systems, this results in somewhat worse-than-worst-case | rounding, since the bandwidth granularity advertised to resctrl by the | MPAM driver is in general only an approximation [...] (I'm happy to go with your suggestion if you're not keen on this, though.) > > On MPAM systems, this results in somewhat worse-than-worst-case > > rounding, since bw_gran is in general only an approximation to the > > actual hardware granularity on these systems, and the hardware > > bandwidth allocation control value is not natively a percentage -- > > necessitating a further conversion in the resctrl_arch_update_domains() > > path, regardless of the conversion done at parse time. > > > > Allow the arch to provide its own parse-time conversion that is > > appropriate for the hardware, and move the existing conversion to x86. > > This will avoid accumulated error from rounding the value twice on MPAM > > systems. > > > > Clarify the documentation, but avoid overly exact promises. > > > > Clamping to bw_min and bw_max still feels generic: leave it in the core > > code, for now. > > Sounds like MPAM may be ready to start the schema parsing discussion again? > I understand that MPAM has a few more ways to describe memory bandwidth as > well as cache portion partitioning. Previously ([1] [2]) James mused about exposing > schema format to user space, which seems like a good idea for new schema. My own ideas in this area are a little different, though I agree with the general idea. I'll respond separately on that, to avoid this thread getting off-topic. For this patch, my aim was to avoid changing anything unnecessarily. > Is this something MPAM is still considering? For example, the minimum > and maximum ranges that can be specified, is this something you already > have some ideas for? Have you perhaps considered Tony's RFD [3] that includes > discussion on how to handle min/max ranges for bandwidth? This is another thing that we probably do want to support at some point, but it feels like a different thing from the minimum and maximum bounds acceptable to an individual schema -- especially since in the hardware they may behave more like trigger points than hard limits. Again, I'll respond separately. [...] > > Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ > > the other tests except for the NONCONT_CAT tests, which do not seem to > > be supported in my configuration -- and have nothing to do with the > > code touched by this patch). > > Is the NONCONT_CAT test failing (i.e printing "not ok")? > > The NONCONT_CAT tests may print error messages as debug information as part of > running, but these errors are expected as part of the test. The test should accurately > state whether it passed or failed though. For example, below attempts to write > a non-contiguous CBM to a system that does not support non-contiguous masks. > This fails as expected, error messages printed as debugging and thus the test passes > with an "ok". > > # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument > # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected > ok 5 L3_NONCONT_CAT: test I don't think that this was anything to do with my changes, but I don't still seem to have the test output. (Since this test has to do with bitmap schemata (?), it seemed unlikely to be affected by changes to bw_validate().) I'll need to re-test with and without this patch to check whether it makes any difference. > > Notes: > > > > I put the x86 version out of line in order to avoid having to move > > struct rdt_resource and its dependencies into resctrl_types.h -- which > > would create a lot of diff noise. Schemata writes from userspace have > > a high overhead in any case. > > Sounds good, I expect compiler will inline. The function and caller are in separate translation units, so unless LTO is used, I don't think the function will be inlined. > > > > For MPAM the conversion will be a no-op, because the incoming > > percentage from the core resctrl code needs to be converted to hardware > > representation in the driver anyway. > > (addressed below) > > > > > Perhaps _all_ the types should move to resctrl_types.h. > > Can surely consider when there is a good motivation. > > > > > For now, I went for the smallest diffstat... I'll assume the motivation is not strong enough for now, but shout if you disagree. [...] > > diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst > > index c7949dd44f2f..a1d0469d6dfb 100644 > > --- a/Documentation/filesystems/resctrl.rst > > +++ b/Documentation/filesystems/resctrl.rst > > @@ -143,12 +143,11 @@ with respect to allocation: > > user can request. > > > > "bandwidth_gran": > > - The granularity in which the memory bandwidth > > + The approximate granularity in which the memory bandwidth > > percentage is allocated. The allocated > > b/w percentage is rounded off to the next > > - control step available on the hardware. The > > - available bandwidth control steps are: > > - min_bandwidth + N * bandwidth_gran. > > + control step available on the hardware. The available > > + steps are at least as small as this value. > > A bit difficult to parse for me. > Is "at least as small as" same as "at least"? It was supposed to mean: "The available steps are no larger than this value." Formally My expectation is that this value is the smallest integer number of percent which is not smaller than the apparent size of any individual rounding step. Equivalently, this is the smallest number g for which writing "MB: 0=x" and "MB: 0=y" yield different configurations for every in-range x and where y = x + g and y is also in-range. That's a bit of a mouthful, though. If you can think of a more succinct way of putting it, I'm open to suggestions! > Please note that the documentation has a section "Memory bandwidth Allocation > and monitoring" that also contains these exact promises. Hmmm, somehow I completely missed that. Does the following make sense? Ideally, there would be a simpler way to describe the discrepancy between the reported and actual values of bw_gran... | Memory bandwidth Allocation and monitoring | ========================================== | | [...] | | The minimum bandwidth percentage value for each cpu model is predefined | and can be looked up through "info/MB/min_bandwidth". The bandwidth | granularity that is allocated is also dependent on the cpu model and can | be looked up at "info/MB/bandwidth_gran". The available bandwidth | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded | -to the next control step available on the hardware. | +control steps are: min_bw + N * (bw_gran - e), where e is a | +non-negative, hardware-defined real constant that is less than 1. | +Intermediate values are rounded to the next control step available on | +the hardware. | + | +At the time of writing, the constant e referred to in the preceding | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran | +describes the step size exactly), but this may not be the case on other | +hardware when the actual granularity is not an exact divisor of 100. > > > > > "delay_linear": > > Indicates if the delay scale is linear or > > diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h > > index feb93b50e990..8bec2b9cc503 100644 > > --- a/arch/x86/include/asm/resctrl.h > > +++ b/arch/x86/include/asm/resctrl.h > > @@ -18,6 +18,8 @@ > > */ > > #define X86_RESCTRL_EMPTY_CLOSID ((u32)~0) > > > > +struct rdt_resource; > > + > > I'm missing something here. Why is this needed? Oops, it's not. This got left behind from when I had the function in-line here. Removed. [...] > > diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c > > index d98e0d2de09f..c5e73b75aaa0 100644 > > --- a/fs/resctrl/ctrlmondata.c > > +++ b/fs/resctrl/ctrlmondata.c > > @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) > > return false; > > } > > > > - *data = roundup(bw, (unsigned long)r->membw.bw_gran); > > + *data = resctrl_arch_round_bw(bw, r); > > Please check that function comments remain accurate after changes (specifically > if making the conversion more generic as proposed below). I hoped that the comment for this function was still applicable, though it can probably be improved. How about the following? | - * hardware. The allocated bandwidth percentage is rounded to the next | - * control step available on the hardware. | + * hardware. The allocated bandwidth percentage is converted as | + * appropriate for consumption by the specific hardware driver. [...] > > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h > > index 6fb4894b8cfd..5b2a555cf2dd 100644 > > --- a/include/linux/resctrl.h > > +++ b/include/linux/resctrl.h > > @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid, > > bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); > > int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); > > > > +/* > > + * Round a bandwidth control value to the nearest value acceptable to > > + * the arch code for resource r: > > + */ > > +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r); > > + > > I do not think that resctrl should make any assumptions on what the > architecture's conversion does (i.e "round"). That architecture needs to be > asked to "round a bandwidth control value" also sounds strange since resctrl really > should be able to do something like rounding itself. As I understand from > the notes this will be a no-op for MPAM making this even more confusing. > > How about naming the helper something like resctrl_arch_convert_bw()? > (Open to other ideas of course). > > If you make such a change, please check that subject of patch still fits. I struggled a bit with the name. Really, this is converting the value to an intermediate form (which might or might not involve rounding). For historical reasons, this is a value suitable for writing directly to the relevant x86 MSR without any further interpretation. For MPAM, it is convenient to do this conversion later, rather than during parsing of the value. Would a name like resctrl_arch_preconvert_bw() be acceptable? This isn't more informative than your suggestion regarding what the conversion is expected to do, but may convey the expectation that the output value may still not be in its final (i.e., hardware) form. > I think that using const to pass data to architecture is great, thanks. > > Reinette I try to constify by default when straightforward to do so, since the compiler can then find which cases need to change; the reverse direction is harder to automate... Cheers ---Dave [...] > [1] https://lore.kernel.org/lkml/fa93564a-45b0-ccdd-c139-ae4867eacfb5@arm.com/ > [2] https://lore.kernel.org/all/acefb432-6388-44ed-b444-1e52335c6c3d@arm.com/ > [3] https://lore.kernel.org/lkml/Z_mB-gmQe_LR4FWP@agluck-desk3/
Hi Dave, On 9/22/25 7:39 AM, Dave Martin wrote: > Hi Reinette, > > Thanks for the review. > > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: >> Hi Dave, >> >> nits: >> Please use the subject prefix "x86,fs/resctrl" to be consistent with other >> resctrl code (and was established by Arm :)). >> Also please use upper case for acronym mba->MBA. > > Ack (the local custom in the MPAM code is to use "mba", but arguably, > the meaning is not quite the same -- I'll change it.) I am curious what the motivation is for the custom? Knowing this will help me to keep things consistent when the two worlds meet. > >> On 9/2/25 9:24 AM, Dave Martin wrote: >>> The control value parser for the MB resource currently coerces the >>> memory bandwidth percentage value from userspace to be an exact >>> multiple of the bw_gran parameter. >> >> (to help be specific) >> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"? > > "bw_gran" was intended as an informal shorthand for the abstract > parameter (exposed both in the field you mention and through the > bandiwidth_gran file in resctrl). I do not see a need for being abstract since the bandwidth_gran file exposes the field verbatim. > > I can rewrite it as per your suggestion, but this could be read as > excluding the bandwidth_gran file. Would it make sense just to write > it out longhand? For now, I've rewritten it as follows: Since the bandwidth_gran file exposes rdt_resource::resctrl_membw::bw_gran it is not clear to me how being specific excludes the bandwidth_gran file. > > | The control value parser for the MB resource currently coerces the > | memory bandwidth percentage value from userspace to be an exact > | multiple of the bandwidth granularity parameter. If you want to include the bandwidth_gran file then the above could be something like: The control value parser for the MB resource coerces the memory bandwidth percentage value from userspace to be an exact multiple of the bandwidth granularity parameter that is exposed by the bandwidth_gran resctrl file. I still think that replacing "the bandwidth granularity parameter" with "rdt_resource::resctrl_membw::bw_gran" will help to be more specific. > | > | On MPAM systems, this results in somewhat worse-than-worst-case > | rounding, since the bandwidth granularity advertised to resctrl by the > | MPAM driver is in general only an approximation [...] > > (I'm happy to go with your suggestion if you're not keen on this, > though.) > >>> On MPAM systems, this results in somewhat worse-than-worst-case >>> rounding, since bw_gran is in general only an approximation to the >>> actual hardware granularity on these systems, and the hardware >>> bandwidth allocation control value is not natively a percentage -- >>> necessitating a further conversion in the resctrl_arch_update_domains() >>> path, regardless of the conversion done at parse time. >>> >>> Allow the arch to provide its own parse-time conversion that is >>> appropriate for the hardware, and move the existing conversion to x86. >>> This will avoid accumulated error from rounding the value twice on MPAM >>> systems. >>> >>> Clarify the documentation, but avoid overly exact promises. >>> >>> Clamping to bw_min and bw_max still feels generic: leave it in the core >>> code, for now. >> >> Sounds like MPAM may be ready to start the schema parsing discussion again? >> I understand that MPAM has a few more ways to describe memory bandwidth as >> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing >> schema format to user space, which seems like a good idea for new schema. > > My own ideas in this area are a little different, though I agree with > the general idea. Should we expect a separate proposal from James? > > I'll respond separately on that, to avoid this thread getting off-topic. Much appreciated. > > For this patch, my aim was to avoid changing anything unnecessarily. Understood. More below as I try to understand the details but it does not really sound as though the current interface works that great for MPAM. If I understand correctly this patch enables MPAM to use existing interface for its memory bandwidth allocations but doing so does not enable users to obtain benefit of hardware capabilities. For that users would want to use the new interface? >>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ >>> the other tests except for the NONCONT_CAT tests, which do not seem to >>> be supported in my configuration -- and have nothing to do with the >>> code touched by this patch). >> >> Is the NONCONT_CAT test failing (i.e printing "not ok")? >> >> The NONCONT_CAT tests may print error messages as debug information as part of >> running, but these errors are expected as part of the test. The test should accurately >> state whether it passed or failed though. For example, below attempts to write >> a non-contiguous CBM to a system that does not support non-contiguous masks. >> This fails as expected, error messages printed as debugging and thus the test passes >> with an "ok". >> >> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument >> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected >> ok 5 L3_NONCONT_CAT: test > > I don't think that this was anything to do with my changes, but I don't > still seem to have the test output. (Since this test has to do with > bitmap schemata (?), it seemed unlikely to be affected by changes to > bw_validate().) I agree that this should not have anything to do with this patch. My concern is that I understood that the test failed for a feature that is not supported. If this is the case then there may be a problem with the test. The test should not fail if the feature is not supported but instead skip the test. > >>> Notes: >>> >>> I put the x86 version out of line in order to avoid having to move >>> struct rdt_resource and its dependencies into resctrl_types.h -- which >>> would create a lot of diff noise. Schemata writes from userspace have >>> a high overhead in any case. >> >> Sounds good, I expect compiler will inline. > > The function and caller are in separate translation units, so unless > LTO is used, I don't think the function will be inlined. Thanks, yes, indeed. > >>> >>> For MPAM the conversion will be a no-op, because the incoming >>> percentage from the core resctrl code needs to be converted to hardware >>> representation in the driver anyway. >> >> (addressed below) >> >>> >>> Perhaps _all_ the types should move to resctrl_types.h. >> >> Can surely consider when there is a good motivation. >> >>> >>> For now, I went for the smallest diffstat... > > I'll assume the motivation is not strong enough for now, but shout if > you disagree. I agree. > > [...] > >>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst >>> index c7949dd44f2f..a1d0469d6dfb 100644 >>> --- a/Documentation/filesystems/resctrl.rst >>> +++ b/Documentation/filesystems/resctrl.rst >>> @@ -143,12 +143,11 @@ with respect to allocation: >>> user can request. >>> >>> "bandwidth_gran": >>> - The granularity in which the memory bandwidth >>> + The approximate granularity in which the memory bandwidth >>> percentage is allocated. The allocated >>> b/w percentage is rounded off to the next >>> - control step available on the hardware. The >>> - available bandwidth control steps are: >>> - min_bandwidth + N * bandwidth_gran. >>> + control step available on the hardware. The available >>> + steps are at least as small as this value. >> >> A bit difficult to parse for me. >> Is "at least as small as" same as "at least"? > > It was supposed to mean: "The available steps are no larger than this > value." This is clear to me, especially when compared with the planned addition to "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting the paragraph below (more below). > > Formally My expectation is that this value is the smallest integer > number of percent which is not smaller than the apparent size of any > individual rounding step. Equivalently, this is the smallest number g Considering the two statements: - "The available steps are no larger than this value." - "this value ... is not smaller than the apparent size of any individual rounding step" The "not larger" and "not smaller" sounds like all these words just end up saying that this is the step size? > for which writing "MB: 0=x" and "MB: 0=y" yield different > configurations for every in-range x and where y = x + g and y is also > in-range. > > That's a bit of a mouthful, though. If you can think of a more > succinct way of putting it, I'm open to suggestions! > >> Please note that the documentation has a section "Memory bandwidth Allocation >> and monitoring" that also contains these exact promises. > > Hmmm, somehow I completely missed that. > > Does the following make sense? Ideally, there would be a simpler way > to describe the discrepancy between the reported and actual values of > bw_gran... > > | Memory bandwidth Allocation and monitoring > | ========================================== > | > | [...] > | > | The minimum bandwidth percentage value for each cpu model is predefined > | and can be looked up through "info/MB/min_bandwidth". The bandwidth > | granularity that is allocated is also dependent on the cpu model and can > | be looked up at "info/MB/bandwidth_gran". The available bandwidth > | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded > | -to the next control step available on the hardware. > | +control steps are: min_bw + N * (bw_gran - e), where e is a > | +non-negative, hardware-defined real constant that is less than 1. > | +Intermediate values are rounded to the next control step available on > | +the hardware. > | + > | +At the time of writing, the constant e referred to in the preceding > | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran > | +describes the step size exactly), but this may not be the case on other > | +hardware when the actual granularity is not an exact divisor of 100. Have you considered how to share the value of "e" with users? >>> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c >>> index d98e0d2de09f..c5e73b75aaa0 100644 >>> --- a/fs/resctrl/ctrlmondata.c >>> +++ b/fs/resctrl/ctrlmondata.c >>> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) >>> return false; >>> } >>> >>> - *data = roundup(bw, (unsigned long)r->membw.bw_gran); >>> + *data = resctrl_arch_round_bw(bw, r); >> >> Please check that function comments remain accurate after changes (specifically >> if making the conversion more generic as proposed below). > > I hoped that the comment for this function was still applicable, though > it can probably be improved. How about the following? > > | - * hardware. The allocated bandwidth percentage is rounded to the next > | - * control step available on the hardware. > | + * hardware. The allocated bandwidth percentage is converted as > | + * appropriate for consumption by the specific hardware driver. > > [...] Looks good to me. > >>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h >>> index 6fb4894b8cfd..5b2a555cf2dd 100644 >>> --- a/include/linux/resctrl.h >>> +++ b/include/linux/resctrl.h >>> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid, >>> bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); >>> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); >>> >>> +/* >>> + * Round a bandwidth control value to the nearest value acceptable to >>> + * the arch code for resource r: >>> + */ >>> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r); >>> + >> >> I do not think that resctrl should make any assumptions on what the >> architecture's conversion does (i.e "round"). That architecture needs to be >> asked to "round a bandwidth control value" also sounds strange since resctrl really >> should be able to do something like rounding itself. As I understand from >> the notes this will be a no-op for MPAM making this even more confusing. >> >> How about naming the helper something like resctrl_arch_convert_bw()? >> (Open to other ideas of course). >> >> If you make such a change, please check that subject of patch still fits. > > I struggled a bit with the name. Really, this is converting the value > to an intermediate form (which might or might not involve rounding). > For historical reasons, this is a value suitable for writing directly > to the relevant x86 MSR without any further interpretation. > > For MPAM, it is convenient to do this conversion later, rather than > during parsing of the value. > > > Would a name like resctrl_arch_preconvert_bw() be acceptable? Yes. > > This isn't more informative than your suggestion regarding what the > conversion is expected to do, but may convey the expectation that the > output value may still not be in its final (i.e., hardware) form. Sounds good, yes. > >> I think that using const to pass data to architecture is great, thanks. >> >> Reinette > > I try to constify by default when straightforward to do so, since the > compiler can then find which cases need to change; the reverse > direction is harder to automate... Could you please elaborate what you mean with "reverse direction"? Thank you Reinette
Hi Reinette, On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote: > Hi Dave, > > On 9/22/25 7:39 AM, Dave Martin wrote: > > Hi Reinette, > > > > Thanks for the review. > > > > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > >> Hi Dave, > >> > >> nits: > >> Please use the subject prefix "x86,fs/resctrl" to be consistent with other > >> resctrl code (and was established by Arm :)). > >> Also please use upper case for acronym mba->MBA. > > > > Ack (the local custom in the MPAM code is to use "mba", but arguably, > > the meaning is not quite the same -- I'll change it.) > > I am curious what the motivation is for the custom? Knowing this will help > me to keep things consistent when the two worlds meet. I think this has just evolved over time. On the x86 side, MBA is a specific architectural feature, but on the MPAM side the architecture doesn't really have a name for the same thing. Memory bandwidth is a concept, but a few different types of control are defined for it, with different names. So, for the MPAM driver "mba" is more of a software concept than something in a published spec: it's the glue that attaches to "MB" resource as seen through resctrl. (This isn't official though; it's just the mental model that I have formed.) > > >>> The control value parser for the MB resource currently coerces the > >>> memory bandwidth percentage value from userspace to be an exact > >>> multiple of the bw_gran parameter. > >> > >> (to help be specific) > >> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"? > > > > "bw_gran" was intended as an informal shorthand for the abstract > > parameter (exposed both in the field you mention and through the > > bandiwidth_gran file in resctrl). > > I do not see a need for being abstract since the bandwidth_gran file exposes > the field verbatim. Sure; that was just my thought process. > > I can rewrite it as per your suggestion, but this could be read as > > excluding the bandwidth_gran file. Would it make sense just to write > > it out longhand? For now, I've rewritten it as follows: > > Since the bandwidth_gran file exposes rdt_resource::resctrl_membw::bw_gran > it is not clear to me how being specific excludes the bandwidth_gran file. > > > > > | The control value parser for the MB resource currently coerces the > > | memory bandwidth percentage value from userspace to be an exact > > | multiple of the bandwidth granularity parameter. > > If you want to include the bandwidth_gran file then the above could be > something like: > > The control value parser for the MB resource coerces the memory > bandwidth percentage value from userspace to be an exact multiple > of the bandwidth granularity parameter that is exposed by the > bandwidth_gran resctrl file. > > I still think that replacing "the bandwidth granularity parameter" with > "rdt_resource::resctrl_membw::bw_gran" will help to be more specific. That's fine. I'll change as per your original suggestion. > > | > > | On MPAM systems, this results in somewhat worse-than-worst-case > > | rounding, since the bandwidth granularity advertised to resctrl by the > > | MPAM driver is in general only an approximation [...] > > > > (I'm happy to go with your suggestion if you're not keen on this, > > though.) > > > >>> On MPAM systems, this results in somewhat worse-than-worst-case > >>> rounding, since bw_gran is in general only an approximation to the > >>> actual hardware granularity on these systems, and the hardware > >>> bandwidth allocation control value is not natively a percentage -- > >>> necessitating a further conversion in the resctrl_arch_update_domains() > >>> path, regardless of the conversion done at parse time. > >>> > >>> Allow the arch to provide its own parse-time conversion that is > >>> appropriate for the hardware, and move the existing conversion to x86. > >>> This will avoid accumulated error from rounding the value twice on MPAM > >>> systems. > >>> > >>> Clarify the documentation, but avoid overly exact promises. > >>> > >>> Clamping to bw_min and bw_max still feels generic: leave it in the core > >>> code, for now. > >> > >> Sounds like MPAM may be ready to start the schema parsing discussion again? > >> I understand that MPAM has a few more ways to describe memory bandwidth as > >> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing > >> schema format to user space, which seems like a good idea for new schema. > > > > My own ideas in this area are a little different, though I agree with > > the general idea. > > Should we expect a separate proposal from James? At some point, yes. We still need to have a chat about it. Right now, I was just throwing an idea out there. > > I'll respond separately on that, to avoid this thread getting off-topic. > > Much appreciated. > > > > > For this patch, my aim was to avoid changing anything unnecessarily. > > Understood. More below as I try to understand the details but it does not > really sound as though the current interface works that great for MPAM. If I > understand correctly this patch enables MPAM to use existing interface for > its memory bandwidth allocations but doing so does not enable users to > obtain benefit of hardware capabilities. For that users would want to use > the new interface? In ideal world, probably, yes. Since not all use cases will care about full precision, the MB resource (approximated for MPAM) should be fine for a lot of people, but I expect that sooner or later somebody will want more exact control. > >>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ > >>> the other tests except for the NONCONT_CAT tests, which do not seem to > >>> be supported in my configuration -- and have nothing to do with the > >>> code touched by this patch). > >> > >> Is the NONCONT_CAT test failing (i.e printing "not ok")? > >> > >> The NONCONT_CAT tests may print error messages as debug information as part of > >> running, but these errors are expected as part of the test. The test should accurately > >> state whether it passed or failed though. For example, below attempts to write > >> a non-contiguous CBM to a system that does not support non-contiguous masks. > >> This fails as expected, error messages printed as debugging and thus the test passes > >> with an "ok". > >> > >> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument > >> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected > >> ok 5 L3_NONCONT_CAT: test > > > > I don't think that this was anything to do with my changes, but I don't > > still seem to have the test output. (Since this test has to do with > > bitmap schemata (?), it seemed unlikely to be affected by changes to > > bw_validate().) > > I agree that this should not have anything to do with this patch. My concern > is that I understood that the test failed for a feature that is not supported. > If this is the case then there may be a problem with the test. The test should > not fail if the feature is not supported but instead skip the test. I'll try to capture more output from this when I re-run it, so that we can figure out what this is. > >>> Notes: > >>> > >>> I put the x86 version out of line in order to avoid having to move > >>> struct rdt_resource and its dependencies into resctrl_types.h -- which > >>> would create a lot of diff noise. Schemata writes from userspace have > >>> a high overhead in any case. > >> > >> Sounds good, I expect compiler will inline. > > > > The function and caller are in separate translation units, so unless > > LTO is used, I don't think the function will be inlined. > > Thanks, yes, indeed. > > > > >>> > >>> For MPAM the conversion will be a no-op, because the incoming > >>> percentage from the core resctrl code needs to be converted to hardware > >>> representation in the driver anyway. > >> > >> (addressed below) > >> > >>> > >>> Perhaps _all_ the types should move to resctrl_types.h. > >> > >> Can surely consider when there is a good motivation. > >> > >>> > >>> For now, I went for the smallest diffstat... > > > > I'll assume the motivation is not strong enough for now, but shout if > > you disagree. > > I agree. OK, I'll leave that as-is for now, then. > > > > [...] > > > >>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst > >>> index c7949dd44f2f..a1d0469d6dfb 100644 > >>> --- a/Documentation/filesystems/resctrl.rst > >>> +++ b/Documentation/filesystems/resctrl.rst > >>> @@ -143,12 +143,11 @@ with respect to allocation: > >>> user can request. > >>> > >>> "bandwidth_gran": > >>> - The granularity in which the memory bandwidth > >>> + The approximate granularity in which the memory bandwidth > >>> percentage is allocated. The allocated > >>> b/w percentage is rounded off to the next > >>> - control step available on the hardware. The > >>> - available bandwidth control steps are: > >>> - min_bandwidth + N * bandwidth_gran. > >>> + control step available on the hardware. The available > >>> + steps are at least as small as this value. > >> > >> A bit difficult to parse for me. > >> Is "at least as small as" same as "at least"? > > > > It was supposed to mean: "The available steps are no larger than this > > value." > > This is clear to me, especially when compared with the planned addition to > "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting > the paragraph below (more below). > > > > > Formally My expectation is that this value is the smallest integer > > number of percent which is not smaller than the apparent size of any > > individual rounding step. Equivalently, this is the smallest number g > > Considering the two statements: > - "The available steps are no larger than this value." > - "this value ... is not smaller than the apparent size of any individual rounding step" > > The "not larger" and "not smaller" sounds like all these words just end up saying that > this is the step size? They are intended to be the same statement: A <= B versus B >= A respectively. But I'd be the first to admit that the wording is a bit twisted! (I wouldn't be astonshed if I got something wrong somewhere.) See below for an alternative way of describing this that might be more intuitive. > > > for which writing "MB: 0=x" and "MB: 0=y" yield different > > configurations for every in-range x and where y = x + g and y is also > > in-range. > > > > That's a bit of a mouthful, though. If you can think of a more > > succinct way of putting it, I'm open to suggestions! > > > >> Please note that the documentation has a section "Memory bandwidth Allocation > >> and monitoring" that also contains these exact promises. > > > > Hmmm, somehow I completely missed that. > > > > Does the following make sense? Ideally, there would be a simpler way > > to describe the discrepancy between the reported and actual values of > > bw_gran... > > > > | Memory bandwidth Allocation and monitoring > > | ========================================== > > | > > | [...] > > | > > | The minimum bandwidth percentage value for each cpu model is predefined > > | and can be looked up through "info/MB/min_bandwidth". The bandwidth > > | granularity that is allocated is also dependent on the cpu model and can > > | be looked up at "info/MB/bandwidth_gran". The available bandwidth > > | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded > > | -to the next control step available on the hardware. > > | +control steps are: min_bw + N * (bw_gran - e), where e is a > > | +non-negative, hardware-defined real constant that is less than 1. > > | +Intermediate values are rounded to the next control step available on > > | +the hardware. > > | + > > | +At the time of writing, the constant e referred to in the preceding > > | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran > > | +describes the step size exactly), but this may not be the case on other > > | +hardware when the actual granularity is not an exact divisor of 100. > > Have you considered how to share the value of "e" with users? Perhaps introducing this "e" as an explicit parameter is a bad idea and overly formal. In practice, there are likely to various sources of skid and approximation in the hardware, so exposing an actual value may be counterproductive -- i.e., what usable guarantee is this providing to userspace, if this is likely to be swamped by approximations elsewhere? Instead, maybe we can just say something like: | The available steps are spaced at roughly equal intervals between the | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading | info/MB/bandwidth_gran gives the worst-case precision of these | interval steps, in per cent. What do you think? If that's adequate, then the wording under the definition of "bandwidth_gran" could be aligned with this. > >>> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c > >>> index d98e0d2de09f..c5e73b75aaa0 100644 > >>> --- a/fs/resctrl/ctrlmondata.c > >>> +++ b/fs/resctrl/ctrlmondata.c > >>> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) > >>> return false; > >>> } > >>> > >>> - *data = roundup(bw, (unsigned long)r->membw.bw_gran); > >>> + *data = resctrl_arch_round_bw(bw, r); > >> > >> Please check that function comments remain accurate after changes (specifically > >> if making the conversion more generic as proposed below). > > > > I hoped that the comment for this function was still applicable, though > > it can probably be improved. How about the following? > > > > | - * hardware. The allocated bandwidth percentage is rounded to the next > > | - * control step available on the hardware. > > | + * hardware. The allocated bandwidth percentage is converted as > > | + * appropriate for consumption by the specific hardware driver. > > > > [...] > > Looks good to me. OK. > > > >>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h > >>> index 6fb4894b8cfd..5b2a555cf2dd 100644 > >>> --- a/include/linux/resctrl.h > >>> +++ b/include/linux/resctrl.h > >>> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid, > >>> bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); > >>> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); > >>> > >>> +/* > >>> + * Round a bandwidth control value to the nearest value acceptable to > >>> + * the arch code for resource r: > >>> + */ > >>> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r); > >>> + > >> > >> I do not think that resctrl should make any assumptions on what the > >> architecture's conversion does (i.e "round"). That architecture needs to be > >> asked to "round a bandwidth control value" also sounds strange since resctrl really > >> should be able to do something like rounding itself. As I understand from > >> the notes this will be a no-op for MPAM making this even more confusing. > >> > >> How about naming the helper something like resctrl_arch_convert_bw()? > >> (Open to other ideas of course). > >> > >> If you make such a change, please check that subject of patch still fits. > > > > I struggled a bit with the name. Really, this is converting the value > > to an intermediate form (which might or might not involve rounding). > > For historical reasons, this is a value suitable for writing directly > > to the relevant x86 MSR without any further interpretation. > > > > For MPAM, it is convenient to do this conversion later, rather than > > during parsing of the value. > > > > > > Would a name like resctrl_arch_preconvert_bw() be acceptable? > > Yes. > > > > > This isn't more informative than your suggestion regarding what the > > conversion is expected to do, but may convey the expectation that the > > output value may still not be in its final (i.e., hardware) form. > > Sounds good, yes. OK, I'll hack that in. > > > > >> I think that using const to pass data to architecture is great, thanks. > >> > >> Reinette > > > > I try to constify by default when straightforward to do so, since the > > compiler can then find which cases need to change; the reverse > > direction is harder to automate... > > Could you please elaborate what you mean with "reverse direction"? I just meant that over-consting tends to result in violations of the language that the compiler will detect, but under-consting doesn't: static void foo(int *nonconstp, const int *constp) { *constp = 0; // compiler error (*nonconstp); // silently accpeted, though it could have been const } So, the compiler will tell you places where const needs to be removed (or something else needs to change), but to find places where const could be _added_, you have to hunt them down yourself, or use some other tool that is probably not part of the usual workflow. Cheers ---Dave
Hi Dave, On 9/25/25 5:46 AM, Dave Martin wrote: > On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote: >> On 9/22/25 7:39 AM, Dave Martin wrote: >>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: >>>> Hi Dave, >>>> >>>> nits: >>>> Please use the subject prefix "x86,fs/resctrl" to be consistent with other >>>> resctrl code (and was established by Arm :)). >>>> Also please use upper case for acronym mba->MBA. >>> >>> Ack (the local custom in the MPAM code is to use "mba", but arguably, >>> the meaning is not quite the same -- I'll change it.) >> >> I am curious what the motivation is for the custom? Knowing this will help >> me to keep things consistent when the two worlds meet. > > I think this has just evolved over time. On the x86 side, MBA is a > specific architectural feature, but on the MPAM side the architecture > doesn't really have a name for the same thing. Memory bandwidth is a > concept, but a few different types of control are defined for it, with > different names. > > So, for the MPAM driver "mba" is more of a software concept than > something in a published spec: it's the glue that attaches to "MB" > resource as seen through resctrl. > > (This isn't official though; it's just the mental model that I have > formed.) I see. Thank you for the details. My mental model is simpler: write acronyms in upper case. ... >>>>> On MPAM systems, this results in somewhat worse-than-worst-case >>>>> rounding, since bw_gran is in general only an approximation to the >>>>> actual hardware granularity on these systems, and the hardware >>>>> bandwidth allocation control value is not natively a percentage -- >>>>> necessitating a further conversion in the resctrl_arch_update_domains() >>>>> path, regardless of the conversion done at parse time. >>>>> >>>>> Allow the arch to provide its own parse-time conversion that is >>>>> appropriate for the hardware, and move the existing conversion to x86. >>>>> This will avoid accumulated error from rounding the value twice on MPAM >>>>> systems. >>>>> >>>>> Clarify the documentation, but avoid overly exact promises. >>>>> >>>>> Clamping to bw_min and bw_max still feels generic: leave it in the core >>>>> code, for now. >>>> >>>> Sounds like MPAM may be ready to start the schema parsing discussion again? >>>> I understand that MPAM has a few more ways to describe memory bandwidth as >>>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing >>>> schema format to user space, which seems like a good idea for new schema. >>> >>> My own ideas in this area are a little different, though I agree with >>> the general idea. >> >> Should we expect a separate proposal from James? > > At some point, yes. We still need to have a chat about it. > > Right now, I was just throwing an idea out there. Thank you very much for doing so. We are digesting it. ... >>> For this patch, my aim was to avoid changing anything unnecessarily. >> >> Understood. More below as I try to understand the details but it does not >> really sound as though the current interface works that great for MPAM. If I >> understand correctly this patch enables MPAM to use existing interface for >> its memory bandwidth allocations but doing so does not enable users to >> obtain benefit of hardware capabilities. For that users would want to use >> the new interface? > > In ideal world, probably, yes. > > Since not all use cases will care about full precision, the MB resource > (approximated for MPAM) should be fine for a lot of people, but I > expect that sooner or later somebody will want more exact control. ack. > >>>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ >>>>> the other tests except for the NONCONT_CAT tests, which do not seem to >>>>> be supported in my configuration -- and have nothing to do with the >>>>> code touched by this patch). >>>> >>>> Is the NONCONT_CAT test failing (i.e printing "not ok")? >>>> >>>> The NONCONT_CAT tests may print error messages as debug information as part of >>>> running, but these errors are expected as part of the test. The test should accurately >>>> state whether it passed or failed though. For example, below attempts to write >>>> a non-contiguous CBM to a system that does not support non-contiguous masks. >>>> This fails as expected, error messages printed as debugging and thus the test passes >>>> with an "ok". >>>> >>>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument >>>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected >>>> ok 5 L3_NONCONT_CAT: test >>> >>> I don't think that this was anything to do with my changes, but I don't >>> still seem to have the test output. (Since this test has to do with >>> bitmap schemata (?), it seemed unlikely to be affected by changes to >>> bw_validate().) >> >> I agree that this should not have anything to do with this patch. My concern >> is that I understood that the test failed for a feature that is not supported. >> If this is the case then there may be a problem with the test. The test should >> not fail if the feature is not supported but instead skip the test. > > I'll try to capture more output from this when I re-run it, so that we > can figure out what this is. Thank you. ... >>> >>> [...] >>> >>>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst >>>>> index c7949dd44f2f..a1d0469d6dfb 100644 >>>>> --- a/Documentation/filesystems/resctrl.rst >>>>> +++ b/Documentation/filesystems/resctrl.rst >>>>> @@ -143,12 +143,11 @@ with respect to allocation: >>>>> user can request. >>>>> >>>>> "bandwidth_gran": >>>>> - The granularity in which the memory bandwidth >>>>> + The approximate granularity in which the memory bandwidth >>>>> percentage is allocated. The allocated >>>>> b/w percentage is rounded off to the next >>>>> - control step available on the hardware. The >>>>> - available bandwidth control steps are: >>>>> - min_bandwidth + N * bandwidth_gran. >>>>> + control step available on the hardware. The available >>>>> + steps are at least as small as this value. >>>> >>>> A bit difficult to parse for me. >>>> Is "at least as small as" same as "at least"? >>> >>> It was supposed to mean: "The available steps are no larger than this >>> value." >> >> This is clear to me, especially when compared with the planned addition to >> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting >> the paragraph below (more below). >> >>> >>> Formally My expectation is that this value is the smallest integer >>> number of percent which is not smaller than the apparent size of any >>> individual rounding step. Equivalently, this is the smallest number g >> >> Considering the two statements: >> - "The available steps are no larger than this value." >> - "this value ... is not smaller than the apparent size of any individual rounding step" >> >> The "not larger" and "not smaller" sounds like all these words just end up saying that >> this is the step size? > > They are intended to be the same statement: A <= B versus > B >= A respectively. This is what I understood from the words ... and that made me think that it can be simplified to A = B ... but no need to digress ... onto the alternatives below ... > > But I'd be the first to admit that the wording is a bit twisted! > (I wouldn't be astonshed if I got something wrong somewhere.) > > See below for an alternative way of describing this that might be more > intuitive. > >> >>> for which writing "MB: 0=x" and "MB: 0=y" yield different >>> configurations for every in-range x and where y = x + g and y is also >>> in-range. >>> >>> That's a bit of a mouthful, though. If you can think of a more >>> succinct way of putting it, I'm open to suggestions! >>> >>>> Please note that the documentation has a section "Memory bandwidth Allocation >>>> and monitoring" that also contains these exact promises. >>> >>> Hmmm, somehow I completely missed that. >>> >>> Does the following make sense? Ideally, there would be a simpler way >>> to describe the discrepancy between the reported and actual values of >>> bw_gran... >>> >>> | Memory bandwidth Allocation and monitoring >>> | ========================================== >>> | >>> | [...] >>> | >>> | The minimum bandwidth percentage value for each cpu model is predefined >>> | and can be looked up through "info/MB/min_bandwidth". The bandwidth >>> | granularity that is allocated is also dependent on the cpu model and can >>> | be looked up at "info/MB/bandwidth_gran". The available bandwidth >>> | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded >>> | -to the next control step available on the hardware. >>> | +control steps are: min_bw + N * (bw_gran - e), where e is a >>> | +non-negative, hardware-defined real constant that is less than 1. >>> | +Intermediate values are rounded to the next control step available on >>> | +the hardware. >>> | + >>> | +At the time of writing, the constant e referred to in the preceding >>> | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran >>> | +describes the step size exactly), but this may not be the case on other >>> | +hardware when the actual granularity is not an exact divisor of 100. >> >> Have you considered how to share the value of "e" with users? > > Perhaps introducing this "e" as an explicit parameter is a bad idea and > overly formal. In practice, there are likely to various sources of > skid and approximation in the hardware, so exposing an actual value may > be counterproductive -- i.e., what usable guarantee is this providing > to userspace, if this is likely to be swamped by approximations > elsewhere? > > Instead, maybe we can just say something like: > > | The available steps are spaced at roughly equal intervals between the > | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading > | info/MB/bandwidth_gran gives the worst-case precision of these > | interval steps, in per cent. > > What do you think? I find "worst-case precision" a bit confusing, consider for example, what would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives the upper limit of these interval steps"? I believe this matches what you mentioned a couple of messages ago: "The available steps are no larger than this value." (and "per cent" -> "percent") > > If that's adequate, then the wording under the definition of > "bandwidth_gran" could be aligned with this. I think putting together a couple of your proposals and statements while making the text more accurate may work: "bandwidth_gran": The approximate granularity in which the memory bandwidth percentage is allocated. The allocated bandwidth percentage is rounded up to the next control step available on the hardware. The available hardware steps are no larger than this value. I assume "available" is needed because, even though the steps are not larger than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth" to 100% range? >>>> I think that using const to pass data to architecture is great, thanks. >>>> >>>> Reinette >>> >>> I try to constify by default when straightforward to do so, since the >>> compiler can then find which cases need to change; the reverse >>> direction is harder to automate... >> >> Could you please elaborate what you mean with "reverse direction"? > > I just meant that over-consting tends to result in violations of the > language that the compiler will detect, but under-consting doesn't: > > static void foo(int *nonconstp, const int *constp) > { > *constp = 0; // compiler error > (*nonconstp); // silently accpeted, though it could have been const > } > > So, the compiler will tell you places where const needs to be removed > (or something else needs to change), but to find places where const > could be _added_, you have to hunt them down yourself, or use some > other tool that is probably not part of the usual workflow. Got it, thanks. Reinette
Hi Reinette, On Thu, Sep 25, 2025 at 09:53:37PM +0100, Reinette Chatre wrote: > Hi Dave, > > On 9/25/25 5:46 AM, Dave Martin wrote: > > On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote: > >> On 9/22/25 7:39 AM, Dave Martin wrote: > >>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > >>>> Hi Dave, [...] > >>>> Also please use upper case for acronym mba->MBA. > >>> > >>> Ack (the local custom in the MPAM code is to use "mba", but arguably, > >>> the meaning is not quite the same -- I'll change it.) > >> > >> I am curious what the motivation is for the custom? Knowing this will help > >> me to keep things consistent when the two worlds meet. > > > > I think this has just evolved over time. On the x86 side, MBA is a > > specific architectural feature, but on the MPAM side the architecture > > doesn't really have a name for the same thing. Memory bandwidth is a > > concept, but a few different types of control are defined for it, with > > different names. > > > > So, for the MPAM driver "mba" is more of a software concept than > > something in a published spec: it's the glue that attaches to "MB" > > resource as seen through resctrl. > > > > (This isn't official though; it's just the mental model that I have > > formed.) > > I see. Thank you for the details. My mental model is simpler: write acronyms > in upper case. Generally, I agree, although I'm not sure whether that acronym belongs in the MPAM-specific code. For this patch, though, that's irrelevant. I've changed it to "MBA" as requested. [...] > >> really sound as though the current interface works that great for MPAM. If I > >> understand correctly this patch enables MPAM to use existing interface for > >> its memory bandwidth allocations but doing so does not enable users to > >> obtain benefit of hardware capabilities. For that users would want to use > >> the new interface? > > > > In ideal world, probably, yes. > > > > Since not all use cases will care about full precision, the MB resource > > (approximated for MPAM) should be fine for a lot of people, but I > > expect that sooner or later somebody will want more exact control. > > ack. OK. [,,,] > >> Considering the two statements: > >> - "The available steps are no larger than this value." > >> - "this value ... is not smaller than the apparent size of any individual rounding step" > >> > >> The "not larger" and "not smaller" sounds like all these words just end up saying that > >> this is the step size? > > > > They are intended to be the same statement: A <= B versus > > B >= A respectively. > > This is what I understood from the words ... and that made me think that it > can be simplified to A = B ... but no need to digress ... onto the alternatives below ... Right... [...] > > Instead, maybe we can just say something like: > > > > | The available steps are spaced at roughly equal intervals between the > > | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading > > | info/MB/bandwidth_gran gives the worst-case precision of these > > | interval steps, in per cent. > > > > What do you think? > > I find "worst-case precision" a bit confusing, consider for example, what > would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives > the upper limit of these interval steps"? I believe this matches what you > mentioned a couple of messages ago: "The available steps are no larger than this > value." Yes, that works. "Worst case" implies a value judgement that smaller steps are "better" then large steps, since the goal is control. But your wording, to the effect that this is the largest (apparent) step size, conveys all the needed information. > (and "per cent" -> "percent") ( Note: https://en.wiktionary.org/wiki/per_cent ) (Though either is acceptable, the fused word has a more informal feel to it for me. Happy to change it -- though your rewording below gets rid of it anyway. (This word doesn't appear in resctrl.rst -- evertying is "percentage" etc.) > > > If that's adequate, then the wording under the definition of > > "bandwidth_gran" could be aligned with this. > > I think putting together a couple of your proposals and statements while making the > text more accurate may work: > > "bandwidth_gran": > The approximate granularity in which the memory bandwidth > percentage is allocated. The allocated bandwidth percentage > is rounded up to the next control step available on the > hardware. The available hardware steps are no larger than > this value. That's better, thanks. I'm happy to pick this up and reword the text in both places along these lines. > I assume "available" is needed because, even though the steps are not larger > than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth" > to 100% range? Yes -- or, rather, the steps _look_ inconsistent because they are rounded to exact percentages by the interface. I don't think we expect the actual steps in the hardware to be irregular. [...] > Reinette Cheers ---Dave
Hi Dave, On 9/29/25 5:43 AM, Dave Martin wrote: > Hi Reinette, > > On Thu, Sep 25, 2025 at 09:53:37PM +0100, Reinette Chatre wrote: >> Hi Dave, >> >> On 9/25/25 5:46 AM, Dave Martin wrote: >>> On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote: >>>> On 9/22/25 7:39 AM, Dave Martin wrote: >>>>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: >>>>>> Hi Dave, > > [...] > >>>>>> Also please use upper case for acronym mba->MBA. >>>>> >>>>> Ack (the local custom in the MPAM code is to use "mba", but arguably, >>>>> the meaning is not quite the same -- I'll change it.) >>>> >>>> I am curious what the motivation is for the custom? Knowing this will help >>>> me to keep things consistent when the two worlds meet. >>> >>> I think this has just evolved over time. On the x86 side, MBA is a >>> specific architectural feature, but on the MPAM side the architecture >>> doesn't really have a name for the same thing. Memory bandwidth is a >>> concept, but a few different types of control are defined for it, with >>> different names. >>> >>> So, for the MPAM driver "mba" is more of a software concept than >>> something in a published spec: it's the glue that attaches to "MB" >>> resource as seen through resctrl. >>> >>> (This isn't official though; it's just the mental model that I have >>> formed.) >> >> I see. Thank you for the details. My mental model is simpler: write acronyms >> in upper case. > > Generally, I agree, although I'm not sure whether that acronym belongs > in the MPAM-specific code. > > For this patch, though, that's irrelevant. I've changed it to "MBA" > as requested. > Thank you. ... >>>> Considering the two statements: >>>> - "The available steps are no larger than this value." >>>> - "this value ... is not smaller than the apparent size of any individual rounding step" >>>> >>>> The "not larger" and "not smaller" sounds like all these words just end up saying that >>>> this is the step size? >>> >>> They are intended to be the same statement: A <= B versus >>> B >= A respectively. >> >> This is what I understood from the words ... and that made me think that it >> can be simplified to A = B ... but no need to digress ... onto the alternatives below ... > > Right... > > [...] > >>> Instead, maybe we can just say something like: >>> >>> | The available steps are spaced at roughly equal intervals between the >>> | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading >>> | info/MB/bandwidth_gran gives the worst-case precision of these >>> | interval steps, in per cent. >>> >>> What do you think? >> >> I find "worst-case precision" a bit confusing, consider for example, what >> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives >> the upper limit of these interval steps"? I believe this matches what you >> mentioned a couple of messages ago: "The available steps are no larger than this >> value." > > Yes, that works. "Worst case" implies a value judgement that smaller > steps are "better" then large steps, since the goal is control. > > But your wording, to the effect that this is the largest (apparent) > step size, conveys all the needed information. Thank you for considering it. My preference is for stating things succinctly and not leave too much for interpretation. > >> (and "per cent" -> "percent") > > ( Note: https://en.wiktionary.org/wiki/per_cent ) Yes, in particular I note the "chiefly Commonwealth". I respect the differences in the English language and was easily convinced in earlier MPAM work to accept different spelling. I now regret doing so because after merge we now have a supposedly coherent resctrl codebase with inconsistent spelling that is unpleasant to encounter when reading the code and also complicates new work. > (Though either is acceptable, the fused word has a more informal feel > to it for me. Happy to change it -- though your rewording below gets > rid of it anyway. (This word doesn't appear in resctrl.rst -- > evertying is "percentage" etc.) > >> >>> If that's adequate, then the wording under the definition of >>> "bandwidth_gran" could be aligned with this. >> >> I think putting together a couple of your proposals and statements while making the >> text more accurate may work: >> >> "bandwidth_gran": >> The approximate granularity in which the memory bandwidth >> percentage is allocated. The allocated bandwidth percentage >> is rounded up to the next control step available on the >> hardware. The available hardware steps are no larger than >> this value. > > That's better, thanks. I'm happy to pick this up and reword the text > in both places along these lines. Thank you very much. Please do feel free to rework. > >> I assume "available" is needed because, even though the steps are not larger >> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth" >> to 100% range? > > Yes -- or, rather, the steps _look_ inconsistent because they are > rounded to exact percentages by the interface. > > I don't think we expect the actual steps in the hardware to be > irregular. > Thank you for clarifying. Reinette
Hi Reinette, On Mon, Sep 29, 2025 at 08:38:13AM -0700, Reinette Chatre wrote: > Hi Dave, > > On 9/29/25 5:43 AM, Dave Martin wrote: [...] > > Generally, I agree, although I'm not sure whether that acronym belongs > > in the MPAM-specific code. > > > > For this patch, though, that's irrelevant. I've changed it to "MBA" > > as requested. > > > > Thank you. [...] > >> I find "worst-case precision" a bit confusing, consider for example, what > >> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives > >> the upper limit of these interval steps"? I believe this matches what you > >> mentioned a couple of messages ago: "The available steps are no larger than this > >> value." > > > > Yes, that works. "Worst case" implies a value judgement that smaller > > steps are "better" then large steps, since the goal is control. > > > > But your wording, to the effect that this is the largest (apparent) > > step size, conveys all the needed information. > > Thank you for considering it. My preference is for stating things succinctly > and not leave too much for interpretation. I find that it's not always easy to work out what information is essential without the discussion... so I hope that didn't feel like a waste of time! > >> (and "per cent" -> "percent") > > > > ( Note: https://en.wiktionary.org/wiki/per_cent ) > > Yes, in particular I note the "chiefly Commonwealth". I respect the differences > in the English language and was easily convinced in earlier MPAM work to > accept different spelling. I now regret doing so because after merge we now have a > supposedly coherent resctrl codebase with inconsistent spelling that is unpleasant > to encounter when reading the code and also complicates new work. > > > (Though either is acceptable, the fused word has a more informal feel > > to it for me. Happy to change it -- though your rewording below gets > > rid of it anyway. (This word doesn't appear in resctrl.rst -- > > evertying is "percentage" etc.) Sure, it's best not to have mixed-up conventions in the same document. (With this one, I wasn't aware that there were regional differences at all, so that was news to me...) [...] > >> I think putting together a couple of your proposals and statements while making the > >> text more accurate may work: > >> > >> "bandwidth_gran": > >> The approximate granularity in which the memory bandwidth > >> percentage is allocated. The allocated bandwidth percentage > >> is rounded up to the next control step available on the > >> hardware. The available hardware steps are no larger than > >> this value. > > > > That's better, thanks. I'm happy to pick this up and reword the text > > in both places along these lines. > > Thank you very much. Please do feel free to rework. > > > > >> I assume "available" is needed because, even though the steps are not larger > >> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth" > >> to 100% range? > > > > Yes -- or, rather, the steps _look_ inconsistent because they are > > rounded to exact percentages by the interface. > > > > I don't think we expect the actual steps in the hardware to be > > irregular. > > > Thank you for clarifying. > > Reinette OK. I'll tidy up the loose ends and repost once I've have a chance to re- test. Thanks for the review. Cheers ---Dave
On Thu, Sep 25, 2025 at 01:53:37PM -0700, Reinette Chatre wrote: > Hi Dave, > > On 9/25/25 5:46 AM, Dave Martin wrote: > > On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote: > >> On 9/22/25 7:39 AM, Dave Martin wrote: > >>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > >>>> Hi Dave, > >>>> > >>>> nits: > >>>> Please use the subject prefix "x86,fs/resctrl" to be consistent with other > >>>> resctrl code (and was established by Arm :)). > >>>> Also please use upper case for acronym mba->MBA. > >>> > >>> Ack (the local custom in the MPAM code is to use "mba", but arguably, > >>> the meaning is not quite the same -- I'll change it.) > >> > >> I am curious what the motivation is for the custom? Knowing this will help > >> me to keep things consistent when the two worlds meet. > > > > I think this has just evolved over time. On the x86 side, MBA is a > > specific architectural feature, but on the MPAM side the architecture > > doesn't really have a name for the same thing. Memory bandwidth is a > > concept, but a few different types of control are defined for it, with > > different names. > > > > So, for the MPAM driver "mba" is more of a software concept than > > something in a published spec: it's the glue that attaches to "MB" > > resource as seen through resctrl. > > > > (This isn't official though; it's just the mental model that I have > > formed.) > > I see. Thank you for the details. My mental model is simpler: write acronyms > in upper case. > > ... > > >>>>> On MPAM systems, this results in somewhat worse-than-worst-case > >>>>> rounding, since bw_gran is in general only an approximation to the > >>>>> actual hardware granularity on these systems, and the hardware > >>>>> bandwidth allocation control value is not natively a percentage -- > >>>>> necessitating a further conversion in the resctrl_arch_update_domains() > >>>>> path, regardless of the conversion done at parse time. > >>>>> > >>>>> Allow the arch to provide its own parse-time conversion that is > >>>>> appropriate for the hardware, and move the existing conversion to x86. > >>>>> This will avoid accumulated error from rounding the value twice on MPAM > >>>>> systems. > >>>>> > >>>>> Clarify the documentation, but avoid overly exact promises. > >>>>> > >>>>> Clamping to bw_min and bw_max still feels generic: leave it in the core > >>>>> code, for now. > >>>> > >>>> Sounds like MPAM may be ready to start the schema parsing discussion again? > >>>> I understand that MPAM has a few more ways to describe memory bandwidth as > >>>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing > >>>> schema format to user space, which seems like a good idea for new schema. > >>> > >>> My own ideas in this area are a little different, though I agree with > >>> the general idea. > >> > >> Should we expect a separate proposal from James? > > > > At some point, yes. We still need to have a chat about it. > > > > Right now, I was just throwing an idea out there. > > Thank you very much for doing so. We are digesting it. > > > ... > > >>> For this patch, my aim was to avoid changing anything unnecessarily. > >> > >> Understood. More below as I try to understand the details but it does not > >> really sound as though the current interface works that great for MPAM. If I > >> understand correctly this patch enables MPAM to use existing interface for > >> its memory bandwidth allocations but doing so does not enable users to > >> obtain benefit of hardware capabilities. For that users would want to use > >> the new interface? > > > > In ideal world, probably, yes. > > > > Since not all use cases will care about full precision, the MB resource > > (approximated for MPAM) should be fine for a lot of people, but I > > expect that sooner or later somebody will want more exact control. > > ack. > > > > >>>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ > >>>>> the other tests except for the NONCONT_CAT tests, which do not seem to > >>>>> be supported in my configuration -- and have nothing to do with the > >>>>> code touched by this patch). > >>>> > >>>> Is the NONCONT_CAT test failing (i.e printing "not ok")? > >>>> > >>>> The NONCONT_CAT tests may print error messages as debug information as part of > >>>> running, but these errors are expected as part of the test. The test should accurately > >>>> state whether it passed or failed though. For example, below attempts to write > >>>> a non-contiguous CBM to a system that does not support non-contiguous masks. > >>>> This fails as expected, error messages printed as debugging and thus the test passes > >>>> with an "ok". > >>>> > >>>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument > >>>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected > >>>> ok 5 L3_NONCONT_CAT: test > >>> > >>> I don't think that this was anything to do with my changes, but I don't > >>> still seem to have the test output. (Since this test has to do with > >>> bitmap schemata (?), it seemed unlikely to be affected by changes to > >>> bw_validate().) > >> > >> I agree that this should not have anything to do with this patch. My concern > >> is that I understood that the test failed for a feature that is not supported. > >> If this is the case then there may be a problem with the test. The test should > >> not fail if the feature is not supported but instead skip the test. > > > > I'll try to capture more output from this when I re-run it, so that we > > can figure out what this is. > > Thank you. > > > ... > > >>> > >>> [...] > >>> > >>>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst > >>>>> index c7949dd44f2f..a1d0469d6dfb 100644 > >>>>> --- a/Documentation/filesystems/resctrl.rst > >>>>> +++ b/Documentation/filesystems/resctrl.rst > >>>>> @@ -143,12 +143,11 @@ with respect to allocation: > >>>>> user can request. > >>>>> > >>>>> "bandwidth_gran": > >>>>> - The granularity in which the memory bandwidth > >>>>> + The approximate granularity in which the memory bandwidth > >>>>> percentage is allocated. The allocated > >>>>> b/w percentage is rounded off to the next > >>>>> - control step available on the hardware. The > >>>>> - available bandwidth control steps are: > >>>>> - min_bandwidth + N * bandwidth_gran. > >>>>> + control step available on the hardware. The available > >>>>> + steps are at least as small as this value. > >>>> > >>>> A bit difficult to parse for me. > >>>> Is "at least as small as" same as "at least"? > >>> > >>> It was supposed to mean: "The available steps are no larger than this > >>> value." > >> > >> This is clear to me, especially when compared with the planned addition to > >> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting > >> the paragraph below (more below). > >> > >>> > >>> Formally My expectation is that this value is the smallest integer > >>> number of percent which is not smaller than the apparent size of any > >>> individual rounding step. Equivalently, this is the smallest number g > >> > >> Considering the two statements: > >> - "The available steps are no larger than this value." > >> - "this value ... is not smaller than the apparent size of any individual rounding step" > >> > >> The "not larger" and "not smaller" sounds like all these words just end up saying that > >> this is the step size? > > > > They are intended to be the same statement: A <= B versus > > B >= A respectively. > > This is what I understood from the words ... and that made me think that it > can be simplified to A = B ... but no need to digress ... onto the alternatives below ... > > > > > But I'd be the first to admit that the wording is a bit twisted! > > (I wouldn't be astonshed if I got something wrong somewhere.) > > > > See below for an alternative way of describing this that might be more > > intuitive. > > > >> > >>> for which writing "MB: 0=x" and "MB: 0=y" yield different > >>> configurations for every in-range x and where y = x + g and y is also > >>> in-range. > >>> > >>> That's a bit of a mouthful, though. If you can think of a more > >>> succinct way of putting it, I'm open to suggestions! > >>> > >>>> Please note that the documentation has a section "Memory bandwidth Allocation > >>>> and monitoring" that also contains these exact promises. > >>> > >>> Hmmm, somehow I completely missed that. > >>> > >>> Does the following make sense? Ideally, there would be a simpler way > >>> to describe the discrepancy between the reported and actual values of > >>> bw_gran... > >>> > >>> | Memory bandwidth Allocation and monitoring > >>> | ========================================== > >>> | > >>> | [...] > >>> | > >>> | The minimum bandwidth percentage value for each cpu model is predefined > >>> | and can be looked up through "info/MB/min_bandwidth". The bandwidth > >>> | granularity that is allocated is also dependent on the cpu model and can > >>> | be looked up at "info/MB/bandwidth_gran". The available bandwidth > >>> | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded > >>> | -to the next control step available on the hardware. > >>> | +control steps are: min_bw + N * (bw_gran - e), where e is a > >>> | +non-negative, hardware-defined real constant that is less than 1. > >>> | +Intermediate values are rounded to the next control step available on > >>> | +the hardware. > >>> | + > >>> | +At the time of writing, the constant e referred to in the preceding > >>> | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran > >>> | +describes the step size exactly), but this may not be the case on other > >>> | +hardware when the actual granularity is not an exact divisor of 100. > >> > >> Have you considered how to share the value of "e" with users? > > > > Perhaps introducing this "e" as an explicit parameter is a bad idea and > > overly formal. In practice, there are likely to various sources of > > skid and approximation in the hardware, so exposing an actual value may > > be counterproductive -- i.e., what usable guarantee is this providing > > to userspace, if this is likely to be swamped by approximations > > elsewhere? > > > > Instead, maybe we can just say something like: > > > > | The available steps are spaced at roughly equal intervals between the > > | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading > > | info/MB/bandwidth_gran gives the worst-case precision of these > > | interval steps, in per cent. > > > > What do you think? > > I find "worst-case precision" a bit confusing, consider for example, what > would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives > the upper limit of these interval steps"? I believe this matches what you > mentioned a couple of messages ago: "The available steps are no larger than this > value." > > (and "per cent" -> "percent") > > > > > If that's adequate, then the wording under the definition of > > "bandwidth_gran" could be aligned with this. > > I think putting together a couple of your proposals and statements while making the > text more accurate may work: > > "bandwidth_gran": > The approximate granularity in which the memory bandwidth > percentage is allocated. The allocated bandwidth percentage > is rounded up to the next control step available on the > hardware. The available hardware steps are no larger than > this value. > > I assume "available" is needed because, even though the steps are not larger > than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth" > to 100% range? What values are allowed for "bandwidth_gran"? The "Intel® Resource Director Technology (Intel® RDT) Architecture Specification" https://cdrdv2.intel.com/v1/dl/getContent/789566 describes the upcoming region aware memory bandwidth allocation controls as being a number from "1" to "Q" (enumerated in an ACPI table). First implementation looks like Q == 255 which means a granularity of 0.392% The spec has headroom to allow Q == 511. I don't expect users to need that granularity at the high bandwidth end of the range, but I do expect them to care for highly throttled background/batch jobs to make sure they can't affect performance of the high priority jobs. I'd hate to have to round all low bandwidth controls to 1% steps. > > >>>> I think that using const to pass data to architecture is great, thanks. > >>>> > >>>> Reinette > >>> > >>> I try to constify by default when straightforward to do so, since the > >>> compiler can then find which cases need to change; the reverse > >>> direction is harder to automate... > >> > >> Could you please elaborate what you mean with "reverse direction"? > > > > I just meant that over-consting tends to result in violations of the > > language that the compiler will detect, but under-consting doesn't: > > > > static void foo(int *nonconstp, const int *constp) > > { > > *constp = 0; // compiler error > > (*nonconstp); // silently accpeted, though it could have been const > > } > > > > So, the compiler will tell you places where const needs to be removed > > (or something else needs to change), but to find places where const > > could be _added_, you have to hunt them down yourself, or use some > > other tool that is probably not part of the usual workflow. > > Got it, thanks. > > Reinette > -Tony
Hi Tony, On 9/25/25 2:35 PM, Luck, Tony wrote: > On Thu, Sep 25, 2025 at 01:53:37PM -0700, Reinette Chatre wrote: >> On 9/25/25 5:46 AM, Dave Martin wrote: >>> On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote: >>>> On 9/22/25 7:39 AM, Dave Martin wrote: >>>>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: ... >>>>> for which writing "MB: 0=x" and "MB: 0=y" yield different >>>>> configurations for every in-range x and where y = x + g and y is also >>>>> in-range. >>>>> >>>>> That's a bit of a mouthful, though. If you can think of a more >>>>> succinct way of putting it, I'm open to suggestions! >>>>> >>>>>> Please note that the documentation has a section "Memory bandwidth Allocation >>>>>> and monitoring" that also contains these exact promises. >>>>> >>>>> Hmmm, somehow I completely missed that. >>>>> >>>>> Does the following make sense? Ideally, there would be a simpler way >>>>> to describe the discrepancy between the reported and actual values of >>>>> bw_gran... >>>>> >>>>> | Memory bandwidth Allocation and monitoring >>>>> | ========================================== >>>>> | >>>>> | [...] >>>>> | >>>>> | The minimum bandwidth percentage value for each cpu model is predefined >>>>> | and can be looked up through "info/MB/min_bandwidth". The bandwidth >>>>> | granularity that is allocated is also dependent on the cpu model and can >>>>> | be looked up at "info/MB/bandwidth_gran". The available bandwidth >>>>> | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded >>>>> | -to the next control step available on the hardware. >>>>> | +control steps are: min_bw + N * (bw_gran - e), where e is a >>>>> | +non-negative, hardware-defined real constant that is less than 1. >>>>> | +Intermediate values are rounded to the next control step available on >>>>> | +the hardware. >>>>> | + >>>>> | +At the time of writing, the constant e referred to in the preceding >>>>> | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran >>>>> | +describes the step size exactly), but this may not be the case on other >>>>> | +hardware when the actual granularity is not an exact divisor of 100. >>>> >>>> Have you considered how to share the value of "e" with users? >>> >>> Perhaps introducing this "e" as an explicit parameter is a bad idea and >>> overly formal. In practice, there are likely to various sources of >>> skid and approximation in the hardware, so exposing an actual value may >>> be counterproductive -- i.e., what usable guarantee is this providing >>> to userspace, if this is likely to be swamped by approximations >>> elsewhere? >>> >>> Instead, maybe we can just say something like: >>> >>> | The available steps are spaced at roughly equal intervals between the >>> | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading >>> | info/MB/bandwidth_gran gives the worst-case precision of these >>> | interval steps, in per cent. >>> >>> What do you think? >> >> I find "worst-case precision" a bit confusing, consider for example, what >> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives >> the upper limit of these interval steps"? I believe this matches what you >> mentioned a couple of messages ago: "The available steps are no larger than this >> value." >> >> (and "per cent" -> "percent") >> >>> >>> If that's adequate, then the wording under the definition of >>> "bandwidth_gran" could be aligned with this. >> >> I think putting together a couple of your proposals and statements while making the >> text more accurate may work: >> >> "bandwidth_gran": >> The approximate granularity in which the memory bandwidth >> percentage is allocated. The allocated bandwidth percentage >> is rounded up to the next control step available on the >> hardware. The available hardware steps are no larger than >> this value. >> >> I assume "available" is needed because, even though the steps are not larger >> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth" >> to 100% range? > > What values are allowed for "bandwidth_gran"? The "Intel® Resource This is a property of the MB resource where the ABI is to express allocations as a percentage. Current doc: "bandwidth_gran": The granularity in which the memory bandwidth percentage is allocated. The allocated b/w percentage is rounded off to the next control step available on the hardware. The available bandwidth control steps are: min_bandwidth + N * bandwidth_gran. I do not expect we can switch it to fractions so I would say that integer values are allowed, starting at 1. I understand that the MB resource on AMD supports different ranges and I find that ABI discrepancy unfortunate. I do not think this should be seen as an opportunity that "anything goes" when it comes to MB and used as an excuse to pile on another range of hardware dependent inputs. Instead I believe we should keep MB interface as-is and instead work on a generic interface that enables user space to interact with resctrl to have benefit of all hardware capabilities without needing to know which hardware is underneath. > Director Technology (Intel® RDT) Architecture Specification" > > https://cdrdv2.intel.com/v1/dl/getContent/789566 > > describes the upcoming region aware memory bandwidth allocation > controls as being a number from "1" to "Q" (enumerated in an ACPI > table). First implementation looks like Q == 255 which means a > granularity of 0.392% The spec has headroom to allow Q == 511. > > I don't expect users to need that granularity at the high bandwidth > end of the range, but I do expect them to care for highly throttled > background/batch jobs to make sure they can't affect performance of > the high priority jobs. > > I'd hate to have to round all low bandwidth controls to 1% steps. This is the limitation if choosing to expose this feature as an MB resource and seems to be the same problem that Dave is facing. For finer granularity allocations I expect that we would need a new schema/resource backed by new properties as proposed by Dave in https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/ This will require updates to user space (that will anyway be needed if wedging another non-ABI input into MB). Reinette
Hi Reinette, Tony, On Thu, Sep 25, 2025 at 03:18:51PM -0700, Reinette Chatre wrote: > Hi Tony, > > On 9/25/25 2:35 PM, Luck, Tony wrote: [...] > > Director Technology (Intel® RDT) Architecture Specification" > > > > https://cdrdv2.intel.com/v1/dl/getContent/789566 > > > > describes the upcoming region aware memory bandwidth allocation > > controls as being a number from "1" to "Q" (enumerated in an ACPI > > table). First implementation looks like Q == 255 which means a > > granularity of 0.392% The spec has headroom to allow Q == 511. That does look like it would benefit from exposing the hardware field without rounding (similarly as for MPAM). Is the relationship between this value and the expected memory system throughput actually defined anywhere? If the expected throughput is exactly proportional to this value, or a reasonable approximation to this, then that it simple -- but I can't see it actually stated. when a spec suggests a need to divide by (2^N - 1), I do wonder whether that it what they _really_ meant (and whether hardware will just do the obvious cheap approximation in defiance of the spec). > > > > I don't expect users to need that granularity at the high bandwidth > > end of the range, but I do expect them to care for highly throttled > > background/batch jobs to make sure they can't affect performance of > > the high priority jobs. A case where it _might_ matter is where there is a non-trivial number of jobs, and an attempt is made to share bandwidth among them. Although it may not matter exactly how much bandwidth is given to each job, the rounding errors may accumulate so that they add up to significantly more than or less than 100% in total. This feels undesirable. Rounding off the value in the interface effectively makes it impossible for portable software to avoid this problem... > > I'd hate to have to round all low bandwidth controls to 1% steps. +1! (No pun intended.) > This is the limitation if choosing to expose this feature as an MB resource > and seems to be the same problem that Dave is facing. For finer granularity > allocations I expect that we would need a new schema/resource backed by new > properties as proposed by Dave in > https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/ > This will require updates to user space (that will anyway be needed if wedging > another non-ABI input into MB). > > Reinette Ack; while we could add decimal places to bandwidth_gran as reported to userspace, we don't know that software isn't going to choke on that. Plus, we could need to add precision to the control values too -- it's no good advertising 0.5% guanularity when the MB schema only accepts/reports integers. Software that parses anything as (potentially) a real number might work transparently, but we didn't warn users that they might need to do that... Cheers ---Dave
© 2016 - 2025 Red Hat, Inc.