[PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats

Tony Luck posted 31 patches 7 months, 3 weeks ago
There is a newer version of this series
[PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
Posted by Tony Luck 7 months, 3 weeks ago
Resctrl was written with the assumption that all monitor events
can be displayed as unsigned decimal integers.

Some telemetry events provide greater precision where architecture code
uses a fixed point format with 18 binary places.

Add a "display_format" field to struct mon_evt which can specify
that the value for the event be displayed as an integer for legacy
events, or as a floating point value with six decimal places converted
from the fixed point format received from architecture code.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl_types.h |  5 +++++
 fs/resctrl/internal.h         |  2 ++
 fs/resctrl/ctrlmondata.c      | 24 +++++++++++++++++++++++-
 fs/resctrl/monitor.c          | 21 ++++++++++++---------
 4 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index 5ef14a24008c..6245034f6c76 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -50,4 +50,9 @@ enum resctrl_event_id {
 #define QOS_NUM_MBM_EVENTS	(QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
 #define MBM_EVENT_IDX(evt)	((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
 
+/* Event value display formats */
+enum resctrl_event_fmt {
+	EVT_FORMAT_U64,
+	EVT_FORMAT_U46_18,
+};
 #endif /* __LINUX_RESCTRL_TYPES_H */
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index d8aa69b42c74..aaa74a17257d 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -73,6 +73,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
  * @configurable:	true if the event is configurable
  * @enabled:		true if the event is enabled
  * @any_cpu:		true if the event can be read from any CPU
+ * @display_format:	format to display value to users
  * @list:		entry in &rdt_resource->evt_list
  */
 struct mon_evt {
@@ -82,6 +83,7 @@ struct mon_evt {
 	bool			configurable;
 	bool			enabled;
 	bool			any_cpu;
+	enum resctrl_event_fmt	display_format;
 	struct list_head	list;
 };
 
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 7a2957b9c13e..1544c103446b 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -569,6 +569,28 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 	resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
 }
 
+#define NUM_FRAC_BITS	18
+#define FRAC_MASK	GENMASK(NUM_FRAC_BITS - 1, 0)
+
+static void print_event_value(struct seq_file *m, enum resctrl_event_fmt type, u64 val)
+{
+	u64 frac;
+
+	switch (type) {
+	case EVT_FORMAT_U64:
+		seq_printf(m, "%llu\n", val);
+		break;
+	case EVT_FORMAT_U46_18:
+		frac = val & FRAC_MASK;
+		frac = frac * 1000000;
+		/* round values up to nearest decimal representation */
+		frac += 1ul << (NUM_FRAC_BITS - 1);
+		frac >>= NUM_FRAC_BITS;
+		seq_printf(m, "%llu.%06llu\n", val >> NUM_FRAC_BITS, frac);
+		break;
+	}
+}
+
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
@@ -637,7 +659,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	else if (rr.err == -EINVAL)
 		seq_puts(m, "Unavailable\n");
 	else
-		seq_printf(m, "%llu\n", rr.val);
+		print_event_value(m, evt->display_format, rr.val);
 
 out:
 	rdtgroup_kn_unlock(of->kn);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index e903d3c076ee..be78488a15e5 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -844,19 +844,22 @@ static void dom_data_exit(struct rdt_resource *r)
 
 struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
 	[QOS_L3_OCCUP_EVENT_ID] = {
-		.name	= "llc_occupancy",
-		.evtid	= QOS_L3_OCCUP_EVENT_ID,
-		.rid	= RDT_RESOURCE_L3,
+		.name		= "llc_occupancy",
+		.evtid		= QOS_L3_OCCUP_EVENT_ID,
+		.rid		= RDT_RESOURCE_L3,
+		.display_format	= EVT_FORMAT_U64,
 	},
 	[QOS_L3_MBM_TOTAL_EVENT_ID] = {
-		.name	= "mbm_total_bytes",
-		.evtid	= QOS_L3_MBM_TOTAL_EVENT_ID,
-		.rid	= RDT_RESOURCE_L3,
+		.name		= "mbm_total_bytes",
+		.evtid		= QOS_L3_MBM_TOTAL_EVENT_ID,
+		.rid		= RDT_RESOURCE_L3,
+		.display_format	= EVT_FORMAT_U64,
 	},
 	[QOS_L3_MBM_LOCAL_EVENT_ID] = {
-		.name	= "mbm_local_bytes",
-		.evtid	= QOS_L3_MBM_LOCAL_EVENT_ID,
-		.rid	= RDT_RESOURCE_L3,
+		.name		= "mbm_local_bytes",
+		.evtid		= QOS_L3_MBM_LOCAL_EVENT_ID,
+		.rid		= RDT_RESOURCE_L3,
+		.display_format	= EVT_FORMAT_U64,
 	},
 };
 
-- 
2.48.1
Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
Posted by Reinette Chatre 7 months, 2 weeks ago
Hi Tony,

shortlog nit: "fs/resctrl: Support additional monitor event display formats"

On 4/28/25 5:33 PM, Tony Luck wrote:
> Resctrl was written with the assumption that all monitor events
> can be displayed as unsigned decimal integers.
> 
> Some telemetry events provide greater precision where architecture code
> uses a fixed point format with 18 binary places.
> 
> Add a "display_format" field to struct mon_evt which can specify
> that the value for the event be displayed as an integer for legacy
> events, or as a floating point value with six decimal places converted
> from the fixed point format received from architecture code.

There was no discussion on this during the previous version.
While this version addresses the issue of architecture changing the
format it does not address the issue of how to handle different
architecture formats. With this change any architecture that may
want to support any of these events will be required to translate
whatever format it uses into the one Intel uses to be translated
again into format for user space. Do you think this is reasonable? 

Alternatively, resctrl could add additional file that contains the
format so that if an architecture in the future needs to present data
differently, an interface will exist to guide userspace how to parse it.
Creation of such user interface cannot be delayed until the time
it is needed since then these formats would be ABI.

> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl_types.h |  5 +++++
>  fs/resctrl/internal.h         |  2 ++
>  fs/resctrl/ctrlmondata.c      | 24 +++++++++++++++++++++++-
>  fs/resctrl/monitor.c          | 21 ++++++++++++---------
>  4 files changed, 42 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> index 5ef14a24008c..6245034f6c76 100644
> --- a/include/linux/resctrl_types.h
> +++ b/include/linux/resctrl_types.h

This needs to be internal to resctrl fs.
resctrl_types.h should only contain the types required in asm/resctrl.h

> @@ -50,4 +50,9 @@ enum resctrl_event_id {
>  #define QOS_NUM_MBM_EVENTS	(QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
>  #define MBM_EVENT_IDX(evt)	((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
>  
> +/* Event value display formats */

Please add details about what each format means (how it should
be interpreted).

Reinette
Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
Posted by Luck, Tony 7 months, 2 weeks ago
On Thu, May 08, 2025 at 08:49:56AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> shortlog nit: "fs/resctrl: Support additional monitor event display formats"
> 
> On 4/28/25 5:33 PM, Tony Luck wrote:
> > Resctrl was written with the assumption that all monitor events
> > can be displayed as unsigned decimal integers.
> > 
> > Some telemetry events provide greater precision where architecture code
> > uses a fixed point format with 18 binary places.
> > 
> > Add a "display_format" field to struct mon_evt which can specify
> > that the value for the event be displayed as an integer for legacy
> > events, or as a floating point value with six decimal places converted
> > from the fixed point format received from architecture code.
> 
> There was no discussion on this during the previous version.
> While this version addresses the issue of architecture changing the
> format it does not address the issue of how to handle different
> architecture formats. With this change any architecture that may
> want to support any of these events will be required to translate
> whatever format it uses into the one Intel uses to be translated
> again into format for user space. Do you think this is reasonable? 
> 
> Alternatively, resctrl could add additional file that contains the
> format so that if an architecture in the future needs to present data
> differently, an interface will exist to guide userspace how to parse it.
> Creation of such user interface cannot be delayed until the time
> it is needed since then these formats would be ABI.

What if resctrl filesystem allows architecture to supply the number
of binary places for fixed point values when enabling an event?

That would allow h/w implementations to pick an appropriate precision
for each new event. Different implementations of the same event
(e.g. "core_energy") may pick different precision across architectures
or between generations of the same architecture.

File system code can then do:

	if (binary_places == 0)
		display as integer
	else
		convert to floating point (with one decimal place per
		three binary places)

Existing events are all integers and won't change (it would be weird
for an architecture to report "mbm_local_bytes" with a fixed point
rather than integer value).

New events may report in either integer or floating point format
with varying amounts of precision. But I'm not sure that would be
a burden for writing tools that can run on different architectures.

> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  include/linux/resctrl_types.h |  5 +++++
> >  fs/resctrl/internal.h         |  2 ++
> >  fs/resctrl/ctrlmondata.c      | 24 +++++++++++++++++++++++-
> >  fs/resctrl/monitor.c          | 21 ++++++++++++---------
> >  4 files changed, 42 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> > index 5ef14a24008c..6245034f6c76 100644
> > --- a/include/linux/resctrl_types.h
> > +++ b/include/linux/resctrl_types.h
> 
> This needs to be internal to resctrl fs.
> resctrl_types.h should only contain the types required in asm/resctrl.h
> 
> > @@ -50,4 +50,9 @@ enum resctrl_event_id {
> >  #define QOS_NUM_MBM_EVENTS	(QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
> >  #define MBM_EVENT_IDX(evt)	((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
> >  
> > +/* Event value display formats */
> 
> Please add details about what each format means (how it should
> be interpreted).
> 
> Reinette

-Tony
Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
Posted by Reinette Chatre 7 months, 2 weeks ago
Hi Tony,

On 5/8/25 1:28 PM, Luck, Tony wrote:
> On Thu, May 08, 2025 at 08:49:56AM -0700, Reinette Chatre wrote:
>> On 4/28/25 5:33 PM, Tony Luck wrote:
>>> Resctrl was written with the assumption that all monitor events
>>> can be displayed as unsigned decimal integers.
>>>
>>> Some telemetry events provide greater precision where architecture code
>>> uses a fixed point format with 18 binary places.
>>>
>>> Add a "display_format" field to struct mon_evt which can specify
>>> that the value for the event be displayed as an integer for legacy
>>> events, or as a floating point value with six decimal places converted
>>> from the fixed point format received from architecture code.
>>
>> There was no discussion on this during the previous version.
>> While this version addresses the issue of architecture changing the
>> format it does not address the issue of how to handle different
>> architecture formats. With this change any architecture that may
>> want to support any of these events will be required to translate
>> whatever format it uses into the one Intel uses to be translated
>> again into format for user space. Do you think this is reasonable? 
>>
>> Alternatively, resctrl could add additional file that contains the
>> format so that if an architecture in the future needs to present data
>> differently, an interface will exist to guide userspace how to parse it.
>> Creation of such user interface cannot be delayed until the time
>> it is needed since then these formats would be ABI.
> 
> What if resctrl filesystem allows architecture to supply the number
> of binary places for fixed point values when enabling an event?

This sounds good. I do not think we are in a position to come up with
an ideal solution. That would require assumptions of what another
architecture may or may not do and thus we do not have complete information.

> 
> That would allow h/w implementations to pick an appropriate precision
> for each new event. Different implementations of the same event
> (e.g. "core_energy") may pick different precision across architectures
> or between generations of the same architecture.
> 
> File system code can then do:
> 
> 	if (binary_places == 0)
> 		display as integer
> 	else
> 		convert to floating point (with one decimal place per
> 		three binary places)

I do not think this problem needs to be solved in this work but there needs
to be a plan for how other architectures can be supported. When similar
enabling needs to be done for that hypothetical architecture then it can
be implemented ... if it is still valid based on what that architecture actually
supports.
It may be sufficient for the "plan" (as above) to be in comments.

> 
> Existing events are all integers and won't change (it would be weird
> for an architecture to report "mbm_local_bytes" with a fixed point
> rather than integer value).
> 
> New events may report in either integer or floating point format
> with varying amounts of precision. But I'm not sure that would be

Partly this will depend on the unit of measurement that should form part of
the definition of the event. For example, events reporting cycles or ticks
should only be integer, no?

> a burden for writing tools that can run on different architectures.

Maybe just a comment in the docs then ... and now I see that you did
so already. My apologies, I did not get to the last four patches.

Reinette
Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
Posted by Dave Martin 7 months, 2 weeks ago
Hi,

(Backtrace retained for context -- see my comment at the end.)

Cheers
---Dave

[...]

On Thu, May 08, 2025 at 04:45:21PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/8/25 1:28 PM, Luck, Tony wrote:
> > On Thu, May 08, 2025 at 08:49:56AM -0700, Reinette Chatre wrote:
> >> On 4/28/25 5:33 PM, Tony Luck wrote:
> >>> Resctrl was written with the assumption that all monitor events
> >>> can be displayed as unsigned decimal integers.
> >>>
> >>> Some telemetry events provide greater precision where architecture code
> >>> uses a fixed point format with 18 binary places.
> >>>
> >>> Add a "display_format" field to struct mon_evt which can specify
> >>> that the value for the event be displayed as an integer for legacy
> >>> events, or as a floating point value with six decimal places converted
> >>> from the fixed point format received from architecture code.
> >>
> >> There was no discussion on this during the previous version.
> >> While this version addresses the issue of architecture changing the
> >> format it does not address the issue of how to handle different
> >> architecture formats. With this change any architecture that may
> >> want to support any of these events will be required to translate
> >> whatever format it uses into the one Intel uses to be translated
> >> again into format for user space. Do you think this is reasonable? 
> >>
> >> Alternatively, resctrl could add additional file that contains the
> >> format so that if an architecture in the future needs to present data
> >> differently, an interface will exist to guide userspace how to parse it.
> >> Creation of such user interface cannot be delayed until the time
> >> it is needed since then these formats would be ABI.
> > 
> > What if resctrl filesystem allows architecture to supply the number
> > of binary places for fixed point values when enabling an event?
> 
> This sounds good. I do not think we are in a position to come up with
> an ideal solution. That would require assumptions of what another
> architecture may or may not do and thus we do not have complete information.
> 
> > 
> > That would allow h/w implementations to pick an appropriate precision
> > for each new event. Different implementations of the same event
> > (e.g. "core_energy") may pick different precision across architectures
> > or between generations of the same architecture.
> > 
> > File system code can then do:
> > 
> > 	if (binary_places == 0)
> > 		display as integer
> > 	else
> > 		convert to floating point (with one decimal place per
> > 		three binary places)
> 
> I do not think this problem needs to be solved in this work but there needs
> to be a plan for how other architectures can be supported. When similar
> enabling needs to be done for that hypothetical architecture then it can
> be implemented ... if it is still valid based on what that architecture actually
> supports.
> It may be sufficient for the "plan" (as above) to be in comments.
> 
> > 
> > Existing events are all integers and won't change (it would be weird
> > for an architecture to report "mbm_local_bytes" with a fixed point
> > rather than integer value).
> > 
> > New events may report in either integer or floating point format
> > with varying amounts of precision. But I'm not sure that would be
> 
> Partly this will depend on the unit of measurement that should form part of
> the definition of the event. For example, events reporting cycles or ticks
> should only be integer, no?
> 
> > a burden for writing tools that can run on different architectures.
> 
> Maybe just a comment in the docs then ... and now I see that you did
> so already. My apologies, I did not get to the last four patches.
> 
> Reinette

Just a thought, but I think that while it's not possible to be fully
generic, a parameter model along the lines of

	quantity = raw_value * ((double)multiplier / divisor) * BASE_UNIT

would cover most things that we have or can reasonably foresee,
including memory bandwidth control values.
raw_value, multiplier and divisor would all be integers.

Since raw_integer can be the value used by the hardware, its precision
can probably be fixed at 1, though we could still report it explicitly.

Fundamental base units would be things like "byte", "bytes per second"
and "none" (i.e., dimensionless quantities).  (Are there others?)


Since we cannot guess for certain what userspace wants to do with the
values, it feels better to let userspace do any scaling calculations
itself, rather than trying to prettify the interface.

For example: scaling memory bandwidth percentages for MPAM is a
nuisance because the hardware uses fixed-point values scaled by a power
of 2, not by 100: the two scales can never match up anywhere except at
multiples of 25%, leading to irregular increments when rounded to an
integer percentage value and uncertainty about what the bandwidth_gran
parameter means.  Round-trip conversions between the two
representations become error-prone due to repeated rounding -- this
proved quite fiddly to get right.  Precision beyond 1% increments may
also be available in the hardware, but is not accessible through the
resctrl interface.

For backwards compatibility we probably shouldn't change that
particular interface, but if we can avoid new instances of the same
kind of problem then that would be a benefit: i.e., explicitly tell
userspace how to scale a given parameter.

Cheers
---Dave
Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
Posted by Peter Newman 7 months, 2 weeks ago
Hi Dave,

On Fri, May 9, 2025 at 1:29 PM Dave Martin <Dave.Martin@arm.com> wrote:
>
> Hi,
>
> (Backtrace retained for context -- see my comment at the end.)
>
> Cheers
> ---Dave
>
> [...]
>
> On Thu, May 08, 2025 at 04:45:21PM -0700, Reinette Chatre wrote:
> > Hi Tony,
> >
> > On 5/8/25 1:28 PM, Luck, Tony wrote:
> > > On Thu, May 08, 2025 at 08:49:56AM -0700, Reinette Chatre wrote:
> > >> On 4/28/25 5:33 PM, Tony Luck wrote:
> > >>> Resctrl was written with the assumption that all monitor events
> > >>> can be displayed as unsigned decimal integers.
> > >>>
> > >>> Some telemetry events provide greater precision where architecture code
> > >>> uses a fixed point format with 18 binary places.
> > >>>
> > >>> Add a "display_format" field to struct mon_evt which can specify
> > >>> that the value for the event be displayed as an integer for legacy
> > >>> events, or as a floating point value with six decimal places converted
> > >>> from the fixed point format received from architecture code.
> > >>
> > >> There was no discussion on this during the previous version.
> > >> While this version addresses the issue of architecture changing the
> > >> format it does not address the issue of how to handle different
> > >> architecture formats. With this change any architecture that may
> > >> want to support any of these events will be required to translate
> > >> whatever format it uses into the one Intel uses to be translated
> > >> again into format for user space. Do you think this is reasonable?
> > >>
> > >> Alternatively, resctrl could add additional file that contains the
> > >> format so that if an architecture in the future needs to present data
> > >> differently, an interface will exist to guide userspace how to parse it.
> > >> Creation of such user interface cannot be delayed until the time
> > >> it is needed since then these formats would be ABI.
> > >
> > > What if resctrl filesystem allows architecture to supply the number
> > > of binary places for fixed point values when enabling an event?
> >
> > This sounds good. I do not think we are in a position to come up with
> > an ideal solution. That would require assumptions of what another
> > architecture may or may not do and thus we do not have complete information.
> >
> > >
> > > That would allow h/w implementations to pick an appropriate precision
> > > for each new event. Different implementations of the same event
> > > (e.g. "core_energy") may pick different precision across architectures
> > > or between generations of the same architecture.
> > >
> > > File system code can then do:
> > >
> > >     if (binary_places == 0)
> > >             display as integer
> > >     else
> > >             convert to floating point (with one decimal place per
> > >             three binary places)
> >
> > I do not think this problem needs to be solved in this work but there needs
> > to be a plan for how other architectures can be supported. When similar
> > enabling needs to be done for that hypothetical architecture then it can
> > be implemented ... if it is still valid based on what that architecture actually
> > supports.
> > It may be sufficient for the "plan" (as above) to be in comments.
> >
> > >
> > > Existing events are all integers and won't change (it would be weird
> > > for an architecture to report "mbm_local_bytes" with a fixed point
> > > rather than integer value).
> > >
> > > New events may report in either integer or floating point format
> > > with varying amounts of precision. But I'm not sure that would be
> >
> > Partly this will depend on the unit of measurement that should form part of
> > the definition of the event. For example, events reporting cycles or ticks
> > should only be integer, no?
> >
> > > a burden for writing tools that can run on different architectures.
> >
> > Maybe just a comment in the docs then ... and now I see that you did
> > so already. My apologies, I did not get to the last four patches.
> >
> > Reinette
>
> Just a thought, but I think that while it's not possible to be fully
> generic, a parameter model along the lines of
>
>         quantity = raw_value * ((double)multiplier / divisor) * BASE_UNIT
>
> would cover most things that we have or can reasonably foresee,
> including memory bandwidth control values.
> raw_value, multiplier and divisor would all be integers.
>
> Since raw_integer can be the value used by the hardware, its precision
> can probably be fixed at 1, though we could still report it explicitly.
>
> Fundamental base units would be things like "byte", "bytes per second"
> and "none" (i.e., dimensionless quantities).  (Are there others?)
>
>
> Since we cannot guess for certain what userspace wants to do with the
> values, it feels better to let userspace do any scaling calculations
> itself, rather than trying to prettify the interface.
>
> For example: scaling memory bandwidth percentages for MPAM is a
> nuisance because the hardware uses fixed-point values scaled by a power
> of 2, not by 100: the two scales can never match up anywhere except at
> multiples of 25%, leading to irregular increments when rounded to an
> integer percentage value and uncertainty about what the bandwidth_gran
> parameter means.  Round-trip conversions between the two
> representations become error-prone due to repeated rounding -- this
> proved quite fiddly to get right.  Precision beyond 1% increments may
> also be available in the hardware, but is not accessible through the
> resctrl interface.

Google users got annoyed with these rounding errors very quickly and
asked me to change the MBA interface to the raw, fixed-point value
used by the MPAM register interface. (but at least shifted down, since
the MBW_MIN/MAX fields are left-justified)

>
> For backwards compatibility we probably shouldn't change that
> particular interface, but if we can avoid new instances of the same
> kind of problem then that would be a benefit: i.e., explicitly tell
> userspace how to scale a given parameter.

MBA is not programmed by percentage on AMD, so I'm not sure why this
is considered necessary for backwards compatibility.

-Peter
Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
Posted by Dave Martin 7 months, 2 weeks ago
Hi,

On Fri, May 09, 2025 at 04:46:30PM +0200, Peter Newman wrote:
> Hi Dave,
> 
> On Fri, May 9, 2025 at 1:29 PM Dave Martin <Dave.Martin@arm.com> wrote:

[...]

> > For example: scaling memory bandwidth percentages for MPAM is a
> > nuisance because the hardware uses fixed-point values scaled by a power
> > of 2, not by 100: the two scales can never match up anywhere except at
> > multiples of 25%, leading to irregular increments when rounded to an
> > integer percentage value and uncertainty about what the bandwidth_gran
> > parameter means.  Round-trip conversions between the two
> > representations become error-prone due to repeated rounding -- this
> > proved quite fiddly to get right.  Precision beyond 1% increments may
> > also be available in the hardware, but is not accessible through the
> > resctrl interface.
> 
> Google users got annoyed with these rounding errors very quickly and
> asked me to change the MBA interface to the raw, fixed-point value
> used by the MPAM register interface. (but at least shifted down, since
> the MBW_MIN/MAX fields are left-justified)

That's interesting.

Do you find a need to do things like step the bandwidth allocation for
a control group?  So, as part of a tuning regime, the bandwidth value
is read out, stepped to the next distinct hardware value and written
back in?

That kind of thing does not map in a convenient way onto the current
interface, although fire-and-forget programming of a predetermined
percentage works fine.

Extending my model outline, a 6-bit MPAM MBW_PART implementation might
be described by:

	min: 1
	max: 64
	step size: 1
	multiplier: 1
	divisor: 64

How easy / difficult do you think it would be for userspace to work
with this, if resctrlfs were to expose the raw control (minus the
ignored bits) with that metadata?

Needless to say, the max and divisor values would dependent on the
hardware and possibly other factors.  They would be fixed for the
lifetime of a single resctrl instance at the very least.

> > For backwards compatibility we probably shouldn't change that
> > particular interface, but if we can avoid new instances of the same
> > kind of problem then that would be a benefit: i.e., explicitly tell
> > userspace how to scale a given parameter.
> 
> MBA is not programmed by percentage on AMD, so I'm not sure why this
> is considered necessary for backwards compatibility.

I presumed scripts (or pre-tuned data fed through them) are in practice
pretty platform-specific, so that it will upset people if the interface
changes between kernel versions at least on a given hardware family.

The divergence between AMD and Intel in this area is unfortunate, but
absolute and proportional bandwidth measures do not really seem to be
interchangeable -- so a truly unified interface may not be easy to
achieve either.

Having two control names in the interface might work, say:

	MBP: proportion of total available memory bandwidth (%)

	MBA: absolute memory bandwidth (B/s)

Then just expose the one that the hardware implements natively (while
still exposing MB as a backwards compatible alias if necessary).

Cheers
---Dave
Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
Posted by Luck, Tony 7 months, 2 weeks ago
On Fri, May 09, 2025 at 04:46:30PM +0200, Peter Newman wrote:
> Hi Dave,
> 
> On Fri, May 9, 2025 at 1:29 PM Dave Martin <Dave.Martin@arm.com> wrote:
> >
> > Hi,
> >
> > (Backtrace retained for context -- see my comment at the end.)
> >
> > Cheers
> > ---Dave
> >
> > [...]
> >
> > On Thu, May 08, 2025 at 04:45:21PM -0700, Reinette Chatre wrote:
> > > Hi Tony,
> > >
> > > On 5/8/25 1:28 PM, Luck, Tony wrote:
> > > > On Thu, May 08, 2025 at 08:49:56AM -0700, Reinette Chatre wrote:
> > > >> On 4/28/25 5:33 PM, Tony Luck wrote:
> > > >>> Resctrl was written with the assumption that all monitor events
> > > >>> can be displayed as unsigned decimal integers.
> > > >>>
> > > >>> Some telemetry events provide greater precision where architecture code
> > > >>> uses a fixed point format with 18 binary places.
> > > >>>
> > > >>> Add a "display_format" field to struct mon_evt which can specify
> > > >>> that the value for the event be displayed as an integer for legacy
> > > >>> events, or as a floating point value with six decimal places converted
> > > >>> from the fixed point format received from architecture code.
> > > >>
> > > >> There was no discussion on this during the previous version.
> > > >> While this version addresses the issue of architecture changing the
> > > >> format it does not address the issue of how to handle different
> > > >> architecture formats. With this change any architecture that may
> > > >> want to support any of these events will be required to translate
> > > >> whatever format it uses into the one Intel uses to be translated
> > > >> again into format for user space. Do you think this is reasonable?
> > > >>
> > > >> Alternatively, resctrl could add additional file that contains the
> > > >> format so that if an architecture in the future needs to present data
> > > >> differently, an interface will exist to guide userspace how to parse it.
> > > >> Creation of such user interface cannot be delayed until the time
> > > >> it is needed since then these formats would be ABI.
> > > >
> > > > What if resctrl filesystem allows architecture to supply the number
> > > > of binary places for fixed point values when enabling an event?
> > >
> > > This sounds good. I do not think we are in a position to come up with
> > > an ideal solution. That would require assumptions of what another
> > > architecture may or may not do and thus we do not have complete information.
> > >
> > > >
> > > > That would allow h/w implementations to pick an appropriate precision
> > > > for each new event. Different implementations of the same event
> > > > (e.g. "core_energy") may pick different precision across architectures
> > > > or between generations of the same architecture.
> > > >
> > > > File system code can then do:
> > > >
> > > >     if (binary_places == 0)
> > > >             display as integer
> > > >     else
> > > >             convert to floating point (with one decimal place per
> > > >             three binary places)
> > >
> > > I do not think this problem needs to be solved in this work but there needs
> > > to be a plan for how other architectures can be supported. When similar
> > > enabling needs to be done for that hypothetical architecture then it can
> > > be implemented ... if it is still valid based on what that architecture actually
> > > supports.
> > > It may be sufficient for the "plan" (as above) to be in comments.
> > >
> > > >
> > > > Existing events are all integers and won't change (it would be weird
> > > > for an architecture to report "mbm_local_bytes" with a fixed point
> > > > rather than integer value).
> > > >
> > > > New events may report in either integer or floating point format
> > > > with varying amounts of precision. But I'm not sure that would be
> > >
> > > Partly this will depend on the unit of measurement that should form part of
> > > the definition of the event. For example, events reporting cycles or ticks
> > > should only be integer, no?
> > >
> > > > a burden for writing tools that can run on different architectures.
> > >
> > > Maybe just a comment in the docs then ... and now I see that you did
> > > so already. My apologies, I did not get to the last four patches.
> > >
> > > Reinette
> >
> > Just a thought, but I think that while it's not possible to be fully
> > generic, a parameter model along the lines of
> >
> >         quantity = raw_value * ((double)multiplier / divisor) * BASE_UNIT
> >
> > would cover most things that we have or can reasonably foresee,
> > including memory bandwidth control values.
> > raw_value, multiplier and divisor would all be integers.
> >
> > Since raw_integer can be the value used by the hardware, its precision
> > can probably be fixed at 1, though we could still report it explicitly.
> >
> > Fundamental base units would be things like "byte", "bytes per second"
> > and "none" (i.e., dimensionless quantities).  (Are there others?)

The energy telemetry counters implemented in this series have:

	core_energy:	Units are Joules
	activity:	Units are Farads (Dynamic Capicitance or "CDyn")

Each of these is reported by h/w as a fixed-point binary value with 18
binary places. I'm proposing reporing these as floating point decimal
value with six decimal places (since 1/2^18 ~= 0.0000038)
Loss of precision from conversion to decimal is likely far smaller than
the error bars on the estimation of these values).

> >
> > Since we cannot guess for certain what userspace wants to do with the
> > values, it feels better to let userspace do any scaling calculations
> > itself, rather than trying to prettify the interface.
> >
> > For example: scaling memory bandwidth percentages for MPAM is a
> > nuisance because the hardware uses fixed-point values scaled by a power
> > of 2, not by 100: the two scales can never match up anywhere except at
> > multiples of 25%, leading to irregular increments when rounded to an
> > integer percentage value and uncertainty about what the bandwidth_gran
> > parameter means.  Round-trip conversions between the two
> > representations become error-prone due to repeated rounding -- this
> > proved quite fiddly to get right.  Precision beyond 1% increments may
> > also be available in the hardware, but is not accessible through the
> > resctrl interface.

Sounds like my fixed-point proposal could be useful for these memory
bandwidth values.

> Google users got annoyed with these rounding errors very quickly and
> asked me to change the MBA interface to the raw, fixed-point value
> used by the MPAM register interface. (but at least shifted down, since
> the MBW_MIN/MAX fields are left-justified)
> 
> >
> > For backwards compatibility we probably shouldn't change that
> > particular interface, but if we can avoid new instances of the same
> > kind of problem then that would be a benefit: i.e., explicitly tell
> > userspace how to scale a given parameter.
> 
> MBA is not programmed by percentage on AMD, so I'm not sure why this
> is considered necessary for backwards compatibility.

Upcoming "region aware" Intel RDT features also present challenges for
backward compatibility as there are going to be separate counters and
controls for each region. Maintaining the use of "percentage" for
controls only gives feeling of familiarity while any tools that are
using the "MB:" line in the schemata file aren't going to work at all.

Sadly, we won't have a direct "MBytes/second" interface (though our
goal is to get to that some day). H/w interface values for throttling
change from legacy 0,10,20,...90 (no throttle up to max) to 255,254...1
(max bandwidth to min).

There's also the bonus feature that memory bandwidth limits can
specify min and max values so hardware can grant jobs some amount
of extra bandwidth when the system is not busy, or throttle just
the low priority jobs when approaching capacity limits.

> -Peter

-Tony