[PATCH v6 00/30] x86,fs/resctrl telemetry monitoring

Tony Luck posted 30 patches 3 months, 1 week ago
There is a newer version of this series
.../admin-guide/kernel-parameters.txt         |   2 +-
Documentation/filesystems/resctrl.rst         |  53 ++-
include/linux/resctrl.h                       |  84 +++-
include/linux/resctrl_types.h                 |  26 +-
arch/x86/include/asm/resctrl.h                |  16 -
arch/x86/kernel/cpu/resctrl/internal.h        |  31 +-
fs/resctrl/internal.h                         |  56 ++-
arch/x86/kernel/cpu/resctrl/core.c            | 333 ++++++++++----
arch/x86/kernel/cpu/resctrl/intel_aet.c       | 411 ++++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c         |  78 ++--
fs/resctrl/ctrlmondata.c                      | 130 +++++-
fs/resctrl/monitor.c                          | 267 +++++++-----
fs/resctrl/rdtgroup.c                         | 253 +++++++----
arch/x86/Kconfig                              |   5 +-
arch/x86/kernel/cpu/resctrl/Makefile          |   1 +
15 files changed, 1327 insertions(+), 419 deletions(-)
create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
[PATCH v6 00/30] x86,fs/resctrl telemetry monitoring
Posted by Tony Luck 3 months, 1 week ago
These patches are based v6.16-rc3 plus David Box's V2 series for
"Intel VSEC/PMT: Introduce Discovery Driver" posted here:

Link: https://lore.kernel.org/all/20250617014041.2861032-1-david.e.box@linux.intel.com/

I've pushed "v6.16-rc3 + David's patches" to:
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git davidboxv2

The total set (v6.16-rc3 + David's patches + this series) is here:

git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git rdt-aet-v6

The first four patches of this series are shared with Babu's ABMC
series. Perhaps if all issues in these four patches are resolved here
these patches could move to upstream?

Note that patches 0017 onwards of this V6 series depend on David's patches
so can't go upstream until that series is merged.

Changes since v5 was posted here:

Link: https://lore.kernel.org/all/20250521225049.132551-1-tony.luck@intel.com/

Change map indexed by patch numbers in v5. Some patches have been merged,
split, dropped, or re-ordered. The v6 patch numbers are referred to
by their 4-digit git format-patch numbers in an attempt to avoid
confusion.

 --- 1 ---
Fixed extra space in commit message

Changed to consistent use of "eventid" for variables/arguments
of type "enum resctrl_event_id" in routines touched by this
series.

Added resctrl_event_id::QOS_FIRST_EVENT and used it as lower bound
when looping over all events.

Added #define for_each_mon_event() to iterate over all events.

s/Description of a monitor event/Properties of a monitor event/

 --- 2 ---
Use QOS_FIRST_EVENT as lower bounds check in resctrl_is_mon_event_enabled().


 --- 3 ---
Replaced changelog with Reinette's improved version.

 --- 4 ---
Add for_each_mbm_event_id() helper to iterate over MBM events.

Peter: Use sizeof(*hw_dom->arch_mbm_states[0]) instead of sizeof(struct arch_mbm_state)
(ditto for "struct mbm_state" instance).

Replaced kerneldoc description of rdt_mon_domain::mbm_states with Reinette's
improved version. Ditto for rdt_hw_mon_domain::arch_mbm_state.

Fixed resctrl_arch_reset_rmid_all() to use same coding pattern of
looping on enabled events instead of checking if rdt_hw_mon_domain::arch_mbm_state
has been allocated.

In get_mbm_state() use local variable name "state" (singular) to match other
code patterns.

Drop duplicate @arch_mbm_states: in kerneldoc for struct rdt_hw_mon_domain.

Drop " or combined CLOSID, RMID on Arm" from @arch_mbm_states description.

Use sizeof(*hw_dom->arch_mbm_states[0]) in resctrl_arch_reset_rmid_all().

Add new macro for_each_mbm_idx() (suggested by Fenghua). Use
it in mon_domain_free() and domain_destroy_mon_state(). Also to
avoid landmines in "cleanup:" code in arch_domain_mbm_alloc() and
domain_setup_mon_state().

 --- 5 ---
Patch dropped. No need for fake definitions of OOBMSM access routines and
structures because the real patches have been posted by David Box. Version
2 of his series here:
https://lore.kernel.org/all/20250617014041.2861032-1-david.e.box@linux.intel.com/

 --- 6 ---
Now 0005 in V6 series.
Updated the domain_header_is_valid() check in rdtgroup_mondata_show() to
explicitly test for RDT_RESOURCE_L3 since at this point in the series only
the L3 resource is possible, or valid. Similar change in subsequent patches
where routines to process only struct rdt_mon_domain make consistency checks.


 --- 7 ---
Moved later to patch 0011 when it becomes a little clearer which renames
are useful, and which may be just noise. Functions that now take a
"struct rdt_l3_mon_domain" argument are obviously for L3 only and don't
need a rename to make that clear.


 --- 8 ---
Now patch 0006. No comments received on V5 version of this patch.


 --- 9 ---
Now patch 0007.
s/goto done/goto out_unlock/

Changes to make symmetric cleanups to domain_remove_cpu_ctrl()
to match the code flow changes that were made to domain_remove_cpu_mon()
split out into new patch 0008.


 --- 10 ---
Now patch 0009.
Anil reported that "d" was used uninitialized in domain_remove_cpu_ctrl().

Fenghua reported an error path that did not unlock the rdtgroup_mutex.

Reinette reported change unrelated to commit message. Moved this to
new patch 0008 (now with change log).

Several domain_header_is_valid() checks now check for RDT_RESOURCE_L3.

Updated kerneldoc comments for struct mon_data to says that @sum is only
used for RDT_RESOURCE_L3.

Updated kerneldoc for mon_get_kn_priv() @do_sum to say it is only
meaningful for L3 domain and added a WARN_ON_ONCE() if this some other
resource tried to set do_sum.

More functions changed to pass struct rdt_domain_hdr. Specifically the
call chain from rdtgroup_mondata_show() down to resctrl_arch_rmid_read()
(via the smp_call*()) so rmid_read::d has been replaced by rmid_read::hdr.


 --- 11 ---
Now patch 0010.Completing this patch before the function renaming
(was patch 7, now 0011) makes it clearer where renames are useful.


 --- 12 ---
No comments received for this patch (patch numbers now aligned as
this is 0012 in V6).

 --- 13 ---
Updated commit comment with better text from Reinette.

s/goto done/goto out_ctx_free/

Changed polarity and name of helper function from cpu_on_wrong_domain()
to cpu_on_correct_domain() to avoid double negatives.


 --- 14 ---
Add mon_evt::is_floating_point set by resctrl file system code to limit
which events architecture code can request be displayed in floating point.

Simplified the fixed-point to floating point algorithm. Reinette is
correct that the additional "lshift" and "rshift" operations are not
required. All that is needed is to multiply the fixed point fractional
part by 10**decimal_places, add a rounding amount equivalent to a "1"
in the binary place after those supplied. Finally divide by 2**binary_places
(with a right shift).

Explained in commit comment how I chose the number of decimal places to
use for each binary places value.

N.B. Dave Martin expressed an opinion that the kernel should not do
this conversion. Instead it should enumerate the scaling factor for
each event where hardware reported a fixed point value. This patch
could be dropped and replaced with one to enumerate scaling factors
per event if others agree with Dave.


 --- 15 ---
Initialize atomic in resctrl_arch_pre_mount() using ATOMIC_INIT(0).

 --- 16 ---
No comments received on this patch.


 --- 17 ---
Changed comment in struct event_group from " Data fields used by this code."
to "Data fields for additional structures to manage this group."

Removed line continuations for DEFINE_FREE(). Though checkpatch is still
not happy. My following line is a "if" statement. Checkpatch wants this
to both 1) Line vertically with the "(" on preceeding line, and 2) Not
use <TAB> followed by some <SPACE> characters to make that happen.

Refactored the interface between get_pmt_feature() and configure_event()
per-Reinette's suggestion to avoid both functions looping through the
p->count entries in the pmt_feature_group structure.

Kconfig changes here ensure that David's INTEL_PMT_TELEMETRY code is
built-in to the kernel so it can be used by resctrl.


 --- 18 ---
Define a macro XML_MMIO_SIZE() as a way to document the hard-coded numbers
used to calculate the expected size of the mmio region.

If the MMIO size reported by intel_pmt_get_regions_by_feature() is smaller
than expected, print that size as part of the warning message.


 --- 19 ---
Rename mmio_info::count to mmio_info::num_regions.

Fix typo s/[0]/[1]/g in ascii art commit message structure diagram.

Take better suggestions for kerneldoc descriptions of mmio_info
structure and the @num_regions and @addrs fields.

Add a period for event_group::pkginfo kerneldoc description.

Fix declaration of **pkginfo in configure_events()

Add "_once" for "Duplicate telemetry" warning.

 --- 20 ---
Shorten field names pmt_event::evtid -> pmt_event::id and pmt_event:evt_idx
becomes pmt_event::idx.

Fenghua: Align each of th struct event_group on the "=".

Anil: Use a macro to initialize entries in mon_event_all[]. This could
be done in patch 1, but with only three events at that point the visual
clutter wasn't too awful.


 --- 21 ---
Split into:
0021: Add mon_evt::arch_priv void pointer that can be set by architectural
code when enabling an event, and passed through to resctrl_arch_rmid_read()

0022: Code to set the arch_priv pointer and use it find the containing
struct event_group for the parameters to read counter from MMIO space.
Note that due to changes in part 0009 resctrl_arch_rmid_read() takes
a struct rdt_domain_hdr argument.


 --- 22 ---
Now 0023: Keep "default:" as last option in switch in domain_remove_cpu_mon().

Comments about "goto mkdir" (was "goto do_mkdir" covered by change in
patch 0009.

 --- 23 ---
Moved to patch 0027.

No comments received for this patch. But one small change. Now
sets r->mon_capable = true; in intel_aet_get_events() so this is
done before the calculation of the minimum of RMIDs supported
in part 0025.


 --- 24 ---
Changed to address Reinette's point that initial implementation
would not work in the same way as other boot choices. Specifically
if a quirk disables a feature because of an erratum, the user should
be able to override from the command line and use it anyway.
This patch provides the option for user to disable a telemetry
feature from the command line. The force enable option moved to
next patch where it is used.


 --- 25 ---
Improved commit comment per-Reinette suggestion.

s/Will be adjusted/Adjusted/ in kerneldoc for event_group::num_rmids

Improved text for comment on not configuring a telemetry feature that
has fewer RMIDs than supported by IA32_PQR_ASSOC.

Second part of command line implementation is here to allow user to
override the fewer RMIDs issue and use a resource anyway.

 --- 26 ---
Back in sync with patch 0026. No comments received for this patch.

 --- 27 ---
Changed from providing a mechanism for architecture code to create
a custom "info/{resource}" file to providing a debugfs directory
for use by a monitor resource. Discussion on the name of the directory
fizzled out. I've gone with:
	/sys/kernel/debug/resctrl/info/{resource}_MON/{utsname()->machine}

 --- 28 ---
No comments on this patch. Changed to create one debugfs file for each
value from each aggregator instance.


 --- 29 ---
No comments on this patch.



Background
----------

Telemetry features are being implemented in conjunction with the
IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
counts for various events to a collector in a nearby OOBMSM device to be
accumulated with counts for each <RMID, event> pair received from other
CPUs. Cores send event counts when the RMID value changes, or after each
2ms elapsed time.

Each OOBMSM device may implement multiple event collectors with each
servicing a subset of the logical CPUs on a package.  In the initial
hardware implementation, there are two categories of events: energy
and perf.

1) Energy - Two counters
core_energy: This is an estimate of Joules consumed by each core. It is
calculated based on the types of instructions executed, not from a power
meter. This counter is useful to understand how much energy a workload
is consuming.

activity: This measures "accumulated dynamic capacitance". Users who
want to optimize energy consumption for a workload may use this rather
than core_energy because it provides consistent results independent of
any frequency or voltage changes that may occur during the runtime of
the application (e.g. entry/exit from turbo mode).

2) Performance - Seven counters
These are similar events to those available via the Linux "perf" tool,
but collected in a way with much lower overhead (no need to collect data
on every context switch).

stalls_llc_hit - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which hit in the LLC

c1_res - Counts the total C1 residency across all cores. The underlying
counter increments on 100MHz clock ticks

unhalted_core_cycles - Counts the total number of unhalted core clock
cycles

stalls_llc_miss - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which missed all the
local caches

c6_res - Counts the total C6 residency. The underlying counter increments
on crystal clock (25MHz) ticks

unhalted_ref_cycles - Counts the total number of unhalted reference clock
(TSC) cycles

uops_retired - Counts the total number of uops retired

The counters are arranged in groups in MMIO space of the OOBMSM device.
E.g. for the energy counters the layout is:

Offset: Counter
0x00	core energy for RMID 0
0x08	core activity for RMID 0
0x10	core energy for RMID 1
0x18	core activity for RMID 1
...

Enumeration
-----------

The only CPUID based enumeration for this feature is the legacy
CPUID(eax=7,ecx=0).ebx{12} that indicates the presence of the
IA32_PQR_ASSOC MSR and the RMID field within it.

The OOBMSM driver discovers which features are present via
PCIe VSEC capabilities. Each feature is tagged with a unique
identifier. These identifiers indicate which XML description file from
https://github.com/intel/Intel-PMT describes which event counters are
available and their layout within the MMIO BAR space of the OOBMSM device.

Resctrl User Interface
----------------------

Because there may be multiple OOBMSM collection agents per processor
package, resctrl accumulates event counts from all agents on a package
and presents a single value to users. This will provide a consistent
user interface on future platforms that vary the number of collectors,
or the mappings from logical CPUs to collectors.

Users will continue to see the legacy monitoring files in the "L3"
directories and the telemetry files in the new "PERF_PKG" directories
(with each file providing the aggregated value from all OOBMSM collectors
on that package).

$ tree /sys/fs/resctrl/mon_data/
/sys/fs/resctrl/mon_data/
├── mon_L3_00
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_L3_01
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_PERF_PKG_00
│   ├── activity
│   ├── c1_res
│   ├── c6_res
│   ├── core_energy
│   ├── stalls_llc_hit
│   ├── stalls_llc_miss
│   ├── unhalted_core_cycles
│   ├── unhalted_ref_cycles
│   └── uops_retired
└── mon_PERF_PKG_01
    ├── activity
    ├── c1_res
    ├── c6_res
    ├── core_energy
    ├── stalls_llc_hit
    ├── stalls_llc_miss
    ├── unhalted_core_cycles
    ├── unhalted_ref_cycles
    └── uops_retired

Resctrl Implementation
----------------------

The OOBMSM driver exposes "intel_pmt_get_regions_by_feature()"
that returns an array of structures describing the per-RMID groups it
found from the VSEC enumeration. Linux looks at the unique identifiers
for each group and enables resctrl for all groups with known unique
identifiers.

The memory map for the counters for each <RMID, event> pair is described
by the XML file. This is too unwieldy to use in the Linux kernel, so a
simplified representation is built into the resctrl code. Note that the
counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
and IA32_QM_CTR MSRs. This means there is no need for cross-processor
calls to read counters from a CPU in a specific domain. The counters
can be read from any CPU.

High level description of code changes:

1) New scope RESCTRL_PACKAGE
2) New struct rdt_resource RDT_RESOURCE_PERF_PKG
3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
        switch (r->rid) {
        case RDT_RESOURCE_L3:
                helper for L3
                break;
        case RDT_RESOURCE_PERF_PKG:
                helper for PKG
                break;
        }
4) New source code file "intel_aet.c" for the code to enumerate, configure, and report event counts.

With only one platform providing this feature, it's tricky to tell
exactly where it is going to go. I've made the event definitions
platform specific (based on the unique ID from the VSEC enumeration). It
seems possible/likely that the list of events may change from generation
to generation.

I've picked names for events based on the descriptions in the XML file.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Tony Luck (30):
  x86,fs/resctrl: Consolidate monitor event descriptions
  x86,fs/resctrl: Replace architecture event enabled checks
  x86/resctrl: Remove 'rdt_mon_features' global variable
  x86,fs/resctrl: Prepare for more monitor events
  x86,fs/resctrl: Improve domain type checking
  x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
  x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain
    types
  x86/resctrl: Clean up domain_remove_cpu_ctrl()
  x86,fs/resctrl: Use struct rdt_domain_hdr instead of struct
    rdt_mon_domain
  x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
  x86,fs/resctrl: Rename some L3 specific functions
  fs/resctrl: Make event details accessible to functions when reading
    events
  x86,fs/resctrl: Handle events that can be read from any CPU
  x86,fs/resctrl: Support binary fixed point event counters
  x86,fs/resctrl: Add an architectural hook called for each mount
  x86,fs/resctrl: Add and initialize rdt_resource for package scope core
    monitor
  x86/resctrl: Discover hardware telemetry events
  x86/resctrl: Count valid telemetry aggregators per package
  x86/resctrl: Complete telemetry event enumeration
  x86,fs/resctrl: Fill in details of Clearwater Forest events
  x86,fs/resctrl: Add architectural event pointer
  x86/resctrl: Read core telemetry events
  x86/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG
  x86/resctrl: Add energy/perf choices to rdt boot option
  x86/resctrl: Handle number of RMIDs supported by telemetry resources
  x86,fs/resctrl: Move RMID initialization to first mount
  x86/resctrl: Enable RDT_RESOURCE_PERF_PKG
  fs/resctrl: Provide interface to create a debugfs info directory
  x86/resctrl: Add debug info/PERF_PKG_MON/status files
  x86,fs/resctrl: Update Documentation for package events

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/filesystems/resctrl.rst         |  53 ++-
 include/linux/resctrl.h                       |  84 +++-
 include/linux/resctrl_types.h                 |  26 +-
 arch/x86/include/asm/resctrl.h                |  16 -
 arch/x86/kernel/cpu/resctrl/internal.h        |  31 +-
 fs/resctrl/internal.h                         |  56 ++-
 arch/x86/kernel/cpu/resctrl/core.c            | 333 ++++++++++----
 arch/x86/kernel/cpu/resctrl/intel_aet.c       | 411 ++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c         |  78 ++--
 fs/resctrl/ctrlmondata.c                      | 130 +++++-
 fs/resctrl/monitor.c                          | 267 +++++++-----
 fs/resctrl/rdtgroup.c                         | 253 +++++++----
 arch/x86/Kconfig                              |   5 +-
 arch/x86/kernel/cpu/resctrl/Makefile          |   1 +
 15 files changed, 1327 insertions(+), 419 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c


base-tree: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git
base-branch: davidboxv2
base-commit: 4742bf1fab91403ca48efc45f7f7fd68a156a955
-- 
2.49.0

Re: [PATCH v6 00/30] x86,fs/resctrl telemetry monitoring
Posted by Luck, Tony 3 months, 1 week ago
Bother. Just got e-mail after posting v6 from lkp. Apparently
I applied the fixes to avoid "'d' used before set" in
domain_remove_cpu_ctrl() and domain_remove_cpu_mon() to some
other branch than the one that made it to my final version.

Please imagine the hunks below merged into patches 7 & 8.

-Tony

---

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3ec8fbd2f778..39cee572a121 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -651,8 +651,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
 		return;
 
-	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
-	if (!cpumask_empty(&d->hdr.cpu_mask))
+	cpumask_clear_cpu(cpu, &hdr->cpu_mask);
+	if (!cpumask_empty(&hdr->cpu_mask))
 		return;
 
 	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
@@ -696,8 +696,8 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 	if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
 		return;
 
-	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
-	if (!cpumask_empty(&d->hdr.cpu_mask))
+	cpumask_clear_cpu(cpu, &hdr->cpu_mask);
+	if (!cpumask_empty(&hdr->cpu_mask))
 		return;
 
 	switch (r->rid) {
Re: [PATCH v6 00/30] x86,fs/resctrl telemetry monitoring
Posted by Luck, Tony 3 months, 1 week ago
On Thu, Jun 26, 2025 at 05:26:46PM -0700, Luck, Tony wrote:
> Bother. Just got e-mail after posting v6 from lkp. Apparently
> I applied the fixes to avoid "'d' used before set" in
> domain_remove_cpu_ctrl() and domain_remove_cpu_mon() to some
> other branch than the one that made it to my final version.
> 
> Please imagine the hunks below merged into patches 7 & 8.

I merged these changes back into the series and pushed the updated
version to git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git
branch rdt-aet-v6.
> 
> -Tony
> 
> ---
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 3ec8fbd2f778..39cee572a121 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -651,8 +651,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
>  	if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
>  		return;
>  
> -	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> -	if (!cpumask_empty(&d->hdr.cpu_mask))
> +	cpumask_clear_cpu(cpu, &hdr->cpu_mask);
> +	if (!cpumask_empty(&hdr->cpu_mask))
>  		return;
>  
>  	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
> @@ -696,8 +696,8 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
>  	if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
>  		return;
>  
> -	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> -	if (!cpumask_empty(&d->hdr.cpu_mask))
> +	cpumask_clear_cpu(cpu, &hdr->cpu_mask);
> +	if (!cpumask_empty(&hdr->cpu_mask))
>  		return;
>  
>  	switch (r->rid) {
Re: [PATCH v6 00/30] x86,fs/resctrl telemetry monitoring
Posted by Reinette Chatre 3 months ago
Hi Tony and Dave,

On 6/26/25 9:49 AM, Tony Luck wrote:
>  --- 14 ---
> Add mon_evt::is_floating_point set by resctrl file system code to limit
> which events architecture code can request be displayed in floating point.
> 
> Simplified the fixed-point to floating point algorithm. Reinette is
> correct that the additional "lshift" and "rshift" operations are not
> required. All that is needed is to multiply the fixed point fractional
> part by 10**decimal_places, add a rounding amount equivalent to a "1"
> in the binary place after those supplied. Finally divide by 2**binary_places
> (with a right shift).
> 
> Explained in commit comment how I chose the number of decimal places to
> use for each binary places value.
> 
> N.B. Dave Martin expressed an opinion that the kernel should not do
> this conversion. Instead it should enumerate the scaling factor for
> each event where hardware reported a fixed point value. This patch
> could be dropped and replaced with one to enumerate scaling factors
> per event if others agree with Dave.

Could resctrl accommodate both usages? For example, it does not
look too invasive to add a second file <mon_evt::name>.raw for the
mon_evt::is_floating_point events that can output something like Dave
suggested in [1]:

.raw file format could be:
	#format:<output that depends on format>
	#fixed-point:<value>/<scaling factor>

Example output:
	fixed-point:0x60000/0x40000

Reinette

[1] https://lore.kernel.org/lkml/aEhMWBemtev%2Ff3yf@e133380.arm.com/
Re: [PATCH v6 00/30] x86,fs/resctrl telemetry monitoring
Posted by Luck, Tony 3 months ago
On Thu, Jul 03, 2025 at 09:45:15AM -0700, Reinette Chatre wrote:
> Hi Tony and Dave,
> 
> On 6/26/25 9:49 AM, Tony Luck wrote:
> >  --- 14 ---
> > Add mon_evt::is_floating_point set by resctrl file system code to limit
> > which events architecture code can request be displayed in floating point.
> > 
> > Simplified the fixed-point to floating point algorithm. Reinette is
> > correct that the additional "lshift" and "rshift" operations are not
> > required. All that is needed is to multiply the fixed point fractional
> > part by 10**decimal_places, add a rounding amount equivalent to a "1"
> > in the binary place after those supplied. Finally divide by 2**binary_places
> > (with a right shift).
> > 
> > Explained in commit comment how I chose the number of decimal places to
> > use for each binary places value.
> > 
> > N.B. Dave Martin expressed an opinion that the kernel should not do
> > this conversion. Instead it should enumerate the scaling factor for
> > each event where hardware reported a fixed point value. This patch
> > could be dropped and replaced with one to enumerate scaling factors
> > per event if others agree with Dave.
> 
> Could resctrl accommodate both usages? For example, it does not
> look too invasive to add a second file <mon_evt::name>.raw for the
> mon_evt::is_floating_point events that can output something like Dave
> suggested in [1]:
> 
> .raw file format could be:
> 	#format:<output that depends on format>
> 	#fixed-point:<value>/<scaling factor>
> 
> Example output:
> 	fixed-point:0x60000/0x40000

Dave: Is that what you want in the ".raw" file? An alternative would be
to put the format information for non-integer events into an
"info" file ("info/{RESOURCE_NAME}_MON/monfeatures.raw.formats"?)
and just put the raw value into the ".raw" file under mon_data.

> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/aEhMWBemtev%2Ff3yf@e133380.arm.com/

-Tony
Re: [PATCH v6 00/30] x86,fs/resctrl telemetry monitoring
Posted by Luck, Tony 3 months ago
On Thu, Jul 03, 2025 at 10:22:06AM -0700, Luck, Tony wrote:
> On Thu, Jul 03, 2025 at 09:45:15AM -0700, Reinette Chatre wrote:
> > Hi Tony and Dave,
> > 
> > On 6/26/25 9:49 AM, Tony Luck wrote:
> > >  --- 14 ---
> > > Add mon_evt::is_floating_point set by resctrl file system code to limit
> > > which events architecture code can request be displayed in floating point.
> > > 
> > > Simplified the fixed-point to floating point algorithm. Reinette is
> > > correct that the additional "lshift" and "rshift" operations are not
> > > required. All that is needed is to multiply the fixed point fractional
> > > part by 10**decimal_places, add a rounding amount equivalent to a "1"
> > > in the binary place after those supplied. Finally divide by 2**binary_places
> > > (with a right shift).
> > > 
> > > Explained in commit comment how I chose the number of decimal places to
> > > use for each binary places value.
> > > 
> > > N.B. Dave Martin expressed an opinion that the kernel should not do
> > > this conversion. Instead it should enumerate the scaling factor for
> > > each event where hardware reported a fixed point value. This patch
> > > could be dropped and replaced with one to enumerate scaling factors
> > > per event if others agree with Dave.
> > 
> > Could resctrl accommodate both usages? For example, it does not
> > look too invasive to add a second file <mon_evt::name>.raw for the
> > mon_evt::is_floating_point events that can output something like Dave
> > suggested in [1]:
> > 
> > .raw file format could be:
> > 	#format:<output that depends on format>
> > 	#fixed-point:<value>/<scaling factor>
> > 
> > Example output:
> > 	fixed-point:0x60000/0x40000
> 
> Dave: Is that what you want in the ".raw" file? An alternative would be
> to put the format information for non-integer events into an
> "info" file ("info/{RESOURCE_NAME}_MON/monfeatures.raw.formats"?)
> and just put the raw value into the ".raw" file under mon_data.

Note that I thought it easier for users to keep the raw file to just
showing a value, rather than including the formatting details in
Reinette's proposal.

Patch to implement my alternative suggestion below. To the user things
look like this:

$ cd /sys/fs/resctrl/mon_data/mon_PERF_PKG_01
$ cat core_energy
0.02203
$ cat core_energy.raw
5775
$ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale
core_energy 262144
activity 262144
$ bc -ql
5775 / 262144
.02202987670898437500

If this seems useful I can write up a commit message and include
as its own patch in v7. Suggestions for better names?

-Tony

---

diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 4704ea7228ca..5ac4e3c98f23 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -90,6 +90,8 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
  *                   the event file belongs. When @sum is one this
  *                   is the id of the L3 cache that all domains to be
  *                   summed share.
+ * @raw:             Set for ".raw" files that directly show hardware
+ *                   provided counts with no interpretation.
  *
  * Pointed to by the kernfs kn->priv field of monitoring event files.
  * Readers and writers must hold rdtgroup_mutex.
@@ -100,6 +102,7 @@ struct mon_data {
 	struct mon_evt		*evt;
 	int			domid;
 	bool			sum;
+	bool			raw;
 };
 
 /**
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 29de0e380ccc..78e7af296d5a 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -753,7 +753,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		seq_puts(m, "Error\n");
 	else if (rr.err == -EINVAL)
 		seq_puts(m, "Unavailable\n");
-	else if (evt->binary_bits == 0)
+	else if (md->raw || evt->binary_bits == 0)
 		seq_printf(m, "%llu\n", rr.val);
 	else
 		print_event_value(m, evt->binary_bits, rr.val);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 511362a67532..97786831722a 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1158,6 +1158,21 @@ static int rdt_mon_features_show(struct kernfs_open_file *of,
 	return 0;
 }
 
+static int rdt_mon_features_raw_scale_show(struct kernfs_open_file *of,
+					   struct seq_file *seq, void *v)
+{
+	struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
+	struct mon_evt *mevt;
+
+	for_each_mon_event(mevt) {
+		if (mevt->rid != r->rid || !mevt->enabled || !mevt->binary_bits)
+			continue;
+		seq_printf(seq, "%s %u\n", mevt->name, 1 << mevt->binary_bits);
+	}
+
+	return 0;
+}
+
 static int rdt_bw_gran_show(struct kernfs_open_file *of,
 			    struct seq_file *seq, void *v)
 {
@@ -1823,6 +1838,13 @@ static struct rftype res_common_files[] = {
 		.seq_show	= rdt_mon_features_show,
 		.fflags		= RFTYPE_MON_INFO,
 	},
+	{
+		.name		= "mon_features_raw_scale",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdt_mon_features_raw_scale_show,
+		.fflags		= RFTYPE_MON_INFO,
+	},
 	{
 		.name		= "num_rmids",
 		.mode		= 0444,
@@ -2905,7 +2927,7 @@ static void rmdir_all_sub(void)
  */
 static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
 					struct mon_evt *mevt,
-					bool do_sum)
+					bool do_sum, bool rawfile)
 {
 	struct mon_data *priv;
 
@@ -2916,7 +2938,8 @@ static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
 
 	list_for_each_entry(priv, &mon_data_kn_priv_list, list) {
 		if (priv->rid == rid && priv->domid == domid &&
-		    priv->sum == do_sum && priv->evt == mevt)
+		    priv->sum == do_sum && priv->evt == mevt &&
+		    priv->raw == rawfile)
 			return priv;
 	}
 
@@ -2928,6 +2951,7 @@ static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
 	priv->domid = domid;
 	priv->sum = do_sum;
 	priv->evt = mevt;
+	priv->raw = rawfile;
 	list_add_tail(&priv->list, &mon_data_kn_priv_list);
 
 	return priv;
@@ -3078,12 +3102,13 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
 	struct rmid_read rr = {0};
 	struct mon_data *priv;
 	struct mon_evt *mevt;
+	char rawname[64];
 	int ret;
 
 	for_each_mon_event(mevt) {
 		if (mevt->rid != r->rid || !mevt->enabled)
 			continue;
-		priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
+		priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum, false);
 		if (WARN_ON_ONCE(!priv))
 			return -EINVAL;
 
@@ -3093,6 +3118,18 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
 
 		if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid))
 			mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt, true);
+
+		if (!mevt->binary_bits)
+			continue;
+
+		sprintf(rawname, "%s.raw", mevt->name);
+		priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum, true);
+		if (WARN_ON_ONCE(!priv))
+			return -EINVAL;
+
+		ret = mon_addfile(kn, rawname, priv);
+		if (ret)
+			return ret;
 	}
 
 	return 0;
Re: [PATCH v6 00/30] x86,fs/resctrl telemetry monitoring
Posted by Reinette Chatre 3 months ago
Hi Tony,

On 7/8/25 12:08 PM, Luck, Tony wrote:
> On Thu, Jul 03, 2025 at 10:22:06AM -0700, Luck, Tony wrote:
>> On Thu, Jul 03, 2025 at 09:45:15AM -0700, Reinette Chatre wrote:
>>> Hi Tony and Dave,
>>>
>>> On 6/26/25 9:49 AM, Tony Luck wrote:
>>>>  --- 14 ---
>>>> Add mon_evt::is_floating_point set by resctrl file system code to limit
>>>> which events architecture code can request be displayed in floating point.
>>>>
>>>> Simplified the fixed-point to floating point algorithm. Reinette is
>>>> correct that the additional "lshift" and "rshift" operations are not
>>>> required. All that is needed is to multiply the fixed point fractional
>>>> part by 10**decimal_places, add a rounding amount equivalent to a "1"
>>>> in the binary place after those supplied. Finally divide by 2**binary_places
>>>> (with a right shift).
>>>>
>>>> Explained in commit comment how I chose the number of decimal places to
>>>> use for each binary places value.
>>>>
>>>> N.B. Dave Martin expressed an opinion that the kernel should not do
>>>> this conversion. Instead it should enumerate the scaling factor for
>>>> each event where hardware reported a fixed point value. This patch
>>>> could be dropped and replaced with one to enumerate scaling factors
>>>> per event if others agree with Dave.
>>>
>>> Could resctrl accommodate both usages? For example, it does not
>>> look too invasive to add a second file <mon_evt::name>.raw for the
>>> mon_evt::is_floating_point events that can output something like Dave
>>> suggested in [1]:
>>>
>>> .raw file format could be:
>>> 	#format:<output that depends on format>
>>> 	#fixed-point:<value>/<scaling factor>
>>>
>>> Example output:
>>> 	fixed-point:0x60000/0x40000
>>
>> Dave: Is that what you want in the ".raw" file? An alternative would be
>> to put the format information for non-integer events into an
>> "info" file ("info/{RESOURCE_NAME}_MON/monfeatures.raw.formats"?)
>> and just put the raw value into the ".raw" file under mon_data.
> 
> Note that I thought it easier for users to keep the raw file to just
> showing a value, rather than including the formatting details in
> Reinette's proposal.

Could you please elaborate what makes this easier? It is not obvious to me
how it is easier for user to open, parse, and close two files rather than one.
(more below)
> 
> Patch to implement my alternative suggestion below. To the user things
> look like this:
> 
> $ cd /sys/fs/resctrl/mon_data/mon_PERF_PKG_01
> $ cat core_energy
> 0.02203
> $ cat core_energy.raw
> 5775
> $ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale
> core_energy 262144
> activity 262144
> $ bc -ql
> 5775 / 262144
> .02202987670898437500
> 
> If this seems useful I can write up a commit message and include
> as its own patch in v7. Suggestions for better names?
> 

I expect users to regularly interact with the monitoring files. For example,
"read the core_energy of group x every second". An API like above would require
a contract that the scale value will never change from resctrl mount to
resctrl unmount. I understand that this implementation supports exactly this by
allowing an architecture to only enable an event once, but do you think this is
something that will always be the case? If not then an interface like above will
require user space to open, parse, close two files instead of one on a frequent basis.
This is not ideal if user space wants to read monitoring data of multiple
groups frequently.

I would also like to keep extensibility in mind. We now know that
unsigned decimal and fixed-point binary needs to be supported. I think any
new interface used to communicate formatting information to user space should be done
in a way that can be extended for a new format. That is, for example, why
I used the actual term "fixed-point" in the example. Something like this avoids
needing assumptions that a raw value always implies fixed-point format.

Reinette
Re: [PATCH v6 00/30] x86,fs/resctrl telemetry monitoring
Posted by Luck, Tony 3 months ago
On Tue, Jul 08, 2025 at 01:49:26PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 7/8/25 12:08 PM, Luck, Tony wrote:
> > On Thu, Jul 03, 2025 at 10:22:06AM -0700, Luck, Tony wrote:
> >> On Thu, Jul 03, 2025 at 09:45:15AM -0700, Reinette Chatre wrote:
> >>> Hi Tony and Dave,
> >>>
> >>> On 6/26/25 9:49 AM, Tony Luck wrote:
> >>>>  --- 14 ---
> >>>> Add mon_evt::is_floating_point set by resctrl file system code to limit
> >>>> which events architecture code can request be displayed in floating point.
> >>>>
> >>>> Simplified the fixed-point to floating point algorithm. Reinette is
> >>>> correct that the additional "lshift" and "rshift" operations are not
> >>>> required. All that is needed is to multiply the fixed point fractional
> >>>> part by 10**decimal_places, add a rounding amount equivalent to a "1"
> >>>> in the binary place after those supplied. Finally divide by 2**binary_places
> >>>> (with a right shift).
> >>>>
> >>>> Explained in commit comment how I chose the number of decimal places to
> >>>> use for each binary places value.
> >>>>
> >>>> N.B. Dave Martin expressed an opinion that the kernel should not do
> >>>> this conversion. Instead it should enumerate the scaling factor for
> >>>> each event where hardware reported a fixed point value. This patch
> >>>> could be dropped and replaced with one to enumerate scaling factors
> >>>> per event if others agree with Dave.
> >>>
> >>> Could resctrl accommodate both usages? For example, it does not
> >>> look too invasive to add a second file <mon_evt::name>.raw for the
> >>> mon_evt::is_floating_point events that can output something like Dave
> >>> suggested in [1]:
> >>>
> >>> .raw file format could be:
> >>> 	#format:<output that depends on format>
> >>> 	#fixed-point:<value>/<scaling factor>
> >>>
> >>> Example output:
> >>> 	fixed-point:0x60000/0x40000
> >>
> >> Dave: Is that what you want in the ".raw" file? An alternative would be
> >> to put the format information for non-integer events into an
> >> "info" file ("info/{RESOURCE_NAME}_MON/monfeatures.raw.formats"?)
> >> and just put the raw value into the ".raw" file under mon_data.
> > 
> > Note that I thought it easier for users to keep the raw file to just
> > showing a value, rather than including the formatting details in
> > Reinette's proposal.
> 
> Could you please elaborate what makes this easier? It is not obvious to me
> how it is easier for user to open, parse, and close two files rather than one.
> (more below)

I had only considered the case where the format does not change while
the resctrl file system is mounted. So users would read the "info" file
to get the scaling factor once, and then read the event files with a
parser that only has to convert a numerical string.

> > Patch to implement my alternative suggestion below. To the user things
> > look like this:
> > 
> > $ cd /sys/fs/resctrl/mon_data/mon_PERF_PKG_01
> > $ cat core_energy
> > 0.02203
> > $ cat core_energy.raw
> > 5775
> > $ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale
> > core_energy 262144
> > activity 262144
> > $ bc -ql
> > 5775 / 262144
> > .02202987670898437500
> > 
> > If this seems useful I can write up a commit message and include
> > as its own patch in v7. Suggestions for better names?
> > 
> 
> I expect users to regularly interact with the monitoring files. For example,
> "read the core_energy of group x every second". An API like above would require
> a contract that the scale value will never change from resctrl mount to
> resctrl unmount. I understand that this implementation supports exactly this by
> allowing an architecture to only enable an event once, but do you think this is
> something that will always be the case? If not then an interface like above will
> require user space to open, parse, close two files instead of one on a frequent basis.
> This is not ideal if user space wants to read monitoring data of multiple
> groups frequently.

While hardware designers do some outlandish things. Changing the format
of an event counter on the fly seems beyond the range of possibility.
How would that even work? A driver would have to rerun enumeration of
the feature every time it read a counter. Or hardware would have to
supply some interrupt to tell s/w that the format changed.

I think it reasonable that resctrl be able to guarantee that the format
described in the info file is valid for the life of the mount.

> I would also like to keep extensibility in mind. We now know that
> unsigned decimal and fixed-point binary needs to be supported. I think any
> new interface used to communicate formatting information to user space should be done
> in a way that can be extended for a new format. That is, for example, why
> I used the actual term "fixed-point" in the example. Something like this avoids
> needing assumptions that a raw value always implies fixed-point format.

This is fair. But could be covered in the "info" file with some more
descriptive way to describe the format. Perhaps:

$ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale
core_energy fixed-point scale=262144
activity fixed-point scale=262144

To allow for other types in the future.

> 
> Reinette

-Tony
Re: [PATCH v6 00/30] x86,fs/resctrl telemetry monitoring
Posted by Reinette Chatre 3 months ago
Hi Tony,

On 7/8/25 3:43 PM, Luck, Tony wrote:
> On Tue, Jul 08, 2025 at 01:49:26PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 7/8/25 12:08 PM, Luck, Tony wrote:
>>> On Thu, Jul 03, 2025 at 10:22:06AM -0700, Luck, Tony wrote:
>>>> On Thu, Jul 03, 2025 at 09:45:15AM -0700, Reinette Chatre wrote:
>>>>> Hi Tony and Dave,
>>>>>
>>>>> On 6/26/25 9:49 AM, Tony Luck wrote:
>>>>>>  --- 14 ---
>>>>>> Add mon_evt::is_floating_point set by resctrl file system code to limit
>>>>>> which events architecture code can request be displayed in floating point.
>>>>>>
>>>>>> Simplified the fixed-point to floating point algorithm. Reinette is
>>>>>> correct that the additional "lshift" and "rshift" operations are not
>>>>>> required. All that is needed is to multiply the fixed point fractional
>>>>>> part by 10**decimal_places, add a rounding amount equivalent to a "1"
>>>>>> in the binary place after those supplied. Finally divide by 2**binary_places
>>>>>> (with a right shift).
>>>>>>
>>>>>> Explained in commit comment how I chose the number of decimal places to
>>>>>> use for each binary places value.
>>>>>>
>>>>>> N.B. Dave Martin expressed an opinion that the kernel should not do
>>>>>> this conversion. Instead it should enumerate the scaling factor for
>>>>>> each event where hardware reported a fixed point value. This patch
>>>>>> could be dropped and replaced with one to enumerate scaling factors
>>>>>> per event if others agree with Dave.
>>>>>
>>>>> Could resctrl accommodate both usages? For example, it does not
>>>>> look too invasive to add a second file <mon_evt::name>.raw for the
>>>>> mon_evt::is_floating_point events that can output something like Dave
>>>>> suggested in [1]:
>>>>>
>>>>> .raw file format could be:
>>>>> 	#format:<output that depends on format>
>>>>> 	#fixed-point:<value>/<scaling factor>
>>>>>
>>>>> Example output:
>>>>> 	fixed-point:0x60000/0x40000
>>>>
>>>> Dave: Is that what you want in the ".raw" file? An alternative would be
>>>> to put the format information for non-integer events into an
>>>> "info" file ("info/{RESOURCE_NAME}_MON/monfeatures.raw.formats"?)
>>>> and just put the raw value into the ".raw" file under mon_data.
>>>
>>> Note that I thought it easier for users to keep the raw file to just
>>> showing a value, rather than including the formatting details in
>>> Reinette's proposal.
>>
>> Could you please elaborate what makes this easier? It is not obvious to me
>> how it is easier for user to open, parse, and close two files rather than one.
>> (more below)
> 
> I had only considered the case where the format does not change while
> the resctrl file system is mounted. So users would read the "info" file
> to get the scaling factor once, and then read the event files with a
> parser that only has to convert a numerical string.
> 
>>> Patch to implement my alternative suggestion below. To the user things
>>> look like this:
>>>
>>> $ cd /sys/fs/resctrl/mon_data/mon_PERF_PKG_01
>>> $ cat core_energy
>>> 0.02203
>>> $ cat core_energy.raw
>>> 5775
>>> $ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale
>>> core_energy 262144
>>> activity 262144
>>> $ bc -ql
>>> 5775 / 262144
>>> .02202987670898437500
>>>
>>> If this seems useful I can write up a commit message and include
>>> as its own patch in v7. Suggestions for better names?
>>>
>>
>> I expect users to regularly interact with the monitoring files. For example,
>> "read the core_energy of group x every second". An API like above would require
>> a contract that the scale value will never change from resctrl mount to
>> resctrl unmount. I understand that this implementation supports exactly this by
>> allowing an architecture to only enable an event once, but do you think this is
>> something that will always be the case? If not then an interface like above will
>> require user space to open, parse, close two files instead of one on a frequent basis.
>> This is not ideal if user space wants to read monitoring data of multiple
>> groups frequently.
> 
> While hardware designers do some outlandish things. Changing the format
> of an event counter on the fly seems beyond the range of possibility.
> How would that even work? A driver would have to rerun enumeration of
> the feature every time it read a counter. Or hardware would have to
> supply some interrupt to tell s/w that the format changed.

There is also the new direction of resctrl dynamically enabling/disabling
hardware capabilities to consider. Here it could be reasonable, since this
would be triggered by user space, that a note of "doing this may change the
format" would be sufficient.

Something else to consider is the possibility of hardware using different scales
in different domains if the packages are not "uniform". 

> I think it reasonable that resctrl be able to guarantee that the format
> described in the info file is valid for the life of the mount.

I'd really like to think that it is reasonable also.

> 
>> I would also like to keep extensibility in mind. We now know that
>> unsigned decimal and fixed-point binary needs to be supported. I think any
>> new interface used to communicate formatting information to user space should be done
>> in a way that can be extended for a new format. That is, for example, why
>> I used the actual term "fixed-point" in the example. Something like this avoids
>> needing assumptions that a raw value always implies fixed-point format.
> 
> This is fair. But could be covered in the "info" file with some more
> descriptive way to describe the format. Perhaps:
> 
> $ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale
> core_energy fixed-point scale=262144
> activity fixed-point scale=262144
> 
> To allow for other types in the future.

Note that the filename still has "scale" in its name making it specific to
fixed-point. 

It may be expected that every entry in mon_features has an entry in
mon_features_raw_scale (name TBD). This means the existing possible "mon_features"
need to be accommodated (except the _config ones). This may also be an
opportunity to introduce the unit of measurement. For example,

 $ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale
 core_energy fixed-point scale=262144 unit=joules
 activity fixed-point scale=262144 unit=farads
 ...

Reinette
Re: [PATCH v6 00/30] x86,fs/resctrl telemetry monitoring
Posted by Reinette Chatre 3 months, 1 week ago
Tony,

On 6/26/25 9:49 AM, Tony Luck wrote:
> Background
> ----------
> 
> Telemetry features are being implemented in conjunction with the
> IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
> counts for various events to a collector in a nearby OOBMSM device to be
> accumulated with counts for each <RMID, event> pair received from other
> CPUs. Cores send event counts when the RMID value changes, or after each
> 2ms elapsed time.

To start a review of this jumbo series and find that the *first* [1]
(straight forward) request from previous review has not been addressed is
demoralizing. I was hoping that the previous version's discussions would result
in review feedback either addressed or discussed (never ignored). I
cannot imagine how requesting OOBMSM to be expanded can be invalid though.

Reinette

[1] https://lore.kernel.org/lkml/b8ddce03-65c0-4420-b30d-e43c54943667@intel.com/
Re: [PATCH v6 00/30] x86,fs/resctrl telemetry monitoring
Posted by Luck, Tony 3 months, 1 week ago
On Mon, Jun 30, 2025 at 10:51:50AM -0700, Reinette Chatre wrote:
> 
> Tony,
> 
> On 6/26/25 9:49 AM, Tony Luck wrote:
> > Background
> > ----------
> > 
> > Telemetry features are being implemented in conjunction with the
> > IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
> > counts for various events to a collector in a nearby OOBMSM device to be
> > accumulated with counts for each <RMID, event> pair received from other
> > CPUs. Cores send event counts when the RMID value changes, or after each
> > 2ms elapsed time.
> 
> To start a review of this jumbo series and find that the *first* [1]
> (straight forward) request from previous review has not been addressed is
> demoralizing. I was hoping that the previous version's discussions would result
> in review feedback either addressed or discussed (never ignored). I
> cannot imagine how requesting OOBMSM to be expanded can be invalid though.
> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/b8ddce03-65c0-4420-b30d-e43c54943667@intel.com/

My profound apologies for blowing it (again). I went through the comments
to patches multiple times to try and catch all your comments. But somehow
skipped the cover letter :-( .

Here's a re-write to address comments, but also to try to provide
a better story line starting with how the logical processors capture
the event data, following on with aggregator processing, etc.

-Tony

---

On Intel systems that support per-RMID telemetry monitoring each logical
processor keeps a local count for various events. When the IA32_PQR_ASSOC.RMID
value for the logical processor changes (or when a two millisecond counter
expires) these event counts are transmitted to an event aggregator on
the same package as the processor together with the current RMID value. The
event counters are reset to zero to begin counting again.

Each aggregator takes the incoming event counts and adds them to
cumulative counts for each event for each RMID. Note that there can be
multiple aggregators on each package with no architectural association
between logical processors and an aggregator.

All of these aggregated counters can be read by an operating system from
the MMIO space of the Out Of Band Management Service Module (OOBMSM)
device(s) on a system. Any counter can be read from any logical processor.

Intel publishes details for each processor generation showing which
events are counted by each logical processor and the offsets for each
accumulated counter value within the MMIO space in XML files here:
https://github.com/intel/Intel-PMT.

For example there are two energy related telemetry events for the Clearwater
Forest family of processors and the MMIO space looks like this:

Offset	RMID	Event
------	----	-----
0x0000	0	core_energy
0x0008	0	activity
0x0010	1	core_energy
0x0018	1	activity
...
0x23F0	575	core_energy
0x23F8	575	activity

In addition the XML file provides the units (Joules for core_energy,
Farads for activity) and the type of data (fixed-point binary with
bit 63 used as to indicate the data is valid, and the low 18 bits as a
binary fraction).

Finally, each XML file provides a 32-bit unique id (or guid) that is
used as an index to find the correct XML description file for each
telemetry implementation.

The INTEL_PMT_DISCOVERY driver provides intel_pmt_get_regions_by_feature()
to enumerate the aggregator instances on a platform. It provides:
1) guid  - so resctrl can determine which events are supported
2) mmio base address of counters
3) package id

Resctrl accumulates counts from all aggregators on a package in order
to provide a consistent user interface across processor generations.

Directory structure for the telemetry events looks like this:

$ tree /sys/fs/resctrl/mon_data/
/sys/fs/resctrl/mon_data/
mon_data
├── mon_PERF_PKG_00
│   ├── activity
│   └── core_energy
└── mon_PERF_PKG_01
    ├── activity
    └── core_energy

Reading the "core_energy" file from some resctrl mon_data directory shows
the cumulative energy (in Joules) used by all tasks that ran with the RMID
associated with that directory on a given package. Note that "core_energy"
reports only energy consumed by CPU cores (data processing units,
L1/L2 caches, etc.). It does not include energy used in the "uncore"
(L3 cache, on package devices, etc.), or used by memory or I/O devices.
Re: [PATCH v6 00/30] x86,fs/resctrl telemetry monitoring
Posted by Reinette Chatre 3 months ago
Hi Tony,

On 6/30/25 3:46 PM, Luck, Tony wrote:
> On Mon, Jun 30, 2025 at 10:51:50AM -0700, Reinette Chatre wrote:
>>
>> Tony,
>>
>> On 6/26/25 9:49 AM, Tony Luck wrote:
>>> Background
>>> ----------
>>>
>>> Telemetry features are being implemented in conjunction with the
>>> IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
>>> counts for various events to a collector in a nearby OOBMSM device to be
>>> accumulated with counts for each <RMID, event> pair received from other
>>> CPUs. Cores send event counts when the RMID value changes, or after each
>>> 2ms elapsed time.
>>
>> To start a review of this jumbo series and find that the *first* [1]
>> (straight forward) request from previous review has not been addressed is
>> demoralizing. I was hoping that the previous version's discussions would result
>> in review feedback either addressed or discussed (never ignored). I
>> cannot imagine how requesting OOBMSM to be expanded can be invalid though.
>>
>> Reinette
>>
>> [1] https://lore.kernel.org/lkml/b8ddce03-65c0-4420-b30d-e43c54943667@intel.com/
> 
> My profound apologies for blowing it (again). I went through the comments
> to patches multiple times to try and catch all your comments. But somehow
> skipped the cover letter :-( .
> 
> Here's a re-write to address comments, but also to try to provide
> a better story line starting with how the logical processors capture
> the event data, following on with aggregator processing, etc.
> 
> -Tony
> 
> ---
> 
> On Intel systems that support per-RMID telemetry monitoring each logical
> processor keeps a local count for various events. When the IA32_PQR_ASSOC.RMID
> value for the logical processor changes (or when a two millisecond counter
> expires) these event counts are transmitted to an event aggregator on
> the same package as the processor together with the current RMID value. The
> event counters are reset to zero to begin counting again.
> 
> Each aggregator takes the incoming event counts and adds them to
> cumulative counts for each event for each RMID. Note that there can be
> multiple aggregators on each package with no architectural association
> between logical processors and an aggregator.
> 
> All of these aggregated counters can be read by an operating system from
> the MMIO space of the Out Of Band Management Service Module (OOBMSM)
> device(s) on a system. Any counter can be read from any logical processor.
> 
> Intel publishes details for each processor generation showing which
> events are counted by each logical processor and the offsets for each
> accumulated counter value within the MMIO space in XML files here:
> https://github.com/intel/Intel-PMT.
> 
> For example there are two energy related telemetry events for the Clearwater
> Forest family of processors and the MMIO space looks like this:
> 
> Offset	RMID	Event
> ------	----	-----
> 0x0000	0	core_energy
> 0x0008	0	activity
> 0x0010	1	core_energy
> 0x0018	1	activity
> ...
> 0x23F0	575	core_energy
> 0x23F8	575	activity
> 
> In addition the XML file provides the units (Joules for core_energy,
> Farads for activity) and the type of data (fixed-point binary with
> bit 63 used as to indicate the data is valid, and the low 18 bits as a

"bit 63 used as to indicate" -> "bit 63 used to indicate"?

> binary fraction).
> 
> Finally, each XML file provides a 32-bit unique id (or guid) that is
> used as an index to find the correct XML description file for each
> telemetry implementation.
> 
> The INTEL_PMT_DISCOVERY driver provides intel_pmt_get_regions_by_feature()
> to enumerate the aggregator instances on a platform. It provides:

I think it will be helpful to prime the connection between "aggregator"
and "telemetery region" here. For example,

"to enumerate the aggregator instances on a platform" -> "to enumerate
the aggregator instances (also referred to as "telemetry regions" in this series)
on a platform"

> 1) guid  - so resctrl can determine which events are supported
> 2) mmio base address of counters

mmio -> MMIO

> 3) package id
> 
> Resctrl accumulates counts from all aggregators on a package in order
> to provide a consistent user interface across processor generations.
> 
> Directory structure for the telemetry events looks like this:
> 
> $ tree /sys/fs/resctrl/mon_data/
> /sys/fs/resctrl/mon_data/
> mon_data
> ├── mon_PERF_PKG_00
> │   ├── activity
> │   └── core_energy
> └── mon_PERF_PKG_01
>     ├── activity
>     └── core_energy
> 
> Reading the "core_energy" file from some resctrl mon_data directory shows
> the cumulative energy (in Joules) used by all tasks that ran with the RMID
> associated with that directory on a given package. Note that "core_energy"
> reports only energy consumed by CPU cores (data processing units,
> L1/L2 caches, etc.). It does not include energy used in the "uncore"
> (L3 cache, on package devices, etc.), or used by memory or I/O devices.

Thank you very much for this rework. I found this much easier to follow.

Reinette