[v5] x86/resctrl telemetry monitoring

[PATCH v5 00/29] x86/resctrl telemetry monitoring

Posted by Tony Luck 8 months, 3 weeks ago

These patches are based on tip x86/cache branch. HEAD at time of
snapshot is:

54d14f25664b ("MAINTAINERS: Add reviewers for fs/resctrl")

These patches are also available in the rdt-aet-v5 branch at:

Link: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git

Changes (based on feedback from Reinette and a bug report from Chen Yu)
since v4 was posted here:

Link: https://lore.kernel.org/all/20250429003359.375508-1-tony.luck@intel.com/

Change map indexed by patch numbers in v4. Some patches have been merged,
split, dropped, or re-ordered. The v5 patch numbers are referred to
by their 4-digit git format-patch numbers in an attempt to avoid
confusion.

=== 1 ===

v4 patch was focussed on removing rdt_mon_features bitmap
which included moving all the mon_evt structure definitions
into an array. Reinette noted that this array would mean the
rdt_resource::evt_list is no longer needed.

v5 splits up the changes into three parts:
0001:	Moves mon_evt structures into an array (now named
	mon_event_all[]) and replaces use of rdt_resource::evt_list
	with iteration of enabled events in the array.
0002:	Replace resctrl_arch_is_llc_occupancy_enabled() with
	resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)
	(ditto for the mbm*enabled() inline functions)
0003:	Remove remaining use of rdt_mon_features.

=== 2 ===

0004:	Fix typos. Change parameter of resctrl_is_mbm_event() to enum.
	s/QOS_NUM_MBM_EVENTS/QOS_NUM_L3_MBM_EVENTS/.
	s/MBM_EVENT_IDX/MBM_STATE_IDX/. Rewrite get_arch_mbm_state()
	in same simple style as get_mbm_state()

=== 3 ===

Dropped this patch. No immediate plans for other mbm monitor events
that could be used as input to the "mba_MBps" feedback control.

=== 4 ===

Also dropped. The rdt_resource::evt_list no longer exists, so no need
to rearrange code to build it at mount time.

=== 5 ===

Dropped the Kconfig changes for now. This means that intel_aet.c is
always built with CONFIG_X86_CPU_RESCTRL=y. Will need to revisit when
the CONFIG_INTEL_PMT_DISCOVERY driver is upstream.

=== 6 ===

0005:	Added comments that fake interface is deliberately crafted
	with parameters to exercise multiple aggregators per package
	and insufficient RMIDs supported.

=== 7 ===

0006:	Rename check_domain_header() to domain_header_is_valid()

=== 8 ===

Split into two parts:
0007:	Better names for functions. Use "()" consistently in
	commit message when naming functions.
0008:	Better description that this change is just for domain add.

=== 9 ===

0009:	New commit message with background and rationale for change.

=== 10 ===

0010:	More context in commit message. Dropped an unnecessary container_of()
	Made domain_add_cpu_ctrl() match domain_add_cpu_mon() with simple
	path when adding a CPU to existing domain.

=== 11 ===

0011:	Added rational for rename of rdt_mon_domain and rdt_hw_mon_domain
	structures. Fixed alignment in structure definitions. Fixed broken
	fir tree ordering.

=== 12 ===

Split into two parts:
0012:	Make mon_data and rmid_read structures point to mon_evt instead of
	just holding the event enum.
0013:	The "read from any cpu" part.
	Fixed bug reported by Chen Yu for use of smp_processor_id()
	New shortlog description. Fixed "cpumast" typo. Separated
	problem description from solution.
	Fixed reverse fir tree.
	Avoid "usually" comment (new comments in the helper function
	that moved out of __mon_event_count().

=== 13 ===

0014:	New direction. Don't bind specific value display formats
	to specific events, limiting other architectures to follow in
	the footsteps for the first to implement an event. Instead
	allow architecture to specify how many binary fixed-point bits
	are used for each event.

=== 14 ===

0015:	Add period to end of sentence in comment for resctrl_arch_pre_mount().
	Use atomic_try_cmpxchg() instead of atomic_cmpxchg().

=== 15 ===

0016:	Updated commit comment to avoid "with code".
	Dropped initialization of rdt_hw_resource::rid.


=== 16 ===
=== 17 ===
=== 18 ===
=== 20 ===
=== 21 ===
	These were "first part", "second part" ... of enumeration and
	the sanity check for adequate MMIO space to match expectations
	from the XML file layout description.

0017:
0018:
0019:
	Now describe what actions each part is doing. Building the
	struct event_group fields as needed for each patch.
	Split the fields into sets that are initialized from XML
	files, and fields used by resctrl code to manage groups.
	Fixed Link: lines with real URL to the Intel-PMT git repo.
	Changed type of guid from int to u32.
	Changed configure_events() return value from bool to standard
	integer error code (and use -ENOMEM, -EINVAL where appropriate).
	Document mmio_info structure and add an ascii art picture to the
	commit comment showing how it is used.
	Use kzalloc() instead of kmalloc()
	Add a helper function skip_this_region() so that counting
	regions and allocation for regions will do the same thing.


=== 19 ===

0020:	Add description of layout of MMIO counters to commit comment.

=== 22 ===

0021:	Fixed mmio address range check in intel_aet_read_event()
	Changed return code from -EINVAL to -EIO to meet expectations
	of rdtgroup_mondata_show().
	Changed name of VALID_BIT define to DATA_VALID to indicate that
	it shows that the value in a counter is valid (as opposed to the
	counter itself).
	Added check in resctrl_arch_rmid_read() that remainder of the
	function after the check for RDT_RESOURCE_PERF_PKG has been
	passed a RDT_RESOURCE_L3 resource.

=== 23 ===

0022:	Fix typo s/domsins/domains/
	Kept definition of struct rdt_perf_pkg_mon_domain in architecture
	code. Reinette comment "This may thus be ok like this for now".
	Since this only contains the rdt_domain_hdr, there isn't anything
	extra that file system code could look at even if it somehow
	wanted to.
	Things may be different for a more complex resource that has
	to maintain additional per-domain state that file system code
	may need to be aware of.

=== 24 ===

	I merged old patch 24 into new 0016

=== 25 ===

0023:	Unchanged

=== 26 ===

0024:	Split the hard-to-read rdt_check_option() function into two that
	have names that convey what they do: rdt_is_option_force_enabled()
	and rdt_is_option_force_disabled().

=== 27 ===

0025:	Updated commit comment and kerneldoc comment to note that
	pmt_event:num_rmids field is initialized from data in the
	XML file, but may be overwritten.
	Added min() operation to make sure num_rmids cannot be increased
	when processing additional event groups.

=== 28 ===

0026:	In V4 this added to the mount time initialization of the per-resource
	event lists. But that code has been dropped from this series. Added
	new one-time call in rdt_get_tree() (inside code where resctrl_mutex
	is held). Same basic function to compute the number of RMIDs as the
	minimum across all enabled monitor resources.

=== 29 ===
=== 30 ===

0027:
0028:	V4 presented these as RFC to add a debug info file for a resource. But
	after some thought I changed strategy so that the per-resource function
	can choose the name. Also avoids it showing up as an empty file in
	the info directory for other resources.

=== 31 ===

0029:	No changes.



Background
----------

Telemetry features are being implemented in conjunction with the
IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
counts for various events to a collector in a nearby OOBMSM device to be
accumulated with counts for each <RMID, event> pair received from other
CPUs. Cores send event counts when the RMID value changes, or after each
2ms elapsed time.

Each OOBMSM device may implement multiple event collectors with each
servicing a subset of the logical CPUs on a package.  In the initial
hardware implementation, there are two categories of events: energy
and perf.

1) Energy - Two counters
core_energy: This is an estimate of Joules consumed by each core. It is
calculated based on the types of instructions executed, not from a power
meter. This counter is useful to understand how much energy a workload
is consuming.

activity: This measures "accumulated dynamic capacitance". Users who
want to optimize energy consumption for a workload may use this rather
than core_energy because it provides consistent results independent of
any frequency or voltage changes that may occur during the runtime of
the application (e.g. entry/exit from turbo mode).

2) Performance - Seven counters
These are similar events to those available via the Linux "perf" tool,
but collected in a way with much lower overhead (no need to collect data
on every context switch).

stalls_llc_hit - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which hit in the LLC

c1_res - Counts the total C1 residency across all cores. The underlying
counter increments on 100MHz clock ticks

unhalted_core_cycles - Counts the total number of unhalted core clock
cycles

stalls_llc_miss - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which missed all the
local caches

c6_res - Counts the total C6 residency. The underlying counter increments
on crystal clock (25MHz) ticks

unhalted_ref_cycles - Counts the total number of unhalted reference clock
(TSC) cycles

uops_retired - Counts the total number of uops retired

The counters are arranged in groups in MMIO space of the OOBMSM device.
E.g. for the energy counters the layout is:

Offset: Counter
0x00	core energy for RMID 0
0x08	core activity for RMID 0
0x10	core energy for RMID 1
0x18	core activity for RMID 1
...

Enumeration
-----------

The only CPUID based enumeration for this feature is the legacy
CPUID(eax=7,ecx=0).ebx{12} that indicates the presence of the
IA32_PQR_ASSOC MSR and the RMID field within it.

The OOBMSM driver discovers which features are present via
PCIe VSEC capabilities. Each feature is tagged with a unique
identifier. These identifiers indicate which XML description file from
https://github.com/intel/Intel-PMT describes which event counters are
available and their layout within the MMIO BAR space of the OOBMSM device.

Resctrl User Interface
----------------------

Because there may be multiple OOBMSM collection agents per processor
package, resctrl accumulates event counts from all agents on a package
and presents a single value to users. This will provide a consistent
user interface on future platforms that vary the number of collectors,
or the mappings from logical CPUs to collectors.

Users will continue to see the legacy monitoring files in the "L3"
directories and the telemetry files in the new "PERF_PKG" directories
(with each file providing the aggregated value from all OOBMSM collectors
on that package).

$ tree /sys/fs/resctrl/mon_data/
/sys/fs/resctrl/mon_data/
├── mon_L3_00
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_L3_01
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_PERF_PKG_00
│   ├── activity
│   ├── c1_res
│   ├── c6_res
│   ├── core_energy
│   ├── stalls_llc_hit
│   ├── stalls_llc_miss
│   ├── unhalted_core_cycles
│   ├── unhalted_ref_cycles
│   └── uops_retired
└── mon_PERF_PKG_01
    ├── activity
    ├── c1_res
    ├── c6_res
    ├── core_energy
    ├── stalls_llc_hit
    ├── stalls_llc_miss
    ├── unhalted_core_cycles
    ├── unhalted_ref_cycles
    └── uops_retired

Resctrl Implementation
----------------------

The OOBMSM driver exposes "intel_pmt_get_regions_by_feature()"
that returns an array of structures describing the per-RMID groups it
found from the VSEC enumeration. Linux looks at the unique identifiers
for each group and enables resctrl for all groups with known unique
identifiers.

The memory map for the counters for each <RMID, event> pair is described
by the XML file. This is too unwieldy to use in the Linux kernel, so a
simplified representation is built into the resctrl code. Note that the
counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
and IA32_QM_CTR MSRs. This means there is no need for cross-processor
calls to read counters from a CPU in a specific domain. The counters
can be read from any CPU.

High level description of code changes:

1) New scope RESCTRL_PACKAGE
2) New struct rdt_resource RDT_RESOURCE_PERF_PKG
3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
        switch (r->rid) {
        case RDT_RESOURCE_L3:
                helper for L3
                break;
        case RDT_RESOURCE_PERF_PKG:
                helper for PKG
                break;
        }
4) New source code file "intel_aet.c" for the code to enumerate, configure, and report event counts.

With only one platform providing this feature, it's tricky to tell
exactly where it is going to go. I've made the event definitions
platform specific (based on the unique ID from the VSEC enumeration). It
seems possible/likely that the list of events may change from generation
to generation.

I've picked names for events based on the descriptions in the XML file.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Tony Luck (29):
  x86,fs/resctrl: Consolidate monitor event descriptions
  x86,fs/resctrl: Replace architecture event enabled checks
  x86/resctrl: Remove 'rdt_mon_features' global variable
  x86,fs/resctrl: Prepare for more monitor events
  x86/rectrl: Fake OOBMSM interface
  x86,fs/resctrl: Improve domain type checking
  x86,fs/resctrl: Rename some L3 specific functions
  x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
  x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain
    types
  x86/resctrl: Change generic domain functions to use struct
    rdt_domain_hdr
  x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
  fs/resctrl: Make event details accessible to functions when reading
    events
  x86,fs/resctrl: Handle events that can be read from any CPU
  x86,fs/resctrl: Support binary fixed point event counters
  fs/resctrl: Add an architectural hook called for each mount
  x86/resctrl: Add and initialize rdt_resource for package scope core
    monitor
  x86/resctrl: Discover hardware telemetry events
  x86/resctrl: Count valid telemetry aggregators per package
  x86/resctrl: Complete telemetry event enumeration
  x86,fs/resctrl: Fill in details of Clearwater Forest events
  x86/resctrl: x86/resctrl: Read core telemetry events
  x86,fs/resctrl: Handle domain creation/deletion for
    RDT_RESOURCE_PERF_PKG
  x86/resctrl: Enable RDT_RESOURCE_PERF_PKG
  x86/resctrl: Add energy/perf choices to rdt boot option
  x86/resctrl: Handle number of RMIDs supported by telemetry resources
  x86,fs/resctrl: Move RMID initialization to first mount
  fs/resctrl: Add file system mechanism for architecture info file
  x86/resctrl: Add info/PERF_PKG_MON/status file
  x86/resctrl: Update Documentation for package events

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/filesystems/resctrl.rst         |  53 ++-
 include/linux/resctrl.h                       |  97 ++++-
 include/linux/resctrl_types.h                 |  14 +
 arch/x86/include/asm/resctrl.h                |  16 -
 .../cpu/resctrl/fake_intel_aet_features.h     |  73 ++++
 arch/x86/kernel/cpu/resctrl/internal.h        |  30 +-
 fs/resctrl/internal.h                         |  87 ++--
 arch/x86/kernel/cpu/resctrl/core.c            | 314 ++++++++++----
 .../cpu/resctrl/fake_intel_aet_features.c     |  97 +++++
 arch/x86/kernel/cpu/resctrl/intel_aet.c       | 388 ++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c         |  67 +--
 fs/resctrl/ctrlmondata.c                      | 108 ++++-
 fs/resctrl/monitor.c                          | 262 +++++++-----
 fs/resctrl/rdtgroup.c                         | 259 ++++++++----
 arch/x86/Kconfig                              |   2 +-
 arch/x86/kernel/cpu/resctrl/Makefile          |   2 +
 17 files changed, 1439 insertions(+), 432 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
 create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
 create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c


base-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
base-branch: x86/cache
base-commit: 54d14f25664bbb75c2928dd0d64a095c0f488176
-- 
2.49.0

Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring

Posted by James Morse 7 months, 4 weeks ago

Hi Tony,

I'm still going through this, but here is my attempt to describe what equivalents arm has
in this area.

On 21/05/2025 23:50, Tony Luck wrote:
> Background
> ----------
> 
> Telemetry features are being implemented in conjunction with the
> IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
> counts for various events to a collector in a nearby OOBMSM device to be
> accumulated with counts for each <RMID, event> pair received from other
> CPUs. Cores send event counts when the RMID value changes, or after each
> 2ms elapsed time.

This is a shared memory area where an external agent (the OOBMSM) has logged measurement data?

Arm's equivalent to this is two things.
For things close to the CPU (e.g. stalls_llc_miss) these would be an PMU (possibly uncore
PMU) which follow a convention on their register layout meaning the general purpose pmu
driver should be able to drive them. The meaning of the events is described to user-space
via the perf json file. The kernel knows how to read event:6, but not what event:6 means.
The spec for this mentions MPAM, values can be monitored by ~RMID, but none of this is
managed by the MPAM driver.

The other thing arm has that is a bit like this is SCMI, which is a packet format for
talking to an on-die microcontroller to get platform specific temperature, voltage and
clock values. Again, this is another bit of kernel infrastructure that has its own way of
doing things. I don't see this filtering things by ~RMID ... but I guess its possible.
That can have shared memory areas (termed 'fast channels'). I think they are an array of
counter values, and something in the packet stream tells you which one is which.

Neither of these need picking up by the MPAM driver to expose via resctrl. But I'd like to
get that information across where possible so that user-space can be portable.

> Each OOBMSM device may implement multiple event collectors with each
> servicing a subset of the logical CPUs on a package.  In the initial
> hardware implementation, there are two categories of events: energy
> and perf.
> 
> 1) Energy - Two counters
> core_energy: This is an estimate of Joules consumed by each core. It is
> calculated based on the types of instructions executed, not from a power
> meter. This counter is useful to understand how much energy a workload
> is consuming.
> 
> activity: This measures "accumulated dynamic capacitance". Users who
> want to optimize energy consumption for a workload may use this rather
> than core_energy because it provides consistent results independent of
> any frequency or voltage changes that may occur during the runtime of
> the application (e.g. entry/exit from turbo mode).

> 2) Performance - Seven counters
> These are similar events to those available via the Linux "perf" tool,
> but collected in a way with much lower overhead (no need to collect data
> on every context switch).
> 
> stalls_llc_hit - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which hit in the LLC
> 
> c1_res - Counts the total C1 residency across all cores. The underlying
> counter increments on 100MHz clock ticks
> 
> unhalted_core_cycles - Counts the total number of unhalted core clock
> cycles
> 
> stalls_llc_miss - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which missed all the
> local caches
> 
> c6_res - Counts the total C6 residency. The underlying counter increments
> on crystal clock (25MHz) ticks
> 
> unhalted_ref_cycles - Counts the total number of unhalted reference clock
> (TSC) cycles
> 
> uops_retired - Counts the total number of uops retired
> 
> The counters are arranged in groups in MMIO space of the OOBMSM device.
> E.g. for the energy counters the layout is:
> 
> Offset: Counter
> 0x00	core energy for RMID 0
> 0x08	core activity for RMID 0
> 0x10	core energy for RMID 1
> 0x18	core activity for RMID 1
> ...

For the performance counters especially, on arm I'd be trying to get these values by
teaching perf about the CLOSID/RMID values, so that perf events are only incremented for
tasks in a particular control/monitor group.
(why that might be relevant is below)

> Resctrl User Interface
> ----------------------
> 
> Because there may be multiple OOBMSM collection agents per processor
> package, resctrl accumulates event counts from all agents on a package
> and presents a single value to users. This will provide a consistent
> user interface on future platforms that vary the number of collectors,
> or the mappings from logical CPUs to collectors.

Great!

> Users will continue to see the legacy monitoring files in the "L3"
> directories and the telemetry files in the new "PERF_PKG" directories
> (with each file providing the aggregated value from all OOBMSM collectors
> on that package).
> 
> $ tree /sys/fs/resctrl/mon_data/
> /sys/fs/resctrl/mon_data/
> ├── mon_L3_00
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_L3_01
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes

> ├── mon_PERF_PKG_00

Where do the package ids come from? How can user-space find out which CPUs are in package-0?

I don't see a package_id in either /sys/devices/system/cpu/cpu0/topology or
Documentation/ABI/stable/sysfs-devices-system-cpu.

> │   ├── activity
> │   ├── c1_res
> │   ├── c6_res
> │   ├── core_energy
> │   ├── stalls_llc_hit
> │   ├── stalls_llc_miss
> │   ├── unhalted_core_cycles
> │   ├── unhalted_ref_cycles
> │   └── uops_retired
> └── mon_PERF_PKG_01
>     ├── activity
>     ├── c1_res
>     ├── c6_res
>     ├── core_energy
>     ├── stalls_llc_hit
>     ├── stalls_llc_miss
>     ├── unhalted_core_cycles
>     ├── unhalted_ref_cycles
>     └── uops_retired

Looks good to me.

The difficulty MPAM platforms have had with mbm_total_bytes et al is the "starts counting
from the beginning of time" property. Having to enable mbm_total_bytes before it counts
would have allowed MPAM to report an error if it couldn't enable more than N counters at a
time. (ABMC suggests AMD platforms have a similar problem).

How do you feel about having to enable these before they start counting?

This would allow the MPAM driver to open the event via perf if it has a corresponding
feature/counter, then provide the value from perf via resctrl.

Another headache is how we describe the format of the contents of these files... a  made
up example: residency counts could be in absolute time, or percentages. I've been bitten
by the existing schemata strings being implicitly in a particular format, meaning
conversions have to happen. I'm not sure whether some architecture/platform would trip
over the same problem here.

> Resctrl Implementation
> ----------------------
> 
> The OOBMSM driver exposes "intel_pmt_get_regions_by_feature()"
> that returns an array of structures describing the per-RMID groups it
> found from the VSEC enumeration. Linux looks at the unique identifiers
> for each group and enables resctrl for all groups with known unique
> identifiers.
> 
> The memory map for the counters for each <RMID, event> pair is described
> by the XML file. This is too unwieldy to use in the Linux kernel, so a
> simplified representation is built into the resctrl code.

(I hope there are only a few combinations!)

> Note that the
> counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
> and IA32_QM_CTR MSRs. This means there is no need for cross-processor
> calls to read counters from a CPU in a specific domain.

Huzzah! RISC-V has this property, and many MPAM platforms do, (...but not all...)

> The counters can be read from any CPU.
> 
> High level description of code changes:
> 
> 1) New scope RESCTRL_PACKAGE
> 2) New struct rdt_resource RDT_RESOURCE_PERF_PKG
> 3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
>         switch (r->rid) {
>         case RDT_RESOURCE_L3:
>                 helper for L3
>                 break;
>         case RDT_RESOURCE_PERF_PKG:
>                 helper for PKG
>                 break;
>         }
> 4) New source code file "intel_aet.c" for the code to enumerate, configure, and report event counts.
> 
> With only one platform providing this feature, it's tricky to tell
> exactly where it is going to go. I've made the event definitions
> platform specific (based on the unique ID from the VSEC enumeration). It
> seems possible/likely that the list of events may change from generation
> to generation.

My thinking about this from a perf angle was to have named events for those things that
resctrl supports, but allow events to be specified by number, and funnel those through
resctrl_arch_rmid_read() so that the arch code can interpret them as a counter type-id or
an offset in some array. The idea was to allow platform specific counters to be read
without any kernel changes, reducing the pressure to add resctrl support for counters that
may only ever be present in a single platform.

With the XML data you have, would it be possible to add new 'events' to this interface via
sysfs/configfs? Or does too much depend on the data identified by that GUID...

Thanks,

James

Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring

Posted by Luck, Tony 7 months, 4 weeks ago

On Fri, Jun 13, 2025 at 05:57:26PM +0100, James Morse wrote:
> Hi Tony,
> 
> I'm still going through this, but here is my attempt to describe what equivalents arm has
> in this area.
> 
> 
> On 21/05/2025 23:50, Tony Luck wrote:
> > Background
> > ----------
> > 
> > Telemetry features are being implemented in conjunction with the
> > IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
> > counts for various events to a collector in a nearby OOBMSM device to be
> > accumulated with counts for each <RMID, event> pair received from other
> > CPUs. Cores send event counts when the RMID value changes, or after each
> > 2ms elapsed time.
> 
> This is a shared memory area where an external agent (the OOBMSM) has logged measurement data?

Yes. Effectively shared memory (but in another address space so need to
use readq(addr) rather than just *addr to read values to keep sparse
happy).

> Arm's equivalent to this is two things.
> For things close to the CPU (e.g. stalls_llc_miss) these would be an PMU (possibly uncore
> PMU) which follow a convention on their register layout meaning the general purpose pmu
> driver should be able to drive them. The meaning of the events is described to user-space
> via the perf json file. The kernel knows how to read event:6, but not what event:6 means.
> The spec for this mentions MPAM, values can be monitored by ~RMID, but none of this is
> managed by the MPAM driver.
> 
> The other thing arm has that is a bit like this is SCMI, which is a packet format for
> talking to an on-die microcontroller to get platform specific temperature, voltage and
> clock values. Again, this is another bit of kernel infrastructure that has its own way of
> doing things. I don't see this filtering things by ~RMID ... but I guess its possible.
> That can have shared memory areas (termed 'fast channels'). I think they are an array of
> counter values, and something in the packet stream tells you which one is which.
> 
> 
> Neither of these need picking up by the MPAM driver to expose via resctrl. But I'd like to
> get that information across where possible so that user-space can be portable.
> 
> 
> > Each OOBMSM device may implement multiple event collectors with each
> > servicing a subset of the logical CPUs on a package.  In the initial
> > hardware implementation, there are two categories of events: energy
> > and perf.
> > 
> > 1) Energy - Two counters
> > core_energy: This is an estimate of Joules consumed by each core. It is
> > calculated based on the types of instructions executed, not from a power
> > meter. This counter is useful to understand how much energy a workload
> > is consuming.
> > 
> > activity: This measures "accumulated dynamic capacitance". Users who
> > want to optimize energy consumption for a workload may use this rather
> > than core_energy because it provides consistent results independent of
> > any frequency or voltage changes that may occur during the runtime of
> > the application (e.g. entry/exit from turbo mode).
> 
> > 2) Performance - Seven counters
> > These are similar events to those available via the Linux "perf" tool,
> > but collected in a way with much lower overhead (no need to collect data
> > on every context switch).
> > 
> > stalls_llc_hit - Counts the total number of unhalted core clock cycles
> > when the core is stalled due to a demand load miss which hit in the LLC
> > 
> > c1_res - Counts the total C1 residency across all cores. The underlying
> > counter increments on 100MHz clock ticks
> > 
> > unhalted_core_cycles - Counts the total number of unhalted core clock
> > cycles
> > 
> > stalls_llc_miss - Counts the total number of unhalted core clock cycles
> > when the core is stalled due to a demand load miss which missed all the
> > local caches
> > 
> > c6_res - Counts the total C6 residency. The underlying counter increments
> > on crystal clock (25MHz) ticks
> > 
> > unhalted_ref_cycles - Counts the total number of unhalted reference clock
> > (TSC) cycles
> > 
> > uops_retired - Counts the total number of uops retired
> > 
> > The counters are arranged in groups in MMIO space of the OOBMSM device.
> > E.g. for the energy counters the layout is:
> > 
> > Offset: Counter
> > 0x00	core energy for RMID 0
> > 0x08	core activity for RMID 0
> > 0x10	core energy for RMID 1
> > 0x18	core activity for RMID 1
> > ...
> 
> For the performance counters especially, on arm I'd be trying to get these values by
> teaching perf about the CLOSID/RMID values, so that perf events are only incremented for
> tasks in a particular control/monitor group.
> (why that might be relevant is below)

Yes. If perf is enhanced to take CLOSID/RMID into account when
accumulating event counts it can provide the same functionality.

Higher overhead since perf needs to sample event counters of
interest on every context switch instead of data collection
being handled by hardware.

On the other hand the perf approach is more flexible as you can
pick any event to sample per-RMID instead of the fixed set that
the h/w designer chose.

> 
> > Resctrl User Interface
> > ----------------------
> > 
> > Because there may be multiple OOBMSM collection agents per processor
> > package, resctrl accumulates event counts from all agents on a package
> > and presents a single value to users. This will provide a consistent
> > user interface on future platforms that vary the number of collectors,
> > or the mappings from logical CPUs to collectors.
> 
> Great!
> 
> 
> > Users will continue to see the legacy monitoring files in the "L3"
> > directories and the telemetry files in the new "PERF_PKG" directories
> > (with each file providing the aggregated value from all OOBMSM collectors
> > on that package).
> > 
> > $ tree /sys/fs/resctrl/mon_data/
> > /sys/fs/resctrl/mon_data/
> > ├── mon_L3_00
> > │   ├── llc_occupancy
> > │   ├── mbm_local_bytes
> > │   └── mbm_total_bytes
> > ├── mon_L3_01
> > │   ├── llc_occupancy
> > │   ├── mbm_local_bytes
> > │   └── mbm_total_bytes
> 
> > ├── mon_PERF_PKG_00
> 
> Where do the package ids come from? How can user-space find out which CPUs are in package-0?

Resctrl gets the id from topology_physical_package_id(cpu);
> 
> I don't see a package_id in either /sys/devices/system/cpu/cpu0/topology or
> Documentation/ABI/stable/sysfs-devices-system-cpu.

These package IDs show up on x86 with these file names:

$ grep ^ /sys/devices/system/cpu/cpu0/topology/*package*
/sys/devices/system/cpu/cpu0/topology/package_cpus:0000,00000fff,ffffff00,0000000f,ffffffff
/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-35,72-107
/sys/devices/system/cpu/cpu0/topology/physical_package_id:0

> 
> > │   ├── activity
> > │   ├── c1_res
> > │   ├── c6_res
> > │   ├── core_energy
> > │   ├── stalls_llc_hit
> > │   ├── stalls_llc_miss
> > │   ├── unhalted_core_cycles
> > │   ├── unhalted_ref_cycles
> > │   └── uops_retired
> > └── mon_PERF_PKG_01
> >     ├── activity
> >     ├── c1_res
> >     ├── c6_res
> >     ├── core_energy
> >     ├── stalls_llc_hit
> >     ├── stalls_llc_miss
> >     ├── unhalted_core_cycles
> >     ├── unhalted_ref_cycles
> >     └── uops_retired
> 
> Looks good to me.
> 
> The difficulty MPAM platforms have had with mbm_total_bytes et al is the "starts counting
> from the beginning of time" property. Having to enable mbm_total_bytes before it counts
> would have allowed MPAM to report an error if it couldn't enable more than N counters at a
> time. (ABMC suggests AMD platforms have a similar problem).

Resctrl goes to some lengths to have mbm_total_bytes start from zero
when you mkdir a group even when some old RMID is re-used that has
got some left over value from its previous lifetime. This isn't
overly painful because resctrl has to carry lots of per-RMID state
to handle the wraparound of the narrow counters.

The Intel telemetry counters are 63 bits (lose one bit for the VALID
indication). So wrap around is no concern at all for most of them
as it happens in centuries/millennia. Potentially the uops_retired
counter might wrap in months, but that only happens if every logical
CPU is running with the same RMID for that whole time. So I've chosen
to ignore wraparound. As a result counters don't start from zero when
a group is created. I don't see this as an issue because all use cases
are "read a counter; wait some interval; re-read the counter; compute
the rate" which doesn't require starting from zero.

> 
> How do you feel about having to enable these before they start counting?
> 
> This would allow the MPAM driver to open the event via perf if it has a corresponding
> feature/counter, then provide the value from perf via resctrl.

You'd have resctrl report "Unavailable" for these until connecting the
plumbing to perf to provide data?

> 
> Another headache is how we describe the format of the contents of these files... a  made
> up example: residency counts could be in absolute time, or percentages. I've been bitten
> by the existing schemata strings being implicitly in a particular format, meaning
> conversions have to happen. I'm not sure whether some architecture/platform would trip
> over the same problem here.

Reinette is adamant that format of each resctrl event file must be
fixed. So if different systems report residency in different ways,
you'd either have to convert to some common format, or if that isn't
possible, those would have to appear in resctrl as different filenames.
E.g. "residency_absolute" and "residency_percentage".

> 
> > Resctrl Implementation
> > ----------------------
> > 
> > The OOBMSM driver exposes "intel_pmt_get_regions_by_feature()"
> > that returns an array of structures describing the per-RMID groups it
> > found from the VSEC enumeration. Linux looks at the unique identifiers
> > for each group and enables resctrl for all groups with known unique
> > identifiers.
> > 
> > The memory map for the counters for each <RMID, event> pair is described
> > by the XML file. This is too unwieldy to use in the Linux kernel, so a
> > simplified representation is built into the resctrl code.
> 
> (I hope there are only a few combinations!)

Almost certain to have a new description for each CPU generation since
the number of RMIDs is embedded in the description. If the overall
structure stays the same, then each new instance is described by a
dozen or so lines of code to initialize a data structure, so I think
an acceptable level of pain.

> 
> > Note that the
> > counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
> > and IA32_QM_CTR MSRs. This means there is no need for cross-processor
> > calls to read counters from a CPU in a specific domain.
> 
> Huzzah! RISC-V has this property, and many MPAM platforms do, (...but not all...)
> 
> 
> > The counters can be read from any CPU.
> > 
> > High level description of code changes:
> > 
> > 1) New scope RESCTRL_PACKAGE
> > 2) New struct rdt_resource RDT_RESOURCE_PERF_PKG
> > 3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
> >         switch (r->rid) {
> >         case RDT_RESOURCE_L3:
> >                 helper for L3
> >                 break;
> >         case RDT_RESOURCE_PERF_PKG:
> >                 helper for PKG
> >                 break;
> >         }
> > 4) New source code file "intel_aet.c" for the code to enumerate, configure, and report event counts.
> > 
> > With only one platform providing this feature, it's tricky to tell
> > exactly where it is going to go. I've made the event definitions
> > platform specific (based on the unique ID from the VSEC enumeration). It
> > seems possible/likely that the list of events may change from generation
> > to generation.
> 
> My thinking about this from a perf angle was to have named events for those things that
> resctrl supports, but allow events to be specified by number, and funnel those through
> resctrl_arch_rmid_read() so that the arch code can interpret them as a counter type-id or
> an offset in some array. The idea was to allow platform specific counters to be read
> without any kernel changes, reducing the pressure to add resctrl support for counters that
> may only ever be present in a single platform.
> 
> With the XML data you have, would it be possible to add new 'events' to this interface via
> sysfs/configfs? Or does too much depend on the data identified by that GUID...

Maybe. There are a number of parameters that need to be provided for an
event:
1) Scope. Must be something that resctrl already knows (and is actively
using?) L2/L3 cache, node, package.
2) Base address(es) for counters for this event.
3) Parameters for F(RMID) to compute offset from base
4) Type of access (mmio_read, MSR_read, other?)
5) post-process needed after read (check for valid? maybe other things)

Might be best to try some PoC implementation to see which of those are
required vs. overkill. Also to see whatI missed from the list.
> 
> 
> Thanks,
> 
> James

-Tony

Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring

Posted by Reinette Chatre 8 months, 2 weeks ago

Hi Tony,

On 5/21/25 3:50 PM, Tony Luck wrote:
> Background
> ----------
> 
> Telemetry features are being implemented in conjunction with the
> IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
> counts for various events to a collector in a nearby OOBMSM device to be

(could you please expand what "OOBMSM" means somewhere?)

> accumulated with counts for each <RMID, event> pair received from other
> CPUs. Cores send event counts when the RMID value changes, or after each
> 2ms elapsed time.

Could you please use consistent terminology? The short paragraph above
uses "logical CPU", "CPU", and "core" seemingly interchangeably and that is
confusing since these terms mean different things on x86 (re.
Documentation/arch/x86/topology.rst). (more below)

> 
> Each OOBMSM device may implement multiple event collectors with each
> servicing a subset of the logical CPUs on a package.  In the initial

("logical CPU" ... but seems to be used in same context as "core" in
first paragraph?)

> hardware implementation, there are two categories of events: energy
> and perf.
> 
> 1) Energy - Two counters
> core_energy: This is an estimate of Joules consumed by each core. It is
> calculated based on the types of instructions executed, not from a power
> meter. This counter is useful to understand how much energy a workload
> is consuming.

With RMIDs being per logical CPU it is not obvious to me how these
events should be treated since most of them are described as core events. 

If the RMID is per logical CPU and the events are per core then how should
the counters be interpreted? Would the user, for example, need to set CPU
affinity to ensure only tasks within same monitor group are run on the same
cores? How else can it be ensured the data reflects the monitor group it
is reported for?

> activity: This measures "accumulated dynamic capacitance". Users who
> want to optimize energy consumption for a workload may use this rather
> than core_energy because it provides consistent results independent of
> any frequency or voltage changes that may occur during the runtime of
> the application (e.g. entry/exit from turbo mode).

(No scope for this event.)

> 
> 2) Performance - Seven counters
> These are similar events to those available via the Linux "perf" tool,
> but collected in a way with much lower overhead (no need to collect data
> on every context switch).
> 
> stalls_llc_hit - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which hit in the LLC

(core scope)

> 
> c1_res - Counts the total C1 residency across all cores. The underlying
> counter increments on 100MHz clock ticks

("across all cores" ... package scope?)

> 
> unhalted_core_cycles - Counts the total number of unhalted core clock
> cycles

(core)

> 
> stalls_llc_miss - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which missed all the
> local caches

(core)

> 
> c6_res - Counts the total C6 residency. The underlying counter increments
> on crystal clock (25MHz) ticks
> 
> unhalted_ref_cycles - Counts the total number of unhalted reference clock
> (TSC) cycles
> 
> uops_retired - Counts the total number of uops retired

(no scope in descriptions of the above)

> 
> The counters are arranged in groups in MMIO space of the OOBMSM device.
> E.g. for the energy counters the layout is:
> 
> Offset: Counter
> 0x00	core energy for RMID 0
> 0x08	core activity for RMID 0
> 0x10	core energy for RMID 1
> 0x18	core activity for RMID 1
> ...

Does seems to hint that counters/events are always per core (but the descriptions
do not always reflect that) while RMID is per logical-CPU. 

Reinette

Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring

Posted by Luck, Tony 8 months, 2 weeks ago

Hi Reinette,

I've begun drafting a new cover letter to explain telemetry.

Here's the introduction. Let me know if it helps cover the
gaps and ambiguities that you pointed out.

-Tony


RMID based telemetry events
---------------------------

Each CPU on a system keeps a local count of various events.

Every two milliseconds, or when the value of the RMID field in the
IA32_PQR_ASSOC MSR is changed, the CPU transmits all the event counts
together with the value of the RMID to a nearby OOBMSM (Out of band
management services module) device. The CPU then resets all counters and
begins counting events for the new RMID or time interval.

The OOBMSM device sums each event count with those received from other
CPUs keeping a running total for each event for each RMID.

The operating system can read these counts to gather a picture of
system-wide activity for each of the logged events per-RMID.

E.g. the operating system may assign RMID 5 to all the tasks running to
perform a certain job. When it reads the core energy event counter for
RMID 5 it will see the total energy consumed by CPU cores for all tasks
in that job while running on any CPU. This is a much lower overhead
mechanism to track events per job than the typical "perf" approach
of reading counters on every context switch.

Events
------

"core energy" The number of Joules consumed by CPU cores during execution
of instructions for the current RMID.
Note that this does not include energy used by the "uncore" (LLC cache
and interfaces to off package devices) or energy used by memory or I/O
devices. Energy may be calculated based on measures of activity rather
than the output from a power meter.

"activity" The dynamic capacitance (Cdyn) in Farads for a core due to
execution of instructions for the current RMID. This event will be
more useful to a user interested in optimizing energy consumption
of a workload because it is invariant of frequency changes (e.g.
turbo mode) that may be outside of the control of the developer.

Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring

Posted by Reinette Chatre 8 months, 2 weeks ago

Hi Tony,

On 5/28/25 2:38 PM, Luck, Tony wrote:
> Hi Reinette,
> 
> I've begun drafting a new cover letter to explain telemetry.
> 
> Here's the introduction. Let me know if it helps cover the
> gaps and ambiguities that you pointed out.
> 
> -Tony
> 
> 
> RMID based telemetry events
> ---------------------------
> 
> Each CPU on a system keeps a local count of various events.
> 
> Every two milliseconds, or when the value of the RMID field in the
> IA32_PQR_ASSOC MSR is changed, the CPU transmits all the event counts
> together with the value of the RMID to a nearby OOBMSM (Out of band
> management services module) device. The CPU then resets all counters and
> begins counting events for the new RMID or time interval.
> 
> The OOBMSM device sums each event count with those received from other
> CPUs keeping a running total for each event for each RMID.
> 
> The operating system can read these counts to gather a picture of
> system-wide activity for each of the logged events per-RMID.
> 
> E.g. the operating system may assign RMID 5 to all the tasks running to
> perform a certain job. When it reads the core energy event counter for
> RMID 5 it will see the total energy consumed by CPU cores for all tasks
> in that job while running on any CPU. This is a much lower overhead
> mechanism to track events per job than the typical "perf" approach
> of reading counters on every context switch.
> 

Could you please elaborate the CPU vs core distinction?

If the example above is for a system with below topology (copied from 
Documentation/arch/x86/topology.rst):
                    
                                                                                
[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0                    
			-> [thread 1] -> Linux CPU 1                    
	    -> [core 1] -> [thread 0] -> Linux CPU 2                    
			-> [thread 1] -> Linux CPU 3   

In the example, RMID 5 is assigned to tasks running "a certain job", for
convenience I will name it "jobA". Consider if the example is extended
with RMID 6 assigned to tasks running another job, "jobB".

If a jobA task is scheduled on CPU 0 and a jobB task is scheduled in CPU 1
then it may look like:
[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 #RMID 5
			-> [thread 1] -> Linux CPU 1 #RMID 6            
	    -> [core 1] -> [thread 0] -> Linux CPU 2                    
			-> [thread 1] -> Linux CPU 3   

The example above states:
	When it reads the core energy event counter for RMID 5 it will
	see the total energy consumed by CPU cores for all tasks in that
	job while running on any CPU.

With RMID 5 and RMID 6 both running on core 0, and "RMID 5 will see
the total energy consumed by CPU cores", does this mean that reading RMID 5
counter will return the energy consumed by core 0 while RMID 5 is assigned to
CPU 0? Since core 0 contains both CPU 0 and CPU 1, would reading RMID 5 thus return
data of both RMID 5 and RMID 6 (jobA and jobB)?
And vice versa, reading RMID 6 will also include energy consumed by tasks
running with RMID 5?

Reinette