[PATCH v3 00/26] x86/resctrl telemetry monitoring

Tony Luck posted 26 patches 1 month ago
There is a newer version of this series
Documentation/filesystems/resctrl.rst         |  38 +-
include/linux/resctrl.h                       |  42 +-
include/linux/resctrl_types.h                 |  36 +-
arch/x86/include/asm/resctrl.h                |   8 +-
.../cpu/resctrl/fake_intel_aet_features.h     |  73 ++++
arch/x86/kernel/cpu/resctrl/internal.h        |  19 +-
fs/resctrl/internal.h                         |  23 +-
arch/x86/kernel/cpu/resctrl/core.c            | 199 ++++++---
.../cpu/resctrl/fake_intel_aet_features.c     |  87 ++++
arch/x86/kernel/cpu/resctrl/intel_aet.c       | 377 ++++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c         |  39 +-
fs/resctrl/ctrlmondata.c                      |  76 ++--
fs/resctrl/monitor.c                          | 164 ++++++--
fs/resctrl/rdtgroup.c                         | 266 ++++++------
arch/x86/Kconfig                              |   1 +
arch/x86/kernel/cpu/resctrl/Makefile          |   2 +
drivers/platform/x86/intel/pmt/Kconfig        |   6 +
17 files changed, 1125 insertions(+), 331 deletions(-)
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
[PATCH v3 00/26] x86/resctrl telemetry monitoring
Posted by Tony Luck 1 month ago
Previous version here:
https://lore.kernel.org/all/20250321231609.57418-1-tony.luck@intel.com/

This series is based on James Morse's "fs/resctrl/" snapshot.

General changes since v2:
------------------------

Fixed several violations of the FS/ARCH interface. As far as
possible the patches are now divided into patches for ARCH
code v.s. patches for FS code.

All event names/numbers are defined by FS code.

Changed the names of new bits in FS code to PERF_PKG.

Changed names of detailed event descriptions from using "CWF"
(short for Clearwater Forest) to names using the guid value.
E.g. NUM_RMIDS_CWF -> NUM_RMIDS_0x26696143

Fixed typos from v2, and improved many commit messages
and code comments.

Roadmap to patches in the series:
--------------------------------

First patch provides a simpler implementation of James' "Expand
the width of dom_id by replacing mon_data_bits" patch. James will
merge this into the next version of his resctrl fs series. So
this patch will be dropped.

Second patch is a cleaned up version of something I posted before
to improve handling multiple MBM events.
https://lore.kernel.org/all/20250314202609.5753-1-tony.luck@intel.com/
It turns out to be a useful base to add other monitor events.

Third patch is preparation for more monitor events changing how
and when they are initialized. The "when" moves the initialization
to first mount of resctrl ready for adding perf events (which
are enumerated much later than normal resctrl initialization).

Fourth sets up the Kconfig elements for PERF events.

Fifth patch is the fake OOBMSM interface, providing the
intel_pmt_get_regions_by_feature() and intel_pmt_put_feature_group()
that will eventually be supplied directly by OOBMSM code.

Sixth patch is in response to Reinette's review of the V2 series
where I had deleted some rdt_mon_domain type checking which
could lead to problems with the additional domain type. I
extended the type checking from simple "ctrl" vs. "mon" to
also check the resource type.

7-9 refactor some x86 code to make adding a new rdt_resource easier

10 Implemented a suggestion from James on coding so that events
that can be read from any CPU avoid a cross-processor interrupt
while keeping the same calling sequence between FS and ARCH code.
The attribute is available independently for each event.

11 Adds support for ARCH code to describe which format to display
monitor values from choices listed in an FS enum. Two choices
implemented: 64-bit unsigned decimal number, and fixed-point
with 18 binary digits displayed as floating point.

12 Add FS hook for ARCH code to set event counter attributes.

13 Add ARCH hook called by FS on each mount.

14-16 X86 enumeration of perf events using calls to OOBMSM

17 Add lookup table from FS event numbers to ARCH structures
needed to process event read requests

18 Hook into resctrl_arch_rmid_read() for RDT_RESOURCE_PERF_PKG
to read counters for its events. Code to read the events sums
over all aggregators in a package.

19 Cross check three enumerated sources of number of RMIDS for
perf events. This part needs further discussion on how to pass
information to users. Existing FS/ARCH interfaces doesn't have
obvious way to add additional info/ files.

20 Add new RDT_RESOURCE_PERF_PKG resource and package scope define

21 Domain creation/deletion for RDT_RESOURCE_PERF_PKG

22 Type define for resctrl files for RDT_RESOURCE_PERF_PKG

23 Add new events to typedef resctrl_event_id and all_events[].

24 Enable all enumerated telemetry events and the RDT_RESOURCE_PERF_PKG resource.

25 Add details for the telemetry events on Clearwater Forest systems.

26 Documentation.

Background

Telemetry features are being implemented in conjunction with the
IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
counts for various events to a collector in a nearby OOMMSM device to be
accumulated with counts for each <RMID, event> pair received from other
CPUs. Cores send event counts when the RMID value changes, or after each
2ms elapsed time.

Each OOBMSM device may implement multiple event collectors with each
servicing a subset of the logical CPUs on a package.  In the initial
hardware implementation, there are two categories of events:

The counters are arranged in groups in MMIO space of the OOBMSM device.
E.g. for the energy counters the layout is:

Offset: Counter
0x00	core energy for RMID 0
0x08	core activity for RMID 0
0x10	core energy for RMID 1
0x18	core activity for RMID 1

1) Energy - Two counters
core_energy: This is an estimate of Joules consumed by each core. It is
calculated based on the types of instructions executed, not from a power
meter. This counter is useful to understand how much energy a workload
is consuming.

activity: This measures "accumulated dynamic capacitance". Users who
want to optimize energy consumption for a workload may use this rather
than core_energy because it provides consistent results independent of
any frequency or voltage changes that may occur during the runtime of
the application (e.g. entry/exit from turbo mode).

2) Performance - Seven counters
These are similar events to those available via the Linux "perf" tool,
but collected in a way with mush lower overhead (no need to collect data
on every context switch).

stalls_llc_hit - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which hit in the LLC

c1_res - Counts the total C1 residency across all cores. The underlying
counter increments on 100MHz clock ticks

unhalted_core_cycles - Counts the total number of unhalted core clock
cycles

stalls_llc_miss - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which missed all the
local caches

c6_res - Counts the total C6 residency. The underlying counter increments
on crystal clock (25MHz) ticks

unhalted_ref_cycles - Counts the total number of unhalted reference clock
(TSC) cycles

uops_retired - Counts the total number of uops retired

Enumeration

The only CPUID based enumeration for this feature is the legacy
CPUID(eax=7,ecx=0).ebx{12} that indicates the presence of the
IA32_PQR_ASSOC MSR and the RMID field within it.

The OOBMSM driver discovers which features are present via
PCIe VSEC capabilities. Each feature is tagged with a unique
identifier. These identifiers indicate which XML description file from
https://github.com/intel/Intel-PMT describes which event counters are
available and their layout within the MMIO BAR space of the OOBMSM device.

Resctrl User Interface

Because there may be multiple OOBMSM collection agents per processor
package, resctrl accumulates event counts from all agents on a package
and presents a single value to users. This will provide a consistent
user interface on future platforms that vary the number of collectors,
or the mappings from logical CPUs to collectors.

Users will see the legacy monitoring files in the "L3" directories
and the telemetry files in "PKG" directories (with each file
providing the aggregated value from all OOBMSM collectors on that
package).

$ tree /sys/fs/resctrl/mon_data/
/sys/fs/resctrl/mon_data/
├── mon_L3_00
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_L3_01
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_PKG_00
│   ├── activity
│   ├── c1_res
│   ├── c6_res
│   ├── core_energy
│   ├── stalls_llc_hit
│   ├── stalls_llc_miss
│   ├── unhalted_core_cycles
│   ├── unhalted_ref_cycles
│   └── uops_retired
└── mon_PKG_01
    ├── activity
    ├── c1_res
    ├── c6_res
    ├── core_energy
    ├── stalls_llc_hit
    ├── stalls_llc_miss
    ├── unhalted_core_cycles
    ├── unhalted_ref_cycles
    └── uops_retired

Resctrl Implementation

The OOBMSM driver exposes a function "intel_pmt_get_regions_by_feature()"
that returns an array of structures describing the per-RMID groups it
found from the VSEC enumeration. Linux looks at the unique identifiers
for each group and enables resctrl for all groups with known unique
identifiers.

The memory map for the counters for each <RMID, event> pair is described
by the XML file. This is too unwieldy to use in the Linux kernel, so a
simplified representation is built into the resctrl code. Note that the
counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
and IA32_QM_CTR MSRs. This means there is no need for cross-processor
calls to read counters from a CPU in a specific domain. The counters
can be read from any CPU.

High level description of code changes:

1) New scope RESCTRL_PACKAGE
2) New struct rdt_resource RDT_RESOURCE_INTEL_PMT
3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
        switch (r->rid) {
        case RDT_RESOURCE_L3:
                helper for L3
                break;
        case RDT_RESOURCE_INTEL_PMT:
                helper for PKG
                break;
        }
4) New source code file "intel_pmt.c" for the code to enumerate, configure, and report event counts.

With only one platform providing this feature, it's tricky to tell
exactly where it is going to go. I've made the event definitions
platform specific (based on the unique ID from the VSEC enumeration). It
seems possible/likely that the list of events may change from generation
to generation.

I've picked names for events based on the descriptions in the XML file.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Tony Luck (26):
  fs/resctrl: Simplify allocation of mon_data structures
  fs-x86/resctrl: Prepare for more monitor events
  fs/resctrl: Change how events are initialized
  fs/resctrl: Set up Kconfig options for telemetry events
  x86/rectrl: Fake OOBMSM interface
  fs-x86/rectrl: Improve domain type checking
  x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
  x86/resctrl: Refactor domain_remove_cpu_mon() ready for new domain
    types
  x86/resctrl: Change generic monitor functions to use struct
    rdt_domain_hdr
  fs/resctrl: Improve handling for events that can be read from any CPU
  fs/resctrl: Add support for additional monitor event display formats
  fs/resctrl: Add hook for architecture code to set monitor event
    attributes
  fs/resctrl: Add an architectural hook called for each mount
  x86/resctrl: Add first part of telemetry event enumeration
  x86/resctrl: Second stage of telemetry event enumeration
  x86/resctrl: Third phase of telemetry event enumeration
  x86/resctrl: Build a lookup table for each resctrl event id
  x86/resctrl: Add code to read core telemetry events
  x86/resctrl: Sanity check telemetry RMID values
  x86/resctrl: Add and initialize rdt_resource for package scope core
    monitor
  fs-x86/resctrl: Handle RDT_RESOURCE_PERF_PKG in domain create/delete
  fs/resctrl: Add type define for PERF_PKG files
  fs/resctrl: Add new telemetry event id and structures
  x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG
  fs-x86/resctrl: Add detailed descriptions for Clearwater Forest events
  x86/resctrl: Update Documentation for package events

 Documentation/filesystems/resctrl.rst         |  38 +-
 include/linux/resctrl.h                       |  42 +-
 include/linux/resctrl_types.h                 |  36 +-
 arch/x86/include/asm/resctrl.h                |   8 +-
 .../cpu/resctrl/fake_intel_aet_features.h     |  73 ++++
 arch/x86/kernel/cpu/resctrl/internal.h        |  19 +-
 fs/resctrl/internal.h                         |  23 +-
 arch/x86/kernel/cpu/resctrl/core.c            | 199 ++++++---
 .../cpu/resctrl/fake_intel_aet_features.c     |  87 ++++
 arch/x86/kernel/cpu/resctrl/intel_aet.c       | 377 ++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c         |  39 +-
 fs/resctrl/ctrlmondata.c                      |  76 ++--
 fs/resctrl/monitor.c                          | 164 ++++++--
 fs/resctrl/rdtgroup.c                         | 266 ++++++------
 arch/x86/Kconfig                              |   1 +
 arch/x86/kernel/cpu/resctrl/Makefile          |   2 +
 drivers/platform/x86/intel/pmt/Kconfig        |   6 +
 17 files changed, 1125 insertions(+), 331 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
 create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
 create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c


base-commit: 89679ca4122a5a6622530490ef4855dd39baa54e
-- 
2.48.1

Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
Posted by Reinette Chatre 3 weeks ago
Hi Tony,

On 4/7/25 4:40 PM, Tony Luck wrote:
> Previous version here:
> https://lore.kernel.org/all/20250321231609.57418-1-tony.luck@intel.com/
> 
> This series is based on James Morse's "fs/resctrl/" snapshot.

Would be helpful to provide link to snapshot used to avoid any uncertainty
about what base to use.

> 
> Background
> 
> Telemetry features are being implemented in conjunction with the
> IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
> counts for various events to a collector in a nearby OOMMSM device to be
> accumulated with counts for each <RMID, event> pair received from other
> CPUs. Cores send event counts when the RMID value changes, or after each
> 2ms elapsed time.
> 
> Each OOBMSM device may implement multiple event collectors with each
> servicing a subset of the logical CPUs on a package.  In the initial
> hardware implementation, there are two categories of events:
> 

(missing the two categories of events)

> The counters are arranged in groups in MMIO space of the OOBMSM device.
> E.g. for the energy counters the layout is:
> 
> Offset: Counter
> 0x00	core energy for RMID 0
> 0x08	core activity for RMID 0
> 0x10	core energy for RMID 1
> 0x18	core activity for RMID 1
> 
> 1) Energy - Two counters
> core_energy: This is an estimate of Joules consumed by each core. It is
> calculated based on the types of instructions executed, not from a power
> meter. This counter is useful to understand how much energy a workload
> is consuming.
> 
> activity: This measures "accumulated dynamic capacitance". Users who
> want to optimize energy consumption for a workload may use this rather
> than core_energy because it provides consistent results independent of
> any frequency or voltage changes that may occur during the runtime of
> the application (e.g. entry/exit from turbo mode).
> 
> 2) Performance - Seven counters
> These are similar events to those available via the Linux "perf" tool,
> but collected in a way with mush lower overhead (no need to collect data

"mush" -> "much"

> on every context switch).
> 
> stalls_llc_hit - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which hit in the LLC
> 
> c1_res - Counts the total C1 residency across all cores. The underlying
> counter increments on 100MHz clock ticks
> 
> unhalted_core_cycles - Counts the total number of unhalted core clock
> cycles
> 
> stalls_llc_miss - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which missed all the
> local caches
> 
> c6_res - Counts the total C6 residency. The underlying counter increments
> on crystal clock (25MHz) ticks
> 
> unhalted_ref_cycles - Counts the total number of unhalted reference clock
> (TSC) cycles
> 
> uops_retired - Counts the total number of uops retired
> 
> Enumeration
> 
> The only CPUID based enumeration for this feature is the legacy
> CPUID(eax=7,ecx=0).ebx{12} that indicates the presence of the
> IA32_PQR_ASSOC MSR and the RMID field within it.
> 
> The OOBMSM driver discovers which features are present via
> PCIe VSEC capabilities. Each feature is tagged with a unique
> identifier. These identifiers indicate which XML description file from
> https://github.com/intel/Intel-PMT describes which event counters are
> available and their layout within the MMIO BAR space of the OOBMSM device.
> 
> Resctrl User Interface
> 
> Because there may be multiple OOBMSM collection agents per processor
> package, resctrl accumulates event counts from all agents on a package
> and presents a single value to users. This will provide a consistent
> user interface on future platforms that vary the number of collectors,
> or the mappings from logical CPUs to collectors.
> 
> Users will see the legacy monitoring files in the "L3" directories
> and the telemetry files in "PKG" directories (with each file

Now PERF_PKG?

> providing the aggregated value from all OOBMSM collectors on that
> package).
> 
> $ tree /sys/fs/resctrl/mon_data/
> /sys/fs/resctrl/mon_data/
> ├── mon_L3_00
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_L3_01
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_PKG_00
> │   ├── activity
> │   ├── c1_res
> │   ├── c6_res
> │   ├── core_energy
> │   ├── stalls_llc_hit
> │   ├── stalls_llc_miss
> │   ├── unhalted_core_cycles
> │   ├── unhalted_ref_cycles
> │   └── uops_retired
> └── mon_PKG_01
>     ├── activity
>     ├── c1_res
>     ├── c6_res
>     ├── core_energy
>     ├── stalls_llc_hit
>     ├── stalls_llc_miss
>     ├── unhalted_core_cycles
>     ├── unhalted_ref_cycles
>     └── uops_retired
> 
> Resctrl Implementation
> 
> The OOBMSM driver exposes a function "intel_pmt_get_regions_by_feature()"

(nit: no need to use "a function" if using ())

> that returns an array of structures describing the per-RMID groups it
> found from the VSEC enumeration. Linux looks at the unique identifiers
> for each group and enables resctrl for all groups with known unique
> identifiers.
> 
> The memory map for the counters for each <RMID, event> pair is described
> by the XML file. This is too unwieldy to use in the Linux kernel, so a
> simplified representation is built into the resctrl code. Note that the
> counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
> and IA32_QM_CTR MSRs. This means there is no need for cross-processor
> calls to read counters from a CPU in a specific domain. The counters
> can be read from any CPU.
> 
> High level description of code changes:
> 
> 1) New scope RESCTRL_PACKAGE
> 2) New struct rdt_resource RDT_RESOURCE_INTEL_PMT
> 3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
>         switch (r->rid) {
>         case RDT_RESOURCE_L3:
>                 helper for L3
>                 break;
>         case RDT_RESOURCE_INTEL_PMT:
>                 helper for PKG
>                 break;
>         }
> 4) New source code file "intel_pmt.c" for the code to enumerate, configure, and report event counts.

Needs an update to match new version of this work.

> 
> With only one platform providing this feature, it's tricky to tell
> exactly where it is going to go. I've made the event definitions
> platform specific (based on the unique ID from the VSEC enumeration). It
> seems possible/likely that the list of events may change from generation
> to generation.
> 
> I've picked names for events based on the descriptions in the XML file.

One aspect that is only hinted to in the final documentation patch is
how users are expected to use this feature. As I understand the number of
monitor groups supported by resctrl is still guided by the number of RMIDs
supported by L3 monitoring. This work hints that the telemetry feature may
not match that number of RMIDs and a monitor group may thus exist but
when a user attempts to ready any of these perf files it will return
"unavailable".

The series attempts to address it by placing the number of RMIDs available
for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
group is not exposed to user space (unless debugging enabled) the user does
not know if a monitor group will support this feature or not. This seems awkward
to me. Why not limit the number of monitor groups that can be created to the
minimum number of RMIDs across these resources like what is done for CLOSid?

Reinette

Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
Posted by Luck, Tony 2 weeks, 5 days ago
On Fri, Apr 18, 2025 at 02:13:39PM -0700, Reinette Chatre wrote:
> One aspect that is only hinted to in the final documentation patch is
> how users are expected to use this feature. As I understand the number of
> monitor groups supported by resctrl is still guided by the number of RMIDs
> supported by L3 monitoring. This work hints that the telemetry feature may
> not match that number of RMIDs and a monitor group may thus exist but
> when a user attempts to ready any of these perf files it will return
> "unavailable".
> 
> The series attempts to address it by placing the number of RMIDs available
> for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
> group is not exposed to user space (unless debugging enabled) the user does
> not know if a monitor group will support this feature or not. This seems awkward
> to me. Why not limit the number of monitor groups that can be created to the
> minimum number of RMIDs across these resources like what is done for CLOSid?

Reinette,

The mismatch between number of RMIDs supported by different components
is a thorny one, and may keep repeating since it feels like systems are
composed of a bunch of lego-like bricks snapped together from a box of
parts available to the h/w architect.

In this case we have three meanings for "number of RMIDs":

1) The number for legacy features enumerated by CPUID leaf 0xF.

2) The number of registers in MMIO space for each event. This is
enumerated in the XML files and is the value I placed into telem_entry::num_rmids.

3) The number of "h/w counters" (this isn't a strictly accurate
description of how things work, but serves as a useful analogy that
does describe the limitations) feeding to those MMIO registers. This is
enumerated in telemetry_region::num_rmids returned from the call to
intel_pmt_get_regions_by_feature()

If "1" is the smallest of these values, the OS will be limited in
which values can be written to the IA32_PQR_ASSOC MSR. Existing
code will do the right thing by limiting RMID allocation to this
value.

If "2" is greater than "1", then the extra MMIO registers will
sit unused.

If "2" is less than "1" my v3 returns the (problematic) -ENOENT
This can't happen in the CPU that debuts this feature, but the check
is there to prevent running past the end of the MMIO space in case
this does occur some day. I'll fix error path in next version to
make sure this end up with "Unavailable".

If "3" is less than "2" then the system will attach "h/w counters" to
MMIO registers in a "most recently used" algorithm. So if the number
of active RMIDs in some time interval is less than "3" the user will
get good values. But if the number of active RMIDs rises above "3"
then the user will see "Unavailable" returns as "h/w counters" are
reassigned to different RMIDs (making the feature really hard to use).

In the debut CPU the "energy" feature has sufficient "energy" counters
to avoid this. But not enough "perf" counters. I've pushed and the
next CPU with the feature will have enough "h/w counters".

My proposal for v4:

Add new options to the "rdt=" kernel boot parameter for "energy"
and "perf".

Treat the case where there are not enough "h/w counters" as an erratum
and do not enable the feature. User can override with "rdt=perf"
if they want the counters for some special case where they limit
the number of simultaneous active RMIDs.

User can use "rdt=!energy,!perf" if they don't want to see the
clutter of all the new files in each mon_data directory.

I'll maybe look at moving resctrl_mon_resource_init() to rdt_get_tree()
and add a "take min of all RMID limits". But since this is a "can't
happen" scenario I may skip this if it starts to get complicated.

Which leaves what should be in info/PERF_PKG_MON/num_rmids? It's
possible that some CPU implementation will have different MMIO
register counts for "perf" and "energy". It's more than possible
that number of "h/w counters" will be different. But they share the
same info file. My v3 code reports the minimum of the number
of "h/w counters" which is the most conservative option. It tells
the user not to make more mon_data directories than this if they
want usable counters across *both* perf and energy. Though they
will actually keep getting good "energy" results even if then
go past this limit.

-Tony

>
Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
Posted by Reinette Chatre 2 weeks, 4 days ago
Hi Tony,

On 4/21/25 11:57 AM, Luck, Tony wrote:
> On Fri, Apr 18, 2025 at 02:13:39PM -0700, Reinette Chatre wrote:
>> One aspect that is only hinted to in the final documentation patch is
>> how users are expected to use this feature. As I understand the number of
>> monitor groups supported by resctrl is still guided by the number of RMIDs
>> supported by L3 monitoring. This work hints that the telemetry feature may
>> not match that number of RMIDs and a monitor group may thus exist but
>> when a user attempts to ready any of these perf files it will return
>> "unavailable".
>>
>> The series attempts to address it by placing the number of RMIDs available
>> for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
>> group is not exposed to user space (unless debugging enabled) the user does
>> not know if a monitor group will support this feature or not. This seems awkward
>> to me. Why not limit the number of monitor groups that can be created to the
>> minimum number of RMIDs across these resources like what is done for CLOSid?
> 
> Reinette,
> 
> The mismatch between number of RMIDs supported by different components
> is a thorny one, and may keep repeating since it feels like systems are
> composed of a bunch of lego-like bricks snapped together from a box of
> parts available to the h/w architect.

With resctrl needing to support multiple architectures' way of doing things,
needing to support variety within an architecture just seems like another step.

> 
> In this case we have three meanings for "number of RMIDs":
> 
> 1) The number for legacy features enumerated by CPUID leaf 0xF.
> 
> 2) The number of registers in MMIO space for each event. This is
> enumerated in the XML files and is the value I placed into telem_entry::num_rmids.
> 
> 3) The number of "h/w counters" (this isn't a strictly accurate
> description of how things work, but serves as a useful analogy that
> does describe the limitations) feeding to those MMIO registers. This is
> enumerated in telemetry_region::num_rmids returned from the call to
> intel_pmt_get_regions_by_feature()

Thank you for explaining this. This was not clear to me.

> 
> If "1" is the smallest of these values, the OS will be limited in
> which values can be written to the IA32_PQR_ASSOC MSR. Existing
> code will do the right thing by limiting RMID allocation to this
> value.
> 
> If "2" is greater than "1", then the extra MMIO registers will
> sit unused.

This is also an issue with this implementation, no? resctrl will not
allow creating more monitor groups than "1".

> If "2" is less than "1" my v3 returns the (problematic) -ENOENT
> This can't happen in the CPU that debuts this feature, but the check
> is there to prevent running past the end of the MMIO space in case
> this does occur some day. I'll fix error path in next version to
> make sure this end up with "Unavailable".

This is a concern since this means the interface becomes a "try and see"
for user space. As I understand a later statement the idea is that
"2" should be used by user space to know how many "mon_groups" directories
should be created to get telemetry support. To me this looks to be
a space that will create a lot of confusion. The moment user space
creates "2" + 1 "mon_groups" directories it becomes a guessing game
of what any new monitor group actually supports. After crossing that
threshold I do not see a good way for going back since if user space
removes one "mon_data" directory it does get back to "2" but then needs to
rely on resctrl internals or debugging to know for sure what the new
monitor group supports.

> 
> If "3" is less than "2" then the system will attach "h/w counters" to
> MMIO registers in a "most recently used" algorithm. So if the number
> of active RMIDs in some time interval is less than "3" the user will
> get good values. But if the number of active RMIDs rises above "3"
> then the user will see "Unavailable" returns as "h/w counters" are
> reassigned to different RMIDs (making the feature really hard to use).

Could the next step be for the architecture to allow user space to
specify which hardware counters need to be assigned? With a new user
interface being created for such capability it may be worthwhile to
consider how it could be used/adapted for this feature. [1]

> 
> In the debut CPU the "energy" feature has sufficient "energy" counters
> to avoid this. But not enough "perf" counters. I've pushed and the
> next CPU with the feature will have enough "h/w counters".
> 
> My proposal for v4:
> 
> Add new options to the "rdt=" kernel boot parameter for "energy"
> and "perf".
> 
> Treat the case where there are not enough "h/w counters" as an erratum
> and do not enable the feature. User can override with "rdt=perf"
> if they want the counters for some special case where they limit
> the number of simultaneous active RMIDs.

This only seems to address the "3" is less than "2" issue. It is not
so obvious to me that it should be treated as an erratum. Although,
I could not tell from your description how obvious this issue will be
to user space. For example, is it clear that if user space
gets *any* value then it is "good" and "Unavailable" means ... "Unavailable", or
could a returned value mean "this is partial data that was collected
during timeframe with hardware counter re-assigned at some point"?

> 
> User can use "rdt=!energy,!perf" if they don't want to see the
> clutter of all the new files in each mon_data directory.
> 
> I'll maybe look at moving resctrl_mon_resource_init() to rdt_get_tree()
> and add a "take min of all RMID limits". But since this is a "can't
> happen" scenario I may skip this if it starts to get complicated.

I do not think that the "2" is less than "1" scenario should be 
ignored for reasons stated above and in review of this version.

What if we enhance resctrl's RMID assignment (setting aside for
a moment PMG assignment) to be directed by user space?

Below is an idea of an interface that can give user space 
control over what monitor groups are monitoring. This is very likely not
the ideal interface but I would like to present it as a start for
better ideas.

For example, monitor groups are by default created with most abundant
(and thus supporting fewest features on fewest resources) RMID.
The user is then presented with a new file (within each monitor group)
that lists all available features and which one(s) are active. For example,
let's consider hypothetical example where PERF_PKG perf has x RMID, PERF_PKG energy
has y RMID, and L3_MON has z RMID, with x < y < z. By default when user space
creates a monitor group resctrl will pick "abundant" RMID from range y + 1 to z
that only supports L3 monitoring:

# cat /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
[L3]
PERF_PKG:energy
PERF_PKG:perf

In above case there will be *no* mon_PERF_PKG_XX directories in 
/sys/fs/resctrl/mon_groups/m1/mon_data.

*If* user space wants perf/energy telemetry for this monitor
group then they can enable needed feature with clear understanding that
it is disruptive to all ongoing monitoring since a new RMID will be assigned.
For example, if user wants PERF_PKG:energy and PERF_PKG:perf then
user can do so with:

# echo PERF_PKG:perf > /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
# cat /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
[L3]
[PERF_PKG:energy]
[PERF_PKG:perf]

After the above all energy and perf files will appear in new mon_PERF_PKG_XX
directories.

User space can then have full control of what is monitored by which monitoring
group. If no RMIDs are available in a particular pool then user space can get
an "out of space" error and be the one to decide how it should be managed.

This also could be a way in which the "2" is larger than "1" scenario
can be addressed. 

> Which leaves what should be in info/PERF_PKG_MON/num_rmids? It's
> possible that some CPU implementation will have different MMIO
> register counts for "perf" and "energy". It's more than possible
> that number of "h/w counters" will be different. But they share the
> same info file. My v3 code reports the minimum of the number
> of "h/w counters" which is the most conservative option. It tells
> the user not to make more mon_data directories than this if they
> want usable counters across *both* perf and energy. Though they
> will actually keep getting good "energy" results even if then
> go past this limit.

num_rmids is a source of complications. It does not have a good equivalent
for MPAM and there has been a few attempts at proposing alternatives that may
be worth keeping in mind while making changes here:
https://lore.kernel.org/all/cbe665c2-fe83-e446-1696-7115c0f9fd76@arm.com/
https://lore.kernel.org/lkml/46767ca7-1f1b-48e8-8ce6-be4b00d129f9@intel.com/

> 
> -Tony
> 

Reinette


[1] https://lore.kernel.org/lkml/cover.1743725907.git.babu.moger@amd.com/

ps. I needed to go back an re-read the original cover-letter a couple of
times, while doing so I noticed one typo in the Background section: OOMMSM -> OOBMSM.
Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
Posted by Luck, Tony 2 weeks, 4 days ago
On Mon, Apr 21, 2025 at 03:59:15PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 4/21/25 11:57 AM, Luck, Tony wrote:
> > On Fri, Apr 18, 2025 at 02:13:39PM -0700, Reinette Chatre wrote:
> >> One aspect that is only hinted to in the final documentation patch is
> >> how users are expected to use this feature. As I understand the number of
> >> monitor groups supported by resctrl is still guided by the number of RMIDs
> >> supported by L3 monitoring. This work hints that the telemetry feature may
> >> not match that number of RMIDs and a monitor group may thus exist but
> >> when a user attempts to ready any of these perf files it will return
> >> "unavailable".
> >>
> >> The series attempts to address it by placing the number of RMIDs available
> >> for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
> >> group is not exposed to user space (unless debugging enabled) the user does
> >> not know if a monitor group will support this feature or not. This seems awkward
> >> to me. Why not limit the number of monitor groups that can be created to the
> >> minimum number of RMIDs across these resources like what is done for CLOSid?
> > 
> > Reinette,
> > 
> > The mismatch between number of RMIDs supported by different components
> > is a thorny one, and may keep repeating since it feels like systems are
> > composed of a bunch of lego-like bricks snapped together from a box of
> > parts available to the h/w architect.
> 
> With resctrl needing to support multiple architectures' way of doing things,
> needing to support variety within an architecture just seems like another step.
> 
> > 
> > In this case we have three meanings for "number of RMIDs":
> > 
> > 1) The number for legacy features enumerated by CPUID leaf 0xF.
> > 
> > 2) The number of registers in MMIO space for each event. This is
> > enumerated in the XML files and is the value I placed into telem_entry::num_rmids.
> > 
> > 3) The number of "h/w counters" (this isn't a strictly accurate
> > description of how things work, but serves as a useful analogy that
> > does describe the limitations) feeding to those MMIO registers. This is
> > enumerated in telemetry_region::num_rmids returned from the call to
> > intel_pmt_get_regions_by_feature()
> 
> Thank you for explaining this. This was not clear to me.
> 
> > 
> > If "1" is the smallest of these values, the OS will be limited in
> > which values can be written to the IA32_PQR_ASSOC MSR. Existing
> > code will do the right thing by limiting RMID allocation to this
> > value.
> > 
> > If "2" is greater than "1", then the extra MMIO registers will
> > sit unused.
> 
> This is also an issue with this implementation, no? resctrl will not
> allow creating more monitor groups than "1".

On Intel there is no point in creating more groups than "1" allows.
You can't make use of any RMID above that limit because you will get
a #GP fault trying to write to the IA32_PQR_ASSOC MSR.

You could read the extra MMIO registers provided by "2", but they
will always be zero since no execution occurred with an RMID in the
range "1" ... "2".

The "2" is greater than "1" may be relatively common since the h/w
for the telemetry counters is common for SKUs with different numbers
of cores, and thus different values of "1". So low core count
systems will see more telemetry counters than they can actually
make use of. I will make sure not to print a message for this case.

> > If "2" is less than "1" my v3 returns the (problematic) -ENOENT
> > This can't happen in the CPU that debuts this feature, but the check
> > is there to prevent running past the end of the MMIO space in case
> > this does occur some day. I'll fix error path in next version to
> > make sure this end up with "Unavailable".
> 
> This is a concern since this means the interface becomes a "try and see"
> for user space. As I understand a later statement the idea is that
> "2" should be used by user space to know how many "mon_groups" directories
> should be created to get telemetry support. To me this looks to be
> a space that will create a lot of confusion. The moment user space
> creates "2" + 1 "mon_groups" directories it becomes a guessing game
> of what any new monitor group actually supports. After crossing that
> threshold I do not see a good way for going back since if user space
> removes one "mon_data" directory it does get back to "2" but then needs to
> rely on resctrl internals or debugging to know for sure what the new
> monitor group supports.

But I assert that it is a "can't happen" concern. "2" will be >= "1".
See below. I will look at addressing this, unless it gets crazy complex
because of the different enumeration timeline. Delaying calculation of
number of RMIDs until rdt_get_tree() as you have suggested may be the
right thing to do.

"3" is the real problem

> > 
> > If "3" is less than "2" then the system will attach "h/w counters" to
> > MMIO registers in a "most recently used" algorithm. So if the number
> > of active RMIDs in some time interval is less than "3" the user will
> > get good values. But if the number of active RMIDs rises above "3"
> > then the user will see "Unavailable" returns as "h/w counters" are
> > reassigned to different RMIDs (making the feature really hard to use).
> 
> Could the next step be for the architecture to allow user space to
> specify which hardware counters need to be assigned? With a new user
> interface being created for such capability it may be worthwhile to
> consider how it could be used/adapted for this feature. [1]
> 
> > 
> > In the debut CPU the "energy" feature has sufficient "energy" counters
> > to avoid this. But not enough "perf" counters. I've pushed and the
> > next CPU with the feature will have enough "h/w counters".
> > 
> > My proposal for v4:
> > 
> > Add new options to the "rdt=" kernel boot parameter for "energy"
> > and "perf".
> > 
> > Treat the case where there are not enough "h/w counters" as an erratum
> > and do not enable the feature. User can override with "rdt=perf"
> > if they want the counters for some special case where they limit
> > the number of simultaneous active RMIDs.
> 
> This only seems to address the "3" is less than "2" issue. It is not
> so obvious to me that it should be treated as an erratum. Although,
> I could not tell from your description how obvious this issue will be
> to user space. For example, is it clear that if user space
> gets *any* value then it is "good" and "Unavailable" means ... "Unavailable", or
> could a returned value mean "this is partial data that was collected
> during timeframe with hardware counter re-assigned at some point"?

When running jobs with more distinct RMIDs than "3" users are at the
mercy of the h/w replacement algorithm. Resctrl use cases for monitoring
are all "read an event counter; wait for some time; re-read the event
counter; compute the rate". With "h/w counter" reassignment the second
read may get "Unavailable", or worse the "h/w counter" may have been
taken, and the returned so a value will be provided to the user, but
it won't provide the count of events since the first read.

That's why I consider this an erratum. There's just false hope that
you can get a pair of meaningful event counts and no sure indication
that you didn't get garbage.

> > 
> > User can use "rdt=!energy,!perf" if they don't want to see the
> > clutter of all the new files in each mon_data directory.
> > 
> > I'll maybe look at moving resctrl_mon_resource_init() to rdt_get_tree()
> > and add a "take min of all RMID limits". But since this is a "can't
> > happen" scenario I may skip this if it starts to get complicated.
> 
> I do not think that the "2" is less than "1" scenario should be 
> ignored for reasons stated above and in review of this version.
> 
> What if we enhance resctrl's RMID assignment (setting aside for
> a moment PMG assignment) to be directed by user space?

I'll take a look at reducing user reported num_rmids to the minimum
of the "1" and "2" values.

> Below is an idea of an interface that can give user space 
> control over what monitor groups are monitoring. This is very likely not
> the ideal interface but I would like to present it as a start for
> better ideas.
> 
> For example, monitor groups are by default created with most abundant
> (and thus supporting fewest features on fewest resources) RMID.
> The user is then presented with a new file (within each monitor group)
> that lists all available features and which one(s) are active. For example,
> let's consider hypothetical example where PERF_PKG perf has x RMID, PERF_PKG energy
> has y RMID, and L3_MON has z RMID, with x < y < z. By default when user space
> creates a monitor group resctrl will pick "abundant" RMID from range y + 1 to z
> that only supports L3 monitoring:

There is no way for s/w to control the reallocation of "h/w counters"
when "3" is too small. So there is no set of RMIDs that support many
events vs. fewer events. AMD is solving this similar problem with their
scheme to pin h/w counters to specific RMIDs. I discussed such an option
for the "3" case, but it wasn't practical to apply to the upcoming CPU
that has this problem. The long term solution is to ensure that "3" is
always large enough that all RMIDs have equal monitoring capabilities.

> # cat /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
> [L3]
> PERF_PKG:energy
> PERF_PKG:perf
> 
> In above case there will be *no* mon_PERF_PKG_XX directories in 
> /sys/fs/resctrl/mon_groups/m1/mon_data.
> 
> *If* user space wants perf/energy telemetry for this monitor
> group then they can enable needed feature with clear understanding that
> it is disruptive to all ongoing monitoring since a new RMID will be assigned.
> For example, if user wants PERF_PKG:energy and PERF_PKG:perf then
> user can do so with:
> 
> # echo PERF_PKG:perf > /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
> # cat /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
> [L3]
> [PERF_PKG:energy]
> [PERF_PKG:perf]
> 
> After the above all energy and perf files will appear in new mon_PERF_PKG_XX
> directories.
> 
> User space can then have full control of what is monitored by which monitoring
> group. If no RMIDs are available in a particular pool then user space can get
> an "out of space" error and be the one to decide how it should be managed.
> 
> This also could be a way in which the "2" is larger than "1" scenario
> can be addressed. 
> 
> > Which leaves what should be in info/PERF_PKG_MON/num_rmids? It's
> > possible that some CPU implementation will have different MMIO
> > register counts for "perf" and "energy". It's more than possible
> > that number of "h/w counters" will be different. But they share the
> > same info file. My v3 code reports the minimum of the number
> > of "h/w counters" which is the most conservative option. It tells
> > the user not to make more mon_data directories than this if they
> > want usable counters across *both* perf and energy. Though they
> > will actually keep getting good "energy" results even if then
> > go past this limit.
> 
> num_rmids is a source of complications. It does not have a good equivalent
> for MPAM and there has been a few attempts at proposing alternatives that may
> be worth keeping in mind while making changes here:
> https://lore.kernel.org/all/cbe665c2-fe83-e446-1696-7115c0f9fd76@arm.com/
> https://lore.kernel.org/lkml/46767ca7-1f1b-48e8-8ce6-be4b00d129f9@intel.com/
> 
> > 
> > -Tony
> > 
> 
> Reinette
> 
> 
> [1] https://lore.kernel.org/lkml/cover.1743725907.git.babu.moger@amd.com/
> 
> ps. I needed to go back an re-read the original cover-letter a couple of
> times, while doing so I noticed one typo in the Background section: OOMMSM -> OOBMSM.

Noted. Will fix.

-Tony
Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
Posted by Reinette Chatre 2 weeks, 3 days ago
Hi Tony,

On 4/22/25 9:20 AM, Luck, Tony wrote:
> On Mon, Apr 21, 2025 at 03:59:15PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 4/21/25 11:57 AM, Luck, Tony wrote:
>>> On Fri, Apr 18, 2025 at 02:13:39PM -0700, Reinette Chatre wrote:
>>>> One aspect that is only hinted to in the final documentation patch is
>>>> how users are expected to use this feature. As I understand the number of
>>>> monitor groups supported by resctrl is still guided by the number of RMIDs
>>>> supported by L3 monitoring. This work hints that the telemetry feature may
>>>> not match that number of RMIDs and a monitor group may thus exist but
>>>> when a user attempts to ready any of these perf files it will return
>>>> "unavailable".
>>>>
>>>> The series attempts to address it by placing the number of RMIDs available
>>>> for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
>>>> group is not exposed to user space (unless debugging enabled) the user does
>>>> not know if a monitor group will support this feature or not. This seems awkward
>>>> to me. Why not limit the number of monitor groups that can be created to the
>>>> minimum number of RMIDs across these resources like what is done for CLOSid?
>>>
>>> Reinette,
>>>
>>> The mismatch between number of RMIDs supported by different components
>>> is a thorny one, and may keep repeating since it feels like systems are
>>> composed of a bunch of lego-like bricks snapped together from a box of
>>> parts available to the h/w architect.
>>
>> With resctrl needing to support multiple architectures' way of doing things,
>> needing to support variety within an architecture just seems like another step.
>>
>>>
>>> In this case we have three meanings for "number of RMIDs":
>>>
>>> 1) The number for legacy features enumerated by CPUID leaf 0xF.
>>>
>>> 2) The number of registers in MMIO space for each event. This is
>>> enumerated in the XML files and is the value I placed into telem_entry::num_rmids.
>>>
>>> 3) The number of "h/w counters" (this isn't a strictly accurate
>>> description of how things work, but serves as a useful analogy that
>>> does describe the limitations) feeding to those MMIO registers. This is
>>> enumerated in telemetry_region::num_rmids returned from the call to
>>> intel_pmt_get_regions_by_feature()
>>
>> Thank you for explaining this. This was not clear to me.
>>
>>>
>>> If "1" is the smallest of these values, the OS will be limited in
>>> which values can be written to the IA32_PQR_ASSOC MSR. Existing
>>> code will do the right thing by limiting RMID allocation to this
>>> value.
>>>
>>> If "2" is greater than "1", then the extra MMIO registers will
>>> sit unused.
>>
>> This is also an issue with this implementation, no? resctrl will not
>> allow creating more monitor groups than "1".
> 
> On Intel there is no point in creating more groups than "1" allows.
> You can't make use of any RMID above that limit because you will get
> a #GP fault trying to write to the IA32_PQR_ASSOC MSR.
> 
> You could read the extra MMIO registers provided by "2", but they
> will always be zero since no execution occurred with an RMID in the
> range "1" ... "2".
> 
> The "2" is greater than "1" may be relatively common since the h/w
> for the telemetry counters is common for SKUs with different numbers
> of cores, and thus different values of "1". So low core count
> systems will see more telemetry counters than they can actually
> make use of. I will make sure not to print a message for this case.

I see, thank you.

> 
>>> If "2" is less than "1" my v3 returns the (problematic) -ENOENT
>>> This can't happen in the CPU that debuts this feature, but the check
>>> is there to prevent running past the end of the MMIO space in case
>>> this does occur some day. I'll fix error path in next version to
>>> make sure this end up with "Unavailable".
>>
>> This is a concern since this means the interface becomes a "try and see"
>> for user space. As I understand a later statement the idea is that
>> "2" should be used by user space to know how many "mon_groups" directories
>> should be created to get telemetry support. To me this looks to be
>> a space that will create a lot of confusion. The moment user space
>> creates "2" + 1 "mon_groups" directories it becomes a guessing game
>> of what any new monitor group actually supports. After crossing that
>> threshold I do not see a good way for going back since if user space
>> removes one "mon_data" directory it does get back to "2" but then needs to
>> rely on resctrl internals or debugging to know for sure what the new
>> monitor group supports.
> 
> But I assert that it is a "can't happen" concern. "2" will be >= "1".
> See below. I will look at addressing this, unless it gets crazy complex
> because of the different enumeration timeline. Delaying calculation of
> number of RMIDs until rdt_get_tree() as you have suggested may be the
> right thing to do.

As you explain it, it does not sound as though calculating how many
RMIDs can be supported is required to be done in rdt_get_tree(), but doing
so would make the implementation more robust since doing so does not rely
on assumptions about what hardware can and will support. 

> 
> "3" is the real problem
> 
>>>
>>> If "3" is less than "2" then the system will attach "h/w counters" to
>>> MMIO registers in a "most recently used" algorithm. So if the number
>>> of active RMIDs in some time interval is less than "3" the user will
>>> get good values. But if the number of active RMIDs rises above "3"
>>> then the user will see "Unavailable" returns as "h/w counters" are
>>> reassigned to different RMIDs (making the feature really hard to use).
>>
>> Could the next step be for the architecture to allow user space to
>> specify which hardware counters need to be assigned? With a new user
>> interface being created for such capability it may be worthwhile to
>> consider how it could be used/adapted for this feature. [1]
>>
>>>
>>> In the debut CPU the "energy" feature has sufficient "energy" counters
>>> to avoid this. But not enough "perf" counters. I've pushed and the
>>> next CPU with the feature will have enough "h/w counters".
>>>
>>> My proposal for v4:
>>>
>>> Add new options to the "rdt=" kernel boot parameter for "energy"
>>> and "perf".
>>>
>>> Treat the case where there are not enough "h/w counters" as an erratum
>>> and do not enable the feature. User can override with "rdt=perf"
>>> if they want the counters for some special case where they limit
>>> the number of simultaneous active RMIDs.

I get this now. This will require rework of the kernel command line parsing
support since current implementation is so closely integrated with the
X86_FEATURE_* flags (and is perhaps an unexpected architecture specific
portion of resctrl). 
What if "rdt=perf" means that "3" is also included in the computation
of how many monitor groups are supported? That would help users to not
need to limit the number of simultaneous active RMIDs.

>>
>> This only seems to address the "3" is less than "2" issue. It is not
>> so obvious to me that it should be treated as an erratum. Although,
>> I could not tell from your description how obvious this issue will be
>> to user space. For example, is it clear that if user space
>> gets *any* value then it is "good" and "Unavailable" means ... "Unavailable", or
>> could a returned value mean "this is partial data that was collected
>> during timeframe with hardware counter re-assigned at some point"?
> 
> When running jobs with more distinct RMIDs than "3" users are at the
> mercy of the h/w replacement algorithm. Resctrl use cases for monitoring
> are all "read an event counter; wait for some time; re-read the event
> counter; compute the rate". With "h/w counter" reassignment the second
> read may get "Unavailable", or worse the "h/w counter" may have been
> taken, and the returned so a value will be provided to the user, but
> it won't provide the count of events since the first read.
> 
> That's why I consider this an erratum. There's just false hope that
> you can get a pair of meaningful event counts and no sure indication
> that you didn't get garbage.
> 
>>>
>>> User can use "rdt=!energy,!perf" if they don't want to see the
>>> clutter of all the new files in each mon_data directory.
>>>
>>> I'll maybe look at moving resctrl_mon_resource_init() to rdt_get_tree()
>>> and add a "take min of all RMID limits". But since this is a "can't
>>> happen" scenario I may skip this if it starts to get complicated.
>>
>> I do not think that the "2" is less than "1" scenario should be 
>> ignored for reasons stated above and in review of this version.
>>
>> What if we enhance resctrl's RMID assignment (setting aside for
>> a moment PMG assignment) to be directed by user space?
> 
> I'll take a look at reducing user reported num_rmids to the minimum
> of the "1" and "2" values.

When comparing to "num_closids" the expectation may be that "num_rmids"
would be accurate for particular resource with understanding that the
minimum among all resources guides the number of monitor groups. This
seems close enough to existing interface to not use this as moment
to move to a new "num_mon_hw_id" or such that works for MPAM also.

> 
>> Below is an idea of an interface that can give user space 
>> control over what monitor groups are monitoring. This is very likely not
>> the ideal interface but I would like to present it as a start for
>> better ideas.
>>
>> For example, monitor groups are by default created with most abundant
>> (and thus supporting fewest features on fewest resources) RMID.
>> The user is then presented with a new file (within each monitor group)
>> that lists all available features and which one(s) are active. For example,
>> let's consider hypothetical example where PERF_PKG perf has x RMID, PERF_PKG energy
>> has y RMID, and L3_MON has z RMID, with x < y < z. By default when user space
>> creates a monitor group resctrl will pick "abundant" RMID from range y + 1 to z
>> that only supports L3 monitoring:
> 
> There is no way for s/w to control the reallocation of "h/w counters"
> when "3" is too small. So there is no set of RMIDs that support many
> events vs. fewer events. AMD is solving this similar problem with their
> scheme to pin h/w counters to specific RMIDs. I discussed such an option
> for the "3" case, but it wasn't practical to apply to the upcoming CPU
> that has this problem. The long term solution is to ensure that "3" is
> always large enough that all RMIDs have equal monitoring capabilities.

ack.

Reinette
Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
Posted by Reinette Chatre 3 weeks ago
Hi Tony,

Just noticed ... could you please include x86 maintainers
(X86 ARCHITECTURE (32-BIT AND 64-BIT) in MAINTAINERS) in your
submission?

Thank you

Reinette