.../admin-guide/kernel-parameters.txt | 2 +- Documentation/filesystems/resctrl.rst | 53 ++- include/linux/resctrl.h | 84 +++- include/linux/resctrl_types.h | 26 +- arch/x86/include/asm/resctrl.h | 16 - arch/x86/kernel/cpu/resctrl/internal.h | 31 +- fs/resctrl/internal.h | 56 ++- arch/x86/kernel/cpu/resctrl/core.c | 333 ++++++++++---- arch/x86/kernel/cpu/resctrl/intel_aet.c | 411 ++++++++++++++++++ arch/x86/kernel/cpu/resctrl/monitor.c | 78 ++-- fs/resctrl/ctrlmondata.c | 130 +++++- fs/resctrl/monitor.c | 267 +++++++----- fs/resctrl/rdtgroup.c | 253 +++++++---- arch/x86/Kconfig | 5 +- arch/x86/kernel/cpu/resctrl/Makefile | 1 + 15 files changed, 1327 insertions(+), 419 deletions(-) create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
These patches are based v6.16-rc3 plus David Box's V2 series for "Intel VSEC/PMT: Introduce Discovery Driver" posted here: Link: https://lore.kernel.org/all/20250617014041.2861032-1-david.e.box@linux.intel.com/ I've pushed "v6.16-rc3 + David's patches" to: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git davidboxv2 The total set (v6.16-rc3 + David's patches + this series) is here: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git rdt-aet-v6 The first four patches of this series are shared with Babu's ABMC series. Perhaps if all issues in these four patches are resolved here these patches could move to upstream? Note that patches 0017 onwards of this V6 series depend on David's patches so can't go upstream until that series is merged. Changes since v5 was posted here: Link: https://lore.kernel.org/all/20250521225049.132551-1-tony.luck@intel.com/ Change map indexed by patch numbers in v5. Some patches have been merged, split, dropped, or re-ordered. The v6 patch numbers are referred to by their 4-digit git format-patch numbers in an attempt to avoid confusion. --- 1 --- Fixed extra space in commit message Changed to consistent use of "eventid" for variables/arguments of type "enum resctrl_event_id" in routines touched by this series. Added resctrl_event_id::QOS_FIRST_EVENT and used it as lower bound when looping over all events. Added #define for_each_mon_event() to iterate over all events. s/Description of a monitor event/Properties of a monitor event/ --- 2 --- Use QOS_FIRST_EVENT as lower bounds check in resctrl_is_mon_event_enabled(). --- 3 --- Replaced changelog with Reinette's improved version. --- 4 --- Add for_each_mbm_event_id() helper to iterate over MBM events. Peter: Use sizeof(*hw_dom->arch_mbm_states[0]) instead of sizeof(struct arch_mbm_state) (ditto for "struct mbm_state" instance). Replaced kerneldoc description of rdt_mon_domain::mbm_states with Reinette's improved version. Ditto for rdt_hw_mon_domain::arch_mbm_state. Fixed resctrl_arch_reset_rmid_all() to use same coding pattern of looping on enabled events instead of checking if rdt_hw_mon_domain::arch_mbm_state has been allocated. In get_mbm_state() use local variable name "state" (singular) to match other code patterns. Drop duplicate @arch_mbm_states: in kerneldoc for struct rdt_hw_mon_domain. Drop " or combined CLOSID, RMID on Arm" from @arch_mbm_states description. Use sizeof(*hw_dom->arch_mbm_states[0]) in resctrl_arch_reset_rmid_all(). Add new macro for_each_mbm_idx() (suggested by Fenghua). Use it in mon_domain_free() and domain_destroy_mon_state(). Also to avoid landmines in "cleanup:" code in arch_domain_mbm_alloc() and domain_setup_mon_state(). --- 5 --- Patch dropped. No need for fake definitions of OOBMSM access routines and structures because the real patches have been posted by David Box. Version 2 of his series here: https://lore.kernel.org/all/20250617014041.2861032-1-david.e.box@linux.intel.com/ --- 6 --- Now 0005 in V6 series. Updated the domain_header_is_valid() check in rdtgroup_mondata_show() to explicitly test for RDT_RESOURCE_L3 since at this point in the series only the L3 resource is possible, or valid. Similar change in subsequent patches where routines to process only struct rdt_mon_domain make consistency checks. --- 7 --- Moved later to patch 0011 when it becomes a little clearer which renames are useful, and which may be just noise. Functions that now take a "struct rdt_l3_mon_domain" argument are obviously for L3 only and don't need a rename to make that clear. --- 8 --- Now patch 0006. No comments received on V5 version of this patch. --- 9 --- Now patch 0007. s/goto done/goto out_unlock/ Changes to make symmetric cleanups to domain_remove_cpu_ctrl() to match the code flow changes that were made to domain_remove_cpu_mon() split out into new patch 0008. --- 10 --- Now patch 0009. Anil reported that "d" was used uninitialized in domain_remove_cpu_ctrl(). Fenghua reported an error path that did not unlock the rdtgroup_mutex. Reinette reported change unrelated to commit message. Moved this to new patch 0008 (now with change log). Several domain_header_is_valid() checks now check for RDT_RESOURCE_L3. Updated kerneldoc comments for struct mon_data to says that @sum is only used for RDT_RESOURCE_L3. Updated kerneldoc for mon_get_kn_priv() @do_sum to say it is only meaningful for L3 domain and added a WARN_ON_ONCE() if this some other resource tried to set do_sum. More functions changed to pass struct rdt_domain_hdr. Specifically the call chain from rdtgroup_mondata_show() down to resctrl_arch_rmid_read() (via the smp_call*()) so rmid_read::d has been replaced by rmid_read::hdr. --- 11 --- Now patch 0010.Completing this patch before the function renaming (was patch 7, now 0011) makes it clearer where renames are useful. --- 12 --- No comments received for this patch (patch numbers now aligned as this is 0012 in V6). --- 13 --- Updated commit comment with better text from Reinette. s/goto done/goto out_ctx_free/ Changed polarity and name of helper function from cpu_on_wrong_domain() to cpu_on_correct_domain() to avoid double negatives. --- 14 --- Add mon_evt::is_floating_point set by resctrl file system code to limit which events architecture code can request be displayed in floating point. Simplified the fixed-point to floating point algorithm. Reinette is correct that the additional "lshift" and "rshift" operations are not required. All that is needed is to multiply the fixed point fractional part by 10**decimal_places, add a rounding amount equivalent to a "1" in the binary place after those supplied. Finally divide by 2**binary_places (with a right shift). Explained in commit comment how I chose the number of decimal places to use for each binary places value. N.B. Dave Martin expressed an opinion that the kernel should not do this conversion. Instead it should enumerate the scaling factor for each event where hardware reported a fixed point value. This patch could be dropped and replaced with one to enumerate scaling factors per event if others agree with Dave. --- 15 --- Initialize atomic in resctrl_arch_pre_mount() using ATOMIC_INIT(0). --- 16 --- No comments received on this patch. --- 17 --- Changed comment in struct event_group from " Data fields used by this code." to "Data fields for additional structures to manage this group." Removed line continuations for DEFINE_FREE(). Though checkpatch is still not happy. My following line is a "if" statement. Checkpatch wants this to both 1) Line vertically with the "(" on preceeding line, and 2) Not use <TAB> followed by some <SPACE> characters to make that happen. Refactored the interface between get_pmt_feature() and configure_event() per-Reinette's suggestion to avoid both functions looping through the p->count entries in the pmt_feature_group structure. Kconfig changes here ensure that David's INTEL_PMT_TELEMETRY code is built-in to the kernel so it can be used by resctrl. --- 18 --- Define a macro XML_MMIO_SIZE() as a way to document the hard-coded numbers used to calculate the expected size of the mmio region. If the MMIO size reported by intel_pmt_get_regions_by_feature() is smaller than expected, print that size as part of the warning message. --- 19 --- Rename mmio_info::count to mmio_info::num_regions. Fix typo s/[0]/[1]/g in ascii art commit message structure diagram. Take better suggestions for kerneldoc descriptions of mmio_info structure and the @num_regions and @addrs fields. Add a period for event_group::pkginfo kerneldoc description. Fix declaration of **pkginfo in configure_events() Add "_once" for "Duplicate telemetry" warning. --- 20 --- Shorten field names pmt_event::evtid -> pmt_event::id and pmt_event:evt_idx becomes pmt_event::idx. Fenghua: Align each of th struct event_group on the "=". Anil: Use a macro to initialize entries in mon_event_all[]. This could be done in patch 1, but with only three events at that point the visual clutter wasn't too awful. --- 21 --- Split into: 0021: Add mon_evt::arch_priv void pointer that can be set by architectural code when enabling an event, and passed through to resctrl_arch_rmid_read() 0022: Code to set the arch_priv pointer and use it find the containing struct event_group for the parameters to read counter from MMIO space. Note that due to changes in part 0009 resctrl_arch_rmid_read() takes a struct rdt_domain_hdr argument. --- 22 --- Now 0023: Keep "default:" as last option in switch in domain_remove_cpu_mon(). Comments about "goto mkdir" (was "goto do_mkdir" covered by change in patch 0009. --- 23 --- Moved to patch 0027. No comments received for this patch. But one small change. Now sets r->mon_capable = true; in intel_aet_get_events() so this is done before the calculation of the minimum of RMIDs supported in part 0025. --- 24 --- Changed to address Reinette's point that initial implementation would not work in the same way as other boot choices. Specifically if a quirk disables a feature because of an erratum, the user should be able to override from the command line and use it anyway. This patch provides the option for user to disable a telemetry feature from the command line. The force enable option moved to next patch where it is used. --- 25 --- Improved commit comment per-Reinette suggestion. s/Will be adjusted/Adjusted/ in kerneldoc for event_group::num_rmids Improved text for comment on not configuring a telemetry feature that has fewer RMIDs than supported by IA32_PQR_ASSOC. Second part of command line implementation is here to allow user to override the fewer RMIDs issue and use a resource anyway. --- 26 --- Back in sync with patch 0026. No comments received for this patch. --- 27 --- Changed from providing a mechanism for architecture code to create a custom "info/{resource}" file to providing a debugfs directory for use by a monitor resource. Discussion on the name of the directory fizzled out. I've gone with: /sys/kernel/debug/resctrl/info/{resource}_MON/{utsname()->machine} --- 28 --- No comments on this patch. Changed to create one debugfs file for each value from each aggregator instance. --- 29 --- No comments on this patch. Background ---------- Telemetry features are being implemented in conjunction with the IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send counts for various events to a collector in a nearby OOBMSM device to be accumulated with counts for each <RMID, event> pair received from other CPUs. Cores send event counts when the RMID value changes, or after each 2ms elapsed time. Each OOBMSM device may implement multiple event collectors with each servicing a subset of the logical CPUs on a package. In the initial hardware implementation, there are two categories of events: energy and perf. 1) Energy - Two counters core_energy: This is an estimate of Joules consumed by each core. It is calculated based on the types of instructions executed, not from a power meter. This counter is useful to understand how much energy a workload is consuming. activity: This measures "accumulated dynamic capacitance". Users who want to optimize energy consumption for a workload may use this rather than core_energy because it provides consistent results independent of any frequency or voltage changes that may occur during the runtime of the application (e.g. entry/exit from turbo mode). 2) Performance - Seven counters These are similar events to those available via the Linux "perf" tool, but collected in a way with much lower overhead (no need to collect data on every context switch). stalls_llc_hit - Counts the total number of unhalted core clock cycles when the core is stalled due to a demand load miss which hit in the LLC c1_res - Counts the total C1 residency across all cores. The underlying counter increments on 100MHz clock ticks unhalted_core_cycles - Counts the total number of unhalted core clock cycles stalls_llc_miss - Counts the total number of unhalted core clock cycles when the core is stalled due to a demand load miss which missed all the local caches c6_res - Counts the total C6 residency. The underlying counter increments on crystal clock (25MHz) ticks unhalted_ref_cycles - Counts the total number of unhalted reference clock (TSC) cycles uops_retired - Counts the total number of uops retired The counters are arranged in groups in MMIO space of the OOBMSM device. E.g. for the energy counters the layout is: Offset: Counter 0x00 core energy for RMID 0 0x08 core activity for RMID 0 0x10 core energy for RMID 1 0x18 core activity for RMID 1 ... Enumeration ----------- The only CPUID based enumeration for this feature is the legacy CPUID(eax=7,ecx=0).ebx{12} that indicates the presence of the IA32_PQR_ASSOC MSR and the RMID field within it. The OOBMSM driver discovers which features are present via PCIe VSEC capabilities. Each feature is tagged with a unique identifier. These identifiers indicate which XML description file from https://github.com/intel/Intel-PMT describes which event counters are available and their layout within the MMIO BAR space of the OOBMSM device. Resctrl User Interface ---------------------- Because there may be multiple OOBMSM collection agents per processor package, resctrl accumulates event counts from all agents on a package and presents a single value to users. This will provide a consistent user interface on future platforms that vary the number of collectors, or the mappings from logical CPUs to collectors. Users will continue to see the legacy monitoring files in the "L3" directories and the telemetry files in the new "PERF_PKG" directories (with each file providing the aggregated value from all OOBMSM collectors on that package). $ tree /sys/fs/resctrl/mon_data/ /sys/fs/resctrl/mon_data/ ├── mon_L3_00 │ ├── llc_occupancy │ ├── mbm_local_bytes │ └── mbm_total_bytes ├── mon_L3_01 │ ├── llc_occupancy │ ├── mbm_local_bytes │ └── mbm_total_bytes ├── mon_PERF_PKG_00 │ ├── activity │ ├── c1_res │ ├── c6_res │ ├── core_energy │ ├── stalls_llc_hit │ ├── stalls_llc_miss │ ├── unhalted_core_cycles │ ├── unhalted_ref_cycles │ └── uops_retired └── mon_PERF_PKG_01 ├── activity ├── c1_res ├── c6_res ├── core_energy ├── stalls_llc_hit ├── stalls_llc_miss ├── unhalted_core_cycles ├── unhalted_ref_cycles └── uops_retired Resctrl Implementation ---------------------- The OOBMSM driver exposes "intel_pmt_get_regions_by_feature()" that returns an array of structures describing the per-RMID groups it found from the VSEC enumeration. Linux looks at the unique identifiers for each group and enables resctrl for all groups with known unique identifiers. The memory map for the counters for each <RMID, event> pair is described by the XML file. This is too unwieldy to use in the Linux kernel, so a simplified representation is built into the resctrl code. Note that the counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL and IA32_QM_CTR MSRs. This means there is no need for cross-processor calls to read counters from a CPU in a specific domain. The counters can be read from any CPU. High level description of code changes: 1) New scope RESCTRL_PACKAGE 2) New struct rdt_resource RDT_RESOURCE_PERF_PKG 3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with: switch (r->rid) { case RDT_RESOURCE_L3: helper for L3 break; case RDT_RESOURCE_PERF_PKG: helper for PKG break; } 4) New source code file "intel_aet.c" for the code to enumerate, configure, and report event counts. With only one platform providing this feature, it's tricky to tell exactly where it is going to go. I've made the event definitions platform specific (based on the unique ID from the VSEC enumeration). It seems possible/likely that the list of events may change from generation to generation. I've picked names for events based on the descriptions in the XML file. Signed-off-by: Tony Luck <tony.luck@intel.com> Tony Luck (30): x86,fs/resctrl: Consolidate monitor event descriptions x86,fs/resctrl: Replace architecture event enabled checks x86/resctrl: Remove 'rdt_mon_features' global variable x86,fs/resctrl: Prepare for more monitor events x86,fs/resctrl: Improve domain type checking x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types x86/resctrl: Clean up domain_remove_cpu_ctrl() x86,fs/resctrl: Use struct rdt_domain_hdr instead of struct rdt_mon_domain x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain x86,fs/resctrl: Rename some L3 specific functions fs/resctrl: Make event details accessible to functions when reading events x86,fs/resctrl: Handle events that can be read from any CPU x86,fs/resctrl: Support binary fixed point event counters x86,fs/resctrl: Add an architectural hook called for each mount x86,fs/resctrl: Add and initialize rdt_resource for package scope core monitor x86/resctrl: Discover hardware telemetry events x86/resctrl: Count valid telemetry aggregators per package x86/resctrl: Complete telemetry event enumeration x86,fs/resctrl: Fill in details of Clearwater Forest events x86,fs/resctrl: Add architectural event pointer x86/resctrl: Read core telemetry events x86/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG x86/resctrl: Add energy/perf choices to rdt boot option x86/resctrl: Handle number of RMIDs supported by telemetry resources x86,fs/resctrl: Move RMID initialization to first mount x86/resctrl: Enable RDT_RESOURCE_PERF_PKG fs/resctrl: Provide interface to create a debugfs info directory x86/resctrl: Add debug info/PERF_PKG_MON/status files x86,fs/resctrl: Update Documentation for package events .../admin-guide/kernel-parameters.txt | 2 +- Documentation/filesystems/resctrl.rst | 53 ++- include/linux/resctrl.h | 84 +++- include/linux/resctrl_types.h | 26 +- arch/x86/include/asm/resctrl.h | 16 - arch/x86/kernel/cpu/resctrl/internal.h | 31 +- fs/resctrl/internal.h | 56 ++- arch/x86/kernel/cpu/resctrl/core.c | 333 ++++++++++---- arch/x86/kernel/cpu/resctrl/intel_aet.c | 411 ++++++++++++++++++ arch/x86/kernel/cpu/resctrl/monitor.c | 78 ++-- fs/resctrl/ctrlmondata.c | 130 +++++- fs/resctrl/monitor.c | 267 +++++++----- fs/resctrl/rdtgroup.c | 253 +++++++---- arch/x86/Kconfig | 5 +- arch/x86/kernel/cpu/resctrl/Makefile | 1 + 15 files changed, 1327 insertions(+), 419 deletions(-) create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c base-tree: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git base-branch: davidboxv2 base-commit: 4742bf1fab91403ca48efc45f7f7fd68a156a955 -- 2.49.0
Bother. Just got e-mail after posting v6 from lkp. Apparently I applied the fixes to avoid "'d' used before set" in domain_remove_cpu_ctrl() and domain_remove_cpu_mon() to some other branch than the one that made it to my final version. Please imagine the hunks below merged into patches 7 & 8. -Tony --- diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 3ec8fbd2f778..39cee572a121 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -651,8 +651,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r) if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid)) return; - cpumask_clear_cpu(cpu, &d->hdr.cpu_mask); - if (!cpumask_empty(&d->hdr.cpu_mask)) + cpumask_clear_cpu(cpu, &hdr->cpu_mask); + if (!cpumask_empty(&hdr->cpu_mask)) return; d = container_of(hdr, struct rdt_ctrl_domain, hdr); @@ -696,8 +696,8 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r) if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid)) return; - cpumask_clear_cpu(cpu, &d->hdr.cpu_mask); - if (!cpumask_empty(&d->hdr.cpu_mask)) + cpumask_clear_cpu(cpu, &hdr->cpu_mask); + if (!cpumask_empty(&hdr->cpu_mask)) return; switch (r->rid) {
On Thu, Jun 26, 2025 at 05:26:46PM -0700, Luck, Tony wrote: > Bother. Just got e-mail after posting v6 from lkp. Apparently > I applied the fixes to avoid "'d' used before set" in > domain_remove_cpu_ctrl() and domain_remove_cpu_mon() to some > other branch than the one that made it to my final version. > > Please imagine the hunks below merged into patches 7 & 8. I merged these changes back into the series and pushed the updated version to git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git branch rdt-aet-v6. > > -Tony > > --- > > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c > index 3ec8fbd2f778..39cee572a121 100644 > --- a/arch/x86/kernel/cpu/resctrl/core.c > +++ b/arch/x86/kernel/cpu/resctrl/core.c > @@ -651,8 +651,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r) > if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid)) > return; > > - cpumask_clear_cpu(cpu, &d->hdr.cpu_mask); > - if (!cpumask_empty(&d->hdr.cpu_mask)) > + cpumask_clear_cpu(cpu, &hdr->cpu_mask); > + if (!cpumask_empty(&hdr->cpu_mask)) > return; > > d = container_of(hdr, struct rdt_ctrl_domain, hdr); > @@ -696,8 +696,8 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r) > if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid)) > return; > > - cpumask_clear_cpu(cpu, &d->hdr.cpu_mask); > - if (!cpumask_empty(&d->hdr.cpu_mask)) > + cpumask_clear_cpu(cpu, &hdr->cpu_mask); > + if (!cpumask_empty(&hdr->cpu_mask)) > return; > > switch (r->rid) {
Hi Tony and Dave, On 6/26/25 9:49 AM, Tony Luck wrote: > --- 14 --- > Add mon_evt::is_floating_point set by resctrl file system code to limit > which events architecture code can request be displayed in floating point. > > Simplified the fixed-point to floating point algorithm. Reinette is > correct that the additional "lshift" and "rshift" operations are not > required. All that is needed is to multiply the fixed point fractional > part by 10**decimal_places, add a rounding amount equivalent to a "1" > in the binary place after those supplied. Finally divide by 2**binary_places > (with a right shift). > > Explained in commit comment how I chose the number of decimal places to > use for each binary places value. > > N.B. Dave Martin expressed an opinion that the kernel should not do > this conversion. Instead it should enumerate the scaling factor for > each event where hardware reported a fixed point value. This patch > could be dropped and replaced with one to enumerate scaling factors > per event if others agree with Dave. Could resctrl accommodate both usages? For example, it does not look too invasive to add a second file <mon_evt::name>.raw for the mon_evt::is_floating_point events that can output something like Dave suggested in [1]: .raw file format could be: #format:<output that depends on format> #fixed-point:<value>/<scaling factor> Example output: fixed-point:0x60000/0x40000 Reinette [1] https://lore.kernel.org/lkml/aEhMWBemtev%2Ff3yf@e133380.arm.com/
On Thu, Jul 03, 2025 at 09:45:15AM -0700, Reinette Chatre wrote: > Hi Tony and Dave, > > On 6/26/25 9:49 AM, Tony Luck wrote: > > --- 14 --- > > Add mon_evt::is_floating_point set by resctrl file system code to limit > > which events architecture code can request be displayed in floating point. > > > > Simplified the fixed-point to floating point algorithm. Reinette is > > correct that the additional "lshift" and "rshift" operations are not > > required. All that is needed is to multiply the fixed point fractional > > part by 10**decimal_places, add a rounding amount equivalent to a "1" > > in the binary place after those supplied. Finally divide by 2**binary_places > > (with a right shift). > > > > Explained in commit comment how I chose the number of decimal places to > > use for each binary places value. > > > > N.B. Dave Martin expressed an opinion that the kernel should not do > > this conversion. Instead it should enumerate the scaling factor for > > each event where hardware reported a fixed point value. This patch > > could be dropped and replaced with one to enumerate scaling factors > > per event if others agree with Dave. > > Could resctrl accommodate both usages? For example, it does not > look too invasive to add a second file <mon_evt::name>.raw for the > mon_evt::is_floating_point events that can output something like Dave > suggested in [1]: > > .raw file format could be: > #format:<output that depends on format> > #fixed-point:<value>/<scaling factor> > > Example output: > fixed-point:0x60000/0x40000 Dave: Is that what you want in the ".raw" file? An alternative would be to put the format information for non-integer events into an "info" file ("info/{RESOURCE_NAME}_MON/monfeatures.raw.formats"?) and just put the raw value into the ".raw" file under mon_data. > > Reinette > > [1] https://lore.kernel.org/lkml/aEhMWBemtev%2Ff3yf@e133380.arm.com/ -Tony
On Thu, Jul 03, 2025 at 10:22:06AM -0700, Luck, Tony wrote: > On Thu, Jul 03, 2025 at 09:45:15AM -0700, Reinette Chatre wrote: > > Hi Tony and Dave, > > > > On 6/26/25 9:49 AM, Tony Luck wrote: > > > --- 14 --- > > > Add mon_evt::is_floating_point set by resctrl file system code to limit > > > which events architecture code can request be displayed in floating point. > > > > > > Simplified the fixed-point to floating point algorithm. Reinette is > > > correct that the additional "lshift" and "rshift" operations are not > > > required. All that is needed is to multiply the fixed point fractional > > > part by 10**decimal_places, add a rounding amount equivalent to a "1" > > > in the binary place after those supplied. Finally divide by 2**binary_places > > > (with a right shift). > > > > > > Explained in commit comment how I chose the number of decimal places to > > > use for each binary places value. > > > > > > N.B. Dave Martin expressed an opinion that the kernel should not do > > > this conversion. Instead it should enumerate the scaling factor for > > > each event where hardware reported a fixed point value. This patch > > > could be dropped and replaced with one to enumerate scaling factors > > > per event if others agree with Dave. > > > > Could resctrl accommodate both usages? For example, it does not > > look too invasive to add a second file <mon_evt::name>.raw for the > > mon_evt::is_floating_point events that can output something like Dave > > suggested in [1]: > > > > .raw file format could be: > > #format:<output that depends on format> > > #fixed-point:<value>/<scaling factor> > > > > Example output: > > fixed-point:0x60000/0x40000 > > Dave: Is that what you want in the ".raw" file? An alternative would be > to put the format information for non-integer events into an > "info" file ("info/{RESOURCE_NAME}_MON/monfeatures.raw.formats"?) > and just put the raw value into the ".raw" file under mon_data. Note that I thought it easier for users to keep the raw file to just showing a value, rather than including the formatting details in Reinette's proposal. Patch to implement my alternative suggestion below. To the user things look like this: $ cd /sys/fs/resctrl/mon_data/mon_PERF_PKG_01 $ cat core_energy 0.02203 $ cat core_energy.raw 5775 $ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale core_energy 262144 activity 262144 $ bc -ql 5775 / 262144 .02202987670898437500 If this seems useful I can write up a commit message and include as its own patch in v7. Suggestions for better names? -Tony --- diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index 4704ea7228ca..5ac4e3c98f23 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -90,6 +90,8 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS]; * the event file belongs. When @sum is one this * is the id of the L3 cache that all domains to be * summed share. + * @raw: Set for ".raw" files that directly show hardware + * provided counts with no interpretation. * * Pointed to by the kernfs kn->priv field of monitoring event files. * Readers and writers must hold rdtgroup_mutex. @@ -100,6 +102,7 @@ struct mon_data { struct mon_evt *evt; int domid; bool sum; + bool raw; }; /** diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index 29de0e380ccc..78e7af296d5a 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -753,7 +753,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) seq_puts(m, "Error\n"); else if (rr.err == -EINVAL) seq_puts(m, "Unavailable\n"); - else if (evt->binary_bits == 0) + else if (md->raw || evt->binary_bits == 0) seq_printf(m, "%llu\n", rr.val); else print_event_value(m, evt->binary_bits, rr.val); diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 511362a67532..97786831722a 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -1158,6 +1158,21 @@ static int rdt_mon_features_show(struct kernfs_open_file *of, return 0; } +static int rdt_mon_features_raw_scale_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = rdt_kn_parent_priv(of->kn); + struct mon_evt *mevt; + + for_each_mon_event(mevt) { + if (mevt->rid != r->rid || !mevt->enabled || !mevt->binary_bits) + continue; + seq_printf(seq, "%s %u\n", mevt->name, 1 << mevt->binary_bits); + } + + return 0; +} + static int rdt_bw_gran_show(struct kernfs_open_file *of, struct seq_file *seq, void *v) { @@ -1823,6 +1838,13 @@ static struct rftype res_common_files[] = { .seq_show = rdt_mon_features_show, .fflags = RFTYPE_MON_INFO, }, + { + .name = "mon_features_raw_scale", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_mon_features_raw_scale_show, + .fflags = RFTYPE_MON_INFO, + }, { .name = "num_rmids", .mode = 0444, @@ -2905,7 +2927,7 @@ static void rmdir_all_sub(void) */ static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid, struct mon_evt *mevt, - bool do_sum) + bool do_sum, bool rawfile) { struct mon_data *priv; @@ -2916,7 +2938,8 @@ static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid, list_for_each_entry(priv, &mon_data_kn_priv_list, list) { if (priv->rid == rid && priv->domid == domid && - priv->sum == do_sum && priv->evt == mevt) + priv->sum == do_sum && priv->evt == mevt && + priv->raw == rawfile) return priv; } @@ -2928,6 +2951,7 @@ static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid, priv->domid = domid; priv->sum = do_sum; priv->evt = mevt; + priv->raw = rawfile; list_add_tail(&priv->list, &mon_data_kn_priv_list); return priv; @@ -3078,12 +3102,13 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr, struct rmid_read rr = {0}; struct mon_data *priv; struct mon_evt *mevt; + char rawname[64]; int ret; for_each_mon_event(mevt) { if (mevt->rid != r->rid || !mevt->enabled) continue; - priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum); + priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum, false); if (WARN_ON_ONCE(!priv)) return -EINVAL; @@ -3093,6 +3118,18 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr, if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid)) mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt, true); + + if (!mevt->binary_bits) + continue; + + sprintf(rawname, "%s.raw", mevt->name); + priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum, true); + if (WARN_ON_ONCE(!priv)) + return -EINVAL; + + ret = mon_addfile(kn, rawname, priv); + if (ret) + return ret; } return 0;
Hi Tony, On 7/8/25 12:08 PM, Luck, Tony wrote: > On Thu, Jul 03, 2025 at 10:22:06AM -0700, Luck, Tony wrote: >> On Thu, Jul 03, 2025 at 09:45:15AM -0700, Reinette Chatre wrote: >>> Hi Tony and Dave, >>> >>> On 6/26/25 9:49 AM, Tony Luck wrote: >>>> --- 14 --- >>>> Add mon_evt::is_floating_point set by resctrl file system code to limit >>>> which events architecture code can request be displayed in floating point. >>>> >>>> Simplified the fixed-point to floating point algorithm. Reinette is >>>> correct that the additional "lshift" and "rshift" operations are not >>>> required. All that is needed is to multiply the fixed point fractional >>>> part by 10**decimal_places, add a rounding amount equivalent to a "1" >>>> in the binary place after those supplied. Finally divide by 2**binary_places >>>> (with a right shift). >>>> >>>> Explained in commit comment how I chose the number of decimal places to >>>> use for each binary places value. >>>> >>>> N.B. Dave Martin expressed an opinion that the kernel should not do >>>> this conversion. Instead it should enumerate the scaling factor for >>>> each event where hardware reported a fixed point value. This patch >>>> could be dropped and replaced with one to enumerate scaling factors >>>> per event if others agree with Dave. >>> >>> Could resctrl accommodate both usages? For example, it does not >>> look too invasive to add a second file <mon_evt::name>.raw for the >>> mon_evt::is_floating_point events that can output something like Dave >>> suggested in [1]: >>> >>> .raw file format could be: >>> #format:<output that depends on format> >>> #fixed-point:<value>/<scaling factor> >>> >>> Example output: >>> fixed-point:0x60000/0x40000 >> >> Dave: Is that what you want in the ".raw" file? An alternative would be >> to put the format information for non-integer events into an >> "info" file ("info/{RESOURCE_NAME}_MON/monfeatures.raw.formats"?) >> and just put the raw value into the ".raw" file under mon_data. > > Note that I thought it easier for users to keep the raw file to just > showing a value, rather than including the formatting details in > Reinette's proposal. Could you please elaborate what makes this easier? It is not obvious to me how it is easier for user to open, parse, and close two files rather than one. (more below) > > Patch to implement my alternative suggestion below. To the user things > look like this: > > $ cd /sys/fs/resctrl/mon_data/mon_PERF_PKG_01 > $ cat core_energy > 0.02203 > $ cat core_energy.raw > 5775 > $ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale > core_energy 262144 > activity 262144 > $ bc -ql > 5775 / 262144 > .02202987670898437500 > > If this seems useful I can write up a commit message and include > as its own patch in v7. Suggestions for better names? > I expect users to regularly interact with the monitoring files. For example, "read the core_energy of group x every second". An API like above would require a contract that the scale value will never change from resctrl mount to resctrl unmount. I understand that this implementation supports exactly this by allowing an architecture to only enable an event once, but do you think this is something that will always be the case? If not then an interface like above will require user space to open, parse, close two files instead of one on a frequent basis. This is not ideal if user space wants to read monitoring data of multiple groups frequently. I would also like to keep extensibility in mind. We now know that unsigned decimal and fixed-point binary needs to be supported. I think any new interface used to communicate formatting information to user space should be done in a way that can be extended for a new format. That is, for example, why I used the actual term "fixed-point" in the example. Something like this avoids needing assumptions that a raw value always implies fixed-point format. Reinette
On Tue, Jul 08, 2025 at 01:49:26PM -0700, Reinette Chatre wrote: > Hi Tony, > > On 7/8/25 12:08 PM, Luck, Tony wrote: > > On Thu, Jul 03, 2025 at 10:22:06AM -0700, Luck, Tony wrote: > >> On Thu, Jul 03, 2025 at 09:45:15AM -0700, Reinette Chatre wrote: > >>> Hi Tony and Dave, > >>> > >>> On 6/26/25 9:49 AM, Tony Luck wrote: > >>>> --- 14 --- > >>>> Add mon_evt::is_floating_point set by resctrl file system code to limit > >>>> which events architecture code can request be displayed in floating point. > >>>> > >>>> Simplified the fixed-point to floating point algorithm. Reinette is > >>>> correct that the additional "lshift" and "rshift" operations are not > >>>> required. All that is needed is to multiply the fixed point fractional > >>>> part by 10**decimal_places, add a rounding amount equivalent to a "1" > >>>> in the binary place after those supplied. Finally divide by 2**binary_places > >>>> (with a right shift). > >>>> > >>>> Explained in commit comment how I chose the number of decimal places to > >>>> use for each binary places value. > >>>> > >>>> N.B. Dave Martin expressed an opinion that the kernel should not do > >>>> this conversion. Instead it should enumerate the scaling factor for > >>>> each event where hardware reported a fixed point value. This patch > >>>> could be dropped and replaced with one to enumerate scaling factors > >>>> per event if others agree with Dave. > >>> > >>> Could resctrl accommodate both usages? For example, it does not > >>> look too invasive to add a second file <mon_evt::name>.raw for the > >>> mon_evt::is_floating_point events that can output something like Dave > >>> suggested in [1]: > >>> > >>> .raw file format could be: > >>> #format:<output that depends on format> > >>> #fixed-point:<value>/<scaling factor> > >>> > >>> Example output: > >>> fixed-point:0x60000/0x40000 > >> > >> Dave: Is that what you want in the ".raw" file? An alternative would be > >> to put the format information for non-integer events into an > >> "info" file ("info/{RESOURCE_NAME}_MON/monfeatures.raw.formats"?) > >> and just put the raw value into the ".raw" file under mon_data. > > > > Note that I thought it easier for users to keep the raw file to just > > showing a value, rather than including the formatting details in > > Reinette's proposal. > > Could you please elaborate what makes this easier? It is not obvious to me > how it is easier for user to open, parse, and close two files rather than one. > (more below) I had only considered the case where the format does not change while the resctrl file system is mounted. So users would read the "info" file to get the scaling factor once, and then read the event files with a parser that only has to convert a numerical string. > > Patch to implement my alternative suggestion below. To the user things > > look like this: > > > > $ cd /sys/fs/resctrl/mon_data/mon_PERF_PKG_01 > > $ cat core_energy > > 0.02203 > > $ cat core_energy.raw > > 5775 > > $ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale > > core_energy 262144 > > activity 262144 > > $ bc -ql > > 5775 / 262144 > > .02202987670898437500 > > > > If this seems useful I can write up a commit message and include > > as its own patch in v7. Suggestions for better names? > > > > I expect users to regularly interact with the monitoring files. For example, > "read the core_energy of group x every second". An API like above would require > a contract that the scale value will never change from resctrl mount to > resctrl unmount. I understand that this implementation supports exactly this by > allowing an architecture to only enable an event once, but do you think this is > something that will always be the case? If not then an interface like above will > require user space to open, parse, close two files instead of one on a frequent basis. > This is not ideal if user space wants to read monitoring data of multiple > groups frequently. While hardware designers do some outlandish things. Changing the format of an event counter on the fly seems beyond the range of possibility. How would that even work? A driver would have to rerun enumeration of the feature every time it read a counter. Or hardware would have to supply some interrupt to tell s/w that the format changed. I think it reasonable that resctrl be able to guarantee that the format described in the info file is valid for the life of the mount. > I would also like to keep extensibility in mind. We now know that > unsigned decimal and fixed-point binary needs to be supported. I think any > new interface used to communicate formatting information to user space should be done > in a way that can be extended for a new format. That is, for example, why > I used the actual term "fixed-point" in the example. Something like this avoids > needing assumptions that a raw value always implies fixed-point format. This is fair. But could be covered in the "info" file with some more descriptive way to describe the format. Perhaps: $ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale core_energy fixed-point scale=262144 activity fixed-point scale=262144 To allow for other types in the future. > > Reinette -Tony
Hi Tony, On 7/8/25 3:43 PM, Luck, Tony wrote: > On Tue, Jul 08, 2025 at 01:49:26PM -0700, Reinette Chatre wrote: >> Hi Tony, >> >> On 7/8/25 12:08 PM, Luck, Tony wrote: >>> On Thu, Jul 03, 2025 at 10:22:06AM -0700, Luck, Tony wrote: >>>> On Thu, Jul 03, 2025 at 09:45:15AM -0700, Reinette Chatre wrote: >>>>> Hi Tony and Dave, >>>>> >>>>> On 6/26/25 9:49 AM, Tony Luck wrote: >>>>>> --- 14 --- >>>>>> Add mon_evt::is_floating_point set by resctrl file system code to limit >>>>>> which events architecture code can request be displayed in floating point. >>>>>> >>>>>> Simplified the fixed-point to floating point algorithm. Reinette is >>>>>> correct that the additional "lshift" and "rshift" operations are not >>>>>> required. All that is needed is to multiply the fixed point fractional >>>>>> part by 10**decimal_places, add a rounding amount equivalent to a "1" >>>>>> in the binary place after those supplied. Finally divide by 2**binary_places >>>>>> (with a right shift). >>>>>> >>>>>> Explained in commit comment how I chose the number of decimal places to >>>>>> use for each binary places value. >>>>>> >>>>>> N.B. Dave Martin expressed an opinion that the kernel should not do >>>>>> this conversion. Instead it should enumerate the scaling factor for >>>>>> each event where hardware reported a fixed point value. This patch >>>>>> could be dropped and replaced with one to enumerate scaling factors >>>>>> per event if others agree with Dave. >>>>> >>>>> Could resctrl accommodate both usages? For example, it does not >>>>> look too invasive to add a second file <mon_evt::name>.raw for the >>>>> mon_evt::is_floating_point events that can output something like Dave >>>>> suggested in [1]: >>>>> >>>>> .raw file format could be: >>>>> #format:<output that depends on format> >>>>> #fixed-point:<value>/<scaling factor> >>>>> >>>>> Example output: >>>>> fixed-point:0x60000/0x40000 >>>> >>>> Dave: Is that what you want in the ".raw" file? An alternative would be >>>> to put the format information for non-integer events into an >>>> "info" file ("info/{RESOURCE_NAME}_MON/monfeatures.raw.formats"?) >>>> and just put the raw value into the ".raw" file under mon_data. >>> >>> Note that I thought it easier for users to keep the raw file to just >>> showing a value, rather than including the formatting details in >>> Reinette's proposal. >> >> Could you please elaborate what makes this easier? It is not obvious to me >> how it is easier for user to open, parse, and close two files rather than one. >> (more below) > > I had only considered the case where the format does not change while > the resctrl file system is mounted. So users would read the "info" file > to get the scaling factor once, and then read the event files with a > parser that only has to convert a numerical string. > >>> Patch to implement my alternative suggestion below. To the user things >>> look like this: >>> >>> $ cd /sys/fs/resctrl/mon_data/mon_PERF_PKG_01 >>> $ cat core_energy >>> 0.02203 >>> $ cat core_energy.raw >>> 5775 >>> $ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale >>> core_energy 262144 >>> activity 262144 >>> $ bc -ql >>> 5775 / 262144 >>> .02202987670898437500 >>> >>> If this seems useful I can write up a commit message and include >>> as its own patch in v7. Suggestions for better names? >>> >> >> I expect users to regularly interact with the monitoring files. For example, >> "read the core_energy of group x every second". An API like above would require >> a contract that the scale value will never change from resctrl mount to >> resctrl unmount. I understand that this implementation supports exactly this by >> allowing an architecture to only enable an event once, but do you think this is >> something that will always be the case? If not then an interface like above will >> require user space to open, parse, close two files instead of one on a frequent basis. >> This is not ideal if user space wants to read monitoring data of multiple >> groups frequently. > > While hardware designers do some outlandish things. Changing the format > of an event counter on the fly seems beyond the range of possibility. > How would that even work? A driver would have to rerun enumeration of > the feature every time it read a counter. Or hardware would have to > supply some interrupt to tell s/w that the format changed. There is also the new direction of resctrl dynamically enabling/disabling hardware capabilities to consider. Here it could be reasonable, since this would be triggered by user space, that a note of "doing this may change the format" would be sufficient. Something else to consider is the possibility of hardware using different scales in different domains if the packages are not "uniform". > I think it reasonable that resctrl be able to guarantee that the format > described in the info file is valid for the life of the mount. I'd really like to think that it is reasonable also. > >> I would also like to keep extensibility in mind. We now know that >> unsigned decimal and fixed-point binary needs to be supported. I think any >> new interface used to communicate formatting information to user space should be done >> in a way that can be extended for a new format. That is, for example, why >> I used the actual term "fixed-point" in the example. Something like this avoids >> needing assumptions that a raw value always implies fixed-point format. > > This is fair. But could be covered in the "info" file with some more > descriptive way to describe the format. Perhaps: > > $ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale > core_energy fixed-point scale=262144 > activity fixed-point scale=262144 > > To allow for other types in the future. Note that the filename still has "scale" in its name making it specific to fixed-point. It may be expected that every entry in mon_features has an entry in mon_features_raw_scale (name TBD). This means the existing possible "mon_features" need to be accommodated (except the _config ones). This may also be an opportunity to introduce the unit of measurement. For example, $ cat /sys/fs/resctrl/info/PERF_PKG_MON/mon_features_raw_scale core_energy fixed-point scale=262144 unit=joules activity fixed-point scale=262144 unit=farads ... Reinette
Tony, On 6/26/25 9:49 AM, Tony Luck wrote: > Background > ---------- > > Telemetry features are being implemented in conjunction with the > IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send > counts for various events to a collector in a nearby OOBMSM device to be > accumulated with counts for each <RMID, event> pair received from other > CPUs. Cores send event counts when the RMID value changes, or after each > 2ms elapsed time. To start a review of this jumbo series and find that the *first* [1] (straight forward) request from previous review has not been addressed is demoralizing. I was hoping that the previous version's discussions would result in review feedback either addressed or discussed (never ignored). I cannot imagine how requesting OOBMSM to be expanded can be invalid though. Reinette [1] https://lore.kernel.org/lkml/b8ddce03-65c0-4420-b30d-e43c54943667@intel.com/
On Mon, Jun 30, 2025 at 10:51:50AM -0700, Reinette Chatre wrote: > > Tony, > > On 6/26/25 9:49 AM, Tony Luck wrote: > > Background > > ---------- > > > > Telemetry features are being implemented in conjunction with the > > IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send > > counts for various events to a collector in a nearby OOBMSM device to be > > accumulated with counts for each <RMID, event> pair received from other > > CPUs. Cores send event counts when the RMID value changes, or after each > > 2ms elapsed time. > > To start a review of this jumbo series and find that the *first* [1] > (straight forward) request from previous review has not been addressed is > demoralizing. I was hoping that the previous version's discussions would result > in review feedback either addressed or discussed (never ignored). I > cannot imagine how requesting OOBMSM to be expanded can be invalid though. > > Reinette > > [1] https://lore.kernel.org/lkml/b8ddce03-65c0-4420-b30d-e43c54943667@intel.com/ My profound apologies for blowing it (again). I went through the comments to patches multiple times to try and catch all your comments. But somehow skipped the cover letter :-( . Here's a re-write to address comments, but also to try to provide a better story line starting with how the logical processors capture the event data, following on with aggregator processing, etc. -Tony --- On Intel systems that support per-RMID telemetry monitoring each logical processor keeps a local count for various events. When the IA32_PQR_ASSOC.RMID value for the logical processor changes (or when a two millisecond counter expires) these event counts are transmitted to an event aggregator on the same package as the processor together with the current RMID value. The event counters are reset to zero to begin counting again. Each aggregator takes the incoming event counts and adds them to cumulative counts for each event for each RMID. Note that there can be multiple aggregators on each package with no architectural association between logical processors and an aggregator. All of these aggregated counters can be read by an operating system from the MMIO space of the Out Of Band Management Service Module (OOBMSM) device(s) on a system. Any counter can be read from any logical processor. Intel publishes details for each processor generation showing which events are counted by each logical processor and the offsets for each accumulated counter value within the MMIO space in XML files here: https://github.com/intel/Intel-PMT. For example there are two energy related telemetry events for the Clearwater Forest family of processors and the MMIO space looks like this: Offset RMID Event ------ ---- ----- 0x0000 0 core_energy 0x0008 0 activity 0x0010 1 core_energy 0x0018 1 activity ... 0x23F0 575 core_energy 0x23F8 575 activity In addition the XML file provides the units (Joules for core_energy, Farads for activity) and the type of data (fixed-point binary with bit 63 used as to indicate the data is valid, and the low 18 bits as a binary fraction). Finally, each XML file provides a 32-bit unique id (or guid) that is used as an index to find the correct XML description file for each telemetry implementation. The INTEL_PMT_DISCOVERY driver provides intel_pmt_get_regions_by_feature() to enumerate the aggregator instances on a platform. It provides: 1) guid - so resctrl can determine which events are supported 2) mmio base address of counters 3) package id Resctrl accumulates counts from all aggregators on a package in order to provide a consistent user interface across processor generations. Directory structure for the telemetry events looks like this: $ tree /sys/fs/resctrl/mon_data/ /sys/fs/resctrl/mon_data/ mon_data ├── mon_PERF_PKG_00 │ ├── activity │ └── core_energy └── mon_PERF_PKG_01 ├── activity └── core_energy Reading the "core_energy" file from some resctrl mon_data directory shows the cumulative energy (in Joules) used by all tasks that ran with the RMID associated with that directory on a given package. Note that "core_energy" reports only energy consumed by CPU cores (data processing units, L1/L2 caches, etc.). It does not include energy used in the "uncore" (L3 cache, on package devices, etc.), or used by memory or I/O devices.
Hi Tony, On 6/30/25 3:46 PM, Luck, Tony wrote: > On Mon, Jun 30, 2025 at 10:51:50AM -0700, Reinette Chatre wrote: >> >> Tony, >> >> On 6/26/25 9:49 AM, Tony Luck wrote: >>> Background >>> ---------- >>> >>> Telemetry features are being implemented in conjunction with the >>> IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send >>> counts for various events to a collector in a nearby OOBMSM device to be >>> accumulated with counts for each <RMID, event> pair received from other >>> CPUs. Cores send event counts when the RMID value changes, or after each >>> 2ms elapsed time. >> >> To start a review of this jumbo series and find that the *first* [1] >> (straight forward) request from previous review has not been addressed is >> demoralizing. I was hoping that the previous version's discussions would result >> in review feedback either addressed or discussed (never ignored). I >> cannot imagine how requesting OOBMSM to be expanded can be invalid though. >> >> Reinette >> >> [1] https://lore.kernel.org/lkml/b8ddce03-65c0-4420-b30d-e43c54943667@intel.com/ > > My profound apologies for blowing it (again). I went through the comments > to patches multiple times to try and catch all your comments. But somehow > skipped the cover letter :-( . > > Here's a re-write to address comments, but also to try to provide > a better story line starting with how the logical processors capture > the event data, following on with aggregator processing, etc. > > -Tony > > --- > > On Intel systems that support per-RMID telemetry monitoring each logical > processor keeps a local count for various events. When the IA32_PQR_ASSOC.RMID > value for the logical processor changes (or when a two millisecond counter > expires) these event counts are transmitted to an event aggregator on > the same package as the processor together with the current RMID value. The > event counters are reset to zero to begin counting again. > > Each aggregator takes the incoming event counts and adds them to > cumulative counts for each event for each RMID. Note that there can be > multiple aggregators on each package with no architectural association > between logical processors and an aggregator. > > All of these aggregated counters can be read by an operating system from > the MMIO space of the Out Of Band Management Service Module (OOBMSM) > device(s) on a system. Any counter can be read from any logical processor. > > Intel publishes details for each processor generation showing which > events are counted by each logical processor and the offsets for each > accumulated counter value within the MMIO space in XML files here: > https://github.com/intel/Intel-PMT. > > For example there are two energy related telemetry events for the Clearwater > Forest family of processors and the MMIO space looks like this: > > Offset RMID Event > ------ ---- ----- > 0x0000 0 core_energy > 0x0008 0 activity > 0x0010 1 core_energy > 0x0018 1 activity > ... > 0x23F0 575 core_energy > 0x23F8 575 activity > > In addition the XML file provides the units (Joules for core_energy, > Farads for activity) and the type of data (fixed-point binary with > bit 63 used as to indicate the data is valid, and the low 18 bits as a "bit 63 used as to indicate" -> "bit 63 used to indicate"? > binary fraction). > > Finally, each XML file provides a 32-bit unique id (or guid) that is > used as an index to find the correct XML description file for each > telemetry implementation. > > The INTEL_PMT_DISCOVERY driver provides intel_pmt_get_regions_by_feature() > to enumerate the aggregator instances on a platform. It provides: I think it will be helpful to prime the connection between "aggregator" and "telemetery region" here. For example, "to enumerate the aggregator instances on a platform" -> "to enumerate the aggregator instances (also referred to as "telemetry regions" in this series) on a platform" > 1) guid - so resctrl can determine which events are supported > 2) mmio base address of counters mmio -> MMIO > 3) package id > > Resctrl accumulates counts from all aggregators on a package in order > to provide a consistent user interface across processor generations. > > Directory structure for the telemetry events looks like this: > > $ tree /sys/fs/resctrl/mon_data/ > /sys/fs/resctrl/mon_data/ > mon_data > ├── mon_PERF_PKG_00 > │ ├── activity > │ └── core_energy > └── mon_PERF_PKG_01 > ├── activity > └── core_energy > > Reading the "core_energy" file from some resctrl mon_data directory shows > the cumulative energy (in Joules) used by all tasks that ran with the RMID > associated with that directory on a given package. Note that "core_energy" > reports only energy consumed by CPU cores (data processing units, > L1/L2 caches, etc.). It does not include energy used in the "uncore" > (L3 cache, on package devices, etc.), or used by memory or I/O devices. Thank you very much for this rework. I found this much easier to follow. Reinette
© 2016 - 2025 Red Hat, Inc.