[PATCH 0/8] resctrl: Add perf PMU for resctrl monitoring

Jonathan Perry posted 8 patches 3 months, 3 weeks ago
Documentation/filesystems/resctrl.rst         |  64 ++++
fs/resctrl/Makefile                           |   2 +-
fs/resctrl/ctrlmondata.c                      | 118 ++++---
fs/resctrl/internal.h                         |  24 +-
fs/resctrl/monitor.c                          |   8 +-
fs/resctrl/pmu.c                              | 217 +++++++++++++
fs/resctrl/rdtgroup.c                         | 131 +++++++-
tools/testing/selftests/resctrl/cache.c       |  94 +++++-
tools/testing/selftests/resctrl/cmt_test.c    |  17 +-
tools/testing/selftests/resctrl/pmu_test.c    | 292 ++++++++++++++++++
tools/testing/selftests/resctrl/pmu_utils.c   |  32 ++
tools/testing/selftests/resctrl/resctrl.h     |   4 +
.../testing/selftests/resctrl/resctrl_tests.c |   1 +
13 files changed, 948 insertions(+), 56 deletions(-)
create mode 100644 fs/resctrl/pmu.c
create mode 100644 tools/testing/selftests/resctrl/pmu_test.c
create mode 100644 tools/testing/selftests/resctrl/pmu_utils.c
[PATCH 0/8] resctrl: Add perf PMU for resctrl monitoring
Posted by Jonathan Perry 3 months, 3 weeks ago
Expose resctrl monitoring data via a lightweight perf PMU. 

Background: The kernel's initial cache-monitoring interface shipped via 
perf (commit 4afbb24ce5e7, 2015). That approach tied monitoring to tasks
and cgroups. Later, cache control was designed around the resctrl 
filesystem to better match hardware semantics, and the incompatible perf 
CQM code was removed (commit c39a0e2c8850, 2017). This series implements
a thin, generic perf PMU that _is_ compatible with resctrl.

Motivation: perf support enables measuring cache occupancy and memory 
bandwidth metrics on hrtimer (high resolution timer) interrupts via eBPF.
Compared with polling from userspace, hrtimer-based reads remove 
scheduling jitter and context switch overhead. Further, PMU reads can be 
parallel, since the PMU read path need not lock resctrl's rdtgroup_mutex.
Parallelization and reduced jitter enable more accurate snapshots of
cache occupancy and memory bandwidth. [1] has more details on the 
motivation and design.

Design: The "resctrl" PMU is a small adapter on top of resctrl's 
monitoring path:
- Event selection uses `attr.config` to pass an open `mon_data` fd
  (e.g. `mon_L3_00/llc_occupancy`).
- Events must be CPU-bound within the file's domain. Perf is responsible 
  the read executes on the bound CPU.
- Event init resolves and pins the rdtgroup, prepares struct rmid_read via
  mon_event_setup_read(), and validates the bound CPU is in the file's 
  domain CPU mask.
- Sampling is not supported; reads match the `mon_data` file contents.
- If the rdtgroup is deleted, reads return 0.

Includes a new selftest (tools/testing/selftests/resctrl/pmu_test.c)
to validate the PMU event init path, and adds PMU testing to existing 
CMT tests.

Example usage (see Documentation/filesystems/resctrl.rst):
Open a monitoring file and pass its fd in `perf_event_attr.config`, with
`attr.type` set to the `resctrl` PMU type.

The patches are based on top of v6.18-rc1 (commit 3a8660878839).

[1] https://www.youtube.com/watch?v=4BGhAMJdZTc

Jonathan Perry (8):
  resctrl: Pin rdtgroup for mon_data file lifetime
  resctrl/mon: Split RMID read init from execution
  resctrl/mon: Select cpumask before invoking mon_event_read()
  resctrl/mon: Create mon_event_setup_read() helper
  resctrl: Propagate CPU mask validation error via rr->err
  resctrl/pmu: Introduce skeleton PMU and selftests
  resctrl/pmu: Use mon_event_setup_read() and validate CPU
  resctrl/pmu: Implement .read via direct RMID read; add LLC selftest

 Documentation/filesystems/resctrl.rst         |  64 ++++
 fs/resctrl/Makefile                           |   2 +-
 fs/resctrl/ctrlmondata.c                      | 118 ++++---
 fs/resctrl/internal.h                         |  24 +-
 fs/resctrl/monitor.c                          |   8 +-
 fs/resctrl/pmu.c                              | 217 +++++++++++++
 fs/resctrl/rdtgroup.c                         | 131 +++++++-
 tools/testing/selftests/resctrl/cache.c       |  94 +++++-
 tools/testing/selftests/resctrl/cmt_test.c    |  17 +-
 tools/testing/selftests/resctrl/pmu_test.c    | 292 ++++++++++++++++++
 tools/testing/selftests/resctrl/pmu_utils.c   |  32 ++
 tools/testing/selftests/resctrl/resctrl.h     |   4 +
 .../testing/selftests/resctrl/resctrl_tests.c |   1 +
 13 files changed, 948 insertions(+), 56 deletions(-)
 create mode 100644 fs/resctrl/pmu.c
 create mode 100644 tools/testing/selftests/resctrl/pmu_test.c
 create mode 100644 tools/testing/selftests/resctrl/pmu_utils.c
Re: [PATCH 0/8] resctrl: Add perf PMU for resctrl monitoring
Posted by Luck, Tony 3 months, 3 weeks ago
On Thu, Oct 16, 2025 at 09:46:48AM -0500, Jonathan Perry wrote:
> Motivation: perf support enables measuring cache occupancy and memory 
> bandwidth metrics on hrtimer (high resolution timer) interrupts via eBPF.
> Compared with polling from userspace, hrtimer-based reads remove 
> scheduling jitter and context switch overhead. Further, PMU reads can be 
> parallel, since the PMU read path need not lock resctrl's rdtgroup_mutex.
> Parallelization and reduced jitter enable more accurate snapshots of
> cache occupancy and memory bandwidth. [1] has more details on the 
> motivation and design.

This parallel read without rdtgroup_mutex looks worrying.

The h/w counters have limited width (24-bits on older Intel CPUs,
32-bits on AMD and Intel >= Icelake). So resctrl takes the raw
value and in get_corrected_val() figures the increment since the
previous read of the MSR to figure out how much to add to the
running per-RMID count of "chunks".

That's all inherently full of races. If perf does this at the
same time that resctrl does, then things will be corrupted
sooner or later.

You might fix it with a per-RMID spinlock in "struct arch_mbm_state"?

-Tony
RE: [PATCH 0/8] resctrl: Add perf PMU for resctrl monitoring
Posted by Luck, Tony 3 months, 3 weeks ago
> > Motivation: perf support enables measuring cache occupancy and memory
> > bandwidth metrics on hrtimer (high resolution timer) interrupts via eBPF.
> > Compared with polling from userspace, hrtimer-based reads remove
> > scheduling jitter and context switch overhead. Further, PMU reads can be
> > parallel, since the PMU read path need not lock resctrl's rdtgroup_mutex.
> > Parallelization and reduced jitter enable more accurate snapshots of
> > cache occupancy and memory bandwidth. [1] has more details on the
> > motivation and design.
>
> This parallel read without rdtgroup_mutex looks worrying.
>
> The h/w counters have limited width (24-bits on older Intel CPUs,
> 32-bits on AMD and Intel >= Icelake). So resctrl takes the raw
> value and in get_corrected_val() figures the increment since the
> previous read of the MSR to figure out how much to add to the
> running per-RMID count of "chunks".
>
> That's all inherently full of races. If perf does this at the
> same time that resctrl does, then things will be corrupted
> sooner or later.
>
> You might fix it with a per-RMID spinlock in "struct arch_mbm_state"?

That might be too fine a locking granularity. You'd probably be fine
with little contention with a lock in "struct rdt_mon_domain".

-Tony
RE: [PATCH 0/8] resctrl: Add perf PMU for resctrl monitoring
Posted by Jonathan Perry 3 months, 3 weeks ago
> > > Motivation: perf support enables measuring cache occupancy and memory
> > > bandwidth metrics on hrtimer (high resolution timer) interrupts via eBPF.
> > > Compared with polling from userspace, hrtimer-based reads remove
> > > scheduling jitter and context switch overhead. Further, PMU reads can be
> > > parallel, since the PMU read path need not lock resctrl's rdtgroup_mutex.
> > > Parallelization and reduced jitter enable more accurate snapshots of
> > > cache occupancy and memory bandwidth. [1] has more details on the
> > > motivation and design.
> >
> > This parallel read without rdtgroup_mutex looks worrying.
> >
> > The h/w counters have limited width (24-bits on older Intel CPUs,
> > 32-bits on AMD and Intel >= Icelake). So resctrl takes the raw
> > value and in get_corrected_val() figures the increment since the
> > previous read of the MSR to figure out how much to add to the
> > running per-RMID count of "chunks".
> >
> > That's all inherently full of races. If perf does this at the
> > same time that resctrl does, then things will be corrupted
> > sooner or later.
> >
> > You might fix it with a per-RMID spinlock in "struct arch_mbm_state"?
> 
> That might be too fine a locking granularity. You'd probably be fine
> with little contention with a lock in "struct rdt_mon_domain".

Good catch. Thank you Tony!

We might be able to solve the issue similarly to what adding a per-RMID 
spinlock in "struct arch_mbm_state" would do, but with only a memory 
barrier (no spinlock). I'll look further into it.

-Jonathan