[PATCH v7 00/20] ARM64 PMU Partitioning

Colton Lewis posted 20 patches 1 month, 1 week ago
arch/arm/include/asm/arm_pmuv3.h              |  18 +
arch/arm64/include/asm/arm_pmuv3.h            |  12 +-
arch/arm64/include/asm/kvm_host.h             |  17 +-
arch/arm64/include/asm/kvm_types.h            |   6 +-
arch/arm64/include/uapi/asm/kvm.h             |   2 +
arch/arm64/kernel/cpufeature.c                |   8 +
arch/arm64/kvm/Makefile                       |   2 +-
arch/arm64/kvm/arm.c                          |   2 +
arch/arm64/kvm/config.c                       |  41 +-
arch/arm64/kvm/debug.c                        |  31 +-
arch/arm64/kvm/pmu-direct.c                   | 494 ++++++++++++
arch/arm64/kvm/pmu-emul.c                     | 674 +----------------
arch/arm64/kvm/pmu.c                          | 701 ++++++++++++++++++
arch/arm64/kvm/sys_regs.c                     | 250 ++++++-
arch/arm64/tools/cpucaps                      |   1 +
arch/arm64/tools/sysreg                       |   6 +-
drivers/perf/arm_pmuv3.c                      | 111 ++-
include/kvm/arm_pmu.h                         | 110 +++
include/linux/perf/arm_pmu.h                  |   3 +
include/linux/perf/arm_pmuv3.h                |  14 +-
include/linux/perf_event.h                    |   3 +
kernel/events/core.c                          |  28 +-
tools/testing/selftests/kvm/Makefile.kvm      |   1 +
.../selftests/kvm/arm64/vpmu_counter_access.c | 112 ++-
tools/testing/selftests/kvm/lib/find_bit.c    |   1 +
25 files changed, 1861 insertions(+), 787 deletions(-)
create mode 100644 arch/arm64/kvm/pmu-direct.c
create mode 100644 tools/testing/selftests/kvm/lib/find_bit.c
[PATCH v7 00/20] ARM64 PMU Partitioning
Posted by Colton Lewis 1 month, 1 week ago
This series creates a new PMU scheme on ARM, a partitioned PMU that
allows reserving a subset of counters for more direct guest access,
significantly reducing overhead. More details, including performance
benchmarks, can be read in the v1 cover letter linked below.

An overview of what this series accomplishes was presented at KVM
Forum 2025. Slides [1] and video [2] are linked below.

After a few false starts, meeting with Will Deacon and Mark Rutland to
discuss implementation ideas, and a few more false starts, I finally
have an implementation of dynamic counter reservation that works
without disrupting host perf too much. Now the host only loses access
to the guest counters when a vCPU resides on the CPU.

The key was creating perf_pmu_resched_update, which behaves exactly
like perf_pmu_resched except it takes a callback to call in between
when the perf events are scheduled out and when they are scheduled
back in. That allows us to update the PMU's available counters when we
know they are not currently in use without needing to expose private
perf core functions and triple check they are not being called in a
way that violates existing assumptions.

Because this introduces a possibility of perf reschedule during vCPU
load, I've optimized to only do that operation if there are host
events occupying the intended guest counters at the time of the load.

The kernel command line parameter for the driver still exists, but now
only defines an upper limit of counters the guest might use rather
than taking those counters from the host permanently.

v7:

* Implement dynamic counter reservation as described above. One side
  effect is the PMUv3 driver now needs much fewer changes to enforce
  the boundary.

* Move register accesses out of fast path for non-FGT hardware. The
  performance impact was negligible and this moves bloat out of the
  fast path and allows a more reliable design with more code sharing.

* Make PMCCNTR a special case in the context swap again because trying
  to access it with PMXEVCNTR is undefined.

* Fix a bug where kvm_pmu_guest_counter_mask was using & instead of |.

* Re-expose the dedicated instruction counter to the host since it was
  decided the guest will not own it.

* Change the global armv8pmu_reserved_host_counters to
  armv8pmu_is_partitoned because it was only used in boolean checks.

* Fix typo in vcpu attribute commit so the spelling of the flag in the
  commit message matches the code.

* Rebase to v7.0-rc7

v6:
https://lore.kernel.org/kvmarm/20260209221414.2169465-1-coltonlewis@google.com/

v5:
https://lore.kernel.org/kvmarm/20251209205121.1871534-1-coltonlewis@google.com/

v4:
https://lore.kernel.org/kvmarm/20250714225917.1396543-1-coltonlewis@google.com/

v3:
https://lore.kernel.org/kvm/20250626200459.1153955-1-coltonlewis@google.com/

v2:
https://lore.kernel.org/kvm/20250620221326.1261128-1-coltonlewis@google.com/

v1:
https://lore.kernel.org/kvm/20250602192702.2125115-1-coltonlewis@google.com/

[1] https://gitlab.com/qemu-project/kvm-forum/-/raw/main/_attachments/2025/Optimizing__itvHkhc.pdf
[2] https://www.youtube.com/watch?v=YRzZ8jMIA6M&list=PLW3ep1uCIRfxwmllXTOA2txfDWN6vUOHp&index=9

Colton Lewis (19):
  arm64: cpufeature: Add cpucap for HPMN0
  KVM: arm64: Reorganize PMU functions
  perf: arm_pmuv3: Generalize counter bitmasks
  perf: arm_pmuv3: Check cntr_mask before using pmccntr
  perf: arm_pmuv3: Add method to partition the PMU
  KVM: arm64: Set up FGT for Partitioned PMU
  KVM: arm64: Add Partitioned PMU register trap handlers
  KVM: arm64: Set up MDCR_EL2 to handle a Partitioned PMU
  KVM: arm64: Context swap Partitioned PMU guest registers
  KVM: arm64: Enforce PMU event filter at vcpu_load()
  perf: Add perf_pmu_resched_update()
  KVM: arm64: Apply dynamic guest counter reservations
  KVM: arm64: Implement lazy PMU context swaps
  perf: arm_pmuv3: Handle IRQs for Partitioned PMU guest counters
  KVM: arm64: Detect overflows for the Partitioned PMU
  KVM: arm64: Add vCPU device attr to partition the PMU
  KVM: selftests: Add find_bit to KVM library
  KVM: arm64: selftests: Add test case for Partitioned PMU
  KVM: arm64: selftests: Relax testing for exceptions when partitioned

Marc Zyngier (1):
  KVM: arm64: Reorganize PMU includes

 arch/arm/include/asm/arm_pmuv3.h              |  18 +
 arch/arm64/include/asm/arm_pmuv3.h            |  12 +-
 arch/arm64/include/asm/kvm_host.h             |  17 +-
 arch/arm64/include/asm/kvm_types.h            |   6 +-
 arch/arm64/include/uapi/asm/kvm.h             |   2 +
 arch/arm64/kernel/cpufeature.c                |   8 +
 arch/arm64/kvm/Makefile                       |   2 +-
 arch/arm64/kvm/arm.c                          |   2 +
 arch/arm64/kvm/config.c                       |  41 +-
 arch/arm64/kvm/debug.c                        |  31 +-
 arch/arm64/kvm/pmu-direct.c                   | 494 ++++++++++++
 arch/arm64/kvm/pmu-emul.c                     | 674 +----------------
 arch/arm64/kvm/pmu.c                          | 701 ++++++++++++++++++
 arch/arm64/kvm/sys_regs.c                     | 250 ++++++-
 arch/arm64/tools/cpucaps                      |   1 +
 arch/arm64/tools/sysreg                       |   6 +-
 drivers/perf/arm_pmuv3.c                      | 111 ++-
 include/kvm/arm_pmu.h                         | 110 +++
 include/linux/perf/arm_pmu.h                  |   3 +
 include/linux/perf/arm_pmuv3.h                |  14 +-
 include/linux/perf_event.h                    |   3 +
 kernel/events/core.c                          |  28 +-
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../selftests/kvm/arm64/vpmu_counter_access.c | 112 ++-
 tools/testing/selftests/kvm/lib/find_bit.c    |   1 +
 25 files changed, 1861 insertions(+), 787 deletions(-)
 create mode 100644 arch/arm64/kvm/pmu-direct.c
 create mode 100644 tools/testing/selftests/kvm/lib/find_bit.c


base-commit: 591cd656a1bf5ea94a222af5ef2ee76df029c1d2
--
2.54.0.545.g6539524ca2-goog
Re: [PATCH v7 00/20] ARM64 PMU Partitioning
Posted by James Clark 1 month ago

On 04/05/2026 10:17 pm, Colton Lewis wrote:
> This series creates a new PMU scheme on ARM, a partitioned PMU that
> allows reserving a subset of counters for more direct guest access,
> significantly reducing overhead. More details, including performance
> benchmarks, can be read in the v1 cover letter linked below.
> 
> An overview of what this series accomplishes was presented at KVM
> Forum 2025. Slides [1] and video [2] are linked below.
> 
> After a few false starts, meeting with Will Deacon and Mark Rutland to
> discuss implementation ideas, and a few more false starts, I finally
> have an implementation of dynamic counter reservation that works
> without disrupting host perf too much. Now the host only loses access
> to the guest counters when a vCPU resides on the CPU.
> 
> The key was creating perf_pmu_resched_update, which behaves exactly
> like perf_pmu_resched except it takes a callback to call in between
> when the perf events are scheduled out and when they are scheduled
> back in. That allows us to update the PMU's available counters when we
> know they are not currently in use without needing to expose private
> perf core functions and triple check they are not being called in a
> way that violates existing assumptions.
> 
> Because this introduces a possibility of perf reschedule during vCPU
> load, I've optimized to only do that operation if there are host
> events occupying the intended guest counters at the time of the load.
> 
> The kernel command line parameter for the driver still exists, but now
> only defines an upper limit of counters the guest might use rather
> than taking those counters from the host permanently.
> 
> v7:
> 
> * Implement dynamic counter reservation as described above. One side
>    effect is the PMUv3 driver now needs much fewer changes to enforce
>    the boundary.
> 
> * Move register accesses out of fast path for non-FGT hardware. The
>    performance impact was negligible and this moves bloat out of the
>    fast path and allows a more reliable design with more code sharing.
> 
> * Make PMCCNTR a special case in the context swap again because trying
>    to access it with PMXEVCNTR is undefined.
> 
> * Fix a bug where kvm_pmu_guest_counter_mask was using & instead of |.
> 
> * Re-expose the dedicated instruction counter to the host since it was
>    decided the guest will not own it.
> 
> * Change the global armv8pmu_reserved_host_counters to
>    armv8pmu_is_partitoned because it was only used in boolean checks.
> 
> * Fix typo in vcpu attribute commit so the spelling of the flag in the
>    commit message matches the code.
> 
> * Rebase to v7.0-rc7
> 
> v6:
> https://lore.kernel.org/kvmarm/20260209221414.2169465-1-coltonlewis@google.com/
> 
> v5:
> https://lore.kernel.org/kvmarm/20251209205121.1871534-1-coltonlewis@google.com/
> 
> v4:
> https://lore.kernel.org/kvmarm/20250714225917.1396543-1-coltonlewis@google.com/
> 
> v3:
> https://lore.kernel.org/kvm/20250626200459.1153955-1-coltonlewis@google.com/
> 
> v2:
> https://lore.kernel.org/kvm/20250620221326.1261128-1-coltonlewis@google.com/
> 
> v1:
> https://lore.kernel.org/kvm/20250602192702.2125115-1-coltonlewis@google.com/
> 
> [1] https://gitlab.com/qemu-project/kvm-forum/-/raw/main/_attachments/2025/Optimizing__itvHkhc.pdf
> [2] https://www.youtube.com/watch?v=YRzZ8jMIA6M&list=PLW3ep1uCIRfxwmllXTOA2txfDWN6vUOHp&index=9
> 
> Colton Lewis (19):
>    arm64: cpufeature: Add cpucap for HPMN0
>    KVM: arm64: Reorganize PMU functions
>    perf: arm_pmuv3: Generalize counter bitmasks
>    perf: arm_pmuv3: Check cntr_mask before using pmccntr
>    perf: arm_pmuv3: Add method to partition the PMU
>    KVM: arm64: Set up FGT for Partitioned PMU
>    KVM: arm64: Add Partitioned PMU register trap handlers
>    KVM: arm64: Set up MDCR_EL2 to handle a Partitioned PMU
>    KVM: arm64: Context swap Partitioned PMU guest registers
>    KVM: arm64: Enforce PMU event filter at vcpu_load()
>    perf: Add perf_pmu_resched_update()
>    KVM: arm64: Apply dynamic guest counter reservations
>    KVM: arm64: Implement lazy PMU context swaps
>    perf: arm_pmuv3: Handle IRQs for Partitioned PMU guest counters
>    KVM: arm64: Detect overflows for the Partitioned PMU
>    KVM: arm64: Add vCPU device attr to partition the PMU
>    KVM: selftests: Add find_bit to KVM library
>    KVM: arm64: selftests: Add test case for Partitioned PMU
>    KVM: arm64: selftests: Relax testing for exceptions when partitioned
> 
> Marc Zyngier (1):
>    KVM: arm64: Reorganize PMU includes
> 
>   arch/arm/include/asm/arm_pmuv3.h              |  18 +
>   arch/arm64/include/asm/arm_pmuv3.h            |  12 +-
>   arch/arm64/include/asm/kvm_host.h             |  17 +-
>   arch/arm64/include/asm/kvm_types.h            |   6 +-
>   arch/arm64/include/uapi/asm/kvm.h             |   2 +
>   arch/arm64/kernel/cpufeature.c                |   8 +
>   arch/arm64/kvm/Makefile                       |   2 +-
>   arch/arm64/kvm/arm.c                          |   2 +
>   arch/arm64/kvm/config.c                       |  41 +-
>   arch/arm64/kvm/debug.c                        |  31 +-
>   arch/arm64/kvm/pmu-direct.c                   | 494 ++++++++++++
>   arch/arm64/kvm/pmu-emul.c                     | 674 +----------------
>   arch/arm64/kvm/pmu.c                          | 701 ++++++++++++++++++
>   arch/arm64/kvm/sys_regs.c                     | 250 ++++++-
>   arch/arm64/tools/cpucaps                      |   1 +
>   arch/arm64/tools/sysreg                       |   6 +-
>   drivers/perf/arm_pmuv3.c                      | 111 ++-
>   include/kvm/arm_pmu.h                         | 110 +++
>   include/linux/perf/arm_pmu.h                  |   3 +
>   include/linux/perf/arm_pmuv3.h                |  14 +-
>   include/linux/perf_event.h                    |   3 +
>   kernel/events/core.c                          |  28 +-
>   tools/testing/selftests/kvm/Makefile.kvm      |   1 +
>   .../selftests/kvm/arm64/vpmu_counter_access.c | 112 ++-
>   tools/testing/selftests/kvm/lib/find_bit.c    |   1 +
>   25 files changed, 1861 insertions(+), 787 deletions(-)
>   create mode 100644 arch/arm64/kvm/pmu-direct.c
>   create mode 100644 tools/testing/selftests/kvm/lib/find_bit.c
> 
> 
> base-commit: 591cd656a1bf5ea94a222af5ef2ee76df029c1d2
> --
> 2.54.0.545.g6539524ca2-goog

I tested it a bit and ran the kselftests and it all seems to be working 
ok. Some of the critical sashiko comments look like they are worth 
looking into though: 
https://sashiko.dev/#/patchset/20260504211813.1804997-1-coltonlewis%40google.com

For example writing to PMCR_EL0.P from EL2 resets the host's counters, 
even if it's KVM doing it after trapping a write from the guest.
Re: [PATCH v7 00/20] ARM64 PMU Partitioning
Posted by Colton Lewis 4 weeks, 1 day ago
Hi James. Thanks for reviewing.

James Clark <james.clark@linaro.org> writes:

> On 04/05/2026 10:17 pm, Colton Lewis wrote:
>> This series creates a new PMU scheme on ARM, a partitioned PMU that
>> allows reserving a subset of counters for more direct guest access,
>> significantly reducing overhead. More details, including performance
>> benchmarks, can be read in the v1 cover letter linked below.

>> An overview of what this series accomplishes was presented at KVM
>> Forum 2025. Slides [1] and video [2] are linked below.

>> After a few false starts, meeting with Will Deacon and Mark Rutland to
>> discuss implementation ideas, and a few more false starts, I finally
>> have an implementation of dynamic counter reservation that works
>> without disrupting host perf too much. Now the host only loses access
>> to the guest counters when a vCPU resides on the CPU.

>> The key was creating perf_pmu_resched_update, which behaves exactly
>> like perf_pmu_resched except it takes a callback to call in between
>> when the perf events are scheduled out and when they are scheduled
>> back in. That allows us to update the PMU's available counters when we
>> know they are not currently in use without needing to expose private
>> perf core functions and triple check they are not being called in a
>> way that violates existing assumptions.

>> Because this introduces a possibility of perf reschedule during vCPU
>> load, I've optimized to only do that operation if there are host
>> events occupying the intended guest counters at the time of the load.

>> The kernel command line parameter for the driver still exists, but now
>> only defines an upper limit of counters the guest might use rather
>> than taking those counters from the host permanently.

>> v7:

>> * Implement dynamic counter reservation as described above. One side
>>     effect is the PMUv3 driver now needs much fewer changes to enforce
>>     the boundary.

>> * Move register accesses out of fast path for non-FGT hardware. The
>>     performance impact was negligible and this moves bloat out of the
>>     fast path and allows a more reliable design with more code sharing.

>> * Make PMCCNTR a special case in the context swap again because trying
>>     to access it with PMXEVCNTR is undefined.

>> * Fix a bug where kvm_pmu_guest_counter_mask was using & instead of |.

>> * Re-expose the dedicated instruction counter to the host since it was
>>     decided the guest will not own it.

>> * Change the global armv8pmu_reserved_host_counters to
>>     armv8pmu_is_partitoned because it was only used in boolean checks.

>> * Fix typo in vcpu attribute commit so the spelling of the flag in the
>>     commit message matches the code.

>> * Rebase to v7.0-rc7

>> v6:
>> https://lore.kernel.org/kvmarm/20260209221414.2169465-1-coltonlewis@google.com/

>> v5:
>> https://lore.kernel.org/kvmarm/20251209205121.1871534-1-coltonlewis@google.com/

>> v4:
>> https://lore.kernel.org/kvmarm/20250714225917.1396543-1-coltonlewis@google.com/

>> v3:
>> https://lore.kernel.org/kvm/20250626200459.1153955-1-coltonlewis@google.com/

>> v2:
>> https://lore.kernel.org/kvm/20250620221326.1261128-1-coltonlewis@google.com/

>> v1:
>> https://lore.kernel.org/kvm/20250602192702.2125115-1-coltonlewis@google.com/

>> [1]  
>> https://gitlab.com/qemu-project/kvm-forum/-/raw/main/_attachments/2025/Optimizing__itvHkhc.pdf
>> [2]  
>> https://www.youtube.com/watch?v=YRzZ8jMIA6M&list=PLW3ep1uCIRfxwmllXTOA2txfDWN6vUOHp&index=9

>> Colton Lewis (19):
>>     arm64: cpufeature: Add cpucap for HPMN0
>>     KVM: arm64: Reorganize PMU functions
>>     perf: arm_pmuv3: Generalize counter bitmasks
>>     perf: arm_pmuv3: Check cntr_mask before using pmccntr
>>     perf: arm_pmuv3: Add method to partition the PMU
>>     KVM: arm64: Set up FGT for Partitioned PMU
>>     KVM: arm64: Add Partitioned PMU register trap handlers
>>     KVM: arm64: Set up MDCR_EL2 to handle a Partitioned PMU
>>     KVM: arm64: Context swap Partitioned PMU guest registers
>>     KVM: arm64: Enforce PMU event filter at vcpu_load()
>>     perf: Add perf_pmu_resched_update()
>>     KVM: arm64: Apply dynamic guest counter reservations
>>     KVM: arm64: Implement lazy PMU context swaps
>>     perf: arm_pmuv3: Handle IRQs for Partitioned PMU guest counters
>>     KVM: arm64: Detect overflows for the Partitioned PMU
>>     KVM: arm64: Add vCPU device attr to partition the PMU
>>     KVM: selftests: Add find_bit to KVM library
>>     KVM: arm64: selftests: Add test case for Partitioned PMU
>>     KVM: arm64: selftests: Relax testing for exceptions when partitioned

>> Marc Zyngier (1):
>>     KVM: arm64: Reorganize PMU includes

>>    arch/arm/include/asm/arm_pmuv3.h              |  18 +
>>    arch/arm64/include/asm/arm_pmuv3.h            |  12 +-
>>    arch/arm64/include/asm/kvm_host.h             |  17 +-
>>    arch/arm64/include/asm/kvm_types.h            |   6 +-
>>    arch/arm64/include/uapi/asm/kvm.h             |   2 +
>>    arch/arm64/kernel/cpufeature.c                |   8 +
>>    arch/arm64/kvm/Makefile                       |   2 +-
>>    arch/arm64/kvm/arm.c                          |   2 +
>>    arch/arm64/kvm/config.c                       |  41 +-
>>    arch/arm64/kvm/debug.c                        |  31 +-
>>    arch/arm64/kvm/pmu-direct.c                   | 494 ++++++++++++
>>    arch/arm64/kvm/pmu-emul.c                     | 674 +----------------
>>    arch/arm64/kvm/pmu.c                          | 701 ++++++++++++++++++
>>    arch/arm64/kvm/sys_regs.c                     | 250 ++++++-
>>    arch/arm64/tools/cpucaps                      |   1 +
>>    arch/arm64/tools/sysreg                       |   6 +-
>>    drivers/perf/arm_pmuv3.c                      | 111 ++-
>>    include/kvm/arm_pmu.h                         | 110 +++
>>    include/linux/perf/arm_pmu.h                  |   3 +
>>    include/linux/perf/arm_pmuv3.h                |  14 +-
>>    include/linux/perf_event.h                    |   3 +
>>    kernel/events/core.c                          |  28 +-
>>    tools/testing/selftests/kvm/Makefile.kvm      |   1 +
>>    .../selftests/kvm/arm64/vpmu_counter_access.c | 112 ++-
>>    tools/testing/selftests/kvm/lib/find_bit.c    |   1 +
>>    25 files changed, 1861 insertions(+), 787 deletions(-)
>>    create mode 100644 arch/arm64/kvm/pmu-direct.c
>>    create mode 100644 tools/testing/selftests/kvm/lib/find_bit.c


>> base-commit: 591cd656a1bf5ea94a222af5ef2ee76df029c1d2
>> --
>> 2.54.0.545.g6539524ca2-goog

> I tested it a bit and ran the kselftests and it all seems to be working

Great to hear you didn't find any obvious problems with your testing!

> ok. Some of the critical sashiko comments look like they are worth
> looking into though:
> https://sashiko.dev/#/patchset/20260504211813.1804997-1-coltonlewis%40google.com
> For example writing to PMCR_EL0.P from EL2 resets the host's counters,
> even if it's KVM doing it after trapping a write from the guest.

I will comb through this and the other sashiko comments and fix.