[PATCH 16/17] KVM: arm64: Add ioctl to partition the PMU when supported

Colton Lewis posted 17 patches 6 months, 2 weeks ago
There is a newer version of this series
[PATCH 16/17] KVM: arm64: Add ioctl to partition the PMU when supported
Posted by Colton Lewis 6 months, 2 weeks ago
Add KVM_ARM_PMU_PARTITION to partition the PMU for a given vCPU with a
specified number of reserved host counters. Add a corresponding
KVM_CAP_ARM_PMU_PARTITION to check for this ability.

This capability is allowed on an initialized vCPU where PMUv3, VHE,
and FGT are supported.

If the ioctl is never called, partitioning will fall back on kernel
command line kvm.reserved_host_counters as before.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
---
 Documentation/virt/kvm/api.rst | 16 ++++++++++++++++
 arch/arm64/kvm/arm.c           | 21 +++++++++++++++++++++
 include/uapi/linux/kvm.h       |  4 ++++
 3 files changed, 41 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index fe3d6b5d2acc..88b851cb6f66 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6464,6 +6464,22 @@ the capability to be present.
 
 `flags` must currently be zero.
 
+4.144 KVM_ARM_PARTITION_PMU
+---------------------------
+
+:Capability: KVM_CAP_ARM_PMU_PARTITION
+:Architectures: arm64
+:Type: vcpu ioctl
+:Parameters: arg[0] is the number of counters to reserve for the host
+
+This API controls the ability to partition the PMU counters into two
+sets, one set reserved for the host and one set reserved for the
+guest. When partitoned, KVM will allow the guest direct hardware
+access to the most commonly used PMU capabilities for those counters,
+bypassing the KVM traps in the standard emulated PMU implementation
+and reducing the overhead of any guest software that uses PMU
+capabilities such as `perf`. The host PMU driver will not access any
+of the counters or bits reserved for the guest.
 
 .. _kvm_run:
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 4a1cc7b72295..1c44160d3b2d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -21,6 +21,7 @@
 #include <linux/irqbypass.h>
 #include <linux/sched/stat.h>
 #include <linux/psci.h>
+#include <linux/perf/arm_pmu.h>
 #include <trace/events/kvm.h>
 
 #define CREATE_TRACE_POINTS
@@ -38,6 +39,7 @@
 #include <asm/kvm_emulate.h>
 #include <asm/kvm_mmu.h>
 #include <asm/kvm_nested.h>
+#include <asm/kvm_pmu.h>
 #include <asm/kvm_pkvm.h>
 #include <asm/kvm_ptrauth.h>
 #include <asm/sections.h>
@@ -382,6 +384,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_PMU_V3:
 		r = kvm_supports_guest_pmuv3();
 		break;
+	case KVM_CAP_ARM_PARTITION_PMU:
+		r = kvm_pmu_partition_supported();
+		break;
 	case KVM_CAP_ARM_INJECT_SERROR_ESR:
 		r = cpus_have_final_cap(ARM64_HAS_RAS_EXTN);
 		break;
@@ -1809,6 +1814,22 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 
 		return kvm_arm_vcpu_finalize(vcpu, what);
 	}
+	case KVM_ARM_PARTITION_PMU: {
+		struct arm_pmu *pmu;
+		u8 host_counters;
+
+		if (unlikely(!kvm_vcpu_initialized(vcpu)))
+			return -ENOEXEC;
+
+		if (!kvm_pmu_partition_supported())
+			return -EPERM;
+
+		if (copy_from_user(&host_counters, argp, sizeof(host_counters)))
+			return -EFAULT;
+
+		pmu = vcpu->kvm->arch.arm_pmu;
+		return kvm_pmu_partition(pmu, host_counters);
+	}
 	default:
 		r = -EINVAL;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c9d4a908976e..f7387c0696d5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -932,6 +932,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239
 #define KVM_CAP_ARM_EL2 240
 #define KVM_CAP_ARM_EL2_E2H0 241
+#define KVM_CAP_ARM_PARTITION_PMU 242
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
@@ -1410,6 +1411,9 @@ struct kvm_enc_region {
 #define KVM_GET_SREGS2             _IOR(KVMIO,  0xcc, struct kvm_sregs2)
 #define KVM_SET_SREGS2             _IOW(KVMIO,  0xcd, struct kvm_sregs2)
 
+/* Available with KVM_CAP_ARM_PARTITION_PMU */
+#define KVM_ARM_PARTITION_PMU	_IOWR(KVMIO, 0xce, u8)
+
 #define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE    (1 << 0)
 #define KVM_DIRTY_LOG_INITIALLY_SET            (1 << 1)
 
-- 
2.49.0.1204.g71687c7c1d-goog
Re: [PATCH 16/17] KVM: arm64: Add ioctl to partition the PMU when supported
Posted by Oliver Upton 6 months, 2 weeks ago
On Mon, Jun 02, 2025 at 07:27:01PM +0000, Colton Lewis wrote:
> +	case KVM_ARM_PARTITION_PMU: {

This should be a vCPU attribute similar to the other PMUv3 controls we
already have. Ideally a single attribute where userspace tells us it
wants paritioning and specifies the PMU ID to use. None of this can be
changed after INIT'ing the PMU.

> +		struct arm_pmu *pmu;
> +		u8 host_counters;
> +
> +		if (unlikely(!kvm_vcpu_initialized(vcpu)))
> +			return -ENOEXEC;
> +
> +		if (!kvm_pmu_partition_supported())
> +			return -EPERM;
> +
> +		if (copy_from_user(&host_counters, argp, sizeof(host_counters)))
> +			return -EFAULT;
> +
> +		pmu = vcpu->kvm->arch.arm_pmu;
> +		return kvm_pmu_partition(pmu, host_counters);

Yeah, we really can't be changing the counters available to the ARM PMU
driver at this point. What happens to host events already scheduled on
the CPU?

Either the partition of host / KVM-owned counters needs to be computed
up front (prior to scheduling events) or KVM needs a way to direct perf
to reschedule events on the PMU based on the new operating constraints.

Thanks,
Oliver
Re: [PATCH 16/17] KVM: arm64: Add ioctl to partition the PMU when supported
Posted by Colton Lewis 6 months, 2 weeks ago
Oliver Upton <oliver.upton@linux.dev> writes:

> On Mon, Jun 02, 2025 at 07:27:01PM +0000, Colton Lewis wrote:
>> +	case KVM_ARM_PARTITION_PMU: {

> This should be a vCPU attribute similar to the other PMUv3 controls we
> already have. Ideally a single attribute where userspace tells us it
> wants paritioning and specifies the PMU ID to use. None of this can be
> changed after INIT'ing the PMU.

Okay

>> +		struct arm_pmu *pmu;
>> +		u8 host_counters;
>> +
>> +		if (unlikely(!kvm_vcpu_initialized(vcpu)))
>> +			return -ENOEXEC;
>> +
>> +		if (!kvm_pmu_partition_supported())
>> +			return -EPERM;
>> +
>> +		if (copy_from_user(&host_counters, argp, sizeof(host_counters)))
>> +			return -EFAULT;
>> +
>> +		pmu = vcpu->kvm->arch.arm_pmu;
>> +		return kvm_pmu_partition(pmu, host_counters);

> Yeah, we really can't be changing the counters available to the ARM PMU
> driver at this point. What happens to host events already scheduled on
> the CPU?

Okay. I remember talking about this before.

> Either the partition of host / KVM-owned counters needs to be computed
> up front (prior to scheduling events) or KVM needs a way to direct perf
> to reschedule events on the PMU based on the new operating constraints.

Yes. I will think about it.
Re: [PATCH 16/17] KVM: arm64: Add ioctl to partition the PMU when supported
Posted by Colton Lewis 6 months, 2 weeks ago
> Oliver Upton <oliver.upton@linux.dev> writes:

>> On Mon, Jun 02, 2025 at 07:27:01PM +0000, Colton Lewis wrote:
>>> +	case KVM_ARM_PARTITION_PMU: {

>> This should be a vCPU attribute similar to the other PMUv3 controls we
>> already have. Ideally a single attribute where userspace tells us it
>> wants paritioning and specifies the PMU ID to use. None of this can be
>> changed after INIT'ing the PMU.

> Okay

>>> +		struct arm_pmu *pmu;
>>> +		u8 host_counters;
>>> +
>>> +		if (unlikely(!kvm_vcpu_initialized(vcpu)))
>>> +			return -ENOEXEC;
>>> +
>>> +		if (!kvm_pmu_partition_supported())
>>> +			return -EPERM;
>>> +
>>> +		if (copy_from_user(&host_counters, argp, sizeof(host_counters)))
>>> +			return -EFAULT;
>>> +
>>> +		pmu = vcpu->kvm->arch.arm_pmu;
>>> +		return kvm_pmu_partition(pmu, host_counters);

>> Yeah, we really can't be changing the counters available to the ARM PMU
>> driver at this point. What happens to host events already scheduled on
>> the CPU?

> Okay. I remember talking about this before.

>> Either the partition of host / KVM-owned counters needs to be computed
>> up front (prior to scheduling events) or KVM needs a way to direct perf
>> to reschedule events on the PMU based on the new operating constraints.

> Yes. I will think about it.

It would be cool to have perf reschedule events. I'm not positive how to
do that, but it looks not too hard. Can someone comment on the
correctness and feasibility here?

1. Scan perf events and call event_sched_out on all events using the
    counters KVM wants.
2. Do the PMU surgery to change the available counters.
3. Call ctx_resched to reschedule events with the available counters.

There is a second option to avoid a permanent partition up front. We
know which counters are in use through used_mask. We could check if the
partition would claim any counters in use and fail with an error if it
would.