arch/x86/kvm/pmu.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
From: Jan H. Schönherr <jschoenh@amazon.de>
It is possible to degrade host performance by manipulating performance
counters from a VM and tricking the host hypervisor to enable branch
tracing. When the guest programs a CPU to track branch instructions and
deliver an interrupt after exactly one branch instruction, the value one
is handled by the host KVM/perf subsystems and treated incorrectly as a
special value to enable the branch trace store (BTS) subsystem. It
should not be possible to enable BTS from a guest. When BTS is enabled,
it leads to general host performance degradation to both VMs and host.
Perf considers the combination of PERF_COUNT_HW_BRANCH_INSTRUCTIONS with
a sample_period of 1 a special case and handles this as a BTS event (see
intel_pmu_has_bts_period()) -- a deviation from the usual semantic,
where the sample_period represents the amount of branch instructions to
encounter before the overflow handler is invoked.
Nothing prevents a guest from programming its vPMU with the above
settings (count branch, interrupt after one branch), which causes KVM to
erroneously instruct perf to create a BTS event within
pmc_reprogram_counter(), which does not have the desired semantics.
The guest could also do more benign actions and request an interrupt
after a more reasonable number of branch instructions via its vPMU. In
that case counting works initially. However, KVM occasionally pauses and
resumes the created performance counters. If the remaining amount of
branch instructions until interrupt has reached 1 exactly,
pmc_resume_counter() fails to resume the counter and a BTS event is
created instead with its incorrect semantics.
Fix this behavior by not passing the special value "1" as sample_period
to perf. Instead, perform the same quirk that happens later in
x86_perf_event_set_period() anyway, when the performance counter is
transferred to the actual PMU: bump the sample_period to 2.
Testing:
From guest:
`./wrmsr -p 12 0x186 0x1100c4`
`./wrmsr -p 12 0xc1 0xffffffffffff`
`./wrmsr -p 12 0x186 0x5100c4`
This sequence sets up branch instruction counting, initializes the counter
to overflow after one event (0xffffffffffff), and then enables edge
detection (bit 18) for branch events.
./wrmsr -p 12 0x186 0x1100c4
Writes to IA32_PERFEVTSEL0 (0x186)
Value 0x1100c4 breaks down as:
Event = 0xC4 (Branch instructions)
Bits 16-17: 0x1 (User mode only)
Bit 22: 1 (Enable counter)
./wrmsr -p 12 0xc1 0xffffffffffff
Writes to IA32_PMC0 (0xC1)
Sets counter to maximum value (0xffffffffffff)
This effectively sets up the counter to overflow on the next branch
./wrmsr -p 12 0x186 0x5100c4
Updates IA32_PERFEVTSEL0 again
Similar to first command but adds bit 18 (0x4 to 0x5)
Enables edge detection (bit 18)
These MSR writes are trapped by the hypervisor in KVM and forwarded to
the perf subsystem to create corresponding monitoring events.
It is possible to repro this problem in a more realistic guest scenario:
`perf record -e branches:u -c 2 -a &`
`perf record -e branches:u -c 2 -a &`
This presumably triggers the issue by KVM pausing and resuming the
performance counter at the wrong moment, when its value is about to
overflow.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Hendrik Borghorst <hborghor@amazon.de>
Link: https://lore.kernel.org/r/20251124100220.238177-1-sieberf@amazon.com
---
arch/x86/kvm/pmu.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 487ad19a236e..547512028e24 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -225,6 +225,19 @@ static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
{
u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
+ /*
+ * A sample_period of 1 might get mistaken by perf for a BTS event, see
+ * intel_pmu_has_bts_period(). This would prevent re-arming the counter
+ * via pmc_resume_counter(), followed by the accidental creation of an
+ * actual BTS event, which we do not want.
+ *
+ * Avoid this by bumping the sampling period. Note, that we do not lose
+ * any precision, because the same quirk happens later anyway (for
+ * different reasons) in x86_perf_event_set_period().
+ */
+ if (sample_period == 1)
+ sample_period = 2;
+
if (!sample_period)
sample_period = pmc_bitmask(pmc) + 1;
return sample_period;
--
2.43.0
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
© 2016 - 2025 Red Hat, Inc.