arch/x86/kvm/pmu.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
From: Jan H. Schönherr <jschoenh@amazon.de>
It is possible to degrade host performance by manipulating performance
counters from a VM and tricking the host hypervisor to enable branch
tracing. When the guest programs a CPU to track branch instructions and
deliver an interrupt after exactly one branch instruction, the value one
is handled by the host KVM/perf subsystems and treated incorrectly as a
special value to enable the branch trace store (BTS) subsystem. It
should not be possible to enable BTS from a guest. When BTS is enabled,
it leads to general host performance degradation to both VMs and host.
Perf considers the combination of PERF_COUNT_HW_BRANCH_INSTRUCTIONS with
a sample_period of 1 a special case and handles this as a BTS event (see
intel_pmu_has_bts_period()) -- a deviation from the usual semantic,
where the sample_period represents the amount of branch instructions to
encounter before the overflow handler is invoked.
Nothing prevents a guest from programming its vPMU with the above
settings (count branch, interrupt after one branch), which causes KVM to
erroneously instruct perf to create a BTS event within
pmc_reprogram_counter(), which does not have the desired semantics.
The guest could also do more benign actions and request an interrupt
after a more reasonable number of branch instructions via its vPMU. In
that case counting works initially. However, KVM occasionally pauses and
resumes the created performance counters. If the remaining amount of
branch instructions until interrupt has reached 1 exactly,
pmc_resume_counter() fails to resume the counter and a BTS event is
created instead with its incorrect semantics.
Fix this behavior by not passing the special value "1" as sample_period
to perf. Instead, perform the same quirk that happens later in
x86_perf_event_set_period() anyway, when the performance counter is
transferred to the actual PMU: bump the sample_period to 2.
Testing:
From guest:
`./wrmsr -p 12 0x186 0x1100c4`
`./wrmsr -p 12 0xc1 0xffffffffffff`
`./wrmsr -p 12 0x186 0x5100c4`
This sequence sets up branch instruction counting, initializes the counter
to overflow after one event (0xffffffffffff), and then enables edge
detection (bit 18) for branch events.
./wrmsr -p 12 0x186 0x1100c4
Writes to IA32_PERFEVTSEL0 (0x186)
Value 0x1100c4 breaks down as:
Event = 0xC4 (Branch instructions)
Bits 16-17: 0x1 (User mode only)
Bit 22: 1 (Enable counter)
./wrmsr -p 12 0xc1 0xffffffffffff
Writes to IA32_PMC0 (0xC1)
Sets counter to maximum value (0xffffffffffff)
This effectively sets up the counter to overflow on the next branch
./wrmsr -p 12 0x186 0x5100c4
Updates IA32_PERFEVTSEL0 again
Similar to first command but adds bit 18 (0x4 to 0x5)
Enables edge detection (bit 18)
These MSR writes are trapped by the hypervisor in KVM and forwarded to
the perf subsystem to create corresponding monitoring events.
It is possible to repro this problem in a more realistic guest scenario:
`perf record -e branches:u -c 2 -a &`
`perf record -e branches:u -c 2 -a &`
This presumably triggers the issue by KVM pausing and resuming the
performance counter at the wrong moment, when its value is about to
overflow.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Hendrik Borghorst <hborghor@amazon.de>
Link: https://lore.kernel.org/r/20251124100220.238177-1-sieberf@amazon.com
---
arch/x86/kvm/pmu.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 487ad19a236e..547512028e24 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -225,6 +225,19 @@ static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
{
u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
+ /*
+ * A sample_period of 1 might get mistaken by perf for a BTS event, see
+ * intel_pmu_has_bts_period(). This would prevent re-arming the counter
+ * via pmc_resume_counter(), followed by the accidental creation of an
+ * actual BTS event, which we do not want.
+ *
+ * Avoid this by bumping the sampling period. Note, that we do not lose
+ * any precision, because the same quirk happens later anyway (for
+ * different reasons) in x86_perf_event_set_period().
+ */
+ if (sample_period == 1)
+ sample_period = 2;
+
if (!sample_period)
sample_period = pmc_bitmask(pmc) + 1;
return sample_period;
--
2.43.0
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
On Mon, Dec 01, 2025 at 04:23:57PM +0200, Fernand Sieber wrote:
> arch/x86/kvm/pmu.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 487ad19a236e..547512028e24 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -225,6 +225,19 @@ static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
> {
> u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
>
> + /*
> + * A sample_period of 1 might get mistaken by perf for a BTS event, see
> + * intel_pmu_has_bts_period(). This would prevent re-arming the counter
> + * via pmc_resume_counter(), followed by the accidental creation of an
> + * actual BTS event, which we do not want.
> + *
> + * Avoid this by bumping the sampling period. Note, that we do not lose
> + * any precision, because the same quirk happens later anyway (for
> + * different reasons) in x86_perf_event_set_period().
> + */
> + if (sample_period == 1)
> + sample_period = 2;
> +
> if (!sample_period)
> sample_period = pmc_bitmask(pmc) + 1;
> return sample_period;
Oh gawd, I so hate this kvm code. It is so ludicrously bad. The way it
keeps recreating counters is just stupid. And then they complain it
sucks, it does :-(
Anyway, yes this is terrible. Let me try and untangle all this, see if
there's a saner solution.
On Tue, Dec 02, 2025 at 11:03:11AM +0100, Peter Zijlstra wrote:
> On Mon, Dec 01, 2025 at 04:23:57PM +0200, Fernand Sieber wrote:
> > arch/x86/kvm/pmu.c | 13 +++++++++++++
> > 1 file changed, 13 insertions(+)
> >
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 487ad19a236e..547512028e24 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -225,6 +225,19 @@ static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
> > {
> > u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
> >
> > + /*
> > + * A sample_period of 1 might get mistaken by perf for a BTS event, see
> > + * intel_pmu_has_bts_period(). This would prevent re-arming the counter
> > + * via pmc_resume_counter(), followed by the accidental creation of an
> > + * actual BTS event, which we do not want.
> > + *
> > + * Avoid this by bumping the sampling period. Note, that we do not lose
> > + * any precision, because the same quirk happens later anyway (for
> > + * different reasons) in x86_perf_event_set_period().
> > + */
> > + if (sample_period == 1)
> > + sample_period = 2;
> > +
> > if (!sample_period)
> > sample_period = pmc_bitmask(pmc) + 1;
> > return sample_period;
>
> Oh gawd, I so hate this kvm code. It is so ludicrously bad. The way it
> keeps recreating counters is just stupid. And then they complain it
> sucks, it does :-(
>
> Anyway, yes this is terrible. Let me try and untangle all this, see if
> there's a saner solution.
Does something like so work? It is still terrible, but perhaps slightly
less so.
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 2b969386dcdd..493e6ba51e06 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1558,13 +1558,22 @@ static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period
struct hw_perf_event *hwc = &event->hw;
unsigned int hw_event, bts_event;
- if (event->attr.freq)
+ /*
+ * Only use BTS for fixed rate period==1 events.
+ */
+ if (event->attr.freq || period != 1)
+ return false;
+
+ /*
+ * BTS doesn't virtualize.
+ */
+ if (event->attr.exclude_host)
return false;
hw_event = hwc->config & INTEL_ARCH_EVENT_MASK;
bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
- return hw_event == bts_event && period == 1;
+ return hw_event == bts_event;
}
static inline bool intel_pmu_has_bts(struct perf_event *event)
On Tue, Dec 02, 2025 at 01:44:23PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 02, 2025 at 11:03:11AM +0100, Peter Zijlstra wrote:
> > On Mon, Dec 01, 2025 at 04:23:57PM +0200, Fernand Sieber wrote:
> > > arch/x86/kvm/pmu.c | 13 +++++++++++++
> > > 1 file changed, 13 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > > index 487ad19a236e..547512028e24 100644
> > > --- a/arch/x86/kvm/pmu.c
> > > +++ b/arch/x86/kvm/pmu.c
> > > @@ -225,6 +225,19 @@ static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
> > > {
> > > u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
> > >
> > > + /*
> > > + * A sample_period of 1 might get mistaken by perf for a BTS event, see
> > > + * intel_pmu_has_bts_period(). This would prevent re-arming the counter
> > > + * via pmc_resume_counter(), followed by the accidental creation of an
> > > + * actual BTS event, which we do not want.
> > > + *
> > > + * Avoid this by bumping the sampling period. Note, that we do not lose
> > > + * any precision, because the same quirk happens later anyway (for
> > > + * different reasons) in x86_perf_event_set_period().
> > > + */
> > > + if (sample_period == 1)
> > > + sample_period = 2;
> > > +
> > > if (!sample_period)
> > > sample_period = pmc_bitmask(pmc) + 1;
> > > return sample_period;
> >
> > Oh gawd, I so hate this kvm code. It is so ludicrously bad. The way it
> > keeps recreating counters is just stupid. And then they complain it
> > sucks, it does :-(
> >
> > Anyway, yes this is terrible. Let me try and untangle all this, see if
> > there's a saner solution.
>
> Does something like so work? It is still terrible, but perhaps slightly
> less so.
>
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 2b969386dcdd..493e6ba51e06 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -1558,13 +1558,22 @@ static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period
> struct hw_perf_event *hwc = &event->hw;
> unsigned int hw_event, bts_event;
>
> - if (event->attr.freq)
> + /*
> + * Only use BTS for fixed rate period==1 events.
> + */
> + if (event->attr.freq || period != 1)
> + return false;
> +
> + /*
> + * BTS doesn't virtualize.
> + */
> + if (event->attr.exclude_host)
> return false;
>
> hw_event = hwc->config & INTEL_ARCH_EVENT_MASK;
> bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
>
> - return hw_event == bts_event && period == 1;
> + return hw_event == bts_event;
> }
>
> static inline bool intel_pmu_has_bts(struct perf_event *event)
Hi Peter,
I've pulled your changes and confirmed that they address the original
bug report.
The repro I use is running on host, with a guest running:
`perf record -e branches:u -c 2 -a &`
`perf record -e branches:u -c 2 -a &`
Then I monitor the enablement of BTS on the host and verify that without
the change BTS is enabled, and with the change it's not.
This looks good to me, should we go ahead with your changes then?
--Fernand
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
On Wed, Dec 10, 2025 at 12:11:47PM +0200, Fernand Sieber wrote: > > Does something like so work? It is still terrible, but perhaps slightly > > less so. > > > > diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h > > index 2b969386dcdd..493e6ba51e06 100644 > > --- a/arch/x86/events/perf_event.h > > +++ b/arch/x86/events/perf_event.h > > @@ -1558,13 +1558,22 @@ static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period > > struct hw_perf_event *hwc = &event->hw; > > unsigned int hw_event, bts_event; > > > > - if (event->attr.freq) > > + /* > > + * Only use BTS for fixed rate period==1 events. > > + */ > > + if (event->attr.freq || period != 1) > > + return false; > > + > > + /* > > + * BTS doesn't virtualize. > > + */ > > + if (event->attr.exclude_host) > > return false; > > > > hw_event = hwc->config & INTEL_ARCH_EVENT_MASK; > > bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS); > > > > - return hw_event == bts_event && period == 1; > > + return hw_event == bts_event; > > } > > > > static inline bool intel_pmu_has_bts(struct perf_event *event) > > Hi Peter, > > I've pulled your changes and confirmed that they address the original > bug report. > > The repro I use is running on host, with a guest running: > `perf record -e branches:u -c 2 -a &` > `perf record -e branches:u -c 2 -a &` > Then I monitor the enablement of BTS on the host and verify that without > the change BTS is enabled, and with the change it's not. > > This looks good to me, should we go ahead with your changes then? Yeah, I suppose. Please stick a coherent changelog on and repost.
By default when users program perf to sample branch instructions
(PERF_COUNT_HW_BRANCH_INSTRUCTIONS) with a sample period of 1, perf
interprets this as a special case and enables BTS (Branch Trace Store)
as an optimization to avoid taking an interrupt on every branch.
Since BTS doesn't virtualize, this optimization doesn't make sense when
the request originates from a guest. Add an additional check that
prevents this optimization for virtualized events (exclude_host).
Reported-by: Jan H. Schönherr <jschoenh@amazon.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
---
arch/x86/events/perf_event.h | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 3161ec0a3416..f2e2d9b03367 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1574,13 +1574,22 @@ static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period
struct hw_perf_event *hwc = &event->hw;
unsigned int hw_event, bts_event;
- if (event->attr.freq)
+ /*
+ * Only use BTS for fixed rate period==1 events.
+ */
+ if (event->attr.freq || period != 1)
+ return false;
+
+ /*
+ * BTS doesn't virtualize.
+ */
+ if (event->attr.exclude_host)
return false;
hw_event = hwc->config & INTEL_ARCH_EVENT_MASK;
bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
- return hw_event == bts_event && period == 1;
+ return hw_event == bts_event;
}
static inline bool intel_pmu_has_bts(struct perf_event *event)
--
2.43.0
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
On Thu, 2025-12-11 at 20:36 +0200, Fernand Sieber wrote: > By default when users program perf to sample branch instructions > (PERF_COUNT_HW_BRANCH_INSTRUCTIONS) with a sample period of 1, perf > interprets this as a special case and enables BTS (Branch Trace Store) > as an optimization to avoid taking an interrupt on every branch. > > Since BTS doesn't virtualize, this optimization doesn't make sense when > the request originates from a guest. Add an additional check that > prevents this optimization for virtualized events (exclude_host). > > Reported-by: Jan H. Schönherr <jschoenh@amazon.de> > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Fernand Sieber <sieberf@amazon.com> Did this get applied? I think you want Cc: stable@vger.kernel.org here in the body of the commit itself, not just on the actual email. > --- > arch/x86/events/perf_event.h | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h > index 3161ec0a3416..f2e2d9b03367 100644 > --- a/arch/x86/events/perf_event.h > +++ b/arch/x86/events/perf_event.h > @@ -1574,13 +1574,22 @@ static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period > struct hw_perf_event *hwc = &event->hw; > unsigned int hw_event, bts_event; > > - if (event->attr.freq) > + /* > + * Only use BTS for fixed rate period==1 events. > + */ > + if (event->attr.freq || period != 1) > + return false; > + > + /* > + * BTS doesn't virtualize. > + */ > + if (event->attr.exclude_host) > return false; > > hw_event = hwc->config & INTEL_ARCH_EVENT_MASK; > bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS); > > - return hw_event == bts_event && period == 1; > + return hw_event == bts_event; > } > > static inline bool intel_pmu_has_bts(struct perf_event *event)
Hi Peter, Could you please take another look and see if you are happy to pull in v2 which implements the approach that you suggested? Thanks, --Fernand Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07
On Wed, Jan 21, 2026 at 03:57:13PM +0200, Fernand Sieber wrote: > Hi Peter, > > Could you please take another look and see if you are happy to pull in v2 which > implements the approach that you suggested? Yeah, sorry, fell into a crack and all that. Got it now. Thanks!
The following commit has been merged into the perf/urgent branch of tip:
Commit-ID: 91dcfae0ff2b9b9ab03c1ec95babaceefbffb9f4
Gitweb: https://git.kernel.org/tip/91dcfae0ff2b9b9ab03c1ec95babaceefbffb9f4
Author: Fernand Sieber <sieberf@amazon.com>
AuthorDate: Thu, 11 Dec 2025 20:36:04 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 21 Jan 2026 16:28:59 +01:00
perf/x86/intel: Do not enable BTS for guests
By default when users program perf to sample branch instructions
(PERF_COUNT_HW_BRANCH_INSTRUCTIONS) with a sample period of 1, perf
interprets this as a special case and enables BTS (Branch Trace Store)
as an optimization to avoid taking an interrupt on every branch.
Since BTS doesn't virtualize, this optimization doesn't make sense when
the request originates from a guest. Add an additional check that
prevents this optimization for virtualized events (exclude_host).
Reported-by: Jan H. Schönherr <jschoenh@amazon.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://patch.msgid.link/20251211183604.868641-1-sieberf@amazon.com
---
arch/x86/events/perf_event.h | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 6296302..ad35c54 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1574,13 +1574,22 @@ static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period
struct hw_perf_event *hwc = &event->hw;
unsigned int hw_event, bts_event;
- if (event->attr.freq)
+ /*
+ * Only use BTS for fixed rate period==1 events.
+ */
+ if (event->attr.freq || period != 1)
+ return false;
+
+ /*
+ * BTS doesn't virtualize.
+ */
+ if (event->attr.exclude_host)
return false;
hw_event = hwc->config & INTEL_ARCH_EVENT_MASK;
bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
- return hw_event == bts_event && period == 1;
+ return hw_event == bts_event;
}
static inline bool intel_pmu_has_bts(struct perf_event *event)
On Tue, Dec 02, 2025, Peter Zijlstra wrote: > Does something like so work? It is still terrible, but perhaps slightly > less so. > > diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h > index 2b969386dcdd..493e6ba51e06 100644 > --- a/arch/x86/events/perf_event.h > +++ b/arch/x86/events/perf_event.h > @@ -1558,13 +1558,22 @@ static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period > struct hw_perf_event *hwc = &event->hw; > unsigned int hw_event, bts_event; > > - if (event->attr.freq) > + /* > + * Only use BTS for fixed rate period==1 events. > + */ > + if (event->attr.freq || period != 1) > + return false; > + > + /* > + * BTS doesn't virtualize. > + */ > + if (event->attr.exclude_host) Ya, this seems like the right fix. Pulling in the original bug report: When BTS is enabled, it leads to general host performance degradation to both VMs and host. I assume the underlying problem is that intel_pmu_enable_bts() is called even when the event isn't enabled in the host, and BTS doesn't discrimate once it's enable and bogs down the host (but presumably not the guest, at least not directly, since KVM should prevent setting BTS in vmcs.GUEST_IA32_DEBUGCTL). Enabling BTS for exclude-host events simply can't work, even if the event came from host userspace.
On 12/1/25 10:23 PM, Fernand Sieber wrote:
> From: Jan H. Schönherr <jschoenh@amazon.de>
>
> It is possible to degrade host performance by manipulating performance
> counters from a VM and tricking the host hypervisor to enable branch
> tracing. When the guest programs a CPU to track branch instructions and
> deliver an interrupt after exactly one branch instruction, the value one
> is handled by the host KVM/perf subsystems and treated incorrectly as a
> special value to enable the branch trace store (BTS) subsystem. It
Based on my observations of PMU users, this is treated as a feature (using
PMC paths to trigger the BTS: generating a sampling for each branch already
makes it functionally and implementation-wise identical to BTS), and it
undoubtedly harms performance (just like other trace-based PMU facilities).
[*] perf record -e branches:u -c 1 -d ls
> should not be possible to enable BTS from a guest. When BTS is enabled,
> it leads to general host performance degradation to both VMs and host.
>
> Perf considers the combination of PERF_COUNT_HW_BRANCH_INSTRUCTIONS with
> a sample_period of 1 a special case and handles this as a BTS event (see
> intel_pmu_has_bts_period()) -- a deviation from the usual semantic,
> where the sample_period represents the amount of branch instructions to
> encounter before the overflow handler is invoked.
>
> Nothing prevents a guest from programming its vPMU with the above
> settings (count branch, interrupt after one branch), which causes KVM to
> erroneously instruct perf to create a BTS event within
> pmc_reprogram_counter(), which does not have the desired semantics.
>
> The guest could also do more benign actions and request an interrupt
> after a more reasonable number of branch instructions via its vPMU. In
> that case counting works initially. However, KVM occasionally pauses and
> resumes the created performance counters. If the remaining amount of
> branch instructions until interrupt has reached 1 exactly,
> pmc_resume_counter() fails to resume the counter and a BTS event is
> created instead with its incorrect semantics.
>
> Fix this behavior by not passing the special value "1" as sample_period
> to perf. Instead, perform the same quirk that happens later in
> x86_perf_event_set_period() anyway, when the performance counter is
> transferred to the actual PMU: bump the sample_period to 2.
>
> Testing:
> From guest:
> `./wrmsr -p 12 0x186 0x1100c4`
> `./wrmsr -p 12 0xc1 0xffffffffffff`
> `./wrmsr -p 12 0x186 0x5100c4`
>
> This sequence sets up branch instruction counting, initializes the counter
> to overflow after one event (0xffffffffffff), and then enables edge
> detection (bit 18) for branch events.
>
> ./wrmsr -p 12 0x186 0x1100c4
> Writes to IA32_PERFEVTSEL0 (0x186)
> Value 0x1100c4 breaks down as:
> Event = 0xC4 (Branch instructions)
> Bits 16-17: 0x1 (User mode only)
> Bit 22: 1 (Enable counter)
>
> ./wrmsr -p 12 0xc1 0xffffffffffff
> Writes to IA32_PMC0 (0xC1)
> Sets counter to maximum value (0xffffffffffff)
> This effectively sets up the counter to overflow on the next branch
>
> ./wrmsr -p 12 0x186 0x5100c4
> Updates IA32_PERFEVTSEL0 again
> Similar to first command but adds bit 18 (0x4 to 0x5)
> Enables edge detection (bit 18)
>
> These MSR writes are trapped by the hypervisor in KVM and forwarded to
> the perf subsystem to create corresponding monitoring events.
>
> It is possible to repro this problem in a more realistic guest scenario:
>
> `perf record -e branches:u -c 2 -a &`
> `perf record -e branches:u -c 2 -a &`
In this reproduction case, is there any unexpected memory corruption
(related to unallocated BTS buffer) ?
>
> This presumably triggers the issue by KVM pausing and resuming the
> performance counter at the wrong moment, when its value is about to
> overflow.
>
> Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
> Signed-off-by: Fernand Sieber <sieberf@amazon.com>
> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> Reviewed-by: Hendrik Borghorst <hborghor@amazon.de>
> Link: https://lore.kernel.org/r/20251124100220.238177-1-sieberf@amazon.com
> ---
> arch/x86/kvm/pmu.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 487ad19a236e..547512028e24 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -225,6 +225,19 @@ static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
> {
> u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
>
> + /*
> + * A sample_period of 1 might get mistaken by perf for a BTS event, see
> + * intel_pmu_has_bts_period(). This would prevent re-arming the counter
> + * via pmc_resume_counter(), followed by the accidental creation of an
> + * actual BTS event, which we do not want.
> + *
> + * Avoid this by bumping the sampling period. Note, that we do not lose
> + * any precision, because the same quirk happens later anyway (for
> + * different reasons) in x86_perf_event_set_period().
> + */
> + if (sample_period == 1)
> + sample_period = 2;
Even without PERF_COUNT_HW_BRANCH_INSTRUCTIONS event check ?
> +
> if (!sample_period)
> sample_period = pmc_bitmask(pmc) + 1;
> return sample_period;
© 2016 - 2026 Red Hat, Inc.