[PATCH v2 0/4] perf/x86: Don't write PEBS_ENABLED on KVM transitions

Sean Christopherson posted 4 patches 1 month, 3 weeks ago
There is a newer version of this series
arch/x86/events/core.c            |  5 ++-
arch/x86/events/intel/core.c      | 69 +++++++++++++++++++------------
arch/x86/events/perf_event.h      |  3 +-
arch/x86/include/asm/kvm_host.h   |  9 ----
arch/x86/include/asm/perf_event.h | 12 +++++-
arch/x86/kvm/vmx/pmu_intel.c      | 20 +++++++--
arch/x86/kvm/vmx/vmx.c            | 11 +++--
arch/x86/kvm/vmx/vmx.h            |  2 +-
8 files changed, 82 insertions(+), 49 deletions(-)
[PATCH v2 0/4] perf/x86: Don't write PEBS_ENABLED on KVM transitions
Posted by Sean Christopherson 1 month, 3 weeks ago
Rework the handling of PEBS_ENABLED (and related PEBS MSRs) to *never* touch
PEBS_ENABLED if the CPU provides PEBS isolation, in which case disabling
counters via PERF_GLOBAL_CTRL is sufficient to prevent generation of unwanted
PEBS records.  For vCPUs without PEBS enabled, this saves upwards of 7 MSR
writes on each roundtrip between the guest and host (KVM performs an immediate
WRMSR to zero out PEBS_ENABLED if it's in the load list).  For vCPUS with PEBS,
this saves 3 MSR writes per roundtrip.

However, performance isn't the underlying motiviation.  We (more accurately,
Jim, Mingwei, and Stephane) have been chasing issues where PEBS_ENABLED bits
can get "stuck" in a '1' state when running KVM guests while profiling the host
with PEBS events.  The working theory is that perf throttles PEBS events in
NMI context, and thus clears bits in cpuc->pebs_enabled and PEBS_ENABLED, after
generating the list of PMU MSRs to context switch but before VM-Entry.  And so
when the host's PEBS_ENABLED is loaded on VM-Exit, the CPU ends up with a
stale PEBS_ENABLED that doesn't get reset until something triggers an explicit
reload in perf.

Testing this against our "PEBS_ENABLED is stuck" reproducer is (still) a work
in-progress (largely because the "reproducer" is currently "throw the kernel in
a big test pool"), i.e. I don't know if this actually resolves the problems we
are seeing.  But even if it doesn't fully resolve our woes, it seems like a
no-brainer improvement, and if we're missing something with respect to "stuck"
PEBS_ENABLED, it'd be nice to get feedback/input asap.

Note, if the throttling theory is correct (which is looking unlikely at the
moment), then there are likely more fixes that need to be done, e.g. for CPUs
without isolation, and/or if PERF_GLOBAL_CTRL can be modified from NMI context
too.

Patch 4 is a clean up that I posted as a standalone patch almost a year ago.
I included it here because it's very related, and because I needed to refresh
it anyways.

v2:
 - "Load" the host value for the guest when an MSR should remain unchanged,
    instead of omitting the MSR from the list entirely, as KVM may need to
    _remove_ the MSR from the list. [Sashiko, Jim]
 - Collect Jim's reviews. [Jim]
 - Call out that the bug being fixed is theoretical at this point.
 - Dropping PEBS_ENABLED from the lists save three MSR writes, not two, as
   KVM performs an explicit WRMSR prior to VM-Entry to guarantee PEBS is
   quiesced.

v1: https://lore.kernel.org/all/20260414191425.2697918-1-seanjc@google.com

Sean Christopherson (4):
  perf/x86/intel: Don't write PEBS_ENABLED on host<=>guest xfers if CPU
    has isolation
  perf/x86/intel: Don't context switch DS_AREA (and PEBS config) if PEBS
    is unused
  perf/x86/intel: Make @data a mandatory param for
    intel_guest_get_msrs()
  perf/x86: KVM: Have perf define a dedicated struct for getting guest
    PEBS data

 arch/x86/events/core.c            |  5 ++-
 arch/x86/events/intel/core.c      | 69 +++++++++++++++++++------------
 arch/x86/events/perf_event.h      |  3 +-
 arch/x86/include/asm/kvm_host.h   |  9 ----
 arch/x86/include/asm/perf_event.h | 12 +++++-
 arch/x86/kvm/vmx/pmu_intel.c      | 20 +++++++--
 arch/x86/kvm/vmx/vmx.c            | 11 +++--
 arch/x86/kvm/vmx/vmx.h            |  2 +-
 8 files changed, 82 insertions(+), 49 deletions(-)


base-commit: 6b802031877a995456c528095c41d1948546bf45
-- 
2.54.0.545.g6539524ca2-goog
Re: [PATCH v2 0/4] perf/x86: Don't write PEBS_ENABLED on KVM transitions
Posted by Peter Zijlstra 1 month, 3 weeks ago
On Thu, Apr 23, 2026 at 08:03:36AM -0700, Sean Christopherson wrote:
> Testing this against our "PEBS_ENABLED is stuck" reproducer is (still) a work
> in-progress (largely because the "reproducer" is currently "throw the kernel in
> a big test pool"), i.e. I don't know if this actually resolves the problems we
> are seeing.  But even if it doesn't fully resolve our woes, it seems like a
> no-brainer improvement, and if we're missing something with respect to "stuck"
> PEBS_ENABLED, it'd be nice to get feedback/input asap.
> 
> Note, if the throttling theory is correct (which is looking unlikely at the
> moment), then there are likely more fixes that need to be done, e.g. for CPUs
> without isolation, and/or if PERF_GLOBAL_CTRL can be modified from NMI context
> too.

Throttle does: pmu->stop() := x86_pmu_stop() -> intel_pmu_disable_event()

Which in turn should:

  x86_pmu_disable_event()
    wrmsrq(config_base, config & ~EN);
  x86_pmu_pebs_disable() := intel_pmu_pebs_disable()
    wrmsr(PEBS_ENABLE, pebs_enabled & ~(1<<idx));

So that's just the counter EN bit and PEBS_ENABLED cleared. However, if
this is from PMI, then the PMI handler should also update GLOBAL_CTRL --
provided it wasn't 0.

See intel_pmu_handle_irq():

  if (pmu_enabled)
  	__intel_pmu_enable_all()
	  wrmsrq(GLOBAL_CTRL, intel_ctrl);
Re: [PATCH v2 0/4] perf/x86: Don't write PEBS_ENABLED on KVM transitions
Posted by Mi, Dapeng 1 month, 3 weeks ago
On 4/24/2026 12:16 AM, Peter Zijlstra wrote:
> On Thu, Apr 23, 2026 at 08:03:36AM -0700, Sean Christopherson wrote:
>> Testing this against our "PEBS_ENABLED is stuck" reproducer is (still) a work
>> in-progress (largely because the "reproducer" is currently "throw the kernel in
>> a big test pool"), i.e. I don't know if this actually resolves the problems we
>> are seeing.  But even if it doesn't fully resolve our woes, it seems like a
>> no-brainer improvement, and if we're missing something with respect to "stuck"
>> PEBS_ENABLED, it'd be nice to get feedback/input asap.
>>
>> Note, if the throttling theory is correct (which is looking unlikely at the
>> moment), then there are likely more fixes that need to be done, e.g. for CPUs
>> without isolation, and/or if PERF_GLOBAL_CTRL can be modified from NMI context
>> too.
> Throttle does: pmu->stop() := x86_pmu_stop() -> intel_pmu_disable_event()
>
> Which in turn should:
>
>   x86_pmu_disable_event()
>     wrmsrq(config_base, config & ~EN);
>   x86_pmu_pebs_disable() := intel_pmu_pebs_disable()
>     wrmsr(PEBS_ENABLE, pebs_enabled & ~(1<<idx));
>
> So that's just the counter EN bit and PEBS_ENABLED cleared. However, if
> this is from PMI, then the PMI handler should also update GLOBAL_CTRL --
> provided it wasn't 0.
>
> See intel_pmu_handle_irq():
>
>   if (pmu_enabled)
>   	__intel_pmu_enable_all()
> 	  wrmsrq(GLOBAL_CTRL, intel_ctrl);
>
Yes, currently all valid bits in GLOBAL_CTRL would be set by default on
Intel platforms. IIUC, this issue looks more like a race condition between
Perf and KVM.

1. KVM saves the value of host PEBS_ENABLE before VM-entry.

2. PMI is triggered and interrupts the upcoming VM-entry. PEBS events are
throttled and PEBS_ENABLE MSR is updated in the PMI handler, then the KVM
saved host PEBS_ENABLE value gets stale. 

3. VM entry continues and then the next VM-exit occurs, the stale
PEBS_ENABLE value is restored. 

4. The PEBS_ENABLE MSR keeps the stale value until next write.

Seems an alternative way to fix this issue is to disable the PMU (Clearing
GLOBAL_CTRL) before KVM saving the PMU MSRs?

Thanks.


Re: [PATCH v2 0/4] perf/x86: Don't write PEBS_ENABLED on KVM transitions
Posted by Peter Zijlstra 1 month, 3 weeks ago
On Fri, Apr 24, 2026 at 08:17:42PM +0800, Mi, Dapeng wrote:
> 
> On 4/24/2026 12:16 AM, Peter Zijlstra wrote:
> > On Thu, Apr 23, 2026 at 08:03:36AM -0700, Sean Christopherson wrote:
> >> Testing this against our "PEBS_ENABLED is stuck" reproducer is (still) a work
> >> in-progress (largely because the "reproducer" is currently "throw the kernel in
> >> a big test pool"), i.e. I don't know if this actually resolves the problems we
> >> are seeing.  But even if it doesn't fully resolve our woes, it seems like a
> >> no-brainer improvement, and if we're missing something with respect to "stuck"
> >> PEBS_ENABLED, it'd be nice to get feedback/input asap.
> >>
> >> Note, if the throttling theory is correct (which is looking unlikely at the
> >> moment), then there are likely more fixes that need to be done, e.g. for CPUs
> >> without isolation, and/or if PERF_GLOBAL_CTRL can be modified from NMI context
> >> too.
> > Throttle does: pmu->stop() := x86_pmu_stop() -> intel_pmu_disable_event()
> >
> > Which in turn should:
> >
> >   x86_pmu_disable_event()
> >     wrmsrq(config_base, config & ~EN);
> >   x86_pmu_pebs_disable() := intel_pmu_pebs_disable()
> >     wrmsr(PEBS_ENABLE, pebs_enabled & ~(1<<idx));
> >
> > So that's just the counter EN bit and PEBS_ENABLED cleared. However, if
> > this is from PMI, then the PMI handler should also update GLOBAL_CTRL --
> > provided it wasn't 0.
> >
> > See intel_pmu_handle_irq():
> >
> >   if (pmu_enabled)
> >   	__intel_pmu_enable_all()
> > 	  wrmsrq(GLOBAL_CTRL, intel_ctrl);
> >
> Yes, currently all valid bits in GLOBAL_CTRL would be set by default on
> Intel platforms. IIUC, this issue looks more like a race condition between
> Perf and KVM.
> 
> 1. KVM saves the value of host PEBS_ENABLE before VM-entry.
> 
> 2. PMI is triggered and interrupts the upcoming VM-entry. PEBS events are
> throttled and PEBS_ENABLE MSR is updated in the PMI handler, then the KVM
> saved host PEBS_ENABLE value gets stale. 
> 
> 3. VM entry continues and then the next VM-exit occurs, the stale
> PEBS_ENABLE value is restored. 
> 
> 4. The PEBS_ENABLE MSR keeps the stale value until next write.
> 
> Seems an alternative way to fix this issue is to disable the PMU (Clearing
> GLOBAL_CTRL) before KVM saving the PMU MSRs?

Yes, that would seem a prudent thing to do.
Re: [PATCH v2 0/4] perf/x86: Don't write PEBS_ENABLED on KVM transitions
Posted by Jim Mattson 1 month, 3 weeks ago
On Thu, Apr 23, 2026 at 8:03 AM Sean Christopherson <seanjc@google.com> wrote:
> ...
>  - Dropping PEBS_ENABLED from the lists save three MSR writes, not two, as
>    KVM performs an explicit WRMSR prior to VM-Entry to guarantee PEBS is
>    quiesced.

Is the wrmsrq(MSR_IA32_PEBS_ENABLED, 0) in add_atomic_switch_msr()
necessary when the CPU has PEBS isolation?