KVM: x86: Add support for mediated vPMUs

[PATCH v5 09/44] perf/x86: Switch LVTPC to/from mediated PMI vector on guest load/put context

Posted by Sean Christopherson 1 month, 4 weeks ago

Add arch hooks to the mediated vPMU load/put APIs, and use the hooks to
switch PMIs to the dedicated mediated PMU IRQ vector on load, and back to
perf's standard NMI when the guest context is put.  I.e. route PMIs to
PERF_GUEST_MEDIATED_PMI_VECTOR when the guest context is active, and to
NMIs while the host context is active.

While running with guest context loaded, ignore all NMIs (in perf).  Any
NMI that arrives while the LVTPC points at the mediated PMU IRQ vector
can't possibly be due to a host perf event.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
[sean: use arch hook instead of per-PMU callback]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/events/core.c     | 27 +++++++++++++++++++++++++++
 include/linux/perf_event.h |  3 +++
 kernel/events/core.c       |  4 ++++
 3 files changed, 34 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7610f26dfbd9..9b0525b252f1 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -55,6 +55,8 @@ DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
 	.pmu = &pmu,
 };
 
+static DEFINE_PER_CPU(bool, x86_guest_ctx_loaded);
+
 DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key);
 DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key);
 DEFINE_STATIC_KEY_FALSE(perf_is_hybrid);
@@ -1756,6 +1758,16 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 	u64 finish_clock;
 	int ret;
 
+	/*
+	 * Ignore all NMIs when a guest's mediated PMU context is loaded.  Any
+	 * such NMI can't be due to a PMI as the CPU's LVTPC is switched to/from
+	 * the dedicated mediated PMI IRQ vector while host events are quiesced.
+	 * Attempting to handle a PMI while the guest's context is loaded will
+	 * generate false positives and clobber guest state.
+	 */
+	if (this_cpu_read(x86_guest_ctx_loaded))
+		return NMI_DONE;
+
 	/*
 	 * All PMUs/events that share this PMI handler should make sure to
 	 * increment active_events for their events.
@@ -2727,6 +2739,21 @@ static struct pmu pmu = {
 	.filter			= x86_pmu_filter,
 };
 
+void arch_perf_load_guest_context(unsigned long data)
+{
+	u32 masked = data & APIC_LVT_MASKED;
+
+	apic_write(APIC_LVTPC,
+		   APIC_DM_FIXED | PERF_GUEST_MEDIATED_PMI_VECTOR | masked);
+	this_cpu_write(x86_guest_ctx_loaded, true);
+}
+
+void arch_perf_put_guest_context(void)
+{
+	this_cpu_write(x86_guest_ctx_loaded, false);
+	apic_write(APIC_LVTPC, APIC_DM_NMI);
+}
+
 void arch_perf_update_userpage(struct perf_event *event,
 			       struct perf_event_mmap_page *userpg, u64 now)
 {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0c529fbd97e6..3a9bd9c4c90e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1846,6 +1846,9 @@ static inline unsigned long perf_arch_guest_misc_flags(struct pt_regs *regs)
 # define perf_arch_guest_misc_flags(regs)	perf_arch_guest_misc_flags(regs)
 #endif
 
+extern void arch_perf_load_guest_context(unsigned long data);
+extern void arch_perf_put_guest_context(void);
+
 static inline bool needs_branch_stack(struct perf_event *event)
 {
 	return event->attr.branch_sample_type != 0;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e1df3c3bfc0d..ad22b182762e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6408,6 +6408,8 @@ void perf_load_guest_context(unsigned long data)
 		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
 	}
 
+	arch_perf_load_guest_context(data);
+
 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
 	if (cpuctx->task_ctx)
 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
@@ -6433,6 +6435,8 @@ void perf_put_guest_context(void)
 
 	perf_event_sched_in(cpuctx, cpuctx->task_ctx, NULL, EVENT_GUEST);
 
+	arch_perf_put_guest_context();
+
 	if (cpuctx->task_ctx)
 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
-- 
2.50.1.565.gc32cd1483b-goog

Re: [PATCH v5 09/44] perf/x86: Switch LVTPC to/from mediated PMI vector on guest load/put context

Posted by Peter Zijlstra 1 month, 2 weeks ago

On Wed, Aug 06, 2025 at 12:56:31PM -0700, Sean Christopherson wrote:

> @@ -2727,6 +2739,21 @@ static struct pmu pmu = {
>  	.filter			= x86_pmu_filter,
>  };
>  
> +void arch_perf_load_guest_context(unsigned long data)
> +{
> +	u32 masked = data & APIC_LVT_MASKED;
> +
> +	apic_write(APIC_LVTPC,
> +		   APIC_DM_FIXED | PERF_GUEST_MEDIATED_PMI_VECTOR | masked);
> +	this_cpu_write(x86_guest_ctx_loaded, true);
> +}

I'm further confused, why would this ever be masked?

Re: [PATCH v5 09/44] perf/x86: Switch LVTPC to/from mediated PMI vector on guest load/put context

Posted by Sean Christopherson 1 month, 2 weeks ago

On Fri, Aug 15, 2025, Peter Zijlstra wrote:
> On Wed, Aug 06, 2025 at 12:56:31PM -0700, Sean Christopherson wrote:
> 
> > @@ -2727,6 +2739,21 @@ static struct pmu pmu = {
> >  	.filter			= x86_pmu_filter,
> >  };
> >  
> > +void arch_perf_load_guest_context(unsigned long data)
> > +{
> > +	u32 masked = data & APIC_LVT_MASKED;
> > +
> > +	apic_write(APIC_LVTPC,
> > +		   APIC_DM_FIXED | PERF_GUEST_MEDIATED_PMI_VECTOR | masked);
> > +	this_cpu_write(x86_guest_ctx_loaded, true);
> > +}
> 
> I'm further confused, why would this ever be masked?

The idea is to match the guest's LVTPC state so that KVM doesn't trigger IRQ
VM-Exits on counter overflow when the guest's LVTPC is masked.

Re: [PATCH v5 09/44] perf/x86: Switch LVTPC to/from mediated PMI vector on guest load/put context

Posted by Peter Zijlstra 1 month, 2 weeks ago

On Wed, Aug 06, 2025 at 12:56:31PM -0700, Sean Christopherson wrote:
> Add arch hooks to the mediated vPMU load/put APIs, and use the hooks to
> switch PMIs to the dedicated mediated PMU IRQ vector on load, and back to
> perf's standard NMI when the guest context is put.  I.e. route PMIs to
> PERF_GUEST_MEDIATED_PMI_VECTOR when the guest context is active, and to
> NMIs while the host context is active.
> 
> While running with guest context loaded, ignore all NMIs (in perf).  Any
> NMI that arrives while the LVTPC points at the mediated PMU IRQ vector
> can't possibly be due to a host perf event.
> 
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> [sean: use arch hook instead of per-PMU callback]
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/events/core.c     | 27 +++++++++++++++++++++++++++
>  include/linux/perf_event.h |  3 +++
>  kernel/events/core.c       |  4 ++++
>  3 files changed, 34 insertions(+)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 7610f26dfbd9..9b0525b252f1 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -55,6 +55,8 @@ DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
>  	.pmu = &pmu,
>  };
>  
> +static DEFINE_PER_CPU(bool, x86_guest_ctx_loaded);
> +
>  DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key);
>  DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key);
>  DEFINE_STATIC_KEY_FALSE(perf_is_hybrid);
> @@ -1756,6 +1758,16 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
>  	u64 finish_clock;
>  	int ret;
>  
> +	/*
> +	 * Ignore all NMIs when a guest's mediated PMU context is loaded.  Any
> +	 * such NMI can't be due to a PMI as the CPU's LVTPC is switched to/from
> +	 * the dedicated mediated PMI IRQ vector while host events are quiesced.
> +	 * Attempting to handle a PMI while the guest's context is loaded will
> +	 * generate false positives and clobber guest state.
> +	 */
> +	if (this_cpu_read(x86_guest_ctx_loaded))
> +		return NMI_DONE;
> +
>  	/*
>  	 * All PMUs/events that share this PMI handler should make sure to
>  	 * increment active_events for their events.
> @@ -2727,6 +2739,21 @@ static struct pmu pmu = {
>  	.filter			= x86_pmu_filter,
>  };
>  
> +void arch_perf_load_guest_context(unsigned long data)
> +{
> +	u32 masked = data & APIC_LVT_MASKED;
> +
> +	apic_write(APIC_LVTPC,
> +		   APIC_DM_FIXED | PERF_GUEST_MEDIATED_PMI_VECTOR | masked);
> +	this_cpu_write(x86_guest_ctx_loaded, true);
> +}
> +
> +void arch_perf_put_guest_context(void)
> +{
> +	this_cpu_write(x86_guest_ctx_loaded, false);
> +	apic_write(APIC_LVTPC, APIC_DM_NMI);
> +}
> +
>  void arch_perf_update_userpage(struct perf_event *event,
>  			       struct perf_event_mmap_page *userpg, u64 now)
>  {
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 0c529fbd97e6..3a9bd9c4c90e 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1846,6 +1846,9 @@ static inline unsigned long perf_arch_guest_misc_flags(struct pt_regs *regs)
>  # define perf_arch_guest_misc_flags(regs)	perf_arch_guest_misc_flags(regs)
>  #endif
>  
> +extern void arch_perf_load_guest_context(unsigned long data);
> +extern void arch_perf_put_guest_context(void);
> +
>  static inline bool needs_branch_stack(struct perf_event *event)
>  {
>  	return event->attr.branch_sample_type != 0;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index e1df3c3bfc0d..ad22b182762e 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6408,6 +6408,8 @@ void perf_load_guest_context(unsigned long data)
>  		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
>  	}
>  
> +	arch_perf_load_guest_context(data);

So I still don't understand why this ever needs to reach the generic
code. x86 pmu driver and x86 kvm can surely sort this out inside of x86,
no?

Re: [PATCH v5 09/44] perf/x86: Switch LVTPC to/from mediated PMI vector on guest load/put context

Posted by Sean Christopherson 1 month, 2 weeks ago

On Fri, Aug 15, 2025, Peter Zijlstra wrote:
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index e1df3c3bfc0d..ad22b182762e 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -6408,6 +6408,8 @@ void perf_load_guest_context(unsigned long data)
> >  		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> >  	}
> >  
> > +	arch_perf_load_guest_context(data);
> 
> So I still don't understand why this ever needs to reach the generic
> code. x86 pmu driver and x86 kvm can surely sort this out inside of x86,
> no?

It's definitely possible to handle this entirely within x86, I just don't love
switching the LVTPC without the protection of perf_ctx_lock and perf_ctx_disable().
It's not a sticking point for me if you strongly prefer something like this: 

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 0e5048ae86fa..86b81c217b97 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1319,7 +1319,9 @@ void kvm_mediated_pmu_load(struct kvm_vcpu *vcpu)
 
        lockdep_assert_irqs_disabled();
 
-       perf_load_guest_context(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
+       perf_load_guest_context();
+
+       perf_load_guest_lvtpc(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
 
        /*
         * Disable all counters before loading event selectors and PMCs so that
@@ -1380,5 +1382,7 @@ void kvm_mediated_pmu_put(struct kvm_vcpu *vcpu)
 
        kvm_pmu_put_guest_pmcs(vcpu);
 
+       perf_put_guest_lvtpc();
+
        perf_put_guest_context();
 }

Re: [PATCH v5 09/44] perf/x86: Switch LVTPC to/from mediated PMI vector on guest load/put context

Posted by Sean Christopherson 1 month, 2 weeks ago

On Fri, Aug 15, 2025, Sean Christopherson wrote:
> On Fri, Aug 15, 2025, Peter Zijlstra wrote:
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index e1df3c3bfc0d..ad22b182762e 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -6408,6 +6408,8 @@ void perf_load_guest_context(unsigned long data)
> > >  		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> > >  	}
> > >  
> > > +	arch_perf_load_guest_context(data);
> > 
> > So I still don't understand why this ever needs to reach the generic
> > code. x86 pmu driver and x86 kvm can surely sort this out inside of x86,
> > no?
> 
> It's definitely possible to handle this entirely within x86, I just don't love
> switching the LVTPC without the protection of perf_ctx_lock and perf_ctx_disable().
> It's not a sticking point for me if you strongly prefer something like this: 
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 0e5048ae86fa..86b81c217b97 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -1319,7 +1319,9 @@ void kvm_mediated_pmu_load(struct kvm_vcpu *vcpu)
>  
>         lockdep_assert_irqs_disabled();
>  
> -       perf_load_guest_context(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
> +       perf_load_guest_context();
> +
> +       perf_load_guest_lvtpc(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));

Hmm, an argument for providing a dedicated perf_load_guest_lvtpc() APIs is that
it would allow KVM to handle LVTPC writes in KVM's VM-Exit fastpath, i.e. without
having to do a full put+reload of the guest context.

So if we're confident that switching the host LVTPC outside of
perf_{load,put}_guest_context() is functionally safe, I'm a-ok with it.

>         /*
>          * Disable all counters before loading event selectors and PMCs so that
> @@ -1380,5 +1382,7 @@ void kvm_mediated_pmu_put(struct kvm_vcpu *vcpu)
>  
>         kvm_pmu_put_guest_pmcs(vcpu);
>  
> +       perf_put_guest_lvtpc();
> +
>         perf_put_guest_context();
>  }
>

Re: [PATCH v5 09/44] perf/x86: Switch LVTPC to/from mediated PMI vector on guest load/put context

Posted by Peter Zijlstra 1 month, 2 weeks ago

On Fri, Aug 15, 2025 at 08:55:25AM -0700, Sean Christopherson wrote:
> On Fri, Aug 15, 2025, Sean Christopherson wrote:
> > On Fri, Aug 15, 2025, Peter Zijlstra wrote:
> > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > index e1df3c3bfc0d..ad22b182762e 100644
> > > > --- a/kernel/events/core.c
> > > > +++ b/kernel/events/core.c
> > > > @@ -6408,6 +6408,8 @@ void perf_load_guest_context(unsigned long data)
> > > >  		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> > > >  	}
> > > >  
> > > > +	arch_perf_load_guest_context(data);
> > > 
> > > So I still don't understand why this ever needs to reach the generic
> > > code. x86 pmu driver and x86 kvm can surely sort this out inside of x86,
> > > no?
> > 
> > It's definitely possible to handle this entirely within x86, I just don't love
> > switching the LVTPC without the protection of perf_ctx_lock and perf_ctx_disable().
> > It's not a sticking point for me if you strongly prefer something like this: 
> > 
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 0e5048ae86fa..86b81c217b97 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -1319,7 +1319,9 @@ void kvm_mediated_pmu_load(struct kvm_vcpu *vcpu)
> >  
> >         lockdep_assert_irqs_disabled();
> >  
> > -       perf_load_guest_context(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
> > +       perf_load_guest_context();
> > +
> > +       perf_load_guest_lvtpc(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
> 
> Hmm, an argument for providing a dedicated perf_load_guest_lvtpc() APIs is that
> it would allow KVM to handle LVTPC writes in KVM's VM-Exit fastpath, i.e. without
> having to do a full put+reload of the guest context.
> 
> So if we're confident that switching the host LVTPC outside of
> perf_{load,put}_guest_context() is functionally safe, I'm a-ok with it.

Let me see. So the hardware sets Masked when it raises the interrupt.

The interrupt handler clears it from software -- depending on uarch in 3
different places:
 1) right at the start of the PMI
 2) in the middle, right before enabling the PMU (writing global control)
 3) at the end of the PMI

the various changelogs adding that code mention spurious PMIs and
malformed PEBS records.

So the fun all happens when the guest is doing PMI and gets a VM-exit
while still Masked.

At that point, we can come in and completely rewrite the PMU state,
reroute the PMI and enable things again. Then later, we 'restore' the
PMU state, re-set LVTPC masked to the guest interrupt and 'resume'.

What could possibly go wrong :/ Kan, I'm assuming, but not knowing, that
writing all the PMU MSRs is somehow serializing state sufficient to not
cause the above mentioned fails? Specifically, clearing PEBS_ENABLE
should inhibit those malformed PEBS records or something? What if the
host also has PEBS and we don't actually clear the bit?

The current order ensures we rewrite LVTPC when global control is unset;
I think we want to keep that.

While staring at this, I note that perf_load_guest_context() will clear
global ctrl, clear all the counter programming, and re-enable an empty
pmu. Now, an empty PMU should result in global control being zero --
there is nothing run after all.

But then kvm_mediated_pmu_load() writes an explicit 0 again. Perhaps
replace this with asserting it is 0 instead?

Anyway, this means that moving the LVTPC writing into
kvm_mediated_pmu_load() as you suggest is identical.
perf_load_guest_context() results in global control being 0, we then
assert it is 0, and write LVTPC while it is still 0.
kvm_pmu_load_guest_pmcs() will then frob the MSRs.

OK, so *IF* doing the VM-exit during PMI is sound, this is something
that needs a comment somewhere. Then going back again, is the easy part,
since on the host side, we can never transition into KVM during a PMI.

Re: [PATCH v5 09/44] perf/x86: Switch LVTPC to/from mediated PMI vector on guest load/put context

Posted by Sean Christopherson 1 month, 2 weeks ago

On Mon, Aug 18, 2025, Peter Zijlstra wrote:
> On Fri, Aug 15, 2025 at 08:55:25AM -0700, Sean Christopherson wrote:
> > On Fri, Aug 15, 2025, Sean Christopherson wrote:
> > > On Fri, Aug 15, 2025, Peter Zijlstra wrote:
> > > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > > index e1df3c3bfc0d..ad22b182762e 100644
> > > > > --- a/kernel/events/core.c
> > > > > +++ b/kernel/events/core.c
> > > > > @@ -6408,6 +6408,8 @@ void perf_load_guest_context(unsigned long data)
> > > > >  		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> > > > >  	}
> > > > >  
> > > > > +	arch_perf_load_guest_context(data);
> > > > 
> > > > So I still don't understand why this ever needs to reach the generic
> > > > code. x86 pmu driver and x86 kvm can surely sort this out inside of x86,
> > > > no?
> > > 
> > > It's definitely possible to handle this entirely within x86, I just don't love
> > > switching the LVTPC without the protection of perf_ctx_lock and perf_ctx_disable().
> > > It's not a sticking point for me if you strongly prefer something like this: 
> > > 
> > > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > > index 0e5048ae86fa..86b81c217b97 100644
> > > --- a/arch/x86/kvm/pmu.c
> > > +++ b/arch/x86/kvm/pmu.c
> > > @@ -1319,7 +1319,9 @@ void kvm_mediated_pmu_load(struct kvm_vcpu *vcpu)
> > >  
> > >         lockdep_assert_irqs_disabled();
> > >  
> > > -       perf_load_guest_context(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
> > > +       perf_load_guest_context();
> > > +
> > > +       perf_load_guest_lvtpc(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
> > 
> > Hmm, an argument for providing a dedicated perf_load_guest_lvtpc() APIs is that
> > it would allow KVM to handle LVTPC writes in KVM's VM-Exit fastpath, i.e. without
> > having to do a full put+reload of the guest context.
> > 
> > So if we're confident that switching the host LVTPC outside of
> > perf_{load,put}_guest_context() is functionally safe, I'm a-ok with it.
> 
> Let me see. So the hardware sets Masked when it raises the interrupt.
> 
> The interrupt handler clears it from software -- depending on uarch in 3
> different places:
>  1) right at the start of the PMI
>  2) in the middle, right before enabling the PMU (writing global control)
>  3) at the end of the PMI
> 
> the various changelogs adding that code mention spurious PMIs and
> malformed PEBS records.
> 
> So the fun all happens when the guest is doing PMI and gets a VM-exit
> while still Masked.
> 
> At that point, we can come in and completely rewrite the PMU state,
> reroute the PMI and enable things again. Then later, we 'restore' the
> PMU state, re-set LVTPC masked to the guest interrupt and 'resume'.
> 
> What could possibly go wrong :/ Kan, I'm assuming, but not knowing, that
> writing all the PMU MSRs is somehow serializing state sufficient to not
> cause the above mentioned fails? Specifically, clearing PEBS_ENABLE
> should inhibit those malformed PEBS records or something? What if the
> host also has PEBS and we don't actually clear the bit?
> 
> The current order ensures we rewrite LVTPC when global control is unset;
> I think we want to keep that.

Yes, for sure.

> While staring at this, I note that perf_load_guest_context() will clear
> global ctrl, clear all the counter programming, and re-enable an empty
> pmu. Now, an empty PMU should result in global control being zero --
> there is nothing run after all.
> 
> But then kvm_mediated_pmu_load() writes an explicit 0 again. Perhaps
> replace this with asserting it is 0 instead?

Yeah, I like that idea, a lot.  This?

	perf_load_guest_context();

	/*
	 * Sanity check that "loading" guest context disabled all counters, as
	 * modifying the LVTPC while host perf is active will cause explosions,
	 * as will loading event selectors and PMCs with guest values.
	 *
	 * VMX will enable/disable counters at VM-Enter/VM-Exit by atomically
	 * loading PERF_GLOBAL_CONTROL.  SVM effectively performs the switch by
	 * configuring all events to be GUEST_ONLY.
	 */
	WARN_ON_ONCE(rdmsrq(kvm_pmu_ops.PERF_GLOBAL_CTRL));

	perf_load_guest_lvtpc(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));

> Anyway, this means that moving the LVTPC writing into
> kvm_mediated_pmu_load() as you suggest is identical.
> perf_load_guest_context() results in global control being 0, we then
> assert it is 0, and write LVTPC while it is still 0.
> kvm_pmu_load_guest_pmcs() will then frob the MSRs.
> 
> OK, so *IF* doing the VM-exit during PMI is sound, this is something
> that needs a comment somewhere. 

I'm a bit lost here.  Are you essentially asking if it's ok to take a VM-Exit
while the guest is handling a PMI?  If so, that _has_ to work, because there are
myriad things that can/will trigger a VM-Exit at any point while the guest is
active.

> Then going back again, is the easy part, since on the host side, we can never
> transition into KVM during a PMI.

Re: [PATCH v5 09/44] perf/x86: Switch LVTPC to/from mediated PMI vector on guest load/put context

Posted by Peter Zijlstra 1 month, 2 weeks ago

On Mon, Aug 18, 2025 at 08:25:34AM -0700, Sean Christopherson wrote:

> > OK, so *IF* doing the VM-exit during PMI is sound, this is something
> > that needs a comment somewhere. 
> 
> I'm a bit lost here.  Are you essentially asking if it's ok to take a VM-Exit
> while the guest is handling a PMI?  If so, that _has_ to work, because there are
> myriad things that can/will trigger a VM-Exit at any point while the guest is
> active.

Yes, that's what I'm asking. Why is this VM-exit during PMI nonsense not
subject to the same failures that mandates the mid/late PMI ACK.

And yes, I realize this needs to work. But so far I'm not sure I
understand why that is a safe thing to do.

Like I wrote, I suspect writing all the PMU MSRs serializes things
sufficiently, but if that is the case, that needs to be explicitly
mentioned. Because that also doesn't explain why we needs mid-ack
instead of late-ack on ADL e-cores for instance.

Could it perhaps be that we don't let the guests do PEBS because DS
doesn't virtualize? And thus we don't have the malformed PEBS record?

Re: [PATCH v5 09/44] perf/x86: Switch LVTPC to/from mediated PMI vector on guest load/put context

Posted by Liang, Kan 1 month, 2 weeks ago

On 2025-08-18 9:12 a.m., Peter Zijlstra wrote:
> On Mon, Aug 18, 2025 at 08:25:34AM -0700, Sean Christopherson wrote:
> 
>>> OK, so *IF* doing the VM-exit during PMI is sound, this is something
>>> that needs a comment somewhere. 
>>
>> I'm a bit lost here.  Are you essentially asking if it's ok to take a VM-Exit
>> while the guest is handling a PMI?  If so, that _has_ to work, because there are
>> myriad things that can/will trigger a VM-Exit at any point while the guest is
>> active.
> 
> Yes, that's what I'm asking. Why is this VM-exit during PMI nonsense not
> subject to the same failures that mandates the mid/late PMI ACK.
> 
> And yes, I realize this needs to work. But so far I'm not sure I
> understand why that is a safe thing to do.
> 
> Like I wrote, I suspect writing all the PMU MSRs serializes things
> sufficiently, but if that is the case, that needs to be explicitly
> mentioned. Because that also doesn't explain why we needs mid-ack
> instead of late-ack on ADL e-cores for instance.

The mid-ack and late-ack only require under some corner cases, e.g., two
PMIs are triggered simultaneously with PEBS.
Because the ucode of p-core and e-core handle the pending PEBS records
and PMIs differently.
For p-core, the ACK should be as close to EOM. Otherwise, the pending
PMI will trigger a spurious PMI warning.
For e-core, the uncode handles the pending PMI well. There is no
spurious PMI. However, it impacts the update of the PEBS_DATA_CFG. The
PEBS_DATA_CFG is global. If the ACK cannot be done before re-enabling
counters, the stale PEBS_DATA_CFG will somehow be written into the next
PEBS record of the pending PMI. It triggers the malformed PEBS record.

For the upcoming arch PEBS, the data cfg is per-counter.
The mid-ack workaround should not be required.
> 
> Could it perhaps be that we don't let the guests do PEBS because DS
> doesn't virtualize? And thus we don't have the malformed PEBS record?
>

Yes, I don't think it can impact the mediated PMU. The legacy PEBS for
vPMU is not supported.
Since the configuration is per-counter with the arch PEBS, the malformed
PEBS record should not be triggered either.

Thanks,
Kan