arch/x86/kvm/svm/avic.c | 23 ++ .../testing/selftests/kvm/include/x86/apic.h | 5 + .../selftests/kvm/x86/xapic_state_test.c | 265 +++++++++++++++++- 3 files changed, 290 insertions(+), 3 deletions(-)
From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> When AVIC is enabled the normal pre-VMRUN LAPIC TPR to VMCB::V_TPR sync in sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC state would *not* get copied into the V_TPR field of VMCB. AVIC does sync between these two fields, however it does so only on explicit guest writes to one of these fields, not on a bare VMRUN. This is especially true when it is the userspace setting LAPIC state via KVM_SET_LAPIC ioctl() since userspace does not have access to the guest VMCB. Practice shows that it is the V_TPR that is actually used by the AVIC to decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI), so any leftover value in V_TPR will cause serious interrupt delivery issues in the guest when AVIC is enabled. Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and similar code paths when AVIC is enabled. Add also a relevant set of tests to xapic_state_test so hopefully we'll be protected against getting such regressions in the future. Yes, this breaks real guests when AVIC is enabled. Specifically, the one OS that sometimes needs different handling and its name begins with letter 'W'. KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs KVM: selftests: Test TPR / CR8 sync and interrupt masking arch/x86/kvm/svm/avic.c | 23 ++ .../testing/selftests/kvm/include/x86/apic.h | 5 + .../selftests/kvm/x86/xapic_state_test.c | 265 +++++++++++++++++- 3 files changed, 290 insertions(+), 3 deletions(-)
On Tue, Aug 19, 2025 at 03:32:13PM +0200, Maciej S. Szmigiero wrote: > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > > When AVIC is enabled the normal pre-VMRUN LAPIC TPR to VMCB::V_TPR sync in > sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC state would > *not* get copied into the V_TPR field of VMCB. > > AVIC does sync between these two fields, however it does so only on > explicit guest writes to one of these fields, not on a bare VMRUN. > > This is especially true when it is the userspace setting LAPIC state via > KVM_SET_LAPIC ioctl() since userspace does not have access to the guest > VMCB. Dumb question: why is the VMM updating TPR? Is this related to live migration or such? I think I do see the problem described here, but when AVIC is temporarily inhibited. So, trying to understand if there are other flows involving the VMM where TPR could be updated outside of the guest. > > Practice shows that it is the V_TPR that is actually used by the AVIC to > decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI), > so any leftover value in V_TPR will cause serious interrupt delivery issues > in the guest when AVIC is enabled. > > Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in > avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and > similar code paths when AVIC is enabled. > > Add also a relevant set of tests to xapic_state_test so hopefully > we'll be protected against getting such regressions in the future. Do the new tests reproduce this issue? > > > Yes, this breaks real guests when AVIC is enabled. > Specifically, the one OS that sometimes needs different handling and its > name begins with letter 'W'. Indeed, Linux does not use TPR AFAIK. - Naveen
On 21.08.2025 10:18, Naveen N Rao wrote: > On Tue, Aug 19, 2025 at 03:32:13PM +0200, Maciej S. Szmigiero wrote: >> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> >> >> When AVIC is enabled the normal pre-VMRUN LAPIC TPR to VMCB::V_TPR sync in >> sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC state would >> *not* get copied into the V_TPR field of VMCB. >> >> AVIC does sync between these two fields, however it does so only on >> explicit guest writes to one of these fields, not on a bare VMRUN. >> >> This is especially true when it is the userspace setting LAPIC state via >> KVM_SET_LAPIC ioctl() since userspace does not have access to the guest >> VMCB. > > Dumb question: why is the VMM updating TPR? Is this related to live > migration or such? In this case, VMM is resetting LAPIC state on machine reset. > I think I do see the problem described here, but when AVIC is > temporarily inhibited. So, trying to understand if there are other flows > involving the VMM where TPR could be updated outside of the guest. > >> >> Practice shows that it is the V_TPR that is actually used by the AVIC to >> decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI), >> so any leftover value in V_TPR will cause serious interrupt delivery issues >> in the guest when AVIC is enabled. >> >> Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in >> avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and >> similar code paths when AVIC is enabled. >> >> Add also a relevant set of tests to xapic_state_test so hopefully >> we'll be protected against getting such regressions in the future. > > Do the new tests reproduce this issue? Yes, and also check quite a bit more of TPR-related functionality. >> >> >> Yes, this breaks real guests when AVIC is enabled. >> Specifically, the one OS that sometimes needs different handling and its >> name begins with letter 'W'. > > Indeed, Linux does not use TPR AFAIK. > > > - Naveen > Thanks, Maciej
On 8/21/25 7:42 AM, Maciej S. Szmigiero wrote: > On 21.08.2025 10:18, Naveen N Rao wrote: >> On Tue, Aug 19, 2025 at 03:32:13PM +0200, Maciej S. Szmigiero wrote: >>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> >>> >>> When AVIC is enabled the normal pre-VMRUN LAPIC TPR to VMCB::V_TPR >>> sync in >>> sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC >>> state would >>> *not* get copied into the V_TPR field of VMCB. >>> >>> AVIC does sync between these two fields, however it does so only on >>> explicit guest writes to one of these fields, not on a bare VMRUN. >>> >>> This is especially true when it is the userspace setting LAPIC state via >>> KVM_SET_LAPIC ioctl() since userspace does not have access to the guest >>> VMCB. >> >> Dumb question: why is the VMM updating TPR? Is this related to live >> migration or such? > > In this case, VMM is resetting LAPIC state on machine reset. > >> I think I do see the problem described here, but when AVIC is >> temporarily inhibited. So, trying to understand if there are other flows >> involving the VMM where TPR could be updated outside of the guest. >> >>> >>> Practice shows that it is the V_TPR that is actually used by the AVIC to >>> decide whether to issue pending interrupts to the CPU (not TPR in >>> TASKPRI), >>> so any leftover value in V_TPR will cause serious interrupt delivery >>> issues >>> in the guest when AVIC is enabled. >>> >>> Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in >>> avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC >>> and >>> similar code paths when AVIC is enabled. >>> >>> Add also a relevant set of tests to xapic_state_test so hopefully >>> we'll be protected against getting such regressions in the future. >> >> Do the new tests reproduce this issue? > > Yes, and also check quite a bit more of TPR-related functionality. > >>> >>> >>> Yes, this breaks real guests when AVIC is enabled. >>> Specifically, the one OS that sometimes needs different handling and its >>> name begins with letter 'W'. >> >> Indeed, Linux does not use TPR AFAIK. >> >> I believe it does, during the local APIC initialization. When Maciej determined the root cause of this issue, I was wondering why we have not seen it earlier in Linux. I found that Linux takes a defensive approach and drains all pending interrupts during lapic initialization. Essentially, for each CPU, Linux will: - temporarily disable the Local APIC (via Spurious Int Vector Reg) - set the TPR to accept all "regular" interrupts i.e. tpr=0x10 - drain all pending interrupts in ISR and/or IRR - attempt the above draining step a max of 512 times - then re-enable APIC and continue initialization The relevant code is in setup_local_APIC() https://elixir.bootlin.com/linux/v6.16/source/arch/x86/kernel/apic/apic.c#L1533-L1545 So without Maciej's proposed change, other OSs that are not as resilient could also be affected by this issue. Alejandro >> - Naveen >> > > Thanks, > Maciej > >
On Thu, Aug 21, 2025, Alejandro Jimenez wrote: > On 8/21/25 7:42 AM, Maciej S. Szmigiero wrote: > > On 21.08.2025 10:18, Naveen N Rao wrote: > > > > Yes, this breaks real guests when AVIC is enabled. > > > > Specifically, the one OS that sometimes needs different handling and its > > > > name begins with letter 'W'. > > > > > > Indeed, Linux does not use TPR AFAIK. > > I believe it does, Heh, yes, Linux technically "uses" the TPR in that it does a one-time write to it. But what Naveen really meant is that Linux doesn't actively use TPR to manage what IRQs are masked/allowed, whereas Windows heavily uses TPR to do exactly that. Specifically, what matters is that Linux doesn't use TPR to _mask_ IRQs, and so clobbering it to '0' on migration is largely benign. > during the local APIC initialization. When Maciej > determined the root cause of this issue, I was wondering why we have not > seen it earlier in Linux. I found that Linux takes a defensive approach and > drains all pending interrupts during lapic initialization. Essentially, for > each CPU, Linux will: > - temporarily disable the Local APIC (via Spurious Int Vector Reg) > - set the TPR to accept all "regular" interrupts i.e. tpr=0x10 > - drain all pending interrupts in ISR and/or IRR > - attempt the above draining step a max of 512 times > - then re-enable APIC and continue initialization > > The relevant code is in setup_local_APIC() > https://elixir.bootlin.com/linux/v6.16/source/arch/x86/kernel/apic/apic.c#L1533-L1545 > > So without Maciej's proposed change, other OSs that are not as resilient > could also be affected by this issue. > > Alejandro > > > > - Naveen > > > > > > > Thanks, > > Maciej > > > > >
© 2016 - 2025 Red Hat, Inc.