[PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest

Binbin Wu posted 16 patches 1 year ago
There is a newer version of this series
[PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Binbin Wu 1 year ago
From: Isaku Yamahata <isaku.yamahata@intel.com>

Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
from host VMM.

Follow how SEV inhibits APICv.  I.e, define a new inhibit reason for TDX, set
it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX interrupts breakout:
- Removed WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm)) in
  tdx_td_vcpu_init(). (Rick)
- Change APICV -> APICv in changelog for consistency.
- Split the changelog to 2 paragraphs.
---
 arch/x86/include/asm/kvm_host.h | 12 +++++++++++-
 arch/x86/kvm/vmx/main.c         |  3 ++-
 arch/x86/kvm/vmx/tdx.c          |  3 +++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 32c7d58a5d68..df535f08e004 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1281,6 +1281,15 @@ enum kvm_apicv_inhibit {
 	 */
 	APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
 
+	/*********************************************************/
+	/* INHIBITs that are relevant only to the Intel's APICv. */
+	/*********************************************************/
+
+	/*
+	 * APICv is disabled because TDX doesn't support it.
+	 */
+	APICV_INHIBIT_REASON_TDX,
+
 	NR_APICV_INHIBIT_REASONS,
 };
 
@@ -1299,7 +1308,8 @@ enum kvm_apicv_inhibit {
 	__APICV_INHIBIT_REASON(IRQWIN),			\
 	__APICV_INHIBIT_REASON(PIT_REINJ),		\
 	__APICV_INHIBIT_REASON(SEV),			\
-	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
+	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED),	\
+	__APICV_INHIBIT_REASON(TDX)
 
 struct kvm_arch {
 	unsigned long n_used_mmu_pages;
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 7f933f821188..13a0ab0a520c 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -445,7 +445,8 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
 	 BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |			\
 	 BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) |	\
 	 BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) |		\
-	 BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
+	 BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) |		\
+	 BIT(APICV_INHIBIT_REASON_TDX))
 
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = KBUILD_MODNAME,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b0f525069ebd..b51d2416acfb 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2143,6 +2143,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 		goto teardown;
 	}
 
+	kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
+
 	return 0;
 
 	/*
@@ -2528,6 +2530,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
 		return -EIO;
 	}
 
+	vcpu->arch.apic->apicv_active = false;
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 
 	return 0;
-- 
2.46.0
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Binbin Wu 11 months, 1 week ago
On 12/9/2024 9:07 AM, Binbin Wu wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
> from host VMM.
>
> Follow how SEV inhibits APICv.  I.e, define a new inhibit reason for TDX, set
> it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
For TDX guests, APICv is always enabled by TDX module. But in current TDX basic support patch series, TDX code inhibits APICv for TDX guests from the view of KVM. Synced with Isaku, the reason was to prevent the APICv active state from toggling during runtime. Sean raised the concern in a PUCK session that it is not concept right to "lie" to KVM that APICv is disabled while it is actually enabled. Instead, it's better to make APICv enabled and prevent it from being disabled from the view of KVM. Following is the analysis about the APICv active state for TDX to kick off further discussions. APICv active state ================== From the view of KVM, whether APICv state is active or not is decided by: 1. APIC is hw enabled 2. VM and vCPU have no inhibit reasons set. APIC hw enabled --------------- After TDX vCPU init, APIC is set to x2APIC mode. However, userspace could disable APIC via KVM_SET_LAPIC or KVM_SET_{SREGS, SREGS2}. - KVM_SET_LAPIC Currently, KVM allows userspace to 
request KVM_SET_LAPIC to set the state of LAPIC for TDX guests. There are two options: - Force x2APIC mode and default base address when userspace request KVM_SET_LAPIC. - Simply reject KVM_SET_LAPIC for TDX guest (apic->guest_apic_protected is true), since migration is not supported yet. Choose option 2 for simplicity for now. - KVM_SET_{SREGS, SREGS2} KVM rejects userspace to set APIC base when vcpu->kvm->arch.has_protected_state and vcpu->arch.guest_state_protected are both set. Currently for TDX, kvm->arch.has_protected_state is not set, so userspace is allowed to modify APIC base. There are three options: - Reject KVM_SET_{SREGS, SREGS2} when either vcpu->arch.guest_state_protected or vcpu->kvm->arch.has_protected_state is set. - Check vcpu->arch.guest_state_protected before kvm_apic_set_base() in __set_sregs_common(). - Set has_protected_state for TDX guests. Choose option 3, i.e. to set has_protected_state for TDX guests, aligning with SEV/SNP. APICv inhibit reasons 
--------------------- APICv could be disabled due to a few inhibit reasons. - APICV_INHIBIT_REASON_DISABLED For TDX, this could be triggered when the module parameter enable_apicv is set to false. enable_apicv could be checked in tdx_bringup(). Disable TDX support if !enable_apicv. So that APICV_INHIBIT_REASON_DISABLED will not be set during runtime and apic->apicv_active is initialized to true. - APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED KVM will reject userspace to modify APIC base, i.e., APIC mode will always be x2APIC mode, the only reason this could be set is it fails to allocate memory for KVM apic map. - APICV_INHIBIT_REASON_PIT_REINJ Based on current code, this is relevant only to AMD's AVIC, so this reason will not be set for TDX guests. However, KVM is also not be able to intercept EOI for TDX guests. For TDX, if in-kernel PIT is enabled and in re-inject mode, the use of PIT in guest may have problem. Fortunately, modern OSes don't use PIT. Options: - Enforce irqchip 
split for TDX guests, i.e. in-kernel PIT is not supported. - Leave it as it is and expect PIT will not be used. - Reasons will not be set for TDX - APICV_INHIBIT_REASON_HYPERV TDX doesn't support HyperV guest yet. - APICV_INHIBIT_REASON_ABSENT In-kernel LAPIC is checked in tdx_vcpu_create(). - APICV_INHIBIT_REASON_BLOCKIRQ TDX doesn't support KVM_SET_GUEST_DEBUG. - APICV_INHIBIT_REASON_APIC_ID_MODIFIED KVM will reject userspace to modify APIC base, i.e., APIC mode will always be x2APIC mode. - APICV_INHIBIT_REASON_APIC_BASE_MODIFIED KVM will reject userspace to set APIC base. - Reasons relevant only to AMD's AVIC - APICV_INHIBIT_REASON_NESTED, - APICV_INHIBIT_REASON_IRQWIN, - APICV_INHIBIT_REASON_SEV, - APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED. Summary about APICv inhibit reasons: APICv could still be disabled runtime in some corner case, e.g, APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED due to memory allocation failure. After checking enable_apicv in tdx_bringup(), 
apic->apicv_active is initialized as true in kvm_create_lapic(). If APICv is inhibited due to any reason runtime, the refresh_apicv_exec_ctrl() callback could be used to check if APICv is disabled for TDX, if APICv is disabled, bug the VM. Changes of APICv active from false to true ========================================== Lazy check for pending APIC EOI when In-kernel IOAPIC ----------------------------------------------------- In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor accelerates write to APIC EOI register and does not trap if the interrupt is edge-triggered. So there is a workaround by lazy check for pending APIC EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no pending APIC EOI. KVM is also not be able to intercept EOI for TDX guests. - When APICv is enabled The code of lazy check for pending APIC EOI doesn't work for TDX because KVM can't get the status of real IRR and ISR, and the values are 0s in vIRR and vISR 
in apic->regs[], kvm_apic_pending_eoi() will always return false. So the RTC pending EOI will always be cleared when ioapic_set_irq() is called for RTC. Then userspace may miss the coalesced RTC interrupts. - When When APICv is disabled ioapic_lazy_update_eoi() will not be called,then pending EOI status for RTC will not be cleared after setting and this will mislead userspace to see coalesced RTC interrupts. Options: - Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC. - Leave it as it is, but the use of RTC may not be accurate. kvm_can_post_timer_interrupt() ------------------------------ Whether housekeeping CPU can deliver timer interrupt to target vCPU via posted interrupt when nohz_full option set. - When APICv active is false, it always return false. - When APICv active is true, it also depends on whether mwait or hlt in guest is set. For TDX guests, hlt will trigger #VE unconditionally and TDX guests request HLT via TDVMCALL. Whether mwait is 
allowed depends on the cpuid configuration in TD_PARAMS. So current implementation of kvm_mwait_in_guest() and kvm_hlt_in_guest() doesn't reflect the real status for TDX guests. However, Sean mentioned "consulting kvm_can_post_timer_interrupt() in the expiration path is silly". There could be cleanups for this part. https://lore.kernel.org/kvm/Z32ZjGH72WPKBMam@google.com/ So, don't do any TDX-specific logic for it. apic_timer_expired() -------------------- About kvm_can_post_timer_interrupt() in the expiration path, see the description above. For the rest part, when the function is not called from timer function - If apicv_active, the timer interrupt will be injected via kvm_apic_inject_pending_timer_irqs(). - If !apicv_active, the timer interrupt will be handled via lapic_timer.pending approach, and finally, the timer interrupt is also be injected via kvm_apic_inject_pending_timer_irqs(). Basically, they are functionally equivalent with subtle differences. E.g., if an 
hrtimer fires while KVM is handling a write to TMICT, KVM will deliver the interrupt if configured to post timer, but not if APICv is disabled, because the latter will increment "pending", and "pending" will be cleared before handling the new TMICT. Ditto for switch APIC timer modes. Sean mentioned the entire lapic_timer.pending approach may need to be ditched, and the timer interrupt could be directly delivered no matter apicv is active or not. https://lore.kernel.org/kvm/Z32ZjGH72WPKBMam@google.com/ This is not TDX specific, leave it for now. Options: - Fix kvm_mwait_in_guest()/kvm_hlt_in_guest() for TDX guests. - VMX preemption timer can't be used by TDX guests anyway, leave kvm_mwait_in_guest()/kvm_hlt_in_guest() as them are, posted timer interrupt could be used when userspace requested to disable exit for mwait/hlt. - VMX preemption timer can't be used by TDX guests anyway, skip checking kvm_mwait_in_guest()/kvm_hlt_in_guest(). kvm_arch_dy_has_pending_interrupt() 
----------------------------------- Before enabling off-TD debug, there is no functional change because there is no PAUSE Exit for TDX guests. After enabling off-TD debug, the kvm_vcpu_apicv_active(vcpu) should be true to get the pending interrupt from PID. Set APICv to active for TDX is the right thing to do. update_cr8_intercept() ---------------------- Functionally unchanged because the callback update_cr8_intercept() for TDX is ignored. Set APICv to active for TDX can return earlier to skip unnecessary code. kvm_lapic_reset() kvm_apic_set_state() -------------------- The callbacks apicv_post_state_restore(), hwapic_irr_update(), and hwapic_isr_update() will be called for TDX guests when apicv is active, these callbacks have been ignored by TDX code already, no functional changes. Issues ====== PIC interrupts -------------- KVM inject PIC interrupt via event injection path. Currently, TDX code doesn't handle this, thus PIC interrupts will be lost. Fortunately, modern OSes 
don't use PIC. We could use posted-interrupt in to deliver PIC interrupt if needed. Or can we assume PIC will not be used by TDX guests? In-kernel PIT in re-inject mode ------------------------------- See the description for "APICV_INHIBIT_REASON_PIT_REINJ" above. Lazy check for pending APIC EOI of In-kernel IOAPIC --------------------------------------------------- See the description for the same item in "Changes of APICv active from false to true". Open: For the issues related to in-kernel PIT and in-kernel IOAPIC, should KVM force irqchip split for TDX guests to eliminate the use of in-kernel PIT and in-kernel IOAPIC? Proposed code change ==================== Below is the proposed code change to change APICv active from false to true for TDX guests. Force irqchip split for TEX guests is not included. Note, by rejecting KVM_GET_LAPIC/KVM_SET_LAPIC for TDX guests (i.e., when guest_apic_protected), it returns an error code instead of returning 0. It requires modifications in 
QEMU TDX support code to avoid requesting KVM_GET_LAPIC/KVM_SET_LAPIC. 8<---------------------------------------------------------------------------- diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 0787855ab006..97025a240d54 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1289,15 +1289,6 @@ enum kvm_apicv_inhibit { */ APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED, - /*********************************************************/ - /* INHIBITs that are relevant only to the Intel's APICv. */ - /*********************************************************/ - - /* - * APICv is disabled because TDX doesn't support it. - */ - APICV_INHIBIT_REASON_TDX, - NR_APICV_INHIBIT_REASONS, }; @@ -1316,8 +1307,7 @@ enum kvm_apicv_inhibit { __APICV_INHIBIT_REASON(IRQWIN), \ __APICV_INHIBIT_REASON(PIT_REINJ), \ __APICV_INHIBIT_REASON(SEV), \ - __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED), \ - __APICV_INHIBIT_REASON(TDX) + 
__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED) struct kvm_arch { unsigned long n_used_mmu_pages; diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c index 9b79b4bb063f..df9cc4a7f2d8 100644 --- a/arch/x86/kvm/vmx/main.c +++ b/arch/x86/kvm/vmx/main.c @@ -782,8 +782,10 @@ static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu) static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) { - if (WARN_ON_ONCE(is_td_vcpu(vcpu))) + if (is_td_vcpu(vcpu)) { + KVM_BUG_ON(!kvm_vcpu_apicv_active(vcpu), vcpu->kvm); return; + } vmx_refresh_apicv_exec_ctrl(vcpu); } @@ -908,8 +910,7 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \ BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \ BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \ - BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) | \ - BIT(APICV_INHIBIT_REASON_TDX)) + BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED)) struct kvm_x86_ops vt_x86_ops __initdata = { .name = 
KBUILD_MODNAME, diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 67fc391fe798..cc516ab2d990 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -614,6 +614,7 @@ int tdx_vm_init(struct kvm *kvm) struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); kvm->arch.has_private_mem = true; + kvm->arch.has_protected_state = true; /* * Because guest TD is protected, VMM can't parse the instruction in TD. @@ -2354,8 +2355,6 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params, goto teardown; } - kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX); - return 0; /* @@ -2741,7 +2740,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx) return -EIO; } - vcpu->arch.apic->apicv_active = false; vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; return 0; @@ -3273,6 +3271,11 @@ int __init tdx_bringup(void) goto success_disable_tdx; } + if (!enable_apicv) { + pr_err("APICv is required for TDX\n"); + goto success_disable_tdx; + } + if 
(!tdp_mmu_enabled || !enable_mmio_caching) { pr_err("TDP MMU and MMIO caching is required for TDX\n"); goto success_disable_tdx; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e433c8ee63a5..837a287d8c47 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5108,6 +5108,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) { + if (vcpu->arch.apic->guest_apic_protected) + return -EINVAL; + kvm_x86_call(sync_pir_to_irr)(vcpu); return kvm_apic_get_state(vcpu, s); @@ -5118,6 +5121,9 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu, { int r; + if (vcpu->arch.apic->guest_apic_protected) + return -EINVAL; + r = kvm_apic_set_state(vcpu, s); if (r) return r;
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Binbin Wu 11 months, 1 week ago


On 1/13/2025 10:03 AM, Binbin Wu wrote:
>
> On 12/9/2024 9:07 AM, Binbin Wu wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
>> from host VMM.
>>
>> Follow how SEV inhibits APICv.  I.e, define a new inhibit reason for TDX, set
>> it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
>
Resend due to the format mess.

For TDX guests, APICv is always enabled by TDX module. But in current TDX
basic support patch series, TDX code inhibits APICv for TDX guests from
the view of KVM. Synced with Isaku, the reason was to prevent the APICv
active state from toggling during runtime.

Sean raised the concern in a PUCK session that it is not concept right to
"lie" to KVM that APICv is disabled while it is actually enabled. Instead,
it's better to make APICv enabled and prevent it from being disabled from
the view of KVM.

Following is the analysis about the APICv active state for TDX to kick off
further discussions.

APICv active state
==================
 From the view of KVM, whether APICv state is active or not is decided by:
1. APIC is hw enabled
2. VM and vCPU have no inhibit reasons set.

APIC hw enabled
---------------

After TDX vCPU init, APIC is set to x2APIC mode. However, userspace could
disable APIC via KVM_SET_LAPIC or KVM_SET_{SREGS, SREGS2}.

- KVM_SET_LAPIC
   Currently, KVM allows userspace to request KVM_SET_LAPIC to set the state
   of LAPIC for TDX guests.
   There are two options:
   - Force x2APIC mode and default base address when userspace request
     KVM_SET_LAPIC.
   - Simply reject KVM_SET_LAPIC for TDX guest (apic->guest_apic_protected
     is true), since migration is not supported yet.
   Choose option 2 for simplicity for now.

- KVM_SET_{SREGS, SREGS2}
   KVM rejects userspace to set APIC base when
   vcpu->kvm->arch.has_protected_state and vcpu->arch.guest_state_protected
   are both set.
   Currently for TDX, kvm->arch.has_protected_state is not set, so userspace
   is allowed to modify APIC base.
   There are three options:
   - Reject KVM_SET_{SREGS, SREGS2} when either vcpu->arch.guest_state_protected
     or vcpu->kvm->arch.has_protected_state is set.
   - Check vcpu->arch.guest_state_protected before kvm_apic_set_base() in
     __set_sregs_common().
   - Set has_protected_state for TDX guests.
   Choose option 3, i.e. to set has_protected_state for TDX guests, aligning
   with SEV/SNP.

APICv inhibit reasons
---------------------

APICv could be disabled due to a few inhibit reasons.

- APICV_INHIBIT_REASON_DISABLED
   For TDX, this could be triggered when the module parameter enable_apicv is
   set to false.
   enable_apicv could be checked in tdx_bringup(). Disable TDX support if
   !enable_apicv. So that APICV_INHIBIT_REASON_DISABLED will not be set
   during runtime and apic->apicv_active is initialized to true.

- APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED
   KVM will reject userspace to modify APIC base, i.e., APIC mode will always
   be x2APIC mode, the only reason this could be set is it fails to allocate
   memory for KVM apic map.

- APICV_INHIBIT_REASON_PIT_REINJ
   Based on current code, this is relevant only to AMD's AVIC, so this reason
   will not be set for TDX guests. However, KVM is also not be able to
   intercept EOI for TDX guests. For TDX, if in-kernel PIT is enabled and in
   re-inject mode, the use of PIT in guest may have problem. Fortunately,
   modern OSes don't use PIT.
   Options:
   - Enforce irqchip split for TDX guests, i.e. in-kernel PIT is not supported.
   - Leave it as it is and expect PIT will not be used.

- Reasons will not be set for TDX
   - APICV_INHIBIT_REASON_HYPERV
     TDX doesn't support HyperV guest yet.
   - APICV_INHIBIT_REASON_ABSENT
     In-kernel LAPIC is checked in tdx_vcpu_create().
   - APICV_INHIBIT_REASON_BLOCKIRQ
     TDX doesn't support KVM_SET_GUEST_DEBUG.
   - APICV_INHIBIT_REASON_APIC_ID_MODIFIED
     KVM will reject userspace to modify APIC base, i.e., APIC mode will always
     be x2APIC mode.
   - APICV_INHIBIT_REASON_APIC_BASE_MODIFIED
     KVM will reject userspace to set APIC base.

- Reasons relevant only to AMD's AVIC
   - APICV_INHIBIT_REASON_NESTED,
   - APICV_INHIBIT_REASON_IRQWIN,
   - APICV_INHIBIT_REASON_SEV,
   - APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED.

Summary about APICv inhibit reasons:
APICv could still be disabled runtime in some corner case, e.g,
APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED due to memory allocation failure.
After checking enable_apicv in tdx_bringup(), apic->apicv_active is
initialized as true in kvm_create_lapic().  If APICv is inhibited due to any
reason runtime, the refresh_apicv_exec_ctrl() callback could be used to check
if APICv is disabled for TDX, if APICv is disabled, bug the VM.


Changes of APICv active from false to true
==========================================

Lazy check for pending APIC EOI when In-kernel IOAPIC
-----------------------------------------------------
In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor
accelerates write to APIC EOI register and does not trap if the interrupt
is edge-triggered. So there is a workaround by lazy check for pending APIC
EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no
pending APIC EOI.
KVM is also not be able to intercept EOI for TDX guests.
- When APICv is enabled
   The code of lazy check for pending APIC EOI doesn't work for TDX because
   KVM can't get the status of real IRR and ISR, and the values are 0s in
   vIRR and vISR in apic->regs[], kvm_apic_pending_eoi() will always return
   false. So the RTC pending EOI will always be cleared when ioapic_set_irq()
   is called for RTC. Then userspace may miss the coalesced RTC interrupts.
- When When APICv is disabled
   ioapic_lazy_update_eoi() will not be called,then pending EOI status for
   RTC will not be cleared after setting and this will mislead userspace to
   see coalesced RTC interrupts.
Options:
- Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC.
- Leave it as it is, but the use of RTC may not be accurate.

kvm_can_post_timer_interrupt()
------------------------------
Whether housekeeping CPU can deliver timer interrupt to target vCPU via
posted interrupt when nohz_full option set.
- When APICv active is false, it always return false.
- When APICv active is true, it also depends on whether mwait or hlt in guest
   is set.
   For TDX guests, hlt will trigger #VE unconditionally and TDX guests request
   HLT via TDVMCALL. Whether mwait is allowed depends on the cpuid configuration
   in TD_PARAMS.
   So current implementation of kvm_mwait_in_guest() and kvm_hlt_in_guest()
   doesn't reflect the real status for TDX guests.
However, Sean mentioned "consulting kvm_can_post_timer_interrupt() in the
expiration path is silly". There could be cleanups for this part.
https://lore.kernel.org/kvm/Z32ZjGH72WPKBMam@google.com/
So, don't do any TDX-specific logic for it.

apic_timer_expired()
--------------------
About kvm_can_post_timer_interrupt() in the expiration path, see the
description above.
For the rest part, when the function is not called from timer function
- If apicv_active, the timer interrupt will be injected via
   kvm_apic_inject_pending_timer_irqs().
- If !apicv_active, the timer interrupt will be handled via
   lapic_timer.pending approach, and finally, the timer interrupt is also be
   injected via kvm_apic_inject_pending_timer_irqs().
Basically, they are functionally equivalent with subtle differences.  E.g.,
if an hrtimer fires while KVM is handling a write to TMICT, KVM will deliver
the interrupt if configured to post timer, but not if APICv is disabled,
because the latter will increment "pending", and "pending" will be cleared
before handling the new TMICT.  Ditto for switch APIC timer modes.
Sean mentioned the entire lapic_timer.pending approach may need to be ditched,
and the timer interrupt could be directly delivered no matter apicv is active
or not. https://lore.kernel.org/kvm/Z32ZjGH72WPKBMam@google.com/
This is not TDX specific, leave it for now.

Options:
- Fix kvm_mwait_in_guest()/kvm_hlt_in_guest() for TDX guests.
- VMX preemption timer can't be used by TDX guests anyway, leave
   kvm_mwait_in_guest()/kvm_hlt_in_guest() as them are, posted timer interrupt
   could be used when userspace requested to disable exit for mwait/hlt.
- VMX preemption timer can't be used by TDX guests anyway, skip checking
   kvm_mwait_in_guest()/kvm_hlt_in_guest().

kvm_arch_dy_has_pending_interrupt()
-----------------------------------
Before enabling off-TD debug, there is no functional change because there
is no PAUSE Exit for TDX guests.
After enabling off-TD debug, the kvm_vcpu_apicv_active(vcpu) should be true
to get the pending interrupt from PID. Set APICv to active for TDX is the
right thing to do.

update_cr8_intercept()
----------------------
Functionally unchanged because the callback update_cr8_intercept() for TDX
is ignored. Set APICv to active for TDX can return earlier to skip unnecessary
code.

kvm_lapic_reset()
kvm_apic_set_state()
--------------------
The callbacks apicv_post_state_restore(), hwapic_irr_update(), and
hwapic_isr_update() will be called for TDX guests when apicv is active,
these callbacks have been ignored by TDX code already, no functional changes.


Issues
======

PIC interrupts
--------------
KVM inject PIC interrupt via event injection path.
Currently, TDX code doesn't handle this, thus PIC interrupts will be lost.
Fortunately, modern OSes don't use PIC.
We could use posted-interrupt in to deliver PIC interrupt if needed.
Or can we assume PIC will not be used by TDX guests?

In-kernel PIT in re-inject mode
-------------------------------
See the description for "APICV_INHIBIT_REASON_PIT_REINJ" above.

Lazy check for pending APIC EOI of In-kernel IOAPIC
---------------------------------------------------
See the description for the same item in "Changes of APICv active from false
to true".

Open:
For the issues related to in-kernel PIT and in-kernel IOAPIC, should KVM
force irqchip split for TDX guests to eliminate the use of in-kernel PIT
and in-kernel IOAPIC?


Proposed code change
====================
Below is the proposed code change to change APICv active from false to true
for TDX guests.

Force irqchip split for TEX guests is not included.

Note, by rejecting KVM_GET_LAPIC/KVM_SET_LAPIC for TDX guests (i.e., when
guest_apic_protected), it returns an error code instead of returning 0.
It requires modifications in QEMU TDX support code to avoid requesting
KVM_GET_LAPIC/KVM_SET_LAPIC.

8<----------------------------------------------------------------------------
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0787855ab006..97025a240d54 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1289,15 +1289,6 @@ enum kvm_apicv_inhibit {
       */
      APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,

-    /*********************************************************/
-    /* INHIBITs that are relevant only to the Intel's APICv. */
-    /*********************************************************/
-
-    /*
-     * APICv is disabled because TDX doesn't support it.
-     */
-    APICV_INHIBIT_REASON_TDX,
-
      NR_APICV_INHIBIT_REASONS,
  };

@@ -1316,8 +1307,7 @@ enum kvm_apicv_inhibit {
      __APICV_INHIBIT_REASON(IRQWIN),            \
      __APICV_INHIBIT_REASON(PIT_REINJ),        \
      __APICV_INHIBIT_REASON(SEV),            \
-    __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED),    \
-    __APICV_INHIBIT_REASON(TDX)
+    __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)

  struct kvm_arch {
      unsigned long n_used_mmu_pages;
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 9b79b4bb063f..df9cc4a7f2d8 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -782,8 +782,10 @@ static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)

  static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
  {
-    if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+    if (is_td_vcpu(vcpu)) {
+        KVM_BUG_ON(!kvm_vcpu_apicv_active(vcpu), vcpu->kvm);
          return;
+    }

      vmx_refresh_apicv_exec_ctrl(vcpu);
  }
@@ -908,8 +910,7 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
       BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |            \
       BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) |    \
       BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) |        \
-     BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) |        \
-     BIT(APICV_INHIBIT_REASON_TDX))
+     BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))

  struct kvm_x86_ops vt_x86_ops __initdata = {
      .name = KBUILD_MODNAME,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 67fc391fe798..cc516ab2d990 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -614,6 +614,7 @@ int tdx_vm_init(struct kvm *kvm)
      struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);

      kvm->arch.has_private_mem = true;
+    kvm->arch.has_protected_state = true;

      /*
       * Because guest TD is protected, VMM can't parse the instruction in TD.
@@ -2354,8 +2355,6 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
          goto teardown;
      }

-    kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
-
      return 0;

      /*
@@ -2741,7 +2740,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
          return -EIO;
      }

-    vcpu->arch.apic->apicv_active = false;
      vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;

      return 0;
@@ -3273,6 +3271,11 @@ int __init tdx_bringup(void)
          goto success_disable_tdx;
      }

+    if (!enable_apicv) {
+        pr_err("APICv is required for TDX\n");
+        goto success_disable_tdx;
+    }
+
      if (!tdp_mmu_enabled || !enable_mmio_caching) {
          pr_err("TDP MMU and MMIO caching is required for TDX\n");
          goto success_disable_tdx;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e433c8ee63a5..837a287d8c47 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5108,6 +5108,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
  static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,
                      struct kvm_lapic_state *s)
  {
+    if (vcpu->arch.apic->guest_apic_protected)
+        return -EINVAL;
+
      kvm_x86_call(sync_pir_to_irr)(vcpu);

      return kvm_apic_get_state(vcpu, s);
@@ -5118,6 +5121,9 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu,
  {
      int r;

+    if (vcpu->arch.apic->guest_apic_protected)
+        return -EINVAL;
+
      r = kvm_apic_set_state(vcpu, s);
      if (r)
          return r;


Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Huang, Kai 11 months ago
On Mon, 2025-01-13 at 10:09 +0800, Binbin Wu wrote:
> Lazy check for pending APIC EOI when In-kernel IOAPIC
> -----------------------------------------------------
> In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor
> accelerates write to APIC EOI register and does not trap if the interrupt
> is edge-triggered. So there is a workaround by lazy check for pending APIC
> EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no
> pending APIC EOI.
> KVM is also not be able to intercept EOI for TDX guests.
> - When APICv is enabled
>    The code of lazy check for pending APIC EOI doesn't work for TDX because
>    KVM can't get the status of real IRR and ISR, and the values are 0s in
>    vIRR and vISR in apic->regs[], kvm_apic_pending_eoi() will always return
>    false. So the RTC pending EOI will always be cleared when ioapic_set_irq()
>    is called for RTC. Then userspace may miss the coalesced RTC interrupts.
> - When When APICv is disabled
>    ioapic_lazy_update_eoi() will not be called,then pending EOI status for
>    RTC will not be cleared after setting and this will mislead userspace to
>    see coalesced RTC interrupts.
> Options:
> - Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC.
> - Leave it as it is, but the use of RTC may not be accurate.

Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
upon EOI for level-triggered interrupt.  I don't know how does KVM works with
userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
irqchip split for TDX guest" seems not right.

I think the problem is level-triggered interrupt, so I think another option is
to reject level-triggered interrupt for TDX guest.
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Sean Christopherson 11 months ago
On Thu, Jan 16, 2025, Kai Huang wrote:
> On Mon, 2025-01-13 at 10:09 +0800, Binbin Wu wrote:
> > Lazy check for pending APIC EOI when In-kernel IOAPIC
> > -----------------------------------------------------
> > In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor
> > accelerates write to APIC EOI register and does not trap if the interrupt
> > is edge-triggered. So there is a workaround by lazy check for pending APIC
> > EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no
> > pending APIC EOI.
> > KVM is also not be able to intercept EOI for TDX guests.
> > - When APICv is enabled
> >    The code of lazy check for pending APIC EOI doesn't work for TDX because
> >    KVM can't get the status of real IRR and ISR, and the values are 0s in
> >    vIRR and vISR in apic->regs[], kvm_apic_pending_eoi() will always return
> >    false. So the RTC pending EOI will always be cleared when ioapic_set_irq()
> >    is called for RTC. Then userspace may miss the coalesced RTC interrupts.
> > - When When APICv is disabled
> >    ioapic_lazy_update_eoi() will not be called,then pending EOI status for
> >    RTC will not be cleared after setting and this will mislead userspace to
> >    see coalesced RTC interrupts.
> > Options:
> > - Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC.
> > - Leave it as it is, but the use of RTC may not be accurate.
> 
> Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
> for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
> upon EOI for level-triggered interrupt.  I don't know how does KVM works with
> userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
> irqchip split for TDX guest" seems not right.

Forcing a "split" IRQ chip is correct, in the sense that TDX doesn't support an
I/O APIC and the "split" model is the way to concoct such a setup.  With a "full"
IRQ chip, KVM is responsible for emulating the I/O APIC, which is more or less
nonsensical on TDX because it's fully virtual world, i.e. there's no reason to
emulate legacy devices that only know how to talk to the I/O APIC (or PIC, etc.).
Disallowing an in-kernel I/O APIC is ideal from KVM's perspective, because
level-triggered interrupts and thus the I/O APIC as a whole can't be faithfully
emulated (see below).

> I think the problem is level-triggered interrupt,

Yes, because the TDX Module doesn't allow the hypervisor to modify the EOI-bitmap,
i.e. all EOIs are accelerated and never trigger exits.

> so I think another option is to reject level-triggered interrupt for TDX guest.

This is a "don't do that, it will hurt" situation.  With a sane VMM, the level-ness
of GSIs is controlled by the guest.  For GSIs that are routed through the I/O APIC,
the level-ness is determined by the corresponding Redirection Table entry.  For
"GSIs" that are actually MSIs (KVM piggybacks legacy GSI routing to let userspace
wire up MSIs), and for direct MSIs injection (KVM_SIGNAL_MSI), the level-ness is
dictated by the MSI itself, which again is guest controlled.

If the guest induces generation of a level-triggered interrupt, the VMM is left
with the choice of dropping the interrupt, sending it as-is, or converting it to
an edge-triggered interrupt.  Ditto for KVM.  All of those options will make the
guest unhappy.

So while it _might_ make debugging broken guests either, I don't think it's worth
the complexity to try and prevent the VMM/guest from sending level-triggered
GSI-routed interrupts.  It'd be a bit of a whack-a-mole and there's no architectural
behavior KVM can provide that's better than sending the interrupt and hoping for
the best.
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Binbin Wu 11 months ago


On 1/16/2025 10:50 PM, Sean Christopherson wrote:
> On Thu, Jan 16, 2025, Kai Huang wrote:
>> On Mon, 2025-01-13 at 10:09 +0800, Binbin Wu wrote:
>>> Lazy check for pending APIC EOI when In-kernel IOAPIC
>>> -----------------------------------------------------
>>> In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor
>>> accelerates write to APIC EOI register and does not trap if the interrupt
>>> is edge-triggered. So there is a workaround by lazy check for pending APIC
>>> EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no
>>> pending APIC EOI.
>>> KVM is also not be able to intercept EOI for TDX guests.
>>> - When APICv is enabled
>>>     The code of lazy check for pending APIC EOI doesn't work for TDX because
>>>     KVM can't get the status of real IRR and ISR, and the values are 0s in
>>>     vIRR and vISR in apic->regs[], kvm_apic_pending_eoi() will always return
>>>     false. So the RTC pending EOI will always be cleared when ioapic_set_irq()
>>>     is called for RTC. Then userspace may miss the coalesced RTC interrupts.
>>> - When When APICv is disabled
>>>     ioapic_lazy_update_eoi() will not be called,then pending EOI status for
>>>     RTC will not be cleared after setting and this will mislead userspace to
>>>     see coalesced RTC interrupts.
>>> Options:
>>> - Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC.
>>> - Leave it as it is, but the use of RTC may not be accurate.
>> Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
>> for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
>> upon EOI for level-triggered interrupt.  I don't know how does KVM works with
>> userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
>> irqchip split for TDX guest" seems not right.
> Forcing a "split" IRQ chip is correct, in the sense that TDX doesn't support an
> I/O APIC and the "split" model is the way to concoct such a setup.  With a "full"
> IRQ chip, KVM is responsible for emulating the I/O APIC, which is more or less
> nonsensical on TDX because it's fully virtual world, i.e. there's no reason to
> emulate legacy devices that only know how to talk to the I/O APIC (or PIC, etc.).
> Disallowing an in-kernel I/O APIC is ideal from KVM's perspective, because
> level-triggered interrupts and thus the I/O APIC as a whole can't be faithfully
> emulated (see below).
>
>> I think the problem is level-triggered interrupt,
> Yes, because the TDX Module doesn't allow the hypervisor to modify the EOI-bitmap,
> i.e. all EOIs are accelerated and never trigger exits.

Yes, and I think it needs to add the description about it and the
level-trigger interrupt in the commit message of some patch.


>
>> so I think another option is to reject level-triggered interrupt for TDX guest.
> This is a "don't do that, it will hurt" situation.  With a sane VMM, the level-ness
> of GSIs is controlled by the guest.  For GSIs that are routed through the I/O APIC,
> the level-ness is determined by the corresponding Redirection Table entry.  For
> "GSIs" that are actually MSIs (KVM piggybacks legacy GSI routing to let userspace
> wire up MSIs), and for direct MSIs injection (KVM_SIGNAL_MSI), the level-ness is
> dictated by the MSI itself, which again is guest controlled.
>
> If the guest induces generation of a level-triggered interrupt, the VMM is left
> with the choice of dropping the interrupt, sending it as-is, or converting it to
> an edge-triggered interrupt.  Ditto for KVM.  All of those options will make the
> guest unhappy.
>
> So while it _might_ make debugging broken guests either, I don't think it's worth
> the complexity to try and prevent the VMM/guest from sending level-triggered
> GSI-routed interrupts.  It'd be a bit of a whack-a-mole and there's no architectural
> behavior KVM can provide that's better than sending the interrupt and hoping for
> the best.
Currently, KVM doesn't do anything special if the guest send level-triggered
interrupts for TDX guests.
QEMU has a patch to set the eoi_intercept_unsupported to true for tdx guests.
https://lore.kernel.org/kvm/20241105062408.3533704-41-xiaoyao.li@intel.com/

And it seems that the level_trigger_unsupported info will be passed to guest
via ACPI table. I didn't dig deep into it, I suppose with the information,
guests will not send level-triggered GSI interrupts?

Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Huang, Kai 11 months ago
On Thu, 2025-01-16 at 06:50 -0800, Sean Christopherson wrote:
> On Thu, Jan 16, 2025, Kai Huang wrote:
> > On Mon, 2025-01-13 at 10:09 +0800, Binbin Wu wrote:
> > > Lazy check for pending APIC EOI when In-kernel IOAPIC
> > > -----------------------------------------------------
> > > In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor
> > > accelerates write to APIC EOI register and does not trap if the interrupt
> > > is edge-triggered. So there is a workaround by lazy check for pending APIC
> > > EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no
> > > pending APIC EOI.
> > > KVM is also not be able to intercept EOI for TDX guests.
> > > - When APICv is enabled
> > >    The code of lazy check for pending APIC EOI doesn't work for TDX because
> > >    KVM can't get the status of real IRR and ISR, and the values are 0s in
> > >    vIRR and vISR in apic->regs[], kvm_apic_pending_eoi() will always return
> > >    false. So the RTC pending EOI will always be cleared when ioapic_set_irq()
> > >    is called for RTC. Then userspace may miss the coalesced RTC interrupts.
> > > - When When APICv is disabled
> > >    ioapic_lazy_update_eoi() will not be called,then pending EOI status for
> > >    RTC will not be cleared after setting and this will mislead userspace to
> > >    see coalesced RTC interrupts.
> > > Options:
> > > - Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC.
> > > - Leave it as it is, but the use of RTC may not be accurate.
> > 
> > Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
> > for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
> > upon EOI for level-triggered interrupt.  I don't know how does KVM works with
> > userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
> > irqchip split for TDX guest" seems not right.
> 
> Forcing a "split" IRQ chip is correct, in the sense that TDX doesn't support an
> I/O APIC and the "split" model is the way to concoct such a setup.  With a "full"
> IRQ chip, KVM is responsible for emulating the I/O APIC, which is more or less
> nonsensical on TDX because it's fully virtual world, i.e. there's no reason to
> emulate legacy devices that only know how to talk to the I/O APIC (or PIC, etc.).
> Disallowing an in-kernel I/O APIC is ideal from KVM's perspective, because
> level-triggered interrupts and thus the I/O APIC as a whole can't be faithfully
> emulated (see below).

Disabling in-kernel IOAPIC/PIC for TDX guests is fine to me, but I think that,
"conceptually", having IOAPIC/PIC in userspace doesn't mean disabling IOAPIC,
because theoretically usrespace IOAPIC still needs to be told about the EOI for
emulation.  I just haven't figured out how does userpsace IOAPIC work with KVM
in case of "split IRQCHIP" w/o trapping EOI for level-triggered interrupt. :-)

If the point is to disable in-kernel IOAPIC/PIC for TDX guests, then I think
both KVM_IRQCHIP_NONE and KVM_IRQCHIP_SPLIT should be allowed for TDX, but not
just KVM_IRQCHIP_SPLIT?

> 
> > I think the problem is level-triggered interrupt,
> 
> Yes, because the TDX Module doesn't allow the hypervisor to modify the EOI-bitmap,
> i.e. all EOIs are accelerated and never trigger exits.
> 
> > so I think another option is to reject level-triggered interrupt for TDX guest.
> 
> This is a "don't do that, it will hurt" situation.  With a sane VMM, the level-ness
> of GSIs is controlled by the guest.  For GSIs that are routed through the I/O APIC,
> the level-ness is determined by the corresponding Redirection Table entry.  For
> "GSIs" that are actually MSIs (KVM piggybacks legacy GSI routing to let userspace
> wire up MSIs), and for direct MSIs injection (KVM_SIGNAL_MSI), the level-ness is
> dictated by the MSI itself, which again is guest controlled.
> 
> If the guest induces generation of a level-triggered interrupt, the VMM is left
> with the choice of dropping the interrupt, sending it as-is, or converting it to
> an edge-triggered interrupt.  Ditto for KVM.  All of those options will make the
> guest unhappy.
> 
> So while it _might_ make debugging broken guests either, I don't think it's worth
> the complexity to try and prevent the VMM/guest from sending level-triggered
> GSI-routed interrupts.  
> 

KVM can at least have some chance to print some error message?

> It'd be a bit of a whack-a-mole and there's no architectural
> behavior KVM can provide that's better than sending the interrupt and hoping for
> the best.

Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Sean Christopherson 11 months ago
On Thu, Jan 16, 2025, Kai Huang wrote:
> On Thu, 2025-01-16 at 06:50 -0800, Sean Christopherson wrote:
> > On Thu, Jan 16, 2025, Kai Huang wrote:

...

> > > Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
> > > for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
> > > upon EOI for level-triggered interrupt.  I don't know how does KVM works with
> > > userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
> > > irqchip split for TDX guest" seems not right.
> > 
> > Forcing a "split" IRQ chip is correct, in the sense that TDX doesn't support an
> > I/O APIC and the "split" model is the way to concoct such a setup.  With a "full"
> > IRQ chip, KVM is responsible for emulating the I/O APIC, which is more or less
> > nonsensical on TDX because it's fully virtual world, i.e. there's no reason to
> > emulate legacy devices that only know how to talk to the I/O APIC (or PIC, etc.).
> > Disallowing an in-kernel I/O APIC is ideal from KVM's perspective, because
> > level-triggered interrupts and thus the I/O APIC as a whole can't be faithfully
> > emulated (see below).
> 
> Disabling in-kernel IOAPIC/PIC for TDX guests is fine to me, but I think that,
> "conceptually", having IOAPIC/PIC in userspace doesn't mean disabling IOAPIC,
> because theoretically usrespace IOAPIC still needs to be told about the EOI for
> emulation.  I just haven't figured out how does userpsace IOAPIC work with KVM
> in case of "split IRQCHIP" w/o trapping EOI for level-triggered interrupt. :-)

Userspace I/O APIC _does_ intercept EOI.  KVM scans the GSI routes provided by
userspace and intercepts those that are configured to be delivered as level-
triggered interrupts.  Whereas with an in-kernel I/O APIC, KVM scans the GSI
routes *and* the I/O APIC Redirection Table (for interrupts that are routed
through the I/O APIC).

> If the point is to disable in-kernel IOAPIC/PIC for TDX guests, then I think
> both KVM_IRQCHIP_NONE and KVM_IRQCHIP_SPLIT should be allowed for TDX, but not
> just KVM_IRQCHIP_SPLIT?

No, because APICv is mandatory for TDX, which rules out KVM_IRQCHIP_NONE.

> 
> > 
> > > I think the problem is level-triggered interrupt,
> > 
> > Yes, because the TDX Module doesn't allow the hypervisor to modify the EOI-bitmap,
> > i.e. all EOIs are accelerated and never trigger exits.
> > 
> > > so I think another option is to reject level-triggered interrupt for TDX guest.
> > 
> > This is a "don't do that, it will hurt" situation.  With a sane VMM, the level-ness
> > of GSIs is controlled by the guest.  For GSIs that are routed through the I/O APIC,
> > the level-ness is determined by the corresponding Redirection Table entry.  For
> > "GSIs" that are actually MSIs (KVM piggybacks legacy GSI routing to let userspace
> > wire up MSIs), and for direct MSIs injection (KVM_SIGNAL_MSI), the level-ness is
> > dictated by the MSI itself, which again is guest controlled.
> > 
> > If the guest induces generation of a level-triggered interrupt, the VMM is left
> > with the choice of dropping the interrupt, sending it as-is, or converting it to
> > an edge-triggered interrupt.  Ditto for KVM.  All of those options will make the
> > guest unhappy.
> > 
> > So while it _might_ make debugging broken guests either, I don't think it's worth
> > the complexity to try and prevent the VMM/guest from sending level-triggered
> > GSI-routed interrupts.  
> > 
> 
> KVM can at least have some chance to print some error message?

No.  A guest can shoot itself any number of ways, and userspace has every
opportunity to log weirdness in this case.
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Huang, Kai 11 months ago
On Thu, 2025-01-16 at 14:37 -0800, Sean Christopherson wrote:
> On Thu, Jan 16, 2025, Kai Huang wrote:
> > On Thu, 2025-01-16 at 06:50 -0800, Sean Christopherson wrote:
> > > On Thu, Jan 16, 2025, Kai Huang wrote:
> 
> ...
> 
> > > > Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
> > > > for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
> > > > upon EOI for level-triggered interrupt.  I don't know how does KVM works with
> > > > userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
> > > > irqchip split for TDX guest" seems not right.
> > > 
> > > Forcing a "split" IRQ chip is correct, in the sense that TDX doesn't support an
> > > I/O APIC and the "split" model is the way to concoct such a setup.  With a "full"
> > > IRQ chip, KVM is responsible for emulating the I/O APIC, which is more or less
> > > nonsensical on TDX because it's fully virtual world, i.e. there's no reason to
> > > emulate legacy devices that only know how to talk to the I/O APIC (or PIC, etc.).
> > > Disallowing an in-kernel I/O APIC is ideal from KVM's perspective, because
> > > level-triggered interrupts and thus the I/O APIC as a whole can't be faithfully
> > > emulated (see below).
> > 
> > Disabling in-kernel IOAPIC/PIC for TDX guests is fine to me, but I think that,
> > "conceptually", having IOAPIC/PIC in userspace doesn't mean disabling IOAPIC,
> > because theoretically usrespace IOAPIC still needs to be told about the EOI for
> > emulation.  I just haven't figured out how does userpsace IOAPIC work with KVM
> > in case of "split IRQCHIP" w/o trapping EOI for level-triggered interrupt. :-)
> 
> Userspace I/O APIC _does_ intercept EOI.  KVM scans the GSI routes provided by
> userspace and intercepts those that are configured to be delivered as level-
> triggered interrupts.  
> 

Yeah see it now (I believe you mean kvm_scan_ioapic_routes()).  Thanks!

> Whereas with an in-kernel I/O APIC, KVM scans the GSI
> routes *and* the I/O APIC Redirection Table (for interrupts that are routed
> through the I/O APIC).

Right.

But neither of them work with TDX because TDX doesn't support EOI exit.

So from the sense that we don't want KVM to support in-kernel IOAPIC for TDX, I
agree we can force to use IRQCHIP split.  But my point is this doesn't seem to
be able to resolve the problem. :-)

Btw, IIUC, in case of IRQCHIP split, KVM uses KVM_IRQ_ROUTING_MSI for routes of
GSIs.  But it seems KVM only allows level-triggered MSI to be signaled (which is
a surprising):

int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
                struct kvm *kvm, int irq_source_id, int level, bool line_status)
{
        struct kvm_lapic_irq irq;

        if (kvm_msi_route_invalid(kvm, e))
                return -EINVAL;

        if (!level)
                return -1;

        kvm_set_msi_irq(kvm, e, &irq);

        return kvm_irq_delivery_to_apic(kvm, NULL, &irq, NULL);
}

> 
> > If the point is to disable in-kernel IOAPIC/PIC for TDX guests, then I think
> > both KVM_IRQCHIP_NONE and KVM_IRQCHIP_SPLIT should be allowed for TDX, but not
> > just KVM_IRQCHIP_SPLIT?
> 
> No, because APICv is mandatory for TDX, which rules out KVM_IRQCHIP_NONE.

Yeah I missed this obvious thing.

> 
> > 
> > > 
> > > > I think the problem is level-triggered interrupt,
> > > 
> > > Yes, because the TDX Module doesn't allow the hypervisor to modify the EOI-bitmap,
> > > i.e. all EOIs are accelerated and never trigger exits.
> > > 
> > > > so I think another option is to reject level-triggered interrupt for TDX guest.
> > > 
> > > This is a "don't do that, it will hurt" situation.  With a sane VMM, the level-ness
> > > of GSIs is controlled by the guest.  For GSIs that are routed through the I/O APIC,
> > > the level-ness is determined by the corresponding Redirection Table entry.  For
> > > "GSIs" that are actually MSIs (KVM piggybacks legacy GSI routing to let userspace
> > > wire up MSIs), and for direct MSIs injection (KVM_SIGNAL_MSI), the level-ness is
> > > dictated by the MSI itself, which again is guest controlled.
> > > 
> > > If the guest induces generation of a level-triggered interrupt, the VMM is left
> > > with the choice of dropping the interrupt, sending it as-is, or converting it to
> > > an edge-triggered interrupt.  Ditto for KVM.  All of those options will make the
> > > guest unhappy.
> > > 
> > > So while it _might_ make debugging broken guests either, I don't think it's worth
> > > the complexity to try and prevent the VMM/guest from sending level-triggered
> > > GSI-routed interrupts.  
> > > 
> > 
> > KVM can at least have some chance to print some error message?
> 
> No.  A guest can shoot itself any number of ways, and userspace has every
> opportunity to log weirdness in this case.

Agreed.
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Huang, Kai 11 months ago
On Fri, 2025-01-17 at 09:53 +0000, Huang, Kai wrote:
> Btw, IIUC, in case of IRQCHIP split, KVM uses KVM_IRQ_ROUTING_MSI for routes of
> GSIs.  But it seems KVM only allows level-triggered MSI to be signaled (which is
> a surprising):
> 
> int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
>                 struct kvm *kvm, int irq_source_id, int level, bool line_status)
> {
>         struct kvm_lapic_irq irq;
> 
>         if (kvm_msi_route_invalid(kvm, e))
>                 return -EINVAL;
> 
>         if (!level)
>                 return -1;
> 
>         kvm_set_msi_irq(kvm, e, &irq);
> 
>         return kvm_irq_delivery_to_apic(kvm, NULL, &irq, NULL);
> }

Ah sorry this 'level' is not trig_mode.  Please ignore :-)
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Sean Christopherson 11 months ago
On Fri, Jan 17, 2025, Kai Huang wrote:
> On Fri, 2025-01-17 at 09:53 +0000, Huang, Kai wrote:
> > Btw, IIUC, in case of IRQCHIP split, KVM uses KVM_IRQ_ROUTING_MSI for routes of
> > GSIs.  But it seems KVM only allows level-triggered MSI to be signaled (which is
> > a surprising):
> > 
> > int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
> >                 struct kvm *kvm, int irq_source_id, int level, bool line_status)
> > {
> >         struct kvm_lapic_irq irq;
> > 
> >         if (kvm_msi_route_invalid(kvm, e))
> >                 return -EINVAL;
> > 
> >         if (!level)
> >                 return -1;
> > 
> >         kvm_set_msi_irq(kvm, e, &irq);
> > 
> >         return kvm_irq_delivery_to_apic(kvm, NULL, &irq, NULL);
> > }
> 
> Ah sorry this 'level' is not trig_mode.  Please ignore :-)

Yeah :-(  I have misread the use of "level" so, so many times in KVM's IRQ code.
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Sean Christopherson 11 months, 1 week ago
On Mon, Jan 13, 2025, Binbin Wu wrote:
> On 1/13/2025 10:03 AM, Binbin Wu wrote:
> > 
> > On 12/9/2024 9:07 AM, Binbin Wu wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
> > > from host VMM.
> > > 
> > > Follow how SEV inhibits APICv.  I.e, define a new inhibit reason for TDX, set
> > > it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
> > 
> Resend due to the format mess.

That was a very impressive mess :-)

> After TDX vCPU init, APIC is set to x2APIC mode. However, userspace could
> disable APIC via KVM_SET_LAPIC or KVM_SET_{SREGS, SREGS2}.
> 
> - KVM_SET_LAPIC
>   Currently, KVM allows userspace to request KVM_SET_LAPIC to set the state
>   of LAPIC for TDX guests.
>   There are two options:
>   - Force x2APIC mode and default base address when userspace request
>     KVM_SET_LAPIC.
>   - Simply reject KVM_SET_LAPIC for TDX guest (apic->guest_apic_protected
>     is true), since migration is not supported yet.
>   Choose option 2 for simplicity for now.

Yeah.  We'll likely need to support KVM_SET_LAPIC at some point, e.g. to support
PID.PIR save/restore, but that's definitely a future problem.

> Summary about APICv inhibit reasons:
> APICv could still be disabled runtime in some corner case, e.g,
> APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED due to memory allocation failure.
> After checking enable_apicv in tdx_bringup(), apic->apicv_active is
> initialized as true in kvm_create_lapic().  If APICv is inhibited due to any
> reason runtime, the refresh_apicv_exec_ctrl() callback could be used to check
> if APICv is disabled for TDX, if APICv is disabled, bug the VM.

I _think_ this is a non-issue, and that KVM could do KVM_BUG_ON() if APICv is
inihibited by kvm_recalculate_apic_map() for a TDX VM.  x2APIC is mandatory
(KVM_APIC_MODE_MAP_DISABLED and "APIC_ID modified" impossible), KVM emulates
APIC_ID as read-only for x2APIC mode (physical aliasing impossible), and LDR is
read-only for x2APIC (logical aliasing impossible).

To ensure no physical aliasing, KVM would need to require KVM_CAP_X2APIC_API be
enabled, but that should probably be required for TDX no matter what.

> kvm_arch_dy_has_pending_interrupt()
> -----------------------------------
> Before enabling off-TD debug, there is no functional change because there
> is no PAUSE Exit for TDX guests.
> After enabling off-TD debug, the kvm_vcpu_apicv_active(vcpu) should be true
> to get the pending interrupt from PID. Set APICv to active for TDX is the
> right thing to do.

And as alluded to above, for save/restore, e.g. intrahost migration.
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Binbin Wu 11 months, 1 week ago


On 1/14/2025 1:16 AM, Sean Christopherson wrote:
> On Mon, Jan 13, 2025, Binbin Wu wrote:
>> On 1/13/2025 10:03 AM, Binbin Wu wrote:
>>> On 12/9/2024 9:07 AM, Binbin Wu wrote:
>>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>>
>>>> Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
>>>> from host VMM.
>>>>
>>>> Follow how SEV inhibits APICv.  I.e, define a new inhibit reason for TDX, set
>>>> it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
>> Resend due to the format mess.
> That was a very impressive mess :-)
>
>> After TDX vCPU init, APIC is set to x2APIC mode. However, userspace could
>> disable APIC via KVM_SET_LAPIC or KVM_SET_{SREGS, SREGS2}.
>>
>> - KVM_SET_LAPIC
>>    Currently, KVM allows userspace to request KVM_SET_LAPIC to set the state
>>    of LAPIC for TDX guests.
>>    There are two options:
>>    - Force x2APIC mode and default base address when userspace request
>>      KVM_SET_LAPIC.
>>    - Simply reject KVM_SET_LAPIC for TDX guest (apic->guest_apic_protected
>>      is true), since migration is not supported yet.
>>    Choose option 2 for simplicity for now.
> Yeah.  We'll likely need to support KVM_SET_LAPIC at some point, e.g. to support
> PID.PIR save/restore, but that's definitely a future problem.
>
>> Summary about APICv inhibit reasons:
>> APICv could still be disabled runtime in some corner case, e.g,
>> APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED due to memory allocation failure.
>> After checking enable_apicv in tdx_bringup(), apic->apicv_active is
>> initialized as true in kvm_create_lapic().  If APICv is inhibited due to any
>> reason runtime, the refresh_apicv_exec_ctrl() callback could be used to check
>> if APICv is disabled for TDX, if APICv is disabled, bug the VM.
> I _think_ this is a non-issue, and that KVM could do KVM_BUG_ON() if APICv is
> inihibited by kvm_recalculate_apic_map() for a TDX VM.  x2APIC is mandatory
> (KVM_APIC_MODE_MAP_DISABLED and "APIC_ID modified" impossible), KVM emulates
> APIC_ID as read-only for x2APIC mode (physical aliasing impossible), and LDR is
> read-only for x2APIC (logical aliasing impossible).

For logical aliasing, according to the KVM code, it's only relevant to
AMD's AVIC. It's not set in VMX_REQUIRED_APICV_INHIBITS.
Is the reason AVIC using logical-id-addressing while APICv using
physical-id-addressing for IPI virtualization?

>
> To ensure no physical aliasing, KVM would need to require KVM_CAP_X2APIC_API be
> enabled, but that should probably be required for TDX no matter what.
There is no physical aliasing when APIC is in x2apic mode, vcpu_id is used
anyway.  Also, KVM is going to reject KVM_SET_LAPIC/KVM_GET_LAPIC from
userspace for TDX guests, functionally, it doesn't matter whether
KVM_CAP_X2APIC_API is enabled or not.

But for future proof, we could enforce KVM_CAP_X2APIC_API being enabled.

>
>> kvm_arch_dy_has_pending_interrupt()
>> -----------------------------------
>> Before enabling off-TD debug, there is no functional change because there
>> is no PAUSE Exit for TDX guests.
>> After enabling off-TD debug, the kvm_vcpu_apicv_active(vcpu) should be true
>> to get the pending interrupt from PID. Set APICv to active for TDX is the
>> right thing to do.
> And as alluded to above, for save/restore, e.g. intrahost migration.

Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Sean Christopherson 11 months, 1 week ago
On Tue, Jan 14, 2025, Binbin Wu wrote:
> On 1/14/2025 1:16 AM, Sean Christopherson wrote:
> > On Mon, Jan 13, 2025, Binbin Wu wrote:
> > > Summary about APICv inhibit reasons:
> > > APICv could still be disabled runtime in some corner case, e.g,
> > > APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED due to memory allocation failure.
> > > After checking enable_apicv in tdx_bringup(), apic->apicv_active is
> > > initialized as true in kvm_create_lapic().  If APICv is inhibited due to any
> > > reason runtime, the refresh_apicv_exec_ctrl() callback could be used to check
> > > if APICv is disabled for TDX, if APICv is disabled, bug the VM.
> > I _think_ this is a non-issue, and that KVM could do KVM_BUG_ON() if APICv is
> > inihibited by kvm_recalculate_apic_map() for a TDX VM.  x2APIC is mandatory
> > (KVM_APIC_MODE_MAP_DISABLED and "APIC_ID modified" impossible), KVM emulates
> > APIC_ID as read-only for x2APIC mode (physical aliasing impossible), and LDR is
> > read-only for x2APIC (logical aliasing impossible).
> 
> For logical aliasing, according to the KVM code, it's only relevant to
> AMD's AVIC. It's not set in VMX_REQUIRED_APICV_INHIBITS.

Ah, right.

> Is the reason AVIC using logical-id-addressing while APICv using
> physical-id-addressing for IPI virtualization?

Ya, more or less.  AVIC supports virtualizing both physical and logical IPIs,
APICv only supports physical.

> > To ensure no physical aliasing, KVM would need to require KVM_CAP_X2APIC_API be
> > enabled, but that should probably be required for TDX no matter what.
> There is no physical aliasing when APIC is in x2apic mode, vcpu_id is used
> anyway.

Yeah, ignore this, I misremembered the effects of KVM_CAP_X2APIC_API.
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Vishal Annapurve 11 months, 2 weeks ago
On Sun, Dec 8, 2024 at 5:11 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
>
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
> from host VMM.
>
> Follow how SEV inhibits APICv.  I.e, define a new inhibit reason for TDX, set
> it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> TDX interrupts breakout:
> - Removed WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm)) in
>   tdx_td_vcpu_init(). (Rick)
> - Change APICV -> APICv in changelog for consistency.
> - Split the changelog to 2 paragraphs.
> ---
>  arch/x86/include/asm/kvm_host.h | 12 +++++++++++-
>  arch/x86/kvm/vmx/main.c         |  3 ++-
>  arch/x86/kvm/vmx/tdx.c          |  3 +++
>  3 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 32c7d58a5d68..df535f08e004 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1281,6 +1281,15 @@ enum kvm_apicv_inhibit {
>          */
>         APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
>
> +       /*********************************************************/
> +       /* INHIBITs that are relevant only to the Intel's APICv. */
> +       /*********************************************************/
> +
> +       /*
> +        * APICv is disabled because TDX doesn't support it.
> +        */
> +       APICV_INHIBIT_REASON_TDX,
> +
>         NR_APICV_INHIBIT_REASONS,
>  };
>
> @@ -1299,7 +1308,8 @@ enum kvm_apicv_inhibit {
>         __APICV_INHIBIT_REASON(IRQWIN),                 \
>         __APICV_INHIBIT_REASON(PIT_REINJ),              \
>         __APICV_INHIBIT_REASON(SEV),                    \
> -       __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
> +       __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED),     \
> +       __APICV_INHIBIT_REASON(TDX)
>
>  struct kvm_arch {
>         unsigned long n_used_mmu_pages;
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 7f933f821188..13a0ab0a520c 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -445,7 +445,8 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
>          BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |                   \
>          BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) |        \
>          BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) |           \
> -        BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
> +        BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) |         \
> +        BIT(APICV_INHIBIT_REASON_TDX))
>
>  struct kvm_x86_ops vt_x86_ops __initdata = {
>         .name = KBUILD_MODNAME,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b0f525069ebd..b51d2416acfb 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2143,6 +2143,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
>                 goto teardown;
>         }
>
> +       kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
> +
>         return 0;
>
>         /*
> @@ -2528,6 +2530,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
>                 return -EIO;
>         }
>
> +       vcpu->arch.apic->apicv_active = false;

With this setting, apic_timer_expired[1] will always cause timer
interrupts to be pending without injecting them right away. Injecting
it after VM exit [2] could cause unbounded delays to timer interrupt
injection.

[1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/lapic.c?h=kvm-coco-queue#n1922
[2] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263

>         vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>
>         return 0;
> --
> 2.46.0
>
>
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Binbin Wu 11 months, 2 weeks ago


On 1/4/2025 5:59 AM, Vishal Annapurve wrote:
> On Sun, Dec 8, 2024 at 5:11 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
>> from host VMM.
>>
>> Follow how SEV inhibits APICv.  I.e, define a new inhibit reason for TDX, set
>> it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
>>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>> ---
>> TDX interrupts breakout:
>> - Removed WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm)) in
>>    tdx_td_vcpu_init(). (Rick)
>> - Change APICV -> APICv in changelog for consistency.
>> - Split the changelog to 2 paragraphs.
>> ---
>>   arch/x86/include/asm/kvm_host.h | 12 +++++++++++-
>>   arch/x86/kvm/vmx/main.c         |  3 ++-
>>   arch/x86/kvm/vmx/tdx.c          |  3 +++
>>   3 files changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 32c7d58a5d68..df535f08e004 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1281,6 +1281,15 @@ enum kvm_apicv_inhibit {
>>           */
>>          APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
>>
>> +       /*********************************************************/
>> +       /* INHIBITs that are relevant only to the Intel's APICv. */
>> +       /*********************************************************/
>> +
>> +       /*
>> +        * APICv is disabled because TDX doesn't support it.
>> +        */
>> +       APICV_INHIBIT_REASON_TDX,
>> +
>>          NR_APICV_INHIBIT_REASONS,
>>   };
>>
>> @@ -1299,7 +1308,8 @@ enum kvm_apicv_inhibit {
>>          __APICV_INHIBIT_REASON(IRQWIN),                 \
>>          __APICV_INHIBIT_REASON(PIT_REINJ),              \
>>          __APICV_INHIBIT_REASON(SEV),                    \
>> -       __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
>> +       __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED),     \
>> +       __APICV_INHIBIT_REASON(TDX)
>>
>>   struct kvm_arch {
>>          unsigned long n_used_mmu_pages;
>> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
>> index 7f933f821188..13a0ab0a520c 100644
>> --- a/arch/x86/kvm/vmx/main.c
>> +++ b/arch/x86/kvm/vmx/main.c
>> @@ -445,7 +445,8 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
>>           BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |                   \
>>           BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) |        \
>>           BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) |           \
>> -        BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
>> +        BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) |         \
>> +        BIT(APICV_INHIBIT_REASON_TDX))
>>
>>   struct kvm_x86_ops vt_x86_ops __initdata = {
>>          .name = KBUILD_MODNAME,
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index b0f525069ebd..b51d2416acfb 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -2143,6 +2143,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
>>                  goto teardown;
>>          }
>>
>> +       kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
>> +
>>          return 0;
>>
>>          /*
>> @@ -2528,6 +2530,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
>>                  return -EIO;
>>          }
>>
>> +       vcpu->arch.apic->apicv_active = false;
> With this setting, apic_timer_expired[1] will always cause timer
> interrupts to be pending without injecting them right away. Injecting
> it after VM exit [2] could cause unbounded delays to timer interrupt
> injection.

When apic->apicv_active is false, it will fallback to increasing the
apic->lapic_timer.pending and request KVM_REQ_UNBLOCK.
If apic_timer_expired() is called from timer function, the target vCPU
will be kicked out immediately.
So there is no unbounded delay to timer interrupt injection.

In a previous PUCK session, Sean suggested apic->apicv_active should be
set to true to align with the hardware setting because TDX module always
enables apicv for TDX guests.
Will send a write up about changing apicv to active later.

>
> [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/lapic.c?h=kvm-coco-queue#n1922
> [2] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263
>
>>          vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>>
>>          return 0;
>> --
>> 2.46.0
>>
>>

Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Vishal Annapurve 11 months, 2 weeks ago
On Sun, Jan 5, 2025 at 5:46 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
>
>
>
>
> On 1/4/2025 5:59 AM, Vishal Annapurve wrote:
> > On Sun, Dec 8, 2024 at 5:11 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
> >> From: Isaku Yamahata <isaku.yamahata@intel.com>
> >>
> >> Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
> >> from host VMM.
> >>
> >> Follow how SEV inhibits APICv.  I.e, define a new inhibit reason for TDX, set
> >> it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
> >>
> >> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> >> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> >> ---
> >> TDX interrupts breakout:
> >> - Removed WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm)) in
> >>    tdx_td_vcpu_init(). (Rick)
> >> - Change APICV -> APICv in changelog for consistency.
> >> - Split the changelog to 2 paragraphs.
> >> ---
> >>   arch/x86/include/asm/kvm_host.h | 12 +++++++++++-
> >>   arch/x86/kvm/vmx/main.c         |  3 ++-
> >>   arch/x86/kvm/vmx/tdx.c          |  3 +++
> >>   3 files changed, 16 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >> index 32c7d58a5d68..df535f08e004 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -1281,6 +1281,15 @@ enum kvm_apicv_inhibit {
> >>           */
> >>          APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
> >>
> >> +       /*********************************************************/
> >> +       /* INHIBITs that are relevant only to the Intel's APICv. */
> >> +       /*********************************************************/
> >> +
> >> +       /*
> >> +        * APICv is disabled because TDX doesn't support it.
> >> +        */
> >> +       APICV_INHIBIT_REASON_TDX,
> >> +
> >>          NR_APICV_INHIBIT_REASONS,
> >>   };
> >>
> >> @@ -1299,7 +1308,8 @@ enum kvm_apicv_inhibit {
> >>          __APICV_INHIBIT_REASON(IRQWIN),                 \
> >>          __APICV_INHIBIT_REASON(PIT_REINJ),              \
> >>          __APICV_INHIBIT_REASON(SEV),                    \
> >> -       __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
> >> +       __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED),     \
> >> +       __APICV_INHIBIT_REASON(TDX)
> >>
> >>   struct kvm_arch {
> >>          unsigned long n_used_mmu_pages;
> >> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> >> index 7f933f821188..13a0ab0a520c 100644
> >> --- a/arch/x86/kvm/vmx/main.c
> >> +++ b/arch/x86/kvm/vmx/main.c
> >> @@ -445,7 +445,8 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> >>           BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |                   \
> >>           BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) |        \
> >>           BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) |           \
> >> -        BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
> >> +        BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) |         \
> >> +        BIT(APICV_INHIBIT_REASON_TDX))
> >>
> >>   struct kvm_x86_ops vt_x86_ops __initdata = {
> >>          .name = KBUILD_MODNAME,
> >> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> >> index b0f525069ebd..b51d2416acfb 100644
> >> --- a/arch/x86/kvm/vmx/tdx.c
> >> +++ b/arch/x86/kvm/vmx/tdx.c
> >> @@ -2143,6 +2143,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
> >>                  goto teardown;
> >>          }
> >>
> >> +       kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
> >> +
> >>          return 0;
> >>
> >>          /*
> >> @@ -2528,6 +2530,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> >>                  return -EIO;
> >>          }
> >>
> >> +       vcpu->arch.apic->apicv_active = false;
> > With this setting, apic_timer_expired[1] will always cause timer
> > interrupts to be pending without injecting them right away. Injecting
> > it after VM exit [2] could cause unbounded delays to timer interrupt
> > injection.
>
> When apic->apicv_active is false, it will fallback to increasing the
> apic->lapic_timer.pending and request KVM_REQ_UNBLOCK.
> If apic_timer_expired() is called from timer function, the target vCPU
> will be kicked out immediately.
> So there is no unbounded delay to timer interrupt injection.

Ack. Though, wouldn't it be faster to just post timer interrupts right
away without causing vcpu exit?

Another scenario I was thinking of was hrtimer expiry during vcpu exit
being handled in KVM/userspace, which will cause timer interrupt
injection after the next exit [1] delaying timer interrupt to guest.
This scenario is not specific to TDX VMs though.

[1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263

>
> In a previous PUCK session, Sean suggested apic->apicv_active should be
> set to true to align with the hardware setting because TDX module always
> enables apicv for TDX guests.
> Will send a write up about changing apicv to active later.
>
> >
> > [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/lapic.c?h=kvm-coco-queue#n1922
> > [2] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263
> >
> >>          vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> >>
> >>          return 0;
> >> --
> >> 2.46.0
> >>
> >>
>
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Sean Christopherson 11 months, 2 weeks ago
On Mon, Jan 06, 2025, Vishal Annapurve wrote:
> On Sun, Jan 5, 2025 at 5:46 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
> > On 1/4/2025 5:59 AM, Vishal Annapurve wrote:
> > > On Sun, Dec 8, 2024 at 5:11 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
> > >> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > >> index b0f525069ebd..b51d2416acfb 100644
> > >> --- a/arch/x86/kvm/vmx/tdx.c
> > >> +++ b/arch/x86/kvm/vmx/tdx.c
> > >> @@ -2143,6 +2143,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
> > >>                  goto teardown;
> > >>          }
> > >>
> > >> +       kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
> > >> +
> > >>          return 0;
> > >>
> > >>          /*
> > >> @@ -2528,6 +2530,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> > >>                  return -EIO;
> > >>          }
> > >>
> > >> +       vcpu->arch.apic->apicv_active = false;
> > > With this setting, apic_timer_expired[1] will always cause timer
> > > interrupts to be pending without injecting them right away. Injecting
> > > it after VM exit [2] could cause unbounded delays to timer interrupt
> > > injection.
> >
> > When apic->apicv_active is false, it will fallback to increasing the
> > apic->lapic_timer.pending and request KVM_REQ_UNBLOCK.
> > If apic_timer_expired() is called from timer function, the target vCPU
> > will be kicked out immediately.
> > So there is no unbounded delay to timer interrupt injection.
> 
> Ack. Though, wouldn't it be faster to just post timer interrupts right
> away without causing vcpu exit?

Yes, but if and only if hrtimers are offloaded to dedicated "housekeeping" CPUs.
If the hrtimer is running on the same pCPU that the vCPU is running on, then the
expiration of the underlying hardware timer (in practice, the "real" APIC timer)
will generate a host IRQ and thus a VM-Exit.  I.e. the vCPU will already be kicked
into the host, and the virtual timer IRQ will be delivered prior to re-entering
the guest.

Note, kvm_use_posted_timer_interrupt() uses a heuristic of HLT/MWAIT interception
being disabled to detect that it's worth trying to post a timer interrupt, but off
the top of my head I'm pretty sure that's unnecessary and pointless.  The
"vcpu->mode == IN_GUEST_MODE" is super cheap, and I can't think of any harm in
posting the time interrupt if the target vCPU happens to be in guest mode even
if the host wasn't configured just so.

> Another scenario I was thinking of was hrtimer expiry during vcpu exit
> being handled in KVM/userspace, which will cause timer interrupt
> injection after the next exit [1] delaying timer interrupt to guest.

No, the IRQ won't be delayed.  Expiration from the hrtimer callback will set
KVM_REQ_UNBLOCK, which will prevent actually entering the guest (see the call
to kvm_request_pending() in kvm_vcpu_exit_request()).

> This scenario is not specific to TDX VMs though.
> 
> [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Chao Gao 11 months, 2 weeks ago
>Note, kvm_use_posted_timer_interrupt() uses a heuristic of HLT/MWAIT interception
>being disabled to detect that it's worth trying to post a timer interrupt, but off
>the top of my head I'm pretty sure that's unnecessary and pointless.  The

Commit 1714a4eb6fb0 gives an explanation:

  if only some guests isolated and others not, they would not
  have any benefit from posted timer interrupts, and at the same time lose
  VMX preemption timer fast paths because kvm_can_post_timer_interrupt()
  returns true and therefore forces kvm_can_use_hv_timer() to false.

>"vcpu->mode == IN_GUEST_MODE" is super cheap, and I can't think of any harm in
>posting the time interrupt if the target vCPU happens to be in guest mode even
>if the host wasn't configured just so.
>
>> Another scenario I was thinking of was hrtimer expiry during vcpu exit
>> being handled in KVM/userspace, which will cause timer interrupt
>> injection after the next exit [1] delaying timer interrupt to guest.
>
>No, the IRQ won't be delayed.  Expiration from the hrtimer callback will set
>KVM_REQ_UNBLOCK, which will prevent actually entering the guest (see the call
>to kvm_request_pending() in kvm_vcpu_exit_request()).
>
>> This scenario is not specific to TDX VMs though.
>> 
>> [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263
Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Binbin Wu 11 months, 2 weeks ago


On 1/7/2025 11:24 AM, Chao Gao wrote:
>> Note, kvm_use_posted_timer_interrupt() uses a heuristic of HLT/MWAIT interception
>> being disabled to detect that it's worth trying to post a timer interrupt, but off
>> the top of my head I'm pretty sure that's unnecessary and pointless.  The
> Commit 1714a4eb6fb0 gives an explanation:
>
>    if only some guests isolated and others not, they would not
>    have any benefit from posted timer interrupts, and at the same time lose
>    VMX preemption timer fast paths because kvm_can_post_timer_interrupt()
>    returns true and therefore forces kvm_can_use_hv_timer() to false.
Userspace uses KVM_CAP_X86_DISABLE_EXITS to enable mwait_in_guest or/and
hlt_in_guest for non TDX guest. The code doesn't work for TDX guests.
- Whether mwait in guest is allowed for TDX depends on the cpuid
   configuration in TD_PARAMS, not the capability of physical cpu checked by
   kvm_can_mwait_in_guest().
- hlt for TDX is via TDVMCALL, i.e. hlt_in_guest should always be false
   for TDX guests.

For TDX guests, not sure if it worths to fix kvm_{mwait,hlt}_in_guest()
or add special handling (i.e., skip checking mwait/hlt in guest) because
VMX preemption timer can't be used anyway, in order to allow housekeeping
CPU to post timer interrupt for TDX vCPUs when nohz_full is configured
after changing APICv active to true.


>
>> "vcpu->mode == IN_GUEST_MODE" is super cheap, and I can't think of any harm in
>> posting the time interrupt if the target vCPU happens to be in guest mode even
>> if the host wasn't configured just so.
>>
>>> Another scenario I was thinking of was hrtimer expiry during vcpu exit
>>> being handled in KVM/userspace, which will cause timer interrupt
>>> injection after the next exit [1] delaying timer interrupt to guest.
>> No, the IRQ won't be delayed.  Expiration from the hrtimer callback will set
>> KVM_REQ_UNBLOCK, which will prevent actually entering the guest (see the call
>> to kvm_request_pending() in kvm_vcpu_exit_request()).
>>
>>> This scenario is not specific to TDX VMs though.
>>>
>>> [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263

Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
Posted by Sean Christopherson 11 months, 1 week ago
On Tue, Jan 07, 2025, Binbin Wu wrote:
> On 1/7/2025 11:24 AM, Chao Gao wrote:
> > > Note, kvm_use_posted_timer_interrupt() uses a heuristic of HLT/MWAIT interception
> > > being disabled to detect that it's worth trying to post a timer interrupt, but off
> > > the top of my head I'm pretty sure that's unnecessary and pointless.  The
> > Commit 1714a4eb6fb0 gives an explanation:
> > 
> >    if only some guests isolated and others not, they would not
> >    have any benefit from posted timer interrupts, and at the same time lose
> >    VMX preemption timer fast paths because kvm_can_post_timer_interrupt()
> >    returns true and therefore forces kvm_can_use_hv_timer() to false.

But that's only relevant for setup.  Upon expiration, if the target vCPU is in
guest mode and APICv is active, then from_timer_fn must be true, which in turn
means that hrtimers are in use.

> Userspace uses KVM_CAP_X86_DISABLE_EXITS to enable mwait_in_guest or/and
> hlt_in_guest for non TDX guest. The code doesn't work for TDX guests.
> - Whether mwait in guest is allowed for TDX depends on the cpuid
>   configuration in TD_PARAMS, not the capability of physical cpu checked by
>   kvm_can_mwait_in_guest().
> - hlt for TDX is via TDVMCALL, i.e. hlt_in_guest should always be false
>   for TDX guests.
> 
> For TDX guests, not sure if it worths to fix kvm_{mwait,hlt}_in_guest()
> or add special handling (i.e., skip checking mwait/hlt in guest) because
> VMX preemption timer can't be used anyway, in order to allow housekeeping
> CPU to post timer interrupt for TDX vCPUs when nohz_full is configured
> after changing APICv active to true.

Yes, but I don't think we need any TDX-specific logic beyond noting that the
VMX preemption can't be used.  As above, consulting kvm_can_post_timer_interrupt()
in the expiration path is silly.

And after staring at all of this for a few hours, I think we should ditch the
entire lapic_timer.pending approach, because at this point it's doing more harm
than good.  Tracking pending timer IRQs was primarily done to support delivery
of *all* timer interrupts when multiple interrupts were coalesced in the host,
e.g. because a vCPU was scheduled out.  But that got removed by commit
f1ed0450a5fa ("KVM: x86: Remove support for reporting coalesced APIC IRQs"),
partly because posted interrupts threw a wrench in things, but also because
delivering multiple periodic interrupts in quick succession is problematic in
its own right.

With the interrupt coalescing logic gone, I can't think of any reason KVM needs
to kick the vCPU out to the main vcpu_run() loop just to deliver the interrupt,
i.e. pending interrupts before delivering them is unnecessary.  E.g. conditioning
direct deliverly on apicv_active in the !from_timer_fn case makes no sense, because
KVM is going to manually move the interrupt from the PIR to the IRR anyways.

I also don't like that the behavior for the posting path is subtly different than
!APICv.  E.g. if an hrtimer fires while KVM is handling a write to TMICT, KVM will
deliver the interrupt if configured to post timer, but not if APICv is disabled,
because the latter will increment "pending", and "pending" will be cleared before
handling the new TMICT.  Ditto for switch APIC timer modes.

Unfortunately, there is a lot to disentangle before KVM can directly deliver all
APIC timer interupts, as KVM has built up a fair bit of ugliness on top of "pending". 

E.g. when switching back to the VMX preemption timer (after a blocking vCPU is
scheduled back in), KVM leaves the hrtimer active.  start_hv_timer() accounts for
that by checking lapic_timer.pending, but leaving the hrtimer running in this case
is absurd and unnecessarily relies on "pending" being incremented.

	if (kvm_x86_call(set_hv_timer)(vcpu, ktimer->tscdeadline, &expired))
		return false;

	ktimer->hv_timer_in_use = true;
	hrtimer_cancel(&ktimer->timer);

	/*
	 * To simplify handling the periodic timer, leave the hv timer running
	 * even if the deadline timer has expired, i.e. rely on the resulting
	 * VM-Exit to recompute the periodic timer's target expiration.
	 */
	if (!apic_lvtt_period(apic)) {
		/*
		 * Cancel the hv timer if the sw timer fired while the hv timer
		 * was being programmed, or if the hv timer itself expired.
		 */
		if (atomic_read(&ktimer->pending)) {
			cancel_hv_timer(apic);
		} else if (expired) {
			apic_timer_expired(apic, false);
			cancel_hv_timer(apic);
		}
	}

I'm also not convinced that letting the hrtimer run on a different CPU in the
"normal" case is a good idea.  KVM explicitly migrates the timer, *except* for
the posting case, and so not pinning the hrtimer is likely counter-productive,
not to mention confusing.

  void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
  {
	struct hrtimer *timer;

	if (!lapic_in_kernel(vcpu) ||
		kvm_can_post_timer_interrupt(vcpu))
		return;

	timer = &vcpu->arch.apic->lapic_timer.timer;
	if (hrtimer_cancel(timer))
		hrtimer_start_expires(timer, HRTIMER_MODE_ABS_HARD);
  }

And if the hrtimer is pinned to the pCPU, then in practice it should be impossible
for KVM to post a timer IRQ into a vCPU that's in guest mode.

Seapking of which, posting a timer IRQ into a vCPU that's in guest mode is a bit
dodgy.  I _think_ it's safe, because all of the flows that can be coincident are
actually mutually exclusive.  E.g. apic_timer_expired()'s call to
__kvm_wait_lapic_expire() can't collide with the call from the entry path, as the
entry path's invocation happens with IRQs disabled.

But all of this makes me wary, so I'd much prefer to clean up, harden, and document
the existing code before we try to optimize timer IRQ delivery for more cases.
That's especially true for TDX as we're already adding significant compexlity
and multiple novel paths for TDX; we don't need more at this time :-)