[PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD

Adrian Hunter posted 7 patches 1 year, 2 months ago
There is a newer version of this series
[PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Adrian Hunter 1 year, 2 months ago
From: Isaku Yamahata <isaku.yamahata@intel.com>

On exiting from the guest TD, xsave state is clobbered.  Restore xsave
state on TD exit.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
TD vcpu enter/exit v1:
- Remove noinstr on tdx_vcpu_enter_exit() (Sean)
- Switch to kvm_host struct for xcr0 and xss

v19:
- Add EXPORT_SYMBOL_GPL(host_xcr0)

v15 -> v16:
- Added CET flag mask
---
 arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6e4ea2d420bc..00fdd2932205 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2,6 +2,8 @@
 #include <linux/cleanup.h>
 #include <linux/cpu.h>
 #include <linux/mmu_context.h>
+
+#include <asm/fpu/xcr.h>
 #include <asm/tdx.h>
 #include "capabilities.h"
 #include "mmu.h"
@@ -709,6 +711,24 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 }
 
 
+static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+	if (static_cpu_has(X86_FEATURE_XSAVE) &&
+	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
+		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
+	if (static_cpu_has(X86_FEATURE_XSAVES) &&
+	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
+	    kvm_host.xss != (kvm_tdx->xfam &
+			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
+			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
+		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
+	if (static_cpu_has(X86_FEATURE_PKU) &&
+	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
+		write_pkru(vcpu->arch.host_pkru);
+}
+
 static void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -776,6 +796,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 
 	tdx_vcpu_enter_exit(vcpu);
 
+	tdx_restore_host_xsave_state(vcpu);
 	tdx->host_state_need_restore = true;
 
 	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
-- 
2.43.0
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Chao Gao 1 year, 2 months ago
>+static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
>+{
>+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>+
>+	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>+	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
>+		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>+	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>+	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
>+	    kvm_host.xss != (kvm_tdx->xfam &
>+			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>+			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))

Should we drop CET/PT from this series? I think they are worth a new
patch/series.

>+		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
>+	if (static_cpu_has(X86_FEATURE_PKU) &&

How about using cpu_feature_enabled()? It is used in kvm_load_host_xsave_state()
It handles the case where CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is not
enabled.

>+	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
>+		write_pkru(vcpu->arch.host_pkru);

If host_pkru happens to match the hardware value after TD-exits, the write can
be omitted, similar to what is done above for xss and xcr0.

>+}
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Adrian Hunter 1 year, 2 months ago
On 22/11/24 07:49, Chao Gao wrote:
>> +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> +
>> +	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>> +	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
>> +		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>> +	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>> +	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
>> +	    kvm_host.xss != (kvm_tdx->xfam &
>> +			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>> +			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
> 
> Should we drop CET/PT from this series? I think they are worth a new
> patch/series.
> 
>> +		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
>> +	if (static_cpu_has(X86_FEATURE_PKU) &&
> 
> How about using cpu_feature_enabled()? It is used in kvm_load_host_xsave_state()
> It handles the case where CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is not
> enabled.

Seems reasonable

> 
>> +	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
>> +		write_pkru(vcpu->arch.host_pkru);
> 
> If host_pkru happens to match the hardware value after TD-exits, the write can
> be omitted, similar to what is done above for xss and xcr0.
> 
>> +}
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Adrian Hunter 1 year, 2 months ago
On 22/11/24 07:49, Chao Gao wrote:
>> +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> +
>> +	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>> +	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
>> +		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>> +	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>> +	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
>> +	    kvm_host.xss != (kvm_tdx->xfam &
>> +			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>> +			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
> 
> Should we drop CET/PT from this series? I think they are worth a new
> patch/series.

This is not really about CET/PT

What is happening here is that we are calculating the current
MSR_IA32_XSS value based on the TDX Module spec which says the
TDX Module sets MSR_IA32_XSS to the XSS bits from XFAM.  The
TDX Module does that literally, from TDX Module source code:

	#define XCR0_SUPERVISOR_BIT_MASK            0x0001FD00
and
	ia32_wrmsr(IA32_XSS_MSR_ADDR, xfam & XCR0_SUPERVISOR_BIT_MASK);

For KVM, rather than:

			kvm_tdx->xfam &
			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)

it would be more direct to define the bits and enforce them
via tdx_get_supported_xfam() e.g.

/* 
 * Before returning from TDH.VP.ENTER, the TDX Module assigns:
 *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9)
 *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
 */
#define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9))
#define TDX_XFAM_XSS_MASK	(GENMASK(16, 10) | BIT(8))
#define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)

static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
{
	u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss;

	/* Ensure features are in the masks */
	val &= TDX_XFAM_MASK;

	if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1)
		return 0;

	val &= td_conf->xfam_fixed0;

	return val;
}

and then:

	if (static_cpu_has(X86_FEATURE_XSAVE) &&
	    kvm_host.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK))
		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
	if (static_cpu_has(X86_FEATURE_XSAVES) &&
	    kvm_host.xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK))
		wrmsrl(MSR_IA32_XSS, kvm_host.xss);

> 
>> +		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
>> +	if (static_cpu_has(X86_FEATURE_PKU) &&
> 
> How about using cpu_feature_enabled()? It is used in kvm_load_host_xsave_state()
> It handles the case where CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is not
> enabled.
> 
>> +	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
>> +		write_pkru(vcpu->arch.host_pkru);
> 
> If host_pkru happens to match the hardware value after TD-exits, the write can
> be omitted, similar to what is done above for xss and xcr0.

True.  It might be better to make restoring PKRU a separate
patch so that the commit message can explain why it needs to
be done here.

Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Sean Christopherson 1 year, 1 month ago
On Mon, Nov 25, 2024, Adrian Hunter wrote:
> On 22/11/24 07:49, Chao Gao wrote:
> >> +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> >> +{
> >> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> >> +
> >> +	if (static_cpu_has(X86_FEATURE_XSAVE) &&
> >> +	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
> >> +		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
> >> +	if (static_cpu_has(X86_FEATURE_XSAVES) &&
> >> +	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
> >> +	    kvm_host.xss != (kvm_tdx->xfam &
> >> +			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
> >> +			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
> > 
> > Should we drop CET/PT from this series? I think they are worth a new
> > patch/series.
> 
> This is not really about CET/PT
> 
> What is happening here is that we are calculating the current
> MSR_IA32_XSS value based on the TDX Module spec which says the
> TDX Module sets MSR_IA32_XSS to the XSS bits from XFAM.  The
> TDX Module does that literally, from TDX Module source code:
> 
> 	#define XCR0_SUPERVISOR_BIT_MASK            0x0001FD00
> and
> 	ia32_wrmsr(IA32_XSS_MSR_ADDR, xfam & XCR0_SUPERVISOR_BIT_MASK);
> 
> For KVM, rather than:
> 
> 			kvm_tdx->xfam &
> 			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
> 			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)
> 
> it would be more direct to define the bits and enforce them
> via tdx_get_supported_xfam() e.g.
> 
> /* 
>  * Before returning from TDH.VP.ENTER, the TDX Module assigns:
>  *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9)
>  *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
>  */
> #define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9))
> #define TDX_XFAM_XSS_MASK	(GENMASK(16, 10) | BIT(8))
> #define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)
> 
> static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
> {
> 	u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss;
> 
> 	/* Ensure features are in the masks */
> 	val &= TDX_XFAM_MASK;
> 
> 	if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1)
> 		return 0;
> 
> 	val &= td_conf->xfam_fixed0;
> 
> 	return val;
> }
> 
> and then:
> 
> 	if (static_cpu_has(X86_FEATURE_XSAVE) &&
> 	    kvm_host.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK))
> 		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
> 	if (static_cpu_has(X86_FEATURE_XSAVES) &&
> 	    kvm_host.xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK))
> 		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
> 
> > 
> >> +		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
> >> +	if (static_cpu_has(X86_FEATURE_PKU) &&
> > 
> > How about using cpu_feature_enabled()? It is used in kvm_load_host_xsave_state()
> > It handles the case where CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is not
> > enabled.

I would rather just use kvm_load_host_xsave_state(), by forcing vcpu->arch.{xcr0,xss}
to XFAM, with a comment explaining that the TDX module sets XCR0 and XSS prior to
returning from VP.ENTER.  I don't see any justificaton for maintaining a special
flow for TDX, it's just more work.  E.g. TDX is missing the optimization to elide
WRPKRU if the current value is the same as the host value.

KVM already disallows emulating a WRMSR to XSS via the tdx_has_emulated_msr()
check, and AFAICT there's no path for the guest to set KVM's view of XCR0, CR0,
or CR4, so I'm pretty sure stuffing state at vCPU creation is all that's needed.

That said, out of paranoia, KVM should disallow the guest from changing XSS if
guest state is protected, i.e. in common code, as XSS is a mandatory passthrough
for SEV-ES+, i.e. XSS is fully guest-owned for both TDX and ES+.

Ditto for CR0 and CR4 (TDX only; SEV-ES+ lets the host see the guest values).
The current TDX code lets KVM read CR0 and CR4, but KVM will always see the RESET
values, which are completely wrong for TDX.  KVM can obviously "work" without a
sane view of guest CR0/CR4, but I think having a sane view will yield code that
is easier to maintain and understand, because almost all special casing will be
in TDX's initialization flow, not spread out wherever KVM needs to know that what
KVM sees in guest state is a lie.

The guest_state_protected check in kvm_load_host_xsave_state() needs to be moved
to svm_vcpu_run(), but IMO that's where the checks belong anyways, because not
restoring host state for protected guests is obviously specific to SEV-ES+ guests,
not to all protected guests.

Side topic, tdx_cache_reg() is ridiculous.  Just mark the "cached" registers as
available on exit.  Invoking a callback just to do nothing is a complete waste.
I'm also not convinced letting KVM read garbage for RIP, RSP, CR3, or PDPTRs is
at all reasonable.  CR3 and PDPTRs should be unreachable, and I gotta imagine the
same holds true for RSP.  Allow reads/writes to RIP is fine, in that it probably
simplifies the overall code.

Something like this (probably won't apply, I have other local hacks as the result
of suggestions).

---
 arch/x86/kvm/svm/svm.c     |  7 ++++--
 arch/x86/kvm/vmx/main.c    |  4 +--
 arch/x86/kvm/vmx/tdx.c     | 50 ++++++++++----------------------------
 arch/x86/kvm/vmx/x86_ops.h |  4 ---
 arch/x86/kvm/x86.c         | 15 +++++++-----
 5 files changed, 28 insertions(+), 52 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index dd15cc635655..63df43e5dcce 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4251,7 +4251,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
 		svm_set_dr6(svm, DR6_ACTIVE_LOW);
 
 	clgi();
-	kvm_load_guest_xsave_state(vcpu);
+
+	if (!vcpu->arch.guest_state_protected)
+		kvm_load_guest_xsave_state(vcpu);
 
 	kvm_wait_lapic_expire(vcpu);
 
@@ -4280,7 +4282,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
 	if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
 		kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
 
-	kvm_load_host_xsave_state(vcpu);
+	if (!vcpu->arch.guest_state_protected)
+		kvm_load_host_xsave_state(vcpu);
 	stgi();
 
 	/* Any pending NMI will happen here */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 2742f2af7f55..d2e78e6675b9 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -520,10 +520,8 @@ static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 
 static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 {
-	if (is_td_vcpu(vcpu)) {
-		tdx_cache_reg(vcpu, reg);
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
 		return;
-	}
 
 	vmx_cache_reg(vcpu, reg);
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7eff717c9d0d..b49dcf32206b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -636,6 +636,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.cr0_guest_owned_bits = -1ul;
 	vcpu->arch.cr4_guest_owned_bits = -1ul;
 
+	vcpu->arch.cr4 = <maximal value>;
+	vcpu->arch.cr0 = <maximal value, give or take>;
+
 	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
 	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
 	/*
@@ -659,6 +662,14 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
 	tdx->state = VCPU_TD_STATE_UNINITIALIZED;
 
+	/*
+	 * On return from VP.ENTER, the TDX Module sets XCR0 and XSS to the
+	 * maximal values supported by the guest, so from KVM's perspective,
+	 * those are the guest's values at all times.
+	 */
+	vcpu->arch.ia32_xss = (kvm_tdx->xfam & XFEATURE_SUPERVISOR_MASK);
+	vcpu->arch.xcr0 = (kvm_tdx->xfam & XFEATURE_USE_MASK);
+
 	return 0;
 }
 
@@ -824,24 +835,6 @@ static void tdx_user_return_msr_update_cache(struct kvm_vcpu *vcpu)
 		kvm_user_return_msr_update_cache(tdx_uret_tsx_ctrl_slot, 0);
 }
 
-static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
-{
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
-
-	if (static_cpu_has(X86_FEATURE_XSAVE) &&
-	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
-		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
-	if (static_cpu_has(X86_FEATURE_XSAVES) &&
-	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
-	    kvm_host.xss != (kvm_tdx->xfam &
-			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
-			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
-		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
-	if (static_cpu_has(X86_FEATURE_PKU) &&
-	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
-		write_pkru(vcpu->arch.host_pkru);
-}
-
 static union vmx_exit_reason tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -941,10 +934,10 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 	tdx_vcpu_enter_exit(vcpu);
 
 	tdx_user_return_msr_update_cache(vcpu);
-	tdx_restore_host_xsave_state(vcpu);
+	kvm_load_host_xsave_state(vcpu);
 	tdx->host_state_need_restore = true;
 
-	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
+	vcpu->arch.regs_avail = TDX_REGS_UNSUPPORTED_SET;
 
 	if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
 		return EXIT_FASTPATH_NONE;
@@ -1963,23 +1956,6 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	}
 }
 
-void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
-{
-	kvm_register_mark_available(vcpu, reg);
-	switch (reg) {
-	case VCPU_REGS_RSP:
-	case VCPU_REGS_RIP:
-	case VCPU_EXREG_PDPTR:
-	case VCPU_EXREG_CR0:
-	case VCPU_EXREG_CR3:
-	case VCPU_EXREG_CR4:
-		break;
-	default:
-		KVM_BUG_ON(1, vcpu->kvm);
-		break;
-	}
-}
-
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index ef60eb7b1245..efa6723837c6 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -145,8 +145,6 @@ bool tdx_has_emulated_msr(u32 index);
 int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
-void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
-
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
 int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
@@ -193,8 +191,6 @@ static inline bool tdx_has_emulated_msr(u32 index) { return false; }
 static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 
-static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
-
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
 static inline int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4f94b1e24eae..d380837433c6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1184,11 +1184,9 @@ EXPORT_SYMBOL_GPL(kvm_lmsw);
 
 void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->arch.guest_state_protected)
-		return;
+	WARN_ON_ONCE(vcpu->arch.guest_state_protected);
 
 	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
-
 		if (vcpu->arch.xcr0 != kvm_host.xcr0)
 			xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
 
@@ -1207,9 +1205,6 @@ EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
 
 void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->arch.guest_state_protected)
-		return;
-
 	if (cpu_feature_enabled(X86_FEATURE_PKU) &&
 	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
 	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
@@ -3943,6 +3938,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
 			return 1;
+
+		if (vcpu->arch.guest_state_protected)
+			return 1;
+
 		/*
 		 * KVM supports exposing PT to the guest, but does not support
 		 * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
@@ -4402,6 +4401,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
 			return 1;
+
+		if (vcpu->arch.guest_state_protected)
+			return 1;
+
 		msr_info->data = vcpu->arch.ia32_xss;
 		break;
 	case MSR_K7_CLK_CTL:

base-commit: 0319082fc23089f516618e193d94da18c837e35a
-- 
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Adrian Hunter 1 year, 1 month ago
On 17/12/24 18:09, Sean Christopherson wrote:
> On Mon, Nov 25, 2024, Adrian Hunter wrote:
> I would rather just use kvm_load_host_xsave_state(), by forcing vcpu->arch.{xcr0,xss}
> to XFAM, with a comment explaining that the TDX module sets XCR0 and XSS prior to
> returning from VP.ENTER.  I don't see any justificaton for maintaining a special
> flow for TDX, it's just more work.  E.g. TDX is missing the optimization to elide
> WRPKRU if the current value is the same as the host value.

Not entirely missing since write_pkru() does do that by itself:

static inline void write_pkru(u32 pkru)
{
	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
		return;
	/*
	 * WRPKRU is relatively expensive compared to RDPKRU.
	 * Avoid WRPKRU when it would not change the value.
	 */
	if (pkru != rdpkru())
		wrpkru(pkru);
}

For TDX, we don't really need rdpkru() since the TDX Module
clears it, so it could be:

	if (pkru)
		wrpkru(pkru);

> 
> KVM already disallows emulating a WRMSR to XSS via the tdx_has_emulated_msr()
> check, and AFAICT there's no path for the guest to set KVM's view of XCR0, CR0,
> or CR4, so I'm pretty sure stuffing state at vCPU creation is all that's needed.
> 
> That said, out of paranoia, KVM should disallow the guest from changing XSS if
> guest state is protected, i.e. in common code, as XSS is a mandatory passthrough
> for SEV-ES+, i.e. XSS is fully guest-owned for both TDX and ES+.
> 
> Ditto for CR0 and CR4 (TDX only; SEV-ES+ lets the host see the guest values).
> The current TDX code lets KVM read CR0 and CR4, but KVM will always see the RESET
> values, which are completely wrong for TDX.  KVM can obviously "work" without a
> sane view of guest CR0/CR4, but I think having a sane view will yield code that
> is easier to maintain and understand, because almost all special casing will be
> in TDX's initialization flow, not spread out wherever KVM needs to know that what
> KVM sees in guest state is a lie.
> 
> The guest_state_protected check in kvm_load_host_xsave_state() needs to be moved
> to svm_vcpu_run(), but IMO that's where the checks belong anyways, because not
> restoring host state for protected guests is obviously specific to SEV-ES+ guests,
> not to all protected guests.
> 
> Side topic, tdx_cache_reg() is ridiculous.  Just mark the "cached" registers as
> available on exit.  Invoking a callback just to do nothing is a complete waste.
> I'm also not convinced letting KVM read garbage for RIP, RSP, CR3, or PDPTRs is
> at all reasonable.  CR3 and PDPTRs should be unreachable, and I gotta imagine the
> same holds true for RSP.  Allow reads/writes to RIP is fine, in that it probably
> simplifies the overall code.
> 
> Something like this (probably won't apply, I have other local hacks as the result
> of suggestions).
> 
> ---
>  arch/x86/kvm/svm/svm.c     |  7 ++++--
>  arch/x86/kvm/vmx/main.c    |  4 +--
>  arch/x86/kvm/vmx/tdx.c     | 50 ++++++++++----------------------------
>  arch/x86/kvm/vmx/x86_ops.h |  4 ---
>  arch/x86/kvm/x86.c         | 15 +++++++-----
>  5 files changed, 28 insertions(+), 52 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index dd15cc635655..63df43e5dcce 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4251,7 +4251,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>  		svm_set_dr6(svm, DR6_ACTIVE_LOW);
>  
>  	clgi();
> -	kvm_load_guest_xsave_state(vcpu);
> +
> +	if (!vcpu->arch.guest_state_protected)
> +		kvm_load_guest_xsave_state(vcpu);
>  
>  	kvm_wait_lapic_expire(vcpu);
>  
> @@ -4280,7 +4282,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>  	if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
>  		kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
>  
> -	kvm_load_host_xsave_state(vcpu);
> +	if (!vcpu->arch.guest_state_protected)
> +		kvm_load_host_xsave_state(vcpu);
>  	stgi();
>  
>  	/* Any pending NMI will happen here */
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 2742f2af7f55..d2e78e6675b9 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -520,10 +520,8 @@ static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
>  
>  static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
>  {
> -	if (is_td_vcpu(vcpu)) {
> -		tdx_cache_reg(vcpu, reg);
> +	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
>  		return;
> -	}
>  
>  	vmx_cache_reg(vcpu, reg);
>  }
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 7eff717c9d0d..b49dcf32206b 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -636,6 +636,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  	vcpu->arch.cr0_guest_owned_bits = -1ul;
>  	vcpu->arch.cr4_guest_owned_bits = -1ul;
>  
> +	vcpu->arch.cr4 = <maximal value>;

Sorry for slow reply.  Seems fine except maybe CR4 usage.

TDX Module validates CR4 based on XFAM and scrubs host state
based on XFAM.  It seems like we would need to use XFAM to
manufacture a CR4 that we then effectively use as a proxy
instead of just checking XFAM.

Since only some vcpu->arch.cr4 bits will be meaningful, it also
still leaves the possibility for confusion.

Are you sure you want this?

> +	vcpu->arch.cr0 = <maximal value, give or take>;

AFAICT we don't need to care about CR0

> +
>  	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
>  	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
>  	/*
> @@ -659,6 +662,14 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  
>  	tdx->state = VCPU_TD_STATE_UNINITIALIZED;
>  
> +	/*
> +	 * On return from VP.ENTER, the TDX Module sets XCR0 and XSS to the
> +	 * maximal values supported by the guest, so from KVM's perspective,
> +	 * those are the guest's values at all times.
> +	 */
> +	vcpu->arch.ia32_xss = (kvm_tdx->xfam & XFEATURE_SUPERVISOR_MASK);
> +	vcpu->arch.xcr0 = (kvm_tdx->xfam & XFEATURE_USE_MASK);
> +
>  	return 0;
>  }
>  
> @@ -824,24 +835,6 @@ static void tdx_user_return_msr_update_cache(struct kvm_vcpu *vcpu)
>  		kvm_user_return_msr_update_cache(tdx_uret_tsx_ctrl_slot, 0);
>  }
>  
> -static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> -{
> -	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> -
> -	if (static_cpu_has(X86_FEATURE_XSAVE) &&
> -	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
> -		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
> -	if (static_cpu_has(X86_FEATURE_XSAVES) &&
> -	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
> -	    kvm_host.xss != (kvm_tdx->xfam &
> -			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
> -			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
> -		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
> -	if (static_cpu_has(X86_FEATURE_PKU) &&
> -	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
> -		write_pkru(vcpu->arch.host_pkru);
> -}
> -
>  static union vmx_exit_reason tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -941,10 +934,10 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>  	tdx_vcpu_enter_exit(vcpu);
>  
>  	tdx_user_return_msr_update_cache(vcpu);
> -	tdx_restore_host_xsave_state(vcpu);
> +	kvm_load_host_xsave_state(vcpu);
>  	tdx->host_state_need_restore = true;
>  
> -	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> +	vcpu->arch.regs_avail = TDX_REGS_UNSUPPORTED_SET;
>  
>  	if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
>  		return EXIT_FASTPATH_NONE;
> @@ -1963,23 +1956,6 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
>  	}
>  }
>  
> -void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> -{
> -	kvm_register_mark_available(vcpu, reg);
> -	switch (reg) {
> -	case VCPU_REGS_RSP:
> -	case VCPU_REGS_RIP:
> -	case VCPU_EXREG_PDPTR:
> -	case VCPU_EXREG_CR0:
> -	case VCPU_EXREG_CR3:
> -	case VCPU_EXREG_CR4:
> -		break;
> -	default:
> -		KVM_BUG_ON(1, vcpu->kvm);
> -		break;
> -	}
> -}
> -
>  static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
>  {
>  	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index ef60eb7b1245..efa6723837c6 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -145,8 +145,6 @@ bool tdx_has_emulated_msr(u32 index);
>  int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
>  int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
>  
> -void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
> -
>  int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>  
>  int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
> @@ -193,8 +191,6 @@ static inline bool tdx_has_emulated_msr(u32 index) { return false; }
>  static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
>  static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
>  
> -static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
> -
>  static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
>  
>  static inline int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4f94b1e24eae..d380837433c6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1184,11 +1184,9 @@ EXPORT_SYMBOL_GPL(kvm_lmsw);
>  
>  void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>  {
> -	if (vcpu->arch.guest_state_protected)
> -		return;
> +	WARN_ON_ONCE(vcpu->arch.guest_state_protected);
>  
>  	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
> -
>  		if (vcpu->arch.xcr0 != kvm_host.xcr0)
>  			xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
>  
> @@ -1207,9 +1205,6 @@ EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>  
>  void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>  {
> -	if (vcpu->arch.guest_state_protected)
> -		return;
> -
>  	if (cpu_feature_enabled(X86_FEATURE_PKU) &&
>  	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>  	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
> @@ -3943,6 +3938,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (!msr_info->host_initiated &&
>  		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>  			return 1;
> +
> +		if (vcpu->arch.guest_state_protected)
> +			return 1;
> +
>  		/*
>  		 * KVM supports exposing PT to the guest, but does not support
>  		 * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
> @@ -4402,6 +4401,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (!msr_info->host_initiated &&
>  		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>  			return 1;
> +
> +		if (vcpu->arch.guest_state_protected)
> +			return 1;
> +
>  		msr_info->data = vcpu->arch.ia32_xss;
>  		break;
>  	case MSR_K7_CLK_CTL:
> 
> base-commit: 0319082fc23089f516618e193d94da18c837e35a
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Sean Christopherson 1 year, 1 month ago
On Fri, Dec 20, 2024, Adrian Hunter wrote:
> On 17/12/24 18:09, Sean Christopherson wrote:
> > On Mon, Nov 25, 2024, Adrian Hunter wrote:
> > I would rather just use kvm_load_host_xsave_state(), by forcing vcpu->arch.{xcr0,xss}
> > to XFAM, with a comment explaining that the TDX module sets XCR0 and XSS prior to
> > returning from VP.ENTER.  I don't see any justificaton for maintaining a special
> > flow for TDX, it's just more work.  E.g. TDX is missing the optimization to elide
> > WRPKRU if the current value is the same as the host value.
> 
> Not entirely missing since write_pkru() does do that by itself:
> 
> static inline void write_pkru(u32 pkru)
> {
> 	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
> 		return;
> 	/*
> 	 * WRPKRU is relatively expensive compared to RDPKRU.
> 	 * Avoid WRPKRU when it would not change the value.
> 	 */
> 	if (pkru != rdpkru())
> 		wrpkru(pkru);

Argh.  Well that's a bug.  KVM did the right thing, and then the core kernel
swizzled things around and got in the way.  I'll post this once it's fully tested:

From: Sean Christopherson <seanjc@google.com>
Date: Fri, 20 Dec 2024 07:38:39 -0800
Subject: [PATCH] KVM: x86: Avoid double RDPKRU when loading host/guest PKRU

Use the raw wrpkru() helper when loading the guest/host's PKRU on switch
to/from guest context, as the write_pkru() wrapper incurs an unnecessary
rdpkru().  In both paths, KVM is guaranteed to have performed RDPKRU since
the last possible write, i.e. KVM has a fresh cache of the current value
in hardware.

This effectively restores KVM behavior to that of KVM rior to commit
c806e88734b9 ("x86/pkeys: Provide *pkru() helpers"), which renamed the raw
helper from __write_pkru() => wrpkru(), and turned __write_pkru() into a
wrapper.  Commit 577ff465f5a6 ("x86/fpu: Only write PKRU if it is different
from current") then added the extra RDPKRU to avoid an unnecessary WRPKRU,
but completely missed that KVM already optimized away pointless writes.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: 577ff465f5a6 ("x86/fpu: Only write PKRU if it is different from current")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4320647bd78a..9d5cece9260b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1186,7 +1186,7 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
 	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
 	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
 	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE)))
-		write_pkru(vcpu->arch.pkru);
+		wrpkru(vcpu->arch.pkru);
 }
 EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
 
@@ -1200,7 +1200,7 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
 	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
 		vcpu->arch.pkru = rdpkru();
 		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
-			write_pkru(vcpu->arch.host_pkru);
+			wrpkru(vcpu->arch.host_pkru);
 	}
 
 	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {

base-commit: 13e98294d7cec978e31138d16824f50556a62d17
-- 

> }
> 
> For TDX, we don't really need rdpkru() since the TDX Module
> clears it, so it could be:
> 
> 	if (pkru)
> 		wrpkru(pkru);

Ah, right, there's no need to do RDPKRU because KVM knows it's zero.  On top of
my previous suggestion:

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e23cd8231144..9e490fccf073 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -664,11 +664,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
        /*
         * On return from VP.ENTER, the TDX Module sets XCR0 and XSS to the
-        * maximal values supported by the guest, so from KVM's perspective,
-        * those are the guest's values at all times.
+        * maximal values supported by the guest, and zeroes PKRU, so from
+        * KVM's perspective, those are the guest's values at all times.
         */
        vcpu->arch.ia32_xss = (kvm_tdx->xfam & XFEATURE_SUPERVISOR_MASK);
        vcpu->arch.xcr0 = (kvm_tdx->xfam & XFEATURE_USE_MASK);
+       vcpu->arch.pkru = 0;
 
        return 0;
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d380837433c6..d2ea7db896ba 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1208,7 +1208,8 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
        if (cpu_feature_enabled(X86_FEATURE_PKU) &&
            ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
             kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
-               vcpu->arch.pkru = rdpkru();
+               if (!vcpu->arch.guest_state_protected)
+                       vcpu->arch.pkru = rdpkru();
                if (vcpu->arch.pkru != vcpu->arch.host_pkru)
                        write_pkru(vcpu->arch.host_pkru);
        }

> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 7eff717c9d0d..b49dcf32206b 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -636,6 +636,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> >  	vcpu->arch.cr0_guest_owned_bits = -1ul;
> >  	vcpu->arch.cr4_guest_owned_bits = -1ul;
> >  
> > +	vcpu->arch.cr4 = <maximal value>;
> 
> Sorry for slow reply.  Seems fine except maybe CR4 usage.
> 
> TDX Module validates CR4 based on XFAM and scrubs host state
> based on XFAM.  It seems like we would need to use XFAM to
> manufacture a CR4 that we then effectively use as a proxy
> instead of just checking XFAM.

Yep.

> Since only some vcpu->arch.cr4 bits will be meaningful, it also
> still leaves the possibility for confusion.

IMO, it's less confusing having '0' for CR0 and CR4, while having accurate values
for other state.  And I'm far more worried about KVM wondering into a bad path
because CR0 and/or CR4 are completely wrong.  E.g. kvm_mmu.cpu_role would be
completely wrong at steady state, the CR4-based runtime CPUID updates would do
the wrong thing, and any helper that wraps kvm_is_cr{0,4}_bit_set() would likely
do the worst possible thing.

> Are you sure you want this?

Yeah, pretty sure.  It would be nice if the TDX Module exposed guest CR0/CR4 to
KVM, a la the traps SEV-ES+ uses, but I think the next best thing is to assume
the guest is using all features.

> > +	vcpu->arch.cr0 = <maximal value, give or take>;
> 
> AFAICT we don't need to care about CR0

Probably not, but having e.g. CR4.PAE/LA57=1 with CR0.PG/PE=0 will be quite
weird.
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Adrian Hunter 1 year, 1 month ago
On 20/12/24 18:22, Sean Christopherson wrote:
> On Fri, Dec 20, 2024, Adrian Hunter wrote:
>> On 17/12/24 18:09, Sean Christopherson wrote:
>>> On Mon, Nov 25, 2024, Adrian Hunter wrote:
>>> I would rather just use kvm_load_host_xsave_state(), by forcing vcpu->arch.{xcr0,xss}
>>> to XFAM, with a comment explaining that the TDX module sets XCR0 and XSS prior to
>>> returning from VP.ENTER.  I don't see any justificaton for maintaining a special
>>> flow for TDX, it's just more work.
>>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>>> index 7eff717c9d0d..b49dcf32206b 100644
>>> --- a/arch/x86/kvm/vmx/tdx.c
>>> +++ b/arch/x86/kvm/vmx/tdx.c
>>> @@ -636,6 +636,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>>>  	vcpu->arch.cr0_guest_owned_bits = -1ul;
>>>  	vcpu->arch.cr4_guest_owned_bits = -1ul;
>>>  
>>> +	vcpu->arch.cr4 = <maximal value>;
>> Sorry for slow reply.  Seems fine except maybe CR4 usage.
>>
>> TDX Module validates CR4 based on XFAM and scrubs host state
>> based on XFAM.  It seems like we would need to use XFAM to
>> manufacture a CR4 that we then effectively use as a proxy
>> instead of just checking XFAM.
> Yep.
> 
>> Since only some vcpu->arch.cr4 bits will be meaningful, it also
>> still leaves the possibility for confusion.
> IMO, it's less confusing having '0' for CR0 and CR4, while having accurate values
> for other state.  And I'm far more worried about KVM wondering into a bad path
> because CR0 and/or CR4 are completely wrong.  E.g. kvm_mmu.cpu_role would be
> completely wrong at steady state, the CR4-based runtime CPUID updates would do
> the wrong thing, and any helper that wraps kvm_is_cr{0,4}_bit_set() would likely
> do the worst possible thing.
> 
>> Are you sure you want this?
> Yeah, pretty sure.  It would be nice if the TDX Module exposed guest CR0/CR4 to
> KVM, a la the traps SEV-ES+ uses, but I think the next best thing is to assume
> the guest is using all features.
> 
>>> +	vcpu->arch.cr0 = <maximal value, give or take>;
>> AFAICT we don't need to care about CR0
> Probably not, but having e.g. CR4.PAE/LA57=1 with CR0.PG/PE=0 will be quite
> weird.

Below is what I have so far.  It seems to work.  Note:
 - use of MSR_IA32_VMX_CR0_FIXED1 and MSR_IA32_VMX_CR4_FIXED1
 to provide base value for CR0 and CR4
 - tdx_reinforce_guest_state() to make sure host state doesn't
 get broken because the values go wrong
 - __kvm_set_xcr() to handle guest_state_protected case
 - kvm_vcpu_reset() to handle guest_state_protected case

Please let me know your feedback.

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 10aebae5af18..2a5f756b05e2 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -351,8 +351,10 @@ static void vt_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 
 static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
-	if (is_td_vcpu(vcpu))
+	if (is_td_vcpu(vcpu)) {
+		tdx_vcpu_after_set_cpuid(vcpu);
 		return;
+	}
 
 	vmx_vcpu_after_set_cpuid(vcpu);
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 191ee209caa0..0ae427340494 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -710,6 +710,57 @@ int tdx_vm_init(struct kvm *kvm)
 	return 0;
 }
 
+/* Set a maximal guest CR0 value */
+static u64 tdx_guest_cr0(struct kvm_vcpu *vcpu, u64 cr4)
+{
+	u64 cr0;
+
+	rdmsrl(MSR_IA32_VMX_CR0_FIXED1, cr0);
+
+	if (cr4 & X86_CR4_CET)
+		cr0 |= X86_CR0_WP;
+
+	cr0 |= X86_CR0_PE | X86_CR0_NE;
+	cr0 &= ~(X86_CR0_NW | X86_CR0_CD);
+
+	return cr0;
+}
+
+/*
+ * Set a maximal guest CR4 value. Clear bits forbidden by XFAM or
+ * TD Attributes.
+ */
+static u64 tdx_guest_cr4(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	u64 cr4;
+
+	rdmsrl(MSR_IA32_VMX_CR4_FIXED1, cr4);
+
+	if (!(kvm_tdx->xfam & XFEATURE_PKRU))
+		cr4 &= ~X86_CR4_PKE;
+
+	if (!(kvm_tdx->xfam & XFEATURE_CET_USER) || !(kvm_tdx->xfam & BIT_ULL(12)))
+		cr4 &= ~X86_CR4_CET;
+
+	/* User Interrupts */
+	if (!(kvm_tdx->xfam & BIT_ULL(14)))
+		cr4 &= ~BIT_ULL(25);
+
+	if (!(kvm_tdx->attributes & TDX_TD_ATTR_LASS))
+		cr4 &= ~BIT_ULL(27);
+
+	if (!(kvm_tdx->attributes & TDX_TD_ATTR_PKS))
+		cr4 &= ~BIT_ULL(24);
+
+	if (!(kvm_tdx->attributes & TDX_TD_ATTR_KL))
+		cr4 &= ~BIT_ULL(19);
+
+	cr4 &= ~X86_CR4_SMXE;
+
+	return cr4;
+}
+
 int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -732,8 +783,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.cr0_guest_owned_bits = -1ul;
 	vcpu->arch.cr4_guest_owned_bits = -1ul;
 
-	vcpu->arch.cr4 = <maximal value>;
-	vcpu->arch.cr0 = <maximal value, give or take>;
+	vcpu->arch.cr4 = tdx_guest_cr4(vcpu);
+	vcpu->arch.cr0 = tdx_guest_cr0(vcpu, vcpu->arch.cr4);
 
 	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
 	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
@@ -767,6 +818,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+{
+	if (cpu_feature_enabled(X86_FEATURE_XSAVES))
+		kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);
+}
+
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -933,6 +990,24 @@ static void tdx_user_return_msr_update_cache(void)
 						 tdx_uret_msrs[i].defval);
 }
 
+static void tdx_reinforce_guest_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+	if (WARN_ON_ONCE(vcpu->arch.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK)))
+		vcpu->arch.xcr0 = kvm_tdx->xfam & TDX_XFAM_XCR0_MASK;
+	if (WARN_ON_ONCE(vcpu->arch.ia32_xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK)))
+		vcpu->arch.ia32_xss = kvm_tdx->xfam & TDX_XFAM_XSS_MASK;
+	if (WARN_ON_ONCE(vcpu->arch.pkru))
+		vcpu->arch.pkru = 0;
+	if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_XSAVE) &&
+			 !kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)))
+		vcpu->arch.cr4 |= X86_CR4_OSXSAVE;
+	if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_XSAVES) &&
+			 !guest_can_use(vcpu, X86_FEATURE_XSAVES)))
+		kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);
+}
+
 static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -1028,9 +1103,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 		update_debugctlmsr(tdx->host_debugctlmsr);
 
 	tdx_user_return_msr_update_cache();
+
+	tdx_reinforce_guest_state(vcpu);
 	kvm_load_host_xsave_state(vcpu);
 
-	vcpu->arch.regs_avail = TDX_REGS_UNSUPPORTED_SET;
+	vcpu->arch.regs_avail = ~0;
 
 	if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
 		return EXIT_FASTPATH_NONE;
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 861c0f649b69..2e0e300a1f5e 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -110,6 +110,7 @@ struct tdx_cpuid_value {
 } __packed;
 
 #define TDX_TD_ATTR_DEBUG		BIT_ULL(0)
+#define TDX_TD_ATTR_LASS		BIT_ULL(27)
 #define TDX_TD_ATTR_SEPT_VE_DISABLE	BIT_ULL(28)
 #define TDX_TD_ATTR_PKS			BIT_ULL(30)
 #define TDX_TD_ATTR_KL			BIT_ULL(31)
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 7fb1bbf12b39..7f03a6a24abc 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -126,6 +126,7 @@ void tdx_vm_free(struct kvm *kvm);
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
+void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
@@ -170,6 +171,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
+static inline void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
 static inline int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2ea7db896ba..f2b1980f830d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1240,6 +1240,11 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 	u64 old_xcr0 = vcpu->arch.xcr0;
 	u64 valid_bits;
 
+	if (vcpu->arch.guest_state_protected) {
+		kvm_update_cpuid_runtime(vcpu);
+		return 0;
+	}
+
 	/* Only support XCR_XFEATURE_ENABLED_MASK(xcr0) now  */
 	if (index != XCR_XFEATURE_ENABLED_MASK)
 		return 1;
@@ -12388,7 +12393,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	 * into hardware, to be zeroed at vCPU creation.  Use CRs as a sentinel
 	 * to detect improper or missing initialization.
 	 */
-	WARN_ON_ONCE(!init_event &&
+	WARN_ON_ONCE(!init_event && !vcpu->arch.guest_state_protected &&
 		     (old_cr0 || kvm_read_cr3(vcpu) || kvm_read_cr4(vcpu)));
 
 	/*
diff --git a/tools/testing/selftests/kvm/x86_64/tdx_vm_tests.c b/tools/testing/selftests/kvm/x86_64/tdx_vm_tests.c
index 075419ef3ac7..2cc9bf40a788 100644
--- a/tools/testing/selftests/kvm/x86_64/tdx_vm_tests.c
+++ b/tools/testing/selftests/kvm/x86_64/tdx_vm_tests.c
@@ -1054,7 +1054,7 @@ void verify_td_cpuid_tdcall(void)
 	TDX_TEST_ASSERT_SUCCESS(vcpu);
 
 	/* Get KVM CPUIDs for reference */
-	tmp = get_cpuid_entry(kvm_get_supported_cpuid(), 1, 0);
+	tmp = get_cpuid_entry(vcpu->cpuid, 1, 0);
 	TEST_ASSERT(tmp, "CPUID entry missing\n");
 
 	cpuid_entry = *tmp;
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Sean Christopherson 1 year ago
On Fri, Jan 03, 2025, Adrian Hunter wrote:
> On 20/12/24 18:22, Sean Christopherson wrote:
> +/* Set a maximal guest CR0 value */
> +static u64 tdx_guest_cr0(struct kvm_vcpu *vcpu, u64 cr4)
> +{
> +	u64 cr0;
> +
> +	rdmsrl(MSR_IA32_VMX_CR0_FIXED1, cr0);
> +
> +	if (cr4 & X86_CR4_CET)
> +		cr0 |= X86_CR0_WP;
> +
> +	cr0 |= X86_CR0_PE | X86_CR0_NE;
> +	cr0 &= ~(X86_CR0_NW | X86_CR0_CD);
> +
> +	return cr0;
> +}
> +
> +/*
> + * Set a maximal guest CR4 value. Clear bits forbidden by XFAM or
> + * TD Attributes.
> + */
> +static u64 tdx_guest_cr4(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +	u64 cr4;
> +
> +	rdmsrl(MSR_IA32_VMX_CR4_FIXED1, cr4);

This won't be accurate long-term.  E.g. run KVM on hardware with CR4 bits that
neither KVM nor TDX know about, and vcpu->arch.cr4 will end up with bits set that
KVM think are illegal, which will cause it's own problems.

For CR0 and CR4, we should be able to start with KVM's set of allowed bits, not
the CPU's.  That will mean there will likely be missing bits, in vcpu->arch.cr{0,4},
but if KVM doesn't know about a bit, the fact that it's missing should be a complete
non-issue.

That also avoids weirdness for things like user-mode interrupts, LASS, PKS, etc.,
where KVM is open coding the bits.  The downside is that we'll need to remember
to update TDX when enabling those features to account for kvm_tdx->attributes,
but that's not unreasonable.

> +
> +	if (!(kvm_tdx->xfam & XFEATURE_PKRU))
> +		cr4 &= ~X86_CR4_PKE;
> +
> +	if (!(kvm_tdx->xfam & XFEATURE_CET_USER) || !(kvm_tdx->xfam & BIT_ULL(12)))
> +		cr4 &= ~X86_CR4_CET;
> +
> +	/* User Interrupts */
> +	if (!(kvm_tdx->xfam & BIT_ULL(14)))
> +		cr4 &= ~BIT_ULL(25);
> +
> +	if (!(kvm_tdx->attributes & TDX_TD_ATTR_LASS))
> +		cr4 &= ~BIT_ULL(27);
> +
> +	if (!(kvm_tdx->attributes & TDX_TD_ATTR_PKS))
> +		cr4 &= ~BIT_ULL(24);
> +
> +	if (!(kvm_tdx->attributes & TDX_TD_ATTR_KL))
> +		cr4 &= ~BIT_ULL(19);
> +
> +	cr4 &= ~X86_CR4_SMXE;
> +
> +	return cr4;
> +}
> +
>  int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> @@ -732,8 +783,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  	vcpu->arch.cr0_guest_owned_bits = -1ul;
>  	vcpu->arch.cr4_guest_owned_bits = -1ul;
>  
> -	vcpu->arch.cr4 = <maximal value>;
> -	vcpu->arch.cr0 = <maximal value, give or take>;
> +	vcpu->arch.cr4 = tdx_guest_cr4(vcpu);
> +	vcpu->arch.cr0 = tdx_guest_cr0(vcpu, vcpu->arch.cr4);
>  
>  	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
>  	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
> @@ -767,6 +818,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
>  
> +void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> +{
> +	if (cpu_feature_enabled(X86_FEATURE_XSAVES))

This should use kvm_cpu_caps_has(), because strictly speaking it's KVM support
that matters.  In practice, I don't think it matters for XSAVES, but it can
matter for other features (though probably not for TDX guests).

> +		kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);
> +}
> +
>  void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  {
>  	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -933,6 +990,24 @@ static void tdx_user_return_msr_update_cache(void)
>  						 tdx_uret_msrs[i].defval);
>  }
>  
> +static void tdx_reinforce_guest_state(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +
> +	if (WARN_ON_ONCE(vcpu->arch.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK)))
> +		vcpu->arch.xcr0 = kvm_tdx->xfam & TDX_XFAM_XCR0_MASK;
> +	if (WARN_ON_ONCE(vcpu->arch.ia32_xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK)))
> +		vcpu->arch.ia32_xss = kvm_tdx->xfam & TDX_XFAM_XSS_MASK;
> +	if (WARN_ON_ONCE(vcpu->arch.pkru))
> +		vcpu->arch.pkru = 0;
> +	if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_XSAVE) &&
> +			 !kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)))
> +		vcpu->arch.cr4 |= X86_CR4_OSXSAVE;
> +	if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_XSAVES) &&
> +			 !guest_can_use(vcpu, X86_FEATURE_XSAVES)))
> +		kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);
> +}
> +
>  static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -1028,9 +1103,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>  		update_debugctlmsr(tdx->host_debugctlmsr);
>  
>  	tdx_user_return_msr_update_cache();
> +
> +	tdx_reinforce_guest_state(vcpu);

Hmm, I don't think fixing up guest state is a good idea.  It probably works?
But continuing on when we know there's a KVM bug *and* a chance for host data
corruption seems unnecessarily risky.

My vote would to KVM_BUG_ON() before entering the guest.  I think I'd also be ok
omitting the checks, it's not like the potential for KVM bugs that clobber KVM's
view of state are unique to TDX (though I do agree that the behavior of the TDX
module in this case does make them more likely).

>  	kvm_load_host_xsave_state(vcpu);
>  
> -	vcpu->arch.regs_avail = TDX_REGS_UNSUPPORTED_SET;
> +	vcpu->arch.regs_avail = ~0;
>  
>  	if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
>  		return EXIT_FASTPATH_NONE;
> diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> index 861c0f649b69..2e0e300a1f5e 100644
> --- a/arch/x86/kvm/vmx/tdx_arch.h
> +++ b/arch/x86/kvm/vmx/tdx_arch.h
> @@ -110,6 +110,7 @@ struct tdx_cpuid_value {
>  } __packed;
>  
>  #define TDX_TD_ATTR_DEBUG		BIT_ULL(0)
> +#define TDX_TD_ATTR_LASS		BIT_ULL(27)
>  #define TDX_TD_ATTR_SEPT_VE_DISABLE	BIT_ULL(28)
>  #define TDX_TD_ATTR_PKS			BIT_ULL(30)
>  #define TDX_TD_ATTR_KL			BIT_ULL(31)
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 7fb1bbf12b39..7f03a6a24abc 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -126,6 +126,7 @@ void tdx_vm_free(struct kvm *kvm);
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>  
>  int tdx_vcpu_create(struct kvm_vcpu *vcpu);
> +void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>  int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
> @@ -170,6 +171,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
>  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>  
>  static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
> +static inline void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d2ea7db896ba..f2b1980f830d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1240,6 +1240,11 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
>  	u64 old_xcr0 = vcpu->arch.xcr0;
>  	u64 valid_bits;
>  
> +	if (vcpu->arch.guest_state_protected) {

This should be a WARN_ON_ONCE() + return 1, no?

> +		kvm_update_cpuid_runtime(vcpu);
> +		return 0;
> +	}
> +
>  	/* Only support XCR_XFEATURE_ENABLED_MASK(xcr0) now  */
>  	if (index != XCR_XFEATURE_ENABLED_MASK)
>  		return 1;
> @@ -12388,7 +12393,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  	 * into hardware, to be zeroed at vCPU creation.  Use CRs as a sentinel
>  	 * to detect improper or missing initialization.
>  	 */
> -	WARN_ON_ONCE(!init_event &&
> +	WARN_ON_ONCE(!init_event && !vcpu->arch.guest_state_protected &&
>  		     (old_cr0 || kvm_read_cr3(vcpu) || kvm_read_cr4(vcpu)));

Maybe stuff state in tdx_vcpu_init() to avoid this waiver?  KVM is already
deferring APIC base and RCX initialization to that point, waiting to stuff all
TDX-specific vCPU state seems natural.
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Adrian Hunter 1 year ago
On 9/01/25 21:11, Sean Christopherson wrote:
> My vote would to KVM_BUG_ON() before entering the guest.

I notice if KVM_BUG_ON() is called with interrupts disabled
smp_call_function_many_cond() generates a warning:

WARNING: CPU: 223 PID: 4213 at kernel/smp.c:807 smp_call_function_many_cond+0x421/0x560

static void smp_call_function_many_cond(const struct cpumask *mask,
					smp_call_func_t func, void *info,
					unsigned int scf_flags,
					smp_cond_func_t cond_func)
{
	int cpu, last_cpu, this_cpu = smp_processor_id();
	struct call_function_data *cfd;
	bool wait = scf_flags & SCF_WAIT;
	int nr_cpus = 0;
	bool run_remote = false;
	bool run_local = false;

	lockdep_assert_preemption_disabled();

	/*
	 * Can deadlock when called with interrupts disabled.
	 * We allow cpu's that are not yet online though, as no one else can
	 * send smp call function interrupt to this cpu and as such deadlocks
	 * can't happen.
	 */
	if (cpu_online(this_cpu) && !oops_in_progress &&
	    !early_boot_irqs_disabled)
		lockdep_assert_irqs_enabled();			<------------- here

Do we need to care about that?
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Sean Christopherson 1 year ago
On Mon, Jan 13, 2025, Adrian Hunter wrote:
> On 9/01/25 21:11, Sean Christopherson wrote:
> > My vote would to KVM_BUG_ON() before entering the guest.
> 
> I notice if KVM_BUG_ON() is called with interrupts disabled
> smp_call_function_many_cond() generates a warning:
> 
> WARNING: CPU: 223 PID: 4213 at kernel/smp.c:807 smp_call_function_many_cond+0x421/0x560
> 
> static void smp_call_function_many_cond(const struct cpumask *mask,
> 					smp_call_func_t func, void *info,
> 					unsigned int scf_flags,
> 					smp_cond_func_t cond_func)
> {
> 	int cpu, last_cpu, this_cpu = smp_processor_id();
> 	struct call_function_data *cfd;
> 	bool wait = scf_flags & SCF_WAIT;
> 	int nr_cpus = 0;
> 	bool run_remote = false;
> 	bool run_local = false;
> 
> 	lockdep_assert_preemption_disabled();
> 
> 	/*
> 	 * Can deadlock when called with interrupts disabled.
> 	 * We allow cpu's that are not yet online though, as no one else can
> 	 * send smp call function interrupt to this cpu and as such deadlocks
> 	 * can't happen.
> 	 */
> 	if (cpu_online(this_cpu) && !oops_in_progress &&
> 	    !early_boot_irqs_disabled)
> 		lockdep_assert_irqs_enabled();			<------------- here
> 
> Do we need to care about that?

Ugh, yes.  E.g. the deadlock mentioned in the comment would occur if two vCPUs
hit the KVM_BUG_ON() at the same time (they'd both wait for the other to respond
to *their* IPI).

Since the damage is limited to the current vCPU, i.e. letting userspace run other
vCPUs is unlikely to put KVM in harm's way, a not-awful alternative would be to
WARN_ON_ONCE() and return KVM_EXIT_INTERNAL_ERROR.
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Adrian Hunter 1 year ago
On 9/01/25 21:11, Sean Christopherson wrote:
> On Fri, Jan 03, 2025, Adrian Hunter wrote:
>> On 20/12/24 18:22, Sean Christopherson wrote:
>> +/* Set a maximal guest CR0 value */
>> +static u64 tdx_guest_cr0(struct kvm_vcpu *vcpu, u64 cr4)
>> +{
>> +	u64 cr0;
>> +
>> +	rdmsrl(MSR_IA32_VMX_CR0_FIXED1, cr0);
>> +
>> +	if (cr4 & X86_CR4_CET)
>> +		cr0 |= X86_CR0_WP;
>> +
>> +	cr0 |= X86_CR0_PE | X86_CR0_NE;
>> +	cr0 &= ~(X86_CR0_NW | X86_CR0_CD);
>> +
>> +	return cr0;
>> +}
>> +
>> +/*
>> + * Set a maximal guest CR4 value. Clear bits forbidden by XFAM or
>> + * TD Attributes.
>> + */
>> +static u64 tdx_guest_cr4(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> +	u64 cr4;
>> +
>> +	rdmsrl(MSR_IA32_VMX_CR4_FIXED1, cr4);
> 
> This won't be accurate long-term.  E.g. run KVM on hardware with CR4 bits that
> neither KVM nor TDX know about, and vcpu->arch.cr4 will end up with bits set that
> KVM think are illegal, which will cause it's own problems.

Currently validation of CR4 is only done when user space changes it,
which should not be allowed for TDX.  For that it looks like TDX
would need:

	kvm->arch.has_protected_state = true;

Not sure why it doesn't already?

> 
> For CR0 and CR4, we should be able to start with KVM's set of allowed bits, not
> the CPU's.  That will mean there will likely be missing bits, in vcpu->arch.cr{0,4},
> but if KVM doesn't know about a bit, the fact that it's missing should be a complete
> non-issue.

What about adding:

	cr4 &= ~cr4_reserved_bits;

and

	cr0 &= ~CR0_RESERVED_BITS
> 
> That also avoids weirdness for things like user-mode interrupts, LASS, PKS, etc.,
> where KVM is open coding the bits.  The downside is that we'll need to remember
> to update TDX when enabling those features to account for kvm_tdx->attributes,
> but that's not unreasonable.
> 
>> +
>> +	if (!(kvm_tdx->xfam & XFEATURE_PKRU))
>> +		cr4 &= ~X86_CR4_PKE;
>> +
>> +	if (!(kvm_tdx->xfam & XFEATURE_CET_USER) || !(kvm_tdx->xfam & BIT_ULL(12)))
>> +		cr4 &= ~X86_CR4_CET;
>> +
>> +	/* User Interrupts */
>> +	if (!(kvm_tdx->xfam & BIT_ULL(14)))
>> +		cr4 &= ~BIT_ULL(25);
>> +
>> +	if (!(kvm_tdx->attributes & TDX_TD_ATTR_LASS))
>> +		cr4 &= ~BIT_ULL(27);
>> +
>> +	if (!(kvm_tdx->attributes & TDX_TD_ATTR_PKS))
>> +		cr4 &= ~BIT_ULL(24);
>> +
>> +	if (!(kvm_tdx->attributes & TDX_TD_ATTR_KL))
>> +		cr4 &= ~BIT_ULL(19);
>> +
>> +	cr4 &= ~X86_CR4_SMXE;
>> +
>> +	return cr4;
>> +}
>> +
>>  int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>>  {
>>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> @@ -732,8 +783,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>>  	vcpu->arch.cr0_guest_owned_bits = -1ul;
>>  	vcpu->arch.cr4_guest_owned_bits = -1ul;
>>  
>> -	vcpu->arch.cr4 = <maximal value>;
>> -	vcpu->arch.cr0 = <maximal value, give or take>;
>> +	vcpu->arch.cr4 = tdx_guest_cr4(vcpu);
>> +	vcpu->arch.cr0 = tdx_guest_cr0(vcpu, vcpu->arch.cr4);
>>  
>>  	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
>>  	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
>> @@ -767,6 +818,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>>  	return 0;
>>  }
>>  
>> +void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>> +{
>> +	if (cpu_feature_enabled(X86_FEATURE_XSAVES))
> 
> This should use kvm_cpu_caps_has(), because strictly speaking it's KVM support
> that matters.  In practice, I don't think it matters for XSAVES, but it can
> matter for other features (though probably not for TDX guests).
> 
>> +		kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);
>> +}
>> +
>>  void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>  {
>>  	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> @@ -933,6 +990,24 @@ static void tdx_user_return_msr_update_cache(void)
>>  						 tdx_uret_msrs[i].defval);
>>  }
>>  
>> +static void tdx_reinforce_guest_state(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> +
>> +	if (WARN_ON_ONCE(vcpu->arch.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK)))
>> +		vcpu->arch.xcr0 = kvm_tdx->xfam & TDX_XFAM_XCR0_MASK;
>> +	if (WARN_ON_ONCE(vcpu->arch.ia32_xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK)))
>> +		vcpu->arch.ia32_xss = kvm_tdx->xfam & TDX_XFAM_XSS_MASK;
>> +	if (WARN_ON_ONCE(vcpu->arch.pkru))
>> +		vcpu->arch.pkru = 0;
>> +	if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_XSAVE) &&
>> +			 !kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)))
>> +		vcpu->arch.cr4 |= X86_CR4_OSXSAVE;
>> +	if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_XSAVES) &&
>> +			 !guest_can_use(vcpu, X86_FEATURE_XSAVES)))
>> +		kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);
>> +}
>> +
>>  static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
>>  {
>>  	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> @@ -1028,9 +1103,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>  		update_debugctlmsr(tdx->host_debugctlmsr);
>>  
>>  	tdx_user_return_msr_update_cache();
>> +
>> +	tdx_reinforce_guest_state(vcpu);
> 
> Hmm, I don't think fixing up guest state is a good idea.  It probably works?
> But continuing on when we know there's a KVM bug *and* a chance for host data
> corruption seems unnecessarily risky.
> 
> My vote would to KVM_BUG_ON() before entering the guest.  I think I'd also be ok
> omitting the checks, it's not like the potential for KVM bugs that clobber KVM's
> view of state are unique to TDX (though I do agree that the behavior of the TDX
> module in this case does make them more likely).

If the guest state that is vital to host state restoration, goes wrong
then the machine can die without much explanation, so KVM_BUG_ON() before
entering the guest seems prudent.

> 
>>  	kvm_load_host_xsave_state(vcpu);
>>  
>> -	vcpu->arch.regs_avail = TDX_REGS_UNSUPPORTED_SET;
>> +	vcpu->arch.regs_avail = ~0;
>>  
>>  	if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
>>  		return EXIT_FASTPATH_NONE;
>> diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
>> index 861c0f649b69..2e0e300a1f5e 100644
>> --- a/arch/x86/kvm/vmx/tdx_arch.h
>> +++ b/arch/x86/kvm/vmx/tdx_arch.h
>> @@ -110,6 +110,7 @@ struct tdx_cpuid_value {
>>  } __packed;
>>  
>>  #define TDX_TD_ATTR_DEBUG		BIT_ULL(0)
>> +#define TDX_TD_ATTR_LASS		BIT_ULL(27)
>>  #define TDX_TD_ATTR_SEPT_VE_DISABLE	BIT_ULL(28)
>>  #define TDX_TD_ATTR_PKS			BIT_ULL(30)
>>  #define TDX_TD_ATTR_KL			BIT_ULL(31)
>> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
>> index 7fb1bbf12b39..7f03a6a24abc 100644
>> --- a/arch/x86/kvm/vmx/x86_ops.h
>> +++ b/arch/x86/kvm/vmx/x86_ops.h
>> @@ -126,6 +126,7 @@ void tdx_vm_free(struct kvm *kvm);
>>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>>  
>>  int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>> +void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
>>  void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>>  void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>>  int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
>> @@ -170,6 +171,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
>>  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>>  
>>  static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>> +static inline void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) {}
>>  static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>>  static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
>>  static inline int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index d2ea7db896ba..f2b1980f830d 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1240,6 +1240,11 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
>>  	u64 old_xcr0 = vcpu->arch.xcr0;
>>  	u64 valid_bits;
>>  
>> +	if (vcpu->arch.guest_state_protected) {
> 
> This should be a WARN_ON_ONCE() + return 1, no?

With kvm->arch.has_protected_state = true, KVM_SET_XCRS
would fail, which would probably be fine except for KVM selftests:

Currently the KVM selftests expect to be able to set XCR0:

    td_vcpu_add()
	vm_vcpu_add()
	    vm_arch_vcpu_add()
		vcpu_init_xcrs()
		    vcpu_xcrs_set()
			vcpu_ioctl(KVM_SET_XCRS)
			    __TEST_ASSERT_VM_VCPU_IOCTL(!ret)

Seems like vm->arch.has_protected_state is needed for
KVM selftests?

> 
>> +		kvm_update_cpuid_runtime(vcpu);

And kvm_update_cpuid_runtime() never gets called otherwise.
Not sure where would be a good place to call it.

>> +		return 0;
>> +	}
>> +
>>  	/* Only support XCR_XFEATURE_ENABLED_MASK(xcr0) now  */
>>  	if (index != XCR_XFEATURE_ENABLED_MASK)
>>  		return 1;
>> @@ -12388,7 +12393,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>>  	 * into hardware, to be zeroed at vCPU creation.  Use CRs as a sentinel
>>  	 * to detect improper or missing initialization.
>>  	 */
>> -	WARN_ON_ONCE(!init_event &&
>> +	WARN_ON_ONCE(!init_event && !vcpu->arch.guest_state_protected &&
>>  		     (old_cr0 || kvm_read_cr3(vcpu) || kvm_read_cr4(vcpu)));
> 
> Maybe stuff state in tdx_vcpu_init() to avoid this waiver?  KVM is already
> deferring APIC base and RCX initialization to that point, waiting to stuff all
> TDX-specific vCPU state seems natural.
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Sean Christopherson 1 year ago
On Fri, Jan 10, 2025, Adrian Hunter wrote:
> On 9/01/25 21:11, Sean Christopherson wrote:
> > On Fri, Jan 03, 2025, Adrian Hunter wrote:
> >> +static u64 tdx_guest_cr4(struct kvm_vcpu *vcpu)
> >> +{
> >> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> >> +	u64 cr4;
> >> +
> >> +	rdmsrl(MSR_IA32_VMX_CR4_FIXED1, cr4);
> > 
> > This won't be accurate long-term.  E.g. run KVM on hardware with CR4 bits that
> > neither KVM nor TDX know about, and vcpu->arch.cr4 will end up with bits set that
> > KVM think are illegal, which will cause it's own problems.
> 
> Currently validation of CR4 is only done when user space changes it,
> which should not be allowed for TDX.  For that it looks like TDX
> would need:
> 
> 	kvm->arch.has_protected_state = true;
> 
> Not sure why it doesn't already?

Sorry, I didn't follow any of that.

> > For CR0 and CR4, we should be able to start with KVM's set of allowed bits, not
> > the CPU's.  That will mean there will likely be missing bits, in vcpu->arch.cr{0,4},
> > but if KVM doesn't know about a bit, the fact that it's missing should be a complete
> > non-issue.
> 
> What about adding:
> 
> 	cr4 &= ~cr4_reserved_bits;
> 
> and
> 
> 	cr0 &= ~CR0_RESERVED_BITS

I was thinking a much more explicit:

	vcpu->arch.cr4 = ~vcpu->arch.cr4_guest_rsvd_bits;

which if it's done in tdx_vcpu_init(), in conjunction with freezing the vCPU
model (see below), should be solid.

> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index d2ea7db896ba..f2b1980f830d 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -1240,6 +1240,11 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
> >>  	u64 old_xcr0 = vcpu->arch.xcr0;
> >>  	u64 valid_bits;
> >>  
> >> +	if (vcpu->arch.guest_state_protected) {
> > 
> > This should be a WARN_ON_ONCE() + return 1, no?
> 
> With kvm->arch.has_protected_state = true, KVM_SET_XCRS
> would fail, which would probably be fine except for KVM selftests:
> 
> Currently the KVM selftests expect to be able to set XCR0:
> 
>     td_vcpu_add()
> 	vm_vcpu_add()
> 	    vm_arch_vcpu_add()
> 		vcpu_init_xcrs()
> 		    vcpu_xcrs_set()
> 			vcpu_ioctl(KVM_SET_XCRS)
> 			    __TEST_ASSERT_VM_VCPU_IOCTL(!ret)
> 
> Seems like vm->arch.has_protected_state is needed for KVM selftests?

I doubt it's truly needed, my guess (without looking at the code) is that selftests
are fudging around the fact that KVM doesn't stuff arch.xcr0.

> >> +		kvm_update_cpuid_runtime(vcpu);
> 
> And kvm_update_cpuid_runtime() never gets called otherwise.
> Not sure where would be a good place to call it.

I think we should call it in tdx_vcpu_init(), and then also freeze the vCPU model
at that time.  KVM currently "freezes" the model based on last_vmentry_cpu, but
that's a bit of a hack and might even be flawed, e.g. I wouldn't be surprised if
it's possible to lead KVM astray by trying to get a signal to race with KVM_RUN
so that last_vmentry_cpu isn't set despite getting quite far into KVM_RUN.

I'll test and post a patch to add vcpu_model_is_frozen (the below will conflict
mightily with the CPUID rework that's queued for v6.14), as I think it's a good
change even if we don't end up freezing the model at tdx_vcpu_init() (though I
can't think of any reason to allow CPUID updates after that point).

---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/cpuid.c            | 2 +-
 arch/x86/kvm/mmu/mmu.c          | 4 ++--
 arch/x86/kvm/pmu.c              | 2 +-
 arch/x86/kvm/vmx/tdx.c          | 7 +++++++
 arch/x86/kvm/x86.c              | 9 ++++++++-
 arch/x86/kvm/x86.h              | 5 -----
 7 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0787855ab006..41c31a69924d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -779,6 +779,7 @@ struct kvm_vcpu_arch {
 	u64 ia32_misc_enable_msr;
 	u64 smbase;
 	u64 smi_count;
+	bool vcpu_model_is_frozen;
 	bool at_instruction_boundary;
 	bool tpr_access_reporting;
 	bool xfd_no_write_intercept;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 6c7ab125f582..678518ec1c72 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -465,7 +465,7 @@ static int kvm_set_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
 	 * KVM_SET_CPUID{,2} again. To support this legacy behavior, check
 	 * whether the supplied CPUID data is equal to what's already set.
 	 */
-	if (kvm_vcpu_has_run(vcpu)) {
+	if (vcpu->arch.vcpu_model_is_frozen) {
 		r = kvm_cpuid_check_equal(vcpu, e2, nent);
 		if (r)
 			return r;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 713ca857f2c2..75350a5c6c54 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5716,9 +5716,9 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 	/*
 	 * Changing guest CPUID after KVM_RUN is forbidden, see the comment in
-	 * kvm_arch_vcpu_ioctl().
+	 * kvm_arch_vcpu_ioctl_run().
 	 */
-	KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm);
+	KVM_BUG_ON(vcpu->arch.vcpu_model_is_frozen, vcpu->kvm);
 }
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 47a46283c866..4f487a980eae 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -752,7 +752,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 
-	if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm))
+	if (KVM_BUG_ON(vcpu->arch.vcpu_model_is_frozen, vcpu->kvm))
 		return;
 
 	/*
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a587f59167a7..997d14506a1f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2822,6 +2822,13 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
 	td_vmcs_write64(tdx, POSTED_INTR_DESC_ADDR, __pa(&tdx->pi_desc));
 	td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);
 
+	/*
+	 * Freeze the vCPU model, as KVM relies on guest CPUID and capabilities
+	 * to be consistent with the TDX Module's view from here on out.
+	 */
+	vcpu->arch.vcpu_model_is_frozen = true;
+	kvm_update_cpuid_runtime(vcpu);
+
 	tdx->state = VCPU_TD_STATE_INITIALIZED;
 	return 0;
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2ea7db896ba..3db935737b59 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2218,7 +2218,8 @@ static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
 	 * writes of the same value, e.g. to allow userspace to blindly stuff
 	 * all MSRs when emulating RESET.
 	 */
-	if (kvm_vcpu_has_run(vcpu) && kvm_is_immutable_feature_msr(index) &&
+	if (vcpu->arch.vcpu_model_is_frozen &&
+	    kvm_is_immutable_feature_msr(index) &&
 	    (do_get_msr(vcpu, index, &val) || *data != val))
 		return -EINVAL;
 
@@ -11469,6 +11470,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	struct kvm_run *kvm_run = vcpu->run;
 	int r;
 
+	/*
+	 * Freeze the vCPU model, i.e. disallow changing CPUID, feature MSRs,
+	 * etc.  KVM doesn't support changing the model once the vCPU has run.
+	 */
+	vcpu->arch.vcpu_model_is_frozen = true;
+
 	vcpu_load(vcpu);
 	kvm_sigset_activate(vcpu);
 	kvm_run->flags = 0;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 514ffd7513f3..6ed074d03616 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -118,11 +118,6 @@ static inline void kvm_leave_nested(struct kvm_vcpu *vcpu)
 	kvm_x86_ops.nested_ops->leave_nested(vcpu);
 }
 
-static inline bool kvm_vcpu_has_run(struct kvm_vcpu *vcpu)
-{
-	return vcpu->arch.last_vmentry_cpu != -1;
-}
-
 static inline bool kvm_is_exception_pending(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.exception.pending ||

base-commit: d12b37e67b767a9e89b221067d48b257708d3044
--
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Adrian Hunter 1 year ago
On 10/01/25 19:30, Sean Christopherson wrote:
> On Fri, Jan 10, 2025, Adrian Hunter wrote:
>> On 9/01/25 21:11, Sean Christopherson wrote:
>>> On Fri, Jan 03, 2025, Adrian Hunter wrote:
>>>> +static u64 tdx_guest_cr4(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>>>> +	u64 cr4;
>>>> +
>>>> +	rdmsrl(MSR_IA32_VMX_CR4_FIXED1, cr4);
>>>
>>> This won't be accurate long-term.  E.g. run KVM on hardware with CR4 bits that
>>> neither KVM nor TDX know about, and vcpu->arch.cr4 will end up with bits set that
>>> KVM think are illegal, which will cause it's own problems.
>>
>> Currently validation of CR4 is only done when user space changes it,
>> which should not be allowed for TDX.  For that it looks like TDX
>> would need:
>>
>> 	kvm->arch.has_protected_state = true;
>>
>> Not sure why it doesn't already?
> 
> Sorry, I didn't follow any of that.
> 
>>> For CR0 and CR4, we should be able to start with KVM's set of allowed bits, not
>>> the CPU's.  That will mean there will likely be missing bits, in vcpu->arch.cr{0,4},
>>> but if KVM doesn't know about a bit, the fact that it's missing should be a complete
>>> non-issue.
>>
>> What about adding:
>>
>> 	cr4 &= ~cr4_reserved_bits;
>>
>> and
>>
>> 	cr0 &= ~CR0_RESERVED_BITS
> 
> I was thinking a much more explicit:
> 
> 	vcpu->arch.cr4 = ~vcpu->arch.cr4_guest_rsvd_bits;
> 
> which if it's done in tdx_vcpu_init(), in conjunction with freezing the vCPU
> model (see below), should be solid.
> 
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index d2ea7db896ba..f2b1980f830d 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -1240,6 +1240,11 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
>>>>  	u64 old_xcr0 = vcpu->arch.xcr0;
>>>>  	u64 valid_bits;
>>>>  
>>>> +	if (vcpu->arch.guest_state_protected) {
>>>
>>> This should be a WARN_ON_ONCE() + return 1, no?
>>
>> With kvm->arch.has_protected_state = true, KVM_SET_XCRS
>> would fail, which would probably be fine except for KVM selftests:
>>
>> Currently the KVM selftests expect to be able to set XCR0:
>>
>>     td_vcpu_add()
>> 	vm_vcpu_add()
>> 	    vm_arch_vcpu_add()
>> 		vcpu_init_xcrs()
>> 		    vcpu_xcrs_set()
>> 			vcpu_ioctl(KVM_SET_XCRS)
>> 			    __TEST_ASSERT_VM_VCPU_IOCTL(!ret)
>>
>> Seems like vm->arch.has_protected_state is needed for KVM selftests?
> 
> I doubt it's truly needed, my guess (without looking at the code) is that selftests
> are fudging around the fact that KVM doesn't stuff arch.xcr0.

Here is when it was added:

commit 8b14c4d85d031f7700fa4e042aebf99d933971f0
Author: Sean Christopherson <seanjc@google.com>
Date:   Thu Oct 3 16:43:31 2024 -0700

    KVM: selftests: Configure XCR0 to max supported value by default
    
    To play nice with compilers generating AVX instructions, set CR4.OSXSAVE
    and configure XCR0 by default when creating selftests vCPUs.  Some distros
    have switched gcc to '-march=x86-64-v3' by default, and while it's hard to
    find a CPU which doesn't support AVX today, many KVM selftests fail with

Is below OK to avoid it?

diff --git a/tools/testing/selftests/kvm/include/x86_64/kvm_util_arch.h b/tools/testing/selftests/kvm/include/x86_64/kvm_util_arch.h
index 972bb1c4ab4c..42925152ed25 100644
--- a/tools/testing/selftests/kvm/include/x86_64/kvm_util_arch.h
+++ b/tools/testing/selftests/kvm/include/x86_64/kvm_util_arch.h
@@ -12,20 +12,21 @@ extern bool is_forced_emulation_enabled;
 
 struct kvm_vm_arch {
 	vm_vaddr_t gdt;
 	vm_vaddr_t tss;
 	vm_vaddr_t idt;
 
 	uint64_t c_bit;
 	uint64_t s_bit;
 	int sev_fd;
 	bool is_pt_protected;
+	bool has_protected_sregs;
 };
 
 static inline bool __vm_arch_has_protected_memory(struct kvm_vm_arch *arch)
 {
 	return arch->c_bit || arch->s_bit;
 }
 
 #define vm_arch_has_protected_memory(vm) \
 	__vm_arch_has_protected_memory(&(vm)->arch)
 
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index 0ed0768b1c88..89b70fe037d1 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -704,22 +704,25 @@ struct kvm_vcpu *vm_arch_vcpu_add(struct kvm_vm *vm, uint32_t vcpu_id)
 	 *
 	 * If this code is ever used to launch a vCPU with 32-bit entry point it
 	 * may need to subtract 4 bytes instead of 8 bytes.
 	 */
 	TEST_ASSERT(IS_ALIGNED(stack_vaddr, PAGE_SIZE),
 		    "__vm_vaddr_alloc() did not provide a page-aligned address");
 	stack_vaddr -= 8;
 
 	vcpu = __vm_vcpu_add(vm, vcpu_id);
 	vcpu_init_cpuid(vcpu, kvm_get_supported_cpuid());
-	vcpu_init_sregs(vm, vcpu);
-	vcpu_init_xcrs(vm, vcpu);
+
+	if (!vm->arch.has_protected_sregs) {
+		vcpu_init_sregs(vm, vcpu);
+		vcpu_init_xcrs(vm, vcpu);
+	}
 
 	vcpu->initial_stack_addr = stack_vaddr;
 
 	/* Setup guest general purpose registers */
 	vcpu_regs_get(vcpu, &regs);
 	regs.rflags = regs.rflags | 0x2;
 	regs.rsp = stack_vaddr;
 	vcpu_regs_set(vcpu, &regs);
 
 	/* Setup the MP state */
diff --git a/tools/testing/selftests/kvm/lib/x86_64/tdx/tdx_util.c b/tools/testing/selftests/kvm/lib/x86_64/tdx/tdx_util.c
index 16db4e97673e..da4bcfefdd70 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/tdx/tdx_util.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/tdx/tdx_util.c
@@ -477,25 +477,22 @@ static void load_td_boot_parameters(struct td_boot_parameters *params,
  * entering the TD first time.
  *
  * Input Args:
  *   vm - Virtual Machine
  *   vcpuid - The id of the VCPU to add to the VM.
  */
 struct kvm_vcpu *td_vcpu_add(struct kvm_vm *vm, uint32_t vcpu_id, void *guest_code)
 {
 	struct kvm_vcpu *vcpu;
 
-	/*
-	 * TD setup will not use the value of rip set in vm_vcpu_add anyway, so
-	 * NULL can be used for guest_code.
-	 */
-	vcpu = vm_vcpu_add(vm, vcpu_id, NULL);
+	vm->arch.has_protected_sregs = true;
+	vcpu = vm_arch_vcpu_add(vm, vcpu_id);
 
 	tdx_td_vcpu_init(vcpu);
 
 	load_td_boot_parameters(addr_gpa2hva(vm, TD_BOOT_PARAMETERS_GPA),
 				vcpu, guest_code);
 
 	return vcpu;
 }
 
 /**
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Sean Christopherson 1 year ago
On Tue, Jan 14, 2025, Adrian Hunter wrote:
> On 10/01/25 19:30, Sean Christopherson wrote:
> >> Currently the KVM selftests expect to be able to set XCR0:
> >>
> >>     td_vcpu_add()
> >> 	vm_vcpu_add()
> >> 	    vm_arch_vcpu_add()
> >> 		vcpu_init_xcrs()
> >> 		    vcpu_xcrs_set()
> >> 			vcpu_ioctl(KVM_SET_XCRS)
> >> 			    __TEST_ASSERT_VM_VCPU_IOCTL(!ret)
> >>
> >> Seems like vm->arch.has_protected_state is needed for KVM selftests?
> > 
> > I doubt it's truly needed, my guess (without looking at the code) is that selftests
> > are fudging around the fact that KVM doesn't stuff arch.xcr0.
> 
> Here is when it was added:
> 
> commit 8b14c4d85d031f7700fa4e042aebf99d933971f0
> Author: Sean Christopherson <seanjc@google.com>
> Date:   Thu Oct 3 16:43:31 2024 -0700
> 
>     KVM: selftests: Configure XCR0 to max supported value by default
>     
>     To play nice with compilers generating AVX instructions, set CR4.OSXSAVE
>     and configure XCR0 by default when creating selftests vCPUs.  Some distros
>     have switched gcc to '-march=x86-64-v3' by default, and while it's hard to
>     find a CPU which doesn't support AVX today, many KVM selftests fail with

Gah, sorry.  I misread the callstack the first time around and didn't realize it
was the common code that was writing XCR0.

> Is below OK to avoid it?

Skipping the ioctls to set XCRs and SREGS is definitely ok.  I'll hold off on
providing concrete feedback until I review the TDX selftests in its entirety,
as I'm skeptical of having td_vcpu_add() wrap vm_arch_vcpu_add() instead of the
other way around, but I don't want to cause a bunch of noise by reacting to a
sliver of the code.
PKEY syscall number for selftest? (was: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD)
Posted by Sean Christopherson 1 year, 1 month ago
Switching topics, dropped everyone else except the list.

On Fri, Dec 20, 2024, Sean Christopherson wrote:
>  arch/x86/kvm/x86.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4320647bd78a..9d5cece9260b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1186,7 +1186,7 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>  	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
>  	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>  	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE)))
> -		write_pkru(vcpu->arch.pkru);
> +		wrpkru(vcpu->arch.pkru);
>  }
>  EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>  
> @@ -1200,7 +1200,7 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>  	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
>  		vcpu->arch.pkru = rdpkru();
>  		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
> -			write_pkru(vcpu->arch.host_pkru);
> +			wrpkru(vcpu->arch.host_pkru);
>  	}
>  
>  	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
> 
> base-commit: 13e98294d7cec978e31138d16824f50556a62d17
> -- 

I tried to test this by running the mm/protection_keys selftest in a VM, but it
gives what are effectively false passes on x86-64 due to the selftest picking up
the generic syscall numbers, e.g. 289 for SYS_pkey_alloc, instead of the x86-64
numbers.

I was able to get the test to run by hacking tools/testing/selftests/mm/pkey-x86.h
to shove in the right numbers, but I can't imagine that's the intended behavior.

If I omit the #undefs from pkey-x86.h, it shows that the test is grabbing the
definitions from the generic usr/include/asm-generic/unistd.h header.

Am I doing something stupid?

Regardless of whether this is PEBKAC or working as intended, on x86, the test
should ideally assert that "ospke" support in /proc/cpuinfo is consistent with
the result of sys_pkey_alloc(), e.g. so that an failure to allocate a pkey on a
system that work is reported as an error, not a pass.

--
diff --git a/tools/testing/selftests/mm/pkey-x86.h b/tools/testing/selftests/mm/pkey-x86.h
index ac91777c8917..ccc3552e6b77 100644
--- a/tools/testing/selftests/mm/pkey-x86.h
+++ b/tools/testing/selftests/mm/pkey-x86.h
@@ -3,6 +3,10 @@
 #ifndef _PKEYS_X86_H
 #define _PKEYS_X86_H
 
+#define __NR_pkey_mprotect     329
+#define __NR_pkey_alloc                330
+#define __NR_pkey_free         331
+
 #ifdef __i386__
 
 #define REG_IP_IDX             REG_EIP
--

Yields:

$ ARCH=x86_64 make protection_keys_64
gcc -Wall -I /home/sean/go/src/kernel.org/linux/tools/testing/selftests/../../..  -isystem /home/sean/go/src/kernel.org/linux/tools/testing/selftests/../../../usr/include -isystem /home/sean/go/src/kernel.org/linux/tools/testing/selftests/../../../tools/include/uapi -no-pie -D_GNU_SOURCE=  -m64 -mxsave  protection_keys.c vm_util.c thp_settings.c -lrt -lpthread -lm -lrt -ldl -o /home/sean/go/src/kernel.org/linux/tools/testing/selftests/mm/protection_keys_64
In file included from pkey-helpers.h:102:0,
                 from protection_keys.c:49:
pkey-x86.h:6:0: warning: "__NR_pkey_mprotect" redefined
 #define __NR_pkey_mprotect 329
 
In file included from protection_keys.c:45:0:
/home/sean/go/src/kernel.org/linux/usr/include/asm-generic/unistd.h:693:0: note: this is the location of the previous definition
 #define __NR_pkey_mprotect 288
 
In file included from pkey-helpers.h:102:0,
                 from protection_keys.c:49:
pkey-x86.h:7:0: warning: "__NR_pkey_alloc" redefined
 #define __NR_pkey_alloc  330
 
In file included from protection_keys.c:45:0:
/home/sean/go/src/kernel.org/linux/usr/include/asm-generic/unistd.h:695:0: note: this is the location of the previous definition
 #define __NR_pkey_alloc 289
 
In file included from pkey-helpers.h:102:0,
                 from protection_keys.c:49:
pkey-x86.h:8:0: warning: "__NR_pkey_free" redefined
 #define __NR_pkey_free  331
 
In file included from protection_keys.c:45:0:
/home/sean/go/src/kernel.org/linux/usr/include/asm-generic/unistd.h:697:0: note: this is the location of the previous definition
 #define __NR_pkey_free 290
 
Re: PKEY syscall number for selftest? (was: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD)
Posted by Sean Christopherson 1 year ago
Ping, I'm guessing this fell through the holiday cracks.

On Fri, Dec 20, 2024, Sean Christopherson wrote:
> Switching topics, dropped everyone else except the list.
> 
> On Fri, Dec 20, 2024, Sean Christopherson wrote:
> >  arch/x86/kvm/x86.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 4320647bd78a..9d5cece9260b 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1186,7 +1186,7 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
> >  	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
> >  	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
> >  	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE)))
> > -		write_pkru(vcpu->arch.pkru);
> > +		wrpkru(vcpu->arch.pkru);
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
> >  
> > @@ -1200,7 +1200,7 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
> >  	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
> >  		vcpu->arch.pkru = rdpkru();
> >  		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
> > -			write_pkru(vcpu->arch.host_pkru);
> > +			wrpkru(vcpu->arch.host_pkru);
> >  	}
> >  
> >  	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
> > 
> > base-commit: 13e98294d7cec978e31138d16824f50556a62d17
> > -- 
> 
> I tried to test this by running the mm/protection_keys selftest in a VM, but it
> gives what are effectively false passes on x86-64 due to the selftest picking up
> the generic syscall numbers, e.g. 289 for SYS_pkey_alloc, instead of the x86-64
> numbers.
> 
> I was able to get the test to run by hacking tools/testing/selftests/mm/pkey-x86.h
> to shove in the right numbers, but I can't imagine that's the intended behavior.
> 
> If I omit the #undefs from pkey-x86.h, it shows that the test is grabbing the
> definitions from the generic usr/include/asm-generic/unistd.h header.
> 
> Am I doing something stupid?
> 
> Regardless of whether this is PEBKAC or working as intended, on x86, the test
> should ideally assert that "ospke" support in /proc/cpuinfo is consistent with
> the result of sys_pkey_alloc(), e.g. so that an failure to allocate a pkey on a
> system that work is reported as an error, not a pass.
> 
> --
> diff --git a/tools/testing/selftests/mm/pkey-x86.h b/tools/testing/selftests/mm/pkey-x86.h
> index ac91777c8917..ccc3552e6b77 100644
> --- a/tools/testing/selftests/mm/pkey-x86.h
> +++ b/tools/testing/selftests/mm/pkey-x86.h
> @@ -3,6 +3,10 @@
>  #ifndef _PKEYS_X86_H
>  #define _PKEYS_X86_H
>  
> +#define __NR_pkey_mprotect     329
> +#define __NR_pkey_alloc                330
> +#define __NR_pkey_free         331
> +
>  #ifdef __i386__
>  
>  #define REG_IP_IDX             REG_EIP
> --
> 
> Yields:
> 
> $ ARCH=x86_64 make protection_keys_64
> gcc -Wall -I /home/sean/go/src/kernel.org/linux/tools/testing/selftests/../../..  -isystem /home/sean/go/src/kernel.org/linux/tools/testing/selftests/../../../usr/include -isystem /home/sean/go/src/kernel.org/linux/tools/testing/selftests/../../../tools/include/uapi -no-pie -D_GNU_SOURCE=  -m64 -mxsave  protection_keys.c vm_util.c thp_settings.c -lrt -lpthread -lm -lrt -ldl -o /home/sean/go/src/kernel.org/linux/tools/testing/selftests/mm/protection_keys_64
> In file included from pkey-helpers.h:102:0,
>                  from protection_keys.c:49:
> pkey-x86.h:6:0: warning: "__NR_pkey_mprotect" redefined
>  #define __NR_pkey_mprotect 329
>  
> In file included from protection_keys.c:45:0:
> /home/sean/go/src/kernel.org/linux/usr/include/asm-generic/unistd.h:693:0: note: this is the location of the previous definition
>  #define __NR_pkey_mprotect 288
>  
> In file included from pkey-helpers.h:102:0,
>                  from protection_keys.c:49:
> pkey-x86.h:7:0: warning: "__NR_pkey_alloc" redefined
>  #define __NR_pkey_alloc  330
>  
> In file included from protection_keys.c:45:0:
> /home/sean/go/src/kernel.org/linux/usr/include/asm-generic/unistd.h:695:0: note: this is the location of the previous definition
>  #define __NR_pkey_alloc 289
>  
> In file included from pkey-helpers.h:102:0,
>                  from protection_keys.c:49:
> pkey-x86.h:8:0: warning: "__NR_pkey_free" redefined
>  #define __NR_pkey_free  331
>  
> In file included from protection_keys.c:45:0:
> /home/sean/go/src/kernel.org/linux/usr/include/asm-generic/unistd.h:697:0: note: this is the location of the previous definition
>  #define __NR_pkey_free 290
>  
> 
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Chao Gao 1 year, 2 months ago
On Mon, Nov 25, 2024 at 01:10:37PM +0200, Adrian Hunter wrote:
>On 22/11/24 07:49, Chao Gao wrote:
>>> +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
>>> +{
>>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>>> +
>>> +	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>>> +	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
>>> +		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>>> +	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>>> +	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
>>> +	    kvm_host.xss != (kvm_tdx->xfam &
>>> +			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>>> +			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
>> 
>> Should we drop CET/PT from this series? I think they are worth a new
>> patch/series.
>
>This is not really about CET/PT
>
>What is happening here is that we are calculating the current
>MSR_IA32_XSS value based on the TDX Module spec which says the
>TDX Module sets MSR_IA32_XSS to the XSS bits from XFAM.  The
>TDX Module does that literally, from TDX Module source code:
>
>	#define XCR0_SUPERVISOR_BIT_MASK            0x0001FD00
>and
>	ia32_wrmsr(IA32_XSS_MSR_ADDR, xfam & XCR0_SUPERVISOR_BIT_MASK);
>
>For KVM, rather than:
>
>			kvm_tdx->xfam &
>			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)
>
>it would be more direct to define the bits and enforce them
>via tdx_get_supported_xfam() e.g.
>
>/* 
> * Before returning from TDH.VP.ENTER, the TDX Module assigns:
> *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9)
> *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
> */
>#define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9))
>#define TDX_XFAM_XSS_MASK	(GENMASK(16, 10) | BIT(8))
>#define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)
>
>static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
>{
>	u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss;
>
>	/* Ensure features are in the masks */
>	val &= TDX_XFAM_MASK;

Before exposing a feature to TD VMs, both the TDX module and KVM must support
it. In other words, kvm_tdx->xfam & kvm_caps.supported_xss should yield the
same result as kvm_tdx->xfam & TDX_XFAM_XSS_MASK. So, to me, the current
approach and your new proposal are functionally identical.

I prefer checking against kvm_caps.supported_xss because we don't need to
update TDX_XFAM_XSS/XCR0_MASK when new user/supervisor xstate bits are added.
Note kvm_caps.supported_xss/xcr0 need to be updated for normal VMs anyway.

>
>	if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1)
>		return 0;
>
>	val &= td_conf->xfam_fixed0;
>
>	return val;
>}
>
>and then:
>
>	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>	    kvm_host.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK))
>		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>	    kvm_host.xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK))
>		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
>
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Adrian Hunter 1 year, 2 months ago
On 26/11/24 04:20, Chao Gao wrote:
> On Mon, Nov 25, 2024 at 01:10:37PM +0200, Adrian Hunter wrote:
>> On 22/11/24 07:49, Chao Gao wrote:
>>>> +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>>>> +
>>>> +	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>>>> +	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
>>>> +		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>>>> +	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>>>> +	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
>>>> +	    kvm_host.xss != (kvm_tdx->xfam &
>>>> +			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>>>> +			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
>>>
>>> Should we drop CET/PT from this series? I think they are worth a new
>>> patch/series.
>>
>> This is not really about CET/PT
>>
>> What is happening here is that we are calculating the current
>> MSR_IA32_XSS value based on the TDX Module spec which says the
>> TDX Module sets MSR_IA32_XSS to the XSS bits from XFAM.  The
>> TDX Module does that literally, from TDX Module source code:
>>
>> 	#define XCR0_SUPERVISOR_BIT_MASK            0x0001FD00
>> and
>> 	ia32_wrmsr(IA32_XSS_MSR_ADDR, xfam & XCR0_SUPERVISOR_BIT_MASK);
>>
>> For KVM, rather than:
>>
>> 			kvm_tdx->xfam &
>> 			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>> 			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)
>>
>> it would be more direct to define the bits and enforce them
>> via tdx_get_supported_xfam() e.g.
>>
>> /* 
>> * Before returning from TDH.VP.ENTER, the TDX Module assigns:
>> *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9)
>> *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
>> */
>> #define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9))
>> #define TDX_XFAM_XSS_MASK	(GENMASK(16, 10) | BIT(8))
>> #define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)
>>
>> static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
>> {
>> 	u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss;
>>
>> 	/* Ensure features are in the masks */
>> 	val &= TDX_XFAM_MASK;
> 
> Before exposing a feature to TD VMs, both the TDX module and KVM must support
> it. In other words, kvm_tdx->xfam & kvm_caps.supported_xss should yield the
> same result as kvm_tdx->xfam & TDX_XFAM_XSS_MASK. So, to me, the current
> approach and your new proposal are functionally identical.
> 
> I prefer checking against kvm_caps.supported_xss because we don't need to
> update TDX_XFAM_XSS/XCR0_MASK when new user/supervisor xstate bits are added.

Arguably, making the addition of new XFAM bits more visible
is a good thing.

> Note kvm_caps.supported_xss/xcr0 need to be updated for normal VMs anyway.

The way the code is at the moment, that seems too fragile.
At the moment there are direct changes to XFAM bits in
tdx_get_supported_xfam() that are not reflected in supported_xss
and so have to be added to tdx_restore_host_xsave_state() as well.
That is, with the current code, changes to tdx_get_supported_xfam()
can break tdx_restore_host_xsave_state().

The new approach:
	reflects what the TDX Module does
	reflects what the TDX Module base spec says
	makes it harder to break tdx_restore_host_xsave_state()

> 
>>
>> 	if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1)
>> 		return 0;
>>
>> 	val &= td_conf->xfam_fixed0;
>>
>> 	return val;
>> }
>>
>> and then:
>>
>> 	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>> 	    kvm_host.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK))
>> 		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>> 	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>> 	    kvm_host.xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK))
>> 		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
>>

Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Chao Gao 1 year, 2 months ago
>>> /* 
>>> * Before returning from TDH.VP.ENTER, the TDX Module assigns:
>>> *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9)
>>> *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)

TILECFG state (bit 17) and TILEDATA state (bit 18) are also user state. Are they
cleared unconditionally?

>>> */
>>> #define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9))
>>> #define TDX_XFAM_XSS_MASK	(GENMASK(16, 10) | BIT(8))
>>> #define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)
Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
Posted by Adrian Hunter 1 year, 2 months ago
On 2/12/24 04:52, Chao Gao wrote:
>>>> /* 
>>>> * Before returning from TDH.VP.ENTER, the TDX Module assigns:
>>>> *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9)
>>>> *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
> 
> TILECFG state (bit 17) and TILEDATA state (bit 18) are also user state. Are they
> cleared unconditionally?

Bit 17 and 18 should also be in TDX_XFAM_XCR0_MASK
TDX Module does define them, from TDX Module sources:

	#define XCR0_USER_BIT_MASK                  0x000602FF

Thanks for spotting that!

> 
>>>> */
>>>> #define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9))
>>>> #define TDX_XFAM_XSS_MASK	(GENMASK(16, 10) | BIT(8))
>>>> #define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)