[PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)

Sean Christopherson posted 22 patches 1 year, 6 months ago
There is a newer version of this series
[PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
Posted by Sean Christopherson 1 year, 6 months ago
Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
directly emulate instructions for ES/SNP, and instead the guest must
explicitly request emulation.  Unless the guest explicitly requests
emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
SPTE, with the subsequent #NPF being reflected into the guest as a #VC.

But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
because except for ES/SNP, doing so requires setting reserved bits in the
SPTE, i.e. the SPTE can't be readable while also generating a #VC on
writes.  Because KVM never creates MMIO SPTEs and jumps directly to
emulation, the guest never gets a #VC.  And since KVM simply resumes the
guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
into an infinite #NPF loop if the vCPU attempts to write read-only memory.

Disallow read-only memory for all VMs with protected state, i.e. for
upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
to support read-only memory, as TDX uses EPT Violation #VE to reflect the
fault into the guest, e.g. KVM could configure read-only SPTEs with RX
protections and SUPPRESS_VE=0.  But there is no strong use case for
supporting read-only memslots on TDX, e.g. the main historical usage is
to emulate option ROMs, but TDX disallows executing from shared memory.
And if someone comes along with a legitimate, strong use case, the
restriction can always be lifted for TDX.

Don't bother trying to retroactively apply the restriction to SEV-ES
VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
possibly work for SEV-ES, i.e. disallowing such memslots is really just
means reporting an error to userspace instead of silently hanging vCPUs.
Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
isn't worth the marginal benefit it would provide userspace.

Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES")
Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
Cc: Peter Gonda <pgonda@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerly Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 include/linux/kvm_host.h        | 7 +++++++
 virt/kvm/kvm_main.c             | 5 ++---
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 950a03e0181e..37c4a573e5fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2191,6 +2191,8 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 #define kvm_arch_has_private_mem(kvm) false
 #endif
 
+#define kvm_arch_has_readonly_mem(kvm) (!(kvm)->arch.has_protected_state)
+
 static inline u16 kvm_read_ldt(void)
 {
 	u16 ldt;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 689e8be873a7..62a3d1c0cc07 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -715,6 +715,13 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 }
 #endif
 
+#ifndef kvm_arch_has_readonly_mem
+static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
+{
+	return IS_ENABLED(CONFIG_HAVE_KVM_READONLY_MEM);
+}
+#endif
+
 struct kvm_memslots {
 	u64 generation;
 	atomic_long_t last_used_slot;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..fad2d5932844 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1578,15 +1578,14 @@ static int check_memory_region_flags(struct kvm *kvm,
 	if (mem->flags & KVM_MEM_GUEST_MEMFD)
 		valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
 
-#ifdef CONFIG_HAVE_KVM_READONLY_MEM
 	/*
 	 * GUEST_MEMFD is incompatible with read-only memslots, as writes to
 	 * read-only memslots have emulated MMIO, not page fault, semantics,
 	 * and KVM doesn't allow emulated MMIO for private memory.
 	 */
-	if (!(mem->flags & KVM_MEM_GUEST_MEMFD))
+	if (kvm_arch_has_readonly_mem(kvm) &&
+	    !(mem->flags & KVM_MEM_GUEST_MEMFD))
 		valid_flags |= KVM_MEM_READONLY;
-#endif
 
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
-- 
2.46.0.76.ge559c4bf1a-goog
Re: [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
Posted by Naveen N Rao 2 months, 1 week ago
Hi Sean,

On Fri, Aug 09, 2024 at 12:02:58PM -0700, Sean Christopherson wrote:
> Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
> directly emulate instructions for ES/SNP, and instead the guest must
> explicitly request emulation.  Unless the guest explicitly requests
> emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
> SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
> 
> But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
> because except for ES/SNP, doing so requires setting reserved bits in the
> SPTE, i.e. the SPTE can't be readable while also generating a #VC on
> writes.  Because KVM never creates MMIO SPTEs and jumps directly to
> emulation, the guest never gets a #VC.  And since KVM simply resumes the
> guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
> into an infinite #NPF loop if the vCPU attempts to write read-only memory.
> 
> Disallow read-only memory for all VMs with protected state, i.e. for
> upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
> to support read-only memory, as TDX uses EPT Violation #VE to reflect the
> fault into the guest, e.g. KVM could configure read-only SPTEs with RX
> protections and SUPPRESS_VE=0.  But there is no strong use case for
> supporting read-only memslots on TDX, e.g. the main historical usage is
> to emulate option ROMs, but TDX disallows executing from shared memory.
> And if someone comes along with a legitimate, strong use case, the
> restriction can always be lifted for TDX.
> 
> Don't bother trying to retroactively apply the restriction to SEV-ES
> VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
> possibly work for SEV-ES, i.e. disallowing such memslots is really just
> means reporting an error to userspace instead of silently hanging vCPUs.
> Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
> isn't worth the marginal benefit it would provide userspace.
> 
> Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES")
> Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
> Cc: Peter Gonda <pgonda@google.com>
> Cc: Michael Roth <michael.roth@amd.com>
> Cc: Vishal Annapurve <vannapurve@google.com>
> Cc: Ackerly Tng <ackerleytng@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 2 ++
>  include/linux/kvm_host.h        | 7 +++++++
>  virt/kvm/kvm_main.c             | 5 ++---
>  3 files changed, 11 insertions(+), 3 deletions(-)

As discussed in one of the previous PUCK calls, this is causing Qemu to 
throw an error when trying to enable debug-swap for a SEV-ES guest when 
using a pflash drive for OVMF. Sample qemu invocation (*):
  qemu-system-x86_64 ... \
    -drive if=pflash,format=raw,unit=0,file=/path/to/OVMF_CODE.fd,readonly=on \
    -drive if=pflash,format=raw,unit=1,file=/path/to/OVMF_VARS.fd \
    -machine q35,confidential-guest-support=sev0 \
    -object sev-guest,id=sev0,policy=0x5,cbitpos=51,reduced-phys-bits=1,debug-swap=on

This is expected since enabling debug-swap requires use of 
KVM_SEV_INIT2, which implies a VM type of KVM_X86_SEV_ES_VM. However, 
SEV-ES VMs that do not enable any VMSA SEV features (and are hence 
KVM_X86_DEFAULT_VM type) are allowed to continue to launch though they 
are also susceptible to this issue.

One of the suggestions in the call was to consider returning an error to 
userspace instead. Is this close to what you had in mind:

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 73cdcbccc89e..19e27ed27e17 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -387,8 +387,10 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
         * they can fix it by changing memory to shared, or they can
         * provide a better error.
         */
-       if (r == RET_PF_EMULATE && fault.is_private) {
-               pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
+       if (r == RET_PF_EMULATE && (fault.is_private ||
+           (!fault.map_writable && fault.write && vcpu->arch.guest_state_protected))) {
+               if (fault.is_private)
+                       pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
                kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
                return -EFAULT;
        }

This seems to work though Qemu seems to think we are asking it to 
convert the memory to shared (so we probably need to signal this error 
some other way?):
  qemu-system-x86_64: Convert non guest_memfd backed memory region (0xf0000 ,+ 0x1000) to shared

Thoughts?


Thanks,
Naveen

--
(*) This requires below patches for Qemu, unless using IGVM:
https://lore.kernel.org/qemu-devel/cover.1761648149.git.naveen@kernel.org/
Re: [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
Posted by Sean Christopherson 1 month, 2 weeks ago
On Wed, Dec 03, 2025, Naveen N Rao wrote:
> Hi Sean,
> 
> On Fri, Aug 09, 2024 at 12:02:58PM -0700, Sean Christopherson wrote:
> > Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
> > directly emulate instructions for ES/SNP, and instead the guest must
> > explicitly request emulation.  Unless the guest explicitly requests
> > emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
> > SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
> > 
> > But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
> > because except for ES/SNP, doing so requires setting reserved bits in the
> > SPTE, i.e. the SPTE can't be readable while also generating a #VC on
> > writes.  Because KVM never creates MMIO SPTEs and jumps directly to
> > emulation, the guest never gets a #VC.  And since KVM simply resumes the
> > guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
> > into an infinite #NPF loop if the vCPU attempts to write read-only memory.
> > 
> > Disallow read-only memory for all VMs with protected state, i.e. for
> > upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
> > to support read-only memory, as TDX uses EPT Violation #VE to reflect the
> > fault into the guest, e.g. KVM could configure read-only SPTEs with RX
> > protections and SUPPRESS_VE=0.  But there is no strong use case for
> > supporting read-only memslots on TDX, e.g. the main historical usage is
> > to emulate option ROMs, but TDX disallows executing from shared memory.
> > And if someone comes along with a legitimate, strong use case, the
> > restriction can always be lifted for TDX.
> > 
> > Don't bother trying to retroactively apply the restriction to SEV-ES
> > VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
> > possibly work for SEV-ES, i.e. disallowing such memslots is really just
> > means reporting an error to userspace instead of silently hanging vCPUs.
> > Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
> > isn't worth the marginal benefit it would provide userspace.
> > 
> > Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES")
> > Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
> > Cc: Peter Gonda <pgonda@google.com>
> > Cc: Michael Roth <michael.roth@amd.com>
> > Cc: Vishal Annapurve <vannapurve@google.com>
> > Cc: Ackerly Tng <ackerleytng@google.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h | 2 ++
> >  include/linux/kvm_host.h        | 7 +++++++
> >  virt/kvm/kvm_main.c             | 5 ++---
> >  3 files changed, 11 insertions(+), 3 deletions(-)
> 
> As discussed in one of the previous PUCK calls, this is causing Qemu to 
> throw an error when trying to enable debug-swap for a SEV-ES guest when 
> using a pflash drive for OVMF. Sample qemu invocation (*):
>   qemu-system-x86_64 ... \
>     -drive if=pflash,format=raw,unit=0,file=/path/to/OVMF_CODE.fd,readonly=on \
>     -drive if=pflash,format=raw,unit=1,file=/path/to/OVMF_VARS.fd \
>     -machine q35,confidential-guest-support=sev0 \
>     -object sev-guest,id=sev0,policy=0x5,cbitpos=51,reduced-phys-bits=1,debug-swap=on
> 
> This is expected since enabling debug-swap requires use of 
> KVM_SEV_INIT2, which implies a VM type of KVM_X86_SEV_ES_VM. However, 
> SEV-ES VMs that do not enable any VMSA SEV features (and are hence 
> KVM_X86_DEFAULT_VM type) are allowed to continue to launch though they 
> are also susceptible to this issue.
> 
> One of the suggestions in the call was to consider returning an error to 
> userspace instead. Is this close to what you had in mind:
> 
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 73cdcbccc89e..19e27ed27e17 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -387,8 +387,10 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>          * they can fix it by changing memory to shared, or they can
>          * provide a better error.
>          */
> -       if (r == RET_PF_EMULATE && fault.is_private) {
> -               pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
> +       if (r == RET_PF_EMULATE && (fault.is_private ||
> +           (!fault.map_writable && fault.write && vcpu->arch.guest_state_protected))) {
> +               if (fault.is_private)
> +                       pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
>                 kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
>                 return -EFAULT;
>         }
> 
> This seems to work though Qemu seems to think we are asking it to 
> convert the memory to shared (so we probably need to signal this error 
> some other way?):
>   qemu-system-x86_64: Convert non guest_memfd backed memory region (0xf0000 ,+ 0x1000) to shared
> 
> Thoughts?

The choke point would be kvm_handle_error_pfn() (see below), where the RET_PF_EMULATE
originates.  But looking at all of this again, I am opposed to changing KVM's
ABI to allow KVM_MEM_READONLY for SEV-ES guests, it simply can't work.  And KVM
enumerates as much.

	case KVM_CAP_READONLY_MEM:
		r = kvm ? kvm_arch_has_readonly_mem(kvm) : 1;
		break;

More importantly, if QEMU wants to provide a not-fully-functional configuration
to allow KVM_SEV_INIT2 with pflash, QEMU can fudge around the lack of read-only
memory without KVM's assistance.  It likely won't be pretty, but it's doable,
by clearing PROT_WRITE in the backing VMA that's handed to the KVM memslot.

KVM will see a normal memslot that the guest can read/execute, and if the guest
attempts to write to the memory, hva_to_pfn() will return KVM_PFN_RR_FAULT and
kvm_handle_error_pfn() will send that out to userspace as -EFAULT.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f17324546900..27dc909b8225 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3493,8 +3493,12 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
         * into the spte otherwise read access on readonly gfn also can
         * caused mmio page fault and treat it as mmio access.
         */
-       if (fault->pfn == KVM_PFN_ERR_RO_FAULT)
+       if (fault->pfn == KVM_PFN_ERR_RO_FAULT) {
+               if (kvm_arch_has_readonly_mem(vcpu->kvm))
+                       return -EFAULT;
+
                return RET_PF_EMULATE;
+       }
 
        if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
                kvm_send_hwpoison_signal(fault->slot, fault->gfn);
Re: [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
Posted by Paolo Bonzini 1 year, 5 months ago
On 8/9/24 21:02, Sean Christopherson wrote:
> Disallow read-only memory for all VMs with protected state, i.e. for
> upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
> to support read-only memory, as TDX uses EPT Violation #VE to reflect the
> fault into the guest, e.g. KVM could configure read-only SPTEs with RX
> protections and SUPPRESS_VE=0.  But there is no strong use case for
> supporting read-only memslots on TDX, e.g. the main historical usage is
> to emulate option ROMs, but TDX disallows executing from shared memory.
> And if someone comes along with a legitimate, strong use case, the
> restriction can always be lifted for TDX.
> 
> Don't bother trying to retroactively apply the restriction to SEV-ES
> VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
> possibly work for SEV-ES, i.e. disallowing such memslots is really just
> means reporting an error to userspace instead of silently hanging vCPUs.
> Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
> isn't worth the marginal benefit it would provide userspace.

Queuing this one for 6.11, thanks.

Paolo