[PATCH v6 05/11] KVM: guest_memfd: Add flag to remove from direct map

Roy, Patrick posted 11 patches 2 weeks, 6 days ago
There is a newer version of this series
[PATCH v6 05/11] KVM: guest_memfd: Add flag to remove from direct map
Posted by Roy, Patrick 2 weeks, 6 days ago
Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
ioctl. When set, guest_memfd folios will be removed from the direct map
after preparation, with direct map entries only restored when the folios
are freed.

To ensure these folios do not end up in places where the kernel cannot
deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.

Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether
guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on
guest_memfd itself being supported, but also on whether linux supports
manipulatomg the direct map at page granularity at all (possible most of
the time, outliers being arm64 where its impossible if the direct map
has been setup using hugepages, as arm64 cannot break these apart due to
break-before-make semantics, and powerpc, which does not select
ARCH_HAS_SET_DIRECT_MAP, which also doesn't support guest_memfd anyway
though).

Note that this flag causes removal of direct map entries for all
guest_memfd folios independent of whether they are "shared" or "private"
(although current guest_memfd only supports either all folios in the
"shared" state, or all folios in the "private" state if
GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
entries of also the shared parts of guest_memfd are a special type of
non-CoCo VM where, host userspace is trusted to have access to all of
guest memory, but where Spectre-style transient execution attacks
through the host kernel's direct map should still be mitigated.  In this
setup, KVM retains access to guest memory via userspace mappings of
guest_memfd, which are reflected back into KVM's memslots via
userspace_addr. This is needed for things like MMIO emulation on x86_64
to work.

Do not perform TLB flushes after direct map manipulations. This is
because TLB flushes resulted in a up to 40x elongation of page faults in
guest_memfd (scaling with the number of CPU cores), or a 5x elongation
of memory population. TLB flushes are not needed for functional
correctness (the virt->phys mapping technically stays "correct",  the
kernel should simply not use it for a while). On the other hand, it means
that the desired protection from Spectre-style attacks is not perfect,
as an attacker could try to prevent a stale TLB entry from getting
evicted, keeping it alive until the page it refers to is used by the
guest for some sensitive data, and then targeting it using a
spectre-gadget.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 Documentation/virt/kvm/api.rst    |  5 ++++
 arch/arm64/include/asm/kvm_host.h | 12 ++++++++
 include/linux/kvm_host.h          |  7 +++++
 include/uapi/linux/kvm.h          |  2 ++
 virt/kvm/guest_memfd.c            | 49 +++++++++++++++++++++++++++----
 virt/kvm/kvm_main.c               |  5 ++++
 6 files changed, 75 insertions(+), 5 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index c17a87a0a5ac..b52c14d58798 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6418,6 +6418,11 @@ When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field
 supports GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation
 enables mmap() and faulting of guest_memfd memory to host userspace.
 
+When the capability KVM_CAP_GMEM_NO_DIRECT_MAP is supported, the 'flags' field
+supports GUEST_MEMFG_FLAG_NO_DIRECT_MAP. Setting this flag makes the guest_memfd
+instance behave similarly to memfd_secret, and unmaps the memory backing it from
+the kernel's address space after allocation.
+
 When the KVM MMU performs a PFN lookup to service a guest fault and the backing
 guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
 consumed from guest_memfd, regardless of whether it is a shared or a private
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 2f2394cce24e..0bfd8e5fd9de 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -19,6 +19,7 @@
 #include <linux/maple_tree.h>
 #include <linux/percpu.h>
 #include <linux/psci.h>
+#include <linux/set_memory.h>
 #include <asm/arch_gicv3.h>
 #include <asm/barrier.h>
 #include <asm/cpufeature.h>
@@ -1706,5 +1707,16 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
 void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1);
 void check_feature_map(void);
 
+#ifdef CONFIG_KVM_GUEST_MEMFD
+static inline bool kvm_arch_gmem_supports_no_direct_map(void)
+{
+	/*
+	 * Without FWB, direct map access is needed in kvm_pgtable_stage2_map(),
+	 * as it calls dcache_clean_inval_poc().
+	 */
+	return can_set_direct_map() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
+}
+#define kvm_arch_gmem_supports_no_direct_map kvm_arch_gmem_supports_no_direct_map
+#endif /* CONFIG_KVM_GUEST_MEMFD */
 
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1d0585616aa3..a9468bce55f2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -36,6 +36,7 @@
 #include <linux/rbtree.h>
 #include <linux/xarray.h>
 #include <asm/signal.h>
+#include <linux/set_memory.h>
 
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
@@ -731,6 +732,12 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 bool kvm_arch_supports_gmem_mmap(struct kvm *kvm);
 #endif
 
+#ifdef CONFIG_KVM_GUEST_MEMFD
+#ifndef kvm_arch_gmem_supports_no_direct_map
+#define kvm_arch_gmem_supports_no_direct_map can_set_direct_map
+#endif
+#endif /* CONFIG_KVM_GUEST_MEMFD */
+
 #ifndef kvm_arch_has_readonly_mem
 static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
 {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6efa98a57ec1..33c8e8946019 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -963,6 +963,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_RISCV_MP_STATE_RESET 242
 #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
 #define KVM_CAP_GUEST_MEMFD_MMAP 244
+#define KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP 245
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
@@ -1600,6 +1601,7 @@ struct kvm_memory_attributes {
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
 #define GUEST_MEMFD_FLAG_MMAP	(1ULL << 0)
+#define GUEST_MEMFD_FLAG_NO_DIRECT_MAP (1ULL << 1)
 
 struct kvm_create_guest_memfd {
 	__u64 size;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 81028984ff89..3c64099fc98a 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,7 @@
 #include <linux/kvm_host.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
+#include <linux/set_memory.h>
 
 #include "kvm_mm.h"
 
@@ -42,9 +43,24 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
 	return 0;
 }
 
-static inline void kvm_gmem_mark_prepared(struct folio *folio)
+static bool kvm_gmem_test_no_direct_map(struct inode *inode)
 {
-	folio_mark_uptodate(folio);
+	return ((unsigned long) inode->i_private) & GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
+}
+
+static inline int kvm_gmem_mark_prepared(struct folio *folio)
+{
+	struct inode *inode = folio_inode(folio);
+	int r = 0;
+
+	if (kvm_gmem_test_no_direct_map(inode))
+		r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
+						 false);
+
+	if (!r)
+		folio_mark_uptodate(folio);
+
+	return r;
 }
 
 /*
@@ -82,7 +98,7 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
 	index = ALIGN_DOWN(index, 1 << folio_order(folio));
 	r = __kvm_gmem_prepare_folio(kvm, slot, index, folio);
 	if (!r)
-		kvm_gmem_mark_prepared(folio);
+		r = kvm_gmem_mark_prepared(folio);
 
 	return r;
 }
@@ -344,8 +360,15 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
 	}
 
 	if (!folio_test_uptodate(folio)) {
+		int err = 0;
+
 		clear_highpage(folio_page(folio, 0));
-		kvm_gmem_mark_prepared(folio);
+		err = kvm_gmem_mark_prepared(folio);
+
+		if (err) {
+			ret = vmf_error(err);
+			goto out_folio;
+		}
 	}
 
 	vmf->page = folio_file_page(folio, vmf->pgoff);
@@ -436,6 +459,16 @@ static void kvm_gmem_free_folio(struct address_space *mapping,
 	kvm_pfn_t pfn = page_to_pfn(page);
 	int order = folio_order(folio);
 
+	/*
+	 * Direct map restoration cannot fail, as the only error condition
+	 * for direct map manipulation is failure to allocate page tables
+	 * when splitting huge pages, but this split would have already
+	 * happened in set_direct_map_invalid_noflush() in kvm_gmem_mark_prepared().
+	 * Thus set_direct_map_valid_noflush() here only updates prot bits.
+	 */
+	if (kvm_gmem_test_no_direct_map(mapping->host))
+		set_direct_map_valid_noflush(page, folio_nr_pages(folio), true);
+
 	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
 }
 
@@ -500,6 +533,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 	/* Unmovable mappings are supposed to be marked unevictable as well. */
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
 
+	if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
+		mapping_set_no_direct_map(inode->i_mapping);
+
 	kvm_get_kvm(kvm);
 	gmem->kvm = kvm;
 	xa_init(&gmem->bindings);
@@ -524,6 +560,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	if (kvm_arch_supports_gmem_mmap(kvm))
 		valid_flags |= GUEST_MEMFD_FLAG_MMAP;
 
+	if (kvm_arch_gmem_supports_no_direct_map())
+		valid_flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
+
 	if (flags & ~valid_flags)
 		return -EINVAL;
 
@@ -768,7 +807,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 		p = src ? src + i * PAGE_SIZE : NULL;
 		ret = post_populate(kvm, gfn, pfn, p, max_order, opaque);
 		if (!ret)
-			kvm_gmem_mark_prepared(folio);
+			ret = kvm_gmem_mark_prepared(folio);
 
 put_folio_and_exit:
 		folio_put(folio);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 18f29ef93543..b5e702d95230 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -65,6 +65,7 @@
 #include <trace/events/kvm.h>
 
 #include <linux/kvm_dirty_ring.h>
+#include <linux/set_memory.h>
 
 
 /* Worst case buffer size needed for holding an integer. */
@@ -4916,6 +4917,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 		return kvm_supported_mem_attributes(kvm);
 #endif
 #ifdef CONFIG_KVM_GUEST_MEMFD
+	case KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP:
+		if (!kvm_arch_gmem_supports_no_direct_map())
+			return 0;
+		fallthrough;
 	case KVM_CAP_GUEST_MEMFD:
 		return 1;
 	case KVM_CAP_GUEST_MEMFD_MMAP:
-- 
2.50.1

Re: [PATCH v6 05/11] KVM: guest_memfd: Add flag to remove from direct map
Posted by Will Deacon 1 week, 6 days ago
Hi Patrick,

We chatted briefly at KVM Forum, so I wanted to chime in here too from
the arm64 side.

On Fri, Sep 12, 2025 at 09:17:37AM +0000, Roy, Patrick wrote:
> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
> ioctl. When set, guest_memfd folios will be removed from the direct map
> after preparation, with direct map entries only restored when the folios
> are freed.
> 
> To ensure these folios do not end up in places where the kernel cannot
> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
> 
> Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether
> guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on
> guest_memfd itself being supported, but also on whether linux supports
> manipulatomg the direct map at page granularity at all (possible most of
> the time, outliers being arm64 where its impossible if the direct map
> has been setup using hugepages, as arm64 cannot break these apart due to
> break-before-make semantics, and powerpc, which does not select
> ARCH_HAS_SET_DIRECT_MAP, which also doesn't support guest_memfd anyway
> though).
> 
> Note that this flag causes removal of direct map entries for all
> guest_memfd folios independent of whether they are "shared" or "private"
> (although current guest_memfd only supports either all folios in the
> "shared" state, or all folios in the "private" state if
> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
> entries of also the shared parts of guest_memfd are a special type of
> non-CoCo VM where, host userspace is trusted to have access to all of
> guest memory, but where Spectre-style transient execution attacks
> through the host kernel's direct map should still be mitigated.  In this
> setup, KVM retains access to guest memory via userspace mappings of
> guest_memfd, which are reflected back into KVM's memslots via
> userspace_addr. This is needed for things like MMIO emulation on x86_64
> to work.
> 
> Do not perform TLB flushes after direct map manipulations. This is
> because TLB flushes resulted in a up to 40x elongation of page faults in
> guest_memfd (scaling with the number of CPU cores), or a 5x elongation
> of memory population. TLB flushes are not needed for functional
> correctness (the virt->phys mapping technically stays "correct",  the
> kernel should simply not use it for a while). On the other hand, it means
> that the desired protection from Spectre-style attacks is not perfect,
> as an attacker could try to prevent a stale TLB entry from getting
> evicted, keeping it alive until the page it refers to is used by the
> guest for some sensitive data, and then targeting it using a
> spectre-gadget.

I'm really not keen on this last part (at least, for arm64).

If you're not going to bother invalidating the TLB after unmapping from
the direct map because of performance reasons, you're better off just
leaving the direct map intact and getting even better performance. On
arm64, that would mean you could use block mappings too.

On the other hand, if you actually care about the security properties
from the unmap then you need the invalidation so that the mapping
doesn't linger around. With "modern" CPU features such as pte
aggregation and shared TLB walk caches it's not unlikely that these
entries will persist a lot longer than you think and it makes the
security benefits of this series impossible to reason about.

As a compromise, could we make the TLB invalidation an architecture
opt-in so that we can have it enabled on arm64, please?

Will
Re: [PATCH v6 05/11] KVM: guest_memfd: Add flag to remove from direct map
Posted by Roy, Patrick 1 week, 6 days ago
Hi Will,

On Thu, 2025-09-18 at 21:21 +0100, Will Deacon wrote:
> Hi Patrick,
> We chatted briefly at KVM Forum, so I wanted to chime in here too from
> the arm64 side.
> 
> On Fri, Sep 12, 2025 at 09:17:37AM +0000, Roy, Patrick wrote:
>> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
>> ioctl. When set, guest_memfd folios will be removed from the direct map
>> after preparation, with direct map entries only restored when the folios
>> are freed.
>>
>> To ensure these folios do not end up in places where the kernel cannot
>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
>> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
>>
>> Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether
>> guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on
>> guest_memfd itself being supported, but also on whether linux supports
>> manipulatomg the direct map at page granularity at all (possible most of
>> the time, outliers being arm64 where its impossible if the direct map
>> has been setup using hugepages, as arm64 cannot break these apart due to
>> break-before-make semantics, and powerpc, which does not select
>> ARCH_HAS_SET_DIRECT_MAP, which also doesn't support guest_memfd anyway
>> though).
>>
>> Note that this flag causes removal of direct map entries for all
>> guest_memfd folios independent of whether they are "shared" or "private"
>> (although current guest_memfd only supports either all folios in the
>> "shared" state, or all folios in the "private" state if
>> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
>> entries of also the shared parts of guest_memfd are a special type of
>> non-CoCo VM where, host userspace is trusted to have access to all of
>> guest memory, but where Spectre-style transient execution attacks
>> through the host kernel's direct map should still be mitigated.  In this
>> setup, KVM retains access to guest memory via userspace mappings of
>> guest_memfd, which are reflected back into KVM's memslots via
>> userspace_addr. This is needed for things like MMIO emulation on x86_64
>> to work.
>>
>> Do not perform TLB flushes after direct map manipulations. This is
>> because TLB flushes resulted in a up to 40x elongation of page faults in
>> guest_memfd (scaling with the number of CPU cores), or a 5x elongation
>> of memory population. TLB flushes are not needed for functional
>> correctness (the virt->phys mapping technically stays "correct",  the
>> kernel should simply not use it for a while). On the other hand, it means
>> that the desired protection from Spectre-style attacks is not perfect,
>> as an attacker could try to prevent a stale TLB entry from getting
>> evicted, keeping it alive until the page it refers to is used by the
>> guest for some sensitive data, and then targeting it using a
>> spectre-gadget.
> 
> I'm really not keen on this last part (at least, for arm64).
> 
> If you're not going to bother invalidating the TLB after unmapping from
> the direct map because of performance reasons, you're better off just
> leaving the direct map intact and getting even better performance. On
> arm64, that would mean you could use block mappings too.

Not until we have hardware with the newest BBM goodies I thought. When I
checked earlier this year, a defconfig has the direct map setup at 4k
granularity.

> On the other hand, if you actually care about the security properties
> from the unmap then you need the invalidation so that the mapping
> doesn't linger around. With "modern" CPU features such as pte
> aggregation and shared TLB walk caches it's not unlikely that these
> entries will persist a lot longer than you think and it makes the
> security benefits of this series impossible to reason about.

I agree it's not 100% protection, but it is still better than the status quo. I
would also love to have the TLB flushes, but sadly the performance impact of
them would make this completely unusable for Amazon :/

Mh, thinking about it now though, iirc the performance problems were mostly
because all CPUs needed to acknowledge the flush before the issuing CPU could
continue. Is there a way to have "fire and forget" flushes, where we don't wait
for acknowledgements?

> As a compromise, could we make the TLB invalidation an architecture
> opt-in so that we can have it enabled on arm64, please?

How about instead of an architecture opt-in, we have some sort of opt-out flag
userspace can set? Similar to the PFNMAP stuff KVM already has.

> Will

Best,
Patrick
Re: [PATCH v6 05/11] KVM: guest_memfd: Add flag to remove from direct map
Posted by Mike Rapoport 2 weeks, 4 days ago
On Fri, Sep 12, 2025 at 09:17:37AM +0000, Roy, Patrick wrote:
> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
> ioctl. When set, guest_memfd folios will be removed from the direct map
> after preparation, with direct map entries only restored when the folios
> are freed.
> 
> To ensure these folios do not end up in places where the kernel cannot
> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
> 
> Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether
> guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on
> guest_memfd itself being supported, but also on whether linux supports
> manipulatomg the direct map at page granularity at all (possible most of
> the time, outliers being arm64 where its impossible if the direct map
> has been setup using hugepages, as arm64 cannot break these apart due to
> break-before-make semantics, and powerpc, which does not select
> ARCH_HAS_SET_DIRECT_MAP, which also doesn't support guest_memfd anyway
> though).
> 
> Note that this flag causes removal of direct map entries for all
> guest_memfd folios independent of whether they are "shared" or "private"
> (although current guest_memfd only supports either all folios in the
> "shared" state, or all folios in the "private" state if
> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
> entries of also the shared parts of guest_memfd are a special type of
> non-CoCo VM where, host userspace is trusted to have access to all of
> guest memory, but where Spectre-style transient execution attacks
> through the host kernel's direct map should still be mitigated.  In this
> setup, KVM retains access to guest memory via userspace mappings of
> guest_memfd, which are reflected back into KVM's memslots via
> userspace_addr. This is needed for things like MMIO emulation on x86_64
> to work.
> 
> Do not perform TLB flushes after direct map manipulations. This is
> because TLB flushes resulted in a up to 40x elongation of page faults in
> guest_memfd (scaling with the number of CPU cores), or a 5x elongation
> of memory population. TLB flushes are not needed for functional
> correctness (the virt->phys mapping technically stays "correct",  the
> kernel should simply not use it for a while). On the other hand, it means
> that the desired protection from Spectre-style attacks is not perfect,
> as an attacker could try to prevent a stale TLB entry from getting
> evicted, keeping it alive until the page it refers to is used by the
> guest for some sensitive data, and then targeting it using a
> spectre-gadget.
> 
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

With a few nits below

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  Documentation/virt/kvm/api.rst    |  5 ++++
>  arch/arm64/include/asm/kvm_host.h | 12 ++++++++
>  include/linux/kvm_host.h          |  7 +++++
>  include/uapi/linux/kvm.h          |  2 ++
>  virt/kvm/guest_memfd.c            | 49 +++++++++++++++++++++++++++----
>  virt/kvm/kvm_main.c               |  5 ++++
>  6 files changed, 75 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index c17a87a0a5ac..b52c14d58798 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6418,6 +6418,11 @@ When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field
>  supports GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation
>  enables mmap() and faulting of guest_memfd memory to host userspace.
>  
> +When the capability KVM_CAP_GMEM_NO_DIRECT_MAP is supported, the 'flags' field
> +supports GUEST_MEMFG_FLAG_NO_DIRECT_MAP. Setting this flag makes the guest_memfd
> +instance behave similarly to memfd_secret, and unmaps the memory backing it from
> +the kernel's address space after allocation.
> +
>  When the KVM MMU performs a PFN lookup to service a guest fault and the backing
>  guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
>  consumed from guest_memfd, regardless of whether it is a shared or a private
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 2f2394cce24e..0bfd8e5fd9de 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -19,6 +19,7 @@
>  #include <linux/maple_tree.h>
>  #include <linux/percpu.h>
>  #include <linux/psci.h>
> +#include <linux/set_memory.h>
>  #include <asm/arch_gicv3.h>
>  #include <asm/barrier.h>
>  #include <asm/cpufeature.h>
> @@ -1706,5 +1707,16 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
>  void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1);
>  void check_feature_map(void);
>  
> +#ifdef CONFIG_KVM_GUEST_MEMFD
> +static inline bool kvm_arch_gmem_supports_no_direct_map(void)
> +{
> +	/*
> +	 * Without FWB, direct map access is needed in kvm_pgtable_stage2_map(),
> +	 * as it calls dcache_clean_inval_poc().
> +	 */
> +	return can_set_direct_map() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
> +}
> +#define kvm_arch_gmem_supports_no_direct_map kvm_arch_gmem_supports_no_direct_map
> +#endif /* CONFIG_KVM_GUEST_MEMFD */
>  
>  #endif /* __ARM64_KVM_HOST_H__ */
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1d0585616aa3..a9468bce55f2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -36,6 +36,7 @@
>  #include <linux/rbtree.h>
>  #include <linux/xarray.h>
>  #include <asm/signal.h>
> +#include <linux/set_memory.h>

The set_memory APIs are not used in the header, no need to include it here.

>  #include <linux/kvm.h>
>  #include <linux/kvm_para.h>
> @@ -731,6 +732,12 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>  bool kvm_arch_supports_gmem_mmap(struct kvm *kvm);
>  #endif
>  
> +#ifdef CONFIG_KVM_GUEST_MEMFD
> +#ifndef kvm_arch_gmem_supports_no_direct_map
> +#define kvm_arch_gmem_supports_no_direct_map can_set_direct_map
> +#endif
> +#endif /* CONFIG_KVM_GUEST_MEMFD */
> +
>  #ifndef kvm_arch_has_readonly_mem
>  static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
>  {
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6efa98a57ec1..33c8e8946019 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -963,6 +963,7 @@ struct kvm_enable_cap {
>  #define KVM_CAP_RISCV_MP_STATE_RESET 242
>  #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
>  #define KVM_CAP_GUEST_MEMFD_MMAP 244
> +#define KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP 245
>  
>  struct kvm_irq_routing_irqchip {
>  	__u32 irqchip;
> @@ -1600,6 +1601,7 @@ struct kvm_memory_attributes {
>  
>  #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
>  #define GUEST_MEMFD_FLAG_MMAP	(1ULL << 0)
> +#define GUEST_MEMFD_FLAG_NO_DIRECT_MAP (1ULL << 1)
>  
>  struct kvm_create_guest_memfd {
>  	__u64 size;
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 81028984ff89..3c64099fc98a 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -4,6 +4,7 @@
>  #include <linux/kvm_host.h>
>  #include <linux/pagemap.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/set_memory.h>
>  
>  #include "kvm_mm.h"
>  
> @@ -42,9 +43,24 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
>  	return 0;
>  }
>  
> -static inline void kvm_gmem_mark_prepared(struct folio *folio)
> +static bool kvm_gmem_test_no_direct_map(struct inode *inode)
>  {
> -	folio_mark_uptodate(folio);
> +	return ((unsigned long) inode->i_private) & GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
> +}
> +
> +static inline int kvm_gmem_mark_prepared(struct folio *folio)
> +{
> +	struct inode *inode = folio_inode(folio);
> +	int r = 0;
> +
> +	if (kvm_gmem_test_no_direct_map(inode))
> +		r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
> +						 false);
> +
> +	if (!r)
> +		folio_mark_uptodate(folio);
> +
> +	return r;
>  }
>  
>  /*
> @@ -82,7 +98,7 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	index = ALIGN_DOWN(index, 1 << folio_order(folio));
>  	r = __kvm_gmem_prepare_folio(kvm, slot, index, folio);
>  	if (!r)
> -		kvm_gmem_mark_prepared(folio);
> +		r = kvm_gmem_mark_prepared(folio);

If this fails, shouldn't we undo __kvm_gmem_prepare_folio()?

>  
>  	return r;
>  }
> @@ -344,8 +360,15 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>  	}
>  
>  	if (!folio_test_uptodate(folio)) {
> +		int err = 0;
> +
>  		clear_highpage(folio_page(folio, 0));
> -		kvm_gmem_mark_prepared(folio);
> +		err = kvm_gmem_mark_prepared(folio);
> +
> +		if (err) {
> +			ret = vmf_error(err);
> +			goto out_folio;
> +		}
>  	}
>  
>  	vmf->page = folio_file_page(folio, vmf->pgoff);
> @@ -436,6 +459,16 @@ static void kvm_gmem_free_folio(struct address_space *mapping,
>  	kvm_pfn_t pfn = page_to_pfn(page);
>  	int order = folio_order(folio);
>  
> +	/*
> +	 * Direct map restoration cannot fail, as the only error condition
> +	 * for direct map manipulation is failure to allocate page tables
> +	 * when splitting huge pages, but this split would have already
> +	 * happened in set_direct_map_invalid_noflush() in kvm_gmem_mark_prepared().
> +	 * Thus set_direct_map_valid_noflush() here only updates prot bits.
> +	 */
> +	if (kvm_gmem_test_no_direct_map(mapping->host))
> +		set_direct_map_valid_noflush(page, folio_nr_pages(folio), true);
> +
>  	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
>  }
>  
> @@ -500,6 +533,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>  	/* Unmovable mappings are supposed to be marked unevictable as well. */
>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>  
> +	if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
> +		mapping_set_no_direct_map(inode->i_mapping);
> +
>  	kvm_get_kvm(kvm);
>  	gmem->kvm = kvm;
>  	xa_init(&gmem->bindings);
> @@ -524,6 +560,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  	if (kvm_arch_supports_gmem_mmap(kvm))
>  		valid_flags |= GUEST_MEMFD_FLAG_MMAP;
>  
> +	if (kvm_arch_gmem_supports_no_direct_map())
> +		valid_flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
> +
>  	if (flags & ~valid_flags)
>  		return -EINVAL;
>  
> @@ -768,7 +807,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
>  		p = src ? src + i * PAGE_SIZE : NULL;
>  		ret = post_populate(kvm, gfn, pfn, p, max_order, opaque);
>  		if (!ret)
> -			kvm_gmem_mark_prepared(folio);
> +			ret = kvm_gmem_mark_prepared(folio);
>  
>  put_folio_and_exit:
>  		folio_put(folio);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 18f29ef93543..b5e702d95230 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -65,6 +65,7 @@
>  #include <trace/events/kvm.h>
>  
>  #include <linux/kvm_dirty_ring.h>
> +#include <linux/set_memory.h>
>  
>  
>  /* Worst case buffer size needed for holding an integer. */
> @@ -4916,6 +4917,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  		return kvm_supported_mem_attributes(kvm);
>  #endif
>  #ifdef CONFIG_KVM_GUEST_MEMFD
> +	case KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP:
> +		if (!kvm_arch_gmem_supports_no_direct_map())
> +			return 0;
> +		fallthrough;
>  	case KVM_CAP_GUEST_MEMFD:
>  		return 1;
>  	case KVM_CAP_GUEST_MEMFD_MMAP:
> -- 
> 2.50.1
> 

-- 
Sincerely yours,
Mike.
Re: [PATCH v6 05/11] KVM: guest_memfd: Add flag to remove from direct map
Posted by Roy, Patrick 1 week, 6 days ago
Hi Mike,

...

>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 2f2394cce24e..0bfd8e5fd9de 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -19,6 +19,7 @@
>>  #include <linux/maple_tree.h>
>>  #include <linux/percpu.h>
>>  #include <linux/psci.h>
>> +#include <linux/set_memory.h>
>>  #include <asm/arch_gicv3.h>
>>  #include <asm/barrier.h>
>>  #include <asm/cpufeature.h>
>> @@ -1706,5 +1707,16 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
>>  void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1);
>>  void check_feature_map(void);
>>
>> +#ifdef CONFIG_KVM_GUEST_MEMFD
>> +static inline bool kvm_arch_gmem_supports_no_direct_map(void)
>> +{
>> +     /*
>> +      * Without FWB, direct map access is needed in kvm_pgtable_stage2_map(),
>> +      * as it calls dcache_clean_inval_poc().
>> +      */
>> +     return can_set_direct_map() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
>> +}
>> +#define kvm_arch_gmem_supports_no_direct_map kvm_arch_gmem_supports_no_direct_map
>> +#endif /* CONFIG_KVM_GUEST_MEMFD */
>>
>>  #endif /* __ARM64_KVM_HOST_H__ */
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 1d0585616aa3..a9468bce55f2 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -36,6 +36,7 @@
>>  #include <linux/rbtree.h>
>>  #include <linux/xarray.h>
>>  #include <asm/signal.h>
>> +#include <linux/set_memory.h>
> 
> The set_memory APIs are not used in the header, no need to include it here.
> 

Ack!

>>  #include <linux/kvm.h>
>>  #include <linux/kvm_para.h>
>> @@ -731,6 +732,12 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>>  bool kvm_arch_supports_gmem_mmap(struct kvm *kvm);
>>  #endif
>>
>> +#ifdef CONFIG_KVM_GUEST_MEMFD
>> +#ifndef kvm_arch_gmem_supports_no_direct_map
>> +#define kvm_arch_gmem_supports_no_direct_map can_set_direct_map
>> +#endif
>> +#endif /* CONFIG_KVM_GUEST_MEMFD */
>> +
>>  #ifndef kvm_arch_has_readonly_mem
>>  static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
>>  {
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 6efa98a57ec1..33c8e8946019 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -963,6 +963,7 @@ struct kvm_enable_cap {
>>  #define KVM_CAP_RISCV_MP_STATE_RESET 242
>>  #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
>>  #define KVM_CAP_GUEST_MEMFD_MMAP 244
>> +#define KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP 245
>>
>>  struct kvm_irq_routing_irqchip {
>>       __u32 irqchip;
>> @@ -1600,6 +1601,7 @@ struct kvm_memory_attributes {
>>
>>  #define KVM_CREATE_GUEST_MEMFD       _IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
>>  #define GUEST_MEMFD_FLAG_MMAP        (1ULL << 0)
>> +#define GUEST_MEMFD_FLAG_NO_DIRECT_MAP (1ULL << 1)
>>
>>  struct kvm_create_guest_memfd {
>>       __u64 size;
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 81028984ff89..3c64099fc98a 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -4,6 +4,7 @@
>>  #include <linux/kvm_host.h>
>>  #include <linux/pagemap.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/set_memory.h>
>>
>>  #include "kvm_mm.h"
>>
>> @@ -42,9 +43,24 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
>>       return 0;
>>  }
>>
>> -static inline void kvm_gmem_mark_prepared(struct folio *folio)
>> +static bool kvm_gmem_test_no_direct_map(struct inode *inode)
>>  {
>> -     folio_mark_uptodate(folio);
>> +     return ((unsigned long) inode->i_private) & GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
>> +}
>> +
>> +static inline int kvm_gmem_mark_prepared(struct folio *folio)
>> +{
>> +     struct inode *inode = folio_inode(folio);
>> +     int r = 0;
>> +
>> +     if (kvm_gmem_test_no_direct_map(inode))
>> +             r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
>> +                                              false);
>> +
>> +     if (!r)
>> +             folio_mark_uptodate(folio);
>> +
>> +     return r;
>>  }
>>
>>  /*
>> @@ -82,7 +98,7 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>>       index = ALIGN_DOWN(index, 1 << folio_order(folio));
>>       r = __kvm_gmem_prepare_folio(kvm, slot, index, folio);
>>       if (!r)
>> -             kvm_gmem_mark_prepared(folio);
>> +             r = kvm_gmem_mark_prepared(folio);
> 
> If this fails, shouldn't we undo __kvm_gmem_prepare_folio()?
>

Yes, good point. I'm not sure if we can undo preparation (its only used
by AMD-SEV right now, for passing off the page to the CoCo context). But
not undoing it means that guest_memfd will consider the page unprepared,
and zero it again the next time it's accesses, which will cause a
machine check because the page has already been passed off to the
confidential world.

We talked about this in the guest_memfd upstream call yesterday, and
decided that in addition to this problem, we want to separate
preparedness tracking from direct map removal state tracking anyway (and
move preparedness tracking outside of guest_memfd into the arch specific
code). And if direct map state and preparedness are separate bits, then
we can accurately record the state of "preparation worked but direct map
removal failed".

>>
>>       return r;
>>  }
>> @@ -344,8 +360,15 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>>       }
>>
>>       if (!folio_test_uptodate(folio)) {
>> +             int err = 0;
>> +
>>               clear_highpage(folio_page(folio, 0));
>> -             kvm_gmem_mark_prepared(folio);
>> +             err = kvm_gmem_mark_prepared(folio);
>> +
>> +             if (err) {
>> +                     ret = vmf_error(err);
>> +                     goto out_folio;
>> +             }
>>       }
>>
>>       vmf->page = folio_file_page(folio, vmf->pgoff);
>> @@ -436,6 +459,16 @@ static void kvm_gmem_free_folio(struct address_space *mapping,
>>       kvm_pfn_t pfn = page_to_pfn(page);
>>       int order = folio_order(folio);
>>
>> +     /*
>> +      * Direct map restoration cannot fail, as the only error condition
>> +      * for direct map manipulation is failure to allocate page tables
>> +      * when splitting huge pages, but this split would have already
>> +      * happened in set_direct_map_invalid_noflush() in kvm_gmem_mark_prepared().
>> +      * Thus set_direct_map_valid_noflush() here only updates prot bits.
>> +      */
>> +     if (kvm_gmem_test_no_direct_map(mapping->host))
>> +             set_direct_map_valid_noflush(page, folio_nr_pages(folio), true);
>> +
>>       kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
>>  }
>>
>> @@ -500,6 +533,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>>       /* Unmovable mappings are supposed to be marked unevictable as well. */
>>       WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>>
>> +     if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
>> +             mapping_set_no_direct_map(inode->i_mapping);
>> +
>>       kvm_get_kvm(kvm);
>>       gmem->kvm = kvm;
>>       xa_init(&gmem->bindings);
>> @@ -524,6 +560,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>>       if (kvm_arch_supports_gmem_mmap(kvm))
>>               valid_flags |= GUEST_MEMFD_FLAG_MMAP;
>>
>> +     if (kvm_arch_gmem_supports_no_direct_map())
>> +             valid_flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
>> +
>>       if (flags & ~valid_flags)
>>               return -EINVAL;
>>
>> @@ -768,7 +807,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
>>               p = src ? src + i * PAGE_SIZE : NULL;
>>               ret = post_populate(kvm, gfn, pfn, p, max_order, opaque);
>>               if (!ret)
>> -                     kvm_gmem_mark_prepared(folio);
>> +                     ret = kvm_gmem_mark_prepared(folio);
>>
>>  put_folio_and_exit:
>>               folio_put(folio);
...

>
> Sincerely yours,
> Mike.
Best, 
Patrick