From: Patrick Roy <patrick.roy@linux.dev>
Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
ioctl. When set, guest_memfd folios will be removed from the direct map
after preparation, with direct map entries only restored when the folios
are freed.
To ensure these folios do not end up in places where the kernel cannot
deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
Note that this flag causes removal of direct map entries for all
guest_memfd folios independent of whether they are "shared" or "private"
(although current guest_memfd only supports either all folios in the
"shared" state, or all folios in the "private" state if
GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
entries of also the shared parts of guest_memfd are a special type of
non-CoCo VM where, host userspace is trusted to have access to all of
guest memory, but where Spectre-style transient execution attacks
through the host kernel's direct map should still be mitigated. In this
setup, KVM retains access to guest memory via userspace mappings of
guest_memfd, which are reflected back into KVM's memslots via
userspace_addr. This is needed for things like MMIO emulation on x86_64
to work.
Direct map entries are zapped right before guest or userspace mappings
of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or
kvm_gmem_get_pfn() [called from the KVM MMU code]. The only place where
a gmem folio can be allocated without being mapped anywhere is
kvm_gmem_populate(), where handling potential failures of direct map
removal is not possible (by the time direct map removal is attempted,
the folio is already marked as prepared, meaning attempting to re-try
kvm_gmem_populate() would just result in -EEXIST without fixing up the
direct map state). These folios are then removed form the direct map
upon kvm_gmem_get_pfn(), e.g. when they are mapped into the guest later.
Signed-off-by: Patrick Roy <patrick.roy@linux.dev>
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
Documentation/virt/kvm/api.rst | 22 ++++++++------
include/linux/kvm_host.h | 12 ++++++++
include/uapi/linux/kvm.h | 1 +
virt/kvm/guest_memfd.c | 54 ++++++++++++++++++++++++++++++++++
4 files changed, 80 insertions(+), 9 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 01a3abef8abb..c5f54f1370c8 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6440,15 +6440,19 @@ a single guest_memfd file, but the bound ranges must not overlap).
The capability KVM_CAP_GUEST_MEMFD_FLAGS enumerates the `flags` that can be
specified via KVM_CREATE_GUEST_MEMFD. Currently defined flags:
- ============================ ================================================
- GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd file
- descriptor.
- GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during
- KVM_CREATE_GUEST_MEMFD (memory files created
- without INIT_SHARED will be marked private).
- Shared memory can be faulted into host userspace
- page tables. Private memory cannot.
- ============================ ================================================
+ ============================== ================================================
+ GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd file
+ descriptor.
+ GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during
+ KVM_CREATE_GUEST_MEMFD (memory files created
+ without INIT_SHARED will be marked private).
+ Shared memory can be faulted into host userspace
+ page tables. Private memory cannot.
+ GUEST_MEMFD_FLAG_NO_DIRECT_MAP The guest_memfd instance will behave similarly
+ to memfd_secret, and unmaps the memory backing
+ it from the kernel's address space before
+ being passed off to userspace or the guest.
+ ============================== ================================================
When the KVM MMU performs a PFN lookup to service a guest fault and the backing
guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 27796a09d29b..d4d5306075bf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -738,10 +738,22 @@ static inline u64 kvm_gmem_get_supported_flags(struct kvm *kvm)
if (!kvm || kvm_arch_supports_gmem_init_shared(kvm))
flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
+ if (kvm_arch_gmem_supports_no_direct_map())
+ flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
+
return flags;
}
#endif
+#ifdef CONFIG_KVM_GUEST_MEMFD
+#ifndef kvm_arch_gmem_supports_no_direct_map
+static inline bool kvm_arch_gmem_supports_no_direct_map(void)
+{
+ return false;
+}
+#endif
+#endif /* CONFIG_KVM_GUEST_MEMFD */
+
#ifndef kvm_arch_has_readonly_mem
static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
{
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index dddb781b0507..60341e1ba1be 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1612,6 +1612,7 @@ struct kvm_memory_attributes {
#define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
#define GUEST_MEMFD_FLAG_MMAP (1ULL << 0)
#define GUEST_MEMFD_FLAG_INIT_SHARED (1ULL << 1)
+#define GUEST_MEMFD_FLAG_NO_DIRECT_MAP (1ULL << 2)
struct kvm_create_guest_memfd {
__u64 size;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 92e7f8c1f303..43f64c11467a 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -7,6 +7,9 @@
#include <linux/mempolicy.h>
#include <linux/pseudo_fs.h>
#include <linux/pagemap.h>
+#include <linux/set_memory.h>
+
+#include <asm/tlbflush.h>
#include "kvm_mm.h"
@@ -76,6 +79,43 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
return 0;
}
+#define KVM_GMEM_FOLIO_NO_DIRECT_MAP BIT(0)
+
+static bool kvm_gmem_folio_no_direct_map(struct folio *folio)
+{
+ return ((u64) folio->private) & KVM_GMEM_FOLIO_NO_DIRECT_MAP;
+}
+
+static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
+{
+ u64 gmem_flags = GMEM_I(folio_inode(folio))->flags;
+ int r = 0;
+
+ if (kvm_gmem_folio_no_direct_map(folio) || !(gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP))
+ goto out;
+
+ folio->private = (void *)((u64)folio->private | KVM_GMEM_FOLIO_NO_DIRECT_MAP);
+ r = folio_zap_direct_map(folio);
+
+out:
+ return r;
+}
+
+static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
+{
+ /*
+ * Direct map restoration cannot fail, as the only error condition
+ * for direct map manipulation is failure to allocate page tables
+ * when splitting huge pages, but this split would have already
+ * happened in folio_zap_direct_map() in kvm_gmem_folio_zap_direct_map().
+ * Thus folio_restore_direct_map() here only updates prot bits.
+ */
+ if (kvm_gmem_folio_no_direct_map(folio)) {
+ WARN_ON_ONCE(folio_restore_direct_map(folio));
+ folio->private = (void *)((u64)folio->private & ~KVM_GMEM_FOLIO_NO_DIRECT_MAP);
+ }
+}
+
static inline void kvm_gmem_mark_prepared(struct folio *folio)
{
folio_mark_uptodate(folio);
@@ -398,6 +438,7 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
struct inode *inode = file_inode(vmf->vma->vm_file);
struct folio *folio;
vm_fault_t ret = VM_FAULT_LOCKED;
+ int err;
if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
return VM_FAULT_SIGBUS;
@@ -423,6 +464,12 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
kvm_gmem_mark_prepared(folio);
}
+ err = kvm_gmem_folio_zap_direct_map(folio);
+ if (err) {
+ ret = vmf_error(err);
+ goto out_folio;
+ }
+
vmf->page = folio_file_page(folio, vmf->pgoff);
out_folio:
@@ -533,6 +580,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
kvm_pfn_t pfn = page_to_pfn(page);
int order = folio_order(folio);
+ kvm_gmem_folio_restore_direct_map(folio);
+
kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
}
@@ -596,6 +645,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
/* Unmovable mappings are supposed to be marked unevictable as well. */
WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+ if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
+ mapping_set_no_direct_map(inode->i_mapping);
+
GMEM_I(inode)->flags = flags;
file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops);
@@ -807,6 +859,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
if (!is_prepared)
r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
+ kvm_gmem_folio_zap_direct_map(folio);
+
folio_unlock(folio);
if (!r)
--
2.50.1
On Wed, 2026-01-14 at 13:46 +0000, Kalyazin, Nikita wrote:
> +static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
> +{
> + /*
> + * Direct map restoration cannot fail, as the only error condition
> + * for direct map manipulation is failure to allocate page tables
> + * when splitting huge pages, but this split would have already
> + * happened in folio_zap_direct_map() in kvm_gmem_folio_zap_direct_map().
> + * Thus folio_restore_direct_map() here only updates prot bits.
> + */
> + if (kvm_gmem_folio_no_direct_map(folio)) {
> + WARN_ON_ONCE(folio_restore_direct_map(folio));
> + folio->private = (void *)((u64)folio->private & ~KVM_GMEM_FOLIO_NO_DIRECT_MAP);
> + }
> +}
> +
Does this assume the folio would not have been split after it was zapped? As in,
if it was zapped at 2MB granularity (no 4KB direct map split required) but then
restored at 4KB (split required)? Or it gets merged somehow before this?
On 16/01/2026 00:00, Edgecombe, Rick P wrote:
> On Wed, 2026-01-14 at 13:46 +0000, Kalyazin, Nikita wrote:
>> +static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
>> +{
>> + /*
>> + * Direct map restoration cannot fail, as the only error condition
>> + * for direct map manipulation is failure to allocate page tables
>> + * when splitting huge pages, but this split would have already
>> + * happened in folio_zap_direct_map() in kvm_gmem_folio_zap_direct_map().
>> + * Thus folio_restore_direct_map() here only updates prot bits.
>> + */
>> + if (kvm_gmem_folio_no_direct_map(folio)) {
>> + WARN_ON_ONCE(folio_restore_direct_map(folio));
>> + folio->private = (void *)((u64)folio->private & ~KVM_GMEM_FOLIO_NO_DIRECT_MAP);
>> + }
>> +}
>> +
>
> Does this assume the folio would not have been split after it was zapped? As in,
> if it was zapped at 2MB granularity (no 4KB direct map split required) but then
> restored at 4KB (split required)? Or it gets merged somehow before this?
AFAIK it can't be zapped at 2MB granularity as the zapping code will
inevitably cause splitting because guest_memfd faults occur at the base
page granularity as of now.
Nikita Kalyazin <kalyazin@amazon.com> writes:
> On 16/01/2026 00:00, Edgecombe, Rick P wrote:
>> On Wed, 2026-01-14 at 13:46 +0000, Kalyazin, Nikita wrote:
>>> +static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
>>> +{
>>> + /*
>>> + * Direct map restoration cannot fail, as the only error condition
>>> + * for direct map manipulation is failure to allocate page tables
>>> + * when splitting huge pages, but this split would have already
>>> + * happened in folio_zap_direct_map() in kvm_gmem_folio_zap_direct_map().
Do you know if folio_restore_direct_map() will also end up merging page
table entries to a higher level?
>>> + * Thus folio_restore_direct_map() here only updates prot bits.
>>> + */
>>> + if (kvm_gmem_folio_no_direct_map(folio)) {
>>> + WARN_ON_ONCE(folio_restore_direct_map(folio));
>>> + folio->private = (void *)((u64)folio->private & ~KVM_GMEM_FOLIO_NO_DIRECT_MAP);
>>> + }
>>> +}
>>> +
>>
>> Does this assume the folio would not have been split after it was zapped? As in,
>> if it was zapped at 2MB granularity (no 4KB direct map split required) but then
>> restored at 4KB (split required)? Or it gets merged somehow before this?
I agree with the rest of the discussion that this will probably land
before huge page support, so I will have to figure out the intersection
of the two later.
>
> AFAIK it can't be zapped at 2MB granularity as the zapping code will
> inevitably cause splitting because guest_memfd faults occur at the base
> page granularity as of now.
Here's what I'm thinking for now:
[HugeTLB, no conversions]
With initial HugeTLB support (no conversions), host userspace
guest_memfd faults will be:
+ For guest_memfd with PUD-sized pages
+ At PUD level or PTE level
+ For guest_memfd with PMD-sized pages
+ At PMD level or PTE level
Since this guest_memfd doesn't support conversions, the folio is never
split/merged, so the direct map is restored at whatever level it was
zapped. I think this works out well.
[HugeTLB + conversions]
For a guest_memfd with HugeTLB support and conversions, host userspace
guest_memfd faults will always be at PTE level, so the direct map will
be split and the faulted pages have the direct map zapped in 4K chunks
as they are faulted.
On conversion back to private, put those back into the direct map
(putting aside whether to merge the direct map PTEs for now).
Unfortunately there's no unmapping callback for guest_memfd to use, so
perhaps the principle should be to put the folios back into the direct
map ASAP - at unmapping if guest_memfd is doing the unmapping, otherwise
at freeing time?
On 22/01/2026 18:37, Ackerley Tng wrote:
> Nikita Kalyazin <kalyazin@amazon.com> writes:
>
>> On 16/01/2026 00:00, Edgecombe, Rick P wrote:
>>> On Wed, 2026-01-14 at 13:46 +0000, Kalyazin, Nikita wrote:
>>>> +static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
>>>> +{
>>>> + /*
>>>> + * Direct map restoration cannot fail, as the only error condition
>>>> + * for direct map manipulation is failure to allocate page tables
>>>> + * when splitting huge pages, but this split would have already
>>>> + * happened in folio_zap_direct_map() in kvm_gmem_folio_zap_direct_map().
>
> Do you know if folio_restore_direct_map() will also end up merging page
> table entries to a higher level?
>
>>>> + * Thus folio_restore_direct_map() here only updates prot bits.
>>>> + */
>>>> + if (kvm_gmem_folio_no_direct_map(folio)) {
>>>> + WARN_ON_ONCE(folio_restore_direct_map(folio));
>>>> + folio->private = (void *)((u64)folio->private & ~KVM_GMEM_FOLIO_NO_DIRECT_MAP);
>>>> + }
>>>> +}
>>>> +
>>>
>>> Does this assume the folio would not have been split after it was zapped? As in,
>>> if it was zapped at 2MB granularity (no 4KB direct map split required) but then
>>> restored at 4KB (split required)? Or it gets merged somehow before this?
>
> I agree with the rest of the discussion that this will probably land
> before huge page support, so I will have to figure out the intersection
> of the two later.
>
>>
>> AFAIK it can't be zapped at 2MB granularity as the zapping code will
>> inevitably cause splitting because guest_memfd faults occur at the base
>> page granularity as of now.
>
> Here's what I'm thinking for now:
>
> [HugeTLB, no conversions]
> With initial HugeTLB support (no conversions), host userspace
> guest_memfd faults will be:
>
> + For guest_memfd with PUD-sized pages
> + At PUD level or PTE level
> + For guest_memfd with PMD-sized pages
> + At PMD level or PTE level
>
> Since this guest_memfd doesn't support conversions, the folio is never
> split/merged, so the direct map is restored at whatever level it was
> zapped. I think this works out well.
>
> [HugeTLB + conversions]
> For a guest_memfd with HugeTLB support and conversions, host userspace
> guest_memfd faults will always be at PTE level, so the direct map will
> be split and the faulted pages have the direct map zapped in 4K chunks
> as they are faulted.
>
> On conversion back to private, put those back into the direct map
> (putting aside whether to merge the direct map PTEs for now).
Makes sense to me.
>
>
> Unfortunately there's no unmapping callback for guest_memfd to use, so
> perhaps the principle should be to put the folios back into the direct
> map ASAP - at unmapping if guest_memfd is doing the unmapping, otherwise
> at freeing time?
I'm not sure I fully understand what you mean here. What would be the
purpose for hooking up to unmapping? Why would making sure we put
folios back into the direct map whenever they are freed or converted to
private not be sufficient?
Nikita Kalyazin <kalyazin@amazon.com> writes:
> On 22/01/2026 18:37, Ackerley Tng wrote:
>> Nikita Kalyazin <kalyazin@amazon.com> writes:
>>
>>> On 16/01/2026 00:00, Edgecombe, Rick P wrote:
>>>> On Wed, 2026-01-14 at 13:46 +0000, Kalyazin, Nikita wrote:
>>>>> +static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
>>>>> +{
>>>>> + /*
>>>>> + * Direct map restoration cannot fail, as the only error condition
>>>>> + * for direct map manipulation is failure to allocate page tables
>>>>> + * when splitting huge pages, but this split would have already
>>>>> + * happened in folio_zap_direct_map() in kvm_gmem_folio_zap_direct_map().
>>
>> Do you know if folio_restore_direct_map() will also end up merging page
>> table entries to a higher level?
>>
>>>>> + * Thus folio_restore_direct_map() here only updates prot bits.
>>>>> + */
>>>>> + if (kvm_gmem_folio_no_direct_map(folio)) {
>>>>> + WARN_ON_ONCE(folio_restore_direct_map(folio));
>>>>> + folio->private = (void *)((u64)folio->private & ~KVM_GMEM_FOLIO_NO_DIRECT_MAP);
>>>>> + }
>>>>> +}
>>>>> +
>>>>
>>>> Does this assume the folio would not have been split after it was zapped? As in,
>>>> if it was zapped at 2MB granularity (no 4KB direct map split required) but then
>>>> restored at 4KB (split required)? Or it gets merged somehow before this?
>>
>> I agree with the rest of the discussion that this will probably land
>> before huge page support, so I will have to figure out the intersection
>> of the two later.
>>
>>>
>>> AFAIK it can't be zapped at 2MB granularity as the zapping code will
>>> inevitably cause splitting because guest_memfd faults occur at the base
>>> page granularity as of now.
>>
>> Here's what I'm thinking for now:
>>
>> [HugeTLB, no conversions]
>> With initial HugeTLB support (no conversions), host userspace
>> guest_memfd faults will be:
>>
>> + For guest_memfd with PUD-sized pages
>> + At PUD level or PTE level
>> + For guest_memfd with PMD-sized pages
>> + At PMD level or PTE level
>>
>> Since this guest_memfd doesn't support conversions, the folio is never
>> split/merged, so the direct map is restored at whatever level it was
>> zapped. I think this works out well.
>>
>> [HugeTLB + conversions]
>> For a guest_memfd with HugeTLB support and conversions, host userspace
>> guest_memfd faults will always be at PTE level, so the direct map will
>> be split and the faulted pages have the direct map zapped in 4K chunks
>> as they are faulted.
>>
>> On conversion back to private, put those back into the direct map
>> (putting aside whether to merge the direct map PTEs for now).
>
> Makes sense to me.
>
>>
>>
>> Unfortunately there's no unmapping callback for guest_memfd to use, so
>> perhaps the principle should be to put the folios back into the direct
>> map ASAP - at unmapping if guest_memfd is doing the unmapping, otherwise
>> at freeing time?
>
> I'm not sure I fully understand what you mean here. What would be the
> purpose for hooking up to unmapping? Why would making sure we put
> folios back into the direct map whenever they are freed or converted to
> private not be sufficient?
I think putting the folios back into the direct map when the folios are
freed or converted to private should cover all cases.
I was just thinking that being able to hook up to unmapping is nice
since unmapping is the counterpart to mapping when the folios are
removed from the direct map.
On 22/01/2026 18:37, Ackerley Tng wrote:
> Nikita Kalyazin <kalyazin@amazon.com> writes:
>
>> On 16/01/2026 00:00, Edgecombe, Rick P wrote:
>>> On Wed, 2026-01-14 at 13:46 +0000, Kalyazin, Nikita wrote:
>>>> +static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
>>>> +{
>>>> + /*
>>>> + * Direct map restoration cannot fail, as the only error condition
>>>> + * for direct map manipulation is failure to allocate page tables
>>>> + * when splitting huge pages, but this split would have already
>>>> + * happened in folio_zap_direct_map() in kvm_gmem_folio_zap_direct_map().
>
> Do you know if folio_restore_direct_map() will also end up merging page
> table entries to a higher level?
By looking at the callchain in x86 at least, I can't see how it would.
>
>>>> + * Thus folio_restore_direct_map() here only updates prot bits.
>>>> + */
>>>> + if (kvm_gmem_folio_no_direct_map(folio)) {
>>>> + WARN_ON_ONCE(folio_restore_direct_map(folio));
>>>> + folio->private = (void *)((u64)folio->private & ~KVM_GMEM_FOLIO_NO_DIRECT_MAP);
>>>> + }
>>>> +}
>>>> +
>>>
>>> Does this assume the folio would not have been split after it was zapped? As in,
>>> if it was zapped at 2MB granularity (no 4KB direct map split required) but then
>>> restored at 4KB (split required)? Or it gets merged somehow before this?
>
> I agree with the rest of the discussion that this will probably land
> before huge page support, so I will have to figure out the intersection
> of the two later.
>
>>
>> AFAIK it can't be zapped at 2MB granularity as the zapping code will
>> inevitably cause splitting because guest_memfd faults occur at the base
>> page granularity as of now.
>
> Here's what I'm thinking for now:
>
> [HugeTLB, no conversions]
> With initial HugeTLB support (no conversions), host userspace
> guest_memfd faults will be:
>
> + For guest_memfd with PUD-sized pages
> + At PUD level or PTE level
> + For guest_memfd with PMD-sized pages
> + At PMD level or PTE level
>
> Since this guest_memfd doesn't support conversions, the folio is never
> split/merged, so the direct map is restored at whatever level it was
> zapped. I think this works out well.
>
> [HugeTLB + conversions]
> For a guest_memfd with HugeTLB support and conversions, host userspace
> guest_memfd faults will always be at PTE level, so the direct map will
> be split and the faulted pages have the direct map zapped in 4K chunks
> as they are faulted.
>
> On conversion back to private, put those back into the direct map
> (putting aside whether to merge the direct map PTEs for now).
>
>
> Unfortunately there's no unmapping callback for guest_memfd to use, so
> perhaps the principle should be to put the folios back into the direct
> map ASAP - at unmapping if guest_memfd is doing the unmapping, otherwise
> at freeing time?
On Fri, 2026-01-16 at 15:00 +0000, Nikita Kalyazin wrote: > > Does this assume the folio would not have been split after it was > > zapped? As in, if it was zapped at 2MB granularity (no 4KB direct > > map split required) but then restored at 4KB (split required)? Or > > it gets merged somehow before this? > > AFAIK it can't be zapped at 2MB granularity as the zapping code will > inevitably cause splitting because guest_memfd faults occur at the > base page granularity as of now. Ah, right since there are no huge pages currently. Then the huge page series will need to keep this in mind and figure out some solution. Probably worth a comment on that assumption to help anyone that changes it. I imagine this feature is really targeted towards machines running a bunch of untrusted VMs, so cloud hypervisors really. In that case the direct map will probably be carved up pretty quick. Did you consider just breaking the full direct map to 4k at the start when it's in use?
On 16/01/2026 15:34, Edgecombe, Rick P wrote: > On Fri, 2026-01-16 at 15:00 +0000, Nikita Kalyazin wrote: >>> Does this assume the folio would not have been split after it was >>> zapped? As in, if it was zapped at 2MB granularity (no 4KB direct >>> map split required) but then restored at 4KB (split required)? Or >>> it gets merged somehow before this? >> >> AFAIK it can't be zapped at 2MB granularity as the zapping code will >> inevitably cause splitting because guest_memfd faults occur at the >> base page granularity as of now. > > Ah, right since there are no huge pages currently. Then the huge page > series will need to keep this in mind and figure out some solution. > Probably worth a comment on that assumption to help anyone that changes > it. Makes sense. I'll leave a comment. > > I imagine this feature is really targeted towards machines running a > bunch of untrusted VMs, so cloud hypervisors really. In that case the > direct map will probably be carved up pretty quick. Did you consider > just breaking the full direct map to 4k at the start when it's in use? That's an interesting point, I haven't thought about it from this perspective. We should run some tests internally to see if it'd help. This will likely change with support for huge pages coming in though.
On Fri, 2026-01-16 at 17:28 +0000, Nikita Kalyazin wrote: > > > > I imagine this feature is really targeted towards machines running > > a bunch of untrusted VMs, so cloud hypervisors really. In that case > > the direct map will probably be carved up pretty quick. Did you > > consider just breaking the full direct map to 4k at the start when > > it's in use? > > That's an interesting point, I haven't thought about it from this > perspective. We should run some tests internally to see if it'd > help. This will likely change with support for huge pages coming in > though. The thing is, those no_flush() helpers actually still flush if they need to split a page. Plus if they need to clear out lazy vmalloc aliases it could be another flush. There are probably a lot of opportunities to reduce flushing even beyond pre-split. Just curious... as far as performance, have you tested this on a big multi-socket system, where that flushing will hurt more? It's something that has always been a fear for these directmap unmapping solutions
On 16/01/2026 17:36, Edgecombe, Rick P wrote: > On Fri, 2026-01-16 at 17:28 +0000, Nikita Kalyazin wrote: >>> >>> I imagine this feature is really targeted towards machines running >>> a bunch of untrusted VMs, so cloud hypervisors really. In that case >>> the direct map will probably be carved up pretty quick. Did you >>> consider just breaking the full direct map to 4k at the start when >>> it's in use? >> >> That's an interesting point, I haven't thought about it from this >> perspective. We should run some tests internally to see if it'd >> help. This will likely change with support for huge pages coming in >> though. > > The thing is, those no_flush() helpers actually still flush if they > need to split a page. Plus if they need to clear out lazy vmalloc > aliases it could be another flush. There are probably a lot of > opportunities to reduce flushing even beyond pre-split. > > Just curious... as far as performance, have you tested this on a big > multi-socket system, where that flushing will hurt more? It's something > that has always been a fear for these directmap unmapping solutions Yes, this is a problem that we'd like to address. We have been discussing it in [1]. The effect of flushing on memory population that we see on x86 is 5-7x elongation. We are thinking of making use of the no-direct-map memory allocator that Brendan is working on [2]. [1] https://lore.kernel.org/lkml/d1b58114-9b88-4535-b28c-09d9cc1ff3be@amazon.com [2] https://lore.kernel.org/kvm/DDVS9ITBCE2Z.RSTLCU79EX8G@google.com >
On Fri, 2026-01-16 at 17:51 +0000, Nikita Kalyazin wrote: > Yes, this is a problem that we'd like to address. We have been > discussing it in [1]. The effect of flushing on memory population > that we see on x86 is 5-7x elongation. We are thinking of making use > of the no-direct-map memory allocator that Brendan is working on [2]. Ah, makes sense. Do you plan to merge this before the performance problems are addressed? I guess this series focuses on safety and functionality first.
On 16/01/2026 18:10, Edgecombe, Rick P wrote: > On Fri, 2026-01-16 at 17:51 +0000, Nikita Kalyazin wrote: >> Yes, this is a problem that we'd like to address. We have been >> discussing it in [1]. The effect of flushing on memory population >> that we see on x86 is 5-7x elongation. We are thinking of making use >> of the no-direct-map memory allocator that Brendan is working on [2]. > > Ah, makes sense. > > Do you plan to merge this before the performance problems are > addressed? I guess this series focuses on safety and functionality > first. Yes, we'd like to merge the functional part first and then optimise it in the further series. >
On Wed, 2026-01-14 at 13:46 +0000, Kalyazin, Nikita wrote: > Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() > ioctl. When set, guest_memfd folios will be removed from the direct map > after preparation, with direct map entries only restored when the folios > are freed. > > To ensure these folios do not end up in places where the kernel cannot > deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct > address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested. > > Note that this flag causes removal of direct map entries for all > guest_memfd folios independent of whether they are "shared" or "private" > (although current guest_memfd only supports either all folios in the > "shared" state, or all folios in the "private" state if > GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map > entries of also the shared parts of guest_memfd are a special type of > non-CoCo VM where, host userspace is trusted to have access to all of > guest memory, but where Spectre-style transient execution attacks > through the host kernel's direct map should still be mitigated. In this > setup, KVM retains access to guest memory via userspace mappings of > guest_memfd, which are reflected back into KVM's memslots via > userspace_addr. This is needed for things like MMIO emulation on x86_64 > to work. TDX does some clearing at the direct map mapping for pages that comes from gmem, using a special instruction. It also does some clflushing at the direct map address for these pages. So I think we need to make sure TDs don't pull from gmem fds with this flag. Not that there would be any expected use of the flag for TDs, but it could cause a crash.
On Thu, Jan 15, 2026 at 3:04 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Wed, 2026-01-14 at 13:46 +0000, Kalyazin, Nikita wrote: > > Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() > > ioctl. When set, guest_memfd folios will be removed from the direct map > > after preparation, with direct map entries only restored when the folios > > are freed. > > > > To ensure these folios do not end up in places where the kernel cannot > > deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct > > address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested. > > > > Note that this flag causes removal of direct map entries for all > > guest_memfd folios independent of whether they are "shared" or "private" > > (although current guest_memfd only supports either all folios in the > > "shared" state, or all folios in the "private" state if > > GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map > > entries of also the shared parts of guest_memfd are a special type of > > non-CoCo VM where, host userspace is trusted to have access to all of > > guest memory, but where Spectre-style transient execution attacks > > through the host kernel's direct map should still be mitigated. In this > > setup, KVM retains access to guest memory via userspace mappings of > > guest_memfd, which are reflected back into KVM's memslots via > > userspace_addr. This is needed for things like MMIO emulation on x86_64 > > to work. > > TDX does some clearing at the direct map mapping for pages that comes from gmem, > using a special instruction. It also does some clflushing at the direct map > address for these pages. So I think we need to make sure TDs don't pull from > gmem fds with this flag. Disabling this feature for TDX VMs for now seems ok. I assume TDX code can establish temporary mappings to the physical memory and therefore doesn't necessarily have to rely on direct map. Is it safe to say that we can remove direct map for guest memory for TDX VMs (and ideally other CC VMs as well) in future as needed? > > Not that there would be any expected use of the flag for TDs, but it could cause > a crash.
On Fri, 2026-01-16 at 09:30 -0800, Vishal Annapurve wrote: > > TDX does some clearing at the direct map mapping for pages that > > comes from gmem, using a special instruction. It also does some > > clflushing at the direct map address for these pages. So I think we > > need to make sure TDs don't pull from gmem fds with this flag. > > Disabling this feature for TDX VMs for now seems ok. I assume TDX > code can establish temporary mappings to the physical memory and > therefore doesn't necessarily have to rely on direct map. Can, as in, can be changed to? It doesn't now, because the direct map is reliable today. > > Is it safe to say that we can remove direct map for guest memory for > TDX VMs (and ideally other CC VMs as well) in future as needed? Linux code doesn't need to read the cipher text of course, but it does need to help with memory cleaning on the errata systems. Doing a new mapping for each page getting reclaimed would add cost to the shutdown path. Then there is the clfush. It is not actually required for the most part. There is a TDX flag to check to see if you need to do it, so we could probably remove the direct map accesses for some systems and avoid temporary mappings. So long term, I don't see a problem. For the old systems it would have extra cost of temporary mappings at shutdown, but I would have imagined direct map removal would have been costly too.
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes: > On Fri, 2026-01-16 at 09:30 -0800, Vishal Annapurve wrote: >> > TDX does some clearing at the direct map mapping for pages that >> > comes from gmem, using a special instruction. It also does some >> > clflushing at the direct map address for these pages. So I think we >> > need to make sure TDs don't pull from gmem fds with this flag. >> >> Disabling this feature for TDX VMs for now seems ok. I assume TDX >> code can establish temporary mappings to the physical memory and >> therefore doesn't necessarily have to rely on direct map. > > Can, as in, can be changed to? It doesn't now, because the direct map > is reliable today. > >> >> Is it safe to say that we can remove direct map for guest memory for >> TDX VMs (and ideally other CC VMs as well) in future as needed? > > Linux code doesn't need to read the cipher text of course, but it does > need to help with memory cleaning on the errata systems. Doing a new > mapping for each page getting reclaimed would add cost to the shutdown > path. > Can we disable direct map removal for errata systems using TDX only, instead of all TDX? If it's complicated to figure that out, we can disable direct map removal for TDX for now and figure that out later. > Then there is the clfush. It is not actually required for the most > part. There is a TDX flag to check to see if you need to do it, so we > could probably remove the direct map accesses for some systems and > avoid temporary mappings. > > So long term, I don't see a problem. For the old systems it would have > extra cost of temporary mappings at shutdown, but I would have imagined > direct map removal would have been costly too. Is there a way to check if the code is running on the errata system and set up the temporary mappings only for those?
On Thu, 2026-01-22 at 08:44 -0800, Ackerley Tng wrote: > > Can we disable direct map removal for errata systems using TDX only, > instead of all TDX? > > If it's complicated to figure that out, we can disable direct map > removal for TDX for now and figure that out later. In theory, but it still would require changes to TDX code since it does the clflush unconditionally today. To know whether clflush is needed (it's a different thing to the errata), you need to check a TDX module flag. (CLFLUSH_BEFORE_ALLOC) Gosh, you know what, I should double check that we don't need the clflush from the vm shutdown optimization. It should be a different thing, but for we gave scrutiny to the whole Linux flow when we did that. So I'd have to double check nothing relied on it. We can follow up here. > > > Then there is the clfush. It is not actually required for the most > > part. There is a TDX flag to check to see if you need to do it, so > > we could probably remove the direct map accesses for some systems > > and avoid temporary mappings. > > > > So long term, I don't see a problem. For the old systems it would > > have extra cost of temporary mappings at shutdown, but I would have > > imagined direct map removal would have been costly too. > > Is there a way to check if the code is running on the errata system > and set up the temporary mappings only for those? The TDX code today doesn't do any remapping because the direct map is reliably present. There isn't a flag or anything to just do the remapping automatically. We would have to do some vmalloc mapping or temporary_mm or something. Can you explain what the use case is for unmapping encrypted TDX private memory from the host direct map?
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes: > On Thu, 2026-01-22 at 08:44 -0800, Ackerley Tng wrote: >> >> Can we disable direct map removal for errata systems using TDX only, >> instead of all TDX? >> >> If it's complicated to figure that out, we can disable direct map >> removal for TDX for now and figure that out later. > > In theory, but it still would require changes to TDX code since it does > the clflush unconditionally today. To know whether clflush is needed > (it's a different thing to the errata), you need to check a TDX module > flag. (CLFLUSH_BEFORE_ALLOC) > > Gosh, you know what, I should double check that we don't need the > clflush from the vm shutdown optimization. It should be a different > thing, but for we gave scrutiny to the whole Linux flow when we did > that. So I'd have to double check nothing relied on it. We can follow > up here. > >> >> > Then there is the clfush. It is not actually required for the most >> > part. There is a TDX flag to check to see if you need to do it, so >> > we could probably remove the direct map accesses for some systems >> > and avoid temporary mappings. >> > >> > So long term, I don't see a problem. For the old systems it would >> > have extra cost of temporary mappings at shutdown, but I would have >> > imagined direct map removal would have been costly too. >> >> Is there a way to check if the code is running on the errata system >> and set up the temporary mappings only for those? > > The TDX code today doesn't do any remapping because the direct map is > reliably present. There isn't a flag or anything to just do the > remapping automatically. We would have to do some vmalloc mapping or > temporary_mm or something. > > Can you explain what the use case is for unmapping encrypted TDX > private memory from the host direct map? There's no use case I can think of for unmapping TDX private memory from the host direct map, but Sean's suggestion https://lore.kernel.org/all/aWpcDrGVLrZOqdcg@google.com/ won't even let shared guest_memfd memory be unmapped from the direct map for TDX VMs. Actually, does TDX's clflush that assumes presence in the direct map apply only for private pages, or all pages? If TDX's clflush only happens for private pages, then we could restore private pages to the direct map, and then we'd be safe even for TDX?
On Thu, 2026-01-22 at 14:47 -0800, Ackerley Tng wrote: > > There's no use case I can think of for unmapping TDX private memory > from the host direct map, but Sean's suggestion > https://lore.kernel.org/all/aWpcDrGVLrZOqdcg@google.com/ won't even > let shared guest_memfd memory be unmapped from the direct map for TDX > VMs. Ah! > > Actually, does TDX's clflush that assumes presence in the direct map > apply only for private pages, or all pages? > > If TDX's clflush only happens for private pages, then we could > restore private pages to the direct map, and then we'd be safe even > for TDX? Yes, just private pages need the special treatment. But it will be much simpler to start with just blocking the option for TDX. A shared pages only mode could come later. In general I think we should try to break things up like this when we can. Kernel code is not set in stone, only ABI. I think it will lead to overall faster upstreaming, because the series' can be simpler.
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes: > On Thu, 2026-01-22 at 14:47 -0800, Ackerley Tng wrote: >> >> There's no use case I can think of for unmapping TDX private memory >> from the host direct map, but Sean's suggestion >> https://lore.kernel.org/all/aWpcDrGVLrZOqdcg@google.com/ won't even >> let shared guest_memfd memory be unmapped from the direct map for TDX >> VMs. > > Ah! > >> >> Actually, does TDX's clflush that assumes presence in the direct map >> apply only for private pages, or all pages? >> >> If TDX's clflush only happens for private pages, then we could >> restore private pages to the direct map, and then we'd be safe even >> for TDX? > > Yes, just private pages need the special treatment. But it will be much > simpler to start with just blocking the option for TDX. A shared pages > only mode could come later. > > In general I think we should try to break things up like this when we > can. Kernel code is not set in stone, only ABI. I think it will lead to > overall faster upstreaming, because the series' can be simpler. I agree on splitting the feature up :), agree that simpler series are better. Perhaps just for my understanding, + shared pages => not in direct map => no TDX clflush + private pages => always in direct map => TDX performs clflush (I could put pages back into the direct map while doing shared to private conversions). Is everything good then? Or does TDX code not apply the special treatment, as in clflush only for private pages, as of now?
On 15/01/2026 23:04, Edgecombe, Rick P wrote: > On Wed, 2026-01-14 at 13:46 +0000, Kalyazin, Nikita wrote: >> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() >> ioctl. When set, guest_memfd folios will be removed from the direct map >> after preparation, with direct map entries only restored when the folios >> are freed. >> >> To ensure these folios do not end up in places where the kernel cannot >> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct >> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested. >> >> Note that this flag causes removal of direct map entries for all >> guest_memfd folios independent of whether they are "shared" or "private" >> (although current guest_memfd only supports either all folios in the >> "shared" state, or all folios in the "private" state if >> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map >> entries of also the shared parts of guest_memfd are a special type of >> non-CoCo VM where, host userspace is trusted to have access to all of >> guest memory, but where Spectre-style transient execution attacks >> through the host kernel's direct map should still be mitigated. In this >> setup, KVM retains access to guest memory via userspace mappings of >> guest_memfd, which are reflected back into KVM's memslots via >> userspace_addr. This is needed for things like MMIO emulation on x86_64 >> to work. > > TDX does some clearing at the direct map mapping for pages that comes from gmem, > using a special instruction. It also does some clflushing at the direct map > address for these pages. So I think we need to make sure TDs don't pull from > gmem fds with this flag. Would you be able to give a pointer on how we can do that? I'm not very familiar with the TDX code. > > Not that there would be any expected use of the flag for TDs, but it could cause > a crash.
On Fri, 2026-01-16 at 15:02 +0000, Nikita Kalyazin wrote: > > TDX does some clearing at the direct map mapping for pages that > > comes from gmem, using a special instruction. It also does some > > clflushing at the direct map address for these pages. So I think we > > need to make sure TDs don't pull from gmem fds with this flag. > > Would you be able to give a pointer on how we can do that? I'm not > very familiar with the TDX code. Uhh, that is a good question. Let me think.
On Fri, Jan 16, 2026, Rick P Edgecombe wrote: > On Fri, 2026-01-16 at 15:02 +0000, Nikita Kalyazin wrote: > > > TDX does some clearing at the direct map mapping for pages that > > > comes from gmem, using a special instruction. It also does some > > > clflushing at the direct map address for these pages. So I think we > > > need to make sure TDs don't pull from gmem fds with this flag. > > > > Would you be able to give a pointer on how we can do that? I'm not > > very familiar with the TDX code. > > Uhh, that is a good question. Let me think. Pass @kvm to kvm_arch_gmem_supports_no_direct_map() and then return %false if it's a TDX VM.
On Fri, 2026-01-16 at 07:41 -0800, Sean Christopherson wrote: > Pass @kvm to kvm_arch_gmem_supports_no_direct_map() and then return > %false if it's a TDX VM. Thanks!
On 16/01/2026 15:41, Sean Christopherson wrote: > On Fri, Jan 16, 2026, Rick P Edgecombe wrote: >> On Fri, 2026-01-16 at 15:02 +0000, Nikita Kalyazin wrote: >>>> TDX does some clearing at the direct map mapping for pages that >>>> comes from gmem, using a special instruction. It also does some >>>> clflushing at the direct map address for these pages. So I think we >>>> need to make sure TDs don't pull from gmem fds with this flag. >>> >>> Would you be able to give a pointer on how we can do that? I'm not >>> very familiar with the TDX code. >> >> Uhh, that is a good question. Let me think. > > Pass @kvm to kvm_arch_gmem_supports_no_direct_map() and then return %false if > it's a TDX VM. Sounds good to me, thanks.
"Kalyazin, Nikita" <kalyazin@amazon.co.uk> writes:
> From: Patrick Roy <patrick.roy@linux.dev>
>
> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
> ioctl. When set, guest_memfd folios will be removed from the direct map
> after preparation, with direct map entries only restored when the folios
> are freed.
>
> To ensure these folios do not end up in places where the kernel cannot
> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
>
> Note that this flag causes removal of direct map entries for all
> guest_memfd folios independent of whether they are "shared" or "private"
> (although current guest_memfd only supports either all folios in the
> "shared" state, or all folios in the "private" state if
> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
> entries of also the shared parts of guest_memfd are a special type of
> non-CoCo VM where, host userspace is trusted to have access to all of
> guest memory, but where Spectre-style transient execution attacks
> through the host kernel's direct map should still be mitigated. In this
> setup, KVM retains access to guest memory via userspace mappings of
> guest_memfd, which are reflected back into KVM's memslots via
> userspace_addr. This is needed for things like MMIO emulation on x86_64
> to work.
>
> Direct map entries are zapped right before guest or userspace mappings
> of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or
> kvm_gmem_get_pfn() [called from the KVM MMU code]. The only place where
> a gmem folio can be allocated without being mapped anywhere is
> kvm_gmem_populate(), where handling potential failures of direct map
> removal is not possible (by the time direct map removal is attempted,
> the folio is already marked as prepared, meaning attempting to re-try
> kvm_gmem_populate() would just result in -EEXIST without fixing up the
> direct map state). These folios are then removed form the direct map
> upon kvm_gmem_get_pfn(), e.g. when they are mapped into the guest later.
>
> Signed-off-by: Patrick Roy <patrick.roy@linux.dev>
> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
> ---
> Documentation/virt/kvm/api.rst | 22 ++++++++------
> include/linux/kvm_host.h | 12 ++++++++
> include/uapi/linux/kvm.h | 1 +
> virt/kvm/guest_memfd.c | 54 ++++++++++++++++++++++++++++++++++
> 4 files changed, 80 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 01a3abef8abb..c5f54f1370c8 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6440,15 +6440,19 @@ a single guest_memfd file, but the bound ranges must not overlap).
> The capability KVM_CAP_GUEST_MEMFD_FLAGS enumerates the `flags` that can be
> specified via KVM_CREATE_GUEST_MEMFD. Currently defined flags:
>
> - ============================ ================================================
> - GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd file
> - descriptor.
> - GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during
> - KVM_CREATE_GUEST_MEMFD (memory files created
> - without INIT_SHARED will be marked private).
> - Shared memory can be faulted into host userspace
> - page tables. Private memory cannot.
> - ============================ ================================================
> + ============================== ================================================
> + GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd file
> + descriptor.
> + GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during
> + KVM_CREATE_GUEST_MEMFD (memory files created
> + without INIT_SHARED will be marked private).
> + Shared memory can be faulted into host userspace
> + page tables. Private memory cannot.
> + GUEST_MEMFD_FLAG_NO_DIRECT_MAP The guest_memfd instance will behave similarly
> + to memfd_secret, and unmaps the memory backing
Perhaps the reference to memfd_secret can be dropped to avoid anyone
assuming further similarities between guest_memfd and memfd_secret. This
could just say that "The guest_memfd instance will unmap the memory
backing it from the kernel's address space...".
> + it from the kernel's address space before
> + being passed off to userspace or the guest.
> + ============================== ================================================
>
> When the KVM MMU performs a PFN lookup to service a guest fault and the backing
> guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 27796a09d29b..d4d5306075bf 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -738,10 +738,22 @@ static inline u64 kvm_gmem_get_supported_flags(struct kvm *kvm)
> if (!kvm || kvm_arch_supports_gmem_init_shared(kvm))
> flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
>
> + if (kvm_arch_gmem_supports_no_direct_map())
> + flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
> +
> return flags;
> }
> #endif
>
> +#ifdef CONFIG_KVM_GUEST_MEMFD
> +#ifndef kvm_arch_gmem_supports_no_direct_map
> +static inline bool kvm_arch_gmem_supports_no_direct_map(void)
> +{
> + return false;
> +}
> +#endif
> +#endif /* CONFIG_KVM_GUEST_MEMFD */
> +
> #ifndef kvm_arch_has_readonly_mem
> static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
> {
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index dddb781b0507..60341e1ba1be 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1612,6 +1612,7 @@ struct kvm_memory_attributes {
> #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
> #define GUEST_MEMFD_FLAG_MMAP (1ULL << 0)
> #define GUEST_MEMFD_FLAG_INIT_SHARED (1ULL << 1)
> +#define GUEST_MEMFD_FLAG_NO_DIRECT_MAP (1ULL << 2)
>
> struct kvm_create_guest_memfd {
> __u64 size;
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 92e7f8c1f303..43f64c11467a 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -7,6 +7,9 @@
> #include <linux/mempolicy.h>
> #include <linux/pseudo_fs.h>
> #include <linux/pagemap.h>
> +#include <linux/set_memory.h>
> +
> +#include <asm/tlbflush.h>
>
> #include "kvm_mm.h"
>
> @@ -76,6 +79,43 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
> return 0;
> }
>
> +#define KVM_GMEM_FOLIO_NO_DIRECT_MAP BIT(0)
> +
> +static bool kvm_gmem_folio_no_direct_map(struct folio *folio)
> +{
> + return ((u64) folio->private) & KVM_GMEM_FOLIO_NO_DIRECT_MAP;
Nit: I think there shouldn't be a space between (u64) and what's being casted.
> +}
> +
> +static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
> +{
> + u64 gmem_flags = GMEM_I(folio_inode(folio))->flags;
> + int r = 0;
> +
> + if (kvm_gmem_folio_no_direct_map(folio) || !(gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP))
> + goto out;
> +
> + folio->private = (void *)((u64)folio->private | KVM_GMEM_FOLIO_NO_DIRECT_MAP);
> + r = folio_zap_direct_map(folio);
> +
> +out:
> + return r;
> +}
> +
> +static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
> +{
> + /*
> + * Direct map restoration cannot fail, as the only error condition
> + * for direct map manipulation is failure to allocate page tables
> + * when splitting huge pages, but this split would have already
> + * happened in folio_zap_direct_map() in kvm_gmem_folio_zap_direct_map().
> + * Thus folio_restore_direct_map() here only updates prot bits.
> + */
Thanks for this comment :)
> + if (kvm_gmem_folio_no_direct_map(folio)) {
> + WARN_ON_ONCE(folio_restore_direct_map(folio));
> + folio->private = (void *)((u64)folio->private & ~KVM_GMEM_FOLIO_NO_DIRECT_MAP);
> + }
> +}
> +
> static inline void kvm_gmem_mark_prepared(struct folio *folio)
> {
> folio_mark_uptodate(folio);
> @@ -398,6 +438,7 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
> struct inode *inode = file_inode(vmf->vma->vm_file);
> struct folio *folio;
> vm_fault_t ret = VM_FAULT_LOCKED;
> + int err;
>
> if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> return VM_FAULT_SIGBUS;
> @@ -423,6 +464,12 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
> kvm_gmem_mark_prepared(folio);
> }
>
> + err = kvm_gmem_folio_zap_direct_map(folio);
Perhaps the check for gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP should
be done here before making the call to kvm_gmem_folio_zap_direct_map()
to make it more obvious that zapping is conditional.
Perhaps also add a check for kvm_arch_gmem_supports_no_direct_map() so
this call can be completely removed by the compiler if it wasn't
compiled in.
The kvm_gmem_folio_no_direct_map() check should probably remain in
kvm_gmem_folio_zap_direct_map() since that's a "if already zapped, don't
zap again" check.
> + if (err) {
> + ret = vmf_error(err);
> + goto out_folio;
> + }
> +
> vmf->page = folio_file_page(folio, vmf->pgoff);
>
> out_folio:
> @@ -533,6 +580,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
> kvm_pfn_t pfn = page_to_pfn(page);
> int order = folio_order(folio);
>
> + kvm_gmem_folio_restore_direct_map(folio);
> +
I can't decide if the kvm_gmem_folio_no_direct_map(folio) should be in
the caller or within kvm_gmem_folio_restore_direct_map(), since this
time it's a folio-specific property being checked.
Perhaps also add a check for kvm_arch_gmem_supports_no_direct_map() so
this call can be completely removed by the compiler if it wasn't
compiled in. IIUC whether the check is added in the caller or within
kvm_gmem_folio_restore_direct_map() the call can still be elided.
> kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
> }
>
> @@ -596,6 +645,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> /* Unmovable mappings are supposed to be marked unevictable as well. */
> WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>
> + if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
> + mapping_set_no_direct_map(inode->i_mapping);
> +
> GMEM_I(inode)->flags = flags;
>
> file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops);
> @@ -807,6 +859,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> if (!is_prepared)
> r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>
> + kvm_gmem_folio_zap_direct_map(folio);
> +
Is there a reason why errors are not handled when faulting private memory?
> folio_unlock(folio);
>
> if (!r)
> --
> 2.50.1
On 15/01/2026 20:00, Ackerley Tng wrote:
> "Kalyazin, Nikita" <kalyazin@amazon.co.uk> writes:
>
>> From: Patrick Roy <patrick.roy@linux.dev>
>>
>> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
>> ioctl. When set, guest_memfd folios will be removed from the direct map
>> after preparation, with direct map entries only restored when the folios
>> are freed.
>>
>> To ensure these folios do not end up in places where the kernel cannot
>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
>> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
>>
>> Note that this flag causes removal of direct map entries for all
>> guest_memfd folios independent of whether they are "shared" or "private"
>> (although current guest_memfd only supports either all folios in the
>> "shared" state, or all folios in the "private" state if
>> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
>> entries of also the shared parts of guest_memfd are a special type of
>> non-CoCo VM where, host userspace is trusted to have access to all of
>> guest memory, but where Spectre-style transient execution attacks
>> through the host kernel's direct map should still be mitigated. In this
>> setup, KVM retains access to guest memory via userspace mappings of
>> guest_memfd, which are reflected back into KVM's memslots via
>> userspace_addr. This is needed for things like MMIO emulation on x86_64
>> to work.
>>
>> Direct map entries are zapped right before guest or userspace mappings
>> of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or
>> kvm_gmem_get_pfn() [called from the KVM MMU code]. The only place where
>> a gmem folio can be allocated without being mapped anywhere is
>> kvm_gmem_populate(), where handling potential failures of direct map
>> removal is not possible (by the time direct map removal is attempted,
>> the folio is already marked as prepared, meaning attempting to re-try
>> kvm_gmem_populate() would just result in -EEXIST without fixing up the
>> direct map state). These folios are then removed form the direct map
>> upon kvm_gmem_get_pfn(), e.g. when they are mapped into the guest later.
>>
>> Signed-off-by: Patrick Roy <patrick.roy@linux.dev>
>> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
>> ---
>> Documentation/virt/kvm/api.rst | 22 ++++++++------
>> include/linux/kvm_host.h | 12 ++++++++
>> include/uapi/linux/kvm.h | 1 +
>> virt/kvm/guest_memfd.c | 54 ++++++++++++++++++++++++++++++++++
>> 4 files changed, 80 insertions(+), 9 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 01a3abef8abb..c5f54f1370c8 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6440,15 +6440,19 @@ a single guest_memfd file, but the bound ranges must not overlap).
>> The capability KVM_CAP_GUEST_MEMFD_FLAGS enumerates the `flags` that can be
>> specified via KVM_CREATE_GUEST_MEMFD. Currently defined flags:
>>
>> - ============================ ================================================
>> - GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd file
>> - descriptor.
>> - GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during
>> - KVM_CREATE_GUEST_MEMFD (memory files created
>> - without INIT_SHARED will be marked private).
>> - Shared memory can be faulted into host userspace
>> - page tables. Private memory cannot.
>> - ============================ ================================================
>> + ============================== ================================================
>> + GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd file
>> + descriptor.
>> + GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during
>> + KVM_CREATE_GUEST_MEMFD (memory files created
>> + without INIT_SHARED will be marked private).
>> + Shared memory can be faulted into host userspace
>> + page tables. Private memory cannot.
>> + GUEST_MEMFD_FLAG_NO_DIRECT_MAP The guest_memfd instance will behave similarly
>> + to memfd_secret, and unmaps the memory backing
>
> Perhaps the reference to memfd_secret can be dropped to avoid anyone
> assuming further similarities between guest_memfd and memfd_secret. This
> could just say that "The guest_memfd instance will unmap the memory
> backing it from the kernel's address space...".
Agree, it may lead to a confusion down the line, thanks.
>
>> + it from the kernel's address space before
>> + being passed off to userspace or the guest.
>> + ============================== ================================================
>>
>> When the KVM MMU performs a PFN lookup to service a guest fault and the backing
>> guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 27796a09d29b..d4d5306075bf 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -738,10 +738,22 @@ static inline u64 kvm_gmem_get_supported_flags(struct kvm *kvm)
>> if (!kvm || kvm_arch_supports_gmem_init_shared(kvm))
>> flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
>>
>> + if (kvm_arch_gmem_supports_no_direct_map())
>> + flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
>> +
>> return flags;
>> }
>> #endif
>>
>> +#ifdef CONFIG_KVM_GUEST_MEMFD
>> +#ifndef kvm_arch_gmem_supports_no_direct_map
>> +static inline bool kvm_arch_gmem_supports_no_direct_map(void)
>> +{
>> + return false;
>> +}
>> +#endif
>> +#endif /* CONFIG_KVM_GUEST_MEMFD */
>> +
>> #ifndef kvm_arch_has_readonly_mem
>> static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
>> {
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index dddb781b0507..60341e1ba1be 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1612,6 +1612,7 @@ struct kvm_memory_attributes {
>> #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
>> #define GUEST_MEMFD_FLAG_MMAP (1ULL << 0)
>> #define GUEST_MEMFD_FLAG_INIT_SHARED (1ULL << 1)
>> +#define GUEST_MEMFD_FLAG_NO_DIRECT_MAP (1ULL << 2)
>>
>> struct kvm_create_guest_memfd {
>> __u64 size;
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 92e7f8c1f303..43f64c11467a 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -7,6 +7,9 @@
>> #include <linux/mempolicy.h>
>> #include <linux/pseudo_fs.h>
>> #include <linux/pagemap.h>
>> +#include <linux/set_memory.h>
>> +
>> +#include <asm/tlbflush.h>
>>
>> #include "kvm_mm.h"
>>
>> @@ -76,6 +79,43 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
>> return 0;
>> }
>>
>> +#define KVM_GMEM_FOLIO_NO_DIRECT_MAP BIT(0)
>> +
>> +static bool kvm_gmem_folio_no_direct_map(struct folio *folio)
>> +{
>> + return ((u64) folio->private) & KVM_GMEM_FOLIO_NO_DIRECT_MAP;
>
> Nit: I think there shouldn't be a space between (u64) and what's being casted.
True, will remove.
>
>> +}
>> +
>> +static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
>> +{
>> + u64 gmem_flags = GMEM_I(folio_inode(folio))->flags;
>> + int r = 0;
>> +
>> + if (kvm_gmem_folio_no_direct_map(folio) || !(gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP))
>> + goto out;
>> +
>> + folio->private = (void *)((u64)folio->private | KVM_GMEM_FOLIO_NO_DIRECT_MAP);
>> + r = folio_zap_direct_map(folio);
>> +
>> +out:
>> + return r;
>> +}
>> +
>> +static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
>> +{
>> + /*
>> + * Direct map restoration cannot fail, as the only error condition
>> + * for direct map manipulation is failure to allocate page tables
>> + * when splitting huge pages, but this split would have already
>> + * happened in folio_zap_direct_map() in kvm_gmem_folio_zap_direct_map().
>> + * Thus folio_restore_direct_map() here only updates prot bits.
>> + */
>
> Thanks for this comment :)
Thanks to Patrick :)
>
>> + if (kvm_gmem_folio_no_direct_map(folio)) {
>> + WARN_ON_ONCE(folio_restore_direct_map(folio));
>> + folio->private = (void *)((u64)folio->private & ~KVM_GMEM_FOLIO_NO_DIRECT_MAP);
>> + }
>> +}
>> +
>> static inline void kvm_gmem_mark_prepared(struct folio *folio)
>> {
>> folio_mark_uptodate(folio);
>> @@ -398,6 +438,7 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>> struct inode *inode = file_inode(vmf->vma->vm_file);
>> struct folio *folio;
>> vm_fault_t ret = VM_FAULT_LOCKED;
>> + int err;
>>
>> if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
>> return VM_FAULT_SIGBUS;
>> @@ -423,6 +464,12 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>> kvm_gmem_mark_prepared(folio);
>> }
>>
>> + err = kvm_gmem_folio_zap_direct_map(folio);
>
> Perhaps the check for gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP should
> be done here before making the call to kvm_gmem_folio_zap_direct_map()
> to make it more obvious that zapping is conditional.
Makes sense to me.
>
> Perhaps also add a check for kvm_arch_gmem_supports_no_direct_map() so
> this call can be completely removed by the compiler if it wasn't
> compiled in.
But if it is compiled in, we will be paying the cost of the call on
every page fault? Eg on arm64, it will call the following:
bool can_set_direct_map(void)
{
...
return rodata_full || debug_pagealloc_enabled() ||
arm64_kfence_can_set_direct_map() || is_realm_world();
}
>
> The kvm_gmem_folio_no_direct_map() check should probably remain in
> kvm_gmem_folio_zap_direct_map() since that's a "if already zapped, don't
> zap again" check.
>
>> + if (err) {
>> + ret = vmf_error(err);
>> + goto out_folio;
>> + }
>> +
>> vmf->page = folio_file_page(folio, vmf->pgoff);
>>
>> out_folio:
>> @@ -533,6 +580,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
>> kvm_pfn_t pfn = page_to_pfn(page);
>> int order = folio_order(folio);
>>
>> + kvm_gmem_folio_restore_direct_map(folio);
>> +
>
> I can't decide if the kvm_gmem_folio_no_direct_map(folio) should be in
> the caller or within kvm_gmem_folio_restore_direct_map(), since this
> time it's a folio-specific property being checked.
I'm tempted to keep it similar to the kvm_gmem_folio_zap_direct_map()
case. How does the fact it's a folio-speicific property change your
reasoning?
>
> Perhaps also add a check for kvm_arch_gmem_supports_no_direct_map() so
> this call can be completely removed by the compiler if it wasn't
> compiled in. IIUC whether the check is added in the caller or within
> kvm_gmem_folio_restore_direct_map() the call can still be elided.
Same concern as the above about kvm_gmem_folio_zap_direct_map(), ie the
performance of the case where kvm_arch_gmem_supports_no_direct_map() exists.
>
>> kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
>> }
>>
>> @@ -596,6 +645,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>> /* Unmovable mappings are supposed to be marked unevictable as well. */
>> WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>>
>> + if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
>> + mapping_set_no_direct_map(inode->i_mapping);
>> +
>> GMEM_I(inode)->flags = flags;
>>
>> file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops);
>> @@ -807,6 +859,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>> if (!is_prepared)
>> r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>>
>> + kvm_gmem_folio_zap_direct_map(folio);
>> +
>
> Is there a reason why errors are not handled when faulting private memory?
No, I can't see a reason. Will add a check, thanks.
>
>> folio_unlock(folio);
>>
>> if (!r)
>> --
>> 2.50.1
Nikita Kalyazin <kalyazin@amazon.com> writes:
Was preparing the reply but couldn't get to it before the
meeting. Here's what was also discussed at the guest_memfd biweekly on
2026-01-22:
>
> [...snip...]
>
>>> @@ -423,6 +464,12 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>>> kvm_gmem_mark_prepared(folio);
>>> }
>>>
>>> + err = kvm_gmem_folio_zap_direct_map(folio);
>>
>> Perhaps the check for gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP should
>> be done here before making the call to kvm_gmem_folio_zap_direct_map()
>> to make it more obvious that zapping is conditional.
>
> Makes sense to me.
>
>>
>> Perhaps also add a check for kvm_arch_gmem_supports_no_direct_map() so
>> this call can be completely removed by the compiler if it wasn't
>> compiled in.
>
> But if it is compiled in, we will be paying the cost of the call on
> every page fault? Eg on arm64, it will call the following:
>
> bool can_set_direct_map(void)
> {
>
> ...
>
> return rodata_full || debug_pagealloc_enabled() ||
> arm64_kfence_can_set_direct_map() || is_realm_world();
> }
>
You're right that this could end up paying the cost on every page
fault. Please ignore this request!
>>
>> The kvm_gmem_folio_no_direct_map() check should probably remain in
>> kvm_gmem_folio_zap_direct_map() since that's a "if already zapped, don't
>> zap again" check.
>>
>>> + if (err) {
>>> + ret = vmf_error(err);
>>> + goto out_folio;
>>> + }
>>> +
>>> vmf->page = folio_file_page(folio, vmf->pgoff);
>>>
>>> out_folio:
>>> @@ -533,6 +580,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
>>> kvm_pfn_t pfn = page_to_pfn(page);
>>> int order = folio_order(folio);
>>>
>>> + kvm_gmem_folio_restore_direct_map(folio);
>>> +
>>
>> I can't decide if the kvm_gmem_folio_no_direct_map(folio) should be in
>> the caller or within kvm_gmem_folio_restore_direct_map(), since this
>> time it's a folio-specific property being checked.
>
> I'm tempted to keep it similar to the kvm_gmem_folio_zap_direct_map()
> case. How does the fact it's a folio-speicific property change your
> reasoning?
>
This is good too:
if (kvm_gmem_folio_no_direct_map(folio))
kvm_gmem_folio_restore_direct_map(folio)
>>
>> Perhaps also add a check for kvm_arch_gmem_supports_no_direct_map() so
>> this call can be completely removed by the compiler if it wasn't
>> compiled in. IIUC whether the check is added in the caller or within
>> kvm_gmem_folio_restore_direct_map() the call can still be elided.
>
> Same concern as the above about kvm_gmem_folio_zap_direct_map(), ie the
> performance of the case where kvm_arch_gmem_supports_no_direct_map() exists.
>
Please ignore this request!
>>
>>> kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
>>> }
>>>
>>> @@ -596,6 +645,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>>> /* Unmovable mappings are supposed to be marked unevictable as well. */
>>> WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>>>
>>> + if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
>>> + mapping_set_no_direct_map(inode->i_mapping);
>>> +
>>> GMEM_I(inode)->flags = flags;
>>>
>>> file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops);
>>> @@ -807,6 +859,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>>> if (!is_prepared)
>>> r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>>>
>>> + kvm_gmem_folio_zap_direct_map(folio);
>>> +
>>
>> Is there a reason why errors are not handled when faulting private memory?
>
> No, I can't see a reason. Will add a check, thanks.
>
>>
>>> folio_unlock(folio);
>>>
>>> if (!r)
>>> --
>>> 2.50.1
On 22/01/2026 16:34, Ackerley Tng wrote:
> Nikita Kalyazin <kalyazin@amazon.com> writes:
>
> Was preparing the reply but couldn't get to it before the
> meeting. Here's what was also discussed at the guest_memfd biweekly on
> 2026-01-22:
>
>>
>> [...snip...]
>>
>>>> @@ -423,6 +464,12 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>>>> kvm_gmem_mark_prepared(folio);
>>>> }
>>>>
>>>> + err = kvm_gmem_folio_zap_direct_map(folio);
>>>
>>> Perhaps the check for gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP should
>>> be done here before making the call to kvm_gmem_folio_zap_direct_map()
>>> to make it more obvious that zapping is conditional.
>>
>> Makes sense to me.
>>
>>>
>>> Perhaps also add a check for kvm_arch_gmem_supports_no_direct_map() so
>>> this call can be completely removed by the compiler if it wasn't
>>> compiled in.
>>
>> But if it is compiled in, we will be paying the cost of the call on
>> every page fault? Eg on arm64, it will call the following:
>>
>> bool can_set_direct_map(void)
>> {
>>
>> ...
>>
>> return rodata_full || debug_pagealloc_enabled() ||
>> arm64_kfence_can_set_direct_map() || is_realm_world();
>> }
>>
>
> You're right that this could end up paying the cost on every page
> fault. Please ignore this request!
>
>>>
>>> The kvm_gmem_folio_no_direct_map() check should probably remain in
>>> kvm_gmem_folio_zap_direct_map() since that's a "if already zapped, don't
>>> zap again" check.
>>>
>>>> + if (err) {
>>>> + ret = vmf_error(err);
>>>> + goto out_folio;
>>>> + }
>>>> +
>>>> vmf->page = folio_file_page(folio, vmf->pgoff);
>>>>
>>>> out_folio:
>>>> @@ -533,6 +580,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
>>>> kvm_pfn_t pfn = page_to_pfn(page);
>>>> int order = folio_order(folio);
>>>>
>>>> + kvm_gmem_folio_restore_direct_map(folio);
>>>> +
>>>
>>> I can't decide if the kvm_gmem_folio_no_direct_map(folio) should be in
>>> the caller or within kvm_gmem_folio_restore_direct_map(), since this
>>> time it's a folio-specific property being checked.
>>
>> I'm tempted to keep it similar to the kvm_gmem_folio_zap_direct_map()
>> case. How does the fact it's a folio-speicific property change your
>> reasoning?
>>
>
> This is good too:
>
> if (kvm_gmem_folio_no_direct_map(folio))
> kvm_gmem_folio_restore_direct_map(folio)
It turns out we can't do that because folio->mapping is gone by the time
filemap_free_folio() is called so we can't inspect the flags. Are you
ok with only having this check when zapping (but not when restoring)?
Do you think we should add a comment saying it's conditional here?
>
>>>
>>> Perhaps also add a check for kvm_arch_gmem_supports_no_direct_map() so
>>> this call can be completely removed by the compiler if it wasn't
>>> compiled in. IIUC whether the check is added in the caller or within
>>> kvm_gmem_folio_restore_direct_map() the call can still be elided.
>>
>> Same concern as the above about kvm_gmem_folio_zap_direct_map(), ie the
>> performance of the case where kvm_arch_gmem_supports_no_direct_map() exists.
>>
>
> Please ignore this request!
>
>>>
>>>> kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
>>>> }
>>>>
>>>> @@ -596,6 +645,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>>>> /* Unmovable mappings are supposed to be marked unevictable as well. */
>>>> WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>>>>
>>>> + if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
>>>> + mapping_set_no_direct_map(inode->i_mapping);
>>>> +
>>>> GMEM_I(inode)->flags = flags;
>>>>
>>>> file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops);
>>>> @@ -807,6 +859,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>>>> if (!is_prepared)
>>>> r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>>>>
>>>> + kvm_gmem_folio_zap_direct_map(folio);
>>>> +
>>>
>>> Is there a reason why errors are not handled when faulting private memory?
>>
>> No, I can't see a reason. Will add a check, thanks.
>>
>>>
>>>> folio_unlock(folio);
>>>>
>>>> if (!r)
>>>> --
>>>> 2.50.1
Nikita Kalyazin <kalyazin@amazon.com> writes: > > [...snip...] > >>>>> @@ -533,6 +580,8 @@ static void kvm_gmem_free_folio(struct folio *folio) >>>>> kvm_pfn_t pfn = page_to_pfn(page); >>>>> int order = folio_order(folio); >>>>> >>>>> + kvm_gmem_folio_restore_direct_map(folio); >>>>> + >>>> >>>> I can't decide if the kvm_gmem_folio_no_direct_map(folio) should be in >>>> the caller or within kvm_gmem_folio_restore_direct_map(), since this >>>> time it's a folio-specific property being checked. >>> >>> I'm tempted to keep it similar to the kvm_gmem_folio_zap_direct_map() >>> case. How does the fact it's a folio-speicific property change your >>> reasoning? >>> >> >> This is good too: >> >> if (kvm_gmem_folio_no_direct_map(folio)) >> kvm_gmem_folio_restore_direct_map(folio) > > It turns out we can't do that because folio->mapping is gone by the time > filemap_free_folio() is called so we can't inspect the flags. Are you > ok with only having this check when zapping (but not when restoring)? > Do you think we should add a comment saying it's conditional here? > I thought kvm_gmem_folio_no_direct_map() only reads folio->private, which I think should still be there at the point of filemap_free_folio(). >> >> [...snip...] >>
On 22/01/2026 20:30, Ackerley Tng wrote: > Nikita Kalyazin <kalyazin@amazon.com> writes: > >> >> [...snip...] >> >>>>>> @@ -533,6 +580,8 @@ static void kvm_gmem_free_folio(struct folio *folio) >>>>>> kvm_pfn_t pfn = page_to_pfn(page); >>>>>> int order = folio_order(folio); >>>>>> >>>>>> + kvm_gmem_folio_restore_direct_map(folio); >>>>>> + >>>>> >>>>> I can't decide if the kvm_gmem_folio_no_direct_map(folio) should be in >>>>> the caller or within kvm_gmem_folio_restore_direct_map(), since this >>>>> time it's a folio-specific property being checked. >>>> >>>> I'm tempted to keep it similar to the kvm_gmem_folio_zap_direct_map() >>>> case. How does the fact it's a folio-speicific property change your >>>> reasoning? >>>> >>> >>> This is good too: >>> >>> if (kvm_gmem_folio_no_direct_map(folio)) >>> kvm_gmem_folio_restore_direct_map(folio) >> >> It turns out we can't do that because folio->mapping is gone by the time >> filemap_free_folio() is called so we can't inspect the flags. Are you >> ok with only having this check when zapping (but not when restoring)? >> Do you think we should add a comment saying it's conditional here? >> > > I thought kvm_gmem_folio_no_direct_map() only reads folio->private, > which I think should still be there at the point of > filemap_free_folio(). Oh, I misread your last reply. What you're proposing would work indeed. > >>> >>> [...snip...] >>>
© 2016 - 2026 Red Hat, Inc.