[PATCH v4 03/12] KVM: guest_memfd: Add flag to remove from direct map

Patrick Roy posted 12 patches 9 months, 4 weeks ago
[PATCH v4 03/12] KVM: guest_memfd: Add flag to remove from direct map
Posted by Patrick Roy 9 months, 4 weeks ago
Add KVM_GMEM_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When
set, guest_memfd folios will be removed from the direct map after
preparation, with direct map entries only restored when the folios are
freed.

To ensure these folios do not end up in places where the kernel cannot
deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
address_space if KVM_GMEM_NO_DIRECT_MAP is requested.

Note that this flag causes removal of direct map entries for all
guest_memfd folios independent of whether they are "shared" or "private"
(although current guest_memfd only supports either all folios in the
"shared" state, or all folios in the "private" state if
!IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM)). The usecase for removing
direct map entries of also the shared parts of guest_memfd are a special
type of non-CoCo VM where, host userspace is trusted to have access to
all of guest memory, but where Spectre-style transient execution attacks
through the host kernel's direct map should still be mitigated.

Note that KVM retains access to guest memory via userspace
mappings of guest_memfd, which are reflected back into KVM's memslots
via userspace_addr. This is needed for things like MMIO emulation on
x86_64 to work. Previous iterations attempted to instead have KVM
temporarily restore direct map entries whenever such an access to guest
memory was needed, but this turned out to have a significant performance
impact, as well as additional complexity due to needing to refcount
direct map reinsertion operations and making them play nicely with gmem
truncations.

This iteration also doesn't have KVM perform TLB flushes after direct
map manipulations. This is because TLB flushes resulted in a up to 40x
elongation of page faults in guest_memfd (scaling with the number of CPU
cores), or a 5x elongation of memory population. On the one hand, TLB
flushes are not needed for functional correctness (the virt->phys
mapping technically stays "correct",  the kernel should simply to not it
for a while), so this is a correct optimization to make. On the other
hand, it means that the desired protection from Spectre-style attacks is
not perfect, as an attacker could try to prevent a stale TLB entry from
getting evicted, keeping it alive until the page it refers to is used by
the guest for some sensitive data, and then targeting it using a
spectre-gadget.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/uapi/linux/kvm.h |  2 ++
 virt/kvm/guest_memfd.c   | 28 +++++++++++++++++++++++++++-
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 117937a895da..4654c01a0a01 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1573,6 +1573,8 @@ struct kvm_create_guest_memfd {
 	__u64 reserved[6];
 };
 
+#define KVM_GMEM_NO_DIRECT_MAP			(1ULL << 0)
+
 #define KVM_PRE_FAULT_MEMORY	_IOWR(KVMIO, 0xd5, struct kvm_pre_fault_memory)
 
 struct kvm_pre_fault_memory {
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 30b47ff0e6d2..bd7d361c9bb7 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,7 @@
 #include <linux/kvm_host.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
+#include <linux/set_memory.h>
 
 #include "kvm_mm.h"
 
@@ -42,8 +43,23 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
 	return 0;
 }
 
+static bool kvm_gmem_test_no_direct_map(struct inode *inode)
+{
+	return ((unsigned long) inode->i_private) & KVM_GMEM_NO_DIRECT_MAP;
+}
+
 static inline void kvm_gmem_mark_prepared(struct folio *folio)
 {
+	struct inode *inode = folio_inode(folio);
+
+	if (kvm_gmem_test_no_direct_map(inode)) {
+		int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
+						     false);
+
+		if (!r)
+			folio_set_private(folio);
+	}
+
 	folio_mark_uptodate(folio);
 }
 
@@ -479,6 +495,10 @@ static void kvm_gmem_free_folio(struct folio *folio)
 	kvm_pfn_t pfn = page_to_pfn(page);
 	int order = folio_order(folio);
 
+	if (folio_test_private(folio))
+		WARN_ON_ONCE(set_direct_map_valid_noflush(folio_page(folio, 0),
+							  folio_nr_pages(folio), true));
+
 	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
 }
 #endif
@@ -552,6 +572,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 	/* Unmovable mappings are supposed to be marked unevictable as well. */
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
 
+	if (flags & KVM_GMEM_NO_DIRECT_MAP)
+		mapping_set_no_direct_map(inode->i_mapping);
+
 	kvm_get_kvm(kvm);
 	gmem->kvm = kvm;
 	xa_init(&gmem->bindings);
@@ -571,7 +594,10 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 {
 	loff_t size = args->size;
 	u64 flags = args->flags;
-	u64 valid_flags = 0;
+	u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP;
+
+	if (!can_set_direct_map())
+		valid_flags &= ~KVM_GMEM_NO_DIRECT_MAP;
 
 	if (flags & ~valid_flags)
 		return -EINVAL;
-- 
2.48.1
Re: [PATCH v4 03/12] KVM: guest_memfd: Add flag to remove from direct map
Posted by David Hildenbrand 9 months, 3 weeks ago
On 21.02.25 17:07, Patrick Roy wrote:
> Add KVM_GMEM_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When
> set, guest_memfd folios will be removed from the direct map after
> preparation, with direct map entries only restored when the folios are
> freed.
> 
> To ensure these folios do not end up in places where the kernel cannot
> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
> address_space if KVM_GMEM_NO_DIRECT_MAP is requested.
> 
> Note that this flag causes removal of direct map entries for all
> guest_memfd folios independent of whether they are "shared" or "private"
> (although current guest_memfd only supports either all folios in the
> "shared" state, or all folios in the "private" state if
> !IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM)). The usecase for removing
> direct map entries of also the shared parts of guest_memfd are a special
> type of non-CoCo VM where, host userspace is trusted to have access to
> all of guest memory, but where Spectre-style transient execution attacks
> through the host kernel's direct map should still be mitigated.
> 
> Note that KVM retains access to guest memory via userspace
> mappings of guest_memfd, which are reflected back into KVM's memslots
> via userspace_addr. This is needed for things like MMIO emulation on
> x86_64 to work. Previous iterations attempted to instead have KVM
> temporarily restore direct map entries whenever such an access to guest
> memory was needed, but this turned out to have a significant performance
> impact, as well as additional complexity due to needing to refcount
> direct map reinsertion operations and making them play nicely with gmem
> truncations.
> 
> This iteration also doesn't have KVM perform TLB flushes after direct
> map manipulations. This is because TLB flushes resulted in a up to 40x
> elongation of page faults in guest_memfd (scaling with the number of CPU
> cores), or a 5x elongation of memory population. On the one hand, TLB
> flushes are not needed for functional correctness (the virt->phys
> mapping technically stays "correct",  the kernel should simply to not it
> for a while), so this is a correct optimization to make. On the other
> hand, it means that the desired protection from Spectre-style attacks is
> not perfect, as an attacker could try to prevent a stale TLB entry from
> getting evicted, keeping it alive until the page it refers to is used by
> the guest for some sensitive data, and then targeting it using a
> spectre-gadget.
> 
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

...

>   
> +static bool kvm_gmem_test_no_direct_map(struct inode *inode)
> +{
> +	return ((unsigned long) inode->i_private) & KVM_GMEM_NO_DIRECT_MAP;
> +}
> +
>   static inline void kvm_gmem_mark_prepared(struct folio *folio)
>   {
> +	struct inode *inode = folio_inode(folio);
> +
> +	if (kvm_gmem_test_no_direct_map(inode)) {
> +		int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
> +						     false);

Will this work if KVM is built as a module, or is this another good 
reason why we might want guest_memfd core part of core-mm?

-- 
Cheers,

David / dhildenb
Re: [PATCH v4 03/12] KVM: guest_memfd: Add flag to remove from direct map
Posted by Patrick Roy 9 months, 3 weeks ago

On Tue, 2025-02-25 at 16:54 +0000, David Hildenbrand wrote:> On 21.02.25 17:07, Patrick Roy wrote:
>> Add KVM_GMEM_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When
>> set, guest_memfd folios will be removed from the direct map after
>> preparation, with direct map entries only restored when the folios are
>> freed.
>>
>> To ensure these folios do not end up in places where the kernel cannot
>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
>> address_space if KVM_GMEM_NO_DIRECT_MAP is requested.
>>
>> Note that this flag causes removal of direct map entries for all
>> guest_memfd folios independent of whether they are "shared" or "private"
>> (although current guest_memfd only supports either all folios in the
>> "shared" state, or all folios in the "private" state if
>> !IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM)). The usecase for removing
>> direct map entries of also the shared parts of guest_memfd are a special
>> type of non-CoCo VM where, host userspace is trusted to have access to
>> all of guest memory, but where Spectre-style transient execution attacks
>> through the host kernel's direct map should still be mitigated.
>>
>> Note that KVM retains access to guest memory via userspace
>> mappings of guest_memfd, which are reflected back into KVM's memslots
>> via userspace_addr. This is needed for things like MMIO emulation on
>> x86_64 to work. Previous iterations attempted to instead have KVM
>> temporarily restore direct map entries whenever such an access to guest
>> memory was needed, but this turned out to have a significant performance
>> impact, as well as additional complexity due to needing to refcount
>> direct map reinsertion operations and making them play nicely with gmem
>> truncations.
>>
>> This iteration also doesn't have KVM perform TLB flushes after direct
>> map manipulations. This is because TLB flushes resulted in a up to 40x
>> elongation of page faults in guest_memfd (scaling with the number of CPU
>> cores), or a 5x elongation of memory population. On the one hand, TLB
>> flushes are not needed for functional correctness (the virt->phys
>> mapping technically stays "correct",  the kernel should simply to not it
>> for a while), so this is a correct optimization to make. On the other
>> hand, it means that the desired protection from Spectre-style attacks is
>> not perfect, as an attacker could try to prevent a stale TLB entry from
>> getting evicted, keeping it alive until the page it refers to is used by
>> the guest for some sensitive data, and then targeting it using a
>> spectre-gadget.
>>
>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> 
> ...
> 
>>
>> +static bool kvm_gmem_test_no_direct_map(struct inode *inode)
>> +{
>> +     return ((unsigned long) inode->i_private) & KVM_GMEM_NO_DIRECT_MAP;
>> +}
>> +
>>   static inline void kvm_gmem_mark_prepared(struct folio *folio)
>>   {
>> +     struct inode *inode = folio_inode(folio);
>> +
>> +     if (kvm_gmem_test_no_direct_map(inode)) {
>> +             int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
>> +                                                  false);
> 
> Will this work if KVM is built as a module, or is this another good
> reason why we might want guest_memfd core part of core-mm?

mh, I'm admittedly not too familiar with the differences that would come
from building KVM as a module vs not. I do remember something about the
direct map accessors not being available for modules, so this would
indeed not work. Does that mean moving gmem into core-mm will be a
pre-requisite for the direct map removal stuff?

> -- 
> Cheers,
> 
> David / dhildenb
> 

Best, 
Patrick
Re: [PATCH v4 03/12] KVM: guest_memfd: Add flag to remove from direct map
Posted by David Hildenbrand 9 months, 3 weeks ago
On 26.02.25 09:48, Patrick Roy wrote:
> 
> 
> On Tue, 2025-02-25 at 16:54 +0000, David Hildenbrand wrote:> On 21.02.25 17:07, Patrick Roy wrote:
>>> Add KVM_GMEM_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When
>>> set, guest_memfd folios will be removed from the direct map after
>>> preparation, with direct map entries only restored when the folios are
>>> freed.
>>>
>>> To ensure these folios do not end up in places where the kernel cannot
>>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
>>> address_space if KVM_GMEM_NO_DIRECT_MAP is requested.
>>>
>>> Note that this flag causes removal of direct map entries for all
>>> guest_memfd folios independent of whether they are "shared" or "private"
>>> (although current guest_memfd only supports either all folios in the
>>> "shared" state, or all folios in the "private" state if
>>> !IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM)). The usecase for removing
>>> direct map entries of also the shared parts of guest_memfd are a special
>>> type of non-CoCo VM where, host userspace is trusted to have access to
>>> all of guest memory, but where Spectre-style transient execution attacks
>>> through the host kernel's direct map should still be mitigated.
>>>
>>> Note that KVM retains access to guest memory via userspace
>>> mappings of guest_memfd, which are reflected back into KVM's memslots
>>> via userspace_addr. This is needed for things like MMIO emulation on
>>> x86_64 to work. Previous iterations attempted to instead have KVM
>>> temporarily restore direct map entries whenever such an access to guest
>>> memory was needed, but this turned out to have a significant performance
>>> impact, as well as additional complexity due to needing to refcount
>>> direct map reinsertion operations and making them play nicely with gmem
>>> truncations.
>>>
>>> This iteration also doesn't have KVM perform TLB flushes after direct
>>> map manipulations. This is because TLB flushes resulted in a up to 40x
>>> elongation of page faults in guest_memfd (scaling with the number of CPU
>>> cores), or a 5x elongation of memory population. On the one hand, TLB
>>> flushes are not needed for functional correctness (the virt->phys
>>> mapping technically stays "correct",  the kernel should simply to not it
>>> for a while), so this is a correct optimization to make. On the other
>>> hand, it means that the desired protection from Spectre-style attacks is
>>> not perfect, as an attacker could try to prevent a stale TLB entry from
>>> getting evicted, keeping it alive until the page it refers to is used by
>>> the guest for some sensitive data, and then targeting it using a
>>> spectre-gadget.
>>>
>>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>>
>> ...
>>
>>>
>>> +static bool kvm_gmem_test_no_direct_map(struct inode *inode)
>>> +{
>>> +     return ((unsigned long) inode->i_private) & KVM_GMEM_NO_DIRECT_MAP;
>>> +}
>>> +
>>>    static inline void kvm_gmem_mark_prepared(struct folio *folio)
>>>    {
>>> +     struct inode *inode = folio_inode(folio);
>>> +
>>> +     if (kvm_gmem_test_no_direct_map(inode)) {
>>> +             int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
>>> +                                                  false);
>>
>> Will this work if KVM is built as a module, or is this another good
>> reason why we might want guest_memfd core part of core-mm?
> 
> mh, I'm admittedly not too familiar with the differences that would come
> from building KVM as a module vs not. I do remember something about the
> direct map accessors not being available for modules, so this would
> indeed not work. Does that mean moving gmem into core-mm will be a
> pre-requisite for the direct map removal stuff?

Likely, we'd need some shim.

Maybe for the time being it could be fenced using #if IS_BUILTIN() ... 
but that sure won't win in a beauty contest.

-- 
Cheers,

David / dhildenb
Re: [PATCH v4 03/12] KVM: guest_memfd: Add flag to remove from direct map
Posted by Patrick Roy 9 months, 3 weeks ago

On Wed, 2025-02-26 at 09:08 +0000, David Hildenbrand wrote:
> On 26.02.25 09:48, Patrick Roy wrote:
>>
>>
>> On Tue, 2025-02-25 at 16:54 +0000, David Hildenbrand wrote:> On 21.02.25 17:07, Patrick Roy wrote:
>>>> Add KVM_GMEM_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When
>>>> set, guest_memfd folios will be removed from the direct map after
>>>> preparation, with direct map entries only restored when the folios are
>>>> freed.
>>>>
>>>> To ensure these folios do not end up in places where the kernel cannot
>>>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
>>>> address_space if KVM_GMEM_NO_DIRECT_MAP is requested.
>>>>
>>>> Note that this flag causes removal of direct map entries for all
>>>> guest_memfd folios independent of whether they are "shared" or "private"
>>>> (although current guest_memfd only supports either all folios in the
>>>> "shared" state, or all folios in the "private" state if
>>>> !IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM)). The usecase for removing
>>>> direct map entries of also the shared parts of guest_memfd are a special
>>>> type of non-CoCo VM where, host userspace is trusted to have access to
>>>> all of guest memory, but where Spectre-style transient execution attacks
>>>> through the host kernel's direct map should still be mitigated.
>>>>
>>>> Note that KVM retains access to guest memory via userspace
>>>> mappings of guest_memfd, which are reflected back into KVM's memslots
>>>> via userspace_addr. This is needed for things like MMIO emulation on
>>>> x86_64 to work. Previous iterations attempted to instead have KVM
>>>> temporarily restore direct map entries whenever such an access to guest
>>>> memory was needed, but this turned out to have a significant performance
>>>> impact, as well as additional complexity due to needing to refcount
>>>> direct map reinsertion operations and making them play nicely with gmem
>>>> truncations.
>>>>
>>>> This iteration also doesn't have KVM perform TLB flushes after direct
>>>> map manipulations. This is because TLB flushes resulted in a up to 40x
>>>> elongation of page faults in guest_memfd (scaling with the number of CPU
>>>> cores), or a 5x elongation of memory population. On the one hand, TLB
>>>> flushes are not needed for functional correctness (the virt->phys
>>>> mapping technically stays "correct",  the kernel should simply to not it
>>>> for a while), so this is a correct optimization to make. On the other
>>>> hand, it means that the desired protection from Spectre-style attacks is
>>>> not perfect, as an attacker could try to prevent a stale TLB entry from
>>>> getting evicted, keeping it alive until the page it refers to is used by
>>>> the guest for some sensitive data, and then targeting it using a
>>>> spectre-gadget.
>>>>
>>>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>>>
>>> ...
>>>
>>>>
>>>> +static bool kvm_gmem_test_no_direct_map(struct inode *inode)
>>>> +{
>>>> +     return ((unsigned long) inode->i_private) & KVM_GMEM_NO_DIRECT_MAP;
>>>> +}
>>>> +
>>>>    static inline void kvm_gmem_mark_prepared(struct folio *folio)
>>>>    {
>>>> +     struct inode *inode = folio_inode(folio);
>>>> +
>>>> +     if (kvm_gmem_test_no_direct_map(inode)) {
>>>> +             int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
>>>> +                                                  false);
>>>
>>> Will this work if KVM is built as a module, or is this another good
>>> reason why we might want guest_memfd core part of core-mm?
>>
>> mh, I'm admittedly not too familiar with the differences that would come
>> from building KVM as a module vs not. I do remember something about the
>> direct map accessors not being available for modules, so this would
>> indeed not work. Does that mean moving gmem into core-mm will be a
>> pre-requisite for the direct map removal stuff?
> 
> Likely, we'd need some shim.
> 
> Maybe for the time being it could be fenced using #if IS_BUILTIN() ...
> but that sure won't win in a beauty contest.

Is anyone working on such a shim at the moment? Otherwise, would it make
sense for me to look into it? (although I'll probably need a pointer or
two for what is actually needed)

I saw your comment on Fuad's series [1] indicating that he'll also need
some shim, so probably makes sense to tackle it anyway instead of
hacking around it with #if-ery.

[1]: https://lore.kernel.org/kvm/8ddab670-8416-47f2-a5a6-94fb3444f328@redhat.com/

> -- 
> Cheers,
> 
> David / dhildenb
>

Best,
Patrick
Re: [PATCH v4 03/12] KVM: guest_memfd: Add flag to remove from direct map
Posted by David Hildenbrand 9 months, 3 weeks ago
On 26.02.25 16:14, Patrick Roy wrote:
> 
> 
> On Wed, 2025-02-26 at 09:08 +0000, David Hildenbrand wrote:
>> On 26.02.25 09:48, Patrick Roy wrote:
>>>
>>>
>>> On Tue, 2025-02-25 at 16:54 +0000, David Hildenbrand wrote:> On 21.02.25 17:07, Patrick Roy wrote:
>>>>> Add KVM_GMEM_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When
>>>>> set, guest_memfd folios will be removed from the direct map after
>>>>> preparation, with direct map entries only restored when the folios are
>>>>> freed.
>>>>>
>>>>> To ensure these folios do not end up in places where the kernel cannot
>>>>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
>>>>> address_space if KVM_GMEM_NO_DIRECT_MAP is requested.
>>>>>
>>>>> Note that this flag causes removal of direct map entries for all
>>>>> guest_memfd folios independent of whether they are "shared" or "private"
>>>>> (although current guest_memfd only supports either all folios in the
>>>>> "shared" state, or all folios in the "private" state if
>>>>> !IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM)). The usecase for removing
>>>>> direct map entries of also the shared parts of guest_memfd are a special
>>>>> type of non-CoCo VM where, host userspace is trusted to have access to
>>>>> all of guest memory, but where Spectre-style transient execution attacks
>>>>> through the host kernel's direct map should still be mitigated.
>>>>>
>>>>> Note that KVM retains access to guest memory via userspace
>>>>> mappings of guest_memfd, which are reflected back into KVM's memslots
>>>>> via userspace_addr. This is needed for things like MMIO emulation on
>>>>> x86_64 to work. Previous iterations attempted to instead have KVM
>>>>> temporarily restore direct map entries whenever such an access to guest
>>>>> memory was needed, but this turned out to have a significant performance
>>>>> impact, as well as additional complexity due to needing to refcount
>>>>> direct map reinsertion operations and making them play nicely with gmem
>>>>> truncations.
>>>>>
>>>>> This iteration also doesn't have KVM perform TLB flushes after direct
>>>>> map manipulations. This is because TLB flushes resulted in a up to 40x
>>>>> elongation of page faults in guest_memfd (scaling with the number of CPU
>>>>> cores), or a 5x elongation of memory population. On the one hand, TLB
>>>>> flushes are not needed for functional correctness (the virt->phys
>>>>> mapping technically stays "correct",  the kernel should simply to not it
>>>>> for a while), so this is a correct optimization to make. On the other
>>>>> hand, it means that the desired protection from Spectre-style attacks is
>>>>> not perfect, as an attacker could try to prevent a stale TLB entry from
>>>>> getting evicted, keeping it alive until the page it refers to is used by
>>>>> the guest for some sensitive data, and then targeting it using a
>>>>> spectre-gadget.
>>>>>
>>>>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>>>>
>>>> ...
>>>>
>>>>>
>>>>> +static bool kvm_gmem_test_no_direct_map(struct inode *inode)
>>>>> +{
>>>>> +     return ((unsigned long) inode->i_private) & KVM_GMEM_NO_DIRECT_MAP;
>>>>> +}
>>>>> +
>>>>>     static inline void kvm_gmem_mark_prepared(struct folio *folio)
>>>>>     {
>>>>> +     struct inode *inode = folio_inode(folio);
>>>>> +
>>>>> +     if (kvm_gmem_test_no_direct_map(inode)) {
>>>>> +             int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
>>>>> +                                                  false);
>>>>
>>>> Will this work if KVM is built as a module, or is this another good
>>>> reason why we might want guest_memfd core part of core-mm?
>>>
>>> mh, I'm admittedly not too familiar with the differences that would come
>>> from building KVM as a module vs not. I do remember something about the
>>> direct map accessors not being available for modules, so this would
>>> indeed not work. Does that mean moving gmem into core-mm will be a
>>> pre-requisite for the direct map removal stuff?
>>
>> Likely, we'd need some shim.
>>
>> Maybe for the time being it could be fenced using #if IS_BUILTIN() ...
>> but that sure won't win in a beauty contest.
> 
> Is anyone working on such a shim at the moment? Otherwise, would it make
> sense for me to look into it? (although I'll probably need a pointer or
> two for what is actually needed)
> 
> I saw your comment on Fuad's series [1] indicating that he'll also need
> some shim, so probably makes sense to tackle it anyway instead of
> hacking around it with #if-ery.

Elliot (CC) was working on "guestmem library" project [1], but it was 
unclear what we could factor out into the core.

Looks like a simple shim for such stuff might be a good starting point, 
although not the final idea of encapsulating more in the library.

@Elliot, are you currently still looking into this?


[1] 
https://lore.kernel.org/all/20241113-guestmem-library-v3-0-71fdee85676b@quicinc.com/T/#u

-- 
Cheers,

David / dhildenb
Re: [PATCH v4 03/12] KVM: guest_memfd: Add flag to remove from direct map
Posted by Patrick Roy 9 months ago
Hi David!

On Wed, 2025-02-26 at 15:30 +0000, David Hildenbrand wrote:
> On 26.02.25 16:14, Patrick Roy wrote:
>>
>>
>> On Wed, 2025-02-26 at 09:08 +0000, David Hildenbrand wrote:
>>> On 26.02.25 09:48, Patrick Roy wrote:
>>>>
>>>>
>>>> On Tue, 2025-02-25 at 16:54 +0000, David Hildenbrand wrote:> On 21.02.25 17:07, Patrick Roy wrote:
>>>>>> Add KVM_GMEM_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When
>>>>>> set, guest_memfd folios will be removed from the direct map after
>>>>>> preparation, with direct map entries only restored when the folios are
>>>>>> freed.
>>>>>>
>>>>>> To ensure these folios do not end up in places where the kernel cannot
>>>>>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
>>>>>> address_space if KVM_GMEM_NO_DIRECT_MAP is requested.
>>>>>>
>>>>>> Note that this flag causes removal of direct map entries for all
>>>>>> guest_memfd folios independent of whether they are "shared" or "private"
>>>>>> (although current guest_memfd only supports either all folios in the
>>>>>> "shared" state, or all folios in the "private" state if
>>>>>> !IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM)). The usecase for removing
>>>>>> direct map entries of also the shared parts of guest_memfd are a special
>>>>>> type of non-CoCo VM where, host userspace is trusted to have access to
>>>>>> all of guest memory, but where Spectre-style transient execution attacks
>>>>>> through the host kernel's direct map should still be mitigated.
>>>>>>
>>>>>> Note that KVM retains access to guest memory via userspace
>>>>>> mappings of guest_memfd, which are reflected back into KVM's memslots
>>>>>> via userspace_addr. This is needed for things like MMIO emulation on
>>>>>> x86_64 to work. Previous iterations attempted to instead have KVM
>>>>>> temporarily restore direct map entries whenever such an access to guest
>>>>>> memory was needed, but this turned out to have a significant performance
>>>>>> impact, as well as additional complexity due to needing to refcount
>>>>>> direct map reinsertion operations and making them play nicely with gmem
>>>>>> truncations.
>>>>>>
>>>>>> This iteration also doesn't have KVM perform TLB flushes after direct
>>>>>> map manipulations. This is because TLB flushes resulted in a up to 40x
>>>>>> elongation of page faults in guest_memfd (scaling with the number of CPU
>>>>>> cores), or a 5x elongation of memory population. On the one hand, TLB
>>>>>> flushes are not needed for functional correctness (the virt->phys
>>>>>> mapping technically stays "correct",  the kernel should simply to not it
>>>>>> for a while), so this is a correct optimization to make. On the other
>>>>>> hand, it means that the desired protection from Spectre-style attacks is
>>>>>> not perfect, as an attacker could try to prevent a stale TLB entry from
>>>>>> getting evicted, keeping it alive until the page it refers to is used by
>>>>>> the guest for some sensitive data, and then targeting it using a
>>>>>> spectre-gadget.
>>>>>>
>>>>>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>>>>>
>>>>> ...
>>>>>
>>>>>>
>>>>>> +static bool kvm_gmem_test_no_direct_map(struct inode *inode)
>>>>>> +{
>>>>>> +     return ((unsigned long) inode->i_private) & KVM_GMEM_NO_DIRECT_MAP;
>>>>>> +}
>>>>>> +
>>>>>>     static inline void kvm_gmem_mark_prepared(struct folio *folio)
>>>>>>     {
>>>>>> +     struct inode *inode = folio_inode(folio);
>>>>>> +
>>>>>> +     if (kvm_gmem_test_no_direct_map(inode)) {
>>>>>> +             int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
>>>>>> +                                                  false);
>>>>>
>>>>> Will this work if KVM is built as a module, or is this another good
>>>>> reason why we might want guest_memfd core part of core-mm?
>>>>
>>>> mh, I'm admittedly not too familiar with the differences that would come
>>>> from building KVM as a module vs not. I do remember something about the
>>>> direct map accessors not being available for modules, so this would
>>>> indeed not work. Does that mean moving gmem into core-mm will be a
>>>> pre-requisite for the direct map removal stuff?
>>>
>>> Likely, we'd need some shim.
>>>
>>> Maybe for the time being it could be fenced using #if IS_BUILTIN() ...
>>> but that sure won't win in a beauty contest.
>>
>> Is anyone working on such a shim at the moment? Otherwise, would it make
>> sense for me to look into it? (although I'll probably need a pointer or
>> two for what is actually needed)
>>
>> I saw your comment on Fuad's series [1] indicating that he'll also need
>> some shim, so probably makes sense to tackle it anyway instead of
>> hacking around it with #if-ery.
> 
> Elliot (CC) was working on "guestmem library" project [1], but it was
> unclear what we could factor out into the core.
> 
> Looks like a simple shim for such stuff might be a good starting point,
> although not the final idea of encapsulating more in the library.

So I started looking into this based on what we talked about at the last
guest_memfd sync. I tried to sort of go the way you hinted at when this
topic of "direct map removal from modules" came up in the past [1], and
hide it behind some sort of "alloc/free" abstraction. E.g. have the
library/shim expose gmem_get_folio(struct inode *inode, pgoff_t index)
that is a sorta equivalent of today __kvm_gmem_get_pfn(), which grabs a
new folio from the filemap, prepares it via a callback provided by KVM,
and then direct map removes it before returning it proper. But then,
that could still be "abused" by module code to just remove arbitrary
folios from the direct map, if a caller messed up any old struct inode
to look sufficiently like a gmem inode for the purposes of
gmem_get_folio(). But I also couldn't really come up with anything that
_wouldn't_ allow something like this. What're your thoughts on this? Do
we need to find a way to prevent this sort of stuff, and is that even
possible? I checked some of Elliot's old submissions that contain
direct map removal as part of the library and they run into the
same problem.

Best, 
Patrick

[1]: https://lore.kernel.org/all/49d14780-56f4-478d-9f5f-0857e788c667@redhat.com/
 
> @Elliot, are you currently still looking into this?
>
> [1]
> https://lore.kernel.org/all/20241113-guestmem-library-v3-0-71fdee85676b@quicinc.com/T/#u
> 
> -- 
> Cheers,
> 
> David / dhildenb
>