KVM: Avoid a lurking guest_memfd ABI mess

[PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Sean Christopherson 4 months, 2 weeks ago

Add a guest_memfd flag to allow userspace to state that the underlying
memory should be configured to be shared by default, and reject user page
faults if the guest_memfd instance's memory isn't shared by default.
Because KVM doesn't yet support in-place private<=>shared conversions, all
guest_memfd memory effectively follows the default state.

Alternatively, KVM could deduce the default state based on MMAP, which for
all intents and purposes is what KVM currently does.  However, implicitly
deriving the default state based on MMAP will result in a messy ABI when
support for in-place conversions is added.

For x86 CoCo VMs, which don't yet support MMAP, memory is currently private
by default (otherwise the memory would be unusable).  If MMAP implies
memory is shared by default, then the default state for CoCo VMs will vary
based on MMAP, and from userspace's perspective, will change when in-place
conversion support is added.  I.e. to maintain guest<=>host ABI, userspace
would need to immediately convert all memory from shared=>private, which
is both ugly and inefficient.  The inefficiency could be avoided by adding
a flag to state that memory is _private_ by default, irrespective of MMAP,
but that would lead to an equally messy and hard to document ABI.

Bite the bullet and immediately add a flag to control the default state so
that the effective behavior is explicit and straightforward.

Fixes: 3d3a04fad25a ("KVM: Allow and advertise support for host mmap() on guest_memfd files")
Cc: David Hildenbrand <david@redhat.com>
Cc: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst                 | 10 ++++++++--
 include/uapi/linux/kvm.h                       |  3 ++-
 tools/testing/selftests/kvm/guest_memfd_test.c |  5 +++--
 virt/kvm/guest_memfd.c                         |  6 +++++-
 4 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index c17a87a0a5ac..4dfe156bbe3c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6415,8 +6415,14 @@ guest_memfd range is not allowed (any number of memory regions can be bound to
 a single guest_memfd file, but the bound ranges must not overlap).
 
 When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field
-supports GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation
-enables mmap() and faulting of guest_memfd memory to host userspace.
+supports GUEST_MEMFD_FLAG_MMAP and  GUEST_MEMFD_FLAG_DEFAULT_SHARED.  Setting
+the MMAP flag on guest_memfd creation enables mmap() and faulting of guest_memfd
+memory to host userspace (so long as the memory is currently shared).  Setting
+DEFAULT_SHARED makes all guest_memfd memory shared by default (versus private
+by default).  Note!  Because KVM doesn't yet support in-place private<=>shared
+conversions, DEFAULT_SHARED must be specified in order to fault memory into
+userspace page tables.  This limitation will go away when in-place conversions
+are supported.
 
 When the KVM MMU performs a PFN lookup to service a guest fault and the backing
 guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6efa98a57ec1..38a2c083b6aa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1599,7 +1599,8 @@ struct kvm_memory_attributes {
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
-#define GUEST_MEMFD_FLAG_MMAP	(1ULL << 0)
+#define GUEST_MEMFD_FLAG_MMAP		(1ULL << 0)
+#define GUEST_MEMFD_FLAG_DEFAULT_SHARED	(1ULL << 1)
 
 struct kvm_create_guest_memfd {
 	__u64 size;
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index b3ca6737f304..81b11a958c7a 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -274,7 +274,7 @@ static void test_guest_memfd(unsigned long vm_type)
 	vm = vm_create_barebones_type(vm_type);
 
 	if (vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_MMAP))
-		flags |= GUEST_MEMFD_FLAG_MMAP;
+		flags |= GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_DEFAULT_SHARED;
 
 	test_create_guest_memfd_multiple(vm);
 	test_create_guest_memfd_invalid_sizes(vm, flags, page_size);
@@ -337,7 +337,8 @@ static void test_guest_memfd_guest(void)
 		    "Default VM type should always support guest_memfd mmap()");
 
 	size = vm->page_size;
-	fd = vm_create_guest_memfd(vm, size, GUEST_MEMFD_FLAG_MMAP);
+	fd = vm_create_guest_memfd(vm, size, GUEST_MEMFD_FLAG_MMAP |
+					     GUEST_MEMFD_FLAG_DEFAULT_SHARED);
 	vm_set_user_memory_region2(vm, slot, KVM_MEM_GUEST_MEMFD, gpa, size, NULL, fd, 0);
 
 	mem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 08a6bc7d25b6..19f05a45be04 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -328,6 +328,9 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
 	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
+	if (!((u64)inode->i_private & GUEST_MEMFD_FLAG_DEFAULT_SHARED))
+		return VM_FAULT_SIGBUS;
+
 	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
 	if (IS_ERR(folio)) {
 		int err = PTR_ERR(folio);
@@ -525,7 +528,8 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	u64 valid_flags = 0;
 
 	if (kvm_arch_supports_gmem_mmap(kvm))
-		valid_flags |= GUEST_MEMFD_FLAG_MMAP;
+		valid_flags |= GUEST_MEMFD_FLAG_MMAP |
+			       GUEST_MEMFD_FLAG_DEFAULT_SHARED;
 
 	if (flags & ~valid_flags)
 		return -EINVAL;
-- 
2.51.0.536.g15c5d4f767-goog

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Fuad Tabba 4 months, 1 week ago

Hi Sean,

On Fri, 26 Sept 2025 at 17:31, Sean Christopherson <seanjc@google.com> wrote:
>
> Add a guest_memfd flag to allow userspace to state that the underlying
> memory should be configured to be shared by default, and reject user page
> faults if the guest_memfd instance's memory isn't shared by default.
> Because KVM doesn't yet support in-place private<=>shared conversions, all
> guest_memfd memory effectively follows the default state.
>
> Alternatively, KVM could deduce the default state based on MMAP, which for
> all intents and purposes is what KVM currently does.  However, implicitly
> deriving the default state based on MMAP will result in a messy ABI when
> support for in-place conversions is added.
>
> For x86 CoCo VMs, which don't yet support MMAP, memory is currently private
> by default (otherwise the memory would be unusable).  If MMAP implies
> memory is shared by default, then the default state for CoCo VMs will vary
> based on MMAP, and from userspace's perspective, will change when in-place
> conversion support is added.  I.e. to maintain guest<=>host ABI, userspace
> would need to immediately convert all memory from shared=>private, which
> is both ugly and inefficient.  The inefficiency could be avoided by adding
> a flag to state that memory is _private_ by default, irrespective of MMAP,
> but that would lead to an equally messy and hard to document ABI.
>
> Bite the bullet and immediately add a flag to control the default state so
> that the effective behavior is explicit and straightforward.
>
> Fixes: 3d3a04fad25a ("KVM: Allow and advertise support for host mmap() on guest_memfd files")
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Fuad Tabba <tabba@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  Documentation/virt/kvm/api.rst                 | 10 ++++++++--
>  include/uapi/linux/kvm.h                       |  3 ++-
>  tools/testing/selftests/kvm/guest_memfd_test.c |  5 +++--
>  virt/kvm/guest_memfd.c                         |  6 +++++-
>  4 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index c17a87a0a5ac..4dfe156bbe3c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6415,8 +6415,14 @@ guest_memfd range is not allowed (any number of memory regions can be bound to
>  a single guest_memfd file, but the bound ranges must not overlap).
>
>  When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field
> -supports GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation
> -enables mmap() and faulting of guest_memfd memory to host userspace.
> +supports GUEST_MEMFD_FLAG_MMAP and  GUEST_MEMFD_FLAG_DEFAULT_SHARED.  Setting

There's an extra space between `and` and `GUEST_MEMFD_FLAG_DEFAULT_SHARED`.

> +the MMAP flag on guest_memfd creation enables mmap() and faulting of guest_memfd
> +memory to host userspace (so long as the memory is currently shared).  Setting
> +DEFAULT_SHARED makes all guest_memfd memory shared by default (versus private
> +by default).  Note!  Because KVM doesn't yet support in-place private<=>shared
> +conversions, DEFAULT_SHARED must be specified in order to fault memory into
> +userspace page tables.  This limitation will go away when in-place conversions
> +are supported.

I think that a more accurate (and future proof) description of the
mmap flag could be something along the lines of:

+ Setting GUEST_MEMFD_FLAG_MMAP enables using mmap() on the file descriptor.

+ Setting GUEST_MEMFD_FLAG_DEFAULT_SHARED makes all memory in the file shared
+ by default, as opposed to private. Shared memory can be faulted into host
+ userspace page tables. Private memory cannot.

>  When the KVM MMU performs a PFN lookup to service a guest fault and the backing
>  guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6efa98a57ec1..38a2c083b6aa 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1599,7 +1599,8 @@ struct kvm_memory_attributes {
>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
>
>  #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
> -#define GUEST_MEMFD_FLAG_MMAP  (1ULL << 0)
> +#define GUEST_MEMFD_FLAG_MMAP          (1ULL << 0)
> +#define GUEST_MEMFD_FLAG_DEFAULT_SHARED        (1ULL << 1)
>
>  struct kvm_create_guest_memfd {
>         __u64 size;
> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
> index b3ca6737f304..81b11a958c7a 100644
> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
> @@ -274,7 +274,7 @@ static void test_guest_memfd(unsigned long vm_type)
>         vm = vm_create_barebones_type(vm_type);
>
>         if (vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_MMAP))
> -               flags |= GUEST_MEMFD_FLAG_MMAP;
> +               flags |= GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_DEFAULT_SHARED;
>
>         test_create_guest_memfd_multiple(vm);
>         test_create_guest_memfd_invalid_sizes(vm, flags, page_size);
> @@ -337,7 +337,8 @@ static void test_guest_memfd_guest(void)
>                     "Default VM type should always support guest_memfd mmap()");
>
>         size = vm->page_size;
> -       fd = vm_create_guest_memfd(vm, size, GUEST_MEMFD_FLAG_MMAP);
> +       fd = vm_create_guest_memfd(vm, size, GUEST_MEMFD_FLAG_MMAP |
> +                                            GUEST_MEMFD_FLAG_DEFAULT_SHARED);
>         vm_set_user_memory_region2(vm, slot, KVM_MEM_GUEST_MEMFD, gpa, size, NULL, fd, 0);
>
>         mem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 08a6bc7d25b6..19f05a45be04 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -328,6 +328,9 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>         if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
>                 return VM_FAULT_SIGBUS;
>
> +       if (!((u64)inode->i_private & GUEST_MEMFD_FLAG_DEFAULT_SHARED))
> +               return VM_FAULT_SIGBUS;
> +
>         folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>         if (IS_ERR(folio)) {
>                 int err = PTR_ERR(folio);
> @@ -525,7 +528,8 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>         u64 valid_flags = 0;
>
>         if (kvm_arch_supports_gmem_mmap(kvm))
> -               valid_flags |= GUEST_MEMFD_FLAG_MMAP;
> +               valid_flags |= GUEST_MEMFD_FLAG_MMAP |
> +                              GUEST_MEMFD_FLAG_DEFAULT_SHARED;

At least for now, GUEST_MEMFD_FLAG_DEFAULT_SHARED and
GUEST_MEMFD_FLAG_MMAP don't make sense without each other. Is it worth
checking for that, at least until we have in-place conversion? Having
only GUEST_MEMFD_FLAG_DEFAULT_SHARED set, but GUEST_MEMFD_FLAG_MMAP,
isn't a useful combination.

That said, these are all nits, I'll leave it to you. With that:

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad



>
>         if (flags & ~valid_flags)
>                 return -EINVAL;
> --
> 2.51.0.536.g15c5d4f767-goog
>

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Ackerley Tng 4 months, 1 week ago

Fuad Tabba <tabba@google.com> writes:

> Hi Sean,
>
> On Fri, 26 Sept 2025 at 17:31, Sean Christopherson <seanjc@google.com> wrote:
>>
>> Add a guest_memfd flag to allow userspace to state that the underlying
>> memory should be configured to be shared by default, and reject user page
>> faults if the guest_memfd instance's memory isn't shared by default.
>> Because KVM doesn't yet support in-place private<=>shared conversions, all
>> guest_memfd memory effectively follows the default state.
>>
>> Alternatively, KVM could deduce the default state based on MMAP, which for
>> all intents and purposes is what KVM currently does.  However, implicitly
>> deriving the default state based on MMAP will result in a messy ABI when
>> support for in-place conversions is added.
>>
>> For x86 CoCo VMs, which don't yet support MMAP, memory is currently private
>> by default (otherwise the memory would be unusable).  If MMAP implies
>> memory is shared by default, then the default state for CoCo VMs will vary
>> based on MMAP, and from userspace's perspective, will change when in-place
>> conversion support is added.  I.e. to maintain guest<=>host ABI, userspace
>> would need to immediately convert all memory from shared=>private, which
>> is both ugly and inefficient.  The inefficiency could be avoided by adding
>> a flag to state that memory is _private_ by default, irrespective of MMAP,
>> but that would lead to an equally messy and hard to document ABI.
>>
>> Bite the bullet and immediately add a flag to control the default state so
>> that the effective behavior is explicit and straightforward.
>>

I like having this flag, but didn't propose this because I thought folks
depending on the default being shared (Patrick/Nikita) might have their
usage broken.

>> Fixes: 3d3a04fad25a ("KVM: Allow and advertise support for host mmap() on guest_memfd files")
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> ---
>>  Documentation/virt/kvm/api.rst                 | 10 ++++++++--
>>  include/uapi/linux/kvm.h                       |  3 ++-
>>  tools/testing/selftests/kvm/guest_memfd_test.c |  5 +++--
>>  virt/kvm/guest_memfd.c                         |  6 +++++-
>>  4 files changed, 18 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index c17a87a0a5ac..4dfe156bbe3c 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6415,8 +6415,14 @@ guest_memfd range is not allowed (any number of memory regions can be bound to
>>  a single guest_memfd file, but the bound ranges must not overlap).
>>
>>  When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field
>> -supports GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation
>> -enables mmap() and faulting of guest_memfd memory to host userspace.
>> +supports GUEST_MEMFD_FLAG_MMAP and  GUEST_MEMFD_FLAG_DEFAULT_SHARED.  Setting
>
> There's an extra space between `and` and `GUEST_MEMFD_FLAG_DEFAULT_SHARED`.
>

+1 on this. Also, would you consider putting the concept of "at creation
time" or "at initialization time" into the name of the flag?

"Default" could be interpreted as "whenever a folio is allocated for
this guest_memfd", the memory the folio represents is by default
shared.

What we want to represent is that when the guest_memfd is created,
memory at all indices are initialized as shared.

Looking a bit further, when conversion is supported, if this flag is not
specified, then all the indices are initialized as private, right?

>> +the MMAP flag on guest_memfd creation enables mmap() and faulting of guest_memfd
>> +memory to host userspace (so long as the memory is currently shared).  Setting
>> +DEFAULT_SHARED makes all guest_memfd memory shared by default (versus private
>> +by default).  Note!  Because KVM doesn't yet support in-place private<=>shared
>> +conversions, DEFAULT_SHARED must be specified in order to fault memory into
>> +userspace page tables.  This limitation will go away when in-place conversions
>> +are supported.
>
> I think that a more accurate (and future proof) description of the
> mmap flag could be something along the lines of:
>

+1 on these suggestions, I agree that making the concepts of SHARED vs
MMAP orthogonal from the start is more future proof.

> + Setting GUEST_MEMFD_FLAG_MMAP enables using mmap() on the file descriptor.
>
> + Setting GUEST_MEMFD_FLAG_DEFAULT_SHARED makes all memory in the file shared
> + by default

See above, I'd prefer clarifying this as "at initialization time" or
something similar.

> , as opposed to private. Shared memory can be faulted into host
> + userspace page tables. Private memory cannot.
>
>>  When the KVM MMU performs a PFN lookup to service a guest fault and the backing
>>  guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 6efa98a57ec1..38a2c083b6aa 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1599,7 +1599,8 @@ struct kvm_memory_attributes {
>>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
>>
>>  #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
>> -#define GUEST_MEMFD_FLAG_MMAP  (1ULL << 0)
>> +#define GUEST_MEMFD_FLAG_MMAP          (1ULL << 0)
>> +#define GUEST_MEMFD_FLAG_DEFAULT_SHARED        (1ULL << 1)
>>
>>  struct kvm_create_guest_memfd {
>>         __u64 size;
>> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
>> index b3ca6737f304..81b11a958c7a 100644
>> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
>> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
>> @@ -274,7 +274,7 @@ static void test_guest_memfd(unsigned long vm_type)
>>         vm = vm_create_barebones_type(vm_type);
>>
>>         if (vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_MMAP))
>> -               flags |= GUEST_MEMFD_FLAG_MMAP;
>> +               flags |= GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_DEFAULT_SHARED;
>>
>>         test_create_guest_memfd_multiple(vm);
>>         test_create_guest_memfd_invalid_sizes(vm, flags, page_size);
>> @@ -337,7 +337,8 @@ static void test_guest_memfd_guest(void)
>>                     "Default VM type should always support guest_memfd mmap()");
>>
>>         size = vm->page_size;
>> -       fd = vm_create_guest_memfd(vm, size, GUEST_MEMFD_FLAG_MMAP);
>> +       fd = vm_create_guest_memfd(vm, size, GUEST_MEMFD_FLAG_MMAP |
>> +                                            GUEST_MEMFD_FLAG_DEFAULT_SHARED);
>>         vm_set_user_memory_region2(vm, slot, KVM_MEM_GUEST_MEMFD, gpa, size, NULL, fd, 0);
>>
>>         mem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 08a6bc7d25b6..19f05a45be04 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -328,6 +328,9 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>>         if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
>>                 return VM_FAULT_SIGBUS;
>>
>> +       if (!((u64)inode->i_private & GUEST_MEMFD_FLAG_DEFAULT_SHARED))
>> +               return VM_FAULT_SIGBUS;
>> +
>>         folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>>         if (IS_ERR(folio)) {
>>                 int err = PTR_ERR(folio);
>> @@ -525,7 +528,8 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>>         u64 valid_flags = 0;
>>
>>         if (kvm_arch_supports_gmem_mmap(kvm))
>> -               valid_flags |= GUEST_MEMFD_FLAG_MMAP;
>> +               valid_flags |= GUEST_MEMFD_FLAG_MMAP |
>> +                              GUEST_MEMFD_FLAG_DEFAULT_SHARED;
>
> At least for now, GUEST_MEMFD_FLAG_DEFAULT_SHARED and
> GUEST_MEMFD_FLAG_MMAP don't make sense without each other. Is it worth
> checking for that, at least until we have in-place conversion? Having
> only GUEST_MEMFD_FLAG_DEFAULT_SHARED set, but GUEST_MEMFD_FLAG_MMAP,
> isn't a useful combination.
>

I think it's okay to have the two flags be orthogonal from the start.

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

> That said, these are all nits, I'll leave it to you. With that:
>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Tested-by: Fuad Tabba <tabba@google.com>
>
> Cheers,
> /fuad
>
>
>
>>
>>         if (flags & ~valid_flags)
>>                 return -EINVAL;
>> --
>> 2.51.0.536.g15c5d4f767-goog
>>

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Sean Christopherson 4 months, 1 week ago

On Mon, Sep 29, 2025, Ackerley Tng wrote:
> Fuad Tabba <tabba@google.com> writes:
> 
> > Hi Sean,
> >
> > On Fri, 26 Sept 2025 at 17:31, Sean Christopherson <seanjc@google.com> wrote:
> >>
> >> Add a guest_memfd flag to allow userspace to state that the underlying
> >> memory should be configured to be shared by default, and reject user page
> >> faults if the guest_memfd instance's memory isn't shared by default.
> >> Because KVM doesn't yet support in-place private<=>shared conversions, all
> >> guest_memfd memory effectively follows the default state.
> >>
> >> Alternatively, KVM could deduce the default state based on MMAP, which for
> >> all intents and purposes is what KVM currently does.  However, implicitly
> >> deriving the default state based on MMAP will result in a messy ABI when
> >> support for in-place conversions is added.
> >>
> >> For x86 CoCo VMs, which don't yet support MMAP, memory is currently private
> >> by default (otherwise the memory would be unusable).  If MMAP implies
> >> memory is shared by default, then the default state for CoCo VMs will vary
> >> based on MMAP, and from userspace's perspective, will change when in-place
> >> conversion support is added.  I.e. to maintain guest<=>host ABI, userspace
> >> would need to immediately convert all memory from shared=>private, which
> >> is both ugly and inefficient.  The inefficiency could be avoided by adding
> >> a flag to state that memory is _private_ by default, irrespective of MMAP,
> >> but that would lead to an equally messy and hard to document ABI.
> >>
> >> Bite the bullet and immediately add a flag to control the default state so
> >> that the effective behavior is explicit and straightforward.
> >>
> 
> I like having this flag, but didn't propose this because I thought folks
> depending on the default being shared (Patrick/Nikita) might have their
> usage broken.

mmap() support hasn't landed upstream, so as far as the upstream kernel is
concerned, there is no userspace to break.  Which is exactly why I want to land
this (or something like it) in 6.18, before GUEST_MEMFD_FLAG_MMAP is officially
released.

> >> Fixes: 3d3a04fad25a ("KVM: Allow and advertise support for host mmap() on guest_memfd files")
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: Fuad Tabba <tabba@google.com>
> >> Signed-off-by: Sean Christopherson <seanjc@google.com>
> >> ---
> >>  Documentation/virt/kvm/api.rst                 | 10 ++++++++--
> >>  include/uapi/linux/kvm.h                       |  3 ++-
> >>  tools/testing/selftests/kvm/guest_memfd_test.c |  5 +++--
> >>  virt/kvm/guest_memfd.c                         |  6 +++++-
> >>  4 files changed, 18 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> index c17a87a0a5ac..4dfe156bbe3c 100644
> >> --- a/Documentation/virt/kvm/api.rst
> >> +++ b/Documentation/virt/kvm/api.rst
> >> @@ -6415,8 +6415,14 @@ guest_memfd range is not allowed (any number of memory regions can be bound to
> >>  a single guest_memfd file, but the bound ranges must not overlap).
> >>
> >>  When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field
> >> -supports GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation
> >> -enables mmap() and faulting of guest_memfd memory to host userspace.
> >> +supports GUEST_MEMFD_FLAG_MMAP and  GUEST_MEMFD_FLAG_DEFAULT_SHARED.  Setting
> >
> > There's an extra space between `and` and `GUEST_MEMFD_FLAG_DEFAULT_SHARED`.
> >
> 
> +1 on this. Also, would you consider putting the concept of "at creation
> time" or "at initialization time" into the name of the flag?

Yah, GUEST_MEMFD_FLAG_INIT_SHARED?

> "Default" could be interpreted as "whenever a folio is allocated for
> this guest_memfd", the memory the folio represents is by default
> shared.
> 
> What we want to represent is that when the guest_memfd is created,
> memory at all indices are initialized as shared.
> 
> Looking a bit further, when conversion is supported, if this flag is not
> specified, then all the indices are initialized as private, right?

Correct, which is the current (pre-6.18) behavior.

> >> +the MMAP flag on guest_memfd creation enables mmap() and faulting of guest_memfd
> >> +memory to host userspace (so long as the memory is currently shared).  Setting
> >> +DEFAULT_SHARED makes all guest_memfd memory shared by default (versus private
> >> +by default).  Note!  Because KVM doesn't yet support in-place private<=>shared
> >> +conversions, DEFAULT_SHARED must be specified in order to fault memory into
> >> +userspace page tables.  This limitation will go away when in-place conversions
> >> +are supported.
> >
> > I think that a more accurate (and future proof) description of the
> > mmap flag could be something along the lines of:
> >
> 
> +1 on these suggestions, I agree that making the concepts of SHARED vs
> MMAP orthogonal from the start is more future proof.
> 
> > + Setting GUEST_MEMFD_FLAG_MMAP enables using mmap() on the file descriptor.
> >
> > + Setting GUEST_MEMFD_FLAG_DEFAULT_SHARED makes all memory in the file shared
> > + by default
> 
> See above, I'd prefer clarifying this as "at initialization time" or
> something similar.

Roger that.

> > At least for now, GUEST_MEMFD_FLAG_DEFAULT_SHARED and GUEST_MEMFD_FLAG_MMAP
> > don't make sense without each other. Is it worth checking for that, at
> > least until we have in-place conversion? Having only
> > GUEST_MEMFD_FLAG_DEFAULT_SHARED set, but GUEST_MEMFD_FLAG_MMAP, isn't a
> > useful combination.

Heh, that's exactly how I coded things up to start:

        /*
         * TODO: Drop the restriction that memory must be shared by default
         *       once in-place conversions are supported.
         */
        if (flags & GUEST_MEMFD_FLAG_MMAP &&
            !(flags & GUEST_MEMFD_FLAG_DEFAULT_SHARED))
                return -EINVAL;

but if we go that route, then dropping the restriction would result in an ABI
change for non-CoCo VMs.  The odds of such an ABI changes breaking userspace are
basically zero, but I couldn't think of any reason to risk it; userspace would
need to specify MMAP+SHARED either way.

And on the flip side, not enforcing the flags at the time of creation allows us
to test that user page faults to private memory are rejected.  It's not a ton of
meaningful coverage, but it's not nothing either.  And from a code perspective,
the diffs when in-place conversions are added are quite nice, as the concepts
don't change (user faults to private memory are disallowed), only the mechanics
change, i.e. the diffs highlight what all needs to happen to support conversions
without the extra noise of a change in overall semantics.

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Patrick Roy 4 months, 1 week ago

Hi Ackerley!

On Mon, 2025-09-29 at 10:43 +0100, Ackerley Tng wrote:
> Fuad Tabba <tabba@google.com> writes:
> 
>> Hi Sean,
>>
>> On Fri, 26 Sept 2025 at 17:31, Sean Christopherson <seanjc@google.com> wrote:
>>>
>>> Add a guest_memfd flag to allow userspace to state that the underlying
>>> memory should be configured to be shared by default, and reject user page
>>> faults if the guest_memfd instance's memory isn't shared by default.
>>> Because KVM doesn't yet support in-place private<=>shared conversions, all
>>> guest_memfd memory effectively follows the default state.
>>>
>>> Alternatively, KVM could deduce the default state based on MMAP, which for
>>> all intents and purposes is what KVM currently does.  However, implicitly
>>> deriving the default state based on MMAP will result in a messy ABI when
>>> support for in-place conversions is added.
>>>
>>> For x86 CoCo VMs, which don't yet support MMAP, memory is currently private
>>> by default (otherwise the memory would be unusable).  If MMAP implies
>>> memory is shared by default, then the default state for CoCo VMs will vary
>>> based on MMAP, and from userspace's perspective, will change when in-place
>>> conversion support is added.  I.e. to maintain guest<=>host ABI, userspace
>>> would need to immediately convert all memory from shared=>private, which
>>> is both ugly and inefficient.  The inefficiency could be avoided by adding
>>> a flag to state that memory is _private_ by default, irrespective of MMAP,
>>> but that would lead to an equally messy and hard to document ABI.
>>>
>>> Bite the bullet and immediately add a flag to control the default state so
>>> that the effective behavior is explicit and straightforward.
>>>
> 
> I like having this flag, but didn't propose this because I thought folks
> depending on the default being shared (Patrick/Nikita) might have their
> usage broken.

We'll just need to pass the new flag in Firecracker, that's not a problem
at all :) We aren't running this anywhere in production yet, so nothing
would break on our end.

>>> Fixes: 3d3a04fad25a ("KVM: Allow and advertise support for host mmap() on guest_memfd files")
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Fuad Tabba <tabba@google.com>
>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>> ---
>>>  Documentation/virt/kvm/api.rst                 | 10 ++++++++--
>>>  include/uapi/linux/kvm.h                       |  3 ++-
>>>  tools/testing/selftests/kvm/guest_memfd_test.c |  5 +++--
>>>  virt/kvm/guest_memfd.c                         |  6 +++++-
>>>  4 files changed, 18 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>>> index c17a87a0a5ac..4dfe156bbe3c 100644
>>> --- a/Documentation/virt/kvm/api.rst
>>> +++ b/Documentation/virt/kvm/api.rst
>>> @@ -6415,8 +6415,14 @@ guest_memfd range is not allowed (any number of memory regions can be bound to
>>>  a single guest_memfd file, but the bound ranges must not overlap).
>>>
>>>  When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field
>>> -supports GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation
>>> -enables mmap() and faulting of guest_memfd memory to host userspace.
>>> +supports GUEST_MEMFD_FLAG_MMAP and  GUEST_MEMFD_FLAG_DEFAULT_SHARED.  Setting
>>
>> There's an extra space between `and` and `GUEST_MEMFD_FLAG_DEFAULT_SHARED`.
>>
> 
> +1 on this. Also, would you consider putting the concept of "at creation
> time" or "at initialization time" into the name of the flag?
> 
> "Default" could be interpreted as "whenever a folio is allocated for
> this guest_memfd", the memory the folio represents is by default
> shared.
> 
> What we want to represent is that when the guest_memfd is created,
> memory at all indices are initialized as shared.
> 
> Looking a bit further, when conversion is supported, if this flag is not
> specified, then all the indices are initialized as private, right?
> 
>>> +the MMAP flag on guest_memfd creation enables mmap() and faulting of guest_memfd
>>> +memory to host userspace (so long as the memory is currently shared).  Setting
>>> +DEFAULT_SHARED makes all guest_memfd memory shared by default (versus private
>>> +by default).  Note!  Because KVM doesn't yet support in-place private<=>shared
>>> +conversions, DEFAULT_SHARED must be specified in order to fault memory into
>>> +userspace page tables.  This limitation will go away when in-place conversions
>>> +are supported.
>>
>> I think that a more accurate (and future proof) description of the
>> mmap flag could be something along the lines of:
>>
> 
> +1 on these suggestions, I agree that making the concepts of SHARED vs
> MMAP orthogonal from the start is more future proof.
> 
>> + Setting GUEST_MEMFD_FLAG_MMAP enables using mmap() on the file descriptor.
>>
>> + Setting GUEST_MEMFD_FLAG_DEFAULT_SHARED makes all memory in the file shared
>> + by default
> 
> See above, I'd prefer clarifying this as "at initialization time" or
> something similar.
> 
>> , as opposed to private. Shared memory can be faulted into host
>> + userspace page tables. Private memory cannot.
>>
>>>  When the KVM MMU performs a PFN lookup to service a guest fault and the backing
>>>  guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>> index 6efa98a57ec1..38a2c083b6aa 100644
>>> --- a/include/uapi/linux/kvm.h
>>> +++ b/include/uapi/linux/kvm.h
>>> @@ -1599,7 +1599,8 @@ struct kvm_memory_attributes {
>>>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
>>>
>>>  #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
>>> -#define GUEST_MEMFD_FLAG_MMAP  (1ULL << 0)
>>> +#define GUEST_MEMFD_FLAG_MMAP          (1ULL << 0)
>>> +#define GUEST_MEMFD_FLAG_DEFAULT_SHARED        (1ULL << 1)
>>>
>>>  struct kvm_create_guest_memfd {
>>>         __u64 size;
>>> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
>>> index b3ca6737f304..81b11a958c7a 100644
>>> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
>>> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
>>> @@ -274,7 +274,7 @@ static void test_guest_memfd(unsigned long vm_type)
>>>         vm = vm_create_barebones_type(vm_type);
>>>
>>>         if (vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_MMAP))
>>> -               flags |= GUEST_MEMFD_FLAG_MMAP;
>>> +               flags |= GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_DEFAULT_SHARED;
>>>
>>>         test_create_guest_memfd_multiple(vm);
>>>         test_create_guest_memfd_invalid_sizes(vm, flags, page_size);
>>> @@ -337,7 +337,8 @@ static void test_guest_memfd_guest(void)
>>>                     "Default VM type should always support guest_memfd mmap()");
>>>
>>>         size = vm->page_size;
>>> -       fd = vm_create_guest_memfd(vm, size, GUEST_MEMFD_FLAG_MMAP);
>>> +       fd = vm_create_guest_memfd(vm, size, GUEST_MEMFD_FLAG_MMAP |
>>> +                                            GUEST_MEMFD_FLAG_DEFAULT_SHARED);
>>>         vm_set_user_memory_region2(vm, slot, KVM_MEM_GUEST_MEMFD, gpa, size, NULL, fd, 0);
>>>
>>>         mem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>>> index 08a6bc7d25b6..19f05a45be04 100644
>>> --- a/virt/kvm/guest_memfd.c
>>> +++ b/virt/kvm/guest_memfd.c
>>> @@ -328,6 +328,9 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>>>         if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
>>>                 return VM_FAULT_SIGBUS;
>>>
>>> +       if (!((u64)inode->i_private & GUEST_MEMFD_FLAG_DEFAULT_SHARED))
>>> +               return VM_FAULT_SIGBUS;
>>> +
>>>         folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>>>         if (IS_ERR(folio)) {
>>>                 int err = PTR_ERR(folio);
>>> @@ -525,7 +528,8 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>>>         u64 valid_flags = 0;
>>>
>>>         if (kvm_arch_supports_gmem_mmap(kvm))
>>> -               valid_flags |= GUEST_MEMFD_FLAG_MMAP;
>>> +               valid_flags |= GUEST_MEMFD_FLAG_MMAP |
>>> +                              GUEST_MEMFD_FLAG_DEFAULT_SHARED;
>>
>> At least for now, GUEST_MEMFD_FLAG_DEFAULT_SHARED and
>> GUEST_MEMFD_FLAG_MMAP don't make sense without each other. Is it worth
>> checking for that, at least until we have in-place conversion? Having
>> only GUEST_MEMFD_FLAG_DEFAULT_SHARED set, but GUEST_MEMFD_FLAG_MMAP,
>> isn't a useful combination.
>>
> 
> I think it's okay to have the two flags be orthogonal from the start.

I think I dimly remember someone at one of the guest_memfd syncs
bringing up a usecase for having a VMA even if all memory is private,
not for faulting anything in, but to do madvise or something? Maybe it
was the NUMA stuff? (+Shivank)

So for that, having the flags be orthogonal would be useful even without
conversion support.

> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
> 
>> That said, these are all nits, I'll leave it to you. With that:
>>
>> Reviewed-by: Fuad Tabba <tabba@google.com>
>> Tested-by: Fuad Tabba <tabba@google.com>
>>
>> Cheers,
>> /fuad
>>
>>
>>
>>>
>>>         if (flags & ~valid_flags)
>>>                 return -EINVAL;
>>> --
>>> 2.51.0.536.g15c5d4f767-goog
>>>

Best,
Patrick

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by David Hildenbrand 4 months, 1 week ago

                          GUEST_MEMFD_FLAG_DEFAULT_SHARED;
>>>
>>> At least for now, GUEST_MEMFD_FLAG_DEFAULT_SHARED and
>>> GUEST_MEMFD_FLAG_MMAP don't make sense without each other. Is it worth
>>> checking for that, at least until we have in-place conversion? Having
>>> only GUEST_MEMFD_FLAG_DEFAULT_SHARED set, but GUEST_MEMFD_FLAG_MMAP,
>>> isn't a useful combination.
>>>
>>
>> I think it's okay to have the two flags be orthogonal from the start.
> 
> I think I dimly remember someone at one of the guest_memfd syncs
> bringing up a usecase for having a VMA even if all memory is private,
> not for faulting anything in, but to do madvise or something? Maybe it
> was the NUMA stuff? (+Shivank)

Yes, that should be it. But we're never faulting in these pages, we only 
need the VMA (for the time being, until there is the in-place conversion).

-- 
Cheers

David / dhildenb

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Ackerley Tng 4 months, 1 week ago

David Hildenbrand <david@redhat.com> writes:

>                           GUEST_MEMFD_FLAG_DEFAULT_SHARED;
>>>>
>>>> At least for now, GUEST_MEMFD_FLAG_DEFAULT_SHARED and
>>>> GUEST_MEMFD_FLAG_MMAP don't make sense without each other. Is it worth
>>>> checking for that, at least until we have in-place conversion? Having
>>>> only GUEST_MEMFD_FLAG_DEFAULT_SHARED set, but GUEST_MEMFD_FLAG_MMAP,
>>>> isn't a useful combination.
>>>>
>>>
>>> I think it's okay to have the two flags be orthogonal from the start.
>> 
>> I think I dimly remember someone at one of the guest_memfd syncs
>> bringing up a usecase for having a VMA even if all memory is private,
>> not for faulting anything in, but to do madvise or something? Maybe it
>> was the NUMA stuff? (+Shivank)
>
> Yes, that should be it. But we're never faulting in these pages, we only 
> need the VMA (for the time being, until there is the in-place conversion).
>

Yup, Sean's patch disables faulting if GUEST_MEMFD_FLAG_DEFAULT_SHARED
is not set, but mmap() is always enabled so madvise() still works.

Requiring GUEST_MEMFD_FLAG_DEFAULT_SHARED to be set together with
GUEST_MEMFD_FLAG_MMAP would still allow madvise() to work since
GUEST_MEMFD_FLAG_DEFAULT_SHARED only gates faulting.

To clarify, I'm still for making GUEST_MEMFD_FLAG_DEFAULT_SHARED
orthogonal to GUEST_MEMFD_FLAG_MMAP with no additional checks on top of
whatever's in this patch. :)

> -- 
> Cheers
>
> David / dhildenb

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Sean Christopherson 4 months, 1 week ago

On Mon, Sep 29, 2025, Ackerley Tng wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
> >                           GUEST_MEMFD_FLAG_DEFAULT_SHARED;
> >>>>
> >>>> At least for now, GUEST_MEMFD_FLAG_DEFAULT_SHARED and
> >>>> GUEST_MEMFD_FLAG_MMAP don't make sense without each other. Is it worth
> >>>> checking for that, at least until we have in-place conversion? Having
> >>>> only GUEST_MEMFD_FLAG_DEFAULT_SHARED set, but GUEST_MEMFD_FLAG_MMAP,
> >>>> isn't a useful combination.
> >>>>
> >>>
> >>> I think it's okay to have the two flags be orthogonal from the start.
> >> 
> >> I think I dimly remember someone at one of the guest_memfd syncs
> >> bringing up a usecase for having a VMA even if all memory is private,
> >> not for faulting anything in, but to do madvise or something? Maybe it
> >> was the NUMA stuff? (+Shivank)
> >
> > Yes, that should be it. But we're never faulting in these pages, we only 
> > need the VMA (for the time being, until there is the in-place conversion).
> >
> 
> Yup, Sean's patch disables faulting if GUEST_MEMFD_FLAG_DEFAULT_SHARED
> is not set, but mmap() is always enabled so madvise() still works.

Hah!  I totally intended that :-D

> Requiring GUEST_MEMFD_FLAG_DEFAULT_SHARED to be set together with
> GUEST_MEMFD_FLAG_MMAP would still allow madvise() to work since
> GUEST_MEMFD_FLAG_DEFAULT_SHARED only gates faulting.
> 
> To clarify, I'm still for making GUEST_MEMFD_FLAG_DEFAULT_SHARED
> orthogonal to GUEST_MEMFD_FLAG_MMAP with no additional checks on top of
> whatever's in this patch. :)

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Sean Christopherson 4 months, 1 week ago

On Mon, Sep 29, 2025, Sean Christopherson wrote:
> On Mon, Sep 29, 2025, Ackerley Tng wrote:
> > David Hildenbrand <david@redhat.com> writes:
> > 
> > >                           GUEST_MEMFD_FLAG_DEFAULT_SHARED;
> > >>>>
> > >>>> At least for now, GUEST_MEMFD_FLAG_DEFAULT_SHARED and
> > >>>> GUEST_MEMFD_FLAG_MMAP don't make sense without each other. Is it worth
> > >>>> checking for that, at least until we have in-place conversion? Having
> > >>>> only GUEST_MEMFD_FLAG_DEFAULT_SHARED set, but GUEST_MEMFD_FLAG_MMAP,
> > >>>> isn't a useful combination.
> > >>>>
> > >>>
> > >>> I think it's okay to have the two flags be orthogonal from the start.
> > >> 
> > >> I think I dimly remember someone at one of the guest_memfd syncs
> > >> bringing up a usecase for having a VMA even if all memory is private,
> > >> not for faulting anything in, but to do madvise or something? Maybe it
> > >> was the NUMA stuff? (+Shivank)
> > >
> > > Yes, that should be it. But we're never faulting in these pages, we only 
> > > need the VMA (for the time being, until there is the in-place conversion).
> > >
> > 
> > Yup, Sean's patch disables faulting if GUEST_MEMFD_FLAG_DEFAULT_SHARED
> > is not set, but mmap() is always enabled so madvise() still works.
> 
> Hah!  I totally intended that :-D
> 
> > Requiring GUEST_MEMFD_FLAG_DEFAULT_SHARED to be set together with
> > GUEST_MEMFD_FLAG_MMAP would still allow madvise() to work since
> > GUEST_MEMFD_FLAG_DEFAULT_SHARED only gates faulting.
> > 
> > To clarify, I'm still for making GUEST_MEMFD_FLAG_DEFAULT_SHARED
> > orthogonal to GUEST_MEMFD_FLAG_MMAP with no additional checks on top of
> > whatever's in this patch. :)

Oh!  This got me looking at kvm_arch_supports_gmem_mmap() and thus
KVM_CAP_GUEST_MEMFD_MMAP.  Two things:

 1. We should change KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS so
    that we don't need to add a capability every time a new flag comes along,
    and so that userspace can gather all flags in a single ioctl.  If gmem ever
    supports more than 32 flags, we'll need KVM_CAP_GUEST_MEMFD_FLAGS2, but
    that's a non-issue relatively speaking.

 2. We should allow mmap() for x86 CoCo VMs right away.  As evidenced by this
    series, mmap() on private memory is totally fine.  It's not useful until the
    NUMA and/or in-place conversion support comes along, but's not dangerous in
    any way.  The actual restriction is on initializing memory to be shared,
    because allowing memory to be shared from gmem's perspective while it's
    private from the VM's perspective would be all kinds of broken.


E.g. with a s/kvm_arch_supports_gmem_mmap/kvm_arch_supports_gmem_init_shared:

	case KVM_CAP_GUEST_MEMFD_FLAGS:
		if (!kvm || kvm_arch_supports_init_shared(kvm))
			return GUEST_MEMFD_FLAG_MMAP |
			       GUEST_MEMFD_FLAG_INIT_SHARED;

		return GUEST_MEMFD_FLAG_MMAP;

#2 is also a good reason to add INIT_SHARED straightaway.  Without INIT_SHARED,
we'd have to INIT_PRIVATE to make the NUMA support useful for x86 CoCo VMs, i.e.
it's not just in-place conversion that's affected, IIUC.

I'll add this in v2.

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Vishal Annapurve 4 months, 1 week ago

On Mon, Sep 29, 2025 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Oh!  This got me looking at kvm_arch_supports_gmem_mmap() and thus
> KVM_CAP_GUEST_MEMFD_MMAP.  Two things:
>
>  1. We should change KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS so
>     that we don't need to add a capability every time a new flag comes along,
>     and so that userspace can gather all flags in a single ioctl.  If gmem ever
>     supports more than 32 flags, we'll need KVM_CAP_GUEST_MEMFD_FLAGS2, but
>     that's a non-issue relatively speaking.
>

Guest_memfd capabilities don't necessarily translate into flags, so ideally:
1) There should be two caps, KVM_CAP_GUEST_MEMFD_FLAGS and
KVM_CAP_GUEST_MEMFD_CAPS.
2) IMO they should both support namespace of 64 values at least from the get go.
3) The reservation scheme for upstream should ideally be LSB's first
for the new caps/flags.

guest_memfd will achieve multiple features in future, both upstream
and in out-of-tree versions to deploy features before they make their
way upstream. Generally the scheme followed by out-of-tree versions is
to define a custom UAPI that won't conflict with upstream UAPIs in
near future. Having a namespace of 32 values gives little space to
avoid the conflict, e.g. features like hugetlb support will have to
eat up at least 5 bits from the flags [1].

[1] https://elixir.bootlin.com/linux/v6.17/source/include/uapi/asm-generic/hugetlb_encode.h#L20

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Sean Christopherson 4 months, 1 week ago

On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> On Mon, Sep 29, 2025 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Oh!  This got me looking at kvm_arch_supports_gmem_mmap() and thus
> > KVM_CAP_GUEST_MEMFD_MMAP.  Two things:
> >
> >  1. We should change KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS so
> >     that we don't need to add a capability every time a new flag comes along,
> >     and so that userspace can gather all flags in a single ioctl.  If gmem ever
> >     supports more than 32 flags, we'll need KVM_CAP_GUEST_MEMFD_FLAGS2, but
> >     that's a non-issue relatively speaking.
> >
> 
> Guest_memfd capabilities don't necessarily translate into flags, so ideally:
> 1) There should be two caps, KVM_CAP_GUEST_MEMFD_FLAGS and
> KVM_CAP_GUEST_MEMFD_CAPS.

I'm not saying we can't have another GUEST_MEMFD capability or three, all I'm
saying is that for enumerating what flags can be passed to KVM_CREATE_GUEST_MEMFD,
KVM_CAP_GUEST_MEMFD_FLAGS is a better fit than a one-off KVM_CAP_GUEST_MEMFD_MMAP.

> 2) IMO they should both support namespace of 64 values at least from the get go.

It's a limitation of KVM_CHECK_EXTENSION, and all of KVM's plumbing for ioctls.
Because KVM still supports 32-bit architectures, direct returns from ioctls are
forced to fit in 32-bit values to avoid unintentionally creating different ABI
for 32-bit vs. 64-bit kernels.

We could add KVM_CHECK_EXTENSION2 or KVM_CHECK_EXTENSION64 or something, but I
honestly don't see the point.  The odds of guest_memfd supporting >32 flags is
small, and the odds of that happening in the next ~5 years is basically zero.
All so that userspace can make one syscall instead of two for a path that isn't
remotely performance critical.

So while I agree that being able to enumerate 64 flags from the get-go would be
nice to have, it's simply not worth the effort (unless someone has a clever idea).

> 3) The reservation scheme for upstream should ideally be LSB's first
> for the new caps/flags.

We're getting way ahead of ourselves.  Nothing needs KVM_CAP_GUEST_MEMFD_CAPS at
this time, so there's nothing to discuss.

> guest_memfd will achieve multiple features in future, both upstream
> and in out-of-tree versions to deploy features before they make their

When it comes to upstream uAPI and uABI, out-of-tree kernel code is irrelevant.

> way upstream. Generally the scheme followed by out-of-tree versions is
> to define a custom UAPI that won't conflict with upstream UAPIs in
> near future. Having a namespace of 32 values gives little space to
> avoid the conflict, e.g. features like hugetlb support will have to
> eat up at least 5 bits from the flags [1].

Why on earth would out-of-tree code use KVM_CAP_GUEST_MEMFD_FLAGS?   Providing
infrastructure to support an infinite (quite literally) number of out-of-tree
capabilities and sub-ioctls, with practically zero chance of conflict, is not
difficult.  See internal b/378111418.

But as above, this is not upstream's problem to solve.

> [1] https://elixir.bootlin.com/linux/v6.17/source/include/uapi/asm-generic/hugetlb_encode.h#L20

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Vishal Annapurve 4 months, 1 week ago

On Wed, Oct 1, 2025 at 9:15 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> > On Mon, Sep 29, 2025 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Oh!  This got me looking at kvm_arch_supports_gmem_mmap() and thus
> > > KVM_CAP_GUEST_MEMFD_MMAP.  Two things:
> > >
> > >  1. We should change KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS so
> > >     that we don't need to add a capability every time a new flag comes along,
> > >     and so that userspace can gather all flags in a single ioctl.  If gmem ever
> > >     supports more than 32 flags, we'll need KVM_CAP_GUEST_MEMFD_FLAGS2, but
> > >     that's a non-issue relatively speaking.
> > >
> >
> > Guest_memfd capabilities don't necessarily translate into flags, so ideally:
> > 1) There should be two caps, KVM_CAP_GUEST_MEMFD_FLAGS and
> > KVM_CAP_GUEST_MEMFD_CAPS.
>
> I'm not saying we can't have another GUEST_MEMFD capability or three, all I'm
> saying is that for enumerating what flags can be passed to KVM_CREATE_GUEST_MEMFD,
> KVM_CAP_GUEST_MEMFD_FLAGS is a better fit than a one-off KVM_CAP_GUEST_MEMFD_MMAP.

Ah, ok. Then do you envision the guest_memfd caps to still be separate
KVM caps per guest_memfd feature?

>
> > 2) IMO they should both support namespace of 64 values at least from the get go.
>
> It's a limitation of KVM_CHECK_EXTENSION, and all of KVM's plumbing for ioctls.
> Because KVM still supports 32-bit architectures, direct returns from ioctls are
> forced to fit in 32-bit values to avoid unintentionally creating different ABI
> for 32-bit vs. 64-bit kernels.
>
> We could add KVM_CHECK_EXTENSION2 or KVM_CHECK_EXTENSION64 or something, but I
> honestly don't see the point.  The odds of guest_memfd supporting >32 flags is
> small, and the odds of that happening in the next ~5 years is basically zero.
> All so that userspace can make one syscall instead of two for a path that isn't
> remotely performance critical.
>
> So while I agree that being able to enumerate 64 flags from the get-go would be
> nice to have, it's simply not worth the effort (unless someone has a clever idea).

Ack.

>
> > 3) The reservation scheme for upstream should ideally be LSB's first
> > for the new caps/flags.
>
> We're getting way ahead of ourselves.  Nothing needs KVM_CAP_GUEST_MEMFD_CAPS at
> this time, so there's nothing to discuss.
>
> > guest_memfd will achieve multiple features in future, both upstream
> > and in out-of-tree versions to deploy features before they make their
>
> When it comes to upstream uAPI and uABI, out-of-tree kernel code is irrelevant.
>
> > way upstream. Generally the scheme followed by out-of-tree versions is
> > to define a custom UAPI that won't conflict with upstream UAPIs in
> > near future. Having a namespace of 32 values gives little space to
> > avoid the conflict, e.g. features like hugetlb support will have to
> > eat up at least 5 bits from the flags [1].
>
> Why on earth would out-of-tree code use KVM_CAP_GUEST_MEMFD_FLAGS?   Providing

I can imagine a scenario where KVM_CAP_GUEST_MEMFD_FLAGS is upstreamed
and more flags landing in KVM_CAP_GUEST_MEMFD_FLAGS as supported over
time afterwards. out-of-tree code may ingest KVM_CAP_GUEST_MEMFD_FLAGS
in between.

> infrastructure to support an infinite (quite literally) number of out-of-tree
> capabilities and sub-ioctls, with practically zero chance of conflict, is not
> difficult.  See internal b/378111418.
>
> But as above, this is not upstream's problem to solve.
>
> > [1] https://elixir.bootlin.com/linux/v6.17/source/include/uapi/asm-generic/hugetlb_encode.h#L20

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Sean Christopherson 4 months, 1 week ago

On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> On Wed, Oct 1, 2025 at 9:15 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> > > On Mon, Sep 29, 2025 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > Oh!  This got me looking at kvm_arch_supports_gmem_mmap() and thus
> > > > KVM_CAP_GUEST_MEMFD_MMAP.  Two things:
> > > >
> > > >  1. We should change KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS so
> > > >     that we don't need to add a capability every time a new flag comes along,
> > > >     and so that userspace can gather all flags in a single ioctl.  If gmem ever
> > > >     supports more than 32 flags, we'll need KVM_CAP_GUEST_MEMFD_FLAGS2, but
> > > >     that's a non-issue relatively speaking.
> > > >
> > >
> > > Guest_memfd capabilities don't necessarily translate into flags, so ideally:
> > > 1) There should be two caps, KVM_CAP_GUEST_MEMFD_FLAGS and
> > > KVM_CAP_GUEST_MEMFD_CAPS.
> >
> > I'm not saying we can't have another GUEST_MEMFD capability or three, all I'm
> > saying is that for enumerating what flags can be passed to KVM_CREATE_GUEST_MEMFD,
> > KVM_CAP_GUEST_MEMFD_FLAGS is a better fit than a one-off KVM_CAP_GUEST_MEMFD_MMAP.
> 
> Ah, ok. Then do you envision the guest_memfd caps to still be separate
> KVM caps per guest_memfd feature?

Yes?  No?  It depends on the feature and the actual implementation.  E.g.
KVM_CAP_IRQCHIP enumerates support for a whole pile of ioctls.

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Vishal Annapurve 4 months, 1 week ago

On Wed, Oct 1, 2025 at 10:16 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> > On Wed, Oct 1, 2025 at 9:15 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> > > > On Mon, Sep 29, 2025 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > >
> > > > > Oh!  This got me looking at kvm_arch_supports_gmem_mmap() and thus
> > > > > KVM_CAP_GUEST_MEMFD_MMAP.  Two things:
> > > > >
> > > > >  1. We should change KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS so
> > > > >     that we don't need to add a capability every time a new flag comes along,
> > > > >     and so that userspace can gather all flags in a single ioctl.  If gmem ever
> > > > >     supports more than 32 flags, we'll need KVM_CAP_GUEST_MEMFD_FLAGS2, but
> > > > >     that's a non-issue relatively speaking.
> > > > >
> > > >
> > > > Guest_memfd capabilities don't necessarily translate into flags, so ideally:
> > > > 1) There should be two caps, KVM_CAP_GUEST_MEMFD_FLAGS and
> > > > KVM_CAP_GUEST_MEMFD_CAPS.
> > >
> > > I'm not saying we can't have another GUEST_MEMFD capability or three, all I'm
> > > saying is that for enumerating what flags can be passed to KVM_CREATE_GUEST_MEMFD,
> > > KVM_CAP_GUEST_MEMFD_FLAGS is a better fit than a one-off KVM_CAP_GUEST_MEMFD_MMAP.
> >
> > Ah, ok. Then do you envision the guest_memfd caps to still be separate
> > KVM caps per guest_memfd feature?
>
> Yes?  No?  It depends on the feature and the actual implementation.  E.g.
> KVM_CAP_IRQCHIP enumerates support for a whole pile of ioctls.

I think I am confused. Is the proposal here as follows?
* Use KVM_CAP_GUEST_MEMFD_FLAGS for features that map to guest_memfd
creation flags.
* Use KVM caps for guest_memfd features that don't map to any flags.

I think in general it would be better to have a KVM cap for each
feature irrespective of the flags as the feature may also need
additional UAPIs like IOCTLs.

I fail to see the benefits of KVM_CAP_GUEST_MEMFD_FLAGS over
KVM_CAP_GUEST_MEMFD_MMAP:
1) It limits the possible values to 32 even though we could pass 64 flags to
the original ioctl.
2) Userspace has to anyways assume the semantics of each bit position.
3) Userspace still has to check for caps for features that carry extra
UAPI baggage.

KVM_CAP_GUEST_MEMFD_MMAP allows userspace to assume that mmap is
supported and userspace can just pass in the mmap flag that it anyways
has to assume.

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Sean Christopherson 4 months, 1 week ago

On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> On Wed, Oct 1, 2025 at 10:16 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> > > On Wed, Oct 1, 2025 at 9:15 AM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> > > > > On Mon, Sep 29, 2025 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > >
> > > > > > Oh!  This got me looking at kvm_arch_supports_gmem_mmap() and thus
> > > > > > KVM_CAP_GUEST_MEMFD_MMAP.  Two things:
> > > > > >
> > > > > >  1. We should change KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS so
> > > > > >     that we don't need to add a capability every time a new flag comes along,
> > > > > >     and so that userspace can gather all flags in a single ioctl.  If gmem ever
> > > > > >     supports more than 32 flags, we'll need KVM_CAP_GUEST_MEMFD_FLAGS2, but
> > > > > >     that's a non-issue relatively speaking.
> > > > > >
> > > > >
> > > > > Guest_memfd capabilities don't necessarily translate into flags, so ideally:
> > > > > 1) There should be two caps, KVM_CAP_GUEST_MEMFD_FLAGS and
> > > > > KVM_CAP_GUEST_MEMFD_CAPS.
> > > >
> > > > I'm not saying we can't have another GUEST_MEMFD capability or three, all I'm
> > > > saying is that for enumerating what flags can be passed to KVM_CREATE_GUEST_MEMFD,
> > > > KVM_CAP_GUEST_MEMFD_FLAGS is a better fit than a one-off KVM_CAP_GUEST_MEMFD_MMAP.
> > >
> > > Ah, ok. Then do you envision the guest_memfd caps to still be separate
> > > KVM caps per guest_memfd feature?
> >
> > Yes?  No?  It depends on the feature and the actual implementation.  E.g.
> > KVM_CAP_IRQCHIP enumerates support for a whole pile of ioctls.
> 
> I think I am confused. Is the proposal here as follows?
> * Use KVM_CAP_GUEST_MEMFD_FLAGS for features that map to guest_memfd
> creation flags.

No, the proposal is to use KVM_CAP_GUEST_MEMFD_FLAGS to enumerate the set of
supported KVM_CREATE_GUEST_MEMFD flags.  Whether or not there is an associated
"feature" is irrelevant.  I.e. it's a very literal "these are the supported
flags".

> * Use KVM caps for guest_memfd features that don't map to any flags.
> 
> I think in general it would be better to have a KVM cap for each
> feature irrespective of the flags as the feature may also need
                                                   ^^^
> additional UAPIs like IOCTLs.

If the _only_ user-visible asset that is added is a KVM_CREATE_GUEST_MEMFD flag,
a CAP is gross overkill.  Even if there are other assets that accompany the new
flag, there's no reason we couldn't say "this feature exist if XYZ flag is
supported".

E.g. it's functionally no different than KVM_CAP_VM_TYPES reporting support for
KVM_X86_TDX_VM also effectively reporting support for a _huge_ number of things
far beyond being able to create a VM of type KVM_X86_TDX_VM.

KVM_CAP_XEN_HVM is a big collection of flags that have very little in common other
than being for Xen emulation.

> I fail to see the benefits of KVM_CAP_GUEST_MEMFD_FLAGS over
> KVM_CAP_GUEST_MEMFD_MMAP:

Adding a new flag doesn't require all of the things that come along with a new
capability.  E.g. there's zero chance of collisions between maintainer sub-trees
(at least as far as capabilities go; if multiple maintainers are adding multiple
gmem flags in the same kernel release, I really hope they'd be coordinating).

Enumerating in userspace is also more natural, e.g. userspace doesn't have to
manually build the mask of valid flags.

Writing documentation should be much easier (much less boilerplate), e.g. the
sum total of uAPI for adding GUEST_MEMFD_FLAG_INIT_SHARED is:

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7ba92f2ced38..754b662a453c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6438,6 +6438,11 @@ specified via KVM_CREATE_GUEST_MEMFD.  Currently defined flags:
   ============================ ================================================
   GUEST_MEMFD_FLAG_MMAP        Enable using mmap() on the guest_memfd file
                                descriptor.
+  GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during
+                               KVM_CREATE_GUEST_MEMFD (memory files created
+                               without INIT_SHARED will be marked private).
+                               Shared memory can be faulted into host userspace
+                               page tables. Private memory cannot.
   ============================ ================================================

 When the KVM MMU performs a PFN lookup to service a guest fault and the backing
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b1d52d0c56ec..52f6000ab020 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1599,7 +1599,8 @@ struct kvm_memory_attributes {
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)

 #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
-#define GUEST_MEMFD_FLAG_MMAP  (1ULL << 0)
+#define GUEST_MEMFD_FLAG_MMAP          (1ULL << 0)
+#define GUEST_MEMFD_FLAG_INIT_SHARED   (1ULL << 1)

 struct kvm_create_guest_memfd {
        __u64 size;

> 1) It limits the possible values to 32 even though we could pass 64 flags to
> the original ioctl.

So because we're currently limited to 32 flags, we should instead throw in the
towel and artificially limit ourselves to 1 flag (0 or 1)?  Because for all intents
and purposes, that's what we'd be doing.

Again, that is unlikely to be problematic before I retire.  It might not even be
a problem _ever_, because with luck we'll kill off 32-bit KVM in the next few
years and then we can actually leverage returning a "long" from ioctls.  Literally
every capability that returns a mask of flags has this "problem"; it's not notable
or even an issue in practice.

> 2) Userspace has to anyways assume the semantics of each bit position.

Not always.

	uint64_t valid_flags = vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_FLAGS);
	uint64_t flag;
	int fd;

	for (flag = BIT(0); flag; flag <<= 1) {
		fd = __vm_create_guest_memfd(vm, page_size, flag);
		if (flag & valid_flags) {
			TEST_ASSERT(fd >= 0,
				    "guest_memfd() with flag '0x%lx' should succeed",
				    flag);
			close(fd);
		} else {
			TEST_ASSERT(fd < 0 && errno == EINVAL,
				    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
				    flag);
		}
	}

But pedantry aside, I don't see how this is at all an interesting point.  Yes,
userspace has to know how to use a feature.

> 3) Userspace still has to check for caps for features that carry extra
> UAPI baggage.

That's simply not true.  E.g. see the example with VM types.

> KVM_CAP_GUEST_MEMFD_MMAP allows userspace to assume that mmap is
> supported and userspace can just pass in the mmap flag that it anyways
> has to assume.

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Vishal Annapurve 4 months, 1 week ago

On Wed, Oct 1, 2025 at 5:04 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> > On Wed, Oct 1, 2025 at 10:16 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> > > > On Wed, Oct 1, 2025 at 9:15 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > >
> > > > > On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> > > > > > On Mon, Sep 29, 2025 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > > >
> > > > > > > Oh!  This got me looking at kvm_arch_supports_gmem_mmap() and thus
> > > > > > > KVM_CAP_GUEST_MEMFD_MMAP.  Two things:
> > > > > > >
> > > > > > >  1. We should change KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS so
> > > > > > >     that we don't need to add a capability every time a new flag comes along,
> > > > > > >     and so that userspace can gather all flags in a single ioctl.  If gmem ever
> > > > > > >     supports more than 32 flags, we'll need KVM_CAP_GUEST_MEMFD_FLAGS2, but
> > > > > > >     that's a non-issue relatively speaking.
> > > > > > >
> > > > > >
> > > > > > Guest_memfd capabilities don't necessarily translate into flags, so ideally:
> > > > > > 1) There should be two caps, KVM_CAP_GUEST_MEMFD_FLAGS and
> > > > > > KVM_CAP_GUEST_MEMFD_CAPS.
> > > > >
> > > > > I'm not saying we can't have another GUEST_MEMFD capability or three, all I'm
> > > > > saying is that for enumerating what flags can be passed to KVM_CREATE_GUEST_MEMFD,
> > > > > KVM_CAP_GUEST_MEMFD_FLAGS is a better fit than a one-off KVM_CAP_GUEST_MEMFD_MMAP.
> > > >
> > > > Ah, ok. Then do you envision the guest_memfd caps to still be separate
> > > > KVM caps per guest_memfd feature?
> > >
> > > Yes?  No?  It depends on the feature and the actual implementation.  E.g.
> > > KVM_CAP_IRQCHIP enumerates support for a whole pile of ioctls.
> >
> > I think I am confused. Is the proposal here as follows?
> > * Use KVM_CAP_GUEST_MEMFD_FLAGS for features that map to guest_memfd
> > creation flags.
>
> No, the proposal is to use KVM_CAP_GUEST_MEMFD_FLAGS to enumerate the set of
> supported KVM_CREATE_GUEST_MEMFD flags.  Whether or not there is an associated
> "feature" is irrelevant.  I.e. it's a very literal "these are the supported
> flags".
>
> > * Use KVM caps for guest_memfd features that don't map to any flags.
> >
> > I think in general it would be better to have a KVM cap for each
> > feature irrespective of the flags as the feature may also need
>                                                    ^^^
> > additional UAPIs like IOCTLs.
>
> If the _only_ user-visible asset that is added is a KVM_CREATE_GUEST_MEMFD flag,
> a CAP is gross overkill.  Even if there are other assets that accompany the new
> flag, there's no reason we couldn't say "this feature exist if XYZ flag is
> supported".
>
> E.g. it's functionally no different than KVM_CAP_VM_TYPES reporting support for
> KVM_X86_TDX_VM also effectively reporting support for a _huge_ number of things
> far beyond being able to create a VM of type KVM_X86_TDX_VM.
>

What's your opinion about having KVM_CAP_GUEST_MEMFD_MMAP part of
KVM_CAP_GUEST_MEMFD_CAPS i.e. having a KVM cap covering all features
of guest_memfd? That seems more consistent to me in order for
userspace to deduce the supported features and assume flags/ioctls/...
associated with the feature as a group.

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Sean Christopherson 4 months, 1 week ago

On Thu, Oct 02, 2025, Vishal Annapurve wrote:
> On Wed, Oct 1, 2025 at 5:04 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> > > On Wed, Oct 1, 2025 at 10:16 AM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> > > > > On Wed, Oct 1, 2025 at 9:15 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > > >
> > > > > > On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> > > > > > > On Mon, Sep 29, 2025 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > > > >
> > > > > > > > Oh!  This got me looking at kvm_arch_supports_gmem_mmap() and thus
> > > > > > > > KVM_CAP_GUEST_MEMFD_MMAP.  Two things:
> > > > > > > >
> > > > > > > >  1. We should change KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS so
> > > > > > > >     that we don't need to add a capability every time a new flag comes along,
> > > > > > > >     and so that userspace can gather all flags in a single ioctl.  If gmem ever
> > > > > > > >     supports more than 32 flags, we'll need KVM_CAP_GUEST_MEMFD_FLAGS2, but
> > > > > > > >     that's a non-issue relatively speaking.
> > > > > > > >
> > > > > > >
> > > > > > > Guest_memfd capabilities don't necessarily translate into flags, so ideally:
> > > > > > > 1) There should be two caps, KVM_CAP_GUEST_MEMFD_FLAGS and
> > > > > > > KVM_CAP_GUEST_MEMFD_CAPS.
> > > > > >
> > > > > > I'm not saying we can't have another GUEST_MEMFD capability or three, all I'm
> > > > > > saying is that for enumerating what flags can be passed to KVM_CREATE_GUEST_MEMFD,
> > > > > > KVM_CAP_GUEST_MEMFD_FLAGS is a better fit than a one-off KVM_CAP_GUEST_MEMFD_MMAP.
> > > > >
> > > > > Ah, ok. Then do you envision the guest_memfd caps to still be separate
> > > > > KVM caps per guest_memfd feature?
> > > >
> > > > Yes?  No?  It depends on the feature and the actual implementation.  E.g.
> > > > KVM_CAP_IRQCHIP enumerates support for a whole pile of ioctls.
> > >
> > > I think I am confused. Is the proposal here as follows?
> > > * Use KVM_CAP_GUEST_MEMFD_FLAGS for features that map to guest_memfd
> > > creation flags.
> >
> > No, the proposal is to use KVM_CAP_GUEST_MEMFD_FLAGS to enumerate the set of
> > supported KVM_CREATE_GUEST_MEMFD flags.  Whether or not there is an associated
> > "feature" is irrelevant.  I.e. it's a very literal "these are the supported
> > flags".
> >
> > > * Use KVM caps for guest_memfd features that don't map to any flags.
> > >
> > > I think in general it would be better to have a KVM cap for each
> > > feature irrespective of the flags as the feature may also need
> >                                                    ^^^
> > > additional UAPIs like IOCTLs.
> >
> > If the _only_ user-visible asset that is added is a KVM_CREATE_GUEST_MEMFD flag,
> > a CAP is gross overkill.  Even if there are other assets that accompany the new
> > flag, there's no reason we couldn't say "this feature exist if XYZ flag is
> > supported".
> >
> > E.g. it's functionally no different than KVM_CAP_VM_TYPES reporting support for
> > KVM_X86_TDX_VM also effectively reporting support for a _huge_ number of things
> > far beyond being able to create a VM of type KVM_X86_TDX_VM.
> >
> 
> What's your opinion about having KVM_CAP_GUEST_MEMFD_MMAP part of
> KVM_CAP_GUEST_MEMFD_CAPS i.e. having a KVM cap covering all features
> of guest_memfd?

I'd much prefer to have both.  Describing flags for an ioctl via a bitmask that
doesn't *exactly* match the flags is asking for problems.  At best, it will be
confusing.  E.g. we'll probably end up with code like this:

	gmem_caps = kvm_check_cap(KVM_CAP_GUEST_MEMFD_CAPS);

	if (gmem_caps & KVM_CAP_GUEST_MEMFD_MMAP)
		gmem_flags |= GUEST_MEMFD_FLAG_MMAP;
	if (gmem_caps & KVM_CAP_GUEST_MEMFD_INIT_SHARED)
		gmem_flags |= KVM_CAP_GUEST_MEMFD_INIT_SHARED;

Those types of patterns often lead to typos causing problems (LOL, case in point,
there's a typo above; I'm leaving it to illustrate my point).  That can be largely
solved by userspace via macro shenanigans, but userspace really shouldn't have to
jump through hoops for such a simple thing.

An ever worse outcome is if userspace does something like:

	gmem_flags = kvm_check_cap(KVM_CAP_GUEST_MEMFD_CAPS);

Which might actually work initially, e.g. if KVM_CAP_GUEST_MEMFD_MMAP and
GUEST_MEMFD_FLAG_MMAP have the same value.  But eventually userspace will be sad.

Another issue is that, while unlikely, we could run out of KVM_CAP_GUEST_MEMFD_CAPS
bits before we run out of flags.

And if we use memory attributes, we're also guaranteed to have at least one gmem
capability that returns a bitmask separately from a dedicated one-size-fits-all
cap, e.g.

	case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
		if (vm_memory_attributes)
			return 0;

		return kvm_supported_mem_attributes(kvm);

Side topic, looking at this, I don't think we need KVM_CAP_GUEST_MEMFD_CAPS, I'm
pretty sure we can simply extend KVM_CAP_GUEST_MEMFD.  E.g. 

#define KVM_GUEST_MEMFD_FEAT_BASIC		(1ULL << 0)
#define KVM_GUEST_MEMFD_FEAT_FANCY		(1ULL << 1)

	case KVM_CAP_GUEST_MEMFD:
		return KVM_GUEST_MEMFD_FEAT_BASIC |
		       KVM_GUEST_MEMFD_FEAT_FANCY;

> That seems more consistent to me in order for userspace to deduce the
> supported features and assume flags/ioctls/...  associated with the feature
> as a group.

If we add a feature that comes with a flag, we could always add both, i.e. a
feature flag for KVM_CAP_GUEST_MEMFD along with the natural enumeration for
KVM_CAP_GUEST_MEMFD_FLAGS.  That certainly wouldn't be my first choice, but it's
a possibility, e.g. if it really is the most intuitive solution.  But that's
getting quite hypothetical.

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Vishal Annapurve 4 months, 1 week ago

On Thu, Oct 2, 2025, 5:12 PM Sean Christopherson <seanjc@google.com> wrote:
>
> > >
> > > If the _only_ user-visible asset that is added is a KVM_CREATE_GUEST_MEMFD flag,
> > > a CAP is gross overkill.  Even if there are other assets that accompany the new
> > > flag, there's no reason we couldn't say "this feature exist if XYZ flag is
> > > supported".
> > >
> > > E.g. it's functionally no different than KVM_CAP_VM_TYPES reporting support for
> > > KVM_X86_TDX_VM also effectively reporting support for a _huge_ number of things
> > > far beyond being able to create a VM of type KVM_X86_TDX_VM.
> > >
> >
> > What's your opinion about having KVM_CAP_GUEST_MEMFD_MMAP part of
> > KVM_CAP_GUEST_MEMFD_CAPS i.e. having a KVM cap covering all features
> > of guest_memfd?
>
> I'd much prefer to have both.  Describing flags for an ioctl via a bitmask that
> doesn't *exactly* match the flags is asking for problems.  At best, it will be
> confusing.  E.g. we'll probably end up with code like this:
>
>         gmem_caps = kvm_check_cap(KVM_CAP_GUEST_MEMFD_CAPS);
>
>         if (gmem_caps & KVM_CAP_GUEST_MEMFD_MMAP)
>                 gmem_flags |= GUEST_MEMFD_FLAG_MMAP;
>         if (gmem_caps & KVM_CAP_GUEST_MEMFD_INIT_SHARED)
>                 gmem_flags |= KVM_CAP_GUEST_MEMFD_INIT_SHARED;
>

No, I actually meant the userspace can just rely on the cap to assume
right flags to be available (not necessarily the same flags as cap
bits).

i.e. Userspace will do something like:
gmem_caps = kvm_check_cap(KVM_CAP_GUEST_MEMFD_CAPS);

if (gmem_caps & KVM_CAP_GUEST_MEMFD_MMAP)
        gmem_flags |= GUEST_MEMFD_FLAG_MMAP;
if (gmem_caps & KVM_CAP_GUEST_MEMFD_HUGETLB)
        gmem_flags |= GUEST_MEMFD_FLAG_HUGETLB | GUEST_MEMFD_FLAG_HUGETLB_2MB;

Userspace has to anyways assume flag values, userspace just needs to
know if a particular feature is available.

> ...
> Another issue is that, while unlikely, we could run out of KVM_CAP_GUEST_MEMFD_CAPS
> bits before we run out of flags.

I would say that's unlikely as I know of at least one feature that
needs multiple flag bits.

>
> And if we use memory attributes, we're also guaranteed to have at least one gmem
> capability that returns a bitmask separately from a dedicated one-size-fits-all
> cap, e.g.
>
>         case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
>                 if (vm_memory_attributes)
>                         return 0;
>
>                 return kvm_supported_mem_attributes(kvm);

For this one, we need a separate dedicated cap.

>
> Side topic, looking at this, I don't think we need KVM_CAP_GUEST_MEMFD_CAPS, I'm
> pretty sure we can simply extend KVM_CAP_GUEST_MEMFD.  E.g.
>
> #define KVM_GUEST_MEMFD_FEAT_BASIC              (1ULL << 0)
> #define KVM_GUEST_MEMFD_FEAT_FANCY              (1ULL << 1)
>
>         case KVM_CAP_GUEST_MEMFD:
>                 return KVM_GUEST_MEMFD_FEAT_BASIC |
>                        KVM_GUEST_MEMFD_FEAT_FANCY;

This scheme seems ok to me.

>
> > That seems more consistent to me in order for userspace to deduce the
> > supported features and assume flags/ioctls/...  associated with the feature
> > as a group.
>
> If we add a feature that comes with a flag, we could always add both, i.e. a
> feature flag for KVM_CAP_GUEST_MEMFD along with the natural enumeration for
> KVM_CAP_GUEST_MEMFD_FLAGS.  That certainly wouldn't be my first choice, but it's
> a possibility, e.g. if it really is the most intuitive solution.  But that's
> getting quite hypothetical.

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Sean Christopherson 4 months, 1 week ago

On Thu, Oct 02, 2025, Vishal Annapurve wrote:
> On Thu, Oct 2, 2025, 5:12 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > >
> > > > If the _only_ user-visible asset that is added is a KVM_CREATE_GUEST_MEMFD flag,
> > > > a CAP is gross overkill.  Even if there are other assets that accompany the new
> > > > flag, there's no reason we couldn't say "this feature exist if XYZ flag is
> > > > supported".
> > > >
> > > > E.g. it's functionally no different than KVM_CAP_VM_TYPES reporting support for
> > > > KVM_X86_TDX_VM also effectively reporting support for a _huge_ number of things
> > > > far beyond being able to create a VM of type KVM_X86_TDX_VM.
> > > >
> > >
> > > What's your opinion about having KVM_CAP_GUEST_MEMFD_MMAP part of
> > > KVM_CAP_GUEST_MEMFD_CAPS i.e. having a KVM cap covering all features
> > > of guest_memfd?
> >
> > I'd much prefer to have both.  Describing flags for an ioctl via a bitmask that
> > doesn't *exactly* match the flags is asking for problems.  At best, it will be
> > confusing.  E.g. we'll probably end up with code like this:
> >
> >         gmem_caps = kvm_check_cap(KVM_CAP_GUEST_MEMFD_CAPS);
> >
> >         if (gmem_caps & KVM_CAP_GUEST_MEMFD_MMAP)
> >                 gmem_flags |= GUEST_MEMFD_FLAG_MMAP;
> >         if (gmem_caps & KVM_CAP_GUEST_MEMFD_INIT_SHARED)
> >                 gmem_flags |= KVM_CAP_GUEST_MEMFD_INIT_SHARED;
> >
> 
> No, I actually meant the userspace can just rely on the cap to assume
> right flags to be available (not necessarily the same flags as cap
> bits).
> 
> i.e. Userspace will do something like:
> gmem_caps = kvm_check_cap(KVM_CAP_GUEST_MEMFD_CAPS);
> 
> if (gmem_caps & KVM_CAP_GUEST_MEMFD_MMAP)
>         gmem_flags |= GUEST_MEMFD_FLAG_MMAP;
> if (gmem_caps & KVM_CAP_GUEST_MEMFD_HUGETLB)
>         gmem_flags |= GUEST_MEMFD_FLAG_HUGETLB | GUEST_MEMFD_FLAG_HUGETLB_2MB;

Yes, that's exactly what I said.  But I goofed when copy+pasted and failed to
do s/KVM_CAP_GUEST_MEMFD_INIT_SHARED/GUEST_MEMFD_FLAG_INIT_SHARED, which is the
type of bug that ideally just can't happen.

Side topic, I'm not at all convinced that this is what we want for KVM's uAPI:

	if (gmem_caps & KVM_CAP_GUEST_MEMFD_HUGETLB)                                  
		gmem_flags |= GUEST_MEMFD_FLAG_HUGETLB | GUEST_MEMFD_FLAG_HUGETLB_2MB;

See https://lore.kernel.org/all/aN_fJEZXo6wkcHOh@google.com.

> Userspace has to anyways assume flag values, userspace just needs to
> know if a particular feature is available.

I don't understand what you mean by "assume flag values".

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Vishal Annapurve 4 months, 1 week ago

On Fri, Oct 3, 2025 at 9:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Oct 02, 2025, Vishal Annapurve wrote:
> > On Thu, Oct 2, 2025, 5:12 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > > >
> > > > > If the _only_ user-visible asset that is added is a KVM_CREATE_GUEST_MEMFD flag,
> > > > > a CAP is gross overkill.  Even if there are other assets that accompany the new
> > > > > flag, there's no reason we couldn't say "this feature exist if XYZ flag is
> > > > > supported".
> > > > >
> > > > > E.g. it's functionally no different than KVM_CAP_VM_TYPES reporting support for
> > > > > KVM_X86_TDX_VM also effectively reporting support for a _huge_ number of things
> > > > > far beyond being able to create a VM of type KVM_X86_TDX_VM.
> > > > >
> > > >
> > > > What's your opinion about having KVM_CAP_GUEST_MEMFD_MMAP part of
> > > > KVM_CAP_GUEST_MEMFD_CAPS i.e. having a KVM cap covering all features
> > > > of guest_memfd?
> > >
> > > I'd much prefer to have both.  Describing flags for an ioctl via a bitmask that
> > > doesn't *exactly* match the flags is asking for problems.  At best, it will be
> > > confusing.  E.g. we'll probably end up with code like this:
> > >
> > >         gmem_caps = kvm_check_cap(KVM_CAP_GUEST_MEMFD_CAPS);
> > >
> > >         if (gmem_caps & KVM_CAP_GUEST_MEMFD_MMAP)
> > >                 gmem_flags |= GUEST_MEMFD_FLAG_MMAP;
> > >         if (gmem_caps & KVM_CAP_GUEST_MEMFD_INIT_SHARED)
> > >                 gmem_flags |= KVM_CAP_GUEST_MEMFD_INIT_SHARED;
> > >
> >
> > No, I actually meant the userspace can just rely on the cap to assume
> > right flags to be available (not necessarily the same flags as cap
> > bits).
> >
> > i.e. Userspace will do something like:
> > gmem_caps = kvm_check_cap(KVM_CAP_GUEST_MEMFD_CAPS);
> >
> > if (gmem_caps & KVM_CAP_GUEST_MEMFD_MMAP)
> >         gmem_flags |= GUEST_MEMFD_FLAG_MMAP;
> > if (gmem_caps & KVM_CAP_GUEST_MEMFD_HUGETLB)
> >         gmem_flags |= GUEST_MEMFD_FLAG_HUGETLB | GUEST_MEMFD_FLAG_HUGETLB_2MB;
>
> Yes, that's exactly what I said.  But I goofed when copy+pasted and failed to
> do s/KVM_CAP_GUEST_MEMFD_INIT_SHARED/GUEST_MEMFD_FLAG_INIT_SHARED, which is the
> type of bug that ideally just can't happen.
>
> Side topic, I'm not at all convinced that this is what we want for KVM's uAPI:
>
>         if (gmem_caps & KVM_CAP_GUEST_MEMFD_HUGETLB)
>                 gmem_flags |= GUEST_MEMFD_FLAG_HUGETLB | GUEST_MEMFD_FLAG_HUGETLB_2MB;
>
> See https://lore.kernel.org/all/aN_fJEZXo6wkcHOh@google.com.

Ack, that makes sense to me.

>
> > Userspace has to anyways assume flag values, userspace just needs to
> > know if a particular feature is available.
>
> I don't understand what you mean by "assume flag values".

Ok, I think you covered the explanation of why you would prefer to
have KVM_CAP_GUEST_MEMFD_FLAGS around and I misinterpreted some of it.

One more example with KVM_CAP_GUEST_MEMFD_FLAGS around:

gmem_caps = kvm_check_cap(KVM_CAP_GUEST_MEMFD_CAPS);
valid_flags = kvm_check_cap(KVM_CAP_GUEST_MEMFD_FLAGS);

if (gmem_caps & KVM_CAP_GUEST_MEMFD_CONVERSION) {
               // Use single memory backing paths for 4K backing
              if (valid_flags & GUEST_MEMFD_FLAG_MMAP)
                          gmem_flags |= GUEST_MEMFD_FLAG_MMAP;
              else
                        // error out;
}
if (gmem_caps & KVM_CAP_GUEST_MEMFD_HUGETLB_CONVERSION) {
               // Use single memory backing paths for hugetlb memory backing
               if (valid_flags & GUEST_MEMFD_FLAG_HUGETLB) {
                          gmem_flags |= GUEST_MEMFD_FLAG_HUGETLB;
                          kvm_create_guest_memfd.huge_page_size_log2 = ...;
               } else
                        // error out;
}

Userspace will have to rely on a combination of flags and caps to
decide it's control flow instead of just caps. Thinking more about
this, I don't have a strong preference between two scenarios i.e. with
or without KVM_CAP_GUEST_MEMFD_FLAGS.

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Ackerley Tng 4 months, 1 week ago

Sean Christopherson <seanjc@google.com> writes:

> On Mon, Sep 29, 2025, Sean Christopherson wrote:
>> On Mon, Sep 29, 2025, Ackerley Tng wrote:
>> > David Hildenbrand <david@redhat.com> writes:
>> > 
>> > >                           GUEST_MEMFD_FLAG_DEFAULT_SHARED;
>> > >>>>
>> > >>>> At least for now, GUEST_MEMFD_FLAG_DEFAULT_SHARED and
>> > >>>> GUEST_MEMFD_FLAG_MMAP don't make sense without each other. Is it worth
>> > >>>> checking for that, at least until we have in-place conversion? Having
>> > >>>> only GUEST_MEMFD_FLAG_DEFAULT_SHARED set, but GUEST_MEMFD_FLAG_MMAP,
>> > >>>> isn't a useful combination.
>> > >>>>
>> > >>>
>> > >>> I think it's okay to have the two flags be orthogonal from the start.
>> > >> 
>> > >> I think I dimly remember someone at one of the guest_memfd syncs
>> > >> bringing up a usecase for having a VMA even if all memory is private,
>> > >> not for faulting anything in, but to do madvise or something? Maybe it
>> > >> was the NUMA stuff? (+Shivank)
>> > >
>> > > Yes, that should be it. But we're never faulting in these pages, we only 
>> > > need the VMA (for the time being, until there is the in-place conversion).
>> > >
>> > 
>> > Yup, Sean's patch disables faulting if GUEST_MEMFD_FLAG_DEFAULT_SHARED
>> > is not set, but mmap() is always enabled so madvise() still works.
>> 
>> Hah!  I totally intended that :-D
>> 
>> > Requiring GUEST_MEMFD_FLAG_DEFAULT_SHARED to be set together with
>> > GUEST_MEMFD_FLAG_MMAP would still allow madvise() to work since
>> > GUEST_MEMFD_FLAG_DEFAULT_SHARED only gates faulting.
>> > 
>> > To clarify, I'm still for making GUEST_MEMFD_FLAG_DEFAULT_SHARED
>> > orthogonal to GUEST_MEMFD_FLAG_MMAP with no additional checks on top of
>> > whatever's in this patch. :)
>
> Oh!  This got me looking at kvm_arch_supports_gmem_mmap() and thus
> KVM_CAP_GUEST_MEMFD_MMAP.  Two things:
>
>  1. We should change KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS so
>     that we don't need to add a capability every time a new flag comes along,
>     and so that userspace can gather all flags in a single ioctl.  If gmem ever
>     supports more than 32 flags, we'll need KVM_CAP_GUEST_MEMFD_FLAGS2, but
>     that's a non-issue relatively speaking.
>

This is a good idea. In my internal WIP series I have 3 flags and 4
CAPs, lol. Some of those CAPs are not for new flags, though.

Would like to check your rationale for future reference: how about
generalizing beyong flags and having KVM_CAP_GUEST_MEMFD_CAPS which
returns 32 bits, one bit for every guest_memfd-related (not necessarily
flags-related) cap?

>  2. We should allow mmap() for x86 CoCo VMs right away.  As evidenced by this
>     series, mmap() on private memory is totally fine.  It's not useful until the
>     NUMA and/or in-place conversion support comes along, but's not dangerous in
>     any way.  The actual restriction is on initializing memory to be shared,

The actual restriction is that private memory must not be mapped to host
userspace, so it's not really about initializing, though before
conversion, initialization state is the only state.

With GUEST_MEMFD_FLAG_INIT_SHARED, the entire guest_memfd is shared and
mappable; without GUEST_MEMFD_FLAG_INIT_SHARED the entire guest_memfd is
private and not mappable (gated in kvm_gmem_fault_user_mapping()).

So yes, I agree that CoCo VMs should be allowed mmap() but not
GUEST_MEMFD_FLAG_INIT_SHARED, since GUEST_MEMFD_FLAG_INIT_SHARED makes
the entire guest_memfd take the shared state for the lifetime of
guest_memfd.

This is turning out to be a much nicer cleanup :)

>     because allowing memory to be shared from gmem's perspective while it's
>     private from the VM's perspective would be all kinds of broken.
>
>
> E.g. with a s/kvm_arch_supports_gmem_mmap/kvm_arch_supports_gmem_init_shared:
>
> 	case KVM_CAP_GUEST_MEMFD_FLAGS:
> 		if (!kvm || kvm_arch_supports_init_shared(kvm))
> 			return GUEST_MEMFD_FLAG_MMAP |
> 			       GUEST_MEMFD_FLAG_INIT_SHARED;
>
> 		return GUEST_MEMFD_FLAG_MMAP;
>

You might end up with this while actually coding v2 up, but how about

	case KVM_CAP_GUEST_MEMFD_FLAGS: {
        	int flag_caps = GUEST_MEMFD_FLAG_MMAP;
                
		if (!kvm || kvm_arch_supports_init_shared(kvm))
			flag_caps |= GUEST_MEMFD_FLAG_INIT_SHARED;

		return flag_caps;
	}

Then all the new non-optional CAPs can be or-ed onto flag_caps from the
start.
        
> #2 is also a good reason to add INIT_SHARED straightaway.  Without INIT_SHARED,
> we'd have to INIT_PRIVATE to make the NUMA support useful for x86 CoCo VMs, i.e.
> it's not just in-place conversion that's affected, IIUC.
>
> I'll add this in v2.

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by David Hildenbrand 4 months, 1 week ago

On 26.09.25 18:31, Sean Christopherson wrote:
> Add a guest_memfd flag to allow userspace to state that the underlying
> memory should be configured to be shared by default, and reject user page
> faults if the guest_memfd instance's memory isn't shared by default.
> Because KVM doesn't yet support in-place private<=>shared conversions, all
> guest_memfd memory effectively follows the default state.

I recall we discussed exactly that in the past (e.g., on April 17) in the call:

"Current plan:
  * guest_memfd creation flag to specify “all memory starts as shared”
    * Compatible with the old behavior where all memory started as private
    * Initially, only these can be mmap (no in-place conversion)
"

> 
> Alternatively, KVM could deduce the default state based on MMAP, which for
> all intents and purposes is what KVM currently does.  However, implicitly
> deriving the default state based on MMAP will result in a messy ABI when
> support for in-place conversions is added.

I don't recall the details, but I faintly remember that we discussed later that with
mmap support, the default will be shared for now, and that no other flag would be
required for the time being.

We could always add a "DEFAULT_PRIVATE" flag when we realize that we would have
to change the default later.

Ackerley might remember more details.

-- 
Cheers

David / dhildenb

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by Fuad Tabba 4 months, 1 week ago

Hi David.

On Mon, 29 Sept 2025 at 09:38, David Hildenbrand <david@redhat.com> wrote:
>
> On 26.09.25 18:31, Sean Christopherson wrote:
> > Add a guest_memfd flag to allow userspace to state that the underlying
> > memory should be configured to be shared by default, and reject user page
> > faults if the guest_memfd instance's memory isn't shared by default.
> > Because KVM doesn't yet support in-place private<=>shared conversions, all
> > guest_memfd memory effectively follows the default state.
>
> I recall we discussed exactly that in the past (e.g., on April 17) in the call:
>
> "Current plan:
>   * guest_memfd creation flag to specify “all memory starts as shared”
>     * Compatible with the old behavior where all memory started as private
>     * Initially, only these can be mmap (no in-place conversion)
> "
>
> >
> > Alternatively, KVM could deduce the default state based on MMAP, which for
> > all intents and purposes is what KVM currently does.  However, implicitly
> > deriving the default state based on MMAP will result in a messy ABI when
> > support for in-place conversions is added.
>
> I don't recall the details, but I faintly remember that we discussed later that with
> mmap support, the default will be shared for now, and that no other flag would be
> required for the time being.
>
> We could always add a "DEFAULT_PRIVATE" flag when we realize that we would have
> to change the default later.

I remember discussing this. For many confidential computing usecases,
e.g., pKVM and TDX, it would make more sense for the default case to
be private, since it's the more common state, and the initial state.
It also makes sense since sharing is usually triggered by the guest.
Ensuring that the initial state is private reduces the changes of the
VMM forgetting to convert the memory to being private later on,
potentially exposing all guest memory from the get go.

I think it makes sense to clarify things now. Especially since with
memory attributes, the default attribute is
KVM_MEMORY_ATTRIBUTE_SHARED, which adds even more confusion.

Cheers,
/fuad



>
> Ackerley might remember more details.
>
> --
> Cheers
>
> David / dhildenb
>

Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

Posted by David Hildenbrand 4 months, 1 week ago

On 29.09.25 10:57, Fuad Tabba wrote:
> Hi David.
> 
> On Mon, 29 Sept 2025 at 09:38, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 26.09.25 18:31, Sean Christopherson wrote:
>>> Add a guest_memfd flag to allow userspace to state that the underlying
>>> memory should be configured to be shared by default, and reject user page
>>> faults if the guest_memfd instance's memory isn't shared by default.
>>> Because KVM doesn't yet support in-place private<=>shared conversions, all
>>> guest_memfd memory effectively follows the default state.
>>
>> I recall we discussed exactly that in the past (e.g., on April 17) in the call:
>>
>> "Current plan:
>>    * guest_memfd creation flag to specify “all memory starts as shared”
>>      * Compatible with the old behavior where all memory started as private
>>      * Initially, only these can be mmap (no in-place conversion)
>> "
>>
>>>
>>> Alternatively, KVM could deduce the default state based on MMAP, which for
>>> all intents and purposes is what KVM currently does.  However, implicitly
>>> deriving the default state based on MMAP will result in a messy ABI when
>>> support for in-place conversions is added.
>>
>> I don't recall the details, but I faintly remember that we discussed later that with
>> mmap support, the default will be shared for now, and that no other flag would be
>> required for the time being.
>>
>> We could always add a "DEFAULT_PRIVATE" flag when we realize that we would have
>> to change the default later.
> 
> I remember discussing this. For many confidential computing usecases,
> e.g., pKVM and TDX, it would make more sense for the default case to
> be private, since it's the more common state, and the initial state.
> It also makes sense since sharing is usually triggered by the guest.
> Ensuring that the initial state is private reduces the changes of the
> VMM forgetting to convert the memory to being private later on,
> potentially exposing all guest memory from the get go.
> 
> I think it makes sense to clarify things now. Especially since with
> memory attributes, the default attribute is
> KVM_MEMORY_ATTRIBUTE_SHARED, which adds even more confusion.

Makes sense to me then, thanks.

-- 
Cheers

David / dhildenb

[PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set
[PATCH 2/6] KVM: selftests: Stash the host page size in a global in the guest_memfd test
[PATCH 3/6] KVM: selftests: Create a new guest_memfd for each testcase
[PATCH 4/6] KVM: selftests: Add test coverage for guest_memfd without GUEST_MEMFD_FLAG_MMAP
[PATCH 5/6] KVM: selftests: Add wrappers for mmap() and munmap() to assert success
[PATCH 6/6] KVM: selftests: Verify that faulting in private guest_memfd memory fails