x86: CVMs: Align memory conversions to 2M granularity

[RFC V1 5/5] x86: CVMs: Ensure that memory conversions happen at 2M alignment

Posted by Vishal Annapurve 2 years ago

Return error on conversion of memory ranges not aligned to 2M size.

Signed-off-by: Vishal Annapurve <vannapurve@google.com>
---
 arch/x86/mm/pat/set_memory.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index bda9f129835e..6f7b06a502f4 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 	int ret;
 
 	/* Should not be working on unaligned addresses */
-	if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
-		addr &= PAGE_MASK;
+	if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
+		|| WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
+			"misaligned numpages: %#lx\n", numpages))
+		return -EINVAL;
 
 	memset(&cpa, 0, sizeof(cpa));
 	cpa.vaddr = &addr;
-- 
2.43.0.275.g3460e3d667-goog

Re: [RFC V1 5/5] x86: CVMs: Ensure that memory conversions happen at 2M alignment

Posted by Dave Hansen 2 years ago

On 1/11/24 21:52, Vishal Annapurve wrote:
> @@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>  	int ret;
>  
>  	/* Should not be working on unaligned addresses */
> -	if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> -		addr &= PAGE_MASK;
> +	if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
> +		|| WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
> +			"misaligned numpages: %#lx\n", numpages))
> +		return -EINVAL;

This series is talking about swiotlb and DMA, then this applies a
restriction to what I *thought* was a much more generic function:
__set_memory_enc_pgtable().  What prevents this function from getting
used on 4k mappings?

Re: [RFC V1 5/5] x86: CVMs: Ensure that memory conversions happen at 2M alignment

Posted by Vishal Annapurve 2 years ago

On Wed, Jan 31, 2024 at 10:03 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/11/24 21:52, Vishal Annapurve wrote:
> > @@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> >       int ret;
> >
> >       /* Should not be working on unaligned addresses */
> > -     if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> > -             addr &= PAGE_MASK;
> > +     if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
> > +             || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
> > +                     "misaligned numpages: %#lx\n", numpages))
> > +             return -EINVAL;
>
> This series is talking about swiotlb and DMA, then this applies a
> restriction to what I *thought* was a much more generic function:
> __set_memory_enc_pgtable().  What prevents this function from getting
> used on 4k mappings?
>
>

The end goal here is to limit the conversion granularity to hugepage
sizes. SWIOTLB allocations are the major source of unaligned
allocations(and so the conversions) that need to be fixed before
achieving this goal.

This change will ensure that conversion fails for unaligned ranges, as
I don't foresee the need for 4K aligned conversions apart from DMA
allocations.

Re: [RFC V1 5/5] x86: CVMs: Ensure that memory conversions happen at 2M alignment

Posted by Jeremi Piotrowski 2 years ago

On 01/02/2024 04:46, Vishal Annapurve wrote:
> On Wed, Jan 31, 2024 at 10:03 PM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 1/11/24 21:52, Vishal Annapurve wrote:
>>> @@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>>>       int ret;
>>>
>>>       /* Should not be working on unaligned addresses */
>>> -     if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
>>> -             addr &= PAGE_MASK;
>>> +     if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
>>> +             || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
>>> +                     "misaligned numpages: %#lx\n", numpages))
>>> +             return -EINVAL;
>>
>> This series is talking about swiotlb and DMA, then this applies a
>> restriction to what I *thought* was a much more generic function:
>> __set_memory_enc_pgtable().  What prevents this function from getting
>> used on 4k mappings?
>>
>>
> 
> The end goal here is to limit the conversion granularity to hugepage
> sizes. SWIOTLB allocations are the major source of unaligned
> allocations(and so the conversions) that need to be fixed before
> achieving this goal.
> 
> This change will ensure that conversion fails for unaligned ranges, as
> I don't foresee the need for 4K aligned conversions apart from DMA
> allocations.

Hi Vishal,

This assumption is wrong. set_memory_decrypted is called from various
parts of the kernel: kexec, sev-guest, kvmclock, hyperv code. These conversions
are for non-DMA allocations that need to be done at 4KB granularity
because the data structures in question are page sized.

Thanks,
Jeremi

Re: [RFC V1 5/5] x86: CVMs: Ensure that memory conversions happen at 2M alignment

Posted by Vishal Annapurve 2 years ago

On Thu, Feb 1, 2024 at 5:32 PM Jeremi Piotrowski
<jpiotrowski@linux.microsoft.com> wrote:
>
> On 01/02/2024 04:46, Vishal Annapurve wrote:
> > On Wed, Jan 31, 2024 at 10:03 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >>
> >> On 1/11/24 21:52, Vishal Annapurve wrote:
> >>> @@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> >>>       int ret;
> >>>
> >>>       /* Should not be working on unaligned addresses */
> >>> -     if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> >>> -             addr &= PAGE_MASK;
> >>> +     if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
> >>> +             || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
> >>> +                     "misaligned numpages: %#lx\n", numpages))
> >>> +             return -EINVAL;
> >>
> >> This series is talking about swiotlb and DMA, then this applies a
> >> restriction to what I *thought* was a much more generic function:
> >> __set_memory_enc_pgtable().  What prevents this function from getting
> >> used on 4k mappings?
> >>
> >>
> >
> > The end goal here is to limit the conversion granularity to hugepage
> > sizes. SWIOTLB allocations are the major source of unaligned
> > allocations(and so the conversions) that need to be fixed before
> > achieving this goal.
> >
> > This change will ensure that conversion fails for unaligned ranges, as
> > I don't foresee the need for 4K aligned conversions apart from DMA
> > allocations.
>
> Hi Vishal,
>
> This assumption is wrong. set_memory_decrypted is called from various
> parts of the kernel: kexec, sev-guest, kvmclock, hyperv code. These conversions
> are for non-DMA allocations that need to be done at 4KB granularity
> because the data structures in question are page sized.
>
> Thanks,
> Jeremi

Thanks Jeremi for pointing out these usecases.

My brief analysis for these call sites:
1) machine_kexec_64.c, realmode/init.c, kvm/mmu/mmu.c - shared memory
allocation/conversion happens when host side memory encryption
(CC_ATTR_HOST_MEM_ENCRYPT) is enabled.
2) kernel/kvmclock.c -  Shared memory allocation can be made to align
2M even if the memory needed is lesser.
3) drivers/virt/coco/sev-guest/sev-guest.c,
drivers/virt/coco/tdx-guest/tdx-guest.c - Shared memory allocation can
be made to align 2M even if the memory needed is lesser.

I admit I haven't analyzed hyperv code in context of these changes,
but will take a better look to see if the calls for memory conversion
here can fit the category of "Shared memory allocation can be made to
align 2M even if the memory needed is lesser".

Agree that this patch should be modified to look something like
(subject to more changes on the call sites):

=============
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index e9b448d1b1b7..8c608d6913c4 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2132,10 +2132,15 @@ static int __set_memory_enc_pgtable(unsigned
long addr, int numpages, bool enc)
        struct cpa_data cpa;
        int ret;

        /* Should not be working on unaligned addresses */
        if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
                addr &= PAGE_MASK;

+       if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
+               (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address:
%#lx\n", addr)
+                       || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
+                               "misaligned numpages: %#lx\n", numpages)))
+               return -EINVAL;
+
        memset(&cpa, 0, sizeof(cpa));
        cpa.vaddr = &addr;
        cpa.numpages = numpages;

Re: [RFC V1 5/5] x86: CVMs: Ensure that memory conversions happen at 2M alignment

Posted by Jeremi Piotrowski 2 years ago

On 02/02/2024 06:08, Vishal Annapurve wrote:
> On Thu, Feb 1, 2024 at 5:32 PM Jeremi Piotrowski
> <jpiotrowski@linux.microsoft.com> wrote:
>>
>> On 01/02/2024 04:46, Vishal Annapurve wrote:
>>> On Wed, Jan 31, 2024 at 10:03 PM Dave Hansen <dave.hansen@intel.com> wrote:
>>>>
>>>> On 1/11/24 21:52, Vishal Annapurve wrote:
>>>>> @@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>>>>>       int ret;
>>>>>
>>>>>       /* Should not be working on unaligned addresses */
>>>>> -     if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
>>>>> -             addr &= PAGE_MASK;
>>>>> +     if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
>>>>> +             || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
>>>>> +                     "misaligned numpages: %#lx\n", numpages))
>>>>> +             return -EINVAL;
>>>>
>>>> This series is talking about swiotlb and DMA, then this applies a
>>>> restriction to what I *thought* was a much more generic function:
>>>> __set_memory_enc_pgtable().  What prevents this function from getting
>>>> used on 4k mappings?
>>>>
>>>>
>>>
>>> The end goal here is to limit the conversion granularity to hugepage
>>> sizes. SWIOTLB allocations are the major source of unaligned
>>> allocations(and so the conversions) that need to be fixed before
>>> achieving this goal.
>>>
>>> This change will ensure that conversion fails for unaligned ranges, as
>>> I don't foresee the need for 4K aligned conversions apart from DMA
>>> allocations.
>>
>> Hi Vishal,
>>
>> This assumption is wrong. set_memory_decrypted is called from various
>> parts of the kernel: kexec, sev-guest, kvmclock, hyperv code. These conversions
>> are for non-DMA allocations that need to be done at 4KB granularity
>> because the data structures in question are page sized.
>>
>> Thanks,
>> Jeremi
> 
> Thanks Jeremi for pointing out these usecases.
> 
> My brief analysis for these call sites:
> 1) machine_kexec_64.c, realmode/init.c, kvm/mmu/mmu.c - shared memory
> allocation/conversion happens when host side memory encryption
> (CC_ATTR_HOST_MEM_ENCRYPT) is enabled.
> 2) kernel/kvmclock.c -  Shared memory allocation can be made to align
> 2M even if the memory needed is lesser.
> 3) drivers/virt/coco/sev-guest/sev-guest.c,
> drivers/virt/coco/tdx-guest/tdx-guest.c - Shared memory allocation can
> be made to align 2M even if the memory needed is lesser.
> 
> I admit I haven't analyzed hyperv code in context of these changes,
> but will take a better look to see if the calls for memory conversion
> here can fit the category of "Shared memory allocation can be made to
> align 2M even if the memory needed is lesser".
> 
> Agree that this patch should be modified to look something like
> (subject to more changes on the call sites)

No, this patch is still built on the wrong assumptions. You're trying
to alter a generic function in the guest for the constraints of a very
specific hypervisor + host userspace + memory backend combination.
That's not right.

Is the numpages check supposed to ensure that the guest *only* toggles
visibility in chunks of 2MB? Then you're exposing more memory to the host
than the guest intends.

If you must - focus on getting swiotlb conversions to happen at the desired
granularity but don't try to force every single conversion to be >4K.

Thanks,
Jeremi


> 
> =============
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index e9b448d1b1b7..8c608d6913c4 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2132,10 +2132,15 @@ static int __set_memory_enc_pgtable(unsigned
> long addr, int numpages, bool enc)
>         struct cpa_data cpa;
>         int ret;
> 
>         /* Should not be working on unaligned addresses */
>         if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
>                 addr &= PAGE_MASK;
> 
> +       if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
> +               (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address:
> %#lx\n", addr)
> +                       || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
> +                               "misaligned numpages: %#lx\n", numpages)))
> +               return -EINVAL;
> +
>         memset(&cpa, 0, sizeof(cpa));
>         cpa.vaddr = &addr;
>         cpa.numpages = numpages;

Re: [RFC V1 5/5] x86: CVMs: Ensure that memory conversions happen at 2M alignment

Posted by Vishal Annapurve 2 years ago

On Fri, Feb 2, 2024 at 1:30 PM Jeremi Piotrowski
<jpiotrowski@linux.microsoft.com> wrote:
>
> On 02/02/2024 06:08, Vishal Annapurve wrote:
> > On Thu, Feb 1, 2024 at 5:32 PM Jeremi Piotrowski
> > <jpiotrowski@linux.microsoft.com> wrote:
> >>
> >> On 01/02/2024 04:46, Vishal Annapurve wrote:
> >>> On Wed, Jan 31, 2024 at 10:03 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >>>>
> >>>> On 1/11/24 21:52, Vishal Annapurve wrote:
> >>>>> @@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> >>>>>       int ret;
> >>>>>
> >>>>>       /* Should not be working on unaligned addresses */
> >>>>> -     if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> >>>>> -             addr &= PAGE_MASK;
> >>>>> +     if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
> >>>>> +             || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
> >>>>> +                     "misaligned numpages: %#lx\n", numpages))
> >>>>> +             return -EINVAL;
> >>>>
> >>>> This series is talking about swiotlb and DMA, then this applies a
> >>>> restriction to what I *thought* was a much more generic function:
> >>>> __set_memory_enc_pgtable().  What prevents this function from getting
> >>>> used on 4k mappings?
> >>>>
> >>>>
> >>>
> >>> The end goal here is to limit the conversion granularity to hugepage
> >>> sizes. SWIOTLB allocations are the major source of unaligned
> >>> allocations(and so the conversions) that need to be fixed before
> >>> achieving this goal.
> >>>
> >>> This change will ensure that conversion fails for unaligned ranges, as
> >>> I don't foresee the need for 4K aligned conversions apart from DMA
> >>> allocations.
> >>
> >> Hi Vishal,
> >>
> >> This assumption is wrong. set_memory_decrypted is called from various
> >> parts of the kernel: kexec, sev-guest, kvmclock, hyperv code. These conversions
> >> are for non-DMA allocations that need to be done at 4KB granularity
> >> because the data structures in question are page sized.
> >>
> >> Thanks,
> >> Jeremi
> >
> > Thanks Jeremi for pointing out these usecases.
> >
> > My brief analysis for these call sites:
> > 1) machine_kexec_64.c, realmode/init.c, kvm/mmu/mmu.c - shared memory
> > allocation/conversion happens when host side memory encryption
> > (CC_ATTR_HOST_MEM_ENCRYPT) is enabled.
> > 2) kernel/kvmclock.c -  Shared memory allocation can be made to align
> > 2M even if the memory needed is lesser.
> > 3) drivers/virt/coco/sev-guest/sev-guest.c,
> > drivers/virt/coco/tdx-guest/tdx-guest.c - Shared memory allocation can
> > be made to align 2M even if the memory needed is lesser.
> >
> > I admit I haven't analyzed hyperv code in context of these changes,
> > but will take a better look to see if the calls for memory conversion
> > here can fit the category of "Shared memory allocation can be made to
> > align 2M even if the memory needed is lesser".
> >
> > Agree that this patch should be modified to look something like
> > (subject to more changes on the call sites)
>
> No, this patch is still built on the wrong assumptions. You're trying
> to alter a generic function in the guest for the constraints of a very
> specific hypervisor + host userspace + memory backend combination.
> That's not right.

Agree on the fact that I focussed on a KVM for these changes. I plan
to spend some time understanding guest memfd relevance for other
hypervisors when dealing with CoCo VMs.

>
> Is the numpages check supposed to ensure that the guest *only* toggles
> visibility in chunks of 2MB?

Yes.

> Then you're exposing more memory to the host than the guest intends.

The goal of the series is to ensure that the CoCo VMs convert memory
at hugepage granularity. So such guests would need to allocate any
memory to be converted to shared, at hugepage granularity. This would
not expose any guest memory that needs to be used privately.

I agree about the fact that extra memory that needs to be allocated
for 2M alignment is effectively  getting wasted. Optimizing this extra
memory usage can be pursued further depending on how significant this
wastage is. One possible way could be to preallocate large enough
shared memory and use it to back smaller allocations from these call
sites (very similar to SWIOTLB).

>
> If you must - focus on getting swiotlb conversions to happen at the desired
> granularity but don't try to force every single conversion to be >4K.

If any conversion within a guest happens at 4K granularity, then this
will effectively cause non-hugepage aligned EPT/NPT entries. This
series is trying to get all private and shared memory regions to be
hugepage aligned to address the problem statement.

>
> Thanks,
> Jeremi
>
>
> >
> > =============
> > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > index e9b448d1b1b7..8c608d6913c4 100644
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -2132,10 +2132,15 @@ static int __set_memory_enc_pgtable(unsigned
> > long addr, int numpages, bool enc)
> >         struct cpa_data cpa;
> >         int ret;
> >
> >         /* Should not be working on unaligned addresses */
> >         if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> >                 addr &= PAGE_MASK;
> >
> > +       if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
> > +               (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address:
> > %#lx\n", addr)
> > +                       || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
> > +                               "misaligned numpages: %#lx\n", numpages)))
> > +               return -EINVAL;
> > +
> >         memset(&cpa, 0, sizeof(cpa));
> >         cpa.vaddr = &addr;
> >         cpa.numpages = numpages;
>

Re: [RFC V1 5/5] x86: CVMs: Ensure that memory conversions happen at 2M alignment

Posted by Dave Hansen 2 years ago

On 2/2/24 08:22, Vishal Annapurve wrote:
>> If you must - focus on getting swiotlb conversions to happen at the desired
>> granularity but don't try to force every single conversion to be >4K.
> If any conversion within a guest happens at 4K granularity, then this
> will effectively cause non-hugepage aligned EPT/NPT entries. This
> series is trying to get all private and shared memory regions to be
> hugepage aligned to address the problem statement.

Yeah, but the series is trying to do that by being awfully myopic at
this stage and without being _declared_ to be so myopic.

Take a look at all of the set_memory_decrypted() calls.  How many of
them even operate on the part of the guest address space rooted in the
memfd where splits matter?  They're not doing conversions.  They're just
setting up shared mappings in the page tables of gunk that was never
private in the first place.

Re: [RFC V1 5/5] x86: CVMs: Ensure that memory conversions happen at 2M alignment

Posted by Vishal Annapurve 2 years ago

On Fri, Feb 2, 2024 at 10:06 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/2/24 08:22, Vishal Annapurve wrote:
> >> If you must - focus on getting swiotlb conversions to happen at the desired
> >> granularity but don't try to force every single conversion to be >4K.
> > If any conversion within a guest happens at 4K granularity, then this
> > will effectively cause non-hugepage aligned EPT/NPT entries. This
> > series is trying to get all private and shared memory regions to be
> > hugepage aligned to address the problem statement.
>
> Yeah, but the series is trying to do that by being awfully myopic at
> this stage and without being _declared_ to be so myopic.
>

Agreed. I was being overly optimistic when I mentioned following in
the cover message:
"** This series leaves out some of the conversion sites which might not
be 2M aligned but should be easy to fix once the approach is finalized. **"

> Take a look at all of the set_memory_decrypted() calls.  How many of
> them even operate on the part of the guest address space rooted in the
> memfd where splits matter?  They're not doing conversions.  They're just
> setting up shared mappings in the page tables of gunk that was never
> private in the first place.

Thinking it over again, yeah the conversions that are happening
outside SWIOTLB should be impacting significantly less memory ranges.
As Jeremi and you are suggesting, it would be a big step forward if
memory conversions happening for just DMA requests are aligned to
hugepage sizes.