arch/x86/include/asm/kvm-x86-ops.h | 3 + arch/x86/include/asm/kvm_host.h | 10 + arch/x86/kvm/mmu.h | 3 + arch/x86/kvm/mmu/mmu.c | 57 +++- arch/x86/kvm/mmu/tdp_mmu.c | 2 - arch/x86/kvm/svm/sev.c | 531 ++++++++++++++++++++++------- arch/x86/kvm/svm/svm.c | 4 + arch/x86/kvm/svm/svm.h | 12 +- arch/x86/kvm/x86.c | 11 +- include/linux/kvm_host.h | 12 + virt/kvm/kvm_main.c | 69 ++-- virt/kvm/kvm_mm.h | 2 +- virt/kvm/pfncache.c | 2 +- 13 files changed, 556 insertions(+), 162 deletions(-)
This is a follow-up to the RFC implementation [1] that incorporates review feedback and bug fixes. See the "RFC v1" section below for a list of changes. SEV guest requires the guest's pages to be pinned in host physical memory as migration of encrypted pages is not supported. The memory encryption scheme uses the physical address of the memory being encrypted. If guest pages are moved by the host, content decrypted in the guest would be incorrect thereby corrupting guest's memory. For SEV/SEV-ES guests, the hypervisor doesn't know which pages are encrypted and when the guest is done using those pages. Hypervisor should treat all the guest pages as encrypted until they are deallocated or the guest is destroyed. While provision a pfn, make KVM aware that guest pages need to be pinned for long-term and use appropriate pin_user_pages API for these special encrypted memory regions. KVM takes the first reference and holds it until a mapping is done. Take an extra reference before KVM releases the pfn. Actual pinning management is handled by vendor code via new kvm_x86_ops hooks. MMU calls in to vendor code to pin the page on demand. Metadata of the pinning is stored in architecture specific memslot area. During the memslot freeing path and deallocation path guest pages are unpinned. Guest boot time comparison: +---------------+----------------+-------------------+ | Guest Memory | baseline | Demand Pinning + | | Size (GB) | v5.17-rc6(secs)| v5.17-rc6(secs) | +---------------+----------------+-------------------+ | 4 | 6.16 | 5.71 | +---------------+----------------+-------------------+ | 16 | 7.38 | 5.91 | +---------------+----------------+-------------------+ | 64 | 12.17 | 6.16 | +---------------+----------------+-------------------+ | 128 | 18.20 | 6.50 | +---------------+----------------+-------------------+ | 192 | 24.56 | 6.80 | +---------------+----------------+-------------------+ Changelog: RFC v1: * Use pin_user_pages API with FOLL_LONGTERM flag for pinning the encrypted guest pages. [David Hildenbrand] * Use new api kvm_for_each_memslot_in_hva_range to walk the memslot. [Maciej S. Szmigiero] * Maintain the non-mmu pinned memory and free them on destruction. [Peter Gonda] * Handle non-mmu pinned memory for intra host migration. [Peter Gonda] * Add the missing RLIMIT_MEMLOCK check. [David Hildenbrand] * Use pin_user_pages API for long term pinning of pages. [David Hildenbrand] * Flush the page before releasing it to the host system. [Mingwei Zhang] [1] https://lore.kernel.org/kvm/20220118110621.62462-1-nikunj@amd.com/ Nikunj A Dadhania (7): KVM: Introduce pinning flag to hva_to_pfn* KVM: x86/mmu: Move hugepage adjust to direct_page_fault KVM: x86/mmu: Add hook to pin PFNs on demand in MMU KVM: SVM: Add pinning metadata in the arch memslot KVM: SVM: Implement demand page pinning KVM: SEV: Carve out routine for allocation of pages KVM: Move kvm_for_each_memslot_in_hva_range() to be used in SVM Sean Christopherson (2): KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV/TDX KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data() arch/x86/include/asm/kvm-x86-ops.h | 3 + arch/x86/include/asm/kvm_host.h | 10 + arch/x86/kvm/mmu.h | 3 + arch/x86/kvm/mmu/mmu.c | 57 +++- arch/x86/kvm/mmu/tdp_mmu.c | 2 - arch/x86/kvm/svm/sev.c | 531 ++++++++++++++++++++++------- arch/x86/kvm/svm/svm.c | 4 + arch/x86/kvm/svm/svm.h | 12 +- arch/x86/kvm/x86.c | 11 +- include/linux/kvm_host.h | 12 + virt/kvm/kvm_main.c | 69 ++-- virt/kvm/kvm_mm.h | 2 +- virt/kvm/pfncache.c | 2 +- 13 files changed, 556 insertions(+), 162 deletions(-) -- 2.32.0
On Tue, Mar 08, 2022, Nikunj A Dadhania wrote:
> This is a follow-up to the RFC implementation [1] that incorporates
> review feedback and bug fixes. See the "RFC v1" section below for a
> list of changes.
Heh, for future reference, the initial posting of a series/patch/RFC is implicitly
v1, i.e. this should be RFC v2.
> SEV guest requires the guest's pages to be pinned in host physical
> memory as migration of encrypted pages is not supported. The memory
> encryption scheme uses the physical address of the memory being
> encrypted. If guest pages are moved by the host, content decrypted in
> the guest would be incorrect thereby corrupting guest's memory.
>
> For SEV/SEV-ES guests, the hypervisor doesn't know which pages are
> encrypted and when the guest is done using those pages. Hypervisor
> should treat all the guest pages as encrypted until they are
> deallocated or the guest is destroyed.
>
> While provision a pfn, make KVM aware that guest pages need to be
> pinned for long-term and use appropriate pin_user_pages API for these
> special encrypted memory regions. KVM takes the first reference and
> holds it until a mapping is done. Take an extra reference before KVM
> releases the pfn.
>
> Actual pinning management is handled by vendor code via new
> kvm_x86_ops hooks. MMU calls in to vendor code to pin the page on
> demand. Metadata of the pinning is stored in architecture specific
> memslot area. During the memslot freeing path and deallocation path
> guest pages are unpinned.
>
> Guest boot time comparison:
> +---------------+----------------+-------------------+
> | Guest Memory | baseline | Demand Pinning + |
> | Size (GB) | v5.17-rc6(secs)| v5.17-rc6(secs) |
> +---------------+----------------+-------------------+
> | 4 | 6.16 | 5.71 |
> +---------------+----------------+-------------------+
> | 16 | 7.38 | 5.91 |
> +---------------+----------------+-------------------+
> | 64 | 12.17 | 6.16 |
> +---------------+----------------+-------------------+
> | 128 | 18.20 | 6.50 |
> +---------------+----------------+-------------------+
> | 192 | 24.56 | 6.80 |
> +---------------+----------------+-------------------+
Let me preface this by saying I generally like the idea and especially the
performance, but...
I think we should abandon this approach in favor of committing all our resources
to fd-based private memory[*], which (if done right) will provide on-demand pinning
for "free". I would much rather get that support merged sooner than later, and use
it as a carrot for legacy SEV to get users to move over to its new APIs, with a long
term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private
memory. That would require guest kernel support to communicate private vs. shared,
but SEV guests already "need" to do that to play nice with live migration, so it's
not a big ask, just another carrot to entice guests/customers to update their kernel
(and possibly users to update their guest firmware).
This series isn't awful by any means, but it requires poking into core flows and
further complicates paths that are already anything but simple. And things like
conditionally grabbing vCPU0 to pin pages in its MMU make me flinch. And I think
the situation would only get worse by the time all the bugs and corner cases are
ironed out. E.g. this code is wrong:
void kvm_release_pfn_clean(kvm_pfn_t pfn)
{
if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
if (page_maybe_dma_pinned(page))
unpin_user_page(page);
else
put_page(page);
}
}
EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
Because (a) page_maybe_dma_pinned() is susceptible to false positives (clearly
documented), and (b) even if it didn't get false positives, there's no guarantee
that _KVM_ owns a pin of the page.
It's not an impossible problem to solve, but I suspect any solution will require
either touching a lot of code or will be fragile and difficult to maintain, e.g.
by auditing all users to understand which need to pin and which don't. Even if
we _always_ pin memory for SEV guests, we'd still need to plumb the "is SEV guest"
info around.
And FWIW, my years-old idea of using a software-available SPTE bit to track pinned
pages is plagued by the same underlying issue: KVM's current management (or lack
thereof) of SEV guest memory just isn't viable long term. In all honesty, it
probably should never have been merged. We can't change the past, but we can,
and IMO should, avoid piling on more code to an approach that is fundamentally flawed.
[*] https://lore.kernel.org/all/20220310140911.50924-1-chao.p.peng@linux.intel.com
On 3/29/2022 2:30 AM, Sean Christopherson wrote:
> On Tue, Mar 08, 2022, Nikunj A Dadhania wrote:
>> This is a follow-up to the RFC implementation [1] that incorporates
>> review feedback and bug fixes. See the "RFC v1" section below for a
>> list of changes.
>
> Heh, for future reference, the initial posting of a series/patch/RFC is implicitly
> v1, i.e. this should be RFC v2.
Sure.
>
>> SEV guest requires the guest's pages to be pinned in host physical
>> memory as migration of encrypted pages is not supported. The memory
>> encryption scheme uses the physical address of the memory being
>> encrypted. If guest pages are moved by the host, content decrypted in
>> the guest would be incorrect thereby corrupting guest's memory.
>>
>> For SEV/SEV-ES guests, the hypervisor doesn't know which pages are
>> encrypted and when the guest is done using those pages. Hypervisor
>> should treat all the guest pages as encrypted until they are
>> deallocated or the guest is destroyed.
>>
>> While provision a pfn, make KVM aware that guest pages need to be
>> pinned for long-term and use appropriate pin_user_pages API for these
>> special encrypted memory regions. KVM takes the first reference and
>> holds it until a mapping is done. Take an extra reference before KVM
>> releases the pfn.
>>
>> Actual pinning management is handled by vendor code via new
>> kvm_x86_ops hooks. MMU calls in to vendor code to pin the page on
>> demand. Metadata of the pinning is stored in architecture specific
>> memslot area. During the memslot freeing path and deallocation path
>> guest pages are unpinned.
>>
>> Guest boot time comparison:
>> +---------------+----------------+-------------------+
>> | Guest Memory | baseline | Demand Pinning + |
>> | Size (GB) | v5.17-rc6(secs)| v5.17-rc6(secs) |
>> +---------------+----------------+-------------------+
>> | 4 | 6.16 | 5.71 |
>> +---------------+----------------+-------------------+
>> | 16 | 7.38 | 5.91 |
>> +---------------+----------------+-------------------+
>> | 64 | 12.17 | 6.16 |
>> +---------------+----------------+-------------------+
>> | 128 | 18.20 | 6.50 |
>> +---------------+----------------+-------------------+
>> | 192 | 24.56 | 6.80 |
>> +---------------+----------------+-------------------+
>
> Let me preface this by saying I generally like the idea and especially the
> performance, but...
>
> I think we should abandon this approach in favor of committing all our resources
> to fd-based private memory[*], which (if done right) will provide on-demand pinning
> for "free".
I will give this a try for SEV, was on my todo list.
> I would much rather get that support merged sooner than later, and use
> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long
> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private
> memory.
> That would require guest kernel support to communicate private vs. shared,
Could you explain this in more detail? This is required for punching hole for shared pages?
> but SEV guests already "need" to do that to play nice with live migration, so it's
> not a big ask, just another carrot to entice guests/customers to update their kernel
> (and possibly users to update their guest firmware).
>
> This series isn't awful by any means, but it requires poking into core flows and
> further complicates paths that are already anything but simple. And things like
> conditionally grabbing vCPU0 to pin pages in its MMU make me flinch. And I think
> the situation would only get worse by the time all the bugs and corner cases are
> ironed out. E.g. this code is wrong:
>
> void kvm_release_pfn_clean(kvm_pfn_t pfn)
> {
> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) {
> struct page *page = pfn_to_page(pfn);
>
> if (page_maybe_dma_pinned(page))
> unpin_user_page(page);
> else
> put_page(page);
> }
> }
> EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
>
> Because (a) page_maybe_dma_pinned() is susceptible to false positives (clearly
> documented), and (b) even if it didn't get false positives, there's no guarantee
> that _KVM_ owns a pin of the page.
Right, the pinning could have been done by some other subsystem.
>
> It's not an impossible problem to solve, but I suspect any solution will require
> either touching a lot of code or will be fragile and difficult to maintain, e.g.
> by auditing all users to understand which need to pin and which don't. Even if
> we _always_ pin memory for SEV guests, we'd still need to plumb the "is SEV guest"
> info around.
>
> And FWIW, my years-old idea of using a software-available SPTE bit to track pinned
> pages is plagued by the same underlying issue: KVM's current management (or lack
> thereof) of SEV guest memory just isn't viable long term. In all honesty, it
> probably should never have been merged. We can't change the past, but we can,
> and IMO should, avoid piling on more code to an approach that is fundamentally flawed.
>
> [*] https://lore.kernel.org/all/20220310140911.50924-1-chao.p.peng@linux.intel.com
>
Thanks for the valuable feedback.
Regards
Nikunj
On Wed, Mar 30, 2022, Nikunj A. Dadhania wrote: > On 3/29/2022 2:30 AM, Sean Christopherson wrote: > > Let me preface this by saying I generally like the idea and especially the > > performance, but... > > > > I think we should abandon this approach in favor of committing all our resources > > to fd-based private memory[*], which (if done right) will provide on-demand pinning > > for "free". > > I will give this a try for SEV, was on my todo list. > > > I would much rather get that support merged sooner than later, and use > > it as a carrot for legacy SEV to get users to move over to its new APIs, with a long > > term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private > > memory. > > > That would require guest kernel support to communicate private vs. shared, > > Could you explain this in more detail? This is required for punching hole for shared pages? Unlike SEV-SNP, which enumerates private vs. shared in the error code, SEV and SEV-ES don't provide private vs. shared information to the host (KVM) on page fault. And it's even more fundamental then that, as SEV/SEV-ES won't even fault if the guest accesses the "wrong" GPA variant, they'll silent consume/corrupt data. That means KVM can't support implicit conversions for SEV/SEV-ES, and so an explicit hypercall is mandatory. SEV doesn't even have a vendor-agnostic guest/host paravirt ABI, and IIRC SEV-ES doesn't provide a conversion/map hypercall in the GHCB spec, so running a SEV/SEV-ES guest under UPM would require the guest firmware+kernel to be properly enlightened beyond what is required architecturally.
On 3/31/2022 1:17 AM, Sean Christopherson wrote: > On Wed, Mar 30, 2022, Nikunj A. Dadhania wrote: >> On 3/29/2022 2:30 AM, Sean Christopherson wrote: >>> Let me preface this by saying I generally like the idea and especially the >>> performance, but... >>> >>> I think we should abandon this approach in favor of committing all our resources >>> to fd-based private memory[*], which (if done right) will provide on-demand pinning >>> for "free". >> >> I will give this a try for SEV, was on my todo list. >> >>> I would much rather get that support merged sooner than later, and use >>> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long >>> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private >>> memory. >> >>> That would require guest kernel support to communicate private vs. shared, >> >> Could you explain this in more detail? This is required for punching hole for shared pages? > > Unlike SEV-SNP, which enumerates private vs. shared in the error code, SEV and SEV-ES > don't provide private vs. shared information to the host (KVM) on page fault. And > it's even more fundamental then that, as SEV/SEV-ES won't even fault if the guest > accesses the "wrong" GPA variant, they'll silent consume/corrupt data. > > That means KVM can't support implicit conversions for SEV/SEV-ES, and so an explicit > hypercall is mandatory. SEV doesn't even have a vendor-agnostic guest/host paravirt > ABI, and IIRC SEV-ES doesn't provide a conversion/map hypercall in the GHCB spec, so > running a SEV/SEV-ES guest under UPM would require the guest firmware+kernel to be > properly enlightened beyond what is required architecturally. > So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared pages to the hypervisor, this information can be used to mark page shared/private. Regards, Nikunj
On Wed, Mar 30, 2022 at 10:48 PM Nikunj A. Dadhania <nikunj@amd.com> wrote: > > > > On 3/31/2022 1:17 AM, Sean Christopherson wrote: > > On Wed, Mar 30, 2022, Nikunj A. Dadhania wrote: > >> On 3/29/2022 2:30 AM, Sean Christopherson wrote: > >>> Let me preface this by saying I generally like the idea and especially the > >>> performance, but... > >>> > >>> I think we should abandon this approach in favor of committing all our resources > >>> to fd-based private memory[*], which (if done right) will provide on-demand pinning > >>> for "free". > >> > >> I will give this a try for SEV, was on my todo list. > >> > >>> I would much rather get that support merged sooner than later, and use > >>> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long > >>> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private > >>> memory. > >> > >>> That would require guest kernel support to communicate private vs. shared, > >> > >> Could you explain this in more detail? This is required for punching hole for shared pages? > > > > Unlike SEV-SNP, which enumerates private vs. shared in the error code, SEV and SEV-ES > > don't provide private vs. shared information to the host (KVM) on page fault. And > > it's even more fundamental then that, as SEV/SEV-ES won't even fault if the guest > > accesses the "wrong" GPA variant, they'll silent consume/corrupt data. > > > > That means KVM can't support implicit conversions for SEV/SEV-ES, and so an explicit > > hypercall is mandatory. SEV doesn't even have a vendor-agnostic guest/host paravirt > > ABI, and IIRC SEV-ES doesn't provide a conversion/map hypercall in the GHCB spec, so > > running a SEV/SEV-ES guest under UPM would require the guest firmware+kernel to be > > properly enlightened beyond what is required architecturally. > > > > So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting > KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared > pages to the hypervisor, this information can be used to mark page shared/private. One concern here may be that the VMM doesn't know which guests have KVM_FEATURE_HC_MAP_GPA_RANGE support and which don't. Only once the guest boots does the guest tell KVM that it supports KVM_FEATURE_HC_MAP_GPA_RANGE. If the guest doesn't we need to pin all the memory before we run the guest to be safe to be safe.
On Thu, Mar 31, 2022, Peter Gonda wrote:
> On Wed, Mar 30, 2022 at 10:48 PM Nikunj A. Dadhania <nikunj@amd.com> wrote:
> > On 3/31/2022 1:17 AM, Sean Christopherson wrote:
> > > On Wed, Mar 30, 2022, Nikunj A. Dadhania wrote:
> > >> On 3/29/2022 2:30 AM, Sean Christopherson wrote:
> > >>> Let me preface this by saying I generally like the idea and especially the
> > >>> performance, but...
> > >>>
> > >>> I think we should abandon this approach in favor of committing all our resources
> > >>> to fd-based private memory[*], which (if done right) will provide on-demand pinning
> > >>> for "free".
> > >>
> > >> I will give this a try for SEV, was on my todo list.
> > >>
> > >>> I would much rather get that support merged sooner than later, and use
> > >>> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long
> > >>> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private
> > >>> memory.
> > >>
> > >>> That would require guest kernel support to communicate private vs. shared,
> > >>
> > >> Could you explain this in more detail? This is required for punching hole for shared pages?
> > >
> > > Unlike SEV-SNP, which enumerates private vs. shared in the error code, SEV and SEV-ES
> > > don't provide private vs. shared information to the host (KVM) on page fault. And
> > > it's even more fundamental then that, as SEV/SEV-ES won't even fault if the guest
> > > accesses the "wrong" GPA variant, they'll silent consume/corrupt data.
> > >
> > > That means KVM can't support implicit conversions for SEV/SEV-ES, and so an explicit
> > > hypercall is mandatory. SEV doesn't even have a vendor-agnostic guest/host paravirt
> > > ABI, and IIRC SEV-ES doesn't provide a conversion/map hypercall in the GHCB spec, so
> > > running a SEV/SEV-ES guest under UPM would require the guest firmware+kernel to be
> > > properly enlightened beyond what is required architecturally.
> > >
> >
> > So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting
> > KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared
> > pages to the hypervisor, this information can be used to mark page shared/private.
>
> One concern here may be that the VMM doesn't know which guests have
> KVM_FEATURE_HC_MAP_GPA_RANGE support and which don't. Only once the
> guest boots does the guest tell KVM that it supports
> KVM_FEATURE_HC_MAP_GPA_RANGE. If the guest doesn't we need to pin all
> the memory before we run the guest to be safe to be safe.
Yep, that's a big reason why I view purging the existing SEV memory management as
a long term goal. The other being that userspace obviously needs to be updated to
support UPM[*]. I suspect the only feasible way to enable this for SEV/SEV-ES
would be to restrict it to new VM types that have a disclaimer regarding additional
requirements.
[*] I believe Peter coined the UPM acronym for "Unmapping guest Private Memory". We've
been using it iternally for discussion and it rolls off the tongue a lot easier than
the full phrase, and is much more precise/descriptive than just "private fd".
On 4/1/2022 12:30 AM, Sean Christopherson wrote: > On Thu, Mar 31, 2022, Peter Gonda wrote: >> On Wed, Mar 30, 2022 at 10:48 PM Nikunj A. Dadhania <nikunj@amd.com> wrote: >>> On 3/31/2022 1:17 AM, Sean Christopherson wrote: >>>> On Wed, Mar 30, 2022, Nikunj A. Dadhania wrote: >>>>> On 3/29/2022 2:30 AM, Sean Christopherson wrote: >>>>>> Let me preface this by saying I generally like the idea and especially the >>>>>> performance, but... >>>>>> >>>>>> I think we should abandon this approach in favor of committing all our resources >>>>>> to fd-based private memory[*], which (if done right) will provide on-demand pinning >>>>>> for "free". >>>>> >>>>> I will give this a try for SEV, was on my todo list. >>>>> >>>>>> I would much rather get that support merged sooner than later, and use >>>>>> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long >>>>>> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private >>>>>> memory. >>>>> >>>>>> That would require guest kernel support to communicate private vs. shared, >>>>> >>>>> Could you explain this in more detail? This is required for punching hole for shared pages? >>>> >>>> Unlike SEV-SNP, which enumerates private vs. shared in the error code, SEV and SEV-ES >>>> don't provide private vs. shared information to the host (KVM) on page fault. And >>>> it's even more fundamental then that, as SEV/SEV-ES won't even fault if the guest >>>> accesses the "wrong" GPA variant, they'll silent consume/corrupt data. >>>> >>>> That means KVM can't support implicit conversions for SEV/SEV-ES, and so an explicit >>>> hypercall is mandatory. SEV doesn't even have a vendor-agnostic guest/host paravirt >>>> ABI, and IIRC SEV-ES doesn't provide a conversion/map hypercall in the GHCB spec, so >>>> running a SEV/SEV-ES guest under UPM would require the guest firmware+kernel to be >>>> properly enlightened beyond what is required architecturally. >>>> >>> >>> So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting >>> KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared >>> pages to the hypervisor, this information can be used to mark page shared/private. >> >> One concern here may be that the VMM doesn't know which guests have >> KVM_FEATURE_HC_MAP_GPA_RANGE support and which don't. Only once the >> guest boots does the guest tell KVM that it supports >> KVM_FEATURE_HC_MAP_GPA_RANGE. If the guest doesn't we need to pin all >> the memory before we run the guest to be safe to be safe. > > Yep, that's a big reason why I view purging the existing SEV memory management as > a long term goal. The other being that userspace obviously needs to be updated to > support UPM[*]. I suspect the only feasible way to enable this for SEV/SEV-ES > would be to restrict it to new VM types that have a disclaimer regarding additional > requirements. For SEV/SEV-ES could we base demand pinning on my first RFC[*]. Those patches does not touch the core KVM flow. Moreover, it does not expect any guest/firmware changes. [*] https://lore.kernel.org/kvm/20220118110621.62462-1-nikunj@amd.com/
On Fri, Apr 01, 2022, Nikunj A. Dadhania wrote: > > On 4/1/2022 12:30 AM, Sean Christopherson wrote: > > On Thu, Mar 31, 2022, Peter Gonda wrote: > >> On Wed, Mar 30, 2022 at 10:48 PM Nikunj A. Dadhania <nikunj@amd.com> wrote: > >>> So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting > >>> KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared > >>> pages to the hypervisor, this information can be used to mark page shared/private. > >> > >> One concern here may be that the VMM doesn't know which guests have > >> KVM_FEATURE_HC_MAP_GPA_RANGE support and which don't. Only once the > >> guest boots does the guest tell KVM that it supports > >> KVM_FEATURE_HC_MAP_GPA_RANGE. If the guest doesn't we need to pin all > >> the memory before we run the guest to be safe to be safe. > > > > Yep, that's a big reason why I view purging the existing SEV memory management as > > a long term goal. The other being that userspace obviously needs to be updated to > > support UPM[*]. I suspect the only feasible way to enable this for SEV/SEV-ES > > would be to restrict it to new VM types that have a disclaimer regarding additional > > requirements. > > For SEV/SEV-ES could we base demand pinning on my first RFC[*]. No, because as David pointed out, elevating the refcount is not the same as actually pinning the page. Things like NUMA balancing will still try to migrate the page, and even go so far as to zap the PTE, before bailing due to the outstanding reference. In other words, not actually pinning makes the mm subsystem less efficient. Would it functionally work? Yes. Is it acceptable KVM behavior? No. > Those patches does not touch the core KVM flow. I don't mind touching core KVM code. If this goes forward, I actually strongly prefer having the x86 MMU code handle the pinning as opposed to burying it in SEV via kvm_x86_ops. The reason I don't think it's worth pursuing this approach is because (a) we know that the current SEV/SEV-ES memory management scheme is flawed and is a deadend, and (b) this is not so trivial as we (or at least I) originally thought/hoped it would be. In other words, it's not that I think demand pinning is a bad idea, nor do I think the issues are unsolvable, it's that I think the cost of getting a workable solution, e.g. code churn, ongoing maintenance, reviewer time, etc..., far outweighs the benefits. > Moreover, it does not expect any guest/firmware changes. > > [*] https://lore.kernel.org/kvm/20220118110621.62462-1-nikunj@amd.com/
On 4/1/2022 8:24 PM, Sean Christopherson wrote: > On Fri, Apr 01, 2022, Nikunj A. Dadhania wrote: >> >> On 4/1/2022 12:30 AM, Sean Christopherson wrote: >>> On Thu, Mar 31, 2022, Peter Gonda wrote: >>>> On Wed, Mar 30, 2022 at 10:48 PM Nikunj A. Dadhania <nikunj@amd.com> wrote: >>>>> So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting >>>>> KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared >>>>> pages to the hypervisor, this information can be used to mark page shared/private. >>>> >>>> One concern here may be that the VMM doesn't know which guests have >>>> KVM_FEATURE_HC_MAP_GPA_RANGE support and which don't. Only once the >>>> guest boots does the guest tell KVM that it supports >>>> KVM_FEATURE_HC_MAP_GPA_RANGE. If the guest doesn't we need to pin all >>>> the memory before we run the guest to be safe to be safe. >>> >>> Yep, that's a big reason why I view purging the existing SEV memory management as >>> a long term goal. The other being that userspace obviously needs to be updated to >>> support UPM[*]. I suspect the only feasible way to enable this for SEV/SEV-ES >>> would be to restrict it to new VM types that have a disclaimer regarding additional >>> requirements. >> >> For SEV/SEV-ES could we base demand pinning on my first RFC[*]. > > No, because as David pointed out, elevating the refcount is not the same as actually > pinning the page. Things like NUMA balancing will still try to migrate the page, > and even go so far as to zap the PTE, before bailing due to the outstanding reference. > In other words, not actually pinning makes the mm subsystem less efficient. Would it > functionally work? Yes. Is it acceptable KVM behavior? No. > >> Those patches does not touch the core KVM flow. > > I don't mind touching core KVM code. If this goes forward, I actually strongly > prefer having the x86 MMU code handle the pinning as opposed to burying it in SEV > via kvm_x86_ops. The reason I don't think it's worth pursuing this approach is > because (a) we know that the current SEV/SEV-ES memory management scheme is flawed > and is a deadend, and (b) this is not so trivial as we (or at least I) originally > thought/hoped it would be. In other words, it's not that I think demand pinning > is a bad idea, nor do I think the issues are unsolvable, it's that I think the > cost of getting a workable solution, e.g. code churn, ongoing maintenance, reviewer > time, etc..., far outweighs the benefits. Point noted Sean, will focus on the UPM effort. Regards Nikunj
On Thu, Mar 31, 2022 at 12:01 PM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Mar 31, 2022, Peter Gonda wrote: > > On Wed, Mar 30, 2022 at 10:48 PM Nikunj A. Dadhania <nikunj@amd.com> wrote: > > > On 3/31/2022 1:17 AM, Sean Christopherson wrote: > > > > On Wed, Mar 30, 2022, Nikunj A. Dadhania wrote: > > > >> On 3/29/2022 2:30 AM, Sean Christopherson wrote: > > > >>> Let me preface this by saying I generally like the idea and especially the > > > >>> performance, but... > > > >>> > > > >>> I think we should abandon this approach in favor of committing all our resources > > > >>> to fd-based private memory[*], which (if done right) will provide on-demand pinning > > > >>> for "free". > > > >> > > > >> I will give this a try for SEV, was on my todo list. > > > >> > > > >>> I would much rather get that support merged sooner than later, and use > > > >>> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long > > > >>> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private > > > >>> memory. > > > >> > > > >>> That would require guest kernel support to communicate private vs. shared, > > > >> > > > >> Could you explain this in more detail? This is required for punching hole for shared pages? > > > > > > > > Unlike SEV-SNP, which enumerates private vs. shared in the error code, SEV and SEV-ES > > > > don't provide private vs. shared information to the host (KVM) on page fault. And > > > > it's even more fundamental then that, as SEV/SEV-ES won't even fault if the guest > > > > accesses the "wrong" GPA variant, they'll silent consume/corrupt data. > > > > > > > > That means KVM can't support implicit conversions for SEV/SEV-ES, and so an explicit > > > > hypercall is mandatory. SEV doesn't even have a vendor-agnostic guest/host paravirt > > > > ABI, and IIRC SEV-ES doesn't provide a conversion/map hypercall in the GHCB spec, so > > > > running a SEV/SEV-ES guest under UPM would require the guest firmware+kernel to be > > > > properly enlightened beyond what is required architecturally. > > > > > > > > > > So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting > > > KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared > > > pages to the hypervisor, this information can be used to mark page shared/private. > > > > One concern here may be that the VMM doesn't know which guests have > > KVM_FEATURE_HC_MAP_GPA_RANGE support and which don't. Only once the > > guest boots does the guest tell KVM that it supports > > KVM_FEATURE_HC_MAP_GPA_RANGE. If the guest doesn't we need to pin all > > the memory before we run the guest to be safe to be safe. > > Yep, that's a big reason why I view purging the existing SEV memory management as > a long term goal. The other being that userspace obviously needs to be updated to > support UPM[*]. I suspect the only feasible way to enable this for SEV/SEV-ES > would be to restrict it to new VM types that have a disclaimer regarding additional > requirements. > > [*] I believe Peter coined the UPM acronym for "Unmapping guest Private Memory". We've > been using it iternally for discussion and it rolls off the tongue a lot easier than > the full phrase, and is much more precise/descriptive than just "private fd". Can we really "purge the existing SEV memory management"? This seems like a non-starter because it violates userspace API (i.e., the ability for the userspace VMM to run a guest without KVM_FEATURE_HC_MAP_GPA_RANGE). Or maybe I'm not quite following what you mean by purge. Assuming that UPM-based lazy pinning comes together via a new VM type that only supports new images based on a minimum kernel version with KVM_FEATURE_HC_MAP_GPA_RANGE, then I think this would like as follows: 1. Userspace VMM: Check SEV VM type. If type is legacy SEV type then do upfront pinning. Else, skip up front pinning. 2. KVM: I'm not sure anything special needs to happen here. For the legacy VM types, it can be configured to use legacy memslots, presumably the same as non-CVMs will be configured. For the new VM type, it should be configured to use UPM. 3. Control plane (thing creating VMs): Responsible for not allowing legacy SEV images (i.e., images without KVM_FEATURE_HC_MAP_GPA_RANGE) with the new SEV VM types that use UPM and have support for demand pinning. Sean: Did I get this right?
On Fri, Apr 01, 2022, Marc Orr wrote: > On Thu, Mar 31, 2022 at 12:01 PM Sean Christopherson <seanjc@google.com> wrote: > > Yep, that's a big reason why I view purging the existing SEV memory management as > > a long term goal. The other being that userspace obviously needs to be updated to > > support UPM[*]. I suspect the only feasible way to enable this for SEV/SEV-ES > > would be to restrict it to new VM types that have a disclaimer regarding additional > > requirements. > > > > [*] I believe Peter coined the UPM acronym for "Unmapping guest Private Memory". We've > > been using it iternally for discussion and it rolls off the tongue a lot easier than > > the full phrase, and is much more precise/descriptive than just "private fd". > > Can we really "purge the existing SEV memory management"? This seems > like a non-starter because it violates userspace API (i.e., the > ability for the userspace VMM to run a guest without > KVM_FEATURE_HC_MAP_GPA_RANGE). Or maybe I'm not quite following what > you mean by purge. I really do mean purge, but I also really do mean "long term", as in 5+ years (probably 10+ if I'm being realistic). Removing support is completely ok, as is changing the uABI, the rule is that we can't break userspace. If all users are migrated to private-fd, e.g. by carrots and/or sticks such as putting the code into maintenance-only mode, then at some point in the future there will be no users left to break and we can drop the current code and make use of private-fd mandatory for SEV/SEV-ES guests. > Assuming that UPM-based lazy pinning comes together via a new VM type > that only supports new images based on a minimum kernel version with > KVM_FEATURE_HC_MAP_GPA_RANGE, then I think this would like as follows: > > 1. Userspace VMM: Check SEV VM type. If type is legacy SEV type then > do upfront pinning. Else, skip up front pinning. Yep, if by legacy "SEV type" you mean "SEV/SEV-ES guest that isn't required to use MAP_GPA_RANGE", which I'm pretty sure you do based on #3. > 2. KVM: I'm not sure anything special needs to happen here. For the > legacy VM types, it can be configured to use legacy memslots, > presumably the same as non-CVMs will be configured. For the new VM > type, it should be configured to use UPM. Correct, for now, KVM does nothing different for SEV/SEV-ES guests. > 3. Control plane (thing creating VMs): Responsible for not allowing > legacy SEV images (i.e., images without KVM_FEATURE_HC_MAP_GPA_RANGE) > with the new SEV VM types that use UPM and have support for demand > pinning. > > Sean: Did I get this right? Yep.
On Fri, Apr 1, 2022 at 11:02 AM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Apr 01, 2022, Marc Orr wrote: > > On Thu, Mar 31, 2022 at 12:01 PM Sean Christopherson <seanjc@google.com> wrote: > > > Yep, that's a big reason why I view purging the existing SEV memory management as > > > a long term goal. The other being that userspace obviously needs to be updated to > > > support UPM[*]. I suspect the only feasible way to enable this for SEV/SEV-ES > > > would be to restrict it to new VM types that have a disclaimer regarding additional > > > requirements. > > > > > > [*] I believe Peter coined the UPM acronym for "Unmapping guest Private Memory". We've > > > been using it iternally for discussion and it rolls off the tongue a lot easier than > > > the full phrase, and is much more precise/descriptive than just "private fd". > > > > Can we really "purge the existing SEV memory management"? This seems > > like a non-starter because it violates userspace API (i.e., the > > ability for the userspace VMM to run a guest without > > KVM_FEATURE_HC_MAP_GPA_RANGE). Or maybe I'm not quite following what > > you mean by purge. > > I really do mean purge, but I also really do mean "long term", as in 5+ years > (probably 10+ if I'm being realistic). > > Removing support is completely ok, as is changing the uABI, the rule is that we > can't break userspace. If all users are migrated to private-fd, e.g. by carrots > and/or sticks such as putting the code into maintenance-only mode, then at some > point in the future there will be no users left to break and we can drop the > current code and make use of private-fd mandatory for SEV/SEV-ES guests. Ah, it makes sense now. Thanks! > > Assuming that UPM-based lazy pinning comes together via a new VM type > > that only supports new images based on a minimum kernel version with > > KVM_FEATURE_HC_MAP_GPA_RANGE, then I think this would like as follows: > > > > 1. Userspace VMM: Check SEV VM type. If type is legacy SEV type then > > do upfront pinning. Else, skip up front pinning. > > Yep, if by legacy "SEV type" you mean "SEV/SEV-ES guest that isn't required to > use MAP_GPA_RANGE", which I'm pretty sure you do based on #3. Yeah, that's exactly what I meant. > > 2. KVM: I'm not sure anything special needs to happen here. For the > > legacy VM types, it can be configured to use legacy memslots, > > presumably the same as non-CVMs will be configured. For the new VM > > type, it should be configured to use UPM. > > Correct, for now, KVM does nothing different for SEV/SEV-ES guests. > > > 3. Control plane (thing creating VMs): Responsible for not allowing > > legacy SEV images (i.e., images without KVM_FEATURE_HC_MAP_GPA_RANGE) > > with the new SEV VM types that use UPM and have support for demand > > pinning. > > > > Sean: Did I get this right? > > Yep. Thank you for verifying.
© 2016 - 2026 Red Hat, Inc.