[v1] KVM: TDX huge page support for private memory

[RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 9 months, 2 weeks ago

Increase folio ref count before mapping a private page, and decrease
folio ref count after a mapping failure or successfully removing a private
page.

The folio ref count to inc/dec corresponds to the mapping/unmapping level,
ensuring the folio ref count remains balanced after entry splitting or
merging.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 355b21fc169f..e23dce59fc72 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1501,9 +1501,9 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
-static void tdx_unpin(struct kvm *kvm, struct page *page)
+static void tdx_unpin(struct kvm *kvm, struct page *page, int level)
 {
-	put_page(page);
+	folio_put_refs(page_folio(page), KVM_PAGES_PER_HPAGE(level));
 }
 
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
@@ -1517,13 +1517,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 
 	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
 	if (unlikely(tdx_operand_busy(err))) {
-		tdx_unpin(kvm, page);
+		tdx_unpin(kvm, page, level);
 		return -EBUSY;
 	}
 
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
-		tdx_unpin(kvm, page);
+		tdx_unpin(kvm, page, level);
 		return -EIO;
 	}
 
@@ -1570,10 +1570,11 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
 	 * migration.  Until guest_memfd supports page migration, prevent page
 	 * migration.
-	 * TODO: Once guest_memfd introduces callback on page migration,
-	 * implement it and remove get_page/put_page().
+	 * TODO: To support in-place-conversion in gmem in futre, remove
+	 * folio_ref_add()/folio_put_refs(). Only increase the folio ref count
+	 * when there're errors during removing private pages.
 	 */
-	get_page(page);
+	folio_ref_add(page_folio(page), KVM_PAGES_PER_HPAGE(level));
 
 	/*
 	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
@@ -1647,7 +1648,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		return -EIO;
 
 	tdx_clear_page(page, level);
-	tdx_unpin(kvm, page);
+	tdx_unpin(kvm, page, level);
 	return 0;
 }
 
@@ -1727,7 +1728,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
 	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
 		atomic64_dec(&kvm_tdx->nr_premapped);
-		tdx_unpin(kvm, page);
+		tdx_unpin(kvm, page, level);
 		return 0;
 	}
 
-- 
2.43.2

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 9 months, 2 weeks ago

On Wed, Apr 23, 2025 at 8:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> Increase folio ref count before mapping a private page, and decrease
> folio ref count after a mapping failure or successfully removing a private
> page.
>
> The folio ref count to inc/dec corresponds to the mapping/unmapping level,
> ensuring the folio ref count remains balanced after entry splitting or
> merging.
>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 355b21fc169f..e23dce59fc72 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1501,9 +1501,9 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>         td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>  }
>
> -static void tdx_unpin(struct kvm *kvm, struct page *page)
> +static void tdx_unpin(struct kvm *kvm, struct page *page, int level)
>  {
> -       put_page(page);
> +       folio_put_refs(page_folio(page), KVM_PAGES_PER_HPAGE(level));
>  }
>
>  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> @@ -1517,13 +1517,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>
>         err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
>         if (unlikely(tdx_operand_busy(err))) {
> -               tdx_unpin(kvm, page);
> +               tdx_unpin(kvm, page, level);
>                 return -EBUSY;
>         }
>
>         if (KVM_BUG_ON(err, kvm)) {
>                 pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> -               tdx_unpin(kvm, page);
> +               tdx_unpin(kvm, page, level);
>                 return -EIO;
>         }
>
> @@ -1570,10 +1570,11 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>          * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
>          * migration.  Until guest_memfd supports page migration, prevent page
>          * migration.
> -        * TODO: Once guest_memfd introduces callback on page migration,
> -        * implement it and remove get_page/put_page().
> +        * TODO: To support in-place-conversion in gmem in futre, remove
> +        * folio_ref_add()/folio_put_refs().

With necessary infrastructure in guest_memfd [1] to prevent page
migration, is it necessary to acquire extra folio refcounts? If not,
why not just cleanup this logic now?

[1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/virt/kvm/guest_memfd.c?h=kvm-coco-queue#n441

> Only increase the folio ref count
> +        * when there're errors during removing private pages.
>          */
> -       get_page(page);
> +       folio_ref_add(page_folio(page), KVM_PAGES_PER_HPAGE(level));
>
>         /*
>          * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> @@ -1647,7 +1648,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>                 return -EIO;
>
>         tdx_clear_page(page, level);
> -       tdx_unpin(kvm, page);
> +       tdx_unpin(kvm, page, level);
>         return 0;
>  }
>
> @@ -1727,7 +1728,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>         if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
>             !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
>                 atomic64_dec(&kvm_tdx->nr_premapped);
> -               tdx_unpin(kvm, page);
> +               tdx_unpin(kvm, page, level);
>                 return 0;
>         }
>
> --
> 2.43.2
>

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 9 months, 2 weeks ago

On Mon, Apr 28, 2025 at 05:17:16PM -0700, Vishal Annapurve wrote:
> On Wed, Apr 23, 2025 at 8:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > Increase folio ref count before mapping a private page, and decrease
> > folio ref count after a mapping failure or successfully removing a private
> > page.
> >
> > The folio ref count to inc/dec corresponds to the mapping/unmapping level,
> > ensuring the folio ref count remains balanced after entry splitting or
> > merging.
> >
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/kvm/vmx/tdx.c | 19 ++++++++++---------
> >  1 file changed, 10 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 355b21fc169f..e23dce59fc72 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1501,9 +1501,9 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> >         td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
> >  }
> >
> > -static void tdx_unpin(struct kvm *kvm, struct page *page)
> > +static void tdx_unpin(struct kvm *kvm, struct page *page, int level)
> >  {
> > -       put_page(page);
> > +       folio_put_refs(page_folio(page), KVM_PAGES_PER_HPAGE(level));
> >  }
> >
> >  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > @@ -1517,13 +1517,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> >
> >         err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> >         if (unlikely(tdx_operand_busy(err))) {
> > -               tdx_unpin(kvm, page);
> > +               tdx_unpin(kvm, page, level);
> >                 return -EBUSY;
> >         }
> >
> >         if (KVM_BUG_ON(err, kvm)) {
> >                 pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> > -               tdx_unpin(kvm, page);
> > +               tdx_unpin(kvm, page, level);
> >                 return -EIO;
> >         }
> >
> > @@ -1570,10 +1570,11 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >          * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> >          * migration.  Until guest_memfd supports page migration, prevent page
> >          * migration.
> > -        * TODO: Once guest_memfd introduces callback on page migration,
> > -        * implement it and remove get_page/put_page().
> > +        * TODO: To support in-place-conversion in gmem in futre, remove
> > +        * folio_ref_add()/folio_put_refs().
> 
> With necessary infrastructure in guest_memfd [1] to prevent page
> migration, is it necessary to acquire extra folio refcounts? If not,
> why not just cleanup this logic now?
Though the old comment says acquiring the lock is for page migration, the other
reason is to prevent the folio from being returned to the OS until it has been
successfully removed from TDX.

If there's an error during the removal or reclaiming of a folio from TDX, such
as a failure in tdh_mem_page_remove()/tdh_phymem_page_wbinvd_hkid() or
tdh_phymem_page_reclaim(), it is important to hold the page refcount within TDX.

So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
folio_ref_add() in the event of a removal failure.

> [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/virt/kvm/guest_memfd.c?h=kvm-coco-queue#n441
> 
> > Only increase the folio ref count
> > +        * when there're errors during removing private pages.
> >          */
> > -       get_page(page);
> > +       folio_ref_add(page_folio(page), KVM_PAGES_PER_HPAGE(level));
> >
> >         /*
> >          * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > @@ -1647,7 +1648,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> >                 return -EIO;
> >
> >         tdx_clear_page(page, level);
> > -       tdx_unpin(kvm, page);
> > +       tdx_unpin(kvm, page, level);
> >         return 0;
> >  }
> >
> > @@ -1727,7 +1728,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> >         if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
> >             !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
> >                 atomic64_dec(&kvm_tdx->nr_premapped);
> > -               tdx_unpin(kvm, page);
> > +               tdx_unpin(kvm, page, level);
> >                 return 0;
> >         }
> >
> > --
> > 2.43.2
> >

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 9 months, 2 weeks ago

On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Mon, Apr 28, 2025 at 05:17:16PM -0700, Vishal Annapurve wrote:
> > On Wed, Apr 23, 2025 at 8:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > Increase folio ref count before mapping a private page, and decrease
> > > folio ref count after a mapping failure or successfully removing a private
> > > page.
> > >
> > > The folio ref count to inc/dec corresponds to the mapping/unmapping level,
> > > ensuring the folio ref count remains balanced after entry splitting or
> > > merging.
> > >
> > > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > ---
> > >  arch/x86/kvm/vmx/tdx.c | 19 ++++++++++---------
> > >  1 file changed, 10 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index 355b21fc169f..e23dce59fc72 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -1501,9 +1501,9 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> > >         td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
> > >  }
> > >
> > > -static void tdx_unpin(struct kvm *kvm, struct page *page)
> > > +static void tdx_unpin(struct kvm *kvm, struct page *page, int level)
> > >  {
> > > -       put_page(page);
> > > +       folio_put_refs(page_folio(page), KVM_PAGES_PER_HPAGE(level));
> > >  }
> > >
> > >  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > > @@ -1517,13 +1517,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > >
> > >         err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> > >         if (unlikely(tdx_operand_busy(err))) {
> > > -               tdx_unpin(kvm, page);
> > > +               tdx_unpin(kvm, page, level);
> > >                 return -EBUSY;
> > >         }
> > >
> > >         if (KVM_BUG_ON(err, kvm)) {
> > >                 pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> > > -               tdx_unpin(kvm, page);
> > > +               tdx_unpin(kvm, page, level);
> > >                 return -EIO;
> > >         }
> > >
> > > @@ -1570,10 +1570,11 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > >          * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> > >          * migration.  Until guest_memfd supports page migration, prevent page
> > >          * migration.
> > > -        * TODO: Once guest_memfd introduces callback on page migration,
> > > -        * implement it and remove get_page/put_page().
> > > +        * TODO: To support in-place-conversion in gmem in futre, remove
> > > +        * folio_ref_add()/folio_put_refs().
> >
> > With necessary infrastructure in guest_memfd [1] to prevent page
> > migration, is it necessary to acquire extra folio refcounts? If not,
> > why not just cleanup this logic now?
> Though the old comment says acquiring the lock is for page migration, the other
> reason is to prevent the folio from being returned to the OS until it has been
> successfully removed from TDX.
>
> If there's an error during the removal or reclaiming of a folio from TDX, such
> as a failure in tdh_mem_page_remove()/tdh_phymem_page_wbinvd_hkid() or
> tdh_phymem_page_reclaim(), it is important to hold the page refcount within TDX.
>
> So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> folio_ref_add() in the event of a removal failure.

In my opinion, the above scheme can be deployed with this series
itself. guest_memfd will not take away memory from TDX VMs without an
invalidation. folio_ref_add() will not work for memory not backed by
page structs, but that problem can be solved in future possibly by
notifying guest_memfd of certain ranges being in use even after
invalidation completes.


>
> > [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/virt/kvm/guest_memfd.c?h=kvm-coco-queue#n441
> >
> > > Only increase the folio ref count
> > > +        * when there're errors during removing private pages.
> > >          */
> > > -       get_page(page);
> > > +       folio_ref_add(page_folio(page), KVM_PAGES_PER_HPAGE(level));
> > >
> > >         /*
> > >          * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > > @@ -1647,7 +1648,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > >                 return -EIO;
> > >
> > >         tdx_clear_page(page, level);
> > > -       tdx_unpin(kvm, page);
> > > +       tdx_unpin(kvm, page, level);
> > >         return 0;
> > >  }
> > >
> > > @@ -1727,7 +1728,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> > >         if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
> > >             !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
> > >                 atomic64_dec(&kvm_tdx->nr_premapped);
> > > -               tdx_unpin(kvm, page);
> > > +               tdx_unpin(kvm, page, level);
> > >                 return 0;
> > >         }
> > >
> > > --
> > > 2.43.2
> > >

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 9 months, 1 week ago

Sorry for the late reply, I was on leave last week.

On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > folio_ref_add() in the event of a removal failure.
> 
> In my opinion, the above scheme can be deployed with this series
> itself. guest_memfd will not take away memory from TDX VMs without an
I initially intended to add a separate patch at the end of this series to
implement invoking folio_ref_add() only upon a removal failure. However, I
decided against it since it's not a must before guest_memfd supports in-place
conversion.

We can include it in the next version If you think it's better.

> invalidation. folio_ref_add() will not work for memory not backed by
> page structs, but that problem can be solved in future possibly by
With current TDX code, all memory must be backed by a page struct.
Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
than a pfn.

> notifying guest_memfd of certain ranges being in use even after
> invalidation completes.
A curious question:
To support memory not backed by page structs in future, is there any counterpart
to the page struct to hold ref count and map count?

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 9 months, 1 week ago

On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> Sorry for the late reply, I was on leave last week.
>
> On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > folio_ref_add() in the event of a removal failure.
> >
> > In my opinion, the above scheme can be deployed with this series
> > itself. guest_memfd will not take away memory from TDX VMs without an
> I initially intended to add a separate patch at the end of this series to
> implement invoking folio_ref_add() only upon a removal failure. However, I
> decided against it since it's not a must before guest_memfd supports in-place
> conversion.
>
> We can include it in the next version If you think it's better.

Ackerley is planning to send out a series for 1G Hugetlb support with
guest memfd soon, hopefully this week. Plus I don't see any reason to
hold extra refcounts in TDX stack so it would be good to clean up this
logic.

>
> > invalidation. folio_ref_add() will not work for memory not backed by
> > page structs, but that problem can be solved in future possibly by
> With current TDX code, all memory must be backed by a page struct.
> Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> than a pfn.
>
> > notifying guest_memfd of certain ranges being in use even after
> > invalidation completes.
> A curious question:
> To support memory not backed by page structs in future, is there any counterpart
> to the page struct to hold ref count and map count?
>

I imagine the needed support will match similar semantics as VM_PFNMAP
[1] memory. No need to maintain refcounts/map counts for such physical
memory ranges as all users will be notified when mappings are
changed/removed.

Any guest_memfd range updates will result in invalidations/updates of
userspace, guest, IOMMU or any other page tables referring to
guest_memfd backed pfns. This story will become clearer once the
support for PFN range allocator for backing guest_memfd starts getting
discussed.

[1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 9 months, 1 week ago

On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > Sorry for the late reply, I was on leave last week.
> >
> > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > folio_ref_add() in the event of a removal failure.
> > >
> > > In my opinion, the above scheme can be deployed with this series
> > > itself. guest_memfd will not take away memory from TDX VMs without an
> > I initially intended to add a separate patch at the end of this series to
> > implement invoking folio_ref_add() only upon a removal failure. However, I
> > decided against it since it's not a must before guest_memfd supports in-place
> > conversion.
> >
> > We can include it in the next version If you think it's better.
> 
> Ackerley is planning to send out a series for 1G Hugetlb support with
> guest memfd soon, hopefully this week. Plus I don't see any reason to
> hold extra refcounts in TDX stack so it would be good to clean up this
> logic.
> 
> >
> > > invalidation. folio_ref_add() will not work for memory not backed by
> > > page structs, but that problem can be solved in future possibly by
> > With current TDX code, all memory must be backed by a page struct.
> > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > than a pfn.
> >
> > > notifying guest_memfd of certain ranges being in use even after
> > > invalidation completes.
> > A curious question:
> > To support memory not backed by page structs in future, is there any counterpart
> > to the page struct to hold ref count and map count?
> >
> 
> I imagine the needed support will match similar semantics as VM_PFNMAP
> [1] memory. No need to maintain refcounts/map counts for such physical
> memory ranges as all users will be notified when mappings are
> changed/removed.
So, it's possible to map such memory in both shared and private EPT
simultaneously?


> Any guest_memfd range updates will result in invalidations/updates of
> userspace, guest, IOMMU or any other page tables referring to
> guest_memfd backed pfns. This story will become clearer once the
> support for PFN range allocator for backing guest_memfd starts getting
> discussed.
Ok. It is indeed unclear right now to support such kind of memory.

Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
into private EPT until TDX connect.
And even in that scenario, the memory is only for private MMIO, so the backend
driver is VFIO pci driver rather than guest_memfd.


> [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 9 months, 1 week ago

On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > Sorry for the late reply, I was on leave last week.
> > >
> > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > folio_ref_add() in the event of a removal failure.
> > > >
> > > > In my opinion, the above scheme can be deployed with this series
> > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > I initially intended to add a separate patch at the end of this series to
> > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > decided against it since it's not a must before guest_memfd supports in-place
> > > conversion.
> > >
> > > We can include it in the next version If you think it's better.
> >
> > Ackerley is planning to send out a series for 1G Hugetlb support with
> > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > hold extra refcounts in TDX stack so it would be good to clean up this
> > logic.
> >
> > >
> > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > page structs, but that problem can be solved in future possibly by
> > > With current TDX code, all memory must be backed by a page struct.
> > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > than a pfn.
> > >
> > > > notifying guest_memfd of certain ranges being in use even after
> > > > invalidation completes.
> > > A curious question:
> > > To support memory not backed by page structs in future, is there any counterpart
> > > to the page struct to hold ref count and map count?
> > >
> >
> > I imagine the needed support will match similar semantics as VM_PFNMAP
> > [1] memory. No need to maintain refcounts/map counts for such physical
> > memory ranges as all users will be notified when mappings are
> > changed/removed.
> So, it's possible to map such memory in both shared and private EPT
> simultaneously?

No, guest_memfd will still ensure that userspace can only fault in
shared memory regions in order to support CoCo VM usecases.

>
>
> > Any guest_memfd range updates will result in invalidations/updates of
> > userspace, guest, IOMMU or any other page tables referring to
> > guest_memfd backed pfns. This story will become clearer once the
> > support for PFN range allocator for backing guest_memfd starts getting
> > discussed.
> Ok. It is indeed unclear right now to support such kind of memory.
>
> Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> into private EPT until TDX connect.

There is a plan to use VM_PFNMAP memory for all of guest_memfd
shared/private ranges orthogonal to TDX connect usecase. With TDX
connect/Sev TIO, major difference would be that guest_memfd private
ranges will be mapped into IOMMU page tables.

Irrespective of whether/when VM_PFNMAP memory support lands, there
have been discussions on not using page structs for private memory
ranges altogether [1] even with hugetlb allocator, which will simplify
seamless merge/split story for private hugepages to support memory
conversion. So I think the general direction we should head towards is
not relying on refcounts for guest_memfd private ranges and/or page
structs altogether.

I think the series [2] to work better with PFNMAP'd physical memory in
KVM is in the very right direction of not assuming page struct backed
memory ranges for guest_memfd as well.

[1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
[2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/

> And even in that scenario, the memory is only for private MMIO, so the backend
> driver is VFIO pci driver rather than guest_memfd.

Not necessary. As I mentioned above guest_memfd ranges will be backed
by VM_PFNMAP memory.

>
>
> > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 9 months, 1 week ago

On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > Sorry for the late reply, I was on leave last week.
> > > >
> > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > folio_ref_add() in the event of a removal failure.
> > > > >
> > > > > In my opinion, the above scheme can be deployed with this series
> > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > I initially intended to add a separate patch at the end of this series to
> > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > conversion.
> > > >
> > > > We can include it in the next version If you think it's better.
> > >
> > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > logic.
> > >
> > > >
> > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > page structs, but that problem can be solved in future possibly by
> > > > With current TDX code, all memory must be backed by a page struct.
> > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > than a pfn.
> > > >
> > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > invalidation completes.
> > > > A curious question:
> > > > To support memory not backed by page structs in future, is there any counterpart
> > > > to the page struct to hold ref count and map count?
> > > >
> > >
> > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > memory ranges as all users will be notified when mappings are
> > > changed/removed.
> > So, it's possible to map such memory in both shared and private EPT
> > simultaneously?
> 
> No, guest_memfd will still ensure that userspace can only fault in
> shared memory regions in order to support CoCo VM usecases.
Before guest_memfd converts a PFN from shared to private, how does it ensure
there are no shared mappings? e.g., in [1], it uses the folio reference count
to ensure that.

Or do you believe that by eliminating the struct page, there would be no
GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
response to a guest_memfd invalidation notification?

As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
no need to register mmu notifier. So why users like VFIO must register
guest_memfd invalidation notification?

Besides, how would guest_memfd handle potential unmap failures? e.g. what
happens to prevent converting a private PFN to shared if there are errors when
TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.

Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
that fails to be unmapped.


[1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/


> >
> >
> > > Any guest_memfd range updates will result in invalidations/updates of
> > > userspace, guest, IOMMU or any other page tables referring to
> > > guest_memfd backed pfns. This story will become clearer once the
> > > support for PFN range allocator for backing guest_memfd starts getting
> > > discussed.
> > Ok. It is indeed unclear right now to support such kind of memory.
> >
> > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > into private EPT until TDX connect.
> 
> There is a plan to use VM_PFNMAP memory for all of guest_memfd
> shared/private ranges orthogonal to TDX connect usecase. With TDX
> connect/Sev TIO, major difference would be that guest_memfd private
> ranges will be mapped into IOMMU page tables.
> 
> Irrespective of whether/when VM_PFNMAP memory support lands, there
> have been discussions on not using page structs for private memory
> ranges altogether [1] even with hugetlb allocator, which will simplify
> seamless merge/split story for private hugepages to support memory
> conversion. So I think the general direction we should head towards is
> not relying on refcounts for guest_memfd private ranges and/or page
> structs altogether.
It's fine to use PFN, but I wonder if there're counterparts of struct page to
keep all necessary info.

 
> I think the series [2] to work better with PFNMAP'd physical memory in
> KVM is in the very right direction of not assuming page struct backed
> memory ranges for guest_memfd as well.
Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".


> [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> 
> > And even in that scenario, the memory is only for private MMIO, so the backend
> > driver is VFIO pci driver rather than guest_memfd.
> 
> Not necessary. As I mentioned above guest_memfd ranges will be backed
> by VM_PFNMAP memory.
> 
> >
> >
> > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 9 months, 1 week ago

On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > Sorry for the late reply, I was on leave last week.
> > > > >
> > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > >
> > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > I initially intended to add a separate patch at the end of this series to
> > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > conversion.
> > > > >
> > > > > We can include it in the next version If you think it's better.
> > > >
> > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > logic.
> > > >
> > > > >
> > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > page structs, but that problem can be solved in future possibly by
> > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > than a pfn.
> > > > >
> > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > invalidation completes.
> > > > > A curious question:
> > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > to the page struct to hold ref count and map count?
> > > > >
> > > >
> > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > memory ranges as all users will be notified when mappings are
> > > > changed/removed.
> > > So, it's possible to map such memory in both shared and private EPT
> > > simultaneously?
> >
> > No, guest_memfd will still ensure that userspace can only fault in
> > shared memory regions in order to support CoCo VM usecases.
> Before guest_memfd converts a PFN from shared to private, how does it ensure
> there are no shared mappings? e.g., in [1], it uses the folio reference count
> to ensure that.
>
> Or do you believe that by eliminating the struct page, there would be no
> GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> response to a guest_memfd invalidation notification?

Yes.

>
> As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> no need to register mmu notifier. So why users like VFIO must register
> guest_memfd invalidation notification?

VM_PFNMAP'd memory can't be long term pinned, so users of such memory
ranges will have to adopt mechanisms to get notified. I think it would
be easy to pursue new users of guest_memfd to follow this scheme.
Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
hugepage support already needs the stance of: "Guest memfd owns all
long-term refcounts on private memory" as discussed at LPC [1].

[1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
(slide 12)

>
> Besides, how would guest_memfd handle potential unmap failures? e.g. what
> happens to prevent converting a private PFN to shared if there are errors when
> TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.

Users will have to signal such failures via the invalidation callback
results or other appropriate mechanisms. guest_memfd can relay the
failures up the call chain to the userspace.

>
> Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> that fails to be unmapped.
>
>
> [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
>
>
> > >
> > >
> > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > userspace, guest, IOMMU or any other page tables referring to
> > > > guest_memfd backed pfns. This story will become clearer once the
> > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > discussed.
> > > Ok. It is indeed unclear right now to support such kind of memory.
> > >
> > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > into private EPT until TDX connect.
> >
> > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > connect/Sev TIO, major difference would be that guest_memfd private
> > ranges will be mapped into IOMMU page tables.
> >
> > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > have been discussions on not using page structs for private memory
> > ranges altogether [1] even with hugetlb allocator, which will simplify
> > seamless merge/split story for private hugepages to support memory
> > conversion. So I think the general direction we should head towards is
> > not relying on refcounts for guest_memfd private ranges and/or page
> > structs altogether.
> It's fine to use PFN, but I wonder if there're counterparts of struct page to
> keep all necessary info.
>

Story will become clearer once VM_PFNMAP'd memory support starts
getting discussed. In case of guest_memfd, there is flexibility to
store metadata for physical ranges within guest_memfd just like
shareability tracking.

>
> > I think the series [2] to work better with PFNMAP'd physical memory in
> > KVM is in the very right direction of not assuming page struct backed
> > memory ranges for guest_memfd as well.
> Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
>
>
> > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> >
> > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > driver is VFIO pci driver rather than guest_memfd.
> >
> > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > by VM_PFNMAP memory.
> >
> > >
> > >
> > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 9 months, 1 week ago

On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote:
> On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > Sorry for the late reply, I was on leave last week.
> > > > > >
> > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > > >
> > > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > > I initially intended to add a separate patch at the end of this series to
> > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > > conversion.
> > > > > >
> > > > > > We can include it in the next version If you think it's better.
> > > > >
> > > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > > logic.
> > > > >
> > > > > >
> > > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > > page structs, but that problem can be solved in future possibly by
> > > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > > than a pfn.
> > > > > >
> > > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > > invalidation completes.
> > > > > > A curious question:
> > > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > > to the page struct to hold ref count and map count?
> > > > > >
> > > > >
> > > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > > memory ranges as all users will be notified when mappings are
> > > > > changed/removed.
> > > > So, it's possible to map such memory in both shared and private EPT
> > > > simultaneously?
> > >
> > > No, guest_memfd will still ensure that userspace can only fault in
> > > shared memory regions in order to support CoCo VM usecases.
> > Before guest_memfd converts a PFN from shared to private, how does it ensure
> > there are no shared mappings? e.g., in [1], it uses the folio reference count
> > to ensure that.
> >
> > Or do you believe that by eliminating the struct page, there would be no
> > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> > response to a guest_memfd invalidation notification?
> 
> Yes.
> 
> >
> > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> > no need to register mmu notifier. So why users like VFIO must register
> > guest_memfd invalidation notification?
> 
> VM_PFNMAP'd memory can't be long term pinned, so users of such memory
> ranges will have to adopt mechanisms to get notified. I think it would
Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory.

> be easy to pursue new users of guest_memfd to follow this scheme.
> Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
> hugepage support already needs the stance of: "Guest memfd owns all
> long-term refcounts on private memory" as discussed at LPC [1].
> 
> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
> (slide 12)
> 
> >
> > Besides, how would guest_memfd handle potential unmap failures? e.g. what
> > happens to prevent converting a private PFN to shared if there are errors when
> > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.
> 
> Users will have to signal such failures via the invalidation callback
> results or other appropriate mechanisms. guest_memfd can relay the
> failures up the call chain to the userspace.
AFAIK, operations that perform actual unmapping do not allow failure, e.g.
kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(),
vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount().

That's why we rely on increasing folio ref count to reflect failure, which are
due to unexpected SEAMCALL errors.

> > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> > that fails to be unmapped.
> >
> >
> > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
> >
> >
> > > >
> > > >
> > > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > > userspace, guest, IOMMU or any other page tables referring to
> > > > > guest_memfd backed pfns. This story will become clearer once the
> > > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > > discussed.
> > > > Ok. It is indeed unclear right now to support such kind of memory.
> > > >
> > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > > into private EPT until TDX connect.
> > >
> > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > > connect/Sev TIO, major difference would be that guest_memfd private
> > > ranges will be mapped into IOMMU page tables.
> > >
> > > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > > have been discussions on not using page structs for private memory
> > > ranges altogether [1] even with hugetlb allocator, which will simplify
> > > seamless merge/split story for private hugepages to support memory
> > > conversion. So I think the general direction we should head towards is
> > > not relying on refcounts for guest_memfd private ranges and/or page
> > > structs altogether.
> > It's fine to use PFN, but I wonder if there're counterparts of struct page to
> > keep all necessary info.
> >
> 
> Story will become clearer once VM_PFNMAP'd memory support starts
> getting discussed. In case of guest_memfd, there is flexibility to
> store metadata for physical ranges within guest_memfd just like
> shareability tracking.
Ok.

> >
> > > I think the series [2] to work better with PFNMAP'd physical memory in
> > > KVM is in the very right direction of not assuming page struct backed
> > > memory ranges for guest_memfd as well.
> > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
> >
> >
> > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> > >
> > > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > > driver is VFIO pci driver rather than guest_memfd.
> > >
> > > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > > by VM_PFNMAP memory.
> > >
> > > >
> > > >
> > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
>

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 9 months ago

On Wed, May 7, 2025 at 6:32 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote:
> > On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > >
> > > > > > > Sorry for the late reply, I was on leave last week.
> > > > > > >
> > > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > > > >
> > > > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > > > I initially intended to add a separate patch at the end of this series to
> > > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > > > conversion.
> > > > > > >
> > > > > > > We can include it in the next version If you think it's better.
> > > > > >
> > > > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > > > logic.
> > > > > >
> > > > > > >
> > > > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > > > page structs, but that problem can be solved in future possibly by
> > > > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > > > than a pfn.
> > > > > > >
> > > > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > > > invalidation completes.
> > > > > > > A curious question:
> > > > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > > > to the page struct to hold ref count and map count?
> > > > > > >
> > > > > >
> > > > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > > > memory ranges as all users will be notified when mappings are
> > > > > > changed/removed.
> > > > > So, it's possible to map such memory in both shared and private EPT
> > > > > simultaneously?
> > > >
> > > > No, guest_memfd will still ensure that userspace can only fault in
> > > > shared memory regions in order to support CoCo VM usecases.
> > > Before guest_memfd converts a PFN from shared to private, how does it ensure
> > > there are no shared mappings? e.g., in [1], it uses the folio reference count
> > > to ensure that.
> > >
> > > Or do you believe that by eliminating the struct page, there would be no
> > > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> > > response to a guest_memfd invalidation notification?
> >
> > Yes.
> >
> > >
> > > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> > > no need to register mmu notifier. So why users like VFIO must register
> > > guest_memfd invalidation notification?
> >
> > VM_PFNMAP'd memory can't be long term pinned, so users of such memory
> > ranges will have to adopt mechanisms to get notified. I think it would
> Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory.

I don't completely understand how VM_PFNMAP'd memory is used today for
VFIO. Maybe only MMIO regions are backed by pfnmap today and the story
for normal memory backed by pfnmap is yet to materialize.

>
> > be easy to pursue new users of guest_memfd to follow this scheme.
> > Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
> > hugepage support already needs the stance of: "Guest memfd owns all
> > long-term refcounts on private memory" as discussed at LPC [1].
> >
> > [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
> > (slide 12)
> >
> > >
> > > Besides, how would guest_memfd handle potential unmap failures? e.g. what
> > > happens to prevent converting a private PFN to shared if there are errors when
> > > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.
> >
> > Users will have to signal such failures via the invalidation callback
> > results or other appropriate mechanisms. guest_memfd can relay the
> > failures up the call chain to the userspace.
> AFAIK, operations that perform actual unmapping do not allow failure, e.g.
> kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(),
> vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount().

Very likely because these operations simply don't fail.

>
> That's why we rely on increasing folio ref count to reflect failure, which are
> due to unexpected SEAMCALL errors.

TDX stack is adding a scenario where invalidation can fail, a cleaner
solution would be to propagate the result as an invalidation failure.
Another option is to notify guest_memfd out of band to convey the
ranges that failed invalidation.

With in-place conversion supported, even if the refcount is raised for
such pages, they can still get used by the host if the guest_memfd is
unaware that the invalidation failed.

>
> > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> > > that fails to be unmapped.
> > >
> > >
> > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
> > >
> > >
> > > > >
> > > > >
> > > > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > > > userspace, guest, IOMMU or any other page tables referring to
> > > > > > guest_memfd backed pfns. This story will become clearer once the
> > > > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > > > discussed.
> > > > > Ok. It is indeed unclear right now to support such kind of memory.
> > > > >
> > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > > > into private EPT until TDX connect.
> > > >
> > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > > > connect/Sev TIO, major difference would be that guest_memfd private
> > > > ranges will be mapped into IOMMU page tables.
> > > >
> > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > > > have been discussions on not using page structs for private memory
> > > > ranges altogether [1] even with hugetlb allocator, which will simplify
> > > > seamless merge/split story for private hugepages to support memory
> > > > conversion. So I think the general direction we should head towards is
> > > > not relying on refcounts for guest_memfd private ranges and/or page
> > > > structs altogether.
> > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
> > > keep all necessary info.
> > >
> >
> > Story will become clearer once VM_PFNMAP'd memory support starts
> > getting discussed. In case of guest_memfd, there is flexibility to
> > store metadata for physical ranges within guest_memfd just like
> > shareability tracking.
> Ok.
>
> > >
> > > > I think the series [2] to work better with PFNMAP'd physical memory in
> > > > KVM is in the very right direction of not assuming page struct backed
> > > > memory ranges for guest_memfd as well.
> > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
> > >
> > >
> > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> > > >
> > > > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > > > driver is VFIO pci driver rather than guest_memfd.
> > > >
> > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > > > by VM_PFNMAP memory.
> > > >
> > > > >
> > > > >
> > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
> >

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 9 months ago

On Thu, May 08, 2025 at 07:10:19AM -0700, Vishal Annapurve wrote:
> On Wed, May 7, 2025 at 6:32 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote:
> > > On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > > > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > >
> > > > > > > > Sorry for the late reply, I was on leave last week.
> > > > > > > >
> > > > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > > > > >
> > > > > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > > > > I initially intended to add a separate patch at the end of this series to
> > > > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > > > > conversion.
> > > > > > > >
> > > > > > > > We can include it in the next version If you think it's better.
> > > > > > >
> > > > > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > > > > logic.
> > > > > > >
> > > > > > > >
> > > > > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > > > > page structs, but that problem can be solved in future possibly by
> > > > > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > > > > than a pfn.
> > > > > > > >
> > > > > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > > > > invalidation completes.
> > > > > > > > A curious question:
> > > > > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > > > > to the page struct to hold ref count and map count?
> > > > > > > >
> > > > > > >
> > > > > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > > > > memory ranges as all users will be notified when mappings are
> > > > > > > changed/removed.
> > > > > > So, it's possible to map such memory in both shared and private EPT
> > > > > > simultaneously?
> > > > >
> > > > > No, guest_memfd will still ensure that userspace can only fault in
> > > > > shared memory regions in order to support CoCo VM usecases.
> > > > Before guest_memfd converts a PFN from shared to private, how does it ensure
> > > > there are no shared mappings? e.g., in [1], it uses the folio reference count
> > > > to ensure that.
> > > >
> > > > Or do you believe that by eliminating the struct page, there would be no
> > > > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> > > > response to a guest_memfd invalidation notification?
> > >
> > > Yes.
> > >
> > > >
> > > > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> > > > no need to register mmu notifier. So why users like VFIO must register
> > > > guest_memfd invalidation notification?
> > >
> > > VM_PFNMAP'd memory can't be long term pinned, so users of such memory
> > > ranges will have to adopt mechanisms to get notified. I think it would
> > Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory.
> 
> I don't completely understand how VM_PFNMAP'd memory is used today for
> VFIO. Maybe only MMIO regions are backed by pfnmap today and the story
> for normal memory backed by pfnmap is yet to materialize.
VFIO can fault in VM_PFNMAP'd memory which is not from MMIO regions. It works
because it knows VM_PFNMAP'd memory are always pinned.

Another example is udmabuf (drivers/dma-buf/udmabuf.c), it mmaps normal folios
with VM_PFNMAP flag without registering mmu notifier because those folios are
pinned.

> >
> > > be easy to pursue new users of guest_memfd to follow this scheme.
> > > Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
> > > hugepage support already needs the stance of: "Guest memfd owns all
> > > long-term refcounts on private memory" as discussed at LPC [1].
> > >
> > > [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
> > > (slide 12)
> > >
> > > >
> > > > Besides, how would guest_memfd handle potential unmap failures? e.g. what
> > > > happens to prevent converting a private PFN to shared if there are errors when
> > > > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.
> > >
> > > Users will have to signal such failures via the invalidation callback
> > > results or other appropriate mechanisms. guest_memfd can relay the
> > > failures up the call chain to the userspace.
> > AFAIK, operations that perform actual unmapping do not allow failure, e.g.
> > kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(),
> > vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount().
> 
> Very likely because these operations simply don't fail.

I think they are intentionally designed to be no-fail.

e.g. in __iopt_area_unfill_domain(), no-fail is achieved by using a small backup
buffer allocated on stack in case of kmalloc() failure.


> >
> > That's why we rely on increasing folio ref count to reflect failure, which are
> > due to unexpected SEAMCALL errors.
> 
> TDX stack is adding a scenario where invalidation can fail, a cleaner
> solution would be to propagate the result as an invalidation failure.
Not sure if linux kernel accepts unmap failure.

> Another option is to notify guest_memfd out of band to convey the
> ranges that failed invalidation.
Yes, this might be better. Something similar like holding folio ref count to
let guest_memfd know that a certain PFN cannot be re-assigned.

> With in-place conversion supported, even if the refcount is raised for
> such pages, they can still get used by the host if the guest_memfd is
> unaware that the invalidation failed.
I thought guest_memfd should check if folio ref count is 0 (or a base count)
before conversion, splitting or re-assignment. Otherwise, why do you care if
TDX holds the ref count? :)


> >
> > > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> > > > that fails to be unmapped.
> > > >
> > > >
> > > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > > > > userspace, guest, IOMMU or any other page tables referring to
> > > > > > > guest_memfd backed pfns. This story will become clearer once the
> > > > > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > > > > discussed.
> > > > > > Ok. It is indeed unclear right now to support such kind of memory.
> > > > > >
> > > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > > > > into private EPT until TDX connect.
> > > > >
> > > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > > > > connect/Sev TIO, major difference would be that guest_memfd private
> > > > > ranges will be mapped into IOMMU page tables.
> > > > >
> > > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > > > > have been discussions on not using page structs for private memory
> > > > > ranges altogether [1] even with hugetlb allocator, which will simplify
> > > > > seamless merge/split story for private hugepages to support memory
> > > > > conversion. So I think the general direction we should head towards is
> > > > > not relying on refcounts for guest_memfd private ranges and/or page
> > > > > structs altogether.
> > > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
> > > > keep all necessary info.
> > > >
> > >
> > > Story will become clearer once VM_PFNMAP'd memory support starts
> > > getting discussed. In case of guest_memfd, there is flexibility to
> > > store metadata for physical ranges within guest_memfd just like
> > > shareability tracking.
> > Ok.
> >
> > > >
> > > > > I think the series [2] to work better with PFNMAP'd physical memory in
> > > > > KVM is in the very right direction of not assuming page struct backed
> > > > > memory ranges for guest_memfd as well.
> > > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> > > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
> > > >
> > > >
> > > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> > > > >
> > > > > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > > > > driver is VFIO pci driver rather than guest_memfd.
> > > > >
> > > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > > > > by VM_PFNMAP memory.
> > > > >
> > > > > >
> > > > > >
> > > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
> > >

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Ackerley Tng 9 months ago

Yan Zhao <yan.y.zhao@intel.com> writes:

>> <snip>
>>
>> Very likely because these operations simply don't fail.
>
> I think they are intentionally designed to be no-fail.
>
> e.g. in __iopt_area_unfill_domain(), no-fail is achieved by using a small backup
> buffer allocated on stack in case of kmalloc() failure.
>
>
>> >
>> > That's why we rely on increasing folio ref count to reflect failure, which are
>> > due to unexpected SEAMCALL errors.
>>
>> TDX stack is adding a scenario where invalidation can fail, a cleaner
>> solution would be to propagate the result as an invalidation failure.
> Not sure if linux kernel accepts unmap failure.
>
>> Another option is to notify guest_memfd out of band to convey the
>> ranges that failed invalidation.
> Yes, this might be better. Something similar like holding folio ref count to
> let guest_memfd know that a certain PFN cannot be re-assigned.
>
>> With in-place conversion supported, even if the refcount is raised for
>> such pages, they can still get used by the host if the guest_memfd is
>> unaware that the invalidation failed.
> I thought guest_memfd should check if folio ref count is 0 (or a base count)
> before conversion, splitting or re-assignment. Otherwise, why do you care if
> TDX holds the ref count? :)
>

IIUC the question here is how we should handle failures in unmapping of
private memory, which should be a rare occurrence.

I think there are two options here

1. Fail on unmapping *private* memory

2. Don't fail on unmapping *private* memory, instead tell the owner of
   the memory that this memory is never to be used again.

I think option 1 is better because it is more direct and provides timely
feedback to the caller when the issue happens. There is also room to
provide even more context about the address of the failure here.

It does seem like generally, unmapping memory does not support failing,
but I think that is for shared memory (even in KVM MMU notifiers).
Would it be possible to establish a new contract that for private pages,
unmapping can fail?

The kernel/KVM-internal functions for unmapping GFNs can be modified to
return error when unmapping private memory. Specifically, when
KVM_FILTER_PRIVATE [1] is set, then the unmapping function can return an
error and if not then the caller should not expect failures.

IIUC the only places where private memory is unmapped now is via
guest_memfd's truncate and (future) convert operations, so guest_memfd
can handle those failures or return failure to userspace.

Option 2 is possible too - but seems a little awkward. For conversion
the general steps are to (a) unmap pages from either host, guest or both
page tables (b) change shareability status in guest_memfd. It seems
awkward to first let step (a) pass even though there was an error, and
then proceed to (b) only to check somewhere (via refcount or otherwise)
that there was an issue and the conversion needs to fail.

Currently for private to shared conversions, (will be posting this 1g
page support series (with conversions) soon), I check refcounts == safe
refcount for shared to private conversions before permitting conversions
(error returned to userspace on failure).

For private to shared conversions, there is no check. At conversion
time, when splitting pages, I just spin in the kernel waiting for any
speculative refcounts to drop to go away. The refcount check at
conversion time is currently purely to ensure a safe merge process.

It is possible to check all the refcounts of private pages (split or
huge page) in the requested conversion range to handle unmapping
failures, but that seems expensive to do for every conversion, for
possibly many 4K pages, just to find a rare error case.

Also, if we do this refcount check to find the error, there wouldn't be
any way to tell if it were an error or if it was a speculative refcount,
so guest_memfd would just have to return -EAGAIN for private to shared
conversions. This would make conversions complicated to handle in
userspace, since the userspace VMM doesn't know whether it should retry
(for speculative refcounts) or it should give up because of the
unmapping error. Returning a different error on unmapping failure would
allow userspace to handle the two cases differently.

Regarding Option 2, another way to indicate an error could be to mark
the page as poisoned, but then again that would overlap/shadow true
memory poisoning.

In summary, I think Option 1 is best, which is that we return error
within the kernel, and the caller (for now only guest_memfd unmaps
private memory) should handle the error.

[1] https://github.com/torvalds/linux/blob/627277ba7c2398dc4f95cc9be8222bb2d9477800/include/linux/kvm_host.h#L260

>
>> >
>> > > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
>> > > > that fails to be unmapped.
>> > > >
>> > > >
>> > > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
>> > > >
>> > > >
>> > > > > >
>> > > > > >
>> > > > > > > Any guest_memfd range updates will result in invalidations/updates of
>> > > > > > > userspace, guest, IOMMU or any other page tables referring to
>> > > > > > > guest_memfd backed pfns. This story will become clearer once the
>> > > > > > > support for PFN range allocator for backing guest_memfd starts getting
>> > > > > > > discussed.
>> > > > > > Ok. It is indeed unclear right now to support such kind of memory.
>> > > > > >
>> > > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
>> > > > > > into private EPT until TDX connect.
>> > > > >
>> > > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
>> > > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
>> > > > > connect/Sev TIO, major difference would be that guest_memfd private
>> > > > > ranges will be mapped into IOMMU page tables.
>> > > > >
>> > > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
>> > > > > have been discussions on not using page structs for private memory
>> > > > > ranges altogether [1] even with hugetlb allocator, which will simplify
>> > > > > seamless merge/split story for private hugepages to support memory
>> > > > > conversion. So I think the general direction we should head towards is
>> > > > > not relying on refcounts for guest_memfd private ranges and/or page
>> > > > > structs altogether.
>> > > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
>> > > > keep all necessary info.
>> > > >
>> > >
>> > > Story will become clearer once VM_PFNMAP'd memory support starts
>> > > getting discussed. In case of guest_memfd, there is flexibility to
>> > > store metadata for physical ranges within guest_memfd just like
>> > > shareability tracking.
>> > Ok.
>> >
>> > > >
>> > > > > I think the series [2] to work better with PFNMAP'd physical memory in
>> > > > > KVM is in the very right direction of not assuming page struct backed
>> > > > > memory ranges for guest_memfd as well.
>> > > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
>> > > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
>> > > >
>> > > >
>> > > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
>> > > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
>> > > > >
>> > > > > > And even in that scenario, the memory is only for private MMIO, so the backend
>> > > > > > driver is VFIO pci driver rather than guest_memfd.
>> > > > >
>> > > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
>> > > > > by VM_PFNMAP memory.
>> > > > >
>> > > > > >
>> > > > > >
>> > > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
>> > >

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Edgecombe, Rick P 9 months ago

On Mon, 2025-05-12 at 12:00 -0700, Ackerley Tng wrote:
> IIUC the question here is how we should handle failures in unmapping of
> private memory, which should be a rare occurrence.
> 
> I think there are two options here
> 
> 1. Fail on unmapping *private* memory
> 
> 2. Don't fail on unmapping *private* memory, instead tell the owner of
>    the memory that this memory is never to be used again.
> 
> I think option 1 is better because it is more direct and provides timely
> feedback to the caller when the issue happens. There is also room to
> provide even more context about the address of the failure here.
> 
> It does seem like generally, unmapping memory does not support failing,
> but I think that is for shared memory (even in KVM MMU notifiers).
> Would it be possible to establish a new contract that for private pages,
> unmapping can fail?
> 
> The kernel/KVM-internal functions for unmapping GFNs can be modified to
> return error when unmapping private memory. Specifically, when
> KVM_FILTER_PRIVATE [1] is set, then the unmapping function can return an
> error and if not then the caller should not expect failures.
> 
> IIUC the only places where private memory is unmapped now is via
> guest_memfd's truncate and (future) convert operations, so guest_memfd
> can handle those failures or return failure to userspace.

1. Private->shared memory conversion
2. Memslot deletion
3. kvm_gmem_invalidate_begin() callers

> 
> Option 2 is possible too - but seems a little awkward. For conversion
> the general steps are to (a) unmap pages from either host, guest or both
> page tables (b) change shareability status in guest_memfd. It seems
> awkward to first let step (a) pass even though there was an error, and
> then proceed to (b) only to check somewhere (via refcount or otherwise)
> that there was an issue and the conversion needs to fail.
> 
> Currently for private to shared conversions, (will be posting this 1g
> page support series (with conversions) soon), I check refcounts == safe
> refcount for shared to private conversions before permitting conversions
> (error returned to userspace on failure).
> 
> For private to shared conversions, there is no check. At conversion
> time, when splitting pages, I just spin in the kernel waiting for any
> speculative refcounts to drop to go away. The refcount check at
> conversion time is currently purely to ensure a safe merge process.
> 
> It is possible to check all the refcounts of private pages (split or
> huge page) in the requested conversion range to handle unmapping
> failures, but that seems expensive to do for every conversion, for
> possibly many 4K pages, just to find a rare error case.
> 
> Also, if we do this refcount check to find the error, there wouldn't be
> any way to tell if it were an error or if it was a speculative refcount,
> so guest_memfd would just have to return -EAGAIN for private to shared
> conversions. This would make conversions complicated to handle in
> userspace, since the userspace VMM doesn't know whether it should retry
> (for speculative refcounts) or it should give up because of the
> unmapping error. Returning a different error on unmapping failure would
> allow userspace to handle the two cases differently.
> 
> Regarding Option 2, another way to indicate an error could be to mark
> the page as poisoned, but then again that would overlap/shadow true
> memory poisoning.
> 
> In summary, I think Option 1 is best, which is that we return error
> within the kernel, and the caller (for now only guest_memfd unmaps
> private memory) should handle the error.

When we get to huge pages we will have two error conditions on the KVM side:
1. Fail to split
2. A TDX module error

For TDX module errors, today we are essentially talking about bugs. The handling
in the case of TDX module bug should be to bug the VM and to prevent the memory
from being freed to the kernel. In which case the unmapping sort of succeeded,
albeit destructively.

So I'm not sure if userspace needs to know about the TDX module bugs (they are
going to find out anyway on the next KVM ioctl). But if we plumbed the error
code all the way through to guestmemfd, then I guess why not tell them.

On whether we should go to the trouble, could another option be to expose a
guestmemfd function that allows for "poisoning" the memory, and have the TDX bug
paths call it. This way guestmemfd could know specifically about unrecoverable
errors. For splitting failures, we can return an error to guestmemfd without
plumbing the error code all the way through.

Perhaps it might be worth doing a quick POC to see how bad plumbing the error
code all the way looks.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 9 months ago

On Thu, May 8, 2025 at 8:22 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Thu, May 08, 2025 at 07:10:19AM -0700, Vishal Annapurve wrote:
> > On Wed, May 7, 2025 at 6:32 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote:
> > > > On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > > > > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > >
> > > > > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > >
> > > > > > > > > Sorry for the late reply, I was on leave last week.
> > > > > > > > >
> > > > > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > > > > > >
> > > > > > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > > > > > I initially intended to add a separate patch at the end of this series to
> > > > > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > > > > > conversion.
> > > > > > > > >
> > > > > > > > > We can include it in the next version If you think it's better.
> > > > > > > >
> > > > > > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > > > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > > > > > logic.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > > > > > page structs, but that problem can be solved in future possibly by
> > > > > > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > > > > > than a pfn.
> > > > > > > > >
> > > > > > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > > > > > invalidation completes.
> > > > > > > > > A curious question:
> > > > > > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > > > > > to the page struct to hold ref count and map count?
> > > > > > > > >
> > > > > > > >
> > > > > > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > > > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > > > > > memory ranges as all users will be notified when mappings are
> > > > > > > > changed/removed.
> > > > > > > So, it's possible to map such memory in both shared and private EPT
> > > > > > > simultaneously?
> > > > > >
> > > > > > No, guest_memfd will still ensure that userspace can only fault in
> > > > > > shared memory regions in order to support CoCo VM usecases.
> > > > > Before guest_memfd converts a PFN from shared to private, how does it ensure
> > > > > there are no shared mappings? e.g., in [1], it uses the folio reference count
> > > > > to ensure that.
> > > > >
> > > > > Or do you believe that by eliminating the struct page, there would be no
> > > > > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> > > > > response to a guest_memfd invalidation notification?
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> > > > > no need to register mmu notifier. So why users like VFIO must register
> > > > > guest_memfd invalidation notification?
> > > >
> > > > VM_PFNMAP'd memory can't be long term pinned, so users of such memory
> > > > ranges will have to adopt mechanisms to get notified. I think it would
> > > Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory.
> >
> > I don't completely understand how VM_PFNMAP'd memory is used today for
> > VFIO. Maybe only MMIO regions are backed by pfnmap today and the story
> > for normal memory backed by pfnmap is yet to materialize.
> VFIO can fault in VM_PFNMAP'd memory which is not from MMIO regions. It works
> because it knows VM_PFNMAP'd memory are always pinned.
>
> Another example is udmabuf (drivers/dma-buf/udmabuf.c), it mmaps normal folios
> with VM_PFNMAP flag without registering mmu notifier because those folios are
> pinned.
>

I might be wrongly throwing out some terminologies here then.
VM_PFNMAP flag can be set for memory backed by folios/page structs.
udmabuf seems to be working with pinned "folios" in the backend.

The goal is to get to a stage where guest_memfd is backed by pfn
ranges unmanaged by kernel that guest_memfd owns and distributes to
userspace, KVM, IOMMU subject to shareability attributes. if the
shareability changes, the users will get notified and will have to
invalidate their mappings. guest_memfd will allow mmaping such ranges
with VM_PFNMAP flag set by default in the VMAs to indicate the need of
special handling/lack of page structs.

As an intermediate stage, it makes sense to me to just not have
private memory backed by page structs and use a special "filemap" to
map file offsets to these private memory ranges. This step will also
need similar contract with users -
   1) memory is pinned by guest_memfd
   2) users will get invalidation notifiers on shareability changes

I am sure there is a lot of work here and many quirks to be addressed,
let's discuss this more with better context around. A few related RFC
series are planned to be posted in the near future.

> > >
> > > > be easy to pursue new users of guest_memfd to follow this scheme.
> > > > Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
> > > > hugepage support already needs the stance of: "Guest memfd owns all
> > > > long-term refcounts on private memory" as discussed at LPC [1].
> > > >
> > > > [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
> > > > (slide 12)
> > > >
> > > > >
> > > > > Besides, how would guest_memfd handle potential unmap failures? e.g. what
> > > > > happens to prevent converting a private PFN to shared if there are errors when
> > > > > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.
> > > >
> > > > Users will have to signal such failures via the invalidation callback
> > > > results or other appropriate mechanisms. guest_memfd can relay the
> > > > failures up the call chain to the userspace.
> > > AFAIK, operations that perform actual unmapping do not allow failure, e.g.
> > > kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(),
> > > vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount().
> >
> > Very likely because these operations simply don't fail.
>
> I think they are intentionally designed to be no-fail.
>
> e.g. in __iopt_area_unfill_domain(), no-fail is achieved by using a small backup
> buffer allocated on stack in case of kmalloc() failure.
>
>
> > >
> > > That's why we rely on increasing folio ref count to reflect failure, which are
> > > due to unexpected SEAMCALL errors.
> >
> > TDX stack is adding a scenario where invalidation can fail, a cleaner
> > solution would be to propagate the result as an invalidation failure.
> Not sure if linux kernel accepts unmap failure.
>
> > Another option is to notify guest_memfd out of band to convey the
> > ranges that failed invalidation.
> Yes, this might be better. Something similar like holding folio ref count to
> let guest_memfd know that a certain PFN cannot be re-assigned.
>
> > With in-place conversion supported, even if the refcount is raised for
> > such pages, they can still get used by the host if the guest_memfd is
> > unaware that the invalidation failed.
> I thought guest_memfd should check if folio ref count is 0 (or a base count)
> before conversion, splitting or re-assignment. Otherwise, why do you care if
> TDX holds the ref count? :)
>

Soon to be posted RFC series by Ackerley currently explicitly checks
for safe private page refcounts when folio splitting is needed and not
for every private to shared conversion. A simple solution would be for
guest_memfd to check safe page refcounts for each private to shared
conversion even if split is not required but will need to be reworked
when either of the stages discussed above land where page structs are
not around.

>
> > >
> > > > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> > > > > that fails to be unmapped.
> > > > >
> > > > >
> > > > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
> > > > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > > > > > userspace, guest, IOMMU or any other page tables referring to
> > > > > > > > guest_memfd backed pfns. This story will become clearer once the
> > > > > > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > > > > > discussed.
> > > > > > > Ok. It is indeed unclear right now to support such kind of memory.
> > > > > > >
> > > > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > > > > > into private EPT until TDX connect.
> > > > > >
> > > > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > > > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > > > > > connect/Sev TIO, major difference would be that guest_memfd private
> > > > > > ranges will be mapped into IOMMU page tables.
> > > > > >
> > > > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > > > > > have been discussions on not using page structs for private memory
> > > > > > ranges altogether [1] even with hugetlb allocator, which will simplify
> > > > > > seamless merge/split story for private hugepages to support memory
> > > > > > conversion. So I think the general direction we should head towards is
> > > > > > not relying on refcounts for guest_memfd private ranges and/or page
> > > > > > structs altogether.
> > > > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
> > > > > keep all necessary info.
> > > > >
> > > >
> > > > Story will become clearer once VM_PFNMAP'd memory support starts
> > > > getting discussed. In case of guest_memfd, there is flexibility to
> > > > store metadata for physical ranges within guest_memfd just like
> > > > shareability tracking.
> > > Ok.
> > >
> > > > >
> > > > > > I think the series [2] to work better with PFNMAP'd physical memory in
> > > > > > KVM is in the very right direction of not assuming page struct backed
> > > > > > memory ranges for guest_memfd as well.
> > > > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> > > > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
> > > > >
> > > > >
> > > > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > > > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> > > > > >
> > > > > > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > > > > > driver is VFIO pci driver rather than guest_memfd.
> > > > > >
> > > > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > > > > > by VM_PFNMAP memory.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
> > > >

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 9 months ago

On Fri, May 09, 2025 at 07:20:30AM -0700, Vishal Annapurve wrote:
> On Thu, May 8, 2025 at 8:22 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Thu, May 08, 2025 at 07:10:19AM -0700, Vishal Annapurve wrote:
> > > On Wed, May 7, 2025 at 6:32 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote:
> > > > > On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > > > > > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > > > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Sorry for the late reply, I was on leave last week.
> > > > > > > > > >
> > > > > > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > > > > > > >
> > > > > > > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > > > > > > I initially intended to add a separate patch at the end of this series to
> > > > > > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > > > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > > > > > > conversion.
> > > > > > > > > >
> > > > > > > > > > We can include it in the next version If you think it's better.
> > > > > > > > >
> > > > > > > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > > > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > > > > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > > > > > > logic.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > > > > > > page structs, but that problem can be solved in future possibly by
> > > > > > > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > > > > > > than a pfn.
> > > > > > > > > >
> > > > > > > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > > > > > > invalidation completes.
> > > > > > > > > > A curious question:
> > > > > > > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > > > > > > to the page struct to hold ref count and map count?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > > > > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > > > > > > memory ranges as all users will be notified when mappings are
> > > > > > > > > changed/removed.
> > > > > > > > So, it's possible to map such memory in both shared and private EPT
> > > > > > > > simultaneously?
> > > > > > >
> > > > > > > No, guest_memfd will still ensure that userspace can only fault in
> > > > > > > shared memory regions in order to support CoCo VM usecases.
> > > > > > Before guest_memfd converts a PFN from shared to private, how does it ensure
> > > > > > there are no shared mappings? e.g., in [1], it uses the folio reference count
> > > > > > to ensure that.
> > > > > >
> > > > > > Or do you believe that by eliminating the struct page, there would be no
> > > > > > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> > > > > > response to a guest_memfd invalidation notification?
> > > > >
> > > > > Yes.
> > > > >
> > > > > >
> > > > > > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> > > > > > no need to register mmu notifier. So why users like VFIO must register
> > > > > > guest_memfd invalidation notification?
> > > > >
> > > > > VM_PFNMAP'd memory can't be long term pinned, so users of such memory
> > > > > ranges will have to adopt mechanisms to get notified. I think it would
> > > > Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory.
> > >
> > > I don't completely understand how VM_PFNMAP'd memory is used today for
> > > VFIO. Maybe only MMIO regions are backed by pfnmap today and the story
> > > for normal memory backed by pfnmap is yet to materialize.
> > VFIO can fault in VM_PFNMAP'd memory which is not from MMIO regions. It works
> > because it knows VM_PFNMAP'd memory are always pinned.
> >
> > Another example is udmabuf (drivers/dma-buf/udmabuf.c), it mmaps normal folios
> > with VM_PFNMAP flag without registering mmu notifier because those folios are
> > pinned.
> >
> 
> I might be wrongly throwing out some terminologies here then.
> VM_PFNMAP flag can be set for memory backed by folios/page structs.
> udmabuf seems to be working with pinned "folios" in the backend.
> 
> The goal is to get to a stage where guest_memfd is backed by pfn
> ranges unmanaged by kernel that guest_memfd owns and distributes to
> userspace, KVM, IOMMU subject to shareability attributes. if the
OK. So from point of the reset part of kernel, those pfns are not regarded as
memory.

> shareability changes, the users will get notified and will have to
> invalidate their mappings. guest_memfd will allow mmaping such ranges
> with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> special handling/lack of page structs.
My concern is a failable invalidation notifer may not be ideal.
Instead of relying on ref counts (or other mechanisms) to determine whether to
start shareabilitiy changes, with a failable invalidation notifier, some users
may fail the invalidation and the shareability change, even after other users
have successfully unmapped a range.

Auditing whether multiple users of shared memory correctly perform unmapping is
harder than auditing reference counts.

> private memory backed by page structs and use a special "filemap" to
> map file offsets to these private memory ranges. This step will also
> need similar contract with users -
>    1) memory is pinned by guest_memfd
>    2) users will get invalidation notifiers on shareability changes
> 
> I am sure there is a lot of work here and many quirks to be addressed,
> let's discuss this more with better context around. A few related RFC
> series are planned to be posted in the near future.
Ok. Thanks for your time and discussions :)

> > > >
> > > > > be easy to pursue new users of guest_memfd to follow this scheme.
> > > > > Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
> > > > > hugepage support already needs the stance of: "Guest memfd owns all
> > > > > long-term refcounts on private memory" as discussed at LPC [1].
> > > > >
> > > > > [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
> > > > > (slide 12)
> > > > >
> > > > > >
> > > > > > Besides, how would guest_memfd handle potential unmap failures? e.g. what
> > > > > > happens to prevent converting a private PFN to shared if there are errors when
> > > > > > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.
> > > > >
> > > > > Users will have to signal such failures via the invalidation callback
> > > > > results or other appropriate mechanisms. guest_memfd can relay the
> > > > > failures up the call chain to the userspace.
> > > > AFAIK, operations that perform actual unmapping do not allow failure, e.g.
> > > > kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(),
> > > > vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount().
> > >
> > > Very likely because these operations simply don't fail.
> >
> > I think they are intentionally designed to be no-fail.
> >
> > e.g. in __iopt_area_unfill_domain(), no-fail is achieved by using a small backup
> > buffer allocated on stack in case of kmalloc() failure.
> >
> >
> > > >
> > > > That's why we rely on increasing folio ref count to reflect failure, which are
> > > > due to unexpected SEAMCALL errors.
> > >
> > > TDX stack is adding a scenario where invalidation can fail, a cleaner
> > > solution would be to propagate the result as an invalidation failure.
> > Not sure if linux kernel accepts unmap failure.
> >
> > > Another option is to notify guest_memfd out of band to convey the
> > > ranges that failed invalidation.
> > Yes, this might be better. Something similar like holding folio ref count to
> > let guest_memfd know that a certain PFN cannot be re-assigned.
> >
> > > With in-place conversion supported, even if the refcount is raised for
> > > such pages, they can still get used by the host if the guest_memfd is
> > > unaware that the invalidation failed.
> > I thought guest_memfd should check if folio ref count is 0 (or a base count)
> > before conversion, splitting or re-assignment. Otherwise, why do you care if
> > TDX holds the ref count? :)
> >
> 
> Soon to be posted RFC series by Ackerley currently explicitly checks
> for safe private page refcounts when folio splitting is needed and not
> for every private to shared conversion. A simple solution would be for
> guest_memfd to check safe page refcounts for each private to shared
> conversion even if split is not required but will need to be reworked
> when either of the stages discussed above land where page structs are
> not around.
> 
> >
> > > >
> > > > > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> > > > > > that fails to be unmapped.
> > > > > >
> > > > > >
> > > > > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
> > > > > >
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > > > > > > userspace, guest, IOMMU or any other page tables referring to
> > > > > > > > > guest_memfd backed pfns. This story will become clearer once the
> > > > > > > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > > > > > > discussed.
> > > > > > > > Ok. It is indeed unclear right now to support such kind of memory.
> > > > > > > >
> > > > > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > > > > > > into private EPT until TDX connect.
> > > > > > >
> > > > > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > > > > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > > > > > > connect/Sev TIO, major difference would be that guest_memfd private
> > > > > > > ranges will be mapped into IOMMU page tables.
> > > > > > >
> > > > > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > > > > > > have been discussions on not using page structs for private memory
> > > > > > > ranges altogether [1] even with hugetlb allocator, which will simplify
> > > > > > > seamless merge/split story for private hugepages to support memory
> > > > > > > conversion. So I think the general direction we should head towards is
> > > > > > > not relying on refcounts for guest_memfd private ranges and/or page
> > > > > > > structs altogether.
> > > > > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
> > > > > > keep all necessary info.
> > > > > >
> > > > >
> > > > > Story will become clearer once VM_PFNMAP'd memory support starts
> > > > > getting discussed. In case of guest_memfd, there is flexibility to
> > > > > store metadata for physical ranges within guest_memfd just like
> > > > > shareability tracking.
> > > > Ok.
> > > >
> > > > > >
> > > > > > > I think the series [2] to work better with PFNMAP'd physical memory in
> > > > > > > KVM is in the very right direction of not assuming page struct backed
> > > > > > > memory ranges for guest_memfd as well.
> > > > > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> > > > > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
> > > > > >
> > > > > >
> > > > > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > > > > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> > > > > > >
> > > > > > > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > > > > > > driver is VFIO pci driver rather than guest_memfd.
> > > > > > >
> > > > > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > > > > > > by VM_PFNMAP memory.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
> > > > >
>

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 9 months ago

On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> ...
> >
> > I might be wrongly throwing out some terminologies here then.
> > VM_PFNMAP flag can be set for memory backed by folios/page structs.
> > udmabuf seems to be working with pinned "folios" in the backend.
> >
> > The goal is to get to a stage where guest_memfd is backed by pfn
> > ranges unmanaged by kernel that guest_memfd owns and distributes to
> > userspace, KVM, IOMMU subject to shareability attributes. if the
> OK. So from point of the reset part of kernel, those pfns are not regarded as
> memory.
>
> > shareability changes, the users will get notified and will have to
> > invalidate their mappings. guest_memfd will allow mmaping such ranges
> > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> > special handling/lack of page structs.
> My concern is a failable invalidation notifer may not be ideal.
> Instead of relying on ref counts (or other mechanisms) to determine whether to
> start shareabilitiy changes, with a failable invalidation notifier, some users
> may fail the invalidation and the shareability change, even after other users
> have successfully unmapped a range.

Even if one user fails to invalidate its mappings, I don't see a
reason to go ahead with shareability change. Shareability should not
change unless all existing users let go of their soon-to-be-invalid
view of memory.

>
> Auditing whether multiple users of shared memory correctly perform unmapping is
> harder than auditing reference counts.
>
> > private memory backed by page structs and use a special "filemap" to
> > map file offsets to these private memory ranges. This step will also
> > need similar contract with users -
> >    1) memory is pinned by guest_memfd
> >    2) users will get invalidation notifiers on shareability changes
> >
> > I am sure there is a lot of work here and many quirks to be addressed,
> > let's discuss this more with better context around. A few related RFC
> > series are planned to be posted in the near future.
> Ok. Thanks for your time and discussions :)
> ...

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 8 months, 4 weeks ago

On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote:
> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > ...
> > >
> > > I might be wrongly throwing out some terminologies here then.
> > > VM_PFNMAP flag can be set for memory backed by folios/page structs.
> > > udmabuf seems to be working with pinned "folios" in the backend.
> > >
> > > The goal is to get to a stage where guest_memfd is backed by pfn
> > > ranges unmanaged by kernel that guest_memfd owns and distributes to
> > > userspace, KVM, IOMMU subject to shareability attributes. if the
> > OK. So from point of the reset part of kernel, those pfns are not regarded as
> > memory.
> >
> > > shareability changes, the users will get notified and will have to
> > > invalidate their mappings. guest_memfd will allow mmaping such ranges
> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> > > special handling/lack of page structs.
> > My concern is a failable invalidation notifer may not be ideal.
> > Instead of relying on ref counts (or other mechanisms) to determine whether to
> > start shareabilitiy changes, with a failable invalidation notifier, some users
> > may fail the invalidation and the shareability change, even after other users
> > have successfully unmapped a range.
> 
> Even if one user fails to invalidate its mappings, I don't see a
> reason to go ahead with shareability change. Shareability should not
> change unless all existing users let go of their soon-to-be-invalid
> view of memory.
My thinking is that:

1. guest_memfd starts shared-to-private conversion
2. guest_memfd sends invalidation notifications
   2.1 invalidate notification --> A --> Unmap and return success
   2.2 invalidate notification --> B --> Unmap and return success
   2.3 invalidate notification --> C --> return failure
3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps
   shareability as shared

Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and
2.2. Even if additional notifications could be sent to A and B to ask for
mapping the GFN back, the map operation might fail. Consequently, A and B might
not be able to restore the mapped status of the GFN. For IOMMU mappings, this
could result in DMAR failure following a failed attempt to do shared-to-private
conversion.

I noticed Ackerley has posted the series. Will check there later.

> >
> > Auditing whether multiple users of shared memory correctly perform unmapping is
> > harder than auditing reference counts.
> >
> > > private memory backed by page structs and use a special "filemap" to
> > > map file offsets to these private memory ranges. This step will also
> > > need similar contract with users -
> > >    1) memory is pinned by guest_memfd
> > >    2) users will get invalidation notifiers on shareability changes
> > >
> > > I am sure there is a lot of work here and many quirks to be addressed,
> > > let's discuss this more with better context around. A few related RFC
> > > series are planned to be posted in the near future.
> > Ok. Thanks for your time and discussions :)
> > ...

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Ackerley Tng 8 months, 1 week ago

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote:
>> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> > ...
>> > >
>> > > I might be wrongly throwing out some terminologies here then.
>> > > VM_PFNMAP flag can be set for memory backed by folios/page structs.
>> > > udmabuf seems to be working with pinned "folios" in the backend.
>> > >
>> > > The goal is to get to a stage where guest_memfd is backed by pfn
>> > > ranges unmanaged by kernel that guest_memfd owns and distributes to
>> > > userspace, KVM, IOMMU subject to shareability attributes. if the
>> > OK. So from point of the reset part of kernel, those pfns are not regarded as
>> > memory.
>> >
>> > > shareability changes, the users will get notified and will have to
>> > > invalidate their mappings. guest_memfd will allow mmaping such ranges
>> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
>> > > special handling/lack of page structs.
>> > My concern is a failable invalidation notifer may not be ideal.
>> > Instead of relying on ref counts (or other mechanisms) to determine whether to
>> > start shareabilitiy changes, with a failable invalidation notifier, some users
>> > may fail the invalidation and the shareability change, even after other users
>> > have successfully unmapped a range.
>>
>> Even if one user fails to invalidate its mappings, I don't see a
>> reason to go ahead with shareability change. Shareability should not
>> change unless all existing users let go of their soon-to-be-invalid
>> view of memory.

Hi Yan,

While working on the 1G (aka HugeTLB) page support for guest_memfd
series [1], we took into account conversion failures too. The steps are
in kvm_gmem_convert_range(). (It might be easier to pull the entire
series from GitHub [2] because the steps for conversion changed in two
separate patches.)

We do need to handle errors across ranges to be converted, possibly from
different memslots. The goal is to either have the entire conversion
happen (including page split/merge) or nothing at all when the ioctl
returns.

We try to undo the restructuring (whether split or merge) and undo any
shareability changes on error (barring ENOMEM, in which case we leave a
WARNing).

The part we don't restore is the presence of the pages in the host or
guest page tables. For that, our idea is that if unmapped, the next
access will just map it in, so there's no issue there.

> My thinking is that:
>
> 1. guest_memfd starts shared-to-private conversion
> 2. guest_memfd sends invalidation notifications
>    2.1 invalidate notification --> A --> Unmap and return success
>    2.2 invalidate notification --> B --> Unmap and return success
>    2.3 invalidate notification --> C --> return failure
> 3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps
>    shareability as shared
>
> Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and
> 2.2. Even if additional notifications could be sent to A and B to ask for
> mapping the GFN back, the map operation might fail. Consequently, A and B might
> not be able to restore the mapped status of the GFN.

For conversion we don't attempt to restore mappings anywhere (whether in
guest or host page tables). What do you think of not restoring the
mappings?

> For IOMMU mappings, this
> could result in DMAR failure following a failed attempt to do shared-to-private
> conversion.

I believe the current conversion setup guards against this because after
unmapping from the host, we check for any unexpected refcounts.

(This unmapping is not the unmapping we're concerned about, since this is
shared memory, and unmapping doesn't go through TDX.)

Coming back to the refcounts, if the IOMMU had mappings, these refcounts
are "unexpected". The conversion ioctl will return to userspace with an
error.

IO can continue to happen, since the memory is still mapped in the
IOMMU. The memory state is still shared. No issue there.

In RFCv2 [1], we expect userspace to see the error, then try and remove
the memory from the IOMMU, and then try conversion again.

The part in concern here is unmapping failures of private pages, for
private-to-shared conversions, since that part goes through TDX and
might fail.

One other thing about taking refcounts is that in RFCv2,
private-to-shared conversions assume that there are no refcounts on the
private pages at all. (See filemap_remove_folio_for_restructuring() in
[3])

Haven't had a chance to think about all the edge cases, but for now I
think on unmapping failure, in addition to taking a refcount, we should
return an error at least up to guest_memfd, so that guest_memfd could
perhaps keep the refcount on that page, but drop the page from the
filemap. Another option could be to track messed up addresses and always
check that on conversion or something - not sure yet.

Either way, guest_memfd must know. If guest_memfd is not informed, on a
next conversion request, the conversion will just spin in
filemap_remove_folio_for_restructuring().

What do you think of this part about informing guest_memfd of the
failure to unmap?

>
> I noticed Ackerley has posted the series. Will check there later.
>

[1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
[2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
[3] https://lore.kernel.org/all/7753dc66229663fecea2498cf442a768cb7191ba.1747264138.git.ackerleytng@google.com/

>> >
>> > Auditing whether multiple users of shared memory correctly perform unmapping is
>> > harder than auditing reference counts.
>> >
>> > > private memory backed by page structs and use a special "filemap" to
>> > > map file offsets to these private memory ranges. This step will also
>> > > need similar contract with users -
>> > >    1) memory is pinned by guest_memfd
>> > >    2) users will get invalidation notifiers on shareability changes
>> > >
>> > > I am sure there is a lot of work here and many quirks to be addressed,
>> > > let's discuss this more with better context around. A few related RFC
>> > > series are planned to be posted in the near future.
>> > Ok. Thanks for your time and discussions :)
>> > ...

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 8 months, 1 week ago

On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> Hi Yan,
> 
> While working on the 1G (aka HugeTLB) page support for guest_memfd
> series [1], we took into account conversion failures too. The steps are
> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> series from GitHub [2] because the steps for conversion changed in two
> separate patches.)
...
> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2

Hi Ackerley,
Thanks for providing this branch.

I'm now trying to make TD huge pages working on this branch and would like to
report to you errors I encountered during this process early.

1. symbol arch_get_align_mask() is not available when KVM is compiled as module.
   I currently workaround it as follows:

--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -102,8 +102,13 @@ static unsigned long kvm_gmem_get_align_mask(struct file *file,
        void *priv;

        inode = file_inode(file);
-       if (!kvm_gmem_has_custom_allocator(inode))
-             return arch_get_align_mask(file, flags);
+       if (!kvm_gmem_has_custom_allocator(inode)) {
+               page_size = 1 << PAGE_SHIFT;
+               return PAGE_MASK & (page_size - 1);
+       }


2. Bug of Sleeping function called from invalid context 

[  193.523469] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:325
[  193.539885] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 3332, name: guest_memfd_con
[  193.556235] preempt_count: 1, expected: 0
[  193.564518] RCU nest depth: 0, expected: 0
[  193.572866] 3 locks held by guest_memfd_con/3332:
[  193.581800]  #0: ff16f8ec217e4438 (sb_writers#14){.+.+}-{0:0}, at: __x64_sys_fallocate+0x46/0x80
[  193.598252]  #1: ff16f8fbd85c8310 (mapping.invalidate_lock#4){++++}-{4:4}, at: kvm_gmem_fallocate+0x9e/0x310 [kvm]
[  193.616706]  #2: ff3189b5e4f65018 (&(kvm)->mmu_lock){++++}-{3:3}, at: kvm_gmem_invalidate_begin_and_zap+0x17f/0x260 [kvm]
[  193.635790] Preemption disabled at:
[  193.635793] [<ffffffffc0850c6f>] kvm_gmem_invalidate_begin_and_zap+0x17f/0x260 [kvm]

This is because add_to_invalidated_kvms() invokes kzalloc() inside kvm->mmu_lock
which is a kind of spinlock.

I workarounded it as follows.

 static int kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
                                             pgoff_t start, pgoff_t end,
@@ -1261,13 +1268,13 @@ static int kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
                        KVM_MMU_LOCK(kvm);
                        kvm_mmu_invalidate_begin(kvm);

-                       if (invalidated_kvms) {
-                               ret = add_to_invalidated_kvms(invalidated_kvms, kvm);
-                               if (ret) {
-                                       kvm_mmu_invalidate_end(kvm);
-                                       goto out;
-                               }
-                       }
                }


@@ -1523,12 +1530,14 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
        }

 out:
-       list_for_each_entry_safe(entry, tmp, &invalidated_kvms, list) {
-               kvm_gmem_do_invalidate_end(entry->kvm);
-               list_del(&entry->list);
-               kfree(entry);
-       }
+       list_for_each_entry(gmem, gmem_list, entry)
+               kvm_gmem_do_invalidate_end(gmem->kvm);

        filemap_invalidate_unlock(inode->i_mapping);


Will let you know more findings later.

Thanks
Yan

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Ackerley Tng 8 months, 1 week ago

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
>> Hi Yan,
>> 
>> While working on the 1G (aka HugeTLB) page support for guest_memfd
>> series [1], we took into account conversion failures too. The steps are
>> in kvm_gmem_convert_range(). (It might be easier to pull the entire
>> series from GitHub [2] because the steps for conversion changed in two
>> separate patches.)
> ...
>> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>
> Hi Ackerley,
> Thanks for providing this branch.

Here's the WIP branch [1], which I initially wasn't intending to make
super public since it's not even RFC standard yet and I didn't want to
add to the many guest_memfd in-flight series, but since you referred to
it, [2] is a v2 of the WIP branch :)

[1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
[2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2

This WIP branch has selftests that test 1G aka HugeTLB page support with
TDX huge page EPT mappings [7]:

1. "KVM: selftests: TDX: Test conversion to private at different
   sizes". This uses the fact that TDX module will return error if the
   page is faulted into the guest at a different level from the accept
   level to check the level that the page was faulted in.
2. "KVM: selftests: Test TDs in private_mem_conversions_test". Updates
   private_mem_conversions_test for use with TDs. This test does
   multi-vCPU conversions and we use this to check for issues to do with
   conversion races.
3. "KVM: selftests: TDX: Test conversions when guest_memfd used for
   private and shared memory". Adds a selftest similar to/on top of
   guest_memfd_conversions_test that does conversions via MapGPA.

Full list of selftests I usually run from tools/testing/selftests/kvm:

+ ./guest_memfd_test
+ ./guest_memfd_conversions_test
+ ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_test
+ ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_conversions_test
+ ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_hugetlb_reporting_test
+ ./x86/private_mem_conversions_test.sh
+ ./set_memory_region_test
+ ./x86/private_mem_kvm_exits_test
+ ./x86/tdx_vm_test
+ ./x86/tdx_upm_test
+ ./x86/tdx_shared_mem_test
+ ./x86/tdx_gmem_private_and_shared_test

As an overview for anyone who might be interested in this WIP branch:

1.  I started with upstream's kvm/next
2.  Applied TDX selftests series [3]
3.  Applied guest_memfd mmap series [4]
4.  Applied conversions (sub)series and HugeTLB (sub)series [5]
5.  Added some fixes for 2 of the earlier series (as labeled in commit
    message)
6.  Updated guest_memfd conversions selftests to work with TDX
7.  Applied 2M EPT series [6] with some hacks
8.  Some patches to make guest_memfd mmap return huge-page-aligned
    userspace address
9.  Selftests for guest_memfd conversion with TDX 2M EPT

[3] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
[4] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
[5] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
[6] https://lore.kernel.org/all/Z%2FOMB7HNO%2FRQyljz@yzhao56-desk.sh.intel.com/
[7] https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com/

>
> I'm now trying to make TD huge pages working on this branch and would like to
> report to you errors I encountered during this process early.
>
> 1. symbol arch_get_align_mask() is not available when KVM is compiled as module.
>    I currently workaround it as follows:
>
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -102,8 +102,13 @@ static unsigned long kvm_gmem_get_align_mask(struct file *file,
>         void *priv;
>
>         inode = file_inode(file);
> -       if (!kvm_gmem_has_custom_allocator(inode))
> -             return arch_get_align_mask(file, flags);
> +       if (!kvm_gmem_has_custom_allocator(inode)) {
> +               page_size = 1 << PAGE_SHIFT;
> +               return PAGE_MASK & (page_size - 1);
> +       }
>
>

Thanks, will fix in the next revision.

> 2. Bug of Sleeping function called from invalid context 
>
> [  193.523469] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:325
> [  193.539885] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 3332, name: guest_memfd_con
> [  193.556235] preempt_count: 1, expected: 0
> [  193.564518] RCU nest depth: 0, expected: 0
> [  193.572866] 3 locks held by guest_memfd_con/3332:
> [  193.581800]  #0: ff16f8ec217e4438 (sb_writers#14){.+.+}-{0:0}, at: __x64_sys_fallocate+0x46/0x80
> [  193.598252]  #1: ff16f8fbd85c8310 (mapping.invalidate_lock#4){++++}-{4:4}, at: kvm_gmem_fallocate+0x9e/0x310 [kvm]
> [  193.616706]  #2: ff3189b5e4f65018 (&(kvm)->mmu_lock){++++}-{3:3}, at: kvm_gmem_invalidate_begin_and_zap+0x17f/0x260 [kvm]
> [  193.635790] Preemption disabled at:
> [  193.635793] [<ffffffffc0850c6f>] kvm_gmem_invalidate_begin_and_zap+0x17f/0x260 [kvm]
>
> This is because add_to_invalidated_kvms() invokes kzalloc() inside kvm->mmu_lock
> which is a kind of spinlock.
>
> I workarounded it as follows.
>
>  static int kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
>                                              pgoff_t start, pgoff_t end,
> @@ -1261,13 +1268,13 @@ static int kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
>                         KVM_MMU_LOCK(kvm);
>                         kvm_mmu_invalidate_begin(kvm);
>
> -                       if (invalidated_kvms) {
> -                               ret = add_to_invalidated_kvms(invalidated_kvms, kvm);
> -                               if (ret) {
> -                                       kvm_mmu_invalidate_end(kvm);
> -                                       goto out;
> -                               }
> -                       }
>                 }
>
>
> @@ -1523,12 +1530,14 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>         }
>
>  out:
> -       list_for_each_entry_safe(entry, tmp, &invalidated_kvms, list) {
> -               kvm_gmem_do_invalidate_end(entry->kvm);
> -               list_del(&entry->list);
> -               kfree(entry);
> -       }
> +       list_for_each_entry(gmem, gmem_list, entry)
> +               kvm_gmem_do_invalidate_end(gmem->kvm);
>
>         filemap_invalidate_unlock(inode->i_mapping);
>
>

I fixed this in WIP series v2 by grouping splitting with
unmapping. Please see this commit [8], the commit message includes an
explanation of what's done.

[8] https://github.com/googleprodkernel/linux-cc/commit/fd27635e5209b5e45a628d7fcf42a17a2b3c7e78

> Will let you know more findings later.
>
> Thanks
> Yan

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 6 months, 4 weeks ago

On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> >> Hi Yan,
> >> 
> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> >> series [1], we took into account conversion failures too. The steps are
> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> >> series from GitHub [2] because the steps for conversion changed in two
> >> separate patches.)
> > ...
> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> >
> > Hi Ackerley,
> > Thanks for providing this branch.
> 
> Here's the WIP branch [1], which I initially wasn't intending to make
> super public since it's not even RFC standard yet and I didn't want to
> add to the many guest_memfd in-flight series, but since you referred to
> it, [2] is a v2 of the WIP branch :)
> 
> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
Hi Ackerley,

I'm working on preparing TDX huge page v2 based on [2] from you. The current
decision is that the code base of TDX huge page v2 needs to include DPAMT
and VM shutdown optimization as well.

So, we think kvm-x86/next is a good candidate for us.
(It is in repo https://github.com/kvm-x86/linux.git
 commit 87198fb0208a (tag: kvm-x86-next-2025.07.15, kvm-x86/next) Merge branch 'vmx',
 which already includes code for VM shutdown optimization).
I still need to port DPAMT + gmem 1G + TDX huge page v2 on top it.

Therefore, I'm wondering if the rebase of [2] onto kvm-x86/next can be done
from your side. A straightforward rebase is sufficient, with no need for
any code modification. And it's better to be completed by the end of next
week.

We thought it might be easier for you to do that (but depending on your
bandwidth), allowing me to work on the DPAMT part for TDX huge page v2 in
parallel.

However, if it's difficult for you, please feel free to let us know.

Thanks
Yan

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Ackerley Tng 6 months, 3 weeks ago

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> 
>> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
>> >> Hi Yan,
>> >> 
>> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
>> >> series [1], we took into account conversion failures too. The steps are
>> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
>> >> series from GitHub [2] because the steps for conversion changed in two
>> >> separate patches.)
>> > ...
>> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>> >
>> > Hi Ackerley,
>> > Thanks for providing this branch.
>> 
>> Here's the WIP branch [1], which I initially wasn't intending to make
>> super public since it's not even RFC standard yet and I didn't want to
>> add to the many guest_memfd in-flight series, but since you referred to
>> it, [2] is a v2 of the WIP branch :)
>> 
>> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
>> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> Hi Ackerley,
>
> I'm working on preparing TDX huge page v2 based on [2] from you. The current
> decision is that the code base of TDX huge page v2 needs to include DPAMT
> and VM shutdown optimization as well.
>
> So, we think kvm-x86/next is a good candidate for us.
> (It is in repo https://github.com/kvm-x86/linux.git
>  commit 87198fb0208a (tag: kvm-x86-next-2025.07.15, kvm-x86/next) Merge branch 'vmx',
>  which already includes code for VM shutdown optimization).
> I still need to port DPAMT + gmem 1G + TDX huge page v2 on top it.
>
> Therefore, I'm wondering if the rebase of [2] onto kvm-x86/next can be done
> from your side. A straightforward rebase is sufficient, with no need for
> any code modification. And it's better to be completed by the end of next
> week.
>
> We thought it might be easier for you to do that (but depending on your
> bandwidth), allowing me to work on the DPAMT part for TDX huge page v2 in
> parallel.
>

I'm a little tied up with some internal work, is it okay if, for the
next RFC, you base the changes that you need to make for TDX huge page
v2 and DPAMT on the base of [2]?

That will save both of us the rebasing. [2] was also based on (some
other version of) kvm/next.

I think it's okay since the main goal is to show that it works. I'll
let you know when I can get to a guest_memfd_HugeTLB v3 (and all the
other patches that go into [2]).

[2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2

> However, if it's difficult for you, please feel free to let us know.
>
> Thanks
> Yan

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 6 months, 3 weeks ago

On Wed, Jul 16, 2025 at 01:57:55PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> 
> >> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> >> >> Hi Yan,
> >> >> 
> >> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> >> >> series [1], we took into account conversion failures too. The steps are
> >> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> >> >> series from GitHub [2] because the steps for conversion changed in two
> >> >> separate patches.)
> >> > ...
> >> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> >> >
> >> > Hi Ackerley,
> >> > Thanks for providing this branch.
> >> 
> >> Here's the WIP branch [1], which I initially wasn't intending to make
> >> super public since it's not even RFC standard yet and I didn't want to
> >> add to the many guest_memfd in-flight series, but since you referred to
> >> it, [2] is a v2 of the WIP branch :)
> >> 
> >> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
> >> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> > Hi Ackerley,
> >
> > I'm working on preparing TDX huge page v2 based on [2] from you. The current
> > decision is that the code base of TDX huge page v2 needs to include DPAMT
> > and VM shutdown optimization as well.
> >
> > So, we think kvm-x86/next is a good candidate for us.
> > (It is in repo https://github.com/kvm-x86/linux.git
> >  commit 87198fb0208a (tag: kvm-x86-next-2025.07.15, kvm-x86/next) Merge branch 'vmx',
> >  which already includes code for VM shutdown optimization).
> > I still need to port DPAMT + gmem 1G + TDX huge page v2 on top it.
> >
> > Therefore, I'm wondering if the rebase of [2] onto kvm-x86/next can be done
> > from your side. A straightforward rebase is sufficient, with no need for
> > any code modification. And it's better to be completed by the end of next
> > week.
> >
> > We thought it might be easier for you to do that (but depending on your
> > bandwidth), allowing me to work on the DPAMT part for TDX huge page v2 in
> > parallel.
> >
> 
> I'm a little tied up with some internal work, is it okay if, for the
No problem.

> next RFC, you base the changes that you need to make for TDX huge page
> v2 and DPAMT on the base of [2]?

> That will save both of us the rebasing. [2] was also based on (some
> other version of) kvm/next.
> 
> I think it's okay since the main goal is to show that it works. I'll
> let you know when I can get to a guest_memfd_HugeTLB v3 (and all the
> other patches that go into [2]).
Hmm, the upstream practice is to post code based on latest version, and
there're lots TDX relates fixes in latest kvm-x86/next.


> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> 
> > However, if it's difficult for you, please feel free to let us know.
> >
> > Thanks
> > Yan

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Ackerley Tng 6 months, 3 weeks ago

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jul 16, 2025 at 01:57:55PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> 
>> > On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
>> >> Yan Zhao <yan.y.zhao@intel.com> writes:
>> >> 
>> >> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
>> >> >> Hi Yan,
>> >> >> 
>> >> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
>> >> >> series [1], we took into account conversion failures too. The steps are
>> >> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
>> >> >> series from GitHub [2] because the steps for conversion changed in two
>> >> >> separate patches.)
>> >> > ...
>> >> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>> >> >
>> >> > Hi Ackerley,
>> >> > Thanks for providing this branch.
>> >> 
>> >> Here's the WIP branch [1], which I initially wasn't intending to make
>> >> super public since it's not even RFC standard yet and I didn't want to
>> >> add to the many guest_memfd in-flight series, but since you referred to
>> >> it, [2] is a v2 of the WIP branch :)
>> >> 
>> >> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
>> >> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
>> > Hi Ackerley,
>> >
>> > I'm working on preparing TDX huge page v2 based on [2] from you. The current
>> > decision is that the code base of TDX huge page v2 needs to include DPAMT
>> > and VM shutdown optimization as well.
>> >
>> > So, we think kvm-x86/next is a good candidate for us.
>> > (It is in repo https://github.com/kvm-x86/linux.git
>> >  commit 87198fb0208a (tag: kvm-x86-next-2025.07.15, kvm-x86/next) Merge branch 'vmx',
>> >  which already includes code for VM shutdown optimization).
>> > I still need to port DPAMT + gmem 1G + TDX huge page v2 on top it.
>> >
>> > Therefore, I'm wondering if the rebase of [2] onto kvm-x86/next can be done
>> > from your side. A straightforward rebase is sufficient, with no need for
>> > any code modification. And it's better to be completed by the end of next
>> > week.
>> >
>> > We thought it might be easier for you to do that (but depending on your
>> > bandwidth), allowing me to work on the DPAMT part for TDX huge page v2 in
>> > parallel.
>> >
>> 
>> I'm a little tied up with some internal work, is it okay if, for the
> No problem.
>
>> next RFC, you base the changes that you need to make for TDX huge page
>> v2 and DPAMT on the base of [2]?
>
>> That will save both of us the rebasing. [2] was also based on (some
>> other version of) kvm/next.
>> 
>> I think it's okay since the main goal is to show that it works. I'll
>> let you know when I can get to a guest_memfd_HugeTLB v3 (and all the
>> other patches that go into [2]).
> Hmm, the upstream practice is to post code based on latest version, and
> there're lots TDX relates fixes in latest kvm-x86/next.
>

Yup I understand.

For guest_memfd//HugeTLB I'm still waiting for guest_memfd//mmap
(managed by Fuad) to settle, and there are plenty of comments for the
guest_memfd//conversion component to iron out still, so the full update
to v3 will take longer than I think you want to wait.

I'd say for RFCs it's okay to post patch series based on some snapshot,
since there are so many series in flight?

To unblock you, if posting based on a snapshot is really not okay, here
are some other options I can think of:

a. Use [2] and posting a link to a WIP tree, similar to how [2] was
   done
b. Use some placeholder patches, assuming some interfaces to
   guest_memfd//HugeTLB, like how the first few patches in this series
   assumes some interfaces of guest_memfd with THP support, and post a
   series based on assumed interfaces

Please let me know if one of those options allow you to proceed, thanks!

>> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
>> 
>> > However, if it's difficult for you, please feel free to let us know.
>> >
>> > Thanks
>> > Yan

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 6 months, 3 weeks ago

On Mon, Jul 21, 2025 at 10:33:14PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, Jul 16, 2025 at 01:57:55PM -0700, Ackerley Tng wrote:
> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> 
> >> > On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
> >> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> >> 
> >> >> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> >> >> >> Hi Yan,
> >> >> >> 
> >> >> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> >> >> >> series [1], we took into account conversion failures too. The steps are
> >> >> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> >> >> >> series from GitHub [2] because the steps for conversion changed in two
> >> >> >> separate patches.)
> >> >> > ...
> >> >> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> >> >> >
> >> >> > Hi Ackerley,
> >> >> > Thanks for providing this branch.
> >> >> 
> >> >> Here's the WIP branch [1], which I initially wasn't intending to make
> >> >> super public since it's not even RFC standard yet and I didn't want to
> >> >> add to the many guest_memfd in-flight series, but since you referred to
> >> >> it, [2] is a v2 of the WIP branch :)
> >> >> 
> >> >> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
> >> >> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> >> > Hi Ackerley,
> >> >
> >> > I'm working on preparing TDX huge page v2 based on [2] from you. The current
> >> > decision is that the code base of TDX huge page v2 needs to include DPAMT
> >> > and VM shutdown optimization as well.
> >> >
> >> > So, we think kvm-x86/next is a good candidate for us.
> >> > (It is in repo https://github.com/kvm-x86/linux.git
> >> >  commit 87198fb0208a (tag: kvm-x86-next-2025.07.15, kvm-x86/next) Merge branch 'vmx',
> >> >  which already includes code for VM shutdown optimization).
> >> > I still need to port DPAMT + gmem 1G + TDX huge page v2 on top it.
> >> >
> >> > Therefore, I'm wondering if the rebase of [2] onto kvm-x86/next can be done
> >> > from your side. A straightforward rebase is sufficient, with no need for
> >> > any code modification. And it's better to be completed by the end of next
> >> > week.
> >> >
> >> > We thought it might be easier for you to do that (but depending on your
> >> > bandwidth), allowing me to work on the DPAMT part for TDX huge page v2 in
> >> > parallel.
> >> >
> >> 
> >> I'm a little tied up with some internal work, is it okay if, for the
> > No problem.
> >
> >> next RFC, you base the changes that you need to make for TDX huge page
> >> v2 and DPAMT on the base of [2]?
> >
> >> That will save both of us the rebasing. [2] was also based on (some
> >> other version of) kvm/next.
> >> 
> >> I think it's okay since the main goal is to show that it works. I'll
> >> let you know when I can get to a guest_memfd_HugeTLB v3 (and all the
> >> other patches that go into [2]).
> > Hmm, the upstream practice is to post code based on latest version, and
> > there're lots TDX relates fixes in latest kvm-x86/next.
> >
> 
> Yup I understand.
> 
> For guest_memfd//HugeTLB I'm still waiting for guest_memfd//mmap
> (managed by Fuad) to settle, and there are plenty of comments for the
> guest_memfd//conversion component to iron out still, so the full update
> to v3 will take longer than I think you want to wait.
> 
> I'd say for RFCs it's okay to post patch series based on some snapshot,
> since there are so many series in flight?
> 
> To unblock you, if posting based on a snapshot is really not okay, here
> are some other options I can think of:
> 
> a. Use [2] and posting a link to a WIP tree, similar to how [2] was
>    done
> b. Use some placeholder patches, assuming some interfaces to
>    guest_memfd//HugeTLB, like how the first few patches in this series
>    assumes some interfaces of guest_memfd with THP support, and post a
>    series based on assumed interfaces
> 
> Please let me know if one of those options allow you to proceed, thanks!
Do you see any issues with directly rebasing [2] onto 6.16.0-rc6?

We currently prefer this approach. We have tested [2] for some time, and TDX
huge page series doesn't rely on the implementation details of guest_memfd.

It's ok if you are currently occupied by Google's internal tasks. No worries.

> >> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> >> 
> >> > However, if it's difficult for you, please feel free to let us know.
> >> >
> >> > Thanks
> >> > Yan

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Ackerley Tng 6 months, 3 weeks ago

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Mon, Jul 21, 2025 at 10:33:14PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> 
>> > On Wed, Jul 16, 2025 at 01:57:55PM -0700, Ackerley Tng wrote:
>> >> Yan Zhao <yan.y.zhao@intel.com> writes:
>> >> 
>> >> > On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
>> >> >> Yan Zhao <yan.y.zhao@intel.com> writes:
>> >> >> 
>> >> >> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
>> >> >> >> Hi Yan,
>> >> >> >> 
>> >> >> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
>> >> >> >> series [1], we took into account conversion failures too. The steps are
>> >> >> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
>> >> >> >> series from GitHub [2] because the steps for conversion changed in two
>> >> >> >> separate patches.)
>> >> >> > ...
>> >> >> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>> >> >> >
>> >> >> > Hi Ackerley,
>> >> >> > Thanks for providing this branch.
>> >> >> 
>> >> >> Here's the WIP branch [1], which I initially wasn't intending to make
>> >> >> super public since it's not even RFC standard yet and I didn't want to
>> >> >> add to the many guest_memfd in-flight series, but since you referred to
>> >> >> it, [2] is a v2 of the WIP branch :)
>> >> >> 
>> >> >> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
>> >> >> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
>> >> > Hi Ackerley,
>> >> >
>> >> > I'm working on preparing TDX huge page v2 based on [2] from you. The current
>> >> > decision is that the code base of TDX huge page v2 needs to include DPAMT
>> >> > and VM shutdown optimization as well.
>> >> >
>> >> > So, we think kvm-x86/next is a good candidate for us.
>> >> > (It is in repo https://github.com/kvm-x86/linux.git
>> >> >  commit 87198fb0208a (tag: kvm-x86-next-2025.07.15, kvm-x86/next) Merge branch 'vmx',
>> >> >  which already includes code for VM shutdown optimization).
>> >> > I still need to port DPAMT + gmem 1G + TDX huge page v2 on top it.
>> >> >
>> >> > Therefore, I'm wondering if the rebase of [2] onto kvm-x86/next can be done
>> >> > from your side. A straightforward rebase is sufficient, with no need for
>> >> > any code modification. And it's better to be completed by the end of next
>> >> > week.
>> >> >
>> >> > We thought it might be easier for you to do that (but depending on your
>> >> > bandwidth), allowing me to work on the DPAMT part for TDX huge page v2 in
>> >> > parallel.
>> >> >
>> >> 
>> >> I'm a little tied up with some internal work, is it okay if, for the
>> > No problem.
>> >
>> >> next RFC, you base the changes that you need to make for TDX huge page
>> >> v2 and DPAMT on the base of [2]?
>> >
>> >> That will save both of us the rebasing. [2] was also based on (some
>> >> other version of) kvm/next.
>> >> 
>> >> I think it's okay since the main goal is to show that it works. I'll
>> >> let you know when I can get to a guest_memfd_HugeTLB v3 (and all the
>> >> other patches that go into [2]).
>> > Hmm, the upstream practice is to post code based on latest version, and
>> > there're lots TDX relates fixes in latest kvm-x86/next.
>> >
>> 
>> Yup I understand.
>> 
>> For guest_memfd//HugeTLB I'm still waiting for guest_memfd//mmap
>> (managed by Fuad) to settle, and there are plenty of comments for the
>> guest_memfd//conversion component to iron out still, so the full update
>> to v3 will take longer than I think you want to wait.
>> 
>> I'd say for RFCs it's okay to post patch series based on some snapshot,
>> since there are so many series in flight?
>> 
>> To unblock you, if posting based on a snapshot is really not okay, here
>> are some other options I can think of:
>> 
>> a. Use [2] and posting a link to a WIP tree, similar to how [2] was
>>    done
>> b. Use some placeholder patches, assuming some interfaces to
>>    guest_memfd//HugeTLB, like how the first few patches in this series
>>    assumes some interfaces of guest_memfd with THP support, and post a
>>    series based on assumed interfaces
>> 
>> Please let me know if one of those options allow you to proceed, thanks!
> Do you see any issues with directly rebasing [2] onto 6.16.0-rc6?
>

Nope I think that should be fine. Thanks for checking!

> We currently prefer this approach. We have tested [2] for some time, and TDX
> huge page series doesn't rely on the implementation details of guest_memfd.
>
> It's ok if you are currently occupied by Google's internal tasks. No worries.
>
>> >> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
>> >> 
>> >> > However, if it's difficult for you, please feel free to let us know.
>> >> >
>> >> > Thanks
>> >> > Yan

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 3 weeks ago

On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> >> Hi Yan,
> >> 
> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> >> series [1], we took into account conversion failures too. The steps are
> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> >> series from GitHub [2] because the steps for conversion changed in two
> >> separate patches.)
> > ...
> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> >
> > Hi Ackerley,
> > Thanks for providing this branch.
> 
> Here's the WIP branch [1], which I initially wasn't intending to make
> super public since it's not even RFC standard yet and I didn't want to
> add to the many guest_memfd in-flight series, but since you referred to
> it, [2] is a v2 of the WIP branch :)
> 
> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
Thanks. [2] works. TDX huge pages now has successfully been rebased on top of [2].


> This WIP branch has selftests that test 1G aka HugeTLB page support with
> TDX huge page EPT mappings [7]:
> 
> 1. "KVM: selftests: TDX: Test conversion to private at different
>    sizes". This uses the fact that TDX module will return error if the
>    page is faulted into the guest at a different level from the accept
>    level to check the level that the page was faulted in.
> 2. "KVM: selftests: Test TDs in private_mem_conversions_test". Updates
>    private_mem_conversions_test for use with TDs. This test does
>    multi-vCPU conversions and we use this to check for issues to do with
>    conversion races.
> 3. "KVM: selftests: TDX: Test conversions when guest_memfd used for
>    private and shared memory". Adds a selftest similar to/on top of
>    guest_memfd_conversions_test that does conversions via MapGPA.
> 
> Full list of selftests I usually run from tools/testing/selftests/kvm:
> + ./guest_memfd_test
> + ./guest_memfd_conversions_test
> + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_test
> + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_conversions_test
> + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_hugetlb_reporting_test
> + ./x86/private_mem_conversions_test.sh
> + ./set_memory_region_test
> + ./x86/private_mem_kvm_exits_test
> + ./x86/tdx_vm_test
> + ./x86/tdx_upm_test
> + ./x86/tdx_shared_mem_test
> + ./x86/tdx_gmem_private_and_shared_test
> 
> As an overview for anyone who might be interested in this WIP branch:
> 
> 1.  I started with upstream's kvm/next
> 2.  Applied TDX selftests series [3]
> 3.  Applied guest_memfd mmap series [4]
> 4.  Applied conversions (sub)series and HugeTLB (sub)series [5]
> 5.  Added some fixes for 2 of the earlier series (as labeled in commit
>     message)
> 6.  Updated guest_memfd conversions selftests to work with TDX
> 7.  Applied 2M EPT series [6] with some hacks
> 8.  Some patches to make guest_memfd mmap return huge-page-aligned
>     userspace address
> 9.  Selftests for guest_memfd conversion with TDX 2M EPT
> 
> [3] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
> [4] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
> [5] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
> [6] https://lore.kernel.org/all/Z%2FOMB7HNO%2FRQyljz@yzhao56-desk.sh.intel.com/
> [7] https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com/
Thanks.
We noticed that it's not easy for TDX initial memory regions to use in-place
conversion version of guest_memfd, because
- tdh_mem_page_add() requires simultaneous access to shared source memory and
  private target memory.
- shared-to-private in-place conversion first unmaps the shared memory and tests
  if any extra folio refcount is held before the conversion is allowed.

Therefore, though tdh_mem_page_add() actually supports in-place add, see [8],
we can't store the initial content in the mmap-ed VA of the in-place conversion
version of guest_memfd.

So, I modified QEMU to workaround this issue by adding an extra anonymous
backend to hold source pages in shared memory, with the target private PFN
allocated from guest_memfd with GUEST_MEMFD_FLAG_SUPPORT_SHARED set.

The goal is to test whether kvm_gmem_populate() works for TDX huge pages.
This testing exposed a bug in kvm_gmem_populate(), which has been fixed in the
following patch.

commit 5f33ed7ca26f00a61c611d2d1fbc001a7ecd8dca
Author: Yan Zhao <yan.y.zhao@intel.com>
Date:   Mon Jun 9 03:01:21 2025 -0700

    Bug fix: Reduce max_order when GFN is not aligned

    Fix the warning hit in kvm_gmem_populate().

    "WARNING: CPU: 7 PID: 4421 at arch/x86/kvm/../../../virt/kvm/guest_memfd.c:
    2496 kvm_gmem_populate+0x4a4/0x5b0"

    The GFN passed to kvm_gmem_populate() may have an offset so it may not be
    aligned to folio order. In this case, reduce the max_order to decrease the
    mapping level.

    Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 4b8047020f17..af7943c0a8ba 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -2493,7 +2493,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
                }

                folio_unlock(folio);
-               WARN_ON(!IS_ALIGNED(gfn, 1 << max_order));
+               while (!IS_ALIGNED(gfn, 1 << max_order))
+                       max_order--;

                npages_to_populate = min(npages - i, 1 << max_order);
                npages_to_populate = private_npages_to_populate(



[8] https://cdrdv2-public.intel.com/839195/intel-tdx-module-1.5-abi-spec-348551002.pdf
"In-Place Add: It is allowed to set the TD page HPA in R8 to the same address as
the source page HPA in R9. In this case the source page is converted to be a TD
private page".

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 7 months, 3 weeks ago

On Thu, Jun 19, 2025 at 1:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
> > Yan Zhao <yan.y.zhao@intel.com> writes:
> >
> > > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> > >> Hi Yan,
> > >>
> > >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> > >> series [1], we took into account conversion failures too. The steps are
> > >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> > >> series from GitHub [2] because the steps for conversion changed in two
> > >> separate patches.)
> > > ...
> > >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > >
> > > Hi Ackerley,
> > > Thanks for providing this branch.
> >
> > Here's the WIP branch [1], which I initially wasn't intending to make
> > super public since it's not even RFC standard yet and I didn't want to
> > add to the many guest_memfd in-flight series, but since you referred to
> > it, [2] is a v2 of the WIP branch :)
> >
> > [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
> > [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> Thanks. [2] works. TDX huge pages now has successfully been rebased on top of [2].
>
>
> > This WIP branch has selftests that test 1G aka HugeTLB page support with
> > TDX huge page EPT mappings [7]:
> >
> > 1. "KVM: selftests: TDX: Test conversion to private at different
> >    sizes". This uses the fact that TDX module will return error if the
> >    page is faulted into the guest at a different level from the accept
> >    level to check the level that the page was faulted in.
> > 2. "KVM: selftests: Test TDs in private_mem_conversions_test". Updates
> >    private_mem_conversions_test for use with TDs. This test does
> >    multi-vCPU conversions and we use this to check for issues to do with
> >    conversion races.
> > 3. "KVM: selftests: TDX: Test conversions when guest_memfd used for
> >    private and shared memory". Adds a selftest similar to/on top of
> >    guest_memfd_conversions_test that does conversions via MapGPA.
> >
> > Full list of selftests I usually run from tools/testing/selftests/kvm:
> > + ./guest_memfd_test
> > + ./guest_memfd_conversions_test
> > + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_test
> > + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_conversions_test
> > + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_hugetlb_reporting_test
> > + ./x86/private_mem_conversions_test.sh
> > + ./set_memory_region_test
> > + ./x86/private_mem_kvm_exits_test
> > + ./x86/tdx_vm_test
> > + ./x86/tdx_upm_test
> > + ./x86/tdx_shared_mem_test
> > + ./x86/tdx_gmem_private_and_shared_test
> >
> > As an overview for anyone who might be interested in this WIP branch:
> >
> > 1.  I started with upstream's kvm/next
> > 2.  Applied TDX selftests series [3]
> > 3.  Applied guest_memfd mmap series [4]
> > 4.  Applied conversions (sub)series and HugeTLB (sub)series [5]
> > 5.  Added some fixes for 2 of the earlier series (as labeled in commit
> >     message)
> > 6.  Updated guest_memfd conversions selftests to work with TDX
> > 7.  Applied 2M EPT series [6] with some hacks
> > 8.  Some patches to make guest_memfd mmap return huge-page-aligned
> >     userspace address
> > 9.  Selftests for guest_memfd conversion with TDX 2M EPT
> >
> > [3] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
> > [4] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
> > [5] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
> > [6] https://lore.kernel.org/all/Z%2FOMB7HNO%2FRQyljz@yzhao56-desk.sh.intel.com/
> > [7] https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com/
> Thanks.
> We noticed that it's not easy for TDX initial memory regions to use in-place
> conversion version of guest_memfd, because
> - tdh_mem_page_add() requires simultaneous access to shared source memory and
>   private target memory.
> - shared-to-private in-place conversion first unmaps the shared memory and tests
>   if any extra folio refcount is held before the conversion is allowed.
>
> Therefore, though tdh_mem_page_add() actually supports in-place add, see [8],
> we can't store the initial content in the mmap-ed VA of the in-place conversion
> version of guest_memfd.
>
> So, I modified QEMU to workaround this issue by adding an extra anonymous
> backend to hold source pages in shared memory, with the target private PFN
> allocated from guest_memfd with GUEST_MEMFD_FLAG_SUPPORT_SHARED set.

Yeah, this scheme of using different memory backing for initial
payload makes sense to me.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 8 months, 1 week ago

On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote:
> >> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> > ...
> >> > >
> >> > > I might be wrongly throwing out some terminologies here then.
> >> > > VM_PFNMAP flag can be set for memory backed by folios/page structs.
> >> > > udmabuf seems to be working with pinned "folios" in the backend.
> >> > >
> >> > > The goal is to get to a stage where guest_memfd is backed by pfn
> >> > > ranges unmanaged by kernel that guest_memfd owns and distributes to
> >> > > userspace, KVM, IOMMU subject to shareability attributes. if the
> >> > OK. So from point of the reset part of kernel, those pfns are not regarded as
> >> > memory.
> >> >
> >> > > shareability changes, the users will get notified and will have to
> >> > > invalidate their mappings. guest_memfd will allow mmaping such ranges
> >> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> >> > > special handling/lack of page structs.
> >> > My concern is a failable invalidation notifer may not be ideal.
> >> > Instead of relying on ref counts (or other mechanisms) to determine whether to
> >> > start shareabilitiy changes, with a failable invalidation notifier, some users
> >> > may fail the invalidation and the shareability change, even after other users
> >> > have successfully unmapped a range.
> >>
> >> Even if one user fails to invalidate its mappings, I don't see a
> >> reason to go ahead with shareability change. Shareability should not
> >> change unless all existing users let go of their soon-to-be-invalid
> >> view of memory.
> 
> Hi Yan,
> 
> While working on the 1G (aka HugeTLB) page support for guest_memfd
> series [1], we took into account conversion failures too. The steps are
> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> series from GitHub [2] because the steps for conversion changed in two
> separate patches.)
> 
> We do need to handle errors across ranges to be converted, possibly from
> different memslots. The goal is to either have the entire conversion
> happen (including page split/merge) or nothing at all when the ioctl
> returns.
> 
> We try to undo the restructuring (whether split or merge) and undo any
> shareability changes on error (barring ENOMEM, in which case we leave a
> WARNing).
As the undo can fail (as the case you leave a WARNing, in patch 38 in [1]), it
can lead to WARNings in kernel with folios not being properly added to the
filemap.

> The part we don't restore is the presence of the pages in the host or
> guest page tables. For that, our idea is that if unmapped, the next
> access will just map it in, so there's no issue there.

I don't think so.

As in patch 38 in [1], on failure, it may fail to
- restore the shareability
- restore the folio's filemap status
- restore the folio's hugetlb stash metadata
- restore the folio's merged/split status

Also, the host page table is not restored.


> > My thinking is that:
> >
> > 1. guest_memfd starts shared-to-private conversion
> > 2. guest_memfd sends invalidation notifications
> >    2.1 invalidate notification --> A --> Unmap and return success
> >    2.2 invalidate notification --> B --> Unmap and return success
> >    2.3 invalidate notification --> C --> return failure
> > 3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps
> >    shareability as shared
> >
> > Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and
> > 2.2. Even if additional notifications could be sent to A and B to ask for
> > mapping the GFN back, the map operation might fail. Consequently, A and B might
> > not be able to restore the mapped status of the GFN.
> 
> For conversion we don't attempt to restore mappings anywhere (whether in
> guest or host page tables). What do you think of not restoring the
> mappings?
It could cause problem if the mappings in S-EPT can't be restored.

For TDX private-to-shared conversion, if kvm_gmem_convert_should_proceed() -->
kvm_gmem_unmap_private() --> kvm_mmu_unmap_gfn_range() fails in the end, then
the GFN shareability is restored to private. The next guest access to
the partially unmapped private memory can meet a fatal error: "access before
acceptance".

It could occur in such a scenario:
1. TD issues a TDVMCALL_MAP_GPA to convert a private GFN to shared
2. Conversion fails in KVM.
3. set_memory_decrypted() fails in TD.
4. TD thinks the GFN is still accepted as private and accesses it.


> > For IOMMU mappings, this
> > could result in DMAR failure following a failed attempt to do shared-to-private
> > conversion.
> 
> I believe the current conversion setup guards against this because after
> unmapping from the host, we check for any unexpected refcounts.
Right, it's fine if we check for any unexpected refcounts.


> (This unmapping is not the unmapping we're concerned about, since this is
> shared memory, and unmapping doesn't go through TDX.)
> 
> Coming back to the refcounts, if the IOMMU had mappings, these refcounts
> are "unexpected". The conversion ioctl will return to userspace with an
> error.
> 
> IO can continue to happen, since the memory is still mapped in the
> IOMMU. The memory state is still shared. No issue there.
> 
> In RFCv2 [1], we expect userspace to see the error, then try and remove
> the memory from the IOMMU, and then try conversion again.
I don't think it's right to depend on that userspace could always perform in 
kernel's expected way, i.e. trying conversion until it succeeds.

We need to restore to the previous status (which includes the host page table)
if conversion can't be done.
That said, in my view, a better flow would be:

1. guest_memfd sends a pre-invalidation request to users (users here means the
   consumers in kernel of memory allocated from guest_memfd).

2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
   proceed. For example, in the case of TDX, this might involve memory
   allocation and page splitting.

3. Based on the pre-check results, guest_memfd either aborts the invalidation or
   proceeds by sending the actual invalidation request.

4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
   TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
   In such cases, TDX can callback guest_memfd to inform the poison-status of
   the page or elevate the page reference count.

5. guest_memfd completes the invalidation process. If the memory is marked as
   "poison," guest_memfd can handle it accordingly. If the page has an elevated
   reference count, guest_memfd may not need to take special action, as the
   elevated count prevents the OS from reallocating the page.
   (but from your reply below, seems a callback to guest_memfd is a better
   approach).


> The part in concern here is unmapping failures of private pages, for
> private-to-shared conversions, since that part goes through TDX and
> might fail.
IMO, even for TDX, the real unmap must not fail unless there are bugs in the KVM
or TDX module.
So, for page splitting in S-EPT, I prefer to try splitting in the
pre-invalidation phase before conducting any real unmap.


> One other thing about taking refcounts is that in RFCv2,
> private-to-shared conversions assume that there are no refcounts on the
> private pages at all. (See filemap_remove_folio_for_restructuring() in
> [3])
>
> Haven't had a chance to think about all the edge cases, but for now I
> think on unmapping failure, in addition to taking a refcount, we should
> return an error at least up to guest_memfd, so that guest_memfd could
> perhaps keep the refcount on that page, but drop the page from the
> filemap. Another option could be to track messed up addresses and always
> check that on conversion or something - not sure yet.

It looks good to me. See the bullet 4 in my proposed flow above.

> Either way, guest_memfd must know. If guest_memfd is not informed, on a
> next conversion request, the conversion will just spin in
> filemap_remove_folio_for_restructuring().
It makes sense.


> What do you think of this part about informing guest_memfd of the
> failure to unmap?
So, do you want to add a guest_memfd callback to achieve this purpose?


BTW, here's an analysis of why we can't let kvm_mmu_unmap_gfn_range()
and mmu_notifier_invalidate_range_start() fail, based on the repo
https://github.com/torvalds/linux.git, commit cd2e103d57e5 ("Merge tag
'hardening-v6.16-rc1-fix1-take2' of
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux")

1. Status of mmu notifier
-------------------------------
(1) There're 34 direct callers of mmu_notifier_invalidate_range_start().
    1. clear_refs_write
    2. do_pagemap_scan
    3. uprobe_write_opcode
    4. do_huge_zero_wp_pmd
    5. __split_huge_pmd (N)
    6. __split_huge_pud (N)
    7. move_pages_huge_pmd
    8. copy_hugetlb_page_range
    9. hugetlb_unshare_pmds  (N)
    10. hugetlb_change_protection
    11. hugetlb_wp
    12. unmap_hugepage_range (N)
    13. move_hugetlb_page_tables
    14. collapse_huge_page
    15. retract_page_tables
    16. collapse_pte_mapped_thp
    17. write_protect_page
    18. replace_page
    19. madvise_free_single_vma
    20. wp_clean_pre_vma
    21. wp_page_copy 
    22. zap_page_range_single_batched (N)
    23. unmap_vmas (N)
    24. copy_page_range 
    25. remove_device_exclusive_entry
    26. migrate_vma_collect
    27. __migrate_device_pages
    28. change_pud_range 
    29. move_page_tables
    30. page_vma_mkclean_one
    31. try_to_unmap_one
    32. try_to_migrate_one
    33. make_device_exclusive
    34. move_pages_pte

Of these 34 direct callers, those marked with (N) cannot tolerate
mmu_notifier_invalidate_range_start() failing. I have not yet investigated all
34 direct callers one by one, so the list of (N) is incomplete.

For 5. __split_huge_pmd(), Documentation/mm/transhuge.rst says:
"Note that split_huge_pmd() doesn't have any limitations on refcounting:
pmd can be split at any point and never fails." This is because split_huge_pmd()
serves as a graceful fallback design for code walking pagetables but unaware
about huge pmds.


(2) There's 1 direct caller of mmu_notifier_invalidate_range_start_nonblock(),
__oom_reap_task_mm(), which only expects the error -EAGAIN.

In mn_hlist_invalidate_range_start():
"WARN_ON(mmu_notifier_range_blockable(range) || _ret != -EAGAIN);"


(3) For DMAs, drivers need to invoke pin_user_pages() to pin memory. In that
case, they don't need to register mmu notifier.

Or, device drivers can pin pages via get_user_pages*(), and register for mmu         
notifier callbacks for the memory range. Then, upon receiving a notifier         
"invalidate range" callback , stop the device from using the range, and unpin    
the pages.

See Documentation/core-api/pin_user_pages.rst.


2. Cases that cannot tolerate failure of mmu_notifier_invalidate_range_start()
-------------------------------
(1) Error fallback cases.

    1. split_huge_pmd() as mentioned in Documentation/mm/transhuge.rst.
       split_huge_pmd() is designed as a graceful fallback without failure.

       split_huge_pmd
        |->__split_huge_pmd
           |->mmu_notifier_range_init
           |  mmu_notifier_invalidate_range_start
           |  split_huge_pmd_locked
           |  mmu_notifier_invalidate_range_end


    2. in fs/iomap/buffered-io.c, iomap_write_failed() itself is error handling.
       iomap_write_failed
         |->truncate_pagecache_range
            |->unmap_mapping_range
            |  |->unmap_mapping_pages
            |     |->unmap_mapping_range_tree
            |        |->unmap_mapping_range_vma
            |           |->zap_page_range_single
            |              |->zap_page_range_single_batched
            |                       |->mmu_notifier_range_init
            |                       |  mmu_notifier_invalidate_range_start
            |                       |  unmap_single_vma
            |                       |  mmu_notifier_invalidate_range_end
            |->truncate_inode_pages_range
               |->truncate_cleanup_folio
                  |->if (folio_mapped(folio))
                  |     unmap_mapping_folio(folio);
                         |->unmap_mapping_range_tree
                            |->unmap_mapping_range_vma
                               |->zap_page_range_single
                                  |->zap_page_range_single_batched
                                     |->mmu_notifier_range_init
                                     |  mmu_notifier_invalidate_range_start
                                     |  unmap_single_vma
                                     |  mmu_notifier_invalidate_range_end

   3. in mm/memory.c, zap_page_range_single() is invoked to handle error.
      remap_pfn_range_notrack
        |->int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
        |  if (!error)
        |      return 0;
	|  zap_page_range_single
           |->zap_page_range_single_batched
              |->mmu_notifier_range_init
              |  mmu_notifier_invalidate_range_start
              |  unmap_single_vma
              |  mmu_notifier_invalidate_range_end

   4. in kernel/events/core.c, zap_page_range_single() is invoked to clear any
      partial mappings on error.

      perf_mmap
        |->ret = map_range(rb, vma);
                 |  err = remap_pfn_range
                 |->if (err) 
                 |     zap_page_range_single
                        |->zap_page_range_single_batched
                           |->mmu_notifier_range_init
                           |  mmu_notifier_invalidate_range_start
                           |  unmap_single_vma
                           |  mmu_notifier_invalidate_range_end


   5. in mm/memory.c, unmap_mapping_folio() is invoked to unmap posion page.

      __do_fault
	|->if (unlikely(PageHWPoison(vmf->page))) { 
	|	vm_fault_t poisonret = VM_FAULT_HWPOISON;
	|	if (ret & VM_FAULT_LOCKED) {
	|		if (page_mapped(vmf->page))
	|			unmap_mapping_folio(folio);
        |                       |->unmap_mapping_range_tree
        |                          |->unmap_mapping_range_vma
        |                             |->zap_page_range_single
        |                                |->zap_page_range_single_batched
        |                                   |->mmu_notifier_range_init
        |                                   |  mmu_notifier_invalidate_range_start
        |                                   |  unmap_single_vma
        |                                   |  mmu_notifier_invalidate_range_end
	|		if (mapping_evict_folio(folio->mapping, folio))
	|			poisonret = VM_FAULT_NOPAGE; 
	|		folio_unlock(folio);
	|	}
	|	folio_put(folio);
	|	vmf->page = NULL;
	|	return poisonret;
	|  }


  6. in mm/vma.c, in __mmap_region(), unmap_region() is invoked to undo any
     partial mapping done by a device driver.

     __mmap_new_vma
       |->__mmap_new_file_vma(map, vma);
          |->error = mmap_file(vma->vm_file, vma);
          |  if (error)
          |     unmap_region
                 |->unmap_vmas
                    |->mmu_notifier_range_init
                    |  mmu_notifier_invalidate_range_start
                    |  unmap_single_vma
                    |  mmu_notifier_invalidate_range_end


(2) No-fail cases
-------------------------------
1. iput() cannot fail. 

iput
 |->iput_final
    |->WRITE_ONCE(inode->i_state, state | I_FREEING);
    |  inode_lru_list_del(inode);
    |  evict(inode);
       |->op->evict_inode(inode);
          |->shmem_evict_inode
             |->shmem_truncate_range
                |->truncate_inode_pages_range
                   |->truncate_cleanup_folio
                      |->if (folio_mapped(folio))
                      |     unmap_mapping_folio(folio);
                            |->unmap_mapping_range_tree
                               |->unmap_mapping_range_vma
                                  |->zap_page_range_single
                                     |->zap_page_range_single_batched
                                        |->mmu_notifier_range_init
                                        |  mmu_notifier_invalidate_range_start
                                        |  unmap_single_vma
                                        |  mmu_notifier_invalidate_range_end


2. exit_mmap() cannot fail

exit_mmap
  |->mmu_notifier_release(mm);
     |->unmap_vmas(&tlb, &vmi.mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
        |->mmu_notifier_range_init
        |  mmu_notifier_invalidate_range_start
        |  unmap_single_vma
        |  mmu_notifier_invalidate_range_end


3. KVM Cases That Cannot Tolerate Unmap Failure
-------------------------------
Allowing unmap operations to fail in the following scenarios would make it very
difficult or even impossible to handle the failure:

(1) __kvm_mmu_get_shadow_page() is designed to reliably obtain a shadow page
without expecting any failure.

mmu_alloc_direct_roots
  |->mmu_alloc_root
     |->kvm_mmu_get_shadow_page
        |->__kvm_mmu_get_shadow_page
           |->kvm_mmu_alloc_shadow_page
              |->account_shadowed
                 |->kvm_mmu_slot_gfn_write_protect
                    |->kvm_tdp_mmu_write_protect_gfn
                       |->write_protect_gfn
                          |->tdp_mmu_iter_set_spte


(2) kvm_vfio_release() and kvm_vfio_file_del() cannot fail

kvm_vfio_release/kvm_vfio_file_del
 |->kvm_vfio_update_coherency
    |->kvm_arch_unregister_noncoherent_dma
       |->kvm_noncoherent_dma_assignment_start_or_stop
          |->kvm_zap_gfn_range
             |->kvm_tdp_mmu_zap_leafs
                |->tdp_mmu_zap_leafs
                   |->tdp_mmu_iter_set_spte


(3) There're lots of callers of __kvm_set_or_clear_apicv_inhibit() currently
never expect failure of unmap.

__kvm_set_or_clear_apicv_inhibit
  |->kvm_zap_gfn_range
     |->kvm_tdp_mmu_zap_leafs
        |->tdp_mmu_zap_leafs
           |->tdp_mmu_iter_set_spte



4. Cases in KVM where it's hard to make tdp_mmu_set_spte() (update SPTE with
write mmu_lock) failable.

(1) kvm_vcpu_flush_tlb_guest()

kvm_vcpu_flush_tlb_guest
  |->kvm_mmu_sync_roots
     |->mmu_sync_children
        |->kvm_vcpu_write_protect_gfn
           |->kvm_mmu_slot_gfn_write_protect
              |->kvm_tdp_mmu_write_protect_gfn
                 |->write_protect_gfn
                    |->tdp_mmu_iter_set_spte
                       |->tdp_mmu_set_spte


(2) handle_removed_pt() and handle_changed_spte().


Thanks
Yan

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 8 months ago

On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> We need to restore to the previous status (which includes the host page table)
> if conversion can't be done.
> That said, in my view, a better flow would be:
>
> 1. guest_memfd sends a pre-invalidation request to users (users here means the
>    consumers in kernel of memory allocated from guest_memfd).
>
> 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
>    proceed. For example, in the case of TDX, this might involve memory
>    allocation and page splitting.
>
> 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
>    proceeds by sending the actual invalidation request.
>
> 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
>    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
>    In such cases, TDX can callback guest_memfd to inform the poison-status of
>    the page or elevate the page reference count.

Few questions here:
1) It sounds like the failure to remove entries from SEPT could only
be due to bugs in the KVM/TDX module, how reliable would it be to
continue executing TDX VMs on the host once such bugs are hit?
2) Is it reliable to continue executing the host kernel and other
normal VMs once such bugs are hit?
3) Can the memory be reclaimed reliably if the VM is marked as dead
and cleaned up right away?

>
> 5. guest_memfd completes the invalidation process. If the memory is marked as
>    "poison," guest_memfd can handle it accordingly. If the page has an elevated
>    reference count, guest_memfd may not need to take special action, as the
>    elevated count prevents the OS from reallocating the page.
>    (but from your reply below, seems a callback to guest_memfd is a better
>    approach).
>

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 3 weeks ago

On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > We need to restore to the previous status (which includes the host page table)
> > if conversion can't be done.
> > That said, in my view, a better flow would be:
> >
> > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> >    consumers in kernel of memory allocated from guest_memfd).
> >
> > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> >    proceed. For example, in the case of TDX, this might involve memory
> >    allocation and page splitting.
> >
> > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> >    proceeds by sending the actual invalidation request.
> >
> > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> >    the page or elevate the page reference count.
> 
> Few questions here:
> 1) It sounds like the failure to remove entries from SEPT could only
> be due to bugs in the KVM/TDX module,
Yes.

> how reliable would it be to
> continue executing TDX VMs on the host once such bugs are hit?
The TDX VMs will be killed. However, the private pages are still mapped in the
SEPT (after the unmapping failure).
The teardown flow for TDX VM is:

do_exit
  |->exit_files
     |->kvm_gmem_release ==> (1) Unmap guest pages 
     |->release kvmfd
        |->kvm_destroy_vm  (2) Reclaiming resources
           |->kvm_arch_pre_destroy_vm  ==> Release hkid
           |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages

Without holding page reference after (1) fails, the guest pages may have been
re-assigned by the host OS while they are still still tracked in the TDX module.


> 2) Is it reliable to continue executing the host kernel and other
> normal VMs once such bugs are hit?
If with TDX holding the page ref count, the impact of unmapping failure of guest
pages is just to leak those pages.

> 3) Can the memory be reclaimed reliably if the VM is marked as dead
> and cleaned up right away?
As in the above flow, TDX needs to hold the page reference on unmapping failure
until after reclaiming is successful. Well, reclaiming itself is possible to
fail either.

So, below is my proposal. Showed in the simple POC code based on
https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.

Patch 1: TDX increases page ref count on unmap failure.
Patch 2: Bail out private-to-shared conversion if splitting fails.
Patch 3: Make kvm_gmem_zap() return void.

After the change,
- the actual private-to-shared conversion will be not be executed on splitting
  failure (which could be due to out of memory or bugs in the KVM/TDX module) or
  unmapping failure (which is due to bugs in the KVM/TDX module).
- other callers of kvm_gmem_zap(), such as kvm_gmem_release(),
  kvm_gmem_error_folio(), kvm_gmem_punch_hole(), are still allowed to proceed.
  After truncating the pages out of the filemap, the pages could be leaked on
  purpose with the reference hold by TDX.


commit 50432c0bb1e10591714b6b880f43fc30797ca047
Author: Yan Zhao <yan.y.zhao@intel.com>
Date:   Tue Jun 10 00:02:30 2025 -0700

    KVM: TDX: Hold folio ref count on fatal error

    Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c60c1fa7b4ee..93c31eecfc60 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1502,6 +1502,15 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
        td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }

+/*
+ * Called when fatal error occurs during removing a TD's page.
+ * Increase the folio ref count in case it's reused by other VMs or host.
+ */
+static void tdx_hold_page_on_error(struct kvm *kvm, struct page *page, int level)
+{
+       folio_ref_add(page_folio(page), KVM_PAGES_PER_HPAGE(level));
+}
+
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
                            enum pg_level level, struct page *page)
 {
@@ -1868,12 +1877,14 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
         * before any might be populated. Warn if zapping is attempted when
         * there can't be anything populated in the private EPT.
         */
-       if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-               return -EINVAL;
+       if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm)) {
+               ret = -EINVAL;
+               goto fatal_error;
+       }

        ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
        if (ret <= 0)
-               return ret;
+               goto fatal_error;

        /*
         * TDX requires TLB tracking before dropping private page.  Do
@@ -1881,7 +1892,14 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
         */
        tdx_track(kvm);

-       return tdx_sept_drop_private_spte(kvm, gfn, level, page);
+       ret = tdx_sept_drop_private_spte(kvm, gfn, level, page);
+       if (ret)
+               goto fatal_error;
+       return ret;
+fatal_error:
+       if (ret < 0)
+               tdx_hold_page_on_error(kvm, page, level);
+       return ret;
 }

 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,


commit 240acb13d4bd724b4c153b73cfba3cd14d3cc296
Author: Yan Zhao <yan.y.zhao@intel.com>
Date:   Tue Jun 10 19:26:56 2025 -0700

    KVM: guest_memfd: Add check kvm_gmem_private_has_safe_refcount()

    Check extra ref count on private pages in case of TDX unmap failure before
    private to shared conversion in the backend.

    In other zap cases, it's ok to do without this ref count check so that
    the error folio will be held by TDX after guest_memfd releases the folio.

    Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index af7943c0a8ba..1e1312bfa157 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -521,6 +521,41 @@ static void kvm_gmem_convert_invalidate_end(struct inode *inode,
                kvm_gmem_invalidate_end(gmem, invalidate_start, invalidate_end);
 }

+static bool kvm_gmem_private_has_safe_refcount(struct inode *inode,
+                                              pgoff_t start, pgoff_t end)
+{
+       pgoff_t index = start;
+       size_t inode_nr_pages;
+       bool ret = true;
+       void *priv;
+
+       /*
+        * Conversion in !kvm_gmem_has_custom_allocator() case does not reach here.
+        */
+       if (!kvm_gmem_has_custom_allocator(inode))
+               return ret;
+
+       priv = kvm_gmem_allocator_private(inode);
+       inode_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
+
+       while (index < end) {
+               struct folio *f;
+               f = filemap_get_folio(inode->i_mapping, index);
+               if (IS_ERR(f)) {
+                       index += inode_nr_pages;
+                       continue;
+               }
+
+               folio_put(f);
+               if (folio_ref_count(f) > folio_nr_pages(f)) {
+                       ret = false;
+                       break;
+               }
+               index += folio_nr_pages(f);
+       }
+       return ret;
+}
+
 static int kvm_gmem_convert_should_proceed(struct inode *inode,
                                           struct conversion_work *work,
                                           bool to_shared, pgoff_t *error_index)
@@ -538,6 +573,10 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
                list_for_each_entry(gmem, gmem_list, entry) {
                        ret = kvm_gmem_zap(gmem, work->start, work_end,
                                           KVM_FILTER_PRIVATE, true);
+                       if (ret)
+                               return ret;
+                       if (!kvm_gmem_private_has_safe_refcount(inode, work->start, work_end))
+                               return -EFAULT;
                }
        } else {
                unmap_mapping_pages(inode->i_mapping, work->start,


commit 26743993663313fa6f8741a43f22ed5ac21399c7
Author: Yan Zhao <yan.y.zhao@intel.com>
Date:   Tue Jun 10 20:01:23 2025 -0700

    KVM: guest_memfd: Move splitting KVM mappings out of kvm_gmem_zap()

    Modify kvm_gmem_zap() to return void and introduce a separate function,
    kvm_gmem_split_private(), to handle the splitting of private EPT.

    With these changes, kvm_gmem_zap() will only be executed after successful
    splitting across the entire conversion/punch range.

    Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 1e1312bfa157..e81efcef0837 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -318,8 +318,7 @@ static bool kvm_gmem_has_safe_refcount(struct address_space *mapping, pgoff_t st
        return refcount_safe;
 }

-static int kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
-                       enum kvm_gfn_range_filter filter, bool do_split)
+static int kvm_gmem_split_private(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end)
 {
        struct kvm_memory_slot *slot;
        struct kvm *kvm = gmem->kvm;
@@ -336,7 +335,7 @@ static int kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
                        .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
                        .slot = slot,
                        .may_block = true,
-                       .attr_filter = filter,
+                       .attr_filter = KVM_FILTER_PRIVATE,
                };

                if (!locked) {
@@ -344,16 +343,13 @@ static int kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
                        locked = true;
                }

-               if (do_split) {
-                       ret = kvm_split_boundary_leafs(kvm, &gfn_range);
-                       if (ret < 0)
-                               goto out;
+               ret = kvm_split_boundary_leafs(kvm, &gfn_range);
+               if (ret < 0)
+                       goto out;

-                       flush |= ret;
-                       ret = 0;
-               }
+               flush |= ret;
+               ret = 0;

-               flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
        }
 out:
        if (flush)
@@ -365,6 +361,42 @@ static int kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
        return ret;
 }

+static void kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
+                       enum kvm_gfn_range_filter filter)
+{
+       struct kvm_memory_slot *slot;
+       struct kvm *kvm = gmem->kvm;
+       unsigned long index;
+       bool locked = false;
+       bool flush = false;
+
+       xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+               pgoff_t pgoff = slot->gmem.pgoff;
+               struct kvm_gfn_range gfn_range = {
+                       .start = slot->base_gfn + max(pgoff, start) - pgoff,
+                       .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
+                       .slot = slot,
+                       .may_block = true,
+                       .attr_filter = filter,
+               };
+
+               if (!locked) {
+                       KVM_MMU_LOCK(kvm);
+                       locked = true;
+               }
+
+               flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
+       }
+
+       if (flush)
+               kvm_flush_remote_tlbs(kvm);
+
+       if (locked)
+               KVM_MMU_UNLOCK(kvm);
+
+       return;
+}
+
 static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
                                      pgoff_t end)
 {
@@ -571,10 +603,10 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,

                gmem_list = &inode->i_mapping->i_private_list;
                list_for_each_entry(gmem, gmem_list, entry) {
-                       ret = kvm_gmem_zap(gmem, work->start, work_end,
-                                          KVM_FILTER_PRIVATE, true);
+                       ret = kvm_gmem_split_private(gmem, work->start, work_end);
                        if (ret)
                                return ret;
+                       kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
                        if (!kvm_gmem_private_has_safe_refcount(inode, work->start, work_end))
                                return -EFAULT;
                }
@@ -1471,9 +1503,10 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
                 * expensive.
                 */
                filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
-               ret = kvm_gmem_zap(gmem, start, end, filter, true);
+               ret = kvm_gmem_split_private(gmem, start, end);
                if (ret)
                        goto out;
+               kvm_gmem_zap(gmem, start, end, filter);
        }

        if (kvm_gmem_has_custom_allocator(inode)) {
@@ -1606,7 +1639,7 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
         * memory, as its lifetime is associated with the inode, not the file.
         */
        kvm_gmem_invalidate_begin(gmem, 0, -1ul);
-       kvm_gmem_zap(gmem, 0, -1ul, KVM_FILTER_PRIVATE | KVM_FILTER_SHARED, false);
+       kvm_gmem_zap(gmem, 0, -1ul, KVM_FILTER_PRIVATE | KVM_FILTER_SHARED);
        kvm_gmem_invalidate_end(gmem, 0, -1ul);

        list_del(&gmem->entry);
@@ -1942,7 +1975,7 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol

                kvm_gmem_invalidate_begin(gmem, start, end);
                filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
-               kvm_gmem_zap(gmem, start, end, filter, false);
+               kvm_gmem_zap(gmem, start, end, filter);
        }

        /*


If the above changes are agreeable, we could consider a more ambitious approach:
introducing an interface like:

int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);

This would allow guest_memfd to maintain an internal reference count for each
private GFN. TDX would call guest_memfd_add_page_ref_count() for mapping and
guest_memfd_dec_page_ref_count() after a successful unmapping. Before truncating
a private page from the filemap, guest_memfd could increase the real folio
reference count based on its internal reference count for the private GFN.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 7 months, 3 weeks ago

On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > We need to restore to the previous status (which includes the host page table)
> > > if conversion can't be done.
> > > That said, in my view, a better flow would be:
> > >
> > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > >    consumers in kernel of memory allocated from guest_memfd).
> > >
> > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > >    proceed. For example, in the case of TDX, this might involve memory
> > >    allocation and page splitting.
> > >
> > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > >    proceeds by sending the actual invalidation request.
> > >
> > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > >    the page or elevate the page reference count.
> >
> > Few questions here:
> > 1) It sounds like the failure to remove entries from SEPT could only
> > be due to bugs in the KVM/TDX module,
> Yes.
>
> > how reliable would it be to
> > continue executing TDX VMs on the host once such bugs are hit?
> The TDX VMs will be killed. However, the private pages are still mapped in the
> SEPT (after the unmapping failure).
> The teardown flow for TDX VM is:
>
> do_exit
>   |->exit_files
>      |->kvm_gmem_release ==> (1) Unmap guest pages
>      |->release kvmfd
>         |->kvm_destroy_vm  (2) Reclaiming resources
>            |->kvm_arch_pre_destroy_vm  ==> Release hkid
>            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
>
> Without holding page reference after (1) fails, the guest pages may have been
> re-assigned by the host OS while they are still still tracked in the TDX module.

What happens to the pagetable memory holding the SEPT entry? Is that
also supposed to be leaked?

>
>
> > 2) Is it reliable to continue executing the host kernel and other
> > normal VMs once such bugs are hit?
> If with TDX holding the page ref count, the impact of unmapping failure of guest
> pages is just to leak those pages.
>
> > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > and cleaned up right away?
> As in the above flow, TDX needs to hold the page reference on unmapping failure
> until after reclaiming is successful. Well, reclaiming itself is possible to
> fail either.
>
> So, below is my proposal. Showed in the simple POC code based on
> https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
>
> Patch 1: TDX increases page ref count on unmap failure.

This will not work as Ackerley pointed out earlier [1], it will be
impossible to differentiate between transient refcounts on private
pages and extra refcounts of private memory due to TDX unmap failure.

[1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/

> Patch 2: Bail out private-to-shared conversion if splitting fails.
> Patch 3: Make kvm_gmem_zap() return void.
>
> ...
>         /*
>
>
> If the above changes are agreeable, we could consider a more ambitious approach:
> introducing an interface like:
>
> int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);

I don't see any reason to introduce full tracking of gfn mapping
status in SEPTs just to handle very rare scenarios which KVM/TDX are
taking utmost care to avoid.

That being said, I see value in letting guest_memfd know exact ranges
still being under use by the TDX module due to unmapping failures.
guest_memfd can take the right action instead of relying on refcounts.

Does KVM continue unmapping the full range even after TDX SEPT
management fails to unmap a subrange?

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 3 weeks ago

On Mon, Jun 16, 2025 at 08:51:41PM -0700, Vishal Annapurve wrote:
> On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > We need to restore to the previous status (which includes the host page table)
> > > > if conversion can't be done.
> > > > That said, in my view, a better flow would be:
> > > >
> > > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > > >    consumers in kernel of memory allocated from guest_memfd).
> > > >
> > > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > > >    proceed. For example, in the case of TDX, this might involve memory
> > > >    allocation and page splitting.
> > > >
> > > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > > >    proceeds by sending the actual invalidation request.
> > > >
> > > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > > >    the page or elevate the page reference count.
> > >
> > > Few questions here:
> > > 1) It sounds like the failure to remove entries from SEPT could only
> > > be due to bugs in the KVM/TDX module,
> > Yes.
> >
> > > how reliable would it be to
> > > continue executing TDX VMs on the host once such bugs are hit?
> > The TDX VMs will be killed. However, the private pages are still mapped in the
> > SEPT (after the unmapping failure).
> > The teardown flow for TDX VM is:
> >
> > do_exit
> >   |->exit_files
> >      |->kvm_gmem_release ==> (1) Unmap guest pages
> >      |->release kvmfd
> >         |->kvm_destroy_vm  (2) Reclaiming resources
> >            |->kvm_arch_pre_destroy_vm  ==> Release hkid
> >            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> >
> > Without holding page reference after (1) fails, the guest pages may have been
> > re-assigned by the host OS while they are still still tracked in the TDX module.
> 
> What happens to the pagetable memory holding the SEPT entry? Is that
> also supposed to be leaked?
It depends on if the reclaiming of the page table pages holding the SEPT entry
fails. If it is, it will be also leaked.
But the page to hold TDR is for sure to be leaked as the reclaiming of TDR page
will fail after (1) fails.



> > > 2) Is it reliable to continue executing the host kernel and other
> > > normal VMs once such bugs are hit?
> > If with TDX holding the page ref count, the impact of unmapping failure of guest
> > pages is just to leak those pages.
> >
> > > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > > and cleaned up right away?
> > As in the above flow, TDX needs to hold the page reference on unmapping failure
> > until after reclaiming is successful. Well, reclaiming itself is possible to
> > fail either.
> >
> > So, below is my proposal. Showed in the simple POC code based on
> > https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
> >
> > Patch 1: TDX increases page ref count on unmap failure.
> 
> This will not work as Ackerley pointed out earlier [1], it will be
> impossible to differentiate between transient refcounts on private
> pages and extra refcounts of private memory due to TDX unmap failure.
Hmm. why are there transient refcounts on private pages?
And why should we differentiate the two?


> [1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
> 
> > Patch 2: Bail out private-to-shared conversion if splitting fails.
> > Patch 3: Make kvm_gmem_zap() return void.
> >
> > ...
> >         /*
> >
> >
> > If the above changes are agreeable, we could consider a more ambitious approach:
> > introducing an interface like:
> >
> > int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> > int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);
> 
> I don't see any reason to introduce full tracking of gfn mapping
> status in SEPTs just to handle very rare scenarios which KVM/TDX are
> taking utmost care to avoid.
> 
> That being said, I see value in letting guest_memfd know exact ranges
> still being under use by the TDX module due to unmapping failures.
> guest_memfd can take the right action instead of relying on refcounts.
> 
> Does KVM continue unmapping the full range even after TDX SEPT
> management fails to unmap a subrange?
Yes, if there's no bug in KVM, it will continue unmapping the full ranges.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 7 months, 3 weeks ago

On Mon, Jun 16, 2025 at 11:55 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Mon, Jun 16, 2025 at 08:51:41PM -0700, Vishal Annapurve wrote:
> > On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > > > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > We need to restore to the previous status (which includes the host page table)
> > > > > if conversion can't be done.
> > > > > That said, in my view, a better flow would be:
> > > > >
> > > > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > > > >    consumers in kernel of memory allocated from guest_memfd).
> > > > >
> > > > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > > > >    proceed. For example, in the case of TDX, this might involve memory
> > > > >    allocation and page splitting.
> > > > >
> > > > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > > > >    proceeds by sending the actual invalidation request.
> > > > >
> > > > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > > > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > > > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > > > >    the page or elevate the page reference count.
> > > >
> > > > Few questions here:
> > > > 1) It sounds like the failure to remove entries from SEPT could only
> > > > be due to bugs in the KVM/TDX module,
> > > Yes.
> > >
> > > > how reliable would it be to
> > > > continue executing TDX VMs on the host once such bugs are hit?
> > > The TDX VMs will be killed. However, the private pages are still mapped in the
> > > SEPT (after the unmapping failure).
> > > The teardown flow for TDX VM is:
> > >
> > > do_exit
> > >   |->exit_files
> > >      |->kvm_gmem_release ==> (1) Unmap guest pages
> > >      |->release kvmfd
> > >         |->kvm_destroy_vm  (2) Reclaiming resources
> > >            |->kvm_arch_pre_destroy_vm  ==> Release hkid
> > >            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> > >
> > > Without holding page reference after (1) fails, the guest pages may have been
> > > re-assigned by the host OS while they are still still tracked in the TDX module.
> >
> > What happens to the pagetable memory holding the SEPT entry? Is that
> > also supposed to be leaked?
> It depends on if the reclaiming of the page table pages holding the SEPT entry
> fails. If it is, it will be also leaked.
> But the page to hold TDR is for sure to be leaked as the reclaiming of TDR page
> will fail after (1) fails.
>

Ok. Few questions that I would like to touch base briefly on:
i) If (1) fails and then VM is marked as bugged, will the TDX module
actually access that page in context of the same VM again?
ii) What all resources should remain unreclaimed if (1) fails?
     * page backing SEPT entry
     * page backing PAMT entry
     * TDMR
    If TDMR is the only one that fails to reclaim, will the TDX module
actually access the physical memory ever after the VM is cleaned up?
Otherwise, should all of these be made unreclaimable?
iii) Will it be safe for the host to use that memory by proper
WBINVD/memory clearing sequence if TDX module/TD is not going to use
that memory?

>
>
> > > > 2) Is it reliable to continue executing the host kernel and other
> > > > normal VMs once such bugs are hit?
> > > If with TDX holding the page ref count, the impact of unmapping failure of guest
> > > pages is just to leak those pages.
> > >
> > > > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > > > and cleaned up right away?
> > > As in the above flow, TDX needs to hold the page reference on unmapping failure
> > > until after reclaiming is successful. Well, reclaiming itself is possible to
> > > fail either.
> > >
> > > So, below is my proposal. Showed in the simple POC code based on
> > > https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
> > >
> > > Patch 1: TDX increases page ref count on unmap failure.
> >
> > This will not work as Ackerley pointed out earlier [1], it will be
> > impossible to differentiate between transient refcounts on private
> > pages and extra refcounts of private memory due to TDX unmap failure.
> Hmm. why are there transient refcounts on private pages?
> And why should we differentiate the two?

Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].

Speculative/transient refcounts came up a few times In the context of
guest_memfd discussions, some examples include: pagetable walkers,
page migration, speculative pagecache lookups, GUP-fast etc. David H
can provide more context here as needed.

Effectively some core-mm features that are present today or might land
in the future can cause folio refcounts to be grabbed for short
durations without actual access to underlying physical memory. These
scenarios are unlikely to happen for private memory but can't be
discounted completely.

Another reason to avoid relying on refcounts is to not block usage of
raw physical memory unmanaged by kernel (without page structs) to back
guest private memory as we had discussed previously. This will help
simplify merge/split operations during conversions and help usecases
like guest memory persistence [2] and non-confidential VMs.

[1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
[2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/

>
>
> > [1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
> >
> > > Patch 2: Bail out private-to-shared conversion if splitting fails.
> > > Patch 3: Make kvm_gmem_zap() return void.
> > >
> > > ...
> > >         /*
> > >
> > >
> > > If the above changes are agreeable, we could consider a more ambitious approach:
> > > introducing an interface like:
> > >
> > > int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> > > int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);
> >
> > I don't see any reason to introduce full tracking of gfn mapping
> > status in SEPTs just to handle very rare scenarios which KVM/TDX are
> > taking utmost care to avoid.
> >
> > That being said, I see value in letting guest_memfd know exact ranges
> > still being under use by the TDX module due to unmapping failures.
> > guest_memfd can take the right action instead of relying on refcounts.
> >
> > Does KVM continue unmapping the full range even after TDX SEPT
> > management fails to unmap a subrange?
> Yes, if there's no bug in KVM, it will continue unmapping the full ranges.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Edgecombe, Rick P 7 months, 3 weeks ago

On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].

I'm confused...

> 
> Speculative/transient refcounts came up a few times In the context of
> guest_memfd discussions, some examples include: pagetable walkers,
> page migration, speculative pagecache lookups, GUP-fast etc. David H
> can provide more context here as needed.
> 
> Effectively some core-mm features that are present today or might land
> in the future can cause folio refcounts to be grabbed for short
> durations without actual access to underlying physical memory. These
> scenarios are unlikely to happen for private memory but can't be
> discounted completely.

This means the refcount could be increased for other reasons, and so guestmemfd
shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
components handling the page elevate the refcount?

> 
> Another reason to avoid relying on refcounts is to not block usage of
> raw physical memory unmanaged by kernel (without page structs) to back
> guest private memory as we had discussed previously. This will help
> simplify merge/split operations during conversions and help usecases
> like guest memory persistence [2] and non-confidential VMs.

If this becomes a thing for private memory (which it isn't yet), then couldn't
we just change things at that point?

Is the only issue with TDX taking refcounts that it won't work with future code
changes?

> 
> [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 5:34 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
>
> I'm confused...
>
> >
> > Speculative/transient refcounts came up a few times In the context of
> > guest_memfd discussions, some examples include: pagetable walkers,
> > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > can provide more context here as needed.
> >
> > Effectively some core-mm features that are present today or might land
> > in the future can cause folio refcounts to be grabbed for short
> > durations without actual access to underlying physical memory. These
> > scenarios are unlikely to happen for private memory but can't be
> > discounted completely.
>
> This means the refcount could be increased for other reasons, and so guestmemfd
> shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> components handling the page elevate the refcount?

It's simpler to handle the transient refcounts as there are following options:
1) Wait for a small amount of time
2) Keep the folio refcounts frozen to zero at all times, which will
effectively eliminate the scenario of transient refcounts.
3) Use raw memory without page structs - unmanaged by kernel.

>
> >
> > Another reason to avoid relying on refcounts is to not block usage of
> > raw physical memory unmanaged by kernel (without page structs) to back
> > guest private memory as we had discussed previously. This will help
> > simplify merge/split operations during conversions and help usecases
> > like guest memory persistence [2] and non-confidential VMs.
>
> If this becomes a thing for private memory (which it isn't yet), then couldn't
> we just change things at that point?

It would be great to avoid having to go through discussion again, if
we have good reasons to handle it now.

>
> Is the only issue with TDX taking refcounts that it won't work with future code
> changes?
>
> >
> > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
>

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Edgecombe, Rick P 7 months, 3 weeks ago

On Tue, 2025-06-17 at 21:29 -0700, Vishal Annapurve wrote:
> > This means the refcount could be increased for other reasons, and so
> > guestmemfd
> > shouldn't rely on refcounts for it's purposes? So, it is not a problem for
> > other
> > components handling the page elevate the refcount?
> 
> It's simpler to handle the transient refcounts as there are following options:
> 1) Wait for a small amount of time
> 2) Keep the folio refcounts frozen to zero at all times, which will
> effectively eliminate the scenario of transient refcounts.
> 3) Use raw memory without page structs - unmanaged by kernel.
> 
> > 
> > > 
> > > Another reason to avoid relying on refcounts is to not block usage of
> > > raw physical memory unmanaged by kernel (without page structs) to back
> > > guest private memory as we had discussed previously. This will help
> > > simplify merge/split operations during conversions and help usecases
> > > like guest memory persistence [2] and non-confidential VMs.
> > 
> > If this becomes a thing for private memory (which it isn't yet), then
> > couldn't
> > we just change things at that point?
> 
> It would be great to avoid having to go through discussion again, if
> we have good reasons to handle it now.

I thought we already came to agreement on whether to spend time pre-designing
for future things. This thread has gotten pretty long, can we stick to the
current problems in an effort to close it?

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 3 weeks ago

On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> 
> I'm confused...
> 
> > 
> > Speculative/transient refcounts came up a few times In the context of
> > guest_memfd discussions, some examples include: pagetable walkers,
> > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > can provide more context here as needed.
> > 
> > Effectively some core-mm features that are present today or might land
> > in the future can cause folio refcounts to be grabbed for short
> > durations without actual access to underlying physical memory. These
> > scenarios are unlikely to happen for private memory but can't be
> > discounted completely.
> 
> This means the refcount could be increased for other reasons, and so guestmemfd
> shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> components handling the page elevate the refcount?
Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
to convert to private, why is it allowed to just invoke
kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
account? Isn't it more easier for shared pages to have speculative/transient
refcounts?

[3] https://lore.kernel.org/lkml/d3832fd95a03aad562705872cbda5b3d248ca321.1747264138.git.ackerleytng@google.com/

> > 
> > Another reason to avoid relying on refcounts is to not block usage of
> > raw physical memory unmanaged by kernel (without page structs) to back
> > guest private memory as we had discussed previously. This will help
> > simplify merge/split operations during conversions and help usecases
> > like guest memory persistence [2] and non-confidential VMs.
> 
> If this becomes a thing for private memory (which it isn't yet), then couldn't
> we just change things at that point?
> 
> Is the only issue with TDX taking refcounts that it won't work with future code
> changes?
> 
> > 
> > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
>

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 5:49 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> > On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> >
> > I'm confused...
> >
> > >
> > > Speculative/transient refcounts came up a few times In the context of
> > > guest_memfd discussions, some examples include: pagetable walkers,
> > > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > > can provide more context here as needed.
> > >
> > > Effectively some core-mm features that are present today or might land
> > > in the future can cause folio refcounts to be grabbed for short
> > > durations without actual access to underlying physical memory. These
> > > scenarios are unlikely to happen for private memory but can't be
> > > discounted completely.
> >
> > This means the refcount could be increased for other reasons, and so guestmemfd
> > shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> > components handling the page elevate the refcount?
> Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
> to convert to private, why is it allowed to just invoke
> kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
> account? Isn't it more easier for shared pages to have speculative/transient
> refcounts?

These speculative refcounts are taken into account, in case of unsafe
refcounts, conversion operation immediately exits to userspace with
EAGAIN and userspace is supposed to retry conversion.

Yes, it's more easier for shared pages to have speculative/transient refcounts.

>
> [3] https://lore.kernel.org/lkml/d3832fd95a03aad562705872cbda5b3d248ca321.1747264138.git.ackerleytng@google.com/
>
> > >
> > > Another reason to avoid relying on refcounts is to not block usage of
> > > raw physical memory unmanaged by kernel (without page structs) to back
> > > guest private memory as we had discussed previously. This will help
> > > simplify merge/split operations during conversions and help usecases
> > > like guest memory persistence [2] and non-confidential VMs.
> >
> > If this becomes a thing for private memory (which it isn't yet), then couldn't
> > we just change things at that point?
> >
> > Is the only issue with TDX taking refcounts that it won't work with future code
> > changes?
> >
> > >
> > > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> >

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 09:33:02PM -0700, Vishal Annapurve wrote:
> On Tue, Jun 17, 2025 at 5:49 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> > > On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > > > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> > >
> > > I'm confused...
> > >
> > > >
> > > > Speculative/transient refcounts came up a few times In the context of
> > > > guest_memfd discussions, some examples include: pagetable walkers,
> > > > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > > > can provide more context here as needed.
> > > >
> > > > Effectively some core-mm features that are present today or might land
> > > > in the future can cause folio refcounts to be grabbed for short
> > > > durations without actual access to underlying physical memory. These
> > > > scenarios are unlikely to happen for private memory but can't be
> > > > discounted completely.
> > >
> > > This means the refcount could be increased for other reasons, and so guestmemfd
> > > shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> > > components handling the page elevate the refcount?
> > Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
> > to convert to private, why is it allowed to just invoke
> > kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
> > account? Isn't it more easier for shared pages to have speculative/transient
> > refcounts?
> 
> These speculative refcounts are taken into account, in case of unsafe
> refcounts, conversion operation immediately exits to userspace with
> EAGAIN and userspace is supposed to retry conversion.
Hmm, so why can't private-to-shared conversion also exit to userspace with
EAGAIN?

In the POC
https://lore.kernel.org/lkml/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com,
kvm_gmem_convert_should_proceed() just returns EFAULT (can be modified to
EAGAIN) to userspace instead.

> 
> Yes, it's more easier for shared pages to have speculative/transient refcounts.
> 
> >
> > [3] https://lore.kernel.org/lkml/d3832fd95a03aad562705872cbda5b3d248ca321.1747264138.git.ackerleytng@google.com/
> >
> > > >
> > > > Another reason to avoid relying on refcounts is to not block usage of
> > > > raw physical memory unmanaged by kernel (without page structs) to back
> > > > guest private memory as we had discussed previously. This will help
> > > > simplify merge/split operations during conversions and help usecases
> > > > like guest memory persistence [2] and non-confidential VMs.
> > >
> > > If this becomes a thing for private memory (which it isn't yet), then couldn't
> > > we just change things at that point?
> > >
> > > Is the only issue with TDX taking refcounts that it won't work with future code
> > > changes?
> > >
> > > >
> > > > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > > > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> > >

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 11:15 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jun 17, 2025 at 09:33:02PM -0700, Vishal Annapurve wrote:
> > On Tue, Jun 17, 2025 at 5:49 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> > > > On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > > > > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> > > >
> > > > I'm confused...
> > > >
> > > > >
> > > > > Speculative/transient refcounts came up a few times In the context of
> > > > > guest_memfd discussions, some examples include: pagetable walkers,
> > > > > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > > > > can provide more context here as needed.
> > > > >
> > > > > Effectively some core-mm features that are present today or might land
> > > > > in the future can cause folio refcounts to be grabbed for short
> > > > > durations without actual access to underlying physical memory. These
> > > > > scenarios are unlikely to happen for private memory but can't be
> > > > > discounted completely.
> > > >
> > > > This means the refcount could be increased for other reasons, and so guestmemfd
> > > > shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> > > > components handling the page elevate the refcount?
> > > Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
> > > to convert to private, why is it allowed to just invoke
> > > kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
> > > account? Isn't it more easier for shared pages to have speculative/transient
> > > refcounts?
> >
> > These speculative refcounts are taken into account, in case of unsafe
> > refcounts, conversion operation immediately exits to userspace with
> > EAGAIN and userspace is supposed to retry conversion.
> Hmm, so why can't private-to-shared conversion also exit to userspace with
> EAGAIN?

How would userspace/guest_memfd differentiate between
speculative/transient refcounts and extra refcounts due to TDX unmap
failures?

>
> In the POC
> https://lore.kernel.org/lkml/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com,
> kvm_gmem_convert_should_proceed() just returns EFAULT (can be modified to
> EAGAIN) to userspace instead.
>
> >
> > Yes, it's more easier for shared pages to have speculative/transient refcounts.
> >
> > >
> > > [3] https://lore.kernel.org/lkml/d3832fd95a03aad562705872cbda5b3d248ca321.1747264138.git.ackerleytng@google.com/
> > >
> > > > >
> > > > > Another reason to avoid relying on refcounts is to not block usage of
> > > > > raw physical memory unmanaged by kernel (without page structs) to back
> > > > > guest private memory as we had discussed previously. This will help
> > > > > simplify merge/split operations during conversions and help usecases
> > > > > like guest memory persistence [2] and non-confidential VMs.
> > > >
> > > > If this becomes a thing for private memory (which it isn't yet), then couldn't
> > > > we just change things at that point?
> > > >
> > > > Is the only issue with TDX taking refcounts that it won't work with future code
> > > > changes?
> > > >
> > > > >
> > > > > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > > > > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> > > >

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 11:21:41PM -0700, Vishal Annapurve wrote:
> On Tue, Jun 17, 2025 at 11:15 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, Jun 17, 2025 at 09:33:02PM -0700, Vishal Annapurve wrote:
> > > On Tue, Jun 17, 2025 at 5:49 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> > > > > On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > > > > > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> > > > >
> > > > > I'm confused...
> > > > >
> > > > > >
> > > > > > Speculative/transient refcounts came up a few times In the context of
> > > > > > guest_memfd discussions, some examples include: pagetable walkers,
> > > > > > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > > > > > can provide more context here as needed.
> > > > > >
> > > > > > Effectively some core-mm features that are present today or might land
> > > > > > in the future can cause folio refcounts to be grabbed for short
> > > > > > durations without actual access to underlying physical memory. These
> > > > > > scenarios are unlikely to happen for private memory but can't be
> > > > > > discounted completely.
> > > > >
> > > > > This means the refcount could be increased for other reasons, and so guestmemfd
> > > > > shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> > > > > components handling the page elevate the refcount?
> > > > Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
> > > > to convert to private, why is it allowed to just invoke
> > > > kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
> > > > account? Isn't it more easier for shared pages to have speculative/transient
> > > > refcounts?
> > >
> > > These speculative refcounts are taken into account, in case of unsafe
> > > refcounts, conversion operation immediately exits to userspace with
> > > EAGAIN and userspace is supposed to retry conversion.
> > Hmm, so why can't private-to-shared conversion also exit to userspace with
> > EAGAIN?
> 
> How would userspace/guest_memfd differentiate between
> speculative/transient refcounts and extra refcounts due to TDX unmap
> failures?
Hmm, it also can't differentiate between speculative/transient refcounts and
extra refcounts on shared folios due to other reasons.

> 
> >
> > In the POC
> > https://lore.kernel.org/lkml/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com,
> > kvm_gmem_convert_should_proceed() just returns EFAULT (can be modified to
> > EAGAIN) to userspace instead.
> >
> > >
> > > Yes, it's more easier for shared pages to have speculative/transient refcounts.
> > >
> > > >
> > > > [3] https://lore.kernel.org/lkml/d3832fd95a03aad562705872cbda5b3d248ca321.1747264138.git.ackerleytng@google.com/
> > > >
> > > > > >
> > > > > > Another reason to avoid relying on refcounts is to not block usage of
> > > > > > raw physical memory unmanaged by kernel (without page structs) to back
> > > > > > guest private memory as we had discussed previously. This will help
> > > > > > simplify merge/split operations during conversions and help usecases
> > > > > > like guest memory persistence [2] and non-confidential VMs.
> > > > >
> > > > > If this becomes a thing for private memory (which it isn't yet), then couldn't
> > > > > we just change things at that point?
> > > > >
> > > > > Is the only issue with TDX taking refcounts that it won't work with future code
> > > > > changes?
> > > > >
> > > > > >
> > > > > > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > > > > > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> > > > >

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 11:34 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jun 17, 2025 at 11:21:41PM -0700, Vishal Annapurve wrote:
> > On Tue, Jun 17, 2025 at 11:15 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Tue, Jun 17, 2025 at 09:33:02PM -0700, Vishal Annapurve wrote:
> > > > On Tue, Jun 17, 2025 at 5:49 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> > > > > > On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > > > > > > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> > > > > >
> > > > > > I'm confused...
> > > > > >
> > > > > > >
> > > > > > > Speculative/transient refcounts came up a few times In the context of
> > > > > > > guest_memfd discussions, some examples include: pagetable walkers,
> > > > > > > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > > > > > > can provide more context here as needed.
> > > > > > >
> > > > > > > Effectively some core-mm features that are present today or might land
> > > > > > > in the future can cause folio refcounts to be grabbed for short
> > > > > > > durations without actual access to underlying physical memory. These
> > > > > > > scenarios are unlikely to happen for private memory but can't be
> > > > > > > discounted completely.
> > > > > >
> > > > > > This means the refcount could be increased for other reasons, and so guestmemfd
> > > > > > shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> > > > > > components handling the page elevate the refcount?
> > > > > Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
> > > > > to convert to private, why is it allowed to just invoke
> > > > > kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
> > > > > account? Isn't it more easier for shared pages to have speculative/transient
> > > > > refcounts?
> > > >
> > > > These speculative refcounts are taken into account, in case of unsafe
> > > > refcounts, conversion operation immediately exits to userspace with
> > > > EAGAIN and userspace is supposed to retry conversion.
> > > Hmm, so why can't private-to-shared conversion also exit to userspace with
> > > EAGAIN?
> >
> > How would userspace/guest_memfd differentiate between
> > speculative/transient refcounts and extra refcounts due to TDX unmap
> > failures?
> Hmm, it also can't differentiate between speculative/transient refcounts and
> extra refcounts on shared folios due to other reasons.
>

In case of shared memory ranges, userspace is effectively responsible
for extra refcounts and can act towards removing them if not done
already. If "extra" refcounts are taken care of then the only
remaining scenario is speculative/transient refcounts.

But for private memory ranges, userspace is not responsible for any
refcounts landing on them.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 11:44:34PM -0700, Vishal Annapurve wrote:
> On Tue, Jun 17, 2025 at 11:34 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, Jun 17, 2025 at 11:21:41PM -0700, Vishal Annapurve wrote:
> > > On Tue, Jun 17, 2025 at 11:15 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Tue, Jun 17, 2025 at 09:33:02PM -0700, Vishal Annapurve wrote:
> > > > > On Tue, Jun 17, 2025 at 5:49 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> > > > > > > On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > > > > > > > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> > > > > > >
> > > > > > > I'm confused...
> > > > > > >
> > > > > > > >
> > > > > > > > Speculative/transient refcounts came up a few times In the context of
> > > > > > > > guest_memfd discussions, some examples include: pagetable walkers,
> > > > > > > > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > > > > > > > can provide more context here as needed.
> > > > > > > >
> > > > > > > > Effectively some core-mm features that are present today or might land
> > > > > > > > in the future can cause folio refcounts to be grabbed for short
> > > > > > > > durations without actual access to underlying physical memory. These
> > > > > > > > scenarios are unlikely to happen for private memory but can't be
> > > > > > > > discounted completely.
> > > > > > >
> > > > > > > This means the refcount could be increased for other reasons, and so guestmemfd
> > > > > > > shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> > > > > > > components handling the page elevate the refcount?
> > > > > > Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
> > > > > > to convert to private, why is it allowed to just invoke
> > > > > > kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
> > > > > > account? Isn't it more easier for shared pages to have speculative/transient
> > > > > > refcounts?
> > > > >
> > > > > These speculative refcounts are taken into account, in case of unsafe
> > > > > refcounts, conversion operation immediately exits to userspace with
> > > > > EAGAIN and userspace is supposed to retry conversion.
> > > > Hmm, so why can't private-to-shared conversion also exit to userspace with
> > > > EAGAIN?
> > >
> > > How would userspace/guest_memfd differentiate between
> > > speculative/transient refcounts and extra refcounts due to TDX unmap
> > > failures?
> > Hmm, it also can't differentiate between speculative/transient refcounts and
> > extra refcounts on shared folios due to other reasons.
> >
> 
> In case of shared memory ranges, userspace is effectively responsible
> for extra refcounts and can act towards removing them if not done
> already. If "extra" refcounts are taken care of then the only
> remaining scenario is speculative/transient refcounts.
> 
> But for private memory ranges, userspace is not responsible for any
> refcounts landing on them.
Ok. The similarities between the two are:
- userspace can't help on speculative/transient refcounts.
- userspace can't make conversion success with "extra" refcounts, whether held
  by user or by TDX.

But I think I get your point that EAGAIN is not the right code in case of
"extra" refcounts held by TDX.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 01:09:05AM -0700, Vishal Annapurve wrote:
> On Mon, Jun 16, 2025 at 11:55 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Mon, Jun 16, 2025 at 08:51:41PM -0700, Vishal Annapurve wrote:
> > > On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > > > > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > We need to restore to the previous status (which includes the host page table)
> > > > > > if conversion can't be done.
> > > > > > That said, in my view, a better flow would be:
> > > > > >
> > > > > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > > > > >    consumers in kernel of memory allocated from guest_memfd).
> > > > > >
> > > > > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > > > > >    proceed. For example, in the case of TDX, this might involve memory
> > > > > >    allocation and page splitting.
> > > > > >
> > > > > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > > > > >    proceeds by sending the actual invalidation request.
> > > > > >
> > > > > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > > > > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > > > > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > > > > >    the page or elevate the page reference count.
> > > > >
> > > > > Few questions here:
> > > > > 1) It sounds like the failure to remove entries from SEPT could only
> > > > > be due to bugs in the KVM/TDX module,
> > > > Yes.
> > > >
> > > > > how reliable would it be to
> > > > > continue executing TDX VMs on the host once such bugs are hit?
> > > > The TDX VMs will be killed. However, the private pages are still mapped in the
> > > > SEPT (after the unmapping failure).
> > > > The teardown flow for TDX VM is:
> > > >
> > > > do_exit
> > > >   |->exit_files
> > > >      |->kvm_gmem_release ==> (1) Unmap guest pages
> > > >      |->release kvmfd
> > > >         |->kvm_destroy_vm  (2) Reclaiming resources
> > > >            |->kvm_arch_pre_destroy_vm  ==> Release hkid
> > > >            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> > > >
> > > > Without holding page reference after (1) fails, the guest pages may have been
> > > > re-assigned by the host OS while they are still still tracked in the TDX module.
> > >
> > > What happens to the pagetable memory holding the SEPT entry? Is that
> > > also supposed to be leaked?
> > It depends on if the reclaiming of the page table pages holding the SEPT entry
> > fails. If it is, it will be also leaked.
> > But the page to hold TDR is for sure to be leaked as the reclaiming of TDR page
> > will fail after (1) fails.
> >
> 
> Ok. Few questions that I would like to touch base briefly on:
> i) If (1) fails and then VM is marked as bugged, will the TDX module
> actually access that page in context of the same VM again?
In TDX module, the TD is marked as TD_TEARDOWN after step (2) when hkid is
released successfully.
Before that, TD is able to access the pages even if it is marked as buggy by KVM.

After TD is marked as TD_TEARDOWN, since (1) fails, the problematic guest
private pages are still tracked in the PAMT entries.
So, re-assignment the same PFN to other TDs will fail.

> ii) What all resources should remain unreclaimed if (1) fails?
>      * page backing SEPT entry
>      * page backing PAMT entry
>      * TDMR
>     If TDMR is the only one that fails to reclaim, will the TDX module
> actually access the physical memory ever after the VM is cleaned up?
> Otherwise, should all of these be made unreclaimable?
From my understanding, they are
- guest private pages
- TDR page
- PAMT entries for guest private pages and TDR page


> iii) Will it be safe for the host to use that memory by proper
> WBINVD/memory clearing sequence if TDX module/TD is not going to use
> that memory?
I'm not sure. But it should be impossible for host to re-assign the pages to
other TDs as long as PAMT entries are not updated.


> > > > > 2) Is it reliable to continue executing the host kernel and other
> > > > > normal VMs once such bugs are hit?
> > > > If with TDX holding the page ref count, the impact of unmapping failure of guest
> > > > pages is just to leak those pages.
> > > >
> > > > > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > > > > and cleaned up right away?
> > > > As in the above flow, TDX needs to hold the page reference on unmapping failure
> > > > until after reclaiming is successful. Well, reclaiming itself is possible to
> > > > fail either.
> > > >
> > > > So, below is my proposal. Showed in the simple POC code based on
> > > > https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
> > > >
> > > > Patch 1: TDX increases page ref count on unmap failure.
> > >
> > > This will not work as Ackerley pointed out earlier [1], it will be
> > > impossible to differentiate between transient refcounts on private
> > > pages and extra refcounts of private memory due to TDX unmap failure.
> > Hmm. why are there transient refcounts on private pages?
> > And why should we differentiate the two?
> 
> Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> 
> Speculative/transient refcounts came up a few times In the context of
> guest_memfd discussions, some examples include: pagetable walkers,
> page migration, speculative pagecache lookups, GUP-fast etc. David H
> can provide more context here as needed.
GUP-fast only walks page tables for shared memory?
Can other walkers get a private folio by walking shared mappings?

On those speculative/transient refcounts came up, can't the 
kvm_gmem_convert_should_proceed() wait in an interruptible way before returning
failure?

The wait will anyway happen after the conversion is started, i.e.,
in filemap_remove_folio_for_restructuring().
       while (!folio_ref_freeze(folio, filemap_refcount)) {
                /*
                 * At this point only filemap refcounts are expected, hence okay
                 * to spin until speculative refcounts go away.
                 */
                WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
        }


BTW, I noticed that there's no filemap_invalidate_lock_shared() in
kvm_gmem_fault_shared() in 
https://lore.kernel.org/all/20250611133330.1514028-9-tabba@google.com/.

Do you know why?

> Effectively some core-mm features that are present today or might land
> in the future can cause folio refcounts to be grabbed for short
> durations without actual access to underlying physical memory. These
> scenarios are unlikely to happen for private memory but can't be
> discounted completely.
> 
> Another reason to avoid relying on refcounts is to not block usage of
> raw physical memory unmanaged by kernel (without page structs) to back
> guest private memory as we had discussed previously. This will help
> simplify merge/split operations during conversions and help usecases
> like guest memory persistence [2] and non-confidential VMs.
Ok.
Currently, "letting guest_memfd know exact ranges still being under use by the
TDX module due to unmapping failures" is good enough for TDX, though full
tracking of each GFN is even better.


> [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> 
> >
> >
> > > [1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
> > >
> > > > Patch 2: Bail out private-to-shared conversion if splitting fails.
> > > > Patch 3: Make kvm_gmem_zap() return void.
> > > >
> > > > ...
> > > >         /*
> > > >
> > > >
> > > > If the above changes are agreeable, we could consider a more ambitious approach:
> > > > introducing an interface like:
> > > >
> > > > int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> > > > int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);
> > >
> > > I don't see any reason to introduce full tracking of gfn mapping
> > > status in SEPTs just to handle very rare scenarios which KVM/TDX are
> > > taking utmost care to avoid.
> > >
> > > That being said, I see value in letting guest_memfd know exact ranges
> > > still being under use by the TDX module due to unmapping failures.
> > > guest_memfd can take the right action instead of relying on refcounts.
> > >
> > > Does KVM continue unmapping the full range even after TDX SEPT
> > > management fails to unmap a subrange?
> > Yes, if there's no bug in KVM, it will continue unmapping the full ranges.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 3:00 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jun 17, 2025 at 01:09:05AM -0700, Vishal Annapurve wrote:
> > On Mon, Jun 16, 2025 at 11:55 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Mon, Jun 16, 2025 at 08:51:41PM -0700, Vishal Annapurve wrote:
> > > > On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > > > > > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > >
> > > > > > > We need to restore to the previous status (which includes the host page table)
> > > > > > > if conversion can't be done.
> > > > > > > That said, in my view, a better flow would be:
> > > > > > >
> > > > > > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > > > > > >    consumers in kernel of memory allocated from guest_memfd).
> > > > > > >
> > > > > > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > > > > > >    proceed. For example, in the case of TDX, this might involve memory
> > > > > > >    allocation and page splitting.
> > > > > > >
> > > > > > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > > > > > >    proceeds by sending the actual invalidation request.
> > > > > > >
> > > > > > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > > > > > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > > > > > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > > > > > >    the page or elevate the page reference count.
> > > > > >
> > > > > > Few questions here:
> > > > > > 1) It sounds like the failure to remove entries from SEPT could only
> > > > > > be due to bugs in the KVM/TDX module,
> > > > > Yes.
> > > > >
> > > > > > how reliable would it be to
> > > > > > continue executing TDX VMs on the host once such bugs are hit?
> > > > > The TDX VMs will be killed. However, the private pages are still mapped in the
> > > > > SEPT (after the unmapping failure).
> > > > > The teardown flow for TDX VM is:
> > > > >
> > > > > do_exit
> > > > >   |->exit_files
> > > > >      |->kvm_gmem_release ==> (1) Unmap guest pages
> > > > >      |->release kvmfd
> > > > >         |->kvm_destroy_vm  (2) Reclaiming resources
> > > > >            |->kvm_arch_pre_destroy_vm  ==> Release hkid
> > > > >            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> > > > >
> > > > > Without holding page reference after (1) fails, the guest pages may have been
> > > > > re-assigned by the host OS while they are still still tracked in the TDX module.
> > > >
> > > > What happens to the pagetable memory holding the SEPT entry? Is that
> > > > also supposed to be leaked?
> > > It depends on if the reclaiming of the page table pages holding the SEPT entry
> > > fails. If it is, it will be also leaked.
> > > But the page to hold TDR is for sure to be leaked as the reclaiming of TDR page
> > > will fail after (1) fails.
> > >
> >
> > Ok. Few questions that I would like to touch base briefly on:
> > i) If (1) fails and then VM is marked as bugged, will the TDX module
> > actually access that page in context of the same VM again?
> In TDX module, the TD is marked as TD_TEARDOWN after step (2) when hkid is
> released successfully.
> Before that, TD is able to access the pages even if it is marked as buggy by KVM.
>
> After TD is marked as TD_TEARDOWN, since (1) fails, the problematic guest
> private pages are still tracked in the PAMT entries.
> So, re-assignment the same PFN to other TDs will fail.
>
> > ii) What all resources should remain unreclaimed if (1) fails?
> >      * page backing SEPT entry
> >      * page backing PAMT entry
> >      * TDMR
> >     If TDMR is the only one that fails to reclaim, will the TDX module
> > actually access the physical memory ever after the VM is cleaned up?
> > Otherwise, should all of these be made unreclaimable?
> From my understanding, they are
> - guest private pages
> - TDR page
> - PAMT entries for guest private pages and TDR page
>
>
> > iii) Will it be safe for the host to use that memory by proper
> > WBINVD/memory clearing sequence if TDX module/TD is not going to use
> > that memory?
> I'm not sure. But it should be impossible for host to re-assign the pages to
> other TDs as long as PAMT entries are not updated.
>
>
> > > > > > 2) Is it reliable to continue executing the host kernel and other
> > > > > > normal VMs once such bugs are hit?
> > > > > If with TDX holding the page ref count, the impact of unmapping failure of guest
> > > > > pages is just to leak those pages.
> > > > >
> > > > > > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > > > > > and cleaned up right away?
> > > > > As in the above flow, TDX needs to hold the page reference on unmapping failure
> > > > > until after reclaiming is successful. Well, reclaiming itself is possible to
> > > > > fail either.
> > > > >
> > > > > So, below is my proposal. Showed in the simple POC code based on
> > > > > https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
> > > > >
> > > > > Patch 1: TDX increases page ref count on unmap failure.
> > > >
> > > > This will not work as Ackerley pointed out earlier [1], it will be
> > > > impossible to differentiate between transient refcounts on private
> > > > pages and extra refcounts of private memory due to TDX unmap failure.
> > > Hmm. why are there transient refcounts on private pages?
> > > And why should we differentiate the two?
> >
> > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> >
> > Speculative/transient refcounts came up a few times In the context of
> > guest_memfd discussions, some examples include: pagetable walkers,
> > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > can provide more context here as needed.
> GUP-fast only walks page tables for shared memory?
> Can other walkers get a private folio by walking shared mappings?

No, they can't. There can be walkers that parse direct map entries.

>
> On those speculative/transient refcounts came up, can't the
> kvm_gmem_convert_should_proceed() wait in an interruptible way before returning
> failure?

These refcounts can land any time on any of the ranges, so guest_memfd
implementation to bail out on errors will need to be time bound for
each folio and will need traversal of each folio even before actual
restructuring. That increases the complexity and latency of conversion
operation.

>
> The wait will anyway happen after the conversion is started, i.e.,
> in filemap_remove_folio_for_restructuring().
>        while (!folio_ref_freeze(folio, filemap_refcount)) {
>                 /*
>                  * At this point only filemap refcounts are expected, hence okay
>                  * to spin until speculative refcounts go away.
>                  */
>                 WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
>         }
>
>
> BTW, I noticed that there's no filemap_invalidate_lock_shared() in
> kvm_gmem_fault_shared() in
> https://lore.kernel.org/all/20250611133330.1514028-9-tabba@google.com/.
>
> Do you know why?

It will land when guest_memfd in-place conversion support will be posted.

>
> > Effectively some core-mm features that are present today or might land
> > in the future can cause folio refcounts to be grabbed for short
> > durations without actual access to underlying physical memory. These
> > scenarios are unlikely to happen for private memory but can't be
> > discounted completely.
> >
> > Another reason to avoid relying on refcounts is to not block usage of
> > raw physical memory unmanaged by kernel (without page structs) to back
> > guest private memory as we had discussed previously. This will help
> > simplify merge/split operations during conversions and help usecases
> > like guest memory persistence [2] and non-confidential VMs.
> Ok.
> Currently, "letting guest_memfd know exact ranges still being under use by the
> TDX module due to unmapping failures" is good enough for TDX, though full
> tracking of each GFN is even better.
>
>
> > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> >
> > >
> > >
> > > > [1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
> > > >
> > > > > Patch 2: Bail out private-to-shared conversion if splitting fails.
> > > > > Patch 3: Make kvm_gmem_zap() return void.
> > > > >
> > > > > ...
> > > > >         /*
> > > > >
> > > > >
> > > > > If the above changes are agreeable, we could consider a more ambitious approach:
> > > > > introducing an interface like:
> > > > >
> > > > > int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> > > > > int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);
> > > >
> > > > I don't see any reason to introduce full tracking of gfn mapping
> > > > status in SEPTs just to handle very rare scenarios which KVM/TDX are
> > > > taking utmost care to avoid.
> > > >
> > > > That being said, I see value in letting guest_memfd know exact ranges
> > > > still being under use by the TDX module due to unmapping failures.
> > > > guest_memfd can take the right action instead of relying on refcounts.
> > > >
> > > > Does KVM continue unmapping the full range even after TDX SEPT
> > > > management fails to unmap a subrange?
> > > Yes, if there's no bug in KVM, it will continue unmapping the full ranges.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Edgecombe, Rick P 7 months, 3 weeks ago

On Mon, 2025-06-16 at 17:59 +0800, Yan Zhao wrote:
> > Few questions here:
> > 1) It sounds like the failure to remove entries from SEPT could only
> > be due to bugs in the KVM/TDX module,
> Yes.

A TDX module bug could hypothetically cause many types of host instability. We
should consider a little more on the context for the risk before we make TDX a
special case or add much error handling code around it. If we end up with a
bunch of paranoid error handling code around TDX module behavior, that is going
to be a pain to maintain. And error handling code for rare cases will be hard to
remove.

We've had a history of unreliable page removal during the base series
development. When we solved the problem, it was not completely clean (though
more on the guest affecting side). So I think there is reason to be concerned.
But this should work reliably in theory. So I'm not sure we should use the error
case as a hard reason. Instead maybe we should focus on how to make it less
likely to have an error. Unless there is a specific case you are considering,
Yan?

That said, I think the refcounting on error (or rather, notifying guestmemfd on
error do let it handle the error how it wants) is a fine solution. As long as it
doesn't take much code (as is the case for Yan's POC).

> 
> > how reliable would it be to
> > continue executing TDX VMs on the host once such bugs are hit?
> The TDX VMs will be killed. However, the private pages are still mapped in the
> SEPT (after the unmapping failure).
> The teardown flow for TDX VM is:
> 
> do_exit
>   |->exit_files
>      |->kvm_gmem_release ==> (1) Unmap guest pages 
>      |->release kvmfd
>         |->kvm_destroy_vm  (2) Reclaiming resources
>            |->kvm_arch_pre_destroy_vm  ==> Release hkid
>            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> 
> Without holding page reference after (1) fails, the guest pages may have been
> re-assigned by the host OS while they are still still tracked in the TDX
> module.
> 
> 
> > 2) Is it reliable to continue executing the host kernel and other
> > normal VMs once such bugs are hit?
> If with TDX holding the page ref count, the impact of unmapping failure of
> guest
> pages is just to leak those pages.

If the kernel might be able to continue working, it should try. It should warn
if there is a risk, so people can use panic_on_warn if they want to stop the
kernel.

> 
> > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > and cleaned up right away?
> As in the above flow, TDX needs to hold the page reference on unmapping
> failure
> until after reclaiming is successful. Well, reclaiming itself is possible to
> fail either.

We could ask TDX module folks if there is anything they could guarantee.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 08:25:20AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-06-16 at 17:59 +0800, Yan Zhao wrote:
> > > Few questions here:
> > > 1) It sounds like the failure to remove entries from SEPT could only
> > > be due to bugs in the KVM/TDX module,
> > Yes.
> 
> A TDX module bug could hypothetically cause many types of host instability. We
> should consider a little more on the context for the risk before we make TDX a
> special case or add much error handling code around it. If we end up with a
> bunch of paranoid error handling code around TDX module behavior, that is going
> to be a pain to maintain. And error handling code for rare cases will be hard to
> remove.
> 
> We've had a history of unreliable page removal during the base series
> development. When we solved the problem, it was not completely clean (though
> more on the guest affecting side). So I think there is reason to be concerned.
> But this should work reliably in theory. So I'm not sure we should use the error
> case as a hard reason. Instead maybe we should focus on how to make it less
> likely to have an error. Unless there is a specific case you are considering,
> Yan?
Yes, KVM/TDX does its utmost to ensure that page removal cannot fail. However,
if bugs occur, KVM/TDX will trigger a BUG_ON and leak the problematic page.
This is a simple way to constrain the error within affected pages. It also helps
in debugging when unexpected errors arise.

Returning the error code up the stack is not worthwhile and I don't even think
it's feasible.


> That said, I think the refcounting on error (or rather, notifying guestmemfd on
> error do let it handle the error how it wants) is a fine solution. As long as it
> doesn't take much code (as is the case for Yan's POC).
> 
> > 
> > > how reliable would it be to
> > > continue executing TDX VMs on the host once such bugs are hit?
> > The TDX VMs will be killed. However, the private pages are still mapped in the
> > SEPT (after the unmapping failure).
> > The teardown flow for TDX VM is:
> > 
> > do_exit
> >   |->exit_files
> >      |->kvm_gmem_release ==> (1) Unmap guest pages 
> >      |->release kvmfd
> >         |->kvm_destroy_vm  (2) Reclaiming resources
> >            |->kvm_arch_pre_destroy_vm  ==> Release hkid
> >            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> > 
> > Without holding page reference after (1) fails, the guest pages may have been
> > re-assigned by the host OS while they are still still tracked in the TDX
> > module.
> > 
> > 
> > > 2) Is it reliable to continue executing the host kernel and other
> > > normal VMs once such bugs are hit?
> > If with TDX holding the page ref count, the impact of unmapping failure of
> > guest
> > pages is just to leak those pages.
> 
> If the kernel might be able to continue working, it should try. It should warn
> if there is a risk, so people can use panic_on_warn if they want to stop the
> kernel.
> 
> > 
> > > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > > and cleaned up right away?
> > As in the above flow, TDX needs to hold the page reference on unmapping
> > failure
> > until after reclaiming is successful. Well, reclaiming itself is possible to
> > fail either.
> 
> We could ask TDX module folks if there is anything they could guarantee.
>

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Edgecombe, Rick P 7 months, 3 weeks ago

On Mon, 2025-06-16 at 17:59 +0800, Yan Zhao wrote:
> If the above changes are agreeable, we could consider a more ambitious approach:
> introducing an interface like:
> 
> int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);

We talked about doing something like having tdx_hold_page_on_error() in
guestmemfd with a proper name. The separation of concerns will be better if we
can just tell guestmemfd, the page has an issue. Then guestmemfd can decide how
to handle it (refcount or whatever).

> 
> This would allow guest_memfd to maintain an internal reference count for each
> private GFN. TDX would call guest_memfd_add_page_ref_count() for mapping and
> guest_memfd_dec_page_ref_count() after a successful unmapping. Before truncating
> a private page from the filemap, guest_memfd could increase the real folio
> reference count based on its internal reference count for the private GFN.

What does this get us exactly? This is the argument to have less error prone
code that can survive forgetting to refcount on error? I don't see that it is an
especially special case.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 08:12:50AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-06-16 at 17:59 +0800, Yan Zhao wrote:
> > If the above changes are agreeable, we could consider a more ambitious approach:
> > introducing an interface like:
> > 
> > int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> > int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);
> 
> We talked about doing something like having tdx_hold_page_on_error() in
> guestmemfd with a proper name. The separation of concerns will be better if we
> can just tell guestmemfd, the page has an issue. Then guestmemfd can decide how
> to handle it (refcount or whatever).
Instead of using tdx_hold_page_on_error(), the advantage of informing
guest_memfd that TDX is holding a page at 4KB granularity is that, even if there
is a bug in KVM (such as forgetting to notify TDX to remove a mapping in
handle_removed_pt()), guest_memfd would be aware that the page remains mapped in
the TDX module. This allows guest_memfd to determine how to handle the
problematic page (whether through refcount adjustments or other methods) before
truncating it.

> > 
> > This would allow guest_memfd to maintain an internal reference count for each
> > private GFN. TDX would call guest_memfd_add_page_ref_count() for mapping and
> > guest_memfd_dec_page_ref_count() after a successful unmapping. Before truncating
> > a private page from the filemap, guest_memfd could increase the real folio
> > reference count based on its internal reference count for the private GFN.
> 
> What does this get us exactly? This is the argument to have less error prone
> code that can survive forgetting to refcount on error? I don't see that it is an
> especially special case.
Yes, for a less error prone code.

If this approach is considered too complex for an initial implementation, using
tdx_hold_page_on_error() is also a viable option.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Edgecombe, Rick P 7 months, 3 weeks ago

On Tue, 2025-06-17 at 09:38 +0800, Yan Zhao wrote:
> > We talked about doing something like having tdx_hold_page_on_error() in
> > guestmemfd with a proper name. The separation of concerns will be better if
> > we
> > can just tell guestmemfd, the page has an issue. Then guestmemfd can decide
> > how
> > to handle it (refcount or whatever).
> Instead of using tdx_hold_page_on_error(), the advantage of informing
> guest_memfd that TDX is holding a page at 4KB granularity is that, even if
> there
> is a bug in KVM (such as forgetting to notify TDX to remove a mapping in
> handle_removed_pt()), guest_memfd would be aware that the page remains mapped
> in
> the TDX module. This allows guest_memfd to determine how to handle the
> problematic page (whether through refcount adjustments or other methods)
> before
> truncating it.

I don't think a potential bug in KVM is a good enough reason. If we are
concerned can we think about a warning instead?

We had talked enhancing kasan to know when a page is mapped into S-EPT in the
past. So rather than design around potential bugs we could focus on having a
simpler implementation with the infrastructure to catch and fix the bugs.

> 
> > > 
> > > This would allow guest_memfd to maintain an internal reference count for
> > > each
> > > private GFN. TDX would call guest_memfd_add_page_ref_count() for mapping
> > > and
> > > guest_memfd_dec_page_ref_count() after a successful unmapping. Before
> > > truncating
> > > a private page from the filemap, guest_memfd could increase the real folio
> > > reference count based on its internal reference count for the private GFN.
> > 
> > What does this get us exactly? This is the argument to have less error prone
> > code that can survive forgetting to refcount on error? I don't see that it
> > is an
> > especially special case.
> Yes, for a less error prone code.
> 
> If this approach is considered too complex for an initial implementation,
> using
> tdx_hold_page_on_error() is also a viable option.

I'm saying I don't think it's not a good enough reason. Why is it different then
other use-after free bugs? I feel like I'm missing something.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 11:52:48PM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-06-17 at 09:38 +0800, Yan Zhao wrote:
> > > We talked about doing something like having tdx_hold_page_on_error() in
> > > guestmemfd with a proper name. The separation of concerns will be better if
> > > we
> > > can just tell guestmemfd, the page has an issue. Then guestmemfd can decide
> > > how
> > > to handle it (refcount or whatever).
> > Instead of using tdx_hold_page_on_error(), the advantage of informing
> > guest_memfd that TDX is holding a page at 4KB granularity is that, even if
> > there
> > is a bug in KVM (such as forgetting to notify TDX to remove a mapping in
> > handle_removed_pt()), guest_memfd would be aware that the page remains mapped
> > in
> > the TDX module. This allows guest_memfd to determine how to handle the
> > problematic page (whether through refcount adjustments or other methods)
> > before
> > truncating it.
> 
> I don't think a potential bug in KVM is a good enough reason. If we are
> concerned can we think about a warning instead?
> 
> We had talked enhancing kasan to know when a page is mapped into S-EPT in the
> past. So rather than design around potential bugs we could focus on having a
> simpler implementation with the infrastructure to catch and fix the bugs.
However, if failing to remove a guest private page would only cause memory leak,
it's fine. 
If TDX does not hold any refcount, guest_memfd has to know that which private
page is still mapped. Otherwise, the page may be re-assigned to other kernel
components while it may still be mapped in the S-EPT.


> > 
> > > > 
> > > > This would allow guest_memfd to maintain an internal reference count for
> > > > each
> > > > private GFN. TDX would call guest_memfd_add_page_ref_count() for mapping
> > > > and
> > > > guest_memfd_dec_page_ref_count() after a successful unmapping. Before
> > > > truncating
> > > > a private page from the filemap, guest_memfd could increase the real folio
> > > > reference count based on its internal reference count for the private GFN.
> > > 
> > > What does this get us exactly? This is the argument to have less error prone
> > > code that can survive forgetting to refcount on error? I don't see that it
> > > is an
> > > especially special case.
> > Yes, for a less error prone code.
> > 
> > If this approach is considered too complex for an initial implementation,
> > using
> > tdx_hold_page_on_error() is also a viable option.
> 
> I'm saying I don't think it's not a good enough reason. Why is it different then
> other use-after free bugs? I feel like I'm missing something.
By tdx_hold_page_on_error(), it could be implememented as on removal failure,
invoke a guest_memfd interface to let guest_memfd know exact ranges still being
under use by the TDX module due to unmapping failures.
Do you think it's ok?

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Edgecombe, Rick P 7 months, 3 weeks ago

On Wed, 2025-06-18 at 08:19 +0800, Yan Zhao wrote:
> > I don't think a potential bug in KVM is a good enough reason. If we are
> > concerned can we think about a warning instead?
> > 
> > We had talked enhancing kasan to know when a page is mapped into S-EPT in
> > the
> > past. So rather than design around potential bugs we could focus on having a
> > simpler implementation with the infrastructure to catch and fix the bugs.
> However, if failing to remove a guest private page would only cause memory
> leak,
> it's fine. 
> If TDX does not hold any refcount, guest_memfd has to know that which private
> page is still mapped. Otherwise, the page may be re-assigned to other kernel
> components while it may still be mapped in the S-EPT.

KASAN detects use-after-free's like that. However, the TDX module code is not
instrumented. It won't check against the KASAN state for it's accesses.

I had a brief chat about this with Dave and Kirill. A couple ideas were
discussed. One was to use page_ext to keep a flag that says the page is in-use
by the TDX module. There was also some discussion of using a normal page flag,
and that the reserved page flag might prevent some of the MM operations that
would be needed on guestmemfd pages. I didn't see the problem when I looked.

For the solution, basically the SEAMCALL wrappers set a flag when they hand a
page to the TDX module, and clear it when they successfully reclaim it via
tdh_mem_page_remove() or tdh_phymem_page_reclaim(). Then if the page makes it
back to the page allocator, a warning is generated.

Also it was mentioned that SGX did have a similar issue to what is being worried
about here:
https://lore.kernel.org/linux-sgx/aCYey1W6i7i3yPLL@gmail.com/T/#m86c8c4cf0e6b9a653bf0709a22bb360034a24d95

> 
> 
> > > 
> > > > > 
> > > > > This would allow guest_memfd to maintain an internal reference count
> > > > > for
> > > > > each
> > > > > private GFN. TDX would call guest_memfd_add_page_ref_count() for
> > > > > mapping
> > > > > and
> > > > > guest_memfd_dec_page_ref_count() after a successful unmapping. Before
> > > > > truncating
> > > > > a private page from the filemap, guest_memfd could increase the real
> > > > > folio
> > > > > reference count based on its internal reference count for the private
> > > > > GFN.
> > > > 
> > > > What does this get us exactly? This is the argument to have less error
> > > > prone
> > > > code that can survive forgetting to refcount on error? I don't see that
> > > > it
> > > > is an
> > > > especially special case.
> > > Yes, for a less error prone code.
> > > 
> > > If this approach is considered too complex for an initial implementation,
> > > using
> > > tdx_hold_page_on_error() is also a viable option.
> > 
> > I'm saying I don't think it's not a good enough reason. Why is it different
> > then
> > other use-after free bugs? I feel like I'm missing something.
> By tdx_hold_page_on_error(), it could be implememented as on removal failure,
> invoke a guest_memfd interface to let guest_memfd know exact ranges still
> being
> under use by the TDX module due to unmapping failures.
> Do you think it's ok?

Either way is ok to me. It seems like we have three ok solutions. But the tone
of the thread is that we are solving some deep problem. Maybe I'm missing
something.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 2 weeks ago

On Wed, Jun 18, 2025 at 08:41:38AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-06-18 at 08:19 +0800, Yan Zhao wrote:
> > > I don't think a potential bug in KVM is a good enough reason. If we are
> > > concerned can we think about a warning instead?
> > > 
> > > We had talked enhancing kasan to know when a page is mapped into S-EPT in
> > > the
> > > past. So rather than design around potential bugs we could focus on having a
> > > simpler implementation with the infrastructure to catch and fix the bugs.
> > However, if failing to remove a guest private page would only cause memory
> > leak,
> > it's fine. 
> > If TDX does not hold any refcount, guest_memfd has to know that which private
> > page is still mapped. Otherwise, the page may be re-assigned to other kernel
> > components while it may still be mapped in the S-EPT.
> 
> KASAN detects use-after-free's like that. However, the TDX module code is not
> instrumented. It won't check against the KASAN state for it's accesses.
> 
> I had a brief chat about this with Dave and Kirill. A couple ideas were
> discussed. One was to use page_ext to keep a flag that says the page is in-use
Thanks!

To use page_ext, should we introduce a new flag PAGE_EXT_FIRMWARE_IN_USE,
similar to PAGE_EXT_YOUNG?

Due to similar issues as those with normal page/folio flags (see the next
comment for details), TDX needs to set PAGE_EXT_FIRMWARE_IN_USE on a
page-by-page basis rather than folio-by-folio.

Additionally, it seems reasonable for guest_memfd not to copy the
PAGE_EXT_FIRMWARE_IN_USE flag when splitting a huge folio?
(in __folio_split() --> split_folio_to_order(), PAGE_EXT_YOUNG and
PAGE_EXT_IDLE are copied to the new folios though).

Furthermore, page_ext uses extra memory. With CONFIG_64BIT, should we instead
introduce a PG_firmware_in_use in page flags, similar to PG_young and PG_idle?

> by the TDX module. There was also some discussion of using a normal page flag,
> and that the reserved page flag might prevent some of the MM operations that
> would be needed on guestmemfd pages. I didn't see the problem when I looked.
> 
> For the solution, basically the SEAMCALL wrappers set a flag when they hand a
> page to the TDX module, and clear it when they successfully reclaim it via
> tdh_mem_page_remove() or tdh_phymem_page_reclaim(). Then if the page makes it
> back to the page allocator, a warning is generated.
After some testing, to use a normal page flag, we may need to set it on a
page-by-page basis rather than folio-by-folio. See "Scheme 1".
And guest_memfd may need to selectively copy page flags when splitting huge
folios. See "Scheme 2".

Scheme 1: Set/unset page flag on folio-by-folio basis, i.e.
        - set folio reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
        - unset folio reserved after a successful tdh_mem_page_remove() or
          tdh_phymem_page_reclaim().

        It has problem in following scenario:
        1. tdh_mem_page_aug() adds a 2MB folio. It marks the folio as reserved
	   via "folio_set_reserved(page_folio(page))"

        2. convert a 4KB page of the 2MB folio to shared.
        2.1 tdh_mem_page_demote() is executed first.
       
        2.2 tdh_mem_page_remove() then removes the 4KB mapping.
            "folio_clear_reserved(page_folio(page))" clears reserved flag for
            the 2MB folio while the rest 511 pages are still mapped in the
            S-EPT.

        2.3. guest_memfd splits the 2MB folio into 512 4KB folios.


Scheme 2: Set/unset page flag on page-by-page basis, i.e.
        - set page flag reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
        - unset page flag reserved after a successful tdh_mem_page_remove() or
          tdh_phymem_page_reclaim().

        It has problem in following scenario:
        1. tdh_mem_page_aug() adds a 2MB folio. It marks pages as reserved by
           invoking "SetPageReserved()" on each page.
           As the folio->flags equals to page[0]->flags, folio->flags is also
	   with reserved set.

        2. convert a 4KB page of the 2MB folio to shared. say, it's page[4].
        2.1 tdh_mem_page_demote() is executed first.
       
        2.2 tdh_mem_page_remove() then removes the 4KB mapping.
            "ClearPageReserved()" clears reserved flag of page[4] of the 2MB
            folio.

        2.3. guest_memfd splits the 2MB folio into 512 4KB folios.
             In guestmem_hugetlb_split_folio(), "p->flags = folio->flags" marks
             page[4]->flags as reserved again as page[0] is still reserved.

            (see the code in https://lore.kernel.org/all/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com/
            for (i = 1; i < orig_nr_pages; ++i) {
                struct page *p = folio_page(folio, i);

                /* Copy flags from the first page to split pages. */
                p->flags = folio->flags;

                p->mapping = NULL;
                clear_compound_head(p);
            }
            )

I'm not sure if "p->flags = folio->flags" can be removed. Currently flag like
PG_unevictable is preserved via this step.

If we selectively copy flags, we may need to implement the following changes to
prevent the page from being available to the page allocator. Otherwise, the
"HugePages_Free" count will not decrease, and the same huge folio will continue
to be recycled (i.e., being allocated and consumed by other VMs).

diff --git a/mm/swap.c b/mm/swap.c
index 2747230ced89..72d8c53e2321 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -98,8 +98,36 @@ static void page_cache_release(struct folio *folio)
                unlock_page_lruvec_irqrestore(lruvec, flags);
 }

+static inline bool folio_is_reserved(struct folio *folio)
+{
+       long nr_pages = folio_nr_pages(folio);
+       long i;
+
+       for (i = 0; i < nr_pages; i++) {
+               if (!PageReserved(folio_page(folio, i)))
+                       continue;
+
+               return true;
+       }
+
+       return false;
+}
+
 static void free_typed_folio(struct folio *folio)
 {
@@ -118,6 +146,13 @@ static void free_typed_folio(struct folio *folio)

 void __folio_put(struct folio *folio)
 {
+       if (folio_is_reserved(folio)) {
+               VM_WARN_ON_FOLIO(folio_is_reserved(folio), folio);
+               return;
+       }
+
        if (unlikely(folio_is_zone_device(folio))) {
                free_zone_device_folio(folio);
                return;
@@ -986,6 +1021,12 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
                if (!folio_ref_sub_and_test(folio, nr_refs))
                        continue;

+               if (folio_is_reserved(folio)) {
+                       VM_WARN_ON_FOLIO(folio_is_reserved(folio), folio);
+                       continue;
+               }
+
                if (unlikely(folio_has_type(folio))) {
                        /* typed folios have their own memcg, if any */
                        if (lruvec) {


Besides, guest_memfd needs to reject converting to shared when a page is still
mapped in S-EPT.

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index d71653e7e51e..6449151a3a69 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -553,6 +553,41 @@ static void kvm_gmem_convert_invalidate_end(struct inode *inode,
                kvm_gmem_invalidate_end(gmem, invalidate_start, invalidate_end);
 }

+static bool kvm_gmem_has_invalid_folio(struct address_space *mapping, pgoff_t start,
+                                       size_t nr_pages)
+{
+       pgoff_t index = start, end = start + nr_pages;
+       bool ret = false;
+
+       while (index < end) {
+               struct folio *f;
+               long i = 0, nr;
+
+               f = filemap_get_folio(mapping, index);
+               if (IS_ERR(f))
+                       continue;
+
+               if (f->index < start)
+                       i = start - f->index;
+
+               nr = folio_nr_pages(f);
+               if (f->index + folio_nr_pages(f) > end)
+                       nr -= f->index + folio_nr_pages(f) - end;
+
+               for (; i < nr; i++) {
+                       if (PageReserved(folio_page(f, i))) {
+                               ret = true;
+                               folio_put(f);
+                               goto out;
+                       }
+               }
+               index += folio_nr_pages(f);
+               folio_put(f);
+       }
+out:
+       return ret;
+}
 static int kvm_gmem_convert_should_proceed(struct inode *inode,
                                           struct conversion_work *work,
                                           bool to_shared, pgoff_t *error_index)
@@ -572,6 +607,12 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
                        if (ret)
                                return ret;
                        kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
+
+                       if (kvm_gmem_has_invalid_folio(inode->i_mapping, work->start,
+                                                      work->nr_pages)) {
+                               ret = -EFAULT;
+                       }
+
                }
        } else {
                unmap_mapping_pages(inode->i_mapping, work->start,



> Also it was mentioned that SGX did have a similar issue to what is being worried
> about here:
> https://lore.kernel.org/linux-sgx/aCYey1W6i7i3yPLL@gmail.com/T/#m86c8c4cf0e6b9a653bf0709a22bb360034a24d95
> 
> > 
> > 
> > > > 
> > > > > > 
> > > > > > This would allow guest_memfd to maintain an internal reference count
> > > > > > for
> > > > > > each
> > > > > > private GFN. TDX would call guest_memfd_add_page_ref_count() for
> > > > > > mapping
> > > > > > and
> > > > > > guest_memfd_dec_page_ref_count() after a successful unmapping. Before
> > > > > > truncating
> > > > > > a private page from the filemap, guest_memfd could increase the real
> > > > > > folio
> > > > > > reference count based on its internal reference count for the private
> > > > > > GFN.
> > > > > 
> > > > > What does this get us exactly? This is the argument to have less error
> > > > > prone
> > > > > code that can survive forgetting to refcount on error? I don't see that
> > > > > it
> > > > > is an
> > > > > especially special case.
> > > > Yes, for a less error prone code.
> > > > 
> > > > If this approach is considered too complex for an initial implementation,
> > > > using
> > > > tdx_hold_page_on_error() is also a viable option.
> > > 
> > > I'm saying I don't think it's not a good enough reason. Why is it different
> > > then
> > > other use-after free bugs? I feel like I'm missing something.
> > By tdx_hold_page_on_error(), it could be implememented as on removal failure,
> > invoke a guest_memfd interface to let guest_memfd know exact ranges still
> > being
> > under use by the TDX module due to unmapping failures.
> > Do you think it's ok?
> 
> Either way is ok to me. It seems like we have three ok solutions. But the tone
> of the thread is that we are solving some deep problem. Maybe I'm missing
> something.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Edgecombe, Rick P 7 months, 2 weeks ago

On Mon, 2025-06-23 at 17:27 +0800, Yan Zhao wrote:
> To use page_ext, should we introduce a new flag PAGE_EXT_FIRMWARE_IN_USE,
> similar to PAGE_EXT_YOUNG?
> 
> Due to similar issues as those with normal page/folio flags (see the next
> comment for details), TDX needs to set PAGE_EXT_FIRMWARE_IN_USE on a
> page-by-page basis rather than folio-by-folio.
> 
> Additionally, it seems reasonable for guest_memfd not to copy the
> PAGE_EXT_FIRMWARE_IN_USE flag when splitting a huge folio?
> (in __folio_split() --> split_folio_to_order(), PAGE_EXT_YOUNG and
> PAGE_EXT_IDLE are copied to the new folios though).
> 
> Furthermore, page_ext uses extra memory. With CONFIG_64BIT, should we instead
> introduce a PG_firmware_in_use in page flags, similar to PG_young and PG_idle?

Page flags are a scarce resource. If we could have used an existing one, it
would have been nice. But otherwise, I would guess the use case is not strong
enough to justify adding one.

So PAGE_EXT_FIRMWARE_IN_USE is probably a better way to go. Due to the memory
use, it would have to be a debug config like the others. If we have line of
sight to a solution, how do you feel about the following direction to move past
this issue:
1. Go with refcount on error approach for now (i.e. tdx_hold_page_on_error())
2. In a pfn-only future, plan to switch to guestmemfd callback instead of
tdx_hold_page_on_error(). We don't understand the pfn-only feature enough to
properly design for it anyway.
3. Plan for a PAGE_EXT_FIRMWARE_IN_USE as follow on work to huge pages. The
reason why it should not be required before huge pages is because it is not
necessary for correct code, only to catch incorrect code slipping in.

That is based on the assessment that the effort to change the zap path to
communicate failure is too much churn. Do you happen to have a diffstat for a
POC on this BTW?

> 
> > by the TDX module. There was also some discussion of using a normal page flag,
> > and that the reserved page flag might prevent some of the MM operations that
> > would be needed on guestmemfd pages. I didn't see the problem when I looked.
> > 
> > For the solution, basically the SEAMCALL wrappers set a flag when they hand a
> > page to the TDX module, and clear it when they successfully reclaim it via
> > tdh_mem_page_remove() or tdh_phymem_page_reclaim(). Then if the page makes it
> > back to the page allocator, a warning is generated.
> After some testing, to use a normal page flag, we may need to set it on a
> page-by-page basis rather than folio-by-folio. See "Scheme 1".
> And guest_memfd may need to selectively copy page flags when splitting huge
> folios. See "Scheme 2".

With page_ext, it seems we could have it be per page from the beginning?

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Ackerley Tng 8 months, 1 week ago

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> 
>> > On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote:
>> >> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> >> > ...
>> >> > >
>> >> > > I might be wrongly throwing out some terminologies here then.
>> >> > > VM_PFNMAP flag can be set for memory backed by folios/page structs.
>> >> > > udmabuf seems to be working with pinned "folios" in the backend.
>> >> > >
>> >> > > The goal is to get to a stage where guest_memfd is backed by pfn
>> >> > > ranges unmanaged by kernel that guest_memfd owns and distributes to
>> >> > > userspace, KVM, IOMMU subject to shareability attributes. if the
>> >> > OK. So from point of the reset part of kernel, those pfns are not regarded as
>> >> > memory.
>> >> >
>> >> > > shareability changes, the users will get notified and will have to
>> >> > > invalidate their mappings. guest_memfd will allow mmaping such ranges
>> >> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
>> >> > > special handling/lack of page structs.
>> >> > My concern is a failable invalidation notifer may not be ideal.
>> >> > Instead of relying on ref counts (or other mechanisms) to determine whether to
>> >> > start shareabilitiy changes, with a failable invalidation notifier, some users
>> >> > may fail the invalidation and the shareability change, even after other users
>> >> > have successfully unmapped a range.
>> >>
>> >> Even if one user fails to invalidate its mappings, I don't see a
>> >> reason to go ahead with shareability change. Shareability should not
>> >> change unless all existing users let go of their soon-to-be-invalid
>> >> view of memory.
>> 
>> Hi Yan,
>> 
>> While working on the 1G (aka HugeTLB) page support for guest_memfd
>> series [1], we took into account conversion failures too. The steps are
>> in kvm_gmem_convert_range(). (It might be easier to pull the entire
>> series from GitHub [2] because the steps for conversion changed in two
>> separate patches.)
>> 
>> We do need to handle errors across ranges to be converted, possibly from
>> different memslots. The goal is to either have the entire conversion
>> happen (including page split/merge) or nothing at all when the ioctl
>> returns.
>> 
>> We try to undo the restructuring (whether split or merge) and undo any
>> shareability changes on error (barring ENOMEM, in which case we leave a
>> WARNing).
> As the undo can fail (as the case you leave a WARNing, in patch 38 in [1]), it
> can lead to WARNings in kernel with folios not being properly added to the
> filemap.
>

I'm not sure how else to handle errors on rollback path. I've hopefully
addressed this on the other thread at [1].

>> The part we don't restore is the presence of the pages in the host or
>> guest page tables. For that, our idea is that if unmapped, the next
>> access will just map it in, so there's no issue there.
>
> I don't think so.
>
> As in patch 38 in [1], on failure, it may fail to
> - restore the shareability
> - restore the folio's filemap status
> - restore the folio's hugetlb stash metadata
> - restore the folio's merged/split status
>

The plan is that we try our best to restore shareability, filemap,
restructuring (aka split/merge, including stash metadata) other than
failures on rollback.

> Also, the host page table is not restored.
>
>

This is by design, the host page tables can be re-populated on the next
fault. I've hopefully addressed this on the other thread at [1].

>> > My thinking is that:
>> >
>> > 1. guest_memfd starts shared-to-private conversion
>> > 2. guest_memfd sends invalidation notifications
>> >    2.1 invalidate notification --> A --> Unmap and return success
>> >    2.2 invalidate notification --> B --> Unmap and return success
>> >    2.3 invalidate notification --> C --> return failure
>> > 3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps
>> >    shareability as shared
>> >
>> > Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and
>> > 2.2. Even if additional notifications could be sent to A and B to ask for
>> > mapping the GFN back, the map operation might fail. Consequently, A and B might
>> > not be able to restore the mapped status of the GFN.
>> 
>> For conversion we don't attempt to restore mappings anywhere (whether in
>> guest or host page tables). What do you think of not restoring the
>> mappings?
> It could cause problem if the mappings in S-EPT can't be restored.
>
> For TDX private-to-shared conversion, if kvm_gmem_convert_should_proceed() -->
> kvm_gmem_unmap_private() --> kvm_mmu_unmap_gfn_range() fails in the end, then
> the GFN shareability is restored to private. The next guest access to
> the partially unmapped private memory can meet a fatal error: "access before
> acceptance".
>
> It could occur in such a scenario:
> 1. TD issues a TDVMCALL_MAP_GPA to convert a private GFN to shared
> 2. Conversion fails in KVM.
> 3. set_memory_decrypted() fails in TD.
> 4. TD thinks the GFN is still accepted as private and accesses it.
>
>

This is true, I was thinking that this isn't handled solely in
conversion but by being part of the contract between userspace VMM and
the guest, that guest must handle conversion failures. I've hopefully
addressed this on the other thread at [1].

>> > For IOMMU mappings, this
>> > could result in DMAR failure following a failed attempt to do shared-to-private
>> > conversion.
>> 
>> I believe the current conversion setup guards against this because after
>> unmapping from the host, we check for any unexpected refcounts.
> Right, it's fine if we check for any unexpected refcounts.
>
>
>> (This unmapping is not the unmapping we're concerned about, since this is
>> shared memory, and unmapping doesn't go through TDX.)
>> 
>> Coming back to the refcounts, if the IOMMU had mappings, these refcounts
>> are "unexpected". The conversion ioctl will return to userspace with an
>> error.
>> 
>> IO can continue to happen, since the memory is still mapped in the
>> IOMMU. The memory state is still shared. No issue there.
>> 
>> In RFCv2 [1], we expect userspace to see the error, then try and remove
>> the memory from the IOMMU, and then try conversion again.
> I don't think it's right to depend on that userspace could always perform in 
> kernel's expected way, i.e. trying conversion until it succeeds.
>

Let me think more deeply about this. Please let me know if there's
anything I missed.

It is true that a buggy or malicious userspace VMM can ignore conversion
failures and report success to the guest, but if both the userspace VMM
and guest are malicious, it's quite hard for the kernel to defend
against that.

I think as long as there's no point where the guest can crash the host
in a fixed way, I think it is okay to rely on a userspace VMM and guest
protocol.

IIUC the guest can crash the host (original point of having guest_memfd)
if the guest can convince the host to write to private memory. For that
to happen, the memory must be faulted into the Secure EPTs, and the
shareability state must be ALL for the host to fault it in.

So to have this issue, the conversion failure must be such that the
memory remains faulted into the Secure EPTs while shareability is
shared. Since unmapping from secure EPTs happens pretty early before any
shareability is changed or any rollback (and rollback failures) can
happen, I think we should be quite safe?

If unmapping of private memory fails, this is where I think guest_memfd
should get an error from the unmap and it should not proceed to change
shareability.


> We need to restore to the previous status (which includes the host page table)
> if conversion can't be done.

Most of the previous status (shareability, filemap,
restructuring (aka split/merge, including stash metadata)) are restored
other than during rollback failures.

As for presence in host page tables, is it okay to defer that till the
next fault, and if not okay, why not?

For presence in guest page tables, is it okay to fall back on the
protocol where the guest must handle conversion failures, and if not
okay, why not?

> That said, in my view, a better flow would be:
>
> 1. guest_memfd sends a pre-invalidation request to users (users here means the
>    consumers in kernel of memory allocated from guest_memfd).
>
> 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
>    proceed. For example, in the case of TDX, this might involve memory
>    allocation and page splitting.
>
> 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
>    proceeds by sending the actual invalidation request.
>
> 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
>    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
>    In such cases, TDX can callback guest_memfd to inform the poison-status of
>    the page or elevate the page reference count.
>
> 5. guest_memfd completes the invalidation process. If the memory is marked as
>    "poison," guest_memfd can handle it accordingly. If the page has an elevated
>    reference count, guest_memfd may not need to take special action, as the
>    elevated count prevents the OS from reallocating the page.
>    (but from your reply below, seems a callback to guest_memfd is a better
>    approach).
>
>

Thanks for this, I've tried to combine this into my response at
[1]. I think this works, but it's hard because

a. Pre-checks are hard to check (explained at [1])
b. Even after all the checks, unmapping can still fail, and those still
   have to be handled, and to handle those, we have to buy into the
   userspace VMM/guest protocol, so why not just buy into the protocol
   to start with?

[1] https://lore.kernel.org/all/diqztt4uhunj.fsf@ackerleytng-ctop.c.googlers.com/

>> The part in concern here is unmapping failures of private pages, for
>> private-to-shared conversions, since that part goes through TDX and
>> might fail.
> IMO, even for TDX, the real unmap must not fail unless there are bugs in the KVM
> or TDX module.
> So, for page splitting in S-EPT, I prefer to try splitting in the
> pre-invalidation phase before conducting any real unmap.
>
>

Thanks for your detailed suggestion.

>> One other thing about taking refcounts is that in RFCv2,
>> private-to-shared conversions assume that there are no refcounts on the
>> private pages at all. (See filemap_remove_folio_for_restructuring() in
>> [3])
>>
>> Haven't had a chance to think about all the edge cases, but for now I
>> think on unmapping failure, in addition to taking a refcount, we should
>> return an error at least up to guest_memfd, so that guest_memfd could
>> perhaps keep the refcount on that page, but drop the page from the
>> filemap. Another option could be to track messed up addresses and always
>> check that on conversion or something - not sure yet.
>
> It looks good to me. See the bullet 4 in my proposed flow above.
>

Thanks again for your detailed suggestion.

>> Either way, guest_memfd must know. If guest_memfd is not informed, on a
>> next conversion request, the conversion will just spin in
>> filemap_remove_folio_for_restructuring().
> It makes sense.
>
>
>> What do you think of this part about informing guest_memfd of the
>> failure to unmap?
> So, do you want to add a guest_memfd callback to achieve this purpose?
>

I will need to think the entire thing through, but I meant informing as
in returning an error to guest_memfd so that guest_memfd knows. I think
returning an error should be the first cause of action.

As for whether guest_memfd should know how to handle the error or
whether the userspace VMM should participate in deciding what to do with
the error, I'm not sure. If you have suggestions on this, I hope we can
combine the suggestions about the conversion protocol on the other thread.

Regarding a callback, are you thinking something like not having the
unmap return an error, but instead TDX will call a function like
kvm_gmem_error_at_offset(loff_t offset), and guest_memfd will then
record that somewhere, and then immediately after calling unmap
guest_memfd will check kvm_gmem_was_there_an_error_in_range() and then
determining whether there's an error? Something like that?

I guess it could work but feels a little odd.

>
> BTW, here's an analysis of why we can't let kvm_mmu_unmap_gfn_range()
> and mmu_notifier_invalidate_range_start() fail, based on the repo
> https://github.com/torvalds/linux.git, commit cd2e103d57e5 ("Merge tag
> 'hardening-v6.16-rc1-fix1-take2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux")

Thank you, I appreciate the effort you took to enumerate these. The
following suggestions are based on my current understanding. I don't
have time in the near future to do the plumbing to test out the
suggestion, but for now I want to see if this suggestion makes sense,
maybe you can correct any misunderstandings first. 

>
> 1. Status of mmu notifier
> -------------------------------
> (1) There're 34 direct callers of mmu_notifier_invalidate_range_start().
>     1. clear_refs_write
>     2. do_pagemap_scan
>     3. uprobe_write_opcode
>     4. do_huge_zero_wp_pmd
>     5. __split_huge_pmd (N)
>     6. __split_huge_pud (N)
>     7. move_pages_huge_pmd
>     8. copy_hugetlb_page_range
>     9. hugetlb_unshare_pmds  (N)
>     10. hugetlb_change_protection
>     11. hugetlb_wp
>     12. unmap_hugepage_range (N)
>     13. move_hugetlb_page_tables
>     14. collapse_huge_page
>     15. retract_page_tables
>     16. collapse_pte_mapped_thp
>     17. write_protect_page
>     18. replace_page
>     19. madvise_free_single_vma
>     20. wp_clean_pre_vma
>     21. wp_page_copy 
>     22. zap_page_range_single_batched (N)
>     23. unmap_vmas (N)
>     24. copy_page_range 
>     25. remove_device_exclusive_entry
>     26. migrate_vma_collect
>     27. __migrate_device_pages
>     28. change_pud_range 
>     29. move_page_tables
>     30. page_vma_mkclean_one
>     31. try_to_unmap_one
>     32. try_to_migrate_one
>     33. make_device_exclusive
>     34. move_pages_pte
>
> Of these 34 direct callers, those marked with (N) cannot tolerate
> mmu_notifier_invalidate_range_start() failing. I have not yet investigated all
> 34 direct callers one by one, so the list of (N) is incomplete.
>
> For 5. __split_huge_pmd(), Documentation/mm/transhuge.rst says:
> "Note that split_huge_pmd() doesn't have any limitations on refcounting:
> pmd can be split at any point and never fails." This is because split_huge_pmd()
> serves as a graceful fallback design for code walking pagetables but unaware
> about huge pmds.
>
>

Do these callers, especially those with (N), ever try to unmap any TDX
private pages? guest_memfd only gives shared pages to core-mm, so for
shared pages, there will continue to be no chance of errors.

If we change mmu_notifier_invalidate_range_start() to return an int, all
of the callers that never invalidate shared pages can continue to safely
rely on the fact that mmu_notifier_invalidate_range_start() will return
0.

For the callers of mmu_notifier_invalidate_range_start() that may touch
private pages, I believe that's only guest_memfd and KVM. That's where
we want the error, and will handle the error.

Another point here is that I was thinking to put EPT splitting together
with actual unmapping instead of with invalidation because we will
probably invalidate more than we unmap (see explanation at [1] about the
race). Maybe moving EPT splitting to unmap could help?

> (2) There's 1 direct caller of mmu_notifier_invalidate_range_start_nonblock(),
> __oom_reap_task_mm(), which only expects the error -EAGAIN.
>
> In mn_hlist_invalidate_range_start():
> "WARN_ON(mmu_notifier_range_blockable(range) || _ret != -EAGAIN);"
>
>
> (3) For DMAs, drivers need to invoke pin_user_pages() to pin memory. In that
> case, they don't need to register mmu notifier.
>
> Or, device drivers can pin pages via get_user_pages*(), and register for mmu         
> notifier callbacks for the memory range. Then, upon receiving a notifier         
> "invalidate range" callback , stop the device from using the range, and unpin    
> the pages.
>
> See Documentation/core-api/pin_user_pages.rst.
>
>

Do you mean that we should teach device drivers to get callbacks for
private pages? Are you looking ahead to handle TDX IO on private pages?
So far we haven't handled that yet.

> 2. Cases that cannot tolerate failure of mmu_notifier_invalidate_range_start()
> -------------------------------
> (1) Error fallback cases.
>
>     1. split_huge_pmd() as mentioned in Documentation/mm/transhuge.rst.
>        split_huge_pmd() is designed as a graceful fallback without failure.
>
>        split_huge_pmd
>         |->__split_huge_pmd
>            |->mmu_notifier_range_init
>            |  mmu_notifier_invalidate_range_start
>            |  split_huge_pmd_locked
>            |  mmu_notifier_invalidate_range_end
>
>
>     2. in fs/iomap/buffered-io.c, iomap_write_failed() itself is error handling.
>        iomap_write_failed
>          |->truncate_pagecache_range
>             |->unmap_mapping_range
>             |  |->unmap_mapping_pages
>             |     |->unmap_mapping_range_tree
>             |        |->unmap_mapping_range_vma
>             |           |->zap_page_range_single
>             |              |->zap_page_range_single_batched
>             |                       |->mmu_notifier_range_init
>             |                       |  mmu_notifier_invalidate_range_start
>             |                       |  unmap_single_vma
>             |                       |  mmu_notifier_invalidate_range_end
>             |->truncate_inode_pages_range
>                |->truncate_cleanup_folio
>                   |->if (folio_mapped(folio))
>                   |     unmap_mapping_folio(folio);
>                          |->unmap_mapping_range_tree
>                             |->unmap_mapping_range_vma
>                                |->zap_page_range_single
>                                   |->zap_page_range_single_batched
>                                      |->mmu_notifier_range_init
>                                      |  mmu_notifier_invalidate_range_start
>                                      |  unmap_single_vma
>                                      |  mmu_notifier_invalidate_range_end
>
>    3. in mm/memory.c, zap_page_range_single() is invoked to handle error.
>       remap_pfn_range_notrack
>         |->int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
>         |  if (!error)
>         |      return 0;
> 	|  zap_page_range_single
>            |->zap_page_range_single_batched
>               |->mmu_notifier_range_init
>               |  mmu_notifier_invalidate_range_start
>               |  unmap_single_vma
>               |  mmu_notifier_invalidate_range_end
>
>    4. in kernel/events/core.c, zap_page_range_single() is invoked to clear any
>       partial mappings on error.
>
>       perf_mmap
>         |->ret = map_range(rb, vma);
>                  |  err = remap_pfn_range
>                  |->if (err) 
>                  |     zap_page_range_single
>                         |->zap_page_range_single_batched
>                            |->mmu_notifier_range_init
>                            |  mmu_notifier_invalidate_range_start
>                            |  unmap_single_vma
>                            |  mmu_notifier_invalidate_range_end
>
>
>    5. in mm/memory.c, unmap_mapping_folio() is invoked to unmap posion page.
>
>       __do_fault
> 	|->if (unlikely(PageHWPoison(vmf->page))) { 
> 	|	vm_fault_t poisonret = VM_FAULT_HWPOISON;
> 	|	if (ret & VM_FAULT_LOCKED) {
> 	|		if (page_mapped(vmf->page))
> 	|			unmap_mapping_folio(folio);
>         |                       |->unmap_mapping_range_tree
>         |                          |->unmap_mapping_range_vma
>         |                             |->zap_page_range_single
>         |                                |->zap_page_range_single_batched
>         |                                   |->mmu_notifier_range_init
>         |                                   |  mmu_notifier_invalidate_range_start
>         |                                   |  unmap_single_vma
>         |                                   |  mmu_notifier_invalidate_range_end
> 	|		if (mapping_evict_folio(folio->mapping, folio))
> 	|			poisonret = VM_FAULT_NOPAGE; 
> 	|		folio_unlock(folio);
> 	|	}
> 	|	folio_put(folio);
> 	|	vmf->page = NULL;
> 	|	return poisonret;
> 	|  }
>
>
>   6. in mm/vma.c, in __mmap_region(), unmap_region() is invoked to undo any
>      partial mapping done by a device driver.
>
>      __mmap_new_vma
>        |->__mmap_new_file_vma(map, vma);
>           |->error = mmap_file(vma->vm_file, vma);
>           |  if (error)
>           |     unmap_region
>                  |->unmap_vmas
>                     |->mmu_notifier_range_init
>                     |  mmu_notifier_invalidate_range_start
>                     |  unmap_single_vma
>                     |  mmu_notifier_invalidate_range_end
>
>

These should probably not ever be invalidating or unmapping private pages.

> (2) No-fail cases
> -------------------------------
> 1. iput() cannot fail. 
>
> iput
>  |->iput_final
>     |->WRITE_ONCE(inode->i_state, state | I_FREEING);
>     |  inode_lru_list_del(inode);
>     |  evict(inode);
>        |->op->evict_inode(inode);
>           |->shmem_evict_inode
>              |->shmem_truncate_range
>                 |->truncate_inode_pages_range
>                    |->truncate_cleanup_folio
>                       |->if (folio_mapped(folio))
>                       |     unmap_mapping_folio(folio);
>                             |->unmap_mapping_range_tree
>                                |->unmap_mapping_range_vma
>                                   |->zap_page_range_single
>                                      |->zap_page_range_single_batched
>                                         |->mmu_notifier_range_init
>                                         |  mmu_notifier_invalidate_range_start
>                                         |  unmap_single_vma
>                                         |  mmu_notifier_invalidate_range_end
>
>
> 2. exit_mmap() cannot fail
>
> exit_mmap
>   |->mmu_notifier_release(mm);
>      |->unmap_vmas(&tlb, &vmi.mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
>         |->mmu_notifier_range_init
>         |  mmu_notifier_invalidate_range_start
>         |  unmap_single_vma
>         |  mmu_notifier_invalidate_range_end
>
>

These should probably not ever be invalidating or unmapping private pages.

> 3. KVM Cases That Cannot Tolerate Unmap Failure
> -------------------------------
> Allowing unmap operations to fail in the following scenarios would make it very
> difficult or even impossible to handle the failure:
>
> (1) __kvm_mmu_get_shadow_page() is designed to reliably obtain a shadow page
> without expecting any failure.
>
> mmu_alloc_direct_roots
>   |->mmu_alloc_root
>      |->kvm_mmu_get_shadow_page
>         |->__kvm_mmu_get_shadow_page
>            |->kvm_mmu_alloc_shadow_page
>               |->account_shadowed
>                  |->kvm_mmu_slot_gfn_write_protect
>                     |->kvm_tdp_mmu_write_protect_gfn
>                        |->write_protect_gfn
>                           |->tdp_mmu_iter_set_spte
>
>

I need to learn more about shadow pages but IIUC TDX doesn't use shadow
pages so this path won't interact with unmapping private pages.

> (2) kvm_vfio_release() and kvm_vfio_file_del() cannot fail
>
> kvm_vfio_release/kvm_vfio_file_del
>  |->kvm_vfio_update_coherency
>     |->kvm_arch_unregister_noncoherent_dma
>        |->kvm_noncoherent_dma_assignment_start_or_stop
>           |->kvm_zap_gfn_range
>              |->kvm_tdp_mmu_zap_leafs
>                 |->tdp_mmu_zap_leafs
>                    |->tdp_mmu_iter_set_spte
>
>

I need to learn more about VFIO but for now IIUC IO uses shared pages,
so this path won't interact with unmapping private pages.

> (3) There're lots of callers of __kvm_set_or_clear_apicv_inhibit() currently
> never expect failure of unmap.
>
> __kvm_set_or_clear_apicv_inhibit
>   |->kvm_zap_gfn_range
>      |->kvm_tdp_mmu_zap_leafs
>         |->tdp_mmu_zap_leafs
>            |->tdp_mmu_iter_set_spte
>
>
>

There could be some TDX specific things such that TDX doesn't use this
path.

> 4. Cases in KVM where it's hard to make tdp_mmu_set_spte() (update SPTE with
> write mmu_lock) failable.
>
> (1) kvm_vcpu_flush_tlb_guest()
>
> kvm_vcpu_flush_tlb_guest
>   |->kvm_mmu_sync_roots
>      |->mmu_sync_children
>         |->kvm_vcpu_write_protect_gfn
>            |->kvm_mmu_slot_gfn_write_protect
>               |->kvm_tdp_mmu_write_protect_gfn
>                  |->write_protect_gfn
>                     |->tdp_mmu_iter_set_spte
>                        |->tdp_mmu_set_spte
>
>
> (2) handle_removed_pt() and handle_changed_spte().
>

Thank you so much for looking into these, I'm hoping that the number of
cases where TDX and private pages are unmapped are really limited to a
few paths that we have to rework.

If we agree that the error has to be handled, then regardless of how we
let the caller know that an error happened, all paths touching TDX
private pages have to be reworked.

Between (1) returning an error vs (2) marking error and having the
caller check for errors, then it's probably better to use the standard
approach of returning an error since it is better understood, and
there's no need to have extra data structures?

>
> Thanks
> Yan

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Yan Zhao 7 months, 3 weeks ago

On Thu, Jun 05, 2025 at 02:12:58PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> 
> >> > On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote:
> >> >> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> >> > ...
> >> >> > >
> >> >> > > I might be wrongly throwing out some terminologies here then.
> >> >> > > VM_PFNMAP flag can be set for memory backed by folios/page structs.
> >> >> > > udmabuf seems to be working with pinned "folios" in the backend.
> >> >> > >
> >> >> > > The goal is to get to a stage where guest_memfd is backed by pfn
> >> >> > > ranges unmanaged by kernel that guest_memfd owns and distributes to
> >> >> > > userspace, KVM, IOMMU subject to shareability attributes. if the
> >> >> > OK. So from point of the reset part of kernel, those pfns are not regarded as
> >> >> > memory.
> >> >> >
> >> >> > > shareability changes, the users will get notified and will have to
> >> >> > > invalidate their mappings. guest_memfd will allow mmaping such ranges
> >> >> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> >> >> > > special handling/lack of page structs.
> >> >> > My concern is a failable invalidation notifer may not be ideal.
> >> >> > Instead of relying on ref counts (or other mechanisms) to determine whether to
> >> >> > start shareabilitiy changes, with a failable invalidation notifier, some users
> >> >> > may fail the invalidation and the shareability change, even after other users
> >> >> > have successfully unmapped a range.
> >> >>
> >> >> Even if one user fails to invalidate its mappings, I don't see a
> >> >> reason to go ahead with shareability change. Shareability should not
> >> >> change unless all existing users let go of their soon-to-be-invalid
> >> >> view of memory.
> >> 
> >> Hi Yan,
> >> 
> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> >> series [1], we took into account conversion failures too. The steps are
> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> >> series from GitHub [2] because the steps for conversion changed in two
> >> separate patches.)
> >> 
> >> We do need to handle errors across ranges to be converted, possibly from
> >> different memslots. The goal is to either have the entire conversion
> >> happen (including page split/merge) or nothing at all when the ioctl
> >> returns.
> >> 
> >> We try to undo the restructuring (whether split or merge) and undo any
> >> shareability changes on error (barring ENOMEM, in which case we leave a
> >> WARNing).
> > As the undo can fail (as the case you leave a WARNing, in patch 38 in [1]), it
> > can lead to WARNings in kernel with folios not being properly added to the
> > filemap.
> >
> 
> I'm not sure how else to handle errors on rollback path. I've hopefully
> addressed this on the other thread at [1].
I'll reply [1].

Please also check my reply and proposal at [2].

> >> The part we don't restore is the presence of the pages in the host or
> >> guest page tables. For that, our idea is that if unmapped, the next
> >> access will just map it in, so there's no issue there.
> >
> > I don't think so.
> >
> > As in patch 38 in [1], on failure, it may fail to
> > - restore the shareability
> > - restore the folio's filemap status
> > - restore the folio's hugetlb stash metadata
> > - restore the folio's merged/split status
> >
> 
> The plan is that we try our best to restore shareability, filemap,
> restructuring (aka split/merge, including stash metadata) other than
> failures on rollback.
> 
> > Also, the host page table is not restored.
> >
> >
> 
> This is by design, the host page tables can be re-populated on the next
> fault. I've hopefully addressed this on the other thread at [1].
This is not. Please check my reply to [1].


> >> > My thinking is that:
> >> >
> >> > 1. guest_memfd starts shared-to-private conversion
> >> > 2. guest_memfd sends invalidation notifications
> >> >    2.1 invalidate notification --> A --> Unmap and return success
> >> >    2.2 invalidate notification --> B --> Unmap and return success
> >> >    2.3 invalidate notification --> C --> return failure
> >> > 3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps
> >> >    shareability as shared
> >> >
> >> > Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and
> >> > 2.2. Even if additional notifications could be sent to A and B to ask for
> >> > mapping the GFN back, the map operation might fail. Consequently, A and B might
> >> > not be able to restore the mapped status of the GFN.
> >> 
> >> For conversion we don't attempt to restore mappings anywhere (whether in
> >> guest or host page tables). What do you think of not restoring the
> >> mappings?
> > It could cause problem if the mappings in S-EPT can't be restored.
> >
> > For TDX private-to-shared conversion, if kvm_gmem_convert_should_proceed() -->
> > kvm_gmem_unmap_private() --> kvm_mmu_unmap_gfn_range() fails in the end, then
> > the GFN shareability is restored to private. The next guest access to
> > the partially unmapped private memory can meet a fatal error: "access before
> > acceptance".
> >
> > It could occur in such a scenario:
> > 1. TD issues a TDVMCALL_MAP_GPA to convert a private GFN to shared
> > 2. Conversion fails in KVM.
> > 3. set_memory_decrypted() fails in TD.
> > 4. TD thinks the GFN is still accepted as private and accesses it.
> >
> >
> 
> This is true, I was thinking that this isn't handled solely in
> conversion but by being part of the contract between userspace VMM and
> the guest, that guest must handle conversion failures. I've hopefully
> addressed this on the other thread at [1].
> 
> >> > For IOMMU mappings, this
> >> > could result in DMAR failure following a failed attempt to do shared-to-private
> >> > conversion.
> >> 
> >> I believe the current conversion setup guards against this because after
> >> unmapping from the host, we check for any unexpected refcounts.
> > Right, it's fine if we check for any unexpected refcounts.
> >
> >
> >> (This unmapping is not the unmapping we're concerned about, since this is
> >> shared memory, and unmapping doesn't go through TDX.)
> >> 
> >> Coming back to the refcounts, if the IOMMU had mappings, these refcounts
> >> are "unexpected". The conversion ioctl will return to userspace with an
> >> error.
> >> 
> >> IO can continue to happen, since the memory is still mapped in the
> >> IOMMU. The memory state is still shared. No issue there.
> >> 
> >> In RFCv2 [1], we expect userspace to see the error, then try and remove
> >> the memory from the IOMMU, and then try conversion again.
> > I don't think it's right to depend on that userspace could always perform in 
> > kernel's expected way, i.e. trying conversion until it succeeds.
> >
> 
> Let me think more deeply about this. Please let me know if there's
> anything I missed.
> 
> It is true that a buggy or malicious userspace VMM can ignore conversion
> failures and report success to the guest, but if both the userspace VMM
> and guest are malicious, it's quite hard for the kernel to defend
> against that.
Hmm, expecting userspace to try conversion endlessly exceeds what is reasonable
for a cooperative userspace?

> I think as long as there's no point where the guest can crash the host
> in a fixed way, I think it is okay to rely on a userspace VMM and guest
> protocol.
> 
> IIUC the guest can crash the host (original point of having guest_memfd)
> if the guest can convince the host to write to private memory. For that
How to?
Unless the host kernel wants to crash itself, I don't think allowing guest to
crash the host is acceptable.
If you happen to know one, please let us know. We'll fix it.

> to happen, the memory must be faulted into the Secure EPTs, and the
> shareability state must be ALL for the host to fault it in.
> 
> So to have this issue, the conversion failure must be such that the
> memory remains faulted into the Secure EPTs while shareability is
> shared. Since unmapping from secure EPTs happens pretty early before any
> shareability is changed or any rollback (and rollback failures) can
> happen, I think we should be quite safe?
It's not safe if unmapping from the secure EPT fails while the shareability is
changed to shared.


> If unmapping of private memory fails, this is where I think guest_memfd
> should get an error from the unmap and it should not proceed to change
> shareability.
Please check if my proposal at [2] is agreeable.

> 
> > We need to restore to the previous status (which includes the host page table)
> > if conversion can't be done.
> 
> Most of the previous status (shareability, filemap,
> restructuring (aka split/merge, including stash metadata)) are restored
> other than during rollback failures.
However, error during the rollback is unacceptable.


> As for presence in host page tables, is it okay to defer that till the
> next fault, and if not okay, why not?
If the host page tables involve only shared mappings in the primary MMU
and shared EPT, it's ok.


> For presence in guest page tables, is it okay to fall back on the
> protocol where the guest must handle conversion failures, and if not
> okay, why not?
Hmm, whether to roll back the guest page table or not after the conversion
failure is the business of the guest OS.

However, KVM can't rely on that the guest must assume that the page state is
shared even after a private-to-shared conversion failure.


> > That said, in my view, a better flow would be:
> >
> > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> >    consumers in kernel of memory allocated from guest_memfd).
> >
> > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> >    proceed. For example, in the case of TDX, this might involve memory
> >    allocation and page splitting.
> >
> > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> >    proceeds by sending the actual invalidation request.
> >
> > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> >    the page or elevate the page reference count.
> >
> > 5. guest_memfd completes the invalidation process. If the memory is marked as
> >    "poison," guest_memfd can handle it accordingly. If the page has an elevated
> >    reference count, guest_memfd may not need to take special action, as the
> >    elevated count prevents the OS from reallocating the page.
> >    (but from your reply below, seems a callback to guest_memfd is a better
> >    approach).
> >
> >
> 
> Thanks for this, I've tried to combine this into my response at
> [1]. I think this works, but it's hard because
> 
> a. Pre-checks are hard to check (explained at [1])
Please check if the pre-checks in my POC [2] is good.
I tested it for the case of TDX unmapping failure. It does not change the
shareabilitiy if splitting or zapping fails.


> b. Even after all the checks, unmapping can still fail, and those still
>    have to be handled, and to handle those, we have to buy into the
>    userspace VMM/guest protocol, so why not just buy into the protocol
>    to start with?
In my POC [2], the outcome of unmapping failure is to leak the pages.
Please check if it looks good to you.

> [1] https://lore.kernel.org/all/diqztt4uhunj.fsf@ackerleytng-ctop.c.googlers.com/

[2] https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com/

> >> The part in concern here is unmapping failures of private pages, for
> >> private-to-shared conversions, since that part goes through TDX and
> >> might fail.
> > IMO, even for TDX, the real unmap must not fail unless there are bugs in the KVM
> > or TDX module.
> > So, for page splitting in S-EPT, I prefer to try splitting in the
> > pre-invalidation phase before conducting any real unmap.
> >
> >
> 
> Thanks for your detailed suggestion.
> 
> >> One other thing about taking refcounts is that in RFCv2,
> >> private-to-shared conversions assume that there are no refcounts on the
> >> private pages at all. (See filemap_remove_folio_for_restructuring() in
> >> [3])
> >>
> >> Haven't had a chance to think about all the edge cases, but for now I
> >> think on unmapping failure, in addition to taking a refcount, we should
> >> return an error at least up to guest_memfd, so that guest_memfd could
> >> perhaps keep the refcount on that page, but drop the page from the
> >> filemap. Another option could be to track messed up addresses and always
> >> check that on conversion or something - not sure yet.
> >
> > It looks good to me. See the bullet 4 in my proposed flow above.
> >
> 
> Thanks again for your detailed suggestion.
> 
> >> Either way, guest_memfd must know. If guest_memfd is not informed, on a
> >> next conversion request, the conversion will just spin in
> >> filemap_remove_folio_for_restructuring().
> > It makes sense.
> >
> >
> >> What do you think of this part about informing guest_memfd of the
> >> failure to unmap?
> > So, do you want to add a guest_memfd callback to achieve this purpose?
> >
> 
> I will need to think the entire thing through, but I meant informing as
> in returning an error to guest_memfd so that guest_memfd knows. I think
> returning an error should be the first cause of action.
> 
> As for whether guest_memfd should know how to handle the error or
> whether the userspace VMM should participate in deciding what to do with
> the error, I'm not sure. If you have suggestions on this, I hope we can
> combine the suggestions about the conversion protocol on the other thread.
> 
> Regarding a callback, are you thinking something like not having the
> unmap return an error, but instead TDX will call a function like
> kvm_gmem_error_at_offset(loff_t offset), and guest_memfd will then
> record that somewhere, and then immediately after calling unmap
> guest_memfd will check kvm_gmem_was_there_an_error_in_range() and then
> determining whether there's an error? Something like that?
> 
> I guess it could work but feels a little odd.
> 
> >
> > BTW, here's an analysis of why we can't let kvm_mmu_unmap_gfn_range()
> > and mmu_notifier_invalidate_range_start() fail, based on the repo
> > https://github.com/torvalds/linux.git, commit cd2e103d57e5 ("Merge tag
> > 'hardening-v6.16-rc1-fix1-take2' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux")
> 
> Thank you, I appreciate the effort you took to enumerate these. The
> following suggestions are based on my current understanding. I don't
> have time in the near future to do the plumbing to test out the
> suggestion, but for now I want to see if this suggestion makes sense,
> maybe you can correct any misunderstandings first. 

Sorry, I realized that the below enumeration brought confusion.

Listing them was to prove that unmapping failure is not expected by the kernel.

Please kindly let me know if any existing kernel code allows unmap to fail.

> > 1. Status of mmu notifier
> > -------------------------------
> > (1) There're 34 direct callers of mmu_notifier_invalidate_range_start().
> >     1. clear_refs_write
> >     2. do_pagemap_scan
> >     3. uprobe_write_opcode
> >     4. do_huge_zero_wp_pmd
> >     5. __split_huge_pmd (N)
> >     6. __split_huge_pud (N)
> >     7. move_pages_huge_pmd
> >     8. copy_hugetlb_page_range
> >     9. hugetlb_unshare_pmds  (N)
> >     10. hugetlb_change_protection
> >     11. hugetlb_wp
> >     12. unmap_hugepage_range (N)
> >     13. move_hugetlb_page_tables
> >     14. collapse_huge_page
> >     15. retract_page_tables
> >     16. collapse_pte_mapped_thp
> >     17. write_protect_page
> >     18. replace_page
> >     19. madvise_free_single_vma
> >     20. wp_clean_pre_vma
> >     21. wp_page_copy 
> >     22. zap_page_range_single_batched (N)
> >     23. unmap_vmas (N)
> >     24. copy_page_range 
> >     25. remove_device_exclusive_entry
> >     26. migrate_vma_collect
> >     27. __migrate_device_pages
> >     28. change_pud_range 
> >     29. move_page_tables
> >     30. page_vma_mkclean_one
> >     31. try_to_unmap_one
> >     32. try_to_migrate_one
> >     33. make_device_exclusive
> >     34. move_pages_pte
> >
> > Of these 34 direct callers, those marked with (N) cannot tolerate
> > mmu_notifier_invalidate_range_start() failing. I have not yet investigated all
> > 34 direct callers one by one, so the list of (N) is incomplete.
> >
> > For 5. __split_huge_pmd(), Documentation/mm/transhuge.rst says:
> > "Note that split_huge_pmd() doesn't have any limitations on refcounting:
> > pmd can be split at any point and never fails." This is because split_huge_pmd()
> > serves as a graceful fallback design for code walking pagetables but unaware
> > about huge pmds.
> >
> >

> Do these callers, especially those with (N), ever try to unmap any TDX
> private pages? guest_memfd only gives shared pages to core-mm, so for
> shared pages, there will continue to be no chance of errors.
> 
> If we change mmu_notifier_invalidate_range_start() to return an int, all
> of the callers that never invalidate shared pages can continue to safely
> rely on the fact that mmu_notifier_invalidate_range_start() will return
> 0.
mmu_notifier_invalidate_range_start() is just to zap shared pages.

 
> For the callers of mmu_notifier_invalidate_range_start() that may touch
> private pages, I believe that's only guest_memfd and KVM. That's where
> we want the error, and will handle the error.
> 
> Another point here is that I was thinking to put EPT splitting together
> with actual unmapping instead of with invalidation because we will
> probably invalidate more than we unmap (see explanation at [1] about the
> race). Maybe moving EPT splitting to unmap could help?
> 
> > (2) There's 1 direct caller of mmu_notifier_invalidate_range_start_nonblock(),
> > __oom_reap_task_mm(), which only expects the error -EAGAIN.
> >
> > In mn_hlist_invalidate_range_start():
> > "WARN_ON(mmu_notifier_range_blockable(range) || _ret != -EAGAIN);"
> >
> >
> > (3) For DMAs, drivers need to invoke pin_user_pages() to pin memory. In that
> > case, they don't need to register mmu notifier.
> >
> > Or, device drivers can pin pages via get_user_pages*(), and register for mmu         
> > notifier callbacks for the memory range. Then, upon receiving a notifier         
> > "invalidate range" callback , stop the device from using the range, and unpin    
> > the pages.
> >
> > See Documentation/core-api/pin_user_pages.rst.
> >
> >
> 
> Do you mean that we should teach device drivers to get callbacks for
> private pages? Are you looking ahead to handle TDX IO on private pages?
> So far we haven't handled that yet.
I tried to show that device drivers increases page refcount (by pinning) when it
maps a page into IOMMU page table. It does not decrease page refcount (by
unpinning) until after unmapping.

If the page hold by the device driver is allocated from hugetlb, and if the page
has been truncated from the hugetlb, the page is still hold by the device
driver until the page is unmapped in the IOMMU page table.

This is similar to TDX. As long as a page is still mapped in the SEPT or tracked
by the TDX module, it's better to hold a page refcount even after the page is
truncated from the file mapping.


> > 2. Cases that cannot tolerate failure of mmu_notifier_invalidate_range_start()
> > -------------------------------
> > (1) Error fallback cases.
> >
> >     1. split_huge_pmd() as mentioned in Documentation/mm/transhuge.rst.
> >        split_huge_pmd() is designed as a graceful fallback without failure.
> >
> >        split_huge_pmd
> >         |->__split_huge_pmd
> >            |->mmu_notifier_range_init
> >            |  mmu_notifier_invalidate_range_start
> >            |  split_huge_pmd_locked
> >            |  mmu_notifier_invalidate_range_end
> >
> >
> >     2. in fs/iomap/buffered-io.c, iomap_write_failed() itself is error handling.
> >        iomap_write_failed
> >          |->truncate_pagecache_range
> >             |->unmap_mapping_range
> >             |  |->unmap_mapping_pages
> >             |     |->unmap_mapping_range_tree
> >             |        |->unmap_mapping_range_vma
> >             |           |->zap_page_range_single
> >             |              |->zap_page_range_single_batched
> >             |                       |->mmu_notifier_range_init
> >             |                       |  mmu_notifier_invalidate_range_start
> >             |                       |  unmap_single_vma
> >             |                       |  mmu_notifier_invalidate_range_end
> >             |->truncate_inode_pages_range
> >                |->truncate_cleanup_folio
> >                   |->if (folio_mapped(folio))
> >                   |     unmap_mapping_folio(folio);
> >                          |->unmap_mapping_range_tree
> >                             |->unmap_mapping_range_vma
> >                                |->zap_page_range_single
> >                                   |->zap_page_range_single_batched
> >                                      |->mmu_notifier_range_init
> >                                      |  mmu_notifier_invalidate_range_start
> >                                      |  unmap_single_vma
> >                                      |  mmu_notifier_invalidate_range_end
> >
> >    3. in mm/memory.c, zap_page_range_single() is invoked to handle error.
> >       remap_pfn_range_notrack
> >         |->int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
> >         |  if (!error)
> >         |      return 0;
> > 	|  zap_page_range_single
> >            |->zap_page_range_single_batched
> >               |->mmu_notifier_range_init
> >               |  mmu_notifier_invalidate_range_start
> >               |  unmap_single_vma
> >               |  mmu_notifier_invalidate_range_end
> >
> >    4. in kernel/events/core.c, zap_page_range_single() is invoked to clear any
> >       partial mappings on error.
> >
> >       perf_mmap
> >         |->ret = map_range(rb, vma);
> >                  |  err = remap_pfn_range
> >                  |->if (err) 
> >                  |     zap_page_range_single
> >                         |->zap_page_range_single_batched
> >                            |->mmu_notifier_range_init
> >                            |  mmu_notifier_invalidate_range_start
> >                            |  unmap_single_vma
> >                            |  mmu_notifier_invalidate_range_end
> >
> >
> >    5. in mm/memory.c, unmap_mapping_folio() is invoked to unmap posion page.
> >
> >       __do_fault
> > 	|->if (unlikely(PageHWPoison(vmf->page))) { 
> > 	|	vm_fault_t poisonret = VM_FAULT_HWPOISON;
> > 	|	if (ret & VM_FAULT_LOCKED) {
> > 	|		if (page_mapped(vmf->page))
> > 	|			unmap_mapping_folio(folio);
> >         |                       |->unmap_mapping_range_tree
> >         |                          |->unmap_mapping_range_vma
> >         |                             |->zap_page_range_single
> >         |                                |->zap_page_range_single_batched
> >         |                                   |->mmu_notifier_range_init
> >         |                                   |  mmu_notifier_invalidate_range_start
> >         |                                   |  unmap_single_vma
> >         |                                   |  mmu_notifier_invalidate_range_end
> > 	|		if (mapping_evict_folio(folio->mapping, folio))
> > 	|			poisonret = VM_FAULT_NOPAGE; 
> > 	|		folio_unlock(folio);
> > 	|	}
> > 	|	folio_put(folio);
> > 	|	vmf->page = NULL;
> > 	|	return poisonret;
> > 	|  }
> >
> >
> >   6. in mm/vma.c, in __mmap_region(), unmap_region() is invoked to undo any
> >      partial mapping done by a device driver.
> >
> >      __mmap_new_vma
> >        |->__mmap_new_file_vma(map, vma);
> >           |->error = mmap_file(vma->vm_file, vma);
> >           |  if (error)
> >           |     unmap_region
> >                  |->unmap_vmas
> >                     |->mmu_notifier_range_init
> >                     |  mmu_notifier_invalidate_range_start
> >                     |  unmap_single_vma
> >                     |  mmu_notifier_invalidate_range_end
> >
> >
> 
> These should probably not ever be invalidating or unmapping private pages.
> 
> > (2) No-fail cases
> > -------------------------------
> > 1. iput() cannot fail. 
> >
> > iput
> >  |->iput_final
> >     |->WRITE_ONCE(inode->i_state, state | I_FREEING);
> >     |  inode_lru_list_del(inode);
> >     |  evict(inode);
> >        |->op->evict_inode(inode);
> >           |->shmem_evict_inode
> >              |->shmem_truncate_range
> >                 |->truncate_inode_pages_range
> >                    |->truncate_cleanup_folio
> >                       |->if (folio_mapped(folio))
> >                       |     unmap_mapping_folio(folio);
> >                             |->unmap_mapping_range_tree
> >                                |->unmap_mapping_range_vma
> >                                   |->zap_page_range_single
> >                                      |->zap_page_range_single_batched
> >                                         |->mmu_notifier_range_init
> >                                         |  mmu_notifier_invalidate_range_start
> >                                         |  unmap_single_vma
> >                                         |  mmu_notifier_invalidate_range_end
> >
> >
> > 2. exit_mmap() cannot fail
> >
> > exit_mmap
> >   |->mmu_notifier_release(mm);
> >      |->unmap_vmas(&tlb, &vmi.mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
> >         |->mmu_notifier_range_init
> >         |  mmu_notifier_invalidate_range_start
> >         |  unmap_single_vma
> >         |  mmu_notifier_invalidate_range_end
> >
> >
> 
> These should probably not ever be invalidating or unmapping private pages.
> 
> > 3. KVM Cases That Cannot Tolerate Unmap Failure
> > -------------------------------
> > Allowing unmap operations to fail in the following scenarios would make it very
> > difficult or even impossible to handle the failure:
> >
> > (1) __kvm_mmu_get_shadow_page() is designed to reliably obtain a shadow page
> > without expecting any failure.
> >
> > mmu_alloc_direct_roots
> >   |->mmu_alloc_root
> >      |->kvm_mmu_get_shadow_page
> >         |->__kvm_mmu_get_shadow_page
> >            |->kvm_mmu_alloc_shadow_page
> >               |->account_shadowed
> >                  |->kvm_mmu_slot_gfn_write_protect
> >                     |->kvm_tdp_mmu_write_protect_gfn
> >                        |->write_protect_gfn
> >                           |->tdp_mmu_iter_set_spte
> >
> >
> 
> I need to learn more about shadow pages but IIUC TDX doesn't use shadow
> pages so this path won't interact with unmapping private pages.
> 
> > (2) kvm_vfio_release() and kvm_vfio_file_del() cannot fail
> >
> > kvm_vfio_release/kvm_vfio_file_del
> >  |->kvm_vfio_update_coherency
> >     |->kvm_arch_unregister_noncoherent_dma
> >        |->kvm_noncoherent_dma_assignment_start_or_stop
> >           |->kvm_zap_gfn_range
> >              |->kvm_tdp_mmu_zap_leafs
> >                 |->tdp_mmu_zap_leafs
> >                    |->tdp_mmu_iter_set_spte
> >
> >
> 
> I need to learn more about VFIO but for now IIUC IO uses shared pages,
> so this path won't interact with unmapping private pages.
> 
> > (3) There're lots of callers of __kvm_set_or_clear_apicv_inhibit() currently
> > never expect failure of unmap.
> >
> > __kvm_set_or_clear_apicv_inhibit
> >   |->kvm_zap_gfn_range
> >      |->kvm_tdp_mmu_zap_leafs
> >         |->tdp_mmu_zap_leafs
> >            |->tdp_mmu_iter_set_spte
> >
> >
> >
> 
> There could be some TDX specific things such that TDX doesn't use this
> path.
tdp_mmu_iter_set_spte() is used by KVM generally to update the SPTE when
kvm->mmu_lock is held for write.

TDX uses tdp_mmu_iter_set_spte() to further unmapping the SEPT.

Converting tdp_mmu_iter_set_spte() to return error is a huge work and I don't
think it's right or worthwhile.

> 
> > 4. Cases in KVM where it's hard to make tdp_mmu_set_spte() (update SPTE with
> > write mmu_lock) failable.
> >
> > (1) kvm_vcpu_flush_tlb_guest()
> >
> > kvm_vcpu_flush_tlb_guest
> >   |->kvm_mmu_sync_roots
> >      |->mmu_sync_children
> >         |->kvm_vcpu_write_protect_gfn
> >            |->kvm_mmu_slot_gfn_write_protect
> >               |->kvm_tdp_mmu_write_protect_gfn
> >                  |->write_protect_gfn
> >                     |->tdp_mmu_iter_set_spte
> >                        |->tdp_mmu_set_spte
> >
> >
> > (2) handle_removed_pt() and handle_changed_spte().
> >
> 
> Thank you so much for looking into these, I'm hoping that the number of
> cases where TDX and private pages are unmapped are really limited to a
> few paths that we have to rework.
> 
> If we agree that the error has to be handled, then regardless of how we
> let the caller know that an error happened, all paths touching TDX
> private pages have to be reworked.
> 
> Between (1) returning an error vs (2) marking error and having the
> caller check for errors, then it's probably better to use the standard
> approach of returning an error since it is better understood, and
> there's no need to have extra data structures?
However, I don't think returning error during the unmap path is a standard
approach...

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Edgecombe, Rick P 7 months, 3 weeks ago

On Mon, 2025-06-16 at 18:43 +0800, Yan Zhao wrote:
> > It is true that a buggy or malicious userspace VMM can ignore conversion
> > failures and report success to the guest, but if both the userspace VMM
> > and guest are malicious, it's quite hard for the kernel to defend
> > against that.

For upstream, it's going to be required that userspace can't mess up the host
kernel. Userspace is free to mess up the guest though.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Edgecombe, Rick P 9 months ago

On Fri, 2025-05-09 at 07:20 -0700, Vishal Annapurve wrote:
> I might be wrongly throwing out some terminologies here then.
> VM_PFNMAP flag can be set for memory backed by folios/page structs.
> udmabuf seems to be working with pinned "folios" in the backend.
> 
> The goal is to get to a stage where guest_memfd is backed by pfn
> ranges unmanaged by kernel that guest_memfd owns and distributes to
> userspace, KVM, IOMMU subject to shareability attributes. if the
> shareability changes, the users will get notified and will have to
> invalidate their mappings. guest_memfd will allow mmaping such ranges
> with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> special handling/lack of page structs.

I see the point about how operating on PFNs can allow smoother transition to a
solution that saves struct page memory, but I wonder about the wisdom of
building this 2MB TDX code against eventual goals.

We were thinking to enable 2MB TDX huge pages on top of:
1. Mmap shared pages
2. In-place conversion
3. 2MB huge page support

Where do you think struct page-less guestmemfd fits in that roadmap?

> 
> As an intermediate stage, it makes sense to me to just not have
> private memory backed by page structs and use a special "filemap" to
> map file offsets to these private memory ranges. This step will also
> need similar contract with users -
>    1) memory is pinned by guest_memfd
>    2) users will get invalidation notifiers on shareability changes
> 
> I am sure there is a lot of work here and many quirks to be addressed,
> let's discuss this more with better context around. A few related RFC
> series are planned to be posted in the near future.

Look forward to collecting more context, and thanks for your patience while we
catch up. But why not an iterative approach? We can't save struct page memory on
guestmemfd huge pages until we have guestmemfd huge pages.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Vishal Annapurve 9 months ago

On Fri, May 9, 2025 at 4:45 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Fri, 2025-05-09 at 07:20 -0700, Vishal Annapurve wrote:
> > I might be wrongly throwing out some terminologies here then.
> > VM_PFNMAP flag can be set for memory backed by folios/page structs.
> > udmabuf seems to be working with pinned "folios" in the backend.
> >
> > The goal is to get to a stage where guest_memfd is backed by pfn
> > ranges unmanaged by kernel that guest_memfd owns and distributes to
> > userspace, KVM, IOMMU subject to shareability attributes. if the
> > shareability changes, the users will get notified and will have to
> > invalidate their mappings. guest_memfd will allow mmaping such ranges
> > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> > special handling/lack of page structs.
>
> I see the point about how operating on PFNs can allow smoother transition to a
> solution that saves struct page memory, but I wonder about the wisdom of
> building this 2MB TDX code against eventual goals.

This discussion was more in response to a few questions from Yan [1].

My point of this discussion was to ensure that:
1) There is more awareness about the future roadmap.
2) There is a line of sight towards supporting guest memory (at least
guest private memory) without page structs.

No need to solve these problems right away, but it would be good to
ensure that the design choices are aligned towards the future
direction.

One thing that needs to be resolved right away is - no refcounts on
guest memory from outside guest_memfd [2]. (Discounting the error
situations)

[1] https://lore.kernel.org/lkml/aBldhnTK93+eKcMq@yzhao56-desk.sh.intel.com/
[2] https://lore.kernel.org/lkml/CAGtprH_ggm8N-R9QbV1f8mo8-cQkqyEta3W=h2jry-NRD7_6OA@mail.gmail.com/

>
> We were thinking to enable 2MB TDX huge pages on top of:
> 1. Mmap shared pages
> 2. In-place conversion
> 3. 2MB huge page support
>
> Where do you think struct page-less guestmemfd fits in that roadmap?

Ideally the roadmap should be:
1. mmap support
2. Huge page support in guest memfd with in-place conversion
3. 2MB huge page EPT mappings support
4. private memory without page structs
5. private/shared memory without page structs

There should be newer RFC series landing soon for 1 and 2. In my
opinion, as long as hugepage EPT support is reviewed, tested and is
stable enough, it can land upstream sooner than 2 as well.

>
> >
> > As an intermediate stage, it makes sense to me to just not have
> > private memory backed by page structs and use a special "filemap" to
> > map file offsets to these private memory ranges. This step will also
> > need similar contract with users -
> >    1) memory is pinned by guest_memfd
> >    2) users will get invalidation notifiers on shareability changes
> >
> > I am sure there is a lot of work here and many quirks to be addressed,
> > let's discuss this more with better context around. A few related RFC
> > series are planned to be posted in the near future.
>
> Look forward to collecting more context, and thanks for your patience while we
> catch up. But why not an iterative approach? We can't save struct page memory on
> guestmemfd huge pages until we have guestmemfd huge pages.

Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Posted by Edgecombe, Rick P 9 months ago

On Fri, 2025-05-09 at 17:41 -0700, Vishal Annapurve wrote:
> > > > I see the point about how operating on PFNs can allow smoother
> > > > transition to > > a
> > > > solution that saves struct page memory, but I wonder about the wisdom of
> > > > building this 2MB TDX code against eventual goals.
> > 
> > This discussion was more in response to a few questions from Yan [1].

Right, I follow.

> > 
> > My point of this discussion was to ensure that:
> > 1) There is more awareness about the future roadmap.
> > 2) There is a line of sight towards supporting guest memory (at least
> > guest private memory) without page structs.
> > 
> > No need to solve these problems right away, but it would be good to
> > ensure that the design choices are aligned towards the future
> > direction.

I'm not sure how much we should consider it at this stage. The kernel is not set
in stone, so it's about how much you want to do at once. For us who have been
working on the giant TDX base series, doing things on a more incremental smaller
size sounds nice :). That said, the necessary changes may have other good
reasons, as discussed.

> > 
> > One thing that needs to be resolved right away is - no refcounts on
> > guest memory from outside guest_memfd [2]. (Discounting the error
> > situations)

Sounds fine.

> > 
> > [1] https://lore.kernel.org/lkml/aBldhnTK93+eKcMq@yzhao56-desk.sh.intel.com/
> > [2] >
> > https://lore.kernel.org/lkml/CAGtprH_ggm8N-R9QbV1f8mo8-cQkqyEta3W=h2jry-NRD7_6OA@mail.gmail.com/