[v3] KVM: TDX huge page support for private memory

[PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 1 month ago

This is v3 of the TDX huge page series. There aren’t any major changes to
the design, just the incorporation of various comments of RFC v2 [0] and
switching to use external cache for splitting to align with changes in
DPAMT v4 [3]. The full stack is available at [4].

Dave/Kirill/Rick/x86 folks, the patches that will eventually need an ack
are patches 1-5. I would appreciate some review on them, with the
understanding that they may need further refinement before they're ready
for ack.
 
Sean, I'm feeling pretty good about the design at this point, however,
there are few remaining design opens (see next section with references to
specific patches). I'm wondering if we can close these as part of the
review of this revision. Thanks a lot!


Highlight
-------------
- Request review of the tip part (patches 1-5)

  Patches 1-5 contain SEAMCALL wrappers updates, which are under tip.
  Besides introducing a SEAMCALL wrapper for demote, the other changes are
  mainly to convert mapping/unmapping related SEAMCALL wrappers from using
  "struct page *page" to "struct folio *folio" and "unsigned long
  start_idx" for huge pages.

- EPT mapping size and folio size

  This series is built upon the rule in KVM that the mapping size in the
  KVM-managed secondary MMU is no larger than the backend folio size.
  
  Therefore, there are sanity checks in the SEAMCALL wrappers in patches
  1-5 to follow this rule, either in tdh_mem_page_aug() for mapping
  (patch 1) or in tdh_phymem_page_wbinvd_hkid() (patch 3),
  tdx_quirk_reset_folio() (patch 4), tdh_phymem_page_reclaim() (patch 5)
  for unmapping.

  However, as Vishal pointed out in [7], the new hugetlb-based guest_memfd
  [1] splits backend folios ahead of notifying KVM for unmapping. So, this
  series also relies on the fixup patch [8] to notify KVM of unmapping
  before splitting the backend folio during the memory conversion ioctl.


- Enable TDX huge pages only on new TDX modules (patch 2, patch 24)

  This v3 detects whether the TDX module supports feature
  TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY and disables TDX huge
  page support for private memory on TDX modules without this feature.

  Additionally, v3 provides a new module parameter "tdx_huge_page" to turn
  off TDX huge pages for private memory by executing
  "modprobe kvm_intel tdx_huge_page=N".

- Dynamic PAMT v4 integration (patches 17-23)

  Currently, DPAMT's involvement with TDX huge pages is limited to page
  splitting. v3 introduces KVM x86 ops for the per-VM external cache for
  splitting (patches 19, 20).

  The per-VM external cache holds pre-allocated pages (patch 19), which are
  dequeued during TDX's splitting operations for installing PAMT pages
  (patches 21-22).

  The general logic of managing the per-VM external cache remains similar
  to the per-VM cache in RFC v2; the difference is that KVM MMU core now
  notifies TDX to implement page pre-allocation/enqueuing, page count
  checking, and page freeing. The external way makes it easier to add a
  lock to protect page enqueuing (in topup_external_per_vm_split_cache()
  op) and page dequeuing (in the split_external_spte() op) for the per-VM
  cache. (See more details in the DPAMT bullet in the "Full Design"
  section).

  Since the basic DPAMT design (without huge pages) is not yet settled,
  feel free to ignore the reviewing request for this part. However, we
  would appreciate some level of review given the differences between the
  per-VM cache and per-vCPU cache.


Changes from RFC v2
-------------------
- Dropped 2 patches:
  "KVM: TDX: Do not hold page refcount on private guest pages"
  (pulled by Sean separately),

  "x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC
   is set"
   (To be consistent with the always-clflush policy.
    The DPAMT-related bug in TDH.PHYMEM.PAGE.WBINVD only existed in the
    old unreleased TDX modules for DPAMT).

- TDX_INTERRUPTED_RESTARTABLE error handling:
  Disable TDX huge pages if the TDX module does not support
  TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY.

- Added a module parameter to turn on/off TDX huge pages.

- Renamed tdx_sept_split_private_spt() to tdx_sept_split_private_spte(),
  passing parameter "old_mirror_spte" instead of "pfn_for_gfn". Updated the
  function implementation to align with Sean's TDX cleanup series.

- Updated API kvm_split_cross_boundary_leafs() to not flush TLBs internally
  or return split/flush status, and added a default implementation for
  non-x86 platforms.

- Renamed tdx_check_accept_level() to tdx_honor_guest_accept_level().
  Returned tdx_honor_guest_accept_level() error to userspace instead of
  retry in KVM.

- Introduced KVM x86 ops for the per-VM external cache for splitting to
  align with DPAMT v4, re-organized the DPAMT related patches, refined
  parameter names for better readability, and added missing KVM_BUG_ON()
  and warnings.

- Changed nth_page() to folio_page() in SEAMCALL wrappers, corrected a
  minor bug in handling reclaimed size mismatch error.

- Removed unnecessary declarations, fixed typos and improved patch logs and
  comments.


Patches Layout
--------------
- Patches 01-05: Update SEAMCALL wrappers/helpers for huge pages.

- Patch 06:      Disallow page merging for TDX.

- Patches 07-09: Enable KVM MMU core to propagate splitting requests to TDX
                 and provide the corresponding implementation in TDX.

- Patches 10-11: Introduce a higher level API
                 kvm_split_cross_boundary_leafs() for splitting
                 cross-boundary mappings.

- Patches 12-13: Honor guest accept level in EPT violation handler.

- Patches 14-16: Split huge pages on page conversion/punch hole.

- Patches 17-23: Dynamic PAMT related changes for TDX huge pages.

- Patch 24:      Turn on 2MB huge pages if tdx_huge_page is true.
                 (Added a module parameter for tdx_huge_page and turn it
                 off if the TDX module does not support
                 TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY).


Full Design (optional reading)
--------------------------------------
1. guest_memfd dependency

   According to the rule in KVM that the mapping size in the KVM-managed
   secondary MMU should not exceed the backend folio size, the TDX huge
   page series relies on guest_memfd's ability to allocate huge folios.

   However, this series does not depend on guest_memfd's implementation
   details, except for patch 16, which invokes API
   kvm_split_cross_boundary_leafs() to request KVM splitting of private
   huge mappings that cross the specified range boundary in guest_memfd
   punch hole and private-to-shared conversion ioctls.

   Therefore, the TDX huge page series can also function with 2MB THP
   guest_memfd by updating patch 16 to match the underlying guest_memfd.

2. Restrictions:
   1) Splitting under read mmu_lock is not supported.

      With the current TDX module (i.e., without NON-BLOCKING-RESIZE
      feature), handling BUSY errors returned from splitting operations
      under read mmu_lock requires invoking tdh_mem_range_unblock(), which
      may also encounter BUSY errors. To avoid complicating the lock
      design, splitting under read mmu_lock for TDX is rejected before the
      TDX module supports the NON-BLOCKING-RESIZE feature.

      However, when TDs perform ACCEPT operations at 4KB level after KVM
      creates huge mappings (e.g., non-Linux TDs' ACCEPT operations that
      occur after memory access, or Linux TDs' ACCEPT operations on a
      pre-faulted range), splitting operations in the fault path are
      required, which usually hold read mmu_lock. 

      Ignoring splitting operations in the fault path would cause repeated
      faults. Forcing KVM's mapping to 4KB before the guest's ACCEPT level
      is available would disable huge pages for non-Linux TDs, since the
      TD's later ACCEPT operations at higher levels would return
      TDX_PAGE_SIZE_MISMATCH error to the guest. (See "ACCEPT level and
      lpage_info" bullet).

      This series hold write mmu_lock for splitting operations in the fault
      path. To maintain performance, the implementation only acquires write
      mmu_lock when necessary, keeping the write mmu_lock acquisition count
      at an acceptable level [6].

   2) No huge page if TDX_INTERRUPTED_RESTARTABLE error is possible

      When SEAMCALL TDH_MEM_PAGE_DEMOTE can return error
      TDX_INTERRUPTED_RESTARTABLE (due to interrupts during SEAMCALL
      execution), the TDX module provides no guaranteed maximum retry count
      to ensure forward progress of page demotion. Interrupt storms could
      then result in DoS if host simply retries endlessly for
      TDX_INTERRUPTED_RESTARTABLE. Disabling interrupts before invoking the
      SEAMCALL in the host cannot prevent TDX_INTERRUPTED_RESTARTABLE
      because NMIs can also trigger the error. Therefore, given that
      SEAMCALL execution time for demotion in basic TDX remains at a
      reasonable level, the tradeoff is to have the TDX module not check
      interrupts during SEAMCALL TDH_MEM_PAGE_DEMOTE, eliminating the
      TDX_INTERRUPTED_RESTARTABLE error.
 
      This series detects whether the TDX module supports feature
      TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY and disables TDX huge
      pages on private memory for TDX modules without this feature, thus
      avoiding TDX_INTERRUPTED_RESTARTABLE error for basic TDX (i.e.,
      without TD partition).

3. Page size is forced to 4KB during TD build time.

   Always return 4KB in the KVM x86 hook gmem_max_mapping_level() to force
   4KB mapping size during TD build time, because:

   - tdh_mem_page_add() only adds private pages at 4KB.
   - The amount of initial private memory pages is limited (typically ~4MB).

4. Page size during TD runtime can be up to 2MB.

   Return up to 2MB in the KVM x86 hook gmem_max_mapping_level():
   Use module parameter "tdx_huge_page" to control whether to return 2MB,
   and turn off tdx_huge_page if the TDX module does not support
   TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY.

   When returning 2MB in KVM x86 hook gmem_max_mapping_level(), KVM may
   still map a page at 4KB due to:
   (1) the backend folio is 4KB,
   (2) disallow_lpage restrictions:
       - mixed private/shared pages in the 2MB range,
       - level misalignment due to slot base_gfn, slot size, and ugfn,
       - GUEST_INHIBIT bit set due to guest ACCEPT operation.
   (3) page merging is disallowed (e.g., when part of a 2MB range has been
       mapped at 4KB level during TD build time).

5. ACCEPT level and lpage_info

   KVM needs to honor TDX guest's ACCEPT level by ensuring a fault's
   mapping level is no higher than the guest's ACCEPT level, which is due
   to the current TDX module implementation:
   - If host mapping level > guest's ACCEPT level, repeated faults carrying
     the guest's ACCEPT level info are generated. No error is returned to
     the guest's ACCEPT operation.
   - If host mapping level < guest's ACCEPT level, the guest's ACCEPT
     operation returns TDX_PAGE_SIZE_MISMATCH error.

   It's efficient to pass the guest's ACCEPT level info to KVM MMU core
   before KVM actually creates the mapping.

   Following the suggestions from Sean/Rick, this series introduces a
   global approach (vs. fault-by-fault per-vCPU) to pass the ACCEPT level
   info: setting the GUEST_INHIBIT bit in a slot's lpage_info(s) above the
   specified guest ACCEPT level. The global approach helps simplify vendor
   code while maintaining a global view across vCPUs. 
   
   Since page merging is currently not supported, and given the limitation
   that page splitting under read mmu_lock is not supported, the
   GUEST_INHIBIT bit is set in a one-way manner under write mmu_lock. Once
   set, it will not be unset, and the GFN will not be allowed to be mapped
   at higher levels, even if the mappings are zapped and re-accepted by the
   guest at higher levels later. This approach simplifies the code and
   prevents potential subtle bugs from different ACCEPT levels specified by
   different vCPUs. Tests showed the one-way manner has minimal impact on
   the huge page map count with a typical Linux TD [5].

   Typical scenarios of honoring 4KB ACCEPT level:
   a. If the guest accesses private memory without first accepting it,
      1) Guest accesses private memory.
      2) KVM maps the private memory at 2MB if no huge page restrictions
         exist.
      3) Guest accepts the private memory at 4KB.
      4) KVM receives an EPT violation VMExit, whose type indicates it's
         caused by the guest's ACCEPT operation at 4KB level.
      5) KVM honors the guest's ACCEPT level by setting the GUEST_INHIBIT
         bit in lpage_info(s) above 4KB level and splitting the created 2MB
         mapping.
      6) Guest accepts the private memory at 4KB level successfully and
         accesses the private memory.

   b. If the guest first accepts private memory before accessing,
      1) Guest accepts private memory at 4KB level.
      2) KVM receives an EPT violation VMExit, whose type indicates it's
         caused by guest's ACCEPT operation at 4KB level.
      3) KVM honors the guest's ACCEPT level by setting the GUEST_INHIBIT
         bit in lpage_info(s) above 4KB level, thus creating the mapping at
         4KB level.
      4) Guest accepts the private memory at 4KB level successfully and
         accesses the private memory.

6. Page splitting (page demotion)

   With the current TDX module, splitting huge mappings in S-EPT requires
   executing tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
   and tdh_mem_page_demote() in sequence. If DPAMT is involved,
   tdh_phymem_pamt_add() is also required.

   Page splitting can occur due to:
   (1) private-to-shared conversion,
   (2) hole punching in guest_memfd,
   (3) guest's ACCEPT operation at lower level than host mapping level.

   All paths trigger splitting by invoking kvm_split_cross_boundary_leafs()
   under write mmu_lock.

   TDX_OPERAND_BUSY is thus handled similarly to removing a private page,
   i.e., by kicking off all vCPUs and retrying, which should succeed on the
   second attempt. Other errors from tdh_mem_page_demote() are not
   expected, triggering TDX_BUG_ON().

7. Page merging (page promotion)

   Promotion is disallowed, because:

   - The current TDX module requires all 4KB leafs to be either all PENDING
     or all ACCEPTED before a successful promotion to 2MB. This requirement
     prevents successful page merging after partially converting a 2MB
     range from private to shared and then back to private, which is the
     primary scenario necessitating page promotion.

   - tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
     TDX module. Consequently, handling BUSY errors is complex, as page
     merging typically occurs in the fault path under shared mmu_lock.

   - Limited amount of initial private memory (typically ~4MB) means the
     need for page merging during TD build time is minimal.

8. Precise zapping of private memory

   Since TDX requires guest's ACCEPT operation on host's mapping of private
   memory, zapping private memory for guest_memfd punch hole and
   private-to-shared conversion must be precise and preceded by splitting
   private memory.

   Patch 16 serves this purpose and is the only patch in the TDX huge page
   series sensitive to guest_memfd's implementation changes.

9. DPAMT
   Currently, DPAMT's involvement with TDX huge page is limited to page
   splitting.

   As shown in the following call stack, DPAMT pages used by splitting are
   pre-allocated and queued in the per-VM external split cache. They are
   dequeued and consumed in tdx_sept_split_private_spte().

   kvm_split_cross_boundary_leafs
     kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
       tdp_mmu_split_huge_pages_root
 (*)      1) tdp_mmu_alloc_sp_for_split()
  +-----2.1) need_topup_external_split_cache(): check if enough pages in
  |          the external split cache. Go to 3 if pages are enough.
  |  +--2.2) topup_external_split_cache(): preallocate/enqueue pages in
  |  |       the external split cache.
  |  |    3) tdp_mmu_split_huge_page
  |  |         tdp_mmu_link_sp
  |  |           tdp_mmu_iter_set_spte
  |  |(**)         tdp_mmu_set_spte
  |  |               split_external_spte 
  |  |                 kvm_x86_call(split_external_spte)
  |  |                   tdx_sept_split_private_spte
  |  |                   3.1) BLOCK, TRACK
  +--+-------------------3.2) Dequeue PAMT pages from the external split
  |  |                        cache for the new sept page
  |  |                   3.3) PAMT_ADD for the new sept page
  +--+-------------------3.4) Dequeue PAMT pages from the external split
                              cache for the 2MB guest private memory.
                         3.5) DEMOTE.
                         3.6) Update PAMT refcount of the 2MB guest private
                              memory.

   (*) The write mmu_lock is held across the checking of enough pages in
       cache in step 2.1 and the page dequeuing in steps 3.2 and 3.4, so
       it's ensured that dequeuing has enough pages in cache.

  (**) A spinlock prealloc_split_cache_lock is used inside the TDX's cache
       implementation to protect page enqueuing in step 2.2 and page
       dequeuing in steps 3.2 and 3.4.


10. TDX does not hold page ref count

   TDX previously held folio refcount and didn't release the refcount if it
   failed to zap the S-EPT. However, guest_memfd's in-place conversion
   feature requires TDX not to hold folio refcount. Several approaches were
   explored. RFC v2 finally simply avoided holding page refcount in TDX and
   generated KVM_BUG_ON() on S-EPT zapping failure without propagating the
   failure back to guest_memfd or notifying guest_memfd through out-of-band
   methods, considering the complexity and that S-EPT zapping failure is
   currently only possible due to KVM or TDX module bugs.

   This approach was acked by Sean and the patch to drop holding TDX page
   refcount was pulled separately.


Base
----
This is based on the latest WIP hugetlb-based guest_memfd code [1], with
"Rework preparation/population" series v2 [2] and DPAMT v4 [3] rebased on
it. For the full stack see [4].

Four issues are identified in the WIP hugetlb-based guest_memfd [1]:

(1) Compilation error due to missing symbol export of
    hugetlb_restructuring_free_folio().

(2) guest_memfd splits backend folios when the folio is still mapped as
    huge in KVM (which breaks KVM's basic assumption that EPT mapping size
    should not exceed the backend folio size).

(3) guest_memfd is incapable of merging folios to huge for
    shared-to-private conversions.

(4) Unnecessary disabling huge private mappings when HVA is not 2M-aligned,
    given that shared pages can only be mapped at 4KB.

So, this series also depends on the four fixup patches included in [4]:

[FIXUP] KVM: guest_memfd: Allow gmem slot lpage even with non-aligned uaddr
[FIXUP] KVM: guest_memfd: Allow merging folios after to-private conversion
[FIXUP] KVM: guest_memfd: Zap mappings before splitting backend folios
[FIXUP] mm: hugetlb_restructuring: Export hugetlb_restructuring_free_folio()

(lkp sent me some more gmem compilation errors. I ignored them as I didn't
 encounter them with my config and env).


Testing
-------
We currently don't have a QEMU that works with the latest in-place
conversion uABI. We plan to increase testing before asking for merging.
This revision is tested via TDX selftests (enhanced to support the in-place
conversion uABI).

Please check the TDX selftests included in tree [4] (under section "Not for
upstream"[9]) that work with the in-place conversion uABI, specifically the
selftest tdx_vm_huge_page for huge page testing.

These selftests require running under "modprobe kvm vm_memory_attributes=N".

The 2MB mapping count can be checked via "/sys/kernel/debug/kvm/pages_2m".

Note #1: Since TDX does not yet enable in-place copy of init memory
         regions, userspace needs to follow the sequence below for init
	 memory regions to work (also shown in commit "KVM: selftests: TDX:
	 Switch init mem to gmem with MMAP flag" [10] in tree [4]):

         1) Create guest_memfd with MMAP flag for the init memory region.
         2) Set the GFNs for the init memory region to shared and copy
	    initial data to the mmap'ed HVA of the guest_memfd.
         3) Create a separate temporary shared memory backend for the init
	    memory region, and copy initial data from the guest_memfd HVA
	    to the temporary shared memory backend.
         4) Convert the GFNs for the init memory region to private.
         5) Invoke ioctl KVM_TDX_INIT_MEM_REGION, passing the HVA of the
	    temporary shared memory backend as source addr and GPAs of the
	    init memory region.
         6) Free the temporary shared memory backend.

Note #2: This series disables TDX huge pages on TDX modules without feature
         TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY. For testing TDX
         huge pages on those unsupported TDX modules (i.e., before
         TDX_1.5.28.00.972), please cherry-pick the workaround patch
         "x86/virt/tdx: Loop for TDX_INTERRUPTED_RESTARTABLE in
         tdh_mem_page_demote()" [11] contained in [4].

Thanks
Yan

[0] RFC v2: https://lore.kernel.org/all/20250807093950.4395-1-yan.y.zhao@intel.com
[1] hugetlb-based gmem: https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25
[2] gmem-population rework v2: https://lore.kernel.org/all/20251215153411.3613928-1-michael.roth@amd.com
[3] DPAMT v4: https://lore.kernel.org/kvm/20251121005125.417831-1-rick.p.edgecombe@intel.com
[4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3
[5] https://lore.kernel.org/all/aF0Kg8FcHVMvsqSo@yzhao56-desk.sh.intel.com
[6] https://lore.kernel.org/all/aGSoDnODoG2%2FpbYn@yzhao56-desk.sh.intel.com
[7] https://lore.kernel.org/all/CAGtprH9vdpAGDNtzje=7faHBQc9qTSF2fUEGcbCkfJehFuP-rw@mail.gmail.com
[8] https://github.com/intel-staging/tdx/commit/a8aedac2df44e29247773db3444bc65f7100daa1
[9] https://github.com/intel-staging/tdx/commit/8747667feb0b37daabcaee7132c398f9e62a6edd
[10] https://github.com/intel-staging/tdx/commit/ab29a85ec2072393ab268e231c97f07833853d0d
[11] https://github.com/intel-staging/tdx/commit/4feb6bf371f3a747b71fc9f4ded25261e66b8895

Edgecombe, Rick P (1):
  KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror
    root

Isaku Yamahata (1):
  KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table
    splitting

Kiryl Shutsemau (4):
  KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB
  KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting
  x86/tdx: Add/Remove DPAMT pages for guest private memory to demote
  x86/tdx: Pass guest memory's PFN info to demote for updating
    pamt_refcount

Xiaoyao Li (1):
  x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()

Yan Zhao (17):
  x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge
    pages
  x86/tdx: Introduce tdx_quirk_reset_folio() to reset private huge pages
  x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support huge pages
  KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock
  KVM: TDX: Enable huge page splitting under write mmu_lock
  KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX
  KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  KVM: x86: Introduce hugepage_set_guest_inhibit()
  KVM: TDX: Honor the guest's accept level contained in an EPT violation
  KVM: Change the return type of gfn_handler_t() from bool to int
  KVM: x86: Split cross-boundary mirror leafs for
    KVM_SET_MEMORY_ATTRIBUTES
  KVM: guest_memfd: Split for punch hole and private-to-shared
    conversion
  x86/virt/tdx: Add loud warning when tdx_pamt_put() fails.
  KVM: x86: Introduce per-VM external cache for splitting
  KVM: TDX: Implement per-VM external cache for splitting in TDX
  KVM: TDX: Turn on PG_LEVEL_2M

 arch/arm64/kvm/mmu.c               |   8 +-
 arch/loongarch/kvm/mmu.c           |   8 +-
 arch/mips/kvm/mmu.c                |   6 +-
 arch/powerpc/kvm/book3s.c          |   4 +-
 arch/powerpc/kvm/e500_mmu_host.c   |   8 +-
 arch/riscv/kvm/mmu.c               |  12 +-
 arch/x86/include/asm/kvm-x86-ops.h |   4 +
 arch/x86/include/asm/kvm_host.h    |  22 +++
 arch/x86/include/asm/tdx.h         |  21 +-
 arch/x86/kvm/mmu.h                 |   3 +
 arch/x86/kvm/mmu/mmu.c             |  90 +++++++--
 arch/x86/kvm/mmu/tdp_mmu.c         | 204 +++++++++++++++++---
 arch/x86/kvm/mmu/tdp_mmu.h         |   3 +
 arch/x86/kvm/vmx/tdx.c             | 299 ++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/tdx.h             |   5 +
 arch/x86/kvm/vmx/tdx_arch.h        |   3 +
 arch/x86/virt/vmx/tdx/tdx.c        | 164 +++++++++++++---
 arch/x86/virt/vmx/tdx/tdx.h        |   1 +
 include/linux/kvm_host.h           |  14 +-
 virt/kvm/guest_memfd.c             |  67 +++++++
 virt/kvm/kvm_main.c                |  44 +++--
 21 files changed, 851 insertions(+), 139 deletions(-)

-- 
2.43.2

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Vishal Annapurve 1 month ago

On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> - EPT mapping size and folio size
>
>   This series is built upon the rule in KVM that the mapping size in the
>   KVM-managed secondary MMU is no larger than the backend folio size.
>
>   Therefore, there are sanity checks in the SEAMCALL wrappers in patches
>   1-5 to follow this rule, either in tdh_mem_page_aug() for mapping
>   (patch 1) or in tdh_phymem_page_wbinvd_hkid() (patch 3),
>   tdx_quirk_reset_folio() (patch 4), tdh_phymem_page_reclaim() (patch 5)
>   for unmapping.
>
>   However, as Vishal pointed out in [7], the new hugetlb-based guest_memfd
>   [1] splits backend folios ahead of notifying KVM for unmapping. So, this
>   series also relies on the fixup patch [8] to notify KVM of unmapping
>   before splitting the backend folio during the memory conversion ioctl.

I think the major issue here is that if splitting fails there is no
way to undo the unmapping [1]. How should KVM/VMM/guest handle the
case where a guest requested conversion to shared, the conversion
failed and the memory is no longer mapped as private?

[1] https://lore.kernel.org/kvm/aN8P87AXlxlEDdpP@google.com/

> Four issues are identified in the WIP hugetlb-based guest_memfd [1]:
>
> (1) Compilation error due to missing symbol export of
>     hugetlb_restructuring_free_folio().
>
> (2) guest_memfd splits backend folios when the folio is still mapped as
>     huge in KVM (which breaks KVM's basic assumption that EPT mapping size
>     should not exceed the backend folio size).
>
> (3) guest_memfd is incapable of merging folios to huge for
>     shared-to-private conversions.
>
> (4) Unnecessary disabling huge private mappings when HVA is not 2M-aligned,
>     given that shared pages can only be mapped at 4KB.
>
> So, this series also depends on the four fixup patches included in [4]:
>
> [FIXUP] KVM: guest_memfd: Allow gmem slot lpage even with non-aligned uaddr
> [FIXUP] KVM: guest_memfd: Allow merging folios after to-private conversion
> [FIXUP] KVM: guest_memfd: Zap mappings before splitting backend folios
> [FIXUP] mm: hugetlb_restructuring: Export hugetlb_restructuring_free_folio()
>
> (lkp sent me some more gmem compilation errors. I ignored them as I didn't
>  encounter them with my config and env).
>
> ...
>
> [0] RFC v2: https://lore.kernel.org/all/20250807093950.4395-1-yan.y.zhao@intel.com
> [1] hugetlb-based gmem: https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25
> [2] gmem-population rework v2: https://lore.kernel.org/all/20251215153411.3613928-1-michael.roth@amd.com
> [3] DPAMT v4: https://lore.kernel.org/kvm/20251121005125.417831-1-rick.p.edgecombe@intel.com
> [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3
> [5] https://lore.kernel.org/all/aF0Kg8FcHVMvsqSo@yzhao56-desk.sh.intel.com
> [6] https://lore.kernel.org/all/aGSoDnODoG2%2FpbYn@yzhao56-desk.sh.intel.com
> [7] https://lore.kernel.org/all/CAGtprH9vdpAGDNtzje=7faHBQc9qTSF2fUEGcbCkfJehFuP-rw@mail.gmail.com
> [8] https://github.com/intel-staging/tdx/commit/a8aedac2df44e29247773db3444bc65f7100daa1
> [9] https://github.com/intel-staging/tdx/commit/8747667feb0b37daabcaee7132c398f9e62a6edd
> [10] https://github.com/intel-staging/tdx/commit/ab29a85ec2072393ab268e231c97f07833853d0d
> [11] https://github.com/intel-staging/tdx/commit/4feb6bf371f3a747b71fc9f4ded25261e66b8895
>

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Ackerley Tng 1 month ago

Vishal Annapurve <vannapurve@google.com> writes:

> On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>>
>> - EPT mapping size and folio size
>>
>>   This series is built upon the rule in KVM that the mapping size in the
>>   KVM-managed secondary MMU is no larger than the backend folio size.
>>

I'm not familiar with this rule and would like to find out more. Why is
this rule imposed? Is this rule there just because traditionally folio
sizes also define the limit of contiguity, and so the mapping size must
not be greater than folio size in case the block of memory represented
by the folio is not contiguous?

In guest_memfd's case, even if the folio is split (just for refcount
tracking purposese on private to shared conversion), the memory is still
contiguous up to the original folio's size. Will the contiguity address
the concerns?

Specifically for TDX, does the folio size actually matter relative to
the mapping, or is it more about contiguity than the folio size?

>>   Therefore, there are sanity checks in the SEAMCALL wrappers in patches
>>   1-5 to follow this rule, either in tdh_mem_page_aug() for mapping
>>   (patch 1) or in tdh_phymem_page_wbinvd_hkid() (patch 3),
>>   tdx_quirk_reset_folio() (patch 4), tdh_phymem_page_reclaim() (patch 5)
>>   for unmapping.
>>
>>   However, as Vishal pointed out in [7], the new hugetlb-based guest_memfd
>>   [1] splits backend folios ahead of notifying KVM for unmapping. So, this
>>   series also relies on the fixup patch [8] to notify KVM of unmapping
>>   before splitting the backend folio during the memory conversion ioctl.
>
> I think the major issue here is that if splitting fails there is no
> way to undo the unmapping [1]. How should KVM/VMM/guest handle the
> case where a guest requested conversion to shared, the conversion
> failed and the memory is no longer mapped as private?
>
> [1] https://lore.kernel.org/kvm/aN8P87AXlxlEDdpP@google.com/
>

Unmapping was supposed to be the point of no return in the conversion
process. (This might have changed since we last discussed this. The link
[1] from Vishal is where it was discussed.)

The new/current plan is that in the conversion process we'll do anything
that might fail first, and then commit the conversion, beginning with
zapping, and so zapping is the point of no return.

(I think you also suggested this before, but back then I couldn't see a
way to separate out the steps cleanly)

Here's the conversion steps in what we're trying now (leaving out the
TDX EPT splitting first)

1. Allocate enough memory for updating attributes maple tree
2a. Only for shared->private conversions: unmap from host page table,
check for safe refcounts
2b. Only for private->shared conversions: split folios (note: split
only, no merges) split can fail since HVO needs to be undone, and that
requires allocations.
3. Invalidate begin
4. Zap from stage 2 page tables: this is the point of no return, before
this, we must be sure nothing after will fail.
5. Update attributes maple tree using allocated memory from step 1.
6. Invalidate end
7. Only for shared->private conversions: merge folios, making sure that
merging does not fail (should not, since there are no allocations, only
folio aka metadata updates)

Updating the maple tree before calling the folio merge function allows
the merge function to look up the *updated* maple tree.

I'm thinking to insert the call to EPT splitting after invalidate begin
(3) since EPT splitting is not undoable. However, that will be after
folio splitting, hence my earlier question on whether it's a hard rule
based on folio size, or based on memory contiguity. Would that work?

>> Four issues are identified in the WIP hugetlb-based guest_memfd [1]:
>>
>> (1) Compilation error due to missing symbol export of
>>     hugetlb_restructuring_free_folio().
>>
>> (2) guest_memfd splits backend folios when the folio is still mapped as
>>     huge in KVM (which breaks KVM's basic assumption that EPT mapping size
>>     should not exceed the backend folio size).
>>
>> (3) guest_memfd is incapable of merging folios to huge for
>>     shared-to-private conversions.
>>
>> (4) Unnecessary disabling huge private mappings when HVA is not 2M-aligned,
>>     given that shared pages can only be mapped at 4KB.
>>
>> So, this series also depends on the four fixup patches included in [4]:
>>

Thank you for these fixes!

>> [FIXUP] KVM: guest_memfd: Allow gmem slot lpage even with non-aligned uaddr
>> [FIXUP] KVM: guest_memfd: Allow merging folios after to-private conversion

Thanks for catching this, Vishal also found this in a very recent
internal review. Our fix for this is to first apply the new state before
doing the folio merge. See the flow described above.

>> [FIXUP] KVM: guest_memfd: Zap mappings before splitting backend folios
>> [FIXUP] mm: hugetlb_restructuring: Export hugetlb_restructuring_free_folio()
>>
>> (lkp sent me some more gmem compilation errors. I ignored them as I didn't
>>  encounter them with my config and env).
>>
>> ...
>>
>> [0] RFC v2: https://lore.kernel.org/all/20250807093950.4395-1-yan.y.zhao@intel.com
>> [1] hugetlb-based gmem: https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25
>> [2] gmem-population rework v2: https://lore.kernel.org/all/20251215153411.3613928-1-michael.roth@amd.com
>> [3] DPAMT v4: https://lore.kernel.org/kvm/20251121005125.417831-1-rick.p.edgecombe@intel.com
>> [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3
>> [5] https://lore.kernel.org/all/aF0Kg8FcHVMvsqSo@yzhao56-desk.sh.intel.com
>> [6] https://lore.kernel.org/all/aGSoDnODoG2%2FpbYn@yzhao56-desk.sh.intel.com
>> [7] https://lore.kernel.org/all/CAGtprH9vdpAGDNtzje=7faHBQc9qTSF2fUEGcbCkfJehFuP-rw@mail.gmail.com
>> [8] https://github.com/intel-staging/tdx/commit/a8aedac2df44e29247773db3444bc65f7100daa1
>> [9] https://github.com/intel-staging/tdx/commit/8747667feb0b37daabcaee7132c398f9e62a6edd
>> [10] https://github.com/intel-staging/tdx/commit/ab29a85ec2072393ab268e231c97f07833853d0d
>> [11] https://github.com/intel-staging/tdx/commit/4feb6bf371f3a747b71fc9f4ded25261e66b8895
>>

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 1 month ago

On Tue, Jan 06, 2026, Ackerley Tng wrote:
> Vishal Annapurve <vannapurve@google.com> writes:
> 
> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>
> >> - EPT mapping size and folio size
> >>
> >>   This series is built upon the rule in KVM that the mapping size in the
> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> >>
> 
> I'm not familiar with this rule and would like to find out more. Why is
> this rule imposed? 

Because it's the only sane way to safely map memory into the guest? :-D

> Is this rule there just because traditionally folio sizes also define the
> limit of contiguity, and so the mapping size must not be greater than folio
> size in case the block of memory represented by the folio is not contiguous?

Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
is) strictly bound by the host mapping size.  That's handles contiguous addresses,
but it _also_ handles contiguous protections (e.g. RWX) and other attributes.

> In guest_memfd's case, even if the folio is split (just for refcount
> tracking purposese on private to shared conversion), the memory is still
> contiguous up to the original folio's size. Will the contiguity address
> the concerns?

Not really?  Why would the folio be split if the memory _and its attributes_ are
fully contiguous?  If the attributes are mixed, KVM must not create a mapping
spanning mixed ranges, i.e. with multiple folios.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Ackerley Tng 1 month ago

Sean Christopherson <seanjc@google.com> writes:

> On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> Vishal Annapurve <vannapurve@google.com> writes:
>>
>> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> >>
>> >> - EPT mapping size and folio size
>> >>
>> >>   This series is built upon the rule in KVM that the mapping size in the
>> >>   KVM-managed secondary MMU is no larger than the backend folio size.
>> >>
>>
>> I'm not familiar with this rule and would like to find out more. Why is
>> this rule imposed?
>
> Because it's the only sane way to safely map memory into the guest? :-D
>
>> Is this rule there just because traditionally folio sizes also define the
>> limit of contiguity, and so the mapping size must not be greater than folio
>> size in case the block of memory represented by the folio is not contiguous?
>
> Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
>
>> In guest_memfd's case, even if the folio is split (just for refcount
>> tracking purposese on private to shared conversion), the memory is still
>> contiguous up to the original folio's size. Will the contiguity address
>> the concerns?
>
> Not really?  Why would the folio be split if the memory _and its attributes_ are
> fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> spanning mixed ranges, i.e. with multiple folios.

The folio can be split if any (or all) of the pages in a huge page range
are shared (in the CoCo sense). So in a 1G block of memory, even if the
attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
would be split, and the split folios are necessary for tracking users of
shared pages using struct page refcounts.

However the split folios in that 1G range are still fully contiguous.

The process of conversion will split the EPT entries soon after the
folios are split so the rule remains upheld.

I guess perhaps the question is, is it okay if the folios are smaller
than the mapping while conversion is in progress? Does the order matter
(split page table entries first vs split folios first)?

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 1 month ago

On Tue, Jan 06, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> Vishal Annapurve <vannapurve@google.com> writes:
> >>
> >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> >>
> >> >> - EPT mapping size and folio size
> >> >>
> >> >>   This series is built upon the rule in KVM that the mapping size in the
> >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> >> >>
> >>
> >> I'm not familiar with this rule and would like to find out more. Why is
> >> this rule imposed?
> >
> > Because it's the only sane way to safely map memory into the guest? :-D
> >
> >> Is this rule there just because traditionally folio sizes also define the
> >> limit of contiguity, and so the mapping size must not be greater than folio
> >> size in case the block of memory represented by the folio is not contiguous?
> >
> > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> >
> >> In guest_memfd's case, even if the folio is split (just for refcount
> >> tracking purposese on private to shared conversion), the memory is still
> >> contiguous up to the original folio's size. Will the contiguity address
> >> the concerns?
> >
> > Not really?  Why would the folio be split if the memory _and its attributes_ are
> > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> > spanning mixed ranges, i.e. with multiple folios.
> 
> The folio can be split if any (or all) of the pages in a huge page range
> are shared (in the CoCo sense). So in a 1G block of memory, even if the
> attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> would be split, and the split folios are necessary for tracking users of
> shared pages using struct page refcounts.

Ahh, that's what the refcounting was referring to.  Gotcha.

> However the split folios in that 1G range are still fully contiguous.
> 
> The process of conversion will split the EPT entries soon after the
> folios are split so the rule remains upheld.
> 
> I guess perhaps the question is, is it okay if the folios are smaller
> than the mapping while conversion is in progress? Does the order matter
> (split page table entries first vs split folios first)?

Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
conceptually totally fine, i.e. I'm not totally opposed to adding support for
mapping multiple guest_memfd folios with a single hugepage.   As to whether we
do (a) nothing, (b) change the refcounting, or (c) add support for mapping
multiple folios in one page, probably comes down to which option provides "good
enough" performance without incurring too much complexity.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Ackerley Tng 3 weeks, 4 days ago

Sean Christopherson <seanjc@google.com> writes:

> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> mapping multiple guest_memfd folios with a single hugepage. As to whether we

Sean, I'd like to clarify this.

> do (a) nothing,

What does do nothing mean here?

In this patch series the TDX functions do sanity checks ensuring that
mapping size <= folio size. IIUC the checks at mapping time, like in
tdh_mem_page_aug() would be fine since at the time of mapping, the
mapping size <= folio size, but we'd be in trouble at the time of
zapping, since that's when mapping sizes > folio sizes get discovered.

The sanity checks are in principle in direct conflict with allowing
mapping of multiple guest_memfd folios at hugepage level.

> (b) change the refcounting, or

I think this is pretty hard unless something changes in core MM that
allows refcounting to be customizable by the FS. guest_memfd would love
to have that, but customizable refcounting is going to hurt refcounting
performance throughout the kernel.

> (c) add support for mapping multiple folios in one page,

Where would the changes need to be made, IIUC there aren't any checks
currently elsewhere in KVM to ensure that mapping size <= folio size,
other than the sanity checks in the TDX code proposed in this series.

Does any support need to be added, or is it about amending the
unenforced/unwritten rule from "mapping size <= folio size" to "mapping
size <= contiguous memory size"?

> probably comes down to which option provides "good
> enough" performance without incurring too much complexity.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 3 weeks, 3 days ago

On Mon, Jan 12, 2026 at 12:15:17PM -0800, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > mapping multiple guest_memfd folios with a single hugepage. As to whether we
> 
> Sean, I'd like to clarify this.
> 
> > do (a) nothing,
> 
> What does do nothing mean here?
> 
> In this patch series the TDX functions do sanity checks ensuring that
> mapping size <= folio size. IIUC the checks at mapping time, like in
> tdh_mem_page_aug() would be fine since at the time of mapping, the
> mapping size <= folio size, but we'd be in trouble at the time of
> zapping, since that's when mapping sizes > folio sizes get discovered.
> 
> The sanity checks are in principle in direct conflict with allowing
> mapping of multiple guest_memfd folios at hugepage level.
> 
> > (b) change the refcounting, or
> 
> I think this is pretty hard unless something changes in core MM that
> allows refcounting to be customizable by the FS. guest_memfd would love
> to have that, but customizable refcounting is going to hurt refcounting
> performance throughout the kernel.
> 
> > (c) add support for mapping multiple folios in one page,
> 
> Where would the changes need to be made, IIUC there aren't any checks
> currently elsewhere in KVM to ensure that mapping size <= folio size,
> other than the sanity checks in the TDX code proposed in this series.
> 
> Does any support need to be added, or is it about amending the
> unenforced/unwritten rule from "mapping size <= folio size" to "mapping
> size <= contiguous memory size"?
The rule is not "unenforced/unwritten". In fact, it's the de facto standard in
KVM.

For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
mapping size in the secondary MMU, while the primary MMU does not create a
mapping larger than the backend folio size. When splitting the backend folio,
the Linux kernel unmaps the folio from both the primary MMU and the KVM-managed
secondary MMU (through the MMU notifier).

On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
than folio sizes, splitting folios while they are still mapped in the IOMMU
stage-2 page table is not permitted due to the extra folio refcount held by the
IOMMU.

For gmem cases, KVM also does not create mappings larger than the folio size
allocated from gmem. This is why the TDX huge page series relies on gmem's
ability to allocate huge folios.

We really need to be careful if we hope to break this long-established rule.

> > probably comes down to which option provides "good
> > enough" performance without incurring too much complexity.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 3 weeks, 3 days ago

On Wed, Jan 14, 2026, Yan Zhao wrote:
> On Mon, Jan 12, 2026 at 12:15:17PM -0800, Ackerley Tng wrote:
> > Sean Christopherson <seanjc@google.com> writes:
> > 
> > > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > > conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > > mapping multiple guest_memfd folios with a single hugepage. As to whether we
> > 
> > Sean, I'd like to clarify this.
> > 
> > > do (a) nothing,
> > 
> > What does do nothing mean here?

Don't support hugepage's for shared mappings, at least for now (as Rick pointed
out, doing nothing now doesn't mean we can't do something in the future).

> > In this patch series the TDX functions do sanity checks ensuring that
> > mapping size <= folio size. IIUC the checks at mapping time, like in
> > tdh_mem_page_aug() would be fine since at the time of mapping, the
> > mapping size <= folio size, but we'd be in trouble at the time of
> > zapping, since that's when mapping sizes > folio sizes get discovered.
> > 
> > The sanity checks are in principle in direct conflict with allowing
> > mapping of multiple guest_memfd folios at hugepage level.
> > 
> > > (b) change the refcounting, or
> > 
> > I think this is pretty hard unless something changes in core MM that
> > allows refcounting to be customizable by the FS. guest_memfd would love
> > to have that, but customizable refcounting is going to hurt refcounting
> > performance throughout the kernel.
> > 
> > > (c) add support for mapping multiple folios in one page,
> > 
> > Where would the changes need to be made, IIUC there aren't any checks
> > currently elsewhere in KVM to ensure that mapping size <= folio size,
> > other than the sanity checks in the TDX code proposed in this series.
> > 
> > Does any support need to be added, or is it about amending the
> > unenforced/unwritten rule from "mapping size <= folio size" to "mapping
> > size <= contiguous memory size"?
>
> The rule is not "unenforced/unwritten". In fact, it's the de facto standard in
> KVM.

Ya, more or less.

The rules aren't formally documented because the overarching rule is very
simple: KVM must not map memory into the guest that the guest shouldn't have
access to.  That falls firmly into the "well, duh" category, and so it's not
written down anywhere :-)

How exactly KVM has honored that rule has varied over the years, and still varies
between architectures.  In the past KVM x86 special cased HugeTLB and THP, but
that proved to be a pain to maintain and wasn't extensible, e.g. didn't play nice
with DAX, and so KVM x86 pivoted to pulling the mapping size from the primary MMU
page tables.

But arm64 still special cases THP and HugeTLB, *and* VM_PFNMAP memory (eww).

> For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> mapping size in the secondary MMU, while the primary MMU does not create a
> mapping larger than the backend folio size.

Super strictly speaking, this might not hold true for VM_PFNMAP memory.  E.g. a
driver _could_ split a folio (no idea why it would) but map the entire thing into
userspace, and then userspace could have off that memory to KVM.

So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
that "KVM mapping size <= primary MMU mapping size", at least for x86.  Arm's
VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.

But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
than the folio size.

> When splitting the backend folio, the Linux kernel unmaps the folio from both
> the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
> 
> On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> than folio sizes, splitting folios while they are still mapped in the IOMMU
> stage-2 page table is not permitted due to the extra folio refcount held by the
> IOMMU.
> 
> For gmem cases, KVM also does not create mappings larger than the folio size
> allocated from gmem. This is why the TDX huge page series relies on gmem's
> ability to allocate huge folios.
> 
> We really need to be careful if we hope to break this long-established rule.

+100 to being careful, but at the same time I don't think we should get _too_
fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
might not be a folio, if guest_memfd stopped using folios, then the entire
discussion becomes moot.

And as above, the long-standing rule isn't about the implementation details so
much as it is about KVM's behavior.  If the simplest solution to support huge
guest_memfd pages is to decouple the max order from the folio, then so be it.

That said, I'd very much like to get a sense of the alternatives, because at the
end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
and naively, tying that to the folio seems like an easy solution.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 3 weeks, 2 days ago

On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote:
> On Wed, Jan 14, 2026, Yan Zhao wrote:
> > On Mon, Jan 12, 2026 at 12:15:17PM -0800, Ackerley Tng wrote:
> > > Sean Christopherson <seanjc@google.com> writes:
> > > 
> > > > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > > > conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > > > mapping multiple guest_memfd folios with a single hugepage. As to whether we
> > > 
> > > Sean, I'd like to clarify this.
> > > 
> > > > do (a) nothing,
> > > 
> > > What does do nothing mean here?
> 
> Don't support hugepage's for shared mappings, at least for now (as Rick pointed
> out, doing nothing now doesn't mean we can't do something in the future).
> 
> > > In this patch series the TDX functions do sanity checks ensuring that
> > > mapping size <= folio size. IIUC the checks at mapping time, like in
> > > tdh_mem_page_aug() would be fine since at the time of mapping, the
> > > mapping size <= folio size, but we'd be in trouble at the time of
> > > zapping, since that's when mapping sizes > folio sizes get discovered.
> > > 
> > > The sanity checks are in principle in direct conflict with allowing
> > > mapping of multiple guest_memfd folios at hugepage level.
> > > 
> > > > (b) change the refcounting, or
> > > 
> > > I think this is pretty hard unless something changes in core MM that
> > > allows refcounting to be customizable by the FS. guest_memfd would love
> > > to have that, but customizable refcounting is going to hurt refcounting
> > > performance throughout the kernel.
> > > 
> > > > (c) add support for mapping multiple folios in one page,
> > > 
> > > Where would the changes need to be made, IIUC there aren't any checks
> > > currently elsewhere in KVM to ensure that mapping size <= folio size,
> > > other than the sanity checks in the TDX code proposed in this series.
> > > 
> > > Does any support need to be added, or is it about amending the
> > > unenforced/unwritten rule from "mapping size <= folio size" to "mapping
> > > size <= contiguous memory size"?
> >
> > The rule is not "unenforced/unwritten". In fact, it's the de facto standard in
> > KVM.
> 
> Ya, more or less.
> 
> The rules aren't formally documented because the overarching rule is very
> simple: KVM must not map memory into the guest that the guest shouldn't have
> access to.  That falls firmly into the "well, duh" category, and so it's not
> written down anywhere :-)
> 
> How exactly KVM has honored that rule has varied over the years, and still varies
> between architectures.  In the past KVM x86 special cased HugeTLB and THP, but
> that proved to be a pain to maintain and wasn't extensible, e.g. didn't play nice
> with DAX, and so KVM x86 pivoted to pulling the mapping size from the primary MMU
> page tables.
> 
> But arm64 still special cases THP and HugeTLB, *and* VM_PFNMAP memory (eww).
> 
> > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> > mapping size in the secondary MMU, while the primary MMU does not create a
> > mapping larger than the backend folio size.
> 
> Super strictly speaking, this might not hold true for VM_PFNMAP memory.  E.g. a
> driver _could_ split a folio (no idea why it would) but map the entire thing into
> userspace, and then userspace could have off that memory to KVM.
> 
> So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
> that "KVM mapping size <= primary MMU mapping size", at least for x86.  Arm's
> VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
> discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.
> 
> But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
> than the folio size.

Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the
backend. Indeed, a driver can create a huge mapping in primary MMU for the
VM_PFNMAP range with multiple discontiguous pages, if it wants.

But this occurs before KVM creates the mapping. Per my understanding, pages
under VM_PFNMAP are pinned, so it looks like there're no splits after they are
mapped into the primary MMU.

So, out of curiosity, do you know why linux kernel needs to unmap mappings from
both primary and secondary MMUs, and check folio refcount before performing
folio splitting?

> > When splitting the backend folio, the Linux kernel unmaps the folio from both
> > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
> > 
> > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> > than folio sizes, splitting folios while they are still mapped in the IOMMU
> > stage-2 page table is not permitted due to the extra folio refcount held by the
> > IOMMU.
> > 
> > For gmem cases, KVM also does not create mappings larger than the folio size
> > allocated from gmem. This is why the TDX huge page series relies on gmem's
> > ability to allocate huge folios.
> > 
> > We really need to be careful if we hope to break this long-established rule.
> 
> +100 to being careful, but at the same time I don't think we should get _too_
> fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
> might not be a folio, if guest_memfd stopped using folios, then the entire
> discussion becomes moot.
> 
> And as above, the long-standing rule isn't about the implementation details so
> much as it is about KVM's behavior.  If the simplest solution to support huge
> guest_memfd pages is to decouple the max order from the folio, then so be it.
> 
> That said, I'd very much like to get a sense of the alternatives, because at the
> end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
> and naively, tying that to the folio seems like an easy solution.
Thanks for the explanation.

Alternatively, how do you feel about the approach of splitting S-EPT first
before splitting folios?
If guest_memfd always splits 1GB folios to 2MB first and only splits the
converted range to 4KB, splitting S-EPT before splitting folios should not
introduce too much overhead. Then, we can defer the folio size problem until
guest_memfd stops using folios.

If the decision is to stop relying on folios for unmapping now, do you think
the following changes are reasonable for the TDX huge page series?

- Add WARN_ON_ONCE() to assert that pages are in a single folio in
  tdh_mem_page_aug().
- Do not assert that pages are in a single folio in
  tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?)
  Could you please give me guidance on
  https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com.
- Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 3 weeks, 2 days ago

On Wed, Jan 14, 2026, Yan Zhao wrote:
> On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote:
> > On Wed, Jan 14, 2026, Yan Zhao wrote:
> > > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> > > mapping size in the secondary MMU, while the primary MMU does not create a
> > > mapping larger than the backend folio size.
> > 
> > Super strictly speaking, this might not hold true for VM_PFNMAP memory.  E.g. a
> > driver _could_ split a folio (no idea why it would) but map the entire thing into
> > userspace, and then userspace could have off that memory to KVM.
> > 
> > So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
> > that "KVM mapping size <= primary MMU mapping size", at least for x86.  Arm's
> > VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
> > discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.
> > 
> > But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
> > than the folio size.
> 
> Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the
> backend. Indeed, a driver can create a huge mapping in primary MMU for the
> VM_PFNMAP range with multiple discontiguous pages, if it wants.
> 
> But this occurs before KVM creates the mapping. Per my understanding, pages
> under VM_PFNMAP are pinned,

Nope.  Only the driver that owns the VMAs knows what sits behind the PFN and the
lifecycle rules for that memory.

That last point is *very* important.  Even if the PFNs shoved into VM_PFNMAP VMAs
have an associated "struct page", that doesn't mean the "struct page" is refcounted,
i.e. can be pinned.  That detail was the heart of "KVM: Stop grabbing references to
PFNMAP'd pages" overhaul[*].

To _safely_ map VM_PFNMAP into a secondary MMU, i.e. without relying on (priveleged)
userspace to "do the right thing", the secondary MMU needs to be tied into
mmu_notifiers, so that modifications to the mappings in the primary MMU are
reflected into the secondary MMU.

[*] https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com

> so it looks like there're no splits after they are mapped into the primary MMU.
> 
> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
> both primary and secondary MMUs, and check folio refcount before performing
> folio splitting?

Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
if something is going through the effort of splitting a folio, then odds are very,
very good that the new folios can't be safely mapped as a contiguous hugepage.
Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
MM to implement, and for drivers/users to understand.

Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
communicate the maximum mapping size; folios are the "currency" for doing so.

And then for edge cases that want to map a split folio as a hugepage (if any such
edge cases exist), thus take on the responsibility of managing the lifecycle of
the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.

> > > When splitting the backend folio, the Linux kernel unmaps the folio from both
> > > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
> > > 
> > > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> > > than folio sizes, splitting folios while they are still mapped in the IOMMU
> > > stage-2 page table is not permitted due to the extra folio refcount held by the
> > > IOMMU.
> > > 
> > > For gmem cases, KVM also does not create mappings larger than the folio size
> > > allocated from gmem. This is why the TDX huge page series relies on gmem's
> > > ability to allocate huge folios.
> > > 
> > > We really need to be careful if we hope to break this long-established rule.
> > 
> > +100 to being careful, but at the same time I don't think we should get _too_
> > fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
> > might not be a folio, if guest_memfd stopped using folios, then the entire
> > discussion becomes moot.
> > 
> > And as above, the long-standing rule isn't about the implementation details so
> > much as it is about KVM's behavior.  If the simplest solution to support huge
> > guest_memfd pages is to decouple the max order from the folio, then so be it.
> > 
> > That said, I'd very much like to get a sense of the alternatives, because at the
> > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
> > and naively, tying that to the folio seems like an easy solution.
> Thanks for the explanation.
> 
> Alternatively, how do you feel about the approach of splitting S-EPT first
> before splitting folios?
> If guest_memfd always splits 1GB folios to 2MB first and only splits the
> converted range to 4KB, splitting S-EPT before splitting folios should not
> introduce too much overhead. Then, we can defer the folio size problem until
> guest_memfd stops using folios.
> 
> If the decision is to stop relying on folios for unmapping now, do you think
> the following changes are reasonable for the TDX huge page series?
> 
> - Add WARN_ON_ONCE() to assert that pages are in a single folio in
>   tdh_mem_page_aug().
> - Do not assert that pages are in a single folio in
>   tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?)
>   Could you please give me guidance on
>   https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com.
> - Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error.

Ok, with the disclaimer that I hadn't actually looked at the patches in this
series before now...

TDX absolutely should not be doing _anything_ with folios.  I am *very* strongly
opposed to TDX assuming that memory is backed by refcounted "struct page", and
thus can use folios to glean the maximum mapping size.

guest_memfd is _the_ owner of that information.  guest_memfd needs to explicitly
_tell_ the rest of KVM what the maximum mapping size is; arch code should not
infer that size from a folio.

And that code+behavior already exists in the form of kvm_gmem_mapping_order() and
its users, _and_ is plumbed all the way into tdx_mem_page_aug() as @level.  IIUC,
the _only_ reason tdx_mem_page_aug() retrieves the page+folio is because
tdx_clflush_page() ultimately requires a "struct page".  That is absolutely
ridiculous and not acceptable.  CLFLUSH takes a virtual address, there is *zero*
reason tdh_mem_page_aug() needs to require/assume a struct page.

Dave may feel differently, but I am not going to budge on this.  I am not going
to bake in assumptions throughout KVM about memory being backed by page+folio.
We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to
PFNMAP'd pages" series, I am NOT reintroducing such assumptions.

NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 3 weeks, 2 days ago

On Wed, Jan 14, 2026 at 07:26:44AM -0800, Sean Christopherson wrote:
> On Wed, Jan 14, 2026, Yan Zhao wrote:
> > On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote:
> > > On Wed, Jan 14, 2026, Yan Zhao wrote:
> > > > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> > > > mapping size in the secondary MMU, while the primary MMU does not create a
> > > > mapping larger than the backend folio size.
> > > 
> > > Super strictly speaking, this might not hold true for VM_PFNMAP memory.  E.g. a
> > > driver _could_ split a folio (no idea why it would) but map the entire thing into
> > > userspace, and then userspace could have off that memory to KVM.
> > > 
> > > So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
> > > that "KVM mapping size <= primary MMU mapping size", at least for x86.  Arm's
> > > VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
> > > discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.
> > > 
> > > But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
> > > than the folio size.
> > 
> > Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the
> > backend. Indeed, a driver can create a huge mapping in primary MMU for the
> > VM_PFNMAP range with multiple discontiguous pages, if it wants.
> > 
> > But this occurs before KVM creates the mapping. Per my understanding, pages
> > under VM_PFNMAP are pinned,
> 
> Nope.  Only the driver that owns the VMAs knows what sits behind the PFN and the
> lifecycle rules for that memory.
> 
> That last point is *very* important.  Even if the PFNs shoved into VM_PFNMAP VMAs
> have an associated "struct page", that doesn't mean the "struct page" is refcounted,
> i.e. can be pinned.  That detail was the heart of "KVM: Stop grabbing references to
> PFNMAP'd pages" overhaul[*].
> 
> To _safely_ map VM_PFNMAP into a secondary MMU, i.e. without relying on (priveleged)
> userspace to "do the right thing", the secondary MMU needs to be tied into
> mmu_notifiers, so that modifications to the mappings in the primary MMU are
> reflected into the secondary MMU.

You are right! It maps tail page of a !compound huge page, which is not
refcounted.

> [*] https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com
> 
> > so it looks like there're no splits after they are mapped into the primary MMU.
> > 
> > So, out of curiosity, do you know why linux kernel needs to unmap mappings from
> > both primary and secondary MMUs, and check folio refcount before performing
> > folio splitting?
> 
> Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
> if something is going through the effort of splitting a folio, then odds are very,
> very good that the new folios can't be safely mapped as a contiguous hugepage.
> Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
> MM to implement, and for drivers/users to understand.
> 
> Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
> communicate the maximum mapping size; folios are the "currency" for doing so.
> 
> And then for edge cases that want to map a split folio as a hugepage (if any such
> edge cases exist), thus take on the responsibility of managing the lifecycle of
> the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.

Thanks for the explanation.

> > > > When splitting the backend folio, the Linux kernel unmaps the folio from both
> > > > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
> > > > 
> > > > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> > > > than folio sizes, splitting folios while they are still mapped in the IOMMU
> > > > stage-2 page table is not permitted due to the extra folio refcount held by the
> > > > IOMMU.
> > > > 
> > > > For gmem cases, KVM also does not create mappings larger than the folio size
> > > > allocated from gmem. This is why the TDX huge page series relies on gmem's
> > > > ability to allocate huge folios.
> > > > 
> > > > We really need to be careful if we hope to break this long-established rule.
> > > 
> > > +100 to being careful, but at the same time I don't think we should get _too_
> > > fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
> > > might not be a folio, if guest_memfd stopped using folios, then the entire
> > > discussion becomes moot.
> > > 
> > > And as above, the long-standing rule isn't about the implementation details so
> > > much as it is about KVM's behavior.  If the simplest solution to support huge
> > > guest_memfd pages is to decouple the max order from the folio, then so be it.
> > > 
> > > That said, I'd very much like to get a sense of the alternatives, because at the
> > > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
> > > and naively, tying that to the folio seems like an easy solution.
> > Thanks for the explanation.
> > 
> > Alternatively, how do you feel about the approach of splitting S-EPT first
> > before splitting folios?
> > If guest_memfd always splits 1GB folios to 2MB first and only splits the
> > converted range to 4KB, splitting S-EPT before splitting folios should not
> > introduce too much overhead. Then, we can defer the folio size problem until
> > guest_memfd stops using folios.
> > 
> > If the decision is to stop relying on folios for unmapping now, do you think
> > the following changes are reasonable for the TDX huge page series?
> > 
> > - Add WARN_ON_ONCE() to assert that pages are in a single folio in
> >   tdh_mem_page_aug().
> > - Do not assert that pages are in a single folio in
> >   tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?)
> >   Could you please give me guidance on
> >   https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com.
> > - Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error.
> 
> Ok, with the disclaimer that I hadn't actually looked at the patches in this
> series before now...
> 
> TDX absolutely should not be doing _anything_ with folios.  I am *very* strongly
> opposed to TDX assuming that memory is backed by refcounted "struct page", and
> thus can use folios to glean the maximum mapping size.
> 
> guest_memfd is _the_ owner of that information.  guest_memfd needs to explicitly
> _tell_ the rest of KVM what the maximum mapping size is; arch code should not
> infer that size from a folio.
> 
> And that code+behavior already exists in the form of kvm_gmem_mapping_order() and
> its users, _and_ is plumbed all the way into tdx_mem_page_aug() as @level.  IIUC,
> the _only_ reason tdx_mem_page_aug() retrieves the page+folio is because
> tdx_clflush_page() ultimately requires a "struct page".  That is absolutely
> ridiculous and not acceptable.  CLFLUSH takes a virtual address, there is *zero*
> reason tdh_mem_page_aug() needs to require/assume a struct page.
Not really.

Per my understanding, tdx_mem_page_aug() requires "struct page" (and checks
folios for huge pages) because the SEAMCALL wrapper APIs are not currently built
into KVM. Since they may have callers other than KVM, some sanity checking in
case the caller does something incorrect seems necessary (e.g., in case the
caller provides an out-of-range struct page or a page with !pfn_valid() PFN).
This is similar to "VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio)" in
__folio_split().

With tdx_mem_page_aug() ensuring pages validity and contiguity, invoking local
static function tdx_clflush_page() page-per-page looks good to me.
Alternatively, we could convert tdx_clflush_page() to tdx_clflush_cache_range(),
which receives VA.

However, I'm not sure if my understanding is correct now, especially since it
seems like everyone thinks the SEAMCALL wrapper APIs should trust the caller,
assuming they are KVM-specific.

> Dave may feel differently, but I am not going to budge on this.  I am not going
> to bake in assumptions throughout KVM about memory being backed by page+folio.
> We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to
> PFNMAP'd pages" series, I am NOT reintroducing such assumptions.
> 
> NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 3 weeks, 1 day ago

On Thu, Jan 15, 2026, Yan Zhao wrote:
> On Wed, Jan 14, 2026 at 07:26:44AM -0800, Sean Christopherson wrote:
> > Ok, with the disclaimer that I hadn't actually looked at the patches in this
> > series before now...
> > 
> > TDX absolutely should not be doing _anything_ with folios.  I am *very* strongly
> > opposed to TDX assuming that memory is backed by refcounted "struct page", and
> > thus can use folios to glean the maximum mapping size.
> > 
> > guest_memfd is _the_ owner of that information.  guest_memfd needs to explicitly
> > _tell_ the rest of KVM what the maximum mapping size is; arch code should not
> > infer that size from a folio.
> > 
> > And that code+behavior already exists in the form of kvm_gmem_mapping_order() and
> > its users, _and_ is plumbed all the way into tdx_mem_page_aug() as @level.  IIUC,
> > the _only_ reason tdx_mem_page_aug() retrieves the page+folio is because
> > tdx_clflush_page() ultimately requires a "struct page".  That is absolutely
> > ridiculous and not acceptable.  CLFLUSH takes a virtual address, there is *zero*
> > reason tdh_mem_page_aug() needs to require/assume a struct page.
> Not really.
> 
> Per my understanding, tdx_mem_page_aug() requires "struct page" (and checks
> folios for huge pages) because the SEAMCALL wrapper APIs are not currently built
> into KVM. Since they may have callers other than KVM, some sanity checking in
> case the caller does something incorrect seems necessary (e.g., in case the
> caller provides an out-of-range struct page or a page with !pfn_valid() PFN).

As a mentioned in my reply to Dave, I don't object to reasonable sanity checks.

> This is similar to "VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio)" in
> __folio_split().

No, it's not.  __folio_split() is verifying that the input for the exact one thing
it's doing, splitting a huge folio, matches what the function is being asked to do.

TDX requiring guest_memfd to back everything with struct page, and to only use
single, huge folios to map hugepages into the guest is making completely unnecessary
about guest_memfd and KVM MMU implementation details.

> With tdx_mem_page_aug() ensuring pages validity and contiguity,

It absolutely does not.

 - If guest_memfd unmaps the direct map[*], CLFLUSH will fault and panic the
   kernel.
 - If the PFN isn't backed by struct page, tdx_mem_page_aug() will hit a NULL
   pointer deref.
 - If the PFN is back by struct page, but the page is managed by something other
   than guest_memfd or core MM, all bets are off.

[*] https://lore.kernel.org/all/20260114134510.1835-1-kalyazin@amazon.com

> invoking local static function tdx_clflush_page() page-per-page looks good to
> me.  Alternatively, we could convert tdx_clflush_page() to
> tdx_clflush_cache_range(), which receives VA.
> 
> However, I'm not sure if my understanding is correct now, especially since it
> seems like everyone thinks the SEAMCALL wrapper APIs should trust the caller,
> assuming they are KVM-specific.

It's all kernel code.  Implying that KVM is somehow untrusted is absurd.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Dave Hansen 3 weeks, 2 days ago

On 1/14/26 07:26, Sean Christopherson wrote:
...
> Dave may feel differently, but I am not going to budge on this.  I am not going
> to bake in assumptions throughout KVM about memory being backed by page+folio.
> We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to
> PFNMAP'd pages" series, I am NOT reintroducing such assumptions.
> 
> NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.

'struct page' gives us two things: One is the type safety, but I'm
pretty flexible on how that's implemented as long as it's not a raw u64
getting passed around everywhere.

The second thing is a (near) guarantee that the backing memory is RAM.
Not only RAM, but RAM that the TDX module knows about and has a PAMT and
TDMR and all that TDX jazz.

We've also done things like stopping memory hotplug because you can't
amend TDX page metadata at runtime. So we prevent new 'struct pages'
from coming into existence. So 'struct page' is a quite useful choke
point for TDX.

I'd love to hear more about how guest_memfd is going to tie all the
pieces together and give the same straightforward guarantees without
leaning on the core mm the same way we do now.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 3 weeks, 2 days ago

On Wed, Jan 14, 2026, Dave Hansen wrote:
> On 1/14/26 07:26, Sean Christopherson wrote:
> ...
> > Dave may feel differently, but I am not going to budge on this.  I am not going
> > to bake in assumptions throughout KVM about memory being backed by page+folio.
> > We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to
> > PFNMAP'd pages" series, I am NOT reintroducing such assumptions.
> > 
> > NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.
> 
> 'struct page' gives us two things: One is the type safety, but I'm
> pretty flexible on how that's implemented as long as it's not a raw u64
> getting passed around everywhere.

I don't necessarily disagree on the type safety front, but for the specific code
in question, any type safety is a facade.  Everything leading up to the TDX code
is dealing with raw PFNs and/or PTEs.  Then the TDX code assumes that the PFN
being mapped into the guest is backed by a struct page, and that the folio size
is consistent with @level, without _any_ checks whatsover.  This is providing
the exact opposite of safety.

  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
			    enum pg_level level, kvm_pfn_t pfn)
  {
	int tdx_level = pg_level_to_tdx_sept_level(level);
	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
	struct page *page = pfn_to_page(pfn);    <==================
	struct folio *folio = page_folio(page);
	gpa_t gpa = gfn_to_gpa(gfn);
	u64 entry, level_state;
	u64 err;

	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
			       folio_page_idx(folio, page), &entry, &level_state);

	...
  }

I've no objection if e.g. tdh_mem_page_aug() wants to sanity check that a PFN
is backed by a struct page with a valid refcount, it's code like that above that
I don't want.

> The second thing is a (near) guarantee that the backing memory is RAM.
> Not only RAM, but RAM that the TDX module knows about and has a PAMT and
> TDMR and all that TDX jazz.

I'm not at all opposed to backing guest_memfd with "struct page", quite the
opposite.  What I don't want is to bake assumptions into KVM code that doesn't
_require_ struct page, because that has cause KVM immense pain in the past.

And I'm strongly opposed to KVM special-casing TDX or anything else, precisely
 because we struggled through all that pain so that KVM would work better with
memory that isn't backed by "struct page", or more specifically, memory that has
an associated "struct page", but isn't managed by core MM, e.g. isn't refcounted.

> We've also done things like stopping memory hotplug because you can't
> amend TDX page metadata at runtime. So we prevent new 'struct pages'
> from coming into existence. So 'struct page' is a quite useful choke
> point for TDX.
> 
> I'd love to hear more about how guest_memfd is going to tie all the
> pieces together and give the same straightforward guarantees without
> leaning on the core mm the same way we do now.

I don't think guest_memfd needs to be different, and that's not what I'm advocating.
What I don't want is to make KVM TDX's handling of memory different from the rest
of KVM and KVM's MMU(s).

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Dave Hansen 3 weeks ago

On 1/14/26 16:19, Sean Christopherson wrote:
>> 'struct page' gives us two things: One is the type safety, but I'm
>> pretty flexible on how that's implemented as long as it's not a raw u64
>> getting passed around everywhere.
> I don't necessarily disagree on the type safety front, but for the specific code
> in question, any type safety is a facade.  Everything leading up to the TDX code
> is dealing with raw PFNs and/or PTEs.  Then the TDX code assumes that the PFN
> being mapped into the guest is backed by a struct page, and that the folio size
> is consistent with @level, without _any_ checks whatsover.  This is providing
> the exact opposite of safety.
> 
>   static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> 			    enum pg_level level, kvm_pfn_t pfn)
>   {
> 	int tdx_level = pg_level_to_tdx_sept_level(level);
> 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> 	struct page *page = pfn_to_page(pfn);    <==================

I of course agree that this is fundamentally unsafe, it's just not
necessarily bad code.

I hope we both agree that this could be made _more_ safe by, for
instance, making sure the page is in a zone, pfn_valid(), and a few more
things.

In a perfect world, these conversions would happen at a well-defined
layer (KVM=>TDX) and in relatively few places. That layer transition is
where the sanity checks happen. It's super useful to have:

struct page *kvm_pfn_to_tdx_private_page(kvm_pfn_t pfn)
{
	struct page *page = pfn_to_page(pfn);
#ifdef DEBUG
	WARN_ON_ONCE(pfn_valid(pfn));
	// page must be from a "file"???
	WARN_ON_ONCE(!page_mapping(page));
	WARN_ON_ONCE(...);
#endif
	return page;
}

*EVEN* if the pfn_to_page() itself is unsafe, and even if the WARN()s
are compiled out, this explicitly lays out the assumptions and it means
someone reading TDX code has an easier idea comprehending it.

It's also not a crime to do the *same* checking on kvm_pfn_t and not
have a type transition. I just like the idea of changing the type so
that the transition line is clear and the concept is carried (forced,
even) through the layers of helpers.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 3 weeks ago

On Fri, Jan 16, 2026, Dave Hansen wrote:
> On 1/14/26 16:19, Sean Christopherson wrote:
> >> 'struct page' gives us two things: One is the type safety, but I'm
> >> pretty flexible on how that's implemented as long as it's not a raw u64
> >> getting passed around everywhere.
> > I don't necessarily disagree on the type safety front, but for the specific code
> > in question, any type safety is a facade.  Everything leading up to the TDX code
> > is dealing with raw PFNs and/or PTEs.  Then the TDX code assumes that the PFN
> > being mapped into the guest is backed by a struct page, and that the folio size
> > is consistent with @level, without _any_ checks whatsover.  This is providing
> > the exact opposite of safety.
> > 
> >   static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > 			    enum pg_level level, kvm_pfn_t pfn)
> >   {
> > 	int tdx_level = pg_level_to_tdx_sept_level(level);
> > 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > 	struct page *page = pfn_to_page(pfn);    <==================
> 
> I of course agree that this is fundamentally unsafe, it's just not
> necessarily bad code.
> 
> I hope we both agree that this could be made _more_ safe by, for
> instance, making sure the page is in a zone, pfn_valid(), and a few more
> things.
>
> In a perfect world, these conversions would happen at a well-defined
> layer (KVM=>TDX) and in relatively few places. That layer transition is
> where the sanity checks happen. It's super useful to have:
> 
> struct page *kvm_pfn_to_tdx_private_page(kvm_pfn_t pfn)
> {
> 	struct page *page = pfn_to_page(pfn);
> #ifdef DEBUG
> 	WARN_ON_ONCE(pfn_valid(pfn));
> 	// page must be from a "file"???
> 	WARN_ON_ONCE(!page_mapping(page));
> 	WARN_ON_ONCE(...);
> #endif
> 	return page;
> }
> 
> *EVEN* if the pfn_to_page() itself is unsafe, and even if the WARN()s
> are compiled out, this explicitly lays out the assumptions and it means
> someone reading TDX code has an easier idea comprehending it.

I object to the existence of those assumptions.  Why the blazes does TDX care
how KVM and guest_memfd manages memory?  If you want to assert that the pfn is
compatible with TDX, then by all means.  But I am NOT accepting any more KVM code
that assumes TDX memory is backed by refcounted struct page.  If I had been paying
more attention when the initial TDX series landed, I would have NAK'd that too.

tdh_mem_page_aug() is just an absurdly slow way of writing a PTE.  It doesn't
_need_ the pfn to be backed a struct page, at all.  IMO, what you're asking for
is akin to adding a pile of unnecessary assumptions to e.g. __set_spte() and
__kvm_tdp_mmu_write_spte().  No thanks.

> It's also not a crime to do the *same* checking on kvm_pfn_t and not
> have a type transition. I just like the idea of changing the type so
> that the transition line is clear and the concept is carried (forced,
> even) through the layers of helpers.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Dave Hansen 3 weeks ago

On 1/16/26 09:14, Sean Christopherson wrote:
...
>> *EVEN* if the pfn_to_page() itself is unsafe, and even if the WARN()s
>> are compiled out, this explicitly lays out the assumptions and it means
>> someone reading TDX code has an easier idea comprehending it.
> I object to the existence of those assumptions.  Why the blazes does TDX care
> how KVM and guest_memfd manages memory? 

For me, it's because TDX can't take arbitrary kvm_pfn_t's for private
memory. It's got to be able to be converted in the hardware, and also
have allocated TDX metadata (PAMT) that the TDX module was handed at
module init time. I thought kvm_pfn_t might, for instance be pointing
over to a shared page or MMIO. Those can't be used for TD private memory.

I think it's a pretty useful convention to know that the generic,
flexible kvm_pfn_t has been winnowed down to a more restrictive type
that is what TDX needs.

But, honestly, my big aversion was to u64's everywhere. I can certainly
live with a few kvm_pfn_t's in the TDX code. It doesn't have to be
'struct page'.

> If you want to assert that the pfn is compatible with TDX, then by
> all means.  But I am NOT accepting any more KVM code that assumes
> TDX memory is backed by refcounted struct page.  If I had been
> paying more attention when the initial TDX series landed, I would
> have NAK'd that too.
I'm kinda surprised by that. The only memory we support handing into TDs
for private memory is refcounted struct page. I can imagine us being
able to do this with DAX pages in the near future, but those have
'struct page' too, and I think they're refcounted pretty normally now as
well.

The TDX module initialization is pretty tied to NUMA nodes, too. If it's
in a NUMA node, the TDX module is told about it and it also universally
gets a 'struct page'.

Is there some kind of memory that I'm missing? What else *is* there? :)

> tdh_mem_page_aug() is just an absurdly slow way of writing a PTE.  It doesn't
> _need_ the pfn to be backed a struct page, at all.  IMO, what you're asking for
> is akin to adding a pile of unnecessary assumptions to e.g. __set_spte() and
> __kvm_tdp_mmu_write_spte().  No thanks.

Which part is absurdly slow? page_to_phys()? Isn't that just a shift by
an immediate and a subtraction of an immediate? Yeah, the subtraction
immediate is chonky so the instruction is big.

But at a quick glance I'm not seeing anything absurd.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 3 weeks ago

On Fri, Jan 16, 2026, Dave Hansen wrote:
> On 1/16/26 09:14, Sean Christopherson wrote:
> > If you want to assert that the pfn is compatible with TDX, then by
> > all means.  But I am NOT accepting any more KVM code that assumes
> > TDX memory is backed by refcounted struct page.  If I had been
> > paying more attention when the initial TDX series landed, I would
> > have NAK'd that too.
> I'm kinda surprised by that. The only memory we support handing into TDs
> for private memory is refcounted struct page. I can imagine us being
> able to do this with DAX pages in the near future, but those have
> 'struct page' too, and I think they're refcounted pretty normally now as
> well.
> 
> The TDX module initialization is pretty tied to NUMA nodes, too. If it's
> in a NUMA node, the TDX module is told about it and it also universally
> gets a 'struct page'.
> 
> Is there some kind of memory that I'm missing? What else *is* there? :)

I don't want to special case TDX on the backend of KVM's MMU.  There's already
waaaay too much code and complexity in KVM that exists purely for S-EPT.  Baking
in assumptions on how _exactly_ KVM is managing guest memory goes too far.

The reason I'm so hostile towards struct page is that, as evidenced by this series
and a ton of historical KVM code, assuming that memory is backed by struct page is
a _very_ slippery slope towards code that is extremely nasty to unwind later on.

E.g. see all of the effort that ended up going into commit ce7b5695397b ("KVM: TDX:
Drop superfluous page pinning in S-EPT management").  And in this series, the
constraints that will be placed on guest_memfd if TDX assumes hugepages will always
be covered in a single folio.  Untangling KVM's historical (non-TDX) messes around
struct page took us something like two years.

And so to avoid introducing similar messes in the future, I don't want KVM's MMU
to make _any_ references to struct page when it comes to mapping memory into the
guest unless it's absolutely necessary, e.g. to put a reference when KVM _knows_
it acquired a refcounted page via gup() (and ideally we'd kill even that, e.g.
by telling gup() not to bump the refcount in the first place).

> > tdh_mem_page_aug() is just an absurdly slow way of writing a PTE.  It doesn't
> > _need_ the pfn to be backed a struct page, at all.  IMO, what you're asking for
> > is akin to adding a pile of unnecessary assumptions to e.g. __set_spte() and
> > __kvm_tdp_mmu_write_spte().  No thanks.
> 
> Which part is absurdly slow?

The SEAMCALL itself.  I'm saying that TDH_MEM_PAGE_AUG is really just the S-EPT
version of "make this PTE PRESENT", and that piling on sanity checks that aren't
fundamental to TDX shouldn't be done when KVM is writing PTEs.

In other words, something like this is totally fine:

	KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));

but this is not:

	WARN_ON_ONCE(!page_mapping(pfn_to_page(pfn)));

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Dave Hansen 3 weeks ago

On 1/16/26 11:59, Sean Christopherson wrote:
> The SEAMCALL itself.  I'm saying that TDH_MEM_PAGE_AUG is really just the S-EPT
> version of "make this PTE PRESENT", and that piling on sanity checks that aren't
> fundamental to TDX shouldn't be done when KVM is writing PTEs.
> 
> In other words, something like this is totally fine:
> 
> 	KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));
> 
> but this is not:
> 
> 	WARN_ON_ONCE(!page_mapping(pfn_to_page(pfn)));

OK, I think I've got a better idea what you're aiming for. I think
that's totally doable going forward.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Edgecombe, Rick P 3 weeks ago

On Wed, 2026-01-14 at 16:19 -0800, Sean Christopherson wrote:
> I've no objection if e.g. tdh_mem_page_aug() wants to sanity check
> that a PFN is backed by a struct page with a valid refcount, it's
> code like that above that I don't want.

Dave wants safety for the TDX pages getting handed to the module. Sean
doesn’t want TDX code to have specialness about requiring struct page.
I think they are not too much at odds actually. Here are two ideas to
get both:

1. Have the TDX module do the checking

Yan points out that we could possibly rely on
TDX_OPERAND_ADDR_RANGE_ERROR to detect operating on the wrong type of
memory. We would have to make sure this will check everything we need
going forward.

2. Invent a new tdx_page_t type.

We could have a mk_foo()-type helpers that does the checks and converts
kvm_pfn_t to TDX’s verified type. The helper can do the checks that it
is valid TDX capable memory. Then there is one place that does the
conversion. It will be easy to change the verification method if we
ever need to.

One benefit is that struct page has already been a problem for other
reasons [0]. To work around that issue we had to keep duplicate formats
of the TDVPR page and lose the standardization of how we handle pages
in the TDX code. This is perfectly functional, but a bit annoying.

But (2) is inventing a new type, which is somewhat disagreeable too.

I’m thinking maybe explore 1 first, with the eventual goal of moving
everything to some type of pfn type to unify with the rest of KVM.
Either KVM’s or the normal one.  But before we do that, can we settle
on what sanity checks we want:
1. Page is TDX capable memory
2. ... I think that is it? There was some discussion of refcount
checking. I think we don’t need it?

[0]
https://lore.kernel.org/kvm/20250910144453.1389652-1-dave.hansen@linux.intel.com/#r

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 3 weeks ago

On Fri, Jan 16, 2026, Rick P Edgecombe wrote:
> On Wed, 2026-01-14 at 16:19 -0800, Sean Christopherson wrote:
> > I've no objection if e.g. tdh_mem_page_aug() wants to sanity check
> > that a PFN is backed by a struct page with a valid refcount, it's
> > code like that above that I don't want.
> 
> Dave wants safety for the TDX pages getting handed to the module.

Define "safety".  As I stressed earlier, blinding retrieving a "struct page" and
dereferencing that pointer is the exact opposite of safe.

> 2. Invent a new tdx_page_t type.

Still doesn't provide meaningful safety.  Regardless of what type gets passed
into the low level tdh_*() helpers, it's going to require KVM to effectively cast
a bare pfn, because I am completely against passing anything other than a SPTE
to tdx_sept_set_private_spte().

> 1. Page is TDX capable memory

That's fine by me, but that's _very_ different than what was proposed here.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Edgecombe, Rick P 3 weeks ago

On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote:
> > Dave wants safety for the TDX pages getting handed to the module.
> 
> Define "safety".  As I stressed earlier, blinding retrieving a
> "struct page" and dereferencing that pointer is the exact opposite of
> safe.

I think we had two problems.

1. Passing in raw PA's via u64 led to buggy code. IIRC we had a bug
with this that was caught before it went upstream. So a page needs a
real type of some sort.

2. Work was done on the tip side to prevent non-TDX capable memory from
entering the page allocator. With that in place, by requiring struct
page, TDX code can know that it is getting the type of memory it worked
hard to guarantee was good.

You are saying that shifting a PFN to a struct page blindly doesn't
actually guarantee that it meets those requirements. Makes sense.

For (1) we can just use any old type I think - pfn_t, etc. As we
discussed in the base series.

For (2) we need to check that the memory came from the page allocator,
or otherwise is valid TDX memory somehow. That is at least the only
check that makes sense to me.

There was some discussion about refcounts somewhere in this thread. I
don't think it's arch/x86's worry. Then Yan was saying something last
night that I didn't quite follow. We said, let's just resume the
discussion on the list. So she might suggest another check.

> 
> > 2. Invent a new tdx_page_t type.
> 
> Still doesn't provide meaningful safety.  Regardless of what type
> gets passed into the low level tdh_*() helpers, it's going to require
> KVM to effectively cast a bare pfn, because I am completely against
> passing anything other than a SPTE to tdx_sept_set_private_spte().

I'm not sure I was clear, like:
1. A raw PFN gets passed in to the conversion helper in arch/x86.
2. The helper does the check that it is TDX capable memory, or anything
it cares to check about memory safety, then returns the new type to
KVM.
3. KVM uses the type as an argument to any seamcall that requires TDX
capable memory.

> 
> > 1. Page is TDX capable memory
> 
> That's fine by me, but that's _very_ different than what was proposed
> here.

Proposed by me just now or the series? We are trying to find a new
solution now.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 2 weeks, 4 days ago

On Sat, Jan 17, 2026 at 12:58:02AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote:
> > > Dave wants safety for the TDX pages getting handed to the module.
> > 
> > Define "safety".  As I stressed earlier, blinding retrieving a
> > "struct page" and dereferencing that pointer is the exact opposite of
> > safe.
> 
> I think we had two problems.
> 
> 1. Passing in raw PA's via u64 led to buggy code. IIRC we had a bug
> with this that was caught before it went upstream. So a page needs a
> real type of some sort.
> 
> 2. Work was done on the tip side to prevent non-TDX capable memory from
> entering the page allocator. With that in place, by requiring struct
> page, TDX code can know that it is getting the type of memory it worked
> hard to guarantee was good.
> 
> You are saying that shifting a PFN to a struct page blindly doesn't
> actually guarantee that it meets those requirements. Makes sense.
> 
> For (1) we can just use any old type I think - pfn_t, etc. As we
> discussed in the base series.
> 
> For (2) we need to check that the memory came from the page allocator,
> or otherwise is valid TDX memory somehow. That is at least the only
> check that makes sense to me.
> 
> There was some discussion about refcounts somewhere in this thread. I
> don't think it's arch/x86's worry. Then Yan was saying something last
> night that I didn't quite follow. We said, let's just resume the
> discussion on the list. So she might suggest another check.
Hmm, I previously had a concern about passing "struct page *" as the SEAMCALL
wrapper parameter. For example, when we do sanity checks for valid TDX memory in
tdh_mem_page_aug(), we need to do the sanity check on every page, right?
However, with base_page + npages, it's not easy to get the ith page's pointer
without first ensuring the pages are contained in a single folio. It would also
be superfluous if we first get base_pfn from base_page, and then derive the ith
page from base_pfn + i.

IIUC, this concern should be gone as Dave has agreed to use "pfn" as the
SEAMCALL parameter [1]?
Then should we invoke "KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));" in KVM
for every pfn of a huge mapping? Or should we keep the sanity check inside the
SEAMCALL wrappers?

BTW, I have another question about the SEAMCALL wrapper implementation, as Kai
also pointed out in [2]: since the SEAMCALL wrappers now serve as APIs available
to callers besides KVM, should the SEAMCALL wrappers return TDX_OPERAND_INVALID
or WARN_ON() (or WARN_ON_ONCE()) on sanity check failure?

By returning TDX_OPERAND_INVALID, the caller can check the return code, adjust
the input or trigger WARN_ON() by itself;
By triggering WARN_ON() directly in the SEAMCALL wrapper, we need to document
this requirement for the SEAMCALL wrappers and have the caller invoke the API
correctly.

So, it looks that "WARN_ON() directly in the SEAMCALL wrapper" is the preferred
approach, right?

[1] https://lore.kernel.org/all/d119c824-4770-41d2-a926-4ab5268ea3a6@intel.com/
[2] https://lore.kernel.org/all/baf6df2cc63d8e897455168c1bf07180fc9c1db8.camel@intel.com

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 1 week ago

On Mon, Jan 19, 2026, Yan Zhao wrote:
> On Sat, Jan 17, 2026 at 12:58:02AM +0800, Edgecombe, Rick P wrote:
> > On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote:
> IIUC, this concern should be gone as Dave has agreed to use "pfn" as the
> SEAMCALL parameter [1]?
> Then should we invoke "KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));" in KVM
> for every pfn of a huge mapping? Or should we keep the sanity check inside the
> SEAMCALL wrappers?

I don't have a strong preference.  But if it goes in KVM, definitely guard it with
KVM_MMU_WARN_ON().

> BTW, I have another question about the SEAMCALL wrapper implementation, as Kai
> also pointed out in [2]: since the SEAMCALL wrappers now serve as APIs available
> to callers besides KVM, should the SEAMCALL wrappers return TDX_OPERAND_INVALID
> or WARN_ON() (or WARN_ON_ONCE()) on sanity check failure?

Why not both?  But maybe TDX_SW_ERROR instead of TDX_OPERAND_INVALID?

If an API has a defined contract and/or set of expectations, and those expectations
aren't met by the caller, then a WARN is justified.  But the failure still needs
to be communicated to the caller.

> By returning TDX_OPERAND_INVALID, the caller can check the return code, adjust
> the input or trigger WARN_ON() by itself;
> By triggering WARN_ON() directly in the SEAMCALL wrapper, we need to document
> this requirement for the SEAMCALL wrappers and have the caller invoke the API
> correctly.

Document what exactly?  Most of this should be common sense.  E.g. we don't generally
document that pointers must be non-NULL, because that goes without saying 99.9%
of the time.

IMO, that holds true here as well.  E.g. trying to map memory into a TDX guest
that isn't convertible is obviously a bug, I don't see any value in formally
documenting that requirement.

> So, it looks that "WARN_ON() directly in the SEAMCALL wrapper" is the preferred
> approach, right?

> 
> [1] https://lore.kernel.org/all/d119c824-4770-41d2-a926-4ab5268ea3a6@intel.com/
> [2] https://lore.kernel.org/all/baf6df2cc63d8e897455168c1bf07180fc9c1db8.camel@intel.com

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 3 days, 18 hours ago

On Fri, Jan 30, 2026 at 07:32:48AM -0800, Sean Christopherson wrote:
> On Mon, Jan 19, 2026, Yan Zhao wrote:
> > On Sat, Jan 17, 2026 at 12:58:02AM +0800, Edgecombe, Rick P wrote:
> > > On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote:
> > IIUC, this concern should be gone as Dave has agreed to use "pfn" as the
> > SEAMCALL parameter [1]?
> > Then should we invoke "KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));" in KVM
> > for every pfn of a huge mapping? Or should we keep the sanity check inside the
> > SEAMCALL wrappers?
> 
> I don't have a strong preference.  But if it goes in KVM, definitely guard it with
> KVM_MMU_WARN_ON().
Thank you for your insights, Sean!

> > BTW, I have another question about the SEAMCALL wrapper implementation, as Kai
> > also pointed out in [2]: since the SEAMCALL wrappers now serve as APIs available
> > to callers besides KVM, should the SEAMCALL wrappers return TDX_OPERAND_INVALID
> > or WARN_ON() (or WARN_ON_ONCE()) on sanity check failure?
> 
> Why not both?  But maybe TDX_SW_ERROR instead of TDX_OPERAND_INVALID?
Hmm, I previously returned TDX_OPERAND_INVALID for non-aligned base PFN.
TDX_SW_ERROR is also ok if we want to indicate that passing an invalid PFN is a
software error.
(I had tdh_mem_page_demote() return TDX_SW_ERROR when an incompatible TDX module
is used, i.e., when !tdx_supports_demote_nointerrupt()).

> If an API has a defined contract and/or set of expectations, and those expectations
> aren't met by the caller, then a WARN is justified.  But the failure still needs
> to be communicated to the caller.
Ok.

The reason for 'not both' is that there's already TDX_BUG_ON_2() in KVM after
the SEAMCALL wrapper returns a non-BUSY error. I'm not sure if having double
WARN_ON_ONCE() calls is good, so I intended to let the caller decide whether to
warn.

> > By returning TDX_OPERAND_INVALID, the caller can check the return code, adjust
> > the input or trigger WARN_ON() by itself;
> > By triggering WARN_ON() directly in the SEAMCALL wrapper, we need to document
> > this requirement for the SEAMCALL wrappers and have the caller invoke the API
> > correctly.
> 
> Document what exactly?  Most of this should be common sense.  E.g. we don't generally
> document that pointers must be non-NULL, because that goes without saying 99.9%
> of the time.
Document the SEAMCALL wrapper's expectations. e.g., for demote, a PFN must be
2MB-aligned, or the caller must not invoke tdh_mem_page_demote() if a TDX module
does not support feature ENHANCED_DEMOTE_INTERRUPTIBILITY...

> IMO, that holds true here as well.  E.g. trying to map memory into a TDX guest
> that isn't convertible is obviously a bug, I don't see any value in formally
> documenting that requirement.
Do we need a comment for documentation above the tdh_mem_page_demote() API?

> > So, it looks that "WARN_ON() directly in the SEAMCALL wrapper" is the preferred
> > approach, right?
> 
> > 
> > [1] https://lore.kernel.org/all/d119c824-4770-41d2-a926-4ab5268ea3a6@intel.com/
> > [2] https://lore.kernel.org/all/baf6df2cc63d8e897455168c1bf07180fc9c1db8.camel@intel.com
>

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Ackerley Tng 3 weeks, 2 days ago

Sean Christopherson <seanjc@google.com> writes:

>>
>> [...snip...]
>>
>> > +100 to being careful, but at the same time I don't think we should get _too_
>> > fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
>> > might not be a folio, if guest_memfd stopped using folios, then the entire
>> > discussion becomes moot.

+1, IMO the usage of folios on the guest_memfd <-> KVM boundary
(kvm_gmem_get_pfn()) is transitional, hopefully we get to a point where
guest_memfd will pass KVM pfn, order and no folios.

>> > And as above, the long-standing rule isn't about the implementation details so
>> > much as it is about KVM's behavior.  If the simplest solution to support huge
>> > guest_memfd pages is to decouple the max order from the folio, then so be it.
>> >
>> > That said, I'd very much like to get a sense of the alternatives, because at the
>> > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
>> > and naively, tying that to the folio seems like an easy solution.

The upcoming attributes maple tree allows a lookup from guest_memfd
index to contiguous range, so the max mapping size (at least
guest_memfd's contribution to max mapping level, to be augmented by
contribution from lpage_info etc) would be the contiguous range in the
xarray containing the index, clamped to guest_memfd page size bounds
(both for huge pages and regular PAGE_SIZE pages).

The lookup complexity is mainly the maple tree lookup complexity. This
lookup happens on mapping and on trying to recover to the largest
mapping level, both of which shouldn't happen super often, so I think
this should be pretty good for now.

This max mapping size is currently memoized as folio size with all the
folio splitting work, but memoizing into a folio is expensive (struct
pages/folios are big). Hopefully guest_memfd gets to a point where it
also supports non-struct page backed memory, and that would save us a
bunch more memory.

>>
>> [...snip...]
>>
>> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
>> both primary and secondary MMUs, and check folio refcount before performing
>> folio splitting?
>
> Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
> if something is going through the effort of splitting a folio, then odds are very,
> very good that the new folios can't be safely mapped as a contiguous hugepage.
> Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
> MM to implement, and for drivers/users to understand.
>
> Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
> communicate the maximum mapping size; folios are the "currency" for doing so.
>
> And then for edge cases that want to map a split folio as a hugepage (if any such
> edge cases exist), thus take on the responsibility of managing the lifecycle of
> the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.
>

Here's my understanding, hope it helps: there might also be a
practical/simpler reason for first unmapping then check refcounts, and
then splitting folios, and guest_memfd kind of does the same thing.

Folio splitting races with lots of other things in the kernel, and the
folio lock isn't super useful because the lock itself is going to be
split up.

Folio splitting wants all users to stop using this folio, so one big
source of users is mappings. Hence, get those mappers (both primary and
secondary MMUs) to unmap.

Core-mm-managed mappings take a refcount, so those refcounts go away. Of
the secondary mmu notifiers, KVM doesn't take a refcount, but KVM does
unmap as requested, so that still falls in line with "stop using this
folio".

I think the refcounting check isn't actually necessary if all users of
folios STOP using the folio on request (via mmu notifiers or
otherwise). Unfortunately, there are other users other than mappers. The
best way to find these users is to check the refcount. The refcount
check is asking "how many other users are left?" and if the number of
users is as expected (just the filemap, or whatever else is expected),
then splitting can go ahead, since the splitting code is now confident
the remaining users won't try and use the folio metadata while splitting
is happening.

guest_memfd does a modified version of that on shared to private
conversions. guest_memfd will unmap from host userspace page tables for
the same reason, mainly to tell all the host users to unmap. The
unmapping also triggers mmu notifiers so the stage 2 mappings also go
away (TBD if this should be skipped) and this is okay because they're
shared pages. guest usage will just map them back in on any failure and
it doesn't break guests.

At this point all the mappers are gone, then guest_memfd checks
refcounts to make sure that guest_memfd itself is the only user of the
folio. If the refcount is as expected, guest_memfd is confident to
continue with splitting folios, since other folio accesses will be
locked out by the filemap invalidate lock.

The one main guest_memfd folio user that won't go away on an unmap call
is if the folios get pinned for IOMMU access. In this case, guest_memfd
fails the conversion and returns an error to userspace so userspace can
sort out the IOMMU unpinning.

As for private to shared conversions, folio merging would require the
same thing that nobody else is using the folios (the folio
metadata). guest_memfd skips that check because for private memory, KVM
is the only other user, and guest_memfd knows KVM doesn't use folio
metadata once the memory is mapped for the guest.

>>
>> [...snip...]
>>

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 3 weeks, 2 days ago

On Wed, Jan 14, 2026 at 10:45:32AM -0800, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> >> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
> >> both primary and secondary MMUs, and check folio refcount before performing
> >> folio splitting?
> >
> > Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
> > if something is going through the effort of splitting a folio, then odds are very,
> > very good that the new folios can't be safely mapped as a contiguous hugepage.
> > Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
> > MM to implement, and for drivers/users to understand.
> >
> > Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
> > communicate the maximum mapping size; folios are the "currency" for doing so.
> >
> > And then for edge cases that want to map a split folio as a hugepage (if any such
> > edge cases exist), thus take on the responsibility of managing the lifecycle of
> > the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.
> >
> 
> Here's my understanding, hope it helps: there might also be a
> practical/simpler reason for first unmapping then check refcounts, and
> then splitting folios, and guest_memfd kind of does the same thing.
> 
> Folio splitting races with lots of other things in the kernel, and the
> folio lock isn't super useful because the lock itself is going to be
> split up.
> 
> Folio splitting wants all users to stop using this folio, so one big
> source of users is mappings. Hence, get those mappers (both primary and
> secondary MMUs) to unmap.
> 
> Core-mm-managed mappings take a refcount, so those refcounts go away. Of
> the secondary mmu notifiers, KVM doesn't take a refcount, but KVM does
> unmap as requested, so that still falls in line with "stop using this
> folio".
> 
> I think the refcounting check isn't actually necessary if all users of
> folios STOP using the folio on request (via mmu notifiers or
> otherwise). Unfortunately, there are other users other than mappers. The
> best way to find these users is to check the refcount. The refcount
> check is asking "how many other users are left?" and if the number of
> users is as expected (just the filemap, or whatever else is expected),
> then splitting can go ahead, since the splitting code is now confident
> the remaining users won't try and use the folio metadata while splitting
> is happening.
> 
> 
> guest_memfd does a modified version of that on shared to private
> conversions. guest_memfd will unmap from host userspace page tables for
> the same reason, mainly to tell all the host users to unmap. The
> unmapping also triggers mmu notifiers so the stage 2 mappings also go
> away (TBD if this should be skipped) and this is okay because they're
> shared pages. guest usage will just map them back in on any failure and
> it doesn't break guests.
> 
> At this point all the mappers are gone, then guest_memfd checks
> refcounts to make sure that guest_memfd itself is the only user of the
> folio. If the refcount is as expected, guest_memfd is confident to
> continue with splitting folios, since other folio accesses will be
> locked out by the filemap invalidate lock.
> 
> The one main guest_memfd folio user that won't go away on an unmap call
> is if the folios get pinned for IOMMU access. In this case, guest_memfd
> fails the conversion and returns an error to userspace so userspace can
> sort out the IOMMU unpinning.
> 
> 
> As for private to shared conversions, folio merging would require the
> same thing that nobody else is using the folios (the folio
> metadata). guest_memfd skips that check because for private memory, KVM
> is the only other user, and guest_memfd knows KVM doesn't use folio
> metadata once the memory is mapped for the guest.
Ok. That makes sense. Thanks for the explanation.
It looks like guest_memfd also rules out concurrent folio metadata access by
holding the filemap_invalidate_lock.

BTW: Could that potentially cause guest soft lockup due to holding the
filemap_invalidate_lock for too long?

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Ackerley Tng 3 weeks, 1 day ago

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jan 14, 2026 at 10:45:32AM -0800, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>> >> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
>> >> both primary and secondary MMUs, and check folio refcount before performing
>> >> folio splitting?
>> >
>> > Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
>> > if something is going through the effort of splitting a folio, then odds are very,
>> > very good that the new folios can't be safely mapped as a contiguous hugepage.
>> > Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
>> > MM to implement, and for drivers/users to understand.
>> >
>> > Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
>> > communicate the maximum mapping size; folios are the "currency" for doing so.
>> >
>> > And then for edge cases that want to map a split folio as a hugepage (if any such
>> > edge cases exist), thus take on the responsibility of managing the lifecycle of
>> > the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.
>> >
>>
>> Here's my understanding, hope it helps: there might also be a
>> practical/simpler reason for first unmapping then check refcounts, and
>> then splitting folios, and guest_memfd kind of does the same thing.
>>
>> Folio splitting races with lots of other things in the kernel, and the
>> folio lock isn't super useful because the lock itself is going to be
>> split up.
>>
>> Folio splitting wants all users to stop using this folio, so one big
>> source of users is mappings. Hence, get those mappers (both primary and
>> secondary MMUs) to unmap.
>>
>> Core-mm-managed mappings take a refcount, so those refcounts go away. Of
>> the secondary mmu notifiers, KVM doesn't take a refcount, but KVM does
>> unmap as requested, so that still falls in line with "stop using this
>> folio".
>>
>> I think the refcounting check isn't actually necessary if all users of
>> folios STOP using the folio on request (via mmu notifiers or
>> otherwise). Unfortunately, there are other users other than mappers. The
>> best way to find these users is to check the refcount. The refcount
>> check is asking "how many other users are left?" and if the number of
>> users is as expected (just the filemap, or whatever else is expected),
>> then splitting can go ahead, since the splitting code is now confident
>> the remaining users won't try and use the folio metadata while splitting
>> is happening.
>>
>>
>> guest_memfd does a modified version of that on shared to private
>> conversions. guest_memfd will unmap from host userspace page tables for
>> the same reason, mainly to tell all the host users to unmap. The
>> unmapping also triggers mmu notifiers so the stage 2 mappings also go
>> away (TBD if this should be skipped) and this is okay because they're
>> shared pages. guest usage will just map them back in on any failure and
>> it doesn't break guests.
>>
>> At this point all the mappers are gone, then guest_memfd checks
>> refcounts to make sure that guest_memfd itself is the only user of the
>> folio. If the refcount is as expected, guest_memfd is confident to
>> continue with splitting folios, since other folio accesses will be
>> locked out by the filemap invalidate lock.
>>
>> The one main guest_memfd folio user that won't go away on an unmap call
>> is if the folios get pinned for IOMMU access. In this case, guest_memfd
>> fails the conversion and returns an error to userspace so userspace can
>> sort out the IOMMU unpinning.
>>
>>
>> As for private to shared conversions, folio merging would require the
>> same thing that nobody else is using the folios (the folio
>> metadata). guest_memfd skips that check because for private memory, KVM
>> is the only other user, and guest_memfd knows KVM doesn't use folio
>> metadata once the memory is mapped for the guest.
> Ok. That makes sense. Thanks for the explanation.
> It looks like guest_memfd also rules out concurrent folio metadata access by
> holding the filemap_invalidate_lock.
>
> BTW: Could that potentially cause guest soft lockup due to holding the
> filemap_invalidate_lock for too long?

Yes, potentially. You mean because the vCPUs are all blocked on page
faults, right? We can definitely optimize later, perhaps lock by
guest_memfd index ranges.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Edgecombe, Rick P 1 month ago

On Tue, 2026-01-06 at 15:43 -0800, Sean Christopherson wrote:
> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> multiple folios in one page, probably comes down to which option provides "good
> enough" performance without incurring too much complexity.

Can we add "whether we can push it off to the future" to the considerations
list? The in-flight gmem stuff is pretty complex and this doesn't seem to have
an ABI intersection.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 1 month ago

On Wed, Jan 07, 2026, Rick P Edgecombe wrote:
> On Tue, 2026-01-06 at 15:43 -0800, Sean Christopherson wrote:
> > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> > do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> > multiple folios in one page, probably comes down to which option provides "good
> > enough" performance without incurring too much complexity.
> 
> Can we add "whether we can push it off to the future" to the considerations
> list? The in-flight gmem stuff is pretty complex and this doesn't seem to have
> an ABI intersection.

Ya, for sure.  The only wrinkle I can think of is if the refcounting somehow
bleeds into userspace, but that seems like it'd be a flaw on its own.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 1 month ago

On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > Sean Christopherson <seanjc@google.com> writes:
> > 
> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> Vishal Annapurve <vannapurve@google.com> writes:
> > >>
> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >> >>
> > >> >> - EPT mapping size and folio size
> > >> >>
> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> > >> >>
> > >>
> > >> I'm not familiar with this rule and would like to find out more. Why is
> > >> this rule imposed?
> > >
> > > Because it's the only sane way to safely map memory into the guest? :-D
> > >
> > >> Is this rule there just because traditionally folio sizes also define the
> > >> limit of contiguity, and so the mapping size must not be greater than folio
> > >> size in case the block of memory represented by the folio is not contiguous?
> > >
> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> > >
> > >> In guest_memfd's case, even if the folio is split (just for refcount
> > >> tracking purposese on private to shared conversion), the memory is still
> > >> contiguous up to the original folio's size. Will the contiguity address
> > >> the concerns?
> > >
> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> > > spanning mixed ranges, i.e. with multiple folios.
> > 
> > The folio can be split if any (or all) of the pages in a huge page range
> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> > would be split, and the split folios are necessary for tracking users of
> > shared pages using struct page refcounts.
> 
> Ahh, that's what the refcounting was referring to.  Gotcha.
> 
> > However the split folios in that 1G range are still fully contiguous.
> > 
> > The process of conversion will split the EPT entries soon after the
> > folios are split so the rule remains upheld.
Overall, I don't think allowing folios smaller than the mappings while
conversion is in progress brings enough benefit.

Cons:
(1) TDX's zapping callback has no idea whether the zapping is caused by an
    in-progress private-to-shared conversion or other reasons. It also has no
    idea if the attributes of the underlying folios remain unchanged during an
    in-progress private-to-shared conversion. Even if the assertion Ackerley
    mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
    callback for in-progress private-to-shared conversion alone (which would
    increase TDX's dependency on guest_memfd's specific implementation even if
    it's feasible).

    Removing the sanity checks entirely in TDX's zapping callback is confusing
    and would show a bad/false expectation from KVM -- what if a huge folio is
    incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
    others) in other conditions? And then do we still need the check in TDX's
    mapping callback? If not, does it mean TDX huge pages can stop relying on
    guest_memfd's ability to allocate huge folios, as KVM could still create
    huge mappings as long as small folios are physically contiguous with
    homogeneous memory attributes?

(2) Allowing folios smaller than the mapping would require splitting S-EPT in
    kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
    invalidate lock held in __kvm_gmem_set_attributes() could guard against
    concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
    error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).

Pro: Preventing zapping private memory until conversion is successful is good.

However, could we achieve this benefit in other ways? For example, is it
possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
optimization? (hugetlb_vmemmap conversion is super slow according to my
observation and I always disable it). Or pre-allocation for
vmemmap_remap_alloc()?

Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
private memory before conversion succeeds is still better than introducing the
mess between folio size and mapping size.

> > I guess perhaps the question is, is it okay if the folios are smaller
> > than the mapping while conversion is in progress? Does the order matter
> > (split page table entries first vs split folios first)?
> 
> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> multiple folios in one page, probably comes down to which option provides "good
> enough" performance without incurring too much complexity.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Ackerley Tng 4 weeks, 1 day ago

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
>> On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > Sean Christopherson <seanjc@google.com> writes:
>> >
>> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > >> Vishal Annapurve <vannapurve@google.com> writes:
>> > >>
>> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> > >> >>
>> > >> >> - EPT mapping size and folio size
>> > >> >>
>> > >> >>   This series is built upon the rule in KVM that the mapping size in the
>> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
>> > >> >>
>> > >>
>> > >> I'm not familiar with this rule and would like to find out more. Why is
>> > >> this rule imposed?
>> > >
>> > > Because it's the only sane way to safely map memory into the guest? :-D
>> > >
>> > >> Is this rule there just because traditionally folio sizes also define the
>> > >> limit of contiguity, and so the mapping size must not be greater than folio
>> > >> size in case the block of memory represented by the folio is not contiguous?
>> > >
>> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
>> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
>> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
>> > >
>> > >> In guest_memfd's case, even if the folio is split (just for refcount
>> > >> tracking purposese on private to shared conversion), the memory is still
>> > >> contiguous up to the original folio's size. Will the contiguity address
>> > >> the concerns?
>> > >
>> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
>> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
>> > > spanning mixed ranges, i.e. with multiple folios.
>> >
>> > The folio can be split if any (or all) of the pages in a huge page range
>> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
>> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
>> > would be split, and the split folios are necessary for tracking users of
>> > shared pages using struct page refcounts.
>>
>> Ahh, that's what the refcounting was referring to.  Gotcha.
>>
>> > However the split folios in that 1G range are still fully contiguous.
>> >
>> > The process of conversion will split the EPT entries soon after the
>> > folios are split so the rule remains upheld.

Correction here: If we go with splitting from 1G to 4K uniformly on
sharing, only the EPT entries around the shared 4K folio will have their
page table entries split, so many of the EPT entries will be at 2M level
though the folios are 4K sized. This would be last beyond the conversion
process.

> Overall, I don't think allowing folios smaller than the mappings while
> conversion is in progress brings enough benefit.
>

I'll look into making the restructuring process always succeed, but off
the top of my head that's hard because

1. HugeTLB Vmemmap Optimization code would have to be refactored to
   use pre-allocated pages, which is refactoring deep in HugeTLB code

2. If we want to split non-uniformly such that only the folios that are
   shared are 4K, and the remaining folios are as large as possible (PMD
   sized as much as possible), it gets complex to figure out how many
   pages to allocate ahead of time.

So it's complex and will probably delay HugeTLB+conversion support even
more!

> Cons:
> (1) TDX's zapping callback has no idea whether the zapping is caused by an
>     in-progress private-to-shared conversion or other reasons. It also has no
>     idea if the attributes of the underlying folios remain unchanged during an
>     in-progress private-to-shared conversion. Even if the assertion Ackerley
>     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
>     callback for in-progress private-to-shared conversion alone (which would
>     increase TDX's dependency on guest_memfd's specific implementation even if
>     it's feasible).
>
>     Removing the sanity checks entirely in TDX's zapping callback is confusing
>     and would show a bad/false expectation from KVM -- what if a huge folio is
>     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
>     others) in other conditions? And then do we still need the check in TDX's
>     mapping callback? If not, does it mean TDX huge pages can stop relying on
>     guest_memfd's ability to allocate huge folios, as KVM could still create
>     huge mappings as long as small folios are physically contiguous with
>     homogeneous memory attributes?
>
> (2) Allowing folios smaller than the mapping would require splitting S-EPT in
>     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
>     invalidate lock held in __kvm_gmem_set_attributes() could guard against
>     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
>     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
>

I think the central question I have among all the above is what TDX
needs to actually care about (putting aside what KVM's folio size/memory
contiguity vs mapping level rule for a while).

I think TDX code can check what it cares about (if required to aid
debugging, as Dave suggested). Does TDX actually care about folio sizes,
or does it actually care about memory contiguity and alignment?

Separately, KVM could also enforce the folio size/memory contiguity vs
mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
the check is deemed necessary, it still shouldn't be in TDX code, I
think.

> Pro: Preventing zapping private memory until conversion is successful is good.
>
> However, could we achieve this benefit in other ways? For example, is it
> possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> optimization? (hugetlb_vmemmap conversion is super slow according to my
> observation and I always disable it).

HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
huge VM, multiplied by a large number of hosts, this is not a trivial
amount of memory. It's one of the key reasons why we are using HugeTLB
in guest_memfd in the first place, other than to be able to get high
level page table mappings. We want this in production.

> Or pre-allocation for
> vmemmap_remap_alloc()?
>

Will investigate if this is possible as mentioned above. Thanks for the
suggestion again!

> Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> private memory before conversion succeeds is still better than introducing the
> mess between folio size and mapping size.
>
>> > I guess perhaps the question is, is it okay if the folios are smaller
>> > than the mapping while conversion is in progress? Does the order matter
>> > (split page table entries first vs split folios first)?
>>
>> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
>> conceptually totally fine, i.e. I'm not totally opposed to adding support for
>> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
>> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
>> multiple folios in one page, probably comes down to which option provides "good
>> enough" performance without incurring too much complexity.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 4 weeks ago

On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> > Sean Christopherson <seanjc@google.com> writes:
> >> >
> >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> > >> Vishal Annapurve <vannapurve@google.com> writes:
> >> > >>
> >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> > >> >>
> >> > >> >> - EPT mapping size and folio size
> >> > >> >>
> >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> >> > >> >>
> >> > >>
> >> > >> I'm not familiar with this rule and would like to find out more. Why is
> >> > >> this rule imposed?
> >> > >
> >> > > Because it's the only sane way to safely map memory into the guest? :-D
> >> > >
> >> > >> Is this rule there just because traditionally folio sizes also define the
> >> > >> limit of contiguity, and so the mapping size must not be greater than folio
> >> > >> size in case the block of memory represented by the folio is not contiguous?
> >> > >
> >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> >> > >
> >> > >> In guest_memfd's case, even if the folio is split (just for refcount
> >> > >> tracking purposese on private to shared conversion), the memory is still
> >> > >> contiguous up to the original folio's size. Will the contiguity address
> >> > >> the concerns?
> >> > >
> >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> >> > > spanning mixed ranges, i.e. with multiple folios.
> >> >
> >> > The folio can be split if any (or all) of the pages in a huge page range
> >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> >> > would be split, and the split folios are necessary for tracking users of
> >> > shared pages using struct page refcounts.
> >>
> >> Ahh, that's what the refcounting was referring to.  Gotcha.
> >>
> >> > However the split folios in that 1G range are still fully contiguous.
> >> >
> >> > The process of conversion will split the EPT entries soon after the
> >> > folios are split so the rule remains upheld.
> 
> Correction here: If we go with splitting from 1G to 4K uniformly on
> sharing, only the EPT entries around the shared 4K folio will have their
> page table entries split, so many of the EPT entries will be at 2M level
> though the folios are 4K sized. This would be last beyond the conversion
> process.
> 
> > Overall, I don't think allowing folios smaller than the mappings while
> > conversion is in progress brings enough benefit.
> >
> 
> I'll look into making the restructuring process always succeed, but off
> the top of my head that's hard because
> 
> 1. HugeTLB Vmemmap Optimization code would have to be refactored to
>    use pre-allocated pages, which is refactoring deep in HugeTLB code
> 
> 2. If we want to split non-uniformly such that only the folios that are
>    shared are 4K, and the remaining folios are as large as possible (PMD
>    sized as much as possible), it gets complex to figure out how many
>    pages to allocate ahead of time.
> 
> So it's complex and will probably delay HugeTLB+conversion support even
> more!
> 
> > Cons:
> > (1) TDX's zapping callback has no idea whether the zapping is caused by an
> >     in-progress private-to-shared conversion or other reasons. It also has no
> >     idea if the attributes of the underlying folios remain unchanged during an
> >     in-progress private-to-shared conversion. Even if the assertion Ackerley
> >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
> >     callback for in-progress private-to-shared conversion alone (which would
> >     increase TDX's dependency on guest_memfd's specific implementation even if
> >     it's feasible).
> >
> >     Removing the sanity checks entirely in TDX's zapping callback is confusing
> >     and would show a bad/false expectation from KVM -- what if a huge folio is
> >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
> >     others) in other conditions? And then do we still need the check in TDX's
> >     mapping callback? If not, does it mean TDX huge pages can stop relying on
> >     guest_memfd's ability to allocate huge folios, as KVM could still create
> >     huge mappings as long as small folios are physically contiguous with
> >     homogeneous memory attributes?
> >
> > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
> >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
> >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
> >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
> >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
> >
> 
> I think the central question I have among all the above is what TDX
> needs to actually care about (putting aside what KVM's folio size/memory
> contiguity vs mapping level rule for a while).
> 
> I think TDX code can check what it cares about (if required to aid
> debugging, as Dave suggested). Does TDX actually care about folio sizes,
> or does it actually care about memory contiguity and alignment?
TDX cares about memory contiguity. A single folio ensures memory contiguity.

Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
contiguous range larger than the page's folio range.

Additionally, we don't split private mappings in kvm_gmem_error_folio().
If smaller folios are allowed, splitting private mapping is required there.
(e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
possible for splitting a huge folio to fail partially, without merging the huge
folio back or further zapping?).
Not sure if there're other edge cases we're still missing.

> Separately, KVM could also enforce the folio size/memory contiguity vs
> mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
> the check is deemed necessary, it still shouldn't be in TDX code, I
> think.
> 
> > Pro: Preventing zapping private memory until conversion is successful is good.
> >
> > However, could we achieve this benefit in other ways? For example, is it
> > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> > optimization? (hugetlb_vmemmap conversion is super slow according to my
> > observation and I always disable it).
> 
> HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
> huge VM, multiplied by a large number of hosts, this is not a trivial
> amount of memory. It's one of the key reasons why we are using HugeTLB
> in guest_memfd in the first place, other than to be able to get high
> level page table mappings. We want this in production.
> 
> > Or pre-allocation for
> > vmemmap_remap_alloc()?
> >
> 
> Will investigate if this is possible as mentioned above. Thanks for the
> suggestion again!
> 
> > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> > private memory before conversion succeeds is still better than introducing the
> > mess between folio size and mapping size.
> >
> >> > I guess perhaps the question is, is it okay if the folios are smaller
> >> > than the mapping while conversion is in progress? Does the order matter
> >> > (split page table entries first vs split folios first)?
> >>
> >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> >> multiple folios in one page, probably comes down to which option provides "good
> >> enough" performance without incurring too much complexity.
>

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Vishal Annapurve 4 weeks ago

On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
> > Yan Zhao <yan.y.zhao@intel.com> writes:
> >
> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> > Sean Christopherson <seanjc@google.com> writes:
> > >> >
> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> > >> Vishal Annapurve <vannapurve@google.com> writes:
> > >> > >>
> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >> > >> >>
> > >> > >> >> - EPT mapping size and folio size
> > >> > >> >>
> > >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> > >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> > >> > >> >>
> > >> > >>
> > >> > >> I'm not familiar with this rule and would like to find out more. Why is
> > >> > >> this rule imposed?
> > >> > >
> > >> > > Because it's the only sane way to safely map memory into the guest? :-D
> > >> > >
> > >> > >> Is this rule there just because traditionally folio sizes also define the
> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio
> > >> > >> size in case the block of memory represented by the folio is not contiguous?
> > >> > >
> > >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> > >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> > >> > >
> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount
> > >> > >> tracking purposese on private to shared conversion), the memory is still
> > >> > >> contiguous up to the original folio's size. Will the contiguity address
> > >> > >> the concerns?
> > >> > >
> > >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> > >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> > >> > > spanning mixed ranges, i.e. with multiple folios.
> > >> >
> > >> > The folio can be split if any (or all) of the pages in a huge page range
> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> > >> > would be split, and the split folios are necessary for tracking users of
> > >> > shared pages using struct page refcounts.
> > >>
> > >> Ahh, that's what the refcounting was referring to.  Gotcha.
> > >>
> > >> > However the split folios in that 1G range are still fully contiguous.
> > >> >
> > >> > The process of conversion will split the EPT entries soon after the
> > >> > folios are split so the rule remains upheld.
> >
> > Correction here: If we go with splitting from 1G to 4K uniformly on
> > sharing, only the EPT entries around the shared 4K folio will have their
> > page table entries split, so many of the EPT entries will be at 2M level
> > though the folios are 4K sized. This would be last beyond the conversion
> > process.
> >
> > > Overall, I don't think allowing folios smaller than the mappings while
> > > conversion is in progress brings enough benefit.
> > >
> >
> > I'll look into making the restructuring process always succeed, but off
> > the top of my head that's hard because
> >
> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to
> >    use pre-allocated pages, which is refactoring deep in HugeTLB code
> >
> > 2. If we want to split non-uniformly such that only the folios that are
> >    shared are 4K, and the remaining folios are as large as possible (PMD
> >    sized as much as possible), it gets complex to figure out how many
> >    pages to allocate ahead of time.
> >
> > So it's complex and will probably delay HugeTLB+conversion support even
> > more!
> >
> > > Cons:
> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an
> > >     in-progress private-to-shared conversion or other reasons. It also has no
> > >     idea if the attributes of the underlying folios remain unchanged during an
> > >     in-progress private-to-shared conversion. Even if the assertion Ackerley
> > >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
> > >     callback for in-progress private-to-shared conversion alone (which would
> > >     increase TDX's dependency on guest_memfd's specific implementation even if
> > >     it's feasible).
> > >
> > >     Removing the sanity checks entirely in TDX's zapping callback is confusing
> > >     and would show a bad/false expectation from KVM -- what if a huge folio is
> > >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
> > >     others) in other conditions? And then do we still need the check in TDX's
> > >     mapping callback? If not, does it mean TDX huge pages can stop relying on
> > >     guest_memfd's ability to allocate huge folios, as KVM could still create
> > >     huge mappings as long as small folios are physically contiguous with
> > >     homogeneous memory attributes?
> > >
> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
> > >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
> > >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
> > >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
> > >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
> > >
> >
> > I think the central question I have among all the above is what TDX
> > needs to actually care about (putting aside what KVM's folio size/memory
> > contiguity vs mapping level rule for a while).
> >
> > I think TDX code can check what it cares about (if required to aid
> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> > or does it actually care about memory contiguity and alignment?
> TDX cares about memory contiguity. A single folio ensures memory contiguity.

In this slightly unusual case, I think the guarantee needed here is
that as long as a range is mapped into SEPT entries, guest_memfd
ensures that the complete range stays private.

i.e. I think it should be safe to rely on guest_memfd here,
irrespective of the folio sizes:
1) KVM TDX stack should be able to reclaim the complete range when unmapping.
2) KVM TDX stack can assume that as long as memory is mapped in SEPT
entries, guest_memfd will not let host userspace mappings to access
guest private memory.

>
> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> contiguous range larger than the page's folio range.

What's the issue with passing the (struct page*, unsigned long nr_pages) pair?

>
> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> If smaller folios are allowed, splitting private mapping is required there.

Yes, I believe splitting private mappings will be invoked to ensure
that the whole huge folio is not unmapped from KVM due to an error on
just a 4K page. Is that a problem?

If splitting fails, the implementation can fall back to completely
zapping the folio range.

> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
> possible for splitting a huge folio to fail partially, without merging the huge
> folio back or further zapping?).

Yes, splitting can fail partially, but guest_memfd will not make the
ranges available to host userspace and derivatives until:
1) The complete range to be converted is split to 4K granularity.
2) The complete range to be converted is zapped from KVM EPT mappings.

> Not sure if there're other edge cases we're still missing.
>
> > Separately, KVM could also enforce the folio size/memory contiguity vs
> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
> > the check is deemed necessary, it still shouldn't be in TDX code, I
> > think.
> >
> > > Pro: Preventing zapping private memory until conversion is successful is good.
> > >
> > > However, could we achieve this benefit in other ways? For example, is it
> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> > > optimization? (hugetlb_vmemmap conversion is super slow according to my
> > > observation and I always disable it).
> >
> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
> > huge VM, multiplied by a large number of hosts, this is not a trivial
> > amount of memory. It's one of the key reasons why we are using HugeTLB
> > in guest_memfd in the first place, other than to be able to get high
> > level page table mappings. We want this in production.
> >
> > > Or pre-allocation for
> > > vmemmap_remap_alloc()?
> > >
> >
> > Will investigate if this is possible as mentioned above. Thanks for the
> > suggestion again!
> >
> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> > > private memory before conversion succeeds is still better than introducing the
> > > mess between folio size and mapping size.
> > >
> > >> > I guess perhaps the question is, is it okay if the folios are smaller
> > >> > than the mapping while conversion is in progress? Does the order matter
> > >> > (split page table entries first vs split folios first)?
> > >>
> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> > >> multiple folios in one page, probably comes down to which option provides "good
> > >> enough" performance without incurring too much complexity.
> >

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Ackerley Tng 4 weeks ago

Vishal Annapurve <vannapurve@google.com> writes:

> On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>>
>> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
>> > Yan Zhao <yan.y.zhao@intel.com> writes:
>> >
>> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
>> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > >> > Sean Christopherson <seanjc@google.com> writes:
>> > >> >
>> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > >> > >> Vishal Annapurve <vannapurve@google.com> writes:
>> > >> > >>
>> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> > >> > >> >>
>> > >> > >> >> - EPT mapping size and folio size
>> > >> > >> >>
>> > >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
>> > >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
>> > >> > >> >>
>> > >> > >>
>> > >> > >> I'm not familiar with this rule and would like to find out more. Why is
>> > >> > >> this rule imposed?
>> > >> > >
>> > >> > > Because it's the only sane way to safely map memory into the guest? :-D
>> > >> > >
>> > >> > >> Is this rule there just because traditionally folio sizes also define the
>> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio
>> > >> > >> size in case the block of memory represented by the folio is not contiguous?
>> > >> > >
>> > >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
>> > >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
>> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
>> > >> > >
>> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount
>> > >> > >> tracking purposese on private to shared conversion), the memory is still
>> > >> > >> contiguous up to the original folio's size. Will the contiguity address
>> > >> > >> the concerns?
>> > >> > >
>> > >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
>> > >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
>> > >> > > spanning mixed ranges, i.e. with multiple folios.
>> > >> >
>> > >> > The folio can be split if any (or all) of the pages in a huge page range
>> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
>> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
>> > >> > would be split, and the split folios are necessary for tracking users of
>> > >> > shared pages using struct page refcounts.
>> > >>
>> > >> Ahh, that's what the refcounting was referring to.  Gotcha.
>> > >>
>> > >> > However the split folios in that 1G range are still fully contiguous.
>> > >> >
>> > >> > The process of conversion will split the EPT entries soon after the
>> > >> > folios are split so the rule remains upheld.
>> >
>> > Correction here: If we go with splitting from 1G to 4K uniformly on
>> > sharing, only the EPT entries around the shared 4K folio will have their
>> > page table entries split, so many of the EPT entries will be at 2M level
>> > though the folios are 4K sized. This would be last beyond the conversion
>> > process.
>> >
>> > > Overall, I don't think allowing folios smaller than the mappings while
>> > > conversion is in progress brings enough benefit.
>> > >
>> >
>> > I'll look into making the restructuring process always succeed, but off
>> > the top of my head that's hard because
>> >
>> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to
>> >    use pre-allocated pages, which is refactoring deep in HugeTLB code
>> >
>> > 2. If we want to split non-uniformly such that only the folios that are
>> >    shared are 4K, and the remaining folios are as large as possible (PMD
>> >    sized as much as possible), it gets complex to figure out how many
>> >    pages to allocate ahead of time.
>> >
>> > So it's complex and will probably delay HugeTLB+conversion support even
>> > more!
>> >
>> > > Cons:
>> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an
>> > >     in-progress private-to-shared conversion or other reasons. It also has no
>> > >     idea if the attributes of the underlying folios remain unchanged during an
>> > >     in-progress private-to-shared conversion. Even if the assertion Ackerley
>> > >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
>> > >     callback for in-progress private-to-shared conversion alone (which would
>> > >     increase TDX's dependency on guest_memfd's specific implementation even if
>> > >     it's feasible).
>> > >
>> > >     Removing the sanity checks entirely in TDX's zapping callback is confusing
>> > >     and would show a bad/false expectation from KVM -- what if a huge folio is
>> > >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
>> > >     others) in other conditions? And then do we still need the check in TDX's
>> > >     mapping callback? If not, does it mean TDX huge pages can stop relying on
>> > >     guest_memfd's ability to allocate huge folios, as KVM could still create
>> > >     huge mappings as long as small folios are physically contiguous with
>> > >     homogeneous memory attributes?
>> > >
>> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
>> > >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
>> > >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
>> > >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
>> > >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
>> > >
>> >
>> > I think the central question I have among all the above is what TDX
>> > needs to actually care about (putting aside what KVM's folio size/memory
>> > contiguity vs mapping level rule for a while).
>> >
>> > I think TDX code can check what it cares about (if required to aid
>> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
>> > or does it actually care about memory contiguity and alignment?
>> TDX cares about memory contiguity. A single folio ensures memory contiguity.
>
> In this slightly unusual case, I think the guarantee needed here is
> that as long as a range is mapped into SEPT entries, guest_memfd
> ensures that the complete range stays private.
>
> i.e. I think it should be safe to rely on guest_memfd here,
> irrespective of the folio sizes:
> 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> entries, guest_memfd will not let host userspace mappings to access
> guest private memory.
>
>>
>> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
>> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
>> contiguous range larger than the page's folio range.
>
> What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
>
>>
>> Additionally, we don't split private mappings in kvm_gmem_error_folio().
>> If smaller folios are allowed, splitting private mapping is required there.

It was discussed before that for memory failure handling, we will want
to split huge pages, we will get to it! The trouble is that guest_memfd
took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
from the ground up), so we'll still need to figure out it's okay to let
HugeTLB deal with it when freeing, and when I last looked, HugeTLB
doesn't actually deal with poisoned folios on freeing, so there's more
work to do on the HugeTLB side.

This is a good point, although IIUC it is a separate issue. The need to
split private mappings on memory failure is not for confidentiality in
the TDX sense but to ensure that the guest doesn't use the failed
memory. In that case, contiguity is broken by the failed memory. The
folio is split, the private EPTs are split. The folio size should still
not be checked in TDX code. guest_memfd knows contiguity got broken, so
guest_memfd calls TDX code to split the EPTs.

>
> Yes, I believe splitting private mappings will be invoked to ensure
> that the whole huge folio is not unmapped from KVM due to an error on
> just a 4K page. Is that a problem?
>
> If splitting fails, the implementation can fall back to completely
> zapping the folio range.
>
>> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
>> possible for splitting a huge folio to fail partially, without merging the huge
>> folio back or further zapping?).

The current stance is to allow splitting failures and not undo that
splitting failure, so there's no merge back to fix the splitting
failure. (Not set in stone yet, I think merging back could turn out to
be a requirement from the mm side, which comes with more complexity in
restructuring logic.)

If it is not merged back on a split failure, the pages are still
contiguous, the pages are guaranteed contiguous while they are owned by
guest_memfd (even in the case of memory failure, if I get my way :P) so
TDX can still trust that.

I think you're worried that on split failure some folios are split, but
the private EPTs for those are not split, but the memory for those
unsplit private EPTs are still contiguous, and on split failure we quit
early so guest_memfd still tracks the ranges as private.

Privateness and contiguity are preserved so I think TDX should be good
with that? The TD can still run. IIUC it is part of the plan that on
splitting failure, conversion ioctl returns failure, guest is informed
of conversion failure so that it can do whatever it should do to clean
up.

>
> Yes, splitting can fail partially, but guest_memfd will not make the
> ranges available to host userspace and derivatives until:
> 1) The complete range to be converted is split to 4K granularity.
> 2) The complete range to be converted is zapped from KVM EPT mappings.
>
>> Not sure if there're other edge cases we're still missing.
>>

As you said, at the core TDX is concerned about contiguity of the memory
ranges (start_addr, length) that it was given. Contiguity is guaranteed
by guest_memfd while the folio is in guest_memfd ownership up to the
boundaries of the original folio, before any restructuring. So if we're
looking for edge cases, I think they would be around
truncation. Can't think of anything now.

(guest_memfd will also ensure truncation of anything less than the
original size of the folio before restructuring is blocked, regardless
of the current size of the folio)

>> > Separately, KVM could also enforce the folio size/memory contiguity vs
>> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
>> > the check is deemed necessary, it still shouldn't be in TDX code, I
>> > think.
>> >
>> > > Pro: Preventing zapping private memory until conversion is successful is good.
>> > >
>> > > However, could we achieve this benefit in other ways? For example, is it
>> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
>> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
>> > > optimization? (hugetlb_vmemmap conversion is super slow according to my
>> > > observation and I always disable it).
>> >
>> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
>> > huge VM, multiplied by a large number of hosts, this is not a trivial
>> > amount of memory. It's one of the key reasons why we are using HugeTLB
>> > in guest_memfd in the first place, other than to be able to get high
>> > level page table mappings. We want this in production.
>> >
>> > > Or pre-allocation for
>> > > vmemmap_remap_alloc()?
>> > >
>> >
>> > Will investigate if this is possible as mentioned above. Thanks for the
>> > suggestion again!
>> >
>> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
>> > > private memory before conversion succeeds is still better than introducing the
>> > > mess between folio size and mapping size.
>> > >
>> > >> > I guess perhaps the question is, is it okay if the folios are smaller
>> > >> > than the mapping while conversion is in progress? Does the order matter
>> > >> > (split page table entries first vs split folios first)?
>> > >>
>> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
>> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
>> > >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
>> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
>> > >> multiple folios in one page, probably comes down to which option provides "good
>> > >> enough" performance without incurring too much complexity.
>> >

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 3 weeks, 5 days ago

On Fri, Jan 09, 2026 at 10:07:00AM -0800, Ackerley Tng wrote:
> Vishal Annapurve <vannapurve@google.com> writes:
> 
> > On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>
> >> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
> >> > Yan Zhao <yan.y.zhao@intel.com> writes:
> >> >
> >> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> >> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> > >> > Sean Christopherson <seanjc@google.com> writes:
> >> > >> >
> >> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> > >> > >> Vishal Annapurve <vannapurve@google.com> writes:
> >> > >> > >>
> >> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> > >> > >> >>
> >> > >> > >> >> - EPT mapping size and folio size
> >> > >> > >> >>
> >> > >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> >> > >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> >> > >> > >> >>
> >> > >> > >>
> >> > >> > >> I'm not familiar with this rule and would like to find out more. Why is
> >> > >> > >> this rule imposed?
> >> > >> > >
> >> > >> > > Because it's the only sane way to safely map memory into the guest? :-D
> >> > >> > >
> >> > >> > >> Is this rule there just because traditionally folio sizes also define the
> >> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio
> >> > >> > >> size in case the block of memory represented by the folio is not contiguous?
> >> > >> > >
> >> > >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> >> > >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> >> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> >> > >> > >
> >> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount
> >> > >> > >> tracking purposese on private to shared conversion), the memory is still
> >> > >> > >> contiguous up to the original folio's size. Will the contiguity address
> >> > >> > >> the concerns?
> >> > >> > >
> >> > >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> >> > >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> >> > >> > > spanning mixed ranges, i.e. with multiple folios.
> >> > >> >
> >> > >> > The folio can be split if any (or all) of the pages in a huge page range
> >> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> >> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> >> > >> > would be split, and the split folios are necessary for tracking users of
> >> > >> > shared pages using struct page refcounts.
> >> > >>
> >> > >> Ahh, that's what the refcounting was referring to.  Gotcha.
> >> > >>
> >> > >> > However the split folios in that 1G range are still fully contiguous.
> >> > >> >
> >> > >> > The process of conversion will split the EPT entries soon after the
> >> > >> > folios are split so the rule remains upheld.
> >> >
> >> > Correction here: If we go with splitting from 1G to 4K uniformly on
> >> > sharing, only the EPT entries around the shared 4K folio will have their
> >> > page table entries split, so many of the EPT entries will be at 2M level
> >> > though the folios are 4K sized. This would be last beyond the conversion
> >> > process.
> >> >
> >> > > Overall, I don't think allowing folios smaller than the mappings while
> >> > > conversion is in progress brings enough benefit.
> >> > >
> >> >
> >> > I'll look into making the restructuring process always succeed, but off
> >> > the top of my head that's hard because
> >> >
> >> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to
> >> >    use pre-allocated pages, which is refactoring deep in HugeTLB code
> >> >
> >> > 2. If we want to split non-uniformly such that only the folios that are
> >> >    shared are 4K, and the remaining folios are as large as possible (PMD
> >> >    sized as much as possible), it gets complex to figure out how many
> >> >    pages to allocate ahead of time.
> >> >
> >> > So it's complex and will probably delay HugeTLB+conversion support even
> >> > more!
> >> >
> >> > > Cons:
> >> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an
> >> > >     in-progress private-to-shared conversion or other reasons. It also has no
> >> > >     idea if the attributes of the underlying folios remain unchanged during an
> >> > >     in-progress private-to-shared conversion. Even if the assertion Ackerley
> >> > >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
> >> > >     callback for in-progress private-to-shared conversion alone (which would
> >> > >     increase TDX's dependency on guest_memfd's specific implementation even if
> >> > >     it's feasible).
> >> > >
> >> > >     Removing the sanity checks entirely in TDX's zapping callback is confusing
> >> > >     and would show a bad/false expectation from KVM -- what if a huge folio is
> >> > >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
> >> > >     others) in other conditions? And then do we still need the check in TDX's
> >> > >     mapping callback? If not, does it mean TDX huge pages can stop relying on
> >> > >     guest_memfd's ability to allocate huge folios, as KVM could still create
> >> > >     huge mappings as long as small folios are physically contiguous with
> >> > >     homogeneous memory attributes?
> >> > >
> >> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
> >> > >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
> >> > >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
> >> > >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
> >> > >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
> >> > >
> >> >
> >> > I think the central question I have among all the above is what TDX
> >> > needs to actually care about (putting aside what KVM's folio size/memory
> >> > contiguity vs mapping level rule for a while).
> >> >
> >> > I think TDX code can check what it cares about (if required to aid
> >> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> >> > or does it actually care about memory contiguity and alignment?
> >> TDX cares about memory contiguity. A single folio ensures memory contiguity.
> >
> > In this slightly unusual case, I think the guarantee needed here is
> > that as long as a range is mapped into SEPT entries, guest_memfd
> > ensures that the complete range stays private.
> >
> > i.e. I think it should be safe to rely on guest_memfd here,
> > irrespective of the folio sizes:
> > 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> > entries, guest_memfd will not let host userspace mappings to access
> > guest private memory.
> >
> >>
> >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> >> contiguous range larger than the page's folio range.
> >
> > What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
> >
> >>
> >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> >> If smaller folios are allowed, splitting private mapping is required there.
> 
> It was discussed before that for memory failure handling, we will want
> to split huge pages, we will get to it! The trouble is that guest_memfd
> took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> from the ground up), so we'll still need to figure out it's okay to let
> HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> doesn't actually deal with poisoned folios on freeing, so there's more
> work to do on the HugeTLB side.
> 
> This is a good point, although IIUC it is a separate issue. The need to
> split private mappings on memory failure is not for confidentiality in
> the TDX sense but to ensure that the guest doesn't use the failed
> memory. In that case, contiguity is broken by the failed memory. The
> folio is split, the private EPTs are split. The folio size should still
> not be checked in TDX code. guest_memfd knows contiguity got broken, so
> guest_memfd calls TDX code to split the EPTs.

Hmm, maybe the key is that we need to split S-EPT first before allowing
guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
folio splitting.

This is better than performing folio splitting while it's mapped as huge in
S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
kvm_gmem_error_folio() would still trigger the over-zapping issue.

In the primary MMU, it follows the rule of unmapping a folio before splitting,
truncating, or migrating a folio. For S-EPT, considering the cost of zapping
more ranges than necessary, maybe a trade-off is to always split S-EPT before
allowing backend folio splitting.

Does this look good to you?

So, to convert a 2MB range from private to shared, even though guest_memfd will
eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails,
don't split the backend folio.

Even if folio splitting may fail later, it just leaves split S-EPT mappings,
which matters little, especially after we support S-EPT promotion later.

The benefit is that we don't need to worry even in the case when guest_memfd
splits a 1GB folio directly to 4KB granularity, potentially introducing the
over-zapping issue later.

> > Yes, I believe splitting private mappings will be invoked to ensure
> > that the whole huge folio is not unmapped from KVM due to an error on
> > just a 4K page. Is that a problem?
> >
> > If splitting fails, the implementation can fall back to completely
> > zapping the folio range.
> >
> >> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
> >> possible for splitting a huge folio to fail partially, without merging the huge
> >> folio back or further zapping?).
> 
> The current stance is to allow splitting failures and not undo that
> splitting failure, so there's no merge back to fix the splitting
> failure. (Not set in stone yet, I think merging back could turn out to
> be a requirement from the mm side, which comes with more complexity in
> restructuring logic.)
> 
> If it is not merged back on a split failure, the pages are still
> contiguous, the pages are guaranteed contiguous while they are owned by
> guest_memfd (even in the case of memory failure, if I get my way :P) so
> TDX can still trust that.
> 
> I think you're worried that on split failure some folios are split, but
> the private EPTs for those are not split, but the memory for those
> unsplit private EPTs are still contiguous, and on split failure we quit
> early so guest_memfd still tracks the ranges as private.
> 
> Privateness and contiguity are preserved so I think TDX should be good
> with that? The TD can still run. IIUC it is part of the plan that on
> splitting failure, conversion ioctl returns failure, guest is informed
> of conversion failure so that it can do whatever it should do to clean
> up.
As above, what about the idea of always requesting KVM to split S-EPT before
guest_memfd splits a folio?

I think splitting S-EPT first is already required for all cases anyway, except
for the private-to-shared conversion of a full 2MB or 1GB range.

Requesting S-EPT splitting when it's about to do folio splitting is better than
leaving huge mappings with split folios and having to patch things up here and
there, just to make the single case of private-to-shared conversion easier.

> > Yes, splitting can fail partially, but guest_memfd will not make the
> > ranges available to host userspace and derivatives until:
> > 1) The complete range to be converted is split to 4K granularity.
> > 2) The complete range to be converted is zapped from KVM EPT mappings.
> >
> >> Not sure if there're other edge cases we're still missing.
> >>
> 
> As you said, at the core TDX is concerned about contiguity of the memory
> ranges (start_addr, length) that it was given. Contiguity is guaranteed
> by guest_memfd while the folio is in guest_memfd ownership up to the
> boundaries of the original folio, before any restructuring. So if we're
> looking for edge cases, I think they would be around
> truncation. Can't think of anything now.
Potentially, folio migration, if we support it in the future.

> (guest_memfd will also ensure truncation of anything less than the
> original size of the folio before restructuring is blocked, regardless
> of the current size of the folio)
> >> > Separately, KVM could also enforce the folio size/memory contiguity vs
> >> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
> >> > the check is deemed necessary, it still shouldn't be in TDX code, I
> >> > think.
> >> >
> >> > > Pro: Preventing zapping private memory until conversion is successful is good.
> >> > >
> >> > > However, could we achieve this benefit in other ways? For example, is it
> >> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> >> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> >> > > optimization? (hugetlb_vmemmap conversion is super slow according to my
> >> > > observation and I always disable it).
> >> >
> >> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
> >> > huge VM, multiplied by a large number of hosts, this is not a trivial
> >> > amount of memory. It's one of the key reasons why we are using HugeTLB
> >> > in guest_memfd in the first place, other than to be able to get high
> >> > level page table mappings. We want this in production.
> >> >
> >> > > Or pre-allocation for
> >> > > vmemmap_remap_alloc()?
> >> > >
> >> >
> >> > Will investigate if this is possible as mentioned above. Thanks for the
> >> > suggestion again!
> >> >
> >> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> >> > > private memory before conversion succeeds is still better than introducing the
> >> > > mess between folio size and mapping size.
> >> > >
> >> > >> > I guess perhaps the question is, is it okay if the folios are smaller
> >> > >> > than the mapping while conversion is in progress? Does the order matter
> >> > >> > (split page table entries first vs split folios first)?
> >> > >>
> >> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> >> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> >> > >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> >> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> >> > >> multiple folios in one page, probably comes down to which option provides "good
> >> > >> enough" performance without incurring too much complexity.
> >> >
>

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 3 weeks, 5 days ago

On Mon, Jan 12, 2026 at 09:39:39AM +0800, Yan Zhao wrote:
> On Fri, Jan 09, 2026 at 10:07:00AM -0800, Ackerley Tng wrote:
> > Vishal Annapurve <vannapurve@google.com> writes:
> > 
> > > On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >>
> > >> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
> > >> > Yan Zhao <yan.y.zhao@intel.com> writes:
> > >> >
> > >> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> > >> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> > >> > Sean Christopherson <seanjc@google.com> writes:
> > >> > >> >
> > >> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> > >> > >> Vishal Annapurve <vannapurve@google.com> writes:
> > >> > >> > >>
> > >> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >> > >> > >> >>
> > >> > >> > >> >> - EPT mapping size and folio size
> > >> > >> > >> >>
> > >> > >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> > >> > >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> > >> > >> > >> >>
> > >> > >> > >>
> > >> > >> > >> I'm not familiar with this rule and would like to find out more. Why is
> > >> > >> > >> this rule imposed?
> > >> > >> > >
> > >> > >> > > Because it's the only sane way to safely map memory into the guest? :-D
> > >> > >> > >
> > >> > >> > >> Is this rule there just because traditionally folio sizes also define the
> > >> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio
> > >> > >> > >> size in case the block of memory represented by the folio is not contiguous?
> > >> > >> > >
> > >> > >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> > >> > >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> > >> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> > >> > >> > >
> > >> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount
> > >> > >> > >> tracking purposese on private to shared conversion), the memory is still
> > >> > >> > >> contiguous up to the original folio's size. Will the contiguity address
> > >> > >> > >> the concerns?
> > >> > >> > >
> > >> > >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> > >> > >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> > >> > >> > > spanning mixed ranges, i.e. with multiple folios.
> > >> > >> >
> > >> > >> > The folio can be split if any (or all) of the pages in a huge page range
> > >> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> > >> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> > >> > >> > would be split, and the split folios are necessary for tracking users of
> > >> > >> > shared pages using struct page refcounts.
> > >> > >>
> > >> > >> Ahh, that's what the refcounting was referring to.  Gotcha.
> > >> > >>
> > >> > >> > However the split folios in that 1G range are still fully contiguous.
> > >> > >> >
> > >> > >> > The process of conversion will split the EPT entries soon after the
> > >> > >> > folios are split so the rule remains upheld.
> > >> >
> > >> > Correction here: If we go with splitting from 1G to 4K uniformly on
> > >> > sharing, only the EPT entries around the shared 4K folio will have their
> > >> > page table entries split, so many of the EPT entries will be at 2M level
> > >> > though the folios are 4K sized. This would be last beyond the conversion
> > >> > process.
> > >> >
> > >> > > Overall, I don't think allowing folios smaller than the mappings while
> > >> > > conversion is in progress brings enough benefit.
> > >> > >
> > >> >
> > >> > I'll look into making the restructuring process always succeed, but off
> > >> > the top of my head that's hard because
> > >> >
> > >> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to
> > >> >    use pre-allocated pages, which is refactoring deep in HugeTLB code
> > >> >
> > >> > 2. If we want to split non-uniformly such that only the folios that are
> > >> >    shared are 4K, and the remaining folios are as large as possible (PMD
> > >> >    sized as much as possible), it gets complex to figure out how many
> > >> >    pages to allocate ahead of time.
> > >> >
> > >> > So it's complex and will probably delay HugeTLB+conversion support even
> > >> > more!
> > >> >
> > >> > > Cons:
> > >> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an
> > >> > >     in-progress private-to-shared conversion or other reasons. It also has no
> > >> > >     idea if the attributes of the underlying folios remain unchanged during an
> > >> > >     in-progress private-to-shared conversion. Even if the assertion Ackerley
> > >> > >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
> > >> > >     callback for in-progress private-to-shared conversion alone (which would
> > >> > >     increase TDX's dependency on guest_memfd's specific implementation even if
> > >> > >     it's feasible).
> > >> > >
> > >> > >     Removing the sanity checks entirely in TDX's zapping callback is confusing
> > >> > >     and would show a bad/false expectation from KVM -- what if a huge folio is
> > >> > >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
> > >> > >     others) in other conditions? And then do we still need the check in TDX's
> > >> > >     mapping callback? If not, does it mean TDX huge pages can stop relying on
> > >> > >     guest_memfd's ability to allocate huge folios, as KVM could still create
> > >> > >     huge mappings as long as small folios are physically contiguous with
> > >> > >     homogeneous memory attributes?
> > >> > >
> > >> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
> > >> > >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
> > >> > >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
> > >> > >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
> > >> > >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
> > >> > >
> > >> >
> > >> > I think the central question I have among all the above is what TDX
> > >> > needs to actually care about (putting aside what KVM's folio size/memory
> > >> > contiguity vs mapping level rule for a while).
> > >> >
> > >> > I think TDX code can check what it cares about (if required to aid
> > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> > >> > or does it actually care about memory contiguity and alignment?
> > >> TDX cares about memory contiguity. A single folio ensures memory contiguity.
> > >
> > > In this slightly unusual case, I think the guarantee needed here is
> > > that as long as a range is mapped into SEPT entries, guest_memfd
> > > ensures that the complete range stays private.
> > >
> > > i.e. I think it should be safe to rely on guest_memfd here,
> > > irrespective of the folio sizes:
> > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> > > entries, guest_memfd will not let host userspace mappings to access
> > > guest private memory.
> > >
> > >>
> > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> > >> contiguous range larger than the page's folio range.
> > >
> > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
> > >
> > >>
> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> > >> If smaller folios are allowed, splitting private mapping is required there.
> > 
> > It was discussed before that for memory failure handling, we will want
> > to split huge pages, we will get to it! The trouble is that guest_memfd
> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> > from the ground up), so we'll still need to figure out it's okay to let
> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> > doesn't actually deal with poisoned folios on freeing, so there's more
> > work to do on the HugeTLB side.
> > 
> > This is a good point, although IIUC it is a separate issue. The need to
> > split private mappings on memory failure is not for confidentiality in
> > the TDX sense but to ensure that the guest doesn't use the failed
> > memory. In that case, contiguity is broken by the failed memory. The
> > folio is split, the private EPTs are split. The folio size should still
> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
> > guest_memfd calls TDX code to split the EPTs.
> 
> Hmm, maybe the key is that we need to split S-EPT first before allowing
> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
> folio splitting.
> 
> This is better than performing folio splitting while it's mapped as huge in
> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
> kvm_gmem_error_folio() would still trigger the over-zapping issue.
> 
> In the primary MMU, it follows the rule of unmapping a folio before splitting,
> truncating, or migrating a folio. For S-EPT, considering the cost of zapping
> more ranges than necessary, maybe a trade-off is to always split S-EPT before
> allowing backend folio splitting.
> 
> Does this look good to you?
So, the flow of converting 0-4KB from private to shared in a 1GB folio in
guest_memfd is:

a. If guest_memfd splits 1GB to 2MB first:
   1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
   2. split folio
   3. zap the 0-4KB mapping.

b. If guest_memfd splits 1GB to 4KB directly:
   1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
   2. split folio
   3. zap the 0-4KB mapping.

The flow of converting 0-2MB from private to shared in a 1GB folio in
guest_memfd is:

a. If guest_memfd splits 1GB to 2MB first:
   1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
   2. split folio
   3. zap the 0-2MB mapping.

b. If guest_memfd splits 1GB to 4KB directly:
   1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
   2. split folio
   3. zap the 0-2MB mapping.

> So, to convert a 2MB range from private to shared, even though guest_memfd will
> eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails,
> don't split the backend folio.
> 
> Even if folio splitting may fail later, it just leaves split S-EPT mappings,
> which matters little, especially after we support S-EPT promotion later.
> 
> The benefit is that we don't need to worry even in the case when guest_memfd
> splits a 1GB folio directly to 4KB granularity, potentially introducing the
> over-zapping issue later.
> 
> > > Yes, I believe splitting private mappings will be invoked to ensure
> > > that the whole huge folio is not unmapped from KVM due to an error on
> > > just a 4K page. Is that a problem?
> > >
> > > If splitting fails, the implementation can fall back to completely
> > > zapping the folio range.
> > >
> > >> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
> > >> possible for splitting a huge folio to fail partially, without merging the huge
> > >> folio back or further zapping?).
> > 
> > The current stance is to allow splitting failures and not undo that
> > splitting failure, so there's no merge back to fix the splitting
> > failure. (Not set in stone yet, I think merging back could turn out to
> > be a requirement from the mm side, which comes with more complexity in
> > restructuring logic.)
> > 
> > If it is not merged back on a split failure, the pages are still
> > contiguous, the pages are guaranteed contiguous while they are owned by
> > guest_memfd (even in the case of memory failure, if I get my way :P) so
> > TDX can still trust that.
> > 
> > I think you're worried that on split failure some folios are split, but
> > the private EPTs for those are not split, but the memory for those
> > unsplit private EPTs are still contiguous, and on split failure we quit
> > early so guest_memfd still tracks the ranges as private.
> > 
> > Privateness and contiguity are preserved so I think TDX should be good
> > with that? The TD can still run. IIUC it is part of the plan that on
> > splitting failure, conversion ioctl returns failure, guest is informed
> > of conversion failure so that it can do whatever it should do to clean
> > up.
> As above, what about the idea of always requesting KVM to split S-EPT before
> guest_memfd splits a folio?
> 
> I think splitting S-EPT first is already required for all cases anyway, except
> for the private-to-shared conversion of a full 2MB or 1GB range.
> 
> Requesting S-EPT splitting when it's about to do folio splitting is better than
> leaving huge mappings with split folios and having to patch things up here and
> there, just to make the single case of private-to-shared conversion easier.
> 
> > > Yes, splitting can fail partially, but guest_memfd will not make the
> > > ranges available to host userspace and derivatives until:
> > > 1) The complete range to be converted is split to 4K granularity.
> > > 2) The complete range to be converted is zapped from KVM EPT mappings.
> > >
> > >> Not sure if there're other edge cases we're still missing.
> > >>
> > 
> > As you said, at the core TDX is concerned about contiguity of the memory
> > ranges (start_addr, length) that it was given. Contiguity is guaranteed
> > by guest_memfd while the folio is in guest_memfd ownership up to the
> > boundaries of the original folio, before any restructuring. So if we're
> > looking for edge cases, I think they would be around
> > truncation. Can't think of anything now.
> Potentially, folio migration, if we support it in the future.
> 
> > (guest_memfd will also ensure truncation of anything less than the
> > original size of the folio before restructuring is blocked, regardless
> > of the current size of the folio)
> > >> > Separately, KVM could also enforce the folio size/memory contiguity vs
> > >> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
> > >> > the check is deemed necessary, it still shouldn't be in TDX code, I
> > >> > think.
> > >> >
> > >> > > Pro: Preventing zapping private memory until conversion is successful is good.
> > >> > >
> > >> > > However, could we achieve this benefit in other ways? For example, is it
> > >> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> > >> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> > >> > > optimization? (hugetlb_vmemmap conversion is super slow according to my
> > >> > > observation and I always disable it).
> > >> >
> > >> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
> > >> > huge VM, multiplied by a large number of hosts, this is not a trivial
> > >> > amount of memory. It's one of the key reasons why we are using HugeTLB
> > >> > in guest_memfd in the first place, other than to be able to get high
> > >> > level page table mappings. We want this in production.
> > >> >
> > >> > > Or pre-allocation for
> > >> > > vmemmap_remap_alloc()?
> > >> > >
> > >> >
> > >> > Will investigate if this is possible as mentioned above. Thanks for the
> > >> > suggestion again!
> > >> >
> > >> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> > >> > > private memory before conversion succeeds is still better than introducing the
> > >> > > mess between folio size and mapping size.
> > >> > >
> > >> > >> > I guess perhaps the question is, is it okay if the folios are smaller
> > >> > >> > than the mapping while conversion is in progress? Does the order matter
> > >> > >> > (split page table entries first vs split folios first)?
> > >> > >>
> > >> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > >> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > >> > >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> > >> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> > >> > >> multiple folios in one page, probably comes down to which option provides "good
> > >> > >> enough" performance without incurring too much complexity.
> > >> >
> >

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Ackerley Tng 3 weeks, 4 days ago

Yan Zhao <yan.y.zhao@intel.com> writes:

>> > >> > I think the central question I have among all the above is what TDX
>> > >> > needs to actually care about (putting aside what KVM's folio size/memory
>> > >> > contiguity vs mapping level rule for a while).
>> > >> >
>> > >> > I think TDX code can check what it cares about (if required to aid
>> > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
>> > >> > or does it actually care about memory contiguity and alignment?
>> > >> TDX cares about memory contiguity. A single folio ensures memory contiguity.
>> > >
>> > > In this slightly unusual case, I think the guarantee needed here is
>> > > that as long as a range is mapped into SEPT entries, guest_memfd
>> > > ensures that the complete range stays private.
>> > >
>> > > i.e. I think it should be safe to rely on guest_memfd here,
>> > > irrespective of the folio sizes:
>> > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
>> > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
>> > > entries, guest_memfd will not let host userspace mappings to access
>> > > guest private memory.
>> > >
>> > >>
>> > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
>> > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
>> > >> contiguous range larger than the page's folio range.
>> > >
>> > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
>> > >

Please let us know what you think of this too, why not parametrize using
page and nr_pages?

>> > >>
>> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
>> > >> If smaller folios are allowed, splitting private mapping is required there.
>> >
>> > It was discussed before that for memory failure handling, we will want
>> > to split huge pages, we will get to it! The trouble is that guest_memfd
>> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
>> > from the ground up), so we'll still need to figure out it's okay to let
>> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
>> > doesn't actually deal with poisoned folios on freeing, so there's more
>> > work to do on the HugeTLB side.
>> >
>> > This is a good point, although IIUC it is a separate issue. The need to
>> > split private mappings on memory failure is not for confidentiality in
>> > the TDX sense but to ensure that the guest doesn't use the failed
>> > memory. In that case, contiguity is broken by the failed memory. The
>> > folio is split, the private EPTs are split. The folio size should still
>> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
>> > guest_memfd calls TDX code to split the EPTs.
>>
>> Hmm, maybe the key is that we need to split S-EPT first before allowing
>> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
>> folio splitting.
>>
>> This is better than performing folio splitting while it's mapped as huge in
>> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
>> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
>> kvm_gmem_error_folio() would still trigger the over-zapping issue.
>>

Let's put memory failure handling aside for now since for now it zaps
the entire huge page, so there's no impact on ordering between S-EPT and
folio split.

>> In the primary MMU, it follows the rule of unmapping a folio before splitting,
>> truncating, or migrating a folio. For S-EPT, considering the cost of zapping
>> more ranges than necessary, maybe a trade-off is to always split S-EPT before
>> allowing backend folio splitting.
>>

The mapping size <= folio size rule (for KVM and the primary MMU) is
there because it is the safe way to map memory into the guest because a
folio implies contiguity. Folios are basically a core MM concept so it
makes sense that the primary MMU relies on that.

IIUC the core of the rule isn't folio sizes, it's memory
contiguity. guest_memfd guarantees memory contiguity, and KVM should be
able to rely on guest_memfd's guarantee, especially since guest_memfd is
virtualiation-first, and KVM first.

I think rules from the primary MMU are a good reference, but we
shouldn't copy rules from the primary MMU, and KVM can rely on
guest_memfd's guarantee of memory contiguity.

>> Does this look good to you?
> So, the flow of converting 0-4KB from private to shared in a 1GB folio in
> guest_memfd is:
>
> a. If guest_memfd splits 1GB to 2MB first:
>    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
>    2. split folio
>    3. zap the 0-4KB mapping.
>
> b. If guest_memfd splits 1GB to 4KB directly:
>    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
>    2. split folio
>    3. zap the 0-4KB mapping.
>
> The flow of converting 0-2MB from private to shared in a 1GB folio in
> guest_memfd is:
>
> a. If guest_memfd splits 1GB to 2MB first:
>    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
>    2. split folio
>    3. zap the 0-2MB mapping.
>
> b. If guest_memfd splits 1GB to 4KB directly:
>    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
>    2. split folio
>    3. zap the 0-2MB mapping.
>
>> So, to convert a 2MB range from private to shared, even though guest_memfd will
>> eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails,
>> don't split the backend folio.
>>
>> Even if folio splitting may fail later, it just leaves split S-EPT mappings,
>> which matters little, especially after we support S-EPT promotion later.
>>

I didn't consider leaving split S-EPT mappings since there is a
performance impact. Let me think about this a little.

Meanwhile, if the folios are split before the S-EPTs are split, as long
as huge folios worth of memory are guaranteed contiguous by guest_memfd
for KVM, what are the problems you see?

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 3 weeks, 3 days ago

On Mon, Jan 12, 2026 at 11:56:01AM -0800, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> >> > >> > I think the central question I have among all the above is what TDX
> >> > >> > needs to actually care about (putting aside what KVM's folio size/memory
> >> > >> > contiguity vs mapping level rule for a while).
> >> > >> >
> >> > >> > I think TDX code can check what it cares about (if required to aid
> >> > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> >> > >> > or does it actually care about memory contiguity and alignment?
> >> > >> TDX cares about memory contiguity. A single folio ensures memory contiguity.
> >> > >
> >> > > In this slightly unusual case, I think the guarantee needed here is
> >> > > that as long as a range is mapped into SEPT entries, guest_memfd
> >> > > ensures that the complete range stays private.
> >> > >
> >> > > i.e. I think it should be safe to rely on guest_memfd here,
> >> > > irrespective of the folio sizes:
> >> > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> >> > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> >> > > entries, guest_memfd will not let host userspace mappings to access
> >> > > guest private memory.
> >> > >
> >> > >>
> >> > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> >> > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> >> > >> contiguous range larger than the page's folio range.
> >> > >
> >> > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
> >> > >
> 
> Please let us know what you think of this too, why not parametrize using
> page and nr_pages?
With (struct page*, unsigned long nr_pages) pair, IMHO, a warning when the
entire range is not fully contained in a folio is still necessary. 

I expressed the concern here:
https://lore.kernel.org/kvm/aWRfVOZpTUdYJ+7C@yzhao56-desk.sh.intel.com/

> >> > >>
> >> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> >> > >> If smaller folios are allowed, splitting private mapping is required there.
> >> >
> >> > It was discussed before that for memory failure handling, we will want
> >> > to split huge pages, we will get to it! The trouble is that guest_memfd
> >> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> >> > from the ground up), so we'll still need to figure out it's okay to let
> >> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> >> > doesn't actually deal with poisoned folios on freeing, so there's more
> >> > work to do on the HugeTLB side.
> >> >
> >> > This is a good point, although IIUC it is a separate issue. The need to
> >> > split private mappings on memory failure is not for confidentiality in
> >> > the TDX sense but to ensure that the guest doesn't use the failed
> >> > memory. In that case, contiguity is broken by the failed memory. The
> >> > folio is split, the private EPTs are split. The folio size should still
> >> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
> >> > guest_memfd calls TDX code to split the EPTs.
> >>
> >> Hmm, maybe the key is that we need to split S-EPT first before allowing
> >> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
> >> folio splitting.
> >>
> >> This is better than performing folio splitting while it's mapped as huge in
> >> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
> >> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
> >> kvm_gmem_error_folio() would still trigger the over-zapping issue.
> >>
> 
> Let's put memory failure handling aside for now since for now it zaps
> the entire huge page, so there's no impact on ordering between S-EPT and
> folio split.
Relying on guest_memfd's specific implemenation is not a good thing. e.g.,

Given there's a version of guest_memfd allocating folios from buddy.
1. KVM maps a 2MB folio in a 2MB mappings.
2. guest_memfd splits the 2MB folio into 4KB folios, but fails and leaves the
   2MB folio partially split.
3. Memory failure occurs on one of the split folio.
4. When splitting S-EPT fails, the over-zapping issue is still there.

> >> In the primary MMU, it follows the rule of unmapping a folio before splitting,
> >> truncating, or migrating a folio. For S-EPT, considering the cost of zapping
> >> more ranges than necessary, maybe a trade-off is to always split S-EPT before
> >> allowing backend folio splitting.
> >>
> 
> The mapping size <= folio size rule (for KVM and the primary MMU) is
> there because it is the safe way to map memory into the guest because a
> folio implies contiguity. Folios are basically a core MM concept so it
> makes sense that the primary MMU relies on that.
So, why the primary MMU needs to unmap and check refcount before folio
splitting?

> IIUC the core of the rule isn't folio sizes, it's memory
> contiguity. guest_memfd guarantees memory contiguity, and KVM should be
> able to rely on guest_memfd's guarantee, especially since guest_memfd is
> virtualiation-first, and KVM first.
>
> I think rules from the primary MMU are a good reference, but we
> shouldn't copy rules from the primary MMU, and KVM can rely on
> guest_memfd's guarantee of memory contiguity.
>
> >> Does this look good to you?
> > So, the flow of converting 0-4KB from private to shared in a 1GB folio in
> > guest_memfd is:
> >
> > a. If guest_memfd splits 1GB to 2MB first:
> >    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
> >    2. split folio
> >    3. zap the 0-4KB mapping.
> >
> > b. If guest_memfd splits 1GB to 4KB directly:
> >    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
> >    2. split folio
> >    3. zap the 0-4KB mapping.
> >
> > The flow of converting 0-2MB from private to shared in a 1GB folio in
> > guest_memfd is:
> >
> > a. If guest_memfd splits 1GB to 2MB first:
> >    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
> >    2. split folio
> >    3. zap the 0-2MB mapping.
> >
> > b. If guest_memfd splits 1GB to 4KB directly:
> >    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
> >    2. split folio
> >    3. zap the 0-2MB mapping.
> >
> >> So, to convert a 2MB range from private to shared, even though guest_memfd will
> >> eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails,
> >> don't split the backend folio.
> >>
> >> Even if folio splitting may fail later, it just leaves split S-EPT mappings,
> >> which matters little, especially after we support S-EPT promotion later.
> >>
> 
> I didn't consider leaving split S-EPT mappings since there is a
> performance impact. Let me think about this a little.
> 
> Meanwhile, if the folios are split before the S-EPTs are split, as long
> as huge folios worth of memory are guaranteed contiguous by guest_memfd
> for KVM, what are the problems you see?
Hmm. As the reply in
https://lore.kernel.org/kvm/aV4hAfPZXfKKB+7i@yzhao56-desk.sh.intel.com/,
there're pros and cons. I'll defer to maintainers' decision.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Vishal Annapurve 3 weeks, 3 days ago

On Mon, Jan 12, 2026 at 10:13 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> > >> > >>
> > >> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> > >> > >> If smaller folios are allowed, splitting private mapping is required there.
> > >> >
> > >> > It was discussed before that for memory failure handling, we will want
> > >> > to split huge pages, we will get to it! The trouble is that guest_memfd
> > >> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> > >> > from the ground up), so we'll still need to figure out it's okay to let
> > >> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> > >> > doesn't actually deal with poisoned folios on freeing, so there's more
> > >> > work to do on the HugeTLB side.
> > >> >
> > >> > This is a good point, although IIUC it is a separate issue. The need to
> > >> > split private mappings on memory failure is not for confidentiality in
> > >> > the TDX sense but to ensure that the guest doesn't use the failed
> > >> > memory. In that case, contiguity is broken by the failed memory. The
> > >> > folio is split, the private EPTs are split. The folio size should still
> > >> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
> > >> > guest_memfd calls TDX code to split the EPTs.
> > >>
> > >> Hmm, maybe the key is that we need to split S-EPT first before allowing
> > >> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
> > >> folio splitting.
> > >>
> > >> This is better than performing folio splitting while it's mapped as huge in
> > >> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
> > >> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
> > >> kvm_gmem_error_folio() would still trigger the over-zapping issue.
> > >>
> >
> > Let's put memory failure handling aside for now since for now it zaps
> > the entire huge page, so there's no impact on ordering between S-EPT and
> > folio split.
> Relying on guest_memfd's specific implemenation is not a good thing. e.g.,
>
> Given there's a version of guest_memfd allocating folios from buddy.
> 1. KVM maps a 2MB folio in a 2MB mappings.
> 2. guest_memfd splits the 2MB folio into 4KB folios, but fails and leaves the
>    2MB folio partially split.
> 3. Memory failure occurs on one of the split folio.
> 4. When splitting S-EPT fails, the over-zapping issue is still there.
>

Why is overzapping an issue?

Memory failure is supposed to be a rare occurrence and if there is no
memory to handle the splitting, I don't see any other choice than
overzapping. IIUC splitting the huge page range (in 1G -> 4K scenario)
requires even more memory than just splitting cross-boundary leaves
and has a higher chance of failing.

i.e. Whether the folio is split first or the SEPTs, there is always a
chance of failure leading to over-zapping. I don't see value in
optimizing rare failures within rarer memory failure handling
codepaths which are supposed to make best-effort decisions anyway.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 3 weeks, 2 days ago

On Tue, Jan 13, 2026 at 08:40:11AM -0800, Vishal Annapurve wrote:
> On Mon, Jan 12, 2026 at 10:13 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > > >> > >>
> > > >> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> > > >> > >> If smaller folios are allowed, splitting private mapping is required there.
> > > >> >
> > > >> > It was discussed before that for memory failure handling, we will want
> > > >> > to split huge pages, we will get to it! The trouble is that guest_memfd
> > > >> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> > > >> > from the ground up), so we'll still need to figure out it's okay to let
> > > >> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> > > >> > doesn't actually deal with poisoned folios on freeing, so there's more
> > > >> > work to do on the HugeTLB side.
> > > >> >
> > > >> > This is a good point, although IIUC it is a separate issue. The need to
> > > >> > split private mappings on memory failure is not for confidentiality in
> > > >> > the TDX sense but to ensure that the guest doesn't use the failed
> > > >> > memory. In that case, contiguity is broken by the failed memory. The
> > > >> > folio is split, the private EPTs are split. The folio size should still
> > > >> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
> > > >> > guest_memfd calls TDX code to split the EPTs.
> > > >>
> > > >> Hmm, maybe the key is that we need to split S-EPT first before allowing
> > > >> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
> > > >> folio splitting.
> > > >>
> > > >> This is better than performing folio splitting while it's mapped as huge in
> > > >> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
> > > >> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
> > > >> kvm_gmem_error_folio() would still trigger the over-zapping issue.
> > > >>
> > >
> > > Let's put memory failure handling aside for now since for now it zaps
> > > the entire huge page, so there's no impact on ordering between S-EPT and
> > > folio split.
> > Relying on guest_memfd's specific implemenation is not a good thing. e.g.,
> >
> > Given there's a version of guest_memfd allocating folios from buddy.
> > 1. KVM maps a 2MB folio in a 2MB mappings.
> > 2. guest_memfd splits the 2MB folio into 4KB folios, but fails and leaves the
> >    2MB folio partially split.
> > 3. Memory failure occurs on one of the split folio.
> > 4. When splitting S-EPT fails, the over-zapping issue is still there.
> >
> 
> Why is overzapping an issue?
> Memory failure is supposed to be a rare occurrence and if there is no
> memory to handle the splitting, I don't see any other choice than
> overzapping. IIUC splitting the huge page range (in 1G -> 4K scenario)
> requires even more memory than just splitting cross-boundary leaves
> and has a higher chance of failing.
> 
> i.e. Whether the folio is split first or the SEPTs, there is always a
> chance of failure leading to over-zapping. I don't see value in
Hmm. If the split occurs after memory failure, yes, splitting S-EPT first also
has chance of over-zapping. But if the split occurs during private-to-shared
conversion for the non-conversion range, when memory failure later occurs on the
split folio, over-zapping can be avoided.

> optimizing rare failures within rarer memory failure handling
> codepaths which are supposed to make best-effort decisions anyway.
I agree it's of low priority.
Just not sure if there're edge cases besides this one.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Vishal Annapurve 4 weeks ago

On Fri, Jan 9, 2026 at 8:12 AM Vishal Annapurve <vannapurve@google.com> wrote:
>
> > > >
> > >
> > > I think the central question I have among all the above is what TDX
> > > needs to actually care about (putting aside what KVM's folio size/memory
> > > contiguity vs mapping level rule for a while).
> > >
> > > I think TDX code can check what it cares about (if required to aid
> > > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> > > or does it actually care about memory contiguity and alignment?
> > TDX cares about memory contiguity. A single folio ensures memory contiguity.
>
> In this slightly unusual case, I think the guarantee needed here is
> that as long as a range is mapped into SEPT entries, guest_memfd
> ensures that the complete range stays private.
>
> i.e. I think it should be safe to rely on guest_memfd here,
> irrespective of the folio sizes:
> 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> entries, guest_memfd will not let host userspace mappings to access
> guest private memory.
>
> >
> > Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> > reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> > contiguous range larger than the page's folio range.
>
> What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
>
> >
> > Additionally, we don't split private mappings in kvm_gmem_error_folio().
> > If smaller folios are allowed, splitting private mapping is required there.
>
> Yes, I believe splitting private mappings will be invoked to ensure
> that the whole huge folio is not unmapped from KVM due to an error on
> just a 4K page. Is that a problem?
>
> If splitting fails, the implementation can fall back to completely
> zapping the folio range.

I forgot to mention that this is a future improvement that will
introduce hugetlb memory failure handling and is not covered by
Ackerley's current set of patches.

>
> > (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
> > possible for splitting a huge folio to fail partially, without merging the huge
> > folio back or further zapping?).
>
> Yes, splitting can fail partially, but guest_memfd will not make the
> ranges available to host userspace and derivatives until:
> 1) The complete range to be converted is split to 4K granularity.
> 2) The complete range to be converted is zapped from KVM EPT mappings.
>
> > Not sure if there're other edge cases we're still missing.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 3 weeks, 1 day ago

On Tue, Jan 06, 2026, Yan Zhao wrote:
> This is v3 of the TDX huge page series. The full stack is available at [4].

Nope, that's different code.

> [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3

E.g. this branch doesn't have

  [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 3 weeks ago

On Thu, Jan 15, 2026 at 04:28:12PM -0800, Sean Christopherson wrote:
> On Tue, Jan 06, 2026, Yan Zhao wrote:
> > This is v3 of the TDX huge page series. The full stack is available at [4].
> 
> Nope, that's different code.
I double-checked. It's the correct code.
See https://github.com/intel-staging/tdx/commits/huge_page_v3.

However, I did make a mistake with the session name, which says
"=== Beginning of section "TDX huge page v2" ===",
but it's actually v3! [facepalm]

> > [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3
> 
> E.g. this branch doesn't have
> 
>   [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
It's here https://github.com/intel-staging/tdx/commit/d0f7465a7f56f10b5bcc9593351c32b37be073a4

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Sean Christopherson 3 weeks ago

On Fri, Jan 16, 2026, Yan Zhao wrote:
> On Thu, Jan 15, 2026 at 04:28:12PM -0800, Sean Christopherson wrote:
> > On Tue, Jan 06, 2026, Yan Zhao wrote:
> > > This is v3 of the TDX huge page series. The full stack is available at [4].
> > 
> > Nope, that's different code.
> I double-checked. It's the correct code.
> See https://github.com/intel-staging/tdx/commits/huge_page_v3.

Argh, and I even double-checked before complaining, but apparently I screwed up
twice.  On triple-checking, I do see the same code as the patches.  *sigh*

Sorry.

Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Posted by Yan Zhao 2 weeks, 5 days ago

On Fri, Jan 16, 2026 at 06:46:09AM -0800, Sean Christopherson wrote:
> On Fri, Jan 16, 2026, Yan Zhao wrote:
> > On Thu, Jan 15, 2026 at 04:28:12PM -0800, Sean Christopherson wrote:
> > > On Tue, Jan 06, 2026, Yan Zhao wrote:
> > > > This is v3 of the TDX huge page series. The full stack is available at [4].
> > > 
> > > Nope, that's different code.
> > I double-checked. It's the correct code.
> > See https://github.com/intel-staging/tdx/commits/huge_page_v3.
> 
> Argh, and I even double-checked before complaining, but apparently I screwed up
> twice.  On triple-checking, I do see the same code as the patches.  *sigh*
> 
> Sorry.
No problem. Thanks for downloading the code and reviewing it!