arch/arm64/kvm/mmu.c | 8 +- arch/loongarch/kvm/mmu.c | 8 +- arch/mips/kvm/mmu.c | 6 +- arch/powerpc/kvm/book3s.c | 4 +- arch/powerpc/kvm/e500_mmu_host.c | 8 +- arch/riscv/kvm/mmu.c | 12 +- arch/x86/include/asm/kvm-x86-ops.h | 4 + arch/x86/include/asm/kvm_host.h | 22 +++ arch/x86/include/asm/tdx.h | 21 +- arch/x86/kvm/mmu.h | 3 + arch/x86/kvm/mmu/mmu.c | 90 +++++++-- arch/x86/kvm/mmu/tdp_mmu.c | 204 +++++++++++++++++--- arch/x86/kvm/mmu/tdp_mmu.h | 3 + arch/x86/kvm/vmx/tdx.c | 299 ++++++++++++++++++++++++++--- arch/x86/kvm/vmx/tdx.h | 5 + arch/x86/kvm/vmx/tdx_arch.h | 3 + arch/x86/virt/vmx/tdx/tdx.c | 164 +++++++++++++--- arch/x86/virt/vmx/tdx/tdx.h | 1 + include/linux/kvm_host.h | 14 +- virt/kvm/guest_memfd.c | 67 +++++++ virt/kvm/kvm_main.c | 44 +++-- 21 files changed, 851 insertions(+), 139 deletions(-)
This is v3 of the TDX huge page series. There aren’t any major changes to
the design, just the incorporation of various comments of RFC v2 [0] and
switching to use external cache for splitting to align with changes in
DPAMT v4 [3]. The full stack is available at [4].
Dave/Kirill/Rick/x86 folks, the patches that will eventually need an ack
are patches 1-5. I would appreciate some review on them, with the
understanding that they may need further refinement before they're ready
for ack.
Sean, I'm feeling pretty good about the design at this point, however,
there are few remaining design opens (see next section with references to
specific patches). I'm wondering if we can close these as part of the
review of this revision. Thanks a lot!
Highlight
-------------
- Request review of the tip part (patches 1-5)
Patches 1-5 contain SEAMCALL wrappers updates, which are under tip.
Besides introducing a SEAMCALL wrapper for demote, the other changes are
mainly to convert mapping/unmapping related SEAMCALL wrappers from using
"struct page *page" to "struct folio *folio" and "unsigned long
start_idx" for huge pages.
- EPT mapping size and folio size
This series is built upon the rule in KVM that the mapping size in the
KVM-managed secondary MMU is no larger than the backend folio size.
Therefore, there are sanity checks in the SEAMCALL wrappers in patches
1-5 to follow this rule, either in tdh_mem_page_aug() for mapping
(patch 1) or in tdh_phymem_page_wbinvd_hkid() (patch 3),
tdx_quirk_reset_folio() (patch 4), tdh_phymem_page_reclaim() (patch 5)
for unmapping.
However, as Vishal pointed out in [7], the new hugetlb-based guest_memfd
[1] splits backend folios ahead of notifying KVM for unmapping. So, this
series also relies on the fixup patch [8] to notify KVM of unmapping
before splitting the backend folio during the memory conversion ioctl.
- Enable TDX huge pages only on new TDX modules (patch 2, patch 24)
This v3 detects whether the TDX module supports feature
TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY and disables TDX huge
page support for private memory on TDX modules without this feature.
Additionally, v3 provides a new module parameter "tdx_huge_page" to turn
off TDX huge pages for private memory by executing
"modprobe kvm_intel tdx_huge_page=N".
- Dynamic PAMT v4 integration (patches 17-23)
Currently, DPAMT's involvement with TDX huge pages is limited to page
splitting. v3 introduces KVM x86 ops for the per-VM external cache for
splitting (patches 19, 20).
The per-VM external cache holds pre-allocated pages (patch 19), which are
dequeued during TDX's splitting operations for installing PAMT pages
(patches 21-22).
The general logic of managing the per-VM external cache remains similar
to the per-VM cache in RFC v2; the difference is that KVM MMU core now
notifies TDX to implement page pre-allocation/enqueuing, page count
checking, and page freeing. The external way makes it easier to add a
lock to protect page enqueuing (in topup_external_per_vm_split_cache()
op) and page dequeuing (in the split_external_spte() op) for the per-VM
cache. (See more details in the DPAMT bullet in the "Full Design"
section).
Since the basic DPAMT design (without huge pages) is not yet settled,
feel free to ignore the reviewing request for this part. However, we
would appreciate some level of review given the differences between the
per-VM cache and per-vCPU cache.
Changes from RFC v2
-------------------
- Dropped 2 patches:
"KVM: TDX: Do not hold page refcount on private guest pages"
(pulled by Sean separately),
"x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC
is set"
(To be consistent with the always-clflush policy.
The DPAMT-related bug in TDH.PHYMEM.PAGE.WBINVD only existed in the
old unreleased TDX modules for DPAMT).
- TDX_INTERRUPTED_RESTARTABLE error handling:
Disable TDX huge pages if the TDX module does not support
TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY.
- Added a module parameter to turn on/off TDX huge pages.
- Renamed tdx_sept_split_private_spt() to tdx_sept_split_private_spte(),
passing parameter "old_mirror_spte" instead of "pfn_for_gfn". Updated the
function implementation to align with Sean's TDX cleanup series.
- Updated API kvm_split_cross_boundary_leafs() to not flush TLBs internally
or return split/flush status, and added a default implementation for
non-x86 platforms.
- Renamed tdx_check_accept_level() to tdx_honor_guest_accept_level().
Returned tdx_honor_guest_accept_level() error to userspace instead of
retry in KVM.
- Introduced KVM x86 ops for the per-VM external cache for splitting to
align with DPAMT v4, re-organized the DPAMT related patches, refined
parameter names for better readability, and added missing KVM_BUG_ON()
and warnings.
- Changed nth_page() to folio_page() in SEAMCALL wrappers, corrected a
minor bug in handling reclaimed size mismatch error.
- Removed unnecessary declarations, fixed typos and improved patch logs and
comments.
Patches Layout
--------------
- Patches 01-05: Update SEAMCALL wrappers/helpers for huge pages.
- Patch 06: Disallow page merging for TDX.
- Patches 07-09: Enable KVM MMU core to propagate splitting requests to TDX
and provide the corresponding implementation in TDX.
- Patches 10-11: Introduce a higher level API
kvm_split_cross_boundary_leafs() for splitting
cross-boundary mappings.
- Patches 12-13: Honor guest accept level in EPT violation handler.
- Patches 14-16: Split huge pages on page conversion/punch hole.
- Patches 17-23: Dynamic PAMT related changes for TDX huge pages.
- Patch 24: Turn on 2MB huge pages if tdx_huge_page is true.
(Added a module parameter for tdx_huge_page and turn it
off if the TDX module does not support
TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY).
Full Design (optional reading)
--------------------------------------
1. guest_memfd dependency
According to the rule in KVM that the mapping size in the KVM-managed
secondary MMU should not exceed the backend folio size, the TDX huge
page series relies on guest_memfd's ability to allocate huge folios.
However, this series does not depend on guest_memfd's implementation
details, except for patch 16, which invokes API
kvm_split_cross_boundary_leafs() to request KVM splitting of private
huge mappings that cross the specified range boundary in guest_memfd
punch hole and private-to-shared conversion ioctls.
Therefore, the TDX huge page series can also function with 2MB THP
guest_memfd by updating patch 16 to match the underlying guest_memfd.
2. Restrictions:
1) Splitting under read mmu_lock is not supported.
With the current TDX module (i.e., without NON-BLOCKING-RESIZE
feature), handling BUSY errors returned from splitting operations
under read mmu_lock requires invoking tdh_mem_range_unblock(), which
may also encounter BUSY errors. To avoid complicating the lock
design, splitting under read mmu_lock for TDX is rejected before the
TDX module supports the NON-BLOCKING-RESIZE feature.
However, when TDs perform ACCEPT operations at 4KB level after KVM
creates huge mappings (e.g., non-Linux TDs' ACCEPT operations that
occur after memory access, or Linux TDs' ACCEPT operations on a
pre-faulted range), splitting operations in the fault path are
required, which usually hold read mmu_lock.
Ignoring splitting operations in the fault path would cause repeated
faults. Forcing KVM's mapping to 4KB before the guest's ACCEPT level
is available would disable huge pages for non-Linux TDs, since the
TD's later ACCEPT operations at higher levels would return
TDX_PAGE_SIZE_MISMATCH error to the guest. (See "ACCEPT level and
lpage_info" bullet).
This series hold write mmu_lock for splitting operations in the fault
path. To maintain performance, the implementation only acquires write
mmu_lock when necessary, keeping the write mmu_lock acquisition count
at an acceptable level [6].
2) No huge page if TDX_INTERRUPTED_RESTARTABLE error is possible
When SEAMCALL TDH_MEM_PAGE_DEMOTE can return error
TDX_INTERRUPTED_RESTARTABLE (due to interrupts during SEAMCALL
execution), the TDX module provides no guaranteed maximum retry count
to ensure forward progress of page demotion. Interrupt storms could
then result in DoS if host simply retries endlessly for
TDX_INTERRUPTED_RESTARTABLE. Disabling interrupts before invoking the
SEAMCALL in the host cannot prevent TDX_INTERRUPTED_RESTARTABLE
because NMIs can also trigger the error. Therefore, given that
SEAMCALL execution time for demotion in basic TDX remains at a
reasonable level, the tradeoff is to have the TDX module not check
interrupts during SEAMCALL TDH_MEM_PAGE_DEMOTE, eliminating the
TDX_INTERRUPTED_RESTARTABLE error.
This series detects whether the TDX module supports feature
TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY and disables TDX huge
pages on private memory for TDX modules without this feature, thus
avoiding TDX_INTERRUPTED_RESTARTABLE error for basic TDX (i.e.,
without TD partition).
3. Page size is forced to 4KB during TD build time.
Always return 4KB in the KVM x86 hook gmem_max_mapping_level() to force
4KB mapping size during TD build time, because:
- tdh_mem_page_add() only adds private pages at 4KB.
- The amount of initial private memory pages is limited (typically ~4MB).
4. Page size during TD runtime can be up to 2MB.
Return up to 2MB in the KVM x86 hook gmem_max_mapping_level():
Use module parameter "tdx_huge_page" to control whether to return 2MB,
and turn off tdx_huge_page if the TDX module does not support
TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY.
When returning 2MB in KVM x86 hook gmem_max_mapping_level(), KVM may
still map a page at 4KB due to:
(1) the backend folio is 4KB,
(2) disallow_lpage restrictions:
- mixed private/shared pages in the 2MB range,
- level misalignment due to slot base_gfn, slot size, and ugfn,
- GUEST_INHIBIT bit set due to guest ACCEPT operation.
(3) page merging is disallowed (e.g., when part of a 2MB range has been
mapped at 4KB level during TD build time).
5. ACCEPT level and lpage_info
KVM needs to honor TDX guest's ACCEPT level by ensuring a fault's
mapping level is no higher than the guest's ACCEPT level, which is due
to the current TDX module implementation:
- If host mapping level > guest's ACCEPT level, repeated faults carrying
the guest's ACCEPT level info are generated. No error is returned to
the guest's ACCEPT operation.
- If host mapping level < guest's ACCEPT level, the guest's ACCEPT
operation returns TDX_PAGE_SIZE_MISMATCH error.
It's efficient to pass the guest's ACCEPT level info to KVM MMU core
before KVM actually creates the mapping.
Following the suggestions from Sean/Rick, this series introduces a
global approach (vs. fault-by-fault per-vCPU) to pass the ACCEPT level
info: setting the GUEST_INHIBIT bit in a slot's lpage_info(s) above the
specified guest ACCEPT level. The global approach helps simplify vendor
code while maintaining a global view across vCPUs.
Since page merging is currently not supported, and given the limitation
that page splitting under read mmu_lock is not supported, the
GUEST_INHIBIT bit is set in a one-way manner under write mmu_lock. Once
set, it will not be unset, and the GFN will not be allowed to be mapped
at higher levels, even if the mappings are zapped and re-accepted by the
guest at higher levels later. This approach simplifies the code and
prevents potential subtle bugs from different ACCEPT levels specified by
different vCPUs. Tests showed the one-way manner has minimal impact on
the huge page map count with a typical Linux TD [5].
Typical scenarios of honoring 4KB ACCEPT level:
a. If the guest accesses private memory without first accepting it,
1) Guest accesses private memory.
2) KVM maps the private memory at 2MB if no huge page restrictions
exist.
3) Guest accepts the private memory at 4KB.
4) KVM receives an EPT violation VMExit, whose type indicates it's
caused by the guest's ACCEPT operation at 4KB level.
5) KVM honors the guest's ACCEPT level by setting the GUEST_INHIBIT
bit in lpage_info(s) above 4KB level and splitting the created 2MB
mapping.
6) Guest accepts the private memory at 4KB level successfully and
accesses the private memory.
b. If the guest first accepts private memory before accessing,
1) Guest accepts private memory at 4KB level.
2) KVM receives an EPT violation VMExit, whose type indicates it's
caused by guest's ACCEPT operation at 4KB level.
3) KVM honors the guest's ACCEPT level by setting the GUEST_INHIBIT
bit in lpage_info(s) above 4KB level, thus creating the mapping at
4KB level.
4) Guest accepts the private memory at 4KB level successfully and
accesses the private memory.
6. Page splitting (page demotion)
With the current TDX module, splitting huge mappings in S-EPT requires
executing tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
and tdh_mem_page_demote() in sequence. If DPAMT is involved,
tdh_phymem_pamt_add() is also required.
Page splitting can occur due to:
(1) private-to-shared conversion,
(2) hole punching in guest_memfd,
(3) guest's ACCEPT operation at lower level than host mapping level.
All paths trigger splitting by invoking kvm_split_cross_boundary_leafs()
under write mmu_lock.
TDX_OPERAND_BUSY is thus handled similarly to removing a private page,
i.e., by kicking off all vCPUs and retrying, which should succeed on the
second attempt. Other errors from tdh_mem_page_demote() are not
expected, triggering TDX_BUG_ON().
7. Page merging (page promotion)
Promotion is disallowed, because:
- The current TDX module requires all 4KB leafs to be either all PENDING
or all ACCEPTED before a successful promotion to 2MB. This requirement
prevents successful page merging after partially converting a 2MB
range from private to shared and then back to private, which is the
primary scenario necessitating page promotion.
- tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
TDX module. Consequently, handling BUSY errors is complex, as page
merging typically occurs in the fault path under shared mmu_lock.
- Limited amount of initial private memory (typically ~4MB) means the
need for page merging during TD build time is minimal.
8. Precise zapping of private memory
Since TDX requires guest's ACCEPT operation on host's mapping of private
memory, zapping private memory for guest_memfd punch hole and
private-to-shared conversion must be precise and preceded by splitting
private memory.
Patch 16 serves this purpose and is the only patch in the TDX huge page
series sensitive to guest_memfd's implementation changes.
9. DPAMT
Currently, DPAMT's involvement with TDX huge page is limited to page
splitting.
As shown in the following call stack, DPAMT pages used by splitting are
pre-allocated and queued in the per-VM external split cache. They are
dequeued and consumed in tdx_sept_split_private_spte().
kvm_split_cross_boundary_leafs
kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
tdp_mmu_split_huge_pages_root
(*) 1) tdp_mmu_alloc_sp_for_split()
+-----2.1) need_topup_external_split_cache(): check if enough pages in
| the external split cache. Go to 3 if pages are enough.
| +--2.2) topup_external_split_cache(): preallocate/enqueue pages in
| | the external split cache.
| | 3) tdp_mmu_split_huge_page
| | tdp_mmu_link_sp
| | tdp_mmu_iter_set_spte
| |(**) tdp_mmu_set_spte
| | split_external_spte
| | kvm_x86_call(split_external_spte)
| | tdx_sept_split_private_spte
| | 3.1) BLOCK, TRACK
+--+-------------------3.2) Dequeue PAMT pages from the external split
| | cache for the new sept page
| | 3.3) PAMT_ADD for the new sept page
+--+-------------------3.4) Dequeue PAMT pages from the external split
cache for the 2MB guest private memory.
3.5) DEMOTE.
3.6) Update PAMT refcount of the 2MB guest private
memory.
(*) The write mmu_lock is held across the checking of enough pages in
cache in step 2.1 and the page dequeuing in steps 3.2 and 3.4, so
it's ensured that dequeuing has enough pages in cache.
(**) A spinlock prealloc_split_cache_lock is used inside the TDX's cache
implementation to protect page enqueuing in step 2.2 and page
dequeuing in steps 3.2 and 3.4.
10. TDX does not hold page ref count
TDX previously held folio refcount and didn't release the refcount if it
failed to zap the S-EPT. However, guest_memfd's in-place conversion
feature requires TDX not to hold folio refcount. Several approaches were
explored. RFC v2 finally simply avoided holding page refcount in TDX and
generated KVM_BUG_ON() on S-EPT zapping failure without propagating the
failure back to guest_memfd or notifying guest_memfd through out-of-band
methods, considering the complexity and that S-EPT zapping failure is
currently only possible due to KVM or TDX module bugs.
This approach was acked by Sean and the patch to drop holding TDX page
refcount was pulled separately.
Base
----
This is based on the latest WIP hugetlb-based guest_memfd code [1], with
"Rework preparation/population" series v2 [2] and DPAMT v4 [3] rebased on
it. For the full stack see [4].
Four issues are identified in the WIP hugetlb-based guest_memfd [1]:
(1) Compilation error due to missing symbol export of
hugetlb_restructuring_free_folio().
(2) guest_memfd splits backend folios when the folio is still mapped as
huge in KVM (which breaks KVM's basic assumption that EPT mapping size
should not exceed the backend folio size).
(3) guest_memfd is incapable of merging folios to huge for
shared-to-private conversions.
(4) Unnecessary disabling huge private mappings when HVA is not 2M-aligned,
given that shared pages can only be mapped at 4KB.
So, this series also depends on the four fixup patches included in [4]:
[FIXUP] KVM: guest_memfd: Allow gmem slot lpage even with non-aligned uaddr
[FIXUP] KVM: guest_memfd: Allow merging folios after to-private conversion
[FIXUP] KVM: guest_memfd: Zap mappings before splitting backend folios
[FIXUP] mm: hugetlb_restructuring: Export hugetlb_restructuring_free_folio()
(lkp sent me some more gmem compilation errors. I ignored them as I didn't
encounter them with my config and env).
Testing
-------
We currently don't have a QEMU that works with the latest in-place
conversion uABI. We plan to increase testing before asking for merging.
This revision is tested via TDX selftests (enhanced to support the in-place
conversion uABI).
Please check the TDX selftests included in tree [4] (under section "Not for
upstream"[9]) that work with the in-place conversion uABI, specifically the
selftest tdx_vm_huge_page for huge page testing.
These selftests require running under "modprobe kvm vm_memory_attributes=N".
The 2MB mapping count can be checked via "/sys/kernel/debug/kvm/pages_2m".
Note #1: Since TDX does not yet enable in-place copy of init memory
regions, userspace needs to follow the sequence below for init
memory regions to work (also shown in commit "KVM: selftests: TDX:
Switch init mem to gmem with MMAP flag" [10] in tree [4]):
1) Create guest_memfd with MMAP flag for the init memory region.
2) Set the GFNs for the init memory region to shared and copy
initial data to the mmap'ed HVA of the guest_memfd.
3) Create a separate temporary shared memory backend for the init
memory region, and copy initial data from the guest_memfd HVA
to the temporary shared memory backend.
4) Convert the GFNs for the init memory region to private.
5) Invoke ioctl KVM_TDX_INIT_MEM_REGION, passing the HVA of the
temporary shared memory backend as source addr and GPAs of the
init memory region.
6) Free the temporary shared memory backend.
Note #2: This series disables TDX huge pages on TDX modules without feature
TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY. For testing TDX
huge pages on those unsupported TDX modules (i.e., before
TDX_1.5.28.00.972), please cherry-pick the workaround patch
"x86/virt/tdx: Loop for TDX_INTERRUPTED_RESTARTABLE in
tdh_mem_page_demote()" [11] contained in [4].
Thanks
Yan
[0] RFC v2: https://lore.kernel.org/all/20250807093950.4395-1-yan.y.zhao@intel.com
[1] hugetlb-based gmem: https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25
[2] gmem-population rework v2: https://lore.kernel.org/all/20251215153411.3613928-1-michael.roth@amd.com
[3] DPAMT v4: https://lore.kernel.org/kvm/20251121005125.417831-1-rick.p.edgecombe@intel.com
[4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3
[5] https://lore.kernel.org/all/aF0Kg8FcHVMvsqSo@yzhao56-desk.sh.intel.com
[6] https://lore.kernel.org/all/aGSoDnODoG2%2FpbYn@yzhao56-desk.sh.intel.com
[7] https://lore.kernel.org/all/CAGtprH9vdpAGDNtzje=7faHBQc9qTSF2fUEGcbCkfJehFuP-rw@mail.gmail.com
[8] https://github.com/intel-staging/tdx/commit/a8aedac2df44e29247773db3444bc65f7100daa1
[9] https://github.com/intel-staging/tdx/commit/8747667feb0b37daabcaee7132c398f9e62a6edd
[10] https://github.com/intel-staging/tdx/commit/ab29a85ec2072393ab268e231c97f07833853d0d
[11] https://github.com/intel-staging/tdx/commit/4feb6bf371f3a747b71fc9f4ded25261e66b8895
Edgecombe, Rick P (1):
KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror
root
Isaku Yamahata (1):
KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table
splitting
Kiryl Shutsemau (4):
KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB
KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting
x86/tdx: Add/Remove DPAMT pages for guest private memory to demote
x86/tdx: Pass guest memory's PFN info to demote for updating
pamt_refcount
Xiaoyao Li (1):
x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Yan Zhao (17):
x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge
pages
x86/tdx: Introduce tdx_quirk_reset_folio() to reset private huge pages
x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support huge pages
KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock
KVM: TDX: Enable huge page splitting under write mmu_lock
KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX
KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
KVM: x86: Introduce hugepage_set_guest_inhibit()
KVM: TDX: Honor the guest's accept level contained in an EPT violation
KVM: Change the return type of gfn_handler_t() from bool to int
KVM: x86: Split cross-boundary mirror leafs for
KVM_SET_MEMORY_ATTRIBUTES
KVM: guest_memfd: Split for punch hole and private-to-shared
conversion
x86/virt/tdx: Add loud warning when tdx_pamt_put() fails.
KVM: x86: Introduce per-VM external cache for splitting
KVM: TDX: Implement per-VM external cache for splitting in TDX
KVM: TDX: Turn on PG_LEVEL_2M
arch/arm64/kvm/mmu.c | 8 +-
arch/loongarch/kvm/mmu.c | 8 +-
arch/mips/kvm/mmu.c | 6 +-
arch/powerpc/kvm/book3s.c | 4 +-
arch/powerpc/kvm/e500_mmu_host.c | 8 +-
arch/riscv/kvm/mmu.c | 12 +-
arch/x86/include/asm/kvm-x86-ops.h | 4 +
arch/x86/include/asm/kvm_host.h | 22 +++
arch/x86/include/asm/tdx.h | 21 +-
arch/x86/kvm/mmu.h | 3 +
arch/x86/kvm/mmu/mmu.c | 90 +++++++--
arch/x86/kvm/mmu/tdp_mmu.c | 204 +++++++++++++++++---
arch/x86/kvm/mmu/tdp_mmu.h | 3 +
arch/x86/kvm/vmx/tdx.c | 299 ++++++++++++++++++++++++++---
arch/x86/kvm/vmx/tdx.h | 5 +
arch/x86/kvm/vmx/tdx_arch.h | 3 +
arch/x86/virt/vmx/tdx/tdx.c | 164 +++++++++++++---
arch/x86/virt/vmx/tdx/tdx.h | 1 +
include/linux/kvm_host.h | 14 +-
virt/kvm/guest_memfd.c | 67 +++++++
virt/kvm/kvm_main.c | 44 +++--
21 files changed, 851 insertions(+), 139 deletions(-)
--
2.43.2
On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote: > > - EPT mapping size and folio size > > This series is built upon the rule in KVM that the mapping size in the > KVM-managed secondary MMU is no larger than the backend folio size. > > Therefore, there are sanity checks in the SEAMCALL wrappers in patches > 1-5 to follow this rule, either in tdh_mem_page_aug() for mapping > (patch 1) or in tdh_phymem_page_wbinvd_hkid() (patch 3), > tdx_quirk_reset_folio() (patch 4), tdh_phymem_page_reclaim() (patch 5) > for unmapping. > > However, as Vishal pointed out in [7], the new hugetlb-based guest_memfd > [1] splits backend folios ahead of notifying KVM for unmapping. So, this > series also relies on the fixup patch [8] to notify KVM of unmapping > before splitting the backend folio during the memory conversion ioctl. I think the major issue here is that if splitting fails there is no way to undo the unmapping [1]. How should KVM/VMM/guest handle the case where a guest requested conversion to shared, the conversion failed and the memory is no longer mapped as private? [1] https://lore.kernel.org/kvm/aN8P87AXlxlEDdpP@google.com/ > Four issues are identified in the WIP hugetlb-based guest_memfd [1]: > > (1) Compilation error due to missing symbol export of > hugetlb_restructuring_free_folio(). > > (2) guest_memfd splits backend folios when the folio is still mapped as > huge in KVM (which breaks KVM's basic assumption that EPT mapping size > should not exceed the backend folio size). > > (3) guest_memfd is incapable of merging folios to huge for > shared-to-private conversions. > > (4) Unnecessary disabling huge private mappings when HVA is not 2M-aligned, > given that shared pages can only be mapped at 4KB. > > So, this series also depends on the four fixup patches included in [4]: > > [FIXUP] KVM: guest_memfd: Allow gmem slot lpage even with non-aligned uaddr > [FIXUP] KVM: guest_memfd: Allow merging folios after to-private conversion > [FIXUP] KVM: guest_memfd: Zap mappings before splitting backend folios > [FIXUP] mm: hugetlb_restructuring: Export hugetlb_restructuring_free_folio() > > (lkp sent me some more gmem compilation errors. I ignored them as I didn't > encounter them with my config and env). > > ... > > [0] RFC v2: https://lore.kernel.org/all/20250807093950.4395-1-yan.y.zhao@intel.com > [1] hugetlb-based gmem: https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25 > [2] gmem-population rework v2: https://lore.kernel.org/all/20251215153411.3613928-1-michael.roth@amd.com > [3] DPAMT v4: https://lore.kernel.org/kvm/20251121005125.417831-1-rick.p.edgecombe@intel.com > [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3 > [5] https://lore.kernel.org/all/aF0Kg8FcHVMvsqSo@yzhao56-desk.sh.intel.com > [6] https://lore.kernel.org/all/aGSoDnODoG2%2FpbYn@yzhao56-desk.sh.intel.com > [7] https://lore.kernel.org/all/CAGtprH9vdpAGDNtzje=7faHBQc9qTSF2fUEGcbCkfJehFuP-rw@mail.gmail.com > [8] https://github.com/intel-staging/tdx/commit/a8aedac2df44e29247773db3444bc65f7100daa1 > [9] https://github.com/intel-staging/tdx/commit/8747667feb0b37daabcaee7132c398f9e62a6edd > [10] https://github.com/intel-staging/tdx/commit/ab29a85ec2072393ab268e231c97f07833853d0d > [11] https://github.com/intel-staging/tdx/commit/4feb6bf371f3a747b71fc9f4ded25261e66b8895 >
Vishal Annapurve <vannapurve@google.com> writes: > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote: >> >> - EPT mapping size and folio size >> >> This series is built upon the rule in KVM that the mapping size in the >> KVM-managed secondary MMU is no larger than the backend folio size. >> I'm not familiar with this rule and would like to find out more. Why is this rule imposed? Is this rule there just because traditionally folio sizes also define the limit of contiguity, and so the mapping size must not be greater than folio size in case the block of memory represented by the folio is not contiguous? In guest_memfd's case, even if the folio is split (just for refcount tracking purposese on private to shared conversion), the memory is still contiguous up to the original folio's size. Will the contiguity address the concerns? Specifically for TDX, does the folio size actually matter relative to the mapping, or is it more about contiguity than the folio size? >> Therefore, there are sanity checks in the SEAMCALL wrappers in patches >> 1-5 to follow this rule, either in tdh_mem_page_aug() for mapping >> (patch 1) or in tdh_phymem_page_wbinvd_hkid() (patch 3), >> tdx_quirk_reset_folio() (patch 4), tdh_phymem_page_reclaim() (patch 5) >> for unmapping. >> >> However, as Vishal pointed out in [7], the new hugetlb-based guest_memfd >> [1] splits backend folios ahead of notifying KVM for unmapping. So, this >> series also relies on the fixup patch [8] to notify KVM of unmapping >> before splitting the backend folio during the memory conversion ioctl. > > I think the major issue here is that if splitting fails there is no > way to undo the unmapping [1]. How should KVM/VMM/guest handle the > case where a guest requested conversion to shared, the conversion > failed and the memory is no longer mapped as private? > > [1] https://lore.kernel.org/kvm/aN8P87AXlxlEDdpP@google.com/ > Unmapping was supposed to be the point of no return in the conversion process. (This might have changed since we last discussed this. The link [1] from Vishal is where it was discussed.) The new/current plan is that in the conversion process we'll do anything that might fail first, and then commit the conversion, beginning with zapping, and so zapping is the point of no return. (I think you also suggested this before, but back then I couldn't see a way to separate out the steps cleanly) Here's the conversion steps in what we're trying now (leaving out the TDX EPT splitting first) 1. Allocate enough memory for updating attributes maple tree 2a. Only for shared->private conversions: unmap from host page table, check for safe refcounts 2b. Only for private->shared conversions: split folios (note: split only, no merges) split can fail since HVO needs to be undone, and that requires allocations. 3. Invalidate begin 4. Zap from stage 2 page tables: this is the point of no return, before this, we must be sure nothing after will fail. 5. Update attributes maple tree using allocated memory from step 1. 6. Invalidate end 7. Only for shared->private conversions: merge folios, making sure that merging does not fail (should not, since there are no allocations, only folio aka metadata updates) Updating the maple tree before calling the folio merge function allows the merge function to look up the *updated* maple tree. I'm thinking to insert the call to EPT splitting after invalidate begin (3) since EPT splitting is not undoable. However, that will be after folio splitting, hence my earlier question on whether it's a hard rule based on folio size, or based on memory contiguity. Would that work? >> Four issues are identified in the WIP hugetlb-based guest_memfd [1]: >> >> (1) Compilation error due to missing symbol export of >> hugetlb_restructuring_free_folio(). >> >> (2) guest_memfd splits backend folios when the folio is still mapped as >> huge in KVM (which breaks KVM's basic assumption that EPT mapping size >> should not exceed the backend folio size). >> >> (3) guest_memfd is incapable of merging folios to huge for >> shared-to-private conversions. >> >> (4) Unnecessary disabling huge private mappings when HVA is not 2M-aligned, >> given that shared pages can only be mapped at 4KB. >> >> So, this series also depends on the four fixup patches included in [4]: >> Thank you for these fixes! >> [FIXUP] KVM: guest_memfd: Allow gmem slot lpage even with non-aligned uaddr >> [FIXUP] KVM: guest_memfd: Allow merging folios after to-private conversion Thanks for catching this, Vishal also found this in a very recent internal review. Our fix for this is to first apply the new state before doing the folio merge. See the flow described above. >> [FIXUP] KVM: guest_memfd: Zap mappings before splitting backend folios >> [FIXUP] mm: hugetlb_restructuring: Export hugetlb_restructuring_free_folio() >> >> (lkp sent me some more gmem compilation errors. I ignored them as I didn't >> encounter them with my config and env). >> >> ... >> >> [0] RFC v2: https://lore.kernel.org/all/20250807093950.4395-1-yan.y.zhao@intel.com >> [1] hugetlb-based gmem: https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25 >> [2] gmem-population rework v2: https://lore.kernel.org/all/20251215153411.3613928-1-michael.roth@amd.com >> [3] DPAMT v4: https://lore.kernel.org/kvm/20251121005125.417831-1-rick.p.edgecombe@intel.com >> [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3 >> [5] https://lore.kernel.org/all/aF0Kg8FcHVMvsqSo@yzhao56-desk.sh.intel.com >> [6] https://lore.kernel.org/all/aGSoDnODoG2%2FpbYn@yzhao56-desk.sh.intel.com >> [7] https://lore.kernel.org/all/CAGtprH9vdpAGDNtzje=7faHBQc9qTSF2fUEGcbCkfJehFuP-rw@mail.gmail.com >> [8] https://github.com/intel-staging/tdx/commit/a8aedac2df44e29247773db3444bc65f7100daa1 >> [9] https://github.com/intel-staging/tdx/commit/8747667feb0b37daabcaee7132c398f9e62a6edd >> [10] https://github.com/intel-staging/tdx/commit/ab29a85ec2072393ab268e231c97f07833853d0d >> [11] https://github.com/intel-staging/tdx/commit/4feb6bf371f3a747b71fc9f4ded25261e66b8895 >>
On Tue, Jan 06, 2026, Ackerley Tng wrote: > Vishal Annapurve <vannapurve@google.com> writes: > > > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote: > >> > >> - EPT mapping size and folio size > >> > >> This series is built upon the rule in KVM that the mapping size in the > >> KVM-managed secondary MMU is no larger than the backend folio size. > >> > > I'm not familiar with this rule and would like to find out more. Why is > this rule imposed? Because it's the only sane way to safely map memory into the guest? :-D > Is this rule there just because traditionally folio sizes also define the > limit of contiguity, and so the mapping size must not be greater than folio > size in case the block of memory represented by the folio is not contiguous? Pre-guest_memfd, KVM didn't care about folios. KVM's mapping size was (and still is) strictly bound by the host mapping size. That's handles contiguous addresses, but it _also_ handles contiguous protections (e.g. RWX) and other attributes. > In guest_memfd's case, even if the folio is split (just for refcount > tracking purposese on private to shared conversion), the memory is still > contiguous up to the original folio's size. Will the contiguity address > the concerns? Not really? Why would the folio be split if the memory _and its attributes_ are fully contiguous? If the attributes are mixed, KVM must not create a mapping spanning mixed ranges, i.e. with multiple folios.
Sean Christopherson <seanjc@google.com> writes: > On Tue, Jan 06, 2026, Ackerley Tng wrote: >> Vishal Annapurve <vannapurve@google.com> writes: >> >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote: >> >> >> >> - EPT mapping size and folio size >> >> >> >> This series is built upon the rule in KVM that the mapping size in the >> >> KVM-managed secondary MMU is no larger than the backend folio size. >> >> >> >> I'm not familiar with this rule and would like to find out more. Why is >> this rule imposed? > > Because it's the only sane way to safely map memory into the guest? :-D > >> Is this rule there just because traditionally folio sizes also define the >> limit of contiguity, and so the mapping size must not be greater than folio >> size in case the block of memory represented by the folio is not contiguous? > > Pre-guest_memfd, KVM didn't care about folios. KVM's mapping size was (and still > is) strictly bound by the host mapping size. That's handles contiguous addresses, > but it _also_ handles contiguous protections (e.g. RWX) and other attributes. > >> In guest_memfd's case, even if the folio is split (just for refcount >> tracking purposese on private to shared conversion), the memory is still >> contiguous up to the original folio's size. Will the contiguity address >> the concerns? > > Not really? Why would the folio be split if the memory _and its attributes_ are > fully contiguous? If the attributes are mixed, KVM must not create a mapping > spanning mixed ranges, i.e. with multiple folios. The folio can be split if any (or all) of the pages in a huge page range are shared (in the CoCo sense). So in a 1G block of memory, even if the attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio would be split, and the split folios are necessary for tracking users of shared pages using struct page refcounts. However the split folios in that 1G range are still fully contiguous. The process of conversion will split the EPT entries soon after the folios are split so the rule remains upheld. I guess perhaps the question is, is it okay if the folios are smaller than the mapping while conversion is in progress? Does the order matter (split page table entries first vs split folios first)?
On Tue, Jan 06, 2026, Ackerley Tng wrote: > Sean Christopherson <seanjc@google.com> writes: > > > On Tue, Jan 06, 2026, Ackerley Tng wrote: > >> Vishal Annapurve <vannapurve@google.com> writes: > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote: > >> >> > >> >> - EPT mapping size and folio size > >> >> > >> >> This series is built upon the rule in KVM that the mapping size in the > >> >> KVM-managed secondary MMU is no larger than the backend folio size. > >> >> > >> > >> I'm not familiar with this rule and would like to find out more. Why is > >> this rule imposed? > > > > Because it's the only sane way to safely map memory into the guest? :-D > > > >> Is this rule there just because traditionally folio sizes also define the > >> limit of contiguity, and so the mapping size must not be greater than folio > >> size in case the block of memory represented by the folio is not contiguous? > > > > Pre-guest_memfd, KVM didn't care about folios. KVM's mapping size was (and still > > is) strictly bound by the host mapping size. That's handles contiguous addresses, > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes. > > > >> In guest_memfd's case, even if the folio is split (just for refcount > >> tracking purposese on private to shared conversion), the memory is still > >> contiguous up to the original folio's size. Will the contiguity address > >> the concerns? > > > > Not really? Why would the folio be split if the memory _and its attributes_ are > > fully contiguous? If the attributes are mixed, KVM must not create a mapping > > spanning mixed ranges, i.e. with multiple folios. > > The folio can be split if any (or all) of the pages in a huge page range > are shared (in the CoCo sense). So in a 1G block of memory, even if the > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio > would be split, and the split folios are necessary for tracking users of > shared pages using struct page refcounts. Ahh, that's what the refcounting was referring to. Gotcha. > However the split folios in that 1G range are still fully contiguous. > > The process of conversion will split the EPT entries soon after the > folios are split so the rule remains upheld. > > I guess perhaps the question is, is it okay if the folios are smaller > than the mapping while conversion is in progress? Does the order matter > (split page table entries first vs split folios first)? Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is conceptually totally fine, i.e. I'm not totally opposed to adding support for mapping multiple guest_memfd folios with a single hugepage. As to whether we do (a) nothing, (b) change the refcounting, or (c) add support for mapping multiple folios in one page, probably comes down to which option provides "good enough" performance without incurring too much complexity.
Sean Christopherson <seanjc@google.com> writes: > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is > conceptually totally fine, i.e. I'm not totally opposed to adding support for > mapping multiple guest_memfd folios with a single hugepage. As to whether we Sean, I'd like to clarify this. > do (a) nothing, What does do nothing mean here? In this patch series the TDX functions do sanity checks ensuring that mapping size <= folio size. IIUC the checks at mapping time, like in tdh_mem_page_aug() would be fine since at the time of mapping, the mapping size <= folio size, but we'd be in trouble at the time of zapping, since that's when mapping sizes > folio sizes get discovered. The sanity checks are in principle in direct conflict with allowing mapping of multiple guest_memfd folios at hugepage level. > (b) change the refcounting, or I think this is pretty hard unless something changes in core MM that allows refcounting to be customizable by the FS. guest_memfd would love to have that, but customizable refcounting is going to hurt refcounting performance throughout the kernel. > (c) add support for mapping multiple folios in one page, Where would the changes need to be made, IIUC there aren't any checks currently elsewhere in KVM to ensure that mapping size <= folio size, other than the sanity checks in the TDX code proposed in this series. Does any support need to be added, or is it about amending the unenforced/unwritten rule from "mapping size <= folio size" to "mapping size <= contiguous memory size"? > probably comes down to which option provides "good > enough" performance without incurring too much complexity.
On Mon, Jan 12, 2026 at 12:15:17PM -0800, Ackerley Tng wrote: > Sean Christopherson <seanjc@google.com> writes: > > > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is > > conceptually totally fine, i.e. I'm not totally opposed to adding support for > > mapping multiple guest_memfd folios with a single hugepage. As to whether we > > Sean, I'd like to clarify this. > > > do (a) nothing, > > What does do nothing mean here? > > In this patch series the TDX functions do sanity checks ensuring that > mapping size <= folio size. IIUC the checks at mapping time, like in > tdh_mem_page_aug() would be fine since at the time of mapping, the > mapping size <= folio size, but we'd be in trouble at the time of > zapping, since that's when mapping sizes > folio sizes get discovered. > > The sanity checks are in principle in direct conflict with allowing > mapping of multiple guest_memfd folios at hugepage level. > > > (b) change the refcounting, or > > I think this is pretty hard unless something changes in core MM that > allows refcounting to be customizable by the FS. guest_memfd would love > to have that, but customizable refcounting is going to hurt refcounting > performance throughout the kernel. > > > (c) add support for mapping multiple folios in one page, > > Where would the changes need to be made, IIUC there aren't any checks > currently elsewhere in KVM to ensure that mapping size <= folio size, > other than the sanity checks in the TDX code proposed in this series. > > Does any support need to be added, or is it about amending the > unenforced/unwritten rule from "mapping size <= folio size" to "mapping > size <= contiguous memory size"? The rule is not "unenforced/unwritten". In fact, it's the de facto standard in KVM. For non-gmem cases, KVM uses the mapping size in the primary MMU as the max mapping size in the secondary MMU, while the primary MMU does not create a mapping larger than the backend folio size. When splitting the backend folio, the Linux kernel unmaps the folio from both the primary MMU and the KVM-managed secondary MMU (through the MMU notifier). On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger than folio sizes, splitting folios while they are still mapped in the IOMMU stage-2 page table is not permitted due to the extra folio refcount held by the IOMMU. For gmem cases, KVM also does not create mappings larger than the folio size allocated from gmem. This is why the TDX huge page series relies on gmem's ability to allocate huge folios. We really need to be careful if we hope to break this long-established rule. > > probably comes down to which option provides "good > > enough" performance without incurring too much complexity.
On Wed, Jan 14, 2026, Yan Zhao wrote: > On Mon, Jan 12, 2026 at 12:15:17PM -0800, Ackerley Tng wrote: > > Sean Christopherson <seanjc@google.com> writes: > > > > > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is > > > conceptually totally fine, i.e. I'm not totally opposed to adding support for > > > mapping multiple guest_memfd folios with a single hugepage. As to whether we > > > > Sean, I'd like to clarify this. > > > > > do (a) nothing, > > > > What does do nothing mean here? Don't support hugepage's for shared mappings, at least for now (as Rick pointed out, doing nothing now doesn't mean we can't do something in the future). > > In this patch series the TDX functions do sanity checks ensuring that > > mapping size <= folio size. IIUC the checks at mapping time, like in > > tdh_mem_page_aug() would be fine since at the time of mapping, the > > mapping size <= folio size, but we'd be in trouble at the time of > > zapping, since that's when mapping sizes > folio sizes get discovered. > > > > The sanity checks are in principle in direct conflict with allowing > > mapping of multiple guest_memfd folios at hugepage level. > > > > > (b) change the refcounting, or > > > > I think this is pretty hard unless something changes in core MM that > > allows refcounting to be customizable by the FS. guest_memfd would love > > to have that, but customizable refcounting is going to hurt refcounting > > performance throughout the kernel. > > > > > (c) add support for mapping multiple folios in one page, > > > > Where would the changes need to be made, IIUC there aren't any checks > > currently elsewhere in KVM to ensure that mapping size <= folio size, > > other than the sanity checks in the TDX code proposed in this series. > > > > Does any support need to be added, or is it about amending the > > unenforced/unwritten rule from "mapping size <= folio size" to "mapping > > size <= contiguous memory size"? > > The rule is not "unenforced/unwritten". In fact, it's the de facto standard in > KVM. Ya, more or less. The rules aren't formally documented because the overarching rule is very simple: KVM must not map memory into the guest that the guest shouldn't have access to. That falls firmly into the "well, duh" category, and so it's not written down anywhere :-) How exactly KVM has honored that rule has varied over the years, and still varies between architectures. In the past KVM x86 special cased HugeTLB and THP, but that proved to be a pain to maintain and wasn't extensible, e.g. didn't play nice with DAX, and so KVM x86 pivoted to pulling the mapping size from the primary MMU page tables. But arm64 still special cases THP and HugeTLB, *and* VM_PFNMAP memory (eww). > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max > mapping size in the secondary MMU, while the primary MMU does not create a > mapping larger than the backend folio size. Super strictly speaking, this might not hold true for VM_PFNMAP memory. E.g. a driver _could_ split a folio (no idea why it would) but map the entire thing into userspace, and then userspace could have off that memory to KVM. So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's that "KVM mapping size <= primary MMU mapping size", at least for x86. Arm's VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping discontiguous pages into a single VM_PFNMAP VMA would be even more sketch. But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger than the folio size. > When splitting the backend folio, the Linux kernel unmaps the folio from both > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier). > > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger > than folio sizes, splitting folios while they are still mapped in the IOMMU > stage-2 page table is not permitted due to the extra folio refcount held by the > IOMMU. > > For gmem cases, KVM also does not create mappings larger than the folio size > allocated from gmem. This is why the TDX huge page series relies on gmem's > ability to allocate huge folios. > > We really need to be careful if we hope to break this long-established rule. +100 to being careful, but at the same time I don't think we should get _too_ fixated on the guest_memfd folio size. E.g. similar to VM_PFNMAP, where there might not be a folio, if guest_memfd stopped using folios, then the entire discussion becomes moot. And as above, the long-standing rule isn't about the implementation details so much as it is about KVM's behavior. If the simplest solution to support huge guest_memfd pages is to decouple the max order from the folio, then so be it. That said, I'd very much like to get a sense of the alternatives, because at the end of the day, guest_memfd needs to track the max mapping sizes _somewhere_, and naively, tying that to the folio seems like an easy solution.
On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote: > On Wed, Jan 14, 2026, Yan Zhao wrote: > > On Mon, Jan 12, 2026 at 12:15:17PM -0800, Ackerley Tng wrote: > > > Sean Christopherson <seanjc@google.com> writes: > > > > > > > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is > > > > conceptually totally fine, i.e. I'm not totally opposed to adding support for > > > > mapping multiple guest_memfd folios with a single hugepage. As to whether we > > > > > > Sean, I'd like to clarify this. > > > > > > > do (a) nothing, > > > > > > What does do nothing mean here? > > Don't support hugepage's for shared mappings, at least for now (as Rick pointed > out, doing nothing now doesn't mean we can't do something in the future). > > > > In this patch series the TDX functions do sanity checks ensuring that > > > mapping size <= folio size. IIUC the checks at mapping time, like in > > > tdh_mem_page_aug() would be fine since at the time of mapping, the > > > mapping size <= folio size, but we'd be in trouble at the time of > > > zapping, since that's when mapping sizes > folio sizes get discovered. > > > > > > The sanity checks are in principle in direct conflict with allowing > > > mapping of multiple guest_memfd folios at hugepage level. > > > > > > > (b) change the refcounting, or > > > > > > I think this is pretty hard unless something changes in core MM that > > > allows refcounting to be customizable by the FS. guest_memfd would love > > > to have that, but customizable refcounting is going to hurt refcounting > > > performance throughout the kernel. > > > > > > > (c) add support for mapping multiple folios in one page, > > > > > > Where would the changes need to be made, IIUC there aren't any checks > > > currently elsewhere in KVM to ensure that mapping size <= folio size, > > > other than the sanity checks in the TDX code proposed in this series. > > > > > > Does any support need to be added, or is it about amending the > > > unenforced/unwritten rule from "mapping size <= folio size" to "mapping > > > size <= contiguous memory size"? > > > > The rule is not "unenforced/unwritten". In fact, it's the de facto standard in > > KVM. > > Ya, more or less. > > The rules aren't formally documented because the overarching rule is very > simple: KVM must not map memory into the guest that the guest shouldn't have > access to. That falls firmly into the "well, duh" category, and so it's not > written down anywhere :-) > > How exactly KVM has honored that rule has varied over the years, and still varies > between architectures. In the past KVM x86 special cased HugeTLB and THP, but > that proved to be a pain to maintain and wasn't extensible, e.g. didn't play nice > with DAX, and so KVM x86 pivoted to pulling the mapping size from the primary MMU > page tables. > > But arm64 still special cases THP and HugeTLB, *and* VM_PFNMAP memory (eww). > > > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max > > mapping size in the secondary MMU, while the primary MMU does not create a > > mapping larger than the backend folio size. > > Super strictly speaking, this might not hold true for VM_PFNMAP memory. E.g. a > driver _could_ split a folio (no idea why it would) but map the entire thing into > userspace, and then userspace could have off that memory to KVM. > > So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's > that "KVM mapping size <= primary MMU mapping size", at least for x86. Arm's > VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping > discontiguous pages into a single VM_PFNMAP VMA would be even more sketch. > > But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger > than the folio size. Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the backend. Indeed, a driver can create a huge mapping in primary MMU for the VM_PFNMAP range with multiple discontiguous pages, if it wants. But this occurs before KVM creates the mapping. Per my understanding, pages under VM_PFNMAP are pinned, so it looks like there're no splits after they are mapped into the primary MMU. So, out of curiosity, do you know why linux kernel needs to unmap mappings from both primary and secondary MMUs, and check folio refcount before performing folio splitting? > > When splitting the backend folio, the Linux kernel unmaps the folio from both > > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier). > > > > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger > > than folio sizes, splitting folios while they are still mapped in the IOMMU > > stage-2 page table is not permitted due to the extra folio refcount held by the > > IOMMU. > > > > For gmem cases, KVM also does not create mappings larger than the folio size > > allocated from gmem. This is why the TDX huge page series relies on gmem's > > ability to allocate huge folios. > > > > We really need to be careful if we hope to break this long-established rule. > > +100 to being careful, but at the same time I don't think we should get _too_ > fixated on the guest_memfd folio size. E.g. similar to VM_PFNMAP, where there > might not be a folio, if guest_memfd stopped using folios, then the entire > discussion becomes moot. > > And as above, the long-standing rule isn't about the implementation details so > much as it is about KVM's behavior. If the simplest solution to support huge > guest_memfd pages is to decouple the max order from the folio, then so be it. > > That said, I'd very much like to get a sense of the alternatives, because at the > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_, > and naively, tying that to the folio seems like an easy solution. Thanks for the explanation. Alternatively, how do you feel about the approach of splitting S-EPT first before splitting folios? If guest_memfd always splits 1GB folios to 2MB first and only splits the converted range to 4KB, splitting S-EPT before splitting folios should not introduce too much overhead. Then, we can defer the folio size problem until guest_memfd stops using folios. If the decision is to stop relying on folios for unmapping now, do you think the following changes are reasonable for the TDX huge page series? - Add WARN_ON_ONCE() to assert that pages are in a single folio in tdh_mem_page_aug(). - Do not assert that pages are in a single folio in tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?) Could you please give me guidance on https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com. - Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error.
On Wed, Jan 14, 2026, Yan Zhao wrote: > On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote: > > On Wed, Jan 14, 2026, Yan Zhao wrote: > > > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max > > > mapping size in the secondary MMU, while the primary MMU does not create a > > > mapping larger than the backend folio size. > > > > Super strictly speaking, this might not hold true for VM_PFNMAP memory. E.g. a > > driver _could_ split a folio (no idea why it would) but map the entire thing into > > userspace, and then userspace could have off that memory to KVM. > > > > So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's > > that "KVM mapping size <= primary MMU mapping size", at least for x86. Arm's > > VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping > > discontiguous pages into a single VM_PFNMAP VMA would be even more sketch. > > > > But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger > > than the folio size. > > Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the > backend. Indeed, a driver can create a huge mapping in primary MMU for the > VM_PFNMAP range with multiple discontiguous pages, if it wants. > > But this occurs before KVM creates the mapping. Per my understanding, pages > under VM_PFNMAP are pinned, Nope. Only the driver that owns the VMAs knows what sits behind the PFN and the lifecycle rules for that memory. That last point is *very* important. Even if the PFNs shoved into VM_PFNMAP VMAs have an associated "struct page", that doesn't mean the "struct page" is refcounted, i.e. can be pinned. That detail was the heart of "KVM: Stop grabbing references to PFNMAP'd pages" overhaul[*]. To _safely_ map VM_PFNMAP into a secondary MMU, i.e. without relying on (priveleged) userspace to "do the right thing", the secondary MMU needs to be tied into mmu_notifiers, so that modifications to the mappings in the primary MMU are reflected into the secondary MMU. [*] https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com > so it looks like there're no splits after they are mapped into the primary MMU. > > So, out of curiosity, do you know why linux kernel needs to unmap mappings from > both primary and secondary MMUs, and check folio refcount before performing > folio splitting? Because it's a straightforward rule for the primary MMU. Similar to guest_memfd, if something is going through the effort of splitting a folio, then odds are very, very good that the new folios can't be safely mapped as a contiguous hugepage. Limiting mapping sizes to folios makes the rules/behavior straightfoward for core MM to implement, and for drivers/users to understand. Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to communicate the maximum mapping size; folios are the "currency" for doing so. And then for edge cases that want to map a split folio as a hugepage (if any such edge cases exist), thus take on the responsibility of managing the lifecycle of the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality. > > > When splitting the backend folio, the Linux kernel unmaps the folio from both > > > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier). > > > > > > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger > > > than folio sizes, splitting folios while they are still mapped in the IOMMU > > > stage-2 page table is not permitted due to the extra folio refcount held by the > > > IOMMU. > > > > > > For gmem cases, KVM also does not create mappings larger than the folio size > > > allocated from gmem. This is why the TDX huge page series relies on gmem's > > > ability to allocate huge folios. > > > > > > We really need to be careful if we hope to break this long-established rule. > > > > +100 to being careful, but at the same time I don't think we should get _too_ > > fixated on the guest_memfd folio size. E.g. similar to VM_PFNMAP, where there > > might not be a folio, if guest_memfd stopped using folios, then the entire > > discussion becomes moot. > > > > And as above, the long-standing rule isn't about the implementation details so > > much as it is about KVM's behavior. If the simplest solution to support huge > > guest_memfd pages is to decouple the max order from the folio, then so be it. > > > > That said, I'd very much like to get a sense of the alternatives, because at the > > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_, > > and naively, tying that to the folio seems like an easy solution. > Thanks for the explanation. > > Alternatively, how do you feel about the approach of splitting S-EPT first > before splitting folios? > If guest_memfd always splits 1GB folios to 2MB first and only splits the > converted range to 4KB, splitting S-EPT before splitting folios should not > introduce too much overhead. Then, we can defer the folio size problem until > guest_memfd stops using folios. > > If the decision is to stop relying on folios for unmapping now, do you think > the following changes are reasonable for the TDX huge page series? > > - Add WARN_ON_ONCE() to assert that pages are in a single folio in > tdh_mem_page_aug(). > - Do not assert that pages are in a single folio in > tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?) > Could you please give me guidance on > https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com. > - Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error. Ok, with the disclaimer that I hadn't actually looked at the patches in this series before now... TDX absolutely should not be doing _anything_ with folios. I am *very* strongly opposed to TDX assuming that memory is backed by refcounted "struct page", and thus can use folios to glean the maximum mapping size. guest_memfd is _the_ owner of that information. guest_memfd needs to explicitly _tell_ the rest of KVM what the maximum mapping size is; arch code should not infer that size from a folio. And that code+behavior already exists in the form of kvm_gmem_mapping_order() and its users, _and_ is plumbed all the way into tdx_mem_page_aug() as @level. IIUC, the _only_ reason tdx_mem_page_aug() retrieves the page+folio is because tdx_clflush_page() ultimately requires a "struct page". That is absolutely ridiculous and not acceptable. CLFLUSH takes a virtual address, there is *zero* reason tdh_mem_page_aug() needs to require/assume a struct page. Dave may feel differently, but I am not going to budge on this. I am not going to bake in assumptions throughout KVM about memory being backed by page+folio. We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to PFNMAP'd pages" series, I am NOT reintroducing such assumptions. NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.
On Wed, Jan 14, 2026 at 07:26:44AM -0800, Sean Christopherson wrote: > On Wed, Jan 14, 2026, Yan Zhao wrote: > > On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote: > > > On Wed, Jan 14, 2026, Yan Zhao wrote: > > > > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max > > > > mapping size in the secondary MMU, while the primary MMU does not create a > > > > mapping larger than the backend folio size. > > > > > > Super strictly speaking, this might not hold true for VM_PFNMAP memory. E.g. a > > > driver _could_ split a folio (no idea why it would) but map the entire thing into > > > userspace, and then userspace could have off that memory to KVM. > > > > > > So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's > > > that "KVM mapping size <= primary MMU mapping size", at least for x86. Arm's > > > VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping > > > discontiguous pages into a single VM_PFNMAP VMA would be even more sketch. > > > > > > But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger > > > than the folio size. > > > > Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the > > backend. Indeed, a driver can create a huge mapping in primary MMU for the > > VM_PFNMAP range with multiple discontiguous pages, if it wants. > > > > But this occurs before KVM creates the mapping. Per my understanding, pages > > under VM_PFNMAP are pinned, > > Nope. Only the driver that owns the VMAs knows what sits behind the PFN and the > lifecycle rules for that memory. > > That last point is *very* important. Even if the PFNs shoved into VM_PFNMAP VMAs > have an associated "struct page", that doesn't mean the "struct page" is refcounted, > i.e. can be pinned. That detail was the heart of "KVM: Stop grabbing references to > PFNMAP'd pages" overhaul[*]. > > To _safely_ map VM_PFNMAP into a secondary MMU, i.e. without relying on (priveleged) > userspace to "do the right thing", the secondary MMU needs to be tied into > mmu_notifiers, so that modifications to the mappings in the primary MMU are > reflected into the secondary MMU. You are right! It maps tail page of a !compound huge page, which is not refcounted. > [*] https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com > > > so it looks like there're no splits after they are mapped into the primary MMU. > > > > So, out of curiosity, do you know why linux kernel needs to unmap mappings from > > both primary and secondary MMUs, and check folio refcount before performing > > folio splitting? > > Because it's a straightforward rule for the primary MMU. Similar to guest_memfd, > if something is going through the effort of splitting a folio, then odds are very, > very good that the new folios can't be safely mapped as a contiguous hugepage. > Limiting mapping sizes to folios makes the rules/behavior straightfoward for core > MM to implement, and for drivers/users to understand. > > Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to > communicate the maximum mapping size; folios are the "currency" for doing so. > > And then for edge cases that want to map a split folio as a hugepage (if any such > edge cases exist), thus take on the responsibility of managing the lifecycle of > the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality. Thanks for the explanation. > > > > When splitting the backend folio, the Linux kernel unmaps the folio from both > > > > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier). > > > > > > > > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger > > > > than folio sizes, splitting folios while they are still mapped in the IOMMU > > > > stage-2 page table is not permitted due to the extra folio refcount held by the > > > > IOMMU. > > > > > > > > For gmem cases, KVM also does not create mappings larger than the folio size > > > > allocated from gmem. This is why the TDX huge page series relies on gmem's > > > > ability to allocate huge folios. > > > > > > > > We really need to be careful if we hope to break this long-established rule. > > > > > > +100 to being careful, but at the same time I don't think we should get _too_ > > > fixated on the guest_memfd folio size. E.g. similar to VM_PFNMAP, where there > > > might not be a folio, if guest_memfd stopped using folios, then the entire > > > discussion becomes moot. > > > > > > And as above, the long-standing rule isn't about the implementation details so > > > much as it is about KVM's behavior. If the simplest solution to support huge > > > guest_memfd pages is to decouple the max order from the folio, then so be it. > > > > > > That said, I'd very much like to get a sense of the alternatives, because at the > > > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_, > > > and naively, tying that to the folio seems like an easy solution. > > Thanks for the explanation. > > > > Alternatively, how do you feel about the approach of splitting S-EPT first > > before splitting folios? > > If guest_memfd always splits 1GB folios to 2MB first and only splits the > > converted range to 4KB, splitting S-EPT before splitting folios should not > > introduce too much overhead. Then, we can defer the folio size problem until > > guest_memfd stops using folios. > > > > If the decision is to stop relying on folios for unmapping now, do you think > > the following changes are reasonable for the TDX huge page series? > > > > - Add WARN_ON_ONCE() to assert that pages are in a single folio in > > tdh_mem_page_aug(). > > - Do not assert that pages are in a single folio in > > tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?) > > Could you please give me guidance on > > https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com. > > - Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error. > > Ok, with the disclaimer that I hadn't actually looked at the patches in this > series before now... > > TDX absolutely should not be doing _anything_ with folios. I am *very* strongly > opposed to TDX assuming that memory is backed by refcounted "struct page", and > thus can use folios to glean the maximum mapping size. > > guest_memfd is _the_ owner of that information. guest_memfd needs to explicitly > _tell_ the rest of KVM what the maximum mapping size is; arch code should not > infer that size from a folio. > > And that code+behavior already exists in the form of kvm_gmem_mapping_order() and > its users, _and_ is plumbed all the way into tdx_mem_page_aug() as @level. IIUC, > the _only_ reason tdx_mem_page_aug() retrieves the page+folio is because > tdx_clflush_page() ultimately requires a "struct page". That is absolutely > ridiculous and not acceptable. CLFLUSH takes a virtual address, there is *zero* > reason tdh_mem_page_aug() needs to require/assume a struct page. Not really. Per my understanding, tdx_mem_page_aug() requires "struct page" (and checks folios for huge pages) because the SEAMCALL wrapper APIs are not currently built into KVM. Since they may have callers other than KVM, some sanity checking in case the caller does something incorrect seems necessary (e.g., in case the caller provides an out-of-range struct page or a page with !pfn_valid() PFN). This is similar to "VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio)" in __folio_split(). With tdx_mem_page_aug() ensuring pages validity and contiguity, invoking local static function tdx_clflush_page() page-per-page looks good to me. Alternatively, we could convert tdx_clflush_page() to tdx_clflush_cache_range(), which receives VA. However, I'm not sure if my understanding is correct now, especially since it seems like everyone thinks the SEAMCALL wrapper APIs should trust the caller, assuming they are KVM-specific. > Dave may feel differently, but I am not going to budge on this. I am not going > to bake in assumptions throughout KVM about memory being backed by page+folio. > We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to > PFNMAP'd pages" series, I am NOT reintroducing such assumptions. > > NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.
On Thu, Jan 15, 2026, Yan Zhao wrote: > On Wed, Jan 14, 2026 at 07:26:44AM -0800, Sean Christopherson wrote: > > Ok, with the disclaimer that I hadn't actually looked at the patches in this > > series before now... > > > > TDX absolutely should not be doing _anything_ with folios. I am *very* strongly > > opposed to TDX assuming that memory is backed by refcounted "struct page", and > > thus can use folios to glean the maximum mapping size. > > > > guest_memfd is _the_ owner of that information. guest_memfd needs to explicitly > > _tell_ the rest of KVM what the maximum mapping size is; arch code should not > > infer that size from a folio. > > > > And that code+behavior already exists in the form of kvm_gmem_mapping_order() and > > its users, _and_ is plumbed all the way into tdx_mem_page_aug() as @level. IIUC, > > the _only_ reason tdx_mem_page_aug() retrieves the page+folio is because > > tdx_clflush_page() ultimately requires a "struct page". That is absolutely > > ridiculous and not acceptable. CLFLUSH takes a virtual address, there is *zero* > > reason tdh_mem_page_aug() needs to require/assume a struct page. > Not really. > > Per my understanding, tdx_mem_page_aug() requires "struct page" (and checks > folios for huge pages) because the SEAMCALL wrapper APIs are not currently built > into KVM. Since they may have callers other than KVM, some sanity checking in > case the caller does something incorrect seems necessary (e.g., in case the > caller provides an out-of-range struct page or a page with !pfn_valid() PFN). As a mentioned in my reply to Dave, I don't object to reasonable sanity checks. > This is similar to "VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio)" in > __folio_split(). No, it's not. __folio_split() is verifying that the input for the exact one thing it's doing, splitting a huge folio, matches what the function is being asked to do. TDX requiring guest_memfd to back everything with struct page, and to only use single, huge folios to map hugepages into the guest is making completely unnecessary about guest_memfd and KVM MMU implementation details. > With tdx_mem_page_aug() ensuring pages validity and contiguity, It absolutely does not. - If guest_memfd unmaps the direct map[*], CLFLUSH will fault and panic the kernel. - If the PFN isn't backed by struct page, tdx_mem_page_aug() will hit a NULL pointer deref. - If the PFN is back by struct page, but the page is managed by something other than guest_memfd or core MM, all bets are off. [*] https://lore.kernel.org/all/20260114134510.1835-1-kalyazin@amazon.com > invoking local static function tdx_clflush_page() page-per-page looks good to > me. Alternatively, we could convert tdx_clflush_page() to > tdx_clflush_cache_range(), which receives VA. > > However, I'm not sure if my understanding is correct now, especially since it > seems like everyone thinks the SEAMCALL wrapper APIs should trust the caller, > assuming they are KVM-specific. It's all kernel code. Implying that KVM is somehow untrusted is absurd.
On 1/14/26 07:26, Sean Christopherson wrote: ... > Dave may feel differently, but I am not going to budge on this. I am not going > to bake in assumptions throughout KVM about memory being backed by page+folio. > We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to > PFNMAP'd pages" series, I am NOT reintroducing such assumptions. > > NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn. 'struct page' gives us two things: One is the type safety, but I'm pretty flexible on how that's implemented as long as it's not a raw u64 getting passed around everywhere. The second thing is a (near) guarantee that the backing memory is RAM. Not only RAM, but RAM that the TDX module knows about and has a PAMT and TDMR and all that TDX jazz. We've also done things like stopping memory hotplug because you can't amend TDX page metadata at runtime. So we prevent new 'struct pages' from coming into existence. So 'struct page' is a quite useful choke point for TDX. I'd love to hear more about how guest_memfd is going to tie all the pieces together and give the same straightforward guarantees without leaning on the core mm the same way we do now.
On Wed, Jan 14, 2026, Dave Hansen wrote:
> On 1/14/26 07:26, Sean Christopherson wrote:
> ...
> > Dave may feel differently, but I am not going to budge on this. I am not going
> > to bake in assumptions throughout KVM about memory being backed by page+folio.
> > We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to
> > PFNMAP'd pages" series, I am NOT reintroducing such assumptions.
> >
> > NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.
>
> 'struct page' gives us two things: One is the type safety, but I'm
> pretty flexible on how that's implemented as long as it's not a raw u64
> getting passed around everywhere.
I don't necessarily disagree on the type safety front, but for the specific code
in question, any type safety is a facade. Everything leading up to the TDX code
is dealing with raw PFNs and/or PTEs. Then the TDX code assumes that the PFN
being mapped into the guest is backed by a struct page, and that the folio size
is consistent with @level, without _any_ checks whatsover. This is providing
the exact opposite of safety.
static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn)
{
int tdx_level = pg_level_to_tdx_sept_level(level);
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
struct page *page = pfn_to_page(pfn); <==================
struct folio *folio = page_folio(page);
gpa_t gpa = gfn_to_gpa(gfn);
u64 entry, level_state;
u64 err;
err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
folio_page_idx(folio, page), &entry, &level_state);
...
}
I've no objection if e.g. tdh_mem_page_aug() wants to sanity check that a PFN
is backed by a struct page with a valid refcount, it's code like that above that
I don't want.
> The second thing is a (near) guarantee that the backing memory is RAM.
> Not only RAM, but RAM that the TDX module knows about and has a PAMT and
> TDMR and all that TDX jazz.
I'm not at all opposed to backing guest_memfd with "struct page", quite the
opposite. What I don't want is to bake assumptions into KVM code that doesn't
_require_ struct page, because that has cause KVM immense pain in the past.
And I'm strongly opposed to KVM special-casing TDX or anything else, precisely
because we struggled through all that pain so that KVM would work better with
memory that isn't backed by "struct page", or more specifically, memory that has
an associated "struct page", but isn't managed by core MM, e.g. isn't refcounted.
> We've also done things like stopping memory hotplug because you can't
> amend TDX page metadata at runtime. So we prevent new 'struct pages'
> from coming into existence. So 'struct page' is a quite useful choke
> point for TDX.
>
> I'd love to hear more about how guest_memfd is going to tie all the
> pieces together and give the same straightforward guarantees without
> leaning on the core mm the same way we do now.
I don't think guest_memfd needs to be different, and that's not what I'm advocating.
What I don't want is to make KVM TDX's handling of memory different from the rest
of KVM and KVM's MMU(s).
On 1/14/26 16:19, Sean Christopherson wrote:
>> 'struct page' gives us two things: One is the type safety, but I'm
>> pretty flexible on how that's implemented as long as it's not a raw u64
>> getting passed around everywhere.
> I don't necessarily disagree on the type safety front, but for the specific code
> in question, any type safety is a facade. Everything leading up to the TDX code
> is dealing with raw PFNs and/or PTEs. Then the TDX code assumes that the PFN
> being mapped into the guest is backed by a struct page, and that the folio size
> is consistent with @level, without _any_ checks whatsover. This is providing
> the exact opposite of safety.
>
> static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> enum pg_level level, kvm_pfn_t pfn)
> {
> int tdx_level = pg_level_to_tdx_sept_level(level);
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> struct page *page = pfn_to_page(pfn); <==================
I of course agree that this is fundamentally unsafe, it's just not
necessarily bad code.
I hope we both agree that this could be made _more_ safe by, for
instance, making sure the page is in a zone, pfn_valid(), and a few more
things.
In a perfect world, these conversions would happen at a well-defined
layer (KVM=>TDX) and in relatively few places. That layer transition is
where the sanity checks happen. It's super useful to have:
struct page *kvm_pfn_to_tdx_private_page(kvm_pfn_t pfn)
{
struct page *page = pfn_to_page(pfn);
#ifdef DEBUG
WARN_ON_ONCE(pfn_valid(pfn));
// page must be from a "file"???
WARN_ON_ONCE(!page_mapping(page));
WARN_ON_ONCE(...);
#endif
return page;
}
*EVEN* if the pfn_to_page() itself is unsafe, and even if the WARN()s
are compiled out, this explicitly lays out the assumptions and it means
someone reading TDX code has an easier idea comprehending it.
It's also not a crime to do the *same* checking on kvm_pfn_t and not
have a type transition. I just like the idea of changing the type so
that the transition line is clear and the concept is carried (forced,
even) through the layers of helpers.
On Fri, Jan 16, 2026, Dave Hansen wrote:
> On 1/14/26 16:19, Sean Christopherson wrote:
> >> 'struct page' gives us two things: One is the type safety, but I'm
> >> pretty flexible on how that's implemented as long as it's not a raw u64
> >> getting passed around everywhere.
> > I don't necessarily disagree on the type safety front, but for the specific code
> > in question, any type safety is a facade. Everything leading up to the TDX code
> > is dealing with raw PFNs and/or PTEs. Then the TDX code assumes that the PFN
> > being mapped into the guest is backed by a struct page, and that the folio size
> > is consistent with @level, without _any_ checks whatsover. This is providing
> > the exact opposite of safety.
> >
> > static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > enum pg_level level, kvm_pfn_t pfn)
> > {
> > int tdx_level = pg_level_to_tdx_sept_level(level);
> > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > struct page *page = pfn_to_page(pfn); <==================
>
> I of course agree that this is fundamentally unsafe, it's just not
> necessarily bad code.
>
> I hope we both agree that this could be made _more_ safe by, for
> instance, making sure the page is in a zone, pfn_valid(), and a few more
> things.
>
> In a perfect world, these conversions would happen at a well-defined
> layer (KVM=>TDX) and in relatively few places. That layer transition is
> where the sanity checks happen. It's super useful to have:
>
> struct page *kvm_pfn_to_tdx_private_page(kvm_pfn_t pfn)
> {
> struct page *page = pfn_to_page(pfn);
> #ifdef DEBUG
> WARN_ON_ONCE(pfn_valid(pfn));
> // page must be from a "file"???
> WARN_ON_ONCE(!page_mapping(page));
> WARN_ON_ONCE(...);
> #endif
> return page;
> }
>
> *EVEN* if the pfn_to_page() itself is unsafe, and even if the WARN()s
> are compiled out, this explicitly lays out the assumptions and it means
> someone reading TDX code has an easier idea comprehending it.
I object to the existence of those assumptions. Why the blazes does TDX care
how KVM and guest_memfd manages memory? If you want to assert that the pfn is
compatible with TDX, then by all means. But I am NOT accepting any more KVM code
that assumes TDX memory is backed by refcounted struct page. If I had been paying
more attention when the initial TDX series landed, I would have NAK'd that too.
tdh_mem_page_aug() is just an absurdly slow way of writing a PTE. It doesn't
_need_ the pfn to be backed a struct page, at all. IMO, what you're asking for
is akin to adding a pile of unnecessary assumptions to e.g. __set_spte() and
__kvm_tdp_mmu_write_spte(). No thanks.
> It's also not a crime to do the *same* checking on kvm_pfn_t and not
> have a type transition. I just like the idea of changing the type so
> that the transition line is clear and the concept is carried (forced,
> even) through the layers of helpers.
On 1/16/26 09:14, Sean Christopherson wrote: ... >> *EVEN* if the pfn_to_page() itself is unsafe, and even if the WARN()s >> are compiled out, this explicitly lays out the assumptions and it means >> someone reading TDX code has an easier idea comprehending it. > I object to the existence of those assumptions. Why the blazes does TDX care > how KVM and guest_memfd manages memory? For me, it's because TDX can't take arbitrary kvm_pfn_t's for private memory. It's got to be able to be converted in the hardware, and also have allocated TDX metadata (PAMT) that the TDX module was handed at module init time. I thought kvm_pfn_t might, for instance be pointing over to a shared page or MMIO. Those can't be used for TD private memory. I think it's a pretty useful convention to know that the generic, flexible kvm_pfn_t has been winnowed down to a more restrictive type that is what TDX needs. But, honestly, my big aversion was to u64's everywhere. I can certainly live with a few kvm_pfn_t's in the TDX code. It doesn't have to be 'struct page'. > If you want to assert that the pfn is compatible with TDX, then by > all means. But I am NOT accepting any more KVM code that assumes > TDX memory is backed by refcounted struct page. If I had been > paying more attention when the initial TDX series landed, I would > have NAK'd that too. I'm kinda surprised by that. The only memory we support handing into TDs for private memory is refcounted struct page. I can imagine us being able to do this with DAX pages in the near future, but those have 'struct page' too, and I think they're refcounted pretty normally now as well. The TDX module initialization is pretty tied to NUMA nodes, too. If it's in a NUMA node, the TDX module is told about it and it also universally gets a 'struct page'. Is there some kind of memory that I'm missing? What else *is* there? :) > tdh_mem_page_aug() is just an absurdly slow way of writing a PTE. It doesn't > _need_ the pfn to be backed a struct page, at all. IMO, what you're asking for > is akin to adding a pile of unnecessary assumptions to e.g. __set_spte() and > __kvm_tdp_mmu_write_spte(). No thanks. Which part is absurdly slow? page_to_phys()? Isn't that just a shift by an immediate and a subtraction of an immediate? Yeah, the subtraction immediate is chonky so the instruction is big. But at a quick glance I'm not seeing anything absurd.
On Fri, Jan 16, 2026, Dave Hansen wrote:
> On 1/16/26 09:14, Sean Christopherson wrote:
> > If you want to assert that the pfn is compatible with TDX, then by
> > all means. But I am NOT accepting any more KVM code that assumes
> > TDX memory is backed by refcounted struct page. If I had been
> > paying more attention when the initial TDX series landed, I would
> > have NAK'd that too.
> I'm kinda surprised by that. The only memory we support handing into TDs
> for private memory is refcounted struct page. I can imagine us being
> able to do this with DAX pages in the near future, but those have
> 'struct page' too, and I think they're refcounted pretty normally now as
> well.
>
> The TDX module initialization is pretty tied to NUMA nodes, too. If it's
> in a NUMA node, the TDX module is told about it and it also universally
> gets a 'struct page'.
>
> Is there some kind of memory that I'm missing? What else *is* there? :)
I don't want to special case TDX on the backend of KVM's MMU. There's already
waaaay too much code and complexity in KVM that exists purely for S-EPT. Baking
in assumptions on how _exactly_ KVM is managing guest memory goes too far.
The reason I'm so hostile towards struct page is that, as evidenced by this series
and a ton of historical KVM code, assuming that memory is backed by struct page is
a _very_ slippery slope towards code that is extremely nasty to unwind later on.
E.g. see all of the effort that ended up going into commit ce7b5695397b ("KVM: TDX:
Drop superfluous page pinning in S-EPT management"). And in this series, the
constraints that will be placed on guest_memfd if TDX assumes hugepages will always
be covered in a single folio. Untangling KVM's historical (non-TDX) messes around
struct page took us something like two years.
And so to avoid introducing similar messes in the future, I don't want KVM's MMU
to make _any_ references to struct page when it comes to mapping memory into the
guest unless it's absolutely necessary, e.g. to put a reference when KVM _knows_
it acquired a refcounted page via gup() (and ideally we'd kill even that, e.g.
by telling gup() not to bump the refcount in the first place).
> > tdh_mem_page_aug() is just an absurdly slow way of writing a PTE. It doesn't
> > _need_ the pfn to be backed a struct page, at all. IMO, what you're asking for
> > is akin to adding a pile of unnecessary assumptions to e.g. __set_spte() and
> > __kvm_tdp_mmu_write_spte(). No thanks.
>
> Which part is absurdly slow?
The SEAMCALL itself. I'm saying that TDH_MEM_PAGE_AUG is really just the S-EPT
version of "make this PTE PRESENT", and that piling on sanity checks that aren't
fundamental to TDX shouldn't be done when KVM is writing PTEs.
In other words, something like this is totally fine:
KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));
but this is not:
WARN_ON_ONCE(!page_mapping(pfn_to_page(pfn)));
On 1/16/26 11:59, Sean Christopherson wrote: > The SEAMCALL itself. I'm saying that TDH_MEM_PAGE_AUG is really just the S-EPT > version of "make this PTE PRESENT", and that piling on sanity checks that aren't > fundamental to TDX shouldn't be done when KVM is writing PTEs. > > In other words, something like this is totally fine: > > KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn)); > > but this is not: > > WARN_ON_ONCE(!page_mapping(pfn_to_page(pfn))); OK, I think I've got a better idea what you're aiming for. I think that's totally doable going forward.
On Wed, 2026-01-14 at 16:19 -0800, Sean Christopherson wrote: > I've no objection if e.g. tdh_mem_page_aug() wants to sanity check > that a PFN is backed by a struct page with a valid refcount, it's > code like that above that I don't want. Dave wants safety for the TDX pages getting handed to the module. Sean doesn’t want TDX code to have specialness about requiring struct page. I think they are not too much at odds actually. Here are two ideas to get both: 1. Have the TDX module do the checking Yan points out that we could possibly rely on TDX_OPERAND_ADDR_RANGE_ERROR to detect operating on the wrong type of memory. We would have to make sure this will check everything we need going forward. 2. Invent a new tdx_page_t type. We could have a mk_foo()-type helpers that does the checks and converts kvm_pfn_t to TDX’s verified type. The helper can do the checks that it is valid TDX capable memory. Then there is one place that does the conversion. It will be easy to change the verification method if we ever need to. One benefit is that struct page has already been a problem for other reasons [0]. To work around that issue we had to keep duplicate formats of the TDVPR page and lose the standardization of how we handle pages in the TDX code. This is perfectly functional, but a bit annoying. But (2) is inventing a new type, which is somewhat disagreeable too. I’m thinking maybe explore 1 first, with the eventual goal of moving everything to some type of pfn type to unify with the rest of KVM. Either KVM’s or the normal one. But before we do that, can we settle on what sanity checks we want: 1. Page is TDX capable memory 2. ... I think that is it? There was some discussion of refcount checking. I think we don’t need it? [0] https://lore.kernel.org/kvm/20250910144453.1389652-1-dave.hansen@linux.intel.com/#r
On Fri, Jan 16, 2026, Rick P Edgecombe wrote: > On Wed, 2026-01-14 at 16:19 -0800, Sean Christopherson wrote: > > I've no objection if e.g. tdh_mem_page_aug() wants to sanity check > > that a PFN is backed by a struct page with a valid refcount, it's > > code like that above that I don't want. > > Dave wants safety for the TDX pages getting handed to the module. Define "safety". As I stressed earlier, blinding retrieving a "struct page" and dereferencing that pointer is the exact opposite of safe. > 2. Invent a new tdx_page_t type. Still doesn't provide meaningful safety. Regardless of what type gets passed into the low level tdh_*() helpers, it's going to require KVM to effectively cast a bare pfn, because I am completely against passing anything other than a SPTE to tdx_sept_set_private_spte(). > 1. Page is TDX capable memory That's fine by me, but that's _very_ different than what was proposed here.
On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote: > > Dave wants safety for the TDX pages getting handed to the module. > > Define "safety". As I stressed earlier, blinding retrieving a > "struct page" and dereferencing that pointer is the exact opposite of > safe. I think we had two problems. 1. Passing in raw PA's via u64 led to buggy code. IIRC we had a bug with this that was caught before it went upstream. So a page needs a real type of some sort. 2. Work was done on the tip side to prevent non-TDX capable memory from entering the page allocator. With that in place, by requiring struct page, TDX code can know that it is getting the type of memory it worked hard to guarantee was good. You are saying that shifting a PFN to a struct page blindly doesn't actually guarantee that it meets those requirements. Makes sense. For (1) we can just use any old type I think - pfn_t, etc. As we discussed in the base series. For (2) we need to check that the memory came from the page allocator, or otherwise is valid TDX memory somehow. That is at least the only check that makes sense to me. There was some discussion about refcounts somewhere in this thread. I don't think it's arch/x86's worry. Then Yan was saying something last night that I didn't quite follow. We said, let's just resume the discussion on the list. So she might suggest another check. > > > 2. Invent a new tdx_page_t type. > > Still doesn't provide meaningful safety. Regardless of what type > gets passed into the low level tdh_*() helpers, it's going to require > KVM to effectively cast a bare pfn, because I am completely against > passing anything other than a SPTE to tdx_sept_set_private_spte(). I'm not sure I was clear, like: 1. A raw PFN gets passed in to the conversion helper in arch/x86. 2. The helper does the check that it is TDX capable memory, or anything it cares to check about memory safety, then returns the new type to KVM. 3. KVM uses the type as an argument to any seamcall that requires TDX capable memory. > > > 1. Page is TDX capable memory > > That's fine by me, but that's _very_ different than what was proposed > here. Proposed by me just now or the series? We are trying to find a new solution now.
On Sat, Jan 17, 2026 at 12:58:02AM +0800, Edgecombe, Rick P wrote: > On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote: > > > Dave wants safety for the TDX pages getting handed to the module. > > > > Define "safety". As I stressed earlier, blinding retrieving a > > "struct page" and dereferencing that pointer is the exact opposite of > > safe. > > I think we had two problems. > > 1. Passing in raw PA's via u64 led to buggy code. IIRC we had a bug > with this that was caught before it went upstream. So a page needs a > real type of some sort. > > 2. Work was done on the tip side to prevent non-TDX capable memory from > entering the page allocator. With that in place, by requiring struct > page, TDX code can know that it is getting the type of memory it worked > hard to guarantee was good. > > You are saying that shifting a PFN to a struct page blindly doesn't > actually guarantee that it meets those requirements. Makes sense. > > For (1) we can just use any old type I think - pfn_t, etc. As we > discussed in the base series. > > For (2) we need to check that the memory came from the page allocator, > or otherwise is valid TDX memory somehow. That is at least the only > check that makes sense to me. > > There was some discussion about refcounts somewhere in this thread. I > don't think it's arch/x86's worry. Then Yan was saying something last > night that I didn't quite follow. We said, let's just resume the > discussion on the list. So she might suggest another check. Hmm, I previously had a concern about passing "struct page *" as the SEAMCALL wrapper parameter. For example, when we do sanity checks for valid TDX memory in tdh_mem_page_aug(), we need to do the sanity check on every page, right? However, with base_page + npages, it's not easy to get the ith page's pointer without first ensuring the pages are contained in a single folio. It would also be superfluous if we first get base_pfn from base_page, and then derive the ith page from base_pfn + i. IIUC, this concern should be gone as Dave has agreed to use "pfn" as the SEAMCALL parameter [1]? Then should we invoke "KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));" in KVM for every pfn of a huge mapping? Or should we keep the sanity check inside the SEAMCALL wrappers? BTW, I have another question about the SEAMCALL wrapper implementation, as Kai also pointed out in [2]: since the SEAMCALL wrappers now serve as APIs available to callers besides KVM, should the SEAMCALL wrappers return TDX_OPERAND_INVALID or WARN_ON() (or WARN_ON_ONCE()) on sanity check failure? By returning TDX_OPERAND_INVALID, the caller can check the return code, adjust the input or trigger WARN_ON() by itself; By triggering WARN_ON() directly in the SEAMCALL wrapper, we need to document this requirement for the SEAMCALL wrappers and have the caller invoke the API correctly. So, it looks that "WARN_ON() directly in the SEAMCALL wrapper" is the preferred approach, right? [1] https://lore.kernel.org/all/d119c824-4770-41d2-a926-4ab5268ea3a6@intel.com/ [2] https://lore.kernel.org/all/baf6df2cc63d8e897455168c1bf07180fc9c1db8.camel@intel.com
On Mon, Jan 19, 2026, Yan Zhao wrote: > On Sat, Jan 17, 2026 at 12:58:02AM +0800, Edgecombe, Rick P wrote: > > On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote: > IIUC, this concern should be gone as Dave has agreed to use "pfn" as the > SEAMCALL parameter [1]? > Then should we invoke "KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));" in KVM > for every pfn of a huge mapping? Or should we keep the sanity check inside the > SEAMCALL wrappers? I don't have a strong preference. But if it goes in KVM, definitely guard it with KVM_MMU_WARN_ON(). > BTW, I have another question about the SEAMCALL wrapper implementation, as Kai > also pointed out in [2]: since the SEAMCALL wrappers now serve as APIs available > to callers besides KVM, should the SEAMCALL wrappers return TDX_OPERAND_INVALID > or WARN_ON() (or WARN_ON_ONCE()) on sanity check failure? Why not both? But maybe TDX_SW_ERROR instead of TDX_OPERAND_INVALID? If an API has a defined contract and/or set of expectations, and those expectations aren't met by the caller, then a WARN is justified. But the failure still needs to be communicated to the caller. > By returning TDX_OPERAND_INVALID, the caller can check the return code, adjust > the input or trigger WARN_ON() by itself; > By triggering WARN_ON() directly in the SEAMCALL wrapper, we need to document > this requirement for the SEAMCALL wrappers and have the caller invoke the API > correctly. Document what exactly? Most of this should be common sense. E.g. we don't generally document that pointers must be non-NULL, because that goes without saying 99.9% of the time. IMO, that holds true here as well. E.g. trying to map memory into a TDX guest that isn't convertible is obviously a bug, I don't see any value in formally documenting that requirement. > So, it looks that "WARN_ON() directly in the SEAMCALL wrapper" is the preferred > approach, right? > > [1] https://lore.kernel.org/all/d119c824-4770-41d2-a926-4ab5268ea3a6@intel.com/ > [2] https://lore.kernel.org/all/baf6df2cc63d8e897455168c1bf07180fc9c1db8.camel@intel.com
On Fri, Jan 30, 2026 at 07:32:48AM -0800, Sean Christopherson wrote: > On Mon, Jan 19, 2026, Yan Zhao wrote: > > On Sat, Jan 17, 2026 at 12:58:02AM +0800, Edgecombe, Rick P wrote: > > > On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote: > > IIUC, this concern should be gone as Dave has agreed to use "pfn" as the > > SEAMCALL parameter [1]? > > Then should we invoke "KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));" in KVM > > for every pfn of a huge mapping? Or should we keep the sanity check inside the > > SEAMCALL wrappers? > > I don't have a strong preference. But if it goes in KVM, definitely guard it with > KVM_MMU_WARN_ON(). Thank you for your insights, Sean! > > BTW, I have another question about the SEAMCALL wrapper implementation, as Kai > > also pointed out in [2]: since the SEAMCALL wrappers now serve as APIs available > > to callers besides KVM, should the SEAMCALL wrappers return TDX_OPERAND_INVALID > > or WARN_ON() (or WARN_ON_ONCE()) on sanity check failure? > > Why not both? But maybe TDX_SW_ERROR instead of TDX_OPERAND_INVALID? Hmm, I previously returned TDX_OPERAND_INVALID for non-aligned base PFN. TDX_SW_ERROR is also ok if we want to indicate that passing an invalid PFN is a software error. (I had tdh_mem_page_demote() return TDX_SW_ERROR when an incompatible TDX module is used, i.e., when !tdx_supports_demote_nointerrupt()). > If an API has a defined contract and/or set of expectations, and those expectations > aren't met by the caller, then a WARN is justified. But the failure still needs > to be communicated to the caller. Ok. The reason for 'not both' is that there's already TDX_BUG_ON_2() in KVM after the SEAMCALL wrapper returns a non-BUSY error. I'm not sure if having double WARN_ON_ONCE() calls is good, so I intended to let the caller decide whether to warn. > > By returning TDX_OPERAND_INVALID, the caller can check the return code, adjust > > the input or trigger WARN_ON() by itself; > > By triggering WARN_ON() directly in the SEAMCALL wrapper, we need to document > > this requirement for the SEAMCALL wrappers and have the caller invoke the API > > correctly. > > Document what exactly? Most of this should be common sense. E.g. we don't generally > document that pointers must be non-NULL, because that goes without saying 99.9% > of the time. Document the SEAMCALL wrapper's expectations. e.g., for demote, a PFN must be 2MB-aligned, or the caller must not invoke tdh_mem_page_demote() if a TDX module does not support feature ENHANCED_DEMOTE_INTERRUPTIBILITY... > IMO, that holds true here as well. E.g. trying to map memory into a TDX guest > that isn't convertible is obviously a bug, I don't see any value in formally > documenting that requirement. Do we need a comment for documentation above the tdh_mem_page_demote() API? > > So, it looks that "WARN_ON() directly in the SEAMCALL wrapper" is the preferred > > approach, right? > > > > > [1] https://lore.kernel.org/all/d119c824-4770-41d2-a926-4ab5268ea3a6@intel.com/ > > [2] https://lore.kernel.org/all/baf6df2cc63d8e897455168c1bf07180fc9c1db8.camel@intel.com >
Sean Christopherson <seanjc@google.com> writes: >> >> [...snip...] >> >> > +100 to being careful, but at the same time I don't think we should get _too_ >> > fixated on the guest_memfd folio size. E.g. similar to VM_PFNMAP, where there >> > might not be a folio, if guest_memfd stopped using folios, then the entire >> > discussion becomes moot. +1, IMO the usage of folios on the guest_memfd <-> KVM boundary (kvm_gmem_get_pfn()) is transitional, hopefully we get to a point where guest_memfd will pass KVM pfn, order and no folios. >> > And as above, the long-standing rule isn't about the implementation details so >> > much as it is about KVM's behavior. If the simplest solution to support huge >> > guest_memfd pages is to decouple the max order from the folio, then so be it. >> > >> > That said, I'd very much like to get a sense of the alternatives, because at the >> > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_, >> > and naively, tying that to the folio seems like an easy solution. The upcoming attributes maple tree allows a lookup from guest_memfd index to contiguous range, so the max mapping size (at least guest_memfd's contribution to max mapping level, to be augmented by contribution from lpage_info etc) would be the contiguous range in the xarray containing the index, clamped to guest_memfd page size bounds (both for huge pages and regular PAGE_SIZE pages). The lookup complexity is mainly the maple tree lookup complexity. This lookup happens on mapping and on trying to recover to the largest mapping level, both of which shouldn't happen super often, so I think this should be pretty good for now. This max mapping size is currently memoized as folio size with all the folio splitting work, but memoizing into a folio is expensive (struct pages/folios are big). Hopefully guest_memfd gets to a point where it also supports non-struct page backed memory, and that would save us a bunch more memory. >> >> [...snip...] >> >> So, out of curiosity, do you know why linux kernel needs to unmap mappings from >> both primary and secondary MMUs, and check folio refcount before performing >> folio splitting? > > Because it's a straightforward rule for the primary MMU. Similar to guest_memfd, > if something is going through the effort of splitting a folio, then odds are very, > very good that the new folios can't be safely mapped as a contiguous hugepage. > Limiting mapping sizes to folios makes the rules/behavior straightfoward for core > MM to implement, and for drivers/users to understand. > > Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to > communicate the maximum mapping size; folios are the "currency" for doing so. > > And then for edge cases that want to map a split folio as a hugepage (if any such > edge cases exist), thus take on the responsibility of managing the lifecycle of > the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality. > Here's my understanding, hope it helps: there might also be a practical/simpler reason for first unmapping then check refcounts, and then splitting folios, and guest_memfd kind of does the same thing. Folio splitting races with lots of other things in the kernel, and the folio lock isn't super useful because the lock itself is going to be split up. Folio splitting wants all users to stop using this folio, so one big source of users is mappings. Hence, get those mappers (both primary and secondary MMUs) to unmap. Core-mm-managed mappings take a refcount, so those refcounts go away. Of the secondary mmu notifiers, KVM doesn't take a refcount, but KVM does unmap as requested, so that still falls in line with "stop using this folio". I think the refcounting check isn't actually necessary if all users of folios STOP using the folio on request (via mmu notifiers or otherwise). Unfortunately, there are other users other than mappers. The best way to find these users is to check the refcount. The refcount check is asking "how many other users are left?" and if the number of users is as expected (just the filemap, or whatever else is expected), then splitting can go ahead, since the splitting code is now confident the remaining users won't try and use the folio metadata while splitting is happening. guest_memfd does a modified version of that on shared to private conversions. guest_memfd will unmap from host userspace page tables for the same reason, mainly to tell all the host users to unmap. The unmapping also triggers mmu notifiers so the stage 2 mappings also go away (TBD if this should be skipped) and this is okay because they're shared pages. guest usage will just map them back in on any failure and it doesn't break guests. At this point all the mappers are gone, then guest_memfd checks refcounts to make sure that guest_memfd itself is the only user of the folio. If the refcount is as expected, guest_memfd is confident to continue with splitting folios, since other folio accesses will be locked out by the filemap invalidate lock. The one main guest_memfd folio user that won't go away on an unmap call is if the folios get pinned for IOMMU access. In this case, guest_memfd fails the conversion and returns an error to userspace so userspace can sort out the IOMMU unpinning. As for private to shared conversions, folio merging would require the same thing that nobody else is using the folios (the folio metadata). guest_memfd skips that check because for private memory, KVM is the only other user, and guest_memfd knows KVM doesn't use folio metadata once the memory is mapped for the guest. >> >> [...snip...] >>
On Wed, Jan 14, 2026 at 10:45:32AM -0800, Ackerley Tng wrote: > Sean Christopherson <seanjc@google.com> writes: > >> So, out of curiosity, do you know why linux kernel needs to unmap mappings from > >> both primary and secondary MMUs, and check folio refcount before performing > >> folio splitting? > > > > Because it's a straightforward rule for the primary MMU. Similar to guest_memfd, > > if something is going through the effort of splitting a folio, then odds are very, > > very good that the new folios can't be safely mapped as a contiguous hugepage. > > Limiting mapping sizes to folios makes the rules/behavior straightfoward for core > > MM to implement, and for drivers/users to understand. > > > > Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to > > communicate the maximum mapping size; folios are the "currency" for doing so. > > > > And then for edge cases that want to map a split folio as a hugepage (if any such > > edge cases exist), thus take on the responsibility of managing the lifecycle of > > the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality. > > > > Here's my understanding, hope it helps: there might also be a > practical/simpler reason for first unmapping then check refcounts, and > then splitting folios, and guest_memfd kind of does the same thing. > > Folio splitting races with lots of other things in the kernel, and the > folio lock isn't super useful because the lock itself is going to be > split up. > > Folio splitting wants all users to stop using this folio, so one big > source of users is mappings. Hence, get those mappers (both primary and > secondary MMUs) to unmap. > > Core-mm-managed mappings take a refcount, so those refcounts go away. Of > the secondary mmu notifiers, KVM doesn't take a refcount, but KVM does > unmap as requested, so that still falls in line with "stop using this > folio". > > I think the refcounting check isn't actually necessary if all users of > folios STOP using the folio on request (via mmu notifiers or > otherwise). Unfortunately, there are other users other than mappers. The > best way to find these users is to check the refcount. The refcount > check is asking "how many other users are left?" and if the number of > users is as expected (just the filemap, or whatever else is expected), > then splitting can go ahead, since the splitting code is now confident > the remaining users won't try and use the folio metadata while splitting > is happening. > > > guest_memfd does a modified version of that on shared to private > conversions. guest_memfd will unmap from host userspace page tables for > the same reason, mainly to tell all the host users to unmap. The > unmapping also triggers mmu notifiers so the stage 2 mappings also go > away (TBD if this should be skipped) and this is okay because they're > shared pages. guest usage will just map them back in on any failure and > it doesn't break guests. > > At this point all the mappers are gone, then guest_memfd checks > refcounts to make sure that guest_memfd itself is the only user of the > folio. If the refcount is as expected, guest_memfd is confident to > continue with splitting folios, since other folio accesses will be > locked out by the filemap invalidate lock. > > The one main guest_memfd folio user that won't go away on an unmap call > is if the folios get pinned for IOMMU access. In this case, guest_memfd > fails the conversion and returns an error to userspace so userspace can > sort out the IOMMU unpinning. > > > As for private to shared conversions, folio merging would require the > same thing that nobody else is using the folios (the folio > metadata). guest_memfd skips that check because for private memory, KVM > is the only other user, and guest_memfd knows KVM doesn't use folio > metadata once the memory is mapped for the guest. Ok. That makes sense. Thanks for the explanation. It looks like guest_memfd also rules out concurrent folio metadata access by holding the filemap_invalidate_lock. BTW: Could that potentially cause guest soft lockup due to holding the filemap_invalidate_lock for too long?
Yan Zhao <yan.y.zhao@intel.com> writes: > On Wed, Jan 14, 2026 at 10:45:32AM -0800, Ackerley Tng wrote: >> Sean Christopherson <seanjc@google.com> writes: >> >> So, out of curiosity, do you know why linux kernel needs to unmap mappings from >> >> both primary and secondary MMUs, and check folio refcount before performing >> >> folio splitting? >> > >> > Because it's a straightforward rule for the primary MMU. Similar to guest_memfd, >> > if something is going through the effort of splitting a folio, then odds are very, >> > very good that the new folios can't be safely mapped as a contiguous hugepage. >> > Limiting mapping sizes to folios makes the rules/behavior straightfoward for core >> > MM to implement, and for drivers/users to understand. >> > >> > Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to >> > communicate the maximum mapping size; folios are the "currency" for doing so. >> > >> > And then for edge cases that want to map a split folio as a hugepage (if any such >> > edge cases exist), thus take on the responsibility of managing the lifecycle of >> > the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality. >> > >> >> Here's my understanding, hope it helps: there might also be a >> practical/simpler reason for first unmapping then check refcounts, and >> then splitting folios, and guest_memfd kind of does the same thing. >> >> Folio splitting races with lots of other things in the kernel, and the >> folio lock isn't super useful because the lock itself is going to be >> split up. >> >> Folio splitting wants all users to stop using this folio, so one big >> source of users is mappings. Hence, get those mappers (both primary and >> secondary MMUs) to unmap. >> >> Core-mm-managed mappings take a refcount, so those refcounts go away. Of >> the secondary mmu notifiers, KVM doesn't take a refcount, but KVM does >> unmap as requested, so that still falls in line with "stop using this >> folio". >> >> I think the refcounting check isn't actually necessary if all users of >> folios STOP using the folio on request (via mmu notifiers or >> otherwise). Unfortunately, there are other users other than mappers. The >> best way to find these users is to check the refcount. The refcount >> check is asking "how many other users are left?" and if the number of >> users is as expected (just the filemap, or whatever else is expected), >> then splitting can go ahead, since the splitting code is now confident >> the remaining users won't try and use the folio metadata while splitting >> is happening. >> >> >> guest_memfd does a modified version of that on shared to private >> conversions. guest_memfd will unmap from host userspace page tables for >> the same reason, mainly to tell all the host users to unmap. The >> unmapping also triggers mmu notifiers so the stage 2 mappings also go >> away (TBD if this should be skipped) and this is okay because they're >> shared pages. guest usage will just map them back in on any failure and >> it doesn't break guests. >> >> At this point all the mappers are gone, then guest_memfd checks >> refcounts to make sure that guest_memfd itself is the only user of the >> folio. If the refcount is as expected, guest_memfd is confident to >> continue with splitting folios, since other folio accesses will be >> locked out by the filemap invalidate lock. >> >> The one main guest_memfd folio user that won't go away on an unmap call >> is if the folios get pinned for IOMMU access. In this case, guest_memfd >> fails the conversion and returns an error to userspace so userspace can >> sort out the IOMMU unpinning. >> >> >> As for private to shared conversions, folio merging would require the >> same thing that nobody else is using the folios (the folio >> metadata). guest_memfd skips that check because for private memory, KVM >> is the only other user, and guest_memfd knows KVM doesn't use folio >> metadata once the memory is mapped for the guest. > Ok. That makes sense. Thanks for the explanation. > It looks like guest_memfd also rules out concurrent folio metadata access by > holding the filemap_invalidate_lock. > > BTW: Could that potentially cause guest soft lockup due to holding the > filemap_invalidate_lock for too long? Yes, potentially. You mean because the vCPUs are all blocked on page faults, right? We can definitely optimize later, perhaps lock by guest_memfd index ranges.
On Tue, 2026-01-06 at 15:43 -0800, Sean Christopherson wrote: > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is > conceptually totally fine, i.e. I'm not totally opposed to adding support for > mapping multiple guest_memfd folios with a single hugepage. As to whether we > do (a) nothing, (b) change the refcounting, or (c) add support for mapping > multiple folios in one page, probably comes down to which option provides "good > enough" performance without incurring too much complexity. Can we add "whether we can push it off to the future" to the considerations list? The in-flight gmem stuff is pretty complex and this doesn't seem to have an ABI intersection.
On Wed, Jan 07, 2026, Rick P Edgecombe wrote: > On Tue, 2026-01-06 at 15:43 -0800, Sean Christopherson wrote: > > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is > > conceptually totally fine, i.e. I'm not totally opposed to adding support for > > mapping multiple guest_memfd folios with a single hugepage. As to whether we > > do (a) nothing, (b) change the refcounting, or (c) add support for mapping > > multiple folios in one page, probably comes down to which option provides "good > > enough" performance without incurring too much complexity. > > Can we add "whether we can push it off to the future" to the considerations > list? The in-flight gmem stuff is pretty complex and this doesn't seem to have > an ABI intersection. Ya, for sure. The only wrinkle I can think of is if the refcounting somehow bleeds into userspace, but that seems like it'd be a flaw on its own.
On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > Sean Christopherson <seanjc@google.com> writes:
> >
> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> Vishal Annapurve <vannapurve@google.com> writes:
> > >>
> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >> >>
> > >> >> - EPT mapping size and folio size
> > >> >>
> > >> >> This series is built upon the rule in KVM that the mapping size in the
> > >> >> KVM-managed secondary MMU is no larger than the backend folio size.
> > >> >>
> > >>
> > >> I'm not familiar with this rule and would like to find out more. Why is
> > >> this rule imposed?
> > >
> > > Because it's the only sane way to safely map memory into the guest? :-D
> > >
> > >> Is this rule there just because traditionally folio sizes also define the
> > >> limit of contiguity, and so the mapping size must not be greater than folio
> > >> size in case the block of memory represented by the folio is not contiguous?
> > >
> > > Pre-guest_memfd, KVM didn't care about folios. KVM's mapping size was (and still
> > > is) strictly bound by the host mapping size. That's handles contiguous addresses,
> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> > >
> > >> In guest_memfd's case, even if the folio is split (just for refcount
> > >> tracking purposese on private to shared conversion), the memory is still
> > >> contiguous up to the original folio's size. Will the contiguity address
> > >> the concerns?
> > >
> > > Not really? Why would the folio be split if the memory _and its attributes_ are
> > > fully contiguous? If the attributes are mixed, KVM must not create a mapping
> > > spanning mixed ranges, i.e. with multiple folios.
> >
> > The folio can be split if any (or all) of the pages in a huge page range
> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> > would be split, and the split folios are necessary for tracking users of
> > shared pages using struct page refcounts.
>
> Ahh, that's what the refcounting was referring to. Gotcha.
>
> > However the split folios in that 1G range are still fully contiguous.
> >
> > The process of conversion will split the EPT entries soon after the
> > folios are split so the rule remains upheld.
Overall, I don't think allowing folios smaller than the mappings while
conversion is in progress brings enough benefit.
Cons:
(1) TDX's zapping callback has no idea whether the zapping is caused by an
in-progress private-to-shared conversion or other reasons. It also has no
idea if the attributes of the underlying folios remain unchanged during an
in-progress private-to-shared conversion. Even if the assertion Ackerley
mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
callback for in-progress private-to-shared conversion alone (which would
increase TDX's dependency on guest_memfd's specific implementation even if
it's feasible).
Removing the sanity checks entirely in TDX's zapping callback is confusing
and would show a bad/false expectation from KVM -- what if a huge folio is
incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
others) in other conditions? And then do we still need the check in TDX's
mapping callback? If not, does it mean TDX huge pages can stop relying on
guest_memfd's ability to allocate huge folios, as KVM could still create
huge mappings as long as small folios are physically contiguous with
homogeneous memory attributes?
(2) Allowing folios smaller than the mapping would require splitting S-EPT in
kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
invalidate lock held in __kvm_gmem_set_attributes() could guard against
concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
Pro: Preventing zapping private memory until conversion is successful is good.
However, could we achieve this benefit in other ways? For example, is it
possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
optimization? (hugetlb_vmemmap conversion is super slow according to my
observation and I always disable it). Or pre-allocation for
vmemmap_remap_alloc()?
Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
private memory before conversion succeeds is still better than introducing the
mess between folio size and mapping size.
> > I guess perhaps the question is, is it okay if the folios are smaller
> > than the mapping while conversion is in progress? Does the order matter
> > (split page table entries first vs split folios first)?
>
> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> mapping multiple guest_memfd folios with a single hugepage. As to whether we
> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> multiple folios in one page, probably comes down to which option provides "good
> enough" performance without incurring too much complexity.
Yan Zhao <yan.y.zhao@intel.com> writes: > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote: >> On Tue, Jan 06, 2026, Ackerley Tng wrote: >> > Sean Christopherson <seanjc@google.com> writes: >> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote: >> > >> Vishal Annapurve <vannapurve@google.com> writes: >> > >> >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote: >> > >> >> >> > >> >> - EPT mapping size and folio size >> > >> >> >> > >> >> This series is built upon the rule in KVM that the mapping size in the >> > >> >> KVM-managed secondary MMU is no larger than the backend folio size. >> > >> >> >> > >> >> > >> I'm not familiar with this rule and would like to find out more. Why is >> > >> this rule imposed? >> > > >> > > Because it's the only sane way to safely map memory into the guest? :-D >> > > >> > >> Is this rule there just because traditionally folio sizes also define the >> > >> limit of contiguity, and so the mapping size must not be greater than folio >> > >> size in case the block of memory represented by the folio is not contiguous? >> > > >> > > Pre-guest_memfd, KVM didn't care about folios. KVM's mapping size was (and still >> > > is) strictly bound by the host mapping size. That's handles contiguous addresses, >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes. >> > > >> > >> In guest_memfd's case, even if the folio is split (just for refcount >> > >> tracking purposese on private to shared conversion), the memory is still >> > >> contiguous up to the original folio's size. Will the contiguity address >> > >> the concerns? >> > > >> > > Not really? Why would the folio be split if the memory _and its attributes_ are >> > > fully contiguous? If the attributes are mixed, KVM must not create a mapping >> > > spanning mixed ranges, i.e. with multiple folios. >> > >> > The folio can be split if any (or all) of the pages in a huge page range >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio >> > would be split, and the split folios are necessary for tracking users of >> > shared pages using struct page refcounts. >> >> Ahh, that's what the refcounting was referring to. Gotcha. >> >> > However the split folios in that 1G range are still fully contiguous. >> > >> > The process of conversion will split the EPT entries soon after the >> > folios are split so the rule remains upheld. Correction here: If we go with splitting from 1G to 4K uniformly on sharing, only the EPT entries around the shared 4K folio will have their page table entries split, so many of the EPT entries will be at 2M level though the folios are 4K sized. This would be last beyond the conversion process. > Overall, I don't think allowing folios smaller than the mappings while > conversion is in progress brings enough benefit. > I'll look into making the restructuring process always succeed, but off the top of my head that's hard because 1. HugeTLB Vmemmap Optimization code would have to be refactored to use pre-allocated pages, which is refactoring deep in HugeTLB code 2. If we want to split non-uniformly such that only the folios that are shared are 4K, and the remaining folios are as large as possible (PMD sized as much as possible), it gets complex to figure out how many pages to allocate ahead of time. So it's complex and will probably delay HugeTLB+conversion support even more! > Cons: > (1) TDX's zapping callback has no idea whether the zapping is caused by an > in-progress private-to-shared conversion or other reasons. It also has no > idea if the attributes of the underlying folios remain unchanged during an > in-progress private-to-shared conversion. Even if the assertion Ackerley > mentioned is true, it's not easy to drop the sanity checks in TDX's zapping > callback for in-progress private-to-shared conversion alone (which would > increase TDX's dependency on guest_memfd's specific implementation even if > it's feasible). > > Removing the sanity checks entirely in TDX's zapping callback is confusing > and would show a bad/false expectation from KVM -- what if a huge folio is > incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or > others) in other conditions? And then do we still need the check in TDX's > mapping callback? If not, does it mean TDX huge pages can stop relying on > guest_memfd's ability to allocate huge folios, as KVM could still create > huge mappings as long as small folios are physically contiguous with > homogeneous memory attributes? > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in > kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the > invalidate lock held in __kvm_gmem_set_attributes() could guard against > concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks > error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially). > I think the central question I have among all the above is what TDX needs to actually care about (putting aside what KVM's folio size/memory contiguity vs mapping level rule for a while). I think TDX code can check what it cares about (if required to aid debugging, as Dave suggested). Does TDX actually care about folio sizes, or does it actually care about memory contiguity and alignment? Separately, KVM could also enforce the folio size/memory contiguity vs mapping level rule, but TDX code shouldn't enforce KVM's rules. So if the check is deemed necessary, it still shouldn't be in TDX code, I think. > Pro: Preventing zapping private memory until conversion is successful is good. > > However, could we achieve this benefit in other ways? For example, is it > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap > optimization? (hugetlb_vmemmap conversion is super slow according to my > observation and I always disable it). HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a huge VM, multiplied by a large number of hosts, this is not a trivial amount of memory. It's one of the key reasons why we are using HugeTLB in guest_memfd in the first place, other than to be able to get high level page table mappings. We want this in production. > Or pre-allocation for > vmemmap_remap_alloc()? > Will investigate if this is possible as mentioned above. Thanks for the suggestion again! > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping > private memory before conversion succeeds is still better than introducing the > mess between folio size and mapping size. > >> > I guess perhaps the question is, is it okay if the folios are smaller >> > than the mapping while conversion is in progress? Does the order matter >> > (split page table entries first vs split folios first)? >> >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is >> conceptually totally fine, i.e. I'm not totally opposed to adding support for >> mapping multiple guest_memfd folios with a single hugepage. As to whether we >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping >> multiple folios in one page, probably comes down to which option provides "good >> enough" performance without incurring too much complexity.
On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote: > Yan Zhao <yan.y.zhao@intel.com> writes: > > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote: > >> On Tue, Jan 06, 2026, Ackerley Tng wrote: > >> > Sean Christopherson <seanjc@google.com> writes: > >> > > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote: > >> > >> Vishal Annapurve <vannapurve@google.com> writes: > >> > >> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote: > >> > >> >> > >> > >> >> - EPT mapping size and folio size > >> > >> >> > >> > >> >> This series is built upon the rule in KVM that the mapping size in the > >> > >> >> KVM-managed secondary MMU is no larger than the backend folio size. > >> > >> >> > >> > >> > >> > >> I'm not familiar with this rule and would like to find out more. Why is > >> > >> this rule imposed? > >> > > > >> > > Because it's the only sane way to safely map memory into the guest? :-D > >> > > > >> > >> Is this rule there just because traditionally folio sizes also define the > >> > >> limit of contiguity, and so the mapping size must not be greater than folio > >> > >> size in case the block of memory represented by the folio is not contiguous? > >> > > > >> > > Pre-guest_memfd, KVM didn't care about folios. KVM's mapping size was (and still > >> > > is) strictly bound by the host mapping size. That's handles contiguous addresses, > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes. > >> > > > >> > >> In guest_memfd's case, even if the folio is split (just for refcount > >> > >> tracking purposese on private to shared conversion), the memory is still > >> > >> contiguous up to the original folio's size. Will the contiguity address > >> > >> the concerns? > >> > > > >> > > Not really? Why would the folio be split if the memory _and its attributes_ are > >> > > fully contiguous? If the attributes are mixed, KVM must not create a mapping > >> > > spanning mixed ranges, i.e. with multiple folios. > >> > > >> > The folio can be split if any (or all) of the pages in a huge page range > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio > >> > would be split, and the split folios are necessary for tracking users of > >> > shared pages using struct page refcounts. > >> > >> Ahh, that's what the refcounting was referring to. Gotcha. > >> > >> > However the split folios in that 1G range are still fully contiguous. > >> > > >> > The process of conversion will split the EPT entries soon after the > >> > folios are split so the rule remains upheld. > > Correction here: If we go with splitting from 1G to 4K uniformly on > sharing, only the EPT entries around the shared 4K folio will have their > page table entries split, so many of the EPT entries will be at 2M level > though the folios are 4K sized. This would be last beyond the conversion > process. > > > Overall, I don't think allowing folios smaller than the mappings while > > conversion is in progress brings enough benefit. > > > > I'll look into making the restructuring process always succeed, but off > the top of my head that's hard because > > 1. HugeTLB Vmemmap Optimization code would have to be refactored to > use pre-allocated pages, which is refactoring deep in HugeTLB code > > 2. If we want to split non-uniformly such that only the folios that are > shared are 4K, and the remaining folios are as large as possible (PMD > sized as much as possible), it gets complex to figure out how many > pages to allocate ahead of time. > > So it's complex and will probably delay HugeTLB+conversion support even > more! > > > Cons: > > (1) TDX's zapping callback has no idea whether the zapping is caused by an > > in-progress private-to-shared conversion or other reasons. It also has no > > idea if the attributes of the underlying folios remain unchanged during an > > in-progress private-to-shared conversion. Even if the assertion Ackerley > > mentioned is true, it's not easy to drop the sanity checks in TDX's zapping > > callback for in-progress private-to-shared conversion alone (which would > > increase TDX's dependency on guest_memfd's specific implementation even if > > it's feasible). > > > > Removing the sanity checks entirely in TDX's zapping callback is confusing > > and would show a bad/false expectation from KVM -- what if a huge folio is > > incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or > > others) in other conditions? And then do we still need the check in TDX's > > mapping callback? If not, does it mean TDX huge pages can stop relying on > > guest_memfd's ability to allocate huge folios, as KVM could still create > > huge mappings as long as small folios are physically contiguous with > > homogeneous memory attributes? > > > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in > > kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the > > invalidate lock held in __kvm_gmem_set_attributes() could guard against > > concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks > > error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially). > > > > I think the central question I have among all the above is what TDX > needs to actually care about (putting aside what KVM's folio size/memory > contiguity vs mapping level rule for a while). > > I think TDX code can check what it cares about (if required to aid > debugging, as Dave suggested). Does TDX actually care about folio sizes, > or does it actually care about memory contiguity and alignment? TDX cares about memory contiguity. A single folio ensures memory contiguity. Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a contiguous range larger than the page's folio range. Additionally, we don't split private mappings in kvm_gmem_error_folio(). If smaller folios are allowed, splitting private mapping is required there. (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it possible for splitting a huge folio to fail partially, without merging the huge folio back or further zapping?). Not sure if there're other edge cases we're still missing. > Separately, KVM could also enforce the folio size/memory contiguity vs > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if > the check is deemed necessary, it still shouldn't be in TDX code, I > think. > > > Pro: Preventing zapping private memory until conversion is successful is good. > > > > However, could we achieve this benefit in other ways? For example, is it > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap > > optimization? (hugetlb_vmemmap conversion is super slow according to my > > observation and I always disable it). > > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a > huge VM, multiplied by a large number of hosts, this is not a trivial > amount of memory. It's one of the key reasons why we are using HugeTLB > in guest_memfd in the first place, other than to be able to get high > level page table mappings. We want this in production. > > > Or pre-allocation for > > vmemmap_remap_alloc()? > > > > Will investigate if this is possible as mentioned above. Thanks for the > suggestion again! > > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping > > private memory before conversion succeeds is still better than introducing the > > mess between folio size and mapping size. > > > >> > I guess perhaps the question is, is it okay if the folios are smaller > >> > than the mapping while conversion is in progress? Does the order matter > >> > (split page table entries first vs split folios first)? > >> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for > >> mapping multiple guest_memfd folios with a single hugepage. As to whether we > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping > >> multiple folios in one page, probably comes down to which option provides "good > >> enough" performance without incurring too much complexity. >
On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@intel.com> wrote: > > On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote: > > Yan Zhao <yan.y.zhao@intel.com> writes: > > > > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote: > > >> On Tue, Jan 06, 2026, Ackerley Tng wrote: > > >> > Sean Christopherson <seanjc@google.com> writes: > > >> > > > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote: > > >> > >> Vishal Annapurve <vannapurve@google.com> writes: > > >> > >> > > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote: > > >> > >> >> > > >> > >> >> - EPT mapping size and folio size > > >> > >> >> > > >> > >> >> This series is built upon the rule in KVM that the mapping size in the > > >> > >> >> KVM-managed secondary MMU is no larger than the backend folio size. > > >> > >> >> > > >> > >> > > >> > >> I'm not familiar with this rule and would like to find out more. Why is > > >> > >> this rule imposed? > > >> > > > > >> > > Because it's the only sane way to safely map memory into the guest? :-D > > >> > > > > >> > >> Is this rule there just because traditionally folio sizes also define the > > >> > >> limit of contiguity, and so the mapping size must not be greater than folio > > >> > >> size in case the block of memory represented by the folio is not contiguous? > > >> > > > > >> > > Pre-guest_memfd, KVM didn't care about folios. KVM's mapping size was (and still > > >> > > is) strictly bound by the host mapping size. That's handles contiguous addresses, > > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes. > > >> > > > > >> > >> In guest_memfd's case, even if the folio is split (just for refcount > > >> > >> tracking purposese on private to shared conversion), the memory is still > > >> > >> contiguous up to the original folio's size. Will the contiguity address > > >> > >> the concerns? > > >> > > > > >> > > Not really? Why would the folio be split if the memory _and its attributes_ are > > >> > > fully contiguous? If the attributes are mixed, KVM must not create a mapping > > >> > > spanning mixed ranges, i.e. with multiple folios. > > >> > > > >> > The folio can be split if any (or all) of the pages in a huge page range > > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the > > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio > > >> > would be split, and the split folios are necessary for tracking users of > > >> > shared pages using struct page refcounts. > > >> > > >> Ahh, that's what the refcounting was referring to. Gotcha. > > >> > > >> > However the split folios in that 1G range are still fully contiguous. > > >> > > > >> > The process of conversion will split the EPT entries soon after the > > >> > folios are split so the rule remains upheld. > > > > Correction here: If we go with splitting from 1G to 4K uniformly on > > sharing, only the EPT entries around the shared 4K folio will have their > > page table entries split, so many of the EPT entries will be at 2M level > > though the folios are 4K sized. This would be last beyond the conversion > > process. > > > > > Overall, I don't think allowing folios smaller than the mappings while > > > conversion is in progress brings enough benefit. > > > > > > > I'll look into making the restructuring process always succeed, but off > > the top of my head that's hard because > > > > 1. HugeTLB Vmemmap Optimization code would have to be refactored to > > use pre-allocated pages, which is refactoring deep in HugeTLB code > > > > 2. If we want to split non-uniformly such that only the folios that are > > shared are 4K, and the remaining folios are as large as possible (PMD > > sized as much as possible), it gets complex to figure out how many > > pages to allocate ahead of time. > > > > So it's complex and will probably delay HugeTLB+conversion support even > > more! > > > > > Cons: > > > (1) TDX's zapping callback has no idea whether the zapping is caused by an > > > in-progress private-to-shared conversion or other reasons. It also has no > > > idea if the attributes of the underlying folios remain unchanged during an > > > in-progress private-to-shared conversion. Even if the assertion Ackerley > > > mentioned is true, it's not easy to drop the sanity checks in TDX's zapping > > > callback for in-progress private-to-shared conversion alone (which would > > > increase TDX's dependency on guest_memfd's specific implementation even if > > > it's feasible). > > > > > > Removing the sanity checks entirely in TDX's zapping callback is confusing > > > and would show a bad/false expectation from KVM -- what if a huge folio is > > > incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or > > > others) in other conditions? And then do we still need the check in TDX's > > > mapping callback? If not, does it mean TDX huge pages can stop relying on > > > guest_memfd's ability to allocate huge folios, as KVM could still create > > > huge mappings as long as small folios are physically contiguous with > > > homogeneous memory attributes? > > > > > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in > > > kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the > > > invalidate lock held in __kvm_gmem_set_attributes() could guard against > > > concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks > > > error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially). > > > > > > > I think the central question I have among all the above is what TDX > > needs to actually care about (putting aside what KVM's folio size/memory > > contiguity vs mapping level rule for a while). > > > > I think TDX code can check what it cares about (if required to aid > > debugging, as Dave suggested). Does TDX actually care about folio sizes, > > or does it actually care about memory contiguity and alignment? > TDX cares about memory contiguity. A single folio ensures memory contiguity. In this slightly unusual case, I think the guarantee needed here is that as long as a range is mapped into SEPT entries, guest_memfd ensures that the complete range stays private. i.e. I think it should be safe to rely on guest_memfd here, irrespective of the folio sizes: 1) KVM TDX stack should be able to reclaim the complete range when unmapping. 2) KVM TDX stack can assume that as long as memory is mapped in SEPT entries, guest_memfd will not let host userspace mappings to access guest private memory. > > Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer > reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a > contiguous range larger than the page's folio range. What's the issue with passing the (struct page*, unsigned long nr_pages) pair? > > Additionally, we don't split private mappings in kvm_gmem_error_folio(). > If smaller folios are allowed, splitting private mapping is required there. Yes, I believe splitting private mappings will be invoked to ensure that the whole huge folio is not unmapped from KVM due to an error on just a 4K page. Is that a problem? If splitting fails, the implementation can fall back to completely zapping the folio range. > (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it > possible for splitting a huge folio to fail partially, without merging the huge > folio back or further zapping?). Yes, splitting can fail partially, but guest_memfd will not make the ranges available to host userspace and derivatives until: 1) The complete range to be converted is split to 4K granularity. 2) The complete range to be converted is zapped from KVM EPT mappings. > Not sure if there're other edge cases we're still missing. > > > Separately, KVM could also enforce the folio size/memory contiguity vs > > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if > > the check is deemed necessary, it still shouldn't be in TDX code, I > > think. > > > > > Pro: Preventing zapping private memory until conversion is successful is good. > > > > > > However, could we achieve this benefit in other ways? For example, is it > > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring > > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap > > > optimization? (hugetlb_vmemmap conversion is super slow according to my > > > observation and I always disable it). > > > > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a > > huge VM, multiplied by a large number of hosts, this is not a trivial > > amount of memory. It's one of the key reasons why we are using HugeTLB > > in guest_memfd in the first place, other than to be able to get high > > level page table mappings. We want this in production. > > > > > Or pre-allocation for > > > vmemmap_remap_alloc()? > > > > > > > Will investigate if this is possible as mentioned above. Thanks for the > > suggestion again! > > > > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping > > > private memory before conversion succeeds is still better than introducing the > > > mess between folio size and mapping size. > > > > > >> > I guess perhaps the question is, is it okay if the folios are smaller > > >> > than the mapping while conversion is in progress? Does the order matter > > >> > (split page table entries first vs split folios first)? > > >> > > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is > > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for > > >> mapping multiple guest_memfd folios with a single hugepage. As to whether we > > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping > > >> multiple folios in one page, probably comes down to which option provides "good > > >> enough" performance without incurring too much complexity. > >
Vishal Annapurve <vannapurve@google.com> writes: > On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@intel.com> wrote: >> >> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote: >> > Yan Zhao <yan.y.zhao@intel.com> writes: >> > >> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote: >> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote: >> > >> > Sean Christopherson <seanjc@google.com> writes: >> > >> > >> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote: >> > >> > >> Vishal Annapurve <vannapurve@google.com> writes: >> > >> > >> >> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote: >> > >> > >> >> >> > >> > >> >> - EPT mapping size and folio size >> > >> > >> >> >> > >> > >> >> This series is built upon the rule in KVM that the mapping size in the >> > >> > >> >> KVM-managed secondary MMU is no larger than the backend folio size. >> > >> > >> >> >> > >> > >> >> > >> > >> I'm not familiar with this rule and would like to find out more. Why is >> > >> > >> this rule imposed? >> > >> > > >> > >> > > Because it's the only sane way to safely map memory into the guest? :-D >> > >> > > >> > >> > >> Is this rule there just because traditionally folio sizes also define the >> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio >> > >> > >> size in case the block of memory represented by the folio is not contiguous? >> > >> > > >> > >> > > Pre-guest_memfd, KVM didn't care about folios. KVM's mapping size was (and still >> > >> > > is) strictly bound by the host mapping size. That's handles contiguous addresses, >> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes. >> > >> > > >> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount >> > >> > >> tracking purposese on private to shared conversion), the memory is still >> > >> > >> contiguous up to the original folio's size. Will the contiguity address >> > >> > >> the concerns? >> > >> > > >> > >> > > Not really? Why would the folio be split if the memory _and its attributes_ are >> > >> > > fully contiguous? If the attributes are mixed, KVM must not create a mapping >> > >> > > spanning mixed ranges, i.e. with multiple folios. >> > >> > >> > >> > The folio can be split if any (or all) of the pages in a huge page range >> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the >> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio >> > >> > would be split, and the split folios are necessary for tracking users of >> > >> > shared pages using struct page refcounts. >> > >> >> > >> Ahh, that's what the refcounting was referring to. Gotcha. >> > >> >> > >> > However the split folios in that 1G range are still fully contiguous. >> > >> > >> > >> > The process of conversion will split the EPT entries soon after the >> > >> > folios are split so the rule remains upheld. >> > >> > Correction here: If we go with splitting from 1G to 4K uniformly on >> > sharing, only the EPT entries around the shared 4K folio will have their >> > page table entries split, so many of the EPT entries will be at 2M level >> > though the folios are 4K sized. This would be last beyond the conversion >> > process. >> > >> > > Overall, I don't think allowing folios smaller than the mappings while >> > > conversion is in progress brings enough benefit. >> > > >> > >> > I'll look into making the restructuring process always succeed, but off >> > the top of my head that's hard because >> > >> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to >> > use pre-allocated pages, which is refactoring deep in HugeTLB code >> > >> > 2. If we want to split non-uniformly such that only the folios that are >> > shared are 4K, and the remaining folios are as large as possible (PMD >> > sized as much as possible), it gets complex to figure out how many >> > pages to allocate ahead of time. >> > >> > So it's complex and will probably delay HugeTLB+conversion support even >> > more! >> > >> > > Cons: >> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an >> > > in-progress private-to-shared conversion or other reasons. It also has no >> > > idea if the attributes of the underlying folios remain unchanged during an >> > > in-progress private-to-shared conversion. Even if the assertion Ackerley >> > > mentioned is true, it's not easy to drop the sanity checks in TDX's zapping >> > > callback for in-progress private-to-shared conversion alone (which would >> > > increase TDX's dependency on guest_memfd's specific implementation even if >> > > it's feasible). >> > > >> > > Removing the sanity checks entirely in TDX's zapping callback is confusing >> > > and would show a bad/false expectation from KVM -- what if a huge folio is >> > > incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or >> > > others) in other conditions? And then do we still need the check in TDX's >> > > mapping callback? If not, does it mean TDX huge pages can stop relying on >> > > guest_memfd's ability to allocate huge folios, as KVM could still create >> > > huge mappings as long as small folios are physically contiguous with >> > > homogeneous memory attributes? >> > > >> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in >> > > kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the >> > > invalidate lock held in __kvm_gmem_set_attributes() could guard against >> > > concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks >> > > error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially). >> > > >> > >> > I think the central question I have among all the above is what TDX >> > needs to actually care about (putting aside what KVM's folio size/memory >> > contiguity vs mapping level rule for a while). >> > >> > I think TDX code can check what it cares about (if required to aid >> > debugging, as Dave suggested). Does TDX actually care about folio sizes, >> > or does it actually care about memory contiguity and alignment? >> TDX cares about memory contiguity. A single folio ensures memory contiguity. > > In this slightly unusual case, I think the guarantee needed here is > that as long as a range is mapped into SEPT entries, guest_memfd > ensures that the complete range stays private. > > i.e. I think it should be safe to rely on guest_memfd here, > irrespective of the folio sizes: > 1) KVM TDX stack should be able to reclaim the complete range when unmapping. > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT > entries, guest_memfd will not let host userspace mappings to access > guest private memory. > >> >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a >> contiguous range larger than the page's folio range. > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair? > >> >> Additionally, we don't split private mappings in kvm_gmem_error_folio(). >> If smaller folios are allowed, splitting private mapping is required there. It was discussed before that for memory failure handling, we will want to split huge pages, we will get to it! The trouble is that guest_memfd took the page from HugeTLB (unlike buddy or HugeTLB which manages memory from the ground up), so we'll still need to figure out it's okay to let HugeTLB deal with it when freeing, and when I last looked, HugeTLB doesn't actually deal with poisoned folios on freeing, so there's more work to do on the HugeTLB side. This is a good point, although IIUC it is a separate issue. The need to split private mappings on memory failure is not for confidentiality in the TDX sense but to ensure that the guest doesn't use the failed memory. In that case, contiguity is broken by the failed memory. The folio is split, the private EPTs are split. The folio size should still not be checked in TDX code. guest_memfd knows contiguity got broken, so guest_memfd calls TDX code to split the EPTs. > > Yes, I believe splitting private mappings will be invoked to ensure > that the whole huge folio is not unmapped from KVM due to an error on > just a 4K page. Is that a problem? > > If splitting fails, the implementation can fall back to completely > zapping the folio range. > >> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it >> possible for splitting a huge folio to fail partially, without merging the huge >> folio back or further zapping?). The current stance is to allow splitting failures and not undo that splitting failure, so there's no merge back to fix the splitting failure. (Not set in stone yet, I think merging back could turn out to be a requirement from the mm side, which comes with more complexity in restructuring logic.) If it is not merged back on a split failure, the pages are still contiguous, the pages are guaranteed contiguous while they are owned by guest_memfd (even in the case of memory failure, if I get my way :P) so TDX can still trust that. I think you're worried that on split failure some folios are split, but the private EPTs for those are not split, but the memory for those unsplit private EPTs are still contiguous, and on split failure we quit early so guest_memfd still tracks the ranges as private. Privateness and contiguity are preserved so I think TDX should be good with that? The TD can still run. IIUC it is part of the plan that on splitting failure, conversion ioctl returns failure, guest is informed of conversion failure so that it can do whatever it should do to clean up. > > Yes, splitting can fail partially, but guest_memfd will not make the > ranges available to host userspace and derivatives until: > 1) The complete range to be converted is split to 4K granularity. > 2) The complete range to be converted is zapped from KVM EPT mappings. > >> Not sure if there're other edge cases we're still missing. >> As you said, at the core TDX is concerned about contiguity of the memory ranges (start_addr, length) that it was given. Contiguity is guaranteed by guest_memfd while the folio is in guest_memfd ownership up to the boundaries of the original folio, before any restructuring. So if we're looking for edge cases, I think they would be around truncation. Can't think of anything now. (guest_memfd will also ensure truncation of anything less than the original size of the folio before restructuring is blocked, regardless of the current size of the folio) >> > Separately, KVM could also enforce the folio size/memory contiguity vs >> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if >> > the check is deemed necessary, it still shouldn't be in TDX code, I >> > think. >> > >> > > Pro: Preventing zapping private memory until conversion is successful is good. >> > > >> > > However, could we achieve this benefit in other ways? For example, is it >> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring >> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap >> > > optimization? (hugetlb_vmemmap conversion is super slow according to my >> > > observation and I always disable it). >> > >> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a >> > huge VM, multiplied by a large number of hosts, this is not a trivial >> > amount of memory. It's one of the key reasons why we are using HugeTLB >> > in guest_memfd in the first place, other than to be able to get high >> > level page table mappings. We want this in production. >> > >> > > Or pre-allocation for >> > > vmemmap_remap_alloc()? >> > > >> > >> > Will investigate if this is possible as mentioned above. Thanks for the >> > suggestion again! >> > >> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping >> > > private memory before conversion succeeds is still better than introducing the >> > > mess between folio size and mapping size. >> > > >> > >> > I guess perhaps the question is, is it okay if the folios are smaller >> > >> > than the mapping while conversion is in progress? Does the order matter >> > >> > (split page table entries first vs split folios first)? >> > >> >> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is >> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for >> > >> mapping multiple guest_memfd folios with a single hugepage. As to whether we >> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping >> > >> multiple folios in one page, probably comes down to which option provides "good >> > >> enough" performance without incurring too much complexity. >> >
On Fri, Jan 09, 2026 at 10:07:00AM -0800, Ackerley Tng wrote: > Vishal Annapurve <vannapurve@google.com> writes: > > > On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@intel.com> wrote: > >> > >> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote: > >> > Yan Zhao <yan.y.zhao@intel.com> writes: > >> > > >> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote: > >> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote: > >> > >> > Sean Christopherson <seanjc@google.com> writes: > >> > >> > > >> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote: > >> > >> > >> Vishal Annapurve <vannapurve@google.com> writes: > >> > >> > >> > >> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote: > >> > >> > >> >> > >> > >> > >> >> - EPT mapping size and folio size > >> > >> > >> >> > >> > >> > >> >> This series is built upon the rule in KVM that the mapping size in the > >> > >> > >> >> KVM-managed secondary MMU is no larger than the backend folio size. > >> > >> > >> >> > >> > >> > >> > >> > >> > >> I'm not familiar with this rule and would like to find out more. Why is > >> > >> > >> this rule imposed? > >> > >> > > > >> > >> > > Because it's the only sane way to safely map memory into the guest? :-D > >> > >> > > > >> > >> > >> Is this rule there just because traditionally folio sizes also define the > >> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio > >> > >> > >> size in case the block of memory represented by the folio is not contiguous? > >> > >> > > > >> > >> > > Pre-guest_memfd, KVM didn't care about folios. KVM's mapping size was (and still > >> > >> > > is) strictly bound by the host mapping size. That's handles contiguous addresses, > >> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes. > >> > >> > > > >> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount > >> > >> > >> tracking purposese on private to shared conversion), the memory is still > >> > >> > >> contiguous up to the original folio's size. Will the contiguity address > >> > >> > >> the concerns? > >> > >> > > > >> > >> > > Not really? Why would the folio be split if the memory _and its attributes_ are > >> > >> > > fully contiguous? If the attributes are mixed, KVM must not create a mapping > >> > >> > > spanning mixed ranges, i.e. with multiple folios. > >> > >> > > >> > >> > The folio can be split if any (or all) of the pages in a huge page range > >> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the > >> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio > >> > >> > would be split, and the split folios are necessary for tracking users of > >> > >> > shared pages using struct page refcounts. > >> > >> > >> > >> Ahh, that's what the refcounting was referring to. Gotcha. > >> > >> > >> > >> > However the split folios in that 1G range are still fully contiguous. > >> > >> > > >> > >> > The process of conversion will split the EPT entries soon after the > >> > >> > folios are split so the rule remains upheld. > >> > > >> > Correction here: If we go with splitting from 1G to 4K uniformly on > >> > sharing, only the EPT entries around the shared 4K folio will have their > >> > page table entries split, so many of the EPT entries will be at 2M level > >> > though the folios are 4K sized. This would be last beyond the conversion > >> > process. > >> > > >> > > Overall, I don't think allowing folios smaller than the mappings while > >> > > conversion is in progress brings enough benefit. > >> > > > >> > > >> > I'll look into making the restructuring process always succeed, but off > >> > the top of my head that's hard because > >> > > >> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to > >> > use pre-allocated pages, which is refactoring deep in HugeTLB code > >> > > >> > 2. If we want to split non-uniformly such that only the folios that are > >> > shared are 4K, and the remaining folios are as large as possible (PMD > >> > sized as much as possible), it gets complex to figure out how many > >> > pages to allocate ahead of time. > >> > > >> > So it's complex and will probably delay HugeTLB+conversion support even > >> > more! > >> > > >> > > Cons: > >> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an > >> > > in-progress private-to-shared conversion or other reasons. It also has no > >> > > idea if the attributes of the underlying folios remain unchanged during an > >> > > in-progress private-to-shared conversion. Even if the assertion Ackerley > >> > > mentioned is true, it's not easy to drop the sanity checks in TDX's zapping > >> > > callback for in-progress private-to-shared conversion alone (which would > >> > > increase TDX's dependency on guest_memfd's specific implementation even if > >> > > it's feasible). > >> > > > >> > > Removing the sanity checks entirely in TDX's zapping callback is confusing > >> > > and would show a bad/false expectation from KVM -- what if a huge folio is > >> > > incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or > >> > > others) in other conditions? And then do we still need the check in TDX's > >> > > mapping callback? If not, does it mean TDX huge pages can stop relying on > >> > > guest_memfd's ability to allocate huge folios, as KVM could still create > >> > > huge mappings as long as small folios are physically contiguous with > >> > > homogeneous memory attributes? > >> > > > >> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in > >> > > kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the > >> > > invalidate lock held in __kvm_gmem_set_attributes() could guard against > >> > > concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks > >> > > error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially). > >> > > > >> > > >> > I think the central question I have among all the above is what TDX > >> > needs to actually care about (putting aside what KVM's folio size/memory > >> > contiguity vs mapping level rule for a while). > >> > > >> > I think TDX code can check what it cares about (if required to aid > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes, > >> > or does it actually care about memory contiguity and alignment? > >> TDX cares about memory contiguity. A single folio ensures memory contiguity. > > > > In this slightly unusual case, I think the guarantee needed here is > > that as long as a range is mapped into SEPT entries, guest_memfd > > ensures that the complete range stays private. > > > > i.e. I think it should be safe to rely on guest_memfd here, > > irrespective of the folio sizes: > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping. > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT > > entries, guest_memfd will not let host userspace mappings to access > > guest private memory. > > > >> > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a > >> contiguous range larger than the page's folio range. > > > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair? > > > >> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio(). > >> If smaller folios are allowed, splitting private mapping is required there. > > It was discussed before that for memory failure handling, we will want > to split huge pages, we will get to it! The trouble is that guest_memfd > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory > from the ground up), so we'll still need to figure out it's okay to let > HugeTLB deal with it when freeing, and when I last looked, HugeTLB > doesn't actually deal with poisoned folios on freeing, so there's more > work to do on the HugeTLB side. > > This is a good point, although IIUC it is a separate issue. The need to > split private mappings on memory failure is not for confidentiality in > the TDX sense but to ensure that the guest doesn't use the failed > memory. In that case, contiguity is broken by the failed memory. The > folio is split, the private EPTs are split. The folio size should still > not be checked in TDX code. guest_memfd knows contiguity got broken, so > guest_memfd calls TDX code to split the EPTs. Hmm, maybe the key is that we need to split S-EPT first before allowing guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the folio splitting. This is better than performing folio splitting while it's mapped as huge in S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in kvm_gmem_error_folio() would still trigger the over-zapping issue. In the primary MMU, it follows the rule of unmapping a folio before splitting, truncating, or migrating a folio. For S-EPT, considering the cost of zapping more ranges than necessary, maybe a trade-off is to always split S-EPT before allowing backend folio splitting. Does this look good to you? So, to convert a 2MB range from private to shared, even though guest_memfd will eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails, don't split the backend folio. Even if folio splitting may fail later, it just leaves split S-EPT mappings, which matters little, especially after we support S-EPT promotion later. The benefit is that we don't need to worry even in the case when guest_memfd splits a 1GB folio directly to 4KB granularity, potentially introducing the over-zapping issue later. > > Yes, I believe splitting private mappings will be invoked to ensure > > that the whole huge folio is not unmapped from KVM due to an error on > > just a 4K page. Is that a problem? > > > > If splitting fails, the implementation can fall back to completely > > zapping the folio range. > > > >> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it > >> possible for splitting a huge folio to fail partially, without merging the huge > >> folio back or further zapping?). > > The current stance is to allow splitting failures and not undo that > splitting failure, so there's no merge back to fix the splitting > failure. (Not set in stone yet, I think merging back could turn out to > be a requirement from the mm side, which comes with more complexity in > restructuring logic.) > > If it is not merged back on a split failure, the pages are still > contiguous, the pages are guaranteed contiguous while they are owned by > guest_memfd (even in the case of memory failure, if I get my way :P) so > TDX can still trust that. > > I think you're worried that on split failure some folios are split, but > the private EPTs for those are not split, but the memory for those > unsplit private EPTs are still contiguous, and on split failure we quit > early so guest_memfd still tracks the ranges as private. > > Privateness and contiguity are preserved so I think TDX should be good > with that? The TD can still run. IIUC it is part of the plan that on > splitting failure, conversion ioctl returns failure, guest is informed > of conversion failure so that it can do whatever it should do to clean > up. As above, what about the idea of always requesting KVM to split S-EPT before guest_memfd splits a folio? I think splitting S-EPT first is already required for all cases anyway, except for the private-to-shared conversion of a full 2MB or 1GB range. Requesting S-EPT splitting when it's about to do folio splitting is better than leaving huge mappings with split folios and having to patch things up here and there, just to make the single case of private-to-shared conversion easier. > > Yes, splitting can fail partially, but guest_memfd will not make the > > ranges available to host userspace and derivatives until: > > 1) The complete range to be converted is split to 4K granularity. > > 2) The complete range to be converted is zapped from KVM EPT mappings. > > > >> Not sure if there're other edge cases we're still missing. > >> > > As you said, at the core TDX is concerned about contiguity of the memory > ranges (start_addr, length) that it was given. Contiguity is guaranteed > by guest_memfd while the folio is in guest_memfd ownership up to the > boundaries of the original folio, before any restructuring. So if we're > looking for edge cases, I think they would be around > truncation. Can't think of anything now. Potentially, folio migration, if we support it in the future. > (guest_memfd will also ensure truncation of anything less than the > original size of the folio before restructuring is blocked, regardless > of the current size of the folio) > >> > Separately, KVM could also enforce the folio size/memory contiguity vs > >> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if > >> > the check is deemed necessary, it still shouldn't be in TDX code, I > >> > think. > >> > > >> > > Pro: Preventing zapping private memory until conversion is successful is good. > >> > > > >> > > However, could we achieve this benefit in other ways? For example, is it > >> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring > >> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap > >> > > optimization? (hugetlb_vmemmap conversion is super slow according to my > >> > > observation and I always disable it). > >> > > >> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a > >> > huge VM, multiplied by a large number of hosts, this is not a trivial > >> > amount of memory. It's one of the key reasons why we are using HugeTLB > >> > in guest_memfd in the first place, other than to be able to get high > >> > level page table mappings. We want this in production. > >> > > >> > > Or pre-allocation for > >> > > vmemmap_remap_alloc()? > >> > > > >> > > >> > Will investigate if this is possible as mentioned above. Thanks for the > >> > suggestion again! > >> > > >> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping > >> > > private memory before conversion succeeds is still better than introducing the > >> > > mess between folio size and mapping size. > >> > > > >> > >> > I guess perhaps the question is, is it okay if the folios are smaller > >> > >> > than the mapping while conversion is in progress? Does the order matter > >> > >> > (split page table entries first vs split folios first)? > >> > >> > >> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is > >> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for > >> > >> mapping multiple guest_memfd folios with a single hugepage. As to whether we > >> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping > >> > >> multiple folios in one page, probably comes down to which option provides "good > >> > >> enough" performance without incurring too much complexity. > >> > >
On Mon, Jan 12, 2026 at 09:39:39AM +0800, Yan Zhao wrote: > On Fri, Jan 09, 2026 at 10:07:00AM -0800, Ackerley Tng wrote: > > Vishal Annapurve <vannapurve@google.com> writes: > > > > > On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@intel.com> wrote: > > >> > > >> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote: > > >> > Yan Zhao <yan.y.zhao@intel.com> writes: > > >> > > > >> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote: > > >> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote: > > >> > >> > Sean Christopherson <seanjc@google.com> writes: > > >> > >> > > > >> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote: > > >> > >> > >> Vishal Annapurve <vannapurve@google.com> writes: > > >> > >> > >> > > >> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote: > > >> > >> > >> >> > > >> > >> > >> >> - EPT mapping size and folio size > > >> > >> > >> >> > > >> > >> > >> >> This series is built upon the rule in KVM that the mapping size in the > > >> > >> > >> >> KVM-managed secondary MMU is no larger than the backend folio size. > > >> > >> > >> >> > > >> > >> > >> > > >> > >> > >> I'm not familiar with this rule and would like to find out more. Why is > > >> > >> > >> this rule imposed? > > >> > >> > > > > >> > >> > > Because it's the only sane way to safely map memory into the guest? :-D > > >> > >> > > > > >> > >> > >> Is this rule there just because traditionally folio sizes also define the > > >> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio > > >> > >> > >> size in case the block of memory represented by the folio is not contiguous? > > >> > >> > > > > >> > >> > > Pre-guest_memfd, KVM didn't care about folios. KVM's mapping size was (and still > > >> > >> > > is) strictly bound by the host mapping size. That's handles contiguous addresses, > > >> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes. > > >> > >> > > > > >> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount > > >> > >> > >> tracking purposese on private to shared conversion), the memory is still > > >> > >> > >> contiguous up to the original folio's size. Will the contiguity address > > >> > >> > >> the concerns? > > >> > >> > > > > >> > >> > > Not really? Why would the folio be split if the memory _and its attributes_ are > > >> > >> > > fully contiguous? If the attributes are mixed, KVM must not create a mapping > > >> > >> > > spanning mixed ranges, i.e. with multiple folios. > > >> > >> > > > >> > >> > The folio can be split if any (or all) of the pages in a huge page range > > >> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the > > >> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio > > >> > >> > would be split, and the split folios are necessary for tracking users of > > >> > >> > shared pages using struct page refcounts. > > >> > >> > > >> > >> Ahh, that's what the refcounting was referring to. Gotcha. > > >> > >> > > >> > >> > However the split folios in that 1G range are still fully contiguous. > > >> > >> > > > >> > >> > The process of conversion will split the EPT entries soon after the > > >> > >> > folios are split so the rule remains upheld. > > >> > > > >> > Correction here: If we go with splitting from 1G to 4K uniformly on > > >> > sharing, only the EPT entries around the shared 4K folio will have their > > >> > page table entries split, so many of the EPT entries will be at 2M level > > >> > though the folios are 4K sized. This would be last beyond the conversion > > >> > process. > > >> > > > >> > > Overall, I don't think allowing folios smaller than the mappings while > > >> > > conversion is in progress brings enough benefit. > > >> > > > > >> > > > >> > I'll look into making the restructuring process always succeed, but off > > >> > the top of my head that's hard because > > >> > > > >> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to > > >> > use pre-allocated pages, which is refactoring deep in HugeTLB code > > >> > > > >> > 2. If we want to split non-uniformly such that only the folios that are > > >> > shared are 4K, and the remaining folios are as large as possible (PMD > > >> > sized as much as possible), it gets complex to figure out how many > > >> > pages to allocate ahead of time. > > >> > > > >> > So it's complex and will probably delay HugeTLB+conversion support even > > >> > more! > > >> > > > >> > > Cons: > > >> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an > > >> > > in-progress private-to-shared conversion or other reasons. It also has no > > >> > > idea if the attributes of the underlying folios remain unchanged during an > > >> > > in-progress private-to-shared conversion. Even if the assertion Ackerley > > >> > > mentioned is true, it's not easy to drop the sanity checks in TDX's zapping > > >> > > callback for in-progress private-to-shared conversion alone (which would > > >> > > increase TDX's dependency on guest_memfd's specific implementation even if > > >> > > it's feasible). > > >> > > > > >> > > Removing the sanity checks entirely in TDX's zapping callback is confusing > > >> > > and would show a bad/false expectation from KVM -- what if a huge folio is > > >> > > incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or > > >> > > others) in other conditions? And then do we still need the check in TDX's > > >> > > mapping callback? If not, does it mean TDX huge pages can stop relying on > > >> > > guest_memfd's ability to allocate huge folios, as KVM could still create > > >> > > huge mappings as long as small folios are physically contiguous with > > >> > > homogeneous memory attributes? > > >> > > > > >> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in > > >> > > kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the > > >> > > invalidate lock held in __kvm_gmem_set_attributes() could guard against > > >> > > concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks > > >> > > error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially). > > >> > > > > >> > > > >> > I think the central question I have among all the above is what TDX > > >> > needs to actually care about (putting aside what KVM's folio size/memory > > >> > contiguity vs mapping level rule for a while). > > >> > > > >> > I think TDX code can check what it cares about (if required to aid > > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes, > > >> > or does it actually care about memory contiguity and alignment? > > >> TDX cares about memory contiguity. A single folio ensures memory contiguity. > > > > > > In this slightly unusual case, I think the guarantee needed here is > > > that as long as a range is mapped into SEPT entries, guest_memfd > > > ensures that the complete range stays private. > > > > > > i.e. I think it should be safe to rely on guest_memfd here, > > > irrespective of the folio sizes: > > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping. > > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT > > > entries, guest_memfd will not let host userspace mappings to access > > > guest private memory. > > > > > >> > > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer > > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a > > >> contiguous range larger than the page's folio range. > > > > > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair? > > > > > >> > > >> Additionally, we don't split private mappings in kvm_gmem_error_folio(). > > >> If smaller folios are allowed, splitting private mapping is required there. > > > > It was discussed before that for memory failure handling, we will want > > to split huge pages, we will get to it! The trouble is that guest_memfd > > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory > > from the ground up), so we'll still need to figure out it's okay to let > > HugeTLB deal with it when freeing, and when I last looked, HugeTLB > > doesn't actually deal with poisoned folios on freeing, so there's more > > work to do on the HugeTLB side. > > > > This is a good point, although IIUC it is a separate issue. The need to > > split private mappings on memory failure is not for confidentiality in > > the TDX sense but to ensure that the guest doesn't use the failed > > memory. In that case, contiguity is broken by the failed memory. The > > folio is split, the private EPTs are split. The folio size should still > > not be checked in TDX code. guest_memfd knows contiguity got broken, so > > guest_memfd calls TDX code to split the EPTs. > > Hmm, maybe the key is that we need to split S-EPT first before allowing > guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the > folio splitting. > > This is better than performing folio splitting while it's mapped as huge in > S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split > S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in > kvm_gmem_error_folio() would still trigger the over-zapping issue. > > In the primary MMU, it follows the rule of unmapping a folio before splitting, > truncating, or migrating a folio. For S-EPT, considering the cost of zapping > more ranges than necessary, maybe a trade-off is to always split S-EPT before > allowing backend folio splitting. > > Does this look good to you? So, the flow of converting 0-4KB from private to shared in a 1GB folio in guest_memfd is: a. If guest_memfd splits 1GB to 2MB first: 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range. 2. split folio 3. zap the 0-4KB mapping. b. If guest_memfd splits 1GB to 4KB directly: 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range. 2. split folio 3. zap the 0-4KB mapping. The flow of converting 0-2MB from private to shared in a 1GB folio in guest_memfd is: a. If guest_memfd splits 1GB to 2MB first: 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range. 2. split folio 3. zap the 0-2MB mapping. b. If guest_memfd splits 1GB to 4KB directly: 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range. 2. split folio 3. zap the 0-2MB mapping. > So, to convert a 2MB range from private to shared, even though guest_memfd will > eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails, > don't split the backend folio. > > Even if folio splitting may fail later, it just leaves split S-EPT mappings, > which matters little, especially after we support S-EPT promotion later. > > The benefit is that we don't need to worry even in the case when guest_memfd > splits a 1GB folio directly to 4KB granularity, potentially introducing the > over-zapping issue later. > > > > Yes, I believe splitting private mappings will be invoked to ensure > > > that the whole huge folio is not unmapped from KVM due to an error on > > > just a 4K page. Is that a problem? > > > > > > If splitting fails, the implementation can fall back to completely > > > zapping the folio range. > > > > > >> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it > > >> possible for splitting a huge folio to fail partially, without merging the huge > > >> folio back or further zapping?). > > > > The current stance is to allow splitting failures and not undo that > > splitting failure, so there's no merge back to fix the splitting > > failure. (Not set in stone yet, I think merging back could turn out to > > be a requirement from the mm side, which comes with more complexity in > > restructuring logic.) > > > > If it is not merged back on a split failure, the pages are still > > contiguous, the pages are guaranteed contiguous while they are owned by > > guest_memfd (even in the case of memory failure, if I get my way :P) so > > TDX can still trust that. > > > > I think you're worried that on split failure some folios are split, but > > the private EPTs for those are not split, but the memory for those > > unsplit private EPTs are still contiguous, and on split failure we quit > > early so guest_memfd still tracks the ranges as private. > > > > Privateness and contiguity are preserved so I think TDX should be good > > with that? The TD can still run. IIUC it is part of the plan that on > > splitting failure, conversion ioctl returns failure, guest is informed > > of conversion failure so that it can do whatever it should do to clean > > up. > As above, what about the idea of always requesting KVM to split S-EPT before > guest_memfd splits a folio? > > I think splitting S-EPT first is already required for all cases anyway, except > for the private-to-shared conversion of a full 2MB or 1GB range. > > Requesting S-EPT splitting when it's about to do folio splitting is better than > leaving huge mappings with split folios and having to patch things up here and > there, just to make the single case of private-to-shared conversion easier. > > > > Yes, splitting can fail partially, but guest_memfd will not make the > > > ranges available to host userspace and derivatives until: > > > 1) The complete range to be converted is split to 4K granularity. > > > 2) The complete range to be converted is zapped from KVM EPT mappings. > > > > > >> Not sure if there're other edge cases we're still missing. > > >> > > > > As you said, at the core TDX is concerned about contiguity of the memory > > ranges (start_addr, length) that it was given. Contiguity is guaranteed > > by guest_memfd while the folio is in guest_memfd ownership up to the > > boundaries of the original folio, before any restructuring. So if we're > > looking for edge cases, I think they would be around > > truncation. Can't think of anything now. > Potentially, folio migration, if we support it in the future. > > > (guest_memfd will also ensure truncation of anything less than the > > original size of the folio before restructuring is blocked, regardless > > of the current size of the folio) > > >> > Separately, KVM could also enforce the folio size/memory contiguity vs > > >> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if > > >> > the check is deemed necessary, it still shouldn't be in TDX code, I > > >> > think. > > >> > > > >> > > Pro: Preventing zapping private memory until conversion is successful is good. > > >> > > > > >> > > However, could we achieve this benefit in other ways? For example, is it > > >> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring > > >> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap > > >> > > optimization? (hugetlb_vmemmap conversion is super slow according to my > > >> > > observation and I always disable it). > > >> > > > >> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a > > >> > huge VM, multiplied by a large number of hosts, this is not a trivial > > >> > amount of memory. It's one of the key reasons why we are using HugeTLB > > >> > in guest_memfd in the first place, other than to be able to get high > > >> > level page table mappings. We want this in production. > > >> > > > >> > > Or pre-allocation for > > >> > > vmemmap_remap_alloc()? > > >> > > > > >> > > > >> > Will investigate if this is possible as mentioned above. Thanks for the > > >> > suggestion again! > > >> > > > >> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping > > >> > > private memory before conversion succeeds is still better than introducing the > > >> > > mess between folio size and mapping size. > > >> > > > > >> > >> > I guess perhaps the question is, is it okay if the folios are smaller > > >> > >> > than the mapping while conversion is in progress? Does the order matter > > >> > >> > (split page table entries first vs split folios first)? > > >> > >> > > >> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is > > >> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for > > >> > >> mapping multiple guest_memfd folios with a single hugepage. As to whether we > > >> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping > > >> > >> multiple folios in one page, probably comes down to which option provides "good > > >> > >> enough" performance without incurring too much complexity. > > >> > > >
Yan Zhao <yan.y.zhao@intel.com> writes: >> > >> > I think the central question I have among all the above is what TDX >> > >> > needs to actually care about (putting aside what KVM's folio size/memory >> > >> > contiguity vs mapping level rule for a while). >> > >> > >> > >> > I think TDX code can check what it cares about (if required to aid >> > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes, >> > >> > or does it actually care about memory contiguity and alignment? >> > >> TDX cares about memory contiguity. A single folio ensures memory contiguity. >> > > >> > > In this slightly unusual case, I think the guarantee needed here is >> > > that as long as a range is mapped into SEPT entries, guest_memfd >> > > ensures that the complete range stays private. >> > > >> > > i.e. I think it should be safe to rely on guest_memfd here, >> > > irrespective of the folio sizes: >> > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping. >> > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT >> > > entries, guest_memfd will not let host userspace mappings to access >> > > guest private memory. >> > > >> > >> >> > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer >> > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a >> > >> contiguous range larger than the page's folio range. >> > > >> > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair? >> > > Please let us know what you think of this too, why not parametrize using page and nr_pages? >> > >> >> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio(). >> > >> If smaller folios are allowed, splitting private mapping is required there. >> > >> > It was discussed before that for memory failure handling, we will want >> > to split huge pages, we will get to it! The trouble is that guest_memfd >> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory >> > from the ground up), so we'll still need to figure out it's okay to let >> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB >> > doesn't actually deal with poisoned folios on freeing, so there's more >> > work to do on the HugeTLB side. >> > >> > This is a good point, although IIUC it is a separate issue. The need to >> > split private mappings on memory failure is not for confidentiality in >> > the TDX sense but to ensure that the guest doesn't use the failed >> > memory. In that case, contiguity is broken by the failed memory. The >> > folio is split, the private EPTs are split. The folio size should still >> > not be checked in TDX code. guest_memfd knows contiguity got broken, so >> > guest_memfd calls TDX code to split the EPTs. >> >> Hmm, maybe the key is that we need to split S-EPT first before allowing >> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the >> folio splitting. >> >> This is better than performing folio splitting while it's mapped as huge in >> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split >> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in >> kvm_gmem_error_folio() would still trigger the over-zapping issue. >> Let's put memory failure handling aside for now since for now it zaps the entire huge page, so there's no impact on ordering between S-EPT and folio split. >> In the primary MMU, it follows the rule of unmapping a folio before splitting, >> truncating, or migrating a folio. For S-EPT, considering the cost of zapping >> more ranges than necessary, maybe a trade-off is to always split S-EPT before >> allowing backend folio splitting. >> The mapping size <= folio size rule (for KVM and the primary MMU) is there because it is the safe way to map memory into the guest because a folio implies contiguity. Folios are basically a core MM concept so it makes sense that the primary MMU relies on that. IIUC the core of the rule isn't folio sizes, it's memory contiguity. guest_memfd guarantees memory contiguity, and KVM should be able to rely on guest_memfd's guarantee, especially since guest_memfd is virtualiation-first, and KVM first. I think rules from the primary MMU are a good reference, but we shouldn't copy rules from the primary MMU, and KVM can rely on guest_memfd's guarantee of memory contiguity. >> Does this look good to you? > So, the flow of converting 0-4KB from private to shared in a 1GB folio in > guest_memfd is: > > a. If guest_memfd splits 1GB to 2MB first: > 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range. > 2. split folio > 3. zap the 0-4KB mapping. > > b. If guest_memfd splits 1GB to 4KB directly: > 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range. > 2. split folio > 3. zap the 0-4KB mapping. > > The flow of converting 0-2MB from private to shared in a 1GB folio in > guest_memfd is: > > a. If guest_memfd splits 1GB to 2MB first: > 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range. > 2. split folio > 3. zap the 0-2MB mapping. > > b. If guest_memfd splits 1GB to 4KB directly: > 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range. > 2. split folio > 3. zap the 0-2MB mapping. > >> So, to convert a 2MB range from private to shared, even though guest_memfd will >> eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails, >> don't split the backend folio. >> >> Even if folio splitting may fail later, it just leaves split S-EPT mappings, >> which matters little, especially after we support S-EPT promotion later. >> I didn't consider leaving split S-EPT mappings since there is a performance impact. Let me think about this a little. Meanwhile, if the folios are split before the S-EPTs are split, as long as huge folios worth of memory are guaranteed contiguous by guest_memfd for KVM, what are the problems you see?
On Mon, Jan 12, 2026 at 11:56:01AM -0800, Ackerley Tng wrote: > Yan Zhao <yan.y.zhao@intel.com> writes: > > >> > >> > I think the central question I have among all the above is what TDX > >> > >> > needs to actually care about (putting aside what KVM's folio size/memory > >> > >> > contiguity vs mapping level rule for a while). > >> > >> > > >> > >> > I think TDX code can check what it cares about (if required to aid > >> > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes, > >> > >> > or does it actually care about memory contiguity and alignment? > >> > >> TDX cares about memory contiguity. A single folio ensures memory contiguity. > >> > > > >> > > In this slightly unusual case, I think the guarantee needed here is > >> > > that as long as a range is mapped into SEPT entries, guest_memfd > >> > > ensures that the complete range stays private. > >> > > > >> > > i.e. I think it should be safe to rely on guest_memfd here, > >> > > irrespective of the folio sizes: > >> > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping. > >> > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT > >> > > entries, guest_memfd will not let host userspace mappings to access > >> > > guest private memory. > >> > > > >> > >> > >> > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer > >> > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a > >> > >> contiguous range larger than the page's folio range. > >> > > > >> > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair? > >> > > > > Please let us know what you think of this too, why not parametrize using > page and nr_pages? With (struct page*, unsigned long nr_pages) pair, IMHO, a warning when the entire range is not fully contained in a folio is still necessary. I expressed the concern here: https://lore.kernel.org/kvm/aWRfVOZpTUdYJ+7C@yzhao56-desk.sh.intel.com/ > >> > >> > >> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio(). > >> > >> If smaller folios are allowed, splitting private mapping is required there. > >> > > >> > It was discussed before that for memory failure handling, we will want > >> > to split huge pages, we will get to it! The trouble is that guest_memfd > >> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory > >> > from the ground up), so we'll still need to figure out it's okay to let > >> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB > >> > doesn't actually deal with poisoned folios on freeing, so there's more > >> > work to do on the HugeTLB side. > >> > > >> > This is a good point, although IIUC it is a separate issue. The need to > >> > split private mappings on memory failure is not for confidentiality in > >> > the TDX sense but to ensure that the guest doesn't use the failed > >> > memory. In that case, contiguity is broken by the failed memory. The > >> > folio is split, the private EPTs are split. The folio size should still > >> > not be checked in TDX code. guest_memfd knows contiguity got broken, so > >> > guest_memfd calls TDX code to split the EPTs. > >> > >> Hmm, maybe the key is that we need to split S-EPT first before allowing > >> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the > >> folio splitting. > >> > >> This is better than performing folio splitting while it's mapped as huge in > >> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split > >> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in > >> kvm_gmem_error_folio() would still trigger the over-zapping issue. > >> > > Let's put memory failure handling aside for now since for now it zaps > the entire huge page, so there's no impact on ordering between S-EPT and > folio split. Relying on guest_memfd's specific implemenation is not a good thing. e.g., Given there's a version of guest_memfd allocating folios from buddy. 1. KVM maps a 2MB folio in a 2MB mappings. 2. guest_memfd splits the 2MB folio into 4KB folios, but fails and leaves the 2MB folio partially split. 3. Memory failure occurs on one of the split folio. 4. When splitting S-EPT fails, the over-zapping issue is still there. > >> In the primary MMU, it follows the rule of unmapping a folio before splitting, > >> truncating, or migrating a folio. For S-EPT, considering the cost of zapping > >> more ranges than necessary, maybe a trade-off is to always split S-EPT before > >> allowing backend folio splitting. > >> > > The mapping size <= folio size rule (for KVM and the primary MMU) is > there because it is the safe way to map memory into the guest because a > folio implies contiguity. Folios are basically a core MM concept so it > makes sense that the primary MMU relies on that. So, why the primary MMU needs to unmap and check refcount before folio splitting? > IIUC the core of the rule isn't folio sizes, it's memory > contiguity. guest_memfd guarantees memory contiguity, and KVM should be > able to rely on guest_memfd's guarantee, especially since guest_memfd is > virtualiation-first, and KVM first. > > I think rules from the primary MMU are a good reference, but we > shouldn't copy rules from the primary MMU, and KVM can rely on > guest_memfd's guarantee of memory contiguity. > > >> Does this look good to you? > > So, the flow of converting 0-4KB from private to shared in a 1GB folio in > > guest_memfd is: > > > > a. If guest_memfd splits 1GB to 2MB first: > > 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range. > > 2. split folio > > 3. zap the 0-4KB mapping. > > > > b. If guest_memfd splits 1GB to 4KB directly: > > 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range. > > 2. split folio > > 3. zap the 0-4KB mapping. > > > > The flow of converting 0-2MB from private to shared in a 1GB folio in > > guest_memfd is: > > > > a. If guest_memfd splits 1GB to 2MB first: > > 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range. > > 2. split folio > > 3. zap the 0-2MB mapping. > > > > b. If guest_memfd splits 1GB to 4KB directly: > > 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range. > > 2. split folio > > 3. zap the 0-2MB mapping. > > > >> So, to convert a 2MB range from private to shared, even though guest_memfd will > >> eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails, > >> don't split the backend folio. > >> > >> Even if folio splitting may fail later, it just leaves split S-EPT mappings, > >> which matters little, especially after we support S-EPT promotion later. > >> > > I didn't consider leaving split S-EPT mappings since there is a > performance impact. Let me think about this a little. > > Meanwhile, if the folios are split before the S-EPTs are split, as long > as huge folios worth of memory are guaranteed contiguous by guest_memfd > for KVM, what are the problems you see? Hmm. As the reply in https://lore.kernel.org/kvm/aV4hAfPZXfKKB+7i@yzhao56-desk.sh.intel.com/, there're pros and cons. I'll defer to maintainers' decision.
On Mon, Jan 12, 2026 at 10:13 PM Yan Zhao <yan.y.zhao@intel.com> wrote: > > > >> > >> > > >> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio(). > > >> > >> If smaller folios are allowed, splitting private mapping is required there. > > >> > > > >> > It was discussed before that for memory failure handling, we will want > > >> > to split huge pages, we will get to it! The trouble is that guest_memfd > > >> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory > > >> > from the ground up), so we'll still need to figure out it's okay to let > > >> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB > > >> > doesn't actually deal with poisoned folios on freeing, so there's more > > >> > work to do on the HugeTLB side. > > >> > > > >> > This is a good point, although IIUC it is a separate issue. The need to > > >> > split private mappings on memory failure is not for confidentiality in > > >> > the TDX sense but to ensure that the guest doesn't use the failed > > >> > memory. In that case, contiguity is broken by the failed memory. The > > >> > folio is split, the private EPTs are split. The folio size should still > > >> > not be checked in TDX code. guest_memfd knows contiguity got broken, so > > >> > guest_memfd calls TDX code to split the EPTs. > > >> > > >> Hmm, maybe the key is that we need to split S-EPT first before allowing > > >> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the > > >> folio splitting. > > >> > > >> This is better than performing folio splitting while it's mapped as huge in > > >> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split > > >> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in > > >> kvm_gmem_error_folio() would still trigger the over-zapping issue. > > >> > > > > Let's put memory failure handling aside for now since for now it zaps > > the entire huge page, so there's no impact on ordering between S-EPT and > > folio split. > Relying on guest_memfd's specific implemenation is not a good thing. e.g., > > Given there's a version of guest_memfd allocating folios from buddy. > 1. KVM maps a 2MB folio in a 2MB mappings. > 2. guest_memfd splits the 2MB folio into 4KB folios, but fails and leaves the > 2MB folio partially split. > 3. Memory failure occurs on one of the split folio. > 4. When splitting S-EPT fails, the over-zapping issue is still there. > Why is overzapping an issue? Memory failure is supposed to be a rare occurrence and if there is no memory to handle the splitting, I don't see any other choice than overzapping. IIUC splitting the huge page range (in 1G -> 4K scenario) requires even more memory than just splitting cross-boundary leaves and has a higher chance of failing. i.e. Whether the folio is split first or the SEPTs, there is always a chance of failure leading to over-zapping. I don't see value in optimizing rare failures within rarer memory failure handling codepaths which are supposed to make best-effort decisions anyway.
On Tue, Jan 13, 2026 at 08:40:11AM -0800, Vishal Annapurve wrote: > On Mon, Jan 12, 2026 at 10:13 PM Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > >> > >> > > > >> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio(). > > > >> > >> If smaller folios are allowed, splitting private mapping is required there. > > > >> > > > > >> > It was discussed before that for memory failure handling, we will want > > > >> > to split huge pages, we will get to it! The trouble is that guest_memfd > > > >> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory > > > >> > from the ground up), so we'll still need to figure out it's okay to let > > > >> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB > > > >> > doesn't actually deal with poisoned folios on freeing, so there's more > > > >> > work to do on the HugeTLB side. > > > >> > > > > >> > This is a good point, although IIUC it is a separate issue. The need to > > > >> > split private mappings on memory failure is not for confidentiality in > > > >> > the TDX sense but to ensure that the guest doesn't use the failed > > > >> > memory. In that case, contiguity is broken by the failed memory. The > > > >> > folio is split, the private EPTs are split. The folio size should still > > > >> > not be checked in TDX code. guest_memfd knows contiguity got broken, so > > > >> > guest_memfd calls TDX code to split the EPTs. > > > >> > > > >> Hmm, maybe the key is that we need to split S-EPT first before allowing > > > >> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the > > > >> folio splitting. > > > >> > > > >> This is better than performing folio splitting while it's mapped as huge in > > > >> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split > > > >> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in > > > >> kvm_gmem_error_folio() would still trigger the over-zapping issue. > > > >> > > > > > > Let's put memory failure handling aside for now since for now it zaps > > > the entire huge page, so there's no impact on ordering between S-EPT and > > > folio split. > > Relying on guest_memfd's specific implemenation is not a good thing. e.g., > > > > Given there's a version of guest_memfd allocating folios from buddy. > > 1. KVM maps a 2MB folio in a 2MB mappings. > > 2. guest_memfd splits the 2MB folio into 4KB folios, but fails and leaves the > > 2MB folio partially split. > > 3. Memory failure occurs on one of the split folio. > > 4. When splitting S-EPT fails, the over-zapping issue is still there. > > > > Why is overzapping an issue? > Memory failure is supposed to be a rare occurrence and if there is no > memory to handle the splitting, I don't see any other choice than > overzapping. IIUC splitting the huge page range (in 1G -> 4K scenario) > requires even more memory than just splitting cross-boundary leaves > and has a higher chance of failing. > > i.e. Whether the folio is split first or the SEPTs, there is always a > chance of failure leading to over-zapping. I don't see value in Hmm. If the split occurs after memory failure, yes, splitting S-EPT first also has chance of over-zapping. But if the split occurs during private-to-shared conversion for the non-conversion range, when memory failure later occurs on the split folio, over-zapping can be avoided. > optimizing rare failures within rarer memory failure handling > codepaths which are supposed to make best-effort decisions anyway. I agree it's of low priority. Just not sure if there're edge cases besides this one.
On Fri, Jan 9, 2026 at 8:12 AM Vishal Annapurve <vannapurve@google.com> wrote: > > > > > > > > > > > I think the central question I have among all the above is what TDX > > > needs to actually care about (putting aside what KVM's folio size/memory > > > contiguity vs mapping level rule for a while). > > > > > > I think TDX code can check what it cares about (if required to aid > > > debugging, as Dave suggested). Does TDX actually care about folio sizes, > > > or does it actually care about memory contiguity and alignment? > > TDX cares about memory contiguity. A single folio ensures memory contiguity. > > In this slightly unusual case, I think the guarantee needed here is > that as long as a range is mapped into SEPT entries, guest_memfd > ensures that the complete range stays private. > > i.e. I think it should be safe to rely on guest_memfd here, > irrespective of the folio sizes: > 1) KVM TDX stack should be able to reclaim the complete range when unmapping. > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT > entries, guest_memfd will not let host userspace mappings to access > guest private memory. > > > > > Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer > > reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a > > contiguous range larger than the page's folio range. > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair? > > > > > Additionally, we don't split private mappings in kvm_gmem_error_folio(). > > If smaller folios are allowed, splitting private mapping is required there. > > Yes, I believe splitting private mappings will be invoked to ensure > that the whole huge folio is not unmapped from KVM due to an error on > just a 4K page. Is that a problem? > > If splitting fails, the implementation can fall back to completely > zapping the folio range. I forgot to mention that this is a future improvement that will introduce hugetlb memory failure handling and is not covered by Ackerley's current set of patches. > > > (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it > > possible for splitting a huge folio to fail partially, without merging the huge > > folio back or further zapping?). > > Yes, splitting can fail partially, but guest_memfd will not make the > ranges available to host userspace and derivatives until: > 1) The complete range to be converted is split to 4K granularity. > 2) The complete range to be converted is zapped from KVM EPT mappings. > > > Not sure if there're other edge cases we're still missing.
On Tue, Jan 06, 2026, Yan Zhao wrote: > This is v3 of the TDX huge page series. The full stack is available at [4]. Nope, that's different code. > [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3 E.g. this branch doesn't have [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
On Thu, Jan 15, 2026 at 04:28:12PM -0800, Sean Christopherson wrote: > On Tue, Jan 06, 2026, Yan Zhao wrote: > > This is v3 of the TDX huge page series. The full stack is available at [4]. > > Nope, that's different code. I double-checked. It's the correct code. See https://github.com/intel-staging/tdx/commits/huge_page_v3. However, I did make a mistake with the session name, which says "=== Beginning of section "TDX huge page v2" ===", but it's actually v3! [facepalm] > > [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3 > > E.g. this branch doesn't have > > [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion It's here https://github.com/intel-staging/tdx/commit/d0f7465a7f56f10b5bcc9593351c32b37be073a4
On Fri, Jan 16, 2026, Yan Zhao wrote: > On Thu, Jan 15, 2026 at 04:28:12PM -0800, Sean Christopherson wrote: > > On Tue, Jan 06, 2026, Yan Zhao wrote: > > > This is v3 of the TDX huge page series. The full stack is available at [4]. > > > > Nope, that's different code. > I double-checked. It's the correct code. > See https://github.com/intel-staging/tdx/commits/huge_page_v3. Argh, and I even double-checked before complaining, but apparently I screwed up twice. On triple-checking, I do see the same code as the patches. *sigh* Sorry.
On Fri, Jan 16, 2026 at 06:46:09AM -0800, Sean Christopherson wrote: > On Fri, Jan 16, 2026, Yan Zhao wrote: > > On Thu, Jan 15, 2026 at 04:28:12PM -0800, Sean Christopherson wrote: > > > On Tue, Jan 06, 2026, Yan Zhao wrote: > > > > This is v3 of the TDX huge page series. The full stack is available at [4]. > > > > > > Nope, that's different code. > > I double-checked. It's the correct code. > > See https://github.com/intel-staging/tdx/commits/huge_page_v3. > > Argh, and I even double-checked before complaining, but apparently I screwed up > twice. On triple-checking, I do see the same code as the patches. *sigh* > > Sorry. No problem. Thanks for downloading the code and reviewing it!
© 2016 - 2026 Red Hat, Inc.