From nobody Sun Feb 8 21:47:16 2026 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DCECB342158 for ; Thu, 29 Jan 2026 01:16:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769649413; cv=none; b=TQGEsCQvgbgaPygaPeKaljrXV6YpO1I0yTTJanRLCTH84bgJGs6nRcLBPSVWTN0j6gKZlrKnrBpcKGA1kPjHsnnLWlsjhl9bdr3MaqZSWU4CiK7J22m3KhwvgsRugLxceZALuL6BL6BE79cH2D8Ttay8j9RFEhN1yGYKQUMIFZI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769649413; c=relaxed/simple; bh=0iJukulF0iDMEL1IfL1RkxcCyzPuMdALrMtRBvNIXvg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=S8mZG9a5W54TrfX/yGnhxCBRwU2Z/qXxYtTCpKxcs7fTG8Nzwuq5YWGmnTKYuyP7p6AKzwqxdrl1ZYbLa6VOvYPj6b9kNplPt7fi1YlRGlYPcwT54mwcoQep6kG6Tlc1ZfduYgoEDA6i9gzmE/+NbwyyM4tQjkHMXP659LcwNHE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=VUa1+a90; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="VUa1+a90" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-2a0e952f153so9480745ad.0 for ; Wed, 28 Jan 2026 17:16:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1769649410; x=1770254210; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=h+1tpJHrAZljj9M61rOSFv1VCgBL5A8paE4Q/bTQIJs=; b=VUa1+a903Fj7EnH257x8AzflrDL/v0e0gAM74zHv4y9goH13SSbhc/Bkbo6wPMUyiK OY+ep38XLw25X9LzK01V+ZT0r0daE580nEIKRU5v/wSatm9Dcl8umN5G2ZfLsF9h7yBl TViTaZSENQcrlCm5pvPFE9IjhSE41woYCZp+Etx7dh2pnO91vzWkhOVhNlRppva7xGOi x42ALQfW0itu05NfA6aFnj1htVA5ji0DF1woTMcR1m9e342HLu+R+zEvtG5nB9keZ+VX wDgEpfS3n/peBIf0d6IBmU8f/J0B5HKschUM+D0MT+e063G6w99PYPHnYOyIuJ2tee9x 5joQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769649410; x=1770254210; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=h+1tpJHrAZljj9M61rOSFv1VCgBL5A8paE4Q/bTQIJs=; b=hEBZeUXKPwlWqbccIqK183BDX8psRbwa5c2j/muUBYDrxBy3HaZmeTujLptsUQNJlb jCHPu20Bap8KGS2LfSFzdFKOW33Bzq1UAvT8KbAInbBN6XMeKHyv5RMa2TkhFl9avMlq vh0w1E4U6yxP/E9nb72iWJ1vqAJVAN9k7/we8K1ryN4hv91VEW/WeQhbpBpxz5l1UbtW Lj8m/LqiEL8KhCTzQOzhnqDW0Nj9HvdBTc/JynYOTsNm0XFL97pNiLnD9LkJpNS4CrpJ wBtlcAsiptJ9KmiRc32opScm3xsAadw3To3e13bqXdgm/VrunUyS8bmmJIjuoRaa8ceB nZ5w== X-Gm-Message-State: AOJu0Ywk5GibuI/78o6MVyOe829CJyYljnGJz2mq4Ua8H1XKm/sJdRic oE5s1pzAg9DwA+TjYVvUnh9AhcGMGuMtZ94c3pIwW31mFvSwfsBvR92gA+lu2RA9KbXx4U9HWqo NROUKjw== X-Received: from pgbcu5.prod.google.com ([2002:a05:6a02:2185:b0:c0e:da74:78ee]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:918f:b0:34f:1623:2354 with SMTP id adf61e73a8af0-38ec6421854mr6633861637.42.1769649410126; Wed, 28 Jan 2026 17:16:50 -0800 (PST) Reply-To: Sean Christopherson Date: Wed, 28 Jan 2026 17:15:16 -0800 In-Reply-To: <20260129011517.3545883-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260129011517.3545883-1-seanjc@google.com> X-Mailer: git-send-email 2.53.0.rc1.217.geba53bf80e-goog Message-ID: <20260129011517.3545883-45-seanjc@google.com> Subject: [RFC PATCH v5 44/45] KVM: x86/mmu: Add support for splitting S-EPT hugepages on conversion From: Sean Christopherson To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, Kiryl Shutsemau , Sean Christopherson , Paolo Bonzini Cc: linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev, kvm@vger.kernel.org, Kai Huang , Rick Edgecombe , Yan Zhao , Vishal Annapurve , Ackerley Tng , Sagi Shahar , Binbin Wu , Xiaoyao Li , Isaku Yamahata Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add support for splitting S-EPT hugepages in preparation for converting a subset of a hugepage to be shared, as KVM must precisely zap/remove S-EPT entries to avoid clobbering guest memory (the lifetime of guest private memory is tied to the S-EPT). I.e. KVM needs to first split a hugepage so that only the to-be-converted small pages can be zapped. To avoid unnecessary work, e.g. if only the tail/end page of massive region isn't aligned to the conversion, explicitly detect unaligned head and tail pages relative to the max page size support by KVM, i.e. head/tail pages that will undergo partial conversion. To support splitting an S-EPT hugepage without a vCPU, add a per-VM PAMT cache, along with a mutex to guard the cache. Using a mutex, e.g. versus a spinlock, is important at it allows KVM to allocate memory *without* dropping the lock, i.e. so that the PAMT cache can be topped-up as needed without needed to juggle arch.tdp_mmu_external_cache_lock. Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 8 +++- arch/x86/kvm/mmu/mmu.c | 2 +- arch/x86/kvm/mmu/tdp_mmu.c | 72 +++++++++++++++++++++++++++++++-- arch/x86/kvm/vmx/tdx.c | 34 +++++++++++++--- arch/x86/kvm/vmx/tdx.h | 2 + 5 files changed, 107 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index 385f1cf32d70..54dea90a53dc 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1563,6 +1563,12 @@ struct kvm_arch { * the code to do so. */ spinlock_t tdp_mmu_pages_lock; + + /* + * Protect the per-VM cache of pre-allocate pages used to populate the + * Dynamic PAMT when splitting S-EPT huge pages without a vCPU. + */ + struct mutex tdp_mmu_external_cache_lock; #endif /* CONFIG_X86_64 */ =20 /* @@ -1861,7 +1867,7 @@ struct kvm_x86_ops { u64 new_spte, enum pg_level level); void (*reclaim_external_sp)(struct kvm *kvm, gfn_t gfn, struct kvm_mmu_page *sp); - int (*topup_external_cache)(struct kvm_vcpu *vcpu, int min); + int (*topup_external_cache)(struct kvm *kvm, struct kvm_vcpu *vcpu, int m= in); =20 =20 bool (*has_wbinvd_exit)(void); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c2765bfc8492..62bf6bec2df2 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -606,7 +606,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcp= u, bool maybe_indirect) if (r) return r; =20 - r =3D kvm_x86_call(topup_external_cache)(vcpu, PT64_ROOT_MAX_LEVEL); + r =3D kvm_x86_call(topup_external_cache)(vcpu->kvm, vcpu, PT64_ROOT_MAX_= LEVEL); if (r) return r; } diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index c46ebdacdb50..3181406c5e0b 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1447,7 +1447,8 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, return spte_set; } =20 -static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct tdp_iter *it= er) +static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, + struct tdp_iter *iter) { struct kvm_mmu_page *sp; =20 @@ -1464,7 +1465,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_spli= t(struct tdp_iter *iter) if (!sp->external_spt) goto err_external_spt; =20 - if (kvm_x86_call(topup_external_cache)(kvm_get_running_vcpu(), 1)) + if (kvm_x86_call(topup_external_cache)(kvm, kvm_get_running_vcpu(), 1)) goto err_external_split; } =20 @@ -1556,7 +1557,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *= kvm, else write_unlock(&kvm->mmu_lock); =20 - sp =3D tdp_mmu_alloc_sp_for_split(&iter); + sp =3D tdp_mmu_alloc_sp_for_split(kvm, &iter); =20 if (shared) read_lock(&kvm->mmu_lock); @@ -1631,9 +1632,74 @@ int kvm_tdp_mmu_split_huge_pages(struct kvm_vcpu *vc= pu, gfn_t start, gfn_t end, EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_tdp_mmu_split_huge_pages); =20 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_CONVERT +static int __tdp_mmu_split_mirror_huge_pages(struct kvm *kvm, + struct kvm_mmu_page *root, + gfn_t gfn, int target_level) +{ + gfn_t end =3D gfn + KVM_PAGES_PER_HPAGE(target_level + 1); + + return tdp_mmu_split_huge_pages_root(kvm, root, gfn, end, target_level, f= alse); +} + +static int tdp_mmu_split_mirror_huge_pages(struct kvm *kvm, + struct kvm_mmu_page *root, + gfn_t start, gfn_t end, int level) +{ + + gfn_t head =3D gfn_round_for_level(start, level + 1); + gfn_t tail =3D gfn_round_for_level(end, level + 1); + int r; + + if (head !=3D start) { + r =3D __tdp_mmu_split_mirror_huge_pages(kvm, root, head, level); + if (r) + return r; + } + + if (tail !=3D end && (head !=3D tail || head =3D=3D start)) { + r =3D __tdp_mmu_split_mirror_huge_pages(kvm, root, tail, level); + if (r) + return r; + } + + return 0; +} + int kvm_arch_gmem_convert(struct kvm *kvm, gfn_t start, gfn_t end, bool to_private) { + struct kvm_mmu_page *root; + int r; + + /* + * When converting from private=3D>shared, KVM must first split potential + * hugepages, as KVM mustn't overzap private mappings for TDX guests, + * i.e. must zap _exactly_ [start, end). Split potential hugepages at + * the head and tail of the to-be-converted (and thus zapped) range so + * that KVM doesn't overzap due to dropping a hugepage that doesn't + * fall wholly inside the range. + */ + if (to_private || !kvm_has_mirrored_tdp(kvm)) + return 0; + + /* + * Acquire the external cache lock, a.k.a. the Dynamic PAMT lock, to + * protect the per-VM cache of pre-allocate pages used to populate the + * Dynamic PAMT when splitting S-EPT huge pages. + */ + guard(mutex)(&kvm->arch.tdp_mmu_external_cache_lock); + + guard(write_lock)(&kvm->mmu_lock); + + /* + * TODO: Also split from PG_LEVEL_1G =3D> PG_LEVEL_2M when KVM supports + * 1GiB S-EPT pages. + */ + __for_each_tdp_mmu_root_yield_safe(kvm, root, 0, KVM_MIRROR_ROOTS) { + r =3D tdp_mmu_split_mirror_huge_pages(kvm, root, start, end, PG_LEVEL_4K= ); + if (r) + return r; + } return 0; } #endif /* CONFIG_HAVE_KVM_ARCH_GMEM_CONVERT */ diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 098954f5e07c..774d395e5c73 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -607,6 +607,8 @@ void tdx_vm_destroy(struct kvm *kvm) { struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(kvm); =20 + tdx_free_pamt_cache(&kvm_tdx->pamt_cache); + tdx_reclaim_td_control_pages(kvm); =20 kvm_tdx->state =3D TD_STATE_UNINITIALIZED; @@ -629,6 +631,8 @@ int tdx_vm_init(struct kvm *kvm) { struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(kvm); =20 + tdx_init_pamt_cache(&kvm_tdx->pamt_cache); + kvm->arch.has_protected_state =3D true; /* * TDX Module doesn't allow the hypervisor to modify the EOI-bitmap, @@ -1621,15 +1625,32 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t = root_hpa, int pgd_level) td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa); } =20 -static int tdx_topup_external_pamt_cache(struct kvm_vcpu *vcpu, int min) +static struct tdx_pamt_cache *tdx_get_pamt_cache(struct kvm *kvm, + struct kvm_vcpu *vcpu) { + if (KVM_BUG_ON(vcpu && vcpu->kvm !=3D kvm, kvm)) + return NULL; + + if (vcpu) + return &to_tdx(vcpu)->pamt_cache; + + lockdep_assert_held(&kvm->arch.tdp_mmu_external_cache_lock); + return &to_kvm_tdx(kvm)->pamt_cache; +} + +static int tdx_topup_external_pamt_cache(struct kvm *kvm, + struct kvm_vcpu *vcpu, int min) +{ + struct tdx_pamt_cache *pamt_cache; + if (!tdx_supports_dynamic_pamt(tdx_sysinfo)) return 0; =20 - if (WARN_ON_ONCE(!vcpu)) + pamt_cache =3D tdx_get_pamt_cache(kvm, vcpu); + if (!pamt_cache) return -EIO; =20 - return tdx_topup_pamt_cache(&to_tdx(vcpu)->pamt_cache, min); + return tdx_topup_pamt_cache(pamt_cache, min); } =20 static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level leve= l, @@ -1792,8 +1813,8 @@ static struct page *tdx_spte_to_external_spt(struct k= vm *kvm, gfn_t gfn, static int tdx_sept_split_private_spte(struct kvm *kvm, gfn_t gfn, u64 old= _spte, u64 new_spte, enum pg_level level) { - struct kvm_vcpu *vcpu =3D kvm_get_running_vcpu(); struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(kvm); + struct tdx_pamt_cache *pamt_cache; gpa_t gpa =3D gfn_to_gpa(gfn); u64 err, entry, level_state; struct page *external_spt; @@ -1804,7 +1825,8 @@ static int tdx_sept_split_private_spte(struct kvm *kv= m, gfn_t gfn, u64 old_spte, if (!external_spt) return -EIO; =20 - if (KVM_BUG_ON(!vcpu || vcpu->kvm !=3D kvm, kvm)) + pamt_cache =3D tdx_get_pamt_cache(kvm, kvm_get_running_vcpu()); + if (!pamt_cache) return -EIO; =20 err =3D tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa, @@ -1816,7 +1838,7 @@ static int tdx_sept_split_private_spte(struct kvm *kv= m, gfn_t gfn, u64 old_spte, =20 err =3D tdh_do_no_vcpus(tdh_mem_page_demote, kvm, &kvm_tdx->td, gpa, level, spte_to_pfn(old_spte), external_spt, - &to_tdx(vcpu)->pamt_cache, &entry, &level_state); + pamt_cache, &entry, &level_state); if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_DEMOTE, entry, level_state, kvm)) return -EIO; =20 diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h index f444fc84d93b..57d7e70ffe7d 100644 --- a/arch/x86/kvm/vmx/tdx.h +++ b/arch/x86/kvm/vmx/tdx.h @@ -48,6 +48,8 @@ struct kvm_tdx { * Set/unset is protected with kvm->mmu_lock. */ bool wait_for_sept_zap; + + struct tdx_pamt_cache pamt_cache; }; =20 /* TDX module vCPU states */ --=20 2.53.0.rc1.217.geba53bf80e-goog