From nobody Sun Feb 8 21:46:57 2026 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1A9AF294A10 for ; Fri, 17 Oct 2025 00:33:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760661198; cv=none; b=CucD65GOeISr1ZpT0S9zGyrN0nGQcIyOmsi2llKwHJlLTYugHfGMgMCpwjHX/cowh+Ab7+rjRMwv1F8aHRrA603fGLYsO5Aa34YWez7GONp0dCnxlzo+njfdsyRk0D2OPUrODjCRlrXOmiMJ36rUGTZ45SlN+toVvVYLgtbZF7o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760661198; c=relaxed/simple; bh=jbwKuImiIrIw8Noi4RxeEMR04KFlxna3AZNfFvD8+0A=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=bAIg54VeSC0v7of7vno7Ml4a1TJJ+W+A+9Vk/3i+83O353mYcrVQcNUKQ2ziqKIDr7u5le83ntALU1X4Dg3drZHJr8PTuBMobN69fZCTKJ9VAoWpI/LlU1reABbVeP+A5nmQFQ4BVJ8JQ9zNrAUYKB3f2rd4rK3kFhFla2xR2ms= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=CQcpuwIZ; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="CQcpuwIZ" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-33bcb7796d4so677134a91.0 for ; Thu, 16 Oct 2025 17:33:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1760661194; x=1761265994; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=24WL3/XVrWtQShgQMVnGZkUPAQhnwZjEdj9gk6jqvRM=; b=CQcpuwIZowrBlg0KVJrWZ/HUjrsj4Sahqh5r4Wv/LzUsBZhwqaRchpFaE2tyAmhVv5 PaUP1QyCyUDn9LkxvnnVr6/bQ8bOXxFuddIMFXRVy6bxUqETXW1hQeZTEIcYm0CLTJzZ 14JTgNpjDmaJePUZOFBsIcxTH/gTF9Qr72Mk1RFYM4tFJn8hxjSNQZNnfhHrAB+SVJCV o8n1r6skEQ5oyG3Vmh3ah/XcdIuQRkI9M5Cwum5TBZL6HVzKZ9XxDTgwFq6K+qoROfj0 XswUCopiaFeqlwjc7GZRaOpjjDM1dhCy5DvWDbc9Q/H4EU+Iz0YGGcgqaSLmB5TK1mUt K1GA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760661194; x=1761265994; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=24WL3/XVrWtQShgQMVnGZkUPAQhnwZjEdj9gk6jqvRM=; b=VybF2L+NCVZ3zl5m0gIYw1kRC5rBRqnpgfPrHFo2IaXusv6l74cCgZl0l4U6naAnLO nRaxmfrHQybzVILrngXLyhK59GhC3ufuhAzRGs3WX+CWAP2mpIaIHRc7WAnNGjtZ0857 kB1iM+P6JuKgpyCBg/plv6+6Du9eA1ktVzcSJDSbZmHC7sDjYcugx9TfFvZYjNB1lwcu MgzjfsBMjWyUqWxvMYrwzALgBXS4Kg3wnsGKHcAkJjLhH3aU88jqQGZnKz8Z+zkSLduF PNQA5Gg81xCW8GUD4zqemusl8nmzlwCXNiuOIWVqfSNwChuf5bBbuHHT/k6LARlKgv+I 9CWQ== X-Forwarded-Encrypted: i=1; AJvYcCXQq2JqQ2zz4+/aG4h1JTeEGzmT4M9UALwHUywUf3e0tir9s7L5QURToABIotMqpznapdOuc+kSw/dL+U8=@vger.kernel.org X-Gm-Message-State: AOJu0Yz47Yn42ENOVs8WzRqTh66bVsOMCGpH3edngRq5qB9VDpTS5WfB rQTrndpMnh1q3AIScVJuUXq/nvcHkNoosHwApdnOhtJArbZlVtbM1jwQE9AwILKS+la+0UKcfy4 OE4cmGg== X-Google-Smtp-Source: AGHT+IHsf5Q4hKwAqjzfrtBJQeOEWpiuWR1U3EUS3M0ncbeH1DJJyZF5orweh3oOIyC71bDEGVac2tYwtXY= X-Received: from pjir1.prod.google.com ([2002:a17:90a:5c81:b0:31f:2a78:943]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:270a:b0:32e:8c14:5d09 with SMTP id 98e67ed59e1d1-33bcf86287fmr1924301a91.7.1760661194542; Thu, 16 Oct 2025 17:33:14 -0700 (PDT) Reply-To: Sean Christopherson Date: Thu, 16 Oct 2025 17:32:33 -0700 In-Reply-To: <20251017003244.186495-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251017003244.186495-1-seanjc@google.com> X-Mailer: git-send-email 2.51.0.858.gf9c4a03a3a-goog Message-ID: <20251017003244.186495-16-seanjc@google.com> Subject: [PATCH v3 15/25] KVM: TDX: ADD pages to the TD image while populating mirror EPT entries From: Sean Christopherson To: Marc Zyngier , Oliver Upton , Tianrui Zhao , Bibo Mao , Huacai Chen , Madhavan Srinivasan , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Christian Borntraeger , Janosch Frank , Claudio Imbrenda , Sean Christopherson , Paolo Bonzini , "Kirill A. Shutemov" Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, kvm@vger.kernel.org, loongarch@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, x86@kernel.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, Ira Weiny , Kai Huang , Michael Roth , Yan Zhao , Vishal Annapurve , Rick Edgecombe , Ackerley Tng , Binbin Wu Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When populating the initial memory image for a TDX guest, ADD pages to the TD as part of establishing the mappings in the mirror EPT, as opposed to creating the mappings and then doing ADD after the fact. Doing ADD in the S-EPT callbacks eliminates the need to track "premapped" pages, as the mirror EPT (M-EPT) and S-EPT are always synchronized, e.g. if ADD fails, KVM reverts to the previous M-EPT entry (guaranteed to be !PRESENT). Eliminating the hole where the M-EPT can have a mapping that doesn't exist in the S-EPT in turn obviates the need to handle errors that are unique to encountering a missing S-EPT entry (see tdx_is_sept_zap_err_due_to_premap()= ). Keeping the M-EPT and S-EPT synchronized also eliminates the need to check for unconsumed "premap" entries during tdx_td_finalize(), as there simply can't be any such entries. Dropping that check in particular reduces the overall cognitive load, as the managemented of nr_premapped with respect to removal of S-EPT is _very_ subtle. E.g. successful removal of an S-EPT entry after it completed ADD doesn't adjust nr_premapped, but it's not clear why that's "ok" but having half-baked entries is not (it's not truly "ok" in that removing pages from the image will likely prevent the guest from booting, but from KVM's perspective it's "ok"). Doing ADD in the S-EPT path requires passing an argument via a scratch field, but the current approach of tracking the number of "premapped" pages effectively does the same. And the "premapped" counter is much more dangerous, as it doesn't have a singular lock to protect its usage, since nr_premapped can be modified as soon as mmu_lock is dropped, at least in theory. I.e. nr_premapped is guarded by slots_lock, but only for "happy" paths. Note, this approach was used/tried at various points in TDX development, but was ultimately discarded due to a desire to avoid stashing temporary state in kvm_tdx. But as above, KVM ended up with such state anyways, and fully committing to using temporary state provides better access rules (100% guarded by slots_lock), and makes several edge cases flat out impossible. Note #2, continue to extend the measurement outside of mmu_lock, as it's a slow operation (typically 16 SEAMCALLs per page whose data is included in the measurement), and doesn't *need* to be done under mmu_lock, e.g. for consistency purposes. However, MR.EXTEND isn't _that_ slow, e.g. ~1ms latency to measure a full page, so if it needs to be done under mmu_lock in the future, e.g. because KVM gains a flow that can remove S-EPT entries uring KVM_TDX_INIT_MEM_REGION, then extending the measurement can also be moved into the S-EPT mapping path (again, only if absolutely necessary). P.S. _If_ MR.EXTEND is moved into the S-EPT path, take care not to return an error up the stack if TDH_MR_EXTEND fails, as removing the M-EPT entry but not the S-EPT entry would result in inconsistent state! Reviewed-by: Rick Edgecombe Signed-off-by: Sean Christopherson Reviewed-by: Kai Huang --- arch/x86/kvm/vmx/tdx.c | 114 ++++++++++++++--------------------------- arch/x86/kvm/vmx/tdx.h | 8 ++- 2 files changed, 45 insertions(+), 77 deletions(-) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index f4bab75d3ffb..76030461c8f7 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1583,6 +1583,32 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t r= oot_hpa, int pgd_level) td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa); } =20 +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level leve= l, + kvm_pfn_t pfn) +{ + struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(kvm); + u64 err, entry, level_state; + gpa_t gpa =3D gfn_to_gpa(gfn); + + lockdep_assert_held(&kvm->slots_lock); + + if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) || + KVM_BUG_ON(!kvm_tdx->page_add_src, kvm)) + return -EIO; + + err =3D tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), + kvm_tdx->page_add_src, &entry, &level_state); + if (unlikely(tdx_operand_busy(err))) + return -EBUSY; + + if (KVM_BUG_ON(err, kvm)) { + pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state); + return -EIO; + } + + return 0; +} + static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn, enum pg_level level, kvm_pfn_t pfn) { @@ -1624,19 +1650,10 @@ static int tdx_sept_set_private_spte(struct kvm *kv= m, gfn_t gfn, =20 /* * If the TD isn't finalized/runnable, then userspace is initializing - * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of - * pages that need to be mapped and initialized via TDH.MEM.PAGE.ADD. - * KVM_TDX_FINALIZE_VM checks the counter to ensure all mapped pages - * have been added to the image, to prevent running the TD with a - * valid mapping in the mirror EPT, but not in the S-EPT. + * the VM image via KVM_TDX_INIT_MEM_REGION; ADD the page to the TD. */ - if (unlikely(kvm_tdx->state !=3D TD_STATE_RUNNABLE)) { - if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)) - return -EIO; - - atomic64_inc(&kvm_tdx->nr_premapped); - return 0; - } + if (unlikely(kvm_tdx->state !=3D TD_STATE_RUNNABLE)) + return tdx_mem_page_add(kvm, gfn, level, pfn); =20 return tdx_mem_page_aug(kvm, gfn, level, pfn); } @@ -1662,39 +1679,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm= , gfn_t gfn, return 0; } =20 -/* - * Check if the error returned from a SEPT zap SEAMCALL is due to that a p= age is - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being call= ed - * successfully. - * - * Since tdh_mem_sept_add() must have been invoked successfully before a - * non-leaf entry present in the mirrored page table, the SEPT ZAP related - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should ins= tead - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in = the - * SEPT. - * - * Further check if the returned entry from SEPT walking is with RWX permi= ssions - * to filter out anything unexpected. - * - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted fr= om - * level_state returned from a SEAMCALL error is the same as that passed i= nto - * the SEAMCALL. - */ -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 = err, - u64 entry, int level) -{ - if (!err || kvm_tdx->state =3D=3D TD_STATE_RUNNABLE) - return false; - - if (err !=3D (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX)) - return false; - - if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK))) - return false; - - return true; -} - static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level, struct page *page) { @@ -1714,12 +1698,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm= , gfn_t gfn, err =3D tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level= _state); tdx_no_vcpus_enter_stop(kvm); } - if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) { - if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm)) - return -EIO; - - return 0; - } =20 if (KVM_BUG_ON(err, kvm)) { pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state); @@ -2825,12 +2803,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct k= vm_tdx_cmd *cmd) =20 if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state =3D=3D TD_STATE_RUNNABLE) return -EINVAL; - /* - * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue - * TDH.MEM.PAGE.ADD(). - */ - if (atomic64_read(&kvm_tdx->nr_premapped)) - return -EINVAL; =20 cmd->hw_error =3D tdh_mr_finalize(&kvm_tdx->td); if (tdx_operand_busy(cmd->hw_error)) @@ -3127,6 +3099,9 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gf= n_t gfn, kvm_pfn_t pfn, struct page *src_page; int ret, i; =20 + if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) + return -EIO; + /* * Get the source page if it has been faulted in. Return failure if the * source page has been swapped out or unmapped in primary memory. @@ -3137,22 +3112,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, = gfn_t gfn, kvm_pfn_t pfn, if (ret !=3D 1) return -ENOMEM; =20 + kvm_tdx->page_add_src =3D src_page; ret =3D kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); - if (ret < 0) - goto out; + kvm_tdx->page_add_src =3D NULL; =20 - ret =3D 0; - err =3D tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), - src_page, &entry, &level_state); - if (err) { - ret =3D unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO; - goto out; - } + put_page(src_page); =20 - KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm); - - if (!(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) - goto out; + if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) + return ret; =20 /* * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed @@ -3165,14 +3132,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, = gfn_t gfn, kvm_pfn_t pfn, err =3D tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state); if (KVM_BUG_ON(err, kvm)) { pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state); - ret =3D -EIO; - goto out; + return -EIO; } } =20 -out: - put_page(src_page); - return ret; + return 0; } =20 static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_= cmd *cmd) diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h index ca39a9391db1..1b00adbbaf77 100644 --- a/arch/x86/kvm/vmx/tdx.h +++ b/arch/x86/kvm/vmx/tdx.h @@ -36,8 +36,12 @@ struct kvm_tdx { =20 struct tdx_td td; =20 - /* For KVM_TDX_INIT_MEM_REGION. */ - atomic64_t nr_premapped; + /* + * Scratch pointer used to pass the source page to tdx_mem_page_add. + * Protected by slots_lock, and non-NULL only when mapping a private + * pfn via tdx_gmem_post_populate(). + */ + struct page *page_add_src; =20 /* * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do --=20 2.51.0.858.gf9c4a03a3a-goog