From nobody Mon Jun 8 05:25:48 2026 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9261E357D0F for ; Fri, 5 Jun 2026 17:46:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780681581; cv=none; b=H9mjfZ7DeTpLc+uVxt3bXEy+p5Aqh7lPX9b+CJEtHGhTycluLZZ7zbWNNxDkrkvKqMsW3AMPLcIg7onT1518i4WmrlkFDjpnLuuBiTimr09Rq3Bc+uREauqPN8x3TtQt26OcY4X9fstif6YHbBmjdRYvYIOLACwf/HUfagil61o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780681581; c=relaxed/simple; bh=8zuQHH+SQ3YapRGKVrkDRoM6/szKXPSMm+qil33kwII=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=tU5kBKOrF4eHSRvOFFriKCxc2Jy0rIN/BoYOMl3b2pPSpxIgYWrQ2QbeJ+1ZSLp/nRbfXb6hDzWuWsDH5gb5EV2OJI8Noa/GYmUdQDOYUq13XF/DuHFrLwURGE3XOgGHX5CDf8nqO4wUa5ceJsm8k5BpA/rspaYRxYseMPbXjKw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=RapmEAdr; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="RapmEAdr" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-2bf32fb7cb2so17521345ad.2 for ; Fri, 05 Jun 2026 10:46:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1780681575; x=1781286375; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=uZPieUENRjAFBSsrN9Q0ajHABgZwgeB6IKaETF6cS9M=; b=RapmEAdr+hUtIVKbX5ri1sttvlEHEt8/QluNoIxE3yCCZVgsfNX9MRy/5BY/ozj6Si Od/EhrQ0ciiUhg5ZdFQRxHvALXXmqhyDfcCJOxQQhLjkV/jfRNW2osvRX8ZIcbEiWAOI 4nLwxqFINLwpCKQ/X6za2NUkymBMWMBw9E6VPFp9fG2NC5rsEMZKpXNJy7cdHzZWmMEE odera3pSb6OVWPZyEUz2dOx+oZ0u9fGRxfIRCe5cYUwbGTSgk6WQ9oBNWoptu6IAMNGV 7rgDCkxBrc/La9y22lZMzWCixcfHlU2Hzp17ElZtaORA981dwcilCT/WHj2O4Y7Dtvdh KAxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780681575; x=1781286375; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=uZPieUENRjAFBSsrN9Q0ajHABgZwgeB6IKaETF6cS9M=; b=PIHUK/xxYUAYtl+2r8IbUpCDGeKuXjaFtwTs85/pVl5wfBP61TL5mWhPQ4NJLRK1lg kvkAbe8WqZhBjx/RqtXQ7jJdXr3vEwe3Yq9O5YOwYJdcaLuDwaxd1xuwbAvz8T58AgBl dGV7iYMbD12FzLoxbWIN9jAbhn/HRewLpaGiWGmyvjLdHaKcGgtJvK/sQ0JpW2il0MVO gy5BqBmX2/PXs0L3OSaIv+xfJoHl411jAsZC455OxylN6uFon2mxsWvNY1b5fSbo4Qpb Vzg1/ouuvJiVd4oShfk2zX6zK581jUvXTrBvjM6WnvpVlbouTP4umxlpis6UYPq7CROR L7XQ== X-Forwarded-Encrypted: i=1; AFNElJ+ucht/WRjXW2YKk4/uX5QirsE695r1smNtSBEKebV1uYyjae0iZ2X8kUwryJbqqWk8kStsa3dNCEZ3PIY=@vger.kernel.org X-Gm-Message-State: AOJu0YzQsW395ZQ3k5friVTO848pUKatA2ZFpdjX0EgsPtX25CCdlZJ6 NU1ZuTvhTp6PwX5TPGWmGAAvp66+n8G59JLFM4Tj+VIMwc4hFchLnSQ8JHK5kj26nsXwxGwE9T6 oOE7IPA== X-Received: from plcx3.prod.google.com ([2002:a17:903:c3:b0:2b4:6553:4533]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:2d0:b0:2c0:bcff:e18c with SMTP id d9443c01a7336-2c1e833c99dmr52147205ad.37.1780681574693; Fri, 05 Jun 2026 10:46:14 -0700 (PDT) Reply-To: Sean Christopherson Date: Fri, 5 Jun 2026 10:46:10 -0700 In-Reply-To: <20260605174611.2222504-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260605174611.2222504-1-seanjc@google.com> X-Mailer: git-send-email 2.54.0.1032.g2f8565e1d1-goog Message-ID: <20260605174611.2222504-2-seanjc@google.com> Subject: [PATCH 1/2] KVM: x86/mmu: Recursively zap orphaned nested TDP shadow pages on emulated writes From: Sean Christopherson To: Sean Christopherson , Paolo Bonzini Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed , Jim Mattson , James Houghton Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Recursively zap orphaned nested TDP shadow pages when emulating a guest write to a shadowed page table, regardless of whether or not the associated (parent) shadow page will be zapped, e.g. due to detected write-flooding. This plugs a hole where KVM fails to reclaim defunct, unsync shadow pages for select L1 hypervisor patterns. Commit 2de4085cccea ("KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent") modified KVM to recursively zap synchronized shadow pages (KVM already recursively zaps unsync children) when a child is orphaned. But the fix effectively only applied the logic to kvm_mmu_page_unlink_children(), i.e. only performs the recursive zap when KVM is already zapping a parent SP and processing its children. If L1 zaps SPTEs bottom-up (4KiB =3D> 2MiB =3D> ...), as KVM's TDP MMU does with CONFIG_KVM_PROVE_MMU=3Dn since commit 8ca983631f3c ("KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity"), then KVM (as L0) will leak upwards of 4 shadow pages per GiB of L2 guest memory. Over hundreds or thousands of L2 boots, if the VM is "lucky" enough to escape write-flooding detection, i.e. not trigger reclaim of the orphaned shadow pages by dumb luck, then it's possible to end up with tens or even hundreds of thousands of unsync shadow pages and associated rmap entries. Polluting the hash table and rmap entries with a horde of stale entries can eventually degrade L2 guest boot time by an order of magnitude, especially if there is any antagonistic activity in the host, i.e. anything that will contend for mmu_lock and/or needs to walk rmaps. With "top"-down zapping, where "top" is 1GiB or above, then L0 KVM is effectively limited to leaking 4 shadow pages per 256 GiB of memory, as KVM's write flooding detection will kick in on the third write to an L1 TDP PUD, and thus recursively zap the entire 256 GiB range of the parent PGD. I.e. even though L1 KVM still recursively zaps 2MiB =3D> 4KiB SPTEs when zapping each 1GiB SPTE, KVM only gets through two of the 1GiB SPTEs before dropping everything. E.g. hacking tracing into L0 KVM's kvm_mmu_track_write(), the top-down zapping of L1's TDP MMU for an L2 with 16GiB of memory leads to: gpa =3D 107407000, old =3D 800000010741bd07, new =3D 8000000000000000, le= vel =3D 3, flood =3D 0 gpa =3D 10741b000, old =3D 8000000112fb2d07, new =3D 80000000000001a0, le= vel =3D 2, flood =3D 0 gpa =3D 10741b008, old =3D 800000012509cd07, new =3D 80000000000001a0, le= vel =3D 2, flood =3D 1 gpa =3D 10741b010, old =3D 80000001114b9d07, new =3D 80000000000001a0, le= vel =3D 2, flood =3D 2 gpa =3D 107407008, old =3D 8000000112fb5d07, new =3D 8000000000000000, le= vel =3D 3, flood =3D 1 gpa =3D 112fb5298, old =3D 8000000106f43d07, new =3D 80000000000001a0, le= vel =3D 2, flood =3D 0 gpa =3D 112fb52a0, old =3D 8000000106f4dd07, new =3D 80000000000001a0, le= vel =3D 2, flood =3D 1 gpa =3D 112fb5ea0, old =3D 8000000120490d07, new =3D 80000000000001a0, le= vel =3D 2, flood =3D 2 gpa =3D 107407010, old =3D 8000000106df2d07, new =3D 8000000000000000, le= vel =3D 3, flood =3D 2 gpa =3D 107410000, old =3D 8000000107408d07, new =3D 8000000000000000, le= vel =3D 5, flood =3D 0 gpa =3D 107408000, old =3D 8000000107407d07, new =3D 80000000000001a0, le= vel =3D 4, flood =3D 0 Contrast that with a bottom-up zap, which effectively allows all 2MiB SPTEs in L1 to leak their children. gpa =3D 167939000, old =3D 800000011c8f4d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 167939020, old =3D 8000000104407d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 167939028, old =3D 800000011ed20d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 2 gpa =3D 118c70bb0, old =3D 8000000167ab9d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 118c70bb8, old =3D 8000000163913d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 118c70de8, old =3D 800000011cc9dd07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 2 gpa =3D 160be7fb0, old =3D 800000011d322d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 160be7fb8, old =3D 8000000126b1bd07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 2 gpa =3D 1634ab000, old =3D 800000010e984d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 1634ab008, old =3D 800000016879fd07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 1634ab010, old =3D 800000016879ed07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 2 gpa =3D 11e3f1e48, old =3D 8000000168a33d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 11e3f1e50, old =3D 80000001664dcd07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 1167eacb8, old =3D 8000000166544d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 1167eacc0, old =3D 800000015c16bd07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 1689e89b8, old =3D 800000015f296d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 1689e89c0, old =3D 8000000167ca8d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 107b35eb8, old =3D 8000000161e71d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 107b35ec0, old =3D 8000000118cf3d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 118cf2d48, old =3D 8000000118cf1d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 118cf2d50, old =3D 8000000118cf0d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 118dcb770, old =3D 8000000118dcad07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 118dcb778, old =3D 8000000118dc9d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 118dc87e8, old =3D 8000000126997d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 118dc87f0, old =3D 8000000126996d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 126995148, old =3D 8000000126994d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 126995150, old =3D 8000000103477d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 1034764c8, old =3D 8000000103475d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 1034764d0, old =3D 8000000103474d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 10ea4b788, old =3D 800000010ea4ad07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 10ea4b790, old =3D 800000010ea49d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 10ea48928, old =3D 800000011a5bfd07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 10ea48930, old =3D 800000011a5bed07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 11a5bd0d8, old =3D 800000011a5bcd07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 11a5bd0e0, old =3D 800000011d323d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 122ce2b40, old =3D 800000011fe0bd07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 0 gpa =3D 122ce2b48, old =3D 800000010e985d07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 1 gpa =3D 122ce2b50, old =3D 8000000161c9dd07, new =3D 8000000000000000, le= vel =3D 2, flood =3D 2 gpa =3D 16864c000, old =3D 8000000167939d07, new =3D 8000000000000000, le= vel =3D 3, flood =3D 0 gpa =3D 16864c008, old =3D 8000000118c70d07, new =3D 8000000000000000, le= vel =3D 3, flood =3D 1 gpa =3D 16864c010, old =3D 80000001688a6d07, new =3D 8000000000000000, le= vel =3D 3, flood =3D 2 gpa =3D 11c8f7000, old =3D 80000001608a7d07, new =3D 8000000000000000, le= vel =3D 5, flood =3D 0 gpa =3D 1608a7000, old =3D 800000016864cd07, new =3D 80000000000001a0, le= vel =3D 4, flood =3D 0 Note, in the shadow MMU, "level" describes the level a shadow page "points" at, not the level of its associated SPTE. I.e. when write-flooding of 1GiB PUD entries is detected, KVM recursively zaps shadow pages covering 256GiB worth of memory. And as shown above, KVM's write-flooding detection operates at all levels, so a single PMD (in L1) can effectively only leak two unsync children (4KiB shadow pages) before it gets recursively zapped. As a result, for the top-down zap, L0 KVM will leak at most 4 unsync shadow pages per 256GiB of L2 memory. The top-down zap also makes it more likely that L1 will self-heal (to some extent), as any shadow pages that are "rediscovered" by future runs of L2 can get reclaimed by a recursive zap, whereas bottom-up zapping orphans shadow pages over and over. Note, in theory, there is some risk of over-zapping, e.g. due to zapping a a large branch of the paging tree that L1 is only temporarily removing. In practice, the usage patterns of hypervisors are highly unlikely to trigger false positives. E.g. temporarily changing paging protections is typically done at the leaf, not on a non-leaf entry. And if the L1 hypervisor is updating large swaths of PTEs, e.g. to (temporarily?) remove chunks of memory from L2, then L0 KVM's write-flooding detection will kick in, and the children would be zapped anyways. Fixes: 2de4085cccea ("KVM: x86/MMU: Recursively zap nested TDP SPs when zap= ping last/only parent") Cc: Yosry Ahmed Cc: Jim Mattson Cc: James Houghton Signed-off-by: Sean Christopherson Reviewed-by: Jim Mattson --- arch/x86/kvm/mmu/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index b8f2edf2cfeb..9368a71336fe 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6376,7 +6376,7 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t= gpa, const u8 *new, =20 while (npte--) { entry =3D *spte; - mmu_page_zap_pte(vcpu->kvm, sp, spte, NULL); + mmu_page_zap_pte(vcpu->kvm, sp, spte, &invalid_list); if (gentry && sp->role.level !=3D PG_LEVEL_4K) ++vcpu->kvm->stat.mmu_pde_zapped; if (is_shadow_present_pte(entry)) --=20 2.54.0.1032.g2f8565e1d1-goog From nobody Mon Jun 8 05:25:48 2026 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DC6133806B0 for ; Fri, 5 Jun 2026 17:46:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780681581; cv=none; b=p6+6Jp8Na7QWFSZB8Zhver1+kIcYwcmhdeJp2EG0EJLM8qQJASjY+khFCcQOjWvehCE/Xp4GhYDHl2d+sKVDYIMgYPdZhHYNs7HJmSbxXuSNClnK+/iH4lAV2VX1pXWK8pn8+g+hT+iX7B+8whLzVAIuNJ2rMN1hUp3ydr4qgWs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780681581; c=relaxed/simple; bh=MaGp98SPP5pLqA/SgDcjONweIRAww8zV/cITU0dzni4=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=NPC5lVRxLhyw8R+X6uHiZbbgWu/TFqZTifsc0qbGm/lsa4+75ab+BnN28hGjlad7di04CGkpGR3TynF3xUkYIwutLYCmeRDMrmE8yJA7L1MCpJiJMUN0VPMcbTlsx4w95fWHJWrO6cJN9OLHYqxpO3uD9mNQsJx25IQWArRQkKU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=hLPmjcrO; arc=none smtp.client-ip=209.85.215.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="hLPmjcrO" Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-c85a2cda4d0so1205313a12.3 for ; Fri, 05 Jun 2026 10:46:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1780681576; x=1781286376; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=eJAya2rWHdwTV/i2f+Er10bBIZ9X+EeemiuR2u/Ktcg=; b=hLPmjcrOLQv8Fk/0+sIz7aqsAOcUWtQcMMs/gUMdknHrZFmtmVZ6LNSPN9kXZo/wNq 1G/ETn8Ck/4N8P2BNWh1QY+Z7MWlGJlxtUsFg3zS0tHtjZAU/G6wVws8nqsG1BUD6/PR fpLZGmVOYK3DRFOHie3R10z821mlxUZQvrP7SuQiKpgtNytDQNJb99tLgB5ha6BDrphE ezNbK8PdG9R5xu2xS2AMbpHc2bP4uecZkBMpsftqiB8QeG/hCkSJWwFNBzvQyn/bVG+9 HaUmfP+Fyjw4SXHwjBCcPwdoMx9FhnLsnVwJfGjbCtg/84e3xM0g917Nm2UkaXihU8P4 mP2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780681576; x=1781286376; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=eJAya2rWHdwTV/i2f+Er10bBIZ9X+EeemiuR2u/Ktcg=; b=hf1h1qUQFhto23KR6yLD4JnbT0pEblYWSzrtd5rMJDGB0sefWD/gVKbh6h/330KU6L gKJnmwkf1rI8W5udGx14eIQS0dGITti2Rly4gJZWd3iLVh5LdDYKf/rveIDIHIjpbr1N osvvFkB7+owq2FVrLzbm1aoXv0VQitIKDl5hP7YerETN1PQx59m7AuBap/5xmWIYcG7O 2pS0KvU0V8gOxnBK+oWpH37bxGvVURN3+ByMB3k/rNhZ2XQGIIclB//RNoYD/SiXd7ab UZls3uFCjEdOsekR0C8I9jOEw5M/64pXFdUw1A7jAfbWw1ThZCI94LiRYOOE+pi9iEvb PMbQ== X-Forwarded-Encrypted: i=1; AFNElJ9v9J73a5ZVNigaysf0jT/jsH0HfVz8mZ/jUQ97Mh8BQlpen15jgdzMpq96qV6OYxj7UI1ppjTbWNwuvQo=@vger.kernel.org X-Gm-Message-State: AOJu0YytZ6ZJ6c/q3TC/9hgy5ooKwi6pbfu8kVntn+tMaoRu/r4QZ/WN VPYBHxENpSxj+/rAzGjzgsGaeqMpGjkmCFWJNSQWbr9mDBxo4STr4r3cRtaFDabQJV3G6EgXKIf 4/Dj2Uw== X-Received: from pgam25.prod.google.com ([2002:a05:6a02:2b59:b0:c80:26d4:20ea]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:514:b0:3b4:b6d7:a992 with SMTP id adf61e73a8af0-3b4ccd1e2a2mr5750363637.8.1780681575824; Fri, 05 Jun 2026 10:46:15 -0700 (PDT) Reply-To: Sean Christopherson Date: Fri, 5 Jun 2026 10:46:11 -0700 In-Reply-To: <20260605174611.2222504-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260605174611.2222504-1-seanjc@google.com> X-Mailer: git-send-email 2.54.0.1032.g2f8565e1d1-goog Message-ID: <20260605174611.2222504-3-seanjc@google.com> Subject: [PATCH 2/2] KVM: x86/mmu: Expose number of shadow MMU shadow pages as a stat From: Sean Christopherson To: Sean Christopherson , Paolo Bonzini Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed , Jim Mattson , James Houghton Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Turn arch.n_used_mmu_pages into a stat, mmu_shadow_pages, as the number of live shadow pages is arguably _the_ most critical datapoint when it comes to analyzing the shadow MMU. Before the TDP MMU came along, i.e. when the shadow MMU was the only MMU, explicitly tracking the number of shadow pages wasn't as interesting, because the same information could more or less be gleaned from the pages_{1g,2m,4k} stats. But with the TDP MMU, where the shadow MMU is only used for nested TDP, it becomes extremely difficult, if not impossible, to determine which SPTEs are coming from the TDP MMU, and which are coming from the shadow MMU. E.g. when triaging/debugging shadow MMU performance issues due to "too many shadow pages", being able to observe that 99%+ of all shadow pages are unsync is critical to being able to deduce that KVM is effectively leaking shadow pages. Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/mmu/mmu.c | 14 +++++++------- arch/x86/kvm/mmu/mmutrace.h | 2 +- arch/x86/kvm/x86.c | 1 + 4 files changed, 10 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index 3886b536c8a5..be84e4d2405e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1701,6 +1701,7 @@ struct kvm_vm_stat { u64 mmu_recycled; u64 mmu_cache_miss; u64 mmu_unsync; + u64 mmu_shadow_pages; union { struct { atomic64_t pages_4k; diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 9368a71336fe..3839aef6819b 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1801,13 +1801,13 @@ static void kvm_mmu_check_sptes_at_free(struct kvm_= mmu_page *sp) =20 static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) { - kvm->arch.n_used_mmu_pages++; + kvm->stat.mmu_shadow_pages++; kvm_account_pgtable_pages((void *)sp->spt, +1); } =20 static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *s= p) { - kvm->arch.n_used_mmu_pages--; + kvm->stat.mmu_shadow_pages--; kvm_account_pgtable_pages((void *)sp->spt, -1); } =20 @@ -2833,9 +2833,9 @@ static unsigned long kvm_mmu_zap_oldest_mmu_pages(str= uct kvm *kvm, =20 static inline unsigned long kvm_mmu_available_pages(struct kvm *kvm) { - if (kvm->arch.n_max_mmu_pages > kvm->arch.n_used_mmu_pages) + if (kvm->arch.n_max_mmu_pages > kvm->stat.mmu_shadow_pages) return kvm->arch.n_max_mmu_pages - - kvm->arch.n_used_mmu_pages; + kvm->stat.mmu_shadow_pages; =20 return 0; } @@ -2871,11 +2871,11 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsi= gned long goal_nr_mmu_pages) { write_lock(&kvm->mmu_lock); =20 - if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) { - kvm_mmu_zap_oldest_mmu_pages(kvm, kvm->arch.n_used_mmu_pages - + if (kvm->stat.mmu_shadow_pages > goal_nr_mmu_pages) { + kvm_mmu_zap_oldest_mmu_pages(kvm, kvm->stat.mmu_shadow_pages - goal_nr_mmu_pages); =20 - goal_nr_mmu_pages =3D kvm->arch.n_used_mmu_pages; + goal_nr_mmu_pages =3D kvm->stat.mmu_shadow_pages; } =20 kvm->arch.n_max_mmu_pages =3D goal_nr_mmu_pages; diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h index fa01719baf8d..8354d9f39777 100644 --- a/arch/x86/kvm/mmu/mmutrace.h +++ b/arch/x86/kvm/mmu/mmutrace.h @@ -303,7 +303,7 @@ TRACE_EVENT( =20 TP_fast_assign( __entry->mmu_valid_gen =3D kvm->arch.mmu_valid_gen; - __entry->mmu_used_pages =3D kvm->arch.n_used_mmu_pages; + __entry->mmu_used_pages =3D kvm->stat.mmu_shadow_pages; ), =20 TP_printk("kvm-mmu-valid-gen %u used_pages %x", diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cd68a5bad0c6..e4cbecaa105d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -244,6 +244,7 @@ const struct kvm_stats_desc kvm_vm_stats_desc[] =3D { STATS_DESC_COUNTER(VM, mmu_recycled), STATS_DESC_COUNTER(VM, mmu_cache_miss), STATS_DESC_ICOUNTER(VM, mmu_unsync), + STATS_DESC_ICOUNTER(VM, mmu_shadow_pages), STATS_DESC_ICOUNTER(VM, pages_4k), STATS_DESC_ICOUNTER(VM, pages_2m), STATS_DESC_ICOUNTER(VM, pages_1g), --=20 2.54.0.1032.g2f8565e1d1-goog