From nobody Thu Apr  2 07:43:45 2026
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id A81CE3B6BEA
	for <linux-kernel@vger.kernel.org>; Mon, 30 Mar 2026 10:07:11 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774865233; cv=none;
 b=oUS5cm+YxCjkfKbqM6p+QP/NrDcVwa/Q5wVexcsbyon0zdWBadMlcwhoadjhtRWi8HfBuQGAM3/l9Srybf5QCJqh5EJ3RGzYTs+LOTgrtCY2XaRCJ9tT/6jwc2NnjUdx4INV7lqoo/xgTbT+vLbnHCgoUQEg6BI3TRY6lhYEggI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774865233; c=relaxed/simple;
	bh=qvHFVIHqG3TWDeZr998/wNmo51SV4CoZlGbG5u71zm0=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=FLW0PhbwSL2ZrcVGmVJwZw9rnOUfKaeBF2Ge+FxuGMm+UCeE1MqiVHvPsSPsIrt8+L4jKVKYahDNceqB/kU0WeDZw+rmx3jMG/gwXylsxMm41JU/2HLFH7OtVY8u2Oka9OheO1cRh9ZayFE09wEm+ta+LqHqhh0MJI+g4qDFjwo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=arm.com;
 spf=pass smtp.mailfrom=arm.com;
 dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com
 header.b=J4PHWicz; arc=none smtp.client-ip=217.140.110.172
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=arm.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com
 header.b="J4PHWicz"
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 46AA21C01;
	Mon, 30 Mar 2026 03:07:05 -0700 (PDT)
Received: from workstation-e142269.cambridge.arm.com
 (usa-sjc-imap-foss1.foss.arm.com [10.121.207.14])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 843103F915;
	Mon, 30 Mar 2026 03:07:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=arm.com; s=foss;
	t=1774865231; bh=qvHFVIHqG3TWDeZr998/wNmo51SV4CoZlGbG5u71zm0=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
	b=J4PHWiczP9ZzU94WsjEk1h3tgoY+8z+3JmQBZ2+uVqlqtPSvwdL2nGZm61oVXmf7o
	 vWllWsEZ4n3w7gj5ErvL+iD5THoHQI7Ix8yJFLn+xvUkpBM8yGiW26V+w0KvIlXKtz
	 s+3nJsCpREmki6dHmKXiQ9vIrmTpnnLAMkdYGQ2I=
From: Wei-Lin Chang <weilin.chang@arm.com>
To: linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Cc: Marc Zyngier <maz@kernel.org>,
	Oliver Upton <oupton@kernel.org>,
	Joey Gouly <joey.gouly@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Wei-Lin Chang <weilin.chang@arm.com>
Subject: [PATCH 1/4] KVM: arm64: nv: Avoid full shadow s2 unmap
Date: Mon, 30 Mar 2026 11:06:30 +0100
Message-ID: <20260330100633.2817076-2-weilin.chang@arm.com>
X-Mailer: git-send-email 2.43.0
In-Reply-To: <20260330100633.2817076-1-weilin.chang@arm.com>
References: <20260330100633.2817076-1-weilin.chang@arm.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Currently we are forced to fully unmap all shadow stage-2 for a VM when
unmapping a page from the canonical stage-2, for example during an MMU
notifier call. This is because we are not tracking what canonical IPA
are mapped in the shadow stage-2 page tables hence there is no way to
know what to unmap.

Create a per kvm_s2_mmu maple tree to track canonical IPA range ->
nested IPA range, so that it is possible to partially unmap shadow
stage-2 when a canonical IPA range is unmapped. The algorithm is simple
and conservative:

At each shadow stage-2 map, insert the nested IPA range into the maple
tree, with the canonical IPA range as the key. If the canonical IPA
range doesn't overlap with existing ranges in the tree, insert as is,
and a reverse mapping for this range is established. But if the
canonical IPA range overlaps with any existing ranges in the tree, erase
those existing ranges, and create a new range that spans all the
overlapping ranges including the input range. In the mean time, mark
this new spanning canonical IPA range as "polluted" indicating we lost
track of the nested IPA ranges that map to this canonical IPA range.

The maple tree's 64 bit entry is enough to store the nested IPA and
polluted status (stored as a bit called UNKNOWN_IPA), therefore besides
maple tree's internal operation, memory allocation is avoided.

Example:
|||| means existing range, ---- means empty range

input:            $$$$$$$$$$$$$$$$$$$$$$$$$$
tree:  --||||-----|||||||---------||||||||||-----------

free overlaps:
       --||||------------------------------------------
insert spanning range:
       --||||-----||||||||||||||||||||||||||-----------
                  ^^^^^^^^polluted!^^^^^^^^^

With the reverse map created, when a canonical IPA range gets unmapped,
look into each s2 mmu's maple tree and look for canonical IPA ranges
affected, and base on their polluted status:

polluted -> fall back and fully invalidate the current shadow stage-2,
            also clear the tree
not polluted -> unmap the nested IPA range, and remove the reverse map
                entry

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
 arch/arm64/include/asm/kvm_host.h   |   3 +
 arch/arm64/include/asm/kvm_nested.h |   4 +
 arch/arm64/kvm/mmu.c                |  27 +++++--
 arch/arm64/kvm/nested.c             | 112 +++++++++++++++++++++++++++-
 4 files changed, 140 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm=
_host.h
index 8545811e2238..1d0db7f268cc 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -217,6 +217,9 @@ struct kvm_s2_mmu {
 	 */
 	bool	nested_stage2_enabled;
=20
+	/* canonical IPA to nested IPA range lookup, protected by kvm.mmu_lock */
+	struct maple_tree nested_revmap_mt;
+
 #ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
 	struct dentry *shadow_pt_debugfs_dentry;
 #endif
diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/k=
vm_nested.h
index 091544e6af44..4d09d567d7f9 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -76,6 +76,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u=
16 vmid,
 				       const union tlbi_info *info,
 				       void (*)(struct kvm_s2_mmu *,
 						const union tlbi_info *));
+extern int kvm_record_nested_revmap(gpa_t gpa, struct kvm_s2_mmu *mmu,
+				    gpa_t fault_gpa, size_t map_size);
 extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
 extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu);
=20
@@ -164,6 +166,8 @@ extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vc=
pu,
 				    struct kvm_s2_trans *trans);
 extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2);
 extern void kvm_nested_s2_wp(struct kvm *kvm);
+extern void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t =
size,
+				       bool may_block);
 extern void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block);
 extern void kvm_nested_s2_flush(struct kvm *kvm);
=20
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 17d64a1e11e5..6beb07d817c8 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1107,8 +1107,10 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 		free_percpu(mmu->last_vcpu_ran);
 	}
=20
-	if (kvm_is_nested_s2_mmu(kvm, mmu))
+	if (kvm_is_nested_s2_mmu(kvm, mmu)) {
+		mtree_destroy(&mmu->nested_revmap_mt);
 		kvm_init_nested_s2_mmu(mmu);
+	}
=20
 	write_unlock(&kvm->mmu_lock);
=20
@@ -1625,6 +1627,13 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_ad=
dr_t fault_ipa,
 		goto out_unlock;
 	}
=20
+	if (nested) {
+		ret =3D kvm_record_nested_revmap(gfn << PAGE_SHIFT, pgt->mmu,
+					       fault_ipa, PAGE_SIZE);
+		if (ret)
+			goto out_unlock;
+	}
+
 	ret =3D KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, PAGE_SIZE,
 						 __pfn_to_phys(pfn), prot,
 						 memcache, flags);
@@ -1922,6 +1931,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phy=
s_addr_t fault_ipa,
 		prot &=3D ~KVM_NV_GUEST_MAP_SZ;
 		ret =3D KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault_ipa, prot,=
 flags);
 	} else {
+		if (nested) {
+			ret =3D kvm_record_nested_revmap(gfn << PAGE_SHIFT, pgt->mmu,
+						       fault_ipa, vma_pagesize);
+			if (ret)
+				goto out_unlock;
+		}
 		ret =3D KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
 					     memcache, flags);
@@ -2223,14 +2238,16 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
=20
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
+	gpa_t gpa =3D range->start << PAGE_SHIFT;
+	size_t size =3D (range->end - range->start) << PAGE_SHIFT;
+	bool may_block =3D range->may_block;
+
 	if (!kvm->arch.mmu.pgt)
 		return false;
=20
-	__unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT,
-			     (range->end - range->start) << PAGE_SHIFT,
-			     range->may_block);
+	__unmap_stage2_range(&kvm->arch.mmu, gpa, size, may_block);
+	kvm_unmap_gfn_range_nested(kvm, gpa, size, may_block);
=20
-	kvm_nested_s2_unmap(kvm, range->may_block);
 	return false;
 }
=20
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 883b6c1008fb..53392cc7dbae 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -7,6 +7,7 @@
 #include <linux/bitfield.h>
 #include <linux/kvm.h>
 #include <linux/kvm_host.h>
+#include <linux/maple_tree.h>
=20
 #include <asm/fixmap.h>
 #include <asm/kvm_arm.h>
@@ -43,6 +44,16 @@ struct vncr_tlb {
  */
 #define S2_MMU_PER_VCPU		2
=20
+/*
+ * Per shadow S2 reverse map (IPA -> nested IPA range) maple tree payload
+ * layout:
+ *
+ * bits 55-12: nested IPA bits 55-12
+ * bit 0: polluted, 1 for polluted, 0 for not
+ */
+#define NESTED_IPA_MASK		GENMASK_ULL(55, 12)
+#define UNKNOWN_IPA		BIT(0)
+
 void kvm_init_nested(struct kvm *kvm)
 {
 	kvm->arch.nested_mmus =3D NULL;
@@ -769,12 +780,54 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kv=
m_vcpu *vcpu)
 	return s2_mmu;
 }
=20
+int kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
+			     gpa_t fault_ipa, size_t map_size)
+{
+	struct maple_tree *mt =3D &mmu->nested_revmap_mt;
+	gpa_t start =3D ipa;
+	gpa_t end =3D ipa + map_size - 1;
+	u64 entry, new_entry =3D 0;
+	int r =3D 0;
+
+	lockdep_assert_held_write(kvm_s2_mmu_to_kvm(mmu)->mmu_lock);
+
+	MA_STATE(mas, mt, start, end);
+	entry =3D (u64)mas_find_range(&mas, end);
+
+	if (entry) {
+		/* maybe just a perm update... */
+		if (!(entry & UNKNOWN_IPA) && mas.index =3D=3D start &&
+		    mas.last =3D=3D end &&
+		    fault_ipa =3D=3D (entry & NESTED_IPA_MASK))
+			goto out;
+		/*
+		 * Remove every overlapping range, then create a "polluted"
+		 * range that spans all these ranges and store it.
+		 */
+		while (entry && mas.index <=3D end) {
+			start =3D min(mas.index, start);
+			end =3D max(mas.last, end);
+			mas_erase(&mas);
+			entry =3D (u64)mas_find_range(&mas, end);
+		}
+		new_entry |=3D UNKNOWN_IPA;
+	} else {
+		new_entry |=3D fault_ipa;
+	}
+
+	mas_set_range(&mas, start, end);
+	r =3D mas_store_gfp(&mas, (void *)new_entry, GFP_KERNEL_ACCOUNT);
+out:
+	return r;
+}
+
 void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
 {
 	/* CnP being set denotes an invalid entry */
 	mmu->tlb_vttbr =3D VTTBR_CNP_BIT;
 	mmu->nested_stage2_enabled =3D false;
 	atomic_set(&mmu->refcnt, 0);
+	mt_init(&mmu->nested_revmap_mt);
 }
=20
 void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
@@ -1150,6 +1203,60 @@ void kvm_nested_s2_wp(struct kvm *kvm)
 	kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
 }
=20
+static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
+				  size_t unmap_size, bool may_block)
+{
+	struct maple_tree *mt =3D &mmu->nested_revmap_mt;
+	gpa_t start =3D gpa;
+	gpa_t end =3D gpa + unmap_size - 1;
+	u64 entry;
+	size_t entry_size;
+
+	MA_STATE(mas, mt, gpa, end);
+	entry =3D (u64)mas_find_range(&mas, end);
+
+	while (entry && mas.index <=3D end) {
+		start =3D mas.last + 1;
+		entry_size =3D mas.last - mas.index + 1;
+		/*
+		 * Give up and invalidate this s2 mmu if the unmap range
+		 * touches any polluted range.
+		 */
+		if (entry & UNKNOWN_IPA) {
+			mtree_destroy(mt);
+			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu),
+					       may_block);
+			return;
+		}
+		mas_erase(&mas);
+		kvm_stage2_unmap_range(mmu, entry & NESTED_IPA_MASK, entry_size,
+				       may_block);
+		/*
+		 * Other maple tree operations during preemption could render
+		 * this ma_state invalid, so reset it.
+		 */
+		mas_set_range(&mas, start, end);
+		entry =3D (u64)mas_find_range(&mas, end);
+	}
+}
+
+void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
+				bool may_block)
+{
+	int i;
+
+	if (!kvm->arch.nested_mmus_size)
+		return;
+
+	/* TODO: accelerate this using mt of canonical s2 mmu */
+	for (i =3D 0; i < kvm->arch.nested_mmus_size; i++) {
+		struct kvm_s2_mmu *mmu =3D &kvm->arch.nested_mmus[i];
+
+		if (kvm_s2_mmu_valid(mmu))
+			unmap_mmu_ipa_range(mmu, gpa, size, may_block);
+	}
+}
+
 void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
 {
 	int i;
@@ -1162,8 +1269,10 @@ void kvm_nested_s2_unmap(struct kvm *kvm, bool may_b=
lock)
 	for (i =3D 0; i < kvm->arch.nested_mmus_size; i++) {
 		struct kvm_s2_mmu *mmu =3D &kvm->arch.nested_mmus[i];
=20
-		if (kvm_s2_mmu_valid(mmu))
+		if (kvm_s2_mmu_valid(mmu)) {
+			mtree_destroy(&mmu->nested_revmap_mt);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
+		}
 	}
=20
 	kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
@@ -1848,6 +1957,7 @@ void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
=20
 		write_lock(&vcpu->kvm->mmu_lock);
 		if (mmu->pending_unmap) {
+			mtree_destroy(&mmu->nested_revmap_mt);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
 			mmu->pending_unmap =3D false;
 		}
--=20
2.43.0
From nobody Thu Apr  2 07:43:45 2026
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id F23F53B8953
	for <linux-kernel@vger.kernel.org>; Mon, 30 Mar 2026 10:07:17 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774865239; cv=none;
 b=VwL/NsMtDQPU2xK6ioRtFIxiLYmfVCSAvqaYr0SxjT4W9L2h7Ep5fU6Cb6xpxpTcgRXpiDv6MtOI/cHgD9JA2/hCJbtDAlpyKPZYcpSamniASQAbGW/cp65vOzDnWUkJ3yJRXtit5TuI4Kpv7hVCG6vXn1C+kwwMMwE09stB4fM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774865239; c=relaxed/simple;
	bh=aEkqyU+VsdQU8bT4JUQkuDybMkR2ZVB/qzwWdelmvWc=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=ZF4QcM3SnLjsRw5Bz3btrDaI2JHvN5Hk5T/vSnRVoQHJurGyqVeR2GCbO+kBgqJcrA3ygc9glrczcTikJhiNjrY0rYchkL3Y6ChUMVDql+P52Iop9rc5+ZqXtdeLIHb16nurW4kWw3BOf9DJ+H+1SWkJOJp1cwEa10GSYGiEcb4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=arm.com;
 spf=pass smtp.mailfrom=arm.com;
 dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com
 header.b=N5D8JjiJ; arc=none smtp.client-ip=217.140.110.172
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=arm.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com
 header.b="N5D8JjiJ"
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8BCFF1BF3;
	Mon, 30 Mar 2026 03:07:11 -0700 (PDT)
Received: from workstation-e142269.cambridge.arm.com
 (usa-sjc-imap-foss1.foss.arm.com [10.121.207.14])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C955B3F99C;
	Mon, 30 Mar 2026 03:07:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=arm.com; s=foss;
	t=1774865237; bh=aEkqyU+VsdQU8bT4JUQkuDybMkR2ZVB/qzwWdelmvWc=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
	b=N5D8JjiJIrjFLT/wI5LNLTCzKPiYVotKdtCD5clebHHnwqfTlsTwv0vViTVLH2iLJ
	 uZJeHHu0O1otw5QB1/bKSe6y7qPLe4sMQ4VUBiwdOf7Yk6iBEEwEYxmBHaOMzU9dFp
	 +VpVTula5m6aotXcd8Tdl/u5qUmWnfwFjBwAb3g4=
From: Wei-Lin Chang <weilin.chang@arm.com>
To: linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Cc: Marc Zyngier <maz@kernel.org>,
	Oliver Upton <oupton@kernel.org>,
	Joey Gouly <joey.gouly@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Wei-Lin Chang <weilin.chang@arm.com>
Subject: [PATCH 2/4] KVM: arm64: nv: Accelerate canonical IPA unmapping with
 canonical s2 mmu maple tree
Date: Mon, 30 Mar 2026 11:06:31 +0100
Message-ID: <20260330100633.2817076-3-weilin.chang@arm.com>
X-Mailer: git-send-email 2.43.0
In-Reply-To: <20260330100633.2817076-1-weilin.chang@arm.com>
References: <20260330100633.2817076-1-weilin.chang@arm.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Checking every nested mmu during canonical IPA unmapping is slow,
especially when there are many valid nested mmus. We can leverage the
unused maple tree in the canonical kvm_s2_mmu to accelerate this
process.

At stage-2 fault time, other than recording the reverse map, also add an
entry in canonical s2 mmu's maple tree, with the canonical IPA range as
the key, and the "nested s2 mmu this fault is happending to" encoded in
the entry.

With the new maple tree for acceleration's information, at canonical
IPA unmap time we can look into the tree to retrieve the nested mmus
affected by this unmap much quicker.

In terms of encoding the nested mmus in the entry, there are 62 bits
available for each entry (bits 1 and 0 are reserved by the maple tree).
Each bit represents a number of nested mmus base on the total number of
nested mmus, this value grows in power of 2, so for example:

total nested mmus: 1-62    -> each bit represents: 1 nested mmu
                   63-124  ->                      2 nested mmus
                   125-248 ->                      4 nested mmus
                   ...                             ...

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
 arch/arm64/include/asm/kvm_host.h |   1 +
 arch/arm64/kvm/mmu.c              |   5 +-
 arch/arm64/kvm/nested.c           | 166 ++++++++++++++++++++++++++++--
 3 files changed, 163 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm=
_host.h
index 1d0db7f268cc..06f83bb7ff1d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -321,6 +321,7 @@ struct kvm_arch {
 	struct kvm_s2_mmu *nested_mmus;
 	size_t nested_mmus_size;
 	int nested_mmus_next;
+	int mmus_per_bit_power;
=20
 	/* Interrupt controller */
 	struct vgic_dist	vgic;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6beb07d817c8..2b413d3dc790 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1009,6 +1009,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s=
2_mmu *mmu, unsigned long t
 	if (kvm_is_nested_s2_mmu(kvm, mmu))
 		kvm_init_nested_s2_mmu(mmu);
=20
+	mt_init(&mmu->nested_revmap_mt);
+
 	return 0;
=20
 out_destroy_pgtable:
@@ -1107,8 +1109,9 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 		free_percpu(mmu->last_vcpu_ran);
 	}
=20
+	mtree_destroy(&mmu->nested_revmap_mt);
+
 	if (kvm_is_nested_s2_mmu(kvm, mmu)) {
-		mtree_destroy(&mmu->nested_revmap_mt);
 		kvm_init_nested_s2_mmu(mmu);
 	}
=20
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 53392cc7dbae..c7d00cb40ba5 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -80,7 +80,7 @@ int kvm_vcpu_init_nested(struct kvm_vcpu *vcpu)
 {
 	struct kvm *kvm =3D vcpu->kvm;
 	struct kvm_s2_mmu *tmp;
-	int num_mmus, ret =3D 0;
+	int num_mmus, power =3D 0, ret =3D 0;
=20
 	if (test_bit(KVM_ARM_VCPU_HAS_EL2_E2H0, kvm->arch.vcpu_features) &&
 	    !cpus_have_final_cap(ARM64_HAS_HCR_NV1))
@@ -131,6 +131,25 @@ int kvm_vcpu_init_nested(struct kvm_vcpu *vcpu)
=20
 	kvm->arch.nested_mmus_size =3D num_mmus;
=20
+	/*
+	 * Calculate how many s2 mmus are represented by each bit in the
+	 * acceleration maple tree entries.
+	 *
+	 * power =3D=3D 0 -> 1 s2 mmu
+	 * power =3D=3D 1 -> 2 s2 mmus
+	 * power =3D=3D 2 -> 4 s2 mmus
+	 * power =3D=3D 3 -> 8 s2 mmus
+	 * etc.
+	 *
+	 * We use only the top 62 bits in the canonical s2 mmu maple tree
+	 * entries, bits 0 and 1 are not used, since maple trees reserve values
+	 * with bit patterns ending in 10 that are also smaller that 4096.
+	 */
+	while (62 * (1 << power) < kvm->arch.nested_mmus_size)
+		power++;
+
+	kvm->arch.mmus_per_bit_power =3D power;
+
 	return 0;
 }
=20
@@ -780,6 +799,119 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kv=
m_vcpu *vcpu)
 	return s2_mmu;
 }
=20
+static int s2_mmu_to_accel_bit(struct kvm_s2_mmu *mmu)
+{
+	BUG_ON(&mmu->arch->mmu =3D=3D mmu);
+
+	int index =3D mmu - mmu->arch->nested_mmus;
+	int power =3D mmu->arch->mmus_per_bit_power;
+
+	return (index >> power) + 2;
+}
+
+/* this returns the first s2 mmu from the span */
+static struct kvm_s2_mmu *accel_bit_to_s2_mmu(struct kvm *kvm, int bit)
+{
+	int power =3D kvm->arch.mmus_per_bit_power;
+	int index =3D (bit - 2) << power;
+
+	BUG_ON(index >=3D kvm->arch.nested_mmus_size);
+
+	return &kvm->arch.nested_mmus[index];
+}
+
+static void accel_clear_mmu_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
+				  size_t size)
+{
+	struct maple_tree *mt =3D &mmu->arch->mmu.nested_revmap_mt;
+	int bit =3D s2_mmu_to_accel_bit(mmu);
+	void *entry, *new_entry;
+	gpa_t start =3D gpa;
+	gpa_t end =3D gpa + size - 1;
+
+	if (mmu->arch->mmus_per_bit_power > 0) {
+		/* sadly nothing we can do here... */
+		return;
+	}
+
+	MA_STATE(mas, mt, start, end);
+
+	entry =3D mas_find_range(&mas, end);
+	BUG_ON(!entry);
+
+	/*
+	 * 1. Ranges smaller than the queried range should not exist, because
+	 *    for the same mmu, the same ranges are added in both the accel mt
+	 *    and the mmu's mt at fault time.
+	 *
+	 * 2. Ranges larger than the queried range could exist, since
+	 *    another mmu could have a range mapped on top.
+	 *    However in this case we don't know whether there are other
+	 *    smaller ranges in this larger range that belongs to this same
+	 *    mmu, so we can't just remove the bit.
+	 */
+	if (mas.index =3D=3D start && mas.last =3D=3D end) {
+		new_entry =3D (void *)((unsigned long)entry & ~BIT(bit));
+		/*
+		 * This naturally clears the range from the mt if
+		 * new_entry =3D=3D 0.
+		 */
+		mas_store_gfp(&mas, new_entry, GFP_KERNEL_ACCOUNT);
+	}
+}
+
+static void accel_clear_mmu(struct kvm_s2_mmu *mmu)
+{
+	struct maple_tree *mt =3D &mmu->arch->mmu.nested_revmap_mt;
+	int bit =3D s2_mmu_to_accel_bit(mmu);
+	void *entry, *new_entry;
+
+	if (mmu->arch->mmus_per_bit_power > 0) {
+		/* sadly nothing we can do here... */
+		return;
+	}
+
+	MA_STATE(mas, mt, 0, ULONG_MAX);
+
+	mas_for_each(&mas, entry, ULONG_MAX) {
+		new_entry =3D (void *)((unsigned long)entry & ~BIT(bit));
+		/*
+		 * This naturally clears the range from the mt if
+		 * new_entry =3D=3D 0.
+		 */
+		mas_store_gfp(&mas, new_entry, GFP_KERNEL_ACCOUNT);
+	}
+}
+
+static int record_accel(struct kvm_s2_mmu *mmu, gpa_t gpa,
+			       size_t map_size)
+{
+	struct maple_tree *mt =3D &mmu->arch->mmu.nested_revmap_mt;
+	gpa_t start =3D gpa;
+	gpa_t end =3D gpa + map_size - 1;
+	u64 entry, new_entry =3D 0;
+
+	MA_STATE(mas, mt, start, end);
+	entry =3D (u64)mas_find_range(&mas, end);
+
+	/*
+	 * OR every overlapping range's entry, then create a
+	 * range that spans all these ranges and store it.
+	 */
+	while (entry && mas.index <=3D end) {
+		start =3D min(mas.index, start);
+		end =3D max(mas.last, end);
+		new_entry |=3D entry;
+		mas_erase(&mas);
+		entry =3D (u64)mas_find_range(&mas, end);
+	}
+
+	new_entry |=3D BIT(s2_mmu_to_accel_bit(mmu));
+	mas_set_range(&mas, start, end);
+
+	return mas_store_gfp(&mas, (void *)new_entry, GFP_KERNEL_ACCOUNT);
+}
+
 int kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
 			     gpa_t fault_ipa, size_t map_size)
 {
@@ -792,6 +924,11 @@ int kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_=
mmu *mmu,
 	lockdep_assert_held_write(kvm_s2_mmu_to_kvm(mmu)->mmu_lock);
=20
 	MA_STATE(mas, mt, start, end);
+
+	r =3D record_accel(mmu, ipa, map_size);
+	if (r)
+		goto out;
+
 	entry =3D (u64)mas_find_range(&mas, end);
=20
 	if (entry) {
@@ -827,7 +964,6 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
 	mmu->tlb_vttbr =3D VTTBR_CNP_BIT;
 	mmu->nested_stage2_enabled =3D false;
 	atomic_set(&mmu->refcnt, 0);
-	mt_init(&mmu->nested_revmap_mt);
 }
=20
 void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
@@ -1224,11 +1360,13 @@ static void unmap_mmu_ipa_range(struct kvm_s2_mmu *=
mmu, gpa_t gpa,
 		 */
 		if (entry & UNKNOWN_IPA) {
 			mtree_destroy(mt);
+			accel_clear_mmu(mmu);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu),
 					       may_block);
 			return;
 		}
 		mas_erase(&mas);
+		accel_clear_mmu_range(mmu, mas.index, entry_size);
 		kvm_stage2_unmap_range(mmu, entry & NESTED_IPA_MASK, entry_size,
 				       may_block);
 		/*
@@ -1243,17 +1381,27 @@ static void unmap_mmu_ipa_range(struct kvm_s2_mmu *=
mmu, gpa_t gpa,
 void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
 				bool may_block)
 {
-	int i;
+	struct maple_tree *mt =3D &kvm->arch.mmu.nested_revmap_mt;
+	gpa_t start =3D gpa;
+	gpa_t end =3D gpa + size - 1;
+	u64 entry;
+	int bit, i =3D 0;
+	int power =3D kvm->arch.mmus_per_bit_power;
+	struct kvm_s2_mmu *mmu;
+	MA_STATE(mas, mt, start, end);
=20
 	if (!kvm->arch.nested_mmus_size)
 		return;
=20
-	/* TODO: accelerate this using mt of canonical s2 mmu */
-	for (i =3D 0; i < kvm->arch.nested_mmus_size; i++) {
-		struct kvm_s2_mmu *mmu =3D &kvm->arch.nested_mmus[i];
+	entry =3D (u64)mas_find_range(&mas, end);
=20
-		if (kvm_s2_mmu_valid(mmu))
-			unmap_mmu_ipa_range(mmu, gpa, size, may_block);
+	while (entry && mas.index <=3D end) {
+		for_each_set_bit(bit, (unsigned long *)&entry, 64) {
+			mmu =3D accel_bit_to_s2_mmu(kvm, bit);
+			for (i =3D 0; i < (1 << power); i++)
+				unmap_mmu_ipa_range(mmu + i, gpa, size, may_block);
+		}
+		entry =3D (u64)mas_find_range(&mas, end);
 	}
 }
=20
@@ -1274,6 +1422,7 @@ void kvm_nested_s2_unmap(struct kvm *kvm, bool may_bl=
ock)
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
 		}
 	}
+	mtree_destroy(&kvm->arch.mmu.nested_revmap_mt);
=20
 	kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
 }
@@ -1958,6 +2107,7 @@ void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
 		write_lock(&vcpu->kvm->mmu_lock);
 		if (mmu->pending_unmap) {
 			mtree_destroy(&mmu->nested_revmap_mt);
+			accel_clear_mmu(mmu);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
 			mmu->pending_unmap =3D false;
 		}
--=20
2.43.0
From nobody Thu Apr  2 07:43:45 2026
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id 2C6413B8BAC
	for <linux-kernel@vger.kernel.org>; Mon, 30 Mar 2026 10:07:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774865254; cv=none;
 b=YDxA94FdQwfOqcP98kHCJ+UZyEK9TUUMDg1jpVlxRAU32/YUTK2kZwWhptQ4zeOP96mTN3Jt+t25QvwUxjJMX6xiQJQFKAcnCnhD6A0OKVXaHQAoHfj2Ny/E8L3wWtAAyKPHEnN345HdgYCtSL4fxYZHLIqNr7r8mOBYMjTl9Qo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774865254; c=relaxed/simple;
	bh=Hj4nhQhUtz0V/rAJyYmuJZMMsBEFwCzLO0spoqPzVVs=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=TNWgTFUd1Tyn82iKqGBGSY2hL1Y7NZOR7ydWPXyr1W9mlb+5CP3WmXXE61iE6f8PzcilEwVIUxhEMEGc5B4GxRQbwVehNH1Zko2IIBDceT4QnzXuILPrFt8KMVAXOhwjSZQmuyNv2LvoeYO5sf2cdHsZiP/NVU6b6/jL4mDi6Ks=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=arm.com;
 spf=pass smtp.mailfrom=arm.com;
 dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com
 header.b=GLST/jYJ; arc=none smtp.client-ip=217.140.110.172
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=arm.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com
 header.b="GLST/jYJ"
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B96C71BF3;
	Mon, 30 Mar 2026 03:07:26 -0700 (PDT)
Received: from workstation-e142269.cambridge.arm.com
 (usa-sjc-imap-foss1.foss.arm.com [10.121.207.14])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 009403F915;
	Mon, 30 Mar 2026 03:07:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=arm.com; s=foss;
	t=1774865252; bh=Hj4nhQhUtz0V/rAJyYmuJZMMsBEFwCzLO0spoqPzVVs=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
	b=GLST/jYJt3trQbmMs39yHLvbetaOb8cmbHUbTH8Jox+3LbXrn9AI04AdAV1oT9Aih
	 XfNQferj6snrFKtDxgqcDGlCKj4XzuESpe4YFkS/3Bgfeh9/4wjklUxxRLBiAXtLKq
	 8KxatCzVUJhlj/SW/xV6sJTMpU1/ueWbRNtWqsJo=
From: Wei-Lin Chang <weilin.chang@arm.com>
To: linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Cc: Marc Zyngier <maz@kernel.org>,
	Oliver Upton <oupton@kernel.org>,
	Joey Gouly <joey.gouly@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Wei-Lin Chang <weilin.chang@arm.com>
Subject: [PATCH 3/4] KVM: arm64: nv: Remove reverse map entries during TLBI
 handling
Date: Mon, 30 Mar 2026 11:06:32 +0100
Message-ID: <20260330100633.2817076-4-weilin.chang@arm.com>
X-Mailer: git-send-email 2.43.0
In-Reply-To: <20260330100633.2817076-1-weilin.chang@arm.com>
References: <20260330100633.2817076-1-weilin.chang@arm.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

When a guest hypervisor issues a TLBI for a specific IPA range, KVM
unmaps that range from all the effected shadow stage-2s. During this we
get the opportunity to remove the reverse map, and lower the probability
of creating polluted reverse map ranges at subsequent stage-2 faults.

However, the TLBI ranges are specified in nested IPA, so in order to
locate the affected ranges in the reverse map maple tree, which is a
mapping from canonical IPA to nested IPA, we can only iterate through
the entire tree and check each entry.

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
 arch/arm64/include/asm/kvm_nested.h |  1 +
 arch/arm64/kvm/nested.c             | 29 +++++++++++++++++++++++++++++
 arch/arm64/kvm/sys_regs.c           |  3 +++
 3 files changed, 33 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/k=
vm_nested.h
index 4d09d567d7f9..376619cdc9d5 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -76,6 +76,7 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u=
16 vmid,
 				       const union tlbi_info *info,
 				       void (*)(struct kvm_s2_mmu *,
 						const union tlbi_info *));
+extern void kvm_remove_nested_revmap(struct kvm_s2_mmu *mmu, u64 addr, u64=
 size);
 extern int kvm_record_nested_revmap(gpa_t gpa, struct kvm_s2_mmu *mmu,
 				    gpa_t fault_gpa, size_t map_size);
 extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index c7d00cb40ba5..125fa21ca2e7 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -912,6 +912,35 @@ static int record_accel(struct kvm_s2_mmu *mmu, gpa_t =
gpa,
 	return mas_store_gfp(&mas, (void *)new_entry, GFP_KERNEL_ACCOUNT);
 }
=20
+void kvm_remove_nested_revmap(struct kvm_s2_mmu *mmu, u64 addr, u64 size)
+{
+	/*
+	 * Iterate through the mt of this mmu, remove all unpolluted canonical
+	 * ipa ranges that maps to ranges that are strictly within
+	 * [addr, addr + size).
+	 */
+	struct maple_tree *mt =3D &mmu->nested_revmap_mt;
+	void *entry;
+	u64 nested_ipa, nested_ipa_end, addr_end =3D addr + size;
+	size_t revmap_size;
+
+	MA_STATE(mas, mt, 0, ULONG_MAX);
+
+	mas_for_each(&mas, entry, ULONG_MAX) {
+		if ((u64)entry & UNKNOWN_IPA)
+			continue;
+
+		revmap_size =3D mas.last - mas.index + 1;
+		nested_ipa =3D (u64)entry & NESTED_IPA_MASK;
+		nested_ipa_end =3D nested_ipa + revmap_size;
+
+		if (nested_ipa >=3D addr && nested_ipa_end <=3D addr_end) {
+			accel_clear_mmu_range(mmu, mas.index, revmap_size);
+			mas_erase(&mas);
+		}
+	}
+}
+
 int kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
 			     gpa_t fault_ipa, size_t map_size)
 {
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index e1001544d4f4..c7af0eac9ee4 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -4006,6 +4006,7 @@ union tlbi_info {
 static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu,
 			       const union tlbi_info *info)
 {
+	kvm_remove_nested_revmap(mmu, info->range.start, info->range.size);
 	/*
 	 * The unmap operation is allowed to drop the MMU lock and block, which
 	 * means that @mmu could be used for a different context than the one
@@ -4104,6 +4105,8 @@ static void s2_mmu_unmap_ipa(struct kvm_s2_mmu *mmu,
 	max_size =3D compute_tlb_inval_range(mmu, info->ipa.addr);
 	base_addr &=3D ~(max_size - 1);
=20
+	kvm_remove_nested_revmap(mmu, base_addr, max_size);
+
 	/*
 	 * See comment in s2_mmu_unmap_range() for why this is allowed to
 	 * reschedule.
--=20
2.43.0
From nobody Thu Apr  2 07:43:46 2026
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id 19CEB3B894A
	for <linux-kernel@vger.kernel.org>; Mon, 30 Mar 2026 10:07:42 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774865264; cv=none;
 b=a/XdGQroYTW5KEEi2LvFVWey5BCxHQIyudQGqTK85RsSrIYoH7RoGL9pY57KI81rBwI2ZUfoPQQ/fbltGog5fStenR8SSjN5SjaAElct+hDX1O3wRwSiC0VdFusk03L2/Fx0IWtARvbDlgT4EcMaALAOVN2V0lkLaMfHxvVWYx8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774865264; c=relaxed/simple;
	bh=Vb9xxtHsTHwRbzD7GU8OESZOKbk4eyUOZTuHVncJqrw=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=K1myravoVD4herLAXbdy/pTlMO7YtwiMMIBANZr5BKe/mApTR2pKyUus+a/C7RNY9jSrOA+0K9mcXqa3t10FvgnKeiNFcX+VFCiocEnMBIqzkkbu9c71XtywsuDb9EZG1TLBoVlsPnZdz6xpX1JL0xfrVHOXhL7TavPfFNMX7Nk=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=arm.com;
 spf=pass smtp.mailfrom=arm.com;
 dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com
 header.b=f/3+x1/2; arc=none smtp.client-ip=217.140.110.172
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=arm.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com
 header.b="f/3+x1/2"
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B266A1BF3;
	Mon, 30 Mar 2026 03:07:36 -0700 (PDT)
Received: from workstation-e142269.cambridge.arm.com
 (usa-sjc-imap-foss1.foss.arm.com [10.121.207.14])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E77803F915;
	Mon, 30 Mar 2026 03:07:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=arm.com; s=foss;
	t=1774865262; bh=Vb9xxtHsTHwRbzD7GU8OESZOKbk4eyUOZTuHVncJqrw=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
	b=f/3+x1/2VERKOhutWBXZBELrTUAbk7W/MrKR6o8yuXOk1TMKjLgjirYbc08fs+Zbu
	 DvVg3CpQ4Henpk4kT7JVgqTOQk7BVnW9IdOaY596wWkD1qUYgZPvEUkdEkI8PuWhyM
	 hpNiKCxVjYDxUYh9/bvpDXEflvOGQ5RDWYdv4fmg=
From: Wei-Lin Chang <weilin.chang@arm.com>
To: linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Cc: Marc Zyngier <maz@kernel.org>,
	Oliver Upton <oupton@kernel.org>,
	Joey Gouly <joey.gouly@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Wei-Lin Chang <weilin.chang@arm.com>
Subject: [PATCH 4/4] KVM: arm64: nv: Create nested IPA direct map to speed up
 reverse map removal
Date: Mon, 30 Mar 2026 11:06:33 +0100
Message-ID: <20260330100633.2817076-5-weilin.chang@arm.com>
X-Mailer: git-send-email 2.43.0
In-Reply-To: <20260330100633.2817076-1-weilin.chang@arm.com>
References: <20260330100633.2817076-1-weilin.chang@arm.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Iterating through the whole reverse map to find which entries to remove
when handling guest hypervisor TLBIs is not efficient. Create a direct
map that goes from nested IPA to canonical IPA so that the canonical
IPA range affected by the TLBI can be quickly determined, then remove
the entries in the reverse map accordingly.

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
 arch/arm64/include/asm/kvm_host.h |   3 +
 arch/arm64/kvm/mmu.c              |   2 +
 arch/arm64/kvm/nested.c           | 131 ++++++++++++++++++++----------
 3 files changed, 95 insertions(+), 41 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm=
_host.h
index 06f83bb7ff1d..6b0858805530 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -220,6 +220,9 @@ struct kvm_s2_mmu {
 	/* canonical IPA to nested IPA range lookup, protected by kvm.mmu_lock */
 	struct maple_tree nested_revmap_mt;
=20
+	/* nested IPA to canonical IPA range lookup, protected by kvm.mmu_lock */
+	struct maple_tree nested_direct_mt;
+
 #ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
 	struct dentry *shadow_pt_debugfs_dentry;
 #endif
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2b413d3dc790..9f27a9669ec9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1010,6 +1010,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s=
2_mmu *mmu, unsigned long t
 		kvm_init_nested_s2_mmu(mmu);
=20
 	mt_init(&mmu->nested_revmap_mt);
+	mt_init(&mmu->nested_direct_mt);
=20
 	return 0;
=20
@@ -1112,6 +1113,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 	mtree_destroy(&mmu->nested_revmap_mt);
=20
 	if (kvm_is_nested_s2_mmu(kvm, mmu)) {
+		mtree_destroy(&mmu->nested_direct_mt);
 		kvm_init_nested_s2_mmu(mmu);
 	}
=20
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 125fa21ca2e7..4c96130abf82 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -45,13 +45,12 @@ struct vncr_tlb {
 #define S2_MMU_PER_VCPU		2
=20
 /*
- * Per shadow S2 reverse map (IPA -> nested IPA range) maple tree payload
- * layout:
+ * Per shadow S2 reverse & direct map maple tree payload layout:
  *
- * bits 55-12: nested IPA bits 55-12
- * bit 0: polluted, 1 for polluted, 0 for not
+ * bits 55-12: {nested, canonical} IPA bits 55-12
+ * bit 0: polluted, 1 for polluted, 0 for not, only used in reverse map
  */
-#define NESTED_IPA_MASK		GENMASK_ULL(55, 12)
+#define ADDR_MASK		GENMASK_ULL(55, 12)
 #define UNKNOWN_IPA		BIT(0)
=20
 void kvm_init_nested(struct kvm *kvm)
@@ -915,74 +914,118 @@ static int record_accel(struct kvm_s2_mmu *mmu, gpa_=
t gpa,
 void kvm_remove_nested_revmap(struct kvm_s2_mmu *mmu, u64 addr, u64 size)
 {
 	/*
-	 * Iterate through the mt of this mmu, remove all unpolluted canonical
-	 * ipa ranges that maps to ranges that are strictly within
-	 * [addr, addr + size).
+	 * For all ranges in direct_mt that are completely covered by the range
+	 * we are TLBIing [addr, addr + size), we remove the reverse map AND
+	 * its corresponding direct map together, when these conditions are
+	 * met:
+	 *
+	 * 1. The TLBI range completely covers the stored nested IPA range.
+	 * 2. The reverse map is not polluted. This ensures the reverse map
+	 *    and the direct map are 1:1.
 	 */
-	struct maple_tree *mt =3D &mmu->nested_revmap_mt;
-	void *entry;
-	u64 nested_ipa, nested_ipa_end, addr_end =3D addr + size;
-	size_t revmap_size;
+	struct maple_tree *direct_mt =3D &mmu->nested_direct_mt;
+	struct maple_tree *revmap_mt =3D &mmu->nested_revmap_mt;
+	gpa_t nested_ipa_start =3D addr;
+	gpa_t nested_ipa_end =3D addr + size - 1;
+	u64 entry_ipa, entry_nested_ipa;
+	u64 ipa, ipa_end;
=20
-	MA_STATE(mas, mt, 0, ULONG_MAX);
+	MA_STATE(mas_nested_ipa, direct_mt, nested_ipa_start, nested_ipa_end);
+	entry_ipa =3D (u64)mas_find_range(&mas_nested_ipa, nested_ipa_end);
=20
-	mas_for_each(&mas, entry, ULONG_MAX) {
-		if ((u64)entry & UNKNOWN_IPA)
-			continue;
+	while (entry_ipa && mas_nested_ipa.index <=3D nested_ipa_end) {
+		ipa =3D entry_ipa & ADDR_MASK;
+		ipa_end =3D ipa + mas_nested_ipa.last - mas_nested_ipa.index;
=20
-		revmap_size =3D mas.last - mas.index + 1;
-		nested_ipa =3D (u64)entry & NESTED_IPA_MASK;
-		nested_ipa_end =3D nested_ipa + revmap_size;
+		/* Use ipa range to find the corresponding entry in revmap. */
+		MA_STATE(mas_ipa, revmap_mt, ipa, ipa_end);
+		entry_nested_ipa =3D (u64)mas_find_range(&mas_ipa, ipa_end);
=20
-		if (nested_ipa >=3D addr && nested_ipa_end <=3D addr_end) {
-			accel_clear_mmu_range(mmu, mas.index, revmap_size);
-			mas_erase(&mas);
+		/*
+		 * Reverse and direct map are created together at s2 faults,
+		 * thus every direct map range should also have a corresponding
+		 * reverse map range, however that can be polluted.
+		 */
+		BUG_ON(!entry_nested_ipa);
+
+		/* The two conditions outlined above. */
+		if (!(entry_nested_ipa & UNKNOWN_IPA) &&
+		    mas_nested_ipa.index >=3D addr &&
+		    mas_nested_ipa.last <=3D nested_ipa_end) {
+			/*
+			 * If the reverse map isn't polluted, the direct and
+			 * reverse map are expected to be 1:1, thus they must
+			 * have the same size.
+			 */
+			BUG_ON(mas_ipa.last - mas_ipa.index !=3D
+			       mas_nested_ipa.last - mas_nested_ipa.index);
+
+			accel_clear_mmu_range(mmu, mas_ipa.index,
+					      mas_ipa.last - mas_ipa.index + 1);
+			mas_erase(&mas_ipa);
+			mas_erase(&mas_nested_ipa);
 		}
+		entry_ipa =3D (u64)mas_find_range(&mas_nested_ipa, nested_ipa_end);
 	}
 }
=20
 int kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
 			     gpa_t fault_ipa, size_t map_size)
 {
-	struct maple_tree *mt =3D &mmu->nested_revmap_mt;
-	gpa_t start =3D ipa;
-	gpa_t end =3D ipa + map_size - 1;
+	struct maple_tree *direct_mt =3D &mmu->nested_direct_mt;
+	struct maple_tree *revmap_mt =3D &mmu->nested_revmap_mt;
+	gpa_t ipa_start =3D ipa;
+	gpa_t ipa_end =3D ipa + map_size - 1;
+	gpa_t fault_ipa_end =3D fault_ipa + map_size - 1;
 	u64 entry, new_entry =3D 0;
 	int r =3D 0;
=20
 	lockdep_assert_held_write(kvm_s2_mmu_to_kvm(mmu)->mmu_lock);
=20
-	MA_STATE(mas, mt, start, end);
+	MA_STATE(mas_ipa, revmap_mt, ipa_start, ipa_end);
+	MA_STATE(mas_nested_ipa, direct_mt, fault_ipa, fault_ipa_end);
=20
 	r =3D record_accel(mmu, ipa, map_size);
 	if (r)
 		goto out;
=20
-	entry =3D (u64)mas_find_range(&mas, end);
+	r =3D mas_store_gfp(&mas_nested_ipa, (void *)ipa, GFP_KERNEL_ACCOUNT);
+	/*
+	 * In the case of direct map store failure, don't clean up
+	 * record_accel()'s successfully installed accel mt entry. Keeping
+	 * it is fine as it will just cause us to check a few more s2 mmus
+	 * in the mmu notifier.
+	 */
+	if (r)
+		goto out;
+
+	entry =3D (u64)mas_find_range(&mas_ipa, ipa_end);
=20
 	if (entry) {
 		/* maybe just a perm update... */
-		if (!(entry & UNKNOWN_IPA) && mas.index =3D=3D start &&
-		    mas.last =3D=3D end &&
-		    fault_ipa =3D=3D (entry & NESTED_IPA_MASK))
+		if (!(entry & UNKNOWN_IPA) && mas_ipa.index =3D=3D ipa_start &&
+		    mas_ipa.last =3D=3D ipa_end &&
+		    fault_ipa =3D=3D (entry & ADDR_MASK))
 			goto out;
 		/*
 		 * Remove every overlapping range, then create a "polluted"
 		 * range that spans all these ranges and store it.
 		 */
-		while (entry && mas.index <=3D end) {
-			start =3D min(mas.index, start);
-			end =3D max(mas.last, end);
-			mas_erase(&mas);
-			entry =3D (u64)mas_find_range(&mas, end);
+		while (entry && mas_ipa.index <=3D ipa_end) {
+			ipa_start =3D min(mas_ipa.index, ipa_start);
+			ipa_end =3D max(mas_ipa.last, ipa_end);
+			mas_erase(&mas_ipa);
+			entry =3D (u64)mas_find_range(&mas_ipa, ipa_end);
 		}
 		new_entry |=3D UNKNOWN_IPA;
 	} else {
 		new_entry |=3D fault_ipa;
 	}
=20
-	mas_set_range(&mas, start, end);
-	r =3D mas_store_gfp(&mas, (void *)new_entry, GFP_KERNEL_ACCOUNT);
+	mas_set_range(&mas_ipa, ipa_start, ipa_end);
+	r =3D mas_store_gfp(&mas_ipa, (void *)new_entry, GFP_KERNEL_ACCOUNT);
+	if (r)
+		mas_erase(&mas_nested_ipa);
 out:
 	return r;
 }
@@ -1371,13 +1414,14 @@ void kvm_nested_s2_wp(struct kvm *kvm)
 static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
 				  size_t unmap_size, bool may_block)
 {
-	struct maple_tree *mt =3D &mmu->nested_revmap_mt;
+	struct maple_tree *direct_mt =3D &mmu->nested_direct_mt;
+	struct maple_tree *revmap_mt =3D &mmu->nested_revmap_mt;
 	gpa_t start =3D gpa;
 	gpa_t end =3D gpa + unmap_size - 1;
 	u64 entry;
 	size_t entry_size;
=20
-	MA_STATE(mas, mt, gpa, end);
+	MA_STATE(mas, revmap_mt, gpa, end);
 	entry =3D (u64)mas_find_range(&mas, end);
=20
 	while (entry && mas.index <=3D end) {
@@ -1388,15 +1432,18 @@ static void unmap_mmu_ipa_range(struct kvm_s2_mmu *=
mmu, gpa_t gpa,
 		 * touches any polluted range.
 		 */
 		if (entry & UNKNOWN_IPA) {
-			mtree_destroy(mt);
+			mtree_destroy(direct_mt);
+			mtree_destroy(revmap_mt);
 			accel_clear_mmu(mmu);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu),
 					       may_block);
 			return;
 		}
+		/* not polluted, direct map and reverse map must be 1:1 */
+		mtree_erase(direct_mt, entry & ADDR_MASK);
 		mas_erase(&mas);
 		accel_clear_mmu_range(mmu, mas.index, entry_size);
-		kvm_stage2_unmap_range(mmu, entry & NESTED_IPA_MASK, entry_size,
+		kvm_stage2_unmap_range(mmu, entry & ADDR_MASK, entry_size,
 				       may_block);
 		/*
 		 * Other maple tree operations during preemption could render
@@ -1447,6 +1494,7 @@ void kvm_nested_s2_unmap(struct kvm *kvm, bool may_bl=
ock)
 		struct kvm_s2_mmu *mmu =3D &kvm->arch.nested_mmus[i];
=20
 		if (kvm_s2_mmu_valid(mmu)) {
+			mtree_destroy(&mmu->nested_direct_mt);
 			mtree_destroy(&mmu->nested_revmap_mt);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
 		}
@@ -2135,6 +2183,7 @@ void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
=20
 		write_lock(&vcpu->kvm->mmu_lock);
 		if (mmu->pending_unmap) {
+			mtree_destroy(&mmu->nested_direct_mt);
 			mtree_destroy(&mmu->nested_revmap_mt);
 			accel_clear_mmu(mmu);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
--=20
2.43.0