[PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission

Gavin Shan posted 1 patch 2 years, 3 months ago
arch/arm64/include/asm/kvm_host.h     |  4 ++
arch/arm64/include/asm/kvm_pgtable.h  |  3 +-
arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
arch/arm64/kvm/hyp/pgtable.c          | 25 +++++++++---
arch/arm64/kvm/mmu.c                  | 55 ++++++++++++++++++++++++++-
5 files changed, 83 insertions(+), 8 deletions(-)
[PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission
Posted by Gavin Shan 2 years, 3 months ago
We observed soft-lockup on the host in a specific scenario where
the host on Ampere's Altra Max CPU has 64KB base page size and the
guest has 4KB base page size, 64 vCPUs and 13GB memory. The guest's
memory is backed by 512MB huge pages via hugetlbfs. All the 64 vCPUs
are simultaneously trapped into the host due to permission page faults,
to request adding the execution permission to the corresponding PMD
entry, before the soft-lockup is raised on the host. On handling the
parallel requests, the instruction cache for the 512MB huge page is
invalidated by mm_ops->icache_inval_pou() in stage2_attr_walker() on
64 hardware CPUs. Unfortunately, the instruction cache invalidation
on one CPU interfere with that on another CPU in the hardware level.
It takes 37 seconds for mm_ops->icache_inval_pou() to finish in the
worst case.

So we can't scale out to handle the permission faults at will. They
need to be serialized to some extent with the help of a interval tree,
to track IPA ranges, currently under service. For the incoming permission
faults, the vCPU is asked to bail for a retry if its IPA range is being
served since the vCPU can't proceed its execution.

Fixes: 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel")
Cc: stable@vger.kernel.org # v6.2+
Reported-by: Yihuang Yu <yihyu@redhat.com>
Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/kvm_host.h     |  4 ++
 arch/arm64/include/asm/kvm_pgtable.h  |  3 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
 arch/arm64/kvm/hyp/pgtable.c          | 25 +++++++++---
 arch/arm64/kvm/mmu.c                  | 55 ++++++++++++++++++++++++++-
 5 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index d3dd05bbfe23..a457720b5caf 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -175,6 +175,10 @@ struct kvm_s2_mmu {
 	struct kvm_mmu_memory_cache split_page_cache;
 	uint64_t split_page_chunk_size;
 
+	/* Page fault ranges */
+	struct mutex		fault_ranges_mutex;
+	struct rb_root_cached	fault_ranges;
+
 	struct kvm_arch *arch;
 };
 
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 929d355eae0a..dca0bf81616f 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -149,7 +149,8 @@ struct kvm_pgtable_mm_ops {
 	void*		(*phys_to_virt)(phys_addr_t phys);
 	phys_addr_t	(*virt_to_phys)(void *addr);
 	void		(*dcache_clean_inval_poc)(void *addr, size_t size);
-	void		(*icache_inval_pou)(void *addr, size_t size);
+	int		(*icache_inval_pou)(struct kvm_s2_mmu *mmu,
+					    void *addr, u64 ipa, size_t size);
 };
 
 /**
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 9d703441278b..9bbe7c641770 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -223,10 +223,12 @@ static void clean_dcache_guest_page(void *va, size_t size)
 	hyp_fixmap_unmap();
 }
 
-static void invalidate_icache_guest_page(void *va, size_t size)
+static int invalidate_icache_guest_page(struct kvm_s2_mmu *mmu,
+					void *va, u64 ipa, size_t size)
 {
 	__invalidate_icache_guest_page(hyp_fixmap_map(__hyp_pa(va)), size);
 	hyp_fixmap_unmap();
+	return 0;
 }
 
 int kvm_guest_prepare_stage2(struct pkvm_hyp_vm *vm, void *pgd)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index f7a93ef29250..fabfdb4d1e00 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -875,6 +875,7 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
 	u64 granule = kvm_granule_size(ctx->level);
 	struct kvm_pgtable *pgt = data->mmu->pgt;
 	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
+	int ret;
 
 	if (!stage2_leaf_mapping_allowed(ctx, data))
 		return -E2BIG;
@@ -903,8 +904,14 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
 					       granule);
 
 	if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->icache_inval_pou &&
-	    stage2_pte_executable(new))
-		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
+	    stage2_pte_executable(new)) {
+		ret = mm_ops->icache_inval_pou(data->mmu,
+					       kvm_pte_follow(new, mm_ops),
+					       ALIGN_DOWN(ctx->addr, granule),
+					       granule);
+		if (ret)
+			return ret;
+	}
 
 	stage2_make_pte(ctx, new);
 
@@ -1101,6 +1108,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 }
 
 struct stage2_attr_data {
+	struct kvm_s2_mmu		*mmu;
 	kvm_pte_t			attr_set;
 	kvm_pte_t			attr_clr;
 	kvm_pte_t			pte;
@@ -1113,6 +1121,8 @@ static int stage2_attr_walker(const struct kvm_pgtable_visit_ctx *ctx,
 	kvm_pte_t pte = ctx->old;
 	struct stage2_attr_data *data = ctx->arg;
 	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
+	u64 granule = kvm_granule_size(ctx->level);
+	int ret;
 
 	if (!kvm_pte_valid(ctx->old))
 		return -EAGAIN;
@@ -1133,9 +1143,13 @@ static int stage2_attr_walker(const struct kvm_pgtable_visit_ctx *ctx,
 		 * stage-2 PTE if we are going to add executable permission.
 		 */
 		if (mm_ops->icache_inval_pou &&
-		    stage2_pte_executable(pte) && !stage2_pte_executable(ctx->old))
-			mm_ops->icache_inval_pou(kvm_pte_follow(pte, mm_ops),
-						  kvm_granule_size(ctx->level));
+		    stage2_pte_executable(pte) && !stage2_pte_executable(ctx->old)) {
+			ret = mm_ops->icache_inval_pou(data->mmu,
+					kvm_pte_follow(pte, mm_ops),
+					ALIGN_DOWN(ctx->addr, granule), granule);
+			if (ret)
+				return ret;
+		}
 
 		if (!stage2_try_set_pte(ctx, pte))
 			return -EAGAIN;
@@ -1152,6 +1166,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 	int ret;
 	kvm_pte_t attr_mask = KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI;
 	struct stage2_attr_data data = {
+		.mmu		= pgt->mmu,
 		.attr_set	= attr_set & attr_mask,
 		.attr_clr	= attr_clr & attr_mask,
 	};
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d3b4feed460c..a778f48beb56 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -267,9 +267,58 @@ static void clean_dcache_guest_page(void *va, size_t size)
 	__clean_dcache_guest_page(va, size);
 }
 
-static void invalidate_icache_guest_page(void *va, size_t size)
+static struct interval_tree_node *add_fault_range(struct kvm_s2_mmu *mmu,
+						  u64 ipa, size_t size)
 {
+	struct interval_tree_node *node;
+	unsigned long start = ipa, end = start + size - 1; /* inclusive */
+
+	mutex_lock(&mmu->fault_ranges_mutex);
+
+	node = interval_tree_iter_first(&mmu->fault_ranges, start, end);
+	if (node) {
+		node = NULL;
+		goto unlock;
+	}
+
+	node = kzalloc(sizeof(*node), GFP_KERNEL_ACCOUNT);
+	if (!node)
+		goto unlock;
+
+	node->start = start;
+	node->last = end;
+	interval_tree_insert(node, &mmu->fault_ranges);
+
+unlock:
+	mutex_unlock(&mmu->fault_ranges_mutex);
+	return node;
+}
+
+static void remove_fault_range(struct kvm_s2_mmu *mmu,
+			       struct interval_tree_node *node)
+{
+	mutex_lock(&mmu->fault_ranges_mutex);
+
+	interval_tree_remove(node, &mmu->fault_ranges);
+	kfree(node);
+
+	mutex_unlock(&mmu->fault_ranges_mutex);
+}
+
+
+static int invalidate_icache_guest_page(struct kvm_s2_mmu *mmu,
+					void *va, u64 ipa, size_t size)
+{
+	struct interval_tree_node *node;
+
+	node = add_fault_range(mmu, ipa, size);
+	if (!node)
+		return -EAGAIN;
+
 	__invalidate_icache_guest_page(va, size);
+	remove_fault_range(mmu, node);
+
+	return 0;
 }
 
 /*
@@ -859,6 +908,10 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
 	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
 
+	/* Initialize the page fault ranges */
+	mutex_init(&mmu->fault_ranges_mutex);
+	mmu->fault_ranges = RB_ROOT_CACHED;
+
 	mmu->pgt = pgt;
 	mmu->pgd_phys = __pa(pgt->pgd);
 	return 0;
-- 
2.41.0
Re: [PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission
Posted by Marc Zyngier 2 years, 3 months ago
Hi Gavin,

On Mon, 04 Sep 2023 08:28:26 +0100,
Gavin Shan <gshan@redhat.com> wrote:
> 
> We observed soft-lockup on the host in a specific scenario where
> the host on Ampere's Altra Max CPU has 64KB base page size and the
> guest has 4KB base page size, 64 vCPUs and 13GB memory. The guest's
> memory is backed by 512MB huge pages via hugetlbfs. All the 64 vCPUs
> are simultaneously trapped into the host due to permission page faults,
> to request adding the execution permission to the corresponding PMD
> entry, before the soft-lockup is raised on the host. On handling the
> parallel requests, the instruction cache for the 512MB huge page is
> invalidated by mm_ops->icache_inval_pou() in stage2_attr_walker() on
> 64 hardware CPUs. Unfortunately, the instruction cache invalidation
> on one CPU interfere with that on another CPU in the hardware level.
> It takes 37 seconds for mm_ops->icache_inval_pou() to finish in the
> worst case.

What really annoys me is that we keep piling all sort of hacks just to
paper over the fact that this HW is absolutely terrible (cue Russell's
text replication patches). Why stick so many CPUs in a single system
if the interconnect is unable to scale?

We already have added non-shareable invalidation to cope with similar
issues, and this is only adding insult to injury. What does Ampere
say about this?

> 
> So we can't scale out to handle the permission faults at will.

The *HW* cannot scale. Bad HW.

> They
> need to be serialized to some extent with the help of a interval tree,
> to track IPA ranges, currently under service. For the incoming permission
> faults, the vCPU is asked to bail for a retry if its IPA range is being
> served since the vCPU can't proceed its execution.
> 
> Fixes: 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel")
> Cc: stable@vger.kernel.org # v6.2+

Please drop these two tags. There is no correctness problem in what
you describe, only HW that does not scale.

> Reported-by: Yihuang Yu <yihyu@redhat.com>
> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  arch/arm64/include/asm/kvm_host.h     |  4 ++
>  arch/arm64/include/asm/kvm_pgtable.h  |  3 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 25 +++++++++---
>  arch/arm64/kvm/mmu.c                  | 55 ++++++++++++++++++++++++++-
>  5 files changed, 83 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index d3dd05bbfe23..a457720b5caf 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -175,6 +175,10 @@ struct kvm_s2_mmu {
>  	struct kvm_mmu_memory_cache split_page_cache;
>  	uint64_t split_page_chunk_size;
>  
> +	/* Page fault ranges */
> +	struct mutex		fault_ranges_mutex;
> +	struct rb_root_cached	fault_ranges;
> +

How is that going to work for NV, where we have multiple concurrent
IPA spaces for the same VM, each with its own stage-2?

And this begs another question: what happens when *another* VM does
the same thing? If the HW gets wedged by concurrent vcpus in the same
VM, it may get just as wedged by a concurrent VM...

What happens if the vcpus perform icache invalidation *in the guest*?

>  	struct kvm_arch *arch;
>  };
>  
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 929d355eae0a..dca0bf81616f 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -149,7 +149,8 @@ struct kvm_pgtable_mm_ops {
>  	void*		(*phys_to_virt)(phys_addr_t phys);
>  	phys_addr_t	(*virt_to_phys)(void *addr);
>  	void		(*dcache_clean_inval_poc)(void *addr, size_t size);
> -	void		(*icache_inval_pou)(void *addr, size_t size);
> +	int		(*icache_inval_pou)(struct kvm_s2_mmu *mmu,
> +					    void *addr, u64 ipa, size_t size);
>  };
>  
>  /**
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 9d703441278b..9bbe7c641770 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -223,10 +223,12 @@ static void clean_dcache_guest_page(void *va, size_t size)
>  	hyp_fixmap_unmap();
>  }
>  
> -static void invalidate_icache_guest_page(void *va, size_t size)
> +static int invalidate_icache_guest_page(struct kvm_s2_mmu *mmu,
> +					void *va, u64 ipa, size_t size)
>  {
>  	__invalidate_icache_guest_page(hyp_fixmap_map(__hyp_pa(va)), size);
>  	hyp_fixmap_unmap();
> +	return 0;
>  }
>  
>  int kvm_guest_prepare_stage2(struct pkvm_hyp_vm *vm, void *pgd)
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index f7a93ef29250..fabfdb4d1e00 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -875,6 +875,7 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
>  	u64 granule = kvm_granule_size(ctx->level);
>  	struct kvm_pgtable *pgt = data->mmu->pgt;
>  	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> +	int ret;
>  
>  	if (!stage2_leaf_mapping_allowed(ctx, data))
>  		return -E2BIG;
> @@ -903,8 +904,14 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
>  					       granule);
>  
>  	if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->icache_inval_pou &&
> -	    stage2_pte_executable(new))
> -		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
> +	    stage2_pte_executable(new)) {
> +		ret = mm_ops->icache_inval_pou(data->mmu,
> +					       kvm_pte_follow(new, mm_ops),
> +					       ALIGN_DOWN(ctx->addr, granule),
> +					       granule);
> +		if (ret)
> +			return ret;
> +	}
>  
>  	stage2_make_pte(ctx, new);
>  
> @@ -1101,6 +1108,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
>  }
>  
>  struct stage2_attr_data {
> +	struct kvm_s2_mmu		*mmu;
>  	kvm_pte_t			attr_set;
>  	kvm_pte_t			attr_clr;
>  	kvm_pte_t			pte;
> @@ -1113,6 +1121,8 @@ static int stage2_attr_walker(const struct kvm_pgtable_visit_ctx *ctx,
>  	kvm_pte_t pte = ctx->old;
>  	struct stage2_attr_data *data = ctx->arg;
>  	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> +	u64 granule = kvm_granule_size(ctx->level);
> +	int ret;
>  
>  	if (!kvm_pte_valid(ctx->old))
>  		return -EAGAIN;
> @@ -1133,9 +1143,13 @@ static int stage2_attr_walker(const struct kvm_pgtable_visit_ctx *ctx,
>  		 * stage-2 PTE if we are going to add executable permission.
>  		 */
>  		if (mm_ops->icache_inval_pou &&
> -		    stage2_pte_executable(pte) && !stage2_pte_executable(ctx->old))
> -			mm_ops->icache_inval_pou(kvm_pte_follow(pte, mm_ops),
> -						  kvm_granule_size(ctx->level));
> +		    stage2_pte_executable(pte) && !stage2_pte_executable(ctx->old)) {
> +			ret = mm_ops->icache_inval_pou(data->mmu,
> +					kvm_pte_follow(pte, mm_ops),
> +					ALIGN_DOWN(ctx->addr, granule), granule);
> +			if (ret)
> +				return ret;
> +		}
>  
>  		if (!stage2_try_set_pte(ctx, pte))
>  			return -EAGAIN;
> @@ -1152,6 +1166,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
>  	int ret;
>  	kvm_pte_t attr_mask = KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI;
>  	struct stage2_attr_data data = {
> +		.mmu		= pgt->mmu,
>  		.attr_set	= attr_set & attr_mask,
>  		.attr_clr	= attr_clr & attr_mask,
>  	};
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d3b4feed460c..a778f48beb56 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -267,9 +267,58 @@ static void clean_dcache_guest_page(void *va, size_t size)
>  	__clean_dcache_guest_page(va, size);
>  }
>  
> -static void invalidate_icache_guest_page(void *va, size_t size)
> +static struct interval_tree_node *add_fault_range(struct kvm_s2_mmu *mmu,
> +						  u64 ipa, size_t size)
>  {
> +	struct interval_tree_node *node;
> +	unsigned long start = ipa, end = start + size - 1; /* inclusive */

Maybe just rename 'ipa' to 'start'?

> +
> +	mutex_lock(&mmu->fault_ranges_mutex);

Don't we hold the rwlock at this at this stage, which actively
prevents things like taking a mutex? Have you tested this with
lockdep?

> +
> +	node = interval_tree_iter_first(&mmu->fault_ranges, start, end);
> +	if (node) {
> +		node = NULL;
> +		goto unlock;
> +	}
> +
> +	node = kzalloc(sizeof(*node), GFP_KERNEL_ACCOUNT);
> +	if (!node)
> +		goto unlock;

That's not going to work -- why do you think we avoid all dynamic
allocation on this path?

> +
> +	node->start = start;
> +	node->last = end;
> +	interval_tree_insert(node, &mmu->fault_ranges);
> +
> +unlock:
> +	mutex_unlock(&mmu->fault_ranges_mutex);
> +	return node;
> +}
> +
> +static void remove_fault_range(struct kvm_s2_mmu *mmu,
> +			       struct interval_tree_node *node)
> +{
> +	mutex_lock(&mmu->fault_ranges_mutex);
> +
> +	interval_tree_remove(node, &mmu->fault_ranges);
> +	kfree(node);
> +
> +	mutex_unlock(&mmu->fault_ranges_mutex);
> +}
> +
> +
> +static int invalidate_icache_guest_page(struct kvm_s2_mmu *mmu,
> +					void *va, u64 ipa, size_t size)
> +{
> +	struct interval_tree_node *node;
> +
> +	node = add_fault_range(mmu, ipa, size);
> +	if (!node)
> +		return -EAGAIN;
> +
>  	__invalidate_icache_guest_page(va, size);
> +	remove_fault_range(mmu, node);

Given that we add and remove the node in the same function, why can't
we allocate the node on the stack instead?

> +
> +	return 0;
>  }
>  
>  /*
> @@ -859,6 +908,10 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>  	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
>  	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
>  
> +	/* Initialize the page fault ranges */
> +	mutex_init(&mmu->fault_ranges_mutex);
> +	mmu->fault_ranges = RB_ROOT_CACHED;
> +
>  	mmu->pgt = pgt;
>  	mmu->pgd_phys = __pa(pgt->pgd);
>  	return 0;

I really wonder why we should add this complexity, rather than doing a
full i-cache invalidation instead. Given what you describe, I don't
think it would be worse.

	M.

-- 
Without deviation from the norm, progress is not possible.
Re: [PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission
Posted by Gavin Shan 2 years, 3 months ago
Hi Marc,

On 9/4/23 18:22, Marc Zyngier wrote:
> On Mon, 04 Sep 2023 08:28:26 +0100,
> Gavin Shan <gshan@redhat.com> wrote:
>>
>> We observed soft-lockup on the host in a specific scenario where
>> the host on Ampere's Altra Max CPU has 64KB base page size and the
>> guest has 4KB base page size, 64 vCPUs and 13GB memory. The guest's
>> memory is backed by 512MB huge pages via hugetlbfs. All the 64 vCPUs
>> are simultaneously trapped into the host due to permission page faults,
>> to request adding the execution permission to the corresponding PMD
>> entry, before the soft-lockup is raised on the host. On handling the
>> parallel requests, the instruction cache for the 512MB huge page is
>> invalidated by mm_ops->icache_inval_pou() in stage2_attr_walker() on
>> 64 hardware CPUs. Unfortunately, the instruction cache invalidation
>> on one CPU interfere with that on another CPU in the hardware level.
>> It takes 37 seconds for mm_ops->icache_inval_pou() to finish in the
>> worst case.
> 
> What really annoys me is that we keep piling all sort of hacks just to
> paper over the fact that this HW is absolutely terrible (cue Russell's
> text replication patches). Why stick so many CPUs in a single system
> if the interconnect is unable to scale?
> 
> We already have added non-shareable invalidation to cope with similar
> issues, and this is only adding insult to injury. What does Ampere
> say about this?
> 

Could you please provide a linker to Russell's text replication patches?
Another intention of posting this patch is to seek Ampere folks' advices.
Maybe it's a known issue to them.

>>
>> So we can't scale out to handle the permission faults at will.
> 
> The *HW* cannot scale. Bad HW.
> 

Ok.

>> They
>> need to be serialized to some extent with the help of a interval tree,
>> to track IPA ranges, currently under service. For the incoming permission
>> faults, the vCPU is asked to bail for a retry if its IPA range is being
>> served since the vCPU can't proceed its execution.
>>
>> Fixes: 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel")
>> Cc: stable@vger.kernel.org # v6.2+
> 
> Please drop these two tags. There is no correctness problem in what
> you describe, only HW that does not scale.
> 

Ok.

>> Reported-by: Yihuang Yu <yihyu@redhat.com>
>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   arch/arm64/include/asm/kvm_host.h     |  4 ++
>>   arch/arm64/include/asm/kvm_pgtable.h  |  3 +-
>>   arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
>>   arch/arm64/kvm/hyp/pgtable.c          | 25 +++++++++---
>>   arch/arm64/kvm/mmu.c                  | 55 ++++++++++++++++++++++++++-
>>   5 files changed, 83 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index d3dd05bbfe23..a457720b5caf 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -175,6 +175,10 @@ struct kvm_s2_mmu {
>>   	struct kvm_mmu_memory_cache split_page_cache;
>>   	uint64_t split_page_chunk_size;
>>   
>> +	/* Page fault ranges */
>> +	struct mutex		fault_ranges_mutex;
>> +	struct rb_root_cached	fault_ranges;
>> +
> 
> How is that going to work for NV, where we have multiple concurrent
> IPA spaces for the same VM, each with its own stage-2?
> 
> And this begs another question: what happens when *another* VM does
> the same thing? If the HW gets wedged by concurrent vcpus in the same
> VM, it may get just as wedged by a concurrent VM...
> 
> What happens if the vcpus perform icache invalidation *in the guest*?
> 

Right, I forgot the NV case when the patch was posted. Frankly, I'm still
learning the spec to understand how NV is working. I spent some time on
the NV patches, posted a while ago. If I understood the code correctly,
those multiple concurent IPA spaces (struct kvm_s2_mmu) are multiplexed to
create the shadow stage-2 page table. Since the fault ranges and the serialization
imposed by it is only sensible to host hypervisor, I don't think it's going
to work for the NV case.

   https://lore.kernel.org/kvmarm/20230515173103.1017669-28-maz@kernel.org/

>>   	struct kvm_arch *arch;
>>   };
>>   
>> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
>> index 929d355eae0a..dca0bf81616f 100644
>> --- a/arch/arm64/include/asm/kvm_pgtable.h
>> +++ b/arch/arm64/include/asm/kvm_pgtable.h
>> @@ -149,7 +149,8 @@ struct kvm_pgtable_mm_ops {
>>   	void*		(*phys_to_virt)(phys_addr_t phys);
>>   	phys_addr_t	(*virt_to_phys)(void *addr);
>>   	void		(*dcache_clean_inval_poc)(void *addr, size_t size);
>> -	void		(*icache_inval_pou)(void *addr, size_t size);
>> +	int		(*icache_inval_pou)(struct kvm_s2_mmu *mmu,
>> +					    void *addr, u64 ipa, size_t size);
>>   };
>>   
>>   /**
>> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
>> index 9d703441278b..9bbe7c641770 100644
>> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
>> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
>> @@ -223,10 +223,12 @@ static void clean_dcache_guest_page(void *va, size_t size)
>>   	hyp_fixmap_unmap();
>>   }
>>   
>> -static void invalidate_icache_guest_page(void *va, size_t size)
>> +static int invalidate_icache_guest_page(struct kvm_s2_mmu *mmu,
>> +					void *va, u64 ipa, size_t size)
>>   {
>>   	__invalidate_icache_guest_page(hyp_fixmap_map(__hyp_pa(va)), size);
>>   	hyp_fixmap_unmap();
>> +	return 0;
>>   }
>>   
>>   int kvm_guest_prepare_stage2(struct pkvm_hyp_vm *vm, void *pgd)
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index f7a93ef29250..fabfdb4d1e00 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -875,6 +875,7 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
>>   	u64 granule = kvm_granule_size(ctx->level);
>>   	struct kvm_pgtable *pgt = data->mmu->pgt;
>>   	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
>> +	int ret;
>>   
>>   	if (!stage2_leaf_mapping_allowed(ctx, data))
>>   		return -E2BIG;
>> @@ -903,8 +904,14 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
>>   					       granule);
>>   
>>   	if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->icache_inval_pou &&
>> -	    stage2_pte_executable(new))
>> -		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
>> +	    stage2_pte_executable(new)) {
>> +		ret = mm_ops->icache_inval_pou(data->mmu,
>> +					       kvm_pte_follow(new, mm_ops),
>> +					       ALIGN_DOWN(ctx->addr, granule),
>> +					       granule);
>> +		if (ret)
>> +			return ret;
>> +	}
>>   
>>   	stage2_make_pte(ctx, new);
>>   
>> @@ -1101,6 +1108,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
>>   }
>>   
>>   struct stage2_attr_data {
>> +	struct kvm_s2_mmu		*mmu;
>>   	kvm_pte_t			attr_set;
>>   	kvm_pte_t			attr_clr;
>>   	kvm_pte_t			pte;
>> @@ -1113,6 +1121,8 @@ static int stage2_attr_walker(const struct kvm_pgtable_visit_ctx *ctx,
>>   	kvm_pte_t pte = ctx->old;
>>   	struct stage2_attr_data *data = ctx->arg;
>>   	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
>> +	u64 granule = kvm_granule_size(ctx->level);
>> +	int ret;
>>   
>>   	if (!kvm_pte_valid(ctx->old))
>>   		return -EAGAIN;
>> @@ -1133,9 +1143,13 @@ static int stage2_attr_walker(const struct kvm_pgtable_visit_ctx *ctx,
>>   		 * stage-2 PTE if we are going to add executable permission.
>>   		 */
>>   		if (mm_ops->icache_inval_pou &&
>> -		    stage2_pte_executable(pte) && !stage2_pte_executable(ctx->old))
>> -			mm_ops->icache_inval_pou(kvm_pte_follow(pte, mm_ops),
>> -						  kvm_granule_size(ctx->level));
>> +		    stage2_pte_executable(pte) && !stage2_pte_executable(ctx->old)) {
>> +			ret = mm_ops->icache_inval_pou(data->mmu,
>> +					kvm_pte_follow(pte, mm_ops),
>> +					ALIGN_DOWN(ctx->addr, granule), granule);
>> +			if (ret)
>> +				return ret;
>> +		}
>>   
>>   		if (!stage2_try_set_pte(ctx, pte))
>>   			return -EAGAIN;
>> @@ -1152,6 +1166,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
>>   	int ret;
>>   	kvm_pte_t attr_mask = KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI;
>>   	struct stage2_attr_data data = {
>> +		.mmu		= pgt->mmu,
>>   		.attr_set	= attr_set & attr_mask,
>>   		.attr_clr	= attr_clr & attr_mask,
>>   	};
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index d3b4feed460c..a778f48beb56 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -267,9 +267,58 @@ static void clean_dcache_guest_page(void *va, size_t size)
>>   	__clean_dcache_guest_page(va, size);
>>   }
>>   
>> -static void invalidate_icache_guest_page(void *va, size_t size)
>> +static struct interval_tree_node *add_fault_range(struct kvm_s2_mmu *mmu,
>> +						  u64 ipa, size_t size)
>>   {
>> +	struct interval_tree_node *node;
>> +	unsigned long start = ipa, end = start + size - 1; /* inclusive */
> 
> Maybe just rename 'ipa' to 'start'?
> 

Yes.

>> +
>> +	mutex_lock(&mmu->fault_ranges_mutex);
> 
> Don't we hold the rwlock at this at this stage, which actively
> prevents things like taking a mutex? Have you tested this with
> lockdep?
> 

The rwlock (kvm->mmu_lock) has been held by readers or writers, depending on
the function call path, and you're right a lockdep issue exists:

[  161.600875] =============================
[  161.604872] [ BUG: Invalid wait context ]
[  161.608870] 6.5.0-gavin+ #11 Not tainted
[  161.612781] -----------------------------
[  161.616778] qemu-system-aar/3598 is trying to lock:
[  161.621644] ffff8000b0947dd0 (&mmu->fault_ranges_mutex){....}-{3:3}, at: invalidate_icache_guest_page+0x48/0x160
[  161.631813] other info that might help us debug this:
[  161.636852] context-{4:4}
[  161.639460] 4 locks held by qemu-system-aar/3598:
[  161.644151]  #0: ffff3fff93b1b1e8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x88/0xa08
[  161.652494]  #1: ffff8000b0948660 (&kvm->srcu){.+.+}-{0:0}, at: kvm_handle_guest_abort+0xf0/0x4f8
[  161.661357]  #2: ffff8000b0947018 (&(kvm)->mmu_lock){++++}-{2:2}, at: user_mem_abort+0x1b4/0x7d8
[  161.670132]  #3: ffffc23c11b09230 (rcu_read_lock){....}-{1:2}, at: kvm_pgtable_walk+0x114/0x228
[  161.678821] stack backtrace:
[  161.681690] CPU: 133 PID: 3598 Comm: qemu-system-aar Not tainted 6.5.0-gavin+ #11
[  161.689160] Hardware name: FOXCONN Mt. Collins/Mt. Collins, BIOS 0ACOC017 (SCP: 1.08.20210825) 10/22/2021
[  161.698713] Call trace:
[  161.701148]  dump_backtrace+0x9c/0x120
[  161.704886]  show_stack+0x1c/0x30
[  161.708189]  dump_stack_lvl+0xe0/0x168
[  161.711927]  dump_stack+0x14/0x20
[  161.715230]  __lock_acquire+0x828/0x968
[  161.719056]  lock_acquire.part.0+0xec/0x268
[  161.723227]  lock_acquire+0x94/0x150
[  161.726791]  __mutex_lock+0x98/0x830
[  161.730355]  mutex_lock_nested+0x28/0x38
[  161.734266]  invalidate_icache_guest_page+0x48/0x160
[  161.739219]  stage2_map_walker_try_leaf+0x140/0x1e8
[  161.744085]  stage2_map_walker+0xf8/0x1a0
[  161.748082]  kvm_pgtable_visitor_cb.isra.0+0x3c/0x78
[  161.753035]  __kvm_pgtable_visit+0x188/0x268
[  161.757292]  __kvm_pgtable_walk+0x90/0xc8
[  161.761290]  __kvm_pgtable_visit+0xb4/0x268
[  161.765461]  __kvm_pgtable_walk+0x90/0xc8
[  161.769458]  kvm_pgtable_walk+0xd4/0x228
[  161.773369]  kvm_pgtable_stage2_map+0x110/0x130
[  161.777888]  user_mem_abort+0x67c/0x7d8
[  161.781712]  kvm_handle_guest_abort+0x3e4/0x4f8
[  161.786231]  handle_exit+0x70/0x1c8
[  161.789709]  kvm_arch_vcpu_ioctl_run+0x29c/0x668
[  161.794315]  kvm_vcpu_ioctl+0x300/0xa08
[  161.798139]  __arm64_sys_ioctl+0xa8/0xf0
[  161.802051]  invoke_syscall.constprop.0+0x7c/0xd0
[  161.806745]  do_el0_svc+0xb4/0xd0
[  161.810049]  el0_svc+0x68/0x278
[  161.813179]  el0t_64_sync_handler+0x134/0x150
[  161.817524]  el0t_64_sync+0x17c/0x180


>> +
>> +	node = interval_tree_iter_first(&mmu->fault_ranges, start, end);
>> +	if (node) {
>> +		node = NULL;
>> +		goto unlock;
>> +	}
>> +
>> +	node = kzalloc(sizeof(*node), GFP_KERNEL_ACCOUNT);
>> +	if (!node)
>> +		goto unlock;
> 
> That's not going to work -- why do you think we avoid all dynamic
> allocation on this path?
> 

Right, dynamic allocation should be avoided in this path and it can be
put into the stack as you suggested later.

>> +
>> +	node->start = start;
>> +	node->last = end;
>> +	interval_tree_insert(node, &mmu->fault_ranges);
>> +
>> +unlock:
>> +	mutex_unlock(&mmu->fault_ranges_mutex);
>> +	return node;
>> +}
>> +
>> +static void remove_fault_range(struct kvm_s2_mmu *mmu,
>> +			       struct interval_tree_node *node)
>> +{
>> +	mutex_lock(&mmu->fault_ranges_mutex);
>> +
>> +	interval_tree_remove(node, &mmu->fault_ranges);
>> +	kfree(node);
>> +
>> +	mutex_unlock(&mmu->fault_ranges_mutex);
>> +}
>> +
>> +
>> +static int invalidate_icache_guest_page(struct kvm_s2_mmu *mmu,
>> +					void *va, u64 ipa, size_t size)
>> +{
>> +	struct interval_tree_node *node;
>> +
>> +	node = add_fault_range(mmu, ipa, size);
>> +	if (!node)
>> +		return -EAGAIN;
>> +
>>   	__invalidate_icache_guest_page(va, size);
>> +	remove_fault_range(mmu, node);
> 
> Given that we add and remove the node in the same function, why can't
> we allocate the node on the stack instead?
> 

Yes, definitely.

>> +
>> +	return 0;
>>   }
>>   
>>   /*
>> @@ -859,6 +908,10 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>>   	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
>>   	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
>>   
>> +	/* Initialize the page fault ranges */
>> +	mutex_init(&mmu->fault_ranges_mutex);
>> +	mmu->fault_ranges = RB_ROOT_CACHED;
>> +
>>   	mmu->pgt = pgt;
>>   	mmu->pgd_phys = __pa(pgt->pgd);
>>   	return 0;
> 
> I really wonder why we should add this complexity, rather than doing a
> full i-cache invalidation instead. Given what you describe, I don't
> think it would be worse.
> 

Yes, Oliver already shared the patch and it worked for me. So please ignore
this patch since the code suggested by Oliver is more sensible and robust,
especially for the NV case.

Thanks,
Gavin
Re: [PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission
Posted by Oliver Upton 2 years, 3 months ago
Gavin,

On Mon, Sep 04, 2023 at 05:28:26PM +1000, Gavin Shan wrote:
> We observed soft-lockup on the host in a specific scenario where
> the host on Ampere's Altra Max CPU has 64KB base page size and the
> guest has 4KB base page size, 64 vCPUs and 13GB memory. The guest's
> memory is backed by 512MB huge pages via hugetlbfs. All the 64 vCPUs
> are simultaneously trapped into the host due to permission page faults,
> to request adding the execution permission to the corresponding PMD
> entry, before the soft-lockup is raised on the host. On handling the
> parallel requests, the instruction cache for the 512MB huge page is
> invalidated by mm_ops->icache_inval_pou() in stage2_attr_walker() on
> 64 hardware CPUs. Unfortunately, the instruction cache invalidation
> on one CPU interfere with that on another CPU in the hardware level.
> It takes 37 seconds for mm_ops->icache_inval_pou() to finish in the
> worst case.
> 
> So we can't scale out to handle the permission faults at will. They
> need to be serialized to some extent with the help of a interval tree,

Parallel permission faults is not the cause of the soft lockups
you observe. The real issue is the volume of invalidations that are
happening under the hood.

Take a look at __invalidate_icache_guest_page() -- we invalidate the
icache by VA regardless of the size of the range. 512M / 64b = 8388608
invalidation operations. Yes, multiple threads doing these invalidations
in parallel makes the issue more pronounced as they bottleneck at the
Miscellaneous node in the interconnect, but we should really do
something about our invalidation strategy instead.

The approach you propose adds a fairly complex serialization mechanic
_and_ unfairly penalizes systems that do not require explicit icache
invalidation (i.e. FEAT_DIC).

I have a patch for the invalidation issue that I've been needing to
send out for a while, could you please give this a go and see if it
addresses the soft lockups you observe? If so, I can clean it up and
send it as a patch. At minimum, MAX_TLBI_OPS needs to be renamed to hint
at the common thread (DVM) between I$ and TLB invalidations.

--
Thanks,
Oliver

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 96a80e8f6226..fd23644c9988 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -117,6 +117,7 @@ alternative_cb_end
 #include <asm/cache.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
+#include <asm/tlbflush.h>
 #include <asm/kvm_emulate.h>
 #include <asm/kvm_host.h>
 
@@ -224,15 +225,38 @@ static inline void __clean_dcache_guest_page(void *va, size_t size)
 	kvm_flush_dcache_to_poc(va, size);
 }
 
+static inline u32 __icache_line_size(void)
+{
+	u8 iminline;
+	u64 ctr;
+
+	asm volatile(ALTERNATIVE_CB("movz %0, #0\n"
+				    "movk %0, #0, lsl #16\n"
+				    "movk %0, #0, lsl #32\n"
+				    "movk %0, #0, lsl #48\n",
+				    ARM64_ALWAYS_SYSTEM,
+				    kvm_compute_final_ctr_el0)
+		     : "=r" (ctr));
+
+	iminline = SYS_FIELD_GET(CTR_EL0, IminLine, ctr);
+	return 4 << iminline;
+}
+
 static inline void __invalidate_icache_guest_page(void *va, size_t size)
 {
+	size_t nr_lines = size / __icache_line_size();
+
 	if (icache_is_aliasing()) {
 		/* any kind of VIPT cache */
 		icache_inval_all_pou();
 	} else if (read_sysreg(CurrentEL) != CurrentEL_EL1 ||
 		   !icache_is_vpipt()) {
 		/* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */
-		icache_inval_pou((unsigned long)va, (unsigned long)va + size);
+		if (nr_lines > MAX_TLBI_OPS)
+			icache_inval_all_pou();
+		else
+			icache_inval_pou((unsigned long)va,
+					 (unsigned long)va + size);
 	}
 }
Re: [PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission
Posted by Gavin Shan 2 years, 3 months ago
Hi Oliver,

On 9/4/23 18:04, Oliver Upton wrote:
> On Mon, Sep 04, 2023 at 05:28:26PM +1000, Gavin Shan wrote:
>> We observed soft-lockup on the host in a specific scenario where
>> the host on Ampere's Altra Max CPU has 64KB base page size and the
>> guest has 4KB base page size, 64 vCPUs and 13GB memory. The guest's
>> memory is backed by 512MB huge pages via hugetlbfs. All the 64 vCPUs
>> are simultaneously trapped into the host due to permission page faults,
>> to request adding the execution permission to the corresponding PMD
>> entry, before the soft-lockup is raised on the host. On handling the
>> parallel requests, the instruction cache for the 512MB huge page is
>> invalidated by mm_ops->icache_inval_pou() in stage2_attr_walker() on
>> 64 hardware CPUs. Unfortunately, the instruction cache invalidation
>> on one CPU interfere with that on another CPU in the hardware level.
>> It takes 37 seconds for mm_ops->icache_inval_pou() to finish in the
>> worst case.
>>
>> So we can't scale out to handle the permission faults at will. They
>> need to be serialized to some extent with the help of a interval tree,
> 
> Parallel permission faults is not the cause of the soft lockups
> you observe. The real issue is the volume of invalidations that are
> happening under the hood.
> 
> Take a look at __invalidate_icache_guest_page() -- we invalidate the
> icache by VA regardless of the size of the range. 512M / 64b = 8388608
> invalidation operations. Yes, multiple threads doing these invalidations
> in parallel makes the issue more pronounced as they bottleneck at the
> Miscellaneous node in the interconnect, but we should really do
> something about our invalidation strategy instead.
> 
> The approach you propose adds a fairly complex serialization mechanic
> _and_ unfairly penalizes systems that do not require explicit icache
> invalidation (i.e. FEAT_DIC).
> 
> I have a patch for the invalidation issue that I've been needing to
> send out for a while, could you please give this a go and see if it
> addresses the soft lockups you observe? If so, I can clean it up and
> send it as a patch. At minimum, MAX_TLBI_OPS needs to be renamed to hint
> at the common thread (DVM) between I$ and TLB invalidations.
> 

I generally agree with you that the issue is caused by hardware limitation
where too much time is needed to invalidate the instruction cache for a 512MB
huge page. The parallel invalidation on multiple CPUs make it worse. Actually,
the issue was reported from our downstream kernel, after the feature to support
the parallel page fault handling is included. It's to say, we didn't see the
issue (soft-lock on the host) when the page fault handlers are serialized.

Yeah, my patch has too much complex and FEAT_DIC is disregarded. FEAT_DIC isn't
available on Ampere's Alter and Alter-Max CPU. FEAT_IDC is supported on these
two CPU models though.

Thanks a lot for your patch, which looks much simplified. With it, I don't see
the soft-lockup issue again. However, we potentially have icache thrashing issue
since all icache lines are invalidated by icache_inval_all_pou(). So I think it's
critical to choose a sensible threshold (MAX_TLBI_OPS). I measured the consumed
time for various operations on Ampere's Altra and Altra-max models like below, which
may be helpful for you to choose a sensible threshold (MAX_TLBI_OPS).

   Operation               Altra           Altra Max
   -------------------------------------------------
   icache_inval_all_pou        153us          71us
   icache_inval_pou(64KB)       18us           8us
   icache_inval_pou(512MB) 1130744us       579132us

> --
> 
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 96a80e8f6226..fd23644c9988 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -117,6 +117,7 @@ alternative_cb_end
>   #include <asm/cache.h>
>   #include <asm/cacheflush.h>
>   #include <asm/mmu_context.h>
> +#include <asm/tlbflush.h>
>   #include <asm/kvm_emulate.h>
>   #include <asm/kvm_host.h>
>   
> @@ -224,15 +225,38 @@ static inline void __clean_dcache_guest_page(void *va, size_t size)
>   	kvm_flush_dcache_to_poc(va, size);
>   }
>   
> +static inline u32 __icache_line_size(void)
> +{
> +	u8 iminline;
> +	u64 ctr;
> +
> +	asm volatile(ALTERNATIVE_CB("movz %0, #0\n"
> +				    "movk %0, #0, lsl #16\n"
> +				    "movk %0, #0, lsl #32\n"
> +				    "movk %0, #0, lsl #48\n",
> +				    ARM64_ALWAYS_SYSTEM,
> +				    kvm_compute_final_ctr_el0)
> +		     : "=r" (ctr));
> +
> +	iminline = SYS_FIELD_GET(CTR_EL0, IminLine, ctr);
> +	return 4 << iminline;
> +}
> +
>   static inline void __invalidate_icache_guest_page(void *va, size_t size)
>   {
> +	size_t nr_lines = size / __icache_line_size();
> +
>   	if (icache_is_aliasing()) {
>   		/* any kind of VIPT cache */
>   		icache_inval_all_pou();
>   	} else if (read_sysreg(CurrentEL) != CurrentEL_EL1 ||
>   		   !icache_is_vpipt()) {
>   		/* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */
> -		icache_inval_pou((unsigned long)va, (unsigned long)va + size);
> +		if (nr_lines > MAX_TLBI_OPS)
> +			icache_inval_all_pou();
> +		else
> +			icache_inval_pou((unsigned long)va,
> +					 (unsigned long)va + size);
>   	}
>   }
>   

I'm not sure if it's worthy to pull the @iminline from CTR_EL0 since it's almost
fixed to 64-bytes. @size is guranteed to be PAGE_SIZE or PMD_SIZE aligned. Maybe
we can just aggressively do something like below, disregarding the icache thrashing.
In this way, the code is further simplified.

     if (size > PAGE_SIZE) {
         icache_inval_all_pou();
     } else {
         icache_inval_pou((unsigned long)va,
                          (unsigned long)va + size);
     }                                                          // parantheses is still needed

---

I'm leveraging the chance to ask one question, which isn't related to the issue.
It seems we're doing the icache/dcache coherence differently for stage1 and stage-2
page table entries. The question is why we needn't to clean the dcache for stage-2,
as we're doing for the stage-1 case?

   // stage-1 page table                                       // stage-2 page table
   __set_pte_at                                                invalidate_icache_guest_page
   __sync_icache_dcache                                        __invalidate_icache_guest_page
   sync_icache_aliases                                         icache_inval_pou
   caches_clean_inval_pou                                        invalidate_icache_by_line  // !ARM64_HAS_CACHE_DIC
   caches_clean_inval_pou_macro
     dcache_by_line_op          // !ARM64_HAS_CACHE_IDC
     invalidate_icache_by_line  // !ARM64_HAS_CACHE_DIC

Thanks,
Gavin
Re: [PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission
Posted by Oliver Upton 2 years, 3 months ago
On Tue, Sep 05, 2023 at 10:06:14AM +1000, Gavin Shan wrote:

[...]

> >   static inline void __invalidate_icache_guest_page(void *va, size_t size)
> >   {
> > +	size_t nr_lines = size / __icache_line_size();
> > +
> >   	if (icache_is_aliasing()) {
> >   		/* any kind of VIPT cache */
> >   		icache_inval_all_pou();
> >   	} else if (read_sysreg(CurrentEL) != CurrentEL_EL1 ||
> >   		   !icache_is_vpipt()) {
> >   		/* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */
> > -		icache_inval_pou((unsigned long)va, (unsigned long)va + size);
> > +		if (nr_lines > MAX_TLBI_OPS)
> > +			icache_inval_all_pou();
> > +		else
> > +			icache_inval_pou((unsigned long)va,
> > +					 (unsigned long)va + size);
> >   	}
> >   }
> 
> I'm not sure if it's worthy to pull the @iminline from CTR_EL0 since it's almost
> fixed to 64-bytes. 

I firmly disagree. The architecture allows implementers to select a
different minimum line size, and non-64b systems _do_ exist in the wild.
Furthermore, some implementers have decided to glue together cores with
mismatched line sizes too...

Though we could avoid some headache by normalizing on 64b, the cold
reality of the ecosystem requires that we go out of our way to
accomodate ~any design choice allowed by the architecture.

> @size is guranteed to be PAGE_SIZE or PMD_SIZE aligned. Maybe
> we can just aggressively do something like below, disregarding the icache thrashing.
> In this way, the code is further simplified.
> 
>     if (size > PAGE_SIZE) {
>         icache_inval_all_pou();
>     } else {
>         icache_inval_pou((unsigned long)va,
>                          (unsigned long)va + size);
>     }                                                          // parantheses is still needed

This could work too but we already have a kernel heuristic for limiting
the amount of broadcast invalidations, which is MAX_TLBI_OPS. I don't
want to introduce a second, KVM-specific hack to address the exact same
thing.

> I'm leveraging the chance to ask one question, which isn't related to the issue.
> It seems we're doing the icache/dcache coherence differently for stage1 and stage-2
> page table entries. The question is why we needn't to clean the dcache for stage-2,
> as we're doing for the stage-1 case?

KVM always does its required dcache maintenance (if any) on the first
translation abort to a given IPA. On systems w/o FEAT_DIC, we lazily
grant execute permissions as an optimization to avoid unnecessary icache
invalidations, which as you've seen tends to be a bit of a sore spot.

Between the two faults, we've effectively guaranteed that any
host-initiated writes to the PA are visible to the guest on both the I
and D side. Any CMOs for making guest-initiated writes coherent after
the translation fault are the sole responsibility of the guest.

-- 
Thanks,
Oliver
Re: [PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission
Posted by Gavin Shan 2 years, 3 months ago
On 9/6/23 04:06, Oliver Upton wrote:
> On Tue, Sep 05, 2023 at 10:06:14AM +1000, Gavin Shan wrote:
> 
> [...]
> 
>>>    static inline void __invalidate_icache_guest_page(void *va, size_t size)
>>>    {
>>> +	size_t nr_lines = size / __icache_line_size();
>>> +
>>>    	if (icache_is_aliasing()) {
>>>    		/* any kind of VIPT cache */
>>>    		icache_inval_all_pou();
>>>    	} else if (read_sysreg(CurrentEL) != CurrentEL_EL1 ||
>>>    		   !icache_is_vpipt()) {
>>>    		/* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */
>>> -		icache_inval_pou((unsigned long)va, (unsigned long)va + size);
>>> +		if (nr_lines > MAX_TLBI_OPS)
>>> +			icache_inval_all_pou();
>>> +		else
>>> +			icache_inval_pou((unsigned long)va,
>>> +					 (unsigned long)va + size);
>>>    	}
>>>    }
>>
>> I'm not sure if it's worthy to pull the @iminline from CTR_EL0 since it's almost
>> fixed to 64-bytes.
> 
> I firmly disagree. The architecture allows implementers to select a
> different minimum line size, and non-64b systems _do_ exist in the wild.
> Furthermore, some implementers have decided to glue together cores with
> mismatched line sizes too...
> 
> Though we could avoid some headache by normalizing on 64b, the cold
> reality of the ecosystem requires that we go out of our way to
> accomodate ~any design choice allowed by the architecture.
> 

It seems I didn't make it clear enough. The reason why I had the concern
to avoid reading ctr_el0 is we read ctr_el0 for twice in the following path,
but I doubt if anybody cares. Since it's a hot path, each bit of performance
gain will count.

   invalidate_icache_guest_page
   __invalidate_icache_guest_page   // first read on ctr_el0, with your code changes
   icache_inval_pou(va, va + size)
   invalidate_icache_by_line
     icache_line_size               // second read on ctr_el0


>> @size is guranteed to be PAGE_SIZE or PMD_SIZE aligned. Maybe
>> we can just aggressively do something like below, disregarding the icache thrashing.
>> In this way, the code is further simplified.
>>
>>      if (size > PAGE_SIZE) {
>>          icache_inval_all_pou();
>>      } else {
>>          icache_inval_pou((unsigned long)va,
>>                           (unsigned long)va + size);
>>      }                                                          // parantheses is still needed
> 
> This could work too but we already have a kernel heuristic for limiting
> the amount of broadcast invalidations, which is MAX_TLBI_OPS. I don't
> want to introduce a second, KVM-specific hack to address the exact same
> thing.
> 

Ok. I was confused at the first glance since TLB isn't relevant to icache.
I think it's fine to reuse MAX_TLBI_OPS here, but a comment may be needed.
Oliver, could you please send a formal patch for your changes?

>> I'm leveraging the chance to ask one question, which isn't related to the issue.
>> It seems we're doing the icache/dcache coherence differently for stage1 and stage-2
>> page table entries. The question is why we needn't to clean the dcache for stage-2,
>> as we're doing for the stage-1 case?
> 
> KVM always does its required dcache maintenance (if any) on the first
> translation abort to a given IPA. On systems w/o FEAT_DIC, we lazily
> grant execute permissions as an optimization to avoid unnecessary icache
> invalidations, which as you've seen tends to be a bit of a sore spot.
> 
> Between the two faults, we've effectively guaranteed that any
> host-initiated writes to the PA are visible to the guest on both the I
> and D side. Any CMOs for making guest-initiated writes coherent after
> the translation fault are the sole responsibility of the guest.
> 

Nice, thanks a lot for the explanation.

Thanks,
Gavin
Re: [PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission
Posted by Oliver Upton 2 years, 3 months ago
Gavin,

On Wed, Sep 06, 2023 at 08:26:24AM +1000, Gavin Shan wrote:

[...]

> It seems I didn't make it clear enough. The reason why I had the concern
> to avoid reading ctr_el0 is we read ctr_el0 for twice in the following path,
> but I doubt if anybody cares. Since it's a hot path, each bit of performance
> gain will count.
> 
>   invalidate_icache_guest_page
>   __invalidate_icache_guest_page   // first read on ctr_el0, with your code changes
>   icache_inval_pou(va, va + size)
>   invalidate_icache_by_line
>     icache_line_size               // second read on ctr_el0

That can be addressed by shoving the check deep into
invalidate_icache_by_line, which would benefit _all_ use cases of
I-cache invalidation by VA. I haven't completely made up my mind about
that, though, because of the consequences of a global invalidation.

> > > @size is guranteed to be PAGE_SIZE or PMD_SIZE aligned. Maybe
> > > we can just aggressively do something like below, disregarding the icache thrashing.
> > > In this way, the code is further simplified.
> > > 
> > >      if (size > PAGE_SIZE) {
> > >          icache_inval_all_pou();
> > >      } else {
> > >          icache_inval_pou((unsigned long)va,
> > >                           (unsigned long)va + size);
> > >      }                                                          // parantheses is still needed
> > 
> > This could work too but we already have a kernel heuristic for limiting
> > the amount of broadcast invalidations, which is MAX_TLBI_OPS. I don't
> > want to introduce a second, KVM-specific hack to address the exact same
> > thing.
> > 
> 
> Ok. I was confused at the first glance since TLB isn't relevant to icache.
> I think it's fine to reuse MAX_TLBI_OPS here, but a comment may be needed.
> Oliver, could you please send a formal patch for your changes?

Yeah, I think I may have said it before, but this thing needs to be
called 'MAX_DVM_OPS'. I-cache invalidations and TLB invalidations become
DVMOps (Distributed Virtual Memory) in terms of CHI, which pile up at the
miscellaneous node in the mesh.

Give me a day or two to convince myself of the right way to go about
this and I'll send out what I have.

-- 
Thanks,
Oliver
Re: [PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission
Posted by Gavin Shan 2 years, 3 months ago
Hi Oliver,

On 9/7/23 02:29, Oliver Upton wrote:
> On Wed, Sep 06, 2023 at 08:26:24AM +1000, Gavin Shan wrote:
> 
> [...]
> 
>> It seems I didn't make it clear enough. The reason why I had the concern
>> to avoid reading ctr_el0 is we read ctr_el0 for twice in the following path,
>> but I doubt if anybody cares. Since it's a hot path, each bit of performance
>> gain will count.
>>
>>    invalidate_icache_guest_page
>>    __invalidate_icache_guest_page   // first read on ctr_el0, with your code changes
>>    icache_inval_pou(va, va + size)
>>    invalidate_icache_by_line
>>      icache_line_size               // second read on ctr_el0
> 
> That can be addressed by shoving the check deep into
> invalidate_icache_by_line, which would benefit _all_ use cases of
> I-cache invalidation by VA. I haven't completely made up my mind about
> that, though, because of the consequences of a global invalidation.
> 

Yes, of course.

>>>> @size is guranteed to be PAGE_SIZE or PMD_SIZE aligned. Maybe
>>>> we can just aggressively do something like below, disregarding the icache thrashing.
>>>> In this way, the code is further simplified.
>>>>
>>>>       if (size > PAGE_SIZE) {
>>>>           icache_inval_all_pou();
>>>>       } else {
>>>>           icache_inval_pou((unsigned long)va,
>>>>                            (unsigned long)va + size);
>>>>       }                                                          // parantheses is still needed
>>>
>>> This could work too but we already have a kernel heuristic for limiting
>>> the amount of broadcast invalidations, which is MAX_TLBI_OPS. I don't
>>> want to introduce a second, KVM-specific hack to address the exact same
>>> thing.
>>>
>>
>> Ok. I was confused at the first glance since TLB isn't relevant to icache.
>> I think it's fine to reuse MAX_TLBI_OPS here, but a comment may be needed.
>> Oliver, could you please send a formal patch for your changes?
> 
> Yeah, I think I may have said it before, but this thing needs to be
> called 'MAX_DVM_OPS'. I-cache invalidations and TLB invalidations become
> DVMOps (Distributed Virtual Memory) in terms of CHI, which pile up at the
> miscellaneous node in the mesh.
> 
> Give me a day or two to convince myself of the right way to go about
> this and I'll send out what I have.
> 

Ok. 'MAX_DVM_OPS' sounds good and it's a new name to me anyway. Oliver,
please let me know if you don't have time for this and need me to file
the formal patches, based on your codes :)

Thanks,
Gavin