KVM-arm: Optimize cache flush by only flushing on vcpu0

[RFC PATCH 0/1] KVM-arm: Optimize cache flush by only flushing on vcpu0

Posted by Jiayuan Liang 9 months, 3 weeks ago

This is an RFC patch to optimize cache flushing behavior in KVM/arm64.

When toggling cache state in a multi-vCPU guest, we currently flush the VM's
stage2 page tables on every vCPU that transitions cache state. This leads to
redundant cache flushes during guest boot, as each vCPU performs the same
flush operation.

In a typical guest boot sequence, vcpu0 is the first to enable caches, and
other vCPUs follow afterward. By the time secondary vCPUs enable their caches,
the flush performed by vcpu0 has already ensured cache coherency for the
entire VM.

I'm proposing to optimize this by only performing the stage2_flush_vm() operation 
on vcpu0, which is sufficient to maintain cache coherency while eliminating redundant
flushes on other vCPUs. This can improve performance during guest boot in
multi-vCPU configurations.

I'm submitting this as RFC because:
1. This is my first contribution to the KVM/arm64 subsystem
2. I want to confirm if this approach is architecturally sound
3. I'd like feedback on potential corner cases I may have missed:
   - Could there be scenarios where secondary vCPUs need their own flushes?
   - Is the assumption about vcpu0 always being first valid?

Implementation details:
- The patch identifies vcpu0 by checking if vcpu->vcpu_id == 0

Testing with a 64-core VM with 128GB memory using hugepages shows dramatic
performance improvements, reducing busybox boot time from 33s to 5s.

I'd appreciate any feedback on the correctness and approach of this optimization.

Jiayuan Liang (1):
  KVM: arm: Optimize cache flush by only flushing on vcpu0

 arch/arm64/kvm/mmu.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)


base-commit: fc96b232f8e7c0a6c282f47726b2ff6a5fb341d2
--
2.43.0

Re: [RFC PATCH 0/1] KVM-arm: Optimize cache flush by only flushing on vcpu0

Posted by Marc Zyngier 9 months, 3 weeks ago

On Fri, 18 Apr 2025 11:22:43 +0100,
Jiayuan Liang <ljykernel@163.com> wrote:
> 
> This is an RFC patch to optimize cache flushing behavior in KVM/arm64.
> 
> When toggling cache state in a multi-vCPU guest, we currently flush the VM's
> stage2 page tables on every vCPU that transitions cache state. This leads to
> redundant cache flushes during guest boot, as each vCPU performs the same
> flush operation.
> 
> In a typical guest boot sequence, vcpu0 is the first to enable caches, and
> other vCPUs follow afterward. By the time secondary vCPUs enable their caches,
> the flush performed by vcpu0 has already ensured cache coherency for the
> entire VM.

The most immediate issue I can spot is that vcpu0 is not special.
There is nothing that says vcpu0 will be the first switching its MMU
on, nor that vcpu0 will ever be running. I guess what you would want
instead is that the *first* vcpu that enables its MMU performs the
CMOs, while the others may not have to.

But even then, this changes a behaviour some guests *may* be relying
on, which is that what they have written while their MMU was off is
visible with the MMU on, without the guest doing any CMO of its own.

A lot of this stuff comes from the days where we were mostly running
32bit guests, some of which had (and still have) pretty bad
assumptions (set/way operations being one of them).

64bit guests *should* be much better behaved, and I wonder whether we
could actually drop the whole thing altogether for those. Something
like the hack below.

But this requires testing and more thought than I'm prepared to on a
day off... ;-)

Thanks,

	M.

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index bd020fc28aa9c..9d05e65433916 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -85,9 +85,11 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
 	 * For non-FWB CPUs, we trap VM ops (HCR_EL2.TVM) until M+C
 	 * get set in SCTLR_EL1 such that we can detect when the guest
 	 * MMU gets turned on and do the necessary cache maintenance
-	 * then.
+	 * then. Limit this dance to 32bit guests, assuming that 64bit
+	 * guests are reasonably behaved.
 	 */
-	if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
+	if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) &&
+	    vcpu_el1_is_32bit(vcpu))
 		vcpu->arch.hcr_el2 |= HCR_TVM;
 }

-- 
Jazz isn't dead. It just smells funny.