KVM: SVM: fixes for SEV

[PATCH 2/2] KVM: SVM: drop useless cpumask_test_cpu() in pre_sev_run()

Posted by Yury Norov 1 month, 3 weeks ago

Testing cpumask for a CPU to be cleared just before setting the exact
same CPU is useless because the end result is always the same: CPU is
set.

While there, switch CPU setter to a non-atomic version. Atomicity is
useless here because sev_writeback_caches() ends up with a plain
for_each_cpu() loop in smp_call_function_many_cond(), which is not
atomic by nature.

Fixes: 6f38f8c57464 ("KVM: SVM: Flush cache only on CPUs running SEV guest")
Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 arch/x86/kvm/svm/sev.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 49d7557de8bc..8170674d39c1 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3498,8 +3498,7 @@ int pre_sev_run(struct vcpu_svm *svm, int cpu)
 	 * have encrypted, dirty data in the cache, and flush caches only for
 	 * CPUs that have entered the guest.
 	 */
-	if (!cpumask_test_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus))
-		cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus);
+	__cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus);
 
 	/* Assign the asid allocated with this SEV guest */
 	svm->asid = asid;
-- 
2.43.0

Re: [PATCH 2/2] KVM: SVM: drop useless cpumask_test_cpu() in pre_sev_run()

Posted by Sean Christopherson 1 month, 3 weeks ago

On Mon, Aug 11, 2025, Yury Norov wrote:
> Testing cpumask for a CPU to be cleared just before setting the exact
> same CPU is useless because the end result is always the same: CPU is
> set.

No, it is not useless.  Blindly writing to the variable will unnecessarily bounce
the cacheline, and this is a hot path.

> While there, switch CPU setter to a non-atomic version. Atomicity is
> useless here 

No, atomicity isn't useless here either.  Dropping atomicity could result in
CPU's bit being lost.  I.e. the atomic accesses aren't for the benefit of
smp_call_function_many_cond(), the writes are atomic so that multiple vCPUs can
concurrently update the mask without needing additional protection.

> because sev_writeback_caches() ends up with a plain
> for_each_cpu() loop in smp_call_function_many_cond(), which is not
> atomic by nature.

That's fine.  As noted in sev_writeback_caches(), if vCPU could be running, then
the caller is responsible for ensuring that all vCPUs flush caches before the
memory being reclaimed is fully freed.  Those guarantees are provided by KVM's
MMU.

sev_writeback_caches() => smp_call_function_many_cond() could hit false positives,
i.e. trigger WBINVD on CPUs that couldn't possibly have accessed the memory being
reclaimed, but such false positives are functionally benign, and are "intended"
in the sense that we chose to prioritize simplicity over precision.

> Fixes: 6f38f8c57464 ("KVM: SVM: Flush cache only on CPUs running SEV guest")
> Signed-off-by: Yury Norov <yury.norov@gmail.com>
> ---
>  arch/x86/kvm/svm/sev.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 49d7557de8bc..8170674d39c1 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3498,8 +3498,7 @@ int pre_sev_run(struct vcpu_svm *svm, int cpu)
>  	 * have encrypted, dirty data in the cache, and flush caches only for
>  	 * CPUs that have entered the guest.
>  	 */
> -	if (!cpumask_test_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus))
> -		cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus);
> +	__cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus);
>  
>  	/* Assign the asid allocated with this SEV guest */
>  	svm->asid = asid;
> -- 
> 2.43.0
>

Re: [PATCH 2/2] KVM: SVM: drop useless cpumask_test_cpu() in pre_sev_run()

Posted by Yury Norov 1 month, 3 weeks ago

On Mon, Aug 11, 2025 at 01:45:46PM -0700, Sean Christopherson wrote:
> On Mon, Aug 11, 2025, Yury Norov wrote:
> > Testing cpumask for a CPU to be cleared just before setting the exact
> > same CPU is useless because the end result is always the same: CPU is
> > set.
> 
> No, it is not useless.  Blindly writing to the variable will unnecessarily bounce
> the cacheline, and this is a hot path.

How hot is that path? How bad the cache contention is? Is there any evidence
that conditional cpumask_set_cpu() worth the effort? The original patch
doesn't discuss that at all, and without any comment the code looks just
buggy.

> > While there, switch CPU setter to a non-atomic version. Atomicity is
> > useless here 
> 
> No, atomicity isn't useless here either.  Dropping atomicity could result in
> CPU's bit being lost.  I.e. the atomic accesses aren't for the benefit of
> smp_call_function_many_cond(), the writes are atomic so that multiple vCPUs can
> concurrently update the mask without needing additional protection.

OK, I see. Something heavy hit my head before I decided to drop
atomicity there.

> > because sev_writeback_caches() ends up with a plain
> > for_each_cpu() loop in smp_call_function_many_cond(), which is not
> > atomic by nature.
> 
> That's fine.  As noted in sev_writeback_caches(), if vCPU could be running, then
> the caller is responsible for ensuring that all vCPUs flush caches before the
> memory being reclaimed is fully freed.  Those guarantees are provided by KVM's
> MMU.
> 
> sev_writeback_caches() => smp_call_function_many_cond() could hit false positives,
> i.e. trigger WBINVD on CPUs that couldn't possibly have accessed the memory being
> reclaimed, but such false positives are functionally benign, and are "intended"
> in the sense that we chose to prioritize simplicity over precision.

So, I don't object to drop the patch, but it would be really nice to
have this 
                        if (!cpumask_test_cpu())
                                cpumask_set_cpu()

pattern explained, and even better supported with performance numbers.


Thanks,
Yury

Re: [PATCH 2/2] KVM: SVM: drop useless cpumask_test_cpu() in pre_sev_run()

Posted by Sean Christopherson 1 month, 3 weeks ago

On Mon, Aug 11, 2025, Yury Norov wrote:
> On Mon, Aug 11, 2025 at 01:45:46PM -0700, Sean Christopherson wrote:
> > On Mon, Aug 11, 2025, Yury Norov wrote:
> > > Testing cpumask for a CPU to be cleared just before setting the exact
> > > same CPU is useless because the end result is always the same: CPU is
> > > set.
> > 
> > No, it is not useless.  Blindly writing to the variable will unnecessarily bounce
> > the cacheline, and this is a hot path.
> 
> How hot is that path?

Very, it gets hit on every VM-Exit => VM-Entry.  For context, putting a single
printk anywhere in KVM's exit=>entry path can completely prevent forward progress
in the guest (for some workloads/patterns).

> How bad the cache contention is?

I would expect it to be "fatally" bad for some workloads and setups.  Not literally
fatal, but bad enough that it would require an urgent fix.

> Is there any evidence that conditional cpumask_set_cpu() worth the effort?

I don't have evidence for this specific code flow, but there is plenty of evidence
that shows that generating atomic accesses, especially across sockets, can have a
significant negative impact on performance.

I didn't ask for performance numbers for optimizing setting the mask because (a)
I know the VM-Entry path can be extremely hot, (b) I know that dueling atomics
can be hugely problematic, and (c) I don't see the separate test + set logic as
being at all notable in terms of effort.

> The original patch doesn't discuss that at all, and without any comment the
> code looks just buggy.

FWIW, there was discussion in a previous version of the series, but no hard
numbers on the perf impact.

https://lore.kernel.org/all/Z75se_OZQvaeQE-4@google.com

> 
> > > While there, switch CPU setter to a non-atomic version. Atomicity is
> > > useless here 
> > 
> > No, atomicity isn't useless here either.  Dropping atomicity could result in
> > CPU's bit being lost.  I.e. the atomic accesses aren't for the benefit of
> > smp_call_function_many_cond(), the writes are atomic so that multiple vCPUs can
> > concurrently update the mask without needing additional protection.
> 
> OK, I see. Something heavy hit my head before I decided to drop
> atomicity there.
> 
> > > because sev_writeback_caches() ends up with a plain
> > > for_each_cpu() loop in smp_call_function_many_cond(), which is not
> > > atomic by nature.
> > 
> > That's fine.  As noted in sev_writeback_caches(), if vCPU could be running, then
> > the caller is responsible for ensuring that all vCPUs flush caches before the
> > memory being reclaimed is fully freed.  Those guarantees are provided by KVM's
> > MMU.
> > 
> > sev_writeback_caches() => smp_call_function_many_cond() could hit false positives,
> > i.e. trigger WBINVD on CPUs that couldn't possibly have accessed the memory being
> > reclaimed, but such false positives are functionally benign, and are "intended"
> > in the sense that we chose to prioritize simplicity over precision.
> 
> So, I don't object to drop the patch, but it would be really nice to
> have this 
>                         if (!cpumask_test_cpu())
>                                 cpumask_set_cpu()
> 
> pattern explained, and even better supported with performance numbers.

I can definitely add a comment, and I might try to gather numbers out of curiosity,
but as above, I just don't see this as something that needs to be investigated with
any urgency.

Re: [PATCH 2/2] KVM: SVM: drop useless cpumask_test_cpu() in pre_sev_run()

Posted by Yury Norov 1 month, 3 weeks ago

On Mon, Aug 11, 2025 at 03:04:50PM -0700, Sean Christopherson wrote:
> On Mon, Aug 11, 2025, Yury Norov wrote:
> > On Mon, Aug 11, 2025 at 01:45:46PM -0700, Sean Christopherson wrote:
> > > On Mon, Aug 11, 2025, Yury Norov wrote:
> > > > Testing cpumask for a CPU to be cleared just before setting the exact
> > > > same CPU is useless because the end result is always the same: CPU is
> > > > set.
> > > 
> > > No, it is not useless.  Blindly writing to the variable will unnecessarily bounce
> > > the cacheline, and this is a hot path.
> > 
> > How hot is that path?
> 
> Very, it gets hit on every VM-Exit => VM-Entry.  For context, putting a single
> printk anywhere in KVM's exit=>entry path can completely prevent forward progress
> in the guest (for some workloads/patterns).
> 
> > How bad the cache contention is?
> 
> I would expect it to be "fatally" bad for some workloads and setups.  Not literally
> fatal, but bad enough that it would require an urgent fix.
> 
> > Is there any evidence that conditional cpumask_set_cpu() worth the effort?
> 
> I don't have evidence for this specific code flow, but there is plenty of evidence
> that shows that generating atomic accesses, especially across sockets, can have a
> significant negative impact on performance.
> 
> I didn't ask for performance numbers for optimizing setting the mask because (a)
> I know the VM-Entry path can be extremely hot, (b) I know that dueling atomics
> can be hugely problematic, and (c) I don't see the separate test + set logic as
> being at all notable in terms of effort.
> 
> > The original patch doesn't discuss that at all, and without any comment the
> > code looks just buggy.
> 
> FWIW, there was discussion in a previous version of the series, but no hard
> numbers on the perf impact.
> 
> https://lore.kernel.org/all/Z75se_OZQvaeQE-4@google.com

OK, I see. So, as I said I don't insist on moving this patch. Let's
drop it if you think that cache contention is critical.

I probably need to think in the opposite direction - if some code
pieces trash caches by concurrent accessing the whole cache line just
to set a single bit, and people try to minimize that by using
conditional set/clear_bit(), we need to make it as effective as we
can, and a part of the API.

Thanks for the discussion, Sean!