Testing cpumask for a CPU to be cleared just before setting the exact
same CPU is useless because the end result is always the same: CPU is
set.
While there, switch CPU setter to a non-atomic version. Atomicity is
useless here because sev_writeback_caches() ends up with a plain
for_each_cpu() loop in smp_call_function_many_cond(), which is not
atomic by nature.
Fixes: 6f38f8c57464 ("KVM: SVM: Flush cache only on CPUs running SEV guest")
Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
arch/x86/kvm/svm/sev.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 49d7557de8bc..8170674d39c1 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3498,8 +3498,7 @@ int pre_sev_run(struct vcpu_svm *svm, int cpu)
* have encrypted, dirty data in the cache, and flush caches only for
* CPUs that have entered the guest.
*/
- if (!cpumask_test_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus))
- cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus);
+ __cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus);
/* Assign the asid allocated with this SEV guest */
svm->asid = asid;
--
2.43.0
On Mon, Aug 11, 2025, Yury Norov wrote: > Testing cpumask for a CPU to be cleared just before setting the exact > same CPU is useless because the end result is always the same: CPU is > set. No, it is not useless. Blindly writing to the variable will unnecessarily bounce the cacheline, and this is a hot path. > While there, switch CPU setter to a non-atomic version. Atomicity is > useless here No, atomicity isn't useless here either. Dropping atomicity could result in CPU's bit being lost. I.e. the atomic accesses aren't for the benefit of smp_call_function_many_cond(), the writes are atomic so that multiple vCPUs can concurrently update the mask without needing additional protection. > because sev_writeback_caches() ends up with a plain > for_each_cpu() loop in smp_call_function_many_cond(), which is not > atomic by nature. That's fine. As noted in sev_writeback_caches(), if vCPU could be running, then the caller is responsible for ensuring that all vCPUs flush caches before the memory being reclaimed is fully freed. Those guarantees are provided by KVM's MMU. sev_writeback_caches() => smp_call_function_many_cond() could hit false positives, i.e. trigger WBINVD on CPUs that couldn't possibly have accessed the memory being reclaimed, but such false positives are functionally benign, and are "intended" in the sense that we chose to prioritize simplicity over precision. > Fixes: 6f38f8c57464 ("KVM: SVM: Flush cache only on CPUs running SEV guest") > Signed-off-by: Yury Norov <yury.norov@gmail.com> > --- > arch/x86/kvm/svm/sev.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > index 49d7557de8bc..8170674d39c1 100644 > --- a/arch/x86/kvm/svm/sev.c > +++ b/arch/x86/kvm/svm/sev.c > @@ -3498,8 +3498,7 @@ int pre_sev_run(struct vcpu_svm *svm, int cpu) > * have encrypted, dirty data in the cache, and flush caches only for > * CPUs that have entered the guest. > */ > - if (!cpumask_test_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus)) > - cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus); > + __cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus); > > /* Assign the asid allocated with this SEV guest */ > svm->asid = asid; > -- > 2.43.0 >
On Mon, Aug 11, 2025 at 01:45:46PM -0700, Sean Christopherson wrote: > On Mon, Aug 11, 2025, Yury Norov wrote: > > Testing cpumask for a CPU to be cleared just before setting the exact > > same CPU is useless because the end result is always the same: CPU is > > set. > > No, it is not useless. Blindly writing to the variable will unnecessarily bounce > the cacheline, and this is a hot path. How hot is that path? How bad the cache contention is? Is there any evidence that conditional cpumask_set_cpu() worth the effort? The original patch doesn't discuss that at all, and without any comment the code looks just buggy. > > While there, switch CPU setter to a non-atomic version. Atomicity is > > useless here > > No, atomicity isn't useless here either. Dropping atomicity could result in > CPU's bit being lost. I.e. the atomic accesses aren't for the benefit of > smp_call_function_many_cond(), the writes are atomic so that multiple vCPUs can > concurrently update the mask without needing additional protection. OK, I see. Something heavy hit my head before I decided to drop atomicity there. > > because sev_writeback_caches() ends up with a plain > > for_each_cpu() loop in smp_call_function_many_cond(), which is not > > atomic by nature. > > That's fine. As noted in sev_writeback_caches(), if vCPU could be running, then > the caller is responsible for ensuring that all vCPUs flush caches before the > memory being reclaimed is fully freed. Those guarantees are provided by KVM's > MMU. > > sev_writeback_caches() => smp_call_function_many_cond() could hit false positives, > i.e. trigger WBINVD on CPUs that couldn't possibly have accessed the memory being > reclaimed, but such false positives are functionally benign, and are "intended" > in the sense that we chose to prioritize simplicity over precision. So, I don't object to drop the patch, but it would be really nice to have this if (!cpumask_test_cpu()) cpumask_set_cpu() pattern explained, and even better supported with performance numbers. Thanks, Yury
On Mon, Aug 11, 2025, Yury Norov wrote: > On Mon, Aug 11, 2025 at 01:45:46PM -0700, Sean Christopherson wrote: > > On Mon, Aug 11, 2025, Yury Norov wrote: > > > Testing cpumask for a CPU to be cleared just before setting the exact > > > same CPU is useless because the end result is always the same: CPU is > > > set. > > > > No, it is not useless. Blindly writing to the variable will unnecessarily bounce > > the cacheline, and this is a hot path. > > How hot is that path? Very, it gets hit on every VM-Exit => VM-Entry. For context, putting a single printk anywhere in KVM's exit=>entry path can completely prevent forward progress in the guest (for some workloads/patterns). > How bad the cache contention is? I would expect it to be "fatally" bad for some workloads and setups. Not literally fatal, but bad enough that it would require an urgent fix. > Is there any evidence that conditional cpumask_set_cpu() worth the effort? I don't have evidence for this specific code flow, but there is plenty of evidence that shows that generating atomic accesses, especially across sockets, can have a significant negative impact on performance. I didn't ask for performance numbers for optimizing setting the mask because (a) I know the VM-Entry path can be extremely hot, (b) I know that dueling atomics can be hugely problematic, and (c) I don't see the separate test + set logic as being at all notable in terms of effort. > The original patch doesn't discuss that at all, and without any comment the > code looks just buggy. FWIW, there was discussion in a previous version of the series, but no hard numbers on the perf impact. https://lore.kernel.org/all/Z75se_OZQvaeQE-4@google.com > > > > While there, switch CPU setter to a non-atomic version. Atomicity is > > > useless here > > > > No, atomicity isn't useless here either. Dropping atomicity could result in > > CPU's bit being lost. I.e. the atomic accesses aren't for the benefit of > > smp_call_function_many_cond(), the writes are atomic so that multiple vCPUs can > > concurrently update the mask without needing additional protection. > > OK, I see. Something heavy hit my head before I decided to drop > atomicity there. > > > > because sev_writeback_caches() ends up with a plain > > > for_each_cpu() loop in smp_call_function_many_cond(), which is not > > > atomic by nature. > > > > That's fine. As noted in sev_writeback_caches(), if vCPU could be running, then > > the caller is responsible for ensuring that all vCPUs flush caches before the > > memory being reclaimed is fully freed. Those guarantees are provided by KVM's > > MMU. > > > > sev_writeback_caches() => smp_call_function_many_cond() could hit false positives, > > i.e. trigger WBINVD on CPUs that couldn't possibly have accessed the memory being > > reclaimed, but such false positives are functionally benign, and are "intended" > > in the sense that we chose to prioritize simplicity over precision. > > So, I don't object to drop the patch, but it would be really nice to > have this > if (!cpumask_test_cpu()) > cpumask_set_cpu() > > pattern explained, and even better supported with performance numbers. I can definitely add a comment, and I might try to gather numbers out of curiosity, but as above, I just don't see this as something that needs to be investigated with any urgency.
On Mon, Aug 11, 2025 at 03:04:50PM -0700, Sean Christopherson wrote: > On Mon, Aug 11, 2025, Yury Norov wrote: > > On Mon, Aug 11, 2025 at 01:45:46PM -0700, Sean Christopherson wrote: > > > On Mon, Aug 11, 2025, Yury Norov wrote: > > > > Testing cpumask for a CPU to be cleared just before setting the exact > > > > same CPU is useless because the end result is always the same: CPU is > > > > set. > > > > > > No, it is not useless. Blindly writing to the variable will unnecessarily bounce > > > the cacheline, and this is a hot path. > > > > How hot is that path? > > Very, it gets hit on every VM-Exit => VM-Entry. For context, putting a single > printk anywhere in KVM's exit=>entry path can completely prevent forward progress > in the guest (for some workloads/patterns). > > > How bad the cache contention is? > > I would expect it to be "fatally" bad for some workloads and setups. Not literally > fatal, but bad enough that it would require an urgent fix. > > > Is there any evidence that conditional cpumask_set_cpu() worth the effort? > > I don't have evidence for this specific code flow, but there is plenty of evidence > that shows that generating atomic accesses, especially across sockets, can have a > significant negative impact on performance. > > I didn't ask for performance numbers for optimizing setting the mask because (a) > I know the VM-Entry path can be extremely hot, (b) I know that dueling atomics > can be hugely problematic, and (c) I don't see the separate test + set logic as > being at all notable in terms of effort. > > > The original patch doesn't discuss that at all, and without any comment the > > code looks just buggy. > > FWIW, there was discussion in a previous version of the series, but no hard > numbers on the perf impact. > > https://lore.kernel.org/all/Z75se_OZQvaeQE-4@google.com OK, I see. So, as I said I don't insist on moving this patch. Let's drop it if you think that cache contention is critical. I probably need to think in the opposite direction - if some code pieces trash caches by concurrent accessing the whole cache line just to set a single bit, and people try to minimize that by using conditional set/clear_bit(), we need to make it as effective as we can, and a part of the API. Thanks for the discussion, Sean!
© 2016 - 2025 Red Hat, Inc.