arch/x86/mm/tlb.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-)
On busy multi-threaded workloads, there can be significant contention
on the mm_cpumask at context switch time.
Reduce that contention by updating mm_cpumask lazily, setting the CPU bit
at context switch time (if not already set), and clearing the CPU bit at
the first TLB flush sent to a CPU where the process isn't running.
When a flurry of TLB flushes for a process happen, only the first one
will be sent to CPUs where the process isn't running. The others will
be sent to CPUs where the process is currently running.
On an AMD Milan system with 36 cores, there is a noticeable difference:
$ hackbench --groups 20 --loops 10000
Before: ~4.5s +/- 0.1s
After: ~4.2s +/- 0.1s
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/mm/tlb.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 86593d1b787d..f19f6378cabf 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -606,18 +606,15 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
cond_mitigation(tsk);
/*
- * Stop remote flushes for the previous mm.
- * Skip kernel threads; we never send init_mm TLB flushing IPIs,
- * but the bitmap manipulation can cause cache line contention.
+ * Leave this CPU in prev's mm_cpumask. Atomic writes to
+ * mm_cpumask can be expensive under contention. The CPU
+ * will be removed lazily at TLB flush time.
*/
- if (prev != &init_mm) {
- VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu,
- mm_cpumask(prev)));
- cpumask_clear_cpu(cpu, mm_cpumask(prev));
- }
+ VM_WARN_ON_ONCE(prev != &init_mm && !cpumask_test_cpu(cpu,
+ mm_cpumask(prev)));
/* Start receiving IPIs and then read tlb_gen (and LAM below) */
- if (next != &init_mm)
+ if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))
cpumask_set_cpu(cpu, mm_cpumask(next));
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
@@ -761,8 +758,10 @@ static void flush_tlb_func(void *info)
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
/* Can only happen on remote CPUs */
- if (f->mm && f->mm != loaded_mm)
+ if (f->mm && f->mm != loaded_mm) {
+ cpumask_clear_cpu(raw_smp_processor_id(), mm_cpumask(f->mm));
return;
+ }
}
if (unlikely(loaded_mm == &init_mm))
--
2.45.2
On 11/8/24 11:31, Rik van Riel wrote: > /* Can only happen on remote CPUs */ > - if (f->mm && f->mm != loaded_mm) > + if (f->mm && f->mm != loaded_mm) { > + cpumask_clear_cpu(raw_smp_processor_id(), mm_cpumask(f->mm)); > return; > + } We could also stick a new flush event type in here that says it received an IPI for a non-loaded mm to see how common those are.
On 11/8/24 11:31, Rik van Riel wrote: > On busy multi-threaded workloads, there can be significant contention > on the mm_cpumask at context switch time. > > Reduce that contention by updating mm_cpumask lazily, setting the CPU bit > at context switch time (if not already set), and clearing the CPU bit at > the first TLB flush sent to a CPU where the process isn't running. > > When a flurry of TLB flushes for a process happen, only the first one > will be sent to CPUs where the process isn't running. The others will > be sent to CPUs where the process is currently running. So I guess it comes down to balancing: The cpumask_clear_cpu() happens on every mm switch which can be thousands of times a second. But it's _relatively_ cheap: dozens or a couple hundred cycles. with: Skipping the cpumask_clear_cpu() will cause more TLB flushes. It can cause at most one extra TLB flush for each time a process is migrated off a CPU and never returns. This is _relatively_ expensive: on the order of thousands of cycles to send and receive an IPI. Migrations are obviously the enemy here, but they're the enemy for lots of _other_ reasons too, which is a really nice property. The only thing I can think of that really worries me is some kind of forked worker model where before this patch you would have: * fork() * run on CPU A * ... migrate to CPU B * malloc()/free(), needs to flush B only * exit() and after: * fork() * run on CPU A * ... migrate to CPU B * malloc()/free(), needs to flush A+B, including IPI * exit() Where that IPI wasn't needed at *all* before. But that's totally contrived. So I think this is the kind of thing we'd want to apply to -rc1 and let the robots poke at it for a few weeks. But it does seem like a sound idea to me.
On Fri, 2024-11-08 at 12:31 -0800, Dave Hansen wrote: > > > The only thing I can think of that really worries me is some kind of > forked worker model where before this patch you would have: > ... > Where that IPI wasn't needed at *all* before. But that's totally > contrived. > > So I think this is the kind of thing we'd want to apply to -rc1 and > let > the robots poke at it for a few weeks. But it does seem like a sound > idea to me. > I am definitely hoping the robot will find something to throw at this workload that I didn't think of. Most of the workloads here are either single threaded processes, or heavily multi-threaded processes. For the worker process case, I would expect us to COW and flush a number of pages in close time proximity to each other. In that case the first IPI may get sent to an unnecessary CPU, but future IPIs in that batch should not be. If we don't send many invalidation IPIs at all, we probably don't care that much. If we do send a bunch, they often seem to happen in bursts. I don't know if there are workloads where we send them frequently, but not in bursts. -- All Rights Reversed.
On 11/8/24 11:31, Rik van Riel wrote: > /* Start receiving IPIs and then read tlb_gen (and LAM below) */ > - if (next != &init_mm) > + if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next))) > cpumask_set_cpu(cpu, mm_cpumask(next)); > next_tlb_gen = atomic64_read(&next->context.tlb_gen); If we're worried about contention on mm_cpumask(), then this hunk makes sense independently of the lazy updating. We might want to take this hunk forward before we do the rest because this seems like a no-brainer.
On Fri, 2024-11-08 at 12:03 -0800, Dave Hansen wrote: > On 11/8/24 11:31, Rik van Riel wrote: > > /* Start receiving IPIs and then read tlb_gen (and > > LAM below) */ > > - if (next != &init_mm) > > + if (next != &init_mm && !cpumask_test_cpu(cpu, > > mm_cpumask(next))) > > cpumask_set_cpu(cpu, mm_cpumask(next)); > > next_tlb_gen = atomic64_read(&next- > > >context.tlb_gen); > > If we're worried about contention on mm_cpumask(), then this hunk > makes > sense independently of the lazy updating. We might want to take this > hunk forward before we do the rest because this seems like a no- > brainer. > If we always clear the CPU in the mm_cpumask when prev != next, wouldn't that result in that CPU's bit being clear (and needing to be set) for next when prev != next? What am I missing? -- All Rights Reversed.
On 11/8/24 12:07, Rik van Riel wrote: > On Fri, 2024-11-08 at 12:03 -0800, Dave Hansen wrote: >> On 11/8/24 11:31, Rik van Riel wrote: >>> /* Start receiving IPIs and then read tlb_gen (and >>> LAM below) */ >>> - if (next != &init_mm) >>> + if (next != &init_mm && !cpumask_test_cpu(cpu, >>> mm_cpumask(next))) >>> cpumask_set_cpu(cpu, mm_cpumask(next)); >>> next_tlb_gen = atomic64_read(&next- >>>> context.tlb_gen); >> If we're worried about contention on mm_cpumask(), then this hunk >> makes >> sense independently of the lazy updating. We might want to take this >> hunk forward before we do the rest because this seems like a no- >> brainer. >> > If we always clear the CPU in the mm_cpumask when prev != next, > wouldn't that result in that CPU's bit being clear (and needing > to be set) for next when prev != next? > > What am I missing? Oh, good point. This is all in the (prev != next) block so yeah, the bit _has_ to be clear. Silly me.
© 2016 - 2024 Red Hat, Inc.