[v1] x86/mm/tlb: avoid reading mm_tlb_gen when possible

[PATCH] x86/mm/tlb: avoid reading mm_tlb_gen when possible

Posted by Nadav Amit 4 years, 3 months ago

From: Nadav Amit <namit@vmware.com>

On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
contended and reading it should (arguably) be avoided as much as
possible.

Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
even when it is not necessary (e.g., the mm was already switched).
This is wasteful.

Moreover, one of the existing optimizations is to read mm's tlb_gen to
see if there are additional in-flight TLB invalidations and flush the
entire TLB in such a case. However, if the request's tlb_gen was already
flushed, the benefit of checking the mm's tlb_gen is likely to be offset
by the overhead of the check itself.

Running will-it-scale with tlb_flush1_threads show a considerable
benefit on 56-core Skylake (up to +24%):

threads		Baseline (v5.17+)	+Patch
1		159960			160202
5		310808			308378 (-0.7%)
10		479110			490728
15		526771			562528
20		534495			587316
25		547462			628296
30		579616			666313
35		594134			701814
40		612288			732967
45		617517			749727
50		637476			735497
55		614363			778913 (+24%)

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>

--

Note: The benchmarked kernels include Dave's revert of commit
6035152d8eeb ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for
tlb_is_not_lazy()
---
 arch/x86/mm/tlb.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 300b11e45792..6d7c69526051 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -733,10 +733,10 @@ static void flush_tlb_func(void *info)
 	const struct flush_tlb_info *f = info;
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
 	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
 	bool local = smp_processor_id() == f->initiating_cpu;
 	unsigned long nr_invalidate = 0;
+	u64 mm_tlb_gen;
 
 	/* This code cannot presently handle being reentered. */
 	VM_WARN_ON(!irqs_disabled());
@@ -770,6 +770,22 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
+	if (f->new_tlb_gen <= local_tlb_gen) {
+		/*
+		 * We are already up to date in respect to f->new_tlb_gen.
+		 * While the core might be still behind mm_tlb_gen, checking
+		 * mm_tlb_gen unnecessarily would have negative caching effects
+		 * so avoid it.
+		 */
+		return;
+	}
+
+	/*
+	 * Defer mm_tlb_gen reading as long as possible to avoid cache
+	 * contention.
+	 */
+	mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
+
 	if (unlikely(local_tlb_gen == mm_tlb_gen)) {
 		/*
 		 * There's nothing to do: we're already up to date.  This can
-- 
2.25.1

Re: [PATCH] x86/mm/tlb: avoid reading mm_tlb_gen when possible

Posted by Dave Hansen 4 years ago

On 3/22/22 15:07, Nadav Amit wrote:
> +	if (f->new_tlb_gen <= local_tlb_gen) {
> +		/*
> +		 * We are already up to date in respect to f->new_tlb_gen.
> +		 * While the core might be still behind mm_tlb_gen, checking
> +		 * mm_tlb_gen unnecessarily would have negative caching effects
> +		 * so avoid it.
> +		 */
> +		return;
> +	}
> +

Nit: There's at least one "we" in here that needs to get fixed up.  I'll
plan to do that when I apply it, but a v2 with that fixed and Peter's
ack added might save me five minutes.

Re: [PATCH] x86/mm/tlb: avoid reading mm_tlb_gen when possible

Posted by Nadav Amit 4 years ago

On Jun 6, 2022, at 8:29 AM, Dave Hansen <dave.hansen@intel.com> wrote:

> ⚠ External Email
> 
> On 3/22/22 15:07, Nadav Amit wrote:
>> +     if (f->new_tlb_gen <= local_tlb_gen) {
>> +             /*
>> +              * We are already up to date in respect to f->new_tlb_gen.
>> +              * While the core might be still behind mm_tlb_gen, checking
>> +              * mm_tlb_gen unnecessarily would have negative caching effects
>> +              * so avoid it.
>> +              */
>> +             return;
>> +     }
>> +
> 
> Nit: There's at least one "we" in here that needs to get fixed up.  I'll
> plan to do that when I apply it, but a v2 with that fixed and Peter's
> ack added might save me five minutes.

No good deed goes unpunished.

I’ll send v2 later today.

Re: [PATCH] x86/mm/tlb: avoid reading mm_tlb_gen when possible

Posted by Peter Zijlstra 4 years, 2 months ago

On Tue, Mar 22, 2022 at 10:07:57PM +0000, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
> contended and reading it should (arguably) be avoided as much as
> possible.
> 
> Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
> even when it is not necessary (e.g., the mm was already switched).
> This is wasteful.
> 
> Moreover, one of the existing optimizations is to read mm's tlb_gen to
> see if there are additional in-flight TLB invalidations and flush the
> entire TLB in such a case. However, if the request's tlb_gen was already
> flushed, the benefit of checking the mm's tlb_gen is likely to be offset
> by the overhead of the check itself.
> 
> Running will-it-scale with tlb_flush1_threads show a considerable
> benefit on 56-core Skylake (up to +24%):
> 
> threads		Baseline (v5.17+)	+Patch
> 1		159960			160202
> 5		310808			308378 (-0.7%)
> 10		479110			490728
> 15		526771			562528
> 20		534495			587316
> 25		547462			628296
> 30		579616			666313
> 35		594134			701814
> 40		612288			732967
> 45		617517			749727
> 50		637476			735497
> 55		614363			778913 (+24%)
> 

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Re: [PATCH] x86/mm/tlb: avoid reading mm_tlb_gen when possible

Posted by Nadav Amit 4 years ago

On Mar 28, 2022, at 3:35 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Mar 22, 2022 at 10:07:57PM +0000, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
>> contended and reading it should (arguably) be avoided as much as
>> possible.
>> 
>> Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
>> even when it is not necessary (e.g., the mm was already switched).
>> This is wasteful.
>> 
>> Moreover, one of the existing optimizations is to read mm's tlb_gen to
>> see if there are additional in-flight TLB invalidations and flush the
>> entire TLB in such a case. However, if the request's tlb_gen was already
>> flushed, the benefit of checking the mm's tlb_gen is likely to be offset
>> by the overhead of the check itself.
>> 
>> Running will-it-scale with tlb_flush1_threads show a considerable
>> benefit on 56-core Skylake (up to +24%):
>> 
>> threads		Baseline (v5.17+)	+Patch
>> 1		159960			160202
>> 5		310808			308378 (-0.7%)
>> 10		479110			490728
>> 15		526771			562528
>> 20		534495			587316
>> 25		547462			628296
>> 30		579616			666313
>> 35		594134			701814
>> 40		612288			732967
>> 45		617517			749727
>> 50		637476			735497
>> 55		614363			778913 (+24%)
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Ping?