[PATCh 0/3] x86,tlb: context switch optimizations

Rik van Riel posted 3 patches 2 weeks ago
There is a newer version of this series
[PATCh 0/3] x86,tlb: context switch optimizations
Posted by Rik van Riel 2 weeks ago
While profiling switch_mm_irqs_off with several workloads,
it appears there are two hot spots that probably don't need
to be there.

The first is the atomic clearing and setting of the current
CPU in prev's and next's mm_cpumask. This can create a large
amount of cache line contention. On a web server, these two
together take about 17% of the CPU time spent in switch_mm_irqs_off.

We should be able to avoid much of the cache line thrashing
by only clearing bits in mm_cpumask lazily from the first
TLB flush to a process, after which the other TLB flushes can
be more narrowly targeted.

A second cause of overhead seems to be the cpumask_test_cpu
inside the WARN_ON_ONCE in the prev == next branch of
switch_mm_irqs_off.

This warning never ever seems to fire, even on a very large
fleet, so it may be best to hide that behind CONFIG_DEBUG_VM.
With the web server workload, this is also about 17% of
switch_mm_irqs_off.
Re: [PATCh 0/3] x86,tlb: context switch optimizations
Posted by Borislav Petkov 1 week, 3 days ago
On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> While profiling switch_mm_irqs_off with several workloads,
> it appears there are two hot spots that probably don't need
> to be there.

One of those three is causing the below here, zapping them from tip.

[    3.050931] smpboot: x86: Booting SMP configuration:
[    3.054476] .... node  #0, CPUs:          #1   #2   #3   #4   #5   #6   #7   #8   #9  #10  #11  #12  #13  #14  #15  #16  #17  #18  #19  #20
[    3.166533] Callback from call_rcu_tasks() invoked.
[    3.178695]   #21  #22  #23  #24  #25  #26  #27  #28  #29  #30  #31
[    3.186469] ------------[ cut here ]------------
[    3.186469] WARNING: CPU: 16 PID: 97 at kernel/smp.c:807 smp_call_function_many_cond+0x188/0x720
[    3.186469] Modules linked in:
[    3.186469] CPU: 16 UID: 0 PID: 97 Comm: cpuhp/16 Not tainted 6.12.0-rc7+ #1
[    3.186469] Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.5 09/08/2022
[    3.186469] RIP: 0010:smp_call_function_many_cond+0x188/0x720
[    3.186469] Code: 96 54 91 01 85 d2 0f 84 f7 fe ff ff 65 8b 05 37 8c e3 7e 85 c0 0f 85 e8 fe ff ff 65 8b 05 0c 89 e3 7e 85 c0 0f 85 d9 fe ff ff <0f> 0b e9 d2 fe ff ff 65 f7 05 4e c5 e4 7e ff ff ff 7f 0f 85 a6 fe
[    3.186469] RSP: 0018:ffffc90000dbfbe8 EFLAGS: 00010046
[    3.186469] RAX: 0000000000000000 RBX: ffffffff8109eeb0 RCX: 0000000000000000
[    3.186469] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff824908dd
[    3.186469] RBP: ffff889003235380 R08: ffffffff8109eeb0 R09: 0000000000000006
[    3.186469] R10: 0000000000000013 R11: 0000000000000005 R12: 0000000000000010
[    3.186469] R13: ffff88810006a480 R14: ffff889003235380 R15: 0000000000000010
[    3.186469] FS:  0000000000000000(0000) GS:ffff889003200000(0000) knlGS:0000000000000000
[    3.186469] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.186469] CR2: 0000000000000000 CR3: 0000000002a3a001 CR4: 0000000000770ef0
[    3.186469] PKRU: 55555554
[    3.186469] Call Trace:
[    3.186469]  <TASK>
[    3.186469]  ? __warn+0xa1/0x1c0
[    3.186469]  ? smp_call_function_many_cond+0x188/0x720
[    3.186469]  ? report_bug+0x1b5/0x1e0
[    3.186469]  ? handle_bug+0x58/0x90
[    3.186469]  ? exc_invalid_op+0x17/0x70
[    3.186469]  ? asm_exc_invalid_op+0x16/0x20
[    3.186469]  ? __pfx_tlb_is_not_lazy+0x10/0x10
[    3.186469]  ? __pfx_tlb_is_not_lazy+0x10/0x10
[    3.186469]  ? smp_call_function_many_cond+0x188/0x720
[    3.186469]  ? smp_call_function_many_cond+0x2a/0x720
[    3.186469]  ? __pte_offset_map_lock+0xa4/0x190
[    3.186469]  ? __pfx_flush_tlb_func+0x10/0x10
[    3.186469]  ? srso_alias_return_thunk+0x5/0xfbef5
[    3.186469]  ? lock_acquire+0x11a/0x350
[    3.186469]  ? __pfx_flush_tlb_func+0x10/0x10
[    3.186469]  ? __pfx_tlb_is_not_lazy+0x10/0x10
[    3.186469]  on_each_cpu_cond_mask+0x50/0x90
[    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
[    3.186469]  ? cpu_bugs_smt_update+0x14/0x1f0
[    3.186469]  __text_poke+0x366/0x5d0
[    3.186469]  ? __pfx_text_poke_memcpy+0x10/0x10
[    3.186469]  ? cpu_bugs_smt_update+0x14/0x1f0
[    3.186469]  text_poke_bp_batch+0xa1/0x3d0
[    3.186469]  ? mptcp_pm_get_add_addr_signal_max+0x10/0x30
[    3.186469]  ? arch_jump_label_transform_queue+0x55/0x80
[    3.186469]  ? __pfx_sched_cpu_activate+0x10/0x10
[    3.186469]  text_poke_finish+0x1b/0x30
[    3.186469]  arch_jump_label_transform_apply+0x18/0x30
[    3.186469]  static_key_slow_inc_cpuslocked+0x55/0xa0
[    3.186469]  ? srso_alias_return_thunk+0x5/0xfbef5
[    3.186469]  sched_cpu_activate+0x45/0x190
[    3.186469]  ? __pfx_sched_cpu_activate+0x10/0x10
[    3.186469]  cpuhp_invoke_callback+0x159/0x6b0
[    3.186469]  ? cpuhp_thread_fun+0x81/0x290
[    3.186469]  cpuhp_thread_fun+0x203/0x290
[    3.186469]  ? cpuhp_thread_fun+0x81/0x290
[    3.186469]  ? smpboot_thread_fn+0x2b/0x260
[    3.186469]  smpboot_thread_fn+0x1ae/0x260
[    3.186469]  ? __pfx_smpboot_thread_fn+0x10/0x10
[    3.186469]  kthread+0xee/0x120
[    3.186469]  ? __pfx_kthread+0x10/0x10
[    3.186469]  ret_from_fork+0x4c/0x60
[    3.186469]  ? __pfx_kthread+0x10/0x10
[    3.186469]  ret_from_fork_asm+0x1a/0x30
[    3.186469]  </TASK>
[    3.186469] irq event stamp: 122
[    3.186469] hardirqs last  enabled at (121): [<ffffffff81106ff7>] balance_push_set+0xe7/0x110
[    3.186469] hardirqs last disabled at (122): [<ffffffff81048129>] __text_poke+0x489/0x5d0
[    3.186469] softirqs last  enabled at (0): [<ffffffff810b1ae5>] copy_process+0x9f5/0x2a70
[    3.186469] softirqs last disabled at (0): [<0000000000000000>] 0x0
[    3.186469] ---[ end trace 0000000000000000 ]---

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCh 0/3] x86,tlb: context switch optimizations
Posted by Rik van Riel 1 week, 3 days ago
On Wed, 13 Nov 2024 10:55:50 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > While profiling switch_mm_irqs_off with several workloads,
> > it appears there are two hot spots that probably don't need
> > to be there.  
> 
> One of those three is causing the below here, zapping them from tip.
> 

TL;DR: __text_poke ends up sending IPIs with interrupts disabled.

> [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> [    3.186469]  ? cpu_bugs_smt_update+0x14/0x1f0
> [    3.186469]  __text_poke+0x366/0x5d0

Here is an alternative to avoid __text_poke() from calling
on_each_cpu_cond_mask() with IRQs disabled:

---8<---
From e872edeaad14c793036f290afc28000281e1b76a Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@surriel.com>
Date: Wed, 13 Nov 2024 09:51:16 -0500
Subject: [PATCH] x86/alternatives: defer poking_mm TLB flush to next use

Instead of doing a TLB flush of the poking_mm after we have
already switched back to the prev mm, we can simply increment
the tlb_gen for the poking_mm at unuse time.

This will cause switch_mm_irqs_off to flush the TLB next time
it loads the poking_mm, in the unlikely case that poking_mm still
has an ASID on that CPU by then.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/kernel/alternative.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index d17518ca19b8..f3caf5bc4df9 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1830,6 +1830,9 @@ static inline void unuse_temporary_mm(temp_mm_state_t prev_state)
 	lockdep_assert_irqs_disabled();
 	switch_mm_irqs_off(NULL, prev_state.mm, current);
 
+	/* Force a TLB flush next time poking_mm is used. */
+	inc_mm_tlb_gen(poking_mm);
+
 	/*
 	 * Restore the breakpoints if they were disabled before the temporary mm
 	 * was loaded.
@@ -1940,14 +1943,6 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
 	 */
 	unuse_temporary_mm(prev);
 
-	/*
-	 * Flushing the TLB might involve IPIs, which would require enabled
-	 * IRQs, but not if the mm is not used, as it is in this point.
-	 */
-	flush_tlb_mm_range(poking_mm, poking_addr, poking_addr +
-			   (cross_page_boundary ? 2 : 1) * PAGE_SIZE,
-			   PAGE_SHIFT, false);
-
 	if (func == text_poke_memcpy) {
 		/*
 		 * If the text does not match what we just wrote then something is
-- 
2.45.2
Re: [PATCh 0/3] x86,tlb: context switch optimizations
Posted by Peter Zijlstra 1 week, 2 days ago
On Wed, Nov 13, 2024 at 09:55:57AM -0500, Rik van Riel wrote:
>  arch/x86/kernel/alternative.c | 11 +++--------
>  1 file changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index d17518ca19b8..f3caf5bc4df9 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1830,6 +1830,9 @@ static inline void unuse_temporary_mm(temp_mm_state_t prev_state)
>  	lockdep_assert_irqs_disabled();
>  	switch_mm_irqs_off(NULL, prev_state.mm, current);
>  
> +	/* Force a TLB flush next time poking_mm is used. */
> +	inc_mm_tlb_gen(poking_mm);
> +
>  	/*
>  	 * Restore the breakpoints if they were disabled before the temporary mm
>  	 * was loaded.
> @@ -1940,14 +1943,6 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
>  	 */
>  	unuse_temporary_mm(prev);
>  
> -	/*
> -	 * Flushing the TLB might involve IPIs, which would require enabled
> -	 * IRQs, but not if the mm is not used, as it is in this point.
> -	 */
> -	flush_tlb_mm_range(poking_mm, poking_addr, poking_addr +
> -			   (cross_page_boundary ? 2 : 1) * PAGE_SIZE,
> -			   PAGE_SHIFT, false);
> -
>  	if (func == text_poke_memcpy) {
>  		/*
>  		 * If the text does not match what we just wrote then something is

So I really don't like this.

Yes it avoids the immediate problem, but we're now sending IPIs where we
shoulnd't be.

Fundamentally this whole text_poke thing is set up such that only a
single CPU will have this magical mapping with the aliasses. Having it
send IPIs is crazy.
Re: [PATCh 0/3] x86,tlb: context switch optimizations
Posted by Peter Zijlstra 1 week, 2 days ago
On Thu, Nov 14, 2024 at 12:36:17PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 09:55:57AM -0500, Rik van Riel wrote:
> >  arch/x86/kernel/alternative.c | 11 +++--------
> >  1 file changed, 3 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> > index d17518ca19b8..f3caf5bc4df9 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -1830,6 +1830,9 @@ static inline void unuse_temporary_mm(temp_mm_state_t prev_state)
> >  	lockdep_assert_irqs_disabled();
> >  	switch_mm_irqs_off(NULL, prev_state.mm, current);
> >  
> > +	/* Force a TLB flush next time poking_mm is used. */
> > +	inc_mm_tlb_gen(poking_mm);
> > +
> >  	/*
> >  	 * Restore the breakpoints if they were disabled before the temporary mm
> >  	 * was loaded.
> > @@ -1940,14 +1943,6 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
> >  	 */
> >  	unuse_temporary_mm(prev);
> >  
> > -	/*
> > -	 * Flushing the TLB might involve IPIs, which would require enabled
> > -	 * IRQs, but not if the mm is not used, as it is in this point.
> > -	 */
> > -	flush_tlb_mm_range(poking_mm, poking_addr, poking_addr +
> > -			   (cross_page_boundary ? 2 : 1) * PAGE_SIZE,
> > -			   PAGE_SHIFT, false);
> > -
> >  	if (func == text_poke_memcpy) {
> >  		/*
> >  		 * If the text does not match what we just wrote then something is
> 
> So I really don't like this.
> 
> Yes it avoids the immediate problem, but we're now sending IPIs where we
> shoulnd't be.
> 
> Fundamentally this whole text_poke thing is set up such that only a
> single CPU will have this magical mapping with the aliasses. Having it
> send IPIs is crazy.

I'm thinking the better fix is to make unuse_temporary_mm() explicitly
clear the bit if we don't want switch_mm() to do it.
Re: [PATCh 0/3] x86,tlb: context switch optimizations
Posted by Ingo Molnar 1 week, 2 days ago
* Rik van Riel <riel@surriel.com> wrote:

> On Wed, 13 Nov 2024 10:55:50 +0100
> Borislav Petkov <bp@alien8.de> wrote:
> 
> > On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > > While profiling switch_mm_irqs_off with several workloads,
> > > it appears there are two hot spots that probably don't need
> > > to be there.  
> > 
> > One of those three is causing the below here, zapping them from tip.
> > 
> 
> TL;DR: __text_poke ends up sending IPIs with interrupts disabled.
> 
> > [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> > [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> > [    3.186469]  ? cpu_bugs_smt_update+0x14/0x1f0
> > [    3.186469]  __text_poke+0x366/0x5d0
> 
> Here is an alternative to avoid __text_poke() from calling
> on_each_cpu_cond_mask() with IRQs disabled:
> 
> ---8<---
> From e872edeaad14c793036f290afc28000281e1b76a Mon Sep 17 00:00:00 2001
> From: Rik van Riel <riel@surriel.com>
> Date: Wed, 13 Nov 2024 09:51:16 -0500
> Subject: [PATCH] x86/alternatives: defer poking_mm TLB flush to next use

I'd argue *both* of your patches improve the code, right?

Mind sending an updated series? It might not make it into the merge window,
but these look like good changes to me.

Thanks,

	Ingo
Re: [PATCh 0/3] x86,tlb: context switch optimizations
Posted by Rik van Riel 1 week, 2 days ago
On Thu, 2024-11-14 at 10:52 +0100, Ingo Molnar wrote:
> 
> * Rik van Riel <riel@surriel.com> wrote:
> 
> > On Wed, 13 Nov 2024 10:55:50 +0100
> > Borislav Petkov <bp@alien8.de> wrote:
> > 
> > > On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > > > While profiling switch_mm_irqs_off with several workloads,
> > > > it appears there are two hot spots that probably don't need
> > > > to be there.  
> > > 
> > > One of those three is causing the below here, zapping them from
> > > tip.
> > > 
> > 
> > TL;DR: __text_poke ends up sending IPIs with interrupts disabled.
> > 
> > > [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> > > [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> > > [    3.186469]  ? cpu_bugs_smt_update+0x14/0x1f0
> > > [    3.186469]  __text_poke+0x366/0x5d0
> > 
> > Here is an alternative to avoid __text_poke() from calling
> > on_each_cpu_cond_mask() with IRQs disabled:
> > 
> > ---8<---
> > From e872edeaad14c793036f290afc28000281e1b76a Mon Sep 17 00:00:00
> > 2001
> > From: Rik van Riel <riel@surriel.com>
> > Date: Wed, 13 Nov 2024 09:51:16 -0500
> > Subject: [PATCH] x86/alternatives: defer poking_mm TLB flush to
> > next use
> 
> I'd argue *both* of your patches improve the code, right?
> 

We have 3 possible solutions, and I think we only need one of them.

> Mind sending an updated series? It might not make it into the merge
> window,
> but these look like good changes to me.

I would be happy to send a new series, but it would be good if
we agreed on what solution we wanted :)

1) Move the interrupt re-enabling up (probably not this one?)

2) Explicitly clear the mm_cpumask bit in unuse_temporary_mm()

3) Have unuse_temporary_mm increment the mm's tlb_gen, since that
   is the only thing flush_tlb_mm_range really does for an MM
   without any bits set in the mm_cpumask.

Which one would you guys prefer?

-- 
All Rights Reversed.
Re: [PATCh 0/3] x86,tlb: context switch optimizations
Posted by Peter Zijlstra 1 week, 2 days ago
On Thu, Nov 14, 2024 at 09:27:25AM -0500, Rik van Riel wrote:

> 1) Move the interrupt re-enabling up (probably not this one?)

Correct, that one is wrong for it results in IPIs that we don't want or
need.

> 2) Explicitly clear the mm_cpumask bit in unuse_temporary_mm()
> 
> 3) Have unuse_temporary_mm increment the mm's tlb_gen, since that
>    is the only thing flush_tlb_mm_range really does for an MM
>    without any bits set in the mm_cpumask.

So flush_tlb_mm_range() has an 'mm == loaded_mm' case, which does a
local flush. I *think* we're not hitting that because switch_mm() does a
write to loaded_mm() just before this.

But I don't think we want to proliferate the logic contained in
flush_tlb_mm_range() further than we have to.

So my preference goes to 2, as that seems to be the safest option.
Notably text_poke() it not concerned with performance much.
Re: [PATCh 0/3] x86,tlb: context switch optimizations
Posted by Peter Zijlstra 1 week, 2 days ago
On Thu, Nov 14, 2024 at 10:52:48AM +0100, Ingo Molnar wrote:
> 
> * Rik van Riel <riel@surriel.com> wrote:
> 
> > On Wed, 13 Nov 2024 10:55:50 +0100
> > Borislav Petkov <bp@alien8.de> wrote:
> > 
> > > On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > > > While profiling switch_mm_irqs_off with several workloads,
> > > > it appears there are two hot spots that probably don't need
> > > > to be there.  
> > > 
> > > One of those three is causing the below here, zapping them from tip.
> > > 
> > 
> > TL;DR: __text_poke ends up sending IPIs with interrupts disabled.
> > 
> > > [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> > > [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> > > [    3.186469]  ? cpu_bugs_smt_update+0x14/0x1f0
> > > [    3.186469]  __text_poke+0x366/0x5d0
> > 
> > Here is an alternative to avoid __text_poke() from calling
> > on_each_cpu_cond_mask() with IRQs disabled:
> > 
> > ---8<---
> > From e872edeaad14c793036f290afc28000281e1b76a Mon Sep 17 00:00:00 2001
> > From: Rik van Riel <riel@surriel.com>
> > Date: Wed, 13 Nov 2024 09:51:16 -0500
> > Subject: [PATCH] x86/alternatives: defer poking_mm TLB flush to next use
> 
> I'd argue *both* of your patches improve the code, right?

No, please don't.
Re: [PATCh 0/3] x86,tlb: context switch optimizations
Posted by Rik van Riel 1 week, 3 days ago
On Wed, 13 Nov 2024 10:55:50 +0100
Borislav Petkov <bp@alien8.de> wrote:
On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > While profiling switch_mm_irqs_off with several workloads,
> > it appears there are two hot spots that probably don't need
> > to be there.
> 
> One of those three is causing the below here, zapping them from tip.
> 

This is interesting, and unexpected.

> [    3.186469] ------------[ cut here ]------------
> [    3.186469] WARNING: CPU: 16 PID: 97 at kernel/smp.c:807
> smp_call_function_many_cond+0x188/0x720

This is the lockdep_assert_irqs_enabled() from this branch:

        if (cpu_online(this_cpu) && !oops_in_progress &&
            !early_boot_irqs_disabled)
                lockdep_assert_irqs_enabled();

> [    3.186469] Call Trace:
> [    3.186469]  <TASK>
> [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> [    3.186469]  __text_poke+0x366/0x5d0

... and sure enough, it looks like __text_poke() calls
flush_tlb_mm_range() with IRQs disabled!

> [    3.186469]  text_poke_bp_batch+0xa1/0x3d0
> [    3.186469]  text_poke_finish+0x1b/0x30
> [    3.186469]  arch_jump_label_transform_apply+0x18/0x30
> [    3.186469]  static_key_slow_inc_cpuslocked+0x55/0xa0
...

I have no good explanation for why that lockdep_assert_irqs_enabled()
would not be firing without my patches applied.

We obviously should not be sending out any IPIs with IRQs disabled.

However, __text_poke has been sending IPIs with interrupts disabled
for 4 years now! No wonder we see deadlocks involving __text_poke
on a semi-regular basis.

Should we move the local_irq_restore() in __text_poke() up a few lines,
like in the patch below?

Alternatively, should we explicitly clear the mm_cpumask in unuse_temporary_mm,
to make sure that mm never has any bits set in mm_cpumask?

Or, since we do not flush the TLB for the poking_mm until AFTER we have switched
back to the prev mm, should we simply always switch to the poking_mm in a way
that involves flushing the TLB? That way we won't even have to flush the entry
after unuse...

What is the best approach here?
---8<---
From a2e7c517bbd2cf108fc14c51449bf8e53e314b53 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@surriel.com>
Date: Wed, 13 Nov 2024 09:19:39 -0500
Subject: [PATCH] x86,alternatives: re-enable interrupts before sending TLB  flush IPI

__text_poke() calls flush_tlb_mm_range() to flush the mapping of
the text poke address. However, it does so with interrupts disabled,
which can cause a deadlock.

We do occasionally observe deadlocks involving __text_poke(), but
not frequently enough to spend much time debugging them.

Borislav triggered this bug while testing a different patch, which
lazily clears bits from the mm_cpumask, resulting in more bits being
set when __text_poke() calls flush_tlb_mm_range(), which in turn
triggered the lockdep_assert_irqs_enabled() in smp_call_function_many_cond().

Avoid sending IPIs with IRQs disabled by re-enabling IRQs earlier.

Signed-off-by: Rik van Riel <riel@surriel.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: stable@kernel.org
Fixes: 7cf494270424 ("x86: expand irq-off region in text_poke()")
---
 arch/x86/kernel/alternative.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index d17518ca19b8..f71d84249f6e 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1940,6 +1940,9 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
 	 */
 	unuse_temporary_mm(prev);
 
+	/* Re-enable interrupts before sending an IPI. */
+	local_irq_restore(flags);
+
 	/*
 	 * Flushing the TLB might involve IPIs, which would require enabled
 	 * IRQs, but not if the mm is not used, as it is in this point.
@@ -1956,7 +1959,6 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
 		BUG_ON(memcmp(addr, src, len));
 	}
 
-	local_irq_restore(flags);
 	pte_unmap_unlock(ptep, ptl);
 	return addr;
 }
-- 
2.45.2
Re: [PATCh 0/3] x86,tlb: context switch optimizations
Posted by Peter Zijlstra 1 week, 2 days ago
On Wed, Nov 13, 2024 at 09:38:26AM -0500, Rik van Riel wrote:
> On Wed, 13 Nov 2024 10:55:50 +0100
> Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > > While profiling switch_mm_irqs_off with several workloads,
> > > it appears there are two hot spots that probably don't need
> > > to be there.
> > 
> > One of those three is causing the below here, zapping them from tip.
> > 
> 
> This is interesting, and unexpected.
> 
> > [    3.186469] ------------[ cut here ]------------
> > [    3.186469] WARNING: CPU: 16 PID: 97 at kernel/smp.c:807
> > smp_call_function_many_cond+0x188/0x720
> 
> This is the lockdep_assert_irqs_enabled() from this branch:
> 
>         if (cpu_online(this_cpu) && !oops_in_progress &&
>             !early_boot_irqs_disabled)
>                 lockdep_assert_irqs_enabled();
> 
> > [    3.186469] Call Trace:
> > [    3.186469]  <TASK>
> > [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> > [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> > [    3.186469]  __text_poke+0x366/0x5d0
> 
> ... and sure enough, it looks like __text_poke() calls
> flush_tlb_mm_range() with IRQs disabled!
> 
> > [    3.186469]  text_poke_bp_batch+0xa1/0x3d0
> > [    3.186469]  text_poke_finish+0x1b/0x30
> > [    3.186469]  arch_jump_label_transform_apply+0x18/0x30
> > [    3.186469]  static_key_slow_inc_cpuslocked+0x55/0xa0
> ...
> 
> I have no good explanation for why that lockdep_assert_irqs_enabled()
> would not be firing without my patches applied.
> 
> We obviously should not be sending out any IPIs with IRQs disabled.
> 
> However, __text_poke has been sending IPIs with interrupts disabled
> for 4 years now! No wonder we see deadlocks involving __text_poke
> on a semi-regular basis.

I don't think we have. Isn't the problem with patch 1, where we
unuse_temporary_mm() now fails to clear the bit, with the direct result
being that flush_tlb_mm_range() now thinks it should be doing IPIs,
where previously it was a strict CPU local affair.
Re: [PATCh 0/3] x86,tlb: context switch optimizations
Posted by Ingo Molnar 1 week, 3 days ago
* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > While profiling switch_mm_irqs_off with several workloads,
> > it appears there are two hot spots that probably don't need
> > to be there.
> 
> One of those three is causing the below here, zapping them from tip.

I've zapped the final two commits from tip:x86/mm that are the likely source of the 
regression.

Thanks,

	Ingo