[v8] VMSCAPE optimization for BHI variant

[PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 1 week, 2 days ago

As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
the Branch History Buffer (BHB). On Alder Lake and newer parts this
sequence is not sufficient because it doesn't clear enough entries. This
was not an issue because these CPUs have a hardware control (BHI_DIS_S)
that mitigates BHI in kernel.

BHI variant of VMSCAPE requires isolating branch history between guests and
userspace. Note that there is no equivalent hardware control for userspace.
To effectively isolate branch history on newer CPUs, clear_bhb_loop()
should execute sufficient number of branches to clear a larger BHB.

Dynamically set the loop count of clear_bhb_loop() such that it is
effective on newer CPUs too. Use the hardware control enumeration
X86_FEATURE_BHI_CTRL to select the appropriate loop count.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_64.S   | 21 ++++++++++++++++-----
 arch/x86/net/bpf_jit_comp.c |  7 -------
 2 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3a180a36ca0e..8128e00ca73f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1535,8 +1535,17 @@ SYM_CODE_END(rewind_stack_and_make_dead)
 SYM_FUNC_START(clear_bhb_loop)
 	ANNOTATE_NOENDBR
 	push	%rbp
+	/* BPF caller may require %rax to be preserved */
+	push	%rax
 	mov	%rsp, %rbp
-	movl	$5, %ecx
+
+	/*
+	 * Between the long and short version of BHB clear sequence, just the
+	 * loop count differs based on BHI_CTRL, see Intel's BHI guidance.
+	 */
+	ALTERNATIVE "movb $5,  %al",	\
+		    "movb $12, %al", X86_FEATURE_BHI_CTRL
+
 	ANNOTATE_INTRA_FUNCTION_CALL
 	call	1f
 	jmp	5f
@@ -1556,16 +1565,18 @@ SYM_FUNC_START(clear_bhb_loop)
 	 * This should be ideally be: .skip 32 - (.Lret2 - 2f), 0xcc
 	 * but some Clang versions (e.g. 18) don't like this.
 	 */
-	.skip 32 - 18, 0xcc
-2:	movl	$5, %eax
+	.skip 32 - 14, 0xcc
+2:	ALTERNATIVE "movb $5, %ah",	\
+		    "movb $7, %ah", X86_FEATURE_BHI_CTRL
 3:	jmp	4f
 	nop
-4:	sub	$1, %eax
+4:	sub	$1, %ah
 	jnz	3b
-	sub	$1, %ecx
+	sub	$1, %al
 	jnz	1b
 .Lret2:	RET
 5:
+	pop	%rax
 	pop	%rbp
 	RET
 SYM_FUNC_END(clear_bhb_loop)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 63d6c9fa5e80..e2cceabb23e8 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1614,11 +1614,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
 	u8 *func;
 
 	if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
-		/* The clearing sequence clobbers eax and ecx. */
-		EMIT1(0x50); /* push rax */
-		EMIT1(0x51); /* push rcx */
-		ip += 2;
-
 		func = (u8 *)clear_bhb_loop;
 		ip += x86_call_depth_emit_accounting(&prog, func, ip);
 
@@ -1626,8 +1621,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
 			return -EINVAL;
 		/* Don't speculate past this until BHB is cleared */
 		EMIT_LFENCE();
-		EMIT1(0x59); /* pop rcx */
-		EMIT1(0x58); /* pop rax */
 	}
 	/* Insert IBHF instruction */
 	if ((cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP) &&

-- 
2.34.1

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Jim Mattson 1 week, 1 day ago

On Tue, Mar 24, 2026 at 11:19 AM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
>
> As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> the Branch History Buffer (BHB). On Alder Lake and newer parts this
> sequence is not sufficient because it doesn't clear enough entries. This
> was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> that mitigates BHI in kernel.
>
> BHI variant of VMSCAPE requires isolating branch history between guests and
> userspace. Note that there is no equivalent hardware control for userspace.
> To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> should execute sufficient number of branches to clear a larger BHB.
>
> Dynamically set the loop count of clear_bhb_loop() such that it is
> effective on newer CPUs too. Use the hardware control enumeration
> X86_FEATURE_BHI_CTRL to select the appropriate loop count.
>
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
>  arch/x86/entry/entry_64.S   | 21 ++++++++++++++++-----
>  arch/x86/net/bpf_jit_comp.c |  7 -------
>  2 files changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 3a180a36ca0e..8128e00ca73f 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1535,8 +1535,17 @@ SYM_CODE_END(rewind_stack_and_make_dead)
>  SYM_FUNC_START(clear_bhb_loop)
>         ANNOTATE_NOENDBR
>         push    %rbp
> +       /* BPF caller may require %rax to be preserved */
> +       push    %rax

Shouldn't the "push %rax" come after "mov %rsp, %rbp"?

>         mov     %rsp, %rbp
> -       movl    $5, %ecx

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by David Laight 1 week, 1 day ago

On Wed, 25 Mar 2026 10:50:58 -0700
Jim Mattson <jmattson@google.com> wrote:

> On Tue, Mar 24, 2026 at 11:19 AM Pawan Gupta
> <pawan.kumar.gupta@linux.intel.com> wrote:
> >
> > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > sequence is not sufficient because it doesn't clear enough entries. This
> > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > that mitigates BHI in kernel.
> >
> > BHI variant of VMSCAPE requires isolating branch history between guests and
> > userspace. Note that there is no equivalent hardware control for userspace.
> > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > should execute sufficient number of branches to clear a larger BHB.
> >
> > Dynamically set the loop count of clear_bhb_loop() such that it is
> > effective on newer CPUs too. Use the hardware control enumeration
> > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> >
> > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > ---
> >  arch/x86/entry/entry_64.S   | 21 ++++++++++++++++-----
> >  arch/x86/net/bpf_jit_comp.c |  7 -------
> >  2 files changed, 16 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 3a180a36ca0e..8128e00ca73f 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -1535,8 +1535,17 @@ SYM_CODE_END(rewind_stack_and_make_dead)
> >  SYM_FUNC_START(clear_bhb_loop)
> >         ANNOTATE_NOENDBR
> >         push    %rbp
> > +       /* BPF caller may require %rax to be preserved */

Since you need a new version change that to 'all registers preserved'.

> > +       push    %rax  
> 
> Shouldn't the "push %rax" come after "mov %rsp, %rbp"?

Or delete the stack frame :-)
It is only there for the stack trace-back code.

	David


> 
> >         mov     %rsp, %rbp
> > -       movl    $5, %ecx

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 1 week, 1 day ago

On Wed, Mar 25, 2026 at 07:41:46PM +0000, David Laight wrote:
> On Wed, 25 Mar 2026 10:50:58 -0700
> Jim Mattson <jmattson@google.com> wrote:
> 
> > On Tue, Mar 24, 2026 at 11:19 AM Pawan Gupta
> > <pawan.kumar.gupta@linux.intel.com> wrote:
> > >
> > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > > sequence is not sufficient because it doesn't clear enough entries. This
> > > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > > that mitigates BHI in kernel.
> > >
> > > BHI variant of VMSCAPE requires isolating branch history between guests and
> > > userspace. Note that there is no equivalent hardware control for userspace.
> > > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > > should execute sufficient number of branches to clear a larger BHB.
> > >
> > > Dynamically set the loop count of clear_bhb_loop() such that it is
> > > effective on newer CPUs too. Use the hardware control enumeration
> > > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > >
> > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > > ---
> > >  arch/x86/entry/entry_64.S   | 21 ++++++++++++++++-----
> > >  arch/x86/net/bpf_jit_comp.c |  7 -------
> > >  2 files changed, 16 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > > index 3a180a36ca0e..8128e00ca73f 100644
> > > --- a/arch/x86/entry/entry_64.S
> > > +++ b/arch/x86/entry/entry_64.S
> > > @@ -1535,8 +1535,17 @@ SYM_CODE_END(rewind_stack_and_make_dead)
> > >  SYM_FUNC_START(clear_bhb_loop)
> > >         ANNOTATE_NOENDBR
> > >         push    %rbp
> > > +       /* BPF caller may require %rax to be preserved */
> 
> Since you need a new version change that to 'all registers preserved'.

Ya, thats more accurate.

> > > +       push    %rax  
> > 
> > Shouldn't the "push %rax" come after "mov %rsp, %rbp"?
> 
> Or delete the stack frame :-)
> It is only there for the stack trace-back code.

Hmm, lets keep the stack frame, it might help debug.

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 1 week, 1 day ago

On Wed, Mar 25, 2026 at 10:50:58AM -0700, Jim Mattson wrote:
> On Tue, Mar 24, 2026 at 11:19 AM Pawan Gupta
> <pawan.kumar.gupta@linux.intel.com> wrote:
> >
> > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > sequence is not sufficient because it doesn't clear enough entries. This
> > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > that mitigates BHI in kernel.
> >
> > BHI variant of VMSCAPE requires isolating branch history between guests and
> > userspace. Note that there is no equivalent hardware control for userspace.
> > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > should execute sufficient number of branches to clear a larger BHB.
> >
> > Dynamically set the loop count of clear_bhb_loop() such that it is
> > effective on newer CPUs too. Use the hardware control enumeration
> > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> >
> > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > ---
> >  arch/x86/entry/entry_64.S   | 21 ++++++++++++++++-----
> >  arch/x86/net/bpf_jit_comp.c |  7 -------
> >  2 files changed, 16 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 3a180a36ca0e..8128e00ca73f 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -1535,8 +1535,17 @@ SYM_CODE_END(rewind_stack_and_make_dead)
> >  SYM_FUNC_START(clear_bhb_loop)
> >         ANNOTATE_NOENDBR
> >         push    %rbp
> > +       /* BPF caller may require %rax to be preserved */
> > +       push    %rax
> 
> Shouldn't the "push %rax" come after "mov %rsp, %rbp"?

Right, thanks for catching that.

> >         mov     %rsp, %rbp
> > -       movl    $5, %ecx

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Borislav Petkov 1 week, 2 days ago

On Tue, Mar 24, 2026 at 11:16:51AM -0700, Pawan Gupta wrote:
> As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> the Branch History Buffer (BHB). On Alder Lake and newer parts this
> sequence is not sufficient because it doesn't clear enough entries. This
> was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> that mitigates BHI in kernel.
> 
> BHI variant of VMSCAPE requires isolating branch history between guests and
> userspace. Note that there is no equivalent hardware control for userspace.
> To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> should execute sufficient number of branches to clear a larger BHB.
> 
> Dynamically set the loop count of clear_bhb_loop() such that it is
> effective on newer CPUs too. Use the hardware control enumeration
> X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> 
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
>  arch/x86/entry/entry_64.S   | 21 ++++++++++++++++-----
>  arch/x86/net/bpf_jit_comp.c |  7 -------
>  2 files changed, 16 insertions(+), 12 deletions(-)

Ok, pls tell me why this below doesn't work?

The additional indirection makes even the BHB loop code simpler.

(I didn't pay too much attention to the labels, 2: is probably weird there).

---

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3a180a36ca0e..95c7ed9afbbe 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1532,11 +1532,13 @@ SYM_CODE_END(rewind_stack_and_make_dead)
  * Note, callers should use a speculation barrier like LFENCE immediately after
  * a call to this function to ensure BHB is cleared before indirect branches.
  */
-SYM_FUNC_START(clear_bhb_loop)
+SYM_FUNC_START(__clear_bhb_loop)
 	ANNOTATE_NOENDBR
 	push	%rbp
+	/* BPF caller may require %rax to be preserved */
+	push	%rax
 	mov	%rsp, %rbp
-	movl	$5, %ecx
+
 	ANNOTATE_INTRA_FUNCTION_CALL
 	call	1f
 	jmp	5f
@@ -1557,17 +1559,17 @@ SYM_FUNC_START(clear_bhb_loop)
 	 * but some Clang versions (e.g. 18) don't like this.
 	 */
 	.skip 32 - 18, 0xcc
-2:	movl	$5, %eax
+2:
 3:	jmp	4f
 	nop
-4:	sub	$1, %eax
+4:	sub	$1, %rsi
 	jnz	3b
-	sub	$1, %ecx
+	sub	$1, %rdi
 	jnz	1b
 .Lret2:	RET
 5:
+	pop	%rax
 	pop	%rbp
 	RET
-SYM_FUNC_END(clear_bhb_loop)
-EXPORT_SYMBOL_FOR_KVM(clear_bhb_loop)
-STACK_FRAME_NON_STANDARD(clear_bhb_loop)
+SYM_FUNC_END(__clear_bhb_loop)
+STACK_FRAME_NON_STANDARD(__clear_bhb_loop)
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 70b377fcbc1c..a9f406941e11 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -390,6 +390,7 @@ extern void write_ibpb(void);
 
 #ifdef CONFIG_X86_64
 extern void clear_bhb_loop(void);
+extern void __clear_bhb_loop(unsigned int a, unsigned int b);
 #endif
 
 extern void (*x86_return_thunk)(void);
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 83f51cab0b1e..c41b0548cf2a 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -3735,3 +3735,11 @@ void __warn_thunk(void)
 {
 	WARN_ONCE(1, "Unpatched return thunk in use. This should not happen!\n");
 }
+
+void clear_bhb_loop(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL))
+		__clear_bhb_loop(12, 7);
+	else
+		__clear_bhb_loop(5, 5);
+}
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 63d6c9fa5e80..e2cceabb23e8 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1614,11 +1614,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
 	u8 *func;
 
 	if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
-		/* The clearing sequence clobbers eax and ecx. */
-		EMIT1(0x50); /* push rax */
-		EMIT1(0x51); /* push rcx */
-		ip += 2;
-
 		func = (u8 *)clear_bhb_loop;
 		ip += x86_call_depth_emit_accounting(&prog, func, ip);
 
@@ -1626,8 +1621,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
 			return -EINVAL;
 		/* Don't speculate past this until BHB is cleared */
 		EMIT_LFENCE();
-		EMIT1(0x59); /* pop rcx */
-		EMIT1(0x58); /* pop rax */
 	}
 	/* Insert IBHF instruction */
 	if ((cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP) &&

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 1 week, 2 days ago

On Tue, Mar 24, 2026 at 09:59:30PM +0100, Borislav Petkov wrote:
> On Tue, Mar 24, 2026 at 11:16:51AM -0700, Pawan Gupta wrote:
> > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > sequence is not sufficient because it doesn't clear enough entries. This
> > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > that mitigates BHI in kernel.
> > 
> > BHI variant of VMSCAPE requires isolating branch history between guests and
> > userspace. Note that there is no equivalent hardware control for userspace.
> > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > should execute sufficient number of branches to clear a larger BHB.
> > 
> > Dynamically set the loop count of clear_bhb_loop() such that it is
> > effective on newer CPUs too. Use the hardware control enumeration
> > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > 
> > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > ---
> >  arch/x86/entry/entry_64.S   | 21 ++++++++++++++++-----
> >  arch/x86/net/bpf_jit_comp.c |  7 -------
> >  2 files changed, 16 insertions(+), 12 deletions(-)
> 
> Ok, pls tell me why this below doesn't work?
> 
> The additional indirection makes even the BHB loop code simpler.
> 
> (I didn't pay too much attention to the labels, 2: is probably weird there).
> 
> ---
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 3a180a36ca0e..95c7ed9afbbe 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1532,11 +1532,13 @@ SYM_CODE_END(rewind_stack_and_make_dead)
>   * Note, callers should use a speculation barrier like LFENCE immediately after
>   * a call to this function to ensure BHB is cleared before indirect branches.
>   */
> -SYM_FUNC_START(clear_bhb_loop)
> +SYM_FUNC_START(__clear_bhb_loop)
>  	ANNOTATE_NOENDBR
>  	push	%rbp
> +	/* BPF caller may require %rax to be preserved */
> +	push	%rax
>  	mov	%rsp, %rbp
> -	movl	$5, %ecx
> +
>  	ANNOTATE_INTRA_FUNCTION_CALL
>  	call	1f
>  	jmp	5f
> @@ -1557,17 +1559,17 @@ SYM_FUNC_START(clear_bhb_loop)
>  	 * but some Clang versions (e.g. 18) don't like this.
>  	 */
>  	.skip 32 - 18, 0xcc
> -2:	movl	$5, %eax
> +2:
>  3:	jmp	4f
>  	nop
> -4:	sub	$1, %eax
> +4:	sub	$1, %rsi

%rsi needs to be loaded again with $inner_loop_count once per every
outer loop iteration. We probably need another register to hold that.

>  	jnz	3b
> -	sub	$1, %ecx
> +	sub	$1, %rdi
>  	jnz	1b
>  .Lret2:	RET
>  5:
> +	pop	%rax
>  	pop	%rbp
>  	RET
> -SYM_FUNC_END(clear_bhb_loop)
> -EXPORT_SYMBOL_FOR_KVM(clear_bhb_loop)
> -STACK_FRAME_NON_STANDARD(clear_bhb_loop)
> +SYM_FUNC_END(__clear_bhb_loop)
> +STACK_FRAME_NON_STANDARD(__clear_bhb_loop)
> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index 70b377fcbc1c..a9f406941e11 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -390,6 +390,7 @@ extern void write_ibpb(void);
>  
>  #ifdef CONFIG_X86_64
>  extern void clear_bhb_loop(void);
> +extern void __clear_bhb_loop(unsigned int a, unsigned int b);
>  #endif
>  
>  extern void (*x86_return_thunk)(void);
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 83f51cab0b1e..c41b0548cf2a 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -3735,3 +3735,11 @@ void __warn_thunk(void)
>  {
>  	WARN_ONCE(1, "Unpatched return thunk in use. This should not happen!\n");
>  }
> +
> +void clear_bhb_loop(void)
> +{
> +	if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL))
> +		__clear_bhb_loop(12, 7);
> +	else
> +		__clear_bhb_loop(5, 5);
> +}

This is cleaner. A few things to consider are, CLEAR_BRANCH_HISTORY that
calls clear_bhb_loop() would be calling into C code very early during the
kernel entry. The code generated here may vary based on the compiler. Any
indirect branch here would be security risk. This needs to be noinstr so
that it can't be hijacked by probes and ftraces.

At kernel entry, calling into C before mitigations are applied is risky.

> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 63d6c9fa5e80..e2cceabb23e8 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -1614,11 +1614,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
>  	u8 *func;
>  
>  	if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
> -		/* The clearing sequence clobbers eax and ecx. */
> -		EMIT1(0x50); /* push rax */
> -		EMIT1(0x51); /* push rcx */

> -		ip += 2;
> -
>  		func = (u8 *)clear_bhb_loop;

Although call to clear_bhb_loop() will be inserted at the end of the BPF
program before it returns, I am not sure if it is safe to assume that
trashing registers in the path clear_bhb_loop() -> __clear_bhb_loop() is
okay? Especially, when we don't know what code compiler generated for
clear_bhb_loop(). BPF experts would know better?

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Borislav Petkov 1 week, 1 day ago

On Tue, Mar 24, 2026 at 03:13:08PM -0700, Pawan Gupta wrote:
> This is cleaner. A few things to consider are, CLEAR_BRANCH_HISTORY that
> calls clear_bhb_loop() would be calling into C code very early during the
> kernel entry. The code generated here may vary based on the compiler. Any
> indirect branch here would be security risk. This needs to be noinstr so
> that it can't be hijacked by probes and ftraces.
> 
> At kernel entry, calling into C before mitigations are applied is risky.

You can write the above function in asm if you prefer - should still be
easier.

> Although call to clear_bhb_loop() will be inserted at the end of the BPF
> program before it returns, I am not sure if it is safe to assume that
> trashing registers in the path clear_bhb_loop() -> __clear_bhb_loop() is
> okay? Especially, when we don't know what code compiler generated for
> clear_bhb_loop(). BPF experts would know better?

The compiler would preserve the regs. If you write it in asm and you adhere to
the C ABI, you could preserve them too. Shouldn't be too many.

Thx.


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 1 week ago

On Wed, Mar 25, 2026 at 09:37:59PM +0100, Borislav Petkov wrote:
> On Tue, Mar 24, 2026 at 03:13:08PM -0700, Pawan Gupta wrote:
> > This is cleaner. A few things to consider are, CLEAR_BRANCH_HISTORY that
> > calls clear_bhb_loop() would be calling into C code very early during the
> > kernel entry. The code generated here may vary based on the compiler. Any
> > indirect branch here would be security risk. This needs to be noinstr so
> > that it can't be hijacked by probes and ftraces.
> > 
> > At kernel entry, calling into C before mitigations are applied is risky.
> 
> You can write the above function in asm if you prefer - should still be
> easier.

I believe the equivalent for cpu_feature_enabled() in asm is the
ALTERNATIVE. Please let me know if I am missing something.

Regarding your intent to move the loop count selection out of the BHB
sequence, below is what I could come up. It is not as pretty as the C
version, but it is trying to achieve something similar:

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ecae3cef9d8c..54c65b0a3f65 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1494,6 +1494,20 @@ SYM_CODE_START_NOALIGN(rewind_stack_and_make_dead)
 SYM_CODE_END(rewind_stack_and_make_dead)
 .popsection
 
+/*
+ * Between the long and short version of BHB clear sequence, just the
+ * loop count differs based on BHI_CTRL, see Intel's BHI guidance.
+ */
+#define BHB_SHORT_LOOP_OUTER	5
+#define BHB_SHORT_LOOP_INNER	5
+
+#define BHB_LONG_LOOP_OUTER	12
+#define BHB_LONG_LOOP_INNER	7
+
+#define BHB_MOVB(type, reg)						\
+	ALTERNATIVE __stringify(movb $BHB_SHORT_LOOP_##type, reg),	\
+		    __stringify(movb $BHB_LONG_LOOP_##type, reg), X86_FEATURE_BHI_CTRL
+
 /*
  * This sequence executes branches in order to remove user branch information
  * from the branch history tracker in the Branch Predictor, therefore removing
@@ -1540,12 +1554,7 @@ SYM_FUNC_START(clear_bhb_loop_nofence)
 	/* BPF caller may require all registers to be preserved */
 	push	%rax
 
-	/*
-	 * Between the long and short version of BHB clear sequence, just the
-	 * loop count differs based on BHI_CTRL, see Intel's BHI guidance.
-	 */
-	ALTERNATIVE "movb $5,  %al",	\
-		    "movb $12, %al", X86_FEATURE_BHI_CTRL
+	BHB_MOVB(OUTER, %al)
 
 	ANNOTATE_INTRA_FUNCTION_CALL
 	call	1f
@@ -1567,8 +1576,7 @@ SYM_FUNC_START(clear_bhb_loop_nofence)
 	 * but some Clang versions (e.g. 18) don't like this.
 	 */
 	.skip 32 - 14, 0xcc
-2:	ALTERNATIVE "movb $5, %ah",	\
-		    "movb $7, %ah", X86_FEATURE_BHI_CTRL
+2:	BHB_MOVB(INNER, %ah)
 3:	jmp	4f
 	nop
 4:	sub	$1, %ah


Below is how the disassembly looks like:

clear_bhb_loop_nofence:
	...
	call    1f
	jmp     5f
	// BHB_MOVB(OUTER, %al)
	mov    $0x5,%al

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Borislav Petkov 1 week ago

On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> I believe the equivalent for cpu_feature_enabled() in asm is the
> ALTERNATIVE. Please let me know if I am missing something.

Yes, you are.

The point is that you don't want to stick those alternative calls inside some
magic bhb_loop function but hand them in from the outside, as function
arguments.

Basically what I did.

Then you were worried about this being C code and it had to be noinstr... So
that outer function can be rewritten in asm, I think, and still keep it well
separate.

I'll try to rewrite it once I get a free minute, and see how it looks.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by David Laight 1 week ago

On Thu, 26 Mar 2026 11:01:20 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > I believe the equivalent for cpu_feature_enabled() in asm is the
> > ALTERNATIVE. Please let me know if I am missing something.  
> 
> Yes, you are.
> 
> The point is that you don't want to stick those alternative calls inside some
> magic bhb_loop function but hand them in from the outside, as function
> arguments.
> 
> Basically what I did.
> 
> Then you were worried about this being C code and it had to be noinstr... So
> that outer function can be rewritten in asm, I think, and still keep it well
> separate.
> 
> I'll try to rewrite it once I get a free minute, and see how it looks.
> 

I think someone tried getting C code to write the values to global data
and getting the asm to read them.
That got discounted because it spilt things between two largely unrelated files.

I think the BPF code would need significant refactoring to call a C function.

	David

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 1 week ago

On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:
> On Thu, 26 Mar 2026 11:01:20 +0100
> Borislav Petkov <bp@alien8.de> wrote:
> 
> > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > ALTERNATIVE. Please let me know if I am missing something.  
> > 
> > Yes, you are.
> > 
> > The point is that you don't want to stick those alternative calls inside some
> > magic bhb_loop function but hand them in from the outside, as function
> > arguments.
> > 
> > Basically what I did.
> > 
> > Then you were worried about this being C code and it had to be noinstr... So
> > that outer function can be rewritten in asm, I think, and still keep it well
> > separate.
> > 
> > I'll try to rewrite it once I get a free minute, and see how it looks.
> > 
> 
> I think someone tried getting C code to write the values to global data
> and getting the asm to read them.
> That got discounted because it spilt things between two largely unrelated files.


The implementation with global variables wasn't that bad, let me revive it.

This part which ties sequence to BHI mitigation, which is not ideal,
(because VMSCAPE also uses it) it does seems a cleaner option.

--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)

 static void __init bhi_update_mitigation(void)
 {
+   if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
+       bhi_seq_outer_loop = 5;
+       bhi_seq_inner_loop = 5;
+   }
+

I believe this can be moved to somewhere common to all mitigations.

> I think the BPF code would need significant refactoring to call a C function.

Ya, true. Will use globals and keep clear_bhb_loop() in asm.

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 6 days ago

On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:
> On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:
> > On Thu, 26 Mar 2026 11:01:20 +0100
> > Borislav Petkov <bp@alien8.de> wrote:
> > 
> > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > ALTERNATIVE. Please let me know if I am missing something.  
> > > 
> > > Yes, you are.
> > > 
> > > The point is that you don't want to stick those alternative calls inside some
> > > magic bhb_loop function but hand them in from the outside, as function
> > > arguments.
> > > 
> > > Basically what I did.
> > > 
> > > Then you were worried about this being C code and it had to be noinstr... So
> > > that outer function can be rewritten in asm, I think, and still keep it well
> > > separate.
> > > 
> > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > > 
> > 
> > I think someone tried getting C code to write the values to global data
> > and getting the asm to read them.
> > That got discounted because it spilt things between two largely unrelated files.
> 
> 
> The implementation with global variables wasn't that bad, let me revive it.
> 
> This part which ties sequence to BHI mitigation, which is not ideal,
> (because VMSCAPE also uses it) it does seems a cleaner option.
> 
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
> 
>  static void __init bhi_update_mitigation(void)
>  {
> +   if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> +       bhi_seq_outer_loop = 5;
> +       bhi_seq_inner_loop = 5;
> +   }
> +
> 
> I believe this can be moved to somewhere common to all mitigations.
> 
> > I think the BPF code would need significant refactoring to call a C function.
> 
> Ya, true. Will use globals and keep clear_bhb_loop() in asm.

While testing this approach, I noticed that syscalls were suffering an 8%
regression on ICX for Native BHI mitigation:

  $ perf bench syscall basic -l 100000000

Bisection pointed to the change for using 8-bit registers (al/ah replacing
eax/ecx) as the main contributor to the regression. (Global variables added
a bit, but within noise).

Further digging revealed a strange behavior, using %ah for the inner loop
was causing the regression, interchanging %al and %ah in the loops
(for movb and sub) eliminated the regression.

<clear_bhb_loop_nofence>:

	movb	bhb_seq_outer_loop(%rip), %al

	call	1f
	jmp	5f
1:	call	2f
.Lret1:	RET
2:	movb	bhb_seq_inner_loop(%rip), %ah
3:	jmp	4f
	nop
4:	sub	$1, %ah <---- No regression with %al here
	jnz	3b
	sub	$1, %al
	jnz	1b

My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
inner loop, which is executed more number of times is likely making the
difference. A perf profile is needed to confirm this.

Never imagined a register selection can make an 8% difference in
performance! Anyways, will update the patch with this finding.

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by David Laight 5 days, 15 hours ago

On Fri, 27 Mar 2026 17:42:56 -0700
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:
> > On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:  
> > > On Thu, 26 Mar 2026 11:01:20 +0100
> > > Borislav Petkov <bp@alien8.de> wrote:
> > >   
> > > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:  
> > > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > > ALTERNATIVE. Please let me know if I am missing something.    
> > > > 
> > > > Yes, you are.
> > > > 
> > > > The point is that you don't want to stick those alternative calls inside some
> > > > magic bhb_loop function but hand them in from the outside, as function
> > > > arguments.
> > > > 
> > > > Basically what I did.
> > > > 
> > > > Then you were worried about this being C code and it had to be noinstr... So
> > > > that outer function can be rewritten in asm, I think, and still keep it well
> > > > separate.
> > > > 
> > > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > > >   
> > > 
> > > I think someone tried getting C code to write the values to global data
> > > and getting the asm to read them.
> > > That got discounted because it spilt things between two largely unrelated files.  
> > 
> > 
> > The implementation with global variables wasn't that bad, let me revive it.
> > 
> > This part which ties sequence to BHI mitigation, which is not ideal,
> > (because VMSCAPE also uses it) it does seems a cleaner option.
> > 
> > --- a/arch/x86/kernel/cpu/bugs.c
> > +++ b/arch/x86/kernel/cpu/bugs.c
> > @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
> > 
> >  static void __init bhi_update_mitigation(void)
> >  {
> > +   if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> > +       bhi_seq_outer_loop = 5;
> > +       bhi_seq_inner_loop = 5;
> > +   }
> > +
> > 
> > I believe this can be moved to somewhere common to all mitigations.
> >   
> > > I think the BPF code would need significant refactoring to call a C function.  
> > 
> > Ya, true. Will use globals and keep clear_bhb_loop() in asm.  
> 
> While testing this approach, I noticed that syscalls were suffering an 8%
> regression on ICX for Native BHI mitigation:
> 
>   $ perf bench syscall basic -l 100000000
> 
> Bisection pointed to the change for using 8-bit registers (al/ah replacing
> eax/ecx) as the main contributor to the regression. (Global variables added
> a bit, but within noise).
> 
> Further digging revealed a strange behavior, using %ah for the inner loop
> was causing the regression, interchanging %al and %ah in the loops
> (for movb and sub) eliminated the regression.
> 
> <clear_bhb_loop_nofence>:
> 
> 	movb	bhb_seq_outer_loop(%rip), %al
> 
> 	call	1f
> 	jmp	5f
> 1:	call	2f
> .Lret1:	RET
> 2:	movb	bhb_seq_inner_loop(%rip), %ah
> 3:	jmp	4f
> 	nop
> 4:	sub	$1, %ah <---- No regression with %al here
> 	jnz	3b
> 	sub	$1, %al
> 	jnz	1b
> 
> My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
> inner loop, which is executed more number of times is likely making the
> difference. A perf profile is needed to confirm this.

I bet it is also CPU dependant - it is quite likely that there isn't
any special hardware to support partial writes of %ah so it ends up taking
a slow path (possibly even a microcoded one to get an 8% regression).

As well as swapping %al <-> %ah try changing the outer loop decrement to
	sub $0x100, %ax
since %al is zero that will set the z flag the same.

I've just hacked a test into some test code I've got.
I'm not seeing an unexpected costs on either zen-5 or haswell.
So it may be more subtle.

	David


> 
> Never imagined a register selection can make an 8% difference in
> performance! Anyways, will update the patch with this finding.

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 1 day, 16 hours ago

On Sat, Mar 28, 2026 at 10:08:37AM +0000, David Laight wrote:
> On Fri, 27 Mar 2026 17:42:56 -0700
> Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> 
> > On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:
> > > On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:  
> > > > On Thu, 26 Mar 2026 11:01:20 +0100
> > > > Borislav Petkov <bp@alien8.de> wrote:
> > > >   
> > > > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:  
> > > > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > > > ALTERNATIVE. Please let me know if I am missing something.    
> > > > > 
> > > > > Yes, you are.
> > > > > 
> > > > > The point is that you don't want to stick those alternative calls inside some
> > > > > magic bhb_loop function but hand them in from the outside, as function
> > > > > arguments.
> > > > > 
> > > > > Basically what I did.
> > > > > 
> > > > > Then you were worried about this being C code and it had to be noinstr... So
> > > > > that outer function can be rewritten in asm, I think, and still keep it well
> > > > > separate.
> > > > > 
> > > > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > > > >   
> > > > 
> > > > I think someone tried getting C code to write the values to global data
> > > > and getting the asm to read them.
> > > > That got discounted because it spilt things between two largely unrelated files.  
> > > 
> > > 
> > > The implementation with global variables wasn't that bad, let me revive it.
> > > 
> > > This part which ties sequence to BHI mitigation, which is not ideal,
> > > (because VMSCAPE also uses it) it does seems a cleaner option.
> > > 
> > > --- a/arch/x86/kernel/cpu/bugs.c
> > > +++ b/arch/x86/kernel/cpu/bugs.c
> > > @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
> > > 
> > >  static void __init bhi_update_mitigation(void)
> > >  {
> > > +   if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> > > +       bhi_seq_outer_loop = 5;
> > > +       bhi_seq_inner_loop = 5;
> > > +   }
> > > +
> > > 
> > > I believe this can be moved to somewhere common to all mitigations.
> > >   
> > > > I think the BPF code would need significant refactoring to call a C function.  
> > > 
> > > Ya, true. Will use globals and keep clear_bhb_loop() in asm.  
> > 
> > While testing this approach, I noticed that syscalls were suffering an 8%
> > regression on ICX for Native BHI mitigation:
> > 
> >   $ perf bench syscall basic -l 100000000
> > 
> > Bisection pointed to the change for using 8-bit registers (al/ah replacing
> > eax/ecx) as the main contributor to the regression. (Global variables added
> > a bit, but within noise).
> > 
> > Further digging revealed a strange behavior, using %ah for the inner loop
> > was causing the regression, interchanging %al and %ah in the loops
> > (for movb and sub) eliminated the regression.
> > 
> > <clear_bhb_loop_nofence>:
> > 
> > 	movb	bhb_seq_outer_loop(%rip), %al
> > 
> > 	call	1f
> > 	jmp	5f
> > 1:	call	2f
> > .Lret1:	RET
> > 2:	movb	bhb_seq_inner_loop(%rip), %ah
> > 3:	jmp	4f
> > 	nop
> > 4:	sub	$1, %ah <---- No regression with %al here
> > 	jnz	3b
> > 	sub	$1, %al
> > 	jnz	1b
> > 
> > My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
> > inner loop, which is executed more number of times is likely making the
> > difference. A perf profile is needed to confirm this.
> 
> I bet it is also CPU dependant - it is quite likely that there isn't
> any special hardware to support partial writes of %ah so it ends up taking
> a slow path (possibly even a microcoded one to get an 8% regression).

Strangely, %ah in the inner loop incurs less uops and has fewer branch
misses, yet takes more cycles. Below is the perf data for the sequence on a
Rocket Lake (similar observation on ICX and EMR):

  Event                     %al inner      %ah inner       Delta
  ----------------------  -------------  -------------  ----------
  cycles                    776,775,020    972,322,384    +25.2%
  instructions/cycle               1.23           0.98    -20.3%
  branch-misses               4,792,502        560,449    -88.3%
  uops_issued.any           768,019,010    696,888,357     -9.3%
  time elapsed                 0.1627s        0.2048s     +25.9%

Time elapsed directly correlates with the increase in cycles.

> As well as swapping %al <-> %ah try changing the outer loop decrement to
> 	sub $0x100, %ax
> since %al is zero that will set the z flag the same.

Unfortunately, using "sub $0x100, %ax"(with %al as inner loop) isn't better
than just using "sub $1, %ah" in the outer loop:

  Event                     %al inner      + sub %ax       Delta
  ----------------------  -------------  -------------  ----------
  cycles                    776,775,020    813,372,036     +4.7%
  instructions/cycle               1.23           1.17     -4.5%
  branch-misses               4,792,502      7,610,323    +58.8%
  uops_issued.any           768,019,010    827,465,137     +7.7%
  time elapsed                 0.1627s        0.1707s      +4.9%

> I've just hacked a test into some test code I've got.
> I'm not seeing an unexpected costs on either zen-5 or haswell.
> So it may be more subtle.

This is puzzling, but atleast it is evident that using %al for the inner
loop seems to be the best option. In summary:

  Variant   Cycles     Uops Issued  Branch Misses
  -------  ----------  -----------  -------------
  %al       776M        768M           4.8M         (fastest)
  %ah       972M (+25%) 697M (-9%)     560K (-88%)  (fewer uops + misses, yet slowest)
  sub %ax   813M (+5%)  827M (+8%)     7.6M (+59%)  (most uops + misses)

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by David Laight 1 day, 16 hours ago

On Wed, 1 Apr 2026 01:12:36 -0700
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Sat, Mar 28, 2026 at 10:08:37AM +0000, David Laight wrote:
> > On Fri, 27 Mar 2026 17:42:56 -0700
> > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> >   
> > > On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:  
> > > > On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:    
> > > > > On Thu, 26 Mar 2026 11:01:20 +0100
> > > > > Borislav Petkov <bp@alien8.de> wrote:
> > > > >     
> > > > > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:    
> > > > > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > > > > ALTERNATIVE. Please let me know if I am missing something.      
> > > > > > 
> > > > > > Yes, you are.
> > > > > > 
> > > > > > The point is that you don't want to stick those alternative calls inside some
> > > > > > magic bhb_loop function but hand them in from the outside, as function
> > > > > > arguments.
> > > > > > 
> > > > > > Basically what I did.
> > > > > > 
> > > > > > Then you were worried about this being C code and it had to be noinstr... So
> > > > > > that outer function can be rewritten in asm, I think, and still keep it well
> > > > > > separate.
> > > > > > 
> > > > > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > > > > >     
> > > > > 
> > > > > I think someone tried getting C code to write the values to global data
> > > > > and getting the asm to read them.
> > > > > That got discounted because it spilt things between two largely unrelated files.    
> > > > 
> > > > 
> > > > The implementation with global variables wasn't that bad, let me revive it.
> > > > 
> > > > This part which ties sequence to BHI mitigation, which is not ideal,
> > > > (because VMSCAPE also uses it) it does seems a cleaner option.
> > > > 
> > > > --- a/arch/x86/kernel/cpu/bugs.c
> > > > +++ b/arch/x86/kernel/cpu/bugs.c
> > > > @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
> > > > 
> > > >  static void __init bhi_update_mitigation(void)
> > > >  {
> > > > +   if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> > > > +       bhi_seq_outer_loop = 5;
> > > > +       bhi_seq_inner_loop = 5;
> > > > +   }
> > > > +
> > > > 
> > > > I believe this can be moved to somewhere common to all mitigations.
> > > >     
> > > > > I think the BPF code would need significant refactoring to call a C function.    
> > > > 
> > > > Ya, true. Will use globals and keep clear_bhb_loop() in asm.    
> > > 
> > > While testing this approach, I noticed that syscalls were suffering an 8%
> > > regression on ICX for Native BHI mitigation:
> > > 
> > >   $ perf bench syscall basic -l 100000000
> > > 
> > > Bisection pointed to the change for using 8-bit registers (al/ah replacing
> > > eax/ecx) as the main contributor to the regression. (Global variables added
> > > a bit, but within noise).
> > > 
> > > Further digging revealed a strange behavior, using %ah for the inner loop
> > > was causing the regression, interchanging %al and %ah in the loops
> > > (for movb and sub) eliminated the regression.
> > > 
> > > <clear_bhb_loop_nofence>:
> > > 
> > > 	movb	bhb_seq_outer_loop(%rip), %al
> > > 
> > > 	call	1f
> > > 	jmp	5f
> > > 1:	call	2f
> > > .Lret1:	RET
> > > 2:	movb	bhb_seq_inner_loop(%rip), %ah
> > > 3:	jmp	4f
> > > 	nop
> > > 4:	sub	$1, %ah <---- No regression with %al here
> > > 	jnz	3b
> > > 	sub	$1, %al
> > > 	jnz	1b
> > > 
> > > My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
> > > inner loop, which is executed more number of times is likely making the
> > > difference. A perf profile is needed to confirm this.  
> > 
> > I bet it is also CPU dependant - it is quite likely that there isn't
> > any special hardware to support partial writes of %ah so it ends up taking
> > a slow path (possibly even a microcoded one to get an 8% regression).  
> 
> Strangely, %ah in the inner loop incurs less uops and has fewer branch
> misses, yet takes more cycles. Below is the perf data for the sequence on a
> Rocket Lake (similar observation on ICX and EMR):
> 
>   Event                     %al inner      %ah inner       Delta
>   ----------------------  -------------  -------------  ----------
>   cycles                    776,775,020    972,322,384    +25.2%
>   instructions/cycle               1.23           0.98    -20.3%
>   branch-misses               4,792,502        560,449    -88.3%
>   uops_issued.any           768,019,010    696,888,357     -9.3%
>   time elapsed                 0.1627s        0.2048s     +25.9%
> 
> Time elapsed directly correlates with the increase in cycles.

That might be consistent with the %ah accesses (probably writes)
being very slow/synchronising.
So you are getting a full cpu stall instead speculative execution
of the following instructions - which must include a lot of mis-predicted
branches.

> > As well as swapping %al <-> %ah try changing the outer loop decrement to
> > 	sub $0x100, %ax
> > since %al is zero that will set the z flag the same.  
> 
> Unfortunately, using "sub $0x100, %ax"(with %al as inner loop) isn't better
> than just using "sub $1, %ah" in the outer loop:
> 
>   Event                     %al inner      + sub %ax       Delta
>   ----------------------  -------------  -------------  ----------
>   cycles                    776,775,020    813,372,036     +4.7%
>   instructions/cycle               1.23           1.17     -4.5%
>   branch-misses               4,792,502      7,610,323    +58.8%
>   uops_issued.any           768,019,010    827,465,137     +7.7%
>   time elapsed                 0.1627s        0.1707s      +4.9%

That is even more interesting.
The 'sub %ax' version has more uops and more branch-misses.
Looks like the extra cost of the %ah access is less than the cost
of the extra mis-predicted branches.

Makes me wonder where a version that uses %cl fits?
(Or use a zero-extending read and %eax/%ecx - likely to be the same.)
I'll bet 'one beer' that is nearest the 'sub %ax' version.

	David

> 
> > I've just hacked a test into some test code I've got.
> > I'm not seeing an unexpected costs on either zen-5 or haswell.
> > So it may be more subtle.  
> 
> This is puzzling, but atleast it is evident that using %al for the inner
> loop seems to be the best option. In summary:
> 
>   Variant   Cycles     Uops Issued  Branch Misses
>   -------  ----------  -----------  -------------
>   %al       776M        768M           4.8M         (fastest)
>   %ah       972M (+25%) 697M (-9%)     560K (-88%)  (fewer uops + misses, yet slowest)
>   sub %ax   813M (+5%)  827M (+8%)     7.6M (+59%)  (most uops + misses)

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 1 day, 6 hours ago

On Wed, Apr 01, 2026 at 10:02:00AM +0100, David Laight wrote:
> > > As well as swapping %al <-> %ah try changing the outer loop decrement to
> > > 	sub $0x100, %ax
> > > since %al is zero that will set the z flag the same.  
> > 
> > Unfortunately, using "sub $0x100, %ax"(with %al as inner loop) isn't better
> > than just using "sub $1, %ah" in the outer loop:
> > 
> >   Event                     %al inner      + sub %ax       Delta
> >   ----------------------  -------------  -------------  ----------
> >   cycles                    776,775,020    813,372,036     +4.7%
> >   instructions/cycle               1.23           1.17     -4.5%
> >   branch-misses               4,792,502      7,610,323    +58.8%
> >   uops_issued.any           768,019,010    827,465,137     +7.7%
> >   time elapsed                 0.1627s        0.1707s      +4.9%
> 
> That is even more interesting.
> The 'sub %ax' version has more uops and more branch-misses.
> Looks like the extra cost of the %ah access is less than the cost
> of the extra mis-predicted branches.
> 
> Makes me wonder where a version that uses %cl fits?
> (Or use a zero-extending read and %eax/%ecx - likely to be the same.)
> I'll bet 'one beer' that is nearest the 'sub %ax' version.

%cl didn't make a noticeable difference, but ...

    Event                      %al/%ah        %al/%cl        Delta
                             (inner/outer)  (inner/outer)
    ----------------------  -------------  -------------  ----------
    cycles                    776,380,149    778,294,183     +0.2%
    instructions/cycle               1.23           1.22     -0.4%
    branch-misses               4,986,437      5,679,599    +13.9%
    uops_issued.any           773,223,387    765,724,878     -1.0%
    time elapsed                 0.1631s        0.1637s      +0.4%

... there are meaningful gains with 32-bit registers:

    Event                      %al/%ah        %eax/%ecx      Delta
                             (inner/outer)  (inner/outer)
    ----------------------  -------------  -------------  ----------
    cycles                    776,380,149    706,331,177     -9.0%
    instructions/cycle               1.23           1.35     +9.9%
    branch-misses               4,986,437      6,089,306    +22.1%
    uops_issued.any           773,223,387    774,539,522     +0.2%
    time elapsed                 0.1631s        0.1482s      -9.1%

These values are for userspace tests with immediates. Next, I will test how
they perform with memory loads in kernel. Before we finalize these uarch
nuances needs to be tested on a variety of CPUs.

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by David Laight 1 week ago

On Thu, 26 Mar 2026 01:39:34 -0700
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Wed, Mar 25, 2026 at 09:37:59PM +0100, Borislav Petkov wrote:
> > On Tue, Mar 24, 2026 at 03:13:08PM -0700, Pawan Gupta wrote:  
> > > This is cleaner. A few things to consider are, CLEAR_BRANCH_HISTORY that
> > > calls clear_bhb_loop() would be calling into C code very early during the
> > > kernel entry. The code generated here may vary based on the compiler. Any
> > > indirect branch here would be security risk. This needs to be noinstr so
> > > that it can't be hijacked by probes and ftraces.
> > > 
> > > At kernel entry, calling into C before mitigations are applied is risky.  
> > 
> > You can write the above function in asm if you prefer - should still be
> > easier.  
> 
> I believe the equivalent for cpu_feature_enabled() in asm is the
> ALTERNATIVE. Please let me know if I am missing something.
> 
> Regarding your intent to move the loop count selection out of the BHB
> sequence, below is what I could come up. It is not as pretty as the C
> version, but it is trying to achieve something similar:

I think that fails on being harder to read and longer.
So no real benefit.

I believe this code has to be asm because it is required to excute
specific instructions in a specific order - you can't trust the C
compiler to do that for you.

	David

> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index ecae3cef9d8c..54c65b0a3f65 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1494,6 +1494,20 @@ SYM_CODE_START_NOALIGN(rewind_stack_and_make_dead)
>  SYM_CODE_END(rewind_stack_and_make_dead)
>  .popsection
>  
> +/*
> + * Between the long and short version of BHB clear sequence, just the
> + * loop count differs based on BHI_CTRL, see Intel's BHI guidance.
> + */
> +#define BHB_SHORT_LOOP_OUTER	5
> +#define BHB_SHORT_LOOP_INNER	5
> +
> +#define BHB_LONG_LOOP_OUTER	12
> +#define BHB_LONG_LOOP_INNER	7
> +
> +#define BHB_MOVB(type, reg)						\
> +	ALTERNATIVE __stringify(movb $BHB_SHORT_LOOP_##type, reg),	\
> +		    __stringify(movb $BHB_LONG_LOOP_##type, reg), X86_FEATURE_BHI_CTRL
> +
>  /*
>   * This sequence executes branches in order to remove user branch information
>   * from the branch history tracker in the Branch Predictor, therefore removing
> @@ -1540,12 +1554,7 @@ SYM_FUNC_START(clear_bhb_loop_nofence)
>  	/* BPF caller may require all registers to be preserved */
>  	push	%rax
>  
> -	/*
> -	 * Between the long and short version of BHB clear sequence, just the
> -	 * loop count differs based on BHI_CTRL, see Intel's BHI guidance.
> -	 */
> -	ALTERNATIVE "movb $5,  %al",	\
> -		    "movb $12, %al", X86_FEATURE_BHI_CTRL
> +	BHB_MOVB(OUTER, %al)
>  
>  	ANNOTATE_INTRA_FUNCTION_CALL
>  	call	1f
> @@ -1567,8 +1576,7 @@ SYM_FUNC_START(clear_bhb_loop_nofence)
>  	 * but some Clang versions (e.g. 18) don't like this.
>  	 */
>  	.skip 32 - 14, 0xcc
> -2:	ALTERNATIVE "movb $5, %ah",	\
> -		    "movb $7, %ah", X86_FEATURE_BHI_CTRL
> +2:	BHB_MOVB(INNER, %ah)
>  3:	jmp	4f
>  	nop
>  4:	sub	$1, %ah
> 
> 
> Below is how the disassembly looks like:
> 
> clear_bhb_loop_nofence:
> 	...
> 	call    1f
> 	jmp     5f
> 	// BHB_MOVB(OUTER, %al)
> 	mov    $0x5,%al

Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by David Laight 1 week, 1 day ago

On Wed, 25 Mar 2026 21:37:59 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Tue, Mar 24, 2026 at 03:13:08PM -0700, Pawan Gupta wrote:
...
> > Although call to clear_bhb_loop() will be inserted at the end of the BPF
> > program before it returns, I am not sure if it is safe to assume that
> > trashing registers in the path clear_bhb_loop() -> __clear_bhb_loop() is
> > okay? Especially, when we don't know what code compiler generated for
> > clear_bhb_loop(). BPF experts would know better?  
> 
> The compiler would preserve the regs. If you write it in asm and you adhere to
> the C ABI, you could preserve them too. Shouldn't be too many.

The BPF code that calls it doesn't use the C ABI - it just puts
a call instruction in the code it generates.
Hence all registers must be preserved.

	David

> 
> Thx.
> 
>