As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
the Branch History Buffer (BHB). On Alder Lake and newer parts this
sequence is not sufficient because it doesn't clear enough entries. This
was not an issue because these CPUs have a hardware control (BHI_DIS_S)
that mitigates BHI in kernel.
BHI variant of VMSCAPE requires isolating branch history between guests and
userspace. Note that there is no equivalent hardware control for userspace.
To effectively isolate branch history on newer CPUs, clear_bhb_loop()
should execute sufficient number of branches to clear a larger BHB.
Dynamically set the loop count of clear_bhb_loop() such that it is
effective on newer CPUs too. Use the hardware control enumeration
X86_FEATURE_BHI_CTRL to select the appropriate loop count.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
arch/x86/entry/entry_64.S | 21 ++++++++++++++++-----
arch/x86/net/bpf_jit_comp.c | 7 -------
2 files changed, 16 insertions(+), 12 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3a180a36ca0e..8128e00ca73f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1535,8 +1535,17 @@ SYM_CODE_END(rewind_stack_and_make_dead)
SYM_FUNC_START(clear_bhb_loop)
ANNOTATE_NOENDBR
push %rbp
+ /* BPF caller may require %rax to be preserved */
+ push %rax
mov %rsp, %rbp
- movl $5, %ecx
+
+ /*
+ * Between the long and short version of BHB clear sequence, just the
+ * loop count differs based on BHI_CTRL, see Intel's BHI guidance.
+ */
+ ALTERNATIVE "movb $5, %al", \
+ "movb $12, %al", X86_FEATURE_BHI_CTRL
+
ANNOTATE_INTRA_FUNCTION_CALL
call 1f
jmp 5f
@@ -1556,16 +1565,18 @@ SYM_FUNC_START(clear_bhb_loop)
* This should be ideally be: .skip 32 - (.Lret2 - 2f), 0xcc
* but some Clang versions (e.g. 18) don't like this.
*/
- .skip 32 - 18, 0xcc
-2: movl $5, %eax
+ .skip 32 - 14, 0xcc
+2: ALTERNATIVE "movb $5, %ah", \
+ "movb $7, %ah", X86_FEATURE_BHI_CTRL
3: jmp 4f
nop
-4: sub $1, %eax
+4: sub $1, %ah
jnz 3b
- sub $1, %ecx
+ sub $1, %al
jnz 1b
.Lret2: RET
5:
+ pop %rax
pop %rbp
RET
SYM_FUNC_END(clear_bhb_loop)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 63d6c9fa5e80..e2cceabb23e8 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1614,11 +1614,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
u8 *func;
if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
- /* The clearing sequence clobbers eax and ecx. */
- EMIT1(0x50); /* push rax */
- EMIT1(0x51); /* push rcx */
- ip += 2;
-
func = (u8 *)clear_bhb_loop;
ip += x86_call_depth_emit_accounting(&prog, func, ip);
@@ -1626,8 +1621,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
return -EINVAL;
/* Don't speculate past this until BHB is cleared */
EMIT_LFENCE();
- EMIT1(0x59); /* pop rcx */
- EMIT1(0x58); /* pop rax */
}
/* Insert IBHF instruction */
if ((cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP) &&
--
2.34.1
On Tue, Mar 24, 2026 at 11:19 AM Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote: > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites > the Branch History Buffer (BHB). On Alder Lake and newer parts this > sequence is not sufficient because it doesn't clear enough entries. This > was not an issue because these CPUs have a hardware control (BHI_DIS_S) > that mitigates BHI in kernel. > > BHI variant of VMSCAPE requires isolating branch history between guests and > userspace. Note that there is no equivalent hardware control for userspace. > To effectively isolate branch history on newer CPUs, clear_bhb_loop() > should execute sufficient number of branches to clear a larger BHB. > > Dynamically set the loop count of clear_bhb_loop() such that it is > effective on newer CPUs too. Use the hardware control enumeration > X86_FEATURE_BHI_CTRL to select the appropriate loop count. > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> > --- > arch/x86/entry/entry_64.S | 21 ++++++++++++++++----- > arch/x86/net/bpf_jit_comp.c | 7 ------- > 2 files changed, 16 insertions(+), 12 deletions(-) > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index 3a180a36ca0e..8128e00ca73f 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -1535,8 +1535,17 @@ SYM_CODE_END(rewind_stack_and_make_dead) > SYM_FUNC_START(clear_bhb_loop) > ANNOTATE_NOENDBR > push %rbp > + /* BPF caller may require %rax to be preserved */ > + push %rax Shouldn't the "push %rax" come after "mov %rsp, %rbp"? > mov %rsp, %rbp > - movl $5, %ecx
On Wed, 25 Mar 2026 10:50:58 -0700 Jim Mattson <jmattson@google.com> wrote: > On Tue, Mar 24, 2026 at 11:19 AM Pawan Gupta > <pawan.kumar.gupta@linux.intel.com> wrote: > > > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites > > the Branch History Buffer (BHB). On Alder Lake and newer parts this > > sequence is not sufficient because it doesn't clear enough entries. This > > was not an issue because these CPUs have a hardware control (BHI_DIS_S) > > that mitigates BHI in kernel. > > > > BHI variant of VMSCAPE requires isolating branch history between guests and > > userspace. Note that there is no equivalent hardware control for userspace. > > To effectively isolate branch history on newer CPUs, clear_bhb_loop() > > should execute sufficient number of branches to clear a larger BHB. > > > > Dynamically set the loop count of clear_bhb_loop() such that it is > > effective on newer CPUs too. Use the hardware control enumeration > > X86_FEATURE_BHI_CTRL to select the appropriate loop count. > > > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> > > --- > > arch/x86/entry/entry_64.S | 21 ++++++++++++++++----- > > arch/x86/net/bpf_jit_comp.c | 7 ------- > > 2 files changed, 16 insertions(+), 12 deletions(-) > > > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > > index 3a180a36ca0e..8128e00ca73f 100644 > > --- a/arch/x86/entry/entry_64.S > > +++ b/arch/x86/entry/entry_64.S > > @@ -1535,8 +1535,17 @@ SYM_CODE_END(rewind_stack_and_make_dead) > > SYM_FUNC_START(clear_bhb_loop) > > ANNOTATE_NOENDBR > > push %rbp > > + /* BPF caller may require %rax to be preserved */ Since you need a new version change that to 'all registers preserved'. > > + push %rax > > Shouldn't the "push %rax" come after "mov %rsp, %rbp"? Or delete the stack frame :-) It is only there for the stack trace-back code. David > > > mov %rsp, %rbp > > - movl $5, %ecx
On Wed, Mar 25, 2026 at 07:41:46PM +0000, David Laight wrote: > On Wed, 25 Mar 2026 10:50:58 -0700 > Jim Mattson <jmattson@google.com> wrote: > > > On Tue, Mar 24, 2026 at 11:19 AM Pawan Gupta > > <pawan.kumar.gupta@linux.intel.com> wrote: > > > > > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites > > > the Branch History Buffer (BHB). On Alder Lake and newer parts this > > > sequence is not sufficient because it doesn't clear enough entries. This > > > was not an issue because these CPUs have a hardware control (BHI_DIS_S) > > > that mitigates BHI in kernel. > > > > > > BHI variant of VMSCAPE requires isolating branch history between guests and > > > userspace. Note that there is no equivalent hardware control for userspace. > > > To effectively isolate branch history on newer CPUs, clear_bhb_loop() > > > should execute sufficient number of branches to clear a larger BHB. > > > > > > Dynamically set the loop count of clear_bhb_loop() such that it is > > > effective on newer CPUs too. Use the hardware control enumeration > > > X86_FEATURE_BHI_CTRL to select the appropriate loop count. > > > > > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> > > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> > > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> > > > --- > > > arch/x86/entry/entry_64.S | 21 ++++++++++++++++----- > > > arch/x86/net/bpf_jit_comp.c | 7 ------- > > > 2 files changed, 16 insertions(+), 12 deletions(-) > > > > > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > > > index 3a180a36ca0e..8128e00ca73f 100644 > > > --- a/arch/x86/entry/entry_64.S > > > +++ b/arch/x86/entry/entry_64.S > > > @@ -1535,8 +1535,17 @@ SYM_CODE_END(rewind_stack_and_make_dead) > > > SYM_FUNC_START(clear_bhb_loop) > > > ANNOTATE_NOENDBR > > > push %rbp > > > + /* BPF caller may require %rax to be preserved */ > > Since you need a new version change that to 'all registers preserved'. Ya, thats more accurate. > > > + push %rax > > > > Shouldn't the "push %rax" come after "mov %rsp, %rbp"? > > Or delete the stack frame :-) > It is only there for the stack trace-back code. Hmm, lets keep the stack frame, it might help debug.
On Wed, Mar 25, 2026 at 10:50:58AM -0700, Jim Mattson wrote: > On Tue, Mar 24, 2026 at 11:19 AM Pawan Gupta > <pawan.kumar.gupta@linux.intel.com> wrote: > > > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites > > the Branch History Buffer (BHB). On Alder Lake and newer parts this > > sequence is not sufficient because it doesn't clear enough entries. This > > was not an issue because these CPUs have a hardware control (BHI_DIS_S) > > that mitigates BHI in kernel. > > > > BHI variant of VMSCAPE requires isolating branch history between guests and > > userspace. Note that there is no equivalent hardware control for userspace. > > To effectively isolate branch history on newer CPUs, clear_bhb_loop() > > should execute sufficient number of branches to clear a larger BHB. > > > > Dynamically set the loop count of clear_bhb_loop() such that it is > > effective on newer CPUs too. Use the hardware control enumeration > > X86_FEATURE_BHI_CTRL to select the appropriate loop count. > > > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> > > --- > > arch/x86/entry/entry_64.S | 21 ++++++++++++++++----- > > arch/x86/net/bpf_jit_comp.c | 7 ------- > > 2 files changed, 16 insertions(+), 12 deletions(-) > > > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > > index 3a180a36ca0e..8128e00ca73f 100644 > > --- a/arch/x86/entry/entry_64.S > > +++ b/arch/x86/entry/entry_64.S > > @@ -1535,8 +1535,17 @@ SYM_CODE_END(rewind_stack_and_make_dead) > > SYM_FUNC_START(clear_bhb_loop) > > ANNOTATE_NOENDBR > > push %rbp > > + /* BPF caller may require %rax to be preserved */ > > + push %rax > > Shouldn't the "push %rax" come after "mov %rsp, %rbp"? Right, thanks for catching that. > > mov %rsp, %rbp > > - movl $5, %ecx
On Tue, Mar 24, 2026 at 11:16:51AM -0700, Pawan Gupta wrote:
> As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> the Branch History Buffer (BHB). On Alder Lake and newer parts this
> sequence is not sufficient because it doesn't clear enough entries. This
> was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> that mitigates BHI in kernel.
>
> BHI variant of VMSCAPE requires isolating branch history between guests and
> userspace. Note that there is no equivalent hardware control for userspace.
> To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> should execute sufficient number of branches to clear a larger BHB.
>
> Dynamically set the loop count of clear_bhb_loop() such that it is
> effective on newer CPUs too. Use the hardware control enumeration
> X86_FEATURE_BHI_CTRL to select the appropriate loop count.
>
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
> arch/x86/entry/entry_64.S | 21 ++++++++++++++++-----
> arch/x86/net/bpf_jit_comp.c | 7 -------
> 2 files changed, 16 insertions(+), 12 deletions(-)
Ok, pls tell me why this below doesn't work?
The additional indirection makes even the BHB loop code simpler.
(I didn't pay too much attention to the labels, 2: is probably weird there).
---
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3a180a36ca0e..95c7ed9afbbe 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1532,11 +1532,13 @@ SYM_CODE_END(rewind_stack_and_make_dead)
* Note, callers should use a speculation barrier like LFENCE immediately after
* a call to this function to ensure BHB is cleared before indirect branches.
*/
-SYM_FUNC_START(clear_bhb_loop)
+SYM_FUNC_START(__clear_bhb_loop)
ANNOTATE_NOENDBR
push %rbp
+ /* BPF caller may require %rax to be preserved */
+ push %rax
mov %rsp, %rbp
- movl $5, %ecx
+
ANNOTATE_INTRA_FUNCTION_CALL
call 1f
jmp 5f
@@ -1557,17 +1559,17 @@ SYM_FUNC_START(clear_bhb_loop)
* but some Clang versions (e.g. 18) don't like this.
*/
.skip 32 - 18, 0xcc
-2: movl $5, %eax
+2:
3: jmp 4f
nop
-4: sub $1, %eax
+4: sub $1, %rsi
jnz 3b
- sub $1, %ecx
+ sub $1, %rdi
jnz 1b
.Lret2: RET
5:
+ pop %rax
pop %rbp
RET
-SYM_FUNC_END(clear_bhb_loop)
-EXPORT_SYMBOL_FOR_KVM(clear_bhb_loop)
-STACK_FRAME_NON_STANDARD(clear_bhb_loop)
+SYM_FUNC_END(__clear_bhb_loop)
+STACK_FRAME_NON_STANDARD(__clear_bhb_loop)
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 70b377fcbc1c..a9f406941e11 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -390,6 +390,7 @@ extern void write_ibpb(void);
#ifdef CONFIG_X86_64
extern void clear_bhb_loop(void);
+extern void __clear_bhb_loop(unsigned int a, unsigned int b);
#endif
extern void (*x86_return_thunk)(void);
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 83f51cab0b1e..c41b0548cf2a 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -3735,3 +3735,11 @@ void __warn_thunk(void)
{
WARN_ONCE(1, "Unpatched return thunk in use. This should not happen!\n");
}
+
+void clear_bhb_loop(void)
+{
+ if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL))
+ __clear_bhb_loop(12, 7);
+ else
+ __clear_bhb_loop(5, 5);
+}
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 63d6c9fa5e80..e2cceabb23e8 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1614,11 +1614,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
u8 *func;
if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
- /* The clearing sequence clobbers eax and ecx. */
- EMIT1(0x50); /* push rax */
- EMIT1(0x51); /* push rcx */
- ip += 2;
-
func = (u8 *)clear_bhb_loop;
ip += x86_call_depth_emit_accounting(&prog, func, ip);
@@ -1626,8 +1621,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
return -EINVAL;
/* Don't speculate past this until BHB is cleared */
EMIT_LFENCE();
- EMIT1(0x59); /* pop rcx */
- EMIT1(0x58); /* pop rax */
}
/* Insert IBHF instruction */
if ((cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP) &&
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Mar 24, 2026 at 09:59:30PM +0100, Borislav Petkov wrote:
> On Tue, Mar 24, 2026 at 11:16:51AM -0700, Pawan Gupta wrote:
> > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > sequence is not sufficient because it doesn't clear enough entries. This
> > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > that mitigates BHI in kernel.
> >
> > BHI variant of VMSCAPE requires isolating branch history between guests and
> > userspace. Note that there is no equivalent hardware control for userspace.
> > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > should execute sufficient number of branches to clear a larger BHB.
> >
> > Dynamically set the loop count of clear_bhb_loop() such that it is
> > effective on newer CPUs too. Use the hardware control enumeration
> > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> >
> > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > ---
> > arch/x86/entry/entry_64.S | 21 ++++++++++++++++-----
> > arch/x86/net/bpf_jit_comp.c | 7 -------
> > 2 files changed, 16 insertions(+), 12 deletions(-)
>
> Ok, pls tell me why this below doesn't work?
>
> The additional indirection makes even the BHB loop code simpler.
>
> (I didn't pay too much attention to the labels, 2: is probably weird there).
>
> ---
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 3a180a36ca0e..95c7ed9afbbe 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1532,11 +1532,13 @@ SYM_CODE_END(rewind_stack_and_make_dead)
> * Note, callers should use a speculation barrier like LFENCE immediately after
> * a call to this function to ensure BHB is cleared before indirect branches.
> */
> -SYM_FUNC_START(clear_bhb_loop)
> +SYM_FUNC_START(__clear_bhb_loop)
> ANNOTATE_NOENDBR
> push %rbp
> + /* BPF caller may require %rax to be preserved */
> + push %rax
> mov %rsp, %rbp
> - movl $5, %ecx
> +
> ANNOTATE_INTRA_FUNCTION_CALL
> call 1f
> jmp 5f
> @@ -1557,17 +1559,17 @@ SYM_FUNC_START(clear_bhb_loop)
> * but some Clang versions (e.g. 18) don't like this.
> */
> .skip 32 - 18, 0xcc
> -2: movl $5, %eax
> +2:
> 3: jmp 4f
> nop
> -4: sub $1, %eax
> +4: sub $1, %rsi
%rsi needs to be loaded again with $inner_loop_count once per every
outer loop iteration. We probably need another register to hold that.
> jnz 3b
> - sub $1, %ecx
> + sub $1, %rdi
> jnz 1b
> .Lret2: RET
> 5:
> + pop %rax
> pop %rbp
> RET
> -SYM_FUNC_END(clear_bhb_loop)
> -EXPORT_SYMBOL_FOR_KVM(clear_bhb_loop)
> -STACK_FRAME_NON_STANDARD(clear_bhb_loop)
> +SYM_FUNC_END(__clear_bhb_loop)
> +STACK_FRAME_NON_STANDARD(__clear_bhb_loop)
> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index 70b377fcbc1c..a9f406941e11 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -390,6 +390,7 @@ extern void write_ibpb(void);
>
> #ifdef CONFIG_X86_64
> extern void clear_bhb_loop(void);
> +extern void __clear_bhb_loop(unsigned int a, unsigned int b);
> #endif
>
> extern void (*x86_return_thunk)(void);
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 83f51cab0b1e..c41b0548cf2a 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -3735,3 +3735,11 @@ void __warn_thunk(void)
> {
> WARN_ONCE(1, "Unpatched return thunk in use. This should not happen!\n");
> }
> +
> +void clear_bhb_loop(void)
> +{
> + if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL))
> + __clear_bhb_loop(12, 7);
> + else
> + __clear_bhb_loop(5, 5);
> +}
This is cleaner. A few things to consider are, CLEAR_BRANCH_HISTORY that
calls clear_bhb_loop() would be calling into C code very early during the
kernel entry. The code generated here may vary based on the compiler. Any
indirect branch here would be security risk. This needs to be noinstr so
that it can't be hijacked by probes and ftraces.
At kernel entry, calling into C before mitigations are applied is risky.
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 63d6c9fa5e80..e2cceabb23e8 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -1614,11 +1614,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
> u8 *func;
>
> if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
> - /* The clearing sequence clobbers eax and ecx. */
> - EMIT1(0x50); /* push rax */
> - EMIT1(0x51); /* push rcx */
> - ip += 2;
> -
> func = (u8 *)clear_bhb_loop;
Although call to clear_bhb_loop() will be inserted at the end of the BPF
program before it returns, I am not sure if it is safe to assume that
trashing registers in the path clear_bhb_loop() -> __clear_bhb_loop() is
okay? Especially, when we don't know what code compiler generated for
clear_bhb_loop(). BPF experts would know better?
On Tue, Mar 24, 2026 at 03:13:08PM -0700, Pawan Gupta wrote:
> This is cleaner. A few things to consider are, CLEAR_BRANCH_HISTORY that
> calls clear_bhb_loop() would be calling into C code very early during the
> kernel entry. The code generated here may vary based on the compiler. Any
> indirect branch here would be security risk. This needs to be noinstr so
> that it can't be hijacked by probes and ftraces.
>
> At kernel entry, calling into C before mitigations are applied is risky.
You can write the above function in asm if you prefer - should still be
easier.
> Although call to clear_bhb_loop() will be inserted at the end of the BPF
> program before it returns, I am not sure if it is safe to assume that
> trashing registers in the path clear_bhb_loop() -> __clear_bhb_loop() is
> okay? Especially, when we don't know what code compiler generated for
> clear_bhb_loop(). BPF experts would know better?
The compiler would preserve the regs. If you write it in asm and you adhere to
the C ABI, you could preserve them too. Shouldn't be too many.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, Mar 25, 2026 at 09:37:59PM +0100, Borislav Petkov wrote: > On Tue, Mar 24, 2026 at 03:13:08PM -0700, Pawan Gupta wrote: > > This is cleaner. A few things to consider are, CLEAR_BRANCH_HISTORY that > > calls clear_bhb_loop() would be calling into C code very early during the > > kernel entry. The code generated here may vary based on the compiler. Any > > indirect branch here would be security risk. This needs to be noinstr so > > that it can't be hijacked by probes and ftraces. > > > > At kernel entry, calling into C before mitigations are applied is risky. > > You can write the above function in asm if you prefer - should still be > easier. I believe the equivalent for cpu_feature_enabled() in asm is the ALTERNATIVE. Please let me know if I am missing something. Regarding your intent to move the loop count selection out of the BHB sequence, below is what I could come up. It is not as pretty as the C version, but it is trying to achieve something similar: diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index ecae3cef9d8c..54c65b0a3f65 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -1494,6 +1494,20 @@ SYM_CODE_START_NOALIGN(rewind_stack_and_make_dead) SYM_CODE_END(rewind_stack_and_make_dead) .popsection +/* + * Between the long and short version of BHB clear sequence, just the + * loop count differs based on BHI_CTRL, see Intel's BHI guidance. + */ +#define BHB_SHORT_LOOP_OUTER 5 +#define BHB_SHORT_LOOP_INNER 5 + +#define BHB_LONG_LOOP_OUTER 12 +#define BHB_LONG_LOOP_INNER 7 + +#define BHB_MOVB(type, reg) \ + ALTERNATIVE __stringify(movb $BHB_SHORT_LOOP_##type, reg), \ + __stringify(movb $BHB_LONG_LOOP_##type, reg), X86_FEATURE_BHI_CTRL + /* * This sequence executes branches in order to remove user branch information * from the branch history tracker in the Branch Predictor, therefore removing @@ -1540,12 +1554,7 @@ SYM_FUNC_START(clear_bhb_loop_nofence) /* BPF caller may require all registers to be preserved */ push %rax - /* - * Between the long and short version of BHB clear sequence, just the - * loop count differs based on BHI_CTRL, see Intel's BHI guidance. - */ - ALTERNATIVE "movb $5, %al", \ - "movb $12, %al", X86_FEATURE_BHI_CTRL + BHB_MOVB(OUTER, %al) ANNOTATE_INTRA_FUNCTION_CALL call 1f @@ -1567,8 +1576,7 @@ SYM_FUNC_START(clear_bhb_loop_nofence) * but some Clang versions (e.g. 18) don't like this. */ .skip 32 - 14, 0xcc -2: ALTERNATIVE "movb $5, %ah", \ - "movb $7, %ah", X86_FEATURE_BHI_CTRL +2: BHB_MOVB(INNER, %ah) 3: jmp 4f nop 4: sub $1, %ah Below is how the disassembly looks like: clear_bhb_loop_nofence: ... call 1f jmp 5f // BHB_MOVB(OUTER, %al) mov $0x5,%al
On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> I believe the equivalent for cpu_feature_enabled() in asm is the
> ALTERNATIVE. Please let me know if I am missing something.
Yes, you are.
The point is that you don't want to stick those alternative calls inside some
magic bhb_loop function but hand them in from the outside, as function
arguments.
Basically what I did.
Then you were worried about this being C code and it had to be noinstr... So
that outer function can be rewritten in asm, I think, and still keep it well
separate.
I'll try to rewrite it once I get a free minute, and see how it looks.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Thu, 26 Mar 2026 11:01:20 +0100 Borislav Petkov <bp@alien8.de> wrote: > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote: > > I believe the equivalent for cpu_feature_enabled() in asm is the > > ALTERNATIVE. Please let me know if I am missing something. > > Yes, you are. > > The point is that you don't want to stick those alternative calls inside some > magic bhb_loop function but hand them in from the outside, as function > arguments. > > Basically what I did. > > Then you were worried about this being C code and it had to be noinstr... So > that outer function can be rewritten in asm, I think, and still keep it well > separate. > > I'll try to rewrite it once I get a free minute, and see how it looks. > I think someone tried getting C code to write the values to global data and getting the asm to read them. That got discounted because it spilt things between two largely unrelated files. I think the BPF code would need significant refactoring to call a C function. David
On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:
> On Thu, 26 Mar 2026 11:01:20 +0100
> Borislav Petkov <bp@alien8.de> wrote:
>
> > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > ALTERNATIVE. Please let me know if I am missing something.
> >
> > Yes, you are.
> >
> > The point is that you don't want to stick those alternative calls inside some
> > magic bhb_loop function but hand them in from the outside, as function
> > arguments.
> >
> > Basically what I did.
> >
> > Then you were worried about this being C code and it had to be noinstr... So
> > that outer function can be rewritten in asm, I think, and still keep it well
> > separate.
> >
> > I'll try to rewrite it once I get a free minute, and see how it looks.
> >
>
> I think someone tried getting C code to write the values to global data
> and getting the asm to read them.
> That got discounted because it spilt things between two largely unrelated files.
The implementation with global variables wasn't that bad, let me revive it.
This part which ties sequence to BHI mitigation, which is not ideal,
(because VMSCAPE also uses it) it does seems a cleaner option.
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
static void __init bhi_update_mitigation(void)
{
+ if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
+ bhi_seq_outer_loop = 5;
+ bhi_seq_inner_loop = 5;
+ }
+
I believe this can be moved to somewhere common to all mitigations.
> I think the BPF code would need significant refactoring to call a C function.
Ya, true. Will use globals and keep clear_bhb_loop() in asm.
On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:
> On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:
> > On Thu, 26 Mar 2026 11:01:20 +0100
> > Borislav Petkov <bp@alien8.de> wrote:
> >
> > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > ALTERNATIVE. Please let me know if I am missing something.
> > >
> > > Yes, you are.
> > >
> > > The point is that you don't want to stick those alternative calls inside some
> > > magic bhb_loop function but hand them in from the outside, as function
> > > arguments.
> > >
> > > Basically what I did.
> > >
> > > Then you were worried about this being C code and it had to be noinstr... So
> > > that outer function can be rewritten in asm, I think, and still keep it well
> > > separate.
> > >
> > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > >
> >
> > I think someone tried getting C code to write the values to global data
> > and getting the asm to read them.
> > That got discounted because it spilt things between two largely unrelated files.
>
>
> The implementation with global variables wasn't that bad, let me revive it.
>
> This part which ties sequence to BHI mitigation, which is not ideal,
> (because VMSCAPE also uses it) it does seems a cleaner option.
>
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
>
> static void __init bhi_update_mitigation(void)
> {
> + if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> + bhi_seq_outer_loop = 5;
> + bhi_seq_inner_loop = 5;
> + }
> +
>
> I believe this can be moved to somewhere common to all mitigations.
>
> > I think the BPF code would need significant refactoring to call a C function.
>
> Ya, true. Will use globals and keep clear_bhb_loop() in asm.
While testing this approach, I noticed that syscalls were suffering an 8%
regression on ICX for Native BHI mitigation:
$ perf bench syscall basic -l 100000000
Bisection pointed to the change for using 8-bit registers (al/ah replacing
eax/ecx) as the main contributor to the regression. (Global variables added
a bit, but within noise).
Further digging revealed a strange behavior, using %ah for the inner loop
was causing the regression, interchanging %al and %ah in the loops
(for movb and sub) eliminated the regression.
<clear_bhb_loop_nofence>:
movb bhb_seq_outer_loop(%rip), %al
call 1f
jmp 5f
1: call 2f
.Lret1: RET
2: movb bhb_seq_inner_loop(%rip), %ah
3: jmp 4f
nop
4: sub $1, %ah <---- No regression with %al here
jnz 3b
sub $1, %al
jnz 1b
My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
inner loop, which is executed more number of times is likely making the
difference. A perf profile is needed to confirm this.
Never imagined a register selection can make an 8% difference in
performance! Anyways, will update the patch with this finding.
On Fri, 27 Mar 2026 17:42:56 -0700
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:
> > On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:
> > > On Thu, 26 Mar 2026 11:01:20 +0100
> > > Borislav Petkov <bp@alien8.de> wrote:
> > >
> > > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > > ALTERNATIVE. Please let me know if I am missing something.
> > > >
> > > > Yes, you are.
> > > >
> > > > The point is that you don't want to stick those alternative calls inside some
> > > > magic bhb_loop function but hand them in from the outside, as function
> > > > arguments.
> > > >
> > > > Basically what I did.
> > > >
> > > > Then you were worried about this being C code and it had to be noinstr... So
> > > > that outer function can be rewritten in asm, I think, and still keep it well
> > > > separate.
> > > >
> > > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > > >
> > >
> > > I think someone tried getting C code to write the values to global data
> > > and getting the asm to read them.
> > > That got discounted because it spilt things between two largely unrelated files.
> >
> >
> > The implementation with global variables wasn't that bad, let me revive it.
> >
> > This part which ties sequence to BHI mitigation, which is not ideal,
> > (because VMSCAPE also uses it) it does seems a cleaner option.
> >
> > --- a/arch/x86/kernel/cpu/bugs.c
> > +++ b/arch/x86/kernel/cpu/bugs.c
> > @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
> >
> > static void __init bhi_update_mitigation(void)
> > {
> > + if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> > + bhi_seq_outer_loop = 5;
> > + bhi_seq_inner_loop = 5;
> > + }
> > +
> >
> > I believe this can be moved to somewhere common to all mitigations.
> >
> > > I think the BPF code would need significant refactoring to call a C function.
> >
> > Ya, true. Will use globals and keep clear_bhb_loop() in asm.
>
> While testing this approach, I noticed that syscalls were suffering an 8%
> regression on ICX for Native BHI mitigation:
>
> $ perf bench syscall basic -l 100000000
>
> Bisection pointed to the change for using 8-bit registers (al/ah replacing
> eax/ecx) as the main contributor to the regression. (Global variables added
> a bit, but within noise).
>
> Further digging revealed a strange behavior, using %ah for the inner loop
> was causing the regression, interchanging %al and %ah in the loops
> (for movb and sub) eliminated the regression.
>
> <clear_bhb_loop_nofence>:
>
> movb bhb_seq_outer_loop(%rip), %al
>
> call 1f
> jmp 5f
> 1: call 2f
> .Lret1: RET
> 2: movb bhb_seq_inner_loop(%rip), %ah
> 3: jmp 4f
> nop
> 4: sub $1, %ah <---- No regression with %al here
> jnz 3b
> sub $1, %al
> jnz 1b
>
> My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
> inner loop, which is executed more number of times is likely making the
> difference. A perf profile is needed to confirm this.
I bet it is also CPU dependant - it is quite likely that there isn't
any special hardware to support partial writes of %ah so it ends up taking
a slow path (possibly even a microcoded one to get an 8% regression).
As well as swapping %al <-> %ah try changing the outer loop decrement to
sub $0x100, %ax
since %al is zero that will set the z flag the same.
I've just hacked a test into some test code I've got.
I'm not seeing an unexpected costs on either zen-5 or haswell.
So it may be more subtle.
David
>
> Never imagined a register selection can make an 8% difference in
> performance! Anyways, will update the patch with this finding.
On Sat, Mar 28, 2026 at 10:08:37AM +0000, David Laight wrote:
> On Fri, 27 Mar 2026 17:42:56 -0700
> Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
>
> > On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:
> > > On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:
> > > > On Thu, 26 Mar 2026 11:01:20 +0100
> > > > Borislav Petkov <bp@alien8.de> wrote:
> > > >
> > > > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > > > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > > > ALTERNATIVE. Please let me know if I am missing something.
> > > > >
> > > > > Yes, you are.
> > > > >
> > > > > The point is that you don't want to stick those alternative calls inside some
> > > > > magic bhb_loop function but hand them in from the outside, as function
> > > > > arguments.
> > > > >
> > > > > Basically what I did.
> > > > >
> > > > > Then you were worried about this being C code and it had to be noinstr... So
> > > > > that outer function can be rewritten in asm, I think, and still keep it well
> > > > > separate.
> > > > >
> > > > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > > > >
> > > >
> > > > I think someone tried getting C code to write the values to global data
> > > > and getting the asm to read them.
> > > > That got discounted because it spilt things between two largely unrelated files.
> > >
> > >
> > > The implementation with global variables wasn't that bad, let me revive it.
> > >
> > > This part which ties sequence to BHI mitigation, which is not ideal,
> > > (because VMSCAPE also uses it) it does seems a cleaner option.
> > >
> > > --- a/arch/x86/kernel/cpu/bugs.c
> > > +++ b/arch/x86/kernel/cpu/bugs.c
> > > @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
> > >
> > > static void __init bhi_update_mitigation(void)
> > > {
> > > + if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> > > + bhi_seq_outer_loop = 5;
> > > + bhi_seq_inner_loop = 5;
> > > + }
> > > +
> > >
> > > I believe this can be moved to somewhere common to all mitigations.
> > >
> > > > I think the BPF code would need significant refactoring to call a C function.
> > >
> > > Ya, true. Will use globals and keep clear_bhb_loop() in asm.
> >
> > While testing this approach, I noticed that syscalls were suffering an 8%
> > regression on ICX for Native BHI mitigation:
> >
> > $ perf bench syscall basic -l 100000000
> >
> > Bisection pointed to the change for using 8-bit registers (al/ah replacing
> > eax/ecx) as the main contributor to the regression. (Global variables added
> > a bit, but within noise).
> >
> > Further digging revealed a strange behavior, using %ah for the inner loop
> > was causing the regression, interchanging %al and %ah in the loops
> > (for movb and sub) eliminated the regression.
> >
> > <clear_bhb_loop_nofence>:
> >
> > movb bhb_seq_outer_loop(%rip), %al
> >
> > call 1f
> > jmp 5f
> > 1: call 2f
> > .Lret1: RET
> > 2: movb bhb_seq_inner_loop(%rip), %ah
> > 3: jmp 4f
> > nop
> > 4: sub $1, %ah <---- No regression with %al here
> > jnz 3b
> > sub $1, %al
> > jnz 1b
> >
> > My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
> > inner loop, which is executed more number of times is likely making the
> > difference. A perf profile is needed to confirm this.
>
> I bet it is also CPU dependant - it is quite likely that there isn't
> any special hardware to support partial writes of %ah so it ends up taking
> a slow path (possibly even a microcoded one to get an 8% regression).
Strangely, %ah in the inner loop incurs less uops and has fewer branch
misses, yet takes more cycles. Below is the perf data for the sequence on a
Rocket Lake (similar observation on ICX and EMR):
Event %al inner %ah inner Delta
---------------------- ------------- ------------- ----------
cycles 776,775,020 972,322,384 +25.2%
instructions/cycle 1.23 0.98 -20.3%
branch-misses 4,792,502 560,449 -88.3%
uops_issued.any 768,019,010 696,888,357 -9.3%
time elapsed 0.1627s 0.2048s +25.9%
Time elapsed directly correlates with the increase in cycles.
> As well as swapping %al <-> %ah try changing the outer loop decrement to
> sub $0x100, %ax
> since %al is zero that will set the z flag the same.
Unfortunately, using "sub $0x100, %ax"(with %al as inner loop) isn't better
than just using "sub $1, %ah" in the outer loop:
Event %al inner + sub %ax Delta
---------------------- ------------- ------------- ----------
cycles 776,775,020 813,372,036 +4.7%
instructions/cycle 1.23 1.17 -4.5%
branch-misses 4,792,502 7,610,323 +58.8%
uops_issued.any 768,019,010 827,465,137 +7.7%
time elapsed 0.1627s 0.1707s +4.9%
> I've just hacked a test into some test code I've got.
> I'm not seeing an unexpected costs on either zen-5 or haswell.
> So it may be more subtle.
This is puzzling, but atleast it is evident that using %al for the inner
loop seems to be the best option. In summary:
Variant Cycles Uops Issued Branch Misses
------- ---------- ----------- -------------
%al 776M 768M 4.8M (fastest)
%ah 972M (+25%) 697M (-9%) 560K (-88%) (fewer uops + misses, yet slowest)
sub %ax 813M (+5%) 827M (+8%) 7.6M (+59%) (most uops + misses)
On Wed, 1 Apr 2026 01:12:36 -0700
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> On Sat, Mar 28, 2026 at 10:08:37AM +0000, David Laight wrote:
> > On Fri, 27 Mar 2026 17:42:56 -0700
> > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> >
> > > On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:
> > > > On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:
> > > > > On Thu, 26 Mar 2026 11:01:20 +0100
> > > > > Borislav Petkov <bp@alien8.de> wrote:
> > > > >
> > > > > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > > > > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > > > > ALTERNATIVE. Please let me know if I am missing something.
> > > > > >
> > > > > > Yes, you are.
> > > > > >
> > > > > > The point is that you don't want to stick those alternative calls inside some
> > > > > > magic bhb_loop function but hand them in from the outside, as function
> > > > > > arguments.
> > > > > >
> > > > > > Basically what I did.
> > > > > >
> > > > > > Then you were worried about this being C code and it had to be noinstr... So
> > > > > > that outer function can be rewritten in asm, I think, and still keep it well
> > > > > > separate.
> > > > > >
> > > > > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > > > > >
> > > > >
> > > > > I think someone tried getting C code to write the values to global data
> > > > > and getting the asm to read them.
> > > > > That got discounted because it spilt things between two largely unrelated files.
> > > >
> > > >
> > > > The implementation with global variables wasn't that bad, let me revive it.
> > > >
> > > > This part which ties sequence to BHI mitigation, which is not ideal,
> > > > (because VMSCAPE also uses it) it does seems a cleaner option.
> > > >
> > > > --- a/arch/x86/kernel/cpu/bugs.c
> > > > +++ b/arch/x86/kernel/cpu/bugs.c
> > > > @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
> > > >
> > > > static void __init bhi_update_mitigation(void)
> > > > {
> > > > + if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> > > > + bhi_seq_outer_loop = 5;
> > > > + bhi_seq_inner_loop = 5;
> > > > + }
> > > > +
> > > >
> > > > I believe this can be moved to somewhere common to all mitigations.
> > > >
> > > > > I think the BPF code would need significant refactoring to call a C function.
> > > >
> > > > Ya, true. Will use globals and keep clear_bhb_loop() in asm.
> > >
> > > While testing this approach, I noticed that syscalls were suffering an 8%
> > > regression on ICX for Native BHI mitigation:
> > >
> > > $ perf bench syscall basic -l 100000000
> > >
> > > Bisection pointed to the change for using 8-bit registers (al/ah replacing
> > > eax/ecx) as the main contributor to the regression. (Global variables added
> > > a bit, but within noise).
> > >
> > > Further digging revealed a strange behavior, using %ah for the inner loop
> > > was causing the regression, interchanging %al and %ah in the loops
> > > (for movb and sub) eliminated the regression.
> > >
> > > <clear_bhb_loop_nofence>:
> > >
> > > movb bhb_seq_outer_loop(%rip), %al
> > >
> > > call 1f
> > > jmp 5f
> > > 1: call 2f
> > > .Lret1: RET
> > > 2: movb bhb_seq_inner_loop(%rip), %ah
> > > 3: jmp 4f
> > > nop
> > > 4: sub $1, %ah <---- No regression with %al here
> > > jnz 3b
> > > sub $1, %al
> > > jnz 1b
> > >
> > > My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
> > > inner loop, which is executed more number of times is likely making the
> > > difference. A perf profile is needed to confirm this.
> >
> > I bet it is also CPU dependant - it is quite likely that there isn't
> > any special hardware to support partial writes of %ah so it ends up taking
> > a slow path (possibly even a microcoded one to get an 8% regression).
>
> Strangely, %ah in the inner loop incurs less uops and has fewer branch
> misses, yet takes more cycles. Below is the perf data for the sequence on a
> Rocket Lake (similar observation on ICX and EMR):
>
> Event %al inner %ah inner Delta
> ---------------------- ------------- ------------- ----------
> cycles 776,775,020 972,322,384 +25.2%
> instructions/cycle 1.23 0.98 -20.3%
> branch-misses 4,792,502 560,449 -88.3%
> uops_issued.any 768,019,010 696,888,357 -9.3%
> time elapsed 0.1627s 0.2048s +25.9%
>
> Time elapsed directly correlates with the increase in cycles.
That might be consistent with the %ah accesses (probably writes)
being very slow/synchronising.
So you are getting a full cpu stall instead speculative execution
of the following instructions - which must include a lot of mis-predicted
branches.
> > As well as swapping %al <-> %ah try changing the outer loop decrement to
> > sub $0x100, %ax
> > since %al is zero that will set the z flag the same.
>
> Unfortunately, using "sub $0x100, %ax"(with %al as inner loop) isn't better
> than just using "sub $1, %ah" in the outer loop:
>
> Event %al inner + sub %ax Delta
> ---------------------- ------------- ------------- ----------
> cycles 776,775,020 813,372,036 +4.7%
> instructions/cycle 1.23 1.17 -4.5%
> branch-misses 4,792,502 7,610,323 +58.8%
> uops_issued.any 768,019,010 827,465,137 +7.7%
> time elapsed 0.1627s 0.1707s +4.9%
That is even more interesting.
The 'sub %ax' version has more uops and more branch-misses.
Looks like the extra cost of the %ah access is less than the cost
of the extra mis-predicted branches.
Makes me wonder where a version that uses %cl fits?
(Or use a zero-extending read and %eax/%ecx - likely to be the same.)
I'll bet 'one beer' that is nearest the 'sub %ax' version.
David
>
> > I've just hacked a test into some test code I've got.
> > I'm not seeing an unexpected costs on either zen-5 or haswell.
> > So it may be more subtle.
>
> This is puzzling, but atleast it is evident that using %al for the inner
> loop seems to be the best option. In summary:
>
> Variant Cycles Uops Issued Branch Misses
> ------- ---------- ----------- -------------
> %al 776M 768M 4.8M (fastest)
> %ah 972M (+25%) 697M (-9%) 560K (-88%) (fewer uops + misses, yet slowest)
> sub %ax 813M (+5%) 827M (+8%) 7.6M (+59%) (most uops + misses)
On Wed, Apr 01, 2026 at 10:02:00AM +0100, David Laight wrote:
> > > As well as swapping %al <-> %ah try changing the outer loop decrement to
> > > sub $0x100, %ax
> > > since %al is zero that will set the z flag the same.
> >
> > Unfortunately, using "sub $0x100, %ax"(with %al as inner loop) isn't better
> > than just using "sub $1, %ah" in the outer loop:
> >
> > Event %al inner + sub %ax Delta
> > ---------------------- ------------- ------------- ----------
> > cycles 776,775,020 813,372,036 +4.7%
> > instructions/cycle 1.23 1.17 -4.5%
> > branch-misses 4,792,502 7,610,323 +58.8%
> > uops_issued.any 768,019,010 827,465,137 +7.7%
> > time elapsed 0.1627s 0.1707s +4.9%
>
> That is even more interesting.
> The 'sub %ax' version has more uops and more branch-misses.
> Looks like the extra cost of the %ah access is less than the cost
> of the extra mis-predicted branches.
>
> Makes me wonder where a version that uses %cl fits?
> (Or use a zero-extending read and %eax/%ecx - likely to be the same.)
> I'll bet 'one beer' that is nearest the 'sub %ax' version.
%cl didn't make a noticeable difference, but ...
Event %al/%ah %al/%cl Delta
(inner/outer) (inner/outer)
---------------------- ------------- ------------- ----------
cycles 776,380,149 778,294,183 +0.2%
instructions/cycle 1.23 1.22 -0.4%
branch-misses 4,986,437 5,679,599 +13.9%
uops_issued.any 773,223,387 765,724,878 -1.0%
time elapsed 0.1631s 0.1637s +0.4%
... there are meaningful gains with 32-bit registers:
Event %al/%ah %eax/%ecx Delta
(inner/outer) (inner/outer)
---------------------- ------------- ------------- ----------
cycles 776,380,149 706,331,177 -9.0%
instructions/cycle 1.23 1.35 +9.9%
branch-misses 4,986,437 6,089,306 +22.1%
uops_issued.any 773,223,387 774,539,522 +0.2%
time elapsed 0.1631s 0.1482s -9.1%
These values are for userspace tests with immediates. Next, I will test how
they perform with memory loads in kernel. Before we finalize these uarch
nuances needs to be tested on a variety of CPUs.
On Thu, 26 Mar 2026 01:39:34 -0700 Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote: > On Wed, Mar 25, 2026 at 09:37:59PM +0100, Borislav Petkov wrote: > > On Tue, Mar 24, 2026 at 03:13:08PM -0700, Pawan Gupta wrote: > > > This is cleaner. A few things to consider are, CLEAR_BRANCH_HISTORY that > > > calls clear_bhb_loop() would be calling into C code very early during the > > > kernel entry. The code generated here may vary based on the compiler. Any > > > indirect branch here would be security risk. This needs to be noinstr so > > > that it can't be hijacked by probes and ftraces. > > > > > > At kernel entry, calling into C before mitigations are applied is risky. > > > > You can write the above function in asm if you prefer - should still be > > easier. > > I believe the equivalent for cpu_feature_enabled() in asm is the > ALTERNATIVE. Please let me know if I am missing something. > > Regarding your intent to move the loop count selection out of the BHB > sequence, below is what I could come up. It is not as pretty as the C > version, but it is trying to achieve something similar: I think that fails on being harder to read and longer. So no real benefit. I believe this code has to be asm because it is required to excute specific instructions in a specific order - you can't trust the C compiler to do that for you. David > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index ecae3cef9d8c..54c65b0a3f65 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -1494,6 +1494,20 @@ SYM_CODE_START_NOALIGN(rewind_stack_and_make_dead) > SYM_CODE_END(rewind_stack_and_make_dead) > .popsection > > +/* > + * Between the long and short version of BHB clear sequence, just the > + * loop count differs based on BHI_CTRL, see Intel's BHI guidance. > + */ > +#define BHB_SHORT_LOOP_OUTER 5 > +#define BHB_SHORT_LOOP_INNER 5 > + > +#define BHB_LONG_LOOP_OUTER 12 > +#define BHB_LONG_LOOP_INNER 7 > + > +#define BHB_MOVB(type, reg) \ > + ALTERNATIVE __stringify(movb $BHB_SHORT_LOOP_##type, reg), \ > + __stringify(movb $BHB_LONG_LOOP_##type, reg), X86_FEATURE_BHI_CTRL > + > /* > * This sequence executes branches in order to remove user branch information > * from the branch history tracker in the Branch Predictor, therefore removing > @@ -1540,12 +1554,7 @@ SYM_FUNC_START(clear_bhb_loop_nofence) > /* BPF caller may require all registers to be preserved */ > push %rax > > - /* > - * Between the long and short version of BHB clear sequence, just the > - * loop count differs based on BHI_CTRL, see Intel's BHI guidance. > - */ > - ALTERNATIVE "movb $5, %al", \ > - "movb $12, %al", X86_FEATURE_BHI_CTRL > + BHB_MOVB(OUTER, %al) > > ANNOTATE_INTRA_FUNCTION_CALL > call 1f > @@ -1567,8 +1576,7 @@ SYM_FUNC_START(clear_bhb_loop_nofence) > * but some Clang versions (e.g. 18) don't like this. > */ > .skip 32 - 14, 0xcc > -2: ALTERNATIVE "movb $5, %ah", \ > - "movb $7, %ah", X86_FEATURE_BHI_CTRL > +2: BHB_MOVB(INNER, %ah) > 3: jmp 4f > nop > 4: sub $1, %ah > > > Below is how the disassembly looks like: > > clear_bhb_loop_nofence: > ... > call 1f > jmp 5f > // BHB_MOVB(OUTER, %al) > mov $0x5,%al
On Wed, 25 Mar 2026 21:37:59 +0100 Borislav Petkov <bp@alien8.de> wrote: > On Tue, Mar 24, 2026 at 03:13:08PM -0700, Pawan Gupta wrote: ... > > Although call to clear_bhb_loop() will be inserted at the end of the BPF > > program before it returns, I am not sure if it is safe to assume that > > trashing registers in the path clear_bhb_loop() -> __clear_bhb_loop() is > > okay? Especially, when we don't know what code compiler generated for > > clear_bhb_loop(). BPF experts would know better? > > The compiler would preserve the regs. If you write it in asm and you adhere to > the C ABI, you could preserve them too. Shouldn't be too many. The BPF code that calls it doesn't use the C ABI - it just puts a call instruction in the code it generates. Hence all registers must be preserved. David > > Thx. > >
© 2016 - 2026 Red Hat, Inc.