As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
the Branch History Buffer (BHB). On Alder Lake and newer parts this
sequence is not sufficient because it doesn't clear enough entries. This
was not an issue because these CPUs have a hardware control (BHI_DIS_S)
that mitigates BHI in kernel.
BHI variant of VMSCAPE requires isolating branch history between guests and
userspace. Note that there is no equivalent hardware control for userspace.
To effectively isolate branch history on newer CPUs, clear_bhb_loop()
should execute sufficient number of branches to clear a larger BHB.
Dynamically set the loop count of clear_bhb_loop() such that it is
effective on newer CPUs too. Use the hardware control enumeration
X86_FEATURE_BHI_CTRL to select the appropriate loop count.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
arch/x86/entry/entry_64.S | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 886f86790b4467347031bc27d3d761d5cc286da1..9f6f4a7c5baf1fe4e3ab18b11e25e2fbcc77489d 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1536,7 +1536,11 @@ SYM_FUNC_START(clear_bhb_loop)
ANNOTATE_NOENDBR
push %rbp
mov %rsp, %rbp
- movl $5, %ecx
+
+ /* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
+ ALTERNATIVE "movl $5, %ecx; movl $5, %edx", \
+ "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL
+
ANNOTATE_INTRA_FUNCTION_CALL
call 1f
jmp 5f
@@ -1557,7 +1561,7 @@ SYM_FUNC_START(clear_bhb_loop)
* but some Clang versions (e.g. 18) don't like this.
*/
.skip 32 - 18, 0xcc
-2: movl $5, %eax
+2: movl %edx, %eax
3: jmp 4f
nop
4: sub $1, %eax
--
2.34.1
On 2.12.25 г. 8:19 ч., Pawan Gupta wrote: > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites > the Branch History Buffer (BHB). On Alder Lake and newer parts this > sequence is not sufficient because it doesn't clear enough entries. This > was not an issue because these CPUs have a hardware control (BHI_DIS_S) > that mitigates BHI in kernel. > > BHI variant of VMSCAPE requires isolating branch history between guests and > userspace. Note that there is no equivalent hardware control for userspace. > To effectively isolate branch history on newer CPUs, clear_bhb_loop() > should execute sufficient number of branches to clear a larger BHB. > > Dynamically set the loop count of clear_bhb_loop() such that it is > effective on newer CPUs too. Use the hardware control enumeration > X86_FEATURE_BHI_CTRL to select the appropriate loop count. > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> nit: My RB tag is incorrect, while I did agree with Dave's suggestion to have global variables for the loop counts I haven't' really seen the code so I couldn't have given my RB on something which I haven't seen but did agree with in principle. Now that I have seen the code I'm willing to give my : Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> > --- > arch/x86/entry/entry_64.S | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index 886f86790b4467347031bc27d3d761d5cc286da1..9f6f4a7c5baf1fe4e3ab18b11e25e2fbcc77489d 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -1536,7 +1536,11 @@ SYM_FUNC_START(clear_bhb_loop) > ANNOTATE_NOENDBR > push %rbp > mov %rsp, %rbp > - movl $5, %ecx > + > + /* loop count differs based on BHI_CTRL, see Intel's BHI guidance */ > + ALTERNATIVE "movl $5, %ecx; movl $5, %edx", \ > + "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL nit: Just > + > ANNOTATE_INTRA_FUNCTION_CALL > call 1f > jmp 5f > @@ -1557,7 +1561,7 @@ SYM_FUNC_START(clear_bhb_loop) > * but some Clang versions (e.g. 18) don't like this. > */ > .skip 32 - 18, 0xcc > -2: movl $5, %eax > +2: movl %edx, %eax > 3: jmp 4f > nop > 4: sub $1, %eax >
On Wed, Dec 10, 2025 at 02:31:31PM +0200, Nikolay Borisov wrote: > > > On 2.12.25 г. 8:19 ч., Pawan Gupta wrote: > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites > > the Branch History Buffer (BHB). On Alder Lake and newer parts this > > sequence is not sufficient because it doesn't clear enough entries. This > > was not an issue because these CPUs have a hardware control (BHI_DIS_S) > > that mitigates BHI in kernel. > > > > BHI variant of VMSCAPE requires isolating branch history between guests and > > userspace. Note that there is no equivalent hardware control for userspace. > > To effectively isolate branch history on newer CPUs, clear_bhb_loop() > > should execute sufficient number of branches to clear a larger BHB. > > > > Dynamically set the loop count of clear_bhb_loop() such that it is > > effective on newer CPUs too. Use the hardware control enumeration > > X86_FEATURE_BHI_CTRL to select the appropriate loop count. > > > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> > > nit: My RB tag is incorrect, while I did agree with Dave's suggestion to > have global variables for the loop counts I haven't' really seen the code so > I couldn't have given my RB on something which I haven't seen but did agree > with in principle. The tag got applied from v4, but yes the patch got updated since: https://lore.kernel.org/all/8b657ef2-d9a7-4424-987d-111beb477727@suse.com/ > Now that I have seen the code I'm willing to give my : > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Thanks. > > --- > > arch/x86/entry/entry_64.S | 8 ++++++-- > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > > index 886f86790b4467347031bc27d3d761d5cc286da1..9f6f4a7c5baf1fe4e3ab18b11e25e2fbcc77489d 100644 > > --- a/arch/x86/entry/entry_64.S > > +++ b/arch/x86/entry/entry_64.S > > @@ -1536,7 +1536,11 @@ SYM_FUNC_START(clear_bhb_loop) > > ANNOTATE_NOENDBR > > push %rbp > > mov %rsp, %rbp > > - movl $5, %ecx > > + > > + /* loop count differs based on BHI_CTRL, see Intel's BHI guidance */ > > + ALTERNATIVE "movl $5, %ecx; movl $5, %edx", \ > > + "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL > > nit: Just Will do: /* Just loop count differs based on BHI_CTRL, see Intel's BHI guidance */ > > + > > ANNOTATE_INTRA_FUNCTION_CALL > > call 1f > > jmp 5f > > @@ -1557,7 +1561,7 @@ SYM_FUNC_START(clear_bhb_loop) > > * but some Clang versions (e.g. 18) don't like this. > > */ > > .skip 32 - 18, 0xcc > > -2: movl $5, %eax > > +2: movl %edx, %eax > > 3: jmp 4f > > nop > > 4: sub $1, %eax > > >
On Wed, 10 Dec 2025 14:31:31 +0200
Nikolay Borisov <nik.borisov@suse.com> wrote:
> On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:
> > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > sequence is not sufficient because it doesn't clear enough entries. This
> > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > that mitigates BHI in kernel.
> >
> > BHI variant of VMSCAPE requires isolating branch history between guests and
> > userspace. Note that there is no equivalent hardware control for userspace.
> > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > should execute sufficient number of branches to clear a larger BHB.
> >
> > Dynamically set the loop count of clear_bhb_loop() such that it is
> > effective on newer CPUs too. Use the hardware control enumeration
> > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> >
> > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
>
> nit: My RB tag is incorrect, while I did agree with Dave's suggestion to
> have global variables for the loop counts I haven't' really seen the
> code so I couldn't have given my RB on something which I haven't seen
> but did agree with in principle.
I thought the plan was to use global variables rather than ALTERNATIVE.
The performance of this code is dominated by the loop.
I also found this code in arch/x86/net/bpf_jit_comp.c:
if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
/* The clearing sequence clobbers eax and ecx. */
EMIT1(0x50); /* push rax */
EMIT1(0x51); /* push rcx */
ip += 2;
func = (u8 *)clear_bhb_loop;
ip += x86_call_depth_emit_accounting(&prog, func, ip);
if (emit_call(&prog, func, ip))
return -EINVAL;
EMIT1(0x59); /* pop rcx */
EMIT1(0x58); /* pop rax */
}
which appears to assume that only rax and rcx are changed.
Since all the counts are small, there is nothing stopping the code
using the 8-bit registers %al, %ah, %cl and %ch.
There are probably some schemes that only need one register.
eg two separate ALTERNATIVE blocks.
David
>
> Now that I have seen the code I'm willing to give my :
>
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > ---
> > arch/x86/entry/entry_64.S | 8 ++++++--
> > 1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 886f86790b4467347031bc27d3d761d5cc286da1..9f6f4a7c5baf1fe4e3ab18b11e25e2fbcc77489d 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -1536,7 +1536,11 @@ SYM_FUNC_START(clear_bhb_loop)
> > ANNOTATE_NOENDBR
> > push %rbp
> > mov %rsp, %rbp
> > - movl $5, %ecx
> > +
> > + /* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
> > + ALTERNATIVE "movl $5, %ecx; movl $5, %edx", \
> > + "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL
>
> nit: Just
>
> > +
> > ANNOTATE_INTRA_FUNCTION_CALL
> > call 1f
> > jmp 5f
> > @@ -1557,7 +1561,7 @@ SYM_FUNC_START(clear_bhb_loop)
> > * but some Clang versions (e.g. 18) don't like this.
> > */
> > .skip 32 - 18, 0xcc
> > -2: movl $5, %eax
> > +2: movl %edx, %eax
> > 3: jmp 4f
> > nop
> > 4: sub $1, %eax
> >
>
>
On Wed, Dec 10, 2025 at 01:35:42PM +0000, David Laight wrote:
> On Wed, 10 Dec 2025 14:31:31 +0200
> Nikolay Borisov <nik.borisov@suse.com> wrote:
>
> > On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:
> > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > > sequence is not sufficient because it doesn't clear enough entries. This
> > > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > > that mitigates BHI in kernel.
> > >
> > > BHI variant of VMSCAPE requires isolating branch history between guests and
> > > userspace. Note that there is no equivalent hardware control for userspace.
> > > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > > should execute sufficient number of branches to clear a larger BHB.
> > >
> > > Dynamically set the loop count of clear_bhb_loop() such that it is
> > > effective on newer CPUs too. Use the hardware control enumeration
> > > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > >
> > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> >
> > nit: My RB tag is incorrect, while I did agree with Dave's suggestion to
> > have global variables for the loop counts I haven't' really seen the
> > code so I couldn't have given my RB on something which I haven't seen
> > but did agree with in principle.
>
> I thought the plan was to use global variables rather than ALTERNATIVE.
> The performance of this code is dominated by the loop.
Using globals was much more involved, requiring changes in atleast 3 files.
The current ALTERNATIVE approach is much simpler and avoids additional
handling to make sure that globals are set correctly for all mitigation
modes of BHI and VMSCAPE.
[ BTW, I am travelling on a vacation and will be intermittently checking my
emails. ]
> I also found this code in arch/x86/net/bpf_jit_comp.c:
> if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
> /* The clearing sequence clobbers eax and ecx. */
> EMIT1(0x50); /* push rax */
> EMIT1(0x51); /* push rcx */
> ip += 2;
>
> func = (u8 *)clear_bhb_loop;
> ip += x86_call_depth_emit_accounting(&prog, func, ip);
>
> if (emit_call(&prog, func, ip))
> return -EINVAL;
> EMIT1(0x59); /* pop rcx */
> EMIT1(0x58); /* pop rax */
> }
> which appears to assume that only rax and rcx are changed.
> Since all the counts are small, there is nothing stopping the code
> using the 8-bit registers %al, %ah, %cl and %ch.
Thanks for catching this.
> There are probably some schemes that only need one register.
> eg two separate ALTERNATIVE blocks.
Also, I think it is better to use a callee-saved register like rbx to avoid
callers having to save/restore registers. Something like below:
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9f6f4a7c5baf..ca4a34ce314a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1535,11 +1535,12 @@ SYM_CODE_END(rewind_stack_and_make_dead)
SYM_FUNC_START(clear_bhb_loop)
ANNOTATE_NOENDBR
push %rbp
+ push %rbx
mov %rsp, %rbp
/* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
- ALTERNATIVE "movl $5, %ecx; movl $5, %edx", \
- "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL
+ ALTERNATIVE "movb $5, %bl", \
+ "movb $12, %bl", X86_FEATURE_BHI_CTRL
ANNOTATE_INTRA_FUNCTION_CALL
call 1f
@@ -1561,15 +1562,17 @@ SYM_FUNC_START(clear_bhb_loop)
* but some Clang versions (e.g. 18) don't like this.
*/
.skip 32 - 18, 0xcc
-2: movl %edx, %eax
+2: ALTERNATIVE "movb $5, %bh", \
+ "movb $7, %bh", X86_FEATURE_BHI_CTRL
3: jmp 4f
nop
-4: sub $1, %eax
+4: sub $1, %bh
jnz 3b
- sub $1, %ecx
+ sub $1, %bl
jnz 1b
.Lret2: RET
5:
+ pop %rbx
pop %rbp
RET
SYM_FUNC_END(clear_bhb_loop)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index c1ec14c55911..823b3f613774 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1593,11 +1593,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
u8 *func;
if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
- /* The clearing sequence clobbers eax and ecx. */
- EMIT1(0x50); /* push rax */
- EMIT1(0x51); /* push rcx */
- ip += 2;
-
func = (u8 *)clear_bhb_loop;
ip += x86_call_depth_emit_accounting(&prog, func, ip);
@@ -1605,8 +1600,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
return -EINVAL;
/* Don't speculate past this until BHB is cleared */
EMIT_LFENCE();
- EMIT1(0x59); /* pop rcx */
- EMIT1(0x58); /* pop rax */
}
/* Insert IBHF instruction */
if ((cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP) &&
On Sun, 14 Dec 2025 10:38:27 -0800
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> On Wed, Dec 10, 2025 at 01:35:42PM +0000, David Laight wrote:
> > On Wed, 10 Dec 2025 14:31:31 +0200
> > Nikolay Borisov <nik.borisov@suse.com> wrote:
> >
> > > On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:
> > > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > > > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > > > sequence is not sufficient because it doesn't clear enough entries. This
> > > > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > > > that mitigates BHI in kernel.
> > > >
> > > > BHI variant of VMSCAPE requires isolating branch history between guests and
> > > > userspace. Note that there is no equivalent hardware control for userspace.
> > > > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > > > should execute sufficient number of branches to clear a larger BHB.
> > > >
> > > > Dynamically set the loop count of clear_bhb_loop() such that it is
> > > > effective on newer CPUs too. Use the hardware control enumeration
> > > > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > > >
> > > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > >
> > > nit: My RB tag is incorrect, while I did agree with Dave's suggestion to
> > > have global variables for the loop counts I haven't' really seen the
> > > code so I couldn't have given my RB on something which I haven't seen
> > > but did agree with in principle.
> >
> > I thought the plan was to use global variables rather than ALTERNATIVE.
> > The performance of this code is dominated by the loop.
>
> Using globals was much more involved, requiring changes in atleast 3 files.
> The current ALTERNATIVE approach is much simpler and avoids additional
> handling to make sure that globals are set correctly for all mitigation
> modes of BHI and VMSCAPE.
>
> [ BTW, I am travelling on a vacation and will be intermittently checking my
> emails. ]
>
> > I also found this code in arch/x86/net/bpf_jit_comp.c:
> > if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
> > /* The clearing sequence clobbers eax and ecx. */
> > EMIT1(0x50); /* push rax */
> > EMIT1(0x51); /* push rcx */
> > ip += 2;
> >
> > func = (u8 *)clear_bhb_loop;
> > ip += x86_call_depth_emit_accounting(&prog, func, ip);
> >
> > if (emit_call(&prog, func, ip))
> > return -EINVAL;
> > EMIT1(0x59); /* pop rcx */
> > EMIT1(0x58); /* pop rax */
> > }
> > which appears to assume that only rax and rcx are changed.
> > Since all the counts are small, there is nothing stopping the code
> > using the 8-bit registers %al, %ah, %cl and %ch.
>
> Thanks for catching this.
I was trying to find where it was called from.
Failed to find the one on system call entry...
> > There are probably some schemes that only need one register.
> > eg two separate ALTERNATIVE blocks.
>
> Also, I think it is better to use a callee-saved register like rbx to avoid
> callers having to save/restore registers. Something like below:
I'm not sure.
%ax is the return value so can be 'trashed' by a normal function call.
But if the bpf code is saving %ax then it isn't expecting a normal call.
OTOH if you are going to save the register in clear_bhb_loop you might
as well use %ax to get the slightly shorter instructions for %al.
(I think 'movb' comes out shorter - as if it really matters.)
Definitely worth a comment that it must save all resisters.
I also wonder if it needs to setup a stack frame?
Again, the code is so slow it won't matter.
David
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 9f6f4a7c5baf..ca4a34ce314a 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1535,11 +1535,12 @@ SYM_CODE_END(rewind_stack_and_make_dead)
> SYM_FUNC_START(clear_bhb_loop)
> ANNOTATE_NOENDBR
> push %rbp
> + push %rbx
> mov %rsp, %rbp
>
> /* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
> - ALTERNATIVE "movl $5, %ecx; movl $5, %edx", \
> - "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL
> + ALTERNATIVE "movb $5, %bl", \
> + "movb $12, %bl", X86_FEATURE_BHI_CTRL
>
> ANNOTATE_INTRA_FUNCTION_CALL
> call 1f
> @@ -1561,15 +1562,17 @@ SYM_FUNC_START(clear_bhb_loop)
> * but some Clang versions (e.g. 18) don't like this.
> */
> .skip 32 - 18, 0xcc
> -2: movl %edx, %eax
> +2: ALTERNATIVE "movb $5, %bh", \
> + "movb $7, %bh", X86_FEATURE_BHI_CTRL
> 3: jmp 4f
> nop
> -4: sub $1, %eax
> +4: sub $1, %bh
> jnz 3b
> - sub $1, %ecx
> + sub $1, %bl
> jnz 1b
> .Lret2: RET
> 5:
> + pop %rbx
> pop %rbp
> RET
> SYM_FUNC_END(clear_bhb_loop)
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index c1ec14c55911..823b3f613774 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -1593,11 +1593,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
> u8 *func;
>
> if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
> - /* The clearing sequence clobbers eax and ecx. */
> - EMIT1(0x50); /* push rax */
> - EMIT1(0x51); /* push rcx */
> - ip += 2;
> -
> func = (u8 *)clear_bhb_loop;
> ip += x86_call_depth_emit_accounting(&prog, func, ip);
>
> @@ -1605,8 +1600,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
> return -EINVAL;
> /* Don't speculate past this until BHB is cleared */
> EMIT_LFENCE();
> - EMIT1(0x59); /* pop rcx */
> - EMIT1(0x58); /* pop rax */
> }
> /* Insert IBHF instruction */
> if ((cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP) &&
On Sun, Dec 14, 2025 at 07:02:33PM +0000, David Laight wrote:
> On Sun, 14 Dec 2025 10:38:27 -0800
> Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
>
> > On Wed, Dec 10, 2025 at 01:35:42PM +0000, David Laight wrote:
> > > On Wed, 10 Dec 2025 14:31:31 +0200
> > > Nikolay Borisov <nik.borisov@suse.com> wrote:
> > >
> > > > On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:
> > > > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > > > > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > > > > sequence is not sufficient because it doesn't clear enough entries. This
> > > > > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > > > > that mitigates BHI in kernel.
> > > > >
> > > > > BHI variant of VMSCAPE requires isolating branch history between guests and
> > > > > userspace. Note that there is no equivalent hardware control for userspace.
> > > > > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > > > > should execute sufficient number of branches to clear a larger BHB.
> > > > >
> > > > > Dynamically set the loop count of clear_bhb_loop() such that it is
> > > > > effective on newer CPUs too. Use the hardware control enumeration
> > > > > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > > > >
> > > > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > > > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > > >
> > > > nit: My RB tag is incorrect, while I did agree with Dave's suggestion to
> > > > have global variables for the loop counts I haven't' really seen the
> > > > code so I couldn't have given my RB on something which I haven't seen
> > > > but did agree with in principle.
> > >
> > > I thought the plan was to use global variables rather than ALTERNATIVE.
> > > The performance of this code is dominated by the loop.
> >
> > Using globals was much more involved, requiring changes in atleast 3 files.
> > The current ALTERNATIVE approach is much simpler and avoids additional
> > handling to make sure that globals are set correctly for all mitigation
> > modes of BHI and VMSCAPE.
> >
> > [ BTW, I am travelling on a vacation and will be intermittently checking my
> > emails. ]
> >
> > > I also found this code in arch/x86/net/bpf_jit_comp.c:
> > > if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
> > > /* The clearing sequence clobbers eax and ecx. */
> > > EMIT1(0x50); /* push rax */
> > > EMIT1(0x51); /* push rcx */
> > > ip += 2;
> > >
> > > func = (u8 *)clear_bhb_loop;
> > > ip += x86_call_depth_emit_accounting(&prog, func, ip);
> > >
> > > if (emit_call(&prog, func, ip))
> > > return -EINVAL;
> > > EMIT1(0x59); /* pop rcx */
> > > EMIT1(0x58); /* pop rax */
> > > }
> > > which appears to assume that only rax and rcx are changed.
> > > Since all the counts are small, there is nothing stopping the code
> > > using the 8-bit registers %al, %ah, %cl and %ch.
> >
> > Thanks for catching this.
>
> I was trying to find where it was called from.
> Failed to find the one on system call entry...
The macro CLEAR_BRANCH_HISTORY calls clear_bhb_loop() at system call entry.
> > > There are probably some schemes that only need one register.
> > > eg two separate ALTERNATIVE blocks.
> >
> > Also, I think it is better to use a callee-saved register like rbx to avoid
> > callers having to save/restore registers. Something like below:
>
> I'm not sure.
> %ax is the return value so can be 'trashed' by a normal function call.
> But if the bpf code is saving %ax then it isn't expecting a normal call.
BHB clear sequence is executed at the end of the BPF JITted code, and %rax
is likely the return value of the BPF program. So, saving/restoring %rax
around the sequence makes sense to me.
> OTOH if you are going to save the register in clear_bhb_loop you might
> as well use %ax to get the slightly shorter instructions for %al.
> (I think 'movb' comes out shorter - as if it really matters.)
%rbx is a callee-saved register so it felt more intuitive to save/restore
it in clear_bhb_loop(). But, I can use %ax if you feel strongly.
> Definitely worth a comment that it must save all resisters.
Yes, will add a comment.
> I also wonder if it needs to setup a stack frame?
I don't know if thats necessary, objtool doesn't complain because
clear_bhb_loop() is marked STACK_FRAME_NON_STANDARD.
> Again, the code is so slow it won't matter.
>
> David
On Mon, 15 Dec 2025 10:01:36 -0800
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> On Sun, Dec 14, 2025 at 07:02:33PM +0000, David Laight wrote:
> > On Sun, 14 Dec 2025 10:38:27 -0800
> > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> >
> > > On Wed, Dec 10, 2025 at 01:35:42PM +0000, David Laight wrote:
> > > > On Wed, 10 Dec 2025 14:31:31 +0200
> > > > Nikolay Borisov <nik.borisov@suse.com> wrote:
> > > >
> > > > > On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:
> > > > > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > > > > > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > > > > > sequence is not sufficient because it doesn't clear enough entries. This
> > > > > > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > > > > > that mitigates BHI in kernel.
> > > > > >
> > > > > > BHI variant of VMSCAPE requires isolating branch history between guests and
> > > > > > userspace. Note that there is no equivalent hardware control for userspace.
> > > > > > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > > > > > should execute sufficient number of branches to clear a larger BHB.
> > > > > >
> > > > > > Dynamically set the loop count of clear_bhb_loop() such that it is
> > > > > > effective on newer CPUs too. Use the hardware control enumeration
> > > > > > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > > > > >
> > > > > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > > > > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > > > >
> > > > > nit: My RB tag is incorrect, while I did agree with Dave's suggestion to
> > > > > have global variables for the loop counts I haven't' really seen the
> > > > > code so I couldn't have given my RB on something which I haven't seen
> > > > > but did agree with in principle.
> > > >
> > > > I thought the plan was to use global variables rather than ALTERNATIVE.
> > > > The performance of this code is dominated by the loop.
> > >
> > > Using globals was much more involved, requiring changes in atleast 3 files.
> > > The current ALTERNATIVE approach is much simpler and avoids additional
> > > handling to make sure that globals are set correctly for all mitigation
> > > modes of BHI and VMSCAPE.
> > >
> > > [ BTW, I am travelling on a vacation and will be intermittently checking my
> > > emails. ]
> > >
> > > > I also found this code in arch/x86/net/bpf_jit_comp.c:
> > > > if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
> > > > /* The clearing sequence clobbers eax and ecx. */
> > > > EMIT1(0x50); /* push rax */
> > > > EMIT1(0x51); /* push rcx */
> > > > ip += 2;
> > > >
> > > > func = (u8 *)clear_bhb_loop;
> > > > ip += x86_call_depth_emit_accounting(&prog, func, ip);
> > > >
> > > > if (emit_call(&prog, func, ip))
> > > > return -EINVAL;
> > > > EMIT1(0x59); /* pop rcx */
> > > > EMIT1(0x58); /* pop rax */
> > > > }
> > > > which appears to assume that only rax and rcx are changed.
> > > > Since all the counts are small, there is nothing stopping the code
> > > > using the 8-bit registers %al, %ah, %cl and %ch.
> > >
> > > Thanks for catching this.
> >
> > I was trying to find where it was called from.
> > Failed to find the one on system call entry...
>
> The macro CLEAR_BRANCH_HISTORY calls clear_bhb_loop() at system call entry.
I didn't look very hard :-)
>
> > > > There are probably some schemes that only need one register.
> > > > eg two separate ALTERNATIVE blocks.
> > >
> > > Also, I think it is better to use a callee-saved register like rbx to avoid
> > > callers having to save/restore registers. Something like below:
> >
> > I'm not sure.
> > %ax is the return value so can be 'trashed' by a normal function call.
> > But if the bpf code is saving %ax then it isn't expecting a normal call.
>
> BHB clear sequence is executed at the end of the BPF JITted code, and %rax
> is likely the return value of the BPF program. So, saving/restoring %rax
> around the sequence makes sense to me.
>
> > OTOH if you are going to save the register in clear_bhb_loop you might
> > as well use %ax to get the slightly shorter instructions for %al.
> > (I think 'movb' comes out shorter - as if it really matters.)
>
> %rbx is a callee-saved register so it felt more intuitive to save/restore
> it in clear_bhb_loop(). But, I can use %ax if you feel strongly.
If you are going to save a register it might as well be %ax.
Otherwise someone will wonder why you picked a different one.
>
> > Definitely worth a comment that it must save all resisters.
>
> Yes, will add a comment.
>
> > I also wonder if it needs to setup a stack frame?
>
> I don't know if thats necessary, objtool doesn't complain because
> clear_bhb_loop() is marked STACK_FRAME_NON_STANDARD.
In some senses it is a leaf functions - and the compiler doesn't create
stack frames for those (by default).
Provided objtool isn't confused by all the call instructions it probably
doesn't matter.
David
>
> > Again, the code is so slow it won't matter.
> >
> > David
On 10.12.25 г. 15:35 ч., David Laight wrote: > On Wed, 10 Dec 2025 14:31:31 +0200 > Nikolay Borisov <nik.borisov@suse.com> wrote: > >> On 2.12.25 г. 8:19 ч., Pawan Gupta wrote: >>> As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites >>> the Branch History Buffer (BHB). On Alder Lake and newer parts this >>> sequence is not sufficient because it doesn't clear enough entries. This >>> was not an issue because these CPUs have a hardware control (BHI_DIS_S) >>> that mitigates BHI in kernel. >>> >>> BHI variant of VMSCAPE requires isolating branch history between guests and >>> userspace. Note that there is no equivalent hardware control for userspace. >>> To effectively isolate branch history on newer CPUs, clear_bhb_loop() >>> should execute sufficient number of branches to clear a larger BHB. >>> >>> Dynamically set the loop count of clear_bhb_loop() such that it is >>> effective on newer CPUs too. Use the hardware control enumeration >>> X86_FEATURE_BHI_CTRL to select the appropriate loop count. >>> >>> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> >>> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> >>> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> >> >> nit: My RB tag is incorrect, while I did agree with Dave's suggestion to >> have global variables for the loop counts I haven't' really seen the >> code so I couldn't have given my RB on something which I haven't seen >> but did agree with in principle. > > I thought the plan was to use global variables rather than ALTERNATIVE. > The performance of this code is dominated by the loop. Generally yes and I was on the verge of calling this out, however what stopped me is the fact that the global variables are going to be set "somewhere else" whilst with the current approach everything is contained within the clear_bhb_loop function. Both ways have their merit but I don't want to endlessly bikeshed. <snip>
© 2016 - 2025 Red Hat, Inc.