[v6] VMSCAPE optimization for BHI variant

[PATCH v6 2/9] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 2 months, 1 week ago

As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
the Branch History Buffer (BHB). On Alder Lake and newer parts this
sequence is not sufficient because it doesn't clear enough entries. This
was not an issue because these CPUs have a hardware control (BHI_DIS_S)
that mitigates BHI in kernel.

BHI variant of VMSCAPE requires isolating branch history between guests and
userspace. Note that there is no equivalent hardware control for userspace.
To effectively isolate branch history on newer CPUs, clear_bhb_loop()
should execute sufficient number of branches to clear a larger BHB.

Dynamically set the loop count of clear_bhb_loop() such that it is
effective on newer CPUs too. Use the hardware control enumeration
X86_FEATURE_BHI_CTRL to select the appropriate loop count.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_64.S | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 886f86790b4467347031bc27d3d761d5cc286da1..9f6f4a7c5baf1fe4e3ab18b11e25e2fbcc77489d 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1536,7 +1536,11 @@ SYM_FUNC_START(clear_bhb_loop)
 	ANNOTATE_NOENDBR
 	push	%rbp
 	mov	%rsp, %rbp
-	movl	$5, %ecx
+
+	/* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
+	ALTERNATIVE "movl $5,  %ecx; movl $5, %edx",	\
+		    "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL
+
 	ANNOTATE_INTRA_FUNCTION_CALL
 	call	1f
 	jmp	5f
@@ -1557,7 +1561,7 @@ SYM_FUNC_START(clear_bhb_loop)
 	 * but some Clang versions (e.g. 18) don't like this.
 	 */
 	.skip 32 - 18, 0xcc
-2:	movl	$5, %eax
+2:	movl	%edx, %eax
 3:	jmp	4f
 	nop
 4:	sub	$1, %eax

-- 
2.34.1

Re: [PATCH v6 2/9] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Borislav Petkov 2 weeks, 1 day ago

On Mon, Dec 01, 2025 at 10:19:14PM -0800, Pawan Gupta wrote:
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 886f86790b4467347031bc27d3d761d5cc286da1..9f6f4a7c5baf1fe4e3ab18b11e25e2fbcc77489d 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1536,7 +1536,11 @@ SYM_FUNC_START(clear_bhb_loop)
>  	ANNOTATE_NOENDBR
>  	push	%rbp
>  	mov	%rsp, %rbp
> -	movl	$5, %ecx
> +
> +	/* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
> +	ALTERNATIVE "movl $5,  %ecx; movl $5, %edx",	\
> +		    "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL

Why isn't this written like this:

in C:

clear_bhb_loop:

	if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL))
		__clear_bhb_loop(12, 7);
	else
		__clear_bhb_loop(5, 5);

and then the __-version is asm and it gets those two arguments from %rdi, and
%rsi instead of more hard-coded, error-prone registers diddling alternative
gunk?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v6 2/9] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Nikolay Borisov 2 months ago


On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:
> As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> the Branch History Buffer (BHB). On Alder Lake and newer parts this
> sequence is not sufficient because it doesn't clear enough entries. This
> was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> that mitigates BHI in kernel.
> 
> BHI variant of VMSCAPE requires isolating branch history between guests and
> userspace. Note that there is no equivalent hardware control for userspace.
> To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> should execute sufficient number of branches to clear a larger BHB.
> 
> Dynamically set the loop count of clear_bhb_loop() such that it is
> effective on newer CPUs too. Use the hardware control enumeration
> X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> 
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

nit: My RB tag is incorrect, while I did agree with Dave's suggestion to 
have global variables for the loop counts I haven't' really seen the 
code so I couldn't have given my RB on something which I haven't seen 
but did agree with in principle.

Now that I have seen the code I'm willing to give my :

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> ---
>   arch/x86/entry/entry_64.S | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 886f86790b4467347031bc27d3d761d5cc286da1..9f6f4a7c5baf1fe4e3ab18b11e25e2fbcc77489d 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1536,7 +1536,11 @@ SYM_FUNC_START(clear_bhb_loop)
>   	ANNOTATE_NOENDBR
>   	push	%rbp
>   	mov	%rsp, %rbp
> -	movl	$5, %ecx
> +
> +	/* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
> +	ALTERNATIVE "movl $5,  %ecx; movl $5, %edx",	\
> +		    "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL

nit: Just

> +
>   	ANNOTATE_INTRA_FUNCTION_CALL
>   	call	1f
>   	jmp	5f
> @@ -1557,7 +1561,7 @@ SYM_FUNC_START(clear_bhb_loop)
>   	 * but some Clang versions (e.g. 18) don't like this.
>   	 */
>   	.skip 32 - 18, 0xcc
> -2:	movl	$5, %eax
> +2:	movl	%edx, %eax
>   3:	jmp	4f
>   	nop
>   4:	sub	$1, %eax
>

Re: [PATCH v6 2/9] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 1 month, 3 weeks ago

On Wed, Dec 10, 2025 at 02:31:31PM +0200, Nikolay Borisov wrote:
> 
> 
> On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:
> > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > sequence is not sufficient because it doesn't clear enough entries. This
> > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > that mitigates BHI in kernel.
> > 
> > BHI variant of VMSCAPE requires isolating branch history between guests and
> > userspace. Note that there is no equivalent hardware control for userspace.
> > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > should execute sufficient number of branches to clear a larger BHB.
> > 
> > Dynamically set the loop count of clear_bhb_loop() such that it is
> > effective on newer CPUs too. Use the hardware control enumeration
> > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > 
> > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> 
> nit: My RB tag is incorrect, while I did agree with Dave's suggestion to
> have global variables for the loop counts I haven't' really seen the code so
> I couldn't have given my RB on something which I haven't seen but did agree
> with in principle.

The tag got applied from v4, but yes the patch got updated since:

https://lore.kernel.org/all/8b657ef2-d9a7-4424-987d-111beb477727@suse.com/

> Now that I have seen the code I'm willing to give my :
> 
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>

Thanks.

> > ---
> >   arch/x86/entry/entry_64.S | 8 ++++++--
> >   1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 886f86790b4467347031bc27d3d761d5cc286da1..9f6f4a7c5baf1fe4e3ab18b11e25e2fbcc77489d 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -1536,7 +1536,11 @@ SYM_FUNC_START(clear_bhb_loop)
> >   	ANNOTATE_NOENDBR
> >   	push	%rbp
> >   	mov	%rsp, %rbp
> > -	movl	$5, %ecx
> > +
> > +	/* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
> > +	ALTERNATIVE "movl $5,  %ecx; movl $5, %edx",	\
> > +		    "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL
> 
> nit: Just

Will do:

	/* Just loop count differs based on BHI_CTRL, see Intel's BHI guidance */

> > +
> >   	ANNOTATE_INTRA_FUNCTION_CALL
> >   	call	1f
> >   	jmp	5f
> > @@ -1557,7 +1561,7 @@ SYM_FUNC_START(clear_bhb_loop)
> >   	 * but some Clang versions (e.g. 18) don't like this.
> >   	 */
> >   	.skip 32 - 18, 0xcc
> > -2:	movl	$5, %eax
> > +2:	movl	%edx, %eax
> >   3:	jmp	4f
> >   	nop
> >   4:	sub	$1, %eax
> > 
>

Re: [PATCH v6 2/9] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by David Laight 2 months ago

On Wed, 10 Dec 2025 14:31:31 +0200
Nikolay Borisov <nik.borisov@suse.com> wrote:

> On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:
> > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > sequence is not sufficient because it doesn't clear enough entries. This
> > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > that mitigates BHI in kernel.
> > 
> > BHI variant of VMSCAPE requires isolating branch history between guests and
> > userspace. Note that there is no equivalent hardware control for userspace.
> > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > should execute sufficient number of branches to clear a larger BHB.
> > 
> > Dynamically set the loop count of clear_bhb_loop() such that it is
> > effective on newer CPUs too. Use the hardware control enumeration
> > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > 
> > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>  
> 
> nit: My RB tag is incorrect, while I did agree with Dave's suggestion to 
> have global variables for the loop counts I haven't' really seen the 
> code so I couldn't have given my RB on something which I haven't seen 
> but did agree with in principle.

I thought the plan was to use global variables rather than ALTERNATIVE.
The performance of this code is dominated by the loop.

I also found this code in arch/x86/net/bpf_jit_comp.c:
	if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
		/* The clearing sequence clobbers eax and ecx. */
		EMIT1(0x50); /* push rax */
		EMIT1(0x51); /* push rcx */
		ip += 2;

		func = (u8 *)clear_bhb_loop;
		ip += x86_call_depth_emit_accounting(&prog, func, ip);

		if (emit_call(&prog, func, ip))
			return -EINVAL;
		EMIT1(0x59); /* pop rcx */
		EMIT1(0x58); /* pop rax */
	}
which appears to assume that only rax and rcx are changed.
Since all the counts are small, there is nothing stopping the code
using the 8-bit registers %al, %ah, %cl and %ch.

There are probably some schemes that only need one register.
eg two separate ALTERNATIVE blocks.

	David

> 
> Now that I have seen the code I'm willing to give my :
> 
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > ---
> >   arch/x86/entry/entry_64.S | 8 ++++++--
> >   1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 886f86790b4467347031bc27d3d761d5cc286da1..9f6f4a7c5baf1fe4e3ab18b11e25e2fbcc77489d 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -1536,7 +1536,11 @@ SYM_FUNC_START(clear_bhb_loop)
> >   	ANNOTATE_NOENDBR
> >   	push	%rbp
> >   	mov	%rsp, %rbp
> > -	movl	$5, %ecx
> > +
> > +	/* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
> > +	ALTERNATIVE "movl $5,  %ecx; movl $5, %edx",	\
> > +		    "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL  
> 
> nit: Just
> 
> > +
> >   	ANNOTATE_INTRA_FUNCTION_CALL
> >   	call	1f
> >   	jmp	5f
> > @@ -1557,7 +1561,7 @@ SYM_FUNC_START(clear_bhb_loop)
> >   	 * but some Clang versions (e.g. 18) don't like this.
> >   	 */
> >   	.skip 32 - 18, 0xcc
> > -2:	movl	$5, %eax
> > +2:	movl	%edx, %eax
> >   3:	jmp	4f
> >   	nop
> >   4:	sub	$1, %eax
> >   
> 
>

Re: [PATCH v6 2/9] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 1 month, 3 weeks ago

On Wed, Dec 10, 2025 at 01:35:42PM +0000, David Laight wrote:
> On Wed, 10 Dec 2025 14:31:31 +0200
> Nikolay Borisov <nik.borisov@suse.com> wrote:
> 
> > On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:
> > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > > sequence is not sufficient because it doesn't clear enough entries. This
> > > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > > that mitigates BHI in kernel.
> > > 
> > > BHI variant of VMSCAPE requires isolating branch history between guests and
> > > userspace. Note that there is no equivalent hardware control for userspace.
> > > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > > should execute sufficient number of branches to clear a larger BHB.
> > > 
> > > Dynamically set the loop count of clear_bhb_loop() such that it is
> > > effective on newer CPUs too. Use the hardware control enumeration
> > > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > > 
> > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>  
> > 
> > nit: My RB tag is incorrect, while I did agree with Dave's suggestion to 
> > have global variables for the loop counts I haven't' really seen the 
> > code so I couldn't have given my RB on something which I haven't seen 
> > but did agree with in principle.
> 
> I thought the plan was to use global variables rather than ALTERNATIVE.
> The performance of this code is dominated by the loop.

Using globals was much more involved, requiring changes in atleast 3 files.
The current ALTERNATIVE approach is much simpler and avoids additional
handling to make sure that globals are set correctly for all mitigation
modes of BHI and VMSCAPE.

[ BTW, I am travelling on a vacation and will be intermittently checking my
  emails. ]

> I also found this code in arch/x86/net/bpf_jit_comp.c:
> 	if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
> 		/* The clearing sequence clobbers eax and ecx. */
> 		EMIT1(0x50); /* push rax */
> 		EMIT1(0x51); /* push rcx */
> 		ip += 2;
> 
> 		func = (u8 *)clear_bhb_loop;
> 		ip += x86_call_depth_emit_accounting(&prog, func, ip);
> 
> 		if (emit_call(&prog, func, ip))
> 			return -EINVAL;
> 		EMIT1(0x59); /* pop rcx */
> 		EMIT1(0x58); /* pop rax */
> 	}
> which appears to assume that only rax and rcx are changed.
> Since all the counts are small, there is nothing stopping the code
> using the 8-bit registers %al, %ah, %cl and %ch.

Thanks for catching this.

> There are probably some schemes that only need one register.
> eg two separate ALTERNATIVE blocks.

Also, I think it is better to use a callee-saved register like rbx to avoid
callers having to save/restore registers. Something like below:

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9f6f4a7c5baf..ca4a34ce314a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1535,11 +1535,12 @@ SYM_CODE_END(rewind_stack_and_make_dead)
 SYM_FUNC_START(clear_bhb_loop)
 	ANNOTATE_NOENDBR
 	push	%rbp
+	push	%rbx
 	mov	%rsp, %rbp
 
 	/* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
-	ALTERNATIVE "movl $5,  %ecx; movl $5, %edx",	\
-		    "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL
+	ALTERNATIVE "movb $5,  %bl",	\
+		    "movb $12, %bl", X86_FEATURE_BHI_CTRL
 
 	ANNOTATE_INTRA_FUNCTION_CALL
 	call	1f
@@ -1561,15 +1562,17 @@ SYM_FUNC_START(clear_bhb_loop)
 	 * but some Clang versions (e.g. 18) don't like this.
 	 */
 	.skip 32 - 18, 0xcc
-2:	movl	%edx, %eax
+2:	ALTERNATIVE "movb $5, %bh",	\
+		    "movb $7, %bh", X86_FEATURE_BHI_CTRL
 3:	jmp	4f
 	nop
-4:	sub	$1, %eax
+4:	sub	$1, %bh
 	jnz	3b
-	sub	$1, %ecx
+	sub	$1, %bl
 	jnz	1b
 .Lret2:	RET
 5:
+	pop	%rbx
 	pop	%rbp
 	RET
 SYM_FUNC_END(clear_bhb_loop)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index c1ec14c55911..823b3f613774 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1593,11 +1593,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
 	u8 *func;
 
 	if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
-		/* The clearing sequence clobbers eax and ecx. */
-		EMIT1(0x50); /* push rax */
-		EMIT1(0x51); /* push rcx */
-		ip += 2;
-
 		func = (u8 *)clear_bhb_loop;
 		ip += x86_call_depth_emit_accounting(&prog, func, ip);
 
@@ -1605,8 +1600,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
 			return -EINVAL;
 		/* Don't speculate past this until BHB is cleared */
 		EMIT_LFENCE();
-		EMIT1(0x59); /* pop rcx */
-		EMIT1(0x58); /* pop rax */
 	}
 	/* Insert IBHF instruction */
 	if ((cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP) &&

Re: [PATCH v6 2/9] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by David Laight 1 month, 3 weeks ago

On Sun, 14 Dec 2025 10:38:27 -0800
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Wed, Dec 10, 2025 at 01:35:42PM +0000, David Laight wrote:
> > On Wed, 10 Dec 2025 14:31:31 +0200
> > Nikolay Borisov <nik.borisov@suse.com> wrote:
> >   
> > > On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:  
> > > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > > > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > > > sequence is not sufficient because it doesn't clear enough entries. This
> > > > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > > > that mitigates BHI in kernel.
> > > > 
> > > > BHI variant of VMSCAPE requires isolating branch history between guests and
> > > > userspace. Note that there is no equivalent hardware control for userspace.
> > > > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > > > should execute sufficient number of branches to clear a larger BHB.
> > > > 
> > > > Dynamically set the loop count of clear_bhb_loop() such that it is
> > > > effective on newer CPUs too. Use the hardware control enumeration
> > > > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > > > 
> > > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>    
> > > 
> > > nit: My RB tag is incorrect, while I did agree with Dave's suggestion to 
> > > have global variables for the loop counts I haven't' really seen the 
> > > code so I couldn't have given my RB on something which I haven't seen 
> > > but did agree with in principle.  
> > 
> > I thought the plan was to use global variables rather than ALTERNATIVE.
> > The performance of this code is dominated by the loop.  
> 
> Using globals was much more involved, requiring changes in atleast 3 files.
> The current ALTERNATIVE approach is much simpler and avoids additional
> handling to make sure that globals are set correctly for all mitigation
> modes of BHI and VMSCAPE.
> 
> [ BTW, I am travelling on a vacation and will be intermittently checking my
>   emails. ]
> 
> > I also found this code in arch/x86/net/bpf_jit_comp.c:
> > 	if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
> > 		/* The clearing sequence clobbers eax and ecx. */
> > 		EMIT1(0x50); /* push rax */
> > 		EMIT1(0x51); /* push rcx */
> > 		ip += 2;
> > 
> > 		func = (u8 *)clear_bhb_loop;
> > 		ip += x86_call_depth_emit_accounting(&prog, func, ip);
> > 
> > 		if (emit_call(&prog, func, ip))
> > 			return -EINVAL;
> > 		EMIT1(0x59); /* pop rcx */
> > 		EMIT1(0x58); /* pop rax */
> > 	}
> > which appears to assume that only rax and rcx are changed.
> > Since all the counts are small, there is nothing stopping the code
> > using the 8-bit registers %al, %ah, %cl and %ch.  
> 
> Thanks for catching this.

I was trying to find where it was called from.
Failed to find the one on system call entry...

> > There are probably some schemes that only need one register.
> > eg two separate ALTERNATIVE blocks.  
> 
> Also, I think it is better to use a callee-saved register like rbx to avoid
> callers having to save/restore registers. Something like below:

I'm not sure.
%ax is the return value so can be 'trashed' by a normal function call.
But if the bpf code is saving %ax then it isn't expecting a normal call.
OTOH if you are going to save the register in clear_bhb_loop you might
as well use %ax to get the slightly shorter instructions for %al.
(I think 'movb' comes out shorter - as if it really matters.)

Definitely worth a comment that it must save all resisters.

I also wonder if it needs to setup a stack frame?
Again, the code is so slow it won't matter.

	David


> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 9f6f4a7c5baf..ca4a34ce314a 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1535,11 +1535,12 @@ SYM_CODE_END(rewind_stack_and_make_dead)
>  SYM_FUNC_START(clear_bhb_loop)
>  	ANNOTATE_NOENDBR
>  	push	%rbp
> +	push	%rbx
>  	mov	%rsp, %rbp
>  
>  	/* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
> -	ALTERNATIVE "movl $5,  %ecx; movl $5, %edx",	\
> -		    "movl $12, %ecx; movl $7, %edx", X86_FEATURE_BHI_CTRL
> +	ALTERNATIVE "movb $5,  %bl",	\
> +		    "movb $12, %bl", X86_FEATURE_BHI_CTRL
>  
>  	ANNOTATE_INTRA_FUNCTION_CALL
>  	call	1f
> @@ -1561,15 +1562,17 @@ SYM_FUNC_START(clear_bhb_loop)
>  	 * but some Clang versions (e.g. 18) don't like this.
>  	 */
>  	.skip 32 - 18, 0xcc
> -2:	movl	%edx, %eax
> +2:	ALTERNATIVE "movb $5, %bh",	\
> +		    "movb $7, %bh", X86_FEATURE_BHI_CTRL
>  3:	jmp	4f
>  	nop
> -4:	sub	$1, %eax
> +4:	sub	$1, %bh
>  	jnz	3b
> -	sub	$1, %ecx
> +	sub	$1, %bl
>  	jnz	1b
>  .Lret2:	RET
>  5:
> +	pop	%rbx
>  	pop	%rbp
>  	RET
>  SYM_FUNC_END(clear_bhb_loop)
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index c1ec14c55911..823b3f613774 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -1593,11 +1593,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
>  	u8 *func;
>  
>  	if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
> -		/* The clearing sequence clobbers eax and ecx. */
> -		EMIT1(0x50); /* push rax */
> -		EMIT1(0x51); /* push rcx */
> -		ip += 2;
> -
>  		func = (u8 *)clear_bhb_loop;
>  		ip += x86_call_depth_emit_accounting(&prog, func, ip);
>  
> @@ -1605,8 +1600,6 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
>  			return -EINVAL;
>  		/* Don't speculate past this until BHB is cleared */
>  		EMIT_LFENCE();
> -		EMIT1(0x59); /* pop rcx */
> -		EMIT1(0x58); /* pop rax */
>  	}
>  	/* Insert IBHF instruction */
>  	if ((cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP) &&

Re: [PATCH v6 2/9] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 1 month, 3 weeks ago

On Sun, Dec 14, 2025 at 07:02:33PM +0000, David Laight wrote:
> On Sun, 14 Dec 2025 10:38:27 -0800
> Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> 
> > On Wed, Dec 10, 2025 at 01:35:42PM +0000, David Laight wrote:
> > > On Wed, 10 Dec 2025 14:31:31 +0200
> > > Nikolay Borisov <nik.borisov@suse.com> wrote:
> > >   
> > > > On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:  
> > > > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > > > > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > > > > sequence is not sufficient because it doesn't clear enough entries. This
> > > > > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > > > > that mitigates BHI in kernel.
> > > > > 
> > > > > BHI variant of VMSCAPE requires isolating branch history between guests and
> > > > > userspace. Note that there is no equivalent hardware control for userspace.
> > > > > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > > > > should execute sufficient number of branches to clear a larger BHB.
> > > > > 
> > > > > Dynamically set the loop count of clear_bhb_loop() such that it is
> > > > > effective on newer CPUs too. Use the hardware control enumeration
> > > > > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > > > > 
> > > > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > > > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>    
> > > > 
> > > > nit: My RB tag is incorrect, while I did agree with Dave's suggestion to 
> > > > have global variables for the loop counts I haven't' really seen the 
> > > > code so I couldn't have given my RB on something which I haven't seen 
> > > > but did agree with in principle.  
> > > 
> > > I thought the plan was to use global variables rather than ALTERNATIVE.
> > > The performance of this code is dominated by the loop.  
> > 
> > Using globals was much more involved, requiring changes in atleast 3 files.
> > The current ALTERNATIVE approach is much simpler and avoids additional
> > handling to make sure that globals are set correctly for all mitigation
> > modes of BHI and VMSCAPE.
> > 
> > [ BTW, I am travelling on a vacation and will be intermittently checking my
> >   emails. ]
> > 
> > > I also found this code in arch/x86/net/bpf_jit_comp.c:
> > > 	if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
> > > 		/* The clearing sequence clobbers eax and ecx. */
> > > 		EMIT1(0x50); /* push rax */
> > > 		EMIT1(0x51); /* push rcx */
> > > 		ip += 2;
> > > 
> > > 		func = (u8 *)clear_bhb_loop;
> > > 		ip += x86_call_depth_emit_accounting(&prog, func, ip);
> > > 
> > > 		if (emit_call(&prog, func, ip))
> > > 			return -EINVAL;
> > > 		EMIT1(0x59); /* pop rcx */
> > > 		EMIT1(0x58); /* pop rax */
> > > 	}
> > > which appears to assume that only rax and rcx are changed.
> > > Since all the counts are small, there is nothing stopping the code
> > > using the 8-bit registers %al, %ah, %cl and %ch.  
> > 
> > Thanks for catching this.
> 
> I was trying to find where it was called from.
> Failed to find the one on system call entry...

The macro CLEAR_BRANCH_HISTORY calls clear_bhb_loop() at system call entry.

> > > There are probably some schemes that only need one register.
> > > eg two separate ALTERNATIVE blocks.  
> > 
> > Also, I think it is better to use a callee-saved register like rbx to avoid
> > callers having to save/restore registers. Something like below:
> 
> I'm not sure.
> %ax is the return value so can be 'trashed' by a normal function call.
> But if the bpf code is saving %ax then it isn't expecting a normal call.

BHB clear sequence is executed at the end of the BPF JITted code, and %rax
is likely the return value of the BPF program. So, saving/restoring %rax
around the sequence makes sense to me.

> OTOH if you are going to save the register in clear_bhb_loop you might
> as well use %ax to get the slightly shorter instructions for %al.
> (I think 'movb' comes out shorter - as if it really matters.)

%rbx is a callee-saved register so it felt more intuitive to save/restore
it in clear_bhb_loop(). But, I can use %ax if you feel strongly.

> Definitely worth a comment that it must save all resisters.

Yes, will add a comment.

> I also wonder if it needs to setup a stack frame?

I don't know if thats necessary, objtool doesn't complain because
clear_bhb_loop() is marked STACK_FRAME_NON_STANDARD.

> Again, the code is so slow it won't matter.
> 
> 	David

Re: [PATCH v6 2/9] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by David Laight 1 month, 3 weeks ago

On Mon, 15 Dec 2025 10:01:36 -0800
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Sun, Dec 14, 2025 at 07:02:33PM +0000, David Laight wrote:
> > On Sun, 14 Dec 2025 10:38:27 -0800
> > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> >   
> > > On Wed, Dec 10, 2025 at 01:35:42PM +0000, David Laight wrote:  
> > > > On Wed, 10 Dec 2025 14:31:31 +0200
> > > > Nikolay Borisov <nik.borisov@suse.com> wrote:
> > > >     
> > > > > On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:    
> > > > > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > > > > > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > > > > > sequence is not sufficient because it doesn't clear enough entries. This
> > > > > > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > > > > > that mitigates BHI in kernel.
> > > > > > 
> > > > > > BHI variant of VMSCAPE requires isolating branch history between guests and
> > > > > > userspace. Note that there is no equivalent hardware control for userspace.
> > > > > > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > > > > > should execute sufficient number of branches to clear a larger BHB.
> > > > > > 
> > > > > > Dynamically set the loop count of clear_bhb_loop() such that it is
> > > > > > effective on newer CPUs too. Use the hardware control enumeration
> > > > > > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > > > > > 
> > > > > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > > > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> > > > > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>      
> > > > > 
> > > > > nit: My RB tag is incorrect, while I did agree with Dave's suggestion to 
> > > > > have global variables for the loop counts I haven't' really seen the 
> > > > > code so I couldn't have given my RB on something which I haven't seen 
> > > > > but did agree with in principle.    
> > > > 
> > > > I thought the plan was to use global variables rather than ALTERNATIVE.
> > > > The performance of this code is dominated by the loop.    
> > > 
> > > Using globals was much more involved, requiring changes in atleast 3 files.
> > > The current ALTERNATIVE approach is much simpler and avoids additional
> > > handling to make sure that globals are set correctly for all mitigation
> > > modes of BHI and VMSCAPE.
> > > 
> > > [ BTW, I am travelling on a vacation and will be intermittently checking my
> > >   emails. ]
> > >   
> > > > I also found this code in arch/x86/net/bpf_jit_comp.c:
> > > > 	if (cpu_feature_enabled(X86_FEATURE_CLEAR_BHB_LOOP)) {
> > > > 		/* The clearing sequence clobbers eax and ecx. */
> > > > 		EMIT1(0x50); /* push rax */
> > > > 		EMIT1(0x51); /* push rcx */
> > > > 		ip += 2;
> > > > 
> > > > 		func = (u8 *)clear_bhb_loop;
> > > > 		ip += x86_call_depth_emit_accounting(&prog, func, ip);
> > > > 
> > > > 		if (emit_call(&prog, func, ip))
> > > > 			return -EINVAL;
> > > > 		EMIT1(0x59); /* pop rcx */
> > > > 		EMIT1(0x58); /* pop rax */
> > > > 	}
> > > > which appears to assume that only rax and rcx are changed.
> > > > Since all the counts are small, there is nothing stopping the code
> > > > using the 8-bit registers %al, %ah, %cl and %ch.    
> > > 
> > > Thanks for catching this.  
> > 
> > I was trying to find where it was called from.
> > Failed to find the one on system call entry...  
> 
> The macro CLEAR_BRANCH_HISTORY calls clear_bhb_loop() at system call entry.

I didn't look very hard :-)

> 
> > > > There are probably some schemes that only need one register.
> > > > eg two separate ALTERNATIVE blocks.    
> > > 
> > > Also, I think it is better to use a callee-saved register like rbx to avoid
> > > callers having to save/restore registers. Something like below:  
> > 
> > I'm not sure.
> > %ax is the return value so can be 'trashed' by a normal function call.
> > But if the bpf code is saving %ax then it isn't expecting a normal call.  
> 
> BHB clear sequence is executed at the end of the BPF JITted code, and %rax
> is likely the return value of the BPF program. So, saving/restoring %rax
> around the sequence makes sense to me.
> 
> > OTOH if you are going to save the register in clear_bhb_loop you might
> > as well use %ax to get the slightly shorter instructions for %al.
> > (I think 'movb' comes out shorter - as if it really matters.)  
> 
> %rbx is a callee-saved register so it felt more intuitive to save/restore
> it in clear_bhb_loop(). But, I can use %ax if you feel strongly.

If you are going to save a register it might as well be %ax.
Otherwise someone will wonder why you picked a different one.

> 
> > Definitely worth a comment that it must save all resisters.  
> 
> Yes, will add a comment.
> 
> > I also wonder if it needs to setup a stack frame?  
> 
> I don't know if thats necessary, objtool doesn't complain because
> clear_bhb_loop() is marked STACK_FRAME_NON_STANDARD.

In some senses it is a leaf functions - and the compiler doesn't create
stack frames for those (by default).

Provided objtool isn't confused by all the call instructions it probably
doesn't matter.

	David

> 
> > Again, the code is so slow it won't matter.
> > 
> > 	David

Re: [PATCH v6 2/9] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Nikolay Borisov 2 months ago


On 10.12.25 г. 15:35 ч., David Laight wrote:
> On Wed, 10 Dec 2025 14:31:31 +0200
> Nikolay Borisov <nik.borisov@suse.com> wrote:
> 
>> On 2.12.25 г. 8:19 ч., Pawan Gupta wrote:
>>> As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
>>> the Branch History Buffer (BHB). On Alder Lake and newer parts this
>>> sequence is not sufficient because it doesn't clear enough entries. This
>>> was not an issue because these CPUs have a hardware control (BHI_DIS_S)
>>> that mitigates BHI in kernel.
>>>
>>> BHI variant of VMSCAPE requires isolating branch history between guests and
>>> userspace. Note that there is no equivalent hardware control for userspace.
>>> To effectively isolate branch history on newer CPUs, clear_bhb_loop()
>>> should execute sufficient number of branches to clear a larger BHB.
>>>
>>> Dynamically set the loop count of clear_bhb_loop() such that it is
>>> effective on newer CPUs too. Use the hardware control enumeration
>>> X86_FEATURE_BHI_CTRL to select the appropriate loop count.
>>>
>>> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
>>> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
>>> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
>>
>> nit: My RB tag is incorrect, while I did agree with Dave's suggestion to
>> have global variables for the loop counts I haven't' really seen the
>> code so I couldn't have given my RB on something which I haven't seen
>> but did agree with in principle.
> 
> I thought the plan was to use global variables rather than ALTERNATIVE.
> The performance of this code is dominated by the loop.

Generally yes and I was on the verge of calling this out, however what 
stopped me is the fact that the global variables are going to be set 
"somewhere else" whilst with the current approach everything is 
contained within the clear_bhb_loop function. Both ways have their merit 
but I don't want to endlessly bikeshed.

<snip>