[v4] VMSCAPE optimization for BHI variant

[PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 2 months, 3 weeks ago

As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
the Branch History Buffer (BHB). On Alder Lake and newer parts this
sequence is not sufficient because it doesn't clear enough entries. This
was not an issue because these CPUs have a hardware control (BHI_DIS_S)
that mitigates BHI in kernel.

BHI variant of VMSCAPE requires isolating branch history between guests and
userspace. Note that there is no equivalent hardware control for userspace.
To effectively isolate branch history on newer CPUs, clear_bhb_loop()
should execute sufficient number of branches to clear a larger BHB.

Dynamically set the loop count of clear_bhb_loop() such that it is
effective on newer CPUs too. Use the hardware control enumeration
X86_FEATURE_BHI_CTRL to select the appropriate loop count.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_64.S | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a2891af416c874349c065160708752c41bc6ba36..6cddd7d6dc927704357d97346fcbb34559608888 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1563,17 +1563,20 @@ SYM_CODE_END(rewind_stack_and_make_dead)
 .endm
 
 /*
- * This should be used on parts prior to Alder Lake. Newer parts should use the
- * BHI_DIS_S hardware control instead. If a pre-Alder Lake part is being
- * virtualized on newer hardware the VMM should protect against BHI attacks by
- * setting BHI_DIS_S for the guests.
+ * For protecting the kernel against BHI, this should be used on parts prior to
+ * Alder Lake. Although the sequence works on newer parts, BHI_DIS_S hardware
+ * control has lower overhead, and should be used instead. If a pre-Alder Lake
+ * part is being virtualized on newer hardware the VMM should protect against
+ * BHI attacks by setting BHI_DIS_S for the guests.
  */
 SYM_FUNC_START(clear_bhb_loop)
 	ANNOTATE_NOENDBR
 	push	%rbp
 	mov	%rsp, %rbp
 
-	CLEAR_BHB_LOOP_SEQ 5, 5
+	/* loop count differs based on CPU-gen, see Intel's BHI guidance */
+	ALTERNATIVE __stringify(CLEAR_BHB_LOOP_SEQ 5, 5),  \
+		    __stringify(CLEAR_BHB_LOOP_SEQ 12, 7), X86_FEATURE_BHI_CTRL
 
 	pop	%rbp
 	RET

-- 
2.34.1

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Dave Hansen 2 months, 2 weeks ago

On 11/19/25 22:18, Pawan Gupta wrote:
> -	CLEAR_BHB_LOOP_SEQ 5, 5
> +	/* loop count differs based on CPU-gen, see Intel's BHI guidance */
> +	ALTERNATIVE (CLEAR_BHB_LOOP_SEQ 5, 5),  \
> +		    __stringify(CLEAR_BHB_LOOP_SEQ 12, 7), X86_FEATURE_BHI_CTRL

There are a million ways to skin this cat. But I'm not sure I really
like the end result here. It seems a little overkill to use ALTERNATIVE
to rewrite a whole sequence just to patch two constants in there.

What if the CLEAR_BHB_LOOP_SEQ just took its inner and outer loop counts
as register arguments? Then this would look more like:

	ALTERNATIVE "mov  $5, %rdi; mov $5, %rsi",
		    "mov $12, %rdi; mov $7, %rsi",
	...

	CLEAR_BHB_LOOP_SEQ

Or, even global variables:

	mov outer_loop_count(%rip), %rdi
	mov inner_loop_count(%rip), %rsi

and then have some C code somewhere that does:

	if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
		outer_loop_count = 5;
		inner_loop_count = 5;
	} else {
		outer_loop_count = 12;
		inner_loop_count = 7;
	}

... and I'm sure I got something wrong in there like flipping the
inner/outer counts, and I'm not even thinking about the variable types.

But, basically, I think I want to avoid as much logic as possible in
assembly. I also think we should reserve ALTERNATIVE for things that
truly need it, like things that are truly performance sensitive or that
can't reach out and poke at variables.

Peter Z. usually has good instincts on these things, so I'm curious what
he thinks of all this.

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 2 months, 2 weeks ago

On Fri, Nov 21, 2025 at 08:40:44AM -0800, Dave Hansen wrote:
> On 11/19/25 22:18, Pawan Gupta wrote:
> > -	CLEAR_BHB_LOOP_SEQ 5, 5
> > +	/* loop count differs based on CPU-gen, see Intel's BHI guidance */
> > +	ALTERNATIVE (CLEAR_BHB_LOOP_SEQ 5, 5),  \
> > +		    __stringify(CLEAR_BHB_LOOP_SEQ 12, 7), X86_FEATURE_BHI_CTRL
> 
> There are a million ways to skin this cat. But I'm not sure I really
> like the end result here. It seems a little overkill to use ALTERNATIVE
> to rewrite a whole sequence just to patch two constants in there.
> 
> What if the CLEAR_BHB_LOOP_SEQ just took its inner and outer loop counts
> as register arguments? Then this would look more like:
> 
> 	ALTERNATIVE "mov  $5, %rdi; mov $5, %rsi",
> 		    "mov $12, %rdi; mov $7, %rsi",
> 	...
> 
> 	CLEAR_BHB_LOOP_SEQ

Following this idea, loop count can be set via ALTERNATIVE within
clear_bhb_loop() itself. The outer count %ecx is already set outside the
loops. The only change to the sequence would be to also store inner count
in a register, and reload %eax from it.

---
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 886f86790b44..e4863d6d3217 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1536,7 +1536,11 @@ SYM_FUNC_START(clear_bhb_loop)
 	ANNOTATE_NOENDBR
 	push	%rbp
 	mov	%rsp, %rbp
-	movl	$5, %ecx
+
+	/* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
+	ALTERNATIVE "movl $5,  %ecx; movl $5, %edx;",	\
+		    "movl $12, %ecx; movl $7, %edx;", X86_FEATURE_BHI_CTRL
+
 	ANNOTATE_INTRA_FUNCTION_CALL
 	call	1f
 	jmp	5f
@@ -1557,7 +1561,7 @@ SYM_FUNC_START(clear_bhb_loop)
 	 * but some Clang versions (e.g. 18) don't like this.
 	 */
 	.skip 32 - 18, 0xcc
-2:	movl	$5, %eax
+2:	movl	%edx, %eax
 3:	jmp	4f
 	nop
 4:	sub	$1, %eax

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Nikolay Borisov 2 months, 2 weeks ago


On 11/21/25 18:40, Dave Hansen wrote:
> On 11/19/25 22:18, Pawan Gupta wrote:
>> -	CLEAR_BHB_LOOP_SEQ 5, 5
>> +	/* loop count differs based on CPU-gen, see Intel's BHI guidance */
>> +	ALTERNATIVE (CLEAR_BHB_LOOP_SEQ 5, 5),  \
>> +		    __stringify(CLEAR_BHB_LOOP_SEQ 12, 7), X86_FEATURE_BHI_CTRL
> 
> There are a million ways to skin this cat. But I'm not sure I really
> like the end result here. It seems a little overkill to use ALTERNATIVE
> to rewrite a whole sequence just to patch two constants in there.
> 
> What if the CLEAR_BHB_LOOP_SEQ just took its inner and outer loop counts
> as register arguments? Then this would look more like:
> 
> 	ALTERNATIVE "mov  $5, %rdi; mov $5, %rsi",
> 		    "mov $12, %rdi; mov $7, %rsi",
> 	...
> 
> 	CLEAR_BHB_LOOP_SEQ
> 
> Or, even global variables:
> 
> 	mov outer_loop_count(%rip), %rdi
> 	mov inner_loop_count(%rip), %rsi

nit: FWIW I find this rather tacky, because the way the registers are 
being used (although they do follow the x86-64 calling convention) is 
obfuscated in the macro itself.

> 
> and then have some C code somewhere that does:
> 
> 	if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> 		outer_loop_count = 5;
> 		inner_loop_count = 5;
> 	} else {
> 		outer_loop_count = 12;
> 		inner_loop_count = 7;
> 	}

OTOH: the global variable approach seems saner as in the macro you'd 
have direct reference to them and so it will be more obvious how things 
are setup.

<snip>

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Dave Hansen 2 months, 2 weeks ago

On 11/21/25 08:45, Nikolay Borisov wrote:
> OTOH: the global variable approach seems saner as in the macro you'd
> have direct reference to them and so it will be more obvious how things
> are setup.

Oh, yeah, duh. You don't need to pass the variables in registers. They
could just be read directly.

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 2 months, 2 weeks ago

On Fri, Nov 21, 2025 at 08:50:17AM -0800, Dave Hansen wrote:
> On 11/21/25 08:45, Nikolay Borisov wrote:
> > OTOH: the global variable approach seems saner as in the macro you'd
> > have direct reference to them and so it will be more obvious how things
> > are setup.
> 
> Oh, yeah, duh. You don't need to pass the variables in registers. They
> could just be read directly.

IIUC, global variables would introduce extra memory loads that may slow
things down. I will try to measure their impact. I think those global
variables should be in the .entry.text section to play well with PTI.

Also I was preferring constants because load values from global variables
may also be subject to speculation. Although any speculation should be
corrected before an indirect branch is executed because of the LFENCE after
the sequence.

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Dave Hansen 2 months, 2 weeks ago

On 11/21/25 10:16, Pawan Gupta wrote:
> On Fri, Nov 21, 2025 at 08:50:17AM -0800, Dave Hansen wrote:
>> On 11/21/25 08:45, Nikolay Borisov wrote:
>>> OTOH: the global variable approach seems saner as in the macro you'd
>>> have direct reference to them and so it will be more obvious how things
>>> are setup.
>>
>> Oh, yeah, duh. You don't need to pass the variables in registers. They
>> could just be read directly.
> 
> IIUC, global variables would introduce extra memory loads that may slow
> things down. I will try to measure their impact. I think those global
> variables should be in the .entry.text section to play well with PTI.

Really? I didn't look exhaustively, but CLEAR_BRANCH_HISTORY seems to
get called pretty close to where the assembly jumps into C. Long after
we're running on the kernel CR3.

> Also I was preferring constants because load values from global variables
> may also be subject to speculation. Although any speculation should be
> corrected before an indirect branch is executed because of the LFENCE after
> the sequence.

I guess that's a theoretical problem, but it's not a practical one.

So I think we have 4-ish options at this point:

1. Generate the long and short sequences independently and in their
   entirety and ALTERNATIVE between them (the original patch)
2. Store the inner/outer loop counts in registers and:
  2a. Load those registers from variables
  2b. Load them from ALTERNATIVES
3. Store the inner/outer loop counts in variables in memory

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 2 months, 2 weeks ago

On Fri, Nov 21, 2025 at 10:42:24AM -0800, Dave Hansen wrote:
> On 11/21/25 10:16, Pawan Gupta wrote:
> > On Fri, Nov 21, 2025 at 08:50:17AM -0800, Dave Hansen wrote:
> >> On 11/21/25 08:45, Nikolay Borisov wrote:
> >>> OTOH: the global variable approach seems saner as in the macro you'd
> >>> have direct reference to them and so it will be more obvious how things
> >>> are setup.
> >>
> >> Oh, yeah, duh. You don't need to pass the variables in registers. They
> >> could just be read directly.
> > 
> > IIUC, global variables would introduce extra memory loads that may slow
> > things down. I will try to measure their impact. I think those global
> > variables should be in the .entry.text section to play well with PTI.
> 
> Really? I didn't look exhaustively, but CLEAR_BRANCH_HISTORY seems to
> get called pretty close to where the assembly jumps into C. Long after
> we're running on the kernel CR3.

You are right. PTI is not a concern here.

> > Also I was preferring constants because load values from global variables
> > may also be subject to speculation. Although any speculation should be
> > corrected before an indirect branch is executed because of the LFENCE after
> > the sequence.
> 
> I guess that's a theoretical problem, but it's not a practical one.

Probably yes. But, load from memory would certainly be slower compared to
immediates.

> So I think we have 4-ish options at this point:
> 
> 1. Generate the long and short sequences independently and in their
>    entirety and ALTERNATIVE between them (the original patch)
> 2. Store the inner/outer loop counts in registers and:
>   2a. Load those registers from variables
>   2b. Load them from ALTERNATIVES

Both of these look to be good options to me.

2b. would be my first preference, because it keeps the loop counts as
inline constants. The resulting sequence stays the same as it is today.

> 3. Store the inner/outer loop counts in variables in memory

I could be wrong, but this will likely have non-zero impact on performance.
I am afraid to cause any regressions in BHI mitigation. That is why I
preferred the least invasive approach in my previous attempts.

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by david laight 2 months, 2 weeks ago

On Fri, 21 Nov 2025 13:26:27 -0800
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Fri, Nov 21, 2025 at 10:42:24AM -0800, Dave Hansen wrote:
> > On 11/21/25 10:16, Pawan Gupta wrote:  
> > > On Fri, Nov 21, 2025 at 08:50:17AM -0800, Dave Hansen wrote:  
> > >> On 11/21/25 08:45, Nikolay Borisov wrote:  
> > >>> OTOH: the global variable approach seems saner as in the macro you'd
> > >>> have direct reference to them and so it will be more obvious how things
> > >>> are setup.  
> > >>
> > >> Oh, yeah, duh. You don't need to pass the variables in registers. They
> > >> could just be read directly.  
> > > 
> > > IIUC, global variables would introduce extra memory loads that may slow
> > > things down. I will try to measure their impact. I think those global
> > > variables should be in the .entry.text section to play well with PTI.  
> > 
> > Really? I didn't look exhaustively, but CLEAR_BRANCH_HISTORY seems to
> > get called pretty close to where the assembly jumps into C. Long after
> > we're running on the kernel CR3.  
> 
> You are right. PTI is not a concern here.
> 
> > > Also I was preferring constants because load values from global variables
> > > may also be subject to speculation. Although any speculation should be
> > > corrected before an indirect branch is executed because of the LFENCE after
> > > the sequence.  
> > 
> > I guess that's a theoretical problem, but it's not a practical one.  
> 
> Probably yes. But, load from memory would certainly be slower compared to
> immediates.
> 
> > So I think we have 4-ish options at this point:
> > 
> > 1. Generate the long and short sequences independently and in their
> >    entirety and ALTERNATIVE between them (the original patch)
> > 2. Store the inner/outer loop counts in registers and:
> >   2a. Load those registers from variables
> >   2b. Load them from ALTERNATIVES  
> 
> Both of these look to be good options to me.
> 
> 2b. would be my first preference, because it keeps the loop counts as
> inline constants. The resulting sequence stays the same as it is today.
> 
> > 3. Store the inner/outer loop counts in variables in memory  
> 
> I could be wrong, but this will likely have non-zero impact on performance.
> I am afraid to cause any regressions in BHI mitigation. That is why I
> preferred the least invasive approach in my previous attempts.

Surely it won't be significant compared to the cost of the loop itself.
That is the bit that really kills performance.

For subtle reasons one of the mitigations that slows kernel entry caused
a doubling of the execution time of a largely single-threaded task that
spends almost all its time in userspace!
(I thought I'd disabled it at compile time - but the config option
changed underneath me...)

	David

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 2 months, 2 weeks ago

On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:
> On Fri, 21 Nov 2025 13:26:27 -0800
> Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> 
> > On Fri, Nov 21, 2025 at 10:42:24AM -0800, Dave Hansen wrote:
> > > On 11/21/25 10:16, Pawan Gupta wrote:  
> > > > On Fri, Nov 21, 2025 at 08:50:17AM -0800, Dave Hansen wrote:  
> > > >> On 11/21/25 08:45, Nikolay Borisov wrote:  
> > > >>> OTOH: the global variable approach seems saner as in the macro you'd
> > > >>> have direct reference to them and so it will be more obvious how things
> > > >>> are setup.  
> > > >>
> > > >> Oh, yeah, duh. You don't need to pass the variables in registers. They
> > > >> could just be read directly.  
> > > > 
> > > > IIUC, global variables would introduce extra memory loads that may slow
> > > > things down. I will try to measure their impact. I think those global
> > > > variables should be in the .entry.text section to play well with PTI.  
> > > 
> > > Really? I didn't look exhaustively, but CLEAR_BRANCH_HISTORY seems to
> > > get called pretty close to where the assembly jumps into C. Long after
> > > we're running on the kernel CR3.  
> > 
> > You are right. PTI is not a concern here.
> > 
> > > > Also I was preferring constants because load values from global variables
> > > > may also be subject to speculation. Although any speculation should be
> > > > corrected before an indirect branch is executed because of the LFENCE after
> > > > the sequence.  
> > > 
> > > I guess that's a theoretical problem, but it's not a practical one.  
> > 
> > Probably yes. But, load from memory would certainly be slower compared to
> > immediates.
> > 
> > > So I think we have 4-ish options at this point:
> > > 
> > > 1. Generate the long and short sequences independently and in their
> > >    entirety and ALTERNATIVE between them (the original patch)
> > > 2. Store the inner/outer loop counts in registers and:
> > >   2a. Load those registers from variables
> > >   2b. Load them from ALTERNATIVES  
> > 
> > Both of these look to be good options to me.
> > 
> > 2b. would be my first preference, because it keeps the loop counts as
> > inline constants. The resulting sequence stays the same as it is today.
> > 
> > > 3. Store the inner/outer loop counts in variables in memory  
> > 
> > I could be wrong, but this will likely have non-zero impact on performance.
> > I am afraid to cause any regressions in BHI mitigation. That is why I
> > preferred the least invasive approach in my previous attempts.
> 
> Surely it won't be significant compared to the cost of the loop itself.
> That is the bit that really kills performance.

Correct, recent data suggests the same.

> For subtle reasons one of the mitigations that slows kernel entry caused
> a doubling of the execution time of a largely single-threaded task that
> spends almost all its time in userspace!
> (I thought I'd disabled it at compile time - but the config option
> changed underneath me...)

That is surprising. If its okay, could you please share more details about
this application? Or any other way I can reproduce this?

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by david laight 2 months, 2 weeks ago

On Mon, 24 Nov 2025 11:31:26 -0800
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:
...
> > For subtle reasons one of the mitigations that slows kernel entry caused
> > a doubling of the execution time of a largely single-threaded task that
> > spends almost all its time in userspace!
> > (I thought I'd disabled it at compile time - but the config option
> > changed underneath me...)  
> 
> That is surprising. If its okay, could you please share more details about
> this application? Or any other way I can reproduce this?

The 'trigger' program is a multi-threaded program that wakes up every 10ms
to process RTP and TDM audio data.
So we have a low RT priority process with one thread per cpu.
Since they are RT they usually get scheduled on the same cpu as last lime.
I think this simple program will have the desired effect:
A main process that does:
	syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
	start_time += 1sec;
	for (n = 1; n < num_cpu; n++)
		pthread_create(thread_code, start_time);
	thread_code(start_time);
with:
thread_code(ts)
{
	for (;;) {
		ts += 10ms;
		syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
		do_work();
	}

So all the threads wake up at exactly the same time every 10ms.
(You need to use syscall(), don't look at what glibc does.)

On my system the program wasn't doing anything, so do_work() was empty.
What matters is whether all the threads end up running at the same time.
I managed that using pthread_broadcast(), but the clock code above
ought to be worse (and I've since changed the daemon to work that way
to avoid all this issues with pthread_broadcast() being sequential
and threads not running because the target cpu is running an ISR or
just looping in kernel).

The process that gets 'hit' is anything cpu bound.
Even a shell loop (eg while :; do ;: done) but with a counter will do.

Without the 'trigger' program, it will (mostly) sit on one cpu and the
clock frequency of that cpu will increase to (say) 3GHz while the other
all run at 800Mhz.
But the 'trigger' program runs threads on all the cpu at the same time.
So the 'hit' program is pre-empted and is later rescheduled on a
different cpu - running at 800MHz.
The cpu speed increases, but 10ms later it gets bounced again.

The real issue is that the cpu speed is associated with the cpu, not
the process running on it.

	David

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 2 months, 1 week ago

On Tue, Nov 25, 2025 at 11:34:07AM +0000, david laight wrote:
> On Mon, 24 Nov 2025 11:31:26 -0800
> Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> 
> > On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:
> ...
> > > For subtle reasons one of the mitigations that slows kernel entry caused
> > > a doubling of the execution time of a largely single-threaded task that
> > > spends almost all its time in userspace!
> > > (I thought I'd disabled it at compile time - but the config option
> > > changed underneath me...)  
> > 
> > That is surprising. If its okay, could you please share more details about
> > this application? Or any other way I can reproduce this?
> 
> The 'trigger' program is a multi-threaded program that wakes up every 10ms
> to process RTP and TDM audio data.
> So we have a low RT priority process with one thread per cpu.
> Since they are RT they usually get scheduled on the same cpu as last lime.
> I think this simple program will have the desired effect:
> A main process that does:
> 	syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
> 	start_time += 1sec;
> 	for (n = 1; n < num_cpu; n++)
> 		pthread_create(thread_code, start_time);
> 	thread_code(start_time);
> with:
> thread_code(ts)
> {
> 	for (;;) {
> 		ts += 10ms;
> 		syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> 		do_work();
> 	}
> 
> So all the threads wake up at exactly the same time every 10ms.
> (You need to use syscall(), don't look at what glibc does.)
> 
> On my system the program wasn't doing anything, so do_work() was empty.
> What matters is whether all the threads end up running at the same time.
> I managed that using pthread_broadcast(), but the clock code above
> ought to be worse (and I've since changed the daemon to work that way
> to avoid all this issues with pthread_broadcast() being sequential
> and threads not running because the target cpu is running an ISR or
> just looping in kernel).
> 
> The process that gets 'hit' is anything cpu bound.
> Even a shell loop (eg while :; do ;: done) but with a counter will do.
> 
> Without the 'trigger' program, it will (mostly) sit on one cpu and the
> clock frequency of that cpu will increase to (say) 3GHz while the other
> all run at 800Mhz.
> But the 'trigger' program runs threads on all the cpu at the same time.
> So the 'hit' program is pre-empted and is later rescheduled on a
> different cpu - running at 800MHz.
> The cpu speed increases, but 10ms later it gets bounced again.

Sorry I haven't tried creating this test yet.

> The real issue is that the cpu speed is associated with the cpu, not
> the process running on it.

So if the 'hit' program gets scheduled to a CPU that is running at 3GHz
then we don't expect a dramatic performance drop? Setting scaling_governor
to "performance" would be an interesting test.

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by david laight 2 months, 1 week ago

On Wed, 3 Dec 2025 17:40:26 -0800
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Tue, Nov 25, 2025 at 11:34:07AM +0000, david laight wrote:
> > On Mon, 24 Nov 2025 11:31:26 -0800
> > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> >   
> > > On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:  
> > ...  
> > > > For subtle reasons one of the mitigations that slows kernel entry caused
> > > > a doubling of the execution time of a largely single-threaded task that
> > > > spends almost all its time in userspace!
> > > > (I thought I'd disabled it at compile time - but the config option
> > > > changed underneath me...)    
> > > 
> > > That is surprising. If its okay, could you please share more details about
> > > this application? Or any other way I can reproduce this?  
> > 
> > The 'trigger' program is a multi-threaded program that wakes up every 10ms
> > to process RTP and TDM audio data.
> > So we have a low RT priority process with one thread per cpu.
> > Since they are RT they usually get scheduled on the same cpu as last lime.
> > I think this simple program will have the desired effect:
> > A main process that does:
> > 	syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
> > 	start_time += 1sec;
> > 	for (n = 1; n < num_cpu; n++)
> > 		pthread_create(thread_code, start_time);
> > 	thread_code(start_time);
> > with:
> > thread_code(ts)
> > {
> > 	for (;;) {
> > 		ts += 10ms;
> > 		syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> > 		do_work();
> > 	}
> > 
> > So all the threads wake up at exactly the same time every 10ms.
> > (You need to use syscall(), don't look at what glibc does.)
> > 
> > On my system the program wasn't doing anything, so do_work() was empty.
> > What matters is whether all the threads end up running at the same time.
> > I managed that using pthread_broadcast(), but the clock code above
> > ought to be worse (and I've since changed the daemon to work that way
> > to avoid all this issues with pthread_broadcast() being sequential
> > and threads not running because the target cpu is running an ISR or
> > just looping in kernel).
> > 
> > The process that gets 'hit' is anything cpu bound.
> > Even a shell loop (eg while :; do ;: done) but with a counter will do.
> > 
> > Without the 'trigger' program, it will (mostly) sit on one cpu and the
> > clock frequency of that cpu will increase to (say) 3GHz while the other
> > all run at 800Mhz.
> > But the 'trigger' program runs threads on all the cpu at the same time.
> > So the 'hit' program is pre-empted and is later rescheduled on a
> > different cpu - running at 800MHz.
> > The cpu speed increases, but 10ms later it gets bounced again.  
> 
> Sorry I haven't tried creating this test yet.
> 
> > The real issue is that the cpu speed is associated with the cpu, not
> > the process running on it.  
> 
> So if the 'hit' program gets scheduled to a CPU that is running at 3GHz
> then we don't expect a dramatic performance drop? Setting scaling_governor
> to "performance" would be an interesting test.

I failed to find a way to lock the cpu frequency (for other testing) on
that system an i7-7xxx - and the system will start thermally throttling
if you aren't careful.
ISTR that the hardware does most of the work.
So I'm not sure what difference "performance" makes (and can't remember what
might be set for that system - could set set anyway.)
We did have to disable some of the low power states, waking the cpu from those
just takes far too long.

	David

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 2 months ago

On Thu, Dec 04, 2025 at 09:15:11AM +0000, david laight wrote:
> On Wed, 3 Dec 2025 17:40:26 -0800
> Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> 
> > On Tue, Nov 25, 2025 at 11:34:07AM +0000, david laight wrote:
> > > On Mon, 24 Nov 2025 11:31:26 -0800
> > > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> > >   
> > > > On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:  
> > > ...  
> > > > > For subtle reasons one of the mitigations that slows kernel entry caused
> > > > > a doubling of the execution time of a largely single-threaded task that
> > > > > spends almost all its time in userspace!
> > > > > (I thought I'd disabled it at compile time - but the config option
> > > > > changed underneath me...)    
> > > > 
> > > > That is surprising. If its okay, could you please share more details about
> > > > this application? Or any other way I can reproduce this?  
> > > 
> > > The 'trigger' program is a multi-threaded program that wakes up every 10ms
> > > to process RTP and TDM audio data.
> > > So we have a low RT priority process with one thread per cpu.
> > > Since they are RT they usually get scheduled on the same cpu as last lime.
> > > I think this simple program will have the desired effect:
> > > A main process that does:
> > > 	syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
> > > 	start_time += 1sec;
> > > 	for (n = 1; n < num_cpu; n++)
> > > 		pthread_create(thread_code, start_time);
> > > 	thread_code(start_time);
> > > with:
> > > thread_code(ts)
> > > {
> > > 	for (;;) {
> > > 		ts += 10ms;
> > > 		syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> > > 		do_work();
> > > 	}
> > > 
> > > So all the threads wake up at exactly the same time every 10ms.
> > > (You need to use syscall(), don't look at what glibc does.)
> > > 
> > > On my system the program wasn't doing anything, so do_work() was empty.
> > > What matters is whether all the threads end up running at the same time.
> > > I managed that using pthread_broadcast(), but the clock code above
> > > ought to be worse (and I've since changed the daemon to work that way
> > > to avoid all this issues with pthread_broadcast() being sequential
> > > and threads not running because the target cpu is running an ISR or
> > > just looping in kernel).
> > > 
> > > The process that gets 'hit' is anything cpu bound.
> > > Even a shell loop (eg while :; do ;: done) but with a counter will do.
> > > 
> > > Without the 'trigger' program, it will (mostly) sit on one cpu and the
> > > clock frequency of that cpu will increase to (say) 3GHz while the other
> > > all run at 800Mhz.
> > > But the 'trigger' program runs threads on all the cpu at the same time.
> > > So the 'hit' program is pre-empted and is later rescheduled on a
> > > different cpu - running at 800MHz.
> > > The cpu speed increases, but 10ms later it gets bounced again.  
> > 
> > Sorry I haven't tried creating this test yet.
> > 
> > > The real issue is that the cpu speed is associated with the cpu, not
> > > the process running on it.  
> > 
> > So if the 'hit' program gets scheduled to a CPU that is running at 3GHz
> > then we don't expect a dramatic performance drop? Setting scaling_governor
> > to "performance" would be an interesting test.
> 
> I failed to find a way to lock the cpu frequency (for other testing) on
> that system an i7-7xxx - and the system will start thermally throttling
> if you aren't careful.

i7-7xxx would be Kaby Lake gen, those shouldn't need to deploy BHB clear
mitigation. I am guessing it is the legacy-IBRS mitigation in your case.

What you described looks very similar to the issue fixed by commit:

  aa1567a7e644 ("intel_idle: Add ibrs_off module parameter to force-disable IBRS")

    Commit bf5835bcdb96 ("intel_idle: Disable IBRS during long idle")
    disables IBRS when the cstate is 6 or lower. However, there are
    some use cases where a customer may want to use max_cstate=1 to
    lower latency. Such use cases will suffer from the performance
    degradation caused by the enabling of IBRS in the sibling idle thread.
    Add a "ibrs_off" module parameter to force disable IBRS and the
    CPUIDLE_FLAG_IRQ_ENABLE flag if set.

    In the case of a Skylake server with max_cstate=1, this new ibrs_off
    option will likely increase the IRQ response latency as IRQ will now
    be disabled.

    When running SPECjbb2015 with cstates set to C1 on a Skylake system.

    First test when the kernel is booted with: "intel_idle.ibrs_off":

      max-jOPS = 117828, critical-jOPS = 66047

    Then retest when the kernel is booted without the "intel_idle.ibrs_off"
    added:

      max-jOPS = 116408, critical-jOPS = 58958

    That means booting with "intel_idle.ibrs_off" improves performance by:

      max-jOPS:      +1.2%, which could be considered noise range.
      critical-jOPS: +12%,  which is definitely a solid improvement.

> ISTR that the hardware does most of the work.
> So I'm not sure what difference "performance" makes (and can't remember what
> might be set for that system - could set set anyway.)

> We did have to disable some of the low power states, waking the cpu from those
> just takes far too long.

Seems like you have a workaround in place already.

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by david laight 2 months ago

On Thu, 4 Dec 2025 13:56:02 -0800
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Thu, Dec 04, 2025 at 09:15:11AM +0000, david laight wrote:
> > On Wed, 3 Dec 2025 17:40:26 -0800
> > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> >   
> > > On Tue, Nov 25, 2025 at 11:34:07AM +0000, david laight wrote:  
> > > > On Mon, 24 Nov 2025 11:31:26 -0800
> > > > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> > > >     
> > > > > On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:    
> > > > ...    
> > > > > > For subtle reasons one of the mitigations that slows kernel entry caused
> > > > > > a doubling of the execution time of a largely single-threaded task that
> > > > > > spends almost all its time in userspace!
> > > > > > (I thought I'd disabled it at compile time - but the config option
> > > > > > changed underneath me...)      
> > > > > 
> > > > > That is surprising. If its okay, could you please share more details about
> > > > > this application? Or any other way I can reproduce this?    
> > > > 
> > > > The 'trigger' program is a multi-threaded program that wakes up every 10ms
> > > > to process RTP and TDM audio data.
> > > > So we have a low RT priority process with one thread per cpu.
> > > > Since they are RT they usually get scheduled on the same cpu as last lime.
> > > > I think this simple program will have the desired effect:
> > > > A main process that does:
> > > > 	syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
> > > > 	start_time += 1sec;
> > > > 	for (n = 1; n < num_cpu; n++)
> > > > 		pthread_create(thread_code, start_time);
> > > > 	thread_code(start_time);
> > > > with:
> > > > thread_code(ts)
> > > > {
> > > > 	for (;;) {
> > > > 		ts += 10ms;
> > > > 		syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> > > > 		do_work();
> > > > 	}
> > > > 
> > > > So all the threads wake up at exactly the same time every 10ms.
> > > > (You need to use syscall(), don't look at what glibc does.)
> > > > 
> > > > On my system the program wasn't doing anything, so do_work() was empty.
> > > > What matters is whether all the threads end up running at the same time.
> > > > I managed that using pthread_broadcast(), but the clock code above
> > > > ought to be worse (and I've since changed the daemon to work that way
> > > > to avoid all this issues with pthread_broadcast() being sequential
> > > > and threads not running because the target cpu is running an ISR or
> > > > just looping in kernel).
> > > > 
> > > > The process that gets 'hit' is anything cpu bound.
> > > > Even a shell loop (eg while :; do ;: done) but with a counter will do.
> > > > 
> > > > Without the 'trigger' program, it will (mostly) sit on one cpu and the
> > > > clock frequency of that cpu will increase to (say) 3GHz while the other
> > > > all run at 800Mhz.
> > > > But the 'trigger' program runs threads on all the cpu at the same time.
> > > > So the 'hit' program is pre-empted and is later rescheduled on a
> > > > different cpu - running at 800MHz.
> > > > The cpu speed increases, but 10ms later it gets bounced again.    
> > > 
> > > Sorry I haven't tried creating this test yet.
> > >   
> > > > The real issue is that the cpu speed is associated with the cpu, not
> > > > the process running on it.    
> > > 
> > > So if the 'hit' program gets scheduled to a CPU that is running at 3GHz
> > > then we don't expect a dramatic performance drop? Setting scaling_governor
> > > to "performance" would be an interesting test.  
> > 
> > I failed to find a way to lock the cpu frequency (for other testing) on
> > that system an i7-7xxx - and the system will start thermally throttling
> > if you aren't careful.  
> 
> i7-7xxx would be Kaby Lake gen, those shouldn't need to deploy BHB clear
> mitigation. I am guessing it is the legacy-IBRS mitigation in your case.
> 
> What you described looks very similar to the issue fixed by commit:
> 
>   aa1567a7e644 ("intel_idle: Add ibrs_off module parameter to force-disable IBRS")
> 
>     Commit bf5835bcdb96 ("intel_idle: Disable IBRS during long idle")
>     disables IBRS when the cstate is 6 or lower. However, there are
>     some use cases where a customer may want to use max_cstate=1 to
>     lower latency. Such use cases will suffer from the performance
>     degradation caused by the enabling of IBRS in the sibling idle thread.
>     Add a "ibrs_off" module parameter to force disable IBRS and the
>     CPUIDLE_FLAG_IRQ_ENABLE flag if set.
> 
>     In the case of a Skylake server with max_cstate=1, this new ibrs_off
>     option will likely increase the IRQ response latency as IRQ will now
>     be disabled.
> 
>     When running SPECjbb2015 with cstates set to C1 on a Skylake system.
> 
>     First test when the kernel is booted with: "intel_idle.ibrs_off":
> 
>       max-jOPS = 117828, critical-jOPS = 66047
> 
>     Then retest when the kernel is booted without the "intel_idle.ibrs_off"
>     added:
> 
>       max-jOPS = 116408, critical-jOPS = 58958
> 
>     That means booting with "intel_idle.ibrs_off" improves performance by:
> 
>       max-jOPS:      +1.2%, which could be considered noise range.
>       critical-jOPS: +12%,  which is definitely a solid improvement.

No, it wasn't anything to do with sibling threads.
It was the simple issue of the single-threaded 'busy in userspace' program
getting migrated to an idle cpu running at a low priority.
The IBRS mitigation just affected the timings of the other processes in the
system enough to force the user thread be pre-empted and rescheduled.

So it was not directly related to this code - even though it caused it.
The real issues is the cpu speed being tied to the physical cpu, not the
thread running on it.

> 
> > ISTR that the hardware does most of the work.
> > So I'm not sure what difference "performance" makes (and can't remember what
> > might be set for that system - could set set anyway.)  
> 
> > We did have to disable some of the low power states, waking the cpu from those
> > just takes far too long.  
> 
> Seems like you have a workaround in place already.

I just needed to find out why my fpga compile had gone out from 12 minutes
to over 20 with a kernel update.
Fixing that was easy, but the 'busy thread being migrated to an idle cpu'
is a separate issue that could affect a lot of workloads.
(Whether or not these mitigations are in place.)
Diagnosing it required looking at the scheduler ftrace events and then
realising what effect they would have.
It wouldn't surprise me if people haven't 'fixed' the problem by pinning
a process to a specific cpu, I couldn't try that because the fpga compiler
has some multithreaded parts.

	David

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Dave Hansen 2 months, 2 weeks ago

On 11/21/25 13:26, Pawan Gupta wrote:
> On Fri, Nov 21, 2025 at 10:42:24AM -0800, Dave Hansen wrote:
>> On 11/21/25 10:16, Pawan Gupta wrote:
...>>> Also I was preferring constants because load values from global
variables
>>> may also be subject to speculation. Although any speculation should be
>>> corrected before an indirect branch is executed because of the LFENCE after
>>> the sequence.
>>
>> I guess that's a theoretical problem, but it's not a practical one.
> 
> Probably yes. But, load from memory would certainly be slower compared to
> immediates.

Yeah, but it's literally two bytes of data that can almost certainly be
shoved in a cacheline that's also being read on kernel entry. I suspect
it would be hard to show a delta between a memory load and an immediate.

I'd love to see some actual data.

>> So I think we have 4-ish options at this point:
>>
>> 1. Generate the long and short sequences independently and in their
>>    entirety and ALTERNATIVE between them (the original patch)
>> 2. Store the inner/outer loop counts in registers and:
>>   2a. Load those registers from variables
>>   2b. Load them from ALTERNATIVES
> 
> Both of these look to be good options to me.
> 
> 2b. would be my first preference, because it keeps the loop counts as
> inline constants. The resulting sequence stays the same as it is today.
> 
>> 3. Store the inner/outer loop counts in variables in memory
> 
> I could be wrong, but this will likely have non-zero impact on performance.
> I am afraid to cause any regressions in BHI mitigation. That is why I
> preferred the least invasive approach in my previous attempts.

Your magic 8-ball and my crystal ball seem to be disagreeing today.

Time for science!

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Pawan Gupta 2 months, 2 weeks ago

On Fri, Nov 21, 2025 at 01:36:37PM -0800, Dave Hansen wrote:
> On 11/21/25 13:26, Pawan Gupta wrote:
> > On Fri, Nov 21, 2025 at 10:42:24AM -0800, Dave Hansen wrote:
> >> On 11/21/25 10:16, Pawan Gupta wrote:
> ...>>> Also I was preferring constants because load values from global
> variables
> >>> may also be subject to speculation. Although any speculation should be
> >>> corrected before an indirect branch is executed because of the LFENCE after
> >>> the sequence.
> >>
> >> I guess that's a theoretical problem, but it's not a practical one.
> > 
> > Probably yes. But, load from memory would certainly be slower compared to
> > immediates.
> 
> Yeah, but it's literally two bytes of data that can almost certainly be
> shoved in a cacheline that's also being read on kernel entry. I suspect
> it would be hard to show a delta between a memory load and an immediate.
> 
> I'd love to see some actual data.

You were right, the perf-tool profiling and the Unixbench results show no
meaningful difference between the two approaches. I was irrationally biased
towards immediates. Making the loop count as global.

Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

Posted by Nikolay Borisov 2 months, 2 weeks ago


On 11/20/25 08:18, Pawan Gupta wrote:
> As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> the Branch History Buffer (BHB). On Alder Lake and newer parts this
> sequence is not sufficient because it doesn't clear enough entries. This
> was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> that mitigates BHI in kernel.
> 
> BHI variant of VMSCAPE requires isolating branch history between guests and
> userspace. Note that there is no equivalent hardware control for userspace.
> To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> should execute sufficient number of branches to clear a larger BHB.
> 
> Dynamically set the loop count of clear_bhb_loop() such that it is
> effective on newer CPUs too. Use the hardware control enumeration
> X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> 
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>