[v1] RE: [patch 00/38] x86/retbleed: Call depth tracking mitigation

RE: [patch 00/38] x86/retbleed: Call depth tracking mitigation

Posted by David Laight 3 years, 8 months ago

From: Linus Torvalds
> Sent: 21 July 2022 19:07
...
>  (b) since you have that r10 use anyway, why can't you just generate the simpler
> 
>         movl $-IMM,%r10d
>         addl -4(%calldest),%r10d
> 
>      instead? You only need ZF anyway.
> 
>      Maybe you need to add some "r10 is clobbered" thing, I don't know.
> 
> But again: I don't know llvm, so the above is basically me just doing
> the "pattern matching monkey" thing.
> 
>              Linus

Since: "If the callee is a variadic function, then the number of floating
point arguments passed to the function in vector registers must be provided
by the caller in the AL register."

And that that never happens in the kernel you can use %eax instead
of %r10d.

Even in userspace %al can be set non-zero after the signature check.

If you are willing to cut the signature down to 26 bits and
then ensure that one of the bytes of -IMM (or ~IMM if you
use xor) is 0xcc and jump back to that on error the check
becomes:
	movl	$-IMM,%eax
1:	addl	-4(%calldest),%eax
	jnz	1b-1	// or -2, -3, -4
	add	$num_fp_args,%eax	// If needed non-zero
	call	%calldest

I think that adds 10 bytes to the call site.
Although with retpoline thunks (and no fp varargs calls)
all but the initial movl can go into the thunk.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [patch 00/38] x86/retbleed: Call depth tracking mitigation

Posted by Peter Zijlstra 3 years, 8 months ago

On Thu, Jul 21, 2022 at 10:01:12PM +0000, David Laight wrote:

> Since: "If the callee is a variadic function, then the number of floating
> point arguments passed to the function in vector registers must be provided
> by the caller in the AL register."
> 
> And that that never happens in the kernel you can use %eax instead
> of %r10d.

Except there's the AMD BTC thing and we should (compiler patch seems
MIA) have an unconditional: 'xor %eax,%eax' in front of every function
call.

(The official mitigation strategy was CALL; LFENCE IIRC, but that's so
horrible nobody is actually considering that)

Yes, the suggested sequence ends with rax being zero, but since we start
the speculation before that result is computed that's not good enough I
suspect.

RE: [patch 00/38] x86/retbleed: Call depth tracking mitigation

Posted by David Laight 3 years, 8 months ago

From: Peter Zijlstra
> Sent: 22 July 2022 12:03
> 
> On Thu, Jul 21, 2022 at 10:01:12PM +0000, David Laight wrote:
> 
> > Since: "If the callee is a variadic function, then the number of floating
> > point arguments passed to the function in vector registers must be provided
> > by the caller in the AL register."
> >
> > And that that never happens in the kernel you can use %eax instead
> > of %r10d.
> 
> Except there's the AMD BTC thing and we should (compiler patch seems
> MIA) have an unconditional: 'xor %eax,%eax' in front of every function
> call.

I've just read https://www.amd.com/system/files/documents/technical-guidance-for-mitigating-branch-type-confusion_v7_20220712.pdf

It doesn't seem to suggest clearing registers except as a vague
'might help' before a function return (to limit what the speculated
code can do.

The only advantage I can think of for 'xor ax,ax' is that it is done as
a register rename - and isn't dependant on older instructions.
So it might reduce some pipeline stalls.

I'm guessing that someone might find a 'gadget' that depends on %eax
and it may be possible to find somewhere that leaves an arbitrary
value in it.
It is also about the only register that isn't live!

> (The official mitigation strategy was CALL; LFENCE IIRC, but that's so
> horrible nobody is actually considering that)
> 
> Yes, the suggested sequence ends with rax being zero, but since we start
> the speculation before that result is computed that's not good enough I
> suspect.

The speculated code can't use the 'wrong' %eax value.
The only problem is that reading from -4(%r11) is likely to be a
D$ miss giving plenty of time for the cpu to execute 'crap'.
But I'm not sure a later 'xor ax,ax' helps.
(OTOH this is all horrid and makes my brian hurt.)

AFAICT with BTC you 'just lose'.
I thought it was bad enough that some cpu used the BTB for predicted
conditional jumps - but using it to decide 'this must be a branch
instruction' seems especially broken.

Seems the best thing to do with those cpu is to run an embedded
system with a busybox+buildroot userspace where almost everything
runs as root :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)