[PATCH 11/14] llvm: kCFI pointer stuff

Peter Zijlstra posted 14 patches 2 months ago
[PATCH 11/14] llvm: kCFI pointer stuff
Posted by Peter Zijlstra 2 months ago
Quick hack to extend the Clang-kCFI function meta-data (u32 hash) with
a u8 bitmask of pointer arguments. This should really be under a new
compiler flag, dependent on both x86_64 and kCFI.

Per the comment, the bitmask represents the register based arguments
as the first 6 bits and bit 6 is used to cover all stack based
arguments. The high bit is used for invalid values.

The purpose is to put a store dependency on the set registers, thereby
blocking speculation paths that would otherwise expoit their value.


Note1:

This implementation simply sets the bit for any pointer type. A better
implementation would only set the bit for any argument that is
dereferenced in the function body.

This better implementation would also capture things like:

  void foo(unsigned long addr, void *args)
  {
    u32 t = *(u32 *)addr;
    bar(t, args);
  }

Which, in contrast to the implementation below, would set bit0 while
leaving bit1 unset -- the exact opposite of this implementation.

Notably, addr *is* dereferenced, even though it is not a pointer on
entry, while args is a pointer, but is not derefereced but passed on
to bar -- if bar uses it, it gets to deal with it.

Note2:

Do we want to make this a u32 to keep room for all registers? AFAICT
the current use is only concerned with the argument registers and
those are limited to 6 for the C ABI, but custom (assembly) functions
could use things outside of that.

---
diff --git a/llvm/lib/Target/X86/X86AsmPrinter.cpp b/llvm/lib/Target/X86/X86AsmPrinter.cpp
index 73c745062096..42dcbc40ab4b 100644
--- a/llvm/lib/Target/X86/X86AsmPrinter.cpp
+++ b/llvm/lib/Target/X86/X86AsmPrinter.cpp
@@ -143,11 +143,28 @@ void X86AsmPrinter::EmitKCFITypePadding(const MachineFunction &MF,
   // one. Otherwise, just pad with nops. The X86::MOV32ri instruction emitted
   // in X86AsmPrinter::emitKCFITypeId is 5 bytes long.
   if (HasType)
-    PrefixBytes += 5;
+    PrefixBytes += 7;
 
   emitNops(offsetToAlignment(PrefixBytes, MF.getAlignment()));
 }
 
+static uint8_t getKCFIPointerArgs(const Function &F)
+{
+  uint8_t val = 0;
+
+  if (F.isVarArg())
+    return 0x7f;
+
+  for (int i = 0; i < F.arg_size() ; i++) {
+    Argument *A = F.getArg(i);
+    Type *T = A->getType();
+    if (T->getTypeID() == Type::PointerTyID)
+      val |= 1 << std::min(i, 6);
+  }
+
+  return val;
+}
+
 /// emitKCFITypeId - Emit the KCFI type information in architecture specific
 /// format.
 void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF) {
@@ -183,6 +200,26 @@ void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF) {
                               .addReg(X86::EAX)
                               .addImm(MaskKCFIType(Type->getZExtValue())));
 
+  // Extend the kCFI meta-data with a byte that has a bit set for each argument
+  // register that contains a pointer. Specifically for x86_64, which has 6
+  // argument registers:
+  //
+  //   bit0 - rdi
+  //   bit1 - rsi
+  //   bit2 - rdx
+  //   bit3 - rcx
+  //   bit4 - r8
+  //   bit5 - r9
+  //
+  // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will
+  // indicate vararg or any other 'unknown' configuration. Leaving bit7 for
+  // error states.
+  //
+  // XXX: should be conditional on some new x86_64 specific 'bhi' argument.
+  EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri)
+		  .addReg(X86::AL)
+		  .addImm(getKCFIPointerArgs(F)));
+
   if (MAI->hasDotTypeDotSizeDirective()) {
     MCSymbol *EndSym = OutContext.createTempSymbol("cfi_func_end");
     OutStreamer->emitLabel(EndSym);
diff --git a/llvm/lib/Target/X86/X86MCInstLower.cpp b/llvm/lib/Target/X86/X86MCInstLower.cpp
index cbb012161524..c0776ef78153 100644
--- a/llvm/lib/Target/X86/X86MCInstLower.cpp
+++ b/llvm/lib/Target/X86/X86MCInstLower.cpp
@@ -897,7 +897,7 @@ void X86AsmPrinter::LowerKCFI_CHECK(const MachineInstr &MI) {
                               .addReg(AddrReg)
                               .addImm(1)
                               .addReg(X86::NoRegister)
-                              .addImm(-(PrefixNops + 4))
+                              .addImm(-(PrefixNops + 6))
                               .addReg(X86::NoRegister));
 
   MCSymbol *Pass = OutContext.createTempSymbol();
RE: [PATCH 11/14] llvm: kCFI pointer stuff
Posted by Constable, Scott D 4 weeks, 1 day ago
> Quick hack to extend the Clang-kCFI function meta-data (u32 hash) with a u8 bitmask of pointer arguments. This should really be under a new compiler flag, dependent on both x86_64 and kCFI.
> 
> Per the comment, the bitmask represents the register based arguments as the first 6 bits and bit 6 is used to cover all stack based arguments. The high bit is used for invalid values.
> 
> The purpose is to put a store dependency on the set registers, thereby blocking speculation paths that would otherwise expoit their value.

Given the ongoing discussion on [PATCH 13/14] where there is a growing consensus that all arguments (not just pointers) should be poisoned after a misprediction, a different encoding scheme would be needed. I believe there are 8 possibilities, which correspond to the function's arity:

0: Function takes 0 args
1: Function takes 1 arg
2: Function takes 2 args
3: Function takes 3 args
4: Function takes 4 args
5: Function takes 5 args
6: Function takes 6 args
7: Function takes >6 args

These possibilities can be encoded with 3 bits. I suspect that it might actually be beneficial to steal 3 bits from the u32 kCFI hash (either by using a smaller 29-bit hash or by truncating the 32-bit hash down to 29 bits). This scheme would arguably strengthen both kCFI and FineIBT by partitioning CFI edges such that a j-arity function cannot call a k-arity function unless j=k (or unless j>6 and k>6); the current 32-bit kCFI hash does not prevent, for example, a 2-arity fptr from calling a 3-arity target if the kCFI hashes collide. The disadvantage of the 29-bit hash is that it would increase the probability of collisions within each arity, but on the other hand the total number of functions of each given arity is much smaller than the total number of functions of all arities.

Regards,

Scott Constable
RE: [PATCH 11/14] llvm: kCFI pointer stuff
Posted by Constable, Scott D 4 weeks ago
> > Quick hack to extend the Clang-kCFI function meta-data (u32 hash) with a u8 bitmask of pointer arguments. This should really be under a new compiler flag, dependent on both x86_64 and kCFI.
> > 
> > Per the comment, the bitmask represents the register based arguments as the first 6 bits and bit 6 is used to cover all stack based arguments. The high bit is used for invalid values.
> > 
> > The purpose is to put a store dependency on the set registers, thereby blocking speculation paths that would otherwise expoit their value.
> 
> Given the ongoing discussion on [PATCH 13/14] where there is a growing consensus that all arguments (not just pointers) should be poisoned after a misprediction, a different encoding scheme would be needed. I believe there are 8 possibilities, which correspond to the function's arity:
> 
> 0: Function takes 0 args
> 1: Function takes 1 arg
> 2: Function takes 2 args
> 3: Function takes 3 args
> 4: Function takes 4 args
> 5: Function takes 5 args
> 6: Function takes 6 args
> 7: Function takes >6 args
>
> These possibilities can be encoded with 3 bits. I suspect that it might actually be beneficial to steal 3 bits from the u32 kCFI hash (either by using a smaller 29-bit hash or by truncating the 32-bit hash down to 29 bits). This scheme would arguably strengthen both kCFI and FineIBT by partitioning CFI edges such that a j-arity function cannot call a k-arity function unless j=k (or unless j>6 and k>6); the current 32-bit kCFI hash does not prevent, for example, a 2-arity fptr from calling a 3-arity target if the kCFI hashes collide. The disadvantage of the 29-bit hash is that it would increase the probability of collisions within each arity, but on the other hand the total number of functions of each given arity is much smaller than the total number of functions of all arities.

I have done some additional analysis on my Noble kernel, which suggests that the proposed 29-bit hash with 3-bit arity will only be more secure than the existing 32-bit hash. Consider: My kernel has 141,617 total indirect call targets, with 10,903 unique function types. With a 32-bit kCFI hash, the expected number of collisions is 2^-32 * (10903 C 2) = 0.01383765. Then I scanned the kernel to identify the number of unique function types for each arity, and computed the corresponding expected number of collisions within each arity, and assuming a 29-bit hash:

# Args	total targets	unique types	Expected collisions
0	12682	32	0.00000092
1	42981	2492	0.00578125
2	37657	3775	0.01326841
3	29436	2547	0.00603931
4	12343	1169	0.00127162
5	4137	519	0.00025038
6	1700	221	0.00004528
more	681	148	0.00002026

(Sorry if the formatting became weird after copying from Excel)

Hence, even the arity (2) with the largest number of unique function types (3775) has a lower expected value for 29-bit collisions (0.01326841) than the expected value for 32-bit collisions (0.01383765).

Regards,

Scott Constable
Re: [PATCH 11/14] llvm: kCFI pointer stuff
Posted by Alexei Starovoitov 2 months ago
On Fri, Sep 27, 2024 at 12:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Quick hack to extend the Clang-kCFI function meta-data (u32 hash) with
> a u8 bitmask of pointer arguments. This should really be under a new
> compiler flag, dependent on both x86_64 and kCFI.
>
> Per the comment, the bitmask represents the register based arguments
> as the first 6 bits and bit 6 is used to cover all stack based
> arguments. The high bit is used for invalid values.
>
> The purpose is to put a store dependency on the set registers, thereby
> blocking speculation paths that would otherwise expoit their value.
>
>
> Note1:
>
> This implementation simply sets the bit for any pointer type. A better
> implementation would only set the bit for any argument that is
> dereferenced in the function body.
>
> This better implementation would also capture things like:
>
>   void foo(unsigned long addr, void *args)
>   {
>     u32 t = *(u32 *)addr;
>     bar(t, args);
>   }
>
> Which, in contrast to the implementation below, would set bit0 while
> leaving bit1 unset -- the exact opposite of this implementation.
>
> Notably, addr *is* dereferenced, even though it is not a pointer on
> entry, while args is a pointer, but is not derefereced but passed on
> to bar -- if bar uses it, it gets to deal with it.
>
> Note2:
>
> Do we want to make this a u32 to keep room for all registers? AFAICT
> the current use is only concerned with the argument registers and
> those are limited to 6 for the C ABI, but custom (assembly) functions
> could use things outside of that.
>
> ---
> diff --git a/llvm/lib/Target/X86/X86AsmPrinter.cpp b/llvm/lib/Target/X86/X86AsmPrinter.cpp
> index 73c745062096..42dcbc40ab4b 100644
> --- a/llvm/lib/Target/X86/X86AsmPrinter.cpp
> +++ b/llvm/lib/Target/X86/X86AsmPrinter.cpp
> @@ -143,11 +143,28 @@ void X86AsmPrinter::EmitKCFITypePadding(const MachineFunction &MF,
>    // one. Otherwise, just pad with nops. The X86::MOV32ri instruction emitted
>    // in X86AsmPrinter::emitKCFITypeId is 5 bytes long.
>    if (HasType)
> -    PrefixBytes += 5;
> +    PrefixBytes += 7;
>
>    emitNops(offsetToAlignment(PrefixBytes, MF.getAlignment()));
>  }
>
> +static uint8_t getKCFIPointerArgs(const Function &F)
> +{
> +  uint8_t val = 0;
> +
> +  if (F.isVarArg())
> +    return 0x7f;
> +
> +  for (int i = 0; i < F.arg_size() ; i++) {
> +    Argument *A = F.getArg(i);
> +    Type *T = A->getType();
> +    if (T->getTypeID() == Type::PointerTyID)
> +      val |= 1 << std::min(i, 6);
> +  }
> +
> +  return val;
> +}
> +
>  /// emitKCFITypeId - Emit the KCFI type information in architecture specific
>  /// format.
>  void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF) {
> @@ -183,6 +200,26 @@ void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF) {
>                                .addReg(X86::EAX)
>                                .addImm(MaskKCFIType(Type->getZExtValue())));
>
> +  // Extend the kCFI meta-data with a byte that has a bit set for each argument
> +  // register that contains a pointer. Specifically for x86_64, which has 6
> +  // argument registers:
> +  //
> +  //   bit0 - rdi
> +  //   bit1 - rsi
> +  //   bit2 - rdx
> +  //   bit3 - rcx
> +  //   bit4 - r8
> +  //   bit5 - r9
> +  //
> +  // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will
> +  // indicate vararg or any other 'unknown' configuration. Leaving bit7 for
> +  // error states.
> +  //
> +  // XXX: should be conditional on some new x86_64 specific 'bhi' argument.
> +  EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri)
> +                 .addReg(X86::AL)
> +                 .addImm(getKCFIPointerArgs(F)));

If I'm reading this correctly it will be an 8-bit move which
doesn't clear upper bits.
If consumer is in assembly it's ok-ish,
but it's an argument to __bhi_args_foo functions,
so should be properly zero extended per call convention.
Re: [PATCH 11/14] llvm: kCFI pointer stuff
Posted by Peter Zijlstra 1 month, 4 weeks ago
On Sun, Sep 29, 2024 at 10:53:05AM -0700, Alexei Starovoitov wrote:

> > +  // Extend the kCFI meta-data with a byte that has a bit set for each argument
> > +  // register that contains a pointer. Specifically for x86_64, which has 6
> > +  // argument registers:
> > +  //
> > +  //   bit0 - rdi
> > +  //   bit1 - rsi
> > +  //   bit2 - rdx
> > +  //   bit3 - rcx
> > +  //   bit4 - r8
> > +  //   bit5 - r9
> > +  //
> > +  // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will
> > +  // indicate vararg or any other 'unknown' configuration. Leaving bit7 for
> > +  // error states.
> > +  //
> > +  // XXX: should be conditional on some new x86_64 specific 'bhi' argument.
> > +  EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri)
> > +                 .addReg(X86::AL)
> > +                 .addImm(getKCFIPointerArgs(F)));
> 
> If I'm reading this correctly it will be an 8-bit move which
> doesn't clear upper bits.
> If consumer is in assembly it's ok-ish,
> but it's an argument to __bhi_args_foo functions,
> so should be properly zero extended per call convention.

These kCFI 'instructions' are never executed. Their sole purpose is to
encode the immediates. They are instructions because they live in .text
and having them this way makes disassemly work nicely. As such, we've
taken to using the 1 byte move instruction to carry them with the least
amounts of bytes.

The consumer is the kernel instruction decoder, we take the immediate
and use that.
Re: [PATCH 11/14] llvm: kCFI pointer stuff
Posted by Alexei Starovoitov 1 month, 4 weeks ago
On Mon, Sep 30, 2024 at 1:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Sep 29, 2024 at 10:53:05AM -0700, Alexei Starovoitov wrote:
>
> > > +  // Extend the kCFI meta-data with a byte that has a bit set for each argument
> > > +  // register that contains a pointer. Specifically for x86_64, which has 6
> > > +  // argument registers:
> > > +  //
> > > +  //   bit0 - rdi
> > > +  //   bit1 - rsi
> > > +  //   bit2 - rdx
> > > +  //   bit3 - rcx
> > > +  //   bit4 - r8
> > > +  //   bit5 - r9
> > > +  //
> > > +  // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will
> > > +  // indicate vararg or any other 'unknown' configuration. Leaving bit7 for
> > > +  // error states.
> > > +  //
> > > +  // XXX: should be conditional on some new x86_64 specific 'bhi' argument.
> > > +  EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri)
> > > +                 .addReg(X86::AL)
> > > +                 .addImm(getKCFIPointerArgs(F)));
> >
> > If I'm reading this correctly it will be an 8-bit move which
> > doesn't clear upper bits.
> > If consumer is in assembly it's ok-ish,
> > but it's an argument to __bhi_args_foo functions,
> > so should be properly zero extended per call convention.
>
> These kCFI 'instructions' are never executed. Their sole purpose is to
> encode the immediates. They are instructions because they live in .text
> and having them this way makes disassemly work nicely. As such, we've
> taken to using the 1 byte move instruction to carry them with the least
> amounts of bytes.
>
> The consumer is the kernel instruction decoder, we take the immediate
> and use that.

I see... and after decoding imm bits in mov %al insn the kernel will
insert a call to corresponding __bhi_args_* stub that will use
cmovne on corresponding register(s) to sanitize the value?
That was difficult to grasp.
A design doc would have helped.

I wonder whether this whole complexity is worth it vs
always calling __bhi_args_all()
Re: [PATCH 11/14] llvm: kCFI pointer stuff
Posted by Peter Zijlstra 1 month, 4 weeks ago
On Mon, Sep 30, 2024 at 09:59:11AM -0700, Alexei Starovoitov wrote:
> On Mon, Sep 30, 2024 at 1:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Sun, Sep 29, 2024 at 10:53:05AM -0700, Alexei Starovoitov wrote:
> >
> > > > +  // Extend the kCFI meta-data with a byte that has a bit set for each argument
> > > > +  // register that contains a pointer. Specifically for x86_64, which has 6
> > > > +  // argument registers:
> > > > +  //
> > > > +  //   bit0 - rdi
> > > > +  //   bit1 - rsi
> > > > +  //   bit2 - rdx
> > > > +  //   bit3 - rcx
> > > > +  //   bit4 - r8
> > > > +  //   bit5 - r9
> > > > +  //
> > > > +  // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will
> > > > +  // indicate vararg or any other 'unknown' configuration. Leaving bit7 for
> > > > +  // error states.
> > > > +  //
> > > > +  // XXX: should be conditional on some new x86_64 specific 'bhi' argument.
> > > > +  EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri)
> > > > +                 .addReg(X86::AL)
> > > > +                 .addImm(getKCFIPointerArgs(F)));
> > >
> > > If I'm reading this correctly it will be an 8-bit move which
> > > doesn't clear upper bits.
> > > If consumer is in assembly it's ok-ish,
> > > but it's an argument to __bhi_args_foo functions,
> > > so should be properly zero extended per call convention.
> >
> > These kCFI 'instructions' are never executed. Their sole purpose is to
> > encode the immediates. They are instructions because they live in .text
> > and having them this way makes disassemly work nicely. As such, we've
> > taken to using the 1 byte move instruction to carry them with the least
> > amounts of bytes.
> >
> > The consumer is the kernel instruction decoder, we take the immediate
> > and use that.
> 
> I see... and after decoding imm bits in mov %al insn the kernel will
> insert a call to corresponding __bhi_args_* stub that will use
> cmovne on corresponding register(s) to sanitize the value?
> That was difficult to grasp.
> A design doc would have helped.

Does something like this help?

diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h
index 31d19c815f99..b6e7e79e79c6 100644
--- a/arch/x86/include/asm/cfi.h
+++ b/arch/x86/include/asm/cfi.h
@@ -44,11 +44,28 @@
  *   call *%r11
  *
  *
+ * IBT+:
+ *
+ * foo:
+ *   endbr64 / ud1 0(%eax), %edx
+ *   ... code here ...
+ *   ret
+ *
+ * direct caller:
+ *   call foo+4
+ *
+ * indirect caller:
+ *   lea foo(%rip), %r11
+ *   ...
+ *   call *%r11
+ *
+ *
  * kCFI:
  *
  * __cfi_foo:
  *   movl $0x12345678, %eax	# kCFI signature hash
- *				# 11 nops when CONFIG_CALL_PADDING
+ *   movb $0x12, %al		# kCFI pointer argument mask
+ *				# 9 nops when CONFIG_CALL_PADDING
  * foo:
  *   endbr64			# when IBT
  *   ... code here ...
@@ -91,6 +108,57 @@
  *   nop4
  *   call *%r11
  *
+ *
+ * FineIBT+:
+ *
+ * __cfi_foo:
+ *   endbr64
+ *   subl 0x12345678, %r10d
+ *   jz   foo
+ *   ud2
+ *   nop
+ * foo:
+ *   ud1 0(%eax), %edx		# was endbr64
+ * foo_4:
+ *   ... code here ...
+ *   ret
+ *
+ * direct caller:
+ *   call foo+4
+ *
+ * indirect caller:
+ *   lea foo(%rip), %r11
+ *   ...
+ *   movl $0x12345678, %r10d
+ *   subl $16, %r11
+ *   nop4
+ *   call *%r11
+ *
+ *
+ * FineIBT-BHI:
+ *
+ * __cfi_foo:
+ *   endbr64
+ *   subl 0x12345678, %r10d
+ *   jz   foo-1
+ *   ud2
+ * foo-1:
+ *   call __bhi_args_XXX	# depends on kCFI pointer argument mask
+ * foo+4:
+ *   ... code here ...
+ *   ret
+ *
+ * direct caller:
+ *   call foo+4
+ *
+ * indirect caller:
+ *   lea foo(%rip), %r11
+ *   ...
+ *   movl $0x12345678, %r10d
+ *   subl $16, %r11
+ *   nop4
+ *   call *%r11
+ *
  */
 enum cfi_mode {
 	CFI_AUTO,	/* FineIBT if hardware has IBT, otherwise kCFI */




> I wonder whether this whole complexity is worth it vs
> always calling __bhi_args_all()

That's one for Scott to answer; I think always doing _all will hurt
especially bad because it includes rsp.


Re: [PATCH 11/14] llvm: kCFI pointer stuff
Posted by Alexei Starovoitov 1 month, 3 weeks ago
On Tue, Oct 1, 2024 at 3:21 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> Does something like this help?

Yep. Thanks.

> diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h
> index 31d19c815f99..b6e7e79e79c6 100644
> --- a/arch/x86/include/asm/cfi.h
> +++ b/arch/x86/include/asm/cfi.h
> @@ -44,11 +44,28 @@
>   *   call *%r11
>   *
>   *
> + * IBT+:
> + *
> + * foo:
> + *   endbr64 / ud1 0(%eax), %edx
> + *   ... code here ...
> + *   ret
> + *
> + * direct caller:
> + *   call foo+4
> + *
> + * indirect caller:
> + *   lea foo(%rip), %r11
> + *   ...
> + *   call *%r11
> + *
> + *
>   * kCFI:
>   *
>   * __cfi_foo:
>   *   movl $0x12345678, %eax    # kCFI signature hash
> - *                             # 11 nops when CONFIG_CALL_PADDING
> + *   movb $0x12, %al           # kCFI pointer argument mask
> + *                             # 9 nops when CONFIG_CALL_PADDING
>   * foo:
>   *   endbr64                   # when IBT
>   *   ... code here ...
> @@ -91,6 +108,57 @@
>   *   nop4
>   *   call *%r11
>   *
> + *
> + * FineIBT+:
> + *
> + * __cfi_foo:
> + *   endbr64
> + *   subl 0x12345678, %r10d
> + *   jz   foo

should it be 'jz foo_4' ?
Otherwise it will trap after endbr64 sealing.

> + *   ud2
> + *   nop
> + * foo:
> + *   ud1 0(%eax), %edx         # was endbr64
> + * foo_4:
> + *   ... code here ...
> + *   ret
> + *
> + * direct caller:
> + *   call foo+4
> + *
> + * indirect caller:
> + *   lea foo(%rip), %r11
> + *   ...
> + *   movl $0x12345678, %r10d
> + *   subl $16, %r11
> + *   nop4
> + *   call *%r11
> + *
> + *
> + * FineIBT-BHI:
> + *
> + * __cfi_foo:
> + *   endbr64
> + *   subl 0x12345678, %r10d
> + *   jz   foo-1
> + *   ud2
> + * foo-1:
> + *   call __bhi_args_XXX       # depends on kCFI pointer argument mask
> + * foo+4:
> + *   ... code here ...
> + *   ret
> + *
> + * direct caller:
> + *   call foo+4
> + *
> + * indirect caller:
> + *   lea foo(%rip), %r11
> + *   ...
> + *   movl $0x12345678, %r10d
> + *   subl $16, %r11
> + *   nop4
> + *   call *%r11
> + *
>   */
>  enum cfi_mode {
>         CFI_AUTO,       /* FineIBT if hardware has IBT, otherwise kCFI */
>
>
>
>
> > I wonder whether this whole complexity is worth it vs
> > always calling __bhi_args_all()
>
> That's one for Scott to answer; I think always doing _all will hurt
> especially bad because it includes rsp.

But why cmovne %rsp ?
Because some pointers are passed on the stack ?
but %rsp itself isn't involved in speculation.
load/store from stack can still speculate regardless of cmovne %rsp ?
Or is it acting like a barrier for all subsequent access from stack
including things like 'push rbp' in the function prologue?

Overall it feels like a lot of complexity for a 'security checkbox'.
If it hurts perf so much regardless the extra % here and there will
be ignored by security obsessed people and folks who care about
performance won't be enabling it anyway.

btw the alternative to hacking compilers is to get information
about pointers in function arguments from BTF.
It's all there. No need to encode it via movb $0x12, %al.

$ bpftool btf dump file vmlinux format c|grep pick_next_task
    struct task_struct * (*pick_next_task)(struct rq *, struct task_struct *);