Quick hack to extend the Clang-kCFI function meta-data (u32 hash) with
a u8 bitmask of pointer arguments. This should really be under a new
compiler flag, dependent on both x86_64 and kCFI.
Per the comment, the bitmask represents the register based arguments
as the first 6 bits and bit 6 is used to cover all stack based
arguments. The high bit is used for invalid values.
The purpose is to put a store dependency on the set registers, thereby
blocking speculation paths that would otherwise expoit their value.
Note1:
This implementation simply sets the bit for any pointer type. A better
implementation would only set the bit for any argument that is
dereferenced in the function body.
This better implementation would also capture things like:
void foo(unsigned long addr, void *args)
{
u32 t = *(u32 *)addr;
bar(t, args);
}
Which, in contrast to the implementation below, would set bit0 while
leaving bit1 unset -- the exact opposite of this implementation.
Notably, addr *is* dereferenced, even though it is not a pointer on
entry, while args is a pointer, but is not derefereced but passed on
to bar -- if bar uses it, it gets to deal with it.
Note2:
Do we want to make this a u32 to keep room for all registers? AFAICT
the current use is only concerned with the argument registers and
those are limited to 6 for the C ABI, but custom (assembly) functions
could use things outside of that.
---
diff --git a/llvm/lib/Target/X86/X86AsmPrinter.cpp b/llvm/lib/Target/X86/X86AsmPrinter.cpp
index 73c745062096..42dcbc40ab4b 100644
--- a/llvm/lib/Target/X86/X86AsmPrinter.cpp
+++ b/llvm/lib/Target/X86/X86AsmPrinter.cpp
@@ -143,11 +143,28 @@ void X86AsmPrinter::EmitKCFITypePadding(const MachineFunction &MF,
// one. Otherwise, just pad with nops. The X86::MOV32ri instruction emitted
// in X86AsmPrinter::emitKCFITypeId is 5 bytes long.
if (HasType)
- PrefixBytes += 5;
+ PrefixBytes += 7;
emitNops(offsetToAlignment(PrefixBytes, MF.getAlignment()));
}
+static uint8_t getKCFIPointerArgs(const Function &F)
+{
+ uint8_t val = 0;
+
+ if (F.isVarArg())
+ return 0x7f;
+
+ for (int i = 0; i < F.arg_size() ; i++) {
+ Argument *A = F.getArg(i);
+ Type *T = A->getType();
+ if (T->getTypeID() == Type::PointerTyID)
+ val |= 1 << std::min(i, 6);
+ }
+
+ return val;
+}
+
/// emitKCFITypeId - Emit the KCFI type information in architecture specific
/// format.
void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF) {
@@ -183,6 +200,26 @@ void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF) {
.addReg(X86::EAX)
.addImm(MaskKCFIType(Type->getZExtValue())));
+ // Extend the kCFI meta-data with a byte that has a bit set for each argument
+ // register that contains a pointer. Specifically for x86_64, which has 6
+ // argument registers:
+ //
+ // bit0 - rdi
+ // bit1 - rsi
+ // bit2 - rdx
+ // bit3 - rcx
+ // bit4 - r8
+ // bit5 - r9
+ //
+ // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will
+ // indicate vararg or any other 'unknown' configuration. Leaving bit7 for
+ // error states.
+ //
+ // XXX: should be conditional on some new x86_64 specific 'bhi' argument.
+ EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri)
+ .addReg(X86::AL)
+ .addImm(getKCFIPointerArgs(F)));
+
if (MAI->hasDotTypeDotSizeDirective()) {
MCSymbol *EndSym = OutContext.createTempSymbol("cfi_func_end");
OutStreamer->emitLabel(EndSym);
diff --git a/llvm/lib/Target/X86/X86MCInstLower.cpp b/llvm/lib/Target/X86/X86MCInstLower.cpp
index cbb012161524..c0776ef78153 100644
--- a/llvm/lib/Target/X86/X86MCInstLower.cpp
+++ b/llvm/lib/Target/X86/X86MCInstLower.cpp
@@ -897,7 +897,7 @@ void X86AsmPrinter::LowerKCFI_CHECK(const MachineInstr &MI) {
.addReg(AddrReg)
.addImm(1)
.addReg(X86::NoRegister)
- .addImm(-(PrefixNops + 4))
+ .addImm(-(PrefixNops + 6))
.addReg(X86::NoRegister));
MCSymbol *Pass = OutContext.createTempSymbol();
> Quick hack to extend the Clang-kCFI function meta-data (u32 hash) with a u8 bitmask of pointer arguments. This should really be under a new compiler flag, dependent on both x86_64 and kCFI. > > Per the comment, the bitmask represents the register based arguments as the first 6 bits and bit 6 is used to cover all stack based arguments. The high bit is used for invalid values. > > The purpose is to put a store dependency on the set registers, thereby blocking speculation paths that would otherwise expoit their value. Given the ongoing discussion on [PATCH 13/14] where there is a growing consensus that all arguments (not just pointers) should be poisoned after a misprediction, a different encoding scheme would be needed. I believe there are 8 possibilities, which correspond to the function's arity: 0: Function takes 0 args 1: Function takes 1 arg 2: Function takes 2 args 3: Function takes 3 args 4: Function takes 4 args 5: Function takes 5 args 6: Function takes 6 args 7: Function takes >6 args These possibilities can be encoded with 3 bits. I suspect that it might actually be beneficial to steal 3 bits from the u32 kCFI hash (either by using a smaller 29-bit hash or by truncating the 32-bit hash down to 29 bits). This scheme would arguably strengthen both kCFI and FineIBT by partitioning CFI edges such that a j-arity function cannot call a k-arity function unless j=k (or unless j>6 and k>6); the current 32-bit kCFI hash does not prevent, for example, a 2-arity fptr from calling a 3-arity target if the kCFI hashes collide. The disadvantage of the 29-bit hash is that it would increase the probability of collisions within each arity, but on the other hand the total number of functions of each given arity is much smaller than the total number of functions of all arities. Regards, Scott Constable
> > Quick hack to extend the Clang-kCFI function meta-data (u32 hash) with a u8 bitmask of pointer arguments. This should really be under a new compiler flag, dependent on both x86_64 and kCFI. > > > > Per the comment, the bitmask represents the register based arguments as the first 6 bits and bit 6 is used to cover all stack based arguments. The high bit is used for invalid values. > > > > The purpose is to put a store dependency on the set registers, thereby blocking speculation paths that would otherwise expoit their value. > > Given the ongoing discussion on [PATCH 13/14] where there is a growing consensus that all arguments (not just pointers) should be poisoned after a misprediction, a different encoding scheme would be needed. I believe there are 8 possibilities, which correspond to the function's arity: > > 0: Function takes 0 args > 1: Function takes 1 arg > 2: Function takes 2 args > 3: Function takes 3 args > 4: Function takes 4 args > 5: Function takes 5 args > 6: Function takes 6 args > 7: Function takes >6 args > > These possibilities can be encoded with 3 bits. I suspect that it might actually be beneficial to steal 3 bits from the u32 kCFI hash (either by using a smaller 29-bit hash or by truncating the 32-bit hash down to 29 bits). This scheme would arguably strengthen both kCFI and FineIBT by partitioning CFI edges such that a j-arity function cannot call a k-arity function unless j=k (or unless j>6 and k>6); the current 32-bit kCFI hash does not prevent, for example, a 2-arity fptr from calling a 3-arity target if the kCFI hashes collide. The disadvantage of the 29-bit hash is that it would increase the probability of collisions within each arity, but on the other hand the total number of functions of each given arity is much smaller than the total number of functions of all arities. I have done some additional analysis on my Noble kernel, which suggests that the proposed 29-bit hash with 3-bit arity will only be more secure than the existing 32-bit hash. Consider: My kernel has 141,617 total indirect call targets, with 10,903 unique function types. With a 32-bit kCFI hash, the expected number of collisions is 2^-32 * (10903 C 2) = 0.01383765. Then I scanned the kernel to identify the number of unique function types for each arity, and computed the corresponding expected number of collisions within each arity, and assuming a 29-bit hash: # Args total targets unique types Expected collisions 0 12682 32 0.00000092 1 42981 2492 0.00578125 2 37657 3775 0.01326841 3 29436 2547 0.00603931 4 12343 1169 0.00127162 5 4137 519 0.00025038 6 1700 221 0.00004528 more 681 148 0.00002026 (Sorry if the formatting became weird after copying from Excel) Hence, even the arity (2) with the largest number of unique function types (3775) has a lower expected value for 29-bit collisions (0.01326841) than the expected value for 32-bit collisions (0.01383765). Regards, Scott Constable
On Fri, Sep 27, 2024 at 12:50 PM Peter Zijlstra <peterz@infradead.org> wrote: > > Quick hack to extend the Clang-kCFI function meta-data (u32 hash) with > a u8 bitmask of pointer arguments. This should really be under a new > compiler flag, dependent on both x86_64 and kCFI. > > Per the comment, the bitmask represents the register based arguments > as the first 6 bits and bit 6 is used to cover all stack based > arguments. The high bit is used for invalid values. > > The purpose is to put a store dependency on the set registers, thereby > blocking speculation paths that would otherwise expoit their value. > > > Note1: > > This implementation simply sets the bit for any pointer type. A better > implementation would only set the bit for any argument that is > dereferenced in the function body. > > This better implementation would also capture things like: > > void foo(unsigned long addr, void *args) > { > u32 t = *(u32 *)addr; > bar(t, args); > } > > Which, in contrast to the implementation below, would set bit0 while > leaving bit1 unset -- the exact opposite of this implementation. > > Notably, addr *is* dereferenced, even though it is not a pointer on > entry, while args is a pointer, but is not derefereced but passed on > to bar -- if bar uses it, it gets to deal with it. > > Note2: > > Do we want to make this a u32 to keep room for all registers? AFAICT > the current use is only concerned with the argument registers and > those are limited to 6 for the C ABI, but custom (assembly) functions > could use things outside of that. > > --- > diff --git a/llvm/lib/Target/X86/X86AsmPrinter.cpp b/llvm/lib/Target/X86/X86AsmPrinter.cpp > index 73c745062096..42dcbc40ab4b 100644 > --- a/llvm/lib/Target/X86/X86AsmPrinter.cpp > +++ b/llvm/lib/Target/X86/X86AsmPrinter.cpp > @@ -143,11 +143,28 @@ void X86AsmPrinter::EmitKCFITypePadding(const MachineFunction &MF, > // one. Otherwise, just pad with nops. The X86::MOV32ri instruction emitted > // in X86AsmPrinter::emitKCFITypeId is 5 bytes long. > if (HasType) > - PrefixBytes += 5; > + PrefixBytes += 7; > > emitNops(offsetToAlignment(PrefixBytes, MF.getAlignment())); > } > > +static uint8_t getKCFIPointerArgs(const Function &F) > +{ > + uint8_t val = 0; > + > + if (F.isVarArg()) > + return 0x7f; > + > + for (int i = 0; i < F.arg_size() ; i++) { > + Argument *A = F.getArg(i); > + Type *T = A->getType(); > + if (T->getTypeID() == Type::PointerTyID) > + val |= 1 << std::min(i, 6); > + } > + > + return val; > +} > + > /// emitKCFITypeId - Emit the KCFI type information in architecture specific > /// format. > void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF) { > @@ -183,6 +200,26 @@ void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF) { > .addReg(X86::EAX) > .addImm(MaskKCFIType(Type->getZExtValue()))); > > + // Extend the kCFI meta-data with a byte that has a bit set for each argument > + // register that contains a pointer. Specifically for x86_64, which has 6 > + // argument registers: > + // > + // bit0 - rdi > + // bit1 - rsi > + // bit2 - rdx > + // bit3 - rcx > + // bit4 - r8 > + // bit5 - r9 > + // > + // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will > + // indicate vararg or any other 'unknown' configuration. Leaving bit7 for > + // error states. > + // > + // XXX: should be conditional on some new x86_64 specific 'bhi' argument. > + EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri) > + .addReg(X86::AL) > + .addImm(getKCFIPointerArgs(F))); If I'm reading this correctly it will be an 8-bit move which doesn't clear upper bits. If consumer is in assembly it's ok-ish, but it's an argument to __bhi_args_foo functions, so should be properly zero extended per call convention.
On Sun, Sep 29, 2024 at 10:53:05AM -0700, Alexei Starovoitov wrote: > > + // Extend the kCFI meta-data with a byte that has a bit set for each argument > > + // register that contains a pointer. Specifically for x86_64, which has 6 > > + // argument registers: > > + // > > + // bit0 - rdi > > + // bit1 - rsi > > + // bit2 - rdx > > + // bit3 - rcx > > + // bit4 - r8 > > + // bit5 - r9 > > + // > > + // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will > > + // indicate vararg or any other 'unknown' configuration. Leaving bit7 for > > + // error states. > > + // > > + // XXX: should be conditional on some new x86_64 specific 'bhi' argument. > > + EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri) > > + .addReg(X86::AL) > > + .addImm(getKCFIPointerArgs(F))); > > If I'm reading this correctly it will be an 8-bit move which > doesn't clear upper bits. > If consumer is in assembly it's ok-ish, > but it's an argument to __bhi_args_foo functions, > so should be properly zero extended per call convention. These kCFI 'instructions' are never executed. Their sole purpose is to encode the immediates. They are instructions because they live in .text and having them this way makes disassemly work nicely. As such, we've taken to using the 1 byte move instruction to carry them with the least amounts of bytes. The consumer is the kernel instruction decoder, we take the immediate and use that.
On Mon, Sep 30, 2024 at 1:27 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Sun, Sep 29, 2024 at 10:53:05AM -0700, Alexei Starovoitov wrote: > > > > + // Extend the kCFI meta-data with a byte that has a bit set for each argument > > > + // register that contains a pointer. Specifically for x86_64, which has 6 > > > + // argument registers: > > > + // > > > + // bit0 - rdi > > > + // bit1 - rsi > > > + // bit2 - rdx > > > + // bit3 - rcx > > > + // bit4 - r8 > > > + // bit5 - r9 > > > + // > > > + // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will > > > + // indicate vararg or any other 'unknown' configuration. Leaving bit7 for > > > + // error states. > > > + // > > > + // XXX: should be conditional on some new x86_64 specific 'bhi' argument. > > > + EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri) > > > + .addReg(X86::AL) > > > + .addImm(getKCFIPointerArgs(F))); > > > > If I'm reading this correctly it will be an 8-bit move which > > doesn't clear upper bits. > > If consumer is in assembly it's ok-ish, > > but it's an argument to __bhi_args_foo functions, > > so should be properly zero extended per call convention. > > These kCFI 'instructions' are never executed. Their sole purpose is to > encode the immediates. They are instructions because they live in .text > and having them this way makes disassemly work nicely. As such, we've > taken to using the 1 byte move instruction to carry them with the least > amounts of bytes. > > The consumer is the kernel instruction decoder, we take the immediate > and use that. I see... and after decoding imm bits in mov %al insn the kernel will insert a call to corresponding __bhi_args_* stub that will use cmovne on corresponding register(s) to sanitize the value? That was difficult to grasp. A design doc would have helped. I wonder whether this whole complexity is worth it vs always calling __bhi_args_all()
On Mon, Sep 30, 2024 at 09:59:11AM -0700, Alexei Starovoitov wrote: > On Mon, Sep 30, 2024 at 1:27 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Sun, Sep 29, 2024 at 10:53:05AM -0700, Alexei Starovoitov wrote: > > > > > > + // Extend the kCFI meta-data with a byte that has a bit set for each argument > > > > + // register that contains a pointer. Specifically for x86_64, which has 6 > > > > + // argument registers: > > > > + // > > > > + // bit0 - rdi > > > > + // bit1 - rsi > > > > + // bit2 - rdx > > > > + // bit3 - rcx > > > > + // bit4 - r8 > > > > + // bit5 - r9 > > > > + // > > > > + // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will > > > > + // indicate vararg or any other 'unknown' configuration. Leaving bit7 for > > > > + // error states. > > > > + // > > > > + // XXX: should be conditional on some new x86_64 specific 'bhi' argument. > > > > + EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri) > > > > + .addReg(X86::AL) > > > > + .addImm(getKCFIPointerArgs(F))); > > > > > > If I'm reading this correctly it will be an 8-bit move which > > > doesn't clear upper bits. > > > If consumer is in assembly it's ok-ish, > > > but it's an argument to __bhi_args_foo functions, > > > so should be properly zero extended per call convention. > > > > These kCFI 'instructions' are never executed. Their sole purpose is to > > encode the immediates. They are instructions because they live in .text > > and having them this way makes disassemly work nicely. As such, we've > > taken to using the 1 byte move instruction to carry them with the least > > amounts of bytes. > > > > The consumer is the kernel instruction decoder, we take the immediate > > and use that. > > I see... and after decoding imm bits in mov %al insn the kernel will > insert a call to corresponding __bhi_args_* stub that will use > cmovne on corresponding register(s) to sanitize the value? > That was difficult to grasp. > A design doc would have helped. Does something like this help? diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h index 31d19c815f99..b6e7e79e79c6 100644 --- a/arch/x86/include/asm/cfi.h +++ b/arch/x86/include/asm/cfi.h @@ -44,11 +44,28 @@ * call *%r11 * * + * IBT+: + * + * foo: + * endbr64 / ud1 0(%eax), %edx + * ... code here ... + * ret + * + * direct caller: + * call foo+4 + * + * indirect caller: + * lea foo(%rip), %r11 + * ... + * call *%r11 + * + * * kCFI: * * __cfi_foo: * movl $0x12345678, %eax # kCFI signature hash - * # 11 nops when CONFIG_CALL_PADDING + * movb $0x12, %al # kCFI pointer argument mask + * # 9 nops when CONFIG_CALL_PADDING * foo: * endbr64 # when IBT * ... code here ... @@ -91,6 +108,57 @@ * nop4 * call *%r11 * + * + * FineIBT+: + * + * __cfi_foo: + * endbr64 + * subl 0x12345678, %r10d + * jz foo + * ud2 + * nop + * foo: + * ud1 0(%eax), %edx # was endbr64 + * foo_4: + * ... code here ... + * ret + * + * direct caller: + * call foo+4 + * + * indirect caller: + * lea foo(%rip), %r11 + * ... + * movl $0x12345678, %r10d + * subl $16, %r11 + * nop4 + * call *%r11 + * + * + * FineIBT-BHI: + * + * __cfi_foo: + * endbr64 + * subl 0x12345678, %r10d + * jz foo-1 + * ud2 + * foo-1: + * call __bhi_args_XXX # depends on kCFI pointer argument mask + * foo+4: + * ... code here ... + * ret + * + * direct caller: + * call foo+4 + * + * indirect caller: + * lea foo(%rip), %r11 + * ... + * movl $0x12345678, %r10d + * subl $16, %r11 + * nop4 + * call *%r11 + * */ enum cfi_mode { CFI_AUTO, /* FineIBT if hardware has IBT, otherwise kCFI */ > I wonder whether this whole complexity is worth it vs > always calling __bhi_args_all() That's one for Scott to answer; I think always doing _all will hurt especially bad because it includes rsp.
On Tue, Oct 1, 2024 at 3:21 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > Does something like this help? Yep. Thanks. > diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h > index 31d19c815f99..b6e7e79e79c6 100644 > --- a/arch/x86/include/asm/cfi.h > +++ b/arch/x86/include/asm/cfi.h > @@ -44,11 +44,28 @@ > * call *%r11 > * > * > + * IBT+: > + * > + * foo: > + * endbr64 / ud1 0(%eax), %edx > + * ... code here ... > + * ret > + * > + * direct caller: > + * call foo+4 > + * > + * indirect caller: > + * lea foo(%rip), %r11 > + * ... > + * call *%r11 > + * > + * > * kCFI: > * > * __cfi_foo: > * movl $0x12345678, %eax # kCFI signature hash > - * # 11 nops when CONFIG_CALL_PADDING > + * movb $0x12, %al # kCFI pointer argument mask > + * # 9 nops when CONFIG_CALL_PADDING > * foo: > * endbr64 # when IBT > * ... code here ... > @@ -91,6 +108,57 @@ > * nop4 > * call *%r11 > * > + * > + * FineIBT+: > + * > + * __cfi_foo: > + * endbr64 > + * subl 0x12345678, %r10d > + * jz foo should it be 'jz foo_4' ? Otherwise it will trap after endbr64 sealing. > + * ud2 > + * nop > + * foo: > + * ud1 0(%eax), %edx # was endbr64 > + * foo_4: > + * ... code here ... > + * ret > + * > + * direct caller: > + * call foo+4 > + * > + * indirect caller: > + * lea foo(%rip), %r11 > + * ... > + * movl $0x12345678, %r10d > + * subl $16, %r11 > + * nop4 > + * call *%r11 > + * > + * > + * FineIBT-BHI: > + * > + * __cfi_foo: > + * endbr64 > + * subl 0x12345678, %r10d > + * jz foo-1 > + * ud2 > + * foo-1: > + * call __bhi_args_XXX # depends on kCFI pointer argument mask > + * foo+4: > + * ... code here ... > + * ret > + * > + * direct caller: > + * call foo+4 > + * > + * indirect caller: > + * lea foo(%rip), %r11 > + * ... > + * movl $0x12345678, %r10d > + * subl $16, %r11 > + * nop4 > + * call *%r11 > + * > */ > enum cfi_mode { > CFI_AUTO, /* FineIBT if hardware has IBT, otherwise kCFI */ > > > > > > I wonder whether this whole complexity is worth it vs > > always calling __bhi_args_all() > > That's one for Scott to answer; I think always doing _all will hurt > especially bad because it includes rsp. But why cmovne %rsp ? Because some pointers are passed on the stack ? but %rsp itself isn't involved in speculation. load/store from stack can still speculate regardless of cmovne %rsp ? Or is it acting like a barrier for all subsequent access from stack including things like 'push rbp' in the function prologue? Overall it feels like a lot of complexity for a 'security checkbox'. If it hurts perf so much regardless the extra % here and there will be ignored by security obsessed people and folks who care about performance won't be enabling it anyway. btw the alternative to hacking compilers is to get information about pointers in function arguments from BTF. It's all there. No need to encode it via movb $0x12, %al. $ bpftool btf dump file vmlinux format c|grep pick_next_task struct task_struct * (*pick_next_task)(struct rq *, struct task_struct *);
© 2016 - 2024 Red Hat, Inc.