Dynamic mitigations

[RFC PATCH 31/56] x86/alternative: Prepend nops with retpolines

Posted by David Kaplan 3 months, 4 weeks ago

When patching retpolines, nops may be required for padding such as when
turning a 5-byte direct call into a 2-byte indirect call.  Previously,
these were appended at the end so the code becomes "call *reg;nop;nop;nop"
for example.  This was fine because it's always going from a larger
instruction to a smaller one.

But this is a problem if the sequence is transformed from a 2-byte indirect
to the 5-byte direct call version at runtime because when the called
function returns, it will be in the middle of the 5-byte call instruction.

To fix this, prepend the nops instead of appending them.  Consequently, the
return site of the called function is always the same.

For indirect jmps this is potentially slightly less efficient compared to
appending nops, but indirect jmps are so rare this hardly seems worth
optimizing.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
---
 arch/x86/kernel/alternative.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8ee5ff547357..7a1f17078581 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -854,6 +854,21 @@ static bool cpu_wants_indirect_its_thunk_at(unsigned long addr, int reg)
 
 #endif /* CONFIG_MITIGATION_ITS */
 
+static void prepend_nops(u8 *bytes, int curlen, int neededlen)
+{
+	u8 newbytes[16];
+	int pad = neededlen - curlen;
+
+	/* Fill padding bytes with NOP. */
+	memset(newbytes, BYTES_NOP1, pad);
+
+	/* Copy the new instruction in. */
+	memcpy(newbytes + pad, bytes, curlen);
+
+	/* And write the final result back out to bytes. */
+	memcpy(bytes, newbytes, neededlen);
+}
+
 /*
  * Rewrite the compiler generated retpoline thunk calls.
  *
@@ -942,10 +957,16 @@ static int patch_retpoline(void *addr, struct insn *insn, u8 *bytes)
 		return ret;
 	i += ret;
 
-	for (; i < insn->length;)
-		bytes[i++] = BYTES_NOP1;
+	/*
+	 * Prepend the instruction with NOPs.  These are prepended, instead of
+	 * appended so the return site does not change.  This is necessary when
+	 * re-patching retpolines at runtime, such as via
+	 * CONFIG_DYNAMIC_MITIGATIONS, but do it always since the performance is
+	 * the same either way (other than for JMP, but those are very rare).
+	 */
+	prepend_nops(bytes, i, insn->length);
 
-	return i;
+	return insn->length;
 }
 
 /*
-- 
2.34.1

Re: [RFC PATCH 31/56] x86/alternative: Prepend nops with retpolines

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Mon, Oct 13, 2025 at 09:34:19AM -0500, David Kaplan wrote:
> When patching retpolines, nops may be required for padding such as when
> turning a 5-byte direct call into a 2-byte indirect call.  Previously,
> these were appended at the end so the code becomes "call *reg;nop;nop;nop"
> for example.  This was fine because it's always going from a larger
> instruction to a smaller one.
> 
> But this is a problem if the sequence is transformed from a 2-byte indirect
> to the 5-byte direct call version at runtime because when the called
> function returns, it will be in the middle of the 5-byte call instruction.
> 
> To fix this, prepend the nops instead of appending them.  Consequently, the
> return site of the called function is always the same.
> 

So this results in:

NOP3; call *%r11

And you're saying a task can be on the other side of that call and then
return lines up. But what if the task is preempted right after that
NOP3?

Same for all the alternative patching; what ensures no task is currently
having a register state that is in the middle of things?

Re: [RFC PATCH 31/56] x86/alternative: Prepend nops with retpolines

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Thu, Oct 16, 2025 at 01:07:17PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 13, 2025 at 09:34:19AM -0500, David Kaplan wrote:
> > When patching retpolines, nops may be required for padding such as when
> > turning a 5-byte direct call into a 2-byte indirect call.  Previously,
> > these were appended at the end so the code becomes "call *reg;nop;nop;nop"
> > for example.  This was fine because it's always going from a larger
> > instruction to a smaller one.
> > 
> > But this is a problem if the sequence is transformed from a 2-byte indirect
> > to the 5-byte direct call version at runtime because when the called
> > function returns, it will be in the middle of the 5-byte call instruction.
> > 
> > To fix this, prepend the nops instead of appending them.  Consequently, the
> > return site of the called function is always the same.
> > 
> 
> So this results in:
> 
> NOP3; call *%r11
> 
> And you're saying a task can be on the other side of that call and then
> return lines up. But what if the task is preempted right after that
> NOP3?
> 
> Same for all the alternative patching; what ensures no task is currently
> having a register state that is in the middle of things?

Ah, I found it, you freeze everything, which puts it at safe points.

Re: [RFC PATCH 31/56] x86/alternative: Prepend nops with retpolines

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Thu, Oct 16, 2025 at 01:07:17PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 13, 2025 at 09:34:19AM -0500, David Kaplan wrote:
> > When patching retpolines, nops may be required for padding such as when
> > turning a 5-byte direct call into a 2-byte indirect call.  Previously,
> > these were appended at the end so the code becomes "call *reg;nop;nop;nop"
> > for example.  This was fine because it's always going from a larger
> > instruction to a smaller one.
> > 
> > But this is a problem if the sequence is transformed from a 2-byte indirect
> > to the 5-byte direct call version at runtime because when the called
> > function returns, it will be in the middle of the 5-byte call instruction.
> > 
> > To fix this, prepend the nops instead of appending them.  Consequently, the
> > return site of the called function is always the same.
> > 
> 
> So this results in:
> 
> NOP3; call *%r11

Also possible:

  lfence; call *r11

(which is why we needed 6 bytes for reg>8)

> And you're saying a task can be on the other side of that call and then
> return lines up. But what if the task is preempted right after that
> NOP3?
> 
> Same for all the alternative patching; what ensures no task is currently
> having a register state that is in the middle of things?

Re: [RFC PATCH 31/56] x86/alternative: Prepend nops with retpolines

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Mon, Oct 13, 2025 at 09:34:19AM -0500, David Kaplan wrote:
> When patching retpolines, nops may be required for padding such as when
> turning a 5-byte direct call into a 2-byte indirect call.  Previously,
> these were appended at the end so the code becomes "call *reg;nop;nop;nop"
> for example.  This was fine because it's always going from a larger
> instruction to a smaller one.
> 
> But this is a problem if the sequence is transformed from a 2-byte indirect
> to the 5-byte direct call version at runtime because when the called
> function returns, it will be in the middle of the 5-byte call instruction.
> 
> To fix this, prepend the nops instead of appending them.  Consequently, the
> return site of the called function is always the same.
> 
> For indirect jmps this is potentially slightly less efficient compared to
> appending nops, but indirect jmps are so rare this hardly seems worth
> optimizing.

Durr.. 

So Retpoline sites can be 5 or 6 bytes, depending on the register. 

Also, I suppose at this point I would prefer prefix stuffing
over multiple instructions. The prefix stuffing ensures it stays a
single instruction.

Alas, some micro-archs have significant decode penalties for >3
prefixes, and filling 6 bytes will need 4 prefixes :-(

Re: [RFC PATCH 31/56] x86/alternative: Prepend nops with retpolines

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Thu, Oct 16, 2025 at 12:32:33PM +0200, Peter Zijlstra wrote:

> Alas, some micro-archs have significant decode penalties for >3
> prefixes, and filling 6 bytes will need 4 prefixes :-(

N/m this, because if it has 6 bytes, then the indirect call will need 3
bytes and we're back to needing 3 prefixes.