[v4] Fix bugs and performance of kstack offset randomisation

[PATCH v4 0/3] Fix bugs and performance of kstack offset randomisation

Posted by Ryan Roberts 2 weeks, 4 days ago

[Kees; I'm hoping this is now good-to-go via your hardening tree? Please shout
if you think there is more work to be done here!]

Hi All,

As I reported at [1], kstack offset randomisation suffers from a couple of bugs
and, on arm64 at least, the performance is poor. This series attempts to fix
both; patch 1 provides back-portable fixes for the functional bugs. Patches 2-3
propose a performance improvement approach.

I've looked at a few different options but ultimately decided that Jeremy's
original prng approach is the fastest. I made the argument that this approach is
secure "enough" in the RFC [2] and the responses indicated agreement.

More details in the commit logs.


Performance
===========

Mean and tail performance of 3 "small" syscalls was measured. syscall was made
10 million times and each individually measured and binned. These results have
low noise so I'm confident that they are trustworthy.

The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
performance cost of turning it on without any changes to the implementation,
then the reduced performance cost of turning it on with my changes applied.

**NOTE**: The below results were generated using the RFC patches but there is no
meaningful change, so the numbers are still valid.

arm64 (AWS Graviton3):
+-----------------+--------------+-------------+---------------+
| Benchmark       | Result Class |   v6.18-rc5 | per-task-prng |
|                 |              | rndstack-on |               |
|                 |              |             |               |
+=================+==============+=============+===============+
| syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |
|                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |
|                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |
+-----------------+--------------+-------------+---------------+
| syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |
|                 | p99 (ns)     | (R) 152.81% |         1.55% |
|                 | p99.9 (ns)   | (R) 153.67% |         1.77% |
+-----------------+--------------+-------------+---------------+
| syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |
|                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |
|                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |
+-----------------+--------------+-------------+---------------+

Because arm64 was previously using get_random_u16(), it was expensive when it
didn't have any buffered bits and had to call into the crng. That's what caused
the enormous tail latency.


x86 (AWS Sapphire Rapids):
+-----------------+--------------+-------------+---------------+
| Benchmark       | Result Class |   v6.18-rc5 | per-task-prng |
|                 |              | rndstack-on |               |
|                 |              |             |               |
+=================+==============+=============+===============+
| syscall/getpid  | mean (ns)    |  (R) 13.32% |     (R) 4.60% |
|                 | p99 (ns)     |  (R) 13.38% |    (R) 18.08% |
|                 | p99.9 (ns)   |      16.26% |    (R) 19.38% |
+-----------------+--------------+-------------+---------------+
| syscall/getppid | mean (ns)    |  (R) 11.96% |     (R) 5.26% |
|                 | p99 (ns)     |  (R) 11.83% |     (R) 8.35% |
|                 | p99.9 (ns)   |  (R) 11.42% |    (R) 22.37% |
+-----------------+--------------+-------------+---------------+
| syscall/invalid | mean (ns)    |  (R) 10.58% |     (R) 2.91% |
|                 | p99 (ns)     |  (R) 10.51% |     (R) 4.36% |
|                 | p99.9 (ns)   |  (R) 10.35% |    (R) 21.97% |
+-----------------+--------------+-------------+---------------+

I was surprised to see that the baseline cost on x86 is 10-12% since it is just
using rdtsc. But as I say, I believe the results are accurate.


Changes since v3 (RFC) [4]
==========================

- Patch 1: Fixed typo in commit log (per David L)
- Patch 2: Reinstated prandom_u32_state() as out-of-line function, which
  forwards to inline version (per David L)
- Patch 3: Added supplementary info about benefits of removing
  choose_random_kstack_offset() (per Mark R)

Changes since v2 (RFC) [3]
==========================

- Moved late_initcall() to initialize kstack_rnd_state out of
  randomize_kstack.h and into main.c. (issue noticed by kernel test robot)

Changes since v1 (RFC) [2]
==========================

- Introduced patch 2 to make prandom_u32_state() __always_inline (needed since
  its called from noinstr code)
- In patch 3, prng is now per-cpu instead of per-task (per Ard)


[1] https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
[2] https://lore.kernel.org/all/20251127105958.2427758-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/all/20251215163520.1144179-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/all/20260102131156.3265118-1-ryan.roberts@arm.com/

Thanks,
Ryan

Ryan Roberts (3):
  randomize_kstack: Maintain kstack_offset per task
  prandom: Add __always_inline version of prandom_u32_state()
  randomize_kstack: Unify random source across arches

 arch/Kconfig                         |  5 ++-
 arch/arm64/kernel/syscall.c          | 11 ------
 arch/loongarch/kernel/syscall.c      | 11 ------
 arch/powerpc/kernel/syscall.c        | 12 -------
 arch/riscv/kernel/traps.c            | 12 -------
 arch/s390/include/asm/entry-common.h |  8 -----
 arch/x86/include/asm/entry-common.h  | 12 -------
 include/linux/prandom.h              | 20 +++++++++++
 include/linux/randomize_kstack.h     | 54 +++++++++++-----------------
 init/main.c                          |  9 ++++-
 kernel/fork.c                        |  1 +
 lib/random32.c                       |  8 +----
 12 files changed, 52 insertions(+), 111 deletions(-)

--
2.43.0

Re: [PATCH v4 0/3] Fix bugs and performance of kstack offset randomisation

Posted by Heiko Carstens 2 weeks, 4 days ago

On Mon, Jan 19, 2026 at 01:01:07PM +0000, Ryan Roberts wrote:
> As I reported at [1], kstack offset randomisation suffers from a couple of bugs
> and, on arm64 at least, the performance is poor. This series attempts to fix
> both; patch 1 provides back-portable fixes for the functional bugs. Patches 2-3
> propose a performance improvement approach.
> 
> I've looked at a few different options but ultimately decided that Jeremy's
> original prng approach is the fastest. I made the argument that this approach is
> secure "enough" in the RFC [2] and the responses indicated agreement.
> 
> More details in the commit logs.
> 
> 
> Performance
> ===========
> 
> Mean and tail performance of 3 "small" syscalls was measured. syscall was made
> 10 million times and each individually measured and binned. These results have
> low noise so I'm confident that they are trustworthy.
> 
> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> performance cost of turning it on without any changes to the implementation,
> then the reduced performance cost of turning it on with my changes applied.

This adds 16 instructions to the system call fast path on s390, however
some quick measurements show that executing this extra code is within
noise ratio performance wise.

Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390

Re: [PATCH v4 0/3] Fix bugs and performance of kstack offset randomisation

Posted by Dave Hansen 2 weeks, 4 days ago

On 1/19/26 05:01, Ryan Roberts wrote:
> x86 (AWS Sapphire Rapids):
> +-----------------+--------------+-------------+---------------+
> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng |
> |                 |              | rndstack-on |               |
> |                 |              |             |               |
> +=================+==============+=============+===============+
> | syscall/getpid  | mean (ns)    |  (R) 13.32% |     (R) 4.60% |
> |                 | p99 (ns)     |  (R) 13.38% |    (R) 18.08% |
> |                 | p99.9 (ns)   |      16.26% |    (R) 19.38% |

Like you noted, this is surprising. This would be a good thing to make
sure it goes in very early after -rc1 and gets plenty of wide testing.

But I don't see any problems with the approach, and the move to common
code looks like a big win as well:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Re: [PATCH v4 0/3] Fix bugs and performance of kstack offset randomisation

Posted by Kees Cook 2 weeks, 4 days ago


On January 19, 2026 8:00:00 AM PST, Dave Hansen <dave.hansen@intel.com> wrote:
>On 1/19/26 05:01, Ryan Roberts wrote:
>> x86 (AWS Sapphire Rapids):
>> +-----------------+--------------+-------------+---------------+
>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng |
>> |                 |              | rndstack-on |               |
>> |                 |              |             |               |
>> +=================+==============+=============+===============+
>> | syscall/getpid  | mean (ns)    |  (R) 13.32% |     (R) 4.60% |
>> |                 | p99 (ns)     |  (R) 13.38% |    (R) 18.08% |
>> |                 | p99.9 (ns)   |      16.26% |    (R) 19.38% |
>
>Like you noted, this is surprising. This would be a good thing to make
>sure it goes in very early after -rc1 and gets plenty of wide testing.

Right, we are pretty late in the dev cycle (rc6). It would be prudent to get this into -next after the coming rc1 (1 month from now).

On the other hand, the changes are pretty "binary" in the sense that mistakes should be VERY visible right away. Would it be better to take this into -next immediately instead?

>But I don't see any problems with the approach, and the move to common
>code looks like a big win as well:

Agreed; I think it's looking great.

-- 
Kees Cook

Re: [PATCH v4 0/3] Fix bugs and performance of kstack offset randomisation

Posted by Ryan Roberts 2 weeks, 3 days ago

On 19/01/2026 16:44, Kees Cook wrote:
> 
> 
> On January 19, 2026 8:00:00 AM PST, Dave Hansen <dave.hansen@intel.com> wrote:
>> On 1/19/26 05:01, Ryan Roberts wrote:
>>> x86 (AWS Sapphire Rapids):
>>> +-----------------+--------------+-------------+---------------+
>>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng |
>>> |                 |              | rndstack-on |               |
>>> |                 |              |             |               |
>>> +=================+==============+=============+===============+
>>> | syscall/getpid  | mean (ns)    |  (R) 13.32% |     (R) 4.60% |
>>> |                 | p99 (ns)     |  (R) 13.38% |    (R) 18.08% |
>>> |                 | p99.9 (ns)   |      16.26% |    (R) 19.38% |
>>
>> Like you noted, this is surprising. This would be a good thing to make
>> sure it goes in very early after -rc1 and gets plenty of wide testing.
> 
> Right, we are pretty late in the dev cycle (rc6). It would be prudent to get this into -next after the coming rc1 (1 month from now).
> 
> On the other hand, the changes are pretty "binary" in the sense that mistakes should be VERY visible right away. Would it be better to take this into -next immediately instead?

I don't think this question was really addressed to me, but I'll give my opinion
anyway; I agree it's pretty binary - it will either work or it will explode.
I've tested on arm64 and x86_64 so I have high confidence that it works. If you
get it into -next ASAP it has 3 weeks to soak before the merge window opens
right? (Linus said he would do an -rc8 this cycle). That feels like enough time
to me. But it's your tree ;-)

Thanks,
Ryan


> 
>> But I don't see any problems with the approach, and the move to common
>> code looks like a big win as well:
> 
> Agreed; I think it's looking great.
>

Re: [PATCH v4 0/3] Fix bugs and performance of kstack offset randomisation

Posted by Dave Hansen 2 weeks, 3 days ago

On 1/20/26 08:32, Ryan Roberts wrote:
> I don't think this question was really addressed to me, but I'll give my opinion
> anyway; I agree it's pretty binary - it will either work or it will explode.
> I've tested on arm64 and x86_64 so I have high confidence that it works. If you
> get it into -next ASAP it has 3 weeks to soak before the merge window opens
> right? (Linus said he would do an -rc8 this cycle). That feels like enough time
> to me. But it's your tree 😉

First of all, thank you for testing it on x86! Having that one data
point where it helped performance is super valuable.

I'm more worried that it's going to regress performance somewhere and
then it's going to be a pain to back out. I'm not super worried about
functional regressions.

Re: [PATCH v4 0/3] Fix bugs and performance of kstack offset randomisation

Posted by David Laight 2 weeks, 3 days ago

On Tue, 20 Jan 2026 08:37:43 -0800
Dave Hansen <dave.hansen@intel.com> wrote:

> On 1/20/26 08:32, Ryan Roberts wrote:
> > I don't think this question was really addressed to me, but I'll give my opinion
> > anyway; I agree it's pretty binary - it will either work or it will explode.
> > I've tested on arm64 and x86_64 so I have high confidence that it works. If you
> > get it into -next ASAP it has 3 weeks to soak before the merge window opens
> > right? (Linus said he would do an -rc8 this cycle). That feels like enough time
> > to me. But it's your tree 😉  
> 
> First of all, thank you for testing it on x86! Having that one data
> point where it helped performance is super valuable.
> 
> I'm more worried that it's going to regress performance somewhere and
> then it's going to be a pain to back out. I'm not super worried about
> functional regressions.

Unlikely, on x86 the 'rdtsc' is ~20 clocks on Intel cpu and even slower
on amd (according to Agner).
(That is serialised against another rdtsc rather than other instructions.)
Whereas the four TAUSWORTHE() are independent so can execute in parallel.
IIRC each is a memory read and 5 ALU instructions - not much at all.
The slow bit will be the cache miss on the per-cpu data.
You lose a clock at the end because gcc will compile the a | b | c | d
as (((a | b) | c) | d) not ((a | b) | (c | d)).

I think someone reported the 'new' version being faster on x86,
that might be why.

	David

Re: [PATCH v4 0/3] Fix bugs and performance of kstack offset randomisation

Posted by Ryan Roberts 2 weeks, 3 days ago

On 20/01/2026 16:37, Dave Hansen wrote:
> On 1/20/26 08:32, Ryan Roberts wrote:
>> I don't think this question was really addressed to me, but I'll give my opinion
>> anyway; I agree it's pretty binary - it will either work or it will explode.
>> I've tested on arm64 and x86_64 so I have high confidence that it works. If you
>> get it into -next ASAP it has 3 weeks to soak before the merge window opens
>> right? (Linus said he would do an -rc8 this cycle). That feels like enough time
>> to me. But it's your tree 😉
> 
> First of all, thank you for testing it on x86! Having that one data
> point where it helped performance is super valuable.
> 
> I'm more worried that it's going to regress performance somewhere and
> then it's going to be a pain to back out. I'm not super worried about
> functional regressions.

Fair enough. Let's go slow then.

Re: [PATCH v4 0/3] Fix bugs and performance of kstack offset randomisation

Posted by Dave Hansen 2 weeks, 4 days ago

On 1/19/26 08:44, Kees Cook wrote:
>> Like you noted, this is surprising. This would be a good thing to
>> make sure it goes in very early after -rc1 and gets plenty of wide
>> testing.
> Right, we are pretty late in the dev cycle (rc6). It would be
> prudent to get this into -next after the coming rc1 (1 month from
> now).
> 
> On the other hand, the changes are pretty "binary" in the sense that
> mistakes should be VERY visible right away. Would it be better to
> take this into -next immediately instead?
I think it can go into -next ASAP. It's just a matter of when it goes to
Linus.