Improve get_random_u8() for use in randomize kstack

[RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ard Biesheuvel 2 months, 1 week ago

From: Ard Biesheuvel <ardb@kernel.org>

Ryan reports that get_random_u16() is dominant in the performance
profiling of syscall entry when kstack randomization is enabled [0].

This is the reason many architectures rely on a counter instead, and
that, in turn, is the reason for the convoluted way the (pseudo-)entropy
is gathered and recorded in a per-CPU variable.

Let's try to make the get_random_uXX() fast path faster, and switch to
get_random_u8() so that we'll hit the slow path 2x less often. Then,
wire it up in the syscall entry path, replacing the per-CPU variable,
making the logic at syscall exit redundant.

[0] https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/

Cc: Kees Cook <kees@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jeremy Linton <jeremy.linton@arm.com>
Cc: Catalin Marinas <Catalin.Marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>

Ard Biesheuvel (6):
  hexagon: Wire up cmpxchg64_local() to generic implementation
  arc: Wire up cmpxchg64_local() to generic implementation
  random: Use u32 to keep track of batched entropy generation
  random: Use a lockless fast path for get_random_uXX()
  random: Plug race in preceding patch
  randomize_kstack: Use get_random_u8() at entry for entropy

 arch/Kconfig                       |  9 ++--
 arch/arc/include/asm/cmpxchg.h     |  3 ++
 arch/hexagon/include/asm/cmpxchg.h |  4 ++
 drivers/char/random.c              | 49 ++++++++++++++------
 include/linux/randomize_kstack.h   | 36 ++------------
 init/main.c                        |  1 -
 6 files changed, 49 insertions(+), 53 deletions(-)


base-commit: ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d
-- 
2.52.0.107.ga0afd4fd5b-goog

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ryan Roberts 2 months, 1 week ago

On 27/11/2025 09:22, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> Ryan reports that get_random_u16() is dominant in the performance
> profiling of syscall entry when kstack randomization is enabled [0].
> 
> This is the reason many architectures rely on a counter instead, and
> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> is gathered and recorded in a per-CPU variable.
> 
> Let's try to make the get_random_uXX() fast path faster, and switch to
> get_random_u8() so that we'll hit the slow path 2x less often. Then,
> wire it up in the syscall entry path, replacing the per-CPU variable,
> making the logic at syscall exit redundant.

I ran the same set of syscall benchmarks for this series as I've done for my 
series. 

The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
performance cost of turning it on without any changes to the implementation,
then the reduced performance cost of turning it on with my changes applied, and 
finally cost of turning it on with Ard's changes applied:

arm64 (AWS Graviton3):
+-----------------+--------------+-------------+---------------+-----------------+
| Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
|                 |              | rndstack-on |               |                 |
+=================+==============+=============+===============+=================+
| syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
|                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
|                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
+-----------------+--------------+-------------+---------------+-----------------+
| syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
|                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
|                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
+-----------------+--------------+-------------+---------------+-----------------+
| syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
|                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
|                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
+-----------------+--------------+-------------+---------------+-----------------+

So this fixes the tail problem. I guess get_random_u8() only takes the slow path 
every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure 
that fully explains it though.

But it's still a 10% cost on average.

Personally I think 10% syscall cost is too much to pay for 6 bits of stack 
randomisation. 3% is better, but still higher than we would all prefer, I'm sure.

Thanks,
Ryan

> 
> [0] https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
> 
> Cc: Kees Cook <kees@kernel.org>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Jeremy Linton <jeremy.linton@arm.com>
> Cc: Catalin Marinas <Catalin.Marinas@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Jason A. Donenfeld <Jason@zx2c4.com>
> 
> Ard Biesheuvel (6):
>   hexagon: Wire up cmpxchg64_local() to generic implementation
>   arc: Wire up cmpxchg64_local() to generic implementation
>   random: Use u32 to keep track of batched entropy generation
>   random: Use a lockless fast path for get_random_uXX()
>   random: Plug race in preceding patch
>   randomize_kstack: Use get_random_u8() at entry for entropy
> 
>  arch/Kconfig                       |  9 ++--
>  arch/arc/include/asm/cmpxchg.h     |  3 ++
>  arch/hexagon/include/asm/cmpxchg.h |  4 ++
>  drivers/char/random.c              | 49 ++++++++++++++------
>  include/linux/randomize_kstack.h   | 36 ++------------
>  init/main.c                        |  1 -
>  6 files changed, 49 insertions(+), 53 deletions(-)
> 
> 
> base-commit: ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ard Biesheuvel 2 months, 1 week ago

On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2025 09:22, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > Ryan reports that get_random_u16() is dominant in the performance
> > profiling of syscall entry when kstack randomization is enabled [0].
> >
> > This is the reason many architectures rely on a counter instead, and
> > that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> > is gathered and recorded in a per-CPU variable.
> >
> > Let's try to make the get_random_uXX() fast path faster, and switch to
> > get_random_u8() so that we'll hit the slow path 2x less often. Then,
> > wire it up in the syscall entry path, replacing the per-CPU variable,
> > making the logic at syscall exit redundant.
>
> I ran the same set of syscall benchmarks for this series as I've done for my
> series.
>

Thanks!


> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> performance cost of turning it on without any changes to the implementation,
> then the reduced performance cost of turning it on with my changes applied, and
> finally cost of turning it on with Ard's changes applied:
>
> arm64 (AWS Graviton3):
> +-----------------+--------------+-------------+---------------+-----------------+
> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
> |                 |              | rndstack-on |               |                 |
> +=================+==============+=============+===============+=================+
> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
> +-----------------+--------------+-------------+---------------+-----------------+
> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
> +-----------------+--------------+-------------+---------------+-----------------+
> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
> +-----------------+--------------+-------------+---------------+-----------------+
>

What does the (R) mean?

> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
> that fully explains it though.
>
> But it's still a 10% cost on average.
>
> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
>

Interesting!

So the only thing that get_random_u8() does that could explain the
delta is calling into the scheduler on preempt_enable(), given that it
does very little beyond that.

Would you mind repeating this experiment after changing the
put_cpu_var() to preempt_enable_no_resched(), to test this theory?

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ryan Roberts 2 months, 1 week ago

On 27/11/2025 12:28, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>
>>> Ryan reports that get_random_u16() is dominant in the performance
>>> profiling of syscall entry when kstack randomization is enabled [0].
>>>
>>> This is the reason many architectures rely on a counter instead, and
>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
>>> is gathered and recorded in a per-CPU variable.
>>>
>>> Let's try to make the get_random_uXX() fast path faster, and switch to
>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
>>> wire it up in the syscall entry path, replacing the per-CPU variable,
>>> making the logic at syscall exit redundant.
>>
>> I ran the same set of syscall benchmarks for this series as I've done for my
>> series.
>>
> 
> Thanks!
> 
> 
>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
>> performance cost of turning it on without any changes to the implementation,
>> then the reduced performance cost of turning it on with my changes applied, and
>> finally cost of turning it on with Ard's changes applied:
>>
>> arm64 (AWS Graviton3):
>> +-----------------+--------------+-------------+---------------+-----------------+
>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
>> |                 |              | rndstack-on |               |                 |
>> +=================+==============+=============+===============+=================+
>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
>> +-----------------+--------------+-------------+---------------+-----------------+
>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
>> +-----------------+--------------+-------------+---------------+-----------------+
>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
>> +-----------------+--------------+-------------+---------------+-----------------+
>>
> 
> What does the (R) mean?
> 
>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
>> that fully explains it though.
>>
>> But it's still a 10% cost on average.
>>
>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
>>
> 
> Interesting!
> 
> So the only thing that get_random_u8() does that could explain the
> delta is calling into the scheduler on preempt_enable(), given that it
> does very little beyond that.
> 
> Would you mind repeating this experiment after changing the
> put_cpu_var() to preempt_enable_no_resched(), to test this theory?

This has no impact on performance.

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ard Biesheuvel 2 months, 1 week ago

On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2025 12:28, Ard Biesheuvel wrote:
> > On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2025 09:22, Ard Biesheuvel wrote:
> >>> From: Ard Biesheuvel <ardb@kernel.org>
> >>>
> >>> Ryan reports that get_random_u16() is dominant in the performance
> >>> profiling of syscall entry when kstack randomization is enabled [0].
> >>>
> >>> This is the reason many architectures rely on a counter instead, and
> >>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> >>> is gathered and recorded in a per-CPU variable.
> >>>
> >>> Let's try to make the get_random_uXX() fast path faster, and switch to
> >>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
> >>> wire it up in the syscall entry path, replacing the per-CPU variable,
> >>> making the logic at syscall exit redundant.
> >>
> >> I ran the same set of syscall benchmarks for this series as I've done for my
> >> series.
> >>
> >
> > Thanks!
> >
> >
> >> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> >> performance cost of turning it on without any changes to the implementation,
> >> then the reduced performance cost of turning it on with my changes applied, and
> >> finally cost of turning it on with Ard's changes applied:
> >>
> >> arm64 (AWS Graviton3):
> >> +-----------------+--------------+-------------+---------------+-----------------+
> >> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
> >> |                 |              | rndstack-on |               |                 |
> >> +=================+==============+=============+===============+=================+
> >> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
> >> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
> >> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
> >> +-----------------+--------------+-------------+---------------+-----------------+
> >> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
> >> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
> >> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
> >> +-----------------+--------------+-------------+---------------+-----------------+
> >> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
> >> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
> >> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
> >> +-----------------+--------------+-------------+---------------+-----------------+
> >>
> >
> > What does the (R) mean?
> >
> >> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
> >> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
> >> that fully explains it though.
> >>
> >> But it's still a 10% cost on average.
> >>
> >> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
> >> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
> >>
> >
> > Interesting!
> >
> > So the only thing that get_random_u8() does that could explain the
> > delta is calling into the scheduler on preempt_enable(), given that it
> > does very little beyond that.
> >
> > Would you mind repeating this experiment after changing the
> > put_cpu_var() to preempt_enable_no_resched(), to test this theory?
>
> This has no impact on performance.
>

Thanks. But this is really rather surprising: what else could be
taking up that time, given that on the fast path, there are only some
loads and stores to the buffer, and a cmpxchg64_local(). Could it be
the latter that is causing so much latency? I suppose the local
cmpxchg() semantics don't really exist on arm64, and this uses the
exact same LSE instruction that would be used for an ordinary
cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.

In any case, there is no debate that your code is faster on arm64. I
also think that using prandom for this purpose is perfectly fine, even
without reseeding: with a 2^113 period and only 6 observable bits per
32 bit sample, predicting the next value reliably is maybe not
impossible, but hardly worth the extensive effort, given that we're
not generating cryptographic keys here.

So the question is really whether we want to dedicate 16 bytes per
task for this. I wouldn't mind personally, but it is something our
internal QA engineers tend to obsess over.

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ryan Roberts 2 months, 1 week ago

On 27/11/2025 15:03, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2025 12:28, Ard Biesheuvel wrote:
>>> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
>>>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>>>
>>>>> Ryan reports that get_random_u16() is dominant in the performance
>>>>> profiling of syscall entry when kstack randomization is enabled [0].
>>>>>
>>>>> This is the reason many architectures rely on a counter instead, and
>>>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
>>>>> is gathered and recorded in a per-CPU variable.
>>>>>
>>>>> Let's try to make the get_random_uXX() fast path faster, and switch to
>>>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
>>>>> wire it up in the syscall entry path, replacing the per-CPU variable,
>>>>> making the logic at syscall exit redundant.
>>>>
>>>> I ran the same set of syscall benchmarks for this series as I've done for my
>>>> series.
>>>>
>>>
>>> Thanks!
>>>
>>>
>>>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
>>>> performance cost of turning it on without any changes to the implementation,
>>>> then the reduced performance cost of turning it on with my changes applied, and
>>>> finally cost of turning it on with Ard's changes applied:
>>>>
>>>> arm64 (AWS Graviton3):
>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
>>>> |                 |              | rndstack-on |               |                 |
>>>> +=================+==============+=============+===============+=================+
>>>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
>>>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
>>>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
>>>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
>>>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
>>>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
>>>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>
>>>
>>> What does the (R) mean?
>>>
>>>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
>>>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
>>>> that fully explains it though.
>>>>
>>>> But it's still a 10% cost on average.
>>>>
>>>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
>>>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
>>>>
>>>
>>> Interesting!
>>>
>>> So the only thing that get_random_u8() does that could explain the
>>> delta is calling into the scheduler on preempt_enable(), given that it
>>> does very little beyond that.
>>>
>>> Would you mind repeating this experiment after changing the
>>> put_cpu_var() to preempt_enable_no_resched(), to test this theory?
>>
>> This has no impact on performance.
>>
> 
> Thanks. But this is really rather surprising: what else could be
> taking up that time, given that on the fast path, there are only some
> loads and stores to the buffer, and a cmpxchg64_local(). Could it be
> the latter that is causing so much latency? I suppose the local
> cmpxchg() semantics don't really exist on arm64, and this uses the
> exact same LSE instruction that would be used for an ordinary
> cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.
> 
> In any case, there is no debate that your code is faster on arm64. 

The results I have for x86 show it's faster than the rdtsc too, although that's
also somewhat surprising. I'll run your series on x86 to get the equivalent data.

> I
> also think that using prandom for this purpose is perfectly fine, even
> without reseeding: with a 2^113 period and only 6 observable bits per
> 32 bit sample, predicting the next value reliably is maybe not
> impossible, but hardly worth the extensive effort, given that we're
> not generating cryptographic keys here.
> 
> So the question is really whether we want to dedicate 16 bytes per
> task for this. I wouldn't mind personally, but it is something our
> internal QA engineers tend to obsess over.

Yeah that's a good point. Is this something we could potentially keep at the
start of the kstack? Is there any precident for keeping state there at the
moment? For arm64, I know there is a general feeling that 16K for the stack more
than enough (but we are stuck with it because 8K isn't quite enough). So it
would be "for free". I guess it would be tricky to do this in an arch-agnostic
way though...

Thanks,
Ryan

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ard Biesheuvel 2 months, 1 week ago

On Thu, 27 Nov 2025 at 16:57, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2025 15:03, Ard Biesheuvel wrote:
> > On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2025 12:28, Ard Biesheuvel wrote:
> >>> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
> >>>>> From: Ard Biesheuvel <ardb@kernel.org>
> >>>>>
> >>>>> Ryan reports that get_random_u16() is dominant in the performance
> >>>>> profiling of syscall entry when kstack randomization is enabled [0].
> >>>>>
> >>>>> This is the reason many architectures rely on a counter instead, and
> >>>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> >>>>> is gathered and recorded in a per-CPU variable.
> >>>>>
> >>>>> Let's try to make the get_random_uXX() fast path faster, and switch to
> >>>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
> >>>>> wire it up in the syscall entry path, replacing the per-CPU variable,
> >>>>> making the logic at syscall exit redundant.
> >>>>
> >>>> I ran the same set of syscall benchmarks for this series as I've done for my
> >>>> series.
> >>>>
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> >>>> performance cost of turning it on without any changes to the implementation,
> >>>> then the reduced performance cost of turning it on with my changes applied, and
> >>>> finally cost of turning it on with Ard's changes applied:
> >>>>
> >>>> arm64 (AWS Graviton3):
> >>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
> >>>> |                 |              | rndstack-on |               |                 |
> >>>> +=================+==============+=============+===============+=================+
> >>>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
> >>>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
> >>>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
> >>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
> >>>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
> >>>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
> >>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
> >>>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
> >>>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
> >>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>>
> >>>
> >>> What does the (R) mean?
> >>>
> >>>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
> >>>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
> >>>> that fully explains it though.
> >>>>
> >>>> But it's still a 10% cost on average.
> >>>>
> >>>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
> >>>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
> >>>>
> >>>
> >>> Interesting!
> >>>
> >>> So the only thing that get_random_u8() does that could explain the
> >>> delta is calling into the scheduler on preempt_enable(), given that it
> >>> does very little beyond that.
> >>>
> >>> Would you mind repeating this experiment after changing the
> >>> put_cpu_var() to preempt_enable_no_resched(), to test this theory?
> >>
> >> This has no impact on performance.
> >>
> >
> > Thanks. But this is really rather surprising: what else could be
> > taking up that time, given that on the fast path, there are only some
> > loads and stores to the buffer, and a cmpxchg64_local(). Could it be
> > the latter that is causing so much latency? I suppose the local
> > cmpxchg() semantics don't really exist on arm64, and this uses the
> > exact same LSE instruction that would be used for an ordinary
> > cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.
> >
> > In any case, there is no debate that your code is faster on arm64.
>
> The results I have for x86 show it's faster than the rdtsc too, although that's
> also somewhat surprising. I'll run your series on x86 to get the equivalent data.
>

OK, brown paper bag time ...

I swapped the order of the 'old' and 'new' cmpxchg64_local()
arguments, resulting in some very odd behavior. I think this explains
why the tail latency was eliminated entirely, which is bizarre.

The speedup is also more modest now (~2x), which may still be
worthwhile, but likely insufficient for the kstack randomization case.

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=lockless-random-v2

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ryan Roberts 2 months, 1 week ago

On 28/11/2025 10:07, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 16:57, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2025 15:03, Ard Biesheuvel wrote:
>>> On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/11/2025 12:28, Ard Biesheuvel wrote:
>>>>> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
>>>>>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>>>>>
>>>>>>> Ryan reports that get_random_u16() is dominant in the performance
>>>>>>> profiling of syscall entry when kstack randomization is enabled [0].
>>>>>>>
>>>>>>> This is the reason many architectures rely on a counter instead, and
>>>>>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
>>>>>>> is gathered and recorded in a per-CPU variable.
>>>>>>>
>>>>>>> Let's try to make the get_random_uXX() fast path faster, and switch to
>>>>>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
>>>>>>> wire it up in the syscall entry path, replacing the per-CPU variable,
>>>>>>> making the logic at syscall exit redundant.
>>>>>>
>>>>>> I ran the same set of syscall benchmarks for this series as I've done for my
>>>>>> series.
>>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
>>>>>> performance cost of turning it on without any changes to the implementation,
>>>>>> then the reduced performance cost of turning it on with my changes applied, and
>>>>>> finally cost of turning it on with Ard's changes applied:
>>>>>>
>>>>>> arm64 (AWS Graviton3):
>>>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
>>>>>> |                 |              | rndstack-on |               |                 |
>>>>>> +=================+==============+=============+===============+=================+
>>>>>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
>>>>>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
>>>>>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
>>>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
>>>>>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
>>>>>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
>>>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
>>>>>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
>>>>>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
>>>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>>>
>>>>>
>>>>> What does the (R) mean?
>>>>>
>>>>>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
>>>>>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
>>>>>> that fully explains it though.
>>>>>>
>>>>>> But it's still a 10% cost on average.
>>>>>>
>>>>>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
>>>>>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
>>>>>>
>>>>>
>>>>> Interesting!
>>>>>
>>>>> So the only thing that get_random_u8() does that could explain the
>>>>> delta is calling into the scheduler on preempt_enable(), given that it
>>>>> does very little beyond that.
>>>>>
>>>>> Would you mind repeating this experiment after changing the
>>>>> put_cpu_var() to preempt_enable_no_resched(), to test this theory?
>>>>
>>>> This has no impact on performance.
>>>>
>>>
>>> Thanks. But this is really rather surprising: what else could be
>>> taking up that time, given that on the fast path, there are only some
>>> loads and stores to the buffer, and a cmpxchg64_local(). Could it be
>>> the latter that is causing so much latency? I suppose the local
>>> cmpxchg() semantics don't really exist on arm64, and this uses the
>>> exact same LSE instruction that would be used for an ordinary
>>> cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.
>>>
>>> In any case, there is no debate that your code is faster on arm64.
>>
>> The results I have for x86 show it's faster than the rdtsc too, although that's
>> also somewhat surprising. I'll run your series on x86 to get the equivalent data.
>>
> 
> OK, brown paper bag time ...
> 
> I swapped the order of the 'old' and 'new' cmpxchg64_local()
> arguments, resulting in some very odd behavior. I think this explains
> why the tail latency was eliminated entirely, which is bizarre.
> 
> The speedup is also more modest now (~2x), which may still be
> worthwhile, but likely insufficient for the kstack randomization case.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=lockless-random-v2

Ahh, so we were never taking the slow path. That would definitely explain it.

I had a go at running this on x86 but couldn't even get the kernel to boot on my
AWS Sapphire Rapids instance. Unfortunately I don't have access to the serial
console so can't tell why it failed. But I used the exact same procedure and
baseline for other runs so the only difference is your change.

I wonder if this issue somehow breaks the boot on that platform?

Anyway, I'll chuck this update at the benchmarks, but probably won't be until
next week...

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ard Biesheuvel 2 months, 1 week ago

On Fri, 28 Nov 2025 at 11:32, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/11/2025 10:07, Ard Biesheuvel wrote:
> > On Thu, 27 Nov 2025 at 16:57, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2025 15:03, Ard Biesheuvel wrote:
> >>> On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 27/11/2025 12:28, Ard Biesheuvel wrote:
> >>>>> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>
> >>>>>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
> >>>>>>> From: Ard Biesheuvel <ardb@kernel.org>
> >>>>>>>
> >>>>>>> Ryan reports that get_random_u16() is dominant in the performance
> >>>>>>> profiling of syscall entry when kstack randomization is enabled [0].
> >>>>>>>
> >>>>>>> This is the reason many architectures rely on a counter instead, and
> >>>>>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> >>>>>>> is gathered and recorded in a per-CPU variable.
> >>>>>>>
> >>>>>>> Let's try to make the get_random_uXX() fast path faster, and switch to
> >>>>>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
> >>>>>>> wire it up in the syscall entry path, replacing the per-CPU variable,
> >>>>>>> making the logic at syscall exit redundant.
> >>>>>>
> >>>>>> I ran the same set of syscall benchmarks for this series as I've done for my
> >>>>>> series.
> >>>>>>
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>>
> >>>>>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> >>>>>> performance cost of turning it on without any changes to the implementation,
> >>>>>> then the reduced performance cost of turning it on with my changes applied, and
> >>>>>> finally cost of turning it on with Ard's changes applied:
> >>>>>>
> >>>>>> arm64 (AWS Graviton3):
> >>>>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>>>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
> >>>>>> |                 |              | rndstack-on |               |                 |
> >>>>>> +=================+==============+=============+===============+=================+
> >>>>>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
> >>>>>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
> >>>>>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
> >>>>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>>>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
> >>>>>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
> >>>>>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
> >>>>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>>>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
> >>>>>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
> >>>>>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
> >>>>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>>>>
> >>>>>
> >>>>> What does the (R) mean?
> >>>>>
> >>>>>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
> >>>>>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
> >>>>>> that fully explains it though.
> >>>>>>
> >>>>>> But it's still a 10% cost on average.
> >>>>>>
> >>>>>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
> >>>>>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
> >>>>>>
> >>>>>
> >>>>> Interesting!
> >>>>>
> >>>>> So the only thing that get_random_u8() does that could explain the
> >>>>> delta is calling into the scheduler on preempt_enable(), given that it
> >>>>> does very little beyond that.
> >>>>>
> >>>>> Would you mind repeating this experiment after changing the
> >>>>> put_cpu_var() to preempt_enable_no_resched(), to test this theory?
> >>>>
> >>>> This has no impact on performance.
> >>>>
> >>>
> >>> Thanks. But this is really rather surprising: what else could be
> >>> taking up that time, given that on the fast path, there are only some
> >>> loads and stores to the buffer, and a cmpxchg64_local(). Could it be
> >>> the latter that is causing so much latency? I suppose the local
> >>> cmpxchg() semantics don't really exist on arm64, and this uses the
> >>> exact same LSE instruction that would be used for an ordinary
> >>> cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.
> >>>
> >>> In any case, there is no debate that your code is faster on arm64.
> >>
> >> The results I have for x86 show it's faster than the rdtsc too, although that's
> >> also somewhat surprising. I'll run your series on x86 to get the equivalent data.
> >>
> >
> > OK, brown paper bag time ...
> >
> > I swapped the order of the 'old' and 'new' cmpxchg64_local()
> > arguments, resulting in some very odd behavior. I think this explains
> > why the tail latency was eliminated entirely, which is bizarre.
> >
> > The speedup is also more modest now (~2x), which may still be
> > worthwhile, but likely insufficient for the kstack randomization case.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=lockless-random-v2
>
> Ahh, so we were never taking the slow path. That would definitely explain it.
>
> I had a go at running this on x86 but couldn't even get the kernel to boot on my
> AWS Sapphire Rapids instance. Unfortunately I don't have access to the serial
> console so can't tell why it failed. But I used the exact same procedure and
> baseline for other runs so the only difference is your change.
>
> I wonder if this issue somehow breaks the boot on that platform?
>

It does. That is how I noticed myself :-)

init_cea_offsets() fails to make progress because
get_random_u32_below() returns the same value every time.
Interestingly, it didn't trigger with KASLR disabled, which made the
debugging sessions all the more fun ...

> Anyway, I'll chuck this update at the benchmarks, but probably won't be until
> next week...
>

Sure. thanks.

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Mark Rutland 2 months, 1 week ago

On Thu, Nov 27, 2025 at 03:56:59PM +0000, Ryan Roberts wrote:
> On 27/11/2025 15:03, Ard Biesheuvel wrote:
> > So the question is really whether we want to dedicate 16 bytes per
> > task for this. I wouldn't mind personally, but it is something our
> > internal QA engineers tend to obsess over.
> 
> Yeah that's a good point.

I think it's a fair point that some people will obsesses over this, but
I think the concern is misplaced.

I know that people were very happy for the kernel FPSIMD context to
disappear from task_struct, but 16 bytes is a fair amount smaller, and
I'm pretty sure we can offset that with a small/moderate amount of work.

AFAICT there are extant holes in task_struct that could easily account
for 16 bytes. I can also see a few ways to rework arm64's thread_info
and thread_struct (which are both embedded within task_struct) to save
some space.

> Is this something we could potentially keep at the start of the
> kstack? Is there any precident for keeping state there at the moment?
> For arm64, I know there is a general feeling that 16K for the stack
> more than enough (but we are stuck with it because 8K isn't quite
> enough). So it would be "for free". I guess it would be tricky to do
> this in an arch-agnostic way though...

We went out of our way to stop playing silly games like that when we
moved thread_info into task_struct; please let's not bring that back.

Mark.

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ard Biesheuvel 2 months, 1 week ago

On Thu, 27 Nov 2025 at 17:58, Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Thu, Nov 27, 2025 at 03:56:59PM +0000, Ryan Roberts wrote:
> > On 27/11/2025 15:03, Ard Biesheuvel wrote:
> > > So the question is really whether we want to dedicate 16 bytes per
> > > task for this. I wouldn't mind personally, but it is something our
> > > internal QA engineers tend to obsess over.
> >
> > Yeah that's a good point.
>
> I think it's a fair point that some people will obsesses over this, but
> I think the concern is misplaced.
>
> I know that people were very happy for the kernel FPSIMD context to
> disappear from task_struct, but 16 bytes is a fair amount smaller, and
> I'm pretty sure we can offset that with a small/moderate amount of work.
>
> AFAICT there are extant holes in task_struct that could easily account
> for 16 bytes. I can also see a few ways to rework arm64's thread_info
> and thread_struct (which are both embedded within task_struct) to save
> some space.
>

Oh, I completely agree. But it is going to come up one way or the other.

> > Is this something we could potentially keep at the start of the
> > kstack? Is there any precident for keeping state there at the moment?
> > For arm64, I know there is a general feeling that 16K for the stack
> > more than enough (but we are stuck with it because 8K isn't quite
> > enough). So it would be "for free". I guess it would be tricky to do
> > this in an arch-agnostic way though...
>
> We went out of our way to stop playing silly games like that when we
> moved thread_info into task_struct; please let's not bring that back.
>

Agreed. (after just having moved the kernel mode FP/SIMD buffer from
task_struct to the kernel mode stack :-))

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ryan Roberts 2 months, 1 week ago

On 27/11/2025 19:01, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 17:58, Mark Rutland <mark.rutland@arm.com> wrote:
>>
>> On Thu, Nov 27, 2025 at 03:56:59PM +0000, Ryan Roberts wrote:
>>> On 27/11/2025 15:03, Ard Biesheuvel wrote:
>>>> So the question is really whether we want to dedicate 16 bytes per
>>>> task for this. I wouldn't mind personally, but it is something our
>>>> internal QA engineers tend to obsess over.
>>>
>>> Yeah that's a good point.
>>
>> I think it's a fair point that some people will obsesses over this, but
>> I think the concern is misplaced.
>>
>> I know that people were very happy for the kernel FPSIMD context to
>> disappear from task_struct, but 16 bytes is a fair amount smaller, and
>> I'm pretty sure we can offset that with a small/moderate amount of work.
>>
>> AFAICT there are extant holes in task_struct that could easily account
>> for 16 bytes. I can also see a few ways to rework arm64's thread_info
>> and thread_struct (which are both embedded within task_struct) to save
>> some space.
>>
> 
> Oh, I completely agree. But it is going to come up one way or the other.

I'm always terrified of changing the layout of those god structs for fear of
accidentally breaking some cacheline clustering-based micro optimization.
Putting new variables into existing holes is one thing, but rearranging existing
data scares me - perhaps I'm being too cautious. I assumed there wouldn't be an
existing hole big enough for 16 bytes.

> 
>>> Is this something we could potentially keep at the start of the
>>> kstack? Is there any precident for keeping state there at the moment?
>>> For arm64, I know there is a general feeling that 16K for the stack
>>> more than enough (but we are stuck with it because 8K isn't quite
>>> enough). So it would be "for free". I guess it would be tricky to do
>>> this in an arch-agnostic way though...
>>
>> We went out of our way to stop playing silly games like that when we
>> moved thread_info into task_struct; please let's not bring that back.
>>

OK fair enough.

> 
> Agreed. (after just having moved the kernel mode FP/SIMD buffer from
> task_struct to the kernel mode stack :-))

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Mark Rutland 2 months, 1 week ago

On Fri, Nov 28, 2025 at 10:36:13AM +0000, Ryan Roberts wrote:
> On 27/11/2025 19:01, Ard Biesheuvel wrote:
> > On Thu, 27 Nov 2025 at 17:58, Mark Rutland <mark.rutland@arm.com> wrote:
> >>
> >> On Thu, Nov 27, 2025 at 03:56:59PM +0000, Ryan Roberts wrote:
> >>> On 27/11/2025 15:03, Ard Biesheuvel wrote:
> >>>> So the question is really whether we want to dedicate 16 bytes per
> >>>> task for this. I wouldn't mind personally, but it is something our
> >>>> internal QA engineers tend to obsess over.
> >>>
> >>> Yeah that's a good point.
> >>
> >> I think it's a fair point that some people will obsesses over this, but
> >> I think the concern is misplaced.
> >>
> >> I know that people were very happy for the kernel FPSIMD context to
> >> disappear from task_struct, but 16 bytes is a fair amount smaller, and
> >> I'm pretty sure we can offset that with a small/moderate amount of work.
> >>
> >> AFAICT there are extant holes in task_struct that could easily account
> >> for 16 bytes. I can also see a few ways to rework arm64's thread_info
> >> and thread_struct (which are both embedded within task_struct) to save
> >> some space.
> > 
> > Oh, I completely agree. But it is going to come up one way or the other.
> 
> I'm always terrified of changing the layout of those god structs for fear of
> accidentally breaking some cacheline clustering-based micro optimization.
> Putting new variables into existing holes is one thing, but rearranging existing
> data scares me - perhaps I'm being too cautious. I assumed there wouldn't be an
> existing hole big enough for 16 bytes.

FWIW, ignoring holes, the trailing padding is pretty big. In v6.18-rc1
defconfig task_struct appears to have ~40 bytes of padding due to
64-byte alignment. So (in that configuration) adding 16 bytes doesn't
actually increase the size of the structure.

I have a few specific changes in mind which could ammortize 16 bytes, so
even if this turns out to be an issue, we can make good.

For example, I'd like to change arm64's FPSIMD/SVE/SME context switch
to remove the opportunistic reuse of context after A->B->A migration.
That would remove the need for 'fpsimd_cpu' and 'kernel_fpsimd_cpu' in
thread struct (which is embedded within task struct, at the end), saving
8 bytes.

If we change the way we encod the 'vl' and 'vl_onexec' array elements,
we can shrink those from unsigned int down to u8, which would sve 12
bytes.

Mark.

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ard Biesheuvel 2 months, 1 week ago

On Thu, 27 Nov 2025 at 16:03, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 27/11/2025 12:28, Ard Biesheuvel wrote:
> > > On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >>
> > >> On 27/11/2025 09:22, Ard Biesheuvel wrote:
> > >>> From: Ard Biesheuvel <ardb@kernel.org>
> > >>>
> > >>> Ryan reports that get_random_u16() is dominant in the performance
> > >>> profiling of syscall entry when kstack randomization is enabled [0].
> > >>>
> > >>> This is the reason many architectures rely on a counter instead, and
> > >>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> > >>> is gathered and recorded in a per-CPU variable.
> > >>>
> > >>> Let's try to make the get_random_uXX() fast path faster, and switch to
> > >>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
> > >>> wire it up in the syscall entry path, replacing the per-CPU variable,
> > >>> making the logic at syscall exit redundant.
> > >>
> > >> I ran the same set of syscall benchmarks for this series as I've done for my
> > >> series.
> > >>
> > >
> > > Thanks!
> > >
> > >
> > >> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> > >> performance cost of turning it on without any changes to the implementation,
> > >> then the reduced performance cost of turning it on with my changes applied, and
> > >> finally cost of turning it on with Ard's changes applied:
> > >>
> > >> arm64 (AWS Graviton3):
> > >> +-----------------+--------------+-------------+---------------+-----------------+
> > >> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
> > >> |                 |              | rndstack-on |               |                 |
> > >> +=================+==============+=============+===============+=================+
> > >> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
> > >> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
> > >> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
> > >> +-----------------+--------------+-------------+---------------+-----------------+
> > >> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
> > >> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
> > >> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
> > >> +-----------------+--------------+-------------+---------------+-----------------+
> > >> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
> > >> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
> > >> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
> > >> +-----------------+--------------+-------------+---------------+-----------------+
> > >>
> > >
> > > What does the (R) mean?
> > >
> > >> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
> > >> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
> > >> that fully explains it though.
> > >>
> > >> But it's still a 10% cost on average.
> > >>
> > >> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
> > >> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
> > >>
> > >
> > > Interesting!
> > >
> > > So the only thing that get_random_u8() does that could explain the
> > > delta is calling into the scheduler on preempt_enable(), given that it
> > > does very little beyond that.
> > >
> > > Would you mind repeating this experiment after changing the
> > > put_cpu_var() to preempt_enable_no_resched(), to test this theory?
> >
> > This has no impact on performance.
> >
>
> Thanks. But this is really rather surprising: what else could be
> taking up that time, given that on the fast path, there are only some
> loads and stores to the buffer, and a cmpxchg64_local(). Could it be
> the latter that is causing so much latency? I suppose the local
> cmpxchg() semantics don't really exist on arm64, and this uses the
> exact same LSE instruction that would be used for an ordinary
> cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.
>

FWIW, my naive get_random_u8() benchmark slows down by 3x on x86 if I
replace cmpxchg64_local() with cmpxchg64(), so I suspect the above
comparison will look different on x86 too.

Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Posted by Ryan Roberts 2 months, 1 week ago

On 27/11/2025 12:28, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>
>>> Ryan reports that get_random_u16() is dominant in the performance
>>> profiling of syscall entry when kstack randomization is enabled [0].
>>>
>>> This is the reason many architectures rely on a counter instead, and
>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
>>> is gathered and recorded in a per-CPU variable.
>>>
>>> Let's try to make the get_random_uXX() fast path faster, and switch to
>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
>>> wire it up in the syscall entry path, replacing the per-CPU variable,
>>> making the logic at syscall exit redundant.
>>
>> I ran the same set of syscall benchmarks for this series as I've done for my
>> series.
>>
> 
> Thanks!
> 
> 
>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
>> performance cost of turning it on without any changes to the implementation,
>> then the reduced performance cost of turning it on with my changes applied, and
>> finally cost of turning it on with Ard's changes applied:
>>
>> arm64 (AWS Graviton3):
>> +-----------------+--------------+-------------+---------------+-----------------+
>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
>> |                 |              | rndstack-on |               |                 |
>> +=================+==============+=============+===============+=================+
>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
>> +-----------------+--------------+-------------+---------------+-----------------+
>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
>> +-----------------+--------------+-------------+---------------+-----------------+
>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
>> +-----------------+--------------+-------------+---------------+-----------------+
>>
> 
> What does the (R) mean?

(R) is statistically significant regression
(I) is statistically significant improvement

Where "statistically significant" is where the 95% confidence intervals of the
baseline and comparson do not overlap.

> 
>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
>> that fully explains it though.
>>
>> But it's still a 10% cost on average.
>>
>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
>>
> 
> Interesting!
> 
> So the only thing that get_random_u8() does that could explain the
> delta is calling into the scheduler on preempt_enable(), given that it
> does very little beyond that.
> 
> Would you mind repeating this experiment after changing the
> put_cpu_var() to preempt_enable_no_resched(), to test this theory?

Yep it's queued.