Don't broadcast TLBI if mm was only active on local CPU

[RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU

Posted by Ryan Roberts 1 month ago

Hi All,

This is an RFC for my implementation of an idea from James Morse to avoid
broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could
have ever observed the pgtable entry for the TLB entry that is being
invalidated. It turns out that x86 does something similar in principle.

The primary feedback I'm looking for is; is this actually correct and safe?
James and I both believe it to be, but it would be useful to get further
validation.

Beyond that, the next question is; does it actually improve performance?
stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
do a much better job of sustaining the overall number of "tlb shootdowns per
second" after the change:

+------------+--------------------------+--------------------------+--------------------------+
|            |     Baseline (v6.15)     |        tlbi local        |        Improvement       |
+------------+-------------+------------+-------------+------------+-------------+------------+
| nr_threads |     ops/sec |    ops/sec |     ops/sec |    ops/sec |     ops/sec |    ops/sec |
|            | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
+------------+-------------+------------+-------------+------------+-------------+------------+
|          1 |        9109 |       2573 |        8903 |       3653 |         -2% |        42% |
|          4 |        8115 |       1299 |        9892 |       1059 |         22% |       -18% |
|          8 |        5119 |        477 |       11854 |       1265 |        132% |       165% |
|         16 |        4796 |        286 |       14176 |        821 |        196% |       187% |
|         32 |        1593 |         38 |       15328 |        474 |        862% |      1147% |
|         64 |        1486 |         19 |        8096 |        131 |        445% |       589% |
|        128 |        1315 |         16 |        8257 |        145 |        528% |       806% |
+------------+-------------+------------+-------------+------------+-------------+------------+

But looking at real-world benchmarks, I haven't yet found anything where it
makes a huge difference; When compiling the kernel, it reduces kernel time by
~2.2%, but overall wall time remains the same. I'd be interested in any
suggestions for workloads where this might prove valuable.

All mm selftests have been run and no regressions are observed. Applies on
v6.17-rc3.

Thanks,
Ryan


Ryan Roberts (2):
  arm64: tlbflush: Move invocation of __flush_tlb_range_op() to a macro
  arm64: tlbflush: Don't broadcast if mm was only active on local cpu

 arch/arm64/include/asm/mmu.h         |  12 +++
 arch/arm64/include/asm/mmu_context.h |   2 +
 arch/arm64/include/asm/tlbflush.h    | 116 ++++++++++++++++++++++++---
 arch/arm64/mm/context.c              |  30 ++++++-
 4 files changed, 145 insertions(+), 15 deletions(-)

--
2.43.0

Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU

Posted by Huang, Ying 3 weeks, 2 days ago

Ryan Roberts <ryan.roberts@arm.com> writes:

> Hi All,
>
> This is an RFC for my implementation of an idea from James Morse to avoid
> broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could
> have ever observed the pgtable entry for the TLB entry that is being
> invalidated. It turns out that x86 does something similar in principle.
>
> The primary feedback I'm looking for is; is this actually correct and safe?
> James and I both believe it to be, but it would be useful to get further
> validation.
>
> Beyond that, the next question is; does it actually improve performance?
> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
> do a much better job of sustaining the overall number of "tlb shootdowns per
> second" after the change:
>
> +------------+--------------------------+--------------------------+--------------------------+
> |            |     Baseline (v6.15)     |        tlbi local        |        Improvement       |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> | nr_threads |     ops/sec |    ops/sec |     ops/sec |    ops/sec |     ops/sec |    ops/sec |
> |            | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> |          1 |        9109 |       2573 |        8903 |       3653 |         -2% |        42% |
> |          4 |        8115 |       1299 |        9892 |       1059 |         22% |       -18% |
> |          8 |        5119 |        477 |       11854 |       1265 |        132% |       165% |
> |         16 |        4796 |        286 |       14176 |        821 |        196% |       187% |
> |         32 |        1593 |         38 |       15328 |        474 |        862% |      1147% |
> |         64 |        1486 |         19 |        8096 |        131 |        445% |       589% |
> |        128 |        1315 |         16 |        8257 |        145 |        528% |       806% |
> +------------+-------------+------------+-------------+------------+-------------+------------+
>
> But looking at real-world benchmarks, I haven't yet found anything where it
> makes a huge difference; When compiling the kernel, it reduces kernel time by
> ~2.2%, but overall wall time remains the same. I'd be interested in any
> suggestions for workloads where this might prove valuable.
>
> All mm selftests have been run and no regressions are observed. Applies on
> v6.17-rc3.

I have used redis (a single threaded in-memory database) to test the
patchset on an ARM server.  32 redis-server processes are run on the
NUMA node 1 to enlarge the overhead of TLBI broadcast.  32
memtier-benchmark processes are run on the NUMA node 0 accordingly.
Snapshot is triggered constantly in redis-server, which fork(), saves
memory database to disk, exit(), so that COW in the redis-server will
trigger a large amount of TLBI.  Basically, this tests the performance
of redis-server during snapshot.  The test time is about 300s.  Test
results show that the benchmark score can improve ~4.5% with the
patchset.

Feel free to add my

Tested-by: Huang Ying <ying.huang@linux.alibaba.com>

in the future versions.

---
Best Regards,
Huang, Ying

Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU

Posted by Ryan Roberts 3 weeks, 2 days ago

On 10/09/2025 11:57, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Hi All,
>>
>> This is an RFC for my implementation of an idea from James Morse to avoid
>> broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could
>> have ever observed the pgtable entry for the TLB entry that is being
>> invalidated. It turns out that x86 does something similar in principle.
>>
>> The primary feedback I'm looking for is; is this actually correct and safe?
>> James and I both believe it to be, but it would be useful to get further
>> validation.
>>
>> Beyond that, the next question is; does it actually improve performance?
>> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
>> do a much better job of sustaining the overall number of "tlb shootdowns per
>> second" after the change:
>>
>> +------------+--------------------------+--------------------------+--------------------------+
>> |            |     Baseline (v6.15)     |        tlbi local        |        Improvement       |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> | nr_threads |     ops/sec |    ops/sec |     ops/sec |    ops/sec |     ops/sec |    ops/sec |
>> |            | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> |          1 |        9109 |       2573 |        8903 |       3653 |         -2% |        42% |
>> |          4 |        8115 |       1299 |        9892 |       1059 |         22% |       -18% |
>> |          8 |        5119 |        477 |       11854 |       1265 |        132% |       165% |
>> |         16 |        4796 |        286 |       14176 |        821 |        196% |       187% |
>> |         32 |        1593 |         38 |       15328 |        474 |        862% |      1147% |
>> |         64 |        1486 |         19 |        8096 |        131 |        445% |       589% |
>> |        128 |        1315 |         16 |        8257 |        145 |        528% |       806% |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>>
>> But looking at real-world benchmarks, I haven't yet found anything where it
>> makes a huge difference; When compiling the kernel, it reduces kernel time by
>> ~2.2%, but overall wall time remains the same. I'd be interested in any
>> suggestions for workloads where this might prove valuable.
>>
>> All mm selftests have been run and no regressions are observed. Applies on
>> v6.17-rc3.
> 
> I have used redis (a single threaded in-memory database) to test the
> patchset on an ARM server.  32 redis-server processes are run on the
> NUMA node 1 to enlarge the overhead of TLBI broadcast.  32
> memtier-benchmark processes are run on the NUMA node 0 accordingly.
> Snapshot is triggered constantly in redis-server, which fork(), saves
> memory database to disk, exit(), so that COW in the redis-server will
> trigger a large amount of TLBI.  Basically, this tests the performance
> of redis-server during snapshot.  The test time is about 300s.  Test
> results show that the benchmark score can improve ~4.5% with the
> patchset.
> 
> Feel free to add my
> 
> Tested-by: Huang Ying <ying.huang@linux.alibaba.com>
> 
> in the future versions.

Thanks for this - very useful!

> 
> ---
> Best Regards,
> Huang, Ying

Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU

Posted by Huang, Ying 1 month ago

Hi, Ryan,

Ryan Roberts <ryan.roberts@arm.com> writes:

> Hi All,
>
> This is an RFC for my implementation of an idea from James Morse to avoid
> broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could
> have ever observed the pgtable entry for the TLB entry that is being
> invalidated. It turns out that x86 does something similar in principle.
>
> The primary feedback I'm looking for is; is this actually correct and safe?
> James and I both believe it to be, but it would be useful to get further
> validation.
>
> Beyond that, the next question is; does it actually improve performance?
> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
> do a much better job of sustaining the overall number of "tlb shootdowns per
> second" after the change:
>
> +------------+--------------------------+--------------------------+--------------------------+
> |            |     Baseline (v6.15)     |        tlbi local        |        Improvement       |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> | nr_threads |     ops/sec |    ops/sec |     ops/sec |    ops/sec |     ops/sec |    ops/sec |
> |            | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> |          1 |        9109 |       2573 |        8903 |       3653 |         -2% |        42% |
> |          4 |        8115 |       1299 |        9892 |       1059 |         22% |       -18% |
> |          8 |        5119 |        477 |       11854 |       1265 |        132% |       165% |
> |         16 |        4796 |        286 |       14176 |        821 |        196% |       187% |
> |         32 |        1593 |         38 |       15328 |        474 |        862% |      1147% |
> |         64 |        1486 |         19 |        8096 |        131 |        445% |       589% |
> |        128 |        1315 |         16 |        8257 |        145 |        528% |       806% |
> +------------+-------------+------------+-------------+------------+-------------+------------+
>
> But looking at real-world benchmarks, I haven't yet found anything where it
> makes a huge difference; When compiling the kernel, it reduces kernel time by
> ~2.2%, but overall wall time remains the same. I'd be interested in any
> suggestions for workloads where this might prove valuable.
>
> All mm selftests have been run and no regressions are observed. Applies on
> v6.17-rc3.

Thanks for working on this.

Several previous TLBI broadcast optimization have been tried before,
Cced the original authors for discussion.  Some workloads show good
improvement,

https://lore.kernel.org/lkml/20190617143255.10462-1-indou.takao@jp.fujitsu.com/
https://lore.kernel.org/all/20200203201745.29986-1-aarcange@redhat.com/

Especially in the following mail,

https://lore.kernel.org/all/20200314031609.GB2250@redhat.com/

---
Best Regards,
Huang, Ying

Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU

Posted by Christoph Lameter (Ampere) 2 weeks, 3 days ago

On Wed, 3 Sep 2025, Huang, Ying wrote:

> https://lore.kernel.org/all/20200314031609.GB2250@redhat.com/

This patch is part of the Redhat 8 ARM kernels.

Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU

Posted by Catalin Marinas 1 month ago

On Fri, Aug 29, 2025 at 04:35:06PM +0100, Ryan Roberts wrote:
> Beyond that, the next question is; does it actually improve performance?
> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
> do a much better job of sustaining the overall number of "tlb shootdowns per
> second" after the change:
> 
> +------------+--------------------------+--------------------------+--------------------------+
> |            |     Baseline (v6.15)     |        tlbi local        |        Improvement       |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> | nr_threads |     ops/sec |    ops/sec |     ops/sec |    ops/sec |     ops/sec |    ops/sec |
> |            | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> |          1 |        9109 |       2573 |        8903 |       3653 |         -2% |        42% |
> |          4 |        8115 |       1299 |        9892 |       1059 |         22% |       -18% |
> |          8 |        5119 |        477 |       11854 |       1265 |        132% |       165% |
> |         16 |        4796 |        286 |       14176 |        821 |        196% |       187% |
> |         32 |        1593 |         38 |       15328 |        474 |        862% |      1147% |
> |         64 |        1486 |         19 |        8096 |        131 |        445% |       589% |
> |        128 |        1315 |         16 |        8257 |        145 |        528% |       806% |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> 
> But looking at real-world benchmarks, I haven't yet found anything where it
> makes a huge difference; When compiling the kernel, it reduces kernel time by
> ~2.2%, but overall wall time remains the same. I'd be interested in any
> suggestions for workloads where this might prove valuable.

I suspect it's highly dependent on hardware and how it handles the DVM
messages. There were some old proposals from Fujitsu:

https://lore.kernel.org/linux-arm-kernel/20190617143255.10462-1-indou.takao@jp.fujitsu.com/

Christoph Lameter (Ampere) also followed with some refactoring in this
area to allow a boot-configurable way to do TLBI via IS ops or IPI:

https://lore.kernel.org/linux-arm-kernel/20231207035703.158053467@gentwo.org/

(for some reason, the patches did not make it to the list, I have them
in my inbox if you are interested)

I don't remember any real-world workload, more like hand-crafted
mprotect() loops.

Anyway, I think the approach in your series doesn't have downsides, it's
fairly clean and addresses some low-hanging fruits. For multi-threaded
workloads where a flush_tlb_mm() is cheaper than a series of per-page
TLBIs, I think we can wait for that hardware to be phased out. The TLBI
range operations should significantly reduce the DVM messages between
CPUs.

-- 
Catalin

Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU

Posted by Christoph Lameter (Ampere) 2 weeks, 3 days ago

On Tue, 2 Sep 2025, Catalin Marinas wrote:

> Christoph Lameter (Ampere) also followed with some refactoring in this
> area to allow a boot-configurable way to do TLBI via IS ops or IPI:
>
> https://lore.kernel.org/linux-arm-kernel/20231207035703.158053467@gentwo.org/
>
> (for some reason, the patches did not make it to the list, I have them
> in my inbox if you are interested)

The patchset is available from kernel.org. Sorry about the mailing list
problems at that time.

https://git.kernel.org/pub/scm/linux/kernel/git/christoph/linux.git/log/?h=tlb

Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU

Posted by Ryan Roberts 1 month ago

On 02/09/2025 17:47, Catalin Marinas wrote:
> On Fri, Aug 29, 2025 at 04:35:06PM +0100, Ryan Roberts wrote:
>> Beyond that, the next question is; does it actually improve performance?
>> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
>> do a much better job of sustaining the overall number of "tlb shootdowns per
>> second" after the change:
>>
>> +------------+--------------------------+--------------------------+--------------------------+
>> |            |     Baseline (v6.15)     |        tlbi local        |        Improvement       |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> | nr_threads |     ops/sec |    ops/sec |     ops/sec |    ops/sec |     ops/sec |    ops/sec |
>> |            | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> |          1 |        9109 |       2573 |        8903 |       3653 |         -2% |        42% |
>> |          4 |        8115 |       1299 |        9892 |       1059 |         22% |       -18% |
>> |          8 |        5119 |        477 |       11854 |       1265 |        132% |       165% |
>> |         16 |        4796 |        286 |       14176 |        821 |        196% |       187% |
>> |         32 |        1593 |         38 |       15328 |        474 |        862% |      1147% |
>> |         64 |        1486 |         19 |        8096 |        131 |        445% |       589% |
>> |        128 |        1315 |         16 |        8257 |        145 |        528% |       806% |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>>
>> But looking at real-world benchmarks, I haven't yet found anything where it
>> makes a huge difference; When compiling the kernel, it reduces kernel time by
>> ~2.2%, but overall wall time remains the same. I'd be interested in any
>> suggestions for workloads where this might prove valuable.
> 
> I suspect it's highly dependent on hardware and how it handles the DVM
> messages. There were some old proposals from Fujitsu:
> 
> https://lore.kernel.org/linux-arm-kernel/20190617143255.10462-1-indou.takao@jp.fujitsu.com/
> 
> Christoph Lameter (Ampere) also followed with some refactoring in this
> area to allow a boot-configurable way to do TLBI via IS ops or IPI:
> 
> https://lore.kernel.org/linux-arm-kernel/20231207035703.158053467@gentwo.org/
> 
> (for some reason, the patches did not make it to the list, I have them
> in my inbox if you are interested)
> 
> I don't remember any real-world workload, more like hand-crafted
> mprotect() loops.
> 
> Anyway, I think the approach in your series doesn't have downsides, it's
> fairly clean and addresses some low-hanging fruits. For multi-threaded
> workloads where a flush_tlb_mm() is cheaper than a series of per-page
> TLBIs, I think we can wait for that hardware to be phased out. The TLBI
> range operations should significantly reduce the DVM messages between
> CPUs.

I'll gather some more numbers and try to make a case for merging it then. I
don't really want to add complexity if there is no clear value.

Thanks for the review.

Thanks,
Ryan