arch/arm64/include/asm/mmu.h | 12 +++ arch/arm64/include/asm/mmu_context.h | 2 + arch/arm64/include/asm/tlbflush.h | 116 ++++++++++++++++++++++++--- arch/arm64/mm/context.c | 30 ++++++- 4 files changed, 145 insertions(+), 15 deletions(-)
Hi All, This is an RFC for my implementation of an idea from James Morse to avoid broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could have ever observed the pgtable entry for the TLB entry that is being invalidated. It turns out that x86 does something similar in principle. The primary feedback I'm looking for is; is this actually correct and safe? James and I both believe it to be, but it would be useful to get further validation. Beyond that, the next question is; does it actually improve performance? stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we do a much better job of sustaining the overall number of "tlb shootdowns per second" after the change: +------------+--------------------------+--------------------------+--------------------------+ | | Baseline (v6.15) | tlbi local | Improvement | +------------+-------------+------------+-------------+------------+-------------+------------+ | nr_threads | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | | | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) | +------------+-------------+------------+-------------+------------+-------------+------------+ | 1 | 9109 | 2573 | 8903 | 3653 | -2% | 42% | | 4 | 8115 | 1299 | 9892 | 1059 | 22% | -18% | | 8 | 5119 | 477 | 11854 | 1265 | 132% | 165% | | 16 | 4796 | 286 | 14176 | 821 | 196% | 187% | | 32 | 1593 | 38 | 15328 | 474 | 862% | 1147% | | 64 | 1486 | 19 | 8096 | 131 | 445% | 589% | | 128 | 1315 | 16 | 8257 | 145 | 528% | 806% | +------------+-------------+------------+-------------+------------+-------------+------------+ But looking at real-world benchmarks, I haven't yet found anything where it makes a huge difference; When compiling the kernel, it reduces kernel time by ~2.2%, but overall wall time remains the same. I'd be interested in any suggestions for workloads where this might prove valuable. All mm selftests have been run and no regressions are observed. Applies on v6.17-rc3. Thanks, Ryan Ryan Roberts (2): arm64: tlbflush: Move invocation of __flush_tlb_range_op() to a macro arm64: tlbflush: Don't broadcast if mm was only active on local cpu arch/arm64/include/asm/mmu.h | 12 +++ arch/arm64/include/asm/mmu_context.h | 2 + arch/arm64/include/asm/tlbflush.h | 116 ++++++++++++++++++++++++--- arch/arm64/mm/context.c | 30 ++++++- 4 files changed, 145 insertions(+), 15 deletions(-) -- 2.43.0
Ryan Roberts <ryan.roberts@arm.com> writes: > Hi All, > > This is an RFC for my implementation of an idea from James Morse to avoid > broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could > have ever observed the pgtable entry for the TLB entry that is being > invalidated. It turns out that x86 does something similar in principle. > > The primary feedback I'm looking for is; is this actually correct and safe? > James and I both believe it to be, but it would be useful to get further > validation. > > Beyond that, the next question is; does it actually improve performance? > stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we > do a much better job of sustaining the overall number of "tlb shootdowns per > second" after the change: > > +------------+--------------------------+--------------------------+--------------------------+ > | | Baseline (v6.15) | tlbi local | Improvement | > +------------+-------------+------------+-------------+------------+-------------+------------+ > | nr_threads | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | > | | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) | > +------------+-------------+------------+-------------+------------+-------------+------------+ > | 1 | 9109 | 2573 | 8903 | 3653 | -2% | 42% | > | 4 | 8115 | 1299 | 9892 | 1059 | 22% | -18% | > | 8 | 5119 | 477 | 11854 | 1265 | 132% | 165% | > | 16 | 4796 | 286 | 14176 | 821 | 196% | 187% | > | 32 | 1593 | 38 | 15328 | 474 | 862% | 1147% | > | 64 | 1486 | 19 | 8096 | 131 | 445% | 589% | > | 128 | 1315 | 16 | 8257 | 145 | 528% | 806% | > +------------+-------------+------------+-------------+------------+-------------+------------+ > > But looking at real-world benchmarks, I haven't yet found anything where it > makes a huge difference; When compiling the kernel, it reduces kernel time by > ~2.2%, but overall wall time remains the same. I'd be interested in any > suggestions for workloads where this might prove valuable. > > All mm selftests have been run and no regressions are observed. Applies on > v6.17-rc3. I have used redis (a single threaded in-memory database) to test the patchset on an ARM server. 32 redis-server processes are run on the NUMA node 1 to enlarge the overhead of TLBI broadcast. 32 memtier-benchmark processes are run on the NUMA node 0 accordingly. Snapshot is triggered constantly in redis-server, which fork(), saves memory database to disk, exit(), so that COW in the redis-server will trigger a large amount of TLBI. Basically, this tests the performance of redis-server during snapshot. The test time is about 300s. Test results show that the benchmark score can improve ~4.5% with the patchset. Feel free to add my Tested-by: Huang Ying <ying.huang@linux.alibaba.com> in the future versions. --- Best Regards, Huang, Ying
On 10/09/2025 11:57, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> Hi All, >> >> This is an RFC for my implementation of an idea from James Morse to avoid >> broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could >> have ever observed the pgtable entry for the TLB entry that is being >> invalidated. It turns out that x86 does something similar in principle. >> >> The primary feedback I'm looking for is; is this actually correct and safe? >> James and I both believe it to be, but it would be useful to get further >> validation. >> >> Beyond that, the next question is; does it actually improve performance? >> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we >> do a much better job of sustaining the overall number of "tlb shootdowns per >> second" after the change: >> >> +------------+--------------------------+--------------------------+--------------------------+ >> | | Baseline (v6.15) | tlbi local | Improvement | >> +------------+-------------+------------+-------------+------------+-------------+------------+ >> | nr_threads | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | >> | | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) | >> +------------+-------------+------------+-------------+------------+-------------+------------+ >> | 1 | 9109 | 2573 | 8903 | 3653 | -2% | 42% | >> | 4 | 8115 | 1299 | 9892 | 1059 | 22% | -18% | >> | 8 | 5119 | 477 | 11854 | 1265 | 132% | 165% | >> | 16 | 4796 | 286 | 14176 | 821 | 196% | 187% | >> | 32 | 1593 | 38 | 15328 | 474 | 862% | 1147% | >> | 64 | 1486 | 19 | 8096 | 131 | 445% | 589% | >> | 128 | 1315 | 16 | 8257 | 145 | 528% | 806% | >> +------------+-------------+------------+-------------+------------+-------------+------------+ >> >> But looking at real-world benchmarks, I haven't yet found anything where it >> makes a huge difference; When compiling the kernel, it reduces kernel time by >> ~2.2%, but overall wall time remains the same. I'd be interested in any >> suggestions for workloads where this might prove valuable. >> >> All mm selftests have been run and no regressions are observed. Applies on >> v6.17-rc3. > > I have used redis (a single threaded in-memory database) to test the > patchset on an ARM server. 32 redis-server processes are run on the > NUMA node 1 to enlarge the overhead of TLBI broadcast. 32 > memtier-benchmark processes are run on the NUMA node 0 accordingly. > Snapshot is triggered constantly in redis-server, which fork(), saves > memory database to disk, exit(), so that COW in the redis-server will > trigger a large amount of TLBI. Basically, this tests the performance > of redis-server during snapshot. The test time is about 300s. Test > results show that the benchmark score can improve ~4.5% with the > patchset. > > Feel free to add my > > Tested-by: Huang Ying <ying.huang@linux.alibaba.com> > > in the future versions. Thanks for this - very useful! > > --- > Best Regards, > Huang, Ying
Hi, Ryan, Ryan Roberts <ryan.roberts@arm.com> writes: > Hi All, > > This is an RFC for my implementation of an idea from James Morse to avoid > broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could > have ever observed the pgtable entry for the TLB entry that is being > invalidated. It turns out that x86 does something similar in principle. > > The primary feedback I'm looking for is; is this actually correct and safe? > James and I both believe it to be, but it would be useful to get further > validation. > > Beyond that, the next question is; does it actually improve performance? > stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we > do a much better job of sustaining the overall number of "tlb shootdowns per > second" after the change: > > +------------+--------------------------+--------------------------+--------------------------+ > | | Baseline (v6.15) | tlbi local | Improvement | > +------------+-------------+------------+-------------+------------+-------------+------------+ > | nr_threads | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | > | | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) | > +------------+-------------+------------+-------------+------------+-------------+------------+ > | 1 | 9109 | 2573 | 8903 | 3653 | -2% | 42% | > | 4 | 8115 | 1299 | 9892 | 1059 | 22% | -18% | > | 8 | 5119 | 477 | 11854 | 1265 | 132% | 165% | > | 16 | 4796 | 286 | 14176 | 821 | 196% | 187% | > | 32 | 1593 | 38 | 15328 | 474 | 862% | 1147% | > | 64 | 1486 | 19 | 8096 | 131 | 445% | 589% | > | 128 | 1315 | 16 | 8257 | 145 | 528% | 806% | > +------------+-------------+------------+-------------+------------+-------------+------------+ > > But looking at real-world benchmarks, I haven't yet found anything where it > makes a huge difference; When compiling the kernel, it reduces kernel time by > ~2.2%, but overall wall time remains the same. I'd be interested in any > suggestions for workloads where this might prove valuable. > > All mm selftests have been run and no regressions are observed. Applies on > v6.17-rc3. Thanks for working on this. Several previous TLBI broadcast optimization have been tried before, Cced the original authors for discussion. Some workloads show good improvement, https://lore.kernel.org/lkml/20190617143255.10462-1-indou.takao@jp.fujitsu.com/ https://lore.kernel.org/all/20200203201745.29986-1-aarcange@redhat.com/ Especially in the following mail, https://lore.kernel.org/all/20200314031609.GB2250@redhat.com/ --- Best Regards, Huang, Ying
On Wed, 3 Sep 2025, Huang, Ying wrote: > https://lore.kernel.org/all/20200314031609.GB2250@redhat.com/ This patch is part of the Redhat 8 ARM kernels.
On Fri, Aug 29, 2025 at 04:35:06PM +0100, Ryan Roberts wrote: > Beyond that, the next question is; does it actually improve performance? > stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we > do a much better job of sustaining the overall number of "tlb shootdowns per > second" after the change: > > +------------+--------------------------+--------------------------+--------------------------+ > | | Baseline (v6.15) | tlbi local | Improvement | > +------------+-------------+------------+-------------+------------+-------------+------------+ > | nr_threads | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | > | | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) | > +------------+-------------+------------+-------------+------------+-------------+------------+ > | 1 | 9109 | 2573 | 8903 | 3653 | -2% | 42% | > | 4 | 8115 | 1299 | 9892 | 1059 | 22% | -18% | > | 8 | 5119 | 477 | 11854 | 1265 | 132% | 165% | > | 16 | 4796 | 286 | 14176 | 821 | 196% | 187% | > | 32 | 1593 | 38 | 15328 | 474 | 862% | 1147% | > | 64 | 1486 | 19 | 8096 | 131 | 445% | 589% | > | 128 | 1315 | 16 | 8257 | 145 | 528% | 806% | > +------------+-------------+------------+-------------+------------+-------------+------------+ > > But looking at real-world benchmarks, I haven't yet found anything where it > makes a huge difference; When compiling the kernel, it reduces kernel time by > ~2.2%, but overall wall time remains the same. I'd be interested in any > suggestions for workloads where this might prove valuable. I suspect it's highly dependent on hardware and how it handles the DVM messages. There were some old proposals from Fujitsu: https://lore.kernel.org/linux-arm-kernel/20190617143255.10462-1-indou.takao@jp.fujitsu.com/ Christoph Lameter (Ampere) also followed with some refactoring in this area to allow a boot-configurable way to do TLBI via IS ops or IPI: https://lore.kernel.org/linux-arm-kernel/20231207035703.158053467@gentwo.org/ (for some reason, the patches did not make it to the list, I have them in my inbox if you are interested) I don't remember any real-world workload, more like hand-crafted mprotect() loops. Anyway, I think the approach in your series doesn't have downsides, it's fairly clean and addresses some low-hanging fruits. For multi-threaded workloads where a flush_tlb_mm() is cheaper than a series of per-page TLBIs, I think we can wait for that hardware to be phased out. The TLBI range operations should significantly reduce the DVM messages between CPUs. -- Catalin
On Tue, 2 Sep 2025, Catalin Marinas wrote: > Christoph Lameter (Ampere) also followed with some refactoring in this > area to allow a boot-configurable way to do TLBI via IS ops or IPI: > > https://lore.kernel.org/linux-arm-kernel/20231207035703.158053467@gentwo.org/ > > (for some reason, the patches did not make it to the list, I have them > in my inbox if you are interested) The patchset is available from kernel.org. Sorry about the mailing list problems at that time. https://git.kernel.org/pub/scm/linux/kernel/git/christoph/linux.git/log/?h=tlb
On 02/09/2025 17:47, Catalin Marinas wrote: > On Fri, Aug 29, 2025 at 04:35:06PM +0100, Ryan Roberts wrote: >> Beyond that, the next question is; does it actually improve performance? >> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we >> do a much better job of sustaining the overall number of "tlb shootdowns per >> second" after the change: >> >> +------------+--------------------------+--------------------------+--------------------------+ >> | | Baseline (v6.15) | tlbi local | Improvement | >> +------------+-------------+------------+-------------+------------+-------------+------------+ >> | nr_threads | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | >> | | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) | >> +------------+-------------+------------+-------------+------------+-------------+------------+ >> | 1 | 9109 | 2573 | 8903 | 3653 | -2% | 42% | >> | 4 | 8115 | 1299 | 9892 | 1059 | 22% | -18% | >> | 8 | 5119 | 477 | 11854 | 1265 | 132% | 165% | >> | 16 | 4796 | 286 | 14176 | 821 | 196% | 187% | >> | 32 | 1593 | 38 | 15328 | 474 | 862% | 1147% | >> | 64 | 1486 | 19 | 8096 | 131 | 445% | 589% | >> | 128 | 1315 | 16 | 8257 | 145 | 528% | 806% | >> +------------+-------------+------------+-------------+------------+-------------+------------+ >> >> But looking at real-world benchmarks, I haven't yet found anything where it >> makes a huge difference; When compiling the kernel, it reduces kernel time by >> ~2.2%, but overall wall time remains the same. I'd be interested in any >> suggestions for workloads where this might prove valuable. > > I suspect it's highly dependent on hardware and how it handles the DVM > messages. There were some old proposals from Fujitsu: > > https://lore.kernel.org/linux-arm-kernel/20190617143255.10462-1-indou.takao@jp.fujitsu.com/ > > Christoph Lameter (Ampere) also followed with some refactoring in this > area to allow a boot-configurable way to do TLBI via IS ops or IPI: > > https://lore.kernel.org/linux-arm-kernel/20231207035703.158053467@gentwo.org/ > > (for some reason, the patches did not make it to the list, I have them > in my inbox if you are interested) > > I don't remember any real-world workload, more like hand-crafted > mprotect() loops. > > Anyway, I think the approach in your series doesn't have downsides, it's > fairly clean and addresses some low-hanging fruits. For multi-threaded > workloads where a flush_tlb_mm() is cheaper than a series of per-page > TLBIs, I think we can wait for that hardware to be phased out. The TLBI > range operations should significantly reduce the DVM messages between > CPUs. I'll gather some more numbers and try to make a case for merging it then. I don't really want to add complexity if there is no clear value. Thanks for the review. Thanks, Ryan
© 2016 - 2025 Red Hat, Inc.