A multi-thread customer workload with large memory footprint uses
fork()/exec() to run some external programs every tens seconds. When
running the workload on an arm64 server machine, it's observed that
quite some CPU cycles are spent in the TLB flushing functions. While
running the workload on the x86_64 server machine, it's not. This
causes the performance on arm64 to be much worse than that on x86_64.
During the workload running, after fork()/exec() write-protects all
pages in the parent process, memory writing in the parent process
will cause a write protection fault. Then the page fault handler
will make the PTE/PDE writable if the page can be reused, which is
almost always true in the workload. On arm64, to avoid the write
protection fault on other CPUs, the page fault handler flushes the TLB
globally with TLBI broadcast after changing the PTE/PDE. However, this
isn't always necessary. Firstly, it's safe to leave some stall
read-only TLB entries as long as they will be flushed finally.
Secondly, it's quite possible that the original read-only PTE/PDEs
aren't cached in remote TLB at all if the memory footprint is large.
In fact, on x86_64, the page fault handler doesn't flush the remote
TLB in this situation, which benefits the performance a lot.
To improve the performance on arm64, make the write protection fault
handler flush the TLB locally instead of globally via TLBI broadcast
after making the PTE/PDE writable. If there are stall read-only TLB
entries in the remote CPUs, the page fault handler on these CPUs will
regard the page fault as spurious and flush the stall TLB entries.
To test the patchset, make the usemem.c from vm-scalability
(https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
support calling fork()/exec() periodically (merged). To mimic the
behavior of the customer workload, run usemem with 4 threads, access
100GB memory, and call fork()/exec() every 40 seconds. Test results
show that with the patchset the score of usemem improves ~40.6%. The
cycles% of TLB flush functions reduces from ~50.5% to ~0.3% in perf
profile.
Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
arch/arm64/include/asm/pgtable.h | 14 +++++---
arch/arm64/include/asm/tlbflush.h | 56 +++++++++++++++++++++++++++++++
arch/arm64/mm/contpte.c | 3 +-
arch/arm64/mm/fault.c | 2 +-
4 files changed, 67 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index aa89c2e67ebc..35bae2e4bcfe 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
/*
- * Outside of a few very special situations (e.g. hibernation), we always
- * use broadcast TLB invalidation instructions, therefore a spurious page
- * fault on one CPU which has been handled concurrently by another CPU
- * does not need to perform additional invalidation.
+ * We use local TLB invalidation instruction when reusing page in
+ * write protection fault handler to avoid TLBI broadcast in the hot
+ * path. This will cause spurious page faults if stall read-only TLB
+ * entries exist.
*/
-#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
+#define flush_tlb_fix_spurious_fault(vma, address, ptep) \
+ local_flush_tlb_page_nonotify(vma, address)
+
+#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \
+ local_flush_tlb_page_nonotify(vma, address)
/*
* ZERO_PAGE is a global shared page that is always zero: used
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 18a5dc0c9a54..651b31fd18bb 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
* cannot be easily determined, the value TLBI_TTL_UNKNOWN will
* perform a non-hinted invalidation.
*
+ * local_flush_tlb_page(vma, addr)
+ * Local variant of flush_tlb_page(). Stale TLB entries may
+ * remain in remote CPUs.
+ *
+ * local_flush_tlb_page_nonotify(vma, addr)
+ * Same as local_flush_tlb_page() except MMU notifier will not be
+ * called.
+ *
+ * local_flush_tlb_contpte_range(vma, start, end)
+ * Invalidate the virtual-address range '[start, end)' mapped with
+ * contpte on local CPU for the user address space corresponding
+ * to 'vma->mm'. Stale TLB entries may remain in remote CPUs.
*
* Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
* on top of these routines, since that is our interface to the mmu_gather
@@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
}
+static inline void __local_flush_tlb_page_nonotify_nosync(
+ struct mm_struct *mm, unsigned long uaddr)
+{
+ unsigned long addr;
+
+ dsb(nshst);
+ addr = __TLBI_VADDR(uaddr, ASID(mm));
+ __tlbi(vale1, addr);
+ __tlbi_user(vale1, addr);
+}
+
+static inline void local_flush_tlb_page_nonotify(
+ struct vm_area_struct *vma, unsigned long uaddr)
+{
+ __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
+ dsb(nsh);
+}
+
+static inline void local_flush_tlb_page(struct vm_area_struct *vma,
+ unsigned long uaddr)
+{
+ __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
+ mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
+ (uaddr & PAGE_MASK) + PAGE_SIZE);
+ dsb(nsh);
+}
+
static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
unsigned long uaddr)
{
@@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
dsb(ish);
}
+static inline void local_flush_tlb_contpte_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ unsigned long asid, pages;
+
+ start = round_down(start, PAGE_SIZE);
+ end = round_up(end, PAGE_SIZE);
+ pages = (end - start) >> PAGE_SHIFT;
+
+ dsb(nshst);
+ asid = ASID(vma->vm_mm);
+ __flush_tlb_range_op(vale1, start, pages, PAGE_SIZE, asid,
+ 3, true, lpa2_is_enabled());
+ mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
+ dsb(nsh);
+}
+
static inline void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index c0557945939c..0f9bbb7224dc 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -622,8 +622,7 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
__ptep_set_access_flags(vma, addr, ptep, entry, 0);
if (dirty)
- __flush_tlb_range(vma, start_addr, addr,
- PAGE_SIZE, true, 3);
+ local_flush_tlb_contpte_range(vma, start_addr, addr);
} else {
__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index d816ff44faff..22f54f5afe3f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -235,7 +235,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
/* Invalidate a stale read-only entry */
if (dirty)
- flush_tlb_page(vma, address);
+ local_flush_tlb_page(vma, address);
return 1;
}
--
2.39.5
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index aa89c2e67ebc..35bae2e4bcfe 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> /*
> - * Outside of a few very special situations (e.g. hibernation), we always
> - * use broadcast TLB invalidation instructions, therefore a spurious page
> - * fault on one CPU which has been handled concurrently by another CPU
> - * does not need to perform additional invalidation.
> + * We use local TLB invalidation instruction when reusing page in
> + * write protection fault handler to avoid TLBI broadcast in the hot
> + * path. This will cause spurious page faults if stall read-only TLB
> + * entries exist.
> */
> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
> +#define flush_tlb_fix_spurious_fault(vma, address, ptep) \
> + local_flush_tlb_page_nonotify(vma, address)
> +
> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \
> + local_flush_tlb_page_nonotify(vma, address)
>
> /*
> * ZERO_PAGE is a global shared page that is always zero: used
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 18a5dc0c9a54..651b31fd18bb 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
> * cannot be easily determined, the value TLBI_TTL_UNKNOWN will
> * perform a non-hinted invalidation.
> *
> + * local_flush_tlb_page(vma, addr)
> + * Local variant of flush_tlb_page(). Stale TLB entries may
> + * remain in remote CPUs.
> + *
> + * local_flush_tlb_page_nonotify(vma, addr)
> + * Same as local_flush_tlb_page() except MMU notifier will not be
> + * called.
> + *
> + * local_flush_tlb_contpte_range(vma, start, end)
> + * Invalidate the virtual-address range '[start, end)' mapped with
> + * contpte on local CPU for the user address space corresponding
> + * to 'vma->mm'. Stale TLB entries may remain in remote CPUs.
> *
> * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
> * on top of these routines, since that is our interface to the mmu_gather
> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
> mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
> }
>
> +static inline void __local_flush_tlb_page_nonotify_nosync(
> + struct mm_struct *mm, unsigned long uaddr)
> +{
> + unsigned long addr;
> +
> + dsb(nshst);
We were issuing dsb(ishst) even for the nosync case, likely to ensure
PTE visibility across cores. However, since set_ptes already includes a
dsb(ishst) in __set_pte_complete(), does this mean we’re being overly
cautious in __flush_tlb_page_nosync() in many cases?
static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
unsigned long uaddr)
{
unsigned long addr;
dsb(ishst);
addr = __TLBI_VADDR(uaddr, ASID(mm));
__tlbi(vale1is, addr);
__tlbi_user(vale1is, addr);
mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
(uaddr & PAGE_MASK) +
PAGE_SIZE);
}
On the other hand, __ptep_set_access_flags() doesn’t seem to use
set_ptes(), so there’s no guarantee the updated PTEs are visible to all
cores. If a remote CPU later encounters a page fault and performs a TLB
invalidation, will it still see a stable PTE?
> + addr = __TLBI_VADDR(uaddr, ASID(mm));
> + __tlbi(vale1, addr);
> + __tlbi_user(vale1, addr);
> +}
> +
> +static inline void local_flush_tlb_page_nonotify(
> + struct vm_area_struct *vma, unsigned long uaddr)
> +{
> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
> + dsb(nsh);
> +}
> +
> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
> + unsigned long uaddr)
> +{
> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
> + (uaddr & PAGE_MASK) + PAGE_SIZE);
> + dsb(nsh);
> +}
> +
> static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> unsigned long uaddr)
> {
> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
> dsb(ish);
> }
>
We already have functions like
__flush_tlb_page_nosync() and __flush_tlb_range_nosync().
Is there a way to factor out or extract their common parts?
Is it because of the differences in barriers that this extraction of
common code isn’t feasible?
Thanks
Barry
Hi Barry, Huang,
On 22/10/2025 05:08, Barry Song wrote:
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index aa89c2e67ebc..35bae2e4bcfe 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> /*
>> - * Outside of a few very special situations (e.g. hibernation), we always
>> - * use broadcast TLB invalidation instructions, therefore a spurious page
>> - * fault on one CPU which has been handled concurrently by another CPU
>> - * does not need to perform additional invalidation.
>> + * We use local TLB invalidation instruction when reusing page in
>> + * write protection fault handler to avoid TLBI broadcast in the hot
>> + * path. This will cause spurious page faults if stall read-only TLB
>> + * entries exist.
>> */
>> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
>> +#define flush_tlb_fix_spurious_fault(vma, address, ptep) \
>> + local_flush_tlb_page_nonotify(vma, address)
>> +
>> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \
>> + local_flush_tlb_page_nonotify(vma, address)
>>
>> /*
>> * ZERO_PAGE is a global shared page that is always zero: used
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index 18a5dc0c9a54..651b31fd18bb 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
>> * cannot be easily determined, the value TLBI_TTL_UNKNOWN will
>> * perform a non-hinted invalidation.
>> *
>> + * local_flush_tlb_page(vma, addr)
>> + * Local variant of flush_tlb_page(). Stale TLB entries may
>> + * remain in remote CPUs.
>> + *
>> + * local_flush_tlb_page_nonotify(vma, addr)
>> + * Same as local_flush_tlb_page() except MMU notifier will not be
>> + * called.
>> + *
>> + * local_flush_tlb_contpte_range(vma, start, end)
>> + * Invalidate the virtual-address range '[start, end)' mapped with
>> + * contpte on local CPU for the user address space corresponding
>> + * to 'vma->mm'. Stale TLB entries may remain in remote CPUs.
>> *
>> * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
>> * on top of these routines, since that is our interface to the mmu_gather
>> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>> mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>> }
>>
>> +static inline void __local_flush_tlb_page_nonotify_nosync(
>> + struct mm_struct *mm, unsigned long uaddr)
>> +{
>> + unsigned long addr;
>> +
>> + dsb(nshst);
I've skimmed through this thread so appologies if I've missed some of the
detail, but thought it might be useful to give my opinion as a summary...
> We were issuing dsb(ishst) even for the nosync case, likely to ensure
> PTE visibility across cores.
The leading dsb prior to issuing the tlbi is to ensure that the HW table
walker(s) will always see the new pte immediately after the tlbi completes.
Without it, you could end up with the old value immediately re-cached after the
tlbi completes. So if you are broadcasting the tlbi, the dsb needs to be to ish.
If you're doing local invalidation, then nsh is sufficient.
"nosync" is just saying that we will not wait for the tlbi to complete. You
still need to issue the leading dsb to ensure that the table walkers see the
latest pte once the tlbi (eventually) completes.
> However, since set_ptes already includes a
> dsb(ishst) in __set_pte_complete(), does this mean we’re being overly
> cautious in __flush_tlb_page_nosync() in many cases?
We only issue a dsb in __set_pte_complete() for kernel ptes. We elide for user
ptes becaue we can safely take a fault (for the case where we transition
invalid->valid) for user mappings and that race will resolve itself with the
help of the PTL. For valid->valid or valid->invalid, there will be an associated
tlb flush, which has the barrier.
>
> static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> unsigned long uaddr)
> {
> unsigned long addr;
>
> dsb(ishst);
> addr = __TLBI_VADDR(uaddr, ASID(mm));
> __tlbi(vale1is, addr);
> __tlbi_user(vale1is, addr);
> mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
> (uaddr & PAGE_MASK) +
> PAGE_SIZE);
> }
>
> On the other hand, __ptep_set_access_flags() doesn’t seem to use
> set_ptes(), so there’s no guarantee the updated PTEs are visible to all
> cores. If a remote CPU later encounters a page fault and performs a TLB
> invalidation, will it still see a stable PTE?
Yes, because the reads and writes are done under the PTL; that synchonizes the
memory for us.
You were discussing the potential value of upgrading the leading dsb from nshst
to ishst during the discussion. IMHO that's neither required nor desirable - the
memory synchonization is handled by the PTL. Overall, this optimization relies
on the premise that synchronizing with remote CPUs is expensive and races are
rare, so we should keep everything local for as long as possible and not worry
about micro-optimizing the efficiency of the race case.
>
>> + addr = __TLBI_VADDR(uaddr, ASID(mm));
>> + __tlbi(vale1, addr);
>> + __tlbi_user(vale1, addr);
>> +}
>> +
>> +static inline void local_flush_tlb_page_nonotify(
>> + struct vm_area_struct *vma, unsigned long uaddr)
>> +{
>> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> + dsb(nsh);
>> +}
>> +
>> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
>> + unsigned long uaddr)
>> +{
>> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
>> + (uaddr & PAGE_MASK) + PAGE_SIZE);
>> + dsb(nsh);
>> +}
>> +
>> static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>> unsigned long uaddr)
>> {
>> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>> dsb(ish);
>> }
>>
>
> We already have functions like
> __flush_tlb_page_nosync() and __flush_tlb_range_nosync().
> Is there a way to factor out or extract their common parts?
>
> Is it because of the differences in barriers that this extraction of
> common code isn’t feasible?
I've proposed re-working these functions to add indepednent flags for
sync/nosync, local/broadcast and notify/nonotify. I think that will clean it up
quite a bit. But I was going to wait for this to land first. And also, Will has
an RFC for some other tlbflush API cleanup (converting it to C functions) so
might want to wait for or incorporate that too.
Thanks,
Ryan
>
> Thanks
> Barry
Hi, Barry,
Barry Song <21cnbao@gmail.com> writes:
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index aa89c2e67ebc..35bae2e4bcfe 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> /*
>> - * Outside of a few very special situations (e.g. hibernation), we always
>> - * use broadcast TLB invalidation instructions, therefore a spurious page
>> - * fault on one CPU which has been handled concurrently by another CPU
>> - * does not need to perform additional invalidation.
>> + * We use local TLB invalidation instruction when reusing page in
>> + * write protection fault handler to avoid TLBI broadcast in the hot
>> + * path. This will cause spurious page faults if stall read-only TLB
>> + * entries exist.
>> */
>> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
>> +#define flush_tlb_fix_spurious_fault(vma, address, ptep) \
>> + local_flush_tlb_page_nonotify(vma, address)
>> +
>> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \
>> + local_flush_tlb_page_nonotify(vma, address)
>>
>> /*
>> * ZERO_PAGE is a global shared page that is always zero: used
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index 18a5dc0c9a54..651b31fd18bb 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
>> * cannot be easily determined, the value TLBI_TTL_UNKNOWN will
>> * perform a non-hinted invalidation.
>> *
>> + * local_flush_tlb_page(vma, addr)
>> + * Local variant of flush_tlb_page(). Stale TLB entries may
>> + * remain in remote CPUs.
>> + *
>> + * local_flush_tlb_page_nonotify(vma, addr)
>> + * Same as local_flush_tlb_page() except MMU notifier will not be
>> + * called.
>> + *
>> + * local_flush_tlb_contpte_range(vma, start, end)
>> + * Invalidate the virtual-address range '[start, end)' mapped with
>> + * contpte on local CPU for the user address space corresponding
>> + * to 'vma->mm'. Stale TLB entries may remain in remote CPUs.
>> *
>> * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
>> * on top of these routines, since that is our interface to the mmu_gather
>> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>> mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>> }
>>
>> +static inline void __local_flush_tlb_page_nonotify_nosync(
>> + struct mm_struct *mm, unsigned long uaddr)
>> +{
>> + unsigned long addr;
>> +
>> + dsb(nshst);
>
> We were issuing dsb(ishst) even for the nosync case, likely to ensure
> PTE visibility across cores. However, since set_ptes already includes a
> dsb(ishst) in __set_pte_complete(), does this mean we’re being overly
> cautious in __flush_tlb_page_nosync() in many cases?
>
> static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> unsigned long uaddr)
> {
> unsigned long addr;
>
> dsb(ishst);
> addr = __TLBI_VADDR(uaddr, ASID(mm));
> __tlbi(vale1is, addr);
> __tlbi_user(vale1is, addr);
> mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
> (uaddr & PAGE_MASK) +
> PAGE_SIZE);
> }
IIUC, _nosync() here means doesn't synchronize with the following code.
It still synchronizes with the previous code, mainly the page table
changing. And, Yes. There may be room to improve this.
> On the other hand, __ptep_set_access_flags() doesn’t seem to use
> set_ptes(), so there’s no guarantee the updated PTEs are visible to all
> cores. If a remote CPU later encounters a page fault and performs a TLB
> invalidation, will it still see a stable PTE?
I don't think so. We just flush local TLB in local_flush_tlb_page()
family functions. So, we only needs to guarantee the page table changes
are available for the local page table walking. If a page fault occurs
on a remote CPU, we will call local_flush_tlb_page() on the remote CPU.
>> + addr = __TLBI_VADDR(uaddr, ASID(mm));
>> + __tlbi(vale1, addr);
>> + __tlbi_user(vale1, addr);
>> +}
>> +
>> +static inline void local_flush_tlb_page_nonotify(
>> + struct vm_area_struct *vma, unsigned long uaddr)
>> +{
>> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> + dsb(nsh);
>> +}
>> +
>> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
>> + unsigned long uaddr)
>> +{
>> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
>> + (uaddr & PAGE_MASK) + PAGE_SIZE);
>> + dsb(nsh);
>> +}
>> +
>> static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>> unsigned long uaddr)
>> {
>> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>> dsb(ish);
>> }
>>
>
> We already have functions like
> __flush_tlb_page_nosync() and __flush_tlb_range_nosync().
> Is there a way to factor out or extract their common parts?
>
> Is it because of the differences in barriers that this extraction of
> common code isn’t feasible?
Yes. It's a good idea to do some code clean to reduce code duplication.
Ryan has plan to work on this.
---
Best Regards,
Huang, Ying
> >
> > static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> > unsigned long uaddr)
> > {
> > unsigned long addr;
> >
> > dsb(ishst);
> > addr = __TLBI_VADDR(uaddr, ASID(mm));
> > __tlbi(vale1is, addr);
> > __tlbi_user(vale1is, addr);
> > mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
> > (uaddr & PAGE_MASK) +
> > PAGE_SIZE);
> > }
>
> IIUC, _nosync() here means doesn't synchronize with the following code.
> It still synchronizes with the previous code, mainly the page table
> changing. And, Yes. There may be room to improve this.
>
> > On the other hand, __ptep_set_access_flags() doesn’t seem to use
> > set_ptes(), so there’s no guarantee the updated PTEs are visible to all
> > cores. If a remote CPU later encounters a page fault and performs a TLB
> > invalidation, will it still see a stable PTE?
>
> I don't think so. We just flush local TLB in local_flush_tlb_page()
> family functions. So, we only needs to guarantee the page table changes
> are available for the local page table walking. If a page fault occurs
> on a remote CPU, we will call local_flush_tlb_page() on the remote CPU.
>
My concern is that:
We don’t have a dsb(ish) to ensure the PTE page table is visible to remote
CPUs, since you’re using dsb(nsh). So even if a remote CPU performs
local_flush_tlb_page(), the memory may not be synchronized yet, and it could
still see the old PTE.
Thanks
Barry
Barry Song <21cnbao@gmail.com> writes:
>> >
>> > static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>> > unsigned long uaddr)
>> > {
>> > unsigned long addr;
>> >
>> > dsb(ishst);
>> > addr = __TLBI_VADDR(uaddr, ASID(mm));
>> > __tlbi(vale1is, addr);
>> > __tlbi_user(vale1is, addr);
>> > mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
>> > (uaddr & PAGE_MASK) +
>> > PAGE_SIZE);
>> > }
>>
>> IIUC, _nosync() here means doesn't synchronize with the following code.
>> It still synchronizes with the previous code, mainly the page table
>> changing. And, Yes. There may be room to improve this.
>>
>> > On the other hand, __ptep_set_access_flags() doesn’t seem to use
>> > set_ptes(), so there’s no guarantee the updated PTEs are visible to all
>> > cores. If a remote CPU later encounters a page fault and performs a TLB
>> > invalidation, will it still see a stable PTE?
>>
>> I don't think so. We just flush local TLB in local_flush_tlb_page()
>> family functions. So, we only needs to guarantee the page table changes
>> are available for the local page table walking. If a page fault occurs
>> on a remote CPU, we will call local_flush_tlb_page() on the remote CPU.
>>
>
> My concern is that:
>
> We don’t have a dsb(ish) to ensure the PTE page table is visible to remote
> CPUs, since you’re using dsb(nsh). So even if a remote CPU performs
> local_flush_tlb_page(), the memory may not be synchronized yet, and it could
> still see the old PTE.
So, do you think that after the load/store unit of the remote CPU have
seen the new PTE, the page table walker could still see the old PTE? I
doubt it. Even if so, the worse case is one extra spurious page fault?
If the possibility of the worst case is low enough, that should be OK.
Additionally, the page table lock is held when writing PTE on this CPU
and re-reading PTE on the remote CPU. That provides some memory order
guarantee too.
---
Best Regards,
Huang, Ying
On Wed, Oct 22, 2025 at 10:02 PM Huang, Ying
<ying.huang@linux.alibaba.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> >> >
> >> > static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> >> > unsigned long uaddr)
> >> > {
> >> > unsigned long addr;
> >> >
> >> > dsb(ishst);
> >> > addr = __TLBI_VADDR(uaddr, ASID(mm));
> >> > __tlbi(vale1is, addr);
> >> > __tlbi_user(vale1is, addr);
> >> > mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
> >> > (uaddr & PAGE_MASK) +
> >> > PAGE_SIZE);
> >> > }
> >>
> >> IIUC, _nosync() here means doesn't synchronize with the following code.
> >> It still synchronizes with the previous code, mainly the page table
> >> changing. And, Yes. There may be room to improve this.
> >>
> >> > On the other hand, __ptep_set_access_flags() doesn’t seem to use
> >> > set_ptes(), so there’s no guarantee the updated PTEs are visible to all
> >> > cores. If a remote CPU later encounters a page fault and performs a TLB
> >> > invalidation, will it still see a stable PTE?
> >>
> >> I don't think so. We just flush local TLB in local_flush_tlb_page()
> >> family functions. So, we only needs to guarantee the page table changes
> >> are available for the local page table walking. If a page fault occurs
> >> on a remote CPU, we will call local_flush_tlb_page() on the remote CPU.
> >>
> >
> > My concern is that:
> >
> > We don’t have a dsb(ish) to ensure the PTE page table is visible to remote
> > CPUs, since you’re using dsb(nsh). So even if a remote CPU performs
> > local_flush_tlb_page(), the memory may not be synchronized yet, and it could
> > still see the old PTE.
>
> So, do you think that after the load/store unit of the remote CPU have
> seen the new PTE, the page table walker could still see the old PTE? I
Without a barrier in the ish domain, remote CPUs’ load/store units may not
see the new PTE written by the first CPU performing the reuse.
That’s why we need a barrier in the ish domain to ensure the PTE is
actually visible across the SMP domain. A store instruction doesn’t guarantee
that the data is immediately visible to other CPUs — at least not for load
instructions.
Though, I’m not entirely sure about the page table walker case.
> doubt it. Even if so, the worse case is one extra spurious page fault?
> If the possibility of the worst case is low enough, that should be OK.
CPU0: CPU1:
write pte;
do local tlbi;
page fault;
do local tlbi; -> still old PTE
pte visible to CPU1
>
> Additionally, the page table lock is held when writing PTE on this CPU
> and re-reading PTE on the remote CPU. That provides some memory order
> guarantee too.
Right, the PTL might take care of it automatically.
Thanks
Barry
Barry Song <21cnbao@gmail.com> writes:
> On Wed, Oct 22, 2025 at 10:02 PM Huang, Ying
> <ying.huang@linux.alibaba.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> >> >
>> >> > static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>> >> > unsigned long uaddr)
>> >> > {
>> >> > unsigned long addr;
>> >> >
>> >> > dsb(ishst);
>> >> > addr = __TLBI_VADDR(uaddr, ASID(mm));
>> >> > __tlbi(vale1is, addr);
>> >> > __tlbi_user(vale1is, addr);
>> >> > mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
>> >> > (uaddr & PAGE_MASK) +
>> >> > PAGE_SIZE);
>> >> > }
>> >>
>> >> IIUC, _nosync() here means doesn't synchronize with the following code.
>> >> It still synchronizes with the previous code, mainly the page table
>> >> changing. And, Yes. There may be room to improve this.
>> >>
>> >> > On the other hand, __ptep_set_access_flags() doesn’t seem to use
>> >> > set_ptes(), so there’s no guarantee the updated PTEs are visible to all
>> >> > cores. If a remote CPU later encounters a page fault and performs a TLB
>> >> > invalidation, will it still see a stable PTE?
>> >>
>> >> I don't think so. We just flush local TLB in local_flush_tlb_page()
>> >> family functions. So, we only needs to guarantee the page table changes
>> >> are available for the local page table walking. If a page fault occurs
>> >> on a remote CPU, we will call local_flush_tlb_page() on the remote CPU.
>> >>
>> >
>> > My concern is that:
>> >
>> > We don’t have a dsb(ish) to ensure the PTE page table is visible to remote
>> > CPUs, since you’re using dsb(nsh). So even if a remote CPU performs
>> > local_flush_tlb_page(), the memory may not be synchronized yet, and it could
>> > still see the old PTE.
>>
>> So, do you think that after the load/store unit of the remote CPU have
>> seen the new PTE, the page table walker could still see the old PTE? I
>
> Without a barrier in the ish domain, remote CPUs’ load/store units may not
> see the new PTE written by the first CPU performing the reuse.
>
> That’s why we need a barrier in the ish domain to ensure the PTE is
> actually visible across the SMP domain. A store instruction doesn’t guarantee
> that the data is immediately visible to other CPUs — at least not for load
> instructions.
>
> Though, I’m not entirely sure about the page table walker case.
>
>> doubt it. Even if so, the worse case is one extra spurious page fault?
>> If the possibility of the worst case is low enough, that should be OK.
>
> CPU0: CPU1:
>
> write pte;
>
> do local tlbi;
>
> page fault;
> do local tlbi; -> still old PTE
>
> pte visible to CPU1
With PTL, this becomes
CPU0: CPU1:
page fault page fault
lock PTL
write PTE
do local tlbi
unlock PTL
lock PTL <- pte visible to CPU 1
read PTE <- new PTE
do local tlbi <- new PTE
unlock PTL
>> Additionally, the page table lock is held when writing PTE on this CPU
>> and re-reading PTE on the remote CPU. That provides some memory order
>> guarantee too.
>
> Right, the PTL might take care of it automatically.
---
Best Regards,
Huang, Ying
>
> With PTL, this becomes
>
> CPU0: CPU1:
>
> page fault page fault
> lock PTL
> write PTE
> do local tlbi
> unlock PTL
> lock PTL <- pte visible to CPU 1
> read PTE <- new PTE
> do local tlbi <- new PTE
> unlock PTL
I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL.
CPU0: CPU1:
lock PTL
write pte;
Issue ish barrier
do local tlbi;
No page fault occurs if tlb misses
unlock PTL
Otherwise, it could be:
CPU0: CPU1:
lock PTL
write pte;
Issue nsh barrier
do local tlbi;
page fault occurs if tlb misses
unlock PTL
Not quite sure if adding an ish right after the PTE modification has any
noticeable performance impact on the test? I assume the most expensive part
is still the tlbi broadcast dsb, not the PTE memory sync barrier?
Thanks
Barry
Barry Song <21cnbao@gmail.com> writes: >> >> With PTL, this becomes >> >> CPU0: CPU1: >> >> page fault page fault >> lock PTL >> write PTE >> do local tlbi >> unlock PTL >> lock PTL <- pte visible to CPU 1 >> read PTE <- new PTE >> do local tlbi <- new PTE >> unlock PTL > > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate memory writing (visible to other CPUs). TBH, I suspect that this is the case. > CPU0: CPU1: > > lock PTL > > write pte; > Issue ish barrier > do local tlbi; > > > No page fault occurs if tlb misses > > > unlock PTL > > > Otherwise, it could be: > > > CPU0: CPU1: > > lock PTL > > write pte; > Issue nsh barrier > do local tlbi; > > > page fault occurs if tlb misses > > > unlock PTL > > > Not quite sure if adding an ish right after the PTE modification has any > noticeable performance impact on the test? I assume the most expensive part > is still the tlbi broadcast dsb, not the PTE memory sync barrier? --- Best Regards, Huang, Ying
On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying <ying.huang@linux.alibaba.com> wrote: > > > > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. > > IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate > memory writing (visible to other CPUs). TBH, I suspect that this is the > case. Why? In any case, nsh is not a smp domain. I believe a dmb(ishst) is sufficient to ensure that the new PTE writes are visible to other CPUs. I’m not quite sure why the current flush code uses dsb(ish); it seems like overkill. Thanks Barry
Barry Song <21cnbao@gmail.com> writes: > On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying > <ying.huang@linux.alibaba.com> wrote: > >> > >> > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. >> >> IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate >> memory writing (visible to other CPUs). TBH, I suspect that this is the >> case. > > Why? In any case, nsh is not a smp domain. I think dsb(ish) will be slower than dsb(nsh) in theory. I guess that dsb just wait for the memory write to be visible in the specified shareability domain instead of making write faster. > I believe a dmb(ishst) is sufficient to ensure that the new PTE writes > are visible dmb(ishst) (smp_wmb()) should pair with dmb(ishld) (smp_rmb()). > to other CPUs. I’m not quite sure why the current flush code uses dsb(ish); > it seems like overkill. dsb(ish) here is used for tlbi(XXis) broadcast. It waits until the page table change is visible to the page table walker of the remote CPU. --- Best Regards, Huang, Ying
On Wed, Oct 22, 2025 at 11:34 PM Huang, Ying <ying.huang@linux.alibaba.com> wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying > > <ying.huang@linux.alibaba.com> wrote: > > > >> > > >> > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. > >> > >> IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate > >> memory writing (visible to other CPUs). TBH, I suspect that this is the > >> case. > > > > Why? In any case, nsh is not a smp domain. > > I think dsb(ish) will be slower than dsb(nsh) in theory. I guess that > dsb just wait for the memory write to be visible in the specified > shareability domain instead of making write faster. > > > I believe a dmb(ishst) is sufficient to ensure that the new PTE writes > > are visible > > dmb(ishst) (smp_wmb()) should pair with dmb(ishld) (smp_rmb()). > > > to other CPUs. I’m not quite sure why the current flush code uses dsb(ish); > > it seems like overkill. > > dsb(ish) here is used for tlbi(XXis) broadcast. It waits until the page > table change is visible to the page table walker of the remote CPU. It seems we’re aligned on all points[1], although I’m not sure whether you have data comparing A and B. A: write pte don't broadcast pte tlbi don't broadcast tlbi with B: write pte broadcast pte tlbi don't broadcast tlbi I guess the gain comes from "don't broadcat tlbi" ? With B, we should be able to share many existing code. [1] https://lore.kernel.org/linux-mm/20251013092038.6963-1-ying.huang@linux.alibaba.com/T/#m54312d4914c69aa550bee7df36711c03a4280c52 Thanks Barry
Barry Song <21cnbao@gmail.com> writes: > On Wed, Oct 22, 2025 at 11:34 PM Huang, Ying > <ying.huang@linux.alibaba.com> wrote: >> >> Barry Song <21cnbao@gmail.com> writes: >> >> > On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying >> > <ying.huang@linux.alibaba.com> wrote: >> > >> >> > >> >> > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. >> >> >> >> IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate >> >> memory writing (visible to other CPUs). TBH, I suspect that this is the >> >> case. >> > >> > Why? In any case, nsh is not a smp domain. >> >> I think dsb(ish) will be slower than dsb(nsh) in theory. I guess that >> dsb just wait for the memory write to be visible in the specified >> shareability domain instead of making write faster. >> >> > I believe a dmb(ishst) is sufficient to ensure that the new PTE writes >> > are visible >> >> dmb(ishst) (smp_wmb()) should pair with dmb(ishld) (smp_rmb()). >> >> > to other CPUs. I’m not quite sure why the current flush code uses dsb(ish); >> > it seems like overkill. >> >> dsb(ish) here is used for tlbi(XXis) broadcast. It waits until the page >> table change is visible to the page table walker of the remote CPU. > > It seems we’re aligned on all points[1], although I’m not sure whether > you have data comparing A and B. > > A: > write pte > don't broadcast pte > tlbi > don't broadcast tlbi > > with > > B: > write pte > broadcast pte I suspect that pte will be broadcast, DVM broadcast isn't used for the memory coherency IIUC. > tlbi > don't broadcast tlbi > > I guess the gain comes from "don't broadcat tlbi" ? > With B, we should be able to share many existing code. Ryan has some plan to reduce the code duplication with the current solution. > [1] > https://lore.kernel.org/linux-mm/20251013092038.6963-1-ying.huang@linux.alibaba.com/T/#m54312d4914c69aa550bee7df36711c03a4280c52 --- Best Regards, Huang, Ying
> > > > A: > > write pte > > don't broadcast pte > > tlbi > > don't broadcast tlbi > > > > with > > > > B: > > write pte > > broadcast pte > > I suspect that pte will be broadcast, DVM broadcast isn't used for > the memory coherency IIUC. I guess you’re right. By “broadcast,” I actually meant the PTE becoming visible to other CPUs. With a dsb(ish) before tlbi, other cores’ TLBs can load the new PTE after their TLB is shoot down. But as you said, if the hardware doesn’t propagate the updated PTE faster, it doesn’t seem to help reduce page faults. As a side note, I’m curious about the data between dsb(nsh) and dsb(ish) on your platform. Perhaps because the number of CPU cores is small, I didn’t see any noticeable difference between them on phones. > > > tlbi > > don't broadcast tlbi > > > > I guess the gain comes from "don't broadcat tlbi" ? > > With B, we should be able to share many existing code. > > Ryan has some plan to reduce the code duplication with the current > solution. Ok. Thanks Barry
Barry Song <21cnbao@gmail.com> writes: >> > >> > A: >> > write pte >> > don't broadcast pte >> > tlbi >> > don't broadcast tlbi >> > >> > with >> > >> > B: >> > write pte >> > broadcast pte >> >> I suspect that pte will be broadcast, DVM broadcast isn't used for >> the memory coherency IIUC. > > I guess you’re right. By “broadcast,” I actually meant the PTE becoming visible > to other CPUs. With a dsb(ish) before tlbi, other cores’ TLBs can load the new > PTE after their TLB is shoot down. But as you said, if the hardware doesn’t > propagate the updated PTE faster, it doesn’t seem to help reduce page faults. > > As a side note, I’m curious about the data between dsb(nsh) and dsb(ish) on > your platform. Perhaps because the number of CPU cores is small, I didn’t see > any noticeable difference between them on phones. Sure. I can git it a try. Can you share the test case? >> >> > tlbi >> > don't broadcast tlbi >> > >> > I guess the gain comes from "don't broadcat tlbi" ? >> > With B, we should be able to share many existing code. >> >> Ryan has some plan to reduce the code duplication with the current >> solution. > > Ok. --- Best Regards, Huang, Ying
On Wed, Oct 22, 2025 at 10:55 PM Barry Song <21cnbao@gmail.com> wrote: > > On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying > <ying.huang@linux.alibaba.com> wrote: > > > > > > > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. > > > > IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate > > memory writing (visible to other CPUs). TBH, I suspect that this is the > > case. > > Why? In any case, nsh is not a smp domain. > > I believe a dmb(ishst) is sufficient to ensure that the new PTE writes > are visible > to other CPUs. I’m not quite sure why the current flush code uses dsb(ish); > it seems like overkill. On second thought, the PTE/page table walker might not be a typical SMP sync case, so a dmb may not be sufficient—we are not dealing with standard load/store instruction sequences across multiple threads. In any case, my point is that dsb(ish) might be slightly slower than your dsb(nsh), but it makes the PTE visible to other CPUs earlier and helps avoid some page faults after we’ve written the PTE. However, if your current nsh version actually provides better performance—even when multiple threads may access the data simultaneously— It should be completely fine. Now you are write pte don't broadcast pte tlbi don't broadcast tlbi we might be: write pte broadcast pte tlbi don't broadcast tlbi Thanks Barry
On 13/10/2025 10:20, Huang Ying wrote:
> A multi-thread customer workload with large memory footprint uses
> fork()/exec() to run some external programs every tens seconds. When
> running the workload on an arm64 server machine, it's observed that
> quite some CPU cycles are spent in the TLB flushing functions. While
> running the workload on the x86_64 server machine, it's not. This
> causes the performance on arm64 to be much worse than that on x86_64.
>
> During the workload running, after fork()/exec() write-protects all
> pages in the parent process, memory writing in the parent process
> will cause a write protection fault. Then the page fault handler
> will make the PTE/PDE writable if the page can be reused, which is
> almost always true in the workload. On arm64, to avoid the write
> protection fault on other CPUs, the page fault handler flushes the TLB
> globally with TLBI broadcast after changing the PTE/PDE. However, this
> isn't always necessary. Firstly, it's safe to leave some stall
nit: You keep using the word "stall" here and in the code. I think you mean "stale"?
> read-only TLB entries as long as they will be flushed finally.
> Secondly, it's quite possible that the original read-only PTE/PDEs
> aren't cached in remote TLB at all if the memory footprint is large.
> In fact, on x86_64, the page fault handler doesn't flush the remote
> TLB in this situation, which benefits the performance a lot.
>
> To improve the performance on arm64, make the write protection fault
> handler flush the TLB locally instead of globally via TLBI broadcast
> after making the PTE/PDE writable. If there are stall read-only TLB
> entries in the remote CPUs, the page fault handler on these CPUs will
> regard the page fault as spurious and flush the stall TLB entries.
>
> To test the patchset, make the usemem.c from vm-scalability
> (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
> support calling fork()/exec() periodically (merged). To mimic the
> behavior of the customer workload, run usemem with 4 threads, access
> 100GB memory, and call fork()/exec() every 40 seconds. Test results
> show that with the patchset the score of usemem improves ~40.6%. The
> cycles% of TLB flush functions reduces from ~50.5% to ~0.3% in perf
> profile.
>
> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Yang Shi <yang@os.amperecomputing.com>
> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Yicong Yang <yangyicong@hisilicon.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Kevin Brodsky <kevin.brodsky@arm.com>
> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
> arch/arm64/include/asm/pgtable.h | 14 +++++---
> arch/arm64/include/asm/tlbflush.h | 56 +++++++++++++++++++++++++++++++
> arch/arm64/mm/contpte.c | 3 +-
> arch/arm64/mm/fault.c | 2 +-
> 4 files changed, 67 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index aa89c2e67ebc..35bae2e4bcfe 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> /*
> - * Outside of a few very special situations (e.g. hibernation), we always
> - * use broadcast TLB invalidation instructions, therefore a spurious page
> - * fault on one CPU which has been handled concurrently by another CPU
> - * does not need to perform additional invalidation.
> + * We use local TLB invalidation instruction when reusing page in
> + * write protection fault handler to avoid TLBI broadcast in the hot
> + * path. This will cause spurious page faults if stall read-only TLB
> + * entries exist.
> */
> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
> +#define flush_tlb_fix_spurious_fault(vma, address, ptep) \
> + local_flush_tlb_page_nonotify(vma, address)
> +
> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \
> + local_flush_tlb_page_nonotify(vma, address)
>
> /*
> * ZERO_PAGE is a global shared page that is always zero: used
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 18a5dc0c9a54..651b31fd18bb 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
> * cannot be easily determined, the value TLBI_TTL_UNKNOWN will
> * perform a non-hinted invalidation.
> *
> + * local_flush_tlb_page(vma, addr)
> + * Local variant of flush_tlb_page(). Stale TLB entries may
> + * remain in remote CPUs.
> + *
> + * local_flush_tlb_page_nonotify(vma, addr)
> + * Same as local_flush_tlb_page() except MMU notifier will not be
> + * called.
> + *
> + * local_flush_tlb_contpte_range(vma, start, end)
> + * Invalidate the virtual-address range '[start, end)' mapped with
> + * contpte on local CPU for the user address space corresponding
> + * to 'vma->mm'. Stale TLB entries may remain in remote CPUs.
> *
> * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
> * on top of these routines, since that is our interface to the mmu_gather
> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
> mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
> }
>
> +static inline void __local_flush_tlb_page_nonotify_nosync(
> + struct mm_struct *mm, unsigned long uaddr)
> +{
> + unsigned long addr;
> +
> + dsb(nshst);
> + addr = __TLBI_VADDR(uaddr, ASID(mm));
> + __tlbi(vale1, addr);
> + __tlbi_user(vale1, addr);
> +}
> +
> +static inline void local_flush_tlb_page_nonotify(
> + struct vm_area_struct *vma, unsigned long uaddr)
> +{
> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
> + dsb(nsh);
> +}
> +
> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
> + unsigned long uaddr)
> +{
> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
> + (uaddr & PAGE_MASK) + PAGE_SIZE);
> + dsb(nsh);
> +}
> +
> static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> unsigned long uaddr)
> {
> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
> dsb(ish);
> }
>
> +static inline void local_flush_tlb_contpte_range(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
This would be clearer as an API if it was like this:
static inline void local_flush_tlb_contpte(struct vm_area_struct *vma,
unsigned long uaddr)
i.e. the user doesn't set the range - it's implicitly CONT_PTE_SIZE starting at
round_down(uaddr, PAGE_SIZE).
Thanks,
Ryan
> +{
> + unsigned long asid, pages;
> +
> + start = round_down(start, PAGE_SIZE);
> + end = round_up(end, PAGE_SIZE);
> + pages = (end - start) >> PAGE_SHIFT;
> +
> + dsb(nshst);
> + asid = ASID(vma->vm_mm);
> + __flush_tlb_range_op(vale1, start, pages, PAGE_SIZE, asid,
> + 3, true, lpa2_is_enabled());
> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
> + dsb(nsh);
> +}
> +
> static inline void flush_tlb_range(struct vm_area_struct *vma,
> unsigned long start, unsigned long end)
> {
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index c0557945939c..0f9bbb7224dc 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -622,8 +622,7 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> __ptep_set_access_flags(vma, addr, ptep, entry, 0);
>
> if (dirty)
> - __flush_tlb_range(vma, start_addr, addr,
> - PAGE_SIZE, true, 3);
> + local_flush_tlb_contpte_range(vma, start_addr, addr);
> } else {
> __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
> __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index d816ff44faff..22f54f5afe3f 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -235,7 +235,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
>
> /* Invalidate a stale read-only entry */
> if (dirty)
> - flush_tlb_page(vma, address);
> + local_flush_tlb_page(vma, address);
> return 1;
> }
>
Hi, Ryan,
Thanks for comments!
Ryan Roberts <ryan.roberts@arm.com> writes:
> On 13/10/2025 10:20, Huang Ying wrote:
>> A multi-thread customer workload with large memory footprint uses
>> fork()/exec() to run some external programs every tens seconds. When
>> running the workload on an arm64 server machine, it's observed that
>> quite some CPU cycles are spent in the TLB flushing functions. While
>> running the workload on the x86_64 server machine, it's not. This
>> causes the performance on arm64 to be much worse than that on x86_64.
>>
>> During the workload running, after fork()/exec() write-protects all
>> pages in the parent process, memory writing in the parent process
>> will cause a write protection fault. Then the page fault handler
>> will make the PTE/PDE writable if the page can be reused, which is
>> almost always true in the workload. On arm64, to avoid the write
>> protection fault on other CPUs, the page fault handler flushes the TLB
>> globally with TLBI broadcast after changing the PTE/PDE. However, this
>> isn't always necessary. Firstly, it's safe to leave some stall
>
> nit: You keep using the word "stall" here and in the code. I think you mean "stale"?
OOPS, my poor English :-(
Yes, it should be "stale". Thanks for pointing this out, will fix it in
the future versions.
>> read-only TLB entries as long as they will be flushed finally.
>> Secondly, it's quite possible that the original read-only PTE/PDEs
>> aren't cached in remote TLB at all if the memory footprint is large.
>> In fact, on x86_64, the page fault handler doesn't flush the remote
>> TLB in this situation, which benefits the performance a lot.
>>
>> To improve the performance on arm64, make the write protection fault
>> handler flush the TLB locally instead of globally via TLBI broadcast
>> after making the PTE/PDE writable. If there are stall read-only TLB
>> entries in the remote CPUs, the page fault handler on these CPUs will
>> regard the page fault as spurious and flush the stall TLB entries.
>>
>> To test the patchset, make the usemem.c from vm-scalability
>> (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
>> support calling fork()/exec() periodically (merged). To mimic the
>> behavior of the customer workload, run usemem with 4 threads, access
>> 100GB memory, and call fork()/exec() every 40 seconds. Test results
>> show that with the patchset the score of usemem improves ~40.6%. The
>> cycles% of TLB flush functions reduces from ~50.5% to ~0.3% in perf
>> profile.
>>
>> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Yang Shi <yang@os.amperecomputing.com>
>> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
>> Cc: Yicong Yang <yangyicong@hisilicon.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Kevin Brodsky <kevin.brodsky@arm.com>
>> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> ---
>> arch/arm64/include/asm/pgtable.h | 14 +++++---
>> arch/arm64/include/asm/tlbflush.h | 56 +++++++++++++++++++++++++++++++
>> arch/arm64/mm/contpte.c | 3 +-
>> arch/arm64/mm/fault.c | 2 +-
>> 4 files changed, 67 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index aa89c2e67ebc..35bae2e4bcfe 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> /*
>> - * Outside of a few very special situations (e.g. hibernation), we always
>> - * use broadcast TLB invalidation instructions, therefore a spurious page
>> - * fault on one CPU which has been handled concurrently by another CPU
>> - * does not need to perform additional invalidation.
>> + * We use local TLB invalidation instruction when reusing page in
>> + * write protection fault handler to avoid TLBI broadcast in the hot
>> + * path. This will cause spurious page faults if stall read-only TLB
>> + * entries exist.
>> */
>> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
>> +#define flush_tlb_fix_spurious_fault(vma, address, ptep) \
>> + local_flush_tlb_page_nonotify(vma, address)
>> +
>> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \
>> + local_flush_tlb_page_nonotify(vma, address)
>>
>> /*
>> * ZERO_PAGE is a global shared page that is always zero: used
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index 18a5dc0c9a54..651b31fd18bb 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
>> * cannot be easily determined, the value TLBI_TTL_UNKNOWN will
>> * perform a non-hinted invalidation.
>> *
>> + * local_flush_tlb_page(vma, addr)
>> + * Local variant of flush_tlb_page(). Stale TLB entries may
>> + * remain in remote CPUs.
>> + *
>> + * local_flush_tlb_page_nonotify(vma, addr)
>> + * Same as local_flush_tlb_page() except MMU notifier will not be
>> + * called.
>> + *
>> + * local_flush_tlb_contpte_range(vma, start, end)
>> + * Invalidate the virtual-address range '[start, end)' mapped with
>> + * contpte on local CPU for the user address space corresponding
>> + * to 'vma->mm'. Stale TLB entries may remain in remote CPUs.
>> *
>> * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
>> * on top of these routines, since that is our interface to the mmu_gather
>> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>> mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>> }
>>
>> +static inline void __local_flush_tlb_page_nonotify_nosync(
>> + struct mm_struct *mm, unsigned long uaddr)
>> +{
>> + unsigned long addr;
>> +
>> + dsb(nshst);
>> + addr = __TLBI_VADDR(uaddr, ASID(mm));
>> + __tlbi(vale1, addr);
>> + __tlbi_user(vale1, addr);
>> +}
>> +
>> +static inline void local_flush_tlb_page_nonotify(
>> + struct vm_area_struct *vma, unsigned long uaddr)
>> +{
>> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> + dsb(nsh);
>> +}
>> +
>> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
>> + unsigned long uaddr)
>> +{
>> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
>> + (uaddr & PAGE_MASK) + PAGE_SIZE);
>> + dsb(nsh);
>> +}
>> +
>> static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>> unsigned long uaddr)
>> {
>> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>> dsb(ish);
>> }
>>
>> +static inline void local_flush_tlb_contpte_range(struct vm_area_struct *vma,
>> + unsigned long start, unsigned long end)
>
> This would be clearer as an API if it was like this:
>
> static inline void local_flush_tlb_contpte(struct vm_area_struct *vma,
> unsigned long uaddr)
>
> i.e. the user doesn't set the range - it's implicitly CONT_PTE_SIZE starting at
> round_down(uaddr, PAGE_SIZE).
Sure. Will do this.
> Thanks,
> Ryan
>
>> +{
>> + unsigned long asid, pages;
>> +
>> + start = round_down(start, PAGE_SIZE);
>> + end = round_up(end, PAGE_SIZE);
>> + pages = (end - start) >> PAGE_SHIFT;
>> +
>> + dsb(nshst);
>> + asid = ASID(vma->vm_mm);
>> + __flush_tlb_range_op(vale1, start, pages, PAGE_SIZE, asid,
>> + 3, true, lpa2_is_enabled());
>> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>> + dsb(nsh);
>> +}
>> +
>> static inline void flush_tlb_range(struct vm_area_struct *vma,
>> unsigned long start, unsigned long end)
>> {
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index c0557945939c..0f9bbb7224dc 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -622,8 +622,7 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> __ptep_set_access_flags(vma, addr, ptep, entry, 0);
>>
>> if (dirty)
>> - __flush_tlb_range(vma, start_addr, addr,
>> - PAGE_SIZE, true, 3);
>> + local_flush_tlb_contpte_range(vma, start_addr, addr);
>> } else {
>> __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>> __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index d816ff44faff..22f54f5afe3f 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -235,7 +235,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
>>
>> /* Invalidate a stale read-only entry */
>> if (dirty)
>> - flush_tlb_page(vma, address);
>> + local_flush_tlb_page(vma, address);
>> return 1;
>> }
>>
---
Best Regards,
Huang, Ying
© 2016 - 2025 Red Hat, Inc.