arm64/mm: Close theoretical race where stale TLB entry remains valid

[PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

Posted by Ryan Roberts 8 months, 1 week ago

Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
a parallel reclaim leaving stale TLB entries") describes a race that,
prior to the commit, could occur between reclaim and operations such as
mprotect() when using reclaim's tlbbatch mechanism. See that commit for
details but the summary is:

"""
Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.

He described the race as follows:

	CPU0				CPU1
	----				----
					user accesses memory using RW PTE
					[PTE now cached in TLB]
	try_to_unmap_one()
	==> ptep_get_and_clear()
	==> set_tlb_ubc_flush_pending()
					mprotect(addr, PROT_READ)
					==> change_pte_range()
					==> [ PTE non-present - no flush ]

					user writes using cached RW PTE
	...

	try_to_unmap_flush()
"""

The solution was to insert flush_tlb_batched_pending() in mprotect() and
friends to explcitly drain any pending reclaim TLB flushes. In the
modern version of this solution, arch_flush_tlb_batched_pending() is
called to do that synchronisation.

arm64's tlbbatch implementation simply issues TLBIs at queue-time
(arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
of the batch to wait for all the issued TLBIs to complete.

Now, the Arm ARM states:

"""
The completion of the TLB maintenance instruction is guaranteed only by
the execution of a DSB by the observer that performed the TLB
maintenance instruction. The execution of a DSB by a different observer
does not have this effect, even if the DSB is known to be executed after
the TLB maintenance instruction is observed by that different observer.
"""

arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
requirement because they are called from the same task (either kswapd or
caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
if the task was migrated, __switch_to() contains an extra dsb(ish).

HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
a dsb(ish). But this may be running on a CPU remote from the one that
issued the outstanding TLBIs. So there is no architectural gurantee of
synchonization. Therefore we are still vulnerable to the theoretical
race described in Commit 3ea277194daa ("mm, mprotect: flush TLB if
potentially racing with a parallel reclaim leaving stale TLB entries").

Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
This aligns with what the other arches that implement the tlbbatch
feature do.

Fixes: 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/tlbflush.h | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index eba1a98657f1..7d564c2a126f 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -323,13 +323,14 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
 }

 /*
- * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
- * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
- * flush_tlb_batched_pending().
+ * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure
+ * all the previously issued TLBIs targeting mm have completed. But since we
+ * can be executing on a remote CPU, a DSB cannot guarrantee this like it can
+ * for arch_tlbbatch_flush(). Our only option is to flush the entire mm.
  */
 static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 {
-	dsb(ish);
+	flush_tlb_mm(mm);
 }

 /*
--
2.43.0

Re: [PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

Posted by Barry Song 8 months, 1 week ago

On Sat, May 31, 2025 at 3:24 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
> a parallel reclaim leaving stale TLB entries") describes a race that,
> prior to the commit, could occur between reclaim and operations such as
> mprotect() when using reclaim's tlbbatch mechanism. See that commit for
> details but the summary is:
>
> """
> Nadav Amit identified a theoritical race between page reclaim and
> mprotect due to TLB flushes being batched outside of the PTL being held.
>
> He described the race as follows:
>
>         CPU0                            CPU1
>         ----                            ----
>                                         user accesses memory using RW PTE
>                                         [PTE now cached in TLB]
>         try_to_unmap_one()
>         ==> ptep_get_and_clear()
>         ==> set_tlb_ubc_flush_pending()
>                                         mprotect(addr, PROT_READ)
>                                         ==> change_pte_range()
>                                         ==> [ PTE non-present - no flush ]
>
>                                         user writes using cached RW PTE
>         ...
>
>         try_to_unmap_flush()
> """
>
> The solution was to insert flush_tlb_batched_pending() in mprotect() and
> friends to explcitly drain any pending reclaim TLB flushes. In the
> modern version of this solution, arch_flush_tlb_batched_pending() is
> called to do that synchronisation.
>
> arm64's tlbbatch implementation simply issues TLBIs at queue-time
> (arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
> trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
> of the batch to wait for all the issued TLBIs to complete.
>
> Now, the Arm ARM states:
>
> """
> The completion of the TLB maintenance instruction is guaranteed only by
> the execution of a DSB by the observer that performed the TLB
> maintenance instruction. The execution of a DSB by a different observer
> does not have this effect, even if the DSB is known to be executed after
> the TLB maintenance instruction is observed by that different observer.
> """
>
> arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
> requirement because they are called from the same task (either kswapd or
> caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
> if the task was migrated, __switch_to() contains an extra dsb(ish).
>
> HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
> a dsb(ish). But this may be running on a CPU remote from the one that
> issued the outstanding TLBIs. So there is no architectural gurantee of
> synchonization. Therefore we are still vulnerable to the theoretical
> race described in Commit 3ea277194daa ("mm, mprotect: flush TLB if
> potentially racing with a parallel reclaim leaving stale TLB entries").

Hi Ryan,

Sorry for bringing up another question late, but your explanation made me
reconsider whether I might also be wrong in arch_tlbbatch_flush(). Specifically,
try_to_unmap_flush() needs to ensure that all TLBI operations from other CPUs
for all involved memory contexts have completed. However, as you pointed
out, a DSB ISH alone cannot guarantee this.

This makes me wonder if we should take inspiration from RISC-V or x86 and use a
cpumask to track all CPUs that have pending TLBIs. Then, we could use IPIs to
explicitly request those CPUs to issue DSB ISH, ensuring their TLB
invalidations are fully completed.

I mean something similar to what x86 and RISC-V do, but using just a
simpler approach like issuing DSB ISH on the relevant CPUs.

void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
        on_each_cpu_mask(&batch->cpumask, __ipi_flush_tlbi, NULL, NULL);
        cpumask_clear(&batch->cpumask);
}

static void __ipi_flush_tlbi(void *info)
{
        dsb(ish);
}

Sorry for the mess I made earlier.

>
> Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
> This aligns with what the other arches that implement the tlbbatch
> feature do.
>
> Fixes: 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration")
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/tlbflush.h | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index eba1a98657f1..7d564c2a126f 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -323,13 +323,14 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>  }
>
>  /*
> - * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
> - * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
> - * flush_tlb_batched_pending().
> + * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure
> + * all the previously issued TLBIs targeting mm have completed. But since we
> + * can be executing on a remote CPU, a DSB cannot guarrantee this like it can
> + * for arch_tlbbatch_flush(). Our only option is to flush the entire mm.
>   */
>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
>  {
> -       dsb(ish);
> +       flush_tlb_mm(mm);
>  }
>
>  /*
> --
> 2.43.0
>

Thanks
Barry

Re: [PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

Posted by Ryan Roberts 8 months, 1 week ago

On 03/06/2025 10:49, Barry Song wrote:
> On Sat, May 31, 2025 at 3:24 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
>> a parallel reclaim leaving stale TLB entries") describes a race that,
>> prior to the commit, could occur between reclaim and operations such as
>> mprotect() when using reclaim's tlbbatch mechanism. See that commit for
>> details but the summary is:
>>
>> """
>> Nadav Amit identified a theoritical race between page reclaim and
>> mprotect due to TLB flushes being batched outside of the PTL being held.
>>
>> He described the race as follows:
>>
>>         CPU0                            CPU1
>>         ----                            ----
>>                                         user accesses memory using RW PTE
>>                                         [PTE now cached in TLB]
>>         try_to_unmap_one()
>>         ==> ptep_get_and_clear()
>>         ==> set_tlb_ubc_flush_pending()
>>                                         mprotect(addr, PROT_READ)
>>                                         ==> change_pte_range()
>>                                         ==> [ PTE non-present - no flush ]
>>
>>                                         user writes using cached RW PTE
>>         ...
>>
>>         try_to_unmap_flush()
>> """
>>
>> The solution was to insert flush_tlb_batched_pending() in mprotect() and
>> friends to explcitly drain any pending reclaim TLB flushes. In the
>> modern version of this solution, arch_flush_tlb_batched_pending() is
>> called to do that synchronisation.
>>
>> arm64's tlbbatch implementation simply issues TLBIs at queue-time
>> (arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
>> trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
>> of the batch to wait for all the issued TLBIs to complete.
>>
>> Now, the Arm ARM states:
>>
>> """
>> The completion of the TLB maintenance instruction is guaranteed only by
>> the execution of a DSB by the observer that performed the TLB
>> maintenance instruction. The execution of a DSB by a different observer
>> does not have this effect, even if the DSB is known to be executed after
>> the TLB maintenance instruction is observed by that different observer.
>> """
>>
>> arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
>> requirement because they are called from the same task (either kswapd or
>> caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
>> if the task was migrated, __switch_to() contains an extra dsb(ish).
>>
>> HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
>> a dsb(ish). But this may be running on a CPU remote from the one that
>> issued the outstanding TLBIs. So there is no architectural gurantee of
>> synchonization. Therefore we are still vulnerable to the theoretical
>> race described in Commit 3ea277194daa ("mm, mprotect: flush TLB if
>> potentially racing with a parallel reclaim leaving stale TLB entries").
> 
> Hi Ryan,
> 
> Sorry for bringing up another question late, but your explanation made me
> reconsider whether I might also be wrong in arch_tlbbatch_flush(). Specifically,
> try_to_unmap_flush() needs to ensure that all TLBI operations from other CPUs
> for all involved memory contexts have completed. However, as you pointed
> out, a DSB ISH alone cannot guarantee this.

Hmm, does try_to_unmap_flush() actually need to ensure that *all* pending TLBIs
are completed, or does it only need to ensure that the TLBIs previously issued
by the same instance of shrink_folio_list() are completed?

I understood it to be the latter and therefore thought the current
arch_tlbbatch_flush() was safe.

If another instance has concurrently queued up some TLBIs using tlbbatch (i.e.
MADV_PAGEOUT) then the first instance would never see those PTEs so I don't
think it matters that the TLB flush is still pending? Perhaps I'm wrong...

> 
> This makes me wonder if we should take inspiration from RISC-V or x86 and use a
> cpumask to track all CPUs that have pending TLBIs. Then, we could use IPIs to
> explicitly request those CPUs to issue DSB ISH, ensuring their TLB
> invalidations are fully completed.
> 
> I mean something similar to what x86 and RISC-V do, but using just a
> simpler approach like issuing DSB ISH on the relevant CPUs.
> 
> void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> {
>         on_each_cpu_mask(&batch->cpumask, __ipi_flush_tlbi, NULL, NULL);
>         cpumask_clear(&batch->cpumask);
> }
> 
> static void __ipi_flush_tlbi(void *info)
> {
>         dsb(ish);
> }

My initial instinct is yuk :). I guess we would need to do a bunch of
benchmarking if this is needed. But would be good to avoid if possible. Let's
figure out if we definitely have a race first...

Thanks,
Ryan

> 
> Sorry for the mess I made earlier.

No worries, it happens.

> 
>>
>> Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
>> This aligns with what the other arches that implement the tlbbatch
>> feature do.
>>
>> Fixes: 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration")
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/tlbflush.h | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index eba1a98657f1..7d564c2a126f 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -323,13 +323,14 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>  }
>>
>>  /*
>> - * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
>> - * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
>> - * flush_tlb_batched_pending().
>> + * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure
>> + * all the previously issued TLBIs targeting mm have completed. But since we
>> + * can be executing on a remote CPU, a DSB cannot guarrantee this like it can
>> + * for arch_tlbbatch_flush(). Our only option is to flush the entire mm.
>>   */
>>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
>>  {
>> -       dsb(ish);
>> +       flush_tlb_mm(mm);
>>  }
>>
>>  /*
>> --
>> 2.43.0
>>
> 
> Thanks
> Barry

Re: [PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

Posted by Barry Song 8 months, 1 week ago

On Tue, Jun 3, 2025 at 10:32 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 03/06/2025 10:49, Barry Song wrote:
> > On Sat, May 31, 2025 at 3:24 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
> >> a parallel reclaim leaving stale TLB entries") describes a race that,
> >> prior to the commit, could occur between reclaim and operations such as
> >> mprotect() when using reclaim's tlbbatch mechanism. See that commit for
> >> details but the summary is:
> >>
> >> """
> >> Nadav Amit identified a theoritical race between page reclaim and
> >> mprotect due to TLB flushes being batched outside of the PTL being held.
> >>
> >> He described the race as follows:
> >>
> >>         CPU0                            CPU1
> >>         ----                            ----
> >>                                         user accesses memory using RW PTE
> >>                                         [PTE now cached in TLB]
> >>         try_to_unmap_one()
> >>         ==> ptep_get_and_clear()
> >>         ==> set_tlb_ubc_flush_pending()
> >>                                         mprotect(addr, PROT_READ)
> >>                                         ==> change_pte_range()
> >>                                         ==> [ PTE non-present - no flush ]
> >>
> >>                                         user writes using cached RW PTE
> >>         ...
> >>
> >>         try_to_unmap_flush()
> >> """
> >>
> >> The solution was to insert flush_tlb_batched_pending() in mprotect() and
> >> friends to explcitly drain any pending reclaim TLB flushes. In the
> >> modern version of this solution, arch_flush_tlb_batched_pending() is
> >> called to do that synchronisation.
> >>
> >> arm64's tlbbatch implementation simply issues TLBIs at queue-time
> >> (arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
> >> trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
> >> of the batch to wait for all the issued TLBIs to complete.
> >>
> >> Now, the Arm ARM states:
> >>
> >> """
> >> The completion of the TLB maintenance instruction is guaranteed only by
> >> the execution of a DSB by the observer that performed the TLB
> >> maintenance instruction. The execution of a DSB by a different observer
> >> does not have this effect, even if the DSB is known to be executed after
> >> the TLB maintenance instruction is observed by that different observer.
> >> """
> >>
> >> arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
> >> requirement because they are called from the same task (either kswapd or
> >> caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
> >> if the task was migrated, __switch_to() contains an extra dsb(ish).
> >>
> >> HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
> >> a dsb(ish). But this may be running on a CPU remote from the one that
> >> issued the outstanding TLBIs. So there is no architectural gurantee of
> >> synchonization. Therefore we are still vulnerable to the theoretical
> >> race described in Commit 3ea277194daa ("mm, mprotect: flush TLB if
> >> potentially racing with a parallel reclaim leaving stale TLB entries").
> >
> > Hi Ryan,
> >
> > Sorry for bringing up another question late, but your explanation made me
> > reconsider whether I might also be wrong in arch_tlbbatch_flush(). Specifically,
> > try_to_unmap_flush() needs to ensure that all TLBI operations from other CPUs
> > for all involved memory contexts have completed. However, as you pointed
> > out, a DSB ISH alone cannot guarantee this.
>
> Hmm, does try_to_unmap_flush() actually need to ensure that *all* pending TLBIs
> are completed, or does it only need to ensure that the TLBIs previously issued
> by the same instance of shrink_folio_list() are completed?
>
> I understood it to be the latter and therefore thought the current
> arch_tlbbatch_flush() was safe.
>
> If another instance has concurrently queued up some TLBIs using tlbbatch (i.e.
> MADV_PAGEOUT) then the first instance would never see those PTEs so I don't
> think it matters that the TLB flush is still pending? Perhaps I'm wrong...

You're right — I got a bit scared that day. As long as each thread can
flush the pending TLBI for its own folio list, it's all good.

>
> >
> > This makes me wonder if we should take inspiration from RISC-V or x86 and use a
> > cpumask to track all CPUs that have pending TLBIs. Then, we could use IPIs to
> > explicitly request those CPUs to issue DSB ISH, ensuring their TLB
> > invalidations are fully completed.
> >
> > I mean something similar to what x86 and RISC-V do, but using just a
> > simpler approach like issuing DSB ISH on the relevant CPUs.
> >
> > void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > {
> >         on_each_cpu_mask(&batch->cpumask, __ipi_flush_tlbi, NULL, NULL);
> >         cpumask_clear(&batch->cpumask);
> > }
> >
> > static void __ipi_flush_tlbi(void *info)
> > {
> >         dsb(ish);
> > }
>
> My initial instinct is yuk :). I guess we would need to do a bunch of
> benchmarking if this is needed. But would be good to avoid if possible. Let's
> figure out if we definitely have a race first...
>
> Thanks,
> Ryan
>
> >
> > Sorry for the mess I made earlier.
>
> No worries, it happens.
>
> >
> >>
> >> Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
> >> This aligns with what the other arches that implement the tlbbatch
> >> feature do.
> >>
> >> Fixes: 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration")
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  arch/arm64/include/asm/tlbflush.h | 9 +++++----
> >>  1 file changed, 5 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> >> index eba1a98657f1..7d564c2a126f 100644
> >> --- a/arch/arm64/include/asm/tlbflush.h
> >> +++ b/arch/arm64/include/asm/tlbflush.h
> >> @@ -323,13 +323,14 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >>  }
> >>
> >>  /*
> >> - * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
> >> - * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
> >> - * flush_tlb_batched_pending().
> >> + * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure
> >> + * all the previously issued TLBIs targeting mm have completed. But since we
> >> + * can be executing on a remote CPU, a DSB cannot guarrantee this like it can
> >> + * for arch_tlbbatch_flush(). Our only option is to flush the entire mm.
> >>   */
> >>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> >>  {
> >> -       dsb(ish);
> >> +       flush_tlb_mm(mm);
> >>  }
> >>
> >>  /*
> >> --
> >> 2.43.0
> >>
> >

Thanks
Barry

Re: [PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

Posted by Will Deacon 8 months, 1 week ago

On Fri, 30 May 2025 16:23:47 +0100, Ryan Roberts wrote:
> Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
> a parallel reclaim leaving stale TLB entries") describes a race that,
> prior to the commit, could occur between reclaim and operations such as
> mprotect() when using reclaim's tlbbatch mechanism. See that commit for
> details but the summary is:
> 
> """
> Nadav Amit identified a theoritical race between page reclaim and
> mprotect due to TLB flushes being batched outside of the PTL being held.
> 
> [...]

Applied to arm64 (for-next/core), thanks!

[1/1] arm64/mm: Close theoretical race where stale TLB entry remains valid
      https://git.kernel.org/arm64/c/4b634918384c

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev

Re: [PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

Posted by Will Deacon 8 months, 1 week ago

On Fri, May 30, 2025 at 04:23:47PM +0100, Ryan Roberts wrote:
> Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
> a parallel reclaim leaving stale TLB entries") describes a race that,
> prior to the commit, could occur between reclaim and operations such as
> mprotect() when using reclaim's tlbbatch mechanism. See that commit for
> details but the summary is:
> 
> """
> Nadav Amit identified a theoritical race between page reclaim and
> mprotect due to TLB flushes being batched outside of the PTL being held.
> 
> He described the race as follows:
> 
> 	CPU0				CPU1
> 	----				----
> 					user accesses memory using RW PTE
> 					[PTE now cached in TLB]
> 	try_to_unmap_one()
> 	==> ptep_get_and_clear()
> 	==> set_tlb_ubc_flush_pending()
> 					mprotect(addr, PROT_READ)
> 					==> change_pte_range()
> 					==> [ PTE non-present - no flush ]
> 
> 					user writes using cached RW PTE
> 	...
> 
> 	try_to_unmap_flush()
> """
> 
> The solution was to insert flush_tlb_batched_pending() in mprotect() and
> friends to explcitly drain any pending reclaim TLB flushes. In the
> modern version of this solution, arch_flush_tlb_batched_pending() is
> called to do that synchronisation.
> 
> arm64's tlbbatch implementation simply issues TLBIs at queue-time
> (arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
> trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
> of the batch to wait for all the issued TLBIs to complete.
> 
> Now, the Arm ARM states:
> 
> """
> The completion of the TLB maintenance instruction is guaranteed only by
> the execution of a DSB by the observer that performed the TLB
> maintenance instruction. The execution of a DSB by a different observer
> does not have this effect, even if the DSB is known to be executed after
> the TLB maintenance instruction is observed by that different observer.
> """
> 
> arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
> requirement because they are called from the same task (either kswapd or
> caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
> if the task was migrated, __switch_to() contains an extra dsb(ish).
> 
> HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
> a dsb(ish). But this may be running on a CPU remote from the one that
> issued the outstanding TLBIs. So there is no architectural gurantee of
> synchonization. Therefore we are still vulnerable to the theoretical
> race described in Commit 3ea277194daa ("mm, mprotect: flush TLB if
> potentially racing with a parallel reclaim leaving stale TLB entries").
> 
> Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
> This aligns with what the other arches that implement the tlbbatch
> feature do.
> 
> Fixes: 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration")

Barry -- it would be great if you could re-run some of the benchmarks
from that commit with this fix applied.

> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/tlbflush.h | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index eba1a98657f1..7d564c2a126f 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -323,13 +323,14 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>  }
> 
>  /*
> - * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
> - * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
> - * flush_tlb_batched_pending().
> + * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure
> + * all the previously issued TLBIs targeting mm have completed. But since we
> + * can be executing on a remote CPU, a DSB cannot guarrantee this like it can
> + * for arch_tlbbatch_flush(). Our only option is to flush the entire mm.
>   */
>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
>  {
> -	dsb(ish);
> +	flush_tlb_mm(mm);
>  }

Thanks, Ryan. I'll pick this as a fix, but perhaps the core code should
do this given that all the architectures selecting
ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH now have an identical implementation
of arch_flush_tlb_batched_pending()?

Will

Re: [PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

Posted by Ryan Roberts 8 months, 1 week ago

On 02/06/2025 13:00, Will Deacon wrote:
> On Fri, May 30, 2025 at 04:23:47PM +0100, Ryan Roberts wrote:
>> Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
>> a parallel reclaim leaving stale TLB entries") describes a race that,
>> prior to the commit, could occur between reclaim and operations such as
>> mprotect() when using reclaim's tlbbatch mechanism. See that commit for
>> details but the summary is:
>>
>> """
>> Nadav Amit identified a theoritical race between page reclaim and
>> mprotect due to TLB flushes being batched outside of the PTL being held.
>>
>> He described the race as follows:
>>
>> 	CPU0				CPU1
>> 	----				----
>> 					user accesses memory using RW PTE
>> 					[PTE now cached in TLB]
>> 	try_to_unmap_one()
>> 	==> ptep_get_and_clear()
>> 	==> set_tlb_ubc_flush_pending()
>> 					mprotect(addr, PROT_READ)
>> 					==> change_pte_range()
>> 					==> [ PTE non-present - no flush ]
>>
>> 					user writes using cached RW PTE
>> 	...
>>
>> 	try_to_unmap_flush()
>> """
>>
>> The solution was to insert flush_tlb_batched_pending() in mprotect() and
>> friends to explcitly drain any pending reclaim TLB flushes. In the
>> modern version of this solution, arch_flush_tlb_batched_pending() is
>> called to do that synchronisation.
>>
>> arm64's tlbbatch implementation simply issues TLBIs at queue-time
>> (arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
>> trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
>> of the batch to wait for all the issued TLBIs to complete.
>>
>> Now, the Arm ARM states:
>>
>> """
>> The completion of the TLB maintenance instruction is guaranteed only by
>> the execution of a DSB by the observer that performed the TLB
>> maintenance instruction. The execution of a DSB by a different observer
>> does not have this effect, even if the DSB is known to be executed after
>> the TLB maintenance instruction is observed by that different observer.
>> """
>>
>> arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
>> requirement because they are called from the same task (either kswapd or
>> caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
>> if the task was migrated, __switch_to() contains an extra dsb(ish).
>>
>> HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
>> a dsb(ish). But this may be running on a CPU remote from the one that
>> issued the outstanding TLBIs. So there is no architectural gurantee of
>> synchonization. Therefore we are still vulnerable to the theoretical
>> race described in Commit 3ea277194daa ("mm, mprotect: flush TLB if
>> potentially racing with a parallel reclaim leaving stale TLB entries").
>>
>> Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
>> This aligns with what the other arches that implement the tlbbatch
>> feature do.
>>
>> Fixes: 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration")
> 
> Barry -- it would be great if you could re-run some of the benchmarks
> from that commit with this fix applied.

Worth rerunning if possible, but I would guess that those benchmarks will still
show the similar improvement because they are measuring the cost of doing the
TLB flushing. But with the fix, there is an extra cost that those benchmarks
probably won't measure; subsequent work within the target mm will have no VAs
cached in the TLB so the miss rate will be much higher.

> 
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/tlbflush.h | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index eba1a98657f1..7d564c2a126f 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -323,13 +323,14 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>  }
>>
>>  /*
>> - * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
>> - * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
>> - * flush_tlb_batched_pending().
>> + * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure
>> + * all the previously issued TLBIs targeting mm have completed. But since we
>> + * can be executing on a remote CPU, a DSB cannot guarrantee this like it can
>> + * for arch_tlbbatch_flush(). Our only option is to flush the entire mm.
>>   */
>>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
>>  {
>> -	dsb(ish);
>> +	flush_tlb_mm(mm);
>>  }
> 
> Thanks, Ryan. I'll pick this as a fix, but perhaps the core code should
> do this given that all the architectures selecting
> ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH now have an identical implementation
> of arch_flush_tlb_batched_pending()?

Ha, yes... infact it looks like that's what it did prior to commit db6c1f6f236d
("mm/tlbbatch: introduce arch_flush_tlb_batched_pending()").

I'll do that tidy up once this fix appears in mm-unstable.

Thanks,
Ryan


> 
> Will

Re: [PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

Posted by Barry Song 8 months, 1 week ago

On Tue, Jun 3, 2025 at 2:00 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 02/06/2025 13:00, Will Deacon wrote:
> > On Fri, May 30, 2025 at 04:23:47PM +0100, Ryan Roberts wrote:
> >> Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
> >> a parallel reclaim leaving stale TLB entries") describes a race that,
> >> prior to the commit, could occur between reclaim and operations such as
> >> mprotect() when using reclaim's tlbbatch mechanism. See that commit for
> >> details but the summary is:
> >>
> >> """
> >> Nadav Amit identified a theoritical race between page reclaim and
> >> mprotect due to TLB flushes being batched outside of the PTL being held.
> >>
> >> He described the race as follows:
> >>
> >>      CPU0                            CPU1
> >>      ----                            ----
> >>                                      user accesses memory using RW PTE
> >>                                      [PTE now cached in TLB]
> >>      try_to_unmap_one()
> >>      ==> ptep_get_and_clear()
> >>      ==> set_tlb_ubc_flush_pending()
> >>                                      mprotect(addr, PROT_READ)
> >>                                      ==> change_pte_range()
> >>                                      ==> [ PTE non-present - no flush ]
> >>
> >>                                      user writes using cached RW PTE
> >>      ...
> >>
> >>      try_to_unmap_flush()
> >> """
> >>
> >> The solution was to insert flush_tlb_batched_pending() in mprotect() and
> >> friends to explcitly drain any pending reclaim TLB flushes. In the
> >> modern version of this solution, arch_flush_tlb_batched_pending() is
> >> called to do that synchronisation.
> >>
> >> arm64's tlbbatch implementation simply issues TLBIs at queue-time
> >> (arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
> >> trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
> >> of the batch to wait for all the issued TLBIs to complete.
> >>
> >> Now, the Arm ARM states:
> >>
> >> """
> >> The completion of the TLB maintenance instruction is guaranteed only by
> >> the execution of a DSB by the observer that performed the TLB
> >> maintenance instruction. The execution of a DSB by a different observer
> >> does not have this effect, even if the DSB is known to be executed after
> >> the TLB maintenance instruction is observed by that different observer.
> >> """
> >>
> >> arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
> >> requirement because they are called from the same task (either kswapd or
> >> caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
> >> if the task was migrated, __switch_to() contains an extra dsb(ish).
> >>
> >> HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
> >> a dsb(ish). But this may be running on a CPU remote from the one that
> >> issued the outstanding TLBIs. So there is no architectural gurantee of
> >> synchonization. Therefore we are still vulnerable to the theoretical
> >> race described in Commit 3ea277194daa ("mm, mprotect: flush TLB if
> >> potentially racing with a parallel reclaim leaving stale TLB entries").
> >>
> >> Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
> >> This aligns with what the other arches that implement the tlbbatch
> >> feature do.
> >>
> >> Fixes: 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration")
> >
> > Barry -- it would be great if you could re-run some of the benchmarks
> > from that commit with this fix applied.
>
> Worth rerunning if possible, but I would guess that those benchmarks will still
> show the similar improvement because they are measuring the cost of doing the
> TLB flushing. But with the fix, there is an extra cost that those benchmarks
> probably won't measure; subsequent work within the target mm will have no VAs
> cached in the TLB so the miss rate will be much higher.

Right, not sure if we have a suitable benchmark to measure the
side effect, but I assume reclamation speed is more important
when we're reclaiming memory.

This was originally introduced in commit 3ea277194daae
("mm, mprotect: flush TLB if potentially racing with a parallel
reclaim leaving stale TLB entries").

Cc'ing Mel to see if he has any comments.

>
> >
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  arch/arm64/include/asm/tlbflush.h | 9 +++++----
> >>  1 file changed, 5 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> >> index eba1a98657f1..7d564c2a126f 100644
> >> --- a/arch/arm64/include/asm/tlbflush.h
> >> +++ b/arch/arm64/include/asm/tlbflush.h
> >> @@ -323,13 +323,14 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >>  }
> >>
> >>  /*
> >> - * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
> >> - * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
> >> - * flush_tlb_batched_pending().
> >> + * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure
> >> + * all the previously issued TLBIs targeting mm have completed. But since we
> >> + * can be executing on a remote CPU, a DSB cannot guarrantee this like it can
> >> + * for arch_tlbbatch_flush(). Our only option is to flush the entire mm.
> >>   */
> >>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> >>  {
> >> -    dsb(ish);
> >> +    flush_tlb_mm(mm);
> >>  }
> >
> > Thanks, Ryan. I'll pick this as a fix, but perhaps the core code should
> > do this given that all the architectures selecting
> > ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH now have an identical implementation
> > of arch_flush_tlb_batched_pending()?
>
> Ha, yes... infact it looks like that's what it did prior to commit db6c1f6f236d
> ("mm/tlbbatch: introduce arch_flush_tlb_batched_pending()").

Yep, it was just a flush_tlb_mm(mm) inside flush_tlb_batched_pending().

>
> I'll do that tidy up once this fix appears in mm-unstable.
>
> Thanks,
> Ryan
>
>
> >
> > Will
>

Thanks
Barry

Re: [PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

Posted by Dev Jain 8 months, 1 week ago

On 30/05/25 8:53 pm, Ryan Roberts wrote:
> Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
> a parallel reclaim leaving stale TLB entries") describes a race that,
> prior to the commit, could occur between reclaim and operations such as
> mprotect() when using reclaim's tlbbatch mechanism. See that commit for
> details but the summary is:
>
> """
> Nadav Amit identified a theoritical race between page reclaim and
> mprotect due to TLB flushes being batched outside of the PTL being held.
>
> He described the race as follows:
>
> 	CPU0				CPU1
> 	----				----
> 					user accesses memory using RW PTE
> 					[PTE now cached in TLB]
> 	try_to_unmap_one()
> 	==> ptep_get_and_clear()
> 	==> set_tlb_ubc_flush_pending()
> 					mprotect(addr, PROT_READ)
> 					==> change_pte_range()
> 					==> [ PTE non-present - no flush ]
>
> 					user writes using cached RW PTE
> 	...
>
> 	try_to_unmap_flush()
> """
>
> The solution was to insert flush_tlb_batched_pending() in mprotect() and
> friends to explcitly drain any pending reclaim TLB flushes. In the
> modern version of this solution, arch_flush_tlb_batched_pending() is
> called to do that synchronisation.
>
> arm64's tlbbatch implementation simply issues TLBIs at queue-time
> (arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
> trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
> of the batch to wait for all the issued TLBIs to complete.
>
> Now, the Arm ARM states:
>
> """
> The completion of the TLB maintenance instruction is guaranteed only by
> the execution of a DSB by the observer that performed the TLB
> maintenance instruction. The execution of a DSB by a different observer
> does not have this effect, even if the DSB is known to be executed after
> the TLB maintenance instruction is observed by that different observer.
> """
>
> arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
> requirement because they are called from the same task (either kswapd or
> caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
> if the task was migrated, __switch_to() contains an extra dsb(ish).
>
> HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
> a dsb(ish). But this may be running on a CPU remote from the one that
> issued the outstanding TLBIs. So there is no architectural gurantee of
> synchonization. Therefore we are still vulnerable to the theoretical
> race described in Commit 3ea277194daa ("mm, mprotect: flush TLB if
> potentially racing with a parallel reclaim leaving stale TLB entries").
>
> Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
> This aligns with what the other arches that implement the tlbbatch
> feature do.
>
> Fixes: 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration")

Do we need Cc stable?

The patch logic looks good to me, but again, will leave it to the experts : )

> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   arch/arm64/include/asm/tlbflush.h | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index eba1a98657f1..7d564c2a126f 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -323,13 +323,14 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>   }
>
>   /*
> - * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
> - * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
> - * flush_tlb_batched_pending().
> + * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure
> + * all the previously issued TLBIs targeting mm have completed. But since we
> + * can be executing on a remote CPU, a DSB cannot guarrantee this like it can
> + * for arch_tlbbatch_flush(). Our only option is to flush the entire mm.
>    */
>   static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
>   {
> -	dsb(ish);
> +	flush_tlb_mm(mm);
>   }
>
>   /*
> --
> 2.43.0
>
>

Re: [PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

Posted by Ryan Roberts 8 months, 1 week ago

On 02/06/2025 05:59, Dev Jain wrote:
> 
> On 30/05/25 8:53 pm, Ryan Roberts wrote:
>> Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
>> a parallel reclaim leaving stale TLB entries") describes a race that,
>> prior to the commit, could occur between reclaim and operations such as
>> mprotect() when using reclaim's tlbbatch mechanism. See that commit for
>> details but the summary is:
>>
>> """
>> Nadav Amit identified a theoritical race between page reclaim and
>> mprotect due to TLB flushes being batched outside of the PTL being held.
>>
>> He described the race as follows:
>>
>>     CPU0                CPU1
>>     ----                ----
>>                     user accesses memory using RW PTE
>>                     [PTE now cached in TLB]
>>     try_to_unmap_one()
>>     ==> ptep_get_and_clear()
>>     ==> set_tlb_ubc_flush_pending()
>>                     mprotect(addr, PROT_READ)
>>                     ==> change_pte_range()
>>                     ==> [ PTE non-present - no flush ]
>>
>>                     user writes using cached RW PTE
>>     ...
>>
>>     try_to_unmap_flush()
>> """
>>
>> The solution was to insert flush_tlb_batched_pending() in mprotect() and
>> friends to explcitly drain any pending reclaim TLB flushes. In the
>> modern version of this solution, arch_flush_tlb_batched_pending() is
>> called to do that synchronisation.
>>
>> arm64's tlbbatch implementation simply issues TLBIs at queue-time
>> (arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
>> trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
>> of the batch to wait for all the issued TLBIs to complete.
>>
>> Now, the Arm ARM states:
>>
>> """
>> The completion of the TLB maintenance instruction is guaranteed only by
>> the execution of a DSB by the observer that performed the TLB
>> maintenance instruction. The execution of a DSB by a different observer
>> does not have this effect, even if the DSB is known to be executed after
>> the TLB maintenance instruction is observed by that different observer.
>> """
>>
>> arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
>> requirement because they are called from the same task (either kswapd or
>> caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
>> if the task was migrated, __switch_to() contains an extra dsb(ish).
>>
>> HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
>> a dsb(ish). But this may be running on a CPU remote from the one that
>> issued the outstanding TLBIs. So there is no architectural gurantee of
>> synchonization. Therefore we are still vulnerable to the theoretical
>> race described in Commit 3ea277194daa ("mm, mprotect: flush TLB if
>> potentially racing with a parallel reclaim leaving stale TLB entries").
>>
>> Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
>> This aligns with what the other arches that implement the tlbbatch
>> feature do.
>>
>> Fixes: 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during
>> page reclamation/migration")
> 
> Do we need Cc stable?

Yeah, good point. Assuming I haven't missed something critical that means this
isn't an issue in practice...

> 
> The patch logic looks good to me, but again, will leave it to the experts : )
> 
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   arch/arm64/include/asm/tlbflush.h | 9 +++++----
>>   1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/
>> tlbflush.h
>> index eba1a98657f1..7d564c2a126f 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -323,13 +323,14 @@ static inline bool arch_tlbbatch_should_defer(struct
>> mm_struct *mm)
>>   }
>>
>>   /*
>> - * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
>> - * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
>> - * flush_tlb_batched_pending().
>> + * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure
>> + * all the previously issued TLBIs targeting mm have completed. But since we
>> + * can be executing on a remote CPU, a DSB cannot guarrantee this like it can
>> + * for arch_tlbbatch_flush(). Our only option is to flush the entire mm.
>>    */
>>   static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
>>   {
>> -    dsb(ish);
>> +    flush_tlb_mm(mm);
>>   }
>>
>>   /*
>> -- 
>> 2.43.0
>>
>>

Re: [PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

Posted by Barry Song 8 months, 1 week ago

On Fri, May 30, 2025 at 11:24 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
> a parallel reclaim leaving stale TLB entries") describes a race that,
> prior to the commit, could occur between reclaim and operations such as
> mprotect() when using reclaim's tlbbatch mechanism. See that commit for
> details but the summary is:
>
> """
> Nadav Amit identified a theoritical race between page reclaim and
> mprotect due to TLB flushes being batched outside of the PTL being held.
>
> He described the race as follows:
>
>         CPU0                            CPU1
>         ----                            ----
>                                         user accesses memory using RW PTE
>                                         [PTE now cached in TLB]
>         try_to_unmap_one()
>         ==> ptep_get_and_clear()
>         ==> set_tlb_ubc_flush_pending()
>                                         mprotect(addr, PROT_READ)
>                                         ==> change_pte_range()
>                                         ==> [ PTE non-present - no flush ]
>
>                                         user writes using cached RW PTE
>         ...
>
>         try_to_unmap_flush()
> """
>
> The solution was to insert flush_tlb_batched_pending() in mprotect() and
> friends to explcitly drain any pending reclaim TLB flushes. In the
> modern version of this solution, arch_flush_tlb_batched_pending() is
> called to do that synchronisation.
>
> arm64's tlbbatch implementation simply issues TLBIs at queue-time
> (arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
> trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
> of the batch to wait for all the issued TLBIs to complete.
>
> Now, the Arm ARM states:
>
> """
> The completion of the TLB maintenance instruction is guaranteed only by
> the execution of a DSB by the observer that performed the TLB
> maintenance instruction. The execution of a DSB by a different observer
> does not have this effect, even if the DSB is known to be executed after
> the TLB maintenance instruction is observed by that different observer.
> """
>
> arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
> requirement because they are called from the same task (either kswapd or
> caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
> if the task was migrated, __switch_to() contains an extra dsb(ish).
>
> HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
> a dsb(ish). But this may be running on a CPU remote from the one that
> issued the outstanding TLBIs. So there is no architectural gurantee of
> synchonization. Therefore we are still vulnerable to the theoretical
> race described in Commit 3ea277194daa ("mm, mprotect: flush TLB if
> potentially racing with a parallel reclaim leaving stale TLB entries").
>
> Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
> This aligns with what the other arches that implement the tlbbatch
> feature do.

Thanks, Ryan. I’m not the ARM expert on this modification,
but your explanation seems reasonable to me.
I’ll leave the judgment to Catalin, Will, and Mark.

>
> Fixes: 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration")
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/tlbflush.h | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index eba1a98657f1..7d564c2a126f 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -323,13 +323,14 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>  }
>
>  /*
> - * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
> - * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
> - * flush_tlb_batched_pending().
> + * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure
> + * all the previously issued TLBIs targeting mm have completed. But since we
> + * can be executing on a remote CPU, a DSB cannot guarrantee this like it can
> + * for arch_tlbbatch_flush(). Our only option is to flush the entire mm.
>   */
>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
>  {
> -       dsb(ish);
> +       flush_tlb_mm(mm);
>  }
>
>  /*
> --
> 2.43.0
>

Thanks
Barry