riscv: mm: Defer tlb flush to context_switch

[RFC PATCH v1 0/4] riscv: mm: Defer tlb flush to context_switch

Posted by Xu Lu 3 months, 1 week ago

When need to flush tlb of a remote cpu, there is no need to send an IPI
if the target cpu is not using the asid we want to flush. Instead, we
can cache the tlb flush info in percpu buffer, and defer the tlb flush
to the next context_switch.

This reduces the number of IPI due to tlb flush:

* ltp - mmapstress01
Before: ~108k
After: ~46k

Future plan in the next version:

- This patch series reduces IPI by deferring tlb flush to
context_switch. It does not clear the mm_cpumask of target mm_struct. In
the next version, I will apply a threshold to the number of ASIDs
maintained by each cpu's tlb. Once the threshold is exceeded, ASID that
has not been used for the longest time will be flushed out. And current
cpu will be cleared in the mm_cpumask.

Thanks in advance for your comments.

Xu Lu (4):
  riscv: mm: Introduce percpu loaded_asid
  riscv: mm: Introduce percpu tlb flush queue
  riscv: mm: Enqueue tlbflush info if task is not running on target cpu
  riscv: mm: Perform tlb flush during context_switch

 arch/riscv/include/asm/mmu_context.h |  1 +
 arch/riscv/include/asm/tlbflush.h    |  4 ++
 arch/riscv/mm/context.c              | 10 ++++
 arch/riscv/mm/tlbflush.c             | 76 +++++++++++++++++++++++++++-
 4 files changed, 90 insertions(+), 1 deletion(-)

-- 
2.20.1

Re: [RFC PATCH v1 0/4] riscv: mm: Defer tlb flush to context_switch

Posted by Guo Ren 3 months ago

On Thu, Oct 30, 2025 at 9:57 PM Xu Lu <luxu.kernel@bytedance.com> wrote:
>
> When need to flush tlb of a remote cpu, there is no need to send an IPI
> if the target cpu is not using the asid we want to flush. Instead, we
> can cache the tlb flush info in percpu buffer, and defer the tlb flush
> to the next context_switch.
>
> This reduces the number of IPI due to tlb flush:
>
> * ltp - mmapstress01
> Before: ~108k
> After: ~46k

Could you add the results for these two test cases to the next version?

* lmbench - lat_pagefault
* lmbench - lat_mmap

Thank you!

>
> Future plan in the next version:
>
> - This patch series reduces IPI by deferring tlb flush to
> context_switch. It does not clear the mm_cpumask of target mm_struct. In
> the next version, I will apply a threshold to the number of ASIDs
> maintained by each cpu's tlb. Once the threshold is exceeded, ASID that
> has not been used for the longest time will be flushed out. And current
> cpu will be cleared in the mm_cpumask.
>
> Thanks in advance for your comments.
>
> Xu Lu (4):
>   riscv: mm: Introduce percpu loaded_asid
>   riscv: mm: Introduce percpu tlb flush queue
>   riscv: mm: Enqueue tlbflush info if task is not running on target cpu
>   riscv: mm: Perform tlb flush during context_switch
>
>  arch/riscv/include/asm/mmu_context.h |  1 +
>  arch/riscv/include/asm/tlbflush.h    |  4 ++
>  arch/riscv/mm/context.c              | 10 ++++
>  arch/riscv/mm/tlbflush.c             | 76 +++++++++++++++++++++++++++-
>  4 files changed, 90 insertions(+), 1 deletion(-)
>
> --
> 2.20.1
>


-- 
Best Regards
 Guo Ren

Re: [External] Re: [RFC PATCH v1 0/4] riscv: mm: Defer tlb flush to context_switch

Posted by Xu Lu 3 months ago

On Fri, Nov 7, 2025 at 9:56 AM Guo Ren <guoren@kernel.org> wrote:
>
> On Thu, Oct 30, 2025 at 9:57 PM Xu Lu <luxu.kernel@bytedance.com> wrote:
> >
> > When need to flush tlb of a remote cpu, there is no need to send an IPI
> > if the target cpu is not using the asid we want to flush. Instead, we
> > can cache the tlb flush info in percpu buffer, and defer the tlb flush
> > to the next context_switch.
> >
> > This reduces the number of IPI due to tlb flush:
> >
> > * ltp - mmapstress01
> > Before: ~108k
> > After: ~46k
>
> Could you add the results for these two test cases to the next version?
>
> * lmbench - lat_pagefault
> * lmbench - lat_mmap

Roger that. Thanks for your supplement.

>
> Thank you!
>
> >
> > Future plan in the next version:
> >
> > - This patch series reduces IPI by deferring tlb flush to
> > context_switch. It does not clear the mm_cpumask of target mm_struct. In
> > the next version, I will apply a threshold to the number of ASIDs
> > maintained by each cpu's tlb. Once the threshold is exceeded, ASID that
> > has not been used for the longest time will be flushed out. And current
> > cpu will be cleared in the mm_cpumask.
> >
> > Thanks in advance for your comments.
> >
> > Xu Lu (4):
> >   riscv: mm: Introduce percpu loaded_asid
> >   riscv: mm: Introduce percpu tlb flush queue
> >   riscv: mm: Enqueue tlbflush info if task is not running on target cpu
> >   riscv: mm: Perform tlb flush during context_switch
> >
> >  arch/riscv/include/asm/mmu_context.h |  1 +
> >  arch/riscv/include/asm/tlbflush.h    |  4 ++
> >  arch/riscv/mm/context.c              | 10 ++++
> >  arch/riscv/mm/tlbflush.c             | 76 +++++++++++++++++++++++++++-
> >  4 files changed, 90 insertions(+), 1 deletion(-)
> >
> > --
> > 2.20.1
> >
>
>
> --
> Best Regards
>  Guo Ren

Re: [RFC PATCH v1 0/4] riscv: mm: Defer tlb flush to context_switch

Posted by Guo Ren 3 months ago

On Thu, Oct 30, 2025 at 9:57 PM Xu Lu <luxu.kernel@bytedance.com> wrote:
>
> When need to flush tlb of a remote cpu, there is no need to send an IPI
> if the target cpu is not using the asid we want to flush. Instead, we
> can cache the tlb flush info in percpu buffer, and defer the tlb flush
> to the next context_switch.
>
> This reduces the number of IPI due to tlb flush:
>
> * ltp - mmapstress01
> Before: ~108k
> After: ~46k
Great result!

I've some questions:
1. Do we need an accurate address flush by a new queue of
flush_tlb_range_data? Why not flush the whole asid?
2. If we reuse the context_tlb_flush_pending mechanism, could
mmapstress01 gain the result better than ~46k?
3. If we meet the kernel address space, we must use IPI flush
immediately, but I didn't see your patch consider that case, or am I
wrong?

>
> Future plan in the next version:
>
> - This patch series reduces IPI by deferring tlb flush to
> context_switch. It does not clear the mm_cpumask of target mm_struct. In
> the next version, I will apply a threshold to the number of ASIDs
> maintained by each cpu's tlb. Once the threshold is exceeded, ASID that
> has not been used for the longest time will be flushed out. And current
> cpu will be cleared in the mm_cpumask.
>
> Thanks in advance for your comments.
>
> Xu Lu (4):
>   riscv: mm: Introduce percpu loaded_asid
>   riscv: mm: Introduce percpu tlb flush queue
>   riscv: mm: Enqueue tlbflush info if task is not running on target cpu
>   riscv: mm: Perform tlb flush during context_switch
>
>  arch/riscv/include/asm/mmu_context.h |  1 +
>  arch/riscv/include/asm/tlbflush.h    |  4 ++
>  arch/riscv/mm/context.c              | 10 ++++
>  arch/riscv/mm/tlbflush.c             | 76 +++++++++++++++++++++++++++-
>  4 files changed, 90 insertions(+), 1 deletion(-)
>
> --
> 2.20.1
>


-- 
Best Regards
 Guo Ren

Re: [External] Re: [RFC PATCH v1 0/4] riscv: mm: Defer tlb flush to context_switch

Posted by Xu Lu 3 months ago

Hi Guo Ren,

On Mon, Nov 3, 2025 at 11:44 AM Guo Ren <guoren@kernel.org> wrote:
>
> On Thu, Oct 30, 2025 at 9:57 PM Xu Lu <luxu.kernel@bytedance.com> wrote:
> >
> > When need to flush tlb of a remote cpu, there is no need to send an IPI
> > if the target cpu is not using the asid we want to flush. Instead, we
> > can cache the tlb flush info in percpu buffer, and defer the tlb flush
> > to the next context_switch.
> >
> > This reduces the number of IPI due to tlb flush:
> >
> > * ltp - mmapstress01
> > Before: ~108k
> > After: ~46k
> Great result!
>
> I've some questions:
> 1. Do we need an accurate address flush by a new queue of
> flush_tlb_range_data? Why not flush the whole asid?

Flushing the whole address space may cause subsequent tlb misses.
Consider such a case: there is only one user mode thread frequently
running on the target hart. When the user thread falls asleep and cpu
context switches to idle thread, another thread of the same process
running on another hart modifies the mapping and needs to perform tlb
flush. The first user mode thread will encounter a large number of tlb
misses when it resumes. I want to try to balance the ipi count and tlb
misses.

> 2. If we reuse the context_tlb_flush_pending mechanism, could
> mmapstress01 gain the result better than ~46k?

Besides lazy tlb flush, another way to reduce ipi overhead is to clean
mm_cpumask. And it does gain a better result for mmapstress01. I have
sent a patch[1] which clears mm_cpumask whenever flushing all tlb of a
certain asid and it reduces the ipi count from ~98k to 268.

As was mentioned in the previous email, in the next version, I will
supply the mm_cpumask clear procedure. Specifically, I will flush all
tlb of an asid and clear mm_cpumask whenever it hasn't been scheduled
after enough context switches.

[1] https://lore.kernel.org/all/20250827131444.23893-3-luxu.kernel@bytedance.com/

> 3. If we meet the kernel address space, we must use IPI flush
> immediately, but I didn't see your patch consider that case, or am I
> wrong?

Nice catch! Forgot to add the kernel ASID judgment logic in the
shoulded_ipi_flush function. I will supply it in the next version.

I have considered canceling ipi and deferring the tlb flush to the
next time target hart enters the s mode if the target hart is now
running in user mode. But there are too many kernel entry points to
consider, especially now we have sse. For kernel tlb flush, it may be
more secure to send ipi. Thanks.

Best Regards,
Xu Lu

>
> >
> > Future plan in the next version:
> >
> > - This patch series reduces IPI by deferring tlb flush to
> > context_switch. It does not clear the mm_cpumask of target mm_struct. In
> > the next version, I will apply a threshold to the number of ASIDs
> > maintained by each cpu's tlb. Once the threshold is exceeded, ASID that
> > has not been used for the longest time will be flushed out. And current
> > cpu will be cleared in the mm_cpumask.
> >
> > Thanks in advance for your comments.
> >
> > Xu Lu (4):
> >   riscv: mm: Introduce percpu loaded_asid
> >   riscv: mm: Introduce percpu tlb flush queue
> >   riscv: mm: Enqueue tlbflush info if task is not running on target cpu
> >   riscv: mm: Perform tlb flush during context_switch
> >
> >  arch/riscv/include/asm/mmu_context.h |  1 +
> >  arch/riscv/include/asm/tlbflush.h    |  4 ++
> >  arch/riscv/mm/context.c              | 10 ++++
> >  arch/riscv/mm/tlbflush.c             | 76 +++++++++++++++++++++++++++-
> >  4 files changed, 90 insertions(+), 1 deletion(-)
> >
> > --
> > 2.20.1
> >
>
>
> --
> Best Regards
>  Guo Ren