targeted TLB sync IPIs for lockless page table walkers

[PATCH v4 0/3] targeted TLB sync IPIs for lockless page table walkers

Posted by Lance Yang 5 days, 4 hours ago

When freeing or unsharing page tables we send an IPI to synchronize with
concurrent lockless page table walkers (e.g. GUP-fast). Today we broadcast
that IPI to all CPUs, which is costly on large machines and hurts RT
workloads[1].

This series makes those IPIs targeted. We track which CPUs are currently
doing a lockless page table walk for a given mm (per-CPU
active_lockless_pt_walk_mm). When we need to sync, we only IPI those CPUs.
GUP-fast and perf_get_page_size() set/clear the tracker around their walk;
tlb_remove_table_sync_mm() uses it and replaces the previous broadcast in
the free/unshare paths.

On x86, when the TLB flush path already sends IPIs (native without INVLPGB,
or KVM), the extra sync IPI is redundant. We add a property on pv_mmu_ops
so each backend can declare whether its flush_tlb_multi sends real IPIs; if
so, tlb_remove_table_sync_mm() is a no-op. We also have tlb_flush() pass
both freed_tables and unshared_tables so lazy-TLB CPUs get IPIs during
hugetlb unshare.

David Hildenbrand did the initial implementation. I built on his work and
relied on off-list discussions to push it further - thanks a lot David!

[1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/

v3 -> v4:
- Rework based on David's two-step direction and per-CPU idea:
  1) Targeted IPIs: per-CPU variable when entering/leaving lockless page
     table walk; tlb_remove_table_sync_mm() IPIs only those CPUs.
  2) On x86, pv_mmu_ops property set at init to skip the extra sync when
     flush_tlb_multi() already sends IPIs.
  https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/
- https://lore.kernel.org/linux-mm/20260106120303.38124-1-lance.yang@linux.dev/

v2 -> v3:
- Complete rewrite: use dynamic IPI tracking instead of static checks
  (per Dave Hansen, thanks!)
- Track IPIs via mmu_gather: native_flush_tlb_multi() sets flag when
  actually sending IPIs
- Motivation for skipping redundant IPIs explained by David:
  https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
- https://lore.kernel.org/linux-mm/20251229145245.85452-1-lance.yang@linux.dev/

v1 -> v2:
- Fix cover letter encoding to resolve send-email issues. Apologies for
  any email flood caused by the failed send attempts :(

RFC -> v1:
- Use a callback function in pv_mmu_ops instead of comparing function
  pointers (per David)
- Embed the check directly in tlb_remove_table_sync_one() instead of
  requiring every caller to check explicitly (per David)
- Move tlb_table_flush_implies_ipi_broadcast() outside of
  CONFIG_MMU_GATHER_RCU_TABLE_FREE to fix build error on architectures
  that don't enable this config.
  https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/
- https://lore.kernel.org/linux-mm/20251213080038.10917-1-lance.yang@linux.dev/

Lance Yang (3):
  mm: use targeted IPIs for TLB sync with lockless page table walkers
  mm: switch callers to tlb_remove_table_sync_mm()
  x86/tlb: add architecture-specific TLB IPI optimization support

 arch/x86/hyperv/mmu.c                 |  5 ++
 arch/x86/include/asm/paravirt.h       |  5 ++
 arch/x86/include/asm/paravirt_types.h |  6 +++
 arch/x86/include/asm/tlb.h            | 20 +++++++-
 arch/x86/kernel/kvm.c                 |  6 +++
 arch/x86/kernel/paravirt.c            | 18 +++++++
 arch/x86/kernel/smpboot.c             |  1 +
 arch/x86/xen/mmu_pv.c                 |  2 +
 include/asm-generic/tlb.h             | 28 +++++++++--
 include/linux/mm.h                    | 34 +++++++++++++
 kernel/events/core.c                  |  2 +
 mm/gup.c                              |  2 +
 mm/khugepaged.c                       |  2 +-
 mm/mmu_gather.c                       | 69 ++++++++++++++++++++++++---
 14 files changed, 187 insertions(+), 13 deletions(-)

-- 
2.49.0

Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table walkers

Posted by Peter Zijlstra 5 days, 2 hours ago

On Mon, Feb 02, 2026 at 03:45:54PM +0800, Lance Yang wrote:
> When freeing or unsharing page tables we send an IPI to synchronize with
> concurrent lockless page table walkers (e.g. GUP-fast). Today we broadcast
> that IPI to all CPUs, which is costly on large machines and hurts RT
> workloads[1].
> 
> This series makes those IPIs targeted. We track which CPUs are currently
> doing a lockless page table walk for a given mm (per-CPU
> active_lockless_pt_walk_mm). When we need to sync, we only IPI those CPUs.
> GUP-fast and perf_get_page_size() set/clear the tracker around their walk;
> tlb_remove_table_sync_mm() uses it and replaces the previous broadcast in
> the free/unshare paths.

I'm confused. This only happens when !PT_RECLAIM, because if PT_RECLAIM
__tlb_remove_table_one() actually uses RCU.

So why are you making things more expensive for no reason?