When freeing or unsharing page tables we send an IPI to synchronize with
concurrent lockless page table walkers (e.g. GUP-fast). Today we broadcast
that IPI to all CPUs, which is costly on large machines and hurts RT
workloads[1].
This series makes those IPIs targeted. We track which CPUs are currently
doing a lockless page table walk for a given mm (per-CPU
active_lockless_pt_walk_mm). When we need to sync, we only IPI those CPUs.
GUP-fast and perf_get_page_size() set/clear the tracker around their walk;
tlb_remove_table_sync_mm() uses it and replaces the previous broadcast in
the free/unshare paths.
On x86, when the TLB flush path already sends IPIs (native without INVLPGB,
or KVM), the extra sync IPI is redundant. We add a property on pv_mmu_ops
so each backend can declare whether its flush_tlb_multi sends real IPIs; if
so, tlb_remove_table_sync_mm() is a no-op. We also have tlb_flush() pass
both freed_tables and unshared_tables so lazy-TLB CPUs get IPIs during
hugetlb unshare.
David Hildenbrand did the initial implementation. I built on his work and
relied on off-list discussions to push it further - thanks a lot David!
[1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
v3 -> v4:
- Rework based on David's two-step direction and per-CPU idea:
1) Targeted IPIs: per-CPU variable when entering/leaving lockless page
table walk; tlb_remove_table_sync_mm() IPIs only those CPUs.
2) On x86, pv_mmu_ops property set at init to skip the extra sync when
flush_tlb_multi() already sends IPIs.
https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/
- https://lore.kernel.org/linux-mm/20260106120303.38124-1-lance.yang@linux.dev/
v2 -> v3:
- Complete rewrite: use dynamic IPI tracking instead of static checks
(per Dave Hansen, thanks!)
- Track IPIs via mmu_gather: native_flush_tlb_multi() sets flag when
actually sending IPIs
- Motivation for skipping redundant IPIs explained by David:
https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
- https://lore.kernel.org/linux-mm/20251229145245.85452-1-lance.yang@linux.dev/
v1 -> v2:
- Fix cover letter encoding to resolve send-email issues. Apologies for
any email flood caused by the failed send attempts :(
RFC -> v1:
- Use a callback function in pv_mmu_ops instead of comparing function
pointers (per David)
- Embed the check directly in tlb_remove_table_sync_one() instead of
requiring every caller to check explicitly (per David)
- Move tlb_table_flush_implies_ipi_broadcast() outside of
CONFIG_MMU_GATHER_RCU_TABLE_FREE to fix build error on architectures
that don't enable this config.
https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/
- https://lore.kernel.org/linux-mm/20251213080038.10917-1-lance.yang@linux.dev/
Lance Yang (3):
mm: use targeted IPIs for TLB sync with lockless page table walkers
mm: switch callers to tlb_remove_table_sync_mm()
x86/tlb: add architecture-specific TLB IPI optimization support
arch/x86/hyperv/mmu.c | 5 ++
arch/x86/include/asm/paravirt.h | 5 ++
arch/x86/include/asm/paravirt_types.h | 6 +++
arch/x86/include/asm/tlb.h | 20 +++++++-
arch/x86/kernel/kvm.c | 6 +++
arch/x86/kernel/paravirt.c | 18 +++++++
arch/x86/kernel/smpboot.c | 1 +
arch/x86/xen/mmu_pv.c | 2 +
include/asm-generic/tlb.h | 28 +++++++++--
include/linux/mm.h | 34 +++++++++++++
kernel/events/core.c | 2 +
mm/gup.c | 2 +
mm/khugepaged.c | 2 +-
mm/mmu_gather.c | 69 ++++++++++++++++++++++++---
14 files changed, 187 insertions(+), 13 deletions(-)
--
2.49.0