From nobody Wed Apr 15 13:15:10 2026 Received: from out-177.mta0.migadu.com (out-177.mta0.migadu.com [91.218.175.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 56A9130FF1C; Wed, 4 Mar 2026 02:11:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772590285; cv=none; b=ibuGSL562CG+Y66mNyrMHk9kMvpZnsSK/2c5AhfHAnuu+SCAXyauT2MEx/JJLDwZqNkHBAj43sBqrf0QHeP9bAS4kfOTNwEBZnxTAuomFYZiAaXvKX/nBnclNoEuWBdubKqBbxoUVeVnKFtr4CxD1pg/aXwMbXk4IvdXueTScDs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772590285; c=relaxed/simple; bh=zNIItgh7ntSYU7+lrBgR4jVh5f49+XCj10qUQg3NJbs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=uGl5GmckWEQMkm636LKxSRHdCE8hk9tnLhJK1L9FDWGpwLbFPOxkY4j3Jn83LKCzi0JujgbthrSFLkJvB9xn2HCp2dAvBw59guRlZX3tXSYhMZRKqWlLaNOLK1KrAZNjLWhCLdYYG/PkKdCtmt/jNcMuptoFkfoQ1x6P5mjdnPw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=fgUYM/d0; arc=none smtp.client-ip=91.218.175.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="fgUYM/d0" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1772590282; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vn2CrPvigyubKEvhLFY6y0EU5doWg0tOHWyGKpSN0ek=; b=fgUYM/d0P37bsHPqPX4Z+o0dKS/1BrCWa8RGt/Nvn9MWX5D3ln9TuZSj+0EjJMPcTmH2z3 8f8a6A1fjLymFVq3c9xMaOEMs86p40x2Nh0ih2lOE2Scf3uQ2nBKi5goxTitYTMxqeIKtO KtwSs9uUVV8FHNDg8aeOFN25eSCnKN8= From: Lance Yang To: akpm@linux-foundation.org Cc: peterz@infradead.org, david@kernel.org, dave.hansen@intel.com, dave.hansen@linux.intel.com, ypodemsk@redhat.com, hughd@google.com, will@kernel.org, aneesh.kumar@kernel.org, npiggin@gmail.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, arnd@arndb.de, lorenzo.stoakes@oracle.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org, shy828301@gmail.com, riel@surriel.com, jannh@google.com, jgross@suse.com, seanjc@google.com, pbonzini@redhat.com, boris.ostrovsky@oracle.com, virtualization@lists.linux.dev, kvm@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, ioworker0@gmail.com, Lance Yang Subject: [PATCH v6 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs Date: Wed, 4 Mar 2026 10:10:43 +0800 Message-ID: <20260304021046.18550-2-lance.yang@linux.dev> In-Reply-To: <20260304021046.18550-1-lance.yang@linux.dev> References: <20260304021046.18550-1-lance.yang@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" From: Lance Yang When page table operations require synchronization with software/lockless walkers, they call tlb_remove_table_sync_{one,rcu}() after flushing the TLB (tlb->freed_tables or tlb->unshared_tables). On architectures where the TLB flush already sends IPIs to all target CPUs, the subsequent sync IPI broadcast is redundant. This is not only costly on large systems where it disrupts all CPUs even for single-process page table operations, but has also been reported to hurt RT workloads[1]. Introduce tlb_table_flush_implies_ipi_broadcast() to check if the prior TLB flush already provided the necessary synchronization. When true, the sync calls can early-return. A few cases rely on this synchronization: 1) hugetlb PMD unshare[2]: The problem is not the freeing but the reuse of the PMD table for other purposes in the last remaining user after unsharing. 2) khugepaged collapse[3]: Ensure no concurrent GUP-fast before collapsing and (possibly) freeing the page table / re-depositing it. Currently always returns false (no behavior change). The follow-up patch will enable the optimization for x86. [1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@k= ernel.org/ [2] https://lore.kernel.org/linux-mm/6a364356-5fea-4a6c-b959-ba3b22ce9c88@k= ernel.org/ [3] https://lore.kernel.org/linux-mm/2cb4503d-3a3f-4f6c-8038-7b3d1c74b3c2@k= ernel.org/ Suggested-by: David Hildenbrand (Arm) Signed-off-by: Lance Yang --- include/asm-generic/tlb.h | 17 +++++++++++++++++ mm/mmu_gather.c | 15 +++++++++++++++ 2 files changed, 32 insertions(+) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index bdcc2778ac64..cb41cc6a0024 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -240,6 +240,23 @@ static inline void tlb_remove_table(struct mmu_gather = *tlb, void *table) } #endif /* CONFIG_MMU_GATHER_TABLE_FREE */ =20 +/** + * tlb_table_flush_implies_ipi_broadcast - does TLB flush imply IPI sync + * + * When page table operations require synchronization with software/lockle= ss + * walkers, they flush the TLB (tlb->freed_tables or tlb->unshared_tables) + * then call tlb_remove_table_sync_{one,rcu}(). If the flush already sent + * IPIs to all CPUs, the sync call is redundant. + * + * Returns false by default. Architectures can override by defining this. + */ +#ifndef tlb_table_flush_implies_ipi_broadcast +static inline bool tlb_table_flush_implies_ipi_broadcast(void) +{ + return false; +} +#endif + #ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE /* * This allows an architecture that does not use the linux page-tables for diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 3985d856de7f..37a6a711c37e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -283,6 +283,14 @@ void tlb_remove_table_sync_one(void) * It is however sufficient for software page-table walkers that rely on * IRQ disabling. */ + + /* + * Skip IPI if the preceding TLB flush already synchronized with + * all CPUs that could be doing software/lockless page table walks. + */ + if (tlb_table_flush_implies_ipi_broadcast()) + return; + smp_call_function(tlb_remove_table_smp_sync, NULL, 1); } =20 @@ -312,6 +320,13 @@ static void tlb_remove_table_free(struct mmu_table_bat= ch *batch) */ void tlb_remove_table_sync_rcu(void) { + /* + * Skip RCU wait if the preceding TLB flush already synchronized + * with all CPUs that could be doing software/lockless page table walks. + */ + if (tlb_table_flush_implies_ipi_broadcast()) + return; + synchronize_rcu(); } =20 --=20 2.49.0 From nobody Wed Apr 15 13:15:10 2026 Received: from out-170.mta0.migadu.com (out-170.mta0.migadu.com [91.218.175.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9B341335BA for ; Wed, 4 Mar 2026 02:11:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772590295; cv=none; b=Em8kdLvmyY6DKVUU7v6S7JRj+fCUSKxG7djyevgJmBUmjp2UXzBHvlQptz0mQETQ3qVOjBT4qmOrQnMiKsw7cnSrXhHukeNK4VLE7yATCvkfM3DJubgdHXY07b4JLgOOti/MTE2BcmeRI+4wvozlJlspZBiix8cA6ihJx0lL5a8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772590295; c=relaxed/simple; bh=u6snGURNRVdBF9Oym9NViXxmQu9pzKOjDuNKNbY3Y5M=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=JoQgh2Z3GhrkvQQh41hjAYebzL3HOTGRtD9AERNmI4yQvcJX3D0UeXJsKYBX0Qb1JxdJ7yXKx0nlcJ/emntep8o3d6rxlI3Fclh7c0vzVTCdqJlFlk8v6f/d/VqWBDR6EXiFExLOEMTLZxVrr8ATUr4GcA4wiEsd8Y8SnUI4KoI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=VnOIi4NJ; arc=none smtp.client-ip=91.218.175.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="VnOIi4NJ" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1772590291; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6zO2kH0svbHx3wC9dGJj7zDc7KLoM6zIa0KkqCo/aKA=; b=VnOIi4NJ2EAfX/nJqT5BK5btiveTdW/sa7Ufo95ckq7KxfH+fb9DlRhbQLGWI10vo44wj6 THZ5iYwXvG+I5znKhcEaqG2aVir82NnjBTbMyM/LATr4O8EOzIcytdSBKzKgrrG9Igngfp 5zJ8ygBTJBc4R4ThrC5ulNktptEwybI= From: Lance Yang To: akpm@linux-foundation.org Cc: peterz@infradead.org, david@kernel.org, dave.hansen@intel.com, dave.hansen@linux.intel.com, ypodemsk@redhat.com, hughd@google.com, will@kernel.org, aneesh.kumar@kernel.org, npiggin@gmail.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, arnd@arndb.de, lorenzo.stoakes@oracle.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org, shy828301@gmail.com, riel@surriel.com, jannh@google.com, jgross@suse.com, seanjc@google.com, pbonzini@redhat.com, boris.ostrovsky@oracle.com, virtualization@lists.linux.dev, kvm@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, ioworker0@gmail.com, Lance Yang Subject: [PATCH v6 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush Date: Wed, 4 Mar 2026 10:10:44 +0800 Message-ID: <20260304021046.18550-3-lance.yang@linux.dev> In-Reply-To: <20260304021046.18550-1-lance.yang@linux.dev> References: <20260304021046.18550-1-lance.yang@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" From: Lance Yang Enable the optimization introduced in the previous patch for x86. Add pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast to track whether flush_tlb_multi() sends real IPIs. Initialize it once in native_pv_tlb_init() during boot. On CONFIG_PARAVIRT systems, tlb_table_flush_implies_ipi_broadcast() reads the pv_ops property. On non-PARAVIRT, it directly checks for INVLPGB. PV backends (KVM, Xen, Hyper-V) typically have their own implementations and don't call native_flush_tlb_multi() directly, so they cannot be trusted to provide the IPI guarantees we need. They keep the property false. Two-step plan as David suggested[1]: Step 1 (this patch): Skip redundant sync when we're 100% certain the TLB flush sent IPIs. INVLPGB is excluded because when supported, we cannot guarantee IPIs were sent, keeping it clean and simple. Step 2 (future work): Send targeted IPIs only to CPUs actually doing software/lockless page table walks, benefiting all architectures. Regarding Step 2, it obviously only applies to setups where Step 1 does not apply: like x86 with INVLPGB or arm64. [1] https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@k= ernel.org/ Suggested-by: David Hildenbrand (Arm) Signed-off-by: Lance Yang --- arch/x86/include/asm/paravirt_types.h | 5 +++++ arch/x86/include/asm/smp.h | 3 +++ arch/x86/include/asm/tlb.h | 16 +++++++++++++++- arch/x86/include/asm/tlbflush.h | 3 +++ arch/x86/kernel/paravirt.c | 20 ++++++++++++++++++++ arch/x86/kernel/smpboot.c | 1 + arch/x86/mm/tlb.c | 14 ++++++++++++++ 7 files changed, 61 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/p= aravirt_types.h index 9bcf6bce88f6..ec01268f2e3e 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -112,6 +112,11 @@ struct pv_mmu_ops { void (*flush_tlb_multi)(const struct cpumask *cpus, const struct flush_tlb_info *info); =20 + /* + * True if flush_tlb_multi() sends real IPIs to all target CPUs. + */ + bool flush_tlb_multi_implies_ipi_broadcast; + /* Hook for intercepting the destruction of an mm_struct. */ void (*exit_mmap)(struct mm_struct *mm); void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, boo= l enc); diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index 84951572ab81..ef1fe0cc4c73 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -105,6 +105,9 @@ void native_smp_prepare_boot_cpu(void); void smp_prepare_cpus_common(void); void native_smp_prepare_cpus(unsigned int max_cpus); void native_smp_cpus_done(unsigned int max_cpus); + +void __init native_pv_tlb_init(void); + int common_cpu_up(unsigned int cpunum, struct task_struct *tidle); int native_kick_ap(unsigned int cpu, struct task_struct *tidle); int native_cpu_disable(void); diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h index 866ea78ba156..532578c5a2e7 100644 --- a/arch/x86/include/asm/tlb.h +++ b/arch/x86/include/asm/tlb.h @@ -5,10 +5,19 @@ #define tlb_flush tlb_flush static inline void tlb_flush(struct mmu_gather *tlb); =20 +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_= broadcast +static inline bool tlb_table_flush_implies_ipi_broadcast(void); + #include #include #include #include +#include + +static inline bool tlb_table_flush_implies_ipi_broadcast(void) +{ + return static_branch_likely(&tlb_ipi_broadcast_key); +} =20 static inline void tlb_flush(struct mmu_gather *tlb) { @@ -20,7 +29,12 @@ static inline void tlb_flush(struct mmu_gather *tlb) end =3D tlb->end; } =20 - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables); + /* + * Pass both freed_tables and unshared_tables so that lazy-TLB CPUs + * also receive IPIs during unsharing page tables. + */ + flush_tlb_mm_range(tlb->mm, start, end, stride_shift, + tlb->freed_tables || tlb->unshared_tables); } =20 static inline void invlpg(unsigned long addr) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflus= h.h index 5a3cdc439e38..a1b5efef3b90 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -5,6 +5,7 @@ #include #include #include +#include =20 #include #include @@ -18,6 +19,8 @@ =20 DECLARE_PER_CPU(u64, tlbstate_untag_mask); =20 +DECLARE_STATIC_KEY_FALSE(tlb_ipi_broadcast_key); + void __flush_tlb_all(void); =20 #define TLB_FLUSH_ALL -1UL diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index a6ed52cae003..c8decadf16e0 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -154,6 +154,7 @@ struct paravirt_patch_template pv_ops =3D { .mmu.flush_tlb_kernel =3D native_flush_tlb_global, .mmu.flush_tlb_one_user =3D native_flush_tlb_one_user, .mmu.flush_tlb_multi =3D native_flush_tlb_multi, + .mmu.flush_tlb_multi_implies_ipi_broadcast =3D false, =20 .mmu.exit_mmap =3D paravirt_nop, .mmu.notify_page_enc_status_changed =3D paravirt_nop, @@ -221,3 +222,22 @@ NOKPROBE_SYMBOL(native_load_idt); =20 EXPORT_SYMBOL(pv_ops); EXPORT_SYMBOL_GPL(pv_info); + +void __init native_pv_tlb_init(void) +{ + /* + * If PV backend already set the property, respect it. + * Otherwise, check if native TLB flush sends real IPIs to all target + * CPUs (i.e., not using INVLPGB broadcast invalidation). + */ + if (pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast) { + static_branch_enable(&tlb_ipi_broadcast_key); + return; + } + + if (pv_ops.mmu.flush_tlb_multi =3D=3D native_flush_tlb_multi && + !cpu_feature_enabled(X86_FEATURE_INVLPGB)) { + pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast =3D true; + static_branch_enable(&tlb_ipi_broadcast_key); + } +} diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 5cd6950ab672..3cdb04162843 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1167,6 +1167,7 @@ void __init native_smp_prepare_boot_cpu(void) switch_gdt_and_percpu_base(me); =20 native_pv_lock_init(); + native_pv_tlb_init(); } =20 void __init native_smp_cpus_done(unsigned int max_cpus) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 621e09d049cb..7b1acfb97782 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -26,6 +26,8 @@ =20 #include "mm_internal.h" =20 +DEFINE_STATIC_KEY_FALSE(tlb_ipi_broadcast_key); + #ifdef CONFIG_PARAVIRT # define STATIC_NOPV #else @@ -1834,3 +1836,15 @@ static int __init create_tlb_single_page_flush_ceili= ng(void) return 0; } late_initcall(create_tlb_single_page_flush_ceiling); + +#ifndef CONFIG_PARAVIRT +void __init native_pv_tlb_init(void) +{ + /* + * For non-PARAVIRT builds, check if native TLB flush sends real IPIs + * (i.e., not using INVLPGB broadcast invalidation). + */ + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) + static_branch_enable(&tlb_ipi_broadcast_key); +} +#endif --=20 2.49.0