From nobody Thu Apr 16 06:40:54 2026 Received: from out-186.mta1.migadu.com (out-186.mta1.migadu.com [95.215.58.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A69E5430BBA for ; Mon, 2 Mar 2026 06:31:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.186 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772433085; cv=none; b=C6WTdMBv7DfsuNsC1uXICiuiAL9/Smjd1hkGC7QcFDUlS5avdIdmn9ZJYY0gM8gSUDWG2/b7ZA0f1XX0eAXlE3WgYDPQAtgMh0MPy8xRApvy+i6zbX1psZ8emh7Pso1hoGiLligR2k2z8TtHrI3nj5dT5s7jKkWp9e/6uU6RA8s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772433085; c=relaxed/simple; bh=zNIItgh7ntSYU7+lrBgR4jVh5f49+XCj10qUQg3NJbs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=NQE932623ZdIP+y18CYvR5wJZsR5g7vfsRYcKEMk5690m9oBEmZSwnSrotBREspHBFAhdxbvUsgzWotBh9mHVKJjr8cibcfXypi2YLx3Ss7FM8R8ZMboRfrqK9d/jEfCBRYwRq64CfYPbnf8bW8n8NETrFTHsoye8yPyG/JyQMI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=w2ziNo0s; arc=none smtp.client-ip=95.215.58.186 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="w2ziNo0s" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1772433081; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vn2CrPvigyubKEvhLFY6y0EU5doWg0tOHWyGKpSN0ek=; b=w2ziNo0sXrzyuhn/KC3kYe51xjY3KPxJPLwHLuxT0ynfPUoR1Ss7ef3E4M1kriQHPtPloj ilRKdwNBRHvRiYQgdoZrrJ6pPQBSWEfdPsGz4zugLBvmIxtuUanbJoW8bujq5OInsz7oCb a9Q5rC7oRmaUqOvBzGToB4SQ3y/c/SE= From: Lance Yang To: akpm@linux-foundation.org Cc: peterz@infradead.org, david@kernel.org, dave.hansen@intel.com, dave.hansen@linux.intel.com, ypodemsk@redhat.com, hughd@google.com, will@kernel.org, aneesh.kumar@kernel.org, npiggin@gmail.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, arnd@arndb.de, lorenzo.stoakes@oracle.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org, shy828301@gmail.com, riel@surriel.com, jannh@google.com, jgross@suse.com, seanjc@google.com, pbonzini@redhat.com, boris.ostrovsky@oracle.com, virtualization@lists.linux.dev, kvm@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, ioworker0@gmail.com, Lance Yang Subject: [PATCH v5 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs Date: Mon, 2 Mar 2026 14:30:35 +0800 Message-ID: <20260302063048.9479-2-lance.yang@linux.dev> In-Reply-To: <20260302063048.9479-1-lance.yang@linux.dev> References: <20260302063048.9479-1-lance.yang@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" From: Lance Yang When page table operations require synchronization with software/lockless walkers, they call tlb_remove_table_sync_{one,rcu}() after flushing the TLB (tlb->freed_tables or tlb->unshared_tables). On architectures where the TLB flush already sends IPIs to all target CPUs, the subsequent sync IPI broadcast is redundant. This is not only costly on large systems where it disrupts all CPUs even for single-process page table operations, but has also been reported to hurt RT workloads[1]. Introduce tlb_table_flush_implies_ipi_broadcast() to check if the prior TLB flush already provided the necessary synchronization. When true, the sync calls can early-return. A few cases rely on this synchronization: 1) hugetlb PMD unshare[2]: The problem is not the freeing but the reuse of the PMD table for other purposes in the last remaining user after unsharing. 2) khugepaged collapse[3]: Ensure no concurrent GUP-fast before collapsing and (possibly) freeing the page table / re-depositing it. Currently always returns false (no behavior change). The follow-up patch will enable the optimization for x86. [1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@k= ernel.org/ [2] https://lore.kernel.org/linux-mm/6a364356-5fea-4a6c-b959-ba3b22ce9c88@k= ernel.org/ [3] https://lore.kernel.org/linux-mm/2cb4503d-3a3f-4f6c-8038-7b3d1c74b3c2@k= ernel.org/ Suggested-by: David Hildenbrand (Arm) Signed-off-by: Lance Yang --- include/asm-generic/tlb.h | 17 +++++++++++++++++ mm/mmu_gather.c | 15 +++++++++++++++ 2 files changed, 32 insertions(+) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index bdcc2778ac64..cb41cc6a0024 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -240,6 +240,23 @@ static inline void tlb_remove_table(struct mmu_gather = *tlb, void *table) } #endif /* CONFIG_MMU_GATHER_TABLE_FREE */ =20 +/** + * tlb_table_flush_implies_ipi_broadcast - does TLB flush imply IPI sync + * + * When page table operations require synchronization with software/lockle= ss + * walkers, they flush the TLB (tlb->freed_tables or tlb->unshared_tables) + * then call tlb_remove_table_sync_{one,rcu}(). If the flush already sent + * IPIs to all CPUs, the sync call is redundant. + * + * Returns false by default. Architectures can override by defining this. + */ +#ifndef tlb_table_flush_implies_ipi_broadcast +static inline bool tlb_table_flush_implies_ipi_broadcast(void) +{ + return false; +} +#endif + #ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE /* * This allows an architecture that does not use the linux page-tables for diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 3985d856de7f..37a6a711c37e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -283,6 +283,14 @@ void tlb_remove_table_sync_one(void) * It is however sufficient for software page-table walkers that rely on * IRQ disabling. */ + + /* + * Skip IPI if the preceding TLB flush already synchronized with + * all CPUs that could be doing software/lockless page table walks. + */ + if (tlb_table_flush_implies_ipi_broadcast()) + return; + smp_call_function(tlb_remove_table_smp_sync, NULL, 1); } =20 @@ -312,6 +320,13 @@ static void tlb_remove_table_free(struct mmu_table_bat= ch *batch) */ void tlb_remove_table_sync_rcu(void) { + /* + * Skip RCU wait if the preceding TLB flush already synchronized + * with all CPUs that could be doing software/lockless page table walks. + */ + if (tlb_table_flush_implies_ipi_broadcast()) + return; + synchronize_rcu(); } =20 --=20 2.49.0 From nobody Thu Apr 16 06:40:54 2026 Received: from out-171.mta1.migadu.com (out-171.mta1.migadu.com [95.215.58.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A00972618; Mon, 2 Mar 2026 06:31:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772433101; cv=none; b=bQX9eFZUf6kA2759Bi8QVU4WGGwhRq6mibNNsYt3mNOxDY+taU3XRO53ItjpTDDfXmu+UCzjGuSNBP7mPOBy8T1Y1zGiTKyl6QzPpn45SH+lywHKjAYE8DAtX0V9ofZZZJvQHm+VYRJikAHgrWBlFUu8u5mRpaXI6zctJvlTjrw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772433101; c=relaxed/simple; bh=yLsmO5W7JC5aPbOrAz3xMPpUYPJanHjXdS4qc/nKvXo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=g1lsVZDdzrfXED1WJ9s3GMfEbVvbLtWI3MSUlXHBF2MbexlhLAV5jnPAzNW5UuITZmP09Ykf4SiadRRKJu2Xv5/v8gDj+9T/3LYKZvdDB1yzrkx5piQLfDXffu3IRI+0L1xKhOrg+mOlPErvan+ZkbVZBhKBfW5J5m8/ilu+ePE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=YibaaDZz; arc=none smtp.client-ip=95.215.58.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="YibaaDZz" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1772433097; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=I8gx42IHYEQCTDsiOiivkLZbaQbWhY24E85Lk6w39Tk=; b=YibaaDZzBLg3a/3Lzh5+odGfgFy4rKCclMUZCLJe8oVE/bwTQq5eeGRW2n/iATLhzlFuh5 X0ieBE+2QbmggG/hWHCuLDq7DvHNaU84T2oUFsxdeTGqgWB5DpcfOGmuitzwpJz9A6mSNO KMYpAP3NT27u9rnisSuI3b0J57v1rSU= From: Lance Yang To: akpm@linux-foundation.org Cc: peterz@infradead.org, david@kernel.org, dave.hansen@intel.com, dave.hansen@linux.intel.com, ypodemsk@redhat.com, hughd@google.com, will@kernel.org, aneesh.kumar@kernel.org, npiggin@gmail.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, arnd@arndb.de, lorenzo.stoakes@oracle.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org, shy828301@gmail.com, riel@surriel.com, jannh@google.com, jgross@suse.com, seanjc@google.com, pbonzini@redhat.com, boris.ostrovsky@oracle.com, virtualization@lists.linux.dev, kvm@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, ioworker0@gmail.com, Lance Yang Subject: [PATCH v5 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush Date: Mon, 2 Mar 2026 14:30:36 +0800 Message-ID: <20260302063048.9479-3-lance.yang@linux.dev> In-Reply-To: <20260302063048.9479-1-lance.yang@linux.dev> References: <20260302063048.9479-1-lance.yang@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" From: Lance Yang Enable the optimization introduced in the previous patch for x86. Add pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast to track whether flush_tlb_multi() sends real IPIs. Initialize it once in native_pv_tlb_init() during boot. On CONFIG_PARAVIRT systems, tlb_table_flush_implies_ipi_broadcast() reads the pv_ops property. On non-PARAVIRT, it directly checks for INVLPGB. PV backends (KVM, Xen, Hyper-V) typically have their own implementations and don't call native_flush_tlb_multi() directly, so they cannot be trusted to provide the IPI guarantees we need. They keep the property false. Two-step plan as David suggested[1]: Step 1 (this patch): Skip redundant sync when we're 100% certain the TLB flush sent IPIs. INVLPGB is excluded because when supported, we cannot guarantee IPIs were sent, keeping it clean and simple. Step 2 (future work): Send targeted IPIs only to CPUs actually doing software/lockless page table walks, benefiting all architectures. Regarding Step 2, it obviously only applies to setups where Step 1 does not apply: like x86 with INVLPGB or arm64. [1] https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@k= ernel.org/ Suggested-by: David Hildenbrand (Arm) Signed-off-by: Lance Yang --- arch/x86/include/asm/paravirt_types.h | 5 +++++ arch/x86/include/asm/smp.h | 7 +++++++ arch/x86/include/asm/tlb.h | 20 +++++++++++++++++++- arch/x86/kernel/paravirt.c | 16 ++++++++++++++++ arch/x86/kernel/smpboot.c | 1 + 5 files changed, 48 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/p= aravirt_types.h index 9bcf6bce88f6..ec01268f2e3e 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -112,6 +112,11 @@ struct pv_mmu_ops { void (*flush_tlb_multi)(const struct cpumask *cpus, const struct flush_tlb_info *info); =20 + /* + * True if flush_tlb_multi() sends real IPIs to all target CPUs. + */ + bool flush_tlb_multi_implies_ipi_broadcast; + /* Hook for intercepting the destruction of an mm_struct. */ void (*exit_mmap)(struct mm_struct *mm); void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, boo= l enc); diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index 84951572ab81..4ac175414ac1 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -105,6 +105,13 @@ void native_smp_prepare_boot_cpu(void); void smp_prepare_cpus_common(void); void native_smp_prepare_cpus(unsigned int max_cpus); void native_smp_cpus_done(unsigned int max_cpus); + +#ifdef CONFIG_PARAVIRT +void __init native_pv_tlb_init(void); +#else +static inline void native_pv_tlb_init(void) { } +#endif + int common_cpu_up(unsigned int cpunum, struct task_struct *tidle); int native_kick_ap(unsigned int cpu, struct task_struct *tidle); int native_cpu_disable(void); diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h index 866ea78ba156..87ef7147eac8 100644 --- a/arch/x86/include/asm/tlb.h +++ b/arch/x86/include/asm/tlb.h @@ -5,10 +5,23 @@ #define tlb_flush tlb_flush static inline void tlb_flush(struct mmu_gather *tlb); =20 +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_= broadcast +static inline bool tlb_table_flush_implies_ipi_broadcast(void); + #include #include #include #include +#include + +static inline bool tlb_table_flush_implies_ipi_broadcast(void) +{ +#ifdef CONFIG_PARAVIRT + return pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast; +#else + return !cpu_feature_enabled(X86_FEATURE_INVLPGB); +#endif +} =20 static inline void tlb_flush(struct mmu_gather *tlb) { @@ -20,7 +33,12 @@ static inline void tlb_flush(struct mmu_gather *tlb) end =3D tlb->end; } =20 - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables); + /* + * Pass both freed_tables and unshared_tables so that lazy-TLB CPUs + * also receive IPIs during unsharing page tables. + */ + flush_tlb_mm_range(tlb->mm, start, end, stride_shift, + tlb->freed_tables || tlb->unshared_tables); } =20 static inline void invlpg(unsigned long addr) diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index a6ed52cae003..b681b8319295 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -154,6 +154,7 @@ struct paravirt_patch_template pv_ops =3D { .mmu.flush_tlb_kernel =3D native_flush_tlb_global, .mmu.flush_tlb_one_user =3D native_flush_tlb_one_user, .mmu.flush_tlb_multi =3D native_flush_tlb_multi, + .mmu.flush_tlb_multi_implies_ipi_broadcast =3D false, =20 .mmu.exit_mmap =3D paravirt_nop, .mmu.notify_page_enc_status_changed =3D paravirt_nop, @@ -221,3 +222,18 @@ NOKPROBE_SYMBOL(native_load_idt); =20 EXPORT_SYMBOL(pv_ops); EXPORT_SYMBOL_GPL(pv_info); + +void __init native_pv_tlb_init(void) +{ + /* + * If PV backend already set the property, respect it. + * Otherwise, check if native TLB flush sends real IPIs to all target + * CPUs (i.e., not using INVLPGB broadcast invalidation). + */ + if (pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast) + return; + + if (pv_ops.mmu.flush_tlb_multi =3D=3D native_flush_tlb_multi && + !cpu_feature_enabled(X86_FEATURE_INVLPGB)) + pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast =3D true; +} diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 5cd6950ab672..3cdb04162843 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1167,6 +1167,7 @@ void __init native_smp_prepare_boot_cpu(void) switch_gdt_and_percpu_base(me); =20 native_pv_lock_init(); + native_pv_tlb_init(); } =20 void __init native_smp_cpus_done(unsigned int max_cpus) --=20 2.49.0