From nobody Fri Apr 3 14:47:21 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A6E9735A399 for ; Tue, 24 Mar 2026 09:51:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774345873; cv=none; b=Zkr0XfHCE0uVBeAfp6kGvclY/iZ4aQLtb6eT9FPTsa89/5fAHxObEroq0CUxkkWk1T+MHKMGuEZ0dT4DlbEW1s9k6OdcjsvOWMoFOew5+zqDsrSyRC+P2chGvBkiCHbUysEAEqa26WuOmo1QTNZyEC9dlZB3aLBgquVWNEjSQPA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774345873; c=relaxed/simple; bh=xopU9c0YD6Ruqe30rx3/tN0/jRrJ5uH0UC4uqScMrPo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Fck0pvQqRzSPBRMddSLN+hjAH9WNoJfDu5YLqbyT44qjzS9k3WaJbwOa43XpbmwKZwWDvZgqXwtAyrbTY/ln/r3BrilZ4bFxyFA0mojKzfqRO34VYxd0b8iOlUhl1+87YewGwxxX/lh8zg44VPSeqhVVydjaQsLZuZXJ3UD/NF4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=g0dH42pH; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="g0dH42pH" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1774345870; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Rz+xHMp39eTKGjjZlITwEBRnQcSaPN2YegGdzhgHICw=; b=g0dH42pHB0L3CzzkuN8I1IOM2T5m84ETV/3u/d0NlZWUG1nkiPlNZhUVinIMFoYDoc0nbf KIUUcq4UrfVI1tRqoRkqA6YU16iS2ma0UIPYnX9D4VUkG7K9e3NZ2TL25se2UP1Ypz0E9J E97OXEeBz+dJtJ/6gPznKwBQT3FRRxY= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-622-wNsf_mUJN3imxGWVyKna5w-1; Tue, 24 Mar 2026 05:51:06 -0400 X-MC-Unique: wNsf_mUJN3imxGWVyKna5w-1 X-Mimecast-MFC-AGG-ID: wNsf_mUJN3imxGWVyKna5w_1774345862 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id A46981955F18; Tue, 24 Mar 2026 09:51:02 +0000 (UTC) Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.44.34.246]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 20CBD300019F; Tue, 24 Mar 2026 09:50:50 +0000 (UTC) From: Valentin Schneider To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Andy Lutomirski , Peter Zijlstra , Arnaldo Carvalho de Melo , Josh Poimboeuf , Paolo Bonzini , Arnd Bergmann , Frederic Weisbecker , "Paul E. McKenney" , Jason Baron , Steven Rostedt , Ard Biesheuvel , Sami Tolvanen , "David S. Miller" , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Uladzislau Rezki , Mathieu Desnoyers , Mel Gorman , Andrew Morton , Masahiro Yamada , Han Shen , Rik van Riel , Jann Horn , Dan Carpenter , Oleg Nesterov , Juri Lelli , Clark Williams , Tomas Glozar , Yair Podemsky , Marcelo Tosatti , Daniel Wagner , Petr Tesarik , Shrikanth Hegde Subject: [RFC PATCH v8 10/10] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs when tracking CR3 switches Date: Tue, 24 Mar 2026 10:48:01 +0100 Message-ID: <20260324094801.3092968-11-vschneid@redhat.com> In-Reply-To: <20260324094801.3092968-1-vschneid@redhat.com> References: <20260324094801.3092968-1-vschneid@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 Content-Type: text/plain; charset="utf-8" Previous commits have added a software signal that tracks which CR3 (kernel or user) is in use for any given CPU. Combined with: o the CR3 switch itself being a flush for non-global mappings o global mappings under kPTI being limited to the CEA and entry text we now have a way to safely defer (kernel) TLB flush IPIs targeting NOHZ_FULL CPUs executing in userspace (i.e. with the user CR3 loaded). When sending a kernel TLB flush IPI to a NOHZ_FULL CPU, check whether it is using the user CR3, and if it is, do not interrupt it and instead rely on the CR3 write that happens when switching to the kernel CR3. Signed-off-by: Valentin Schneider --- arch/x86/include/asm/tlbflush.h | 1 + arch/x86/mm/tlb.c | 34 ++++++++++++++++++++++++++------- mm/vmalloc.c | 30 ++++++++++++++++++++++++----- 3 files changed, 53 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflus= h.h index 3b3aceee701e6..8bae150206665 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -22,6 +22,7 @@ DECLARE_PER_CPU_PAGE_ALIGNED(bool, kernel_cr3_loaded); #endif =20 void __flush_tlb_all(void); +void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long = end); =20 #define TLB_FLUSH_ALL -1UL #define TLB_GENERATION_INVALID 0 diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index f5b93e01e3472..e08f16474f074 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -13,6 +13,7 @@ #include #include #include +#include =20 #include #include @@ -1530,23 +1531,24 @@ static void do_kernel_range_flush(void *info) flush_tlb_one_kernel(addr); } =20 -static void kernel_tlb_flush_all(struct flush_tlb_info *info) +static void kernel_tlb_flush_all(smp_cond_func_t cond, struct flush_tlb_in= fo *info) { if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) invlpgb_flush_all(); else - on_each_cpu(do_flush_tlb_all, NULL, 1); + on_each_cpu_cond(cond, do_flush_tlb_all, NULL, 1); } =20 -static void kernel_tlb_flush_range(struct flush_tlb_info *info) +static void kernel_tlb_flush_range(smp_cond_func_t cond, struct flush_tlb_= info *info) { if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) invlpgb_kernel_range_flush(info); else - on_each_cpu(do_kernel_range_flush, info, 1); + on_each_cpu_cond(cond, do_kernel_range_flush, info, 1); } =20 -void flush_tlb_kernel_range(unsigned long start, unsigned long end) +static inline void +__flush_tlb_kernel_range(smp_cond_func_t cond, unsigned long start, unsign= ed long end) { struct flush_tlb_info *info; =20 @@ -1556,13 +1558,31 @@ void flush_tlb_kernel_range(unsigned long start, un= signed long end) TLB_GENERATION_INVALID); =20 if (info->end =3D=3D TLB_FLUSH_ALL) - kernel_tlb_flush_all(info); + kernel_tlb_flush_all(cond, info); else - kernel_tlb_flush_range(info); + kernel_tlb_flush_range(cond, info); =20 put_flush_tlb_info(); } =20 +void flush_tlb_kernel_range(unsigned long start, unsigned long end) +{ + __flush_tlb_kernel_range(NULL, start, end); +} + +#ifdef CONFIG_TRACK_CR3 +static bool flush_tlb_kernel_cond(int cpu, void *info) +{ + return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || + per_cpu(kernel_cr3_loaded, cpu); +} + +void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long = end) +{ + __flush_tlb_kernel_range(flush_tlb_kernel_cond, start, end); +} +#endif + /* * This can be used from process context to figure out what the value of * CR3 is without needing to do a (slow) __read_cr3(). diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e286c2d2068cb..55b7bafe26016 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -501,6 +501,26 @@ void vunmap_range_noflush(unsigned long start, unsigne= d long end) __vunmap_range_noflush(start, end); } =20 +/* + * !!! BIG FAT WARNING !!! + * + * The CPU is free to cache any part of the paging hierarchy it wants at a= ny + * time. It's also free to set accessed and dirty bits at any time, even f= or + * instructions that may never execute architecturally. + * + * This means that deferring a TLB flush affecting freed page-table-pages = (IOW, + * keeping them in a CPU's paging hierarchy cache) is a recipe for disaste= r. + * + * This isn't a problem for deferral of TLB flushes in vmalloc, because + * page-table-pages used for vmap() mappings are never freed - see how + * __vunmap_range_noflush() walks the whole mapping but only clears the le= af PTEs. + * If this ever changes, TLB flush deferral will cause misery. + */ +void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigne= d long end) +{ + flush_tlb_kernel_range(start, end); +} + /** * vunmap_range - unmap kernel virtual addresses * @addr: start of the VM area to unmap @@ -514,7 +534,7 @@ void vunmap_range(unsigned long addr, unsigned long end) { flush_cache_vunmap(addr, end); vunmap_range_noflush(addr, end); - flush_tlb_kernel_range(addr, end); + flush_tlb_kernel_range_deferrable(addr, end); } =20 static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, @@ -2366,7 +2386,7 @@ static bool __purge_vmap_area_lazy(unsigned long star= t, unsigned long end, =20 nr_purge_nodes =3D cpumask_weight(&purge_nodes); if (nr_purge_nodes > 0) { - flush_tlb_kernel_range(start, end); + flush_tlb_kernel_range_deferrable(start, end); =20 /* One extra worker is per a lazy_max_pages() full set minus one. */ nr_purge_helpers =3D atomic_long_read(&vmap_lazy_nr) / lazy_max_pages(); @@ -2469,7 +2489,7 @@ static void free_unmap_vmap_area(struct vmap_area *va) flush_cache_vunmap(va->va_start, va->va_end); vunmap_range_noflush(va->va_start, va->va_end); if (debug_pagealloc_enabled_static()) - flush_tlb_kernel_range(va->va_start, va->va_end); + flush_tlb_kernel_range_deferrable(va->va_start, va->va_end); =20 free_vmap_area_noflush(va); } @@ -2916,7 +2936,7 @@ static void vb_free(unsigned long addr, unsigned long= size) vunmap_range_noflush(addr, addr + size); =20 if (debug_pagealloc_enabled_static()) - flush_tlb_kernel_range(addr, addr + size); + flush_tlb_kernel_range_deferrable(addr, addr + size); =20 spin_lock(&vb->lock); =20 @@ -2981,7 +3001,7 @@ static void _vm_unmap_aliases(unsigned long start, un= signed long end, int flush) free_purged_blocks(&purge_list); =20 if (!__purge_vmap_area_lazy(start, end, false) && flush) - flush_tlb_kernel_range(start, end); + flush_tlb_kernel_range_deferrable(start, end); mutex_unlock(&vmap_purge_lock); } =20 --=20 2.52.0