From nobody Fri Apr 3 16:01:34 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 13BCF3D902F for ; Tue, 24 Mar 2026 09:51:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774345863; cv=none; b=uCUzWnNh2czsYUP3AyDvcrOeS/jIyEH9rdyPMY3/QcX/3eQjV5BvyuOiflslZcgwaR6JyXqBAD1STI4ODSFWhmHiW86NTizx/6Wy7oPonrxnIgA48zfRk5V0sClk0YU4vObDdVZ/WdDqbgssWMGnlHRW+VzwK+i3V0bpx7o0JHo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774345863; c=relaxed/simple; bh=/kAYgUotWkTkab98yJocoAzcptjnDZjvJSvELvTC5gc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=P2eOXNdDh6Drh7mXqoGru+/tqalZ1tooi83ygpUYGEVY5Kz4p1L0dYSoopi/sbFLBBTF7z6nCVgbRUGfAhkBK6R2BWC8e1m5zWRdU+xvzj3sauOUnyuCVXLpKYP92rZHySkJAOtjOYaqSVpzGEnHzmGjExLUnioNAmc5W2V95WI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=hGqCQ9UK; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="hGqCQ9UK" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1774345861; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Gat9n9rqQs56/qB2UIuZaMkqQi+a1E0dBrxpCkg7CqE=; b=hGqCQ9UKXa6EkncmoJaWq2onfEtnDy1j3MYmiaSzqSqbI0EMh5D8nW53zUtLZfV2j8W5FP wBo2PzyhFrJqvzhrpV1qLzH05mPiiRpPBwW1ZHDjyu+zHinzcAe08a6SNenmWsK7uzp+jL z5hcOhz7JtYCRew+j3qN9HE/li8Wwdo= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-131-B5zXNt9rNpyEt3OiBpBtfg-1; Tue, 24 Mar 2026 05:50:55 -0400 X-MC-Unique: B5zXNt9rNpyEt3OiBpBtfg-1 X-Mimecast-MFC-AGG-ID: B5zXNt9rNpyEt3OiBpBtfg_1774345850 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id A6F7F1800281; Tue, 24 Mar 2026 09:50:50 +0000 (UTC) Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.44.34.246]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id D57C830001BB; Tue, 24 Mar 2026 09:50:35 +0000 (UTC) From: Valentin Schneider To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org Cc: "Peter Zijlstra (Intel)" , Nicolas Saenz Julienne , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Andy Lutomirski , Arnaldo Carvalho de Melo , Josh Poimboeuf , Paolo Bonzini , Arnd Bergmann , Frederic Weisbecker , "Paul E. McKenney" , Jason Baron , Steven Rostedt , Ard Biesheuvel , Sami Tolvanen , "David S. Miller" , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Uladzislau Rezki , Mathieu Desnoyers , Mel Gorman , Andrew Morton , Masahiro Yamada , Han Shen , Rik van Riel , Jann Horn , Dan Carpenter , Oleg Nesterov , Juri Lelli , Clark Williams , Tomas Glozar , Yair Podemsky , Marcelo Tosatti , Daniel Wagner , Petr Tesarik , Shrikanth Hegde Subject: [RFC PATCH v8 09/10] context_tracking,x86: Defer kernel text patching IPIs when tracking CR3 switches Date: Tue, 24 Mar 2026 10:48:00 +0100 Message-ID: <20260324094801.3092968-10-vschneid@redhat.com> In-Reply-To: <20260324094801.3092968-1-vschneid@redhat.com> References: <20260324094801.3092968-1-vschneid@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 Content-Type: text/plain; charset="utf-8" text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs. As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync must *not* be mutable (patchable) at runtime. This means we must pay attention to mutable instructions in the early entry code: - alternatives - static keys - static calls - all sorts of probes (kprobes/ftrace/bpf/???) The early entry code is noinstr, which gets rid of the probes. Alternatives are safe, because it's boot-time patching (before SMP is even brought up) which is before any IPI deferral can happen. This leaves us with static keys and static calls. Any static key used in early entry code should be only forever-enabled at boot time, IOW __ro_after_init (pretty much like alternatives). Exceptions to that will now be caught by objtool. The deferred instruction sync is the CR3 RMW done as part of kPTI when switching to the kernel page table: SDM vol2 chapter 4.3 - Move to/from control registers: ``` MOV CR* instructions, except for MOV CR8, are serializing instructions. ``` Leverage the new kernel_cr3_loaded signal and the kPTI CR3 RMW to defer sync_core() IPIs targeting NOHZ_FULL CPUs. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Nicolas Saenz Julienne Signed-off-by: Valentin Schneider --- arch/x86/include/asm/text-patching.h | 5 ++++ arch/x86/kernel/alternative.c | 34 +++++++++++++++++++++++----- arch/x86/kernel/kprobes/core.c | 4 ++-- arch/x86/kernel/kprobes/opt.c | 4 ++-- arch/x86/kernel/module.c | 2 +- 5 files changed, 38 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/te= xt-patching.h index f2d142a0a862e..628e80f8318cd 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -33,6 +33,11 @@ extern void text_poke_apply_relocation(u8 *buf, const u8= * const instr, size_t i */ extern void *text_poke(void *addr, const void *opcode, size_t len); extern void smp_text_poke_sync_each_cpu(void); +#ifdef CONFIG_TRACK_CR3 +extern void smp_text_poke_sync_each_cpu_deferrable(void); +#else +#define smp_text_poke_sync_each_cpu_deferrable smp_text_poke_sync_each_cpu +#endif extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); extern void *text_poke_copy(void *addr, const void *opcode, size_t len); #define text_poke_copy text_poke_copy diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 28518371d8bf3..f3af77d7c533c 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -6,6 +6,7 @@ #include #include #include +#include =20 #include #include @@ -13,6 +14,7 @@ #include #include #include +#include =20 int __read_mostly alternatives_patched; =20 @@ -2706,11 +2708,29 @@ static void do_sync_core(void *info) sync_core(); } =20 +static void __smp_text_poke_sync_each_cpu(smp_cond_func_t cond_func) +{ + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1); +} + void smp_text_poke_sync_each_cpu(void) { - on_each_cpu(do_sync_core, NULL, 1); + __smp_text_poke_sync_each_cpu(NULL); +} + +#ifdef CONFIG_TRACK_CR3 +static bool do_sync_core_defer_cond(int cpu, void *info) +{ + return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || + per_cpu(kernel_cr3_loaded, cpu); } =20 +void smp_text_poke_sync_each_cpu_deferrable(void) +{ + __smp_text_poke_sync_each_cpu(do_sync_core_defer_cond); +} +#endif + /* * NOTE: crazy scheme to allow patching Jcc.d32 but not increase the size = of * this thing. When len =3D=3D 6 everything is prefixed with 0x0f and we m= ap @@ -2914,11 +2934,13 @@ void smp_text_poke_batch_finish(void) * First step: add a INT3 trap to the address that will be patched. */ for (i =3D 0; i < text_poke_array.nr_entries; i++) { - text_poke_array.vec[i].old =3D *(u8 *)text_poke_addr(&text_poke_array.ve= c[i]); - text_poke(text_poke_addr(&text_poke_array.vec[i]), &int3, INT3_INSN_SIZE= ); + void *addr =3D text_poke_addr(&text_poke_array.vec[i]); + + text_poke_array.vec[i].old =3D *((u8 *)addr); + text_poke(addr, &int3, INT3_INSN_SIZE); } =20 - smp_text_poke_sync_each_cpu(); + smp_text_poke_sync_each_cpu_deferrable(); =20 /* * Second step: update all but the first byte of the patched range. @@ -2980,7 +3002,7 @@ void smp_text_poke_batch_finish(void) * not necessary and we'd be safe even without it. But * better safe than sorry (plus there's not only Intel). */ - smp_text_poke_sync_each_cpu(); + smp_text_poke_sync_each_cpu_deferrable(); } =20 /* @@ -3001,7 +3023,7 @@ void smp_text_poke_batch_finish(void) } =20 if (do_sync) - smp_text_poke_sync_each_cpu(); + smp_text_poke_sync_each_cpu_deferrable(); =20 /* * Remove and wait for refs to be zero. diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index c1fac3a9fecc2..61a93ba30f255 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -789,7 +789,7 @@ void arch_arm_kprobe(struct kprobe *p) u8 int3 =3D INT3_INSN_OPCODE; =20 text_poke(p->addr, &int3, 1); - smp_text_poke_sync_each_cpu(); + smp_text_poke_sync_each_cpu_deferrable(); perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1); } =20 @@ -799,7 +799,7 @@ void arch_disarm_kprobe(struct kprobe *p) =20 perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1); text_poke(p->addr, &p->opcode, 1); - smp_text_poke_sync_each_cpu(); + smp_text_poke_sync_each_cpu_deferrable(); } =20 void arch_remove_kprobe(struct kprobe *p) diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index 6f826a00eca29..3b3be66da320c 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -509,11 +509,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *= op) JMP32_INSN_SIZE - INT3_INSN_SIZE); =20 text_poke(addr, new, INT3_INSN_SIZE); - smp_text_poke_sync_each_cpu(); + smp_text_poke_sync_each_cpu_deferrable(); text_poke(addr + INT3_INSN_SIZE, new + INT3_INSN_SIZE, JMP32_INSN_SIZE - INT3_INSN_SIZE); - smp_text_poke_sync_each_cpu(); + smp_text_poke_sync_each_cpu_deferrable(); =20 perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_S= IZE); } diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index 11c45ce42694c..0894b1f38de77 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -209,7 +209,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs, write, apply); =20 if (!early) { - smp_text_poke_sync_each_cpu(); + smp_text_poke_sync_each_cpu_deferrable(); mutex_unlock(&text_mutex); } =20 --=20 2.52.0