From nobody Mon Feb 9 01:21:12 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DA3512DA0E8 for ; Tue, 29 Apr 2025 11:46:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745927194; cv=none; b=aoeiC+RHCc0VigQ8AAxgZpyFgFjGnmj0fLKtrlm/7bXvTcMoqaX6am2uV14vK9/453mdTcnJj9WwGSawhjWIjO8nJQmeqMkMn+K+fRvnmZMf6J/yDr2/k3lLkovQoD75g1PPDUxi9+Cz05k8BhnSNNeZHqQ9BxWqkiomC6iPqrM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745927194; c=relaxed/simple; bh=//7Q4pa0cyHdD4P62SOM8iLoDYvBplXZ4+xR7j8wswY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=U9iCUYlWwxHhCd4jMGmX8KqPscOEOksTlXxlm2xq/57d5/5P/g1wcRN9nMdkU4yW8/6XgUnHHjZXpOsPf6qU9SNkIWbJGj9SY5OMwJofMgux4u0Q7hxc5kTVtTjoeulljxtxOz7S5LALw826K6FoBOegFhT9Srkn96yr7t8Q56M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=PTFiwNSW; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="PTFiwNSW" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1745927191; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UNP0/g1fsLvLBQS0mp5T4cvzk2DsJdERKWqgqviWFK4=; b=PTFiwNSWv1TU3SVUN6sFdcrrC6ctG5qCkurX94HcjzhCQGyyUD/xbkjKUQHQZHuFrT6E1r jFnzYe9SDa2kesTtYqGb8nsroidQwZ+Od/GQ5T5P4lDGaGaNiNOoonNDq2FVFq1iSLZczL 7jAIz62PxRSY08J1RLk1IBwU/FYDQWU= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-155-r7bLT26sNc-abpjlyaV0Qg-1; Tue, 29 Apr 2025 07:46:25 -0400 X-MC-Unique: r7bLT26sNc-abpjlyaV0Qg-1 X-Mimecast-MFC-AGG-ID: r7bLT26sNc-abpjlyaV0Qg_1745927181 Received: from mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id BEDBC1800264; Tue, 29 Apr 2025 11:46:21 +0000 (UTC) Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.45.225.102]) by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 4C51B19560A3; Tue, 29 Apr 2025 11:45:52 +0000 (UTC) From: Valentin Schneider To: linux-kernel@vger.kernel.org, virtualization@lists.linux.dev, linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev, linux-riscv@lists.infradead.org, linux-perf-users@vger.kernel.org, kvm@vger.kernel.org, linux-arch@vger.kernel.org, linux-modules@vger.kernel.org, linux-trace-kernel@vger.kernel.org, rcu@vger.kernel.org, linux-hardening@vger.kernel.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org Cc: "Peter Zijlstra (Intel)" , Nicolas Saenz Julienne , Juri Lelli , Marcelo Tosatti , Yair Podemsky , Josh Poimboeuf , Daniel Wagner , Petr Tesarik , Nicolas Saenz Julienne , Frederic Weisbecker , "Paul E. McKenney" , Dave Hansen , Sean Christopherson , Juergen Gross , Ajay Kaher , Alexey Makhalov , Broadcom internal kernel review list , Russell King , Catalin Marinas , Will Deacon , Huacai Chen , WANG Xuerui , Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexandre Ghiti , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H. Peter Anvin" , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , "Liang, Kan" , Pawan Gupta , Paolo Bonzini , Arnd Bergmann , Jason Baron , Steven Rostedt , Ard Biesheuvel , Luis Chamberlain , Petr Pavlu , Sami Tolvanen , Daniel Gomez , Naveen N Rao , Anil S Keshavamurthy , "David S. Miller" , Masami Hiramatsu , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Uladzislau Rezki , Mathieu Desnoyers , Lai Jiangshan , Zqiang , Vincent Guittot , Dietmar Eggemann , Ben Segall , Mel Gorman , Kees Cook , Shuah Khan , Masahiro Yamada , Alice Ryhl , Miguel Ojeda , "Mike Rapoport (Microsoft)" , Rong Xu , Rafael Aquini , Song Liu , Andrii Nakryiko , Dan Carpenter , Brian Gerst , "Kirill A. Shutemov" , Benjamin Berg , Vishal Annapurve , Randy Dunlap , John Stultz , Tiezhu Yang Subject: [PATCH v5 25/25] context_tracking,x86: Defer kernel text patching IPIs Date: Tue, 29 Apr 2025 13:32:42 +0200 Message-ID: <20250429113242.998312-26-vschneid@redhat.com> In-Reply-To: <20250429113242.998312-1-vschneid@redhat.com> References: <20250429113242.998312-1-vschneid@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.40 Content-Type: text/plain; charset="utf-8" text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs. As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync (ct_work_flush()) must *not* be mutable (patchable) at runtime. This means we must pay attention to mutable instructions in the early entry code: - alternatives - static keys - static calls - all sorts of probes (kprobes/ftrace/bpf/???) The early entry code leading to ct_work_flush() is noinstr, which gets rid of the probes. Alternatives are safe, because it's boot-time patching (before SMP is even brought up) which is before any IPI deferral can happen. This leaves us with static keys and static calls. Any static key used in early entry code should be only forever-enabled at boot time, IOW __ro_after_init (pretty much like alternatives). Exceptions are explicitly marked as allowed in .noinstr and will always generate an IPI when flipped. The same applies to static calls - they should be only updated at boot time, or manually marked as an exception. Objtool is now able to point at static keys/calls that don't respect this, and all static keys/calls used in early entry code have now been verified as behaving appropriately. Leverage the new context_tracking infrastructure to defer sync_core() IPIs to a target CPU's next kernel entry. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Nicolas Saenz Julienne Signed-off-by: Valentin Schneider --- arch/x86/include/asm/context_tracking_work.h | 6 ++- arch/x86/include/asm/text-patching.h | 1 + arch/x86/kernel/alternative.c | 39 +++++++++++++++++--- arch/x86/kernel/kprobes/core.c | 4 +- arch/x86/kernel/kprobes/opt.c | 4 +- arch/x86/kernel/module.c | 2 +- include/asm-generic/sections.h | 15 ++++++++ include/linux/context_tracking_work.h | 4 +- 8 files changed, 60 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/includ= e/asm/context_tracking_work.h index 5f3b2d0977235..485b32881fde5 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -2,11 +2,13 @@ #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H #define _ASM_X86_CONTEXT_TRACKING_WORK_H =20 +#include + static __always_inline void arch_context_tracking_work(enum ct_work work) { switch (work) { - case CT_WORK_n: - // Do work... + case CT_WORK_SYNC: + sync_core(); break; case CT_WORK_MAX: WARN_ON_ONCE(true); diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/te= xt-patching.h index ab9e143ec9fea..9dfa46f721c1d 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -33,6 +33,7 @@ extern void apply_relocation(u8 *buf, const u8 * const in= str, size_t instrlen, u */ extern void *text_poke(void *addr, const void *opcode, size_t len); extern void text_poke_sync(void); +extern void text_poke_sync_deferrable(void); extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); extern void *text_poke_copy(void *addr, const void *opcode, size_t len); #define text_poke_copy text_poke_copy diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index bf82c6f7d6905..8c73ac6243809 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -2450,9 +2451,24 @@ static void do_sync_core(void *info) sync_core(); } =20 +static bool do_sync_core_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CT_WORK_SYNC); +} + +static void __text_poke_sync(smp_cond_func_t cond_func) +{ + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1); +} + void text_poke_sync(void) { - on_each_cpu(do_sync_core, NULL, 1); + __text_poke_sync(NULL); +} + +void text_poke_sync_deferrable(void) +{ + __text_poke_sync(do_sync_core_defer_cond); } =20 /* @@ -2623,6 +2639,7 @@ static int tp_vec_nr; */ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_e= ntries) { + smp_cond_func_t cond =3D do_sync_core_defer_cond; unsigned char int3 =3D INT3_INSN_OPCODE; unsigned int i; int do_sync; @@ -2658,11 +2675,21 @@ static void text_poke_bp_batch(struct text_poke_loc= *tp, unsigned int nr_entries * First step: add a int3 trap to the address that will be patched. */ for (i =3D 0; i < nr_entries; i++) { - tp[i].old =3D *(u8 *)text_poke_addr(&tp[i]); - text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE); + void *addr =3D text_poke_addr(&tp[i]); + + /* + * There's no safe way to defer IPIs for patching text in + * .noinstr, record whether there is at least one such poke. + */ + if (is_kernel_noinstr_text((unsigned long)addr) || + is_module_noinstr_text_address((unsigned long)addr)) + cond =3D NULL; + + tp[i].old =3D *((u8 *)addr); + text_poke(addr, &int3, INT3_INSN_SIZE); } =20 - text_poke_sync(); + __text_poke_sync(cond); =20 /* * Second step: update all but the first byte of the patched range. @@ -2724,7 +2751,7 @@ static void text_poke_bp_batch(struct text_poke_loc *= tp, unsigned int nr_entries * not necessary and we'd be safe even without it. But * better safe than sorry (plus there's not only Intel). */ - text_poke_sync(); + __text_poke_sync(cond); } =20 /* @@ -2745,7 +2772,7 @@ static void text_poke_bp_batch(struct text_poke_loc *= tp, unsigned int nr_entries } =20 if (do_sync) - text_poke_sync(); + __text_poke_sync(cond); =20 /* * Remove and wait for refs to be zero. diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index 09608fd936876..687e6805b7511 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -808,7 +808,7 @@ void arch_arm_kprobe(struct kprobe *p) u8 int3 =3D INT3_INSN_OPCODE; =20 text_poke(p->addr, &int3, 1); - text_poke_sync(); + text_poke_sync_deferrable(); perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1); } =20 @@ -818,7 +818,7 @@ void arch_disarm_kprobe(struct kprobe *p) =20 perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1); text_poke(p->addr, &p->opcode, 1); - text_poke_sync(); + text_poke_sync_deferrable(); } =20 void arch_remove_kprobe(struct kprobe *p) diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index 36d6809c6c9e1..b2ce4d9c3ba56 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -513,11 +513,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *= op) JMP32_INSN_SIZE - INT3_INSN_SIZE); =20 text_poke(addr, new, INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable(); text_poke(addr + INT3_INSN_SIZE, new + INT3_INSN_SIZE, JMP32_INSN_SIZE - INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable(); =20 perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_S= IZE); } diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index a7998f3517017..d89c9de0ca9f5 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -206,7 +206,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs, write, apply); =20 if (!early) { - text_poke_sync(); + text_poke_sync_deferrable(); mutex_unlock(&text_mutex); } =20 diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index 0755bc39b0d80..7d2403014010e 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -199,6 +199,21 @@ static inline bool is_kernel_inittext(unsigned long ad= dr) addr < (unsigned long)_einittext; } =20 + +/** + * is_kernel_noinstr_text - checks if the pointer address is located in the + * .noinstr section + * + * @addr: address to check + * + * Returns: true if the address is located in .noinstr, false otherwise. + */ +static inline bool is_kernel_noinstr_text(unsigned long addr) +{ + return addr >=3D (unsigned long)__noinstr_text_start && + addr < (unsigned long)__noinstr_text_end; +} + /** * __is_kernel_text - checks if the pointer address is located in the * .text section diff --git a/include/linux/context_tracking_work.h b/include/linux/context_= tracking_work.h index c68245f8d77c5..2facc621be067 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -5,12 +5,12 @@ #include =20 enum { - CT_WORK_n_OFFSET, + CT_WORK_SYNC_OFFSET, CT_WORK_MAX_OFFSET }; =20 enum ct_work { - CT_WORK_n =3D BIT(CT_WORK_n_OFFSET), + CT_WORK_SYNC =3D BIT(CT_WORK_SYNC_OFFSET), CT_WORK_MAX =3D BIT(CT_WORK_MAX_OFFSET) }; =20 --=20 2.49.0