From nobody Thu Dec 18 23:01:08 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21CE5C77B7A for ; Sat, 3 Jun 2023 20:08:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230466AbjFCUH6 (ORCPT ); Sat, 3 Jun 2023 16:07:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51320 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230284AbjFCUHw (ORCPT ); Sat, 3 Jun 2023 16:07:52 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BF540E6E for ; Sat, 3 Jun 2023 13:07:24 -0700 (PDT) Message-ID: <20230603200459.657036052@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1685822817; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=OuVg+D0jb4HgNSp9ye4aSRcei5b/KvEx573nuovh52M=; b=2c8fwqxhQVPxRmc3T8a6DTenMMCD77hCal25+SdWF5dospABEuvQdNa7fPQ200NSevXmyM 2bDOr9OSVJYCqteJH5Z+xRmCIOVx3922ppsxRicluntiRO3yPjnPA0Yl+3oRiY2Hh6ni4F pNJH/J8rkLA8watPvcCHMG2tj2xVutVwgAQ0ietWzQyg3zIv455V6Yg4emvRqH6IbPolyq hkOs04DC1n0f2IrBNjmVvl0BT9wknSawj9wG6YB/6ocgtaZxkpMw02/TqpAGZuH0SAWKhz YJUaMh0n3ZEwmv2cxXw/MSLlmL2nml5smSvYPiRJuKJ9JComN1YswOAv9Oqhyw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1685822817; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=OuVg+D0jb4HgNSp9ye4aSRcei5b/KvEx573nuovh52M=; b=LxAcG02Z6CBSPAUoOkIBhVDAchgubOdBNxyjsaJizVfW84GwfH0L7pCA0KKKeVC9waPHh9 0B2NSgeAuIKCbKCw== From: Thomas Gleixner To: LKML Cc: x86@kernel.org, Ashok Raj , Dave Hansen , Tony Luck , Arjan van de Veen , Peter Zijlstra , Eric Biederman Subject: [patch 1/6] x86/smp: Remove pointless wmb() from native_stop_other_cpus() References: <20230603193439.502645149@linutronix.de> MIME-Version: 1.0 Date: Sat, 3 Jun 2023 22:06:56 +0200 (CEST) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The wmb() after the successfull atomic_cmpxchg() is complete voodoo along with the comment stating "sync above data before sending IRQ". There is no "above" data except for the atomic_t stopping_cpu which has just been acquired. The reboot IPI handler does not check any data and unconditionally disables the CPU. Remove this cargo cult barrier. Signed-off-by: Thomas Gleixner --- arch/x86/kernel/smp.c | 3 --- 1 file changed, 3 deletions(-) --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -171,9 +171,6 @@ static void native_stop_other_cpus(int w if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) !=3D -1) return; =20 - /* sync above data before sending IRQ */ - wmb(); - apic_send_IPI_allbutself(REBOOT_VECTOR); =20 /* From nobody Thu Dec 18 23:01:08 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2DAEC7EE32 for ; Sat, 3 Jun 2023 20:07:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230421AbjFCUHz (ORCPT ); Sat, 3 Jun 2023 16:07:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51318 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230225AbjFCUHw (ORCPT ); Sat, 3 Jun 2023 16:07:52 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC9CAE75 for ; Sat, 3 Jun 2023 13:07:24 -0700 (PDT) Message-ID: <20230603200459.717231106@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1685822818; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=I7I2EKoHeo/byjwgLE4HOlJ+wED05lzaQv+BpbzI77U=; b=IUrevXDafhxQpLh0SuAslVuhafF3e085DF4EC0XDwrP8KirhfW70LxmVA0yQBkAN089fV1 xgb04EKn2nBiZfcWHV5dd+hXBzRkvKtJjfnzqmLOKKeyZVQVx0Hijqxe8a+fqRITiQK6Uy 6rwJNJk4Uekr62dY1ag1j7pylEXmWsP/PGDQGhVrJCilWws69mT0bMHRlqrXlowbUT/xG+ DhArg3tMUWupx+ywukQnAg37tt6FcMEsgrjQINvqgOrZSqe3Z7TtL20/Dze4ygSRVAQQtv 3pFA2DdRTl+ylJ35vrK5DBRkV7SbvE6mb+QakAAnipAa23n6w4vFP6viBqi/1w== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1685822818; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=I7I2EKoHeo/byjwgLE4HOlJ+wED05lzaQv+BpbzI77U=; b=NSZ+nkDqpgXAzT5LDYYjC92Jhe41A25OJcBNQ50l8zJv063WLDMznbCPeL21GpuzRk1VZ7 ZKh5N99eICdZnGDw== From: Thomas Gleixner To: LKML Cc: x86@kernel.org, Ashok Raj , Dave Hansen , Tony Luck , Arjan van de Veen , Peter Zijlstra , Eric Biederman Subject: [patch 2/6] x86/smp: Acquire stopping_cpu unconditionally References: <20230603193439.502645149@linutronix.de> MIME-Version: 1.0 Date: Sat, 3 Jun 2023 22:06:58 +0200 (CEST) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is no reason to acquire the stopping_cpu atomic_t only when there is more than one online CPU. Make it unconditional to prepare for fixing the kexec() problem when there are present but "offline" CPUs which play dead in mwait_play_dead(). They need to be brought out of mwait before kexec() as kexec() can overwrite text, pagetables, stacks and the monitored cacheline of the original kernel. The latter causes mwait to resume execution which obviously causes havoc on the kexec kernel which results usually in triple faults. Move the acquire out of the num_online_cpus() > 1 condition so the upcoming 'kick mwait' fixup is properly protected. Signed-off-by: Thomas Gleixner --- arch/x86/kernel/smp.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -152,6 +152,10 @@ static void native_stop_other_cpus(int w if (reboot_force) return; =20 + /* Only proceed if this is the first CPU to reach this code */ + if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) !=3D -1) + return; + /* * Use an own vector here because smp_call_function * does lots of things not suitable in a panic situation. @@ -167,10 +171,6 @@ static void native_stop_other_cpus(int w * finish their work before we force them off with the NMI. */ if (num_online_cpus() > 1) { - /* did someone beat us here? */ - if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) !=3D -1) - return; - apic_send_IPI_allbutself(REBOOT_VECTOR); =20 /* From nobody Thu Dec 18 23:01:08 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1772C7EE24 for ; Sat, 3 Jun 2023 20:09:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231252AbjFCUJE (ORCPT ); Sat, 3 Jun 2023 16:09:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51570 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231628AbjFCUIp (ORCPT ); Sat, 3 Jun 2023 16:08:45 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BCACFE60 for ; Sat, 3 Jun 2023 13:07:53 -0700 (PDT) Message-ID: <20230603200459.775471968@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1685822820; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=fh2uTBEHG28Fh31y7f2tmR7yGLyMLoAyt+FF1TixMCg=; b=1nVnHokyojIXwrcsLwEvNrRA4MA/Gp4A/mBpI6uAEV1v/jCxYGSnkpIOK13V1qC9G1cZ68 E0Wt+8KqnQK18DnJRPKdRxwf91S4QA5Y42Tq1WmTqsOF/yqIlLllKO5pQ4bFhhY3wTqQgr 4Px0WvA8jO3bRnsrseb5mKoj67Vo4TPrNsdaa7Dwj1VLeq0KvQPdBOCvei4ulO0CvRwoue 5bdcxZywog18NE2kVcZJv5vK/CVo4mGvrgvPcQdw786B/16F0MRU9jxZ0PFOlLKF711UiT vLQs43veq6Pkof2vLWl6tpJjJYljK/tnGppsn6lnxiP2WWUF4zrYmoP5GdpuUQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1685822820; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=fh2uTBEHG28Fh31y7f2tmR7yGLyMLoAyt+FF1TixMCg=; b=ZYmnES9mtLf+uuLZaMIpIFu5yZDJIJKyATWf0PD4/i7pu+IKi4nksO4gNyQ6S9IsqkhH1t YllL+o1xr/zXinCg== From: Thomas Gleixner To: LKML Cc: x86@kernel.org, Ashok Raj , Dave Hansen , Tony Luck , Arjan van de Veen , Peter Zijlstra , Eric Biederman Subject: [patch 3/6] x86/smp: Use dedicated cache-line for mwait_play_dead() References: <20230603193439.502645149@linutronix.de> MIME-Version: 1.0 Date: Sat, 3 Jun 2023 22:07:00 +0200 (CEST) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Monitoring idletask::thread_info::flags in mwait_play_dead() has been an obvious choice as all what is needed is a cache line which is not written by other CPUs. But there is a use case where a "dead" CPU needs to be brought out of that mwait(): kexec(). The CPU needs to be brought out of mwait before kexec() as kexec() can overwrite text, pagetables, stacks and the monitored cacheline of the original kernel. The latter causes mwait to resume execution which obviously causes havoc on the kexec kernel which results usually in triple faults. Use a dedicated per CPU storage to prepare for that. Signed-off-by: Thomas Gleixner --- arch/x86/kernel/smpboot.c | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -101,6 +101,17 @@ EXPORT_PER_CPU_SYMBOL(cpu_die_map); DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info); EXPORT_PER_CPU_SYMBOL(cpu_info); =20 +struct mwait_cpu_dead { + unsigned int control; + unsigned int status; +}; + +/* + * Cache line aligned data for mwait_play_dead(). Separate on purpose so + * that it's unlikely to be touched by other CPUs. + */ +static DEFINE_PER_CPU_ALIGNED(struct mwait_cpu_dead, mwait_cpu_dead); + /* Logical package management. We might want to allocate that dynamically = */ unsigned int __max_logical_packages __read_mostly; EXPORT_SYMBOL(__max_logical_packages); @@ -1758,10 +1769,10 @@ EXPORT_SYMBOL_GPL(cond_wakeup_cpu0); */ static inline void mwait_play_dead(void) { + struct mwait_cpu_dead *md =3D this_cpu_ptr(&mwait_cpu_dead); unsigned int eax, ebx, ecx, edx; unsigned int highest_cstate =3D 0; unsigned int highest_subcstate =3D 0; - void *mwait_ptr; int i; =20 if (boot_cpu_data.x86_vendor =3D=3D X86_VENDOR_AMD || @@ -1796,13 +1807,6 @@ static inline void mwait_play_dead(void) (highest_subcstate - 1); } =20 - /* - * This should be a memory location in a cache line which is - * unlikely to be touched by other processors. The actual - * content is immaterial as it is not actually modified in any way. - */ - mwait_ptr =3D ¤t_thread_info()->flags; - wbinvd(); =20 while (1) { @@ -1814,9 +1818,9 @@ static inline void mwait_play_dead(void) * case where we return around the loop. */ mb(); - clflush(mwait_ptr); + clflush(md); mb(); - __monitor(mwait_ptr, 0, 0); + __monitor(md, 0, 0); mb(); __mwait(eax, 0); From nobody Thu Dec 18 23:01:08 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15D7AC77B7A for ; Sat, 3 Jun 2023 20:08:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231185AbjFCUII (ORCPT ); Sat, 3 Jun 2023 16:08:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51388 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230354AbjFCUHy (ORCPT ); Sat, 3 Jun 2023 16:07:54 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2FAD0E7C for ; Sat, 3 Jun 2023 13:07:27 -0700 (PDT) Message-ID: <20230603200459.832650526@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1685822822; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Fs0P0A9TSWn0oyonYhmHKR1qTOJzaX+9a0aZnA35Xew=; b=YcLzt5/GmXgL2qhyMdnLZYvvSqnm1V0oSMFWfGN5BbPewLCX6KGY2QHL43L/CRkBbYGx+v dDQRsHWuj9YT0KJaXDV+zuKFovHQreRSVOyGClifoGOAeav/CClGTaRMLs+J1RuIicIeww DEeVyuJtx9PYa210EIfynEdKKw/nLbTvjp33vZfS/6szwa2FH5hUFm6dKXRTjyUt/Wn3lJ hM1xCnX9D0e8vN98wPiygqZwim6dF26K4HDigqtIsf/5FTRD/CqYV45X0jBGUeaY5VkzrO 7j3Lh3JZQO0hbQe7QYrgEa5tIUiXshedRssU3nnGzWnYwtL1Kjijt8xjws7Faw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1685822822; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Fs0P0A9TSWn0oyonYhmHKR1qTOJzaX+9a0aZnA35Xew=; b=Ztbv9vSgQGBSlikkYdtonAPME4i/mpM/kxlv819NZNopN+WQu8I48mG0bzS/k3aLbLaxzR /skYXqWXhNhoj5AQ== From: Thomas Gleixner To: LKML Cc: x86@kernel.org, Ashok Raj , Dave Hansen , Tony Luck , Arjan van de Veen , Peter Zijlstra , Eric Biederman , Ashok Raj Subject: [patch 4/6] x86/smp: Cure kexec() vs. mwait_play_dead() breakage References: <20230603193439.502645149@linutronix.de> MIME-Version: 1.0 Date: Sat, 3 Jun 2023 22:07:01 +0200 (CEST) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" TLDR: It's a mess. When kexec() is executed on a system with "offline" CPUs, which are parked in mwait_play_dead() it can end up in a triple fault during the bootup of the kexec kernel or cause hard to diagnose data corruption. The reason is that kexec() eventually overwrites the previous kernels text, page tables, data and stack, If it writes to the cache line which is monitored by an previously offlined CPU, MWAIT resumes execution and ends up executing the wrong text, dereferencing overwritten page tables or corrupting the kexec kernels data. Cure this by bringing the offline CPUs out of MWAIT into HLT. Write to the monitored cache line of each offline CPU, which makes MWAIT resume execution. The written control word tells the offline CPUs to issue HLT, which does not have the MWAIT problem. That does not help, if a stray NMI, MCE or SMI hits the offline CPUs as those make it come out of HLT. A follow up change will put them into INIT, which protects at least against NMI and SMI. Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix t= he legacy case") Reported-by: Ashok Raj Signed-off-by: Thomas Gleixner Reviewed-by: Ashok Raj Tested-by: Ashok Raj --- arch/x86/include/asm/smp.h | 2 + arch/x86/kernel/smp.c | 21 +++++++--------- arch/x86/kernel/smpboot.c | 59 ++++++++++++++++++++++++++++++++++++++++= +++++ 3 files changed, 71 insertions(+), 11 deletions(-) --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -132,6 +132,8 @@ void wbinvd_on_cpu(int cpu); int wbinvd_on_all_cpus(void); void cond_wakeup_cpu0(void); =20 +void smp_kick_mwait_play_dead(void); + void native_smp_send_reschedule(int cpu); void native_send_call_func_ipi(const struct cpumask *mask); void native_send_call_func_single_ipi(int cpu); --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -21,6 +21,7 @@ #include #include #include +#include =20 #include #include @@ -156,19 +157,17 @@ static void native_stop_other_cpus(int w if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) !=3D -1) return; =20 - /* - * Use an own vector here because smp_call_function - * does lots of things not suitable in a panic situation. - */ + /* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */ + if (kexec_in_progress) + smp_kick_mwait_play_dead(); =20 /* - * We start by using the REBOOT_VECTOR irq. - * The irq is treated as a sync point to allow critical - * regions of code on other cpus to release their spin locks - * and re-enable irqs. Jumping straight to an NMI might - * accidentally cause deadlocks with further shutdown/panic - * code. By syncing, we give the cpus up to one second to - * finish their work before we force them off with the NMI. + * Start by using the REBOOT_VECTOR. That acts as a sync point to + * allow critical regions of code on other cpus to leave their + * critical regions. Jumping straight to an NMI might accidentally + * cause deadlocks with further shutdown code. This gives the CPUs + * up to one second to finish their work before forcing them off + * with the NMI. */ if (num_online_cpus() > 1) { apic_send_IPI_allbutself(REBOOT_VECTOR); --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -53,6 +53,7 @@ #include #include #include +#include #include #include #include @@ -106,6 +107,9 @@ struct mwait_cpu_dead { unsigned int status; }; =20 +#define CPUDEAD_MWAIT_WAIT 0xDEADBEEF +#define CPUDEAD_MWAIT_KEXEC_HLT 0x4A17DEAD + /* * Cache line aligned data for mwait_play_dead(). Separate on purpose so * that it's unlikely to be touched by other CPUs. @@ -173,6 +177,10 @@ static void smp_callin(void) { int cpuid; =20 + /* Mop up eventual mwait_play_dead() wreckage */ + this_cpu_write(mwait_cpu_dead.status, 0); + this_cpu_write(mwait_cpu_dead.control, 0); + /* * If waken up by an INIT in an 82489DX configuration * cpu_callout_mask guarantees we don't get here before @@ -1807,6 +1815,10 @@ static inline void mwait_play_dead(void) (highest_subcstate - 1); } =20 + /* Set up state for the kexec() hack below */ + md->status =3D CPUDEAD_MWAIT_WAIT; + md->control =3D CPUDEAD_MWAIT_WAIT; + wbinvd(); =20 while (1) { @@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void) mb(); __mwait(eax, 0); =20 + if (READ_ONCE(md->control) =3D=3D CPUDEAD_MWAIT_KEXEC_HLT) { + /* + * Kexec is about to happen. Don't go back into mwait() as + * the kexec kernel might overwrite text and data including + * page tables and stack. So mwait() would resume when the + * monitor cache line is written to and then the CPU goes + * south due to overwritten text, page tables and stack. + * + * Note: This does _NOT_ protect against a stray MCE, NMI, + * SMI. They will resume execution at the instruction + * following the HLT instruction and run into the problem + * which this is trying to prevent. + */ + WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT); + while(1) + native_halt(); + } + cond_wakeup_cpu0(); } } =20 +/* + * Kick all "offline" CPUs out of mwait on kexec(). See comment in + * mwait_play_dead(). + */ +void smp_kick_mwait_play_dead(void) +{ + u32 newstate =3D CPUDEAD_MWAIT_KEXEC_HLT; + struct mwait_cpu_dead *md; + unsigned int cpu, i; + + for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) { + md =3D per_cpu_ptr(&mwait_cpu_dead, cpu); + + /* Does it sit in mwait_play_dead() ? */ + if (READ_ONCE(md->status) !=3D CPUDEAD_MWAIT_WAIT) + continue; + + /* Wait maximal 5ms */ + for (i =3D 0; READ_ONCE(md->status) !=3D newstate && i < 1000; i++) { + /* Bring it out of mwait */ + WRITE_ONCE(md->control, newstate); + udelay(5); + } + + if (READ_ONCE(md->status) !=3D newstate) + pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu); + } +} + void __noreturn hlt_play_dead(void) { if (__this_cpu_read(cpu_info.x86) >=3D 4) From nobody Thu Dec 18 23:01:08 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C803C77B7A for ; Sat, 3 Jun 2023 20:08:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229487AbjFCUI4 (ORCPT ); Sat, 3 Jun 2023 16:08:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52156 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231874AbjFCUIq (ORCPT ); Sat, 3 Jun 2023 16:08:46 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 147BC1B1 for ; Sat, 3 Jun 2023 13:07:55 -0700 (PDT) Message-ID: <20230603200459.889612295@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1685822823; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Y46eTbENCwKeEacqCP+RntAtFJd0WeB3KmTST49uhjE=; b=uzaFJoAicXXB1OMPQ6488zyFUrKvJ+MPs3GtqQZR8wA3AR9I/3Bpk6mhRkDLuzKTs0S4Mq HgrPcsOBgoErTzLeUSC/+IpopPE7KNuzZCR8kUEoUnRMIcNCZOrA+kJnl3wahqoDoXP5bv DrZ/THD2rPTj1pIBOL0YG+eKuE8anGCStrObe7EVv8TL4WvGnMjIcp3agjPnvm31M6qXjd WRNpJPqUo1CCMo+OmYXl4HAD4OnMS44D5N6COtOgUAzOw3uKdGQAYcO7GZ3JMvTMCMzFPb IkkWRnekR6wHvnZnLUbMJT7qPoMcPsjFGh89ILg8vcX7AzMWH13sAmDaid2HbA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1685822823; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Y46eTbENCwKeEacqCP+RntAtFJd0WeB3KmTST49uhjE=; b=8FveJwVoVmiP1lJU+Lhh3x5CU6ewAbCz36APqoBREfMpFa74nq/tEGtvLd+8xDNEsCSGe+ MJtScKp7oSMZFqBw== From: Thomas Gleixner To: LKML Cc: x86@kernel.org, Ashok Raj , Dave Hansen , Tony Luck , Arjan van de Veen , Peter Zijlstra , Eric Biederman Subject: [patch 5/6] x86/smp: Split sending INIT IPI out into a helper function References: <20230603193439.502645149@linutronix.de> MIME-Version: 1.0 Date: Sat, 3 Jun 2023 22:07:03 +0200 (CEST) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Putting CPUs into INIT is a safer place during kexec() to park CPUs. Split the INIT assert/deassert sequence out so it can be reused. Signed-off-by: Thomas Gleixner --- arch/x86/kernel/smpboot.c | 51 +++++++++++++++++++----------------------= ----- 1 file changed, 22 insertions(+), 29 deletions(-) --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -853,47 +853,40 @@ wakeup_secondary_cpu_via_nmi(int apicid, return (send_status | accept_status); } =20 -static int -wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip) +static void send_init_sequence(int phys_apicid) { - unsigned long send_status =3D 0, accept_status =3D 0; - int maxlvt, num_starts, j; - - maxlvt =3D lapic_get_maxlvt(); + int maxlvt =3D lapic_get_maxlvt(); =20 - /* - * Be paranoid about clearing APIC errors. - */ + /* Be paranoid about clearing APIC errors. */ if (APIC_INTEGRATED(boot_cpu_apic_version)) { - if (maxlvt > 3) /* Due to the Pentium erratum 3AP. */ + /* Due to the Pentium erratum 3AP. */ + if (maxlvt > 3) apic_write(APIC_ESR, 0); apic_read(APIC_ESR); } =20 - pr_debug("Asserting INIT\n"); - - /* - * Turn INIT on target chip - */ - /* - * Send IPI - */ - apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, - phys_apicid); - - pr_debug("Waiting for send to finish...\n"); - send_status =3D safe_apic_wait_icr_idle(); + /* Assert INIT on the target CPU */ + apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, phys_= apicid); + safe_apic_wait_icr_idle(); =20 udelay(init_udelay); =20 - pr_debug("Deasserting INIT\n"); - - /* Target chip */ - /* Send IPI */ + /* Deassert INIT on the target CPU */ apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid); + safe_apic_wait_icr_idle(); +} + +/* + * Wake up AP by INIT, INIT, STARTUP sequence. + */ +static int wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long st= art_eip) +{ + unsigned long send_status =3D 0, accept_status =3D 0; + int maxlvt, num_starts, j; + + preempt_disable(); =20 - pr_debug("Waiting for send to finish...\n"); - send_status =3D safe_apic_wait_icr_idle(); + send_init_sequence(phys_apicid); =20 mb(); From nobody Thu Dec 18 23:01:08 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30662C7EE2E for ; Sat, 3 Jun 2023 20:09:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229717AbjFCUI6 (ORCPT ); Sat, 3 Jun 2023 16:08:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51530 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231863AbjFCUIq (ORCPT ); Sat, 3 Jun 2023 16:08:46 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 693E4E69 for ; Sat, 3 Jun 2023 13:07:54 -0700 (PDT) Message-ID: <20230603200459.947733085@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1685822825; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=58ciVfBbzjxuvXyJZlI2x4w/lpUE1kObMgg65Z8puj4=; b=k0RLuIJfu2acp3uJige/wnB1o3b00dIeIQUtTTHPmJUvBflBrt59Udj5QRO++OiMNS4JEs 9Y5njrqjzdJmXr3CBhRkWOQvJj9YfbabsAvF5atOh17zeYQVA+pXV1r/ApNQJJSX0gsrp0 xANnVDlXQzFANAA2WFVU7w0MOTn4aZ4B75QFXY0qq+0jcBOJPlxNyYiBt3cL0SgxvfEjN1 hfTbA9Awl+dwJ1N9oCkYmRb5PBuZBX4APwrLeYXe+i6hBbbQMH4zpyDB9+U/qqUK21B5Bg ZXmYJNn6RYMN8OyfCiLTsBsOhMgrM1Ubsdw5fh6f2grzels7iZGIk/Er3cxUWg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1685822825; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=58ciVfBbzjxuvXyJZlI2x4w/lpUE1kObMgg65Z8puj4=; b=ukQiSp2yRu+wh7kqBZofjrfC6122zjd2/ElbhiDKxg4ad5K2CCUaIUqZRe9ViVeyGK8gmv j1xamtJMAH4PJ8Bg== From: Thomas Gleixner To: LKML Cc: x86@kernel.org, Ashok Raj , Dave Hansen , Tony Luck , Arjan van de Veen , Peter Zijlstra , Eric Biederman Subject: [patch 6/6] x86/smp: Put CPUs into INIT on shutdown if possible References: <20230603193439.502645149@linutronix.de> MIME-Version: 1.0 Date: Sat, 3 Jun 2023 22:07:04 +0200 (CEST) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Parking CPUs in a HLT loop is not completely safe vs. kexec() as HLT can resume execution due to NMI, SMI and MCE, which has the same issue as the MWAIT loop. Kicking the secondary CPUs into INIT makes this safe against NMI and SMI. A broadcast MCE will take the machine down, but a broadcast MCE which makes HLT resume and execute overwritten text, pagetables or data will end up in a disaster too. So chose the lesser of two evils and kick the secondary CPUs into INIT unless the system has installed special wakeup mechanisms which are not using INIT. Signed-off-by: Thomas Gleixner Reviewed-by: Ashok Raj --- arch/x86/include/asm/smp.h | 2 ++ arch/x86/kernel/smp.c | 38 +++++++++++++++++++++++++++++--------- arch/x86/kernel/smpboot.c | 19 +++++++++++++++++++ 3 files changed, 50 insertions(+), 9 deletions(-) --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -139,6 +139,8 @@ void native_send_call_func_ipi(const str void native_send_call_func_single_ipi(int cpu); void x86_idle_thread_init(unsigned int cpu, struct task_struct *idle); =20 +bool smp_park_nonboot_cpus_in_init(void); + void smp_store_boot_cpu_info(void); void smp_store_cpu_info(int id); =20 --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -130,7 +130,7 @@ static int smp_stop_nmi_callback(unsigne } =20 /* - * this function calls the 'stop' function on all other CPUs in the system. + * Disable virtualization, APIC etc. and park the CPU in a HLT loop */ DEFINE_IDTENTRY_SYSVEC(sysvec_reboot) { @@ -147,8 +147,7 @@ static int register_stop_handler(void) =20 static void native_stop_other_cpus(int wait) { - unsigned long flags; - unsigned long timeout; + unsigned long flags, timeout; =20 if (reboot_force) return; @@ -164,10 +163,10 @@ static void native_stop_other_cpus(int w /* * Start by using the REBOOT_VECTOR. That acts as a sync point to * allow critical regions of code on other cpus to leave their - * critical regions. Jumping straight to an NMI might accidentally - * cause deadlocks with further shutdown code. This gives the CPUs - * up to one second to finish their work before forcing them off - * with the NMI. + * critical regions. Jumping straight to NMI or INIT might + * accidentally cause deadlocks with further shutdown code. This + * gives the CPUs up to one second to finish their work before + * forcing them off with the NMI or INIT. */ if (num_online_cpus() > 1) { apic_send_IPI_allbutself(REBOOT_VECTOR); @@ -175,7 +174,7 @@ static void native_stop_other_cpus(int w /* * Don't wait longer than a second for IPI completion. The * wait request is not checked here because that would - * prevent an NMI shutdown attempt in case that not all + * prevent an NMI/INIT shutdown in case that not all * CPUs reach shutdown state. */ timeout =3D USEC_PER_SEC; @@ -183,7 +182,27 @@ static void native_stop_other_cpus(int w udelay(1); } =20 - /* if the REBOOT_VECTOR didn't work, try with the NMI */ + /* + * Park all nonboot CPUs in INIT including offline CPUs, if + * possible. That's a safe place where they can't resume execution + * of HLT and then execute the HLT loop from overwritten text or + * page tables. + * + * The only downside is a broadcast MCE, but up to the point where + * the kexec() kernel brought all APs online again an MCE will just + * make HLT resume and handle the MCE. The machine crashs and burns + * due to overwritten text, page tables and data. So there is a + * choice between fire and frying pan. The result is pretty much + * the same. Chose frying pan until x86 provides a sane mechanism + * to park a CPU. + */ + if (smp_park_nonboot_cpus_in_init()) + goto done; + + /* + * If park with INIT was not possible and the REBOOT_VECTOR didn't + * take all secondary CPUs offline, try with the NMI. + */ if (num_online_cpus() > 1) { /* * If NMI IPI is enabled, try to register the stop handler @@ -208,6 +227,7 @@ static void native_stop_other_cpus(int w udelay(1); } =20 +done: local_irq_save(flags); disable_local_APIC(); mcheck_cpu_clear(this_cpu_ptr(&cpu_info)); --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1467,6 +1467,25 @@ void arch_thaw_secondary_cpus_end(void) cache_aps_init(); } =20 +bool smp_park_nonboot_cpus_in_init(void) +{ + unsigned int cpu, this_cpu =3D smp_processor_id(); + unsigned int apicid; + + if (apic->wakeup_secondary_cpu_64 || apic->wakeup_secondary_cpu) + return false; + + for_each_present_cpu(cpu) { + if (cpu =3D=3D this_cpu) + continue; + apicid =3D apic->cpu_present_to_apicid(cpu); + if (apicid =3D=3D BAD_APICID) + continue; + send_init_sequence(apicid); + } + return true; +} + /* * Early setup to make printk work. */