From nobody Sun Dec 28 21:16:58 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5ADDDC10DC1 for ; Tue, 5 Dec 2023 00:46:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343726AbjLEApi (ORCPT ); Mon, 4 Dec 2023 19:45:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40122 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231550AbjLEApZ (ORCPT ); Mon, 4 Dec 2023 19:45:25 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8F19A102 for ; Mon, 4 Dec 2023 16:45:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737131; x=1733273131; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=u5ucMMEoQG0yrHE3VZH+wayFPugXvBlmW9eq8eGnxhg=; b=Oz5TJKq1fW9uZZoQ4mkxYcxUfS8k0n7hfuZ6mv2sltBfw4OFVjRPB52Z Rwe4zhwho69LWgjIZBRyUp4cmfknuSKksjb8bo56upMfjUxmBjp61zKsm LIN/RIuWunLpVBPaIHUXm1C0OPlkDdXrNel8C/cqhe1bK227hx926iUMR ziEMqwckB79t31sPcz7SasSSO7GWIBKtymEuosITHGV7ZbFtEWeau8wmq NP0iwCqBqZbr2Ew9VnUtSnRLEmq4Zn40Go+CkTVBIKfp59gquVBWkt/HW wqsHFGqQAGThLVwMlQ2Di1KxtwDEkNq3Jw4sQ17WbHty5KyiWF3S3Gcfm Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688635" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688635" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:28 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="944067920" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="944067920" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:23 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 38A7910A43E; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 01/14] x86/acpi: Extract ACPI MADT wakeup code into a separate file Date: Tue, 5 Dec 2023 03:44:57 +0300 Message-ID: <20231205004510.27164-2-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" In order to prepare for the expansion of support for the ACPI MADT wakeup method, move the relevant code into a separate file. Introduce a new configuration option to clearly indicate dependencies without the use of ifdefs. There have been no functional changes. Signed-off-by: Kirill A. Shutemov Reviewed-by: Kuppuswamy Sathyanarayanan Acked-by: Kai Huang --- arch/x86/Kconfig | 7 +++ arch/x86/include/asm/acpi.h | 5 ++ arch/x86/kernel/acpi/Makefile | 11 ++-- arch/x86/kernel/acpi/boot.c | 86 +----------------------------- arch/x86/kernel/acpi/madt_wakeup.c | 81 ++++++++++++++++++++++++++++ 5 files changed, 100 insertions(+), 90 deletions(-) create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index c456c9b1fc7c..969dca563077 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1104,6 +1104,13 @@ config X86_LOCAL_APIC depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI select IRQ_DOMAIN_HIERARCHY =20 +config X86_ACPI_MADT_WAKEUP + def_bool y + depends on X86_64 + depends on ACPI + depends on SMP + depends on X86_LOCAL_APIC + config X86_IO_APIC def_bool y depends on X86_LOCAL_APIC || X86_UP_IOAPIC diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h index f896eed4516c..2625b915ae7f 100644 --- a/arch/x86/include/asm/acpi.h +++ b/arch/x86/include/asm/acpi.h @@ -76,6 +76,11 @@ static inline bool acpi_skip_set_wakeup_address(void) =20 #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address =20 +union acpi_subtable_headers; + +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header, + const unsigned long end); + /* * Check if the CPU can handle C2 and deeper */ diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile index fc17b3f136fe..8c7329c88a75 100644 --- a/arch/x86/kernel/acpi/Makefile +++ b/arch/x86/kernel/acpi/Makefile @@ -1,11 +1,12 @@ # SPDX-License-Identifier: GPL-2.0 =20 -obj-$(CONFIG_ACPI) +=3D boot.o -obj-$(CONFIG_ACPI_SLEEP) +=3D sleep.o wakeup_$(BITS).o -obj-$(CONFIG_ACPI_APEI) +=3D apei.o -obj-$(CONFIG_ACPI_CPPC_LIB) +=3D cppc.o +obj-$(CONFIG_ACPI) +=3D boot.o +obj-$(CONFIG_ACPI_SLEEP) +=3D sleep.o wakeup_$(BITS).o +obj-$(CONFIG_ACPI_APEI) +=3D apei.o +obj-$(CONFIG_ACPI_CPPC_LIB) +=3D cppc.o +obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) +=3D madt_wakeup.o =20 ifneq ($(CONFIG_ACPI_PROCESSOR),) -obj-y +=3D cstate.o +obj-y +=3D cstate.o endif =20 diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 1a0dd80d81ac..171d86fe71ef 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -67,13 +67,6 @@ static bool has_lapic_cpus __initdata; static bool acpi_support_online_capable; #endif =20 -#ifdef CONFIG_X86_64 -/* Physical address of the Multiprocessor Wakeup Structure mailbox */ -static u64 acpi_mp_wake_mailbox_paddr; -/* Virtual address of the Multiprocessor Wakeup Structure mailbox */ -static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox; -#endif - #ifdef CONFIG_X86_IO_APIC /* * Locks related to IOAPIC hotplug @@ -369,60 +362,6 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * hea= der, const unsigned long e =20 return 0; } - -#ifdef CONFIG_X86_64 -static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip) -{ - /* - * Remap mailbox memory only for the first call to acpi_wakeup_cpu(). - * - * Wakeup of secondary CPUs is fully serialized in the core code. - * No need to protect acpi_mp_wake_mailbox from concurrent accesses. - */ - if (!acpi_mp_wake_mailbox) { - acpi_mp_wake_mailbox =3D memremap(acpi_mp_wake_mailbox_paddr, - sizeof(*acpi_mp_wake_mailbox), - MEMREMAP_WB); - } - - /* - * Mailbox memory is shared between the firmware and OS. Firmware will - * listen on mailbox command address, and once it receives the wakeup - * command, the CPU associated with the given apicid will be booted. - * - * The value of 'apic_id' and 'wakeup_vector' must be visible to the - * firmware before the wakeup command is visible. smp_store_release() - * ensures ordering and visibility. - */ - acpi_mp_wake_mailbox->apic_id =3D apicid; - acpi_mp_wake_mailbox->wakeup_vector =3D start_ip; - smp_store_release(&acpi_mp_wake_mailbox->command, - ACPI_MP_WAKE_COMMAND_WAKEUP); - - /* - * Wait for the CPU to wake up. - * - * The CPU being woken up is essentially in a spin loop waiting to be - * woken up. It should not take long for it wake up and acknowledge by - * zeroing out ->command. - * - * ACPI specification doesn't provide any guidance on how long kernel - * has to wait for a wake up acknowledgement. It also doesn't provide - * a way to cancel a wake up request if it takes too long. - * - * In TDX environment, the VMM has control over how long it takes to - * wake up secondary. It can postpone scheduling secondary vCPU - * indefinitely. Giving up on wake up request and reporting error opens - * possible attack vector for VMM: it can wake up a secondary CPU when - * kernel doesn't expect it. Wait until positive result of the wake up - * request. - */ - while (READ_ONCE(acpi_mp_wake_mailbox->command)) - cpu_relax(); - - return 0; -} -#endif /* CONFIG_X86_64 */ #endif /* CONFIG_X86_LOCAL_APIC */ =20 #ifdef CONFIG_X86_IO_APIC @@ -1159,29 +1098,6 @@ static int __init acpi_parse_madt_lapic_entries(void) } return 0; } - -#ifdef CONFIG_X86_64 -static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header, - const unsigned long end) -{ - struct acpi_madt_multiproc_wakeup *mp_wake; - - if (!IS_ENABLED(CONFIG_SMP)) - return -ENODEV; - - mp_wake =3D (struct acpi_madt_multiproc_wakeup *)header; - if (BAD_MADT_ENTRY(mp_wake, end)) - return -EINVAL; - - acpi_table_print_madt_entry(&header->common); - - acpi_mp_wake_mailbox_paddr =3D mp_wake->base_address; - - apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu); - - return 0; -} -#endif /* CONFIG_X86_64 */ #endif /* CONFIG_X86_LOCAL_APIC */ =20 #ifdef CONFIG_X86_IO_APIC @@ -1378,7 +1294,7 @@ static void __init acpi_process_madt(void) smp_found_config =3D 1; } =20 -#ifdef CONFIG_X86_64 +#ifdef CONFIG_X86_ACPI_MADT_WAKEUP /* * Parse MADT MP Wake entry. */ diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt= _wakeup.c new file mode 100644 index 000000000000..f4be492b7e4c --- /dev/null +++ b/arch/x86/kernel/acpi/madt_wakeup.c @@ -0,0 +1,81 @@ +#include +#include +#include +#include +#include + +/* Physical address of the Multiprocessor Wakeup Structure mailbox */ +static u64 acpi_mp_wake_mailbox_paddr; + +/* Virtual address of the Multiprocessor Wakeup Structure mailbox */ +static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox; + +static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip) +{ + /* + * Remap mailbox memory only for the first call to acpi_wakeup_cpu(). + * + * Wakeup of secondary CPUs is fully serialized in the core code. + * No need to protect acpi_mp_wake_mailbox from concurrent accesses. + */ + if (!acpi_mp_wake_mailbox) { + acpi_mp_wake_mailbox =3D memremap(acpi_mp_wake_mailbox_paddr, + sizeof(*acpi_mp_wake_mailbox), + MEMREMAP_WB); + } + + /* + * Mailbox memory is shared between the firmware and OS. Firmware will + * listen on mailbox command address, and once it receives the wakeup + * command, the CPU associated with the given apicid will be booted. + * + * The value of 'apic_id' and 'wakeup_vector' must be visible to the + * firmware before the wakeup command is visible. smp_store_release() + * ensures ordering and visibility. + */ + acpi_mp_wake_mailbox->apic_id =3D apicid; + acpi_mp_wake_mailbox->wakeup_vector =3D start_ip; + smp_store_release(&acpi_mp_wake_mailbox->command, + ACPI_MP_WAKE_COMMAND_WAKEUP); + + /* + * Wait for the CPU to wake up. + * + * The CPU being woken up is essentially in a spin loop waiting to be + * woken up. It should not take long for it wake up and acknowledge by + * zeroing out ->command. + * + * ACPI specification doesn't provide any guidance on how long kernel + * has to wait for a wake up acknowledgement. It also doesn't provide + * a way to cancel a wake up request if it takes too long. + * + * In TDX environment, the VMM has control over how long it takes to + * wake up secondary. It can postpone scheduling secondary vCPU + * indefinitely. Giving up on wake up request and reporting error opens + * possible attack vector for VMM: it can wake up a secondary CPU when + * kernel doesn't expect it. Wait until positive result of the wake up + * request. + */ + while (READ_ONCE(acpi_mp_wake_mailbox->command)) + cpu_relax(); + + return 0; +} + +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header, + const unsigned long end) +{ + struct acpi_madt_multiproc_wakeup *mp_wake; + + mp_wake =3D (struct acpi_madt_multiproc_wakeup *)header; + if (BAD_MADT_ENTRY(mp_wake, end)) + return -EINVAL; + + acpi_table_print_madt_entry(&header->common); + + acpi_mp_wake_mailbox_paddr =3D mp_wake->base_address; + + apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu); + + return 0; +} --=20 2.41.0 From nobody Sun Dec 28 21:16:58 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 26D62C4167B for ; Tue, 5 Dec 2023 00:46:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343707AbjLEApo (ORCPT ); Mon, 4 Dec 2023 19:45:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231567AbjLEApZ (ORCPT ); Mon, 4 Dec 2023 19:45:25 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9E2D310F for ; Mon, 4 Dec 2023 16:45:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737131; x=1733273131; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=rRqk8Me3Mc+00KtmumlNI0KQ6itAlytBjh04y1PXAoM=; b=atTeV75Qj0+OKnaE+JDZfaFX/WYO0lP5xw6ncL7VLTuJx11EHsg0499C SAlZM7lTpmcZjfCus7U8YzXxYnWAJD2BX2N3MmsHKg98BaCgXZqzaQN4N jYygB+crbNL2MuvdA2WUVw4k7qsNe00N1bgxCoVIde5eNnOVm0VlXD2T9 IBW8B/ytMhwhIgjoXxLAoEGrXIFcoP7dNC4bla/4GH2w/EozlYQxw4Xkz eUw5p1onwYgkWEZpHWyE2L0y+9sb1AgttVDKS6VqO3umYu0VxZq7wT3hi JfTzzmVt9RsEQyF5cmTDGFg54HUQ7BhRmG35GHJ264rZ0yj38ZulAmwk/ w==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688643" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688643" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:28 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="944067918" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="944067918" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:23 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 42E3810A440; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 02/14] x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init Date: Tue, 5 Dec 2023 03:44:58 +0300 Message-ID: <20231205004510.27164-3-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" acpi_mp_wake_mailbox_paddr and acpi_mp_wake_mailbox initialized once during ACPI MADT init and never changed. Signed-off-by: Kirill A. Shutemov Acked-by: Kai Huang --- arch/x86/kernel/acpi/madt_wakeup.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt= _wakeup.c index f4be492b7e4c..38ffd4524e44 100644 --- a/arch/x86/kernel/acpi/madt_wakeup.c +++ b/arch/x86/kernel/acpi/madt_wakeup.c @@ -5,10 +5,10 @@ #include =20 /* Physical address of the Multiprocessor Wakeup Structure mailbox */ -static u64 acpi_mp_wake_mailbox_paddr; +static u64 acpi_mp_wake_mailbox_paddr __ro_after_init; =20 /* Virtual address of the Multiprocessor Wakeup Structure mailbox */ -static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox; +static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox __r= o_after_init; =20 static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip) { --=20 2.41.0 From nobody Sun Dec 28 21:16:58 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 061A9C10DCE for ; Tue, 5 Dec 2023 00:45:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233055AbjLEAp0 (ORCPT ); Mon, 4 Dec 2023 19:45:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40100 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229703AbjLEApY (ORCPT ); Mon, 4 Dec 2023 19:45:24 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 09DFAC4 for ; Mon, 4 Dec 2023 16:45:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737128; x=1733273128; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=TlUbH+K6pjEWmV+sTKQ2Oh1pOAL04QaVsgJCWqheRjc=; b=J/fVbHYlQhTEqwyzPnpJW8B6yBFbsBl9U4NtXk69fH6O8ecuHLHtrGk1 Lc3CBUDPI7mNBdM9wdMRcS7ki1j7oSw6H6iHUqbmdS3H2kvXC1yW6HUtl u2noANOHbV0OgyUGSURoIVIswDEzM0wi2OlJZGM4bS0WsU9CUAZHvKGYo mUjLXFpDR4kaMfDvzXnITRmw7hES3pD8/jUiE/B30gAmpgGvhys/AIOnP 6yii4bBysjHmzdQQW7Hg0GARowWtywLBDVfA4/8ZK92v66CXjuOWmOa8Y n/iTV84g7Ma+4cSOVmiHyO3zuJ6FEl33bxHHgIs37g+XFnsgWltXq8oA7 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688605" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688605" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:27 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="888704371" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="888704371" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by fmsmga002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:22 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 4B55F10A441; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 03/14] cpu/hotplug: Add support for declaring CPU offlining not supported Date: Tue, 5 Dec 2023 03:44:59 +0300 Message-ID: <20231205004510.27164-4-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The ACPI MADT mailbox wakeup method doesn't allow to offline CPU after it got woke up. Currently offlining hotplug is prevented based on the confidential computing attribute which is set for Intel TDX. But TDX is not the only possible user of the wake up method. The MADT wakeup can be implemented outside of a confidential computing environment. Offline support is a property of the wakeup method, not the CoCo implementation. Introduce cpu_hotplug_disable_offlining() that can be called to indicate that CPU offlining should be disabled. This function is going to replace CC_ATTR_HOTPLUG_DISABLED for ACPI MADT wakeup method. Signed-off-by: Kirill A. Shutemov Reviewed-by: Thomas Gleixner --- include/linux/cpu.h | 2 ++ kernel/cpu.c | 13 ++++++++++++- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/include/linux/cpu.h b/include/linux/cpu.h index fc8094419084..46f2e34a0c5e 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -134,6 +134,7 @@ extern void cpus_read_lock(void); extern void cpus_read_unlock(void); extern int cpus_read_trylock(void); extern void lockdep_assert_cpus_held(void); +extern void cpu_hotplug_disable_offlining(void); extern void cpu_hotplug_disable(void); extern void cpu_hotplug_enable(void); void clear_tasks_mm_cpumask(int cpu); @@ -149,6 +150,7 @@ static inline void cpus_read_lock(void) { } static inline void cpus_read_unlock(void) { } static inline int cpus_read_trylock(void) { return true; } static inline void lockdep_assert_cpus_held(void) { } +static inline void cpu_hotplug_disable_offlining(void) { } static inline void cpu_hotplug_disable(void) { } static inline void cpu_hotplug_enable(void) { } static inline int remove_cpu(unsigned int cpu) { return -EPERM; } diff --git a/kernel/cpu.c b/kernel/cpu.c index a86972a91991..af8034ccda8e 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -484,6 +484,8 @@ static int cpu_hotplug_disabled; =20 DEFINE_STATIC_PERCPU_RWSEM(cpu_hotplug_lock); =20 +static bool cpu_hotplug_offline_disabled; + void cpus_read_lock(void) { percpu_down_read(&cpu_hotplug_lock); @@ -543,6 +545,14 @@ static void lockdep_release_cpus_lock(void) rwsem_release(&cpu_hotplug_lock.dep_map, _THIS_IP_); } =20 +/* Declare CPU offlining not supported */ +void cpu_hotplug_disable_offlining(void) +{ + cpu_maps_update_begin(); + cpu_hotplug_offline_disabled =3D true; + cpu_maps_update_done(); +} + /* * Wait for currently running CPU hotplug operations to complete (if any) = and * disable future CPU hotplug (from sysfs). The 'cpu_add_remove_lock' prot= ects @@ -1522,7 +1532,8 @@ static int cpu_down_maps_locked(unsigned int cpu, enu= m cpuhp_state target) * If the platform does not support hotplug, report it explicitly to * differentiate it from a transient offlining failure. */ - if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED)) + if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || + cpu_hotplug_offline_disabled) return -EOPNOTSUPP; if (cpu_hotplug_disabled) return -EBUSY; --=20 2.41.0 From nobody Sun Dec 28 21:16:58 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BC27C46CA0 for ; Tue, 5 Dec 2023 00:46:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343710AbjLEApe (ORCPT ); Mon, 4 Dec 2023 19:45:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40120 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229742AbjLEApZ (ORCPT ); Mon, 4 Dec 2023 19:45:25 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8B11EA4 for ; Mon, 4 Dec 2023 16:45:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737131; x=1733273131; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=x3xuiVtppGTtNOfAwJhMe/qF1cFiWNEHa9BAkXdB+As=; b=IyrNUkqlu5Gft91KP+D4oiZqC8X0y4U0zIUAjtMaHSiXiPGGYKX8mkCl a6tQh1vh4tbkYJyR69WPBrN2IU8ThqYjogeFrqOLGNeAmEwXTCV6KEN3B 5EQNWSe4DBLoMpWKzDq/9d+J0pPodTKMlpCCSj3t1QIkhZwc0RhmsGrsa bpJMCNr04W8R/vJhTCDV3rFjG/nP8cOeuwPrVqZmTwlu3CyshbWYbuWVb WqAeeURDMiky00LTSALExCGtfOwVm3zy8kQ7SDvRylRMsiK8/Cl2nqbKJ R8P8j0fnMkQJ5q5Mq5v9D1YwMJ2EWfauSn+0YPOEwQzIHeVjR/G+wfuIv A==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688625" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688625" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:27 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="888704375" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="888704375" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by fmsmga002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:22 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 5335A10A444; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 04/14] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup Date: Tue, 5 Dec 2023 03:45:00 +0300 Message-ID: <20231205004510.27164-5-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" ACPI MADT doesn't allow to offline CPU after it got woke up. Currently hotplug prevented based on the confidential computing attribute which is set for Intel TDX. But TDX is not the only possible user of the wake up method. Disable CPU offlining on ACPI MADT wakeup enumeration. Signed-off-by: Kirill A. Shutemov Reviewed-by: Thomas Gleixner --- arch/x86/coco/core.c | 1 - arch/x86/kernel/acpi/madt_wakeup.c | 3 +++ include/linux/cc_platform.h | 10 ---------- kernel/cpu.c | 3 +-- 4 files changed, 4 insertions(+), 13 deletions(-) diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c index eeec9986570e..f07c3bb7deab 100644 --- a/arch/x86/coco/core.c +++ b/arch/x86/coco/core.c @@ -20,7 +20,6 @@ static bool noinstr intel_cc_platform_has(enum cc_attr at= tr) { switch (attr) { case CC_ATTR_GUEST_UNROLL_STRING_IO: - case CC_ATTR_HOTPLUG_DISABLED: case CC_ATTR_GUEST_MEM_ENCRYPT: case CC_ATTR_MEM_ENCRYPT: return true; diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt= _wakeup.c index 38ffd4524e44..f7e33cea1be5 100644 --- a/arch/x86/kernel/acpi/madt_wakeup.c +++ b/arch/x86/kernel/acpi/madt_wakeup.c @@ -1,4 +1,5 @@ #include +#include #include #include #include @@ -75,6 +76,8 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers= *header, =20 acpi_mp_wake_mailbox_paddr =3D mp_wake->base_address; =20 + cpu_hotplug_disable_offlining(); + apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu); =20 return 0; diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h index cb0d6cd1c12f..d08dd65b5c43 100644 --- a/include/linux/cc_platform.h +++ b/include/linux/cc_platform.h @@ -80,16 +80,6 @@ enum cc_attr { * using AMD SEV-SNP features. */ CC_ATTR_GUEST_SEV_SNP, - - /** - * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled. - * - * The platform/OS is running as a guest/virtual machine does not - * support CPU hotplug feature. - * - * Examples include TDX Guest. - */ - CC_ATTR_HOTPLUG_DISABLED, }; =20 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM diff --git a/kernel/cpu.c b/kernel/cpu.c index af8034ccda8e..a9e1628cebbb 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1532,8 +1532,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enu= m cpuhp_state target) * If the platform does not support hotplug, report it explicitly to * differentiate it from a transient offlining failure. */ - if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || - cpu_hotplug_offline_disabled) + if (cpu_hotplug_offline_disabled) return -EOPNOTSUPP; if (cpu_hotplug_disabled) return -EBUSY; --=20 2.41.0 From nobody Sun Dec 28 21:16:58 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1EF7AC4167B for ; Tue, 5 Dec 2023 00:46:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376299AbjLEAqT (ORCPT ); Mon, 4 Dec 2023 19:46:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60342 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234796AbjLEApg (ORCPT ); Mon, 4 Dec 2023 19:45:36 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 60750111 for ; Mon, 4 Dec 2023 16:45:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737137; x=1733273137; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=baddqEIghmkFajP6rpIfXbNavqQAwsBb4c0/kd75gPA=; b=S51Jwrbm9HBMJT898lMSXTTGizZg3+O0FHUa5tSvjLuefJYsIuTACJbp Vjn0KH7yJ/TC5SLOn0KkUpBO8Xo+Azs2u4YohVt0kgw5jGF4ZfVQdNlJx 74kQLpaKYPykuErtNU1zVna6/7mbbbud5/PAU7oJon3/zZF/Uva1iU6D8 4mmDv2+zTUoYdpBweTcxhX1CcYhw5/4V6pQlDzYUEuBfs6H5H3drUyr2Y OD7yRhhJdj00iEiL88xFRQV4GTk75Eemka+jNeU4+u/lYP6XYxVG6PLvC CB5CHQrl/lyVUJB92t9hfnh2Pk7XaO6IpXNQbKW4FZy8ZZl8zmkzIicX9 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688725" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688725" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:36 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="944067951" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="944067951" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:30 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 5C3E410A445; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" , Vitaly Kuznetsov , Paolo Bonzini , Wanpeng Li Subject: [PATCHv4 05/14] x86/kvm: Do not try to disable kvmclock if it was not enabled Date: Tue, 5 Dec 2023 03:45:01 +0300 Message-ID: <20231205004510.27164-6-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is present in the VM. It leads to write to a MSR that doesn't exist on some configurations, namely in TDX guest: unchecked MSR access error: WRMSR to 0x12 (tried to write 0x00000000000000= 00) at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30) kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt features. Do not disable kvmclock if it was not enabled. Signed-off-by: Kirill A. Shutemov Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown") Reviewed-by: Sean Christopherson Reviewed-by: Vitaly Kuznetsov Cc: Paolo Bonzini Cc: Wanpeng Li --- arch/x86/kernel/kvmclock.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index fb8f52149be9..f2fff625576d 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -24,8 +24,8 @@ =20 static int kvmclock __initdata =3D 1; static int kvmclock_vsyscall __initdata =3D 1; -static int msr_kvm_system_time __ro_after_init =3D MSR_KVM_SYSTEM_TIME; -static int msr_kvm_wall_clock __ro_after_init =3D MSR_KVM_WALL_CLOCK; +static int msr_kvm_system_time __ro_after_init; +static int msr_kvm_wall_clock __ro_after_init; static u64 kvm_sched_clock_offset __ro_after_init; =20 static int __init parse_no_kvmclock(char *arg) @@ -195,7 +195,8 @@ static void kvm_setup_secondary_clock(void) =20 void kvmclock_disable(void) { - native_write_msr(msr_kvm_system_time, 0, 0); + if (msr_kvm_system_time) + native_write_msr(msr_kvm_system_time, 0, 0); } =20 static void __init kvmclock_init_mem(void) @@ -294,7 +295,10 @@ void __init kvmclock_init(void) if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) { msr_kvm_system_time =3D MSR_KVM_SYSTEM_TIME_NEW; msr_kvm_wall_clock =3D MSR_KVM_WALL_CLOCK_NEW; - } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) { + } else if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) { + msr_kvm_system_time =3D MSR_KVM_SYSTEM_TIME; + msr_kvm_wall_clock =3D MSR_KVM_WALL_CLOCK; + } else { return; } =20 --=20 2.41.0 From nobody Sun Dec 28 21:16:58 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 22494C4167B for ; Tue, 5 Dec 2023 00:46:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234910AbjLEAqD (ORCPT ); Mon, 4 Dec 2023 19:46:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40134 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343632AbjLEAp2 (ORCPT ); Mon, 4 Dec 2023 19:45:28 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AD0D2A4 for ; Mon, 4 Dec 2023 16:45:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737134; x=1733273134; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=wYs3fU4C0jZd+PnKZxc0e8e/KDE7b/5Kd3uwb7sJ62o=; b=Qb2rTBYqT1YCgBPpUKZffe+VXslh7ruFlAP0N7wHYiza4djqc0gwNBa7 5rJV836ZOyxlgYWcqULyqj+61zpDSKDk4E8K2xWEHP1OyOZTH83Luu+iu g5V2vJRDPyS7w3jqkCBmjVPwlI4Zp/xwyohmzWAfpZ+meYfqP6piTJrKc 7LKMt+3wZJGpqNvAOY/58/qshBeVEuLwCl+6UwhICz6zFN2cy/a/IKAvx pYHaQoDcy3grocjYsfFN+J0DKVSxAvgdfNiA0t84rQm9u2A+DRHsi4Udu v39zl6ZQ+Oq3Kskx+jaPI56/g3dCFp/nSOlTLnwxZ1v6MLe+AADFtvBiW Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688656" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688656" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:34 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="888704402" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="888704402" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by fmsmga002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:29 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 67CE110A446; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 06/14] x86/kexec: Keep CR4.MCE set during kexec for TDX guest Date: Tue, 5 Dec 2023 03:45:02 +0300 Message-ID: <20231205004510.27164-7-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads to #VE. Use alternatives to keep the flag during kexec for TDX guests. The change doesn't affect non-TDX-guest environments. Signed-off-by: Kirill A. Shutemov Reviewed-by: Kai Huang --- arch/x86/kernel/relocate_kernel_64.S | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocat= e_kernel_64.S index 56cab1bb25f5..cd6a53667c6b 100644 --- a/arch/x86/kernel/relocate_kernel_64.S +++ b/arch/x86/kernel/relocate_kernel_64.S @@ -145,12 +145,15 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) * Set cr4 to a known state: * - physical address extension enabled * - 5-level paging, if it was enabled before + * - Machine check exception on TDX guest. Clearing MCE is not allowed + * in TDX guests. */ movl $X86_CR4_PAE, %eax testq $X86_CR4_LA57, %r13 jz 1f orl $X86_CR4_LA57, %eax 1: + ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %eax), X86_FEATURE_TDX_GUEST movq %rax, %cr4 =20 jmp 1f --=20 2.41.0 From nobody Sun Dec 28 21:16:58 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0C51C4167B for ; Tue, 5 Dec 2023 00:46:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346402AbjLEAqM (ORCPT ); Mon, 4 Dec 2023 19:46:12 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40156 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343679AbjLEAp3 (ORCPT ); Mon, 4 Dec 2023 19:45:29 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D460FF for ; Mon, 4 Dec 2023 16:45:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737135; x=1733273135; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=NCrHDHYqaRsJTQsJlXRJcHU/RO2MzFqbmR6RTmdv2Ks=; b=d9NAO5uHaRG4oXsUaR5nJOJXg+hypracdenodTRqjiuo4KUoIRU2senC XcIndgk4s+zJi/jASTcVrRSYUBt1QFwFJhZplj/s/LUlz4P8ow7n72eQw 8DgHtfu7HGqgoU3QZ1h3IeASHFGXO2MyLu2EcCorMNpzbwfycUKAmM0Hg GrINEPHfGDxNzYX3wwGmvIxdM52M9eAcY/lTMXPiHtob5kyoPyYAGDNZa xhTAWbp0YnhI0wNqCaK4xEFpkDdCOrrmbG20g9DZ0ZAhY5EJ4IqO3XXt3 B/pW7D0te26w2VHS1nv1amvdOKUdyynnmNodZnJTMZAZdBuEyDOJK1soG Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688677" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688677" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:34 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="888704408" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="888704408" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by fmsmga002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:30 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 72EEE10A447; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 07/14] x86/mm: Make x86_platform.guest.enc_status_change_*() return errno Date: Tue, 5 Dec 2023 03:45:03 +0300 Message-ID: <20231205004510.27164-8-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" TDX is going to have more than one reason to fail enc_status_change_prepare(). Change the callback to return errno instead of assuming -EIO; enc_status_change_finish() changed too to keep the interface symmetric. Signed-off-by: Kirill A. Shutemov --- arch/x86/coco/tdx/tdx.c | 20 +++++++++++--------- arch/x86/hyperv/ivm.c | 9 +++------ arch/x86/include/asm/x86_init.h | 4 ++-- arch/x86/kernel/x86_init.c | 4 ++-- arch/x86/mm/mem_encrypt_amd.c | 8 ++++---- arch/x86/mm/pat/set_memory.c | 9 +++++---- 6 files changed, 27 insertions(+), 27 deletions(-) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 1b5d17a9f70d..2d90043a0e91 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -797,28 +797,30 @@ static bool tdx_enc_status_changed(unsigned long vadd= r, int numpages, bool enc) return true; } =20 -static bool tdx_enc_status_change_prepare(unsigned long vaddr, int numpage= s, - bool enc) +static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages, + bool enc) { /* * Only handle shared->private conversion here. * See the comment in tdx_early_init(). */ - if (enc) - return tdx_enc_status_changed(vaddr, numpages, enc); - return true; + if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) + return -EIO; + + return 0; } =20 -static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages, +static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages, bool enc) { /* * Only handle private->shared conversion here. * See the comment in tdx_early_init(). */ - if (!enc) - return tdx_enc_status_changed(vaddr, numpages, enc); - return true; + if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) + return -EIO; + + return 0; } =20 void __init tdx_early_init(void) diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c index 02e55237d919..2e1be1afeebe 100644 --- a/arch/x86/hyperv/ivm.c +++ b/arch/x86/hyperv/ivm.c @@ -510,13 +510,12 @@ static int hv_mark_gpa_visibility(u16 count, const u6= 4 pfn[], * with host. This function works as wrap of hv_mark_gpa_visibility() * with memory base and size. */ -static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecou= nt, bool enc) +static int hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecoun= t, bool enc) { enum hv_mem_host_visibility visibility =3D enc ? VMBUS_PAGE_NOT_VISIBLE : VMBUS_PAGE_VISIBLE_READ_WRITE; u64 *pfn_array; int ret =3D 0; - bool result =3D true; int i, pfn; =20 pfn_array =3D kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL); @@ -530,17 +529,15 @@ static bool hv_vtom_set_host_visibility(unsigned long= kbuffer, int pagecount, bo if (pfn =3D=3D HV_MAX_MODIFY_GPA_REP_COUNT || i =3D=3D pagecount - 1) { ret =3D hv_mark_gpa_visibility(pfn, pfn_array, visibility); - if (ret) { - result =3D false; + if (ret) goto err_free_pfn_array; - } pfn =3D 0; } } =20 err_free_pfn_array: kfree(pfn_array); - return result; + return ret; } =20 static bool hv_vtom_tlb_flush_required(bool private) diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_ini= t.h index c878616a18b8..c9503fe2d13a 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -150,8 +150,8 @@ struct x86_init_acpi { * @enc_cache_flush_required Returns true if a cache flush is needed befor= e changing page encryption status */ struct x86_guest { - bool (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool e= nc); - bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool en= c); + int (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool en= c); + int (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc= ); bool (*enc_tlb_flush_required)(bool enc); bool (*enc_cache_flush_required)(void); }; diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c index a37ebd3b4773..f0f54e109eb9 100644 --- a/arch/x86/kernel/x86_init.c +++ b/arch/x86/kernel/x86_init.c @@ -131,8 +131,8 @@ struct x86_cpuinit_ops x86_cpuinit =3D { =20 static void default_nmi_init(void) { }; =20 -static bool enc_status_change_prepare_noop(unsigned long vaddr, int npages= , bool enc) { return true; } -static bool enc_status_change_finish_noop(unsigned long vaddr, int npages,= bool enc) { return true; } +static int enc_status_change_prepare_noop(unsigned long vaddr, int npages,= bool enc) { return 0; } +static int enc_status_change_finish_noop(unsigned long vaddr, int npages, = bool enc) { return 0; } static bool enc_tlb_flush_required_noop(bool enc) { return false; } static bool enc_cache_flush_required_noop(void) { return false; } static bool is_private_mmio_noop(u64 addr) {return false; } diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c index a68f2dda0948..6cf6cc8ae6a6 100644 --- a/arch/x86/mm/mem_encrypt_amd.c +++ b/arch/x86/mm/mem_encrypt_amd.c @@ -282,7 +282,7 @@ static void enc_dec_hypercall(unsigned long vaddr, unsi= gned long size, bool enc) #endif } =20 -static bool amd_enc_status_change_prepare(unsigned long vaddr, int npages,= bool enc) +static int amd_enc_status_change_prepare(unsigned long vaddr, int npages, = bool enc) { /* * To maintain the security guarantees of SEV-SNP guests, make sure @@ -291,11 +291,11 @@ static bool amd_enc_status_change_prepare(unsigned lo= ng vaddr, int npages, bool if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !enc) snp_set_memory_shared(vaddr, npages); =20 - return true; + return 0; } =20 /* Return true unconditionally: return value doesn't matter for the SEV si= de */ -static bool amd_enc_status_change_finish(unsigned long vaddr, int npages, = bool enc) +static int amd_enc_status_change_finish(unsigned long vaddr, int npages, b= ool enc) { /* * After memory is mapped encrypted in the page table, validate it @@ -307,7 +307,7 @@ static bool amd_enc_status_change_finish(unsigned long = vaddr, int npages, bool e if (!cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) enc_dec_hypercall(vaddr, npages << PAGE_SHIFT, enc); =20 - return true; + return 0; } =20 static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc) diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index bda9f129835e..6fbf22d5fa56 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -2152,8 +2152,9 @@ static int __set_memory_enc_pgtable(unsigned long add= r, int numpages, bool enc) cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required()); =20 /* Notify hypervisor that we are about to set/clr encryption attribute. */ - if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc)) - return -EIO; + ret =3D x86_platform.guest.enc_status_change_prepare(addr, numpages, enc); + if (ret) + return ret; =20 ret =3D __change_page_attr_set_clr(&cpa, 1); =20 @@ -2168,8 +2169,8 @@ static int __set_memory_enc_pgtable(unsigned long add= r, int numpages, bool enc) =20 /* Notify hypervisor that we have successfully set/clr encryption attribu= te. */ if (!ret) { - if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc)) - ret =3D -EIO; + ret =3D x86_platform.guest.enc_status_change_finish(addr, + numpages, enc); } =20 return ret; --=20 2.41.0 From nobody Sun Dec 28 21:16:58 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2BA82C4167B for ; Tue, 5 Dec 2023 00:46:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376317AbjLEAqW (ORCPT ); Mon, 4 Dec 2023 19:46:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60356 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234800AbjLEApg (ORCPT ); Mon, 4 Dec 2023 19:45:36 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9B6C310F for ; Mon, 4 Dec 2023 16:45:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737136; x=1733273136; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=7MgS2tz+eXb3t+LyzYLE/xkq/zAhJTq68EvRtEFLASY=; b=h/2DWoQWkX45VJv1byn7pMDOhmAtcyjOwYlZCxaBpH6PMXzcDZJgsNLR T18KJ6FZI8IXgdVMwJsRxwwwG71o6QeumVAxEKfArLN6MqCjQyw8pSYN+ Cajon8sbvSooktnuSkkugH4j0SAMP3VGQZgMfHaFkiwxstG/32qOSz4U4 Z1Q1khz4fVkCYNw47Sg5AY9U/PxY8xll7iHL0HPtcBPHgXhBtwllSA77o gKeSQKanOgFzi8XwlNNs3q2FWuNbl29rB7sum5Msm+DtdCHihY0ZZIKbz jpy8vzL/kW0D2Ev3PbZ+2AkFIAsBLy/QHuCv7Xmk0E55tjesqSEzTze0b g==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688715" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688715" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:35 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="944067949" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="944067949" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:30 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 7E6FD10A448; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 08/14] x86/mm: Return correct level from lookup_address() if pte is none Date: Tue, 5 Dec 2023 03:45:04 +0300 Message-ID: <20231205004510.27164-9-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" lookup_address() only returns correct page table level for the entry if the entry is not none. Make the helper to always return correct 'level'. It allows to implement iterator over kernel page tables using lookup_address(). Add one more entry into enum pg_level to indicate size of VA covered by one PGD entry in 5-level paging mode. Signed-off-by: Kirill A. Shutemov Reviewed-by: Rick Edgecombe --- arch/x86/include/asm/pgtable_types.h | 1 + arch/x86/mm/pat/set_memory.c | 8 ++++---- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pg= table_types.h index 0b748ee16b3d..3f648ffdfbe5 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -548,6 +548,7 @@ enum pg_level { PG_LEVEL_2M, PG_LEVEL_1G, PG_LEVEL_512G, + PG_LEVEL_256T, PG_LEVEL_NUM }; =20 diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 6fbf22d5fa56..01f827eb8e80 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -666,32 +666,32 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned lon= g address, pud_t *pud; pmd_t *pmd; =20 - *level =3D PG_LEVEL_NONE; + *level =3D PG_LEVEL_256T; =20 if (pgd_none(*pgd)) return NULL; =20 + *level =3D PG_LEVEL_512G; p4d =3D p4d_offset(pgd, address); if (p4d_none(*p4d)) return NULL; =20 - *level =3D PG_LEVEL_512G; if (p4d_large(*p4d) || !p4d_present(*p4d)) return (pte_t *)p4d; =20 + *level =3D PG_LEVEL_1G; pud =3D pud_offset(p4d, address); if (pud_none(*pud)) return NULL; =20 - *level =3D PG_LEVEL_1G; if (pud_large(*pud) || !pud_present(*pud)) return (pte_t *)pud; =20 + *level =3D PG_LEVEL_2M; pmd =3D pmd_offset(pud, address); if (pmd_none(*pmd)) return NULL; =20 - *level =3D PG_LEVEL_2M; if (pmd_large(*pmd) || !pmd_present(*pmd)) return (pte_t *)pmd; =20 --=20 2.41.0 From nobody Sun Dec 28 21:16:58 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 289C3C4167B for ; Tue, 5 Dec 2023 00:46:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346390AbjLEAqK (ORCPT ); Mon, 4 Dec 2023 19:46:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40144 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343675AbjLEAp2 (ORCPT ); Mon, 4 Dec 2023 19:45:28 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 02C11C4 for ; Mon, 4 Dec 2023 16:45:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737134; x=1733273134; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=RP/Va3aoGG5zb/Y76A/szY23yia8vizTXE1CbaIASh0=; b=KJ8zsBnVxD9AsmsdE6Q99B38wUuo1LXVyQp78+X3YfkB3V7PptrkLrAT GzrzG/92uHhQ+s9Zg1OKO7TLKjqls+FYDaAFp7zgDirfVsWUxMoX1s9qO WPz8QiBq9ZGFaevdItCl5DHDCX3piq5jXXl7igw7E4WuoiQW48WVA0+00 EbkoNkkkgQfFHaA2mf+m5lavkB4S5mcX3ppvExsAtWN3s5431j+TlXzCN OTDC1LWumfQCuG60715jN+O9v1/ci3jKszzQ3euUG4RrxVc2PpIXAznJ3 RBx9Jq7prt+IIsfDOBrareb7vr8mo9S3rvAfa8OH+/GvkEt9OrCx2XvlD g==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688666" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688666" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:34 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="888704404" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="888704404" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by fmsmga002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:30 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 8959010A44A; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 09/14] x86/tdx: Account shared memory Date: Tue, 5 Dec 2023 03:45:05 +0300 Message-ID: <20231205004510.27164-10-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The kernel will convert all shared memory back to private during kexec. The direct mapping page tables will provide information on which memory is shared. It is extremely important to convert all shared memory. If a page is missed, it will cause the second kernel to crash when it accesses it. Keep track of the number of shared pages. This will allow for cross-checking against the shared information in the direct mapping and reporting if the shared bit is lost. Include a debugfs interface that allows for the check to be performed at any point. Signed-off-by: Kirill A. Shutemov --- arch/x86/coco/tdx/tdx.c | 69 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 69 insertions(+) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 2d90043a0e91..fcc159497554 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -5,6 +5,7 @@ #define pr_fmt(fmt) "tdx: " fmt =20 #include +#include #include #include #include @@ -37,6 +38,13 @@ =20 #define TDREPORT_SUBTYPE_0 0 =20 +static atomic_long_t nr_shared; + +static inline bool pte_decrypted(pte_t pte) +{ + return cc_mkdec(pte_val(pte)) =3D=3D pte_val(pte); +} + /* Called from __tdx_hypercall() for unrecoverable failure */ noinstr void __noreturn __tdx_hypercall_failed(void) { @@ -820,6 +828,11 @@ static int tdx_enc_status_change_finish(unsigned long = vaddr, int numpages, if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) return -EIO; =20 + if (enc) + atomic_long_sub(numpages, &nr_shared); + else + atomic_long_add(numpages, &nr_shared); + return 0; } =20 @@ -895,3 +908,59 @@ void __init tdx_early_init(void) =20 pr_info("Guest detected\n"); } + +#ifdef CONFIG_DEBUG_FS +static int tdx_shared_memory_show(struct seq_file *m, void *p) +{ + unsigned long addr, end; + unsigned long found =3D 0; + + addr =3D PAGE_OFFSET; + end =3D PAGE_OFFSET + get_max_mapped(); + + while (addr < end) { + unsigned long size; + unsigned int level; + pte_t *pte; + + pte =3D lookup_address(addr, &level); + size =3D page_level_size(level); + + if (pte && pte_decrypted(*pte)) + found +=3D size / PAGE_SIZE; + + addr +=3D size; + + cond_resched(); + } + + seq_printf(m, "Number of shared pages in kernel page tables: %16lu\n", + found); + seq_printf(m, "Number of pages accounted as shared: %16ld\n", + atomic_long_read(&nr_shared)); + return 0; +} + +static int tdx_shared_memory_open(struct inode *inode, struct file *file) +{ + return single_open(file, tdx_shared_memory_show, NULL); +} + +static const struct file_operations tdx_shared_memory_fops =3D { + .open =3D tdx_shared_memory_open, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D single_release, +}; + +static __init int debug_tdx_shared_memory(void) +{ + if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) + return 0; + + debugfs_create_file("tdx_shared_memory", S_IRUSR, arch_debugfs_dir, + NULL, &tdx_shared_memory_fops); + return 0; +} +fs_initcall(debug_tdx_shared_memory); +#endif --=20 2.41.0 From nobody Sun Dec 28 21:16:58 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61F3BC4167B for ; Tue, 5 Dec 2023 00:46:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376272AbjLEAqQ (ORCPT ); Mon, 4 Dec 2023 19:46:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40184 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231627AbjLEApa (ORCPT ); Mon, 4 Dec 2023 19:45:30 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 496D2101 for ; Mon, 4 Dec 2023 16:45:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737136; x=1733273136; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=YHWGA0FuyU0K8kQA0J/DX6ZgopiBD6JnQLF5bZFmbyE=; b=Fd4Xsdbo6W0b+7GZd0VV7ftC6tx/UKJ2eaYnrdSyWU4HL6VEXJdESeoj IwN0v/I+HIA1xOq2plKX6ZKGHgyIr8szvxedPViptnOffTs9ik1QB5W/c Cg3J28Ondttd3QwbAXj/o8Rt6avodz3Er/ZrkEZrx6Zj8gQvemdeaaNii p1Zda2sjucPzcdzfm31/sERFSgNbLqsbXsSkHc57XEMbFYhk9/z0VJU3S YDbhKR+8pctkSIaJ67mRLWgSoqzQwYtQHyagi7U+RUwocwfqBvdwdmzbo 5V3AJ0dmeV5VKBA8qA9f4zTmOICLD3hgaX+sUfMKoHM3/1ocw11YwtlNl A==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688705" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688705" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:35 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="944067946" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="944067946" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:30 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 94C0110A44C; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 10/14] x86/tdx: Convert shared memory back to private on kexec Date: Tue, 5 Dec 2023 03:45:06 +0300 Message-ID: <20231205004510.27164-11-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" TDX guests allocate shared buffers to perform I/O. It is done by allocating pages normally from the buddy allocator and converting them to shared with set_memory_decrypted(). The second kernel has no idea what memory is converted this way. It only sees E820_TYPE_RAM. Accessing shared memory via private mapping is fatal. It leads to unrecoverable TD exit. On kexec walk direct mapping and convert all shared memory back to private. It makes all RAM private again and second kernel may use it normally. Signed-off-by: Kirill A. Shutemov Reviewed-by: Rick Edgecombe --- arch/x86/coco/tdx/kexec.c | 0 arch/x86/coco/tdx/tdx.c | 120 +++++++++++++++++++++++++++++++- arch/x86/include/asm/x86_init.h | 1 + arch/x86/kernel/crash.c | 4 ++ arch/x86/kernel/reboot.c | 10 +++ 5 files changed, 133 insertions(+), 2 deletions(-) create mode 100644 arch/x86/coco/tdx/kexec.c diff --git a/arch/x86/coco/tdx/kexec.c b/arch/x86/coco/tdx/kexec.c new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index fcc159497554..46355ea9f4cf 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -6,14 +6,17 @@ =20 #include #include +#include #include #include +#include #include #include #include #include #include #include +#include =20 /* MMIO direction */ #define EPT_READ 0 @@ -40,6 +43,9 @@ =20 static atomic_long_t nr_shared; =20 +static atomic_t conversions_in_progress; +static bool conversion_allowed =3D true; + static inline bool pte_decrypted(pte_t pte) { return cc_mkdec(pte_val(pte)) =3D=3D pte_val(pte); @@ -725,6 +731,14 @@ static bool tdx_tlb_flush_required(bool private) =20 static bool tdx_cache_flush_required(void) { + /* + * Avoid issuing CLFLUSH on set_memory_decrypted() if conversions + * stopped. Otherwise it can race with unshare_all_memory() and trigger + * implicit conversion to shared. + */ + if (!conversion_allowed) + return false; + /* * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence. * TDX doesn't have such capability. @@ -808,12 +822,25 @@ static bool tdx_enc_status_changed(unsigned long vadd= r, int numpages, bool enc) static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages, bool enc) { + atomic_inc(&conversions_in_progress); + + /* + * Check after bumping conversions_in_progress to serialize + * against tdx_shutdown(). + */ + if (!conversion_allowed) { + atomic_dec(&conversions_in_progress); + return -EBUSY; + } + /* * Only handle shared->private conversion here. * See the comment in tdx_early_init(). */ - if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) + if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) { + atomic_dec(&conversions_in_progress); return -EIO; + } =20 return 0; } @@ -825,17 +852,104 @@ static int tdx_enc_status_change_finish(unsigned lon= g vaddr, int numpages, * Only handle private->shared conversion here. * See the comment in tdx_early_init(). */ - if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) + if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) { + atomic_dec(&conversions_in_progress); return -EIO; + } =20 if (enc) atomic_long_sub(numpages, &nr_shared); else atomic_long_add(numpages, &nr_shared); =20 + atomic_dec(&conversions_in_progress); + return 0; } =20 +static void tdx_kexec_unshare_mem(bool crash) +{ + unsigned long addr, end; + long found =3D 0, shared; + + /* Stop new private<->shared conversions */ + conversion_allowed =3D false; + + /* + * Crash kernel reaches here with interrupts disabled: can't wait for + * conversions to finish. + * + * If race happened, just report and proceed. + */ + if (!crash) { + unsigned long timeout; + + /* + * Wait for in-flight conversions to complete. + * + * Do not wait more than 30 seconds. + */ + timeout =3D 30 * USEC_PER_SEC; + while (atomic_read(&conversions_in_progress) && timeout--) + udelay(1); + } + + if (atomic_read(&conversions_in_progress)) + pr_warn("Failed to finish shared<->private conversions\n"); + + /* + * Walk direct mapping and convert all shared memory back to private, + */ + + addr =3D PAGE_OFFSET; + end =3D PAGE_OFFSET + get_max_mapped(); + + while (addr < end) { + unsigned long size; + unsigned int level; + pte_t *pte; + + pte =3D lookup_address(addr, &level); + size =3D page_level_size(level); + + if (pte && pte_decrypted(*pte)) { + int pages =3D size / PAGE_SIZE; + + /* + * Touching memory with shared bit set triggers implicit + * conversion to shared. + * + * Make sure nobody touches the shared range from + * now on. + * + * Bypass unmapping for crash scenario. Unmapping + * requires sleepable context, but in crash case kernel + * hits the code path with interrupts disabled. + * It shouldn't be a problem as all secondary CPUs are + * down and kernel runs with interrupts disabled, so + * there is no room for race. + */ + if (!crash) + set_memory_np(addr, pages); + + if (!tdx_enc_status_changed(addr, pages, true)) { + pr_err("Failed to unshare range %#lx-%#lx\n", + addr, addr + size); + } + + found +=3D pages; + } + + addr +=3D size; + } + + shared =3D atomic_long_read(&nr_shared); + if (shared !=3D found) { + pr_err("shared page accounting is off\n"); + pr_err("nr_shared =3D %ld, nr_found =3D %ld\n", shared, found); + } +} + void __init tdx_early_init(void) { struct tdx_module_args args =3D { @@ -895,6 +1009,8 @@ void __init tdx_early_init(void) x86_platform.guest.enc_cache_flush_required =3D tdx_cache_flush_required; x86_platform.guest.enc_tlb_flush_required =3D tdx_tlb_flush_required; =20 + x86_platform.guest.enc_kexec_unshare_mem =3D tdx_kexec_unshare_mem; + /* * TDX intercepts the RDMSR to read the X2APIC ID in the parallel * bringup low level code. That raises #VE which cannot be handled diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_ini= t.h index c9503fe2d13a..917358821a31 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -154,6 +154,7 @@ struct x86_guest { int (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc= ); bool (*enc_tlb_flush_required)(bool enc); bool (*enc_cache_flush_required)(void); + void (*enc_kexec_unshare_mem)(bool crash); }; =20 /** diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index c92d88680dbf..1618224775f5 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -40,6 +40,7 @@ #include #include #include +#include =20 /* Used while preparing memory map entries for second kernel */ struct crash_memmap_data { @@ -107,6 +108,9 @@ void native_machine_crash_shutdown(struct pt_regs *regs) =20 crash_smp_send_stop(); =20 + if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) + x86_platform.guest.enc_kexec_unshare_mem(true); + cpu_emergency_disable_virtualization(); =20 /* diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index 830425e6d38e..c81afffaa954 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -31,6 +32,7 @@ #include #include #include +#include =20 /* * Power off function, if any @@ -716,6 +718,14 @@ static void native_machine_emergency_restart(void) =20 void native_machine_shutdown(void) { + /* + * Call enc_kexec_unshare_mem() while all CPUs are still active and + * interrupts are enabled. This will allow all in-flight memory + * conversions to finish cleanly before unsharing all memory. + */ + if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) && kexec_in_progress) + x86_platform.guest.enc_kexec_unshare_mem(false); + /* Stop the cpus and apics */ #ifdef CONFIG_X86_IO_APIC /* --=20 2.41.0 From nobody Sun Dec 28 21:16:58 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F37EFC10DCE for ; Tue, 5 Dec 2023 00:46:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346429AbjLEAqd (ORCPT ); Mon, 4 Dec 2023 19:46:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60388 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234828AbjLEApg (ORCPT ); Mon, 4 Dec 2023 19:45:36 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 38CAF116 for ; Mon, 4 Dec 2023 16:45:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737138; x=1733273138; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=YaJcwdfsbT/u6tJs5uRbj4NUFmy4HkVgm/EwY5jjC6U=; b=F6IQpxEkhFTBIwCR8svjMWMRi3APMIdb3u1bGxLya4UaflqAsfoIwTWs 0fWnVzogu0i0zGjdrhB/RCXbmo9vuEEeAm0iuzSEihLybW4fnXewhrjSu 8/rQe8a1C1a/Y6D0jab+W9r9RieutvmvtoEozr7ItlN6lw4Di1nWjKy11 j5wXnkXAFHUfi2mdW4NftVNY8N7gQs14ViJb9BRbw4ALYfOuGUlbmwoZz PtSA0nvato+30k5YddLkckoog08DdLI3yUdddB7DqtrbvtObnPp4FY6PB HiZwF8dthXvVRQx33zWQyb3uOpgi9ZqnIgc48FNZqSoF8L5hKMNyZk2oG w==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688743" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688743" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:36 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="944067956" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="944067956" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:30 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 9FC0810A44E; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 11/14] x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges Date: Tue, 5 Dec 2023 03:45:07 +0300 Message-ID: <20231205004510.27164-12-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" e820__end_of_ram_pfn() is used to calculate max_pfn which, among other things, guides where direct mapping ends. Any memory above max_pfn is not going to be present in the direct mapping. e820__end_of_ram_pfn() finds the end of the ram based on the highest E820_TYPE_RAM range. But it doesn't includes E820_TYPE_ACPI ranges into calculation. Despite the name, E820_TYPE_ACPI covers not only ACPI data, but also EFI tables and might be required by kernel to function properly. Usually the problem is hidden because there is some E820_TYPE_RAM memory above E820_TYPE_ACPI. But crashkernel only presents pre-allocated crash memory as E820_TYPE_RAM on boot. If the preallocated range is small, it can fit under the last E820_TYPE_ACPI range. Modify e820__end_of_ram_pfn() and e820__end_of_low_ram_pfn() to cover E820_TYPE_ACPI memory. The problem was discovered during debugging kexec for TDX guest. TDX guest uses E820_TYPE_ACPI to store the unaccepted memory bitmap and pass it between the kernels on kexec. Signed-off-by: Kirill A. Shutemov --- arch/x86/kernel/e820.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index fb8cf953380d..99c80680dc9e 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -827,7 +827,7 @@ u64 __init e820__memblock_alloc_reserved(u64 size, u64 = align) /* * Find the highest page frame number we have available */ -static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e82= 0_type type) +static unsigned long __init e820_end_ram_pfn(unsigned long limit_pfn) { int i; unsigned long last_pfn =3D 0; @@ -838,7 +838,8 @@ static unsigned long __init e820_end_pfn(unsigned long = limit_pfn, enum e820_type unsigned long start_pfn; unsigned long end_pfn; =20 - if (entry->type !=3D type) + if (entry->type !=3D E820_TYPE_RAM && + entry->type !=3D E820_TYPE_ACPI) continue; =20 start_pfn =3D entry->addr >> PAGE_SHIFT; @@ -864,12 +865,12 @@ static unsigned long __init e820_end_pfn(unsigned lon= g limit_pfn, enum e820_type =20 unsigned long __init e820__end_of_ram_pfn(void) { - return e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_RAM); + return e820_end_ram_pfn(MAX_ARCH_PFN); } =20 unsigned long __init e820__end_of_low_ram_pfn(void) { - return e820_end_pfn(1UL << (32 - PAGE_SHIFT), E820_TYPE_RAM); + return e820_end_ram_pfn(1UL << (32 - PAGE_SHIFT)); } =20 static void __init early_panic(char *msg) --=20 2.41.0 From nobody Sun Dec 28 21:16:59 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51A2AC4167B for ; Tue, 5 Dec 2023 00:46:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376361AbjLEAq3 (ORCPT ); Mon, 4 Dec 2023 19:46:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60384 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234814AbjLEApg (ORCPT ); Mon, 4 Dec 2023 19:45:36 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 44C7D119 for ; Mon, 4 Dec 2023 16:45:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737138; x=1733273138; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=hDA/+p+nxM87pJNv9Ukdh1cz+IjWMOLujDQdnMnWdrA=; b=dSj8TyhoTN9rZ3u1fCma5ZsSdILXI1OB0cUUdTmOxcFMKE+3+aCcPn1z gGZ5yIf0nMdyIQlRp3sTTwoEVgVXXLgdSUHvQzFKKroQzTBiEDMO21cYC /Pmuzd1XGSQNsqaGmYrhsMtd+g3NSfO9/7gbVueg3zyAbgT5RD4SNW8yN X1D1jzILF8uwWy0qEzCDHRaq6eDTTYyRzu9s7Cdsct5lCTzbY3su22/xL Kjm2Kx5FLSr+47gje1KPHeWfeaqo3DTSp8HEFCAwsetDKo4Ltq+9PlOqT pnjnhcmMGqFYddOIYBmKAgO+Ur4Qr7JFZ6Jft+2Jfst5OCLdvWnLWwLrX Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688735" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688735" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:36 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="944067953" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="944067953" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:30 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id AB59210A44F; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 12/14] x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure Date: Tue, 5 Dec 2023 03:45:08 +0300 Message-ID: <20231205004510.27164-13-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" To prepare for the addition of support for MADT wakeup structure version 1, it is necessary to provide more appropriate names for the fields in the structure. The field 'mailbox_version' renamed as 'version'. This field signifies the version of the structure and the related protocols, rather than the version of the mailbox. This field has not been utilized in the code thus far. The field 'base_address' renamed as 'mailbox_address' to clarify the kind of address it represents. In version 1, the structure includes the reset vector address. Clear and distinct naming helps to prevent any confusion. Signed-off-by: Kirill A. Shutemov Reviewed-by: Kai Huang Reviewed-by: Kuppuswamy Sathyanarayanan --- arch/x86/kernel/acpi/madt_wakeup.c | 2 +- include/acpi/actbl2.h | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt= _wakeup.c index f7e33cea1be5..386adbb03094 100644 --- a/arch/x86/kernel/acpi/madt_wakeup.c +++ b/arch/x86/kernel/acpi/madt_wakeup.c @@ -74,7 +74,7 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers= *header, =20 acpi_table_print_madt_entry(&header->common); =20 - acpi_mp_wake_mailbox_paddr =3D mp_wake->base_address; + acpi_mp_wake_mailbox_paddr =3D mp_wake->mailbox_address; =20 cpu_hotplug_disable_offlining(); =20 diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h index 3751ae69432f..23b4cfb640fc 100644 --- a/include/acpi/actbl2.h +++ b/include/acpi/actbl2.h @@ -1109,9 +1109,9 @@ struct acpi_madt_generic_translator { =20 struct acpi_madt_multiproc_wakeup { struct acpi_subtable_header header; - u16 mailbox_version; + u16 version; u32 reserved; /* reserved - must be zero */ - u64 base_address; + u64 mailbox_address; }; =20 #define ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE 2032 --=20 2.41.0 From nobody Sun Dec 28 21:16:59 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DF0DFC4167B for ; Tue, 5 Dec 2023 00:46:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234970AbjLEAqH (ORCPT ); Mon, 4 Dec 2023 19:46:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40172 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343693AbjLEApa (ORCPT ); Mon, 4 Dec 2023 19:45:30 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 37C26A4 for ; Mon, 4 Dec 2023 16:45:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737136; x=1733273136; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=X/WHs4jb2k1fyD6oT320fhQc0wtdLRiTKuUWbSP8aYs=; b=WKISbf+hEU7czwXPrDDeSg/IZZzV/rYH9B9wnom8GiSCzZ6bhRHl/AKL 0pACJNwQTVh1GU5DuFLYpbF7ytqUxypSJknXxrYcBjutH2HmYv9HPTNNO VWpnYt1Fk55raoXEAHXLaGOjPEsK3wkfz2kJ8BlfDOqAe0dXB58ywCH6o 13T70juYR9lEtgSrh908y01g/fu/xpsNR5COAk1juLtek6pUvMqZdmo6T W6F5+fsmGMIjSltBD48b24pCRLYaZE8z65F0mwiND9F5EwheVtjrzDyK2 RiIWIKmpKRwjNUNhz8P6XGI3wXjJR6GuMxEKQWl/uyg2GRHNnHMEY80oi w==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688687" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688687" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:35 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="888704411" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="888704411" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by fmsmga002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:30 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id B75CB10A450; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 13/14] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case Date: Tue, 5 Dec 2023 03:45:09 +0300 Message-ID: <20231205004510.27164-14-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" ACPI MADT doesn't allow to offline CPU after it got woke up. It limits kexec: the second kernel won't be able to use more than one CPU. Now acpi_mp_wake_mailbox_paddr already has the mailbox address. The acpi_wakeup_cpu() will use it to bring up secondary cpus. Zero out mailbox address in the ACPI MADT wakeup structure to indicate that the mailbox is not usable. This prevents the kexec()-ed kernel from reading a vaild mailbox, which in turn makes the kexec()-ed kernel only be able to use the boot CPU. This is Linux-specific protocol and not reflected in ACPI spec. Booting the second kernel with signle CPU is enough to cover the most common case for kexec -- kdump. Signed-off-by: Kirill A. Shutemov Reviewed-by: Kai Huang Reviewed-by: Kuppuswamy Sathyanarayanan --- arch/x86/kernel/acpi/madt_wakeup.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt= _wakeup.c index 386adbb03094..5d92d12f1042 100644 --- a/arch/x86/kernel/acpi/madt_wakeup.c +++ b/arch/x86/kernel/acpi/madt_wakeup.c @@ -13,6 +13,11 @@ static struct acpi_madt_multiproc_wakeup_mailbox *acpi_m= p_wake_mailbox __ro_afte =20 static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip) { + if (!acpi_mp_wake_mailbox_paddr) { + pr_warn_once("No MADT mailbox: cannot bringup secondary CPUs. Booting wi= th kexec?\n"); + return -EOPNOTSUPP; + } + /* * Remap mailbox memory only for the first call to acpi_wakeup_cpu(). * @@ -78,6 +83,23 @@ int __init acpi_parse_mp_wake(union acpi_subtable_header= s *header, =20 cpu_hotplug_disable_offlining(); =20 + /* + * ACPI MADT doesn't allow to offline CPU after it got woke up. + * It limits kexec: the second kernel won't be able to use more than + * one CPU. + * + * Now acpi_mp_wake_mailbox_paddr already has the mailbox address. + * The acpi_wakeup_cpu() will use it to bring up secondary cpus. + * + * Zero out mailbox address in the ACPI MADT wakeup structure to + * indicate that the mailbox is not usable. This prevents the + * kexec()-ed kernel from reading a vaild mailbox, which in turn + * makes the kexec()-ed kernel only be able to use the boot CPU. + * + * This is Linux-specific protocol and not reflected in ACPI spec. + */ + mp_wake->mailbox_address =3D 0; + apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu); =20 return 0; --=20 2.41.0 From nobody Sun Dec 28 21:16:59 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E321C4167B for ; Tue, 5 Dec 2023 00:46:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229668AbjLEAq0 (ORCPT ); Mon, 4 Dec 2023 19:46:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60398 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234805AbjLEApg (ORCPT ); Mon, 4 Dec 2023 19:45:36 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 48EBA11F for ; Mon, 4 Dec 2023 16:45:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701737138; x=1733273138; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=8b6nnzvT7MJKGtUPhN9ZDFjx6caST117rFTmJst0Njg=; b=foXu8RPGhqzF6GxRPuTrqcqsd7OcHA6AnW4eXr2ihnXU4wGYKe+rVBHp Ex3thwpFcbpnUXs6pINcWZ0HA77uF3EADobnW832oFUB49OKPyFbmk8PZ x8iW3I6YsQSlBjgA5ONH0STX5xkiW9bf8x+mGLDg3Yy5NGk9QRyAookCe h/gBle8ZlmyEuFZs/HluYvBBXND/ZSrC4NUnxpwxzyuq//HeOPWqQd0OV s61JwAcxvnPnPebCb5Mrz5vxlUiMnaeQNjp2/26SD1RMKNmXAWKL5/XZI JvRVyKSvcmEzoU3TnxuD9todC2AZ5T7IrrPnJ1H0RG0qDeFs6ObpxZfB8 A==; X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="392688751" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="392688751" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:36 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10914"; a="944067958" X-IronPort-AV: E=Sophos;i="6.04,251,1695711600"; d="scan'208";a="944067958" Received: from abijaz-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.61.240]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 16:45:30 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id C31E410A454; Tue, 5 Dec 2023 03:45:20 +0300 (+03) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method Date: Tue, 5 Dec 2023 03:45:10 +0300 Message-ID: <20231205004510.27164-15-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> References: <20231205004510.27164-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" MADT Multiprocessor Wakeup structure version 1 brings support of CPU offlining: BIOS provides a reset vector where the CPU has to jump to offline itself. The new TEST mailbox command can be used to test the CPU offlined successfully and BIOS has control over it. Add CPU offling support for ACPI MADT wakeup method by implementing custom cpu_die, play_dead and stop_other_cpus SMP operations. CPU offlining makes is possible to hand over secondary CPUs over kexec, not limiting the second kernel to single CPU. The change conforms to the approved ACPI spec change proposal. See the Link. Signed-off-by: Kirill A. Shutemov Link: https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher --- arch/x86/include/asm/smp.h | 1 + arch/x86/kernel/acpi/Makefile | 2 +- arch/x86/kernel/acpi/madt_playdead.S | 21 ++ arch/x86/kernel/acpi/madt_wakeup.c | 295 +++++++++++++++++++++++++-- arch/x86/kernel/reboot.c | 12 +- include/acpi/actbl2.h | 15 +- 6 files changed, 321 insertions(+), 25 deletions(-) create mode 100644 arch/x86/kernel/acpi/madt_playdead.S diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index 4fab2ed454f3..3c8efba86d5c 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -38,6 +38,7 @@ struct smp_ops { int (*cpu_disable)(void); void (*cpu_die)(unsigned int cpu); void (*play_dead)(void); + void (*crash_play_dead)(void); =20 void (*send_call_func_ipi)(const struct cpumask *mask); void (*send_call_func_single_ipi)(int cpu); diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile index 8c7329c88a75..37b1f28846de 100644 --- a/arch/x86/kernel/acpi/Makefile +++ b/arch/x86/kernel/acpi/Makefile @@ -4,7 +4,7 @@ obj-$(CONFIG_ACPI) +=3D boot.o obj-$(CONFIG_ACPI_SLEEP) +=3D sleep.o wakeup_$(BITS).o obj-$(CONFIG_ACPI_APEI) +=3D apei.o obj-$(CONFIG_ACPI_CPPC_LIB) +=3D cppc.o -obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) +=3D madt_wakeup.o +obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) +=3D madt_wakeup.o madt_playdead.o =20 ifneq ($(CONFIG_ACPI_PROCESSOR),) obj-y +=3D cstate.o diff --git a/arch/x86/kernel/acpi/madt_playdead.S b/arch/x86/kernel/acpi/ma= dt_playdead.S new file mode 100644 index 000000000000..68f83865a1e3 --- /dev/null +++ b/arch/x86/kernel/acpi/madt_playdead.S @@ -0,0 +1,21 @@ +#include +#include +#include +#include + + .text + .align PAGE_SIZE +SYM_FUNC_START(asm_acpi_mp_play_dead) + /* Turn off global entries. Following CR3 write will flush them. */ + movq %cr4, %rdx + andq $~(X86_CR4_PGE), %rdx + movq %rdx, %cr4 + + /* Switch to identity mapping */ + movq %rsi, %rax + movq %rax, %cr3 + + /* Jump to reset vector */ + ANNOTATE_RETPOLINE_SAFE + jmp *%rdi +SYM_FUNC_END(asm_acpi_mp_play_dead) diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt= _wakeup.c index 5d92d12f1042..f8cf7a048743 100644 --- a/arch/x86/kernel/acpi/madt_wakeup.c +++ b/arch/x86/kernel/acpi/madt_wakeup.c @@ -1,9 +1,18 @@ #include #include +#include #include +#include +#include +#include +#include #include #include +#include +#include +#include #include +#include =20 /* Physical address of the Multiprocessor Wakeup Structure mailbox */ static u64 acpi_mp_wake_mailbox_paddr __ro_after_init; @@ -11,6 +20,228 @@ static u64 acpi_mp_wake_mailbox_paddr __ro_after_init; /* Virtual address of the Multiprocessor Wakeup Structure mailbox */ static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox __r= o_after_init; =20 +static u64 acpi_mp_pgd __ro_after_init; +static u64 acpi_mp_reset_vector_paddr __ro_after_init; + +void asm_acpi_mp_play_dead(u64 reset_vector, u64 pgd_pa); + +static void crash_acpi_mp_play_dead(void) +{ + asm_acpi_mp_play_dead(acpi_mp_reset_vector_paddr, + acpi_mp_pgd); +} + +static void acpi_mp_play_dead(void) +{ + play_dead_common(); + asm_acpi_mp_play_dead(acpi_mp_reset_vector_paddr, + acpi_mp_pgd); +} + +static void acpi_mp_cpu_die(unsigned int cpu) +{ + u32 apicid =3D per_cpu(x86_cpu_to_apicid, cpu); + unsigned long timeout; + + /* + * Use TEST mailbox command to prove that BIOS got control over + * the CPU before declaring it dead. + * + * BIOS has to clear 'command' field of the mailbox. + */ + acpi_mp_wake_mailbox->apic_id =3D apicid; + smp_store_release(&acpi_mp_wake_mailbox->command, + ACPI_MP_WAKE_COMMAND_TEST); + + /* Don't wait longer than a second. */ + timeout =3D USEC_PER_SEC; + while (READ_ONCE(acpi_mp_wake_mailbox->command) && timeout--) + udelay(1); +} + +static void acpi_mp_stop_other_cpus(int wait) +{ + smp_shutdown_nonboot_cpus(smp_processor_id()); +} + +/* The argument is required to match type of x86_mapping_info::alloc_pgt_p= age */ +static void __init *alloc_pgt_page(void *dummy) +{ + return memblock_alloc(PAGE_SIZE, PAGE_SIZE); +} + +/* + * Make sure asm_acpi_mp_play_dead() is present in the identity mapping at + * the same place as in the kernel page tables. asm_acpi_mp_play_dead() sw= itches + * to the identity mapping and the function has be present at the same spo= t in + * the virtual address space before and after switching page tables. + */ +static int __init init_transition_pgtable(pgd_t *pgd) +{ + pgprot_t prot =3D PAGE_KERNEL_EXEC_NOENC; + unsigned long vaddr, paddr; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + vaddr =3D (unsigned long)asm_acpi_mp_play_dead; + pgd +=3D pgd_index(vaddr); + if (!pgd_present(*pgd)) { + p4d =3D (p4d_t *)alloc_pgt_page(NULL); + if (!p4d) + return -ENOMEM; + set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE)); + } + p4d =3D p4d_offset(pgd, vaddr); + if (!p4d_present(*p4d)) { + pud =3D (pud_t *)alloc_pgt_page(NULL); + if (!pud) + return -ENOMEM; + set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE)); + } + pud =3D pud_offset(p4d, vaddr); + if (!pud_present(*pud)) { + pmd =3D (pmd_t *)alloc_pgt_page(NULL); + if (!pmd) + return -ENOMEM; + set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE)); + } + pmd =3D pmd_offset(pud, vaddr); + if (!pmd_present(*pmd)) { + pte =3D (pte_t *)alloc_pgt_page(NULL); + if (!pte) + return -ENOMEM; + set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE)); + } + pte =3D pte_offset_kernel(pmd, vaddr); + + paddr =3D __pa(vaddr); + set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot)); + + return 0; +} + +static void __init free_pte(pmd_t *pmd) +{ + pte_t *pte =3D pte_offset_kernel(pmd, 0); + + memblock_free(pte, PAGE_SIZE); +} + +static void __init free_pmd(pud_t *pud) +{ + pmd_t *pmd =3D pmd_offset(pud, 0); + int i; + + for (i =3D 0; i < PTRS_PER_PMD; i++) { + if (!pmd_present(pmd[i])) + continue; + + if (pmd_leaf(pmd[i])) + continue; + + free_pte(&pmd[i]); + } + + memblock_free(pmd, PAGE_SIZE); +} + +static void __init free_pud(p4d_t *p4d) +{ + pud_t *pud =3D pud_offset(p4d, 0); + int i; + + for (i =3D 0; i < PTRS_PER_PUD; i++) { + if (!pud_present(pud[i])) + continue; + + if (pud_leaf(pud[i])) + continue; + + free_pmd(&pud[i]); + } + + memblock_free(pud, PAGE_SIZE); +} + +static void __init free_p4d(pgd_t *pgd) +{ + p4d_t *p4d =3D p4d_offset(pgd, 0); + int i; + + for (i =3D 0; i < PTRS_PER_P4D; i++) { + if (!p4d_present(p4d[i])) + continue; + + free_pud(&p4d[i]); + } + + if (pgtable_l5_enabled()) + memblock_free(p4d, PAGE_SIZE); +} + +static void __init free_pgd(pgd_t *pgd) +{ + int i; + + for (i =3D 0; i < PTRS_PER_PGD; i++) { + if (!pgd_present(pgd[i])) + continue; + + free_p4d(&pgd[i]); + } + + memblock_free(pgd, PAGE_SIZE); +} + +static int __init acpi_mp_setup_reset(u64 reset_vector) +{ + pgd_t *pgd; + struct x86_mapping_info info =3D { + .alloc_pgt_page =3D alloc_pgt_page, + .page_flag =3D __PAGE_KERNEL_LARGE_EXEC, + .kernpg_flag =3D _KERNPG_TABLE_NOENC, + }; + + pgd =3D alloc_pgt_page(NULL); + if (!pgd) + return -ENOMEM; + + for (int i =3D 0; i < nr_pfn_mapped; i++) { + unsigned long mstart, mend; + + mstart =3D pfn_mapped[i].start << PAGE_SHIFT; + mend =3D pfn_mapped[i].end << PAGE_SHIFT; + if (kernel_ident_mapping_init(&info, pgd, mstart, mend)) { + free_pgd(pgd); + return -ENOMEM; + } + } + + if (kernel_ident_mapping_init(&info, pgd, + PAGE_ALIGN_DOWN(reset_vector), + PAGE_ALIGN(reset_vector + 1))) { + free_pgd(pgd); + return -ENOMEM; + } + + if (init_transition_pgtable(pgd)) { + free_pgd(pgd); + return -ENOMEM; + } + + smp_ops.play_dead =3D acpi_mp_play_dead; + smp_ops.crash_play_dead =3D crash_acpi_mp_play_dead; + smp_ops.cpu_die =3D acpi_mp_cpu_die; + smp_ops.stop_other_cpus =3D acpi_mp_stop_other_cpus; + + acpi_mp_reset_vector_paddr =3D reset_vector; + acpi_mp_pgd =3D __pa(pgd); + + return 0; +} + static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip) { if (!acpi_mp_wake_mailbox_paddr) { @@ -68,37 +299,63 @@ static int acpi_wakeup_cpu(u32 apicid, unsigned long s= tart_ip) return 0; } =20 +static void acpi_mp_disable_offlining(struct acpi_madt_multiproc_wakeup *m= p_wake) +{ + cpu_hotplug_disable_offlining(); + + /* + * Zero out mailbox address in the ACPI MADT wakeup structure + * to indicate that the mailbox is not usable. This prevents + * the kexec()-ed kernel from reading a vaild mailbox, which in + * turn makes the kexec()-ed kernel only be able to use the boot + * CPU. + * + * This is Linux-specific protocol and not reflected in ACPI spec. + * + * acpi_mp_wake_mailbox_paddr already has the mailbox address. + * The acpi_wakeup_cpu() will use it to bring up secondary cpus for + * the current kernel. + */ + mp_wake->mailbox_address =3D 0; +} + int __init acpi_parse_mp_wake(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_multiproc_wakeup *mp_wake; =20 mp_wake =3D (struct acpi_madt_multiproc_wakeup *)header; - if (BAD_MADT_ENTRY(mp_wake, end)) + + /* + * Cannot use the standard BAD_MADT_ENTRY() to sanity check the @m= p_wake + * entry. 'sizeof (struct acpi_madt_multiproc_wakeup)' can be lar= ger + * than the actual size of the MP wakeup entry in ACPI table becau= se the + * 'reset_vector' is only available in the V1 MP wakeup structure. + */ + if (!mp_wake) + return -EINVAL; + if (end - (unsigned long)mp_wake < ACPI_MADT_MP_WAKEUP_SIZE_V0) + return -EINVAL; + if (mp_wake->header.length < ACPI_MADT_MP_WAKEUP_SIZE_V0) return -EINVAL; =20 acpi_table_print_madt_entry(&header->common); =20 acpi_mp_wake_mailbox_paddr =3D mp_wake->mailbox_address; =20 - cpu_hotplug_disable_offlining(); - - /* - * ACPI MADT doesn't allow to offline CPU after it got woke up. - * It limits kexec: the second kernel won't be able to use more than - * one CPU. - * - * Now acpi_mp_wake_mailbox_paddr already has the mailbox address. - * The acpi_wakeup_cpu() will use it to bring up secondary cpus. - * - * Zero out mailbox address in the ACPI MADT wakeup structure to - * indicate that the mailbox is not usable. This prevents the - * kexec()-ed kernel from reading a vaild mailbox, which in turn - * makes the kexec()-ed kernel only be able to use the boot CPU. - * - * This is Linux-specific protocol and not reflected in ACPI spec. - */ - mp_wake->mailbox_address =3D 0; + if (mp_wake->version >=3D ACPI_MADT_MP_WAKEUP_VERSION_V1 && + mp_wake->header.length >=3D ACPI_MADT_MP_WAKEUP_SIZE_V1) { + if (acpi_mp_setup_reset(mp_wake->reset_vector)) { + pr_warn("Failed to setup MADT reset vector\n"); + acpi_mp_disable_offlining(mp_wake); + } + } else { + /* + * CPU offlining requires version 1 of the ACPI MADT wakeup + * structure. + */ + acpi_mp_disable_offlining(mp_wake); + } =20 apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu); =20 diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index c81afffaa954..99e6ab552da0 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -878,10 +878,14 @@ static int crash_nmi_callback(unsigned int val, struc= t pt_regs *regs) cpu_emergency_disable_virtualization(); =20 atomic_dec(&waiting_for_crash_ipi); - /* Assume hlt works */ - halt(); - for (;;) - cpu_relax(); + + if (smp_ops.crash_play_dead) { + smp_ops.crash_play_dead(); + } else { + halt(); + for (;;) + cpu_relax(); + } =20 return NMI_HANDLED; } diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h index 23b4cfb640fc..8348bf46a648 100644 --- a/include/acpi/actbl2.h +++ b/include/acpi/actbl2.h @@ -1112,8 +1112,20 @@ struct acpi_madt_multiproc_wakeup { u16 version; u32 reserved; /* reserved - must be zero */ u64 mailbox_address; + u64 reset_vector; }; =20 +/* Values for Version field above */ + +enum acpi_madt_multiproc_wakeup_version { + ACPI_MADT_MP_WAKEUP_VERSION_NONE =3D 0, + ACPI_MADT_MP_WAKEUP_VERSION_V1 =3D 1, + ACPI_MADT_MP_WAKEUP_VERSION_RESERVED =3D 2, /* 2 and greater are reserved= */ +}; + +#define ACPI_MADT_MP_WAKEUP_SIZE_V0 16 +#define ACPI_MADT_MP_WAKEUP_SIZE_V1 24 + #define ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE 2032 #define ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE 2048 =20 @@ -1126,7 +1138,8 @@ struct acpi_madt_multiproc_wakeup_mailbox { u8 reserved_firmware[ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE]; /* reserved= for firmware use */ }; =20 -#define ACPI_MP_WAKE_COMMAND_WAKEUP 1 +#define ACPI_MP_WAKE_COMMAND_WAKEUP 1 +#define ACPI_MP_WAKE_COMMAND_TEST 2 =20 /* 17: CPU Core Interrupt Controller (ACPI 6.5) */ =20 --=20 2.41.0