Series comparison

-[PATCH v7 00/20] TDX host kernel support
+[PATCH v14 00/23] TDX host kernel support
-Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
+Hi all,
 host and certain physical attacks.  TDX specs are available in [1].
-This series is the initial support to enable TDX with minimal code to
+For people who concern this patchset, sorry for sending out late.  And to
-allow KVM to create and run TDX guests.  KVM support for TDX is being
+save people's time, I didn't include the full coverletter here this time.
-developed separately[2].  A new "userspace inaccessible memfd" approach
+For detailed information please refer to previous v13's coverletter[1].
 to support TDX private memory is also being developed[3].  The KVM will
 only support the new "userspace inaccessible memfd" as TDX guest memory.
 This series doesn't aim to support all functionalities (i.e. exposing TDX
 module via /sysfs), and doesn't aim to resolve all things perfectly.
 Especially, the implementation to how to choose "TDX-usable" memory and
 memory hotplug handling is simple, that this series just makes sure all
 pages in the page allocator are TDX memory.
 A better solution, suggested by Kirill, is similar to the per-node memory
 encryption flag in this series [4].  Similarly, a per-node TDX flag can
 be added so both "TDX-capable" and "non-TDX-capable" nodes can co-exist.
 With exposing the TDX flag to userspace via /sysfs, the userspace can
 then use NUMA APIs to bind TDX guests to those "TDX-capable" nodes.
 For more information please refer to "Kernel policy on TDX memory" and
 "Memory hotplug" sections below.  Huang, Ying is working on this
 "per-node TDX flag" support and will post another series independently.
 (For memory hotplug, sorry for broadcasting widely but I cc'ed the
 linux-mm@kvack.org following Kirill's suggestion so MM experts can also
 help to provide comments.)
 Also, other optimizations will be posted as follow-up once this initial
 TDX support is upstreamed.
 Hi Dave, Dan, Kirill, Ying (and Intel reviewers),
 Please kindly help to review, and I would appreciate reviewed-by or
 acked-by tags if the patches look good to you.
 This series has been reviewed by Isaku who is developing KVM TDX patches.
 Kirill also has reviewed couple of patches as well.
 Also, I highly appreciate if anyone else can help to review this series.
 ----- Changelog history: ------
 - v6 -> v7:
   - Added memory hotplug support.
   - Changed how to choose the list of "TDX-usable" memory regions from at
     kernel boot time to TDX module initialization time.
   - Addressed comments received in previous versions. (Andi/Dave).
   - Improved the commit message and the comments of kexec() support patch,
     and the patch handles returnning PAMTs back to the kernel when TDX
     module initialization fails. Please also see "kexec()" section below.
   - Changed the documentation patch accordingly.
   - For all others please see individual patch changelog history.
 - v5 -> v6:
   - Removed ACPI CPU/memory hotplug patches. (Intel internal discussion)
   - Removed patch to disable driver-managed memory hotplug (Intel
     internal discussion).
   - Added one patch to introduce enum type for TDX supported page size
     level to replace the hard-coded values in TDX guest code (Dave).
   - Added one patch to make TDX depends on X2APIC being enabled (Dave).
   - Added one patch to build all boot-time present memory regions as TDX
     memory during kernel boot.
   - Added Reviewed-by from others to some patches.
   - For all others please see individual patch changelog history.
 - v4 -> v5:
   This is essentially a resent of v4.  Sorry I forgot to consult
   get_maintainer.pl when sending out v4, so I forgot to add linux-acpi
   and linux-mm mailing list and the relevant people for 4 new patches.
   There are also very minor code and commit message update from v4:
   - Rebased to latest tip/x86/tdx.
   - Fixed a checkpatch issue that I missed in v4.
   - Removed an obsoleted comment that I missed in patch 6.
   - Very minor update to the commit message of patch 12.
   For other changes to individual patches since v3, please refer to the
   changelog histroy of individual patches (I just used v3 -> v5 since
   there's basically no code change to v4).
 - v3 -> v4 (addressed Dave's comments, and other comments from others):
  - Simplified SEAMRR and TDX keyID detection.
  - Added patches to handle ACPI CPU hotplug.
  - Added patches to handle ACPI memory hotplug and driver managed memory
    hotplug.
  - Removed tdx_detect() but only use single tdx_init().
  - Removed detecting TDX module via P-SEAMLDR.
  - Changed from using e820 to using memblock to convert system RAM to TDX
    memory.
  - Excluded legacy PMEM from TDX memory.
  - Removed the boot-time command line to disable TDX patch.
  - Addressed comments for other individual patches (please see individual
    patches).
  - Improved the documentation patch based on the new implementation.
 - V2 -> v3:
  - Addressed comments from Isaku.
   - Fixed memory leak and unnecessary function argument in the patch to
     configure the key for the global keyid (patch 17).
   - Enhanced a little bit to the patch to get TDX module and CMR
     information (patch 09).
   - Fixed an unintended change in the patch to allocate PAMT (patch 13).
  - Addressed comments from Kevin:
   - Slightly improvement on commit message to patch 03.
  - Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
    seamrr_enabled() (patch 04).
  - Changed documentation patch to add TDX host kernel support materials
    to Documentation/x86/tdx.rst together with TDX guest staff, instead
    of a standalone file (patch 21)
  - Very minor improvement in commit messages.
 - RFC (v1) -> v2:
   - Rebased to Kirill's latest TDX guest code.
   - Fixed two issues that are related to finding all RAM memory regions
     based on e820.
   - Minor improvement on comments and commit messages.
 v6:
 https://lore.kernel.org/linux-mm/cover.1666824663.git.kai.huang@intel.com/T/
 v5:
 https://lore.kernel.org/lkml/cover.1655894131.git.kai.huang@intel.com/T/
 v3:
 https://lore.kernel.org/lkml/68484e168226037c3a25b6fb983b052b26ab3ec1.camel@intel.com/T/
 V2:
 https://lore.kernel.org/lkml/cover.1647167475.git.kai.huang@intel.com/T/
 RFC (v1):
 https://lore.kernel.org/all/e0ff030a49b252d91c789a89c303bb4206f85e3d.1646007267.git.kai.huang@intel.com/T/
 == Background ==
 TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
 and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
 A CPU-attested software module called 'the TDX module' runs in the new
 isolated region as a trusted hypervisor to create/run protected VMs.
 TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
 provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
 as TDX private KeyIDs, which are only accessible within the SEAM mode.
 TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
 secure processor to provide crypto-protection.  The firmware runs on the
 secure processor acts a similar role as the TDX module.
 The host kernel communicates with SEAM software via a new SEAMCALL
 instruction.  This is conceptually similar to a guest->host hypercall,
 except it is made from the host to SEAM software instead.
 Before being able to manage TD guests, the TDX module must be loaded
 and properly initialized.  This series assumes the TDX module is loaded
 by BIOS before the kernel boots.
 How to initialize the TDX module is described at TDX module 1.0
 specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
 Initialization and Shutdown".
 == Design Considerations ==
 . Initialize the TDX module at runtime
 There are basically two ways the TDX module could be initialized: either
 in early boot, or at runtime before the first TDX guest is run.  This
 series implements the runtime initialization.
 This series adds a function tdx_enable() to allow the caller to initialize
 TDX at runtime:
         if (tdx_enable())
                 goto no_tdx;
     // TDX is ready to create TD guests.
 This approach has below pros:
 ) Initializing the TDX module requires to reserve ~1/256th system RAM as
 metadata.  Enabling TDX on demand allows only to consume this memory when
 TDX is truly needed (i.e. when KVM wants to create TD guests).
 ) SEAMCALL requires CPU being already in VMX operation (VMXON has been
 done).  So far, KVM is the only user of TDX, and it already handles VMXON.
 Letting KVM to initialize TDX avoids handling VMXON in the core kernel.
 ) It is more flexible to support "TDX module runtime update" (not in
 this series).  After updating to the new module at runtime, kernel needs
 to go through the initialization process again.
 . CPU hotplug
 TDX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
 never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
 event to the kernel.  This series doesn't handle physical (ACPI) CPU
 hotplug at all but depends on the BIOS to behave correctly.
 Note TDX works with CPU logical online/offline, thus this series still
 allows to do logical CPU online/offline.
 . Kernel policy on TDX memory
 The TDX module reports a list of "Convertible Memory Region" (CMR) to
 indicate which memory regions are TDX-capable.  The TDX architecture
 allows the VMM to designate specific convertible memory regions as usable
 for TDX private memory.
 The initial support of TDX guests will only allocate TDX private memory
 from the global page allocator.  This series chooses to designate _all_
 system RAM in the core-mm at the time of initializing TDX module as TDX
 memory to guarantee all pages in the page allocator are TDX pages.
 . Memory Hotplug
 After the kernel passes all "TDX-usable" memory regions to the TDX
 module, the set of "TDX-usable" memory regions are fixed during module's
 runtime.  No more "TDX-usable" memory can be added to the TDX module
 after that.
 To achieve above "to guarantee all pages in the page allocator are TDX
 pages", this series simply choose to reject any non-TDX-usable memory in
 memory hotplug.
 This _will_ be enhanced in the future after first submission.  The
 direction we are heading is to allow adding/onlining non-TDX memory to
 separate NUMA nodes so that both "TDX-capable" nodes and "TDX-capable"
 nodes can co-exist.  The TDX flag can be exposed to userspace via /sysfs
 so userspace can bind TDX guests to "TDX-capable" nodes via NUMA ABIs.
 Note TDX assumes convertible memory is always physically present during
 machine's runtime.  A non-buggy BIOS should never support hot-removal of
 any convertible memory.  This implementation doesn't handle ACPI memory
 removal but depends on the BIOS to behave correctly.
 . Kexec()
 There are two problems in terms of using kexec() to boot to a new kernel
 when the old kernel has enabled TDX: 1) Part of the memory pages are
 still TDX private pages (i.e. metadata used by the TDX module, and any
 TDX guest memory if kexec() happens when there's any TDX guest alive).
 ) There might be dirty cachelines associated with TDX private pages.
 Just like SME, TDX hosts require special cache flushing before kexec().
 Similar to SME handling, the kernel uses wbinvd() to flush cache in
 stop_this_cpu() when TDX is enabled.
 This series doesn't convert all TDX private pages back to normal due to
 below considerations:
 ) The kernel doesn't have existing infrastructure to track which pages
    are TDX private pages.
 ) The number of TDX private pages can be large, and converting all of
    them (cache flush + using MOVDIR64B to clear the page) in kexec() can
    be time consuming.
 ) The new kernel will almost only use KeyID 0 to access memory.  KeyID
 doesn't support integrity-check, so it's OK.
 ) The kernel doesn't (and may never) support MKTME.  If any 3rd party
    kernel ever supports MKTME, it should do MOVDIR64B to clear the page
    with the new MKTME KeyID (just like TDX does) before using it.
 Also, if the old kernel ever enables TDX, the new kernel cannot use TDX
 again.  When the new kernel goes through the TDX module initialization
 process it will fail immediately at the first step.
 Ideally, it's better to shutdown the TDX module in kexec(), but there's
 no guarantee that CPUs are in VMX operation in kexec() so just leave the
 module open.
 == Reference ==
 [1]: TDX specs
 https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
 [2]: KVM TDX basic feature support
 https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/
 [3]: KVM: mm: fd-based approach for supporting KVM
 https://lore.kernel.org/lkml/20220915142913.2213336-1-chao.p.peng@linux.intel.com/T/
 [4]: per-node memory encryption flag
 https://lore.kernel.org/linux-mm/20221007155323.ue4cdthkilfy4lbd@box.shutemov.name/t/
-Kai Huang (20):
+This version mainly adds a new patch to handle TDX vs S3/hibernation
 interaction.  In short, TDX cannot survive when platform goes to S3 and
 deeper states.  TDX gets completely reset upon this, and both TDX guests
 and TDX module are destroyed.  Please refer to the new patch (21).
 Other changes from v13 -> v14:
  - Addressed comments received in v13 (Rick/Nikolay/Dave).
    - SEAMCALL patches, skeleton patch, kexec patch
  - Some minor updates based on internal discussion.
  - Added received Reviewed-by tags (thanks!).
  - Updated the documentation patch to reflect new changes.
 Please see each individual patch for specific change history.
 Hi Dave,
 In this version all patches (except the documentation one) now have at
 least Kirill's Reviewed-by tag.  Could you help to take a look?
 And again, thanks everyone for reviewing and helping on this series.
 [1]: v13 https://lore.kernel.org/lkml/cover.1692962263.git.kai.huang@intel.com/T/
 Kai Huang (23):
   x86/virt/tdx: Detect TDX during kernel boot
   x86/tdx: Define TDX supported page sizes as macros
-  x86/virt/tdx: Detect TDX during kernel boot
+  x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
-  x86/virt/tdx: Disable TDX if X2APIC is not enabled
+  x86/cpu: Detect TDX partial write machine check erratum
-  x86/virt/tdx: Add skeleton to initialize TDX on demand
+  x86/virt/tdx: Handle SEAMCALL no entropy error in common code
-  x86/virt/tdx: Implement functions to make SEAMCALL
+  x86/virt/tdx: Add SEAMCALL error printing for module initialization
-  x86/virt/tdx: Shut down TDX module in case of error
+  x86/virt/tdx: Add skeleton to enable TDX on demand
   x86/virt/tdx: Do TDX module global initialization
   x86/virt/tdx: Do logical-cpu scope TDX module initialization
   x86/virt/tdx: Get information about TDX module and TDX-capable memory
   x86/virt/tdx: Use all system memory when initializing TDX module as
     TDX memory
   x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
     memory regions
-  x86/virt/tdx: Create TDMRs to cover all TDX memory regions
+  x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
   x86/virt/tdx: Allocate and set up PAMTs for TDMRs
-  x86/virt/tdx: Set up reserved areas for all TDMRs
+  x86/virt/tdx: Designate reserved areas for all TDMRs
-  x86/virt/tdx: Reserve TDX module global KeyID
+  x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
   x86/virt/tdx: Configure TDX module with TDMRs and global KeyID
   x86/virt/tdx: Configure global KeyID on all packages
   x86/virt/tdx: Initialize all TDMRs
-  x86/virt/tdx: Flush cache in kexec() when TDX is enabled
+  x86/kexec: Flush cache of TDX private memory
   x86/virt/tdx: Keep TDMRs when module initialization is successful
   x86/virt/tdx: Improve readability of module initialization error
     handling
   x86/kexec(): Reset TDX private memory on platforms with TDX erratum
   x86/virt/tdx: Handle TDX interaction with ACPI S3 and deeper states
   x86/mce: Improve error log of kernel space TDX #MC due to erratum
   Documentation/x86: Add documentation for TDX host support
- Documentation/x86/tdx.rst        |  181 +++-
+ Documentation/arch/x86/tdx.rst     |  217 +++-
- arch/x86/Kconfig                 |   15 +
+ arch/x86/Kconfig                   |    3 +
- arch/x86/Makefile                |    2 +
+ arch/x86/coco/tdx/tdx-shared.c     |    6 +-
- arch/x86/coco/tdx/tdx.c          |    6 +-
+ arch/x86/include/asm/cpufeatures.h |    1 +
- arch/x86/include/asm/tdx.h       |   30 +
+ arch/x86/include/asm/msr-index.h   |    3 +
- arch/x86/kernel/process.c        |    8 +-
+ arch/x86/include/asm/shared/tdx.h  |    6 +
- arch/x86/mm/init_64.c            |   10 +
+ arch/x86/include/asm/tdx.h         |   39 +
- arch/x86/virt/Makefile           |    2 +
+ arch/x86/kernel/cpu/intel.c        |   17 +
- arch/x86/virt/vmx/Makefile       |    2 +
+ arch/x86/kernel/cpu/mce/core.c     |   33 +
- arch/x86/virt/vmx/tdx/Makefile   |    2 +
+ arch/x86/kernel/machine_kexec_64.c |   16 +
- arch/x86/virt/vmx/tdx/seamcall.S |   52 ++
+ arch/x86/kernel/process.c          |    8 +-
- arch/x86/virt/vmx/tdx/tdx.c      | 1422 ++++++++++++++++++++++++++++++
+ arch/x86/kernel/reboot.c           |   15 +
- arch/x86/virt/vmx/tdx/tdx.h      |  118 +++
+ arch/x86/kernel/setup.c            |    2 +
- arch/x86/virt/vmx/tdx/tdxcall.S  |   19 +-
+ arch/x86/virt/vmx/tdx/Makefile     |    2 +-
-files changed, 1852 insertions(+), 17 deletions(-)
+ arch/x86/virt/vmx/tdx/tdx.c        | 1587 ++++++++++++++++++++++++++++
- create mode 100644 arch/x86/virt/Makefile
+ arch/x86/virt/vmx/tdx/tdx.h        |  145 +++
- create mode 100644 arch/x86/virt/vmx/Makefile
+files changed, 2084 insertions(+), 16 deletions(-)
  create mode 100644 arch/x86/virt/vmx/tdx/Makefile
  create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
  create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
  create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
-base-commit: 00e07cfbdf0b232f7553f0175f8f4e8d792f7e90
+base-commit: 9ee4318c157b9802589b746cc340bae3142d984c
 --
-.38.1
+.41.0

-[PATCH v7 02/20] x86/virt/tdx: Detect TDX during kernel boot
+[PATCH v14 01/23] x86/virt/tdx: Detect TDX during kernel boot
 ...
 space from the MKTME architecture for crypto-protection to VMs.  The
 BIOS is responsible for partitioning the "KeyID" space between legacy
 MKTME and TDX.  The KeyIDs reserved for TDX are called 'TDX private
 KeyIDs' or 'TDX KeyIDs' for short.
-TDX doesn't trust the BIOS.  During machine boot, TDX verifies the TDX
+During machine boot, TDX microcode verifies that the BIOS programmed TDX
-private KeyIDs are consistently and correctly programmed by the BIOS
+private KeyIDs consistently and correctly programmed across all CPU
-across all CPU packages before it enables TDX on any CPU core.  A valid
+packages.  The MSRs are locked in this state after verification.  This
-TDX private KeyID range on BSP indicates TDX has been enabled by the
+is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration:
-BIOS, otherwise the BIOS is buggy.
+it indicates not just that the hardware supports TDX, but that all the
 boot-time security checks passed.
 The TDX module is expected to be loaded by the BIOS when it enables TDX,
 but the kernel needs to properly initialize it before it can be used to
-create and run any TDX guests.  The TDX module will be initialized at
+create and run any TDX guests.  The TDX module will be initialized by
-runtime by the user (i.e. KVM) on demand.
+the KVM subsystem when KVM wants to use TDX.
-Add a new early_initcall(tdx_init) to do TDX early boot initialization.
+Add a new early_initcall(tdx_init) to detect the TDX by detecting TDX
-Only detect TDX private KeyIDs for now.  Some other early checks will
+private KeyIDs.  Also add a function to report whether TDX is enabled by
-follow up.  Also add a new function to report whether TDX has been
+the BIOS.  Similar to AMD SME, kexec() will use it to determine whether
-enabled by BIOS (TDX private KeyID range is valid).  Kexec() will also
+cache flush is needed.
 need it to determine whether need to flush dirty cachelines that are
 associated with any TDX private KeyIDs before booting to the new kernel.
-To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
+The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
-TDX host kernel support.  Add a new Kconfig option CONFIG_INTEL_TDX_HOST
+to protect its metadata.  Each TDX guest also needs a TDX KeyID for its
-to opt-in TDX host kernel support (to distinguish with TDX guest kernel
+own protection.  Just use the first TDX KeyID as the global KeyID and
-support).  So far only KVM is the only user of TDX.  Make the new config
+leave the rest for TDX guests.  If no TDX KeyID is left for TDX guests,
-option depend on KVM_INTEL.
+disable TDX as initializing the TDX module alone is useless.
+Signed-off-by: Kai Huang <kai.huang@intel.com>
 Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
-Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
 Reviewed-by: David Hildenbrand <david@redhat.com>
 Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
 ---
-v6 -> v7:
+v13 -> v14:
- - No change.
+ - "tdx:" -> "virt/tdx:" (internal)
  - Add Dave's tag
 ---
  arch/x86/include/asm/msr-index.h |  3 ++
  arch/x86/include/asm/tdx.h       |  4 ++
  arch/x86/virt/vmx/tdx/Makefile   |  2 +-
  arch/x86/virt/vmx/tdx/tdx.c      | 90 ++++++++++++++++++++++++++++++++
 files changed, 98 insertions(+), 1 deletion(-)
  create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
-v5 -> v6:
+diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
  - Removed SEAMRR detection to make code simpler.
  - Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill).
  - Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill).
 ---
  arch/x86/Kconfig               | 12 +++++
  arch/x86/Makefile              |  2 +
  arch/x86/include/asm/tdx.h     |  7 +++
  arch/x86/virt/Makefile         |  2 +
  arch/x86/virt/vmx/Makefile     |  2 +
  arch/x86/virt/vmx/tdx/Makefile |  2 +
  arch/x86/virt/vmx/tdx/tdx.c    | 95 ++++++++++++++++++++++++++++++++++
  arch/x86/virt/vmx/tdx/tdx.h    | 15 ++++++
 files changed, 137 insertions(+)
  create mode 100644 arch/x86/virt/Makefile
  create mode 100644 arch/x86/virt/vmx/Makefile
  create mode 100644 arch/x86/virt/vmx/tdx/Makefile
  create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
  create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index XXXXXXX..XXXXXXX 100644
---- a/arch/x86/Kconfig
+--- a/arch/x86/include/asm/msr-index.h
-+++ b/arch/x86/Kconfig
++++ b/arch/x86/include/asm/msr-index.h
-@@ -XXX,XX +XXX,XX @@ config X86_SGX
+@@ -XXX,XX +XXX,XX @@
+ #define MSR_RELOAD_PMC0            0x000014c1
-       If unsure, say N.
+ #define MSR_RELOAD_FIXED_CTR0        0x00001309
-+config INTEL_TDX_HOST
++/* KeyID partitioning between MKTME and TDX */
-+    bool "Intel Trust Domain Extensions (TDX) host support"
++#define MSR_IA32_MKTME_KEYID_PARTITIONING    0x00000087
 +    depends on CPU_SUP_INTEL
 +    depends on X86_64
 +    depends on KVM_INTEL
 +    help
 +      Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 +      host and certain physical attacks.  This option enables necessary TDX
 +      support in host kernel to run protected VMs.
 +
-+      If unsure, say N.
+ /*
-+
+  * AMD64 MSRs. Not complete. See the architecture manual for a more
- config EFI
+  * complete list.
      bool "EFI runtime service support"
      depends on ACPI
 diff --git a/arch/x86/Makefile b/arch/x86/Makefile
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/Makefile
 +++ b/arch/x86/Makefile
@@ -XXX,XX +XXX,XX @@ archheaders:
  libs-y  += arch/x86/lib/
 +core-y += arch/x86/virt/
 +
  # drivers-y are linked after core-y
  drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
  drivers-$(CONFIG_PCI)            += arch/x86/pci/
 diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/include/asm/tdx.h
 +++ b/arch/x86/include/asm/tdx.h
 @@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
-     return -ENODEV;
+ u64 __seamcall(u64 fn, struct tdx_module_args *args);
- }
+ u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
- #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
+ u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
 +
-+#ifdef CONFIG_INTEL_TDX_HOST
 +bool platform_tdx_enabled(void);
-+#else    /* !CONFIG_INTEL_TDX_HOST */
++#else
 +static inline bool platform_tdx_enabled(void) { return false; }
-+#endif    /* CONFIG_INTEL_TDX_HOST */
+ #endif    /* CONFIG_INTEL_TDX_HOST */
-+
  #endif /* !__ASSEMBLY__ */
- #endif /* _ASM_X86_TDX_H */
-diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
-new file mode 100644
-index XXXXXXX..XXXXXXX
---- /dev/null
-+++ b/arch/x86/virt/Makefile
-@@ -XXX,XX +XXX,XX @@
-+# SPDX-License-Identifier: GPL-2.0-only
-+obj-y    += vmx/
-diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
-new file mode 100644
-index XXXXXXX..XXXXXXX
---- /dev/null
-+++ b/arch/x86/virt/vmx/Makefile
-@@ -XXX,XX +XXX,XX @@
-+# SPDX-License-Identifier: GPL-2.0-only
-+obj-$(CONFIG_INTEL_TDX_HOST)    += tdx/
 diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
-new file mode 100644
+index XXXXXXX..XXXXXXX 100644
-index XXXXXXX..XXXXXXX
+--- a/arch/x86/virt/vmx/tdx/Makefile
 --- /dev/null
 +++ b/arch/x86/virt/vmx/tdx/Makefile
 @@ -XXX,XX +XXX,XX @@
-+# SPDX-License-Identifier: GPL-2.0-only
+ # SPDX-License-Identifier: GPL-2.0-only
-+obj-y += tdx.o
+-obj-y += seamcall.o
 +obj-y += seamcall.o tdx.o
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
 @@ -XXX,XX +XXX,XX @@
 +// SPDX-License-Identifier: GPL-2.0
 +/*
-+ * Copyright(c) 2022 Intel Corporation.
++ * Copyright(c) 2023 Intel Corporation.
 + *
 + * Intel Trusted Domain Extensions (TDX) support
 + */
 +
-+#define pr_fmt(fmt)    "tdx: " fmt
++#define pr_fmt(fmt)    "virt/tdx: " fmt
 +
 +#include <linux/types.h>
++#include <linux/cache.h>
 +#include <linux/init.h>
++#include <linux/errno.h>
 +#include <linux/printk.h>
 +#include <asm/msr-index.h>
 +#include <asm/msr.h>
 +#include <asm/tdx.h>
-+#include "tdx.h"
 +
-+static u32 tdx_keyid_start __ro_after_init;
++static u32 tdx_global_keyid __ro_after_init;
-+static u32 tdx_keyid_num __ro_after_init;
++static u32 tdx_guest_keyid_start __ro_after_init;
 +static u32 tdx_nr_guest_keyids __ro_after_init;
 +
-+/*
++static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
-+ * Detect TDX private KeyIDs to see whether TDX has been enabled by the
++                        u32 *nr_tdx_keyids)
 + * BIOS.  Both initializing the TDX module and running TDX guest require
 + * TDX private KeyID.
 + *
 + * TDX doesn't trust BIOS.  TDX verifies all configurations from BIOS
 + * are correct before enabling TDX on any core.  TDX requires the BIOS
 + * to correctly and consistently program TDX private KeyIDs on all CPU
 + * packages.  Unless there is a BIOS bug, detecting a valid TDX private
 + * KeyID range on BSP indicates TDX has been enabled by the BIOS.  If
 + * there's such BIOS bug, it will be caught later when initializing the
 + * TDX module.
 + */
 +static int __init detect_tdx(void)
 +{
++    u32 _nr_mktme_keyids, _tdx_keyid_start, _nr_tdx_keyids;
 +    int ret;
 +
 +    /*
 +     * IA32_MKTME_KEYID_PARTIONING:
 +     *   Bit [31:0]:    Number of MKTME KeyIDs.
 +     *   Bit [63:32]:    Number of TDX private KeyIDs.
 +     */
-+    ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &tdx_keyid_start,
++    ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &_nr_mktme_keyids,
-+            &tdx_keyid_num);
++            &_nr_tdx_keyids);
 +    if (ret)
 +        return -ENODEV;
 +
-+    if (!tdx_keyid_num)
++    if (!_nr_tdx_keyids)
 +        return -ENODEV;
 +
-+    /*
++    /* TDX KeyIDs start after the last MKTME KeyID. */
-+     * KeyID 0 is for TME.  MKTME KeyIDs start from 1.  TDX private
++    _tdx_keyid_start = _nr_mktme_keyids + 1;
 +     * KeyIDs start after the last MKTME KeyID.
 +     */
 +    tdx_keyid_start++;
 +
-+    pr_info("TDX enabled by BIOS. TDX private KeyID range: [%u, %u)\n",
++    *tdx_keyid_start = _tdx_keyid_start;
-+            tdx_keyid_start, tdx_keyid_start + tdx_keyid_num);
++    *nr_tdx_keyids = _nr_tdx_keyids;
 +
 +    return 0;
 +}
 +
-+static void __init clear_tdx(void)
-+{
-+    tdx_keyid_start = tdx_keyid_num = 0;
-+}
-+
 +static int __init tdx_init(void)
 +{
-+    if (detect_tdx())
++    u32 tdx_keyid_start, nr_tdx_keyids;
-+        return -ENODEV;
++    int err;
 +
 +    err = record_keyid_partitioning(&tdx_keyid_start, &nr_tdx_keyids);
 +    if (err)
 +        return err;
 +
 +    pr_info("BIOS enabled: private KeyID range [%u, %u)\n",
 +            tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids);
 +
 +    /*
-+     * Initializing the TDX module requires one TDX private KeyID.
++     * The TDX module itself requires one 'global KeyID' to protect
-+     * If there's only one TDX KeyID then after module initialization
++     * its metadata.  If there's only one TDX KeyID, there won't be
-+     * KVM won't be able to run any TDX guest, which makes the whole
++     * any left for TDX guests thus there's no point to enable TDX
-+     * thing worthless.  Just disable TDX in this case.
++     * at all.
 +     */
-+    if (tdx_keyid_num < 2) {
++    if (nr_tdx_keyids < 2) {
-+        pr_info("Disable TDX as there's only one TDX private KeyID available.\n");
++        pr_err("initialization failed: too few private KeyIDs available.\n");
-+        goto no_tdx;
++        return -ENODEV;
 +    }
 +
++    /*
++     * Just use the first TDX KeyID as the 'global KeyID' and
++     * leave the rest for TDX guests.
++     */
++    tdx_global_keyid = tdx_keyid_start;
++    tdx_guest_keyid_start = tdx_keyid_start + 1;
++    tdx_nr_guest_keyids = nr_tdx_keyids - 1;
++
 +    return 0;
-+no_tdx:
-+    clear_tdx();
-+    return -ENODEV;
 +}
 +early_initcall(tdx_init);
 +
 +/* Return whether the BIOS has enabled TDX */
 +bool platform_tdx_enabled(void)
 +{
-+    return !!tdx_keyid_num;
++    return !!tdx_global_keyid;
 +}
-diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
-new file mode 100644
-index XXXXXXX..XXXXXXX
---- /dev/null
-+++ b/arch/x86/virt/vmx/tdx/tdx.h
-@@ -XXX,XX +XXX,XX @@
-+/* SPDX-License-Identifier: GPL-2.0 */
-+#ifndef _X86_VIRT_TDX_H
-+#define _X86_VIRT_TDX_H
-+
-+/*
-+ * This file contains both macros and data structures defined by the TDX
-+ * architecture and Linux defined software data structures and functions.
-+ * The two should not be mixed together for better readability.  The
-+ * architectural definitions come first.
-+ */
-+
-+/* MSR to report KeyID partitioning between MKTME and TDX */
-+#define MSR_IA32_MKTME_KEYID_PARTITIONING    0x00000087
-+
-+#endif
 --
-.38.1
+.41.0

-[PATCH v7 01/20] x86/tdx: Define TDX supported page sizes as macros
+[PATCH v14 02/23] x86/tdx: Define TDX supported page sizes as macros
 ...
 page.  However currently try_accept_one() uses hard-coded magic values.
 Define TDX supported page sizes as macros and get rid of the hard-coded
 values in try_accept_one().  TDX host support will need to use them too.
+Signed-off-by: Kai Huang <kai.huang@intel.com>
 Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
-Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
 Reviewed-by: David Hildenbrand <david@redhat.com>
 ---
+ arch/x86/coco/tdx/tdx-shared.c    | 6 +++---
+ arch/x86/include/asm/shared/tdx.h | 5 +++++
+files changed, 8 insertions(+), 3 deletions(-)
-v6 -> v7:
+diff --git a/arch/x86/coco/tdx/tdx-shared.c b/arch/x86/coco/tdx/tdx-shared.c
  - Removed the helper to convert kernel page level to TDX page level.
  - Changed to use macro to define TDX supported page sizes.
 ---
  arch/x86/coco/tdx/tdx.c    | 6 +++---
  arch/x86/include/asm/tdx.h | 9 +++++++++
 files changed, 12 insertions(+), 3 deletions(-)
 diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
---- a/arch/x86/coco/tdx/tdx.c
+--- a/arch/x86/coco/tdx/tdx-shared.c
-+++ b/arch/x86/coco/tdx/tdx.c
++++ b/arch/x86/coco/tdx/tdx-shared.c
-@@ -XXX,XX +XXX,XX @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
+@@ -XXX,XX +XXX,XX @@ static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
       */
      switch (pg_level) {
      case PG_LEVEL_4K:
 -        page_size = 0;
 +        page_size = TDX_PS_4K;
 ...
      case PG_LEVEL_1G:
 -        page_size = 2;
 +        page_size = TDX_PS_1G;
          break;
      default:
-         return false;
+         return 0;
-diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
+diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
 index XXXXXXX..XXXXXXX 100644
---- a/arch/x86/include/asm/tdx.h
+--- a/arch/x86/include/asm/shared/tdx.h
-+++ b/arch/x86/include/asm/tdx.h
++++ b/arch/x86/include/asm/shared/tdx.h
 @@ -XXX,XX +XXX,XX @@
+     (TDX_RDX | TDX_RBX | TDX_RSI | TDX_RDI | TDX_R8  | TDX_R9  | \
- #ifndef __ASSEMBLY__
+      TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14 | TDX_R15)
-+/*
++/* TDX supported page sizes from the TDX module ABI. */
 + * TDX supported page sizes (4K/2M/1G).
 + *
 + * Those values are part of the TDX module ABI.  Do not change them.
 + */
 +#define TDX_PS_4K    0
 +#define TDX_PS_2M    1
 +#define TDX_PS_1G    2
 +
- /*
+ #ifndef __ASSEMBLY__
-  * Used to gather the output registers values of the TDCALL and SEAMCALL
-  * instructions when requesting services from the TDX module.
+ #include <linux/compiler_attributes.h>
 --
-.38.1
+.41.0

-[PATCH v7 03/20] x86/virt/tdx: Disable TDX if X2APIC is not enabled
+[PATCH v14 03/23] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
-The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
+TDX capable platforms are locked to X2APIC mode and cannot fall back to
-[1].  This bug allows an attacker to use the APIC MMIO interface to
+the legacy xAPIC mode when TDX is enabled by the BIOS.  TDX host support
-extract data from the SGX enclave.
+requires x2APIC.  Make INTEL_TDX_HOST depend on X86_X2APIC.
-TDX is not immune from this either.  Early check X2APIC and disable TDX
-if X2APIC is not enabled, and make INTEL_TDX_HOST depend on X86_X2APIC.
-[1]: https://aepicleak.com/aepicleak.pdf
-Link: https://lore.kernel.org/lkml/d6ffb489-7024-ff74-bd2f-d1e06573bb82@intel.com/
 Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/
 Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
+Reviewed-by: David Hildenbrand <david@redhat.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 ---
+ arch/x86/Kconfig | 1 +
-v6 -> v7:
+file changed, 1 insertion(+)
  - Changed to use "Link" for the two lore links to get rid of checkpatch
    warning.
 ---
  arch/x86/Kconfig            |  1 +
  arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++++++
 files changed, 12 insertions(+)
 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 ...
      depends on KVM_INTEL
 +    depends on X86_X2APIC
      help
        Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
        host and certain physical attacks.  This option enables necessary TDX
-diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
-index XXXXXXX..XXXXXXX 100644
---- a/arch/x86/virt/vmx/tdx/tdx.c
-+++ b/arch/x86/virt/vmx/tdx/tdx.c
-@@ -XXX,XX +XXX,XX @@
- #include <linux/printk.h>
- #include <asm/msr-index.h>
- #include <asm/msr.h>
-+#include <asm/apic.h>
- #include <asm/tdx.h>
- #include "tdx.h"
-@@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void)
-         goto no_tdx;
-     }
-+    /*
-+     * TDX requires X2APIC being enabled to prevent potential data
-+     * leak via APIC MMIO registers.  Just disable TDX if not using
-+     * X2APIC.
-+     */
-+    if (!x2apic_enabled()) {
-+        pr_info("Disable TDX as X2APIC is not enabled.\n");
-+        goto no_tdx;
-+    }
-+
-     return 0;
- no_tdx:
-     clear_tdx();
 --
-.38.1
+.41.0

-New patch
+[PATCH v14 04/23] x86/cpu: Detect TDX partial write machine check erratum
+TDX memory has integrity and confidentiality protections.  Violations of
+this integrity protection are supposed to only affect TDX operations and
+are never supposed to affect the host kernel itself.  In other words,
+the host kernel should never, itself, see machine checks induced by the
+TDX integrity hardware.
+Alas, the first few generations of TDX hardware have an erratum.  A
+partial write to a TDX private memory cacheline will silently "poison"
+the line.  Subsequent reads will consume the poison and generate a
+machine check.  According to the TDX hardware spec, neither of these
+things should have happened.
+Virtually all kernel memory accesses operations happen in full
+cachelines.  In practice, writing a "byte" of memory usually reads a 64
+byte cacheline of memory, modifies it, then writes the whole line back.
+Those operations do not trigger this problem.
+This problem is triggered by "partial" writes where a write transaction
+of less than cacheline lands at the memory controller.  The CPU does
+these via non-temporal write instructions (like MOVNTI), or through
+UC/WC memory mappings.  The issue can also be triggered away from the
+CPU by devices doing partial writes via DMA.
+With this erratum, there are additional things need to be done.  To
+prepare for those changes, add a CPU bug bit to indicate this erratum.
+Note this bug reflects the hardware thus it is detected regardless of
+whether the kernel is built with TDX support or not.
+Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
+Reviewed-by: David Hildenbrand <david@redhat.com>
+Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
+---
+v13 -> v14:
+ - Use "To prepare for ___, add ___" in changelog (Dave)
+ - Added Dave's tag.
+v12 -> v13:
+ - Added David's tag.
+v11 -> v12:
+ - Added Kirill's tag
+ - Changed to detect the erratum in early_init_intel() (Kirill)
+v10 -> v11:
+ - New patch
+---
+ arch/x86/include/asm/cpufeatures.h |  1 +
+ arch/x86/kernel/cpu/intel.c        | 17 +++++++++++++++++
+files changed, 18 insertions(+)
+diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
+index XXXXXXX..XXXXXXX 100644
+--- a/arch/x86/include/asm/cpufeatures.h
++++ b/arch/x86/include/asm/cpufeatures.h
+@@ -XXX,XX +XXX,XX @@
+ #define X86_BUG_EIBRS_PBRSB        X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
+ #define X86_BUG_SMT_RSB            X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */
+ #define X86_BUG_GDS            X86_BUG(30) /* CPU is affected by Gather Data Sampling */
++#define X86_BUG_TDX_PW_MCE        X86_BUG(31) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */
+ /* BUG word 2 */
+ #define X86_BUG_SRSO            X86_BUG(1*32 + 0) /* AMD SRSO bug */
+diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
+index XXXXXXX..XXXXXXX 100644
+--- a/arch/x86/kernel/cpu/intel.c
++++ b/arch/x86/kernel/cpu/intel.c
+@@ -XXX,XX +XXX,XX @@ static bool bad_spectre_microcode(struct cpuinfo_x86 *c)
+     return false;
+ }
++static void check_tdx_erratum(struct cpuinfo_x86 *c)
++{
++    /*
++     * These CPUs have an erratum.  A partial write from non-TD
++     * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
++     * private memory poisons that memory, and a subsequent read of
++     * that memory triggers #MC.
++     */
++    switch (c->x86_model) {
++    case INTEL_FAM6_SAPPHIRERAPIDS_X:
++    case INTEL_FAM6_EMERALDRAPIDS_X:
++        setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
++    }
++}
++
+ static void early_init_intel(struct cpuinfo_x86 *c)
+ {
+     u64 misc_enable;
+@@ -XXX,XX +XXX,XX @@ static void early_init_intel(struct cpuinfo_x86 *c)
+      */
+     if (detect_extended_topology_early(c) < 0)
+         detect_ht_early(c);
++
++    check_tdx_erratum(c);
+ }
+ static void bsp_init_intel(struct cpuinfo_x86 *c)
+--
+.41.0

-New patch
+[PATCH v14 05/23] x86/virt/tdx: Handle SEAMCALL no entropy error in common code
+Some SEAMCALLs use the RDRAND hardware and can fail for the same reasons
+as RDRAND.  Use the kernel RDRAND retry logic for them.
+There are three __seamcall*() variants.  Do the SEAMCALL retry in common
+code and add a wrapper for each of them.
+Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirll.shutemov@linux.intel.com>
+---
+v13 -> v14:
+ - Use real function sc_retry() instead of using macros. (Dave)
+ - Added Kirill's tag.
+v12 -> v13:
+ - New implementation due to TDCALL assembly series.
+---
+ arch/x86/include/asm/tdx.h | 26 ++++++++++++++++++++++++++
+file changed, 26 insertions(+)
+diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
+index XXXXXXX..XXXXXXX 100644
+--- a/arch/x86/include/asm/tdx.h
++++ b/arch/x86/include/asm/tdx.h
+@@ -XXX,XX +XXX,XX @@
+ #define TDX_SEAMCALL_GP            (TDX_SW_ERROR | X86_TRAP_GP)
+ #define TDX_SEAMCALL_UD            (TDX_SW_ERROR | X86_TRAP_UD)
++/*
++ * TDX module SEAMCALL leaf function error codes
++ */
++#define TDX_RND_NO_ENTROPY    0x8000020300000000ULL
++
+ #ifndef __ASSEMBLY__
+ /*
+@@ -XXX,XX +XXX,XX @@ u64 __seamcall(u64 fn, struct tdx_module_args *args);
+ u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
+ u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
++#include <asm/archrandom.h>
++
++typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
++
++static inline u64 sc_retry(sc_func_t func, u64 fn,
++               struct tdx_module_args *args)
++{
++    int retry = RDRAND_RETRY_LOOPS;
++    u64 ret;
++
++    do {
++        ret = func(fn, args);
++    } while (ret == TDX_RND_NO_ENTROPY && --retry);
++
++    return ret;
++}
++
++#define seamcall(_fn, _args)        sc_retry(__seamcall, (_fn), (_args))
++#define seamcall_ret(_fn, _args)    sc_retry(__seamcall_ret, (_fn), (_args))
++#define seamcall_saved_ret(_fn, _args)    sc_retry(__seamcall_saved_ret, (_fn), (_args))
++
+ bool platform_tdx_enabled(void);
+ #else
+ static inline bool platform_tdx_enabled(void) { return false; }
+--
+.41.0

-New patch
+[PATCH v14 06/23] x86/virt/tdx: Add SEAMCALL error printing for module initialization
+The SEAMCALLs involved during the TDX module initialization are not
+expected to fail.  In fact, they are not expected to return any non-zero
+code (except the "running out of entropy error", which can be handled
+internally already).
+Add yet another set of SEAMCALL wrappers, which treats all non-zero
+return code as error, to support printing SEAMCALL error upon failure
+for module initialization.  Note the TDX module initialization doesn't
+use the _saved_ret() variant thus no wrapper is added for it.
+SEAMCALL assembly can also return kernel-defined error codes for three
+special cases: 1) TDX isn't enabled by the BIOS; 2) TDX module isn't
+loaded; 3) CPU isn't in VMX operation.  Whether they can legally happen
+depends on the caller, so leave to the caller to print error message
+when desired.
+Also convert the SEAMCALL error codes to the kernel error codes in the
+new wrappers so that each SEAMCALL caller doesn't have to repeat the
+conversion.
+Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
+---
+v13 -> v14:
+ - Use real functions to replace macros. (Dave)
+ - Moved printing error message for special error code to the caller
+   (internal)
+ - Added Kirill's tag
+v12 -> v13:
+ - New implementation due to TDCALL assembly series.
+---
+ arch/x86/include/asm/tdx.h  |  1 +
+ arch/x86/virt/vmx/tdx/tdx.c | 52 +++++++++++++++++++++++++++++++++++++
+files changed, 53 insertions(+)
+diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
+index XXXXXXX..XXXXXXX 100644
+--- a/arch/x86/include/asm/tdx.h
++++ b/arch/x86/include/asm/tdx.h
+@@ -XXX,XX +XXX,XX @@
+ /*
+  * TDX module SEAMCALL leaf function error codes
+  */
++#define TDX_SUCCESS        0ULL
+ #define TDX_RND_NO_ENTROPY    0x8000020300000000ULL
+ #ifndef __ASSEMBLY__
+diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
+index XXXXXXX..XXXXXXX 100644
+--- a/arch/x86/virt/vmx/tdx/tdx.c
++++ b/arch/x86/virt/vmx/tdx/tdx.c
+@@ -XXX,XX +XXX,XX @@ static u32 tdx_global_keyid __ro_after_init;
+ static u32 tdx_guest_keyid_start __ro_after_init;
+ static u32 tdx_nr_guest_keyids __ro_after_init;
++typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
++
++static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
++{
++    pr_err("SEAMCALL (0x%llx) failed: 0x%llx\n", fn, err);
++}
++
++static inline void seamcall_err_ret(u64 fn, u64 err,
++                    struct tdx_module_args *args)
++{
++    seamcall_err(fn, err, args);
++    pr_err("RCX 0x%llx RDX 0x%llx R8 0x%llx R9 0x%llx R10 0x%llx R11 0x%llx\n",
++            args->rcx, args->rdx, args->r8, args->r9,
++            args->r10, args->r11);
++}
++
++static inline void seamcall_err_saved_ret(u64 fn, u64 err,
++                      struct tdx_module_args *args)
++{
++    seamcall_err_ret(fn, err, args);
++    pr_err("RBX 0x%llx RDI 0x%llx RSI 0x%llx R12 0x%llx R13 0x%llx R14 0x%llx R15 0x%llx\n",
++            args->rbx, args->rdi, args->rsi, args->r12,
++            args->r13, args->r14, args->r15);
++}
++
++static inline int sc_retry_prerr(sc_func_t func, sc_err_func_t err_func,
++                 u64 fn, struct tdx_module_args *args)
++{
++    u64 sret = sc_retry(func, fn, args);
++
++    if (sret == TDX_SUCCESS)
++        return 0;
++
++    if (sret == TDX_SEAMCALL_VMFAILINVALID)
++        return -ENODEV;
++
++    if (sret == TDX_SEAMCALL_GP)
++        return -EOPNOTSUPP;
++
++    if (sret == TDX_SEAMCALL_UD)
++        return -EACCES;
++
++    err_func(fn, sret, args);
++    return -EIO;
++}
++
++#define seamcall_prerr(__fn, __args)                        \
++    sc_retry_prerr(__seamcall, seamcall_err, (__fn), (__args))
++
++#define seamcall_prerr_ret(__fn, __args)                    \
++    sc_retry_prerr(__seamcall_ret, seamcall_err_ret, (__fn), (__args))
++
+ static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
+                         u32 *nr_tdx_keyids)
+ {
+--
+.41.0

-[PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand
+[PATCH v14 07/23] x86/virt/tdx: Add skeleton to enable TDX on demand
-Before the TDX module can be used to create and run TDX guests, it must
+To enable TDX the kernel needs to initialize TDX from two perspectives:
-be loaded and properly initialized.  The TDX module is expected to be
+) Do a set of SEAMCALLs to initialize the TDX module to make it ready
-loaded by the BIOS, and to be initialized by the kernel.
+to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
+on one logical cpu before the kernel wants to make any other SEAMCALLs
-TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).  The host
+on that cpu (including those involved during module initialization and
-kernel communicates with the TDX module via a new SEAMCALL instruction.
+running TDX guests).
 The TDX module implements a set of SEAMCALL leaf functions to allow the
 host kernel to initialize it.
 The TDX module can be initialized only once in its lifetime.  Instead
 of always initializing it at boot time, this implementation chooses an
 "on demand" approach to initialize TDX until there is a real need (e.g
 when requested by KVM).  This approach has below pros:
 ) It avoids consuming the memory that must be allocated by kernel and
 given to the TDX module as metadata (~1/256th of the TDX-usable memory),
 and also saves the CPU cycles of initializing the TDX module (and the
 metadata) when TDX is not used at all.
-) It is more flexible to support TDX module runtime updating in the
+) The TDX module design allows it to be updated while the system is
-future (after updating the TDX module, it needs to be initialized
+running.  The update procedure shares quite a few steps with this "on
-again).
+demand" initialization mechanism.  The hope is that much of "on demand"
+mechanism can be shared with a future "update" mechanism.  A boot-time
-) It avoids having to do a "temporary" solution to handle VMXON in the
+TDX module implementation would not be able to share much code with the
-core (non-KVM) kernel for now.  This is because SEAMCALL requires CPU
+update mechanism.
-being in VMX operation (VMXON is done), but currently only KVM handles
-VMXON.  Adding VMXON support to the core kernel isn't trivial.  More
+) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
-importantly, from long-term a reference-based approach is likely needed
+code mucks with VMX enabling.  If the TDX module were to be initialized
-in the core kernel as more kernel components are likely needed to
+separately from KVM (like at boot), the boot code would need to be
-support TDX as well.  Allow KVM to initialize the TDX module avoids
+taught how to muck with VMX enabling and KVM would need to be taught how
-having to handle VMXON during kernel boot for now.
+to cope with that.  Making KVM itself responsible for TDX initialization
+lets the rest of the kernel stay blissfully unaware of VMX.
-Add a placeholder tdx_enable() to detect and initialize the TDX module
-on demand, with a state machine protected by mutex to support concurrent
+Similar to module initialization, also make the per-cpu initialization
-calls from multiple callers.
+"on demand" as it also depends on VMX being enabled.
-The TDX module will be initialized in multi-steps defined by the TDX
+Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
-module:
+module and enable TDX on local cpu respectively.  For now tdx_enable()
+is a placeholder.  The TODO list will be pared down as functionality is
-) Global initialization;
+added.
-) Logical-CPU scope initialization;
-) Enumerate the TDX module capabilities and platform configuration;
+Export both tdx_cpu_enable() and tdx_enable() for KVM use.
-) Configure the TDX module about TDX usable memory ranges and global
-     KeyID information;
+In tdx_enable() use a state machine protected by mutex to make sure the
-) Package-scope configuration for the global KeyID;
+initialization will only be done once, as tdx_enable() can be called
-) Initialize usable memory ranges based on 4).
+multiple times (i.e. KVM module can be reloaded) and may be called
+concurrently by other kernel components in the future.
-The TDX module can also be shut down at any time during its lifetime.
-In case of any error during the initialization process, shut down the
+The per-cpu initialization on each cpu can only be done once during the
-module.  It's pointless to leave the module in any intermediate state
+module's life time.  Use a per-cpu variable to track its status to make
-during the initialization.
+sure it is only done once in tdx_cpu_enable().
-Both logical CPU scope initialization and shutting down the TDX module
+Also, a SEAMCALL to do TDX module global initialization must be done
-require calling SEAMCALL on all boot-time present CPUs.  For simplicity
+once on any logical cpu before any per-cpu initialization SEAMCALL.  Do
-just temporarily disable CPU hotplug during the module initialization.
+it inside tdx_cpu_enable() too (if hasn't been done).
-Note TDX architecturally doesn't support physical CPU hot-add/removal.
+tdx_enable() can potentially invoke SEAMCALLs on any online cpus.  The
-A non-buggy BIOS should never support ACPI CPU hot-add/removal.  This
+per-cpu initialization must be done before those SEAMCALLs are invoked
-implementation doesn't explicitly handle ACPI CPU hot-add/removal but
+on some cpu.  To keep things simple, in tdx_cpu_enable(), always do the
-depends on the BIOS to do the right thing.
+per-cpu initialization regardless of whether the TDX module has been
+initialized or not.  And in tdx_enable(), don't call tdx_cpu_enable()
-Reviewed-by: Chao Gao <chao.gao@intel.com>
+but assume the caller has disabled CPU hotplug, done VMXON and
 tdx_cpu_enable() on all online cpus before calling tdx_enable().
 Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 ---
-v6 -> v7:
+v13 -> v14:
- - No change.
+ - Use lockdep_assert_irqs_off() in try_init_model_global() (Nikolay),
+   but still keep the comment (Kirill).
-v5 -> v6:
+ - Add code to print "module not loaded" in the first SEAMCALL.
- - Added code to set status to TDX_MODULE_NONE if TDX module is not
+ - If SYS.INIT fails, stop calling LP.INIT in other tdx_cpu_enable()s.
-   loaded (Chao)
+ - Added Kirill's tag
- - Added Chao's Reviewed-by.
- - Improved comments around cpus_read_lock().
+v12 -> v13:
+ - Made tdx_cpu_enable() always be called with IRQ disabled via IPI
-- v3->v5 (no feedback on v4):
+   funciton call (Peter, Kirill).
- - Removed the check that SEAMRR and TDX KeyID have been detected on
-   all present cpus.
+v11 -> v12:
- - Removed tdx_detect().
+ - Simplified TDX module global init and lp init status tracking (David).
- - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
+ - Added comment around try_init_module_global() for using
-   hotplug lock and return early with error message.
+   raw_spin_lock() (Dave).
- - Improved dmesg printing for TDX module detection and initialization.
+ - Added one sentence to changelog to explain why to expose tdx_enable()
    and tdx_cpu_enable() (Dave).
  - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use
    lockdep_assert_*() instead. (Dave)
  - Removed redundent "TDX" in error message (Dave).
 v10 -> v11:
  - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off.
  - Return the actual error code for tdx_enable() instead of -EINVAL.
  - Added Isaku's Reviewed-by.
 v9 -> v10:
  - Merged the patch to handle per-cpu initialization to this patch to
    tell the story better.
  - Changed how to handle the per-cpu initialization to only provide a
    tdx_cpu_enable() function to let the user of TDX to do it when the
    user wants to run TDX code on a certain cpu.
  - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
    call lockdep_assert_cpus_held() to assume the caller has done that.
  - Improved comments around tdx_enable() and tdx_cpu_enable().
  - Improved changelog to tell the story better accordingly.
 v8 -> v9:
  - Removed detailed TODO list in the changelog (Dave).
  - Added back steps to do module global initialization and per-cpu
    initialization in the TODO list comment.
  - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h
 v7 -> v8:
  - Refined changelog (Dave).
  - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
  - Add a "TODO list" comment in init_tdx_module() to list all steps of
    initializing the TDX Module to tell the story (Dave).
  - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
    comments (Dave).
  - Simplified __tdx_enable() to only handle success or failure.
  - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
  - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
  - Improved comments (Dave).
  - Pointed out 'tdx_module_status' is software thing (Dave).
  ...
 ---
- arch/x86/include/asm/tdx.h  |   2 +
+ arch/x86/include/asm/tdx.h  |   4 +
- arch/x86/virt/vmx/tdx/tdx.c | 150 ++++++++++++++++++++++++++++++++++++
+ arch/x86/virt/vmx/tdx/tdx.c | 167 ++++++++++++++++++++++++++++++++++++
-files changed, 152 insertions(+)
+ arch/x86/virt/vmx/tdx/tdx.h |  30 +++++++
 files changed, 201 insertions(+)
  create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
 diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/include/asm/tdx.h
 +++ b/arch/x86/include/asm/tdx.h
-@@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
+@@ -XXX,XX +XXX,XX @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
+ #define seamcall_saved_ret(_fn, _args)    sc_retry(__seamcall_saved_ret, (_fn), (_args))
- #ifdef CONFIG_INTEL_TDX_HOST
  bool platform_tdx_enabled(void);
++int tdx_cpu_enable(void);
 +int tdx_enable(void);
- #else    /* !CONFIG_INTEL_TDX_HOST */
+ #else
  static inline bool platform_tdx_enabled(void) { return false; }
++static inline int tdx_cpu_enable(void) { return -ENODEV; }
 +static inline int tdx_enable(void)  { return -ENODEV; }
  #endif    /* CONFIG_INTEL_TDX_HOST */
  #endif /* !__ASSEMBLY__ */
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
 @@ -XXX,XX +XXX,XX @@
- #include <linux/types.h>
  #include <linux/init.h>
+ #include <linux/errno.h>
  #include <linux/printk.h>
++#include <linux/cpu.h>
++#include <linux/spinlock.h>
++#include <linux/percpu-defs.h>
 +#include <linux/mutex.h>
-+#include <linux/cpu.h>
-+#include <linux/cpumask.h>
  #include <asm/msr-index.h>
  #include <asm/msr.h>
- #include <asm/apic.h>
  #include <asm/tdx.h>
- #include "tdx.h"
++#include "tdx.h"
-+/* TDX module status during initialization */
+ static u32 tdx_global_keyid __ro_after_init;
-+enum tdx_module_status_t {
+ static u32 tdx_guest_keyid_start __ro_after_init;
-+    /* TDX module hasn't been detected and initialized */
+ static u32 tdx_nr_guest_keyids __ro_after_init;
-+    TDX_MODULE_UNKNOWN,
-+    /* TDX module is not loaded */
++static DEFINE_PER_CPU(bool, tdx_lp_initialized);
-+    TDX_MODULE_NONE,
++
 +    /* TDX module is initialized */
 +    TDX_MODULE_INITIALIZED,
 +    /* TDX module is shut down due to initialization error */
 +    TDX_MODULE_SHUTDOWN,
 +};
 +
  static u32 tdx_keyid_start __ro_after_init;
  static u32 tdx_keyid_num __ro_after_init;
 +static enum tdx_module_status_t tdx_module_status;
-+/* Prevent concurrent attempts on TDX detection and initialization */
 +static DEFINE_MUTEX(tdx_module_lock);
 +
- /*
+ typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
-  * Detect TDX private KeyIDs to see whether TDX has been enabled by the
-  * BIOS.  Both initializing the TDX module and running TDX guest require
+ static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
-@@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void)
+@@ -XXX,XX +XXX,XX @@ static inline int sc_retry_prerr(sc_func_t func, sc_err_func_t err_func,
- {
+ #define seamcall_prerr_ret(__fn, __args)                    \
-     return !!tdx_keyid_num;
+     sc_retry_prerr(__seamcall_ret, seamcall_err_ret, (__fn), (__args))
- }
 +
 +/*
-+ * Detect and initialize the TDX module.
++ * Do the module global initialization once and return its result.
-+ *
++ * It can be done on any cpu.  It's always called with interrupts
-+ * Return -ENODEV when the TDX module is not loaded, 0 when it
++ * disabled.
-+ * is successfully initialized, or other error when it fails to
++ */
-+ * initialize.
++static int try_init_module_global(void)
-+ */
++{
-+static int init_tdx_module(void)
++    struct tdx_module_args args = {};
-+{
++    static DEFINE_RAW_SPINLOCK(sysinit_lock);
-+    /* The TDX module hasn't been detected */
++    static bool sysinit_done;
-+    return -ENODEV;
++    static int sysinit_ret;
-+}
++
-+
++    lockdep_assert_irqs_disabled();
-+static void shutdown_tdx_module(void)
++
-+{
++    raw_spin_lock(&sysinit_lock);
-+    /* TODO: Shut down the TDX module */
++
-+}
++    if (sysinit_done)
-+
++        goto out;
-+static int __tdx_enable(void)
++
-+{
++    /* RCX is module attributes and all bits are reserved */
-+    int ret;
++    args.rcx = 0;
 +    sysinit_ret = seamcall_prerr(TDH_SYS_INIT, &args);
 +
 +    /*
-+     * Initializing the TDX module requires doing SEAMCALL on all
++     * The first SEAMCALL also detects the TDX module, thus
-+     * boot-time present CPUs.  For simplicity temporarily disable
++     * it can fail due to the TDX module is not loaded.
-+     * CPU hotplug to prevent any CPU from going offline during
++     * Dump message to let the user know.
 +     * the initialization.
 +     */
-+    cpus_read_lock();
++    if (sysinit_ret == -ENODEV)
-+
++        pr_err("module not loaded\n");
-+    /*
++
-+     * Check whether all boot-time present CPUs are online and
++    sysinit_done = true;
 +     * return early with a message so the user can be aware.
 +     *
 +     * Note a non-buggy BIOS should never support physical (ACPI)
 +     * CPU hotplug when TDX is enabled, and all boot-time present
 +     * CPU should be enabled in MADT, so there should be no
 +     * disabled_cpus and num_processors won't change at runtime
 +     * either.
 +     */
 +    if (disabled_cpus || num_online_cpus() != num_processors) {
 +        pr_err("Unable to initialize the TDX module when there's offline CPU(s).\n");
 +        ret = -EINVAL;
 +        goto out;
 +    }
 +
 +    ret = init_tdx_module();
 +    if (ret == -ENODEV) {
 +        pr_info("TDX module is not loaded.\n");
 +        tdx_module_status = TDX_MODULE_NONE;
 +        goto out;
 +    }
 +
 +    /*
 +     * Shut down the TDX module in case of any error during the
 +     * initialization process.  It's meaningless to leave the TDX
 +     * module in any middle state of the initialization process.
 +     *
 +     * Shutting down the module also requires doing SEAMCALL on all
 +     * MADT-enabled CPUs.  Do it while CPU hotplug is disabled.
 +     *
 +     * Return all errors during the initialization as -EFAULT as the
 +     * module is always shut down.
 +     */
 +    if (ret) {
 +        pr_info("Failed to initialize TDX module. Shut it down.\n");
 +        shutdown_tdx_module();
 +        tdx_module_status = TDX_MODULE_SHUTDOWN;
 +        ret = -EFAULT;
 +        goto out;
 +    }
 +
 +    pr_info("TDX module initialized.\n");
 +    tdx_module_status = TDX_MODULE_INITIALIZED;
 +out:
-+    cpus_read_unlock();
++    raw_spin_unlock(&sysinit_lock);
-+
++    return sysinit_ret;
 +    return ret;
 +}
 +
 +/**
-+ * tdx_enable - Enable TDX by initializing the TDX module
++ * tdx_cpu_enable - Enable TDX on local cpu
 + *
-+ * Caller to make sure all CPUs are online and in VMX operation before
++ * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
-+ * calling this function.  CPU hotplug is temporarily disabled internally
++ * global initialization SEAMCALL if not done) on local cpu to make this
-+ * to prevent any cpu from going offline.
++ * cpu be ready to run any other SEAMCALLs.
 + *
-+ * This function can be called in parallel by multiple callers.
++ * Always call this function via IPI function calls.
 + *
-+ * Return:
++ * Return 0 on success, otherwise errors.
-+ *
++ */
-+ * * 0:        The TDX module has been successfully initialized.
++int tdx_cpu_enable(void)
-+ * * -ENODEV:    The TDX module is not loaded, or TDX is not supported.
++{
-+ * * -EINVAL:    The TDX module cannot be initialized due to certain
++    struct tdx_module_args args = {};
 + *        conditions are not met (i.e. when not all MADT-enabled
 + *        CPUs are not online).
 + * * -EFAULT:    Other internal fatal errors, or the TDX module is in
 + *        shutdown mode due to it failed to initialize in previous
 + *        attempts.
 + */
 +int tdx_enable(void)
 +{
 +    int ret;
 +
 +    if (!platform_tdx_enabled())
 +        return -ENODEV;
 +
++    lockdep_assert_irqs_disabled();
++
++    if (__this_cpu_read(tdx_lp_initialized))
++        return 0;
++
++    /*
++     * The TDX module global initialization is the very first step
++     * to enable TDX.  Need to do it first (if hasn't been done)
++     * before the per-cpu initialization.
++     */
++    ret = try_init_module_global();
++    if (ret)
++        return ret;
++
++    ret = seamcall_prerr(TDH_SYS_LP_INIT, &args);
++    if (ret)
++        return ret;
++
++    __this_cpu_write(tdx_lp_initialized, true);
++
++    return 0;
++}
++EXPORT_SYMBOL_GPL(tdx_cpu_enable);
++
++static int init_tdx_module(void)
++{
++    /*
++     * TODO:
++     *
++     *  - Get TDX module information and TDX-capable memory regions.
++     *  - Build the list of TDX-usable memory regions.
++     *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
++     *    all TDX-usable memory regions.
++     *  - Configure the TDMRs and the global KeyID to the TDX module.
++     *  - Configure the global KeyID on all packages.
++     *  - Initialize all TDMRs.
++     *
++     *  Return error before all steps are done.
++     */
++    return -EINVAL;
++}
++
++static int __tdx_enable(void)
++{
++    int ret;
++
++    ret = init_tdx_module();
++    if (ret) {
++        pr_err("module initialization failed (%d)\n", ret);
++        tdx_module_status = TDX_MODULE_ERROR;
++        return ret;
++    }
++
++    pr_info("module initialized\n");
++    tdx_module_status = TDX_MODULE_INITIALIZED;
++
++    return 0;
++}
++
++/**
++ * tdx_enable - Enable TDX module to make it ready to run TDX guests
++ *
++ * This function assumes the caller has: 1) held read lock of CPU hotplug
++ * lock to prevent any new cpu from becoming online; 2) done both VMXON
++ * and tdx_cpu_enable() on all online cpus.
++ *
++ * This function can be called in parallel by multiple callers.
++ *
++ * Return 0 if TDX is enabled successfully, otherwise error.
++ */
++int tdx_enable(void)
++{
++    int ret;
++
++    if (!platform_tdx_enabled())
++        return -ENODEV;
++
++    lockdep_assert_cpus_held();
++
 +    mutex_lock(&tdx_module_lock);
 +
 +    switch (tdx_module_status) {
-+    case TDX_MODULE_UNKNOWN:
++    case TDX_MODULE_UNINITIALIZED:
 +        ret = __tdx_enable();
 +        break;
-+    case TDX_MODULE_NONE:
-+        ret = -ENODEV;
-+        break;
 +    case TDX_MODULE_INITIALIZED:
++        /* Already initialized, great, tell the caller. */
 +        ret = 0;
 +        break;
 +    default:
-+        WARN_ON_ONCE(tdx_module_status != TDX_MODULE_SHUTDOWN);
++        /* Failed to initialize in the previous attempts */
-+        ret = -EFAULT;
++        ret = -EINVAL;
 +        break;
 +    }
 +
 +    mutex_unlock(&tdx_module_lock);
 +
 +    return ret;
 +}
 +EXPORT_SYMBOL_GPL(tdx_enable);
++
+ static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
+                         u32 *nr_tdx_keyids)
+ {
+diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/arch/x86/virt/vmx/tdx/tdx.h
+@@ -XXX,XX +XXX,XX @@
++/* SPDX-License-Identifier: GPL-2.0 */
++#ifndef _X86_VIRT_TDX_H
++#define _X86_VIRT_TDX_H
++
++/*
++ * This file contains both macros and data structures defined by the TDX
++ * architecture and Linux defined software data structures and functions.
++ * The two should not be mixed together for better readability.  The
++ * architectural definitions come first.
++ */
++
++/*
++ * TDX module SEAMCALL leaf functions
++ */
++#define TDH_SYS_INIT        33
++#define TDH_SYS_LP_INIT        35
++
++/*
++ * Do not put any hardware-defined TDX structure representations below
++ * this comment!
++ */
++
++/* Kernel defined TDX module status during module initialization. */
++enum tdx_module_status_t {
++    TDX_MODULE_UNINITIALIZED,
++    TDX_MODULE_INITIALIZED,
++    TDX_MODULE_ERROR
++};
++
++#endif
 --
-.38.1
+.41.0

-[PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
+[PATCH v14 08/23] x86/virt/tdx: Get information about TDX module and TDX-capable memory
+Start to transit out the "multi-steps" to initialize the TDX module.
 TDX provides increased levels of memory confidentiality and integrity.
 This requires special hardware support for features like memory
 encryption and storage of memory integrity checksums.  Not all memory
 satisfies these requirements.
 As a result, TDX introduced the concept of a "Convertible Memory Region"
 (CMR).  During boot, the firmware builds a list of all of the memory
-ranges which can provide the TDX security guarantees.  The list of these
+ranges which can provide the TDX security guarantees.
-ranges, along with TDX module information, is available to the kernel by
-querying the TDX module via TDH.SYS.INFO SEAMCALL.
+CMRs tell the kernel which memory is TDX compatible.  The kernel takes
+CMRs (plus a little more metadata) and constructs "TD Memory Regions"
-The host kernel can choose whether or not to use all convertible memory
+(TDMRs).  TDMRs let the kernel grant TDX protections to some or all of
-regions as TDX-usable memory.  Before the TDX module is ready to create
+the CMR areas.
-any TDX guests, the kernel needs to configure the TDX-usable memory
-regions by passing an array of "TD Memory Regions" (TDMRs) to the TDX
+The TDX module also reports necessary information to let the kernel
-module.  Constructing the TDMR array requires information of both the
+build TDMRs and run TDX guests in structure 'tdsysinfo_struct'.  The
-TDX module (TDSYSINFO_STRUCT) and the Convertible Memory Regions.  Call
+list of CMRs, along with the TDX module information, is available to
-TDH.SYS.INFO to get this information as a preparation.
+the kernel by querying the TDX module.
-Use static variables for both TDSYSINFO_STRUCT and CMR array to avoid
+As a preparation to construct TDMRs, get the TDX module information and
-having to pass them as function arguments when constructing the TDMR
+the list of CMRs.  Print out CMRs to help user to decode which memory
-array.  And they are too big to be put to the stack anyway.  Also, KVM
+regions are TDX convertible.
-needs to use the TDSYSINFO_STRUCT to create TDX guests.
+The 'tdsysinfo_struct' is fairly large (1024 bytes) and contains a lot
 of info about the TDX module.  Fully define the entire structure, but
 only use the fields necessary to build the TDMRs and pr_info() some
 basics about the module.  The rest of the fields will get used by KVM.
 Signed-off-by: Kai Huang <kai.huang@intel.com>
 Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
-Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 ---
-v6 -> v7:
+v13 -> v14:
- - Simplified the check of CMRs due to the fact that TDX actually
+ - Added Kirill's tag.
-   verifies CMRs (that are passed by the BIOS) before enabling TDX.
- - Changed the function name from check_cmrs() -> trim_empty_cmrs().
+v12 -> v13:
- - Added CMR page aligned check so that later patch can just get the PFN
+ - Allocate TDSYSINFO and CMR array separately. (Kirill)
-   using ">> PAGE_SHIFT".
+ - Added comment around TDH.SYS.INFO. (Peter)
-v5 -> v6:
+v11 -> v12:
- - Added to also print TDX module's attribute (Isaku).
+ - Changed to use dynamic allocation for TDSYSINFO_STRUCT and CMR array
- - Removed all arguments in tdx_gete_sysinfo() to use static variables
+   (Kirill).
-   of 'tdx_sysinfo' and 'tdx_cmr_array' directly as they are all used
+ - Keep SEAMCALL leaf macro definitions in order (Kirill)
-   directly in other functions in later patches.
+ - Removed is_cmr_empty() but open code directly (David)
- - Added Isaku's Reviewed-by.
+ - 'atribute' -> 'attribute' (David)
-- v3 -> v5 (no feedback on v4):
+v10 -> v11:
- - Renamed sanitize_cmrs() to check_cmrs().
+ - No change.
- - Removed unnecessary sanity check against tdx_sysinfo and tdx_cmr_array
-   actual size returned by TDH.SYS.INFO.
+v9 -> v10:
- - Changed -EFAULT to -EINVAL in couple places.
+ - Added back "start to transit out..." as now per-cpu init has been
- - Added comments around tdx_sysinfo and tdx_cmr_array saying they are
+   moved out from tdx_enable().
-   used by TDH.SYS.INFO ABI.
- - Changed to pass 'tdx_sysinfo' and 'tdx_cmr_array' as function
+v8 -> v9:
-   arguments in tdx_get_sysinfo().
+ - Removed "start to trransit out ..." part in changelog since this patch
- - Changed to only print BIOS-CMR when check_cmrs() fails.
+   is no longer the first step anymore.
  - Changed to declare 'tdsysinfo' and 'cmr_array' as local static, and
    changed changelog accordingly (Dave).
  - Improved changelog to explain why to declare  'tdsysinfo_struct' in
    full but only use a few members of them (Dave).
 v7 -> v8: (Dave)
  - Improved changelog to tell this is the first patch to transit out the
    "multi-steps" init_tdx_module().
  - Removed all CMR check/trim code but to depend on later SEAMCALL.
  - Variable 'vertical alignment' in print TDX module information.
  - Added DECLARE_PADDED_STRUCT() for padded structure.
  - Made tdx_sysinfo and tdx_cmr_array[] to be function local variable
    (and rename them accordingly), and added -Wframe-larger-than=4096 flag
    to silence the build warning.
  ...
 ---
- arch/x86/virt/vmx/tdx/tdx.c | 125 ++++++++++++++++++++++++++++++++++++
+ arch/x86/virt/vmx/tdx/tdx.c | 94 ++++++++++++++++++++++++++++++++++++-
- arch/x86/virt/vmx/tdx/tdx.h |  61 ++++++++++++++++++
+ arch/x86/virt/vmx/tdx/tdx.h | 64 +++++++++++++++++++++++++
-files changed, 186 insertions(+)
+files changed, 156 insertions(+), 2 deletions(-)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
 @@ -XXX,XX +XXX,XX @@
- #include <linux/cpumask.h>
+ #include <linux/spinlock.h>
- #include <linux/smp.h>
+ #include <linux/percpu-defs.h>
- #include <linux/atomic.h>
+ #include <linux/mutex.h>
-+#include <linux/align.h>
++#include <linux/slab.h>
 +#include <linux/math.h>
  #include <asm/msr-index.h>
  #include <asm/msr.h>
- #include <asm/apic.h>
++#include <asm/page.h>
-@@ -XXX,XX +XXX,XX @@ static enum tdx_module_status_t tdx_module_status;
+ #include <asm/tdx.h>
- /* Prevent concurrent attempts on TDX detection and initialization */
+ #include "tdx.h"
- static DEFINE_MUTEX(tdx_module_lock);
+@@ -XXX,XX +XXX,XX @@ int tdx_cpu_enable(void)
 +/* Below two are used in TDH.SYS.INFO SEAMCALL ABI */
 +static struct tdsysinfo_struct tdx_sysinfo;
 +static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
 +static int tdx_cmr_num;
 +
  /*
   * Detect TDX private KeyIDs to see whether TDX has been enabled by the
   * BIOS.  Both initializing the TDX module and running TDX guest require
@@ -XXX,XX +XXX,XX @@ static int tdx_module_init_cpus(void)
      return atomic_read(&sc.err);
  }
+ EXPORT_SYMBOL_GPL(tdx_cpu_enable);
-+static inline bool is_cmr_empty(struct cmr_info *cmr)
-+{
++static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs)
 +    return !cmr->size;
 +}
 +
 +static inline bool is_cmr_ok(struct cmr_info *cmr)
 +{
 +    /* CMR must be page aligned */
 +    return IS_ALIGNED(cmr->base, PAGE_SIZE) &&
 +        IS_ALIGNED(cmr->size, PAGE_SIZE);
 +}
 +
 +static void print_cmrs(struct cmr_info *cmr_array, int cmr_num,
 +               const char *name)
 +{
 +    int i;
 +
-+    for (i = 0; i < cmr_num; i++) {
++    for (i = 0; i < nr_cmrs; i++) {
 +        struct cmr_info *cmr = &cmr_array[i];
 +
-+        pr_info("%s : [0x%llx, 0x%llx)\n", name,
++        /*
-+                cmr->base, cmr->base + cmr->size);
++         * The array of CMRs reported via TDH.SYS.INFO can
 +         * contain tail empty CMRs.  Don't print them.
 +         */
 +        if (!cmr->size)
 +            break;
 +
 +        pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base,
 +                cmr->base + cmr->size);
 +    }
 +}
 +
-+/* Check CMRs reported by TDH.SYS.INFO, and trim tail empty CMRs. */
++static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo,
-+static int trim_empty_cmrs(struct cmr_info *cmr_array, int *actual_cmr_num)
++               struct cmr_info *cmr_array)
 +{
-+    struct cmr_info *cmr;
++    struct tdx_module_args args = {};
-+    int i, cmr_num;
++    int ret;
 +
 +    /*
-+     * Intel TDX module spec, 20.7.3 CMR_INFO:
++     * TDH.SYS.INFO writes the TDSYSINFO_STRUCT and the CMR array
 +     * to the buffers provided by the kernel (via RCX and R8
 +     * respectively).  The buffer size of the TDSYSINFO_STRUCT
 +     * (via RDX) and the maximum entries of the CMR array (via R9)
 +     * passed to this SEAMCALL must be at least the size of
 +     * TDSYSINFO_STRUCT and MAX_CMRS respectively.
 +     *
-+     *   TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
++     * Upon a successful return, R9 contains the actual entries
-+     *   array of CMR_INFO entries. The CMRs are sorted from the
++     * written to the CMR array.
 +     *   lowest base address to the highest base address, and they
 +     *   are non-overlapping.
 +     *
 +     * This implies that BIOS may generate invalid empty entries
 +     * if total CMRs are less than 32.  Need to skip them manually.
 +     *
 +     * CMR also must be 4K aligned.  TDX doesn't trust BIOS.  TDX
 +     * actually verifies CMRs before it gets enabled, so anything
 +     * doesn't meet above means kernel bug (or TDX is broken).
 +     */
-+    cmr = &cmr_array[0];
++    args.rcx = __pa(tdsysinfo);
-+    /* There must be at least one valid CMR */
++    args.rdx = TDSYSINFO_STRUCT_SIZE;
-+    if (WARN_ON_ONCE(is_cmr_empty(cmr) || !is_cmr_ok(cmr)))
++    args.r8 = __pa(cmr_array);
-+        goto err;
++    args.r9 = MAX_CMRS;
-+
++    ret = seamcall_prerr_ret(TDH_SYS_INFO, &args);
 +    cmr_num = *actual_cmr_num;
 +    for (i = 1; i < cmr_num; i++) {
 +        struct cmr_info *cmr = &cmr_array[i];
 +        struct cmr_info *prev_cmr = NULL;
 +
 +        /* Skip further empty CMRs */
 +        if (is_cmr_empty(cmr))
 +            break;
 +
 +        /*
 +         * Do sanity check anyway to make sure CMRs:
 +         *  - are 4K aligned
 +         *  - don't overlap
 +         *  - are in address ascending order.
 +         */
 +        if (WARN_ON_ONCE(!is_cmr_ok(cmr)))
 +            goto err;
 +
 +        prev_cmr = &cmr_array[i - 1];
 +        if (WARN_ON_ONCE((prev_cmr->base + prev_cmr->size) >
 +                    cmr->base))
 +            goto err;
 +    }
 +
 +    /* Update the actual number of CMRs */
 +    *actual_cmr_num = i;
 +
 +    /* Print kernel checked CMRs */
 +    print_cmrs(cmr_array, *actual_cmr_num, "Kernel-checked-CMR");
 +
 +    return 0;
 +err:
 +    pr_info("[TDX broken ?]: Invalid CMRs detected\n");
 +    print_cmrs(cmr_array, cmr_num, "BIOS-CMR");
 +    return -EINVAL;
 +}
 +
 +static int tdx_get_sysinfo(void)
 +{
 +    struct tdx_module_output out;
 +    int ret;
 +
 +    BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE);
 +
 +    ret = seamcall(TDH_SYS_INFO, __pa(&tdx_sysinfo), TDSYSINFO_STRUCT_SIZE,
 +            __pa(tdx_cmr_array), MAX_CMRS, NULL, &out);
 +    if (ret)
 +        return ret;
 +
-+    /* R9 contains the actual entries written the CMR array. */
++    pr_info("TDX module: attributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
-+    tdx_cmr_num = out.r9;
++        tdsysinfo->attributes,    tdsysinfo->vendor_id,
-+
++        tdsysinfo->major_version, tdsysinfo->minor_version,
-+    pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
++        tdsysinfo->build_date,    tdsysinfo->build_num);
-+        tdx_sysinfo.attributes, tdx_sysinfo.vendor_id,
++
-+        tdx_sysinfo.major_version, tdx_sysinfo.minor_version,
++    print_cmrs(cmr_array, args.r9);
-+        tdx_sysinfo.build_date, tdx_sysinfo.build_num);
++
-+
++    return 0;
 +    /*
 +     * trim_empty_cmrs() updates the actual number of CMRs by
 +     * dropping all tail empty CMRs.
 +     */
 +    return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
 +}
 +
- /*
+ static int init_tdx_module(void)
-  * Detect and initialize the TDX module.
+ {
-  *
++    struct tdsysinfo_struct *tdsysinfo;
-@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
++    struct cmr_info *cmr_array;
-     if (ret)
++    int tdsysinfo_size;
-         goto out;
++    int cmr_array_size;
++    int ret;
-+    ret = tdx_get_sysinfo();
++
 +    tdsysinfo_size = round_up(TDSYSINFO_STRUCT_SIZE,
 +            TDSYSINFO_STRUCT_ALIGNMENT);
 +    tdsysinfo = kzalloc(tdsysinfo_size, GFP_KERNEL);
 +    if (!tdsysinfo)
 +        return -ENOMEM;
 +
 +    cmr_array_size = sizeof(struct cmr_info) * MAX_CMRS;
 +    cmr_array_size = round_up(cmr_array_size, CMR_INFO_ARRAY_ALIGNMENT);
 +    cmr_array = kzalloc(cmr_array_size, GFP_KERNEL);
 +    if (!cmr_array) {
 +        kfree(tdsysinfo);
 +        return -ENOMEM;
 +    }
 +
 +
 +    /* Get the TDSYSINFO_STRUCT and CMRs from the TDX module. */
 +    ret = get_tdx_sysinfo(tdsysinfo, cmr_array);
 +    if (ret)
 +        goto out;
 +
      /*
-      * Return -EINVAL until all steps of TDX module initialization
+      * TODO:
-      * process are done.
+      *
 -     *  - Get TDX module information and TDX-capable memory regions.
       *  - Build the list of TDX-usable memory regions.
       *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
       *    all TDX-usable memory regions.
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
       *
       *  Return error before all steps are done.
       */
 -    return -EINVAL;
 +    ret = -EINVAL;
 +out:
 +    /*
 +     * For now both @sysinfo and @cmr_array are only used during
 +     * module initialization, so always free them.
 +     */
 +    kfree(tdsysinfo);
 +    kfree(cmr_array);
 +    return ret;
  }
  static int __tdx_enable(void)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.h
 +++ b/arch/x86/virt/vmx/tdx/tdx.h
+@@ -XXX,XX +XXX,XX @@
+ #ifndef _X86_VIRT_TDX_H
+ #define _X86_VIRT_TDX_H
++#include <linux/types.h>
++#include <linux/stddef.h>
++#include <linux/compiler_attributes.h>
++
+ /*
+  * This file contains both macros and data structures defined by the TDX
+  * architecture and Linux defined software data structures and functions.
 @@ -XXX,XX +XXX,XX @@
  /*
   * TDX module SEAMCALL leaf functions
   */
 +#define TDH_SYS_INFO        32
  #define TDH_SYS_INIT        33
  #define TDH_SYS_LP_INIT        35
- #define TDH_SYS_LP_SHUTDOWN    44
 +struct cmr_info {
 +    u64    base;
 +    u64    size;
 +} __packed;
 +
-+#define MAX_CMRS            32
++#define MAX_CMRS    32
 +#define CMR_INFO_ARRAY_ALIGNMENT    512
 +
 +struct cpuid_config {
 +    u32    leaf;
 +    u32    sub_leaf;
 ...
 +} __packed;
 +
 +#define TDSYSINFO_STRUCT_SIZE        1024
 +#define TDSYSINFO_STRUCT_ALIGNMENT    1024
 +
++/*
++ * The size of this structure itself is flexible.  The actual structure
++ * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE bytes
++ * and TDSYSINFO_STRUCT_ALIGNMENT bytes aligned.
++ */
 +struct tdsysinfo_struct {
 +    /* TDX-SEAM Module Info */
 +    u32    attributes;
 +    u32    vendor_id;
 +    u32    build_date;
 ...
 +    u64    xfam_fixed1;
 +    u8    reserved4[32];
 +    u32    num_cpuid_config;
 +    /*
 +     * The actual number of CPUID_CONFIG depends on above
-+     * 'num_cpuid_config'.  The size of 'struct tdsysinfo_struct'
++     * 'num_cpuid_config'.
 +     * is 1024B defined by TDX architecture.  Use a union with
 +     * specific padding to make 'sizeof(struct tdsysinfo_struct)'
 +     * equal to 1024.
 +     */
-+    union {
++    DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
-+        struct cpuid_config    cpuid_configs[0];
++} __packed;
 +        u8            reserved5[892];
 +    };
 +} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
 +
  /*
   * Do not put any hardware-defined TDX structure representations below
   * this comment!
 --
-.38.1
+.41.0

-[PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
+[PATCH v14 09/23] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
-TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
+As a step of initializing the TDX module, the kernel needs to tell the
-memory regions that can possibly be used by the TDX module, but they are
+TDX module which memory regions can be used by the TDX module as TDX
-not automatically usable to the TDX module.  As a step of initializing
+guest memory.
-the TDX module, the kernel needs to choose a list of memory regions (out
-from convertible memory regions) that the TDX module can use and pass
+TDX reports a list of "Convertible Memory Region" (CMR) to tell the
-those regions to the TDX module.  Once this is done, those "TDX-usable"
+kernel which memory is TDX compatible.  The kernel needs to build a list
-memory regions are fixed during module's lifetime.  No more TDX-usable
+of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
-memory can be added to the TDX module after that.
+the TDX module.  Once this is done, those "TDX-usable" memory regions
+are fixed during module's lifetime.
-The initial support of TDX guests will only allocate TDX guest memory
-from the global page allocator.  To keep things simple, this initial
+To keep things simple, assume that all TDX-protected memory will come
-implementation simply guarantees all pages in the page allocator are TDX
+from the page allocator.  Make sure all pages in the page allocator
-memory.  To achieve this, use all system memory in the core-mm at the
+*are* TDX-usable memory.
-time of initializing the TDX module as TDX memory, and at the meantime,
-refuse to add any non-TDX-memory in the memory hotplug.
+As TDX-usable memory is a fixed configuration, take a snapshot of the
+memory configuration from memblocks at the time of module initialization
-Specifically, walk through all memory regions managed by memblock and
+(memblocks are modified on memory hotplug).  This snapshot is used to
-add them to a global list of "TDX-usable" memory regions, which is a
+enable TDX support for *this* memory configuration only.  Use a memory
-fixed list after the module initialization (or empty if initialization
+hotplug notifier to ensure that no other RAM can be added outside of
-fails).  To reject non-TDX-memory in memory hotplug, add an additional
+this configuration.
-check in arch_add_memory() to check whether the new region is covered by
-any region in the "TDX-usable" memory region list.
+This approach requires all memblock memory regions at the time of module
+initialization to be TDX convertible memory to work, otherwise module
-Note this requires all memory regions in memblock are TDX convertible
+initialization will fail in a later SEAMCALL when passing those regions
-memory when initializing the TDX module.  This is true in practice if no
+to the module.  This approach works when all boot-time "system RAM" is
-new memory has been hot-added before initializing the TDX module, since
+TDX convertible memory, and no non-TDX-convertible memory is hot-added
-in practice all boot-time present DIMM is TDX convertible memory.  If
+to the core-mm before module initialization.
-any new memory has been hot-added, then initializing the TDX module will
-fail due to that memory region is not covered by CMR.
+For instance, on the first generation of TDX machines, both CXL memory
+and NVDIMM are not TDX convertible memory.  Using kmem driver to hot-add
-This can be enhanced in the future, i.e. by allowing adding non-TDX
+any CXL memory or NVDIMM to the core-mm before module initialization
-memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
+will result in failure to initialize the module.  The SEAMCALL error
-and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
+code will be available in the dmesg to help user to understand the
-needs to guarantee memory pages for TDX guests are always allocated from
+failure.
 the "TDX-capable" nodes.
 Note TDX assumes convertible memory is always physically present during
 machine's runtime.  A non-buggy BIOS should never support hot-removal of
 any convertible memory.  This implementation doesn't handle ACPI memory
 removal but depends on the BIOS to behave correctly.
 Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
+Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
+Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 ---
-v6 -> v7:
+v13 -> v14:
- - Changed to use all system memory in memblock at the time of
+ - No change
-   initializing the TDX module as TDX memory
- - Added memory hotplug support
+v12 -> v13:
  - Avoided using " ? : " in tdx_memory_notifier(). (Peter)
 v11 -> v12:
  - Added tags from Dave/Kirill.
 v10 -> v11:
  - Added Isaku's Reviewed-by.
 v9 -> v10:
  - Moved empty @tdx_memlist check out of is_tdx_memory() to make the
    logic better.
  - Added Ying's Reviewed-by.
 v8 -> v9:
  - Replace "The initial support ..." with timeless sentence in both
    changelog and comments(Dave).
  - Fix run-on sentence in changelog, and senstence to explain why to
    stash off memblock (Dave).
  - Tried to improve why to choose this approach and how it work in
    changelog based on Dave's suggestion.
  - Many other comments enhancement (Dave).
 v7 -> v8:
  - Trimed down changelog (Dave).
  - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series
    (Ying).
  - Moved memory hotplug handling from add_arch_memory() to
    memory_notifier (Dan/David).
  - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave).
  - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave).
  - Removed pfn_covered_by_cmr() check as no code to trim CMRs now.
  - Improve the comment around first 1MB (Dave).
  - Added a comment around reserve_real_mode() to point out TDX code
    relies on first 1MB being reserved (Ying).
  - Added comment to explain why the new online memory range cannot
    cross multiple TDX memory blocks (Dave).
  - Improved other comments (Dave).
 ---
  arch/x86/Kconfig            |   1 +
- arch/x86/include/asm/tdx.h  |   3 +
+ arch/x86/kernel/setup.c     |   2 +
- arch/x86/mm/init_64.c       |  10 ++
+ arch/x86/virt/vmx/tdx/tdx.c | 162 +++++++++++++++++++++++++++++++++++-
- arch/x86/virt/vmx/tdx/tdx.c | 183 ++++++++++++++++++++++++++++++++++++
+ arch/x86/virt/vmx/tdx/tdx.h |   6 ++
-files changed, 197 insertions(+)
+files changed, 170 insertions(+), 1 deletion(-)
 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 ...
      depends on X86_X2APIC
 +    select ARCH_KEEP_MEMBLOCK
      help
        Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
        host and certain physical attacks.  This option enables necessary TDX
-diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
+diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
 index XXXXXXX..XXXXXXX 100644
---- a/arch/x86/include/asm/tdx.h
+--- a/arch/x86/kernel/setup.c
-+++ b/arch/x86/include/asm/tdx.h
++++ b/arch/x86/kernel/setup.c
-@@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
+@@ -XXX,XX +XXX,XX @@ void __init setup_arch(char **cmdline_p)
- #ifdef CONFIG_INTEL_TDX_HOST
+      *
- bool platform_tdx_enabled(void);
+      * Moreover, on machines with SandyBridge graphics or in setups that use
- int tdx_enable(void);
+      * crashkernel the entire 1M is reserved anyway.
-+bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
++     *
- #else    /* !CONFIG_INTEL_TDX_HOST */
++     * Note the host kernel TDX also requires the first 1MB being reserved.
- static inline bool platform_tdx_enabled(void) { return false; }
+      */
- static inline int tdx_enable(void)  { return -ENODEV; }
+     x86_platform.realmode_reserve();
-+static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
 +        unsigned long end_pfn) { return true; }
  #endif    /* CONFIG_INTEL_TDX_HOST */
  #endif /* !__ASSEMBLY__ */
 diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/mm/init_64.c
 +++ b/arch/x86/mm/init_64.c
@@ -XXX,XX +XXX,XX @@
  #include <asm/uv/uv.h>
  #include <asm/setup.h>
  #include <asm/ftrace.h>
 +#include <asm/tdx.h>
  #include "mm_internal.h"
@@ -XXX,XX +XXX,XX @@ int arch_add_memory(int nid, u64 start, u64 size,
      unsigned long start_pfn = start >> PAGE_SHIFT;
      unsigned long nr_pages = size >> PAGE_SHIFT;
 +    /*
 +     * For now if TDX is enabled, all pages in the page allocator
 +     * must be TDX memory, which is a fixed set of memory regions
 +     * that are passed to the TDX module.  Reject the new region
 +     * if it is not TDX memory to guarantee above is true.
 +     */
 +    if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
 +        return -EINVAL;
 +
      init_memory_mapping(start, start + size, params->pgprot);
      return add_pages(nid, start_pfn, nr_pages, params);
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
 @@ -XXX,XX +XXX,XX @@
- #include <linux/smp.h>
+ #include <linux/mutex.h>
- #include <linux/atomic.h>
+ #include <linux/slab.h>
- #include <linux/align.h>
+ #include <linux/math.h>
 +#include <linux/list.h>
-+#include <linux/slab.h>
 +#include <linux/memblock.h>
++#include <linux/memory.h>
 +#include <linux/minmax.h>
 +#include <linux/sizes.h>
++#include <linux/pfn.h>
  #include <asm/msr-index.h>
  #include <asm/msr.h>
- #include <asm/apic.h>
+ #include <asm/page.h>
-@@ -XXX,XX +XXX,XX @@ enum tdx_module_status_t {
+@@ -XXX,XX +XXX,XX @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized);
-     TDX_MODULE_SHUTDOWN,
+ static enum tdx_module_status_t tdx_module_status;
- };
+ static DEFINE_MUTEX(tdx_module_lock);
-+struct tdx_memblock {
++/* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
 +    struct list_head list;
 +    unsigned long start_pfn;
 +    unsigned long end_pfn;
 +    int nid;
 +};
 +
  static u32 tdx_keyid_start __ro_after_init;
  static u32 tdx_keyid_num __ro_after_init;
@@ -XXX,XX +XXX,XX @@ static struct tdsysinfo_struct tdx_sysinfo;
  static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
  static int tdx_cmr_num;
 +/* All TDX-usable memory regions */
 +static LIST_HEAD(tdx_memlist);
 +
- /*
+ typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
-  * Detect TDX private KeyIDs to see whether TDX has been enabled by the
-  * BIOS.  Both initializing the TDX module and running TDX guest require
+ static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
-@@ -XXX,XX +XXX,XX @@ static int tdx_get_sysinfo(void)
+@@ -XXX,XX +XXX,XX @@ static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo,
-     return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
+     return 0;
  }
-+/* Check whether the given pfn range is covered by any CMR or not. */
-+static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
-+                     unsigned long end_pfn)
-+{
-+    int i;
-+
-+    for (i = 0; i < tdx_cmr_num; i++) {
-+        struct cmr_info *cmr = &tdx_cmr_array[i];
-+        unsigned long cmr_start_pfn;
-+        unsigned long cmr_end_pfn;
-+
-+        cmr_start_pfn = cmr->base >> PAGE_SHIFT;
-+        cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
-+
-+        if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
-+            return true;
-+    }
-+
-+    return false;
-+}
-+
 +/*
-+ * Add a memory region on a given node as a TDX memory block.  The caller
++ * Add a memory region as a TDX memory block.  The caller must make sure
-+ * to make sure all memory regions are added in address ascending order
++ * all memory regions are added in address ascending order and don't
-+ * and don't overlap.
++ * overlap.
 + */
-+static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
++static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
-+                int nid)
++                unsigned long end_pfn)
 +{
 +    struct tdx_memblock *tmb;
 +
 +    tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
 +    if (!tmb)
 +        return -ENOMEM;
 +
 +    INIT_LIST_HEAD(&tmb->list);
 +    tmb->start_pfn = start_pfn;
 +    tmb->end_pfn = end_pfn;
-+    tmb->nid = nid;
++
-+
++    /* @tmb_list is protected by mem_hotplug_lock */
-+    list_add_tail(&tmb->list, &tdx_memlist);
++    list_add_tail(&tmb->list, tmb_list);
 +    return 0;
 +}
 +
-+static void free_tdx_memory(void)
++static void free_tdx_memlist(struct list_head *tmb_list)
 +{
-+    while (!list_empty(&tdx_memlist)) {
++    /* @tmb_list is protected by mem_hotplug_lock */
-+        struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
++    while (!list_empty(tmb_list)) {
 +        struct tdx_memblock *tmb = list_first_entry(tmb_list,
 +                struct tdx_memblock, list);
 +
 +        list_del(&tmb->list);
 +        kfree(tmb);
 +    }
 +}
 +
 +/*
-+ * Add all memblock memory regions to the @tdx_memlist as TDX memory.
++ * Ensure that all memblock memory regions are convertible to TDX
-+ * Must be called when get_online_mems() is called by the caller.
++ * memory.  Once this has been established, stash the memblock
 + * ranges off in a secondary structure because memblock is modified
 + * in memory hotplug while TDX memory regions are fixed.
 + */
-+static int build_tdx_memory(void)
++static int build_tdx_memlist(struct list_head *tmb_list)
 +{
 +    unsigned long start_pfn, end_pfn;
-+    int i, nid, ret;
++    int i, ret;
 +
-+    for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
++    for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
 +        /*
-+         * The first 1MB may not be reported as TDX convertible
++         * The first 1MB is not reported as TDX convertible memory.
-+         * memory.  Manually exclude them as TDX memory.
++         * Although the first 1MB is always reserved and won't end up
-+         *
++         * to the page allocator, it is still in memblock's memory
-+         * This is fine as the first 1MB is already reserved in
++         * regions.  Skip them manually to exclude them as TDX memory.
 +         * reserve_real_mode() and won't end up to ZONE_DMA as
 +         * free page anyway.
 +         */
-+        start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
++        start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
 +        if (start_pfn >= end_pfn)
 +            continue;
-+
-+        /* Verify memory is truly TDX convertible memory */
-+        if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
-+            pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
-+                    start_pfn << PAGE_SHIFT,
-+                    end_pfn << PAGE_SHIFT);
-+            return -EINVAL;
-+        }
 +
 +        /*
 +         * Add the memory regions as TDX memory.  The regions in
 +         * memblock has already guaranteed they are in address
 +         * ascending order and don't overlap.
 +         */
-+        ret = add_tdx_memblock(start_pfn, end_pfn, nid);
++        ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
 +        if (ret)
 +            goto err;
 +    }
 +
 +    return 0;
 +err:
-+    free_tdx_memory();
++    free_tdx_memlist(tmb_list);
 +    return ret;
 +}
 +
- /*
+ static int init_tdx_module(void)
-  * Detect and initialize the TDX module.
+ {
-  *
+     struct tdsysinfo_struct *tdsysinfo;
 @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
      if (ret)
          goto out;
 +    /*
-+     * All memory regions that can be used by the TDX module must be
++     * To keep things simple, assume that all TDX-protected memory
-+     * passed to the TDX module during the module initialization.
++     * will come from the page allocator.  Make sure all pages in the
-+     * Once this is done, all "TDX-usable" memory regions are fixed
++     * page allocator are TDX-usable memory.
 +     * during module's runtime.
 +     *
-+     * The initial support of TDX guests only allocates memory from
++     * Build the list of "TDX-usable" memory regions which cover all
-+     * the global page allocator.  To keep things simple, for now
++     * pages in the page allocator to guarantee that.  Do it while
-+     * just make sure all pages in the page allocator are TDX memory.
++     * holding mem_hotplug_lock read-lock as the memory hotplug code
-+     *
++     * path reads the @tdx_memlist to reject any new memory.
 +     * To achieve this, use all system memory in the core-mm at the
 +     * time of initializing the TDX module as TDX memory, and at the
 +     * meantime, reject any new memory in memory hot-add.
 +     *
 +     * This works as in practice, all boot-time present DIMM is TDX
 +     * convertible memory.  However if any new memory is hot-added
 +     * before initializing the TDX module, the initialization will
 +     * fail due to that memory is not covered by CMR.
 +     *
 +     * This can be enhanced in the future, i.e. by allowing adding or
 +     * onlining non-TDX memory to a separate node, in which case the
 +     * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
 +     * together -- the userspace/kernel just needs to make sure pages
 +     * for TDX guests must come from those "TDX-capable" nodes.
 +     *
 +     * Build the list of TDX memory regions as mentioned above so
 +     * they can be passed to the TDX module later.
 +     */
 +    get_online_mems();
 +
-+    ret = build_tdx_memory();
++    ret = build_tdx_memlist(&tdx_memlist);
 +    if (ret)
-+        goto out;
++        goto out_put_tdxmem;
 +
      /*
-      * Return -EINVAL until all steps of TDX module initialization
+      * TODO:
-      * process are done.
+      *
 -     *  - Build the list of TDX-usable memory regions.
       *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
       *    all TDX-usable memory regions.
       *  - Configure the TDMRs and the global KeyID to the TDX module.
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
       *  Return error before all steps are done.
       */
      ret = -EINVAL;
++out_put_tdxmem:
++    /*
++     * @tdx_memlist is written here and read at memory hotplug time.
++     * Lock out memory hotplug code while building it.
++     */
++    put_online_mems();
  out:
-+    /*
+     /*
-+     * Memory hotplug checks the hot-added memory region against the
+      * For now both @sysinfo and @cmr_array are only used during
-+     * @tdx_memlist to see if the region is TDX memory.
+@@ -XXX,XX +XXX,XX @@ static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
-+     *
+     return 0;
 +     * Do put_online_mems() here to make sure any modification to
 +     * @tdx_memlist is done while holding the memory hotplug read
 +     * lock, so that the memory hotplug path can just check the
 +     * @tdx_memlist w/o holding the @tdx_module_lock which may cause
 +     * deadlock.
 +     */
 +    put_online_mems();
      return ret;
  }
-@@ -XXX,XX +XXX,XX @@ int tdx_enable(void)
++static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
      return ret;
  }
  EXPORT_SYMBOL_GPL(tdx_enable);
 +
 +/*
 + * Check whether the given range is TDX memory.  Must be called between
 + * mem_hotplug_begin()/mem_hotplug_done().
 + */
 +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
 +{
 +    struct tdx_memblock *tmb;
 +
-+    /* Empty list means TDX isn't enabled successfully */
++    /*
-+    if (list_empty(&tdx_memlist))
++     * This check assumes that the start_pfn<->end_pfn range does not
-+        return true;
++     * cross multiple @tdx_memlist entries.  A single memory online
-+
++     * event across multiple memblocks (from which @tdx_memlist
 +     * entries are derived at the time of module initialization) is
 +     * not possible.  This is because memory offline/online is done
 +     * on granularity of 'struct memory_block', and the hotpluggable
 +     * memory region (one memblock) must be multiple of memory_block.
 +     */
 +    list_for_each_entry(tmb, &tdx_memlist, list) {
-+        /*
-+         * The new range is TDX memory if it is fully covered
-+         * by any TDX memory block.
-+         */
 +        if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
 +            return true;
 +    }
 +    return false;
 +}
++
++static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
++                   void *v)
++{
++    struct memory_notify *mn = v;
++
++    if (action != MEM_GOING_ONLINE)
++        return NOTIFY_OK;
++
++    /*
++     * Empty list means TDX isn't enabled.  Allow any memory
++     * to go online.
++     */
++    if (list_empty(&tdx_memlist))
++        return NOTIFY_OK;
++
++    /*
++     * The TDX memory configuration is static and can not be
++     * changed.  Reject onlining any memory which is outside of
++     * the static configuration whether it supports TDX or not.
++     */
++    if (is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages))
++        return NOTIFY_OK;
++
++    return NOTIFY_BAD;
++}
++
++static struct notifier_block tdx_memory_nb = {
++    .notifier_call = tdx_memory_notifier,
++};
++
+ static int __init tdx_init(void)
+ {
+     u32 tdx_keyid_start, nr_tdx_keyids;
+@@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void)
+         return -ENODEV;
+     }
++    err = register_memory_notifier(&tdx_memory_nb);
++    if (err) {
++        pr_err("initialization failed: register_memory_notifier() failed (%d)\n",
++                err);
++        return -ENODEV;
++    }
++
+     /*
+      * Just use the first TDX KeyID as the 'global KeyID' and
+      * leave the rest for TDX guests.
+diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
+index XXXXXXX..XXXXXXX 100644
+--- a/arch/x86/virt/vmx/tdx/tdx.h
++++ b/arch/x86/virt/vmx/tdx/tdx.h
+@@ -XXX,XX +XXX,XX @@ enum tdx_module_status_t {
+     TDX_MODULE_ERROR
+ };
++struct tdx_memblock {
++    struct list_head list;
++    unsigned long start_pfn;
++    unsigned long end_pfn;
++};
++
+ #endif
 --
-.38.1
+.41.0

-[PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
+[PATCH v14 10/23] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
+After the kernel selects all TDX-usable memory regions, the kernel needs
+to pass those regions to the TDX module via data structure "TD Memory
+Region" (TDMR).
+Add a placeholder to construct a list of TDMRs (in multiple steps) to
+cover all TDX-usable memory regions.
+=== Long Version ===
 TDX provides increased levels of memory confidentiality and integrity.
 This requires special hardware support for features like memory
 encryption and storage of memory integrity checksums.  Not all memory
 satisfies these requirements.
-As a result, the TDX introduced the concept of a "Convertible Memory
+As a result, TDX introduced the concept of a "Convertible Memory Region"
-Region" (CMR).  During boot, the firmware builds a list of all of the
+(CMR).  During boot, the firmware builds a list of all of the memory
-memory ranges which can provide the TDX security guarantees.  The list
+ranges which can provide the TDX security guarantees.  The list of these
-of these ranges is available to the kernel by querying the TDX module.
+ranges is available to the kernel by querying the TDX module.
 The TDX architecture needs additional metadata to record things like
 which TD guest "owns" a given page of memory.  This metadata essentially
 serves as the 'struct page' for the TDX module.  The space for this
 metadata is not reserved by the hardware up front and must be allocated
 ...
  CMR - Firmware-enumerated physical ranges that support TDX.  CMRs are
 K aligned.
 TDMR - Physical address range which is chosen by the kernel to support
        TDX.  1G granularity and alignment required.  Each TDMR has
        reserved areas where TDX memory holes and overlapping PAMTs can
-       be put into.
+       be represented.
 PAMT - Physically contiguous TDX metadata.  One table for each page size
        per TDMR.  Roughly 1/256th of TDMR in size.  256G TDMR = ~1G
        PAMT.
 As one step of initializing the TDX module, the kernel configures
-TDX-usable memory regions by passing an array of TDMRs to the TDX module.
+TDX-usable memory regions by passing a list of TDMRs to the TDX module.
-Constructing the array of TDMRs consists below steps:
+Constructing the list of TDMRs consists below steps:
-) Create TDMRs to cover all memory regions that the TDX module can use;
+) Fill out TDMRs to cover all memory regions that the TDX module will
-) Allocate and set up PAMT for each TDMR;
+   use for TD memory.
-) Set up reserved areas for each TDMR.
+) Allocate and set up PAMT for each TDMR.
+) Designate reserved areas for each TDMR.
-Add a placeholder to construct TDMRs to do the above steps after all
-TDX memory regions are verified to be truly convertible.  Always free
+Add a placeholder to construct TDMRs to do the above steps.  To keep
-TDMRs at the end of the initialization (no matter successful or not)
+things simple, just allocate enough space to hold maximum number of
-as TDMRs are only used during the initialization.
+TDMRs up front.  Always free the buffer of TDMRs since they are only
+used during module initialization.
 Signed-off-by: Kai Huang <kai.huang@intel.com>
 Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
-Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
 Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 ---
+v13 -> v14:
+ - No change.
+v12 -> v13:
+ - No change.
+v11 -> v12:
+ - Added tags from Dave/Kirill.
+v10 -> v11:
+ - Changed to keep TDMRs after module initialization to deal with TDX
+   erratum in future patches.
+v9 -> v10:
+ - Changed the TDMR list from static variable back to local variable as
+   now TDX module isn't disabled when tdx_cpu_enable() fails.
+v8 -> v9:
+ - Changes around 'struct tdmr_info_list' (Dave):
+   - Moved the declaration from tdx.c to tdx.h.
+   - Renamed 'first_tdmr' to 'tdmrs'.
+   - 'nr_tdmrs' -> 'nr_consumed_tdmrs'.
+   - Changed 'tdmrs' to 'void *'.
+   - Improved comments for all structure members.
+ - Added a missing empty line in alloc_tdmr_list() (Dave).
+v7 -> v8:
+ - Improved changelog to tell this is one step of "TODO list" in
+   init_tdx_module().
+ - Other changelog improvement suggested by Dave (with "Create TDMRs" to
+   "Fill out TDMRs" to align with the code).
+ - Added a "TODO list" comment to lay out the steps to construct TDMRs,
+   following the same idea of "TODO list" in tdx_module_init().
+ - Introduced 'struct tdmr_info_list' (Dave)
+ - Further added additional members (tdmr_sz/max_tdmrs/nr_tdmrs) to
+   simplify getting TDMR by given index, and reduce passing arguments
+   around functions.
+ - Added alloc_tdmr_list()/free_tdmr_list() accordingly, which internally
+   uses tdmr_size_single() (Dave).
+ - tdmr_num -> nr_tdmrs (Dave).
 v6 -> v7:
  - Improved commit message to explain 'int' overflow cannot happen
    in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave.
-v5 -> v6:
+  ...
  - construct_tdmrs_memblock() -> construct_tdmrs() as 'tdx_memblock' is
    used instead of memblock.
  - Added Isaku's Reviewed-by.
 - v3 -> v5 (no feedback on v4):
  - Moved calculating TDMR size to this patch.
  - Changed to use alloc_pages_exact() to allocate buffer for all TDMRs
    once, instead of allocating each TDMR individually.
  - Removed "crypto protection" in the changelog.
  - -EFAULT -> -EINVAL in couple of places.
 ---
- arch/x86/virt/vmx/tdx/tdx.c | 83 +++++++++++++++++++++++++++++++++++++
+ arch/x86/virt/vmx/tdx/tdx.c | 97 ++++++++++++++++++++++++++++++++++++-
- arch/x86/virt/vmx/tdx/tdx.h | 23 ++++++++++
+ arch/x86/virt/vmx/tdx/tdx.h | 32 ++++++++++++
-files changed, 106 insertions(+)
+files changed, 127 insertions(+), 2 deletions(-)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
-@@ -XXX,XX +XXX,XX @@ static int build_tdx_memory(void)
+@@ -XXX,XX +XXX,XX @@
  #include <linux/minmax.h>
  #include <linux/sizes.h>
  #include <linux/pfn.h>
 +#include <linux/align.h>
  #include <asm/msr-index.h>
  #include <asm/msr.h>
  #include <asm/page.h>
@@ -XXX,XX +XXX,XX @@ static int build_tdx_memlist(struct list_head *tmb_list)
      return ret;
  }
-+/* Calculate the actual TDMR_INFO size */
++/* Calculate the actual TDMR size */
-+static inline int cal_tdmr_size(void)
++static int tdmr_size_single(u16 max_reserved_per_tdmr)
 +{
 +    int tdmr_sz;
 +
 +    /*
-+     * The actual size of TDMR_INFO depends on the maximum number
++     * The actual size of TDMR depends on the maximum
-+     * of reserved areas.
++     * number of reserved areas.
 +     *
 +     * Note: for TDX1.0 the max_reserved_per_tdmr is 16, and
 +     * TDMR_INFO size is aligned up to 512-byte.  Even it is
 +     * extended in the future, it would be insane if TDMR_INFO
 +     * becomes larger than 4K.  The tdmr_sz here should never
 +     * overflow.
 +     */
 +    tdmr_sz = sizeof(struct tdmr_info);
-+    tdmr_sz += sizeof(struct tdmr_reserved_area) *
++    tdmr_sz += sizeof(struct tdmr_reserved_area) * max_reserved_per_tdmr;
-+           tdx_sysinfo.max_reserved_per_tdmr;
++
 +
 +    /*
 +     * TDX requires each TDMR_INFO to be 512-byte aligned.  Always
 +     * round up TDMR_INFO size to the 512-byte boundary.
 +     */
 +    return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
 +}
 +
-+static struct tdmr_info *alloc_tdmr_array(int *array_sz)
++static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list,
 +               struct tdsysinfo_struct *sysinfo)
 +{
-+    /*
++    size_t tdmr_sz, tdmr_array_sz;
-+     * TDX requires each TDMR_INFO to be 512-byte aligned.
++    void *tdmr_array;
-+     * Use alloc_pages_exact() to allocate all TDMRs at once.
++
-+     * Each TDMR_INFO will still be 512-byte aligned since
++    tdmr_sz = tdmr_size_single(sysinfo->max_reserved_per_tdmr);
-+     * cal_tdmr_size() always returns 512-byte aligned size.
++    tdmr_array_sz = tdmr_sz * sysinfo->max_tdmrs;
-+     */
++
-+    *array_sz = cal_tdmr_size() * tdx_sysinfo.max_tdmrs;
++    /*
-+
++     * To keep things simple, allocate all TDMRs together.
-+    /*
++     * The buffer needs to be physically contiguous to make
-+     * Zero the buffer so 'struct tdmr_info::size' can be
++     * sure each TDMR is physically contiguous.
-+     * used to determine whether a TDMR is valid.
++     */
 +    tdmr_array = alloc_pages_exact(tdmr_array_sz,
 +            GFP_KERNEL | __GFP_ZERO);
 +    if (!tdmr_array)
 +        return -ENOMEM;
 +
 +    tdmr_list->tdmrs = tdmr_array;
 +
 +    /*
 +     * Keep the size of TDMR to find the target TDMR
 +     * at a given index in the TDMR list.
 +     */
 +    tdmr_list->tdmr_sz = tdmr_sz;
 +    tdmr_list->max_tdmrs = sysinfo->max_tdmrs;
 +    tdmr_list->nr_consumed_tdmrs = 0;
 +
 +    return 0;
 +}
 +
 +static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
 +{
 +    free_pages_exact(tdmr_list->tdmrs,
 +            tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
 +}
 +
 +/*
 + * Construct a list of TDMRs on the preallocated space in @tdmr_list
 + * to cover all TDX memory regions in @tmb_list based on the TDX module
 + * information in @sysinfo.
 + */
 +static int construct_tdmrs(struct list_head *tmb_list,
 +               struct tdmr_info_list *tdmr_list,
 +               struct tdsysinfo_struct *sysinfo)
 +{
 +    /*
 +     * TODO:
 +     *
-+     * Note: for TDX1.0 the max_tdmrs is 64 and TDMR_INFO size
++     *  - Fill out TDMRs to cover all TDX memory regions.
-+     * is 512-byte.  Even they are extended in the future, it
++     *  - Allocate and set up PAMTs for each TDMR.
-+     * would be insane if the total size exceeds 4MB.
++     *  - Designate reserved areas for each TDMR.
-+     */
++     *
-+    return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
++     * Return -EINVAL until constructing TDMRs is done
-+}
++     */
 +
 +/*
 + * Construct an array of TDMRs to cover all TDX memory ranges.
 + * The actual number of TDMRs is kept to @tdmr_num.
 + */
 +static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
 +{
 +    /* Return -EINVAL until constructing TDMRs is done */
 +    return -EINVAL;
 +}
 +
- /*
-  * Detect and initialize the TDX module.
-  *
-@@ -XXX,XX +XXX,XX @@ static int build_tdx_memory(void)
-  */
  static int init_tdx_module(void)
  {
-+    struct tdmr_info *tdmr_array;
+     struct tdsysinfo_struct *tdsysinfo;
-+    int tdmr_array_sz;
++    struct tdmr_info_list tdmr_list;
-+    int tdmr_num;
+     struct cmr_info *cmr_array;
-     int ret;
+     int tdsysinfo_size;
+     int cmr_array_size;
      /*
 @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
-     ret = build_tdx_memory();
      if (ret)
-         goto out;
+         goto out_put_tdxmem;
-+
-+    /* Prepare enough space to construct TDMRs */
++    /* Allocate enough space for constructing TDMRs */
-+    tdmr_array = alloc_tdmr_array(&tdmr_array_sz);
++    ret = alloc_tdmr_list(&tdmr_list, tdsysinfo);
-+    if (!tdmr_array) {
++    if (ret)
-+        ret = -ENOMEM;
++        goto out_free_tdxmem;
-+        goto out_free_tdx_mem;
++
-+    }
++    /* Cover all TDX-usable memory regions in TDMRs */
-+
++    ret = construct_tdmrs(&tdx_memlist, &tdmr_list, tdsysinfo);
 +    /* Construct TDMRs to cover all TDX memory ranges */
 +    ret = construct_tdmrs(tdmr_array, &tdmr_num);
 +    if (ret)
 +        goto out_free_tdmrs;
 +
      /*
-      * Return -EINVAL until all steps of TDX module initialization
+      * TODO:
-      * process are done.
+      *
 -     *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
 -     *    all TDX-usable memory regions.
       *  - Configure the TDMRs and the global KeyID to the TDX module.
       *  - Configure the global KeyID on all packages.
       *  - Initialize all TDMRs.
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
       *  Return error before all steps are done.
       */
      ret = -EINVAL;
 +out_free_tdmrs:
 +    /*
-+     * The array of TDMRs is freed no matter the initialization is
++     * Always free the buffer of TDMRs as they are only used during
 +     * successful or not.  They are not needed anymore after the
 +     * module initialization.
 +     */
-+    free_pages_exact(tdmr_array, tdmr_array_sz);
++    free_tdmr_list(&tdmr_list);
-+out_free_tdx_mem:
++out_free_tdxmem:
 +    if (ret)
-+        free_tdx_memory();
++        free_tdx_memlist(&tdx_memlist);
- out:
+ out_put_tdxmem:
      /*
-      * Memory hotplug checks the hot-added memory region against the
+      * @tdx_memlist is written here and read at memory hotplug time.
 diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.h
 +++ b/arch/x86/virt/vmx/tdx/tdx.h
 @@ -XXX,XX +XXX,XX @@ struct tdsysinfo_struct {
-     };
+     DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
- } __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
+ } __packed;
 +struct tdmr_reserved_area {
 +    u64 offset;
 +    u64 size;
 +} __packed;
 ...
 +    u64 pamt_4k_size;
 +    /*
 +     * Actual number of reserved areas depends on
 +     * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
 +     */
-+    struct tdmr_reserved_area reserved_areas[0];
++    DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas);
 +} __packed __aligned(TDMR_INFO_ALIGNMENT);
 +
  /*
   * Do not put any hardware-defined TDX structure representations below
   * this comment!
+@@ -XXX,XX +XXX,XX @@ struct tdx_memblock {
+     unsigned long end_pfn;
+ };
++struct tdmr_info_list {
++    void *tdmrs;    /* Flexible array to hold 'tdmr_info's */
++    int nr_consumed_tdmrs;    /* How many 'tdmr_info's are in use */
++
++    /* Metadata for finding target 'tdmr_info' and freeing @tdmrs */
++    int tdmr_sz;    /* Size of one 'tdmr_info' */
++    int max_tdmrs;    /* How many 'tdmr_info's are allocated */
++};
++
+ #endif
 --
-.38.1
+.41.0

-[PATCH v7 12/20] x86/virt/tdx: Create TDMRs to cover all TDX memory regions
+[PATCH v14 11/23] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
-The kernel configures TDX-usable memory regions by passing an array of
+Start to transit out the "multi-steps" to construct a list of "TD Memory
-"TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains the
+Regions" (TDMRs) to cover all TDX-usable memory regions.
-information of the base/size of a memory region, the base/size of the
 The kernel configures TDX-usable memory regions by passing a list of
 TDMRs "TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains
 the information of the base/size of a memory region, the base/size of the
 associated Physical Address Metadata Table (PAMT) and a list of reserved
 areas in the region.
-Create a number of TDMRs to cover all TDX memory regions.  To keep it
+Do the first step to fill out a number of TDMRs to cover all TDX memory
-simple, always try to create one TDMR for each memory region.  As the
+regions.  To keep it simple, always try to use one TDMR for each memory
-first step only set up the base/size for each TDMR.
+region.  As the first step only set up the base/size for each TDMR.
 Each TDMR must be 1G aligned and the size must be in 1G granularity.
 This implies that one TDMR could cover multiple memory regions.  If a
 memory region spans the 1GB boundary and the former part is already
-covered by the previous TDMR, just create a new TDMR for the remaining
+covered by the previous TDMR, just use a new TDMR for the remaining
 part.
 TDX only supports a limited number of TDMRs.  Disable TDX if all TDMRs
 are consumed but there is more memory region to cover.
+There are fancier things that could be done like trying to merge
+adjacent TDMRs.  This would allow more pathological memory layouts to be
+supported.  But, current systems are not even close to exhausting the
+existing TDMR resources in practice.  For now, keep it simple.
 Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
+Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
+Reviewed-by: Yuan Yao <yuan.yao@intel.com>
 ---
-v6 -> v7:
+v13 -> v14:
  - No change
 v12 -> v13:
  - Added Yuan's tag.
 v11 -> v12:
  - Improved comments around looping over TDX memblock to create TDMRs.
    (Dave).
  - Added code to pr_warn() when consumed TDMRs reaching maximum TDMRs
    (Dave).
  - BIT_ULL(30) -> SZ_1G (Kirill)
  - Removed unused TDMR_PFN_ALIGNMENT (Sathy)
  - Added tags from Kirill/Sathy
 v10 -> v11:
  - No update
 v9 -> v10:
  - No change.
-v5 -> v6:
+v8 -> v9:
- - Rebase due to using 'tdx_memblock' instead of memblock.
+ - Added the last paragraph in the changelog (Dave).
-- v3 -> v5 (no feedback on v4):
+ - Removed unnecessary type cast in tdmr_entry() (Dave).
  - Removed allocating TDMR individually.
  - Improved changelog by using Dave's words.
  - Made TDMR_START() and TDMR_END() as static inline function.
 ---
- arch/x86/virt/vmx/tdx/tdx.c | 104 +++++++++++++++++++++++++++++++++++-
+ arch/x86/virt/vmx/tdx/tdx.c | 103 +++++++++++++++++++++++++++++++++++-
-file changed, 103 insertions(+), 1 deletion(-)
+ arch/x86/virt/vmx/tdx/tdx.h |   3 ++
 files changed, 105 insertions(+), 1 deletion(-)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
-@@ -XXX,XX +XXX,XX @@ static int build_tdx_memory(void)
+@@ -XXX,XX +XXX,XX @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
-     return ret;
+             tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
  }
-+/* TDMR must be 1gb aligned */
++/* Get the TDMR from the list at the given index. */
-+#define TDMR_ALIGNMENT        BIT_ULL(30)
++static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
-+#define TDMR_PFN_ALIGNMENT    (TDMR_ALIGNMENT >> PAGE_SHIFT)
++                    int idx)
-+
++{
-+/* Align up and down the address to TDMR boundary */
++    int tdmr_info_offset = tdmr_list->tdmr_sz * idx;
 +
 +    return (void *)tdmr_list->tdmrs + tdmr_info_offset;
 +}
 +
 +#define TDMR_ALIGNMENT        SZ_1G
 +#define TDMR_ALIGN_DOWN(_addr)    ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
 +#define TDMR_ALIGN_UP(_addr)    ALIGN((_addr), TDMR_ALIGNMENT)
-+
-+static inline u64 tdmr_start(struct tdmr_info *tdmr)
-+{
-+    return tdmr->base;
-+}
 +
 +static inline u64 tdmr_end(struct tdmr_info *tdmr)
 +{
 +    return tdmr->base + tdmr->size;
 +}
 +
- /* Calculate the actual TDMR_INFO size */
- static inline int cal_tdmr_size(void)
- {
-@@ -XXX,XX +XXX,XX @@ static struct tdmr_info *alloc_tdmr_array(int *array_sz)
-     return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
- }
-+static struct tdmr_info *tdmr_array_entry(struct tdmr_info *tdmr_array,
-+                      int idx)
-+{
-+    return (struct tdmr_info *)((unsigned long)tdmr_array +
-+            cal_tdmr_size() * idx);
-+}
-+
 +/*
-+ * Create TDMRs to cover all TDX memory regions.  The actual number
++ * Take the memory referenced in @tmb_list and populate the
-+ * of TDMRs is set to @tdmr_num.
++ * preallocated @tdmr_list, following all the special alignment
 + * and size rules for TDMR.
 + */
-+static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
++static int fill_out_tdmrs(struct list_head *tmb_list,
 +              struct tdmr_info_list *tdmr_list)
 +{
 +    struct tdx_memblock *tmb;
 +    int tdmr_idx = 0;
 +
 +    /*
-+     * Loop over TDX memory regions and create TDMRs to cover them.
++     * Loop over TDX memory regions and fill out TDMRs to cover them.
-+     * To keep it simple, always try to use one TDMR to cover
++     * To keep it simple, always try to use one TDMR to cover one
-+     * one memory region.
++     * memory region.
 +     *
 +     * In practice TDX supports at least 64 TDMRs.  A 2-socket system
 +     * typically only consumes less than 10 of those.  This code is
 +     * dumb and simple and may use more TMDRs than is strictly
 +     * required.
 +     */
-+    list_for_each_entry(tmb, &tdx_memlist, list) {
++    list_for_each_entry(tmb, tmb_list, list) {
-+        struct tdmr_info *tdmr;
++        struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx);
 +        u64 start, end;
 +
-+        tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
++        start = TDMR_ALIGN_DOWN(PFN_PHYS(tmb->start_pfn));
-+        start = TDMR_ALIGN_DOWN(tmb->start_pfn << PAGE_SHIFT);
++        end   = TDMR_ALIGN_UP(PFN_PHYS(tmb->end_pfn));
 +        end = TDMR_ALIGN_UP(tmb->end_pfn << PAGE_SHIFT);
 +
 +        /*
-+         * If the current TDMR's size hasn't been initialized,
++         * A valid size indicates the current TDMR has already
-+         * it is a new TDMR to cover the new memory region.
++         * been filled out to cover the previous memory region(s).
 +         * Otherwise, the current TDMR has already covered the
 +         * previous memory region.  In the latter case, check
 +         * whether the current memory region has been fully or
 +         * partially covered by the current TDMR, since TDMR is
 +         * 1G aligned.
 +         */
 +        if (tdmr->size) {
 +            /*
-+             * Loop to the next memory region if the current
++             * Loop to the next if the current memory region
-+             * block has already been fully covered by the
++             * has already been fully covered.
 +             * current TDMR.
 +             */
 +            if (end <= tdmr_end(tdmr))
 +                continue;
 +
-+            /*
++            /* Otherwise, skip the already covered part. */
 +             * If part of the current memory region has
 +             * already been covered by the current TDMR,
 +             * skip the already covered part.
 +             */
 +            if (start < tdmr_end(tdmr))
 +                start = tdmr_end(tdmr);
 +
 +            /*
 +             * Create a new TDMR to cover the current memory
 +             * region, or the remaining part of it.
 +             */
 +            tdmr_idx++;
-+            if (tdmr_idx >= tdx_sysinfo.max_tdmrs)
++            if (tdmr_idx >= tdmr_list->max_tdmrs) {
-+                return -E2BIG;
++                pr_warn("initialization failed: TDMRs exhausted.\n");
-+
++                return -ENOSPC;
-+            tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
++            }
 +
 +            tdmr = tdmr_entry(tdmr_list, tdmr_idx);
 +        }
 +
 +        tdmr->base = start;
 +        tdmr->size = end - start;
 +    }
 +
-+    /* @tdmr_idx is always the index of last valid TDMR. */
++    /* @tdmr_idx is always the index of the last valid TDMR. */
-+    *tdmr_num = tdmr_idx + 1;
++    tdmr_list->nr_consumed_tdmrs = tdmr_idx + 1;
 +
 +    /*
 +     * Warn early that kernel is about to run out of TDMRs.
 +     *
 +     * This is an indication that TDMR allocation has to be
 +     * reworked to be smarter to not run into an issue.
 +     */
 +    if (tdmr_list->max_tdmrs - tdmr_list->nr_consumed_tdmrs < TDMR_NR_WARN)
 +        pr_warn("consumed TDMRs reaching limit: %d used out of %d\n",
 +                tdmr_list->nr_consumed_tdmrs,
 +                tdmr_list->max_tdmrs);
 +
 +    return 0;
 +}
 +
  /*
-  * Construct an array of TDMRs to cover all TDX memory ranges.
+  * Construct a list of TDMRs on the preallocated space in @tdmr_list
-  * The actual number of TDMRs is kept to @tdmr_num.
+  * to cover all TDX memory regions in @tmb_list based on the TDX module
-  */
+@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list,
- static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
+                struct tdmr_info_list *tdmr_list,
                 struct tdsysinfo_struct *sysinfo)
  {
 +    int ret;
 +
-+    ret = create_tdmrs(tdmr_array, tdmr_num);
++    ret = fill_out_tdmrs(tmb_list, tdmr_list);
 +    if (ret)
-+        goto err;
++        return ret;
 +
-     /* Return -EINVAL until constructing TDMRs is done */
+     /*
--    return -EINVAL;
+      * TODO:
-+    ret = -EINVAL;
+      *
-+err:
+-     *  - Fill out TDMRs to cover all TDX memory regions.
-+    return ret;
+      *  - Allocate and set up PAMTs for each TDMR.
- }
+      *  - Designate reserved areas for each TDMR.
+      *
- /*
+diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.h
 +++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@ struct tdx_memblock {
      unsigned long end_pfn;
  };
 +/* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
 +#define TDMR_NR_WARN 4
 +
  struct tdmr_info_list {
      void *tdmrs;    /* Flexible array to hold 'tdmr_info's */
      int nr_consumed_tdmrs;    /* How many 'tdmr_info's are in use */
 --
-.38.1
+.41.0

-[PATCH v7 13/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
+[PATCH v14 12/23] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
 The TDX module uses additional metadata to record things like which
 guest "owns" a given page of memory.  This metadata, referred as
 Physical Address Metadata Table (PAMT), essentially serves as the
 'struct page' for the TDX module.  PAMTs are not reserved by hardware
 up front.  They must be allocated by the kernel and then given to the
-TDX module.
+TDX module during module initialization.
 TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
 (TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
 be a physically contiguous area from a Convertible Memory Region (CMR).
 However, the PAMTs which track pages in one TDMR do not need to reside
 ...
 that particular TDMR.
 Use alloc_contig_pages() since PAMT must be a physically contiguous area
 and it may be potentially large (~1/256th of the size of the given TDMR).
 The downside is alloc_contig_pages() may fail at runtime.  One (bad)
-mitigation is to launch a TD guest early during system boot to get those
+mitigation is to launch a TDX guest early during system boot to get
-PAMTs allocated at early time, but the only way to fix is to add a boot
+those PAMTs allocated at early time, but the only way to fix is to add a
-option to allocate or reserve PAMTs during kernel boot.
+boot option to allocate or reserve PAMTs during kernel boot.
 It is imperfect but will be improved on later.
 TDX only supports a limited number of reserved areas per TDMR to cover
 both PAMTs and memory holes within the given TDMR.  If many PAMTs are
 allocated within a single TDMR, the reserved areas may not be sufficient
 to cover all of them.
 ...
     the total number of reserved areas consumed for PAMTs.
   - Try to first allocate PAMT from the local node of the TDMR for better
     NUMA locality.
 Also dump out how many pages are allocated for PAMTs when the TDX module
-is initialized successfully.
+is initialized successfully.  This helps answer the eternal "where did
+all my memory go?" questions.
 Signed-off-by: Kai Huang <kai.huang@intel.com>
 Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
-Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
 Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 Reviewed-by: Yuan Yao <yuan.yao@intel.com>
 ---
+v13 -> v14:
+ - No change
+v12 -> v13:
+ - Added Kirill and Yuan's tag.
+ - Removed unintended space. (Yuan)
+v11 -> v12:
+ - Moved TDX_PS_NUM from tdx.c to <asm/tdx.h> (Kirill)
+ - "<= TDX_PS_1G" -> "< TDX_PS_NUM" (Kirill)
+ - Changed tdmr_get_pamt() to return base and size instead of base_pfn
+   and npages and related code directly (Dave).
+ - Simplified PAMT kb counting. (Dave)
+ - tdmrs_count_pamt_pages() -> tdmr_count_pamt_kb() (Kirill/Dave)
+v10 -> v11:
+ - No update
+v9 -> v10:
+ - Removed code change in disable_tdx_module() as it doesn't exist
+   anymore.
+v8 -> v9:
+ - Added TDX_PS_NR macro instead of open-coding (Dave).
+ - Better alignment of 'pamt_entry_size' in tdmr_set_up_pamt() (Dave).
+ - Changed to print out PAMTs in "KBs" instead of "pages" (Dave).
+ - Added Dave's Reviewed-by.
+v7 -> v8: (Dave)
+ - Changelog:
+  - Added a sentence to state PAMT allocation will be improved.
+  - Others suggested by Dave.
+ - Moved 'nid' of 'struct tdx_memblock' to this patch.
+ - Improved comments around tdmr_get_nid().
+ - WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid().
+ - Other changes due to 'struct tdmr_info_list'.
 v6 -> v7:
  - Changes due to using macros instead of 'enum' for TDX supported page
    sizes.
 ...
  - Improved comment around tdmr_get_nid() (Dave).
  - Improved comment in tdmr_set_up_pamt() around breaking the PAMT
    into PAMTs for 4K/2M/1G (Dave).
  - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).
-- v3 -> v5 (no feedback on v4):
- - Used memblock to get the NUMA node for given TDMR.
- - Removed tdmr_get_pamt_sz() helper but use open-code instead.
- - Changed to use 'switch .. case..' for each TDX supported page size in
-   tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
- - Added printing out memory used for PAMT allocation when TDX module is
-   initialized successfully.
- - Explained downside of alloc_contig_pages() in changelog.
- - Addressed other minor comments.
 ---
- arch/x86/Kconfig            |   1 +
+ arch/x86/Kconfig                  |   1 +
- arch/x86/virt/vmx/tdx/tdx.c | 191 ++++++++++++++++++++++++++++++++++++
+ arch/x86/include/asm/shared/tdx.h |   1 +
-files changed, 192 insertions(+)
+ arch/x86/virt/vmx/tdx/tdx.c       | 215 +++++++++++++++++++++++++++++-
  arch/x86/virt/vmx/tdx/tdx.h       |   1 +
 files changed, 213 insertions(+), 5 deletions(-)
 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 ...
      select ARCH_KEEP_MEMBLOCK
 +    depends on CONTIG_ALLOC
      help
        Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
        host and certain physical attacks.  This option enables necessary TDX
+diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
+index XXXXXXX..XXXXXXX 100644
+--- a/arch/x86/include/asm/shared/tdx.h
++++ b/arch/x86/include/asm/shared/tdx.h
+@@ -XXX,XX +XXX,XX @@
+ #define TDX_PS_4K    0
+ #define TDX_PS_2M    1
+ #define TDX_PS_1G    2
++#define TDX_PS_NR    (TDX_PS_1G + 1)
+ #ifndef __ASSEMBLY__
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
-@@ -XXX,XX +XXX,XX @@ static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
+@@ -XXX,XX +XXX,XX @@ static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo,
   * overlap.
   */
  static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
 -                unsigned long end_pfn)
 +                unsigned long end_pfn, int nid)
  {
      struct tdx_memblock *tmb;
@@ -XXX,XX +XXX,XX @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
      INIT_LIST_HEAD(&tmb->list);
      tmb->start_pfn = start_pfn;
      tmb->end_pfn = end_pfn;
 +    tmb->nid = nid;
      /* @tmb_list is protected by mem_hotplug_lock */
      list_add_tail(&tmb->list, tmb_list);
@@ -XXX,XX +XXX,XX @@ static void free_tdx_memlist(struct list_head *tmb_list)
  static int build_tdx_memlist(struct list_head *tmb_list)
  {
      unsigned long start_pfn, end_pfn;
 -    int i, ret;
 +    int i, nid, ret;
 -    for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
 +    for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
          /*
           * The first 1MB is not reported as TDX convertible memory.
           * Although the first 1MB is always reserved and won't end up
@@ -XXX,XX +XXX,XX @@ static int build_tdx_memlist(struct list_head *tmb_list)
           * memblock has already guaranteed they are in address
           * ascending order and don't overlap.
           */
 -        ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
 +        ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
          if (ret)
              goto err;
      }
@@ -XXX,XX +XXX,XX @@ static int fill_out_tdmrs(struct list_head *tmb_list,
      return 0;
  }
 +/*
 + * Calculate PAMT size given a TDMR and a page size.  The returned
 + * PAMT size is always aligned up to 4K page boundary.
 + */
-+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz)
++static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
 +                      u16 pamt_entry_size)
 +{
 +    unsigned long pamt_sz, nr_pamt_entries;
 +
 +    switch (pgsz) {
 +    case TDX_PS_4K:
 ...
 +    default:
 +        WARN_ON_ONCE(1);
 +        return 0;
 +    }
 +
-+    pamt_sz = nr_pamt_entries * tdx_sysinfo.pamt_entry_size;
++    pamt_sz = nr_pamt_entries * pamt_entry_size;
 +    /* TDX requires PAMT size must be 4K aligned */
 +    pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
 +
 +    return pamt_sz;
 +}
 +
 +/*
-+ * Pick a NUMA node on which to allocate this TDMR's metadata.
++ * Locate a NUMA node which should hold the allocation of the @tdmr
-+ *
++ * PAMT.  This node will have some memory covered by the TDMR.  The
-+ * This is imprecise since TDMRs are 1G aligned and NUMA nodes might
++ * relative amount of memory covered is not considered.
 + * not be.  If the TDMR covers more than one node, just use the _first_
 + * one.  This can lead to small areas of off-node metadata for some
 + * memory.
 + */
-+static int tdmr_get_nid(struct tdmr_info *tdmr)
++static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list)
 +{
 +    struct tdx_memblock *tmb;
 +
-+    /* Find the first memory region covered by the TDMR */
++    /*
-+    list_for_each_entry(tmb, &tdx_memlist, list) {
++     * A TDMR must cover at least part of one TMB.  That TMB will end
-+        if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
++     * after the TDMR begins.  But, that TMB may have started before
 +     * the TDMR.  Find the next 'tmb' that _ends_ after this TDMR
 +     * begins.  Ignore 'tmb' start addresses.  They are irrelevant.
 +     */
 +    list_for_each_entry(tmb, tmb_list, list) {
 +        if (tmb->end_pfn > PHYS_PFN(tdmr->base))
 +            return tmb->nid;
 +    }
 +
 +    /*
 +     * Fall back to allocating the TDMR's metadata from node 0 when
 +     * no TDX memory block can be found.  This should never happen
 +     * since TDMRs originate from TDX memory blocks.
 +     */
-+    WARN_ON_ONCE(1);
++    pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT allocation, fallback to use node 0.\n",
 +            tdmr->base, tdmr_end(tdmr));
 +    return 0;
 +}
 +
-+static int tdmr_set_up_pamt(struct tdmr_info *tdmr)
++/*
-+{
++ * Allocate PAMTs from the local NUMA node of some memory in @tmb_list
-+    unsigned long pamt_base[TDX_PS_1G + 1];
++ * within @tdmr, and set up PAMTs for @tdmr.
-+    unsigned long pamt_size[TDX_PS_1G + 1];
++ */
 +static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
 +                struct list_head *tmb_list,
 +                u16 pamt_entry_size)
 +{
 +    unsigned long pamt_base[TDX_PS_NR];
 +    unsigned long pamt_size[TDX_PS_NR];
 +    unsigned long tdmr_pamt_base;
 +    unsigned long tdmr_pamt_size;
 +    struct page *pamt;
 +    int pgsz, nid;
 +
-+    nid = tdmr_get_nid(tdmr);
++    nid = tdmr_get_nid(tdmr, tmb_list);
 +
 +    /*
 +     * Calculate the PAMT size for each TDX supported page size
 +     * and the total PAMT size.
 +     */
 +    tdmr_pamt_size = 0;
-+    for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G ; pgsz++) {
++    for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
-+        pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz);
++        pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
 +                    pamt_entry_size);
 +        tdmr_pamt_size += pamt_size[pgsz];
 +    }
 +
 +    /*
 +     * Allocate one chunk of physically contiguous memory for all
 ...
 +    /*
 +     * Break the contiguous allocation back up into the
 +     * individual PAMTs for each page size.
 +     */
 +    tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
-+    for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G; pgsz++) {
++    for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
 +        pamt_base[pgsz] = tdmr_pamt_base;
 +        tdmr_pamt_base += pamt_size[pgsz];
 +    }
 +
 +    tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
 ...
 +    tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
 +
 +    return 0;
 +}
 +
-+static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
++static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base,
-+              unsigned long *pamt_npages)
++              unsigned long *pamt_size)
 +{
-+    unsigned long pamt_base, pamt_sz;
++    unsigned long pamt_bs, pamt_sz;
 +
 +    /*
 +     * The PAMT was allocated in one contiguous unit.  The 4K PAMT
 +     * should always point to the beginning of that allocation.
 +     */
-+    pamt_base = tdmr->pamt_4k_base;
++    pamt_bs = tdmr->pamt_4k_base;
 +    pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
 +
-+    *pamt_pfn = pamt_base >> PAGE_SHIFT;
++    WARN_ON_ONCE((pamt_bs & ~PAGE_MASK) || (pamt_sz & ~PAGE_MASK));
-+    *pamt_npages = pamt_sz >> PAGE_SHIFT;
++
 +    *pamt_base = pamt_bs;
 +    *pamt_size = pamt_sz;
 +}
 +
 +static void tdmr_free_pamt(struct tdmr_info *tdmr)
 +{
-+    unsigned long pamt_pfn, pamt_npages;
++    unsigned long pamt_base, pamt_size;
 +
-+    tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages);
++    tdmr_get_pamt(tdmr, &pamt_base, &pamt_size);
 +
 +    /* Do nothing if PAMT hasn't been allocated for this TDMR */
-+    if (!pamt_npages)
++    if (!pamt_size)
 +        return;
 +
-+    if (WARN_ON_ONCE(!pamt_pfn))
++    if (WARN_ON_ONCE(!pamt_base))
 +        return;
 +
-+    free_contig_range(pamt_pfn, pamt_npages);
++    free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT);
 +}
 +
-+static void tdmrs_free_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
++static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
 +{
 +    int i;
 +
-+    for (i = 0; i < tdmr_num; i++)
++    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
-+        tdmr_free_pamt(tdmr_array_entry(tdmr_array, i));
++        tdmr_free_pamt(tdmr_entry(tdmr_list, i));
 +}
 +
 +/* Allocate and set up PAMTs for all TDMRs */
-+static int tdmrs_set_up_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
++static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
 +                 struct list_head *tmb_list,
 +                 u16 pamt_entry_size)
 +{
 +    int i, ret = 0;
 +
-+    for (i = 0; i < tdmr_num; i++) {
++    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
-+        ret = tdmr_set_up_pamt(tdmr_array_entry(tdmr_array, i));
++        ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list,
 +                pamt_entry_size);
 +        if (ret)
 +            goto err;
 +    }
 +
 +    return 0;
 +err:
-+    tdmrs_free_pamt_all(tdmr_array, tdmr_num);
++    tdmrs_free_pamt_all(tdmr_list);
 +    return ret;
 +}
 +
-+static unsigned long tdmrs_count_pamt_pages(struct tdmr_info *tdmr_array,
++static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
-+                      int tdmr_num)
++{
-+{
++    unsigned long pamt_size = 0;
 +    unsigned long pamt_npages = 0;
 +    int i;
 +
-+    for (i = 0; i < tdmr_num; i++) {
++    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
-+        unsigned long pfn, npages;
++        unsigned long base, size;
 +
-+        tdmr_get_pamt(tdmr_array_entry(tdmr_array, i), &pfn, &npages);
++        tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
-+        pamt_npages += npages;
++        pamt_size += size;
 +    }
 +
-+    return pamt_npages;
++    return pamt_size / 1024;
 +}
 +
  /*
-  * Construct an array of TDMRs to cover all TDX memory ranges.
+  * Construct a list of TDMRs on the preallocated space in @tdmr_list
-  * The actual number of TDMRs is kept to @tdmr_num.
+  * to cover all TDX memory regions in @tmb_list based on the TDX module
-@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
+@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list,
      if (ret)
-         goto err;
+         return ret;
-+    ret = tdmrs_set_up_pamt_all(tdmr_array, *tdmr_num);
++    ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list,
 +            sysinfo->pamt_entry_size);
 +    if (ret)
-+        goto err;
++        return ret;
-+
+     /*
-     /* Return -EINVAL until constructing TDMRs is done */
+      * TODO:
-     ret = -EINVAL;
+      *
-+    tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
+-     *  - Allocate and set up PAMTs for each TDMR.
- err:
+      *  - Designate reserved areas for each TDMR.
-     return ret;
+      *
- }
+      * Return -EINVAL until constructing TDMRs is done
 @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
-      * process are done.
+      *  Return error before all steps are done.
       */
      ret = -EINVAL;
 +    if (ret)
-+        tdmrs_free_pamt_all(tdmr_array, tdmr_num);
++        tdmrs_free_pamt_all(&tdmr_list);
 +    else
-+        pr_info("%lu pages allocated for PAMT.\n",
++        pr_info("%lu KBs allocated for PAMT\n",
-+                tdmrs_count_pamt_pages(tdmr_array, tdmr_num));
++                tdmrs_count_pamt_kb(&tdmr_list));
  out_free_tdmrs:
      /*
-      * The array of TDMRs is freed no matter the initialization is
+      * Always free the buffer of TDMRs as they are only used during
 diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.h
 +++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@ struct tdx_memblock {
      struct list_head list;
      unsigned long start_pfn;
      unsigned long end_pfn;
 +    int nid;
  };
  /* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
 --
-.38.1
+.41.0

-[PATCH v7 14/20] x86/virt/tdx: Set up reserved areas for all TDMRs
+[PATCH v14 13/23] x86/virt/tdx: Designate reserved areas for all TDMRs
-As the last step of constructing TDMRs, set up reserved areas for all
+As the last step of constructing TDMRs, populate reserved areas for all
 TDMRs.  For each TDMR, put all memory holes within this TDMR to the
 reserved areas.  And for all PAMTs which overlap with this TDMR, put
 all the overlapping parts to reserved areas too.
+Signed-off-by: Kai Huang <kai.huang@intel.com>
 Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
-Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 Reviewed-by: Yuan Yao <yuan.yao@intel.com>
 ---
-v6 -> v7:
+v13 -> v14:
  - No change
 v12 -> v13:
  - Added Yuan's tag.
 v11 -> v12:
  - Code change due to tdmr_get_pamt() change from returning pfn/npages to
    base/size
  - Added Kirill's tag
 v10 -> v11:
  - No update
 v9 -> v10:
  - No change.
-v5 -> v6:
+v8 -> v9:
- - Rebase due to using 'tdx_memblock' instead of memblock.
+ - Added comment around 'tdmr_add_rsvd_area()' to point out it doesn't do
- - Split tdmr_set_up_rsvd_areas() into two functions to handle memory
+   optimization to save reserved areas. (Dave).
-   hole and PAMT respectively.
- - Added Isaku's Reviewed-by.
+v7 -> v8: (Dave)
+ - "set_up" -> "populate" in function name change (Dave).
  - Improved comment suggested by Dave.
  - Other changes due to 'struct tdmr_info_list'.
 ---
- arch/x86/virt/vmx/tdx/tdx.c | 190 +++++++++++++++++++++++++++++++++++-
+ arch/x86/virt/vmx/tdx/tdx.c | 217 ++++++++++++++++++++++++++++++++++--
-file changed, 188 insertions(+), 2 deletions(-)
+file changed, 209 insertions(+), 8 deletions(-)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
 @@ -XXX,XX +XXX,XX @@
- #include <linux/memblock.h>
- #include <linux/minmax.h>
  #include <linux/sizes.h>
+ #include <linux/pfn.h>
+ #include <linux/align.h>
 +#include <linux/sort.h>
  #include <asm/msr-index.h>
  #include <asm/msr.h>
- #include <asm/apic.h>
+ #include <asm/page.h>
-@@ -XXX,XX +XXX,XX @@ static unsigned long tdmrs_count_pamt_pages(struct tdmr_info *tdmr_array,
+@@ -XXX,XX +XXX,XX @@ static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
-     return pamt_npages;
+     return pamt_size / 1024;
  }
-+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx,
++static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
-+                  u64 addr, u64 size)
++                  u64 size, u16 max_reserved_per_tdmr)
 +{
 +    struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
 +    int idx = *p_idx;
 +
 +    /* Reserved area must be 4K aligned in offset and size */
 +    if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
 +        return -EINVAL;
 +
-+    /* Cannot exceed maximum reserved areas supported by TDX */
++    if (idx >= max_reserved_per_tdmr) {
-+    if (idx >= tdx_sysinfo.max_reserved_per_tdmr)
++        pr_warn("initialization failed: TDMR [0x%llx, 0x%llx): reserved areas exhausted.\n",
-+        return -E2BIG;
++                tdmr->base, tdmr_end(tdmr));
-+
++        return -ENOSPC;
 +    }
 +
 +    /*
 +     * Consume one reserved area per call.  Make no effort to
 +     * optimize or reduce the number of reserved areas which are
 +     * consumed by contiguous reserved areas, for instance.
 +     */
 +    rsvd_areas[idx].offset = addr - tdmr->base;
 +    rsvd_areas[idx].size = size;
 +
 +    *p_idx = idx + 1;
 +
 +    return 0;
 +}
 +
-+static int tdmr_set_up_memory_hole_rsvd_areas(struct tdmr_info *tdmr,
++/*
-+                          int *rsvd_idx)
++ * Go through @tmb_list to find holes between memory areas.  If any of
 + * those holes fall within @tdmr, set up a TDMR reserved area to cover
 + * the hole.
 + */
 +static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
 +                    struct tdmr_info *tdmr,
 +                    int *rsvd_idx,
 +                    u16 max_reserved_per_tdmr)
 +{
 +    struct tdx_memblock *tmb;
 +    u64 prev_end;
 +    int ret;
 +
-+    /* Mark holes between memory regions as reserved */
++    /*
-+    prev_end = tdmr_start(tdmr);
++     * Start looking for reserved blocks at the
-+    list_for_each_entry(tmb, &tdx_memlist, list) {
++     * beginning of the TDMR.
 +     */
 +    prev_end = tdmr->base;
 +    list_for_each_entry(tmb, tmb_list, list) {
 +        u64 start, end;
 +
-+        start = tmb->start_pfn << PAGE_SHIFT;
++        start = PFN_PHYS(tmb->start_pfn);
-+        end = tmb->end_pfn << PAGE_SHIFT;
++        end   = PFN_PHYS(tmb->end_pfn);
 +
 +        /* Break if this region is after the TDMR */
 +        if (start >= tdmr_end(tdmr))
 +            break;
 +
 +        /* Exclude regions before this TDMR */
-+        if (end < tdmr_start(tdmr))
++        if (end < tdmr->base)
 +            continue;
 +
 +        /*
-+         * Skip if no hole exists before this region. "<=" is
++         * Skip over memory areas that
-+         * used because one memory region might span two TDMRs
++         * have already been dealt with.
 +         * (when the previous TDMR covers part of this region).
 +         * In this case the start address of this region is
 +         * smaller than the start address of the second TDMR.
 +         *
 +         * Update the prev_end to the end of this region where
 +         * the possible memory hole starts.
 +         */
 +        if (start <= prev_end) {
 +            prev_end = end;
 +            continue;
 +        }
 +
 +        /* Add the hole before this region */
 +        ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
-+                start - prev_end);
++                start - prev_end,
 +                max_reserved_per_tdmr);
 +        if (ret)
 +            return ret;
 +
 +        prev_end = end;
 +    }
 +
 +    /* Add the hole after the last region if it exists. */
 +    if (prev_end < tdmr_end(tdmr)) {
 +        ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
-+                tdmr_end(tdmr) - prev_end);
++                tdmr_end(tdmr) - prev_end,
-+        if (ret)
++                max_reserved_per_tdmr);
-+            return ret;
++        if (ret)
-+    }
++            return ret;
-+
++    }
-+    return 0;
++
-+}
++    return 0;
-+
++}
-+static int tdmr_set_up_pamt_rsvd_areas(struct tdmr_info *tdmr, int *rsvd_idx,
++
-+                       struct tdmr_info *tdmr_array,
++/*
-+                       int tdmr_num)
++ * Go through @tdmr_list to find all PAMTs.  If any of those PAMTs
 + * overlaps with @tdmr, set up a TDMR reserved area to cover the
 + * overlapping part.
 + */
 +static int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list,
 +                    struct tdmr_info *tdmr,
 +                    int *rsvd_idx,
 +                    u16 max_reserved_per_tdmr)
 +{
 +    int i, ret;
 +
-+    /*
++    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
-+     * If any PAMT overlaps with this TDMR, the overlapping part
++        struct tdmr_info *tmp = tdmr_entry(tdmr_list, i);
-+     * must also be put to the reserved area too.  Walk over all
++        unsigned long pamt_base, pamt_size, pamt_end;
-+     * TDMRs to find out those overlapping PAMTs and put them to
++
-+     * reserved areas.
++        tdmr_get_pamt(tmp, &pamt_base, &pamt_size);
 +     */
 +    for (i = 0; i < tdmr_num; i++) {
 +        struct tdmr_info *tmp = tdmr_array_entry(tdmr_array, i);
 +        unsigned long pamt_start_pfn, pamt_npages;
 +        u64 pamt_start, pamt_end;
 +
 +        tdmr_get_pamt(tmp, &pamt_start_pfn, &pamt_npages);
 +        /* Each TDMR must already have PAMT allocated */
-+        WARN_ON_ONCE(!pamt_npages || !pamt_start_pfn);
++        WARN_ON_ONCE(!pamt_size || !pamt_base);
 +
-+        pamt_start = pamt_start_pfn << PAGE_SHIFT;
++        pamt_end = pamt_base + pamt_size;
 +        pamt_end = pamt_start + (pamt_npages << PAGE_SHIFT);
 +
 +        /* Skip PAMTs outside of the given TDMR */
-+        if ((pamt_end <= tdmr_start(tdmr)) ||
++        if ((pamt_end <= tdmr->base) ||
-+                (pamt_start >= tdmr_end(tdmr)))
++                (pamt_base >= tdmr_end(tdmr)))
 +            continue;
 +
 +        /* Only mark the part within the TDMR as reserved */
-+        if (pamt_start < tdmr_start(tdmr))
++        if (pamt_base < tdmr->base)
-+            pamt_start = tdmr_start(tdmr);
++            pamt_base = tdmr->base;
 +        if (pamt_end > tdmr_end(tdmr))
 +            pamt_end = tdmr_end(tdmr);
 +
-+        ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_start,
++        ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_base,
-+                pamt_end - pamt_start);
++                pamt_end - pamt_base,
 +                max_reserved_per_tdmr);
 +        if (ret)
 +            return ret;
 +    }
 +
 +    return 0;
 ...
 +    if (r1->offset + r1->size <= r2->offset)
 +        return -1;
 +    if (r1->offset >= r2->offset + r2->size)
 +        return 1;
 +
-+    /* Reserved areas cannot overlap.  The caller should guarantee. */
++    /* Reserved areas cannot overlap.  The caller must guarantee. */
 +    WARN_ON_ONCE(1);
 +    return -1;
 +}
 +
-+/* Set up reserved areas for a TDMR, including memory holes and PAMTs */
++/*
-+static int tdmr_set_up_rsvd_areas(struct tdmr_info *tdmr,
++ * Populate reserved areas for the given @tdmr, including memory holes
-+                  struct tdmr_info *tdmr_array,
++ * (via @tmb_list) and PAMTs (via @tdmr_list).
-+                  int tdmr_num)
++ */
 +static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
 +                    struct list_head *tmb_list,
 +                    struct tdmr_info_list *tdmr_list,
 +                    u16 max_reserved_per_tdmr)
 +{
 +    int ret, rsvd_idx = 0;
 +
-+    /* Put all memory holes within the TDMR into reserved areas */
++    ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
-+    ret = tdmr_set_up_memory_hole_rsvd_areas(tdmr, &rsvd_idx);
++            max_reserved_per_tdmr);
 +    if (ret)
 +        return ret;
 +
-+    /* Put all (overlapping) PAMTs within the TDMR into reserved areas */
++    ret = tdmr_populate_rsvd_pamts(tdmr_list, tdmr, &rsvd_idx,
-+    ret = tdmr_set_up_pamt_rsvd_areas(tdmr, &rsvd_idx, tdmr_array, tdmr_num);
++            max_reserved_per_tdmr);
 +    if (ret)
 +        return ret;
 +
 +    /* TDX requires reserved areas listed in address ascending order */
 +    sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
 +            rsvd_area_cmp_func, NULL);
 +
 +    return 0;
 +}
 +
-+static int tdmrs_set_up_rsvd_areas_all(struct tdmr_info *tdmr_array,
++/*
-+                       int tdmr_num)
++ * Populate reserved areas for all TDMRs in @tdmr_list, including memory
 + * holes (via @tmb_list) and PAMTs.
 + */
 +static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
 +                     struct list_head *tmb_list,
 +                     u16 max_reserved_per_tdmr)
 +{
 +    int i;
 +
-+    for (i = 0; i < tdmr_num; i++) {
++    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
 +        int ret;
 +
-+        ret = tdmr_set_up_rsvd_areas(tdmr_array_entry(tdmr_array, i),
++        ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
-+                tdmr_array, tdmr_num);
++                tmb_list, tdmr_list, max_reserved_per_tdmr);
 +        if (ret)
 +            return ret;
 +    }
 +
 +    return 0;
 +}
 +
  /*
-  * Construct an array of TDMRs to cover all TDX memory ranges.
+  * Construct a list of TDMRs on the preallocated space in @tdmr_list
-  * The actual number of TDMRs is kept to @tdmr_num.
+  * to cover all TDX memory regions in @tmb_list based on the TDX module
-@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
+@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list,
              sysinfo->pamt_entry_size);
      if (ret)
-         goto err;
+         return ret;
+-    /*
--    /* Return -EINVAL until constructing TDMRs is done */
+-     * TODO:
--    ret = -EINVAL;
+-     *
-+    ret = tdmrs_set_up_rsvd_areas_all(tdmr_array, *tdmr_num);
+-     *  - Designate reserved areas for each TDMR.
 -     *
 -     * Return -EINVAL until constructing TDMRs is done
 -     */
 -    return -EINVAL;
 +
 +    ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
 +            sysinfo->max_reserved_per_tdmr);
 +    if (ret)
-+        goto err_free_pamts;
++        tdmrs_free_pamt_all(tdmr_list);
 +
-+    return 0;
++    return ret;
-+err_free_pamts:
+ }
-     tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
- err:
+ static int init_tdx_module(void)
      return ret;
 --
-.38.1
+.41.0

-[PATCH v7 16/20] x86/virt/tdx: Configure TDX module with TDMRs and global KeyID
+[PATCH v14 14/23] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
-After the TDX-usable memory regions are constructed in an array of TDMRs
+The TDX module uses a private KeyID as the "global KeyID" for mapping
-and the global KeyID is reserved, configure them to the TDX module using
+things like the PAMT and other TDX metadata.  This KeyID has already
-TDH.SYS.CONFIG SEAMCALL.  TDH.SYS.CONFIG can only be called once and can
+been reserved when detecting TDX during the kernel early boot.
 be done on any logical cpu.
+After the list of "TD Memory Regions" (TDMRs) has been constructed to
+cover all TDX-usable memory regions, the next step is to pass them to
+the TDX module together with the global KeyID.
+Signed-off-by: Kai Huang <kai.huang@intel.com>
 Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
-Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 Reviewed-by: Yuan Yao <yuan.yao@intel.com>
 ---
- arch/x86/virt/vmx/tdx/tdx.c | 37 +++++++++++++++++++++++++++++++++++++
 v13 -> v14:
  - No change
 v12 -> v13:
  - Added Yuan's tag.
 v11 -> v12:
  - Added Kirill's tag
 v10 -> v11:
  - No update
 v9 -> v10:
  - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.
 v8 -> v9:
  - Improved changlog to explain why initializing TDMRs can take long
    time (Dave).
  - Improved comments around 'next-to-initialize' address (Dave).
 v7 -> v8: (Dave)
  - Changelog:
    - explicitly call out this is the last step of TDX module initialization.
    - Trimed down changelog by removing SEAMCALL name and details.
  - Removed/trimmed down unnecessary comments.
  - Other changes due to 'struct tdmr_info_list'.
 v6 -> v7:
  - Removed need_resched() check. -- Andi.
 ---
  arch/x86/virt/vmx/tdx/tdx.c | 43 ++++++++++++++++++++++++++++++++++++-
  arch/x86/virt/vmx/tdx/tdx.h |  2 ++
-files changed, 39 insertions(+)
+files changed, 44 insertions(+), 1 deletion(-)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
-@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
+@@ -XXX,XX +XXX,XX @@
  #include <linux/pfn.h>
  #include <linux/align.h>
  #include <linux/sort.h>
 +#include <linux/log2.h>
  #include <asm/msr-index.h>
  #include <asm/msr.h>
  #include <asm/page.h>
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list,
      return ret;
  }
-+static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
++static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
 +                 u64 global_keyid)
 +{
++    struct tdx_module_args args = {};
 +    u64 *tdmr_pa_array;
-+    int i, array_sz;
++    size_t array_sz;
-+    u64 ret;
++    int i, ret;
 +
 +    /*
-+     * TDMR_INFO entries are configured to the TDX module via an
++     * TDMRs are passed to the TDX module via an array of physical
-+     * array of the physical address of each TDMR_INFO.  TDX module
++     * addresses of each TDMR.  The array itself also has certain
-+     * requires the array itself to be 512-byte aligned.  Round up
++     * alignment requirement.
 +     * the array size to 512-byte aligned so the buffer allocated
 +     * by kzalloc() will meet the alignment requirement.
 +     */
-+    array_sz = ALIGN(tdmr_num * sizeof(u64), TDMR_INFO_PA_ARRAY_ALIGNMENT);
++    array_sz = tdmr_list->nr_consumed_tdmrs * sizeof(u64);
 +    array_sz = roundup_pow_of_two(array_sz);
 +    if (array_sz < TDMR_INFO_PA_ARRAY_ALIGNMENT)
 +        array_sz = TDMR_INFO_PA_ARRAY_ALIGNMENT;
 +
 +    tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
 +    if (!tdmr_pa_array)
 +        return -ENOMEM;
 +
-+    for (i = 0; i < tdmr_num; i++)
++    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
-+        tdmr_pa_array[i] = __pa(tdmr_array_entry(tdmr_array, i));
++        tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i));
 +
-+    ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), tdmr_num,
++    args.rcx = __pa(tdmr_pa_array);
-+                global_keyid, 0, NULL, NULL);
++    args.rdx = tdmr_list->nr_consumed_tdmrs;
 +    args.r8 = global_keyid;
 +    ret = seamcall_prerr(TDH_SYS_CONFIG, &args);
 +
 +    /* Free the array as it is not required anymore. */
 +    kfree(tdmr_pa_array);
 +
 +    return ret;
 +}
 +
- /*
+ static int init_tdx_module(void)
-  * Detect and initialize the TDX module.
+ {
-  *
+     struct tdsysinfo_struct *tdsysinfo;
 @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
-      */
+     if (ret)
-     tdx_global_keyid = tdx_keyid_start;
+         goto out_free_tdmrs;
 +    /* Pass the TDMRs and the global KeyID to the TDX module */
-+    ret = config_tdx_module(tdmr_array, tdmr_num, tdx_global_keyid);
++    ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
 +    if (ret)
 +        goto out_free_pamts;
 +
      /*
-      * Return -EINVAL until all steps of TDX module initialization
+      * TODO:
-      * process are done.
+      *
 -     *  - Configure the TDMRs and the global KeyID to the TDX module.
       *  - Configure the global KeyID on all packages.
       *  - Initialize all TDMRs.
       *
       *  Return error before all steps are done.
       */
      ret = -EINVAL;
 +out_free_pamts:
      if (ret)
-         tdmrs_free_pamt_all(tdmr_array, tdmr_num);
+         tdmrs_free_pamt_all(&tdmr_list);
      else
 diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.h
 +++ b/arch/x86/virt/vmx/tdx/tdx.h
 @@ -XXX,XX +XXX,XX @@
+ #define TDH_SYS_INFO        32
  #define TDH_SYS_INIT        33
  #define TDH_SYS_LP_INIT        35
- #define TDH_SYS_LP_SHUTDOWN    44
 +#define TDH_SYS_CONFIG        45
  struct cmr_info {
      u64    base;
 @@ -XXX,XX +XXX,XX @@ struct tdmr_reserved_area {
 ...
 +#define TDMR_INFO_PA_ARRAY_ALIGNMENT    512
  struct tdmr_info {
      u64 base;
 --
-.38.1
+.41.0

-[PATCH v7 17/20] x86/virt/tdx: Configure global KeyID on all packages
+[PATCH v14 15/23] x86/virt/tdx: Configure global KeyID on all packages
-After the array of TDMRs and the global KeyID are configured to the TDX
+After the list of TDMRs and the global KeyID are configured to the TDX
-module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID
+module, the kernel needs to configure the key of the global KeyID on all
-on all packages.
+packages using TDH.SYS.KEY.CONFIG.
-TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package.  And
+This SEAMCALL cannot run parallel on different cpus.  Loop all online
-it cannot run concurrently on different CPUs.  Implement a helper to
+cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of
-run SEAMCALL on one cpu for each package one by one, and use it to
+each package.
-configure the global KeyID on all packages.
 To keep things simple, this implementation takes no affirmative steps to
 online cpus to make sure there's at least one cpu for each package.  The
 callers (aka. KVM) can ensure success by ensuring sufficient CPUs are
 online for this to succeed.
 Intel hardware doesn't guarantee cache coherency across different
-KeyIDs.  The kernel needs to flush PAMT's dirty cachelines (associated
+KeyIDs.  The PAMTs are transitioning from being used by the kernel
-with KeyID 0) before the TDX module uses the global KeyID to access the
+mapping (KeyId 0) to the TDX module's "global KeyID" mapping.
-PAMT.  Following the TDX module specification, flush cache before
-configuring the global KeyID on all packages.
+This means that the kernel must flush any dirty KeyID-0 PAMT cachelines
+before the TDX module uses the global KeyID to access the PAMTs.
-Given the PAMT size can be large (~1/256th of system RAM), just use
+Otherwise, if those dirty cachelines were written back, they would
-WBINVD on all CPUs to flush.
+corrupt the TDX module's metadata.  Aside: This corruption would be
+detected by the memory integrity hardware on the next read of the memory
-Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
+with the global KeyID.  The result would likely be fatal to the system
-used the global KeyID to write any PAMT.  Therefore, need to use WBINVD
+but would not impact TDX security.
-to flush cache before freeing the PAMTs back to the kernel.  Note using
-MOVDIR64B (which changes the page's associated KeyID from the old TDX
+Following the TDX module specification, flush cache before configuring
-private KeyID back to KeyID 0, which is used by the kernel) to clear
+the global KeyID on all packages.  Given the PAMT size can be large
-PMATs isn't needed, as the KeyID 0 doesn't support integrity check.
+(~1/256th of system RAM), just use WBINVD on all CPUs to flush.
 If TDH.SYS.KEY.CONFIG fails, the TDX module may already have used the
 global KeyID to write the PAMTs.  Therefore, use WBINVD to flush cache
 before returning the PAMTs back to the kernel.  Also convert all PAMTs
 back to normal by using MOVDIR64B as suggested by the TDX module spec,
 although on the platform without the "partial write machine check"
 erratum it's OK to leave PAMTs as is.
 Signed-off-by: Kai Huang <kai.huang@intel.com>
 Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
-Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 Reviewed-by: Yuan Yao <yuan.yao@intel.com>
 ---
-v6 -> v7:
+v13 -> v14:
- - Improved changelong and comment to explain why MOVDIR64B isn't used
+ - No change
-   when returning PAMTs back to the kernel.
 v12 -> v13:
  - Added Yuan's tag.
 v11 -> v12:
  - Added Kirill's tag
  - Improved changelog (Nikolay)
 v10 -> v11:
  - Convert PAMTs back to normal when module initialization fails.
  - Fixed an error in changelog
 v9 -> v10:
  - Changed to use 'smp_call_on_cpu()' directly to do key configuration.
 v8 -> v9:
  - Improved changelog (Dave).
  - Improved comments to explain the function to configure global KeyID
    "takes no affirmative action to online any cpu". (Dave).
  - Improved other comments suggested by Dave.
 v7 -> v8: (Dave)
  - Changelog changes:
   - Point out this is the step of "multi-steps" of init_tdx_module().
   - Removed MOVDIR64B part.
   - Other changes due to removing TDH.SYS.SHUTDOWN and TDH.SYS.LP.INIT.
  - Changed to loop over online cpus and use smp_call_function_single()
    directly as the patch to shut down TDX module has been removed.
  - Removed MOVDIR64B part in comment.
 ---
- arch/x86/virt/vmx/tdx/tdx.c | 89 ++++++++++++++++++++++++++++++++++++-
+ arch/x86/virt/vmx/tdx/tdx.c | 130 +++++++++++++++++++++++++++++++++++-
- arch/x86/virt/vmx/tdx/tdx.h |  1 +
+ arch/x86/virt/vmx/tdx/tdx.h |   1 +
-files changed, 88 insertions(+), 2 deletions(-)
+files changed, 129 insertions(+), 2 deletions(-)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
-@@ -XXX,XX +XXX,XX @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
+@@ -XXX,XX +XXX,XX @@
-     on_each_cpu(seamcall_smp_call_function, sc, true);
+ #include <asm/msr-index.h>
- }
+ #include <asm/msr.h>
  #include <asm/page.h>
 +#include <asm/special_insns.h>
  #include <asm/tdx.h>
  #include "tdx.h"
@@ -XXX,XX +XXX,XX @@ static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base,
      *pamt_size = pamt_sz;
  }
 -static void tdmr_free_pamt(struct tdmr_info *tdmr)
 +static void tdmr_do_pamt_func(struct tdmr_info *tdmr,
 +        void (*pamt_func)(unsigned long base, unsigned long size))
  {
      unsigned long pamt_base, pamt_size;
@@ -XXX,XX +XXX,XX @@ static void tdmr_free_pamt(struct tdmr_info *tdmr)
      if (WARN_ON_ONCE(!pamt_base))
          return;
 +    (*pamt_func)(pamt_base, pamt_size);
 +}
 +
 +static void free_pamt(unsigned long pamt_base, unsigned long pamt_size)
 +{
      free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT);
  }
 +static void tdmr_free_pamt(struct tdmr_info *tdmr)
 +{
 +    tdmr_do_pamt_func(tdmr, free_pamt);
 +}
 +
  static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
  {
      int i;
@@ -XXX,XX +XXX,XX @@ static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
      return ret;
  }
 +/*
-+ * Call one SEAMCALL on one (any) cpu for each physical package in
++ * Convert TDX private pages back to normal by using MOVDIR64B to
-+ * serialized way.  Return immediately in case of any error if
++ * clear these pages.  Note this function doesn't flush cache of
-+ * SEAMCALL fails on any cpu.
++ * these TDX private pages.  The caller should make sure of that.
 + */
 +static void reset_tdx_pages(unsigned long base, unsigned long size)
 +{
 +    const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
 +    unsigned long phys, end;
 +
 +    end = base + size;
 +    for (phys = base; phys < end; phys += 64)
 +        movdir64b(__va(phys), zero_page);
 +
 +    /*
 +     * MOVDIR64B uses WC protocol.  Use memory barrier to
 +     * make sure any later user of these pages sees the
 +     * updated data.
 +     */
 +    mb();
 +}
 +
 +static void tdmr_reset_pamt(struct tdmr_info *tdmr)
 +{
 +    tdmr_do_pamt_func(tdmr, reset_tdx_pages);
 +}
 +
 +static void tdmrs_reset_pamt_all(struct tdmr_info_list *tdmr_list)
 +{
 +    int i;
 +
 +    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
 +        tdmr_reset_pamt(tdmr_entry(tdmr_list, i));
 +}
 +
  static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
  {
      unsigned long pamt_size = 0;
@@ -XXX,XX +XXX,XX @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
      return ret;
  }
 +static int do_global_key_config(void *data)
 +{
 +    struct tdx_module_args args = {};
 +
 +    return seamcall_prerr(TDH_SYS_KEY_CONFIG, &args);
 +}
 +
 +/*
 + * Attempt to configure the global KeyID on all physical packages.
 + *
-+ * Note for serialized calls 'struct seamcall_ctx::err' doesn't have
++ * This requires running code on at least one CPU in each package.  If a
-+ * to be atomic, but for simplicity just reuse it instead of adding
++ * package has no online CPUs, that code will not run and TDX module
-+ * a new one.
++ * initialization (TDMR initialization) will fail.
 + *
 + * This code takes no affirmative steps to online CPUs.  Callers (aka.
 + * KVM) can ensure success by ensuring sufficient CPUs are online for
 + * this to succeed.
 + */
-+static int seamcall_on_each_package_serialized(struct seamcall_ctx *sc)
++static int config_global_keyid(void)
 +{
 +    cpumask_var_t packages;
-+    int cpu, ret = 0;
++    int cpu, ret = -EINVAL;
 +
 +    if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
 +        return -ENOMEM;
 +
 +    for_each_online_cpu(cpu) {
 +        if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
 +                    packages))
 +            continue;
 +
-+        ret = smp_call_function_single(cpu, seamcall_smp_call_function,
-+                sc, true);
-+        if (ret)
-+            break;
-+
 +        /*
-+         * Doesn't have to use atomic_read(), but it doesn't
++         * TDH.SYS.KEY.CONFIG cannot run concurrently on
-+         * hurt either.
++         * different cpus, so just do it one by one.
 +         */
-+        ret = atomic_read(&sc->err);
++        ret = smp_call_on_cpu(cpu, do_global_key_config, NULL, true);
 +        if (ret)
 +            break;
 +    }
 +
 +    free_cpumask_var(packages);
 +    return ret;
 +}
 +
- static int tdx_module_init_cpus(void)
+ static int init_tdx_module(void)
  {
-     struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
+     struct tdsysinfo_struct *tdsysinfo;
@@ -XXX,XX +XXX,XX @@ static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
      return ret;
  }
 +static int config_global_keyid(void)
 +{
 +    struct seamcall_ctx sc = { .fn = TDH_SYS_KEY_CONFIG };
 +
 +    /*
 +     * Configure the key of the global KeyID on all packages by
 +     * calling TDH.SYS.KEY.CONFIG on all packages in a serialized
 +     * way as it cannot run concurrently on different CPUs.
 +     *
 +     * TDH.SYS.KEY.CONFIG may fail with entropy error (which is
 +     * a recoverable error).  Assume this is exceedingly rare and
 +     * just return error if encountered instead of retrying.
 +     */
 +    return seamcall_on_each_package_serialized(&sc);
 +}
 +
  /*
   * Detect and initialize the TDX module.
   *
 @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
      if (ret)
          goto out_free_pamts;
 +    /*
 +     * Hardware doesn't guarantee cache coherency across different
 +     * KeyIDs.  The kernel needs to flush PAMT's dirty cachelines
 +     * (associated with KeyID 0) before the TDX module can use the
 +     * global KeyID to access the PAMT.  Given PAMTs are potentially
 +     * large (~1/256th of system RAM), just use WBINVD on all cpus
 +     * to flush the cache.
-+     *
-+     * Follow the TDX spec to flush cache before configuring the
-+     * global KeyID on all packages.
 +     */
 +    wbinvd_on_all_cpus();
 +
 +    /* Config the key of global KeyID on all packages */
 +    ret = config_global_keyid();
 +    if (ret)
-+        goto out_free_pamts;
++        goto out_reset_pamts;
 +
      /*
-      * Return -EINVAL until all steps of TDX module initialization
+      * TODO:
-      * process are done.
+      *
 -     *  - Configure the global KeyID on all packages.
       *  - Initialize all TDMRs.
       *
       *  Return error before all steps are done.
       */
      ret = -EINVAL;
- out_free_pamts:
++out_reset_pamts:
 -    if (ret)
 +    if (ret) {
 +        /*
-+         * Part of PAMT may already have been initialized by
++         * Part of PAMTs may already have been initialized by the
-+         * TDX module.  Flush cache before returning PAMT back
++         * TDX module.  Flush cache before returning PAMTs back
 +         * to the kernel.
-+         *
-+         * Note there's no need to do MOVDIR64B (which changes
-+         * the page's associated KeyID from the old TDX private
-+         * KeyID back to KeyID 0, which is used by the kernel),
-+         * as KeyID 0 doesn't support integrity check.
 +         */
 +        wbinvd_on_all_cpus();
-         tdmrs_free_pamt_all(tdmr_array, tdmr_num);
++        /*
--    else
++         * According to the TDX hardware spec, if the platform
-+    } else
++         * doesn't have the "partial write machine check"
-         pr_info("%lu pages allocated for PAMT.\n",
++         * erratum, any kernel read/write will never cause #MC
-                 tdmrs_count_pamt_pages(tdmr_array, tdmr_num));
++         * in kernel space, thus it's OK to not convert PAMTs
- out_free_tdmrs:
++         * back to normal.  But do the conversion anyway here
 +         * as suggested by the TDX spec.
 +         */
 +        tdmrs_reset_pamt_all(&tdmr_list);
 +    }
  out_free_pamts:
      if (ret)
          tdmrs_free_pamt_all(&tdmr_list);
@@ -XXX,XX +XXX,XX @@ static int __tdx_enable(void)
   * lock to prevent any new cpu from becoming online; 2) done both VMXON
   * and tdx_cpu_enable() on all online cpus.
   *
 + * This function requires there's at least one online cpu for each CPU
 + * package to succeed.
 + *
   * This function can be called in parallel by multiple callers.
   *
   * Return 0 if TDX is enabled successfully, otherwise error.
 diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.h
 +++ b/arch/x86/virt/vmx/tdx/tdx.h
 @@ -XXX,XX +XXX,XX @@
 ...
 +#define TDH_SYS_KEY_CONFIG    31
  #define TDH_SYS_INFO        32
  #define TDH_SYS_INIT        33
  #define TDH_SYS_LP_INIT        35
 --
-.38.1
+.41.0

-[PATCH v7 18/20] x86/virt/tdx: Initialize all TDMRs
+[PATCH v14 16/23] x86/virt/tdx: Initialize all TDMRs
-Initialize TDMRs via TDH.SYS.TDMR.INIT as the last step to complete the
+After the global KeyID has been configured on all packages, initialize
-TDX initialization.
+all TDMRs to make all TDX-usable memory regions that are passed to the
 TDX module become usable.
-All TDMRs need to be initialized using TDH.SYS.TDMR.INIT SEAMCALL before
+This is the last step of initializing the TDX module.
 the memory pages can be used by the TDX module.  The time to initialize
 TDMR is proportional to the size of the TDMR because TDH.SYS.TDMR.INIT
 internally initializes the PAMT entries using the global KeyID.
-To avoid long latency caused in one SEAMCALL, TDH.SYS.TDMR.INIT only
+Initializing TDMRs can be time consuming on large memory systems as it
-initializes an (implementation-specific) subset of PAMT entries of one
+involves initializing all metadata entries for all pages that can be
-TDMR in one invocation.  The caller needs to call TDH.SYS.TDMR.INIT
+used by TDX guests.  Initializing different TDMRs can be parallelized.
-iteratively until all PAMT entries of the given TDMR are initialized.
+For now to keep it simple, just initialize all TDMRs one by one.  It can
 be enhanced in the future.
-TDH.SYS.TDMR.INITs can run concurrently on multiple CPUs as long as they
+Signed-off-by: Kai Huang <kai.huang@intel.com>
-are initializing different TDMRs.  To keep it simple, just initialize
+Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
-all TDMRs one by one.  On a 2-socket machine with 2.2G CPUs and 64GB
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
-memory, each TDH.SYS.TDMR.INIT roughly takes couple of microseconds on
+Reviewed-by: Yuan Yao <yuan.yao@intel.com>
-average, and it takes roughly dozens of milliseconds to complete the
+---
 initialization of all TDMRs while system is idle.
-Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
+v13 -> v14:
-Signed-off-by: Kai Huang <kai.huang@intel.com>
+ - No change
----
 v12 -> v13:
  - Added Yuan's tag.
 v11 -> v12:
  - Added Kirill's tag
 v10 -> v11:
  - No update
 v9 -> v10:
  - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.
 v8 -> v9:
  - Improved changlog to explain why initializing TDMRs can take long
    time (Dave).
  - Improved comments around 'next-to-initialize' address (Dave).
 v7 -> v8: (Dave)
  - Changelog:
    - explicitly call out this is the last step of TDX module initialization.
    - Trimed down changelog by removing SEAMCALL name and details.
  - Removed/trimmed down unnecessary comments.
  - Other changes due to 'struct tdmr_info_list'.
 v6 -> v7:
  - Removed need_resched() check. -- Andi.
 ---
- arch/x86/virt/vmx/tdx/tdx.c | 69 ++++++++++++++++++++++++++++++++++---
+ arch/x86/virt/vmx/tdx/tdx.c | 60 ++++++++++++++++++++++++++++++++-----
  arch/x86/virt/vmx/tdx/tdx.h |  1 +
-files changed, 65 insertions(+), 5 deletions(-)
+files changed, 53 insertions(+), 8 deletions(-)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
 @@ -XXX,XX +XXX,XX @@ static int config_global_keyid(void)
-     return seamcall_on_each_package_serialized(&sc);
+     return ret;
  }
-+/* Initialize one TDMR */
 +static int init_tdmr(struct tdmr_info *tdmr)
 +{
 +    u64 next;
 +
 +    /*
-+     * Initializing PAMT entries might be time-consuming (in
++     * Initializing a TDMR can be time consuming.  To avoid long
-+     * proportion to the size of the requested TDMR).  To avoid long
++     * SEAMCALLs, the TDX module may only initialize a part of the
-+     * latency in one SEAMCALL, TDH.SYS.TDMR.INIT only initializes
++     * TDMR in each call.
 +     * an (implementation-defined) subset of PAMT entries in one
 +     * invocation.
 +     *
 +     * Call TDH.SYS.TDMR.INIT iteratively until all PAMT entries
 +     * of the requested TDMR are initialized (if next-to-initialize
 +     * address matches the end address of the TDMR).
 +     */
 +    do {
-+        struct tdx_module_output out;
++        struct tdx_module_args args = {
 +            .rcx = tdmr->base,
 +        };
 +        int ret;
 +
-+        ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
++        ret = seamcall_prerr_ret(TDH_SYS_TDMR_INIT, &args);
 +                &out);
 +        if (ret)
 +            return ret;
 +        /*
 +         * RDX contains 'next-to-initialize' address if
-+         * TDH.SYS.TDMR.INT succeeded.
++         * TDH.SYS.TDMR.INIT did not fully complete and
 +         * should be retried.
 +         */
-+        next = out.rdx;
++        next = args.rdx;
 +        /* Allow scheduling when needed */
 +        cond_resched();
++        /* Keep making SEAMCALLs until the TDMR is done */
 +    } while (next < tdmr->base + tdmr->size);
 +
 +    return 0;
 +}
 +
-+/* Initialize all TDMRs */
++static int init_tdmrs(struct tdmr_info_list *tdmr_list)
 +static int init_tdmrs(struct tdmr_info *tdmr_array, int tdmr_num)
 +{
 +    int i;
 +
 +    /*
-+     * Initialize TDMRs one-by-one for simplicity, though the TDX
++     * This operation is costly.  It can be parallelized,
-+     * architecture does allow different TDMRs to be initialized in
++     * but keep it simple for now.
 +     * parallel on multiple CPUs.  Parallel initialization could
 +     * be added later when the time spent in the serialized scheme
 +     * becomes a real concern.
 +     */
-+    for (i = 0; i < tdmr_num; i++) {
++    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
 +        int ret;
 +
-+        ret = init_tdmr(tdmr_array_entry(tdmr_array, i));
++        ret = init_tdmr(tdmr_entry(tdmr_list, i));
 +        if (ret)
 +            return ret;
 +    }
 +
 +    return 0;
 +}
 +
- /*
+ static int init_tdx_module(void)
-  * Detect and initialize the TDX module.
+ {
-  *
+     struct tdsysinfo_struct *tdsysinfo;
 @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
      if (ret)
-         goto out_free_pamts;
+         goto out_reset_pamts;
 -    /*
--     * Return -EINVAL until all steps of TDX module initialization
+-     * TODO:
--     * process are done.
+-     *
 -     *  - Initialize all TDMRs.
 -     *
 -     *  Return error before all steps are done.
 -     */
 -    ret = -EINVAL;
 +    /* Initialize TDMRs to complete the TDX module initialization */
-+    ret = init_tdmrs(tdmr_array, tdmr_num);
++    ret = init_tdmrs(&tdmr_list);
-+    if (ret)
+ out_reset_pamts:
 +        goto out_free_pamts;
 +
  out_free_pamts:
      if (ret) {
          /*
 diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.h
 +++ b/arch/x86/virt/vmx/tdx/tdx.h
 @@ -XXX,XX +XXX,XX @@
  #define TDH_SYS_INFO        32
  #define TDH_SYS_INIT        33
  #define TDH_SYS_LP_INIT        35
 +#define TDH_SYS_TDMR_INIT    36
- #define TDH_SYS_LP_SHUTDOWN    44
  #define TDH_SYS_CONFIG        45
+ struct cmr_info {
 --
-.38.1
+.41.0

-[PATCH v7 19/20] x86/virt/tdx: Flush cache in kexec() when TDX is enabled
+[PATCH v14 17/23] x86/kexec: Flush cache of TDX private memory
 There are two problems in terms of using kexec() to boot to a new kernel
 when the old kernel has enabled TDX: 1) Part of the memory pages are
-still TDX private pages (i.e. metadata used by the TDX module, and any
+still TDX private pages; 2) There might be dirty cachelines associated
-TDX guest memory if kexec() happens when there's any TDX guest alive).
+with TDX private pages.
 ) There might be dirty cachelines associated with TDX private pages.
-Because the hardware doesn't guarantee cache coherency among different
+The first problem doesn't matter on the platforms w/o the "partial write
-KeyIDs, the old kernel needs to flush cache (of those TDX private pages)
+machine check" erratum.  KeyID 0 doesn't have integrity check.  If the
-before booting to the new kernel.  Also, reading TDX private page using
+new kernel wants to use any non-zero KeyID, it needs to convert the
-any shared non-TDX KeyID with integrity-check enabled can trigger #MC.
+memory to that KeyID and such conversion would work from any KeyID.
 Therefore ideally, the kernel should convert all TDX private pages back
 to normal before booting to the new kernel.
-However, this implementation doesn't convert TDX private pages back to
+However the old kernel needs to guarantee there's no dirty cacheline
-normal in kexec() because of below considerations:
+left behind before booting to the new kernel to avoid silent corruption
 from later cacheline writeback (Intel hardware doesn't guarantee cache
 coherency across different KeyIDs).
-) The kernel doesn't have existing infrastructure to track which pages
+There are two things that the old kernel needs to do to achieve that:
    are TDX private pages.
 ) The number of TDX private pages can be large, and converting all of
    them (cache flush + using MOVDIR64B to clear the page) in kexec() can
    be time consuming.
 ) The new kernel will almost only use KeyID 0 to access memory.  KeyID
 doesn't support integrity-check, so it's OK.
 ) The kernel doesn't (and may never) support MKTME.  If any 3rd party
    kernel ever supports MKTME, it should do MOVDIR64B to clear the page
    with the new MKTME KeyID (just like TDX does) before using it.
-Therefore, this implementation just flushes cache to make sure there are
+) Stop accessing TDX private memory mappings:
-no stale dirty cachelines associated with any TDX private KeyIDs before
+   a. Stop making TDX module SEAMCALLs (TDX global KeyID);
-booting to the new kernel, otherwise they may silently corrupt the new
+   b. Stop TDX guests from running (per-guest TDX KeyID).
-kernel.
+) Flush any cachelines from previous TDX private KeyID writes.
-Following SME support, use wbinvd() to flush cache in stop_this_cpu().
+For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME
 support.  And in this way 1) happens for free as there's no TDX activity
 between wbinvd() and the native_halt().
 Flushing cache in stop_this_cpu() only flushes cache on remote cpus.  On
 the rebooting cpu which does kexec(), unlike SME which does the cache
 flush in relocate_kernel(), flush the cache right after stopping remote
 cpus in machine_shutdown().
 There are two reasons to do so: 1) For TDX there's no need to defer
 cache flush to relocate_kernel() because all TDX activities have been
 stopped.  2) On the platforms with the above erratum the kernel must
 convert all TDX private pages back to normal before booting to the new
 kernel in kexec(), and flushing cache early allows the kernel to convert
 memory early rather than having to muck with the relocate_kernel()
 assembly.
 Theoretically, cache flush is only needed when the TDX module has been
 initialized.  However initializing the TDX module is done on demand at
 runtime, and it takes a mutex to read the module status.  Just check
-whether TDX is enabled by BIOS instead to flush cache.
+whether TDX is enabled by the BIOS instead to flush cache.
-Also, the current TDX module doesn't play nicely with kexec().  The TDX
+Signed-off-by: Kai Huang <kai.huang@intel.com>
 module can only be initialized once during its lifetime, and there is no
 ABI to reset the module to give a new clean slate to the new kernel.
 Therefore ideally, if the TDX module is ever initialized, it's better
 to shut it down.  The new kernel won't be able to use TDX anyway (as it
 needs to go through the TDX module initialization process which will
 fail immediately at the first step).
 However, shutting down the TDX module requires all CPUs being in VMX
 operation, but there's no such guarantee as kexec() can happen at any
 time (i.e. when KVM is not even loaded).  So just do nothing but leave
 leave the TDX module open.
 Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
-Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 ---
-v6 -> v7:
+v13 -> v14:
- - Improved changelog to explain why don't convert TDX private pages back
+ - No change
    to normal.
 ---
- arch/x86/kernel/process.c | 8 +++++++-
+ arch/x86/kernel/process.c |  8 +++++++-
-file changed, 7 insertions(+), 1 deletion(-)
+ arch/x86/kernel/reboot.c  | 15 +++++++++++++++
 files changed, 22 insertions(+), 1 deletion(-)
 diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/kernel/process.c
 +++ b/arch/x86/kernel/process.c
 @@ -XXX,XX +XXX,XX @@ void __noreturn stop_this_cpu(void *dummy)
       *
       * Test the CPUID bit directly because the machine might've cleared
       * X86_FEATURE_SME due to cmdline options.
 +     *
-+     * Similar to SME, if the TDX module is ever initialized, the
++     * The TDX module or guests might have left dirty cachelines
-+     * cachelines associated with any TDX private KeyID must be flushed
++     * behind.  Flush them to avoid corruption from later writeback.
-+     * before transiting to the new kernel.  The TDX module is initialized
++     * Note that this flushes on all systems where TDX is possible,
-+     * on demand, and it takes the mutex to read its status.  Just check
++     * but does not actually check that TDX was in use.
 +     * whether TDX is enabled by BIOS instead to flush cache.
       */
--    if (cpuid_eax(0x8000001f) & BIT(0))
+-    if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
-+    if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
++    if ((c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
 +            || platform_tdx_enabled())
          native_wbinvd();
-     for (;;) {
-         /*
+     /*
 diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/kernel/reboot.c
 +++ b/arch/x86/kernel/reboot.c
@@ -XXX,XX +XXX,XX @@
  #include <asm/realmode.h>
  #include <asm/x86_init.h>
  #include <asm/efi.h>
 +#include <asm/tdx.h>
  /*
   * Power off function, if any
@@ -XXX,XX +XXX,XX @@ void native_machine_shutdown(void)
      local_irq_disable();
      stop_other_cpus();
  #endif
 +    /*
 +     * stop_other_cpus() has flushed all dirty cachelines of TDX
 +     * private memory on remote cpus.  Unlike SME, which does the
 +     * cache flush on _this_ cpu in the relocate_kernel(), flush
 +     * the cache for _this_ cpu here.  This is because on the
 +     * platforms with "partial write machine check" erratum the
 +     * kernel needs to convert all TDX private pages back to normal
 +     * before booting to the new kernel in kexec(), and the cache
 +     * flush must be done before that.  If the kernel took SME's way,
 +     * it would have to muck with the relocate_kernel() assembly to
 +     * do memory conversion.
 +     */
 +    if (platform_tdx_enabled())
 +        native_wbinvd();
      lapic_shutdown();
      restore_boot_irq_mode();
 --
-.38.1
+.41.0

-[PATCH v7 15/20] x86/virt/tdx: Reserve TDX module global KeyID
+[PATCH v14 18/23] x86/virt/tdx: Keep TDMRs when module initialization is successful
-TDX module initialization requires to use one TDX private KeyID as the
+On the platforms with the "partial write machine check" erratum, the
-global KeyID to protect the TDX module metadata.  The global KeyID is
+kexec() needs to convert all TDX private pages back to normal before
-configured to the TDX module along with TDMRs.
+booting to the new kernel.  Otherwise, the new kernel may get unexpected
 machine check.
-Just reserve the first TDX private KeyID as the global KeyID.  Keep the
+There's no existing infrastructure to track TDX private pages.  Keep
-global KeyID as a static variable as KVM will need to use it too.
+TDMRs when module initialization is successful so that they can be used
 to find PAMTs.
-Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
 Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 ---
- arch/x86/virt/vmx/tdx/tdx.c | 9 +++++++++
-file changed, 9 insertions(+)
+v13 -> v14:
  - "Change to keep" -> "Keep" (Kirill)
  - Add Kirill/Rick's tags
 v12 -> v13:
   - Split "improve error handling" part out as a separate patch.
 v11 -> v12 (new patch):
   - Defer keeping TDMRs logic to this patch for better review
   - Improved error handling logic (Nikolay/Kirill in patch 15)
 ---
  arch/x86/virt/vmx/tdx/tdx.c | 24 +++++++++++-------------
 file changed, 11 insertions(+), 13 deletions(-)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
-@@ -XXX,XX +XXX,XX @@ static int tdx_cmr_num;
+@@ -XXX,XX +XXX,XX @@ static DEFINE_MUTEX(tdx_module_lock);
- /* All TDX-usable memory regions */
+ /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
  static LIST_HEAD(tdx_memlist);
-+/* TDX module global KeyID.  Used in TDH.SYS.CONFIG ABI. */
++static struct tdmr_info_list tdx_tdmr_list;
 +static u32 tdx_global_keyid;
 +
- /*
+ typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
-  * Detect TDX private KeyIDs to see whether TDX has been enabled by the
-  * BIOS.  Both initializing the TDX module and running TDX guest require
+ static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
@@ -XXX,XX +XXX,XX @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
  static int init_tdx_module(void)
  {
      struct tdsysinfo_struct *tdsysinfo;
 -    struct tdmr_info_list tdmr_list;
      struct cmr_info *cmr_array;
      int tdsysinfo_size;
      int cmr_array_size;
 @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
+         goto out_put_tdxmem;
+     /* Allocate enough space for constructing TDMRs */
+-    ret = alloc_tdmr_list(&tdmr_list, tdsysinfo);
++    ret = alloc_tdmr_list(&tdx_tdmr_list, tdsysinfo);
+     if (ret)
+         goto out_free_tdxmem;
+     /* Cover all TDX-usable memory regions in TDMRs */
+-    ret = construct_tdmrs(&tdx_memlist, &tdmr_list, tdsysinfo);
++    ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, tdsysinfo);
      if (ret)
          goto out_free_tdmrs;
-+    /*
+     /* Pass the TDMRs and the global KeyID to the TDX module */
-+     * Reserve the first TDX KeyID as global KeyID to protect
+-    ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
-+     * TDX module metadata.
++    ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid);
-+     */
+     if (ret)
-+    tdx_global_keyid = tdx_keyid_start;
+         goto out_free_pamts;
-+
-     /*
+@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
-      * Return -EINVAL until all steps of TDX module initialization
+         goto out_reset_pamts;
-      * process are done.
      /* Initialize TDMRs to complete the TDX module initialization */
 -    ret = init_tdmrs(&tdmr_list);
 +    ret = init_tdmrs(&tdx_tdmr_list);
  out_reset_pamts:
      if (ret) {
          /*
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
           * back to normal.  But do the conversion anyway here
           * as suggested by the TDX spec.
           */
 -        tdmrs_reset_pamt_all(&tdmr_list);
 +        tdmrs_reset_pamt_all(&tdx_tdmr_list);
      }
  out_free_pamts:
      if (ret)
 -        tdmrs_free_pamt_all(&tdmr_list);
 +        tdmrs_free_pamt_all(&tdx_tdmr_list);
      else
          pr_info("%lu KBs allocated for PAMT\n",
 -                tdmrs_count_pamt_kb(&tdmr_list));
 +                tdmrs_count_pamt_kb(&tdx_tdmr_list));
  out_free_tdmrs:
 -    /*
 -     * Always free the buffer of TDMRs as they are only used during
 -     * module initialization.
 -     */
 -    free_tdmr_list(&tdmr_list);
 +    if (ret)
 +        free_tdmr_list(&tdx_tdmr_list);
  out_free_tdxmem:
      if (ret)
          free_tdx_memlist(&tdx_memlist);
 --
-.38.1
+.41.0

-[PATCH v7 07/20] x86/virt/tdx: Do TDX module global initialization
+[PATCH v14 19/23] x86/virt/tdx: Improve readability of module initialization error handling
-The first step of initializing the module is to call TDH.SYS.INIT once
+With keeping TDMRs upon successful TDX module initialization, now only
-on any logical cpu to do module global initialization.  Do the module
+put_online_mems() and freeing the buffers of the TDSYSINFO_STRUCT and
-global initialization.
+the CMR array still need to be done even when module initialization is
 successful.  On the other hand, all other four "out_*" labels before
 them explicitly check the return value and only clean up when module
 initialization fails.
-It also detects the TDX module, as seamcall() returns -ENODEV when the
+This isn't ideal.  Make all other four "out_*" labels only reachable
-module is not loaded.
+when module initialization fails to improve the readability of error
 handling.  Rename them from "out_*" to "err_*" to reflect the fact.
 Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 ---
-v6 -> v7:
+v13 -> v14:
- - Improved changelog.
+ - Fix spell typo (Rick)
  - Add Kirill/Rick's tags
 v12 -> v13:
   - New patch to improve error handling. (Kirill, Nikolay)
 ---
- arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++++++--
+ arch/x86/virt/vmx/tdx/tdx.c | 67 +++++++++++++++++++------------------
- arch/x86/virt/vmx/tdx/tdx.h |  1 +
+file changed, 34 insertions(+), 33 deletions(-)
 files changed, 18 insertions(+), 2 deletions(-)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
-@@ -XXX,XX +XXX,XX @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
+@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
-  */
+     /* Allocate enough space for constructing TDMRs */
- static int init_tdx_module(void)
+     ret = alloc_tdmr_list(&tdx_tdmr_list, tdsysinfo);
- {
+     if (ret)
--    /* The TDX module hasn't been detected */
+-        goto out_free_tdxmem;
--    return -ENODEV;
++        goto err_free_tdxmem;
-+    int ret;
      /* Cover all TDX-usable memory regions in TDMRs */
      ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, tdsysinfo);
      if (ret)
 -        goto out_free_tdmrs;
 +        goto err_free_tdmrs;
      /* Pass the TDMRs and the global KeyID to the TDX module */
      ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid);
      if (ret)
 -        goto out_free_pamts;
 +        goto err_free_pamts;
      /*
       * Hardware doesn't guarantee cache coherency across different
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
      /* Config the key of global KeyID on all packages */
      ret = config_global_keyid();
      if (ret)
 -        goto out_reset_pamts;
 +        goto err_reset_pamts;
      /* Initialize TDMRs to complete the TDX module initialization */
      ret = init_tdmrs(&tdx_tdmr_list);
 -out_reset_pamts:
 -    if (ret) {
 -        /*
 -         * Part of PAMTs may already have been initialized by the
 -         * TDX module.  Flush cache before returning PAMTs back
 -         * to the kernel.
 -         */
 -        wbinvd_on_all_cpus();
 -        /*
 -         * According to the TDX hardware spec, if the platform
 -         * doesn't have the "partial write machine check"
 -         * erratum, any kernel read/write will never cause #MC
 -         * in kernel space, thus it's OK to not convert PAMTs
 -         * back to normal.  But do the conversion anyway here
 -         * as suggested by the TDX spec.
 -         */
 -        tdmrs_reset_pamt_all(&tdx_tdmr_list);
 -    }
 -out_free_pamts:
      if (ret)
 -        tdmrs_free_pamt_all(&tdx_tdmr_list);
 -    else
 -        pr_info("%lu KBs allocated for PAMT\n",
 -                tdmrs_count_pamt_kb(&tdx_tdmr_list));
 -out_free_tdmrs:
 -    if (ret)
 -        free_tdmr_list(&tdx_tdmr_list);
 -out_free_tdxmem:
 -    if (ret)
 -        free_tdx_memlist(&tdx_memlist);
 +        goto err_reset_pamts;
 +
++    pr_info("%lu KBs allocated for PAMT\n",
++            tdmrs_count_pamt_kb(&tdx_tdmr_list));
++
+ out_put_tdxmem:
+     /*
+      * @tdx_memlist is written here and read at memory hotplug time.
+@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
+     kfree(tdsysinfo);
+     kfree(cmr_array);
+     return ret;
++
++err_reset_pamts:
 +    /*
-+     * Call TDH.SYS.INIT to do the global initialization of
++     * Part of PAMTs may already have been initialized by the
-+     * the TDX module.  It also detects the module.
++     * TDX module.  Flush cache before returning PAMTs back
 +     * to the kernel.
 +     */
-+    ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
++    wbinvd_on_all_cpus();
 +    if (ret)
 +        goto out;
 +
 +    /*
-+     * Return -EINVAL until all steps of TDX module initialization
++     * According to the TDX hardware spec, if the platform
-+     * process are done.
++     * doesn't have the "partial write machine check"
 +     * erratum, any kernel read/write will never cause #MC
 +     * in kernel space, thus it's OK to not convert PAMTs
 +     * back to normal.  But do the conversion anyway here
 +     * as suggested by the TDX spec.
 +     */
-+    ret = -EINVAL;
++    tdmrs_reset_pamt_all(&tdx_tdmr_list);
-+out:
++err_free_pamts:
-+    return ret;
++    tdmrs_free_pamt_all(&tdx_tdmr_list);
 +err_free_tdmrs:
 +    free_tdmr_list(&tdx_tdmr_list);
 +err_free_tdxmem:
 +    free_tdx_memlist(&tdx_memlist);
 +    /* Do things irrelevant to module initialization result */
 +    goto out_put_tdxmem;
  }
- static void shutdown_tdx_module(void)
+ static int __tdx_enable(void)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.h
 +++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
  /*
   * TDX module SEAMCALL leaf functions
   */
 +#define TDH_SYS_INIT        33
  #define TDH_SYS_LP_SHUTDOWN    44
  /*
 --
-.38.1
+.41.0

-[PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error
+[PATCH v14 20/23] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
-TDX supports shutting down the TDX module at any time during its
+The first few generations of TDX hardware have an erratum.  A partial
-lifetime.  After the module is shut down, no further TDX module SEAMCALL
+write to a TDX private memory cacheline will silently "poison" the
-leaf functions can be made to the module on any logical cpu.
+line.  Subsequent reads will consume the poison and generate a machine
+check.  According to the TDX hardware spec, neither of these things
-Shut down the TDX module in case of any error during the initialization
+should have happened.
-process.  It's pointless to leave the TDX module in some middle state.
+== Background ==
-Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
-BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
+Virtually all kernel memory accesses operations happen in full
-CPUs.  Implement a mechanism to run SEAMCALL concurrently on all online
+cachelines.  In practice, writing a "byte" of memory usually reads a 64
-CPUs and use it to shut down the module.  Later logical-cpu scope module
+byte cacheline of memory, modifies it, then writes the whole line back.
-initialization will use it too.
+Those operations do not trigger this problem.
-Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
+This problem is triggered by "partial" writes where a write transaction
 of less than cacheline lands at the memory controller.  The CPU does
 these via non-temporal write instructions (like MOVNTI), or through
 UC/WC memory mappings.  The issue can also be triggered away from the
 CPU by devices doing partial writes via DMA.
 == Problem ==
 A fast warm reset doesn't reset TDX private memory.  Kexec() can also
 boot into the new kernel directly.  Thus if the old kernel has enabled
 TDX on the platform with this erratum, the new kernel may get unexpected
 machine check.
 Note that w/o this erratum any kernel read/write on TDX private memory
 should never cause machine check, thus it's OK for the old kernel to
 leave TDX private pages as is.
 == Solution ==
 In short, with this erratum, the kernel needs to explicitly convert all
 TDX private pages back to normal to give the new kernel a clean slate
 after kexec().  The BIOS is also expected to disable fast warm reset as
 a workaround to this erratum, thus this implementation doesn't try to
 reset TDX private memory for the reboot case in the kernel but depend on
 the BIOS to enable the workaround.
 Convert TDX private pages back to normal after all remote cpus has been
 stopped and cache flush has been done on all cpus, when no more TDX
 activity can happen further.  Do it in machine_kexec() to avoid the
 additional overhead to the normal reboot/shutdown as the kernel depends
 on the BIOS to disable fast warm reset for the reboot case.
 For now TDX private memory can only be PAMT pages.  It would be ideal to
 cover all types of TDX private memory here, but there are practical
 problems to do so:
 ) There's no existing infrastructure to track TDX private pages;
 ) It's not feasible to query the TDX module about page type because VMX
    has already been stopped when KVM receives the reboot notifier, plus
    the result from the TDX module may not be accurate (e.g., the remote
    CPU could be stopped right before MOVDIR64B).
 One temporary solution is to blindly convert all memory pages, but it's
 problematic to do so too, because not all pages are mapped as writable
 in the direct mapping.  It can be done by switching to the identical
 mapping created for kexec() or a new page table, but the complexity
 looks overkill.
 Therefore, rather than doing something dramatic, only reset PAMT pages
 here.  Other kernel components which use TDX need to do the conversion
 on their own by intercepting the rebooting/shutdown notifier (KVM
 already does that).
 Note kexec() can happen at anytime, including when TDX module is being
 initialized.  Register TDX reboot notifier callback to stop further TDX
 module initialization.  If there's any ongoing module initialization,
 wait until it finishes.  This makes sure the TDX module status is stable
 after the reboot notifier callback, and the later kexec() code can read
 module status to decide whether PAMTs are stable and available.
 Also stop further TDX module initialization in case of machine shutdown
 and halt, but not limited to kexec(), as there's no reason to do so in
 these cases too.
 Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 ---
-v6 -> v7:
+v13 -> v14:
- - No change.
+ - Skip resetting TDX private memory when preserve_context is true (Rick)
+ - Use reboot notifier to stop TDX module initialization at early time of
-v5 -> v6:
+   kexec() to make module status stable, to avoid using a new variable
- - Removed the seamcall() wrapper to previous patch (Dave).
+   and memory barrier (which is tricky to review).
+ - Added Kirill's tag
-- v3 -> v5 (no feedback on v4):
- - Added a wrapper of __seamcall() to print error code if SEAMCALL fails.
+v12 -> v13:
- - Made the seamcall_on_each_cpu() void.
+ - Improve comments to explain why barrier is needed and ignore WBINVD.
- - Removed 'seamcall_ret' and 'tdx_module_out' from
+   (Dave)
-   'struct seamcall_ctx', as they must be local variable.
+ - Improve comments to document memory ordering. (Nikolay)
- - Added the comments to tdx_init() and one paragraph to changelog to
+ - Made comments/changelog slightly more concise.
-   explain the caller should handle VMXON.
- - Called out after shut down, no "TDX module" SEAMCALL can be made.
+v11 -> v12:
  - Changed comment/changelog to say kernel doesn't try to handle fast
    warm reset but depends on BIOS to enable workaround (Kirill)
  - Added a new tdx_may_has_private_mem to indicate system may have TDX
    private memory and PAMTs/TDMRs are stable to access. (Dave).
  - Use atomic_t for tdx_may_has_private_mem for build-in memory barrier
    (Dave)
  - Changed calling x86_platform.memory_shutdown() to calling
    tdx_reset_memory() directly from machine_kexec() to avoid overhead to
    normal reboot case.
 v10 -> v11:
  - New patch
 ---
- arch/x86/virt/vmx/tdx/tdx.c | 43 +++++++++++++++++++++++++++++++++----
+ arch/x86/include/asm/tdx.h         |  2 +
- arch/x86/virt/vmx/tdx/tdx.h |  5 +++++
+ arch/x86/kernel/machine_kexec_64.c | 16 ++++++
-files changed, 44 insertions(+), 4 deletions(-)
+ arch/x86/virt/vmx/tdx/tdx.c        | 92 ++++++++++++++++++++++++++++++
+files changed, 110 insertions(+)
 diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/include/asm/tdx.h
 +++ b/arch/x86/include/asm/tdx.h
@@ -XXX,XX +XXX,XX @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
  bool platform_tdx_enabled(void);
  int tdx_cpu_enable(void);
  int tdx_enable(void);
 +void tdx_reset_memory(void);
  #else
  static inline bool platform_tdx_enabled(void) { return false; }
  static inline int tdx_cpu_enable(void) { return -ENODEV; }
  static inline int tdx_enable(void)  { return -ENODEV; }
 +static inline void tdx_reset_memory(void) { }
  #endif    /* CONFIG_INTEL_TDX_HOST */
  #endif /* !__ASSEMBLY__ */
 diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/kernel/machine_kexec_64.c
 +++ b/arch/x86/kernel/machine_kexec_64.c
@@ -XXX,XX +XXX,XX @@
  #include <asm/setup.h>
  #include <asm/set_memory.h>
  #include <asm/cpu.h>
 +#include <asm/tdx.h>
  #ifdef CONFIG_ACPI
  /*
@@ -XXX,XX +XXX,XX @@ void machine_kexec(struct kimage *image)
      void *control_page;
      int save_ftrace_enabled;
 +    /*
 +     * For platforms with TDX "partial write machine check" erratum,
 +     * all TDX private pages need to be converted back to normal
 +     * before booting to the new kernel, otherwise the new kernel
 +     * may get unexpected machine check.
 +     *
 +     * But skip this when preserve_context is on.  The second kernel
 +     * shouldn't write to the first kernel's memory anyway.  Skipping
 +     * this also avoids killing TDX in the first kernel, which would
 +     * require more complicated handling.
 +     */
  #ifdef CONFIG_KEXEC_JUMP
      if (image->preserve_context)
          save_processor_state();
 +    else
 +        tdx_reset_memory();
 +#else
 +    tdx_reset_memory();
  #endif
      save_ftrace_enabled = __ftrace_enabled_save();
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
 @@ -XXX,XX +XXX,XX @@
- #include <linux/mutex.h>
+ #include <linux/align.h>
- #include <linux/cpu.h>
+ #include <linux/sort.h>
- #include <linux/cpumask.h>
+ #include <linux/log2.h>
-+#include <linux/smp.h>
++#include <linux/reboot.h>
 +#include <linux/atomic.h>
  #include <asm/msr-index.h>
  #include <asm/msr.h>
- #include <asm/apic.h>
+ #include <asm/page.h>
-@@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void)
+@@ -XXX,XX +XXX,XX @@ static LIST_HEAD(tdx_memlist);
-     return !!tdx_keyid_num;
  static struct tdmr_info_list tdx_tdmr_list;
 +static bool tdx_rebooting;
 +
  typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
  static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
@@ -XXX,XX +XXX,XX @@ static int __tdx_enable(void)
  {
      int ret;
 +    if (tdx_rebooting)
 +        return -EAGAIN;
 +
      ret = init_tdx_module();
      if (ret) {
          pr_err("module initialization failed (%d)\n", ret);
@@ -XXX,XX +XXX,XX @@ int tdx_enable(void)
  }
+ EXPORT_SYMBOL_GPL(tdx_enable);
 +/*
-+ * Data structure to make SEAMCALL on multiple CPUs concurrently.
++ * Convert TDX private pages back to normal on platforms with
-+ * @err is set to -EFAULT when SEAMCALL fails on any cpu.
++ * "partial write machine check" erratum.
 + *
 + * Called from machine_kexec() before booting to the new kernel.
 + */
-+struct seamcall_ctx {
++void tdx_reset_memory(void)
-+    u64 fn;
++{
-+    u64 rcx;
++    if (!platform_tdx_enabled())
-+    u64 rdx;
++        return;
-+    u64 r8;
++
-+    u64 r9;
++    /*
-+    atomic_t err;
++     * Kernel read/write to TDX private memory doesn't
 +     * cause machine check on hardware w/o this erratum.
 +     */
 +    if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
 +        return;
 +
 +    /* Called from kexec() when only rebooting cpu is alive */
 +    WARN_ON_ONCE(num_online_cpus() != 1);
 +
 +    /*
 +     * tdx_reboot_notifier() waits until ongoing TDX module
 +     * initialization to finish, and module initialization is
 +     * rejected after that.  Therefore @tdx_module_status is
 +     * stable here and can be read w/o holding lock.
 +     */
 +    if (tdx_module_status != TDX_MODULE_INITIALIZED)
 +        return;
 +
 +    /*
 +     * Convert PAMTs back to normal.  All other cpus are already
 +     * dead and TDMRs/PAMTs are stable.
 +     *
 +     * Ideally it's better to cover all types of TDX private pages
 +     * here, but it's impractical:
 +     *
 +     *  - There's no existing infrastructure to tell whether a page
 +     *    is TDX private memory or not.
 +     *
 +     *  - Using SEAMCALL to query TDX module isn't feasible either:
 +     *    - VMX has been turned off by reaching here so SEAMCALL
 +     *      cannot be made;
 +     *    - Even SEAMCALL can be made the result from TDX module may
 +     *      not be accurate (e.g., remote CPU can be stopped while
 +     *      the kernel is in the middle of reclaiming TDX private
 +     *      page and doing MOVDIR64B).
 +     *
 +     * One temporary solution could be just converting all memory
 +     * pages, but it's problematic too, because not all pages are
 +     * mapped as writable in direct mapping.  It can be done by
 +     * switching to the identical mapping for kexec() or a new page
 +     * table which maps all pages as writable, but the complexity is
 +     * overkill.
 +     *
 +     * Thus instead of doing something dramatic to convert all pages,
 +     * only convert PAMTs here.  Other kernel components which use
 +     * TDX need to do the conversion on their own by intercepting the
 +     * rebooting/shutdown notifier (KVM already does that).
 +     */
 +    tdmrs_reset_pamt_all(&tdx_tdmr_list);
 +}
 +
  static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
                          u32 *nr_tdx_keyids)
  {
@@ -XXX,XX +XXX,XX @@ static struct notifier_block tdx_memory_nb = {
      .notifier_call = tdx_memory_notifier,
  };
 +static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode,
 +                   void *unused)
 +{
 +    /* Wait ongoing TDX initialization to finish */
 +    mutex_lock(&tdx_module_lock);
 +    tdx_rebooting = true;
 +    mutex_unlock(&tdx_module_lock);
 +
 +    return NOTIFY_OK;
 +}
 +
 +static struct notifier_block tdx_reboot_nb = {
 +    .notifier_call = tdx_reboot_notifier,
 +};
 +
- /*
+ static int __init tdx_init(void)
   * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
   * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
   * leaf function return code and the additional output respectively if
   * not NULL.
   */
 -static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 -                    u64 *seamcall_ret,
 -                    struct tdx_module_output *out)
 +static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 +            u64 *seamcall_ret, struct tdx_module_output *out)
  {
-     u64 sret;
+     u32 tdx_keyid_start, nr_tdx_keyids;
+@@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void)
-@@ -XXX,XX +XXX,XX @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+         return -ENODEV;
      }
- }
++    err = register_reboot_notifier(&tdx_reboot_nb);
-+static void seamcall_smp_call_function(void *data)
++    if (err) {
-+{
++        pr_err("initialization failed: register_reboot_notifier() failed (%d)\n",
-+    struct seamcall_ctx *sc = data;
++                err);
-+    int ret;
++        unregister_memory_notifier(&tdx_memory_nb);
-+
++        return -ENODEV;
-+    ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, NULL, NULL);
++    }
-+    if (ret)
++
-+        atomic_set(&sc->err, -EFAULT);
+     /*
-+}
+      * Just use the first TDX KeyID as the 'global KeyID' and
-+
+      * leave the rest for TDX guests.
 +/*
 + * Call the SEAMCALL on all online CPUs concurrently.  Caller to check
 + * @sc->err to determine whether any SEAMCALL failed on any cpu.
 + */
 +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
 +{
 +    on_each_cpu(seamcall_smp_call_function, sc, true);
 +}
 +
  /*
   * Detect and initialize the TDX module.
   *
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
  static void shutdown_tdx_module(void)
  {
 -    /* TODO: Shut down the TDX module */
 +    struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
 +
 +    seamcall_on_each_cpu(&sc);
  }
  static int __tdx_enable(void)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.h
 +++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
  /* MSR to report KeyID partitioning between MKTME and TDX */
  #define MSR_IA32_MKTME_KEYID_PARTITIONING    0x00000087
 +/*
 + * TDX module SEAMCALL leaf functions
 + */
 +#define TDH_SYS_LP_SHUTDOWN    44
 +
  /*
   * Do not put any hardware-defined TDX structure representations below
   * this comment!
 --
-.38.1
+.41.0

-[PATCH v7 08/20] x86/virt/tdx: Do logical-cpu scope TDX module initialization
+[PATCH v14 21/23] x86/virt/tdx: Handle TDX interaction with ACPI S3 and deeper states
-After the global module initialization, the next step is logical-cpu
+TDX cannot survive from S3 and deeper states.  The hardware resets and
-scope module initialization.  Logical-cpu initialization requires
+disables TDX completely when platform goes to S3 and deeper.  Both TDX
-calling TDH.SYS.LP.INIT on all BIOS-enabled CPUs.  This SEAMCALL can run
+guests and the TDX module get destroyed permanently.
 concurrently on all CPUs.
-Use the helper introduced for shutting down the module to do logical-cpu
+The kernel uses S3 to support suspend-to-ram, and S4 or deeper states to
-scope initialization.
+support hibernation.  The kernel also maintains TDX states to track
 whether it has been initialized and its metadata resource, etc.  After
 resuming from S3 or hibernation, these TDX states won't be correct
 anymore.
 Theoretically, the kernel can do more complicated things like resetting
 TDX internal states and TDX module metadata before going to S3 or
 deeper, and re-initialize TDX module after resuming, etc, but there is
 no way to save/restore TDX guests for now.
 Until TDX supports full save and restore of TDX guests, there is no big
 value to handle TDX module in suspend and hibernation alone.  To make
 things simple, just choose to make TDX mutually exclusive with S3 and
 hibernation.
 Note the TDX module is initialized at runtime.  To avoid having to deal
 with the fuss of determining TDX state at runtime, just choose TDX vs S3
 and hibernation at kernel early boot.  It's a bad user experience if the
 choice of TDX and S3/hibernation is done at runtime anyway, i.e., the
 user can experience being able to do S3/hibernation but later becoming
 unable to due to TDX being enabled.
 Disable TDX in kernel early boot when hibernation is available, and give
 a message telling the user to disable hibernation via kernel command
 line in order to use TDX.  Currently there's no mechanism exposed by the
 hibernation code to allow other kernel code to disable hibernation once
 for all.
 Disable ACPI S3 by setting acpi_suspend_lowlevel function pointer to
 NULL when TDX is enabled by the BIOS.  This avoids having to modify the
 ACPI code to disable ACPI S3 in other ways.
 Also give a message telling the user to disable TDX in the BIOS in order
 to use ACPI S3.  A new kernel command line can be added in the future if
 there's a need to let user disable TDX host via kernel command line.
 Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 ---
- arch/x86/virt/vmx/tdx/tdx.c | 14 ++++++++++++++
- arch/x86/virt/vmx/tdx/tdx.h |  1 +
+v13 -> v14:
-files changed, 15 insertions(+)
+ - New patch
 ---
  arch/x86/virt/vmx/tdx/tdx.c | 23 +++++++++++++++++++++++
 file changed, 23 insertions(+)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
-@@ -XXX,XX +XXX,XX @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
+@@ -XXX,XX +XXX,XX @@
-     on_each_cpu(seamcall_smp_call_function, sc, true);
+ #include <linux/sort.h>
- }
+ #include <linux/log2.h>
+ #include <linux/reboot.h>
-+static int tdx_module_init_cpus(void)
++#include <linux/suspend.h>
-+{
+ #include <asm/msr-index.h>
-+    struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
+ #include <asm/msr.h>
  #include <asm/page.h>
  #include <asm/special_insns.h>
 +#include <asm/acpi.h>
  #include <asm/tdx.h>
  #include "tdx.h"
@@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void)
          return -ENODEV;
      }
 +#define HIBERNATION_MSG        \
 +    "Disable TDX due to hibernation is available. Use 'nohibernate' command line to disable hibernation."
 +    /*
 +     * Note hibernation_available() can vary when it is called at
 +     * runtime as it checks secretmem_active() and cxl_mem_active()
 +     * which can both vary at runtime.  But here at early_init() they
 +     * both cannot return true, thus when hibernation_available()
 +     * returns false here, hibernation is disabled by either
 +     * 'nohibernate' or LOCKDOWN_HIBERNATION security lockdown,
 +     * which are both permanent.
 +     */
 +    if (hibernation_available()) {
 +        pr_err("initialization failed: %s\n", HIBERNATION_MSG);
 +        return -ENODEV;
 +    }
 +
-+    seamcall_on_each_cpu(&sc);
+     err = register_memory_notifier(&tdx_memory_nb);
-+
+     if (err) {
-+    return atomic_read(&sc.err);
+         pr_err("initialization failed: register_memory_notifier() failed (%d)\n",
-+}
+@@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void)
-+
+         return -ENODEV;
- /*
+     }
-  * Detect and initialize the TDX module.
-  *
++#ifdef CONFIG_ACPI
-@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
++    pr_info("Disable ACPI S3 suspend. Turn off TDX in the BIOS to use ACPI S3.\n");
-     if (ret)
++    acpi_suspend_lowlevel = NULL;
-         goto out;
++#endif
 +    /* Logical-cpu scope initialization */
 +    ret = tdx_module_init_cpus();
 +    if (ret)
 +        goto out;
 +
      /*
-      * Return -EINVAL until all steps of TDX module initialization
+      * Just use the first TDX KeyID as the 'global KeyID' and
-      * process are done.
+      * leave the rest for TDX guests.
 diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.h
 +++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
   * TDX module SEAMCALL leaf functions
   */
  #define TDH_SYS_INIT        33
 +#define TDH_SYS_LP_INIT        35
  #define TDH_SYS_LP_SHUTDOWN    44
  /*
 --
-.38.1
+.41.0

-[PATCH v7 05/20] x86/virt/tdx: Implement functions to make SEAMCALL
+[PATCH v14 22/23] x86/mce: Improve error log of kernel space TDX #MC due to erratum
-TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).  This
+The first few generations of TDX hardware have an erratum.  Triggering
-mode runs only the TDX module itself or other code to load the TDX
+it in Linux requires some kind of kernel bug involving relatively exotic
-module.
+memory writes to TDX private memory and will manifest via
+spurious-looking machine checks when reading the affected memory.
-The host kernel communicates with SEAM software via a new SEAMCALL
-instruction.  This is conceptually similar to a guest->host hypercall,
+== Background ==
-except it is made from the host to SEAM software instead.
+Virtually all kernel memory accesses operations happen in full
-The TDX module defines a set of SEAMCALL leaf functions to allow the
+cachelines.  In practice, writing a "byte" of memory usually reads a 64
-host to initialize it, and to create and run protected VMs.  SEAMCALL
+byte cacheline of memory, modifies it, then writes the whole line back.
-leaf functions use an ABI different from the x86-64 system-v ABI.
+Those operations do not trigger this problem.
-Instead, they share the same ABI with the TDCALL leaf functions.
+This problem is triggered by "partial" writes where a write transaction
-Implement a function __seamcall() to allow the host to make SEAMCALL
+of less than cacheline lands at the memory controller.  The CPU does
-to SEAM software using the TDX_MODULE_CALL macro which is the common
+these via non-temporal write instructions (like MOVNTI), or through
-assembly for both SEAMCALL and TDCALL.
+UC/WC memory mappings.  The issue can also be triggered away from the
+CPU by devices doing partial writes via DMA.
-SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
-CPU is not in VMX operation.  The current TDX_MODULE_CALL macro doesn't
+== Problem ==
-handle any of them.  There's no way to check whether the CPU is in VMX
-operation or not.
+A partial write to a TDX private memory cacheline will silently "poison"
+the line.  Subsequent reads will consume the poison and generate a
-Initializing the TDX module is done at runtime on demand, and it depends
+machine check.  According to the TDX hardware spec, neither of these
-on the caller to ensure CPU is in VMX operation before making SEAMCALL.
+things should have happened.
-To avoid getting Oops when the caller mistakenly tries to initialize the
-TDX module when CPU is not in VMX operation, extend the TDX_MODULE_CALL
+To add insult to injury, the Linux machine code will present these as a
-macro to handle #UD (and also #GP, which can theoretically still happen
+literal "Hardware error" when they were, in fact, a software-triggered
-when TDX isn't actually enabled by the BIOS, i.e. due to BIOS bug).
+issue.
-Introduce two new TDX error codes for #UD and #GP respectively so the
+== Solution ==
-caller can distinguish.  Also, Opportunistically put the new TDX error
-codes and the existing TDX_SEAMCALL_VMFAILINVALID into INTEL_TDX_HOST
+In the end, this issue is hard to trigger.  Rather than do something
-Kconfig option as they are only used when it is on.
+rash (and incomplete) like unmap TDX private memory from the direct map,
+improve the machine check handler.
-As __seamcall() can potentially return multiple error codes, besides the
-actual SEAMCALL leaf function return code, also introduce a wrapper
+Currently, the #MC handler doesn't distinguish whether the memory is
-function seamcall() to convert the __seamcall() error code to the kernel
+TDX private memory or not but just dump, for instance, below message:
-error code, so the caller doesn't need to duplicate the code to check
-return value of __seamcall() and return kernel error code accordingly.
+ [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
  [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
      ...
  [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
  [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
  [...] Kernel panic - not syncing: Fatal local machine check
 Which says "Hardware Error" and "Data load in unrecoverable area of
 kernel".
 Ideally, it's better for the log to say "software bug around TDX private
 memory" instead of "Hardware Error".  But in reality the real hardware
 memory error can happen, and sadly such software-triggered #MC cannot be
 distinguished from the real hardware error.  Also, the error message is
 used by userspace tool 'mcelog' to parse, so changing the output may
 break userspace.
 So keep the "Hardware Error".  The "Data load in unrecoverable area of
 kernel" is also helpful, so keep it too.
 Instead of modifying above error log, improve the error log by printing
 additional TDX related message to make the log like:
   ...
  [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
  [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.
 Adding this additional message requires determination of whether the
 memory page is TDX private memory.  There is no existing infrastructure
 to do that.  Add an interface to query the TDX module to fill this gap.
 == Impact ==
 This issue requires some kind of kernel bug to trigger.
 TDX private memory should never be mapped UC/WC.  A partial write
 originating from these mappings would require *two* bugs, first mapping
 the wrong page, then writing the wrong memory.  It would also be
 detectable using traditional memory corruption techniques like
 DEBUG_PAGEALLOC.
 MOVNTI (and friends) could cause this issue with something like a simple
 buffer overrun or use-after-free on the direct map.  It should also be
 detectable with normal debug techniques.
 The one place where this might get nasty would be if the CPU read data
 then wrote back the same data.  That would trigger this problem but
 would not, for instance, set off mechanisms like slab redzoning because
 it doesn't actually corrupt data.
 With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
 TDX private memory would first need to be incorrectly mapped into the
 I/O space and then a later DMA to that mapping would actually cause the
 poisoning event.
 Signed-off-by: Kai Huang <kai.huang@intel.com>
+Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
+Reviewed-by: Yuan Yao <yuan.yao@intel.com>
 ---
-v6 -> v7:
+v13 -> v14:
- - No change.
+ - No change
-v5 -> v6:
+v12 -> v13:
- - Added code to handle #UD and #GP (Dave).
+ - Added Kirill and Yuan's tag.
- - Moved the seamcall() wrapper function to this patch, and used a
-   temporary __always_unused to avoid compile warning (Dave).
+v11 -> v12:
+ - Simplified #MC message (Dave/Kirill)
-- v3 -> v5 (no feedback on v4):
+ - Slightly improved some comments.
- - Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the
-   SEAMCALL itself fails.
+v10 -> v11:
- - Improve the changelog.
+ - New patch
 ---
- arch/x86/include/asm/tdx.h       |  9 ++++++
+ arch/x86/include/asm/tdx.h     |   2 +
- arch/x86/virt/vmx/tdx/Makefile   |  2 +-
+ arch/x86/kernel/cpu/mce/core.c |  33 +++++++++++
- arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++
+ arch/x86/virt/vmx/tdx/tdx.c    | 103 +++++++++++++++++++++++++++++++++
- arch/x86/virt/vmx/tdx/tdx.c      | 42 ++++++++++++++++++++++++++
+ arch/x86/virt/vmx/tdx/tdx.h    |   5 ++
- arch/x86/virt/vmx/tdx/tdx.h      |  8 +++++
+files changed, 143 insertions(+)
  arch/x86/virt/vmx/tdx/tdxcall.S  | 19 ++++++++++--
 files changed, 129 insertions(+), 3 deletions(-)
  create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
 diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/include/asm/tdx.h
 +++ b/arch/x86/include/asm/tdx.h
+@@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void);
+ int tdx_cpu_enable(void);
+ int tdx_enable(void);
+ void tdx_reset_memory(void);
++bool tdx_is_private_mem(unsigned long phys);
+ #else
+ static inline bool platform_tdx_enabled(void) { return false; }
+ static inline int tdx_cpu_enable(void) { return -ENODEV; }
+ static inline int tdx_enable(void)  { return -ENODEV; }
+ static inline void tdx_reset_memory(void) { }
++static inline bool tdx_is_private_mem(unsigned long phys) { return false; }
+ #endif    /* CONFIG_INTEL_TDX_HOST */
+ #endif /* !__ASSEMBLY__ */
+diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
+index XXXXXXX..XXXXXXX 100644
+--- a/arch/x86/kernel/cpu/mce/core.c
++++ b/arch/x86/kernel/cpu/mce/core.c
 @@ -XXX,XX +XXX,XX @@
- #include <asm/ptrace.h>
+ #include <asm/mce.h>
- #include <asm/shared/tdx.h>
+ #include <asm/msr.h>
+ #include <asm/reboot.h>
-+#ifdef CONFIG_INTEL_TDX_HOST
++#include <asm/tdx.h>
-+
-+#include <asm/trapnr.h>
+ #include "internal.h"
-+
- /*
+@@ -XXX,XX +XXX,XX @@ static void wait_for_panic(void)
-  * SW-defined error codes.
+     panic("Panicing machine check CPU died");
-  *
+ }
-@@ -XXX,XX +XXX,XX @@
- #define TDX_SW_ERROR            (TDX_ERROR | GENMASK_ULL(47, 40))
++static const char *mce_memory_info(struct mce *m)
- #define TDX_SEAMCALL_VMFAILINVALID    (TDX_SW_ERROR | _UL(0xFFFF0000))
++{
++    if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
-+#define TDX_SEAMCALL_GP            (TDX_SW_ERROR | X86_TRAP_GP)
++        return NULL;
-+#define TDX_SEAMCALL_UD            (TDX_SW_ERROR | X86_TRAP_UD)
++
-+
++    /*
-+#endif
++     * Certain initial generations of TDX-capable CPUs have an
-+
++     * erratum.  A kernel non-temporal partial write to TDX private
- #ifndef __ASSEMBLY__
++     * memory poisons that memory, and a subsequent read of that
++     * memory triggers #MC.
- /*
++     *
-diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
++     * However such #MC caused by software cannot be distinguished
-index XXXXXXX..XXXXXXX 100644
++     * from the real hardware #MC.  Just print additional message
---- a/arch/x86/virt/vmx/tdx/Makefile
++     * to show such #MC may be result of the CPU erratum.
-+++ b/arch/x86/virt/vmx/tdx/Makefile
++     */
-@@ -XXX,XX +XXX,XX @@
++    if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
- # SPDX-License-Identifier: GPL-2.0-only
++        return NULL;
--obj-y += tdx.o
++
-+obj-y += tdx.o seamcall.o
++    return !tdx_is_private_mem(m->addr) ? NULL :
-diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
++        "TDX private memory error. Possible kernel bug.";
-new file mode 100644
++}
-index XXXXXXX..XXXXXXX
++
---- /dev/null
+ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
-+++ b/arch/x86/virt/vmx/tdx/seamcall.S
+ {
-@@ -XXX,XX +XXX,XX @@
+     struct llist_node *pending;
-+/* SPDX-License-Identifier: GPL-2.0 */
+     struct mce_evt_llist *l;
-+#include <linux/linkage.h>
+     int apei_err = 0;
-+#include <asm/frame.h>
++    const char *memmsg;
-+
-+#include "tdxcall.S"
+     /*
-+
+      * Allow instrumentation around external facilities usage. Not that it
-+/*
+@@ -XXX,XX +XXX,XX @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
-+ * __seamcall() - Host-side interface functions to SEAM software module
+     }
-+ *          (the P-SEAMLDR or the TDX module).
+     if (exp)
-+ *
+         pr_emerg(HW_ERR "Machine check: %s\n", exp);
-+ * Transform function call register arguments into the SEAMCALL register
++    /*
-+ * ABI.  Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails,
++     * Confidential computing platforms such as TDX platforms
-+ * or the completion status of the SEAMCALL leaf function.  Additional
++     * may occur MCE due to incorrect access to confidential
-+ * output operands are saved in @out (if it is provided by the caller).
++     * memory.  Print additional information for such error.
-+ *
++     */
-+ *-------------------------------------------------------------------------
++    memmsg = mce_memory_info(final);
-+ * SEAMCALL ABI:
++    if (memmsg)
-+ *-------------------------------------------------------------------------
++        pr_emerg(HW_ERR "Machine check: %s\n", memmsg);
-+ * Input Registers:
++
-+ *
+     if (!fake_panic) {
-+ * RAX                 - SEAMCALL Leaf number.
+         if (panic_timeout == 0)
-+ * RCX,RDX,R8-R9       - SEAMCALL Leaf specific input registers.
+             panic_timeout = mca_cfg.panic_timeout;
 + *
 + * Output Registers:
 + *
 + * RAX                 - SEAMCALL completion status code.
 + * RCX,RDX,R8-R11      - SEAMCALL Leaf specific output registers.
 + *
 + *-------------------------------------------------------------------------
 + *
 + * __seamcall() function ABI:
 + *
 + * @fn  (RDI)          - SEAMCALL Leaf number, moved to RAX
 + * @rcx (RSI)          - Input parameter 1, moved to RCX
 + * @rdx (RDX)          - Input parameter 2, moved to RDX
 + * @r8  (RCX)          - Input parameter 3, moved to R8
 + * @r9  (R8)           - Input parameter 4, moved to R9
 + *
 + * @out (R9)           - struct tdx_module_output pointer
 + *             stored temporarily in R12 (not
 + *             used by the P-SEAMLDR or the TDX
 + *             module). It can be NULL.
 + *
 + * Return (via RAX) the completion status of the SEAMCALL, or
 + * TDX_SEAMCALL_VMFAILINVALID.
 + */
 +SYM_FUNC_START(__seamcall)
 +    FRAME_BEGIN
 +    TDX_MODULE_CALL host=1
 +    FRAME_END
 +    RET
 +SYM_FUNC_END(__seamcall)
 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.c
 +++ b/arch/x86/virt/vmx/tdx/tdx.c
-@@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void)
+@@ -XXX,XX +XXX,XX @@ void tdx_reset_memory(void)
-     return !!tdx_keyid_num;
+     tdmrs_reset_pamt_all(&tdx_tdmr_list);
  }
++static bool is_pamt_page(unsigned long phys)
++{
++    struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;
++    int i;
++
++    /*
++     * This function is called from #MC handler, and theoretically
++     * it could run in parallel with the TDX module initialization
++     * on other logical cpus.  But it's not OK to hold mutex here
++     * so just blindly check module status to make sure PAMTs/TDMRs
++     * are stable to access.
++     *
++     * This may return inaccurate result in rare cases, e.g., when
++     * #MC happens on a PAMT page during module initialization, but
++     * this is fine as #MC handler doesn't need a 100% accurate
++     * result.
++     */
++    if (tdx_module_status != TDX_MODULE_INITIALIZED)
++        return false;
++
++    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
++        unsigned long base, size;
++
++        tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
++
++        if (phys >= base && phys < (base + size))
++            return true;
++    }
++
++    return false;
++}
++
 +/*
-+ * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
++ * Return whether the memory page at the given physical address is TDX
-+ * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
++ * private memory or not.  Called from #MC handler do_machine_check().
-+ * leaf function return code and the additional output respectively if
++ *
-+ * not NULL.
++ * Note this function may not return an accurate result in rare cases.
 + * This is fine as the #MC handler doesn't need a 100% accurate result,
 + * because it cannot distinguish #MC between software bug and real
 + * hardware error anyway.
 + */
-+static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
++bool tdx_is_private_mem(unsigned long phys)
 +                    u64 *seamcall_ret,
 +                    struct tdx_module_output *out)
 +{
++    struct tdx_module_args args = {
++        .rcx = phys & PAGE_MASK,
++    };
 +    u64 sret;
 +
-+    sret = __seamcall(fn, rcx, rdx, r8, r9, out);
++    if (!platform_tdx_enabled())
-+
++        return false;
-+    /* Save SEAMCALL return code if caller wants it */
++
-+    if (seamcall_ret)
++    /* Get page type from the TDX module */
-+        *seamcall_ret = sret;
++    sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
-+
++    /*
-+    /* SEAMCALL was successful */
++     * Handle the case that CPU isn't in VMX operation.
-+    if (!sret)
++     *
-+        return 0;
++     * KVM guarantees no VM is running (thus no TDX guest)
-+
++     * when there's any online CPU isn't in VMX operation.
-+    switch (sret) {
++     * This means there will be no TDX guest private memory
-+    case TDX_SEAMCALL_GP:
++     * and Secure-EPT pages.  However the TDX module may have
-+        /*
++     * been initialized and the memory page could be PAMT.
-+         * platform_tdx_enabled() is checked to be true
++     */
-+         * before making any SEAMCALL.
++    if (sret == TDX_SEAMCALL_UD)
-+         */
++        return is_pamt_page(phys);
-+        WARN_ON_ONCE(1);
++
-+        fallthrough;
++    /*
-+    case TDX_SEAMCALL_VMFAILINVALID:
++     * Any other failure means:
-+        /* Return -ENODEV if the TDX module is not loaded. */
++     *
-+        return -ENODEV;
++     * 1) TDX module not loaded; or
-+    case TDX_SEAMCALL_UD:
++     * 2) Memory page isn't managed by the TDX module.
-+        /* Return -EINVAL if CPU isn't in VMX operation. */
++     *
-+        return -EINVAL;
++     * In either case, the memory page cannot be a TDX
 +     * private page.
 +     */
 +    if (sret)
 +        return false;
 +
 +    /*
 +     * SEAMCALL was successful -- read page type (via RCX):
 +     *
 +     *  - PT_NDA:    Page is not used by the TDX module
 +     *  - PT_RSVD:    Reserved for Non-TDX use
 +     *  - Others:    Page is used by the TDX module
 +     *
 +     * Note PAMT pages are marked as PT_RSVD but they are also TDX
 +     * private memory.
 +     *
 +     * Note: Even page type is PT_NDA, the memory page could still
 +     * be associated with TDX private KeyID if the kernel hasn't
 +     * explicitly used MOVDIR64B to clear the page.  Assume KVM
 +     * always does that after reclaiming any private page from TDX
 +     * gusets.
 +     */
 +    switch (args.rcx) {
 +    case PT_NDA:
 +        return false;
 +    case PT_RSVD:
 +        return is_pamt_page(phys);
 +    default:
-+        /* Return -EIO if the actual SEAMCALL leaf failed. */
++        return true;
 +        return -EIO;
 +    }
 +}
 +
- /*
+ static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
-  * Detect and initialize the TDX module.
+                         u32 *nr_tdx_keyids)
-  *
+ {
 diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdx.h
 +++ b/arch/x86/virt/vmx/tdx/tdx.h
 @@ -XXX,XX +XXX,XX @@
- /* MSR to report KeyID partitioning between MKTME and TDX */
+ /*
- #define MSR_IA32_MKTME_KEYID_PARTITIONING    0x00000087
+  * TDX module SEAMCALL leaf functions
+  */
-+/*
++#define TDH_PHYMEM_PAGE_RDMD    24
-+ * Do not put any hardware-defined TDX structure representations below
+ #define TDH_SYS_KEY_CONFIG    31
-+ * this comment!
+ #define TDH_SYS_INFO        32
-+ */
+ #define TDH_SYS_INIT        33
 +
 +struct tdx_module_output;
 +u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 +           struct tdx_module_output *out);
  #endif
 diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
 index XXXXXXX..XXXXXXX 100644
 --- a/arch/x86/virt/vmx/tdx/tdxcall.S
 +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
 @@ -XXX,XX +XXX,XX @@
- /* SPDX-License-Identifier: GPL-2.0 */
+ #define TDH_SYS_TDMR_INIT    36
- #include <asm/asm-offsets.h>
+ #define TDH_SYS_CONFIG        45
- #include <asm/tdx.h>
-+#include <asm/asm.h>
++/* TDX page types */
++#define    PT_NDA        0x0
- /*
++#define    PT_RSVD        0x1
-  * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
++
-@@ -XXX,XX +XXX,XX @@
+ struct cmr_info {
-     /* Leave input param 2 in RDX */
+     u64    base;
+     u64    size;
      .if \host
 +1:
      seamcall
      /*
       * SEAMCALL instruction is essentially a VMExit from VMX root
@@ -XXX,XX +XXX,XX @@
       * This value will never be used as actual SEAMCALL error code as
       * it is from the Reserved status code class.
       */
 -    jnc .Lno_vmfailinvalid
 +    jnc .Lseamcall_out
      mov $TDX_SEAMCALL_VMFAILINVALID, %rax
 -.Lno_vmfailinvalid:
 +    jmp .Lseamcall_out
 +2:
 +    /*
 +     * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
 +     * the trap number.  Convert the trap number to the TDX error
 +     * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
 +     *
 +     * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
 +     * only accepts 32-bit immediate at most.
 +     */
 +    mov $TDX_SW_ERROR, %r12
 +    orq %r12, %rax
 +    _ASM_EXTABLE_FAULT(1b, 2b)
 +.Lseamcall_out:
      .else
      tdcall
      .endif
 --
-.38.1
+.41.0

-[PATCH v7 20/20] Documentation/x86: Add documentation for TDX host support
+[PATCH v14 23/23] Documentation/x86: Add documentation for TDX host support
 ...
 materials under it, and add a new menu for TDX host kernel support.
 Signed-off-by: Kai Huang <kai.huang@intel.com>
 ---
-v6 -> v7:
+ - Added new sections for "Erratum" and "TDX vs S3/hibernation"
  - Changed "TDX Memory Policy" and "Kexec()" sections.
 ---
- Documentation/x86/tdx.rst | 181 +++++++++++++++++++++++++++++++++++---
+ Documentation/arch/x86/tdx.rst | 217 +++++++++++++++++++++++++++++++--
-file changed, 170 insertions(+), 11 deletions(-)
+file changed, 206 insertions(+), 11 deletions(-)
-diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
+diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
 index XXXXXXX..XXXXXXX 100644
---- a/Documentation/x86/tdx.rst
+--- a/Documentation/arch/x86/tdx.rst
-+++ b/Documentation/x86/tdx.rst
++++ b/Documentation/arch/x86/tdx.rst
 @@ -XXX,XX +XXX,XX @@ encrypting the guest memory. In TDX, a special module running in a special
  mode sits between the host and the guest and manages the guest/host
  separation.
 +TDX Host Kernel Support
 ...
 +-----------------------
 +
 +The kernel detects TDX by detecting TDX private KeyIDs during kernel
 +boot.  Below dmesg shows when TDX is enabled by BIOS::
 +
-+  [..] tdx: TDX enabled by BIOS. TDX private KeyID range: [16, 64).
++  [..] virt/tdx: BIOS enabled: private KeyID range: [16, 64)
 +
-+TDX module detection and initialization
++TDX module initialization
 +---------------------------------------
-+
-+There is no CPUID or MSR to detect the TDX module.  The kernel detects it
-+by initializing it.
 +
 +The kernel talks to the TDX module via the new SEAMCALL instruction.  The
 +TDX module implements SEAMCALL leaf functions to allow the kernel to
 +initialize it.
++
++If the TDX module isn't loaded, the SEAMCALL instruction fails with a
++special error.  In this case the kernel fails the module initialization
++and reports the module isn't loaded::
++
++  [..] virt/tdx: module not loaded
 +
 +Initializing the TDX module consumes roughly ~1/256th system RAM size to
 +use it as 'metadata' for the TDX memory.  It also takes additional CPU
 +time to initialize those metadata along with the TDX module itself.  Both
 +are not trivial.  The kernel initializes the TDX module at runtime on
-+demand.  The caller to call tdx_enable() to initialize the TDX module::
++demand.
 +
 +Besides initializing the TDX module, a per-cpu initialization SEAMCALL
 +must be done on one cpu before any other SEAMCALLs can be made on that
 +cpu.
 +
 +The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to
 +allow the user of TDX to enable the TDX module and enable TDX on local
 +cpu.
 +
 +Making SEAMCALL requires the CPU already being in VMX operation (VMXON
 +has been done).  For now both tdx_enable() and tdx_cpu_enable() don't
 +handle VMXON internally, but depends on the caller to guarantee that.
 +
 +To enable TDX, the caller of TDX should: 1) hold read lock of CPU hotplug
 +lock; 2) do VMXON and tdx_enable_cpu() on all online cpus successfully;
 +3) call tdx_enable().  For example::
 +
 +        cpus_read_lock();
 +        on_each_cpu(vmxon_and_tdx_cpu_enable());
 +        ret = tdx_enable();
++        cpus_read_unlock();
 +        if (ret)
 +                goto no_tdx;
 +        // TDX is ready to use
 +
-+Initializing the TDX module requires all logical CPUs being online.
++And the caller of TDX must guarantee the tdx_cpu_enable() has been
-+tdx_enable() internally temporarily disables CPU hotplug to prevent any
++successfully done on any cpu before it wants to run any other SEAMCALL.
-+CPU from going offline, but the caller still needs to guarantee all
++A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
-+present CPUs are online before calling tdx_enable().
++online callback, and refuse to online if tdx_cpu_enable() fails.
 +
-+Also, tdx_enable() requires all CPUs are already in VMX operation
++User can consult dmesg to see whether the TDX module has been initialized.
 +(requirement of making SEAMCALL).  Currently, tdx_enable() doesn't handle
 +VMXON internally, but depends on the caller to guarantee that.  So far
 +KVM is the only user of TDX and KVM already handles VMXON.
 +
 +User can consult dmesg to see the presence of the TDX module, and whether
 +it has been initialized.
 +
 +If the TDX module is not loaded, dmesg shows below::
 +
 +  [..] tdx: TDX module is not loaded.
 +
 +If the TDX module is initialized successfully, dmesg shows something
 +like below::
 +
-+  [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
++  [..] virt/tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
-+  [..] tdx: 65667 pages allocated for PAMT.
++  [..] virt/tdx: 262668 KBs allocated for PAMT
-+  [..] tdx: TDX module initialized.
++  [..] virt/tdx: module initialized
 +
-+If the TDX module failed to initialize, dmesg shows below::
++If the TDX module failed to initialize, dmesg also shows it failed to
-+
++initialize::
-+  [..] tdx: Failed to initialize TDX module. Shut it down.
++
 +  [..] virt/tdx: module initialization failed ...
 +
 +TDX Interaction to Other Kernel Components
 +------------------------------------------
 +
 +TDX Memory Policy
 +~~~~~~~~~~~~~~~~~
 +
-+TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
++TDX reports a list of "Convertible Memory Region" (CMR) to tell the
-+memory regions that can possibly be used by the TDX module, but they are
++kernel which memory is TDX compatible.  The kernel needs to build a list
-+not automatically usable to the TDX module.  As a step of initializing
++of memory regions (out of CMRs) as "TDX-usable" memory and pass those
-+the TDX module, the kernel needs to choose a list of memory regions (out
++regions to the TDX module.  Once this is done, those "TDX-usable" memory
-+from convertible memory regions) that the TDX module can use and pass
++regions are fixed during module's lifetime.
 +those regions to the TDX module.  Once this is done, those "TDX-usable"
 +memory regions are fixed during module's lifetime.  No more TDX-usable
 +memory can be added to the TDX module after that.
 +
 +To keep things simple, currently the kernel simply guarantees all pages
 +in the page allocator are TDX memory.  Specifically, the kernel uses all
 +system memory in the core-mm at the time of initializing the TDX module
-+as TDX memory, and at the meantime, refuses to add any non-TDX-memory in
++as TDX memory, and in the meantime, refuses to online any non-TDX-memory
-+the memory hotplug.
++in the memory hotplug.
 +
-+This can be enhanced in the future, i.e. by allowing adding non-TDX
++Physical Memory Hotplug
-+memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
++~~~~~~~~~~~~~~~~~~~~~~~
 +and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
 +needs to guarantee memory pages for TDX guests are always allocated from
 +the "TDX-capable" nodes.
 +
 +Note TDX assumes convertible memory is always physically present during
 +machine's runtime.  A non-buggy BIOS should never support hot-removal of
 +any convertible memory.  This implementation doesn't handle ACPI memory
 +removal but depends on the BIOS to behave correctly.
 +
 +CPU Hotplug
 +~~~~~~~~~~~
++
++TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT)
++must be done on one cpu before any other SEAMCALLs can be made on that
++cpu, including those involved during the module initialization.
++
++The kernel provides tdx_cpu_enable() to let the user of TDX to do it when
++the user wants to use a new cpu for TDX task.
 +
 +TDX doesn't support physical (ACPI) CPU hotplug.  During machine boot,
 +TDX verifies all boot-time present logical CPUs are TDX compatible before
 +enabling TDX.  A non-buggy BIOS should never support hot-add/removal of
 +physical CPU.  Currently the kernel doesn't handle physical CPU hotplug,
 ...
 +Kexec()
 +~~~~~~~
 +
 +There are two problems in terms of using kexec() to boot to a new kernel
 +when the old kernel has enabled TDX: 1) Part of the memory pages are
-+still TDX private pages (i.e. metadata used by the TDX module, and any
++still TDX private pages; 2) There might be dirty cachelines associated
-+TDX guest memory if kexec() is executed when there's live TDX guests).
++with TDX private pages.
-+2) There might be dirty cachelines associated with TDX private pages.
++
-+
++The first problem doesn't matter.  KeyID 0 doesn't have integrity check.
-+Because the hardware doesn't guarantee cache coherency among different
++Even the new kernel wants use any non-zero KeyID, it needs to convert
-+KeyIDs, the old kernel needs to flush cache (of TDX private pages)
++the memory to that KeyID and such conversion would work from any KeyID.
-+before booting to the new kernel.  Also, the kernel doesn't convert all
++
-+TDX private pages back to normal because of below considerations:
++However the old kernel needs to guarantee there's no dirty cacheline
-+
++left behind before booting to the new kernel to avoid silent corruption
-+1) The kernel doesn't have existing infrastructure to track which pages
++from later cacheline writeback (Intel hardware doesn't guarantee cache
-+   are TDX private page.
++coherency across different KeyIDs).
-+2) The number of TDX private pages can be large, and converting all of
++
-+   them (cache flush + using MOVDIR64B to clear the page) can be time
++Similar to AMD SME, the kernel just uses wbinvd() to flush cache before
-+   consuming.
++booting to the new kernel.
-+3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
++
-+   0 doesn't support integrity-check, so it's OK.
++Erratum
-+4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
++~~~~~~~
-+   kernel ever supports MKTME, it should do MOVDIR64B to clear the page
++
-+   with the new MKTME KeyID (just like TDX does) before using it.
++The first few generations of TDX hardware have an erratum.  A partial
-+
++write to a TDX private memory cacheline will silently "poison" the
-+The current TDX module architecture doesn't play nicely with kexec().
++line.  Subsequent reads will consume the poison and generate a machine
-+The TDX module can only be initialized once during its lifetime, and
++check.
-+there is no SEAMCALL to reset the module to give a new clean slate to
++
-+the new kernel.  Therefore, ideally, if the module is ever initialized,
++A partial write is a memory write where a write transaction of less than
-+it's better to shut down the module.  The new kernel won't be able to
++cacheline lands at the memory controller.  The CPU does these via
-+use TDX anyway (as it needs to go through the TDX module initialization
++non-temporal write instructions (like MOVNTI), or through UC/WC memory
-+process which will fail immediately at the first step).
++mappings.  Devices can also do partial writes via DMA.
 +
-+However, there's no guarantee CPU is in VMX operation during kexec(), so
++Theoretically, a kernel bug could do partial write to TDX private memory
-+it's impractical to shut down the module.  Currently, the kernel just
++and trigger unexpected machine check.  What's more, the machine check
-+leaves the module in open state.
++code will present these as "Hardware error" when they were, in fact, a
 +software-triggered issue.  But in the end, this issue is hard to trigger.
 +
 +If the platform has such erratum, the kernel does additional things:
 +1) resetting TDX private pages using MOVDIR64B in kexec before booting to
 +the new kernel; 2) Printing additional message in machine check handler
 +to tell user the machine check may be caused by kernel bug on TDX private
 +memory.
 +
 +Interaction vs S3 and deeper states
 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 +
 +TDX cannot survive from S3 and deeper states.  The hardware resets and
 +disables TDX completely when platform goes to S3 and deeper.  Both TDX
 +guests and the TDX module get destroyed permanently.
 +
 +The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for
 +hibernation.  Currently, for simplicity, the kernel chooses to make TDX
 +mutually exclusive with S3 and hibernation.
 +
 +For most cases, the user needs to add 'nohibernation' kernel command line
 +in order to use TDX.  S3 is disabled during kernel early boot if TDX is
 +detected.  The user needs to turn off TDX in the BIOS in order to use S3.
 +
 +TDX Guest Support
 +=================
  Since the host cannot directly access guest registers or memory, much
  normal functionality of a hypervisor must be moved into the guest. This is
 ...
 +-------------------------
  All TDX guest memory starts out as private at boot.  This memory can not
  be accessed by the hypervisor.  However, some kernel users like device
 --
-.38.1
+.41.0

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  TDX specs are available in [1].

This series is the initial support to enable TDX with minimal code to
allow KVM to create and run TDX guests.  KVM support for TDX is being
developed separately[2].  A new "userspace inaccessible memfd" approach
to support TDX private memory is also being developed[3].  The KVM will
only support the new "userspace inaccessible memfd" as TDX guest memory.

This series doesn't aim to support all functionalities (i.e. exposing TDX
module via /sysfs), and doesn't aim to resolve all things perfectly.
Especially, the implementation to how to choose "TDX-usable" memory and
memory hotplug handling is simple, that this series just makes sure all
pages in the page allocator are TDX memory.

A better solution, suggested by Kirill, is similar to the per-node memory
encryption flag in this series [4].  Similarly, a per-node TDX flag can
be added so both "TDX-capable" and "non-TDX-capable" nodes can co-exist.
With exposing the TDX flag to userspace via /sysfs, the userspace can
then use NUMA APIs to bind TDX guests to those "TDX-capable" nodes.

For more information please refer to "Kernel policy on TDX memory" and
"Memory hotplug" sections below.  Huang, Ying is working on this
"per-node TDX flag" support and will post another series independently.

(For memory hotplug, sorry for broadcasting widely but I cc'ed the
linux-mm@kvack.org following Kirill's suggestion so MM experts can also
help to provide comments.)

Also, other optimizations will be posted as follow-up once this initial
TDX support is upstreamed.

Hi Dave, Dan, Kirill, Ying (and Intel reviewers),
   
Please kindly help to review, and I would appreciate reviewed-by or
acked-by tags if the patches look good to you.

This series has been reviewed by Isaku who is developing KVM TDX patches.
Kirill also has reviewed couple of patches as well.

Also, I highly appreciate if anyone else can help to review this series.

----- Changelog history: ------

- v6 -> v7:

- Added memory hotplug support.
  - Changed how to choose the list of "TDX-usable" memory regions from at
    kernel boot time to TDX module initialization time.
  - Addressed comments received in previous versions. (Andi/Dave).
  - Improved the commit message and the comments of kexec() support patch,
    and the patch handles returnning PAMTs back to the kernel when TDX
    module initialization fails. Please also see "kexec()" section below.
  - Changed the documentation patch accordingly.
  - For all others please see individual patch changelog history.

- v5 -> v6:

- Removed ACPI CPU/memory hotplug patches. (Intel internal discussion)
  - Removed patch to disable driver-managed memory hotplug (Intel
    internal discussion).
  - Added one patch to introduce enum type for TDX supported page size
    level to replace the hard-coded values in TDX guest code (Dave).
  - Added one patch to make TDX depends on X2APIC being enabled (Dave).
  - Added one patch to build all boot-time present memory regions as TDX
    memory during kernel boot.
  - Added Reviewed-by from others to some patches.
  - For all others please see individual patch changelog history.

- v4 -> v5:

This is essentially a resent of v4.  Sorry I forgot to consult
  get_maintainer.pl when sending out v4, so I forgot to add linux-acpi
  and linux-mm mailing list and the relevant people for 4 new patches.

There are also very minor code and commit message update from v4:

- Rebased to latest tip/x86/tdx.
  - Fixed a checkpatch issue that I missed in v4.
  - Removed an obsoleted comment that I missed in patch 6.
  - Very minor update to the commit message of patch 12.

For other changes to individual patches since v3, please refer to the
  changelog histroy of individual patches (I just used v3 -> v5 since
  there's basically no code change to v4).

- v3 -> v4 (addressed Dave's comments, and other comments from others):

- Simplified SEAMRR and TDX keyID detection.
 - Added patches to handle ACPI CPU hotplug.
 - Added patches to handle ACPI memory hotplug and driver managed memory
   hotplug.
 - Removed tdx_detect() but only use single tdx_init().
 - Removed detecting TDX module via P-SEAMLDR.
 - Changed from using e820 to using memblock to convert system RAM to TDX
   memory.
 - Excluded legacy PMEM from TDX memory.
 - Removed the boot-time command line to disable TDX patch.
 - Addressed comments for other individual patches (please see individual
   patches).
 - Improved the documentation patch based on the new implementation.

- V2 -> v3:

- Addressed comments from Isaku.
  - Fixed memory leak and unnecessary function argument in the patch to
    configure the key for the global keyid (patch 17).
  - Enhanced a little bit to the patch to get TDX module and CMR
    information (patch 09).
  - Fixed an unintended change in the patch to allocate PAMT (patch 13).
 - Addressed comments from Kevin:
  - Slightly improvement on commit message to patch 03.
 - Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
   seamrr_enabled() (patch 04).
 - Changed documentation patch to add TDX host kernel support materials
   to Documentation/x86/tdx.rst together with TDX guest staff, instead
   of a standalone file (patch 21)
 - Very minor improvement in commit messages.

- RFC (v1) -> v2:
  - Rebased to Kirill's latest TDX guest code.
  - Fixed two issues that are related to finding all RAM memory regions
    based on e820.
  - Minor improvement on comments and commit messages.

v6:
https://lore.kernel.org/linux-mm/cover.1666824663.git.kai.huang@intel.com/T/

v5:
https://lore.kernel.org/lkml/cover.1655894131.git.kai.huang@intel.com/T/

v3:
https://lore.kernel.org/lkml/68484e168226037c3a25b6fb983b052b26ab3ec1.camel@intel.com/T/

V2:
https://lore.kernel.org/lkml/cover.1647167475.git.kai.huang@intel.com/T/

RFC (v1):
https://lore.kernel.org/all/e0ff030a49b252d91c789a89c303bb4206f85e3d.1646007267.git.kai.huang@intel.com/T/

== Background ==

TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
A CPU-attested software module called 'the TDX module' runs in the new
isolated region as a trusted hypervisor to create/run protected VMs.

TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
as TDX private KeyIDs, which are only accessible within the SEAM mode.

TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
secure processor to provide crypto-protection.  The firmware runs on the
secure processor acts a similar role as the TDX module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.

Before being able to manage TD guests, the TDX module must be loaded
and properly initialized.  This series assumes the TDX module is loaded
by BIOS before the kernel boots.

How to initialize the TDX module is described at TDX module 1.0
specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
Initialization and Shutdown".

== Design Considerations ==

1. Initialize the TDX module at runtime

There are basically two ways the TDX module could be initialized: either
in early boot, or at runtime before the first TDX guest is run.  This
series implements the runtime initialization.

This series adds a function tdx_enable() to allow the caller to initialize
TDX at runtime:

if (tdx_enable())
                goto no_tdx;
	// TDX is ready to create TD guests.

This approach has below pros:

1) Initializing the TDX module requires to reserve ~1/256th system RAM as
metadata.  Enabling TDX on demand allows only to consume this memory when
TDX is truly needed (i.e. when KVM wants to create TD guests).

2) SEAMCALL requires CPU being already in VMX operation (VMXON has been
done).  So far, KVM is the only user of TDX, and it already handles VMXON.
Letting KVM to initialize TDX avoids handling VMXON in the core kernel.

3) It is more flexible to support "TDX module runtime update" (not in
this series).  After updating to the new module at runtime, kernel needs
to go through the initialization process again.

2. CPU hotplug

TDX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
event to the kernel.  This series doesn't handle physical (ACPI) CPU
hotplug at all but depends on the BIOS to behave correctly.

Note TDX works with CPU logical online/offline, thus this series still
allows to do logical CPU online/offline.

3. Kernel policy on TDX memory

The TDX module reports a list of "Convertible Memory Region" (CMR) to
indicate which memory regions are TDX-capable.  The TDX architecture
allows the VMM to designate specific convertible memory regions as usable
for TDX private memory.

The initial support of TDX guests will only allocate TDX private memory
from the global page allocator.  This series chooses to designate _all_
system RAM in the core-mm at the time of initializing TDX module as TDX
memory to guarantee all pages in the page allocator are TDX pages.

4. Memory Hotplug

After the kernel passes all "TDX-usable" memory regions to the TDX
module, the set of "TDX-usable" memory regions are fixed during module's
runtime.  No more "TDX-usable" memory can be added to the TDX module
after that.

To achieve above "to guarantee all pages in the page allocator are TDX
pages", this series simply choose to reject any non-TDX-usable memory in
memory hotplug.

This _will_ be enhanced in the future after first submission.  The
direction we are heading is to allow adding/onlining non-TDX memory to
separate NUMA nodes so that both "TDX-capable" nodes and "TDX-capable"
nodes can co-exist.  The TDX flag can be exposed to userspace via /sysfs
so userspace can bind TDX guests to "TDX-capable" nodes via NUMA ABIs.

Note TDX assumes convertible memory is always physically present during
machine's runtime.  A non-buggy BIOS should never support hot-removal of
any convertible memory.  This implementation doesn't handle ACPI memory
removal but depends on the BIOS to behave correctly.

5. Kexec()

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages (i.e. metadata used by the TDX module, and any
TDX guest memory if kexec() happens when there's any TDX guest alive).
2) There might be dirty cachelines associated with TDX private pages.

Just like SME, TDX hosts require special cache flushing before kexec().
Similar to SME handling, the kernel uses wbinvd() to flush cache in
stop_this_cpu() when TDX is enabled.

This series doesn't convert all TDX private pages back to normal due to
below considerations:

1) The kernel doesn't have existing infrastructure to track which pages
   are TDX private pages.
2) The number of TDX private pages can be large, and converting all of
   them (cache flush + using MOVDIR64B to clear the page) in kexec() can
   be time consuming.
3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
   0 doesn't support integrity-check, so it's OK.
4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
   kernel ever supports MKTME, it should do MOVDIR64B to clear the page
   with the new MKTME KeyID (just like TDX does) before using it.

Also, if the old kernel ever enables TDX, the new kernel cannot use TDX
again.  When the new kernel goes through the TDX module initialization
process it will fail immediately at the first step.

Ideally, it's better to shutdown the TDX module in kexec(), but there's
no guarantee that CPUs are in VMX operation in kexec() so just leave the
module open.

== Reference ==

[1]: TDX specs
https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html

[2]: KVM TDX basic feature support
https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/

[3]: KVM: mm: fd-based approach for supporting KVM
https://lore.kernel.org/lkml/20220915142913.2213336-1-chao.p.peng@linux.intel.com/T/

[4]: per-node memory encryption flag
https://lore.kernel.org/linux-mm/20221007155323.ue4cdthkilfy4lbd@box.shutemov.name/t/

Kai Huang (20):
  x86/tdx: Define TDX supported page sizes as macros
  x86/virt/tdx: Detect TDX during kernel boot
  x86/virt/tdx: Disable TDX if X2APIC is not enabled
  x86/virt/tdx: Add skeleton to initialize TDX on demand
  x86/virt/tdx: Implement functions to make SEAMCALL
  x86/virt/tdx: Shut down TDX module in case of error
  x86/virt/tdx: Do TDX module global initialization
  x86/virt/tdx: Do logical-cpu scope TDX module initialization
  x86/virt/tdx: Get information about TDX module and TDX-capable memory
  x86/virt/tdx: Use all system memory when initializing TDX module as
    TDX memory
  x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
    memory regions
  x86/virt/tdx: Create TDMRs to cover all TDX memory regions
  x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  x86/virt/tdx: Set up reserved areas for all TDMRs
  x86/virt/tdx: Reserve TDX module global KeyID
  x86/virt/tdx: Configure TDX module with TDMRs and global KeyID
  x86/virt/tdx: Configure global KeyID on all packages
  x86/virt/tdx: Initialize all TDMRs
  x86/virt/tdx: Flush cache in kexec() when TDX is enabled
  Documentation/x86: Add documentation for TDX host support

Documentation/x86/tdx.rst        |  181 +++-
 arch/x86/Kconfig                 |   15 +
 arch/x86/Makefile                |    2 +
 arch/x86/coco/tdx/tdx.c          |    6 +-
 arch/x86/include/asm/tdx.h       |   30 +
 arch/x86/kernel/process.c        |    8 +-
 arch/x86/mm/init_64.c            |   10 +
 arch/x86/virt/Makefile           |    2 +
 arch/x86/virt/vmx/Makefile       |    2 +
 arch/x86/virt/vmx/tdx/Makefile   |    2 +
 arch/x86/virt/vmx/tdx/seamcall.S |   52 ++
 arch/x86/virt/vmx/tdx/tdx.c      | 1422 ++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      |  118 +++
 arch/x86/virt/vmx/tdx/tdxcall.S  |   19 +-
 14 files changed, 1852 insertions(+), 17 deletions(-)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

base-commit: 00e07cfbdf0b232f7553f0175f8f4e8d792f7e90
-- 
2.38.1

TDX supports 4K, 2M and 1G page sizes.  The corresponding values are
defined by the TDX module spec and used as TDX module ABI.  Currently,
they are used in try_accept_one() when the TDX guest tries to accept a
page.  However currently try_accept_one() uses hard-coded magic values.

Define TDX supported page sizes as macros and get rid of the hard-coded
values in try_accept_one().  TDX host support will need to use them too.

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:

- Removed the helper to convert kernel page level to TDX page level.
 - Changed to use macro to define TDX supported page sizes.

---
 arch/x86/coco/tdx/tdx.c    | 6 +++---
 arch/x86/include/asm/tdx.h | 9 +++++++++
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
 	 */
 	switch (pg_level) {
 	case PG_LEVEL_4K:
-		page_size = 0;
+		page_size = TDX_PS_4K;
 		break;
 	case PG_LEVEL_2M:
-		page_size = 1;
+		page_size = TDX_PS_2M;
 		break;
 	case PG_LEVEL_1G:
-		page_size = 2;
+		page_size = TDX_PS_1G;
 		break;
 	default:
 		return false;
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -XXX,XX +XXX,XX @@
 
 #ifndef __ASSEMBLY__
 
+/*
+ * TDX supported page sizes (4K/2M/1G).
+ *
+ * Those values are part of the TDX module ABI.  Do not change them.
+ */
+#define TDX_PS_4K	0
+#define TDX_PS_2M	1
+#define TDX_PS_1G	2
+
 /*
  * Used to gather the output registers values of the TDCALL and SEAMCALL
  * instructions when requesting services from the TDX module.
-- 
2.38.1

Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.

Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME.  The memory encryption hardware underpinning MKTME is also
used for Intel TDX.  TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs.  The
BIOS is responsible for partitioning the "KeyID" space between legacy
MKTME and TDX.  The KeyIDs reserved for TDX are called 'TDX private
KeyIDs' or 'TDX KeyIDs' for short.

TDX doesn't trust the BIOS.  During machine boot, TDX verifies the TDX
private KeyIDs are consistently and correctly programmed by the BIOS
across all CPU packages before it enables TDX on any CPU core.  A valid
TDX private KeyID range on BSP indicates TDX has been enabled by the
BIOS, otherwise the BIOS is buggy.

The TDX module is expected to be loaded by the BIOS when it enables TDX,
but the kernel needs to properly initialize it before it can be used to
create and run any TDX guests.  The TDX module will be initialized at
runtime by the user (i.e. KVM) on demand.

Add a new early_initcall(tdx_init) to do TDX early boot initialization.
Only detect TDX private KeyIDs for now.  Some other early checks will
follow up.  Also add a new function to report whether TDX has been
enabled by BIOS (TDX private KeyID range is valid).  Kexec() will also
need it to determine whether need to flush dirty cachelines that are
associated with any TDX private KeyIDs before booting to the new kernel.

To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support.  Add a new Kconfig option CONFIG_INTEL_TDX_HOST
to opt-in TDX host kernel support (to distinguish with TDX guest kernel
support).  So far only KVM is the only user of TDX.  Make the new config
option depend on KVM_INTEL.

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - No change.

v5 -> v6:
 - Removed SEAMRR detection to make code simpler.
 - Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill).
 - Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill).

---
 arch/x86/Kconfig               | 12 +++++
 arch/x86/Makefile              |  2 +
 arch/x86/include/asm/tdx.h     |  7 +++
 arch/x86/virt/Makefile         |  2 +
 arch/x86/virt/vmx/Makefile     |  2 +
 arch/x86/virt/vmx/tdx/Makefile |  2 +
 arch/x86/virt/vmx/tdx/tdx.c    | 95 ++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h    | 15 ++++++
 8 files changed, 137 insertions(+)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -XXX,XX +XXX,XX @@ config X86_SGX
 
 	  If unsure, say N.
 
+config INTEL_TDX_HOST
+	bool "Intel Trust Domain Extensions (TDX) host support"
+	depends on CPU_SUP_INTEL
+	depends on X86_64
+	depends on KVM_INTEL
+	help
+	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
+	  host and certain physical attacks.  This option enables necessary TDX
+	  support in host kernel to run protected VMs.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -XXX,XX +XXX,XX @@ archheaders:
 
 libs-y  += arch/x86/lib/
 
+core-y += arch/x86/virt/
+
 # drivers-y are linked after core-y
 drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
 drivers-$(CONFIG_PCI)            += arch/x86/pci/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 	return -ENODEV;
 }
 #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
+
+#ifdef CONFIG_INTEL_TDX_HOST
+bool platform_tdx_enabled(void);
+#else	/* !CONFIG_INTEL_TDX_HOST */
+static inline bool platform_tdx_enabled(void) { return false; }
+#endif	/* CONFIG_INTEL_TDX_HOST */
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/arch/x86/virt/Makefile
@@ -XXX,XX +XXX,XX @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y	+= vmx/
diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/arch/x86/virt/vmx/Makefile
@@ -XXX,XX +XXX,XX @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx/
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -XXX,XX +XXX,XX @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += tdx.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2022 Intel Corporation.
+ *
+ * Intel Trusted Domain Extensions (TDX) support
+ */
+
+#define pr_fmt(fmt)	"tdx: " fmt
+
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/printk.h>
+#include <asm/msr-index.h>
+#include <asm/msr.h>
+#include <asm/tdx.h>
+#include "tdx.h"
+
+static u32 tdx_keyid_start __ro_after_init;
+static u32 tdx_keyid_num __ro_after_init;
+
+/*
+ * Detect TDX private KeyIDs to see whether TDX has been enabled by the
+ * BIOS.  Both initializing the TDX module and running TDX guest require
+ * TDX private KeyID.
+ *
+ * TDX doesn't trust BIOS.  TDX verifies all configurations from BIOS
+ * are correct before enabling TDX on any core.  TDX requires the BIOS
+ * to correctly and consistently program TDX private KeyIDs on all CPU
+ * packages.  Unless there is a BIOS bug, detecting a valid TDX private
+ * KeyID range on BSP indicates TDX has been enabled by the BIOS.  If
+ * there's such BIOS bug, it will be caught later when initializing the
+ * TDX module.
+ */
+static int __init detect_tdx(void)
+{
+	int ret;
+
+	/*
+	 * IA32_MKTME_KEYID_PARTIONING:
+	 *   Bit [31:0]:	Number of MKTME KeyIDs.
+	 *   Bit [63:32]:	Number of TDX private KeyIDs.
+	 */
+	ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &tdx_keyid_start,
+			&tdx_keyid_num);
+	if (ret)
+		return -ENODEV;
+
+	if (!tdx_keyid_num)
+		return -ENODEV;
+
+	/*
+	 * KeyID 0 is for TME.  MKTME KeyIDs start from 1.  TDX private
+	 * KeyIDs start after the last MKTME KeyID.
+	 */
+	tdx_keyid_start++;
+
+	pr_info("TDX enabled by BIOS. TDX private KeyID range: [%u, %u)\n",
+			tdx_keyid_start, tdx_keyid_start + tdx_keyid_num);
+
+	return 0;
+}
+
+static void __init clear_tdx(void)
+{
+	tdx_keyid_start = tdx_keyid_num = 0;
+}
+
+static int __init tdx_init(void)
+{
+	if (detect_tdx())
+		return -ENODEV;
+
+	/*
+	 * Initializing the TDX module requires one TDX private KeyID.
+	 * If there's only one TDX KeyID then after module initialization
+	 * KVM won't be able to run any TDX guest, which makes the whole
+	 * thing worthless.  Just disable TDX in this case.
+	 */
+	if (tdx_keyid_num < 2) {
+		pr_info("Disable TDX as there's only one TDX private KeyID available.\n");
+		goto no_tdx;
+	}
+
+	return 0;
+no_tdx:
+	clear_tdx();
+	return -ENODEV;
+}
+early_initcall(tdx_init);
+
+/* Return whether the BIOS has enabled TDX */
+bool platform_tdx_enabled(void)
+{
+	return !!tdx_keyid_num;
+}
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_VIRT_TDX_H
+#define _X86_VIRT_TDX_H
+
+/*
+ * This file contains both macros and data structures defined by the TDX
+ * architecture and Linux defined software data structures and functions.
+ * The two should not be mixed together for better readability.  The
+ * architectural definitions come first.
+ */
+
+/* MSR to report KeyID partitioning between MKTME and TDX */
+#define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
+
+#endif
-- 
2.38.1

The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
[1].  This bug allows an attacker to use the APIC MMIO interface to
extract data from the SGX enclave.

TDX is not immune from this either.  Early check X2APIC and disable TDX
if X2APIC is not enabled, and make INTEL_TDX_HOST depend on X86_X2APIC.

[1]: https://aepicleak.com/aepicleak.pdf

Link: https://lore.kernel.org/lkml/d6ffb489-7024-ff74-bd2f-d1e06573bb82@intel.com/
Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - Changed to use "Link" for the two lore links to get rid of checkpatch
   warning.

---
 arch/x86/Kconfig            |  1 +
 arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++++++
 2 files changed, 12 insertions(+)

Before the TDX module can be used to create and run TDX guests, it must
be loaded and properly initialized.  The TDX module is expected to be
loaded by the BIOS, and to be initialized by the kernel.

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).  The host
kernel communicates with the TDX module via a new SEAMCALL instruction.
The TDX module implements a set of SEAMCALL leaf functions to allow the
host kernel to initialize it.

The TDX module can be initialized only once in its lifetime.  Instead
of always initializing it at boot time, this implementation chooses an
"on demand" approach to initialize TDX until there is a real need (e.g
when requested by KVM).  This approach has below pros:

1) It avoids consuming the memory that must be allocated by kernel and
given to the TDX module as metadata (~1/256th of the TDX-usable memory),
and also saves the CPU cycles of initializing the TDX module (and the
metadata) when TDX is not used at all.

2) It is more flexible to support TDX module runtime updating in the
future (after updating the TDX module, it needs to be initialized
again).

3) It avoids having to do a "temporary" solution to handle VMXON in the
core (non-KVM) kernel for now.  This is because SEAMCALL requires CPU
being in VMX operation (VMXON is done), but currently only KVM handles
VMXON.  Adding VMXON support to the core kernel isn't trivial.  More
importantly, from long-term a reference-based approach is likely needed
in the core kernel as more kernel components are likely needed to
support TDX as well.  Allow KVM to initialize the TDX module avoids
having to handle VMXON during kernel boot for now.

Add a placeholder tdx_enable() to detect and initialize the TDX module
on demand, with a state machine protected by mutex to support concurrent
calls from multiple callers.

The TDX module will be initialized in multi-steps defined by the TDX
module:

1) Global initialization;
  2) Logical-CPU scope initialization;
  3) Enumerate the TDX module capabilities and platform configuration;
  4) Configure the TDX module about TDX usable memory ranges and global
     KeyID information;
  5) Package-scope configuration for the global KeyID;
  6) Initialize usable memory ranges based on 4).

The TDX module can also be shut down at any time during its lifetime.
In case of any error during the initialization process, shut down the
module.  It's pointless to leave the module in any intermediate state
during the initialization.

Both logical CPU scope initialization and shutting down the TDX module
require calling SEAMCALL on all boot-time present CPUs.  For simplicity
just temporarily disable CPU hotplug during the module initialization.

Note TDX architecturally doesn't support physical CPU hot-add/removal.
A non-buggy BIOS should never support ACPI CPU hot-add/removal.  This
implementation doesn't explicitly handle ACPI CPU hot-add/removal but
depends on the BIOS to do the right thing.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - No change.

v5 -> v6:
 - Added code to set status to TDX_MODULE_NONE if TDX module is not
   loaded (Chao)
 - Added Chao's Reviewed-by.
 - Improved comments around cpus_read_lock().

- v3->v5 (no feedback on v4):
 - Removed the check that SEAMRR and TDX KeyID have been detected on
   all present cpus.
 - Removed tdx_detect().
 - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
   hotplug lock and return early with error message.
 - Improved dmesg printing for TDX module detection and initialization.

---
 arch/x86/include/asm/tdx.h  |   2 +
 arch/x86/virt/vmx/tdx/tdx.c | 150 ++++++++++++++++++++++++++++++++++++
 2 files changed, 152 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 
 #ifdef CONFIG_INTEL_TDX_HOST
 bool platform_tdx_enabled(void);
+int tdx_enable(void);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
+static inline int tdx_enable(void)  { return -ENODEV; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
 #include <linux/types.h>
 #include <linux/init.h>
 #include <linux/printk.h>
+#include <linux/mutex.h>
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/apic.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
+/* TDX module status during initialization */
+enum tdx_module_status_t {
+	/* TDX module hasn't been detected and initialized */
+	TDX_MODULE_UNKNOWN,
+	/* TDX module is not loaded */
+	TDX_MODULE_NONE,
+	/* TDX module is initialized */
+	TDX_MODULE_INITIALIZED,
+	/* TDX module is shut down due to initialization error */
+	TDX_MODULE_SHUTDOWN,
+};
+
 static u32 tdx_keyid_start __ro_after_init;
 static u32 tdx_keyid_num __ro_after_init;
 
+static enum tdx_module_status_t tdx_module_status;
+/* Prevent concurrent attempts on TDX detection and initialization */
+static DEFINE_MUTEX(tdx_module_lock);
+
 /*
  * Detect TDX private KeyIDs to see whether TDX has been enabled by the
  * BIOS.  Both initializing the TDX module and running TDX guest require
@@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void)
 {
 	return !!tdx_keyid_num;
 }
+
+/*
+ * Detect and initialize the TDX module.
+ *
+ * Return -ENODEV when the TDX module is not loaded, 0 when it
+ * is successfully initialized, or other error when it fails to
+ * initialize.
+ */
+static int init_tdx_module(void)
+{
+	/* The TDX module hasn't been detected */
+	return -ENODEV;
+}
+
+static void shutdown_tdx_module(void)
+{
+	/* TODO: Shut down the TDX module */
+}
+
+static int __tdx_enable(void)
+{
+	int ret;
+
+	/*
+	 * Initializing the TDX module requires doing SEAMCALL on all
+	 * boot-time present CPUs.  For simplicity temporarily disable
+	 * CPU hotplug to prevent any CPU from going offline during
+	 * the initialization.
+	 */
+	cpus_read_lock();
+
+	/*
+	 * Check whether all boot-time present CPUs are online and
+	 * return early with a message so the user can be aware.
+	 *
+	 * Note a non-buggy BIOS should never support physical (ACPI)
+	 * CPU hotplug when TDX is enabled, and all boot-time present
+	 * CPU should be enabled in MADT, so there should be no
+	 * disabled_cpus and num_processors won't change at runtime
+	 * either.
+	 */
+	if (disabled_cpus || num_online_cpus() != num_processors) {
+		pr_err("Unable to initialize the TDX module when there's offline CPU(s).\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = init_tdx_module();
+	if (ret == -ENODEV) {
+		pr_info("TDX module is not loaded.\n");
+		tdx_module_status = TDX_MODULE_NONE;
+		goto out;
+	}
+
+	/*
+	 * Shut down the TDX module in case of any error during the
+	 * initialization process.  It's meaningless to leave the TDX
+	 * module in any middle state of the initialization process.
+	 *
+	 * Shutting down the module also requires doing SEAMCALL on all
+	 * MADT-enabled CPUs.  Do it while CPU hotplug is disabled.
+	 *
+	 * Return all errors during the initialization as -EFAULT as the
+	 * module is always shut down.
+	 */
+	if (ret) {
+		pr_info("Failed to initialize TDX module. Shut it down.\n");
+		shutdown_tdx_module();
+		tdx_module_status = TDX_MODULE_SHUTDOWN;
+		ret = -EFAULT;
+		goto out;
+	}
+
+	pr_info("TDX module initialized.\n");
+	tdx_module_status = TDX_MODULE_INITIALIZED;
+out:
+	cpus_read_unlock();
+
+	return ret;
+}
+
+/**
+ * tdx_enable - Enable TDX by initializing the TDX module
+ *
+ * Caller to make sure all CPUs are online and in VMX operation before
+ * calling this function.  CPU hotplug is temporarily disabled internally
+ * to prevent any cpu from going offline.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return:
+ *
+ * * 0:		The TDX module has been successfully initialized.
+ * * -ENODEV:	The TDX module is not loaded, or TDX is not supported.
+ * * -EINVAL:	The TDX module cannot be initialized due to certain
+ *		conditions are not met (i.e. when not all MADT-enabled
+ *		CPUs are not online).
+ * * -EFAULT:	Other internal fatal errors, or the TDX module is in
+ *		shutdown mode due to it failed to initialize in previous
+ *		attempts.
+ */
+int tdx_enable(void)
+{
+	int ret;
+
+	if (!platform_tdx_enabled())
+		return -ENODEV;
+
+	mutex_lock(&tdx_module_lock);
+
+	switch (tdx_module_status) {
+	case TDX_MODULE_UNKNOWN:
+		ret = __tdx_enable();
+		break;
+	case TDX_MODULE_NONE:
+		ret = -ENODEV;
+		break;
+	case TDX_MODULE_INITIALIZED:
+		ret = 0;
+		break;
+	default:
+		WARN_ON_ONCE(tdx_module_status != TDX_MODULE_SHUTDOWN);
+		ret = -EFAULT;
+		break;
+	}
+
+	mutex_unlock(&tdx_module_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_enable);
-- 
2.38.1

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).  This
mode runs only the TDX module itself or other code to load the TDX
module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.

The TDX module defines a set of SEAMCALL leaf functions to allow the
host to initialize it, and to create and run protected VMs.  SEAMCALL
leaf functions use an ABI different from the x86-64 system-v ABI.
Instead, they share the same ABI with the TDCALL leaf functions.

Implement a function __seamcall() to allow the host to make SEAMCALL
to SEAM software using the TDX_MODULE_CALL macro which is the common
assembly for both SEAMCALL and TDCALL.

SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
CPU is not in VMX operation.  The current TDX_MODULE_CALL macro doesn't
handle any of them.  There's no way to check whether the CPU is in VMX
operation or not.

Initializing the TDX module is done at runtime on demand, and it depends
on the caller to ensure CPU is in VMX operation before making SEAMCALL.
To avoid getting Oops when the caller mistakenly tries to initialize the
TDX module when CPU is not in VMX operation, extend the TDX_MODULE_CALL
macro to handle #UD (and also #GP, which can theoretically still happen
when TDX isn't actually enabled by the BIOS, i.e. due to BIOS bug).

Introduce two new TDX error codes for #UD and #GP respectively so the
caller can distinguish.  Also, Opportunistically put the new TDX error
codes and the existing TDX_SEAMCALL_VMFAILINVALID into INTEL_TDX_HOST
Kconfig option as they are only used when it is on.

As __seamcall() can potentially return multiple error codes, besides the
actual SEAMCALL leaf function return code, also introduce a wrapper
function seamcall() to convert the __seamcall() error code to the kernel
error code, so the caller doesn't need to duplicate the code to check
return value of __seamcall() and return kernel error code accordingly.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - No change.

v5 -> v6:
 - Added code to handle #UD and #GP (Dave).
 - Moved the seamcall() wrapper function to this patch, and used a
   temporary __always_unused to avoid compile warning (Dave).

- v3 -> v5 (no feedback on v4):
 - Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the
   SEAMCALL itself fails.
 - Improve the changelog.

---
 arch/x86/include/asm/tdx.h       |  9 ++++++
 arch/x86/virt/vmx/tdx/Makefile   |  2 +-
 arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.c      | 42 ++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      |  8 +++++
 arch/x86/virt/vmx/tdx/tdxcall.S  | 19 ++++++++++--
 6 files changed, 129 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -XXX,XX +XXX,XX @@
 #include <asm/ptrace.h>
 #include <asm/shared/tdx.h>
 
+#ifdef CONFIG_INTEL_TDX_HOST
+
+#include <asm/trapnr.h>
+
 /*
  * SW-defined error codes.
  *
@@ -XXX,XX +XXX,XX @@
 #define TDX_SW_ERROR			(TDX_ERROR | GENMASK_ULL(47, 40))
 #define TDX_SEAMCALL_VMFAILINVALID	(TDX_SW_ERROR | _UL(0xFFFF0000))
 
+#define TDX_SEAMCALL_GP			(TDX_SW_ERROR | X86_TRAP_GP)
+#define TDX_SEAMCALL_UD			(TDX_SW_ERROR | X86_TRAP_UD)
+
+#endif
+
 #ifndef __ASSEMBLY__
 
 /*
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -XXX,XX +XXX,XX @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-y += tdx.o
+obj-y += tdx.o seamcall.o
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -XXX,XX +XXX,XX @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+#include "tdxcall.S"
+
+/*
+ * __seamcall() - Host-side interface functions to SEAM software module
+ *		  (the P-SEAMLDR or the TDX module).
+ *
+ * Transform function call register arguments into the SEAMCALL register
+ * ABI.  Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails,
+ * or the completion status of the SEAMCALL leaf function.  Additional
+ * output operands are saved in @out (if it is provided by the caller).
+ *
+ *-------------------------------------------------------------------------
+ * SEAMCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX                 - SEAMCALL Leaf number.
+ * RCX,RDX,R8-R9       - SEAMCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX                 - SEAMCALL completion status code.
+ * RCX,RDX,R8-R11      - SEAMCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __seamcall() function ABI:
+ *
+ * @fn  (RDI)          - SEAMCALL Leaf number, moved to RAX
+ * @rcx (RSI)          - Input parameter 1, moved to RCX
+ * @rdx (RDX)          - Input parameter 2, moved to RDX
+ * @r8  (RCX)          - Input parameter 3, moved to R8
+ * @r9  (R8)           - Input parameter 4, moved to R9
+ *
+ * @out (R9)           - struct tdx_module_output pointer
+ *			 stored temporarily in R12 (not
+ *			 used by the P-SEAMLDR or the TDX
+ *			 module). It can be NULL.
+ *
+ * Return (via RAX) the completion status of the SEAMCALL, or
+ * TDX_SEAMCALL_VMFAILINVALID.
+ */
+SYM_FUNC_START(__seamcall)
+	FRAME_BEGIN
+	TDX_MODULE_CALL host=1
+	FRAME_END
+	RET
+SYM_FUNC_END(__seamcall)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void)
 	return !!tdx_keyid_num;
 }
 
+/*
+ * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
+ * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
+ * leaf function return code and the additional output respectively if
+ * not NULL.
+ */
+static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+				    u64 *seamcall_ret,
+				    struct tdx_module_output *out)
+{
+	u64 sret;
+
+	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
+
+	/* Save SEAMCALL return code if caller wants it */
+	if (seamcall_ret)
+		*seamcall_ret = sret;
+
+	/* SEAMCALL was successful */
+	if (!sret)
+		return 0;
+
+	switch (sret) {
+	case TDX_SEAMCALL_GP:
+		/*
+		 * platform_tdx_enabled() is checked to be true
+		 * before making any SEAMCALL.
+		 */
+		WARN_ON_ONCE(1);
+		fallthrough;
+	case TDX_SEAMCALL_VMFAILINVALID:
+		/* Return -ENODEV if the TDX module is not loaded. */
+		return -ENODEV;
+	case TDX_SEAMCALL_UD:
+		/* Return -EINVAL if CPU isn't in VMX operation. */
+		return -EINVAL;
+	default:
+		/* Return -EIO if the actual SEAMCALL leaf failed. */
+		return -EIO;
+	}
+}
+
 /*
  * Detect and initialize the TDX module.
  *
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
 /* MSR to report KeyID partitioning between MKTME and TDX */
 #define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
 
+/*
+ * Do not put any hardware-defined TDX structure representations below
+ * this comment!
+ */
+
+struct tdx_module_output;
+u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+	       struct tdx_module_output *out);
 #endif
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -XXX,XX +XXX,XX @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <asm/asm-offsets.h>
 #include <asm/tdx.h>
+#include <asm/asm.h>
 
 /*
  * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
@@ -XXX,XX +XXX,XX @@
 	/* Leave input param 2 in RDX */
 
 	.if \host
+1:
 	seamcall
 	/*
 	 * SEAMCALL instruction is essentially a VMExit from VMX root
@@ -XXX,XX +XXX,XX @@
 	 * This value will never be used as actual SEAMCALL error code as
 	 * it is from the Reserved status code class.
 	 */
-	jnc .Lno_vmfailinvalid
+	jnc .Lseamcall_out
 	mov $TDX_SEAMCALL_VMFAILINVALID, %rax
-.Lno_vmfailinvalid:
+	jmp .Lseamcall_out
+2:
+	/*
+	 * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
+	 * the trap number.  Convert the trap number to the TDX error
+	 * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
+	 *
+	 * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
+	 * only accepts 32-bit immediate at most.
+	 */
+	mov $TDX_SW_ERROR, %r12
+	orq %r12, %rax
 
+	_ASM_EXTABLE_FAULT(1b, 2b)
+.Lseamcall_out:
 	.else
 	tdcall
 	.endif
-- 
2.38.1

TDX supports shutting down the TDX module at any time during its
lifetime.  After the module is shut down, no further TDX module SEAMCALL
leaf functions can be made to the module on any logical cpu.

Shut down the TDX module in case of any error during the initialization
process.  It's pointless to leave the TDX module in some middle state.

Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
CPUs.  Implement a mechanism to run SEAMCALL concurrently on all online
CPUs and use it to shut down the module.  Later logical-cpu scope module
initialization will use it too.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - No change.

v5 -> v6:
 - Removed the seamcall() wrapper to previous patch (Dave).

- v3 -> v5 (no feedback on v4):
 - Added a wrapper of __seamcall() to print error code if SEAMCALL fails.
 - Made the seamcall_on_each_cpu() void.
 - Removed 'seamcall_ret' and 'tdx_module_out' from
   'struct seamcall_ctx', as they must be local variable.
 - Added the comments to tdx_init() and one paragraph to changelog to
   explain the caller should handle VMXON.
 - Called out after shut down, no "TDX module" SEAMCALL can be made.

---
 arch/x86/virt/vmx/tdx/tdx.c | 43 +++++++++++++++++++++++++++++++++----
 arch/x86/virt/vmx/tdx/tdx.h |  5 +++++
 2 files changed, 44 insertions(+), 4 deletions(-)

The first step of initializing the module is to call TDH.SYS.INIT once
on any logical cpu to do module global initialization.  Do the module
global initialization.

It also detects the TDX module, as seamcall() returns -ENODEV when the
module is not loaded.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - Improved changelog.

---
 arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++++++--
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
  */
 static int init_tdx_module(void)
 {
-	/* The TDX module hasn't been detected */
-	return -ENODEV;
+	int ret;
+
+	/*
+	 * Call TDH.SYS.INIT to do the global initialization of
+	 * the TDX module.  It also detects the module.
+	 */
+	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
+	if (ret)
+		goto out;
+
+	/*
+	 * Return -EINVAL until all steps of TDX module initialization
+	 * process are done.
+	 */
+	ret = -EINVAL;
+out:
+	return ret;
 }
 
 static void shutdown_tdx_module(void)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_INIT		33
 #define TDH_SYS_LP_SHUTDOWN	44
 
 /*
-- 
2.38.1

After the global module initialization, the next step is logical-cpu
scope module initialization.  Logical-cpu initialization requires
calling TDH.SYS.LP.INIT on all BIOS-enabled CPUs.  This SEAMCALL can run
concurrently on all CPUs.

Use the helper introduced for shutting down the module to do logical-cpu
scope initialization.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 14 ++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
 	on_each_cpu(seamcall_smp_call_function, sc, true);
 }
 
+static int tdx_module_init_cpus(void)
+{
+	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
+
+	seamcall_on_each_cpu(&sc);
+
+	return atomic_read(&sc.err);
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/* Logical-cpu scope initialization */
+	ret = tdx_module_init_cpus();
+	if (ret)
+		goto out;
+
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
  * TDX module SEAMCALL leaf functions
  */
 #define TDH_SYS_INIT		33
+#define TDH_SYS_LP_INIT		35
 #define TDH_SYS_LP_SHUTDOWN	44
 
 /*
-- 
2.38.1

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.  The list of these
ranges, along with TDX module information, is available to the kernel by
querying the TDX module via TDH.SYS.INFO SEAMCALL.

The host kernel can choose whether or not to use all convertible memory
regions as TDX-usable memory.  Before the TDX module is ready to create
any TDX guests, the kernel needs to configure the TDX-usable memory
regions by passing an array of "TD Memory Regions" (TDMRs) to the TDX
module.  Constructing the TDMR array requires information of both the
TDX module (TDSYSINFO_STRUCT) and the Convertible Memory Regions.  Call
TDH.SYS.INFO to get this information as a preparation.

Use static variables for both TDSYSINFO_STRUCT and CMR array to avoid
having to pass them as function arguments when constructing the TDMR
array.  And they are too big to be put to the stack anyway.  Also, KVM
needs to use the TDSYSINFO_STRUCT to create TDX guests.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - Simplified the check of CMRs due to the fact that TDX actually
   verifies CMRs (that are passed by the BIOS) before enabling TDX.
 - Changed the function name from check_cmrs() -> trim_empty_cmrs().
 - Added CMR page aligned check so that later patch can just get the PFN
   using ">> PAGE_SHIFT".

v5 -> v6:
 - Added to also print TDX module's attribute (Isaku).
 - Removed all arguments in tdx_gete_sysinfo() to use static variables
   of 'tdx_sysinfo' and 'tdx_cmr_array' directly as they are all used
   directly in other functions in later patches.
 - Added Isaku's Reviewed-by.

- v3 -> v5 (no feedback on v4):
 - Renamed sanitize_cmrs() to check_cmrs().
 - Removed unnecessary sanity check against tdx_sysinfo and tdx_cmr_array
   actual size returned by TDH.SYS.INFO.
 - Changed -EFAULT to -EINVAL in couple places.
 - Added comments around tdx_sysinfo and tdx_cmr_array saying they are
   used by TDH.SYS.INFO ABI.
 - Changed to pass 'tdx_sysinfo' and 'tdx_cmr_array' as function
   arguments in tdx_get_sysinfo().
 - Changed to only print BIOS-CMR when check_cmrs() fails.

---
 arch/x86/virt/vmx/tdx/tdx.c | 125 ++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  61 ++++++++++++++++++
 2 files changed, 186 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
 #include <linux/cpumask.h>
 #include <linux/smp.h>
 #include <linux/atomic.h>
+#include <linux/align.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/apic.h>
@@ -XXX,XX +XXX,XX @@ static enum tdx_module_status_t tdx_module_status;
 /* Prevent concurrent attempts on TDX detection and initialization */
 static DEFINE_MUTEX(tdx_module_lock);
 
+/* Below two are used in TDH.SYS.INFO SEAMCALL ABI */
+static struct tdsysinfo_struct tdx_sysinfo;
+static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
+static int tdx_cmr_num;
+
 /*
  * Detect TDX private KeyIDs to see whether TDX has been enabled by the
  * BIOS.  Both initializing the TDX module and running TDX guest require
@@ -XXX,XX +XXX,XX @@ static int tdx_module_init_cpus(void)
 	return atomic_read(&sc.err);
 }
 
+static inline bool is_cmr_empty(struct cmr_info *cmr)
+{
+	return !cmr->size;
+}
+
+static inline bool is_cmr_ok(struct cmr_info *cmr)
+{
+	/* CMR must be page aligned */
+	return IS_ALIGNED(cmr->base, PAGE_SIZE) &&
+		IS_ALIGNED(cmr->size, PAGE_SIZE);
+}
+
+static void print_cmrs(struct cmr_info *cmr_array, int cmr_num,
+		       const char *name)
+{
+	int i;
+
+	for (i = 0; i < cmr_num; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+
+		pr_info("%s : [0x%llx, 0x%llx)\n", name,
+				cmr->base, cmr->base + cmr->size);
+	}
+}
+
+/* Check CMRs reported by TDH.SYS.INFO, and trim tail empty CMRs. */
+static int trim_empty_cmrs(struct cmr_info *cmr_array, int *actual_cmr_num)
+{
+	struct cmr_info *cmr;
+	int i, cmr_num;
+
+	/*
+	 * Intel TDX module spec, 20.7.3 CMR_INFO:
+	 *
+	 *   TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
+	 *   array of CMR_INFO entries. The CMRs are sorted from the
+	 *   lowest base address to the highest base address, and they
+	 *   are non-overlapping.
+	 *
+	 * This implies that BIOS may generate invalid empty entries
+	 * if total CMRs are less than 32.  Need to skip them manually.
+	 *
+	 * CMR also must be 4K aligned.  TDX doesn't trust BIOS.  TDX
+	 * actually verifies CMRs before it gets enabled, so anything
+	 * doesn't meet above means kernel bug (or TDX is broken).
+	 */
+	cmr = &cmr_array[0];
+	/* There must be at least one valid CMR */
+	if (WARN_ON_ONCE(is_cmr_empty(cmr) || !is_cmr_ok(cmr)))
+		goto err;
+
+	cmr_num = *actual_cmr_num;
+	for (i = 1; i < cmr_num; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+		struct cmr_info *prev_cmr = NULL;
+
+		/* Skip further empty CMRs */
+		if (is_cmr_empty(cmr))
+			break;
+
+		/*
+		 * Do sanity check anyway to make sure CMRs:
+		 *  - are 4K aligned
+		 *  - don't overlap
+		 *  - are in address ascending order.
+		 */
+		if (WARN_ON_ONCE(!is_cmr_ok(cmr)))
+			goto err;
+
+		prev_cmr = &cmr_array[i - 1];
+		if (WARN_ON_ONCE((prev_cmr->base + prev_cmr->size) >
+					cmr->base))
+			goto err;
+	}
+
+	/* Update the actual number of CMRs */
+	*actual_cmr_num = i;
+
+	/* Print kernel checked CMRs */
+	print_cmrs(cmr_array, *actual_cmr_num, "Kernel-checked-CMR");
+
+	return 0;
+err:
+	pr_info("[TDX broken ?]: Invalid CMRs detected\n");
+	print_cmrs(cmr_array, cmr_num, "BIOS-CMR");
+	return -EINVAL;
+}
+
+static int tdx_get_sysinfo(void)
+{
+	struct tdx_module_output out;
+	int ret;
+
+	BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE);
+
+	ret = seamcall(TDH_SYS_INFO, __pa(&tdx_sysinfo), TDSYSINFO_STRUCT_SIZE,
+			__pa(tdx_cmr_array), MAX_CMRS, NULL, &out);
+	if (ret)
+		return ret;
+
+	/* R9 contains the actual entries written the CMR array. */
+	tdx_cmr_num = out.r9;
+
+	pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
+		tdx_sysinfo.attributes, tdx_sysinfo.vendor_id,
+		tdx_sysinfo.major_version, tdx_sysinfo.minor_version,
+		tdx_sysinfo.build_date, tdx_sysinfo.build_num);
+
+	/*
+	 * trim_empty_cmrs() updates the actual number of CMRs by
+	 * dropping all tail empty CMRs.
+	 */
+	return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	ret = tdx_get_sysinfo();
+	if (ret)
+		goto out;
+
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
 #define TDH_SYS_LP_SHUTDOWN	44
 
+struct cmr_info {
+	u64	base;
+	u64	size;
+} __packed;
+
+#define MAX_CMRS			32
+#define CMR_INFO_ARRAY_ALIGNMENT	512
+
+struct cpuid_config {
+	u32	leaf;
+	u32	sub_leaf;
+	u32	eax;
+	u32	ebx;
+	u32	ecx;
+	u32	edx;
+} __packed;
+
+#define TDSYSINFO_STRUCT_SIZE		1024
+#define TDSYSINFO_STRUCT_ALIGNMENT	1024
+
+struct tdsysinfo_struct {
+	/* TDX-SEAM Module Info */
+	u32	attributes;
+	u32	vendor_id;
+	u32	build_date;
+	u16	build_num;
+	u16	minor_version;
+	u16	major_version;
+	u8	reserved0[14];
+	/* Memory Info */
+	u16	max_tdmrs;
+	u16	max_reserved_per_tdmr;
+	u16	pamt_entry_size;
+	u8	reserved1[10];
+	/* Control Struct Info */
+	u16	tdcs_base_size;
+	u8	reserved2[2];
+	u16	tdvps_base_size;
+	u8	tdvps_xfam_dependent_size;
+	u8	reserved3[9];
+	/* TD Capabilities */
+	u64	attributes_fixed0;
+	u64	attributes_fixed1;
+	u64	xfam_fixed0;
+	u64	xfam_fixed1;
+	u8	reserved4[32];
+	u32	num_cpuid_config;
+	/*
+	 * The actual number of CPUID_CONFIG depends on above
+	 * 'num_cpuid_config'.  The size of 'struct tdsysinfo_struct'
+	 * is 1024B defined by TDX architecture.  Use a union with
+	 * specific padding to make 'sizeof(struct tdsysinfo_struct)'
+	 * equal to 1024.
+	 */
+	union {
+		struct cpuid_config	cpuid_configs[0];
+		u8			reserved5[892];
+	};
+} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
-- 
2.38.1

TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
memory regions that can possibly be used by the TDX module, but they are
not automatically usable to the TDX module.  As a step of initializing
the TDX module, the kernel needs to choose a list of memory regions (out
from convertible memory regions) that the TDX module can use and pass
those regions to the TDX module.  Once this is done, those "TDX-usable"
memory regions are fixed during module's lifetime.  No more TDX-usable
memory can be added to the TDX module after that.

The initial support of TDX guests will only allocate TDX guest memory
from the global page allocator.  To keep things simple, this initial
implementation simply guarantees all pages in the page allocator are TDX
memory.  To achieve this, use all system memory in the core-mm at the
time of initializing the TDX module as TDX memory, and at the meantime,
refuse to add any non-TDX-memory in the memory hotplug.

Specifically, walk through all memory regions managed by memblock and
add them to a global list of "TDX-usable" memory regions, which is a
fixed list after the module initialization (or empty if initialization
fails).  To reject non-TDX-memory in memory hotplug, add an additional
check in arch_add_memory() to check whether the new region is covered by
any region in the "TDX-usable" memory region list.

Note this requires all memory regions in memblock are TDX convertible
memory when initializing the TDX module.  This is true in practice if no
new memory has been hot-added before initializing the TDX module, since
in practice all boot-time present DIMM is TDX convertible memory.  If
any new memory has been hot-added, then initializing the TDX module will
fail due to that memory region is not covered by CMR.

This can be enhanced in the future, i.e. by allowing adding non-TDX
memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
needs to guarantee memory pages for TDX guests are always allocated from
the "TDX-capable" nodes.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - Changed to use all system memory in memblock at the time of
   initializing the TDX module as TDX memory
 - Added memory hotplug support

---
 arch/x86/Kconfig            |   1 +
 arch/x86/include/asm/tdx.h  |   3 +
 arch/x86/mm/init_64.c       |  10 ++
 arch/x86/virt/vmx/tdx/tdx.c | 183 ++++++++++++++++++++++++++++++++++++
 4 files changed, 197 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -XXX,XX +XXX,XX @@ config INTEL_TDX_HOST
 	depends on X86_64
 	depends on KVM_INTEL
 	depends on X86_X2APIC
+	select ARCH_KEEP_MEMBLOCK
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 #ifdef CONFIG_INTEL_TDX_HOST
 bool platform_tdx_enabled(void);
 int tdx_enable(void);
+bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
 static inline int tdx_enable(void)  { return -ENODEV; }
+static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
+		unsigned long end_pfn) { return true; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -XXX,XX +XXX,XX @@
 #include <asm/uv/uv.h>
 #include <asm/setup.h>
 #include <asm/ftrace.h>
+#include <asm/tdx.h>
 
 #include "mm_internal.h"
 
@@ -XXX,XX +XXX,XX @@ int arch_add_memory(int nid, u64 start, u64 size,
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
+	/*
+	 * For now if TDX is enabled, all pages in the page allocator
+	 * must be TDX memory, which is a fixed set of memory regions
+	 * that are passed to the TDX module.  Reject the new region
+	 * if it is not TDX memory to guarantee above is true.
+	 */
+	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
+		return -EINVAL;
+
 	init_memory_mapping(start, start + size, params->pgprot);
 
 	return add_pages(nid, start_pfn, nr_pages, params);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
 #include <linux/smp.h>
 #include <linux/atomic.h>
 #include <linux/align.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/memblock.h>
+#include <linux/minmax.h>
+#include <linux/sizes.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/apic.h>
@@ -XXX,XX +XXX,XX @@ enum tdx_module_status_t {
 	TDX_MODULE_SHUTDOWN,
 };
 
+struct tdx_memblock {
+	struct list_head list;
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+	int nid;
+};
+
 static u32 tdx_keyid_start __ro_after_init;
 static u32 tdx_keyid_num __ro_after_init;
 
@@ -XXX,XX +XXX,XX @@ static struct tdsysinfo_struct tdx_sysinfo;
 static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
 static int tdx_cmr_num;
 
+/* All TDX-usable memory regions */
+static LIST_HEAD(tdx_memlist);
+
 /*
  * Detect TDX private KeyIDs to see whether TDX has been enabled by the
  * BIOS.  Both initializing the TDX module and running TDX guest require
@@ -XXX,XX +XXX,XX @@ static int tdx_get_sysinfo(void)
 	return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
 }
 
+/* Check whether the given pfn range is covered by any CMR or not. */
+static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
+				     unsigned long end_pfn)
+{
+	int i;
+
+	for (i = 0; i < tdx_cmr_num; i++) {
+		struct cmr_info *cmr = &tdx_cmr_array[i];
+		unsigned long cmr_start_pfn;
+		unsigned long cmr_end_pfn;
+
+		cmr_start_pfn = cmr->base >> PAGE_SHIFT;
+		cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
+
+		if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add a memory region on a given node as a TDX memory block.  The caller
+ * to make sure all memory regions are added in address ascending order
+ * and don't overlap.
+ */
+static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
+			    int nid)
+{
+	struct tdx_memblock *tmb;
+
+	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
+	if (!tmb)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&tmb->list);
+	tmb->start_pfn = start_pfn;
+	tmb->end_pfn = end_pfn;
+	tmb->nid = nid;
+
+	list_add_tail(&tmb->list, &tdx_memlist);
+	return 0;
+}
+
+static void free_tdx_memory(void)
+{
+	while (!list_empty(&tdx_memlist)) {
+		struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
+				struct tdx_memblock, list);
+
+		list_del(&tmb->list);
+		kfree(tmb);
+	}
+}
+
+/*
+ * Add all memblock memory regions to the @tdx_memlist as TDX memory.
+ * Must be called when get_online_mems() is called by the caller.
+ */
+static int build_tdx_memory(void)
+{
+	unsigned long start_pfn, end_pfn;
+	int i, nid, ret;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+		/*
+		 * The first 1MB may not be reported as TDX convertible
+		 * memory.  Manually exclude them as TDX memory.
+		 *
+		 * This is fine as the first 1MB is already reserved in
+		 * reserve_real_mode() and won't end up to ZONE_DMA as
+		 * free page anyway.
+		 */
+		start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
+		if (start_pfn >= end_pfn)
+			continue;
+
+		/* Verify memory is truly TDX convertible memory */
+		if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
+			pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
+					start_pfn << PAGE_SHIFT,
+					end_pfn << PAGE_SHIFT);
+			return -EINVAL;
+		}
+
+		/*
+		 * Add the memory regions as TDX memory.  The regions in
+		 * memblock has already guaranteed they are in address
+		 * ascending order and don't overlap.
+		 */
+		ret = add_tdx_memblock(start_pfn, end_pfn, nid);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	free_tdx_memory();
+	return ret;
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/*
+	 * All memory regions that can be used by the TDX module must be
+	 * passed to the TDX module during the module initialization.
+	 * Once this is done, all "TDX-usable" memory regions are fixed
+	 * during module's runtime.
+	 *
+	 * The initial support of TDX guests only allocates memory from
+	 * the global page allocator.  To keep things simple, for now
+	 * just make sure all pages in the page allocator are TDX memory.
+	 *
+	 * To achieve this, use all system memory in the core-mm at the
+	 * time of initializing the TDX module as TDX memory, and at the
+	 * meantime, reject any new memory in memory hot-add.
+	 *
+	 * This works as in practice, all boot-time present DIMM is TDX
+	 * convertible memory.  However if any new memory is hot-added
+	 * before initializing the TDX module, the initialization will
+	 * fail due to that memory is not covered by CMR.
+	 *
+	 * This can be enhanced in the future, i.e. by allowing adding or
+	 * onlining non-TDX memory to a separate node, in which case the
+	 * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
+	 * together -- the userspace/kernel just needs to make sure pages
+	 * for TDX guests must come from those "TDX-capable" nodes.
+	 *
+	 * Build the list of TDX memory regions as mentioned above so
+	 * they can be passed to the TDX module later.
+	 */
+	get_online_mems();
+
+	ret = build_tdx_memory();
+	if (ret)
+		goto out;
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
 	 */
 	ret = -EINVAL;
 out:
+	/*
+	 * Memory hotplug checks the hot-added memory region against the
+	 * @tdx_memlist to see if the region is TDX memory.
+	 *
+	 * Do put_online_mems() here to make sure any modification to
+	 * @tdx_memlist is done while holding the memory hotplug read
+	 * lock, so that the memory hotplug path can just check the
+	 * @tdx_memlist w/o holding the @tdx_module_lock which may cause
+	 * deadlock.
+	 */
+	put_online_mems();
 	return ret;
 }
 
@@ -XXX,XX +XXX,XX @@ int tdx_enable(void)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(tdx_enable);
+
+/*
+ * Check whether the given range is TDX memory.  Must be called between
+ * mem_hotplug_begin()/mem_hotplug_done().
+ */
+bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	/* Empty list means TDX isn't enabled successfully */
+	if (list_empty(&tdx_memlist))
+		return true;
+
+	list_for_each_entry(tmb, &tdx_memlist, list) {
+		/*
+		 * The new range is TDX memory if it is fully covered
+		 * by any TDX memory block.
+		 */
+		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
+			return true;
+	}
+	return false;
+}
-- 
2.38.1

As a result, the TDX introduced the concept of a "Convertible Memory
Region" (CMR).  During boot, the firmware builds a list of all of the
memory ranges which can provide the TDX security guarantees.  The list
of these ranges is available to the kernel by querying the TDX module.

The TDX architecture needs additional metadata to record things like
which TD guest "owns" a given page of memory.  This metadata essentially
serves as the 'struct page' for the TDX module.  The space for this
metadata is not reserved by the hardware up front and must be allocated
by the kernel and given to the TDX module.

Since this metadata consumes space, the VMM can choose whether or not to
allocate it for a given area of convertible memory.  If it chooses not
to, the memory cannot receive TDX protections and can not be used by TDX
guests as private memory.

For every memory region that the VMM wants to use as TDX memory, it sets
up a "TD Memory Region" (TDMR).  Each TDMR represents a physically
contiguous convertible range and must also have its own physically
contiguous metadata table, referred to as a Physical Address Metadata
Table (PAMT), to track status for each page in the TDMR range.

Unlike a CMR, each TDMR requires 1G granularity and alignment.  To
support physical RAM areas that don't meet those strict requirements,
each TDMR permits a number of internal "reserved areas" which can be
placed over memory holes.  If PAMT metadata is placed within a TDMR it
must be covered by one of these reserved areas.

Let's summarize the concepts:

CMR - Firmware-enumerated physical ranges that support TDX.  CMRs are
       4K aligned.
TDMR - Physical address range which is chosen by the kernel to support
       TDX.  1G granularity and alignment required.  Each TDMR has
       reserved areas where TDX memory holes and overlapping PAMTs can
       be put into.
PAMT - Physically contiguous TDX metadata.  One table for each page size
       per TDMR.  Roughly 1/256th of TDMR in size.  256G TDMR = ~1G
       PAMT.

As one step of initializing the TDX module, the kernel configures
TDX-usable memory regions by passing an array of TDMRs to the TDX module.

Constructing the array of TDMRs consists below steps:

1) Create TDMRs to cover all memory regions that the TDX module can use;
2) Allocate and set up PAMT for each TDMR;
3) Set up reserved areas for each TDMR.

Add a placeholder to construct TDMRs to do the above steps after all
TDX memory regions are verified to be truly convertible.  Always free
TDMRs at the end of the initialization (no matter successful or not)
as TDMRs are only used during the initialization.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - Improved commit message to explain 'int' overflow cannot happen
   in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave.

v5 -> v6:
 - construct_tdmrs_memblock() -> construct_tdmrs() as 'tdx_memblock' is
   used instead of memblock.
 - Added Isaku's Reviewed-by.

- v3 -> v5 (no feedback on v4):
 - Moved calculating TDMR size to this patch.
 - Changed to use alloc_pages_exact() to allocate buffer for all TDMRs
   once, instead of allocating each TDMR individually.
 - Removed "crypto protection" in the changelog.
 - -EFAULT -> -EINVAL in couple of places.

---
 arch/x86/virt/vmx/tdx/tdx.c | 83 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h | 23 ++++++++++
 2 files changed, 106 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static int build_tdx_memory(void)
 	return ret;
 }
 
+/* Calculate the actual TDMR_INFO size */
+static inline int cal_tdmr_size(void)
+{
+	int tdmr_sz;
+
+	/*
+	 * The actual size of TDMR_INFO depends on the maximum number
+	 * of reserved areas.
+	 *
+	 * Note: for TDX1.0 the max_reserved_per_tdmr is 16, and
+	 * TDMR_INFO size is aligned up to 512-byte.  Even it is
+	 * extended in the future, it would be insane if TDMR_INFO
+	 * becomes larger than 4K.  The tdmr_sz here should never
+	 * overflow.
+	 */
+	tdmr_sz = sizeof(struct tdmr_info);
+	tdmr_sz += sizeof(struct tdmr_reserved_area) *
+		   tdx_sysinfo.max_reserved_per_tdmr;
+
+	/*
+	 * TDX requires each TDMR_INFO to be 512-byte aligned.  Always
+	 * round up TDMR_INFO size to the 512-byte boundary.
+	 */
+	return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
+}
+
+static struct tdmr_info *alloc_tdmr_array(int *array_sz)
+{
+	/*
+	 * TDX requires each TDMR_INFO to be 512-byte aligned.
+	 * Use alloc_pages_exact() to allocate all TDMRs at once.
+	 * Each TDMR_INFO will still be 512-byte aligned since
+	 * cal_tdmr_size() always returns 512-byte aligned size.
+	 */
+	*array_sz = cal_tdmr_size() * tdx_sysinfo.max_tdmrs;
+
+	/*
+	 * Zero the buffer so 'struct tdmr_info::size' can be
+	 * used to determine whether a TDMR is valid.
+	 *
+	 * Note: for TDX1.0 the max_tdmrs is 64 and TDMR_INFO size
+	 * is 512-byte.  Even they are extended in the future, it
+	 * would be insane if the total size exceeds 4MB.
+	 */
+	return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
+}
+
+/*
+ * Construct an array of TDMRs to cover all TDX memory ranges.
+ * The actual number of TDMRs is kept to @tdmr_num.
+ */
+static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
+{
+	/* Return -EINVAL until constructing TDMRs is done */
+	return -EINVAL;
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -XXX,XX +XXX,XX @@ static int build_tdx_memory(void)
  */
 static int init_tdx_module(void)
 {
+	struct tdmr_info *tdmr_array;
+	int tdmr_array_sz;
+	int tdmr_num;
 	int ret;
 
 	/*
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	ret = build_tdx_memory();
 	if (ret)
 		goto out;
+
+	/* Prepare enough space to construct TDMRs */
+	tdmr_array = alloc_tdmr_array(&tdmr_array_sz);
+	if (!tdmr_array) {
+		ret = -ENOMEM;
+		goto out_free_tdx_mem;
+	}
+
+	/* Construct TDMRs to cover all TDX memory ranges */
+	ret = construct_tdmrs(tdmr_array, &tdmr_num);
+	if (ret)
+		goto out_free_tdmrs;
+
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
 	 */
 	ret = -EINVAL;
+out_free_tdmrs:
+	/*
+	 * The array of TDMRs is freed no matter the initialization is
+	 * successful or not.  They are not needed anymore after the
+	 * module initialization.
+	 */
+	free_pages_exact(tdmr_array, tdmr_array_sz);
+out_free_tdx_mem:
+	if (ret)
+		free_tdx_memory();
 out:
 	/*
 	 * Memory hotplug checks the hot-added memory region against the
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@ struct tdsysinfo_struct {
 	};
 } __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
 
+struct tdmr_reserved_area {
+	u64 offset;
+	u64 size;
+} __packed;
+
+#define TDMR_INFO_ALIGNMENT	512
+
+struct tdmr_info {
+	u64 base;
+	u64 size;
+	u64 pamt_1g_base;
+	u64 pamt_1g_size;
+	u64 pamt_2m_base;
+	u64 pamt_2m_size;
+	u64 pamt_4k_base;
+	u64 pamt_4k_size;
+	/*
+	 * Actual number of reserved areas depends on
+	 * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
+	 */
+	struct tdmr_reserved_area reserved_areas[0];
+} __packed __aligned(TDMR_INFO_ALIGNMENT);
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
-- 
2.38.1

The kernel configures TDX-usable memory regions by passing an array of
"TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains the
information of the base/size of a memory region, the base/size of the
associated Physical Address Metadata Table (PAMT) and a list of reserved
areas in the region.

Create a number of TDMRs to cover all TDX memory regions.  To keep it
simple, always try to create one TDMR for each memory region.  As the
first step only set up the base/size for each TDMR.

Each TDMR must be 1G aligned and the size must be in 1G granularity.
This implies that one TDMR could cover multiple memory regions.  If a
memory region spans the 1GB boundary and the former part is already
covered by the previous TDMR, just create a new TDMR for the remaining
part.

TDX only supports a limited number of TDMRs.  Disable TDX if all TDMRs
are consumed but there is more memory region to cover.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - No change.

v5 -> v6:
 - Rebase due to using 'tdx_memblock' instead of memblock.

- v3 -> v5 (no feedback on v4):
 - Removed allocating TDMR individually.
 - Improved changelog by using Dave's words.
 - Made TDMR_START() and TDMR_END() as static inline function.

---
 arch/x86/virt/vmx/tdx/tdx.c | 104 +++++++++++++++++++++++++++++++++++-
 1 file changed, 103 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static int build_tdx_memory(void)
 	return ret;
 }
 
+/* TDMR must be 1gb aligned */
+#define TDMR_ALIGNMENT		BIT_ULL(30)
+#define TDMR_PFN_ALIGNMENT	(TDMR_ALIGNMENT >> PAGE_SHIFT)
+
+/* Align up and down the address to TDMR boundary */
+#define TDMR_ALIGN_DOWN(_addr)	ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
+#define TDMR_ALIGN_UP(_addr)	ALIGN((_addr), TDMR_ALIGNMENT)
+
+static inline u64 tdmr_start(struct tdmr_info *tdmr)
+{
+	return tdmr->base;
+}
+
+static inline u64 tdmr_end(struct tdmr_info *tdmr)
+{
+	return tdmr->base + tdmr->size;
+}
+
 /* Calculate the actual TDMR_INFO size */
 static inline int cal_tdmr_size(void)
 {
@@ -XXX,XX +XXX,XX @@ static struct tdmr_info *alloc_tdmr_array(int *array_sz)
 	return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
 }
 
+static struct tdmr_info *tdmr_array_entry(struct tdmr_info *tdmr_array,
+					  int idx)
+{
+	return (struct tdmr_info *)((unsigned long)tdmr_array +
+			cal_tdmr_size() * idx);
+}
+
+/*
+ * Create TDMRs to cover all TDX memory regions.  The actual number
+ * of TDMRs is set to @tdmr_num.
+ */
+static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
+{
+	struct tdx_memblock *tmb;
+	int tdmr_idx = 0;
+
+	/*
+	 * Loop over TDX memory regions and create TDMRs to cover them.
+	 * To keep it simple, always try to use one TDMR to cover
+	 * one memory region.
+	 */
+	list_for_each_entry(tmb, &tdx_memlist, list) {
+		struct tdmr_info *tdmr;
+		u64 start, end;
+
+		tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
+		start = TDMR_ALIGN_DOWN(tmb->start_pfn << PAGE_SHIFT);
+		end = TDMR_ALIGN_UP(tmb->end_pfn << PAGE_SHIFT);
+
+		/*
+		 * If the current TDMR's size hasn't been initialized,
+		 * it is a new TDMR to cover the new memory region.
+		 * Otherwise, the current TDMR has already covered the
+		 * previous memory region.  In the latter case, check
+		 * whether the current memory region has been fully or
+		 * partially covered by the current TDMR, since TDMR is
+		 * 1G aligned.
+		 */
+		if (tdmr->size) {
+			/*
+			 * Loop to the next memory region if the current
+			 * block has already been fully covered by the
+			 * current TDMR.
+			 */
+			if (end <= tdmr_end(tdmr))
+				continue;
+
+			/*
+			 * If part of the current memory region has
+			 * already been covered by the current TDMR,
+			 * skip the already covered part.
+			 */
+			if (start < tdmr_end(tdmr))
+				start = tdmr_end(tdmr);
+
+			/*
+			 * Create a new TDMR to cover the current memory
+			 * region, or the remaining part of it.
+			 */
+			tdmr_idx++;
+			if (tdmr_idx >= tdx_sysinfo.max_tdmrs)
+				return -E2BIG;
+
+			tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
+		}
+
+		tdmr->base = start;
+		tdmr->size = end - start;
+	}
+
+	/* @tdmr_idx is always the index of last valid TDMR. */
+	*tdmr_num = tdmr_idx + 1;
+
+	return 0;
+}
+
 /*
  * Construct an array of TDMRs to cover all TDX memory ranges.
  * The actual number of TDMRs is kept to @tdmr_num.
  */
 static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
 {
+	int ret;
+
+	ret = create_tdmrs(tdmr_array, tdmr_num);
+	if (ret)
+		goto err;
+
 	/* Return -EINVAL until constructing TDMRs is done */
-	return -EINVAL;
+	ret = -EINVAL;
+err:
+	return ret;
 }
 
 /*
-- 
2.38.1

The TDX module uses additional metadata to record things like which
guest "owns" a given page of memory.  This metadata, referred as
Physical Address Metadata Table (PAMT), essentially serves as the
'struct page' for the TDX module.  PAMTs are not reserved by hardware
up front.  They must be allocated by the kernel and then given to the
TDX module.

TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
be a physically contiguous area from a Convertible Memory Region (CMR).
However, the PAMTs which track pages in one TDMR do not need to reside
within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
any TDMR, the overlapping part must be reported as a reserved area in
that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).
The downside is alloc_contig_pages() may fail at runtime.  One (bad)
mitigation is to launch a TD guest early during system boot to get those
PAMTs allocated at early time, but the only way to fix is to add a boot
option to allocate or reserve PAMTs during kernel boot.

TDX only supports a limited number of reserved areas per TDMR to cover
both PAMTs and memory holes within the given TDMR.  If many PAMTs are
allocated within a single TDMR, the reserved areas may not be sufficient
to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

- Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
    the total number of reserved areas consumed for PAMTs.
  - Try to first allocate PAMT from the local node of the TDMR for better
    NUMA locality.

Also dump out how many pages are allocated for PAMTs when the TDX module
is initialized successfully.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - Changes due to using macros instead of 'enum' for TDX supported page
   sizes.

v5 -> v6:
 - Rebase due to using 'tdx_memblock' instead of memblock.
 - 'int pamt_entry_nr' -> 'unsigned long nr_pamt_entries' (Dave/Sagis).
 - Improved comment around tdmr_get_nid() (Dave).
 - Improved comment in tdmr_set_up_pamt() around breaking the PAMT
   into PAMTs for 4K/2M/1G (Dave).
 - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).

- v3 -> v5 (no feedback on v4):
 - Used memblock to get the NUMA node for given TDMR.
 - Removed tdmr_get_pamt_sz() helper but use open-code instead.
 - Changed to use 'switch .. case..' for each TDX supported page size in
   tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
 - Added printing out memory used for PAMT allocation when TDX module is
   initialized successfully.
 - Explained downside of alloc_contig_pages() in changelog.
 - Addressed other minor comments.

---
 arch/x86/Kconfig            |   1 +
 arch/x86/virt/vmx/tdx/tdx.c | 191 ++++++++++++++++++++++++++++++++++++
 2 files changed, 192 insertions(+)

As the last step of constructing TDMRs, set up reserved areas for all
TDMRs.  For each TDMR, put all memory holes within this TDMR to the
reserved areas.  And for all PAMTs which overlap with this TDMR, put
all the overlapping parts to reserved areas too.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - No change.

v5 -> v6:
 - Rebase due to using 'tdx_memblock' instead of memblock.
 - Split tdmr_set_up_rsvd_areas() into two functions to handle memory
   hole and PAMT respectively.
 - Added Isaku's Reviewed-by.

---
 arch/x86/virt/vmx/tdx/tdx.c | 190 +++++++++++++++++++++++++++++++++++-
 1 file changed, 188 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
 #include <linux/memblock.h>
 #include <linux/minmax.h>
 #include <linux/sizes.h>
+#include <linux/sort.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/apic.h>
@@ -XXX,XX +XXX,XX @@ static unsigned long tdmrs_count_pamt_pages(struct tdmr_info *tdmr_array,
 	return pamt_npages;
 }
 
+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx,
+			      u64 addr, u64 size)
+{
+	struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
+	int idx = *p_idx;
+
+	/* Reserved area must be 4K aligned in offset and size */
+	if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
+		return -EINVAL;
+
+	/* Cannot exceed maximum reserved areas supported by TDX */
+	if (idx >= tdx_sysinfo.max_reserved_per_tdmr)
+		return -E2BIG;
+
+	rsvd_areas[idx].offset = addr - tdmr->base;
+	rsvd_areas[idx].size = size;
+
+	*p_idx = idx + 1;
+
+	return 0;
+}
+
+static int tdmr_set_up_memory_hole_rsvd_areas(struct tdmr_info *tdmr,
+					      int *rsvd_idx)
+{
+	struct tdx_memblock *tmb;
+	u64 prev_end;
+	int ret;
+
+	/* Mark holes between memory regions as reserved */
+	prev_end = tdmr_start(tdmr);
+	list_for_each_entry(tmb, &tdx_memlist, list) {
+		u64 start, end;
+
+		start = tmb->start_pfn << PAGE_SHIFT;
+		end = tmb->end_pfn << PAGE_SHIFT;
+
+		/* Break if this region is after the TDMR */
+		if (start >= tdmr_end(tdmr))
+			break;
+
+		/* Exclude regions before this TDMR */
+		if (end < tdmr_start(tdmr))
+			continue;
+
+		/*
+		 * Skip if no hole exists before this region. "<=" is
+		 * used because one memory region might span two TDMRs
+		 * (when the previous TDMR covers part of this region).
+		 * In this case the start address of this region is
+		 * smaller than the start address of the second TDMR.
+		 *
+		 * Update the prev_end to the end of this region where
+		 * the possible memory hole starts.
+		 */
+		if (start <= prev_end) {
+			prev_end = end;
+			continue;
+		}
+
+		/* Add the hole before this region */
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
+				start - prev_end);
+		if (ret)
+			return ret;
+
+		prev_end = end;
+	}
+
+	/* Add the hole after the last region if it exists. */
+	if (prev_end < tdmr_end(tdmr)) {
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
+				tdmr_end(tdmr) - prev_end);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int tdmr_set_up_pamt_rsvd_areas(struct tdmr_info *tdmr, int *rsvd_idx,
+				       struct tdmr_info *tdmr_array,
+				       int tdmr_num)
+{
+	int i, ret;
+
+	/*
+	 * If any PAMT overlaps with this TDMR, the overlapping part
+	 * must also be put to the reserved area too.  Walk over all
+	 * TDMRs to find out those overlapping PAMTs and put them to
+	 * reserved areas.
+	 */
+	for (i = 0; i < tdmr_num; i++) {
+		struct tdmr_info *tmp = tdmr_array_entry(tdmr_array, i);
+		unsigned long pamt_start_pfn, pamt_npages;
+		u64 pamt_start, pamt_end;
+
+		tdmr_get_pamt(tmp, &pamt_start_pfn, &pamt_npages);
+		/* Each TDMR must already have PAMT allocated */
+		WARN_ON_ONCE(!pamt_npages || !pamt_start_pfn);
+
+		pamt_start = pamt_start_pfn << PAGE_SHIFT;
+		pamt_end = pamt_start + (pamt_npages << PAGE_SHIFT);
+
+		/* Skip PAMTs outside of the given TDMR */
+		if ((pamt_end <= tdmr_start(tdmr)) ||
+				(pamt_start >= tdmr_end(tdmr)))
+			continue;
+
+		/* Only mark the part within the TDMR as reserved */
+		if (pamt_start < tdmr_start(tdmr))
+			pamt_start = tdmr_start(tdmr);
+		if (pamt_end > tdmr_end(tdmr))
+			pamt_end = tdmr_end(tdmr);
+
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_start,
+				pamt_end - pamt_start);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/* Compare function called by sort() for TDMR reserved areas */
+static int rsvd_area_cmp_func(const void *a, const void *b)
+{
+	struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
+	struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
+
+	if (r1->offset + r1->size <= r2->offset)
+		return -1;
+	if (r1->offset >= r2->offset + r2->size)
+		return 1;
+
+	/* Reserved areas cannot overlap.  The caller should guarantee. */
+	WARN_ON_ONCE(1);
+	return -1;
+}
+
+/* Set up reserved areas for a TDMR, including memory holes and PAMTs */
+static int tdmr_set_up_rsvd_areas(struct tdmr_info *tdmr,
+				  struct tdmr_info *tdmr_array,
+				  int tdmr_num)
+{
+	int ret, rsvd_idx = 0;
+
+	/* Put all memory holes within the TDMR into reserved areas */
+	ret = tdmr_set_up_memory_hole_rsvd_areas(tdmr, &rsvd_idx);
+	if (ret)
+		return ret;
+
+	/* Put all (overlapping) PAMTs within the TDMR into reserved areas */
+	ret = tdmr_set_up_pamt_rsvd_areas(tdmr, &rsvd_idx, tdmr_array, tdmr_num);
+	if (ret)
+		return ret;
+
+	/* TDX requires reserved areas listed in address ascending order */
+	sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
+			rsvd_area_cmp_func, NULL);
+
+	return 0;
+}
+
+static int tdmrs_set_up_rsvd_areas_all(struct tdmr_info *tdmr_array,
+				       int tdmr_num)
+{
+	int i;
+
+	for (i = 0; i < tdmr_num; i++) {
+		int ret;
+
+		ret = tdmr_set_up_rsvd_areas(tdmr_array_entry(tdmr_array, i),
+				tdmr_array, tdmr_num);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Construct an array of TDMRs to cover all TDX memory ranges.
  * The actual number of TDMRs is kept to @tdmr_num.
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
 	if (ret)
 		goto err;
 
-	/* Return -EINVAL until constructing TDMRs is done */
-	ret = -EINVAL;
+	ret = tdmrs_set_up_rsvd_areas_all(tdmr_array, *tdmr_num);
+	if (ret)
+		goto err_free_pamts;
+
+	return 0;
+err_free_pamts:
 	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
 err:
 	return ret;
-- 
2.38.1

TDX module initialization requires to use one TDX private KeyID as the
global KeyID to protect the TDX module metadata.  The global KeyID is
configured to the TDX module along with TDMRs.

Just reserve the first TDX private KeyID as the global KeyID.  Keep the
global KeyID as a static variable as KVM will need to use it too.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static int tdx_cmr_num;
 /* All TDX-usable memory regions */
 static LIST_HEAD(tdx_memlist);
 
+/* TDX module global KeyID.  Used in TDH.SYS.CONFIG ABI. */
+static u32 tdx_global_keyid;
+
 /*
  * Detect TDX private KeyIDs to see whether TDX has been enabled by the
  * BIOS.  Both initializing the TDX module and running TDX guest require
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_tdmrs;
 
+	/*
+	 * Reserve the first TDX KeyID as global KeyID to protect
+	 * TDX module metadata.
+	 */
+	tdx_global_keyid = tdx_keyid_start;
+
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
-- 
2.38.1

After the TDX-usable memory regions are constructed in an array of TDMRs
and the global KeyID is reserved, configure them to the TDX module using
TDH.SYS.CONFIG SEAMCALL.  TDH.SYS.CONFIG can only be called once and can
be done on any logical cpu.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 37 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  2 ++
 2 files changed, 39 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
 	return ret;
 }
 
+static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
+			     u64 global_keyid)
+{
+	u64 *tdmr_pa_array;
+	int i, array_sz;
+	u64 ret;
+
+	/*
+	 * TDMR_INFO entries are configured to the TDX module via an
+	 * array of the physical address of each TDMR_INFO.  TDX module
+	 * requires the array itself to be 512-byte aligned.  Round up
+	 * the array size to 512-byte aligned so the buffer allocated
+	 * by kzalloc() will meet the alignment requirement.
+	 */
+	array_sz = ALIGN(tdmr_num * sizeof(u64), TDMR_INFO_PA_ARRAY_ALIGNMENT);
+	tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
+	if (!tdmr_pa_array)
+		return -ENOMEM;
+
+	for (i = 0; i < tdmr_num; i++)
+		tdmr_pa_array[i] = __pa(tdmr_array_entry(tdmr_array, i));
+
+	ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), tdmr_num,
+				global_keyid, 0, NULL, NULL);
+
+	/* Free the array as it is not required anymore. */
+	kfree(tdmr_pa_array);
+
+	return ret;
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	 */
 	tdx_global_keyid = tdx_keyid_start;
 
+	/* Pass the TDMRs and the global KeyID to the TDX module */
+	ret = config_tdx_module(tdmr_array, tdmr_num, tdx_global_keyid);
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
 	 */
 	ret = -EINVAL;
+out_free_pamts:
 	if (ret)
 		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
 	else
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
 #define TDH_SYS_LP_SHUTDOWN	44
+#define TDH_SYS_CONFIG		45
 
 struct cmr_info {
 	u64	base;
@@ -XXX,XX +XXX,XX @@ struct tdmr_reserved_area {
 } __packed;
 
 #define TDMR_INFO_ALIGNMENT	512
+#define TDMR_INFO_PA_ARRAY_ALIGNMENT	512
 
 struct tdmr_info {
 	u64 base;
-- 
2.38.1

After the array of TDMRs and the global KeyID are configured to the TDX
module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID
on all packages.

TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package.  And
it cannot run concurrently on different CPUs.  Implement a helper to
run SEAMCALL on one cpu for each package one by one, and use it to
configure the global KeyID on all packages.

Intel hardware doesn't guarantee cache coherency across different
KeyIDs.  The kernel needs to flush PAMT's dirty cachelines (associated
with KeyID 0) before the TDX module uses the global KeyID to access the
PAMT.  Following the TDX module specification, flush cache before
configuring the global KeyID on all packages.

Given the PAMT size can be large (~1/256th of system RAM), just use
WBINVD on all CPUs to flush.

Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
used the global KeyID to write any PAMT.  Therefore, need to use WBINVD
to flush cache before freeing the PAMTs back to the kernel.  Note using
MOVDIR64B (which changes the page's associated KeyID from the old TDX
private KeyID back to KeyID 0, which is used by the kernel) to clear
PMATs isn't needed, as the KeyID 0 doesn't support integrity check.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - Improved changelong and comment to explain why MOVDIR64B isn't used
   when returning PAMTs back to the kernel.

---
 arch/x86/virt/vmx/tdx/tdx.c | 89 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
 	on_each_cpu(seamcall_smp_call_function, sc, true);
 }
 
+/*
+ * Call one SEAMCALL on one (any) cpu for each physical package in
+ * serialized way.  Return immediately in case of any error if
+ * SEAMCALL fails on any cpu.
+ *
+ * Note for serialized calls 'struct seamcall_ctx::err' doesn't have
+ * to be atomic, but for simplicity just reuse it instead of adding
+ * a new one.
+ */
+static int seamcall_on_each_package_serialized(struct seamcall_ctx *sc)
+{
+	cpumask_var_t packages;
+	int cpu, ret = 0;
+
+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
+		return -ENOMEM;
+
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
+					packages))
+			continue;
+
+		ret = smp_call_function_single(cpu, seamcall_smp_call_function,
+				sc, true);
+		if (ret)
+			break;
+
+		/*
+		 * Doesn't have to use atomic_read(), but it doesn't
+		 * hurt either.
+		 */
+		ret = atomic_read(&sc->err);
+		if (ret)
+			break;
+	}
+
+	free_cpumask_var(packages);
+	return ret;
+}
+
 static int tdx_module_init_cpus(void)
 {
 	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
@@ -XXX,XX +XXX,XX @@ static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
 	return ret;
 }
 
+static int config_global_keyid(void)
+{
+	struct seamcall_ctx sc = { .fn = TDH_SYS_KEY_CONFIG };
+
+	/*
+	 * Configure the key of the global KeyID on all packages by
+	 * calling TDH.SYS.KEY.CONFIG on all packages in a serialized
+	 * way as it cannot run concurrently on different CPUs.
+	 *
+	 * TDH.SYS.KEY.CONFIG may fail with entropy error (which is
+	 * a recoverable error).  Assume this is exceedingly rare and
+	 * just return error if encountered instead of retrying.
+	 */
+	return seamcall_on_each_package_serialized(&sc);
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
+	/*
+	 * Hardware doesn't guarantee cache coherency across different
+	 * KeyIDs.  The kernel needs to flush PAMT's dirty cachelines
+	 * (associated with KeyID 0) before the TDX module can use the
+	 * global KeyID to access the PAMT.  Given PAMTs are potentially
+	 * large (~1/256th of system RAM), just use WBINVD on all cpus
+	 * to flush the cache.
+	 *
+	 * Follow the TDX spec to flush cache before configuring the
+	 * global KeyID on all packages.
+	 */
+	wbinvd_on_all_cpus();
+
+	/* Config the key of global KeyID on all packages */
+	ret = config_global_keyid();
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
 	 */
 	ret = -EINVAL;
 out_free_pamts:
-	if (ret)
+	if (ret) {
+		/*
+		 * Part of PAMT may already have been initialized by
+		 * TDX module.  Flush cache before returning PAMT back
+		 * to the kernel.
+		 *
+		 * Note there's no need to do MOVDIR64B (which changes
+		 * the page's associated KeyID from the old TDX private
+		 * KeyID back to KeyID 0, which is used by the kernel),
+		 * as KeyID 0 doesn't support integrity check.
+		 */
+		wbinvd_on_all_cpus();
 		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
-	else
+	} else
 		pr_info("%lu pages allocated for PAMT.\n",
 				tdmrs_count_pamt_pages(tdmr_array, tdmr_num));
 out_free_tdmrs:
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_KEY_CONFIG	31
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
-- 
2.38.1

Initialize TDMRs via TDH.SYS.TDMR.INIT as the last step to complete the
TDX initialization.

All TDMRs need to be initialized using TDH.SYS.TDMR.INIT SEAMCALL before
the memory pages can be used by the TDX module.  The time to initialize
TDMR is proportional to the size of the TDMR because TDH.SYS.TDMR.INIT
internally initializes the PAMT entries using the global KeyID.

To avoid long latency caused in one SEAMCALL, TDH.SYS.TDMR.INIT only
initializes an (implementation-specific) subset of PAMT entries of one
TDMR in one invocation.  The caller needs to call TDH.SYS.TDMR.INIT
iteratively until all PAMT entries of the given TDMR are initialized.

TDH.SYS.TDMR.INITs can run concurrently on multiple CPUs as long as they
are initializing different TDMRs.  To keep it simple, just initialize
all TDMRs one by one.  On a 2-socket machine with 2.2G CPUs and 64GB
memory, each TDH.SYS.TDMR.INIT roughly takes couple of microseconds on
average, and it takes roughly dozens of milliseconds to complete the
initialization of all TDMRs while system is idle.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - Removed need_resched() check. -- Andi.

---
 arch/x86/virt/vmx/tdx/tdx.c | 69 ++++++++++++++++++++++++++++++++++---
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static int config_global_keyid(void)
 	return seamcall_on_each_package_serialized(&sc);
 }
 
+/* Initialize one TDMR */
+static int init_tdmr(struct tdmr_info *tdmr)
+{
+	u64 next;
+
+	/*
+	 * Initializing PAMT entries might be time-consuming (in
+	 * proportion to the size of the requested TDMR).  To avoid long
+	 * latency in one SEAMCALL, TDH.SYS.TDMR.INIT only initializes
+	 * an (implementation-defined) subset of PAMT entries in one
+	 * invocation.
+	 *
+	 * Call TDH.SYS.TDMR.INIT iteratively until all PAMT entries
+	 * of the requested TDMR are initialized (if next-to-initialize
+	 * address matches the end address of the TDMR).
+	 */
+	do {
+		struct tdx_module_output out;
+		int ret;
+
+		ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
+				&out);
+		if (ret)
+			return ret;
+		/*
+		 * RDX contains 'next-to-initialize' address if
+		 * TDH.SYS.TDMR.INT succeeded.
+		 */
+		next = out.rdx;
+		/* Allow scheduling when needed */
+		cond_resched();
+	} while (next < tdmr->base + tdmr->size);
+
+	return 0;
+}
+
+/* Initialize all TDMRs */
+static int init_tdmrs(struct tdmr_info *tdmr_array, int tdmr_num)
+{
+	int i;
+
+	/*
+	 * Initialize TDMRs one-by-one for simplicity, though the TDX
+	 * architecture does allow different TDMRs to be initialized in
+	 * parallel on multiple CPUs.  Parallel initialization could
+	 * be added later when the time spent in the serialized scheme
+	 * becomes a real concern.
+	 */
+	for (i = 0; i < tdmr_num; i++) {
+		int ret;
+
+		ret = init_tdmr(tdmr_array_entry(tdmr_array, i));
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
-	/*
-	 * Return -EINVAL until all steps of TDX module initialization
-	 * process are done.
-	 */
-	ret = -EINVAL;
+	/* Initialize TDMRs to complete the TDX module initialization */
+	ret = init_tdmrs(tdmr_array, tdmr_num);
+	if (ret)
+		goto out_free_pamts;
+
 out_free_pamts:
 	if (ret) {
 		/*
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
+#define TDH_SYS_TDMR_INIT	36
 #define TDH_SYS_LP_SHUTDOWN	44
 #define TDH_SYS_CONFIG		45
 
-- 
2.38.1

Because the hardware doesn't guarantee cache coherency among different
KeyIDs, the old kernel needs to flush cache (of those TDX private pages)
before booting to the new kernel.  Also, reading TDX private page using
any shared non-TDX KeyID with integrity-check enabled can trigger #MC.
Therefore ideally, the kernel should convert all TDX private pages back
to normal before booting to the new kernel.

However, this implementation doesn't convert TDX private pages back to
normal in kexec() because of below considerations:

Therefore, this implementation just flushes cache to make sure there are
no stale dirty cachelines associated with any TDX private KeyIDs before
booting to the new kernel, otherwise they may silently corrupt the new
kernel.

Following SME support, use wbinvd() to flush cache in stop_this_cpu().
Theoretically, cache flush is only needed when the TDX module has been
initialized.  However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status.  Just check
whether TDX is enabled by BIOS instead to flush cache.

Also, the current TDX module doesn't play nicely with kexec().  The TDX
module can only be initialized once during its lifetime, and there is no
ABI to reset the module to give a new clean slate to the new kernel.
Therefore ideally, if the TDX module is ever initialized, it's better
to shut it down.  The new kernel won't be able to use TDX anyway (as it
needs to go through the TDX module initialization process which will
fail immediately at the first step).

However, shutting down the TDX module requires all CPUs being in VMX
operation, but there's no such guarantee as kexec() can happen at any
time (i.e. when KVM is not even loaded).  So just do nothing but leave
leave the TDX module open.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - Improved changelog to explain why don't convert TDX private pages back
   to normal.

---
 arch/x86/kernel/process.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Add documentation for TDX host kernel support.  There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals.  Also reuse it for TDX host kernel support.

Introduce a new level menu "TDX Guest Support" and move existing
materials under it, and add a new menu for TDX host kernel support.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - Changed "TDX Memory Policy" and "Kexec()" sections.

---
 Documentation/x86/tdx.rst | 181 +++++++++++++++++++++++++++++++++++---
 1 file changed, 170 insertions(+), 11 deletions(-)

diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index XXXXXXX..XXXXXXX 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -XXX,XX +XXX,XX @@ encrypting the guest memory. In TDX, a special module running in a special
 mode sits between the host and the guest and manages the guest/host
 separation.
 
+TDX Host Kernel Support
+=======================
+
+TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
+a new isolated range pointed by the SEAM Ranger Register (SEAMRR).  A
+CPU-attested software module called 'the TDX module' runs inside the new
+isolated range to provide the functionalities to manage and run protected
+VMs.
+
+TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
+provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
+as TDX private KeyIDs, which are only accessible within the SEAM mode.
+BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
+
+Before the TDX module can be used to create and run protected VMs, it
+must be loaded into the isolated range and properly initialized.  The TDX
+architecture doesn't require the BIOS to load the TDX module, but the
+kernel assumes it is loaded by the BIOS.
+
+TDX boot-time detection
+-----------------------
+
+The kernel detects TDX by detecting TDX private KeyIDs during kernel
+boot.  Below dmesg shows when TDX is enabled by BIOS::
+
+  [..] tdx: TDX enabled by BIOS. TDX private KeyID range: [16, 64).
+
+TDX module detection and initialization
+---------------------------------------
+
+There is no CPUID or MSR to detect the TDX module.  The kernel detects it
+by initializing it.
+
+The kernel talks to the TDX module via the new SEAMCALL instruction.  The
+TDX module implements SEAMCALL leaf functions to allow the kernel to
+initialize it.
+
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
+use it as 'metadata' for the TDX memory.  It also takes additional CPU
+time to initialize those metadata along with the TDX module itself.  Both
+are not trivial.  The kernel initializes the TDX module at runtime on
+demand.  The caller to call tdx_enable() to initialize the TDX module::
+
+        ret = tdx_enable();
+        if (ret)
+                goto no_tdx;
+        // TDX is ready to use
+
+Initializing the TDX module requires all logical CPUs being online.
+tdx_enable() internally temporarily disables CPU hotplug to prevent any
+CPU from going offline, but the caller still needs to guarantee all
+present CPUs are online before calling tdx_enable().
+
+Also, tdx_enable() requires all CPUs are already in VMX operation
+(requirement of making SEAMCALL).  Currently, tdx_enable() doesn't handle
+VMXON internally, but depends on the caller to guarantee that.  So far
+KVM is the only user of TDX and KVM already handles VMXON.
+
+User can consult dmesg to see the presence of the TDX module, and whether
+it has been initialized.
+
+If the TDX module is not loaded, dmesg shows below::
+
+  [..] tdx: TDX module is not loaded.
+
+If the TDX module is initialized successfully, dmesg shows something
+like below::
+
+  [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+  [..] tdx: 65667 pages allocated for PAMT.
+  [..] tdx: TDX module initialized.
+
+If the TDX module failed to initialize, dmesg shows below::
+
+  [..] tdx: Failed to initialize TDX module. Shut it down.
+
+TDX Interaction to Other Kernel Components
+------------------------------------------
+
+TDX Memory Policy
+~~~~~~~~~~~~~~~~~
+
+TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
+memory regions that can possibly be used by the TDX module, but they are
+not automatically usable to the TDX module.  As a step of initializing
+the TDX module, the kernel needs to choose a list of memory regions (out
+from convertible memory regions) that the TDX module can use and pass
+those regions to the TDX module.  Once this is done, those "TDX-usable"
+memory regions are fixed during module's lifetime.  No more TDX-usable
+memory can be added to the TDX module after that.
+
+To keep things simple, currently the kernel simply guarantees all pages
+in the page allocator are TDX memory.  Specifically, the kernel uses all
+system memory in the core-mm at the time of initializing the TDX module
+as TDX memory, and at the meantime, refuses to add any non-TDX-memory in
+the memory hotplug.
+
+This can be enhanced in the future, i.e. by allowing adding non-TDX
+memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
+and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
+needs to guarantee memory pages for TDX guests are always allocated from
+the "TDX-capable" nodes.
+
+Note TDX assumes convertible memory is always physically present during
+machine's runtime.  A non-buggy BIOS should never support hot-removal of
+any convertible memory.  This implementation doesn't handle ACPI memory
+removal but depends on the BIOS to behave correctly.
+
+CPU Hotplug
+~~~~~~~~~~~
+
+TDX doesn't support physical (ACPI) CPU hotplug.  During machine boot,
+TDX verifies all boot-time present logical CPUs are TDX compatible before
+enabling TDX.  A non-buggy BIOS should never support hot-add/removal of
+physical CPU.  Currently the kernel doesn't handle physical CPU hotplug,
+but depends on the BIOS to behave correctly.
+
+Note TDX works with CPU logical online/offline, thus the kernel still
+allows to offline logical CPU and online it again.
+
+Kexec()
+~~~~~~~
+
+There are two problems in terms of using kexec() to boot to a new kernel
+when the old kernel has enabled TDX: 1) Part of the memory pages are
+still TDX private pages (i.e. metadata used by the TDX module, and any
+TDX guest memory if kexec() is executed when there's live TDX guests).
+2) There might be dirty cachelines associated with TDX private pages.
+
+Because the hardware doesn't guarantee cache coherency among different
+KeyIDs, the old kernel needs to flush cache (of TDX private pages)
+before booting to the new kernel.  Also, the kernel doesn't convert all
+TDX private pages back to normal because of below considerations:
+
+1) The kernel doesn't have existing infrastructure to track which pages
+   are TDX private page.
+2) The number of TDX private pages can be large, and converting all of
+   them (cache flush + using MOVDIR64B to clear the page) can be time
+   consuming.
+3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
+   0 doesn't support integrity-check, so it's OK.
+4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
+   kernel ever supports MKTME, it should do MOVDIR64B to clear the page
+   with the new MKTME KeyID (just like TDX does) before using it.
+
+The current TDX module architecture doesn't play nicely with kexec().
+The TDX module can only be initialized once during its lifetime, and
+there is no SEAMCALL to reset the module to give a new clean slate to
+the new kernel.  Therefore, ideally, if the module is ever initialized,
+it's better to shut down the module.  The new kernel won't be able to
+use TDX anyway (as it needs to go through the TDX module initialization
+process which will fail immediately at the first step).
+
+However, there's no guarantee CPU is in VMX operation during kexec(), so
+it's impractical to shut down the module.  Currently, the kernel just
+leaves the module in open state.
+
+TDX Guest Support
+=================
 Since the host cannot directly access guest registers or memory, much
 normal functionality of a hypervisor must be moved into the guest. This is
 implemented using a Virtualization Exception (#VE) that is handled by the
@@ -XXX,XX +XXX,XX @@ TDX includes new hypercall-like mechanisms for communicating from the
 guest to the hypervisor or the TDX module.
 
 New TDX Exceptions
-==================
+------------------
 
 TDX guests behave differently from bare-metal and traditional VMX guests.
 In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -XXX,XX +XXX,XX @@ Instructions marked with an '*' conditionally cause exceptions.  The
 details for these instructions are discussed below.
 
 Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - Port I/O (INS, OUTS, IN, OUT)
 - HLT
@@ -XXX,XX +XXX,XX @@ Instruction-based #VE
 - CPUID*
 
 Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
   VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -XXX,XX +XXX,XX @@ Instruction-based #GP
 - RDMSR*,WRMSR*
 
 RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 MSR access behavior falls into three categories:
 
@@ -XXX,XX +XXX,XX @@ trapping and handling in the TDX module.  Other than possibly being slow,
 these MSRs appear to function just as they would on bare metal.
 
 CPUID Behavior
---------------
+~~~~~~~~~~~~~~
 
 For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
 return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -XXX,XX +XXX,XX @@ not know how to handle. The guest kernel may ask the hypervisor for the
 value with a hypercall.
 
 #VE on Memory Accesses
-======================
+----------------------
 
 There are essentially two classes of TDX memory: private and shared.
 Private memory receives full TDX protections.  Its content is protected
@@ -XXX,XX +XXX,XX @@ entries.  This helps ensure that a guest does not place sensitive
 information in shared memory, exposing it to the untrusted hypervisor.
 
 #VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 Access to shared mappings can cause a #VE.  The hypervisor ultimately
 controls whether a shared memory access causes a #VE, so the guest must be
@@ -XXX,XX +XXX,XX @@ be careful not to access device MMIO regions unless it is also prepared to
 handle a #VE.
 
 #VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 An access to private mappings can also cause a #VE.  Since all kernel
 memory is also private memory, the kernel might theoretically need to
@@ -XXX,XX +XXX,XX @@ The hypervisor is permitted to unilaterally move accepted pages to a
 to handle the exception.
 
 Linux #VE handler
-=================
+-----------------
 
 Just like page faults or #GP's, #VE exceptions can be either handled or be
 fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -XXX,XX +XXX,XX @@ While the block is in place, any #VE is elevated to a double fault (#DF)
 which is not recoverable.
 
 MMIO handling
-=============
+-------------
 
 In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
 mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -XXX,XX +XXX,XX @@ MMIO access via other means (like structure overlays) may result in an
 oops.
 
 Shared Memory Conversions
-=========================
+-------------------------
 
 All TDX guest memory starts out as private at boot.  This memory can not
 be accessed by the hypervisor.  However, some kernel users like device
-- 
2.38.1

Hi all,

For people who concern this patchset, sorry for sending out late.  And to
save people's time, I didn't include the full coverletter here this time.
For detailed information please refer to previous v13's coverletter[1].

This version mainly adds a new patch to handle TDX vs S3/hibernation
interaction.  In short, TDX cannot survive when platform goes to S3 and
deeper states.  TDX gets completely reset upon this, and both TDX guests
and TDX module are destroyed.  Please refer to the new patch (21).

Other changes from v13 -> v14:
 - Addressed comments received in v13 (Rick/Nikolay/Dave).
   - SEAMCALL patches, skeleton patch, kexec patch
 - Some minor updates based on internal discussion.
 - Added received Reviewed-by tags (thanks!).
 - Updated the documentation patch to reflect new changes.

Please see each individual patch for specific change history.

Hi Dave,

In this version all patches (except the documentation one) now have at
least Kirill's Reviewed-by tag.  Could you help to take a look?

And again, thanks everyone for reviewing and helping on this series.

[1]: v13 https://lore.kernel.org/lkml/cover.1692962263.git.kai.huang@intel.com/T/

Kai Huang (23):
  x86/virt/tdx: Detect TDX during kernel boot
  x86/tdx: Define TDX supported page sizes as macros
  x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
  x86/cpu: Detect TDX partial write machine check erratum
  x86/virt/tdx: Handle SEAMCALL no entropy error in common code
  x86/virt/tdx: Add SEAMCALL error printing for module initialization
  x86/virt/tdx: Add skeleton to enable TDX on demand
  x86/virt/tdx: Get information about TDX module and TDX-capable memory
  x86/virt/tdx: Use all system memory when initializing TDX module as
    TDX memory
  x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
    memory regions
  x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
  x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  x86/virt/tdx: Designate reserved areas for all TDMRs
  x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
  x86/virt/tdx: Configure global KeyID on all packages
  x86/virt/tdx: Initialize all TDMRs
  x86/kexec: Flush cache of TDX private memory
  x86/virt/tdx: Keep TDMRs when module initialization is successful
  x86/virt/tdx: Improve readability of module initialization error
    handling
  x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  x86/virt/tdx: Handle TDX interaction with ACPI S3 and deeper states
  x86/mce: Improve error log of kernel space TDX #MC due to erratum
  Documentation/x86: Add documentation for TDX host support

Documentation/arch/x86/tdx.rst     |  217 +++-
 arch/x86/Kconfig                   |    3 +
 arch/x86/coco/tdx/tdx-shared.c     |    6 +-
 arch/x86/include/asm/cpufeatures.h |    1 +
 arch/x86/include/asm/msr-index.h   |    3 +
 arch/x86/include/asm/shared/tdx.h  |    6 +
 arch/x86/include/asm/tdx.h         |   39 +
 arch/x86/kernel/cpu/intel.c        |   17 +
 arch/x86/kernel/cpu/mce/core.c     |   33 +
 arch/x86/kernel/machine_kexec_64.c |   16 +
 arch/x86/kernel/process.c          |    8 +-
 arch/x86/kernel/reboot.c           |   15 +
 arch/x86/kernel/setup.c            |    2 +
 arch/x86/virt/vmx/tdx/Makefile     |    2 +-
 arch/x86/virt/vmx/tdx/tdx.c        | 1587 ++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h        |  145 +++
 16 files changed, 2084 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

base-commit: 9ee4318c157b9802589b746cc340bae3142d984c
-- 
2.41.0

During machine boot, TDX microcode verifies that the BIOS programmed TDX
private KeyIDs consistently and correctly programmed across all CPU
packages.  The MSRs are locked in this state after verification.  This
is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration:
it indicates not just that the hardware supports TDX, but that all the
boot-time security checks passed.

Add a new early_initcall(tdx_init) to detect the TDX by detecting TDX
private KeyIDs.  Also add a function to report whether TDX is enabled by
the BIOS.  Similar to AMD SME, kexec() will use it to determine whether
cache flush is needed.

The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
to protect its metadata.  Each TDX guest also needs a TDX KeyID for its
own protection.  Just use the first TDX KeyID as the global KeyID and
leave the rest for TDX guests.  If no TDX KeyID is left for TDX guests,
disable TDX as initializing the TDX module alone is useless.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---

v13 -> v14:
 - "tdx:" -> "virt/tdx:" (internal)
 - Add Dave's tag
 
---
 arch/x86/include/asm/msr-index.h |  3 ++
 arch/x86/include/asm/tdx.h       |  4 ++
 arch/x86/virt/vmx/tdx/Makefile   |  2 +-
 arch/x86/virt/vmx/tdx/tdx.c      | 90 ++++++++++++++++++++++++++++++++
 4 files changed, 98 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -XXX,XX +XXX,XX @@
 #define MSR_RELOAD_PMC0			0x000014c1
 #define MSR_RELOAD_FIXED_CTR0		0x00001309
 
+/* KeyID partitioning between MKTME and TDX */
+#define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
+
 /*
  * AMD64 MSRs. Not complete. See the architecture manual for a more
  * complete list.
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 u64 __seamcall(u64 fn, struct tdx_module_args *args);
 u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
 u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
+
+bool platform_tdx_enabled(void);
+#else
+static inline bool platform_tdx_enabled(void) { return false; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -XXX,XX +XXX,XX @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-y += seamcall.o
+obj-y += seamcall.o tdx.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2023 Intel Corporation.
+ *
+ * Intel Trusted Domain Extensions (TDX) support
+ */
+
+#define pr_fmt(fmt)	"virt/tdx: " fmt
+
+#include <linux/types.h>
+#include <linux/cache.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/printk.h>
+#include <asm/msr-index.h>
+#include <asm/msr.h>
+#include <asm/tdx.h>
+
+static u32 tdx_global_keyid __ro_after_init;
+static u32 tdx_guest_keyid_start __ro_after_init;
+static u32 tdx_nr_guest_keyids __ro_after_init;
+
+static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
+					    u32 *nr_tdx_keyids)
+{
+	u32 _nr_mktme_keyids, _tdx_keyid_start, _nr_tdx_keyids;
+	int ret;
+
+	/*
+	 * IA32_MKTME_KEYID_PARTIONING:
+	 *   Bit [31:0]:	Number of MKTME KeyIDs.
+	 *   Bit [63:32]:	Number of TDX private KeyIDs.
+	 */
+	ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &_nr_mktme_keyids,
+			&_nr_tdx_keyids);
+	if (ret)
+		return -ENODEV;
+
+	if (!_nr_tdx_keyids)
+		return -ENODEV;
+
+	/* TDX KeyIDs start after the last MKTME KeyID. */
+	_tdx_keyid_start = _nr_mktme_keyids + 1;
+
+	*tdx_keyid_start = _tdx_keyid_start;
+	*nr_tdx_keyids = _nr_tdx_keyids;
+
+	return 0;
+}
+
+static int __init tdx_init(void)
+{
+	u32 tdx_keyid_start, nr_tdx_keyids;
+	int err;
+
+	err = record_keyid_partitioning(&tdx_keyid_start, &nr_tdx_keyids);
+	if (err)
+		return err;
+
+	pr_info("BIOS enabled: private KeyID range [%u, %u)\n",
+			tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids);
+
+	/*
+	 * The TDX module itself requires one 'global KeyID' to protect
+	 * its metadata.  If there's only one TDX KeyID, there won't be
+	 * any left for TDX guests thus there's no point to enable TDX
+	 * at all.
+	 */
+	if (nr_tdx_keyids < 2) {
+		pr_err("initialization failed: too few private KeyIDs available.\n");
+		return -ENODEV;
+	}
+
+	/*
+	 * Just use the first TDX KeyID as the 'global KeyID' and
+	 * leave the rest for TDX guests.
+	 */
+	tdx_global_keyid = tdx_keyid_start;
+	tdx_guest_keyid_start = tdx_keyid_start + 1;
+	tdx_nr_guest_keyids = nr_tdx_keyids - 1;
+
+	return 0;
+}
+early_initcall(tdx_init);
+
+/* Return whether the BIOS has enabled TDX */
+bool platform_tdx_enabled(void)
+{
+	return !!tdx_global_keyid;
+}
-- 
2.41.0

Define TDX supported page sizes as macros and get rid of the hard-coded
values in try_accept_one().  TDX host support will need to use them too.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/coco/tdx/tdx-shared.c    | 6 +++---
 arch/x86/include/asm/shared/tdx.h | 5 +++++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx-shared.c b/arch/x86/coco/tdx/tdx-shared.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/coco/tdx/tdx-shared.c
+++ b/arch/x86/coco/tdx/tdx-shared.c
@@ -XXX,XX +XXX,XX @@ static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
 	 */
 	switch (pg_level) {
 	case PG_LEVEL_4K:
-		page_size = 0;
+		page_size = TDX_PS_4K;
 		break;
 	case PG_LEVEL_2M:
-		page_size = 1;
+		page_size = TDX_PS_2M;
 		break;
 	case PG_LEVEL_1G:
-		page_size = 2;
+		page_size = TDX_PS_1G;
 		break;
 	default:
 		return 0;
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -XXX,XX +XXX,XX @@
 	(TDX_RDX | TDX_RBX | TDX_RSI | TDX_RDI | TDX_R8  | TDX_R9  | \
 	 TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14 | TDX_R15)
 
+/* TDX supported page sizes from the TDX module ABI. */
+#define TDX_PS_4K	0
+#define TDX_PS_2M	1
+#define TDX_PS_1G	2
+
 #ifndef __ASSEMBLY__
 
 #include <linux/compiler_attributes.h>
-- 
2.41.0

TDX memory has integrity and confidentiality protections.  Violations of
this integrity protection are supposed to only affect TDX operations and
are never supposed to affect the host kernel itself.  In other words,
the host kernel should never, itself, see machine checks induced by the
TDX integrity hardware.

Alas, the first few generations of TDX hardware have an erratum.  A
partial write to a TDX private memory cacheline will silently "poison"
the line.  Subsequent reads will consume the poison and generate a
machine check.  According to the TDX hardware spec, neither of these
things should have happened.

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

With this erratum, there are additional things need to be done.  To
prepare for those changes, add a CPU bug bit to indicate this erratum.
Note this bug reflects the hardware thus it is detected regardless of
whether the kernel is built with TDX support or not.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---

v13 -> v14:
 - Use "To prepare for ___, add ___" in changelog (Dave)
 - Added Dave's tag.

v12 -> v13:
 - Added David's tag.

v11 -> v12:
 - Added Kirill's tag
 - Changed to detect the erratum in early_init_intel() (Kirill)

v10 -> v11:
 - New patch

---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/kernel/cpu/intel.c        | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -XXX,XX +XXX,XX @@
 #define X86_BUG_EIBRS_PBRSB		X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
 #define X86_BUG_SMT_RSB			X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */
 #define X86_BUG_GDS			X86_BUG(30) /* CPU is affected by Gather Data Sampling */
+#define X86_BUG_TDX_PW_MCE		X86_BUG(31) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */
 
 /* BUG word 2 */
 #define X86_BUG_SRSO			X86_BUG(1*32 + 0) /* AMD SRSO bug */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -XXX,XX +XXX,XX @@ static bool bad_spectre_microcode(struct cpuinfo_x86 *c)
 	return false;
 }
 
+static void check_tdx_erratum(struct cpuinfo_x86 *c)
+{
+	/*
+	 * These CPUs have an erratum.  A partial write from non-TD
+	 * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
+	 * private memory poisons that memory, and a subsequent read of
+	 * that memory triggers #MC.
+	 */
+	switch (c->x86_model) {
+	case INTEL_FAM6_SAPPHIRERAPIDS_X:
+	case INTEL_FAM6_EMERALDRAPIDS_X:
+		setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
+	}
+}
+
 static void early_init_intel(struct cpuinfo_x86 *c)
 {
 	u64 misc_enable;
@@ -XXX,XX +XXX,XX @@ static void early_init_intel(struct cpuinfo_x86 *c)
 	 */
 	if (detect_extended_topology_early(c) < 0)
 		detect_ht_early(c);
+
+	check_tdx_erratum(c);
 }
 
 static void bsp_init_intel(struct cpuinfo_x86 *c)
-- 
2.41.0

Some SEAMCALLs use the RDRAND hardware and can fail for the same reasons
as RDRAND.  Use the kernel RDRAND retry logic for them.

There are three __seamcall*() variants.  Do the SEAMCALL retry in common
code and add a wrapper for each of them.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirll.shutemov@linux.intel.com>
---

v13 -> v14:
 - Use real function sc_retry() instead of using macros. (Dave)
 - Added Kirill's tag.

v12 -> v13:
 - New implementation due to TDCALL assembly series.
---
 arch/x86/include/asm/tdx.h | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -XXX,XX +XXX,XX @@
 #define TDX_SEAMCALL_GP			(TDX_SW_ERROR | X86_TRAP_GP)
 #define TDX_SEAMCALL_UD			(TDX_SW_ERROR | X86_TRAP_UD)
 
+/*
+ * TDX module SEAMCALL leaf function error codes
+ */
+#define TDX_RND_NO_ENTROPY	0x8000020300000000ULL
+
 #ifndef __ASSEMBLY__
 
 /*
@@ -XXX,XX +XXX,XX @@ u64 __seamcall(u64 fn, struct tdx_module_args *args);
 u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
 u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
 
+#include <asm/archrandom.h>
+
+typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
+
+static inline u64 sc_retry(sc_func_t func, u64 fn,
+			   struct tdx_module_args *args)
+{
+	int retry = RDRAND_RETRY_LOOPS;
+	u64 ret;
+
+	do {
+		ret = func(fn, args);
+	} while (ret == TDX_RND_NO_ENTROPY && --retry);
+
+	return ret;
+}
+
+#define seamcall(_fn, _args)		sc_retry(__seamcall, (_fn), (_args))
+#define seamcall_ret(_fn, _args)	sc_retry(__seamcall_ret, (_fn), (_args))
+#define seamcall_saved_ret(_fn, _args)	sc_retry(__seamcall_saved_ret, (_fn), (_args))
+
 bool platform_tdx_enabled(void);
 #else
 static inline bool platform_tdx_enabled(void) { return false; }
-- 
2.41.0

The SEAMCALLs involved during the TDX module initialization are not
expected to fail.  In fact, they are not expected to return any non-zero
code (except the "running out of entropy error", which can be handled
internally already).

Add yet another set of SEAMCALL wrappers, which treats all non-zero
return code as error, to support printing SEAMCALL error upon failure
for module initialization.  Note the TDX module initialization doesn't
use the _saved_ret() variant thus no wrapper is added for it.

SEAMCALL assembly can also return kernel-defined error codes for three
special cases: 1) TDX isn't enabled by the BIOS; 2) TDX module isn't
loaded; 3) CPU isn't in VMX operation.  Whether they can legally happen
depends on the caller, so leave to the caller to print error message
when desired.

Also convert the SEAMCALL error codes to the kernel error codes in the
new wrappers so that each SEAMCALL caller doesn't have to repeat the
conversion.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v13 -> v14:
 - Use real functions to replace macros. (Dave)
 - Moved printing error message for special error code to the caller
   (internal)
 - Added Kirill's tag

v12 -> v13:
 - New implementation due to TDCALL assembly series.

---
 arch/x86/include/asm/tdx.h  |  1 +
 arch/x86/virt/vmx/tdx/tdx.c | 52 +++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -XXX,XX +XXX,XX @@
 /*
  * TDX module SEAMCALL leaf function error codes
  */
+#define TDX_SUCCESS		0ULL
 #define TDX_RND_NO_ENTROPY	0x8000020300000000ULL
 
 #ifndef __ASSEMBLY__
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static u32 tdx_global_keyid __ro_after_init;
 static u32 tdx_guest_keyid_start __ro_after_init;
 static u32 tdx_nr_guest_keyids __ro_after_init;
 
+typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
+
+static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
+{
+	pr_err("SEAMCALL (0x%llx) failed: 0x%llx\n", fn, err);
+}
+
+static inline void seamcall_err_ret(u64 fn, u64 err,
+				    struct tdx_module_args *args)
+{
+	seamcall_err(fn, err, args);
+	pr_err("RCX 0x%llx RDX 0x%llx R8 0x%llx R9 0x%llx R10 0x%llx R11 0x%llx\n",
+			args->rcx, args->rdx, args->r8, args->r9,
+			args->r10, args->r11);
+}
+
+static inline void seamcall_err_saved_ret(u64 fn, u64 err,
+					  struct tdx_module_args *args)
+{
+	seamcall_err_ret(fn, err, args);
+	pr_err("RBX 0x%llx RDI 0x%llx RSI 0x%llx R12 0x%llx R13 0x%llx R14 0x%llx R15 0x%llx\n",
+			args->rbx, args->rdi, args->rsi, args->r12,
+			args->r13, args->r14, args->r15);
+}
+
+static inline int sc_retry_prerr(sc_func_t func, sc_err_func_t err_func,
+				 u64 fn, struct tdx_module_args *args)
+{
+	u64 sret = sc_retry(func, fn, args);
+
+	if (sret == TDX_SUCCESS)
+		return 0;
+
+	if (sret == TDX_SEAMCALL_VMFAILINVALID)
+		return -ENODEV;
+
+	if (sret == TDX_SEAMCALL_GP)
+		return -EOPNOTSUPP;
+
+	if (sret == TDX_SEAMCALL_UD)
+		return -EACCES;
+
+	err_func(fn, sret, args);
+	return -EIO;
+}
+
+#define seamcall_prerr(__fn, __args)						\
+	sc_retry_prerr(__seamcall, seamcall_err, (__fn), (__args))
+
+#define seamcall_prerr_ret(__fn, __args)					\
+	sc_retry_prerr(__seamcall_ret, seamcall_err_ret, (__fn), (__args))
+
 static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 					    u32 *nr_tdx_keyids)
 {
-- 
2.41.0

To enable TDX the kernel needs to initialize TDX from two perspectives:
1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
on one logical cpu before the kernel wants to make any other SEAMCALLs
on that cpu (including those involved during module initialization and
running TDX guests).

2) The TDX module design allows it to be updated while the system is
running.  The update procedure shares quite a few steps with this "on
demand" initialization mechanism.  The hope is that much of "on demand"
mechanism can be shared with a future "update" mechanism.  A boot-time
TDX module implementation would not be able to share much code with the
update mechanism.

3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
code mucks with VMX enabling.  If the TDX module were to be initialized
separately from KVM (like at boot), the boot code would need to be
taught how to muck with VMX enabling and KVM would need to be taught how
to cope with that.  Making KVM itself responsible for TDX initialization
lets the rest of the kernel stay blissfully unaware of VMX.

Similar to module initialization, also make the per-cpu initialization
"on demand" as it also depends on VMX being enabled.

Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
module and enable TDX on local cpu respectively.  For now tdx_enable()
is a placeholder.  The TODO list will be pared down as functionality is
added.

Export both tdx_cpu_enable() and tdx_enable() for KVM use.

In tdx_enable() use a state machine protected by mutex to make sure the
initialization will only be done once, as tdx_enable() can be called
multiple times (i.e. KVM module can be reloaded) and may be called
concurrently by other kernel components in the future.

The per-cpu initialization on each cpu can only be done once during the
module's life time.  Use a per-cpu variable to track its status to make
sure it is only done once in tdx_cpu_enable().

Also, a SEAMCALL to do TDX module global initialization must be done
once on any logical cpu before any per-cpu initialization SEAMCALL.  Do
it inside tdx_cpu_enable() too (if hasn't been done).

tdx_enable() can potentially invoke SEAMCALLs on any online cpus.  The
per-cpu initialization must be done before those SEAMCALLs are invoked
on some cpu.  To keep things simple, in tdx_cpu_enable(), always do the
per-cpu initialization regardless of whether the TDX module has been
initialized or not.  And in tdx_enable(), don't call tdx_cpu_enable()
but assume the caller has disabled CPU hotplug, done VMXON and
tdx_cpu_enable() on all online cpus before calling tdx_enable().

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v13 -> v14:
 - Use lockdep_assert_irqs_off() in try_init_model_global() (Nikolay),
   but still keep the comment (Kirill).
 - Add code to print "module not loaded" in the first SEAMCALL.
 - If SYS.INIT fails, stop calling LP.INIT in other tdx_cpu_enable()s.
 - Added Kirill's tag

v12 -> v13:
 - Made tdx_cpu_enable() always be called with IRQ disabled via IPI
   funciton call (Peter, Kirill).

v11 -> v12:
 - Simplified TDX module global init and lp init status tracking (David).
 - Added comment around try_init_module_global() for using
   raw_spin_lock() (Dave).
 - Added one sentence to changelog to explain why to expose tdx_enable()
   and tdx_cpu_enable() (Dave).
 - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use
   lockdep_assert_*() instead. (Dave)
 - Removed redundent "TDX" in error message (Dave).

v10 -> v11:
 - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off.
 - Return the actual error code for tdx_enable() instead of -EINVAL.
 - Added Isaku's Reviewed-by.

v9 -> v10:
 - Merged the patch to handle per-cpu initialization to this patch to
   tell the story better.
 - Changed how to handle the per-cpu initialization to only provide a
   tdx_cpu_enable() function to let the user of TDX to do it when the
   user wants to run TDX code on a certain cpu.
 - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
   call lockdep_assert_cpus_held() to assume the caller has done that.
 - Improved comments around tdx_enable() and tdx_cpu_enable().
 - Improved changelog to tell the story better accordingly.

v8 -> v9:
 - Removed detailed TODO list in the changelog (Dave).
 - Added back steps to do module global initialization and per-cpu
   initialization in the TODO list comment.
 - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h

v7 -> v8:
 - Refined changelog (Dave).
 - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
 - Add a "TODO list" comment in init_tdx_module() to list all steps of
   initializing the TDX Module to tell the story (Dave).
 - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
   comments (Dave).
 - Simplified __tdx_enable() to only handle success or failure.
 - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
 - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
 - Improved comments (Dave).
 - Pointed out 'tdx_module_status' is software thing (Dave).

...

---
 arch/x86/include/asm/tdx.h  |   4 +
 arch/x86/virt/vmx/tdx/tdx.c | 167 ++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  30 +++++++
 3 files changed, 201 insertions(+)
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -XXX,XX +XXX,XX @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
 #define seamcall_saved_ret(_fn, _args)	sc_retry(__seamcall_saved_ret, (_fn), (_args))
 
 bool platform_tdx_enabled(void);
+int tdx_cpu_enable(void);
+int tdx_enable(void);
 #else
 static inline bool platform_tdx_enabled(void) { return false; }
+static inline int tdx_cpu_enable(void) { return -ENODEV; }
+static inline int tdx_enable(void)  { return -ENODEV; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
 #include <linux/init.h>
 #include <linux/errno.h>
 #include <linux/printk.h>
+#include <linux/cpu.h>
+#include <linux/spinlock.h>
+#include <linux/percpu-defs.h>
+#include <linux/mutex.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
+#include "tdx.h"
 
 static u32 tdx_global_keyid __ro_after_init;
 static u32 tdx_guest_keyid_start __ro_after_init;
 static u32 tdx_nr_guest_keyids __ro_after_init;
 
+static DEFINE_PER_CPU(bool, tdx_lp_initialized);
+
+static enum tdx_module_status_t tdx_module_status;
+static DEFINE_MUTEX(tdx_module_lock);
+
 typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
 
 static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
@@ -XXX,XX +XXX,XX @@ static inline int sc_retry_prerr(sc_func_t func, sc_err_func_t err_func,
 #define seamcall_prerr_ret(__fn, __args)					\
 	sc_retry_prerr(__seamcall_ret, seamcall_err_ret, (__fn), (__args))
 
+/*
+ * Do the module global initialization once and return its result.
+ * It can be done on any cpu.  It's always called with interrupts
+ * disabled.
+ */
+static int try_init_module_global(void)
+{
+	struct tdx_module_args args = {};
+	static DEFINE_RAW_SPINLOCK(sysinit_lock);
+	static bool sysinit_done;
+	static int sysinit_ret;
+
+	lockdep_assert_irqs_disabled();
+
+	raw_spin_lock(&sysinit_lock);
+
+	if (sysinit_done)
+		goto out;
+
+	/* RCX is module attributes and all bits are reserved */
+	args.rcx = 0;
+	sysinit_ret = seamcall_prerr(TDH_SYS_INIT, &args);
+
+	/*
+	 * The first SEAMCALL also detects the TDX module, thus
+	 * it can fail due to the TDX module is not loaded.
+	 * Dump message to let the user know.
+	 */
+	if (sysinit_ret == -ENODEV)
+		pr_err("module not loaded\n");
+
+	sysinit_done = true;
+out:
+	raw_spin_unlock(&sysinit_lock);
+	return sysinit_ret;
+}
+
+/**
+ * tdx_cpu_enable - Enable TDX on local cpu
+ *
+ * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
+ * global initialization SEAMCALL if not done) on local cpu to make this
+ * cpu be ready to run any other SEAMCALLs.
+ *
+ * Always call this function via IPI function calls.
+ *
+ * Return 0 on success, otherwise errors.
+ */
+int tdx_cpu_enable(void)
+{
+	struct tdx_module_args args = {};
+	int ret;
+
+	if (!platform_tdx_enabled())
+		return -ENODEV;
+
+	lockdep_assert_irqs_disabled();
+
+	if (__this_cpu_read(tdx_lp_initialized))
+		return 0;
+
+	/*
+	 * The TDX module global initialization is the very first step
+	 * to enable TDX.  Need to do it first (if hasn't been done)
+	 * before the per-cpu initialization.
+	 */
+	ret = try_init_module_global();
+	if (ret)
+		return ret;
+
+	ret = seamcall_prerr(TDH_SYS_LP_INIT, &args);
+	if (ret)
+		return ret;
+
+	__this_cpu_write(tdx_lp_initialized, true);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(tdx_cpu_enable);
+
+static int init_tdx_module(void)
+{
+	/*
+	 * TODO:
+	 *
+	 *  - Get TDX module information and TDX-capable memory regions.
+	 *  - Build the list of TDX-usable memory regions.
+	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
+	 *    all TDX-usable memory regions.
+	 *  - Configure the TDMRs and the global KeyID to the TDX module.
+	 *  - Configure the global KeyID on all packages.
+	 *  - Initialize all TDMRs.
+	 *
+	 *  Return error before all steps are done.
+	 */
+	return -EINVAL;
+}
+
+static int __tdx_enable(void)
+{
+	int ret;
+
+	ret = init_tdx_module();
+	if (ret) {
+		pr_err("module initialization failed (%d)\n", ret);
+		tdx_module_status = TDX_MODULE_ERROR;
+		return ret;
+	}
+
+	pr_info("module initialized\n");
+	tdx_module_status = TDX_MODULE_INITIALIZED;
+
+	return 0;
+}
+
+/**
+ * tdx_enable - Enable TDX module to make it ready to run TDX guests
+ *
+ * This function assumes the caller has: 1) held read lock of CPU hotplug
+ * lock to prevent any new cpu from becoming online; 2) done both VMXON
+ * and tdx_cpu_enable() on all online cpus.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return 0 if TDX is enabled successfully, otherwise error.
+ */
+int tdx_enable(void)
+{
+	int ret;
+
+	if (!platform_tdx_enabled())
+		return -ENODEV;
+
+	lockdep_assert_cpus_held();
+
+	mutex_lock(&tdx_module_lock);
+
+	switch (tdx_module_status) {
+	case TDX_MODULE_UNINITIALIZED:
+		ret = __tdx_enable();
+		break;
+	case TDX_MODULE_INITIALIZED:
+		/* Already initialized, great, tell the caller. */
+		ret = 0;
+		break;
+	default:
+		/* Failed to initialize in the previous attempts */
+		ret = -EINVAL;
+		break;
+	}
+
+	mutex_unlock(&tdx_module_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_enable);
+
 static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 					    u32 *nr_tdx_keyids)
 {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_VIRT_TDX_H
+#define _X86_VIRT_TDX_H
+
+/*
+ * This file contains both macros and data structures defined by the TDX
+ * architecture and Linux defined software data structures and functions.
+ * The two should not be mixed together for better readability.  The
+ * architectural definitions come first.
+ */
+
+/*
+ * TDX module SEAMCALL leaf functions
+ */
+#define TDH_SYS_INIT		33
+#define TDH_SYS_LP_INIT		35
+
+/*
+ * Do not put any hardware-defined TDX structure representations below
+ * this comment!
+ */
+
+/* Kernel defined TDX module status during module initialization. */
+enum tdx_module_status_t {
+	TDX_MODULE_UNINITIALIZED,
+	TDX_MODULE_INITIALIZED,
+	TDX_MODULE_ERROR
+};
+
+#endif
-- 
2.41.0

Start to transit out the "multi-steps" to initialize the TDX module.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.

CMRs tell the kernel which memory is TDX compatible.  The kernel takes
CMRs (plus a little more metadata) and constructs "TD Memory Regions"
(TDMRs).  TDMRs let the kernel grant TDX protections to some or all of
the CMR areas.

The TDX module also reports necessary information to let the kernel
build TDMRs and run TDX guests in structure 'tdsysinfo_struct'.  The
list of CMRs, along with the TDX module information, is available to
the kernel by querying the TDX module.

As a preparation to construct TDMRs, get the TDX module information and
the list of CMRs.  Print out CMRs to help user to decode which memory
regions are TDX convertible.

The 'tdsysinfo_struct' is fairly large (1024 bytes) and contains a lot
of info about the TDX module.  Fully define the entire structure, but
only use the fields necessary to build the TDMRs and pr_info() some
basics about the module.  The rest of the fields will get used by KVM.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v13 -> v14:
 - Added Kirill's tag.

v12 -> v13:
 - Allocate TDSYSINFO and CMR array separately. (Kirill)
 - Added comment around TDH.SYS.INFO. (Peter)

v11 -> v12:
 - Changed to use dynamic allocation for TDSYSINFO_STRUCT and CMR array
   (Kirill).
 - Keep SEAMCALL leaf macro definitions in order (Kirill)
 - Removed is_cmr_empty() but open code directly (David)
 - 'atribute' -> 'attribute' (David)

v10 -> v11:
 - No change.

v9 -> v10:
 - Added back "start to transit out..." as now per-cpu init has been
   moved out from tdx_enable().

v8 -> v9:
 - Removed "start to trransit out ..." part in changelog since this patch
   is no longer the first step anymore.
 - Changed to declare 'tdsysinfo' and 'cmr_array' as local static, and
   changed changelog accordingly (Dave).
 - Improved changelog to explain why to declare  'tdsysinfo_struct' in
   full but only use a few members of them (Dave).

v7 -> v8: (Dave)
 - Improved changelog to tell this is the first patch to transit out the
   "multi-steps" init_tdx_module().
 - Removed all CMR check/trim code but to depend on later SEAMCALL.
 - Variable 'vertical alignment' in print TDX module information.
 - Added DECLARE_PADDED_STRUCT() for padded structure.
 - Made tdx_sysinfo and tdx_cmr_array[] to be function local variable
   (and rename them accordingly), and added -Wframe-larger-than=4096 flag
   to silence the build warning.

...

---
 arch/x86/virt/vmx/tdx/tdx.c | 94 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h | 64 +++++++++++++++++++++++++
 2 files changed, 156 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
 #include <linux/spinlock.h>
 #include <linux/percpu-defs.h>
 #include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/math.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
+#include <asm/page.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
@@ -XXX,XX +XXX,XX @@ int tdx_cpu_enable(void)
 }
 EXPORT_SYMBOL_GPL(tdx_cpu_enable);
 
+static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs)
+{
+	int i;
+
+	for (i = 0; i < nr_cmrs; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+
+		/*
+		 * The array of CMRs reported via TDH.SYS.INFO can
+		 * contain tail empty CMRs.  Don't print them.
+		 */
+		if (!cmr->size)
+			break;
+
+		pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base,
+				cmr->base + cmr->size);
+	}
+}
+
+static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo,
+			   struct cmr_info *cmr_array)
+{
+	struct tdx_module_args args = {};
+	int ret;
+
+	/*
+	 * TDH.SYS.INFO writes the TDSYSINFO_STRUCT and the CMR array
+	 * to the buffers provided by the kernel (via RCX and R8
+	 * respectively).  The buffer size of the TDSYSINFO_STRUCT
+	 * (via RDX) and the maximum entries of the CMR array (via R9)
+	 * passed to this SEAMCALL must be at least the size of
+	 * TDSYSINFO_STRUCT and MAX_CMRS respectively.
+	 *
+	 * Upon a successful return, R9 contains the actual entries
+	 * written to the CMR array.
+	 */
+	args.rcx = __pa(tdsysinfo);
+	args.rdx = TDSYSINFO_STRUCT_SIZE;
+	args.r8 = __pa(cmr_array);
+	args.r9 = MAX_CMRS;
+	ret = seamcall_prerr_ret(TDH_SYS_INFO, &args);
+	if (ret)
+		return ret;
+
+	pr_info("TDX module: attributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
+		tdsysinfo->attributes,    tdsysinfo->vendor_id,
+		tdsysinfo->major_version, tdsysinfo->minor_version,
+		tdsysinfo->build_date,    tdsysinfo->build_num);
+
+	print_cmrs(cmr_array, args.r9);
+
+	return 0;
+}
+
 static int init_tdx_module(void)
 {
+	struct tdsysinfo_struct *tdsysinfo;
+	struct cmr_info *cmr_array;
+	int tdsysinfo_size;
+	int cmr_array_size;
+	int ret;
+
+	tdsysinfo_size = round_up(TDSYSINFO_STRUCT_SIZE,
+			TDSYSINFO_STRUCT_ALIGNMENT);
+	tdsysinfo = kzalloc(tdsysinfo_size, GFP_KERNEL);
+	if (!tdsysinfo)
+		return -ENOMEM;
+
+	cmr_array_size = sizeof(struct cmr_info) * MAX_CMRS;
+	cmr_array_size = round_up(cmr_array_size, CMR_INFO_ARRAY_ALIGNMENT);
+	cmr_array = kzalloc(cmr_array_size, GFP_KERNEL);
+	if (!cmr_array) {
+		kfree(tdsysinfo);
+		return -ENOMEM;
+	}
+
+
+	/* Get the TDSYSINFO_STRUCT and CMRs from the TDX module. */
+	ret = get_tdx_sysinfo(tdsysinfo, cmr_array);
+	if (ret)
+		goto out;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Get TDX module information and TDX-capable memory regions.
 	 *  - Build the list of TDX-usable memory regions.
 	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
 	 *    all TDX-usable memory regions.
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	 *
 	 *  Return error before all steps are done.
 	 */
-	return -EINVAL;
+	ret = -EINVAL;
+out:
+	/*
+	 * For now both @sysinfo and @cmr_array are only used during
+	 * module initialization, so always free them.
+	 */
+	kfree(tdsysinfo);
+	kfree(cmr_array);
+	return ret;
 }
 
 static int __tdx_enable(void)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
 #ifndef _X86_VIRT_TDX_H
 #define _X86_VIRT_TDX_H
 
+#include <linux/types.h>
+#include <linux/stddef.h>
+#include <linux/compiler_attributes.h>
+
 /*
  * This file contains both macros and data structures defined by the TDX
  * architecture and Linux defined software data structures and functions.
@@ -XXX,XX +XXX,XX @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
 
+struct cmr_info {
+	u64	base;
+	u64	size;
+} __packed;
+
+#define MAX_CMRS	32
+#define CMR_INFO_ARRAY_ALIGNMENT	512
+
+struct cpuid_config {
+	u32	leaf;
+	u32	sub_leaf;
+	u32	eax;
+	u32	ebx;
+	u32	ecx;
+	u32	edx;
+} __packed;
+
+#define TDSYSINFO_STRUCT_SIZE		1024
+#define TDSYSINFO_STRUCT_ALIGNMENT	1024
+
+/*
+ * The size of this structure itself is flexible.  The actual structure
+ * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE bytes
+ * and TDSYSINFO_STRUCT_ALIGNMENT bytes aligned.
+ */
+struct tdsysinfo_struct {
+	/* TDX-SEAM Module Info */
+	u32	attributes;
+	u32	vendor_id;
+	u32	build_date;
+	u16	build_num;
+	u16	minor_version;
+	u16	major_version;
+	u8	reserved0[14];
+	/* Memory Info */
+	u16	max_tdmrs;
+	u16	max_reserved_per_tdmr;
+	u16	pamt_entry_size;
+	u8	reserved1[10];
+	/* Control Struct Info */
+	u16	tdcs_base_size;
+	u8	reserved2[2];
+	u16	tdvps_base_size;
+	u8	tdvps_xfam_dependent_size;
+	u8	reserved3[9];
+	/* TD Capabilities */
+	u64	attributes_fixed0;
+	u64	attributes_fixed1;
+	u64	xfam_fixed0;
+	u64	xfam_fixed1;
+	u8	reserved4[32];
+	u32	num_cpuid_config;
+	/*
+	 * The actual number of CPUID_CONFIG depends on above
+	 * 'num_cpuid_config'.
+	 */
+	DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
+} __packed;
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
-- 
2.41.0

As a step of initializing the TDX module, the kernel needs to tell the
TDX module which memory regions can be used by the TDX module as TDX
guest memory.

TDX reports a list of "Convertible Memory Region" (CMR) to tell the
kernel which memory is TDX compatible.  The kernel needs to build a list
of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
the TDX module.  Once this is done, those "TDX-usable" memory regions
are fixed during module's lifetime.

To keep things simple, assume that all TDX-protected memory will come
from the page allocator.  Make sure all pages in the page allocator
*are* TDX-usable memory.

As TDX-usable memory is a fixed configuration, take a snapshot of the
memory configuration from memblocks at the time of module initialization
(memblocks are modified on memory hotplug).  This snapshot is used to
enable TDX support for *this* memory configuration only.  Use a memory
hotplug notifier to ensure that no other RAM can be added outside of
this configuration.

This approach requires all memblock memory regions at the time of module
initialization to be TDX convertible memory to work, otherwise module
initialization will fail in a later SEAMCALL when passing those regions
to the module.  This approach works when all boot-time "system RAM" is
TDX convertible memory, and no non-TDX-convertible memory is hot-added
to the core-mm before module initialization.

For instance, on the first generation of TDX machines, both CXL memory
and NVDIMM are not TDX convertible memory.  Using kmem driver to hot-add
any CXL memory or NVDIMM to the core-mm before module initialization
will result in failure to initialize the module.  The SEAMCALL error
code will be available in the dmesg to help user to understand the
failure.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v13 -> v14:
 - No change

v12 -> v13:
 - Avoided using " ? : " in tdx_memory_notifier(). (Peter)

v11 -> v12:
 - Added tags from Dave/Kirill.

v10 -> v11:
 - Added Isaku's Reviewed-by.

v9 -> v10:
 - Moved empty @tdx_memlist check out of is_tdx_memory() to make the
   logic better.
 - Added Ying's Reviewed-by.

v8 -> v9:
 - Replace "The initial support ..." with timeless sentence in both
   changelog and comments(Dave).
 - Fix run-on sentence in changelog, and senstence to explain why to
   stash off memblock (Dave).
 - Tried to improve why to choose this approach and how it work in
   changelog based on Dave's suggestion.
 - Many other comments enhancement (Dave).

v7 -> v8:
 - Trimed down changelog (Dave).
 - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series
   (Ying).
 - Moved memory hotplug handling from add_arch_memory() to
   memory_notifier (Dan/David).
 - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave).
 - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave).
 - Removed pfn_covered_by_cmr() check as no code to trim CMRs now.
 - Improve the comment around first 1MB (Dave).
 - Added a comment around reserve_real_mode() to point out TDX code
   relies on first 1MB being reserved (Ying).
 - Added comment to explain why the new online memory range cannot
   cross multiple TDX memory blocks (Dave).
 - Improved other comments (Dave).

---
 arch/x86/Kconfig            |   1 +
 arch/x86/kernel/setup.c     |   2 +
 arch/x86/virt/vmx/tdx/tdx.c | 162 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |   6 ++
 4 files changed, 170 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -XXX,XX +XXX,XX @@ config INTEL_TDX_HOST
 	depends on X86_64
 	depends on KVM_INTEL
 	depends on X86_X2APIC
+	select ARCH_KEEP_MEMBLOCK
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -XXX,XX +XXX,XX @@ void __init setup_arch(char **cmdline_p)
 	 *
 	 * Moreover, on machines with SandyBridge graphics or in setups that use
 	 * crashkernel the entire 1M is reserved anyway.
+	 *
+	 * Note the host kernel TDX also requires the first 1MB being reserved.
 	 */
 	x86_platform.realmode_reserve();
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include <linux/math.h>
+#include <linux/list.h>
+#include <linux/memblock.h>
+#include <linux/memory.h>
+#include <linux/minmax.h>
+#include <linux/sizes.h>
+#include <linux/pfn.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/page.h>
@@ -XXX,XX +XXX,XX @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized);
 static enum tdx_module_status_t tdx_module_status;
 static DEFINE_MUTEX(tdx_module_lock);
 
+/* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
+static LIST_HEAD(tdx_memlist);
+
 typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
 
 static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
@@ -XXX,XX +XXX,XX @@ static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo,
 	return 0;
 }
 
+/*
+ * Add a memory region as a TDX memory block.  The caller must make sure
+ * all memory regions are added in address ascending order and don't
+ * overlap.
+ */
+static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
+			    unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
+	if (!tmb)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&tmb->list);
+	tmb->start_pfn = start_pfn;
+	tmb->end_pfn = end_pfn;
+
+	/* @tmb_list is protected by mem_hotplug_lock */
+	list_add_tail(&tmb->list, tmb_list);
+	return 0;
+}
+
+static void free_tdx_memlist(struct list_head *tmb_list)
+{
+	/* @tmb_list is protected by mem_hotplug_lock */
+	while (!list_empty(tmb_list)) {
+		struct tdx_memblock *tmb = list_first_entry(tmb_list,
+				struct tdx_memblock, list);
+
+		list_del(&tmb->list);
+		kfree(tmb);
+	}
+}
+
+/*
+ * Ensure that all memblock memory regions are convertible to TDX
+ * memory.  Once this has been established, stash the memblock
+ * ranges off in a secondary structure because memblock is modified
+ * in memory hotplug while TDX memory regions are fixed.
+ */
+static int build_tdx_memlist(struct list_head *tmb_list)
+{
+	unsigned long start_pfn, end_pfn;
+	int i, ret;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+		/*
+		 * The first 1MB is not reported as TDX convertible memory.
+		 * Although the first 1MB is always reserved and won't end up
+		 * to the page allocator, it is still in memblock's memory
+		 * regions.  Skip them manually to exclude them as TDX memory.
+		 */
+		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
+		if (start_pfn >= end_pfn)
+			continue;
+
+		/*
+		 * Add the memory regions as TDX memory.  The regions in
+		 * memblock has already guaranteed they are in address
+		 * ascending order and don't overlap.
+		 */
+		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	free_tdx_memlist(tmb_list);
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdsysinfo_struct *tdsysinfo;
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/*
+	 * To keep things simple, assume that all TDX-protected memory
+	 * will come from the page allocator.  Make sure all pages in the
+	 * page allocator are TDX-usable memory.
+	 *
+	 * Build the list of "TDX-usable" memory regions which cover all
+	 * pages in the page allocator to guarantee that.  Do it while
+	 * holding mem_hotplug_lock read-lock as the memory hotplug code
+	 * path reads the @tdx_memlist to reject any new memory.
+	 */
+	get_online_mems();
+
+	ret = build_tdx_memlist(&tdx_memlist);
+	if (ret)
+		goto out_put_tdxmem;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Build the list of TDX-usable memory regions.
 	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
 	 *    all TDX-usable memory regions.
 	 *  - Configure the TDMRs and the global KeyID to the TDX module.
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+out_put_tdxmem:
+	/*
+	 * @tdx_memlist is written here and read at memory hotplug time.
+	 * Lock out memory hotplug code while building it.
+	 */
+	put_online_mems();
 out:
 	/*
 	 * For now both @sysinfo and @cmr_array are only used during
@@ -XXX,XX +XXX,XX @@ static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 	return 0;
 }
 
+static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	/*
+	 * This check assumes that the start_pfn<->end_pfn range does not
+	 * cross multiple @tdx_memlist entries.  A single memory online
+	 * event across multiple memblocks (from which @tdx_memlist
+	 * entries are derived at the time of module initialization) is
+	 * not possible.  This is because memory offline/online is done
+	 * on granularity of 'struct memory_block', and the hotpluggable
+	 * memory region (one memblock) must be multiple of memory_block.
+	 */
+	list_for_each_entry(tmb, &tdx_memlist, list) {
+		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
+			return true;
+	}
+	return false;
+}
+
+static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
+			       void *v)
+{
+	struct memory_notify *mn = v;
+
+	if (action != MEM_GOING_ONLINE)
+		return NOTIFY_OK;
+
+	/*
+	 * Empty list means TDX isn't enabled.  Allow any memory
+	 * to go online.
+	 */
+	if (list_empty(&tdx_memlist))
+		return NOTIFY_OK;
+
+	/*
+	 * The TDX memory configuration is static and can not be
+	 * changed.  Reject onlining any memory which is outside of
+	 * the static configuration whether it supports TDX or not.
+	 */
+	if (is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages))
+		return NOTIFY_OK;
+
+	return NOTIFY_BAD;
+}
+
+static struct notifier_block tdx_memory_nb = {
+	.notifier_call = tdx_memory_notifier,
+};
+
 static int __init tdx_init(void)
 {
 	u32 tdx_keyid_start, nr_tdx_keyids;
@@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void)
 		return -ENODEV;
 	}
 
+	err = register_memory_notifier(&tdx_memory_nb);
+	if (err) {
+		pr_err("initialization failed: register_memory_notifier() failed (%d)\n",
+				err);
+		return -ENODEV;
+	}
+
 	/*
 	 * Just use the first TDX KeyID as the 'global KeyID' and
 	 * leave the rest for TDX guests.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@ enum tdx_module_status_t {
 	TDX_MODULE_ERROR
 };
 
+struct tdx_memblock {
+	struct list_head list;
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+};
+
 #endif
-- 
2.41.0

After the kernel selects all TDX-usable memory regions, the kernel needs
to pass those regions to the TDX module via data structure "TD Memory
Region" (TDMR).

Add a placeholder to construct a list of TDMRs (in multiple steps) to
cover all TDX-usable memory regions.

=== Long Version ===

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.  The list of these
ranges is available to the kernel by querying the TDX module.

Let's summarize the concepts:

CMR - Firmware-enumerated physical ranges that support TDX.  CMRs are
       4K aligned.
TDMR - Physical address range which is chosen by the kernel to support
       TDX.  1G granularity and alignment required.  Each TDMR has
       reserved areas where TDX memory holes and overlapping PAMTs can
       be represented.
PAMT - Physically contiguous TDX metadata.  One table for each page size
       per TDMR.  Roughly 1/256th of TDMR in size.  256G TDMR = ~1G
       PAMT.

As one step of initializing the TDX module, the kernel configures
TDX-usable memory regions by passing a list of TDMRs to the TDX module.

Constructing the list of TDMRs consists below steps:

1) Fill out TDMRs to cover all memory regions that the TDX module will
   use for TD memory.
2) Allocate and set up PAMT for each TDMR.
3) Designate reserved areas for each TDMR.

Add a placeholder to construct TDMRs to do the above steps.  To keep
things simple, just allocate enough space to hold maximum number of
TDMRs up front.  Always free the buffer of TDMRs since they are only
used during module initialization.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v13 -> v14:
 - No change.

v12 -> v13:
 - No change.

v11 -> v12:
 - Added tags from Dave/Kirill.

v10 -> v11:
 - Changed to keep TDMRs after module initialization to deal with TDX
   erratum in future patches.

v9 -> v10:
 - Changed the TDMR list from static variable back to local variable as
   now TDX module isn't disabled when tdx_cpu_enable() fails.

v8 -> v9:
 - Changes around 'struct tdmr_info_list' (Dave):
   - Moved the declaration from tdx.c to tdx.h.
   - Renamed 'first_tdmr' to 'tdmrs'.
   - 'nr_tdmrs' -> 'nr_consumed_tdmrs'.
   - Changed 'tdmrs' to 'void *'.
   - Improved comments for all structure members.
 - Added a missing empty line in alloc_tdmr_list() (Dave).

v7 -> v8:
 - Improved changelog to tell this is one step of "TODO list" in
   init_tdx_module().
 - Other changelog improvement suggested by Dave (with "Create TDMRs" to
   "Fill out TDMRs" to align with the code).
 - Added a "TODO list" comment to lay out the steps to construct TDMRs,
   following the same idea of "TODO list" in tdx_module_init().
 - Introduced 'struct tdmr_info_list' (Dave)
 - Further added additional members (tdmr_sz/max_tdmrs/nr_tdmrs) to
   simplify getting TDMR by given index, and reduce passing arguments
   around functions.
 - Added alloc_tdmr_list()/free_tdmr_list() accordingly, which internally
   uses tdmr_size_single() (Dave).
 - tdmr_num -> nr_tdmrs (Dave).

v6 -> v7:
 - Improved commit message to explain 'int' overflow cannot happen
   in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave.

...

---
 arch/x86/virt/vmx/tdx/tdx.c | 97 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h | 32 ++++++++++++
 2 files changed, 127 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
 #include <linux/minmax.h>
 #include <linux/sizes.h>
 #include <linux/pfn.h>
+#include <linux/align.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/page.h>
@@ -XXX,XX +XXX,XX @@ static int build_tdx_memlist(struct list_head *tmb_list)
 	return ret;
 }
 
+/* Calculate the actual TDMR size */
+static int tdmr_size_single(u16 max_reserved_per_tdmr)
+{
+	int tdmr_sz;
+
+	/*
+	 * The actual size of TDMR depends on the maximum
+	 * number of reserved areas.
+	 */
+	tdmr_sz = sizeof(struct tdmr_info);
+	tdmr_sz += sizeof(struct tdmr_reserved_area) * max_reserved_per_tdmr;
+
+	return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
+}
+
+static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list,
+			   struct tdsysinfo_struct *sysinfo)
+{
+	size_t tdmr_sz, tdmr_array_sz;
+	void *tdmr_array;
+
+	tdmr_sz = tdmr_size_single(sysinfo->max_reserved_per_tdmr);
+	tdmr_array_sz = tdmr_sz * sysinfo->max_tdmrs;
+
+	/*
+	 * To keep things simple, allocate all TDMRs together.
+	 * The buffer needs to be physically contiguous to make
+	 * sure each TDMR is physically contiguous.
+	 */
+	tdmr_array = alloc_pages_exact(tdmr_array_sz,
+			GFP_KERNEL | __GFP_ZERO);
+	if (!tdmr_array)
+		return -ENOMEM;
+
+	tdmr_list->tdmrs = tdmr_array;
+
+	/*
+	 * Keep the size of TDMR to find the target TDMR
+	 * at a given index in the TDMR list.
+	 */
+	tdmr_list->tdmr_sz = tdmr_sz;
+	tdmr_list->max_tdmrs = sysinfo->max_tdmrs;
+	tdmr_list->nr_consumed_tdmrs = 0;
+
+	return 0;
+}
+
+static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
+{
+	free_pages_exact(tdmr_list->tdmrs,
+			tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
+}
+
+/*
+ * Construct a list of TDMRs on the preallocated space in @tdmr_list
+ * to cover all TDX memory regions in @tmb_list based on the TDX module
+ * information in @sysinfo.
+ */
+static int construct_tdmrs(struct list_head *tmb_list,
+			   struct tdmr_info_list *tdmr_list,
+			   struct tdsysinfo_struct *sysinfo)
+{
+	/*
+	 * TODO:
+	 *
+	 *  - Fill out TDMRs to cover all TDX memory regions.
+	 *  - Allocate and set up PAMTs for each TDMR.
+	 *  - Designate reserved areas for each TDMR.
+	 *
+	 * Return -EINVAL until constructing TDMRs is done
+	 */
+	return -EINVAL;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdsysinfo_struct *tdsysinfo;
+	struct tdmr_info_list tdmr_list;
 	struct cmr_info *cmr_array;
 	int tdsysinfo_size;
 	int cmr_array_size;
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_put_tdxmem;
 
+	/* Allocate enough space for constructing TDMRs */
+	ret = alloc_tdmr_list(&tdmr_list, tdsysinfo);
+	if (ret)
+		goto out_free_tdxmem;
+
+	/* Cover all TDX-usable memory regions in TDMRs */
+	ret = construct_tdmrs(&tdx_memlist, &tdmr_list, tdsysinfo);
+	if (ret)
+		goto out_free_tdmrs;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
-	 *    all TDX-usable memory regions.
 	 *  - Configure the TDMRs and the global KeyID to the TDX module.
 	 *  - Configure the global KeyID on all packages.
 	 *  - Initialize all TDMRs.
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+out_free_tdmrs:
+	/*
+	 * Always free the buffer of TDMRs as they are only used during
+	 * module initialization.
+	 */
+	free_tdmr_list(&tdmr_list);
+out_free_tdxmem:
+	if (ret)
+		free_tdx_memlist(&tdx_memlist);
 out_put_tdxmem:
 	/*
 	 * @tdx_memlist is written here and read at memory hotplug time.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@ struct tdsysinfo_struct {
 	DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
 } __packed;
 
+struct tdmr_reserved_area {
+	u64 offset;
+	u64 size;
+} __packed;
+
+#define TDMR_INFO_ALIGNMENT	512
+
+struct tdmr_info {
+	u64 base;
+	u64 size;
+	u64 pamt_1g_base;
+	u64 pamt_1g_size;
+	u64 pamt_2m_base;
+	u64 pamt_2m_size;
+	u64 pamt_4k_base;
+	u64 pamt_4k_size;
+	/*
+	 * Actual number of reserved areas depends on
+	 * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
+	 */
+	DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas);
+} __packed __aligned(TDMR_INFO_ALIGNMENT);
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
@@ -XXX,XX +XXX,XX @@ struct tdx_memblock {
 	unsigned long end_pfn;
 };
 
+struct tdmr_info_list {
+	void *tdmrs;	/* Flexible array to hold 'tdmr_info's */
+	int nr_consumed_tdmrs;	/* How many 'tdmr_info's are in use */
+
+	/* Metadata for finding target 'tdmr_info' and freeing @tdmrs */
+	int tdmr_sz;	/* Size of one 'tdmr_info' */
+	int max_tdmrs;	/* How many 'tdmr_info's are allocated */
+};
+
 #endif
-- 
2.41.0

Start to transit out the "multi-steps" to construct a list of "TD Memory
Regions" (TDMRs) to cover all TDX-usable memory regions.

The kernel configures TDX-usable memory regions by passing a list of
TDMRs "TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains
the information of the base/size of a memory region, the base/size of the
associated Physical Address Metadata Table (PAMT) and a list of reserved
areas in the region.

Do the first step to fill out a number of TDMRs to cover all TDX memory
regions.  To keep it simple, always try to use one TDMR for each memory
region.  As the first step only set up the base/size for each TDMR.

TDX only supports a limited number of TDMRs.  Disable TDX if all TDMRs
are consumed but there is more memory region to cover.

There are fancier things that could be done like trying to merge
adjacent TDMRs.  This would allow more pathological memory layouts to be
supported.  But, current systems are not even close to exhausting the
existing TDMR resources in practice.  For now, keep it simple.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---

v13 -> v14: 
 - No change

v12 -> v13:
 - Added Yuan's tag.

v11 -> v12:
 - Improved comments around looping over TDX memblock to create TDMRs.
   (Dave).
 - Added code to pr_warn() when consumed TDMRs reaching maximum TDMRs
   (Dave).
 - BIT_ULL(30) -> SZ_1G (Kirill)
 - Removed unused TDMR_PFN_ALIGNMENT (Sathy)
 - Added tags from Kirill/Sathy

v10 -> v11:
 - No update

v9 -> v10:
 - No change.

v8 -> v9:

- Added the last paragraph in the changelog (Dave).
 - Removed unnecessary type cast in tdmr_entry() (Dave).

---
 arch/x86/virt/vmx/tdx/tdx.c | 103 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |   3 ++
 2 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
 			tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
 }
 
+/* Get the TDMR from the list at the given index. */
+static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
+				    int idx)
+{
+	int tdmr_info_offset = tdmr_list->tdmr_sz * idx;
+
+	return (void *)tdmr_list->tdmrs + tdmr_info_offset;
+}
+
+#define TDMR_ALIGNMENT		SZ_1G
+#define TDMR_ALIGN_DOWN(_addr)	ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
+#define TDMR_ALIGN_UP(_addr)	ALIGN((_addr), TDMR_ALIGNMENT)
+
+static inline u64 tdmr_end(struct tdmr_info *tdmr)
+{
+	return tdmr->base + tdmr->size;
+}
+
+/*
+ * Take the memory referenced in @tmb_list and populate the
+ * preallocated @tdmr_list, following all the special alignment
+ * and size rules for TDMR.
+ */
+static int fill_out_tdmrs(struct list_head *tmb_list,
+			  struct tdmr_info_list *tdmr_list)
+{
+	struct tdx_memblock *tmb;
+	int tdmr_idx = 0;
+
+	/*
+	 * Loop over TDX memory regions and fill out TDMRs to cover them.
+	 * To keep it simple, always try to use one TDMR to cover one
+	 * memory region.
+	 *
+	 * In practice TDX supports at least 64 TDMRs.  A 2-socket system
+	 * typically only consumes less than 10 of those.  This code is
+	 * dumb and simple and may use more TMDRs than is strictly
+	 * required.
+	 */
+	list_for_each_entry(tmb, tmb_list, list) {
+		struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx);
+		u64 start, end;
+
+		start = TDMR_ALIGN_DOWN(PFN_PHYS(tmb->start_pfn));
+		end   = TDMR_ALIGN_UP(PFN_PHYS(tmb->end_pfn));
+
+		/*
+		 * A valid size indicates the current TDMR has already
+		 * been filled out to cover the previous memory region(s).
+		 */
+		if (tdmr->size) {
+			/*
+			 * Loop to the next if the current memory region
+			 * has already been fully covered.
+			 */
+			if (end <= tdmr_end(tdmr))
+				continue;
+
+			/* Otherwise, skip the already covered part. */
+			if (start < tdmr_end(tdmr))
+				start = tdmr_end(tdmr);
+
+			/*
+			 * Create a new TDMR to cover the current memory
+			 * region, or the remaining part of it.
+			 */
+			tdmr_idx++;
+			if (tdmr_idx >= tdmr_list->max_tdmrs) {
+				pr_warn("initialization failed: TDMRs exhausted.\n");
+				return -ENOSPC;
+			}
+
+			tdmr = tdmr_entry(tdmr_list, tdmr_idx);
+		}
+
+		tdmr->base = start;
+		tdmr->size = end - start;
+	}
+
+	/* @tdmr_idx is always the index of the last valid TDMR. */
+	tdmr_list->nr_consumed_tdmrs = tdmr_idx + 1;
+
+	/*
+	 * Warn early that kernel is about to run out of TDMRs.
+	 *
+	 * This is an indication that TDMR allocation has to be
+	 * reworked to be smarter to not run into an issue.
+	 */
+	if (tdmr_list->max_tdmrs - tdmr_list->nr_consumed_tdmrs < TDMR_NR_WARN)
+		pr_warn("consumed TDMRs reaching limit: %d used out of %d\n",
+				tdmr_list->nr_consumed_tdmrs,
+				tdmr_list->max_tdmrs);
+
+	return 0;
+}
+
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list,
 			   struct tdmr_info_list *tdmr_list,
 			   struct tdsysinfo_struct *sysinfo)
 {
+	int ret;
+
+	ret = fill_out_tdmrs(tmb_list, tdmr_list);
+	if (ret)
+		return ret;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Fill out TDMRs to cover all TDX memory regions.
 	 *  - Allocate and set up PAMTs for each TDMR.
 	 *  - Designate reserved areas for each TDMR.
 	 *
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@ struct tdx_memblock {
 	unsigned long end_pfn;
 };
 
+/* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
+#define TDMR_NR_WARN 4
+
 struct tdmr_info_list {
 	void *tdmrs;	/* Flexible array to hold 'tdmr_info's */
 	int nr_consumed_tdmrs;	/* How many 'tdmr_info's are in use */
-- 
2.41.0

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).
The downside is alloc_contig_pages() may fail at runtime.  One (bad)
mitigation is to launch a TDX guest early during system boot to get
those PAMTs allocated at early time, but the only way to fix is to add a
boot option to allocate or reserve PAMTs during kernel boot.

It is imperfect but will be improved on later.

Adopt the following policies when allocating PAMTs for a given TDMR:

Also dump out how many pages are allocated for PAMTs when the TDX module
is initialized successfully.  This helps answer the eternal "where did
all my memory go?" questions.

v13 -> v14:
 - No change

v12 -> v13:
 - Added Kirill and Yuan's tag.
 - Removed unintended space. (Yuan)

v11 -> v12:
 - Moved TDX_PS_NUM from tdx.c to <asm/tdx.h> (Kirill)
 - "<= TDX_PS_1G" -> "< TDX_PS_NUM" (Kirill)
 - Changed tdmr_get_pamt() to return base and size instead of base_pfn
   and npages and related code directly (Dave).
 - Simplified PAMT kb counting. (Dave)
 - tdmrs_count_pamt_pages() -> tdmr_count_pamt_kb() (Kirill/Dave)

v10 -> v11:
 - No update

v9 -> v10:
 - Removed code change in disable_tdx_module() as it doesn't exist
   anymore.

v8 -> v9:
 - Added TDX_PS_NR macro instead of open-coding (Dave).
 - Better alignment of 'pamt_entry_size' in tdmr_set_up_pamt() (Dave).
 - Changed to print out PAMTs in "KBs" instead of "pages" (Dave).
 - Added Dave's Reviewed-by.

v7 -> v8: (Dave)
 - Changelog:
  - Added a sentence to state PAMT allocation will be improved.
  - Others suggested by Dave.
 - Moved 'nid' of 'struct tdx_memblock' to this patch.
 - Improved comments around tdmr_get_nid().
 - WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid().
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - Changes due to using macros instead of 'enum' for TDX supported page
   sizes.

---
 arch/x86/Kconfig                  |   1 +
 arch/x86/include/asm/shared/tdx.h |   1 +
 arch/x86/virt/vmx/tdx/tdx.c       | 215 +++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h       |   1 +
 4 files changed, 213 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -XXX,XX +XXX,XX @@ config INTEL_TDX_HOST
 	depends on KVM_INTEL
 	depends on X86_X2APIC
 	select ARCH_KEEP_MEMBLOCK
+	depends on CONTIG_ALLOC
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -XXX,XX +XXX,XX @@
 #define TDX_PS_4K	0
 #define TDX_PS_2M	1
 #define TDX_PS_1G	2
+#define TDX_PS_NR	(TDX_PS_1G + 1)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo,
  * overlap.
  */
 static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
-			    unsigned long end_pfn)
+			    unsigned long end_pfn, int nid)
 {
 	struct tdx_memblock *tmb;
 
@@ -XXX,XX +XXX,XX @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
 	INIT_LIST_HEAD(&tmb->list);
 	tmb->start_pfn = start_pfn;
 	tmb->end_pfn = end_pfn;
+	tmb->nid = nid;
 
 	/* @tmb_list is protected by mem_hotplug_lock */
 	list_add_tail(&tmb->list, tmb_list);
@@ -XXX,XX +XXX,XX @@ static void free_tdx_memlist(struct list_head *tmb_list)
 static int build_tdx_memlist(struct list_head *tmb_list)
 {
 	unsigned long start_pfn, end_pfn;
-	int i, ret;
+	int i, nid, ret;
 
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
 		/*
 		 * The first 1MB is not reported as TDX convertible memory.
 		 * Although the first 1MB is always reserved and won't end up
@@ -XXX,XX +XXX,XX @@ static int build_tdx_memlist(struct list_head *tmb_list)
 		 * memblock has already guaranteed they are in address
 		 * ascending order and don't overlap.
 		 */
-		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
+		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
 		if (ret)
 			goto err;
 	}
@@ -XXX,XX +XXX,XX @@ static int fill_out_tdmrs(struct list_head *tmb_list,
 	return 0;
 }
 
+/*
+ * Calculate PAMT size given a TDMR and a page size.  The returned
+ * PAMT size is always aligned up to 4K page boundary.
+ */
+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
+				      u16 pamt_entry_size)
+{
+	unsigned long pamt_sz, nr_pamt_entries;
+
+	switch (pgsz) {
+	case TDX_PS_4K:
+		nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
+		break;
+	case TDX_PS_2M:
+		nr_pamt_entries = tdmr->size >> PMD_SHIFT;
+		break;
+	case TDX_PS_1G:
+		nr_pamt_entries = tdmr->size >> PUD_SHIFT;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return 0;
+	}
+
+	pamt_sz = nr_pamt_entries * pamt_entry_size;
+	/* TDX requires PAMT size must be 4K aligned */
+	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
+
+	return pamt_sz;
+}
+
+/*
+ * Locate a NUMA node which should hold the allocation of the @tdmr
+ * PAMT.  This node will have some memory covered by the TDMR.  The
+ * relative amount of memory covered is not considered.
+ */
+static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list)
+{
+	struct tdx_memblock *tmb;
+
+	/*
+	 * A TDMR must cover at least part of one TMB.  That TMB will end
+	 * after the TDMR begins.  But, that TMB may have started before
+	 * the TDMR.  Find the next 'tmb' that _ends_ after this TDMR
+	 * begins.  Ignore 'tmb' start addresses.  They are irrelevant.
+	 */
+	list_for_each_entry(tmb, tmb_list, list) {
+		if (tmb->end_pfn > PHYS_PFN(tdmr->base))
+			return tmb->nid;
+	}
+
+	/*
+	 * Fall back to allocating the TDMR's metadata from node 0 when
+	 * no TDX memory block can be found.  This should never happen
+	 * since TDMRs originate from TDX memory blocks.
+	 */
+	pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT allocation, fallback to use node 0.\n",
+			tdmr->base, tdmr_end(tdmr));
+	return 0;
+}
+
+/*
+ * Allocate PAMTs from the local NUMA node of some memory in @tmb_list
+ * within @tdmr, and set up PAMTs for @tdmr.
+ */
+static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
+			    struct list_head *tmb_list,
+			    u16 pamt_entry_size)
+{
+	unsigned long pamt_base[TDX_PS_NR];
+	unsigned long pamt_size[TDX_PS_NR];
+	unsigned long tdmr_pamt_base;
+	unsigned long tdmr_pamt_size;
+	struct page *pamt;
+	int pgsz, nid;
+
+	nid = tdmr_get_nid(tdmr, tmb_list);
+
+	/*
+	 * Calculate the PAMT size for each TDX supported page size
+	 * and the total PAMT size.
+	 */
+	tdmr_pamt_size = 0;
+	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
+		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
+					pamt_entry_size);
+		tdmr_pamt_size += pamt_size[pgsz];
+	}
+
+	/*
+	 * Allocate one chunk of physically contiguous memory for all
+	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
+	 * in overlapped TDMRs.
+	 */
+	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
+			nid, &node_online_map);
+	if (!pamt)
+		return -ENOMEM;
+
+	/*
+	 * Break the contiguous allocation back up into the
+	 * individual PAMTs for each page size.
+	 */
+	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
+	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
+		pamt_base[pgsz] = tdmr_pamt_base;
+		tdmr_pamt_base += pamt_size[pgsz];
+	}
+
+	tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
+	tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
+	tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
+	tdmr->pamt_2m_size = pamt_size[TDX_PS_2M];
+	tdmr->pamt_1g_base = pamt_base[TDX_PS_1G];
+	tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
+
+	return 0;
+}
+
+static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base,
+			  unsigned long *pamt_size)
+{
+	unsigned long pamt_bs, pamt_sz;
+
+	/*
+	 * The PAMT was allocated in one contiguous unit.  The 4K PAMT
+	 * should always point to the beginning of that allocation.
+	 */
+	pamt_bs = tdmr->pamt_4k_base;
+	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
+
+	WARN_ON_ONCE((pamt_bs & ~PAGE_MASK) || (pamt_sz & ~PAGE_MASK));
+
+	*pamt_base = pamt_bs;
+	*pamt_size = pamt_sz;
+}
+
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
+{
+	unsigned long pamt_base, pamt_size;
+
+	tdmr_get_pamt(tdmr, &pamt_base, &pamt_size);
+
+	/* Do nothing if PAMT hasn't been allocated for this TDMR */
+	if (!pamt_size)
+		return;
+
+	if (WARN_ON_ONCE(!pamt_base))
+		return;
+
+	free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT);
+}
+
+static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
+{
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
+		tdmr_free_pamt(tdmr_entry(tdmr_list, i));
+}
+
+/* Allocate and set up PAMTs for all TDMRs */
+static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
+				 struct list_head *tmb_list,
+				 u16 pamt_entry_size)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list,
+				pamt_entry_size);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	tdmrs_free_pamt_all(tdmr_list);
+	return ret;
+}
+
+static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
+{
+	unsigned long pamt_size = 0;
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		unsigned long base, size;
+
+		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
+		pamt_size += size;
+	}
+
+	return pamt_size / 1024;
+}
+
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list,
 	if (ret)
 		return ret;
 
+	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list,
+			sysinfo->pamt_entry_size);
+	if (ret)
+		return ret;
 	/*
 	 * TODO:
 	 *
-	 *  - Allocate and set up PAMTs for each TDMR.
 	 *  - Designate reserved areas for each TDMR.
 	 *
 	 * Return -EINVAL until constructing TDMRs is done
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+	if (ret)
+		tdmrs_free_pamt_all(&tdmr_list);
+	else
+		pr_info("%lu KBs allocated for PAMT\n",
+				tdmrs_count_pamt_kb(&tdmr_list));
 out_free_tdmrs:
 	/*
 	 * Always free the buffer of TDMRs as they are only used during
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@ struct tdx_memblock {
 	struct list_head list;
 	unsigned long start_pfn;
 	unsigned long end_pfn;
+	int nid;
 };
 
 /* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
-- 
2.41.0

As the last step of constructing TDMRs, populate reserved areas for all
TDMRs.  For each TDMR, put all memory holes within this TDMR to the
reserved areas.  And for all PAMTs which overlap with this TDMR, put
all the overlapping parts to reserved areas too.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---

v13 -> v14:
 - No change

v12 -> v13:
 - Added Yuan's tag.

v11 -> v12:
 - Code change due to tdmr_get_pamt() change from returning pfn/npages to
   base/size
 - Added Kirill's tag

v10 -> v11:
 - No update

v9 -> v10:
 - No change.

v8 -> v9:
 - Added comment around 'tdmr_add_rsvd_area()' to point out it doesn't do
   optimization to save reserved areas. (Dave).

v7 -> v8: (Dave)
 - "set_up" -> "populate" in function name change (Dave).
 - Improved comment suggested by Dave.
 - Other changes due to 'struct tdmr_info_list'.

---
 arch/x86/virt/vmx/tdx/tdx.c | 217 ++++++++++++++++++++++++++++++++++--
 1 file changed, 209 insertions(+), 8 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
 #include <linux/sizes.h>
 #include <linux/pfn.h>
 #include <linux/align.h>
+#include <linux/sort.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/page.h>
@@ -XXX,XX +XXX,XX @@ static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
 	return pamt_size / 1024;
 }
 
+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
+			      u64 size, u16 max_reserved_per_tdmr)
+{
+	struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
+	int idx = *p_idx;
+
+	/* Reserved area must be 4K aligned in offset and size */
+	if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
+		return -EINVAL;
+
+	if (idx >= max_reserved_per_tdmr) {
+		pr_warn("initialization failed: TDMR [0x%llx, 0x%llx): reserved areas exhausted.\n",
+				tdmr->base, tdmr_end(tdmr));
+		return -ENOSPC;
+	}
+
+	/*
+	 * Consume one reserved area per call.  Make no effort to
+	 * optimize or reduce the number of reserved areas which are
+	 * consumed by contiguous reserved areas, for instance.
+	 */
+	rsvd_areas[idx].offset = addr - tdmr->base;
+	rsvd_areas[idx].size = size;
+
+	*p_idx = idx + 1;
+
+	return 0;
+}
+
+/*
+ * Go through @tmb_list to find holes between memory areas.  If any of
+ * those holes fall within @tdmr, set up a TDMR reserved area to cover
+ * the hole.
+ */
+static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
+				    struct tdmr_info *tdmr,
+				    int *rsvd_idx,
+				    u16 max_reserved_per_tdmr)
+{
+	struct tdx_memblock *tmb;
+	u64 prev_end;
+	int ret;
+
+	/*
+	 * Start looking for reserved blocks at the
+	 * beginning of the TDMR.
+	 */
+	prev_end = tdmr->base;
+	list_for_each_entry(tmb, tmb_list, list) {
+		u64 start, end;
+
+		start = PFN_PHYS(tmb->start_pfn);
+		end   = PFN_PHYS(tmb->end_pfn);
+
+		/* Break if this region is after the TDMR */
+		if (start >= tdmr_end(tdmr))
+			break;
+
+		/* Exclude regions before this TDMR */
+		if (end < tdmr->base)
+			continue;
+
+		/*
+		 * Skip over memory areas that
+		 * have already been dealt with.
+		 */
+		if (start <= prev_end) {
+			prev_end = end;
+			continue;
+		}
+
+		/* Add the hole before this region */
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
+				start - prev_end,
+				max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+
+		prev_end = end;
+	}
+
+	/* Add the hole after the last region if it exists. */
+	if (prev_end < tdmr_end(tdmr)) {
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
+				tdmr_end(tdmr) - prev_end,
+				max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/*
+ * Go through @tdmr_list to find all PAMTs.  If any of those PAMTs
+ * overlaps with @tdmr, set up a TDMR reserved area to cover the
+ * overlapping part.
+ */
+static int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list,
+				    struct tdmr_info *tdmr,
+				    int *rsvd_idx,
+				    u16 max_reserved_per_tdmr)
+{
+	int i, ret;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		struct tdmr_info *tmp = tdmr_entry(tdmr_list, i);
+		unsigned long pamt_base, pamt_size, pamt_end;
+
+		tdmr_get_pamt(tmp, &pamt_base, &pamt_size);
+		/* Each TDMR must already have PAMT allocated */
+		WARN_ON_ONCE(!pamt_size || !pamt_base);
+
+		pamt_end = pamt_base + pamt_size;
+		/* Skip PAMTs outside of the given TDMR */
+		if ((pamt_end <= tdmr->base) ||
+				(pamt_base >= tdmr_end(tdmr)))
+			continue;
+
+		/* Only mark the part within the TDMR as reserved */
+		if (pamt_base < tdmr->base)
+			pamt_base = tdmr->base;
+		if (pamt_end > tdmr_end(tdmr))
+			pamt_end = tdmr_end(tdmr);
+
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_base,
+				pamt_end - pamt_base,
+				max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/* Compare function called by sort() for TDMR reserved areas */
+static int rsvd_area_cmp_func(const void *a, const void *b)
+{
+	struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
+	struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
+
+	if (r1->offset + r1->size <= r2->offset)
+		return -1;
+	if (r1->offset >= r2->offset + r2->size)
+		return 1;
+
+	/* Reserved areas cannot overlap.  The caller must guarantee. */
+	WARN_ON_ONCE(1);
+	return -1;
+}
+
+/*
+ * Populate reserved areas for the given @tdmr, including memory holes
+ * (via @tmb_list) and PAMTs (via @tdmr_list).
+ */
+static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
+				    struct list_head *tmb_list,
+				    struct tdmr_info_list *tdmr_list,
+				    u16 max_reserved_per_tdmr)
+{
+	int ret, rsvd_idx = 0;
+
+	ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
+			max_reserved_per_tdmr);
+	if (ret)
+		return ret;
+
+	ret = tdmr_populate_rsvd_pamts(tdmr_list, tdmr, &rsvd_idx,
+			max_reserved_per_tdmr);
+	if (ret)
+		return ret;
+
+	/* TDX requires reserved areas listed in address ascending order */
+	sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
+			rsvd_area_cmp_func, NULL);
+
+	return 0;
+}
+
+/*
+ * Populate reserved areas for all TDMRs in @tdmr_list, including memory
+ * holes (via @tmb_list) and PAMTs.
+ */
+static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
+					 struct list_head *tmb_list,
+					 u16 max_reserved_per_tdmr)
+{
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		int ret;
+
+		ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
+				tmb_list, tdmr_list, max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list,
 			sysinfo->pamt_entry_size);
 	if (ret)
 		return ret;
-	/*
-	 * TODO:
-	 *
-	 *  - Designate reserved areas for each TDMR.
-	 *
-	 * Return -EINVAL until constructing TDMRs is done
-	 */
-	return -EINVAL;
+
+	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
+			sysinfo->max_reserved_per_tdmr);
+	if (ret)
+		tdmrs_free_pamt_all(tdmr_list);
+
+	return ret;
 }
 
 static int init_tdx_module(void)
-- 
2.41.0

The TDX module uses a private KeyID as the "global KeyID" for mapping
things like the PAMT and other TDX metadata.  This KeyID has already
been reserved when detecting TDX during the kernel early boot.

After the list of "TD Memory Regions" (TDMRs) has been constructed to
cover all TDX-usable memory regions, the next step is to pass them to
the TDX module together with the global KeyID.

v13 -> v14:
 - No change

v12 -> v13:
 - Added Yuan's tag.

v11 -> v12:
 - Added Kirill's tag

v10 -> v11:
 - No update

v9 -> v10:
 - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.

v8 -> v9:
 - Improved changlog to explain why initializing TDMRs can take long
   time (Dave).
 - Improved comments around 'next-to-initialize' address (Dave).

v7 -> v8: (Dave)
 - Changelog:
   - explicitly call out this is the last step of TDX module initialization.
   - Trimed down changelog by removing SEAMCALL name and details.
 - Removed/trimmed down unnecessary comments.
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - Removed need_resched() check. -- Andi.

---
 arch/x86/virt/vmx/tdx/tdx.c | 43 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  2 ++
 2 files changed, 44 insertions(+), 1 deletion(-)

After the list of TDMRs and the global KeyID are configured to the TDX
module, the kernel needs to configure the key of the global KeyID on all
packages using TDH.SYS.KEY.CONFIG.

This SEAMCALL cannot run parallel on different cpus.  Loop all online
cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of
each package.

To keep things simple, this implementation takes no affirmative steps to
online cpus to make sure there's at least one cpu for each package.  The
callers (aka. KVM) can ensure success by ensuring sufficient CPUs are
online for this to succeed.

Intel hardware doesn't guarantee cache coherency across different
KeyIDs.  The PAMTs are transitioning from being used by the kernel
mapping (KeyId 0) to the TDX module's "global KeyID" mapping.

This means that the kernel must flush any dirty KeyID-0 PAMT cachelines
before the TDX module uses the global KeyID to access the PAMTs.
Otherwise, if those dirty cachelines were written back, they would
corrupt the TDX module's metadata.  Aside: This corruption would be
detected by the memory integrity hardware on the next read of the memory
with the global KeyID.  The result would likely be fatal to the system
but would not impact TDX security.

Following the TDX module specification, flush cache before configuring
the global KeyID on all packages.  Given the PAMT size can be large
(~1/256th of system RAM), just use WBINVD on all CPUs to flush.

If TDH.SYS.KEY.CONFIG fails, the TDX module may already have used the
global KeyID to write the PAMTs.  Therefore, use WBINVD to flush cache
before returning the PAMTs back to the kernel.  Also convert all PAMTs
back to normal by using MOVDIR64B as suggested by the TDX module spec,
although on the platform without the "partial write machine check"
erratum it's OK to leave PAMTs as is.

v13 -> v14:
 - No change

v12 -> v13:
 - Added Yuan's tag.

v11 -> v12:
 - Added Kirill's tag
 - Improved changelog (Nikolay)

v10 -> v11:
 - Convert PAMTs back to normal when module initialization fails.
 - Fixed an error in changelog

v9 -> v10:
 - Changed to use 'smp_call_on_cpu()' directly to do key configuration.

v8 -> v9:
 - Improved changelog (Dave).
 - Improved comments to explain the function to configure global KeyID
   "takes no affirmative action to online any cpu". (Dave).
 - Improved other comments suggested by Dave.

v7 -> v8: (Dave)
 - Changelog changes:
  - Point out this is the step of "multi-steps" of init_tdx_module().
  - Removed MOVDIR64B part.
  - Other changes due to removing TDH.SYS.SHUTDOWN and TDH.SYS.LP.INIT.
 - Changed to loop over online cpus and use smp_call_function_single()
   directly as the patch to shut down TDX module has been removed.
 - Removed MOVDIR64B part in comment.

---
 arch/x86/virt/vmx/tdx/tdx.c | 130 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |   1 +
 2 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/page.h>
+#include <asm/special_insns.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
@@ -XXX,XX +XXX,XX @@ static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base,
 	*pamt_size = pamt_sz;
 }
 
-static void tdmr_free_pamt(struct tdmr_info *tdmr)
+static void tdmr_do_pamt_func(struct tdmr_info *tdmr,
+		void (*pamt_func)(unsigned long base, unsigned long size))
 {
 	unsigned long pamt_base, pamt_size;
 
@@ -XXX,XX +XXX,XX @@ static void tdmr_free_pamt(struct tdmr_info *tdmr)
 	if (WARN_ON_ONCE(!pamt_base))
 		return;
 
+	(*pamt_func)(pamt_base, pamt_size);
+}
+
+static void free_pamt(unsigned long pamt_base, unsigned long pamt_size)
+{
 	free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT);
 }
 
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
+{
+	tdmr_do_pamt_func(tdmr, free_pamt);
+}
+
 static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
 {
 	int i;
@@ -XXX,XX +XXX,XX @@ static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
 	return ret;
 }
 
+/*
+ * Convert TDX private pages back to normal by using MOVDIR64B to
+ * clear these pages.  Note this function doesn't flush cache of
+ * these TDX private pages.  The caller should make sure of that.
+ */
+static void reset_tdx_pages(unsigned long base, unsigned long size)
+{
+	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
+	unsigned long phys, end;
+
+	end = base + size;
+	for (phys = base; phys < end; phys += 64)
+		movdir64b(__va(phys), zero_page);
+
+	/*
+	 * MOVDIR64B uses WC protocol.  Use memory barrier to
+	 * make sure any later user of these pages sees the
+	 * updated data.
+	 */
+	mb();
+}
+
+static void tdmr_reset_pamt(struct tdmr_info *tdmr)
+{
+	tdmr_do_pamt_func(tdmr, reset_tdx_pages);
+}
+
+static void tdmrs_reset_pamt_all(struct tdmr_info_list *tdmr_list)
+{
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
+		tdmr_reset_pamt(tdmr_entry(tdmr_list, i));
+}
+
 static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
 {
 	unsigned long pamt_size = 0;
@@ -XXX,XX +XXX,XX @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
 	return ret;
 }
 
+static int do_global_key_config(void *data)
+{
+	struct tdx_module_args args = {};
+
+	return seamcall_prerr(TDH_SYS_KEY_CONFIG, &args);
+}
+
+/*
+ * Attempt to configure the global KeyID on all physical packages.
+ *
+ * This requires running code on at least one CPU in each package.  If a
+ * package has no online CPUs, that code will not run and TDX module
+ * initialization (TDMR initialization) will fail.
+ *
+ * This code takes no affirmative steps to online CPUs.  Callers (aka.
+ * KVM) can ensure success by ensuring sufficient CPUs are online for
+ * this to succeed.
+ */
+static int config_global_keyid(void)
+{
+	cpumask_var_t packages;
+	int cpu, ret = -EINVAL;
+
+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
+		return -ENOMEM;
+
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
+					packages))
+			continue;
+
+		/*
+		 * TDH.SYS.KEY.CONFIG cannot run concurrently on
+		 * different cpus, so just do it one by one.
+		 */
+		ret = smp_call_on_cpu(cpu, do_global_key_config, NULL, true);
+		if (ret)
+			break;
+	}
+
+	free_cpumask_var(packages);
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdsysinfo_struct *tdsysinfo;
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
+	/*
+	 * Hardware doesn't guarantee cache coherency across different
+	 * KeyIDs.  The kernel needs to flush PAMT's dirty cachelines
+	 * (associated with KeyID 0) before the TDX module can use the
+	 * global KeyID to access the PAMT.  Given PAMTs are potentially
+	 * large (~1/256th of system RAM), just use WBINVD on all cpus
+	 * to flush the cache.
+	 */
+	wbinvd_on_all_cpus();
+
+	/* Config the key of global KeyID on all packages */
+	ret = config_global_keyid();
+	if (ret)
+		goto out_reset_pamts;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Configure the global KeyID on all packages.
 	 *  - Initialize all TDMRs.
 	 *
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+out_reset_pamts:
+	if (ret) {
+		/*
+		 * Part of PAMTs may already have been initialized by the
+		 * TDX module.  Flush cache before returning PAMTs back
+		 * to the kernel.
+		 */
+		wbinvd_on_all_cpus();
+		/*
+		 * According to the TDX hardware spec, if the platform
+		 * doesn't have the "partial write machine check"
+		 * erratum, any kernel read/write will never cause #MC
+		 * in kernel space, thus it's OK to not convert PAMTs
+		 * back to normal.  But do the conversion anyway here
+		 * as suggested by the TDX spec.
+		 */
+		tdmrs_reset_pamt_all(&tdmr_list);
+	}
 out_free_pamts:
 	if (ret)
 		tdmrs_free_pamt_all(&tdmr_list);
@@ -XXX,XX +XXX,XX @@ static int __tdx_enable(void)
  * lock to prevent any new cpu from becoming online; 2) done both VMXON
  * and tdx_cpu_enable() on all online cpus.
  *
+ * This function requires there's at least one online cpu for each CPU
+ * package to succeed.
+ *
  * This function can be called in parallel by multiple callers.
  *
  * Return 0 if TDX is enabled successfully, otherwise error.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_KEY_CONFIG	31
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
-- 
2.41.0

After the global KeyID has been configured on all packages, initialize
all TDMRs to make all TDX-usable memory regions that are passed to the
TDX module become usable.

This is the last step of initializing the TDX module.

Initializing TDMRs can be time consuming on large memory systems as it
involves initializing all metadata entries for all pages that can be
used by TDX guests.  Initializing different TDMRs can be parallelized.
For now to keep it simple, just initialize all TDMRs one by one.  It can
be enhanced in the future.

v13 -> v14:
 - No change

v12 -> v13:
 - Added Yuan's tag.

v11 -> v12:
 - Added Kirill's tag

v10 -> v11:
 - No update

v9 -> v10:
 - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.

v8 -> v9:
 - Improved changlog to explain why initializing TDMRs can take long
   time (Dave).
 - Improved comments around 'next-to-initialize' address (Dave).

v6 -> v7:
 - Removed need_resched() check. -- Andi.

---
 arch/x86/virt/vmx/tdx/tdx.c | 60 ++++++++++++++++++++++++++++++++-----
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static int config_global_keyid(void)
 	return ret;
 }
 
+static int init_tdmr(struct tdmr_info *tdmr)
+{
+	u64 next;
+
+	/*
+	 * Initializing a TDMR can be time consuming.  To avoid long
+	 * SEAMCALLs, the TDX module may only initialize a part of the
+	 * TDMR in each call.
+	 */
+	do {
+		struct tdx_module_args args = {
+			.rcx = tdmr->base,
+		};
+		int ret;
+
+		ret = seamcall_prerr_ret(TDH_SYS_TDMR_INIT, &args);
+		if (ret)
+			return ret;
+		/*
+		 * RDX contains 'next-to-initialize' address if
+		 * TDH.SYS.TDMR.INIT did not fully complete and
+		 * should be retried.
+		 */
+		next = args.rdx;
+		cond_resched();
+		/* Keep making SEAMCALLs until the TDMR is done */
+	} while (next < tdmr->base + tdmr->size);
+
+	return 0;
+}
+
+static int init_tdmrs(struct tdmr_info_list *tdmr_list)
+{
+	int i;
+
+	/*
+	 * This operation is costly.  It can be parallelized,
+	 * but keep it simple for now.
+	 */
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		int ret;
+
+		ret = init_tdmr(tdmr_entry(tdmr_list, i));
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdsysinfo_struct *tdsysinfo;
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_reset_pamts;
 
-	/*
-	 * TODO:
-	 *
-	 *  - Initialize all TDMRs.
-	 *
-	 *  Return error before all steps are done.
-	 */
-	ret = -EINVAL;
+	/* Initialize TDMRs to complete the TDX module initialization */
+	ret = init_tdmrs(&tdmr_list);
 out_reset_pamts:
 	if (ret) {
 		/*
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
+#define TDH_SYS_TDMR_INIT	36
 #define TDH_SYS_CONFIG		45
 
 struct cmr_info {
-- 
2.41.0

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages; 2) There might be dirty cachelines associated
with TDX private pages.

The first problem doesn't matter on the platforms w/o the "partial write
machine check" erratum.  KeyID 0 doesn't have integrity check.  If the
new kernel wants to use any non-zero KeyID, it needs to convert the
memory to that KeyID and such conversion would work from any KeyID.

However the old kernel needs to guarantee there's no dirty cacheline
left behind before booting to the new kernel to avoid silent corruption
from later cacheline writeback (Intel hardware doesn't guarantee cache
coherency across different KeyIDs).

There are two things that the old kernel needs to do to achieve that:

1) Stop accessing TDX private memory mappings:
   a. Stop making TDX module SEAMCALLs (TDX global KeyID);
   b. Stop TDX guests from running (per-guest TDX KeyID).
2) Flush any cachelines from previous TDX private KeyID writes.

For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME
support.  And in this way 1) happens for free as there's no TDX activity
between wbinvd() and the native_halt().

Flushing cache in stop_this_cpu() only flushes cache on remote cpus.  On
the rebooting cpu which does kexec(), unlike SME which does the cache
flush in relocate_kernel(), flush the cache right after stopping remote
cpus in machine_shutdown().

There are two reasons to do so: 1) For TDX there's no need to defer
cache flush to relocate_kernel() because all TDX activities have been
stopped.  2) On the platforms with the above erratum the kernel must
convert all TDX private pages back to normal before booting to the new
kernel in kexec(), and flushing cache early allows the kernel to convert
memory early rather than having to muck with the relocate_kernel()
assembly.

Theoretically, cache flush is only needed when the TDX module has been
initialized.  However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status.  Just check
whether TDX is enabled by the BIOS instead to flush cache.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v13 -> v14:
 - No change

---
 arch/x86/kernel/process.c |  8 +++++++-
 arch/x86/kernel/reboot.c  | 15 +++++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -XXX,XX +XXX,XX @@ void __noreturn stop_this_cpu(void *dummy)
 	 *
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
+	 *
+	 * The TDX module or guests might have left dirty cachelines
+	 * behind.  Flush them to avoid corruption from later writeback.
+	 * Note that this flushes on all systems where TDX is possible,
+	 * but does not actually check that TDX was in use.
 	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+	if ((c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+			|| platform_tdx_enabled())
 		native_wbinvd();
 
 	/*
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -XXX,XX +XXX,XX @@
 #include <asm/realmode.h>
 #include <asm/x86_init.h>
 #include <asm/efi.h>
+#include <asm/tdx.h>
 
 /*
  * Power off function, if any
@@ -XXX,XX +XXX,XX @@ void native_machine_shutdown(void)
 	local_irq_disable();
 	stop_other_cpus();
 #endif
+	/*
+	 * stop_other_cpus() has flushed all dirty cachelines of TDX
+	 * private memory on remote cpus.  Unlike SME, which does the
+	 * cache flush on _this_ cpu in the relocate_kernel(), flush
+	 * the cache for _this_ cpu here.  This is because on the
+	 * platforms with "partial write machine check" erratum the
+	 * kernel needs to convert all TDX private pages back to normal
+	 * before booting to the new kernel in kexec(), and the cache
+	 * flush must be done before that.  If the kernel took SME's way,
+	 * it would have to muck with the relocate_kernel() assembly to
+	 * do memory conversion.
+	 */
+	if (platform_tdx_enabled())
+		native_wbinvd();
 
 	lapic_shutdown();
 	restore_boot_irq_mode();
-- 
2.41.0

On the platforms with the "partial write machine check" erratum, the
kexec() needs to convert all TDX private pages back to normal before
booting to the new kernel.  Otherwise, the new kernel may get unexpected
machine check.

There's no existing infrastructure to track TDX private pages.  Keep
TDMRs when module initialization is successful so that they can be used
to find PAMTs.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v13 -> v14:
 - "Change to keep" -> "Keep" (Kirill)
 - Add Kirill/Rick's tags

v12 -> v13:
  - Split "improve error handling" part out as a separate patch.

v11 -> v12 (new patch):
  - Defer keeping TDMRs logic to this patch for better review
  - Improved error handling logic (Nikolay/Kirill in patch 15)

---
 arch/x86/virt/vmx/tdx/tdx.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static DEFINE_MUTEX(tdx_module_lock);
 /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
 static LIST_HEAD(tdx_memlist);
 
+static struct tdmr_info_list tdx_tdmr_list;
+
 typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
 
 static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
@@ -XXX,XX +XXX,XX @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
 static int init_tdx_module(void)
 {
 	struct tdsysinfo_struct *tdsysinfo;
-	struct tdmr_info_list tdmr_list;
 	struct cmr_info *cmr_array;
 	int tdsysinfo_size;
 	int cmr_array_size;
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 		goto out_put_tdxmem;
 
 	/* Allocate enough space for constructing TDMRs */
-	ret = alloc_tdmr_list(&tdmr_list, tdsysinfo);
+	ret = alloc_tdmr_list(&tdx_tdmr_list, tdsysinfo);
 	if (ret)
 		goto out_free_tdxmem;
 
 	/* Cover all TDX-usable memory regions in TDMRs */
-	ret = construct_tdmrs(&tdx_memlist, &tdmr_list, tdsysinfo);
+	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, tdsysinfo);
 	if (ret)
 		goto out_free_tdmrs;
 
 	/* Pass the TDMRs and the global KeyID to the TDX module */
-	ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
+	ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid);
 	if (ret)
 		goto out_free_pamts;
 
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 		goto out_reset_pamts;
 
 	/* Initialize TDMRs to complete the TDX module initialization */
-	ret = init_tdmrs(&tdmr_list);
+	ret = init_tdmrs(&tdx_tdmr_list);
 out_reset_pamts:
 	if (ret) {
 		/*
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 		 * back to normal.  But do the conversion anyway here
 		 * as suggested by the TDX spec.
 		 */
-		tdmrs_reset_pamt_all(&tdmr_list);
+		tdmrs_reset_pamt_all(&tdx_tdmr_list);
 	}
 out_free_pamts:
 	if (ret)
-		tdmrs_free_pamt_all(&tdmr_list);
+		tdmrs_free_pamt_all(&tdx_tdmr_list);
 	else
 		pr_info("%lu KBs allocated for PAMT\n",
-				tdmrs_count_pamt_kb(&tdmr_list));
+				tdmrs_count_pamt_kb(&tdx_tdmr_list));
 out_free_tdmrs:
-	/*
-	 * Always free the buffer of TDMRs as they are only used during
-	 * module initialization.
-	 */
-	free_tdmr_list(&tdmr_list);
+	if (ret)
+		free_tdmr_list(&tdx_tdmr_list);
 out_free_tdxmem:
 	if (ret)
 		free_tdx_memlist(&tdx_memlist);
-- 
2.41.0

With keeping TDMRs upon successful TDX module initialization, now only
put_online_mems() and freeing the buffers of the TDSYSINFO_STRUCT and
the CMR array still need to be done even when module initialization is
successful.  On the other hand, all other four "out_*" labels before
them explicitly check the return value and only clean up when module
initialization fails.

This isn't ideal.  Make all other four "out_*" labels only reachable
when module initialization fails to improve the readability of error
handling.  Rename them from "out_*" to "err_*" to reflect the fact.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v13 -> v14:
 - Fix spell typo (Rick)
 - Add Kirill/Rick's tags

v12 -> v13:
  - New patch to improve error handling. (Kirill, Nikolay)

---
 arch/x86/virt/vmx/tdx/tdx.c | 67 +++++++++++++++++++------------------
 1 file changed, 34 insertions(+), 33 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	/* Allocate enough space for constructing TDMRs */
 	ret = alloc_tdmr_list(&tdx_tdmr_list, tdsysinfo);
 	if (ret)
-		goto out_free_tdxmem;
+		goto err_free_tdxmem;
 
 	/* Cover all TDX-usable memory regions in TDMRs */
 	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, tdsysinfo);
 	if (ret)
-		goto out_free_tdmrs;
+		goto err_free_tdmrs;
 
 	/* Pass the TDMRs and the global KeyID to the TDX module */
 	ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid);
 	if (ret)
-		goto out_free_pamts;
+		goto err_free_pamts;
 
 	/*
 	 * Hardware doesn't guarantee cache coherency across different
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	/* Config the key of global KeyID on all packages */
 	ret = config_global_keyid();
 	if (ret)
-		goto out_reset_pamts;
+		goto err_reset_pamts;
 
 	/* Initialize TDMRs to complete the TDX module initialization */
 	ret = init_tdmrs(&tdx_tdmr_list);
-out_reset_pamts:
-	if (ret) {
-		/*
-		 * Part of PAMTs may already have been initialized by the
-		 * TDX module.  Flush cache before returning PAMTs back
-		 * to the kernel.
-		 */
-		wbinvd_on_all_cpus();
-		/*
-		 * According to the TDX hardware spec, if the platform
-		 * doesn't have the "partial write machine check"
-		 * erratum, any kernel read/write will never cause #MC
-		 * in kernel space, thus it's OK to not convert PAMTs
-		 * back to normal.  But do the conversion anyway here
-		 * as suggested by the TDX spec.
-		 */
-		tdmrs_reset_pamt_all(&tdx_tdmr_list);
-	}
-out_free_pamts:
 	if (ret)
-		tdmrs_free_pamt_all(&tdx_tdmr_list);
-	else
-		pr_info("%lu KBs allocated for PAMT\n",
-				tdmrs_count_pamt_kb(&tdx_tdmr_list));
-out_free_tdmrs:
-	if (ret)
-		free_tdmr_list(&tdx_tdmr_list);
-out_free_tdxmem:
-	if (ret)
-		free_tdx_memlist(&tdx_memlist);
+		goto err_reset_pamts;
+
+	pr_info("%lu KBs allocated for PAMT\n",
+			tdmrs_count_pamt_kb(&tdx_tdmr_list));
+
 out_put_tdxmem:
 	/*
 	 * @tdx_memlist is written here and read at memory hotplug time.
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
 	kfree(tdsysinfo);
 	kfree(cmr_array);
 	return ret;
+
+err_reset_pamts:
+	/*
+	 * Part of PAMTs may already have been initialized by the
+	 * TDX module.  Flush cache before returning PAMTs back
+	 * to the kernel.
+	 */
+	wbinvd_on_all_cpus();
+	/*
+	 * According to the TDX hardware spec, if the platform
+	 * doesn't have the "partial write machine check"
+	 * erratum, any kernel read/write will never cause #MC
+	 * in kernel space, thus it's OK to not convert PAMTs
+	 * back to normal.  But do the conversion anyway here
+	 * as suggested by the TDX spec.
+	 */
+	tdmrs_reset_pamt_all(&tdx_tdmr_list);
+err_free_pamts:
+	tdmrs_free_pamt_all(&tdx_tdmr_list);
+err_free_tdmrs:
+	free_tdmr_list(&tdx_tdmr_list);
+err_free_tdxmem:
+	free_tdx_memlist(&tdx_memlist);
+	/* Do things irrelevant to module initialization result */
+	goto out_put_tdxmem;
 }
 
 static int __tdx_enable(void)
-- 
2.41.0

The first few generations of TDX hardware have an erratum.  A partial
write to a TDX private memory cacheline will silently "poison" the
line.  Subsequent reads will consume the poison and generate a machine
check.  According to the TDX hardware spec, neither of these things
should have happened.

== Background ==

== Problem ==

A fast warm reset doesn't reset TDX private memory.  Kexec() can also
boot into the new kernel directly.  Thus if the old kernel has enabled
TDX on the platform with this erratum, the new kernel may get unexpected
machine check.

Note that w/o this erratum any kernel read/write on TDX private memory
should never cause machine check, thus it's OK for the old kernel to
leave TDX private pages as is.

== Solution ==

In short, with this erratum, the kernel needs to explicitly convert all
TDX private pages back to normal to give the new kernel a clean slate
after kexec().  The BIOS is also expected to disable fast warm reset as
a workaround to this erratum, thus this implementation doesn't try to
reset TDX private memory for the reboot case in the kernel but depend on
the BIOS to enable the workaround.

Convert TDX private pages back to normal after all remote cpus has been
stopped and cache flush has been done on all cpus, when no more TDX
activity can happen further.  Do it in machine_kexec() to avoid the
additional overhead to the normal reboot/shutdown as the kernel depends
on the BIOS to disable fast warm reset for the reboot case.

For now TDX private memory can only be PAMT pages.  It would be ideal to
cover all types of TDX private memory here, but there are practical
problems to do so:

1) There's no existing infrastructure to track TDX private pages;
2) It's not feasible to query the TDX module about page type because VMX
   has already been stopped when KVM receives the reboot notifier, plus
   the result from the TDX module may not be accurate (e.g., the remote
   CPU could be stopped right before MOVDIR64B).

One temporary solution is to blindly convert all memory pages, but it's
problematic to do so too, because not all pages are mapped as writable
in the direct mapping.  It can be done by switching to the identical
mapping created for kexec() or a new page table, but the complexity
looks overkill.

Therefore, rather than doing something dramatic, only reset PAMT pages
here.  Other kernel components which use TDX need to do the conversion
on their own by intercepting the rebooting/shutdown notifier (KVM
already does that).

Note kexec() can happen at anytime, including when TDX module is being
initialized.  Register TDX reboot notifier callback to stop further TDX
module initialization.  If there's any ongoing module initialization,
wait until it finishes.  This makes sure the TDX module status is stable
after the reboot notifier callback, and the later kexec() code can read
module status to decide whether PAMTs are stable and available.

Also stop further TDX module initialization in case of machine shutdown
and halt, but not limited to kexec(), as there's no reason to do so in
these cases too.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v13 -> v14:
 - Skip resetting TDX private memory when preserve_context is true (Rick)
 - Use reboot notifier to stop TDX module initialization at early time of
   kexec() to make module status stable, to avoid using a new variable
   and memory barrier (which is tricky to review).
 - Added Kirill's tag

v12 -> v13:
 - Improve comments to explain why barrier is needed and ignore WBINVD.
   (Dave)
 - Improve comments to document memory ordering. (Nikolay)
 - Made comments/changelog slightly more concise.

v11 -> v12:
 - Changed comment/changelog to say kernel doesn't try to handle fast
   warm reset but depends on BIOS to enable workaround (Kirill)
 - Added a new tdx_may_has_private_mem to indicate system may have TDX
   private memory and PAMTs/TDMRs are stable to access. (Dave).
 - Use atomic_t for tdx_may_has_private_mem for build-in memory barrier
   (Dave)
 - Changed calling x86_platform.memory_shutdown() to calling
   tdx_reset_memory() directly from machine_kexec() to avoid overhead to
   normal reboot case.

v10 -> v11:
 - New patch

---
 arch/x86/include/asm/tdx.h         |  2 +
 arch/x86/kernel/machine_kexec_64.c | 16 ++++++
 arch/x86/virt/vmx/tdx/tdx.c        | 92 ++++++++++++++++++++++++++++++
 3 files changed, 110 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -XXX,XX +XXX,XX @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
 bool platform_tdx_enabled(void);
 int tdx_cpu_enable(void);
 int tdx_enable(void);
+void tdx_reset_memory(void);
 #else
 static inline bool platform_tdx_enabled(void) { return false; }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
 static inline int tdx_enable(void)  { return -ENODEV; }
+static inline void tdx_reset_memory(void) { }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -XXX,XX +XXX,XX @@
 #include <asm/setup.h>
 #include <asm/set_memory.h>
 #include <asm/cpu.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_ACPI
 /*
@@ -XXX,XX +XXX,XX @@ void machine_kexec(struct kimage *image)
 	void *control_page;
 	int save_ftrace_enabled;
 
+	/*
+	 * For platforms with TDX "partial write machine check" erratum,
+	 * all TDX private pages need to be converted back to normal
+	 * before booting to the new kernel, otherwise the new kernel
+	 * may get unexpected machine check.
+	 *
+	 * But skip this when preserve_context is on.  The second kernel
+	 * shouldn't write to the first kernel's memory anyway.  Skipping
+	 * this also avoids killing TDX in the first kernel, which would
+	 * require more complicated handling.
+	 */
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
 		save_processor_state();
+	else
+		tdx_reset_memory();
+#else
+	tdx_reset_memory();
 #endif
 
 	save_ftrace_enabled = __ftrace_enabled_save();
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@
 #include <linux/align.h>
 #include <linux/sort.h>
 #include <linux/log2.h>
+#include <linux/reboot.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/page.h>
@@ -XXX,XX +XXX,XX @@ static LIST_HEAD(tdx_memlist);
 
 static struct tdmr_info_list tdx_tdmr_list;
 
+static bool tdx_rebooting;
+
 typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
 
 static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
@@ -XXX,XX +XXX,XX @@ static int __tdx_enable(void)
 {
 	int ret;
 
+	if (tdx_rebooting)
+		return -EAGAIN;
+
 	ret = init_tdx_module();
 	if (ret) {
 		pr_err("module initialization failed (%d)\n", ret);
@@ -XXX,XX +XXX,XX @@ int tdx_enable(void)
 }
 EXPORT_SYMBOL_GPL(tdx_enable);
 
+/*
+ * Convert TDX private pages back to normal on platforms with
+ * "partial write machine check" erratum.
+ *
+ * Called from machine_kexec() before booting to the new kernel.
+ */
+void tdx_reset_memory(void)
+{
+	if (!platform_tdx_enabled())
+		return;
+
+	/*
+	 * Kernel read/write to TDX private memory doesn't
+	 * cause machine check on hardware w/o this erratum.
+	 */
+	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+		return;
+
+	/* Called from kexec() when only rebooting cpu is alive */
+	WARN_ON_ONCE(num_online_cpus() != 1);
+
+	/*
+	 * tdx_reboot_notifier() waits until ongoing TDX module
+	 * initialization to finish, and module initialization is
+	 * rejected after that.  Therefore @tdx_module_status is
+	 * stable here and can be read w/o holding lock.
+	 */
+	if (tdx_module_status != TDX_MODULE_INITIALIZED)
+		return;
+
+	/*
+	 * Convert PAMTs back to normal.  All other cpus are already
+	 * dead and TDMRs/PAMTs are stable.
+	 *
+	 * Ideally it's better to cover all types of TDX private pages
+	 * here, but it's impractical:
+	 *
+	 *  - There's no existing infrastructure to tell whether a page
+	 *    is TDX private memory or not.
+	 *
+	 *  - Using SEAMCALL to query TDX module isn't feasible either:
+	 *    - VMX has been turned off by reaching here so SEAMCALL
+	 *      cannot be made;
+	 *    - Even SEAMCALL can be made the result from TDX module may
+	 *      not be accurate (e.g., remote CPU can be stopped while
+	 *      the kernel is in the middle of reclaiming TDX private
+	 *      page and doing MOVDIR64B).
+	 *
+	 * One temporary solution could be just converting all memory
+	 * pages, but it's problematic too, because not all pages are
+	 * mapped as writable in direct mapping.  It can be done by
+	 * switching to the identical mapping for kexec() or a new page
+	 * table which maps all pages as writable, but the complexity is
+	 * overkill.
+	 *
+	 * Thus instead of doing something dramatic to convert all pages,
+	 * only convert PAMTs here.  Other kernel components which use
+	 * TDX need to do the conversion on their own by intercepting the
+	 * rebooting/shutdown notifier (KVM already does that).
+	 */
+	tdmrs_reset_pamt_all(&tdx_tdmr_list);
+}
+
 static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 					    u32 *nr_tdx_keyids)
 {
@@ -XXX,XX +XXX,XX @@ static struct notifier_block tdx_memory_nb = {
 	.notifier_call = tdx_memory_notifier,
 };
 
+static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode,
+			       void *unused)
+{
+	/* Wait ongoing TDX initialization to finish */
+	mutex_lock(&tdx_module_lock);
+	tdx_rebooting = true;
+	mutex_unlock(&tdx_module_lock);
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block tdx_reboot_nb = {
+	.notifier_call = tdx_reboot_notifier,
+};
+
 static int __init tdx_init(void)
 {
 	u32 tdx_keyid_start, nr_tdx_keyids;
@@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void)
 		return -ENODEV;
 	}
 
+	err = register_reboot_notifier(&tdx_reboot_nb);
+	if (err) {
+		pr_err("initialization failed: register_reboot_notifier() failed (%d)\n",
+				err);
+		unregister_memory_notifier(&tdx_memory_nb);
+		return -ENODEV;
+	}
+
 	/*
 	 * Just use the first TDX KeyID as the 'global KeyID' and
 	 * leave the rest for TDX guests.
-- 
2.41.0

TDX cannot survive from S3 and deeper states.  The hardware resets and
disables TDX completely when platform goes to S3 and deeper.  Both TDX
guests and the TDX module get destroyed permanently.

The kernel uses S3 to support suspend-to-ram, and S4 or deeper states to
support hibernation.  The kernel also maintains TDX states to track
whether it has been initialized and its metadata resource, etc.  After
resuming from S3 or hibernation, these TDX states won't be correct
anymore.

Theoretically, the kernel can do more complicated things like resetting
TDX internal states and TDX module metadata before going to S3 or
deeper, and re-initialize TDX module after resuming, etc, but there is
no way to save/restore TDX guests for now.

Until TDX supports full save and restore of TDX guests, there is no big
value to handle TDX module in suspend and hibernation alone.  To make
things simple, just choose to make TDX mutually exclusive with S3 and
hibernation.

Note the TDX module is initialized at runtime.  To avoid having to deal
with the fuss of determining TDX state at runtime, just choose TDX vs S3
and hibernation at kernel early boot.  It's a bad user experience if the
choice of TDX and S3/hibernation is done at runtime anyway, i.e., the
user can experience being able to do S3/hibernation but later becoming
unable to due to TDX being enabled.

Disable TDX in kernel early boot when hibernation is available, and give
a message telling the user to disable hibernation via kernel command
line in order to use TDX.  Currently there's no mechanism exposed by the
hibernation code to allow other kernel code to disable hibernation once
for all.

Disable ACPI S3 by setting acpi_suspend_lowlevel function pointer to
NULL when TDX is enabled by the BIOS.  This avoids having to modify the
ACPI code to disable ACPI S3 in other ways.

Also give a message telling the user to disable TDX in the BIOS in order
to use ACPI S3.  A new kernel command line can be added in the future if
there's a need to let user disable TDX host via kernel command line.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v13 -> v14:
 - New patch

---
 arch/x86/virt/vmx/tdx/tdx.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

The first few generations of TDX hardware have an erratum.  Triggering
it in Linux requires some kind of kernel bug involving relatively exotic
memory writes to TDX private memory and will manifest via
spurious-looking machine checks when reading the affected memory.

== Background ==

== Problem ==

A partial write to a TDX private memory cacheline will silently "poison"
the line.  Subsequent reads will consume the poison and generate a
machine check.  According to the TDX hardware spec, neither of these
things should have happened.

To add insult to injury, the Linux machine code will present these as a
literal "Hardware error" when they were, in fact, a software-triggered
issue.

== Solution ==

In the end, this issue is hard to trigger.  Rather than do something
rash (and incomplete) like unmap TDX private memory from the direct map,
improve the machine check handler.

Currently, the #MC handler doesn't distinguish whether the memory is
TDX private memory or not but just dump, for instance, below message:

[...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
 [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
 	...
 [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] Kernel panic - not syncing: Fatal local machine check

Which says "Hardware Error" and "Data load in unrecoverable area of
kernel".

Ideally, it's better for the log to say "software bug around TDX private
memory" instead of "Hardware Error".  But in reality the real hardware
memory error can happen, and sadly such software-triggered #MC cannot be
distinguished from the real hardware error.  Also, the error message is
used by userspace tool 'mcelog' to parse, so changing the output may
break userspace.

So keep the "Hardware Error".  The "Data load in unrecoverable area of
kernel" is also helpful, so keep it too.

Instead of modifying above error log, improve the error log by printing
additional TDX related message to make the log like:

...
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.

Adding this additional message requires determination of whether the
memory page is TDX private memory.  There is no existing infrastructure
to do that.  Add an interface to query the TDX module to fill this gap.

== Impact ==

This issue requires some kind of kernel bug to trigger.

TDX private memory should never be mapped UC/WC.  A partial write
originating from these mappings would require *two* bugs, first mapping
the wrong page, then writing the wrong memory.  It would also be
detectable using traditional memory corruption techniques like
DEBUG_PAGEALLOC.

MOVNTI (and friends) could cause this issue with something like a simple
buffer overrun or use-after-free on the direct map.  It should also be
detectable with normal debug techniques.

The one place where this might get nasty would be if the CPU read data
then wrote back the same data.  That would trigger this problem but
would not, for instance, set off mechanisms like slab redzoning because
it doesn't actually corrupt data.

With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
TDX private memory would first need to be incorrectly mapped into the
I/O space and then a later DMA to that mapping would actually cause the
poisoning event.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---

v13 -> v14:
 - No change

v12 -> v13:
 - Added Kirill and Yuan's tag.

v11 -> v12:
 - Simplified #MC message (Dave/Kirill)
 - Slightly improved some comments.

v10 -> v11:
 - New patch

---
 arch/x86/include/asm/tdx.h     |   2 +
 arch/x86/kernel/cpu/mce/core.c |  33 +++++++++++
 arch/x86/virt/vmx/tdx/tdx.c    | 103 +++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h    |   5 ++
 4 files changed, 143 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void);
 int tdx_cpu_enable(void);
 int tdx_enable(void);
 void tdx_reset_memory(void);
+bool tdx_is_private_mem(unsigned long phys);
 #else
 static inline bool platform_tdx_enabled(void) { return false; }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
 static inline int tdx_enable(void)  { return -ENODEV; }
 static inline void tdx_reset_memory(void) { }
+static inline bool tdx_is_private_mem(unsigned long phys) { return false; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -XXX,XX +XXX,XX @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
+#include <asm/tdx.h>
 
 #include "internal.h"
 
@@ -XXX,XX +XXX,XX @@ static void wait_for_panic(void)
 	panic("Panicing machine check CPU died");
 }
 
+static const char *mce_memory_info(struct mce *m)
+{
+	if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
+		return NULL;
+
+	/*
+	 * Certain initial generations of TDX-capable CPUs have an
+	 * erratum.  A kernel non-temporal partial write to TDX private
+	 * memory poisons that memory, and a subsequent read of that
+	 * memory triggers #MC.
+	 *
+	 * However such #MC caused by software cannot be distinguished
+	 * from the real hardware #MC.  Just print additional message
+	 * to show such #MC may be result of the CPU erratum.
+	 */
+	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+		return NULL;
+
+	return !tdx_is_private_mem(m->addr) ? NULL :
+		"TDX private memory error. Possible kernel bug.";
+}
+
 static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
 {
 	struct llist_node *pending;
 	struct mce_evt_llist *l;
 	int apei_err = 0;
+	const char *memmsg;
 
 	/*
 	 * Allow instrumentation around external facilities usage. Not that it
@@ -XXX,XX +XXX,XX @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
 	}
 	if (exp)
 		pr_emerg(HW_ERR "Machine check: %s\n", exp);
+	/*
+	 * Confidential computing platforms such as TDX platforms
+	 * may occur MCE due to incorrect access to confidential
+	 * memory.  Print additional information for such error.
+	 */
+	memmsg = mce_memory_info(final);
+	if (memmsg)
+		pr_emerg(HW_ERR "Machine check: %s\n", memmsg);
+
 	if (!fake_panic) {
 		if (panic_timeout == 0)
 			panic_timeout = mca_cfg.panic_timeout;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -XXX,XX +XXX,XX @@ void tdx_reset_memory(void)
 	tdmrs_reset_pamt_all(&tdx_tdmr_list);
 }
 
+static bool is_pamt_page(unsigned long phys)
+{
+	struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;
+	int i;
+
+	/*
+	 * This function is called from #MC handler, and theoretically
+	 * it could run in parallel with the TDX module initialization
+	 * on other logical cpus.  But it's not OK to hold mutex here
+	 * so just blindly check module status to make sure PAMTs/TDMRs
+	 * are stable to access.
+	 *
+	 * This may return inaccurate result in rare cases, e.g., when
+	 * #MC happens on a PAMT page during module initialization, but
+	 * this is fine as #MC handler doesn't need a 100% accurate
+	 * result.
+	 */
+	if (tdx_module_status != TDX_MODULE_INITIALIZED)
+		return false;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		unsigned long base, size;
+
+		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
+
+		if (phys >= base && phys < (base + size))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Return whether the memory page at the given physical address is TDX
+ * private memory or not.  Called from #MC handler do_machine_check().
+ *
+ * Note this function may not return an accurate result in rare cases.
+ * This is fine as the #MC handler doesn't need a 100% accurate result,
+ * because it cannot distinguish #MC between software bug and real
+ * hardware error anyway.
+ */
+bool tdx_is_private_mem(unsigned long phys)
+{
+	struct tdx_module_args args = {
+		.rcx = phys & PAGE_MASK,
+	};
+	u64 sret;
+
+	if (!platform_tdx_enabled())
+		return false;
+
+	/* Get page type from the TDX module */
+	sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
+	/*
+	 * Handle the case that CPU isn't in VMX operation.
+	 *
+	 * KVM guarantees no VM is running (thus no TDX guest)
+	 * when there's any online CPU isn't in VMX operation.
+	 * This means there will be no TDX guest private memory
+	 * and Secure-EPT pages.  However the TDX module may have
+	 * been initialized and the memory page could be PAMT.
+	 */
+	if (sret == TDX_SEAMCALL_UD)
+		return is_pamt_page(phys);
+
+	/*
+	 * Any other failure means:
+	 *
+	 * 1) TDX module not loaded; or
+	 * 2) Memory page isn't managed by the TDX module.
+	 *
+	 * In either case, the memory page cannot be a TDX
+	 * private page.
+	 */
+	if (sret)
+		return false;
+
+	/*
+	 * SEAMCALL was successful -- read page type (via RCX):
+	 *
+	 *  - PT_NDA:	Page is not used by the TDX module
+	 *  - PT_RSVD:	Reserved for Non-TDX use
+	 *  - Others:	Page is used by the TDX module
+	 *
+	 * Note PAMT pages are marked as PT_RSVD but they are also TDX
+	 * private memory.
+	 *
+	 * Note: Even page type is PT_NDA, the memory page could still
+	 * be associated with TDX private KeyID if the kernel hasn't
+	 * explicitly used MOVDIR64B to clear the page.  Assume KVM
+	 * always does that after reclaiming any private page from TDX
+	 * gusets.
+	 */
+	switch (args.rcx) {
+	case PT_NDA:
+		return false;
+	case PT_RSVD:
+		return is_pamt_page(phys);
+	default:
+		return true;
+	}
+}
+
 static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 					    u32 *nr_tdx_keyids)
 {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -XXX,XX +XXX,XX @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_PHYMEM_PAGE_RDMD	24
 #define TDH_SYS_KEY_CONFIG	31
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
@@ -XXX,XX +XXX,XX @@
 #define TDH_SYS_TDMR_INIT	36
 #define TDH_SYS_CONFIG		45
 
+/* TDX page types */
+#define	PT_NDA		0x0
+#define	PT_RSVD		0x1
+
 struct cmr_info {
 	u64	base;
 	u64	size;
-- 
2.41.0

Add documentation for TDX host kernel support.  There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals.  Also reuse it for TDX host kernel support.

Introduce a new level menu "TDX Guest Support" and move existing
materials under it, and add a new menu for TDX host kernel support.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

- Added new sections for "Erratum" and "TDX vs S3/hibernation"

---
 Documentation/arch/x86/tdx.rst | 217 +++++++++++++++++++++++++++++++--
 1 file changed, 206 insertions(+), 11 deletions(-)

diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
index XXXXXXX..XXXXXXX 100644
--- a/Documentation/arch/x86/tdx.rst
+++ b/Documentation/arch/x86/tdx.rst
@@ -XXX,XX +XXX,XX @@ encrypting the guest memory. In TDX, a special module running in a special
 mode sits between the host and the guest and manages the guest/host
 separation.
 
+TDX Host Kernel Support
+=======================
+
+TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
+a new isolated range pointed by the SEAM Ranger Register (SEAMRR).  A
+CPU-attested software module called 'the TDX module' runs inside the new
+isolated range to provide the functionalities to manage and run protected
+VMs.
+
+TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
+provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
+as TDX private KeyIDs, which are only accessible within the SEAM mode.
+BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
+
+Before the TDX module can be used to create and run protected VMs, it
+must be loaded into the isolated range and properly initialized.  The TDX
+architecture doesn't require the BIOS to load the TDX module, but the
+kernel assumes it is loaded by the BIOS.
+
+TDX boot-time detection
+-----------------------
+
+The kernel detects TDX by detecting TDX private KeyIDs during kernel
+boot.  Below dmesg shows when TDX is enabled by BIOS::
+
+  [..] virt/tdx: BIOS enabled: private KeyID range: [16, 64)
+
+TDX module initialization
+---------------------------------------
+
+The kernel talks to the TDX module via the new SEAMCALL instruction.  The
+TDX module implements SEAMCALL leaf functions to allow the kernel to
+initialize it.
+
+If the TDX module isn't loaded, the SEAMCALL instruction fails with a
+special error.  In this case the kernel fails the module initialization
+and reports the module isn't loaded::
+
+  [..] virt/tdx: module not loaded
+
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
+use it as 'metadata' for the TDX memory.  It also takes additional CPU
+time to initialize those metadata along with the TDX module itself.  Both
+are not trivial.  The kernel initializes the TDX module at runtime on
+demand.
+
+Besides initializing the TDX module, a per-cpu initialization SEAMCALL
+must be done on one cpu before any other SEAMCALLs can be made on that
+cpu.
+
+The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to
+allow the user of TDX to enable the TDX module and enable TDX on local
+cpu.
+
+Making SEAMCALL requires the CPU already being in VMX operation (VMXON
+has been done).  For now both tdx_enable() and tdx_cpu_enable() don't
+handle VMXON internally, but depends on the caller to guarantee that.
+
+To enable TDX, the caller of TDX should: 1) hold read lock of CPU hotplug
+lock; 2) do VMXON and tdx_enable_cpu() on all online cpus successfully;
+3) call tdx_enable().  For example::
+
+        cpus_read_lock();
+        on_each_cpu(vmxon_and_tdx_cpu_enable());
+        ret = tdx_enable();
+        cpus_read_unlock();
+        if (ret)
+                goto no_tdx;
+        // TDX is ready to use
+
+And the caller of TDX must guarantee the tdx_cpu_enable() has been
+successfully done on any cpu before it wants to run any other SEAMCALL.
+A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
+online callback, and refuse to online if tdx_cpu_enable() fails.
+
+User can consult dmesg to see whether the TDX module has been initialized.
+
+If the TDX module is initialized successfully, dmesg shows something
+like below::
+
+  [..] virt/tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+  [..] virt/tdx: 262668 KBs allocated for PAMT
+  [..] virt/tdx: module initialized
+
+If the TDX module failed to initialize, dmesg also shows it failed to
+initialize::
+
+  [..] virt/tdx: module initialization failed ...
+
+TDX Interaction to Other Kernel Components
+------------------------------------------
+
+TDX Memory Policy
+~~~~~~~~~~~~~~~~~
+
+TDX reports a list of "Convertible Memory Region" (CMR) to tell the
+kernel which memory is TDX compatible.  The kernel needs to build a list
+of memory regions (out of CMRs) as "TDX-usable" memory and pass those
+regions to the TDX module.  Once this is done, those "TDX-usable" memory
+regions are fixed during module's lifetime.
+
+To keep things simple, currently the kernel simply guarantees all pages
+in the page allocator are TDX memory.  Specifically, the kernel uses all
+system memory in the core-mm at the time of initializing the TDX module
+as TDX memory, and in the meantime, refuses to online any non-TDX-memory
+in the memory hotplug.
+
+Physical Memory Hotplug
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Note TDX assumes convertible memory is always physically present during
+machine's runtime.  A non-buggy BIOS should never support hot-removal of
+any convertible memory.  This implementation doesn't handle ACPI memory
+removal but depends on the BIOS to behave correctly.
+
+CPU Hotplug
+~~~~~~~~~~~
+
+TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT)
+must be done on one cpu before any other SEAMCALLs can be made on that
+cpu, including those involved during the module initialization.
+
+The kernel provides tdx_cpu_enable() to let the user of TDX to do it when
+the user wants to use a new cpu for TDX task.
+
+TDX doesn't support physical (ACPI) CPU hotplug.  During machine boot,
+TDX verifies all boot-time present logical CPUs are TDX compatible before
+enabling TDX.  A non-buggy BIOS should never support hot-add/removal of
+physical CPU.  Currently the kernel doesn't handle physical CPU hotplug,
+but depends on the BIOS to behave correctly.
+
+Note TDX works with CPU logical online/offline, thus the kernel still
+allows to offline logical CPU and online it again.
+
+Kexec()
+~~~~~~~
+
+There are two problems in terms of using kexec() to boot to a new kernel
+when the old kernel has enabled TDX: 1) Part of the memory pages are
+still TDX private pages; 2) There might be dirty cachelines associated
+with TDX private pages.
+
+The first problem doesn't matter.  KeyID 0 doesn't have integrity check.
+Even the new kernel wants use any non-zero KeyID, it needs to convert
+the memory to that KeyID and such conversion would work from any KeyID.
+
+However the old kernel needs to guarantee there's no dirty cacheline
+left behind before booting to the new kernel to avoid silent corruption
+from later cacheline writeback (Intel hardware doesn't guarantee cache
+coherency across different KeyIDs).
+
+Similar to AMD SME, the kernel just uses wbinvd() to flush cache before
+booting to the new kernel.
+
+Erratum
+~~~~~~~
+
+The first few generations of TDX hardware have an erratum.  A partial
+write to a TDX private memory cacheline will silently "poison" the
+line.  Subsequent reads will consume the poison and generate a machine
+check.
+
+A partial write is a memory write where a write transaction of less than
+cacheline lands at the memory controller.  The CPU does these via
+non-temporal write instructions (like MOVNTI), or through UC/WC memory
+mappings.  Devices can also do partial writes via DMA.
+
+Theoretically, a kernel bug could do partial write to TDX private memory
+and trigger unexpected machine check.  What's more, the machine check
+code will present these as "Hardware error" when they were, in fact, a
+software-triggered issue.  But in the end, this issue is hard to trigger.
+
+If the platform has such erratum, the kernel does additional things:
+1) resetting TDX private pages using MOVDIR64B in kexec before booting to
+the new kernel; 2) Printing additional message in machine check handler
+to tell user the machine check may be caused by kernel bug on TDX private
+memory.
+
+Interaction vs S3 and deeper states
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+TDX cannot survive from S3 and deeper states.  The hardware resets and
+disables TDX completely when platform goes to S3 and deeper.  Both TDX
+guests and the TDX module get destroyed permanently.
+
+The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for
+hibernation.  Currently, for simplicity, the kernel chooses to make TDX
+mutually exclusive with S3 and hibernation.
+
+For most cases, the user needs to add 'nohibernation' kernel command line
+in order to use TDX.  S3 is disabled during kernel early boot if TDX is
+detected.  The user needs to turn off TDX in the BIOS in order to use S3.
+
+TDX Guest Support
+=================
 Since the host cannot directly access guest registers or memory, much
 normal functionality of a hypervisor must be moved into the guest. This is
 implemented using a Virtualization Exception (#VE) that is handled by the
@@ -XXX,XX +XXX,XX @@ TDX includes new hypercall-like mechanisms for communicating from the
 guest to the hypervisor or the TDX module.
 
 New TDX Exceptions
-==================
+------------------
 
 TDX guests behave differently from bare-metal and traditional VMX guests.
 In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -XXX,XX +XXX,XX @@ Instructions marked with an '*' conditionally cause exceptions.  The
 details for these instructions are discussed below.
 
 Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - Port I/O (INS, OUTS, IN, OUT)
 - HLT
@@ -XXX,XX +XXX,XX @@ Instruction-based #VE
 - CPUID*
 
 Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
   VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -XXX,XX +XXX,XX @@ Instruction-based #GP
 - RDMSR*,WRMSR*
 
 RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 MSR access behavior falls into three categories:
 
@@ -XXX,XX +XXX,XX @@ trapping and handling in the TDX module.  Other than possibly being slow,
 these MSRs appear to function just as they would on bare metal.
 
 CPUID Behavior
---------------
+~~~~~~~~~~~~~~
 
 For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
 return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -XXX,XX +XXX,XX @@ not know how to handle. The guest kernel may ask the hypervisor for the
 value with a hypercall.
 
 #VE on Memory Accesses
-======================
+----------------------
 
 There are essentially two classes of TDX memory: private and shared.
 Private memory receives full TDX protections.  Its content is protected
@@ -XXX,XX +XXX,XX @@ entries.  This helps ensure that a guest does not place sensitive
 information in shared memory, exposing it to the untrusted hypervisor.
 
 #VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 Access to shared mappings can cause a #VE.  The hypervisor ultimately
 controls whether a shared memory access causes a #VE, so the guest must be
@@ -XXX,XX +XXX,XX @@ be careful not to access device MMIO regions unless it is also prepared to
 handle a #VE.
 
 #VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 An access to private mappings can also cause a #VE.  Since all kernel
 memory is also private memory, the kernel might theoretically need to
@@ -XXX,XX +XXX,XX @@ The hypervisor is permitted to unilaterally move accepted pages to a
 to handle the exception.
 
 Linux #VE handler
-=================
+-----------------
 
 Just like page faults or #GP's, #VE exceptions can be either handled or be
 fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -XXX,XX +XXX,XX @@ While the block is in place, any #VE is elevated to a double fault (#DF)
 which is not recoverable.
 
 MMIO handling
-=============
+-------------
 
 In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
 mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -XXX,XX +XXX,XX @@ MMIO access via other means (like structure overlays) may result in an
 oops.
 
 Shared Memory Conversions
-=========================
+-------------------------
 
 All TDX guest memory starts out as private at boot.  This memory can not
 be accessed by the hypervisor.  However, some kernel users like device
-- 
2.41.0