1 | Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious | 1 | Hi all, |
---|---|---|---|
2 | host and certain physical attacks. TDX specs are available in [1]. | ||
3 | 2 | ||
4 | This series is the initial support to enable TDX with minimal code to | 3 | For people who concern this patchset, sorry for sending out late. And to |
5 | allow KVM to create and run TDX guests. KVM support for TDX is being | 4 | save people's time, I didn't include the full coverletter here this time. |
6 | developed separately[2]. A new "userspace inaccessible memfd" approach | 5 | For detailed information please refer to previous v13's coverletter[1]. |
7 | to support TDX private memory is also being developed[3]. The KVM will | ||
8 | only support the new "userspace inaccessible memfd" as TDX guest memory. | ||
9 | |||
10 | This series doesn't aim to support all functionalities (i.e. exposing TDX | ||
11 | module via /sysfs), and doesn't aim to resolve all things perfectly. | ||
12 | Especially, the implementation to how to choose "TDX-usable" memory and | ||
13 | memory hotplug handling is simple, that this series just makes sure all | ||
14 | pages in the page allocator are TDX memory. | ||
15 | |||
16 | A better solution, suggested by Kirill, is similar to the per-node memory | ||
17 | encryption flag in this series [4]. Similarly, a per-node TDX flag can | ||
18 | be added so both "TDX-capable" and "non-TDX-capable" nodes can co-exist. | ||
19 | With exposing the TDX flag to userspace via /sysfs, the userspace can | ||
20 | then use NUMA APIs to bind TDX guests to those "TDX-capable" nodes. | ||
21 | |||
22 | For more information please refer to "Kernel policy on TDX memory" and | ||
23 | "Memory hotplug" sections below. Huang, Ying is working on this | ||
24 | "per-node TDX flag" support and will post another series independently. | ||
25 | |||
26 | (For memory hotplug, sorry for broadcasting widely but I cc'ed the | ||
27 | linux-mm@kvack.org following Kirill's suggestion so MM experts can also | ||
28 | help to provide comments.) | ||
29 | |||
30 | Also, other optimizations will be posted as follow-up once this initial | ||
31 | TDX support is upstreamed. | ||
32 | |||
33 | Hi Dave, Dan, Kirill, Ying (and Intel reviewers), | ||
34 | |||
35 | Please kindly help to review, and I would appreciate reviewed-by or | ||
36 | acked-by tags if the patches look good to you. | ||
37 | |||
38 | This series has been reviewed by Isaku who is developing KVM TDX patches. | ||
39 | Kirill also has reviewed couple of patches as well. | ||
40 | |||
41 | Also, I highly appreciate if anyone else can help to review this series. | ||
42 | |||
43 | ----- Changelog history: ------ | ||
44 | |||
45 | - v6 -> v7: | ||
46 | |||
47 | - Added memory hotplug support. | ||
48 | - Changed how to choose the list of "TDX-usable" memory regions from at | ||
49 | kernel boot time to TDX module initialization time. | ||
50 | - Addressed comments received in previous versions. (Andi/Dave). | ||
51 | - Improved the commit message and the comments of kexec() support patch, | ||
52 | and the patch handles returnning PAMTs back to the kernel when TDX | ||
53 | module initialization fails. Please also see "kexec()" section below. | ||
54 | - Changed the documentation patch accordingly. | ||
55 | - For all others please see individual patch changelog history. | ||
56 | |||
57 | - v5 -> v6: | ||
58 | |||
59 | - Removed ACPI CPU/memory hotplug patches. (Intel internal discussion) | ||
60 | - Removed patch to disable driver-managed memory hotplug (Intel | ||
61 | internal discussion). | ||
62 | - Added one patch to introduce enum type for TDX supported page size | ||
63 | level to replace the hard-coded values in TDX guest code (Dave). | ||
64 | - Added one patch to make TDX depends on X2APIC being enabled (Dave). | ||
65 | - Added one patch to build all boot-time present memory regions as TDX | ||
66 | memory during kernel boot. | ||
67 | - Added Reviewed-by from others to some patches. | ||
68 | - For all others please see individual patch changelog history. | ||
69 | |||
70 | - v4 -> v5: | ||
71 | |||
72 | This is essentially a resent of v4. Sorry I forgot to consult | ||
73 | get_maintainer.pl when sending out v4, so I forgot to add linux-acpi | ||
74 | and linux-mm mailing list and the relevant people for 4 new patches. | ||
75 | |||
76 | There are also very minor code and commit message update from v4: | ||
77 | |||
78 | - Rebased to latest tip/x86/tdx. | ||
79 | - Fixed a checkpatch issue that I missed in v4. | ||
80 | - Removed an obsoleted comment that I missed in patch 6. | ||
81 | - Very minor update to the commit message of patch 12. | ||
82 | |||
83 | For other changes to individual patches since v3, please refer to the | ||
84 | changelog histroy of individual patches (I just used v3 -> v5 since | ||
85 | there's basically no code change to v4). | ||
86 | |||
87 | - v3 -> v4 (addressed Dave's comments, and other comments from others): | ||
88 | |||
89 | - Simplified SEAMRR and TDX keyID detection. | ||
90 | - Added patches to handle ACPI CPU hotplug. | ||
91 | - Added patches to handle ACPI memory hotplug and driver managed memory | ||
92 | hotplug. | ||
93 | - Removed tdx_detect() but only use single tdx_init(). | ||
94 | - Removed detecting TDX module via P-SEAMLDR. | ||
95 | - Changed from using e820 to using memblock to convert system RAM to TDX | ||
96 | memory. | ||
97 | - Excluded legacy PMEM from TDX memory. | ||
98 | - Removed the boot-time command line to disable TDX patch. | ||
99 | - Addressed comments for other individual patches (please see individual | ||
100 | patches). | ||
101 | - Improved the documentation patch based on the new implementation. | ||
102 | |||
103 | - V2 -> v3: | ||
104 | |||
105 | - Addressed comments from Isaku. | ||
106 | - Fixed memory leak and unnecessary function argument in the patch to | ||
107 | configure the key for the global keyid (patch 17). | ||
108 | - Enhanced a little bit to the patch to get TDX module and CMR | ||
109 | information (patch 09). | ||
110 | - Fixed an unintended change in the patch to allocate PAMT (patch 13). | ||
111 | - Addressed comments from Kevin: | ||
112 | - Slightly improvement on commit message to patch 03. | ||
113 | - Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in | ||
114 | seamrr_enabled() (patch 04). | ||
115 | - Changed documentation patch to add TDX host kernel support materials | ||
116 | to Documentation/x86/tdx.rst together with TDX guest staff, instead | ||
117 | of a standalone file (patch 21) | ||
118 | - Very minor improvement in commit messages. | ||
119 | |||
120 | - RFC (v1) -> v2: | ||
121 | - Rebased to Kirill's latest TDX guest code. | ||
122 | - Fixed two issues that are related to finding all RAM memory regions | ||
123 | based on e820. | ||
124 | - Minor improvement on comments and commit messages. | ||
125 | |||
126 | v6: | ||
127 | https://lore.kernel.org/linux-mm/cover.1666824663.git.kai.huang@intel.com/T/ | ||
128 | |||
129 | v5: | ||
130 | https://lore.kernel.org/lkml/cover.1655894131.git.kai.huang@intel.com/T/ | ||
131 | |||
132 | v3: | ||
133 | https://lore.kernel.org/lkml/68484e168226037c3a25b6fb983b052b26ab3ec1.camel@intel.com/T/ | ||
134 | |||
135 | V2: | ||
136 | https://lore.kernel.org/lkml/cover.1647167475.git.kai.huang@intel.com/T/ | ||
137 | |||
138 | RFC (v1): | ||
139 | https://lore.kernel.org/all/e0ff030a49b252d91c789a89c303bb4206f85e3d.1646007267.git.kai.huang@intel.com/T/ | ||
140 | |||
141 | == Background == | ||
142 | |||
143 | TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) | ||
144 | and a new isolated range pointed by the SEAM Ranger Register (SEAMRR). | ||
145 | A CPU-attested software module called 'the TDX module' runs in the new | ||
146 | isolated region as a trusted hypervisor to create/run protected VMs. | ||
147 | |||
148 | TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to | ||
149 | provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs | ||
150 | as TDX private KeyIDs, which are only accessible within the SEAM mode. | ||
151 | |||
152 | TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated | ||
153 | secure processor to provide crypto-protection. The firmware runs on the | ||
154 | secure processor acts a similar role as the TDX module. | ||
155 | |||
156 | The host kernel communicates with SEAM software via a new SEAMCALL | ||
157 | instruction. This is conceptually similar to a guest->host hypercall, | ||
158 | except it is made from the host to SEAM software instead. | ||
159 | |||
160 | Before being able to manage TD guests, the TDX module must be loaded | ||
161 | and properly initialized. This series assumes the TDX module is loaded | ||
162 | by BIOS before the kernel boots. | ||
163 | |||
164 | How to initialize the TDX module is described at TDX module 1.0 | ||
165 | specification, chapter "13.Intel TDX Module Lifecycle: Enumeration, | ||
166 | Initialization and Shutdown". | ||
167 | |||
168 | == Design Considerations == | ||
169 | |||
170 | 1. Initialize the TDX module at runtime | ||
171 | |||
172 | There are basically two ways the TDX module could be initialized: either | ||
173 | in early boot, or at runtime before the first TDX guest is run. This | ||
174 | series implements the runtime initialization. | ||
175 | |||
176 | This series adds a function tdx_enable() to allow the caller to initialize | ||
177 | TDX at runtime: | ||
178 | |||
179 | if (tdx_enable()) | ||
180 | goto no_tdx; | ||
181 | // TDX is ready to create TD guests. | ||
182 | |||
183 | This approach has below pros: | ||
184 | |||
185 | 1) Initializing the TDX module requires to reserve ~1/256th system RAM as | ||
186 | metadata. Enabling TDX on demand allows only to consume this memory when | ||
187 | TDX is truly needed (i.e. when KVM wants to create TD guests). | ||
188 | |||
189 | 2) SEAMCALL requires CPU being already in VMX operation (VMXON has been | ||
190 | done). So far, KVM is the only user of TDX, and it already handles VMXON. | ||
191 | Letting KVM to initialize TDX avoids handling VMXON in the core kernel. | ||
192 | |||
193 | 3) It is more flexible to support "TDX module runtime update" (not in | ||
194 | this series). After updating to the new module at runtime, kernel needs | ||
195 | to go through the initialization process again. | ||
196 | |||
197 | 2. CPU hotplug | ||
198 | |||
199 | TDX doesn't support physical (ACPI) CPU hotplug. A non-buggy BIOS should | ||
200 | never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug | ||
201 | event to the kernel. This series doesn't handle physical (ACPI) CPU | ||
202 | hotplug at all but depends on the BIOS to behave correctly. | ||
203 | |||
204 | Note TDX works with CPU logical online/offline, thus this series still | ||
205 | allows to do logical CPU online/offline. | ||
206 | |||
207 | 3. Kernel policy on TDX memory | ||
208 | |||
209 | The TDX module reports a list of "Convertible Memory Region" (CMR) to | ||
210 | indicate which memory regions are TDX-capable. The TDX architecture | ||
211 | allows the VMM to designate specific convertible memory regions as usable | ||
212 | for TDX private memory. | ||
213 | |||
214 | The initial support of TDX guests will only allocate TDX private memory | ||
215 | from the global page allocator. This series chooses to designate _all_ | ||
216 | system RAM in the core-mm at the time of initializing TDX module as TDX | ||
217 | memory to guarantee all pages in the page allocator are TDX pages. | ||
218 | |||
219 | 4. Memory Hotplug | ||
220 | |||
221 | After the kernel passes all "TDX-usable" memory regions to the TDX | ||
222 | module, the set of "TDX-usable" memory regions are fixed during module's | ||
223 | runtime. No more "TDX-usable" memory can be added to the TDX module | ||
224 | after that. | ||
225 | |||
226 | To achieve above "to guarantee all pages in the page allocator are TDX | ||
227 | pages", this series simply choose to reject any non-TDX-usable memory in | ||
228 | memory hotplug. | ||
229 | |||
230 | This _will_ be enhanced in the future after first submission. The | ||
231 | direction we are heading is to allow adding/onlining non-TDX memory to | ||
232 | separate NUMA nodes so that both "TDX-capable" nodes and "TDX-capable" | ||
233 | nodes can co-exist. The TDX flag can be exposed to userspace via /sysfs | ||
234 | so userspace can bind TDX guests to "TDX-capable" nodes via NUMA ABIs. | ||
235 | |||
236 | Note TDX assumes convertible memory is always physically present during | ||
237 | machine's runtime. A non-buggy BIOS should never support hot-removal of | ||
238 | any convertible memory. This implementation doesn't handle ACPI memory | ||
239 | removal but depends on the BIOS to behave correctly. | ||
240 | |||
241 | 5. Kexec() | ||
242 | |||
243 | There are two problems in terms of using kexec() to boot to a new kernel | ||
244 | when the old kernel has enabled TDX: 1) Part of the memory pages are | ||
245 | still TDX private pages (i.e. metadata used by the TDX module, and any | ||
246 | TDX guest memory if kexec() happens when there's any TDX guest alive). | ||
247 | 2) There might be dirty cachelines associated with TDX private pages. | ||
248 | |||
249 | Just like SME, TDX hosts require special cache flushing before kexec(). | ||
250 | Similar to SME handling, the kernel uses wbinvd() to flush cache in | ||
251 | stop_this_cpu() when TDX is enabled. | ||
252 | |||
253 | This series doesn't convert all TDX private pages back to normal due to | ||
254 | below considerations: | ||
255 | |||
256 | 1) The kernel doesn't have existing infrastructure to track which pages | ||
257 | are TDX private pages. | ||
258 | 2) The number of TDX private pages can be large, and converting all of | ||
259 | them (cache flush + using MOVDIR64B to clear the page) in kexec() can | ||
260 | be time consuming. | ||
261 | 3) The new kernel will almost only use KeyID 0 to access memory. KeyID | ||
262 | 0 doesn't support integrity-check, so it's OK. | ||
263 | 4) The kernel doesn't (and may never) support MKTME. If any 3rd party | ||
264 | kernel ever supports MKTME, it should do MOVDIR64B to clear the page | ||
265 | with the new MKTME KeyID (just like TDX does) before using it. | ||
266 | |||
267 | Also, if the old kernel ever enables TDX, the new kernel cannot use TDX | ||
268 | again. When the new kernel goes through the TDX module initialization | ||
269 | process it will fail immediately at the first step. | ||
270 | |||
271 | Ideally, it's better to shutdown the TDX module in kexec(), but there's | ||
272 | no guarantee that CPUs are in VMX operation in kexec() so just leave the | ||
273 | module open. | ||
274 | |||
275 | == Reference == | ||
276 | |||
277 | [1]: TDX specs | ||
278 | https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html | ||
279 | |||
280 | [2]: KVM TDX basic feature support | ||
281 | https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/ | ||
282 | |||
283 | [3]: KVM: mm: fd-based approach for supporting KVM | ||
284 | https://lore.kernel.org/lkml/20220915142913.2213336-1-chao.p.peng@linux.intel.com/T/ | ||
285 | |||
286 | [4]: per-node memory encryption flag | ||
287 | https://lore.kernel.org/linux-mm/20221007155323.ue4cdthkilfy4lbd@box.shutemov.name/t/ | ||
288 | 6 | ||
289 | 7 | ||
290 | Kai Huang (20): | 8 | This version mainly adds a new patch to handle TDX vs S3/hibernation |
9 | interaction. In short, TDX cannot survive when platform goes to S3 and | ||
10 | deeper states. TDX gets completely reset upon this, and both TDX guests | ||
11 | and TDX module are destroyed. Please refer to the new patch (21). | ||
12 | |||
13 | Other changes from v13 -> v14: | ||
14 | - Addressed comments received in v13 (Rick/Nikolay/Dave). | ||
15 | - SEAMCALL patches, skeleton patch, kexec patch | ||
16 | - Some minor updates based on internal discussion. | ||
17 | - Added received Reviewed-by tags (thanks!). | ||
18 | - Updated the documentation patch to reflect new changes. | ||
19 | |||
20 | Please see each individual patch for specific change history. | ||
21 | |||
22 | Hi Dave, | ||
23 | |||
24 | In this version all patches (except the documentation one) now have at | ||
25 | least Kirill's Reviewed-by tag. Could you help to take a look? | ||
26 | |||
27 | And again, thanks everyone for reviewing and helping on this series. | ||
28 | |||
29 | [1]: v13 https://lore.kernel.org/lkml/cover.1692962263.git.kai.huang@intel.com/T/ | ||
30 | |||
31 | |||
32 | Kai Huang (23): | ||
33 | x86/virt/tdx: Detect TDX during kernel boot | ||
291 | x86/tdx: Define TDX supported page sizes as macros | 34 | x86/tdx: Define TDX supported page sizes as macros |
292 | x86/virt/tdx: Detect TDX during kernel boot | 35 | x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC |
293 | x86/virt/tdx: Disable TDX if X2APIC is not enabled | 36 | x86/cpu: Detect TDX partial write machine check erratum |
294 | x86/virt/tdx: Add skeleton to initialize TDX on demand | 37 | x86/virt/tdx: Handle SEAMCALL no entropy error in common code |
295 | x86/virt/tdx: Implement functions to make SEAMCALL | 38 | x86/virt/tdx: Add SEAMCALL error printing for module initialization |
296 | x86/virt/tdx: Shut down TDX module in case of error | 39 | x86/virt/tdx: Add skeleton to enable TDX on demand |
297 | x86/virt/tdx: Do TDX module global initialization | ||
298 | x86/virt/tdx: Do logical-cpu scope TDX module initialization | ||
299 | x86/virt/tdx: Get information about TDX module and TDX-capable memory | 40 | x86/virt/tdx: Get information about TDX module and TDX-capable memory |
300 | x86/virt/tdx: Use all system memory when initializing TDX module as | 41 | x86/virt/tdx: Use all system memory when initializing TDX module as |
301 | TDX memory | 42 | TDX memory |
302 | x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX | 43 | x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX |
303 | memory regions | 44 | memory regions |
304 | x86/virt/tdx: Create TDMRs to cover all TDX memory regions | 45 | x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions |
305 | x86/virt/tdx: Allocate and set up PAMTs for TDMRs | 46 | x86/virt/tdx: Allocate and set up PAMTs for TDMRs |
306 | x86/virt/tdx: Set up reserved areas for all TDMRs | 47 | x86/virt/tdx: Designate reserved areas for all TDMRs |
307 | x86/virt/tdx: Reserve TDX module global KeyID | 48 | x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID |
308 | x86/virt/tdx: Configure TDX module with TDMRs and global KeyID | ||
309 | x86/virt/tdx: Configure global KeyID on all packages | 49 | x86/virt/tdx: Configure global KeyID on all packages |
310 | x86/virt/tdx: Initialize all TDMRs | 50 | x86/virt/tdx: Initialize all TDMRs |
311 | x86/virt/tdx: Flush cache in kexec() when TDX is enabled | 51 | x86/kexec: Flush cache of TDX private memory |
52 | x86/virt/tdx: Keep TDMRs when module initialization is successful | ||
53 | x86/virt/tdx: Improve readability of module initialization error | ||
54 | handling | ||
55 | x86/kexec(): Reset TDX private memory on platforms with TDX erratum | ||
56 | x86/virt/tdx: Handle TDX interaction with ACPI S3 and deeper states | ||
57 | x86/mce: Improve error log of kernel space TDX #MC due to erratum | ||
312 | Documentation/x86: Add documentation for TDX host support | 58 | Documentation/x86: Add documentation for TDX host support |
313 | 59 | ||
314 | Documentation/x86/tdx.rst | 181 +++- | 60 | Documentation/arch/x86/tdx.rst | 217 +++- |
315 | arch/x86/Kconfig | 15 + | 61 | arch/x86/Kconfig | 3 + |
316 | arch/x86/Makefile | 2 + | 62 | arch/x86/coco/tdx/tdx-shared.c | 6 +- |
317 | arch/x86/coco/tdx/tdx.c | 6 +- | 63 | arch/x86/include/asm/cpufeatures.h | 1 + |
318 | arch/x86/include/asm/tdx.h | 30 + | 64 | arch/x86/include/asm/msr-index.h | 3 + |
319 | arch/x86/kernel/process.c | 8 +- | 65 | arch/x86/include/asm/shared/tdx.h | 6 + |
320 | arch/x86/mm/init_64.c | 10 + | 66 | arch/x86/include/asm/tdx.h | 39 + |
321 | arch/x86/virt/Makefile | 2 + | 67 | arch/x86/kernel/cpu/intel.c | 17 + |
322 | arch/x86/virt/vmx/Makefile | 2 + | 68 | arch/x86/kernel/cpu/mce/core.c | 33 + |
323 | arch/x86/virt/vmx/tdx/Makefile | 2 + | 69 | arch/x86/kernel/machine_kexec_64.c | 16 + |
324 | arch/x86/virt/vmx/tdx/seamcall.S | 52 ++ | 70 | arch/x86/kernel/process.c | 8 +- |
325 | arch/x86/virt/vmx/tdx/tdx.c | 1422 ++++++++++++++++++++++++++++++ | 71 | arch/x86/kernel/reboot.c | 15 + |
326 | arch/x86/virt/vmx/tdx/tdx.h | 118 +++ | 72 | arch/x86/kernel/setup.c | 2 + |
327 | arch/x86/virt/vmx/tdx/tdxcall.S | 19 +- | 73 | arch/x86/virt/vmx/tdx/Makefile | 2 +- |
328 | 14 files changed, 1852 insertions(+), 17 deletions(-) | 74 | arch/x86/virt/vmx/tdx/tdx.c | 1587 ++++++++++++++++++++++++++++ |
329 | create mode 100644 arch/x86/virt/Makefile | 75 | arch/x86/virt/vmx/tdx/tdx.h | 145 +++ |
330 | create mode 100644 arch/x86/virt/vmx/Makefile | 76 | 16 files changed, 2084 insertions(+), 16 deletions(-) |
331 | create mode 100644 arch/x86/virt/vmx/tdx/Makefile | ||
332 | create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S | ||
333 | create mode 100644 arch/x86/virt/vmx/tdx/tdx.c | 77 | create mode 100644 arch/x86/virt/vmx/tdx/tdx.c |
334 | create mode 100644 arch/x86/virt/vmx/tdx/tdx.h | 78 | create mode 100644 arch/x86/virt/vmx/tdx/tdx.h |
335 | 79 | ||
336 | 80 | ||
337 | base-commit: 00e07cfbdf0b232f7553f0175f8f4e8d792f7e90 | 81 | base-commit: 9ee4318c157b9802589b746cc340bae3142d984c |
338 | -- | 82 | -- |
339 | 2.38.1 | 83 | 2.41.0 | diff view generated by jsdifflib |
... | ... | ||
---|---|---|---|
9 | space from the MKTME architecture for crypto-protection to VMs. The | 9 | space from the MKTME architecture for crypto-protection to VMs. The |
10 | BIOS is responsible for partitioning the "KeyID" space between legacy | 10 | BIOS is responsible for partitioning the "KeyID" space between legacy |
11 | MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private | 11 | MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private |
12 | KeyIDs' or 'TDX KeyIDs' for short. | 12 | KeyIDs' or 'TDX KeyIDs' for short. |
13 | 13 | ||
14 | TDX doesn't trust the BIOS. During machine boot, TDX verifies the TDX | 14 | During machine boot, TDX microcode verifies that the BIOS programmed TDX |
15 | private KeyIDs are consistently and correctly programmed by the BIOS | 15 | private KeyIDs consistently and correctly programmed across all CPU |
16 | across all CPU packages before it enables TDX on any CPU core. A valid | 16 | packages. The MSRs are locked in this state after verification. This |
17 | TDX private KeyID range on BSP indicates TDX has been enabled by the | 17 | is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration: |
18 | BIOS, otherwise the BIOS is buggy. | 18 | it indicates not just that the hardware supports TDX, but that all the |
19 | boot-time security checks passed. | ||
19 | 20 | ||
20 | The TDX module is expected to be loaded by the BIOS when it enables TDX, | 21 | The TDX module is expected to be loaded by the BIOS when it enables TDX, |
21 | but the kernel needs to properly initialize it before it can be used to | 22 | but the kernel needs to properly initialize it before it can be used to |
22 | create and run any TDX guests. The TDX module will be initialized at | 23 | create and run any TDX guests. The TDX module will be initialized by |
23 | runtime by the user (i.e. KVM) on demand. | 24 | the KVM subsystem when KVM wants to use TDX. |
24 | 25 | ||
25 | Add a new early_initcall(tdx_init) to do TDX early boot initialization. | 26 | Add a new early_initcall(tdx_init) to detect the TDX by detecting TDX |
26 | Only detect TDX private KeyIDs for now. Some other early checks will | 27 | private KeyIDs. Also add a function to report whether TDX is enabled by |
27 | follow up. Also add a new function to report whether TDX has been | 28 | the BIOS. Similar to AMD SME, kexec() will use it to determine whether |
28 | enabled by BIOS (TDX private KeyID range is valid). Kexec() will also | 29 | cache flush is needed. |
29 | need it to determine whether need to flush dirty cachelines that are | ||
30 | associated with any TDX private KeyIDs before booting to the new kernel. | ||
31 | 30 | ||
32 | To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for | 31 | The TDX module itself requires one TDX KeyID as the 'TDX global KeyID' |
33 | TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST | 32 | to protect its metadata. Each TDX guest also needs a TDX KeyID for its |
34 | to opt-in TDX host kernel support (to distinguish with TDX guest kernel | 33 | own protection. Just use the first TDX KeyID as the global KeyID and |
35 | support). So far only KVM is the only user of TDX. Make the new config | 34 | leave the rest for TDX guests. If no TDX KeyID is left for TDX guests, |
36 | option depend on KVM_INTEL. | 35 | disable TDX as initializing the TDX module alone is useless. |
37 | 36 | ||
37 | Signed-off-by: Kai Huang <kai.huang@intel.com> | ||
38 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | 38 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
39 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 39 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> |
40 | Reviewed-by: David Hildenbrand <david@redhat.com> | ||
41 | Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> | ||
40 | --- | 42 | --- |
41 | 43 | ||
42 | v6 -> v7: | 44 | v13 -> v14: |
43 | - No change. | 45 | - "tdx:" -> "virt/tdx:" (internal) |
46 | - Add Dave's tag | ||
47 | |||
48 | --- | ||
49 | arch/x86/include/asm/msr-index.h | 3 ++ | ||
50 | arch/x86/include/asm/tdx.h | 4 ++ | ||
51 | arch/x86/virt/vmx/tdx/Makefile | 2 +- | ||
52 | arch/x86/virt/vmx/tdx/tdx.c | 90 ++++++++++++++++++++++++++++++++ | ||
53 | 4 files changed, 98 insertions(+), 1 deletion(-) | ||
54 | create mode 100644 arch/x86/virt/vmx/tdx/tdx.c | ||
44 | 55 | ||
45 | v5 -> v6: | 56 | diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h |
46 | - Removed SEAMRR detection to make code simpler. | ||
47 | - Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill). | ||
48 | - Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill). | ||
49 | |||
50 | |||
51 | --- | ||
52 | arch/x86/Kconfig | 12 +++++ | ||
53 | arch/x86/Makefile | 2 + | ||
54 | arch/x86/include/asm/tdx.h | 7 +++ | ||
55 | arch/x86/virt/Makefile | 2 + | ||
56 | arch/x86/virt/vmx/Makefile | 2 + | ||
57 | arch/x86/virt/vmx/tdx/Makefile | 2 + | ||
58 | arch/x86/virt/vmx/tdx/tdx.c | 95 ++++++++++++++++++++++++++++++++++ | ||
59 | arch/x86/virt/vmx/tdx/tdx.h | 15 ++++++ | ||
60 | 8 files changed, 137 insertions(+) | ||
61 | create mode 100644 arch/x86/virt/Makefile | ||
62 | create mode 100644 arch/x86/virt/vmx/Makefile | ||
63 | create mode 100644 arch/x86/virt/vmx/tdx/Makefile | ||
64 | create mode 100644 arch/x86/virt/vmx/tdx/tdx.c | ||
65 | create mode 100644 arch/x86/virt/vmx/tdx/tdx.h | ||
66 | |||
67 | diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig | ||
68 | index XXXXXXX..XXXXXXX 100644 | 57 | index XXXXXXX..XXXXXXX 100644 |
69 | --- a/arch/x86/Kconfig | 58 | --- a/arch/x86/include/asm/msr-index.h |
70 | +++ b/arch/x86/Kconfig | 59 | +++ b/arch/x86/include/asm/msr-index.h |
71 | @@ -XXX,XX +XXX,XX @@ config X86_SGX | 60 | @@ -XXX,XX +XXX,XX @@ |
72 | 61 | #define MSR_RELOAD_PMC0 0x000014c1 | |
73 | If unsure, say N. | 62 | #define MSR_RELOAD_FIXED_CTR0 0x00001309 |
74 | 63 | ||
75 | +config INTEL_TDX_HOST | 64 | +/* KeyID partitioning between MKTME and TDX */ |
76 | + bool "Intel Trust Domain Extensions (TDX) host support" | 65 | +#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087 |
77 | + depends on CPU_SUP_INTEL | ||
78 | + depends on X86_64 | ||
79 | + depends on KVM_INTEL | ||
80 | + help | ||
81 | + Intel Trust Domain Extensions (TDX) protects guest VMs from malicious | ||
82 | + host and certain physical attacks. This option enables necessary TDX | ||
83 | + support in host kernel to run protected VMs. | ||
84 | + | 66 | + |
85 | + If unsure, say N. | 67 | /* |
86 | + | 68 | * AMD64 MSRs. Not complete. See the architecture manual for a more |
87 | config EFI | 69 | * complete list. |
88 | bool "EFI runtime service support" | ||
89 | depends on ACPI | ||
90 | diff --git a/arch/x86/Makefile b/arch/x86/Makefile | ||
91 | index XXXXXXX..XXXXXXX 100644 | ||
92 | --- a/arch/x86/Makefile | ||
93 | +++ b/arch/x86/Makefile | ||
94 | @@ -XXX,XX +XXX,XX @@ archheaders: | ||
95 | |||
96 | libs-y += arch/x86/lib/ | ||
97 | |||
98 | +core-y += arch/x86/virt/ | ||
99 | + | ||
100 | # drivers-y are linked after core-y | ||
101 | drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/ | ||
102 | drivers-$(CONFIG_PCI) += arch/x86/pci/ | ||
103 | diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h | 70 | diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h |
104 | index XXXXXXX..XXXXXXX 100644 | 71 | index XXXXXXX..XXXXXXX 100644 |
105 | --- a/arch/x86/include/asm/tdx.h | 72 | --- a/arch/x86/include/asm/tdx.h |
106 | +++ b/arch/x86/include/asm/tdx.h | 73 | +++ b/arch/x86/include/asm/tdx.h |
107 | @@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, | 74 | @@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, |
108 | return -ENODEV; | 75 | u64 __seamcall(u64 fn, struct tdx_module_args *args); |
109 | } | 76 | u64 __seamcall_ret(u64 fn, struct tdx_module_args *args); |
110 | #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */ | 77 | u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args); |
111 | + | 78 | + |
112 | +#ifdef CONFIG_INTEL_TDX_HOST | ||
113 | +bool platform_tdx_enabled(void); | 79 | +bool platform_tdx_enabled(void); |
114 | +#else /* !CONFIG_INTEL_TDX_HOST */ | 80 | +#else |
115 | +static inline bool platform_tdx_enabled(void) { return false; } | 81 | +static inline bool platform_tdx_enabled(void) { return false; } |
116 | +#endif /* CONFIG_INTEL_TDX_HOST */ | 82 | #endif /* CONFIG_INTEL_TDX_HOST */ |
117 | + | 83 | |
118 | #endif /* !__ASSEMBLY__ */ | 84 | #endif /* !__ASSEMBLY__ */ |
119 | #endif /* _ASM_X86_TDX_H */ | ||
120 | diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile | ||
121 | new file mode 100644 | ||
122 | index XXXXXXX..XXXXXXX | ||
123 | --- /dev/null | ||
124 | +++ b/arch/x86/virt/Makefile | ||
125 | @@ -XXX,XX +XXX,XX @@ | ||
126 | +# SPDX-License-Identifier: GPL-2.0-only | ||
127 | +obj-y += vmx/ | ||
128 | diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile | ||
129 | new file mode 100644 | ||
130 | index XXXXXXX..XXXXXXX | ||
131 | --- /dev/null | ||
132 | +++ b/arch/x86/virt/vmx/Makefile | ||
133 | @@ -XXX,XX +XXX,XX @@ | ||
134 | +# SPDX-License-Identifier: GPL-2.0-only | ||
135 | +obj-$(CONFIG_INTEL_TDX_HOST) += tdx/ | ||
136 | diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile | 85 | diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile |
137 | new file mode 100644 | 86 | index XXXXXXX..XXXXXXX 100644 |
138 | index XXXXXXX..XXXXXXX | 87 | --- a/arch/x86/virt/vmx/tdx/Makefile |
139 | --- /dev/null | ||
140 | +++ b/arch/x86/virt/vmx/tdx/Makefile | 88 | +++ b/arch/x86/virt/vmx/tdx/Makefile |
141 | @@ -XXX,XX +XXX,XX @@ | 89 | @@ -XXX,XX +XXX,XX @@ |
142 | +# SPDX-License-Identifier: GPL-2.0-only | 90 | # SPDX-License-Identifier: GPL-2.0-only |
143 | +obj-y += tdx.o | 91 | -obj-y += seamcall.o |
92 | +obj-y += seamcall.o tdx.o | ||
144 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 93 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
145 | new file mode 100644 | 94 | new file mode 100644 |
146 | index XXXXXXX..XXXXXXX | 95 | index XXXXXXX..XXXXXXX |
147 | --- /dev/null | 96 | --- /dev/null |
148 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 97 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
149 | @@ -XXX,XX +XXX,XX @@ | 98 | @@ -XXX,XX +XXX,XX @@ |
150 | +// SPDX-License-Identifier: GPL-2.0 | 99 | +// SPDX-License-Identifier: GPL-2.0 |
151 | +/* | 100 | +/* |
152 | + * Copyright(c) 2022 Intel Corporation. | 101 | + * Copyright(c) 2023 Intel Corporation. |
153 | + * | 102 | + * |
154 | + * Intel Trusted Domain Extensions (TDX) support | 103 | + * Intel Trusted Domain Extensions (TDX) support |
155 | + */ | 104 | + */ |
156 | + | 105 | + |
157 | +#define pr_fmt(fmt) "tdx: " fmt | 106 | +#define pr_fmt(fmt) "virt/tdx: " fmt |
158 | + | 107 | + |
159 | +#include <linux/types.h> | 108 | +#include <linux/types.h> |
109 | +#include <linux/cache.h> | ||
160 | +#include <linux/init.h> | 110 | +#include <linux/init.h> |
111 | +#include <linux/errno.h> | ||
161 | +#include <linux/printk.h> | 112 | +#include <linux/printk.h> |
162 | +#include <asm/msr-index.h> | 113 | +#include <asm/msr-index.h> |
163 | +#include <asm/msr.h> | 114 | +#include <asm/msr.h> |
164 | +#include <asm/tdx.h> | 115 | +#include <asm/tdx.h> |
165 | +#include "tdx.h" | ||
166 | + | 116 | + |
167 | +static u32 tdx_keyid_start __ro_after_init; | 117 | +static u32 tdx_global_keyid __ro_after_init; |
168 | +static u32 tdx_keyid_num __ro_after_init; | 118 | +static u32 tdx_guest_keyid_start __ro_after_init; |
119 | +static u32 tdx_nr_guest_keyids __ro_after_init; | ||
169 | + | 120 | + |
170 | +/* | 121 | +static int __init record_keyid_partitioning(u32 *tdx_keyid_start, |
171 | + * Detect TDX private KeyIDs to see whether TDX has been enabled by the | 122 | + u32 *nr_tdx_keyids) |
172 | + * BIOS. Both initializing the TDX module and running TDX guest require | ||
173 | + * TDX private KeyID. | ||
174 | + * | ||
175 | + * TDX doesn't trust BIOS. TDX verifies all configurations from BIOS | ||
176 | + * are correct before enabling TDX on any core. TDX requires the BIOS | ||
177 | + * to correctly and consistently program TDX private KeyIDs on all CPU | ||
178 | + * packages. Unless there is a BIOS bug, detecting a valid TDX private | ||
179 | + * KeyID range on BSP indicates TDX has been enabled by the BIOS. If | ||
180 | + * there's such BIOS bug, it will be caught later when initializing the | ||
181 | + * TDX module. | ||
182 | + */ | ||
183 | +static int __init detect_tdx(void) | ||
184 | +{ | 123 | +{ |
124 | + u32 _nr_mktme_keyids, _tdx_keyid_start, _nr_tdx_keyids; | ||
185 | + int ret; | 125 | + int ret; |
186 | + | 126 | + |
187 | + /* | 127 | + /* |
188 | + * IA32_MKTME_KEYID_PARTIONING: | 128 | + * IA32_MKTME_KEYID_PARTIONING: |
189 | + * Bit [31:0]: Number of MKTME KeyIDs. | 129 | + * Bit [31:0]: Number of MKTME KeyIDs. |
190 | + * Bit [63:32]: Number of TDX private KeyIDs. | 130 | + * Bit [63:32]: Number of TDX private KeyIDs. |
191 | + */ | 131 | + */ |
192 | + ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &tdx_keyid_start, | 132 | + ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &_nr_mktme_keyids, |
193 | + &tdx_keyid_num); | 133 | + &_nr_tdx_keyids); |
194 | + if (ret) | 134 | + if (ret) |
195 | + return -ENODEV; | 135 | + return -ENODEV; |
196 | + | 136 | + |
197 | + if (!tdx_keyid_num) | 137 | + if (!_nr_tdx_keyids) |
198 | + return -ENODEV; | 138 | + return -ENODEV; |
199 | + | 139 | + |
200 | + /* | 140 | + /* TDX KeyIDs start after the last MKTME KeyID. */ |
201 | + * KeyID 0 is for TME. MKTME KeyIDs start from 1. TDX private | 141 | + _tdx_keyid_start = _nr_mktme_keyids + 1; |
202 | + * KeyIDs start after the last MKTME KeyID. | ||
203 | + */ | ||
204 | + tdx_keyid_start++; | ||
205 | + | 142 | + |
206 | + pr_info("TDX enabled by BIOS. TDX private KeyID range: [%u, %u)\n", | 143 | + *tdx_keyid_start = _tdx_keyid_start; |
207 | + tdx_keyid_start, tdx_keyid_start + tdx_keyid_num); | 144 | + *nr_tdx_keyids = _nr_tdx_keyids; |
208 | + | 145 | + |
209 | + return 0; | 146 | + return 0; |
210 | +} | 147 | +} |
211 | + | 148 | + |
212 | +static void __init clear_tdx(void) | ||
213 | +{ | ||
214 | + tdx_keyid_start = tdx_keyid_num = 0; | ||
215 | +} | ||
216 | + | ||
217 | +static int __init tdx_init(void) | 149 | +static int __init tdx_init(void) |
218 | +{ | 150 | +{ |
219 | + if (detect_tdx()) | 151 | + u32 tdx_keyid_start, nr_tdx_keyids; |
220 | + return -ENODEV; | 152 | + int err; |
153 | + | ||
154 | + err = record_keyid_partitioning(&tdx_keyid_start, &nr_tdx_keyids); | ||
155 | + if (err) | ||
156 | + return err; | ||
157 | + | ||
158 | + pr_info("BIOS enabled: private KeyID range [%u, %u)\n", | ||
159 | + tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids); | ||
221 | + | 160 | + |
222 | + /* | 161 | + /* |
223 | + * Initializing the TDX module requires one TDX private KeyID. | 162 | + * The TDX module itself requires one 'global KeyID' to protect |
224 | + * If there's only one TDX KeyID then after module initialization | 163 | + * its metadata. If there's only one TDX KeyID, there won't be |
225 | + * KVM won't be able to run any TDX guest, which makes the whole | 164 | + * any left for TDX guests thus there's no point to enable TDX |
226 | + * thing worthless. Just disable TDX in this case. | 165 | + * at all. |
227 | + */ | 166 | + */ |
228 | + if (tdx_keyid_num < 2) { | 167 | + if (nr_tdx_keyids < 2) { |
229 | + pr_info("Disable TDX as there's only one TDX private KeyID available.\n"); | 168 | + pr_err("initialization failed: too few private KeyIDs available.\n"); |
230 | + goto no_tdx; | 169 | + return -ENODEV; |
231 | + } | 170 | + } |
232 | + | 171 | + |
172 | + /* | ||
173 | + * Just use the first TDX KeyID as the 'global KeyID' and | ||
174 | + * leave the rest for TDX guests. | ||
175 | + */ | ||
176 | + tdx_global_keyid = tdx_keyid_start; | ||
177 | + tdx_guest_keyid_start = tdx_keyid_start + 1; | ||
178 | + tdx_nr_guest_keyids = nr_tdx_keyids - 1; | ||
179 | + | ||
233 | + return 0; | 180 | + return 0; |
234 | +no_tdx: | ||
235 | + clear_tdx(); | ||
236 | + return -ENODEV; | ||
237 | +} | 181 | +} |
238 | +early_initcall(tdx_init); | 182 | +early_initcall(tdx_init); |
239 | + | 183 | + |
240 | +/* Return whether the BIOS has enabled TDX */ | 184 | +/* Return whether the BIOS has enabled TDX */ |
241 | +bool platform_tdx_enabled(void) | 185 | +bool platform_tdx_enabled(void) |
242 | +{ | 186 | +{ |
243 | + return !!tdx_keyid_num; | 187 | + return !!tdx_global_keyid; |
244 | +} | 188 | +} |
245 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | ||
246 | new file mode 100644 | ||
247 | index XXXXXXX..XXXXXXX | ||
248 | --- /dev/null | ||
249 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | ||
250 | @@ -XXX,XX +XXX,XX @@ | ||
251 | +/* SPDX-License-Identifier: GPL-2.0 */ | ||
252 | +#ifndef _X86_VIRT_TDX_H | ||
253 | +#define _X86_VIRT_TDX_H | ||
254 | + | ||
255 | +/* | ||
256 | + * This file contains both macros and data structures defined by the TDX | ||
257 | + * architecture and Linux defined software data structures and functions. | ||
258 | + * The two should not be mixed together for better readability. The | ||
259 | + * architectural definitions come first. | ||
260 | + */ | ||
261 | + | ||
262 | +/* MSR to report KeyID partitioning between MKTME and TDX */ | ||
263 | +#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087 | ||
264 | + | ||
265 | +#endif | ||
266 | -- | 189 | -- |
267 | 2.38.1 | 190 | 2.41.0 | diff view generated by jsdifflib |
... | ... | ||
---|---|---|---|
4 | page. However currently try_accept_one() uses hard-coded magic values. | 4 | page. However currently try_accept_one() uses hard-coded magic values. |
5 | 5 | ||
6 | Define TDX supported page sizes as macros and get rid of the hard-coded | 6 | Define TDX supported page sizes as macros and get rid of the hard-coded |
7 | values in try_accept_one(). TDX host support will need to use them too. | 7 | values in try_accept_one(). TDX host support will need to use them too. |
8 | 8 | ||
9 | Signed-off-by: Kai Huang <kai.huang@intel.com> | ||
9 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | 10 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
10 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 11 | Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> |
12 | Reviewed-by: David Hildenbrand <david@redhat.com> | ||
11 | --- | 13 | --- |
14 | arch/x86/coco/tdx/tdx-shared.c | 6 +++--- | ||
15 | arch/x86/include/asm/shared/tdx.h | 5 +++++ | ||
16 | 2 files changed, 8 insertions(+), 3 deletions(-) | ||
12 | 17 | ||
13 | v6 -> v7: | 18 | diff --git a/arch/x86/coco/tdx/tdx-shared.c b/arch/x86/coco/tdx/tdx-shared.c |
14 | |||
15 | - Removed the helper to convert kernel page level to TDX page level. | ||
16 | - Changed to use macro to define TDX supported page sizes. | ||
17 | |||
18 | --- | ||
19 | arch/x86/coco/tdx/tdx.c | 6 +++--- | ||
20 | arch/x86/include/asm/tdx.h | 9 +++++++++ | ||
21 | 2 files changed, 12 insertions(+), 3 deletions(-) | ||
22 | |||
23 | diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c | ||
24 | index XXXXXXX..XXXXXXX 100644 | 19 | index XXXXXXX..XXXXXXX 100644 |
25 | --- a/arch/x86/coco/tdx/tdx.c | 20 | --- a/arch/x86/coco/tdx/tdx-shared.c |
26 | +++ b/arch/x86/coco/tdx/tdx.c | 21 | +++ b/arch/x86/coco/tdx/tdx-shared.c |
27 | @@ -XXX,XX +XXX,XX @@ static bool try_accept_one(phys_addr_t *start, unsigned long len, | 22 | @@ -XXX,XX +XXX,XX @@ static unsigned long try_accept_one(phys_addr_t start, unsigned long len, |
28 | */ | 23 | */ |
29 | switch (pg_level) { | 24 | switch (pg_level) { |
30 | case PG_LEVEL_4K: | 25 | case PG_LEVEL_4K: |
31 | - page_size = 0; | 26 | - page_size = 0; |
32 | + page_size = TDX_PS_4K; | 27 | + page_size = TDX_PS_4K; |
... | ... | ||
38 | case PG_LEVEL_1G: | 33 | case PG_LEVEL_1G: |
39 | - page_size = 2; | 34 | - page_size = 2; |
40 | + page_size = TDX_PS_1G; | 35 | + page_size = TDX_PS_1G; |
41 | break; | 36 | break; |
42 | default: | 37 | default: |
43 | return false; | 38 | return 0; |
44 | diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h | 39 | diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h |
45 | index XXXXXXX..XXXXXXX 100644 | 40 | index XXXXXXX..XXXXXXX 100644 |
46 | --- a/arch/x86/include/asm/tdx.h | 41 | --- a/arch/x86/include/asm/shared/tdx.h |
47 | +++ b/arch/x86/include/asm/tdx.h | 42 | +++ b/arch/x86/include/asm/shared/tdx.h |
48 | @@ -XXX,XX +XXX,XX @@ | 43 | @@ -XXX,XX +XXX,XX @@ |
49 | 44 | (TDX_RDX | TDX_RBX | TDX_RSI | TDX_RDI | TDX_R8 | TDX_R9 | \ | |
50 | #ifndef __ASSEMBLY__ | 45 | TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14 | TDX_R15) |
51 | 46 | ||
52 | +/* | 47 | +/* TDX supported page sizes from the TDX module ABI. */ |
53 | + * TDX supported page sizes (4K/2M/1G). | ||
54 | + * | ||
55 | + * Those values are part of the TDX module ABI. Do not change them. | ||
56 | + */ | ||
57 | +#define TDX_PS_4K 0 | 48 | +#define TDX_PS_4K 0 |
58 | +#define TDX_PS_2M 1 | 49 | +#define TDX_PS_2M 1 |
59 | +#define TDX_PS_1G 2 | 50 | +#define TDX_PS_1G 2 |
60 | + | 51 | + |
61 | /* | 52 | #ifndef __ASSEMBLY__ |
62 | * Used to gather the output registers values of the TDCALL and SEAMCALL | 53 | |
63 | * instructions when requesting services from the TDX module. | 54 | #include <linux/compiler_attributes.h> |
64 | -- | 55 | -- |
65 | 2.38.1 | 56 | 2.41.0 | diff view generated by jsdifflib |
1 | The MMIO/xAPIC interface has some problems, most notably the APIC LEAK | 1 | TDX capable platforms are locked to X2APIC mode and cannot fall back to |
---|---|---|---|
2 | [1]. This bug allows an attacker to use the APIC MMIO interface to | 2 | the legacy xAPIC mode when TDX is enabled by the BIOS. TDX host support |
3 | extract data from the SGX enclave. | 3 | requires x2APIC. Make INTEL_TDX_HOST depend on X86_X2APIC. |
4 | 4 | ||
5 | TDX is not immune from this either. Early check X2APIC and disable TDX | ||
6 | if X2APIC is not enabled, and make INTEL_TDX_HOST depend on X86_X2APIC. | ||
7 | |||
8 | [1]: https://aepicleak.com/aepicleak.pdf | ||
9 | |||
10 | Link: https://lore.kernel.org/lkml/d6ffb489-7024-ff74-bd2f-d1e06573bb82@intel.com/ | ||
11 | Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/ | 5 | Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/ |
12 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 6 | Signed-off-by: Kai Huang <kai.huang@intel.com> |
7 | Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> | ||
8 | Reviewed-by: David Hildenbrand <david@redhat.com> | ||
9 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
13 | --- | 10 | --- |
14 | 11 | arch/x86/Kconfig | 1 + | |
15 | v6 -> v7: | 12 | 1 file changed, 1 insertion(+) |
16 | - Changed to use "Link" for the two lore links to get rid of checkpatch | ||
17 | warning. | ||
18 | |||
19 | --- | ||
20 | arch/x86/Kconfig | 1 + | ||
21 | arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++++++ | ||
22 | 2 files changed, 12 insertions(+) | ||
23 | 13 | ||
24 | diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig | 14 | diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig |
25 | index XXXXXXX..XXXXXXX 100644 | 15 | index XXXXXXX..XXXXXXX 100644 |
26 | --- a/arch/x86/Kconfig | 16 | --- a/arch/x86/Kconfig |
27 | +++ b/arch/x86/Kconfig | 17 | +++ b/arch/x86/Kconfig |
... | ... | ||
31 | depends on KVM_INTEL | 21 | depends on KVM_INTEL |
32 | + depends on X86_X2APIC | 22 | + depends on X86_X2APIC |
33 | help | 23 | help |
34 | Intel Trust Domain Extensions (TDX) protects guest VMs from malicious | 24 | Intel Trust Domain Extensions (TDX) protects guest VMs from malicious |
35 | host and certain physical attacks. This option enables necessary TDX | 25 | host and certain physical attacks. This option enables necessary TDX |
36 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | ||
37 | index XXXXXXX..XXXXXXX 100644 | ||
38 | --- a/arch/x86/virt/vmx/tdx/tdx.c | ||
39 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | ||
40 | @@ -XXX,XX +XXX,XX @@ | ||
41 | #include <linux/printk.h> | ||
42 | #include <asm/msr-index.h> | ||
43 | #include <asm/msr.h> | ||
44 | +#include <asm/apic.h> | ||
45 | #include <asm/tdx.h> | ||
46 | #include "tdx.h" | ||
47 | |||
48 | @@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void) | ||
49 | goto no_tdx; | ||
50 | } | ||
51 | |||
52 | + /* | ||
53 | + * TDX requires X2APIC being enabled to prevent potential data | ||
54 | + * leak via APIC MMIO registers. Just disable TDX if not using | ||
55 | + * X2APIC. | ||
56 | + */ | ||
57 | + if (!x2apic_enabled()) { | ||
58 | + pr_info("Disable TDX as X2APIC is not enabled.\n"); | ||
59 | + goto no_tdx; | ||
60 | + } | ||
61 | + | ||
62 | return 0; | ||
63 | no_tdx: | ||
64 | clear_tdx(); | ||
65 | -- | 26 | -- |
66 | 2.38.1 | 27 | 2.41.0 | diff view generated by jsdifflib |
New patch | |||
---|---|---|---|
1 | TDX memory has integrity and confidentiality protections. Violations of | ||
2 | this integrity protection are supposed to only affect TDX operations and | ||
3 | are never supposed to affect the host kernel itself. In other words, | ||
4 | the host kernel should never, itself, see machine checks induced by the | ||
5 | TDX integrity hardware. | ||
1 | 6 | ||
7 | Alas, the first few generations of TDX hardware have an erratum. A | ||
8 | partial write to a TDX private memory cacheline will silently "poison" | ||
9 | the line. Subsequent reads will consume the poison and generate a | ||
10 | machine check. According to the TDX hardware spec, neither of these | ||
11 | things should have happened. | ||
12 | |||
13 | Virtually all kernel memory accesses operations happen in full | ||
14 | cachelines. In practice, writing a "byte" of memory usually reads a 64 | ||
15 | byte cacheline of memory, modifies it, then writes the whole line back. | ||
16 | Those operations do not trigger this problem. | ||
17 | |||
18 | This problem is triggered by "partial" writes where a write transaction | ||
19 | of less than cacheline lands at the memory controller. The CPU does | ||
20 | these via non-temporal write instructions (like MOVNTI), or through | ||
21 | UC/WC memory mappings. The issue can also be triggered away from the | ||
22 | CPU by devices doing partial writes via DMA. | ||
23 | |||
24 | With this erratum, there are additional things need to be done. To | ||
25 | prepare for those changes, add a CPU bug bit to indicate this erratum. | ||
26 | Note this bug reflects the hardware thus it is detected regardless of | ||
27 | whether the kernel is built with TDX support or not. | ||
28 | |||
29 | Signed-off-by: Kai Huang <kai.huang@intel.com> | ||
30 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
31 | Reviewed-by: David Hildenbrand <david@redhat.com> | ||
32 | Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> | ||
33 | --- | ||
34 | |||
35 | v13 -> v14: | ||
36 | - Use "To prepare for ___, add ___" in changelog (Dave) | ||
37 | - Added Dave's tag. | ||
38 | |||
39 | v12 -> v13: | ||
40 | - Added David's tag. | ||
41 | |||
42 | v11 -> v12: | ||
43 | - Added Kirill's tag | ||
44 | - Changed to detect the erratum in early_init_intel() (Kirill) | ||
45 | |||
46 | v10 -> v11: | ||
47 | - New patch | ||
48 | |||
49 | --- | ||
50 | arch/x86/include/asm/cpufeatures.h | 1 + | ||
51 | arch/x86/kernel/cpu/intel.c | 17 +++++++++++++++++ | ||
52 | 2 files changed, 18 insertions(+) | ||
53 | |||
54 | diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h | ||
55 | index XXXXXXX..XXXXXXX 100644 | ||
56 | --- a/arch/x86/include/asm/cpufeatures.h | ||
57 | +++ b/arch/x86/include/asm/cpufeatures.h | ||
58 | @@ -XXX,XX +XXX,XX @@ | ||
59 | #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */ | ||
60 | #define X86_BUG_SMT_RSB X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */ | ||
61 | #define X86_BUG_GDS X86_BUG(30) /* CPU is affected by Gather Data Sampling */ | ||
62 | +#define X86_BUG_TDX_PW_MCE X86_BUG(31) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */ | ||
63 | |||
64 | /* BUG word 2 */ | ||
65 | #define X86_BUG_SRSO X86_BUG(1*32 + 0) /* AMD SRSO bug */ | ||
66 | diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c | ||
67 | index XXXXXXX..XXXXXXX 100644 | ||
68 | --- a/arch/x86/kernel/cpu/intel.c | ||
69 | +++ b/arch/x86/kernel/cpu/intel.c | ||
70 | @@ -XXX,XX +XXX,XX @@ static bool bad_spectre_microcode(struct cpuinfo_x86 *c) | ||
71 | return false; | ||
72 | } | ||
73 | |||
74 | +static void check_tdx_erratum(struct cpuinfo_x86 *c) | ||
75 | +{ | ||
76 | + /* | ||
77 | + * These CPUs have an erratum. A partial write from non-TD | ||
78 | + * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX | ||
79 | + * private memory poisons that memory, and a subsequent read of | ||
80 | + * that memory triggers #MC. | ||
81 | + */ | ||
82 | + switch (c->x86_model) { | ||
83 | + case INTEL_FAM6_SAPPHIRERAPIDS_X: | ||
84 | + case INTEL_FAM6_EMERALDRAPIDS_X: | ||
85 | + setup_force_cpu_bug(X86_BUG_TDX_PW_MCE); | ||
86 | + } | ||
87 | +} | ||
88 | + | ||
89 | static void early_init_intel(struct cpuinfo_x86 *c) | ||
90 | { | ||
91 | u64 misc_enable; | ||
92 | @@ -XXX,XX +XXX,XX @@ static void early_init_intel(struct cpuinfo_x86 *c) | ||
93 | */ | ||
94 | if (detect_extended_topology_early(c) < 0) | ||
95 | detect_ht_early(c); | ||
96 | + | ||
97 | + check_tdx_erratum(c); | ||
98 | } | ||
99 | |||
100 | static void bsp_init_intel(struct cpuinfo_x86 *c) | ||
101 | -- | ||
102 | 2.41.0 | diff view generated by jsdifflib |
New patch | |||
---|---|---|---|
1 | Some SEAMCALLs use the RDRAND hardware and can fail for the same reasons | ||
2 | as RDRAND. Use the kernel RDRAND retry logic for them. | ||
1 | 3 | ||
4 | There are three __seamcall*() variants. Do the SEAMCALL retry in common | ||
5 | code and add a wrapper for each of them. | ||
6 | |||
7 | Signed-off-by: Kai Huang <kai.huang@intel.com> | ||
8 | Reviewed-by: Kirill A. Shutemov <kirll.shutemov@linux.intel.com> | ||
9 | --- | ||
10 | |||
11 | v13 -> v14: | ||
12 | - Use real function sc_retry() instead of using macros. (Dave) | ||
13 | - Added Kirill's tag. | ||
14 | |||
15 | v12 -> v13: | ||
16 | - New implementation due to TDCALL assembly series. | ||
17 | --- | ||
18 | arch/x86/include/asm/tdx.h | 26 ++++++++++++++++++++++++++ | ||
19 | 1 file changed, 26 insertions(+) | ||
20 | |||
21 | diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h | ||
22 | index XXXXXXX..XXXXXXX 100644 | ||
23 | --- a/arch/x86/include/asm/tdx.h | ||
24 | +++ b/arch/x86/include/asm/tdx.h | ||
25 | @@ -XXX,XX +XXX,XX @@ | ||
26 | #define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP) | ||
27 | #define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD) | ||
28 | |||
29 | +/* | ||
30 | + * TDX module SEAMCALL leaf function error codes | ||
31 | + */ | ||
32 | +#define TDX_RND_NO_ENTROPY 0x8000020300000000ULL | ||
33 | + | ||
34 | #ifndef __ASSEMBLY__ | ||
35 | |||
36 | /* | ||
37 | @@ -XXX,XX +XXX,XX @@ u64 __seamcall(u64 fn, struct tdx_module_args *args); | ||
38 | u64 __seamcall_ret(u64 fn, struct tdx_module_args *args); | ||
39 | u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args); | ||
40 | |||
41 | +#include <asm/archrandom.h> | ||
42 | + | ||
43 | +typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args); | ||
44 | + | ||
45 | +static inline u64 sc_retry(sc_func_t func, u64 fn, | ||
46 | + struct tdx_module_args *args) | ||
47 | +{ | ||
48 | + int retry = RDRAND_RETRY_LOOPS; | ||
49 | + u64 ret; | ||
50 | + | ||
51 | + do { | ||
52 | + ret = func(fn, args); | ||
53 | + } while (ret == TDX_RND_NO_ENTROPY && --retry); | ||
54 | + | ||
55 | + return ret; | ||
56 | +} | ||
57 | + | ||
58 | +#define seamcall(_fn, _args) sc_retry(__seamcall, (_fn), (_args)) | ||
59 | +#define seamcall_ret(_fn, _args) sc_retry(__seamcall_ret, (_fn), (_args)) | ||
60 | +#define seamcall_saved_ret(_fn, _args) sc_retry(__seamcall_saved_ret, (_fn), (_args)) | ||
61 | + | ||
62 | bool platform_tdx_enabled(void); | ||
63 | #else | ||
64 | static inline bool platform_tdx_enabled(void) { return false; } | ||
65 | -- | ||
66 | 2.41.0 | diff view generated by jsdifflib |
New patch | |||
---|---|---|---|
1 | The SEAMCALLs involved during the TDX module initialization are not | ||
2 | expected to fail. In fact, they are not expected to return any non-zero | ||
3 | code (except the "running out of entropy error", which can be handled | ||
4 | internally already). | ||
1 | 5 | ||
6 | Add yet another set of SEAMCALL wrappers, which treats all non-zero | ||
7 | return code as error, to support printing SEAMCALL error upon failure | ||
8 | for module initialization. Note the TDX module initialization doesn't | ||
9 | use the _saved_ret() variant thus no wrapper is added for it. | ||
10 | |||
11 | SEAMCALL assembly can also return kernel-defined error codes for three | ||
12 | special cases: 1) TDX isn't enabled by the BIOS; 2) TDX module isn't | ||
13 | loaded; 3) CPU isn't in VMX operation. Whether they can legally happen | ||
14 | depends on the caller, so leave to the caller to print error message | ||
15 | when desired. | ||
16 | |||
17 | Also convert the SEAMCALL error codes to the kernel error codes in the | ||
18 | new wrappers so that each SEAMCALL caller doesn't have to repeat the | ||
19 | conversion. | ||
20 | |||
21 | Signed-off-by: Kai Huang <kai.huang@intel.com> | ||
22 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
23 | --- | ||
24 | |||
25 | v13 -> v14: | ||
26 | - Use real functions to replace macros. (Dave) | ||
27 | - Moved printing error message for special error code to the caller | ||
28 | (internal) | ||
29 | - Added Kirill's tag | ||
30 | |||
31 | v12 -> v13: | ||
32 | - New implementation due to TDCALL assembly series. | ||
33 | |||
34 | --- | ||
35 | arch/x86/include/asm/tdx.h | 1 + | ||
36 | arch/x86/virt/vmx/tdx/tdx.c | 52 +++++++++++++++++++++++++++++++++++++ | ||
37 | 2 files changed, 53 insertions(+) | ||
38 | |||
39 | diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h | ||
40 | index XXXXXXX..XXXXXXX 100644 | ||
41 | --- a/arch/x86/include/asm/tdx.h | ||
42 | +++ b/arch/x86/include/asm/tdx.h | ||
43 | @@ -XXX,XX +XXX,XX @@ | ||
44 | /* | ||
45 | * TDX module SEAMCALL leaf function error codes | ||
46 | */ | ||
47 | +#define TDX_SUCCESS 0ULL | ||
48 | #define TDX_RND_NO_ENTROPY 0x8000020300000000ULL | ||
49 | |||
50 | #ifndef __ASSEMBLY__ | ||
51 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | ||
52 | index XXXXXXX..XXXXXXX 100644 | ||
53 | --- a/arch/x86/virt/vmx/tdx/tdx.c | ||
54 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | ||
55 | @@ -XXX,XX +XXX,XX @@ static u32 tdx_global_keyid __ro_after_init; | ||
56 | static u32 tdx_guest_keyid_start __ro_after_init; | ||
57 | static u32 tdx_nr_guest_keyids __ro_after_init; | ||
58 | |||
59 | +typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args); | ||
60 | + | ||
61 | +static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args) | ||
62 | +{ | ||
63 | + pr_err("SEAMCALL (0x%llx) failed: 0x%llx\n", fn, err); | ||
64 | +} | ||
65 | + | ||
66 | +static inline void seamcall_err_ret(u64 fn, u64 err, | ||
67 | + struct tdx_module_args *args) | ||
68 | +{ | ||
69 | + seamcall_err(fn, err, args); | ||
70 | + pr_err("RCX 0x%llx RDX 0x%llx R8 0x%llx R9 0x%llx R10 0x%llx R11 0x%llx\n", | ||
71 | + args->rcx, args->rdx, args->r8, args->r9, | ||
72 | + args->r10, args->r11); | ||
73 | +} | ||
74 | + | ||
75 | +static inline void seamcall_err_saved_ret(u64 fn, u64 err, | ||
76 | + struct tdx_module_args *args) | ||
77 | +{ | ||
78 | + seamcall_err_ret(fn, err, args); | ||
79 | + pr_err("RBX 0x%llx RDI 0x%llx RSI 0x%llx R12 0x%llx R13 0x%llx R14 0x%llx R15 0x%llx\n", | ||
80 | + args->rbx, args->rdi, args->rsi, args->r12, | ||
81 | + args->r13, args->r14, args->r15); | ||
82 | +} | ||
83 | + | ||
84 | +static inline int sc_retry_prerr(sc_func_t func, sc_err_func_t err_func, | ||
85 | + u64 fn, struct tdx_module_args *args) | ||
86 | +{ | ||
87 | + u64 sret = sc_retry(func, fn, args); | ||
88 | + | ||
89 | + if (sret == TDX_SUCCESS) | ||
90 | + return 0; | ||
91 | + | ||
92 | + if (sret == TDX_SEAMCALL_VMFAILINVALID) | ||
93 | + return -ENODEV; | ||
94 | + | ||
95 | + if (sret == TDX_SEAMCALL_GP) | ||
96 | + return -EOPNOTSUPP; | ||
97 | + | ||
98 | + if (sret == TDX_SEAMCALL_UD) | ||
99 | + return -EACCES; | ||
100 | + | ||
101 | + err_func(fn, sret, args); | ||
102 | + return -EIO; | ||
103 | +} | ||
104 | + | ||
105 | +#define seamcall_prerr(__fn, __args) \ | ||
106 | + sc_retry_prerr(__seamcall, seamcall_err, (__fn), (__args)) | ||
107 | + | ||
108 | +#define seamcall_prerr_ret(__fn, __args) \ | ||
109 | + sc_retry_prerr(__seamcall_ret, seamcall_err_ret, (__fn), (__args)) | ||
110 | + | ||
111 | static int __init record_keyid_partitioning(u32 *tdx_keyid_start, | ||
112 | u32 *nr_tdx_keyids) | ||
113 | { | ||
114 | -- | ||
115 | 2.41.0 | diff view generated by jsdifflib |
1 | Before the TDX module can be used to create and run TDX guests, it must | 1 | To enable TDX the kernel needs to initialize TDX from two perspectives: |
---|---|---|---|
2 | be loaded and properly initialized. The TDX module is expected to be | 2 | 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready |
3 | loaded by the BIOS, and to be initialized by the kernel. | 3 | to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL |
4 | 4 | on one logical cpu before the kernel wants to make any other SEAMCALLs | |
5 | TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). The host | 5 | on that cpu (including those involved during module initialization and |
6 | kernel communicates with the TDX module via a new SEAMCALL instruction. | 6 | running TDX guests). |
7 | The TDX module implements a set of SEAMCALL leaf functions to allow the | ||
8 | host kernel to initialize it. | ||
9 | 7 | ||
10 | The TDX module can be initialized only once in its lifetime. Instead | 8 | The TDX module can be initialized only once in its lifetime. Instead |
11 | of always initializing it at boot time, this implementation chooses an | 9 | of always initializing it at boot time, this implementation chooses an |
12 | "on demand" approach to initialize TDX until there is a real need (e.g | 10 | "on demand" approach to initialize TDX until there is a real need (e.g |
13 | when requested by KVM). This approach has below pros: | 11 | when requested by KVM). This approach has below pros: |
14 | 12 | ||
15 | 1) It avoids consuming the memory that must be allocated by kernel and | 13 | 1) It avoids consuming the memory that must be allocated by kernel and |
16 | given to the TDX module as metadata (~1/256th of the TDX-usable memory), | 14 | given to the TDX module as metadata (~1/256th of the TDX-usable memory), |
17 | and also saves the CPU cycles of initializing the TDX module (and the | 15 | and also saves the CPU cycles of initializing the TDX module (and the |
18 | metadata) when TDX is not used at all. | 16 | metadata) when TDX is not used at all. |
19 | 17 | ||
20 | 2) It is more flexible to support TDX module runtime updating in the | 18 | 2) The TDX module design allows it to be updated while the system is |
21 | future (after updating the TDX module, it needs to be initialized | 19 | running. The update procedure shares quite a few steps with this "on |
22 | again). | 20 | demand" initialization mechanism. The hope is that much of "on demand" |
23 | 21 | mechanism can be shared with a future "update" mechanism. A boot-time | |
24 | 3) It avoids having to do a "temporary" solution to handle VMXON in the | 22 | TDX module implementation would not be able to share much code with the |
25 | core (non-KVM) kernel for now. This is because SEAMCALL requires CPU | 23 | update mechanism. |
26 | being in VMX operation (VMXON is done), but currently only KVM handles | 24 | |
27 | VMXON. Adding VMXON support to the core kernel isn't trivial. More | 25 | 3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM |
28 | importantly, from long-term a reference-based approach is likely needed | 26 | code mucks with VMX enabling. If the TDX module were to be initialized |
29 | in the core kernel as more kernel components are likely needed to | 27 | separately from KVM (like at boot), the boot code would need to be |
30 | support TDX as well. Allow KVM to initialize the TDX module avoids | 28 | taught how to muck with VMX enabling and KVM would need to be taught how |
31 | having to handle VMXON during kernel boot for now. | 29 | to cope with that. Making KVM itself responsible for TDX initialization |
32 | 30 | lets the rest of the kernel stay blissfully unaware of VMX. | |
33 | Add a placeholder tdx_enable() to detect and initialize the TDX module | 31 | |
34 | on demand, with a state machine protected by mutex to support concurrent | 32 | Similar to module initialization, also make the per-cpu initialization |
35 | calls from multiple callers. | 33 | "on demand" as it also depends on VMX being enabled. |
36 | 34 | ||
37 | The TDX module will be initialized in multi-steps defined by the TDX | 35 | Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX |
38 | module: | 36 | module and enable TDX on local cpu respectively. For now tdx_enable() |
39 | 37 | is a placeholder. The TODO list will be pared down as functionality is | |
40 | 1) Global initialization; | 38 | added. |
41 | 2) Logical-CPU scope initialization; | 39 | |
42 | 3) Enumerate the TDX module capabilities and platform configuration; | 40 | Export both tdx_cpu_enable() and tdx_enable() for KVM use. |
43 | 4) Configure the TDX module about TDX usable memory ranges and global | 41 | |
44 | KeyID information; | 42 | In tdx_enable() use a state machine protected by mutex to make sure the |
45 | 5) Package-scope configuration for the global KeyID; | 43 | initialization will only be done once, as tdx_enable() can be called |
46 | 6) Initialize usable memory ranges based on 4). | 44 | multiple times (i.e. KVM module can be reloaded) and may be called |
47 | 45 | concurrently by other kernel components in the future. | |
48 | The TDX module can also be shut down at any time during its lifetime. | 46 | |
49 | In case of any error during the initialization process, shut down the | 47 | The per-cpu initialization on each cpu can only be done once during the |
50 | module. It's pointless to leave the module in any intermediate state | 48 | module's life time. Use a per-cpu variable to track its status to make |
51 | during the initialization. | 49 | sure it is only done once in tdx_cpu_enable(). |
52 | 50 | ||
53 | Both logical CPU scope initialization and shutting down the TDX module | 51 | Also, a SEAMCALL to do TDX module global initialization must be done |
54 | require calling SEAMCALL on all boot-time present CPUs. For simplicity | 52 | once on any logical cpu before any per-cpu initialization SEAMCALL. Do |
55 | just temporarily disable CPU hotplug during the module initialization. | 53 | it inside tdx_cpu_enable() too (if hasn't been done). |
56 | 54 | ||
57 | Note TDX architecturally doesn't support physical CPU hot-add/removal. | 55 | tdx_enable() can potentially invoke SEAMCALLs on any online cpus. The |
58 | A non-buggy BIOS should never support ACPI CPU hot-add/removal. This | 56 | per-cpu initialization must be done before those SEAMCALLs are invoked |
59 | implementation doesn't explicitly handle ACPI CPU hot-add/removal but | 57 | on some cpu. To keep things simple, in tdx_cpu_enable(), always do the |
60 | depends on the BIOS to do the right thing. | 58 | per-cpu initialization regardless of whether the TDX module has been |
61 | 59 | initialized or not. And in tdx_enable(), don't call tdx_cpu_enable() | |
62 | Reviewed-by: Chao Gao <chao.gao@intel.com> | 60 | but assume the caller has disabled CPU hotplug, done VMXON and |
61 | tdx_cpu_enable() on all online cpus before calling tdx_enable(). | ||
62 | |||
63 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 63 | Signed-off-by: Kai Huang <kai.huang@intel.com> |
64 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
64 | --- | 65 | --- |
65 | 66 | ||
66 | v6 -> v7: | 67 | v13 -> v14: |
67 | - No change. | 68 | - Use lockdep_assert_irqs_off() in try_init_model_global() (Nikolay), |
68 | 69 | but still keep the comment (Kirill). | |
69 | v5 -> v6: | 70 | - Add code to print "module not loaded" in the first SEAMCALL. |
70 | - Added code to set status to TDX_MODULE_NONE if TDX module is not | 71 | - If SYS.INIT fails, stop calling LP.INIT in other tdx_cpu_enable()s. |
71 | loaded (Chao) | 72 | - Added Kirill's tag |
72 | - Added Chao's Reviewed-by. | 73 | |
73 | - Improved comments around cpus_read_lock(). | 74 | v12 -> v13: |
74 | 75 | - Made tdx_cpu_enable() always be called with IRQ disabled via IPI | |
75 | - v3->v5 (no feedback on v4): | 76 | funciton call (Peter, Kirill). |
76 | - Removed the check that SEAMRR and TDX KeyID have been detected on | 77 | |
77 | all present cpus. | 78 | v11 -> v12: |
78 | - Removed tdx_detect(). | 79 | - Simplified TDX module global init and lp init status tracking (David). |
79 | - Added num_online_cpus() to MADT-enabled CPUs check within the CPU | 80 | - Added comment around try_init_module_global() for using |
80 | hotplug lock and return early with error message. | 81 | raw_spin_lock() (Dave). |
81 | - Improved dmesg printing for TDX module detection and initialization. | 82 | - Added one sentence to changelog to explain why to expose tdx_enable() |
83 | and tdx_cpu_enable() (Dave). | ||
84 | - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use | ||
85 | lockdep_assert_*() instead. (Dave) | ||
86 | - Removed redundent "TDX" in error message (Dave). | ||
87 | |||
88 | v10 -> v11: | ||
89 | - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off. | ||
90 | - Return the actual error code for tdx_enable() instead of -EINVAL. | ||
91 | - Added Isaku's Reviewed-by. | ||
92 | |||
93 | v9 -> v10: | ||
94 | - Merged the patch to handle per-cpu initialization to this patch to | ||
95 | tell the story better. | ||
96 | - Changed how to handle the per-cpu initialization to only provide a | ||
97 | tdx_cpu_enable() function to let the user of TDX to do it when the | ||
98 | user wants to run TDX code on a certain cpu. | ||
99 | - Changed tdx_enable() to not call cpus_read_lock() explicitly, but | ||
100 | call lockdep_assert_cpus_held() to assume the caller has done that. | ||
101 | - Improved comments around tdx_enable() and tdx_cpu_enable(). | ||
102 | - Improved changelog to tell the story better accordingly. | ||
103 | |||
104 | v8 -> v9: | ||
105 | - Removed detailed TODO list in the changelog (Dave). | ||
106 | - Added back steps to do module global initialization and per-cpu | ||
107 | initialization in the TODO list comment. | ||
108 | - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h | ||
109 | |||
110 | v7 -> v8: | ||
111 | - Refined changelog (Dave). | ||
112 | - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave). | ||
113 | - Add a "TODO list" comment in init_tdx_module() to list all steps of | ||
114 | initializing the TDX Module to tell the story (Dave). | ||
115 | - Made tdx_enable() unverisally return -EINVAL, and removed nonsense | ||
116 | comments (Dave). | ||
117 | - Simplified __tdx_enable() to only handle success or failure. | ||
118 | - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR | ||
119 | - Removed TDX_MODULE_NONE (not loaded) as it is not necessary. | ||
120 | - Improved comments (Dave). | ||
121 | - Pointed out 'tdx_module_status' is software thing (Dave). | ||
122 | |||
123 | ... | ||
82 | 124 | ||
83 | --- | 125 | --- |
84 | arch/x86/include/asm/tdx.h | 2 + | 126 | arch/x86/include/asm/tdx.h | 4 + |
85 | arch/x86/virt/vmx/tdx/tdx.c | 150 ++++++++++++++++++++++++++++++++++++ | 127 | arch/x86/virt/vmx/tdx/tdx.c | 167 ++++++++++++++++++++++++++++++++++++ |
86 | 2 files changed, 152 insertions(+) | 128 | arch/x86/virt/vmx/tdx/tdx.h | 30 +++++++ |
129 | 3 files changed, 201 insertions(+) | ||
130 | create mode 100644 arch/x86/virt/vmx/tdx/tdx.h | ||
87 | 131 | ||
88 | diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h | 132 | diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h |
89 | index XXXXXXX..XXXXXXX 100644 | 133 | index XXXXXXX..XXXXXXX 100644 |
90 | --- a/arch/x86/include/asm/tdx.h | 134 | --- a/arch/x86/include/asm/tdx.h |
91 | +++ b/arch/x86/include/asm/tdx.h | 135 | +++ b/arch/x86/include/asm/tdx.h |
92 | @@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, | 136 | @@ -XXX,XX +XXX,XX @@ static inline u64 sc_retry(sc_func_t func, u64 fn, |
93 | 137 | #define seamcall_saved_ret(_fn, _args) sc_retry(__seamcall_saved_ret, (_fn), (_args)) | |
94 | #ifdef CONFIG_INTEL_TDX_HOST | 138 | |
95 | bool platform_tdx_enabled(void); | 139 | bool platform_tdx_enabled(void); |
140 | +int tdx_cpu_enable(void); | ||
96 | +int tdx_enable(void); | 141 | +int tdx_enable(void); |
97 | #else /* !CONFIG_INTEL_TDX_HOST */ | 142 | #else |
98 | static inline bool platform_tdx_enabled(void) { return false; } | 143 | static inline bool platform_tdx_enabled(void) { return false; } |
144 | +static inline int tdx_cpu_enable(void) { return -ENODEV; } | ||
99 | +static inline int tdx_enable(void) { return -ENODEV; } | 145 | +static inline int tdx_enable(void) { return -ENODEV; } |
100 | #endif /* CONFIG_INTEL_TDX_HOST */ | 146 | #endif /* CONFIG_INTEL_TDX_HOST */ |
101 | 147 | ||
102 | #endif /* !__ASSEMBLY__ */ | 148 | #endif /* !__ASSEMBLY__ */ |
103 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 149 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
104 | index XXXXXXX..XXXXXXX 100644 | 150 | index XXXXXXX..XXXXXXX 100644 |
105 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 151 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
106 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 152 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
107 | @@ -XXX,XX +XXX,XX @@ | 153 | @@ -XXX,XX +XXX,XX @@ |
108 | #include <linux/types.h> | ||
109 | #include <linux/init.h> | 154 | #include <linux/init.h> |
155 | #include <linux/errno.h> | ||
110 | #include <linux/printk.h> | 156 | #include <linux/printk.h> |
157 | +#include <linux/cpu.h> | ||
158 | +#include <linux/spinlock.h> | ||
159 | +#include <linux/percpu-defs.h> | ||
111 | +#include <linux/mutex.h> | 160 | +#include <linux/mutex.h> |
112 | +#include <linux/cpu.h> | ||
113 | +#include <linux/cpumask.h> | ||
114 | #include <asm/msr-index.h> | 161 | #include <asm/msr-index.h> |
115 | #include <asm/msr.h> | 162 | #include <asm/msr.h> |
116 | #include <asm/apic.h> | ||
117 | #include <asm/tdx.h> | 163 | #include <asm/tdx.h> |
118 | #include "tdx.h" | 164 | +#include "tdx.h" |
119 | 165 | ||
120 | +/* TDX module status during initialization */ | 166 | static u32 tdx_global_keyid __ro_after_init; |
121 | +enum tdx_module_status_t { | 167 | static u32 tdx_guest_keyid_start __ro_after_init; |
122 | + /* TDX module hasn't been detected and initialized */ | 168 | static u32 tdx_nr_guest_keyids __ro_after_init; |
123 | + TDX_MODULE_UNKNOWN, | 169 | |
124 | + /* TDX module is not loaded */ | 170 | +static DEFINE_PER_CPU(bool, tdx_lp_initialized); |
125 | + TDX_MODULE_NONE, | 171 | + |
126 | + /* TDX module is initialized */ | ||
127 | + TDX_MODULE_INITIALIZED, | ||
128 | + /* TDX module is shut down due to initialization error */ | ||
129 | + TDX_MODULE_SHUTDOWN, | ||
130 | +}; | ||
131 | + | ||
132 | static u32 tdx_keyid_start __ro_after_init; | ||
133 | static u32 tdx_keyid_num __ro_after_init; | ||
134 | |||
135 | +static enum tdx_module_status_t tdx_module_status; | 172 | +static enum tdx_module_status_t tdx_module_status; |
136 | +/* Prevent concurrent attempts on TDX detection and initialization */ | ||
137 | +static DEFINE_MUTEX(tdx_module_lock); | 173 | +static DEFINE_MUTEX(tdx_module_lock); |
138 | + | 174 | + |
139 | /* | 175 | typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args); |
140 | * Detect TDX private KeyIDs to see whether TDX has been enabled by the | 176 | |
141 | * BIOS. Both initializing the TDX module and running TDX guest require | 177 | static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args) |
142 | @@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void) | 178 | @@ -XXX,XX +XXX,XX @@ static inline int sc_retry_prerr(sc_func_t func, sc_err_func_t err_func, |
143 | { | 179 | #define seamcall_prerr_ret(__fn, __args) \ |
144 | return !!tdx_keyid_num; | 180 | sc_retry_prerr(__seamcall_ret, seamcall_err_ret, (__fn), (__args)) |
145 | } | 181 | |
146 | + | ||
147 | +/* | 182 | +/* |
148 | + * Detect and initialize the TDX module. | 183 | + * Do the module global initialization once and return its result. |
149 | + * | 184 | + * It can be done on any cpu. It's always called with interrupts |
150 | + * Return -ENODEV when the TDX module is not loaded, 0 when it | 185 | + * disabled. |
151 | + * is successfully initialized, or other error when it fails to | 186 | + */ |
152 | + * initialize. | 187 | +static int try_init_module_global(void) |
153 | + */ | 188 | +{ |
154 | +static int init_tdx_module(void) | 189 | + struct tdx_module_args args = {}; |
155 | +{ | 190 | + static DEFINE_RAW_SPINLOCK(sysinit_lock); |
156 | + /* The TDX module hasn't been detected */ | 191 | + static bool sysinit_done; |
157 | + return -ENODEV; | 192 | + static int sysinit_ret; |
158 | +} | 193 | + |
159 | + | 194 | + lockdep_assert_irqs_disabled(); |
160 | +static void shutdown_tdx_module(void) | 195 | + |
161 | +{ | 196 | + raw_spin_lock(&sysinit_lock); |
162 | + /* TODO: Shut down the TDX module */ | 197 | + |
163 | +} | 198 | + if (sysinit_done) |
164 | + | 199 | + goto out; |
165 | +static int __tdx_enable(void) | 200 | + |
166 | +{ | 201 | + /* RCX is module attributes and all bits are reserved */ |
167 | + int ret; | 202 | + args.rcx = 0; |
203 | + sysinit_ret = seamcall_prerr(TDH_SYS_INIT, &args); | ||
168 | + | 204 | + |
169 | + /* | 205 | + /* |
170 | + * Initializing the TDX module requires doing SEAMCALL on all | 206 | + * The first SEAMCALL also detects the TDX module, thus |
171 | + * boot-time present CPUs. For simplicity temporarily disable | 207 | + * it can fail due to the TDX module is not loaded. |
172 | + * CPU hotplug to prevent any CPU from going offline during | 208 | + * Dump message to let the user know. |
173 | + * the initialization. | ||
174 | + */ | 209 | + */ |
175 | + cpus_read_lock(); | 210 | + if (sysinit_ret == -ENODEV) |
176 | + | 211 | + pr_err("module not loaded\n"); |
177 | + /* | 212 | + |
178 | + * Check whether all boot-time present CPUs are online and | 213 | + sysinit_done = true; |
179 | + * return early with a message so the user can be aware. | ||
180 | + * | ||
181 | + * Note a non-buggy BIOS should never support physical (ACPI) | ||
182 | + * CPU hotplug when TDX is enabled, and all boot-time present | ||
183 | + * CPU should be enabled in MADT, so there should be no | ||
184 | + * disabled_cpus and num_processors won't change at runtime | ||
185 | + * either. | ||
186 | + */ | ||
187 | + if (disabled_cpus || num_online_cpus() != num_processors) { | ||
188 | + pr_err("Unable to initialize the TDX module when there's offline CPU(s).\n"); | ||
189 | + ret = -EINVAL; | ||
190 | + goto out; | ||
191 | + } | ||
192 | + | ||
193 | + ret = init_tdx_module(); | ||
194 | + if (ret == -ENODEV) { | ||
195 | + pr_info("TDX module is not loaded.\n"); | ||
196 | + tdx_module_status = TDX_MODULE_NONE; | ||
197 | + goto out; | ||
198 | + } | ||
199 | + | ||
200 | + /* | ||
201 | + * Shut down the TDX module in case of any error during the | ||
202 | + * initialization process. It's meaningless to leave the TDX | ||
203 | + * module in any middle state of the initialization process. | ||
204 | + * | ||
205 | + * Shutting down the module also requires doing SEAMCALL on all | ||
206 | + * MADT-enabled CPUs. Do it while CPU hotplug is disabled. | ||
207 | + * | ||
208 | + * Return all errors during the initialization as -EFAULT as the | ||
209 | + * module is always shut down. | ||
210 | + */ | ||
211 | + if (ret) { | ||
212 | + pr_info("Failed to initialize TDX module. Shut it down.\n"); | ||
213 | + shutdown_tdx_module(); | ||
214 | + tdx_module_status = TDX_MODULE_SHUTDOWN; | ||
215 | + ret = -EFAULT; | ||
216 | + goto out; | ||
217 | + } | ||
218 | + | ||
219 | + pr_info("TDX module initialized.\n"); | ||
220 | + tdx_module_status = TDX_MODULE_INITIALIZED; | ||
221 | +out: | 214 | +out: |
222 | + cpus_read_unlock(); | 215 | + raw_spin_unlock(&sysinit_lock); |
223 | + | 216 | + return sysinit_ret; |
224 | + return ret; | ||
225 | +} | 217 | +} |
226 | + | 218 | + |
227 | +/** | 219 | +/** |
228 | + * tdx_enable - Enable TDX by initializing the TDX module | 220 | + * tdx_cpu_enable - Enable TDX on local cpu |
229 | + * | 221 | + * |
230 | + * Caller to make sure all CPUs are online and in VMX operation before | 222 | + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module |
231 | + * calling this function. CPU hotplug is temporarily disabled internally | 223 | + * global initialization SEAMCALL if not done) on local cpu to make this |
232 | + * to prevent any cpu from going offline. | 224 | + * cpu be ready to run any other SEAMCALLs. |
233 | + * | 225 | + * |
234 | + * This function can be called in parallel by multiple callers. | 226 | + * Always call this function via IPI function calls. |
235 | + * | 227 | + * |
236 | + * Return: | 228 | + * Return 0 on success, otherwise errors. |
237 | + * | 229 | + */ |
238 | + * * 0: The TDX module has been successfully initialized. | 230 | +int tdx_cpu_enable(void) |
239 | + * * -ENODEV: The TDX module is not loaded, or TDX is not supported. | 231 | +{ |
240 | + * * -EINVAL: The TDX module cannot be initialized due to certain | 232 | + struct tdx_module_args args = {}; |
241 | + * conditions are not met (i.e. when not all MADT-enabled | ||
242 | + * CPUs are not online). | ||
243 | + * * -EFAULT: Other internal fatal errors, or the TDX module is in | ||
244 | + * shutdown mode due to it failed to initialize in previous | ||
245 | + * attempts. | ||
246 | + */ | ||
247 | +int tdx_enable(void) | ||
248 | +{ | ||
249 | + int ret; | 233 | + int ret; |
250 | + | 234 | + |
251 | + if (!platform_tdx_enabled()) | 235 | + if (!platform_tdx_enabled()) |
252 | + return -ENODEV; | 236 | + return -ENODEV; |
253 | + | 237 | + |
238 | + lockdep_assert_irqs_disabled(); | ||
239 | + | ||
240 | + if (__this_cpu_read(tdx_lp_initialized)) | ||
241 | + return 0; | ||
242 | + | ||
243 | + /* | ||
244 | + * The TDX module global initialization is the very first step | ||
245 | + * to enable TDX. Need to do it first (if hasn't been done) | ||
246 | + * before the per-cpu initialization. | ||
247 | + */ | ||
248 | + ret = try_init_module_global(); | ||
249 | + if (ret) | ||
250 | + return ret; | ||
251 | + | ||
252 | + ret = seamcall_prerr(TDH_SYS_LP_INIT, &args); | ||
253 | + if (ret) | ||
254 | + return ret; | ||
255 | + | ||
256 | + __this_cpu_write(tdx_lp_initialized, true); | ||
257 | + | ||
258 | + return 0; | ||
259 | +} | ||
260 | +EXPORT_SYMBOL_GPL(tdx_cpu_enable); | ||
261 | + | ||
262 | +static int init_tdx_module(void) | ||
263 | +{ | ||
264 | + /* | ||
265 | + * TODO: | ||
266 | + * | ||
267 | + * - Get TDX module information and TDX-capable memory regions. | ||
268 | + * - Build the list of TDX-usable memory regions. | ||
269 | + * - Construct a list of "TD Memory Regions" (TDMRs) to cover | ||
270 | + * all TDX-usable memory regions. | ||
271 | + * - Configure the TDMRs and the global KeyID to the TDX module. | ||
272 | + * - Configure the global KeyID on all packages. | ||
273 | + * - Initialize all TDMRs. | ||
274 | + * | ||
275 | + * Return error before all steps are done. | ||
276 | + */ | ||
277 | + return -EINVAL; | ||
278 | +} | ||
279 | + | ||
280 | +static int __tdx_enable(void) | ||
281 | +{ | ||
282 | + int ret; | ||
283 | + | ||
284 | + ret = init_tdx_module(); | ||
285 | + if (ret) { | ||
286 | + pr_err("module initialization failed (%d)\n", ret); | ||
287 | + tdx_module_status = TDX_MODULE_ERROR; | ||
288 | + return ret; | ||
289 | + } | ||
290 | + | ||
291 | + pr_info("module initialized\n"); | ||
292 | + tdx_module_status = TDX_MODULE_INITIALIZED; | ||
293 | + | ||
294 | + return 0; | ||
295 | +} | ||
296 | + | ||
297 | +/** | ||
298 | + * tdx_enable - Enable TDX module to make it ready to run TDX guests | ||
299 | + * | ||
300 | + * This function assumes the caller has: 1) held read lock of CPU hotplug | ||
301 | + * lock to prevent any new cpu from becoming online; 2) done both VMXON | ||
302 | + * and tdx_cpu_enable() on all online cpus. | ||
303 | + * | ||
304 | + * This function can be called in parallel by multiple callers. | ||
305 | + * | ||
306 | + * Return 0 if TDX is enabled successfully, otherwise error. | ||
307 | + */ | ||
308 | +int tdx_enable(void) | ||
309 | +{ | ||
310 | + int ret; | ||
311 | + | ||
312 | + if (!platform_tdx_enabled()) | ||
313 | + return -ENODEV; | ||
314 | + | ||
315 | + lockdep_assert_cpus_held(); | ||
316 | + | ||
254 | + mutex_lock(&tdx_module_lock); | 317 | + mutex_lock(&tdx_module_lock); |
255 | + | 318 | + |
256 | + switch (tdx_module_status) { | 319 | + switch (tdx_module_status) { |
257 | + case TDX_MODULE_UNKNOWN: | 320 | + case TDX_MODULE_UNINITIALIZED: |
258 | + ret = __tdx_enable(); | 321 | + ret = __tdx_enable(); |
259 | + break; | 322 | + break; |
260 | + case TDX_MODULE_NONE: | ||
261 | + ret = -ENODEV; | ||
262 | + break; | ||
263 | + case TDX_MODULE_INITIALIZED: | 323 | + case TDX_MODULE_INITIALIZED: |
324 | + /* Already initialized, great, tell the caller. */ | ||
264 | + ret = 0; | 325 | + ret = 0; |
265 | + break; | 326 | + break; |
266 | + default: | 327 | + default: |
267 | + WARN_ON_ONCE(tdx_module_status != TDX_MODULE_SHUTDOWN); | 328 | + /* Failed to initialize in the previous attempts */ |
268 | + ret = -EFAULT; | 329 | + ret = -EINVAL; |
269 | + break; | 330 | + break; |
270 | + } | 331 | + } |
271 | + | 332 | + |
272 | + mutex_unlock(&tdx_module_lock); | 333 | + mutex_unlock(&tdx_module_lock); |
273 | + | 334 | + |
274 | + return ret; | 335 | + return ret; |
275 | +} | 336 | +} |
276 | +EXPORT_SYMBOL_GPL(tdx_enable); | 337 | +EXPORT_SYMBOL_GPL(tdx_enable); |
338 | + | ||
339 | static int __init record_keyid_partitioning(u32 *tdx_keyid_start, | ||
340 | u32 *nr_tdx_keyids) | ||
341 | { | ||
342 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | ||
343 | new file mode 100644 | ||
344 | index XXXXXXX..XXXXXXX | ||
345 | --- /dev/null | ||
346 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | ||
347 | @@ -XXX,XX +XXX,XX @@ | ||
348 | +/* SPDX-License-Identifier: GPL-2.0 */ | ||
349 | +#ifndef _X86_VIRT_TDX_H | ||
350 | +#define _X86_VIRT_TDX_H | ||
351 | + | ||
352 | +/* | ||
353 | + * This file contains both macros and data structures defined by the TDX | ||
354 | + * architecture and Linux defined software data structures and functions. | ||
355 | + * The two should not be mixed together for better readability. The | ||
356 | + * architectural definitions come first. | ||
357 | + */ | ||
358 | + | ||
359 | +/* | ||
360 | + * TDX module SEAMCALL leaf functions | ||
361 | + */ | ||
362 | +#define TDH_SYS_INIT 33 | ||
363 | +#define TDH_SYS_LP_INIT 35 | ||
364 | + | ||
365 | +/* | ||
366 | + * Do not put any hardware-defined TDX structure representations below | ||
367 | + * this comment! | ||
368 | + */ | ||
369 | + | ||
370 | +/* Kernel defined TDX module status during module initialization. */ | ||
371 | +enum tdx_module_status_t { | ||
372 | + TDX_MODULE_UNINITIALIZED, | ||
373 | + TDX_MODULE_INITIALIZED, | ||
374 | + TDX_MODULE_ERROR | ||
375 | +}; | ||
376 | + | ||
377 | +#endif | ||
277 | -- | 378 | -- |
278 | 2.38.1 | 379 | 2.41.0 | diff view generated by jsdifflib |
1 | Start to transit out the "multi-steps" to initialize the TDX module. | ||
---|---|---|---|
2 | |||
1 | TDX provides increased levels of memory confidentiality and integrity. | 3 | TDX provides increased levels of memory confidentiality and integrity. |
2 | This requires special hardware support for features like memory | 4 | This requires special hardware support for features like memory |
3 | encryption and storage of memory integrity checksums. Not all memory | 5 | encryption and storage of memory integrity checksums. Not all memory |
4 | satisfies these requirements. | 6 | satisfies these requirements. |
5 | 7 | ||
6 | As a result, TDX introduced the concept of a "Convertible Memory Region" | 8 | As a result, TDX introduced the concept of a "Convertible Memory Region" |
7 | (CMR). During boot, the firmware builds a list of all of the memory | 9 | (CMR). During boot, the firmware builds a list of all of the memory |
8 | ranges which can provide the TDX security guarantees. The list of these | 10 | ranges which can provide the TDX security guarantees. |
9 | ranges, along with TDX module information, is available to the kernel by | 11 | |
10 | querying the TDX module via TDH.SYS.INFO SEAMCALL. | 12 | CMRs tell the kernel which memory is TDX compatible. The kernel takes |
11 | 13 | CMRs (plus a little more metadata) and constructs "TD Memory Regions" | |
12 | The host kernel can choose whether or not to use all convertible memory | 14 | (TDMRs). TDMRs let the kernel grant TDX protections to some or all of |
13 | regions as TDX-usable memory. Before the TDX module is ready to create | 15 | the CMR areas. |
14 | any TDX guests, the kernel needs to configure the TDX-usable memory | 16 | |
15 | regions by passing an array of "TD Memory Regions" (TDMRs) to the TDX | 17 | The TDX module also reports necessary information to let the kernel |
16 | module. Constructing the TDMR array requires information of both the | 18 | build TDMRs and run TDX guests in structure 'tdsysinfo_struct'. The |
17 | TDX module (TDSYSINFO_STRUCT) and the Convertible Memory Regions. Call | 19 | list of CMRs, along with the TDX module information, is available to |
18 | TDH.SYS.INFO to get this information as a preparation. | 20 | the kernel by querying the TDX module. |
19 | 21 | ||
20 | Use static variables for both TDSYSINFO_STRUCT and CMR array to avoid | 22 | As a preparation to construct TDMRs, get the TDX module information and |
21 | having to pass them as function arguments when constructing the TDMR | 23 | the list of CMRs. Print out CMRs to help user to decode which memory |
22 | array. And they are too big to be put to the stack anyway. Also, KVM | 24 | regions are TDX convertible. |
23 | needs to use the TDSYSINFO_STRUCT to create TDX guests. | 25 | |
24 | 26 | The 'tdsysinfo_struct' is fairly large (1024 bytes) and contains a lot | |
27 | of info about the TDX module. Fully define the entire structure, but | ||
28 | only use the fields necessary to build the TDMRs and pr_info() some | ||
29 | basics about the module. The rest of the fields will get used by KVM. | ||
30 | |||
31 | Signed-off-by: Kai Huang <kai.huang@intel.com> | ||
25 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> | 32 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> |
26 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 33 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
27 | --- | 34 | --- |
28 | 35 | ||
29 | v6 -> v7: | 36 | v13 -> v14: |
30 | - Simplified the check of CMRs due to the fact that TDX actually | 37 | - Added Kirill's tag. |
31 | verifies CMRs (that are passed by the BIOS) before enabling TDX. | 38 | |
32 | - Changed the function name from check_cmrs() -> trim_empty_cmrs(). | 39 | v12 -> v13: |
33 | - Added CMR page aligned check so that later patch can just get the PFN | 40 | - Allocate TDSYSINFO and CMR array separately. (Kirill) |
34 | using ">> PAGE_SHIFT". | 41 | - Added comment around TDH.SYS.INFO. (Peter) |
35 | 42 | ||
36 | v5 -> v6: | 43 | v11 -> v12: |
37 | - Added to also print TDX module's attribute (Isaku). | 44 | - Changed to use dynamic allocation for TDSYSINFO_STRUCT and CMR array |
38 | - Removed all arguments in tdx_gete_sysinfo() to use static variables | 45 | (Kirill). |
39 | of 'tdx_sysinfo' and 'tdx_cmr_array' directly as they are all used | 46 | - Keep SEAMCALL leaf macro definitions in order (Kirill) |
40 | directly in other functions in later patches. | 47 | - Removed is_cmr_empty() but open code directly (David) |
41 | - Added Isaku's Reviewed-by. | 48 | - 'atribute' -> 'attribute' (David) |
42 | 49 | ||
43 | - v3 -> v5 (no feedback on v4): | 50 | v10 -> v11: |
44 | - Renamed sanitize_cmrs() to check_cmrs(). | 51 | - No change. |
45 | - Removed unnecessary sanity check against tdx_sysinfo and tdx_cmr_array | 52 | |
46 | actual size returned by TDH.SYS.INFO. | 53 | v9 -> v10: |
47 | - Changed -EFAULT to -EINVAL in couple places. | 54 | - Added back "start to transit out..." as now per-cpu init has been |
48 | - Added comments around tdx_sysinfo and tdx_cmr_array saying they are | 55 | moved out from tdx_enable(). |
49 | used by TDH.SYS.INFO ABI. | 56 | |
50 | - Changed to pass 'tdx_sysinfo' and 'tdx_cmr_array' as function | 57 | v8 -> v9: |
51 | arguments in tdx_get_sysinfo(). | 58 | - Removed "start to trransit out ..." part in changelog since this patch |
52 | - Changed to only print BIOS-CMR when check_cmrs() fails. | 59 | is no longer the first step anymore. |
60 | - Changed to declare 'tdsysinfo' and 'cmr_array' as local static, and | ||
61 | changed changelog accordingly (Dave). | ||
62 | - Improved changelog to explain why to declare 'tdsysinfo_struct' in | ||
63 | full but only use a few members of them (Dave). | ||
64 | |||
65 | v7 -> v8: (Dave) | ||
66 | - Improved changelog to tell this is the first patch to transit out the | ||
67 | "multi-steps" init_tdx_module(). | ||
68 | - Removed all CMR check/trim code but to depend on later SEAMCALL. | ||
69 | - Variable 'vertical alignment' in print TDX module information. | ||
70 | - Added DECLARE_PADDED_STRUCT() for padded structure. | ||
71 | - Made tdx_sysinfo and tdx_cmr_array[] to be function local variable | ||
72 | (and rename them accordingly), and added -Wframe-larger-than=4096 flag | ||
73 | to silence the build warning. | ||
74 | |||
75 | ... | ||
53 | 76 | ||
54 | --- | 77 | --- |
55 | arch/x86/virt/vmx/tdx/tdx.c | 125 ++++++++++++++++++++++++++++++++++++ | 78 | arch/x86/virt/vmx/tdx/tdx.c | 94 ++++++++++++++++++++++++++++++++++++- |
56 | arch/x86/virt/vmx/tdx/tdx.h | 61 ++++++++++++++++++ | 79 | arch/x86/virt/vmx/tdx/tdx.h | 64 +++++++++++++++++++++++++ |
57 | 2 files changed, 186 insertions(+) | 80 | 2 files changed, 156 insertions(+), 2 deletions(-) |
58 | 81 | ||
59 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 82 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
60 | index XXXXXXX..XXXXXXX 100644 | 83 | index XXXXXXX..XXXXXXX 100644 |
61 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 84 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
62 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 85 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
63 | @@ -XXX,XX +XXX,XX @@ | 86 | @@ -XXX,XX +XXX,XX @@ |
64 | #include <linux/cpumask.h> | 87 | #include <linux/spinlock.h> |
65 | #include <linux/smp.h> | 88 | #include <linux/percpu-defs.h> |
66 | #include <linux/atomic.h> | 89 | #include <linux/mutex.h> |
67 | +#include <linux/align.h> | 90 | +#include <linux/slab.h> |
91 | +#include <linux/math.h> | ||
68 | #include <asm/msr-index.h> | 92 | #include <asm/msr-index.h> |
69 | #include <asm/msr.h> | 93 | #include <asm/msr.h> |
70 | #include <asm/apic.h> | 94 | +#include <asm/page.h> |
71 | @@ -XXX,XX +XXX,XX @@ static enum tdx_module_status_t tdx_module_status; | 95 | #include <asm/tdx.h> |
72 | /* Prevent concurrent attempts on TDX detection and initialization */ | 96 | #include "tdx.h" |
73 | static DEFINE_MUTEX(tdx_module_lock); | 97 | |
74 | 98 | @@ -XXX,XX +XXX,XX @@ int tdx_cpu_enable(void) | |
75 | +/* Below two are used in TDH.SYS.INFO SEAMCALL ABI */ | ||
76 | +static struct tdsysinfo_struct tdx_sysinfo; | ||
77 | +static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT); | ||
78 | +static int tdx_cmr_num; | ||
79 | + | ||
80 | /* | ||
81 | * Detect TDX private KeyIDs to see whether TDX has been enabled by the | ||
82 | * BIOS. Both initializing the TDX module and running TDX guest require | ||
83 | @@ -XXX,XX +XXX,XX @@ static int tdx_module_init_cpus(void) | ||
84 | return atomic_read(&sc.err); | ||
85 | } | 99 | } |
86 | 100 | EXPORT_SYMBOL_GPL(tdx_cpu_enable); | |
87 | +static inline bool is_cmr_empty(struct cmr_info *cmr) | 101 | |
88 | +{ | 102 | +static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs) |
89 | + return !cmr->size; | ||
90 | +} | ||
91 | + | ||
92 | +static inline bool is_cmr_ok(struct cmr_info *cmr) | ||
93 | +{ | ||
94 | + /* CMR must be page aligned */ | ||
95 | + return IS_ALIGNED(cmr->base, PAGE_SIZE) && | ||
96 | + IS_ALIGNED(cmr->size, PAGE_SIZE); | ||
97 | +} | ||
98 | + | ||
99 | +static void print_cmrs(struct cmr_info *cmr_array, int cmr_num, | ||
100 | + const char *name) | ||
101 | +{ | 103 | +{ |
102 | + int i; | 104 | + int i; |
103 | + | 105 | + |
104 | + for (i = 0; i < cmr_num; i++) { | 106 | + for (i = 0; i < nr_cmrs; i++) { |
105 | + struct cmr_info *cmr = &cmr_array[i]; | 107 | + struct cmr_info *cmr = &cmr_array[i]; |
106 | + | 108 | + |
107 | + pr_info("%s : [0x%llx, 0x%llx)\n", name, | 109 | + /* |
108 | + cmr->base, cmr->base + cmr->size); | 110 | + * The array of CMRs reported via TDH.SYS.INFO can |
111 | + * contain tail empty CMRs. Don't print them. | ||
112 | + */ | ||
113 | + if (!cmr->size) | ||
114 | + break; | ||
115 | + | ||
116 | + pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base, | ||
117 | + cmr->base + cmr->size); | ||
109 | + } | 118 | + } |
110 | +} | 119 | +} |
111 | + | 120 | + |
112 | +/* Check CMRs reported by TDH.SYS.INFO, and trim tail empty CMRs. */ | 121 | +static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo, |
113 | +static int trim_empty_cmrs(struct cmr_info *cmr_array, int *actual_cmr_num) | 122 | + struct cmr_info *cmr_array) |
114 | +{ | 123 | +{ |
115 | + struct cmr_info *cmr; | 124 | + struct tdx_module_args args = {}; |
116 | + int i, cmr_num; | 125 | + int ret; |
117 | + | 126 | + |
118 | + /* | 127 | + /* |
119 | + * Intel TDX module spec, 20.7.3 CMR_INFO: | 128 | + * TDH.SYS.INFO writes the TDSYSINFO_STRUCT and the CMR array |
129 | + * to the buffers provided by the kernel (via RCX and R8 | ||
130 | + * respectively). The buffer size of the TDSYSINFO_STRUCT | ||
131 | + * (via RDX) and the maximum entries of the CMR array (via R9) | ||
132 | + * passed to this SEAMCALL must be at least the size of | ||
133 | + * TDSYSINFO_STRUCT and MAX_CMRS respectively. | ||
120 | + * | 134 | + * |
121 | + * TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry | 135 | + * Upon a successful return, R9 contains the actual entries |
122 | + * array of CMR_INFO entries. The CMRs are sorted from the | 136 | + * written to the CMR array. |
123 | + * lowest base address to the highest base address, and they | ||
124 | + * are non-overlapping. | ||
125 | + * | ||
126 | + * This implies that BIOS may generate invalid empty entries | ||
127 | + * if total CMRs are less than 32. Need to skip them manually. | ||
128 | + * | ||
129 | + * CMR also must be 4K aligned. TDX doesn't trust BIOS. TDX | ||
130 | + * actually verifies CMRs before it gets enabled, so anything | ||
131 | + * doesn't meet above means kernel bug (or TDX is broken). | ||
132 | + */ | 137 | + */ |
133 | + cmr = &cmr_array[0]; | 138 | + args.rcx = __pa(tdsysinfo); |
134 | + /* There must be at least one valid CMR */ | 139 | + args.rdx = TDSYSINFO_STRUCT_SIZE; |
135 | + if (WARN_ON_ONCE(is_cmr_empty(cmr) || !is_cmr_ok(cmr))) | 140 | + args.r8 = __pa(cmr_array); |
136 | + goto err; | 141 | + args.r9 = MAX_CMRS; |
137 | + | 142 | + ret = seamcall_prerr_ret(TDH_SYS_INFO, &args); |
138 | + cmr_num = *actual_cmr_num; | ||
139 | + for (i = 1; i < cmr_num; i++) { | ||
140 | + struct cmr_info *cmr = &cmr_array[i]; | ||
141 | + struct cmr_info *prev_cmr = NULL; | ||
142 | + | ||
143 | + /* Skip further empty CMRs */ | ||
144 | + if (is_cmr_empty(cmr)) | ||
145 | + break; | ||
146 | + | ||
147 | + /* | ||
148 | + * Do sanity check anyway to make sure CMRs: | ||
149 | + * - are 4K aligned | ||
150 | + * - don't overlap | ||
151 | + * - are in address ascending order. | ||
152 | + */ | ||
153 | + if (WARN_ON_ONCE(!is_cmr_ok(cmr))) | ||
154 | + goto err; | ||
155 | + | ||
156 | + prev_cmr = &cmr_array[i - 1]; | ||
157 | + if (WARN_ON_ONCE((prev_cmr->base + prev_cmr->size) > | ||
158 | + cmr->base)) | ||
159 | + goto err; | ||
160 | + } | ||
161 | + | ||
162 | + /* Update the actual number of CMRs */ | ||
163 | + *actual_cmr_num = i; | ||
164 | + | ||
165 | + /* Print kernel checked CMRs */ | ||
166 | + print_cmrs(cmr_array, *actual_cmr_num, "Kernel-checked-CMR"); | ||
167 | + | ||
168 | + return 0; | ||
169 | +err: | ||
170 | + pr_info("[TDX broken ?]: Invalid CMRs detected\n"); | ||
171 | + print_cmrs(cmr_array, cmr_num, "BIOS-CMR"); | ||
172 | + return -EINVAL; | ||
173 | +} | ||
174 | + | ||
175 | +static int tdx_get_sysinfo(void) | ||
176 | +{ | ||
177 | + struct tdx_module_output out; | ||
178 | + int ret; | ||
179 | + | ||
180 | + BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE); | ||
181 | + | ||
182 | + ret = seamcall(TDH_SYS_INFO, __pa(&tdx_sysinfo), TDSYSINFO_STRUCT_SIZE, | ||
183 | + __pa(tdx_cmr_array), MAX_CMRS, NULL, &out); | ||
184 | + if (ret) | 143 | + if (ret) |
185 | + return ret; | 144 | + return ret; |
186 | + | 145 | + |
187 | + /* R9 contains the actual entries written the CMR array. */ | 146 | + pr_info("TDX module: attributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u", |
188 | + tdx_cmr_num = out.r9; | 147 | + tdsysinfo->attributes, tdsysinfo->vendor_id, |
189 | + | 148 | + tdsysinfo->major_version, tdsysinfo->minor_version, |
190 | + pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u", | 149 | + tdsysinfo->build_date, tdsysinfo->build_num); |
191 | + tdx_sysinfo.attributes, tdx_sysinfo.vendor_id, | 150 | + |
192 | + tdx_sysinfo.major_version, tdx_sysinfo.minor_version, | 151 | + print_cmrs(cmr_array, args.r9); |
193 | + tdx_sysinfo.build_date, tdx_sysinfo.build_num); | 152 | + |
194 | + | 153 | + return 0; |
195 | + /* | ||
196 | + * trim_empty_cmrs() updates the actual number of CMRs by | ||
197 | + * dropping all tail empty CMRs. | ||
198 | + */ | ||
199 | + return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num); | ||
200 | +} | 154 | +} |
201 | + | 155 | + |
202 | /* | 156 | static int init_tdx_module(void) |
203 | * Detect and initialize the TDX module. | 157 | { |
204 | * | 158 | + struct tdsysinfo_struct *tdsysinfo; |
205 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | 159 | + struct cmr_info *cmr_array; |
206 | if (ret) | 160 | + int tdsysinfo_size; |
207 | goto out; | 161 | + int cmr_array_size; |
208 | 162 | + int ret; | |
209 | + ret = tdx_get_sysinfo(); | 163 | + |
164 | + tdsysinfo_size = round_up(TDSYSINFO_STRUCT_SIZE, | ||
165 | + TDSYSINFO_STRUCT_ALIGNMENT); | ||
166 | + tdsysinfo = kzalloc(tdsysinfo_size, GFP_KERNEL); | ||
167 | + if (!tdsysinfo) | ||
168 | + return -ENOMEM; | ||
169 | + | ||
170 | + cmr_array_size = sizeof(struct cmr_info) * MAX_CMRS; | ||
171 | + cmr_array_size = round_up(cmr_array_size, CMR_INFO_ARRAY_ALIGNMENT); | ||
172 | + cmr_array = kzalloc(cmr_array_size, GFP_KERNEL); | ||
173 | + if (!cmr_array) { | ||
174 | + kfree(tdsysinfo); | ||
175 | + return -ENOMEM; | ||
176 | + } | ||
177 | + | ||
178 | + | ||
179 | + /* Get the TDSYSINFO_STRUCT and CMRs from the TDX module. */ | ||
180 | + ret = get_tdx_sysinfo(tdsysinfo, cmr_array); | ||
210 | + if (ret) | 181 | + if (ret) |
211 | + goto out; | 182 | + goto out; |
212 | + | 183 | + |
213 | /* | 184 | /* |
214 | * Return -EINVAL until all steps of TDX module initialization | 185 | * TODO: |
215 | * process are done. | 186 | * |
187 | - * - Get TDX module information and TDX-capable memory regions. | ||
188 | * - Build the list of TDX-usable memory regions. | ||
189 | * - Construct a list of "TD Memory Regions" (TDMRs) to cover | ||
190 | * all TDX-usable memory regions. | ||
191 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | ||
192 | * | ||
193 | * Return error before all steps are done. | ||
194 | */ | ||
195 | - return -EINVAL; | ||
196 | + ret = -EINVAL; | ||
197 | +out: | ||
198 | + /* | ||
199 | + * For now both @sysinfo and @cmr_array are only used during | ||
200 | + * module initialization, so always free them. | ||
201 | + */ | ||
202 | + kfree(tdsysinfo); | ||
203 | + kfree(cmr_array); | ||
204 | + return ret; | ||
205 | } | ||
206 | |||
207 | static int __tdx_enable(void) | ||
216 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | 208 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h |
217 | index XXXXXXX..XXXXXXX 100644 | 209 | index XXXXXXX..XXXXXXX 100644 |
218 | --- a/arch/x86/virt/vmx/tdx/tdx.h | 210 | --- a/arch/x86/virt/vmx/tdx/tdx.h |
219 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | 211 | +++ b/arch/x86/virt/vmx/tdx/tdx.h |
212 | @@ -XXX,XX +XXX,XX @@ | ||
213 | #ifndef _X86_VIRT_TDX_H | ||
214 | #define _X86_VIRT_TDX_H | ||
215 | |||
216 | +#include <linux/types.h> | ||
217 | +#include <linux/stddef.h> | ||
218 | +#include <linux/compiler_attributes.h> | ||
219 | + | ||
220 | /* | ||
221 | * This file contains both macros and data structures defined by the TDX | ||
222 | * architecture and Linux defined software data structures and functions. | ||
220 | @@ -XXX,XX +XXX,XX @@ | 223 | @@ -XXX,XX +XXX,XX @@ |
221 | /* | 224 | /* |
222 | * TDX module SEAMCALL leaf functions | 225 | * TDX module SEAMCALL leaf functions |
223 | */ | 226 | */ |
224 | +#define TDH_SYS_INFO 32 | 227 | +#define TDH_SYS_INFO 32 |
225 | #define TDH_SYS_INIT 33 | 228 | #define TDH_SYS_INIT 33 |
226 | #define TDH_SYS_LP_INIT 35 | 229 | #define TDH_SYS_LP_INIT 35 |
227 | #define TDH_SYS_LP_SHUTDOWN 44 | ||
228 | 230 | ||
229 | +struct cmr_info { | 231 | +struct cmr_info { |
230 | + u64 base; | 232 | + u64 base; |
231 | + u64 size; | 233 | + u64 size; |
232 | +} __packed; | 234 | +} __packed; |
233 | + | 235 | + |
234 | +#define MAX_CMRS 32 | 236 | +#define MAX_CMRS 32 |
235 | +#define CMR_INFO_ARRAY_ALIGNMENT 512 | 237 | +#define CMR_INFO_ARRAY_ALIGNMENT 512 |
236 | + | 238 | + |
237 | +struct cpuid_config { | 239 | +struct cpuid_config { |
238 | + u32 leaf; | 240 | + u32 leaf; |
239 | + u32 sub_leaf; | 241 | + u32 sub_leaf; |
... | ... | ||
244 | +} __packed; | 246 | +} __packed; |
245 | + | 247 | + |
246 | +#define TDSYSINFO_STRUCT_SIZE 1024 | 248 | +#define TDSYSINFO_STRUCT_SIZE 1024 |
247 | +#define TDSYSINFO_STRUCT_ALIGNMENT 1024 | 249 | +#define TDSYSINFO_STRUCT_ALIGNMENT 1024 |
248 | + | 250 | + |
251 | +/* | ||
252 | + * The size of this structure itself is flexible. The actual structure | ||
253 | + * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE bytes | ||
254 | + * and TDSYSINFO_STRUCT_ALIGNMENT bytes aligned. | ||
255 | + */ | ||
249 | +struct tdsysinfo_struct { | 256 | +struct tdsysinfo_struct { |
250 | + /* TDX-SEAM Module Info */ | 257 | + /* TDX-SEAM Module Info */ |
251 | + u32 attributes; | 258 | + u32 attributes; |
252 | + u32 vendor_id; | 259 | + u32 vendor_id; |
253 | + u32 build_date; | 260 | + u32 build_date; |
... | ... | ||
273 | + u64 xfam_fixed1; | 280 | + u64 xfam_fixed1; |
274 | + u8 reserved4[32]; | 281 | + u8 reserved4[32]; |
275 | + u32 num_cpuid_config; | 282 | + u32 num_cpuid_config; |
276 | + /* | 283 | + /* |
277 | + * The actual number of CPUID_CONFIG depends on above | 284 | + * The actual number of CPUID_CONFIG depends on above |
278 | + * 'num_cpuid_config'. The size of 'struct tdsysinfo_struct' | 285 | + * 'num_cpuid_config'. |
279 | + * is 1024B defined by TDX architecture. Use a union with | ||
280 | + * specific padding to make 'sizeof(struct tdsysinfo_struct)' | ||
281 | + * equal to 1024. | ||
282 | + */ | 286 | + */ |
283 | + union { | 287 | + DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs); |
284 | + struct cpuid_config cpuid_configs[0]; | 288 | +} __packed; |
285 | + u8 reserved5[892]; | ||
286 | + }; | ||
287 | +} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT); | ||
288 | + | 289 | + |
289 | /* | 290 | /* |
290 | * Do not put any hardware-defined TDX structure representations below | 291 | * Do not put any hardware-defined TDX structure representations below |
291 | * this comment! | 292 | * this comment! |
292 | -- | 293 | -- |
293 | 2.38.1 | 294 | 2.41.0 | diff view generated by jsdifflib |
1 | TDX reports a list of "Convertible Memory Region" (CMR) to indicate all | 1 | As a step of initializing the TDX module, the kernel needs to tell the |
---|---|---|---|
2 | memory regions that can possibly be used by the TDX module, but they are | 2 | TDX module which memory regions can be used by the TDX module as TDX |
3 | not automatically usable to the TDX module. As a step of initializing | 3 | guest memory. |
4 | the TDX module, the kernel needs to choose a list of memory regions (out | 4 | |
5 | from convertible memory regions) that the TDX module can use and pass | 5 | TDX reports a list of "Convertible Memory Region" (CMR) to tell the |
6 | those regions to the TDX module. Once this is done, those "TDX-usable" | 6 | kernel which memory is TDX compatible. The kernel needs to build a list |
7 | memory regions are fixed during module's lifetime. No more TDX-usable | 7 | of memory regions (out of CMRs) as "TDX-usable" memory and pass them to |
8 | memory can be added to the TDX module after that. | 8 | the TDX module. Once this is done, those "TDX-usable" memory regions |
9 | 9 | are fixed during module's lifetime. | |
10 | The initial support of TDX guests will only allocate TDX guest memory | 10 | |
11 | from the global page allocator. To keep things simple, this initial | 11 | To keep things simple, assume that all TDX-protected memory will come |
12 | implementation simply guarantees all pages in the page allocator are TDX | 12 | from the page allocator. Make sure all pages in the page allocator |
13 | memory. To achieve this, use all system memory in the core-mm at the | 13 | *are* TDX-usable memory. |
14 | time of initializing the TDX module as TDX memory, and at the meantime, | 14 | |
15 | refuse to add any non-TDX-memory in the memory hotplug. | 15 | As TDX-usable memory is a fixed configuration, take a snapshot of the |
16 | 16 | memory configuration from memblocks at the time of module initialization | |
17 | Specifically, walk through all memory regions managed by memblock and | 17 | (memblocks are modified on memory hotplug). This snapshot is used to |
18 | add them to a global list of "TDX-usable" memory regions, which is a | 18 | enable TDX support for *this* memory configuration only. Use a memory |
19 | fixed list after the module initialization (or empty if initialization | 19 | hotplug notifier to ensure that no other RAM can be added outside of |
20 | fails). To reject non-TDX-memory in memory hotplug, add an additional | 20 | this configuration. |
21 | check in arch_add_memory() to check whether the new region is covered by | 21 | |
22 | any region in the "TDX-usable" memory region list. | 22 | This approach requires all memblock memory regions at the time of module |
23 | 23 | initialization to be TDX convertible memory to work, otherwise module | |
24 | Note this requires all memory regions in memblock are TDX convertible | 24 | initialization will fail in a later SEAMCALL when passing those regions |
25 | memory when initializing the TDX module. This is true in practice if no | 25 | to the module. This approach works when all boot-time "system RAM" is |
26 | new memory has been hot-added before initializing the TDX module, since | 26 | TDX convertible memory, and no non-TDX-convertible memory is hot-added |
27 | in practice all boot-time present DIMM is TDX convertible memory. If | 27 | to the core-mm before module initialization. |
28 | any new memory has been hot-added, then initializing the TDX module will | 28 | |
29 | fail due to that memory region is not covered by CMR. | 29 | For instance, on the first generation of TDX machines, both CXL memory |
30 | 30 | and NVDIMM are not TDX convertible memory. Using kmem driver to hot-add | |
31 | This can be enhanced in the future, i.e. by allowing adding non-TDX | 31 | any CXL memory or NVDIMM to the core-mm before module initialization |
32 | memory to a separate NUMA node. In this case, the "TDX-capable" nodes | 32 | will result in failure to initialize the module. The SEAMCALL error |
33 | and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace | 33 | code will be available in the dmesg to help user to understand the |
34 | needs to guarantee memory pages for TDX guests are always allocated from | 34 | failure. |
35 | the "TDX-capable" nodes. | ||
36 | |||
37 | Note TDX assumes convertible memory is always physically present during | ||
38 | machine's runtime. A non-buggy BIOS should never support hot-removal of | ||
39 | any convertible memory. This implementation doesn't handle ACPI memory | ||
40 | removal but depends on the BIOS to behave correctly. | ||
41 | 35 | ||
42 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 36 | Signed-off-by: Kai Huang <kai.huang@intel.com> |
37 | Reviewed-by: "Huang, Ying" <ying.huang@intel.com> | ||
38 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> | ||
39 | Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> | ||
40 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
43 | --- | 41 | --- |
44 | 42 | ||
45 | v6 -> v7: | 43 | v13 -> v14: |
46 | - Changed to use all system memory in memblock at the time of | 44 | - No change |
47 | initializing the TDX module as TDX memory | 45 | |
48 | - Added memory hotplug support | 46 | v12 -> v13: |
47 | - Avoided using " ? : " in tdx_memory_notifier(). (Peter) | ||
48 | |||
49 | v11 -> v12: | ||
50 | - Added tags from Dave/Kirill. | ||
51 | |||
52 | v10 -> v11: | ||
53 | - Added Isaku's Reviewed-by. | ||
54 | |||
55 | v9 -> v10: | ||
56 | - Moved empty @tdx_memlist check out of is_tdx_memory() to make the | ||
57 | logic better. | ||
58 | - Added Ying's Reviewed-by. | ||
59 | |||
60 | v8 -> v9: | ||
61 | - Replace "The initial support ..." with timeless sentence in both | ||
62 | changelog and comments(Dave). | ||
63 | - Fix run-on sentence in changelog, and senstence to explain why to | ||
64 | stash off memblock (Dave). | ||
65 | - Tried to improve why to choose this approach and how it work in | ||
66 | changelog based on Dave's suggestion. | ||
67 | - Many other comments enhancement (Dave). | ||
68 | |||
69 | v7 -> v8: | ||
70 | - Trimed down changelog (Dave). | ||
71 | - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series | ||
72 | (Ying). | ||
73 | - Moved memory hotplug handling from add_arch_memory() to | ||
74 | memory_notifier (Dan/David). | ||
75 | - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave). | ||
76 | - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave). | ||
77 | - Removed pfn_covered_by_cmr() check as no code to trim CMRs now. | ||
78 | - Improve the comment around first 1MB (Dave). | ||
79 | - Added a comment around reserve_real_mode() to point out TDX code | ||
80 | relies on first 1MB being reserved (Ying). | ||
81 | - Added comment to explain why the new online memory range cannot | ||
82 | cross multiple TDX memory blocks (Dave). | ||
83 | - Improved other comments (Dave). | ||
49 | 84 | ||
50 | --- | 85 | --- |
51 | arch/x86/Kconfig | 1 + | 86 | arch/x86/Kconfig | 1 + |
52 | arch/x86/include/asm/tdx.h | 3 + | 87 | arch/x86/kernel/setup.c | 2 + |
53 | arch/x86/mm/init_64.c | 10 ++ | 88 | arch/x86/virt/vmx/tdx/tdx.c | 162 +++++++++++++++++++++++++++++++++++- |
54 | arch/x86/virt/vmx/tdx/tdx.c | 183 ++++++++++++++++++++++++++++++++++++ | 89 | arch/x86/virt/vmx/tdx/tdx.h | 6 ++ |
55 | 4 files changed, 197 insertions(+) | 90 | 4 files changed, 170 insertions(+), 1 deletion(-) |
56 | 91 | ||
57 | diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig | 92 | diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig |
58 | index XXXXXXX..XXXXXXX 100644 | 93 | index XXXXXXX..XXXXXXX 100644 |
59 | --- a/arch/x86/Kconfig | 94 | --- a/arch/x86/Kconfig |
60 | +++ b/arch/x86/Kconfig | 95 | +++ b/arch/x86/Kconfig |
... | ... | ||
64 | depends on X86_X2APIC | 99 | depends on X86_X2APIC |
65 | + select ARCH_KEEP_MEMBLOCK | 100 | + select ARCH_KEEP_MEMBLOCK |
66 | help | 101 | help |
67 | Intel Trust Domain Extensions (TDX) protects guest VMs from malicious | 102 | Intel Trust Domain Extensions (TDX) protects guest VMs from malicious |
68 | host and certain physical attacks. This option enables necessary TDX | 103 | host and certain physical attacks. This option enables necessary TDX |
69 | diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h | 104 | diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c |
70 | index XXXXXXX..XXXXXXX 100644 | 105 | index XXXXXXX..XXXXXXX 100644 |
71 | --- a/arch/x86/include/asm/tdx.h | 106 | --- a/arch/x86/kernel/setup.c |
72 | +++ b/arch/x86/include/asm/tdx.h | 107 | +++ b/arch/x86/kernel/setup.c |
73 | @@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, | 108 | @@ -XXX,XX +XXX,XX @@ void __init setup_arch(char **cmdline_p) |
74 | #ifdef CONFIG_INTEL_TDX_HOST | 109 | * |
75 | bool platform_tdx_enabled(void); | 110 | * Moreover, on machines with SandyBridge graphics or in setups that use |
76 | int tdx_enable(void); | 111 | * crashkernel the entire 1M is reserved anyway. |
77 | +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn); | 112 | + * |
78 | #else /* !CONFIG_INTEL_TDX_HOST */ | 113 | + * Note the host kernel TDX also requires the first 1MB being reserved. |
79 | static inline bool platform_tdx_enabled(void) { return false; } | 114 | */ |
80 | static inline int tdx_enable(void) { return -ENODEV; } | 115 | x86_platform.realmode_reserve(); |
81 | +static inline bool tdx_cc_memory_compatible(unsigned long start_pfn, | 116 | |
82 | + unsigned long end_pfn) { return true; } | ||
83 | #endif /* CONFIG_INTEL_TDX_HOST */ | ||
84 | |||
85 | #endif /* !__ASSEMBLY__ */ | ||
86 | diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c | ||
87 | index XXXXXXX..XXXXXXX 100644 | ||
88 | --- a/arch/x86/mm/init_64.c | ||
89 | +++ b/arch/x86/mm/init_64.c | ||
90 | @@ -XXX,XX +XXX,XX @@ | ||
91 | #include <asm/uv/uv.h> | ||
92 | #include <asm/setup.h> | ||
93 | #include <asm/ftrace.h> | ||
94 | +#include <asm/tdx.h> | ||
95 | |||
96 | #include "mm_internal.h" | ||
97 | |||
98 | @@ -XXX,XX +XXX,XX @@ int arch_add_memory(int nid, u64 start, u64 size, | ||
99 | unsigned long start_pfn = start >> PAGE_SHIFT; | ||
100 | unsigned long nr_pages = size >> PAGE_SHIFT; | ||
101 | |||
102 | + /* | ||
103 | + * For now if TDX is enabled, all pages in the page allocator | ||
104 | + * must be TDX memory, which is a fixed set of memory regions | ||
105 | + * that are passed to the TDX module. Reject the new region | ||
106 | + * if it is not TDX memory to guarantee above is true. | ||
107 | + */ | ||
108 | + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages)) | ||
109 | + return -EINVAL; | ||
110 | + | ||
111 | init_memory_mapping(start, start + size, params->pgprot); | ||
112 | |||
113 | return add_pages(nid, start_pfn, nr_pages, params); | ||
114 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 117 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
115 | index XXXXXXX..XXXXXXX 100644 | 118 | index XXXXXXX..XXXXXXX 100644 |
116 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 119 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
117 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 120 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
118 | @@ -XXX,XX +XXX,XX @@ | 121 | @@ -XXX,XX +XXX,XX @@ |
119 | #include <linux/smp.h> | 122 | #include <linux/mutex.h> |
120 | #include <linux/atomic.h> | 123 | #include <linux/slab.h> |
121 | #include <linux/align.h> | 124 | #include <linux/math.h> |
122 | +#include <linux/list.h> | 125 | +#include <linux/list.h> |
123 | +#include <linux/slab.h> | ||
124 | +#include <linux/memblock.h> | 126 | +#include <linux/memblock.h> |
127 | +#include <linux/memory.h> | ||
125 | +#include <linux/minmax.h> | 128 | +#include <linux/minmax.h> |
126 | +#include <linux/sizes.h> | 129 | +#include <linux/sizes.h> |
130 | +#include <linux/pfn.h> | ||
127 | #include <asm/msr-index.h> | 131 | #include <asm/msr-index.h> |
128 | #include <asm/msr.h> | 132 | #include <asm/msr.h> |
129 | #include <asm/apic.h> | 133 | #include <asm/page.h> |
130 | @@ -XXX,XX +XXX,XX @@ enum tdx_module_status_t { | 134 | @@ -XXX,XX +XXX,XX @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized); |
131 | TDX_MODULE_SHUTDOWN, | 135 | static enum tdx_module_status_t tdx_module_status; |
132 | }; | 136 | static DEFINE_MUTEX(tdx_module_lock); |
133 | 137 | ||
134 | +struct tdx_memblock { | 138 | +/* All TDX-usable memory regions. Protected by mem_hotplug_lock. */ |
135 | + struct list_head list; | ||
136 | + unsigned long start_pfn; | ||
137 | + unsigned long end_pfn; | ||
138 | + int nid; | ||
139 | +}; | ||
140 | + | ||
141 | static u32 tdx_keyid_start __ro_after_init; | ||
142 | static u32 tdx_keyid_num __ro_after_init; | ||
143 | |||
144 | @@ -XXX,XX +XXX,XX @@ static struct tdsysinfo_struct tdx_sysinfo; | ||
145 | static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT); | ||
146 | static int tdx_cmr_num; | ||
147 | |||
148 | +/* All TDX-usable memory regions */ | ||
149 | +static LIST_HEAD(tdx_memlist); | 139 | +static LIST_HEAD(tdx_memlist); |
150 | + | 140 | + |
151 | /* | 141 | typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args); |
152 | * Detect TDX private KeyIDs to see whether TDX has been enabled by the | 142 | |
153 | * BIOS. Both initializing the TDX module and running TDX guest require | 143 | static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args) |
154 | @@ -XXX,XX +XXX,XX @@ static int tdx_get_sysinfo(void) | 144 | @@ -XXX,XX +XXX,XX @@ static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo, |
155 | return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num); | 145 | return 0; |
156 | } | 146 | } |
157 | 147 | ||
158 | +/* Check whether the given pfn range is covered by any CMR or not. */ | ||
159 | +static bool pfn_range_covered_by_cmr(unsigned long start_pfn, | ||
160 | + unsigned long end_pfn) | ||
161 | +{ | ||
162 | + int i; | ||
163 | + | ||
164 | + for (i = 0; i < tdx_cmr_num; i++) { | ||
165 | + struct cmr_info *cmr = &tdx_cmr_array[i]; | ||
166 | + unsigned long cmr_start_pfn; | ||
167 | + unsigned long cmr_end_pfn; | ||
168 | + | ||
169 | + cmr_start_pfn = cmr->base >> PAGE_SHIFT; | ||
170 | + cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT; | ||
171 | + | ||
172 | + if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn) | ||
173 | + return true; | ||
174 | + } | ||
175 | + | ||
176 | + return false; | ||
177 | +} | ||
178 | + | ||
179 | +/* | 148 | +/* |
180 | + * Add a memory region on a given node as a TDX memory block. The caller | 149 | + * Add a memory region as a TDX memory block. The caller must make sure |
181 | + * to make sure all memory regions are added in address ascending order | 150 | + * all memory regions are added in address ascending order and don't |
182 | + * and don't overlap. | 151 | + * overlap. |
183 | + */ | 152 | + */ |
184 | +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn, | 153 | +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn, |
185 | + int nid) | 154 | + unsigned long end_pfn) |
186 | +{ | 155 | +{ |
187 | + struct tdx_memblock *tmb; | 156 | + struct tdx_memblock *tmb; |
188 | + | 157 | + |
189 | + tmb = kmalloc(sizeof(*tmb), GFP_KERNEL); | 158 | + tmb = kmalloc(sizeof(*tmb), GFP_KERNEL); |
190 | + if (!tmb) | 159 | + if (!tmb) |
191 | + return -ENOMEM; | 160 | + return -ENOMEM; |
192 | + | 161 | + |
193 | + INIT_LIST_HEAD(&tmb->list); | 162 | + INIT_LIST_HEAD(&tmb->list); |
194 | + tmb->start_pfn = start_pfn; | 163 | + tmb->start_pfn = start_pfn; |
195 | + tmb->end_pfn = end_pfn; | 164 | + tmb->end_pfn = end_pfn; |
196 | + tmb->nid = nid; | 165 | + |
197 | + | 166 | + /* @tmb_list is protected by mem_hotplug_lock */ |
198 | + list_add_tail(&tmb->list, &tdx_memlist); | 167 | + list_add_tail(&tmb->list, tmb_list); |
199 | + return 0; | 168 | + return 0; |
200 | +} | 169 | +} |
201 | + | 170 | + |
202 | +static void free_tdx_memory(void) | 171 | +static void free_tdx_memlist(struct list_head *tmb_list) |
203 | +{ | 172 | +{ |
204 | + while (!list_empty(&tdx_memlist)) { | 173 | + /* @tmb_list is protected by mem_hotplug_lock */ |
205 | + struct tdx_memblock *tmb = list_first_entry(&tdx_memlist, | 174 | + while (!list_empty(tmb_list)) { |
175 | + struct tdx_memblock *tmb = list_first_entry(tmb_list, | ||
206 | + struct tdx_memblock, list); | 176 | + struct tdx_memblock, list); |
207 | + | 177 | + |
208 | + list_del(&tmb->list); | 178 | + list_del(&tmb->list); |
209 | + kfree(tmb); | 179 | + kfree(tmb); |
210 | + } | 180 | + } |
211 | +} | 181 | +} |
212 | + | 182 | + |
213 | +/* | 183 | +/* |
214 | + * Add all memblock memory regions to the @tdx_memlist as TDX memory. | 184 | + * Ensure that all memblock memory regions are convertible to TDX |
215 | + * Must be called when get_online_mems() is called by the caller. | 185 | + * memory. Once this has been established, stash the memblock |
186 | + * ranges off in a secondary structure because memblock is modified | ||
187 | + * in memory hotplug while TDX memory regions are fixed. | ||
216 | + */ | 188 | + */ |
217 | +static int build_tdx_memory(void) | 189 | +static int build_tdx_memlist(struct list_head *tmb_list) |
218 | +{ | 190 | +{ |
219 | + unsigned long start_pfn, end_pfn; | 191 | + unsigned long start_pfn, end_pfn; |
220 | + int i, nid, ret; | 192 | + int i, ret; |
221 | + | 193 | + |
222 | + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { | 194 | + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) { |
223 | + /* | 195 | + /* |
224 | + * The first 1MB may not be reported as TDX convertible | 196 | + * The first 1MB is not reported as TDX convertible memory. |
225 | + * memory. Manually exclude them as TDX memory. | 197 | + * Although the first 1MB is always reserved and won't end up |
226 | + * | 198 | + * to the page allocator, it is still in memblock's memory |
227 | + * This is fine as the first 1MB is already reserved in | 199 | + * regions. Skip them manually to exclude them as TDX memory. |
228 | + * reserve_real_mode() and won't end up to ZONE_DMA as | ||
229 | + * free page anyway. | ||
230 | + */ | 200 | + */ |
231 | + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT); | 201 | + start_pfn = max(start_pfn, PHYS_PFN(SZ_1M)); |
232 | + if (start_pfn >= end_pfn) | 202 | + if (start_pfn >= end_pfn) |
233 | + continue; | 203 | + continue; |
234 | + | ||
235 | + /* Verify memory is truly TDX convertible memory */ | ||
236 | + if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) { | ||
237 | + pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n", | ||
238 | + start_pfn << PAGE_SHIFT, | ||
239 | + end_pfn << PAGE_SHIFT); | ||
240 | + return -EINVAL; | ||
241 | + } | ||
242 | + | 204 | + |
243 | + /* | 205 | + /* |
244 | + * Add the memory regions as TDX memory. The regions in | 206 | + * Add the memory regions as TDX memory. The regions in |
245 | + * memblock has already guaranteed they are in address | 207 | + * memblock has already guaranteed they are in address |
246 | + * ascending order and don't overlap. | 208 | + * ascending order and don't overlap. |
247 | + */ | 209 | + */ |
248 | + ret = add_tdx_memblock(start_pfn, end_pfn, nid); | 210 | + ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn); |
249 | + if (ret) | 211 | + if (ret) |
250 | + goto err; | 212 | + goto err; |
251 | + } | 213 | + } |
252 | + | 214 | + |
253 | + return 0; | 215 | + return 0; |
254 | +err: | 216 | +err: |
255 | + free_tdx_memory(); | 217 | + free_tdx_memlist(tmb_list); |
256 | + return ret; | 218 | + return ret; |
257 | +} | 219 | +} |
258 | + | 220 | + |
259 | /* | 221 | static int init_tdx_module(void) |
260 | * Detect and initialize the TDX module. | 222 | { |
261 | * | 223 | struct tdsysinfo_struct *tdsysinfo; |
262 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | 224 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) |
263 | if (ret) | 225 | if (ret) |
264 | goto out; | 226 | goto out; |
265 | 227 | ||
266 | + /* | 228 | + /* |
267 | + * All memory regions that can be used by the TDX module must be | 229 | + * To keep things simple, assume that all TDX-protected memory |
268 | + * passed to the TDX module during the module initialization. | 230 | + * will come from the page allocator. Make sure all pages in the |
269 | + * Once this is done, all "TDX-usable" memory regions are fixed | 231 | + * page allocator are TDX-usable memory. |
270 | + * during module's runtime. | ||
271 | + * | 232 | + * |
272 | + * The initial support of TDX guests only allocates memory from | 233 | + * Build the list of "TDX-usable" memory regions which cover all |
273 | + * the global page allocator. To keep things simple, for now | 234 | + * pages in the page allocator to guarantee that. Do it while |
274 | + * just make sure all pages in the page allocator are TDX memory. | 235 | + * holding mem_hotplug_lock read-lock as the memory hotplug code |
275 | + * | 236 | + * path reads the @tdx_memlist to reject any new memory. |
276 | + * To achieve this, use all system memory in the core-mm at the | ||
277 | + * time of initializing the TDX module as TDX memory, and at the | ||
278 | + * meantime, reject any new memory in memory hot-add. | ||
279 | + * | ||
280 | + * This works as in practice, all boot-time present DIMM is TDX | ||
281 | + * convertible memory. However if any new memory is hot-added | ||
282 | + * before initializing the TDX module, the initialization will | ||
283 | + * fail due to that memory is not covered by CMR. | ||
284 | + * | ||
285 | + * This can be enhanced in the future, i.e. by allowing adding or | ||
286 | + * onlining non-TDX memory to a separate node, in which case the | ||
287 | + * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist | ||
288 | + * together -- the userspace/kernel just needs to make sure pages | ||
289 | + * for TDX guests must come from those "TDX-capable" nodes. | ||
290 | + * | ||
291 | + * Build the list of TDX memory regions as mentioned above so | ||
292 | + * they can be passed to the TDX module later. | ||
293 | + */ | 237 | + */ |
294 | + get_online_mems(); | 238 | + get_online_mems(); |
295 | + | 239 | + |
296 | + ret = build_tdx_memory(); | 240 | + ret = build_tdx_memlist(&tdx_memlist); |
297 | + if (ret) | 241 | + if (ret) |
298 | + goto out; | 242 | + goto out_put_tdxmem; |
243 | + | ||
299 | /* | 244 | /* |
300 | * Return -EINVAL until all steps of TDX module initialization | 245 | * TODO: |
301 | * process are done. | 246 | * |
247 | - * - Build the list of TDX-usable memory regions. | ||
248 | * - Construct a list of "TD Memory Regions" (TDMRs) to cover | ||
249 | * all TDX-usable memory regions. | ||
250 | * - Configure the TDMRs and the global KeyID to the TDX module. | ||
251 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | ||
252 | * Return error before all steps are done. | ||
302 | */ | 253 | */ |
303 | ret = -EINVAL; | 254 | ret = -EINVAL; |
255 | +out_put_tdxmem: | ||
256 | + /* | ||
257 | + * @tdx_memlist is written here and read at memory hotplug time. | ||
258 | + * Lock out memory hotplug code while building it. | ||
259 | + */ | ||
260 | + put_online_mems(); | ||
304 | out: | 261 | out: |
305 | + /* | 262 | /* |
306 | + * Memory hotplug checks the hot-added memory region against the | 263 | * For now both @sysinfo and @cmr_array are only used during |
307 | + * @tdx_memlist to see if the region is TDX memory. | 264 | @@ -XXX,XX +XXX,XX @@ static int __init record_keyid_partitioning(u32 *tdx_keyid_start, |
308 | + * | 265 | return 0; |
309 | + * Do put_online_mems() here to make sure any modification to | ||
310 | + * @tdx_memlist is done while holding the memory hotplug read | ||
311 | + * lock, so that the memory hotplug path can just check the | ||
312 | + * @tdx_memlist w/o holding the @tdx_module_lock which may cause | ||
313 | + * deadlock. | ||
314 | + */ | ||
315 | + put_online_mems(); | ||
316 | return ret; | ||
317 | } | 266 | } |
318 | 267 | ||
319 | @@ -XXX,XX +XXX,XX @@ int tdx_enable(void) | 268 | +static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn) |
320 | return ret; | ||
321 | } | ||
322 | EXPORT_SYMBOL_GPL(tdx_enable); | ||
323 | + | ||
324 | +/* | ||
325 | + * Check whether the given range is TDX memory. Must be called between | ||
326 | + * mem_hotplug_begin()/mem_hotplug_done(). | ||
327 | + */ | ||
328 | +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn) | ||
329 | +{ | 269 | +{ |
330 | + struct tdx_memblock *tmb; | 270 | + struct tdx_memblock *tmb; |
331 | + | 271 | + |
332 | + /* Empty list means TDX isn't enabled successfully */ | 272 | + /* |
333 | + if (list_empty(&tdx_memlist)) | 273 | + * This check assumes that the start_pfn<->end_pfn range does not |
334 | + return true; | 274 | + * cross multiple @tdx_memlist entries. A single memory online |
335 | + | 275 | + * event across multiple memblocks (from which @tdx_memlist |
276 | + * entries are derived at the time of module initialization) is | ||
277 | + * not possible. This is because memory offline/online is done | ||
278 | + * on granularity of 'struct memory_block', and the hotpluggable | ||
279 | + * memory region (one memblock) must be multiple of memory_block. | ||
280 | + */ | ||
336 | + list_for_each_entry(tmb, &tdx_memlist, list) { | 281 | + list_for_each_entry(tmb, &tdx_memlist, list) { |
337 | + /* | ||
338 | + * The new range is TDX memory if it is fully covered | ||
339 | + * by any TDX memory block. | ||
340 | + */ | ||
341 | + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn) | 282 | + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn) |
342 | + return true; | 283 | + return true; |
343 | + } | 284 | + } |
344 | + return false; | 285 | + return false; |
345 | +} | 286 | +} |
287 | + | ||
288 | +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action, | ||
289 | + void *v) | ||
290 | +{ | ||
291 | + struct memory_notify *mn = v; | ||
292 | + | ||
293 | + if (action != MEM_GOING_ONLINE) | ||
294 | + return NOTIFY_OK; | ||
295 | + | ||
296 | + /* | ||
297 | + * Empty list means TDX isn't enabled. Allow any memory | ||
298 | + * to go online. | ||
299 | + */ | ||
300 | + if (list_empty(&tdx_memlist)) | ||
301 | + return NOTIFY_OK; | ||
302 | + | ||
303 | + /* | ||
304 | + * The TDX memory configuration is static and can not be | ||
305 | + * changed. Reject onlining any memory which is outside of | ||
306 | + * the static configuration whether it supports TDX or not. | ||
307 | + */ | ||
308 | + if (is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages)) | ||
309 | + return NOTIFY_OK; | ||
310 | + | ||
311 | + return NOTIFY_BAD; | ||
312 | +} | ||
313 | + | ||
314 | +static struct notifier_block tdx_memory_nb = { | ||
315 | + .notifier_call = tdx_memory_notifier, | ||
316 | +}; | ||
317 | + | ||
318 | static int __init tdx_init(void) | ||
319 | { | ||
320 | u32 tdx_keyid_start, nr_tdx_keyids; | ||
321 | @@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void) | ||
322 | return -ENODEV; | ||
323 | } | ||
324 | |||
325 | + err = register_memory_notifier(&tdx_memory_nb); | ||
326 | + if (err) { | ||
327 | + pr_err("initialization failed: register_memory_notifier() failed (%d)\n", | ||
328 | + err); | ||
329 | + return -ENODEV; | ||
330 | + } | ||
331 | + | ||
332 | /* | ||
333 | * Just use the first TDX KeyID as the 'global KeyID' and | ||
334 | * leave the rest for TDX guests. | ||
335 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | ||
336 | index XXXXXXX..XXXXXXX 100644 | ||
337 | --- a/arch/x86/virt/vmx/tdx/tdx.h | ||
338 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | ||
339 | @@ -XXX,XX +XXX,XX @@ enum tdx_module_status_t { | ||
340 | TDX_MODULE_ERROR | ||
341 | }; | ||
342 | |||
343 | +struct tdx_memblock { | ||
344 | + struct list_head list; | ||
345 | + unsigned long start_pfn; | ||
346 | + unsigned long end_pfn; | ||
347 | +}; | ||
348 | + | ||
349 | #endif | ||
346 | -- | 350 | -- |
347 | 2.38.1 | 351 | 2.41.0 | diff view generated by jsdifflib |
1 | After the kernel selects all TDX-usable memory regions, the kernel needs | ||
---|---|---|---|
2 | to pass those regions to the TDX module via data structure "TD Memory | ||
3 | Region" (TDMR). | ||
4 | |||
5 | Add a placeholder to construct a list of TDMRs (in multiple steps) to | ||
6 | cover all TDX-usable memory regions. | ||
7 | |||
8 | === Long Version === | ||
9 | |||
1 | TDX provides increased levels of memory confidentiality and integrity. | 10 | TDX provides increased levels of memory confidentiality and integrity. |
2 | This requires special hardware support for features like memory | 11 | This requires special hardware support for features like memory |
3 | encryption and storage of memory integrity checksums. Not all memory | 12 | encryption and storage of memory integrity checksums. Not all memory |
4 | satisfies these requirements. | 13 | satisfies these requirements. |
5 | 14 | ||
6 | As a result, the TDX introduced the concept of a "Convertible Memory | 15 | As a result, TDX introduced the concept of a "Convertible Memory Region" |
7 | Region" (CMR). During boot, the firmware builds a list of all of the | 16 | (CMR). During boot, the firmware builds a list of all of the memory |
8 | memory ranges which can provide the TDX security guarantees. The list | 17 | ranges which can provide the TDX security guarantees. The list of these |
9 | of these ranges is available to the kernel by querying the TDX module. | 18 | ranges is available to the kernel by querying the TDX module. |
10 | 19 | ||
11 | The TDX architecture needs additional metadata to record things like | 20 | The TDX architecture needs additional metadata to record things like |
12 | which TD guest "owns" a given page of memory. This metadata essentially | 21 | which TD guest "owns" a given page of memory. This metadata essentially |
13 | serves as the 'struct page' for the TDX module. The space for this | 22 | serves as the 'struct page' for the TDX module. The space for this |
14 | metadata is not reserved by the hardware up front and must be allocated | 23 | metadata is not reserved by the hardware up front and must be allocated |
... | ... | ||
36 | CMR - Firmware-enumerated physical ranges that support TDX. CMRs are | 45 | CMR - Firmware-enumerated physical ranges that support TDX. CMRs are |
37 | 4K aligned. | 46 | 4K aligned. |
38 | TDMR - Physical address range which is chosen by the kernel to support | 47 | TDMR - Physical address range which is chosen by the kernel to support |
39 | TDX. 1G granularity and alignment required. Each TDMR has | 48 | TDX. 1G granularity and alignment required. Each TDMR has |
40 | reserved areas where TDX memory holes and overlapping PAMTs can | 49 | reserved areas where TDX memory holes and overlapping PAMTs can |
41 | be put into. | 50 | be represented. |
42 | PAMT - Physically contiguous TDX metadata. One table for each page size | 51 | PAMT - Physically contiguous TDX metadata. One table for each page size |
43 | per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G | 52 | per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G |
44 | PAMT. | 53 | PAMT. |
45 | 54 | ||
46 | As one step of initializing the TDX module, the kernel configures | 55 | As one step of initializing the TDX module, the kernel configures |
47 | TDX-usable memory regions by passing an array of TDMRs to the TDX module. | 56 | TDX-usable memory regions by passing a list of TDMRs to the TDX module. |
48 | 57 | ||
49 | Constructing the array of TDMRs consists below steps: | 58 | Constructing the list of TDMRs consists below steps: |
50 | 59 | ||
51 | 1) Create TDMRs to cover all memory regions that the TDX module can use; | 60 | 1) Fill out TDMRs to cover all memory regions that the TDX module will |
52 | 2) Allocate and set up PAMT for each TDMR; | 61 | use for TD memory. |
53 | 3) Set up reserved areas for each TDMR. | 62 | 2) Allocate and set up PAMT for each TDMR. |
54 | 63 | 3) Designate reserved areas for each TDMR. | |
55 | Add a placeholder to construct TDMRs to do the above steps after all | 64 | |
56 | TDX memory regions are verified to be truly convertible. Always free | 65 | Add a placeholder to construct TDMRs to do the above steps. To keep |
57 | TDMRs at the end of the initialization (no matter successful or not) | 66 | things simple, just allocate enough space to hold maximum number of |
58 | as TDMRs are only used during the initialization. | 67 | TDMRs up front. Always free the buffer of TDMRs since they are only |
59 | 68 | used during module initialization. | |
69 | |||
70 | Signed-off-by: Kai Huang <kai.huang@intel.com> | ||
60 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> | 71 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> |
61 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 72 | Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> |
73 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
62 | --- | 74 | --- |
75 | |||
76 | v13 -> v14: | ||
77 | - No change. | ||
78 | |||
79 | v12 -> v13: | ||
80 | - No change. | ||
81 | |||
82 | v11 -> v12: | ||
83 | - Added tags from Dave/Kirill. | ||
84 | |||
85 | v10 -> v11: | ||
86 | - Changed to keep TDMRs after module initialization to deal with TDX | ||
87 | erratum in future patches. | ||
88 | |||
89 | v9 -> v10: | ||
90 | - Changed the TDMR list from static variable back to local variable as | ||
91 | now TDX module isn't disabled when tdx_cpu_enable() fails. | ||
92 | |||
93 | v8 -> v9: | ||
94 | - Changes around 'struct tdmr_info_list' (Dave): | ||
95 | - Moved the declaration from tdx.c to tdx.h. | ||
96 | - Renamed 'first_tdmr' to 'tdmrs'. | ||
97 | - 'nr_tdmrs' -> 'nr_consumed_tdmrs'. | ||
98 | - Changed 'tdmrs' to 'void *'. | ||
99 | - Improved comments for all structure members. | ||
100 | - Added a missing empty line in alloc_tdmr_list() (Dave). | ||
101 | |||
102 | v7 -> v8: | ||
103 | - Improved changelog to tell this is one step of "TODO list" in | ||
104 | init_tdx_module(). | ||
105 | - Other changelog improvement suggested by Dave (with "Create TDMRs" to | ||
106 | "Fill out TDMRs" to align with the code). | ||
107 | - Added a "TODO list" comment to lay out the steps to construct TDMRs, | ||
108 | following the same idea of "TODO list" in tdx_module_init(). | ||
109 | - Introduced 'struct tdmr_info_list' (Dave) | ||
110 | - Further added additional members (tdmr_sz/max_tdmrs/nr_tdmrs) to | ||
111 | simplify getting TDMR by given index, and reduce passing arguments | ||
112 | around functions. | ||
113 | - Added alloc_tdmr_list()/free_tdmr_list() accordingly, which internally | ||
114 | uses tdmr_size_single() (Dave). | ||
115 | - tdmr_num -> nr_tdmrs (Dave). | ||
63 | 116 | ||
64 | v6 -> v7: | 117 | v6 -> v7: |
65 | - Improved commit message to explain 'int' overflow cannot happen | 118 | - Improved commit message to explain 'int' overflow cannot happen |
66 | in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave. | 119 | in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave. |
67 | 120 | ||
68 | v5 -> v6: | 121 | ... |
69 | - construct_tdmrs_memblock() -> construct_tdmrs() as 'tdx_memblock' is | ||
70 | used instead of memblock. | ||
71 | - Added Isaku's Reviewed-by. | ||
72 | |||
73 | - v3 -> v5 (no feedback on v4): | ||
74 | - Moved calculating TDMR size to this patch. | ||
75 | - Changed to use alloc_pages_exact() to allocate buffer for all TDMRs | ||
76 | once, instead of allocating each TDMR individually. | ||
77 | - Removed "crypto protection" in the changelog. | ||
78 | - -EFAULT -> -EINVAL in couple of places. | ||
79 | |||
80 | 122 | ||
81 | --- | 123 | --- |
82 | arch/x86/virt/vmx/tdx/tdx.c | 83 +++++++++++++++++++++++++++++++++++++ | 124 | arch/x86/virt/vmx/tdx/tdx.c | 97 ++++++++++++++++++++++++++++++++++++- |
83 | arch/x86/virt/vmx/tdx/tdx.h | 23 ++++++++++ | 125 | arch/x86/virt/vmx/tdx/tdx.h | 32 ++++++++++++ |
84 | 2 files changed, 106 insertions(+) | 126 | 2 files changed, 127 insertions(+), 2 deletions(-) |
85 | 127 | ||
86 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 128 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
87 | index XXXXXXX..XXXXXXX 100644 | 129 | index XXXXXXX..XXXXXXX 100644 |
88 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 130 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
89 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 131 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
90 | @@ -XXX,XX +XXX,XX @@ static int build_tdx_memory(void) | 132 | @@ -XXX,XX +XXX,XX @@ |
133 | #include <linux/minmax.h> | ||
134 | #include <linux/sizes.h> | ||
135 | #include <linux/pfn.h> | ||
136 | +#include <linux/align.h> | ||
137 | #include <asm/msr-index.h> | ||
138 | #include <asm/msr.h> | ||
139 | #include <asm/page.h> | ||
140 | @@ -XXX,XX +XXX,XX @@ static int build_tdx_memlist(struct list_head *tmb_list) | ||
91 | return ret; | 141 | return ret; |
92 | } | 142 | } |
93 | 143 | ||
94 | +/* Calculate the actual TDMR_INFO size */ | 144 | +/* Calculate the actual TDMR size */ |
95 | +static inline int cal_tdmr_size(void) | 145 | +static int tdmr_size_single(u16 max_reserved_per_tdmr) |
96 | +{ | 146 | +{ |
97 | + int tdmr_sz; | 147 | + int tdmr_sz; |
98 | + | 148 | + |
99 | + /* | 149 | + /* |
100 | + * The actual size of TDMR_INFO depends on the maximum number | 150 | + * The actual size of TDMR depends on the maximum |
101 | + * of reserved areas. | 151 | + * number of reserved areas. |
102 | + * | ||
103 | + * Note: for TDX1.0 the max_reserved_per_tdmr is 16, and | ||
104 | + * TDMR_INFO size is aligned up to 512-byte. Even it is | ||
105 | + * extended in the future, it would be insane if TDMR_INFO | ||
106 | + * becomes larger than 4K. The tdmr_sz here should never | ||
107 | + * overflow. | ||
108 | + */ | 152 | + */ |
109 | + tdmr_sz = sizeof(struct tdmr_info); | 153 | + tdmr_sz = sizeof(struct tdmr_info); |
110 | + tdmr_sz += sizeof(struct tdmr_reserved_area) * | 154 | + tdmr_sz += sizeof(struct tdmr_reserved_area) * max_reserved_per_tdmr; |
111 | + tdx_sysinfo.max_reserved_per_tdmr; | 155 | + |
112 | + | ||
113 | + /* | ||
114 | + * TDX requires each TDMR_INFO to be 512-byte aligned. Always | ||
115 | + * round up TDMR_INFO size to the 512-byte boundary. | ||
116 | + */ | ||
117 | + return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT); | 156 | + return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT); |
118 | +} | 157 | +} |
119 | + | 158 | + |
120 | +static struct tdmr_info *alloc_tdmr_array(int *array_sz) | 159 | +static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list, |
160 | + struct tdsysinfo_struct *sysinfo) | ||
121 | +{ | 161 | +{ |
122 | + /* | 162 | + size_t tdmr_sz, tdmr_array_sz; |
123 | + * TDX requires each TDMR_INFO to be 512-byte aligned. | 163 | + void *tdmr_array; |
124 | + * Use alloc_pages_exact() to allocate all TDMRs at once. | 164 | + |
125 | + * Each TDMR_INFO will still be 512-byte aligned since | 165 | + tdmr_sz = tdmr_size_single(sysinfo->max_reserved_per_tdmr); |
126 | + * cal_tdmr_size() always returns 512-byte aligned size. | 166 | + tdmr_array_sz = tdmr_sz * sysinfo->max_tdmrs; |
127 | + */ | 167 | + |
128 | + *array_sz = cal_tdmr_size() * tdx_sysinfo.max_tdmrs; | 168 | + /* |
129 | + | 169 | + * To keep things simple, allocate all TDMRs together. |
130 | + /* | 170 | + * The buffer needs to be physically contiguous to make |
131 | + * Zero the buffer so 'struct tdmr_info::size' can be | 171 | + * sure each TDMR is physically contiguous. |
132 | + * used to determine whether a TDMR is valid. | 172 | + */ |
173 | + tdmr_array = alloc_pages_exact(tdmr_array_sz, | ||
174 | + GFP_KERNEL | __GFP_ZERO); | ||
175 | + if (!tdmr_array) | ||
176 | + return -ENOMEM; | ||
177 | + | ||
178 | + tdmr_list->tdmrs = tdmr_array; | ||
179 | + | ||
180 | + /* | ||
181 | + * Keep the size of TDMR to find the target TDMR | ||
182 | + * at a given index in the TDMR list. | ||
183 | + */ | ||
184 | + tdmr_list->tdmr_sz = tdmr_sz; | ||
185 | + tdmr_list->max_tdmrs = sysinfo->max_tdmrs; | ||
186 | + tdmr_list->nr_consumed_tdmrs = 0; | ||
187 | + | ||
188 | + return 0; | ||
189 | +} | ||
190 | + | ||
191 | +static void free_tdmr_list(struct tdmr_info_list *tdmr_list) | ||
192 | +{ | ||
193 | + free_pages_exact(tdmr_list->tdmrs, | ||
194 | + tdmr_list->max_tdmrs * tdmr_list->tdmr_sz); | ||
195 | +} | ||
196 | + | ||
197 | +/* | ||
198 | + * Construct a list of TDMRs on the preallocated space in @tdmr_list | ||
199 | + * to cover all TDX memory regions in @tmb_list based on the TDX module | ||
200 | + * information in @sysinfo. | ||
201 | + */ | ||
202 | +static int construct_tdmrs(struct list_head *tmb_list, | ||
203 | + struct tdmr_info_list *tdmr_list, | ||
204 | + struct tdsysinfo_struct *sysinfo) | ||
205 | +{ | ||
206 | + /* | ||
207 | + * TODO: | ||
133 | + * | 208 | + * |
134 | + * Note: for TDX1.0 the max_tdmrs is 64 and TDMR_INFO size | 209 | + * - Fill out TDMRs to cover all TDX memory regions. |
135 | + * is 512-byte. Even they are extended in the future, it | 210 | + * - Allocate and set up PAMTs for each TDMR. |
136 | + * would be insane if the total size exceeds 4MB. | 211 | + * - Designate reserved areas for each TDMR. |
137 | + */ | 212 | + * |
138 | + return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO); | 213 | + * Return -EINVAL until constructing TDMRs is done |
139 | +} | 214 | + */ |
140 | + | ||
141 | +/* | ||
142 | + * Construct an array of TDMRs to cover all TDX memory ranges. | ||
143 | + * The actual number of TDMRs is kept to @tdmr_num. | ||
144 | + */ | ||
145 | +static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num) | ||
146 | +{ | ||
147 | + /* Return -EINVAL until constructing TDMRs is done */ | ||
148 | + return -EINVAL; | 215 | + return -EINVAL; |
149 | +} | 216 | +} |
150 | + | 217 | + |
151 | /* | ||
152 | * Detect and initialize the TDX module. | ||
153 | * | ||
154 | @@ -XXX,XX +XXX,XX @@ static int build_tdx_memory(void) | ||
155 | */ | ||
156 | static int init_tdx_module(void) | 218 | static int init_tdx_module(void) |
157 | { | 219 | { |
158 | + struct tdmr_info *tdmr_array; | 220 | struct tdsysinfo_struct *tdsysinfo; |
159 | + int tdmr_array_sz; | 221 | + struct tdmr_info_list tdmr_list; |
160 | + int tdmr_num; | 222 | struct cmr_info *cmr_array; |
161 | int ret; | 223 | int tdsysinfo_size; |
162 | 224 | int cmr_array_size; | |
163 | /* | ||
164 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | 225 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) |
165 | ret = build_tdx_memory(); | ||
166 | if (ret) | 226 | if (ret) |
167 | goto out; | 227 | goto out_put_tdxmem; |
168 | + | 228 | |
169 | + /* Prepare enough space to construct TDMRs */ | 229 | + /* Allocate enough space for constructing TDMRs */ |
170 | + tdmr_array = alloc_tdmr_array(&tdmr_array_sz); | 230 | + ret = alloc_tdmr_list(&tdmr_list, tdsysinfo); |
171 | + if (!tdmr_array) { | 231 | + if (ret) |
172 | + ret = -ENOMEM; | 232 | + goto out_free_tdxmem; |
173 | + goto out_free_tdx_mem; | 233 | + |
174 | + } | 234 | + /* Cover all TDX-usable memory regions in TDMRs */ |
175 | + | 235 | + ret = construct_tdmrs(&tdx_memlist, &tdmr_list, tdsysinfo); |
176 | + /* Construct TDMRs to cover all TDX memory ranges */ | ||
177 | + ret = construct_tdmrs(tdmr_array, &tdmr_num); | ||
178 | + if (ret) | 236 | + if (ret) |
179 | + goto out_free_tdmrs; | 237 | + goto out_free_tdmrs; |
180 | + | 238 | + |
181 | /* | 239 | /* |
182 | * Return -EINVAL until all steps of TDX module initialization | 240 | * TODO: |
183 | * process are done. | 241 | * |
242 | - * - Construct a list of "TD Memory Regions" (TDMRs) to cover | ||
243 | - * all TDX-usable memory regions. | ||
244 | * - Configure the TDMRs and the global KeyID to the TDX module. | ||
245 | * - Configure the global KeyID on all packages. | ||
246 | * - Initialize all TDMRs. | ||
247 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | ||
248 | * Return error before all steps are done. | ||
184 | */ | 249 | */ |
185 | ret = -EINVAL; | 250 | ret = -EINVAL; |
186 | +out_free_tdmrs: | 251 | +out_free_tdmrs: |
187 | + /* | 252 | + /* |
188 | + * The array of TDMRs is freed no matter the initialization is | 253 | + * Always free the buffer of TDMRs as they are only used during |
189 | + * successful or not. They are not needed anymore after the | ||
190 | + * module initialization. | 254 | + * module initialization. |
191 | + */ | 255 | + */ |
192 | + free_pages_exact(tdmr_array, tdmr_array_sz); | 256 | + free_tdmr_list(&tdmr_list); |
193 | +out_free_tdx_mem: | 257 | +out_free_tdxmem: |
194 | + if (ret) | 258 | + if (ret) |
195 | + free_tdx_memory(); | 259 | + free_tdx_memlist(&tdx_memlist); |
196 | out: | 260 | out_put_tdxmem: |
197 | /* | 261 | /* |
198 | * Memory hotplug checks the hot-added memory region against the | 262 | * @tdx_memlist is written here and read at memory hotplug time. |
199 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | 263 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h |
200 | index XXXXXXX..XXXXXXX 100644 | 264 | index XXXXXXX..XXXXXXX 100644 |
201 | --- a/arch/x86/virt/vmx/tdx/tdx.h | 265 | --- a/arch/x86/virt/vmx/tdx/tdx.h |
202 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | 266 | +++ b/arch/x86/virt/vmx/tdx/tdx.h |
203 | @@ -XXX,XX +XXX,XX @@ struct tdsysinfo_struct { | 267 | @@ -XXX,XX +XXX,XX @@ struct tdsysinfo_struct { |
204 | }; | 268 | DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs); |
205 | } __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT); | 269 | } __packed; |
206 | 270 | ||
207 | +struct tdmr_reserved_area { | 271 | +struct tdmr_reserved_area { |
208 | + u64 offset; | 272 | + u64 offset; |
209 | + u64 size; | 273 | + u64 size; |
210 | +} __packed; | 274 | +} __packed; |
... | ... | ||
222 | + u64 pamt_4k_size; | 286 | + u64 pamt_4k_size; |
223 | + /* | 287 | + /* |
224 | + * Actual number of reserved areas depends on | 288 | + * Actual number of reserved areas depends on |
225 | + * 'struct tdsysinfo_struct'::max_reserved_per_tdmr. | 289 | + * 'struct tdsysinfo_struct'::max_reserved_per_tdmr. |
226 | + */ | 290 | + */ |
227 | + struct tdmr_reserved_area reserved_areas[0]; | 291 | + DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas); |
228 | +} __packed __aligned(TDMR_INFO_ALIGNMENT); | 292 | +} __packed __aligned(TDMR_INFO_ALIGNMENT); |
229 | + | 293 | + |
230 | /* | 294 | /* |
231 | * Do not put any hardware-defined TDX structure representations below | 295 | * Do not put any hardware-defined TDX structure representations below |
232 | * this comment! | 296 | * this comment! |
297 | @@ -XXX,XX +XXX,XX @@ struct tdx_memblock { | ||
298 | unsigned long end_pfn; | ||
299 | }; | ||
300 | |||
301 | +struct tdmr_info_list { | ||
302 | + void *tdmrs; /* Flexible array to hold 'tdmr_info's */ | ||
303 | + int nr_consumed_tdmrs; /* How many 'tdmr_info's are in use */ | ||
304 | + | ||
305 | + /* Metadata for finding target 'tdmr_info' and freeing @tdmrs */ | ||
306 | + int tdmr_sz; /* Size of one 'tdmr_info' */ | ||
307 | + int max_tdmrs; /* How many 'tdmr_info's are allocated */ | ||
308 | +}; | ||
309 | + | ||
310 | #endif | ||
233 | -- | 311 | -- |
234 | 2.38.1 | 312 | 2.41.0 | diff view generated by jsdifflib |
1 | The kernel configures TDX-usable memory regions by passing an array of | 1 | Start to transit out the "multi-steps" to construct a list of "TD Memory |
---|---|---|---|
2 | "TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains the | 2 | Regions" (TDMRs) to cover all TDX-usable memory regions. |
3 | information of the base/size of a memory region, the base/size of the | 3 | |
4 | The kernel configures TDX-usable memory regions by passing a list of | ||
5 | TDMRs "TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains | ||
6 | the information of the base/size of a memory region, the base/size of the | ||
4 | associated Physical Address Metadata Table (PAMT) and a list of reserved | 7 | associated Physical Address Metadata Table (PAMT) and a list of reserved |
5 | areas in the region. | 8 | areas in the region. |
6 | 9 | ||
7 | Create a number of TDMRs to cover all TDX memory regions. To keep it | 10 | Do the first step to fill out a number of TDMRs to cover all TDX memory |
8 | simple, always try to create one TDMR for each memory region. As the | 11 | regions. To keep it simple, always try to use one TDMR for each memory |
9 | first step only set up the base/size for each TDMR. | 12 | region. As the first step only set up the base/size for each TDMR. |
10 | 13 | ||
11 | Each TDMR must be 1G aligned and the size must be in 1G granularity. | 14 | Each TDMR must be 1G aligned and the size must be in 1G granularity. |
12 | This implies that one TDMR could cover multiple memory regions. If a | 15 | This implies that one TDMR could cover multiple memory regions. If a |
13 | memory region spans the 1GB boundary and the former part is already | 16 | memory region spans the 1GB boundary and the former part is already |
14 | covered by the previous TDMR, just create a new TDMR for the remaining | 17 | covered by the previous TDMR, just use a new TDMR for the remaining |
15 | part. | 18 | part. |
16 | 19 | ||
17 | TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs | 20 | TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs |
18 | are consumed but there is more memory region to cover. | 21 | are consumed but there is more memory region to cover. |
19 | 22 | ||
23 | There are fancier things that could be done like trying to merge | ||
24 | adjacent TDMRs. This would allow more pathological memory layouts to be | ||
25 | supported. But, current systems are not even close to exhausting the | ||
26 | existing TDMR resources in practice. For now, keep it simple. | ||
27 | |||
20 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 28 | Signed-off-by: Kai Huang <kai.huang@intel.com> |
29 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
30 | Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> | ||
31 | Reviewed-by: Yuan Yao <yuan.yao@intel.com> | ||
21 | --- | 32 | --- |
22 | 33 | ||
23 | v6 -> v7: | 34 | v13 -> v14: |
35 | - No change | ||
36 | |||
37 | v12 -> v13: | ||
38 | - Added Yuan's tag. | ||
39 | |||
40 | v11 -> v12: | ||
41 | - Improved comments around looping over TDX memblock to create TDMRs. | ||
42 | (Dave). | ||
43 | - Added code to pr_warn() when consumed TDMRs reaching maximum TDMRs | ||
44 | (Dave). | ||
45 | - BIT_ULL(30) -> SZ_1G (Kirill) | ||
46 | - Removed unused TDMR_PFN_ALIGNMENT (Sathy) | ||
47 | - Added tags from Kirill/Sathy | ||
48 | |||
49 | v10 -> v11: | ||
50 | - No update | ||
51 | |||
52 | v9 -> v10: | ||
24 | - No change. | 53 | - No change. |
25 | 54 | ||
26 | v5 -> v6: | 55 | v8 -> v9: |
27 | - Rebase due to using 'tdx_memblock' instead of memblock. | 56 | |
28 | 57 | - Added the last paragraph in the changelog (Dave). | |
29 | - v3 -> v5 (no feedback on v4): | 58 | - Removed unnecessary type cast in tdmr_entry() (Dave). |
30 | - Removed allocating TDMR individually. | ||
31 | - Improved changelog by using Dave's words. | ||
32 | - Made TDMR_START() and TDMR_END() as static inline function. | ||
33 | 59 | ||
34 | --- | 60 | --- |
35 | arch/x86/virt/vmx/tdx/tdx.c | 104 +++++++++++++++++++++++++++++++++++- | 61 | arch/x86/virt/vmx/tdx/tdx.c | 103 +++++++++++++++++++++++++++++++++++- |
36 | 1 file changed, 103 insertions(+), 1 deletion(-) | 62 | arch/x86/virt/vmx/tdx/tdx.h | 3 ++ |
63 | 2 files changed, 105 insertions(+), 1 deletion(-) | ||
37 | 64 | ||
38 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 65 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
39 | index XXXXXXX..XXXXXXX 100644 | 66 | index XXXXXXX..XXXXXXX 100644 |
40 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 67 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
41 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 68 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
42 | @@ -XXX,XX +XXX,XX @@ static int build_tdx_memory(void) | 69 | @@ -XXX,XX +XXX,XX @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list) |
43 | return ret; | 70 | tdmr_list->max_tdmrs * tdmr_list->tdmr_sz); |
44 | } | 71 | } |
45 | 72 | ||
46 | +/* TDMR must be 1gb aligned */ | 73 | +/* Get the TDMR from the list at the given index. */ |
47 | +#define TDMR_ALIGNMENT BIT_ULL(30) | 74 | +static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list, |
48 | +#define TDMR_PFN_ALIGNMENT (TDMR_ALIGNMENT >> PAGE_SHIFT) | 75 | + int idx) |
49 | + | 76 | +{ |
50 | +/* Align up and down the address to TDMR boundary */ | 77 | + int tdmr_info_offset = tdmr_list->tdmr_sz * idx; |
78 | + | ||
79 | + return (void *)tdmr_list->tdmrs + tdmr_info_offset; | ||
80 | +} | ||
81 | + | ||
82 | +#define TDMR_ALIGNMENT SZ_1G | ||
51 | +#define TDMR_ALIGN_DOWN(_addr) ALIGN_DOWN((_addr), TDMR_ALIGNMENT) | 83 | +#define TDMR_ALIGN_DOWN(_addr) ALIGN_DOWN((_addr), TDMR_ALIGNMENT) |
52 | +#define TDMR_ALIGN_UP(_addr) ALIGN((_addr), TDMR_ALIGNMENT) | 84 | +#define TDMR_ALIGN_UP(_addr) ALIGN((_addr), TDMR_ALIGNMENT) |
53 | + | ||
54 | +static inline u64 tdmr_start(struct tdmr_info *tdmr) | ||
55 | +{ | ||
56 | + return tdmr->base; | ||
57 | +} | ||
58 | + | 85 | + |
59 | +static inline u64 tdmr_end(struct tdmr_info *tdmr) | 86 | +static inline u64 tdmr_end(struct tdmr_info *tdmr) |
60 | +{ | 87 | +{ |
61 | + return tdmr->base + tdmr->size; | 88 | + return tdmr->base + tdmr->size; |
62 | +} | 89 | +} |
63 | + | 90 | + |
64 | /* Calculate the actual TDMR_INFO size */ | ||
65 | static inline int cal_tdmr_size(void) | ||
66 | { | ||
67 | @@ -XXX,XX +XXX,XX @@ static struct tdmr_info *alloc_tdmr_array(int *array_sz) | ||
68 | return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO); | ||
69 | } | ||
70 | |||
71 | +static struct tdmr_info *tdmr_array_entry(struct tdmr_info *tdmr_array, | ||
72 | + int idx) | ||
73 | +{ | ||
74 | + return (struct tdmr_info *)((unsigned long)tdmr_array + | ||
75 | + cal_tdmr_size() * idx); | ||
76 | +} | ||
77 | + | ||
78 | +/* | 91 | +/* |
79 | + * Create TDMRs to cover all TDX memory regions. The actual number | 92 | + * Take the memory referenced in @tmb_list and populate the |
80 | + * of TDMRs is set to @tdmr_num. | 93 | + * preallocated @tdmr_list, following all the special alignment |
94 | + * and size rules for TDMR. | ||
81 | + */ | 95 | + */ |
82 | +static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num) | 96 | +static int fill_out_tdmrs(struct list_head *tmb_list, |
97 | + struct tdmr_info_list *tdmr_list) | ||
83 | +{ | 98 | +{ |
84 | + struct tdx_memblock *tmb; | 99 | + struct tdx_memblock *tmb; |
85 | + int tdmr_idx = 0; | 100 | + int tdmr_idx = 0; |
86 | + | 101 | + |
87 | + /* | 102 | + /* |
88 | + * Loop over TDX memory regions and create TDMRs to cover them. | 103 | + * Loop over TDX memory regions and fill out TDMRs to cover them. |
89 | + * To keep it simple, always try to use one TDMR to cover | 104 | + * To keep it simple, always try to use one TDMR to cover one |
90 | + * one memory region. | 105 | + * memory region. |
106 | + * | ||
107 | + * In practice TDX supports at least 64 TDMRs. A 2-socket system | ||
108 | + * typically only consumes less than 10 of those. This code is | ||
109 | + * dumb and simple and may use more TMDRs than is strictly | ||
110 | + * required. | ||
91 | + */ | 111 | + */ |
92 | + list_for_each_entry(tmb, &tdx_memlist, list) { | 112 | + list_for_each_entry(tmb, tmb_list, list) { |
93 | + struct tdmr_info *tdmr; | 113 | + struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx); |
94 | + u64 start, end; | 114 | + u64 start, end; |
95 | + | 115 | + |
96 | + tdmr = tdmr_array_entry(tdmr_array, tdmr_idx); | 116 | + start = TDMR_ALIGN_DOWN(PFN_PHYS(tmb->start_pfn)); |
97 | + start = TDMR_ALIGN_DOWN(tmb->start_pfn << PAGE_SHIFT); | 117 | + end = TDMR_ALIGN_UP(PFN_PHYS(tmb->end_pfn)); |
98 | + end = TDMR_ALIGN_UP(tmb->end_pfn << PAGE_SHIFT); | ||
99 | + | 118 | + |
100 | + /* | 119 | + /* |
101 | + * If the current TDMR's size hasn't been initialized, | 120 | + * A valid size indicates the current TDMR has already |
102 | + * it is a new TDMR to cover the new memory region. | 121 | + * been filled out to cover the previous memory region(s). |
103 | + * Otherwise, the current TDMR has already covered the | ||
104 | + * previous memory region. In the latter case, check | ||
105 | + * whether the current memory region has been fully or | ||
106 | + * partially covered by the current TDMR, since TDMR is | ||
107 | + * 1G aligned. | ||
108 | + */ | 122 | + */ |
109 | + if (tdmr->size) { | 123 | + if (tdmr->size) { |
110 | + /* | 124 | + /* |
111 | + * Loop to the next memory region if the current | 125 | + * Loop to the next if the current memory region |
112 | + * block has already been fully covered by the | 126 | + * has already been fully covered. |
113 | + * current TDMR. | ||
114 | + */ | 127 | + */ |
115 | + if (end <= tdmr_end(tdmr)) | 128 | + if (end <= tdmr_end(tdmr)) |
116 | + continue; | 129 | + continue; |
117 | + | 130 | + |
118 | + /* | 131 | + /* Otherwise, skip the already covered part. */ |
119 | + * If part of the current memory region has | ||
120 | + * already been covered by the current TDMR, | ||
121 | + * skip the already covered part. | ||
122 | + */ | ||
123 | + if (start < tdmr_end(tdmr)) | 132 | + if (start < tdmr_end(tdmr)) |
124 | + start = tdmr_end(tdmr); | 133 | + start = tdmr_end(tdmr); |
125 | + | 134 | + |
126 | + /* | 135 | + /* |
127 | + * Create a new TDMR to cover the current memory | 136 | + * Create a new TDMR to cover the current memory |
128 | + * region, or the remaining part of it. | 137 | + * region, or the remaining part of it. |
129 | + */ | 138 | + */ |
130 | + tdmr_idx++; | 139 | + tdmr_idx++; |
131 | + if (tdmr_idx >= tdx_sysinfo.max_tdmrs) | 140 | + if (tdmr_idx >= tdmr_list->max_tdmrs) { |
132 | + return -E2BIG; | 141 | + pr_warn("initialization failed: TDMRs exhausted.\n"); |
133 | + | 142 | + return -ENOSPC; |
134 | + tdmr = tdmr_array_entry(tdmr_array, tdmr_idx); | 143 | + } |
144 | + | ||
145 | + tdmr = tdmr_entry(tdmr_list, tdmr_idx); | ||
135 | + } | 146 | + } |
136 | + | 147 | + |
137 | + tdmr->base = start; | 148 | + tdmr->base = start; |
138 | + tdmr->size = end - start; | 149 | + tdmr->size = end - start; |
139 | + } | 150 | + } |
140 | + | 151 | + |
141 | + /* @tdmr_idx is always the index of last valid TDMR. */ | 152 | + /* @tdmr_idx is always the index of the last valid TDMR. */ |
142 | + *tdmr_num = tdmr_idx + 1; | 153 | + tdmr_list->nr_consumed_tdmrs = tdmr_idx + 1; |
154 | + | ||
155 | + /* | ||
156 | + * Warn early that kernel is about to run out of TDMRs. | ||
157 | + * | ||
158 | + * This is an indication that TDMR allocation has to be | ||
159 | + * reworked to be smarter to not run into an issue. | ||
160 | + */ | ||
161 | + if (tdmr_list->max_tdmrs - tdmr_list->nr_consumed_tdmrs < TDMR_NR_WARN) | ||
162 | + pr_warn("consumed TDMRs reaching limit: %d used out of %d\n", | ||
163 | + tdmr_list->nr_consumed_tdmrs, | ||
164 | + tdmr_list->max_tdmrs); | ||
143 | + | 165 | + |
144 | + return 0; | 166 | + return 0; |
145 | +} | 167 | +} |
146 | + | 168 | + |
147 | /* | 169 | /* |
148 | * Construct an array of TDMRs to cover all TDX memory ranges. | 170 | * Construct a list of TDMRs on the preallocated space in @tdmr_list |
149 | * The actual number of TDMRs is kept to @tdmr_num. | 171 | * to cover all TDX memory regions in @tmb_list based on the TDX module |
150 | */ | 172 | @@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list, |
151 | static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num) | 173 | struct tdmr_info_list *tdmr_list, |
174 | struct tdsysinfo_struct *sysinfo) | ||
152 | { | 175 | { |
153 | + int ret; | 176 | + int ret; |
154 | + | 177 | + |
155 | + ret = create_tdmrs(tdmr_array, tdmr_num); | 178 | + ret = fill_out_tdmrs(tmb_list, tdmr_list); |
156 | + if (ret) | 179 | + if (ret) |
157 | + goto err; | 180 | + return ret; |
158 | + | 181 | + |
159 | /* Return -EINVAL until constructing TDMRs is done */ | 182 | /* |
160 | - return -EINVAL; | 183 | * TODO: |
161 | + ret = -EINVAL; | 184 | * |
162 | +err: | 185 | - * - Fill out TDMRs to cover all TDX memory regions. |
163 | + return ret; | 186 | * - Allocate and set up PAMTs for each TDMR. |
164 | } | 187 | * - Designate reserved areas for each TDMR. |
165 | 188 | * | |
166 | /* | 189 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h |
190 | index XXXXXXX..XXXXXXX 100644 | ||
191 | --- a/arch/x86/virt/vmx/tdx/tdx.h | ||
192 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | ||
193 | @@ -XXX,XX +XXX,XX @@ struct tdx_memblock { | ||
194 | unsigned long end_pfn; | ||
195 | }; | ||
196 | |||
197 | +/* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */ | ||
198 | +#define TDMR_NR_WARN 4 | ||
199 | + | ||
200 | struct tdmr_info_list { | ||
201 | void *tdmrs; /* Flexible array to hold 'tdmr_info's */ | ||
202 | int nr_consumed_tdmrs; /* How many 'tdmr_info's are in use */ | ||
167 | -- | 203 | -- |
168 | 2.38.1 | 204 | 2.41.0 | diff view generated by jsdifflib |
1 | The TDX module uses additional metadata to record things like which | 1 | The TDX module uses additional metadata to record things like which |
---|---|---|---|
2 | guest "owns" a given page of memory. This metadata, referred as | 2 | guest "owns" a given page of memory. This metadata, referred as |
3 | Physical Address Metadata Table (PAMT), essentially serves as the | 3 | Physical Address Metadata Table (PAMT), essentially serves as the |
4 | 'struct page' for the TDX module. PAMTs are not reserved by hardware | 4 | 'struct page' for the TDX module. PAMTs are not reserved by hardware |
5 | up front. They must be allocated by the kernel and then given to the | 5 | up front. They must be allocated by the kernel and then given to the |
6 | TDX module. | 6 | TDX module during module initialization. |
7 | 7 | ||
8 | TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region" | 8 | TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region" |
9 | (TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must | 9 | (TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must |
10 | be a physically contiguous area from a Convertible Memory Region (CMR). | 10 | be a physically contiguous area from a Convertible Memory Region (CMR). |
11 | However, the PAMTs which track pages in one TDMR do not need to reside | 11 | However, the PAMTs which track pages in one TDMR do not need to reside |
... | ... | ||
14 | that particular TDMR. | 14 | that particular TDMR. |
15 | 15 | ||
16 | Use alloc_contig_pages() since PAMT must be a physically contiguous area | 16 | Use alloc_contig_pages() since PAMT must be a physically contiguous area |
17 | and it may be potentially large (~1/256th of the size of the given TDMR). | 17 | and it may be potentially large (~1/256th of the size of the given TDMR). |
18 | The downside is alloc_contig_pages() may fail at runtime. One (bad) | 18 | The downside is alloc_contig_pages() may fail at runtime. One (bad) |
19 | mitigation is to launch a TD guest early during system boot to get those | 19 | mitigation is to launch a TDX guest early during system boot to get |
20 | PAMTs allocated at early time, but the only way to fix is to add a boot | 20 | those PAMTs allocated at early time, but the only way to fix is to add a |
21 | option to allocate or reserve PAMTs during kernel boot. | 21 | boot option to allocate or reserve PAMTs during kernel boot. |
22 | |||
23 | It is imperfect but will be improved on later. | ||
22 | 24 | ||
23 | TDX only supports a limited number of reserved areas per TDMR to cover | 25 | TDX only supports a limited number of reserved areas per TDMR to cover |
24 | both PAMTs and memory holes within the given TDMR. If many PAMTs are | 26 | both PAMTs and memory holes within the given TDMR. If many PAMTs are |
25 | allocated within a single TDMR, the reserved areas may not be sufficient | 27 | allocated within a single TDMR, the reserved areas may not be sufficient |
26 | to cover all of them. | 28 | to cover all of them. |
... | ... | ||
31 | the total number of reserved areas consumed for PAMTs. | 33 | the total number of reserved areas consumed for PAMTs. |
32 | - Try to first allocate PAMT from the local node of the TDMR for better | 34 | - Try to first allocate PAMT from the local node of the TDMR for better |
33 | NUMA locality. | 35 | NUMA locality. |
34 | 36 | ||
35 | Also dump out how many pages are allocated for PAMTs when the TDX module | 37 | Also dump out how many pages are allocated for PAMTs when the TDX module |
36 | is initialized successfully. | 38 | is initialized successfully. This helps answer the eternal "where did |
37 | 39 | all my memory go?" questions. | |
40 | |||
41 | Signed-off-by: Kai Huang <kai.huang@intel.com> | ||
38 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> | 42 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> |
39 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 43 | Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> |
44 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
45 | Reviewed-by: Yuan Yao <yuan.yao@intel.com> | ||
40 | --- | 46 | --- |
47 | |||
48 | v13 -> v14: | ||
49 | - No change | ||
50 | |||
51 | v12 -> v13: | ||
52 | - Added Kirill and Yuan's tag. | ||
53 | - Removed unintended space. (Yuan) | ||
54 | |||
55 | v11 -> v12: | ||
56 | - Moved TDX_PS_NUM from tdx.c to <asm/tdx.h> (Kirill) | ||
57 | - "<= TDX_PS_1G" -> "< TDX_PS_NUM" (Kirill) | ||
58 | - Changed tdmr_get_pamt() to return base and size instead of base_pfn | ||
59 | and npages and related code directly (Dave). | ||
60 | - Simplified PAMT kb counting. (Dave) | ||
61 | - tdmrs_count_pamt_pages() -> tdmr_count_pamt_kb() (Kirill/Dave) | ||
62 | |||
63 | v10 -> v11: | ||
64 | - No update | ||
65 | |||
66 | v9 -> v10: | ||
67 | - Removed code change in disable_tdx_module() as it doesn't exist | ||
68 | anymore. | ||
69 | |||
70 | v8 -> v9: | ||
71 | - Added TDX_PS_NR macro instead of open-coding (Dave). | ||
72 | - Better alignment of 'pamt_entry_size' in tdmr_set_up_pamt() (Dave). | ||
73 | - Changed to print out PAMTs in "KBs" instead of "pages" (Dave). | ||
74 | - Added Dave's Reviewed-by. | ||
75 | |||
76 | v7 -> v8: (Dave) | ||
77 | - Changelog: | ||
78 | - Added a sentence to state PAMT allocation will be improved. | ||
79 | - Others suggested by Dave. | ||
80 | - Moved 'nid' of 'struct tdx_memblock' to this patch. | ||
81 | - Improved comments around tdmr_get_nid(). | ||
82 | - WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid(). | ||
83 | - Other changes due to 'struct tdmr_info_list'. | ||
41 | 84 | ||
42 | v6 -> v7: | 85 | v6 -> v7: |
43 | - Changes due to using macros instead of 'enum' for TDX supported page | 86 | - Changes due to using macros instead of 'enum' for TDX supported page |
44 | sizes. | 87 | sizes. |
45 | 88 | ||
... | ... | ||
49 | - Improved comment around tdmr_get_nid() (Dave). | 92 | - Improved comment around tdmr_get_nid() (Dave). |
50 | - Improved comment in tdmr_set_up_pamt() around breaking the PAMT | 93 | - Improved comment in tdmr_set_up_pamt() around breaking the PAMT |
51 | into PAMTs for 4K/2M/1G (Dave). | 94 | into PAMTs for 4K/2M/1G (Dave). |
52 | - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave). | 95 | - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave). |
53 | 96 | ||
54 | - v3 -> v5 (no feedback on v4): | ||
55 | - Used memblock to get the NUMA node for given TDMR. | ||
56 | - Removed tdmr_get_pamt_sz() helper but use open-code instead. | ||
57 | - Changed to use 'switch .. case..' for each TDX supported page size in | ||
58 | tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()). | ||
59 | - Added printing out memory used for PAMT allocation when TDX module is | ||
60 | initialized successfully. | ||
61 | - Explained downside of alloc_contig_pages() in changelog. | ||
62 | - Addressed other minor comments. | ||
63 | |||
64 | |||
65 | --- | 97 | --- |
66 | arch/x86/Kconfig | 1 + | 98 | arch/x86/Kconfig | 1 + |
67 | arch/x86/virt/vmx/tdx/tdx.c | 191 ++++++++++++++++++++++++++++++++++++ | 99 | arch/x86/include/asm/shared/tdx.h | 1 + |
68 | 2 files changed, 192 insertions(+) | 100 | arch/x86/virt/vmx/tdx/tdx.c | 215 +++++++++++++++++++++++++++++- |
101 | arch/x86/virt/vmx/tdx/tdx.h | 1 + | ||
102 | 4 files changed, 213 insertions(+), 5 deletions(-) | ||
69 | 103 | ||
70 | diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig | 104 | diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig |
71 | index XXXXXXX..XXXXXXX 100644 | 105 | index XXXXXXX..XXXXXXX 100644 |
72 | --- a/arch/x86/Kconfig | 106 | --- a/arch/x86/Kconfig |
73 | +++ b/arch/x86/Kconfig | 107 | +++ b/arch/x86/Kconfig |
... | ... | ||
77 | select ARCH_KEEP_MEMBLOCK | 111 | select ARCH_KEEP_MEMBLOCK |
78 | + depends on CONTIG_ALLOC | 112 | + depends on CONTIG_ALLOC |
79 | help | 113 | help |
80 | Intel Trust Domain Extensions (TDX) protects guest VMs from malicious | 114 | Intel Trust Domain Extensions (TDX) protects guest VMs from malicious |
81 | host and certain physical attacks. This option enables necessary TDX | 115 | host and certain physical attacks. This option enables necessary TDX |
116 | diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h | ||
117 | index XXXXXXX..XXXXXXX 100644 | ||
118 | --- a/arch/x86/include/asm/shared/tdx.h | ||
119 | +++ b/arch/x86/include/asm/shared/tdx.h | ||
120 | @@ -XXX,XX +XXX,XX @@ | ||
121 | #define TDX_PS_4K 0 | ||
122 | #define TDX_PS_2M 1 | ||
123 | #define TDX_PS_1G 2 | ||
124 | +#define TDX_PS_NR (TDX_PS_1G + 1) | ||
125 | |||
126 | #ifndef __ASSEMBLY__ | ||
127 | |||
82 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 128 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
83 | index XXXXXXX..XXXXXXX 100644 | 129 | index XXXXXXX..XXXXXXX 100644 |
84 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 130 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
85 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 131 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
86 | @@ -XXX,XX +XXX,XX @@ static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num) | 132 | @@ -XXX,XX +XXX,XX @@ static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo, |
133 | * overlap. | ||
134 | */ | ||
135 | static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn, | ||
136 | - unsigned long end_pfn) | ||
137 | + unsigned long end_pfn, int nid) | ||
138 | { | ||
139 | struct tdx_memblock *tmb; | ||
140 | |||
141 | @@ -XXX,XX +XXX,XX @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn, | ||
142 | INIT_LIST_HEAD(&tmb->list); | ||
143 | tmb->start_pfn = start_pfn; | ||
144 | tmb->end_pfn = end_pfn; | ||
145 | + tmb->nid = nid; | ||
146 | |||
147 | /* @tmb_list is protected by mem_hotplug_lock */ | ||
148 | list_add_tail(&tmb->list, tmb_list); | ||
149 | @@ -XXX,XX +XXX,XX @@ static void free_tdx_memlist(struct list_head *tmb_list) | ||
150 | static int build_tdx_memlist(struct list_head *tmb_list) | ||
151 | { | ||
152 | unsigned long start_pfn, end_pfn; | ||
153 | - int i, ret; | ||
154 | + int i, nid, ret; | ||
155 | |||
156 | - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) { | ||
157 | + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { | ||
158 | /* | ||
159 | * The first 1MB is not reported as TDX convertible memory. | ||
160 | * Although the first 1MB is always reserved and won't end up | ||
161 | @@ -XXX,XX +XXX,XX @@ static int build_tdx_memlist(struct list_head *tmb_list) | ||
162 | * memblock has already guaranteed they are in address | ||
163 | * ascending order and don't overlap. | ||
164 | */ | ||
165 | - ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn); | ||
166 | + ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid); | ||
167 | if (ret) | ||
168 | goto err; | ||
169 | } | ||
170 | @@ -XXX,XX +XXX,XX @@ static int fill_out_tdmrs(struct list_head *tmb_list, | ||
87 | return 0; | 171 | return 0; |
88 | } | 172 | } |
89 | 173 | ||
90 | +/* | 174 | +/* |
91 | + * Calculate PAMT size given a TDMR and a page size. The returned | 175 | + * Calculate PAMT size given a TDMR and a page size. The returned |
92 | + * PAMT size is always aligned up to 4K page boundary. | 176 | + * PAMT size is always aligned up to 4K page boundary. |
93 | + */ | 177 | + */ |
94 | +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz) | 178 | +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz, |
179 | + u16 pamt_entry_size) | ||
95 | +{ | 180 | +{ |
96 | + unsigned long pamt_sz, nr_pamt_entries; | 181 | + unsigned long pamt_sz, nr_pamt_entries; |
97 | + | 182 | + |
98 | + switch (pgsz) { | 183 | + switch (pgsz) { |
99 | + case TDX_PS_4K: | 184 | + case TDX_PS_4K: |
... | ... | ||
108 | + default: | 193 | + default: |
109 | + WARN_ON_ONCE(1); | 194 | + WARN_ON_ONCE(1); |
110 | + return 0; | 195 | + return 0; |
111 | + } | 196 | + } |
112 | + | 197 | + |
113 | + pamt_sz = nr_pamt_entries * tdx_sysinfo.pamt_entry_size; | 198 | + pamt_sz = nr_pamt_entries * pamt_entry_size; |
114 | + /* TDX requires PAMT size must be 4K aligned */ | 199 | + /* TDX requires PAMT size must be 4K aligned */ |
115 | + pamt_sz = ALIGN(pamt_sz, PAGE_SIZE); | 200 | + pamt_sz = ALIGN(pamt_sz, PAGE_SIZE); |
116 | + | 201 | + |
117 | + return pamt_sz; | 202 | + return pamt_sz; |
118 | +} | 203 | +} |
119 | + | 204 | + |
120 | +/* | 205 | +/* |
121 | + * Pick a NUMA node on which to allocate this TDMR's metadata. | 206 | + * Locate a NUMA node which should hold the allocation of the @tdmr |
122 | + * | 207 | + * PAMT. This node will have some memory covered by the TDMR. The |
123 | + * This is imprecise since TDMRs are 1G aligned and NUMA nodes might | 208 | + * relative amount of memory covered is not considered. |
124 | + * not be. If the TDMR covers more than one node, just use the _first_ | ||
125 | + * one. This can lead to small areas of off-node metadata for some | ||
126 | + * memory. | ||
127 | + */ | 209 | + */ |
128 | +static int tdmr_get_nid(struct tdmr_info *tdmr) | 210 | +static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list) |
129 | +{ | 211 | +{ |
130 | + struct tdx_memblock *tmb; | 212 | + struct tdx_memblock *tmb; |
131 | + | 213 | + |
132 | + /* Find the first memory region covered by the TDMR */ | 214 | + /* |
133 | + list_for_each_entry(tmb, &tdx_memlist, list) { | 215 | + * A TDMR must cover at least part of one TMB. That TMB will end |
134 | + if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT)) | 216 | + * after the TDMR begins. But, that TMB may have started before |
217 | + * the TDMR. Find the next 'tmb' that _ends_ after this TDMR | ||
218 | + * begins. Ignore 'tmb' start addresses. They are irrelevant. | ||
219 | + */ | ||
220 | + list_for_each_entry(tmb, tmb_list, list) { | ||
221 | + if (tmb->end_pfn > PHYS_PFN(tdmr->base)) | ||
135 | + return tmb->nid; | 222 | + return tmb->nid; |
136 | + } | 223 | + } |
137 | + | 224 | + |
138 | + /* | 225 | + /* |
139 | + * Fall back to allocating the TDMR's metadata from node 0 when | 226 | + * Fall back to allocating the TDMR's metadata from node 0 when |
140 | + * no TDX memory block can be found. This should never happen | 227 | + * no TDX memory block can be found. This should never happen |
141 | + * since TDMRs originate from TDX memory blocks. | 228 | + * since TDMRs originate from TDX memory blocks. |
142 | + */ | 229 | + */ |
143 | + WARN_ON_ONCE(1); | 230 | + pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT allocation, fallback to use node 0.\n", |
231 | + tdmr->base, tdmr_end(tdmr)); | ||
144 | + return 0; | 232 | + return 0; |
145 | +} | 233 | +} |
146 | + | 234 | + |
147 | +static int tdmr_set_up_pamt(struct tdmr_info *tdmr) | 235 | +/* |
148 | +{ | 236 | + * Allocate PAMTs from the local NUMA node of some memory in @tmb_list |
149 | + unsigned long pamt_base[TDX_PS_1G + 1]; | 237 | + * within @tdmr, and set up PAMTs for @tdmr. |
150 | + unsigned long pamt_size[TDX_PS_1G + 1]; | 238 | + */ |
239 | +static int tdmr_set_up_pamt(struct tdmr_info *tdmr, | ||
240 | + struct list_head *tmb_list, | ||
241 | + u16 pamt_entry_size) | ||
242 | +{ | ||
243 | + unsigned long pamt_base[TDX_PS_NR]; | ||
244 | + unsigned long pamt_size[TDX_PS_NR]; | ||
151 | + unsigned long tdmr_pamt_base; | 245 | + unsigned long tdmr_pamt_base; |
152 | + unsigned long tdmr_pamt_size; | 246 | + unsigned long tdmr_pamt_size; |
153 | + struct page *pamt; | 247 | + struct page *pamt; |
154 | + int pgsz, nid; | 248 | + int pgsz, nid; |
155 | + | 249 | + |
156 | + nid = tdmr_get_nid(tdmr); | 250 | + nid = tdmr_get_nid(tdmr, tmb_list); |
157 | + | 251 | + |
158 | + /* | 252 | + /* |
159 | + * Calculate the PAMT size for each TDX supported page size | 253 | + * Calculate the PAMT size for each TDX supported page size |
160 | + * and the total PAMT size. | 254 | + * and the total PAMT size. |
161 | + */ | 255 | + */ |
162 | + tdmr_pamt_size = 0; | 256 | + tdmr_pamt_size = 0; |
163 | + for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G ; pgsz++) { | 257 | + for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) { |
164 | + pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz); | 258 | + pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz, |
259 | + pamt_entry_size); | ||
165 | + tdmr_pamt_size += pamt_size[pgsz]; | 260 | + tdmr_pamt_size += pamt_size[pgsz]; |
166 | + } | 261 | + } |
167 | + | 262 | + |
168 | + /* | 263 | + /* |
169 | + * Allocate one chunk of physically contiguous memory for all | 264 | + * Allocate one chunk of physically contiguous memory for all |
... | ... | ||
178 | + /* | 273 | + /* |
179 | + * Break the contiguous allocation back up into the | 274 | + * Break the contiguous allocation back up into the |
180 | + * individual PAMTs for each page size. | 275 | + * individual PAMTs for each page size. |
181 | + */ | 276 | + */ |
182 | + tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT; | 277 | + tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT; |
183 | + for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G; pgsz++) { | 278 | + for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) { |
184 | + pamt_base[pgsz] = tdmr_pamt_base; | 279 | + pamt_base[pgsz] = tdmr_pamt_base; |
185 | + tdmr_pamt_base += pamt_size[pgsz]; | 280 | + tdmr_pamt_base += pamt_size[pgsz]; |
186 | + } | 281 | + } |
187 | + | 282 | + |
188 | + tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; | 283 | + tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; |
... | ... | ||
193 | + tdmr->pamt_1g_size = pamt_size[TDX_PS_1G]; | 288 | + tdmr->pamt_1g_size = pamt_size[TDX_PS_1G]; |
194 | + | 289 | + |
195 | + return 0; | 290 | + return 0; |
196 | +} | 291 | +} |
197 | + | 292 | + |
198 | +static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn, | 293 | +static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base, |
199 | + unsigned long *pamt_npages) | 294 | + unsigned long *pamt_size) |
200 | +{ | 295 | +{ |
201 | + unsigned long pamt_base, pamt_sz; | 296 | + unsigned long pamt_bs, pamt_sz; |
202 | + | 297 | + |
203 | + /* | 298 | + /* |
204 | + * The PAMT was allocated in one contiguous unit. The 4K PAMT | 299 | + * The PAMT was allocated in one contiguous unit. The 4K PAMT |
205 | + * should always point to the beginning of that allocation. | 300 | + * should always point to the beginning of that allocation. |
206 | + */ | 301 | + */ |
207 | + pamt_base = tdmr->pamt_4k_base; | 302 | + pamt_bs = tdmr->pamt_4k_base; |
208 | + pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size; | 303 | + pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size; |
209 | + | 304 | + |
210 | + *pamt_pfn = pamt_base >> PAGE_SHIFT; | 305 | + WARN_ON_ONCE((pamt_bs & ~PAGE_MASK) || (pamt_sz & ~PAGE_MASK)); |
211 | + *pamt_npages = pamt_sz >> PAGE_SHIFT; | 306 | + |
307 | + *pamt_base = pamt_bs; | ||
308 | + *pamt_size = pamt_sz; | ||
212 | +} | 309 | +} |
213 | + | 310 | + |
214 | +static void tdmr_free_pamt(struct tdmr_info *tdmr) | 311 | +static void tdmr_free_pamt(struct tdmr_info *tdmr) |
215 | +{ | 312 | +{ |
216 | + unsigned long pamt_pfn, pamt_npages; | 313 | + unsigned long pamt_base, pamt_size; |
217 | + | 314 | + |
218 | + tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages); | 315 | + tdmr_get_pamt(tdmr, &pamt_base, &pamt_size); |
219 | + | 316 | + |
220 | + /* Do nothing if PAMT hasn't been allocated for this TDMR */ | 317 | + /* Do nothing if PAMT hasn't been allocated for this TDMR */ |
221 | + if (!pamt_npages) | 318 | + if (!pamt_size) |
222 | + return; | 319 | + return; |
223 | + | 320 | + |
224 | + if (WARN_ON_ONCE(!pamt_pfn)) | 321 | + if (WARN_ON_ONCE(!pamt_base)) |
225 | + return; | 322 | + return; |
226 | + | 323 | + |
227 | + free_contig_range(pamt_pfn, pamt_npages); | 324 | + free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT); |
228 | +} | 325 | +} |
229 | + | 326 | + |
230 | +static void tdmrs_free_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num) | 327 | +static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list) |
231 | +{ | 328 | +{ |
232 | + int i; | 329 | + int i; |
233 | + | 330 | + |
234 | + for (i = 0; i < tdmr_num; i++) | 331 | + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) |
235 | + tdmr_free_pamt(tdmr_array_entry(tdmr_array, i)); | 332 | + tdmr_free_pamt(tdmr_entry(tdmr_list, i)); |
236 | +} | 333 | +} |
237 | + | 334 | + |
238 | +/* Allocate and set up PAMTs for all TDMRs */ | 335 | +/* Allocate and set up PAMTs for all TDMRs */ |
239 | +static int tdmrs_set_up_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num) | 336 | +static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list, |
337 | + struct list_head *tmb_list, | ||
338 | + u16 pamt_entry_size) | ||
240 | +{ | 339 | +{ |
241 | + int i, ret = 0; | 340 | + int i, ret = 0; |
242 | + | 341 | + |
243 | + for (i = 0; i < tdmr_num; i++) { | 342 | + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) { |
244 | + ret = tdmr_set_up_pamt(tdmr_array_entry(tdmr_array, i)); | 343 | + ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list, |
344 | + pamt_entry_size); | ||
245 | + if (ret) | 345 | + if (ret) |
246 | + goto err; | 346 | + goto err; |
247 | + } | 347 | + } |
248 | + | 348 | + |
249 | + return 0; | 349 | + return 0; |
250 | +err: | 350 | +err: |
251 | + tdmrs_free_pamt_all(tdmr_array, tdmr_num); | 351 | + tdmrs_free_pamt_all(tdmr_list); |
252 | + return ret; | 352 | + return ret; |
253 | +} | 353 | +} |
254 | + | 354 | + |
255 | +static unsigned long tdmrs_count_pamt_pages(struct tdmr_info *tdmr_array, | 355 | +static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list) |
256 | + int tdmr_num) | 356 | +{ |
257 | +{ | 357 | + unsigned long pamt_size = 0; |
258 | + unsigned long pamt_npages = 0; | ||
259 | + int i; | 358 | + int i; |
260 | + | 359 | + |
261 | + for (i = 0; i < tdmr_num; i++) { | 360 | + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) { |
262 | + unsigned long pfn, npages; | 361 | + unsigned long base, size; |
263 | + | 362 | + |
264 | + tdmr_get_pamt(tdmr_array_entry(tdmr_array, i), &pfn, &npages); | 363 | + tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size); |
265 | + pamt_npages += npages; | 364 | + pamt_size += size; |
266 | + } | 365 | + } |
267 | + | 366 | + |
268 | + return pamt_npages; | 367 | + return pamt_size / 1024; |
269 | +} | 368 | +} |
270 | + | 369 | + |
271 | /* | 370 | /* |
272 | * Construct an array of TDMRs to cover all TDX memory ranges. | 371 | * Construct a list of TDMRs on the preallocated space in @tdmr_list |
273 | * The actual number of TDMRs is kept to @tdmr_num. | 372 | * to cover all TDX memory regions in @tmb_list based on the TDX module |
274 | @@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num) | 373 | @@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list, |
275 | if (ret) | 374 | if (ret) |
276 | goto err; | 375 | return ret; |
277 | 376 | ||
278 | + ret = tdmrs_set_up_pamt_all(tdmr_array, *tdmr_num); | 377 | + ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list, |
378 | + sysinfo->pamt_entry_size); | ||
279 | + if (ret) | 379 | + if (ret) |
280 | + goto err; | 380 | + return ret; |
281 | + | 381 | /* |
282 | /* Return -EINVAL until constructing TDMRs is done */ | 382 | * TODO: |
283 | ret = -EINVAL; | 383 | * |
284 | + tdmrs_free_pamt_all(tdmr_array, *tdmr_num); | 384 | - * - Allocate and set up PAMTs for each TDMR. |
285 | err: | 385 | * - Designate reserved areas for each TDMR. |
286 | return ret; | 386 | * |
287 | } | 387 | * Return -EINVAL until constructing TDMRs is done |
288 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | 388 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) |
289 | * process are done. | 389 | * Return error before all steps are done. |
290 | */ | 390 | */ |
291 | ret = -EINVAL; | 391 | ret = -EINVAL; |
292 | + if (ret) | 392 | + if (ret) |
293 | + tdmrs_free_pamt_all(tdmr_array, tdmr_num); | 393 | + tdmrs_free_pamt_all(&tdmr_list); |
294 | + else | 394 | + else |
295 | + pr_info("%lu pages allocated for PAMT.\n", | 395 | + pr_info("%lu KBs allocated for PAMT\n", |
296 | + tdmrs_count_pamt_pages(tdmr_array, tdmr_num)); | 396 | + tdmrs_count_pamt_kb(&tdmr_list)); |
297 | out_free_tdmrs: | 397 | out_free_tdmrs: |
298 | /* | 398 | /* |
299 | * The array of TDMRs is freed no matter the initialization is | 399 | * Always free the buffer of TDMRs as they are only used during |
400 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | ||
401 | index XXXXXXX..XXXXXXX 100644 | ||
402 | --- a/arch/x86/virt/vmx/tdx/tdx.h | ||
403 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | ||
404 | @@ -XXX,XX +XXX,XX @@ struct tdx_memblock { | ||
405 | struct list_head list; | ||
406 | unsigned long start_pfn; | ||
407 | unsigned long end_pfn; | ||
408 | + int nid; | ||
409 | }; | ||
410 | |||
411 | /* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */ | ||
300 | -- | 412 | -- |
301 | 2.38.1 | 413 | 2.41.0 | diff view generated by jsdifflib |
1 | As the last step of constructing TDMRs, set up reserved areas for all | 1 | As the last step of constructing TDMRs, populate reserved areas for all |
---|---|---|---|
2 | TDMRs. For each TDMR, put all memory holes within this TDMR to the | 2 | TDMRs. For each TDMR, put all memory holes within this TDMR to the |
3 | reserved areas. And for all PAMTs which overlap with this TDMR, put | 3 | reserved areas. And for all PAMTs which overlap with this TDMR, put |
4 | all the overlapping parts to reserved areas too. | 4 | all the overlapping parts to reserved areas too. |
5 | 5 | ||
6 | Signed-off-by: Kai Huang <kai.huang@intel.com> | ||
6 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> | 7 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> |
7 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 8 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
9 | Reviewed-by: Yuan Yao <yuan.yao@intel.com> | ||
8 | --- | 10 | --- |
9 | 11 | ||
10 | v6 -> v7: | 12 | v13 -> v14: |
13 | - No change | ||
14 | |||
15 | v12 -> v13: | ||
16 | - Added Yuan's tag. | ||
17 | |||
18 | v11 -> v12: | ||
19 | - Code change due to tdmr_get_pamt() change from returning pfn/npages to | ||
20 | base/size | ||
21 | - Added Kirill's tag | ||
22 | |||
23 | v10 -> v11: | ||
24 | - No update | ||
25 | |||
26 | v9 -> v10: | ||
11 | - No change. | 27 | - No change. |
12 | 28 | ||
13 | v5 -> v6: | 29 | v8 -> v9: |
14 | - Rebase due to using 'tdx_memblock' instead of memblock. | 30 | - Added comment around 'tdmr_add_rsvd_area()' to point out it doesn't do |
15 | - Split tdmr_set_up_rsvd_areas() into two functions to handle memory | 31 | optimization to save reserved areas. (Dave). |
16 | hole and PAMT respectively. | 32 | |
17 | - Added Isaku's Reviewed-by. | 33 | v7 -> v8: (Dave) |
18 | 34 | - "set_up" -> "populate" in function name change (Dave). | |
35 | - Improved comment suggested by Dave. | ||
36 | - Other changes due to 'struct tdmr_info_list'. | ||
19 | 37 | ||
20 | --- | 38 | --- |
21 | arch/x86/virt/vmx/tdx/tdx.c | 190 +++++++++++++++++++++++++++++++++++- | 39 | arch/x86/virt/vmx/tdx/tdx.c | 217 ++++++++++++++++++++++++++++++++++-- |
22 | 1 file changed, 188 insertions(+), 2 deletions(-) | 40 | 1 file changed, 209 insertions(+), 8 deletions(-) |
23 | 41 | ||
24 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 42 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
25 | index XXXXXXX..XXXXXXX 100644 | 43 | index XXXXXXX..XXXXXXX 100644 |
26 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 44 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
27 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 45 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
28 | @@ -XXX,XX +XXX,XX @@ | 46 | @@ -XXX,XX +XXX,XX @@ |
29 | #include <linux/memblock.h> | ||
30 | #include <linux/minmax.h> | ||
31 | #include <linux/sizes.h> | 47 | #include <linux/sizes.h> |
48 | #include <linux/pfn.h> | ||
49 | #include <linux/align.h> | ||
32 | +#include <linux/sort.h> | 50 | +#include <linux/sort.h> |
33 | #include <asm/msr-index.h> | 51 | #include <asm/msr-index.h> |
34 | #include <asm/msr.h> | 52 | #include <asm/msr.h> |
35 | #include <asm/apic.h> | 53 | #include <asm/page.h> |
36 | @@ -XXX,XX +XXX,XX @@ static unsigned long tdmrs_count_pamt_pages(struct tdmr_info *tdmr_array, | 54 | @@ -XXX,XX +XXX,XX @@ static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list) |
37 | return pamt_npages; | 55 | return pamt_size / 1024; |
38 | } | 56 | } |
39 | 57 | ||
40 | +static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, | 58 | +static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr, |
41 | + u64 addr, u64 size) | 59 | + u64 size, u16 max_reserved_per_tdmr) |
42 | +{ | 60 | +{ |
43 | + struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas; | 61 | + struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas; |
44 | + int idx = *p_idx; | 62 | + int idx = *p_idx; |
45 | + | 63 | + |
46 | + /* Reserved area must be 4K aligned in offset and size */ | 64 | + /* Reserved area must be 4K aligned in offset and size */ |
47 | + if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK)) | 65 | + if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK)) |
48 | + return -EINVAL; | 66 | + return -EINVAL; |
49 | + | 67 | + |
50 | + /* Cannot exceed maximum reserved areas supported by TDX */ | 68 | + if (idx >= max_reserved_per_tdmr) { |
51 | + if (idx >= tdx_sysinfo.max_reserved_per_tdmr) | 69 | + pr_warn("initialization failed: TDMR [0x%llx, 0x%llx): reserved areas exhausted.\n", |
52 | + return -E2BIG; | 70 | + tdmr->base, tdmr_end(tdmr)); |
53 | + | 71 | + return -ENOSPC; |
72 | + } | ||
73 | + | ||
74 | + /* | ||
75 | + * Consume one reserved area per call. Make no effort to | ||
76 | + * optimize or reduce the number of reserved areas which are | ||
77 | + * consumed by contiguous reserved areas, for instance. | ||
78 | + */ | ||
54 | + rsvd_areas[idx].offset = addr - tdmr->base; | 79 | + rsvd_areas[idx].offset = addr - tdmr->base; |
55 | + rsvd_areas[idx].size = size; | 80 | + rsvd_areas[idx].size = size; |
56 | + | 81 | + |
57 | + *p_idx = idx + 1; | 82 | + *p_idx = idx + 1; |
58 | + | 83 | + |
59 | + return 0; | 84 | + return 0; |
60 | +} | 85 | +} |
61 | + | 86 | + |
62 | +static int tdmr_set_up_memory_hole_rsvd_areas(struct tdmr_info *tdmr, | 87 | +/* |
63 | + int *rsvd_idx) | 88 | + * Go through @tmb_list to find holes between memory areas. If any of |
89 | + * those holes fall within @tdmr, set up a TDMR reserved area to cover | ||
90 | + * the hole. | ||
91 | + */ | ||
92 | +static int tdmr_populate_rsvd_holes(struct list_head *tmb_list, | ||
93 | + struct tdmr_info *tdmr, | ||
94 | + int *rsvd_idx, | ||
95 | + u16 max_reserved_per_tdmr) | ||
64 | +{ | 96 | +{ |
65 | + struct tdx_memblock *tmb; | 97 | + struct tdx_memblock *tmb; |
66 | + u64 prev_end; | 98 | + u64 prev_end; |
67 | + int ret; | 99 | + int ret; |
68 | + | 100 | + |
69 | + /* Mark holes between memory regions as reserved */ | 101 | + /* |
70 | + prev_end = tdmr_start(tdmr); | 102 | + * Start looking for reserved blocks at the |
71 | + list_for_each_entry(tmb, &tdx_memlist, list) { | 103 | + * beginning of the TDMR. |
104 | + */ | ||
105 | + prev_end = tdmr->base; | ||
106 | + list_for_each_entry(tmb, tmb_list, list) { | ||
72 | + u64 start, end; | 107 | + u64 start, end; |
73 | + | 108 | + |
74 | + start = tmb->start_pfn << PAGE_SHIFT; | 109 | + start = PFN_PHYS(tmb->start_pfn); |
75 | + end = tmb->end_pfn << PAGE_SHIFT; | 110 | + end = PFN_PHYS(tmb->end_pfn); |
76 | + | 111 | + |
77 | + /* Break if this region is after the TDMR */ | 112 | + /* Break if this region is after the TDMR */ |
78 | + if (start >= tdmr_end(tdmr)) | 113 | + if (start >= tdmr_end(tdmr)) |
79 | + break; | 114 | + break; |
80 | + | 115 | + |
81 | + /* Exclude regions before this TDMR */ | 116 | + /* Exclude regions before this TDMR */ |
82 | + if (end < tdmr_start(tdmr)) | 117 | + if (end < tdmr->base) |
83 | + continue; | 118 | + continue; |
84 | + | 119 | + |
85 | + /* | 120 | + /* |
86 | + * Skip if no hole exists before this region. "<=" is | 121 | + * Skip over memory areas that |
87 | + * used because one memory region might span two TDMRs | 122 | + * have already been dealt with. |
88 | + * (when the previous TDMR covers part of this region). | ||
89 | + * In this case the start address of this region is | ||
90 | + * smaller than the start address of the second TDMR. | ||
91 | + * | ||
92 | + * Update the prev_end to the end of this region where | ||
93 | + * the possible memory hole starts. | ||
94 | + */ | 123 | + */ |
95 | + if (start <= prev_end) { | 124 | + if (start <= prev_end) { |
96 | + prev_end = end; | 125 | + prev_end = end; |
97 | + continue; | 126 | + continue; |
98 | + } | 127 | + } |
99 | + | 128 | + |
100 | + /* Add the hole before this region */ | 129 | + /* Add the hole before this region */ |
101 | + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end, | 130 | + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end, |
102 | + start - prev_end); | 131 | + start - prev_end, |
132 | + max_reserved_per_tdmr); | ||
103 | + if (ret) | 133 | + if (ret) |
104 | + return ret; | 134 | + return ret; |
105 | + | 135 | + |
106 | + prev_end = end; | 136 | + prev_end = end; |
107 | + } | 137 | + } |
108 | + | 138 | + |
109 | + /* Add the hole after the last region if it exists. */ | 139 | + /* Add the hole after the last region if it exists. */ |
110 | + if (prev_end < tdmr_end(tdmr)) { | 140 | + if (prev_end < tdmr_end(tdmr)) { |
111 | + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end, | 141 | + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end, |
112 | + tdmr_end(tdmr) - prev_end); | 142 | + tdmr_end(tdmr) - prev_end, |
113 | + if (ret) | 143 | + max_reserved_per_tdmr); |
114 | + return ret; | 144 | + if (ret) |
115 | + } | 145 | + return ret; |
116 | + | 146 | + } |
117 | + return 0; | 147 | + |
118 | +} | 148 | + return 0; |
119 | + | 149 | +} |
120 | +static int tdmr_set_up_pamt_rsvd_areas(struct tdmr_info *tdmr, int *rsvd_idx, | 150 | + |
121 | + struct tdmr_info *tdmr_array, | 151 | +/* |
122 | + int tdmr_num) | 152 | + * Go through @tdmr_list to find all PAMTs. If any of those PAMTs |
153 | + * overlaps with @tdmr, set up a TDMR reserved area to cover the | ||
154 | + * overlapping part. | ||
155 | + */ | ||
156 | +static int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list, | ||
157 | + struct tdmr_info *tdmr, | ||
158 | + int *rsvd_idx, | ||
159 | + u16 max_reserved_per_tdmr) | ||
123 | +{ | 160 | +{ |
124 | + int i, ret; | 161 | + int i, ret; |
125 | + | 162 | + |
126 | + /* | 163 | + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) { |
127 | + * If any PAMT overlaps with this TDMR, the overlapping part | 164 | + struct tdmr_info *tmp = tdmr_entry(tdmr_list, i); |
128 | + * must also be put to the reserved area too. Walk over all | 165 | + unsigned long pamt_base, pamt_size, pamt_end; |
129 | + * TDMRs to find out those overlapping PAMTs and put them to | 166 | + |
130 | + * reserved areas. | 167 | + tdmr_get_pamt(tmp, &pamt_base, &pamt_size); |
131 | + */ | ||
132 | + for (i = 0; i < tdmr_num; i++) { | ||
133 | + struct tdmr_info *tmp = tdmr_array_entry(tdmr_array, i); | ||
134 | + unsigned long pamt_start_pfn, pamt_npages; | ||
135 | + u64 pamt_start, pamt_end; | ||
136 | + | ||
137 | + tdmr_get_pamt(tmp, &pamt_start_pfn, &pamt_npages); | ||
138 | + /* Each TDMR must already have PAMT allocated */ | 168 | + /* Each TDMR must already have PAMT allocated */ |
139 | + WARN_ON_ONCE(!pamt_npages || !pamt_start_pfn); | 169 | + WARN_ON_ONCE(!pamt_size || !pamt_base); |
140 | + | 170 | + |
141 | + pamt_start = pamt_start_pfn << PAGE_SHIFT; | 171 | + pamt_end = pamt_base + pamt_size; |
142 | + pamt_end = pamt_start + (pamt_npages << PAGE_SHIFT); | ||
143 | + | ||
144 | + /* Skip PAMTs outside of the given TDMR */ | 172 | + /* Skip PAMTs outside of the given TDMR */ |
145 | + if ((pamt_end <= tdmr_start(tdmr)) || | 173 | + if ((pamt_end <= tdmr->base) || |
146 | + (pamt_start >= tdmr_end(tdmr))) | 174 | + (pamt_base >= tdmr_end(tdmr))) |
147 | + continue; | 175 | + continue; |
148 | + | 176 | + |
149 | + /* Only mark the part within the TDMR as reserved */ | 177 | + /* Only mark the part within the TDMR as reserved */ |
150 | + if (pamt_start < tdmr_start(tdmr)) | 178 | + if (pamt_base < tdmr->base) |
151 | + pamt_start = tdmr_start(tdmr); | 179 | + pamt_base = tdmr->base; |
152 | + if (pamt_end > tdmr_end(tdmr)) | 180 | + if (pamt_end > tdmr_end(tdmr)) |
153 | + pamt_end = tdmr_end(tdmr); | 181 | + pamt_end = tdmr_end(tdmr); |
154 | + | 182 | + |
155 | + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_start, | 183 | + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_base, |
156 | + pamt_end - pamt_start); | 184 | + pamt_end - pamt_base, |
185 | + max_reserved_per_tdmr); | ||
157 | + if (ret) | 186 | + if (ret) |
158 | + return ret; | 187 | + return ret; |
159 | + } | 188 | + } |
160 | + | 189 | + |
161 | + return 0; | 190 | + return 0; |
... | ... | ||
170 | + if (r1->offset + r1->size <= r2->offset) | 199 | + if (r1->offset + r1->size <= r2->offset) |
171 | + return -1; | 200 | + return -1; |
172 | + if (r1->offset >= r2->offset + r2->size) | 201 | + if (r1->offset >= r2->offset + r2->size) |
173 | + return 1; | 202 | + return 1; |
174 | + | 203 | + |
175 | + /* Reserved areas cannot overlap. The caller should guarantee. */ | 204 | + /* Reserved areas cannot overlap. The caller must guarantee. */ |
176 | + WARN_ON_ONCE(1); | 205 | + WARN_ON_ONCE(1); |
177 | + return -1; | 206 | + return -1; |
178 | +} | 207 | +} |
179 | + | 208 | + |
180 | +/* Set up reserved areas for a TDMR, including memory holes and PAMTs */ | 209 | +/* |
181 | +static int tdmr_set_up_rsvd_areas(struct tdmr_info *tdmr, | 210 | + * Populate reserved areas for the given @tdmr, including memory holes |
182 | + struct tdmr_info *tdmr_array, | 211 | + * (via @tmb_list) and PAMTs (via @tdmr_list). |
183 | + int tdmr_num) | 212 | + */ |
213 | +static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr, | ||
214 | + struct list_head *tmb_list, | ||
215 | + struct tdmr_info_list *tdmr_list, | ||
216 | + u16 max_reserved_per_tdmr) | ||
184 | +{ | 217 | +{ |
185 | + int ret, rsvd_idx = 0; | 218 | + int ret, rsvd_idx = 0; |
186 | + | 219 | + |
187 | + /* Put all memory holes within the TDMR into reserved areas */ | 220 | + ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx, |
188 | + ret = tdmr_set_up_memory_hole_rsvd_areas(tdmr, &rsvd_idx); | 221 | + max_reserved_per_tdmr); |
189 | + if (ret) | 222 | + if (ret) |
190 | + return ret; | 223 | + return ret; |
191 | + | 224 | + |
192 | + /* Put all (overlapping) PAMTs within the TDMR into reserved areas */ | 225 | + ret = tdmr_populate_rsvd_pamts(tdmr_list, tdmr, &rsvd_idx, |
193 | + ret = tdmr_set_up_pamt_rsvd_areas(tdmr, &rsvd_idx, tdmr_array, tdmr_num); | 226 | + max_reserved_per_tdmr); |
194 | + if (ret) | 227 | + if (ret) |
195 | + return ret; | 228 | + return ret; |
196 | + | 229 | + |
197 | + /* TDX requires reserved areas listed in address ascending order */ | 230 | + /* TDX requires reserved areas listed in address ascending order */ |
198 | + sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area), | 231 | + sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area), |
199 | + rsvd_area_cmp_func, NULL); | 232 | + rsvd_area_cmp_func, NULL); |
200 | + | 233 | + |
201 | + return 0; | 234 | + return 0; |
202 | +} | 235 | +} |
203 | + | 236 | + |
204 | +static int tdmrs_set_up_rsvd_areas_all(struct tdmr_info *tdmr_array, | 237 | +/* |
205 | + int tdmr_num) | 238 | + * Populate reserved areas for all TDMRs in @tdmr_list, including memory |
239 | + * holes (via @tmb_list) and PAMTs. | ||
240 | + */ | ||
241 | +static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list, | ||
242 | + struct list_head *tmb_list, | ||
243 | + u16 max_reserved_per_tdmr) | ||
206 | +{ | 244 | +{ |
207 | + int i; | 245 | + int i; |
208 | + | 246 | + |
209 | + for (i = 0; i < tdmr_num; i++) { | 247 | + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) { |
210 | + int ret; | 248 | + int ret; |
211 | + | 249 | + |
212 | + ret = tdmr_set_up_rsvd_areas(tdmr_array_entry(tdmr_array, i), | 250 | + ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i), |
213 | + tdmr_array, tdmr_num); | 251 | + tmb_list, tdmr_list, max_reserved_per_tdmr); |
214 | + if (ret) | 252 | + if (ret) |
215 | + return ret; | 253 | + return ret; |
216 | + } | 254 | + } |
217 | + | 255 | + |
218 | + return 0; | 256 | + return 0; |
219 | +} | 257 | +} |
220 | + | 258 | + |
221 | /* | 259 | /* |
222 | * Construct an array of TDMRs to cover all TDX memory ranges. | 260 | * Construct a list of TDMRs on the preallocated space in @tdmr_list |
223 | * The actual number of TDMRs is kept to @tdmr_num. | 261 | * to cover all TDX memory regions in @tmb_list based on the TDX module |
224 | @@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num) | 262 | @@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list, |
263 | sysinfo->pamt_entry_size); | ||
225 | if (ret) | 264 | if (ret) |
226 | goto err; | 265 | return ret; |
227 | 266 | - /* | |
228 | - /* Return -EINVAL until constructing TDMRs is done */ | 267 | - * TODO: |
229 | - ret = -EINVAL; | 268 | - * |
230 | + ret = tdmrs_set_up_rsvd_areas_all(tdmr_array, *tdmr_num); | 269 | - * - Designate reserved areas for each TDMR. |
270 | - * | ||
271 | - * Return -EINVAL until constructing TDMRs is done | ||
272 | - */ | ||
273 | - return -EINVAL; | ||
274 | + | ||
275 | + ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list, | ||
276 | + sysinfo->max_reserved_per_tdmr); | ||
231 | + if (ret) | 277 | + if (ret) |
232 | + goto err_free_pamts; | 278 | + tdmrs_free_pamt_all(tdmr_list); |
233 | + | 279 | + |
234 | + return 0; | 280 | + return ret; |
235 | +err_free_pamts: | 281 | } |
236 | tdmrs_free_pamt_all(tdmr_array, *tdmr_num); | 282 | |
237 | err: | 283 | static int init_tdx_module(void) |
238 | return ret; | ||
239 | -- | 284 | -- |
240 | 2.38.1 | 285 | 2.41.0 | diff view generated by jsdifflib |
1 | After the TDX-usable memory regions are constructed in an array of TDMRs | 1 | The TDX module uses a private KeyID as the "global KeyID" for mapping |
---|---|---|---|
2 | and the global KeyID is reserved, configure them to the TDX module using | 2 | things like the PAMT and other TDX metadata. This KeyID has already |
3 | TDH.SYS.CONFIG SEAMCALL. TDH.SYS.CONFIG can only be called once and can | 3 | been reserved when detecting TDX during the kernel early boot. |
4 | be done on any logical cpu. | ||
5 | 4 | ||
5 | After the list of "TD Memory Regions" (TDMRs) has been constructed to | ||
6 | cover all TDX-usable memory regions, the next step is to pass them to | ||
7 | the TDX module together with the global KeyID. | ||
8 | |||
9 | Signed-off-by: Kai Huang <kai.huang@intel.com> | ||
6 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> | 10 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> |
7 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 11 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
12 | Reviewed-by: Yuan Yao <yuan.yao@intel.com> | ||
8 | --- | 13 | --- |
9 | arch/x86/virt/vmx/tdx/tdx.c | 37 +++++++++++++++++++++++++++++++++++++ | 14 | |
15 | v13 -> v14: | ||
16 | - No change | ||
17 | |||
18 | v12 -> v13: | ||
19 | - Added Yuan's tag. | ||
20 | |||
21 | v11 -> v12: | ||
22 | - Added Kirill's tag | ||
23 | |||
24 | v10 -> v11: | ||
25 | - No update | ||
26 | |||
27 | v9 -> v10: | ||
28 | - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'. | ||
29 | |||
30 | v8 -> v9: | ||
31 | - Improved changlog to explain why initializing TDMRs can take long | ||
32 | time (Dave). | ||
33 | - Improved comments around 'next-to-initialize' address (Dave). | ||
34 | |||
35 | v7 -> v8: (Dave) | ||
36 | - Changelog: | ||
37 | - explicitly call out this is the last step of TDX module initialization. | ||
38 | - Trimed down changelog by removing SEAMCALL name and details. | ||
39 | - Removed/trimmed down unnecessary comments. | ||
40 | - Other changes due to 'struct tdmr_info_list'. | ||
41 | |||
42 | v6 -> v7: | ||
43 | - Removed need_resched() check. -- Andi. | ||
44 | |||
45 | --- | ||
46 | arch/x86/virt/vmx/tdx/tdx.c | 43 ++++++++++++++++++++++++++++++++++++- | ||
10 | arch/x86/virt/vmx/tdx/tdx.h | 2 ++ | 47 | arch/x86/virt/vmx/tdx/tdx.h | 2 ++ |
11 | 2 files changed, 39 insertions(+) | 48 | 2 files changed, 44 insertions(+), 1 deletion(-) |
12 | 49 | ||
13 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 50 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
14 | index XXXXXXX..XXXXXXX 100644 | 51 | index XXXXXXX..XXXXXXX 100644 |
15 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 52 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
16 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 53 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
17 | @@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num) | 54 | @@ -XXX,XX +XXX,XX @@ |
55 | #include <linux/pfn.h> | ||
56 | #include <linux/align.h> | ||
57 | #include <linux/sort.h> | ||
58 | +#include <linux/log2.h> | ||
59 | #include <asm/msr-index.h> | ||
60 | #include <asm/msr.h> | ||
61 | #include <asm/page.h> | ||
62 | @@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list, | ||
18 | return ret; | 63 | return ret; |
19 | } | 64 | } |
20 | 65 | ||
21 | +static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num, | 66 | +static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid) |
22 | + u64 global_keyid) | ||
23 | +{ | 67 | +{ |
68 | + struct tdx_module_args args = {}; | ||
24 | + u64 *tdmr_pa_array; | 69 | + u64 *tdmr_pa_array; |
25 | + int i, array_sz; | 70 | + size_t array_sz; |
26 | + u64 ret; | 71 | + int i, ret; |
27 | + | 72 | + |
28 | + /* | 73 | + /* |
29 | + * TDMR_INFO entries are configured to the TDX module via an | 74 | + * TDMRs are passed to the TDX module via an array of physical |
30 | + * array of the physical address of each TDMR_INFO. TDX module | 75 | + * addresses of each TDMR. The array itself also has certain |
31 | + * requires the array itself to be 512-byte aligned. Round up | 76 | + * alignment requirement. |
32 | + * the array size to 512-byte aligned so the buffer allocated | ||
33 | + * by kzalloc() will meet the alignment requirement. | ||
34 | + */ | 77 | + */ |
35 | + array_sz = ALIGN(tdmr_num * sizeof(u64), TDMR_INFO_PA_ARRAY_ALIGNMENT); | 78 | + array_sz = tdmr_list->nr_consumed_tdmrs * sizeof(u64); |
79 | + array_sz = roundup_pow_of_two(array_sz); | ||
80 | + if (array_sz < TDMR_INFO_PA_ARRAY_ALIGNMENT) | ||
81 | + array_sz = TDMR_INFO_PA_ARRAY_ALIGNMENT; | ||
82 | + | ||
36 | + tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL); | 83 | + tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL); |
37 | + if (!tdmr_pa_array) | 84 | + if (!tdmr_pa_array) |
38 | + return -ENOMEM; | 85 | + return -ENOMEM; |
39 | + | 86 | + |
40 | + for (i = 0; i < tdmr_num; i++) | 87 | + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) |
41 | + tdmr_pa_array[i] = __pa(tdmr_array_entry(tdmr_array, i)); | 88 | + tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i)); |
42 | + | 89 | + |
43 | + ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), tdmr_num, | 90 | + args.rcx = __pa(tdmr_pa_array); |
44 | + global_keyid, 0, NULL, NULL); | 91 | + args.rdx = tdmr_list->nr_consumed_tdmrs; |
92 | + args.r8 = global_keyid; | ||
93 | + ret = seamcall_prerr(TDH_SYS_CONFIG, &args); | ||
45 | + | 94 | + |
46 | + /* Free the array as it is not required anymore. */ | 95 | + /* Free the array as it is not required anymore. */ |
47 | + kfree(tdmr_pa_array); | 96 | + kfree(tdmr_pa_array); |
48 | + | 97 | + |
49 | + return ret; | 98 | + return ret; |
50 | +} | 99 | +} |
51 | + | 100 | + |
52 | /* | 101 | static int init_tdx_module(void) |
53 | * Detect and initialize the TDX module. | 102 | { |
54 | * | 103 | struct tdsysinfo_struct *tdsysinfo; |
55 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | 104 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) |
56 | */ | 105 | if (ret) |
57 | tdx_global_keyid = tdx_keyid_start; | 106 | goto out_free_tdmrs; |
58 | 107 | ||
59 | + /* Pass the TDMRs and the global KeyID to the TDX module */ | 108 | + /* Pass the TDMRs and the global KeyID to the TDX module */ |
60 | + ret = config_tdx_module(tdmr_array, tdmr_num, tdx_global_keyid); | 109 | + ret = config_tdx_module(&tdmr_list, tdx_global_keyid); |
61 | + if (ret) | 110 | + if (ret) |
62 | + goto out_free_pamts; | 111 | + goto out_free_pamts; |
63 | + | 112 | + |
64 | /* | 113 | /* |
65 | * Return -EINVAL until all steps of TDX module initialization | 114 | * TODO: |
66 | * process are done. | 115 | * |
116 | - * - Configure the TDMRs and the global KeyID to the TDX module. | ||
117 | * - Configure the global KeyID on all packages. | ||
118 | * - Initialize all TDMRs. | ||
119 | * | ||
120 | * Return error before all steps are done. | ||
67 | */ | 121 | */ |
68 | ret = -EINVAL; | 122 | ret = -EINVAL; |
69 | +out_free_pamts: | 123 | +out_free_pamts: |
70 | if (ret) | 124 | if (ret) |
71 | tdmrs_free_pamt_all(tdmr_array, tdmr_num); | 125 | tdmrs_free_pamt_all(&tdmr_list); |
72 | else | 126 | else |
73 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | 127 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h |
74 | index XXXXXXX..XXXXXXX 100644 | 128 | index XXXXXXX..XXXXXXX 100644 |
75 | --- a/arch/x86/virt/vmx/tdx/tdx.h | 129 | --- a/arch/x86/virt/vmx/tdx/tdx.h |
76 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | 130 | +++ b/arch/x86/virt/vmx/tdx/tdx.h |
77 | @@ -XXX,XX +XXX,XX @@ | 131 | @@ -XXX,XX +XXX,XX @@ |
132 | #define TDH_SYS_INFO 32 | ||
78 | #define TDH_SYS_INIT 33 | 133 | #define TDH_SYS_INIT 33 |
79 | #define TDH_SYS_LP_INIT 35 | 134 | #define TDH_SYS_LP_INIT 35 |
80 | #define TDH_SYS_LP_SHUTDOWN 44 | ||
81 | +#define TDH_SYS_CONFIG 45 | 135 | +#define TDH_SYS_CONFIG 45 |
82 | 136 | ||
83 | struct cmr_info { | 137 | struct cmr_info { |
84 | u64 base; | 138 | u64 base; |
85 | @@ -XXX,XX +XXX,XX @@ struct tdmr_reserved_area { | 139 | @@ -XXX,XX +XXX,XX @@ struct tdmr_reserved_area { |
... | ... | ||
89 | +#define TDMR_INFO_PA_ARRAY_ALIGNMENT 512 | 143 | +#define TDMR_INFO_PA_ARRAY_ALIGNMENT 512 |
90 | 144 | ||
91 | struct tdmr_info { | 145 | struct tdmr_info { |
92 | u64 base; | 146 | u64 base; |
93 | -- | 147 | -- |
94 | 2.38.1 | 148 | 2.41.0 | diff view generated by jsdifflib |
1 | After the array of TDMRs and the global KeyID are configured to the TDX | 1 | After the list of TDMRs and the global KeyID are configured to the TDX |
---|---|---|---|
2 | module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID | 2 | module, the kernel needs to configure the key of the global KeyID on all |
3 | on all packages. | 3 | packages using TDH.SYS.KEY.CONFIG. |
4 | 4 | ||
5 | TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package. And | 5 | This SEAMCALL cannot run parallel on different cpus. Loop all online |
6 | it cannot run concurrently on different CPUs. Implement a helper to | 6 | cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of |
7 | run SEAMCALL on one cpu for each package one by one, and use it to | 7 | each package. |
8 | configure the global KeyID on all packages. | 8 | |
9 | To keep things simple, this implementation takes no affirmative steps to | ||
10 | online cpus to make sure there's at least one cpu for each package. The | ||
11 | callers (aka. KVM) can ensure success by ensuring sufficient CPUs are | ||
12 | online for this to succeed. | ||
9 | 13 | ||
10 | Intel hardware doesn't guarantee cache coherency across different | 14 | Intel hardware doesn't guarantee cache coherency across different |
11 | KeyIDs. The kernel needs to flush PAMT's dirty cachelines (associated | 15 | KeyIDs. The PAMTs are transitioning from being used by the kernel |
12 | with KeyID 0) before the TDX module uses the global KeyID to access the | 16 | mapping (KeyId 0) to the TDX module's "global KeyID" mapping. |
13 | PAMT. Following the TDX module specification, flush cache before | 17 | |
14 | configuring the global KeyID on all packages. | 18 | This means that the kernel must flush any dirty KeyID-0 PAMT cachelines |
15 | 19 | before the TDX module uses the global KeyID to access the PAMTs. | |
16 | Given the PAMT size can be large (~1/256th of system RAM), just use | 20 | Otherwise, if those dirty cachelines were written back, they would |
17 | WBINVD on all CPUs to flush. | 21 | corrupt the TDX module's metadata. Aside: This corruption would be |
18 | 22 | detected by the memory integrity hardware on the next read of the memory | |
19 | Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have | 23 | with the global KeyID. The result would likely be fatal to the system |
20 | used the global KeyID to write any PAMT. Therefore, need to use WBINVD | 24 | but would not impact TDX security. |
21 | to flush cache before freeing the PAMTs back to the kernel. Note using | 25 | |
22 | MOVDIR64B (which changes the page's associated KeyID from the old TDX | 26 | Following the TDX module specification, flush cache before configuring |
23 | private KeyID back to KeyID 0, which is used by the kernel) to clear | 27 | the global KeyID on all packages. Given the PAMT size can be large |
24 | PMATs isn't needed, as the KeyID 0 doesn't support integrity check. | 28 | (~1/256th of system RAM), just use WBINVD on all CPUs to flush. |
25 | 29 | ||
30 | If TDH.SYS.KEY.CONFIG fails, the TDX module may already have used the | ||
31 | global KeyID to write the PAMTs. Therefore, use WBINVD to flush cache | ||
32 | before returning the PAMTs back to the kernel. Also convert all PAMTs | ||
33 | back to normal by using MOVDIR64B as suggested by the TDX module spec, | ||
34 | although on the platform without the "partial write machine check" | ||
35 | erratum it's OK to leave PAMTs as is. | ||
36 | |||
37 | Signed-off-by: Kai Huang <kai.huang@intel.com> | ||
26 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> | 38 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> |
27 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 39 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
40 | Reviewed-by: Yuan Yao <yuan.yao@intel.com> | ||
28 | --- | 41 | --- |
29 | 42 | ||
30 | v6 -> v7: | 43 | v13 -> v14: |
31 | - Improved changelong and comment to explain why MOVDIR64B isn't used | 44 | - No change |
32 | when returning PAMTs back to the kernel. | 45 | |
46 | v12 -> v13: | ||
47 | - Added Yuan's tag. | ||
48 | |||
49 | v11 -> v12: | ||
50 | - Added Kirill's tag | ||
51 | - Improved changelog (Nikolay) | ||
52 | |||
53 | v10 -> v11: | ||
54 | - Convert PAMTs back to normal when module initialization fails. | ||
55 | - Fixed an error in changelog | ||
56 | |||
57 | v9 -> v10: | ||
58 | - Changed to use 'smp_call_on_cpu()' directly to do key configuration. | ||
59 | |||
60 | v8 -> v9: | ||
61 | - Improved changelog (Dave). | ||
62 | - Improved comments to explain the function to configure global KeyID | ||
63 | "takes no affirmative action to online any cpu". (Dave). | ||
64 | - Improved other comments suggested by Dave. | ||
65 | |||
66 | v7 -> v8: (Dave) | ||
67 | - Changelog changes: | ||
68 | - Point out this is the step of "multi-steps" of init_tdx_module(). | ||
69 | - Removed MOVDIR64B part. | ||
70 | - Other changes due to removing TDH.SYS.SHUTDOWN and TDH.SYS.LP.INIT. | ||
71 | - Changed to loop over online cpus and use smp_call_function_single() | ||
72 | directly as the patch to shut down TDX module has been removed. | ||
73 | - Removed MOVDIR64B part in comment. | ||
33 | 74 | ||
34 | --- | 75 | --- |
35 | arch/x86/virt/vmx/tdx/tdx.c | 89 ++++++++++++++++++++++++++++++++++++- | 76 | arch/x86/virt/vmx/tdx/tdx.c | 130 +++++++++++++++++++++++++++++++++++- |
36 | arch/x86/virt/vmx/tdx/tdx.h | 1 + | 77 | arch/x86/virt/vmx/tdx/tdx.h | 1 + |
37 | 2 files changed, 88 insertions(+), 2 deletions(-) | 78 | 2 files changed, 129 insertions(+), 2 deletions(-) |
38 | 79 | ||
39 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 80 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
40 | index XXXXXXX..XXXXXXX 100644 | 81 | index XXXXXXX..XXXXXXX 100644 |
41 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 82 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
42 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 83 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
43 | @@ -XXX,XX +XXX,XX @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc) | 84 | @@ -XXX,XX +XXX,XX @@ |
44 | on_each_cpu(seamcall_smp_call_function, sc, true); | 85 | #include <asm/msr-index.h> |
45 | } | 86 | #include <asm/msr.h> |
87 | #include <asm/page.h> | ||
88 | +#include <asm/special_insns.h> | ||
89 | #include <asm/tdx.h> | ||
90 | #include "tdx.h" | ||
91 | |||
92 | @@ -XXX,XX +XXX,XX @@ static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base, | ||
93 | *pamt_size = pamt_sz; | ||
94 | } | ||
95 | |||
96 | -static void tdmr_free_pamt(struct tdmr_info *tdmr) | ||
97 | +static void tdmr_do_pamt_func(struct tdmr_info *tdmr, | ||
98 | + void (*pamt_func)(unsigned long base, unsigned long size)) | ||
99 | { | ||
100 | unsigned long pamt_base, pamt_size; | ||
101 | |||
102 | @@ -XXX,XX +XXX,XX @@ static void tdmr_free_pamt(struct tdmr_info *tdmr) | ||
103 | if (WARN_ON_ONCE(!pamt_base)) | ||
104 | return; | ||
105 | |||
106 | + (*pamt_func)(pamt_base, pamt_size); | ||
107 | +} | ||
108 | + | ||
109 | +static void free_pamt(unsigned long pamt_base, unsigned long pamt_size) | ||
110 | +{ | ||
111 | free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT); | ||
112 | } | ||
113 | |||
114 | +static void tdmr_free_pamt(struct tdmr_info *tdmr) | ||
115 | +{ | ||
116 | + tdmr_do_pamt_func(tdmr, free_pamt); | ||
117 | +} | ||
118 | + | ||
119 | static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list) | ||
120 | { | ||
121 | int i; | ||
122 | @@ -XXX,XX +XXX,XX @@ static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list, | ||
123 | return ret; | ||
124 | } | ||
46 | 125 | ||
47 | +/* | 126 | +/* |
48 | + * Call one SEAMCALL on one (any) cpu for each physical package in | 127 | + * Convert TDX private pages back to normal by using MOVDIR64B to |
49 | + * serialized way. Return immediately in case of any error if | 128 | + * clear these pages. Note this function doesn't flush cache of |
50 | + * SEAMCALL fails on any cpu. | 129 | + * these TDX private pages. The caller should make sure of that. |
130 | + */ | ||
131 | +static void reset_tdx_pages(unsigned long base, unsigned long size) | ||
132 | +{ | ||
133 | + const void *zero_page = (const void *)page_address(ZERO_PAGE(0)); | ||
134 | + unsigned long phys, end; | ||
135 | + | ||
136 | + end = base + size; | ||
137 | + for (phys = base; phys < end; phys += 64) | ||
138 | + movdir64b(__va(phys), zero_page); | ||
139 | + | ||
140 | + /* | ||
141 | + * MOVDIR64B uses WC protocol. Use memory barrier to | ||
142 | + * make sure any later user of these pages sees the | ||
143 | + * updated data. | ||
144 | + */ | ||
145 | + mb(); | ||
146 | +} | ||
147 | + | ||
148 | +static void tdmr_reset_pamt(struct tdmr_info *tdmr) | ||
149 | +{ | ||
150 | + tdmr_do_pamt_func(tdmr, reset_tdx_pages); | ||
151 | +} | ||
152 | + | ||
153 | +static void tdmrs_reset_pamt_all(struct tdmr_info_list *tdmr_list) | ||
154 | +{ | ||
155 | + int i; | ||
156 | + | ||
157 | + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) | ||
158 | + tdmr_reset_pamt(tdmr_entry(tdmr_list, i)); | ||
159 | +} | ||
160 | + | ||
161 | static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list) | ||
162 | { | ||
163 | unsigned long pamt_size = 0; | ||
164 | @@ -XXX,XX +XXX,XX @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid) | ||
165 | return ret; | ||
166 | } | ||
167 | |||
168 | +static int do_global_key_config(void *data) | ||
169 | +{ | ||
170 | + struct tdx_module_args args = {}; | ||
171 | + | ||
172 | + return seamcall_prerr(TDH_SYS_KEY_CONFIG, &args); | ||
173 | +} | ||
174 | + | ||
175 | +/* | ||
176 | + * Attempt to configure the global KeyID on all physical packages. | ||
51 | + * | 177 | + * |
52 | + * Note for serialized calls 'struct seamcall_ctx::err' doesn't have | 178 | + * This requires running code on at least one CPU in each package. If a |
53 | + * to be atomic, but for simplicity just reuse it instead of adding | 179 | + * package has no online CPUs, that code will not run and TDX module |
54 | + * a new one. | 180 | + * initialization (TDMR initialization) will fail. |
181 | + * | ||
182 | + * This code takes no affirmative steps to online CPUs. Callers (aka. | ||
183 | + * KVM) can ensure success by ensuring sufficient CPUs are online for | ||
184 | + * this to succeed. | ||
55 | + */ | 185 | + */ |
56 | +static int seamcall_on_each_package_serialized(struct seamcall_ctx *sc) | 186 | +static int config_global_keyid(void) |
57 | +{ | 187 | +{ |
58 | + cpumask_var_t packages; | 188 | + cpumask_var_t packages; |
59 | + int cpu, ret = 0; | 189 | + int cpu, ret = -EINVAL; |
60 | + | 190 | + |
61 | + if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) | 191 | + if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) |
62 | + return -ENOMEM; | 192 | + return -ENOMEM; |
63 | + | 193 | + |
64 | + for_each_online_cpu(cpu) { | 194 | + for_each_online_cpu(cpu) { |
65 | + if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu), | 195 | + if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu), |
66 | + packages)) | 196 | + packages)) |
67 | + continue; | 197 | + continue; |
68 | + | 198 | + |
69 | + ret = smp_call_function_single(cpu, seamcall_smp_call_function, | ||
70 | + sc, true); | ||
71 | + if (ret) | ||
72 | + break; | ||
73 | + | ||
74 | + /* | 199 | + /* |
75 | + * Doesn't have to use atomic_read(), but it doesn't | 200 | + * TDH.SYS.KEY.CONFIG cannot run concurrently on |
76 | + * hurt either. | 201 | + * different cpus, so just do it one by one. |
77 | + */ | 202 | + */ |
78 | + ret = atomic_read(&sc->err); | 203 | + ret = smp_call_on_cpu(cpu, do_global_key_config, NULL, true); |
79 | + if (ret) | 204 | + if (ret) |
80 | + break; | 205 | + break; |
81 | + } | 206 | + } |
82 | + | 207 | + |
83 | + free_cpumask_var(packages); | 208 | + free_cpumask_var(packages); |
84 | + return ret; | 209 | + return ret; |
85 | +} | 210 | +} |
86 | + | 211 | + |
87 | static int tdx_module_init_cpus(void) | 212 | static int init_tdx_module(void) |
88 | { | 213 | { |
89 | struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT }; | 214 | struct tdsysinfo_struct *tdsysinfo; |
90 | @@ -XXX,XX +XXX,XX @@ static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num, | ||
91 | return ret; | ||
92 | } | ||
93 | |||
94 | +static int config_global_keyid(void) | ||
95 | +{ | ||
96 | + struct seamcall_ctx sc = { .fn = TDH_SYS_KEY_CONFIG }; | ||
97 | + | ||
98 | + /* | ||
99 | + * Configure the key of the global KeyID on all packages by | ||
100 | + * calling TDH.SYS.KEY.CONFIG on all packages in a serialized | ||
101 | + * way as it cannot run concurrently on different CPUs. | ||
102 | + * | ||
103 | + * TDH.SYS.KEY.CONFIG may fail with entropy error (which is | ||
104 | + * a recoverable error). Assume this is exceedingly rare and | ||
105 | + * just return error if encountered instead of retrying. | ||
106 | + */ | ||
107 | + return seamcall_on_each_package_serialized(&sc); | ||
108 | +} | ||
109 | + | ||
110 | /* | ||
111 | * Detect and initialize the TDX module. | ||
112 | * | ||
113 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | 215 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) |
114 | if (ret) | 216 | if (ret) |
115 | goto out_free_pamts; | 217 | goto out_free_pamts; |
116 | 218 | ||
117 | + /* | 219 | + /* |
118 | + * Hardware doesn't guarantee cache coherency across different | 220 | + * Hardware doesn't guarantee cache coherency across different |
119 | + * KeyIDs. The kernel needs to flush PAMT's dirty cachelines | 221 | + * KeyIDs. The kernel needs to flush PAMT's dirty cachelines |
120 | + * (associated with KeyID 0) before the TDX module can use the | 222 | + * (associated with KeyID 0) before the TDX module can use the |
121 | + * global KeyID to access the PAMT. Given PAMTs are potentially | 223 | + * global KeyID to access the PAMT. Given PAMTs are potentially |
122 | + * large (~1/256th of system RAM), just use WBINVD on all cpus | 224 | + * large (~1/256th of system RAM), just use WBINVD on all cpus |
123 | + * to flush the cache. | 225 | + * to flush the cache. |
124 | + * | ||
125 | + * Follow the TDX spec to flush cache before configuring the | ||
126 | + * global KeyID on all packages. | ||
127 | + */ | 226 | + */ |
128 | + wbinvd_on_all_cpus(); | 227 | + wbinvd_on_all_cpus(); |
129 | + | 228 | + |
130 | + /* Config the key of global KeyID on all packages */ | 229 | + /* Config the key of global KeyID on all packages */ |
131 | + ret = config_global_keyid(); | 230 | + ret = config_global_keyid(); |
132 | + if (ret) | 231 | + if (ret) |
133 | + goto out_free_pamts; | 232 | + goto out_reset_pamts; |
134 | + | 233 | + |
135 | /* | 234 | /* |
136 | * Return -EINVAL until all steps of TDX module initialization | 235 | * TODO: |
137 | * process are done. | 236 | * |
237 | - * - Configure the global KeyID on all packages. | ||
238 | * - Initialize all TDMRs. | ||
239 | * | ||
240 | * Return error before all steps are done. | ||
138 | */ | 241 | */ |
139 | ret = -EINVAL; | 242 | ret = -EINVAL; |
140 | out_free_pamts: | 243 | +out_reset_pamts: |
141 | - if (ret) | ||
142 | + if (ret) { | 244 | + if (ret) { |
143 | + /* | 245 | + /* |
144 | + * Part of PAMT may already have been initialized by | 246 | + * Part of PAMTs may already have been initialized by the |
145 | + * TDX module. Flush cache before returning PAMT back | 247 | + * TDX module. Flush cache before returning PAMTs back |
146 | + * to the kernel. | 248 | + * to the kernel. |
147 | + * | ||
148 | + * Note there's no need to do MOVDIR64B (which changes | ||
149 | + * the page's associated KeyID from the old TDX private | ||
150 | + * KeyID back to KeyID 0, which is used by the kernel), | ||
151 | + * as KeyID 0 doesn't support integrity check. | ||
152 | + */ | 249 | + */ |
153 | + wbinvd_on_all_cpus(); | 250 | + wbinvd_on_all_cpus(); |
154 | tdmrs_free_pamt_all(tdmr_array, tdmr_num); | 251 | + /* |
155 | - else | 252 | + * According to the TDX hardware spec, if the platform |
156 | + } else | 253 | + * doesn't have the "partial write machine check" |
157 | pr_info("%lu pages allocated for PAMT.\n", | 254 | + * erratum, any kernel read/write will never cause #MC |
158 | tdmrs_count_pamt_pages(tdmr_array, tdmr_num)); | 255 | + * in kernel space, thus it's OK to not convert PAMTs |
159 | out_free_tdmrs: | 256 | + * back to normal. But do the conversion anyway here |
257 | + * as suggested by the TDX spec. | ||
258 | + */ | ||
259 | + tdmrs_reset_pamt_all(&tdmr_list); | ||
260 | + } | ||
261 | out_free_pamts: | ||
262 | if (ret) | ||
263 | tdmrs_free_pamt_all(&tdmr_list); | ||
264 | @@ -XXX,XX +XXX,XX @@ static int __tdx_enable(void) | ||
265 | * lock to prevent any new cpu from becoming online; 2) done both VMXON | ||
266 | * and tdx_cpu_enable() on all online cpus. | ||
267 | * | ||
268 | + * This function requires there's at least one online cpu for each CPU | ||
269 | + * package to succeed. | ||
270 | + * | ||
271 | * This function can be called in parallel by multiple callers. | ||
272 | * | ||
273 | * Return 0 if TDX is enabled successfully, otherwise error. | ||
160 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | 274 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h |
161 | index XXXXXXX..XXXXXXX 100644 | 275 | index XXXXXXX..XXXXXXX 100644 |
162 | --- a/arch/x86/virt/vmx/tdx/tdx.h | 276 | --- a/arch/x86/virt/vmx/tdx/tdx.h |
163 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | 277 | +++ b/arch/x86/virt/vmx/tdx/tdx.h |
164 | @@ -XXX,XX +XXX,XX @@ | 278 | @@ -XXX,XX +XXX,XX @@ |
... | ... | ||
168 | +#define TDH_SYS_KEY_CONFIG 31 | 282 | +#define TDH_SYS_KEY_CONFIG 31 |
169 | #define TDH_SYS_INFO 32 | 283 | #define TDH_SYS_INFO 32 |
170 | #define TDH_SYS_INIT 33 | 284 | #define TDH_SYS_INIT 33 |
171 | #define TDH_SYS_LP_INIT 35 | 285 | #define TDH_SYS_LP_INIT 35 |
172 | -- | 286 | -- |
173 | 2.38.1 | 287 | 2.41.0 | diff view generated by jsdifflib |
1 | Initialize TDMRs via TDH.SYS.TDMR.INIT as the last step to complete the | 1 | After the global KeyID has been configured on all packages, initialize |
---|---|---|---|
2 | TDX initialization. | 2 | all TDMRs to make all TDX-usable memory regions that are passed to the |
3 | TDX module become usable. | ||
3 | 4 | ||
4 | All TDMRs need to be initialized using TDH.SYS.TDMR.INIT SEAMCALL before | 5 | This is the last step of initializing the TDX module. |
5 | the memory pages can be used by the TDX module. The time to initialize | ||
6 | TDMR is proportional to the size of the TDMR because TDH.SYS.TDMR.INIT | ||
7 | internally initializes the PAMT entries using the global KeyID. | ||
8 | 6 | ||
9 | To avoid long latency caused in one SEAMCALL, TDH.SYS.TDMR.INIT only | 7 | Initializing TDMRs can be time consuming on large memory systems as it |
10 | initializes an (implementation-specific) subset of PAMT entries of one | 8 | involves initializing all metadata entries for all pages that can be |
11 | TDMR in one invocation. The caller needs to call TDH.SYS.TDMR.INIT | 9 | used by TDX guests. Initializing different TDMRs can be parallelized. |
12 | iteratively until all PAMT entries of the given TDMR are initialized. | 10 | For now to keep it simple, just initialize all TDMRs one by one. It can |
11 | be enhanced in the future. | ||
13 | 12 | ||
14 | TDH.SYS.TDMR.INITs can run concurrently on multiple CPUs as long as they | 13 | Signed-off-by: Kai Huang <kai.huang@intel.com> |
15 | are initializing different TDMRs. To keep it simple, just initialize | 14 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> |
16 | all TDMRs one by one. On a 2-socket machine with 2.2G CPUs and 64GB | 15 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
17 | memory, each TDH.SYS.TDMR.INIT roughly takes couple of microseconds on | 16 | Reviewed-by: Yuan Yao <yuan.yao@intel.com> |
18 | average, and it takes roughly dozens of milliseconds to complete the | 17 | --- |
19 | initialization of all TDMRs while system is idle. | ||
20 | 18 | ||
21 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> | 19 | v13 -> v14: |
22 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 20 | - No change |
23 | --- | 21 | |
22 | v12 -> v13: | ||
23 | - Added Yuan's tag. | ||
24 | |||
25 | v11 -> v12: | ||
26 | - Added Kirill's tag | ||
27 | |||
28 | v10 -> v11: | ||
29 | - No update | ||
30 | |||
31 | v9 -> v10: | ||
32 | - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'. | ||
33 | |||
34 | v8 -> v9: | ||
35 | - Improved changlog to explain why initializing TDMRs can take long | ||
36 | time (Dave). | ||
37 | - Improved comments around 'next-to-initialize' address (Dave). | ||
38 | |||
39 | v7 -> v8: (Dave) | ||
40 | - Changelog: | ||
41 | - explicitly call out this is the last step of TDX module initialization. | ||
42 | - Trimed down changelog by removing SEAMCALL name and details. | ||
43 | - Removed/trimmed down unnecessary comments. | ||
44 | - Other changes due to 'struct tdmr_info_list'. | ||
24 | 45 | ||
25 | v6 -> v7: | 46 | v6 -> v7: |
26 | - Removed need_resched() check. -- Andi. | 47 | - Removed need_resched() check. -- Andi. |
27 | 48 | ||
28 | --- | 49 | --- |
29 | arch/x86/virt/vmx/tdx/tdx.c | 69 ++++++++++++++++++++++++++++++++++--- | 50 | arch/x86/virt/vmx/tdx/tdx.c | 60 ++++++++++++++++++++++++++++++++----- |
30 | arch/x86/virt/vmx/tdx/tdx.h | 1 + | 51 | arch/x86/virt/vmx/tdx/tdx.h | 1 + |
31 | 2 files changed, 65 insertions(+), 5 deletions(-) | 52 | 2 files changed, 53 insertions(+), 8 deletions(-) |
32 | 53 | ||
33 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 54 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
34 | index XXXXXXX..XXXXXXX 100644 | 55 | index XXXXXXX..XXXXXXX 100644 |
35 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 56 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
36 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 57 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
37 | @@ -XXX,XX +XXX,XX @@ static int config_global_keyid(void) | 58 | @@ -XXX,XX +XXX,XX @@ static int config_global_keyid(void) |
38 | return seamcall_on_each_package_serialized(&sc); | 59 | return ret; |
39 | } | 60 | } |
40 | 61 | ||
41 | +/* Initialize one TDMR */ | ||
42 | +static int init_tdmr(struct tdmr_info *tdmr) | 62 | +static int init_tdmr(struct tdmr_info *tdmr) |
43 | +{ | 63 | +{ |
44 | + u64 next; | 64 | + u64 next; |
45 | + | 65 | + |
46 | + /* | 66 | + /* |
47 | + * Initializing PAMT entries might be time-consuming (in | 67 | + * Initializing a TDMR can be time consuming. To avoid long |
48 | + * proportion to the size of the requested TDMR). To avoid long | 68 | + * SEAMCALLs, the TDX module may only initialize a part of the |
49 | + * latency in one SEAMCALL, TDH.SYS.TDMR.INIT only initializes | 69 | + * TDMR in each call. |
50 | + * an (implementation-defined) subset of PAMT entries in one | ||
51 | + * invocation. | ||
52 | + * | ||
53 | + * Call TDH.SYS.TDMR.INIT iteratively until all PAMT entries | ||
54 | + * of the requested TDMR are initialized (if next-to-initialize | ||
55 | + * address matches the end address of the TDMR). | ||
56 | + */ | 70 | + */ |
57 | + do { | 71 | + do { |
58 | + struct tdx_module_output out; | 72 | + struct tdx_module_args args = { |
73 | + .rcx = tdmr->base, | ||
74 | + }; | ||
59 | + int ret; | 75 | + int ret; |
60 | + | 76 | + |
61 | + ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL, | 77 | + ret = seamcall_prerr_ret(TDH_SYS_TDMR_INIT, &args); |
62 | + &out); | ||
63 | + if (ret) | 78 | + if (ret) |
64 | + return ret; | 79 | + return ret; |
65 | + /* | 80 | + /* |
66 | + * RDX contains 'next-to-initialize' address if | 81 | + * RDX contains 'next-to-initialize' address if |
67 | + * TDH.SYS.TDMR.INT succeeded. | 82 | + * TDH.SYS.TDMR.INIT did not fully complete and |
83 | + * should be retried. | ||
68 | + */ | 84 | + */ |
69 | + next = out.rdx; | 85 | + next = args.rdx; |
70 | + /* Allow scheduling when needed */ | ||
71 | + cond_resched(); | 86 | + cond_resched(); |
87 | + /* Keep making SEAMCALLs until the TDMR is done */ | ||
72 | + } while (next < tdmr->base + tdmr->size); | 88 | + } while (next < tdmr->base + tdmr->size); |
73 | + | 89 | + |
74 | + return 0; | 90 | + return 0; |
75 | +} | 91 | +} |
76 | + | 92 | + |
77 | +/* Initialize all TDMRs */ | 93 | +static int init_tdmrs(struct tdmr_info_list *tdmr_list) |
78 | +static int init_tdmrs(struct tdmr_info *tdmr_array, int tdmr_num) | ||
79 | +{ | 94 | +{ |
80 | + int i; | 95 | + int i; |
81 | + | 96 | + |
82 | + /* | 97 | + /* |
83 | + * Initialize TDMRs one-by-one for simplicity, though the TDX | 98 | + * This operation is costly. It can be parallelized, |
84 | + * architecture does allow different TDMRs to be initialized in | 99 | + * but keep it simple for now. |
85 | + * parallel on multiple CPUs. Parallel initialization could | ||
86 | + * be added later when the time spent in the serialized scheme | ||
87 | + * becomes a real concern. | ||
88 | + */ | 100 | + */ |
89 | + for (i = 0; i < tdmr_num; i++) { | 101 | + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) { |
90 | + int ret; | 102 | + int ret; |
91 | + | 103 | + |
92 | + ret = init_tdmr(tdmr_array_entry(tdmr_array, i)); | 104 | + ret = init_tdmr(tdmr_entry(tdmr_list, i)); |
93 | + if (ret) | 105 | + if (ret) |
94 | + return ret; | 106 | + return ret; |
95 | + } | 107 | + } |
96 | + | 108 | + |
97 | + return 0; | 109 | + return 0; |
98 | +} | 110 | +} |
99 | + | 111 | + |
100 | /* | 112 | static int init_tdx_module(void) |
101 | * Detect and initialize the TDX module. | 113 | { |
102 | * | 114 | struct tdsysinfo_struct *tdsysinfo; |
103 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | 115 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) |
104 | if (ret) | 116 | if (ret) |
105 | goto out_free_pamts; | 117 | goto out_reset_pamts; |
106 | 118 | ||
107 | - /* | 119 | - /* |
108 | - * Return -EINVAL until all steps of TDX module initialization | 120 | - * TODO: |
109 | - * process are done. | 121 | - * |
122 | - * - Initialize all TDMRs. | ||
123 | - * | ||
124 | - * Return error before all steps are done. | ||
110 | - */ | 125 | - */ |
111 | - ret = -EINVAL; | 126 | - ret = -EINVAL; |
112 | + /* Initialize TDMRs to complete the TDX module initialization */ | 127 | + /* Initialize TDMRs to complete the TDX module initialization */ |
113 | + ret = init_tdmrs(tdmr_array, tdmr_num); | 128 | + ret = init_tdmrs(&tdmr_list); |
114 | + if (ret) | 129 | out_reset_pamts: |
115 | + goto out_free_pamts; | ||
116 | + | ||
117 | out_free_pamts: | ||
118 | if (ret) { | 130 | if (ret) { |
119 | /* | 131 | /* |
120 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | 132 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h |
121 | index XXXXXXX..XXXXXXX 100644 | 133 | index XXXXXXX..XXXXXXX 100644 |
122 | --- a/arch/x86/virt/vmx/tdx/tdx.h | 134 | --- a/arch/x86/virt/vmx/tdx/tdx.h |
123 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | 135 | +++ b/arch/x86/virt/vmx/tdx/tdx.h |
124 | @@ -XXX,XX +XXX,XX @@ | 136 | @@ -XXX,XX +XXX,XX @@ |
125 | #define TDH_SYS_INFO 32 | 137 | #define TDH_SYS_INFO 32 |
126 | #define TDH_SYS_INIT 33 | 138 | #define TDH_SYS_INIT 33 |
127 | #define TDH_SYS_LP_INIT 35 | 139 | #define TDH_SYS_LP_INIT 35 |
128 | +#define TDH_SYS_TDMR_INIT 36 | 140 | +#define TDH_SYS_TDMR_INIT 36 |
129 | #define TDH_SYS_LP_SHUTDOWN 44 | ||
130 | #define TDH_SYS_CONFIG 45 | 141 | #define TDH_SYS_CONFIG 45 |
131 | 142 | ||
143 | struct cmr_info { | ||
132 | -- | 144 | -- |
133 | 2.38.1 | 145 | 2.41.0 | diff view generated by jsdifflib |
1 | There are two problems in terms of using kexec() to boot to a new kernel | 1 | There are two problems in terms of using kexec() to boot to a new kernel |
---|---|---|---|
2 | when the old kernel has enabled TDX: 1) Part of the memory pages are | 2 | when the old kernel has enabled TDX: 1) Part of the memory pages are |
3 | still TDX private pages (i.e. metadata used by the TDX module, and any | 3 | still TDX private pages; 2) There might be dirty cachelines associated |
4 | TDX guest memory if kexec() happens when there's any TDX guest alive). | 4 | with TDX private pages. |
5 | 2) There might be dirty cachelines associated with TDX private pages. | ||
6 | 5 | ||
7 | Because the hardware doesn't guarantee cache coherency among different | 6 | The first problem doesn't matter on the platforms w/o the "partial write |
8 | KeyIDs, the old kernel needs to flush cache (of those TDX private pages) | 7 | machine check" erratum. KeyID 0 doesn't have integrity check. If the |
9 | before booting to the new kernel. Also, reading TDX private page using | 8 | new kernel wants to use any non-zero KeyID, it needs to convert the |
10 | any shared non-TDX KeyID with integrity-check enabled can trigger #MC. | 9 | memory to that KeyID and such conversion would work from any KeyID. |
11 | Therefore ideally, the kernel should convert all TDX private pages back | ||
12 | to normal before booting to the new kernel. | ||
13 | 10 | ||
14 | However, this implementation doesn't convert TDX private pages back to | 11 | However the old kernel needs to guarantee there's no dirty cacheline |
15 | normal in kexec() because of below considerations: | 12 | left behind before booting to the new kernel to avoid silent corruption |
13 | from later cacheline writeback (Intel hardware doesn't guarantee cache | ||
14 | coherency across different KeyIDs). | ||
16 | 15 | ||
17 | 1) The kernel doesn't have existing infrastructure to track which pages | 16 | There are two things that the old kernel needs to do to achieve that: |
18 | are TDX private pages. | ||
19 | 2) The number of TDX private pages can be large, and converting all of | ||
20 | them (cache flush + using MOVDIR64B to clear the page) in kexec() can | ||
21 | be time consuming. | ||
22 | 3) The new kernel will almost only use KeyID 0 to access memory. KeyID | ||
23 | 0 doesn't support integrity-check, so it's OK. | ||
24 | 4) The kernel doesn't (and may never) support MKTME. If any 3rd party | ||
25 | kernel ever supports MKTME, it should do MOVDIR64B to clear the page | ||
26 | with the new MKTME KeyID (just like TDX does) before using it. | ||
27 | 17 | ||
28 | Therefore, this implementation just flushes cache to make sure there are | 18 | 1) Stop accessing TDX private memory mappings: |
29 | no stale dirty cachelines associated with any TDX private KeyIDs before | 19 | a. Stop making TDX module SEAMCALLs (TDX global KeyID); |
30 | booting to the new kernel, otherwise they may silently corrupt the new | 20 | b. Stop TDX guests from running (per-guest TDX KeyID). |
31 | kernel. | 21 | 2) Flush any cachelines from previous TDX private KeyID writes. |
32 | 22 | ||
33 | Following SME support, use wbinvd() to flush cache in stop_this_cpu(). | 23 | For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME |
24 | support. And in this way 1) happens for free as there's no TDX activity | ||
25 | between wbinvd() and the native_halt(). | ||
26 | |||
27 | Flushing cache in stop_this_cpu() only flushes cache on remote cpus. On | ||
28 | the rebooting cpu which does kexec(), unlike SME which does the cache | ||
29 | flush in relocate_kernel(), flush the cache right after stopping remote | ||
30 | cpus in machine_shutdown(). | ||
31 | |||
32 | There are two reasons to do so: 1) For TDX there's no need to defer | ||
33 | cache flush to relocate_kernel() because all TDX activities have been | ||
34 | stopped. 2) On the platforms with the above erratum the kernel must | ||
35 | convert all TDX private pages back to normal before booting to the new | ||
36 | kernel in kexec(), and flushing cache early allows the kernel to convert | ||
37 | memory early rather than having to muck with the relocate_kernel() | ||
38 | assembly. | ||
39 | |||
34 | Theoretically, cache flush is only needed when the TDX module has been | 40 | Theoretically, cache flush is only needed when the TDX module has been |
35 | initialized. However initializing the TDX module is done on demand at | 41 | initialized. However initializing the TDX module is done on demand at |
36 | runtime, and it takes a mutex to read the module status. Just check | 42 | runtime, and it takes a mutex to read the module status. Just check |
37 | whether TDX is enabled by BIOS instead to flush cache. | 43 | whether TDX is enabled by the BIOS instead to flush cache. |
38 | 44 | ||
39 | Also, the current TDX module doesn't play nicely with kexec(). The TDX | 45 | Signed-off-by: Kai Huang <kai.huang@intel.com> |
40 | module can only be initialized once during its lifetime, and there is no | ||
41 | ABI to reset the module to give a new clean slate to the new kernel. | ||
42 | Therefore ideally, if the TDX module is ever initialized, it's better | ||
43 | to shut it down. The new kernel won't be able to use TDX anyway (as it | ||
44 | needs to go through the TDX module initialization process which will | ||
45 | fail immediately at the first step). | ||
46 | |||
47 | However, shutting down the TDX module requires all CPUs being in VMX | ||
48 | operation, but there's no such guarantee as kexec() can happen at any | ||
49 | time (i.e. when KVM is not even loaded). So just do nothing but leave | ||
50 | leave the TDX module open. | ||
51 | |||
52 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> | 46 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> |
53 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 47 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
54 | --- | 48 | --- |
55 | 49 | ||
56 | v6 -> v7: | 50 | v13 -> v14: |
57 | - Improved changelog to explain why don't convert TDX private pages back | 51 | - No change |
58 | to normal. | ||
59 | 52 | ||
60 | --- | 53 | --- |
61 | arch/x86/kernel/process.c | 8 +++++++- | 54 | arch/x86/kernel/process.c | 8 +++++++- |
62 | 1 file changed, 7 insertions(+), 1 deletion(-) | 55 | arch/x86/kernel/reboot.c | 15 +++++++++++++++ |
56 | 2 files changed, 22 insertions(+), 1 deletion(-) | ||
63 | 57 | ||
64 | diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c | 58 | diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c |
65 | index XXXXXXX..XXXXXXX 100644 | 59 | index XXXXXXX..XXXXXXX 100644 |
66 | --- a/arch/x86/kernel/process.c | 60 | --- a/arch/x86/kernel/process.c |
67 | +++ b/arch/x86/kernel/process.c | 61 | +++ b/arch/x86/kernel/process.c |
68 | @@ -XXX,XX +XXX,XX @@ void __noreturn stop_this_cpu(void *dummy) | 62 | @@ -XXX,XX +XXX,XX @@ void __noreturn stop_this_cpu(void *dummy) |
69 | * | 63 | * |
70 | * Test the CPUID bit directly because the machine might've cleared | 64 | * Test the CPUID bit directly because the machine might've cleared |
71 | * X86_FEATURE_SME due to cmdline options. | 65 | * X86_FEATURE_SME due to cmdline options. |
72 | + * | 66 | + * |
73 | + * Similar to SME, if the TDX module is ever initialized, the | 67 | + * The TDX module or guests might have left dirty cachelines |
74 | + * cachelines associated with any TDX private KeyID must be flushed | 68 | + * behind. Flush them to avoid corruption from later writeback. |
75 | + * before transiting to the new kernel. The TDX module is initialized | 69 | + * Note that this flushes on all systems where TDX is possible, |
76 | + * on demand, and it takes the mutex to read its status. Just check | 70 | + * but does not actually check that TDX was in use. |
77 | + * whether TDX is enabled by BIOS instead to flush cache. | ||
78 | */ | 71 | */ |
79 | - if (cpuid_eax(0x8000001f) & BIT(0)) | 72 | - if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0))) |
80 | + if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled()) | 73 | + if ((c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0))) |
74 | + || platform_tdx_enabled()) | ||
81 | native_wbinvd(); | 75 | native_wbinvd(); |
82 | for (;;) { | 76 | |
83 | /* | 77 | /* |
78 | diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c | ||
79 | index XXXXXXX..XXXXXXX 100644 | ||
80 | --- a/arch/x86/kernel/reboot.c | ||
81 | +++ b/arch/x86/kernel/reboot.c | ||
82 | @@ -XXX,XX +XXX,XX @@ | ||
83 | #include <asm/realmode.h> | ||
84 | #include <asm/x86_init.h> | ||
85 | #include <asm/efi.h> | ||
86 | +#include <asm/tdx.h> | ||
87 | |||
88 | /* | ||
89 | * Power off function, if any | ||
90 | @@ -XXX,XX +XXX,XX @@ void native_machine_shutdown(void) | ||
91 | local_irq_disable(); | ||
92 | stop_other_cpus(); | ||
93 | #endif | ||
94 | + /* | ||
95 | + * stop_other_cpus() has flushed all dirty cachelines of TDX | ||
96 | + * private memory on remote cpus. Unlike SME, which does the | ||
97 | + * cache flush on _this_ cpu in the relocate_kernel(), flush | ||
98 | + * the cache for _this_ cpu here. This is because on the | ||
99 | + * platforms with "partial write machine check" erratum the | ||
100 | + * kernel needs to convert all TDX private pages back to normal | ||
101 | + * before booting to the new kernel in kexec(), and the cache | ||
102 | + * flush must be done before that. If the kernel took SME's way, | ||
103 | + * it would have to muck with the relocate_kernel() assembly to | ||
104 | + * do memory conversion. | ||
105 | + */ | ||
106 | + if (platform_tdx_enabled()) | ||
107 | + native_wbinvd(); | ||
108 | |||
109 | lapic_shutdown(); | ||
110 | restore_boot_irq_mode(); | ||
84 | -- | 111 | -- |
85 | 2.38.1 | 112 | 2.41.0 | diff view generated by jsdifflib |
1 | TDX module initialization requires to use one TDX private KeyID as the | 1 | On the platforms with the "partial write machine check" erratum, the |
---|---|---|---|
2 | global KeyID to protect the TDX module metadata. The global KeyID is | 2 | kexec() needs to convert all TDX private pages back to normal before |
3 | configured to the TDX module along with TDMRs. | 3 | booting to the new kernel. Otherwise, the new kernel may get unexpected |
4 | machine check. | ||
4 | 5 | ||
5 | Just reserve the first TDX private KeyID as the global KeyID. Keep the | 6 | There's no existing infrastructure to track TDX private pages. Keep |
6 | global KeyID as a static variable as KVM will need to use it too. | 7 | TDMRs when module initialization is successful so that they can be used |
8 | to find PAMTs. | ||
7 | 9 | ||
8 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> | ||
9 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 10 | Signed-off-by: Kai Huang <kai.huang@intel.com> |
11 | Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> | ||
12 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
10 | --- | 13 | --- |
11 | arch/x86/virt/vmx/tdx/tdx.c | 9 +++++++++ | 14 | |
12 | 1 file changed, 9 insertions(+) | 15 | v13 -> v14: |
16 | - "Change to keep" -> "Keep" (Kirill) | ||
17 | - Add Kirill/Rick's tags | ||
18 | |||
19 | v12 -> v13: | ||
20 | - Split "improve error handling" part out as a separate patch. | ||
21 | |||
22 | v11 -> v12 (new patch): | ||
23 | - Defer keeping TDMRs logic to this patch for better review | ||
24 | - Improved error handling logic (Nikolay/Kirill in patch 15) | ||
25 | |||
26 | --- | ||
27 | arch/x86/virt/vmx/tdx/tdx.c | 24 +++++++++++------------- | ||
28 | 1 file changed, 11 insertions(+), 13 deletions(-) | ||
13 | 29 | ||
14 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 30 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
15 | index XXXXXXX..XXXXXXX 100644 | 31 | index XXXXXXX..XXXXXXX 100644 |
16 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 32 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
17 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 33 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
18 | @@ -XXX,XX +XXX,XX @@ static int tdx_cmr_num; | 34 | @@ -XXX,XX +XXX,XX @@ static DEFINE_MUTEX(tdx_module_lock); |
19 | /* All TDX-usable memory regions */ | 35 | /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */ |
20 | static LIST_HEAD(tdx_memlist); | 36 | static LIST_HEAD(tdx_memlist); |
21 | 37 | ||
22 | +/* TDX module global KeyID. Used in TDH.SYS.CONFIG ABI. */ | 38 | +static struct tdmr_info_list tdx_tdmr_list; |
23 | +static u32 tdx_global_keyid; | ||
24 | + | 39 | + |
25 | /* | 40 | typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args); |
26 | * Detect TDX private KeyIDs to see whether TDX has been enabled by the | 41 | |
27 | * BIOS. Both initializing the TDX module and running TDX guest require | 42 | static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args) |
43 | @@ -XXX,XX +XXX,XX @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list) | ||
44 | static int init_tdx_module(void) | ||
45 | { | ||
46 | struct tdsysinfo_struct *tdsysinfo; | ||
47 | - struct tdmr_info_list tdmr_list; | ||
48 | struct cmr_info *cmr_array; | ||
49 | int tdsysinfo_size; | ||
50 | int cmr_array_size; | ||
28 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | 51 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) |
52 | goto out_put_tdxmem; | ||
53 | |||
54 | /* Allocate enough space for constructing TDMRs */ | ||
55 | - ret = alloc_tdmr_list(&tdmr_list, tdsysinfo); | ||
56 | + ret = alloc_tdmr_list(&tdx_tdmr_list, tdsysinfo); | ||
57 | if (ret) | ||
58 | goto out_free_tdxmem; | ||
59 | |||
60 | /* Cover all TDX-usable memory regions in TDMRs */ | ||
61 | - ret = construct_tdmrs(&tdx_memlist, &tdmr_list, tdsysinfo); | ||
62 | + ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, tdsysinfo); | ||
29 | if (ret) | 63 | if (ret) |
30 | goto out_free_tdmrs; | 64 | goto out_free_tdmrs; |
31 | 65 | ||
32 | + /* | 66 | /* Pass the TDMRs and the global KeyID to the TDX module */ |
33 | + * Reserve the first TDX KeyID as global KeyID to protect | 67 | - ret = config_tdx_module(&tdmr_list, tdx_global_keyid); |
34 | + * TDX module metadata. | 68 | + ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid); |
35 | + */ | 69 | if (ret) |
36 | + tdx_global_keyid = tdx_keyid_start; | 70 | goto out_free_pamts; |
37 | + | 71 | |
38 | /* | 72 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) |
39 | * Return -EINVAL until all steps of TDX module initialization | 73 | goto out_reset_pamts; |
40 | * process are done. | 74 | |
75 | /* Initialize TDMRs to complete the TDX module initialization */ | ||
76 | - ret = init_tdmrs(&tdmr_list); | ||
77 | + ret = init_tdmrs(&tdx_tdmr_list); | ||
78 | out_reset_pamts: | ||
79 | if (ret) { | ||
80 | /* | ||
81 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | ||
82 | * back to normal. But do the conversion anyway here | ||
83 | * as suggested by the TDX spec. | ||
84 | */ | ||
85 | - tdmrs_reset_pamt_all(&tdmr_list); | ||
86 | + tdmrs_reset_pamt_all(&tdx_tdmr_list); | ||
87 | } | ||
88 | out_free_pamts: | ||
89 | if (ret) | ||
90 | - tdmrs_free_pamt_all(&tdmr_list); | ||
91 | + tdmrs_free_pamt_all(&tdx_tdmr_list); | ||
92 | else | ||
93 | pr_info("%lu KBs allocated for PAMT\n", | ||
94 | - tdmrs_count_pamt_kb(&tdmr_list)); | ||
95 | + tdmrs_count_pamt_kb(&tdx_tdmr_list)); | ||
96 | out_free_tdmrs: | ||
97 | - /* | ||
98 | - * Always free the buffer of TDMRs as they are only used during | ||
99 | - * module initialization. | ||
100 | - */ | ||
101 | - free_tdmr_list(&tdmr_list); | ||
102 | + if (ret) | ||
103 | + free_tdmr_list(&tdx_tdmr_list); | ||
104 | out_free_tdxmem: | ||
105 | if (ret) | ||
106 | free_tdx_memlist(&tdx_memlist); | ||
41 | -- | 107 | -- |
42 | 2.38.1 | 108 | 2.41.0 | diff view generated by jsdifflib |
1 | The first step of initializing the module is to call TDH.SYS.INIT once | 1 | With keeping TDMRs upon successful TDX module initialization, now only |
---|---|---|---|
2 | on any logical cpu to do module global initialization. Do the module | 2 | put_online_mems() and freeing the buffers of the TDSYSINFO_STRUCT and |
3 | global initialization. | 3 | the CMR array still need to be done even when module initialization is |
4 | successful. On the other hand, all other four "out_*" labels before | ||
5 | them explicitly check the return value and only clean up when module | ||
6 | initialization fails. | ||
4 | 7 | ||
5 | It also detects the TDX module, as seamcall() returns -ENODEV when the | 8 | This isn't ideal. Make all other four "out_*" labels only reachable |
6 | module is not loaded. | 9 | when module initialization fails to improve the readability of error |
10 | handling. Rename them from "out_*" to "err_*" to reflect the fact. | ||
7 | 11 | ||
8 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 12 | Signed-off-by: Kai Huang <kai.huang@intel.com> |
13 | Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> | ||
14 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
9 | --- | 15 | --- |
10 | 16 | ||
11 | v6 -> v7: | 17 | v13 -> v14: |
12 | - Improved changelog. | 18 | - Fix spell typo (Rick) |
19 | - Add Kirill/Rick's tags | ||
20 | |||
21 | v12 -> v13: | ||
22 | - New patch to improve error handling. (Kirill, Nikolay) | ||
13 | 23 | ||
14 | --- | 24 | --- |
15 | arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++++++-- | 25 | arch/x86/virt/vmx/tdx/tdx.c | 67 +++++++++++++++++++------------------ |
16 | arch/x86/virt/vmx/tdx/tdx.h | 1 + | 26 | 1 file changed, 34 insertions(+), 33 deletions(-) |
17 | 2 files changed, 18 insertions(+), 2 deletions(-) | ||
18 | 27 | ||
19 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 28 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
20 | index XXXXXXX..XXXXXXX 100644 | 29 | index XXXXXXX..XXXXXXX 100644 |
21 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 30 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
22 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 31 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
23 | @@ -XXX,XX +XXX,XX @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc) | 32 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) |
24 | */ | 33 | /* Allocate enough space for constructing TDMRs */ |
25 | static int init_tdx_module(void) | 34 | ret = alloc_tdmr_list(&tdx_tdmr_list, tdsysinfo); |
26 | { | 35 | if (ret) |
27 | - /* The TDX module hasn't been detected */ | 36 | - goto out_free_tdxmem; |
28 | - return -ENODEV; | 37 | + goto err_free_tdxmem; |
29 | + int ret; | 38 | |
39 | /* Cover all TDX-usable memory regions in TDMRs */ | ||
40 | ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, tdsysinfo); | ||
41 | if (ret) | ||
42 | - goto out_free_tdmrs; | ||
43 | + goto err_free_tdmrs; | ||
44 | |||
45 | /* Pass the TDMRs and the global KeyID to the TDX module */ | ||
46 | ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid); | ||
47 | if (ret) | ||
48 | - goto out_free_pamts; | ||
49 | + goto err_free_pamts; | ||
50 | |||
51 | /* | ||
52 | * Hardware doesn't guarantee cache coherency across different | ||
53 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | ||
54 | /* Config the key of global KeyID on all packages */ | ||
55 | ret = config_global_keyid(); | ||
56 | if (ret) | ||
57 | - goto out_reset_pamts; | ||
58 | + goto err_reset_pamts; | ||
59 | |||
60 | /* Initialize TDMRs to complete the TDX module initialization */ | ||
61 | ret = init_tdmrs(&tdx_tdmr_list); | ||
62 | -out_reset_pamts: | ||
63 | - if (ret) { | ||
64 | - /* | ||
65 | - * Part of PAMTs may already have been initialized by the | ||
66 | - * TDX module. Flush cache before returning PAMTs back | ||
67 | - * to the kernel. | ||
68 | - */ | ||
69 | - wbinvd_on_all_cpus(); | ||
70 | - /* | ||
71 | - * According to the TDX hardware spec, if the platform | ||
72 | - * doesn't have the "partial write machine check" | ||
73 | - * erratum, any kernel read/write will never cause #MC | ||
74 | - * in kernel space, thus it's OK to not convert PAMTs | ||
75 | - * back to normal. But do the conversion anyway here | ||
76 | - * as suggested by the TDX spec. | ||
77 | - */ | ||
78 | - tdmrs_reset_pamt_all(&tdx_tdmr_list); | ||
79 | - } | ||
80 | -out_free_pamts: | ||
81 | if (ret) | ||
82 | - tdmrs_free_pamt_all(&tdx_tdmr_list); | ||
83 | - else | ||
84 | - pr_info("%lu KBs allocated for PAMT\n", | ||
85 | - tdmrs_count_pamt_kb(&tdx_tdmr_list)); | ||
86 | -out_free_tdmrs: | ||
87 | - if (ret) | ||
88 | - free_tdmr_list(&tdx_tdmr_list); | ||
89 | -out_free_tdxmem: | ||
90 | - if (ret) | ||
91 | - free_tdx_memlist(&tdx_memlist); | ||
92 | + goto err_reset_pamts; | ||
30 | + | 93 | + |
94 | + pr_info("%lu KBs allocated for PAMT\n", | ||
95 | + tdmrs_count_pamt_kb(&tdx_tdmr_list)); | ||
96 | + | ||
97 | out_put_tdxmem: | ||
98 | /* | ||
99 | * @tdx_memlist is written here and read at memory hotplug time. | ||
100 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | ||
101 | kfree(tdsysinfo); | ||
102 | kfree(cmr_array); | ||
103 | return ret; | ||
104 | + | ||
105 | +err_reset_pamts: | ||
31 | + /* | 106 | + /* |
32 | + * Call TDH.SYS.INIT to do the global initialization of | 107 | + * Part of PAMTs may already have been initialized by the |
33 | + * the TDX module. It also detects the module. | 108 | + * TDX module. Flush cache before returning PAMTs back |
109 | + * to the kernel. | ||
34 | + */ | 110 | + */ |
35 | + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL); | 111 | + wbinvd_on_all_cpus(); |
36 | + if (ret) | ||
37 | + goto out; | ||
38 | + | ||
39 | + /* | 112 | + /* |
40 | + * Return -EINVAL until all steps of TDX module initialization | 113 | + * According to the TDX hardware spec, if the platform |
41 | + * process are done. | 114 | + * doesn't have the "partial write machine check" |
115 | + * erratum, any kernel read/write will never cause #MC | ||
116 | + * in kernel space, thus it's OK to not convert PAMTs | ||
117 | + * back to normal. But do the conversion anyway here | ||
118 | + * as suggested by the TDX spec. | ||
42 | + */ | 119 | + */ |
43 | + ret = -EINVAL; | 120 | + tdmrs_reset_pamt_all(&tdx_tdmr_list); |
44 | +out: | 121 | +err_free_pamts: |
45 | + return ret; | 122 | + tdmrs_free_pamt_all(&tdx_tdmr_list); |
123 | +err_free_tdmrs: | ||
124 | + free_tdmr_list(&tdx_tdmr_list); | ||
125 | +err_free_tdxmem: | ||
126 | + free_tdx_memlist(&tdx_memlist); | ||
127 | + /* Do things irrelevant to module initialization result */ | ||
128 | + goto out_put_tdxmem; | ||
46 | } | 129 | } |
47 | 130 | ||
48 | static void shutdown_tdx_module(void) | 131 | static int __tdx_enable(void) |
49 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | ||
50 | index XXXXXXX..XXXXXXX 100644 | ||
51 | --- a/arch/x86/virt/vmx/tdx/tdx.h | ||
52 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | ||
53 | @@ -XXX,XX +XXX,XX @@ | ||
54 | /* | ||
55 | * TDX module SEAMCALL leaf functions | ||
56 | */ | ||
57 | +#define TDH_SYS_INIT 33 | ||
58 | #define TDH_SYS_LP_SHUTDOWN 44 | ||
59 | |||
60 | /* | ||
61 | -- | 132 | -- |
62 | 2.38.1 | 133 | 2.41.0 | diff view generated by jsdifflib |
1 | TDX supports shutting down the TDX module at any time during its | 1 | The first few generations of TDX hardware have an erratum. A partial |
---|---|---|---|
2 | lifetime. After the module is shut down, no further TDX module SEAMCALL | 2 | write to a TDX private memory cacheline will silently "poison" the |
3 | leaf functions can be made to the module on any logical cpu. | 3 | line. Subsequent reads will consume the poison and generate a machine |
4 | 4 | check. According to the TDX hardware spec, neither of these things | |
5 | Shut down the TDX module in case of any error during the initialization | 5 | should have happened. |
6 | process. It's pointless to leave the TDX module in some middle state. | 6 | |
7 | 7 | == Background == | |
8 | Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all | 8 | |
9 | BIOS-enabled CPUs, and the SEMACALL can run concurrently on different | 9 | Virtually all kernel memory accesses operations happen in full |
10 | CPUs. Implement a mechanism to run SEAMCALL concurrently on all online | 10 | cachelines. In practice, writing a "byte" of memory usually reads a 64 |
11 | CPUs and use it to shut down the module. Later logical-cpu scope module | 11 | byte cacheline of memory, modifies it, then writes the whole line back. |
12 | initialization will use it too. | 12 | Those operations do not trigger this problem. |
13 | 13 | ||
14 | Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> | 14 | This problem is triggered by "partial" writes where a write transaction |
15 | of less than cacheline lands at the memory controller. The CPU does | ||
16 | these via non-temporal write instructions (like MOVNTI), or through | ||
17 | UC/WC memory mappings. The issue can also be triggered away from the | ||
18 | CPU by devices doing partial writes via DMA. | ||
19 | |||
20 | == Problem == | ||
21 | |||
22 | A fast warm reset doesn't reset TDX private memory. Kexec() can also | ||
23 | boot into the new kernel directly. Thus if the old kernel has enabled | ||
24 | TDX on the platform with this erratum, the new kernel may get unexpected | ||
25 | machine check. | ||
26 | |||
27 | Note that w/o this erratum any kernel read/write on TDX private memory | ||
28 | should never cause machine check, thus it's OK for the old kernel to | ||
29 | leave TDX private pages as is. | ||
30 | |||
31 | == Solution == | ||
32 | |||
33 | In short, with this erratum, the kernel needs to explicitly convert all | ||
34 | TDX private pages back to normal to give the new kernel a clean slate | ||
35 | after kexec(). The BIOS is also expected to disable fast warm reset as | ||
36 | a workaround to this erratum, thus this implementation doesn't try to | ||
37 | reset TDX private memory for the reboot case in the kernel but depend on | ||
38 | the BIOS to enable the workaround. | ||
39 | |||
40 | Convert TDX private pages back to normal after all remote cpus has been | ||
41 | stopped and cache flush has been done on all cpus, when no more TDX | ||
42 | activity can happen further. Do it in machine_kexec() to avoid the | ||
43 | additional overhead to the normal reboot/shutdown as the kernel depends | ||
44 | on the BIOS to disable fast warm reset for the reboot case. | ||
45 | |||
46 | For now TDX private memory can only be PAMT pages. It would be ideal to | ||
47 | cover all types of TDX private memory here, but there are practical | ||
48 | problems to do so: | ||
49 | |||
50 | 1) There's no existing infrastructure to track TDX private pages; | ||
51 | 2) It's not feasible to query the TDX module about page type because VMX | ||
52 | has already been stopped when KVM receives the reboot notifier, plus | ||
53 | the result from the TDX module may not be accurate (e.g., the remote | ||
54 | CPU could be stopped right before MOVDIR64B). | ||
55 | |||
56 | One temporary solution is to blindly convert all memory pages, but it's | ||
57 | problematic to do so too, because not all pages are mapped as writable | ||
58 | in the direct mapping. It can be done by switching to the identical | ||
59 | mapping created for kexec() or a new page table, but the complexity | ||
60 | looks overkill. | ||
61 | |||
62 | Therefore, rather than doing something dramatic, only reset PAMT pages | ||
63 | here. Other kernel components which use TDX need to do the conversion | ||
64 | on their own by intercepting the rebooting/shutdown notifier (KVM | ||
65 | already does that). | ||
66 | |||
67 | Note kexec() can happen at anytime, including when TDX module is being | ||
68 | initialized. Register TDX reboot notifier callback to stop further TDX | ||
69 | module initialization. If there's any ongoing module initialization, | ||
70 | wait until it finishes. This makes sure the TDX module status is stable | ||
71 | after the reboot notifier callback, and the later kexec() code can read | ||
72 | module status to decide whether PAMTs are stable and available. | ||
73 | |||
74 | Also stop further TDX module initialization in case of machine shutdown | ||
75 | and halt, but not limited to kexec(), as there's no reason to do so in | ||
76 | these cases too. | ||
77 | |||
15 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 78 | Signed-off-by: Kai Huang <kai.huang@intel.com> |
79 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
16 | --- | 80 | --- |
17 | 81 | ||
18 | v6 -> v7: | 82 | v13 -> v14: |
19 | - No change. | 83 | - Skip resetting TDX private memory when preserve_context is true (Rick) |
20 | 84 | - Use reboot notifier to stop TDX module initialization at early time of | |
21 | v5 -> v6: | 85 | kexec() to make module status stable, to avoid using a new variable |
22 | - Removed the seamcall() wrapper to previous patch (Dave). | 86 | and memory barrier (which is tricky to review). |
23 | 87 | - Added Kirill's tag | |
24 | - v3 -> v5 (no feedback on v4): | 88 | |
25 | - Added a wrapper of __seamcall() to print error code if SEAMCALL fails. | 89 | v12 -> v13: |
26 | - Made the seamcall_on_each_cpu() void. | 90 | - Improve comments to explain why barrier is needed and ignore WBINVD. |
27 | - Removed 'seamcall_ret' and 'tdx_module_out' from | 91 | (Dave) |
28 | 'struct seamcall_ctx', as they must be local variable. | 92 | - Improve comments to document memory ordering. (Nikolay) |
29 | - Added the comments to tdx_init() and one paragraph to changelog to | 93 | - Made comments/changelog slightly more concise. |
30 | explain the caller should handle VMXON. | 94 | |
31 | - Called out after shut down, no "TDX module" SEAMCALL can be made. | 95 | v11 -> v12: |
96 | - Changed comment/changelog to say kernel doesn't try to handle fast | ||
97 | warm reset but depends on BIOS to enable workaround (Kirill) | ||
98 | - Added a new tdx_may_has_private_mem to indicate system may have TDX | ||
99 | private memory and PAMTs/TDMRs are stable to access. (Dave). | ||
100 | - Use atomic_t for tdx_may_has_private_mem for build-in memory barrier | ||
101 | (Dave) | ||
102 | - Changed calling x86_platform.memory_shutdown() to calling | ||
103 | tdx_reset_memory() directly from machine_kexec() to avoid overhead to | ||
104 | normal reboot case. | ||
105 | |||
106 | v10 -> v11: | ||
107 | - New patch | ||
32 | 108 | ||
33 | --- | 109 | --- |
34 | arch/x86/virt/vmx/tdx/tdx.c | 43 +++++++++++++++++++++++++++++++++---- | 110 | arch/x86/include/asm/tdx.h | 2 + |
35 | arch/x86/virt/vmx/tdx/tdx.h | 5 +++++ | 111 | arch/x86/kernel/machine_kexec_64.c | 16 ++++++ |
36 | 2 files changed, 44 insertions(+), 4 deletions(-) | 112 | arch/x86/virt/vmx/tdx/tdx.c | 92 ++++++++++++++++++++++++++++++ |
37 | 113 | 3 files changed, 110 insertions(+) | |
114 | |||
115 | diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h | ||
116 | index XXXXXXX..XXXXXXX 100644 | ||
117 | --- a/arch/x86/include/asm/tdx.h | ||
118 | +++ b/arch/x86/include/asm/tdx.h | ||
119 | @@ -XXX,XX +XXX,XX @@ static inline u64 sc_retry(sc_func_t func, u64 fn, | ||
120 | bool platform_tdx_enabled(void); | ||
121 | int tdx_cpu_enable(void); | ||
122 | int tdx_enable(void); | ||
123 | +void tdx_reset_memory(void); | ||
124 | #else | ||
125 | static inline bool platform_tdx_enabled(void) { return false; } | ||
126 | static inline int tdx_cpu_enable(void) { return -ENODEV; } | ||
127 | static inline int tdx_enable(void) { return -ENODEV; } | ||
128 | +static inline void tdx_reset_memory(void) { } | ||
129 | #endif /* CONFIG_INTEL_TDX_HOST */ | ||
130 | |||
131 | #endif /* !__ASSEMBLY__ */ | ||
132 | diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c | ||
133 | index XXXXXXX..XXXXXXX 100644 | ||
134 | --- a/arch/x86/kernel/machine_kexec_64.c | ||
135 | +++ b/arch/x86/kernel/machine_kexec_64.c | ||
136 | @@ -XXX,XX +XXX,XX @@ | ||
137 | #include <asm/setup.h> | ||
138 | #include <asm/set_memory.h> | ||
139 | #include <asm/cpu.h> | ||
140 | +#include <asm/tdx.h> | ||
141 | |||
142 | #ifdef CONFIG_ACPI | ||
143 | /* | ||
144 | @@ -XXX,XX +XXX,XX @@ void machine_kexec(struct kimage *image) | ||
145 | void *control_page; | ||
146 | int save_ftrace_enabled; | ||
147 | |||
148 | + /* | ||
149 | + * For platforms with TDX "partial write machine check" erratum, | ||
150 | + * all TDX private pages need to be converted back to normal | ||
151 | + * before booting to the new kernel, otherwise the new kernel | ||
152 | + * may get unexpected machine check. | ||
153 | + * | ||
154 | + * But skip this when preserve_context is on. The second kernel | ||
155 | + * shouldn't write to the first kernel's memory anyway. Skipping | ||
156 | + * this also avoids killing TDX in the first kernel, which would | ||
157 | + * require more complicated handling. | ||
158 | + */ | ||
159 | #ifdef CONFIG_KEXEC_JUMP | ||
160 | if (image->preserve_context) | ||
161 | save_processor_state(); | ||
162 | + else | ||
163 | + tdx_reset_memory(); | ||
164 | +#else | ||
165 | + tdx_reset_memory(); | ||
166 | #endif | ||
167 | |||
168 | save_ftrace_enabled = __ftrace_enabled_save(); | ||
38 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 169 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
39 | index XXXXXXX..XXXXXXX 100644 | 170 | index XXXXXXX..XXXXXXX 100644 |
40 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 171 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
41 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 172 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
42 | @@ -XXX,XX +XXX,XX @@ | 173 | @@ -XXX,XX +XXX,XX @@ |
43 | #include <linux/mutex.h> | 174 | #include <linux/align.h> |
44 | #include <linux/cpu.h> | 175 | #include <linux/sort.h> |
45 | #include <linux/cpumask.h> | 176 | #include <linux/log2.h> |
46 | +#include <linux/smp.h> | 177 | +#include <linux/reboot.h> |
47 | +#include <linux/atomic.h> | ||
48 | #include <asm/msr-index.h> | 178 | #include <asm/msr-index.h> |
49 | #include <asm/msr.h> | 179 | #include <asm/msr.h> |
50 | #include <asm/apic.h> | 180 | #include <asm/page.h> |
51 | @@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void) | 181 | @@ -XXX,XX +XXX,XX @@ static LIST_HEAD(tdx_memlist); |
52 | return !!tdx_keyid_num; | 182 | |
183 | static struct tdmr_info_list tdx_tdmr_list; | ||
184 | |||
185 | +static bool tdx_rebooting; | ||
186 | + | ||
187 | typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args); | ||
188 | |||
189 | static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args) | ||
190 | @@ -XXX,XX +XXX,XX @@ static int __tdx_enable(void) | ||
191 | { | ||
192 | int ret; | ||
193 | |||
194 | + if (tdx_rebooting) | ||
195 | + return -EAGAIN; | ||
196 | + | ||
197 | ret = init_tdx_module(); | ||
198 | if (ret) { | ||
199 | pr_err("module initialization failed (%d)\n", ret); | ||
200 | @@ -XXX,XX +XXX,XX @@ int tdx_enable(void) | ||
53 | } | 201 | } |
202 | EXPORT_SYMBOL_GPL(tdx_enable); | ||
54 | 203 | ||
55 | +/* | 204 | +/* |
56 | + * Data structure to make SEAMCALL on multiple CPUs concurrently. | 205 | + * Convert TDX private pages back to normal on platforms with |
57 | + * @err is set to -EFAULT when SEAMCALL fails on any cpu. | 206 | + * "partial write machine check" erratum. |
207 | + * | ||
208 | + * Called from machine_kexec() before booting to the new kernel. | ||
58 | + */ | 209 | + */ |
59 | +struct seamcall_ctx { | 210 | +void tdx_reset_memory(void) |
60 | + u64 fn; | 211 | +{ |
61 | + u64 rcx; | 212 | + if (!platform_tdx_enabled()) |
62 | + u64 rdx; | 213 | + return; |
63 | + u64 r8; | 214 | + |
64 | + u64 r9; | 215 | + /* |
65 | + atomic_t err; | 216 | + * Kernel read/write to TDX private memory doesn't |
217 | + * cause machine check on hardware w/o this erratum. | ||
218 | + */ | ||
219 | + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) | ||
220 | + return; | ||
221 | + | ||
222 | + /* Called from kexec() when only rebooting cpu is alive */ | ||
223 | + WARN_ON_ONCE(num_online_cpus() != 1); | ||
224 | + | ||
225 | + /* | ||
226 | + * tdx_reboot_notifier() waits until ongoing TDX module | ||
227 | + * initialization to finish, and module initialization is | ||
228 | + * rejected after that. Therefore @tdx_module_status is | ||
229 | + * stable here and can be read w/o holding lock. | ||
230 | + */ | ||
231 | + if (tdx_module_status != TDX_MODULE_INITIALIZED) | ||
232 | + return; | ||
233 | + | ||
234 | + /* | ||
235 | + * Convert PAMTs back to normal. All other cpus are already | ||
236 | + * dead and TDMRs/PAMTs are stable. | ||
237 | + * | ||
238 | + * Ideally it's better to cover all types of TDX private pages | ||
239 | + * here, but it's impractical: | ||
240 | + * | ||
241 | + * - There's no existing infrastructure to tell whether a page | ||
242 | + * is TDX private memory or not. | ||
243 | + * | ||
244 | + * - Using SEAMCALL to query TDX module isn't feasible either: | ||
245 | + * - VMX has been turned off by reaching here so SEAMCALL | ||
246 | + * cannot be made; | ||
247 | + * - Even SEAMCALL can be made the result from TDX module may | ||
248 | + * not be accurate (e.g., remote CPU can be stopped while | ||
249 | + * the kernel is in the middle of reclaiming TDX private | ||
250 | + * page and doing MOVDIR64B). | ||
251 | + * | ||
252 | + * One temporary solution could be just converting all memory | ||
253 | + * pages, but it's problematic too, because not all pages are | ||
254 | + * mapped as writable in direct mapping. It can be done by | ||
255 | + * switching to the identical mapping for kexec() or a new page | ||
256 | + * table which maps all pages as writable, but the complexity is | ||
257 | + * overkill. | ||
258 | + * | ||
259 | + * Thus instead of doing something dramatic to convert all pages, | ||
260 | + * only convert PAMTs here. Other kernel components which use | ||
261 | + * TDX need to do the conversion on their own by intercepting the | ||
262 | + * rebooting/shutdown notifier (KVM already does that). | ||
263 | + */ | ||
264 | + tdmrs_reset_pamt_all(&tdx_tdmr_list); | ||
265 | +} | ||
266 | + | ||
267 | static int __init record_keyid_partitioning(u32 *tdx_keyid_start, | ||
268 | u32 *nr_tdx_keyids) | ||
269 | { | ||
270 | @@ -XXX,XX +XXX,XX @@ static struct notifier_block tdx_memory_nb = { | ||
271 | .notifier_call = tdx_memory_notifier, | ||
272 | }; | ||
273 | |||
274 | +static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode, | ||
275 | + void *unused) | ||
276 | +{ | ||
277 | + /* Wait ongoing TDX initialization to finish */ | ||
278 | + mutex_lock(&tdx_module_lock); | ||
279 | + tdx_rebooting = true; | ||
280 | + mutex_unlock(&tdx_module_lock); | ||
281 | + | ||
282 | + return NOTIFY_OK; | ||
283 | +} | ||
284 | + | ||
285 | +static struct notifier_block tdx_reboot_nb = { | ||
286 | + .notifier_call = tdx_reboot_notifier, | ||
66 | +}; | 287 | +}; |
67 | + | 288 | + |
68 | /* | 289 | static int __init tdx_init(void) |
69 | * Wrapper of __seamcall() to convert SEAMCALL leaf function error code | ||
70 | * to kernel error code. @seamcall_ret and @out contain the SEAMCALL | ||
71 | * leaf function return code and the additional output respectively if | ||
72 | * not NULL. | ||
73 | */ | ||
74 | -static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, | ||
75 | - u64 *seamcall_ret, | ||
76 | - struct tdx_module_output *out) | ||
77 | +static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, | ||
78 | + u64 *seamcall_ret, struct tdx_module_output *out) | ||
79 | { | 290 | { |
80 | u64 sret; | 291 | u32 tdx_keyid_start, nr_tdx_keyids; |
81 | 292 | @@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void) | |
82 | @@ -XXX,XX +XXX,XX @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, | 293 | return -ENODEV; |
83 | } | 294 | } |
84 | } | 295 | |
85 | 296 | + err = register_reboot_notifier(&tdx_reboot_nb); | |
86 | +static void seamcall_smp_call_function(void *data) | 297 | + if (err) { |
87 | +{ | 298 | + pr_err("initialization failed: register_reboot_notifier() failed (%d)\n", |
88 | + struct seamcall_ctx *sc = data; | 299 | + err); |
89 | + int ret; | 300 | + unregister_memory_notifier(&tdx_memory_nb); |
90 | + | 301 | + return -ENODEV; |
91 | + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, NULL, NULL); | 302 | + } |
92 | + if (ret) | 303 | + |
93 | + atomic_set(&sc->err, -EFAULT); | 304 | /* |
94 | +} | 305 | * Just use the first TDX KeyID as the 'global KeyID' and |
95 | + | 306 | * leave the rest for TDX guests. |
96 | +/* | ||
97 | + * Call the SEAMCALL on all online CPUs concurrently. Caller to check | ||
98 | + * @sc->err to determine whether any SEAMCALL failed on any cpu. | ||
99 | + */ | ||
100 | +static void seamcall_on_each_cpu(struct seamcall_ctx *sc) | ||
101 | +{ | ||
102 | + on_each_cpu(seamcall_smp_call_function, sc, true); | ||
103 | +} | ||
104 | + | ||
105 | /* | ||
106 | * Detect and initialize the TDX module. | ||
107 | * | ||
108 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | ||
109 | |||
110 | static void shutdown_tdx_module(void) | ||
111 | { | ||
112 | - /* TODO: Shut down the TDX module */ | ||
113 | + struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN }; | ||
114 | + | ||
115 | + seamcall_on_each_cpu(&sc); | ||
116 | } | ||
117 | |||
118 | static int __tdx_enable(void) | ||
119 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | ||
120 | index XXXXXXX..XXXXXXX 100644 | ||
121 | --- a/arch/x86/virt/vmx/tdx/tdx.h | ||
122 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | ||
123 | @@ -XXX,XX +XXX,XX @@ | ||
124 | /* MSR to report KeyID partitioning between MKTME and TDX */ | ||
125 | #define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087 | ||
126 | |||
127 | +/* | ||
128 | + * TDX module SEAMCALL leaf functions | ||
129 | + */ | ||
130 | +#define TDH_SYS_LP_SHUTDOWN 44 | ||
131 | + | ||
132 | /* | ||
133 | * Do not put any hardware-defined TDX structure representations below | ||
134 | * this comment! | ||
135 | -- | 307 | -- |
136 | 2.38.1 | 308 | 2.41.0 | diff view generated by jsdifflib |
1 | After the global module initialization, the next step is logical-cpu | 1 | TDX cannot survive from S3 and deeper states. The hardware resets and |
---|---|---|---|
2 | scope module initialization. Logical-cpu initialization requires | 2 | disables TDX completely when platform goes to S3 and deeper. Both TDX |
3 | calling TDH.SYS.LP.INIT on all BIOS-enabled CPUs. This SEAMCALL can run | 3 | guests and the TDX module get destroyed permanently. |
4 | concurrently on all CPUs. | ||
5 | 4 | ||
6 | Use the helper introduced for shutting down the module to do logical-cpu | 5 | The kernel uses S3 to support suspend-to-ram, and S4 or deeper states to |
7 | scope initialization. | 6 | support hibernation. The kernel also maintains TDX states to track |
7 | whether it has been initialized and its metadata resource, etc. After | ||
8 | resuming from S3 or hibernation, these TDX states won't be correct | ||
9 | anymore. | ||
10 | |||
11 | Theoretically, the kernel can do more complicated things like resetting | ||
12 | TDX internal states and TDX module metadata before going to S3 or | ||
13 | deeper, and re-initialize TDX module after resuming, etc, but there is | ||
14 | no way to save/restore TDX guests for now. | ||
15 | |||
16 | Until TDX supports full save and restore of TDX guests, there is no big | ||
17 | value to handle TDX module in suspend and hibernation alone. To make | ||
18 | things simple, just choose to make TDX mutually exclusive with S3 and | ||
19 | hibernation. | ||
20 | |||
21 | Note the TDX module is initialized at runtime. To avoid having to deal | ||
22 | with the fuss of determining TDX state at runtime, just choose TDX vs S3 | ||
23 | and hibernation at kernel early boot. It's a bad user experience if the | ||
24 | choice of TDX and S3/hibernation is done at runtime anyway, i.e., the | ||
25 | user can experience being able to do S3/hibernation but later becoming | ||
26 | unable to due to TDX being enabled. | ||
27 | |||
28 | Disable TDX in kernel early boot when hibernation is available, and give | ||
29 | a message telling the user to disable hibernation via kernel command | ||
30 | line in order to use TDX. Currently there's no mechanism exposed by the | ||
31 | hibernation code to allow other kernel code to disable hibernation once | ||
32 | for all. | ||
33 | |||
34 | Disable ACPI S3 by setting acpi_suspend_lowlevel function pointer to | ||
35 | NULL when TDX is enabled by the BIOS. This avoids having to modify the | ||
36 | ACPI code to disable ACPI S3 in other ways. | ||
37 | |||
38 | Also give a message telling the user to disable TDX in the BIOS in order | ||
39 | to use ACPI S3. A new kernel command line can be added in the future if | ||
40 | there's a need to let user disable TDX host via kernel command line. | ||
8 | 41 | ||
9 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 42 | Signed-off-by: Kai Huang <kai.huang@intel.com> |
43 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
10 | --- | 44 | --- |
11 | arch/x86/virt/vmx/tdx/tdx.c | 14 ++++++++++++++ | 45 | |
12 | arch/x86/virt/vmx/tdx/tdx.h | 1 + | 46 | v13 -> v14: |
13 | 2 files changed, 15 insertions(+) | 47 | - New patch |
48 | |||
49 | --- | ||
50 | arch/x86/virt/vmx/tdx/tdx.c | 23 +++++++++++++++++++++++ | ||
51 | 1 file changed, 23 insertions(+) | ||
14 | 52 | ||
15 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 53 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
16 | index XXXXXXX..XXXXXXX 100644 | 54 | index XXXXXXX..XXXXXXX 100644 |
17 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 55 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
18 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 56 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
19 | @@ -XXX,XX +XXX,XX @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc) | 57 | @@ -XXX,XX +XXX,XX @@ |
20 | on_each_cpu(seamcall_smp_call_function, sc, true); | 58 | #include <linux/sort.h> |
21 | } | 59 | #include <linux/log2.h> |
22 | 60 | #include <linux/reboot.h> | |
23 | +static int tdx_module_init_cpus(void) | 61 | +#include <linux/suspend.h> |
24 | +{ | 62 | #include <asm/msr-index.h> |
25 | + struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT }; | 63 | #include <asm/msr.h> |
64 | #include <asm/page.h> | ||
65 | #include <asm/special_insns.h> | ||
66 | +#include <asm/acpi.h> | ||
67 | #include <asm/tdx.h> | ||
68 | #include "tdx.h" | ||
69 | |||
70 | @@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void) | ||
71 | return -ENODEV; | ||
72 | } | ||
73 | |||
74 | +#define HIBERNATION_MSG \ | ||
75 | + "Disable TDX due to hibernation is available. Use 'nohibernate' command line to disable hibernation." | ||
76 | + /* | ||
77 | + * Note hibernation_available() can vary when it is called at | ||
78 | + * runtime as it checks secretmem_active() and cxl_mem_active() | ||
79 | + * which can both vary at runtime. But here at early_init() they | ||
80 | + * both cannot return true, thus when hibernation_available() | ||
81 | + * returns false here, hibernation is disabled by either | ||
82 | + * 'nohibernate' or LOCKDOWN_HIBERNATION security lockdown, | ||
83 | + * which are both permanent. | ||
84 | + */ | ||
85 | + if (hibernation_available()) { | ||
86 | + pr_err("initialization failed: %s\n", HIBERNATION_MSG); | ||
87 | + return -ENODEV; | ||
88 | + } | ||
26 | + | 89 | + |
27 | + seamcall_on_each_cpu(&sc); | 90 | err = register_memory_notifier(&tdx_memory_nb); |
28 | + | 91 | if (err) { |
29 | + return atomic_read(&sc.err); | 92 | pr_err("initialization failed: register_memory_notifier() failed (%d)\n", |
30 | +} | 93 | @@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void) |
31 | + | 94 | return -ENODEV; |
32 | /* | 95 | } |
33 | * Detect and initialize the TDX module. | 96 | |
34 | * | 97 | +#ifdef CONFIG_ACPI |
35 | @@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void) | 98 | + pr_info("Disable ACPI S3 suspend. Turn off TDX in the BIOS to use ACPI S3.\n"); |
36 | if (ret) | 99 | + acpi_suspend_lowlevel = NULL; |
37 | goto out; | 100 | +#endif |
38 | |||
39 | + /* Logical-cpu scope initialization */ | ||
40 | + ret = tdx_module_init_cpus(); | ||
41 | + if (ret) | ||
42 | + goto out; | ||
43 | + | 101 | + |
44 | /* | 102 | /* |
45 | * Return -EINVAL until all steps of TDX module initialization | 103 | * Just use the first TDX KeyID as the 'global KeyID' and |
46 | * process are done. | 104 | * leave the rest for TDX guests. |
47 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | ||
48 | index XXXXXXX..XXXXXXX 100644 | ||
49 | --- a/arch/x86/virt/vmx/tdx/tdx.h | ||
50 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | ||
51 | @@ -XXX,XX +XXX,XX @@ | ||
52 | * TDX module SEAMCALL leaf functions | ||
53 | */ | ||
54 | #define TDH_SYS_INIT 33 | ||
55 | +#define TDH_SYS_LP_INIT 35 | ||
56 | #define TDH_SYS_LP_SHUTDOWN 44 | ||
57 | |||
58 | /* | ||
59 | -- | 105 | -- |
60 | 2.38.1 | 106 | 2.41.0 | diff view generated by jsdifflib |
1 | TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This | 1 | The first few generations of TDX hardware have an erratum. Triggering |
---|---|---|---|
2 | mode runs only the TDX module itself or other code to load the TDX | 2 | it in Linux requires some kind of kernel bug involving relatively exotic |
3 | module. | 3 | memory writes to TDX private memory and will manifest via |
4 | 4 | spurious-looking machine checks when reading the affected memory. | |
5 | The host kernel communicates with SEAM software via a new SEAMCALL | 5 | |
6 | instruction. This is conceptually similar to a guest->host hypercall, | 6 | == Background == |
7 | except it is made from the host to SEAM software instead. | 7 | |
8 | 8 | Virtually all kernel memory accesses operations happen in full | |
9 | The TDX module defines a set of SEAMCALL leaf functions to allow the | 9 | cachelines. In practice, writing a "byte" of memory usually reads a 64 |
10 | host to initialize it, and to create and run protected VMs. SEAMCALL | 10 | byte cacheline of memory, modifies it, then writes the whole line back. |
11 | leaf functions use an ABI different from the x86-64 system-v ABI. | 11 | Those operations do not trigger this problem. |
12 | Instead, they share the same ABI with the TDCALL leaf functions. | 12 | |
13 | 13 | This problem is triggered by "partial" writes where a write transaction | |
14 | Implement a function __seamcall() to allow the host to make SEAMCALL | 14 | of less than cacheline lands at the memory controller. The CPU does |
15 | to SEAM software using the TDX_MODULE_CALL macro which is the common | 15 | these via non-temporal write instructions (like MOVNTI), or through |
16 | assembly for both SEAMCALL and TDCALL. | 16 | UC/WC memory mappings. The issue can also be triggered away from the |
17 | 17 | CPU by devices doing partial writes via DMA. | |
18 | SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when | 18 | |
19 | CPU is not in VMX operation. The current TDX_MODULE_CALL macro doesn't | 19 | == Problem == |
20 | handle any of them. There's no way to check whether the CPU is in VMX | 20 | |
21 | operation or not. | 21 | A partial write to a TDX private memory cacheline will silently "poison" |
22 | 22 | the line. Subsequent reads will consume the poison and generate a | |
23 | Initializing the TDX module is done at runtime on demand, and it depends | 23 | machine check. According to the TDX hardware spec, neither of these |
24 | on the caller to ensure CPU is in VMX operation before making SEAMCALL. | 24 | things should have happened. |
25 | To avoid getting Oops when the caller mistakenly tries to initialize the | 25 | |
26 | TDX module when CPU is not in VMX operation, extend the TDX_MODULE_CALL | 26 | To add insult to injury, the Linux machine code will present these as a |
27 | macro to handle #UD (and also #GP, which can theoretically still happen | 27 | literal "Hardware error" when they were, in fact, a software-triggered |
28 | when TDX isn't actually enabled by the BIOS, i.e. due to BIOS bug). | 28 | issue. |
29 | 29 | ||
30 | Introduce two new TDX error codes for #UD and #GP respectively so the | 30 | == Solution == |
31 | caller can distinguish. Also, Opportunistically put the new TDX error | 31 | |
32 | codes and the existing TDX_SEAMCALL_VMFAILINVALID into INTEL_TDX_HOST | 32 | In the end, this issue is hard to trigger. Rather than do something |
33 | Kconfig option as they are only used when it is on. | 33 | rash (and incomplete) like unmap TDX private memory from the direct map, |
34 | 34 | improve the machine check handler. | |
35 | As __seamcall() can potentially return multiple error codes, besides the | 35 | |
36 | actual SEAMCALL leaf function return code, also introduce a wrapper | 36 | Currently, the #MC handler doesn't distinguish whether the memory is |
37 | function seamcall() to convert the __seamcall() error code to the kernel | 37 | TDX private memory or not but just dump, for instance, below message: |
38 | error code, so the caller doesn't need to duplicate the code to check | 38 | |
39 | return value of __seamcall() and return kernel error code accordingly. | 39 | [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134 |
40 | [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0} | ||
41 | ... | ||
42 | [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii' | ||
43 | [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel | ||
44 | [...] Kernel panic - not syncing: Fatal local machine check | ||
45 | |||
46 | Which says "Hardware Error" and "Data load in unrecoverable area of | ||
47 | kernel". | ||
48 | |||
49 | Ideally, it's better for the log to say "software bug around TDX private | ||
50 | memory" instead of "Hardware Error". But in reality the real hardware | ||
51 | memory error can happen, and sadly such software-triggered #MC cannot be | ||
52 | distinguished from the real hardware error. Also, the error message is | ||
53 | used by userspace tool 'mcelog' to parse, so changing the output may | ||
54 | break userspace. | ||
55 | |||
56 | So keep the "Hardware Error". The "Data load in unrecoverable area of | ||
57 | kernel" is also helpful, so keep it too. | ||
58 | |||
59 | Instead of modifying above error log, improve the error log by printing | ||
60 | additional TDX related message to make the log like: | ||
61 | |||
62 | ... | ||
63 | [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel | ||
64 | [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug. | ||
65 | |||
66 | Adding this additional message requires determination of whether the | ||
67 | memory page is TDX private memory. There is no existing infrastructure | ||
68 | to do that. Add an interface to query the TDX module to fill this gap. | ||
69 | |||
70 | == Impact == | ||
71 | |||
72 | This issue requires some kind of kernel bug to trigger. | ||
73 | |||
74 | TDX private memory should never be mapped UC/WC. A partial write | ||
75 | originating from these mappings would require *two* bugs, first mapping | ||
76 | the wrong page, then writing the wrong memory. It would also be | ||
77 | detectable using traditional memory corruption techniques like | ||
78 | DEBUG_PAGEALLOC. | ||
79 | |||
80 | MOVNTI (and friends) could cause this issue with something like a simple | ||
81 | buffer overrun or use-after-free on the direct map. It should also be | ||
82 | detectable with normal debug techniques. | ||
83 | |||
84 | The one place where this might get nasty would be if the CPU read data | ||
85 | then wrote back the same data. That would trigger this problem but | ||
86 | would not, for instance, set off mechanisms like slab redzoning because | ||
87 | it doesn't actually corrupt data. | ||
88 | |||
89 | With an IOMMU at least, the DMA exposure is similar to the UC/WC issue. | ||
90 | TDX private memory would first need to be incorrectly mapped into the | ||
91 | I/O space and then a later DMA to that mapping would actually cause the | ||
92 | poisoning event. | ||
40 | 93 | ||
41 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 94 | Signed-off-by: Kai Huang <kai.huang@intel.com> |
95 | Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | ||
96 | Reviewed-by: Yuan Yao <yuan.yao@intel.com> | ||
42 | --- | 97 | --- |
43 | 98 | ||
44 | v6 -> v7: | 99 | v13 -> v14: |
45 | - No change. | 100 | - No change |
46 | 101 | ||
47 | v5 -> v6: | 102 | v12 -> v13: |
48 | - Added code to handle #UD and #GP (Dave). | 103 | - Added Kirill and Yuan's tag. |
49 | - Moved the seamcall() wrapper function to this patch, and used a | 104 | |
50 | temporary __always_unused to avoid compile warning (Dave). | 105 | v11 -> v12: |
51 | 106 | - Simplified #MC message (Dave/Kirill) | |
52 | - v3 -> v5 (no feedback on v4): | 107 | - Slightly improved some comments. |
53 | - Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the | 108 | |
54 | SEAMCALL itself fails. | 109 | v10 -> v11: |
55 | - Improve the changelog. | 110 | - New patch |
56 | 111 | ||
57 | --- | 112 | --- |
58 | arch/x86/include/asm/tdx.h | 9 ++++++ | 113 | arch/x86/include/asm/tdx.h | 2 + |
59 | arch/x86/virt/vmx/tdx/Makefile | 2 +- | 114 | arch/x86/kernel/cpu/mce/core.c | 33 +++++++++++ |
60 | arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++ | 115 | arch/x86/virt/vmx/tdx/tdx.c | 103 +++++++++++++++++++++++++++++++++ |
61 | arch/x86/virt/vmx/tdx/tdx.c | 42 ++++++++++++++++++++++++++ | 116 | arch/x86/virt/vmx/tdx/tdx.h | 5 ++ |
62 | arch/x86/virt/vmx/tdx/tdx.h | 8 +++++ | 117 | 4 files changed, 143 insertions(+) |
63 | arch/x86/virt/vmx/tdx/tdxcall.S | 19 ++++++++++-- | ||
64 | 6 files changed, 129 insertions(+), 3 deletions(-) | ||
65 | create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S | ||
66 | 118 | ||
67 | diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h | 119 | diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h |
68 | index XXXXXXX..XXXXXXX 100644 | 120 | index XXXXXXX..XXXXXXX 100644 |
69 | --- a/arch/x86/include/asm/tdx.h | 121 | --- a/arch/x86/include/asm/tdx.h |
70 | +++ b/arch/x86/include/asm/tdx.h | 122 | +++ b/arch/x86/include/asm/tdx.h |
123 | @@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void); | ||
124 | int tdx_cpu_enable(void); | ||
125 | int tdx_enable(void); | ||
126 | void tdx_reset_memory(void); | ||
127 | +bool tdx_is_private_mem(unsigned long phys); | ||
128 | #else | ||
129 | static inline bool platform_tdx_enabled(void) { return false; } | ||
130 | static inline int tdx_cpu_enable(void) { return -ENODEV; } | ||
131 | static inline int tdx_enable(void) { return -ENODEV; } | ||
132 | static inline void tdx_reset_memory(void) { } | ||
133 | +static inline bool tdx_is_private_mem(unsigned long phys) { return false; } | ||
134 | #endif /* CONFIG_INTEL_TDX_HOST */ | ||
135 | |||
136 | #endif /* !__ASSEMBLY__ */ | ||
137 | diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c | ||
138 | index XXXXXXX..XXXXXXX 100644 | ||
139 | --- a/arch/x86/kernel/cpu/mce/core.c | ||
140 | +++ b/arch/x86/kernel/cpu/mce/core.c | ||
71 | @@ -XXX,XX +XXX,XX @@ | 141 | @@ -XXX,XX +XXX,XX @@ |
72 | #include <asm/ptrace.h> | 142 | #include <asm/mce.h> |
73 | #include <asm/shared/tdx.h> | 143 | #include <asm/msr.h> |
74 | 144 | #include <asm/reboot.h> | |
75 | +#ifdef CONFIG_INTEL_TDX_HOST | 145 | +#include <asm/tdx.h> |
76 | + | 146 | |
77 | +#include <asm/trapnr.h> | 147 | #include "internal.h" |
78 | + | 148 | |
79 | /* | 149 | @@ -XXX,XX +XXX,XX @@ static void wait_for_panic(void) |
80 | * SW-defined error codes. | 150 | panic("Panicing machine check CPU died"); |
81 | * | 151 | } |
82 | @@ -XXX,XX +XXX,XX @@ | 152 | |
83 | #define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40)) | 153 | +static const char *mce_memory_info(struct mce *m) |
84 | #define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000)) | 154 | +{ |
85 | 155 | + if (!m || !mce_is_memory_error(m) || !mce_usable_address(m)) | |
86 | +#define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP) | 156 | + return NULL; |
87 | +#define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD) | 157 | + |
88 | + | 158 | + /* |
89 | +#endif | 159 | + * Certain initial generations of TDX-capable CPUs have an |
90 | + | 160 | + * erratum. A kernel non-temporal partial write to TDX private |
91 | #ifndef __ASSEMBLY__ | 161 | + * memory poisons that memory, and a subsequent read of that |
92 | 162 | + * memory triggers #MC. | |
93 | /* | 163 | + * |
94 | diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile | 164 | + * However such #MC caused by software cannot be distinguished |
95 | index XXXXXXX..XXXXXXX 100644 | 165 | + * from the real hardware #MC. Just print additional message |
96 | --- a/arch/x86/virt/vmx/tdx/Makefile | 166 | + * to show such #MC may be result of the CPU erratum. |
97 | +++ b/arch/x86/virt/vmx/tdx/Makefile | 167 | + */ |
98 | @@ -XXX,XX +XXX,XX @@ | 168 | + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) |
99 | # SPDX-License-Identifier: GPL-2.0-only | 169 | + return NULL; |
100 | -obj-y += tdx.o | 170 | + |
101 | +obj-y += tdx.o seamcall.o | 171 | + return !tdx_is_private_mem(m->addr) ? NULL : |
102 | diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S | 172 | + "TDX private memory error. Possible kernel bug."; |
103 | new file mode 100644 | 173 | +} |
104 | index XXXXXXX..XXXXXXX | 174 | + |
105 | --- /dev/null | 175 | static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) |
106 | +++ b/arch/x86/virt/vmx/tdx/seamcall.S | 176 | { |
107 | @@ -XXX,XX +XXX,XX @@ | 177 | struct llist_node *pending; |
108 | +/* SPDX-License-Identifier: GPL-2.0 */ | 178 | struct mce_evt_llist *l; |
109 | +#include <linux/linkage.h> | 179 | int apei_err = 0; |
110 | +#include <asm/frame.h> | 180 | + const char *memmsg; |
111 | + | 181 | |
112 | +#include "tdxcall.S" | 182 | /* |
113 | + | 183 | * Allow instrumentation around external facilities usage. Not that it |
114 | +/* | 184 | @@ -XXX,XX +XXX,XX @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) |
115 | + * __seamcall() - Host-side interface functions to SEAM software module | 185 | } |
116 | + * (the P-SEAMLDR or the TDX module). | 186 | if (exp) |
117 | + * | 187 | pr_emerg(HW_ERR "Machine check: %s\n", exp); |
118 | + * Transform function call register arguments into the SEAMCALL register | 188 | + /* |
119 | + * ABI. Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails, | 189 | + * Confidential computing platforms such as TDX platforms |
120 | + * or the completion status of the SEAMCALL leaf function. Additional | 190 | + * may occur MCE due to incorrect access to confidential |
121 | + * output operands are saved in @out (if it is provided by the caller). | 191 | + * memory. Print additional information for such error. |
122 | + * | 192 | + */ |
123 | + *------------------------------------------------------------------------- | 193 | + memmsg = mce_memory_info(final); |
124 | + * SEAMCALL ABI: | 194 | + if (memmsg) |
125 | + *------------------------------------------------------------------------- | 195 | + pr_emerg(HW_ERR "Machine check: %s\n", memmsg); |
126 | + * Input Registers: | 196 | + |
127 | + * | 197 | if (!fake_panic) { |
128 | + * RAX - SEAMCALL Leaf number. | 198 | if (panic_timeout == 0) |
129 | + * RCX,RDX,R8-R9 - SEAMCALL Leaf specific input registers. | 199 | panic_timeout = mca_cfg.panic_timeout; |
130 | + * | ||
131 | + * Output Registers: | ||
132 | + * | ||
133 | + * RAX - SEAMCALL completion status code. | ||
134 | + * RCX,RDX,R8-R11 - SEAMCALL Leaf specific output registers. | ||
135 | + * | ||
136 | + *------------------------------------------------------------------------- | ||
137 | + * | ||
138 | + * __seamcall() function ABI: | ||
139 | + * | ||
140 | + * @fn (RDI) - SEAMCALL Leaf number, moved to RAX | ||
141 | + * @rcx (RSI) - Input parameter 1, moved to RCX | ||
142 | + * @rdx (RDX) - Input parameter 2, moved to RDX | ||
143 | + * @r8 (RCX) - Input parameter 3, moved to R8 | ||
144 | + * @r9 (R8) - Input parameter 4, moved to R9 | ||
145 | + * | ||
146 | + * @out (R9) - struct tdx_module_output pointer | ||
147 | + * stored temporarily in R12 (not | ||
148 | + * used by the P-SEAMLDR or the TDX | ||
149 | + * module). It can be NULL. | ||
150 | + * | ||
151 | + * Return (via RAX) the completion status of the SEAMCALL, or | ||
152 | + * TDX_SEAMCALL_VMFAILINVALID. | ||
153 | + */ | ||
154 | +SYM_FUNC_START(__seamcall) | ||
155 | + FRAME_BEGIN | ||
156 | + TDX_MODULE_CALL host=1 | ||
157 | + FRAME_END | ||
158 | + RET | ||
159 | +SYM_FUNC_END(__seamcall) | ||
160 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c | 200 | diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c |
161 | index XXXXXXX..XXXXXXX 100644 | 201 | index XXXXXXX..XXXXXXX 100644 |
162 | --- a/arch/x86/virt/vmx/tdx/tdx.c | 202 | --- a/arch/x86/virt/vmx/tdx/tdx.c |
163 | +++ b/arch/x86/virt/vmx/tdx/tdx.c | 203 | +++ b/arch/x86/virt/vmx/tdx/tdx.c |
164 | @@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void) | 204 | @@ -XXX,XX +XXX,XX @@ void tdx_reset_memory(void) |
165 | return !!tdx_keyid_num; | 205 | tdmrs_reset_pamt_all(&tdx_tdmr_list); |
166 | } | 206 | } |
167 | 207 | ||
208 | +static bool is_pamt_page(unsigned long phys) | ||
209 | +{ | ||
210 | + struct tdmr_info_list *tdmr_list = &tdx_tdmr_list; | ||
211 | + int i; | ||
212 | + | ||
213 | + /* | ||
214 | + * This function is called from #MC handler, and theoretically | ||
215 | + * it could run in parallel with the TDX module initialization | ||
216 | + * on other logical cpus. But it's not OK to hold mutex here | ||
217 | + * so just blindly check module status to make sure PAMTs/TDMRs | ||
218 | + * are stable to access. | ||
219 | + * | ||
220 | + * This may return inaccurate result in rare cases, e.g., when | ||
221 | + * #MC happens on a PAMT page during module initialization, but | ||
222 | + * this is fine as #MC handler doesn't need a 100% accurate | ||
223 | + * result. | ||
224 | + */ | ||
225 | + if (tdx_module_status != TDX_MODULE_INITIALIZED) | ||
226 | + return false; | ||
227 | + | ||
228 | + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) { | ||
229 | + unsigned long base, size; | ||
230 | + | ||
231 | + tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size); | ||
232 | + | ||
233 | + if (phys >= base && phys < (base + size)) | ||
234 | + return true; | ||
235 | + } | ||
236 | + | ||
237 | + return false; | ||
238 | +} | ||
239 | + | ||
168 | +/* | 240 | +/* |
169 | + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code | 241 | + * Return whether the memory page at the given physical address is TDX |
170 | + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL | 242 | + * private memory or not. Called from #MC handler do_machine_check(). |
171 | + * leaf function return code and the additional output respectively if | 243 | + * |
172 | + * not NULL. | 244 | + * Note this function may not return an accurate result in rare cases. |
245 | + * This is fine as the #MC handler doesn't need a 100% accurate result, | ||
246 | + * because it cannot distinguish #MC between software bug and real | ||
247 | + * hardware error anyway. | ||
173 | + */ | 248 | + */ |
174 | +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, | 249 | +bool tdx_is_private_mem(unsigned long phys) |
175 | + u64 *seamcall_ret, | ||
176 | + struct tdx_module_output *out) | ||
177 | +{ | 250 | +{ |
251 | + struct tdx_module_args args = { | ||
252 | + .rcx = phys & PAGE_MASK, | ||
253 | + }; | ||
178 | + u64 sret; | 254 | + u64 sret; |
179 | + | 255 | + |
180 | + sret = __seamcall(fn, rcx, rdx, r8, r9, out); | 256 | + if (!platform_tdx_enabled()) |
181 | + | 257 | + return false; |
182 | + /* Save SEAMCALL return code if caller wants it */ | 258 | + |
183 | + if (seamcall_ret) | 259 | + /* Get page type from the TDX module */ |
184 | + *seamcall_ret = sret; | 260 | + sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args); |
185 | + | 261 | + /* |
186 | + /* SEAMCALL was successful */ | 262 | + * Handle the case that CPU isn't in VMX operation. |
187 | + if (!sret) | 263 | + * |
188 | + return 0; | 264 | + * KVM guarantees no VM is running (thus no TDX guest) |
189 | + | 265 | + * when there's any online CPU isn't in VMX operation. |
190 | + switch (sret) { | 266 | + * This means there will be no TDX guest private memory |
191 | + case TDX_SEAMCALL_GP: | 267 | + * and Secure-EPT pages. However the TDX module may have |
192 | + /* | 268 | + * been initialized and the memory page could be PAMT. |
193 | + * platform_tdx_enabled() is checked to be true | 269 | + */ |
194 | + * before making any SEAMCALL. | 270 | + if (sret == TDX_SEAMCALL_UD) |
195 | + */ | 271 | + return is_pamt_page(phys); |
196 | + WARN_ON_ONCE(1); | 272 | + |
197 | + fallthrough; | 273 | + /* |
198 | + case TDX_SEAMCALL_VMFAILINVALID: | 274 | + * Any other failure means: |
199 | + /* Return -ENODEV if the TDX module is not loaded. */ | 275 | + * |
200 | + return -ENODEV; | 276 | + * 1) TDX module not loaded; or |
201 | + case TDX_SEAMCALL_UD: | 277 | + * 2) Memory page isn't managed by the TDX module. |
202 | + /* Return -EINVAL if CPU isn't in VMX operation. */ | 278 | + * |
203 | + return -EINVAL; | 279 | + * In either case, the memory page cannot be a TDX |
280 | + * private page. | ||
281 | + */ | ||
282 | + if (sret) | ||
283 | + return false; | ||
284 | + | ||
285 | + /* | ||
286 | + * SEAMCALL was successful -- read page type (via RCX): | ||
287 | + * | ||
288 | + * - PT_NDA: Page is not used by the TDX module | ||
289 | + * - PT_RSVD: Reserved for Non-TDX use | ||
290 | + * - Others: Page is used by the TDX module | ||
291 | + * | ||
292 | + * Note PAMT pages are marked as PT_RSVD but they are also TDX | ||
293 | + * private memory. | ||
294 | + * | ||
295 | + * Note: Even page type is PT_NDA, the memory page could still | ||
296 | + * be associated with TDX private KeyID if the kernel hasn't | ||
297 | + * explicitly used MOVDIR64B to clear the page. Assume KVM | ||
298 | + * always does that after reclaiming any private page from TDX | ||
299 | + * gusets. | ||
300 | + */ | ||
301 | + switch (args.rcx) { | ||
302 | + case PT_NDA: | ||
303 | + return false; | ||
304 | + case PT_RSVD: | ||
305 | + return is_pamt_page(phys); | ||
204 | + default: | 306 | + default: |
205 | + /* Return -EIO if the actual SEAMCALL leaf failed. */ | 307 | + return true; |
206 | + return -EIO; | ||
207 | + } | 308 | + } |
208 | +} | 309 | +} |
209 | + | 310 | + |
210 | /* | 311 | static int __init record_keyid_partitioning(u32 *tdx_keyid_start, |
211 | * Detect and initialize the TDX module. | 312 | u32 *nr_tdx_keyids) |
212 | * | 313 | { |
213 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h | 314 | diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h |
214 | index XXXXXXX..XXXXXXX 100644 | 315 | index XXXXXXX..XXXXXXX 100644 |
215 | --- a/arch/x86/virt/vmx/tdx/tdx.h | 316 | --- a/arch/x86/virt/vmx/tdx/tdx.h |
216 | +++ b/arch/x86/virt/vmx/tdx/tdx.h | 317 | +++ b/arch/x86/virt/vmx/tdx/tdx.h |
217 | @@ -XXX,XX +XXX,XX @@ | 318 | @@ -XXX,XX +XXX,XX @@ |
218 | /* MSR to report KeyID partitioning between MKTME and TDX */ | 319 | /* |
219 | #define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087 | 320 | * TDX module SEAMCALL leaf functions |
220 | 321 | */ | |
221 | +/* | 322 | +#define TDH_PHYMEM_PAGE_RDMD 24 |
222 | + * Do not put any hardware-defined TDX structure representations below | 323 | #define TDH_SYS_KEY_CONFIG 31 |
223 | + * this comment! | 324 | #define TDH_SYS_INFO 32 |
224 | + */ | 325 | #define TDH_SYS_INIT 33 |
225 | + | ||
226 | +struct tdx_module_output; | ||
227 | +u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, | ||
228 | + struct tdx_module_output *out); | ||
229 | #endif | ||
230 | diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S | ||
231 | index XXXXXXX..XXXXXXX 100644 | ||
232 | --- a/arch/x86/virt/vmx/tdx/tdxcall.S | ||
233 | +++ b/arch/x86/virt/vmx/tdx/tdxcall.S | ||
234 | @@ -XXX,XX +XXX,XX @@ | 326 | @@ -XXX,XX +XXX,XX @@ |
235 | /* SPDX-License-Identifier: GPL-2.0 */ | 327 | #define TDH_SYS_TDMR_INIT 36 |
236 | #include <asm/asm-offsets.h> | 328 | #define TDH_SYS_CONFIG 45 |
237 | #include <asm/tdx.h> | 329 | |
238 | +#include <asm/asm.h> | 330 | +/* TDX page types */ |
239 | 331 | +#define PT_NDA 0x0 | |
240 | /* | 332 | +#define PT_RSVD 0x1 |
241 | * TDCALL and SEAMCALL are supported in Binutils >= 2.36. | 333 | + |
242 | @@ -XXX,XX +XXX,XX @@ | 334 | struct cmr_info { |
243 | /* Leave input param 2 in RDX */ | 335 | u64 base; |
244 | 336 | u64 size; | |
245 | .if \host | ||
246 | +1: | ||
247 | seamcall | ||
248 | /* | ||
249 | * SEAMCALL instruction is essentially a VMExit from VMX root | ||
250 | @@ -XXX,XX +XXX,XX @@ | ||
251 | * This value will never be used as actual SEAMCALL error code as | ||
252 | * it is from the Reserved status code class. | ||
253 | */ | ||
254 | - jnc .Lno_vmfailinvalid | ||
255 | + jnc .Lseamcall_out | ||
256 | mov $TDX_SEAMCALL_VMFAILINVALID, %rax | ||
257 | -.Lno_vmfailinvalid: | ||
258 | + jmp .Lseamcall_out | ||
259 | +2: | ||
260 | + /* | ||
261 | + * SEAMCALL caused #GP or #UD. By reaching here %eax contains | ||
262 | + * the trap number. Convert the trap number to the TDX error | ||
263 | + * code by setting TDX_SW_ERROR to the high 32-bits of %rax. | ||
264 | + * | ||
265 | + * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction | ||
266 | + * only accepts 32-bit immediate at most. | ||
267 | + */ | ||
268 | + mov $TDX_SW_ERROR, %r12 | ||
269 | + orq %r12, %rax | ||
270 | |||
271 | + _ASM_EXTABLE_FAULT(1b, 2b) | ||
272 | +.Lseamcall_out: | ||
273 | .else | ||
274 | tdcall | ||
275 | .endif | ||
276 | -- | 337 | -- |
277 | 2.38.1 | 338 | 2.41.0 | diff view generated by jsdifflib |
... | ... | ||
---|---|---|---|
6 | materials under it, and add a new menu for TDX host kernel support. | 6 | materials under it, and add a new menu for TDX host kernel support. |
7 | 7 | ||
8 | Signed-off-by: Kai Huang <kai.huang@intel.com> | 8 | Signed-off-by: Kai Huang <kai.huang@intel.com> |
9 | --- | 9 | --- |
10 | 10 | ||
11 | v6 -> v7: | 11 | - Added new sections for "Erratum" and "TDX vs S3/hibernation" |
12 | - Changed "TDX Memory Policy" and "Kexec()" sections. | ||
13 | 12 | ||
14 | --- | 13 | --- |
15 | Documentation/x86/tdx.rst | 181 +++++++++++++++++++++++++++++++++++--- | 14 | Documentation/arch/x86/tdx.rst | 217 +++++++++++++++++++++++++++++++-- |
16 | 1 file changed, 170 insertions(+), 11 deletions(-) | 15 | 1 file changed, 206 insertions(+), 11 deletions(-) |
17 | 16 | ||
18 | diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst | 17 | diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst |
19 | index XXXXXXX..XXXXXXX 100644 | 18 | index XXXXXXX..XXXXXXX 100644 |
20 | --- a/Documentation/x86/tdx.rst | 19 | --- a/Documentation/arch/x86/tdx.rst |
21 | +++ b/Documentation/x86/tdx.rst | 20 | +++ b/Documentation/arch/x86/tdx.rst |
22 | @@ -XXX,XX +XXX,XX @@ encrypting the guest memory. In TDX, a special module running in a special | 21 | @@ -XXX,XX +XXX,XX @@ encrypting the guest memory. In TDX, a special module running in a special |
23 | mode sits between the host and the guest and manages the guest/host | 22 | mode sits between the host and the guest and manages the guest/host |
24 | separation. | 23 | separation. |
25 | 24 | ||
26 | +TDX Host Kernel Support | 25 | +TDX Host Kernel Support |
... | ... | ||
46 | +----------------------- | 45 | +----------------------- |
47 | + | 46 | + |
48 | +The kernel detects TDX by detecting TDX private KeyIDs during kernel | 47 | +The kernel detects TDX by detecting TDX private KeyIDs during kernel |
49 | +boot. Below dmesg shows when TDX is enabled by BIOS:: | 48 | +boot. Below dmesg shows when TDX is enabled by BIOS:: |
50 | + | 49 | + |
51 | + [..] tdx: TDX enabled by BIOS. TDX private KeyID range: [16, 64). | 50 | + [..] virt/tdx: BIOS enabled: private KeyID range: [16, 64) |
52 | + | 51 | + |
53 | +TDX module detection and initialization | 52 | +TDX module initialization |
54 | +--------------------------------------- | 53 | +--------------------------------------- |
55 | + | ||
56 | +There is no CPUID or MSR to detect the TDX module. The kernel detects it | ||
57 | +by initializing it. | ||
58 | + | 54 | + |
59 | +The kernel talks to the TDX module via the new SEAMCALL instruction. The | 55 | +The kernel talks to the TDX module via the new SEAMCALL instruction. The |
60 | +TDX module implements SEAMCALL leaf functions to allow the kernel to | 56 | +TDX module implements SEAMCALL leaf functions to allow the kernel to |
61 | +initialize it. | 57 | +initialize it. |
58 | + | ||
59 | +If the TDX module isn't loaded, the SEAMCALL instruction fails with a | ||
60 | +special error. In this case the kernel fails the module initialization | ||
61 | +and reports the module isn't loaded:: | ||
62 | + | ||
63 | + [..] virt/tdx: module not loaded | ||
62 | + | 64 | + |
63 | +Initializing the TDX module consumes roughly ~1/256th system RAM size to | 65 | +Initializing the TDX module consumes roughly ~1/256th system RAM size to |
64 | +use it as 'metadata' for the TDX memory. It also takes additional CPU | 66 | +use it as 'metadata' for the TDX memory. It also takes additional CPU |
65 | +time to initialize those metadata along with the TDX module itself. Both | 67 | +time to initialize those metadata along with the TDX module itself. Both |
66 | +are not trivial. The kernel initializes the TDX module at runtime on | 68 | +are not trivial. The kernel initializes the TDX module at runtime on |
67 | +demand. The caller to call tdx_enable() to initialize the TDX module:: | 69 | +demand. |
68 | + | 70 | + |
71 | +Besides initializing the TDX module, a per-cpu initialization SEAMCALL | ||
72 | +must be done on one cpu before any other SEAMCALLs can be made on that | ||
73 | +cpu. | ||
74 | + | ||
75 | +The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to | ||
76 | +allow the user of TDX to enable the TDX module and enable TDX on local | ||
77 | +cpu. | ||
78 | + | ||
79 | +Making SEAMCALL requires the CPU already being in VMX operation (VMXON | ||
80 | +has been done). For now both tdx_enable() and tdx_cpu_enable() don't | ||
81 | +handle VMXON internally, but depends on the caller to guarantee that. | ||
82 | + | ||
83 | +To enable TDX, the caller of TDX should: 1) hold read lock of CPU hotplug | ||
84 | +lock; 2) do VMXON and tdx_enable_cpu() on all online cpus successfully; | ||
85 | +3) call tdx_enable(). For example:: | ||
86 | + | ||
87 | + cpus_read_lock(); | ||
88 | + on_each_cpu(vmxon_and_tdx_cpu_enable()); | ||
69 | + ret = tdx_enable(); | 89 | + ret = tdx_enable(); |
90 | + cpus_read_unlock(); | ||
70 | + if (ret) | 91 | + if (ret) |
71 | + goto no_tdx; | 92 | + goto no_tdx; |
72 | + // TDX is ready to use | 93 | + // TDX is ready to use |
73 | + | 94 | + |
74 | +Initializing the TDX module requires all logical CPUs being online. | 95 | +And the caller of TDX must guarantee the tdx_cpu_enable() has been |
75 | +tdx_enable() internally temporarily disables CPU hotplug to prevent any | 96 | +successfully done on any cpu before it wants to run any other SEAMCALL. |
76 | +CPU from going offline, but the caller still needs to guarantee all | 97 | +A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug |
77 | +present CPUs are online before calling tdx_enable(). | 98 | +online callback, and refuse to online if tdx_cpu_enable() fails. |
78 | + | 99 | + |
79 | +Also, tdx_enable() requires all CPUs are already in VMX operation | 100 | +User can consult dmesg to see whether the TDX module has been initialized. |
80 | +(requirement of making SEAMCALL). Currently, tdx_enable() doesn't handle | ||
81 | +VMXON internally, but depends on the caller to guarantee that. So far | ||
82 | +KVM is the only user of TDX and KVM already handles VMXON. | ||
83 | + | ||
84 | +User can consult dmesg to see the presence of the TDX module, and whether | ||
85 | +it has been initialized. | ||
86 | + | ||
87 | +If the TDX module is not loaded, dmesg shows below:: | ||
88 | + | ||
89 | + [..] tdx: TDX module is not loaded. | ||
90 | + | 101 | + |
91 | +If the TDX module is initialized successfully, dmesg shows something | 102 | +If the TDX module is initialized successfully, dmesg shows something |
92 | +like below:: | 103 | +like below:: |
93 | + | 104 | + |
94 | + [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160 | 105 | + [..] virt/tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160 |
95 | + [..] tdx: 65667 pages allocated for PAMT. | 106 | + [..] virt/tdx: 262668 KBs allocated for PAMT |
96 | + [..] tdx: TDX module initialized. | 107 | + [..] virt/tdx: module initialized |
97 | + | 108 | + |
98 | +If the TDX module failed to initialize, dmesg shows below:: | 109 | +If the TDX module failed to initialize, dmesg also shows it failed to |
99 | + | 110 | +initialize:: |
100 | + [..] tdx: Failed to initialize TDX module. Shut it down. | 111 | + |
112 | + [..] virt/tdx: module initialization failed ... | ||
101 | + | 113 | + |
102 | +TDX Interaction to Other Kernel Components | 114 | +TDX Interaction to Other Kernel Components |
103 | +------------------------------------------ | 115 | +------------------------------------------ |
104 | + | 116 | + |
105 | +TDX Memory Policy | 117 | +TDX Memory Policy |
106 | +~~~~~~~~~~~~~~~~~ | 118 | +~~~~~~~~~~~~~~~~~ |
107 | + | 119 | + |
108 | +TDX reports a list of "Convertible Memory Region" (CMR) to indicate all | 120 | +TDX reports a list of "Convertible Memory Region" (CMR) to tell the |
109 | +memory regions that can possibly be used by the TDX module, but they are | 121 | +kernel which memory is TDX compatible. The kernel needs to build a list |
110 | +not automatically usable to the TDX module. As a step of initializing | 122 | +of memory regions (out of CMRs) as "TDX-usable" memory and pass those |
111 | +the TDX module, the kernel needs to choose a list of memory regions (out | 123 | +regions to the TDX module. Once this is done, those "TDX-usable" memory |
112 | +from convertible memory regions) that the TDX module can use and pass | 124 | +regions are fixed during module's lifetime. |
113 | +those regions to the TDX module. Once this is done, those "TDX-usable" | ||
114 | +memory regions are fixed during module's lifetime. No more TDX-usable | ||
115 | +memory can be added to the TDX module after that. | ||
116 | + | 125 | + |
117 | +To keep things simple, currently the kernel simply guarantees all pages | 126 | +To keep things simple, currently the kernel simply guarantees all pages |
118 | +in the page allocator are TDX memory. Specifically, the kernel uses all | 127 | +in the page allocator are TDX memory. Specifically, the kernel uses all |
119 | +system memory in the core-mm at the time of initializing the TDX module | 128 | +system memory in the core-mm at the time of initializing the TDX module |
120 | +as TDX memory, and at the meantime, refuses to add any non-TDX-memory in | 129 | +as TDX memory, and in the meantime, refuses to online any non-TDX-memory |
121 | +the memory hotplug. | 130 | +in the memory hotplug. |
122 | + | 131 | + |
123 | +This can be enhanced in the future, i.e. by allowing adding non-TDX | 132 | +Physical Memory Hotplug |
124 | +memory to a separate NUMA node. In this case, the "TDX-capable" nodes | 133 | +~~~~~~~~~~~~~~~~~~~~~~~ |
125 | +and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace | ||
126 | +needs to guarantee memory pages for TDX guests are always allocated from | ||
127 | +the "TDX-capable" nodes. | ||
128 | + | 134 | + |
129 | +Note TDX assumes convertible memory is always physically present during | 135 | +Note TDX assumes convertible memory is always physically present during |
130 | +machine's runtime. A non-buggy BIOS should never support hot-removal of | 136 | +machine's runtime. A non-buggy BIOS should never support hot-removal of |
131 | +any convertible memory. This implementation doesn't handle ACPI memory | 137 | +any convertible memory. This implementation doesn't handle ACPI memory |
132 | +removal but depends on the BIOS to behave correctly. | 138 | +removal but depends on the BIOS to behave correctly. |
133 | + | 139 | + |
134 | +CPU Hotplug | 140 | +CPU Hotplug |
135 | +~~~~~~~~~~~ | 141 | +~~~~~~~~~~~ |
142 | + | ||
143 | +TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT) | ||
144 | +must be done on one cpu before any other SEAMCALLs can be made on that | ||
145 | +cpu, including those involved during the module initialization. | ||
146 | + | ||
147 | +The kernel provides tdx_cpu_enable() to let the user of TDX to do it when | ||
148 | +the user wants to use a new cpu for TDX task. | ||
136 | + | 149 | + |
137 | +TDX doesn't support physical (ACPI) CPU hotplug. During machine boot, | 150 | +TDX doesn't support physical (ACPI) CPU hotplug. During machine boot, |
138 | +TDX verifies all boot-time present logical CPUs are TDX compatible before | 151 | +TDX verifies all boot-time present logical CPUs are TDX compatible before |
139 | +enabling TDX. A non-buggy BIOS should never support hot-add/removal of | 152 | +enabling TDX. A non-buggy BIOS should never support hot-add/removal of |
140 | +physical CPU. Currently the kernel doesn't handle physical CPU hotplug, | 153 | +physical CPU. Currently the kernel doesn't handle physical CPU hotplug, |
... | ... | ||
146 | +Kexec() | 159 | +Kexec() |
147 | +~~~~~~~ | 160 | +~~~~~~~ |
148 | + | 161 | + |
149 | +There are two problems in terms of using kexec() to boot to a new kernel | 162 | +There are two problems in terms of using kexec() to boot to a new kernel |
150 | +when the old kernel has enabled TDX: 1) Part of the memory pages are | 163 | +when the old kernel has enabled TDX: 1) Part of the memory pages are |
151 | +still TDX private pages (i.e. metadata used by the TDX module, and any | 164 | +still TDX private pages; 2) There might be dirty cachelines associated |
152 | +TDX guest memory if kexec() is executed when there's live TDX guests). | 165 | +with TDX private pages. |
153 | +2) There might be dirty cachelines associated with TDX private pages. | 166 | + |
154 | + | 167 | +The first problem doesn't matter. KeyID 0 doesn't have integrity check. |
155 | +Because the hardware doesn't guarantee cache coherency among different | 168 | +Even the new kernel wants use any non-zero KeyID, it needs to convert |
156 | +KeyIDs, the old kernel needs to flush cache (of TDX private pages) | 169 | +the memory to that KeyID and such conversion would work from any KeyID. |
157 | +before booting to the new kernel. Also, the kernel doesn't convert all | 170 | + |
158 | +TDX private pages back to normal because of below considerations: | 171 | +However the old kernel needs to guarantee there's no dirty cacheline |
159 | + | 172 | +left behind before booting to the new kernel to avoid silent corruption |
160 | +1) The kernel doesn't have existing infrastructure to track which pages | 173 | +from later cacheline writeback (Intel hardware doesn't guarantee cache |
161 | + are TDX private page. | 174 | +coherency across different KeyIDs). |
162 | +2) The number of TDX private pages can be large, and converting all of | 175 | + |
163 | + them (cache flush + using MOVDIR64B to clear the page) can be time | 176 | +Similar to AMD SME, the kernel just uses wbinvd() to flush cache before |
164 | + consuming. | 177 | +booting to the new kernel. |
165 | +3) The new kernel will almost only use KeyID 0 to access memory. KeyID | 178 | + |
166 | + 0 doesn't support integrity-check, so it's OK. | 179 | +Erratum |
167 | +4) The kernel doesn't (and may never) support MKTME. If any 3rd party | 180 | +~~~~~~~ |
168 | + kernel ever supports MKTME, it should do MOVDIR64B to clear the page | 181 | + |
169 | + with the new MKTME KeyID (just like TDX does) before using it. | 182 | +The first few generations of TDX hardware have an erratum. A partial |
170 | + | 183 | +write to a TDX private memory cacheline will silently "poison" the |
171 | +The current TDX module architecture doesn't play nicely with kexec(). | 184 | +line. Subsequent reads will consume the poison and generate a machine |
172 | +The TDX module can only be initialized once during its lifetime, and | 185 | +check. |
173 | +there is no SEAMCALL to reset the module to give a new clean slate to | 186 | + |
174 | +the new kernel. Therefore, ideally, if the module is ever initialized, | 187 | +A partial write is a memory write where a write transaction of less than |
175 | +it's better to shut down the module. The new kernel won't be able to | 188 | +cacheline lands at the memory controller. The CPU does these via |
176 | +use TDX anyway (as it needs to go through the TDX module initialization | 189 | +non-temporal write instructions (like MOVNTI), or through UC/WC memory |
177 | +process which will fail immediately at the first step). | 190 | +mappings. Devices can also do partial writes via DMA. |
178 | + | 191 | + |
179 | +However, there's no guarantee CPU is in VMX operation during kexec(), so | 192 | +Theoretically, a kernel bug could do partial write to TDX private memory |
180 | +it's impractical to shut down the module. Currently, the kernel just | 193 | +and trigger unexpected machine check. What's more, the machine check |
181 | +leaves the module in open state. | 194 | +code will present these as "Hardware error" when they were, in fact, a |
195 | +software-triggered issue. But in the end, this issue is hard to trigger. | ||
196 | + | ||
197 | +If the platform has such erratum, the kernel does additional things: | ||
198 | +1) resetting TDX private pages using MOVDIR64B in kexec before booting to | ||
199 | +the new kernel; 2) Printing additional message in machine check handler | ||
200 | +to tell user the machine check may be caused by kernel bug on TDX private | ||
201 | +memory. | ||
202 | + | ||
203 | +Interaction vs S3 and deeper states | ||
204 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
205 | + | ||
206 | +TDX cannot survive from S3 and deeper states. The hardware resets and | ||
207 | +disables TDX completely when platform goes to S3 and deeper. Both TDX | ||
208 | +guests and the TDX module get destroyed permanently. | ||
209 | + | ||
210 | +The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for | ||
211 | +hibernation. Currently, for simplicity, the kernel chooses to make TDX | ||
212 | +mutually exclusive with S3 and hibernation. | ||
213 | + | ||
214 | +For most cases, the user needs to add 'nohibernation' kernel command line | ||
215 | +in order to use TDX. S3 is disabled during kernel early boot if TDX is | ||
216 | +detected. The user needs to turn off TDX in the BIOS in order to use S3. | ||
182 | + | 217 | + |
183 | +TDX Guest Support | 218 | +TDX Guest Support |
184 | +================= | 219 | +================= |
185 | Since the host cannot directly access guest registers or memory, much | 220 | Since the host cannot directly access guest registers or memory, much |
186 | normal functionality of a hypervisor must be moved into the guest. This is | 221 | normal functionality of a hypervisor must be moved into the guest. This is |
... | ... | ||
283 | +------------------------- | 318 | +------------------------- |
284 | 319 | ||
285 | All TDX guest memory starts out as private at boot. This memory can not | 320 | All TDX guest memory starts out as private at boot. This memory can not |
286 | be accessed by the hypervisor. However, some kernel users like device | 321 | be accessed by the hypervisor. However, some kernel users like device |
287 | -- | 322 | -- |
288 | 2.38.1 | 323 | 2.41.0 | diff view generated by jsdifflib |