1
Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
1
Hi all,
2
host and certain physical attacks. TDX specs are available in [1].
3
2
4
This series is the initial support to enable TDX with minimal code to
3
For people who concern this patchset, sorry for sending out late. And to
5
allow KVM to create and run TDX guests. KVM support for TDX is being
4
save people's time, I didn't include the full coverletter here this time.
6
developed separately[2]. A new "userspace inaccessible memfd" approach
5
For detailed information please refer to previous v13's coverletter[1].
7
to support TDX private memory is also being developed[3]. The KVM will
8
only support the new "userspace inaccessible memfd" as TDX guest memory.
9
10
This series doesn't aim to support all functionalities (i.e. exposing TDX
11
module via /sysfs), and doesn't aim to resolve all things perfectly.
12
Especially, the implementation to how to choose "TDX-usable" memory and
13
memory hotplug handling is simple, that this series just makes sure all
14
pages in the page allocator are TDX memory.
15
16
A better solution, suggested by Kirill, is similar to the per-node memory
17
encryption flag in this series [4]. Similarly, a per-node TDX flag can
18
be added so both "TDX-capable" and "non-TDX-capable" nodes can co-exist.
19
With exposing the TDX flag to userspace via /sysfs, the userspace can
20
then use NUMA APIs to bind TDX guests to those "TDX-capable" nodes.
21
22
For more information please refer to "Kernel policy on TDX memory" and
23
"Memory hotplug" sections below. Huang, Ying is working on this
24
"per-node TDX flag" support and will post another series independently.
25
26
(For memory hotplug, sorry for broadcasting widely but I cc'ed the
27
linux-mm@kvack.org following Kirill's suggestion so MM experts can also
28
help to provide comments.)
29
30
Also, other optimizations will be posted as follow-up once this initial
31
TDX support is upstreamed.
32
33
Hi Dave, Dan, Kirill, Ying (and Intel reviewers),
34
35
Please kindly help to review, and I would appreciate reviewed-by or
36
acked-by tags if the patches look good to you.
37
38
This series has been reviewed by Isaku who is developing KVM TDX patches.
39
Kirill also has reviewed couple of patches as well.
40
41
Also, I highly appreciate if anyone else can help to review this series.
42
43
----- Changelog history: ------
44
45
- v6 -> v7:
46
47
- Added memory hotplug support.
48
- Changed how to choose the list of "TDX-usable" memory regions from at
49
kernel boot time to TDX module initialization time.
50
- Addressed comments received in previous versions. (Andi/Dave).
51
- Improved the commit message and the comments of kexec() support patch,
52
and the patch handles returnning PAMTs back to the kernel when TDX
53
module initialization fails. Please also see "kexec()" section below.
54
- Changed the documentation patch accordingly.
55
- For all others please see individual patch changelog history.
56
57
- v5 -> v6:
58
59
- Removed ACPI CPU/memory hotplug patches. (Intel internal discussion)
60
- Removed patch to disable driver-managed memory hotplug (Intel
61
internal discussion).
62
- Added one patch to introduce enum type for TDX supported page size
63
level to replace the hard-coded values in TDX guest code (Dave).
64
- Added one patch to make TDX depends on X2APIC being enabled (Dave).
65
- Added one patch to build all boot-time present memory regions as TDX
66
memory during kernel boot.
67
- Added Reviewed-by from others to some patches.
68
- For all others please see individual patch changelog history.
69
70
- v4 -> v5:
71
72
This is essentially a resent of v4. Sorry I forgot to consult
73
get_maintainer.pl when sending out v4, so I forgot to add linux-acpi
74
and linux-mm mailing list and the relevant people for 4 new patches.
75
76
There are also very minor code and commit message update from v4:
77
78
- Rebased to latest tip/x86/tdx.
79
- Fixed a checkpatch issue that I missed in v4.
80
- Removed an obsoleted comment that I missed in patch 6.
81
- Very minor update to the commit message of patch 12.
82
83
For other changes to individual patches since v3, please refer to the
84
changelog histroy of individual patches (I just used v3 -> v5 since
85
there's basically no code change to v4).
86
87
- v3 -> v4 (addressed Dave's comments, and other comments from others):
88
89
- Simplified SEAMRR and TDX keyID detection.
90
- Added patches to handle ACPI CPU hotplug.
91
- Added patches to handle ACPI memory hotplug and driver managed memory
92
hotplug.
93
- Removed tdx_detect() but only use single tdx_init().
94
- Removed detecting TDX module via P-SEAMLDR.
95
- Changed from using e820 to using memblock to convert system RAM to TDX
96
memory.
97
- Excluded legacy PMEM from TDX memory.
98
- Removed the boot-time command line to disable TDX patch.
99
- Addressed comments for other individual patches (please see individual
100
patches).
101
- Improved the documentation patch based on the new implementation.
102
103
- V2 -> v3:
104
105
- Addressed comments from Isaku.
106
- Fixed memory leak and unnecessary function argument in the patch to
107
configure the key for the global keyid (patch 17).
108
- Enhanced a little bit to the patch to get TDX module and CMR
109
information (patch 09).
110
- Fixed an unintended change in the patch to allocate PAMT (patch 13).
111
- Addressed comments from Kevin:
112
- Slightly improvement on commit message to patch 03.
113
- Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
114
seamrr_enabled() (patch 04).
115
- Changed documentation patch to add TDX host kernel support materials
116
to Documentation/x86/tdx.rst together with TDX guest staff, instead
117
of a standalone file (patch 21)
118
- Very minor improvement in commit messages.
119
120
- RFC (v1) -> v2:
121
- Rebased to Kirill's latest TDX guest code.
122
- Fixed two issues that are related to finding all RAM memory regions
123
based on e820.
124
- Minor improvement on comments and commit messages.
125
126
v6:
127
https://lore.kernel.org/linux-mm/cover.1666824663.git.kai.huang@intel.com/T/
128
129
v5:
130
https://lore.kernel.org/lkml/cover.1655894131.git.kai.huang@intel.com/T/
131
132
v3:
133
https://lore.kernel.org/lkml/68484e168226037c3a25b6fb983b052b26ab3ec1.camel@intel.com/T/
134
135
V2:
136
https://lore.kernel.org/lkml/cover.1647167475.git.kai.huang@intel.com/T/
137
138
RFC (v1):
139
https://lore.kernel.org/all/e0ff030a49b252d91c789a89c303bb4206f85e3d.1646007267.git.kai.huang@intel.com/T/
140
141
== Background ==
142
143
TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
144
and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
145
A CPU-attested software module called 'the TDX module' runs in the new
146
isolated region as a trusted hypervisor to create/run protected VMs.
147
148
TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
149
provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs
150
as TDX private KeyIDs, which are only accessible within the SEAM mode.
151
152
TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
153
secure processor to provide crypto-protection. The firmware runs on the
154
secure processor acts a similar role as the TDX module.
155
156
The host kernel communicates with SEAM software via a new SEAMCALL
157
instruction. This is conceptually similar to a guest->host hypercall,
158
except it is made from the host to SEAM software instead.
159
160
Before being able to manage TD guests, the TDX module must be loaded
161
and properly initialized. This series assumes the TDX module is loaded
162
by BIOS before the kernel boots.
163
164
How to initialize the TDX module is described at TDX module 1.0
165
specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
166
Initialization and Shutdown".
167
168
== Design Considerations ==
169
170
1. Initialize the TDX module at runtime
171
172
There are basically two ways the TDX module could be initialized: either
173
in early boot, or at runtime before the first TDX guest is run. This
174
series implements the runtime initialization.
175
176
This series adds a function tdx_enable() to allow the caller to initialize
177
TDX at runtime:
178
179
if (tdx_enable())
180
goto no_tdx;
181
    // TDX is ready to create TD guests.
182
183
This approach has below pros:
184
185
1) Initializing the TDX module requires to reserve ~1/256th system RAM as
186
metadata. Enabling TDX on demand allows only to consume this memory when
187
TDX is truly needed (i.e. when KVM wants to create TD guests).
188
189
2) SEAMCALL requires CPU being already in VMX operation (VMXON has been
190
done). So far, KVM is the only user of TDX, and it already handles VMXON.
191
Letting KVM to initialize TDX avoids handling VMXON in the core kernel.
192
193
3) It is more flexible to support "TDX module runtime update" (not in
194
this series). After updating to the new module at runtime, kernel needs
195
to go through the initialization process again.
196
197
2. CPU hotplug
198
199
TDX doesn't support physical (ACPI) CPU hotplug. A non-buggy BIOS should
200
never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
201
event to the kernel. This series doesn't handle physical (ACPI) CPU
202
hotplug at all but depends on the BIOS to behave correctly.
203
204
Note TDX works with CPU logical online/offline, thus this series still
205
allows to do logical CPU online/offline.
206
207
3. Kernel policy on TDX memory
208
209
The TDX module reports a list of "Convertible Memory Region" (CMR) to
210
indicate which memory regions are TDX-capable. The TDX architecture
211
allows the VMM to designate specific convertible memory regions as usable
212
for TDX private memory.
213
214
The initial support of TDX guests will only allocate TDX private memory
215
from the global page allocator. This series chooses to designate _all_
216
system RAM in the core-mm at the time of initializing TDX module as TDX
217
memory to guarantee all pages in the page allocator are TDX pages.
218
219
4. Memory Hotplug
220
221
After the kernel passes all "TDX-usable" memory regions to the TDX
222
module, the set of "TDX-usable" memory regions are fixed during module's
223
runtime. No more "TDX-usable" memory can be added to the TDX module
224
after that.
225
226
To achieve above "to guarantee all pages in the page allocator are TDX
227
pages", this series simply choose to reject any non-TDX-usable memory in
228
memory hotplug.
229
230
This _will_ be enhanced in the future after first submission. The
231
direction we are heading is to allow adding/onlining non-TDX memory to
232
separate NUMA nodes so that both "TDX-capable" nodes and "TDX-capable"
233
nodes can co-exist. The TDX flag can be exposed to userspace via /sysfs
234
so userspace can bind TDX guests to "TDX-capable" nodes via NUMA ABIs.
235
236
Note TDX assumes convertible memory is always physically present during
237
machine's runtime. A non-buggy BIOS should never support hot-removal of
238
any convertible memory. This implementation doesn't handle ACPI memory
239
removal but depends on the BIOS to behave correctly.
240
241
5. Kexec()
242
243
There are two problems in terms of using kexec() to boot to a new kernel
244
when the old kernel has enabled TDX: 1) Part of the memory pages are
245
still TDX private pages (i.e. metadata used by the TDX module, and any
246
TDX guest memory if kexec() happens when there's any TDX guest alive).
247
2) There might be dirty cachelines associated with TDX private pages.
248
249
Just like SME, TDX hosts require special cache flushing before kexec().
250
Similar to SME handling, the kernel uses wbinvd() to flush cache in
251
stop_this_cpu() when TDX is enabled.
252
253
This series doesn't convert all TDX private pages back to normal due to
254
below considerations:
255
256
1) The kernel doesn't have existing infrastructure to track which pages
257
are TDX private pages.
258
2) The number of TDX private pages can be large, and converting all of
259
them (cache flush + using MOVDIR64B to clear the page) in kexec() can
260
be time consuming.
261
3) The new kernel will almost only use KeyID 0 to access memory. KeyID
262
0 doesn't support integrity-check, so it's OK.
263
4) The kernel doesn't (and may never) support MKTME. If any 3rd party
264
kernel ever supports MKTME, it should do MOVDIR64B to clear the page
265
with the new MKTME KeyID (just like TDX does) before using it.
266
267
Also, if the old kernel ever enables TDX, the new kernel cannot use TDX
268
again. When the new kernel goes through the TDX module initialization
269
process it will fail immediately at the first step.
270
271
Ideally, it's better to shutdown the TDX module in kexec(), but there's
272
no guarantee that CPUs are in VMX operation in kexec() so just leave the
273
module open.
274
275
== Reference ==
276
277
[1]: TDX specs
278
https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
279
280
[2]: KVM TDX basic feature support
281
https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/
282
283
[3]: KVM: mm: fd-based approach for supporting KVM
284
https://lore.kernel.org/lkml/20220915142913.2213336-1-chao.p.peng@linux.intel.com/T/
285
286
[4]: per-node memory encryption flag
287
https://lore.kernel.org/linux-mm/20221007155323.ue4cdthkilfy4lbd@box.shutemov.name/t/
288
6
289
7
290
Kai Huang (20):
8
This version mainly adds a new patch to handle TDX vs S3/hibernation
9
interaction. In short, TDX cannot survive when platform goes to S3 and
10
deeper states. TDX gets completely reset upon this, and both TDX guests
11
and TDX module are destroyed. Please refer to the new patch (21).
12
13
Other changes from v13 -> v14:
14
- Addressed comments received in v13 (Rick/Nikolay/Dave).
15
- SEAMCALL patches, skeleton patch, kexec patch
16
- Some minor updates based on internal discussion.
17
- Added received Reviewed-by tags (thanks!).
18
- Updated the documentation patch to reflect new changes.
19
20
Please see each individual patch for specific change history.
21
22
Hi Dave,
23
24
In this version all patches (except the documentation one) now have at
25
least Kirill's Reviewed-by tag. Could you help to take a look?
26
27
And again, thanks everyone for reviewing and helping on this series.
28
29
[1]: v13 https://lore.kernel.org/lkml/cover.1692962263.git.kai.huang@intel.com/T/
30
31
32
Kai Huang (23):
33
x86/virt/tdx: Detect TDX during kernel boot
291
x86/tdx: Define TDX supported page sizes as macros
34
x86/tdx: Define TDX supported page sizes as macros
292
x86/virt/tdx: Detect TDX during kernel boot
35
x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
293
x86/virt/tdx: Disable TDX if X2APIC is not enabled
36
x86/cpu: Detect TDX partial write machine check erratum
294
x86/virt/tdx: Add skeleton to initialize TDX on demand
37
x86/virt/tdx: Handle SEAMCALL no entropy error in common code
295
x86/virt/tdx: Implement functions to make SEAMCALL
38
x86/virt/tdx: Add SEAMCALL error printing for module initialization
296
x86/virt/tdx: Shut down TDX module in case of error
39
x86/virt/tdx: Add skeleton to enable TDX on demand
297
x86/virt/tdx: Do TDX module global initialization
298
x86/virt/tdx: Do logical-cpu scope TDX module initialization
299
x86/virt/tdx: Get information about TDX module and TDX-capable memory
40
x86/virt/tdx: Get information about TDX module and TDX-capable memory
300
x86/virt/tdx: Use all system memory when initializing TDX module as
41
x86/virt/tdx: Use all system memory when initializing TDX module as
301
TDX memory
42
TDX memory
302
x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
43
x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
303
memory regions
44
memory regions
304
x86/virt/tdx: Create TDMRs to cover all TDX memory regions
45
x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
305
x86/virt/tdx: Allocate and set up PAMTs for TDMRs
46
x86/virt/tdx: Allocate and set up PAMTs for TDMRs
306
x86/virt/tdx: Set up reserved areas for all TDMRs
47
x86/virt/tdx: Designate reserved areas for all TDMRs
307
x86/virt/tdx: Reserve TDX module global KeyID
48
x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
308
x86/virt/tdx: Configure TDX module with TDMRs and global KeyID
309
x86/virt/tdx: Configure global KeyID on all packages
49
x86/virt/tdx: Configure global KeyID on all packages
310
x86/virt/tdx: Initialize all TDMRs
50
x86/virt/tdx: Initialize all TDMRs
311
x86/virt/tdx: Flush cache in kexec() when TDX is enabled
51
x86/kexec: Flush cache of TDX private memory
52
x86/virt/tdx: Keep TDMRs when module initialization is successful
53
x86/virt/tdx: Improve readability of module initialization error
54
handling
55
x86/kexec(): Reset TDX private memory on platforms with TDX erratum
56
x86/virt/tdx: Handle TDX interaction with ACPI S3 and deeper states
57
x86/mce: Improve error log of kernel space TDX #MC due to erratum
312
Documentation/x86: Add documentation for TDX host support
58
Documentation/x86: Add documentation for TDX host support
313
59
314
Documentation/x86/tdx.rst | 181 +++-
60
Documentation/arch/x86/tdx.rst | 217 +++-
315
arch/x86/Kconfig | 15 +
61
arch/x86/Kconfig | 3 +
316
arch/x86/Makefile | 2 +
62
arch/x86/coco/tdx/tdx-shared.c | 6 +-
317
arch/x86/coco/tdx/tdx.c | 6 +-
63
arch/x86/include/asm/cpufeatures.h | 1 +
318
arch/x86/include/asm/tdx.h | 30 +
64
arch/x86/include/asm/msr-index.h | 3 +
319
arch/x86/kernel/process.c | 8 +-
65
arch/x86/include/asm/shared/tdx.h | 6 +
320
arch/x86/mm/init_64.c | 10 +
66
arch/x86/include/asm/tdx.h | 39 +
321
arch/x86/virt/Makefile | 2 +
67
arch/x86/kernel/cpu/intel.c | 17 +
322
arch/x86/virt/vmx/Makefile | 2 +
68
arch/x86/kernel/cpu/mce/core.c | 33 +
323
arch/x86/virt/vmx/tdx/Makefile | 2 +
69
arch/x86/kernel/machine_kexec_64.c | 16 +
324
arch/x86/virt/vmx/tdx/seamcall.S | 52 ++
70
arch/x86/kernel/process.c | 8 +-
325
arch/x86/virt/vmx/tdx/tdx.c | 1422 ++++++++++++++++++++++++++++++
71
arch/x86/kernel/reboot.c | 15 +
326
arch/x86/virt/vmx/tdx/tdx.h | 118 +++
72
arch/x86/kernel/setup.c | 2 +
327
arch/x86/virt/vmx/tdx/tdxcall.S | 19 +-
73
arch/x86/virt/vmx/tdx/Makefile | 2 +-
328
14 files changed, 1852 insertions(+), 17 deletions(-)
74
arch/x86/virt/vmx/tdx/tdx.c | 1587 ++++++++++++++++++++++++++++
329
create mode 100644 arch/x86/virt/Makefile
75
arch/x86/virt/vmx/tdx/tdx.h | 145 +++
330
create mode 100644 arch/x86/virt/vmx/Makefile
76
16 files changed, 2084 insertions(+), 16 deletions(-)
331
create mode 100644 arch/x86/virt/vmx/tdx/Makefile
332
create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
333
create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
77
create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
334
create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
78
create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
335
79
336
80
337
base-commit: 00e07cfbdf0b232f7553f0175f8f4e8d792f7e90
81
base-commit: 9ee4318c157b9802589b746cc340bae3142d984c
338
--
82
--
339
2.38.1
83
2.41.0
diff view generated by jsdifflib
...
...
9
space from the MKTME architecture for crypto-protection to VMs. The
9
space from the MKTME architecture for crypto-protection to VMs. The
10
BIOS is responsible for partitioning the "KeyID" space between legacy
10
BIOS is responsible for partitioning the "KeyID" space between legacy
11
MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private
11
MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private
12
KeyIDs' or 'TDX KeyIDs' for short.
12
KeyIDs' or 'TDX KeyIDs' for short.
13
13
14
TDX doesn't trust the BIOS. During machine boot, TDX verifies the TDX
14
During machine boot, TDX microcode verifies that the BIOS programmed TDX
15
private KeyIDs are consistently and correctly programmed by the BIOS
15
private KeyIDs consistently and correctly programmed across all CPU
16
across all CPU packages before it enables TDX on any CPU core. A valid
16
packages. The MSRs are locked in this state after verification. This
17
TDX private KeyID range on BSP indicates TDX has been enabled by the
17
is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration:
18
BIOS, otherwise the BIOS is buggy.
18
it indicates not just that the hardware supports TDX, but that all the
19
boot-time security checks passed.
19
20
20
The TDX module is expected to be loaded by the BIOS when it enables TDX,
21
The TDX module is expected to be loaded by the BIOS when it enables TDX,
21
but the kernel needs to properly initialize it before it can be used to
22
but the kernel needs to properly initialize it before it can be used to
22
create and run any TDX guests. The TDX module will be initialized at
23
create and run any TDX guests. The TDX module will be initialized by
23
runtime by the user (i.e. KVM) on demand.
24
the KVM subsystem when KVM wants to use TDX.
24
25
25
Add a new early_initcall(tdx_init) to do TDX early boot initialization.
26
Add a new early_initcall(tdx_init) to detect the TDX by detecting TDX
26
Only detect TDX private KeyIDs for now. Some other early checks will
27
private KeyIDs. Also add a function to report whether TDX is enabled by
27
follow up. Also add a new function to report whether TDX has been
28
the BIOS. Similar to AMD SME, kexec() will use it to determine whether
28
enabled by BIOS (TDX private KeyID range is valid). Kexec() will also
29
cache flush is needed.
29
need it to determine whether need to flush dirty cachelines that are
30
associated with any TDX private KeyIDs before booting to the new kernel.
31
30
32
To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
31
The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
33
TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST
32
to protect its metadata. Each TDX guest also needs a TDX KeyID for its
34
to opt-in TDX host kernel support (to distinguish with TDX guest kernel
33
own protection. Just use the first TDX KeyID as the global KeyID and
35
support). So far only KVM is the only user of TDX. Make the new config
34
leave the rest for TDX guests. If no TDX KeyID is left for TDX guests,
36
option depend on KVM_INTEL.
35
disable TDX as initializing the TDX module alone is useless.
37
36
37
Signed-off-by: Kai Huang <kai.huang@intel.com>
38
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
38
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
39
Signed-off-by: Kai Huang <kai.huang@intel.com>
39
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
40
Reviewed-by: David Hildenbrand <david@redhat.com>
41
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
40
---
42
---
41
43
42
v6 -> v7:
44
v13 -> v14:
43
- No change.
45
- "tdx:" -> "virt/tdx:" (internal)
46
- Add Dave's tag
47
48
---
49
arch/x86/include/asm/msr-index.h | 3 ++
50
arch/x86/include/asm/tdx.h | 4 ++
51
arch/x86/virt/vmx/tdx/Makefile | 2 +-
52
arch/x86/virt/vmx/tdx/tdx.c | 90 ++++++++++++++++++++++++++++++++
53
4 files changed, 98 insertions(+), 1 deletion(-)
54
create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
44
55
45
v5 -> v6:
56
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
46
- Removed SEAMRR detection to make code simpler.
47
- Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill).
48
- Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill).
49
50
51
---
52
arch/x86/Kconfig | 12 +++++
53
arch/x86/Makefile | 2 +
54
arch/x86/include/asm/tdx.h | 7 +++
55
arch/x86/virt/Makefile | 2 +
56
arch/x86/virt/vmx/Makefile | 2 +
57
arch/x86/virt/vmx/tdx/Makefile | 2 +
58
arch/x86/virt/vmx/tdx/tdx.c | 95 ++++++++++++++++++++++++++++++++++
59
arch/x86/virt/vmx/tdx/tdx.h | 15 ++++++
60
8 files changed, 137 insertions(+)
61
create mode 100644 arch/x86/virt/Makefile
62
create mode 100644 arch/x86/virt/vmx/Makefile
63
create mode 100644 arch/x86/virt/vmx/tdx/Makefile
64
create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
65
create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
66
67
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
68
index XXXXXXX..XXXXXXX 100644
57
index XXXXXXX..XXXXXXX 100644
69
--- a/arch/x86/Kconfig
58
--- a/arch/x86/include/asm/msr-index.h
70
+++ b/arch/x86/Kconfig
59
+++ b/arch/x86/include/asm/msr-index.h
71
@@ -XXX,XX +XXX,XX @@ config X86_SGX
60
@@ -XXX,XX +XXX,XX @@
72
61
#define MSR_RELOAD_PMC0            0x000014c1
73
     If unsure, say N.
62
#define MSR_RELOAD_FIXED_CTR0        0x00001309
74
63
75
+config INTEL_TDX_HOST
64
+/* KeyID partitioning between MKTME and TDX */
76
+    bool "Intel Trust Domain Extensions (TDX) host support"
65
+#define MSR_IA32_MKTME_KEYID_PARTITIONING    0x00000087
77
+    depends on CPU_SUP_INTEL
78
+    depends on X86_64
79
+    depends on KVM_INTEL
80
+    help
81
+     Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
82
+     host and certain physical attacks. This option enables necessary TDX
83
+     support in host kernel to run protected VMs.
84
+
66
+
85
+     If unsure, say N.
67
/*
86
+
68
* AMD64 MSRs. Not complete. See the architecture manual for a more
87
config EFI
69
* complete list.
88
    bool "EFI runtime service support"
89
    depends on ACPI
90
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
91
index XXXXXXX..XXXXXXX 100644
92
--- a/arch/x86/Makefile
93
+++ b/arch/x86/Makefile
94
@@ -XXX,XX +XXX,XX @@ archheaders:
95
96
libs-y += arch/x86/lib/
97
98
+core-y += arch/x86/virt/
99
+
100
# drivers-y are linked after core-y
101
drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
102
drivers-$(CONFIG_PCI) += arch/x86/pci/
103
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
70
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
104
index XXXXXXX..XXXXXXX 100644
71
index XXXXXXX..XXXXXXX 100644
105
--- a/arch/x86/include/asm/tdx.h
72
--- a/arch/x86/include/asm/tdx.h
106
+++ b/arch/x86/include/asm/tdx.h
73
+++ b/arch/x86/include/asm/tdx.h
107
@@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
74
@@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
108
    return -ENODEV;
75
u64 __seamcall(u64 fn, struct tdx_module_args *args);
109
}
76
u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
110
#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
77
u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
111
+
78
+
112
+#ifdef CONFIG_INTEL_TDX_HOST
113
+bool platform_tdx_enabled(void);
79
+bool platform_tdx_enabled(void);
114
+#else    /* !CONFIG_INTEL_TDX_HOST */
80
+#else
115
+static inline bool platform_tdx_enabled(void) { return false; }
81
+static inline bool platform_tdx_enabled(void) { return false; }
116
+#endif    /* CONFIG_INTEL_TDX_HOST */
82
#endif    /* CONFIG_INTEL_TDX_HOST */
117
+
83
118
#endif /* !__ASSEMBLY__ */
84
#endif /* !__ASSEMBLY__ */
119
#endif /* _ASM_X86_TDX_H */
120
diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
121
new file mode 100644
122
index XXXXXXX..XXXXXXX
123
--- /dev/null
124
+++ b/arch/x86/virt/Makefile
125
@@ -XXX,XX +XXX,XX @@
126
+# SPDX-License-Identifier: GPL-2.0-only
127
+obj-y    += vmx/
128
diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
129
new file mode 100644
130
index XXXXXXX..XXXXXXX
131
--- /dev/null
132
+++ b/arch/x86/virt/vmx/Makefile
133
@@ -XXX,XX +XXX,XX @@
134
+# SPDX-License-Identifier: GPL-2.0-only
135
+obj-$(CONFIG_INTEL_TDX_HOST)    += tdx/
136
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
85
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
137
new file mode 100644
86
index XXXXXXX..XXXXXXX 100644
138
index XXXXXXX..XXXXXXX
87
--- a/arch/x86/virt/vmx/tdx/Makefile
139
--- /dev/null
140
+++ b/arch/x86/virt/vmx/tdx/Makefile
88
+++ b/arch/x86/virt/vmx/tdx/Makefile
141
@@ -XXX,XX +XXX,XX @@
89
@@ -XXX,XX +XXX,XX @@
142
+# SPDX-License-Identifier: GPL-2.0-only
90
# SPDX-License-Identifier: GPL-2.0-only
143
+obj-y += tdx.o
91
-obj-y += seamcall.o
92
+obj-y += seamcall.o tdx.o
144
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
93
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
145
new file mode 100644
94
new file mode 100644
146
index XXXXXXX..XXXXXXX
95
index XXXXXXX..XXXXXXX
147
--- /dev/null
96
--- /dev/null
148
+++ b/arch/x86/virt/vmx/tdx/tdx.c
97
+++ b/arch/x86/virt/vmx/tdx/tdx.c
149
@@ -XXX,XX +XXX,XX @@
98
@@ -XXX,XX +XXX,XX @@
150
+// SPDX-License-Identifier: GPL-2.0
99
+// SPDX-License-Identifier: GPL-2.0
151
+/*
100
+/*
152
+ * Copyright(c) 2022 Intel Corporation.
101
+ * Copyright(c) 2023 Intel Corporation.
153
+ *
102
+ *
154
+ * Intel Trusted Domain Extensions (TDX) support
103
+ * Intel Trusted Domain Extensions (TDX) support
155
+ */
104
+ */
156
+
105
+
157
+#define pr_fmt(fmt)    "tdx: " fmt
106
+#define pr_fmt(fmt)    "virt/tdx: " fmt
158
+
107
+
159
+#include <linux/types.h>
108
+#include <linux/types.h>
109
+#include <linux/cache.h>
160
+#include <linux/init.h>
110
+#include <linux/init.h>
111
+#include <linux/errno.h>
161
+#include <linux/printk.h>
112
+#include <linux/printk.h>
162
+#include <asm/msr-index.h>
113
+#include <asm/msr-index.h>
163
+#include <asm/msr.h>
114
+#include <asm/msr.h>
164
+#include <asm/tdx.h>
115
+#include <asm/tdx.h>
165
+#include "tdx.h"
166
+
116
+
167
+static u32 tdx_keyid_start __ro_after_init;
117
+static u32 tdx_global_keyid __ro_after_init;
168
+static u32 tdx_keyid_num __ro_after_init;
118
+static u32 tdx_guest_keyid_start __ro_after_init;
119
+static u32 tdx_nr_guest_keyids __ro_after_init;
169
+
120
+
170
+/*
121
+static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
171
+ * Detect TDX private KeyIDs to see whether TDX has been enabled by the
122
+                     u32 *nr_tdx_keyids)
172
+ * BIOS. Both initializing the TDX module and running TDX guest require
173
+ * TDX private KeyID.
174
+ *
175
+ * TDX doesn't trust BIOS. TDX verifies all configurations from BIOS
176
+ * are correct before enabling TDX on any core. TDX requires the BIOS
177
+ * to correctly and consistently program TDX private KeyIDs on all CPU
178
+ * packages. Unless there is a BIOS bug, detecting a valid TDX private
179
+ * KeyID range on BSP indicates TDX has been enabled by the BIOS. If
180
+ * there's such BIOS bug, it will be caught later when initializing the
181
+ * TDX module.
182
+ */
183
+static int __init detect_tdx(void)
184
+{
123
+{
124
+    u32 _nr_mktme_keyids, _tdx_keyid_start, _nr_tdx_keyids;
185
+    int ret;
125
+    int ret;
186
+
126
+
187
+    /*
127
+    /*
188
+     * IA32_MKTME_KEYID_PARTIONING:
128
+     * IA32_MKTME_KEYID_PARTIONING:
189
+     * Bit [31:0]:    Number of MKTME KeyIDs.
129
+     * Bit [31:0]:    Number of MKTME KeyIDs.
190
+     * Bit [63:32]:    Number of TDX private KeyIDs.
130
+     * Bit [63:32]:    Number of TDX private KeyIDs.
191
+     */
131
+     */
192
+    ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &tdx_keyid_start,
132
+    ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &_nr_mktme_keyids,
193
+            &tdx_keyid_num);
133
+            &_nr_tdx_keyids);
194
+    if (ret)
134
+    if (ret)
195
+        return -ENODEV;
135
+        return -ENODEV;
196
+
136
+
197
+    if (!tdx_keyid_num)
137
+    if (!_nr_tdx_keyids)
198
+        return -ENODEV;
138
+        return -ENODEV;
199
+
139
+
200
+    /*
140
+    /* TDX KeyIDs start after the last MKTME KeyID. */
201
+     * KeyID 0 is for TME. MKTME KeyIDs start from 1. TDX private
141
+    _tdx_keyid_start = _nr_mktme_keyids + 1;
202
+     * KeyIDs start after the last MKTME KeyID.
203
+     */
204
+    tdx_keyid_start++;
205
+
142
+
206
+    pr_info("TDX enabled by BIOS. TDX private KeyID range: [%u, %u)\n",
143
+    *tdx_keyid_start = _tdx_keyid_start;
207
+            tdx_keyid_start, tdx_keyid_start + tdx_keyid_num);
144
+    *nr_tdx_keyids = _nr_tdx_keyids;
208
+
145
+
209
+    return 0;
146
+    return 0;
210
+}
147
+}
211
+
148
+
212
+static void __init clear_tdx(void)
213
+{
214
+    tdx_keyid_start = tdx_keyid_num = 0;
215
+}
216
+
217
+static int __init tdx_init(void)
149
+static int __init tdx_init(void)
218
+{
150
+{
219
+    if (detect_tdx())
151
+    u32 tdx_keyid_start, nr_tdx_keyids;
220
+        return -ENODEV;
152
+    int err;
153
+
154
+    err = record_keyid_partitioning(&tdx_keyid_start, &nr_tdx_keyids);
155
+    if (err)
156
+        return err;
157
+
158
+    pr_info("BIOS enabled: private KeyID range [%u, %u)\n",
159
+            tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids);
221
+
160
+
222
+    /*
161
+    /*
223
+     * Initializing the TDX module requires one TDX private KeyID.
162
+     * The TDX module itself requires one 'global KeyID' to protect
224
+     * If there's only one TDX KeyID then after module initialization
163
+     * its metadata. If there's only one TDX KeyID, there won't be
225
+     * KVM won't be able to run any TDX guest, which makes the whole
164
+     * any left for TDX guests thus there's no point to enable TDX
226
+     * thing worthless. Just disable TDX in this case.
165
+     * at all.
227
+     */
166
+     */
228
+    if (tdx_keyid_num < 2) {
167
+    if (nr_tdx_keyids < 2) {
229
+        pr_info("Disable TDX as there's only one TDX private KeyID available.\n");
168
+        pr_err("initialization failed: too few private KeyIDs available.\n");
230
+        goto no_tdx;
169
+        return -ENODEV;
231
+    }
170
+    }
232
+
171
+
172
+    /*
173
+     * Just use the first TDX KeyID as the 'global KeyID' and
174
+     * leave the rest for TDX guests.
175
+     */
176
+    tdx_global_keyid = tdx_keyid_start;
177
+    tdx_guest_keyid_start = tdx_keyid_start + 1;
178
+    tdx_nr_guest_keyids = nr_tdx_keyids - 1;
179
+
233
+    return 0;
180
+    return 0;
234
+no_tdx:
235
+    clear_tdx();
236
+    return -ENODEV;
237
+}
181
+}
238
+early_initcall(tdx_init);
182
+early_initcall(tdx_init);
239
+
183
+
240
+/* Return whether the BIOS has enabled TDX */
184
+/* Return whether the BIOS has enabled TDX */
241
+bool platform_tdx_enabled(void)
185
+bool platform_tdx_enabled(void)
242
+{
186
+{
243
+    return !!tdx_keyid_num;
187
+    return !!tdx_global_keyid;
244
+}
188
+}
245
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
246
new file mode 100644
247
index XXXXXXX..XXXXXXX
248
--- /dev/null
249
+++ b/arch/x86/virt/vmx/tdx/tdx.h
250
@@ -XXX,XX +XXX,XX @@
251
+/* SPDX-License-Identifier: GPL-2.0 */
252
+#ifndef _X86_VIRT_TDX_H
253
+#define _X86_VIRT_TDX_H
254
+
255
+/*
256
+ * This file contains both macros and data structures defined by the TDX
257
+ * architecture and Linux defined software data structures and functions.
258
+ * The two should not be mixed together for better readability. The
259
+ * architectural definitions come first.
260
+ */
261
+
262
+/* MSR to report KeyID partitioning between MKTME and TDX */
263
+#define MSR_IA32_MKTME_KEYID_PARTITIONING    0x00000087
264
+
265
+#endif
266
--
189
--
267
2.38.1
190
2.41.0
diff view generated by jsdifflib
...
...
4
page. However currently try_accept_one() uses hard-coded magic values.
4
page. However currently try_accept_one() uses hard-coded magic values.
5
5
6
Define TDX supported page sizes as macros and get rid of the hard-coded
6
Define TDX supported page sizes as macros and get rid of the hard-coded
7
values in try_accept_one(). TDX host support will need to use them too.
7
values in try_accept_one(). TDX host support will need to use them too.
8
8
9
Signed-off-by: Kai Huang <kai.huang@intel.com>
9
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
10
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
10
Signed-off-by: Kai Huang <kai.huang@intel.com>
11
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
12
Reviewed-by: David Hildenbrand <david@redhat.com>
11
---
13
---
14
arch/x86/coco/tdx/tdx-shared.c | 6 +++---
15
arch/x86/include/asm/shared/tdx.h | 5 +++++
16
2 files changed, 8 insertions(+), 3 deletions(-)
12
17
13
v6 -> v7:
18
diff --git a/arch/x86/coco/tdx/tdx-shared.c b/arch/x86/coco/tdx/tdx-shared.c
14
15
- Removed the helper to convert kernel page level to TDX page level.
16
- Changed to use macro to define TDX supported page sizes.
17
18
---
19
arch/x86/coco/tdx/tdx.c | 6 +++---
20
arch/x86/include/asm/tdx.h | 9 +++++++++
21
2 files changed, 12 insertions(+), 3 deletions(-)
22
23
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
24
index XXXXXXX..XXXXXXX 100644
19
index XXXXXXX..XXXXXXX 100644
25
--- a/arch/x86/coco/tdx/tdx.c
20
--- a/arch/x86/coco/tdx/tdx-shared.c
26
+++ b/arch/x86/coco/tdx/tdx.c
21
+++ b/arch/x86/coco/tdx/tdx-shared.c
27
@@ -XXX,XX +XXX,XX @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
22
@@ -XXX,XX +XXX,XX @@ static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
28
     */
23
     */
29
    switch (pg_level) {
24
    switch (pg_level) {
30
    case PG_LEVEL_4K:
25
    case PG_LEVEL_4K:
31
-        page_size = 0;
26
-        page_size = 0;
32
+        page_size = TDX_PS_4K;
27
+        page_size = TDX_PS_4K;
...
...
38
    case PG_LEVEL_1G:
33
    case PG_LEVEL_1G:
39
-        page_size = 2;
34
-        page_size = 2;
40
+        page_size = TDX_PS_1G;
35
+        page_size = TDX_PS_1G;
41
        break;
36
        break;
42
    default:
37
    default:
43
        return false;
38
        return 0;
44
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
39
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
45
index XXXXXXX..XXXXXXX 100644
40
index XXXXXXX..XXXXXXX 100644
46
--- a/arch/x86/include/asm/tdx.h
41
--- a/arch/x86/include/asm/shared/tdx.h
47
+++ b/arch/x86/include/asm/tdx.h
42
+++ b/arch/x86/include/asm/shared/tdx.h
48
@@ -XXX,XX +XXX,XX @@
43
@@ -XXX,XX +XXX,XX @@
49
44
    (TDX_RDX | TDX_RBX | TDX_RSI | TDX_RDI | TDX_R8 | TDX_R9 | \
50
#ifndef __ASSEMBLY__
45
     TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14 | TDX_R15)
51
46
52
+/*
47
+/* TDX supported page sizes from the TDX module ABI. */
53
+ * TDX supported page sizes (4K/2M/1G).
54
+ *
55
+ * Those values are part of the TDX module ABI. Do not change them.
56
+ */
57
+#define TDX_PS_4K    0
48
+#define TDX_PS_4K    0
58
+#define TDX_PS_2M    1
49
+#define TDX_PS_2M    1
59
+#define TDX_PS_1G    2
50
+#define TDX_PS_1G    2
60
+
51
+
61
/*
52
#ifndef __ASSEMBLY__
62
* Used to gather the output registers values of the TDCALL and SEAMCALL
53
63
* instructions when requesting services from the TDX module.
54
#include <linux/compiler_attributes.h>
64
--
55
--
65
2.38.1
56
2.41.0
diff view generated by jsdifflib
1
The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
1
TDX capable platforms are locked to X2APIC mode and cannot fall back to
2
[1]. This bug allows an attacker to use the APIC MMIO interface to
2
the legacy xAPIC mode when TDX is enabled by the BIOS. TDX host support
3
extract data from the SGX enclave.
3
requires x2APIC. Make INTEL_TDX_HOST depend on X86_X2APIC.
4
4
5
TDX is not immune from this either. Early check X2APIC and disable TDX
6
if X2APIC is not enabled, and make INTEL_TDX_HOST depend on X86_X2APIC.
7
8
[1]: https://aepicleak.com/aepicleak.pdf
9
10
Link: https://lore.kernel.org/lkml/d6ffb489-7024-ff74-bd2f-d1e06573bb82@intel.com/
11
Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/
5
Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/
12
Signed-off-by: Kai Huang <kai.huang@intel.com>
6
Signed-off-by: Kai Huang <kai.huang@intel.com>
7
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
8
Reviewed-by: David Hildenbrand <david@redhat.com>
9
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
13
---
10
---
14
11
arch/x86/Kconfig | 1 +
15
v6 -> v7:
12
1 file changed, 1 insertion(+)
16
- Changed to use "Link" for the two lore links to get rid of checkpatch
17
warning.
18
19
---
20
arch/x86/Kconfig | 1 +
21
arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++++++
22
2 files changed, 12 insertions(+)
23
13
24
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
14
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
25
index XXXXXXX..XXXXXXX 100644
15
index XXXXXXX..XXXXXXX 100644
26
--- a/arch/x86/Kconfig
16
--- a/arch/x86/Kconfig
27
+++ b/arch/x86/Kconfig
17
+++ b/arch/x86/Kconfig
...
...
31
    depends on KVM_INTEL
21
    depends on KVM_INTEL
32
+    depends on X86_X2APIC
22
+    depends on X86_X2APIC
33
    help
23
    help
34
     Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
24
     Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
35
     host and certain physical attacks. This option enables necessary TDX
25
     host and certain physical attacks. This option enables necessary TDX
36
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
37
index XXXXXXX..XXXXXXX 100644
38
--- a/arch/x86/virt/vmx/tdx/tdx.c
39
+++ b/arch/x86/virt/vmx/tdx/tdx.c
40
@@ -XXX,XX +XXX,XX @@
41
#include <linux/printk.h>
42
#include <asm/msr-index.h>
43
#include <asm/msr.h>
44
+#include <asm/apic.h>
45
#include <asm/tdx.h>
46
#include "tdx.h"
47
48
@@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void)
49
        goto no_tdx;
50
    }
51
52
+    /*
53
+     * TDX requires X2APIC being enabled to prevent potential data
54
+     * leak via APIC MMIO registers. Just disable TDX if not using
55
+     * X2APIC.
56
+     */
57
+    if (!x2apic_enabled()) {
58
+        pr_info("Disable TDX as X2APIC is not enabled.\n");
59
+        goto no_tdx;
60
+    }
61
+
62
    return 0;
63
no_tdx:
64
    clear_tdx();
65
--
26
--
66
2.38.1
27
2.41.0
diff view generated by jsdifflib
New patch
1
TDX memory has integrity and confidentiality protections. Violations of
2
this integrity protection are supposed to only affect TDX operations and
3
are never supposed to affect the host kernel itself. In other words,
4
the host kernel should never, itself, see machine checks induced by the
5
TDX integrity hardware.
1
6
7
Alas, the first few generations of TDX hardware have an erratum. A
8
partial write to a TDX private memory cacheline will silently "poison"
9
the line. Subsequent reads will consume the poison and generate a
10
machine check. According to the TDX hardware spec, neither of these
11
things should have happened.
12
13
Virtually all kernel memory accesses operations happen in full
14
cachelines. In practice, writing a "byte" of memory usually reads a 64
15
byte cacheline of memory, modifies it, then writes the whole line back.
16
Those operations do not trigger this problem.
17
18
This problem is triggered by "partial" writes where a write transaction
19
of less than cacheline lands at the memory controller. The CPU does
20
these via non-temporal write instructions (like MOVNTI), or through
21
UC/WC memory mappings. The issue can also be triggered away from the
22
CPU by devices doing partial writes via DMA.
23
24
With this erratum, there are additional things need to be done. To
25
prepare for those changes, add a CPU bug bit to indicate this erratum.
26
Note this bug reflects the hardware thus it is detected regardless of
27
whether the kernel is built with TDX support or not.
28
29
Signed-off-by: Kai Huang <kai.huang@intel.com>
30
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
31
Reviewed-by: David Hildenbrand <david@redhat.com>
32
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
33
---
34
35
v13 -> v14:
36
- Use "To prepare for ___, add ___" in changelog (Dave)
37
- Added Dave's tag.
38
39
v12 -> v13:
40
- Added David's tag.
41
42
v11 -> v12:
43
- Added Kirill's tag
44
- Changed to detect the erratum in early_init_intel() (Kirill)
45
46
v10 -> v11:
47
- New patch
48
49
---
50
arch/x86/include/asm/cpufeatures.h | 1 +
51
arch/x86/kernel/cpu/intel.c | 17 +++++++++++++++++
52
2 files changed, 18 insertions(+)
53
54
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
55
index XXXXXXX..XXXXXXX 100644
56
--- a/arch/x86/include/asm/cpufeatures.h
57
+++ b/arch/x86/include/asm/cpufeatures.h
58
@@ -XXX,XX +XXX,XX @@
59
#define X86_BUG_EIBRS_PBRSB        X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
60
#define X86_BUG_SMT_RSB            X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */
61
#define X86_BUG_GDS            X86_BUG(30) /* CPU is affected by Gather Data Sampling */
62
+#define X86_BUG_TDX_PW_MCE        X86_BUG(31) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */
63
64
/* BUG word 2 */
65
#define X86_BUG_SRSO            X86_BUG(1*32 + 0) /* AMD SRSO bug */
66
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
67
index XXXXXXX..XXXXXXX 100644
68
--- a/arch/x86/kernel/cpu/intel.c
69
+++ b/arch/x86/kernel/cpu/intel.c
70
@@ -XXX,XX +XXX,XX @@ static bool bad_spectre_microcode(struct cpuinfo_x86 *c)
71
    return false;
72
}
73
74
+static void check_tdx_erratum(struct cpuinfo_x86 *c)
75
+{
76
+    /*
77
+     * These CPUs have an erratum. A partial write from non-TD
78
+     * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
79
+     * private memory poisons that memory, and a subsequent read of
80
+     * that memory triggers #MC.
81
+     */
82
+    switch (c->x86_model) {
83
+    case INTEL_FAM6_SAPPHIRERAPIDS_X:
84
+    case INTEL_FAM6_EMERALDRAPIDS_X:
85
+        setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
86
+    }
87
+}
88
+
89
static void early_init_intel(struct cpuinfo_x86 *c)
90
{
91
    u64 misc_enable;
92
@@ -XXX,XX +XXX,XX @@ static void early_init_intel(struct cpuinfo_x86 *c)
93
     */
94
    if (detect_extended_topology_early(c) < 0)
95
        detect_ht_early(c);
96
+
97
+    check_tdx_erratum(c);
98
}
99
100
static void bsp_init_intel(struct cpuinfo_x86 *c)
101
--
102
2.41.0
diff view generated by jsdifflib
New patch
1
Some SEAMCALLs use the RDRAND hardware and can fail for the same reasons
2
as RDRAND. Use the kernel RDRAND retry logic for them.
1
3
4
There are three __seamcall*() variants. Do the SEAMCALL retry in common
5
code and add a wrapper for each of them.
6
7
Signed-off-by: Kai Huang <kai.huang@intel.com>
8
Reviewed-by: Kirill A. Shutemov <kirll.shutemov@linux.intel.com>
9
---
10
11
v13 -> v14:
12
- Use real function sc_retry() instead of using macros. (Dave)
13
- Added Kirill's tag.
14
15
v12 -> v13:
16
- New implementation due to TDCALL assembly series.
17
---
18
arch/x86/include/asm/tdx.h | 26 ++++++++++++++++++++++++++
19
1 file changed, 26 insertions(+)
20
21
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
22
index XXXXXXX..XXXXXXX 100644
23
--- a/arch/x86/include/asm/tdx.h
24
+++ b/arch/x86/include/asm/tdx.h
25
@@ -XXX,XX +XXX,XX @@
26
#define TDX_SEAMCALL_GP            (TDX_SW_ERROR | X86_TRAP_GP)
27
#define TDX_SEAMCALL_UD            (TDX_SW_ERROR | X86_TRAP_UD)
28
29
+/*
30
+ * TDX module SEAMCALL leaf function error codes
31
+ */
32
+#define TDX_RND_NO_ENTROPY    0x8000020300000000ULL
33
+
34
#ifndef __ASSEMBLY__
35
36
/*
37
@@ -XXX,XX +XXX,XX @@ u64 __seamcall(u64 fn, struct tdx_module_args *args);
38
u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
39
u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
40
41
+#include <asm/archrandom.h>
42
+
43
+typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
44
+
45
+static inline u64 sc_retry(sc_func_t func, u64 fn,
46
+             struct tdx_module_args *args)
47
+{
48
+    int retry = RDRAND_RETRY_LOOPS;
49
+    u64 ret;
50
+
51
+    do {
52
+        ret = func(fn, args);
53
+    } while (ret == TDX_RND_NO_ENTROPY && --retry);
54
+
55
+    return ret;
56
+}
57
+
58
+#define seamcall(_fn, _args)        sc_retry(__seamcall, (_fn), (_args))
59
+#define seamcall_ret(_fn, _args)    sc_retry(__seamcall_ret, (_fn), (_args))
60
+#define seamcall_saved_ret(_fn, _args)    sc_retry(__seamcall_saved_ret, (_fn), (_args))
61
+
62
bool platform_tdx_enabled(void);
63
#else
64
static inline bool platform_tdx_enabled(void) { return false; }
65
--
66
2.41.0
diff view generated by jsdifflib
New patch
1
The SEAMCALLs involved during the TDX module initialization are not
2
expected to fail. In fact, they are not expected to return any non-zero
3
code (except the "running out of entropy error", which can be handled
4
internally already).
1
5
6
Add yet another set of SEAMCALL wrappers, which treats all non-zero
7
return code as error, to support printing SEAMCALL error upon failure
8
for module initialization. Note the TDX module initialization doesn't
9
use the _saved_ret() variant thus no wrapper is added for it.
10
11
SEAMCALL assembly can also return kernel-defined error codes for three
12
special cases: 1) TDX isn't enabled by the BIOS; 2) TDX module isn't
13
loaded; 3) CPU isn't in VMX operation. Whether they can legally happen
14
depends on the caller, so leave to the caller to print error message
15
when desired.
16
17
Also convert the SEAMCALL error codes to the kernel error codes in the
18
new wrappers so that each SEAMCALL caller doesn't have to repeat the
19
conversion.
20
21
Signed-off-by: Kai Huang <kai.huang@intel.com>
22
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
23
---
24
25
v13 -> v14:
26
- Use real functions to replace macros. (Dave)
27
- Moved printing error message for special error code to the caller
28
(internal)
29
- Added Kirill's tag
30
31
v12 -> v13:
32
- New implementation due to TDCALL assembly series.
33
34
---
35
arch/x86/include/asm/tdx.h | 1 +
36
arch/x86/virt/vmx/tdx/tdx.c | 52 +++++++++++++++++++++++++++++++++++++
37
2 files changed, 53 insertions(+)
38
39
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
40
index XXXXXXX..XXXXXXX 100644
41
--- a/arch/x86/include/asm/tdx.h
42
+++ b/arch/x86/include/asm/tdx.h
43
@@ -XXX,XX +XXX,XX @@
44
/*
45
* TDX module SEAMCALL leaf function error codes
46
*/
47
+#define TDX_SUCCESS        0ULL
48
#define TDX_RND_NO_ENTROPY    0x8000020300000000ULL
49
50
#ifndef __ASSEMBLY__
51
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
52
index XXXXXXX..XXXXXXX 100644
53
--- a/arch/x86/virt/vmx/tdx/tdx.c
54
+++ b/arch/x86/virt/vmx/tdx/tdx.c
55
@@ -XXX,XX +XXX,XX @@ static u32 tdx_global_keyid __ro_after_init;
56
static u32 tdx_guest_keyid_start __ro_after_init;
57
static u32 tdx_nr_guest_keyids __ro_after_init;
58
59
+typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
60
+
61
+static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
62
+{
63
+    pr_err("SEAMCALL (0x%llx) failed: 0x%llx\n", fn, err);
64
+}
65
+
66
+static inline void seamcall_err_ret(u64 fn, u64 err,
67
+                 struct tdx_module_args *args)
68
+{
69
+    seamcall_err(fn, err, args);
70
+    pr_err("RCX 0x%llx RDX 0x%llx R8 0x%llx R9 0x%llx R10 0x%llx R11 0x%llx\n",
71
+            args->rcx, args->rdx, args->r8, args->r9,
72
+            args->r10, args->r11);
73
+}
74
+
75
+static inline void seamcall_err_saved_ret(u64 fn, u64 err,
76
+                     struct tdx_module_args *args)
77
+{
78
+    seamcall_err_ret(fn, err, args);
79
+    pr_err("RBX 0x%llx RDI 0x%llx RSI 0x%llx R12 0x%llx R13 0x%llx R14 0x%llx R15 0x%llx\n",
80
+            args->rbx, args->rdi, args->rsi, args->r12,
81
+            args->r13, args->r14, args->r15);
82
+}
83
+
84
+static inline int sc_retry_prerr(sc_func_t func, sc_err_func_t err_func,
85
+                 u64 fn, struct tdx_module_args *args)
86
+{
87
+    u64 sret = sc_retry(func, fn, args);
88
+
89
+    if (sret == TDX_SUCCESS)
90
+        return 0;
91
+
92
+    if (sret == TDX_SEAMCALL_VMFAILINVALID)
93
+        return -ENODEV;
94
+
95
+    if (sret == TDX_SEAMCALL_GP)
96
+        return -EOPNOTSUPP;
97
+
98
+    if (sret == TDX_SEAMCALL_UD)
99
+        return -EACCES;
100
+
101
+    err_func(fn, sret, args);
102
+    return -EIO;
103
+}
104
+
105
+#define seamcall_prerr(__fn, __args)                        \
106
+    sc_retry_prerr(__seamcall, seamcall_err, (__fn), (__args))
107
+
108
+#define seamcall_prerr_ret(__fn, __args)                    \
109
+    sc_retry_prerr(__seamcall_ret, seamcall_err_ret, (__fn), (__args))
110
+
111
static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
112
                     u32 *nr_tdx_keyids)
113
{
114
--
115
2.41.0
diff view generated by jsdifflib
1
Before the TDX module can be used to create and run TDX guests, it must
1
To enable TDX the kernel needs to initialize TDX from two perspectives:
2
be loaded and properly initialized. The TDX module is expected to be
2
1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
3
loaded by the BIOS, and to be initialized by the kernel.
3
to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
4
4
on one logical cpu before the kernel wants to make any other SEAMCALLs
5
TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). The host
5
on that cpu (including those involved during module initialization and
6
kernel communicates with the TDX module via a new SEAMCALL instruction.
6
running TDX guests).
7
The TDX module implements a set of SEAMCALL leaf functions to allow the
8
host kernel to initialize it.
9
7
10
The TDX module can be initialized only once in its lifetime. Instead
8
The TDX module can be initialized only once in its lifetime. Instead
11
of always initializing it at boot time, this implementation chooses an
9
of always initializing it at boot time, this implementation chooses an
12
"on demand" approach to initialize TDX until there is a real need (e.g
10
"on demand" approach to initialize TDX until there is a real need (e.g
13
when requested by KVM). This approach has below pros:
11
when requested by KVM). This approach has below pros:
14
12
15
1) It avoids consuming the memory that must be allocated by kernel and
13
1) It avoids consuming the memory that must be allocated by kernel and
16
given to the TDX module as metadata (~1/256th of the TDX-usable memory),
14
given to the TDX module as metadata (~1/256th of the TDX-usable memory),
17
and also saves the CPU cycles of initializing the TDX module (and the
15
and also saves the CPU cycles of initializing the TDX module (and the
18
metadata) when TDX is not used at all.
16
metadata) when TDX is not used at all.
19
17
20
2) It is more flexible to support TDX module runtime updating in the
18
2) The TDX module design allows it to be updated while the system is
21
future (after updating the TDX module, it needs to be initialized
19
running. The update procedure shares quite a few steps with this "on
22
again).
20
demand" initialization mechanism. The hope is that much of "on demand"
23
21
mechanism can be shared with a future "update" mechanism. A boot-time
24
3) It avoids having to do a "temporary" solution to handle VMXON in the
22
TDX module implementation would not be able to share much code with the
25
core (non-KVM) kernel for now. This is because SEAMCALL requires CPU
23
update mechanism.
26
being in VMX operation (VMXON is done), but currently only KVM handles
24
27
VMXON. Adding VMXON support to the core kernel isn't trivial. More
25
3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM
28
importantly, from long-term a reference-based approach is likely needed
26
code mucks with VMX enabling. If the TDX module were to be initialized
29
in the core kernel as more kernel components are likely needed to
27
separately from KVM (like at boot), the boot code would need to be
30
support TDX as well. Allow KVM to initialize the TDX module avoids
28
taught how to muck with VMX enabling and KVM would need to be taught how
31
having to handle VMXON during kernel boot for now.
29
to cope with that. Making KVM itself responsible for TDX initialization
32
30
lets the rest of the kernel stay blissfully unaware of VMX.
33
Add a placeholder tdx_enable() to detect and initialize the TDX module
31
34
on demand, with a state machine protected by mutex to support concurrent
32
Similar to module initialization, also make the per-cpu initialization
35
calls from multiple callers.
33
"on demand" as it also depends on VMX being enabled.
36
34
37
The TDX module will be initialized in multi-steps defined by the TDX
35
Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
38
module:
36
module and enable TDX on local cpu respectively. For now tdx_enable()
39
37
is a placeholder. The TODO list will be pared down as functionality is
40
1) Global initialization;
38
added.
41
2) Logical-CPU scope initialization;
39
42
3) Enumerate the TDX module capabilities and platform configuration;
40
Export both tdx_cpu_enable() and tdx_enable() for KVM use.
43
4) Configure the TDX module about TDX usable memory ranges and global
41
44
KeyID information;
42
In tdx_enable() use a state machine protected by mutex to make sure the
45
5) Package-scope configuration for the global KeyID;
43
initialization will only be done once, as tdx_enable() can be called
46
6) Initialize usable memory ranges based on 4).
44
multiple times (i.e. KVM module can be reloaded) and may be called
47
45
concurrently by other kernel components in the future.
48
The TDX module can also be shut down at any time during its lifetime.
46
49
In case of any error during the initialization process, shut down the
47
The per-cpu initialization on each cpu can only be done once during the
50
module. It's pointless to leave the module in any intermediate state
48
module's life time. Use a per-cpu variable to track its status to make
51
during the initialization.
49
sure it is only done once in tdx_cpu_enable().
52
50
53
Both logical CPU scope initialization and shutting down the TDX module
51
Also, a SEAMCALL to do TDX module global initialization must be done
54
require calling SEAMCALL on all boot-time present CPUs. For simplicity
52
once on any logical cpu before any per-cpu initialization SEAMCALL. Do
55
just temporarily disable CPU hotplug during the module initialization.
53
it inside tdx_cpu_enable() too (if hasn't been done).
56
54
57
Note TDX architecturally doesn't support physical CPU hot-add/removal.
55
tdx_enable() can potentially invoke SEAMCALLs on any online cpus. The
58
A non-buggy BIOS should never support ACPI CPU hot-add/removal. This
56
per-cpu initialization must be done before those SEAMCALLs are invoked
59
implementation doesn't explicitly handle ACPI CPU hot-add/removal but
57
on some cpu. To keep things simple, in tdx_cpu_enable(), always do the
60
depends on the BIOS to do the right thing.
58
per-cpu initialization regardless of whether the TDX module has been
61
59
initialized or not. And in tdx_enable(), don't call tdx_cpu_enable()
62
Reviewed-by: Chao Gao <chao.gao@intel.com>
60
but assume the caller has disabled CPU hotplug, done VMXON and
61
tdx_cpu_enable() on all online cpus before calling tdx_enable().
62
63
Signed-off-by: Kai Huang <kai.huang@intel.com>
63
Signed-off-by: Kai Huang <kai.huang@intel.com>
64
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
64
---
65
---
65
66
66
v6 -> v7:
67
v13 -> v14:
67
- No change.
68
- Use lockdep_assert_irqs_off() in try_init_model_global() (Nikolay),
68
69
but still keep the comment (Kirill).
69
v5 -> v6:
70
- Add code to print "module not loaded" in the first SEAMCALL.
70
- Added code to set status to TDX_MODULE_NONE if TDX module is not
71
- If SYS.INIT fails, stop calling LP.INIT in other tdx_cpu_enable()s.
71
loaded (Chao)
72
- Added Kirill's tag
72
- Added Chao's Reviewed-by.
73
73
- Improved comments around cpus_read_lock().
74
v12 -> v13:
74
75
- Made tdx_cpu_enable() always be called with IRQ disabled via IPI
75
- v3->v5 (no feedback on v4):
76
funciton call (Peter, Kirill).
76
- Removed the check that SEAMRR and TDX KeyID have been detected on
77
77
all present cpus.
78
v11 -> v12:
78
- Removed tdx_detect().
79
- Simplified TDX module global init and lp init status tracking (David).
79
- Added num_online_cpus() to MADT-enabled CPUs check within the CPU
80
- Added comment around try_init_module_global() for using
80
hotplug lock and return early with error message.
81
raw_spin_lock() (Dave).
81
- Improved dmesg printing for TDX module detection and initialization.
82
- Added one sentence to changelog to explain why to expose tdx_enable()
83
and tdx_cpu_enable() (Dave).
84
- Simplifed comments around tdx_enable() and tdx_cpu_enable() to use
85
lockdep_assert_*() instead. (Dave)
86
- Removed redundent "TDX" in error message (Dave).
87
88
v10 -> v11:
89
- Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off.
90
- Return the actual error code for tdx_enable() instead of -EINVAL.
91
- Added Isaku's Reviewed-by.
92
93
v9 -> v10:
94
- Merged the patch to handle per-cpu initialization to this patch to
95
tell the story better.
96
- Changed how to handle the per-cpu initialization to only provide a
97
tdx_cpu_enable() function to let the user of TDX to do it when the
98
user wants to run TDX code on a certain cpu.
99
- Changed tdx_enable() to not call cpus_read_lock() explicitly, but
100
call lockdep_assert_cpus_held() to assume the caller has done that.
101
- Improved comments around tdx_enable() and tdx_cpu_enable().
102
- Improved changelog to tell the story better accordingly.
103
104
v8 -> v9:
105
- Removed detailed TODO list in the changelog (Dave).
106
- Added back steps to do module global initialization and per-cpu
107
initialization in the TODO list comment.
108
- Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h
109
110
v7 -> v8:
111
- Refined changelog (Dave).
112
- Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
113
- Add a "TODO list" comment in init_tdx_module() to list all steps of
114
initializing the TDX Module to tell the story (Dave).
115
- Made tdx_enable() unverisally return -EINVAL, and removed nonsense
116
comments (Dave).
117
- Simplified __tdx_enable() to only handle success or failure.
118
- TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
119
- Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
120
- Improved comments (Dave).
121
- Pointed out 'tdx_module_status' is software thing (Dave).
122
123
...
82
124
83
---
125
---
84
arch/x86/include/asm/tdx.h | 2 +
126
arch/x86/include/asm/tdx.h | 4 +
85
arch/x86/virt/vmx/tdx/tdx.c | 150 ++++++++++++++++++++++++++++++++++++
127
arch/x86/virt/vmx/tdx/tdx.c | 167 ++++++++++++++++++++++++++++++++++++
86
2 files changed, 152 insertions(+)
128
arch/x86/virt/vmx/tdx/tdx.h | 30 +++++++
129
3 files changed, 201 insertions(+)
130
create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
87
131
88
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
132
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
89
index XXXXXXX..XXXXXXX 100644
133
index XXXXXXX..XXXXXXX 100644
90
--- a/arch/x86/include/asm/tdx.h
134
--- a/arch/x86/include/asm/tdx.h
91
+++ b/arch/x86/include/asm/tdx.h
135
+++ b/arch/x86/include/asm/tdx.h
92
@@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
136
@@ -XXX,XX +XXX,XX @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
93
137
#define seamcall_saved_ret(_fn, _args)    sc_retry(__seamcall_saved_ret, (_fn), (_args))
94
#ifdef CONFIG_INTEL_TDX_HOST
138
95
bool platform_tdx_enabled(void);
139
bool platform_tdx_enabled(void);
140
+int tdx_cpu_enable(void);
96
+int tdx_enable(void);
141
+int tdx_enable(void);
97
#else    /* !CONFIG_INTEL_TDX_HOST */
142
#else
98
static inline bool platform_tdx_enabled(void) { return false; }
143
static inline bool platform_tdx_enabled(void) { return false; }
144
+static inline int tdx_cpu_enable(void) { return -ENODEV; }
99
+static inline int tdx_enable(void) { return -ENODEV; }
145
+static inline int tdx_enable(void) { return -ENODEV; }
100
#endif    /* CONFIG_INTEL_TDX_HOST */
146
#endif    /* CONFIG_INTEL_TDX_HOST */
101
147
102
#endif /* !__ASSEMBLY__ */
148
#endif /* !__ASSEMBLY__ */
103
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
149
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
104
index XXXXXXX..XXXXXXX 100644
150
index XXXXXXX..XXXXXXX 100644
105
--- a/arch/x86/virt/vmx/tdx/tdx.c
151
--- a/arch/x86/virt/vmx/tdx/tdx.c
106
+++ b/arch/x86/virt/vmx/tdx/tdx.c
152
+++ b/arch/x86/virt/vmx/tdx/tdx.c
107
@@ -XXX,XX +XXX,XX @@
153
@@ -XXX,XX +XXX,XX @@
108
#include <linux/types.h>
109
#include <linux/init.h>
154
#include <linux/init.h>
155
#include <linux/errno.h>
110
#include <linux/printk.h>
156
#include <linux/printk.h>
157
+#include <linux/cpu.h>
158
+#include <linux/spinlock.h>
159
+#include <linux/percpu-defs.h>
111
+#include <linux/mutex.h>
160
+#include <linux/mutex.h>
112
+#include <linux/cpu.h>
113
+#include <linux/cpumask.h>
114
#include <asm/msr-index.h>
161
#include <asm/msr-index.h>
115
#include <asm/msr.h>
162
#include <asm/msr.h>
116
#include <asm/apic.h>
117
#include <asm/tdx.h>
163
#include <asm/tdx.h>
118
#include "tdx.h"
164
+#include "tdx.h"
119
165
120
+/* TDX module status during initialization */
166
static u32 tdx_global_keyid __ro_after_init;
121
+enum tdx_module_status_t {
167
static u32 tdx_guest_keyid_start __ro_after_init;
122
+    /* TDX module hasn't been detected and initialized */
168
static u32 tdx_nr_guest_keyids __ro_after_init;
123
+    TDX_MODULE_UNKNOWN,
169
124
+    /* TDX module is not loaded */
170
+static DEFINE_PER_CPU(bool, tdx_lp_initialized);
125
+    TDX_MODULE_NONE,
171
+
126
+    /* TDX module is initialized */
127
+    TDX_MODULE_INITIALIZED,
128
+    /* TDX module is shut down due to initialization error */
129
+    TDX_MODULE_SHUTDOWN,
130
+};
131
+
132
static u32 tdx_keyid_start __ro_after_init;
133
static u32 tdx_keyid_num __ro_after_init;
134
135
+static enum tdx_module_status_t tdx_module_status;
172
+static enum tdx_module_status_t tdx_module_status;
136
+/* Prevent concurrent attempts on TDX detection and initialization */
137
+static DEFINE_MUTEX(tdx_module_lock);
173
+static DEFINE_MUTEX(tdx_module_lock);
138
+
174
+
139
/*
175
typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
140
* Detect TDX private KeyIDs to see whether TDX has been enabled by the
176
141
* BIOS. Both initializing the TDX module and running TDX guest require
177
static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
142
@@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void)
178
@@ -XXX,XX +XXX,XX @@ static inline int sc_retry_prerr(sc_func_t func, sc_err_func_t err_func,
143
{
179
#define seamcall_prerr_ret(__fn, __args)                    \
144
    return !!tdx_keyid_num;
180
    sc_retry_prerr(__seamcall_ret, seamcall_err_ret, (__fn), (__args))
145
}
181
146
+
147
+/*
182
+/*
148
+ * Detect and initialize the TDX module.
183
+ * Do the module global initialization once and return its result.
149
+ *
184
+ * It can be done on any cpu. It's always called with interrupts
150
+ * Return -ENODEV when the TDX module is not loaded, 0 when it
185
+ * disabled.
151
+ * is successfully initialized, or other error when it fails to
186
+ */
152
+ * initialize.
187
+static int try_init_module_global(void)
153
+ */
188
+{
154
+static int init_tdx_module(void)
189
+    struct tdx_module_args args = {};
155
+{
190
+    static DEFINE_RAW_SPINLOCK(sysinit_lock);
156
+    /* The TDX module hasn't been detected */
191
+    static bool sysinit_done;
157
+    return -ENODEV;
192
+    static int sysinit_ret;
158
+}
193
+
159
+
194
+    lockdep_assert_irqs_disabled();
160
+static void shutdown_tdx_module(void)
195
+
161
+{
196
+    raw_spin_lock(&sysinit_lock);
162
+    /* TODO: Shut down the TDX module */
197
+
163
+}
198
+    if (sysinit_done)
164
+
199
+        goto out;
165
+static int __tdx_enable(void)
200
+
166
+{
201
+    /* RCX is module attributes and all bits are reserved */
167
+    int ret;
202
+    args.rcx = 0;
203
+    sysinit_ret = seamcall_prerr(TDH_SYS_INIT, &args);
168
+
204
+
169
+    /*
205
+    /*
170
+     * Initializing the TDX module requires doing SEAMCALL on all
206
+     * The first SEAMCALL also detects the TDX module, thus
171
+     * boot-time present CPUs. For simplicity temporarily disable
207
+     * it can fail due to the TDX module is not loaded.
172
+     * CPU hotplug to prevent any CPU from going offline during
208
+     * Dump message to let the user know.
173
+     * the initialization.
174
+     */
209
+     */
175
+    cpus_read_lock();
210
+    if (sysinit_ret == -ENODEV)
176
+
211
+        pr_err("module not loaded\n");
177
+    /*
212
+
178
+     * Check whether all boot-time present CPUs are online and
213
+    sysinit_done = true;
179
+     * return early with a message so the user can be aware.
180
+     *
181
+     * Note a non-buggy BIOS should never support physical (ACPI)
182
+     * CPU hotplug when TDX is enabled, and all boot-time present
183
+     * CPU should be enabled in MADT, so there should be no
184
+     * disabled_cpus and num_processors won't change at runtime
185
+     * either.
186
+     */
187
+    if (disabled_cpus || num_online_cpus() != num_processors) {
188
+        pr_err("Unable to initialize the TDX module when there's offline CPU(s).\n");
189
+        ret = -EINVAL;
190
+        goto out;
191
+    }
192
+
193
+    ret = init_tdx_module();
194
+    if (ret == -ENODEV) {
195
+        pr_info("TDX module is not loaded.\n");
196
+        tdx_module_status = TDX_MODULE_NONE;
197
+        goto out;
198
+    }
199
+
200
+    /*
201
+     * Shut down the TDX module in case of any error during the
202
+     * initialization process. It's meaningless to leave the TDX
203
+     * module in any middle state of the initialization process.
204
+     *
205
+     * Shutting down the module also requires doing SEAMCALL on all
206
+     * MADT-enabled CPUs. Do it while CPU hotplug is disabled.
207
+     *
208
+     * Return all errors during the initialization as -EFAULT as the
209
+     * module is always shut down.
210
+     */
211
+    if (ret) {
212
+        pr_info("Failed to initialize TDX module. Shut it down.\n");
213
+        shutdown_tdx_module();
214
+        tdx_module_status = TDX_MODULE_SHUTDOWN;
215
+        ret = -EFAULT;
216
+        goto out;
217
+    }
218
+
219
+    pr_info("TDX module initialized.\n");
220
+    tdx_module_status = TDX_MODULE_INITIALIZED;
221
+out:
214
+out:
222
+    cpus_read_unlock();
215
+    raw_spin_unlock(&sysinit_lock);
223
+
216
+    return sysinit_ret;
224
+    return ret;
225
+}
217
+}
226
+
218
+
227
+/**
219
+/**
228
+ * tdx_enable - Enable TDX by initializing the TDX module
220
+ * tdx_cpu_enable - Enable TDX on local cpu
229
+ *
221
+ *
230
+ * Caller to make sure all CPUs are online and in VMX operation before
222
+ * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
231
+ * calling this function. CPU hotplug is temporarily disabled internally
223
+ * global initialization SEAMCALL if not done) on local cpu to make this
232
+ * to prevent any cpu from going offline.
224
+ * cpu be ready to run any other SEAMCALLs.
233
+ *
225
+ *
234
+ * This function can be called in parallel by multiple callers.
226
+ * Always call this function via IPI function calls.
235
+ *
227
+ *
236
+ * Return:
228
+ * Return 0 on success, otherwise errors.
237
+ *
229
+ */
238
+ * * 0:        The TDX module has been successfully initialized.
230
+int tdx_cpu_enable(void)
239
+ * * -ENODEV:    The TDX module is not loaded, or TDX is not supported.
231
+{
240
+ * * -EINVAL:    The TDX module cannot be initialized due to certain
232
+    struct tdx_module_args args = {};
241
+ *        conditions are not met (i.e. when not all MADT-enabled
242
+ *        CPUs are not online).
243
+ * * -EFAULT:    Other internal fatal errors, or the TDX module is in
244
+ *        shutdown mode due to it failed to initialize in previous
245
+ *        attempts.
246
+ */
247
+int tdx_enable(void)
248
+{
249
+    int ret;
233
+    int ret;
250
+
234
+
251
+    if (!platform_tdx_enabled())
235
+    if (!platform_tdx_enabled())
252
+        return -ENODEV;
236
+        return -ENODEV;
253
+
237
+
238
+    lockdep_assert_irqs_disabled();
239
+
240
+    if (__this_cpu_read(tdx_lp_initialized))
241
+        return 0;
242
+
243
+    /*
244
+     * The TDX module global initialization is the very first step
245
+     * to enable TDX. Need to do it first (if hasn't been done)
246
+     * before the per-cpu initialization.
247
+     */
248
+    ret = try_init_module_global();
249
+    if (ret)
250
+        return ret;
251
+
252
+    ret = seamcall_prerr(TDH_SYS_LP_INIT, &args);
253
+    if (ret)
254
+        return ret;
255
+
256
+    __this_cpu_write(tdx_lp_initialized, true);
257
+
258
+    return 0;
259
+}
260
+EXPORT_SYMBOL_GPL(tdx_cpu_enable);
261
+
262
+static int init_tdx_module(void)
263
+{
264
+    /*
265
+     * TODO:
266
+     *
267
+     * - Get TDX module information and TDX-capable memory regions.
268
+     * - Build the list of TDX-usable memory regions.
269
+     * - Construct a list of "TD Memory Regions" (TDMRs) to cover
270
+     * all TDX-usable memory regions.
271
+     * - Configure the TDMRs and the global KeyID to the TDX module.
272
+     * - Configure the global KeyID on all packages.
273
+     * - Initialize all TDMRs.
274
+     *
275
+     * Return error before all steps are done.
276
+     */
277
+    return -EINVAL;
278
+}
279
+
280
+static int __tdx_enable(void)
281
+{
282
+    int ret;
283
+
284
+    ret = init_tdx_module();
285
+    if (ret) {
286
+        pr_err("module initialization failed (%d)\n", ret);
287
+        tdx_module_status = TDX_MODULE_ERROR;
288
+        return ret;
289
+    }
290
+
291
+    pr_info("module initialized\n");
292
+    tdx_module_status = TDX_MODULE_INITIALIZED;
293
+
294
+    return 0;
295
+}
296
+
297
+/**
298
+ * tdx_enable - Enable TDX module to make it ready to run TDX guests
299
+ *
300
+ * This function assumes the caller has: 1) held read lock of CPU hotplug
301
+ * lock to prevent any new cpu from becoming online; 2) done both VMXON
302
+ * and tdx_cpu_enable() on all online cpus.
303
+ *
304
+ * This function can be called in parallel by multiple callers.
305
+ *
306
+ * Return 0 if TDX is enabled successfully, otherwise error.
307
+ */
308
+int tdx_enable(void)
309
+{
310
+    int ret;
311
+
312
+    if (!platform_tdx_enabled())
313
+        return -ENODEV;
314
+
315
+    lockdep_assert_cpus_held();
316
+
254
+    mutex_lock(&tdx_module_lock);
317
+    mutex_lock(&tdx_module_lock);
255
+
318
+
256
+    switch (tdx_module_status) {
319
+    switch (tdx_module_status) {
257
+    case TDX_MODULE_UNKNOWN:
320
+    case TDX_MODULE_UNINITIALIZED:
258
+        ret = __tdx_enable();
321
+        ret = __tdx_enable();
259
+        break;
322
+        break;
260
+    case TDX_MODULE_NONE:
261
+        ret = -ENODEV;
262
+        break;
263
+    case TDX_MODULE_INITIALIZED:
323
+    case TDX_MODULE_INITIALIZED:
324
+        /* Already initialized, great, tell the caller. */
264
+        ret = 0;
325
+        ret = 0;
265
+        break;
326
+        break;
266
+    default:
327
+    default:
267
+        WARN_ON_ONCE(tdx_module_status != TDX_MODULE_SHUTDOWN);
328
+        /* Failed to initialize in the previous attempts */
268
+        ret = -EFAULT;
329
+        ret = -EINVAL;
269
+        break;
330
+        break;
270
+    }
331
+    }
271
+
332
+
272
+    mutex_unlock(&tdx_module_lock);
333
+    mutex_unlock(&tdx_module_lock);
273
+
334
+
274
+    return ret;
335
+    return ret;
275
+}
336
+}
276
+EXPORT_SYMBOL_GPL(tdx_enable);
337
+EXPORT_SYMBOL_GPL(tdx_enable);
338
+
339
static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
340
                     u32 *nr_tdx_keyids)
341
{
342
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
343
new file mode 100644
344
index XXXXXXX..XXXXXXX
345
--- /dev/null
346
+++ b/arch/x86/virt/vmx/tdx/tdx.h
347
@@ -XXX,XX +XXX,XX @@
348
+/* SPDX-License-Identifier: GPL-2.0 */
349
+#ifndef _X86_VIRT_TDX_H
350
+#define _X86_VIRT_TDX_H
351
+
352
+/*
353
+ * This file contains both macros and data structures defined by the TDX
354
+ * architecture and Linux defined software data structures and functions.
355
+ * The two should not be mixed together for better readability. The
356
+ * architectural definitions come first.
357
+ */
358
+
359
+/*
360
+ * TDX module SEAMCALL leaf functions
361
+ */
362
+#define TDH_SYS_INIT        33
363
+#define TDH_SYS_LP_INIT        35
364
+
365
+/*
366
+ * Do not put any hardware-defined TDX structure representations below
367
+ * this comment!
368
+ */
369
+
370
+/* Kernel defined TDX module status during module initialization. */
371
+enum tdx_module_status_t {
372
+    TDX_MODULE_UNINITIALIZED,
373
+    TDX_MODULE_INITIALIZED,
374
+    TDX_MODULE_ERROR
375
+};
376
+
377
+#endif
277
--
378
--
278
2.38.1
379
2.41.0
diff view generated by jsdifflib
1
Start to transit out the "multi-steps" to initialize the TDX module.
2
1
TDX provides increased levels of memory confidentiality and integrity.
3
TDX provides increased levels of memory confidentiality and integrity.
2
This requires special hardware support for features like memory
4
This requires special hardware support for features like memory
3
encryption and storage of memory integrity checksums. Not all memory
5
encryption and storage of memory integrity checksums. Not all memory
4
satisfies these requirements.
6
satisfies these requirements.
5
7
6
As a result, TDX introduced the concept of a "Convertible Memory Region"
8
As a result, TDX introduced the concept of a "Convertible Memory Region"
7
(CMR). During boot, the firmware builds a list of all of the memory
9
(CMR). During boot, the firmware builds a list of all of the memory
8
ranges which can provide the TDX security guarantees. The list of these
10
ranges which can provide the TDX security guarantees.
9
ranges, along with TDX module information, is available to the kernel by
11
10
querying the TDX module via TDH.SYS.INFO SEAMCALL.
12
CMRs tell the kernel which memory is TDX compatible. The kernel takes
11
13
CMRs (plus a little more metadata) and constructs "TD Memory Regions"
12
The host kernel can choose whether or not to use all convertible memory
14
(TDMRs). TDMRs let the kernel grant TDX protections to some or all of
13
regions as TDX-usable memory. Before the TDX module is ready to create
15
the CMR areas.
14
any TDX guests, the kernel needs to configure the TDX-usable memory
16
15
regions by passing an array of "TD Memory Regions" (TDMRs) to the TDX
17
The TDX module also reports necessary information to let the kernel
16
module. Constructing the TDMR array requires information of both the
18
build TDMRs and run TDX guests in structure 'tdsysinfo_struct'. The
17
TDX module (TDSYSINFO_STRUCT) and the Convertible Memory Regions. Call
19
list of CMRs, along with the TDX module information, is available to
18
TDH.SYS.INFO to get this information as a preparation.
20
the kernel by querying the TDX module.
19
21
20
Use static variables for both TDSYSINFO_STRUCT and CMR array to avoid
22
As a preparation to construct TDMRs, get the TDX module information and
21
having to pass them as function arguments when constructing the TDMR
23
the list of CMRs. Print out CMRs to help user to decode which memory
22
array. And they are too big to be put to the stack anyway. Also, KVM
24
regions are TDX convertible.
23
needs to use the TDSYSINFO_STRUCT to create TDX guests.
25
24
26
The 'tdsysinfo_struct' is fairly large (1024 bytes) and contains a lot
27
of info about the TDX module. Fully define the entire structure, but
28
only use the fields necessary to build the TDMRs and pr_info() some
29
basics about the module. The rest of the fields will get used by KVM.
30
31
Signed-off-by: Kai Huang <kai.huang@intel.com>
25
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
32
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
26
Signed-off-by: Kai Huang <kai.huang@intel.com>
33
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
27
---
34
---
28
35
29
v6 -> v7:
36
v13 -> v14:
30
- Simplified the check of CMRs due to the fact that TDX actually
37
- Added Kirill's tag.
31
verifies CMRs (that are passed by the BIOS) before enabling TDX.
38
32
- Changed the function name from check_cmrs() -> trim_empty_cmrs().
39
v12 -> v13:
33
- Added CMR page aligned check so that later patch can just get the PFN
40
- Allocate TDSYSINFO and CMR array separately. (Kirill)
34
using ">> PAGE_SHIFT".
41
- Added comment around TDH.SYS.INFO. (Peter)
35
42
36
v5 -> v6:
43
v11 -> v12:
37
- Added to also print TDX module's attribute (Isaku).
44
- Changed to use dynamic allocation for TDSYSINFO_STRUCT and CMR array
38
- Removed all arguments in tdx_gete_sysinfo() to use static variables
45
(Kirill).
39
of 'tdx_sysinfo' and 'tdx_cmr_array' directly as they are all used
46
- Keep SEAMCALL leaf macro definitions in order (Kirill)
40
directly in other functions in later patches.
47
- Removed is_cmr_empty() but open code directly (David)
41
- Added Isaku's Reviewed-by.
48
- 'atribute' -> 'attribute' (David)
42
49
43
- v3 -> v5 (no feedback on v4):
50
v10 -> v11:
44
- Renamed sanitize_cmrs() to check_cmrs().
51
- No change.
45
- Removed unnecessary sanity check against tdx_sysinfo and tdx_cmr_array
52
46
actual size returned by TDH.SYS.INFO.
53
v9 -> v10:
47
- Changed -EFAULT to -EINVAL in couple places.
54
- Added back "start to transit out..." as now per-cpu init has been
48
- Added comments around tdx_sysinfo and tdx_cmr_array saying they are
55
moved out from tdx_enable().
49
used by TDH.SYS.INFO ABI.
56
50
- Changed to pass 'tdx_sysinfo' and 'tdx_cmr_array' as function
57
v8 -> v9:
51
arguments in tdx_get_sysinfo().
58
- Removed "start to trransit out ..." part in changelog since this patch
52
- Changed to only print BIOS-CMR when check_cmrs() fails.
59
is no longer the first step anymore.
60
- Changed to declare 'tdsysinfo' and 'cmr_array' as local static, and
61
changed changelog accordingly (Dave).
62
- Improved changelog to explain why to declare 'tdsysinfo_struct' in
63
full but only use a few members of them (Dave).
64
65
v7 -> v8: (Dave)
66
- Improved changelog to tell this is the first patch to transit out the
67
"multi-steps" init_tdx_module().
68
- Removed all CMR check/trim code but to depend on later SEAMCALL.
69
- Variable 'vertical alignment' in print TDX module information.
70
- Added DECLARE_PADDED_STRUCT() for padded structure.
71
- Made tdx_sysinfo and tdx_cmr_array[] to be function local variable
72
(and rename them accordingly), and added -Wframe-larger-than=4096 flag
73
to silence the build warning.
74
75
...
53
76
54
---
77
---
55
arch/x86/virt/vmx/tdx/tdx.c | 125 ++++++++++++++++++++++++++++++++++++
78
arch/x86/virt/vmx/tdx/tdx.c | 94 ++++++++++++++++++++++++++++++++++++-
56
arch/x86/virt/vmx/tdx/tdx.h | 61 ++++++++++++++++++
79
arch/x86/virt/vmx/tdx/tdx.h | 64 +++++++++++++++++++++++++
57
2 files changed, 186 insertions(+)
80
2 files changed, 156 insertions(+), 2 deletions(-)
58
81
59
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
82
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
60
index XXXXXXX..XXXXXXX 100644
83
index XXXXXXX..XXXXXXX 100644
61
--- a/arch/x86/virt/vmx/tdx/tdx.c
84
--- a/arch/x86/virt/vmx/tdx/tdx.c
62
+++ b/arch/x86/virt/vmx/tdx/tdx.c
85
+++ b/arch/x86/virt/vmx/tdx/tdx.c
63
@@ -XXX,XX +XXX,XX @@
86
@@ -XXX,XX +XXX,XX @@
64
#include <linux/cpumask.h>
87
#include <linux/spinlock.h>
65
#include <linux/smp.h>
88
#include <linux/percpu-defs.h>
66
#include <linux/atomic.h>
89
#include <linux/mutex.h>
67
+#include <linux/align.h>
90
+#include <linux/slab.h>
91
+#include <linux/math.h>
68
#include <asm/msr-index.h>
92
#include <asm/msr-index.h>
69
#include <asm/msr.h>
93
#include <asm/msr.h>
70
#include <asm/apic.h>
94
+#include <asm/page.h>
71
@@ -XXX,XX +XXX,XX @@ static enum tdx_module_status_t tdx_module_status;
95
#include <asm/tdx.h>
72
/* Prevent concurrent attempts on TDX detection and initialization */
96
#include "tdx.h"
73
static DEFINE_MUTEX(tdx_module_lock);
97
74
98
@@ -XXX,XX +XXX,XX @@ int tdx_cpu_enable(void)
75
+/* Below two are used in TDH.SYS.INFO SEAMCALL ABI */
76
+static struct tdsysinfo_struct tdx_sysinfo;
77
+static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
78
+static int tdx_cmr_num;
79
+
80
/*
81
* Detect TDX private KeyIDs to see whether TDX has been enabled by the
82
* BIOS. Both initializing the TDX module and running TDX guest require
83
@@ -XXX,XX +XXX,XX @@ static int tdx_module_init_cpus(void)
84
    return atomic_read(&sc.err);
85
}
99
}
86
100
EXPORT_SYMBOL_GPL(tdx_cpu_enable);
87
+static inline bool is_cmr_empty(struct cmr_info *cmr)
101
88
+{
102
+static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs)
89
+    return !cmr->size;
90
+}
91
+
92
+static inline bool is_cmr_ok(struct cmr_info *cmr)
93
+{
94
+    /* CMR must be page aligned */
95
+    return IS_ALIGNED(cmr->base, PAGE_SIZE) &&
96
+        IS_ALIGNED(cmr->size, PAGE_SIZE);
97
+}
98
+
99
+static void print_cmrs(struct cmr_info *cmr_array, int cmr_num,
100
+         const char *name)
101
+{
103
+{
102
+    int i;
104
+    int i;
103
+
105
+
104
+    for (i = 0; i < cmr_num; i++) {
106
+    for (i = 0; i < nr_cmrs; i++) {
105
+        struct cmr_info *cmr = &cmr_array[i];
107
+        struct cmr_info *cmr = &cmr_array[i];
106
+
108
+
107
+        pr_info("%s : [0x%llx, 0x%llx)\n", name,
109
+        /*
108
+                cmr->base, cmr->base + cmr->size);
110
+         * The array of CMRs reported via TDH.SYS.INFO can
111
+         * contain tail empty CMRs. Don't print them.
112
+         */
113
+        if (!cmr->size)
114
+            break;
115
+
116
+        pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base,
117
+                cmr->base + cmr->size);
109
+    }
118
+    }
110
+}
119
+}
111
+
120
+
112
+/* Check CMRs reported by TDH.SYS.INFO, and trim tail empty CMRs. */
121
+static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo,
113
+static int trim_empty_cmrs(struct cmr_info *cmr_array, int *actual_cmr_num)
122
+             struct cmr_info *cmr_array)
114
+{
123
+{
115
+    struct cmr_info *cmr;
124
+    struct tdx_module_args args = {};
116
+    int i, cmr_num;
125
+    int ret;
117
+
126
+
118
+    /*
127
+    /*
119
+     * Intel TDX module spec, 20.7.3 CMR_INFO:
128
+     * TDH.SYS.INFO writes the TDSYSINFO_STRUCT and the CMR array
129
+     * to the buffers provided by the kernel (via RCX and R8
130
+     * respectively). The buffer size of the TDSYSINFO_STRUCT
131
+     * (via RDX) and the maximum entries of the CMR array (via R9)
132
+     * passed to this SEAMCALL must be at least the size of
133
+     * TDSYSINFO_STRUCT and MAX_CMRS respectively.
120
+     *
134
+     *
121
+     * TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
135
+     * Upon a successful return, R9 contains the actual entries
122
+     * array of CMR_INFO entries. The CMRs are sorted from the
136
+     * written to the CMR array.
123
+     * lowest base address to the highest base address, and they
124
+     * are non-overlapping.
125
+     *
126
+     * This implies that BIOS may generate invalid empty entries
127
+     * if total CMRs are less than 32. Need to skip them manually.
128
+     *
129
+     * CMR also must be 4K aligned. TDX doesn't trust BIOS. TDX
130
+     * actually verifies CMRs before it gets enabled, so anything
131
+     * doesn't meet above means kernel bug (or TDX is broken).
132
+     */
137
+     */
133
+    cmr = &cmr_array[0];
138
+    args.rcx = __pa(tdsysinfo);
134
+    /* There must be at least one valid CMR */
139
+    args.rdx = TDSYSINFO_STRUCT_SIZE;
135
+    if (WARN_ON_ONCE(is_cmr_empty(cmr) || !is_cmr_ok(cmr)))
140
+    args.r8 = __pa(cmr_array);
136
+        goto err;
141
+    args.r9 = MAX_CMRS;
137
+
142
+    ret = seamcall_prerr_ret(TDH_SYS_INFO, &args);
138
+    cmr_num = *actual_cmr_num;
139
+    for (i = 1; i < cmr_num; i++) {
140
+        struct cmr_info *cmr = &cmr_array[i];
141
+        struct cmr_info *prev_cmr = NULL;
142
+
143
+        /* Skip further empty CMRs */
144
+        if (is_cmr_empty(cmr))
145
+            break;
146
+
147
+        /*
148
+         * Do sanity check anyway to make sure CMRs:
149
+         * - are 4K aligned
150
+         * - don't overlap
151
+         * - are in address ascending order.
152
+         */
153
+        if (WARN_ON_ONCE(!is_cmr_ok(cmr)))
154
+            goto err;
155
+
156
+        prev_cmr = &cmr_array[i - 1];
157
+        if (WARN_ON_ONCE((prev_cmr->base + prev_cmr->size) >
158
+                    cmr->base))
159
+            goto err;
160
+    }
161
+
162
+    /* Update the actual number of CMRs */
163
+    *actual_cmr_num = i;
164
+
165
+    /* Print kernel checked CMRs */
166
+    print_cmrs(cmr_array, *actual_cmr_num, "Kernel-checked-CMR");
167
+
168
+    return 0;
169
+err:
170
+    pr_info("[TDX broken ?]: Invalid CMRs detected\n");
171
+    print_cmrs(cmr_array, cmr_num, "BIOS-CMR");
172
+    return -EINVAL;
173
+}
174
+
175
+static int tdx_get_sysinfo(void)
176
+{
177
+    struct tdx_module_output out;
178
+    int ret;
179
+
180
+    BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE);
181
+
182
+    ret = seamcall(TDH_SYS_INFO, __pa(&tdx_sysinfo), TDSYSINFO_STRUCT_SIZE,
183
+            __pa(tdx_cmr_array), MAX_CMRS, NULL, &out);
184
+    if (ret)
143
+    if (ret)
185
+        return ret;
144
+        return ret;
186
+
145
+
187
+    /* R9 contains the actual entries written the CMR array. */
146
+    pr_info("TDX module: attributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
188
+    tdx_cmr_num = out.r9;
147
+        tdsysinfo->attributes, tdsysinfo->vendor_id,
189
+
148
+        tdsysinfo->major_version, tdsysinfo->minor_version,
190
+    pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
149
+        tdsysinfo->build_date, tdsysinfo->build_num);
191
+        tdx_sysinfo.attributes, tdx_sysinfo.vendor_id,
150
+
192
+        tdx_sysinfo.major_version, tdx_sysinfo.minor_version,
151
+    print_cmrs(cmr_array, args.r9);
193
+        tdx_sysinfo.build_date, tdx_sysinfo.build_num);
152
+
194
+
153
+    return 0;
195
+    /*
196
+     * trim_empty_cmrs() updates the actual number of CMRs by
197
+     * dropping all tail empty CMRs.
198
+     */
199
+    return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
200
+}
154
+}
201
+
155
+
202
/*
156
static int init_tdx_module(void)
203
* Detect and initialize the TDX module.
157
{
204
*
158
+    struct tdsysinfo_struct *tdsysinfo;
205
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
159
+    struct cmr_info *cmr_array;
206
    if (ret)
160
+    int tdsysinfo_size;
207
        goto out;
161
+    int cmr_array_size;
208
162
+    int ret;
209
+    ret = tdx_get_sysinfo();
163
+
164
+    tdsysinfo_size = round_up(TDSYSINFO_STRUCT_SIZE,
165
+            TDSYSINFO_STRUCT_ALIGNMENT);
166
+    tdsysinfo = kzalloc(tdsysinfo_size, GFP_KERNEL);
167
+    if (!tdsysinfo)
168
+        return -ENOMEM;
169
+
170
+    cmr_array_size = sizeof(struct cmr_info) * MAX_CMRS;
171
+    cmr_array_size = round_up(cmr_array_size, CMR_INFO_ARRAY_ALIGNMENT);
172
+    cmr_array = kzalloc(cmr_array_size, GFP_KERNEL);
173
+    if (!cmr_array) {
174
+        kfree(tdsysinfo);
175
+        return -ENOMEM;
176
+    }
177
+
178
+
179
+    /* Get the TDSYSINFO_STRUCT and CMRs from the TDX module. */
180
+    ret = get_tdx_sysinfo(tdsysinfo, cmr_array);
210
+    if (ret)
181
+    if (ret)
211
+        goto out;
182
+        goto out;
212
+
183
+
213
    /*
184
    /*
214
     * Return -EINVAL until all steps of TDX module initialization
185
     * TODO:
215
     * process are done.
186
     *
187
-     * - Get TDX module information and TDX-capable memory regions.
188
     * - Build the list of TDX-usable memory regions.
189
     * - Construct a list of "TD Memory Regions" (TDMRs) to cover
190
     * all TDX-usable memory regions.
191
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
192
     *
193
     * Return error before all steps are done.
194
     */
195
-    return -EINVAL;
196
+    ret = -EINVAL;
197
+out:
198
+    /*
199
+     * For now both @sysinfo and @cmr_array are only used during
200
+     * module initialization, so always free them.
201
+     */
202
+    kfree(tdsysinfo);
203
+    kfree(cmr_array);
204
+    return ret;
205
}
206
207
static int __tdx_enable(void)
216
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
208
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
217
index XXXXXXX..XXXXXXX 100644
209
index XXXXXXX..XXXXXXX 100644
218
--- a/arch/x86/virt/vmx/tdx/tdx.h
210
--- a/arch/x86/virt/vmx/tdx/tdx.h
219
+++ b/arch/x86/virt/vmx/tdx/tdx.h
211
+++ b/arch/x86/virt/vmx/tdx/tdx.h
212
@@ -XXX,XX +XXX,XX @@
213
#ifndef _X86_VIRT_TDX_H
214
#define _X86_VIRT_TDX_H
215
216
+#include <linux/types.h>
217
+#include <linux/stddef.h>
218
+#include <linux/compiler_attributes.h>
219
+
220
/*
221
* This file contains both macros and data structures defined by the TDX
222
* architecture and Linux defined software data structures and functions.
220
@@ -XXX,XX +XXX,XX @@
223
@@ -XXX,XX +XXX,XX @@
221
/*
224
/*
222
* TDX module SEAMCALL leaf functions
225
* TDX module SEAMCALL leaf functions
223
*/
226
*/
224
+#define TDH_SYS_INFO        32
227
+#define TDH_SYS_INFO        32
225
#define TDH_SYS_INIT        33
228
#define TDH_SYS_INIT        33
226
#define TDH_SYS_LP_INIT        35
229
#define TDH_SYS_LP_INIT        35
227
#define TDH_SYS_LP_SHUTDOWN    44
228
230
229
+struct cmr_info {
231
+struct cmr_info {
230
+    u64    base;
232
+    u64    base;
231
+    u64    size;
233
+    u64    size;
232
+} __packed;
234
+} __packed;
233
+
235
+
234
+#define MAX_CMRS            32
236
+#define MAX_CMRS    32
235
+#define CMR_INFO_ARRAY_ALIGNMENT    512
237
+#define CMR_INFO_ARRAY_ALIGNMENT    512
236
+
238
+
237
+struct cpuid_config {
239
+struct cpuid_config {
238
+    u32    leaf;
240
+    u32    leaf;
239
+    u32    sub_leaf;
241
+    u32    sub_leaf;
...
...
244
+} __packed;
246
+} __packed;
245
+
247
+
246
+#define TDSYSINFO_STRUCT_SIZE        1024
248
+#define TDSYSINFO_STRUCT_SIZE        1024
247
+#define TDSYSINFO_STRUCT_ALIGNMENT    1024
249
+#define TDSYSINFO_STRUCT_ALIGNMENT    1024
248
+
250
+
251
+/*
252
+ * The size of this structure itself is flexible. The actual structure
253
+ * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE bytes
254
+ * and TDSYSINFO_STRUCT_ALIGNMENT bytes aligned.
255
+ */
249
+struct tdsysinfo_struct {
256
+struct tdsysinfo_struct {
250
+    /* TDX-SEAM Module Info */
257
+    /* TDX-SEAM Module Info */
251
+    u32    attributes;
258
+    u32    attributes;
252
+    u32    vendor_id;
259
+    u32    vendor_id;
253
+    u32    build_date;
260
+    u32    build_date;
...
...
273
+    u64    xfam_fixed1;
280
+    u64    xfam_fixed1;
274
+    u8    reserved4[32];
281
+    u8    reserved4[32];
275
+    u32    num_cpuid_config;
282
+    u32    num_cpuid_config;
276
+    /*
283
+    /*
277
+     * The actual number of CPUID_CONFIG depends on above
284
+     * The actual number of CPUID_CONFIG depends on above
278
+     * 'num_cpuid_config'. The size of 'struct tdsysinfo_struct'
285
+     * 'num_cpuid_config'.
279
+     * is 1024B defined by TDX architecture. Use a union with
280
+     * specific padding to make 'sizeof(struct tdsysinfo_struct)'
281
+     * equal to 1024.
282
+     */
286
+     */
283
+    union {
287
+    DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
284
+        struct cpuid_config    cpuid_configs[0];
288
+} __packed;
285
+        u8            reserved5[892];
286
+    };
287
+} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
288
+
289
+
289
/*
290
/*
290
* Do not put any hardware-defined TDX structure representations below
291
* Do not put any hardware-defined TDX structure representations below
291
* this comment!
292
* this comment!
292
--
293
--
293
2.38.1
294
2.41.0
diff view generated by jsdifflib
1
TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
1
As a step of initializing the TDX module, the kernel needs to tell the
2
memory regions that can possibly be used by the TDX module, but they are
2
TDX module which memory regions can be used by the TDX module as TDX
3
not automatically usable to the TDX module. As a step of initializing
3
guest memory.
4
the TDX module, the kernel needs to choose a list of memory regions (out
4
5
from convertible memory regions) that the TDX module can use and pass
5
TDX reports a list of "Convertible Memory Region" (CMR) to tell the
6
those regions to the TDX module. Once this is done, those "TDX-usable"
6
kernel which memory is TDX compatible. The kernel needs to build a list
7
memory regions are fixed during module's lifetime. No more TDX-usable
7
of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
8
memory can be added to the TDX module after that.
8
the TDX module. Once this is done, those "TDX-usable" memory regions
9
9
are fixed during module's lifetime.
10
The initial support of TDX guests will only allocate TDX guest memory
10
11
from the global page allocator. To keep things simple, this initial
11
To keep things simple, assume that all TDX-protected memory will come
12
implementation simply guarantees all pages in the page allocator are TDX
12
from the page allocator. Make sure all pages in the page allocator
13
memory. To achieve this, use all system memory in the core-mm at the
13
*are* TDX-usable memory.
14
time of initializing the TDX module as TDX memory, and at the meantime,
14
15
refuse to add any non-TDX-memory in the memory hotplug.
15
As TDX-usable memory is a fixed configuration, take a snapshot of the
16
16
memory configuration from memblocks at the time of module initialization
17
Specifically, walk through all memory regions managed by memblock and
17
(memblocks are modified on memory hotplug). This snapshot is used to
18
add them to a global list of "TDX-usable" memory regions, which is a
18
enable TDX support for *this* memory configuration only. Use a memory
19
fixed list after the module initialization (or empty if initialization
19
hotplug notifier to ensure that no other RAM can be added outside of
20
fails). To reject non-TDX-memory in memory hotplug, add an additional
20
this configuration.
21
check in arch_add_memory() to check whether the new region is covered by
21
22
any region in the "TDX-usable" memory region list.
22
This approach requires all memblock memory regions at the time of module
23
23
initialization to be TDX convertible memory to work, otherwise module
24
Note this requires all memory regions in memblock are TDX convertible
24
initialization will fail in a later SEAMCALL when passing those regions
25
memory when initializing the TDX module. This is true in practice if no
25
to the module. This approach works when all boot-time "system RAM" is
26
new memory has been hot-added before initializing the TDX module, since
26
TDX convertible memory, and no non-TDX-convertible memory is hot-added
27
in practice all boot-time present DIMM is TDX convertible memory. If
27
to the core-mm before module initialization.
28
any new memory has been hot-added, then initializing the TDX module will
28
29
fail due to that memory region is not covered by CMR.
29
For instance, on the first generation of TDX machines, both CXL memory
30
30
and NVDIMM are not TDX convertible memory. Using kmem driver to hot-add
31
This can be enhanced in the future, i.e. by allowing adding non-TDX
31
any CXL memory or NVDIMM to the core-mm before module initialization
32
memory to a separate NUMA node. In this case, the "TDX-capable" nodes
32
will result in failure to initialize the module. The SEAMCALL error
33
and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
33
code will be available in the dmesg to help user to understand the
34
needs to guarantee memory pages for TDX guests are always allocated from
34
failure.
35
the "TDX-capable" nodes.
36
37
Note TDX assumes convertible memory is always physically present during
38
machine's runtime. A non-buggy BIOS should never support hot-removal of
39
any convertible memory. This implementation doesn't handle ACPI memory
40
removal but depends on the BIOS to behave correctly.
41
35
42
Signed-off-by: Kai Huang <kai.huang@intel.com>
36
Signed-off-by: Kai Huang <kai.huang@intel.com>
37
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
38
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
39
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
40
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
43
---
41
---
44
42
45
v6 -> v7:
43
v13 -> v14:
46
- Changed to use all system memory in memblock at the time of
44
- No change
47
initializing the TDX module as TDX memory
45
48
- Added memory hotplug support
46
v12 -> v13:
47
- Avoided using " ? : " in tdx_memory_notifier(). (Peter)
48
49
v11 -> v12:
50
- Added tags from Dave/Kirill.
51
52
v10 -> v11:
53
- Added Isaku's Reviewed-by.
54
55
v9 -> v10:
56
- Moved empty @tdx_memlist check out of is_tdx_memory() to make the
57
logic better.
58
- Added Ying's Reviewed-by.
59
60
v8 -> v9:
61
- Replace "The initial support ..." with timeless sentence in both
62
changelog and comments(Dave).
63
- Fix run-on sentence in changelog, and senstence to explain why to
64
stash off memblock (Dave).
65
- Tried to improve why to choose this approach and how it work in
66
changelog based on Dave's suggestion.
67
- Many other comments enhancement (Dave).
68
69
v7 -> v8:
70
- Trimed down changelog (Dave).
71
- Changed to use PHYS_PFN() and PFN_PHYS() throughout this series
72
(Ying).
73
- Moved memory hotplug handling from add_arch_memory() to
74
memory_notifier (Dan/David).
75
- Removed 'nid' from 'struct tdx_memblock' to later patch (Dave).
76
- {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave).
77
- Removed pfn_covered_by_cmr() check as no code to trim CMRs now.
78
- Improve the comment around first 1MB (Dave).
79
- Added a comment around reserve_real_mode() to point out TDX code
80
relies on first 1MB being reserved (Ying).
81
- Added comment to explain why the new online memory range cannot
82
cross multiple TDX memory blocks (Dave).
83
- Improved other comments (Dave).
49
84
50
---
85
---
51
arch/x86/Kconfig | 1 +
86
arch/x86/Kconfig | 1 +
52
arch/x86/include/asm/tdx.h | 3 +
87
arch/x86/kernel/setup.c | 2 +
53
arch/x86/mm/init_64.c | 10 ++
88
arch/x86/virt/vmx/tdx/tdx.c | 162 +++++++++++++++++++++++++++++++++++-
54
arch/x86/virt/vmx/tdx/tdx.c | 183 ++++++++++++++++++++++++++++++++++++
89
arch/x86/virt/vmx/tdx/tdx.h | 6 ++
55
4 files changed, 197 insertions(+)
90
4 files changed, 170 insertions(+), 1 deletion(-)
56
91
57
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
92
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
58
index XXXXXXX..XXXXXXX 100644
93
index XXXXXXX..XXXXXXX 100644
59
--- a/arch/x86/Kconfig
94
--- a/arch/x86/Kconfig
60
+++ b/arch/x86/Kconfig
95
+++ b/arch/x86/Kconfig
...
...
64
    depends on X86_X2APIC
99
    depends on X86_X2APIC
65
+    select ARCH_KEEP_MEMBLOCK
100
+    select ARCH_KEEP_MEMBLOCK
66
    help
101
    help
67
     Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
102
     Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
68
     host and certain physical attacks. This option enables necessary TDX
103
     host and certain physical attacks. This option enables necessary TDX
69
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
104
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
70
index XXXXXXX..XXXXXXX 100644
105
index XXXXXXX..XXXXXXX 100644
71
--- a/arch/x86/include/asm/tdx.h
106
--- a/arch/x86/kernel/setup.c
72
+++ b/arch/x86/include/asm/tdx.h
107
+++ b/arch/x86/kernel/setup.c
73
@@ -XXX,XX +XXX,XX @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
108
@@ -XXX,XX +XXX,XX @@ void __init setup_arch(char **cmdline_p)
74
#ifdef CONFIG_INTEL_TDX_HOST
109
     *
75
bool platform_tdx_enabled(void);
110
     * Moreover, on machines with SandyBridge graphics or in setups that use
76
int tdx_enable(void);
111
     * crashkernel the entire 1M is reserved anyway.
77
+bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
112
+     *
78
#else    /* !CONFIG_INTEL_TDX_HOST */
113
+     * Note the host kernel TDX also requires the first 1MB being reserved.
79
static inline bool platform_tdx_enabled(void) { return false; }
114
     */
80
static inline int tdx_enable(void) { return -ENODEV; }
115
    x86_platform.realmode_reserve();
81
+static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
116
82
+        unsigned long end_pfn) { return true; }
83
#endif    /* CONFIG_INTEL_TDX_HOST */
84
85
#endif /* !__ASSEMBLY__ */
86
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
87
index XXXXXXX..XXXXXXX 100644
88
--- a/arch/x86/mm/init_64.c
89
+++ b/arch/x86/mm/init_64.c
90
@@ -XXX,XX +XXX,XX @@
91
#include <asm/uv/uv.h>
92
#include <asm/setup.h>
93
#include <asm/ftrace.h>
94
+#include <asm/tdx.h>
95
96
#include "mm_internal.h"
97
98
@@ -XXX,XX +XXX,XX @@ int arch_add_memory(int nid, u64 start, u64 size,
99
    unsigned long start_pfn = start >> PAGE_SHIFT;
100
    unsigned long nr_pages = size >> PAGE_SHIFT;
101
102
+    /*
103
+     * For now if TDX is enabled, all pages in the page allocator
104
+     * must be TDX memory, which is a fixed set of memory regions
105
+     * that are passed to the TDX module. Reject the new region
106
+     * if it is not TDX memory to guarantee above is true.
107
+     */
108
+    if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
109
+        return -EINVAL;
110
+
111
    init_memory_mapping(start, start + size, params->pgprot);
112
113
    return add_pages(nid, start_pfn, nr_pages, params);
114
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
117
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
115
index XXXXXXX..XXXXXXX 100644
118
index XXXXXXX..XXXXXXX 100644
116
--- a/arch/x86/virt/vmx/tdx/tdx.c
119
--- a/arch/x86/virt/vmx/tdx/tdx.c
117
+++ b/arch/x86/virt/vmx/tdx/tdx.c
120
+++ b/arch/x86/virt/vmx/tdx/tdx.c
118
@@ -XXX,XX +XXX,XX @@
121
@@ -XXX,XX +XXX,XX @@
119
#include <linux/smp.h>
122
#include <linux/mutex.h>
120
#include <linux/atomic.h>
123
#include <linux/slab.h>
121
#include <linux/align.h>
124
#include <linux/math.h>
122
+#include <linux/list.h>
125
+#include <linux/list.h>
123
+#include <linux/slab.h>
124
+#include <linux/memblock.h>
126
+#include <linux/memblock.h>
127
+#include <linux/memory.h>
125
+#include <linux/minmax.h>
128
+#include <linux/minmax.h>
126
+#include <linux/sizes.h>
129
+#include <linux/sizes.h>
130
+#include <linux/pfn.h>
127
#include <asm/msr-index.h>
131
#include <asm/msr-index.h>
128
#include <asm/msr.h>
132
#include <asm/msr.h>
129
#include <asm/apic.h>
133
#include <asm/page.h>
130
@@ -XXX,XX +XXX,XX @@ enum tdx_module_status_t {
134
@@ -XXX,XX +XXX,XX @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized);
131
    TDX_MODULE_SHUTDOWN,
135
static enum tdx_module_status_t tdx_module_status;
132
};
136
static DEFINE_MUTEX(tdx_module_lock);
133
137
134
+struct tdx_memblock {
138
+/* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
135
+    struct list_head list;
136
+    unsigned long start_pfn;
137
+    unsigned long end_pfn;
138
+    int nid;
139
+};
140
+
141
static u32 tdx_keyid_start __ro_after_init;
142
static u32 tdx_keyid_num __ro_after_init;
143
144
@@ -XXX,XX +XXX,XX @@ static struct tdsysinfo_struct tdx_sysinfo;
145
static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
146
static int tdx_cmr_num;
147
148
+/* All TDX-usable memory regions */
149
+static LIST_HEAD(tdx_memlist);
139
+static LIST_HEAD(tdx_memlist);
150
+
140
+
151
/*
141
typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
152
* Detect TDX private KeyIDs to see whether TDX has been enabled by the
142
153
* BIOS. Both initializing the TDX module and running TDX guest require
143
static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
154
@@ -XXX,XX +XXX,XX @@ static int tdx_get_sysinfo(void)
144
@@ -XXX,XX +XXX,XX @@ static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo,
155
    return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
145
    return 0;
156
}
146
}
157
147
158
+/* Check whether the given pfn range is covered by any CMR or not. */
159
+static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
160
+                 unsigned long end_pfn)
161
+{
162
+    int i;
163
+
164
+    for (i = 0; i < tdx_cmr_num; i++) {
165
+        struct cmr_info *cmr = &tdx_cmr_array[i];
166
+        unsigned long cmr_start_pfn;
167
+        unsigned long cmr_end_pfn;
168
+
169
+        cmr_start_pfn = cmr->base >> PAGE_SHIFT;
170
+        cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
171
+
172
+        if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
173
+            return true;
174
+    }
175
+
176
+    return false;
177
+}
178
+
179
+/*
148
+/*
180
+ * Add a memory region on a given node as a TDX memory block. The caller
149
+ * Add a memory region as a TDX memory block. The caller must make sure
181
+ * to make sure all memory regions are added in address ascending order
150
+ * all memory regions are added in address ascending order and don't
182
+ * and don't overlap.
151
+ * overlap.
183
+ */
152
+ */
184
+static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
153
+static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
185
+             int nid)
154
+             unsigned long end_pfn)
186
+{
155
+{
187
+    struct tdx_memblock *tmb;
156
+    struct tdx_memblock *tmb;
188
+
157
+
189
+    tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
158
+    tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
190
+    if (!tmb)
159
+    if (!tmb)
191
+        return -ENOMEM;
160
+        return -ENOMEM;
192
+
161
+
193
+    INIT_LIST_HEAD(&tmb->list);
162
+    INIT_LIST_HEAD(&tmb->list);
194
+    tmb->start_pfn = start_pfn;
163
+    tmb->start_pfn = start_pfn;
195
+    tmb->end_pfn = end_pfn;
164
+    tmb->end_pfn = end_pfn;
196
+    tmb->nid = nid;
165
+
197
+
166
+    /* @tmb_list is protected by mem_hotplug_lock */
198
+    list_add_tail(&tmb->list, &tdx_memlist);
167
+    list_add_tail(&tmb->list, tmb_list);
199
+    return 0;
168
+    return 0;
200
+}
169
+}
201
+
170
+
202
+static void free_tdx_memory(void)
171
+static void free_tdx_memlist(struct list_head *tmb_list)
203
+{
172
+{
204
+    while (!list_empty(&tdx_memlist)) {
173
+    /* @tmb_list is protected by mem_hotplug_lock */
205
+        struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
174
+    while (!list_empty(tmb_list)) {
175
+        struct tdx_memblock *tmb = list_first_entry(tmb_list,
206
+                struct tdx_memblock, list);
176
+                struct tdx_memblock, list);
207
+
177
+
208
+        list_del(&tmb->list);
178
+        list_del(&tmb->list);
209
+        kfree(tmb);
179
+        kfree(tmb);
210
+    }
180
+    }
211
+}
181
+}
212
+
182
+
213
+/*
183
+/*
214
+ * Add all memblock memory regions to the @tdx_memlist as TDX memory.
184
+ * Ensure that all memblock memory regions are convertible to TDX
215
+ * Must be called when get_online_mems() is called by the caller.
185
+ * memory. Once this has been established, stash the memblock
186
+ * ranges off in a secondary structure because memblock is modified
187
+ * in memory hotplug while TDX memory regions are fixed.
216
+ */
188
+ */
217
+static int build_tdx_memory(void)
189
+static int build_tdx_memlist(struct list_head *tmb_list)
218
+{
190
+{
219
+    unsigned long start_pfn, end_pfn;
191
+    unsigned long start_pfn, end_pfn;
220
+    int i, nid, ret;
192
+    int i, ret;
221
+
193
+
222
+    for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
194
+    for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
223
+        /*
195
+        /*
224
+         * The first 1MB may not be reported as TDX convertible
196
+         * The first 1MB is not reported as TDX convertible memory.
225
+         * memory. Manually exclude them as TDX memory.
197
+         * Although the first 1MB is always reserved and won't end up
226
+         *
198
+         * to the page allocator, it is still in memblock's memory
227
+         * This is fine as the first 1MB is already reserved in
199
+         * regions. Skip them manually to exclude them as TDX memory.
228
+         * reserve_real_mode() and won't end up to ZONE_DMA as
229
+         * free page anyway.
230
+         */
200
+         */
231
+        start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
201
+        start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
232
+        if (start_pfn >= end_pfn)
202
+        if (start_pfn >= end_pfn)
233
+            continue;
203
+            continue;
234
+
235
+        /* Verify memory is truly TDX convertible memory */
236
+        if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
237
+            pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
238
+                    start_pfn << PAGE_SHIFT,
239
+                    end_pfn << PAGE_SHIFT);
240
+            return -EINVAL;
241
+        }
242
+
204
+
243
+        /*
205
+        /*
244
+         * Add the memory regions as TDX memory. The regions in
206
+         * Add the memory regions as TDX memory. The regions in
245
+         * memblock has already guaranteed they are in address
207
+         * memblock has already guaranteed they are in address
246
+         * ascending order and don't overlap.
208
+         * ascending order and don't overlap.
247
+         */
209
+         */
248
+        ret = add_tdx_memblock(start_pfn, end_pfn, nid);
210
+        ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
249
+        if (ret)
211
+        if (ret)
250
+            goto err;
212
+            goto err;
251
+    }
213
+    }
252
+
214
+
253
+    return 0;
215
+    return 0;
254
+err:
216
+err:
255
+    free_tdx_memory();
217
+    free_tdx_memlist(tmb_list);
256
+    return ret;
218
+    return ret;
257
+}
219
+}
258
+
220
+
259
/*
221
static int init_tdx_module(void)
260
* Detect and initialize the TDX module.
222
{
261
*
223
    struct tdsysinfo_struct *tdsysinfo;
262
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
224
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
263
    if (ret)
225
    if (ret)
264
        goto out;
226
        goto out;
265
227
266
+    /*
228
+    /*
267
+     * All memory regions that can be used by the TDX module must be
229
+     * To keep things simple, assume that all TDX-protected memory
268
+     * passed to the TDX module during the module initialization.
230
+     * will come from the page allocator. Make sure all pages in the
269
+     * Once this is done, all "TDX-usable" memory regions are fixed
231
+     * page allocator are TDX-usable memory.
270
+     * during module's runtime.
271
+     *
232
+     *
272
+     * The initial support of TDX guests only allocates memory from
233
+     * Build the list of "TDX-usable" memory regions which cover all
273
+     * the global page allocator. To keep things simple, for now
234
+     * pages in the page allocator to guarantee that. Do it while
274
+     * just make sure all pages in the page allocator are TDX memory.
235
+     * holding mem_hotplug_lock read-lock as the memory hotplug code
275
+     *
236
+     * path reads the @tdx_memlist to reject any new memory.
276
+     * To achieve this, use all system memory in the core-mm at the
277
+     * time of initializing the TDX module as TDX memory, and at the
278
+     * meantime, reject any new memory in memory hot-add.
279
+     *
280
+     * This works as in practice, all boot-time present DIMM is TDX
281
+     * convertible memory. However if any new memory is hot-added
282
+     * before initializing the TDX module, the initialization will
283
+     * fail due to that memory is not covered by CMR.
284
+     *
285
+     * This can be enhanced in the future, i.e. by allowing adding or
286
+     * onlining non-TDX memory to a separate node, in which case the
287
+     * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
288
+     * together -- the userspace/kernel just needs to make sure pages
289
+     * for TDX guests must come from those "TDX-capable" nodes.
290
+     *
291
+     * Build the list of TDX memory regions as mentioned above so
292
+     * they can be passed to the TDX module later.
293
+     */
237
+     */
294
+    get_online_mems();
238
+    get_online_mems();
295
+
239
+
296
+    ret = build_tdx_memory();
240
+    ret = build_tdx_memlist(&tdx_memlist);
297
+    if (ret)
241
+    if (ret)
298
+        goto out;
242
+        goto out_put_tdxmem;
243
+
299
    /*
244
    /*
300
     * Return -EINVAL until all steps of TDX module initialization
245
     * TODO:
301
     * process are done.
246
     *
247
-     * - Build the list of TDX-usable memory regions.
248
     * - Construct a list of "TD Memory Regions" (TDMRs) to cover
249
     * all TDX-usable memory regions.
250
     * - Configure the TDMRs and the global KeyID to the TDX module.
251
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
252
     * Return error before all steps are done.
302
     */
253
     */
303
    ret = -EINVAL;
254
    ret = -EINVAL;
255
+out_put_tdxmem:
256
+    /*
257
+     * @tdx_memlist is written here and read at memory hotplug time.
258
+     * Lock out memory hotplug code while building it.
259
+     */
260
+    put_online_mems();
304
out:
261
out:
305
+    /*
262
    /*
306
+     * Memory hotplug checks the hot-added memory region against the
263
     * For now both @sysinfo and @cmr_array are only used during
307
+     * @tdx_memlist to see if the region is TDX memory.
264
@@ -XXX,XX +XXX,XX @@ static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
308
+     *
265
    return 0;
309
+     * Do put_online_mems() here to make sure any modification to
310
+     * @tdx_memlist is done while holding the memory hotplug read
311
+     * lock, so that the memory hotplug path can just check the
312
+     * @tdx_memlist w/o holding the @tdx_module_lock which may cause
313
+     * deadlock.
314
+     */
315
+    put_online_mems();
316
    return ret;
317
}
266
}
318
267
319
@@ -XXX,XX +XXX,XX @@ int tdx_enable(void)
268
+static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
320
    return ret;
321
}
322
EXPORT_SYMBOL_GPL(tdx_enable);
323
+
324
+/*
325
+ * Check whether the given range is TDX memory. Must be called between
326
+ * mem_hotplug_begin()/mem_hotplug_done().
327
+ */
328
+bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
329
+{
269
+{
330
+    struct tdx_memblock *tmb;
270
+    struct tdx_memblock *tmb;
331
+
271
+
332
+    /* Empty list means TDX isn't enabled successfully */
272
+    /*
333
+    if (list_empty(&tdx_memlist))
273
+     * This check assumes that the start_pfn<->end_pfn range does not
334
+        return true;
274
+     * cross multiple @tdx_memlist entries. A single memory online
335
+
275
+     * event across multiple memblocks (from which @tdx_memlist
276
+     * entries are derived at the time of module initialization) is
277
+     * not possible. This is because memory offline/online is done
278
+     * on granularity of 'struct memory_block', and the hotpluggable
279
+     * memory region (one memblock) must be multiple of memory_block.
280
+     */
336
+    list_for_each_entry(tmb, &tdx_memlist, list) {
281
+    list_for_each_entry(tmb, &tdx_memlist, list) {
337
+        /*
338
+         * The new range is TDX memory if it is fully covered
339
+         * by any TDX memory block.
340
+         */
341
+        if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
282
+        if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
342
+            return true;
283
+            return true;
343
+    }
284
+    }
344
+    return false;
285
+    return false;
345
+}
286
+}
287
+
288
+static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
289
+             void *v)
290
+{
291
+    struct memory_notify *mn = v;
292
+
293
+    if (action != MEM_GOING_ONLINE)
294
+        return NOTIFY_OK;
295
+
296
+    /*
297
+     * Empty list means TDX isn't enabled. Allow any memory
298
+     * to go online.
299
+     */
300
+    if (list_empty(&tdx_memlist))
301
+        return NOTIFY_OK;
302
+
303
+    /*
304
+     * The TDX memory configuration is static and can not be
305
+     * changed. Reject onlining any memory which is outside of
306
+     * the static configuration whether it supports TDX or not.
307
+     */
308
+    if (is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages))
309
+        return NOTIFY_OK;
310
+
311
+    return NOTIFY_BAD;
312
+}
313
+
314
+static struct notifier_block tdx_memory_nb = {
315
+    .notifier_call = tdx_memory_notifier,
316
+};
317
+
318
static int __init tdx_init(void)
319
{
320
    u32 tdx_keyid_start, nr_tdx_keyids;
321
@@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void)
322
        return -ENODEV;
323
    }
324
325
+    err = register_memory_notifier(&tdx_memory_nb);
326
+    if (err) {
327
+        pr_err("initialization failed: register_memory_notifier() failed (%d)\n",
328
+                err);
329
+        return -ENODEV;
330
+    }
331
+
332
    /*
333
     * Just use the first TDX KeyID as the 'global KeyID' and
334
     * leave the rest for TDX guests.
335
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
336
index XXXXXXX..XXXXXXX 100644
337
--- a/arch/x86/virt/vmx/tdx/tdx.h
338
+++ b/arch/x86/virt/vmx/tdx/tdx.h
339
@@ -XXX,XX +XXX,XX @@ enum tdx_module_status_t {
340
    TDX_MODULE_ERROR
341
};
342
343
+struct tdx_memblock {
344
+    struct list_head list;
345
+    unsigned long start_pfn;
346
+    unsigned long end_pfn;
347
+};
348
+
349
#endif
346
--
350
--
347
2.38.1
351
2.41.0
diff view generated by jsdifflib
1
After the kernel selects all TDX-usable memory regions, the kernel needs
2
to pass those regions to the TDX module via data structure "TD Memory
3
Region" (TDMR).
4
5
Add a placeholder to construct a list of TDMRs (in multiple steps) to
6
cover all TDX-usable memory regions.
7
8
=== Long Version ===
9
1
TDX provides increased levels of memory confidentiality and integrity.
10
TDX provides increased levels of memory confidentiality and integrity.
2
This requires special hardware support for features like memory
11
This requires special hardware support for features like memory
3
encryption and storage of memory integrity checksums. Not all memory
12
encryption and storage of memory integrity checksums. Not all memory
4
satisfies these requirements.
13
satisfies these requirements.
5
14
6
As a result, the TDX introduced the concept of a "Convertible Memory
15
As a result, TDX introduced the concept of a "Convertible Memory Region"
7
Region" (CMR). During boot, the firmware builds a list of all of the
16
(CMR). During boot, the firmware builds a list of all of the memory
8
memory ranges which can provide the TDX security guarantees. The list
17
ranges which can provide the TDX security guarantees. The list of these
9
of these ranges is available to the kernel by querying the TDX module.
18
ranges is available to the kernel by querying the TDX module.
10
19
11
The TDX architecture needs additional metadata to record things like
20
The TDX architecture needs additional metadata to record things like
12
which TD guest "owns" a given page of memory. This metadata essentially
21
which TD guest "owns" a given page of memory. This metadata essentially
13
serves as the 'struct page' for the TDX module. The space for this
22
serves as the 'struct page' for the TDX module. The space for this
14
metadata is not reserved by the hardware up front and must be allocated
23
metadata is not reserved by the hardware up front and must be allocated
...
...
36
CMR - Firmware-enumerated physical ranges that support TDX. CMRs are
45
CMR - Firmware-enumerated physical ranges that support TDX. CMRs are
37
4K aligned.
46
4K aligned.
38
TDMR - Physical address range which is chosen by the kernel to support
47
TDMR - Physical address range which is chosen by the kernel to support
39
TDX. 1G granularity and alignment required. Each TDMR has
48
TDX. 1G granularity and alignment required. Each TDMR has
40
reserved areas where TDX memory holes and overlapping PAMTs can
49
reserved areas where TDX memory holes and overlapping PAMTs can
41
be put into.
50
be represented.
42
PAMT - Physically contiguous TDX metadata. One table for each page size
51
PAMT - Physically contiguous TDX metadata. One table for each page size
43
per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G
52
per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G
44
PAMT.
53
PAMT.
45
54
46
As one step of initializing the TDX module, the kernel configures
55
As one step of initializing the TDX module, the kernel configures
47
TDX-usable memory regions by passing an array of TDMRs to the TDX module.
56
TDX-usable memory regions by passing a list of TDMRs to the TDX module.
48
57
49
Constructing the array of TDMRs consists below steps:
58
Constructing the list of TDMRs consists below steps:
50
59
51
1) Create TDMRs to cover all memory regions that the TDX module can use;
60
1) Fill out TDMRs to cover all memory regions that the TDX module will
52
2) Allocate and set up PAMT for each TDMR;
61
use for TD memory.
53
3) Set up reserved areas for each TDMR.
62
2) Allocate and set up PAMT for each TDMR.
54
63
3) Designate reserved areas for each TDMR.
55
Add a placeholder to construct TDMRs to do the above steps after all
64
56
TDX memory regions are verified to be truly convertible. Always free
65
Add a placeholder to construct TDMRs to do the above steps. To keep
57
TDMRs at the end of the initialization (no matter successful or not)
66
things simple, just allocate enough space to hold maximum number of
58
as TDMRs are only used during the initialization.
67
TDMRs up front. Always free the buffer of TDMRs since they are only
59
68
used during module initialization.
69
70
Signed-off-by: Kai Huang <kai.huang@intel.com>
60
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
71
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
61
Signed-off-by: Kai Huang <kai.huang@intel.com>
72
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
73
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
62
---
74
---
75
76
v13 -> v14:
77
- No change.
78
79
v12 -> v13:
80
- No change.
81
82
v11 -> v12:
83
- Added tags from Dave/Kirill.
84
85
v10 -> v11:
86
- Changed to keep TDMRs after module initialization to deal with TDX
87
erratum in future patches.
88
89
v9 -> v10:
90
- Changed the TDMR list from static variable back to local variable as
91
now TDX module isn't disabled when tdx_cpu_enable() fails.
92
93
v8 -> v9:
94
- Changes around 'struct tdmr_info_list' (Dave):
95
- Moved the declaration from tdx.c to tdx.h.
96
- Renamed 'first_tdmr' to 'tdmrs'.
97
- 'nr_tdmrs' -> 'nr_consumed_tdmrs'.
98
- Changed 'tdmrs' to 'void *'.
99
- Improved comments for all structure members.
100
- Added a missing empty line in alloc_tdmr_list() (Dave).
101
102
v7 -> v8:
103
- Improved changelog to tell this is one step of "TODO list" in
104
init_tdx_module().
105
- Other changelog improvement suggested by Dave (with "Create TDMRs" to
106
"Fill out TDMRs" to align with the code).
107
- Added a "TODO list" comment to lay out the steps to construct TDMRs,
108
following the same idea of "TODO list" in tdx_module_init().
109
- Introduced 'struct tdmr_info_list' (Dave)
110
- Further added additional members (tdmr_sz/max_tdmrs/nr_tdmrs) to
111
simplify getting TDMR by given index, and reduce passing arguments
112
around functions.
113
- Added alloc_tdmr_list()/free_tdmr_list() accordingly, which internally
114
uses tdmr_size_single() (Dave).
115
- tdmr_num -> nr_tdmrs (Dave).
63
116
64
v6 -> v7:
117
v6 -> v7:
65
- Improved commit message to explain 'int' overflow cannot happen
118
- Improved commit message to explain 'int' overflow cannot happen
66
in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave.
119
in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave.
67
120
68
v5 -> v6:
121
...
69
- construct_tdmrs_memblock() -> construct_tdmrs() as 'tdx_memblock' is
70
used instead of memblock.
71
- Added Isaku's Reviewed-by.
72
73
- v3 -> v5 (no feedback on v4):
74
- Moved calculating TDMR size to this patch.
75
- Changed to use alloc_pages_exact() to allocate buffer for all TDMRs
76
once, instead of allocating each TDMR individually.
77
- Removed "crypto protection" in the changelog.
78
- -EFAULT -> -EINVAL in couple of places.
79
80
122
81
---
123
---
82
arch/x86/virt/vmx/tdx/tdx.c | 83 +++++++++++++++++++++++++++++++++++++
124
arch/x86/virt/vmx/tdx/tdx.c | 97 ++++++++++++++++++++++++++++++++++++-
83
arch/x86/virt/vmx/tdx/tdx.h | 23 ++++++++++
125
arch/x86/virt/vmx/tdx/tdx.h | 32 ++++++++++++
84
2 files changed, 106 insertions(+)
126
2 files changed, 127 insertions(+), 2 deletions(-)
85
127
86
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
128
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
87
index XXXXXXX..XXXXXXX 100644
129
index XXXXXXX..XXXXXXX 100644
88
--- a/arch/x86/virt/vmx/tdx/tdx.c
130
--- a/arch/x86/virt/vmx/tdx/tdx.c
89
+++ b/arch/x86/virt/vmx/tdx/tdx.c
131
+++ b/arch/x86/virt/vmx/tdx/tdx.c
90
@@ -XXX,XX +XXX,XX @@ static int build_tdx_memory(void)
132
@@ -XXX,XX +XXX,XX @@
133
#include <linux/minmax.h>
134
#include <linux/sizes.h>
135
#include <linux/pfn.h>
136
+#include <linux/align.h>
137
#include <asm/msr-index.h>
138
#include <asm/msr.h>
139
#include <asm/page.h>
140
@@ -XXX,XX +XXX,XX @@ static int build_tdx_memlist(struct list_head *tmb_list)
91
    return ret;
141
    return ret;
92
}
142
}
93
143
94
+/* Calculate the actual TDMR_INFO size */
144
+/* Calculate the actual TDMR size */
95
+static inline int cal_tdmr_size(void)
145
+static int tdmr_size_single(u16 max_reserved_per_tdmr)
96
+{
146
+{
97
+    int tdmr_sz;
147
+    int tdmr_sz;
98
+
148
+
99
+    /*
149
+    /*
100
+     * The actual size of TDMR_INFO depends on the maximum number
150
+     * The actual size of TDMR depends on the maximum
101
+     * of reserved areas.
151
+     * number of reserved areas.
102
+     *
103
+     * Note: for TDX1.0 the max_reserved_per_tdmr is 16, and
104
+     * TDMR_INFO size is aligned up to 512-byte. Even it is
105
+     * extended in the future, it would be insane if TDMR_INFO
106
+     * becomes larger than 4K. The tdmr_sz here should never
107
+     * overflow.
108
+     */
152
+     */
109
+    tdmr_sz = sizeof(struct tdmr_info);
153
+    tdmr_sz = sizeof(struct tdmr_info);
110
+    tdmr_sz += sizeof(struct tdmr_reserved_area) *
154
+    tdmr_sz += sizeof(struct tdmr_reserved_area) * max_reserved_per_tdmr;
111
+         tdx_sysinfo.max_reserved_per_tdmr;
155
+
112
+
113
+    /*
114
+     * TDX requires each TDMR_INFO to be 512-byte aligned. Always
115
+     * round up TDMR_INFO size to the 512-byte boundary.
116
+     */
117
+    return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
156
+    return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
118
+}
157
+}
119
+
158
+
120
+static struct tdmr_info *alloc_tdmr_array(int *array_sz)
159
+static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list,
160
+             struct tdsysinfo_struct *sysinfo)
121
+{
161
+{
122
+    /*
162
+    size_t tdmr_sz, tdmr_array_sz;
123
+     * TDX requires each TDMR_INFO to be 512-byte aligned.
163
+    void *tdmr_array;
124
+     * Use alloc_pages_exact() to allocate all TDMRs at once.
164
+
125
+     * Each TDMR_INFO will still be 512-byte aligned since
165
+    tdmr_sz = tdmr_size_single(sysinfo->max_reserved_per_tdmr);
126
+     * cal_tdmr_size() always returns 512-byte aligned size.
166
+    tdmr_array_sz = tdmr_sz * sysinfo->max_tdmrs;
127
+     */
167
+
128
+    *array_sz = cal_tdmr_size() * tdx_sysinfo.max_tdmrs;
168
+    /*
129
+
169
+     * To keep things simple, allocate all TDMRs together.
130
+    /*
170
+     * The buffer needs to be physically contiguous to make
131
+     * Zero the buffer so 'struct tdmr_info::size' can be
171
+     * sure each TDMR is physically contiguous.
132
+     * used to determine whether a TDMR is valid.
172
+     */
173
+    tdmr_array = alloc_pages_exact(tdmr_array_sz,
174
+            GFP_KERNEL | __GFP_ZERO);
175
+    if (!tdmr_array)
176
+        return -ENOMEM;
177
+
178
+    tdmr_list->tdmrs = tdmr_array;
179
+
180
+    /*
181
+     * Keep the size of TDMR to find the target TDMR
182
+     * at a given index in the TDMR list.
183
+     */
184
+    tdmr_list->tdmr_sz = tdmr_sz;
185
+    tdmr_list->max_tdmrs = sysinfo->max_tdmrs;
186
+    tdmr_list->nr_consumed_tdmrs = 0;
187
+
188
+    return 0;
189
+}
190
+
191
+static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
192
+{
193
+    free_pages_exact(tdmr_list->tdmrs,
194
+            tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
195
+}
196
+
197
+/*
198
+ * Construct a list of TDMRs on the preallocated space in @tdmr_list
199
+ * to cover all TDX memory regions in @tmb_list based on the TDX module
200
+ * information in @sysinfo.
201
+ */
202
+static int construct_tdmrs(struct list_head *tmb_list,
203
+             struct tdmr_info_list *tdmr_list,
204
+             struct tdsysinfo_struct *sysinfo)
205
+{
206
+    /*
207
+     * TODO:
133
+     *
208
+     *
134
+     * Note: for TDX1.0 the max_tdmrs is 64 and TDMR_INFO size
209
+     * - Fill out TDMRs to cover all TDX memory regions.
135
+     * is 512-byte. Even they are extended in the future, it
210
+     * - Allocate and set up PAMTs for each TDMR.
136
+     * would be insane if the total size exceeds 4MB.
211
+     * - Designate reserved areas for each TDMR.
137
+     */
212
+     *
138
+    return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
213
+     * Return -EINVAL until constructing TDMRs is done
139
+}
214
+     */
140
+
141
+/*
142
+ * Construct an array of TDMRs to cover all TDX memory ranges.
143
+ * The actual number of TDMRs is kept to @tdmr_num.
144
+ */
145
+static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
146
+{
147
+    /* Return -EINVAL until constructing TDMRs is done */
148
+    return -EINVAL;
215
+    return -EINVAL;
149
+}
216
+}
150
+
217
+
151
/*
152
* Detect and initialize the TDX module.
153
*
154
@@ -XXX,XX +XXX,XX @@ static int build_tdx_memory(void)
155
*/
156
static int init_tdx_module(void)
218
static int init_tdx_module(void)
157
{
219
{
158
+    struct tdmr_info *tdmr_array;
220
    struct tdsysinfo_struct *tdsysinfo;
159
+    int tdmr_array_sz;
221
+    struct tdmr_info_list tdmr_list;
160
+    int tdmr_num;
222
    struct cmr_info *cmr_array;
161
    int ret;
223
    int tdsysinfo_size;
162
224
    int cmr_array_size;
163
    /*
164
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
225
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
165
    ret = build_tdx_memory();
166
    if (ret)
226
    if (ret)
167
        goto out;
227
        goto out_put_tdxmem;
168
+
228
169
+    /* Prepare enough space to construct TDMRs */
229
+    /* Allocate enough space for constructing TDMRs */
170
+    tdmr_array = alloc_tdmr_array(&tdmr_array_sz);
230
+    ret = alloc_tdmr_list(&tdmr_list, tdsysinfo);
171
+    if (!tdmr_array) {
231
+    if (ret)
172
+        ret = -ENOMEM;
232
+        goto out_free_tdxmem;
173
+        goto out_free_tdx_mem;
233
+
174
+    }
234
+    /* Cover all TDX-usable memory regions in TDMRs */
175
+
235
+    ret = construct_tdmrs(&tdx_memlist, &tdmr_list, tdsysinfo);
176
+    /* Construct TDMRs to cover all TDX memory ranges */
177
+    ret = construct_tdmrs(tdmr_array, &tdmr_num);
178
+    if (ret)
236
+    if (ret)
179
+        goto out_free_tdmrs;
237
+        goto out_free_tdmrs;
180
+
238
+
181
    /*
239
    /*
182
     * Return -EINVAL until all steps of TDX module initialization
240
     * TODO:
183
     * process are done.
241
     *
242
-     * - Construct a list of "TD Memory Regions" (TDMRs) to cover
243
-     * all TDX-usable memory regions.
244
     * - Configure the TDMRs and the global KeyID to the TDX module.
245
     * - Configure the global KeyID on all packages.
246
     * - Initialize all TDMRs.
247
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
248
     * Return error before all steps are done.
184
     */
249
     */
185
    ret = -EINVAL;
250
    ret = -EINVAL;
186
+out_free_tdmrs:
251
+out_free_tdmrs:
187
+    /*
252
+    /*
188
+     * The array of TDMRs is freed no matter the initialization is
253
+     * Always free the buffer of TDMRs as they are only used during
189
+     * successful or not. They are not needed anymore after the
190
+     * module initialization.
254
+     * module initialization.
191
+     */
255
+     */
192
+    free_pages_exact(tdmr_array, tdmr_array_sz);
256
+    free_tdmr_list(&tdmr_list);
193
+out_free_tdx_mem:
257
+out_free_tdxmem:
194
+    if (ret)
258
+    if (ret)
195
+        free_tdx_memory();
259
+        free_tdx_memlist(&tdx_memlist);
196
out:
260
out_put_tdxmem:
197
    /*
261
    /*
198
     * Memory hotplug checks the hot-added memory region against the
262
     * @tdx_memlist is written here and read at memory hotplug time.
199
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
263
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
200
index XXXXXXX..XXXXXXX 100644
264
index XXXXXXX..XXXXXXX 100644
201
--- a/arch/x86/virt/vmx/tdx/tdx.h
265
--- a/arch/x86/virt/vmx/tdx/tdx.h
202
+++ b/arch/x86/virt/vmx/tdx/tdx.h
266
+++ b/arch/x86/virt/vmx/tdx/tdx.h
203
@@ -XXX,XX +XXX,XX @@ struct tdsysinfo_struct {
267
@@ -XXX,XX +XXX,XX @@ struct tdsysinfo_struct {
204
    };
268
    DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
205
} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
269
} __packed;
206
270
207
+struct tdmr_reserved_area {
271
+struct tdmr_reserved_area {
208
+    u64 offset;
272
+    u64 offset;
209
+    u64 size;
273
+    u64 size;
210
+} __packed;
274
+} __packed;
...
...
222
+    u64 pamt_4k_size;
286
+    u64 pamt_4k_size;
223
+    /*
287
+    /*
224
+     * Actual number of reserved areas depends on
288
+     * Actual number of reserved areas depends on
225
+     * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
289
+     * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
226
+     */
290
+     */
227
+    struct tdmr_reserved_area reserved_areas[0];
291
+    DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas);
228
+} __packed __aligned(TDMR_INFO_ALIGNMENT);
292
+} __packed __aligned(TDMR_INFO_ALIGNMENT);
229
+
293
+
230
/*
294
/*
231
* Do not put any hardware-defined TDX structure representations below
295
* Do not put any hardware-defined TDX structure representations below
232
* this comment!
296
* this comment!
297
@@ -XXX,XX +XXX,XX @@ struct tdx_memblock {
298
    unsigned long end_pfn;
299
};
300
301
+struct tdmr_info_list {
302
+    void *tdmrs;    /* Flexible array to hold 'tdmr_info's */
303
+    int nr_consumed_tdmrs;    /* How many 'tdmr_info's are in use */
304
+
305
+    /* Metadata for finding target 'tdmr_info' and freeing @tdmrs */
306
+    int tdmr_sz;    /* Size of one 'tdmr_info' */
307
+    int max_tdmrs;    /* How many 'tdmr_info's are allocated */
308
+};
309
+
310
#endif
233
--
311
--
234
2.38.1
312
2.41.0
diff view generated by jsdifflib
1
The kernel configures TDX-usable memory regions by passing an array of
1
Start to transit out the "multi-steps" to construct a list of "TD Memory
2
"TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains the
2
Regions" (TDMRs) to cover all TDX-usable memory regions.
3
information of the base/size of a memory region, the base/size of the
3
4
The kernel configures TDX-usable memory regions by passing a list of
5
TDMRs "TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains
6
the information of the base/size of a memory region, the base/size of the
4
associated Physical Address Metadata Table (PAMT) and a list of reserved
7
associated Physical Address Metadata Table (PAMT) and a list of reserved
5
areas in the region.
8
areas in the region.
6
9
7
Create a number of TDMRs to cover all TDX memory regions. To keep it
10
Do the first step to fill out a number of TDMRs to cover all TDX memory
8
simple, always try to create one TDMR for each memory region. As the
11
regions. To keep it simple, always try to use one TDMR for each memory
9
first step only set up the base/size for each TDMR.
12
region. As the first step only set up the base/size for each TDMR.
10
13
11
Each TDMR must be 1G aligned and the size must be in 1G granularity.
14
Each TDMR must be 1G aligned and the size must be in 1G granularity.
12
This implies that one TDMR could cover multiple memory regions. If a
15
This implies that one TDMR could cover multiple memory regions. If a
13
memory region spans the 1GB boundary and the former part is already
16
memory region spans the 1GB boundary and the former part is already
14
covered by the previous TDMR, just create a new TDMR for the remaining
17
covered by the previous TDMR, just use a new TDMR for the remaining
15
part.
18
part.
16
19
17
TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs
20
TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs
18
are consumed but there is more memory region to cover.
21
are consumed but there is more memory region to cover.
19
22
23
There are fancier things that could be done like trying to merge
24
adjacent TDMRs. This would allow more pathological memory layouts to be
25
supported. But, current systems are not even close to exhausting the
26
existing TDMR resources in practice. For now, keep it simple.
27
20
Signed-off-by: Kai Huang <kai.huang@intel.com>
28
Signed-off-by: Kai Huang <kai.huang@intel.com>
29
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
30
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
31
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
21
---
32
---
22
33
23
v6 -> v7:
34
v13 -> v14:
35
- No change
36
37
v12 -> v13:
38
- Added Yuan's tag.
39
40
v11 -> v12:
41
- Improved comments around looping over TDX memblock to create TDMRs.
42
(Dave).
43
- Added code to pr_warn() when consumed TDMRs reaching maximum TDMRs
44
(Dave).
45
- BIT_ULL(30) -> SZ_1G (Kirill)
46
- Removed unused TDMR_PFN_ALIGNMENT (Sathy)
47
- Added tags from Kirill/Sathy
48
49
v10 -> v11:
50
- No update
51
52
v9 -> v10:
24
- No change.
53
- No change.
25
54
26
v5 -> v6:
55
v8 -> v9:
27
- Rebase due to using 'tdx_memblock' instead of memblock.
56
28
57
- Added the last paragraph in the changelog (Dave).
29
- v3 -> v5 (no feedback on v4):
58
- Removed unnecessary type cast in tdmr_entry() (Dave).
30
- Removed allocating TDMR individually.
31
- Improved changelog by using Dave's words.
32
- Made TDMR_START() and TDMR_END() as static inline function.
33
59
34
---
60
---
35
arch/x86/virt/vmx/tdx/tdx.c | 104 +++++++++++++++++++++++++++++++++++-
61
arch/x86/virt/vmx/tdx/tdx.c | 103 +++++++++++++++++++++++++++++++++++-
36
1 file changed, 103 insertions(+), 1 deletion(-)
62
arch/x86/virt/vmx/tdx/tdx.h | 3 ++
63
2 files changed, 105 insertions(+), 1 deletion(-)
37
64
38
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
65
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
39
index XXXXXXX..XXXXXXX 100644
66
index XXXXXXX..XXXXXXX 100644
40
--- a/arch/x86/virt/vmx/tdx/tdx.c
67
--- a/arch/x86/virt/vmx/tdx/tdx.c
41
+++ b/arch/x86/virt/vmx/tdx/tdx.c
68
+++ b/arch/x86/virt/vmx/tdx/tdx.c
42
@@ -XXX,XX +XXX,XX @@ static int build_tdx_memory(void)
69
@@ -XXX,XX +XXX,XX @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
43
    return ret;
70
            tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
44
}
71
}
45
72
46
+/* TDMR must be 1gb aligned */
73
+/* Get the TDMR from the list at the given index. */
47
+#define TDMR_ALIGNMENT        BIT_ULL(30)
74
+static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
48
+#define TDMR_PFN_ALIGNMENT    (TDMR_ALIGNMENT >> PAGE_SHIFT)
75
+                 int idx)
49
+
76
+{
50
+/* Align up and down the address to TDMR boundary */
77
+    int tdmr_info_offset = tdmr_list->tdmr_sz * idx;
78
+
79
+    return (void *)tdmr_list->tdmrs + tdmr_info_offset;
80
+}
81
+
82
+#define TDMR_ALIGNMENT        SZ_1G
51
+#define TDMR_ALIGN_DOWN(_addr)    ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
83
+#define TDMR_ALIGN_DOWN(_addr)    ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
52
+#define TDMR_ALIGN_UP(_addr)    ALIGN((_addr), TDMR_ALIGNMENT)
84
+#define TDMR_ALIGN_UP(_addr)    ALIGN((_addr), TDMR_ALIGNMENT)
53
+
54
+static inline u64 tdmr_start(struct tdmr_info *tdmr)
55
+{
56
+    return tdmr->base;
57
+}
58
+
85
+
59
+static inline u64 tdmr_end(struct tdmr_info *tdmr)
86
+static inline u64 tdmr_end(struct tdmr_info *tdmr)
60
+{
87
+{
61
+    return tdmr->base + tdmr->size;
88
+    return tdmr->base + tdmr->size;
62
+}
89
+}
63
+
90
+
64
/* Calculate the actual TDMR_INFO size */
65
static inline int cal_tdmr_size(void)
66
{
67
@@ -XXX,XX +XXX,XX @@ static struct tdmr_info *alloc_tdmr_array(int *array_sz)
68
    return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
69
}
70
71
+static struct tdmr_info *tdmr_array_entry(struct tdmr_info *tdmr_array,
72
+                     int idx)
73
+{
74
+    return (struct tdmr_info *)((unsigned long)tdmr_array +
75
+            cal_tdmr_size() * idx);
76
+}
77
+
78
+/*
91
+/*
79
+ * Create TDMRs to cover all TDX memory regions. The actual number
92
+ * Take the memory referenced in @tmb_list and populate the
80
+ * of TDMRs is set to @tdmr_num.
93
+ * preallocated @tdmr_list, following all the special alignment
94
+ * and size rules for TDMR.
81
+ */
95
+ */
82
+static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
96
+static int fill_out_tdmrs(struct list_head *tmb_list,
97
+             struct tdmr_info_list *tdmr_list)
83
+{
98
+{
84
+    struct tdx_memblock *tmb;
99
+    struct tdx_memblock *tmb;
85
+    int tdmr_idx = 0;
100
+    int tdmr_idx = 0;
86
+
101
+
87
+    /*
102
+    /*
88
+     * Loop over TDX memory regions and create TDMRs to cover them.
103
+     * Loop over TDX memory regions and fill out TDMRs to cover them.
89
+     * To keep it simple, always try to use one TDMR to cover
104
+     * To keep it simple, always try to use one TDMR to cover one
90
+     * one memory region.
105
+     * memory region.
106
+     *
107
+     * In practice TDX supports at least 64 TDMRs. A 2-socket system
108
+     * typically only consumes less than 10 of those. This code is
109
+     * dumb and simple and may use more TMDRs than is strictly
110
+     * required.
91
+     */
111
+     */
92
+    list_for_each_entry(tmb, &tdx_memlist, list) {
112
+    list_for_each_entry(tmb, tmb_list, list) {
93
+        struct tdmr_info *tdmr;
113
+        struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx);
94
+        u64 start, end;
114
+        u64 start, end;
95
+
115
+
96
+        tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
116
+        start = TDMR_ALIGN_DOWN(PFN_PHYS(tmb->start_pfn));
97
+        start = TDMR_ALIGN_DOWN(tmb->start_pfn << PAGE_SHIFT);
117
+        end = TDMR_ALIGN_UP(PFN_PHYS(tmb->end_pfn));
98
+        end = TDMR_ALIGN_UP(tmb->end_pfn << PAGE_SHIFT);
99
+
118
+
100
+        /*
119
+        /*
101
+         * If the current TDMR's size hasn't been initialized,
120
+         * A valid size indicates the current TDMR has already
102
+         * it is a new TDMR to cover the new memory region.
121
+         * been filled out to cover the previous memory region(s).
103
+         * Otherwise, the current TDMR has already covered the
104
+         * previous memory region. In the latter case, check
105
+         * whether the current memory region has been fully or
106
+         * partially covered by the current TDMR, since TDMR is
107
+         * 1G aligned.
108
+         */
122
+         */
109
+        if (tdmr->size) {
123
+        if (tdmr->size) {
110
+            /*
124
+            /*
111
+             * Loop to the next memory region if the current
125
+             * Loop to the next if the current memory region
112
+             * block has already been fully covered by the
126
+             * has already been fully covered.
113
+             * current TDMR.
114
+             */
127
+             */
115
+            if (end <= tdmr_end(tdmr))
128
+            if (end <= tdmr_end(tdmr))
116
+                continue;
129
+                continue;
117
+
130
+
118
+            /*
131
+            /* Otherwise, skip the already covered part. */
119
+             * If part of the current memory region has
120
+             * already been covered by the current TDMR,
121
+             * skip the already covered part.
122
+             */
123
+            if (start < tdmr_end(tdmr))
132
+            if (start < tdmr_end(tdmr))
124
+                start = tdmr_end(tdmr);
133
+                start = tdmr_end(tdmr);
125
+
134
+
126
+            /*
135
+            /*
127
+             * Create a new TDMR to cover the current memory
136
+             * Create a new TDMR to cover the current memory
128
+             * region, or the remaining part of it.
137
+             * region, or the remaining part of it.
129
+             */
138
+             */
130
+            tdmr_idx++;
139
+            tdmr_idx++;
131
+            if (tdmr_idx >= tdx_sysinfo.max_tdmrs)
140
+            if (tdmr_idx >= tdmr_list->max_tdmrs) {
132
+                return -E2BIG;
141
+                pr_warn("initialization failed: TDMRs exhausted.\n");
133
+
142
+                return -ENOSPC;
134
+            tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
143
+            }
144
+
145
+            tdmr = tdmr_entry(tdmr_list, tdmr_idx);
135
+        }
146
+        }
136
+
147
+
137
+        tdmr->base = start;
148
+        tdmr->base = start;
138
+        tdmr->size = end - start;
149
+        tdmr->size = end - start;
139
+    }
150
+    }
140
+
151
+
141
+    /* @tdmr_idx is always the index of last valid TDMR. */
152
+    /* @tdmr_idx is always the index of the last valid TDMR. */
142
+    *tdmr_num = tdmr_idx + 1;
153
+    tdmr_list->nr_consumed_tdmrs = tdmr_idx + 1;
154
+
155
+    /*
156
+     * Warn early that kernel is about to run out of TDMRs.
157
+     *
158
+     * This is an indication that TDMR allocation has to be
159
+     * reworked to be smarter to not run into an issue.
160
+     */
161
+    if (tdmr_list->max_tdmrs - tdmr_list->nr_consumed_tdmrs < TDMR_NR_WARN)
162
+        pr_warn("consumed TDMRs reaching limit: %d used out of %d\n",
163
+                tdmr_list->nr_consumed_tdmrs,
164
+                tdmr_list->max_tdmrs);
143
+
165
+
144
+    return 0;
166
+    return 0;
145
+}
167
+}
146
+
168
+
147
/*
169
/*
148
* Construct an array of TDMRs to cover all TDX memory ranges.
170
* Construct a list of TDMRs on the preallocated space in @tdmr_list
149
* The actual number of TDMRs is kept to @tdmr_num.
171
* to cover all TDX memory regions in @tmb_list based on the TDX module
150
*/
172
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list,
151
static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
173
             struct tdmr_info_list *tdmr_list,
174
             struct tdsysinfo_struct *sysinfo)
152
{
175
{
153
+    int ret;
176
+    int ret;
154
+
177
+
155
+    ret = create_tdmrs(tdmr_array, tdmr_num);
178
+    ret = fill_out_tdmrs(tmb_list, tdmr_list);
156
+    if (ret)
179
+    if (ret)
157
+        goto err;
180
+        return ret;
158
+
181
+
159
    /* Return -EINVAL until constructing TDMRs is done */
182
    /*
160
-    return -EINVAL;
183
     * TODO:
161
+    ret = -EINVAL;
184
     *
162
+err:
185
-     * - Fill out TDMRs to cover all TDX memory regions.
163
+    return ret;
186
     * - Allocate and set up PAMTs for each TDMR.
164
}
187
     * - Designate reserved areas for each TDMR.
165
188
     *
166
/*
189
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
190
index XXXXXXX..XXXXXXX 100644
191
--- a/arch/x86/virt/vmx/tdx/tdx.h
192
+++ b/arch/x86/virt/vmx/tdx/tdx.h
193
@@ -XXX,XX +XXX,XX @@ struct tdx_memblock {
194
    unsigned long end_pfn;
195
};
196
197
+/* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
198
+#define TDMR_NR_WARN 4
199
+
200
struct tdmr_info_list {
201
    void *tdmrs;    /* Flexible array to hold 'tdmr_info's */
202
    int nr_consumed_tdmrs;    /* How many 'tdmr_info's are in use */
167
--
203
--
168
2.38.1
204
2.41.0
diff view generated by jsdifflib
1
The TDX module uses additional metadata to record things like which
1
The TDX module uses additional metadata to record things like which
2
guest "owns" a given page of memory. This metadata, referred as
2
guest "owns" a given page of memory. This metadata, referred as
3
Physical Address Metadata Table (PAMT), essentially serves as the
3
Physical Address Metadata Table (PAMT), essentially serves as the
4
'struct page' for the TDX module. PAMTs are not reserved by hardware
4
'struct page' for the TDX module. PAMTs are not reserved by hardware
5
up front. They must be allocated by the kernel and then given to the
5
up front. They must be allocated by the kernel and then given to the
6
TDX module.
6
TDX module during module initialization.
7
7
8
TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region"
8
TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region"
9
(TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must
9
(TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must
10
be a physically contiguous area from a Convertible Memory Region (CMR).
10
be a physically contiguous area from a Convertible Memory Region (CMR).
11
However, the PAMTs which track pages in one TDMR do not need to reside
11
However, the PAMTs which track pages in one TDMR do not need to reside
...
...
14
that particular TDMR.
14
that particular TDMR.
15
15
16
Use alloc_contig_pages() since PAMT must be a physically contiguous area
16
Use alloc_contig_pages() since PAMT must be a physically contiguous area
17
and it may be potentially large (~1/256th of the size of the given TDMR).
17
and it may be potentially large (~1/256th of the size of the given TDMR).
18
The downside is alloc_contig_pages() may fail at runtime. One (bad)
18
The downside is alloc_contig_pages() may fail at runtime. One (bad)
19
mitigation is to launch a TD guest early during system boot to get those
19
mitigation is to launch a TDX guest early during system boot to get
20
PAMTs allocated at early time, but the only way to fix is to add a boot
20
those PAMTs allocated at early time, but the only way to fix is to add a
21
option to allocate or reserve PAMTs during kernel boot.
21
boot option to allocate or reserve PAMTs during kernel boot.
22
23
It is imperfect but will be improved on later.
22
24
23
TDX only supports a limited number of reserved areas per TDMR to cover
25
TDX only supports a limited number of reserved areas per TDMR to cover
24
both PAMTs and memory holes within the given TDMR. If many PAMTs are
26
both PAMTs and memory holes within the given TDMR. If many PAMTs are
25
allocated within a single TDMR, the reserved areas may not be sufficient
27
allocated within a single TDMR, the reserved areas may not be sufficient
26
to cover all of them.
28
to cover all of them.
...
...
31
the total number of reserved areas consumed for PAMTs.
33
the total number of reserved areas consumed for PAMTs.
32
- Try to first allocate PAMT from the local node of the TDMR for better
34
- Try to first allocate PAMT from the local node of the TDMR for better
33
NUMA locality.
35
NUMA locality.
34
36
35
Also dump out how many pages are allocated for PAMTs when the TDX module
37
Also dump out how many pages are allocated for PAMTs when the TDX module
36
is initialized successfully.
38
is initialized successfully. This helps answer the eternal "where did
37
39
all my memory go?" questions.
40
41
Signed-off-by: Kai Huang <kai.huang@intel.com>
38
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
42
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
39
Signed-off-by: Kai Huang <kai.huang@intel.com>
43
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
44
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
45
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
40
---
46
---
47
48
v13 -> v14:
49
- No change
50
51
v12 -> v13:
52
- Added Kirill and Yuan's tag.
53
- Removed unintended space. (Yuan)
54
55
v11 -> v12:
56
- Moved TDX_PS_NUM from tdx.c to <asm/tdx.h> (Kirill)
57
- "<= TDX_PS_1G" -> "< TDX_PS_NUM" (Kirill)
58
- Changed tdmr_get_pamt() to return base and size instead of base_pfn
59
and npages and related code directly (Dave).
60
- Simplified PAMT kb counting. (Dave)
61
- tdmrs_count_pamt_pages() -> tdmr_count_pamt_kb() (Kirill/Dave)
62
63
v10 -> v11:
64
- No update
65
66
v9 -> v10:
67
- Removed code change in disable_tdx_module() as it doesn't exist
68
anymore.
69
70
v8 -> v9:
71
- Added TDX_PS_NR macro instead of open-coding (Dave).
72
- Better alignment of 'pamt_entry_size' in tdmr_set_up_pamt() (Dave).
73
- Changed to print out PAMTs in "KBs" instead of "pages" (Dave).
74
- Added Dave's Reviewed-by.
75
76
v7 -> v8: (Dave)
77
- Changelog:
78
- Added a sentence to state PAMT allocation will be improved.
79
- Others suggested by Dave.
80
- Moved 'nid' of 'struct tdx_memblock' to this patch.
81
- Improved comments around tdmr_get_nid().
82
- WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid().
83
- Other changes due to 'struct tdmr_info_list'.
41
84
42
v6 -> v7:
85
v6 -> v7:
43
- Changes due to using macros instead of 'enum' for TDX supported page
86
- Changes due to using macros instead of 'enum' for TDX supported page
44
sizes.
87
sizes.
45
88
...
...
49
- Improved comment around tdmr_get_nid() (Dave).
92
- Improved comment around tdmr_get_nid() (Dave).
50
- Improved comment in tdmr_set_up_pamt() around breaking the PAMT
93
- Improved comment in tdmr_set_up_pamt() around breaking the PAMT
51
into PAMTs for 4K/2M/1G (Dave).
94
into PAMTs for 4K/2M/1G (Dave).
52
- tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).
95
- tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).
53
96
54
- v3 -> v5 (no feedback on v4):
55
- Used memblock to get the NUMA node for given TDMR.
56
- Removed tdmr_get_pamt_sz() helper but use open-code instead.
57
- Changed to use 'switch .. case..' for each TDX supported page size in
58
tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
59
- Added printing out memory used for PAMT allocation when TDX module is
60
initialized successfully.
61
- Explained downside of alloc_contig_pages() in changelog.
62
- Addressed other minor comments.
63
64
65
---
97
---
66
arch/x86/Kconfig | 1 +
98
arch/x86/Kconfig | 1 +
67
arch/x86/virt/vmx/tdx/tdx.c | 191 ++++++++++++++++++++++++++++++++++++
99
arch/x86/include/asm/shared/tdx.h | 1 +
68
2 files changed, 192 insertions(+)
100
arch/x86/virt/vmx/tdx/tdx.c | 215 +++++++++++++++++++++++++++++-
101
arch/x86/virt/vmx/tdx/tdx.h | 1 +
102
4 files changed, 213 insertions(+), 5 deletions(-)
69
103
70
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
104
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
71
index XXXXXXX..XXXXXXX 100644
105
index XXXXXXX..XXXXXXX 100644
72
--- a/arch/x86/Kconfig
106
--- a/arch/x86/Kconfig
73
+++ b/arch/x86/Kconfig
107
+++ b/arch/x86/Kconfig
...
...
77
    select ARCH_KEEP_MEMBLOCK
111
    select ARCH_KEEP_MEMBLOCK
78
+    depends on CONTIG_ALLOC
112
+    depends on CONTIG_ALLOC
79
    help
113
    help
80
     Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
114
     Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
81
     host and certain physical attacks. This option enables necessary TDX
115
     host and certain physical attacks. This option enables necessary TDX
116
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
117
index XXXXXXX..XXXXXXX 100644
118
--- a/arch/x86/include/asm/shared/tdx.h
119
+++ b/arch/x86/include/asm/shared/tdx.h
120
@@ -XXX,XX +XXX,XX @@
121
#define TDX_PS_4K    0
122
#define TDX_PS_2M    1
123
#define TDX_PS_1G    2
124
+#define TDX_PS_NR    (TDX_PS_1G + 1)
125
126
#ifndef __ASSEMBLY__
127
82
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
128
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
83
index XXXXXXX..XXXXXXX 100644
129
index XXXXXXX..XXXXXXX 100644
84
--- a/arch/x86/virt/vmx/tdx/tdx.c
130
--- a/arch/x86/virt/vmx/tdx/tdx.c
85
+++ b/arch/x86/virt/vmx/tdx/tdx.c
131
+++ b/arch/x86/virt/vmx/tdx/tdx.c
86
@@ -XXX,XX +XXX,XX @@ static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
132
@@ -XXX,XX +XXX,XX @@ static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo,
133
* overlap.
134
*/
135
static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
136
-             unsigned long end_pfn)
137
+             unsigned long end_pfn, int nid)
138
{
139
    struct tdx_memblock *tmb;
140
141
@@ -XXX,XX +XXX,XX @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
142
    INIT_LIST_HEAD(&tmb->list);
143
    tmb->start_pfn = start_pfn;
144
    tmb->end_pfn = end_pfn;
145
+    tmb->nid = nid;
146
147
    /* @tmb_list is protected by mem_hotplug_lock */
148
    list_add_tail(&tmb->list, tmb_list);
149
@@ -XXX,XX +XXX,XX @@ static void free_tdx_memlist(struct list_head *tmb_list)
150
static int build_tdx_memlist(struct list_head *tmb_list)
151
{
152
    unsigned long start_pfn, end_pfn;
153
-    int i, ret;
154
+    int i, nid, ret;
155
156
-    for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
157
+    for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
158
        /*
159
         * The first 1MB is not reported as TDX convertible memory.
160
         * Although the first 1MB is always reserved and won't end up
161
@@ -XXX,XX +XXX,XX @@ static int build_tdx_memlist(struct list_head *tmb_list)
162
         * memblock has already guaranteed they are in address
163
         * ascending order and don't overlap.
164
         */
165
-        ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
166
+        ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
167
        if (ret)
168
            goto err;
169
    }
170
@@ -XXX,XX +XXX,XX @@ static int fill_out_tdmrs(struct list_head *tmb_list,
87
    return 0;
171
    return 0;
88
}
172
}
89
173
90
+/*
174
+/*
91
+ * Calculate PAMT size given a TDMR and a page size. The returned
175
+ * Calculate PAMT size given a TDMR and a page size. The returned
92
+ * PAMT size is always aligned up to 4K page boundary.
176
+ * PAMT size is always aligned up to 4K page boundary.
93
+ */
177
+ */
94
+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz)
178
+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
179
+                 u16 pamt_entry_size)
95
+{
180
+{
96
+    unsigned long pamt_sz, nr_pamt_entries;
181
+    unsigned long pamt_sz, nr_pamt_entries;
97
+
182
+
98
+    switch (pgsz) {
183
+    switch (pgsz) {
99
+    case TDX_PS_4K:
184
+    case TDX_PS_4K:
...
...
108
+    default:
193
+    default:
109
+        WARN_ON_ONCE(1);
194
+        WARN_ON_ONCE(1);
110
+        return 0;
195
+        return 0;
111
+    }
196
+    }
112
+
197
+
113
+    pamt_sz = nr_pamt_entries * tdx_sysinfo.pamt_entry_size;
198
+    pamt_sz = nr_pamt_entries * pamt_entry_size;
114
+    /* TDX requires PAMT size must be 4K aligned */
199
+    /* TDX requires PAMT size must be 4K aligned */
115
+    pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
200
+    pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
116
+
201
+
117
+    return pamt_sz;
202
+    return pamt_sz;
118
+}
203
+}
119
+
204
+
120
+/*
205
+/*
121
+ * Pick a NUMA node on which to allocate this TDMR's metadata.
206
+ * Locate a NUMA node which should hold the allocation of the @tdmr
122
+ *
207
+ * PAMT. This node will have some memory covered by the TDMR. The
123
+ * This is imprecise since TDMRs are 1G aligned and NUMA nodes might
208
+ * relative amount of memory covered is not considered.
124
+ * not be. If the TDMR covers more than one node, just use the _first_
125
+ * one. This can lead to small areas of off-node metadata for some
126
+ * memory.
127
+ */
209
+ */
128
+static int tdmr_get_nid(struct tdmr_info *tdmr)
210
+static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list)
129
+{
211
+{
130
+    struct tdx_memblock *tmb;
212
+    struct tdx_memblock *tmb;
131
+
213
+
132
+    /* Find the first memory region covered by the TDMR */
214
+    /*
133
+    list_for_each_entry(tmb, &tdx_memlist, list) {
215
+     * A TDMR must cover at least part of one TMB. That TMB will end
134
+        if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
216
+     * after the TDMR begins. But, that TMB may have started before
217
+     * the TDMR. Find the next 'tmb' that _ends_ after this TDMR
218
+     * begins. Ignore 'tmb' start addresses. They are irrelevant.
219
+     */
220
+    list_for_each_entry(tmb, tmb_list, list) {
221
+        if (tmb->end_pfn > PHYS_PFN(tdmr->base))
135
+            return tmb->nid;
222
+            return tmb->nid;
136
+    }
223
+    }
137
+
224
+
138
+    /*
225
+    /*
139
+     * Fall back to allocating the TDMR's metadata from node 0 when
226
+     * Fall back to allocating the TDMR's metadata from node 0 when
140
+     * no TDX memory block can be found. This should never happen
227
+     * no TDX memory block can be found. This should never happen
141
+     * since TDMRs originate from TDX memory blocks.
228
+     * since TDMRs originate from TDX memory blocks.
142
+     */
229
+     */
143
+    WARN_ON_ONCE(1);
230
+    pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT allocation, fallback to use node 0.\n",
231
+            tdmr->base, tdmr_end(tdmr));
144
+    return 0;
232
+    return 0;
145
+}
233
+}
146
+
234
+
147
+static int tdmr_set_up_pamt(struct tdmr_info *tdmr)
235
+/*
148
+{
236
+ * Allocate PAMTs from the local NUMA node of some memory in @tmb_list
149
+    unsigned long pamt_base[TDX_PS_1G + 1];
237
+ * within @tdmr, and set up PAMTs for @tdmr.
150
+    unsigned long pamt_size[TDX_PS_1G + 1];
238
+ */
239
+static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
240
+             struct list_head *tmb_list,
241
+             u16 pamt_entry_size)
242
+{
243
+    unsigned long pamt_base[TDX_PS_NR];
244
+    unsigned long pamt_size[TDX_PS_NR];
151
+    unsigned long tdmr_pamt_base;
245
+    unsigned long tdmr_pamt_base;
152
+    unsigned long tdmr_pamt_size;
246
+    unsigned long tdmr_pamt_size;
153
+    struct page *pamt;
247
+    struct page *pamt;
154
+    int pgsz, nid;
248
+    int pgsz, nid;
155
+
249
+
156
+    nid = tdmr_get_nid(tdmr);
250
+    nid = tdmr_get_nid(tdmr, tmb_list);
157
+
251
+
158
+    /*
252
+    /*
159
+     * Calculate the PAMT size for each TDX supported page size
253
+     * Calculate the PAMT size for each TDX supported page size
160
+     * and the total PAMT size.
254
+     * and the total PAMT size.
161
+     */
255
+     */
162
+    tdmr_pamt_size = 0;
256
+    tdmr_pamt_size = 0;
163
+    for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G ; pgsz++) {
257
+    for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
164
+        pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz);
258
+        pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
259
+                    pamt_entry_size);
165
+        tdmr_pamt_size += pamt_size[pgsz];
260
+        tdmr_pamt_size += pamt_size[pgsz];
166
+    }
261
+    }
167
+
262
+
168
+    /*
263
+    /*
169
+     * Allocate one chunk of physically contiguous memory for all
264
+     * Allocate one chunk of physically contiguous memory for all
...
...
178
+    /*
273
+    /*
179
+     * Break the contiguous allocation back up into the
274
+     * Break the contiguous allocation back up into the
180
+     * individual PAMTs for each page size.
275
+     * individual PAMTs for each page size.
181
+     */
276
+     */
182
+    tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
277
+    tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
183
+    for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G; pgsz++) {
278
+    for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
184
+        pamt_base[pgsz] = tdmr_pamt_base;
279
+        pamt_base[pgsz] = tdmr_pamt_base;
185
+        tdmr_pamt_base += pamt_size[pgsz];
280
+        tdmr_pamt_base += pamt_size[pgsz];
186
+    }
281
+    }
187
+
282
+
188
+    tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
283
+    tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
...
...
193
+    tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
288
+    tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
194
+
289
+
195
+    return 0;
290
+    return 0;
196
+}
291
+}
197
+
292
+
198
+static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
293
+static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base,
199
+             unsigned long *pamt_npages)
294
+             unsigned long *pamt_size)
200
+{
295
+{
201
+    unsigned long pamt_base, pamt_sz;
296
+    unsigned long pamt_bs, pamt_sz;
202
+
297
+
203
+    /*
298
+    /*
204
+     * The PAMT was allocated in one contiguous unit. The 4K PAMT
299
+     * The PAMT was allocated in one contiguous unit. The 4K PAMT
205
+     * should always point to the beginning of that allocation.
300
+     * should always point to the beginning of that allocation.
206
+     */
301
+     */
207
+    pamt_base = tdmr->pamt_4k_base;
302
+    pamt_bs = tdmr->pamt_4k_base;
208
+    pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
303
+    pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
209
+
304
+
210
+    *pamt_pfn = pamt_base >> PAGE_SHIFT;
305
+    WARN_ON_ONCE((pamt_bs & ~PAGE_MASK) || (pamt_sz & ~PAGE_MASK));
211
+    *pamt_npages = pamt_sz >> PAGE_SHIFT;
306
+
307
+    *pamt_base = pamt_bs;
308
+    *pamt_size = pamt_sz;
212
+}
309
+}
213
+
310
+
214
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
311
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
215
+{
312
+{
216
+    unsigned long pamt_pfn, pamt_npages;
313
+    unsigned long pamt_base, pamt_size;
217
+
314
+
218
+    tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages);
315
+    tdmr_get_pamt(tdmr, &pamt_base, &pamt_size);
219
+
316
+
220
+    /* Do nothing if PAMT hasn't been allocated for this TDMR */
317
+    /* Do nothing if PAMT hasn't been allocated for this TDMR */
221
+    if (!pamt_npages)
318
+    if (!pamt_size)
222
+        return;
319
+        return;
223
+
320
+
224
+    if (WARN_ON_ONCE(!pamt_pfn))
321
+    if (WARN_ON_ONCE(!pamt_base))
225
+        return;
322
+        return;
226
+
323
+
227
+    free_contig_range(pamt_pfn, pamt_npages);
324
+    free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT);
228
+}
325
+}
229
+
326
+
230
+static void tdmrs_free_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
327
+static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
231
+{
328
+{
232
+    int i;
329
+    int i;
233
+
330
+
234
+    for (i = 0; i < tdmr_num; i++)
331
+    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
235
+        tdmr_free_pamt(tdmr_array_entry(tdmr_array, i));
332
+        tdmr_free_pamt(tdmr_entry(tdmr_list, i));
236
+}
333
+}
237
+
334
+
238
+/* Allocate and set up PAMTs for all TDMRs */
335
+/* Allocate and set up PAMTs for all TDMRs */
239
+static int tdmrs_set_up_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
336
+static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
337
+                 struct list_head *tmb_list,
338
+                 u16 pamt_entry_size)
240
+{
339
+{
241
+    int i, ret = 0;
340
+    int i, ret = 0;
242
+
341
+
243
+    for (i = 0; i < tdmr_num; i++) {
342
+    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
244
+        ret = tdmr_set_up_pamt(tdmr_array_entry(tdmr_array, i));
343
+        ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list,
344
+                pamt_entry_size);
245
+        if (ret)
345
+        if (ret)
246
+            goto err;
346
+            goto err;
247
+    }
347
+    }
248
+
348
+
249
+    return 0;
349
+    return 0;
250
+err:
350
+err:
251
+    tdmrs_free_pamt_all(tdmr_array, tdmr_num);
351
+    tdmrs_free_pamt_all(tdmr_list);
252
+    return ret;
352
+    return ret;
253
+}
353
+}
254
+
354
+
255
+static unsigned long tdmrs_count_pamt_pages(struct tdmr_info *tdmr_array,
355
+static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
256
+                     int tdmr_num)
356
+{
257
+{
357
+    unsigned long pamt_size = 0;
258
+    unsigned long pamt_npages = 0;
259
+    int i;
358
+    int i;
260
+
359
+
261
+    for (i = 0; i < tdmr_num; i++) {
360
+    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
262
+        unsigned long pfn, npages;
361
+        unsigned long base, size;
263
+
362
+
264
+        tdmr_get_pamt(tdmr_array_entry(tdmr_array, i), &pfn, &npages);
363
+        tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
265
+        pamt_npages += npages;
364
+        pamt_size += size;
266
+    }
365
+    }
267
+
366
+
268
+    return pamt_npages;
367
+    return pamt_size / 1024;
269
+}
368
+}
270
+
369
+
271
/*
370
/*
272
* Construct an array of TDMRs to cover all TDX memory ranges.
371
* Construct a list of TDMRs on the preallocated space in @tdmr_list
273
* The actual number of TDMRs is kept to @tdmr_num.
372
* to cover all TDX memory regions in @tmb_list based on the TDX module
274
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
373
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list,
275
    if (ret)
374
    if (ret)
276
        goto err;
375
        return ret;
277
376
278
+    ret = tdmrs_set_up_pamt_all(tdmr_array, *tdmr_num);
377
+    ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list,
378
+            sysinfo->pamt_entry_size);
279
+    if (ret)
379
+    if (ret)
280
+        goto err;
380
+        return ret;
281
+
381
    /*
282
    /* Return -EINVAL until constructing TDMRs is done */
382
     * TODO:
283
    ret = -EINVAL;
383
     *
284
+    tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
384
-     * - Allocate and set up PAMTs for each TDMR.
285
err:
385
     * - Designate reserved areas for each TDMR.
286
    return ret;
386
     *
287
}
387
     * Return -EINVAL until constructing TDMRs is done
288
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
388
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
289
     * process are done.
389
     * Return error before all steps are done.
290
     */
390
     */
291
    ret = -EINVAL;
391
    ret = -EINVAL;
292
+    if (ret)
392
+    if (ret)
293
+        tdmrs_free_pamt_all(tdmr_array, tdmr_num);
393
+        tdmrs_free_pamt_all(&tdmr_list);
294
+    else
394
+    else
295
+        pr_info("%lu pages allocated for PAMT.\n",
395
+        pr_info("%lu KBs allocated for PAMT\n",
296
+                tdmrs_count_pamt_pages(tdmr_array, tdmr_num));
396
+                tdmrs_count_pamt_kb(&tdmr_list));
297
out_free_tdmrs:
397
out_free_tdmrs:
298
    /*
398
    /*
299
     * The array of TDMRs is freed no matter the initialization is
399
     * Always free the buffer of TDMRs as they are only used during
400
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
401
index XXXXXXX..XXXXXXX 100644
402
--- a/arch/x86/virt/vmx/tdx/tdx.h
403
+++ b/arch/x86/virt/vmx/tdx/tdx.h
404
@@ -XXX,XX +XXX,XX @@ struct tdx_memblock {
405
    struct list_head list;
406
    unsigned long start_pfn;
407
    unsigned long end_pfn;
408
+    int nid;
409
};
410
411
/* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
300
--
412
--
301
2.38.1
413
2.41.0
diff view generated by jsdifflib
1
As the last step of constructing TDMRs, set up reserved areas for all
1
As the last step of constructing TDMRs, populate reserved areas for all
2
TDMRs. For each TDMR, put all memory holes within this TDMR to the
2
TDMRs. For each TDMR, put all memory holes within this TDMR to the
3
reserved areas. And for all PAMTs which overlap with this TDMR, put
3
reserved areas. And for all PAMTs which overlap with this TDMR, put
4
all the overlapping parts to reserved areas too.
4
all the overlapping parts to reserved areas too.
5
5
6
Signed-off-by: Kai Huang <kai.huang@intel.com>
6
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
7
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
7
Signed-off-by: Kai Huang <kai.huang@intel.com>
8
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
9
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
8
---
10
---
9
11
10
v6 -> v7:
12
v13 -> v14:
13
- No change
14
15
v12 -> v13:
16
- Added Yuan's tag.
17
18
v11 -> v12:
19
- Code change due to tdmr_get_pamt() change from returning pfn/npages to
20
base/size
21
- Added Kirill's tag
22
23
v10 -> v11:
24
- No update
25
26
v9 -> v10:
11
- No change.
27
- No change.
12
28
13
v5 -> v6:
29
v8 -> v9:
14
- Rebase due to using 'tdx_memblock' instead of memblock.
30
- Added comment around 'tdmr_add_rsvd_area()' to point out it doesn't do
15
- Split tdmr_set_up_rsvd_areas() into two functions to handle memory
31
optimization to save reserved areas. (Dave).
16
hole and PAMT respectively.
32
17
- Added Isaku's Reviewed-by.
33
v7 -> v8: (Dave)
18
34
- "set_up" -> "populate" in function name change (Dave).
35
- Improved comment suggested by Dave.
36
- Other changes due to 'struct tdmr_info_list'.
19
37
20
---
38
---
21
arch/x86/virt/vmx/tdx/tdx.c | 190 +++++++++++++++++++++++++++++++++++-
39
arch/x86/virt/vmx/tdx/tdx.c | 217 ++++++++++++++++++++++++++++++++++--
22
1 file changed, 188 insertions(+), 2 deletions(-)
40
1 file changed, 209 insertions(+), 8 deletions(-)
23
41
24
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
42
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
25
index XXXXXXX..XXXXXXX 100644
43
index XXXXXXX..XXXXXXX 100644
26
--- a/arch/x86/virt/vmx/tdx/tdx.c
44
--- a/arch/x86/virt/vmx/tdx/tdx.c
27
+++ b/arch/x86/virt/vmx/tdx/tdx.c
45
+++ b/arch/x86/virt/vmx/tdx/tdx.c
28
@@ -XXX,XX +XXX,XX @@
46
@@ -XXX,XX +XXX,XX @@
29
#include <linux/memblock.h>
30
#include <linux/minmax.h>
31
#include <linux/sizes.h>
47
#include <linux/sizes.h>
48
#include <linux/pfn.h>
49
#include <linux/align.h>
32
+#include <linux/sort.h>
50
+#include <linux/sort.h>
33
#include <asm/msr-index.h>
51
#include <asm/msr-index.h>
34
#include <asm/msr.h>
52
#include <asm/msr.h>
35
#include <asm/apic.h>
53
#include <asm/page.h>
36
@@ -XXX,XX +XXX,XX @@ static unsigned long tdmrs_count_pamt_pages(struct tdmr_info *tdmr_array,
54
@@ -XXX,XX +XXX,XX @@ static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
37
    return pamt_npages;
55
    return pamt_size / 1024;
38
}
56
}
39
57
40
+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx,
58
+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
41
+             u64 addr, u64 size)
59
+             u64 size, u16 max_reserved_per_tdmr)
42
+{
60
+{
43
+    struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
61
+    struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
44
+    int idx = *p_idx;
62
+    int idx = *p_idx;
45
+
63
+
46
+    /* Reserved area must be 4K aligned in offset and size */
64
+    /* Reserved area must be 4K aligned in offset and size */
47
+    if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
65
+    if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
48
+        return -EINVAL;
66
+        return -EINVAL;
49
+
67
+
50
+    /* Cannot exceed maximum reserved areas supported by TDX */
68
+    if (idx >= max_reserved_per_tdmr) {
51
+    if (idx >= tdx_sysinfo.max_reserved_per_tdmr)
69
+        pr_warn("initialization failed: TDMR [0x%llx, 0x%llx): reserved areas exhausted.\n",
52
+        return -E2BIG;
70
+                tdmr->base, tdmr_end(tdmr));
53
+
71
+        return -ENOSPC;
72
+    }
73
+
74
+    /*
75
+     * Consume one reserved area per call. Make no effort to
76
+     * optimize or reduce the number of reserved areas which are
77
+     * consumed by contiguous reserved areas, for instance.
78
+     */
54
+    rsvd_areas[idx].offset = addr - tdmr->base;
79
+    rsvd_areas[idx].offset = addr - tdmr->base;
55
+    rsvd_areas[idx].size = size;
80
+    rsvd_areas[idx].size = size;
56
+
81
+
57
+    *p_idx = idx + 1;
82
+    *p_idx = idx + 1;
58
+
83
+
59
+    return 0;
84
+    return 0;
60
+}
85
+}
61
+
86
+
62
+static int tdmr_set_up_memory_hole_rsvd_areas(struct tdmr_info *tdmr,
87
+/*
63
+                     int *rsvd_idx)
88
+ * Go through @tmb_list to find holes between memory areas. If any of
89
+ * those holes fall within @tdmr, set up a TDMR reserved area to cover
90
+ * the hole.
91
+ */
92
+static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
93
+                 struct tdmr_info *tdmr,
94
+                 int *rsvd_idx,
95
+                 u16 max_reserved_per_tdmr)
64
+{
96
+{
65
+    struct tdx_memblock *tmb;
97
+    struct tdx_memblock *tmb;
66
+    u64 prev_end;
98
+    u64 prev_end;
67
+    int ret;
99
+    int ret;
68
+
100
+
69
+    /* Mark holes between memory regions as reserved */
101
+    /*
70
+    prev_end = tdmr_start(tdmr);
102
+     * Start looking for reserved blocks at the
71
+    list_for_each_entry(tmb, &tdx_memlist, list) {
103
+     * beginning of the TDMR.
104
+     */
105
+    prev_end = tdmr->base;
106
+    list_for_each_entry(tmb, tmb_list, list) {
72
+        u64 start, end;
107
+        u64 start, end;
73
+
108
+
74
+        start = tmb->start_pfn << PAGE_SHIFT;
109
+        start = PFN_PHYS(tmb->start_pfn);
75
+        end = tmb->end_pfn << PAGE_SHIFT;
110
+        end = PFN_PHYS(tmb->end_pfn);
76
+
111
+
77
+        /* Break if this region is after the TDMR */
112
+        /* Break if this region is after the TDMR */
78
+        if (start >= tdmr_end(tdmr))
113
+        if (start >= tdmr_end(tdmr))
79
+            break;
114
+            break;
80
+
115
+
81
+        /* Exclude regions before this TDMR */
116
+        /* Exclude regions before this TDMR */
82
+        if (end < tdmr_start(tdmr))
117
+        if (end < tdmr->base)
83
+            continue;
118
+            continue;
84
+
119
+
85
+        /*
120
+        /*
86
+         * Skip if no hole exists before this region. "<=" is
121
+         * Skip over memory areas that
87
+         * used because one memory region might span two TDMRs
122
+         * have already been dealt with.
88
+         * (when the previous TDMR covers part of this region).
89
+         * In this case the start address of this region is
90
+         * smaller than the start address of the second TDMR.
91
+         *
92
+         * Update the prev_end to the end of this region where
93
+         * the possible memory hole starts.
94
+         */
123
+         */
95
+        if (start <= prev_end) {
124
+        if (start <= prev_end) {
96
+            prev_end = end;
125
+            prev_end = end;
97
+            continue;
126
+            continue;
98
+        }
127
+        }
99
+
128
+
100
+        /* Add the hole before this region */
129
+        /* Add the hole before this region */
101
+        ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
130
+        ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
102
+                start - prev_end);
131
+                start - prev_end,
132
+                max_reserved_per_tdmr);
103
+        if (ret)
133
+        if (ret)
104
+            return ret;
134
+            return ret;
105
+
135
+
106
+        prev_end = end;
136
+        prev_end = end;
107
+    }
137
+    }
108
+
138
+
109
+    /* Add the hole after the last region if it exists. */
139
+    /* Add the hole after the last region if it exists. */
110
+    if (prev_end < tdmr_end(tdmr)) {
140
+    if (prev_end < tdmr_end(tdmr)) {
111
+        ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
141
+        ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
112
+                tdmr_end(tdmr) - prev_end);
142
+                tdmr_end(tdmr) - prev_end,
113
+        if (ret)
143
+                max_reserved_per_tdmr);
114
+            return ret;
144
+        if (ret)
115
+    }
145
+            return ret;
116
+
146
+    }
117
+    return 0;
147
+
118
+}
148
+    return 0;
119
+
149
+}
120
+static int tdmr_set_up_pamt_rsvd_areas(struct tdmr_info *tdmr, int *rsvd_idx,
150
+
121
+                 struct tdmr_info *tdmr_array,
151
+/*
122
+                 int tdmr_num)
152
+ * Go through @tdmr_list to find all PAMTs. If any of those PAMTs
153
+ * overlaps with @tdmr, set up a TDMR reserved area to cover the
154
+ * overlapping part.
155
+ */
156
+static int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list,
157
+                 struct tdmr_info *tdmr,
158
+                 int *rsvd_idx,
159
+                 u16 max_reserved_per_tdmr)
123
+{
160
+{
124
+    int i, ret;
161
+    int i, ret;
125
+
162
+
126
+    /*
163
+    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
127
+     * If any PAMT overlaps with this TDMR, the overlapping part
164
+        struct tdmr_info *tmp = tdmr_entry(tdmr_list, i);
128
+     * must also be put to the reserved area too. Walk over all
165
+        unsigned long pamt_base, pamt_size, pamt_end;
129
+     * TDMRs to find out those overlapping PAMTs and put them to
166
+
130
+     * reserved areas.
167
+        tdmr_get_pamt(tmp, &pamt_base, &pamt_size);
131
+     */
132
+    for (i = 0; i < tdmr_num; i++) {
133
+        struct tdmr_info *tmp = tdmr_array_entry(tdmr_array, i);
134
+        unsigned long pamt_start_pfn, pamt_npages;
135
+        u64 pamt_start, pamt_end;
136
+
137
+        tdmr_get_pamt(tmp, &pamt_start_pfn, &pamt_npages);
138
+        /* Each TDMR must already have PAMT allocated */
168
+        /* Each TDMR must already have PAMT allocated */
139
+        WARN_ON_ONCE(!pamt_npages || !pamt_start_pfn);
169
+        WARN_ON_ONCE(!pamt_size || !pamt_base);
140
+
170
+
141
+        pamt_start = pamt_start_pfn << PAGE_SHIFT;
171
+        pamt_end = pamt_base + pamt_size;
142
+        pamt_end = pamt_start + (pamt_npages << PAGE_SHIFT);
143
+
144
+        /* Skip PAMTs outside of the given TDMR */
172
+        /* Skip PAMTs outside of the given TDMR */
145
+        if ((pamt_end <= tdmr_start(tdmr)) ||
173
+        if ((pamt_end <= tdmr->base) ||
146
+                (pamt_start >= tdmr_end(tdmr)))
174
+                (pamt_base >= tdmr_end(tdmr)))
147
+            continue;
175
+            continue;
148
+
176
+
149
+        /* Only mark the part within the TDMR as reserved */
177
+        /* Only mark the part within the TDMR as reserved */
150
+        if (pamt_start < tdmr_start(tdmr))
178
+        if (pamt_base < tdmr->base)
151
+            pamt_start = tdmr_start(tdmr);
179
+            pamt_base = tdmr->base;
152
+        if (pamt_end > tdmr_end(tdmr))
180
+        if (pamt_end > tdmr_end(tdmr))
153
+            pamt_end = tdmr_end(tdmr);
181
+            pamt_end = tdmr_end(tdmr);
154
+
182
+
155
+        ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_start,
183
+        ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_base,
156
+                pamt_end - pamt_start);
184
+                pamt_end - pamt_base,
185
+                max_reserved_per_tdmr);
157
+        if (ret)
186
+        if (ret)
158
+            return ret;
187
+            return ret;
159
+    }
188
+    }
160
+
189
+
161
+    return 0;
190
+    return 0;
...
...
170
+    if (r1->offset + r1->size <= r2->offset)
199
+    if (r1->offset + r1->size <= r2->offset)
171
+        return -1;
200
+        return -1;
172
+    if (r1->offset >= r2->offset + r2->size)
201
+    if (r1->offset >= r2->offset + r2->size)
173
+        return 1;
202
+        return 1;
174
+
203
+
175
+    /* Reserved areas cannot overlap. The caller should guarantee. */
204
+    /* Reserved areas cannot overlap. The caller must guarantee. */
176
+    WARN_ON_ONCE(1);
205
+    WARN_ON_ONCE(1);
177
+    return -1;
206
+    return -1;
178
+}
207
+}
179
+
208
+
180
+/* Set up reserved areas for a TDMR, including memory holes and PAMTs */
209
+/*
181
+static int tdmr_set_up_rsvd_areas(struct tdmr_info *tdmr,
210
+ * Populate reserved areas for the given @tdmr, including memory holes
182
+                 struct tdmr_info *tdmr_array,
211
+ * (via @tmb_list) and PAMTs (via @tdmr_list).
183
+                 int tdmr_num)
212
+ */
213
+static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
214
+                 struct list_head *tmb_list,
215
+                 struct tdmr_info_list *tdmr_list,
216
+                 u16 max_reserved_per_tdmr)
184
+{
217
+{
185
+    int ret, rsvd_idx = 0;
218
+    int ret, rsvd_idx = 0;
186
+
219
+
187
+    /* Put all memory holes within the TDMR into reserved areas */
220
+    ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
188
+    ret = tdmr_set_up_memory_hole_rsvd_areas(tdmr, &rsvd_idx);
221
+            max_reserved_per_tdmr);
189
+    if (ret)
222
+    if (ret)
190
+        return ret;
223
+        return ret;
191
+
224
+
192
+    /* Put all (overlapping) PAMTs within the TDMR into reserved areas */
225
+    ret = tdmr_populate_rsvd_pamts(tdmr_list, tdmr, &rsvd_idx,
193
+    ret = tdmr_set_up_pamt_rsvd_areas(tdmr, &rsvd_idx, tdmr_array, tdmr_num);
226
+            max_reserved_per_tdmr);
194
+    if (ret)
227
+    if (ret)
195
+        return ret;
228
+        return ret;
196
+
229
+
197
+    /* TDX requires reserved areas listed in address ascending order */
230
+    /* TDX requires reserved areas listed in address ascending order */
198
+    sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
231
+    sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
199
+            rsvd_area_cmp_func, NULL);
232
+            rsvd_area_cmp_func, NULL);
200
+
233
+
201
+    return 0;
234
+    return 0;
202
+}
235
+}
203
+
236
+
204
+static int tdmrs_set_up_rsvd_areas_all(struct tdmr_info *tdmr_array,
237
+/*
205
+                 int tdmr_num)
238
+ * Populate reserved areas for all TDMRs in @tdmr_list, including memory
239
+ * holes (via @tmb_list) and PAMTs.
240
+ */
241
+static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
242
+                     struct list_head *tmb_list,
243
+                     u16 max_reserved_per_tdmr)
206
+{
244
+{
207
+    int i;
245
+    int i;
208
+
246
+
209
+    for (i = 0; i < tdmr_num; i++) {
247
+    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
210
+        int ret;
248
+        int ret;
211
+
249
+
212
+        ret = tdmr_set_up_rsvd_areas(tdmr_array_entry(tdmr_array, i),
250
+        ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
213
+                tdmr_array, tdmr_num);
251
+                tmb_list, tdmr_list, max_reserved_per_tdmr);
214
+        if (ret)
252
+        if (ret)
215
+            return ret;
253
+            return ret;
216
+    }
254
+    }
217
+
255
+
218
+    return 0;
256
+    return 0;
219
+}
257
+}
220
+
258
+
221
/*
259
/*
222
* Construct an array of TDMRs to cover all TDX memory ranges.
260
* Construct a list of TDMRs on the preallocated space in @tdmr_list
223
* The actual number of TDMRs is kept to @tdmr_num.
261
* to cover all TDX memory regions in @tmb_list based on the TDX module
224
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
262
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list,
263
            sysinfo->pamt_entry_size);
225
    if (ret)
264
    if (ret)
226
        goto err;
265
        return ret;
227
266
-    /*
228
-    /* Return -EINVAL until constructing TDMRs is done */
267
-     * TODO:
229
-    ret = -EINVAL;
268
-     *
230
+    ret = tdmrs_set_up_rsvd_areas_all(tdmr_array, *tdmr_num);
269
-     * - Designate reserved areas for each TDMR.
270
-     *
271
-     * Return -EINVAL until constructing TDMRs is done
272
-     */
273
-    return -EINVAL;
274
+
275
+    ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
276
+            sysinfo->max_reserved_per_tdmr);
231
+    if (ret)
277
+    if (ret)
232
+        goto err_free_pamts;
278
+        tdmrs_free_pamt_all(tdmr_list);
233
+
279
+
234
+    return 0;
280
+    return ret;
235
+err_free_pamts:
281
}
236
    tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
282
237
err:
283
static int init_tdx_module(void)
238
    return ret;
239
--
284
--
240
2.38.1
285
2.41.0
diff view generated by jsdifflib
1
After the TDX-usable memory regions are constructed in an array of TDMRs
1
The TDX module uses a private KeyID as the "global KeyID" for mapping
2
and the global KeyID is reserved, configure them to the TDX module using
2
things like the PAMT and other TDX metadata. This KeyID has already
3
TDH.SYS.CONFIG SEAMCALL. TDH.SYS.CONFIG can only be called once and can
3
been reserved when detecting TDX during the kernel early boot.
4
be done on any logical cpu.
5
4
5
After the list of "TD Memory Regions" (TDMRs) has been constructed to
6
cover all TDX-usable memory regions, the next step is to pass them to
7
the TDX module together with the global KeyID.
8
9
Signed-off-by: Kai Huang <kai.huang@intel.com>
6
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
10
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
7
Signed-off-by: Kai Huang <kai.huang@intel.com>
11
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
12
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
8
---
13
---
9
arch/x86/virt/vmx/tdx/tdx.c | 37 +++++++++++++++++++++++++++++++++++++
14
15
v13 -> v14:
16
- No change
17
18
v12 -> v13:
19
- Added Yuan's tag.
20
21
v11 -> v12:
22
- Added Kirill's tag
23
24
v10 -> v11:
25
- No update
26
27
v9 -> v10:
28
- Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.
29
30
v8 -> v9:
31
- Improved changlog to explain why initializing TDMRs can take long
32
time (Dave).
33
- Improved comments around 'next-to-initialize' address (Dave).
34
35
v7 -> v8: (Dave)
36
- Changelog:
37
- explicitly call out this is the last step of TDX module initialization.
38
- Trimed down changelog by removing SEAMCALL name and details.
39
- Removed/trimmed down unnecessary comments.
40
- Other changes due to 'struct tdmr_info_list'.
41
42
v6 -> v7:
43
- Removed need_resched() check. -- Andi.
44
45
---
46
arch/x86/virt/vmx/tdx/tdx.c | 43 ++++++++++++++++++++++++++++++++++++-
10
arch/x86/virt/vmx/tdx/tdx.h | 2 ++
47
arch/x86/virt/vmx/tdx/tdx.h | 2 ++
11
2 files changed, 39 insertions(+)
48
2 files changed, 44 insertions(+), 1 deletion(-)
12
49
13
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
50
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
14
index XXXXXXX..XXXXXXX 100644
51
index XXXXXXX..XXXXXXX 100644
15
--- a/arch/x86/virt/vmx/tdx/tdx.c
52
--- a/arch/x86/virt/vmx/tdx/tdx.c
16
+++ b/arch/x86/virt/vmx/tdx/tdx.c
53
+++ b/arch/x86/virt/vmx/tdx/tdx.c
17
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
54
@@ -XXX,XX +XXX,XX @@
55
#include <linux/pfn.h>
56
#include <linux/align.h>
57
#include <linux/sort.h>
58
+#include <linux/log2.h>
59
#include <asm/msr-index.h>
60
#include <asm/msr.h>
61
#include <asm/page.h>
62
@@ -XXX,XX +XXX,XX @@ static int construct_tdmrs(struct list_head *tmb_list,
18
    return ret;
63
    return ret;
19
}
64
}
20
65
21
+static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
66
+static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
22
+             u64 global_keyid)
23
+{
67
+{
68
+    struct tdx_module_args args = {};
24
+    u64 *tdmr_pa_array;
69
+    u64 *tdmr_pa_array;
25
+    int i, array_sz;
70
+    size_t array_sz;
26
+    u64 ret;
71
+    int i, ret;
27
+
72
+
28
+    /*
73
+    /*
29
+     * TDMR_INFO entries are configured to the TDX module via an
74
+     * TDMRs are passed to the TDX module via an array of physical
30
+     * array of the physical address of each TDMR_INFO. TDX module
75
+     * addresses of each TDMR. The array itself also has certain
31
+     * requires the array itself to be 512-byte aligned. Round up
76
+     * alignment requirement.
32
+     * the array size to 512-byte aligned so the buffer allocated
33
+     * by kzalloc() will meet the alignment requirement.
34
+     */
77
+     */
35
+    array_sz = ALIGN(tdmr_num * sizeof(u64), TDMR_INFO_PA_ARRAY_ALIGNMENT);
78
+    array_sz = tdmr_list->nr_consumed_tdmrs * sizeof(u64);
79
+    array_sz = roundup_pow_of_two(array_sz);
80
+    if (array_sz < TDMR_INFO_PA_ARRAY_ALIGNMENT)
81
+        array_sz = TDMR_INFO_PA_ARRAY_ALIGNMENT;
82
+
36
+    tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
83
+    tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
37
+    if (!tdmr_pa_array)
84
+    if (!tdmr_pa_array)
38
+        return -ENOMEM;
85
+        return -ENOMEM;
39
+
86
+
40
+    for (i = 0; i < tdmr_num; i++)
87
+    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
41
+        tdmr_pa_array[i] = __pa(tdmr_array_entry(tdmr_array, i));
88
+        tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i));
42
+
89
+
43
+    ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), tdmr_num,
90
+    args.rcx = __pa(tdmr_pa_array);
44
+                global_keyid, 0, NULL, NULL);
91
+    args.rdx = tdmr_list->nr_consumed_tdmrs;
92
+    args.r8 = global_keyid;
93
+    ret = seamcall_prerr(TDH_SYS_CONFIG, &args);
45
+
94
+
46
+    /* Free the array as it is not required anymore. */
95
+    /* Free the array as it is not required anymore. */
47
+    kfree(tdmr_pa_array);
96
+    kfree(tdmr_pa_array);
48
+
97
+
49
+    return ret;
98
+    return ret;
50
+}
99
+}
51
+
100
+
52
/*
101
static int init_tdx_module(void)
53
* Detect and initialize the TDX module.
102
{
54
*
103
    struct tdsysinfo_struct *tdsysinfo;
55
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
104
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
56
     */
105
    if (ret)
57
    tdx_global_keyid = tdx_keyid_start;
106
        goto out_free_tdmrs;
58
107
59
+    /* Pass the TDMRs and the global KeyID to the TDX module */
108
+    /* Pass the TDMRs and the global KeyID to the TDX module */
60
+    ret = config_tdx_module(tdmr_array, tdmr_num, tdx_global_keyid);
109
+    ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
61
+    if (ret)
110
+    if (ret)
62
+        goto out_free_pamts;
111
+        goto out_free_pamts;
63
+
112
+
64
    /*
113
    /*
65
     * Return -EINVAL until all steps of TDX module initialization
114
     * TODO:
66
     * process are done.
115
     *
116
-     * - Configure the TDMRs and the global KeyID to the TDX module.
117
     * - Configure the global KeyID on all packages.
118
     * - Initialize all TDMRs.
119
     *
120
     * Return error before all steps are done.
67
     */
121
     */
68
    ret = -EINVAL;
122
    ret = -EINVAL;
69
+out_free_pamts:
123
+out_free_pamts:
70
    if (ret)
124
    if (ret)
71
        tdmrs_free_pamt_all(tdmr_array, tdmr_num);
125
        tdmrs_free_pamt_all(&tdmr_list);
72
    else
126
    else
73
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
127
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
74
index XXXXXXX..XXXXXXX 100644
128
index XXXXXXX..XXXXXXX 100644
75
--- a/arch/x86/virt/vmx/tdx/tdx.h
129
--- a/arch/x86/virt/vmx/tdx/tdx.h
76
+++ b/arch/x86/virt/vmx/tdx/tdx.h
130
+++ b/arch/x86/virt/vmx/tdx/tdx.h
77
@@ -XXX,XX +XXX,XX @@
131
@@ -XXX,XX +XXX,XX @@
132
#define TDH_SYS_INFO        32
78
#define TDH_SYS_INIT        33
133
#define TDH_SYS_INIT        33
79
#define TDH_SYS_LP_INIT        35
134
#define TDH_SYS_LP_INIT        35
80
#define TDH_SYS_LP_SHUTDOWN    44
81
+#define TDH_SYS_CONFIG        45
135
+#define TDH_SYS_CONFIG        45
82
136
83
struct cmr_info {
137
struct cmr_info {
84
    u64    base;
138
    u64    base;
85
@@ -XXX,XX +XXX,XX @@ struct tdmr_reserved_area {
139
@@ -XXX,XX +XXX,XX @@ struct tdmr_reserved_area {
...
...
89
+#define TDMR_INFO_PA_ARRAY_ALIGNMENT    512
143
+#define TDMR_INFO_PA_ARRAY_ALIGNMENT    512
90
144
91
struct tdmr_info {
145
struct tdmr_info {
92
    u64 base;
146
    u64 base;
93
--
147
--
94
2.38.1
148
2.41.0
diff view generated by jsdifflib
1
After the array of TDMRs and the global KeyID are configured to the TDX
1
After the list of TDMRs and the global KeyID are configured to the TDX
2
module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID
2
module, the kernel needs to configure the key of the global KeyID on all
3
on all packages.
3
packages using TDH.SYS.KEY.CONFIG.
4
4
5
TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package. And
5
This SEAMCALL cannot run parallel on different cpus. Loop all online
6
it cannot run concurrently on different CPUs. Implement a helper to
6
cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of
7
run SEAMCALL on one cpu for each package one by one, and use it to
7
each package.
8
configure the global KeyID on all packages.
8
9
To keep things simple, this implementation takes no affirmative steps to
10
online cpus to make sure there's at least one cpu for each package. The
11
callers (aka. KVM) can ensure success by ensuring sufficient CPUs are
12
online for this to succeed.
9
13
10
Intel hardware doesn't guarantee cache coherency across different
14
Intel hardware doesn't guarantee cache coherency across different
11
KeyIDs. The kernel needs to flush PAMT's dirty cachelines (associated
15
KeyIDs. The PAMTs are transitioning from being used by the kernel
12
with KeyID 0) before the TDX module uses the global KeyID to access the
16
mapping (KeyId 0) to the TDX module's "global KeyID" mapping.
13
PAMT. Following the TDX module specification, flush cache before
17
14
configuring the global KeyID on all packages.
18
This means that the kernel must flush any dirty KeyID-0 PAMT cachelines
15
19
before the TDX module uses the global KeyID to access the PAMTs.
16
Given the PAMT size can be large (~1/256th of system RAM), just use
20
Otherwise, if those dirty cachelines were written back, they would
17
WBINVD on all CPUs to flush.
21
corrupt the TDX module's metadata. Aside: This corruption would be
18
22
detected by the memory integrity hardware on the next read of the memory
19
Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
23
with the global KeyID. The result would likely be fatal to the system
20
used the global KeyID to write any PAMT. Therefore, need to use WBINVD
24
but would not impact TDX security.
21
to flush cache before freeing the PAMTs back to the kernel. Note using
25
22
MOVDIR64B (which changes the page's associated KeyID from the old TDX
26
Following the TDX module specification, flush cache before configuring
23
private KeyID back to KeyID 0, which is used by the kernel) to clear
27
the global KeyID on all packages. Given the PAMT size can be large
24
PMATs isn't needed, as the KeyID 0 doesn't support integrity check.
28
(~1/256th of system RAM), just use WBINVD on all CPUs to flush.
25
29
30
If TDH.SYS.KEY.CONFIG fails, the TDX module may already have used the
31
global KeyID to write the PAMTs. Therefore, use WBINVD to flush cache
32
before returning the PAMTs back to the kernel. Also convert all PAMTs
33
back to normal by using MOVDIR64B as suggested by the TDX module spec,
34
although on the platform without the "partial write machine check"
35
erratum it's OK to leave PAMTs as is.
36
37
Signed-off-by: Kai Huang <kai.huang@intel.com>
26
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
38
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
27
Signed-off-by: Kai Huang <kai.huang@intel.com>
39
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
40
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
28
---
41
---
29
42
30
v6 -> v7:
43
v13 -> v14:
31
- Improved changelong and comment to explain why MOVDIR64B isn't used
44
- No change
32
when returning PAMTs back to the kernel.
45
46
v12 -> v13:
47
- Added Yuan's tag.
48
49
v11 -> v12:
50
- Added Kirill's tag
51
- Improved changelog (Nikolay)
52
53
v10 -> v11:
54
- Convert PAMTs back to normal when module initialization fails.
55
- Fixed an error in changelog
56
57
v9 -> v10:
58
- Changed to use 'smp_call_on_cpu()' directly to do key configuration.
59
60
v8 -> v9:
61
- Improved changelog (Dave).
62
- Improved comments to explain the function to configure global KeyID
63
"takes no affirmative action to online any cpu". (Dave).
64
- Improved other comments suggested by Dave.
65
66
v7 -> v8: (Dave)
67
- Changelog changes:
68
- Point out this is the step of "multi-steps" of init_tdx_module().
69
- Removed MOVDIR64B part.
70
- Other changes due to removing TDH.SYS.SHUTDOWN and TDH.SYS.LP.INIT.
71
- Changed to loop over online cpus and use smp_call_function_single()
72
directly as the patch to shut down TDX module has been removed.
73
- Removed MOVDIR64B part in comment.
33
74
34
---
75
---
35
arch/x86/virt/vmx/tdx/tdx.c | 89 ++++++++++++++++++++++++++++++++++++-
76
arch/x86/virt/vmx/tdx/tdx.c | 130 +++++++++++++++++++++++++++++++++++-
36
arch/x86/virt/vmx/tdx/tdx.h | 1 +
77
arch/x86/virt/vmx/tdx/tdx.h | 1 +
37
2 files changed, 88 insertions(+), 2 deletions(-)
78
2 files changed, 129 insertions(+), 2 deletions(-)
38
79
39
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
80
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
40
index XXXXXXX..XXXXXXX 100644
81
index XXXXXXX..XXXXXXX 100644
41
--- a/arch/x86/virt/vmx/tdx/tdx.c
82
--- a/arch/x86/virt/vmx/tdx/tdx.c
42
+++ b/arch/x86/virt/vmx/tdx/tdx.c
83
+++ b/arch/x86/virt/vmx/tdx/tdx.c
43
@@ -XXX,XX +XXX,XX @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
84
@@ -XXX,XX +XXX,XX @@
44
    on_each_cpu(seamcall_smp_call_function, sc, true);
85
#include <asm/msr-index.h>
45
}
86
#include <asm/msr.h>
87
#include <asm/page.h>
88
+#include <asm/special_insns.h>
89
#include <asm/tdx.h>
90
#include "tdx.h"
91
92
@@ -XXX,XX +XXX,XX @@ static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base,
93
    *pamt_size = pamt_sz;
94
}
95
96
-static void tdmr_free_pamt(struct tdmr_info *tdmr)
97
+static void tdmr_do_pamt_func(struct tdmr_info *tdmr,
98
+        void (*pamt_func)(unsigned long base, unsigned long size))
99
{
100
    unsigned long pamt_base, pamt_size;
101
102
@@ -XXX,XX +XXX,XX @@ static void tdmr_free_pamt(struct tdmr_info *tdmr)
103
    if (WARN_ON_ONCE(!pamt_base))
104
        return;
105
106
+    (*pamt_func)(pamt_base, pamt_size);
107
+}
108
+
109
+static void free_pamt(unsigned long pamt_base, unsigned long pamt_size)
110
+{
111
    free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT);
112
}
113
114
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
115
+{
116
+    tdmr_do_pamt_func(tdmr, free_pamt);
117
+}
118
+
119
static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
120
{
121
    int i;
122
@@ -XXX,XX +XXX,XX @@ static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
123
    return ret;
124
}
46
125
47
+/*
126
+/*
48
+ * Call one SEAMCALL on one (any) cpu for each physical package in
127
+ * Convert TDX private pages back to normal by using MOVDIR64B to
49
+ * serialized way. Return immediately in case of any error if
128
+ * clear these pages. Note this function doesn't flush cache of
50
+ * SEAMCALL fails on any cpu.
129
+ * these TDX private pages. The caller should make sure of that.
130
+ */
131
+static void reset_tdx_pages(unsigned long base, unsigned long size)
132
+{
133
+    const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
134
+    unsigned long phys, end;
135
+
136
+    end = base + size;
137
+    for (phys = base; phys < end; phys += 64)
138
+        movdir64b(__va(phys), zero_page);
139
+
140
+    /*
141
+     * MOVDIR64B uses WC protocol. Use memory barrier to
142
+     * make sure any later user of these pages sees the
143
+     * updated data.
144
+     */
145
+    mb();
146
+}
147
+
148
+static void tdmr_reset_pamt(struct tdmr_info *tdmr)
149
+{
150
+    tdmr_do_pamt_func(tdmr, reset_tdx_pages);
151
+}
152
+
153
+static void tdmrs_reset_pamt_all(struct tdmr_info_list *tdmr_list)
154
+{
155
+    int i;
156
+
157
+    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
158
+        tdmr_reset_pamt(tdmr_entry(tdmr_list, i));
159
+}
160
+
161
static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
162
{
163
    unsigned long pamt_size = 0;
164
@@ -XXX,XX +XXX,XX @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
165
    return ret;
166
}
167
168
+static int do_global_key_config(void *data)
169
+{
170
+    struct tdx_module_args args = {};
171
+
172
+    return seamcall_prerr(TDH_SYS_KEY_CONFIG, &args);
173
+}
174
+
175
+/*
176
+ * Attempt to configure the global KeyID on all physical packages.
51
+ *
177
+ *
52
+ * Note for serialized calls 'struct seamcall_ctx::err' doesn't have
178
+ * This requires running code on at least one CPU in each package. If a
53
+ * to be atomic, but for simplicity just reuse it instead of adding
179
+ * package has no online CPUs, that code will not run and TDX module
54
+ * a new one.
180
+ * initialization (TDMR initialization) will fail.
181
+ *
182
+ * This code takes no affirmative steps to online CPUs. Callers (aka.
183
+ * KVM) can ensure success by ensuring sufficient CPUs are online for
184
+ * this to succeed.
55
+ */
185
+ */
56
+static int seamcall_on_each_package_serialized(struct seamcall_ctx *sc)
186
+static int config_global_keyid(void)
57
+{
187
+{
58
+    cpumask_var_t packages;
188
+    cpumask_var_t packages;
59
+    int cpu, ret = 0;
189
+    int cpu, ret = -EINVAL;
60
+
190
+
61
+    if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
191
+    if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
62
+        return -ENOMEM;
192
+        return -ENOMEM;
63
+
193
+
64
+    for_each_online_cpu(cpu) {
194
+    for_each_online_cpu(cpu) {
65
+        if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
195
+        if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
66
+                    packages))
196
+                    packages))
67
+            continue;
197
+            continue;
68
+
198
+
69
+        ret = smp_call_function_single(cpu, seamcall_smp_call_function,
70
+                sc, true);
71
+        if (ret)
72
+            break;
73
+
74
+        /*
199
+        /*
75
+         * Doesn't have to use atomic_read(), but it doesn't
200
+         * TDH.SYS.KEY.CONFIG cannot run concurrently on
76
+         * hurt either.
201
+         * different cpus, so just do it one by one.
77
+         */
202
+         */
78
+        ret = atomic_read(&sc->err);
203
+        ret = smp_call_on_cpu(cpu, do_global_key_config, NULL, true);
79
+        if (ret)
204
+        if (ret)
80
+            break;
205
+            break;
81
+    }
206
+    }
82
+
207
+
83
+    free_cpumask_var(packages);
208
+    free_cpumask_var(packages);
84
+    return ret;
209
+    return ret;
85
+}
210
+}
86
+
211
+
87
static int tdx_module_init_cpus(void)
212
static int init_tdx_module(void)
88
{
213
{
89
    struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
214
    struct tdsysinfo_struct *tdsysinfo;
90
@@ -XXX,XX +XXX,XX @@ static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
91
    return ret;
92
}
93
94
+static int config_global_keyid(void)
95
+{
96
+    struct seamcall_ctx sc = { .fn = TDH_SYS_KEY_CONFIG };
97
+
98
+    /*
99
+     * Configure the key of the global KeyID on all packages by
100
+     * calling TDH.SYS.KEY.CONFIG on all packages in a serialized
101
+     * way as it cannot run concurrently on different CPUs.
102
+     *
103
+     * TDH.SYS.KEY.CONFIG may fail with entropy error (which is
104
+     * a recoverable error). Assume this is exceedingly rare and
105
+     * just return error if encountered instead of retrying.
106
+     */
107
+    return seamcall_on_each_package_serialized(&sc);
108
+}
109
+
110
/*
111
* Detect and initialize the TDX module.
112
*
113
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
215
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
114
    if (ret)
216
    if (ret)
115
        goto out_free_pamts;
217
        goto out_free_pamts;
116
218
117
+    /*
219
+    /*
118
+     * Hardware doesn't guarantee cache coherency across different
220
+     * Hardware doesn't guarantee cache coherency across different
119
+     * KeyIDs. The kernel needs to flush PAMT's dirty cachelines
221
+     * KeyIDs. The kernel needs to flush PAMT's dirty cachelines
120
+     * (associated with KeyID 0) before the TDX module can use the
222
+     * (associated with KeyID 0) before the TDX module can use the
121
+     * global KeyID to access the PAMT. Given PAMTs are potentially
223
+     * global KeyID to access the PAMT. Given PAMTs are potentially
122
+     * large (~1/256th of system RAM), just use WBINVD on all cpus
224
+     * large (~1/256th of system RAM), just use WBINVD on all cpus
123
+     * to flush the cache.
225
+     * to flush the cache.
124
+     *
125
+     * Follow the TDX spec to flush cache before configuring the
126
+     * global KeyID on all packages.
127
+     */
226
+     */
128
+    wbinvd_on_all_cpus();
227
+    wbinvd_on_all_cpus();
129
+
228
+
130
+    /* Config the key of global KeyID on all packages */
229
+    /* Config the key of global KeyID on all packages */
131
+    ret = config_global_keyid();
230
+    ret = config_global_keyid();
132
+    if (ret)
231
+    if (ret)
133
+        goto out_free_pamts;
232
+        goto out_reset_pamts;
134
+
233
+
135
    /*
234
    /*
136
     * Return -EINVAL until all steps of TDX module initialization
235
     * TODO:
137
     * process are done.
236
     *
237
-     * - Configure the global KeyID on all packages.
238
     * - Initialize all TDMRs.
239
     *
240
     * Return error before all steps are done.
138
     */
241
     */
139
    ret = -EINVAL;
242
    ret = -EINVAL;
140
out_free_pamts:
243
+out_reset_pamts:
141
-    if (ret)
142
+    if (ret) {
244
+    if (ret) {
143
+        /*
245
+        /*
144
+         * Part of PAMT may already have been initialized by
246
+         * Part of PAMTs may already have been initialized by the
145
+         * TDX module. Flush cache before returning PAMT back
247
+         * TDX module. Flush cache before returning PAMTs back
146
+         * to the kernel.
248
+         * to the kernel.
147
+         *
148
+         * Note there's no need to do MOVDIR64B (which changes
149
+         * the page's associated KeyID from the old TDX private
150
+         * KeyID back to KeyID 0, which is used by the kernel),
151
+         * as KeyID 0 doesn't support integrity check.
152
+         */
249
+         */
153
+        wbinvd_on_all_cpus();
250
+        wbinvd_on_all_cpus();
154
        tdmrs_free_pamt_all(tdmr_array, tdmr_num);
251
+        /*
155
-    else
252
+         * According to the TDX hardware spec, if the platform
156
+    } else
253
+         * doesn't have the "partial write machine check"
157
        pr_info("%lu pages allocated for PAMT.\n",
254
+         * erratum, any kernel read/write will never cause #MC
158
                tdmrs_count_pamt_pages(tdmr_array, tdmr_num));
255
+         * in kernel space, thus it's OK to not convert PAMTs
159
out_free_tdmrs:
256
+         * back to normal. But do the conversion anyway here
257
+         * as suggested by the TDX spec.
258
+         */
259
+        tdmrs_reset_pamt_all(&tdmr_list);
260
+    }
261
out_free_pamts:
262
    if (ret)
263
        tdmrs_free_pamt_all(&tdmr_list);
264
@@ -XXX,XX +XXX,XX @@ static int __tdx_enable(void)
265
* lock to prevent any new cpu from becoming online; 2) done both VMXON
266
* and tdx_cpu_enable() on all online cpus.
267
*
268
+ * This function requires there's at least one online cpu for each CPU
269
+ * package to succeed.
270
+ *
271
* This function can be called in parallel by multiple callers.
272
*
273
* Return 0 if TDX is enabled successfully, otherwise error.
160
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
274
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
161
index XXXXXXX..XXXXXXX 100644
275
index XXXXXXX..XXXXXXX 100644
162
--- a/arch/x86/virt/vmx/tdx/tdx.h
276
--- a/arch/x86/virt/vmx/tdx/tdx.h
163
+++ b/arch/x86/virt/vmx/tdx/tdx.h
277
+++ b/arch/x86/virt/vmx/tdx/tdx.h
164
@@ -XXX,XX +XXX,XX @@
278
@@ -XXX,XX +XXX,XX @@
...
...
168
+#define TDH_SYS_KEY_CONFIG    31
282
+#define TDH_SYS_KEY_CONFIG    31
169
#define TDH_SYS_INFO        32
283
#define TDH_SYS_INFO        32
170
#define TDH_SYS_INIT        33
284
#define TDH_SYS_INIT        33
171
#define TDH_SYS_LP_INIT        35
285
#define TDH_SYS_LP_INIT        35
172
--
286
--
173
2.38.1
287
2.41.0
diff view generated by jsdifflib
1
Initialize TDMRs via TDH.SYS.TDMR.INIT as the last step to complete the
1
After the global KeyID has been configured on all packages, initialize
2
TDX initialization.
2
all TDMRs to make all TDX-usable memory regions that are passed to the
3
TDX module become usable.
3
4
4
All TDMRs need to be initialized using TDH.SYS.TDMR.INIT SEAMCALL before
5
This is the last step of initializing the TDX module.
5
the memory pages can be used by the TDX module. The time to initialize
6
TDMR is proportional to the size of the TDMR because TDH.SYS.TDMR.INIT
7
internally initializes the PAMT entries using the global KeyID.
8
6
9
To avoid long latency caused in one SEAMCALL, TDH.SYS.TDMR.INIT only
7
Initializing TDMRs can be time consuming on large memory systems as it
10
initializes an (implementation-specific) subset of PAMT entries of one
8
involves initializing all metadata entries for all pages that can be
11
TDMR in one invocation. The caller needs to call TDH.SYS.TDMR.INIT
9
used by TDX guests. Initializing different TDMRs can be parallelized.
12
iteratively until all PAMT entries of the given TDMR are initialized.
10
For now to keep it simple, just initialize all TDMRs one by one. It can
11
be enhanced in the future.
13
12
14
TDH.SYS.TDMR.INITs can run concurrently on multiple CPUs as long as they
13
Signed-off-by: Kai Huang <kai.huang@intel.com>
15
are initializing different TDMRs. To keep it simple, just initialize
14
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
16
all TDMRs one by one. On a 2-socket machine with 2.2G CPUs and 64GB
15
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
17
memory, each TDH.SYS.TDMR.INIT roughly takes couple of microseconds on
16
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
18
average, and it takes roughly dozens of milliseconds to complete the
17
---
19
initialization of all TDMRs while system is idle.
20
18
21
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
19
v13 -> v14:
22
Signed-off-by: Kai Huang <kai.huang@intel.com>
20
- No change
23
---
21
22
v12 -> v13:
23
- Added Yuan's tag.
24
25
v11 -> v12:
26
- Added Kirill's tag
27
28
v10 -> v11:
29
- No update
30
31
v9 -> v10:
32
- Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.
33
34
v8 -> v9:
35
- Improved changlog to explain why initializing TDMRs can take long
36
time (Dave).
37
- Improved comments around 'next-to-initialize' address (Dave).
38
39
v7 -> v8: (Dave)
40
- Changelog:
41
- explicitly call out this is the last step of TDX module initialization.
42
- Trimed down changelog by removing SEAMCALL name and details.
43
- Removed/trimmed down unnecessary comments.
44
- Other changes due to 'struct tdmr_info_list'.
24
45
25
v6 -> v7:
46
v6 -> v7:
26
- Removed need_resched() check. -- Andi.
47
- Removed need_resched() check. -- Andi.
27
48
28
---
49
---
29
arch/x86/virt/vmx/tdx/tdx.c | 69 ++++++++++++++++++++++++++++++++++---
50
arch/x86/virt/vmx/tdx/tdx.c | 60 ++++++++++++++++++++++++++++++++-----
30
arch/x86/virt/vmx/tdx/tdx.h | 1 +
51
arch/x86/virt/vmx/tdx/tdx.h | 1 +
31
2 files changed, 65 insertions(+), 5 deletions(-)
52
2 files changed, 53 insertions(+), 8 deletions(-)
32
53
33
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
54
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
34
index XXXXXXX..XXXXXXX 100644
55
index XXXXXXX..XXXXXXX 100644
35
--- a/arch/x86/virt/vmx/tdx/tdx.c
56
--- a/arch/x86/virt/vmx/tdx/tdx.c
36
+++ b/arch/x86/virt/vmx/tdx/tdx.c
57
+++ b/arch/x86/virt/vmx/tdx/tdx.c
37
@@ -XXX,XX +XXX,XX @@ static int config_global_keyid(void)
58
@@ -XXX,XX +XXX,XX @@ static int config_global_keyid(void)
38
    return seamcall_on_each_package_serialized(&sc);
59
    return ret;
39
}
60
}
40
61
41
+/* Initialize one TDMR */
42
+static int init_tdmr(struct tdmr_info *tdmr)
62
+static int init_tdmr(struct tdmr_info *tdmr)
43
+{
63
+{
44
+    u64 next;
64
+    u64 next;
45
+
65
+
46
+    /*
66
+    /*
47
+     * Initializing PAMT entries might be time-consuming (in
67
+     * Initializing a TDMR can be time consuming. To avoid long
48
+     * proportion to the size of the requested TDMR). To avoid long
68
+     * SEAMCALLs, the TDX module may only initialize a part of the
49
+     * latency in one SEAMCALL, TDH.SYS.TDMR.INIT only initializes
69
+     * TDMR in each call.
50
+     * an (implementation-defined) subset of PAMT entries in one
51
+     * invocation.
52
+     *
53
+     * Call TDH.SYS.TDMR.INIT iteratively until all PAMT entries
54
+     * of the requested TDMR are initialized (if next-to-initialize
55
+     * address matches the end address of the TDMR).
56
+     */
70
+     */
57
+    do {
71
+    do {
58
+        struct tdx_module_output out;
72
+        struct tdx_module_args args = {
73
+            .rcx = tdmr->base,
74
+        };
59
+        int ret;
75
+        int ret;
60
+
76
+
61
+        ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
77
+        ret = seamcall_prerr_ret(TDH_SYS_TDMR_INIT, &args);
62
+                &out);
63
+        if (ret)
78
+        if (ret)
64
+            return ret;
79
+            return ret;
65
+        /*
80
+        /*
66
+         * RDX contains 'next-to-initialize' address if
81
+         * RDX contains 'next-to-initialize' address if
67
+         * TDH.SYS.TDMR.INT succeeded.
82
+         * TDH.SYS.TDMR.INIT did not fully complete and
83
+         * should be retried.
68
+         */
84
+         */
69
+        next = out.rdx;
85
+        next = args.rdx;
70
+        /* Allow scheduling when needed */
71
+        cond_resched();
86
+        cond_resched();
87
+        /* Keep making SEAMCALLs until the TDMR is done */
72
+    } while (next < tdmr->base + tdmr->size);
88
+    } while (next < tdmr->base + tdmr->size);
73
+
89
+
74
+    return 0;
90
+    return 0;
75
+}
91
+}
76
+
92
+
77
+/* Initialize all TDMRs */
93
+static int init_tdmrs(struct tdmr_info_list *tdmr_list)
78
+static int init_tdmrs(struct tdmr_info *tdmr_array, int tdmr_num)
79
+{
94
+{
80
+    int i;
95
+    int i;
81
+
96
+
82
+    /*
97
+    /*
83
+     * Initialize TDMRs one-by-one for simplicity, though the TDX
98
+     * This operation is costly. It can be parallelized,
84
+     * architecture does allow different TDMRs to be initialized in
99
+     * but keep it simple for now.
85
+     * parallel on multiple CPUs. Parallel initialization could
86
+     * be added later when the time spent in the serialized scheme
87
+     * becomes a real concern.
88
+     */
100
+     */
89
+    for (i = 0; i < tdmr_num; i++) {
101
+    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
90
+        int ret;
102
+        int ret;
91
+
103
+
92
+        ret = init_tdmr(tdmr_array_entry(tdmr_array, i));
104
+        ret = init_tdmr(tdmr_entry(tdmr_list, i));
93
+        if (ret)
105
+        if (ret)
94
+            return ret;
106
+            return ret;
95
+    }
107
+    }
96
+
108
+
97
+    return 0;
109
+    return 0;
98
+}
110
+}
99
+
111
+
100
/*
112
static int init_tdx_module(void)
101
* Detect and initialize the TDX module.
113
{
102
*
114
    struct tdsysinfo_struct *tdsysinfo;
103
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
115
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
104
    if (ret)
116
    if (ret)
105
        goto out_free_pamts;
117
        goto out_reset_pamts;
106
118
107
-    /*
119
-    /*
108
-     * Return -EINVAL until all steps of TDX module initialization
120
-     * TODO:
109
-     * process are done.
121
-     *
122
-     * - Initialize all TDMRs.
123
-     *
124
-     * Return error before all steps are done.
110
-     */
125
-     */
111
-    ret = -EINVAL;
126
-    ret = -EINVAL;
112
+    /* Initialize TDMRs to complete the TDX module initialization */
127
+    /* Initialize TDMRs to complete the TDX module initialization */
113
+    ret = init_tdmrs(tdmr_array, tdmr_num);
128
+    ret = init_tdmrs(&tdmr_list);
114
+    if (ret)
129
out_reset_pamts:
115
+        goto out_free_pamts;
116
+
117
out_free_pamts:
118
    if (ret) {
130
    if (ret) {
119
        /*
131
        /*
120
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
132
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
121
index XXXXXXX..XXXXXXX 100644
133
index XXXXXXX..XXXXXXX 100644
122
--- a/arch/x86/virt/vmx/tdx/tdx.h
134
--- a/arch/x86/virt/vmx/tdx/tdx.h
123
+++ b/arch/x86/virt/vmx/tdx/tdx.h
135
+++ b/arch/x86/virt/vmx/tdx/tdx.h
124
@@ -XXX,XX +XXX,XX @@
136
@@ -XXX,XX +XXX,XX @@
125
#define TDH_SYS_INFO        32
137
#define TDH_SYS_INFO        32
126
#define TDH_SYS_INIT        33
138
#define TDH_SYS_INIT        33
127
#define TDH_SYS_LP_INIT        35
139
#define TDH_SYS_LP_INIT        35
128
+#define TDH_SYS_TDMR_INIT    36
140
+#define TDH_SYS_TDMR_INIT    36
129
#define TDH_SYS_LP_SHUTDOWN    44
130
#define TDH_SYS_CONFIG        45
141
#define TDH_SYS_CONFIG        45
131
142
143
struct cmr_info {
132
--
144
--
133
2.38.1
145
2.41.0
diff view generated by jsdifflib
1
There are two problems in terms of using kexec() to boot to a new kernel
1
There are two problems in terms of using kexec() to boot to a new kernel
2
when the old kernel has enabled TDX: 1) Part of the memory pages are
2
when the old kernel has enabled TDX: 1) Part of the memory pages are
3
still TDX private pages (i.e. metadata used by the TDX module, and any
3
still TDX private pages; 2) There might be dirty cachelines associated
4
TDX guest memory if kexec() happens when there's any TDX guest alive).
4
with TDX private pages.
5
2) There might be dirty cachelines associated with TDX private pages.
6
5
7
Because the hardware doesn't guarantee cache coherency among different
6
The first problem doesn't matter on the platforms w/o the "partial write
8
KeyIDs, the old kernel needs to flush cache (of those TDX private pages)
7
machine check" erratum. KeyID 0 doesn't have integrity check. If the
9
before booting to the new kernel. Also, reading TDX private page using
8
new kernel wants to use any non-zero KeyID, it needs to convert the
10
any shared non-TDX KeyID with integrity-check enabled can trigger #MC.
9
memory to that KeyID and such conversion would work from any KeyID.
11
Therefore ideally, the kernel should convert all TDX private pages back
12
to normal before booting to the new kernel.
13
10
14
However, this implementation doesn't convert TDX private pages back to
11
However the old kernel needs to guarantee there's no dirty cacheline
15
normal in kexec() because of below considerations:
12
left behind before booting to the new kernel to avoid silent corruption
13
from later cacheline writeback (Intel hardware doesn't guarantee cache
14
coherency across different KeyIDs).
16
15
17
1) The kernel doesn't have existing infrastructure to track which pages
16
There are two things that the old kernel needs to do to achieve that:
18
are TDX private pages.
19
2) The number of TDX private pages can be large, and converting all of
20
them (cache flush + using MOVDIR64B to clear the page) in kexec() can
21
be time consuming.
22
3) The new kernel will almost only use KeyID 0 to access memory. KeyID
23
0 doesn't support integrity-check, so it's OK.
24
4) The kernel doesn't (and may never) support MKTME. If any 3rd party
25
kernel ever supports MKTME, it should do MOVDIR64B to clear the page
26
with the new MKTME KeyID (just like TDX does) before using it.
27
17
28
Therefore, this implementation just flushes cache to make sure there are
18
1) Stop accessing TDX private memory mappings:
29
no stale dirty cachelines associated with any TDX private KeyIDs before
19
a. Stop making TDX module SEAMCALLs (TDX global KeyID);
30
booting to the new kernel, otherwise they may silently corrupt the new
20
b. Stop TDX guests from running (per-guest TDX KeyID).
31
kernel.
21
2) Flush any cachelines from previous TDX private KeyID writes.
32
22
33
Following SME support, use wbinvd() to flush cache in stop_this_cpu().
23
For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME
24
support. And in this way 1) happens for free as there's no TDX activity
25
between wbinvd() and the native_halt().
26
27
Flushing cache in stop_this_cpu() only flushes cache on remote cpus. On
28
the rebooting cpu which does kexec(), unlike SME which does the cache
29
flush in relocate_kernel(), flush the cache right after stopping remote
30
cpus in machine_shutdown().
31
32
There are two reasons to do so: 1) For TDX there's no need to defer
33
cache flush to relocate_kernel() because all TDX activities have been
34
stopped. 2) On the platforms with the above erratum the kernel must
35
convert all TDX private pages back to normal before booting to the new
36
kernel in kexec(), and flushing cache early allows the kernel to convert
37
memory early rather than having to muck with the relocate_kernel()
38
assembly.
39
34
Theoretically, cache flush is only needed when the TDX module has been
40
Theoretically, cache flush is only needed when the TDX module has been
35
initialized. However initializing the TDX module is done on demand at
41
initialized. However initializing the TDX module is done on demand at
36
runtime, and it takes a mutex to read the module status. Just check
42
runtime, and it takes a mutex to read the module status. Just check
37
whether TDX is enabled by BIOS instead to flush cache.
43
whether TDX is enabled by the BIOS instead to flush cache.
38
44
39
Also, the current TDX module doesn't play nicely with kexec(). The TDX
45
Signed-off-by: Kai Huang <kai.huang@intel.com>
40
module can only be initialized once during its lifetime, and there is no
41
ABI to reset the module to give a new clean slate to the new kernel.
42
Therefore ideally, if the TDX module is ever initialized, it's better
43
to shut it down. The new kernel won't be able to use TDX anyway (as it
44
needs to go through the TDX module initialization process which will
45
fail immediately at the first step).
46
47
However, shutting down the TDX module requires all CPUs being in VMX
48
operation, but there's no such guarantee as kexec() can happen at any
49
time (i.e. when KVM is not even loaded). So just do nothing but leave
50
leave the TDX module open.
51
52
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
46
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
53
Signed-off-by: Kai Huang <kai.huang@intel.com>
47
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
54
---
48
---
55
49
56
v6 -> v7:
50
v13 -> v14:
57
- Improved changelog to explain why don't convert TDX private pages back
51
- No change
58
to normal.
59
52
60
---
53
---
61
arch/x86/kernel/process.c | 8 +++++++-
54
arch/x86/kernel/process.c | 8 +++++++-
62
1 file changed, 7 insertions(+), 1 deletion(-)
55
arch/x86/kernel/reboot.c | 15 +++++++++++++++
56
2 files changed, 22 insertions(+), 1 deletion(-)
63
57
64
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
58
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
65
index XXXXXXX..XXXXXXX 100644
59
index XXXXXXX..XXXXXXX 100644
66
--- a/arch/x86/kernel/process.c
60
--- a/arch/x86/kernel/process.c
67
+++ b/arch/x86/kernel/process.c
61
+++ b/arch/x86/kernel/process.c
68
@@ -XXX,XX +XXX,XX @@ void __noreturn stop_this_cpu(void *dummy)
62
@@ -XXX,XX +XXX,XX @@ void __noreturn stop_this_cpu(void *dummy)
69
     *
63
     *
70
     * Test the CPUID bit directly because the machine might've cleared
64
     * Test the CPUID bit directly because the machine might've cleared
71
     * X86_FEATURE_SME due to cmdline options.
65
     * X86_FEATURE_SME due to cmdline options.
72
+     *
66
+     *
73
+     * Similar to SME, if the TDX module is ever initialized, the
67
+     * The TDX module or guests might have left dirty cachelines
74
+     * cachelines associated with any TDX private KeyID must be flushed
68
+     * behind. Flush them to avoid corruption from later writeback.
75
+     * before transiting to the new kernel. The TDX module is initialized
69
+     * Note that this flushes on all systems where TDX is possible,
76
+     * on demand, and it takes the mutex to read its status. Just check
70
+     * but does not actually check that TDX was in use.
77
+     * whether TDX is enabled by BIOS instead to flush cache.
78
     */
71
     */
79
-    if (cpuid_eax(0x8000001f) & BIT(0))
72
-    if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
80
+    if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
73
+    if ((c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
74
+            || platform_tdx_enabled())
81
        native_wbinvd();
75
        native_wbinvd();
82
    for (;;) {
76
83
        /*
77
    /*
78
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
79
index XXXXXXX..XXXXXXX 100644
80
--- a/arch/x86/kernel/reboot.c
81
+++ b/arch/x86/kernel/reboot.c
82
@@ -XXX,XX +XXX,XX @@
83
#include <asm/realmode.h>
84
#include <asm/x86_init.h>
85
#include <asm/efi.h>
86
+#include <asm/tdx.h>
87
88
/*
89
* Power off function, if any
90
@@ -XXX,XX +XXX,XX @@ void native_machine_shutdown(void)
91
    local_irq_disable();
92
    stop_other_cpus();
93
#endif
94
+    /*
95
+     * stop_other_cpus() has flushed all dirty cachelines of TDX
96
+     * private memory on remote cpus. Unlike SME, which does the
97
+     * cache flush on _this_ cpu in the relocate_kernel(), flush
98
+     * the cache for _this_ cpu here. This is because on the
99
+     * platforms with "partial write machine check" erratum the
100
+     * kernel needs to convert all TDX private pages back to normal
101
+     * before booting to the new kernel in kexec(), and the cache
102
+     * flush must be done before that. If the kernel took SME's way,
103
+     * it would have to muck with the relocate_kernel() assembly to
104
+     * do memory conversion.
105
+     */
106
+    if (platform_tdx_enabled())
107
+        native_wbinvd();
108
109
    lapic_shutdown();
110
    restore_boot_irq_mode();
84
--
111
--
85
2.38.1
112
2.41.0
diff view generated by jsdifflib
1
TDX module initialization requires to use one TDX private KeyID as the
1
On the platforms with the "partial write machine check" erratum, the
2
global KeyID to protect the TDX module metadata. The global KeyID is
2
kexec() needs to convert all TDX private pages back to normal before
3
configured to the TDX module along with TDMRs.
3
booting to the new kernel. Otherwise, the new kernel may get unexpected
4
machine check.
4
5
5
Just reserve the first TDX private KeyID as the global KeyID. Keep the
6
There's no existing infrastructure to track TDX private pages. Keep
6
global KeyID as a static variable as KVM will need to use it too.
7
TDMRs when module initialization is successful so that they can be used
8
to find PAMTs.
7
9
8
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
9
Signed-off-by: Kai Huang <kai.huang@intel.com>
10
Signed-off-by: Kai Huang <kai.huang@intel.com>
11
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
12
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
10
---
13
---
11
arch/x86/virt/vmx/tdx/tdx.c | 9 +++++++++
14
12
1 file changed, 9 insertions(+)
15
v13 -> v14:
16
- "Change to keep" -> "Keep" (Kirill)
17
- Add Kirill/Rick's tags
18
19
v12 -> v13:
20
- Split "improve error handling" part out as a separate patch.
21
22
v11 -> v12 (new patch):
23
- Defer keeping TDMRs logic to this patch for better review
24
- Improved error handling logic (Nikolay/Kirill in patch 15)
25
26
---
27
arch/x86/virt/vmx/tdx/tdx.c | 24 +++++++++++-------------
28
1 file changed, 11 insertions(+), 13 deletions(-)
13
29
14
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
30
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
15
index XXXXXXX..XXXXXXX 100644
31
index XXXXXXX..XXXXXXX 100644
16
--- a/arch/x86/virt/vmx/tdx/tdx.c
32
--- a/arch/x86/virt/vmx/tdx/tdx.c
17
+++ b/arch/x86/virt/vmx/tdx/tdx.c
33
+++ b/arch/x86/virt/vmx/tdx/tdx.c
18
@@ -XXX,XX +XXX,XX @@ static int tdx_cmr_num;
34
@@ -XXX,XX +XXX,XX @@ static DEFINE_MUTEX(tdx_module_lock);
19
/* All TDX-usable memory regions */
35
/* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
20
static LIST_HEAD(tdx_memlist);
36
static LIST_HEAD(tdx_memlist);
21
37
22
+/* TDX module global KeyID. Used in TDH.SYS.CONFIG ABI. */
38
+static struct tdmr_info_list tdx_tdmr_list;
23
+static u32 tdx_global_keyid;
24
+
39
+
25
/*
40
typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
26
* Detect TDX private KeyIDs to see whether TDX has been enabled by the
41
27
* BIOS. Both initializing the TDX module and running TDX guest require
42
static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
43
@@ -XXX,XX +XXX,XX @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
44
static int init_tdx_module(void)
45
{
46
    struct tdsysinfo_struct *tdsysinfo;
47
-    struct tdmr_info_list tdmr_list;
48
    struct cmr_info *cmr_array;
49
    int tdsysinfo_size;
50
    int cmr_array_size;
28
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
51
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
52
        goto out_put_tdxmem;
53
54
    /* Allocate enough space for constructing TDMRs */
55
-    ret = alloc_tdmr_list(&tdmr_list, tdsysinfo);
56
+    ret = alloc_tdmr_list(&tdx_tdmr_list, tdsysinfo);
57
    if (ret)
58
        goto out_free_tdxmem;
59
60
    /* Cover all TDX-usable memory regions in TDMRs */
61
-    ret = construct_tdmrs(&tdx_memlist, &tdmr_list, tdsysinfo);
62
+    ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, tdsysinfo);
29
    if (ret)
63
    if (ret)
30
        goto out_free_tdmrs;
64
        goto out_free_tdmrs;
31
65
32
+    /*
66
    /* Pass the TDMRs and the global KeyID to the TDX module */
33
+     * Reserve the first TDX KeyID as global KeyID to protect
67
-    ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
34
+     * TDX module metadata.
68
+    ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid);
35
+     */
69
    if (ret)
36
+    tdx_global_keyid = tdx_keyid_start;
70
        goto out_free_pamts;
37
+
71
38
    /*
72
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
39
     * Return -EINVAL until all steps of TDX module initialization
73
        goto out_reset_pamts;
40
     * process are done.
74
75
    /* Initialize TDMRs to complete the TDX module initialization */
76
-    ret = init_tdmrs(&tdmr_list);
77
+    ret = init_tdmrs(&tdx_tdmr_list);
78
out_reset_pamts:
79
    if (ret) {
80
        /*
81
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
82
         * back to normal. But do the conversion anyway here
83
         * as suggested by the TDX spec.
84
         */
85
-        tdmrs_reset_pamt_all(&tdmr_list);
86
+        tdmrs_reset_pamt_all(&tdx_tdmr_list);
87
    }
88
out_free_pamts:
89
    if (ret)
90
-        tdmrs_free_pamt_all(&tdmr_list);
91
+        tdmrs_free_pamt_all(&tdx_tdmr_list);
92
    else
93
        pr_info("%lu KBs allocated for PAMT\n",
94
-                tdmrs_count_pamt_kb(&tdmr_list));
95
+                tdmrs_count_pamt_kb(&tdx_tdmr_list));
96
out_free_tdmrs:
97
-    /*
98
-     * Always free the buffer of TDMRs as they are only used during
99
-     * module initialization.
100
-     */
101
-    free_tdmr_list(&tdmr_list);
102
+    if (ret)
103
+        free_tdmr_list(&tdx_tdmr_list);
104
out_free_tdxmem:
105
    if (ret)
106
        free_tdx_memlist(&tdx_memlist);
41
--
107
--
42
2.38.1
108
2.41.0
diff view generated by jsdifflib
1
The first step of initializing the module is to call TDH.SYS.INIT once
1
With keeping TDMRs upon successful TDX module initialization, now only
2
on any logical cpu to do module global initialization. Do the module
2
put_online_mems() and freeing the buffers of the TDSYSINFO_STRUCT and
3
global initialization.
3
the CMR array still need to be done even when module initialization is
4
successful. On the other hand, all other four "out_*" labels before
5
them explicitly check the return value and only clean up when module
6
initialization fails.
4
7
5
It also detects the TDX module, as seamcall() returns -ENODEV when the
8
This isn't ideal. Make all other four "out_*" labels only reachable
6
module is not loaded.
9
when module initialization fails to improve the readability of error
10
handling. Rename them from "out_*" to "err_*" to reflect the fact.
7
11
8
Signed-off-by: Kai Huang <kai.huang@intel.com>
12
Signed-off-by: Kai Huang <kai.huang@intel.com>
13
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
14
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
9
---
15
---
10
16
11
v6 -> v7:
17
v13 -> v14:
12
- Improved changelog.
18
- Fix spell typo (Rick)
19
- Add Kirill/Rick's tags
20
21
v12 -> v13:
22
- New patch to improve error handling. (Kirill, Nikolay)
13
23
14
---
24
---
15
arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++++++--
25
arch/x86/virt/vmx/tdx/tdx.c | 67 +++++++++++++++++++------------------
16
arch/x86/virt/vmx/tdx/tdx.h | 1 +
26
1 file changed, 34 insertions(+), 33 deletions(-)
17
2 files changed, 18 insertions(+), 2 deletions(-)
18
27
19
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
28
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
20
index XXXXXXX..XXXXXXX 100644
29
index XXXXXXX..XXXXXXX 100644
21
--- a/arch/x86/virt/vmx/tdx/tdx.c
30
--- a/arch/x86/virt/vmx/tdx/tdx.c
22
+++ b/arch/x86/virt/vmx/tdx/tdx.c
31
+++ b/arch/x86/virt/vmx/tdx/tdx.c
23
@@ -XXX,XX +XXX,XX @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
32
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
24
*/
33
    /* Allocate enough space for constructing TDMRs */
25
static int init_tdx_module(void)
34
    ret = alloc_tdmr_list(&tdx_tdmr_list, tdsysinfo);
26
{
35
    if (ret)
27
-    /* The TDX module hasn't been detected */
36
-        goto out_free_tdxmem;
28
-    return -ENODEV;
37
+        goto err_free_tdxmem;
29
+    int ret;
38
39
    /* Cover all TDX-usable memory regions in TDMRs */
40
    ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, tdsysinfo);
41
    if (ret)
42
-        goto out_free_tdmrs;
43
+        goto err_free_tdmrs;
44
45
    /* Pass the TDMRs and the global KeyID to the TDX module */
46
    ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid);
47
    if (ret)
48
-        goto out_free_pamts;
49
+        goto err_free_pamts;
50
51
    /*
52
     * Hardware doesn't guarantee cache coherency across different
53
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
54
    /* Config the key of global KeyID on all packages */
55
    ret = config_global_keyid();
56
    if (ret)
57
-        goto out_reset_pamts;
58
+        goto err_reset_pamts;
59
60
    /* Initialize TDMRs to complete the TDX module initialization */
61
    ret = init_tdmrs(&tdx_tdmr_list);
62
-out_reset_pamts:
63
-    if (ret) {
64
-        /*
65
-         * Part of PAMTs may already have been initialized by the
66
-         * TDX module. Flush cache before returning PAMTs back
67
-         * to the kernel.
68
-         */
69
-        wbinvd_on_all_cpus();
70
-        /*
71
-         * According to the TDX hardware spec, if the platform
72
-         * doesn't have the "partial write machine check"
73
-         * erratum, any kernel read/write will never cause #MC
74
-         * in kernel space, thus it's OK to not convert PAMTs
75
-         * back to normal. But do the conversion anyway here
76
-         * as suggested by the TDX spec.
77
-         */
78
-        tdmrs_reset_pamt_all(&tdx_tdmr_list);
79
-    }
80
-out_free_pamts:
81
    if (ret)
82
-        tdmrs_free_pamt_all(&tdx_tdmr_list);
83
-    else
84
-        pr_info("%lu KBs allocated for PAMT\n",
85
-                tdmrs_count_pamt_kb(&tdx_tdmr_list));
86
-out_free_tdmrs:
87
-    if (ret)
88
-        free_tdmr_list(&tdx_tdmr_list);
89
-out_free_tdxmem:
90
-    if (ret)
91
-        free_tdx_memlist(&tdx_memlist);
92
+        goto err_reset_pamts;
30
+
93
+
94
+    pr_info("%lu KBs allocated for PAMT\n",
95
+            tdmrs_count_pamt_kb(&tdx_tdmr_list));
96
+
97
out_put_tdxmem:
98
    /*
99
     * @tdx_memlist is written here and read at memory hotplug time.
100
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
101
    kfree(tdsysinfo);
102
    kfree(cmr_array);
103
    return ret;
104
+
105
+err_reset_pamts:
31
+    /*
106
+    /*
32
+     * Call TDH.SYS.INIT to do the global initialization of
107
+     * Part of PAMTs may already have been initialized by the
33
+     * the TDX module. It also detects the module.
108
+     * TDX module. Flush cache before returning PAMTs back
109
+     * to the kernel.
34
+     */
110
+     */
35
+    ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
111
+    wbinvd_on_all_cpus();
36
+    if (ret)
37
+        goto out;
38
+
39
+    /*
112
+    /*
40
+     * Return -EINVAL until all steps of TDX module initialization
113
+     * According to the TDX hardware spec, if the platform
41
+     * process are done.
114
+     * doesn't have the "partial write machine check"
115
+     * erratum, any kernel read/write will never cause #MC
116
+     * in kernel space, thus it's OK to not convert PAMTs
117
+     * back to normal. But do the conversion anyway here
118
+     * as suggested by the TDX spec.
42
+     */
119
+     */
43
+    ret = -EINVAL;
120
+    tdmrs_reset_pamt_all(&tdx_tdmr_list);
44
+out:
121
+err_free_pamts:
45
+    return ret;
122
+    tdmrs_free_pamt_all(&tdx_tdmr_list);
123
+err_free_tdmrs:
124
+    free_tdmr_list(&tdx_tdmr_list);
125
+err_free_tdxmem:
126
+    free_tdx_memlist(&tdx_memlist);
127
+    /* Do things irrelevant to module initialization result */
128
+    goto out_put_tdxmem;
46
}
129
}
47
130
48
static void shutdown_tdx_module(void)
131
static int __tdx_enable(void)
49
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
50
index XXXXXXX..XXXXXXX 100644
51
--- a/arch/x86/virt/vmx/tdx/tdx.h
52
+++ b/arch/x86/virt/vmx/tdx/tdx.h
53
@@ -XXX,XX +XXX,XX @@
54
/*
55
* TDX module SEAMCALL leaf functions
56
*/
57
+#define TDH_SYS_INIT        33
58
#define TDH_SYS_LP_SHUTDOWN    44
59
60
/*
61
--
132
--
62
2.38.1
133
2.41.0
diff view generated by jsdifflib
1
TDX supports shutting down the TDX module at any time during its
1
The first few generations of TDX hardware have an erratum. A partial
2
lifetime. After the module is shut down, no further TDX module SEAMCALL
2
write to a TDX private memory cacheline will silently "poison" the
3
leaf functions can be made to the module on any logical cpu.
3
line. Subsequent reads will consume the poison and generate a machine
4
4
check. According to the TDX hardware spec, neither of these things
5
Shut down the TDX module in case of any error during the initialization
5
should have happened.
6
process. It's pointless to leave the TDX module in some middle state.
6
7
7
== Background ==
8
Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
8
9
BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
9
Virtually all kernel memory accesses operations happen in full
10
CPUs. Implement a mechanism to run SEAMCALL concurrently on all online
10
cachelines. In practice, writing a "byte" of memory usually reads a 64
11
CPUs and use it to shut down the module. Later logical-cpu scope module
11
byte cacheline of memory, modifies it, then writes the whole line back.
12
initialization will use it too.
12
Those operations do not trigger this problem.
13
13
14
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
14
This problem is triggered by "partial" writes where a write transaction
15
of less than cacheline lands at the memory controller. The CPU does
16
these via non-temporal write instructions (like MOVNTI), or through
17
UC/WC memory mappings. The issue can also be triggered away from the
18
CPU by devices doing partial writes via DMA.
19
20
== Problem ==
21
22
A fast warm reset doesn't reset TDX private memory. Kexec() can also
23
boot into the new kernel directly. Thus if the old kernel has enabled
24
TDX on the platform with this erratum, the new kernel may get unexpected
25
machine check.
26
27
Note that w/o this erratum any kernel read/write on TDX private memory
28
should never cause machine check, thus it's OK for the old kernel to
29
leave TDX private pages as is.
30
31
== Solution ==
32
33
In short, with this erratum, the kernel needs to explicitly convert all
34
TDX private pages back to normal to give the new kernel a clean slate
35
after kexec(). The BIOS is also expected to disable fast warm reset as
36
a workaround to this erratum, thus this implementation doesn't try to
37
reset TDX private memory for the reboot case in the kernel but depend on
38
the BIOS to enable the workaround.
39
40
Convert TDX private pages back to normal after all remote cpus has been
41
stopped and cache flush has been done on all cpus, when no more TDX
42
activity can happen further. Do it in machine_kexec() to avoid the
43
additional overhead to the normal reboot/shutdown as the kernel depends
44
on the BIOS to disable fast warm reset for the reboot case.
45
46
For now TDX private memory can only be PAMT pages. It would be ideal to
47
cover all types of TDX private memory here, but there are practical
48
problems to do so:
49
50
1) There's no existing infrastructure to track TDX private pages;
51
2) It's not feasible to query the TDX module about page type because VMX
52
has already been stopped when KVM receives the reboot notifier, plus
53
the result from the TDX module may not be accurate (e.g., the remote
54
CPU could be stopped right before MOVDIR64B).
55
56
One temporary solution is to blindly convert all memory pages, but it's
57
problematic to do so too, because not all pages are mapped as writable
58
in the direct mapping. It can be done by switching to the identical
59
mapping created for kexec() or a new page table, but the complexity
60
looks overkill.
61
62
Therefore, rather than doing something dramatic, only reset PAMT pages
63
here. Other kernel components which use TDX need to do the conversion
64
on their own by intercepting the rebooting/shutdown notifier (KVM
65
already does that).
66
67
Note kexec() can happen at anytime, including when TDX module is being
68
initialized. Register TDX reboot notifier callback to stop further TDX
69
module initialization. If there's any ongoing module initialization,
70
wait until it finishes. This makes sure the TDX module status is stable
71
after the reboot notifier callback, and the later kexec() code can read
72
module status to decide whether PAMTs are stable and available.
73
74
Also stop further TDX module initialization in case of machine shutdown
75
and halt, but not limited to kexec(), as there's no reason to do so in
76
these cases too.
77
15
Signed-off-by: Kai Huang <kai.huang@intel.com>
78
Signed-off-by: Kai Huang <kai.huang@intel.com>
79
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
16
---
80
---
17
81
18
v6 -> v7:
82
v13 -> v14:
19
- No change.
83
- Skip resetting TDX private memory when preserve_context is true (Rick)
20
84
- Use reboot notifier to stop TDX module initialization at early time of
21
v5 -> v6:
85
kexec() to make module status stable, to avoid using a new variable
22
- Removed the seamcall() wrapper to previous patch (Dave).
86
and memory barrier (which is tricky to review).
23
87
- Added Kirill's tag
24
- v3 -> v5 (no feedback on v4):
88
25
- Added a wrapper of __seamcall() to print error code if SEAMCALL fails.
89
v12 -> v13:
26
- Made the seamcall_on_each_cpu() void.
90
- Improve comments to explain why barrier is needed and ignore WBINVD.
27
- Removed 'seamcall_ret' and 'tdx_module_out' from
91
(Dave)
28
'struct seamcall_ctx', as they must be local variable.
92
- Improve comments to document memory ordering. (Nikolay)
29
- Added the comments to tdx_init() and one paragraph to changelog to
93
- Made comments/changelog slightly more concise.
30
explain the caller should handle VMXON.
94
31
- Called out after shut down, no "TDX module" SEAMCALL can be made.
95
v11 -> v12:
96
- Changed comment/changelog to say kernel doesn't try to handle fast
97
warm reset but depends on BIOS to enable workaround (Kirill)
98
- Added a new tdx_may_has_private_mem to indicate system may have TDX
99
private memory and PAMTs/TDMRs are stable to access. (Dave).
100
- Use atomic_t for tdx_may_has_private_mem for build-in memory barrier
101
(Dave)
102
- Changed calling x86_platform.memory_shutdown() to calling
103
tdx_reset_memory() directly from machine_kexec() to avoid overhead to
104
normal reboot case.
105
106
v10 -> v11:
107
- New patch
32
108
33
---
109
---
34
arch/x86/virt/vmx/tdx/tdx.c | 43 +++++++++++++++++++++++++++++++++----
110
arch/x86/include/asm/tdx.h | 2 +
35
arch/x86/virt/vmx/tdx/tdx.h | 5 +++++
111
arch/x86/kernel/machine_kexec_64.c | 16 ++++++
36
2 files changed, 44 insertions(+), 4 deletions(-)
112
arch/x86/virt/vmx/tdx/tdx.c | 92 ++++++++++++++++++++++++++++++
37
113
3 files changed, 110 insertions(+)
114
115
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
116
index XXXXXXX..XXXXXXX 100644
117
--- a/arch/x86/include/asm/tdx.h
118
+++ b/arch/x86/include/asm/tdx.h
119
@@ -XXX,XX +XXX,XX @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
120
bool platform_tdx_enabled(void);
121
int tdx_cpu_enable(void);
122
int tdx_enable(void);
123
+void tdx_reset_memory(void);
124
#else
125
static inline bool platform_tdx_enabled(void) { return false; }
126
static inline int tdx_cpu_enable(void) { return -ENODEV; }
127
static inline int tdx_enable(void) { return -ENODEV; }
128
+static inline void tdx_reset_memory(void) { }
129
#endif    /* CONFIG_INTEL_TDX_HOST */
130
131
#endif /* !__ASSEMBLY__ */
132
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
133
index XXXXXXX..XXXXXXX 100644
134
--- a/arch/x86/kernel/machine_kexec_64.c
135
+++ b/arch/x86/kernel/machine_kexec_64.c
136
@@ -XXX,XX +XXX,XX @@
137
#include <asm/setup.h>
138
#include <asm/set_memory.h>
139
#include <asm/cpu.h>
140
+#include <asm/tdx.h>
141
142
#ifdef CONFIG_ACPI
143
/*
144
@@ -XXX,XX +XXX,XX @@ void machine_kexec(struct kimage *image)
145
    void *control_page;
146
    int save_ftrace_enabled;
147
148
+    /*
149
+     * For platforms with TDX "partial write machine check" erratum,
150
+     * all TDX private pages need to be converted back to normal
151
+     * before booting to the new kernel, otherwise the new kernel
152
+     * may get unexpected machine check.
153
+     *
154
+     * But skip this when preserve_context is on. The second kernel
155
+     * shouldn't write to the first kernel's memory anyway. Skipping
156
+     * this also avoids killing TDX in the first kernel, which would
157
+     * require more complicated handling.
158
+     */
159
#ifdef CONFIG_KEXEC_JUMP
160
    if (image->preserve_context)
161
        save_processor_state();
162
+    else
163
+        tdx_reset_memory();
164
+#else
165
+    tdx_reset_memory();
166
#endif
167
168
    save_ftrace_enabled = __ftrace_enabled_save();
38
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
169
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
39
index XXXXXXX..XXXXXXX 100644
170
index XXXXXXX..XXXXXXX 100644
40
--- a/arch/x86/virt/vmx/tdx/tdx.c
171
--- a/arch/x86/virt/vmx/tdx/tdx.c
41
+++ b/arch/x86/virt/vmx/tdx/tdx.c
172
+++ b/arch/x86/virt/vmx/tdx/tdx.c
42
@@ -XXX,XX +XXX,XX @@
173
@@ -XXX,XX +XXX,XX @@
43
#include <linux/mutex.h>
174
#include <linux/align.h>
44
#include <linux/cpu.h>
175
#include <linux/sort.h>
45
#include <linux/cpumask.h>
176
#include <linux/log2.h>
46
+#include <linux/smp.h>
177
+#include <linux/reboot.h>
47
+#include <linux/atomic.h>
48
#include <asm/msr-index.h>
178
#include <asm/msr-index.h>
49
#include <asm/msr.h>
179
#include <asm/msr.h>
50
#include <asm/apic.h>
180
#include <asm/page.h>
51
@@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void)
181
@@ -XXX,XX +XXX,XX @@ static LIST_HEAD(tdx_memlist);
52
    return !!tdx_keyid_num;
182
183
static struct tdmr_info_list tdx_tdmr_list;
184
185
+static bool tdx_rebooting;
186
+
187
typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
188
189
static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
190
@@ -XXX,XX +XXX,XX @@ static int __tdx_enable(void)
191
{
192
    int ret;
193
194
+    if (tdx_rebooting)
195
+        return -EAGAIN;
196
+
197
    ret = init_tdx_module();
198
    if (ret) {
199
        pr_err("module initialization failed (%d)\n", ret);
200
@@ -XXX,XX +XXX,XX @@ int tdx_enable(void)
53
}
201
}
202
EXPORT_SYMBOL_GPL(tdx_enable);
54
203
55
+/*
204
+/*
56
+ * Data structure to make SEAMCALL on multiple CPUs concurrently.
205
+ * Convert TDX private pages back to normal on platforms with
57
+ * @err is set to -EFAULT when SEAMCALL fails on any cpu.
206
+ * "partial write machine check" erratum.
207
+ *
208
+ * Called from machine_kexec() before booting to the new kernel.
58
+ */
209
+ */
59
+struct seamcall_ctx {
210
+void tdx_reset_memory(void)
60
+    u64 fn;
211
+{
61
+    u64 rcx;
212
+    if (!platform_tdx_enabled())
62
+    u64 rdx;
213
+        return;
63
+    u64 r8;
214
+
64
+    u64 r9;
215
+    /*
65
+    atomic_t err;
216
+     * Kernel read/write to TDX private memory doesn't
217
+     * cause machine check on hardware w/o this erratum.
218
+     */
219
+    if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
220
+        return;
221
+
222
+    /* Called from kexec() when only rebooting cpu is alive */
223
+    WARN_ON_ONCE(num_online_cpus() != 1);
224
+
225
+    /*
226
+     * tdx_reboot_notifier() waits until ongoing TDX module
227
+     * initialization to finish, and module initialization is
228
+     * rejected after that. Therefore @tdx_module_status is
229
+     * stable here and can be read w/o holding lock.
230
+     */
231
+    if (tdx_module_status != TDX_MODULE_INITIALIZED)
232
+        return;
233
+
234
+    /*
235
+     * Convert PAMTs back to normal. All other cpus are already
236
+     * dead and TDMRs/PAMTs are stable.
237
+     *
238
+     * Ideally it's better to cover all types of TDX private pages
239
+     * here, but it's impractical:
240
+     *
241
+     * - There's no existing infrastructure to tell whether a page
242
+     * is TDX private memory or not.
243
+     *
244
+     * - Using SEAMCALL to query TDX module isn't feasible either:
245
+     * - VMX has been turned off by reaching here so SEAMCALL
246
+     * cannot be made;
247
+     * - Even SEAMCALL can be made the result from TDX module may
248
+     * not be accurate (e.g., remote CPU can be stopped while
249
+     * the kernel is in the middle of reclaiming TDX private
250
+     * page and doing MOVDIR64B).
251
+     *
252
+     * One temporary solution could be just converting all memory
253
+     * pages, but it's problematic too, because not all pages are
254
+     * mapped as writable in direct mapping. It can be done by
255
+     * switching to the identical mapping for kexec() or a new page
256
+     * table which maps all pages as writable, but the complexity is
257
+     * overkill.
258
+     *
259
+     * Thus instead of doing something dramatic to convert all pages,
260
+     * only convert PAMTs here. Other kernel components which use
261
+     * TDX need to do the conversion on their own by intercepting the
262
+     * rebooting/shutdown notifier (KVM already does that).
263
+     */
264
+    tdmrs_reset_pamt_all(&tdx_tdmr_list);
265
+}
266
+
267
static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
268
                     u32 *nr_tdx_keyids)
269
{
270
@@ -XXX,XX +XXX,XX @@ static struct notifier_block tdx_memory_nb = {
271
    .notifier_call = tdx_memory_notifier,
272
};
273
274
+static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode,
275
+             void *unused)
276
+{
277
+    /* Wait ongoing TDX initialization to finish */
278
+    mutex_lock(&tdx_module_lock);
279
+    tdx_rebooting = true;
280
+    mutex_unlock(&tdx_module_lock);
281
+
282
+    return NOTIFY_OK;
283
+}
284
+
285
+static struct notifier_block tdx_reboot_nb = {
286
+    .notifier_call = tdx_reboot_notifier,
66
+};
287
+};
67
+
288
+
68
/*
289
static int __init tdx_init(void)
69
* Wrapper of __seamcall() to convert SEAMCALL leaf function error code
70
* to kernel error code. @seamcall_ret and @out contain the SEAMCALL
71
* leaf function return code and the additional output respectively if
72
* not NULL.
73
*/
74
-static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
75
-                 u64 *seamcall_ret,
76
-                 struct tdx_module_output *out)
77
+static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
78
+         u64 *seamcall_ret, struct tdx_module_output *out)
79
{
290
{
80
    u64 sret;
291
    u32 tdx_keyid_start, nr_tdx_keyids;
81
292
@@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void)
82
@@ -XXX,XX +XXX,XX @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
293
        return -ENODEV;
83
    }
294
    }
84
}
295
85
296
+    err = register_reboot_notifier(&tdx_reboot_nb);
86
+static void seamcall_smp_call_function(void *data)
297
+    if (err) {
87
+{
298
+        pr_err("initialization failed: register_reboot_notifier() failed (%d)\n",
88
+    struct seamcall_ctx *sc = data;
299
+                err);
89
+    int ret;
300
+        unregister_memory_notifier(&tdx_memory_nb);
90
+
301
+        return -ENODEV;
91
+    ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, NULL, NULL);
302
+    }
92
+    if (ret)
303
+
93
+        atomic_set(&sc->err, -EFAULT);
304
    /*
94
+}
305
     * Just use the first TDX KeyID as the 'global KeyID' and
95
+
306
     * leave the rest for TDX guests.
96
+/*
97
+ * Call the SEAMCALL on all online CPUs concurrently. Caller to check
98
+ * @sc->err to determine whether any SEAMCALL failed on any cpu.
99
+ */
100
+static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
101
+{
102
+    on_each_cpu(seamcall_smp_call_function, sc, true);
103
+}
104
+
105
/*
106
* Detect and initialize the TDX module.
107
*
108
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
109
110
static void shutdown_tdx_module(void)
111
{
112
-    /* TODO: Shut down the TDX module */
113
+    struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
114
+
115
+    seamcall_on_each_cpu(&sc);
116
}
117
118
static int __tdx_enable(void)
119
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
120
index XXXXXXX..XXXXXXX 100644
121
--- a/arch/x86/virt/vmx/tdx/tdx.h
122
+++ b/arch/x86/virt/vmx/tdx/tdx.h
123
@@ -XXX,XX +XXX,XX @@
124
/* MSR to report KeyID partitioning between MKTME and TDX */
125
#define MSR_IA32_MKTME_KEYID_PARTITIONING    0x00000087
126
127
+/*
128
+ * TDX module SEAMCALL leaf functions
129
+ */
130
+#define TDH_SYS_LP_SHUTDOWN    44
131
+
132
/*
133
* Do not put any hardware-defined TDX structure representations below
134
* this comment!
135
--
307
--
136
2.38.1
308
2.41.0
diff view generated by jsdifflib
1
After the global module initialization, the next step is logical-cpu
1
TDX cannot survive from S3 and deeper states. The hardware resets and
2
scope module initialization. Logical-cpu initialization requires
2
disables TDX completely when platform goes to S3 and deeper. Both TDX
3
calling TDH.SYS.LP.INIT on all BIOS-enabled CPUs. This SEAMCALL can run
3
guests and the TDX module get destroyed permanently.
4
concurrently on all CPUs.
5
4
6
Use the helper introduced for shutting down the module to do logical-cpu
5
The kernel uses S3 to support suspend-to-ram, and S4 or deeper states to
7
scope initialization.
6
support hibernation. The kernel also maintains TDX states to track
7
whether it has been initialized and its metadata resource, etc. After
8
resuming from S3 or hibernation, these TDX states won't be correct
9
anymore.
10
11
Theoretically, the kernel can do more complicated things like resetting
12
TDX internal states and TDX module metadata before going to S3 or
13
deeper, and re-initialize TDX module after resuming, etc, but there is
14
no way to save/restore TDX guests for now.
15
16
Until TDX supports full save and restore of TDX guests, there is no big
17
value to handle TDX module in suspend and hibernation alone. To make
18
things simple, just choose to make TDX mutually exclusive with S3 and
19
hibernation.
20
21
Note the TDX module is initialized at runtime. To avoid having to deal
22
with the fuss of determining TDX state at runtime, just choose TDX vs S3
23
and hibernation at kernel early boot. It's a bad user experience if the
24
choice of TDX and S3/hibernation is done at runtime anyway, i.e., the
25
user can experience being able to do S3/hibernation but later becoming
26
unable to due to TDX being enabled.
27
28
Disable TDX in kernel early boot when hibernation is available, and give
29
a message telling the user to disable hibernation via kernel command
30
line in order to use TDX. Currently there's no mechanism exposed by the
31
hibernation code to allow other kernel code to disable hibernation once
32
for all.
33
34
Disable ACPI S3 by setting acpi_suspend_lowlevel function pointer to
35
NULL when TDX is enabled by the BIOS. This avoids having to modify the
36
ACPI code to disable ACPI S3 in other ways.
37
38
Also give a message telling the user to disable TDX in the BIOS in order
39
to use ACPI S3. A new kernel command line can be added in the future if
40
there's a need to let user disable TDX host via kernel command line.
8
41
9
Signed-off-by: Kai Huang <kai.huang@intel.com>
42
Signed-off-by: Kai Huang <kai.huang@intel.com>
43
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
10
---
44
---
11
arch/x86/virt/vmx/tdx/tdx.c | 14 ++++++++++++++
45
12
arch/x86/virt/vmx/tdx/tdx.h | 1 +
46
v13 -> v14:
13
2 files changed, 15 insertions(+)
47
- New patch
48
49
---
50
arch/x86/virt/vmx/tdx/tdx.c | 23 +++++++++++++++++++++++
51
1 file changed, 23 insertions(+)
14
52
15
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
53
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
16
index XXXXXXX..XXXXXXX 100644
54
index XXXXXXX..XXXXXXX 100644
17
--- a/arch/x86/virt/vmx/tdx/tdx.c
55
--- a/arch/x86/virt/vmx/tdx/tdx.c
18
+++ b/arch/x86/virt/vmx/tdx/tdx.c
56
+++ b/arch/x86/virt/vmx/tdx/tdx.c
19
@@ -XXX,XX +XXX,XX @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
57
@@ -XXX,XX +XXX,XX @@
20
    on_each_cpu(seamcall_smp_call_function, sc, true);
58
#include <linux/sort.h>
21
}
59
#include <linux/log2.h>
22
60
#include <linux/reboot.h>
23
+static int tdx_module_init_cpus(void)
61
+#include <linux/suspend.h>
24
+{
62
#include <asm/msr-index.h>
25
+    struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
63
#include <asm/msr.h>
64
#include <asm/page.h>
65
#include <asm/special_insns.h>
66
+#include <asm/acpi.h>
67
#include <asm/tdx.h>
68
#include "tdx.h"
69
70
@@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void)
71
        return -ENODEV;
72
    }
73
74
+#define HIBERNATION_MSG        \
75
+    "Disable TDX due to hibernation is available. Use 'nohibernate' command line to disable hibernation."
76
+    /*
77
+     * Note hibernation_available() can vary when it is called at
78
+     * runtime as it checks secretmem_active() and cxl_mem_active()
79
+     * which can both vary at runtime. But here at early_init() they
80
+     * both cannot return true, thus when hibernation_available()
81
+     * returns false here, hibernation is disabled by either
82
+     * 'nohibernate' or LOCKDOWN_HIBERNATION security lockdown,
83
+     * which are both permanent.
84
+     */
85
+    if (hibernation_available()) {
86
+        pr_err("initialization failed: %s\n", HIBERNATION_MSG);
87
+        return -ENODEV;
88
+    }
26
+
89
+
27
+    seamcall_on_each_cpu(&sc);
90
    err = register_memory_notifier(&tdx_memory_nb);
28
+
91
    if (err) {
29
+    return atomic_read(&sc.err);
92
        pr_err("initialization failed: register_memory_notifier() failed (%d)\n",
30
+}
93
@@ -XXX,XX +XXX,XX @@ static int __init tdx_init(void)
31
+
94
        return -ENODEV;
32
/*
95
    }
33
* Detect and initialize the TDX module.
96
34
*
97
+#ifdef CONFIG_ACPI
35
@@ -XXX,XX +XXX,XX @@ static int init_tdx_module(void)
98
+    pr_info("Disable ACPI S3 suspend. Turn off TDX in the BIOS to use ACPI S3.\n");
36
    if (ret)
99
+    acpi_suspend_lowlevel = NULL;
37
        goto out;
100
+#endif
38
39
+    /* Logical-cpu scope initialization */
40
+    ret = tdx_module_init_cpus();
41
+    if (ret)
42
+        goto out;
43
+
101
+
44
    /*
102
    /*
45
     * Return -EINVAL until all steps of TDX module initialization
103
     * Just use the first TDX KeyID as the 'global KeyID' and
46
     * process are done.
104
     * leave the rest for TDX guests.
47
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
48
index XXXXXXX..XXXXXXX 100644
49
--- a/arch/x86/virt/vmx/tdx/tdx.h
50
+++ b/arch/x86/virt/vmx/tdx/tdx.h
51
@@ -XXX,XX +XXX,XX @@
52
* TDX module SEAMCALL leaf functions
53
*/
54
#define TDH_SYS_INIT        33
55
+#define TDH_SYS_LP_INIT        35
56
#define TDH_SYS_LP_SHUTDOWN    44
57
58
/*
59
--
105
--
60
2.38.1
106
2.41.0
diff view generated by jsdifflib
1
TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This
1
The first few generations of TDX hardware have an erratum. Triggering
2
mode runs only the TDX module itself or other code to load the TDX
2
it in Linux requires some kind of kernel bug involving relatively exotic
3
module.
3
memory writes to TDX private memory and will manifest via
4
4
spurious-looking machine checks when reading the affected memory.
5
The host kernel communicates with SEAM software via a new SEAMCALL
5
6
instruction. This is conceptually similar to a guest->host hypercall,
6
== Background ==
7
except it is made from the host to SEAM software instead.
7
8
8
Virtually all kernel memory accesses operations happen in full
9
The TDX module defines a set of SEAMCALL leaf functions to allow the
9
cachelines. In practice, writing a "byte" of memory usually reads a 64
10
host to initialize it, and to create and run protected VMs. SEAMCALL
10
byte cacheline of memory, modifies it, then writes the whole line back.
11
leaf functions use an ABI different from the x86-64 system-v ABI.
11
Those operations do not trigger this problem.
12
Instead, they share the same ABI with the TDCALL leaf functions.
12
13
13
This problem is triggered by "partial" writes where a write transaction
14
Implement a function __seamcall() to allow the host to make SEAMCALL
14
of less than cacheline lands at the memory controller. The CPU does
15
to SEAM software using the TDX_MODULE_CALL macro which is the common
15
these via non-temporal write instructions (like MOVNTI), or through
16
assembly for both SEAMCALL and TDCALL.
16
UC/WC memory mappings. The issue can also be triggered away from the
17
17
CPU by devices doing partial writes via DMA.
18
SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
18
19
CPU is not in VMX operation. The current TDX_MODULE_CALL macro doesn't
19
== Problem ==
20
handle any of them. There's no way to check whether the CPU is in VMX
20
21
operation or not.
21
A partial write to a TDX private memory cacheline will silently "poison"
22
22
the line. Subsequent reads will consume the poison and generate a
23
Initializing the TDX module is done at runtime on demand, and it depends
23
machine check. According to the TDX hardware spec, neither of these
24
on the caller to ensure CPU is in VMX operation before making SEAMCALL.
24
things should have happened.
25
To avoid getting Oops when the caller mistakenly tries to initialize the
25
26
TDX module when CPU is not in VMX operation, extend the TDX_MODULE_CALL
26
To add insult to injury, the Linux machine code will present these as a
27
macro to handle #UD (and also #GP, which can theoretically still happen
27
literal "Hardware error" when they were, in fact, a software-triggered
28
when TDX isn't actually enabled by the BIOS, i.e. due to BIOS bug).
28
issue.
29
29
30
Introduce two new TDX error codes for #UD and #GP respectively so the
30
== Solution ==
31
caller can distinguish. Also, Opportunistically put the new TDX error
31
32
codes and the existing TDX_SEAMCALL_VMFAILINVALID into INTEL_TDX_HOST
32
In the end, this issue is hard to trigger. Rather than do something
33
Kconfig option as they are only used when it is on.
33
rash (and incomplete) like unmap TDX private memory from the direct map,
34
34
improve the machine check handler.
35
As __seamcall() can potentially return multiple error codes, besides the
35
36
actual SEAMCALL leaf function return code, also introduce a wrapper
36
Currently, the #MC handler doesn't distinguish whether the memory is
37
function seamcall() to convert the __seamcall() error code to the kernel
37
TDX private memory or not but just dump, for instance, below message:
38
error code, so the caller doesn't need to duplicate the code to check
38
39
return value of __seamcall() and return kernel error code accordingly.
39
[...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
40
[...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
41
    ...
42
[...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
43
[...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
44
[...] Kernel panic - not syncing: Fatal local machine check
45
46
Which says "Hardware Error" and "Data load in unrecoverable area of
47
kernel".
48
49
Ideally, it's better for the log to say "software bug around TDX private
50
memory" instead of "Hardware Error". But in reality the real hardware
51
memory error can happen, and sadly such software-triggered #MC cannot be
52
distinguished from the real hardware error. Also, the error message is
53
used by userspace tool 'mcelog' to parse, so changing the output may
54
break userspace.
55
56
So keep the "Hardware Error". The "Data load in unrecoverable area of
57
kernel" is also helpful, so keep it too.
58
59
Instead of modifying above error log, improve the error log by printing
60
additional TDX related message to make the log like:
61
62
...
63
[...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
64
[...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.
65
66
Adding this additional message requires determination of whether the
67
memory page is TDX private memory. There is no existing infrastructure
68
to do that. Add an interface to query the TDX module to fill this gap.
69
70
== Impact ==
71
72
This issue requires some kind of kernel bug to trigger.
73
74
TDX private memory should never be mapped UC/WC. A partial write
75
originating from these mappings would require *two* bugs, first mapping
76
the wrong page, then writing the wrong memory. It would also be
77
detectable using traditional memory corruption techniques like
78
DEBUG_PAGEALLOC.
79
80
MOVNTI (and friends) could cause this issue with something like a simple
81
buffer overrun or use-after-free on the direct map. It should also be
82
detectable with normal debug techniques.
83
84
The one place where this might get nasty would be if the CPU read data
85
then wrote back the same data. That would trigger this problem but
86
would not, for instance, set off mechanisms like slab redzoning because
87
it doesn't actually corrupt data.
88
89
With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
90
TDX private memory would first need to be incorrectly mapped into the
91
I/O space and then a later DMA to that mapping would actually cause the
92
poisoning event.
40
93
41
Signed-off-by: Kai Huang <kai.huang@intel.com>
94
Signed-off-by: Kai Huang <kai.huang@intel.com>
95
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
96
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
42
---
97
---
43
98
44
v6 -> v7:
99
v13 -> v14:
45
- No change.
100
- No change
46
101
47
v5 -> v6:
102
v12 -> v13:
48
- Added code to handle #UD and #GP (Dave).
103
- Added Kirill and Yuan's tag.
49
- Moved the seamcall() wrapper function to this patch, and used a
104
50
temporary __always_unused to avoid compile warning (Dave).
105
v11 -> v12:
51
106
- Simplified #MC message (Dave/Kirill)
52
- v3 -> v5 (no feedback on v4):
107
- Slightly improved some comments.
53
- Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the
108
54
SEAMCALL itself fails.
109
v10 -> v11:
55
- Improve the changelog.
110
- New patch
56
111
57
---
112
---
58
arch/x86/include/asm/tdx.h | 9 ++++++
113
arch/x86/include/asm/tdx.h | 2 +
59
arch/x86/virt/vmx/tdx/Makefile | 2 +-
114
arch/x86/kernel/cpu/mce/core.c | 33 +++++++++++
60
arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++
115
arch/x86/virt/vmx/tdx/tdx.c | 103 +++++++++++++++++++++++++++++++++
61
arch/x86/virt/vmx/tdx/tdx.c | 42 ++++++++++++++++++++++++++
116
arch/x86/virt/vmx/tdx/tdx.h | 5 ++
62
arch/x86/virt/vmx/tdx/tdx.h | 8 +++++
117
4 files changed, 143 insertions(+)
63
arch/x86/virt/vmx/tdx/tdxcall.S | 19 ++++++++++--
64
6 files changed, 129 insertions(+), 3 deletions(-)
65
create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
66
118
67
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
119
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
68
index XXXXXXX..XXXXXXX 100644
120
index XXXXXXX..XXXXXXX 100644
69
--- a/arch/x86/include/asm/tdx.h
121
--- a/arch/x86/include/asm/tdx.h
70
+++ b/arch/x86/include/asm/tdx.h
122
+++ b/arch/x86/include/asm/tdx.h
123
@@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void);
124
int tdx_cpu_enable(void);
125
int tdx_enable(void);
126
void tdx_reset_memory(void);
127
+bool tdx_is_private_mem(unsigned long phys);
128
#else
129
static inline bool platform_tdx_enabled(void) { return false; }
130
static inline int tdx_cpu_enable(void) { return -ENODEV; }
131
static inline int tdx_enable(void) { return -ENODEV; }
132
static inline void tdx_reset_memory(void) { }
133
+static inline bool tdx_is_private_mem(unsigned long phys) { return false; }
134
#endif    /* CONFIG_INTEL_TDX_HOST */
135
136
#endif /* !__ASSEMBLY__ */
137
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
138
index XXXXXXX..XXXXXXX 100644
139
--- a/arch/x86/kernel/cpu/mce/core.c
140
+++ b/arch/x86/kernel/cpu/mce/core.c
71
@@ -XXX,XX +XXX,XX @@
141
@@ -XXX,XX +XXX,XX @@
72
#include <asm/ptrace.h>
142
#include <asm/mce.h>
73
#include <asm/shared/tdx.h>
143
#include <asm/msr.h>
74
144
#include <asm/reboot.h>
75
+#ifdef CONFIG_INTEL_TDX_HOST
145
+#include <asm/tdx.h>
76
+
146
77
+#include <asm/trapnr.h>
147
#include "internal.h"
78
+
148
79
/*
149
@@ -XXX,XX +XXX,XX @@ static void wait_for_panic(void)
80
* SW-defined error codes.
150
    panic("Panicing machine check CPU died");
81
*
151
}
82
@@ -XXX,XX +XXX,XX @@
152
83
#define TDX_SW_ERROR            (TDX_ERROR | GENMASK_ULL(47, 40))
153
+static const char *mce_memory_info(struct mce *m)
84
#define TDX_SEAMCALL_VMFAILINVALID    (TDX_SW_ERROR | _UL(0xFFFF0000))
154
+{
85
155
+    if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
86
+#define TDX_SEAMCALL_GP            (TDX_SW_ERROR | X86_TRAP_GP)
156
+        return NULL;
87
+#define TDX_SEAMCALL_UD            (TDX_SW_ERROR | X86_TRAP_UD)
157
+
88
+
158
+    /*
89
+#endif
159
+     * Certain initial generations of TDX-capable CPUs have an
90
+
160
+     * erratum. A kernel non-temporal partial write to TDX private
91
#ifndef __ASSEMBLY__
161
+     * memory poisons that memory, and a subsequent read of that
92
162
+     * memory triggers #MC.
93
/*
163
+     *
94
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
164
+     * However such #MC caused by software cannot be distinguished
95
index XXXXXXX..XXXXXXX 100644
165
+     * from the real hardware #MC. Just print additional message
96
--- a/arch/x86/virt/vmx/tdx/Makefile
166
+     * to show such #MC may be result of the CPU erratum.
97
+++ b/arch/x86/virt/vmx/tdx/Makefile
167
+     */
98
@@ -XXX,XX +XXX,XX @@
168
+    if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
99
# SPDX-License-Identifier: GPL-2.0-only
169
+        return NULL;
100
-obj-y += tdx.o
170
+
101
+obj-y += tdx.o seamcall.o
171
+    return !tdx_is_private_mem(m->addr) ? NULL :
102
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
172
+        "TDX private memory error. Possible kernel bug.";
103
new file mode 100644
173
+}
104
index XXXXXXX..XXXXXXX
174
+
105
--- /dev/null
175
static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
106
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
176
{
107
@@ -XXX,XX +XXX,XX @@
177
    struct llist_node *pending;
108
+/* SPDX-License-Identifier: GPL-2.0 */
178
    struct mce_evt_llist *l;
109
+#include <linux/linkage.h>
179
    int apei_err = 0;
110
+#include <asm/frame.h>
180
+    const char *memmsg;
111
+
181
112
+#include "tdxcall.S"
182
    /*
113
+
183
     * Allow instrumentation around external facilities usage. Not that it
114
+/*
184
@@ -XXX,XX +XXX,XX @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
115
+ * __seamcall() - Host-side interface functions to SEAM software module
185
    }
116
+ *         (the P-SEAMLDR or the TDX module).
186
    if (exp)
117
+ *
187
        pr_emerg(HW_ERR "Machine check: %s\n", exp);
118
+ * Transform function call register arguments into the SEAMCALL register
188
+    /*
119
+ * ABI. Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails,
189
+     * Confidential computing platforms such as TDX platforms
120
+ * or the completion status of the SEAMCALL leaf function. Additional
190
+     * may occur MCE due to incorrect access to confidential
121
+ * output operands are saved in @out (if it is provided by the caller).
191
+     * memory. Print additional information for such error.
122
+ *
192
+     */
123
+ *-------------------------------------------------------------------------
193
+    memmsg = mce_memory_info(final);
124
+ * SEAMCALL ABI:
194
+    if (memmsg)
125
+ *-------------------------------------------------------------------------
195
+        pr_emerg(HW_ERR "Machine check: %s\n", memmsg);
126
+ * Input Registers:
196
+
127
+ *
197
    if (!fake_panic) {
128
+ * RAX - SEAMCALL Leaf number.
198
        if (panic_timeout == 0)
129
+ * RCX,RDX,R8-R9 - SEAMCALL Leaf specific input registers.
199
            panic_timeout = mca_cfg.panic_timeout;
130
+ *
131
+ * Output Registers:
132
+ *
133
+ * RAX - SEAMCALL completion status code.
134
+ * RCX,RDX,R8-R11 - SEAMCALL Leaf specific output registers.
135
+ *
136
+ *-------------------------------------------------------------------------
137
+ *
138
+ * __seamcall() function ABI:
139
+ *
140
+ * @fn (RDI) - SEAMCALL Leaf number, moved to RAX
141
+ * @rcx (RSI) - Input parameter 1, moved to RCX
142
+ * @rdx (RDX) - Input parameter 2, moved to RDX
143
+ * @r8 (RCX) - Input parameter 3, moved to R8
144
+ * @r9 (R8) - Input parameter 4, moved to R9
145
+ *
146
+ * @out (R9) - struct tdx_module_output pointer
147
+ *             stored temporarily in R12 (not
148
+ *             used by the P-SEAMLDR or the TDX
149
+ *             module). It can be NULL.
150
+ *
151
+ * Return (via RAX) the completion status of the SEAMCALL, or
152
+ * TDX_SEAMCALL_VMFAILINVALID.
153
+ */
154
+SYM_FUNC_START(__seamcall)
155
+    FRAME_BEGIN
156
+    TDX_MODULE_CALL host=1
157
+    FRAME_END
158
+    RET
159
+SYM_FUNC_END(__seamcall)
160
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
200
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
161
index XXXXXXX..XXXXXXX 100644
201
index XXXXXXX..XXXXXXX 100644
162
--- a/arch/x86/virt/vmx/tdx/tdx.c
202
--- a/arch/x86/virt/vmx/tdx/tdx.c
163
+++ b/arch/x86/virt/vmx/tdx/tdx.c
203
+++ b/arch/x86/virt/vmx/tdx/tdx.c
164
@@ -XXX,XX +XXX,XX @@ bool platform_tdx_enabled(void)
204
@@ -XXX,XX +XXX,XX @@ void tdx_reset_memory(void)
165
    return !!tdx_keyid_num;
205
    tdmrs_reset_pamt_all(&tdx_tdmr_list);
166
}
206
}
167
207
208
+static bool is_pamt_page(unsigned long phys)
209
+{
210
+    struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;
211
+    int i;
212
+
213
+    /*
214
+     * This function is called from #MC handler, and theoretically
215
+     * it could run in parallel with the TDX module initialization
216
+     * on other logical cpus. But it's not OK to hold mutex here
217
+     * so just blindly check module status to make sure PAMTs/TDMRs
218
+     * are stable to access.
219
+     *
220
+     * This may return inaccurate result in rare cases, e.g., when
221
+     * #MC happens on a PAMT page during module initialization, but
222
+     * this is fine as #MC handler doesn't need a 100% accurate
223
+     * result.
224
+     */
225
+    if (tdx_module_status != TDX_MODULE_INITIALIZED)
226
+        return false;
227
+
228
+    for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
229
+        unsigned long base, size;
230
+
231
+        tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
232
+
233
+        if (phys >= base && phys < (base + size))
234
+            return true;
235
+    }
236
+
237
+    return false;
238
+}
239
+
168
+/*
240
+/*
169
+ * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
241
+ * Return whether the memory page at the given physical address is TDX
170
+ * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
242
+ * private memory or not. Called from #MC handler do_machine_check().
171
+ * leaf function return code and the additional output respectively if
243
+ *
172
+ * not NULL.
244
+ * Note this function may not return an accurate result in rare cases.
245
+ * This is fine as the #MC handler doesn't need a 100% accurate result,
246
+ * because it cannot distinguish #MC between software bug and real
247
+ * hardware error anyway.
173
+ */
248
+ */
174
+static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
249
+bool tdx_is_private_mem(unsigned long phys)
175
+                 u64 *seamcall_ret,
176
+                 struct tdx_module_output *out)
177
+{
250
+{
251
+    struct tdx_module_args args = {
252
+        .rcx = phys & PAGE_MASK,
253
+    };
178
+    u64 sret;
254
+    u64 sret;
179
+
255
+
180
+    sret = __seamcall(fn, rcx, rdx, r8, r9, out);
256
+    if (!platform_tdx_enabled())
181
+
257
+        return false;
182
+    /* Save SEAMCALL return code if caller wants it */
258
+
183
+    if (seamcall_ret)
259
+    /* Get page type from the TDX module */
184
+        *seamcall_ret = sret;
260
+    sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
185
+
261
+    /*
186
+    /* SEAMCALL was successful */
262
+     * Handle the case that CPU isn't in VMX operation.
187
+    if (!sret)
263
+     *
188
+        return 0;
264
+     * KVM guarantees no VM is running (thus no TDX guest)
189
+
265
+     * when there's any online CPU isn't in VMX operation.
190
+    switch (sret) {
266
+     * This means there will be no TDX guest private memory
191
+    case TDX_SEAMCALL_GP:
267
+     * and Secure-EPT pages. However the TDX module may have
192
+        /*
268
+     * been initialized and the memory page could be PAMT.
193
+         * platform_tdx_enabled() is checked to be true
269
+     */
194
+         * before making any SEAMCALL.
270
+    if (sret == TDX_SEAMCALL_UD)
195
+         */
271
+        return is_pamt_page(phys);
196
+        WARN_ON_ONCE(1);
272
+
197
+        fallthrough;
273
+    /*
198
+    case TDX_SEAMCALL_VMFAILINVALID:
274
+     * Any other failure means:
199
+        /* Return -ENODEV if the TDX module is not loaded. */
275
+     *
200
+        return -ENODEV;
276
+     * 1) TDX module not loaded; or
201
+    case TDX_SEAMCALL_UD:
277
+     * 2) Memory page isn't managed by the TDX module.
202
+        /* Return -EINVAL if CPU isn't in VMX operation. */
278
+     *
203
+        return -EINVAL;
279
+     * In either case, the memory page cannot be a TDX
280
+     * private page.
281
+     */
282
+    if (sret)
283
+        return false;
284
+
285
+    /*
286
+     * SEAMCALL was successful -- read page type (via RCX):
287
+     *
288
+     * - PT_NDA:    Page is not used by the TDX module
289
+     * - PT_RSVD:    Reserved for Non-TDX use
290
+     * - Others:    Page is used by the TDX module
291
+     *
292
+     * Note PAMT pages are marked as PT_RSVD but they are also TDX
293
+     * private memory.
294
+     *
295
+     * Note: Even page type is PT_NDA, the memory page could still
296
+     * be associated with TDX private KeyID if the kernel hasn't
297
+     * explicitly used MOVDIR64B to clear the page. Assume KVM
298
+     * always does that after reclaiming any private page from TDX
299
+     * gusets.
300
+     */
301
+    switch (args.rcx) {
302
+    case PT_NDA:
303
+        return false;
304
+    case PT_RSVD:
305
+        return is_pamt_page(phys);
204
+    default:
306
+    default:
205
+        /* Return -EIO if the actual SEAMCALL leaf failed. */
307
+        return true;
206
+        return -EIO;
207
+    }
308
+    }
208
+}
309
+}
209
+
310
+
210
/*
311
static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
211
* Detect and initialize the TDX module.
312
                     u32 *nr_tdx_keyids)
212
*
313
{
213
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
314
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
214
index XXXXXXX..XXXXXXX 100644
315
index XXXXXXX..XXXXXXX 100644
215
--- a/arch/x86/virt/vmx/tdx/tdx.h
316
--- a/arch/x86/virt/vmx/tdx/tdx.h
216
+++ b/arch/x86/virt/vmx/tdx/tdx.h
317
+++ b/arch/x86/virt/vmx/tdx/tdx.h
217
@@ -XXX,XX +XXX,XX @@
318
@@ -XXX,XX +XXX,XX @@
218
/* MSR to report KeyID partitioning between MKTME and TDX */
319
/*
219
#define MSR_IA32_MKTME_KEYID_PARTITIONING    0x00000087
320
* TDX module SEAMCALL leaf functions
220
321
*/
221
+/*
322
+#define TDH_PHYMEM_PAGE_RDMD    24
222
+ * Do not put any hardware-defined TDX structure representations below
323
#define TDH_SYS_KEY_CONFIG    31
223
+ * this comment!
324
#define TDH_SYS_INFO        32
224
+ */
325
#define TDH_SYS_INIT        33
225
+
226
+struct tdx_module_output;
227
+u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
228
+     struct tdx_module_output *out);
229
#endif
230
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
231
index XXXXXXX..XXXXXXX 100644
232
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
233
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
234
@@ -XXX,XX +XXX,XX @@
326
@@ -XXX,XX +XXX,XX @@
235
/* SPDX-License-Identifier: GPL-2.0 */
327
#define TDH_SYS_TDMR_INIT    36
236
#include <asm/asm-offsets.h>
328
#define TDH_SYS_CONFIG        45
237
#include <asm/tdx.h>
329
238
+#include <asm/asm.h>
330
+/* TDX page types */
239
331
+#define    PT_NDA        0x0
240
/*
332
+#define    PT_RSVD        0x1
241
* TDCALL and SEAMCALL are supported in Binutils >= 2.36.
333
+
242
@@ -XXX,XX +XXX,XX @@
334
struct cmr_info {
243
    /* Leave input param 2 in RDX */
335
    u64    base;
244
336
    u64    size;
245
    .if \host
246
+1:
247
    seamcall
248
    /*
249
     * SEAMCALL instruction is essentially a VMExit from VMX root
250
@@ -XXX,XX +XXX,XX @@
251
     * This value will never be used as actual SEAMCALL error code as
252
     * it is from the Reserved status code class.
253
     */
254
-    jnc .Lno_vmfailinvalid
255
+    jnc .Lseamcall_out
256
    mov $TDX_SEAMCALL_VMFAILINVALID, %rax
257
-.Lno_vmfailinvalid:
258
+    jmp .Lseamcall_out
259
+2:
260
+    /*
261
+     * SEAMCALL caused #GP or #UD. By reaching here %eax contains
262
+     * the trap number. Convert the trap number to the TDX error
263
+     * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
264
+     *
265
+     * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
266
+     * only accepts 32-bit immediate at most.
267
+     */
268
+    mov $TDX_SW_ERROR, %r12
269
+    orq %r12, %rax
270
271
+    _ASM_EXTABLE_FAULT(1b, 2b)
272
+.Lseamcall_out:
273
    .else
274
    tdcall
275
    .endif
276
--
337
--
277
2.38.1
338
2.41.0
diff view generated by jsdifflib
...
...
6
materials under it, and add a new menu for TDX host kernel support.
6
materials under it, and add a new menu for TDX host kernel support.
7
7
8
Signed-off-by: Kai Huang <kai.huang@intel.com>
8
Signed-off-by: Kai Huang <kai.huang@intel.com>
9
---
9
---
10
10
11
v6 -> v7:
11
- Added new sections for "Erratum" and "TDX vs S3/hibernation"
12
- Changed "TDX Memory Policy" and "Kexec()" sections.
13
12
14
---
13
---
15
Documentation/x86/tdx.rst | 181 +++++++++++++++++++++++++++++++++++---
14
Documentation/arch/x86/tdx.rst | 217 +++++++++++++++++++++++++++++++--
16
1 file changed, 170 insertions(+), 11 deletions(-)
15
1 file changed, 206 insertions(+), 11 deletions(-)
17
16
18
diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
17
diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
19
index XXXXXXX..XXXXXXX 100644
18
index XXXXXXX..XXXXXXX 100644
20
--- a/Documentation/x86/tdx.rst
19
--- a/Documentation/arch/x86/tdx.rst
21
+++ b/Documentation/x86/tdx.rst
20
+++ b/Documentation/arch/x86/tdx.rst
22
@@ -XXX,XX +XXX,XX @@ encrypting the guest memory. In TDX, a special module running in a special
21
@@ -XXX,XX +XXX,XX @@ encrypting the guest memory. In TDX, a special module running in a special
23
mode sits between the host and the guest and manages the guest/host
22
mode sits between the host and the guest and manages the guest/host
24
separation.
23
separation.
25
24
26
+TDX Host Kernel Support
25
+TDX Host Kernel Support
...
...
46
+-----------------------
45
+-----------------------
47
+
46
+
48
+The kernel detects TDX by detecting TDX private KeyIDs during kernel
47
+The kernel detects TDX by detecting TDX private KeyIDs during kernel
49
+boot. Below dmesg shows when TDX is enabled by BIOS::
48
+boot. Below dmesg shows when TDX is enabled by BIOS::
50
+
49
+
51
+ [..] tdx: TDX enabled by BIOS. TDX private KeyID range: [16, 64).
50
+ [..] virt/tdx: BIOS enabled: private KeyID range: [16, 64)
52
+
51
+
53
+TDX module detection and initialization
52
+TDX module initialization
54
+---------------------------------------
53
+---------------------------------------
55
+
56
+There is no CPUID or MSR to detect the TDX module. The kernel detects it
57
+by initializing it.
58
+
54
+
59
+The kernel talks to the TDX module via the new SEAMCALL instruction. The
55
+The kernel talks to the TDX module via the new SEAMCALL instruction. The
60
+TDX module implements SEAMCALL leaf functions to allow the kernel to
56
+TDX module implements SEAMCALL leaf functions to allow the kernel to
61
+initialize it.
57
+initialize it.
58
+
59
+If the TDX module isn't loaded, the SEAMCALL instruction fails with a
60
+special error. In this case the kernel fails the module initialization
61
+and reports the module isn't loaded::
62
+
63
+ [..] virt/tdx: module not loaded
62
+
64
+
63
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
65
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
64
+use it as 'metadata' for the TDX memory. It also takes additional CPU
66
+use it as 'metadata' for the TDX memory. It also takes additional CPU
65
+time to initialize those metadata along with the TDX module itself. Both
67
+time to initialize those metadata along with the TDX module itself. Both
66
+are not trivial. The kernel initializes the TDX module at runtime on
68
+are not trivial. The kernel initializes the TDX module at runtime on
67
+demand. The caller to call tdx_enable() to initialize the TDX module::
69
+demand.
68
+
70
+
71
+Besides initializing the TDX module, a per-cpu initialization SEAMCALL
72
+must be done on one cpu before any other SEAMCALLs can be made on that
73
+cpu.
74
+
75
+The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to
76
+allow the user of TDX to enable the TDX module and enable TDX on local
77
+cpu.
78
+
79
+Making SEAMCALL requires the CPU already being in VMX operation (VMXON
80
+has been done). For now both tdx_enable() and tdx_cpu_enable() don't
81
+handle VMXON internally, but depends on the caller to guarantee that.
82
+
83
+To enable TDX, the caller of TDX should: 1) hold read lock of CPU hotplug
84
+lock; 2) do VMXON and tdx_enable_cpu() on all online cpus successfully;
85
+3) call tdx_enable(). For example::
86
+
87
+ cpus_read_lock();
88
+ on_each_cpu(vmxon_and_tdx_cpu_enable());
69
+ ret = tdx_enable();
89
+ ret = tdx_enable();
90
+ cpus_read_unlock();
70
+ if (ret)
91
+ if (ret)
71
+ goto no_tdx;
92
+ goto no_tdx;
72
+ // TDX is ready to use
93
+ // TDX is ready to use
73
+
94
+
74
+Initializing the TDX module requires all logical CPUs being online.
95
+And the caller of TDX must guarantee the tdx_cpu_enable() has been
75
+tdx_enable() internally temporarily disables CPU hotplug to prevent any
96
+successfully done on any cpu before it wants to run any other SEAMCALL.
76
+CPU from going offline, but the caller still needs to guarantee all
97
+A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
77
+present CPUs are online before calling tdx_enable().
98
+online callback, and refuse to online if tdx_cpu_enable() fails.
78
+
99
+
79
+Also, tdx_enable() requires all CPUs are already in VMX operation
100
+User can consult dmesg to see whether the TDX module has been initialized.
80
+(requirement of making SEAMCALL). Currently, tdx_enable() doesn't handle
81
+VMXON internally, but depends on the caller to guarantee that. So far
82
+KVM is the only user of TDX and KVM already handles VMXON.
83
+
84
+User can consult dmesg to see the presence of the TDX module, and whether
85
+it has been initialized.
86
+
87
+If the TDX module is not loaded, dmesg shows below::
88
+
89
+ [..] tdx: TDX module is not loaded.
90
+
101
+
91
+If the TDX module is initialized successfully, dmesg shows something
102
+If the TDX module is initialized successfully, dmesg shows something
92
+like below::
103
+like below::
93
+
104
+
94
+ [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
105
+ [..] virt/tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
95
+ [..] tdx: 65667 pages allocated for PAMT.
106
+ [..] virt/tdx: 262668 KBs allocated for PAMT
96
+ [..] tdx: TDX module initialized.
107
+ [..] virt/tdx: module initialized
97
+
108
+
98
+If the TDX module failed to initialize, dmesg shows below::
109
+If the TDX module failed to initialize, dmesg also shows it failed to
99
+
110
+initialize::
100
+ [..] tdx: Failed to initialize TDX module. Shut it down.
111
+
112
+ [..] virt/tdx: module initialization failed ...
101
+
113
+
102
+TDX Interaction to Other Kernel Components
114
+TDX Interaction to Other Kernel Components
103
+------------------------------------------
115
+------------------------------------------
104
+
116
+
105
+TDX Memory Policy
117
+TDX Memory Policy
106
+~~~~~~~~~~~~~~~~~
118
+~~~~~~~~~~~~~~~~~
107
+
119
+
108
+TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
120
+TDX reports a list of "Convertible Memory Region" (CMR) to tell the
109
+memory regions that can possibly be used by the TDX module, but they are
121
+kernel which memory is TDX compatible. The kernel needs to build a list
110
+not automatically usable to the TDX module. As a step of initializing
122
+of memory regions (out of CMRs) as "TDX-usable" memory and pass those
111
+the TDX module, the kernel needs to choose a list of memory regions (out
123
+regions to the TDX module. Once this is done, those "TDX-usable" memory
112
+from convertible memory regions) that the TDX module can use and pass
124
+regions are fixed during module's lifetime.
113
+those regions to the TDX module. Once this is done, those "TDX-usable"
114
+memory regions are fixed during module's lifetime. No more TDX-usable
115
+memory can be added to the TDX module after that.
116
+
125
+
117
+To keep things simple, currently the kernel simply guarantees all pages
126
+To keep things simple, currently the kernel simply guarantees all pages
118
+in the page allocator are TDX memory. Specifically, the kernel uses all
127
+in the page allocator are TDX memory. Specifically, the kernel uses all
119
+system memory in the core-mm at the time of initializing the TDX module
128
+system memory in the core-mm at the time of initializing the TDX module
120
+as TDX memory, and at the meantime, refuses to add any non-TDX-memory in
129
+as TDX memory, and in the meantime, refuses to online any non-TDX-memory
121
+the memory hotplug.
130
+in the memory hotplug.
122
+
131
+
123
+This can be enhanced in the future, i.e. by allowing adding non-TDX
132
+Physical Memory Hotplug
124
+memory to a separate NUMA node. In this case, the "TDX-capable" nodes
133
+~~~~~~~~~~~~~~~~~~~~~~~
125
+and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
126
+needs to guarantee memory pages for TDX guests are always allocated from
127
+the "TDX-capable" nodes.
128
+
134
+
129
+Note TDX assumes convertible memory is always physically present during
135
+Note TDX assumes convertible memory is always physically present during
130
+machine's runtime. A non-buggy BIOS should never support hot-removal of
136
+machine's runtime. A non-buggy BIOS should never support hot-removal of
131
+any convertible memory. This implementation doesn't handle ACPI memory
137
+any convertible memory. This implementation doesn't handle ACPI memory
132
+removal but depends on the BIOS to behave correctly.
138
+removal but depends on the BIOS to behave correctly.
133
+
139
+
134
+CPU Hotplug
140
+CPU Hotplug
135
+~~~~~~~~~~~
141
+~~~~~~~~~~~
142
+
143
+TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT)
144
+must be done on one cpu before any other SEAMCALLs can be made on that
145
+cpu, including those involved during the module initialization.
146
+
147
+The kernel provides tdx_cpu_enable() to let the user of TDX to do it when
148
+the user wants to use a new cpu for TDX task.
136
+
149
+
137
+TDX doesn't support physical (ACPI) CPU hotplug. During machine boot,
150
+TDX doesn't support physical (ACPI) CPU hotplug. During machine boot,
138
+TDX verifies all boot-time present logical CPUs are TDX compatible before
151
+TDX verifies all boot-time present logical CPUs are TDX compatible before
139
+enabling TDX. A non-buggy BIOS should never support hot-add/removal of
152
+enabling TDX. A non-buggy BIOS should never support hot-add/removal of
140
+physical CPU. Currently the kernel doesn't handle physical CPU hotplug,
153
+physical CPU. Currently the kernel doesn't handle physical CPU hotplug,
...
...
146
+Kexec()
159
+Kexec()
147
+~~~~~~~
160
+~~~~~~~
148
+
161
+
149
+There are two problems in terms of using kexec() to boot to a new kernel
162
+There are two problems in terms of using kexec() to boot to a new kernel
150
+when the old kernel has enabled TDX: 1) Part of the memory pages are
163
+when the old kernel has enabled TDX: 1) Part of the memory pages are
151
+still TDX private pages (i.e. metadata used by the TDX module, and any
164
+still TDX private pages; 2) There might be dirty cachelines associated
152
+TDX guest memory if kexec() is executed when there's live TDX guests).
165
+with TDX private pages.
153
+2) There might be dirty cachelines associated with TDX private pages.
166
+
154
+
167
+The first problem doesn't matter. KeyID 0 doesn't have integrity check.
155
+Because the hardware doesn't guarantee cache coherency among different
168
+Even the new kernel wants use any non-zero KeyID, it needs to convert
156
+KeyIDs, the old kernel needs to flush cache (of TDX private pages)
169
+the memory to that KeyID and such conversion would work from any KeyID.
157
+before booting to the new kernel. Also, the kernel doesn't convert all
170
+
158
+TDX private pages back to normal because of below considerations:
171
+However the old kernel needs to guarantee there's no dirty cacheline
159
+
172
+left behind before booting to the new kernel to avoid silent corruption
160
+1) The kernel doesn't have existing infrastructure to track which pages
173
+from later cacheline writeback (Intel hardware doesn't guarantee cache
161
+ are TDX private page.
174
+coherency across different KeyIDs).
162
+2) The number of TDX private pages can be large, and converting all of
175
+
163
+ them (cache flush + using MOVDIR64B to clear the page) can be time
176
+Similar to AMD SME, the kernel just uses wbinvd() to flush cache before
164
+ consuming.
177
+booting to the new kernel.
165
+3) The new kernel will almost only use KeyID 0 to access memory. KeyID
178
+
166
+ 0 doesn't support integrity-check, so it's OK.
179
+Erratum
167
+4) The kernel doesn't (and may never) support MKTME. If any 3rd party
180
+~~~~~~~
168
+ kernel ever supports MKTME, it should do MOVDIR64B to clear the page
181
+
169
+ with the new MKTME KeyID (just like TDX does) before using it.
182
+The first few generations of TDX hardware have an erratum. A partial
170
+
183
+write to a TDX private memory cacheline will silently "poison" the
171
+The current TDX module architecture doesn't play nicely with kexec().
184
+line. Subsequent reads will consume the poison and generate a machine
172
+The TDX module can only be initialized once during its lifetime, and
185
+check.
173
+there is no SEAMCALL to reset the module to give a new clean slate to
186
+
174
+the new kernel. Therefore, ideally, if the module is ever initialized,
187
+A partial write is a memory write where a write transaction of less than
175
+it's better to shut down the module. The new kernel won't be able to
188
+cacheline lands at the memory controller. The CPU does these via
176
+use TDX anyway (as it needs to go through the TDX module initialization
189
+non-temporal write instructions (like MOVNTI), or through UC/WC memory
177
+process which will fail immediately at the first step).
190
+mappings. Devices can also do partial writes via DMA.
178
+
191
+
179
+However, there's no guarantee CPU is in VMX operation during kexec(), so
192
+Theoretically, a kernel bug could do partial write to TDX private memory
180
+it's impractical to shut down the module. Currently, the kernel just
193
+and trigger unexpected machine check. What's more, the machine check
181
+leaves the module in open state.
194
+code will present these as "Hardware error" when they were, in fact, a
195
+software-triggered issue. But in the end, this issue is hard to trigger.
196
+
197
+If the platform has such erratum, the kernel does additional things:
198
+1) resetting TDX private pages using MOVDIR64B in kexec before booting to
199
+the new kernel; 2) Printing additional message in machine check handler
200
+to tell user the machine check may be caused by kernel bug on TDX private
201
+memory.
202
+
203
+Interaction vs S3 and deeper states
204
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
205
+
206
+TDX cannot survive from S3 and deeper states. The hardware resets and
207
+disables TDX completely when platform goes to S3 and deeper. Both TDX
208
+guests and the TDX module get destroyed permanently.
209
+
210
+The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for
211
+hibernation. Currently, for simplicity, the kernel chooses to make TDX
212
+mutually exclusive with S3 and hibernation.
213
+
214
+For most cases, the user needs to add 'nohibernation' kernel command line
215
+in order to use TDX. S3 is disabled during kernel early boot if TDX is
216
+detected. The user needs to turn off TDX in the BIOS in order to use S3.
182
+
217
+
183
+TDX Guest Support
218
+TDX Guest Support
184
+=================
219
+=================
185
Since the host cannot directly access guest registers or memory, much
220
Since the host cannot directly access guest registers or memory, much
186
normal functionality of a hypervisor must be moved into the guest. This is
221
normal functionality of a hypervisor must be moved into the guest. This is
...
...
283
+-------------------------
318
+-------------------------
284
319
285
All TDX guest memory starts out as private at boot. This memory can not
320
All TDX guest memory starts out as private at boot. This memory can not
286
be accessed by the hypervisor. However, some kernel users like device
321
be accessed by the hypervisor. However, some kernel users like device
287
--
322
--
288
2.38.1
323
2.41.0
diff view generated by jsdifflib