Documentation/x86/zero-page.rst | 1 + arch/x86/Kconfig | 1 + arch/x86/boot/bitops.h | 40 ++++++++ arch/x86/boot/compressed/Makefile | 1 + arch/x86/boot/compressed/align.h | 14 +++ arch/x86/boot/compressed/bitmap.c | 43 ++++++++ arch/x86/boot/compressed/bitmap.h | 49 +++++++++ arch/x86/boot/compressed/bits.h | 36 +++++++ arch/x86/boot/compressed/compiler.h | 9 ++ arch/x86/boot/compressed/efi.h | 1 + arch/x86/boot/compressed/find.c | 54 ++++++++++ arch/x86/boot/compressed/find.h | 80 +++++++++++++++ arch/x86/boot/compressed/ident_map_64.c | 8 -- arch/x86/boot/compressed/kaslr.c | 35 ++++--- arch/x86/boot/compressed/math.h | 37 +++++++ arch/x86/boot/compressed/mem.c | 111 ++++++++++++++++++++ arch/x86/boot/compressed/minmax.h | 61 +++++++++++ arch/x86/boot/compressed/misc.c | 6 ++ arch/x86/boot/compressed/misc.h | 15 +++ arch/x86/boot/compressed/pgtable_types.h | 25 +++++ arch/x86/boot/compressed/sev.c | 2 - arch/x86/boot/compressed/tdx.c | 78 ++++++++++++++ arch/x86/coco/tdx/tdx.c | 94 ++++++++--------- arch/x86/include/asm/page.h | 3 + arch/x86/include/asm/shared/tdx.h | 47 +++++++++ arch/x86/include/asm/tdx.h | 19 ---- arch/x86/include/asm/unaccepted_memory.h | 16 +++ arch/x86/include/uapi/asm/bootparam.h | 2 +- arch/x86/kernel/e820.c | 10 ++ arch/x86/mm/Makefile | 2 + arch/x86/mm/unaccepted_memory.c | 123 +++++++++++++++++++++++ drivers/base/node.c | 7 ++ drivers/firmware/efi/Kconfig | 14 +++ drivers/firmware/efi/efi.c | 1 + drivers/firmware/efi/libstub/x86-stub.c | 103 ++++++++++++++++--- fs/proc/meminfo.c | 5 + include/linux/efi.h | 3 +- include/linux/mmzone.h | 1 + include/linux/page-flags.h | 31 ++++++ mm/internal.h | 12 +++ mm/memblock.c | 9 ++ mm/page_alloc.c | 96 +++++++++++++++++- mm/vmstat.c | 1 + 43 files changed, 1191 insertions(+), 115 deletions(-) create mode 100644 arch/x86/boot/compressed/align.h create mode 100644 arch/x86/boot/compressed/bitmap.c create mode 100644 arch/x86/boot/compressed/bitmap.h create mode 100644 arch/x86/boot/compressed/bits.h create mode 100644 arch/x86/boot/compressed/compiler.h create mode 100644 arch/x86/boot/compressed/find.c create mode 100644 arch/x86/boot/compressed/find.h create mode 100644 arch/x86/boot/compressed/math.h create mode 100644 arch/x86/boot/compressed/mem.c create mode 100644 arch/x86/boot/compressed/minmax.h create mode 100644 arch/x86/boot/compressed/pgtable_types.h create mode 100644 arch/x86/include/asm/unaccepted_memory.h create mode 100644 arch/x86/mm/unaccepted_memory.c
UEFI Specification version 2.9 introduces the concept of memory
acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.
Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.
The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.
Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table
Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.
In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.
Any unaccepted memory that is not aligned to 2M gets accepted upfront.
The approach lowers boot time substantially. Boot to shell is ~2.5x
faster for 4G TDX VM and ~4x faster for 64G.
TDX-specific code isolated from the core of unaccepted memory support. It
supposed to help to plug-in different implementation of unaccepted memory
such as SEV-SNP.
The tree can be found here:
https://github.com/intel/tdx.git guest-unaccepted-memory
v7:
- Rework meminfo counter to use PageUnaccepted() and move to generic code;
- Fix range_contains_unaccepted_memory() on machines without unaccepted memory;
- Add Reviewed-by from David;
v6:
- Fix load_unaligned_zeropad() on machine with unaccepted memory;
- Clear PageUnaccepted() on merged pages, leaving it only on head;
- Clarify error handling in allocate_e820();
- Fix build with CONFIG_UNACCEPTED_MEMORY=y, but without TDX;
- Disable kexec at boottime instead of build conflict;
- Rebased to tip/master;
- Spelling fixes;
- Add Reviewed-by from Mike and David;
v5:
- Updates comments and commit messages;
+ Explain options for unaccepted memory handling;
- Expose amount of unaccepted memory in /proc/meminfo
- Adjust check in page_expected_state();
- Fix error code handling in allocate_e820();
- Centralize __pa()/__va() definitions in the boot stub;
- Avoid includes from the main kernel in the boot stub;
- Use an existing hole in boot_param for unaccepted_memory, instead of adding
to the end of the structure;
- Extract allocate_unaccepted_memory() form allocate_e820();
- Complain if there's unaccepted memory, but kernel does not support it;
- Fix vmstat counter;
- Split up few preparatory patches;
- Random readability adjustments;
v4:
- PageBuddyUnaccepted() -> PageUnaccepted;
- Use separate page_type, not shared with offline;
- Rework interface between core-mm and arch code;
- Adjust commit messages;
- Ack from Mike;
Kirill A. Shutemov (14):
x86/boot: Centralize __pa()/__va() definitions
mm: Add support for unaccepted memory
mm: Report unaccepted memory in meminfo
efi/x86: Get full memory map in allocate_e820()
x86/boot: Add infrastructure required for unaccepted memory support
efi/x86: Implement support for unaccepted memory
x86/boot/compressed: Handle unaccepted memory
x86/mm: Reserve unaccepted memory bitmap
x86/mm: Provide helpers for unaccepted memory
x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
x86: Disable kexec if system has unaccepted memory
x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in
boot stub
x86/tdx: Refactor try_accept_one()
x86/tdx: Add unaccepted memory support
Documentation/x86/zero-page.rst | 1 +
arch/x86/Kconfig | 1 +
arch/x86/boot/bitops.h | 40 ++++++++
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/boot/compressed/align.h | 14 +++
arch/x86/boot/compressed/bitmap.c | 43 ++++++++
arch/x86/boot/compressed/bitmap.h | 49 +++++++++
arch/x86/boot/compressed/bits.h | 36 +++++++
arch/x86/boot/compressed/compiler.h | 9 ++
arch/x86/boot/compressed/efi.h | 1 +
arch/x86/boot/compressed/find.c | 54 ++++++++++
arch/x86/boot/compressed/find.h | 80 +++++++++++++++
arch/x86/boot/compressed/ident_map_64.c | 8 --
arch/x86/boot/compressed/kaslr.c | 35 ++++---
arch/x86/boot/compressed/math.h | 37 +++++++
arch/x86/boot/compressed/mem.c | 111 ++++++++++++++++++++
arch/x86/boot/compressed/minmax.h | 61 +++++++++++
arch/x86/boot/compressed/misc.c | 6 ++
arch/x86/boot/compressed/misc.h | 15 +++
arch/x86/boot/compressed/pgtable_types.h | 25 +++++
arch/x86/boot/compressed/sev.c | 2 -
arch/x86/boot/compressed/tdx.c | 78 ++++++++++++++
arch/x86/coco/tdx/tdx.c | 94 ++++++++---------
arch/x86/include/asm/page.h | 3 +
arch/x86/include/asm/shared/tdx.h | 47 +++++++++
arch/x86/include/asm/tdx.h | 19 ----
arch/x86/include/asm/unaccepted_memory.h | 16 +++
arch/x86/include/uapi/asm/bootparam.h | 2 +-
arch/x86/kernel/e820.c | 10 ++
arch/x86/mm/Makefile | 2 +
arch/x86/mm/unaccepted_memory.c | 123 +++++++++++++++++++++++
drivers/base/node.c | 7 ++
drivers/firmware/efi/Kconfig | 14 +++
drivers/firmware/efi/efi.c | 1 +
drivers/firmware/efi/libstub/x86-stub.c | 103 ++++++++++++++++---
fs/proc/meminfo.c | 5 +
include/linux/efi.h | 3 +-
include/linux/mmzone.h | 1 +
include/linux/page-flags.h | 31 ++++++
mm/internal.h | 12 +++
mm/memblock.c | 9 ++
mm/page_alloc.c | 96 +++++++++++++++++-
mm/vmstat.c | 1 +
43 files changed, 1191 insertions(+), 115 deletions(-)
create mode 100644 arch/x86/boot/compressed/align.h
create mode 100644 arch/x86/boot/compressed/bitmap.c
create mode 100644 arch/x86/boot/compressed/bitmap.h
create mode 100644 arch/x86/boot/compressed/bits.h
create mode 100644 arch/x86/boot/compressed/compiler.h
create mode 100644 arch/x86/boot/compressed/find.c
create mode 100644 arch/x86/boot/compressed/find.h
create mode 100644 arch/x86/boot/compressed/math.h
create mode 100644 arch/x86/boot/compressed/mem.c
create mode 100644 arch/x86/boot/compressed/minmax.h
create mode 100644 arch/x86/boot/compressed/pgtable_types.h
create mode 100644 arch/x86/include/asm/unaccepted_memory.h
create mode 100644 arch/x86/mm/unaccepted_memory.c
--
2.35.1
This series adds SEV-SNP support for unaccepted memory to the patch series
titled:
[PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This lead to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. Additionally, the page state change operations are not
optimized under Linux since it was expected that all memory has been
validated already, resulting in poor performance when adding basic
support for unaccepted memory.
This series consists of six patches:
- Two pre-patch fixes which can be taken regardless of this series.
- A pre-patch to switch from a kmalloc()'d page state change structure
to a (smaller) stack-based page state change structure.
- A pre-patch to allow the use of the early boot GHCB in the core kernel
path.
- A pre-patch to allow for use of 2M page state change requests and 2M
page validation.
- SNP support for unaccepted memory.
The series is based off of and tested against Kirill Shutemov's tree:
https://github.com/intel/tdx.git guest-unaccepted-memory
---
Changes since v4:
- Two fixes for when an unsigned int used as the number of pages to
process, it needs to be converted to an unsigned long before being
used to calculate ending addresses, otherwise a value >= 0x100000
results in adding 0 in the calculation.
- Commit message and comment updates.
Changes since v3:
- Reworks the PSC process to greatly improve performance:
- Optimize the PSC process to use 2M pages when applicable.
- Optimize the page validation process to use 2M pages when applicable.
- Use the early GHCB in both the decompression phase and core kernel
boot phase in order to minimize the use of the MSR protocol. The MSR
protocol only allows for a single 4K page to be updated at a time.
- Move the ghcb_percpu_ready flag into the sev_config structure and
rename it to ghcbs_initialized.
Changes since v2:
- Improve code comments in regards to when to use the per-CPU GHCB vs
the MSR protocol and why a single global value is valid for both
the BSP and APs.
- Add a comment related to the number of PSC entries and how it can
impact the size of the struct and, therefore, stack usage.
- Add a WARN_ON_ONCE() for invoking vmgexit_psc() when per-CPU GHCBs
haven't been created or registered, yet.
- Use the compiler support for clearing the PSC struct instead of
issuing memset().
Changes since v1:
- Change from using a per-CPU PSC structure to a (smaller) stack PSC
structure.
Tom Lendacky (6):
x86/sev: Fix calculation of end address based on number of pages
x86/sev: Fix calculation of end address based on number of pages
x86/sev: Put PSC struct on the stack in prep for unaccepted memory
support
x86/sev: Allow for use of the early boot GHCB for PSC requests
x86/sev: Use large PSC requests if applicable
x86/sev: Add SNP-specific unaccepted memory support
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/mem.c | 3 +
arch/x86/boot/compressed/sev.c | 54 ++++++-
arch/x86/boot/compressed/sev.h | 23 +++
arch/x86/include/asm/sev-common.h | 9 +-
arch/x86/include/asm/sev.h | 7 +
arch/x86/kernel/sev-shared.c | 104 +++++++++++++
arch/x86/kernel/sev.c | 250 +++++++++++++-----------------
arch/x86/mm/unaccepted_memory.c | 4 +
9 files changed, 307 insertions(+), 148 deletions(-)
create mode 100644 arch/x86/boot/compressed/sev.h
--
2.37.3
When calculating an end address based on an unsigned int number of pages,
the number of pages must be cast to an unsigned long so that any value
greater than or equal to 0x100000 does not result in zero after the shift.
Fixes: 5e5ccff60a29 ("x86/sev: Add helper for validating pages in early enc attribute changes")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/kernel/sev.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c05f0124c410..cac56540929d 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -649,7 +649,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
int rc;
vaddr = vaddr & PAGE_MASK;
- vaddr_end = vaddr + (npages << PAGE_SHIFT);
+ vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
while (vaddr < vaddr_end) {
rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
@@ -666,7 +666,7 @@ static void __init early_set_pages_state(unsigned long paddr, unsigned int npage
u64 val;
paddr = paddr & PAGE_MASK;
- paddr_end = paddr + (npages << PAGE_SHIFT);
+ paddr_end = paddr + ((unsigned long)npages << PAGE_SHIFT);
while (paddr < paddr_end) {
/*
--
2.37.3
>
> When calculating an end address based on an unsigned int number of pages,
> the number of pages must be cast to an unsigned long so that any value
> greater than or equal to 0x100000 does not result in zero after the shift.
>
> Fixes: 5e5ccff60a29 ("x86/sev: Add helper for validating pages in early enc attribute changes")
Tested-by: Dionna Glaze <dionnaglaze@google.com>
--
-Dionna Glaze, PhD (she/her)
On 9/27/22 10:04, Tom Lendacky wrote:
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -649,7 +649,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
> int rc;
>
> vaddr = vaddr & PAGE_MASK;
> - vaddr_end = vaddr + (npages << PAGE_SHIFT);
> + vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
Could we please just fix the fragile typing that cascaded down to this
point?
Shouldn't 'npages' in this interface be a long?
> struct x86_guest {
> void (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
> bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
> bool (*enc_tlb_flush_required)(bool enc);
> bool (*enc_cache_flush_required)(void);
> };
On 9/27/22 12:10, Dave Hansen wrote:
> On 9/27/22 10:04, Tom Lendacky wrote:
>> --- a/arch/x86/kernel/sev.c
>> +++ b/arch/x86/kernel/sev.c
>> @@ -649,7 +649,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
>> int rc;
>>
>> vaddr = vaddr & PAGE_MASK;
>> - vaddr_end = vaddr + (npages << PAGE_SHIFT);
>> + vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
>
> Could we please just fix the fragile typing that cascaded down to this
> point?
>
> Shouldn't 'npages' in this interface be a long?
I'll take a look at that.
Thanks,
Tom
>
>> struct x86_guest {
>> void (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
>> bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
>> bool (*enc_tlb_flush_required)(bool enc);
>> bool (*enc_cache_flush_required)(void);
>> };
When calculating an end address based on an unsigned int number of pages,
the number of pages must be cast to an unsigned long so that any value
greater than or equal to 0x100000 does not result in zero after the shift.
Fixes: dc3f3d2474b8 ("x86/mm: Validate memory when changing the C-bit")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/kernel/sev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index cac56540929d..c90a47c39f6b 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -875,7 +875,7 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
panic("SNP: failed to allocate memory for PSC descriptor\n");
vaddr = vaddr & PAGE_MASK;
- vaddr_end = vaddr + (npages << PAGE_SHIFT);
+ vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
while (vaddr < vaddr_end) {
/* Calculate the last vaddr that fits in one struct snp_psc_desc. */
--
2.37.3
In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change (PSC) structure to using a
local variable that lives on the stack. This is needed to avoid a possible
recursive call into set_pages_state() if the kmalloc() call requires
(more) memory to be accepted, which would result in a hang.
The current size of the PSC struct is 2,032 bytes. To make the struct more
stack friendly, reduce the number of PSC entries from 253 down to 64,
resulting in a size of 520 bytes. This is a nice compromise on struct size
and total PSC requests while still allowing parallel PSC operations across
vCPUs.
If the reduction in PSC entries results in any kind of performance issue
(that is not seen at the moment), use of a larger static PSC struct, with
fallback to the smaller stack version, can be investigated.
For more background info on this decision, see the subthread in the Link:
tag below.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com
---
arch/x86/include/asm/sev-common.h | 9 +++++++--
arch/x86/kernel/sev.c | 10 ++--------
2 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index b8357d6ecd47..8ddfdbe521d4 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -106,8 +106,13 @@ enum psc_op {
#define GHCB_HV_FT_SNP BIT_ULL(0)
#define GHCB_HV_FT_SNP_AP_CREATION BIT_ULL(1)
-/* SNP Page State Change NAE event */
-#define VMGEXIT_PSC_MAX_ENTRY 253
+/*
+ * SNP Page State Change NAE event
+ * The VMGEXIT_PSC_MAX_ENTRY determines the size of the PSC structure, which
+ * is a local stack variable in set_pages_state(). Do not increase this value
+ * without evaluating the impact to stack usage.
+ */
+#define VMGEXIT_PSC_MAX_ENTRY 64
struct psc_hdr {
u16 cur_entry;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c90a47c39f6b..664a4de91757 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -868,11 +868,7 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
{
unsigned long vaddr_end, next_vaddr;
- struct snp_psc_desc *desc;
-
- desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
- if (!desc)
- panic("SNP: failed to allocate memory for PSC descriptor\n");
+ struct snp_psc_desc desc;
vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
@@ -882,12 +878,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
next_vaddr = min_t(unsigned long, vaddr_end,
(VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
- __set_pages_state(desc, vaddr, next_vaddr, op);
+ __set_pages_state(&desc, vaddr, next_vaddr, op);
vaddr = next_vaddr;
}
-
- kfree(desc);
}
void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
--
2.37.3
Using a GHCB for a page stage change (as opposed to the MSR protocol)
allows for multiple pages to be processed in a single request. In prep
for early PSC requests in support of unaccepted memory, update the
invocation of vmgexit_psc() to be able to use the early boot GHCB and not
just the per-CPU GHCB structure.
In order to use the proper GHCB (early boot vs per-CPU), set a flag that
indicates when the per-CPU GHCBs are available and registered. For APs,
the per-CPU GHCBs are created before they are started and registered upon
startup, so this flag can be used globally for the BSP and APs instead of
creating a per-CPU flag. This will allow for a significant reduction in
the number of MSR protocol page state change requests when accepting
memory.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/kernel/sev.c | 61 +++++++++++++++++++++++++++----------------
1 file changed, 38 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 664a4de91757..0b958d77abb4 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -117,7 +117,19 @@ static DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
struct sev_config {
__u64 debug : 1,
- __reserved : 63;
+
+ /*
+ * A flag used by __set_pages_state() that indicates when the
+ * per-CPU GHCB has been created and registered and thus can be
+ * used by the BSP instead of the early boot GHCB.
+ *
+ * For APs, the per-CPU GHCB is created before they are started
+ * and registered upon startup, so this flag can be used globally
+ * for the BSP and APs.
+ */
+ ghcbs_initialized : 1,
+
+ __reserved : 62;
};
static struct sev_config sev_cfg __read_mostly;
@@ -660,7 +672,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
}
}
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
@@ -742,26 +754,13 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
WARN(1, "invalid memory op %d\n", op);
}
-static int vmgexit_psc(struct snp_psc_desc *desc)
+static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
{
int cur_entry, end_entry, ret = 0;
struct snp_psc_desc *data;
- struct ghcb_state state;
struct es_em_ctxt ctxt;
- unsigned long flags;
- struct ghcb *ghcb;
- /*
- * __sev_get_ghcb() needs to run with IRQs disabled because it is using
- * a per-CPU GHCB.
- */
- local_irq_save(flags);
-
- ghcb = __sev_get_ghcb(&state);
- if (!ghcb) {
- ret = 1;
- goto out_unlock;
- }
+ vc_ghcb_invalidate(ghcb);
/* Copy the input desc into GHCB shared buffer */
data = (struct snp_psc_desc *)ghcb->shared_buffer;
@@ -818,20 +817,18 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
}
out:
- __sev_put_ghcb(&state);
-
-out_unlock:
- local_irq_restore(flags);
-
return ret;
}
static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
unsigned long vaddr_end, int op)
{
+ struct ghcb_state state;
struct psc_hdr *hdr;
struct psc_entry *e;
+ unsigned long flags;
unsigned long pfn;
+ struct ghcb *ghcb;
int i;
hdr = &data->hdr;
@@ -861,8 +858,20 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
i++;
}
- if (vmgexit_psc(data))
+ local_irq_save(flags);
+
+ if (sev_cfg.ghcbs_initialized)
+ ghcb = __sev_get_ghcb(&state);
+ else
+ ghcb = boot_ghcb;
+
+ if (!ghcb || vmgexit_psc(ghcb, data))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+ if (sev_cfg.ghcbs_initialized)
+ __sev_put_ghcb(&state);
+
+ local_irq_restore(flags);
}
static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
@@ -870,6 +879,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
unsigned long vaddr_end, next_vaddr;
struct snp_psc_desc desc;
+ /* Use the MSR protocol when a GHCB is not available. */
+ if (!boot_ghcb)
+ return early_set_pages_state(__pa(vaddr), npages, op);
+
vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
@@ -1248,6 +1261,8 @@ void setup_ghcb(void)
if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
snp_register_per_cpu_ghcb();
+ sev_cfg.ghcbs_initialized = true;
+
return;
}
--
2.37.3
In advance of providing support for unaccepted memory, request 2M Page
State Change (PSC) requests when the address range allows for it. By using
a 2M page size, more PSC operations can be handled in a single request to
the hypervisor. The hypervisor will determine if it can accommodate the
larger request by checking the mapping in the nested page table. If mapped
as a large page, then the 2M page request can be performed, otherwise the
2M page request will be broken down into 512 4K page requests. This is
still more efficient than having the guest perform multiple PSC requests
in order to process the 512 4K pages.
In conjunction with the 2M PSC requests, attempt to perform the associated
PVALIDATE instruction of the page using the 2M page size. If PVALIDATE
fails with a size mismatch, then fallback to validating 512 4K pages. To
do this, page validation is modified to work with the PSC structure and
not just a virtual address range.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/include/asm/sev.h | 4 ++
arch/x86/kernel/sev.c | 125 ++++++++++++++++++++++++-------------
2 files changed, 84 insertions(+), 45 deletions(-)
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 19514524f0f8..0007ab04ac5f 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -79,11 +79,15 @@ extern void vc_no_ghcb(void);
extern void vc_boot_ghcb(void);
extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
+/* PVALIDATE return codes */
+#define PVALIDATE_FAIL_SIZEMISMATCH 6
+
/* Software defined (when rFlags.CF = 1) */
#define PVALIDATE_FAIL_NOUPDATE 255
/* RMP page size */
#define RMP_PG_SIZE_4K 0
+#define RMP_PG_SIZE_2M 1
#define RMPADJUST_VMSA_PAGE_BIT BIT(16)
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 0b958d77abb4..eabb8dd5be5b 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -655,32 +655,58 @@ static u64 __init get_jump_table_addr(void)
return ret;
}
-static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool validate)
+static void pvalidate_pages(struct snp_psc_desc *desc)
{
- unsigned long vaddr_end;
+ struct psc_entry *e;
+ unsigned long vaddr;
+ unsigned int size;
+ unsigned int i;
+ bool validate;
int rc;
- vaddr = vaddr & PAGE_MASK;
- vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
+ for (i = 0; i <= desc->hdr.end_entry; i++) {
+ e = &desc->entries[i];
+
+ vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
+ size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+ validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
+
+ rc = pvalidate(vaddr, size, validate);
+ if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+ unsigned long vaddr_end = vaddr + PMD_PAGE_SIZE;
+
+ for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
+ rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+ if (rc)
+ break;
+ }
+ }
- while (vaddr < vaddr_end) {
- rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
if (WARN(rc, "Failed to validate address 0x%lx ret %d", vaddr, rc))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
-
- vaddr = vaddr + PAGE_SIZE;
}
}
-static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
+ unsigned int npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
+ int ret;
+
+ vaddr = vaddr & PAGE_MASK;
paddr = paddr & PAGE_MASK;
paddr_end = paddr + ((unsigned long)npages << PAGE_SHIFT);
while (paddr < paddr_end) {
+ if (op == SNP_PAGE_STATE_SHARED) {
+ /* Page validation must be rescinded before changing to shared */
+ ret = pvalidate(vaddr, RMP_PG_SIZE_4K, false);
+ if (WARN(ret, "Failed to validate address 0x%lx ret %d", paddr, ret))
+ goto e_term;
+ }
+
/*
* Use the MSR protocol because this function can be called before
* the GHCB is established.
@@ -701,7 +727,15 @@ static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum
paddr, GHCB_MSR_PSC_RESP_VAL(val)))
goto e_term;
- paddr = paddr + PAGE_SIZE;
+ if (op == SNP_PAGE_STATE_PRIVATE) {
+ /* Page validation must be performed after changing to private */
+ ret = pvalidate(vaddr, RMP_PG_SIZE_4K, true);
+ if (WARN(ret, "Failed to validate address 0x%lx ret %d", paddr, ret))
+ goto e_term;
+ }
+
+ vaddr += PAGE_SIZE;
+ paddr += PAGE_SIZE;
}
return;
@@ -720,10 +754,7 @@ void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long padd
* Ask the hypervisor to mark the memory pages as private in the RMP
* table.
*/
- early_set_pages_state(paddr, npages, SNP_PAGE_STATE_PRIVATE);
-
- /* Validate the memory pages after they've been added in the RMP table. */
- pvalidate_pages(vaddr, npages, true);
+ early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_PRIVATE);
}
void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
@@ -732,11 +763,8 @@ void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
return;
- /* Invalidate the memory pages before they are marked shared in the RMP table. */
- pvalidate_pages(vaddr, npages, false);
-
/* Ask hypervisor to mark the memory pages shared in the RMP table. */
- early_set_pages_state(paddr, npages, SNP_PAGE_STATE_SHARED);
+ early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_SHARED);
}
void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op)
@@ -820,10 +848,11 @@ static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
return ret;
}
-static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
- unsigned long vaddr_end, int op)
+static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
+ unsigned long vaddr_end, int op)
{
struct ghcb_state state;
+ bool use_large_entry;
struct psc_hdr *hdr;
struct psc_entry *e;
unsigned long flags;
@@ -837,27 +866,37 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
memset(data, 0, sizeof(*data));
i = 0;
- while (vaddr < vaddr_end) {
- if (is_vmalloc_addr((void *)vaddr))
+ while (vaddr < vaddr_end && i < ARRAY_SIZE(data->entries)) {
+ hdr->end_entry = i;
+
+ if (is_vmalloc_addr((void *)vaddr)) {
pfn = vmalloc_to_pfn((void *)vaddr);
- else
+ use_large_entry = false;
+ } else {
pfn = __pa(vaddr) >> PAGE_SHIFT;
+ use_large_entry = true;
+ }
e->gfn = pfn;
e->operation = op;
- hdr->end_entry = i;
- /*
- * Current SNP implementation doesn't keep track of the RMP page
- * size so use 4K for simplicity.
- */
- e->pagesize = RMP_PG_SIZE_4K;
+ if (use_large_entry && IS_ALIGNED(vaddr, PMD_PAGE_SIZE) &&
+ (vaddr_end - vaddr) >= PMD_PAGE_SIZE) {
+ e->pagesize = RMP_PG_SIZE_2M;
+ vaddr += PMD_PAGE_SIZE;
+ } else {
+ e->pagesize = RMP_PG_SIZE_4K;
+ vaddr += PAGE_SIZE;
+ }
- vaddr = vaddr + PAGE_SIZE;
e++;
i++;
}
+ /* Page validation must be rescinded before changing to shared */
+ if (op == SNP_PAGE_STATE_SHARED)
+ pvalidate_pages(data);
+
local_irq_save(flags);
if (sev_cfg.ghcbs_initialized)
@@ -865,6 +904,7 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
else
ghcb = boot_ghcb;
+ /* Invoke the hypervisor to perform the page state changes */
if (!ghcb || vmgexit_psc(ghcb, data))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
@@ -872,29 +912,28 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
__sev_put_ghcb(&state);
local_irq_restore(flags);
+
+ /* Page validation must be performed after changing to private */
+ if (op == SNP_PAGE_STATE_PRIVATE)
+ pvalidate_pages(data);
+
+ return vaddr;
}
static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
{
- unsigned long vaddr_end, next_vaddr;
struct snp_psc_desc desc;
+ unsigned long vaddr_end;
/* Use the MSR protocol when a GHCB is not available. */
if (!boot_ghcb)
- return early_set_pages_state(__pa(vaddr), npages, op);
+ return early_set_pages_state(vaddr, __pa(vaddr), npages, op);
vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
- while (vaddr < vaddr_end) {
- /* Calculate the last vaddr that fits in one struct snp_psc_desc. */
- next_vaddr = min_t(unsigned long, vaddr_end,
- (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
-
- __set_pages_state(&desc, vaddr, next_vaddr, op);
-
- vaddr = next_vaddr;
- }
+ while (vaddr < vaddr_end)
+ vaddr = __set_pages_state(&desc, vaddr, vaddr_end, op);
}
void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
@@ -902,8 +941,6 @@ void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
return;
- pvalidate_pages(vaddr, npages, false);
-
set_pages_state(vaddr, npages, SNP_PAGE_STATE_SHARED);
}
@@ -913,8 +950,6 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
return;
set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
-
- pvalidate_pages(vaddr, npages, true);
}
static int snp_set_vmsa(void *va, bool vmsa)
--
2.37.3
Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.
The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.
Since the boot path and the core kernel paths perform similar operations,
move the pvalidate_pages() and vmgexit_psc() functions into sev-shared.c
to avoid code duplication.
Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/mem.c | 3 +
arch/x86/boot/compressed/sev.c | 54 ++++++++++++++-
arch/x86/boot/compressed/sev.h | 23 +++++++
arch/x86/include/asm/sev.h | 3 +
arch/x86/kernel/sev-shared.c | 104 +++++++++++++++++++++++++++++
arch/x86/kernel/sev.c | 112 ++++----------------------------
arch/x86/mm/unaccepted_memory.c | 4 ++
8 files changed, 205 insertions(+), 99 deletions(-)
create mode 100644 arch/x86/boot/compressed/sev.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 34146ecc5bdd..0ad53c3533c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
select INSTRUCTION_DECODER
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
+ select UNACCEPTED_MEMORY
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 48e36e640da1..3e19dc0da0d7 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
#include "find.h"
#include "math.h"
#include "tdx.h"
+#include "sev.h"
#include <asm/shared/tdx.h>
#define PMD_SHIFT 21
@@ -39,6 +40,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
/* Platform-specific memory-acceptance call goes here */
if (is_tdx_guest())
tdx_accept_memory(start, end);
+ else if (sev_snp_enabled())
+ snp_accept_memory(start, end);
else
error("Cannot accept memory: unknown platform\n");
}
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 730c4677e9db..22da65c96b47 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
/* Include code for early handlers */
#include "../../kernel/sev-shared.c"
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
{
return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
}
@@ -181,6 +181,58 @@ static bool early_setup_ghcb(void)
return true;
}
+static phys_addr_t __snp_accept_memory(struct snp_psc_desc *desc,
+ phys_addr_t pa, phys_addr_t pa_end)
+{
+ struct psc_hdr *hdr;
+ struct psc_entry *e;
+ unsigned int i;
+
+ hdr = &desc->hdr;
+ memset(hdr, 0, sizeof(*hdr));
+
+ e = desc->entries;
+
+ i = 0;
+ while (pa < pa_end && i < VMGEXIT_PSC_MAX_ENTRY) {
+ hdr->end_entry = i;
+
+ e->gfn = pa >> PAGE_SHIFT;
+ e->operation = SNP_PAGE_STATE_PRIVATE;
+ if (IS_ALIGNED(pa, PMD_PAGE_SIZE) && (pa_end - pa) >= PMD_PAGE_SIZE) {
+ e->pagesize = RMP_PG_SIZE_2M;
+ pa += PMD_PAGE_SIZE;
+ } else {
+ e->pagesize = RMP_PG_SIZE_4K;
+ pa += PAGE_SIZE;
+ }
+
+ e++;
+ i++;
+ }
+
+ if (vmgexit_psc(boot_ghcb, desc))
+ sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+ pvalidate_pages(desc);
+
+ return pa;
+}
+
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ struct snp_psc_desc desc = {};
+ unsigned int i;
+ phys_addr_t pa;
+
+ if (!boot_ghcb && !early_setup_ghcb())
+ sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+ pa = start;
+ while (pa < end)
+ pa = __snp_accept_memory(&desc, pa, end);
+}
+
void sev_es_shutdown_ghcb(void)
{
if (!boot_ghcb)
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 0007ab04ac5f..9297aab0c79e 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -206,6 +206,7 @@ void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
void snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
@@ -230,6 +231,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
{
return -ENOTTY;
}
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
#endif
#endif
diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index b478edf43bec..7ac7857da2b8 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -12,6 +12,9 @@
#ifndef __BOOT_COMPRESSED
#define error(v) pr_err(v)
#define has_cpuflag(f) boot_cpu_has(f)
+#else
+#undef WARN
+#define WARN(condition...)
#endif
/* I/O parameters for CPUID-related helpers */
@@ -998,3 +1001,104 @@ static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
cpuid_ext_range_max = fn->eax;
}
}
+
+static void pvalidate_pages(struct snp_psc_desc *desc)
+{
+ struct psc_entry *e;
+ unsigned long vaddr;
+ unsigned int size;
+ unsigned int i;
+ bool validate;
+ int rc;
+
+ for (i = 0; i <= desc->hdr.end_entry; i++) {
+ e = &desc->entries[i];
+
+ vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
+ size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+ validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
+
+ rc = pvalidate(vaddr, size, validate);
+ if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+ unsigned long vaddr_end = vaddr + PMD_PAGE_SIZE;
+
+ for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
+ rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+ if (rc)
+ break;
+ }
+ }
+
+ if (rc) {
+ WARN(1, "Failed to validate address 0x%lx ret %d", vaddr, rc);
+ sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
+ }
+ }
+}
+
+static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
+{
+ int cur_entry, end_entry, ret = 0;
+ struct snp_psc_desc *data;
+ struct es_em_ctxt ctxt;
+
+ vc_ghcb_invalidate(ghcb);
+
+ /* Copy the input desc into GHCB shared buffer */
+ data = (struct snp_psc_desc *)ghcb->shared_buffer;
+ memcpy(ghcb->shared_buffer, desc, min_t(int, GHCB_SHARED_BUF_SIZE, sizeof(*desc)));
+
+ /*
+ * As per the GHCB specification, the hypervisor can resume the guest
+ * before processing all the entries. Check whether all the entries
+ * are processed. If not, then keep retrying. Note, the hypervisor
+ * will update the data memory directly to indicate the status, so
+ * reference the data->hdr everywhere.
+ *
+ * The strategy here is to wait for the hypervisor to change the page
+ * state in the RMP table before guest accesses the memory pages. If the
+ * page state change was not successful, then later memory access will
+ * result in a crash.
+ */
+ cur_entry = data->hdr.cur_entry;
+ end_entry = data->hdr.end_entry;
+
+ while (data->hdr.cur_entry <= data->hdr.end_entry) {
+ ghcb_set_sw_scratch(ghcb, (u64)__pa(data));
+
+ /* This will advance the shared buffer data points to. */
+ ret = sev_es_ghcb_hv_call(ghcb, true, &ctxt, SVM_VMGEXIT_PSC, 0, 0);
+
+ /*
+ * Page State Change VMGEXIT can pass error code through
+ * exit_info_2.
+ */
+ if (ret || ghcb->save.sw_exit_info_2) {
+ WARN(1, "SNP: PSC failed ret=%d exit_info_2=%llx\n",
+ ret, ghcb->save.sw_exit_info_2);
+ ret = 1;
+ goto out;
+ }
+
+ /* Verify that reserved bit is not set */
+ if (data->hdr.reserved) {
+ WARN(1, "Reserved bit is set in the PSC header\n");
+ ret = 1;
+ goto out;
+ }
+
+ /*
+ * Sanity check that entry processing is not going backwards.
+ * This will happen only if hypervisor is tricking us.
+ */
+ if (data->hdr.end_entry > end_entry || cur_entry > data->hdr.cur_entry) {
+ WARN(1, "SNP: PSC processing going backward, end_entry %d (got %d) cur_entry %d (got %d)\n",
+ end_entry, data->hdr.end_entry, cur_entry, data->hdr.cur_entry);
+ ret = 1;
+ goto out;
+ }
+ }
+
+out:
+ return ret;
+}
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index eabb8dd5be5b..48440933bde2 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -655,38 +655,6 @@ static u64 __init get_jump_table_addr(void)
return ret;
}
-static void pvalidate_pages(struct snp_psc_desc *desc)
-{
- struct psc_entry *e;
- unsigned long vaddr;
- unsigned int size;
- unsigned int i;
- bool validate;
- int rc;
-
- for (i = 0; i <= desc->hdr.end_entry; i++) {
- e = &desc->entries[i];
-
- vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
- size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
- validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
-
- rc = pvalidate(vaddr, size, validate);
- if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
- unsigned long vaddr_end = vaddr + PMD_PAGE_SIZE;
-
- for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
- rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
- if (rc)
- break;
- }
- }
-
- if (WARN(rc, "Failed to validate address 0x%lx ret %d", vaddr, rc))
- sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
- }
-}
-
static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
unsigned int npages, enum psc_op op)
{
@@ -782,72 +750,6 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
WARN(1, "invalid memory op %d\n", op);
}
-static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
-{
- int cur_entry, end_entry, ret = 0;
- struct snp_psc_desc *data;
- struct es_em_ctxt ctxt;
-
- vc_ghcb_invalidate(ghcb);
-
- /* Copy the input desc into GHCB shared buffer */
- data = (struct snp_psc_desc *)ghcb->shared_buffer;
- memcpy(ghcb->shared_buffer, desc, min_t(int, GHCB_SHARED_BUF_SIZE, sizeof(*desc)));
-
- /*
- * As per the GHCB specification, the hypervisor can resume the guest
- * before processing all the entries. Check whether all the entries
- * are processed. If not, then keep retrying. Note, the hypervisor
- * will update the data memory directly to indicate the status, so
- * reference the data->hdr everywhere.
- *
- * The strategy here is to wait for the hypervisor to change the page
- * state in the RMP table before guest accesses the memory pages. If the
- * page state change was not successful, then later memory access will
- * result in a crash.
- */
- cur_entry = data->hdr.cur_entry;
- end_entry = data->hdr.end_entry;
-
- while (data->hdr.cur_entry <= data->hdr.end_entry) {
- ghcb_set_sw_scratch(ghcb, (u64)__pa(data));
-
- /* This will advance the shared buffer data points to. */
- ret = sev_es_ghcb_hv_call(ghcb, true, &ctxt, SVM_VMGEXIT_PSC, 0, 0);
-
- /*
- * Page State Change VMGEXIT can pass error code through
- * exit_info_2.
- */
- if (WARN(ret || ghcb->save.sw_exit_info_2,
- "SNP: PSC failed ret=%d exit_info_2=%llx\n",
- ret, ghcb->save.sw_exit_info_2)) {
- ret = 1;
- goto out;
- }
-
- /* Verify that reserved bit is not set */
- if (WARN(data->hdr.reserved, "Reserved bit is set in the PSC header\n")) {
- ret = 1;
- goto out;
- }
-
- /*
- * Sanity check that entry processing is not going backwards.
- * This will happen only if hypervisor is tricking us.
- */
- if (WARN(data->hdr.end_entry > end_entry || cur_entry > data->hdr.cur_entry,
-"SNP: PSC processing going backward, end_entry %d (got %d) cur_entry %d (got %d)\n",
- end_entry, data->hdr.end_entry, cur_entry, data->hdr.cur_entry)) {
- ret = 1;
- goto out;
- }
- }
-
-out:
- return ret;
-}
-
static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
unsigned long vaddr_end, int op)
{
@@ -952,6 +854,20 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
}
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long vaddr;
+ unsigned int npages;
+
+ if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+ return;
+
+ vaddr = (unsigned long)__va(start);
+ npages = (end - start) >> PAGE_SHIFT;
+
+ set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+}
+
static int snp_set_vmsa(void *va, bool vmsa)
{
u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..b86ad6a8ddf5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -9,6 +9,7 @@
#include <asm/setup.h>
#include <asm/shared/tdx.h>
#include <asm/unaccepted_memory.h>
+#include <asm/sev.h>
/* Protects unaccepted memory bitmap */
static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -66,6 +67,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
tdx_accept_memory(range_start * PMD_SIZE,
range_end * PMD_SIZE);
+ } else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+ snp_accept_memory(range_start * PMD_SIZE,
+ range_end * PMD_SIZE);
} else {
panic("Cannot accept memory: unknown platform\n");
}
--
2.37.3
This series adds SEV-SNP support for unaccepted memory to the patch series
titled:
[PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This lead to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. Additionally, the page state change operations are not
optimized under Linux since it was expected that all memory has been
validated already, resulting in poor performance when adding basic
support for unaccepted memory.
So this series consists of four patches:
- A pre-patch to switch from a kmalloc()'d page state change structure
to a (smaller) stack-based page state change structure.
- A pre-patch to allow the use of the early boot GHCB in the core kernel
path.
- A pre-patch to allow for use of 2M page state change requests and 2M
page validation.
- SNP support for unaccepted memory.
The series is based off of and tested against Kirill Shutemov's tree:
https://github.com/intel/tdx.git guest-unaccepted-memory
---
Changes since v3:
- Reworks the PSC process to greatly improve performance:
- Optimize the PSC process to use 2M pages when applicable.
- Optimize the page validation process to use 2M pages when applicable.
- Use the early GHCB in both the decompression phase and core kernel
boot phase in order to minimize the use of the MSR protocol. The MSR
protocol only allows for a single 4K page to be updated at a time.
- Move the ghcb_percpu_ready flag into the sev_config structure and
rename it to ghcbs_initialized.
Changes since v2:
- Improve code comments in regards to when to use the per-CPU GHCB vs
the MSR protocol and why a single global value is valid for both
the BSP and APs.
- Add a comment related to the number of PSC entries and how it can
impact the size of the struct and, therefore, stack usage.
- Add a WARN_ON_ONCE() for invoking vmgexit_psc() when per-CPU GHCBs
haven't been created or registered, yet.
- Use the compiler support for clearing the PSC struct instead of
issuing memset().
Changes since v1:
- Change from using a per-CPU PSC structure to a (smaller) stack PSC
structure.
Tom Lendacky (4):
x86/sev: Put PSC struct on the stack in prep for unaccepted memory
support
x86/sev: Allow for use of the early boot GHCB for PSC requests
x86/sev: Use large PSC requests if applicable
x86/sev: Add SNP-specific unaccepted memory support
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/mem.c | 3 +
arch/x86/boot/compressed/sev.c | 54 ++++++-
arch/x86/boot/compressed/sev.h | 23 +++
arch/x86/include/asm/sev-common.h | 9 +-
arch/x86/include/asm/sev.h | 7 +
arch/x86/kernel/sev-shared.c | 104 +++++++++++++
arch/x86/kernel/sev.c | 246 +++++++++++++-----------------
arch/x86/mm/unaccepted_memory.c | 4 +
9 files changed, 305 insertions(+), 146 deletions(-)
create mode 100644 arch/x86/boot/compressed/sev.h
--
2.37.2
In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change (PSC) structure to using a
local variable that lives on the stack. This is needed to avoid a possible
recursive call into set_pages_state() if the kmalloc() call requires
(more) memory to be accepted, which would result in a hang.
The current size of the PSC struct is 2,032 bytes. To make the struct more
stack friendly, reduce the number of PSC entries from 253 down to 64,
resulting in a size of 520 bytes. This is a nice compromise on struct size
and total PSC requests while still allowing parallel PSC operations across
vCPUs.
If the reduction in PSC entries results in any kind of performance issue
(that is not seen at the moment), use of a larger static PSC struct, with
fallback to the smaller stack version, can be investigated.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/include/asm/sev-common.h | 9 +++++++--
arch/x86/kernel/sev.c | 10 ++--------
2 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index b8357d6ecd47..6c3d61c5f6a3 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -106,8 +106,13 @@ enum psc_op {
#define GHCB_HV_FT_SNP BIT_ULL(0)
#define GHCB_HV_FT_SNP_AP_CREATION BIT_ULL(1)
-/* SNP Page State Change NAE event */
-#define VMGEXIT_PSC_MAX_ENTRY 253
+/*
+ * SNP Page State Change NAE event
+ * The VMGEXIT_PSC_MAX_ENTRY determines the size of the PSC structure,
+ * which is a local variable (stack usage) in set_pages_state(). Do not
+ * increase this value without evaluating the impact to stack usage.
+ */
+#define VMGEXIT_PSC_MAX_ENTRY 64
struct psc_hdr {
u16 cur_entry;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c05f0124c410..d18a580dd048 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -868,11 +868,7 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
{
unsigned long vaddr_end, next_vaddr;
- struct snp_psc_desc *desc;
-
- desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
- if (!desc)
- panic("SNP: failed to allocate memory for PSC descriptor\n");
+ struct snp_psc_desc desc;
vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + (npages << PAGE_SHIFT);
@@ -882,12 +878,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
next_vaddr = min_t(unsigned long, vaddr_end,
(VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
- __set_pages_state(desc, vaddr, next_vaddr, op);
+ __set_pages_state(&desc, vaddr, next_vaddr, op);
vaddr = next_vaddr;
}
-
- kfree(desc);
}
void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
--
2.37.2
On Thu, Aug 25, 2022 at 09:23:14AM -0500, Tom Lendacky wrote:
> In advance of providing support for unaccepted memory, switch from using
> kmalloc() for allocating the Page State Change (PSC) structure to using a
> local variable that lives on the stack. This is needed to avoid a possible
> recursive call into set_pages_state() if the kmalloc() call requires
> (more) memory to be accepted, which would result in a hang.
>
> The current size of the PSC struct is 2,032 bytes. To make the struct more
> stack friendly, reduce the number of PSC entries from 253 down to 64,
> resulting in a size of 520 bytes. This is a nice compromise on struct size
> and total PSC requests while still allowing parallel PSC operations across
> vCPUs.
>
> If the reduction in PSC entries results in any kind of performance issue
> (that is not seen at the moment), use of a larger static PSC struct, with
> fallback to the smaller stack version, can be investigated.
"For more background info on this decision see the subthread in the Link
tag below."
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com
> ---
> arch/x86/include/asm/sev-common.h | 9 +++++++--
> arch/x86/kernel/sev.c | 10 ++--------
> 2 files changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
> index b8357d6ecd47..6c3d61c5f6a3 100644
> --- a/arch/x86/include/asm/sev-common.h
> +++ b/arch/x86/include/asm/sev-common.h
> @@ -106,8 +106,13 @@ enum psc_op {
> #define GHCB_HV_FT_SNP BIT_ULL(0)
> #define GHCB_HV_FT_SNP_AP_CREATION BIT_ULL(1)
>
> -/* SNP Page State Change NAE event */
> -#define VMGEXIT_PSC_MAX_ENTRY 253
> +/*
> + * SNP Page State Change NAE event
> + * The VMGEXIT_PSC_MAX_ENTRY determines the size of the PSC structure,
> + * which is a local variable (stack usage) in set_pages_state(). Do not
... which is a local stack variable...
> + * increase this value without evaluating the impact to stack usage.
> + */
...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Using a GHCB for a page stage change (as opposed to the MSR protocol)
allows for multiple pages to be processed in a single request. In prep
for early PSC requests in support of unaccepted memory, update the
invocation of vmgexit_psc() to be able to use the early boot GHCB and not
just the per-CPU GHCB structure.
In order to use the proper GHCB (early boot vs per-CPU), set a flag that
indicates when the per-CPU GHCBs are available and registered. For APs,
the per-CPU GHCBs are created before they are started and registered upon
startup, so this flag can be used globally for the BSP and APs instead of
creating a per-CPU flag. This will allow for a significant reduction in
the number of MSR protocol page state change requests when accepting
memory.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/kernel/sev.c | 61 +++++++++++++++++++++++++++----------------
1 file changed, 38 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index d18a580dd048..a5f02b6b099b 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -117,7 +117,19 @@ static DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
struct sev_config {
__u64 debug : 1,
- __reserved : 63;
+
+ /*
+ * A flag used by __set_pages_state() that indicates when the
+ * per-CPU GHCB has been created and registered and thus can be
+ * used by the BSP instead of the early boot GHCB.
+ *
+ * For APs, the per-CPU GHCB is created before they are started
+ * and registered upon startup, so this flag can be used globally
+ * for the BSP and APs.
+ */
+ ghcbs_initialized : 1,
+
+ __reserved : 62;
};
static struct sev_config sev_cfg __read_mostly;
@@ -660,7 +672,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
}
}
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
@@ -742,26 +754,13 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
WARN(1, "invalid memory op %d\n", op);
}
-static int vmgexit_psc(struct snp_psc_desc *desc)
+static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
{
int cur_entry, end_entry, ret = 0;
struct snp_psc_desc *data;
- struct ghcb_state state;
struct es_em_ctxt ctxt;
- unsigned long flags;
- struct ghcb *ghcb;
- /*
- * __sev_get_ghcb() needs to run with IRQs disabled because it is using
- * a per-CPU GHCB.
- */
- local_irq_save(flags);
-
- ghcb = __sev_get_ghcb(&state);
- if (!ghcb) {
- ret = 1;
- goto out_unlock;
- }
+ vc_ghcb_invalidate(ghcb);
/* Copy the input desc into GHCB shared buffer */
data = (struct snp_psc_desc *)ghcb->shared_buffer;
@@ -818,20 +817,18 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
}
out:
- __sev_put_ghcb(&state);
-
-out_unlock:
- local_irq_restore(flags);
-
return ret;
}
static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
unsigned long vaddr_end, int op)
{
+ struct ghcb_state state;
struct psc_hdr *hdr;
struct psc_entry *e;
+ unsigned long flags;
unsigned long pfn;
+ struct ghcb *ghcb;
int i;
hdr = &data->hdr;
@@ -861,8 +858,20 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
i++;
}
- if (vmgexit_psc(data))
+ local_irq_save(flags);
+
+ if (sev_cfg.ghcbs_initialized)
+ ghcb = __sev_get_ghcb(&state);
+ else
+ ghcb = boot_ghcb;
+
+ if (!ghcb || vmgexit_psc(ghcb, data))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+ if (sev_cfg.ghcbs_initialized)
+ __sev_put_ghcb(&state);
+
+ local_irq_restore(flags);
}
static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
@@ -870,6 +879,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
unsigned long vaddr_end, next_vaddr;
struct snp_psc_desc desc;
+ /* Use the MSR protocol when a GHCB is not available. */
+ if (!boot_ghcb)
+ return early_set_pages_state(__pa(vaddr), npages, op);
+
vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + (npages << PAGE_SHIFT);
@@ -1248,6 +1261,8 @@ void setup_ghcb(void)
if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
snp_register_per_cpu_ghcb();
+ sev_cfg.ghcbs_initialized = true;
+
return;
}
--
2.37.2
In advance of providing support for unaccepted memory, request 2M Page
State Change (PSC) requests when the address range allows for it. By using
a 2M page size, more PSC operations can be handled in a single request to
the hypervisor. The hypervisor will determine if it can accommodate the
larger request by checking the mapping in the nested page table. If mapped
as a large page, then the 2M page request can be performed, otherwise the
2M page request will be broken down into 512 4K page requests. This is
still more efficient than having the guest perform multiple PSC requests
in order to process the 512 4K pages.
In conjunction with the 2M PSC requests, attempt to perform the associated
PVALIDATE instruction of the page using the 2M page size. If PVALIDATE
fails with a size mismatch, then fallback to validating 512 4K pages. To
do this, page validation is modified to work with the PSC structure and
not just a virtual address range.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/include/asm/sev.h | 4 ++
arch/x86/kernel/sev.c | 125 ++++++++++++++++++++++++-------------
2 files changed, 84 insertions(+), 45 deletions(-)
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 19514524f0f8..0007ab04ac5f 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -79,11 +79,15 @@ extern void vc_no_ghcb(void);
extern void vc_boot_ghcb(void);
extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
+/* PVALIDATE return codes */
+#define PVALIDATE_FAIL_SIZEMISMATCH 6
+
/* Software defined (when rFlags.CF = 1) */
#define PVALIDATE_FAIL_NOUPDATE 255
/* RMP page size */
#define RMP_PG_SIZE_4K 0
+#define RMP_PG_SIZE_2M 1
#define RMPADJUST_VMSA_PAGE_BIT BIT(16)
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index a5f02b6b099b..a744f7f2e72b 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -655,32 +655,58 @@ static u64 __init get_jump_table_addr(void)
return ret;
}
-static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool validate)
+static void pvalidate_pages(struct snp_psc_desc *desc)
{
- unsigned long vaddr_end;
+ struct psc_entry *e;
+ unsigned long vaddr;
+ unsigned int size;
+ unsigned int i;
+ bool validate;
int rc;
- vaddr = vaddr & PAGE_MASK;
- vaddr_end = vaddr + (npages << PAGE_SHIFT);
+ for (i = 0; i <= desc->hdr.end_entry; i++) {
+ e = &desc->entries[i];
+
+ vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
+ size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+ validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
+
+ rc = pvalidate(vaddr, size, validate);
+ if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+ unsigned long vaddr_end = vaddr + PMD_PAGE_SIZE;
+
+ for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
+ rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+ if (rc)
+ break;
+ }
+ }
- while (vaddr < vaddr_end) {
- rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
if (WARN(rc, "Failed to validate address 0x%lx ret %d", vaddr, rc))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
-
- vaddr = vaddr + PAGE_SIZE;
}
}
-static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
+ unsigned int npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
+ int ret;
+
+ vaddr = vaddr & PAGE_MASK;
paddr = paddr & PAGE_MASK;
paddr_end = paddr + (npages << PAGE_SHIFT);
while (paddr < paddr_end) {
+ if (op == SNP_PAGE_STATE_SHARED) {
+ /* Page validation must be rescinded before changing to shared */
+ ret = pvalidate(vaddr, RMP_PG_SIZE_4K, false);
+ if (WARN(ret, "Failed to validate address 0x%lx ret %d", paddr, ret))
+ goto e_term;
+ }
+
/*
* Use the MSR protocol because this function can be called before
* the GHCB is established.
@@ -701,7 +727,15 @@ static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum
paddr, GHCB_MSR_PSC_RESP_VAL(val)))
goto e_term;
- paddr = paddr + PAGE_SIZE;
+ if (op == SNP_PAGE_STATE_PRIVATE) {
+ /* Page validation must be performed after changing to private */
+ ret = pvalidate(vaddr, RMP_PG_SIZE_4K, true);
+ if (WARN(ret, "Failed to validate address 0x%lx ret %d", paddr, ret))
+ goto e_term;
+ }
+
+ vaddr += PAGE_SIZE;
+ paddr += PAGE_SIZE;
}
return;
@@ -720,10 +754,7 @@ void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long padd
* Ask the hypervisor to mark the memory pages as private in the RMP
* table.
*/
- early_set_pages_state(paddr, npages, SNP_PAGE_STATE_PRIVATE);
-
- /* Validate the memory pages after they've been added in the RMP table. */
- pvalidate_pages(vaddr, npages, true);
+ early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_PRIVATE);
}
void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
@@ -732,11 +763,8 @@ void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
return;
- /* Invalidate the memory pages before they are marked shared in the RMP table. */
- pvalidate_pages(vaddr, npages, false);
-
/* Ask hypervisor to mark the memory pages shared in the RMP table. */
- early_set_pages_state(paddr, npages, SNP_PAGE_STATE_SHARED);
+ early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_SHARED);
}
void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op)
@@ -820,10 +848,11 @@ static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
return ret;
}
-static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
- unsigned long vaddr_end, int op)
+static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
+ unsigned long vaddr_end, int op)
{
struct ghcb_state state;
+ bool use_large_entry;
struct psc_hdr *hdr;
struct psc_entry *e;
unsigned long flags;
@@ -837,27 +866,37 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
memset(data, 0, sizeof(*data));
i = 0;
- while (vaddr < vaddr_end) {
- if (is_vmalloc_addr((void *)vaddr))
+ while (vaddr < vaddr_end && i < ARRAY_SIZE(data->entries)) {
+ hdr->end_entry = i;
+
+ if (is_vmalloc_addr((void *)vaddr)) {
pfn = vmalloc_to_pfn((void *)vaddr);
- else
+ use_large_entry = false;
+ } else {
pfn = __pa(vaddr) >> PAGE_SHIFT;
+ use_large_entry = true;
+ }
e->gfn = pfn;
e->operation = op;
- hdr->end_entry = i;
- /*
- * Current SNP implementation doesn't keep track of the RMP page
- * size so use 4K for simplicity.
- */
- e->pagesize = RMP_PG_SIZE_4K;
+ if (use_large_entry && IS_ALIGNED(vaddr, PMD_PAGE_SIZE) &&
+ (vaddr_end - vaddr) >= PMD_PAGE_SIZE) {
+ e->pagesize = RMP_PG_SIZE_2M;
+ vaddr += PMD_PAGE_SIZE;
+ } else {
+ e->pagesize = RMP_PG_SIZE_4K;
+ vaddr += PAGE_SIZE;
+ }
- vaddr = vaddr + PAGE_SIZE;
e++;
i++;
}
+ /* Page validation must be rescinded before changing to shared */
+ if (op == SNP_PAGE_STATE_SHARED)
+ pvalidate_pages(data);
+
local_irq_save(flags);
if (sev_cfg.ghcbs_initialized)
@@ -865,6 +904,7 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
else
ghcb = boot_ghcb;
+ /* Invoke the hypervisor to perform the page state changes */
if (!ghcb || vmgexit_psc(ghcb, data))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
@@ -872,29 +912,28 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
__sev_put_ghcb(&state);
local_irq_restore(flags);
+
+ /* Page validation must be performed after changing to private */
+ if (op == SNP_PAGE_STATE_PRIVATE)
+ pvalidate_pages(data);
+
+ return vaddr;
}
static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
{
- unsigned long vaddr_end, next_vaddr;
struct snp_psc_desc desc;
+ unsigned long vaddr_end;
/* Use the MSR protocol when a GHCB is not available. */
if (!boot_ghcb)
- return early_set_pages_state(__pa(vaddr), npages, op);
+ return early_set_pages_state(vaddr, __pa(vaddr), npages, op);
vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + (npages << PAGE_SHIFT);
- while (vaddr < vaddr_end) {
- /* Calculate the last vaddr that fits in one struct snp_psc_desc. */
- next_vaddr = min_t(unsigned long, vaddr_end,
- (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
-
- __set_pages_state(&desc, vaddr, next_vaddr, op);
-
- vaddr = next_vaddr;
- }
+ while (vaddr < vaddr_end)
+ vaddr = __set_pages_state(&desc, vaddr, vaddr_end, op);
}
void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
@@ -902,8 +941,6 @@ void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
return;
- pvalidate_pages(vaddr, npages, false);
-
set_pages_state(vaddr, npages, SNP_PAGE_STATE_SHARED);
}
@@ -913,8 +950,6 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
return;
set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
-
- pvalidate_pages(vaddr, npages, true);
}
static int snp_set_vmsa(void *va, bool vmsa)
--
2.37.2
Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.
The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.
Since the boot path and the core kernel paths perform similar operations,
move the pvalidate_pages() and vmgexit_psc() functions into sev-shared.c
to avoid code duplication.
Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/mem.c | 3 +
arch/x86/boot/compressed/sev.c | 54 ++++++++++++++-
arch/x86/boot/compressed/sev.h | 23 +++++++
arch/x86/include/asm/sev.h | 3 +
arch/x86/kernel/sev-shared.c | 104 +++++++++++++++++++++++++++++
arch/x86/kernel/sev.c | 112 ++++----------------------------
arch/x86/mm/unaccepted_memory.c | 4 ++
8 files changed, 205 insertions(+), 99 deletions(-)
create mode 100644 arch/x86/boot/compressed/sev.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 34146ecc5bdd..0ad53c3533c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
select INSTRUCTION_DECODER
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
+ select UNACCEPTED_MEMORY
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 48e36e640da1..3e19dc0da0d7 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
#include "find.h"
#include "math.h"
#include "tdx.h"
+#include "sev.h"
#include <asm/shared/tdx.h>
#define PMD_SHIFT 21
@@ -39,6 +40,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
/* Platform-specific memory-acceptance call goes here */
if (is_tdx_guest())
tdx_accept_memory(start, end);
+ else if (sev_snp_enabled())
+ snp_accept_memory(start, end);
else
error("Cannot accept memory: unknown platform\n");
}
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 730c4677e9db..22da65c96b47 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
/* Include code for early handlers */
#include "../../kernel/sev-shared.c"
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
{
return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
}
@@ -181,6 +181,58 @@ static bool early_setup_ghcb(void)
return true;
}
+static phys_addr_t __snp_accept_memory(struct snp_psc_desc *desc,
+ phys_addr_t pa, phys_addr_t pa_end)
+{
+ struct psc_hdr *hdr;
+ struct psc_entry *e;
+ unsigned int i;
+
+ hdr = &desc->hdr;
+ memset(hdr, 0, sizeof(*hdr));
+
+ e = desc->entries;
+
+ i = 0;
+ while (pa < pa_end && i < VMGEXIT_PSC_MAX_ENTRY) {
+ hdr->end_entry = i;
+
+ e->gfn = pa >> PAGE_SHIFT;
+ e->operation = SNP_PAGE_STATE_PRIVATE;
+ if (IS_ALIGNED(pa, PMD_PAGE_SIZE) && (pa_end - pa) >= PMD_PAGE_SIZE) {
+ e->pagesize = RMP_PG_SIZE_2M;
+ pa += PMD_PAGE_SIZE;
+ } else {
+ e->pagesize = RMP_PG_SIZE_4K;
+ pa += PAGE_SIZE;
+ }
+
+ e++;
+ i++;
+ }
+
+ if (vmgexit_psc(boot_ghcb, desc))
+ sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+ pvalidate_pages(desc);
+
+ return pa;
+}
+
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ struct snp_psc_desc desc = {};
+ unsigned int i;
+ phys_addr_t pa;
+
+ if (!boot_ghcb && !early_setup_ghcb())
+ sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+ pa = start;
+ while (pa < end)
+ pa = __snp_accept_memory(&desc, pa, end);
+}
+
void sev_es_shutdown_ghcb(void)
{
if (!boot_ghcb)
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 0007ab04ac5f..9297aab0c79e 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -206,6 +206,7 @@ void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
void snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
@@ -230,6 +231,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
{
return -ENOTTY;
}
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
#endif
#endif
diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index b478edf43bec..7ac7857da2b8 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -12,6 +12,9 @@
#ifndef __BOOT_COMPRESSED
#define error(v) pr_err(v)
#define has_cpuflag(f) boot_cpu_has(f)
+#else
+#undef WARN
+#define WARN(condition...)
#endif
/* I/O parameters for CPUID-related helpers */
@@ -998,3 +1001,104 @@ static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
cpuid_ext_range_max = fn->eax;
}
}
+
+static void pvalidate_pages(struct snp_psc_desc *desc)
+{
+ struct psc_entry *e;
+ unsigned long vaddr;
+ unsigned int size;
+ unsigned int i;
+ bool validate;
+ int rc;
+
+ for (i = 0; i <= desc->hdr.end_entry; i++) {
+ e = &desc->entries[i];
+
+ vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
+ size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+ validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
+
+ rc = pvalidate(vaddr, size, validate);
+ if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+ unsigned long vaddr_end = vaddr + PMD_PAGE_SIZE;
+
+ for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
+ rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+ if (rc)
+ break;
+ }
+ }
+
+ if (rc) {
+ WARN(1, "Failed to validate address 0x%lx ret %d", vaddr, rc);
+ sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
+ }
+ }
+}
+
+static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
+{
+ int cur_entry, end_entry, ret = 0;
+ struct snp_psc_desc *data;
+ struct es_em_ctxt ctxt;
+
+ vc_ghcb_invalidate(ghcb);
+
+ /* Copy the input desc into GHCB shared buffer */
+ data = (struct snp_psc_desc *)ghcb->shared_buffer;
+ memcpy(ghcb->shared_buffer, desc, min_t(int, GHCB_SHARED_BUF_SIZE, sizeof(*desc)));
+
+ /*
+ * As per the GHCB specification, the hypervisor can resume the guest
+ * before processing all the entries. Check whether all the entries
+ * are processed. If not, then keep retrying. Note, the hypervisor
+ * will update the data memory directly to indicate the status, so
+ * reference the data->hdr everywhere.
+ *
+ * The strategy here is to wait for the hypervisor to change the page
+ * state in the RMP table before guest accesses the memory pages. If the
+ * page state change was not successful, then later memory access will
+ * result in a crash.
+ */
+ cur_entry = data->hdr.cur_entry;
+ end_entry = data->hdr.end_entry;
+
+ while (data->hdr.cur_entry <= data->hdr.end_entry) {
+ ghcb_set_sw_scratch(ghcb, (u64)__pa(data));
+
+ /* This will advance the shared buffer data points to. */
+ ret = sev_es_ghcb_hv_call(ghcb, true, &ctxt, SVM_VMGEXIT_PSC, 0, 0);
+
+ /*
+ * Page State Change VMGEXIT can pass error code through
+ * exit_info_2.
+ */
+ if (ret || ghcb->save.sw_exit_info_2) {
+ WARN(1, "SNP: PSC failed ret=%d exit_info_2=%llx\n",
+ ret, ghcb->save.sw_exit_info_2);
+ ret = 1;
+ goto out;
+ }
+
+ /* Verify that reserved bit is not set */
+ if (data->hdr.reserved) {
+ WARN(1, "Reserved bit is set in the PSC header\n");
+ ret = 1;
+ goto out;
+ }
+
+ /*
+ * Sanity check that entry processing is not going backwards.
+ * This will happen only if hypervisor is tricking us.
+ */
+ if (data->hdr.end_entry > end_entry || cur_entry > data->hdr.cur_entry) {
+ WARN(1, "SNP: PSC processing going backward, end_entry %d (got %d) cur_entry %d (got %d)\n",
+ end_entry, data->hdr.end_entry, cur_entry, data->hdr.cur_entry);
+ ret = 1;
+ goto out;
+ }
+ }
+
+out:
+ return ret;
+}
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index a744f7f2e72b..abdf431622ea 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -655,38 +655,6 @@ static u64 __init get_jump_table_addr(void)
return ret;
}
-static void pvalidate_pages(struct snp_psc_desc *desc)
-{
- struct psc_entry *e;
- unsigned long vaddr;
- unsigned int size;
- unsigned int i;
- bool validate;
- int rc;
-
- for (i = 0; i <= desc->hdr.end_entry; i++) {
- e = &desc->entries[i];
-
- vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
- size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
- validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
-
- rc = pvalidate(vaddr, size, validate);
- if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
- unsigned long vaddr_end = vaddr + PMD_PAGE_SIZE;
-
- for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
- rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
- if (rc)
- break;
- }
- }
-
- if (WARN(rc, "Failed to validate address 0x%lx ret %d", vaddr, rc))
- sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
- }
-}
-
static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
unsigned int npages, enum psc_op op)
{
@@ -782,72 +750,6 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
WARN(1, "invalid memory op %d\n", op);
}
-static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
-{
- int cur_entry, end_entry, ret = 0;
- struct snp_psc_desc *data;
- struct es_em_ctxt ctxt;
-
- vc_ghcb_invalidate(ghcb);
-
- /* Copy the input desc into GHCB shared buffer */
- data = (struct snp_psc_desc *)ghcb->shared_buffer;
- memcpy(ghcb->shared_buffer, desc, min_t(int, GHCB_SHARED_BUF_SIZE, sizeof(*desc)));
-
- /*
- * As per the GHCB specification, the hypervisor can resume the guest
- * before processing all the entries. Check whether all the entries
- * are processed. If not, then keep retrying. Note, the hypervisor
- * will update the data memory directly to indicate the status, so
- * reference the data->hdr everywhere.
- *
- * The strategy here is to wait for the hypervisor to change the page
- * state in the RMP table before guest accesses the memory pages. If the
- * page state change was not successful, then later memory access will
- * result in a crash.
- */
- cur_entry = data->hdr.cur_entry;
- end_entry = data->hdr.end_entry;
-
- while (data->hdr.cur_entry <= data->hdr.end_entry) {
- ghcb_set_sw_scratch(ghcb, (u64)__pa(data));
-
- /* This will advance the shared buffer data points to. */
- ret = sev_es_ghcb_hv_call(ghcb, true, &ctxt, SVM_VMGEXIT_PSC, 0, 0);
-
- /*
- * Page State Change VMGEXIT can pass error code through
- * exit_info_2.
- */
- if (WARN(ret || ghcb->save.sw_exit_info_2,
- "SNP: PSC failed ret=%d exit_info_2=%llx\n",
- ret, ghcb->save.sw_exit_info_2)) {
- ret = 1;
- goto out;
- }
-
- /* Verify that reserved bit is not set */
- if (WARN(data->hdr.reserved, "Reserved bit is set in the PSC header\n")) {
- ret = 1;
- goto out;
- }
-
- /*
- * Sanity check that entry processing is not going backwards.
- * This will happen only if hypervisor is tricking us.
- */
- if (WARN(data->hdr.end_entry > end_entry || cur_entry > data->hdr.cur_entry,
-"SNP: PSC processing going backward, end_entry %d (got %d) cur_entry %d (got %d)\n",
- end_entry, data->hdr.end_entry, cur_entry, data->hdr.cur_entry)) {
- ret = 1;
- goto out;
- }
- }
-
-out:
- return ret;
-}
-
static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
unsigned long vaddr_end, int op)
{
@@ -952,6 +854,20 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
}
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long vaddr;
+ unsigned int npages;
+
+ if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+ return;
+
+ vaddr = (unsigned long)__va(start);
+ npages = (end - start) >> PAGE_SHIFT;
+
+ set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+}
+
static int snp_set_vmsa(void *va, bool vmsa)
{
u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..b86ad6a8ddf5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -9,6 +9,7 @@
#include <asm/setup.h>
#include <asm/shared/tdx.h>
#include <asm/unaccepted_memory.h>
+#include <asm/sev.h>
/* Protects unaccepted memory bitmap */
static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -66,6 +67,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
tdx_accept_memory(range_start * PMD_SIZE,
range_end * PMD_SIZE);
+ } else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+ snp_accept_memory(range_start * PMD_SIZE,
+ range_end * PMD_SIZE);
} else {
panic("Cannot accept memory: unknown platform\n");
}
--
2.37.2
> > Add SNP-specific hooks to the unaccepted memory support in the boot > path (__accept_memory()) and the core kernel (accept_memory()) in order > to support booting SNP guests when unaccepted memory is present. Without > this support, SNP guests will fail to boot and/or panic() when unaccepted > memory is present in the EFI memory map. > > The process of accepting memory under SNP involves invoking the hypervisor > to perform a page state change for the page to private memory and then > issuing a PVALIDATE instruction to accept the page. Thanks for this update! Tests show the boot performance shaves off a good few seconds over eager acceptance, and it'll get better when we have on-demand pinning. The uncaught #VC exception is still there for 256GB machines and larger though. -- -Dionna Glaze, PhD (she/her)
On 8/25/22 17:10, Dionna Amalie Glaze wrote: >> >> Add SNP-specific hooks to the unaccepted memory support in the boot >> path (__accept_memory()) and the core kernel (accept_memory()) in order >> to support booting SNP guests when unaccepted memory is present. Without >> this support, SNP guests will fail to boot and/or panic() when unaccepted >> memory is present in the EFI memory map. >> >> The process of accepting memory under SNP involves invoking the hypervisor >> to perform a page state change for the page to private memory and then >> issuing a PVALIDATE instruction to accept the page. > > Thanks for this update! Tests show the boot performance shaves off a > good few seconds over eager acceptance, and it'll get better when we > have on-demand pinning. > > The uncaught #VC exception is still there for 256GB machines and larger though. Any chance of getting a stack trace when this occurs, e.g. adding a WARN_ON() in vc_handle_exitcode() (assuming it happens when logging is enabled)? Thanks, Tom >
This series adds SEV-SNP support for unaccepted memory to the patch series
titled:
[PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This lead to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. So this series consists of two patches:
- A pre-patch to switch from a kmalloc()'d page state change structure
to a per-CPU page state change structure.
- SNP support for unaccepted memory.
The series is based off of and tested against Kirill Shutemov's tree:
https://github.com/intel/tdx.git guest-unaccepted-memory
---
Tom Lendacky (2):
x86/sev: Use per-CPU PSC structure in prep for unaccepted memory
support
x86/sev: Add SNP-specific unaccepted memory support
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/mem.c | 3 ++
arch/x86/boot/compressed/sev.c | 10 ++++-
arch/x86/boot/compressed/sev.h | 23 ++++++++++
arch/x86/include/asm/sev.h | 3 ++
arch/x86/kernel/sev.c | 76 ++++++++++++++++++++++++---------
arch/x86/mm/unaccepted_memory.c | 4 ++
7 files changed, 98 insertions(+), 22 deletions(-)
create mode 100644 arch/x86/boot/compressed/sev.h
--
2.36.1
In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change structure to using a
per-CPU structure. This is needed to avoid a possible recursive call into
set_pages_state() if the allocation requires (more) memory to be accepted,
which would result in a hang.
Protect the use of the per-CPU structure by disabling interrupts during
memory acceptance. Since the set_pages_state() path is the only path into
vmgexit_psc(), rename vmgexit_psc() to __vmgexit_psc() and remove the
calls to disable interrupts which are now performed by set_pages_state().
Even with interrupts disabled, an NMI can be raised while performing
memory acceptance. The NMI could then cause further memory acceptance to
performed. To prevent corruption to the per-CPU structure, use the PSC
MSR protocol in this situation.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/kernel/sev.c | 60 ++++++++++++++++++++++++++++---------------
1 file changed, 39 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c05f0124c410..1f7f6205c4f6 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -104,6 +104,15 @@ struct sev_es_runtime_data {
* is currently unsupported in SEV-ES guests.
*/
unsigned long dr7;
+
+ /*
+ * Page State Change structure for use when accepting memory or when
+ * changing page state. Interrupts are disabled when using the structure
+ * but an NMI could still be raised, so use a flag to indicate when the
+ * structure is in use and use the MSR protocol in these cases.
+ */
+ struct snp_psc_desc psc_desc;
+ bool psc_active;
};
struct ghcb_state {
@@ -660,7 +669,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
}
}
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
@@ -742,26 +751,17 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
WARN(1, "invalid memory op %d\n", op);
}
-static int vmgexit_psc(struct snp_psc_desc *desc)
+static int __vmgexit_psc(struct snp_psc_desc *desc)
{
int cur_entry, end_entry, ret = 0;
struct snp_psc_desc *data;
struct ghcb_state state;
struct es_em_ctxt ctxt;
- unsigned long flags;
struct ghcb *ghcb;
- /*
- * __sev_get_ghcb() needs to run with IRQs disabled because it is using
- * a per-CPU GHCB.
- */
- local_irq_save(flags);
-
ghcb = __sev_get_ghcb(&state);
- if (!ghcb) {
- ret = 1;
- goto out_unlock;
- }
+ if (!ghcb)
+ return 1;
/* Copy the input desc into GHCB shared buffer */
data = (struct snp_psc_desc *)ghcb->shared_buffer;
@@ -820,9 +820,6 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
out:
__sev_put_ghcb(&state);
-out_unlock:
- local_irq_restore(flags);
-
return ret;
}
@@ -861,18 +858,32 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
i++;
}
- if (vmgexit_psc(data))
+ if (__vmgexit_psc(data))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
}
static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
{
unsigned long vaddr_end, next_vaddr;
+ struct sev_es_runtime_data *data;
struct snp_psc_desc *desc;
+ unsigned long flags;
- desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
- if (!desc)
- panic("SNP: failed to allocate memory for PSC descriptor\n");
+ /* Disable interrupts since a per-CPU PSC and per-CPU GHCB are used. */
+ local_irq_save(flags);
+
+ data = this_cpu_read(runtime_data);
+ if (!data || data->psc_active) {
+ /* No per-CPU PSC or it is active, use the MSR protocol. */
+ early_set_pages_state(__pa(vaddr), npages, op);
+ goto out;
+ }
+
+ /* Mark the PSC in use. */
+ data->psc_active = true;
+ barrier();
+
+ desc = &data->psc_desc;
vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + (npages << PAGE_SHIFT);
@@ -887,7 +898,12 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
vaddr = next_vaddr;
}
- kfree(desc);
+ /* Mark the PSC no longer in use. */
+ barrier();
+ data->psc_active = false;
+
+out:
+ local_irq_restore(flags);
}
void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
@@ -1339,6 +1355,8 @@ static void __init alloc_runtime_data(int cpu)
panic("Can't allocate SEV-ES runtime data");
per_cpu(runtime_data, cpu) = data;
+
+ data->psc_active = false;
}
static void __init init_ghcb(int cpu)
--
2.36.1
On 7/29/22 07:01, Tom Lendacky wrote:
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index c05f0124c410..1f7f6205c4f6 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -104,6 +104,15 @@ struct sev_es_runtime_data {
> * is currently unsupported in SEV-ES guests.
> */
> unsigned long dr7;
> +
> + /*
> + * Page State Change structure for use when accepting memory or when
> + * changing page state. Interrupts are disabled when using the structure
> + * but an NMI could still be raised, so use a flag to indicate when the
> + * structure is in use and use the MSR protocol in these cases.
> + */
> + struct snp_psc_desc psc_desc;
> + bool psc_active;
> };
This thing:
struct snp_psc_desc {
struct psc_hdr hdr;
struct psc_entry entries[VMGEXIT_PSC_MAX_ENTRY];
} __packed;
is 16k, right? Being per-cpu, this might eat up a MB or two of memory
on a big server?
Considering that runtime acceptance is already single-threaded[1] *and*
there's a fallback method, why not just have a single copy of this
guarded by a single lock?
1.
https://lore.kernel.org/all/20220614120231.48165-10-kirill.shutemov@linux.intel.com/
On 7/29/22 09:18, Dave Hansen wrote:
> On 7/29/22 07:01, Tom Lendacky wrote:
>> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
>> index c05f0124c410..1f7f6205c4f6 100644
>> --- a/arch/x86/kernel/sev.c
>> +++ b/arch/x86/kernel/sev.c
>> @@ -104,6 +104,15 @@ struct sev_es_runtime_data {
>> * is currently unsupported in SEV-ES guests.
>> */
>> unsigned long dr7;
>> +
>> + /*
>> + * Page State Change structure for use when accepting memory or when
>> + * changing page state. Interrupts are disabled when using the structure
>> + * but an NMI could still be raised, so use a flag to indicate when the
>> + * structure is in use and use the MSR protocol in these cases.
>> + */
>> + struct snp_psc_desc psc_desc;
>> + bool psc_active;
>> };
>
> This thing:
>
> struct snp_psc_desc {
> struct psc_hdr hdr;
> struct psc_entry entries[VMGEXIT_PSC_MAX_ENTRY];
> } __packed;
>
> is 16k, right? Being per-cpu, this might eat up a MB or two of memory
> on a big server?
It's just under 2K, 2,032 bytes.
>
> Considering that runtime acceptance is already single-threaded[1] *and*
> there's a fallback method, why not just have a single copy of this
> guarded by a single lock?
This function is called for more than just memory acceptance. It's also
called for any changes from or to private or shared, which isn't
single-threaded.
Thanks,
Tom
>
> 1.
> https://lore.kernel.org/all/20220614120231.48165-10-kirill.shutemov@linux.intel.com/
On 7/29/22 07:25, Tom Lendacky wrote: >> Considering that runtime acceptance is already single-threaded[1] *and* >> there's a fallback method, why not just have a single copy of this >> guarded by a single lock? > > This function is called for more than just memory acceptance. It's also > called for any changes from or to private or shared, which isn't > single-threaded. I think this tidbit from the changelog threw me off: > Protect the use of the per-CPU structure by disabling interrupts during > memory acceptance. Could you please revise that to accurately capture the impact of this change?
On 7/29/22 14:08, Dave Hansen wrote: > On 7/29/22 07:25, Tom Lendacky wrote: >>> Considering that runtime acceptance is already single-threaded[1] *and* >>> there's a fallback method, why not just have a single copy of this >>> guarded by a single lock? >> >> This function is called for more than just memory acceptance. It's also >> called for any changes from or to private or shared, which isn't >> single-threaded. > > I think this tidbit from the changelog threw me off: > >> Protect the use of the per-CPU structure by disabling interrupts during >> memory acceptance. > > Could you please revise that to accurately capture the impact of this > change? Is s/memory acceptance/page state changes/ enough of what you are looking for or something more? Thanks, Tom
On 7/29/22 12:22, Tom Lendacky wrote: >> I think this tidbit from the changelog threw me off: >> >>> Protect the use of the per-CPU structure by disabling interrupts during >>> memory acceptance. >> >> Could you please revise that to accurately capture the impact of this >> change? > > Is s/memory acceptance/page state changes/ enough of what you are > looking for or something more? That, plus a reminder of when "page state changes" are performed would be nice. How frequent are they? Are they performance sensitive? That'll help us decide if the design here is appropriate or not.
On 7/29/22 14:28, Dave Hansen wrote: > On 7/29/22 12:22, Tom Lendacky wrote: >>> I think this tidbit from the changelog threw me off: >>> >>>> Protect the use of the per-CPU structure by disabling interrupts during >>>> memory acceptance. >>> >>> Could you please revise that to accurately capture the impact of this >>> change? >> >> Is s/memory acceptance/page state changes/ enough of what you are >> looking for or something more? > > That, plus a reminder of when "page state changes" are performed would > be nice. How frequent are they? Are they performance sensitive? > That'll help us decide if the design here is appropriate or not. Without submitting a v2, here's what the updated paragraph would look like: Page state changes occur whenever DMA memory is allocated or memory needs to be shared with the hypervisor (kvmclock, attestation reports, etc.). A per-CPU structure is chosen over a single PSC structure protected with a lock because these changes can be initiated from interrupt or soft-interrupt context (e.g. the NVMe driver). Protect the use of the per-CPU structure by disabling interrupts during page state changes. Since the set_pages_state() path is the only path into vmgexit_psc(), rename vmgexit_psc() to __vmgexit_psc() and remove the calls to disable interrupts which are now performed by set_pages_state(). Hopefully there aren't a lot of page state changes occurring once a system has booted, so maybe a static struct with a lock would work. I am a bit worried about an NMI occurring during a page state change that requires a lock. I suppose, in_nmi() can be used to detect that and go the MSR protocol route to avoid a deadlock. I can investigate that if the 2K-extra per-CPU is not desired. Thanks, Tom
This series adds SEV-SNP support for unaccepted memory to the patch series
titled:
[PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This lead to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. So this series consists of two patches:
- A pre-patch to switch from a kmalloc()'d page state change structure
to a static page state change structure proteced with access protected
by a spinlock.
- SNP support for unaccepted memory.
The series is based off of and tested against Kirill Shutemov's tree:
https://github.com/intel/tdx.git guest-unaccepted-memory
---
This is what the static structure / spinlock method looks like. Let me
know if this approach is preferred over the per-CPU structure. If so,
I'll submit this as a v2.
Thanks,
Tom
Tom Lendacky (2):
x86/sev: Use per-CPU PSC structure in prep for unaccepted memory
support
x86/sev: Add SNP-specific unaccepted memory support
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/mem.c | 3 ++
arch/x86/boot/compressed/sev.c | 10 ++++-
arch/x86/boot/compressed/sev.h | 23 +++++++++++
arch/x86/include/asm/sev.h | 3 ++
arch/x86/kernel/sev.c | 71 ++++++++++++++++++++++-----------
arch/x86/mm/unaccepted_memory.c | 4 ++
7 files changed, 91 insertions(+), 24 deletions(-)
create mode 100644 arch/x86/boot/compressed/sev.h
--
2.36.1
In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change (PSC) structure to using a
static structure. This is needed to avoid a possible recursive call into
set_pages_state() if the kmalloc() call requires (more) memory to be
accepted, which would result in a hang.
Page state changes occur whenever DMA memory is allocated or memory needs
to be shared with the hypervisor (kvmclock, attestation reports, etc.).
Since most page state changes occur early in boot and are limited in
number, a single static PSC structure is used and protected by a spin
lock with interrupts disabled.
Even with interrupts disabled, an NMI can be raised while performing
memory acceptance. The NMI could then cause further memory acceptance to
be performed. To prevent a deadlock, use the MSR protocol if executing in
an NMI context.
Since the set_pages_state() path is the only path into vmgexit_psc(),
rename vmgexit_psc() to __vmgexit_psc() and remove the calls to disable
interrupts which are now performed by set_pages_state().
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/kernel/sev.c | 55 +++++++++++++++++++++++++------------------
1 file changed, 32 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c05f0124c410..84d94fd2ec53 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -66,6 +66,9 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
*/
static struct ghcb *boot_ghcb __section(".data");
+/* Flag to indicate when the first per-CPU GHCB is registered */
+static bool ghcb_percpu_ready __section(".data");
+
/* Bitmap of SEV features supported by the hypervisor */
static u64 sev_hv_features __ro_after_init;
@@ -122,6 +125,15 @@ struct sev_config {
static struct sev_config sev_cfg __read_mostly;
+/*
+ * Page State Change structure for use when accepting memory or when changing
+ * page state. Use is protected by a spinlock with interrupts disabled, but an
+ * NMI could still be raised, so check if running in an NMI an use the MSR
+ * protocol in these cases.
+ */
+static struct snp_psc_desc psc_desc;
+static DEFINE_SPINLOCK(psc_desc_lock);
+
static __always_inline bool on_vc_stack(struct pt_regs *regs)
{
unsigned long sp = regs->sp;
@@ -660,7 +672,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
}
}
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
@@ -742,26 +754,17 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
WARN(1, "invalid memory op %d\n", op);
}
-static int vmgexit_psc(struct snp_psc_desc *desc)
+static int __vmgexit_psc(struct snp_psc_desc *desc)
{
int cur_entry, end_entry, ret = 0;
struct snp_psc_desc *data;
struct ghcb_state state;
struct es_em_ctxt ctxt;
- unsigned long flags;
struct ghcb *ghcb;
- /*
- * __sev_get_ghcb() needs to run with IRQs disabled because it is using
- * a per-CPU GHCB.
- */
- local_irq_save(flags);
-
ghcb = __sev_get_ghcb(&state);
- if (!ghcb) {
- ret = 1;
- goto out_unlock;
- }
+ if (!ghcb)
+ return 1;
/* Copy the input desc into GHCB shared buffer */
data = (struct snp_psc_desc *)ghcb->shared_buffer;
@@ -820,9 +823,6 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
out:
__sev_put_ghcb(&state);
-out_unlock:
- local_irq_restore(flags);
-
return ret;
}
@@ -861,18 +861,25 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
i++;
}
- if (vmgexit_psc(data))
+ if (__vmgexit_psc(data))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
}
static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
{
unsigned long vaddr_end, next_vaddr;
- struct snp_psc_desc *desc;
+ unsigned long flags;
- desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
- if (!desc)
- panic("SNP: failed to allocate memory for PSC descriptor\n");
+ /*
+ * Use the MSR protocol when either:
+ * - executing in an NMI to avoid any possibility of a deadlock
+ * - per-CPU GHCBs are not yet registered, since __vmgexit_psc()
+ * uses the per-CPU GHCB.
+ */
+ if (in_nmi() || !ghcb_percpu_ready)
+ return early_set_pages_state(__pa(vaddr), npages, op);
+
+ spin_lock_irqsave(&psc_desc_lock, flags);
vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + (npages << PAGE_SHIFT);
@@ -882,12 +889,12 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
next_vaddr = min_t(unsigned long, vaddr_end,
(VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
- __set_pages_state(desc, vaddr, next_vaddr, op);
+ __set_pages_state(&psc_desc, vaddr, next_vaddr, op);
vaddr = next_vaddr;
}
- kfree(desc);
+ spin_unlock_irqrestore(&psc_desc_lock, flags);
}
void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
@@ -1254,6 +1261,8 @@ void setup_ghcb(void)
if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
snp_register_per_cpu_ghcb();
+ ghcb_percpu_ready = true;
+
return;
}
--
2.36.1
On 8/3/22 13:11, Tom Lendacky wrote:
Of course, I'll fix the subject if submitting this for real... ugh.
Thanks,
Tom
> In advance of providing support for unaccepted memory, switch from using
> kmalloc() for allocating the Page State Change (PSC) structure to using a
> static structure. This is needed to avoid a possible recursive call into
> set_pages_state() if the kmalloc() call requires (more) memory to be
> accepted, which would result in a hang.
>
> Page state changes occur whenever DMA memory is allocated or memory needs
> to be shared with the hypervisor (kvmclock, attestation reports, etc.).
> Since most page state changes occur early in boot and are limited in
> number, a single static PSC structure is used and protected by a spin
> lock with interrupts disabled.
>
> Even with interrupts disabled, an NMI can be raised while performing
> memory acceptance. The NMI could then cause further memory acceptance to
> be performed. To prevent a deadlock, use the MSR protocol if executing in
> an NMI context.
>
> Since the set_pages_state() path is the only path into vmgexit_psc(),
> rename vmgexit_psc() to __vmgexit_psc() and remove the calls to disable
> interrupts which are now performed by set_pages_state().
>
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> ---
> arch/x86/kernel/sev.c | 55 +++++++++++++++++++++++++------------------
> 1 file changed, 32 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index c05f0124c410..84d94fd2ec53 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -66,6 +66,9 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
> */
> static struct ghcb *boot_ghcb __section(".data");
>
> +/* Flag to indicate when the first per-CPU GHCB is registered */
> +static bool ghcb_percpu_ready __section(".data");
> +
> /* Bitmap of SEV features supported by the hypervisor */
> static u64 sev_hv_features __ro_after_init;
>
> @@ -122,6 +125,15 @@ struct sev_config {
>
> static struct sev_config sev_cfg __read_mostly;
>
> +/*
> + * Page State Change structure for use when accepting memory or when changing
> + * page state. Use is protected by a spinlock with interrupts disabled, but an
> + * NMI could still be raised, so check if running in an NMI an use the MSR
> + * protocol in these cases.
> + */
> +static struct snp_psc_desc psc_desc;
> +static DEFINE_SPINLOCK(psc_desc_lock);
> +
> static __always_inline bool on_vc_stack(struct pt_regs *regs)
> {
> unsigned long sp = regs->sp;
> @@ -660,7 +672,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
> }
> }
>
> -static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
> +static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
> {
> unsigned long paddr_end;
> u64 val;
> @@ -742,26 +754,17 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
> WARN(1, "invalid memory op %d\n", op);
> }
>
> -static int vmgexit_psc(struct snp_psc_desc *desc)
> +static int __vmgexit_psc(struct snp_psc_desc *desc)
> {
> int cur_entry, end_entry, ret = 0;
> struct snp_psc_desc *data;
> struct ghcb_state state;
> struct es_em_ctxt ctxt;
> - unsigned long flags;
> struct ghcb *ghcb;
>
> - /*
> - * __sev_get_ghcb() needs to run with IRQs disabled because it is using
> - * a per-CPU GHCB.
> - */
> - local_irq_save(flags);
> -
> ghcb = __sev_get_ghcb(&state);
> - if (!ghcb) {
> - ret = 1;
> - goto out_unlock;
> - }
> + if (!ghcb)
> + return 1;
>
> /* Copy the input desc into GHCB shared buffer */
> data = (struct snp_psc_desc *)ghcb->shared_buffer;
> @@ -820,9 +823,6 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
> out:
> __sev_put_ghcb(&state);
>
> -out_unlock:
> - local_irq_restore(flags);
> -
> return ret;
> }
>
> @@ -861,18 +861,25 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
> i++;
> }
>
> - if (vmgexit_psc(data))
> + if (__vmgexit_psc(data))
> sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
> }
>
> static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
> {
> unsigned long vaddr_end, next_vaddr;
> - struct snp_psc_desc *desc;
> + unsigned long flags;
>
> - desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
> - if (!desc)
> - panic("SNP: failed to allocate memory for PSC descriptor\n");
> + /*
> + * Use the MSR protocol when either:
> + * - executing in an NMI to avoid any possibility of a deadlock
> + * - per-CPU GHCBs are not yet registered, since __vmgexit_psc()
> + * uses the per-CPU GHCB.
> + */
> + if (in_nmi() || !ghcb_percpu_ready)
> + return early_set_pages_state(__pa(vaddr), npages, op);
> +
> + spin_lock_irqsave(&psc_desc_lock, flags);
>
> vaddr = vaddr & PAGE_MASK;
> vaddr_end = vaddr + (npages << PAGE_SHIFT);
> @@ -882,12 +889,12 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
> next_vaddr = min_t(unsigned long, vaddr_end,
> (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
>
> - __set_pages_state(desc, vaddr, next_vaddr, op);
> + __set_pages_state(&psc_desc, vaddr, next_vaddr, op);
>
> vaddr = next_vaddr;
> }
>
> - kfree(desc);
> + spin_unlock_irqrestore(&psc_desc_lock, flags);
> }
>
> void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
> @@ -1254,6 +1261,8 @@ void setup_ghcb(void)
> if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
> snp_register_per_cpu_ghcb();
>
> + ghcb_percpu_ready = true;
> +
> return;
> }
>
On 8/3/22 11:11, Tom Lendacky wrote: > + /* > + * Use the MSR protocol when either: > + * - executing in an NMI to avoid any possibility of a deadlock > + * - per-CPU GHCBs are not yet registered, since __vmgexit_psc() > + * uses the per-CPU GHCB. > + */ > + if (in_nmi() || !ghcb_percpu_ready) > + return early_set_pages_state(__pa(vaddr), npages, op); > + > + spin_lock_irqsave(&psc_desc_lock, flags); Would it be simpler to just do a spin_trylock_irqsave()? You fall back to early_set_pages_state() whenever you can't acquire the lock. That avoids even having to know what the situations are where you _might_ recurse. If it recurses, the trylock will just naturally fail. You simply can't have bugs where the "(in_nmi() || !ghcb_percpu_ready)" conditional was wrong.
On 8/3/22 13:17, Dave Hansen wrote: > On 8/3/22 11:11, Tom Lendacky wrote: >> + /* >> + * Use the MSR protocol when either: >> + * - executing in an NMI to avoid any possibility of a deadlock >> + * - per-CPU GHCBs are not yet registered, since __vmgexit_psc() >> + * uses the per-CPU GHCB. >> + */ >> + if (in_nmi() || !ghcb_percpu_ready) >> + return early_set_pages_state(__pa(vaddr), npages, op); >> + >> + spin_lock_irqsave(&psc_desc_lock, flags); > > Would it be simpler to just do a spin_trylock_irqsave()? You fall back > to early_set_pages_state() whenever you can't acquire the lock. I was looking at that and can definitely go that route if this approach is preferred. Thanks, Tom > > That avoids even having to know what the situations are where you > _might_ recurse. If it recurses, the trylock will just naturally fail. > You simply can't have bugs where the "(in_nmi() || !ghcb_percpu_ready)" > conditional was wrong.
On 8/3/22 11:21, Tom Lendacky wrote: >> Would it be simpler to just do a spin_trylock_irqsave()? You fall back >> to early_set_pages_state() whenever you can't acquire the lock. > > I was looking at that and can definitely go that route if this approach > is preferred. I prefer it for sure. This whole iteration does look good to me versus the per-cpu version, so I say go ahead with doing this for v2 once you wait a bit for any more feedback.
On 8/3/22 13:24, Dave Hansen wrote: > On 8/3/22 11:21, Tom Lendacky wrote: >>> Would it be simpler to just do a spin_trylock_irqsave()? You fall back >>> to early_set_pages_state() whenever you can't acquire the lock. >> >> I was looking at that and can definitely go that route if this approach >> is preferred. > > I prefer it for sure. > > This whole iteration does look good to me versus the per-cpu version, so > I say go ahead with doing this for v2 once you wait a bit for any more > feedback. I'm still concerned about the whole spinlock and performance. What if I reduce the number of entries in the PSC structure to, say, 64, which reduces the size of the struct to 520 bytes. Any issue if that is put on the stack, instead? It definitely makes things less complicated and feels like a good compromise on the size vs the number of PSC VMGEXIT requests. Thanks, Tom
On 8/3/22 14:03, Tom Lendacky wrote: >> This whole iteration does look good to me versus the per-cpu version, so >> I say go ahead with doing this for v2 once you wait a bit for any more >> feedback. > > I'm still concerned about the whole spinlock and performance. What if I > reduce the number of entries in the PSC structure to, say, 64, which > reduces the size of the struct to 520 bytes. Any issue if that is put on > the stack, instead? It definitely makes things less complicated and > feels like a good compromise on the size vs the number of PSC VMGEXIT > requests. That would be fine too. But, I doubt there will be any real performance issues coming out of this. As bad as this MSR thing is, I suspect it's not half as disastrous as the global spinlock in Kirill's patches. Also, private<->shared page conversions are *NOT* common from what I can tell. There are a few pages converted at boot, but most host the guest<->host communications are through the swiotlb pages which are static. Are there other things that SEV uses this structure for that I'm missing?
On 8/3/22 16:18, Dave Hansen wrote: > On 8/3/22 14:03, Tom Lendacky wrote: >>> This whole iteration does look good to me versus the per-cpu version, so >>> I say go ahead with doing this for v2 once you wait a bit for any more >>> feedback. >> >> I'm still concerned about the whole spinlock and performance. What if I >> reduce the number of entries in the PSC structure to, say, 64, which >> reduces the size of the struct to 520 bytes. Any issue if that is put on >> the stack, instead? It definitely makes things less complicated and >> feels like a good compromise on the size vs the number of PSC VMGEXIT >> requests. > > That would be fine too. Ok. > > But, I doubt there will be any real performance issues coming out of > this. As bad as this MSR thing is, I suspect it's not half as > disastrous as the global spinlock in Kirill's patches. > > Also, private<->shared page conversions are *NOT* common from what I can > tell. There are a few pages converted at boot, but most host the > guest<->host communications are through the swiotlb pages which are static. Generally, that's true. But, e.g., a dma_alloc_coherent() actually doesn't go through SWIOTLB, but instead allocates the pages and makes them shared, which results in a page state change. The NVMe driver was calling that API a lot. In this case, though, the NVMe driver was running in IRQ context and set_memory_decrypted() could sleep, so an unencrypted DMA memory pool was created to work around the sleeping issue and reduce the page state changes. It's just things like that, that make me wary. Thanks, Tom > > Are there other things that SEV uses this structure for that I'm missing?
On 8/3/22 14:34, Tom Lendacky wrote: >> Also, private<->shared page conversions are *NOT* common from what I can >> tell. There are a few pages converted at boot, but most host the >> guest<->host communications are through the swiotlb pages which are >> static. > > Generally, that's true. But, e.g., a dma_alloc_coherent() actually > doesn't go through SWIOTLB, but instead allocates the pages and makes > them shared, which results in a page state change. The NVMe driver was > calling that API a lot. In this case, though, the NVMe driver was > running in IRQ context and set_memory_decrypted() could sleep, so an > unencrypted DMA memory pool was created to work around the sleeping > issue and reduce the page state changes. It's just things like that, > that make me wary. Interesting. Is that a real passthrough NVMe device or the hypervisor presenting a virtual one that just happens to use the NVMe driver? I'm pretty sure the TDX folks have been banking on having very few page state changes. But, part of that at least is their expectation of relying heavily on virtio. I wonder if their expectations are accurate, or if once TDX gets out into the real world if their hopes will be dashed.
On 8/3/22 16:48, Dave Hansen wrote: > On 8/3/22 14:34, Tom Lendacky wrote: >>> Also, private<->shared page conversions are *NOT* common from what I can >>> tell. There are a few pages converted at boot, but most host the >>> guest<->host communications are through the swiotlb pages which are >>> static. >> >> Generally, that's true. But, e.g., a dma_alloc_coherent() actually >> doesn't go through SWIOTLB, but instead allocates the pages and makes >> them shared, which results in a page state change. The NVMe driver was >> calling that API a lot. In this case, though, the NVMe driver was >> running in IRQ context and set_memory_decrypted() could sleep, so an >> unencrypted DMA memory pool was created to work around the sleeping >> issue and reduce the page state changes. It's just things like that, >> that make me wary. > > Interesting. Is that a real passthrough NVMe device or the hypervisor > presenting a virtual one that just happens to use the NVMe driver? Hmmm... not sure, possibly the latter. I just knew that whatever it was, the NVMe driver was being used. Thanks, Tom > > I'm pretty sure the TDX folks have been banking on having very few page > state changes. But, part of that at least is their expectation of > relying heavily on virtio. > > I wonder if their expectations are accurate, or if once TDX gets out > into the real world if their hopes will be dashed.
Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.
The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.
Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/mem.c | 3 +++
arch/x86/boot/compressed/sev.c | 10 +++++++++-
arch/x86/boot/compressed/sev.h | 23 +++++++++++++++++++++++
arch/x86/include/asm/sev.h | 3 +++
arch/x86/kernel/sev.c | 16 ++++++++++++++++
arch/x86/mm/unaccepted_memory.c | 4 ++++
7 files changed, 59 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/boot/compressed/sev.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 34146ecc5bdd..0ad53c3533c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
select INSTRUCTION_DECODER
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
+ select UNACCEPTED_MEMORY
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 48e36e640da1..3e19dc0da0d7 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
#include "find.h"
#include "math.h"
#include "tdx.h"
+#include "sev.h"
#include <asm/shared/tdx.h>
#define PMD_SHIFT 21
@@ -39,6 +40,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
/* Platform-specific memory-acceptance call goes here */
if (is_tdx_guest())
tdx_accept_memory(start, end);
+ else if (sev_snp_enabled())
+ snp_accept_memory(start, end);
else
error("Cannot accept memory: unknown platform\n");
}
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 730c4677e9db..d4b06c862094 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
/* Include code for early handlers */
#include "../../kernel/sev-shared.c"
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
{
return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
}
@@ -161,6 +161,14 @@ void snp_set_page_shared(unsigned long paddr)
__page_state_change(paddr, SNP_PAGE_STATE_SHARED);
}
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ while (end > start) {
+ snp_set_page_private(start);
+ start += PAGE_SIZE;
+ }
+}
+
static bool early_setup_ghcb(void)
{
if (set_page_decrypted((unsigned long)&boot_ghcb_page))
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 19514524f0f8..21db66bacefe 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -202,6 +202,7 @@ void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
void snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
@@ -226,6 +227,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
{
return -ENOTTY;
}
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
#endif
#endif
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 84d94fd2ec53..db74c38babf7 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -917,6 +917,22 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
pvalidate_pages(vaddr, npages, true);
}
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long vaddr;
+ unsigned int npages;
+
+ if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+ return;
+
+ vaddr = (unsigned long)__va(start);
+ npages = (end - start) >> PAGE_SHIFT;
+
+ set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+
+ pvalidate_pages(vaddr, npages, true);
+}
+
static int snp_set_vmsa(void *va, bool vmsa)
{
u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..b86ad6a8ddf5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -9,6 +9,7 @@
#include <asm/setup.h>
#include <asm/shared/tdx.h>
#include <asm/unaccepted_memory.h>
+#include <asm/sev.h>
/* Protects unaccepted memory bitmap */
static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -66,6 +67,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
tdx_accept_memory(range_start * PMD_SIZE,
range_end * PMD_SIZE);
+ } else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+ snp_accept_memory(range_start * PMD_SIZE,
+ range_end * PMD_SIZE);
} else {
panic("Cannot accept memory: unknown platform\n");
}
--
2.36.1
Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.
The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.
Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/mem.c | 3 +++
arch/x86/boot/compressed/sev.c | 10 +++++++++-
arch/x86/boot/compressed/sev.h | 23 +++++++++++++++++++++++
arch/x86/include/asm/sev.h | 3 +++
arch/x86/kernel/sev.c | 16 ++++++++++++++++
arch/x86/mm/unaccepted_memory.c | 4 ++++
7 files changed, 59 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/boot/compressed/sev.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 34146ecc5bdd..0ad53c3533c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
select INSTRUCTION_DECODER
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
+ select UNACCEPTED_MEMORY
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 48e36e640da1..3e19dc0da0d7 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
#include "find.h"
#include "math.h"
#include "tdx.h"
+#include "sev.h"
#include <asm/shared/tdx.h>
#define PMD_SHIFT 21
@@ -39,6 +40,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
/* Platform-specific memory-acceptance call goes here */
if (is_tdx_guest())
tdx_accept_memory(start, end);
+ else if (sev_snp_enabled())
+ snp_accept_memory(start, end);
else
error("Cannot accept memory: unknown platform\n");
}
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 730c4677e9db..d4b06c862094 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
/* Include code for early handlers */
#include "../../kernel/sev-shared.c"
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
{
return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
}
@@ -161,6 +161,14 @@ void snp_set_page_shared(unsigned long paddr)
__page_state_change(paddr, SNP_PAGE_STATE_SHARED);
}
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ while (end > start) {
+ snp_set_page_private(start);
+ start += PAGE_SIZE;
+ }
+}
+
static bool early_setup_ghcb(void)
{
if (set_page_decrypted((unsigned long)&boot_ghcb_page))
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 19514524f0f8..21db66bacefe 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -202,6 +202,7 @@ void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
void snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
@@ -226,6 +227,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
{
return -ENOTTY;
}
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
#endif
#endif
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 1f7f6205c4f6..289764e3a0b5 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -926,6 +926,22 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
pvalidate_pages(vaddr, npages, true);
}
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long vaddr;
+ unsigned int npages;
+
+ if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+ return;
+
+ vaddr = (unsigned long)__va(start);
+ npages = (end - start) >> PAGE_SHIFT;
+
+ set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+
+ pvalidate_pages(vaddr, npages, true);
+}
+
static int snp_set_vmsa(void *va, bool vmsa)
{
u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..b86ad6a8ddf5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -9,6 +9,7 @@
#include <asm/setup.h>
#include <asm/shared/tdx.h>
#include <asm/unaccepted_memory.h>
+#include <asm/sev.h>
/* Protects unaccepted memory bitmap */
static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -66,6 +67,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
tdx_accept_memory(range_start * PMD_SIZE,
range_end * PMD_SIZE);
+ } else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+ snp_accept_memory(range_start * PMD_SIZE,
+ range_end * PMD_SIZE);
} else {
panic("Cannot accept memory: unknown platform\n");
}
--
2.36.1
> --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT > select INSTRUCTION_DECODER > select ARCH_HAS_CC_PLATFORM > select X86_MEM_ENCRYPT > + select UNACCEPTED_MEMORY > help > Say yes to enable support for the encryption of system memory. > This requires an AMD processor that supports Secure Memory At the risk of starting another centithread like on Kirill's patches for unaccepted memory, I think this needs to be brought up. By making unaccepted_memory on option rather than a dependency, we get into an inescapable situation of always needing to know whether or not the guest OS will support unaccepted memory, from within the firmware. I think that makes a UEFI specification change necessary. If we don't make this configurable, and indeed make it a dependency, then we can say SEV-SNP implies that the firmware should create unaccepted memory. We can work around the short gap of support between kernel versions. What are your thoughts on dependency versus UEFI spec change to allow this configuration to be negotiated with the firmware? -- -Dionna Glaze, PhD (she/her)
>
> +void snp_accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> + unsigned long vaddr;
> + unsigned int npages;
> +
> + if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
> + return;
> +
> + vaddr = (unsigned long)__va(start);
> + npages = (end - start) >> PAGE_SHIFT;
> +
> + set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
> +
> + pvalidate_pages(vaddr, npages, true);
> +}
My testing of this patch shows that a significant amount of time is
spent using the MSR protocol to change page state, in such a
significant fashion that it's slower than eagerly accepting all
memory. The difference gets worse as the RAM size goes up, so I think
there's some phase problem with the GHCB protocol not getting used
early enough?
--
-Dionna Glaze, PhD (she/her)
On 8/22/22 19:24, Dionna Amalie Glaze wrote:
>>
>> +void snp_accept_memory(phys_addr_t start, phys_addr_t end)
>> +{
>> + unsigned long vaddr;
>> + unsigned int npages;
>> +
>> + if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
>> + return;
>> +
>> + vaddr = (unsigned long)__va(start);
>> + npages = (end - start) >> PAGE_SHIFT;
>> +
>> + set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
>> +
>> + pvalidate_pages(vaddr, npages, true);
>> +}
>
> My testing of this patch shows that a significant amount of time is
> spent using the MSR protocol to change page state, in such a
> significant fashion that it's slower than eagerly accepting all
> memory. The difference gets worse as the RAM size goes up, so I think
> there's some phase problem with the GHCB protocol not getting used
> early enough?
Thank you for testing. Let me see what I can find. I might have to rework
Brijesh's original patches more to make use of the early boot GHCB in
order to cut down on the number of MSR protocol requests.
Thanks,
Tom
>
This series adds SEV-SNP support for unaccepted memory to the patch series
titled:
[PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This lead to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. So this series consists of two patches:
- A pre-patch to switch from a kmalloc()'d page state change structure
to a (smaller) stack-based page state change structure.
- SNP support for unaccepted memory.
The series is based off of and tested against Kirill Shutemov's tree:
https://github.com/intel/tdx.git guest-unaccepted-memory
---
Changes since v2:
- Improve code comments in regards to when to use the per-CPU GHCB vs
the MSR protocol and why a single global value is valid for both
the BSP and APs.
- Add a comment related to the number of PSC entries and how it can
impact the size of the struct and, therefore, stack usage.
- Add a WARN_ON_ONCE() for invoking vmgexit_psc() when per-CPU GHCBs
haven't been created or registered, yet.
- Use the compiler support for clearing the PSC struct instead of
issuing memset().
Changes since v1:
- Change from using a per-CPU PSC structure to a (smaller) stack PSC
structure.
Tom Lendacky (2):
x86/sev: Put PSC struct on the stack in prep for unaccepted memory
support
x86/sev: Add SNP-specific unaccepted memory support
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/mem.c | 3 ++
arch/x86/boot/compressed/sev.c | 10 ++++++-
arch/x86/boot/compressed/sev.h | 23 +++++++++++++++
arch/x86/include/asm/sev-common.h | 9 ++++--
arch/x86/include/asm/sev.h | 3 ++
arch/x86/kernel/sev.c | 48 +++++++++++++++++++++++++------
arch/x86/mm/unaccepted_memory.c | 4 +++
8 files changed, 90 insertions(+), 11 deletions(-)
create mode 100644 arch/x86/boot/compressed/sev.h
--
2.36.1
In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change (PSC) structure to using a
local variable that lives on the stack. This is needed to avoid a possible
recursive call into set_pages_state() if the kmalloc() call requires
(more) memory to be accepted, which would result in a hang.
The current size of the PSC struct is 2,032 bytes. To make the struct more
stack friendly, reduce the number of PSC entries from 253 down to 64,
resulting in a size of 520 bytes. This is a nice compromise on struct size
and total PSC requests while still allowing parallel PSC operations across
vCPUs.
Also, since set_pages_state() uses the per-CPU GHCB, add a static variable
that indicates when per-CPU GHCBs are available. Until they are available,
the GHCB MSR protocol is used to perform page state changes. For APs, the
per-CPU GHCB is created before they are started and registered upon
startup, so this flag can be used globally for the BSP and APs instead of
creating a per-CPU flag.
If either the reduction in PSC entries or the use of the MSR protocol
until the per-CPU GHCBs are available, result in any kind of performance
issue (that is not seen at the moment), use of a larger static PSC struct,
with fallback to the smaller stack version, or use the boot GHCB prior to
per-CPU GHCBs, can be investigated.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/include/asm/sev-common.h | 9 +++++++--
arch/x86/kernel/sev.c | 32 +++++++++++++++++++++++--------
2 files changed, 31 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index b8357d6ecd47..6c3d61c5f6a3 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -106,8 +106,13 @@ enum psc_op {
#define GHCB_HV_FT_SNP BIT_ULL(0)
#define GHCB_HV_FT_SNP_AP_CREATION BIT_ULL(1)
-/* SNP Page State Change NAE event */
-#define VMGEXIT_PSC_MAX_ENTRY 253
+/*
+ * SNP Page State Change NAE event
+ * The VMGEXIT_PSC_MAX_ENTRY determines the size of the PSC structure,
+ * which is a local variable (stack usage) in set_pages_state(). Do not
+ * increase this value without evaluating the impact to stack usage.
+ */
+#define VMGEXIT_PSC_MAX_ENTRY 64
struct psc_hdr {
u16 cur_entry;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c05f0124c410..40268ce97aad 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -66,6 +66,17 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
*/
static struct ghcb *boot_ghcb __section(".data");
+/*
+ * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
+ * been created and registered and thus can be used instead of using the MSR
+ * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
+ * which only works with a per-CPU GHCB.
+ *
+ * For APs, the per-CPU GHCB is created before they are started and registered
+ * upon startup, so this flag can be used globally for the BSP and APs.
+ */
+static bool ghcb_percpu_ready __section(".data");
+
/* Bitmap of SEV features supported by the hypervisor */
static u64 sev_hv_features __ro_after_init;
@@ -660,7 +671,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
}
}
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
@@ -751,6 +762,8 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
unsigned long flags;
struct ghcb *ghcb;
+ WARN_ON_ONCE(!ghcb_percpu_ready);
+
/*
* __sev_get_ghcb() needs to run with IRQs disabled because it is using
* a per-CPU GHCB.
@@ -868,11 +881,14 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
{
unsigned long vaddr_end, next_vaddr;
- struct snp_psc_desc *desc;
+ struct snp_psc_desc desc = {};
- desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
- if (!desc)
- panic("SNP: failed to allocate memory for PSC descriptor\n");
+ /*
+ * Use the MSR protocol when the per-CPU GHCBs are not yet registered,
+ * since vmgexit_psc() uses the per-CPU GHCB.
+ */
+ if (!ghcb_percpu_ready)
+ return early_set_pages_state(__pa(vaddr), npages, op);
vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + (npages << PAGE_SHIFT);
@@ -882,12 +898,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
next_vaddr = min_t(unsigned long, vaddr_end,
(VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
- __set_pages_state(desc, vaddr, next_vaddr, op);
+ __set_pages_state(&desc, vaddr, next_vaddr, op);
vaddr = next_vaddr;
}
-
- kfree(desc);
}
void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
@@ -1254,6 +1268,8 @@ void setup_ghcb(void)
if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
snp_register_per_cpu_ghcb();
+ ghcb_percpu_ready = true;
+
return;
}
--
2.36.1
On Mon, Aug 15, 2022 at 10:57:42AM -0500, Tom Lendacky wrote:
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index c05f0124c410..40268ce97aad 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -66,6 +66,17 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
> */
> static struct ghcb *boot_ghcb __section(".data");
>
> +/*
> + * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
> + * been created and registered and thus can be used instead of using the MSR
> + * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
> + * which only works with a per-CPU GHCB.
> + *
> + * For APs, the per-CPU GHCB is created before they are started and registered
> + * upon startup, so this flag can be used globally for the BSP and APs.
> + */
Ok, better, thanks!
> +static bool ghcb_percpu_ready __section(".data");
However, it reads really weird if you have "percpu" in the name of a
variable which is not per CPU...
Let's just call it "ghcbs_initialized" and be done with it.
And I still hate the whole thing ofc.
Do this ontop (and I knew we had a flags thing already):
(And yes, __read_mostly is in the .data section too).
---
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 40268ce97aad..5b3afbf26349 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -66,17 +66,6 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
*/
static struct ghcb *boot_ghcb __section(".data");
-/*
- * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
- * been created and registered and thus can be used instead of using the MSR
- * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
- * which only works with a per-CPU GHCB.
- *
- * For APs, the per-CPU GHCB is created before they are started and registered
- * upon startup, so this flag can be used globally for the BSP and APs.
- */
-static bool ghcb_percpu_ready __section(".data");
-
/* Bitmap of SEV features supported by the hypervisor */
static u64 sev_hv_features __ro_after_init;
@@ -128,7 +117,18 @@ static DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
struct sev_config {
__u64 debug : 1,
- __reserved : 63;
+
+ /*
+ * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
+ * been created and registered and thus can be used instead of using the MSR
+ * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
+ * which only works with a per-CPU GHCB.
+ *
+ * For APs, the per-CPU GHCB is created before they are started and registered
+ * upon startup, so this flag can be used globally for the BSP and APs.
+ */
+ ghcbs_initialized : 1,
+ __reserved : 62;
};
static struct sev_config sev_cfg __read_mostly;
@@ -762,7 +762,7 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
unsigned long flags;
struct ghcb *ghcb;
- WARN_ON_ONCE(!ghcb_percpu_ready);
+ WARN_ON_ONCE(!sev_cfg.ghcbs_initialized);
/*
* __sev_get_ghcb() needs to run with IRQs disabled because it is using
@@ -887,7 +887,7 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
* Use the MSR protocol when the per-CPU GHCBs are not yet registered,
* since vmgexit_psc() uses the per-CPU GHCB.
*/
- if (!ghcb_percpu_ready)
+ if (!sev_cfg.ghcbs_initialized)
return early_set_pages_state(__pa(vaddr), npages, op);
vaddr = vaddr & PAGE_MASK;
@@ -1268,7 +1268,7 @@ void setup_ghcb(void)
if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
snp_register_per_cpu_ghcb();
- ghcb_percpu_ready = true;
+ sev_cfg.ghcbs_initialized = true;
return;
}
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 8/17/22 11:08, Borislav Petkov wrote:
> On Mon, Aug 15, 2022 at 10:57:42AM -0500, Tom Lendacky wrote:
>> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
>> index c05f0124c410..40268ce97aad 100644
>> --- a/arch/x86/kernel/sev.c
>> +++ b/arch/x86/kernel/sev.c
>> @@ -66,6 +66,17 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>> */
>> static struct ghcb *boot_ghcb __section(".data");
>>
>> +/*
>> + * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
>> + * been created and registered and thus can be used instead of using the MSR
>> + * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
>> + * which only works with a per-CPU GHCB.
>> + *
>> + * For APs, the per-CPU GHCB is created before they are started and registered
>> + * upon startup, so this flag can be used globally for the BSP and APs.
>> + */
>
> Ok, better, thanks!
>
>> +static bool ghcb_percpu_ready __section(".data");
>
> However, it reads really weird if you have "percpu" in the name of a
> variable which is not per CPU...
>
> Let's just call it "ghcbs_initialized" and be done with it.
>
> And I still hate the whole thing ofc.
>
> Do this ontop (and I knew we had a flags thing already):
>
> (And yes, __read_mostly is in the .data section too).
Cool, will do.
Thanks,
Tom
>
> ---
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 40268ce97aad..5b3afbf26349 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -66,17 +66,6 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
> */
> static struct ghcb *boot_ghcb __section(".data");
>
> -/*
> - * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
> - * been created and registered and thus can be used instead of using the MSR
> - * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
> - * which only works with a per-CPU GHCB.
> - *
> - * For APs, the per-CPU GHCB is created before they are started and registered
> - * upon startup, so this flag can be used globally for the BSP and APs.
> - */
> -static bool ghcb_percpu_ready __section(".data");
> -
> /* Bitmap of SEV features supported by the hypervisor */
> static u64 sev_hv_features __ro_after_init;
>
> @@ -128,7 +117,18 @@ static DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
>
> struct sev_config {
> __u64 debug : 1,
> - __reserved : 63;
> +
> + /*
> + * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
> + * been created and registered and thus can be used instead of using the MSR
> + * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
> + * which only works with a per-CPU GHCB.
> + *
> + * For APs, the per-CPU GHCB is created before they are started and registered
> + * upon startup, so this flag can be used globally for the BSP and APs.
> + */
> + ghcbs_initialized : 1,
> + __reserved : 62;
> };
>
> static struct sev_config sev_cfg __read_mostly;
> @@ -762,7 +762,7 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
> unsigned long flags;
> struct ghcb *ghcb;
>
> - WARN_ON_ONCE(!ghcb_percpu_ready);
> + WARN_ON_ONCE(!sev_cfg.ghcbs_initialized);
>
> /*
> * __sev_get_ghcb() needs to run with IRQs disabled because it is using
> @@ -887,7 +887,7 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
> * Use the MSR protocol when the per-CPU GHCBs are not yet registered,
> * since vmgexit_psc() uses the per-CPU GHCB.
> */
> - if (!ghcb_percpu_ready)
> + if (!sev_cfg.ghcbs_initialized)
> return early_set_pages_state(__pa(vaddr), npages, op);
>
> vaddr = vaddr & PAGE_MASK;
> @@ -1268,7 +1268,7 @@ void setup_ghcb(void)
> if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
> snp_register_per_cpu_ghcb();
>
> - ghcb_percpu_ready = true;
> + sev_cfg.ghcbs_initialized = true;
>
> return;
> }
>
Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.
The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.
Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/mem.c | 3 +++
arch/x86/boot/compressed/sev.c | 10 +++++++++-
arch/x86/boot/compressed/sev.h | 23 +++++++++++++++++++++++
arch/x86/include/asm/sev.h | 3 +++
arch/x86/kernel/sev.c | 16 ++++++++++++++++
arch/x86/mm/unaccepted_memory.c | 4 ++++
7 files changed, 59 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/boot/compressed/sev.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 34146ecc5bdd..0ad53c3533c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
select INSTRUCTION_DECODER
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
+ select UNACCEPTED_MEMORY
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 48e36e640da1..3e19dc0da0d7 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
#include "find.h"
#include "math.h"
#include "tdx.h"
+#include "sev.h"
#include <asm/shared/tdx.h>
#define PMD_SHIFT 21
@@ -39,6 +40,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
/* Platform-specific memory-acceptance call goes here */
if (is_tdx_guest())
tdx_accept_memory(start, end);
+ else if (sev_snp_enabled())
+ snp_accept_memory(start, end);
else
error("Cannot accept memory: unknown platform\n");
}
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 730c4677e9db..d4b06c862094 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
/* Include code for early handlers */
#include "../../kernel/sev-shared.c"
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
{
return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
}
@@ -161,6 +161,14 @@ void snp_set_page_shared(unsigned long paddr)
__page_state_change(paddr, SNP_PAGE_STATE_SHARED);
}
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ while (end > start) {
+ snp_set_page_private(start);
+ start += PAGE_SIZE;
+ }
+}
+
static bool early_setup_ghcb(void)
{
if (set_page_decrypted((unsigned long)&boot_ghcb_page))
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 19514524f0f8..21db66bacefe 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -202,6 +202,7 @@ void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
void snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
@@ -226,6 +227,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
{
return -ENOTTY;
}
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
#endif
#endif
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 40268ce97aad..d71740f54277 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -924,6 +924,22 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
pvalidate_pages(vaddr, npages, true);
}
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long vaddr;
+ unsigned int npages;
+
+ if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+ return;
+
+ vaddr = (unsigned long)__va(start);
+ npages = (end - start) >> PAGE_SHIFT;
+
+ set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+
+ pvalidate_pages(vaddr, npages, true);
+}
+
static int snp_set_vmsa(void *va, bool vmsa)
{
u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..b86ad6a8ddf5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -9,6 +9,7 @@
#include <asm/setup.h>
#include <asm/shared/tdx.h>
#include <asm/unaccepted_memory.h>
+#include <asm/sev.h>
/* Protects unaccepted memory bitmap */
static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -66,6 +67,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
tdx_accept_memory(range_start * PMD_SIZE,
range_end * PMD_SIZE);
+ } else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+ snp_accept_memory(range_start * PMD_SIZE,
+ range_end * PMD_SIZE);
} else {
panic("Cannot accept memory: unknown platform\n");
}
--
2.36.1
On Mon, Aug 15, 2022 at 10:57:43AM -0500, Tom Lendacky wrote:
> Add SNP-specific hooks to the unaccepted memory support in the boot
> path (__accept_memory()) and the core kernel (accept_memory()) in order
> to support booting SNP guests when unaccepted memory is present. Without
> this support, SNP guests will fail to boot and/or panic() when unaccepted
> memory is present in the EFI memory map.
>
> The process of accepting memory under SNP involves invoking the hypervisor
> to perform a page state change for the page to private memory and then
> issuing a PVALIDATE instruction to accept the page.
>
> Create the new header file arch/x86/boot/compressed/sev.h because adding
> the function declaration to any of the existing SEV related header files
> pulls in too many other header files, causing the build to fail.
>
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/boot/compressed/mem.c | 3 +++
> arch/x86/boot/compressed/sev.c | 10 +++++++++-
> arch/x86/boot/compressed/sev.h | 23 +++++++++++++++++++++++
> arch/x86/include/asm/sev.h | 3 +++
> arch/x86/kernel/sev.c | 16 ++++++++++++++++
> arch/x86/mm/unaccepted_memory.c | 4 ++++
> 7 files changed, 59 insertions(+), 1 deletion(-)
> create mode 100644 arch/x86/boot/compressed/sev.h
Looks mostly ok to me...
> diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
> index 730c4677e9db..d4b06c862094 100644
> --- a/arch/x86/boot/compressed/sev.c
> +++ b/arch/x86/boot/compressed/sev.c
> @@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
> /* Include code for early handlers */
> #include "../../kernel/sev-shared.c"
>
> -static inline bool sev_snp_enabled(void)
> +bool sev_snp_enabled(void)
> {
> return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
> }
This is another one of my pet peeves and now it even gets exported but
it is the early decompressor crap so I won't even try to mention cc_*
helpers...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
This series adds SEV-SNP support for unaccepted memory to the patch series
titled:
[PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This lead to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. So this series consists of two patches:
- A pre-patch to switch from a kmalloc()'d page state change structure
to a (smaller) stack-based page state change structure.
- SNP support for unaccepted memory.
The series is based off of and tested against Kirill Shutemov's tree:
https://github.com/intel/tdx.git guest-unaccepted-memory
---
Changes since v1:
- Change from using a per-CPU PSC structure to a (smaller) stack PSC
structure.
Tom Lendacky (2):
x86/sev: Put PSC struct on the stack in prep for unaccepted memory
support
x86/sev: Add SNP-specific unaccepted memory support
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/mem.c | 3 +++
arch/x86/boot/compressed/sev.c | 10 +++++++-
arch/x86/boot/compressed/sev.h | 23 ++++++++++++++++++
arch/x86/include/asm/sev-common.h | 2 +-
arch/x86/include/asm/sev.h | 3 +++
arch/x86/kernel/sev.c | 40 ++++++++++++++++++++++++-------
arch/x86/mm/unaccepted_memory.c | 4 ++++
8 files changed, 76 insertions(+), 10 deletions(-)
create mode 100644 arch/x86/boot/compressed/sev.h
--
2.36.1
In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change (PSC) structure to using a
local variable that lives on the stack. This is needed to avoid a possible
recursive call into set_pages_state() if the kmalloc() call requires
(more) memory to be accepted, which would result in a hang.
The current size of the PSC struct is 2,032 bytes. To make the struct more
stack friendly, reduce the number of PSC entries from 253 down to 64,
resulting in a size of 520 bytes. This is a nice compromise on struct size
and total PSC requests.
Also, since set_pages_state() uses the per-CPU GHCB, add a static variable
that indicates when per-CPU GHCBs are available. Until they are available,
the GHCB MSR protocol is used to perform page state changes.
If either the reduction in PSC entries or the use of the MSR protocol
until the per-CPU GHCBs are available, result in any kind of performance
issue (that is not seen at the moment), use of a larger static PSC struct,
with fallback to the smaller stack version, or use the boot GHCB prior to
per-CPU GHCBs, can be investigated.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/include/asm/sev-common.h | 2 +-
arch/x86/kernel/sev.c | 24 ++++++++++++++++--------
2 files changed, 17 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index b8357d6ecd47..6f7268a817fc 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -107,7 +107,7 @@ enum psc_op {
#define GHCB_HV_FT_SNP_AP_CREATION BIT_ULL(1)
/* SNP Page State Change NAE event */
-#define VMGEXIT_PSC_MAX_ENTRY 253
+#define VMGEXIT_PSC_MAX_ENTRY 64
struct psc_hdr {
u16 cur_entry;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c05f0124c410..275aa890611f 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -66,6 +66,9 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
*/
static struct ghcb *boot_ghcb __section(".data");
+/* Flag to indicate when the first per-CPU GHCB is registered */
+static bool ghcb_percpu_ready __section(".data");
+
/* Bitmap of SEV features supported by the hypervisor */
static u64 sev_hv_features __ro_after_init;
@@ -660,7 +663,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
}
}
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
@@ -868,11 +871,16 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
{
unsigned long vaddr_end, next_vaddr;
- struct snp_psc_desc *desc;
+ struct snp_psc_desc desc;
+
+ /*
+ * Use the MSR protocol when the per-CPU GHCBs are not yet registered,
+ * since vmgexit_psc() uses the per-CPU GHCB.
+ */
+ if (!ghcb_percpu_ready)
+ return early_set_pages_state(__pa(vaddr), npages, op);
- desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
- if (!desc)
- panic("SNP: failed to allocate memory for PSC descriptor\n");
+ memset(&desc, 0, sizeof(desc));
vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + (npages << PAGE_SHIFT);
@@ -882,12 +890,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
next_vaddr = min_t(unsigned long, vaddr_end,
(VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
- __set_pages_state(desc, vaddr, next_vaddr, op);
+ __set_pages_state(&desc, vaddr, next_vaddr, op);
vaddr = next_vaddr;
}
-
- kfree(desc);
}
void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
@@ -1254,6 +1260,8 @@ void setup_ghcb(void)
if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
snp_register_per_cpu_ghcb();
+ ghcb_percpu_ready = true;
+
return;
}
--
2.36.1
On Mon, Aug 08, 2022 at 12:16:24PM -0500, Tom Lendacky wrote:
> In advance of providing support for unaccepted memory, switch from using
> kmalloc() for allocating the Page State Change (PSC) structure to using a
> local variable that lives on the stack. This is needed to avoid a possible
> recursive call into set_pages_state() if the kmalloc() call requires
> (more) memory to be accepted, which would result in a hang.
I don't understand: kmalloc() allocates memory which is unaccepted?
> The current size of the PSC struct is 2,032 bytes. To make the struct more
> stack friendly, reduce the number of PSC entries from 253 down to 64,
> resulting in a size of 520 bytes. This is a nice compromise on struct size
> and total PSC requests.
Why can't you simply allocate that one PSC page once at boot, accept the
memory for it and use it throughout? Under locking, ofc, if multiple PSC
calls need to happen in parallel...
Instead of limiting the PSC req size.
> @@ -1254,6 +1260,8 @@ void setup_ghcb(void)
> if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
> snp_register_per_cpu_ghcb();
>
> + ghcb_percpu_ready = true;
You know how I can't stand those random boolean vars stating something
has been initialized?
Can't you at least use some field in struct ghcb.reserved_1[] or so
which the spec can provide to OS use so that FW doesn't touch it?
And then stick a "percpu_ready" bit there.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 8/12/22 08:03, Borislav Petkov wrote: > On Mon, Aug 08, 2022 at 12:16:24PM -0500, Tom Lendacky wrote: >> In advance of providing support for unaccepted memory, switch from using >> kmalloc() for allocating the Page State Change (PSC) structure to using a >> local variable that lives on the stack. This is needed to avoid a possible >> recursive call into set_pages_state() if the kmalloc() call requires >> (more) memory to be accepted, which would result in a hang. > > I don't understand: kmalloc() allocates memory which is unaccepted? In order to satisfy the kmalloc() some memory has to be accepted. So it tries to accept some additional memory, but we're already in the accept memory path... deadlock. > >> The current size of the PSC struct is 2,032 bytes. To make the struct more >> stack friendly, reduce the number of PSC entries from 253 down to 64, >> resulting in a size of 520 bytes. This is a nice compromise on struct size >> and total PSC requests. > > Why can't you simply allocate that one PSC page once at boot, accept the > memory for it and use it throughout? Under locking, ofc, if multiple PSC > calls need to happen in parallel... > > Instead of limiting the PSC req size. There was a whole discussion on this and I would prefer to keep the ability to parallelize PSC without locking. > >> @@ -1254,6 +1260,8 @@ void setup_ghcb(void) >> if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) >> snp_register_per_cpu_ghcb(); >> >> + ghcb_percpu_ready = true; > > You know how I can't stand those random boolean vars stating something > has been initialized? > > Can't you at least use some field in struct ghcb.reserved_1[] or so > which the spec can provide to OS use so that FW doesn't touch it? Well when we don't know which GHCB is in use, using that reserved area in the GHCB doesn't help. Also, I don't want to update the GHCB specification for a single bit that is only required because of the way Linux went about establishing the GHCB usage. Thanks, Tom > > And then stick a "percpu_ready" bit there. > > Thx. >
On Fri, Aug 12, 2022 at 09:11:25AM -0500, Tom Lendacky wrote:
> There was a whole discussion on this
Pointer to it?
> and I would prefer to keep the ability to parallelize PSC without
> locking.
So smaller, on-stack PSC but lockless is still better than a bigger one
but with synchronized accesses to it?
> Well when we don't know which GHCB is in use, using that reserved area in
> the GHCB doesn't help.
What do you mean?
The one which you read with
data = this_cpu_read(runtime_data);
in snp_register_per_cpu_ghcb() is the one you register.
> Also, I don't want to update the GHCB specification for a single bit
> that is only required because of the way Linux went about establishing
> the GHCB usage.
Linux?
You mean, you did it this way: 885689e47dfa1499b756a07237eb645234d93cf9
:-)
"The runtime handler needs one GHCB per-CPU. Set them up and map them
unencrypted."
Why does that handler need one GHCB per CPU?
As to the field, I was thinking along the lines of
struct ghcb.vendor_flags
field which each virt vendor can use however they like.
It might be overkill but a random bool ain't pretty either. Especially
if those things start getting added for all kinds of other things.
If anything, you could make this a single u64 sev_flags which can at
least collect all that gunk in one variable ... at least...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 8/12/22 09:33, Borislav Petkov wrote: > On Fri, Aug 12, 2022 at 09:11:25AM -0500, Tom Lendacky wrote: >> There was a whole discussion on this > > Pointer to it? It starts here: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com/ > >> and I would prefer to keep the ability to parallelize PSC without >> locking. > > So smaller, on-stack PSC but lockless is still better than a bigger one > but with synchronized accesses to it? > >> Well when we don't know which GHCB is in use, using that reserved area in >> the GHCB doesn't help. > > What do you mean? > > The one which you read with > > data = this_cpu_read(runtime_data); Memory acceptance is called before the per-CPU GHCBs have been allocated and so you would be actually be using early boot GHCB. And that is decided based on the #VC handler that is invoked - but in this case we're not coming through the #VC handler to accept memory. > > in snp_register_per_cpu_ghcb() is the one you register. > >> Also, I don't want to update the GHCB specification for a single bit >> that is only required because of the way Linux went about establishing >> the GHCB usage. > > Linux? > > You mean, you did it this way: 885689e47dfa1499b756a07237eb645234d93cf9 > > :-) Well Joerg re-worked all that quite a bit. And with the SNP support, the added requirement of registering the GHCB changed which GHCB could be used. So even when the per-CPU GHCB is allocated, it can't be used until it is registered, which depends on when the #VC handler is changed from the boot #VC handler to the runtime #VC handler. > > "The runtime handler needs one GHCB per-CPU. Set them up and map them > unencrypted." > > Why does that handler need one GHCB per CPU? Each vCPU can be handling a #VC and you don't want to be serializing on a single GHCB. Thanks, Tom > > As to the field, I was thinking along the lines of > > struct ghcb.vendor_flags > > field which each virt vendor can use however they like. > > It might be overkill but a random bool ain't pretty either. Especially > if those things start getting added for all kinds of other things. > > If anything, you could make this a single u64 sev_flags which can at > least collect all that gunk in one variable ... at least... > > Thx. >
On Fri, Aug 12, 2022 at 09:51:41AM -0500, Tom Lendacky wrote:
> On 8/12/22 09:33, Borislav Petkov wrote:
> > On Fri, Aug 12, 2022 at 09:11:25AM -0500, Tom Lendacky wrote:
> > > There was a whole discussion on this
> >
> > Pointer to it?
>
> It starts here: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com/
So how come none of the rationale for the on-stack decision vs a single
buffer with a spinlock protection hasn't made it to this patch?
We need to have the reason why this thing is changed documented
somewhere.
> > So smaller, on-stack PSC but lockless is still better than a bigger one
> > but with synchronized accesses to it?
That thing.
That decision for on-stack buffer needs explaining why.
> > > Well when we don't know which GHCB is in use, using that reserved area in
> > > the GHCB doesn't help.
> >
> > What do you mean?
> >
> > The one which you read with
> >
> > data = this_cpu_read(runtime_data);
>
> Memory acceptance is called before the per-CPU GHCBs have been allocated
> and so you would be actually be using early boot GHCB. And that is decided
> based on the #VC handler that is invoked - but in this case we're not
> coming through the #VC handler to accept memory.
But then ghcb_percpu_ready needs to be a per-CPU variable too! Because
it is set right after snp_register_per_cpu_ghcb() which works on the
*per-CPU* GHCB.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 8/13/22 14:40, Borislav Petkov wrote: > On Fri, Aug 12, 2022 at 09:51:41AM -0500, Tom Lendacky wrote: >> On 8/12/22 09:33, Borislav Petkov wrote: >>> On Fri, Aug 12, 2022 at 09:11:25AM -0500, Tom Lendacky wrote: >>>> There was a whole discussion on this >>> >>> Pointer to it? >> >> It starts here: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com/ > > So how come none of the rationale for the on-stack decision vs a single > buffer with a spinlock protection hasn't made it to this patch? > > We need to have the reason why this thing is changed documented > somewhere. Yup, was all being addressed in v3 based on Dave's comments. > >>> So smaller, on-stack PSC but lockless is still better than a bigger one >>> but with synchronized accesses to it? > > That thing. > > That decision for on-stack buffer needs explaining why. > >>>> Well when we don't know which GHCB is in use, using that reserved area in >>>> the GHCB doesn't help. >>> >>> What do you mean? >>> >>> The one which you read with >>> >>> data = this_cpu_read(runtime_data); >> >> Memory acceptance is called before the per-CPU GHCBs have been allocated >> and so you would be actually be using early boot GHCB. And that is decided >> based on the #VC handler that is invoked - but in this case we're not >> coming through the #VC handler to accept memory. > > But then ghcb_percpu_ready needs to be a per-CPU variable too! Because > it is set right after snp_register_per_cpu_ghcb() which works on the > *per-CPU* GHCB. No, and the code comment will explain this. Since the APs only ever use the per-CPU GHCB there is no concern as to when there is a switch over from the early boot GHCB to the per-CPU GHCB, so a single global variable is all that is needed. I'll send out v3 soon. Thanks, Tom >
On 8/8/22 10:16, Tom Lendacky wrote:
...
> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
> index b8357d6ecd47..6f7268a817fc 100644
> --- a/arch/x86/include/asm/sev-common.h
> +++ b/arch/x86/include/asm/sev-common.h
> @@ -107,7 +107,7 @@ enum psc_op {
> #define GHCB_HV_FT_SNP_AP_CREATION BIT_ULL(1)
>
> /* SNP Page State Change NAE event */
> -#define VMGEXIT_PSC_MAX_ENTRY 253
> +#define VMGEXIT_PSC_MAX_ENTRY 64
In general, the stack-based allocation looks fine. It might be worth a
comment in there to make it clear that this can consume stack space.
> struct psc_hdr {
> u16 cur_entry;
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index c05f0124c410..275aa890611f 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -66,6 +66,9 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
> */
> static struct ghcb *boot_ghcb __section(".data");
>
> +/* Flag to indicate when the first per-CPU GHCB is registered */
> +static bool ghcb_percpu_ready __section(".data");
So, there's a code path that can't be entered until this is set? Seems
like the least we can do it annotate that path with a
WARN_ON_ONCE(!ghcb_percpu_ready).
Also, how does having _one_ global variable work for indicating the
state of multiple per-cpu structures? The code doesn't seem to delay
setting this variable until after _all_ of the per-cpu state is ready.
> /* Bitmap of SEV features supported by the hypervisor */
> static u64 sev_hv_features __ro_after_init;
>
> @@ -660,7 +663,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
> }
> }
>
> -static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
> +static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
> {
> unsigned long paddr_end;
> u64 val;
> @@ -868,11 +871,16 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
> static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
> {
> unsigned long vaddr_end, next_vaddr;
> - struct snp_psc_desc *desc;
> + struct snp_psc_desc desc;
> +
> + /*
> + * Use the MSR protocol when the per-CPU GHCBs are not yet registered,
> + * since vmgexit_psc() uses the per-CPU GHCB.
> + */
> + if (!ghcb_percpu_ready)
> + return early_set_pages_state(__pa(vaddr), npages, op);
>
> - desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
> - if (!desc)
> - panic("SNP: failed to allocate memory for PSC descriptor\n");
> + memset(&desc, 0, sizeof(desc));
Why is this using memset()? The compiler should be smart enough to
delay initializing 'desc' until after the return with this kind of
construct:
struct snp_psc_desc desc = {};
if (foo)
return;
// use 'desc' here
The compiler *knows* there is no access to 'desc' before the if().
> vaddr = vaddr & PAGE_MASK;
> vaddr_end = vaddr + (npages << PAGE_SHIFT);
> @@ -882,12 +890,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
> next_vaddr = min_t(unsigned long, vaddr_end,
> (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
>
> - __set_pages_state(desc, vaddr, next_vaddr, op);
> + __set_pages_state(&desc, vaddr, next_vaddr, op);
>
> vaddr = next_vaddr;
> }
> -
> - kfree(desc);
> }
>
> void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
> @@ -1254,6 +1260,8 @@ void setup_ghcb(void)
> if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
> snp_register_per_cpu_ghcb();
>
> + ghcb_percpu_ready = true;
> +
> return;
> }
>
On 8/8/22 16:43, Dave Hansen wrote:
> On 8/8/22 10:16, Tom Lendacky wrote:
> ...
>> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
>> index b8357d6ecd47..6f7268a817fc 100644
>> --- a/arch/x86/include/asm/sev-common.h
>> +++ b/arch/x86/include/asm/sev-common.h
>> @@ -107,7 +107,7 @@ enum psc_op {
>> #define GHCB_HV_FT_SNP_AP_CREATION BIT_ULL(1)
>>
>> /* SNP Page State Change NAE event */
>> -#define VMGEXIT_PSC_MAX_ENTRY 253
>> +#define VMGEXIT_PSC_MAX_ENTRY 64
>
> In general, the stack-based allocation looks fine. It might be worth a
> comment in there to make it clear that this can consume stack space.
I'll add that.
>
>> struct psc_hdr {
>> u16 cur_entry;
>> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
>> index c05f0124c410..275aa890611f 100644
>> --- a/arch/x86/kernel/sev.c
>> +++ b/arch/x86/kernel/sev.c
>> @@ -66,6 +66,9 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>> */
>> static struct ghcb *boot_ghcb __section(".data");
>>
>> +/* Flag to indicate when the first per-CPU GHCB is registered */
>> +static bool ghcb_percpu_ready __section(".data");
>
> So, there's a code path that can't be entered until this is set? Seems
> like the least we can do it annotate that path with a
> WARN_ON_ONCE(!ghcb_percpu_ready).
Sure, that can be added. Right now the only function that calls
vmgexit_psc() is covered (set_pages_state()/__set_pages_state()) and is
doing the right thing. But I guess if anything is added in the future,
that will provide details on what happened.
>
> Also, how does having _one_ global variable work for indicating the
> state of multiple per-cpu structures? The code doesn't seem to delay
> setting this variable until after _all_ of the per-cpu state is ready.
All of the per-CPU GHCBs are allocated during the BSP boot, before any AP
is started. The APs only ever run the kernel_exc_vmm_communication() #VC
handler and only ever use the per-CPU version of the GHCB, never the early
boot version. This is based on the initial_vc_handler being switched to
the runtime #VC handler, kernel_exc_vmm_communication.
The trigger for the switch over for the BSP from the early boot GHCB to
the per-CPU GHCB is during setup_ghcb() after the initial_vc_handler has
been switched to kernel_exc_vmm_communication, which is just after the
per-CPU allocations. By putting the setting of the ghcb_percpu_ready in
setup_ghcb(), it indicates that the BSP per-CPU GHCB has been registered
and can be used.
>
>> /* Bitmap of SEV features supported by the hypervisor */
>> static u64 sev_hv_features __ro_after_init;
>>
>> @@ -660,7 +663,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
>> }
>> }
>>
>> -static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
>> +static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
>> {
>> unsigned long paddr_end;
>> u64 val;
>> @@ -868,11 +871,16 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
>> static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
>> {
>> unsigned long vaddr_end, next_vaddr;
>> - struct snp_psc_desc *desc;
>> + struct snp_psc_desc desc;
>> +
>> + /*
>> + * Use the MSR protocol when the per-CPU GHCBs are not yet registered,
>> + * since vmgexit_psc() uses the per-CPU GHCB.
>> + */
>> + if (!ghcb_percpu_ready)
>> + return early_set_pages_state(__pa(vaddr), npages, op);
>>
>> - desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
>> - if (!desc)
>> - panic("SNP: failed to allocate memory for PSC descriptor\n");
>> + memset(&desc, 0, sizeof(desc));
>
> Why is this using memset()? The compiler should be smart enough to
> delay initializing 'desc' until after the return with this kind of
> construct:
>
> struct snp_psc_desc desc = {};
> if (foo)
> return;
> // use 'desc' here
>
> The compiler *knows* there is no access to 'desc' before the if().
Yup, I can change that.
Thanks,
Tom
>
>
>> vaddr = vaddr & PAGE_MASK;
>> vaddr_end = vaddr + (npages << PAGE_SHIFT);
>> @@ -882,12 +890,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
>> next_vaddr = min_t(unsigned long, vaddr_end,
>> (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
>>
>> - __set_pages_state(desc, vaddr, next_vaddr, op);
>> + __set_pages_state(&desc, vaddr, next_vaddr, op);
>>
>> vaddr = next_vaddr;
>> }
>> -
>> - kfree(desc);
>> }
>>
>> void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
>> @@ -1254,6 +1260,8 @@ void setup_ghcb(void)
>> if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
>> snp_register_per_cpu_ghcb();
>>
>> + ghcb_percpu_ready = true;
>> +
>> return;
>> }
>>
>
On 8/8/22 15:18, Tom Lendacky wrote:
>>> +/* Flag to indicate when the first per-CPU GHCB is registered */
>>> +static bool ghcb_percpu_ready __section(".data");
>>
>> So, there's a code path that can't be entered until this is set? Seems
>> like the least we can do it annotate that path with a
>> WARN_ON_ONCE(!ghcb_percpu_ready).
>
> Sure, that can be added. Right now the only function that calls
> vmgexit_psc() is covered (set_pages_state()/__set_pages_state()) and is
> doing the right thing. But I guess if anything is added in the future,
> that will provide details on what happened.
>
>>
>> Also, how does having _one_ global variable work for indicating the
>> state of multiple per-cpu structures? The code doesn't seem to delay
>> setting this variable until after _all_ of the per-cpu state is ready.
>
> All of the per-CPU GHCBs are allocated during the BSP boot, before any
> AP is started. The APs only ever run the kernel_exc_vmm_communication()
> #VC handler and only ever use the per-CPU version of the GHCB, never the
> early boot version. This is based on the initial_vc_handler being
> switched to the runtime #VC handler, kernel_exc_vmm_communication.
>
> The trigger for the switch over for the BSP from the early boot GHCB to
> the per-CPU GHCB is during setup_ghcb() after the initial_vc_handler has
> been switched to kernel_exc_vmm_communication, which is just after the
> per-CPU allocations. By putting the setting of the ghcb_percpu_ready in
> setup_ghcb(), it indicates that the BSP per-CPU GHCB has been registered
> and can be used.
That description makes the proposed comment even more confusing:
/* Flag to indicate when the first per-CPU GHCB is registered */
The important thing is that this variable is only _useful_ for the boot
CPU. After the boot CPU has allocated space for _itself_, it can then
go and stop using the MSR-based method.
The reason it's set after "the first" is because "the first" is also the
boot CPU, but referring to it as the "the first" is a bit oblique.
Maybe something like this:
/*
* Set after the boot CPU's GHCB is registered. At that point,
* it can be used for calls instead of MSRs.
*/
On 8/8/22 17:33, Dave Hansen wrote:
> On 8/8/22 15:18, Tom Lendacky wrote:
>>>> +/* Flag to indicate when the first per-CPU GHCB is registered */
>>>> +static bool ghcb_percpu_ready __section(".data");
>>>
>>> So, there's a code path that can't be entered until this is set? Seems
>>> like the least we can do it annotate that path with a
>>> WARN_ON_ONCE(!ghcb_percpu_ready).
>>
>> Sure, that can be added. Right now the only function that calls
>> vmgexit_psc() is covered (set_pages_state()/__set_pages_state()) and is
>> doing the right thing. But I guess if anything is added in the future,
>> that will provide details on what happened.
>>
>>>
>>> Also, how does having _one_ global variable work for indicating the
>>> state of multiple per-cpu structures? The code doesn't seem to delay
>>> setting this variable until after _all_ of the per-cpu state is ready.
>>
>> All of the per-CPU GHCBs are allocated during the BSP boot, before any
>> AP is started. The APs only ever run the kernel_exc_vmm_communication()
>> #VC handler and only ever use the per-CPU version of the GHCB, never the
>> early boot version. This is based on the initial_vc_handler being
>> switched to the runtime #VC handler, kernel_exc_vmm_communication.
>>
>> The trigger for the switch over for the BSP from the early boot GHCB to
>> the per-CPU GHCB is during setup_ghcb() after the initial_vc_handler has
>> been switched to kernel_exc_vmm_communication, which is just after the
>> per-CPU allocations. By putting the setting of the ghcb_percpu_ready in
>> setup_ghcb(), it indicates that the BSP per-CPU GHCB has been registered
>> and can be used.
>
> That description makes the proposed comment even more confusing:
>
> /* Flag to indicate when the first per-CPU GHCB is registered */
>
> The important thing is that this variable is only _useful_ for the boot
> CPU. After the boot CPU has allocated space for _itself_, it can then
> go and stop using the MSR-based method.
>
> The reason it's set after "the first" is because "the first" is also the
> boot CPU, but referring to it as the "the first" is a bit oblique.
> Maybe something like this:
>
> /*
> * Set after the boot CPU's GHCB is registered. At that point,
> * it can be used for calls instead of MSRs.
> */
Sure, I'll work on the wording.
Thanks,
Tom
Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.
The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.
Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/mem.c | 3 +++
arch/x86/boot/compressed/sev.c | 10 +++++++++-
arch/x86/boot/compressed/sev.h | 23 +++++++++++++++++++++++
arch/x86/include/asm/sev.h | 3 +++
arch/x86/kernel/sev.c | 16 ++++++++++++++++
arch/x86/mm/unaccepted_memory.c | 4 ++++
7 files changed, 59 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/boot/compressed/sev.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 34146ecc5bdd..0ad53c3533c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
select INSTRUCTION_DECODER
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
+ select UNACCEPTED_MEMORY
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 48e36e640da1..3e19dc0da0d7 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
#include "find.h"
#include "math.h"
#include "tdx.h"
+#include "sev.h"
#include <asm/shared/tdx.h>
#define PMD_SHIFT 21
@@ -39,6 +40,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
/* Platform-specific memory-acceptance call goes here */
if (is_tdx_guest())
tdx_accept_memory(start, end);
+ else if (sev_snp_enabled())
+ snp_accept_memory(start, end);
else
error("Cannot accept memory: unknown platform\n");
}
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 730c4677e9db..d4b06c862094 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
/* Include code for early handlers */
#include "../../kernel/sev-shared.c"
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
{
return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
}
@@ -161,6 +161,14 @@ void snp_set_page_shared(unsigned long paddr)
__page_state_change(paddr, SNP_PAGE_STATE_SHARED);
}
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ while (end > start) {
+ snp_set_page_private(start);
+ start += PAGE_SIZE;
+ }
+}
+
static bool early_setup_ghcb(void)
{
if (set_page_decrypted((unsigned long)&boot_ghcb_page))
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 19514524f0f8..21db66bacefe 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -202,6 +202,7 @@ void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
void snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
@@ -226,6 +227,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
{
return -ENOTTY;
}
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
#endif
#endif
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 275aa890611f..f64805fa5dcb 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -916,6 +916,22 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
pvalidate_pages(vaddr, npages, true);
}
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long vaddr;
+ unsigned int npages;
+
+ if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+ return;
+
+ vaddr = (unsigned long)__va(start);
+ npages = (end - start) >> PAGE_SHIFT;
+
+ set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+
+ pvalidate_pages(vaddr, npages, true);
+}
+
static int snp_set_vmsa(void *va, bool vmsa)
{
u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..b86ad6a8ddf5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -9,6 +9,7 @@
#include <asm/setup.h>
#include <asm/shared/tdx.h>
#include <asm/unaccepted_memory.h>
+#include <asm/sev.h>
/* Protects unaccepted memory bitmap */
static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -66,6 +67,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
tdx_accept_memory(range_start * PMD_SIZE,
range_end * PMD_SIZE);
+ } else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+ snp_accept_memory(range_start * PMD_SIZE,
+ range_end * PMD_SIZE);
} else {
panic("Cannot accept memory: unknown platform\n");
}
--
2.36.1
On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > > UEFI Specification version 2.9 introduces the concept of memory > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD > SEV-SNP, requiring memory to be accepted before it can be used by the > guest. Accepting happens via a protocol specific for the Virtual > Machine platform. > > Accepting memory is costly and it makes VMM allocate memory for the > accepted guest physical address range. It's better to postpone memory > acceptance until memory is needed. It lowers boot time and reduces > memory overhead. > > The kernel needs to know what memory has been accepted. Firmware > communicates this information via memory map: a new memory type -- > EFI_UNACCEPTED_MEMORY -- indicates such memory. > > Range-based tracking works fine for firmware, but it gets bulky for > the kernel: e820 has to be modified on every page acceptance. It leads > to table fragmentation, but there's a limited number of entries in the > e820 table > > Another option is to mark such memory as usable in e820 and track if the > range has been accepted in a bitmap. One bit in the bitmap represents > 2MiB in the address space: one 4k page is enough to track 64GiB or > physical address space. > > In the worst-case scenario -- a huge hole in the middle of the > address space -- It needs 256MiB to handle 4PiB of the address > space. > > Any unaccepted memory that is not aligned to 2M gets accepted upfront. > > The approach lowers boot time substantially. Boot to shell is ~2.5x > faster for 4G TDX VM and ~4x faster for 64G. > > TDX-specific code isolated from the core of unaccepted memory support. It > supposed to help to plug-in different implementation of unaccepted memory > such as SEV-SNP. > > The tree can be found here: > > https://github.com/intel/tdx.git guest-unaccepted-memory Hi Kirill, I have a couple questions about this feature mainly about how cloud customers can use this, I assume since this is a confidential compute feature a large number of the users of these patches will be cloud customers using TDX and SNP. One issue I see with these patches is how do we as a cloud provider know whether a customer's linux image supports this feature, if the image doesn't have these patches UEFI needs to fully validate the memory, if the image does we can use this new protocol. In GCE we supply our VMs with a version of the EDK2 FW and the customer doesn't input into which UEFI we run, as far as I can tell from the Azure SNP VM documentation it seems very similar. We need to somehow tell our UEFI in the VM what to do based on the image. The current way I can see to solve this issue would be to have our customers give us metadata about their VM's image but this seems kinda burdensome on our customers (I assume we'll have more features which both UEFI and kernel need to both support inorder to be turned on like this one) and error-prone, if a customer incorrectly labels their image it may fail to boot.. Has there been any discussion about how to solve this? My naive thoughts were what if UEFI and Kernel had some sort of feature negotiation. Maybe that could happen via an extension to exit boot services or a UEFI runtime driver, I'm not sure what's best here just some ideas. > > v7: > - Rework meminfo counter to use PageUnaccepted() and move to generic code; > - Fix range_contains_unaccepted_memory() on machines without unaccepted memory; > - Add Reviewed-by from David; > v6: > - Fix load_unaligned_zeropad() on machine with unaccepted memory; > - Clear PageUnaccepted() on merged pages, leaving it only on head; > - Clarify error handling in allocate_e820(); > - Fix build with CONFIG_UNACCEPTED_MEMORY=y, but without TDX; > - Disable kexec at boottime instead of build conflict; > - Rebased to tip/master; > - Spelling fixes; > - Add Reviewed-by from Mike and David; > v5: > - Updates comments and commit messages; > + Explain options for unaccepted memory handling; > - Expose amount of unaccepted memory in /proc/meminfo > - Adjust check in page_expected_state(); > - Fix error code handling in allocate_e820(); > - Centralize __pa()/__va() definitions in the boot stub; > - Avoid includes from the main kernel in the boot stub; > - Use an existing hole in boot_param for unaccepted_memory, instead of adding > to the end of the structure; > - Extract allocate_unaccepted_memory() form allocate_e820(); > - Complain if there's unaccepted memory, but kernel does not support it; > - Fix vmstat counter; > - Split up few preparatory patches; > - Random readability adjustments; > v4: > - PageBuddyUnaccepted() -> PageUnaccepted; > - Use separate page_type, not shared with offline; > - Rework interface between core-mm and arch code; > - Adjust commit messages; > - Ack from Mike; > > Kirill A. Shutemov (14): > x86/boot: Centralize __pa()/__va() definitions > mm: Add support for unaccepted memory > mm: Report unaccepted memory in meminfo > efi/x86: Get full memory map in allocate_e820() > x86/boot: Add infrastructure required for unaccepted memory support > efi/x86: Implement support for unaccepted memory > x86/boot/compressed: Handle unaccepted memory > x86/mm: Reserve unaccepted memory bitmap > x86/mm: Provide helpers for unaccepted memory > x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory > x86: Disable kexec if system has unaccepted memory > x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in > boot stub > x86/tdx: Refactor try_accept_one() > x86/tdx: Add unaccepted memory support > > Documentation/x86/zero-page.rst | 1 + > arch/x86/Kconfig | 1 + > arch/x86/boot/bitops.h | 40 ++++++++ > arch/x86/boot/compressed/Makefile | 1 + > arch/x86/boot/compressed/align.h | 14 +++ > arch/x86/boot/compressed/bitmap.c | 43 ++++++++ > arch/x86/boot/compressed/bitmap.h | 49 +++++++++ > arch/x86/boot/compressed/bits.h | 36 +++++++ > arch/x86/boot/compressed/compiler.h | 9 ++ > arch/x86/boot/compressed/efi.h | 1 + > arch/x86/boot/compressed/find.c | 54 ++++++++++ > arch/x86/boot/compressed/find.h | 80 +++++++++++++++ > arch/x86/boot/compressed/ident_map_64.c | 8 -- > arch/x86/boot/compressed/kaslr.c | 35 ++++--- > arch/x86/boot/compressed/math.h | 37 +++++++ > arch/x86/boot/compressed/mem.c | 111 ++++++++++++++++++++ > arch/x86/boot/compressed/minmax.h | 61 +++++++++++ > arch/x86/boot/compressed/misc.c | 6 ++ > arch/x86/boot/compressed/misc.h | 15 +++ > arch/x86/boot/compressed/pgtable_types.h | 25 +++++ > arch/x86/boot/compressed/sev.c | 2 - > arch/x86/boot/compressed/tdx.c | 78 ++++++++++++++ > arch/x86/coco/tdx/tdx.c | 94 ++++++++--------- > arch/x86/include/asm/page.h | 3 + > arch/x86/include/asm/shared/tdx.h | 47 +++++++++ > arch/x86/include/asm/tdx.h | 19 ---- > arch/x86/include/asm/unaccepted_memory.h | 16 +++ > arch/x86/include/uapi/asm/bootparam.h | 2 +- > arch/x86/kernel/e820.c | 10 ++ > arch/x86/mm/Makefile | 2 + > arch/x86/mm/unaccepted_memory.c | 123 +++++++++++++++++++++++ > drivers/base/node.c | 7 ++ > drivers/firmware/efi/Kconfig | 14 +++ > drivers/firmware/efi/efi.c | 1 + > drivers/firmware/efi/libstub/x86-stub.c | 103 ++++++++++++++++--- > fs/proc/meminfo.c | 5 + > include/linux/efi.h | 3 +- > include/linux/mmzone.h | 1 + > include/linux/page-flags.h | 31 ++++++ > mm/internal.h | 12 +++ > mm/memblock.c | 9 ++ > mm/page_alloc.c | 96 +++++++++++++++++- > mm/vmstat.c | 1 + > 43 files changed, 1191 insertions(+), 115 deletions(-) > create mode 100644 arch/x86/boot/compressed/align.h > create mode 100644 arch/x86/boot/compressed/bitmap.c > create mode 100644 arch/x86/boot/compressed/bitmap.h > create mode 100644 arch/x86/boot/compressed/bits.h > create mode 100644 arch/x86/boot/compressed/compiler.h > create mode 100644 arch/x86/boot/compressed/find.c > create mode 100644 arch/x86/boot/compressed/find.h > create mode 100644 arch/x86/boot/compressed/math.h > create mode 100644 arch/x86/boot/compressed/mem.c > create mode 100644 arch/x86/boot/compressed/minmax.h > create mode 100644 arch/x86/boot/compressed/pgtable_types.h > create mode 100644 arch/x86/include/asm/unaccepted_memory.h > create mode 100644 arch/x86/mm/unaccepted_memory.c > > -- > 2.35.1 > >
On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote: > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: > > > > UEFI Specification version 2.9 introduces the concept of memory > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD > > SEV-SNP, requiring memory to be accepted before it can be used by the > > guest. Accepting happens via a protocol specific for the Virtual > > Machine platform. > > > > Accepting memory is costly and it makes VMM allocate memory for the > > accepted guest physical address range. It's better to postpone memory > > acceptance until memory is needed. It lowers boot time and reduces > > memory overhead. > > > > The kernel needs to know what memory has been accepted. Firmware > > communicates this information via memory map: a new memory type -- > > EFI_UNACCEPTED_MEMORY -- indicates such memory. > > > > Range-based tracking works fine for firmware, but it gets bulky for > > the kernel: e820 has to be modified on every page acceptance. It leads > > to table fragmentation, but there's a limited number of entries in the > > e820 table > > > > Another option is to mark such memory as usable in e820 and track if the > > range has been accepted in a bitmap. One bit in the bitmap represents > > 2MiB in the address space: one 4k page is enough to track 64GiB or > > physical address space. > > > > In the worst-case scenario -- a huge hole in the middle of the > > address space -- It needs 256MiB to handle 4PiB of the address > > space. > > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront. > > > > The approach lowers boot time substantially. Boot to shell is ~2.5x > > faster for 4G TDX VM and ~4x faster for 64G. > > > > TDX-specific code isolated from the core of unaccepted memory support. It > > supposed to help to plug-in different implementation of unaccepted memory > > such as SEV-SNP. > > > > The tree can be found here: > > > > https://github.com/intel/tdx.git guest-unaccepted-memory > > Hi Kirill, > > I have a couple questions about this feature mainly about how cloud > customers can use this, I assume since this is a confidential compute > feature a large number of the users of these patches will be cloud > customers using TDX and SNP. One issue I see with these patches is how > do we as a cloud provider know whether a customer's linux image > supports this feature, if the image doesn't have these patches UEFI > needs to fully validate the memory, if the image does we can use this > new protocol. In GCE we supply our VMs with a version of the EDK2 FW > and the customer doesn't input into which UEFI we run, as far as I can > tell from the Azure SNP VM documentation it seems very similar. We > need to somehow tell our UEFI in the VM what to do based on the image. > The current way I can see to solve this issue would be to have our > customers give us metadata about their VM's image but this seems kinda > burdensome on our customers (I assume we'll have more features which > both UEFI and kernel need to both support inorder to be turned on like > this one) and error-prone, if a customer incorrectly labels their > image it may fail to boot.. Has there been any discussion about how to > solve this? My naive thoughts were what if UEFI and Kernel had some > sort of feature negotiation. Maybe that could happen via an extension > to exit boot services or a UEFI runtime driver, I'm not sure what's > best here just some ideas. Just as an idea, we can put info into UTS_VERSION which can be read from the built bzImage. We have info on SMP and preeption there already. Patch below does this: $ file arch/x86/boot/bzImage arch/x86/boot/bzImage: Linux kernel x86 boot executable bzImage, version 5.19.0-rc3-00016-g2f6aa48e28d9-dirty (kas@box) #2300 SMP PREEMPT_DYNAMIC UNACCEPTED_MEMORY Mon Jun 27 14:23:04 , RO-rootFS, swap_dev 0XC, Normal VGA Note UNACCEPTED_MEMORY in the output. Probably we want to have there info on which flavour of unaccepted memory is supported (TDX/SNP/whatever). It is a bit more tricky. Any opinion? diff --git a/init/Makefile b/init/Makefile index d82623d7fc8e..6688ea43e6bf 100644 --- a/init/Makefile +++ b/init/Makefile @@ -32,7 +32,7 @@ quiet_cmd_compile.h = CHK $@ $(CONFIG_SHELL) $(srctree)/scripts/mkcompile_h $@ \ "$(UTS_MACHINE)" "$(CONFIG_SMP)" "$(CONFIG_PREEMPT_BUILD)" \ "$(CONFIG_PREEMPT_DYNAMIC)" "$(CONFIG_PREEMPT_RT)" \ - "$(CONFIG_CC_VERSION_TEXT)" "$(LD)" + "$(CONFIG_UNACCEPTED_MEMORY)" "$(CONFIG_CC_VERSION_TEXT)" "$(LD)" include/generated/compile.h: FORCE $(call cmd,compile.h) diff --git a/scripts/mkcompile_h b/scripts/mkcompile_h index ca40a5258c87..efacfecad699 100755 --- a/scripts/mkcompile_h +++ b/scripts/mkcompile_h @@ -7,8 +7,9 @@ SMP=$3 PREEMPT=$4 PREEMPT_DYNAMIC=$5 PREEMPT_RT=$6 -CC_VERSION="$7" -LD=$8 +UNACCEPTED_MEMORY=$7 +CC_VERSION="$8" +LD=$9 # Do not expand names set -f @@ -51,6 +52,10 @@ elif [ -n "$PREEMPT" ] ; then CONFIG_FLAGS="$CONFIG_FLAGS PREEMPT" fi +if [ -n "$UNACCEPTED_MEMORY" ] ; then + CONFIG_FLAGS="$CONFIG_FLAGS UNACCEPTED_MEMORY" +fi + # Truncate to maximum length UTS_LEN=64 UTS_VERSION="$(echo $UTS_VERSION $CONFIG_FLAGS $TIMESTAMP | cut -b -$UTS_LEN)" -- Kirill A. Shutemov
On Mon, 27 Jun 2022 at 13:30, Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > > On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote: > > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov > > <kirill.shutemov@linux.intel.com> wrote: > > > > > > UEFI Specification version 2.9 introduces the concept of memory > > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD > > > SEV-SNP, requiring memory to be accepted before it can be used by the > > > guest. Accepting happens via a protocol specific for the Virtual > > > Machine platform. > > > > > > Accepting memory is costly and it makes VMM allocate memory for the > > > accepted guest physical address range. It's better to postpone memory > > > acceptance until memory is needed. It lowers boot time and reduces > > > memory overhead. > > > > > > The kernel needs to know what memory has been accepted. Firmware > > > communicates this information via memory map: a new memory type -- > > > EFI_UNACCEPTED_MEMORY -- indicates such memory. > > > > > > Range-based tracking works fine for firmware, but it gets bulky for > > > the kernel: e820 has to be modified on every page acceptance. It leads > > > to table fragmentation, but there's a limited number of entries in the > > > e820 table > > > > > > Another option is to mark such memory as usable in e820 and track if the > > > range has been accepted in a bitmap. One bit in the bitmap represents > > > 2MiB in the address space: one 4k page is enough to track 64GiB or > > > physical address space. > > > > > > In the worst-case scenario -- a huge hole in the middle of the > > > address space -- It needs 256MiB to handle 4PiB of the address > > > space. > > > > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront. > > > > > > The approach lowers boot time substantially. Boot to shell is ~2.5x > > > faster for 4G TDX VM and ~4x faster for 64G. > > > > > > TDX-specific code isolated from the core of unaccepted memory support. It > > > supposed to help to plug-in different implementation of unaccepted memory > > > such as SEV-SNP. > > > > > > The tree can be found here: > > > > > > https://github.com/intel/tdx.git guest-unaccepted-memory > > > > Hi Kirill, > > > > I have a couple questions about this feature mainly about how cloud > > customers can use this, I assume since this is a confidential compute > > feature a large number of the users of these patches will be cloud > > customers using TDX and SNP. One issue I see with these patches is how > > do we as a cloud provider know whether a customer's linux image > > supports this feature, if the image doesn't have these patches UEFI > > needs to fully validate the memory, if the image does we can use this > > new protocol. In GCE we supply our VMs with a version of the EDK2 FW > > and the customer doesn't input into which UEFI we run, as far as I can > > tell from the Azure SNP VM documentation it seems very similar. We > > need to somehow tell our UEFI in the VM what to do based on the image. > > The current way I can see to solve this issue would be to have our > > customers give us metadata about their VM's image but this seems kinda > > burdensome on our customers (I assume we'll have more features which > > both UEFI and kernel need to both support inorder to be turned on like > > this one) and error-prone, if a customer incorrectly labels their > > image it may fail to boot.. Has there been any discussion about how to > > solve this? My naive thoughts were what if UEFI and Kernel had some > > sort of feature negotiation. Maybe that could happen via an extension > > to exit boot services or a UEFI runtime driver, I'm not sure what's > > best here just some ideas. > > Just as an idea, we can put info into UTS_VERSION which can be read from > the built bzImage. We have info on SMP and preeption there already. > Instead of hacking this into the binary, couldn't we define a protocol that the kernel will call from the EFI stub (before EBS()) to identify itself as an image that understands unaccepted memory, and knows how to deal with it? That way, the firmware can accept all the memory on behalf of the OS at ExitBootServices() time, unless the OS has indicated there is no need to do so. > Patch below does this: > > $ file arch/x86/boot/bzImage > arch/x86/boot/bzImage: Linux kernel x86 boot executable bzImage, version 5.19.0-rc3-00016-g2f6aa48e28d9-dirty (kas@box) #2300 SMP PREEMPT_DYNAMIC UNACCEPTED_MEMORY Mon Jun 27 14:23:04 , RO-rootFS, swap_dev 0XC, Normal VGA > > Note UNACCEPTED_MEMORY in the output. > > Probably we want to have there info on which flavour of unaccepted memory > is supported (TDX/SNP/whatever). It is a bit more tricky. > > Any opinion? > > diff --git a/init/Makefile b/init/Makefile > index d82623d7fc8e..6688ea43e6bf 100644 > --- a/init/Makefile > +++ b/init/Makefile > @@ -32,7 +32,7 @@ quiet_cmd_compile.h = CHK $@ > $(CONFIG_SHELL) $(srctree)/scripts/mkcompile_h $@ \ > "$(UTS_MACHINE)" "$(CONFIG_SMP)" "$(CONFIG_PREEMPT_BUILD)" \ > "$(CONFIG_PREEMPT_DYNAMIC)" "$(CONFIG_PREEMPT_RT)" \ > - "$(CONFIG_CC_VERSION_TEXT)" "$(LD)" > + "$(CONFIG_UNACCEPTED_MEMORY)" "$(CONFIG_CC_VERSION_TEXT)" "$(LD)" > > include/generated/compile.h: FORCE > $(call cmd,compile.h) > diff --git a/scripts/mkcompile_h b/scripts/mkcompile_h > index ca40a5258c87..efacfecad699 100755 > --- a/scripts/mkcompile_h > +++ b/scripts/mkcompile_h > @@ -7,8 +7,9 @@ SMP=$3 > PREEMPT=$4 > PREEMPT_DYNAMIC=$5 > PREEMPT_RT=$6 > -CC_VERSION="$7" > -LD=$8 > +UNACCEPTED_MEMORY=$7 > +CC_VERSION="$8" > +LD=$9 > > # Do not expand names > set -f > @@ -51,6 +52,10 @@ elif [ -n "$PREEMPT" ] ; then > CONFIG_FLAGS="$CONFIG_FLAGS PREEMPT" > fi > > +if [ -n "$UNACCEPTED_MEMORY" ] ; then > + CONFIG_FLAGS="$CONFIG_FLAGS UNACCEPTED_MEMORY" > +fi > + > # Truncate to maximum length > UTS_LEN=64 > UTS_VERSION="$(echo $UTS_VERSION $CONFIG_FLAGS $TIMESTAMP | cut -b -$UTS_LEN)" > -- > Kirill A. Shutemov
On Mon, Jun 27, 2022 at 01:54:45PM +0200, Ard Biesheuvel wrote: > On Mon, 27 Jun 2022 at 13:30, Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: > > > > On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote: > > > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov > > > <kirill.shutemov@linux.intel.com> wrote: > > > > > > > > UEFI Specification version 2.9 introduces the concept of memory > > > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD > > > > SEV-SNP, requiring memory to be accepted before it can be used by the > > > > guest. Accepting happens via a protocol specific for the Virtual > > > > Machine platform. > > > > > > > > Accepting memory is costly and it makes VMM allocate memory for the > > > > accepted guest physical address range. It's better to postpone memory > > > > acceptance until memory is needed. It lowers boot time and reduces > > > > memory overhead. > > > > > > > > The kernel needs to know what memory has been accepted. Firmware > > > > communicates this information via memory map: a new memory type -- > > > > EFI_UNACCEPTED_MEMORY -- indicates such memory. > > > > > > > > Range-based tracking works fine for firmware, but it gets bulky for > > > > the kernel: e820 has to be modified on every page acceptance. It leads > > > > to table fragmentation, but there's a limited number of entries in the > > > > e820 table > > > > > > > > Another option is to mark such memory as usable in e820 and track if the > > > > range has been accepted in a bitmap. One bit in the bitmap represents > > > > 2MiB in the address space: one 4k page is enough to track 64GiB or > > > > physical address space. > > > > > > > > In the worst-case scenario -- a huge hole in the middle of the > > > > address space -- It needs 256MiB to handle 4PiB of the address > > > > space. > > > > > > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront. > > > > > > > > The approach lowers boot time substantially. Boot to shell is ~2.5x > > > > faster for 4G TDX VM and ~4x faster for 64G. > > > > > > > > TDX-specific code isolated from the core of unaccepted memory support. It > > > > supposed to help to plug-in different implementation of unaccepted memory > > > > such as SEV-SNP. > > > > > > > > The tree can be found here: > > > > > > > > https://github.com/intel/tdx.git guest-unaccepted-memory > > > > > > Hi Kirill, > > > > > > I have a couple questions about this feature mainly about how cloud > > > customers can use this, I assume since this is a confidential compute > > > feature a large number of the users of these patches will be cloud > > > customers using TDX and SNP. One issue I see with these patches is how > > > do we as a cloud provider know whether a customer's linux image > > > supports this feature, if the image doesn't have these patches UEFI > > > needs to fully validate the memory, if the image does we can use this > > > new protocol. In GCE we supply our VMs with a version of the EDK2 FW > > > and the customer doesn't input into which UEFI we run, as far as I can > > > tell from the Azure SNP VM documentation it seems very similar. We > > > need to somehow tell our UEFI in the VM what to do based on the image. > > > The current way I can see to solve this issue would be to have our > > > customers give us metadata about their VM's image but this seems kinda > > > burdensome on our customers (I assume we'll have more features which > > > both UEFI and kernel need to both support inorder to be turned on like > > > this one) and error-prone, if a customer incorrectly labels their > > > image it may fail to boot.. Has there been any discussion about how to > > > solve this? My naive thoughts were what if UEFI and Kernel had some > > > sort of feature negotiation. Maybe that could happen via an extension > > > to exit boot services or a UEFI runtime driver, I'm not sure what's > > > best here just some ideas. > > > > Just as an idea, we can put info into UTS_VERSION which can be read from > > the built bzImage. We have info on SMP and preeption there already. > > > > Instead of hacking this into the binary, couldn't we define a protocol > that the kernel will call from the EFI stub (before EBS()) to identify > itself as an image that understands unaccepted memory, and knows how > to deal with it? > > That way, the firmware can accept all the memory on behalf of the OS > at ExitBootServices() time, unless the OS has indicated there is no > need to do so. I agree it would be better. But I think it would require change to EFI spec, no? -- Kirill A. Shutemov
On Mon, Jun 27, 2022 at 6:22 AM Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > > On Mon, Jun 27, 2022 at 01:54:45PM +0200, Ard Biesheuvel wrote: > > On Mon, 27 Jun 2022 at 13:30, Kirill A. Shutemov > > <kirill.shutemov@linux.intel.com> wrote: > > > > > > On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote: > > > > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov > > > > <kirill.shutemov@linux.intel.com> wrote: > > > > > > > > > > UEFI Specification version 2.9 introduces the concept of memory > > > > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD > > > > > SEV-SNP, requiring memory to be accepted before it can be used by the > > > > > guest. Accepting happens via a protocol specific for the Virtual > > > > > Machine platform. > > > > > > > > > > Accepting memory is costly and it makes VMM allocate memory for the > > > > > accepted guest physical address range. It's better to postpone memory > > > > > acceptance until memory is needed. It lowers boot time and reduces > > > > > memory overhead. > > > > > > > > > > The kernel needs to know what memory has been accepted. Firmware > > > > > communicates this information via memory map: a new memory type -- > > > > > EFI_UNACCEPTED_MEMORY -- indicates such memory. > > > > > > > > > > Range-based tracking works fine for firmware, but it gets bulky for > > > > > the kernel: e820 has to be modified on every page acceptance. It leads > > > > > to table fragmentation, but there's a limited number of entries in the > > > > > e820 table > > > > > > > > > > Another option is to mark such memory as usable in e820 and track if the > > > > > range has been accepted in a bitmap. One bit in the bitmap represents > > > > > 2MiB in the address space: one 4k page is enough to track 64GiB or > > > > > physical address space. > > > > > > > > > > In the worst-case scenario -- a huge hole in the middle of the > > > > > address space -- It needs 256MiB to handle 4PiB of the address > > > > > space. > > > > > > > > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront. > > > > > > > > > > The approach lowers boot time substantially. Boot to shell is ~2.5x > > > > > faster for 4G TDX VM and ~4x faster for 64G. > > > > > > > > > > TDX-specific code isolated from the core of unaccepted memory support. It > > > > > supposed to help to plug-in different implementation of unaccepted memory > > > > > such as SEV-SNP. > > > > > > > > > > The tree can be found here: > > > > > > > > > > https://github.com/intel/tdx.git guest-unaccepted-memory > > > > > > > > Hi Kirill, > > > > > > > > I have a couple questions about this feature mainly about how cloud > > > > customers can use this, I assume since this is a confidential compute > > > > feature a large number of the users of these patches will be cloud > > > > customers using TDX and SNP. One issue I see with these patches is how > > > > do we as a cloud provider know whether a customer's linux image > > > > supports this feature, if the image doesn't have these patches UEFI > > > > needs to fully validate the memory, if the image does we can use this > > > > new protocol. In GCE we supply our VMs with a version of the EDK2 FW > > > > and the customer doesn't input into which UEFI we run, as far as I can > > > > tell from the Azure SNP VM documentation it seems very similar. We > > > > need to somehow tell our UEFI in the VM what to do based on the image. > > > > The current way I can see to solve this issue would be to have our > > > > customers give us metadata about their VM's image but this seems kinda > > > > burdensome on our customers (I assume we'll have more features which > > > > both UEFI and kernel need to both support inorder to be turned on like > > > > this one) and error-prone, if a customer incorrectly labels their > > > > image it may fail to boot.. Has there been any discussion about how to > > > > solve this? My naive thoughts were what if UEFI and Kernel had some > > > > sort of feature negotiation. Maybe that could happen via an extension > > > > to exit boot services or a UEFI runtime driver, I'm not sure what's > > > > best here just some ideas. > > > > > > Just as an idea, we can put info into UTS_VERSION which can be read from > > > the built bzImage. We have info on SMP and preeption there already. > > > > > > > Instead of hacking this into the binary, couldn't we define a protocol > > that the kernel will call from the EFI stub (before EBS()) to identify > > itself as an image that understands unaccepted memory, and knows how > > to deal with it? > > > > That way, the firmware can accept all the memory on behalf of the OS > > at ExitBootServices() time, unless the OS has indicated there is no > > need to do so. > > I agree it would be better. But I think it would require change to EFI > spec, no? Could this somehow be amended on to the UEFI Specification version 2.9 change which added all of the unaccepted memory features? > > -- > Kirill A. Shutemov
On Mon, 27 Jun 2022 at 18:17, Peter Gonda <pgonda@google.com> wrote: > > On Mon, Jun 27, 2022 at 6:22 AM Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: > > > > On Mon, Jun 27, 2022 at 01:54:45PM +0200, Ard Biesheuvel wrote: > > > On Mon, 27 Jun 2022 at 13:30, Kirill A. Shutemov > > > <kirill.shutemov@linux.intel.com> wrote: > > > > > > > > On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote: > > > > > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov > > > > > <kirill.shutemov@linux.intel.com> wrote: > > > > > > > > > > > > UEFI Specification version 2.9 introduces the concept of memory > > > > > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD > > > > > > SEV-SNP, requiring memory to be accepted before it can be used by the > > > > > > guest. Accepting happens via a protocol specific for the Virtual > > > > > > Machine platform. > > > > > > > > > > > > Accepting memory is costly and it makes VMM allocate memory for the > > > > > > accepted guest physical address range. It's better to postpone memory > > > > > > acceptance until memory is needed. It lowers boot time and reduces > > > > > > memory overhead. > > > > > > > > > > > > The kernel needs to know what memory has been accepted. Firmware > > > > > > communicates this information via memory map: a new memory type -- > > > > > > EFI_UNACCEPTED_MEMORY -- indicates such memory. > > > > > > > > > > > > Range-based tracking works fine for firmware, but it gets bulky for > > > > > > the kernel: e820 has to be modified on every page acceptance. It leads > > > > > > to table fragmentation, but there's a limited number of entries in the > > > > > > e820 table > > > > > > > > > > > > Another option is to mark such memory as usable in e820 and track if the > > > > > > range has been accepted in a bitmap. One bit in the bitmap represents > > > > > > 2MiB in the address space: one 4k page is enough to track 64GiB or > > > > > > physical address space. > > > > > > > > > > > > In the worst-case scenario -- a huge hole in the middle of the > > > > > > address space -- It needs 256MiB to handle 4PiB of the address > > > > > > space. > > > > > > > > > > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront. > > > > > > > > > > > > The approach lowers boot time substantially. Boot to shell is ~2.5x > > > > > > faster for 4G TDX VM and ~4x faster for 64G. > > > > > > > > > > > > TDX-specific code isolated from the core of unaccepted memory support. It > > > > > > supposed to help to plug-in different implementation of unaccepted memory > > > > > > such as SEV-SNP. > > > > > > > > > > > > The tree can be found here: > > > > > > > > > > > > https://github.com/intel/tdx.git guest-unaccepted-memory > > > > > > > > > > Hi Kirill, > > > > > > > > > > I have a couple questions about this feature mainly about how cloud > > > > > customers can use this, I assume since this is a confidential compute > > > > > feature a large number of the users of these patches will be cloud > > > > > customers using TDX and SNP. One issue I see with these patches is how > > > > > do we as a cloud provider know whether a customer's linux image > > > > > supports this feature, if the image doesn't have these patches UEFI > > > > > needs to fully validate the memory, if the image does we can use this > > > > > new protocol. In GCE we supply our VMs with a version of the EDK2 FW > > > > > and the customer doesn't input into which UEFI we run, as far as I can > > > > > tell from the Azure SNP VM documentation it seems very similar. We > > > > > need to somehow tell our UEFI in the VM what to do based on the image. > > > > > The current way I can see to solve this issue would be to have our > > > > > customers give us metadata about their VM's image but this seems kinda > > > > > burdensome on our customers (I assume we'll have more features which > > > > > both UEFI and kernel need to both support inorder to be turned on like > > > > > this one) and error-prone, if a customer incorrectly labels their > > > > > image it may fail to boot.. Has there been any discussion about how to > > > > > solve this? My naive thoughts were what if UEFI and Kernel had some > > > > > sort of feature negotiation. Maybe that could happen via an extension > > > > > to exit boot services or a UEFI runtime driver, I'm not sure what's > > > > > best here just some ideas. > > > > > > > > Just as an idea, we can put info into UTS_VERSION which can be read from > > > > the built bzImage. We have info on SMP and preeption there already. > > > > > > > > > > Instead of hacking this into the binary, couldn't we define a protocol > > > that the kernel will call from the EFI stub (before EBS()) to identify > > > itself as an image that understands unaccepted memory, and knows how > > > to deal with it? > > > > > > That way, the firmware can accept all the memory on behalf of the OS > > > at ExitBootServices() time, unless the OS has indicated there is no > > > need to do so. > > > > I agree it would be better. But I think it would require change to EFI > > spec, no? > > Could this somehow be amended on to the UEFI Specification version 2.9 > change which added all of the unaccepted memory features? > Why would this need a change in the EFI spec? Not every EFI protocol needs to be in the spec.
On Mon, Jun 27, 2022 at 06:33:51PM +0200, Ard Biesheuvel wrote: > > > > > > > > > > Just as an idea, we can put info into UTS_VERSION which can be read from > > > > > the built bzImage. We have info on SMP and preeption there already. > > > > > > > > > > > > > Instead of hacking this into the binary, couldn't we define a protocol > > > > that the kernel will call from the EFI stub (before EBS()) to identify > > > > itself as an image that understands unaccepted memory, and knows how > > > > to deal with it? > > > > > > > > That way, the firmware can accept all the memory on behalf of the OS > > > > at ExitBootServices() time, unless the OS has indicated there is no > > > > need to do so. > > > > > > I agree it would be better. But I think it would require change to EFI > > > spec, no? > > > > Could this somehow be amended on to the UEFI Specification version 2.9 > > change which added all of the unaccepted memory features? > > > > Why would this need a change in the EFI spec? Not every EFI protocol > needs to be in the spec. My EFI knowledge is shallow. Do we do this in other cases? -- Kirill A. Shutemov
On Tue, 28 Jun 2022 at 00:38, Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > > On Mon, Jun 27, 2022 at 06:33:51PM +0200, Ard Biesheuvel wrote: > > > > > > > > > > > > Just as an idea, we can put info into UTS_VERSION which can be read from > > > > > > the built bzImage. We have info on SMP and preeption there already. > > > > > > > > > > > > > > > > Instead of hacking this into the binary, couldn't we define a protocol > > > > > that the kernel will call from the EFI stub (before EBS()) to identify > > > > > itself as an image that understands unaccepted memory, and knows how > > > > > to deal with it? > > > > > > > > > > That way, the firmware can accept all the memory on behalf of the OS > > > > > at ExitBootServices() time, unless the OS has indicated there is no > > > > > need to do so. > > > > > > > > I agree it would be better. But I think it would require change to EFI > > > > spec, no? > > > > > > Could this somehow be amended on to the UEFI Specification version 2.9 > > > change which added all of the unaccepted memory features? > > > > > > > Why would this need a change in the EFI spec? Not every EFI protocol > > needs to be in the spec. > > My EFI knowledge is shallow. Do we do this in other cases? > The E in EFI means 'extensible' and the whole design of a protocol database using GUIDs as identifiers (which will not collide and therefore need no a priori coordination when defining them) is intended to allow extensions to be defined and implemented in a distributed manner. Of course, it would be fantastic if we can converge on a protocol that all flavors of confidential compute can use, across different OSes, so it is generally good if a protocol is defined in *some* shared specification. But this doesn't have to be the EFI spec.
On Tue, Jun 28, 2022 at 07:17:00PM +0200, Ard Biesheuvel wrote: > On Tue, 28 Jun 2022 at 00:38, Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: > > > > On Mon, Jun 27, 2022 at 06:33:51PM +0200, Ard Biesheuvel wrote: > > > > > > > > > > > > > > Just as an idea, we can put info into UTS_VERSION which can be read from > > > > > > > the built bzImage. We have info on SMP and preeption there already. > > > > > > > > > > > > > > > > > > > Instead of hacking this into the binary, couldn't we define a protocol > > > > > > that the kernel will call from the EFI stub (before EBS()) to identify > > > > > > itself as an image that understands unaccepted memory, and knows how > > > > > > to deal with it? > > > > > > > > > > > > That way, the firmware can accept all the memory on behalf of the OS > > > > > > at ExitBootServices() time, unless the OS has indicated there is no > > > > > > need to do so. > > > > > > > > > > I agree it would be better. But I think it would require change to EFI > > > > > spec, no? > > > > > > > > Could this somehow be amended on to the UEFI Specification version 2.9 > > > > change which added all of the unaccepted memory features? > > > > > > > > > > Why would this need a change in the EFI spec? Not every EFI protocol > > > needs to be in the spec. > > > > My EFI knowledge is shallow. Do we do this in other cases? > > > > The E in EFI means 'extensible' and the whole design of a protocol > database using GUIDs as identifiers (which will not collide and > therefore need no a priori coordination when defining them) is > intended to allow extensions to be defined and implemented in a > distributed manner. > > Of course, it would be fantastic if we can converge on a protocol that > all flavors of confidential compute can use, across different OSes, so > it is generally good if a protocol is defined in *some* shared > specification. But this doesn't have to be the EFI spec. I've talked with our firmware expert today and I think we have a problem with the approach when kernel declaries support of unaccepted memory. This apporach doesn't work if we include bootloader into the picture: if EBS() called by bootloader we still cannot know if target kernel supports unaccepted memory and we return to the square 1. I think we should make it obvious from a kernel image if it supports unaccepted memory (with UTS_VERSION or other way). Any comments? -- Kirill A. Shutemov
> I've talked with our firmware expert today and I think we have a problem > with the approach when kernel declaries support of unaccepted memory. > Is this Jiewen Yao? I've been trying to design the UEFI spec change with him. The bootloader problem he commented with this morning was something I wasn't fully considering. > This apporach doesn't work if we include bootloader into the picture: if > EBS() called by bootloader we still cannot know if target kernel supports > unaccepted memory and we return to the square 1. > > I think we should make it obvious from a kernel image if it supports > unaccepted memory (with UTS_VERSION or other way). > > Any comments? Is this binary parsing trick already used in EDK2? If not, I wouldn't want to introduce an ABI-solidifying requirement like that. A bit more cumbersome, but more flexible way to enable the feature is an idea I had in a meeting today: Make unaccepted memory support a feature-enabling EFI driver installed to the EFI system partition. * The first time you boot (setup mode), you install an EFI driver that just sets a feature Pcd to true (using a custom protocol as Ard had suggested above). * The second time you boot, if the feature Pcd is true, then the UEFI is free to not accept memory and use the unaccepted memory type. The bootloader will run after unaccepted memory has been allowed already, so there is no accept-all event. The default behavior will be to accept all memory when GetMemoryMap is called unless the feature pcd is set to true. We can then say this driver isn't needed once some new generation of this technology comes along and we can require unaccepted memory support as part of that technology's baseline, or we manage to update the UEFI spec to have GetMemoryMapEx which has unaccepted memory support baked in and the bootloaders all know to use it. The cloud experience will be, "is boot slow? Install this EFI driver from the cloud service provider" to tell the UEFI to enable unaccepted memory. -- -Dionna Glaze, PhD (she/her)
Hey I posted my comment on Bugzilla https://bugzilla.tianocore.org/show_bug.cgi?id=3987 Let's achieve EDKII/UEFI related discussion there. Thank you Yao, Jiewen > -----Original Message----- > From: Dionna Amalie Glaze <dionnaglaze@google.com> > Sent: Tuesday, July 19, 2022 7:32 AM > To: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Ard Biesheuvel <ardb@kernel.org>; Peter Gonda <pgonda@google.com>; > Borislav Petkov <bp@alien8.de>; Lutomirski, Andy <luto@kernel.org>; > Christopherson,, Sean <seanjc@google.com>; Andrew Morton <akpm@linux- > foundation.org>; Rodel, Jorg <jroedel@suse.de>; Andi Kleen > <ak@linux.intel.com>; Kuppuswamy Sathyanarayanan > <sathyanarayanan.kuppuswamy@linux.intel.com>; David Rientjes > <rientjes@google.com>; Vlastimil Babka <vbabka@suse.cz>; Tom Lendacky > <thomas.lendacky@amd.com>; Thomas Gleixner <tglx@linutronix.de>; Peter > Zijlstra <peterz@infradead.org>; Paolo Bonzini <pbonzini@redhat.com>; Ingo > Molnar <mingo@redhat.com>; Varad Gautam <varad.gautam@suse.com>; > Dario Faggioli <dfaggioli@suse.com>; Hansen, Dave <dave.hansen@intel.com>; > Mike Rapoport <rppt@kernel.org>; David Hildenbrand <david@redhat.com>; > Marcelo Cerri <marcelo.cerri@canonical.com>; tim.gardner@canonical.com; > Khalid ElMously <khalid.elmously@canonical.com>; Cox, Philip > <philip.cox@canonical.com>; the arch/x86 maintainers <x86@kernel.org>; > Linux Memory Management List <linux-mm@kvack.org>; linux- > coco@lists.linux.dev; linux-efi <linux-efi@vger.kernel.org>; LKML <linux- > kernel@vger.kernel.org>; Yao, Jiewen <jiewen.yao@intel.com> > Subject: Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted > memory > > > I've talked with our firmware expert today and I think we have a problem > > with the approach when kernel declaries support of unaccepted memory. > > > > Is this Jiewen Yao? I've been trying to design the UEFI spec change > with him. The bootloader problem he commented with this morning was > something I wasn't fully considering. > > > This apporach doesn't work if we include bootloader into the picture: if > > EBS() called by bootloader we still cannot know if target kernel supports > > unaccepted memory and we return to the square 1. > > > > I think we should make it obvious from a kernel image if it supports > > unaccepted memory (with UTS_VERSION or other way). > > > > Any comments? > > Is this binary parsing trick already used in EDK2? If not, I wouldn't > want to introduce an ABI-solidifying requirement like that. > > A bit more cumbersome, but more flexible way to enable the feature is > an idea I had in a meeting today: > Make unaccepted memory support a feature-enabling EFI driver installed > to the EFI system partition. > > * The first time you boot (setup mode), you install an EFI driver that > just sets a feature Pcd to true (using a custom protocol as Ard had > suggested above). > * The second time you boot, if the feature Pcd is true, then the UEFI > is free to not accept memory and use the unaccepted memory type. The > bootloader will run after unaccepted memory has been allowed already, > so there is no accept-all event. > > The default behavior will be to accept all memory when GetMemoryMap is > called unless the feature pcd is set to true. > > We can then say this driver isn't needed once some new generation of > this technology comes along and we can require unaccepted memory > support as part of that technology's baseline, or we manage to update > the UEFI spec to have GetMemoryMapEx which has unaccepted memory > support baked in and the bootloaders all know to use it. > > The cloud experience will be, "is boot slow? Install this EFI driver > from the cloud service provider" to tell the UEFI to enable unaccepted > memory. > > -- > -Dionna Glaze, PhD (she/her)
> > I think we should make it obvious from a kernel image if it supports > > unaccepted memory (with UTS_VERSION or other way). > > Something I didn't address in my previous email: how would the UEFI know where the kernel is to parse this UTS_VERSION out when it's booting a bootloader before Linux gets booted? -- -Dionna Glaze, PhD (she/her)
> > > I think we should make it obvious from a kernel image if it supports > > > unaccepted memory (with UTS_VERSION or other way). > > > > > Something I didn't address in my previous email: how would the UEFI > know where the kernel is to parse this UTS_VERSION out when it's > booting a bootloader before Linux gets booted? > How about instead of the limited resource of UTS_VERSION, we add a SETUP_BOOT_FEATURES enum for setup_data in the boot header? That would be easier to parse out and more extensible in the future. https://www.kernel.org/doc/html/latest/x86/boot.html?highlight=boot This can contain a bitmap of a number of features that we currently need manual tagging for, such as SEV guest support, SEV-SNP guest support, TDX guest support, and (CONFIG_UNACCEPTED_MEMORY, TDX) or (CONFIG_UNACCEPTED_MEMORY, SEV-SNP). The VMM, UEFI, or boot loader can read these from the images/kernels and have the appropriate behavior. -- -Dionna Glaze, PhD (she/her)
On Tue, Jul 19, 2022 at 11:29:32AM -0700, Dionna Amalie Glaze wrote:
> How about instead of the limited resource of UTS_VERSION, we add a
> SETUP_BOOT_FEATURES enum for setup_data in the boot header? That would
> be easier to parse out and more extensible in the future.
> https://www.kernel.org/doc/html/latest/x86/boot.html?highlight=boot
>
> This can contain a bitmap of a number of features that we currently
> need manual tagging for, such as SEV guest support, SEV-SNP guest
> support, TDX guest support, and (CONFIG_UNACCEPTED_MEMORY, TDX) or
> (CONFIG_UNACCEPTED_MEMORY, SEV-SNP).
> The VMM, UEFI, or boot loader can read these from the images/kernels
> and have the appropriate behavior.
I think for stuff like that you want loadflags or xloadflags in the
setup header.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, 19 Jul 2022 at 21:14, Borislav Petkov <bp@alien8.de> wrote: > > On Tue, Jul 19, 2022 at 11:29:32AM -0700, Dionna Amalie Glaze wrote: > > How about instead of the limited resource of UTS_VERSION, we add a > > SETUP_BOOT_FEATURES enum for setup_data in the boot header? That would > > be easier to parse out and more extensible in the future. > > https://www.kernel.org/doc/html/latest/x86/boot.html?highlight=boot > > > > This can contain a bitmap of a number of features that we currently > > need manual tagging for, such as SEV guest support, SEV-SNP guest > > support, TDX guest support, and (CONFIG_UNACCEPTED_MEMORY, TDX) or > > (CONFIG_UNACCEPTED_MEMORY, SEV-SNP). > > The VMM, UEFI, or boot loader can read these from the images/kernels > > and have the appropriate behavior. > > I think for stuff like that you want loadflags or xloadflags in the > setup header. > Please, no. Let's not invent Linux/x86 specific hacks to infer whether or not the kernel is capable of accepting memory when it is perfectly capable of telling us directly. We will surely need something analogous on other architectures in the future as well, so the setup header is definitely not the right place for this. The 'bootloader that calls EBS()' case does not apply to Linux, and given that we are talking specifically about confidential computing VMs here, we can afford to be normative and define something generic that works well for us. So let's define a way for the EFI stub to signal to the firmware (before EBS()) that it will take control of accepting memory. The 'bootloader that calls EBS()' case can invent something along the lines of what has been proposed in this thread to infer the capabilities of the kernel (and decide what to signal to the firmware). But we have no need for this additional complexity on Linux.
On Tue, Jul 19, 2022 at 10:45:06PM +0200, Ard Biesheuvel wrote:
> So let's define a way for the EFI stub to signal to the firmware
> (before EBS()) that it will take control of accepting memory. The
> 'bootloader that calls EBS()' case can invent something along the
> lines of what has been proposed in this thread to infer the
> capabilities of the kernel (and decide what to signal to the
> firmware). But we have no need for this additional complexity on
> Linux.
To tell you the truth, I've been perusing this thread from the sidelines
and am wondering why does this need this special dance at all?
If EFI takes control of accepting memory, then when the guest kernel
boots, it'll find all memory accepted and not do anything.
If EFI doesn't accept memory, then the guest kernel will boot and do the
accepting itself.
So either I'm missing something or we're overengineering this for no
good reason...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 7/19/22 14:23, Borislav Petkov wrote: > On Tue, Jul 19, 2022 at 10:45:06PM +0200, Ard Biesheuvel wrote: >> So let's define a way for the EFI stub to signal to the firmware >> (before EBS()) that it will take control of accepting memory. The >> 'bootloader that calls EBS()' case can invent something along the >> lines of what has been proposed in this thread to infer the >> capabilities of the kernel (and decide what to signal to the >> firmware). But we have no need for this additional complexity on >> Linux. > To tell you the truth, I've been perusing this thread from the sidelines > and am wondering why does this need this special dance at all? > > If EFI takes control of accepting memory, then when the guest kernel > boots, it'll find all memory accepted and not do anything. > > If EFI doesn't accept memory, then the guest kernel will boot and do the > accepting itself. > > So either I'm missing something or we're overengineering this for no > good reason... They're trying to design something that can (forever) handle guests that might not be able to accept memory. It's based on the idea that *something* needs to assume control and EFI doesn't have enough information to assume control. I wish we didn't need all this complexity, though. There are three entities that can influence how much memory is accepted: 1. The host 2. The guest firmware 3. The guest kernel (or bootloader or something after the firmware) This whole thread is about how #2 and #3 talk to each other and make sure *someone* does it. I kinda think we should just take the guest firmware out of the picture. There are only going to be a few versions of the kernel that can boot under TDX (or SEV-SNP) and *can't* handle unaccepted memory. It seems a bit silly to design this whole interface for a few versions of the OS that TDX folks tell me can't be used anyway. I think we should just say if you want to run an OS that doesn't have unaccepted memory support, you can either: 1. Deal with that at the host level configuration 2. Boot some intermediate thing like a bootloader that does acceptance before running the stupid^Wunenlightended OS 3. Live with the 4GB of pre-accepted memory you get with no OS work. Yeah, this isn't convenient for some hosts. But, really, this is preferable to doing an EFI/OS dance until the end of time.
On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote:
> They're trying to design something that can (forever) handle guests that
> might not be able to accept memory.
Wait, what?
If you can't modify those guests to teach them to accept memory, how do
you add TDX or SNP guest support to them?
I.e., you need to modify the guests and then you can add memory
acceptance. Basically, your point below...
> It's based on the idea that *something* needs to assume control and
> EFI doesn't have enough information to assume control.
>
> I wish we didn't need all this complexity, though.
>
> There are three entities that can influence how much memory is accepted:
>
> 1. The host
> 2. The guest firmware
> 3. The guest kernel (or bootloader or something after the firmware)
>
> This whole thread is about how #2 and #3 talk to each other and make
> sure *someone* does it.
>
> I kinda think we should just take the guest firmware out of the picture.
> There are only going to be a few versions of the kernel that can boot
> under TDX (or SEV-SNP) and *can't* handle unaccepted memory. It seems a
> bit silly to design this whole interface for a few versions of the OS
> that TDX folks tell me can't be used anyway.
>
> I think we should just say if you want to run an OS that doesn't have
> unaccepted memory support, you can either:
>
> 1. Deal with that at the host level configuration
> 2. Boot some intermediate thing like a bootloader that does acceptance
> before running the stupid^Wunenlightended OS
> 3. Live with the 4GB of pre-accepted memory you get with no OS work.
>
> Yeah, this isn't convenient for some hosts. But, really, this is
> preferable to doing an EFI/OS dance until the end of time.
Ack. Definitely.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 7/19/22 14:50, Borislav Petkov wrote: > On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote: >> They're trying to design something that can (forever) handle guests that >> might not be able to accept memory. > Wait, what? > > If you can't modify those guests to teach them to accept memory, how do > you add TDX or SNP guest support to them? Mainline today, for instance, doesn't have unaccepted memory support for TDX or SEV-SNP guests. But, they both still boot fine because folks either configure it on the host side not to *have* any unaccepted memory. Or, they just live with the small (4GB??) amount of pre-accepted memory, which is fine for testing things.
On Tue, Jul 19, 2022 at 3:02 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 7/19/22 14:50, Borislav Petkov wrote: > > On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote: > >> They're trying to design something that can (forever) handle guests that > >> might not be able to accept memory. > > Wait, what? > > > > If you can't modify those guests to teach them to accept memory, how do > > you add TDX or SNP guest support to them? > > Mainline today, for instance, doesn't have unaccepted memory support for > TDX or SEV-SNP guests. But, they both still boot fine because folks > either configure it on the host side not to *have* any unaccepted > memory. Or, they just live with the small (4GB??) amount of > pre-accepted memory, which is fine for testing things. For us (Google cloud), "1. Deal with that at the host level configuration" looks like: https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images#guest-os-features In other words, we have to tag images with "feature tags" to distinguish which images have kernels that support which features. Part of the reason we need to do it this way is that we use a single guest firmware (i.e., guest UEFI) that lives outside of the image. These feature tags are a mess to keep track of. All that being said, I can totally see the upstream perspective being "not our problem". It's hard to argue with that :-). A few more thoughts: - If the guest-side patches weren't upstream before this patch set to handle unaccepted memory, you're all definitely right, that this isn't a real issue. (Maybe it still isn't...) - Do we anticipate (many) more features for confidential compute in the future that require code in both the guest FW and guest kernel? If yes, then designing a FW-kernel feature negotiation could be useful beyond this situation. - Dave's suggestion to "2. Boot some intermediate thing like a bootloader that does acceptance ..." is pretty clever! So if upstream thinks this FW-kernel negotiation is not a good direction, maybe we (Google) can pursue this idea to avoid introducing yet another tag on our images. Thank you all for this discussion. Thanks, Marc
On 7/19/22 17:26, Marc Orr wrote: > - Dave's suggestion to "2. Boot some intermediate thing like a > bootloader that does acceptance ..." is pretty clever! So if upstream > thinks this FW-kernel negotiation is not a good direction, maybe we > (Google) can pursue this idea to avoid introducing yet another tag on > our images. I'm obviously speaking only for myself here and not for "upstream" as a whole, but I clearly don't like the FW/kernel negotiation thing. It's a permanent pain in our necks to solve a very temporary problem.
On Thu, 21 Jul 2022 at 19:13, Dave Hansen <dave.hansen@intel.com> wrote: > > On 7/19/22 17:26, Marc Orr wrote: > > - Dave's suggestion to "2. Boot some intermediate thing like a > > bootloader that does acceptance ..." is pretty clever! So if upstream > > thinks this FW-kernel negotiation is not a good direction, maybe we > > (Google) can pursue this idea to avoid introducing yet another tag on > > our images. > > I'm obviously speaking only for myself here and not for "upstream" as a > whole, but I clearly don't like the FW/kernel negotiation thing. It's a > permanent pain in our necks to solve a very temporary problem. EFI is basically our existing embodiment of this fw/kernel negotiation thing, and iff we need it, I have no objection to using it for this purpose, i.e., to allow the firmware to infer whether or not it should accept all available memory on behalf of the OS before exiting boot services. But if we don't need this, even better. What I strongly object to is inventing a new bespoke way for the firmware to make inferences about the capabilities of the image by inspecting fields in the file representation of the image (which is not guaranteed by EFI to be identical to its in-memory representation, as, e.g., the PE/COFF header could be omitted by a loader without violating the spec) As for the intermediate thing: yes, that would be a valuable thing to have in OVMF (and I will gladly take EDK2 patches that implement this). However, I'm not sure how you decide whether or not this thing should be active or not, doesn't that just move the problem around?
On Sat, Jul 23, 2022 at 01:14:07PM +0200, Ard Biesheuvel wrote: > On Thu, 21 Jul 2022 at 19:13, Dave Hansen <dave.hansen@intel.com> wrote: > > > > On 7/19/22 17:26, Marc Orr wrote: > > > - Dave's suggestion to "2. Boot some intermediate thing like a > > > bootloader that does acceptance ..." is pretty clever! So if upstream > > > thinks this FW-kernel negotiation is not a good direction, maybe we > > > (Google) can pursue this idea to avoid introducing yet another tag on > > > our images. > > > > I'm obviously speaking only for myself here and not for "upstream" as a > > whole, but I clearly don't like the FW/kernel negotiation thing. It's a > > permanent pain in our necks to solve a very temporary problem. > > EFI is basically our existing embodiment of this fw/kernel negotiation > thing, and iff we need it, I have no objection to using it for this > purpose, i.e., to allow the firmware to infer whether or not it should > accept all available memory on behalf of the OS before exiting boot > services. But if we don't need this, even better. FW/kernel negotiation does not work if there's a boot loader in the middle that does ExitBootServices(). By the time kernel can announce if it supports unaccepted memory there's nobody to announce to. -- Kiryl Shutsemau / Kirill A. Shutemov
On Tue, 9 Aug 2022 at 13:11, Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > > On Sat, Jul 23, 2022 at 01:14:07PM +0200, Ard Biesheuvel wrote: > > On Thu, 21 Jul 2022 at 19:13, Dave Hansen <dave.hansen@intel.com> wrote: > > > > > > On 7/19/22 17:26, Marc Orr wrote: > > > > - Dave's suggestion to "2. Boot some intermediate thing like a > > > > bootloader that does acceptance ..." is pretty clever! So if upstream > > > > thinks this FW-kernel negotiation is not a good direction, maybe we > > > > (Google) can pursue this idea to avoid introducing yet another tag on > > > > our images. > > > > > > I'm obviously speaking only for myself here and not for "upstream" as a > > > whole, but I clearly don't like the FW/kernel negotiation thing. It's a > > > permanent pain in our necks to solve a very temporary problem. > > > > EFI is basically our existing embodiment of this fw/kernel negotiation > > thing, and iff we need it, I have no objection to using it for this > > purpose, i.e., to allow the firmware to infer whether or not it should > > accept all available memory on behalf of the OS before exiting boot > > services. But if we don't need this, even better. > > FW/kernel negotiation does not work if there's a boot loader in the middle > that does ExitBootServices(). By the time kernel can announce if it > supports unaccepted memory there's nobody to announce to. > Why would you want to support such bootloaders for TDX anyway? TDX heavily relies on measured boot abstractions and other things that are heavily tied to firmware.
On Tue, Aug 09, 2022 at 01:36:00PM +0200, Ard Biesheuvel wrote: > On Tue, 9 Aug 2022 at 13:11, Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: > > > > On Sat, Jul 23, 2022 at 01:14:07PM +0200, Ard Biesheuvel wrote: > > > On Thu, 21 Jul 2022 at 19:13, Dave Hansen <dave.hansen@intel.com> wrote: > > > > > > > > On 7/19/22 17:26, Marc Orr wrote: > > > > > - Dave's suggestion to "2. Boot some intermediate thing like a > > > > > bootloader that does acceptance ..." is pretty clever! So if upstream > > > > > thinks this FW-kernel negotiation is not a good direction, maybe we > > > > > (Google) can pursue this idea to avoid introducing yet another tag on > > > > > our images. > > > > > > > > I'm obviously speaking only for myself here and not for "upstream" as a > > > > whole, but I clearly don't like the FW/kernel negotiation thing. It's a > > > > permanent pain in our necks to solve a very temporary problem. > > > > > > EFI is basically our existing embodiment of this fw/kernel negotiation > > > thing, and iff we need it, I have no objection to using it for this > > > purpose, i.e., to allow the firmware to infer whether or not it should > > > accept all available memory on behalf of the OS before exiting boot > > > services. But if we don't need this, even better. > > > > FW/kernel negotiation does not work if there's a boot loader in the middle > > that does ExitBootServices(). By the time kernel can announce if it > > supports unaccepted memory there's nobody to announce to. > > > > Why would you want to support such bootloaders for TDX anyway? TDX > heavily relies on measured boot abstractions and other things that are > heavily tied to firmware. I don't understand it either. And, yet, there's demand for it. -- Kiryl Shutsemau / Kirill A. Shutemov
> > > > EFI is basically our existing embodiment of this fw/kernel negotiation > > > > thing, and iff we need it, I have no objection to using it for this > > > > purpose, i.e., to allow the firmware to infer whether or not it should > > > > accept all available memory on behalf of the OS before exiting boot > > > > services. But if we don't need this, even better. > > > > > > FW/kernel negotiation does not work if there's a boot loader in the middle > > > that does ExitBootServices(). By the time kernel can announce if it > > > supports unaccepted memory there's nobody to announce to. > > > > > > > Why would you want to support such bootloaders for TDX anyway? TDX > > heavily relies on measured boot abstractions and other things that are > > heavily tied to firmware. > > I don't understand it either. And, yet, there's demand for it. > I think there's no good solution for this bad upgrade path that the UEFI spec stuck us with, so I think I'm going to stick to what many folks have suggested: just have the host require external information. What this means is that at VM creation time, the user has to specify an extra flag that all memory has to be accepted in firmware before booting the guest OS. Failure to provide the flag leads to the unfortunate outcome that the VM only has access to the lower 4GB of RAM. We can only hope that the VM OOMs shortly after they start up the machine and the user reads an FAQ that they should add this flag. I'll do a round of appeals to distributions to include this patch set and AMD's follow-up that defines accept_memory for SEV-SNP to reduce the time that people need to know about this flag. -- -Dionna Glaze, PhD (she/her)
> > What I strongly object to is inventing a new bespoke way for the > firmware to make inferences about the capabilities of the image by > inspecting fields in the file representation of the image (which is > not guaranteed by EFI to be identical to its in-memory representation, > as, e.g., the PE/COFF header could be omitted by a loader without > violating the spec) > > As for the intermediate thing: yes, that would be a valuable thing to > have in OVMF (and I will gladly take EDK2 patches that implement > this). However, I'm not sure how you decide whether or not this thing > should be active or not, doesn't that just move the problem around? This does just move the problem around, but it makes correct behavior the default instead of silently ignoring most of the VM's memory and booting regularly. I have the driver mostly written to change the behavior to accept all by default unless a driver has been installed to set a particular boolean to make it not. Still that's yet another thing as you say. I agree with everyone that this situation just stinks. "Can't you just boot it?" was asked before, and yes we can, but at the scale of a CSP managing anybody's image uploads, that not-insignificant cost has to be paid by someone. It's a hard problem to route the image to the right kind of machine that's expected to be able to run it... it's a big ol' mess. One thing is for sure: these patches shouldn't be blocked by the "how do we detect it" question. I'm glad to see so much engagement with this problem, but I fear I might have delayed its progress towards a merge. I know AMD has a follow-up to add SEV-SNP accept_memory support to finish this all up. I'll try to get the ear of all the distributions that are tracking towards providing SEV-SNP-supported images for CSPs to get them on the release that includes these patches. I'll also see about upstreaming that EFI driver and EDK2 changes in case there's a slip in the kernel release and we need this workaround. -- -Dionna Glaze, PhD (she/her)
On Tue, Jul 19, 2022 at 05:26:21PM -0700, Marc Orr wrote:
> These feature tags are a mess to keep track of.
Well, looking at those tags, it doesn't look like you'll stop using them
anytime soon.
And once all the required SNP/TDX features are part of the guest image,
- including unaccepted memory - if anything, you'll have less tags.
:-)
> - Do we anticipate (many) more features for confidential compute in
> the future that require code in both the guest FW and guest kernel? If
> yes, then designing a FW-kernel feature negotiation could be useful
> beyond this situation.
Good question.
> - Dave's suggestion to "2. Boot some intermediate thing like a
> bootloader that does acceptance ..." is pretty clever! So if upstream
> thinks this FW-kernel negotiation is not a good direction, maybe we
> (Google) can pursue this idea to avoid introducing yet another tag on
> our images.
Are those tags really that nasty so that you guys are looking at
upstream changes just to avoid them?
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Jul 19, 2022 at 10:44 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Jul 19, 2022 at 05:26:21PM -0700, Marc Orr wrote:
> > These feature tags are a mess to keep track of.
>
> Well, looking at those tags, it doesn't look like you'll stop using them
> anytime soon.
>
> And once all the required SNP/TDX features are part of the guest image,
> - including unaccepted memory - if anything, you'll have less tags.
>
> :-)
Yeah, once all of the features are a part of the guest image AND any
older images with SNP/TDX minus the features are deprecated. I agree.
> > - Do we anticipate (many) more features for confidential compute in
> > the future that require code in both the guest FW and guest kernel? If
> > yes, then designing a FW-kernel feature negotiation could be useful
> > beyond this situation.
>
> Good question.
>
> > - Dave's suggestion to "2. Boot some intermediate thing like a
> > bootloader that does acceptance ..." is pretty clever! So if upstream
> > thinks this FW-kernel negotiation is not a good direction, maybe we
> > (Google) can pursue this idea to avoid introducing yet another tag on
> > our images.
>
> Are those tags really that nasty so that you guys are looking at
> upstream changes just to avoid them?
Generally, no. But the problem with tags is that distros tag their
images wrong sometimes. And that leads to problems. For example, I
just got a bug assigned to me yesterday about some ARM image tagged as
SEV_CAPABLE. Oops. Lol :-). (Though, I'm pretty sure we won't try to
boot an ARM image on a non-ARM host anyway; but it's still wrong...)
That being said, this lazy accept problem is sort of a special case,
since it requires deploying code to the guest FW and the guest kernel.
I'm still relatively new at all of this, but other than the
SNP/TDX-enlightenment patches themselves, I haven't really seen any
other examples of this. So that goes back to my previous question. Is
this going to happen a lot more? If not, I can definitely see value in
the argument to skip the complexity of the FW/kernel feature
negotiation.
Another thing I thought of since my last reply, that's mostly an
internal solution to this problem on our side: Going back to Dave's
10k-foot view of the different angles of how to solve this. For "1.
Deal with that at the host level configuration", I'm thinking we could
tag the images with their internal guest kernel version. For example,
if an image has a 5.15 kernel, then we could have a `KERNEL_5_15` tag.
This would then allow us to have logic in the guest FW like:
if (guest_kernel_is_at_least(/*major=*/5, /*minor=*/15)
enable_lazy_accept = true;
One detail I actually missed in all of this, is how the guest image
tag gets propagated into the guest FW in this approach. (Apologies for
this, as that's a pretty big oversight on my part.) Dionna: Have you
thought about this? Presumably this requires some sort of paravirt for
the guest to ask the host. And for any paravirt interface, now we need
to think about if it degrades the security of the confidential VMs.
Though, using it to get the kernel version to decide whether or not to
accept the memory within the guest UEFI or mark it as unaccepted seems
fine from a security angle to me.
Also, tagging images with their underlying kernel versions still seems
susceptible to mis-labeling. But this seems like it can be mostly
"fixed" via automation (e.g., write a tool to boot the guest and ask
it what it's kernel version is and use the result to attach the tag).
Also, tagging the images with their kernel version seems like a much
more general solution to these sorts of issues.
Thoughts?
On Wed, Jul 20, 2022 at 10:03:40AM -0700, Marc Orr wrote:
> Generally, no. But the problem with tags is that distros tag their
> images wrong sometimes. And that leads to problems. For example, I
> just got a bug assigned to me yesterday about some ARM image tagged as
> SEV_CAPABLE. Oops. Lol :-). (Though, I'm pretty sure we won't try to
> boot an ARM image on a non-ARM host anyway; but it's still wrong...)
Yeah, even if, let it crash'n'burn - people will notice pretty quickly.
> That being said, this lazy accept problem is sort of a special case,
> since it requires deploying code to the guest FW and the guest kernel.
> I'm still relatively new at all of this, but other than the
> SNP/TDX-enlightenment patches themselves, I haven't really seen any
> other examples of this. So that goes back to my previous question. Is
> this going to happen a lot more?
Good question.
Unfortunately, not even the architects of coco could give you an answer
because, as you see yourself, those additional features like memory
acceptance, live migration, etc keep changing - the whole coco thing is
pretty much a moving target.
For example, if someone comes along and says, err, see, I have this live
migration helper and that thing runs as an EFI executable and it is so
much better...
Not saying it'll happen but it could. I hope you're catching my drift.
> If not, I can definitely see value in the argument to skip the
> complexity of the FW/kernel feature negotiation.
>
> Another thing I thought of since my last reply, that's mostly an
> internal solution to this problem on our side: Going back to Dave's
> 10k-foot view of the different angles of how to solve this. For "1.
> Deal with that at the host level configuration", I'm thinking we could
> tag the images with their internal guest kernel version. For example,
> if an image has a 5.15 kernel, then we could have a `KERNEL_5_15` tag.
> This would then allow us to have logic in the guest FW like:
>
> if (guest_kernel_is_at_least(/*major=*/5, /*minor=*/15)
> enable_lazy_accept = true;
Well, I don't want to spoil your idea but imagine distros like SLE or
others backport features into old kernels. All of a sudden 5.14 or older
can do memory acceptance too. And then that version-based scheme falls
apart.
So I'm guessing it would probably be better to explicitly tag distro
images. Thing is, once all needed support gets in, you can drop the tags
and simply say, you don't support those old images anymore and assume
all required support is there and implicit...
> Also, tagging images with their underlying kernel versions still seems
> susceptible to mis-labeling. But this seems like it can be mostly
> "fixed" via automation (e.g., write a tool to boot the guest and ask
> it what it's kernel version is and use the result to attach the tag).
I'll do you one better: boot the image and check for all required
features and produce tags. Or do not accept the image as a possible coco
image. And so on.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 7/19/22 17:02, Dave Hansen wrote: > On 7/19/22 14:50, Borislav Petkov wrote: >> On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote: >>> They're trying to design something that can (forever) handle guests that >>> might not be able to accept memory. >> Wait, what? >> >> If you can't modify those guests to teach them to accept memory, how do >> you add TDX or SNP guest support to them? > > Mainline today, for instance, doesn't have unaccepted memory support for > TDX or SEV-SNP guests. But, they both still boot fine because folks > either configure it on the host side not to *have* any unaccepted > memory. Or, they just live with the small (4GB??) amount of > pre-accepted memory, which is fine for testing things. Today, for SEV-SNP, OVMF accepts all of the memory in advance of booting the kernel. Thanks, Tom >
On Tue, Jul 19, 2022 at 11:50:57PM +0200, Borislav Petkov wrote: > On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote: > > They're trying to design something that can (forever) handle guests that > > might not be able to accept memory. > > Wait, what? > > If you can't modify those guests to teach them to accept memory, how do > you add TDX or SNP guest support to them? > > I.e., you need to modify the guests and then you can add memory > acceptance. Basically, your point below... > > > It's based on the idea that *something* needs to assume control and > > EFI doesn't have enough information to assume control. > > > > I wish we didn't need all this complexity, though. > > > > There are three entities that can influence how much memory is accepted: > > > > 1. The host > > 2. The guest firmware > > 3. The guest kernel (or bootloader or something after the firmware) > > > > This whole thread is about how #2 and #3 talk to each other and make > > sure *someone* does it. > > > > I kinda think we should just take the guest firmware out of the picture. > > There are only going to be a few versions of the kernel that can boot > > under TDX (or SEV-SNP) and *can't* handle unaccepted memory. It seems a > > bit silly to design this whole interface for a few versions of the OS > > that TDX folks tell me can't be used anyway. > > > > I think we should just say if you want to run an OS that doesn't have > > unaccepted memory support, you can either: > > > > 1. Deal with that at the host level configuration > > 2. Boot some intermediate thing like a bootloader that does acceptance > > before running the stupid^Wunenlightended OS > > 3. Live with the 4GB of pre-accepted memory you get with no OS work. > > > > Yeah, this isn't convenient for some hosts. But, really, this is > > preferable to doing an EFI/OS dance until the end of time. > > Ack. Definitely. I like it too as it is no-code solution :P Peter, I'm pretty sure unaccepted memory support hits upstream well before TDX get adopted widely in production. I think it is pretty reasonable to deal with it on host side in meanwhile. Any objections? -- Kirill A. Shutemov
On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, requiring memory to be accepted before it can be used by the
> > guest. Accepting happens via a protocol specific for the Virtual
> > Machine platform.
> >
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. It's better to postpone memory
> > acceptance until memory is needed. It lowers boot time and reduces
> > memory overhead.
> >
> > The kernel needs to know what memory has been accepted. Firmware
> > communicates this information via memory map: a new memory type --
> > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> >
> > Range-based tracking works fine for firmware, but it gets bulky for
> > the kernel: e820 has to be modified on every page acceptance. It leads
> > to table fragmentation, but there's a limited number of entries in the
> > e820 table
> >
> > Another option is to mark such memory as usable in e820 and track if the
> > range has been accepted in a bitmap. One bit in the bitmap represents
> > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > physical address space.
> >
> > In the worst-case scenario -- a huge hole in the middle of the
> > address space -- It needs 256MiB to handle 4PiB of the address
> > space.
> >
> > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> >
> > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > faster for 4G TDX VM and ~4x faster for 64G.
> >
> > TDX-specific code isolated from the core of unaccepted memory support. It
> > supposed to help to plug-in different implementation of unaccepted memory
> > such as SEV-SNP.
> >
> > The tree can be found here:
> >
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fintel%2Ftdx.git&data=05%7C01%7Cmichael.roth%40amd.com%7C73bacba017c84291482a08da55ffd481%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637916854542432349%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=P%2FUJOL305xo85NLXGxGouQVGHgzLJpmBdNyZ7Re5%2FB0%3D&reserved=0 guest-unaccepted-memory
>
> Hi Kirill,
>
> I have a couple questions about this feature mainly about how cloud
> customers can use this, I assume since this is a confidential compute
> feature a large number of the users of these patches will be cloud
> customers using TDX and SNP. One issue I see with these patches is how
> do we as a cloud provider know whether a customer's linux image
> supports this feature, if the image doesn't have these patches UEFI
> needs to fully validate the memory, if the image does we can use this
> new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> and the customer doesn't input into which UEFI we run, as far as I can
> tell from the Azure SNP VM documentation it seems very similar. We
> need to somehow tell our UEFI in the VM what to do based on the image.
> The current way I can see to solve this issue would be to have our
> customers give us metadata about their VM's image but this seems kinda
> burdensome on our customers (I assume we'll have more features which
> both UEFI and kernel need to both support inorder to be turned on like
> this one) and error-prone, if a customer incorrectly labels their
> image it may fail to boot.. Has there been any discussion about how to
> solve this? My naive thoughts were what if UEFI and Kernel had some
> sort of feature negotiation. Maybe that could happen via an extension
> to exit boot services or a UEFI runtime driver, I'm not sure what's
> best here just some ideas.
Not sure if you've seen this thread or not, but there's also been some
discussion around this in the context of the UEFI support:
https://patchew.org/EDK2/cover.1654420875.git.min.m.xu@intel.com/cce5ea2aaaeddd9ce9df6fa7ac1ef52976c5c7e6.1654420876.git.min.m.xu@intel.com/#20220608061805.vvsjiqt55rqnl3fw@sirius.home.kraxel.org
2 things being discussed there really, which I think roughly boil down
to:
1) how to configure OVMF to enable/disable lazy acceptance
- compile time option most likely: accept-all/accept-minimum/accept-1GB
2) how to introduce an automatic mode in the future where OVMF does the
right thing based on what the guest supports. Gerd floated the idea of
tying it to ExitBootServices as well, but not sure there's a solid
plan on what to do here yet.
If that's accurate, it seems like the only 'safe' option is to disable it via
#1 (accept-all), and then when #2 comes along, compile OVMF to just Do The
Right Thing.
Users who know their VMs implement lazy acceptance can force it on via
accept-all OVMF compile option.
-Mike
On Fri, Jun 24, 2022 at 11:41 AM Michael Roth <michael.roth@amd.com> wrote: > > On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote: > > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov > > <kirill.shutemov@linux.intel.com> wrote: > > > > > > UEFI Specification version 2.9 introduces the concept of memory > > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD > > > SEV-SNP, requiring memory to be accepted before it can be used by the > > > guest. Accepting happens via a protocol specific for the Virtual > > > Machine platform. > > > > > > Accepting memory is costly and it makes VMM allocate memory for the > > > accepted guest physical address range. It's better to postpone memory > > > acceptance until memory is needed. It lowers boot time and reduces > > > memory overhead. > > > > > > The kernel needs to know what memory has been accepted. Firmware > > > communicates this information via memory map: a new memory type -- > > > EFI_UNACCEPTED_MEMORY -- indicates such memory. > > > > > > Range-based tracking works fine for firmware, but it gets bulky for > > > the kernel: e820 has to be modified on every page acceptance. It leads > > > to table fragmentation, but there's a limited number of entries in the > > > e820 table > > > > > > Another option is to mark such memory as usable in e820 and track if the > > > range has been accepted in a bitmap. One bit in the bitmap represents > > > 2MiB in the address space: one 4k page is enough to track 64GiB or > > > physical address space. > > > > > > In the worst-case scenario -- a huge hole in the middle of the > > > address space -- It needs 256MiB to handle 4PiB of the address > > > space. > > > > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront. > > > > > > The approach lowers boot time substantially. Boot to shell is ~2.5x > > > faster for 4G TDX VM and ~4x faster for 64G. > > > > > > TDX-specific code isolated from the core of unaccepted memory support. It > > > supposed to help to plug-in different implementation of unaccepted memory > > > such as SEV-SNP. > > > > > > The tree can be found here: > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fintel%2Ftdx.git&data=05%7C01%7Cmichael.roth%40amd.com%7C73bacba017c84291482a08da55ffd481%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637916854542432349%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=P%2FUJOL305xo85NLXGxGouQVGHgzLJpmBdNyZ7Re5%2FB0%3D&reserved=0 guest-unaccepted-memory > > > > Hi Kirill, > > > > I have a couple questions about this feature mainly about how cloud > > customers can use this, I assume since this is a confidential compute > > feature a large number of the users of these patches will be cloud > > customers using TDX and SNP. One issue I see with these patches is how > > do we as a cloud provider know whether a customer's linux image > > supports this feature, if the image doesn't have these patches UEFI > > needs to fully validate the memory, if the image does we can use this > > new protocol. In GCE we supply our VMs with a version of the EDK2 FW > > and the customer doesn't input into which UEFI we run, as far as I can > > tell from the Azure SNP VM documentation it seems very similar. We > > need to somehow tell our UEFI in the VM what to do based on the image. > > The current way I can see to solve this issue would be to have our > > customers give us metadata about their VM's image but this seems kinda > > burdensome on our customers (I assume we'll have more features which > > both UEFI and kernel need to both support inorder to be turned on like > > this one) and error-prone, if a customer incorrectly labels their > > > image it may fail to boot.. Has there been any discussion about how to > > solve this? My naive thoughts were what if UEFI and Kernel had some > > sort of feature negotiation. Maybe that could happen via an extension > > to exit boot services or a UEFI runtime driver, I'm not sure what's > > best here just some ideas. > > Not sure if you've seen this thread or not, but there's also been some > discussion around this in the context of the UEFI support: > > https://patchew.org/EDK2/cover.1654420875.git.min.m.xu@intel.com/cce5ea2aaaeddd9ce9df6fa7ac1ef52976c5c7e6.1654420876.git.min.m.xu@intel.com/#20220608061805.vvsjiqt55rqnl3fw@sirius.home.kraxel.org > > 2 things being discussed there really, which I think roughly boil down > to: > > 1) how to configure OVMF to enable/disable lazy acceptance > - compile time option most likely: accept-all/accept-minimum/accept-1GB > > 2) how to introduce an automatic mode in the future where OVMF does the > right thing based on what the guest supports. Gerd floated the idea of > tying it to ExitBootServices as well, but not sure there's a solid > plan on what to do here yet. > > If that's accurate, it seems like the only 'safe' option is to disable it via > #1 (accept-all), and then when #2 comes along, compile OVMF to just Do The > Right Thing. > > Users who know their VMs implement lazy acceptance can force it on via > accept-all OVMF compile option. Thanks for this Mike! I will bring this to the EDK2 community. The issue for us is our users use a GCE built EDK2 not their own compiled version so they don't have the choice. Reading the Azure docs it seems the same for them, and for AWS so I don't know how often customers actually get to bring their own firmware. > > -Mike
On Fri, Jun 24, 2022 at 12:40:57PM -0500, Michael Roth wrote: > > 1) how to configure OVMF to enable/disable lazy acceptance > - compile time option most likely: accept-all/accept-minimum/accept-1GB > > 2) how to introduce an automatic mode in the future where OVMF does the > right thing based on what the guest supports. Gerd floated the idea of > tying it to ExitBootServices as well, but not sure there's a solid > plan on what to do here yet. > > If that's accurate, it seems like the only 'safe' option is to disable it via > #1 (accept-all), and then when #2 comes along, compile OVMF to just Do The > Right Thing. > > Users who know their VMs implement lazy acceptance can force it on via > accept-all OVMF compile option. accept-min / accept-X I mean.
Peter, is your enter key broken? You seem to be typing all your text in a single unreadable paragraph. On 6/24/22 09:37, Peter Gonda wrote: > if a customer incorrectly labels their image it may fail to boot.. You're saying that firmware basically has two choices: 1. Accept all the memory up front and boot slowly, but reliably 2. Use thus "unaccepted memory" mechanism, boot fast, but risk that the VM loses a bunch of memory. If the guest can't even boot because of a lack of memory, then the pre-accepted chunk is probably too small in the first place. If the customer screws up, they lose a bunch of the RAM they paid for. That seems like a rather self-correcting problem to me.
On Fri, Jun 24, 2022 at 9:57 AM Dave Hansen <dave.hansen@intel.com> wrote: > > Peter, is your enter key broken? You seem to be typing all your text in > a single unreadable paragraph. > > On 6/24/22 09:37, Peter Gonda wrote: > > if a customer incorrectly labels their image it may fail to boot.. > > You're saying that firmware basically has two choices: > 1. Accept all the memory up front and boot slowly, but reliably > 2. Use thus "unaccepted memory" mechanism, boot fast, but risk that the > VM loses a bunch of memory. > > If the guest can't even boot because of a lack of memory, then the > pre-accepted chunk is probably too small in the first place. > > If the customer screws up, they lose a bunch of the RAM they paid for. > That seems like a rather self-correcting problem to me. I think Peter's point is a little more nuanced than that. Once lazy accept goes into the guest firmware -- without the feature negotiation that Peter is suggesting -- cloud providers now have a bookkeeping problem. Which images have kernels that can boot from a guest firmware that doesn't pre-validate all the guest memory? The way we've been solving similar bookkeeping problems up to now (e.g., Which guest can run with CVM features like TDX/SEV enabled? Which SEV guests can live migrate?) is as follows. We tag images with feature tags. But this is sort of a hack. And not a great one. It's confusing to customers, hard for the cloud service provider to support, and easy to mess up. It would be better if the guest FW knew whether or not the kernel it was going to launch supported lazy accept. That being said, this does seem like a difficult problem to solve, since it's sort of backward from how things work, in that when the guest firmware wants to switch between pre-validating all memory vs. minimizing what it pre-validates, the guest kernel is not running yet! But if there is some way to do this, it would be a huge improvement over the current status quo of pushing the feature negotiation up to the cloud service provider and ultimately the cloud customer.
On 6/24/22 10:06, Marc Orr wrote: > I think Peter's point is a little more nuanced than that. Once lazy > accept goes into the guest firmware -- without the feature negotiation > that Peter is suggesting -- cloud providers now have a bookkeeping > problem. Which images have kernels that can boot from a guest firmware > that doesn't pre-validate all the guest memory? Hold on a sec though... Is this a matter of can boot from a guest firmware that doesn't pre-validate all the guest memory? or can boot from a guest firmware that doesn't pre-validate all the guest memory ... with access to all of that guest's RAM? In other words, are we talking about "fails to boot" or "can't see all the RAM"?
On Fri, Jun 24, 2022 at 10:10 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/24/22 10:06, Marc Orr wrote: > > I think Peter's point is a little more nuanced than that. Once lazy > > accept goes into the guest firmware -- without the feature negotiation > > that Peter is suggesting -- cloud providers now have a bookkeeping > > problem. Which images have kernels that can boot from a guest firmware > > that doesn't pre-validate all the guest memory? > > Hold on a sec though... > > Is this a matter of > > can boot from a guest firmware that doesn't pre-validate all the > guest memory? > > or > > can boot from a guest firmware that doesn't pre-validate all the > guest memory ... with access to all of that guest's RAM? > > In other words, are we talking about "fails to boot" or "can't see all > the RAM"? Ah... yeah, you're right, Dave -- I guess it's the latter. The guest won't have access to all of the memory that the customer is paying for. But that's still bad. If the customer buys a 96 GB VM and can only see 4GB because they're kernel doesn't have these patches they're going to be confused and frustrated.
On 6/24/22 10:19, Marc Orr wrote: >> Is this a matter of >> >> can boot from a guest firmware that doesn't pre-validate all the >> guest memory? >> >> or >> >> can boot from a guest firmware that doesn't pre-validate all the >> guest memory ... with access to all of that guest's RAM? >> >> In other words, are we talking about "fails to boot" or "can't see all >> the RAM"? > Ah... yeah, you're right, Dave -- I guess it's the latter. The guest > won't have access to all of the memory that the customer is paying > for. But that's still bad. If the customer buys a 96 GB VM and can > only see 4GB because they're kernel doesn't have these patches they're > going to be confused and frustrated. They'll at least be a _bit_ less angry and frustrated than if they were staring at a blank screen. ;) But, yeah, I totally get the point. How big is the window going to be where we have guests that can have unaccepted memory, but don't have acceptance support? For TDX, it's looking like it'll probably _just_ be 5.19. Is TDX on 5.19 in shape that cloud providers can deploy it? Or, is stuff like lack of attestation a deal breaker?
On Fri, Jun 24, 2022 at 11:47 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/24/22 10:19, Marc Orr wrote: > >> Is this a matter of > >> > >> can boot from a guest firmware that doesn't pre-validate all the > >> guest memory? > >> > >> or > >> > >> can boot from a guest firmware that doesn't pre-validate all the > >> guest memory ... with access to all of that guest's RAM? > >> > >> In other words, are we talking about "fails to boot" or "can't see all > >> the RAM"? > > Ah... yeah, you're right, Dave -- I guess it's the latter. The guest > > won't have access to all of the memory that the customer is paying > > for. But that's still bad. If the customer buys a 96 GB VM and can > > only see 4GB because they're kernel doesn't have these patches they're > > going to be confused and frustrated. > > They'll at least be a _bit_ less angry and frustrated than if they were > staring at a blank screen. ;) But, yeah, I totally get the point. Ha! Well we do have that issue in some cases. If you try to run an SEV VM with an image that doesn't support SEV you will just get a blank serial screen. If we had something like this back then the FW could have surfaced a nice error to the user but that's history now. > > How big is the window going to be where we have guests that can have > unaccepted memory, but don't have acceptance support? For TDX, it's > looking like it'll probably _just_ be 5.19. Is TDX on 5.19 in shape > that cloud providers can deploy it? Or, is stuff like lack of > attestation a deal breaker? This is complicated because distros don't run upstream linux versions. If I understand correctly (I see some distro emails on here so please correct me) distros normally maintain forks which they backport things into. So I cannot answer this question. It is possible that a hypothetical distro backports only the SNP/TDX initial patches and doesn't take these for many releases. I am more familiar with SNP and it does have some attestation support in the first patch sets. Also I should have been more clear. I don't want to try and hold up this feature but instead discuss a future usability add-on feature. > >
On 6/24/22 11:10, Peter Gonda wrote: >> How big is the window going to be where we have guests that can have >> unaccepted memory, but don't have acceptance support? For TDX, it's >> looking like it'll probably _just_ be 5.19. Is TDX on 5.19 in shape >> that cloud providers can deploy it? Or, is stuff like lack of >> attestation a deal breaker? > This is complicated because distros don't run upstream linux versions. > If I understand correctly (I see some distro emails on here so please > correct me) distros normally maintain forks which they backport things > into. So I cannot answer this question. It is possible that a > hypothetical distro backports only the SNP/TDX initial patches and > doesn't take these for many releases. Distros could also backport a bare-bones version of this set that doesn't do anything fancy and just synchronously accepts the memory at boot. No bitmap, no page allocator changes. It'll slow boot down, but is better than having no RAM.
On Fri, Jun 24, 2022 at 11:19 AM Marc Orr <marcorr@google.com> wrote: > > On Fri, Jun 24, 2022 at 10:10 AM Dave Hansen <dave.hansen@intel.com> wrote: > > > > On 6/24/22 10:06, Marc Orr wrote: > > > I think Peter's point is a little more nuanced than that. Once lazy > > > accept goes into the guest firmware -- without the feature negotiation > > > that Peter is suggesting -- cloud providers now have a bookkeeping > > > problem. Which images have kernels that can boot from a guest firmware > > > that doesn't pre-validate all the guest memory? > > > > Hold on a sec though... > > > > Is this a matter of > > > > can boot from a guest firmware that doesn't pre-validate all the > > guest memory? > > > > or > > > > can boot from a guest firmware that doesn't pre-validate all the > > guest memory ... with access to all of that guest's RAM? > > > > In other words, are we talking about "fails to boot" or "can't see all > > the RAM"? > > Ah... yeah, you're right, Dave -- I guess it's the latter. The guest > won't have access to all of the memory that the customer is paying > for. But that's still bad. If the customer buys a 96 GB VM and can > only see 4GB because they're kernel doesn't have these patches they're > going to be confused and frustrated. The other error case which might be more confusing to the customer is their kernel does have these patches, there is some misconfiguration and their VM boots slowly because the FW uses the accept all memory approach.
>> > Peter, is your enter key broken? You seem to be typing all your text in >> > a single unreadable paragraph. Sorry I will try to format better in the future. >> > You're saying that firmware basically has two choices: >> > 1. Accept all the memory up front and boot slowly, but reliably >> > 2. Use thus "unaccepted memory" mechanism, boot fast, but risk that the >> > VM loses a bunch of memory. That's right. Given that the first round of SNP guest patches are in but this work to support unaccepted memory for SNP is not we assume we will have distros that support SNP without this "unaccepted memory" feature. On Fri, Jun 24, 2022 at 11:10 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/24/22 10:06, Marc Orr wrote: > > I think Peter's point is a little more nuanced than that. Once lazy > > accept goes into the guest firmware -- without the feature negotiation > > that Peter is suggesting -- cloud providers now have a bookkeeping > > problem. Which images have kernels that can boot from a guest firmware > > that doesn't pre-validate all the guest memory? > > Hold on a sec though... > > Is this a matter of > > can boot from a guest firmware that doesn't pre-validate all the > guest memory? > > or > > can boot from a guest firmware that doesn't pre-validate all the > guest memory ... with access to all of that guest's RAM? > > In other words, are we talking about "fails to boot" or "can't see all > the RAM"? > Yes, I'm sorry I was mistaken. If FW uses unaccepted memory but the kernel doesn't support it the VM should still boot but will fail to utilize all of its given RAM. >> > If the customer screws up, they lose a bunch of the RAM they paid for. >> > That seems like a rather self-correcting problem to me. Providing customers with an easy to use product is a problem for us the cloud provider, encoding foot-guns doesn't sound like what's best for the user here. I wanted to bring this up here since it seems like a problem most vendors/users of SNP and TDX would run into. We can of course figure this out internally if no one else sees this as an issue.
© 2016 - 2026 Red Hat, Inc.