[PATCH RFC v1] hw/i386: place setup_data at fixed place in memory

Jason A. Donenfeld posted 1 patch 1 week, 1 day ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20220803170235.1312978-1-Jason@zx2c4.com
Maintainers: Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <richard.henderson@linaro.org>, Eduardo Habkost <eduardo@habkost.net>, "Michael S. Tsirkin" <mst@redhat.com>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
hw/i386/x86.c | 38 +++++++++++++++++++++-----------------
1 file changed, 21 insertions(+), 17 deletions(-)
[PATCH RFC v1] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week, 1 day ago
The boot parameter header refers to setup_data at an absolute address,
and each setup_data refers to the next setup_data at an absolute address
too. Currently QEMU simply puts the setup_datas right after the kernel
image, and since the kernel_image is loaded at prot_addr -- a fixed
address knowable to QEMU apriori -- the setup_data absolute address
winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.

This mostly works fine, so long as the kernel image really is loaded at
prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
generally EFI doesn't give a good way of predicting where it's going to
load the kernel. So when it loads it at some address != prot_addr, the
absolute addresses in setup_data now point somewhere bogus, causing
crashes when EFI stub tries to follow the next link.

Fix this by placing setup_data at some fixed place in memory, not as
part of the kernel image, and then pointing the setup_data absolute
address to that fixed place in memory. This way, even if OVMF or other
chains relocate the kernel image, the boot parameter still points to the
correct absolute address.

=== NOTE NOTE NOTE NOTE NOTE ===
This commit is currently garbage! It fixes the boot test case, but it
just picks the address 0x10000000. That's probably not a good idea. If
somebody with some x86 architectural knowledge could let me know a
better reserved place to put this, that'd be very appreciated.

Fixes: 3cbeb52467 ("hw/i386: add device tree support")
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 hw/i386/x86.c | 38 +++++++++++++++++++++-----------------
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 050eedc0c8..0b0083b345 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -773,9 +773,9 @@ void x86_load_linux(X86MachineState *x86ms,
     bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled;
     uint16_t protocol;
     int setup_size, kernel_size, cmdline_size;
-    int dtb_size, setup_data_offset;
+    int dtb_size, setup_data_item_len, setup_data_total_len = 0;
     uint32_t initrd_max;
-    uint8_t header[8192], *setup, *kernel;
+    uint8_t header[8192], *setup, *kernel, *setup_datas = NULL;
     hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0;
     FILE *f;
     char *vmode;
@@ -1048,6 +1048,8 @@ void x86_load_linux(X86MachineState *x86ms,
     }
     fclose(f);
 
+#define SETUP_DATA_PHYS_BASE 0x10000000
+
     /* append dtb to kernel */
     if (dtb_filename) {
         if (protocol < 0x209) {
@@ -1062,34 +1064,36 @@ void x86_load_linux(X86MachineState *x86ms,
             exit(1);
         }
 
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(struct setup_data) + dtb_size;
-        kernel = g_realloc(kernel, kernel_size);
-
-
-        setup_data = (struct setup_data *)(kernel + setup_data_offset);
+        setup_data_item_len = sizeof(struct setup_data) + dtb_size;
+        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
+        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = SETUP_DATA_PHYS_BASE + setup_data_total_len;
+        setup_data_total_len += setup_data_item_len;
         setup_data->type = cpu_to_le32(SETUP_DTB);
         setup_data->len = cpu_to_le32(dtb_size);
-
         load_image_size(dtb_filename, setup_data->data, dtb_size);
     }
 
     if (!legacy_no_rng_seed) {
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
-        kernel = g_realloc(kernel, kernel_size);
-        setup_data = (struct setup_data *)(kernel + setup_data_offset);
+        setup_data_item_len = sizeof(struct setup_data) + SETUP_RNG_SEED;
+        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
+        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = SETUP_DATA_PHYS_BASE + setup_data_total_len;
+        setup_data_total_len += setup_data_item_len;
         setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
         setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
         qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
     }
 
-    /* Offset 0x250 is a pointer to the first setup_data link. */
-    stq_p(header + 0x250, first_setup_data);
+    if (first_setup_data) {
+            /* Offset 0x250 is a pointer to the first setup_data link. */
+            stq_p(header + 0x250, first_setup_data);
+            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
+                         SETUP_DATA_PHYS_BASE, NULL, NULL, NULL, NULL, false);
+    }
+
 
     /*
      * If we're starting an encrypted VM, it will be OVMF based, which uses the
-- 
2.35.1


Re: [PATCH RFC v1] hw/i386: place setup_data at fixed place in memory
Posted by Michael S. Tsirkin 1 week, 1 day ago
On Wed, Aug 03, 2022 at 07:02:35PM +0200, Jason A. Donenfeld wrote:
> The boot parameter header refers to setup_data at an absolute address,
> and each setup_data refers to the next setup_data at an absolute address
> too. Currently QEMU simply puts the setup_datas right after the kernel
> image, and since the kernel_image is loaded at prot_addr -- a fixed
> address knowable to QEMU apriori -- the setup_data absolute address
> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> 
> This mostly works fine, so long as the kernel image really is loaded at
> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> generally EFI doesn't give a good way of predicting where it's going to
> load the kernel. So when it loads it at some address != prot_addr, the
> absolute addresses in setup_data now point somewhere bogus, causing
> crashes when EFI stub tries to follow the next link.
> 
> Fix this by placing setup_data at some fixed place in memory, not as
> part of the kernel image, and then pointing the setup_data absolute
> address to that fixed place in memory. This way, even if OVMF or other
> chains relocate the kernel image, the boot parameter still points to the
> correct absolute address.
> 
> === NOTE NOTE NOTE NOTE NOTE ===
> This commit is currently garbage! It fixes the boot test case, but it
> just picks the address 0x10000000. That's probably not a good idea. If
> somebody with some x86 architectural knowledge could let me know a
> better reserved place to put this, that'd be very appreciated.
> 
> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: linux-efi@vger.kernel.org
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
>  hw/i386/x86.c | 38 +++++++++++++++++++++-----------------
>  1 file changed, 21 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> index 050eedc0c8..0b0083b345 100644
> --- a/hw/i386/x86.c
> +++ b/hw/i386/x86.c
> @@ -773,9 +773,9 @@ void x86_load_linux(X86MachineState *x86ms,
>      bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled;
>      uint16_t protocol;
>      int setup_size, kernel_size, cmdline_size;
> -    int dtb_size, setup_data_offset;
> +    int dtb_size, setup_data_item_len, setup_data_total_len = 0;
>      uint32_t initrd_max;
> -    uint8_t header[8192], *setup, *kernel;
> +    uint8_t header[8192], *setup, *kernel, *setup_datas = NULL;
>      hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0;
>      FILE *f;
>      char *vmode;
> @@ -1048,6 +1048,8 @@ void x86_load_linux(X86MachineState *x86ms,
>      }
>      fclose(f);
>  
> +#define SETUP_DATA_PHYS_BASE 0x10000000
> +
>      /* append dtb to kernel */
>      if (dtb_filename) {
>          if (protocol < 0x209) {
> @@ -1062,34 +1064,36 @@ void x86_load_linux(X86MachineState *x86ms,
>              exit(1);
>          }
>  
> -        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
> -        kernel_size = setup_data_offset + sizeof(struct setup_data) + dtb_size;
> -        kernel = g_realloc(kernel, kernel_size);
> -
> -
> -        setup_data = (struct setup_data *)(kernel + setup_data_offset);
> +        setup_data_item_len = sizeof(struct setup_data) + dtb_size;
> +        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
> +        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
>          setup_data->next = cpu_to_le64(first_setup_data);
> -        first_setup_data = prot_addr + setup_data_offset;
> +        first_setup_data = SETUP_DATA_PHYS_BASE + setup_data_total_len;
> +        setup_data_total_len += setup_data_item_len;
>          setup_data->type = cpu_to_le32(SETUP_DTB);
>          setup_data->len = cpu_to_le32(dtb_size);
> -
>          load_image_size(dtb_filename, setup_data->data, dtb_size);
>      }
>  
>      if (!legacy_no_rng_seed) {
> -        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
> -        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
> -        kernel = g_realloc(kernel, kernel_size);
> -        setup_data = (struct setup_data *)(kernel + setup_data_offset);
> +        setup_data_item_len = sizeof(struct setup_data) + SETUP_RNG_SEED;
> +        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
> +        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
>          setup_data->next = cpu_to_le64(first_setup_data);
> -        first_setup_data = prot_addr + setup_data_offset;
> +        first_setup_data = SETUP_DATA_PHYS_BASE + setup_data_total_len;
> +        setup_data_total_len += setup_data_item_len;
>          setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
>          setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
>          qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
>      }
>  
> -    /* Offset 0x250 is a pointer to the first setup_data link. */
> -    stq_p(header + 0x250, first_setup_data);
> +    if (first_setup_data) {
> +            /* Offset 0x250 is a pointer to the first setup_data link. */
> +            stq_p(header + 0x250, first_setup_data);
> +            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
> +                         SETUP_DATA_PHYS_BASE, NULL, NULL, NULL, NULL, false);
> +    }
> +
>

Allocating memory on x86 is tricky business.  Can we maybe use bios-linker-loader
with COMMAND_WRITE_POINTER to get an address from firmware?

  
>      /*
>       * If we're starting an encrypted VM, it will be OVMF based, which uses the
> -- 
> 2.35.1
Re: [PATCH RFC v1] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week, 1 day ago
Hi Michael,

On Wed, Aug 03, 2022 at 06:25:39PM -0400, Michael S. Tsirkin wrote:
> > -    /* Offset 0x250 is a pointer to the first setup_data link. */
> > -    stq_p(header + 0x250, first_setup_data);
> > +    if (first_setup_data) {
> > +            /* Offset 0x250 is a pointer to the first setup_data link. */
> > +            stq_p(header + 0x250, first_setup_data);
> > +            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
> > +                         SETUP_DATA_PHYS_BASE, NULL, NULL, NULL, NULL, false);
> > +    }
> > +
> >
> 
> Allocating memory on x86 is tricky business.  Can we maybe use bios-linker-loader
> with COMMAND_WRITE_POINTER to get an address from firmware?

Hmm. Is BIOSLinker even available to us at this stage in preparation?

One thing to note is that this memory doesn't really need to be
persistent. It's only used extreeeemely early in boot. So it could be
somewhere that gets used/remapped later on.

Jason
Re: [PATCH RFC v1] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week, 1 day ago
Hey again,

On Thu, Aug 04, 2022 at 12:50:50AM +0200, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Aug 03, 2022 at 06:25:39PM -0400, Michael S. Tsirkin wrote:
> > > -    /* Offset 0x250 is a pointer to the first setup_data link. */
> > > -    stq_p(header + 0x250, first_setup_data);
> > > +    if (first_setup_data) {
> > > +            /* Offset 0x250 is a pointer to the first setup_data link. */
> > > +            stq_p(header + 0x250, first_setup_data);
> > > +            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
> > > +                         SETUP_DATA_PHYS_BASE, NULL, NULL, NULL, NULL, false);
> > > +    }
> > > +
> > >
> > 
> > Allocating memory on x86 is tricky business.  Can we maybe use bios-linker-loader
> > with COMMAND_WRITE_POINTER to get an address from firmware?
> 
> Hmm. Is BIOSLinker even available to us at this stage in preparation?
> 
> One thing to note is that this memory doesn't really need to be
> persistent. It's only used extreeeemely early in boot. So it could be
> somewhere that gets used/remapped later on.

Actually, it's possible there's one place that's already available, and
that this isn't so bad after all. In my tests, this seems to be working
in a wide variety of configurations. I'll send a v2.

Jason
[PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week, 1 day ago
The boot parameter header refers to setup_data at an absolute address,
and each setup_data refers to the next setup_data at an absolute address
too. Currently QEMU simply puts the setup_datas right after the kernel
image, and since the kernel_image is loaded at prot_addr -- a fixed
address knowable to QEMU apriori -- the setup_data absolute address
winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.

This mostly works fine, so long as the kernel image really is loaded at
prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
generally EFI doesn't give a good way of predicting where it's going to
load the kernel. So when it loads it at some address != prot_addr, the
absolute addresses in setup_data now point somewhere bogus, causing
crashes when EFI stub tries to follow the next link.

Fix this by placing setup_data at some fixed place in memory, relative
to real_addr, not as part of the kernel image, and then pointing the
setup_data absolute address to that fixed place in memory. This way,
even if OVMF or other chains relocate the kernel image, the boot
parameter still points to the correct absolute address.

Fixes: 3cbeb52467 ("hw/i386: add device tree support")
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 hw/i386/x86.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 050eedc0c8..8b853abf38 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -760,36 +760,36 @@ static bool load_elfboot(const char *kernel_filename,
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ENTRY, pvh_start_addr);
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, mh_load_addr);
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, elf_kernel_size);
 
     return true;
 }
 
 void x86_load_linux(X86MachineState *x86ms,
                     FWCfgState *fw_cfg,
                     int acpi_data_size,
                     bool pvh_enabled,
                     bool legacy_no_rng_seed)
 {
     bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled;
     uint16_t protocol;
     int setup_size, kernel_size, cmdline_size;
-    int dtb_size, setup_data_offset;
+    int dtb_size, setup_data_item_len, setup_data_total_len = 0;
     uint32_t initrd_max;
-    uint8_t header[8192], *setup, *kernel;
-    hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0;
+    uint8_t header[8192], *setup, *kernel, *setup_datas = NULL;
+    hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0, setup_data_base;
     FILE *f;
     char *vmode;
     MachineState *machine = MACHINE(x86ms);
     struct setup_data *setup_data;
     const char *kernel_filename = machine->kernel_filename;
     const char *initrd_filename = machine->initrd_filename;
     const char *dtb_filename = machine->dtb;
     const char *kernel_cmdline = machine->kernel_cmdline;
     SevKernelLoaderContext sev_load_ctx = {};
     enum { RNG_SEED_LENGTH = 32 };
 
     /* Align to 16 bytes as a paranoia measure */
     cmdline_size = (strlen(kernel_cmdline) + 16) & ~15;
 
     /* load the kernel header */
     f = fopen(kernel_filename, "rb");
@@ -886,32 +886,33 @@ void x86_load_linux(X86MachineState *x86ms,
     if (protocol < 0x200 || !(header[0x211] & 0x01)) {
         /* Low kernel */
         real_addr    = 0x90000;
         cmdline_addr = 0x9a000 - cmdline_size;
         prot_addr    = 0x10000;
     } else if (protocol < 0x202) {
         /* High but ancient kernel */
         real_addr    = 0x90000;
         cmdline_addr = 0x9a000 - cmdline_size;
         prot_addr    = 0x100000;
     } else {
         /* High and recent kernel */
         real_addr    = 0x10000;
         cmdline_addr = 0x20000;
         prot_addr    = 0x100000;
     }
+    setup_data_base = real_addr + 0x8000;
 
     /* highest address for loading the initrd */
     if (protocol >= 0x20c &&
         lduw_p(header + 0x236) & XLF_CAN_BE_LOADED_ABOVE_4G) {
         /*
          * Linux has supported initrd up to 4 GB for a very long time (2007,
          * long before XLF_CAN_BE_LOADED_ABOVE_4G which was added in 2013),
          * though it only sets initrd_max to 2 GB to "work around bootloader
          * bugs". Luckily, QEMU firmware(which does something like bootloader)
          * has supported this.
          *
          * It's believed that if XLF_CAN_BE_LOADED_ABOVE_4G is set, initrd can
          * be loaded into any address.
          *
          * In addition, initrd_max is uint32_t simply because QEMU doesn't
          * support the 64-bit boot protocol (specifically the ext_ramdisk_image
@@ -1049,60 +1050,61 @@ void x86_load_linux(X86MachineState *x86ms,
     fclose(f);
 
     /* append dtb to kernel */
     if (dtb_filename) {
         if (protocol < 0x209) {
             fprintf(stderr, "qemu: Linux kernel too old to load a dtb\n");
             exit(1);
         }
 
         dtb_size = get_image_size(dtb_filename);
         if (dtb_size <= 0) {
             fprintf(stderr, "qemu: error reading dtb %s: %s\n",
                     dtb_filename, strerror(errno));
             exit(1);
         }
 
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(struct setup_data) + dtb_size;
-        kernel = g_realloc(kernel, kernel_size);
-
-
-        setup_data = (struct setup_data *)(kernel + setup_data_offset);
+        setup_data_item_len = sizeof(struct setup_data) + dtb_size;
+        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
+        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = setup_data_base + setup_data_total_len;
+        setup_data_total_len += setup_data_item_len;
         setup_data->type = cpu_to_le32(SETUP_DTB);
         setup_data->len = cpu_to_le32(dtb_size);
-
         load_image_size(dtb_filename, setup_data->data, dtb_size);
     }
 
     if (!legacy_no_rng_seed) {
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
-        kernel = g_realloc(kernel, kernel_size);
-        setup_data = (struct setup_data *)(kernel + setup_data_offset);
+        setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH;
+        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
+        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = setup_data_base + setup_data_total_len;
+        setup_data_total_len += setup_data_item_len;
         setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
         setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
         qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
     }
 
-    /* Offset 0x250 is a pointer to the first setup_data link. */
-    stq_p(header + 0x250, first_setup_data);
+    if (first_setup_data) {
+            /* Offset 0x250 is a pointer to the first setup_data link. */
+            stq_p(header + 0x250, first_setup_data);
+            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
+                         setup_data_base, NULL, NULL, NULL, NULL, false);
+    }
 
     /*
      * If we're starting an encrypted VM, it will be OVMF based, which uses the
      * efi stub for booting and doesn't require any values to be placed in the
      * kernel header.  We therefore don't update the header so the hash of the
      * kernel on the other side of the fw_cfg interface matches the hash of the
      * file the user passed in.
      */
     if (!sev_enabled()) {
         memcpy(setup, header, MIN(sizeof(header), setup_size));
     }
 
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
     fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size);
     sev_load_ctx.kernel_data = (char *)kernel;
-- 
2.35.1


Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Daniel P. Berrangé 1 week ago
On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> The boot parameter header refers to setup_data at an absolute address,
> and each setup_data refers to the next setup_data at an absolute address
> too. Currently QEMU simply puts the setup_datas right after the kernel
> image, and since the kernel_image is loaded at prot_addr -- a fixed
> address knowable to QEMU apriori -- the setup_data absolute address
> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> 
> This mostly works fine, so long as the kernel image really is loaded at
> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> generally EFI doesn't give a good way of predicting where it's going to
> load the kernel. So when it loads it at some address != prot_addr, the
> absolute addresses in setup_data now point somewhere bogus, causing
> crashes when EFI stub tries to follow the next link.
> 
> Fix this by placing setup_data at some fixed place in memory, relative
> to real_addr, not as part of the kernel image, and then pointing the
> setup_data absolute address to that fixed place in memory. This way,
> even if OVMF or other chains relocate the kernel image, the boot
> parameter still points to the correct absolute address.
> 
> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: linux-efi@vger.kernel.org
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
>  hw/i386/x86.c | 38 ++++++++++++++++++++------------------
>  1 file changed, 20 insertions(+), 18 deletions(-)
> 
> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> index 050eedc0c8..8b853abf38 100644
> --- a/hw/i386/x86.c
> +++ b/hw/i386/x86.c


>      if (!legacy_no_rng_seed) {
> -        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
> -        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
> -        kernel = g_realloc(kernel, kernel_size);
> -        setup_data = (struct setup_data *)(kernel + setup_data_offset);
> +        setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH;
> +        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
> +        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
>          setup_data->next = cpu_to_le64(first_setup_data);
> -        first_setup_data = prot_addr + setup_data_offset;
> +        first_setup_data = setup_data_base + setup_data_total_len;
> +        setup_data_total_len += setup_data_item_len;
>          setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
>          setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
>          qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
>      }
>  
> -    /* Offset 0x250 is a pointer to the first setup_data link. */
> -    stq_p(header + 0x250, first_setup_data);
> +    if (first_setup_data) {
> +            /* Offset 0x250 is a pointer to the first setup_data link. */
> +            stq_p(header + 0x250, first_setup_data);
> +            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
> +                         setup_data_base, NULL, NULL, NULL, NULL, false);
> +    }

The boot measurements with AMD SEV now succeed, but I'm a little
worried about the implications of adding this ROM, when a few lines
later here we're discarding the 'header' changes for AMD SEV. Is
this still going to operate correctly in the guest OS if we've
discarded the header changes below ?

>      /*
>       * If we're starting an encrypted VM, it will be OVMF based, which uses the
>       * efi stub for booting and doesn't require any values to be placed in the
>       * kernel header.  We therefore don't update the header so the hash of the
>       * kernel on the other side of the fw_cfg interface matches the hash of the
>       * file the user passed in.
>       */
>      if (!sev_enabled()) {
>          memcpy(setup, header, MIN(sizeof(header), setup_size));
>      }
>  
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
>      fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size);
>      sev_load_ctx.kernel_data = (char *)kernel;
> -- 
> 2.35.1
> 
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week ago
Hi Daniel,

On Thu, Aug 4, 2022 at 2:54 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> > The boot parameter header refers to setup_data at an absolute address,
> > and each setup_data refers to the next setup_data at an absolute address
> > too. Currently QEMU simply puts the setup_datas right after the kernel
> > image, and since the kernel_image is loaded at prot_addr -- a fixed
> > address knowable to QEMU apriori -- the setup_data absolute address
> > winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> >
> > This mostly works fine, so long as the kernel image really is loaded at
> > prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> > generally EFI doesn't give a good way of predicting where it's going to
> > load the kernel. So when it loads it at some address != prot_addr, the
> > absolute addresses in setup_data now point somewhere bogus, causing
> > crashes when EFI stub tries to follow the next link.
> >
> > Fix this by placing setup_data at some fixed place in memory, relative
> > to real_addr, not as part of the kernel image, and then pointing the
> > setup_data absolute address to that fixed place in memory. This way,
> > even if OVMF or other chains relocate the kernel image, the boot
> > parameter still points to the correct absolute address.
> >
> > Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> > Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <richard.henderson@linaro.org>
> > Cc: Peter Maydell <peter.maydell@linaro.org>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Daniel P. Berrangé <berrange@redhat.com>
> > Cc: Gerd Hoffmann <kraxel@redhat.com>
> > Cc: Ard Biesheuvel <ardb@kernel.org>
> > Cc: linux-efi@vger.kernel.org
> > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> > ---
> >  hw/i386/x86.c | 38 ++++++++++++++++++++------------------
> >  1 file changed, 20 insertions(+), 18 deletions(-)
> >
> > diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> > index 050eedc0c8..8b853abf38 100644
> > --- a/hw/i386/x86.c
> > +++ b/hw/i386/x86.c
>
>
> >      if (!legacy_no_rng_seed) {
> > -        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
> > -        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
> > -        kernel = g_realloc(kernel, kernel_size);
> > -        setup_data = (struct setup_data *)(kernel + setup_data_offset);
> > +        setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH;
> > +        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
> > +        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
> >          setup_data->next = cpu_to_le64(first_setup_data);
> > -        first_setup_data = prot_addr + setup_data_offset;
> > +        first_setup_data = setup_data_base + setup_data_total_len;
> > +        setup_data_total_len += setup_data_item_len;
> >          setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
> >          setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
> >          qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
> >      }
> >
> > -    /* Offset 0x250 is a pointer to the first setup_data link. */
> > -    stq_p(header + 0x250, first_setup_data);
> > +    if (first_setup_data) {
> > +            /* Offset 0x250 is a pointer to the first setup_data link. */
> > +            stq_p(header + 0x250, first_setup_data);
> > +            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
> > +                         setup_data_base, NULL, NULL, NULL, NULL, false);
> > +    }
>
> The boot measurements with AMD SEV now succeed, but I'm a little
> worried about the implications of adding this ROM, when a few lines
> later here we're discarding the 'header' changes for AMD SEV. Is
> this still going to operate correctly in the guest OS if we've
> discarded the header changes below ?

I'll add a !sev_enabled() condition to that block too, so it also
skips adding the ROM, for v3.

Jason
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Michael S. Tsirkin 1 week ago
On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> The boot parameter header refers to setup_data at an absolute address,
> and each setup_data refers to the next setup_data at an absolute address
> too. Currently QEMU simply puts the setup_datas right after the kernel
> image, and since the kernel_image is loaded at prot_addr -- a fixed
> address knowable to QEMU apriori -- the setup_data absolute address
> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> 
> This mostly works fine, so long as the kernel image really is loaded at
> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> generally EFI doesn't give a good way of predicting where it's going to
> load the kernel. So when it loads it at some address != prot_addr, the
> absolute addresses in setup_data now point somewhere bogus, causing
> crashes when EFI stub tries to follow the next link.
> 
> Fix this by placing setup_data at some fixed place in memory, relative
> to real_addr, not as part of the kernel image, and then pointing the
> setup_data absolute address to that fixed place in memory. This way,
> even if OVMF or other chains relocate the kernel image, the boot
> parameter still points to the correct absolute address.
> 
> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: linux-efi@vger.kernel.org
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Didn't read the patch yet.
Adding Laszlo.

> ---
>  hw/i386/x86.c | 38 ++++++++++++++++++++------------------
>  1 file changed, 20 insertions(+), 18 deletions(-)
> 
> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> index 050eedc0c8..8b853abf38 100644
> --- a/hw/i386/x86.c
> +++ b/hw/i386/x86.c
> @@ -760,36 +760,36 @@ static bool load_elfboot(const char *kernel_filename,
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ENTRY, pvh_start_addr);
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, mh_load_addr);
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, elf_kernel_size);
>  
>      return true;
>  }
>  
>  void x86_load_linux(X86MachineState *x86ms,
>                      FWCfgState *fw_cfg,
>                      int acpi_data_size,
>                      bool pvh_enabled,
>                      bool legacy_no_rng_seed)
>  {
>      bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled;
>      uint16_t protocol;
>      int setup_size, kernel_size, cmdline_size;
> -    int dtb_size, setup_data_offset;
> +    int dtb_size, setup_data_item_len, setup_data_total_len = 0;
>      uint32_t initrd_max;
> -    uint8_t header[8192], *setup, *kernel;
> -    hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0;
> +    uint8_t header[8192], *setup, *kernel, *setup_datas = NULL;
> +    hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0, setup_data_base;
>      FILE *f;
>      char *vmode;
>      MachineState *machine = MACHINE(x86ms);
>      struct setup_data *setup_data;
>      const char *kernel_filename = machine->kernel_filename;
>      const char *initrd_filename = machine->initrd_filename;
>      const char *dtb_filename = machine->dtb;
>      const char *kernel_cmdline = machine->kernel_cmdline;
>      SevKernelLoaderContext sev_load_ctx = {};
>      enum { RNG_SEED_LENGTH = 32 };
>  
>      /* Align to 16 bytes as a paranoia measure */
>      cmdline_size = (strlen(kernel_cmdline) + 16) & ~15;
>  
>      /* load the kernel header */
>      f = fopen(kernel_filename, "rb");
> @@ -886,32 +886,33 @@ void x86_load_linux(X86MachineState *x86ms,
>      if (protocol < 0x200 || !(header[0x211] & 0x01)) {
>          /* Low kernel */
>          real_addr    = 0x90000;
>          cmdline_addr = 0x9a000 - cmdline_size;
>          prot_addr    = 0x10000;
>      } else if (protocol < 0x202) {
>          /* High but ancient kernel */
>          real_addr    = 0x90000;
>          cmdline_addr = 0x9a000 - cmdline_size;
>          prot_addr    = 0x100000;
>      } else {
>          /* High and recent kernel */
>          real_addr    = 0x10000;
>          cmdline_addr = 0x20000;
>          prot_addr    = 0x100000;
>      }
> +    setup_data_base = real_addr + 0x8000;
>  
>      /* highest address for loading the initrd */
>      if (protocol >= 0x20c &&
>          lduw_p(header + 0x236) & XLF_CAN_BE_LOADED_ABOVE_4G) {
>          /*
>           * Linux has supported initrd up to 4 GB for a very long time (2007,
>           * long before XLF_CAN_BE_LOADED_ABOVE_4G which was added in 2013),
>           * though it only sets initrd_max to 2 GB to "work around bootloader
>           * bugs". Luckily, QEMU firmware(which does something like bootloader)
>           * has supported this.
>           *
>           * It's believed that if XLF_CAN_BE_LOADED_ABOVE_4G is set, initrd can
>           * be loaded into any address.
>           *
>           * In addition, initrd_max is uint32_t simply because QEMU doesn't
>           * support the 64-bit boot protocol (specifically the ext_ramdisk_image
> @@ -1049,60 +1050,61 @@ void x86_load_linux(X86MachineState *x86ms,
>      fclose(f);
>  
>      /* append dtb to kernel */
>      if (dtb_filename) {
>          if (protocol < 0x209) {
>              fprintf(stderr, "qemu: Linux kernel too old to load a dtb\n");
>              exit(1);
>          }
>  
>          dtb_size = get_image_size(dtb_filename);
>          if (dtb_size <= 0) {
>              fprintf(stderr, "qemu: error reading dtb %s: %s\n",
>                      dtb_filename, strerror(errno));
>              exit(1);
>          }
>  
> -        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
> -        kernel_size = setup_data_offset + sizeof(struct setup_data) + dtb_size;
> -        kernel = g_realloc(kernel, kernel_size);
> -
> -
> -        setup_data = (struct setup_data *)(kernel + setup_data_offset);
> +        setup_data_item_len = sizeof(struct setup_data) + dtb_size;
> +        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
> +        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
>          setup_data->next = cpu_to_le64(first_setup_data);
> -        first_setup_data = prot_addr + setup_data_offset;
> +        first_setup_data = setup_data_base + setup_data_total_len;
> +        setup_data_total_len += setup_data_item_len;
>          setup_data->type = cpu_to_le32(SETUP_DTB);
>          setup_data->len = cpu_to_le32(dtb_size);
> -
>          load_image_size(dtb_filename, setup_data->data, dtb_size);
>      }
>  
>      if (!legacy_no_rng_seed) {
> -        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
> -        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
> -        kernel = g_realloc(kernel, kernel_size);
> -        setup_data = (struct setup_data *)(kernel + setup_data_offset);
> +        setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH;
> +        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
> +        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
>          setup_data->next = cpu_to_le64(first_setup_data);
> -        first_setup_data = prot_addr + setup_data_offset;
> +        first_setup_data = setup_data_base + setup_data_total_len;
> +        setup_data_total_len += setup_data_item_len;
>          setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
>          setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
>          qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
>      }
>  
> -    /* Offset 0x250 is a pointer to the first setup_data link. */
> -    stq_p(header + 0x250, first_setup_data);
> +    if (first_setup_data) {
> +            /* Offset 0x250 is a pointer to the first setup_data link. */
> +            stq_p(header + 0x250, first_setup_data);
> +            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
> +                         setup_data_base, NULL, NULL, NULL, NULL, false);
> +    }
>  
>      /*
>       * If we're starting an encrypted VM, it will be OVMF based, which uses the
>       * efi stub for booting and doesn't require any values to be placed in the
>       * kernel header.  We therefore don't update the header so the hash of the
>       * kernel on the other side of the fw_cfg interface matches the hash of the
>       * file the user passed in.
>       */
>      if (!sev_enabled()) {
>          memcpy(setup, header, MIN(sizeof(header), setup_size));
>      }
>  
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
>      fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size);
>      sev_load_ctx.kernel_data = (char *)kernel;
> -- 
> 2.35.1
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Laszlo Ersek 1 week ago
On 08/04/22 09:03, Michael S. Tsirkin wrote:
> On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
>> The boot parameter header refers to setup_data at an absolute address,
>> and each setup_data refers to the next setup_data at an absolute address
>> too. Currently QEMU simply puts the setup_datas right after the kernel
>> image, and since the kernel_image is loaded at prot_addr -- a fixed
>> address knowable to QEMU apriori -- the setup_data absolute address
>> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
>>
>> This mostly works fine, so long as the kernel image really is loaded at
>> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
>> generally EFI doesn't give a good way of predicting where it's going to
>> load the kernel. So when it loads it at some address != prot_addr, the
>> absolute addresses in setup_data now point somewhere bogus, causing
>> crashes when EFI stub tries to follow the next link.
>>
>> Fix this by placing setup_data at some fixed place in memory, relative
>> to real_addr, not as part of the kernel image, and then pointing the
>> setup_data absolute address to that fixed place in memory. This way,
>> even if OVMF or other chains relocate the kernel image, the boot
>> parameter still points to the correct absolute address.
>>
>> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
>> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Richard Henderson <richard.henderson@linaro.org>
>> Cc: Peter Maydell <peter.maydell@linaro.org>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> Cc: Daniel P. Berrangé <berrange@redhat.com>
>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>> Cc: Ard Biesheuvel <ardb@kernel.org>
>> Cc: linux-efi@vger.kernel.org
>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> 
> Didn't read the patch yet.
> Adding Laszlo.

As I said in
<http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>,
I don't believe that the setup_data chaining described in
<https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work
for UEFI guests at all, with QEMU pre-populating the links with GPAs.

However, rather than introducing a new info channel, or reusing an
existent one (ACPI linker/loader, GUID-ed structure chaining in pflash),
for the sake of this feature, I suggest simply disabling this feature
for UEFI guests. setup_data chaining has not been necessary for UEFI
guests for years (this is the first time I've heard about it in more
than a decade), and the particular use case (provide guests with good
random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL.

... Now, here's my problem: microvm, and Xen.

As far as I can tell, QEMU can determine (it already does determine)
whether the guest uses UEFI or not, for the "pc" and "q35" machine
types. But not for microvm or Xen!

Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can
derive that

  pflash_cfi01_get_blk(pcms->flash[0])

returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this
is only valid after the inital part of pc_system_firmware_init() has run
("Map legacy -drive if=pflash to machine properties"), but that is not a
problem, given the following call tree:

  pc_memory_init()            [hw/i386/pc.c]
    pc_system_firmware_init() [hw/i386/pc_sysfw.c]
    x86_load_linux()          [hw/i386/x86.c]

That is, when we call x86_load_linux() from pc_memory_init(), we can set
the last argument of x86_load_linux() from *both* the compat knob *and*
pflash_cfi01_get_blk(pcms->flash[0]).

Unluckily however, we have two more x86_load_linux() calls where this
does not work.

The first is xen_load_linux() [hw/i386/pc.c], which is used for "xen HVM
direct kernel boot" [hw/i386/pc_piix.c] -- there we certainly don't use
flash. I don't know if Xen HVM direct kernel boot is possible with UEFI.
If so, we have a problem, if UEFI is out of the question there, then
enabling setup_data passing is fine.

The other problematic x86_load_linux() call is in MicroVM --
microvm_memory_init() [hw/i386/microvm.c]. This is a real problem
unfortunately, as MicroVM can be used with SeaBIOS, QBoot, and even
OVMF, *but* even in the last case, it runs OVMF *without* flash (just
from ROM). So not only is MicroVM not a descendant of the PC Machine
Class (so it has no access to PC Machine State either -- such as
pcms->flash), it never *uses* flash in fact.

Which is a big problem for my idea, because QEMU has no way of
identifying whether microvm is going to boot a custom SeaBIOS binary
(where the current setup_data chaining is OK) or a custom OVMF binary
(where the current setup_data chaining crashes the guest kernel).

So I thought that for pc and q35, I should be able to propose a fix,
based on:

  pflash_cfi01_get_blk(pcms->flash[0])

but it turns out I don't know what to do about Xen; and worse, for
MicroVM, it's currently *impossible* for QEMU to tell apart UEFI from
other guest firmwares.

For now I suggest either reverting the original patch, or at least not
enabling the knob by default for any machine types. In particular, when
using MicroVM, the user must leave the knob disabled when direct booting
a kernel on OVMF, and the user may or may not enable the knob when
direct booting a kernel on SeaBIOS.

Thanks
Laszlo
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Daniel P. Berrangé 1 week ago
On Thu, Aug 04, 2022 at 10:58:36AM +0200, Laszlo Ersek wrote:
> On 08/04/22 09:03, Michael S. Tsirkin wrote:
> > On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> >> The boot parameter header refers to setup_data at an absolute address,
> >> and each setup_data refers to the next setup_data at an absolute address
> >> too. Currently QEMU simply puts the setup_datas right after the kernel
> >> image, and since the kernel_image is loaded at prot_addr -- a fixed
> >> address knowable to QEMU apriori -- the setup_data absolute address
> >> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> >>
> >> This mostly works fine, so long as the kernel image really is loaded at
> >> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> >> generally EFI doesn't give a good way of predicting where it's going to
> >> load the kernel. So when it loads it at some address != prot_addr, the
> >> absolute addresses in setup_data now point somewhere bogus, causing
> >> crashes when EFI stub tries to follow the next link.
> >>
> >> Fix this by placing setup_data at some fixed place in memory, relative
> >> to real_addr, not as part of the kernel image, and then pointing the
> >> setup_data absolute address to that fixed place in memory. This way,
> >> even if OVMF or other chains relocate the kernel image, the boot
> >> parameter still points to the correct absolute address.
> >>
> >> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> >> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> >> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >> Cc: Richard Henderson <richard.henderson@linaro.org>
> >> Cc: Peter Maydell <peter.maydell@linaro.org>
> >> Cc: Michael S. Tsirkin <mst@redhat.com>
> >> Cc: Daniel P. Berrangé <berrange@redhat.com>
> >> Cc: Gerd Hoffmann <kraxel@redhat.com>
> >> Cc: Ard Biesheuvel <ardb@kernel.org>
> >> Cc: linux-efi@vger.kernel.org
> >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> > 
> > Didn't read the patch yet.
> > Adding Laszlo.
> 
> As I said in
> <http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>,
> I don't believe that the setup_data chaining described in
> <https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work
> for UEFI guests at all, with QEMU pre-populating the links with GPAs.
> 
> However, rather than introducing a new info channel, or reusing an
> existent one (ACPI linker/loader, GUID-ed structure chaining in pflash),
> for the sake of this feature, I suggest simply disabling this feature
> for UEFI guests. setup_data chaining has not been necessary for UEFI
> guests for years (this is the first time I've heard about it in more
> than a decade), and the particular use case (provide guests with good
> random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL.
> 
> ... Now, here's my problem: microvm, and Xen.
> 
> As far as I can tell, QEMU can determine (it already does determine)
> whether the guest uses UEFI or not, for the "pc" and "q35" machine
> types. But not for microvm or Xen!
> 
> Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can
> derive that
> 
>   pflash_cfi01_get_blk(pcms->flash[0])
> 
> returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this
> is only valid after the inital part of pc_system_firmware_init() has run
> ("Map legacy -drive if=pflash to machine properties"), but that is not a
> problem, given the following call tree:

I don't beleve that's a valid check anymore since Gerd introduced the
ability to load UEFI via -bios, for UEFI builds without persistent
variables. ( a8152c4e4613c70c2f0573a82babbc8acc00cf90 )


> Which is a big problem for my idea, because QEMU has no way of
> identifying whether microvm is going to boot a custom SeaBIOS binary
> (where the current setup_data chaining is OK) or a custom OVMF binary
> (where the current setup_data chaining crashes the guest kernel).
> 
> So I thought that for pc and q35, I should be able to propose a fix,
> based on:
> 
>   pflash_cfi01_get_blk(pcms->flash[0])
> 
> but it turns out I don't know what to do about Xen; and worse, for
> MicroVM, it's currently *impossible* for QEMU to tell apart UEFI from
> other guest firmwares.

Yep, and ultimately the inability to distinguish UEFI vs other firmware
is arguably correct by design, as the QEMU <-> firmware interface is
supposed to be arbitrarily pluggable for any firmware implementation
not  limited to merely UEFI + seabios.

> For now I suggest either reverting the original patch, or at least not
> enabling the knob by default for any machine types. In particular, when
> using MicroVM, the user must leave the knob disabled when direct booting
> a kernel on OVMF, and the user may or may not enable the knob when
> direct booting a kernel on SeaBIOS.

Having it opt-in via a knob would defeat Jason's goal of having the seed
available automatically.



With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week ago
Hi Daniel,

On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> Yep, and ultimately the inability to distinguish UEFI vs other firmware
> is arguably correct by design, as the QEMU <-> firmware interface is
> supposed to be arbitrarily pluggable for any firmware implementation
> not  limited to merely UEFI + seabios.

Indeed, I agree with this.

> 
> > For now I suggest either reverting the original patch, or at least not
> > enabling the knob by default for any machine types. In particular, when
> > using MicroVM, the user must leave the knob disabled when direct booting
> > a kernel on OVMF, and the user may or may not enable the knob when
> > direct booting a kernel on SeaBIOS.
> 
> Having it opt-in via a knob would defeat Jason's goal of having the seed
> available automatically.

Yes, adding a knob is absolutely out of the question.

It also doesn't actually solve the problem: this triggers when QEMU
passes a DTB too. It's not just for the new RNG seed thing. This bug
isn't new.

Jason

Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Daniel P. Berrangé 1 week ago
On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> Hi Daniel,
> 
> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> > Yep, and ultimately the inability to distinguish UEFI vs other firmware
> > is arguably correct by design, as the QEMU <-> firmware interface is
> > supposed to be arbitrarily pluggable for any firmware implementation
> > not  limited to merely UEFI + seabios.
> 
> Indeed, I agree with this.
> 
> > 
> > > For now I suggest either reverting the original patch, or at least not
> > > enabling the knob by default for any machine types. In particular, when
> > > using MicroVM, the user must leave the knob disabled when direct booting
> > > a kernel on OVMF, and the user may or may not enable the knob when
> > > direct booting a kernel on SeaBIOS.
> > 
> > Having it opt-in via a knob would defeat Jason's goal of having the seed
> > available automatically.
> 
> Yes, adding a knob is absolutely out of the question.
> 
> It also doesn't actually solve the problem: this triggers when QEMU
> passes a DTB too. It's not just for the new RNG seed thing. This bug
> isn't new.

In the other thread I also mentioned that this RNG Seed addition has
caused a bug with AMD SEV too, making boot measurement attestation
fail because the kernel blob passed to the firmware no longer matches
what the tenant expects, due to the injected seed.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Ard Biesheuvel 1 week ago
On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> > Hi Daniel,
> >
> > On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> > > Yep, and ultimately the inability to distinguish UEFI vs other firmware
> > > is arguably correct by design, as the QEMU <-> firmware interface is
> > > supposed to be arbitrarily pluggable for any firmware implementation
> > > not  limited to merely UEFI + seabios.
> >
> > Indeed, I agree with this.
> >
> > >
> > > > For now I suggest either reverting the original patch, or at least not
> > > > enabling the knob by default for any machine types. In particular, when
> > > > using MicroVM, the user must leave the knob disabled when direct booting
> > > > a kernel on OVMF, and the user may or may not enable the knob when
> > > > direct booting a kernel on SeaBIOS.
> > >
> > > Having it opt-in via a knob would defeat Jason's goal of having the seed
> > > available automatically.
> >
> > Yes, adding a knob is absolutely out of the question.
> >
> > It also doesn't actually solve the problem: this triggers when QEMU
> > passes a DTB too. It's not just for the new RNG seed thing. This bug
> > isn't new.
>
> In the other thread I also mentioned that this RNG Seed addition has
> caused a bug with AMD SEV too, making boot measurement attestation
> fail because the kernel blob passed to the firmware no longer matches
> what the tenant expects, due to the injected seed.
>

I was actually expecting this to be an issue in the
signing/attestation department as well, and you just confirmed my
suspicion.

But does this mean that populating the setup_data pointer is out of
the question altogether? Or only that putting the setup_data linked
list nodes inside the image is a problem?
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Laszlo Ersek 1 week ago
On 08/04/22 14:16, Ard Biesheuvel wrote:
> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>
>> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
>>> Hi Daniel,
>>>
>>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
>>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware
>>>> is arguably correct by design, as the QEMU <-> firmware interface is
>>>> supposed to be arbitrarily pluggable for any firmware implementation
>>>> not  limited to merely UEFI + seabios.
>>>
>>> Indeed, I agree with this.
>>>
>>>>
>>>>> For now I suggest either reverting the original patch, or at least not
>>>>> enabling the knob by default for any machine types. In particular, when
>>>>> using MicroVM, the user must leave the knob disabled when direct booting
>>>>> a kernel on OVMF, and the user may or may not enable the knob when
>>>>> direct booting a kernel on SeaBIOS.
>>>>
>>>> Having it opt-in via a knob would defeat Jason's goal of having the seed
>>>> available automatically.
>>>
>>> Yes, adding a knob is absolutely out of the question.
>>>
>>> It also doesn't actually solve the problem: this triggers when QEMU
>>> passes a DTB too. It's not just for the new RNG seed thing. This bug
>>> isn't new.
>>
>> In the other thread I also mentioned that this RNG Seed addition has
>> caused a bug with AMD SEV too, making boot measurement attestation
>> fail because the kernel blob passed to the firmware no longer matches
>> what the tenant expects, due to the injected seed.
>>
> 
> I was actually expecting this to be an issue in the
> signing/attestation department as well, and you just confirmed my
> suspicion.
> 
> But does this mean that populating the setup_data pointer is out of
> the question altogether? Or only that putting the setup_data linked
> list nodes inside the image is a problem?

QEMU already has to inject a whole bunch of stuff into confidential
computing guests. The way it's done (IIRC) is that the non-compressed,
trailing part of pflash (basically where the reset vector code lives
too) is populated at OVMF build time with a chain of GUID-ed structures,
and fields of those structures are filled in (at OVMF build time) from
various fixed PCDs. The fixed PCDs in turn are populated from the FD
files, using various MEMFD regions. When QEMU launches the guest, it can
parse the GPAs from the on-disk pflash image (traversing the list of
GUID-ed structs), at which addresses the guest firmware will then expect
the various crypto artifacts to be injected.

The point is that "who's in control" is reversed. The firmware exposes
(at build time) at what GPAs it can accept data injections, and QEMU
follows that. Of course the firmware ensures that nothing else in the
firmware will try to own those GPAs.

The only thing that needed to be hard-coded when this feature was
introduced was the "entry point", that is, the flash offset at which
QEMU starts the GUID-ed structure traversal.

AMD and IBM developers can likely much better describe this mechanism,
as I've not dealt with it in over a year. The QEMU side code is in
"hw/i386/pc_sysfw_ovmf.c".

We can make setup_data chaining work with OVMF, but the whole chain
should be located in a GPA range that OVMF dictates.


Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week ago
On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote:
>
> On 08/04/22 14:16, Ard Biesheuvel wrote:
> > On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >>
> >> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> >>> Hi Daniel,
> >>>
> >>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> >>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware
> >>>> is arguably correct by design, as the QEMU <-> firmware interface is
> >>>> supposed to be arbitrarily pluggable for any firmware implementation
> >>>> not  limited to merely UEFI + seabios.
> >>>
> >>> Indeed, I agree with this.
> >>>
> >>>>
> >>>>> For now I suggest either reverting the original patch, or at least not
> >>>>> enabling the knob by default for any machine types. In particular, when
> >>>>> using MicroVM, the user must leave the knob disabled when direct booting
> >>>>> a kernel on OVMF, and the user may or may not enable the knob when
> >>>>> direct booting a kernel on SeaBIOS.
> >>>>
> >>>> Having it opt-in via a knob would defeat Jason's goal of having the seed
> >>>> available automatically.
> >>>
> >>> Yes, adding a knob is absolutely out of the question.
> >>>
> >>> It also doesn't actually solve the problem: this triggers when QEMU
> >>> passes a DTB too. It's not just for the new RNG seed thing. This bug
> >>> isn't new.
> >>
> >> In the other thread I also mentioned that this RNG Seed addition has
> >> caused a bug with AMD SEV too, making boot measurement attestation
> >> fail because the kernel blob passed to the firmware no longer matches
> >> what the tenant expects, due to the injected seed.
> >>
> >
> > I was actually expecting this to be an issue in the
> > signing/attestation department as well, and you just confirmed my
> > suspicion.
> >
> > But does this mean that populating the setup_data pointer is out of
> > the question altogether? Or only that putting the setup_data linked
> > list nodes inside the image is a problem?
>
> QEMU already has to inject a whole bunch of stuff into confidential
> computing guests. The way it's done (IIRC) is that the non-compressed,
> trailing part of pflash (basically where the reset vector code lives
> too) is populated at OVMF build time with a chain of GUID-ed structures,
> and fields of those structures are filled in (at OVMF build time) from
> various fixed PCDs. The fixed PCDs in turn are populated from the FD
> files, using various MEMFD regions. When QEMU launches the guest, it can
> parse the GPAs from the on-disk pflash image (traversing the list of
> GUID-ed structs), at which addresses the guest firmware will then expect
> the various crypto artifacts to be injected.
>
> The point is that "who's in control" is reversed. The firmware exposes
> (at build time) at what GPAs it can accept data injections, and QEMU
> follows that. Of course the firmware ensures that nothing else in the
> firmware will try to own those GPAs.
>
> The only thing that needed to be hard-coded when this feature was
> introduced was the "entry point", that is, the flash offset at which
> QEMU starts the GUID-ed structure traversal.
>
> AMD and IBM developers can likely much better describe this mechanism,
> as I've not dealt with it in over a year. The QEMU side code is in
> "hw/i386/pc_sysfw_ovmf.c".
>
> We can make setup_data chaining work with OVMF, but the whole chain
> should be located in a GPA range that OVMF dictates.

It sounds like what you describe is pretty OVMF-specific though,
right? Do we want to tie things together so tightly like that?

Given we only need 48 bytes or so, isn't there a more subtle place we
could just throw this in ram that doesn't need such complex
coordination?

Jason
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Laszlo Ersek 1 week ago
On 08/04/22 15:28, Jason A. Donenfeld wrote:
> On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote:
>>
>> On 08/04/22 14:16, Ard Biesheuvel wrote:
>>> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>>>
>>>> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
>>>>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware
>>>>>> is arguably correct by design, as the QEMU <-> firmware interface is
>>>>>> supposed to be arbitrarily pluggable for any firmware implementation
>>>>>> not  limited to merely UEFI + seabios.
>>>>>
>>>>> Indeed, I agree with this.
>>>>>
>>>>>>
>>>>>>> For now I suggest either reverting the original patch, or at least not
>>>>>>> enabling the knob by default for any machine types. In particular, when
>>>>>>> using MicroVM, the user must leave the knob disabled when direct booting
>>>>>>> a kernel on OVMF, and the user may or may not enable the knob when
>>>>>>> direct booting a kernel on SeaBIOS.
>>>>>>
>>>>>> Having it opt-in via a knob would defeat Jason's goal of having the seed
>>>>>> available automatically.
>>>>>
>>>>> Yes, adding a knob is absolutely out of the question.
>>>>>
>>>>> It also doesn't actually solve the problem: this triggers when QEMU
>>>>> passes a DTB too. It's not just for the new RNG seed thing. This bug
>>>>> isn't new.
>>>>
>>>> In the other thread I also mentioned that this RNG Seed addition has
>>>> caused a bug with AMD SEV too, making boot measurement attestation
>>>> fail because the kernel blob passed to the firmware no longer matches
>>>> what the tenant expects, due to the injected seed.
>>>>
>>>
>>> I was actually expecting this to be an issue in the
>>> signing/attestation department as well, and you just confirmed my
>>> suspicion.
>>>
>>> But does this mean that populating the setup_data pointer is out of
>>> the question altogether? Or only that putting the setup_data linked
>>> list nodes inside the image is a problem?
>>
>> QEMU already has to inject a whole bunch of stuff into confidential
>> computing guests. The way it's done (IIRC) is that the non-compressed,
>> trailing part of pflash (basically where the reset vector code lives
>> too) is populated at OVMF build time with a chain of GUID-ed structures,
>> and fields of those structures are filled in (at OVMF build time) from
>> various fixed PCDs. The fixed PCDs in turn are populated from the FD
>> files, using various MEMFD regions. When QEMU launches the guest, it can
>> parse the GPAs from the on-disk pflash image (traversing the list of
>> GUID-ed structs), at which addresses the guest firmware will then expect
>> the various crypto artifacts to be injected.
>>
>> The point is that "who's in control" is reversed. The firmware exposes
>> (at build time) at what GPAs it can accept data injections, and QEMU
>> follows that. Of course the firmware ensures that nothing else in the
>> firmware will try to own those GPAs.
>>
>> The only thing that needed to be hard-coded when this feature was
>> introduced was the "entry point", that is, the flash offset at which
>> QEMU starts the GUID-ed structure traversal.
>>
>> AMD and IBM developers can likely much better describe this mechanism,
>> as I've not dealt with it in over a year. The QEMU side code is in
>> "hw/i386/pc_sysfw_ovmf.c".
>>
>> We can make setup_data chaining work with OVMF, but the whole chain
>> should be located in a GPA range that OVMF dictates.
> 
> It sounds like what you describe is pretty OVMF-specific though,
> right?

Yes, completely OVMF specific.

> Do we want to tie things together so tightly like that?

It depends on what the goal is:

- do we want setup_data chaining work generally?

- or do we want only the random seed injection to stop crashing OVMF guests?

In the latter case, we only need to teach QEMU to reliably recognize
OVMF from the on-disk firmware binary (regardless of pflash use), and
then not expose any setup_data in guest RAM. The UEFI guest kernel will
use EFI_RNG_PROTOCOL (when available, from virtio-rng-pci), and no
particular seed otherwise.

(This is the "papering over" case, which I don't think is necessarily
wrong; only it should be robust. I thought pflash would be usable for
that; turns out it isn't. Thus, we could perhaps try identifying a
firmware binary as "OVMF" by contents.)

In the former case though, I think the "tight coupling" is unavoidable.
As long as setup_data is needed by the kernel extremely early, no
"firmware hiding" or "firmware independence" is in place yet.

> Given we only need 48 bytes or so, isn't there a more subtle place we
> could just throw this in ram that doesn't need such complex
> coordination?

These tricks add up and go wrong after a while. The pedantic
reservations in the firmware have proved necessary.

IIUC, with v2, the setup_data_base address would (most frequently) be 96
KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does
not reserve away the area, UefiCpuPkg or other drivers might allocate an
overlapping chunk, even if only temporarily. That might not break the
firmware, but it could overwrite the random seed. If the guest kernel
somehow consumed that seed (rather than getting one from
EFI_RNG_PROTOCOL -- what if no virtio-rng-pci device was configured?),
that could potentially destroy the guest's security, without any
obvious/immediate signs.

Laszlo


Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week ago
Hey Laszlo,

On Thu, Aug 04, 2022 at 03:56:54PM +0200, Laszlo Ersek wrote:
> - do we want setup_data chaining work generally?
> 
> - or do we want only the random seed injection to stop crashing OVMF guests?

Preferably the first - generally. Which brings us to your point:
 
> > Given we only need 48 bytes or so, isn't there a more subtle place we
> > could just throw this in ram that doesn't need such complex
> > coordination?
> 
> These tricks add up and go wrong after a while. The pedantic
> reservations in the firmware have proved necessary.
> 
> IIUC, with v2, the setup_data_base address would (most frequently) be 96
> KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does
> not reserve away the area, UefiCpuPkg or other drivers might allocate an
> overlapping chunk, even if only temporarily. That might not break the
> firmware, but it could overwrite the random seed. 

Yea, so we don't want an address that something else might overwrite. So
my question is: isn't there some 48 bytes or so available in some low
address (or maybe a high one?) that is traditionally reserved for some
hardware function, and so software doesn't use it, but it turns out QEMU
doesn't use it for anything either, so we can get away placing it at
that address? It seems like there *ought* to be something like that. I
just don't (yet) know what it is...

Jason
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Laszlo Ersek 6 days, 18 hours ago
On 08/05/22 00:56, Jason A. Donenfeld wrote:
> Hey Laszlo,
> 
> On Thu, Aug 04, 2022 at 03:56:54PM +0200, Laszlo Ersek wrote:
>> - do we want setup_data chaining work generally?
>>
>> - or do we want only the random seed injection to stop crashing OVMF guests?
> 
> Preferably the first - generally. Which brings us to your point:
>  
>>> Given we only need 48 bytes or so, isn't there a more subtle place we
>>> could just throw this in ram that doesn't need such complex
>>> coordination?
>>
>> These tricks add up and go wrong after a while. The pedantic
>> reservations in the firmware have proved necessary.
>>
>> IIUC, with v2, the setup_data_base address would (most frequently) be 96
>> KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does
>> not reserve away the area, UefiCpuPkg or other drivers might allocate an
>> overlapping chunk, even if only temporarily. That might not break the
>> firmware, but it could overwrite the random seed. 
> 
> Yea, so we don't want an address that something else might overwrite. So
> my question is: isn't there some 48 bytes or so available in some low
> address (or maybe a high one?) that is traditionally reserved for some
> hardware function, and so software doesn't use it, but it turns out QEMU
> doesn't use it for anything either, so we can get away placing it at
> that address? It seems like there *ought* to be something like that. I
> just don't (yet) know what it is...

I don't know of any such "hidden pocket", unfortunately. On the other
hand, low-level edk2 drivers (usually dealing with x86 intricacies, such
as MTRRs, CPU bringup, ...) have repeatedly surprised me with their
handling of low memory.

Laszlo
[PATCH v3] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week ago
The boot parameter header refers to setup_data at an absolute address,
and each setup_data refers to the next setup_data at an absolute address
too. Currently QEMU simply puts the setup_datas right after the kernel
image, and since the kernel_image is loaded at prot_addr -- a fixed
address knowable to QEMU apriori -- the setup_data absolute address
winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.

This mostly works fine, so long as the kernel image really is loaded at
prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
generally EFI doesn't give a good way of predicting where it's going to
load the kernel. So when it loads it at some address != prot_addr, the
absolute addresses in setup_data now point somewhere bogus, causing
crashes when EFI stub tries to follow the next link.

Fix this by placing setup_data at some fixed place in memory, not as
part of the kernel image, and then pointing the setup_data absolute
address to that fixed place in memory. This way, even if OVMF or other
chains relocate the kernel image, the boot parameter still points to the
correct absolute address.

For this, an unused part of the hardware mapped area is used, which
isn't used by anything else.

Fixes: 3cbeb52467 ("hw/i386: add device tree support")
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Laszlo Ersek <lersek@redhat.com>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 hw/i386/x86.c | 39 +++++++++++++++++++++------------------
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 050eedc0c8..3affef3277 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -773,10 +773,10 @@ void x86_load_linux(X86MachineState *x86ms,
     bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled;
     uint16_t protocol;
     int setup_size, kernel_size, cmdline_size;
-    int dtb_size, setup_data_offset;
+    int dtb_size, setup_data_item_len, setup_data_total_len = 0;
     uint32_t initrd_max;
-    uint8_t header[8192], *setup, *kernel;
-    hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0;
+    uint8_t header[8192], *setup, *kernel, *setup_datas = NULL;
+    hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0, setup_data_base;
     FILE *f;
     char *vmode;
     MachineState *machine = MACHINE(x86ms);
@@ -899,6 +899,8 @@ void x86_load_linux(X86MachineState *x86ms,
         cmdline_addr = 0x20000;
         prot_addr    = 0x100000;
     }
+    /* Nothing else uses this part of the hardware mapped region */
+    setup_data_base = 0xfffff - 0x1000;
 
     /* highest address for loading the initrd */
     if (protocol >= 0x20c &&
@@ -1062,34 +1064,35 @@ void x86_load_linux(X86MachineState *x86ms,
             exit(1);
         }
 
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(struct setup_data) + dtb_size;
-        kernel = g_realloc(kernel, kernel_size);
-
-
-        setup_data = (struct setup_data *)(kernel + setup_data_offset);
+        setup_data_item_len = sizeof(struct setup_data) + dtb_size;
+        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
+        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = setup_data_base + setup_data_total_len;
+        setup_data_total_len += setup_data_item_len;
         setup_data->type = cpu_to_le32(SETUP_DTB);
         setup_data->len = cpu_to_le32(dtb_size);
-
         load_image_size(dtb_filename, setup_data->data, dtb_size);
     }
 
     if (!legacy_no_rng_seed) {
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
-        kernel = g_realloc(kernel, kernel_size);
-        setup_data = (struct setup_data *)(kernel + setup_data_offset);
+        setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH;
+        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
+        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = setup_data_base + setup_data_total_len;
+        setup_data_total_len += setup_data_item_len;
         setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
         setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
         qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
     }
 
-    /* Offset 0x250 is a pointer to the first setup_data link. */
-    stq_p(header + 0x250, first_setup_data);
+    if (first_setup_data && !sev_enabled()) {
+            /* Offset 0x250 is a pointer to the first setup_data link. */
+            stq_p(header + 0x250, first_setup_data);
+            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
+                         setup_data_base, NULL, NULL, NULL, NULL, false);
+    }
 
     /*
      * If we're starting an encrypted VM, it will be OVMF based, which uses the
-- 
2.35.1


Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory
Posted by Paolo Bonzini 6 days, 16 hours ago
On 8/5/22 01:04, Jason A. Donenfeld wrote:
> +    /* Nothing else uses this part of the hardware mapped region */
> +    setup_data_base = 0xfffff - 0x1000;

Isn't this where the BIOS lives?  I don't think this works.

Does it work to place setup_data at the end of the cmdline file instead 
of having it at the end of the kernel file?  This way the first item 
will be at 0x20000 + cmdline_size.

Paolo
Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 6 days, 11 hours ago
Hi Paolo,

On Fri, Aug 05, 2022 at 10:10:02AM +0200, Paolo Bonzini wrote:
> On 8/5/22 01:04, Jason A. Donenfeld wrote:
> > +    /* Nothing else uses this part of the hardware mapped region */
> > +    setup_data_base = 0xfffff - 0x1000;
> 
> Isn't this where the BIOS lives?  I don't think this works.

That's the segment dedicated to ROM and hardware mapped addresses. So
that's a place to put ROM material. No actual software will use it.

Jason
Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 2 days, 12 hours ago
Hey Paolo,

On Fri, Aug 05, 2022 at 02:47:27PM +0200, Jason A. Donenfeld wrote:
> Hi Paolo,
> 
> On Fri, Aug 05, 2022 at 10:10:02AM +0200, Paolo Bonzini wrote:
> > On 8/5/22 01:04, Jason A. Donenfeld wrote:
> > > +    /* Nothing else uses this part of the hardware mapped region */
> > > +    setup_data_base = 0xfffff - 0x1000;
> > 
> > Isn't this where the BIOS lives?  I don't think this works.
> 
> That's the segment dedicated to ROM and hardware mapped addresses. So
> that's a place to put ROM material. No actual software will use it.
> 
> Jason

Unless I've misread the thread, I don't think there are any remaining
objections, right? Can we try merging this and seeing if it fixes the
issue for good?

Jason
Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory
Posted by Michael S. Tsirkin 2 days, 10 hours ago
On Tue, Aug 09, 2022 at 02:17:23PM +0200, Jason A. Donenfeld wrote:
> Hey Paolo,
> 
> On Fri, Aug 05, 2022 at 02:47:27PM +0200, Jason A. Donenfeld wrote:
> > Hi Paolo,
> > 
> > On Fri, Aug 05, 2022 at 10:10:02AM +0200, Paolo Bonzini wrote:
> > > On 8/5/22 01:04, Jason A. Donenfeld wrote:
> > > > +    /* Nothing else uses this part of the hardware mapped region */
> > > > +    setup_data_base = 0xfffff - 0x1000;
> > > 
> > > Isn't this where the BIOS lives?  I don't think this works.
> > 
> > That's the segment dedicated to ROM and hardware mapped addresses. So
> > that's a place to put ROM material. No actual software will use it.
> > 
> > Jason
> 
> Unless I've misread the thread, I don't think there are any remaining
> objections, right? Can we try merging this and seeing if it fixes the
> issue for good?
> 
> Jason

Laszlo commented here:
https://lore.kernel.org/r/fa0601e4-acf5-0ce8-9277-4d90d046b53e%40redhat.com

-- 
MST
Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory
Posted by Daniel P. Berrangé 2 days, 10 hours ago
On Tue, Aug 09, 2022 at 10:07:44AM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 09, 2022 at 02:17:23PM +0200, Jason A. Donenfeld wrote:
> > Hey Paolo,
> > 
> > On Fri, Aug 05, 2022 at 02:47:27PM +0200, Jason A. Donenfeld wrote:
> > > Hi Paolo,
> > > 
> > > On Fri, Aug 05, 2022 at 10:10:02AM +0200, Paolo Bonzini wrote:
> > > > On 8/5/22 01:04, Jason A. Donenfeld wrote:
> > > > > +    /* Nothing else uses this part of the hardware mapped region */
> > > > > +    setup_data_base = 0xfffff - 0x1000;
> > > > 
> > > > Isn't this where the BIOS lives?  I don't think this works.
> > > 
> > > That's the segment dedicated to ROM and hardware mapped addresses. So
> > > that's a place to put ROM material. No actual software will use it.
> > > 
> > > Jason
> > 
> > Unless I've misread the thread, I don't think there are any remaining
> > objections, right? Can we try merging this and seeing if it fixes the
> > issue for good?
> > 
> > Jason
> 
> Laszlo commented here:
> https://lore.kernel.org/r/fa0601e4-acf5-0ce8-9277-4d90d046b53e%40redhat.com

It is 7.1.0 rc2 date today, which leaves ideally only one rc remaining
before GA release.

The discussion still taking place in this thread does not fill me with
confidence that we're going to have a *well tested* solution before GA.
Even if we agree on a patch, are we really going to have confidence
in it being reliable if we've only got a week of testing ?

IMHO we're at the point where we should just disable the RNG feature
for 7.1.0, and gives ourselves time to come up with a solution in 7.2.0
that can be properly tested without the time pressure of release deadlines.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory
Posted by Laszlo Ersek 6 days, 11 hours ago
On 08/05/22 14:47, Jason A. Donenfeld wrote:
> Hi Paolo,
> 
> On Fri, Aug 05, 2022 at 10:10:02AM +0200, Paolo Bonzini wrote:
>> On 8/5/22 01:04, Jason A. Donenfeld wrote:
>>> +    /* Nothing else uses this part of the hardware mapped region */
>>> +    setup_data_base = 0xfffff - 0x1000;
>>
>> Isn't this where the BIOS lives?  I don't think this works.
> 
> That's the segment dedicated to ROM and hardware mapped addresses. So
> that's a place to put ROM material. No actual software will use it.

... accordingly (I think), when the guest tries to read it, it will see the ROM MemoryRegion that QEMU places there, not RAM contents.

"info mtree" QEMU monitor command output (excerpt), while OVMF is in the Boot Device Selection phase (well, I left it waiting in the Setup TUI):

address-space: memory
  0000000000000000-ffffffffffffffff (prio 0, i/o): system
    0000000000000000-000000007fffffff (prio 0, ram): alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
    0000000000000000-ffffffffffffffff (prio -1, i/o): pci
      00000000000a0000-00000000000affff (prio 2, ram): alias vga.chain4 @vga.vram 0000000000000000-000000000000ffff
      00000000000a0000-00000000000bffff (prio 1, i/o): vga-lowmem
      00000000000c0000-00000000000dffff (prio 1, rom): pc.rom
      00000000000e0000-00000000000fffff (prio 1, rom): isa-bios

flat view ("info mtree -f"):

FlatView #1
 AS "memory", root: system
 AS "cpu-memory-0", root: system
 AS "cpu-memory-1", root: system
 AS "cpu-memory-2", root: system
 AS "cpu-memory-3", root: system
 AS "mch", root: bus master container
 AS "ICH9-LPC", root: bus master container
 AS "ich9-ahci", root: bus master container
 AS "ICH9-SMB", root: bus master container
 AS "pcie-root-port", root: bus master container
 AS "pcie-root-port", root: bus master container
 AS "pcie-root-port", root: bus master container
 AS "pcie-root-port", root: bus master container
 AS "pcie-root-port", root: bus master container
 AS "qemu-xhci", root: bus master container
 AS "virtio-scsi-pci", root: bus master container
 AS "virtio-serial-pci", root: bus master container
 AS "virtio-net-pci", root: bus master container
 AS "VGA", root: bus master container
 AS "virtio-balloon-pci", root: bus master container
 AS "virtio-rng-pci", root: bus master container
 Root memory region: system
  0000000000000000-000000000002ffff (prio 0, ram): pc.ram KVM
  0000000000030000-000000000004ffff (prio 1, i/o): smbase-blackhole
  0000000000050000-000000000009ffff (prio 0, ram): pc.ram @0000000000050000 KVM
  00000000000a0000-00000000000affff (prio 1, ram): vga.vram KVM
  00000000000b0000-00000000000bffff (prio 1, i/o): vga-lowmem @0000000000010000
  00000000000c0000-00000000000c3fff (prio 0, rom): pc.ram @00000000000c0000 KVM
  00000000000c4000-00000000000dffff (prio 1, rom): pc.rom @0000000000004000 KVM
  00000000000e0000-00000000000fffff (prio 1, rom): isa-bios KVM


Laszlo
Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory
Posted by Ard Biesheuvel 6 days, 13 hours ago
On Fri, 5 Aug 2022 at 10:10, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 8/5/22 01:04, Jason A. Donenfeld wrote:
> > +    /* Nothing else uses this part of the hardware mapped region */
> > +    setup_data_base = 0xfffff - 0x1000;
>
> Isn't this where the BIOS lives?  I don't think this works.
>
> Does it work to place setup_data at the end of the cmdline file instead
> of having it at the end of the kernel file?  This way the first item
> will be at 0x20000 + cmdline_size.
>

Does QEMU always allocate the command line statically like that?
AFAIK, OVMF never accesses that memory to read the command line, it
uses fw_cfg to copy it into a buffer it allocates itself. And I guess
that implies that this region could be clobbered by OVMF unless it is
told to preserve it.
Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory
Posted by Paolo Bonzini 6 days, 7 hours ago
On 8/5/22 13:08, Ard Biesheuvel wrote:
>>
>> Does it work to place setup_data at the end of the cmdline file instead
>> of having it at the end of the kernel file?  This way the first item
>> will be at 0x20000 + cmdline_size.
>>
> Does QEMU always allocate the command line statically like that?
> AFAIK, OVMF never accesses that memory to read the command line, it
> uses fw_cfg to copy it into a buffer it allocates itself. And I guess
> that implies that this region could be clobbered by OVMF unless it is
> told to preserve it.

No it's not. :(  It also goes to gBS->AllocatePages in the end.

At this point it seems to me that without extra changes the whole 
setup_data concept is dead on arrival for OVMF.  In principle there's no 
reason why the individual setup_data items couldn't include interior 
pointers, meaning that the setup_data _has_ to be at the address 
provided in fw_cfg by QEMU.

One way to "fix" it would be for OVMF to overwrite the pointer to the 
head of the list, so that the kernel ignores the setup data provided by 
QEMU. Another way would be to put it in the command line fw_cfg blob and 
teach OVMF to use a fixed address for the command line.  Both are ugly, 
and both are also broken for new QEMU / old OVMF.

In any case, I don't think this should be fixed so close to the release. 
  We have two possibilities:

1) if we believe "build setup_data in QEMU" is a feasible design that 
only needs more yak shaving, we can keep the code in, but disabled by 
default, and sort it out in 7.2.

2) if we go for an alternative design, it needs to be reverted.  For 
example the randomness could be in _another_ fw_cfg file, and the 
linuxboot DMA can patch it in the setup_data.


With (2) the OVMF breakage would be limited to -dtb, which more or less 
nobody cares about, and we can just look the other way.

Paolo
Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory
Posted by Michael S. Tsirkin 2 days, 15 hours ago
On Fri, Aug 05, 2022 at 07:29:29PM +0200, Paolo Bonzini wrote:
> On 8/5/22 13:08, Ard Biesheuvel wrote:
> > > 
> > > Does it work to place setup_data at the end of the cmdline file instead
> > > of having it at the end of the kernel file?  This way the first item
> > > will be at 0x20000 + cmdline_size.
> > > 
> > Does QEMU always allocate the command line statically like that?
> > AFAIK, OVMF never accesses that memory to read the command line, it
> > uses fw_cfg to copy it into a buffer it allocates itself. And I guess
> > that implies that this region could be clobbered by OVMF unless it is
> > told to preserve it.
> 
> No it's not. :(  It also goes to gBS->AllocatePages in the end.
> 
> At this point it seems to me that without extra changes the whole setup_data
> concept is dead on arrival for OVMF.  In principle there's no reason why the
> individual setup_data items couldn't include interior pointers, meaning that
> the setup_data _has_ to be at the address provided in fw_cfg by QEMU.
> 
> One way to "fix" it would be for OVMF to overwrite the pointer to the head
> of the list, so that the kernel ignores the setup data provided by QEMU.
> Another way would be to put it in the command line fw_cfg blob and teach
> OVMF to use a fixed address for the command line.  Both are ugly, and both
> are also broken for new QEMU / old OVMF.
> 
> In any case, I don't think this should be fixed so close to the release.  We
> have two possibilities:
> 
> 1) if we believe "build setup_data in QEMU" is a feasible design that only
> needs more yak shaving, we can keep the code in, but disabled by default,
> and sort it out in 7.2.
> 
> 2) if we go for an alternative design, it needs to be reverted.  For example
> the randomness could be in _another_ fw_cfg file, and the linuxboot DMA can
> patch it in the setup_data.
> 
> 
> With (2) the OVMF breakage would be limited to -dtb, which more or less
> nobody cares about, and we can just look the other way.
> 
> Paolo


So IIUC you retract your pc: add property for Linux setup_data random
number seed then? It's neither of the two options above.

-- 
MST
Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory
Posted by Paolo Bonzini 2 days, 10 hours ago
On 8/9/22 11:17, Michael S. Tsirkin wrote:
>> 1) if we believe "build setup_data in QEMU" is a feasible design that only
>> needs more yak shaving, we can keep the code in, but disabled by default,
>> and sort it out in 7.2.
>>
>> 2) if we go for an alternative design, it needs to be reverted.  For example
>> the randomness could be in _another_ fw_cfg file, and the linuxboot DMA can
>> patch it in the setup_data.
>>
>> With (2) the OVMF breakage would be limited to -dtb, which more or less
>> nobody cares about, and we can just look the other way.
> 
> So IIUC you retract your pc: add property for Linux setup_data random
> number seed then? It's neither of the two options above.

That one would be a base for (1).

Another choice (3) is to put a pointer to the first setup_data in a new 
fw_cfg entry, and let the option ROMs place it in the header.

In any case, as Laszlo said this [PATCH v3] does not work because 
0xf0000 is mapped as ROM (and if it worked, it would have the same 
problem as the first 640K).

Paolo
Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory
Posted by Ard Biesheuvel 6 days, 6 hours ago
On Fri, 5 Aug 2022 at 19:29, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 8/5/22 13:08, Ard Biesheuvel wrote:
> >>
> >> Does it work to place setup_data at the end of the cmdline file instead
> >> of having it at the end of the kernel file?  This way the first item
> >> will be at 0x20000 + cmdline_size.
> >>
> > Does QEMU always allocate the command line statically like that?
> > AFAIK, OVMF never accesses that memory to read the command line, it
> > uses fw_cfg to copy it into a buffer it allocates itself. And I guess
> > that implies that this region could be clobbered by OVMF unless it is
> > told to preserve it.
>
> No it's not. :(  It also goes to gBS->AllocatePages in the end.
>
> At this point it seems to me that without extra changes the whole
> setup_data concept is dead on arrival for OVMF.  In principle there's no
> reason why the individual setup_data items couldn't include interior
> pointers, meaning that the setup_data _has_ to be at the address
> provided in fw_cfg by QEMU.
>

AIUI, the setup_data nodes are appended at the end, so they are not
covered by the setup_data fw_cfg file but the kernel one.

> One way to "fix" it would be for OVMF to overwrite the pointer to the
> head of the list, so that the kernel ignores the setup data provided by
> QEMU. Another way would be to put it in the command line fw_cfg blob and
> teach OVMF to use a fixed address for the command line.  Both are ugly,
> and both are also broken for new QEMU / old OVMF.
>

This is the 'pure EFI' boot path in OVMF, which means that the
firmware does not rely on definitions of struct bootparams or struct
setup_header at all. Introducing that dependency just for this is
something I'd really prefer to avoid.

> In any case, I don't think this should be fixed so close to the release.
>   We have two possibilities:
>
> 1) if we believe "build setup_data in QEMU" is a feasible design that
> only needs more yak shaving, we can keep the code in, but disabled by
> default, and sort it out in 7.2.
>

As I argued before, conflating the 'file' representation with the
'memory' representation like this is fundamentally flawed. fw_cfg
happily DMA's those files anywhere you like, so their contents should
not be position dependent like this.

So Jason's fix gets us halfway there, although we now pass information
to the kernel that is not covered by signatures or measurements, where
the setup_data pointer itself is. This means you can replace a single
SETUP_RNG_SEED node in memory with a whole set of SETUP_xxx nodes that
might be rigged to manipulate the boot in a way that measured boot
won't detect.

This is perhaps a bit of a stretch, and arguably only a problem if
secure or measured boot are enabled to begin with, in which case we
could impose additional policy on the use of setup_data. But still ...

> 2) if we go for an alternative design, it needs to be reverted.  For
> example the randomness could be in _another_ fw_cfg file, and the
> linuxboot DMA can patch it in the setup_data.
>
>
> With (2) the OVMF breakage would be limited to -dtb, which more or less
> nobody cares about, and we can just look the other way.
>
> Paolo
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Laszlo Ersek 1 week ago
On 08/04/22 15:56, Laszlo Ersek wrote:
> On 08/04/22 15:28, Jason A. Donenfeld wrote:
>> On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote:
>>>
>>> On 08/04/22 14:16, Ard Biesheuvel wrote:
>>>> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>>>>
>>>>> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
>>>>>> Hi Daniel,
>>>>>>
>>>>>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
>>>>>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware
>>>>>>> is arguably correct by design, as the QEMU <-> firmware interface is
>>>>>>> supposed to be arbitrarily pluggable for any firmware implementation
>>>>>>> not  limited to merely UEFI + seabios.
>>>>>>
>>>>>> Indeed, I agree with this.
>>>>>>
>>>>>>>
>>>>>>>> For now I suggest either reverting the original patch, or at least not
>>>>>>>> enabling the knob by default for any machine types. In particular, when
>>>>>>>> using MicroVM, the user must leave the knob disabled when direct booting
>>>>>>>> a kernel on OVMF, and the user may or may not enable the knob when
>>>>>>>> direct booting a kernel on SeaBIOS.
>>>>>>>
>>>>>>> Having it opt-in via a knob would defeat Jason's goal of having the seed
>>>>>>> available automatically.
>>>>>>
>>>>>> Yes, adding a knob is absolutely out of the question.
>>>>>>
>>>>>> It also doesn't actually solve the problem: this triggers when QEMU
>>>>>> passes a DTB too. It's not just for the new RNG seed thing. This bug
>>>>>> isn't new.
>>>>>
>>>>> In the other thread I also mentioned that this RNG Seed addition has
>>>>> caused a bug with AMD SEV too, making boot measurement attestation
>>>>> fail because the kernel blob passed to the firmware no longer matches
>>>>> what the tenant expects, due to the injected seed.
>>>>>
>>>>
>>>> I was actually expecting this to be an issue in the
>>>> signing/attestation department as well, and you just confirmed my
>>>> suspicion.
>>>>
>>>> But does this mean that populating the setup_data pointer is out of
>>>> the question altogether? Or only that putting the setup_data linked
>>>> list nodes inside the image is a problem?
>>>
>>> QEMU already has to inject a whole bunch of stuff into confidential
>>> computing guests. The way it's done (IIRC) is that the non-compressed,
>>> trailing part of pflash (basically where the reset vector code lives
>>> too) is populated at OVMF build time with a chain of GUID-ed structures,
>>> and fields of those structures are filled in (at OVMF build time) from
>>> various fixed PCDs. The fixed PCDs in turn are populated from the FD
>>> files, using various MEMFD regions. When QEMU launches the guest, it can
>>> parse the GPAs from the on-disk pflash image (traversing the list of
>>> GUID-ed structs), at which addresses the guest firmware will then expect
>>> the various crypto artifacts to be injected.
>>>
>>> The point is that "who's in control" is reversed. The firmware exposes
>>> (at build time) at what GPAs it can accept data injections, and QEMU
>>> follows that. Of course the firmware ensures that nothing else in the
>>> firmware will try to own those GPAs.
>>>
>>> The only thing that needed to be hard-coded when this feature was
>>> introduced was the "entry point", that is, the flash offset at which
>>> QEMU starts the GUID-ed structure traversal.
>>>
>>> AMD and IBM developers can likely much better describe this mechanism,
>>> as I've not dealt with it in over a year. The QEMU side code is in
>>> "hw/i386/pc_sysfw_ovmf.c".
>>>
>>> We can make setup_data chaining work with OVMF, but the whole chain
>>> should be located in a GPA range that OVMF dictates.
>>
>> It sounds like what you describe is pretty OVMF-specific though,
>> right?
> 
> Yes, completely OVMF specific.
> 
>> Do we want to tie things together so tightly like that?
> 
> It depends on what the goal is:
> 
> - do we want setup_data chaining work generally?
> 
> - or do we want only the random seed injection to stop crashing OVMF guests?
> 
> In the latter case, we only need to teach QEMU to reliably recognize
> OVMF from the on-disk firmware binary (regardless of pflash use), and
> then not expose any setup_data in guest RAM. The UEFI guest kernel will
> use EFI_RNG_PROTOCOL (when available, from virtio-rng-pci), and no
> particular seed otherwise.
> 
> (This is the "papering over" case, which I don't think is necessarily
> wrong; only it should be robust. I thought pflash would be usable for
> that; turns out it isn't. Thus, we could perhaps try identifying a
> firmware binary as "OVMF" by contents.)
> 
> In the former case though, I think the "tight coupling" is unavoidable.
> As long as setup_data is needed by the kernel extremely early, no
> "firmware hiding" or "firmware independence" is in place yet.
> 
>> Given we only need 48 bytes or so, isn't there a more subtle place we
>> could just throw this in ram that doesn't need such complex
>> coordination?
> 
> These tricks add up and go wrong after a while. The pedantic
> reservations in the firmware have proved necessary.
> 
> IIUC, with v2, the setup_data_base address would (most frequently) be 96
> KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does
> not reserve away the area, UefiCpuPkg or other drivers might allocate an
> overlapping chunk, even if only temporarily. That might not break the
> firmware, but it could overwrite the random seed. If the guest kernel
> somehow consumed that seed (rather than getting one from
> EFI_RNG_PROTOCOL -- what if no virtio-rng-pci device was configured?),
> that could potentially destroy the guest's security, without any
> obvious/immediate signs.

I must add however that I feel very much out of place making
authoritative-sounding statements like this. The above is my honest
opinion, and my (unpleasant) writing style, but it's just that: my
opinion & style. So much has happened in edk2 and QEMU since last summer
(without me following them closely) that I feel uncomfortable speaking
on them. Take whatever I say with a grain of salt, and definitely
research all options.

Laszlo


Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Daniel P. Berrangé 1 week ago
On Thu, Aug 04, 2022 at 03:56:54PM +0200, Laszlo Ersek wrote:
> On 08/04/22 15:28, Jason A. Donenfeld wrote:
> > On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote:
> >>
> >> On 08/04/22 14:16, Ard Biesheuvel wrote:
> >>> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >>>>
> >>>> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> >>>>> Hi Daniel,
> >>>>>
> >>>>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> >>>>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware
> >>>>>> is arguably correct by design, as the QEMU <-> firmware interface is
> >>>>>> supposed to be arbitrarily pluggable for any firmware implementation
> >>>>>> not  limited to merely UEFI + seabios.
> >>>>>
> >>>>> Indeed, I agree with this.
> >>>>>
> >>>>>>
> >>>>>>> For now I suggest either reverting the original patch, or at least not
> >>>>>>> enabling the knob by default for any machine types. In particular, when
> >>>>>>> using MicroVM, the user must leave the knob disabled when direct booting
> >>>>>>> a kernel on OVMF, and the user may or may not enable the knob when
> >>>>>>> direct booting a kernel on SeaBIOS.
> >>>>>>
> >>>>>> Having it opt-in via a knob would defeat Jason's goal of having the seed
> >>>>>> available automatically.
> >>>>>
> >>>>> Yes, adding a knob is absolutely out of the question.
> >>>>>
> >>>>> It also doesn't actually solve the problem: this triggers when QEMU
> >>>>> passes a DTB too. It's not just for the new RNG seed thing. This bug
> >>>>> isn't new.
> >>>>
> >>>> In the other thread I also mentioned that this RNG Seed addition has
> >>>> caused a bug with AMD SEV too, making boot measurement attestation
> >>>> fail because the kernel blob passed to the firmware no longer matches
> >>>> what the tenant expects, due to the injected seed.
> >>>>
> >>>
> >>> I was actually expecting this to be an issue in the
> >>> signing/attestation department as well, and you just confirmed my
> >>> suspicion.
> >>>
> >>> But does this mean that populating the setup_data pointer is out of
> >>> the question altogether? Or only that putting the setup_data linked
> >>> list nodes inside the image is a problem?
> >>
> >> QEMU already has to inject a whole bunch of stuff into confidential
> >> computing guests. The way it's done (IIRC) is that the non-compressed,
> >> trailing part of pflash (basically where the reset vector code lives
> >> too) is populated at OVMF build time with a chain of GUID-ed structures,
> >> and fields of those structures are filled in (at OVMF build time) from
> >> various fixed PCDs. The fixed PCDs in turn are populated from the FD
> >> files, using various MEMFD regions. When QEMU launches the guest, it can
> >> parse the GPAs from the on-disk pflash image (traversing the list of
> >> GUID-ed structs), at which addresses the guest firmware will then expect
> >> the various crypto artifacts to be injected.
> >>
> >> The point is that "who's in control" is reversed. The firmware exposes
> >> (at build time) at what GPAs it can accept data injections, and QEMU
> >> follows that. Of course the firmware ensures that nothing else in the
> >> firmware will try to own those GPAs.
> >>
> >> The only thing that needed to be hard-coded when this feature was
> >> introduced was the "entry point", that is, the flash offset at which
> >> QEMU starts the GUID-ed structure traversal.
> >>
> >> AMD and IBM developers can likely much better describe this mechanism,
> >> as I've not dealt with it in over a year. The QEMU side code is in
> >> "hw/i386/pc_sysfw_ovmf.c".
> >>
> >> We can make setup_data chaining work with OVMF, but the whole chain
> >> should be located in a GPA range that OVMF dictates.
> > 
> > It sounds like what you describe is pretty OVMF-specific though,
> > right?
> 
> Yes, completely OVMF specific.
> 
> > Do we want to tie things together so tightly like that?
> 
> It depends on what the goal is:
> 
> - do we want setup_data chaining work generally?
> 
> - or do we want only the random seed injection to stop crashing OVMF guests?
> 
> In the latter case, we only need to teach QEMU to reliably recognize
> OVMF from the on-disk firmware binary (regardless of pflash use), and
> then not expose any setup_data in guest RAM. The UEFI guest kernel will
> use EFI_RNG_PROTOCOL (when available, from virtio-rng-pci), and no
> particular seed otherwise.
> 
> (This is the "papering over" case, which I don't think is necessarily
> wrong; only it should be robust. I thought pflash would be usable for
> that; turns out it isn't. Thus, we could perhaps try identifying a
> firmware binary as "OVMF" by contents.)
> 
> In the former case though, I think the "tight coupling" is unavoidable.
> As long as setup_data is needed by the kernel extremely early, no
> "firmware hiding" or "firmware independence" is in place yet.
> 
> > Given we only need 48 bytes or so, isn't there a more subtle place we
> > could just throw this in ram that doesn't need such complex
> > coordination?
> 
> These tricks add up and go wrong after a while. The pedantic
> reservations in the firmware have proved necessary.
> 
> IIUC, with v2, the setup_data_base address would (most frequently) be 96
> KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does
> not reserve away the area, UefiCpuPkg or other drivers might allocate an
> overlapping chunk, even if only temporarily. That might not break the
> firmware, but it could overwrite the random seed. If the guest kernel
> somehow consumed that seed (rather than getting one from
> EFI_RNG_PROTOCOL -- what if no virtio-rng-pci device was configured?),
> that could potentially destroy the guest's security, without any
> obvious/immediate signs.

The guest kernel shouldn't load this random seed even if it is
provided by the hypervisor, when running in a confidential
virutalization environment like SEV/TDX or equiv for other arches.
It can't trust the hypervisor to have actually used random data
for populating the seed value. On x86 it would have to rely on
the hardware RDRAND/RDSEED for an initial seed.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week ago
Hi Ard,

On Thu, Aug 4, 2022 at 2:16 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> > > Hi Daniel,
> > >
> > > On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> > > > Yep, and ultimately the inability to distinguish UEFI vs other firmware
> > > > is arguably correct by design, as the QEMU <-> firmware interface is
> > > > supposed to be arbitrarily pluggable for any firmware implementation
> > > > not  limited to merely UEFI + seabios.
> > >
> > > Indeed, I agree with this.
> > >
> > > >
> > > > > For now I suggest either reverting the original patch, or at least not
> > > > > enabling the knob by default for any machine types. In particular, when
> > > > > using MicroVM, the user must leave the knob disabled when direct booting
> > > > > a kernel on OVMF, and the user may or may not enable the knob when
> > > > > direct booting a kernel on SeaBIOS.
> > > >
> > > > Having it opt-in via a knob would defeat Jason's goal of having the seed
> > > > available automatically.
> > >
> > > Yes, adding a knob is absolutely out of the question.
> > >
> > > It also doesn't actually solve the problem: this triggers when QEMU
> > > passes a DTB too. It's not just for the new RNG seed thing. This bug
> > > isn't new.
> >
> > In the other thread I also mentioned that this RNG Seed addition has
> > caused a bug with AMD SEV too, making boot measurement attestation
> > fail because the kernel blob passed to the firmware no longer matches
> > what the tenant expects, due to the injected seed.
> >
>
> I was actually expecting this to be an issue in the
> signing/attestation department as well, and you just confirmed my
> suspicion.
>
> But does this mean that populating the setup_data pointer is out of
> the question altogether? Or only that putting the setup_data linked
> list nodes inside the image is a problem?

If you look at the v2 patch, populating boot_param->setup_data winds
up being a fixed value. So even if that part was a problem (though I
don't think it is), it won't be with the v2 patch, since it's always
the same.

Jason
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week ago
On Thu, Aug 4, 2022 at 2:17 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Ard,
>
> On Thu, Aug 4, 2022 at 2:16 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> > > > Hi Daniel,
> > > >
> > > > On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> > > > > Yep, and ultimately the inability to distinguish UEFI vs other firmware
> > > > > is arguably correct by design, as the QEMU <-> firmware interface is
> > > > > supposed to be arbitrarily pluggable for any firmware implementation
> > > > > not  limited to merely UEFI + seabios.
> > > >
> > > > Indeed, I agree with this.
> > > >
> > > > >
> > > > > > For now I suggest either reverting the original patch, or at least not
> > > > > > enabling the knob by default for any machine types. In particular, when
> > > > > > using MicroVM, the user must leave the knob disabled when direct booting
> > > > > > a kernel on OVMF, and the user may or may not enable the knob when
> > > > > > direct booting a kernel on SeaBIOS.
> > > > >
> > > > > Having it opt-in via a knob would defeat Jason's goal of having the seed
> > > > > available automatically.
> > > >
> > > > Yes, adding a knob is absolutely out of the question.
> > > >
> > > > It also doesn't actually solve the problem: this triggers when QEMU
> > > > passes a DTB too. It's not just for the new RNG seed thing. This bug
> > > > isn't new.
> > >
> > > In the other thread I also mentioned that this RNG Seed addition has
> > > caused a bug with AMD SEV too, making boot measurement attestation
> > > fail because the kernel blob passed to the firmware no longer matches
> > > what the tenant expects, due to the injected seed.
> > >
> >
> > I was actually expecting this to be an issue in the
> > signing/attestation department as well, and you just confirmed my
> > suspicion.
> >
> > But does this mean that populating the setup_data pointer is out of
> > the question altogether? Or only that putting the setup_data linked
> > list nodes inside the image is a problem?
>
> If you look at the v2 patch, populating boot_param->setup_data winds
> up being a fixed value. So even if that part was a problem (though I
> don't think it is), it won't be with the v2 patch, since it's always
> the same.

Actually, `setup` isn't even modified if SEV is being used anyway. So
really, the approach of this v2 -- of not modifying the kernel image
-- should fix that issue, no matter what.

Jason
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Ard Biesheuvel 1 week ago
On Thu, 4 Aug 2022 at 11:25, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Aug 04, 2022 at 10:58:36AM +0200, Laszlo Ersek wrote:
> > On 08/04/22 09:03, Michael S. Tsirkin wrote:
> > > On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> > >> The boot parameter header refers to setup_data at an absolute address,
> > >> and each setup_data refers to the next setup_data at an absolute address
> > >> too. Currently QEMU simply puts the setup_datas right after the kernel
> > >> image, and since the kernel_image is loaded at prot_addr -- a fixed
> > >> address knowable to QEMU apriori -- the setup_data absolute address
> > >> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> > >>
> > >> This mostly works fine, so long as the kernel image really is loaded at
> > >> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> > >> generally EFI doesn't give a good way of predicting where it's going to
> > >> load the kernel. So when it loads it at some address != prot_addr, the
> > >> absolute addresses in setup_data now point somewhere bogus, causing
> > >> crashes when EFI stub tries to follow the next link.
> > >>
> > >> Fix this by placing setup_data at some fixed place in memory, relative
> > >> to real_addr, not as part of the kernel image, and then pointing the
> > >> setup_data absolute address to that fixed place in memory. This way,
> > >> even if OVMF or other chains relocate the kernel image, the boot
> > >> parameter still points to the correct absolute address.
> > >>
> > >> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> > >> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > >> Cc: Paolo Bonzini <pbonzini@redhat.com>
> > >> Cc: Richard Henderson <richard.henderson@linaro.org>
> > >> Cc: Peter Maydell <peter.maydell@linaro.org>
> > >> Cc: Michael S. Tsirkin <mst@redhat.com>
> > >> Cc: Daniel P. Berrangé <berrange@redhat.com>
> > >> Cc: Gerd Hoffmann <kraxel@redhat.com>
> > >> Cc: Ard Biesheuvel <ardb@kernel.org>
> > >> Cc: linux-efi@vger.kernel.org
> > >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> > >
> > > Didn't read the patch yet.
> > > Adding Laszlo.
> >
> > As I said in
> > <http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>,
> > I don't believe that the setup_data chaining described in
> > <https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work
> > for UEFI guests at all, with QEMU pre-populating the links with GPAs.
> >
> > However, rather than introducing a new info channel, or reusing an
> > existent one (ACPI linker/loader, GUID-ed structure chaining in pflash),
> > for the sake of this feature, I suggest simply disabling this feature
> > for UEFI guests. setup_data chaining has not been necessary for UEFI
> > guests for years (this is the first time I've heard about it in more
> > than a decade), and the particular use case (provide guests with good
> > random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL.
> >
> > ... Now, here's my problem: microvm, and Xen.
> >
> > As far as I can tell, QEMU can determine (it already does determine)
> > whether the guest uses UEFI or not, for the "pc" and "q35" machine
> > types. But not for microvm or Xen!
> >
> > Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can
> > derive that
> >
> >   pflash_cfi01_get_blk(pcms->flash[0])
> >
> > returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this
> > is only valid after the inital part of pc_system_firmware_init() has run
> > ("Map legacy -drive if=pflash to machine properties"), but that is not a
> > problem, given the following call tree:
>
> I don't beleve that's a valid check anymore since Gerd introduced the
> ability to load UEFI via -bios, for UEFI builds without persistent
> variables. ( a8152c4e4613c70c2f0573a82babbc8acc00cf90 )
>

I think there is a fundamental flaw in the QEMU logic where it adds
setup_data nodes to the *file* representation of the kernel image.

IOW, loading the kernei image at address x, creating setup_data nodes
in memory relative to x and then invoking the kernel image directly
(as kexec does as well, AIUI) is perfectly fine.

Managing a file system abstraction and a generic interface to load its
contents, and using it to load the kernel image anywhere in memory is
also fine, and OVMF -kernel relies on this.

It is the combination of the two that explodes, unsurprisingly. Making
inferences about which kind of firmware is invoking the file loading
interface, and gating this behavior accordingly just papers over the
problem.

Jason and I have been discussing this over IRC, and there are
essentially 3 places we could address this:
1) in the Linux/x86 EFI stub, which has a 'pure EFI' entrypoint and a
'EFI handover protocol' entrypoint, and we could simply wipe the
setup_data pointer in the former;
2) in OVMF's kernel loader, where we could 'fix up' the setup_data field
3) in QEMU, which [as I laid out above] does something it shouldn't be doing.

I strongly object to 2), as it would involve teaching OVMF's 'pure
EFI' boot path about the intricacies of struct boot_params etc, which
defeats the purpose of having a generic EFI boot path (Note that much
of my work on the Linux/EFI subsystem over the past years has been
focused on getting rid of all the arch-specific hacks and layering
violations and making as much of the EFI code in Linux arch-agnostic
and adhere to EFI principles and APIs)

Given that neither SETUP_DTB nor SETUP_RNG_SEED are particularly
relevant for pure EFI boot, clearing setup_data in the pure EFI
entrypoint in the EFI stub is not unreasonable, but not very future
proof, given that it limits how we might use setup_data for other
purposes in the future (although one might argue we could just drop
code again from the stub when new uses of setup_data are introduced
into the kernel)

However, our conclusion was that it is really QEMU that is doing
something strange here, so IMHO, fixing QEMU is really the most
suitable approach.

-- 
Ard.
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Laszlo Ersek 1 week ago
On 08/04/22 12:26, Ard Biesheuvel wrote:
> On Thu, 4 Aug 2022 at 11:25, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>
>> On Thu, Aug 04, 2022 at 10:58:36AM +0200, Laszlo Ersek wrote:
>>> On 08/04/22 09:03, Michael S. Tsirkin wrote:
>>>> On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
>>>>> The boot parameter header refers to setup_data at an absolute address,
>>>>> and each setup_data refers to the next setup_data at an absolute address
>>>>> too. Currently QEMU simply puts the setup_datas right after the kernel
>>>>> image, and since the kernel_image is loaded at prot_addr -- a fixed
>>>>> address knowable to QEMU apriori -- the setup_data absolute address
>>>>> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
>>>>>
>>>>> This mostly works fine, so long as the kernel image really is loaded at
>>>>> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
>>>>> generally EFI doesn't give a good way of predicting where it's going to
>>>>> load the kernel. So when it loads it at some address != prot_addr, the
>>>>> absolute addresses in setup_data now point somewhere bogus, causing
>>>>> crashes when EFI stub tries to follow the next link.
>>>>>
>>>>> Fix this by placing setup_data at some fixed place in memory, relative
>>>>> to real_addr, not as part of the kernel image, and then pointing the
>>>>> setup_data absolute address to that fixed place in memory. This way,
>>>>> even if OVMF or other chains relocate the kernel image, the boot
>>>>> parameter still points to the correct absolute address.
>>>>>
>>>>> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
>>>>> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>>> Cc: Richard Henderson <richard.henderson@linaro.org>
>>>>> Cc: Peter Maydell <peter.maydell@linaro.org>
>>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>>> Cc: Daniel P. Berrangé <berrange@redhat.com>
>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>>>>> Cc: Ard Biesheuvel <ardb@kernel.org>
>>>>> Cc: linux-efi@vger.kernel.org
>>>>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>>>>
>>>> Didn't read the patch yet.
>>>> Adding Laszlo.
>>>
>>> As I said in
>>> <http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>,
>>> I don't believe that the setup_data chaining described in
>>> <https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work
>>> for UEFI guests at all, with QEMU pre-populating the links with GPAs.
>>>
>>> However, rather than introducing a new info channel, or reusing an
>>> existent one (ACPI linker/loader, GUID-ed structure chaining in pflash),
>>> for the sake of this feature, I suggest simply disabling this feature
>>> for UEFI guests. setup_data chaining has not been necessary for UEFI
>>> guests for years (this is the first time I've heard about it in more
>>> than a decade), and the particular use case (provide guests with good
>>> random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL.
>>>
>>> ... Now, here's my problem: microvm, and Xen.
>>>
>>> As far as I can tell, QEMU can determine (it already does determine)
>>> whether the guest uses UEFI or not, for the "pc" and "q35" machine
>>> types. But not for microvm or Xen!
>>>
>>> Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can
>>> derive that
>>>
>>>   pflash_cfi01_get_blk(pcms->flash[0])
>>>
>>> returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this
>>> is only valid after the inital part of pc_system_firmware_init() has run
>>> ("Map legacy -drive if=pflash to machine properties"), but that is not a
>>> problem, given the following call tree:
>>
>> I don't beleve that's a valid check anymore since Gerd introduced the
>> ability to load UEFI via -bios, for UEFI builds without persistent
>> variables. ( a8152c4e4613c70c2f0573a82babbc8acc00cf90 )
>>
> 
> I think there is a fundamental flaw in the QEMU logic where it adds
> setup_data nodes to the *file* representation of the kernel image.
> 
> IOW, loading the kernei image at address x, creating setup_data nodes
> in memory relative to x and then invoking the kernel image directly
> (as kexec does as well, AIUI) is perfectly fine.
> 
> Managing a file system abstraction and a generic interface to load its
> contents, and using it to load the kernel image anywhere in memory is
> also fine, and OVMF -kernel relies on this.
> 
> It is the combination of the two that explodes, unsurprisingly. Making
> inferences about which kind of firmware is invoking the file loading
> interface, and gating this behavior accordingly just papers over the
> problem.
> 
> Jason and I have been discussing this over IRC, and there are
> essentially 3 places we could address this:
> 1) in the Linux/x86 EFI stub, which has a 'pure EFI' entrypoint and a
> 'EFI handover protocol' entrypoint, and we could simply wipe the
> setup_data pointer in the former;
> 2) in OVMF's kernel loader, where we could 'fix up' the setup_data field
> 3) in QEMU, which [as I laid out above] does something it shouldn't be doing.
> 
> I strongly object to 2), as it would involve teaching OVMF's 'pure
> EFI' boot path about the intricacies of struct boot_params etc, which
> defeats the purpose of having a generic EFI boot path (Note that much
> of my work on the Linux/EFI subsystem over the past years has been
> focused on getting rid of all the arch-specific hacks and layering
> violations and making as much of the EFI code in Linux arch-agnostic
> and adhere to EFI principles and APIs)
> 
> Given that neither SETUP_DTB nor SETUP_RNG_SEED are particularly
> relevant for pure EFI boot, clearing setup_data in the pure EFI
> entrypoint in the EFI stub is not unreasonable, but not very future
> proof, given that it limits how we might use setup_data for other
> purposes in the future (although one might argue we could just drop
> code again from the stub when new uses of setup_data are introduced
> into the kernel)
> 
> However, our conclusion was that it is really QEMU that is doing
> something strange here, so IMHO, fixing QEMU is really the most
> suitable approach.
> 

I'm not sure if this can be solved arch-independently and
firmware-independently.

ACPI and SMBIOS look "independent" of arch and firmware if you squint,
but that's only because the specs describe the entry points for
different firmwares differently, thereby hiding the problem from the
rest of the world. Also because the various virtual firmwares receive
the ACPI and SMBIOS contents from the VMM(s) differently. However, ACPI
and SMBIOS are probably too late for the kernel to consume the random seed.

A different example for information that's needed really early is the
serial port's location (see "earlyprintk" e.g.). That gets passed on the
command line (and the usual values strongly differ between arm64 and
x864_64). But, I guess passing the random seed on the kernel cmdline is
considered insecure.

None of the existing info passing methods seem early enough, generic
enough, and secure enough (at the same time)...


Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week ago
Hi Laszlo,

On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote:
> None of the existing info passing methods seem early enough, generic
> enough, and secure enough (at the same time)...

Can you look at the v2 patch? It seems to work on every configuration I
throw at it. Keep in mind that setup_data is only used very, very early.
I can think of a few other places to put it too, looking at the x86
memory map, that will survive long enough.

I think this might actually be a straightforwardly solvable problem if
you think about it more basically.

Jason
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week ago
On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Laszlo,
>
> On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote:
> > None of the existing info passing methods seem early enough, generic
> > enough, and secure enough (at the same time)...
>
> Can you look at the v2 patch? It seems to work on every configuration I
> throw at it. Keep in mind that setup_data is only used very, very early.
> I can think of a few other places to put it too, looking at the x86
> memory map, that will survive long enough.
>
> I think this might actually be a straightforwardly solvable problem if
> you think about it more basically.

And just to put things in perspective here... We only need like 48
bytes or something at some easy fixed address. That's not much. That's
*got* to be a fairly tractable problem. If v2 has issues, I can't see
why there wouldn't be a different easy place to put a meger 48 bytes
of stuff that then is allowed to be wiped out after early boot.

Jason
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Laszlo Ersek 1 week ago
On 08/04/22 14:47, Jason A. Donenfeld wrote:
> On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>>
>> Hi Laszlo,
>>
>> On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote:
>>> None of the existing info passing methods seem early enough, generic
>>> enough, and secure enough (at the same time)...
>>
>> Can you look at the v2 patch? It seems to work on every configuration I
>> throw at it. Keep in mind that setup_data is only used very, very early.
>> I can think of a few other places to put it too, looking at the x86
>> memory map, that will survive long enough.
>>
>> I think this might actually be a straightforwardly solvable problem if
>> you think about it more basically.
> 
> And just to put things in perspective here... We only need like 48
> bytes or something at some easy fixed address. That's not much. That's
> *got* to be a fairly tractable problem. If v2 has issues, I can't see
> why there wouldn't be a different easy place to put a meger 48 bytes
> of stuff that then is allowed to be wiped out after early boot.

I've looked at v2. It still relies on passing information from QEMU to
the guest kernel through guest RAM such that the whole firmware
execution takes place in-between, without the firmware knowing anything
about that particular area -- effectively treating it as free system
RAM. Such exceptions are time bombs.

We *have* used hard-coded addresses, sometimes they are unavoidable, but
then they are open-coded in both QEMU and the firmware, and some early
part of the firmware takes care to either move the data to a "safe"
place, or to cover it in-place with a kind of reservation that prevents
other parts of the firmware from trampling over it. I've debugged
mistakes (memory corruption) when such reservation was forgotten; it's
not fun.

In short, I have nothing against the QEMU patch, but then the current
OvmfPkg maintainers should accept a patch for the firmware too, for
protecting the area from later firmware components, as early as possible.

Laszlo
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Laszlo Ersek 1 week ago
On 08/04/22 15:16, Laszlo Ersek wrote:
> On 08/04/22 14:47, Jason A. Donenfeld wrote:
>> On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>>>
>>> Hi Laszlo,
>>>
>>> On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote:
>>>> None of the existing info passing methods seem early enough, generic
>>>> enough, and secure enough (at the same time)...
>>>
>>> Can you look at the v2 patch? It seems to work on every configuration I
>>> throw at it. Keep in mind that setup_data is only used very, very early.
>>> I can think of a few other places to put it too, looking at the x86
>>> memory map, that will survive long enough.
>>>
>>> I think this might actually be a straightforwardly solvable problem if
>>> you think about it more basically.
>>
>> And just to put things in perspective here... We only need like 48
>> bytes or something at some easy fixed address. That's not much. That's
>> *got* to be a fairly tractable problem. If v2 has issues, I can't see
>> why there wouldn't be a different easy place to put a meger 48 bytes
>> of stuff that then is allowed to be wiped out after early boot.
> 
> I've looked at v2. It still relies on passing information from QEMU to
> the guest kernel through guest RAM such that the whole firmware
> execution takes place in-between, without the firmware knowing anything
> about that particular area -- effectively treating it as free system
> RAM. Such exceptions are time bombs.
> 
> We *have* used hard-coded addresses, sometimes they are unavoidable, but
> then they are open-coded in both QEMU and the firmware, and some early
> part of the firmware takes care to either move the data to a "safe"
> place, or to cover it in-place with a kind of reservation that prevents
> other parts of the firmware from trampling over it. I've debugged
> mistakes (memory corruption) when such reservation was forgotten; it's
> not fun.

Reference:

https://github.com/tianocore/edk2/commit/ad43bc6b2e

> 
> In short, I have nothing against the QEMU patch, but then the current
> OvmfPkg maintainers should accept a patch for the firmware too, for
> protecting the area from later firmware components, as early as possible.
> 
> Laszlo
>
Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory
Posted by Jason A. Donenfeld 1 week ago
Hi Laszlo,

On Thu, Aug 4, 2022 at 3:17 PM Laszlo Ersek <lersek@redhat.com> wrote:
>
> On 08/04/22 14:47, Jason A. Donenfeld wrote:
> > On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >>
> >> Hi Laszlo,
> >>
> >> On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote:
> >>> None of the existing info passing methods seem early enough, generic
> >>> enough, and secure enough (at the same time)...
> >>
> >> Can you look at the v2 patch? It seems to work on every configuration I
> >> throw at it. Keep in mind that setup_data is only used very, very early.
> >> I can think of a few other places to put it too, looking at the x86
> >> memory map, that will survive long enough.
> >>
> >> I think this might actually be a straightforwardly solvable problem if
> >> you think about it more basically.
> >
> > And just to put things in perspective here... We only need like 48
> > bytes or something at some easy fixed address. That's not much. That's
> > *got* to be a fairly tractable problem. If v2 has issues, I can't see
> > why there wouldn't be a different easy place to put a meger 48 bytes
> > of stuff that then is allowed to be wiped out after early boot.
>
> I've looked at v2. It still relies on passing information from QEMU to
> the guest kernel through guest RAM such that the whole firmware
> execution takes place in-between, without the firmware knowing anything
> about that particular area -- effectively treating it as free system
> RAM. Such exceptions are time bombs.
>
> We *have* used hard-coded addresses, sometimes they are unavoidable, but
> then they are open-coded in both QEMU and the firmware, and some early
> part of the firmware takes care to either move the data to a "safe"
> place, or to cover it in-place with a kind of reservation that prevents
> other parts of the firmware from trampling over it. I've debugged
> mistakes (memory corruption) when such reservation was forgotten; it's
> not fun.
>
> In short, I have nothing against the QEMU patch, but then the current
> OvmfPkg maintainers should accept a patch for the firmware too, for
> protecting the area from later firmware components, as early as possible.

What you say mostly makes sense. Though, I was wondering if there's
some unused space (maybe a historical low address) that nothing
touches because it's always been considered reserved. Some kind of
ancient video data region, for example. We only need 48 bytes after
all... My hope was that somebody with a lot of deep x86 knowledge
would be able to smell their way to something that's always good like
that. (Alternatively, there's my real_addr thing from v2.)

Jason