[v1] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU

[PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Joao Martins 4 years ago

It is assumed that the whole GPA space is available to be DMA
addressable, within a given address space limit, expect for a
tiny region before the 4G. Since Linux v5.4, VFIO validates
whether the selected GPA is indeed valid i.e. not reserved by
IOMMU on behalf of some specific devices or platform-defined
restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
 -EINVAL.

AMD systems with an IOMMU are examples of such platforms and
particularly may only have these ranges as allowed:

	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])

We already account for the 4G hole, albeit if the guest is big
enough we will fail to allocate a guest with  >1010G due to the
~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).

[*] there is another reserved region unrelated to HT that exists
in the 256T boundaru in Fam 17h according to Errata #1286,
documeted also in "Open-Source Register Reference for AMD Family
17h Processors (PUB)"

When creating the region above 4G, take into account that on AMD
platforms the HyperTransport range is reserved and hence it
cannot be used either as GPAs. On those cases rather than
establishing the start of ram-above-4g to be 4G, relocate instead
to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
Topology", for more information on the underlying restriction of
IOVAs.

After accounting for the 1Tb hole on AMD hosts, mtree should
look like:

0000000000000000-000000007fffffff (prio 0, i/o):
	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
0000010000000000-000001ff7fffffff (prio 0, i/o):
	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff

If the relocation is done, we also add the the reserved HT
e820 range as reserved.

Suggested-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c      | 66 +++++++++++++++++++++++++++++++++++++++++++++++
 target/i386/cpu.h |  4 +++
 2 files changed, 70 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 7de0e87f4a3f..b060aedd38f3 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -802,6 +802,65 @@ void xen_load_linux(PCMachineState *pcms)
 #define PC_ROM_ALIGN       0x800
 #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
 
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static void relocate_4g(MachineState *machine, PCMachineState *pcms)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    ram_addr_t device_mem_size = 0;
+    uint32_t eax, vendor[3];
+
+    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
+    if (!IS_AMD_VENDOR(vendor)) {
+        return;
+    }
+
+    if (pcmc->has_reserved_memory &&
+       (machine->ram_size < machine->maxram_size)) {
+        device_mem_size = machine->maxram_size - machine->ram_size;
+    }
+
+    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
+         device_mem_size) < AMD_HT_START) {
+        return;
+    }
+
+    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -821,6 +880,8 @@ void pc_memory_init(PCMachineState *pcms,
 
     linux_boot = (machine->kernel_filename != NULL);
 
+    relocate_4g(machine, pcms);
+
     /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
@@ -831,6 +892,11 @@ void pc_memory_init(PCMachineState *pcms,
                              0, x86ms->below_4g_mem_size);
     memory_region_add_subregion(system_memory, 0, ram_below_4g);
     e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
+
+    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
+        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+    }
+
     if (x86ms->above_4g_mem_size > 0) {
         ram_above_4g = g_malloc(sizeof(*ram_above_4g));
         memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 9911d7c8711b..1acebc569b02 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -906,6 +906,10 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
 #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
                          (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
                          (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
+#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
+                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
+                         (vendor[2]) == CPUID_VENDOR_AMD_3)
+
 
 #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
 #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */
-- 
2.17.2

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Igor Mammedov 3 years, 12 months ago

On Mon,  7 Feb 2022 20:24:20 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> It is assumed that the whole GPA space is available to be DMA
> addressable, within a given address space limit, expect for a
> tiny region before the 4G. Since Linux v5.4, VFIO validates
> whether the selected GPA is indeed valid i.e. not reserved by
> IOMMU on behalf of some specific devices or platform-defined
> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>  -EINVAL.
> 
> AMD systems with an IOMMU are examples of such platforms and
> particularly may only have these ranges as allowed:
> 
> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
> 
> We already account for the 4G hole, albeit if the guest is big
> enough we will fail to allocate a guest with  >1010G due to the
> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> 
> [*] there is another reserved region unrelated to HT that exists
> in the 256T boundaru in Fam 17h according to Errata #1286,
> documeted also in "Open-Source Register Reference for AMD Family
> 17h Processors (PUB)"
> 
> When creating the region above 4G, take into account that on AMD
> platforms the HyperTransport range is reserved and hence it
> cannot be used either as GPAs. On those cases rather than
> establishing the start of ram-above-4g to be 4G, relocate instead
> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> Topology", for more information on the underlying restriction of
> IOVAs.
> 
> After accounting for the 1Tb hole on AMD hosts, mtree should
> look like:
> 
> 0000000000000000-000000007fffffff (prio 0, i/o):
> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
> 0000010000000000-000001ff7fffffff (prio 0, i/o):
> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
> 
> If the relocation is done, we also add the the reserved HT
> e820 range as reserved.
> 
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c      | 66 +++++++++++++++++++++++++++++++++++++++++++++++
>  target/i386/cpu.h |  4 +++
>  2 files changed, 70 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 7de0e87f4a3f..b060aedd38f3 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -802,6 +802,65 @@ void xen_load_linux(PCMachineState *pcms)
>  #define PC_ROM_ALIGN       0x800
>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>  
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static void relocate_4g(MachineState *machine, PCMachineState *pcms)

perhaps rename it to x86_update_above_4g_mem_start() ?


> +{
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    ram_addr_t device_mem_size = 0;
> +    uint32_t eax, vendor[3];
> +
> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> +    if (!IS_AMD_VENDOR(vendor)) {
> +        return;
> +    }
> +
> +    if (pcmc->has_reserved_memory &&
> +       (machine->ram_size < machine->maxram_size)) {
> +        device_mem_size = machine->maxram_size - machine->ram_size;
> +    }
> +
> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
> +         device_mem_size) < AMD_HT_START) {
should it account for sgx as well?

what if above sum ends up right before AMD_HT_START,
and exit without adjusting above_4g_mem_start, but
pci64 hole eventually will fall into HT range?
Is it expected behaviour?

> +        return;
> +    }
> +
> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -821,6 +880,8 @@ void pc_memory_init(PCMachineState *pcms,
>  
>      linux_boot = (machine->kernel_filename != NULL);
>  
> +    relocate_4g(machine, pcms);
> +
>      /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
> @@ -831,6 +892,11 @@ void pc_memory_init(PCMachineState *pcms,
>                               0, x86ms->below_4g_mem_size);
>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> +
> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +    }

btw: do we have to add reservation record for HT zone, why?

>      if (x86ms->above_4g_mem_size > 0) {
>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 9911d7c8711b..1acebc569b02 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -906,6 +906,10 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
>  #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
>                           (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
>                           (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
> +#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
> +                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
> +                         (vendor[2]) == CPUID_VENDOR_AMD_3)
> +
>  
>  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
>  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Joao Martins 3 years, 12 months ago


On 2/14/22 14:53, Igor Mammedov wrote:
> On Mon,  7 Feb 2022 20:24:20 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> It is assumed that the whole GPA space is available to be DMA
>> addressable, within a given address space limit, expect for a
>> tiny region before the 4G. Since Linux v5.4, VFIO validates
>> whether the selected GPA is indeed valid i.e. not reserved by
>> IOMMU on behalf of some specific devices or platform-defined
>> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>>  -EINVAL.
>>
>> AMD systems with an IOMMU are examples of such platforms and
>> particularly may only have these ranges as allowed:
>>
>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
>>
>> We already account for the 4G hole, albeit if the guest is big
>> enough we will fail to allocate a guest with  >1010G due to the
>> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
>>
>> [*] there is another reserved region unrelated to HT that exists
>> in the 256T boundaru in Fam 17h according to Errata #1286,
>> documeted also in "Open-Source Register Reference for AMD Family
>> 17h Processors (PUB)"
>>
>> When creating the region above 4G, take into account that on AMD
>> platforms the HyperTransport range is reserved and hence it
>> cannot be used either as GPAs. On those cases rather than
>> establishing the start of ram-above-4g to be 4G, relocate instead
>> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
>> Topology", for more information on the underlying restriction of
>> IOVAs.
>>
>> After accounting for the 1Tb hole on AMD hosts, mtree should
>> look like:
>>
>> 0000000000000000-000000007fffffff (prio 0, i/o):
>> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
>> 0000010000000000-000001ff7fffffff (prio 0, i/o):
>> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
>>
>> If the relocation is done, we also add the the reserved HT
>> e820 range as reserved.
>>
>> Suggested-by: Igor Mammedov <imammedo@redhat.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c      | 66 +++++++++++++++++++++++++++++++++++++++++++++++
>>  target/i386/cpu.h |  4 +++
>>  2 files changed, 70 insertions(+)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 7de0e87f4a3f..b060aedd38f3 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -802,6 +802,65 @@ void xen_load_linux(PCMachineState *pcms)
>>  #define PC_ROM_ALIGN       0x800
>>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>>  
>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static void relocate_4g(MachineState *machine, PCMachineState *pcms)
> 
> perhaps rename it to x86_update_above_4g_mem_start() ?
> 
Yeap.

>> +{
>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    ram_addr_t device_mem_size = 0;
>> +    uint32_t eax, vendor[3];
>> +
>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>> +    if (!IS_AMD_VENDOR(vendor)) {
>> +        return;
>> +    }
>> +
>> +    if (pcmc->has_reserved_memory &&
>> +       (machine->ram_size < machine->maxram_size)) {
>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>> +    }
>> +
>> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
>> +         device_mem_size) < AMD_HT_START) {
> should it account for sgx as well?
> 
Yes, I missed that one.

> what if above sum ends up right before AMD_HT_START,
> and exit without adjusting above_4g_mem_start, but
> pci64 hole eventually will fall into HT range?
> Is it expected behaviour?
> 
No -- it should not be any reserved range really.

And I was at two minds on this one, whether to advertise *always*
the 1T hole, regardless of relocation. Or account the size
we advertise for the pci64 hole and make that part of the equation
above. Although that has the flaw that the firmware at admin request
may pick some ludricous number (limited by maxphysaddr).

>> +        return;
>> +    }
>> +
>> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -821,6 +880,8 @@ void pc_memory_init(PCMachineState *pcms,
>>  
>>      linux_boot = (machine->kernel_filename != NULL);
>>  
>> +    relocate_4g(machine, pcms);
>> +
>>      /*
>>       * Split single memory region and use aliases to address portions of it,
>>       * done for backwards compatibility with older qemus.
>> @@ -831,6 +892,11 @@ void pc_memory_init(PCMachineState *pcms,
>>                               0, x86ms->below_4g_mem_size);
>>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
>> +
>> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
>> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>> +    }
> 
> btw: do we have to add reservation record for HT zone, why?
> 

This is what real hardware does fwiw :D

I understand that we you do the relocation, firmware /should/
pick the first address after RAM as start of pci64 hole, and hence
past the HT range.

But if felt that for correctness we would tell the guest that this
range cannot be used regardless. I take that perhaps you're thinking
that you omit the E820_RESERVED just like the 4G hole?

>>      if (x86ms->above_4g_mem_size > 0) {
>>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
>> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
>> index 9911d7c8711b..1acebc569b02 100644
>> --- a/target/i386/cpu.h
>> +++ b/target/i386/cpu.h
>> @@ -906,6 +906,10 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
>>  #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
>>                           (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
>>                           (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
>> +#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
>> +                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
>> +                         (vendor[2]) == CPUID_VENDOR_AMD_3)
>> +
>>  
>>  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
>>  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */
>

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Igor Mammedov 3 years, 12 months ago

On Mon, 14 Feb 2022 15:05:00 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 2/14/22 14:53, Igor Mammedov wrote:
> > On Mon,  7 Feb 2022 20:24:20 +0000
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> It is assumed that the whole GPA space is available to be DMA
> >> addressable, within a given address space limit, expect for a
> >> tiny region before the 4G. Since Linux v5.4, VFIO validates
> >> whether the selected GPA is indeed valid i.e. not reserved by
> >> IOMMU on behalf of some specific devices or platform-defined
> >> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
> >>  -EINVAL.
> >>
> >> AMD systems with an IOMMU are examples of such platforms and
> >> particularly may only have these ranges as allowed:
> >>
> >> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> >> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> >> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
> >>
> >> We already account for the 4G hole, albeit if the guest is big
> >> enough we will fail to allocate a guest with  >1010G due to the
> >> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> >>
> >> [*] there is another reserved region unrelated to HT that exists
> >> in the 256T boundaru in Fam 17h according to Errata #1286,
> >> documeted also in "Open-Source Register Reference for AMD Family
> >> 17h Processors (PUB)"
> >>
> >> When creating the region above 4G, take into account that on AMD
> >> platforms the HyperTransport range is reserved and hence it
> >> cannot be used either as GPAs. On those cases rather than
> >> establishing the start of ram-above-4g to be 4G, relocate instead
> >> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> >> Topology", for more information on the underlying restriction of
> >> IOVAs.
> >>
> >> After accounting for the 1Tb hole on AMD hosts, mtree should
> >> look like:
> >>
> >> 0000000000000000-000000007fffffff (prio 0, i/o):
> >> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
> >> 0000010000000000-000001ff7fffffff (prio 0, i/o):
> >> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
> >>
> >> If the relocation is done, we also add the the reserved HT
> >> e820 range as reserved.
> >>
> >> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  hw/i386/pc.c      | 66 +++++++++++++++++++++++++++++++++++++++++++++++
> >>  target/i386/cpu.h |  4 +++
> >>  2 files changed, 70 insertions(+)
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index 7de0e87f4a3f..b060aedd38f3 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -802,6 +802,65 @@ void xen_load_linux(PCMachineState *pcms)
> >>  #define PC_ROM_ALIGN       0x800
> >>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
> >>  
> >> +/*
> >> + * AMD systems with an IOMMU have an additional hole close to the
> >> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> >> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> >> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> >> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> >> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> >> + * The ranges reserved for Hyper-Transport are:
> >> + *
> >> + * FD_0000_0000h - FF_FFFF_FFFFh
> >> + *
> >> + * The ranges represent the following:
> >> + *
> >> + * Base Address   Top Address  Use
> >> + *
> >> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> >> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> >> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> >> + * FD_F910_0000h FD_F91F_FFFFh System Management
> >> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> >> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> >> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> >> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> >> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> >> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> >> + *
> >> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> >> + * Table 3: Special Address Controls (GPA) for more information.
> >> + */
> >> +#define AMD_HT_START         0xfd00000000UL
> >> +#define AMD_HT_END           0xffffffffffUL
> >> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> >> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> >> +
> >> +static void relocate_4g(MachineState *machine, PCMachineState *pcms)  
> > 
> > perhaps rename it to x86_update_above_4g_mem_start() ?
> >   
> Yeap.
> 
> >> +{
> >> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >> +    ram_addr_t device_mem_size = 0;
> >> +    uint32_t eax, vendor[3];
> >> +
> >> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >> +    if (!IS_AMD_VENDOR(vendor)) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (pcmc->has_reserved_memory &&
> >> +       (machine->ram_size < machine->maxram_size)) {
> >> +        device_mem_size = machine->maxram_size - machine->ram_size;
> >> +    }
> >> +
> >> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
> >> +         device_mem_size) < AMD_HT_START) {  
> > should it account for sgx as well?
> >   
> Yes, I missed that one.
> 
> > what if above sum ends up right before AMD_HT_START,
> > and exit without adjusting above_4g_mem_start, but
> > pci64 hole eventually will fall into HT range?
> > Is it expected behaviour?
> >   
> No -- it should not be any reserved range really.
> 
> And I was at two minds on this one, whether to advertise *always*
> the 1T hole, regardless of relocation. Or account the size
> we advertise for the pci64 hole and make that part of the equation
> above. Although that has the flaw that the firmware at admin request
> may pick some ludricous number (limited by maxphysaddr).

it this point we have only pci64 hole size (machine property),
so I'd include that in equation to make firmware assign
pci64 aperture above HT range.

as for checking maxphysaddr, we can only check 'default' PCI hole
range at this stage (i.e. 1Gb aligned hole size after all possible RAM)
and hard error on it.

I don't know what behavior should be if firmware tries to program
PCI64 hole beyond supported phys-bits.

Michael, Gerd
  perhaps you can suggest something here

 
> >> +        return;
> >> +    }
> >> +
> >> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> >> +}
> >> +
> >>  void pc_memory_init(PCMachineState *pcms,
> >>                      MemoryRegion *system_memory,
> >>                      MemoryRegion *rom_memory,
> >> @@ -821,6 +880,8 @@ void pc_memory_init(PCMachineState *pcms,
> >>  
> >>      linux_boot = (machine->kernel_filename != NULL);
> >>  
> >> +    relocate_4g(machine, pcms);
> >> +
> >>      /*
> >>       * Split single memory region and use aliases to address portions of it,
> >>       * done for backwards compatibility with older qemus.
> >> @@ -831,6 +892,11 @@ void pc_memory_init(PCMachineState *pcms,
> >>                               0, x86ms->below_4g_mem_size);
> >>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
> >>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> >> +
> >> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
> >> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> >> +    }  
> > 
> > btw: do we have to add reservation record for HT zone, why?
> >   
> 
> This is what real hardware does fwiw :D
> 
> I understand that we you do the relocation, firmware /should/
> pick the first address after RAM as start of pci64 hole, and hence
> past the HT range.
> 
> But if felt that for correctness we would tell the guest that this
> range cannot be used regardless. I take that perhaps you're thinking
> that you omit the E820_RESERVED just like the 4G hole?

It's fine to advertise region as reserved, for completeness wit real HW
and in case if we ever allow guest OS to reconfigure PCI64
hole aperture it should prevent guest picking reserved range
(theoretically, whether it actually would work or not I'm not sure)

> 
> >>      if (x86ms->above_4g_mem_size > 0) {
> >>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
> >>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
> >> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> >> index 9911d7c8711b..1acebc569b02 100644
> >> --- a/target/i386/cpu.h
> >> +++ b/target/i386/cpu.h
> >> @@ -906,6 +906,10 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
> >>  #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
> >>                           (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
> >>                           (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
> >> +#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
> >> +                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
> >> +                         (vendor[2]) == CPUID_VENDOR_AMD_3)
> >> +
> >>  
> >>  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
> >>  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */  
> >   
>

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Gerd Hoffmann 3 years, 11 months ago

  Hi,

> I don't know what behavior should be if firmware tries to program
> PCI64 hole beyond supported phys-bits.

Well, you are basically f*cked.

Unfortunately there is no reliable way to figure what phys-bits actually
is.  Because of that the firmware (both seabios and edk2) tries to place
the pci64 hole as low as possible.

The long version:

qemu advertises phys-bits=40 to the guest by default.  Probably because
this is what the first amd opteron processors had, assuming that it
would be a safe default.  Then intel came, releasing processors with
phys-bits=36, even recent (desktop-class) hardware has phys-bits=39.
Boom.

End result is that edk2 uses a 32G pci64 window by default, which is
placed at the first 32G border beyond normal ram.  So for virtual
machines with up to ~ 30G ram (including reservations for memory
hotplug) the pci64 hole covers 32G -> 64G in guest physical address
space, which is low enough that it works on hardware with phys-bits=36.

If your VM has more than 32G of memory the pci64 hole will move and
phys-bits=36 isn't enough any more, but given that you probably only do
that on more beefy hosts which can take >= 64G of RAM and have a larger
physical address space this heuristic works good enough in practice.

Changing phys-bits behavior has been discussed on and off since years.
It's tricky to change for live migration compatibility reasons.

We got the host-phys-bits and host-phys-bits-limit properties, which
solve some of the phys-bits problems.

 * host-phys-bits=on makes sure the phys-bits advertised to the guest
   actually works.  It's off by default though for backward
   compatibility reasons (except microvm).  Also because turning it on
   breaks live migration of machines between hosts with different
   phys-bits.

 * host-phys-bits-limit can be used to tweak phys-bits to
   be lower than what the host supports.  Which can be used for
   live migration compatibility, i.e. if you have a pool of machines
   where some have 36 and some 39 you can limit phys-bits to 36 so
   live migration from 39 hosts to 36 hosts works.

What is missing:

 * Some way for the firmware to get a phys-bits value it can actually
   use.  One possible way would be to have a paravirtual bit somewhere
   telling whenever host-phys-bits is enabled or not.

If edk2 could figure what the usable (guest) physical address space
actually is it could:

  (a) make sure it never crosses that limit, and
  (b) pick better defaults, for example make the pci64 hole larger
      than 32G in case the available address space allows that.

take care,
  Gerd

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Joao Martins 3 years, 11 months ago

On 2/15/22 09:53, Gerd Hoffmann wrote:
> What is missing:
> 
>  * Some way for the firmware to get a phys-bits value it can actually
>    use.  One possible way would be to have a paravirtual bit somewhere
>    telling whenever host-phys-bits is enabled or not.
> 
If we are not talking about *very old* processors... isn't what already
advertised in CPUID.80000008 EAX enough? That's the maxphysaddr width
on x86, which on qemu we do set it with the phys-bits value;

	Joao

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Gerd Hoffmann 3 years, 11 months ago

On Tue, Feb 15, 2022 at 07:37:40PM +0000, Joao Martins wrote:
> On 2/15/22 09:53, Gerd Hoffmann wrote:
> > What is missing:
> > 
> >  * Some way for the firmware to get a phys-bits value it can actually
> >    use.  One possible way would be to have a paravirtual bit somewhere
> >    telling whenever host-phys-bits is enabled or not.
> > 
> If we are not talking about *very old* processors... isn't what already
> advertised in CPUID.80000008 EAX enough? That's the maxphysaddr width
> on x86, which on qemu we do set it with the phys-bits value;

Sigh.  Nope.  Did you read the complete reply?

Problem is this is not reliable.  With host-phys-bits=off (default) qemu
allows to set phys-bits to whatever value you want, including values
larger than what the host actually supports.  Which renders
CPUID.80000008.EAX unusable.  To make things even worse:  The default
value (phys-bits=40) is larger than what many intel boxes support.

host-phys-bits=on fixes that.  It makes guest-phys-bits == host-phys-bits
by default, and also enforces guest-phys-bits <= host-phys-bits.
So with host-phys-bits=on the guest can actually use CPUID.80000008.EAX
to figure how big the guest physical address space is.

Problem is the guest doesn't know whenever host-phys-bits is enabled or
not ...

So the options to fix that are:

  (1) Make the host-phys-bits option visible to the guest (as suggested
      above), or
  (2) Advertise a _reliable_ phys-bits value to the guest somehow.

take care,
  Gerd

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Joao Martins 3 years, 11 months ago

On 2/16/22 08:19, Gerd Hoffmann wrote:
> On Tue, Feb 15, 2022 at 07:37:40PM +0000, Joao Martins wrote:
>> On 2/15/22 09:53, Gerd Hoffmann wrote:
>>> What is missing:
>>>
>>>  * Some way for the firmware to get a phys-bits value it can actually
>>>    use.  One possible way would be to have a paravirtual bit somewhere
>>>    telling whenever host-phys-bits is enabled or not.
>>>
>> If we are not talking about *very old* processors... isn't what already
>> advertised in CPUID.80000008 EAX enough? That's the maxphysaddr width
>> on x86, which on qemu we do set it with the phys-bits value;
> 
> Sigh.  Nope.  Did you read the complete reply?
> 
Yes, I did.

What I overlooked was the emphasis you had on desktops (qemu default bigger than
host-advertised), where I am thinking mostly in servers.

> Problem is this is not reliable.  With host-phys-bits=off (default) qemu
> allows to set phys-bits to whatever value you want, including values
> larger than what the host actually supports.  Which renders
> CPUID.80000008.EAX unusable. 

I am seeing from another angle, which the way to convey the phys-bits is
via this CPUID leaf is *maybe* enough (like real hardware). But we are setting
with a bigger value than we should have (or in other words ... not honoring
our physical boundary).

> To make things even worse:  The default
> value (phys-bits=40) is larger than what many intel boxes support.
> 
> host-phys-bits=on fixes that.  It makes guest-phys-bits == host-phys-bits
> by default, and also enforces guest-phys-bits <= host-phys-bits.
> So with host-phys-bits=on the guest can actually use CPUID.80000008.EAX
> to figure how big the guest physical address space is.
> 
Your 2 paragraphs sound like it's two different things, but +host-phys-bits
just sets CPUID.80000008.EAX with the host CPUID equivalent leaf/register
value. Which yes, it makes it reliable, but the way to convey is still the
same. That is regardless, of phys-bits=bogus-bigger-than-host-number,
host-phys-bits=on or host-phys-bits-limit=N.

> Problem is the guest doesn't know whenever host-phys-bits is enabled or
> not ...
> 
> So the options to fix that are:
> 
>   (1) Make the host-phys-bits option visible to the guest (as suggested
>       above), or
>   (2) Advertise a _reliable_ phys-bits value to the guest somehow.

What I am not enterily sure from (1) is the value on having a 'guest phys-bits'
and a 'host phys-bits' *exposed to the guest* when it seems we are picking the wrong
value for guests. It seems unnecessary redirection (compared to real hw) unless
there's a use-case in keeping both that I am probably missing.

	Joao

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Gerd Hoffmann 3 years, 11 months ago

  Hi,

> What I overlooked was the emphasis you had on desktops (qemu default bigger than
> host-advertised), where I am thinking mostly in servers.

Yep, on servers you have the reverse problem that phys-bits=40 might be
too small for very large guests.

> > To make things even worse:  The default
> > value (phys-bits=40) is larger than what many intel boxes support.
> > 
> > host-phys-bits=on fixes that.  It makes guest-phys-bits == host-phys-bits
> > by default, and also enforces guest-phys-bits <= host-phys-bits.
> > So with host-phys-bits=on the guest can actually use CPUID.80000008.EAX
> > to figure how big the guest physical address space is.
> > 
> Your 2 paragraphs sound like it's two different things, but +host-phys-bits
> just sets CPUID.80000008.EAX with the host CPUID equivalent leaf/register
> value. Which yes, it makes it reliable, but the way to convey is still the
> same. That is regardless, of phys-bits=bogus-bigger-than-host-number,
> host-phys-bits=on or host-phys-bits-limit=N.

Yep, it's CPUID.80000008.EAX in all cases.

> > Problem is the guest doesn't know whenever host-phys-bits is enabled or
> > not ...
> > 
> > So the options to fix that are:
> > 
> >   (1) Make the host-phys-bits option visible to the guest (as suggested
> >       above), or
> >   (2) Advertise a _reliable_ phys-bits value to the guest somehow.
> 
> What I am not enterily sure from (1) is the value on having a 'guest phys-bits'
> and a 'host phys-bits' *exposed to the guest* when it seems we are picking the wrong
> value for guests.

Well, the guest can read CPUID.80000008.EAX but as of today the guest
just doesn't know whenever it's bogus or not.

> It seems unnecessary redirection (compared to real hw) unless
> there's a use-case in keeping both that I am probably missing.

Well, the use case is backward compatibility.  If we want make better
use of the guest physical address space on newer qemu without breaking
compatibility with older qemu versions the firmware needs to do
something along the lines of:

    if (i-can-trust-phys-bits) {
        // new qemu
        read CPUID.80000008.EAX and go with that
    } else {
        // old qemu
        use current heuristic, placing stuff as low as possible.
    }

And for that to actually work new qemu needs to expose something to the
guest which can be used to evaluate the "i-can-trust-phys-bits"
expression.  A guest-readable bit somewhere which is 1 for
host-phys-bits=on and 0 otherwise would do the trick, but there are
surely other ways to solve the problem.

take care,
  Gerd

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Daniel P. Berrangé 3 years, 11 months ago

On Tue, Feb 15, 2022 at 10:53:58AM +0100, Gerd Hoffmann wrote:
>   Hi,
> 
> > I don't know what behavior should be if firmware tries to program
> > PCI64 hole beyond supported phys-bits.
> 
> Well, you are basically f*cked.
> 
> Unfortunately there is no reliable way to figure what phys-bits actually
> is.  Because of that the firmware (both seabios and edk2) tries to place
> the pci64 hole as low as possible.
> 
> The long version:
> 
> qemu advertises phys-bits=40 to the guest by default.  Probably because
> this is what the first amd opteron processors had, assuming that it
> would be a safe default.  Then intel came, releasing processors with
> phys-bits=36, even recent (desktop-class) hardware has phys-bits=39.
> Boom.
> 
> End result is that edk2 uses a 32G pci64 window by default, which is
> placed at the first 32G border beyond normal ram.  So for virtual
> machines with up to ~ 30G ram (including reservations for memory
> hotplug) the pci64 hole covers 32G -> 64G in guest physical address
> space, which is low enough that it works on hardware with phys-bits=36.
> 
> If your VM has more than 32G of memory the pci64 hole will move and
> phys-bits=36 isn't enough any more, but given that you probably only do
> that on more beefy hosts which can take >= 64G of RAM and have a larger
> physical address space this heuristic works good enough in practice.
> 
> Changing phys-bits behavior has been discussed on and off since years.
> It's tricky to change for live migration compatibility reasons.
> 
> We got the host-phys-bits and host-phys-bits-limit properties, which
> solve some of the phys-bits problems.
> 
>  * host-phys-bits=on makes sure the phys-bits advertised to the guest
>    actually works.  It's off by default though for backward
>    compatibility reasons (except microvm).  Also because turning it on
>    breaks live migration of machines between hosts with different
>    phys-bits.

RHEL has shipped with host-phys-bits=on in its machine types
sinec RHEL-7. If it is good enough for RHEL machine types
for 8 years, IMHO, it is a sign that its reasonable to do the
same with upstream for new machine types.


>  * host-phys-bits-limit can be used to tweak phys-bits to
>    be lower than what the host supports.  Which can be used for
>    live migration compatibility, i.e. if you have a pool of machines
>    where some have 36 and some 39 you can limit phys-bits to 36 so
>    live migration from 39 hosts to 36 hosts works.

RHEL machine types have set this to host-phys-bits-limit=48
since RHEL-8 days, to avoid accidentally enabling 5-level
paging in guests without explicit user opt-in.

> What is missing:
> 
>  * Some way for the firmware to get a phys-bits value it can actually
>    use.  One possible way would be to have a paravirtual bit somewhere
>    telling whenever host-phys-bits is enabled or not.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Dr. David Alan Gilbert 3 years, 11 months ago

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, Feb 15, 2022 at 10:53:58AM +0100, Gerd Hoffmann wrote:
> >   Hi,
> > 
> > > I don't know what behavior should be if firmware tries to program
> > > PCI64 hole beyond supported phys-bits.
> > 
> > Well, you are basically f*cked.
> > 
> > Unfortunately there is no reliable way to figure what phys-bits actually
> > is.  Because of that the firmware (both seabios and edk2) tries to place
> > the pci64 hole as low as possible.
> > 
> > The long version:
> > 
> > qemu advertises phys-bits=40 to the guest by default.  Probably because
> > this is what the first amd opteron processors had, assuming that it
> > would be a safe default.  Then intel came, releasing processors with
> > phys-bits=36, even recent (desktop-class) hardware has phys-bits=39.
> > Boom.
> > 
> > End result is that edk2 uses a 32G pci64 window by default, which is
> > placed at the first 32G border beyond normal ram.  So for virtual
> > machines with up to ~ 30G ram (including reservations for memory
> > hotplug) the pci64 hole covers 32G -> 64G in guest physical address
> > space, which is low enough that it works on hardware with phys-bits=36.
> > 
> > If your VM has more than 32G of memory the pci64 hole will move and
> > phys-bits=36 isn't enough any more, but given that you probably only do
> > that on more beefy hosts which can take >= 64G of RAM and have a larger
> > physical address space this heuristic works good enough in practice.
> > 
> > Changing phys-bits behavior has been discussed on and off since years.
> > It's tricky to change for live migration compatibility reasons.
> > 
> > We got the host-phys-bits and host-phys-bits-limit properties, which
> > solve some of the phys-bits problems.
> > 
> >  * host-phys-bits=on makes sure the phys-bits advertised to the guest
> >    actually works.  It's off by default though for backward
> >    compatibility reasons (except microvm).  Also because turning it on
> >    breaks live migration of machines between hosts with different
> >    phys-bits.
> 
> RHEL has shipped with host-phys-bits=on in its machine types
> sinec RHEL-7. If it is good enough for RHEL machine types
> for 8 years, IMHO, it is a sign that its reasonable to do the
> same with upstream for new machine types.

And the upstream code is now pretty much identical except for the
default;  note that for TCG you do need to keep to 40 I think.

Dave
> 
> >  * host-phys-bits-limit can be used to tweak phys-bits to
> >    be lower than what the host supports.  Which can be used for
> >    live migration compatibility, i.e. if you have a pool of machines
> >    where some have 36 and some 39 you can limit phys-bits to 36 so
> >    live migration from 39 hosts to 36 hosts works.
> 
> RHEL machine types have set this to host-phys-bits-limit=48
> since RHEL-8 days, to avoid accidentally enabling 5-level
> paging in guests without explicit user opt-in.
> 
> > What is missing:
> > 
> >  * Some way for the firmware to get a phys-bits value it can actually
> >    use.  One possible way would be to have a paravirtual bit somewhere
> >    telling whenever host-phys-bits is enabled or not.
> 
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Igor Mammedov 3 years, 11 months ago

On Mon, 21 Feb 2022 13:15:40 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Tue, Feb 15, 2022 at 10:53:58AM +0100, Gerd Hoffmann wrote:  
> > >   Hi,
> > >   
> > > > I don't know what behavior should be if firmware tries to program
> > > > PCI64 hole beyond supported phys-bits.  
> > > 
> > > Well, you are basically f*cked.
> > > 
> > > Unfortunately there is no reliable way to figure what phys-bits actually
> > > is.  Because of that the firmware (both seabios and edk2) tries to place
> > > the pci64 hole as low as possible.
> > > 
> > > The long version:
> > > 
> > > qemu advertises phys-bits=40 to the guest by default.  Probably because
> > > this is what the first amd opteron processors had, assuming that it
> > > would be a safe default.  Then intel came, releasing processors with
> > > phys-bits=36, even recent (desktop-class) hardware has phys-bits=39.
> > > Boom.
> > > 
> > > End result is that edk2 uses a 32G pci64 window by default, which is
> > > placed at the first 32G border beyond normal ram.  So for virtual
> > > machines with up to ~ 30G ram (including reservations for memory
> > > hotplug) the pci64 hole covers 32G -> 64G in guest physical address
> > > space, which is low enough that it works on hardware with phys-bits=36.
> > > 
> > > If your VM has more than 32G of memory the pci64 hole will move and
> > > phys-bits=36 isn't enough any more, but given that you probably only do
> > > that on more beefy hosts which can take >= 64G of RAM and have a larger
> > > physical address space this heuristic works good enough in practice.
> > > 
> > > Changing phys-bits behavior has been discussed on and off since years.
> > > It's tricky to change for live migration compatibility reasons.
> > > 
> > > We got the host-phys-bits and host-phys-bits-limit properties, which
> > > solve some of the phys-bits problems.
> > > 
> > >  * host-phys-bits=on makes sure the phys-bits advertised to the guest
> > >    actually works.  It's off by default though for backward
> > >    compatibility reasons (except microvm).  Also because turning it on
> > >    breaks live migration of machines between hosts with different
> > >    phys-bits.  
> > 
> > RHEL has shipped with host-phys-bits=on in its machine types
> > sinec RHEL-7. If it is good enough for RHEL machine types
> > for 8 years, IMHO, it is a sign that its reasonable to do the
> > same with upstream for new machine types.  
> 
> And the upstream code is now pretty much identical except for the
> default;  note that for TCG you do need to keep to 40 I think.

will TCG work with 40bits on host that supports less than that?

Also quick look at host-phys-bits shows that it affects only 'host'
cpu model and is NOP for all other models.
If it's so than we probably need to expand it's scope to other cpu
models to cap them at actually supported range.

> 
> Dave
> >   
> > >  * host-phys-bits-limit can be used to tweak phys-bits to
> > >    be lower than what the host supports.  Which can be used for
> > >    live migration compatibility, i.e. if you have a pool of machines
> > >    where some have 36 and some 39 you can limit phys-bits to 36 so
> > >    live migration from 39 hosts to 36 hosts works.  
> > 
> > RHEL machine types have set this to host-phys-bits-limit=48
> > since RHEL-8 days, to avoid accidentally enabling 5-level
> > paging in guests without explicit user opt-in.
> >   
> > > What is missing:
> > > 
> > >  * Some way for the firmware to get a phys-bits value it can actually
> > >    use.  One possible way would be to have a paravirtual bit somewhere
> > >    telling whenever host-phys-bits is enabled or not.  
> > 
> > 
> > Regards,
> > Daniel
> > -- 
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > 
> >

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Dr. David Alan Gilbert 3 years, 11 months ago

* Igor Mammedov (imammedo@redhat.com) wrote:
> On Mon, 21 Feb 2022 13:15:40 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Tue, Feb 15, 2022 at 10:53:58AM +0100, Gerd Hoffmann wrote:  
> > > >   Hi,
> > > >   
> > > > > I don't know what behavior should be if firmware tries to program
> > > > > PCI64 hole beyond supported phys-bits.  
> > > > 
> > > > Well, you are basically f*cked.
> > > > 
> > > > Unfortunately there is no reliable way to figure what phys-bits actually
> > > > is.  Because of that the firmware (both seabios and edk2) tries to place
> > > > the pci64 hole as low as possible.
> > > > 
> > > > The long version:
> > > > 
> > > > qemu advertises phys-bits=40 to the guest by default.  Probably because
> > > > this is what the first amd opteron processors had, assuming that it
> > > > would be a safe default.  Then intel came, releasing processors with
> > > > phys-bits=36, even recent (desktop-class) hardware has phys-bits=39.
> > > > Boom.
> > > > 
> > > > End result is that edk2 uses a 32G pci64 window by default, which is
> > > > placed at the first 32G border beyond normal ram.  So for virtual
> > > > machines with up to ~ 30G ram (including reservations for memory
> > > > hotplug) the pci64 hole covers 32G -> 64G in guest physical address
> > > > space, which is low enough that it works on hardware with phys-bits=36.
> > > > 
> > > > If your VM has more than 32G of memory the pci64 hole will move and
> > > > phys-bits=36 isn't enough any more, but given that you probably only do
> > > > that on more beefy hosts which can take >= 64G of RAM and have a larger
> > > > physical address space this heuristic works good enough in practice.
> > > > 
> > > > Changing phys-bits behavior has been discussed on and off since years.
> > > > It's tricky to change for live migration compatibility reasons.
> > > > 
> > > > We got the host-phys-bits and host-phys-bits-limit properties, which
> > > > solve some of the phys-bits problems.
> > > > 
> > > >  * host-phys-bits=on makes sure the phys-bits advertised to the guest
> > > >    actually works.  It's off by default though for backward
> > > >    compatibility reasons (except microvm).  Also because turning it on
> > > >    breaks live migration of machines between hosts with different
> > > >    phys-bits.  
> > > 
> > > RHEL has shipped with host-phys-bits=on in its machine types
> > > sinec RHEL-7. If it is good enough for RHEL machine types
> > > for 8 years, IMHO, it is a sign that its reasonable to do the
> > > same with upstream for new machine types.  
> > 
> > And the upstream code is now pretty much identical except for the
> > default;  note that for TCG you do need to keep to 40 I think.
> 
> will TCG work with 40bits on host that supports less than that?
> 
> Also quick look at host-phys-bits shows that it affects only 'host'
> cpu model and is NOP for all other models.
> If it's so than we probably need to expand it's scope to other cpu
> models to cap them at actually supported range.

(We shouldn't really bring TCG oddities into this series!)

As I remember it effectively gets it from the accelerator, and TCG being
portable, there's no portable way of reading the phys-bits.

Whether it would work, hmm.  I'm assuming the host OS would stop you
allocating a huge ram block, so it shouldn't break from that.
But then the guest address translation is done in software, not using
the host MMU, so I think the guests view of addressing should be able
to be larger than the host. (Unless you try things like vfio/iommu on
tcg, which I'm told does work in some combos).

Dave


> > 
> > Dave
> > >   
> > > >  * host-phys-bits-limit can be used to tweak phys-bits to
> > > >    be lower than what the host supports.  Which can be used for
> > > >    live migration compatibility, i.e. if you have a pool of machines
> > > >    where some have 36 and some 39 you can limit phys-bits to 36 so
> > > >    live migration from 39 hosts to 36 hosts works.  
> > > 
> > > RHEL machine types have set this to host-phys-bits-limit=48
> > > since RHEL-8 days, to avoid accidentally enabling 5-level
> > > paging in guests without explicit user opt-in.
> > >   
> > > > What is missing:
> > > > 
> > > >  * Some way for the firmware to get a phys-bits value it can actually
> > > >    use.  One possible way would be to have a paravirtual bit somewhere
> > > >    telling whenever host-phys-bits is enabled or not.  
> > > 
> > > 
> > > Regards,
> > > Daniel
> > > -- 
> > > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > > 
> > >   
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Gerd Hoffmann 3 years, 11 months ago

  Hi,

> > And the upstream code is now pretty much identical except for the
> > default;  note that for TCG you do need to keep to 40 I think.
> 
> will TCG work with 40bits on host that supports less than that?

When I understand things correctly the problem is that the phys-bits
limit applies to the npt/ept tables too, effectively restricting guest
physical address space to host physical address space.

TCG is not affected by that and should work just fine.

Not sure what happens if you turn off npt/ept and run on softmmu.
Possibly that works fine too.

> Also quick look at host-phys-bits shows that it affects only 'host'
> cpu model and is NOP for all other models.

I don't think so.  microvm forces host-phys-bits=on and that works with
all cpu models.

take care,
  Gerd

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Igor Mammedov 3 years, 11 months ago

On Tue, 22 Feb 2022 10:42:55 +0100
Gerd Hoffmann <kraxel@redhat.com> wrote:

>   Hi,
> 
> > > And the upstream code is now pretty much identical except for the
> > > default;  note that for TCG you do need to keep to 40 I think.  
> > 
> > will TCG work with 40bits on host that supports less than that?  
> 
> When I understand things correctly the problem is that the phys-bits
> limit applies to the npt/ept tables too, effectively restricting guest
> physical address space to host physical address space.
> 
> TCG is not affected by that and should work just fine.
> 
> Not sure what happens if you turn off npt/ept and run on softmmu.
> Possibly that works fine too.
> 
> > Also quick look at host-phys-bits shows that it affects only 'host'
> > cpu model and is NOP for all other models.  
> 
> I don't think so.  microvm forces host-phys-bits=on and that works with
> all cpu models.

I just don't see how host-phys-bits can work for other than 'host' cpu model.
It's true that property is available for all cpu models, but the field it sets
is only used in target/i386/host-cpu.c, the same applies to host-phys-bits-limit.
Am I missing something?

> 
> take care,
>   Gerd
>

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Dr. David Alan Gilbert 3 years, 11 months ago

* Igor Mammedov (imammedo@redhat.com) wrote:
> On Tue, 22 Feb 2022 10:42:55 +0100
> Gerd Hoffmann <kraxel@redhat.com> wrote:
> 
> >   Hi,
> > 
> > > > And the upstream code is now pretty much identical except for the
> > > > default;  note that for TCG you do need to keep to 40 I think.  
> > > 
> > > will TCG work with 40bits on host that supports less than that?  
> > 
> > When I understand things correctly the problem is that the phys-bits
> > limit applies to the npt/ept tables too, effectively restricting guest
> > physical address space to host physical address space.
> > 
> > TCG is not affected by that and should work just fine.
> > 
> > Not sure what happens if you turn off npt/ept and run on softmmu.
> > Possibly that works fine too.
> > 
> > > Also quick look at host-phys-bits shows that it affects only 'host'
> > > cpu model and is NOP for all other models.  
> > 
> > I don't think so.  microvm forces host-phys-bits=on and that works with
> > all cpu models.
> 
> I just don't see how host-phys-bits can work for other than 'host' cpu model.
> It's true that property is available for all cpu models, but the field it sets
> is only used in target/i386/host-cpu.c, the same applies to host-phys-bits-limit.
> Am I missing something?

The hook in kvm/kvm-cpu.c kvm_cpu_realizefn:

    /*
     * The realize order is important, since x86_cpu_realize() checks if
     * nothing else has been set by the user (or by accelerators) in
     * cpu->ucode_rev and cpu->phys_bits, and updates the CPUID results in
     * mwait.ecx.
     * This accel realization code also assumes cpu features are already expanded.
     *
     * realize order:
     *
     * x86_cpu_realize():
     *  -> x86_cpu_expand_features()
     *  -> cpu_exec_realizefn():
     *            -> accel_cpu_realizefn()
     *               kvm_cpu_realizefn() -> host_cpu_realizefn()
     *  -> check/update ucode_rev, phys_bits, mwait
     */

Dave

> > 
> > take care,
> >   Gerd
> > 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Igor Mammedov 3 years, 11 months ago

On Wed, 23 Feb 2022 09:16:51 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Igor Mammedov (imammedo@redhat.com) wrote:
> > On Tue, 22 Feb 2022 10:42:55 +0100
> > Gerd Hoffmann <kraxel@redhat.com> wrote:
> >   
> > >   Hi,
> > >   
> > > > > And the upstream code is now pretty much identical except for the
> > > > > default;  note that for TCG you do need to keep to 40 I think.    
> > > > 
> > > > will TCG work with 40bits on host that supports less than that?    
> > > 
> > > When I understand things correctly the problem is that the phys-bits
> > > limit applies to the npt/ept tables too, effectively restricting guest
> > > physical address space to host physical address space.
> > > 
> > > TCG is not affected by that and should work just fine.
> > > 
> > > Not sure what happens if you turn off npt/ept and run on softmmu.
> > > Possibly that works fine too.
> > >   
> > > > Also quick look at host-phys-bits shows that it affects only 'host'
> > > > cpu model and is NOP for all other models.    
> > > 
> > > I don't think so.  microvm forces host-phys-bits=on and that works with
> > > all cpu models.  
> > 
> > I just don't see how host-phys-bits can work for other than 'host' cpu model.
> > It's true that property is available for all cpu models, but the field it sets
> > is only used in target/i386/host-cpu.c, the same applies to host-phys-bits-limit.
> > Am I missing something?  
> 
> The hook in kvm/kvm-cpu.c kvm_cpu_realizefn:
> 
>     /*
>      * The realize order is important, since x86_cpu_realize() checks if
>      * nothing else has been set by the user (or by accelerators) in
>      * cpu->ucode_rev and cpu->phys_bits, and updates the CPUID results in
>      * mwait.ecx.
>      * This accel realization code also assumes cpu features are already expanded.
>      *
>      * realize order:
>      *
>      * x86_cpu_realize():
>      *  -> x86_cpu_expand_features()
>      *  -> cpu_exec_realizefn():
>      *            -> accel_cpu_realizefn()
>      *               kvm_cpu_realizefn() -> host_cpu_realizefn()
>      *  -> check/update ucode_rev, phys_bits, mwait
>      */

Thanks,
I didn't expect host_cpu_realizefn being called from elsewhere
beside of cpu model it belongs to or models inherited from it.

> 
> Dave
> 
> > > 
> > > take care,
> > >   Gerd
> > >   
> >

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Joao Martins 3 years, 11 months ago

On 2/14/22 15:31, Igor Mammedov wrote:
> On Mon, 14 Feb 2022 15:05:00 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 2/14/22 14:53, Igor Mammedov wrote:
>>> On Mon,  7 Feb 2022 20:24:20 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> +{
>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>> +    ram_addr_t device_mem_size = 0;
>>>> +    uint32_t eax, vendor[3];
>>>> +
>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (pcmc->has_reserved_memory &&
>>>> +       (machine->ram_size < machine->maxram_size)) {
>>>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>>>> +    }
>>>> +
>>>> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
>>>> +         device_mem_size) < AMD_HT_START) {  
>>>   
>> And I was at two minds on this one, whether to advertise *always*
>> the 1T hole, regardless of relocation. Or account the size
>> we advertise for the pci64 hole and make that part of the equation
>> above. Although that has the flaw that the firmware at admin request
>> may pick some ludricous number (limited by maxphysaddr).
> 
> it this point we have only pci64 hole size (machine property),
> so I'd include that in equation to make firmware assign
> pci64 aperture above HT range.
> 
> as for checking maxphysaddr, we can only check 'default' PCI hole
> range at this stage (i.e. 1Gb aligned hole size after all possible RAM)
> and hard error on it.
> 

Igor, in the context of your comment above, I'll be introducing another
preparatory patch that adds up pci_hole64_size to pc_memory_init() such
that all used/max physaddr space checks are consolidated in pc_memory_init().

To that end, the changes involve mainly moves the the pcihost qdev creation
to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx
needs slightly more of a dance to extract that from i440fx_init() and also
because most i440fx state is private (hence the new helper for size). But
the actual initialization of I440fx/q35 pci host is still after pc_memory_init(),
it is just to extra pci_hole64_size from the object + user passed args (-global etc).

Raw staging changes below the scissors mark so far.

-->8--

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index b2e43eba1106..902977081350 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -875,7 +875,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms)
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
-                    MemoryRegion **ram_memory)
+                    MemoryRegion **ram_memory,
+                    uint64_t pci_hole64_size)
 {
     int linux_boot, i;
     MemoryRegion *option_rom_mr;
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index d9b344248dac..5a608e30e28f 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -91,6 +91,8 @@ static void pc_init1(MachineState *machine,
     MemoryRegion *pci_memory;
     MemoryRegion *rom_memory;
     ram_addr_t lowmem;
+    uint64_t hole64_size;
+    DeviceState *i440fx_dev;

     /*
      * Calculate ram split, for memory below and above 4G.  It's a bit
@@ -164,9 +166,13 @@ static void pc_init1(MachineState *machine,
         pci_memory = g_new(MemoryRegion, 1);
         memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
         rom_memory = pci_memory;
+        i440fx_dev = qdev_new(host_type);
+        hole64_size = i440fx_pci_hole64_size(i440fx_dev);
     } else {
         pci_memory = NULL;
         rom_memory = system_memory;
+        i440fx_dev = NULL;
+        hole64_size = 0;
     }

     pc_guest_info_init(pcms);
@@ -183,7 +189,7 @@ static void pc_init1(MachineState *machine,
     /* allocate ram and load rom/bios */
     if (!xen_enabled()) {
         pc_memory_init(pcms, system_memory,
-                       rom_memory, &ram_memory);
+                       rom_memory, &ram_memory, hole64_size);
     } else {
         pc_system_flash_cleanup_unused(pcms);
         if (machine->kernel_filename != NULL) {
@@ -199,7 +205,7 @@ static void pc_init1(MachineState *machine,

         pci_bus = i440fx_init(host_type,
                               pci_type,
-                              &i440fx_state,
+                              i440fx_dev, &i440fx_state,
                               system_memory, system_io, machine->ram_size,
                               x86ms->below_4g_mem_size,
                               x86ms->above_4g_mem_size,
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 1780f79bc127..b7cf44d4755e 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -203,12 +203,13 @@ static void pc_q35_init(MachineState *machine)
                             pcms->smbios_entry_point_type);
     }

-    /* allocate ram and load rom/bios */
-    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
-
     /* create pci host bus */
     q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));

+    /* allocate ram and load rom/bios */
+    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
+                   q35_host->mch.pci_hole64_size);
+
     object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
     object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
                              OBJECT(ram_memory), NULL);
diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
index e08716142b6e..c5cc28250d5c 100644
--- a/hw/pci-host/i440fx.c
+++ b/hw/pci-host/i440fx.c
@@ -237,7 +237,15 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
     }
 }

+uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
+{
+        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
+
+        return i440fx->pci_hole64_size;
+}
+
 PCIBus *i440fx_init(const char *host_type, const char *pci_type,
+                    DeviceState *dev,
                     PCII440FXState **pi440fx_state,
                     MemoryRegion *address_space_mem,
                     MemoryRegion *address_space_io,
@@ -247,7 +255,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
                     MemoryRegion *pci_address_space,
                     MemoryRegion *ram_memory)
 {
-    DeviceState *dev;
     PCIBus *b;
     PCIDevice *d;
     PCIHostState *s;
@@ -255,7 +262,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
     unsigned i;
     I440FXState *i440fx;

-    dev = qdev_new(host_type);
     s = PCI_HOST_BRIDGE(dev);
     b = pci_root_bus_new(dev, NULL, pci_address_space,
                          address_space_io, 0, TYPE_PCI_BUS);
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 9c9f4ac74810..d8b9c4ebd748 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -158,7 +158,8 @@ void xen_load_linux(PCMachineState *pcms);
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
-                    MemoryRegion **ram_memory);
+                    MemoryRegion **ram_memory,
+                    uint64_t pci_hole64_size);
 uint64_t pc_pci_hole64_start(void);
 DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
 void pc_basic_device_init(struct PCMachineState *pcms,
diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
index f068aaba8fda..1299d6a2b0e4 100644
--- a/include/hw/pci-host/i440fx.h
+++ b/include/hw/pci-host/i440fx.h
@@ -36,7 +36,7 @@ struct PCII440FXState {
 #define TYPE_IGD_PASSTHROUGH_I440FX_PCI_DEVICE "igd-passthrough-i440FX"

 PCIBus *i440fx_init(const char *host_type, const char *pci_type,
-                    PCII440FXState **pi440fx_state,
+                    DeviceState *dev, PCII440FXState **pi440fx_state,
                     MemoryRegion *address_space_mem,
                     MemoryRegion *address_space_io,
                     ram_addr_t ram_size,
@@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
                     MemoryRegion *pci_memory,
                     MemoryRegion *ram_memory);

+uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);

 #endif

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Igor Mammedov 3 years, 11 months ago

On Fri, 18 Feb 2022 17:12:21 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 2/14/22 15:31, Igor Mammedov wrote:
> > On Mon, 14 Feb 2022 15:05:00 +0000
> > Joao Martins <joao.m.martins@oracle.com> wrote:  
> >> On 2/14/22 14:53, Igor Mammedov wrote:  
> >>> On Mon,  7 Feb 2022 20:24:20 +0000
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>>> +{
> >>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>> +    ram_addr_t device_mem_size = 0;
> >>>> +    uint32_t eax, vendor[3];
> >>>> +
> >>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    if (pcmc->has_reserved_memory &&
> >>>> +       (machine->ram_size < machine->maxram_size)) {
> >>>> +        device_mem_size = machine->maxram_size - machine->ram_size;
> >>>> +    }
> >>>> +
> >>>> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
> >>>> +         device_mem_size) < AMD_HT_START) {    
> >>>     
> >> And I was at two minds on this one, whether to advertise *always*
> >> the 1T hole, regardless of relocation. Or account the size
> >> we advertise for the pci64 hole and make that part of the equation
> >> above. Although that has the flaw that the firmware at admin request
> >> may pick some ludricous number (limited by maxphysaddr).  
> > 
> > it this point we have only pci64 hole size (machine property),
> > so I'd include that in equation to make firmware assign
> > pci64 aperture above HT range.
> > 
> > as for checking maxphysaddr, we can only check 'default' PCI hole
> > range at this stage (i.e. 1Gb aligned hole size after all possible RAM)
> > and hard error on it.
> >   
> 
> Igor, in the context of your comment above, I'll be introducing another
> preparatory patch that adds up pci_hole64_size to pc_memory_init() such
> that all used/max physaddr space checks are consolidated in pc_memory_init().
>
> To that end, the changes involve mainly moves the the pcihost qdev creation
> to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx
> needs slightly more of a dance to extract that from i440fx_init() and also
> because most i440fx state is private (hence the new helper for size). But
> the actual initialization of I440fx/q35 pci host is still after pc_memory_init(),
> it is just to extra pci_hole64_size from the object + user passed args (-global etc).

Shuffling init order is looks too intrusive and in practice
quite risky.
How about moving maxphysaddr check to pc_machine_done() instead?
(this way you won't have to move pcihost around)


> Raw staging changes below the scissors mark so far.
> 
> -->8--  
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index b2e43eba1106..902977081350 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -875,7 +875,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms)
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> -                    MemoryRegion **ram_memory)
> +                    MemoryRegion **ram_memory,
> +                    uint64_t pci_hole64_size)
>  {
>      int linux_boot, i;
>      MemoryRegion *option_rom_mr;
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index d9b344248dac..5a608e30e28f 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -91,6 +91,8 @@ static void pc_init1(MachineState *machine,
>      MemoryRegion *pci_memory;
>      MemoryRegion *rom_memory;
>      ram_addr_t lowmem;
> +    uint64_t hole64_size;
> +    DeviceState *i440fx_dev;
> 
>      /*
>       * Calculate ram split, for memory below and above 4G.  It's a bit
> @@ -164,9 +166,13 @@ static void pc_init1(MachineState *machine,
>          pci_memory = g_new(MemoryRegion, 1);
>          memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
>          rom_memory = pci_memory;
> +        i440fx_dev = qdev_new(host_type);
> +        hole64_size = i440fx_pci_hole64_size(i440fx_dev);
>      } else {
>          pci_memory = NULL;
>          rom_memory = system_memory;
> +        i440fx_dev = NULL;
> +        hole64_size = 0;
>      }
> 
>      pc_guest_info_init(pcms);
> @@ -183,7 +189,7 @@ static void pc_init1(MachineState *machine,
>      /* allocate ram and load rom/bios */
>      if (!xen_enabled()) {
>          pc_memory_init(pcms, system_memory,
> -                       rom_memory, &ram_memory);
> +                       rom_memory, &ram_memory, hole64_size);
>      } else {
>          pc_system_flash_cleanup_unused(pcms);
>          if (machine->kernel_filename != NULL) {
> @@ -199,7 +205,7 @@ static void pc_init1(MachineState *machine,
> 
>          pci_bus = i440fx_init(host_type,
>                                pci_type,
> -                              &i440fx_state,
> +                              i440fx_dev, &i440fx_state,
>                                system_memory, system_io, machine->ram_size,
>                                x86ms->below_4g_mem_size,
>                                x86ms->above_4g_mem_size,
> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> index 1780f79bc127..b7cf44d4755e 100644
> --- a/hw/i386/pc_q35.c
> +++ b/hw/i386/pc_q35.c
> @@ -203,12 +203,13 @@ static void pc_q35_init(MachineState *machine)
>                              pcms->smbios_entry_point_type);
>      }
> 
> -    /* allocate ram and load rom/bios */
> -    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
> -
>      /* create pci host bus */
>      q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
> 
> +    /* allocate ram and load rom/bios */
> +    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
> +                   q35_host->mch.pci_hole64_size);
> +
>      object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
>      object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
>                               OBJECT(ram_memory), NULL);
> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
> index e08716142b6e..c5cc28250d5c 100644
> --- a/hw/pci-host/i440fx.c
> +++ b/hw/pci-host/i440fx.c
> @@ -237,7 +237,15 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
>      }
>  }
> 
> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
> +{
> +        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
> +
> +        return i440fx->pci_hole64_size;
> +}
> +
>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
> +                    DeviceState *dev,
>                      PCII440FXState **pi440fx_state,
>                      MemoryRegion *address_space_mem,
>                      MemoryRegion *address_space_io,
> @@ -247,7 +255,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>                      MemoryRegion *pci_address_space,
>                      MemoryRegion *ram_memory)
>  {
> -    DeviceState *dev;
>      PCIBus *b;
>      PCIDevice *d;
>      PCIHostState *s;
> @@ -255,7 +262,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>      unsigned i;
>      I440FXState *i440fx;
> 
> -    dev = qdev_new(host_type);
>      s = PCI_HOST_BRIDGE(dev);
>      b = pci_root_bus_new(dev, NULL, pci_address_space,
>                           address_space_io, 0, TYPE_PCI_BUS);
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 9c9f4ac74810..d8b9c4ebd748 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -158,7 +158,8 @@ void xen_load_linux(PCMachineState *pcms);
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> -                    MemoryRegion **ram_memory);
> +                    MemoryRegion **ram_memory,
> +                    uint64_t pci_hole64_size);
>  uint64_t pc_pci_hole64_start(void);
>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
>  void pc_basic_device_init(struct PCMachineState *pcms,
> diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
> index f068aaba8fda..1299d6a2b0e4 100644
> --- a/include/hw/pci-host/i440fx.h
> +++ b/include/hw/pci-host/i440fx.h
> @@ -36,7 +36,7 @@ struct PCII440FXState {
>  #define TYPE_IGD_PASSTHROUGH_I440FX_PCI_DEVICE "igd-passthrough-i440FX"
> 
>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
> -                    PCII440FXState **pi440fx_state,
> +                    DeviceState *dev, PCII440FXState **pi440fx_state,
>                      MemoryRegion *address_space_mem,
>                      MemoryRegion *address_space_io,
>                      ram_addr_t ram_size,
> @@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>                      MemoryRegion *pci_memory,
>                      MemoryRegion *ram_memory);
> 
> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);
> 
>  #endif
>

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Joao Martins 3 years, 11 months ago


On 2/21/22 06:58, Igor Mammedov wrote:
> On Fri, 18 Feb 2022 17:12:21 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 2/14/22 15:31, Igor Mammedov wrote:
>>> On Mon, 14 Feb 2022 15:05:00 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>> On 2/14/22 14:53, Igor Mammedov wrote:  
>>>>> On Mon,  7 Feb 2022 20:24:20 +0000
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>> +{
>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>> +    ram_addr_t device_mem_size = 0;
>>>>>> +    uint32_t eax, vendor[3];
>>>>>> +
>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (pcmc->has_reserved_memory &&
>>>>>> +       (machine->ram_size < machine->maxram_size)) {
>>>>>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>>>>>> +    }
>>>>>> +
>>>>>> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
>>>>>> +         device_mem_size) < AMD_HT_START) {    
>>>>>     
>>>> And I was at two minds on this one, whether to advertise *always*
>>>> the 1T hole, regardless of relocation. Or account the size
>>>> we advertise for the pci64 hole and make that part of the equation
>>>> above. Although that has the flaw that the firmware at admin request
>>>> may pick some ludricous number (limited by maxphysaddr).  
>>>
>>> it this point we have only pci64 hole size (machine property),
>>> so I'd include that in equation to make firmware assign
>>> pci64 aperture above HT range.
>>>
>>> as for checking maxphysaddr, we can only check 'default' PCI hole
>>> range at this stage (i.e. 1Gb aligned hole size after all possible RAM)
>>> and hard error on it.
>>>   
>>
>> Igor, in the context of your comment above, I'll be introducing another
>> preparatory patch that adds up pci_hole64_size to pc_memory_init() such
>> that all used/max physaddr space checks are consolidated in pc_memory_init().
>>
>> To that end, the changes involve mainly moves the the pcihost qdev creation
>> to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx
>> needs slightly more of a dance to extract that from i440fx_init() and also
>> because most i440fx state is private (hence the new helper for size). But
>> the actual initialization of I440fx/q35 pci host is still after pc_memory_init(),
>> it is just to extra pci_hole64_size from the object + user passed args (-global etc).
> 
> Shuffling init order is looks too intrusive and in practice
> quite risky.

Yeah, it is an intrusive change sadly. Although, why would you consider it
risky (curious)? We are "only" moving this:

	qdev_new(host_type);

... located at the very top of i440fx_init() and called at the top of q35_host
initilization to be instead before pc_memory_init(). And that means that an instance of an
object gets made and its properties initialized i.e. @instance_init of q35 and i440fx and
its properties. I don't see a particular dependence in PC code to tell that this
would affect its surroundings parts.

The actual pcihost-related initialization is still kept entirely unchanged.

> How about moving maxphysaddr check to pc_machine_done() instead?
> (this way you won't have to move pcihost around)
> 
I can move it. But be there will be a slight disconnect between what pc_memory_init()
checks against "max used address"  between ... dictating if the 4G mem start should change
to 1T or not ...  and when the phys-bits check is actually made which includes the pci hole64.

For example, we create a guest with maxram 1009G (which 4G mem start would get at
unchanged) and then the pci_hole64 goes likely assigned across the rest until 1023G (i.e.
across the HT region). Here it would need an extra check and fail if pci_hole64 crosses
the HT region. Whereby if it was added in pc_memory_init() then we could just relocate to
1T and the guest creation would still proceed.

> 
>> Raw staging changes below the scissors mark so far.
>>
>> -->8--  
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index b2e43eba1106..902977081350 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -875,7 +875,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms)
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> -                    MemoryRegion **ram_memory)
>> +                    MemoryRegion **ram_memory,
>> +                    uint64_t pci_hole64_size)
>>  {
>>      int linux_boot, i;
>>      MemoryRegion *option_rom_mr;
>> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
>> index d9b344248dac..5a608e30e28f 100644
>> --- a/hw/i386/pc_piix.c
>> +++ b/hw/i386/pc_piix.c
>> @@ -91,6 +91,8 @@ static void pc_init1(MachineState *machine,
>>      MemoryRegion *pci_memory;
>>      MemoryRegion *rom_memory;
>>      ram_addr_t lowmem;
>> +    uint64_t hole64_size;
>> +    DeviceState *i440fx_dev;
>>
>>      /*
>>       * Calculate ram split, for memory below and above 4G.  It's a bit
>> @@ -164,9 +166,13 @@ static void pc_init1(MachineState *machine,
>>          pci_memory = g_new(MemoryRegion, 1);
>>          memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
>>          rom_memory = pci_memory;
>> +        i440fx_dev = qdev_new(host_type);
>> +        hole64_size = i440fx_pci_hole64_size(i440fx_dev);
>>      } else {
>>          pci_memory = NULL;
>>          rom_memory = system_memory;
>> +        i440fx_dev = NULL;
>> +        hole64_size = 0;
>>      }
>>
>>      pc_guest_info_init(pcms);
>> @@ -183,7 +189,7 @@ static void pc_init1(MachineState *machine,
>>      /* allocate ram and load rom/bios */
>>      if (!xen_enabled()) {
>>          pc_memory_init(pcms, system_memory,
>> -                       rom_memory, &ram_memory);
>> +                       rom_memory, &ram_memory, hole64_size);
>>      } else {
>>          pc_system_flash_cleanup_unused(pcms);
>>          if (machine->kernel_filename != NULL) {
>> @@ -199,7 +205,7 @@ static void pc_init1(MachineState *machine,
>>
>>          pci_bus = i440fx_init(host_type,
>>                                pci_type,
>> -                              &i440fx_state,
>> +                              i440fx_dev, &i440fx_state,
>>                                system_memory, system_io, machine->ram_size,
>>                                x86ms->below_4g_mem_size,
>>                                x86ms->above_4g_mem_size,
>> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
>> index 1780f79bc127..b7cf44d4755e 100644
>> --- a/hw/i386/pc_q35.c
>> +++ b/hw/i386/pc_q35.c
>> @@ -203,12 +203,13 @@ static void pc_q35_init(MachineState *machine)
>>                              pcms->smbios_entry_point_type);
>>      }
>>
>> -    /* allocate ram and load rom/bios */
>> -    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
>> -
>>      /* create pci host bus */
>>      q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
>>
>> +    /* allocate ram and load rom/bios */
>> +    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
>> +                   q35_host->mch.pci_hole64_size);
>> +
>>      object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
>>      object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
>>                               OBJECT(ram_memory), NULL);
>> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
>> index e08716142b6e..c5cc28250d5c 100644
>> --- a/hw/pci-host/i440fx.c
>> +++ b/hw/pci-host/i440fx.c
>> @@ -237,7 +237,15 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
>>      }
>>  }
>>
>> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
>> +{
>> +        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
>> +
>> +        return i440fx->pci_hole64_size;
>> +}
>> +
>>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>> +                    DeviceState *dev,
>>                      PCII440FXState **pi440fx_state,
>>                      MemoryRegion *address_space_mem,
>>                      MemoryRegion *address_space_io,
>> @@ -247,7 +255,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>>                      MemoryRegion *pci_address_space,
>>                      MemoryRegion *ram_memory)
>>  {
>> -    DeviceState *dev;
>>      PCIBus *b;
>>      PCIDevice *d;
>>      PCIHostState *s;
>> @@ -255,7 +262,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>>      unsigned i;
>>      I440FXState *i440fx;
>>
>> -    dev = qdev_new(host_type);
>>      s = PCI_HOST_BRIDGE(dev);
>>      b = pci_root_bus_new(dev, NULL, pci_address_space,
>>                           address_space_io, 0, TYPE_PCI_BUS);
>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>> index 9c9f4ac74810..d8b9c4ebd748 100644
>> --- a/include/hw/i386/pc.h
>> +++ b/include/hw/i386/pc.h
>> @@ -158,7 +158,8 @@ void xen_load_linux(PCMachineState *pcms);
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> -                    MemoryRegion **ram_memory);
>> +                    MemoryRegion **ram_memory,
>> +                    uint64_t pci_hole64_size);
>>  uint64_t pc_pci_hole64_start(void);
>>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
>>  void pc_basic_device_init(struct PCMachineState *pcms,
>> diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
>> index f068aaba8fda..1299d6a2b0e4 100644
>> --- a/include/hw/pci-host/i440fx.h
>> +++ b/include/hw/pci-host/i440fx.h
>> @@ -36,7 +36,7 @@ struct PCII440FXState {
>>  #define TYPE_IGD_PASSTHROUGH_I440FX_PCI_DEVICE "igd-passthrough-i440FX"
>>
>>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>> -                    PCII440FXState **pi440fx_state,
>> +                    DeviceState *dev, PCII440FXState **pi440fx_state,
>>                      MemoryRegion *address_space_mem,
>>                      MemoryRegion *address_space_io,
>>                      ram_addr_t ram_size,
>> @@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>>                      MemoryRegion *pci_memory,
>>                      MemoryRegion *ram_memory);
>>
>> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);
>>
>>  #endif
>>
>

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Joao Martins 3 years, 11 months ago

On 2/21/22 15:28, Joao Martins wrote:
> On 2/21/22 06:58, Igor Mammedov wrote:
>> On Fri, 18 Feb 2022 17:12:21 +0000
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>>> On 2/14/22 15:31, Igor Mammedov wrote:
>>>> On Mon, 14 Feb 2022 15:05:00 +0000
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>> On 2/14/22 14:53, Igor Mammedov wrote:  
>>>>>> On Mon,  7 Feb 2022 20:24:20 +0000
>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>>> +{
>>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>> +    ram_addr_t device_mem_size = 0;
>>>>>>> +    uint32_t eax, vendor[3];
>>>>>>> +
>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>> +        return;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if (pcmc->has_reserved_memory &&
>>>>>>> +       (machine->ram_size < machine->maxram_size)) {
>>>>>>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
>>>>>>> +         device_mem_size) < AMD_HT_START) {    
>>>>>>     
>>>>> And I was at two minds on this one, whether to advertise *always*
>>>>> the 1T hole, regardless of relocation. Or account the size
>>>>> we advertise for the pci64 hole and make that part of the equation
>>>>> above. Although that has the flaw that the firmware at admin request
>>>>> may pick some ludricous number (limited by maxphysaddr).  
>>>>
>>>> it this point we have only pci64 hole size (machine property),
>>>> so I'd include that in equation to make firmware assign
>>>> pci64 aperture above HT range.
>>>>
>>>> as for checking maxphysaddr, we can only check 'default' PCI hole
>>>> range at this stage (i.e. 1Gb aligned hole size after all possible RAM)
>>>> and hard error on it.
>>>>   
>>>
>>> Igor, in the context of your comment above, I'll be introducing another
>>> preparatory patch that adds up pci_hole64_size to pc_memory_init() such
>>> that all used/max physaddr space checks are consolidated in pc_memory_init().
>>>
>>> To that end, the changes involve mainly moves the the pcihost qdev creation
>>> to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx
>>> needs slightly more of a dance to extract that from i440fx_init() and also
>>> because most i440fx state is private (hence the new helper for size). But
>>> the actual initialization of I440fx/q35 pci host is still after pc_memory_init(),
>>> it is just to extra pci_hole64_size from the object + user passed args (-global etc).
>>
>> Shuffling init order is looks too intrusive and in practice
>> quite risky.
> 
> Yeah, it is an intrusive change sadly. Although, why would you consider it
> risky (curious)? We are "only" moving this:
> 
> 	qdev_new(host_type);
> 
> ... located at the very top of i440fx_init() and called at the top of q35_host
> initilization to be instead before pc_memory_init(). And that means that an instance of an
> object gets made and its properties initialized i.e. @instance_init of q35 and i440fx and
> its properties. I don't see a particular dependence in PC code to tell that this
> would affect its surroundings parts.
> 
> The actual pcihost-related initialization is still kept entirely unchanged.
> 
>> How about moving maxphysaddr check to pc_machine_done() instead?
>> (this way you won't have to move pcihost around)
>>
> I can move it. But be there will be a slight disconnect between what pc_memory_init()
> checks against "max used address"  between ... dictating if the 4G mem start should change
> to 1T or not ...  and when the phys-bits check is actually made which includes the pci hole64.
> 
> For example, we create a guest with maxram 1009G (which 4G mem start would get at
> unchanged) and then the pci_hole64 goes likely assigned across the rest until 1023G (i.e.
> across the HT region). Here it would need an extra check and fail if pci_hole64 crosses
> the HT region. Whereby if it was added in pc_memory_init() then we could just relocate to
> 1T and the guest creation would still proceed.
> 
Actually, on a second thought, not having the pci_hole64_size
to pc_memory_init() to instead introduce it in pc_machine_done() to
include pci_hole64_size looks like a half-step :( because otherwise the user
needs to play games on how much it should include as -m|-object-memory*
number to force it to relocate to 1T and avoid the guest creation
failure.

So either we:

1) consider pci_hole64_size in pc_memory_init() and move default
pci-hole64-size (sort of what I was proposing in this last exchange)

2) keep the maxphysaddr check as is (but generic in pc_memory_init()
and disregarding pci-hole64-size) and advertise the actual 1T reserved hole
(regardless of above-4G relocation) letting BIOS consider reserved regions
when picking hole64-start.

Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable

Posted by Igor Mammedov 3 years, 11 months ago

On Tue, 22 Feb 2022 11:00:55 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 2/21/22 15:28, Joao Martins wrote:
> > On 2/21/22 06:58, Igor Mammedov wrote:  
> >> On Fri, 18 Feb 2022 17:12:21 +0000
> >> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>  
> >>> On 2/14/22 15:31, Igor Mammedov wrote:  
> >>>> On Mon, 14 Feb 2022 15:05:00 +0000
> >>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>> On 2/14/22 14:53, Igor Mammedov wrote:    
> >>>>>> On Mon,  7 Feb 2022 20:24:20 +0000
> >>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>>>> +{
> >>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>> +    ram_addr_t device_mem_size = 0;
> >>>>>>> +    uint32_t eax, vendor[3];
> >>>>>>> +
> >>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>> +        return;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    if (pcmc->has_reserved_memory &&
> >>>>>>> +       (machine->ram_size < machine->maxram_size)) {
> >>>>>>> +        device_mem_size = machine->maxram_size - machine->ram_size;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
> >>>>>>> +         device_mem_size) < AMD_HT_START) {      
> >>>>>>       
> >>>>> And I was at two minds on this one, whether to advertise *always*
> >>>>> the 1T hole, regardless of relocation. Or account the size
> >>>>> we advertise for the pci64 hole and make that part of the equation
> >>>>> above. Although that has the flaw that the firmware at admin request
> >>>>> may pick some ludricous number (limited by maxphysaddr).    
> >>>>
> >>>> it this point we have only pci64 hole size (machine property),
> >>>> so I'd include that in equation to make firmware assign
> >>>> pci64 aperture above HT range.
> >>>>
> >>>> as for checking maxphysaddr, we can only check 'default' PCI hole
> >>>> range at this stage (i.e. 1Gb aligned hole size after all possible RAM)
> >>>> and hard error on it.
> >>>>     
> >>>
> >>> Igor, in the context of your comment above, I'll be introducing another
> >>> preparatory patch that adds up pci_hole64_size to pc_memory_init() such
> >>> that all used/max physaddr space checks are consolidated in pc_memory_init().
> >>>
> >>> To that end, the changes involve mainly moves the the pcihost qdev creation
> >>> to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx
> >>> needs slightly more of a dance to extract that from i440fx_init() and also
> >>> because most i440fx state is private (hence the new helper for size). But
> >>> the actual initialization of I440fx/q35 pci host is still after pc_memory_init(),
> >>> it is just to extra pci_hole64_size from the object + user passed args (-global etc).  
> >>
> >> Shuffling init order is looks too intrusive and in practice
> >> quite risky.  
> > 
> > Yeah, it is an intrusive change sadly. Although, why would you consider it
> > risky (curious)? We are "only" moving this:
> > 
> > 	qdev_new(host_type);
> > 
> > ... located at the very top of i440fx_init() and called at the top of q35_host
> > initilization to be instead before pc_memory_init(). And that means that an instance of an
> > object gets made and its properties initialized i.e. @instance_init of q35 and i440fx and
> > its properties. I don't see a particular dependence in PC code to tell that this
> > would affect its surroundings parts.

I don't see anything wrong her as well
(but I'm probably overcautious since more often than not
changing initialization order, has broken things in non obvious ways)

> > 
> > The actual pcihost-related initialization is still kept entirely unchanged.
> >   
> >> How about moving maxphysaddr check to pc_machine_done() instead?
> >> (this way you won't have to move pcihost around)
> >>  
> > I can move it. But be there will be a slight disconnect between what pc_memory_init()
> > checks against "max used address"  between ... dictating if the 4G mem start should change
> > to 1T or not ...  and when the phys-bits check is actually made which includes the pci hole64.
> > 
> > For example, we create a guest with maxram 1009G (which 4G mem start would get at
> > unchanged) and then the pci_hole64 goes likely assigned across the rest until 1023G (i.e.
> > across the HT region). Here it would need an extra check and fail if pci_hole64 crosses
> > the HT region. Whereby if it was added in pc_memory_init() then we could just relocate to
> > 1T and the guest creation would still proceed.
> >   
> Actually, on a second thought, not having the pci_hole64_size
> to pc_memory_init() to instead introduce it in pc_machine_done() to
> include pci_hole64_size looks like a half-step :( because otherwise the user
> needs to play games on how much it should include as -m|-object-memory*
> number to force it to relocate to 1T and avoid the guest creation
> failure.
> 
> So either we:
> 
> 1) consider pci_hole64_size in pc_memory_init() and move default
> pci-hole64-size (sort of what I was proposing in this last exchange)

ok, lets go with this approach 

> 
> 2) keep the maxphysaddr check as is (but generic in pc_memory_init()
> and disregarding pci-hole64-size) and advertise the actual 1T reserved hole
> (regardless of above-4G relocation) letting BIOS consider reserved regions
> when picking hole64-start.
>

[PATCH RFCv2 1/4] hw/i386: add 4g boundary start to X86MachineState
[PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
[PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low
[PATCH RFCv2 4/4] i386/pc: Restrict AMD-only enforcing of valid IOVAs to new machine type