[PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU

Joao Martins posted 6 patches 2 years, 10 months ago
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20210622154905.30858-1-joao.m.martins@oracle.com
Maintainers: Richard Henderson <richard.henderson@linaro.org>, Eduardo Habkost <ehabkost@redhat.com>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, "Michael S. Tsirkin" <mst@redhat.com>, Igor Mammedov <imammedo@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>
hw/i386/acpi-build.c  |  22 ++++-
hw/i386/pc.c          | 206 +++++++++++++++++++++++++++++++++++++++---
hw/i386/pc_piix.c     |   2 +
hw/i386/pc_q35.c      |   2 +
hw/pci-host/i440fx.c  |   4 +-
hw/pci-host/q35.c     |   4 +-
include/hw/i386/pc.h  |  62 ++++++++++++-
include/hw/i386/x86.h |   4 +
target/i386/cpu.h     |   3 +
9 files changed, 288 insertions(+), 21 deletions(-)
[PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Posted by Joao Martins 2 years, 10 months ago
Hey,

This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
when running on AMD systems with an IOMMU.

Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
affected by this extra validation. But AMD systems with IOMMU have a hole in
the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.

VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
 -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
of the failure:

qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
	failed to setup container for group 258: memory listener initialization failed:
		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)

Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
as documented on the links down below.

This series tries to address that by dealing with this AMD-specific 1Tb hole,
similarly to how we deal with the 4G hole today in x86 in general. It is splitted
as following:

* patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
           which gets used too in other parts of pc/acpi besides MR creation. The
	   allowed IOVA *only* changes if it's an AMD host, so no change for
	   Intel. We walk the allowed ranges for memory above 4G, and
	   add a E820_RESERVED type everytime we find a hole (which is at the
	   1TB boundary).
	   
	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
	   understand that it doesn't cover the non-x86 host case running TCG.

	   Additionally, an alternative to hardcoded ranges as we do today,
	   VFIO could advertise the platform valid IOVA ranges without necessarily
	   requiring to have a PCI device added in the vfio container. That would
	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
	   ranges as we do today. But sadly, wouldn't work for older hypervisors.

* patch 2 - 5: cover the remaining parts of the surrounding the mem map, particularly
	       ACPI SRAT ranges, CMOS, hotplug as well as the PCI 64-bit hole.

* patch 6: Add a machine property which is disabled for older machine types (<=6.0)
	   to keep things as is.

The 'consequence' of this approach is that we may need more than the default
phys-bits e.g. a guest with 1024G, will have ~13G be put after the 1TB
address, consequently needing 41 phys-bits as opposed to the default of 40.
I can't figure a reasonable way to establish the required phys-bits we
need for the memory map in a dynamic way, especially considering that
today there's already a precedent to depend on the user to pick the right value
of phys-bits (regardless of this series).

Additionally, the reserved region is always added regardless of whether we have
VFIO devices to cover the VFIO device hotplug case.

Other options considered:

a) Consider the reserved range part of RAM, and just marking it as
E820_RESERVED without SPA allocated for it. So a -m 1024G guest would
only allocate 1010G of RAM and the remaining would be marked reserved.
This is not how what we do today for the 4G hole i.e. the RAM
actually allocated is the value specified by the user and thus RAM available
to the guest (IIUC).

b) Avoid VFIO DMA_MAP ioctl() calls to the reserved range. Similar to a) but done at a
later stage when RAM mrs are already allocated at the invalid GPAs. Albeit that
alone wouldn't fix the way MRs are laid out which is where fundamentally the
problem is.

The proposed approach in this series works regardless of the kernel, and
relevant for old and new Qemu.

Open to alternatives/comments/suggestions that I should pursue instead.

	Joao

[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
[2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf

Joao Martins (6):
  i386/pc: Account IOVA reserved ranges above 4G boundary
  i386/pc: Round up the hotpluggable memory within valid IOVA ranges
  pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary
  i386/pc: Keep PCI 64-bit hole within usable IOVA space
  i386/acpi: Fix SRAT ranges in accordance to usable IOVA
  i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs

 hw/i386/acpi-build.c  |  22 ++++-
 hw/i386/pc.c          | 206 +++++++++++++++++++++++++++++++++++++++---
 hw/i386/pc_piix.c     |   2 +
 hw/i386/pc_q35.c      |   2 +
 hw/pci-host/i440fx.c  |   4 +-
 hw/pci-host/q35.c     |   4 +-
 include/hw/i386/pc.h  |  62 ++++++++++++-
 include/hw/i386/x86.h |   4 +
 target/i386/cpu.h     |   3 +
 9 files changed, 288 insertions(+), 21 deletions(-)

-- 
2.17.1


Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Posted by Alex Williamson 2 years, 10 months ago
On Tue, 22 Jun 2021 16:48:59 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> Hey,
> 
> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
> when running on AMD systems with an IOMMU.
> 
> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> affected by this extra validation. But AMD systems with IOMMU have a hole in
> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> 
> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> of the failure:
> 
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> 	failed to setup container for group 258: memory listener initialization failed:
> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> 
> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> as documented on the links down below.
> 
> This series tries to address that by dealing with this AMD-specific 1Tb hole,
> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
> as following:
> 
> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
>            which gets used too in other parts of pc/acpi besides MR creation. The
> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
> 	   Intel. We walk the allowed ranges for memory above 4G, and
> 	   add a E820_RESERVED type everytime we find a hole (which is at the
> 	   1TB boundary).
> 	   
> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
> 	   understand that it doesn't cover the non-x86 host case running TCG.
> 
> 	   Additionally, an alternative to hardcoded ranges as we do today,
> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
> 	   requiring to have a PCI device added in the vfio container. That would
> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.


$ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
0x00000000fee00000 0x00000000feefffff msi
0x000000fd00000000 0x000000ffffffffff reserved

Ideally we might take that into account on all hosts, but of course
then we run into massive compatibility issues when we consider
migration.  We run into similar problems when people try to assign
devices to non-x86 TCG hosts, where the arch doesn't have a natural
memory hole overlapping the msi range.

The issue here is similar to trying to find a set of supported CPU
flags across hosts, QEMU only has visibility to the host where it runs,
an upper level tool needs to be able to pass through information about
compatibility to all possible migration targets.  Towards that end, we
should probably have command line options that either allow to specify
specific usable or reserved GPA address ranges.  For example something
like:
	--reserved-mem-ranges=host

Or explicitly:

	--reserved-mem-ranges=13G@1010G,1M@4078M

> * patch 2 - 5: cover the remaining parts of the surrounding the mem map, particularly
> 	       ACPI SRAT ranges, CMOS, hotplug as well as the PCI 64-bit hole.
> 
> * patch 6: Add a machine property which is disabled for older machine types (<=6.0)
> 	   to keep things as is.
> 
> The 'consequence' of this approach is that we may need more than the default
> phys-bits e.g. a guest with 1024G, will have ~13G be put after the 1TB
> address, consequently needing 41 phys-bits as opposed to the default of 40.
> I can't figure a reasonable way to establish the required phys-bits we
> need for the memory map in a dynamic way, especially considering that
> today there's already a precedent to depend on the user to pick the right value
> of phys-bits (regardless of this series).
> 
> Additionally, the reserved region is always added regardless of whether we have
> VFIO devices to cover the VFIO device hotplug case.

Various migration issues as you note later in the series.

> Other options considered:
> 
> a) Consider the reserved range part of RAM, and just marking it as
> E820_RESERVED without SPA allocated for it. So a -m 1024G guest would
> only allocate 1010G of RAM and the remaining would be marked reserved.
> This is not how what we do today for the 4G hole i.e. the RAM
> actually allocated is the value specified by the user and thus RAM available
> to the guest (IIUC).
> 
> b) Avoid VFIO DMA_MAP ioctl() calls to the reserved range. Similar to a) but done at a
> later stage when RAM mrs are already allocated at the invalid GPAs. Albeit that
> alone wouldn't fix the way MRs are laid out which is where fundamentally the
> problem is.

Data corruption with b) should the guest ever use memory within this
range as a DMA target.  Thanks,

Alex
 
> The proposed approach in this series works regardless of the kernel, and
> relevant for old and new Qemu.
> 
> Open to alternatives/comments/suggestions that I should pursue instead.
> 
> 	Joao
> 
> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
> [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> 
> Joao Martins (6):
>   i386/pc: Account IOVA reserved ranges above 4G boundary
>   i386/pc: Round up the hotpluggable memory within valid IOVA ranges
>   pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary
>   i386/pc: Keep PCI 64-bit hole within usable IOVA space
>   i386/acpi: Fix SRAT ranges in accordance to usable IOVA
>   i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs
> 
>  hw/i386/acpi-build.c  |  22 ++++-
>  hw/i386/pc.c          | 206 +++++++++++++++++++++++++++++++++++++++---
>  hw/i386/pc_piix.c     |   2 +
>  hw/i386/pc_q35.c      |   2 +
>  hw/pci-host/i440fx.c  |   4 +-
>  hw/pci-host/q35.c     |   4 +-
>  include/hw/i386/pc.h  |  62 ++++++++++++-
>  include/hw/i386/x86.h |   4 +
>  target/i386/cpu.h     |   3 +
>  9 files changed, 288 insertions(+), 21 deletions(-)
> 


Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Posted by David Edmondson 2 years, 10 months ago
On Tuesday, 2021-06-22 at 15:16:29 -06, Alex Williamson wrote:

>> 	   Additionally, an alternative to hardcoded ranges as we do today,
>> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
>> 	   requiring to have a PCI device added in the vfio container. That would
>> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
>> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.
>
>
> $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> 0x00000000fee00000 0x00000000feefffff msi
> 0x000000fd00000000 0x000000ffffffffff reserved
>
> Ideally we might take that into account on all hosts, but of course
> then we run into massive compatibility issues when we consider
> migration.  We run into similar problems when people try to assign
> devices to non-x86 TCG hosts, where the arch doesn't have a natural
> memory hole overlapping the msi range.
>
> The issue here is similar to trying to find a set of supported CPU
> flags across hosts, QEMU only has visibility to the host where it runs,
> an upper level tool needs to be able to pass through information about
> compatibility to all possible migration targets.  Towards that end, we
> should probably have command line options that either allow to specify
> specific usable or reserved GPA address ranges.  For example something
> like:
> 	--reserved-mem-ranges=host
>
> Or explicitly:
>
> 	--reserved-mem-ranges=13G@1010G,1M@4078M

Would this not naturally be a property of a machine model?

dme.
-- 
Seems I'm not alone at being alone.

Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Posted by Alex Williamson 2 years, 10 months ago
On Wed, 23 Jun 2021 08:40:56 +0100
David Edmondson <dme@dme.org> wrote:

> On Tuesday, 2021-06-22 at 15:16:29 -06, Alex Williamson wrote:
> 
> >> 	   Additionally, an alternative to hardcoded ranges as we do today,
> >> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
> >> 	   requiring to have a PCI device added in the vfio container. That would
> >> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
> >> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
> >
> >
> > $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> > 0x00000000fee00000 0x00000000feefffff msi
> > 0x000000fd00000000 0x000000ffffffffff reserved
> >
> > Ideally we might take that into account on all hosts, but of course
> > then we run into massive compatibility issues when we consider
> > migration.  We run into similar problems when people try to assign
> > devices to non-x86 TCG hosts, where the arch doesn't have a natural
> > memory hole overlapping the msi range.
> >
> > The issue here is similar to trying to find a set of supported CPU
> > flags across hosts, QEMU only has visibility to the host where it runs,
> > an upper level tool needs to be able to pass through information about
> > compatibility to all possible migration targets.  Towards that end, we
> > should probably have command line options that either allow to specify
> > specific usable or reserved GPA address ranges.  For example something
> > like:
> > 	--reserved-mem-ranges=host
> >
> > Or explicitly:
> >
> > 	--reserved-mem-ranges=13G@1010G,1M@4078M  
> 
> Would this not naturally be a property of a machine model?

That's an option too, maybe a better one given that we already know how
to manage different machine types.  We probably could not adopt QEMU
v6.1 versions of q35 and 440fx to include this reserved range give the
physical address width downsides Joao mentions.  The above was meant to
be more of an explicit method to match the VM address space to the host
versus the implicit mechanism of the machine type.  A new machine type
at least makes the decision a user policy rather than a magic artifact
of the processor vendor where the VM was first created, but it also
doesn't have the flexibility of the extra command line options.  Thanks,

Alex


Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Posted by Joao Martins 2 years, 10 months ago
On 6/22/21 10:16 PM, Alex Williamson wrote:
> On Tue, 22 Jun 2021 16:48:59 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> Hey,
>>
>> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
>> when running on AMD systems with an IOMMU.
>>
>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
>> affected by this extra validation. But AMD systems with IOMMU have a hole in
>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
>> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>>
>> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
>> of the failure:
>>
>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
>> 	failed to setup container for group 258: memory listener initialization failed:
>> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>>
>> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
>> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
>> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
>> as documented on the links down below.
>>
>> This series tries to address that by dealing with this AMD-specific 1Tb hole,
>> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
>> as following:
>>
>> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
>>            which gets used too in other parts of pc/acpi besides MR creation. The
>> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
>> 	   Intel. We walk the allowed ranges for memory above 4G, and
>> 	   add a E820_RESERVED type everytime we find a hole (which is at the
>> 	   1TB boundary).
>> 	   
>> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
>> 	   understand that it doesn't cover the non-x86 host case running TCG.
>>
>> 	   Additionally, an alternative to hardcoded ranges as we do today,
>> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
>> 	   requiring to have a PCI device added in the vfio container. That would
>> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
>> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.
> 
> 
> $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> 0x00000000fee00000 0x00000000feefffff msi
> 0x000000fd00000000 0x000000ffffffffff reserved
> 
Yeap, I am aware.

The VFIO advertising extension was just because we already advertise the above info,
although behind a non-empty vfio container e.g. we seem to use that for example in
collect_usable_iova_ranges().

> Ideally we might take that into account on all hosts, but of course
> then we run into massive compatibility issues when we consider
> migration.  We run into similar problems when people try to assign
> devices to non-x86 TCG hosts, where the arch doesn't have a natural
> memory hole overlapping the msi range.
> 
> The issue here is similar to trying to find a set of supported CPU
> flags across hosts, QEMU only has visibility to the host where it runs,
> an upper level tool needs to be able to pass through information about
> compatibility to all possible migration targets.

I agree with your generic sentiment (and idea) but are we sure this is really something as
dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
pc/q35 is one very good example, because this hasn't changed since it's inception [a
decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
with more than 1Tb). Additionally, there might be architectural impositions like on x86
e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
keep the reserved range nonetheless in the common denominator)

> Towards that end, we
> should probably have command line options that either allow to specify
> specific usable or reserved GPA address ranges.  For example something
> like:
> 	--reserved-mem-ranges=host
> 
> Or explicitly:
> 
> 	--reserved-mem-ranges=13G@1010G,1M@4078M
> 
I like the added option, particularly because it lets everyone workaround similar issues.
I remember a series before that had similar issues on ARM (but can't remember now what it
was).

>> * patch 2 - 5: cover the remaining parts of the surrounding the mem map, particularly
>> 	       ACPI SRAT ranges, CMOS, hotplug as well as the PCI 64-bit hole.
>>
>> * patch 6: Add a machine property which is disabled for older machine types (<=6.0)
>> 	   to keep things as is.
>>
>> The 'consequence' of this approach is that we may need more than the default
>> phys-bits e.g. a guest with 1024G, will have ~13G be put after the 1TB
>> address, consequently needing 41 phys-bits as opposed to the default of 40.
>> I can't figure a reasonable way to establish the required phys-bits we
>> need for the memory map in a dynamic way, especially considering that
>> today there's already a precedent to depend on the user to pick the right value
>> of phys-bits (regardless of this series).
>>
>> Additionally, the reserved region is always added regardless of whether we have
>> VFIO devices to cover the VFIO device hotplug case.
> 
> Various migration issues as you note later in the series.
> 
/me nods

>> Other options considered:
>>
>> a) Consider the reserved range part of RAM, and just marking it as
>> E820_RESERVED without SPA allocated for it. So a -m 1024G guest would
>> only allocate 1010G of RAM and the remaining would be marked reserved.
>> This is not how what we do today for the 4G hole i.e. the RAM
>> actually allocated is the value specified by the user and thus RAM available
>> to the guest (IIUC).
>>
>> b) Avoid VFIO DMA_MAP ioctl() calls to the reserved range. Similar to a) but done at a
>> later stage when RAM mrs are already allocated at the invalid GPAs. Albeit that
>> alone wouldn't fix the way MRs are laid out which is where fundamentally the
>> problem is.
> 
> Data corruption with b) should the guest ever use memory within this
> range as a DMA target.  Thanks,
> 
Yeap.

> Alex
>  
>> The proposed approach in this series works regardless of the kernel, and
>> relevant for old and new Qemu.
>>
>> Open to alternatives/comments/suggestions that I should pursue instead.
>>
>> 	Joao
>>
>> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
>> [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
>>
>> Joao Martins (6):
>>   i386/pc: Account IOVA reserved ranges above 4G boundary
>>   i386/pc: Round up the hotpluggable memory within valid IOVA ranges
>>   pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary
>>   i386/pc: Keep PCI 64-bit hole within usable IOVA space
>>   i386/acpi: Fix SRAT ranges in accordance to usable IOVA
>>   i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs
>>
>>  hw/i386/acpi-build.c  |  22 ++++-
>>  hw/i386/pc.c          | 206 +++++++++++++++++++++++++++++++++++++++---
>>  hw/i386/pc_piix.c     |   2 +
>>  hw/i386/pc_q35.c      |   2 +
>>  hw/pci-host/i440fx.c  |   4 +-
>>  hw/pci-host/q35.c     |   4 +-
>>  include/hw/i386/pc.h  |  62 ++++++++++++-
>>  include/hw/i386/x86.h |   4 +
>>  target/i386/cpu.h     |   3 +
>>  9 files changed, 288 insertions(+), 21 deletions(-)
>>
> 

Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Posted by Igor Mammedov 2 years, 10 months ago
On Wed, 23 Jun 2021 10:30:29 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/22/21 10:16 PM, Alex Williamson wrote:
> > On Tue, 22 Jun 2021 16:48:59 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> Hey,
> >>
> >> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
> >> when running on AMD systems with an IOMMU.
> >>
> >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> >> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> >> affected by this extra validation. But AMD systems with IOMMU have a hole in
> >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> >> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> >>
> >> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
> >>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> >> of the failure:
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> >> 	failed to setup container for group 258: memory listener initialization failed:
> >> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> >>
> >> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
> >> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> >> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> >> as documented on the links down below.
> >>
> >> This series tries to address that by dealing with this AMD-specific 1Tb hole,
> >> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
> >> as following:
> >>
> >> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
> >>            which gets used too in other parts of pc/acpi besides MR creation. The
> >> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
> >> 	   Intel. We walk the allowed ranges for memory above 4G, and
> >> 	   add a E820_RESERVED type everytime we find a hole (which is at the
> >> 	   1TB boundary).
> >> 	   
> >> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
> >> 	   understand that it doesn't cover the non-x86 host case running TCG.
> >>
> >> 	   Additionally, an alternative to hardcoded ranges as we do today,
> >> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
> >> 	   requiring to have a PCI device added in the vfio container. That would
> >> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
> >> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
> > 
> > 
> > $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> > 0x00000000fee00000 0x00000000feefffff msi
> > 0x000000fd00000000 0x000000ffffffffff reserved
> >   
> Yeap, I am aware.
> 
> The VFIO advertising extension was just because we already advertise the above info,
> although behind a non-empty vfio container e.g. we seem to use that for example in
> collect_usable_iova_ranges().
> 
> > Ideally we might take that into account on all hosts, but of course
> > then we run into massive compatibility issues when we consider
> > migration.  We run into similar problems when people try to assign
> > devices to non-x86 TCG hosts, where the arch doesn't have a natural
> > memory hole overlapping the msi range.
> > 
> > The issue here is similar to trying to find a set of supported CPU
> > flags across hosts, QEMU only has visibility to the host where it runs,
> > an upper level tool needs to be able to pass through information about
> > compatibility to all possible migration targets.  
> 
> I agree with your generic sentiment (and idea) but are we sure this is really something as
> dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
> in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
> pc/q35 is one very good example, because this hasn't changed since it's inception [a
> decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
> with more than 1Tb). Additionally, there might be architectural impositions like on x86
> e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
> targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
> keep the reserved range nonetheless in the common denominator)
> 
> > Towards that end, we
> > should probably have command line options that either allow to specify
> > specific usable or reserved GPA address ranges.  For example something
> > like:
> > 	--reserved-mem-ranges=host
> > 
> > Or explicitly:
> > 
> > 	--reserved-mem-ranges=13G@1010G,1M@4078M

if we can do without adding any option at all it will be even better
since user/mgmt won't need to care about it as well.

> >   
> I like the added option, particularly because it lets everyone workaround similar issues.
> I remember a series before that had similar issues on ARM (but can't remember now what it
> was).
> 
> >> * patch 2 - 5: cover the remaining parts of the surrounding the mem map, particularly
> >> 	       ACPI SRAT ranges, CMOS, hotplug as well as the PCI 64-bit hole.
> >>
> >> * patch 6: Add a machine property which is disabled for older machine types (<=6.0)
> >> 	   to keep things as is.
> >>
> >> The 'consequence' of this approach is that we may need more than the default
> >> phys-bits e.g. a guest with 1024G, will have ~13G be put after the 1TB
> >> address, consequently needing 41 phys-bits as opposed to the default of 40.
> >> I can't figure a reasonable way to establish the required phys-bits we
> >> need for the memory map in a dynamic way, especially considering that
> >> today there's already a precedent to depend on the user to pick the right value
> >> of phys-bits (regardless of this series).
> >>
> >> Additionally, the reserved region is always added regardless of whether we have
> >> VFIO devices to cover the VFIO device hotplug case.  
> > 
> > Various migration issues as you note later in the series.
> >   
> /me nods
> 
> >> Other options considered:
> >>
> >> a) Consider the reserved range part of RAM, and just marking it as
> >> E820_RESERVED without SPA allocated for it. So a -m 1024G guest would
> >> only allocate 1010G of RAM and the remaining would be marked reserved.
> >> This is not how what we do today for the 4G hole i.e. the RAM
> >> actually allocated is the value specified by the user and thus RAM available
> >> to the guest (IIUC).

it's partially true, we don't care about small MMIO regions that
overlay on top of low memory. But concealing RAM behind large PCI
hole would be a significant waste (especially when we are speaking
about PCI hole below 4GB)

I wonder how it works on real hardware?
i.e. does memory controller remap physical RAM at 1T hole region, just hides it
or just doesn't place any DIMMs there?


> >> b) Avoid VFIO DMA_MAP ioctl() calls to the reserved range. Similar to a) but done at a
> >> later stage when RAM mrs are already allocated at the invalid GPAs. Albeit that
> >> alone wouldn't fix the way MRs are laid out which is where fundamentally the
> >> problem is.  
> > 
> > Data corruption with b) should the guest ever use memory within this
> > range as a DMA target.  Thanks,
> >   
> Yeap.
> 
> > Alex
> >    
> >> The proposed approach in this series works regardless of the kernel, and
> >> relevant for old and new Qemu.
> >>
> >> Open to alternatives/comments/suggestions that I should pursue instead.
> >>
> >> 	Joao
> >>
> >> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
> >> [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> >>
> >> Joao Martins (6):
> >>   i386/pc: Account IOVA reserved ranges above 4G boundary
> >>   i386/pc: Round up the hotpluggable memory within valid IOVA ranges
> >>   pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary
> >>   i386/pc: Keep PCI 64-bit hole within usable IOVA space
> >>   i386/acpi: Fix SRAT ranges in accordance to usable IOVA
> >>   i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs
> >>
> >>  hw/i386/acpi-build.c  |  22 ++++-
> >>  hw/i386/pc.c          | 206 +++++++++++++++++++++++++++++++++++++++---
> >>  hw/i386/pc_piix.c     |   2 +
> >>  hw/i386/pc_q35.c      |   2 +
> >>  hw/pci-host/i440fx.c  |   4 +-
> >>  hw/pci-host/q35.c     |   4 +-
> >>  include/hw/i386/pc.h  |  62 ++++++++++++-
> >>  include/hw/i386/x86.h |   4 +
> >>  target/i386/cpu.h     |   3 +
> >>  9 files changed, 288 insertions(+), 21 deletions(-)
> >>  
> >   
> 


Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Posted by Joao Martins 2 years, 10 months ago

On 6/23/21 12:58 PM, Igor Mammedov wrote:
> On Wed, 23 Jun 2021 10:30:29 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 6/22/21 10:16 PM, Alex Williamson wrote:
>>> On Tue, 22 Jun 2021 16:48:59 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> Hey,
>>>>
>>>> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
>>>> when running on AMD systems with an IOMMU.
>>>>
>>>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
>>>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
>>>> affected by this extra validation. But AMD systems with IOMMU have a hole in
>>>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
>>>> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
>>>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>>>>
>>>> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>>>>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
>>>> of the failure:
>>>>
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
>>>> 	failed to setup container for group 258: memory listener initialization failed:
>>>> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>>>>
>>>> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
>>>> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
>>>> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
>>>> as documented on the links down below.
>>>>
>>>> This series tries to address that by dealing with this AMD-specific 1Tb hole,
>>>> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
>>>> as following:
>>>>
>>>> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
>>>>            which gets used too in other parts of pc/acpi besides MR creation. The
>>>> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
>>>> 	   Intel. We walk the allowed ranges for memory above 4G, and
>>>> 	   add a E820_RESERVED type everytime we find a hole (which is at the
>>>> 	   1TB boundary).
>>>> 	   
>>>> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
>>>> 	   understand that it doesn't cover the non-x86 host case running TCG.
>>>>
>>>> 	   Additionally, an alternative to hardcoded ranges as we do today,
>>>> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
>>>> 	   requiring to have a PCI device added in the vfio container. That would
>>>> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
>>>> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
>>>
>>>
>>> $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
>>> 0x00000000fee00000 0x00000000feefffff msi
>>> 0x000000fd00000000 0x000000ffffffffff reserved
>>>   
>> Yeap, I am aware.
>>
>> The VFIO advertising extension was just because we already advertise the above info,
>> although behind a non-empty vfio container e.g. we seem to use that for example in
>> collect_usable_iova_ranges().
>>
>>> Ideally we might take that into account on all hosts, but of course
>>> then we run into massive compatibility issues when we consider
>>> migration.  We run into similar problems when people try to assign
>>> devices to non-x86 TCG hosts, where the arch doesn't have a natural
>>> memory hole overlapping the msi range.
>>>
>>> The issue here is similar to trying to find a set of supported CPU
>>> flags across hosts, QEMU only has visibility to the host where it runs,
>>> an upper level tool needs to be able to pass through information about
>>> compatibility to all possible migration targets.  
>>
>> I agree with your generic sentiment (and idea) but are we sure this is really something as
>> dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
>> in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
>> pc/q35 is one very good example, because this hasn't changed since it's inception [a
>> decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
>> with more than 1Tb). Additionally, there might be architectural impositions like on x86
>> e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
>> targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
>> keep the reserved range nonetheless in the common denominator)
>>
>>> Towards that end, we
>>> should probably have command line options that either allow to specify
>>> specific usable or reserved GPA address ranges.  For example something
>>> like:
>>> 	--reserved-mem-ranges=host
>>>
>>> Or explicitly:
>>>
>>> 	--reserved-mem-ranges=13G@1010G,1M@4078M
> 
> if we can do without adding any option at all it will be even better
> since user/mgmt won't need to care about it as well.
> 
Yeap.

But should folks want the --reserved-mem-ranges approach perhaps the default ought to be
'host', and let the user customize the memmap should it deem necessary.

>>>   
>> I like the added option, particularly because it lets everyone workaround similar issues.
>> I remember a series before that had similar issues on ARM (but can't remember now what it
>> was).
>>
>>>> * patch 2 - 5: cover the remaining parts of the surrounding the mem map, particularly
>>>> 	       ACPI SRAT ranges, CMOS, hotplug as well as the PCI 64-bit hole.
>>>>
>>>> * patch 6: Add a machine property which is disabled for older machine types (<=6.0)
>>>> 	   to keep things as is.
>>>>
>>>> The 'consequence' of this approach is that we may need more than the default
>>>> phys-bits e.g. a guest with 1024G, will have ~13G be put after the 1TB
>>>> address, consequently needing 41 phys-bits as opposed to the default of 40.
>>>> I can't figure a reasonable way to establish the required phys-bits we
>>>> need for the memory map in a dynamic way, especially considering that
>>>> today there's already a precedent to depend on the user to pick the right value
>>>> of phys-bits (regardless of this series).
>>>>
>>>> Additionally, the reserved region is always added regardless of whether we have
>>>> VFIO devices to cover the VFIO device hotplug case.  
>>>
>>> Various migration issues as you note later in the series.
>>>   
>> /me nods
>>
>>>> Other options considered:
>>>>
>>>> a) Consider the reserved range part of RAM, and just marking it as
>>>> E820_RESERVED without SPA allocated for it. So a -m 1024G guest would
>>>> only allocate 1010G of RAM and the remaining would be marked reserved.
>>>> This is not how what we do today for the 4G hole i.e. the RAM
>>>> actually allocated is the value specified by the user and thus RAM available
>>>> to the guest (IIUC).
> 
> it's partially true, we don't care about small MMIO regions that
> overlay on top of low memory. But concealing RAM behind large PCI
> hole would be a significant waste (especially when we are speaking
> about PCI hole below 4GB)
> 
> I wonder how it works on real hardware?
> i.e. does memory controller remap physical RAM at 1T hole region, just hides it
> or just doesn't place any DIMMs there?
> 
In real hardware you lose 12G of RAM IIUC (and the range is marked as reserved). But
whether the said range is actually backed by SPA or not it might depend on family. Suravee
is CCed can probably keep me honest here :)

>>>> b) Avoid VFIO DMA_MAP ioctl() calls to the reserved range. Similar to a) but done at a
>>>> later stage when RAM mrs are already allocated at the invalid GPAs. Albeit that
>>>> alone wouldn't fix the way MRs are laid out which is where fundamentally the
>>>> problem is.  
>>>
>>> Data corruption with b) should the guest ever use memory within this
>>> range as a DMA target.  Thanks,
>>>   
>> Yeap.
>>
>>> Alex
>>>    
>>>> The proposed approach in this series works regardless of the kernel, and
>>>> relevant for old and new Qemu.
>>>>
>>>> Open to alternatives/comments/suggestions that I should pursue instead.
>>>>
>>>> 	Joao
>>>>
>>>> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
>>>> [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
>>>>
>>>> Joao Martins (6):
>>>>   i386/pc: Account IOVA reserved ranges above 4G boundary
>>>>   i386/pc: Round up the hotpluggable memory within valid IOVA ranges
>>>>   pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary
>>>>   i386/pc: Keep PCI 64-bit hole within usable IOVA space
>>>>   i386/acpi: Fix SRAT ranges in accordance to usable IOVA
>>>>   i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs
>>>>
>>>>  hw/i386/acpi-build.c  |  22 ++++-
>>>>  hw/i386/pc.c          | 206 +++++++++++++++++++++++++++++++++++++++---
>>>>  hw/i386/pc_piix.c     |   2 +
>>>>  hw/i386/pc_q35.c      |   2 +
>>>>  hw/pci-host/i440fx.c  |   4 +-
>>>>  hw/pci-host/q35.c     |   4 +-
>>>>  include/hw/i386/pc.h  |  62 ++++++++++++-
>>>>  include/hw/i386/x86.h |   4 +
>>>>  target/i386/cpu.h     |   3 +
>>>>  9 files changed, 288 insertions(+), 21 deletions(-)
>>>>  
>>>   
>>
> 

Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Posted by Alex Williamson 2 years, 10 months ago
On Wed, 23 Jun 2021 10:30:29 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/22/21 10:16 PM, Alex Williamson wrote:
> > On Tue, 22 Jun 2021 16:48:59 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> Hey,
> >>
> >> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
> >> when running on AMD systems with an IOMMU.
> >>
> >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> >> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> >> affected by this extra validation. But AMD systems with IOMMU have a hole in
> >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> >> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> >>
> >> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
> >>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> >> of the failure:
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> >> 	failed to setup container for group 258: memory listener initialization failed:
> >> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> >>
> >> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
> >> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> >> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> >> as documented on the links down below.
> >>
> >> This series tries to address that by dealing with this AMD-specific 1Tb hole,
> >> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
> >> as following:
> >>
> >> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
> >>            which gets used too in other parts of pc/acpi besides MR creation. The
> >> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
> >> 	   Intel. We walk the allowed ranges for memory above 4G, and
> >> 	   add a E820_RESERVED type everytime we find a hole (which is at the
> >> 	   1TB boundary).
> >> 	   
> >> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
> >> 	   understand that it doesn't cover the non-x86 host case running TCG.
> >>
> >> 	   Additionally, an alternative to hardcoded ranges as we do today,
> >> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
> >> 	   requiring to have a PCI device added in the vfio container. That would
> >> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
> >> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
> > 
> > 
> > $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> > 0x00000000fee00000 0x00000000feefffff msi
> > 0x000000fd00000000 0x000000ffffffffff reserved
> >   
> Yeap, I am aware.
> 
> The VFIO advertising extension was just because we already advertise the above info,
> although behind a non-empty vfio container e.g. we seem to use that for example in
> collect_usable_iova_ranges().

VFIO can't guess what groups you'll use to mark reserved ranges in an
empty container.  Each group might have unique ranges.  A container
enforcing ranges unrelated to the groups/devices in use doesn't make
sense.
 
> > Ideally we might take that into account on all hosts, but of course
> > then we run into massive compatibility issues when we consider
> > migration.  We run into similar problems when people try to assign
> > devices to non-x86 TCG hosts, where the arch doesn't have a natural
> > memory hole overlapping the msi range.
> > 
> > The issue here is similar to trying to find a set of supported CPU
> > flags across hosts, QEMU only has visibility to the host where it runs,
> > an upper level tool needs to be able to pass through information about
> > compatibility to all possible migration targets.  
> 
> I agree with your generic sentiment (and idea) but are we sure this is really something as
> dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
> in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
> pc/q35 is one very good example, because this hasn't changed since it's inception [a
> decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
> with more than 1Tb). Additionally, there might be architectural impositions like on x86
> e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
> targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
> keep the reserved range nonetheless in the common denominator)

I like the flexibility that being able to specify reserved ranges would
provide, but I agree that the machine memory map is usually deeply
embedded into the arch code and would probably be difficult to
generalize.  Cross vendor migration should be a consideration and only
an inter-system management policy could specify the importance of that.

Perhaps as David mentioned, this is really a machine type issue, where
the address width downsides you've noted might be sufficient reason
to introduce a new machine type that includes this memory hole.  That
would likely be the more traditional solution to this issue.  Thanks,

Alex


Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Posted by Dr. David Alan Gilbert 2 years, 10 months ago
* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Wed, 23 Jun 2021 10:30:29 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
> > On 6/22/21 10:16 PM, Alex Williamson wrote:
> > > On Tue, 22 Jun 2021 16:48:59 +0100
> > > Joao Martins <joao.m.martins@oracle.com> wrote:
> > >   
> > >> Hey,
> > >>
> > >> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
> > >> when running on AMD systems with an IOMMU.
> > >>
> > >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> > >> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> > >> affected by this extra validation. But AMD systems with IOMMU have a hole in
> > >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> > >> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> > >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> > >>
> > >> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
> > >>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> > >> of the failure:
> > >>
> > >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> > >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> > >> 	failed to setup container for group 258: memory listener initialization failed:
> > >> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> > >>
> > >> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
> > >> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> > >> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> > >> as documented on the links down below.
> > >>
> > >> This series tries to address that by dealing with this AMD-specific 1Tb hole,
> > >> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
> > >> as following:
> > >>
> > >> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
> > >>            which gets used too in other parts of pc/acpi besides MR creation. The
> > >> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
> > >> 	   Intel. We walk the allowed ranges for memory above 4G, and
> > >> 	   add a E820_RESERVED type everytime we find a hole (which is at the
> > >> 	   1TB boundary).
> > >> 	   
> > >> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
> > >> 	   understand that it doesn't cover the non-x86 host case running TCG.
> > >>
> > >> 	   Additionally, an alternative to hardcoded ranges as we do today,
> > >> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
> > >> 	   requiring to have a PCI device added in the vfio container. That would
> > >> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
> > >> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
> > > 
> > > 
> > > $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> > > 0x00000000fee00000 0x00000000feefffff msi
> > > 0x000000fd00000000 0x000000ffffffffff reserved
> > >   
> > Yeap, I am aware.
> > 
> > The VFIO advertising extension was just because we already advertise the above info,
> > although behind a non-empty vfio container e.g. we seem to use that for example in
> > collect_usable_iova_ranges().
> 
> VFIO can't guess what groups you'll use to mark reserved ranges in an
> empty container.  Each group might have unique ranges.  A container
> enforcing ranges unrelated to the groups/devices in use doesn't make
> sense.
>  
> > > Ideally we might take that into account on all hosts, but of course
> > > then we run into massive compatibility issues when we consider
> > > migration.  We run into similar problems when people try to assign
> > > devices to non-x86 TCG hosts, where the arch doesn't have a natural
> > > memory hole overlapping the msi range.
> > > 
> > > The issue here is similar to trying to find a set of supported CPU
> > > flags across hosts, QEMU only has visibility to the host where it runs,
> > > an upper level tool needs to be able to pass through information about
> > > compatibility to all possible migration targets.  
> > 
> > I agree with your generic sentiment (and idea) but are we sure this is really something as
> > dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
> > in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
> > pc/q35 is one very good example, because this hasn't changed since it's inception [a
> > decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
> > with more than 1Tb). Additionally, there might be architectural impositions like on x86
> > e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
> > targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
> > keep the reserved range nonetheless in the common denominator)
> 
> I like the flexibility that being able to specify reserved ranges would
> provide, but I agree that the machine memory map is usually deeply
> embedded into the arch code and would probably be difficult to
> generalize.  Cross vendor migration should be a consideration and only
> an inter-system management policy could specify the importance of that.

On x86 at least, the cross vendor part doesn't seem to be an issue; I
wouldn't expect an Intel->AMD migration to work reliably anyway.

> Perhaps as David mentioned, this is really a machine type issue, where
> the address width downsides you've noted might be sufficient reason
> to introduce a new machine type that includes this memory hole.  That
> would likely be the more traditional solution to this issue.  Thanks,

To me this seems a combination of machine type+CPU model; perhaps what
we're looking at here is having a list of holes, which can be
contributed to by any of:
  a) The machine type
  b) The CPU model
  c) and extra command line option like you list

Dave

> Alex
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Posted by Joao Martins 2 years, 10 months ago

On 6/23/21 8:27 PM, Alex Williamson wrote:
> On Wed, 23 Jun 2021 10:30:29 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 6/22/21 10:16 PM, Alex Williamson wrote:
>>> On Tue, 22 Jun 2021 16:48:59 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> Hey,
>>>>
>>>> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
>>>> when running on AMD systems with an IOMMU.
>>>>
>>>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
>>>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
>>>> affected by this extra validation. But AMD systems with IOMMU have a hole in
>>>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
>>>> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
>>>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>>>>
>>>> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>>>>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
>>>> of the failure:
>>>>
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
>>>> 	failed to setup container for group 258: memory listener initialization failed:
>>>> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>>>>
>>>> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
>>>> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
>>>> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
>>>> as documented on the links down below.
>>>>
>>>> This series tries to address that by dealing with this AMD-specific 1Tb hole,
>>>> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
>>>> as following:
>>>>
>>>> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
>>>>            which gets used too in other parts of pc/acpi besides MR creation. The
>>>> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
>>>> 	   Intel. We walk the allowed ranges for memory above 4G, and
>>>> 	   add a E820_RESERVED type everytime we find a hole (which is at the
>>>> 	   1TB boundary).
>>>> 	   
>>>> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
>>>> 	   understand that it doesn't cover the non-x86 host case running TCG.
>>>>
>>>> 	   Additionally, an alternative to hardcoded ranges as we do today,
>>>> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
>>>> 	   requiring to have a PCI device added in the vfio container. That would
>>>> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
>>>> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
>>>
>>>
>>> $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
>>> 0x00000000fee00000 0x00000000feefffff msi
>>> 0x000000fd00000000 0x000000ffffffffff reserved
>>>   
>> Yeap, I am aware.
>>
>> The VFIO advertising extension was just because we already advertise the above info,
>> although behind a non-empty vfio container e.g. we seem to use that for example in
>> collect_usable_iova_ranges().
> 
> VFIO can't guess what groups you'll use to mark reserved ranges in an
> empty container.  Each group might have unique ranges.  A container
> enforcing ranges unrelated to the groups/devices in use doesn't make
> sense.
>  
Hmm OK, I see.

The suggestion/point was because AMD IOMMU seems to mark these reserved for
both MSI range and the HyperTransport regardless of device/group
specifics. See amd_iommu_get_resv_regions().

So I thought that something else for advertising platform-wide reserved ranges would be
appropriate rather than replicating the same said info on the various groups
reserved_regions sysfs entry.

>>> Ideally we might take that into account on all hosts, but of course
>>> then we run into massive compatibility issues when we consider
>>> migration.  We run into similar problems when people try to assign
>>> devices to non-x86 TCG hosts, where the arch doesn't have a natural
>>> memory hole overlapping the msi range.
>>>
>>> The issue here is similar to trying to find a set of supported CPU
>>> flags across hosts, QEMU only has visibility to the host where it runs,
>>> an upper level tool needs to be able to pass through information about
>>> compatibility to all possible migration targets.  
>>
>> I agree with your generic sentiment (and idea) but are we sure this is really something as
>> dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
>> in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
>> pc/q35 is one very good example, because this hasn't changed since it's inception [a
>> decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
>> with more than 1Tb). Additionally, there might be architectural impositions like on x86
>> e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
>> targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
>> keep the reserved range nonetheless in the common denominator)
> 
> I like the flexibility that being able to specify reserved ranges would
> provide, 

/me nods

> but I agree that the machine memory map is usually deeply
> embedded into the arch code and would probably be difficult to
> generalize.  Cross vendor migration should be a consideration and only
> an inter-system management policy could specify the importance of that.
> 
> Perhaps as David mentioned, this is really a machine type issue, where
> the address width downsides you've noted might be sufficient reason
> to introduce a new machine type that includes this memory hole.  That
> would likely be the more traditional solution to this issue.

Maybe there could be a generic facility that stores/manages the reserved ranges, and then
the different machines can provide a default set depending on the running target
heuristics (AMD x86, generic x86, ARM, etc). To some extent it means tracking reserved,
rather tracking usable IOVA as I do here (and move that some where else that's not x86
specific).