[PATCH] x86/hvmloader: don't set xenpci MMIO BAR as UC in MTRR

Roger Pau Monne posted 1 patch 5 months ago
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20250530092314.27306-1-roger.pau@citrix.com
tools/firmware/hvmloader/config.h |  2 +-
tools/firmware/hvmloader/pci.c    | 40 +++++++++++++++++++++++++++++--
tools/firmware/hvmloader/util.c   |  2 +-
3 files changed, 40 insertions(+), 4 deletions(-)
[PATCH] x86/hvmloader: don't set xenpci MMIO BAR as UC in MTRR
Posted by Roger Pau Monne 5 months ago
The Xen PCI device (vendor ID 0x5853) exposed to x86 HVM guests doesn't
have the functionality of a traditional PCI device.  The exposed MIO BAR is
used by some guests (including Linux) as a safe place to map foreign
memory, including the grant table itself.

Traditionally BARs from devices have the uncacheable (UC) cache attribute
from the MTRR, to ensure correct functionality of such devices.  hvmloader
mimics this behaviour and sets the MTRR attributes of both the low and high
PCI MMIO windows (where BARs of PCI devices reside) as UC in MTRR.

This however causes performance issues for the users of the Xen PCI device
BAR, as for the purposes of mapping remote memory there's no need to use
the UC attribute.  On Intel systems this is worked around by using iPAT,
that allows the hypervisor to force the effective cache attribute of a p2m
entry regardless of the guest PAT value.  AMD however doesn't have an
equivalent of iPAT, and guest PAT values are always considered.

Linux commit:

41925b105e34 xen: replace xen_remap() with memremap()

Attempted to mitigate this by forcing mappings of the grant-table to use
the write-back (WB) cache attribute.  However Linux memremap() takes MTRRs
into account to calculate which PAT type to use, and seeing the MTRR cache
attribute for the region being UC the PAT also ends up as UC, regardless of
the caller having requested WB.

As a workaround to allow current Linux to map the grant-table as WB using
memremap() special case the Xen PCI device BAR in hvmloader and don't set
its cache attribute as UC.  Such workaround in hvmloader should also be
paired with a fix for Linux so it attempts to change the MTRR of the Xen
PCI device BAR to WB by itself.

Overall, the long term solution would be to provide the guest with a safe
range in the guest physical address space where mappings to foreign pages
can be created.

Some vif throughput performance figures provided by Anthoine from a 8
vCPUs, 4GB of RAM HVM guest(s) running on AMD hardware:

Without this patch:
vm -> dom0: 1.1Gb/s
vm -> vm:   5.0Gb/s

With the patch:
vm -> dom0: 4.5Gb/s
vm -> vm:   7.0Gb/s

Reported-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
I don't think the ACPI tables builder consume the PCI window size
information, I'm not seeing any consumer of the acpi_info->pci_{min,len}
fields, yet I've keep them covering the xenpci device BAR, hence the
adjustment to hvmloader_acpi_build_tables() part of this patch.
---
 tools/firmware/hvmloader/config.h |  2 +-
 tools/firmware/hvmloader/pci.c    | 40 +++++++++++++++++++++++++++++--
 tools/firmware/hvmloader/util.c   |  2 +-
 3 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/tools/firmware/hvmloader/config.h b/tools/firmware/hvmloader/config.h
index 6e1da137d779..c159db30eea9 100644
--- a/tools/firmware/hvmloader/config.h
+++ b/tools/firmware/hvmloader/config.h
@@ -58,7 +58,7 @@ extern uint32_t *cpu_to_apicid;
 #define ACPI_TIS_HDR_ADDRESS 0xFED40F00UL
 
 extern uint32_t pci_mem_start;
-extern const uint32_t pci_mem_end;
+extern uint32_t pci_mem_end;
 extern uint64_t pci_hi_mem_start, pci_hi_mem_end;
 
 extern bool acpi_enabled;
diff --git a/tools/firmware/hvmloader/pci.c b/tools/firmware/hvmloader/pci.c
index c3c61ca060a6..7c339cb8b9f7 100644
--- a/tools/firmware/hvmloader/pci.c
+++ b/tools/firmware/hvmloader/pci.c
@@ -30,7 +30,7 @@
 #include <xen/hvm/e820.h>
 
 uint32_t pci_mem_start = HVM_BELOW_4G_MMIO_START;
-const uint32_t pci_mem_end = RESERVED_MEMBASE;
+uint32_t pci_mem_end = RESERVED_MEMBASE;
 uint64_t pci_hi_mem_start = 0, pci_hi_mem_end = 0;
 
 /*
@@ -281,6 +281,42 @@ void pci_setup(void)
             if ( bar_sz == 0 )
                 continue;
 
+            if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) ==
+                   PCI_BASE_ADDRESS_SPACE_MEMORY) &&
+                 vendor_id == 0x5853 &&
+                 (device_id == 0x0001 || device_id == 0x0002) )
+            {
+                if ( is_64bar )
+                {
+                     printf("xenpci dev %02x:%x unexpected MMIO 64bit BAR%u\n",
+                            devfn >> 3, devfn & 7, bar);
+                     continue;
+                }
+
+                if ( bar_sz > pci_mem_end ||
+                     ((pci_mem_end - bar_sz) & ~(bar_sz - 1)) < pci_mem_start )
+                {
+                     printf("xenpci dev %02x:%x BAR%u size %llx overflows low PCI hole\n",
+                            devfn >> 3, devfn & 7, bar, bar_sz);
+                     continue;
+                }
+
+                /* Put unconditionally at the end of the low PCI MMIO hole. */
+                pci_mem_end -= bar_sz;
+                pci_mem_end &= ~(bar_sz - 1);
+                bar_data &= ~PCI_BASE_ADDRESS_MEM_MASK;
+                bar_data |= pci_mem_end;
+                pci_writel(devfn, bar_reg, bar_data);
+                pci_devfn_decode_type[devfn] |= PCI_COMMAND_MEMORY;
+
+                /* Prefix BAR address with a 0 to match format used below. */
+                printf("pci dev %02x:%x bar %02x size "PRIllx": 0%08x\n",
+                       devfn >> 3, devfn & 7, bar_reg,
+                       PRIllx_arg(bar_sz), bar_data);
+
+                continue;
+            }
+
             for ( i = 0; i < nr_bars; i++ )
                 if ( bars[i].bar_sz < bar_sz )
                     break;
@@ -320,7 +356,7 @@ void pci_setup(void)
         }
 
         /* Enable bus master for this function later */
-        pci_devfn_decode_type[devfn] = PCI_COMMAND_MASTER;
+        pci_devfn_decode_type[devfn] |= PCI_COMMAND_MASTER;
     }
 
     if ( mmio_hole_size )
diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index 2d07ce129013..fd107818da35 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -874,7 +874,7 @@ void hvmloader_acpi_build_tables(struct acpi_config *config,
         config->table_flags |= ACPI_HAS_HPET;
 
     config->pci_start = pci_mem_start;
-    config->pci_len = pci_mem_end - pci_mem_start;
+    config->pci_len = RESERVED_MEMBASE - pci_mem_start;
     if ( pci_hi_mem_end > pci_hi_mem_start )
     {
         config->pci_hi_start = pci_hi_mem_start;
-- 
2.49.0


Re: [PATCH] x86/hvmloader: don't set xenpci MMIO BAR as UC in MTRR
Posted by Jan Beulich 5 months ago
On 30.05.2025 11:23, Roger Pau Monne wrote:
> The Xen PCI device (vendor ID 0x5853) exposed to x86 HVM guests doesn't
> have the functionality of a traditional PCI device.  The exposed MIO BAR is
> used by some guests (including Linux) as a safe place to map foreign
> memory, including the grant table itself.
> 
> Traditionally BARs from devices have the uncacheable (UC) cache attribute
> from the MTRR, to ensure correct functionality of such devices.  hvmloader
> mimics this behaviour and sets the MTRR attributes of both the low and high
> PCI MMIO windows (where BARs of PCI devices reside) as UC in MTRR.
> 
> This however causes performance issues for the users of the Xen PCI device
> BAR, as for the purposes of mapping remote memory there's no need to use
> the UC attribute.  On Intel systems this is worked around by using iPAT,
> that allows the hypervisor to force the effective cache attribute of a p2m
> entry regardless of the guest PAT value.  AMD however doesn't have an
> equivalent of iPAT, and guest PAT values are always considered.
> 
> Linux commit:
> 
> 41925b105e34 xen: replace xen_remap() with memremap()
> 
> Attempted to mitigate this by forcing mappings of the grant-table to use
> the write-back (WB) cache attribute.  However Linux memremap() takes MTRRs
> into account to calculate which PAT type to use, and seeing the MTRR cache
> attribute for the region being UC the PAT also ends up as UC, regardless of
> the caller having requested WB.
> 
> As a workaround to allow current Linux to map the grant-table as WB using
> memremap() special case the Xen PCI device BAR in hvmloader and don't set
> its cache attribute as UC.

Can we (fully compatibly) make such a change? IOW do we know all possible
guests would be at least unaffected (ideally affected positively)? Imo ...

>  Such workaround in hvmloader should also be
> paired with a fix for Linux so it attempts to change the MTRR of the Xen
> PCI device BAR to WB by itself.
> 
> Overall, the long term solution would be to provide the guest with a safe
> range in the guest physical address space where mappings to foreign pages
> can be created.

... this is the only viable path.

> Some vif throughput performance figures provided by Anthoine from a 8
> vCPUs, 4GB of RAM HVM guest(s) running on AMD hardware:
> 
> Without this patch:
> vm -> dom0: 1.1Gb/s
> vm -> vm:   5.0Gb/s
> 
> With the patch:
> vm -> dom0: 4.5Gb/s
> vm -> vm:   7.0Gb/s
> 
> Reported-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> I don't think the ACPI tables builder consume the PCI window size
> information, I'm not seeing any consumer of the acpi_info->pci_{min,len}
> fields, yet I've keep them covering the xenpci device BAR, hence the
> adjustment to hvmloader_acpi_build_tables() part of this patch.

acpi_build_tables() copies the field, and the comment ahead of struct
acpi_info clarifies where the uses are: It's the PLEN field, which does
have a use in dsdt.asl. Aiui the change you make is therefore a necessary
one.

Jan

Re: [PATCH] x86/hvmloader: don't set xenpci MMIO BAR as UC in MTRR
Posted by Roger Pau Monné 5 months ago
On Mon, Jun 02, 2025 at 11:46:52AM +0200, Jan Beulich wrote:
> On 30.05.2025 11:23, Roger Pau Monne wrote:
> > The Xen PCI device (vendor ID 0x5853) exposed to x86 HVM guests doesn't
> > have the functionality of a traditional PCI device.  The exposed MIO BAR is
> > used by some guests (including Linux) as a safe place to map foreign
> > memory, including the grant table itself.
> > 
> > Traditionally BARs from devices have the uncacheable (UC) cache attribute
> > from the MTRR, to ensure correct functionality of such devices.  hvmloader
> > mimics this behaviour and sets the MTRR attributes of both the low and high
> > PCI MMIO windows (where BARs of PCI devices reside) as UC in MTRR.
> > 
> > This however causes performance issues for the users of the Xen PCI device
> > BAR, as for the purposes of mapping remote memory there's no need to use
> > the UC attribute.  On Intel systems this is worked around by using iPAT,
> > that allows the hypervisor to force the effective cache attribute of a p2m
> > entry regardless of the guest PAT value.  AMD however doesn't have an
> > equivalent of iPAT, and guest PAT values are always considered.
> > 
> > Linux commit:
> > 
> > 41925b105e34 xen: replace xen_remap() with memremap()
> > 
> > Attempted to mitigate this by forcing mappings of the grant-table to use
> > the write-back (WB) cache attribute.  However Linux memremap() takes MTRRs
> > into account to calculate which PAT type to use, and seeing the MTRR cache
> > attribute for the region being UC the PAT also ends up as UC, regardless of
> > the caller having requested WB.
> > 
> > As a workaround to allow current Linux to map the grant-table as WB using
> > memremap() special case the Xen PCI device BAR in hvmloader and don't set
> > its cache attribute as UC.
> 
> Can we (fully compatibly) make such a change? IOW do we know all possible
> guests would be at least unaffected (ideally affected positively)? Imo ...

Is there any other possible usage for the xenpci MMIO BAR?  My
understanding is it was introduced for this specific purpose; to
signal a safe place to map the grant-table or foreign mappings, which
in both cases want to have an effective WB cache attribute.

> >  Such workaround in hvmloader should also be
> > paired with a fix for Linux so it attempts to change the MTRR of the Xen
> > PCI device BAR to WB by itself.
> > 
> > Overall, the long term solution would be to provide the guest with a safe
> > range in the guest physical address space where mappings to foreign pages
> > can be created.
> 
> ... this is the only viable path.

I agree, however this will take more time to materialize IMO.  Needs a
patch to Linux, plus possible backports, and then distros picking the
updates.

While I agree this needs fixing in Linux, I don't see any downsides of
doing the workaround in hvmloader also, as a faster way to get it
deployed with just a Xen update.

> > Some vif throughput performance figures provided by Anthoine from a 8
> > vCPUs, 4GB of RAM HVM guest(s) running on AMD hardware:
> > 
> > Without this patch:
> > vm -> dom0: 1.1Gb/s
> > vm -> vm:   5.0Gb/s
> > 
> > With the patch:
> > vm -> dom0: 4.5Gb/s
> > vm -> vm:   7.0Gb/s
> > 
> > Reported-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech>
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > I don't think the ACPI tables builder consume the PCI window size
> > information, I'm not seeing any consumer of the acpi_info->pci_{min,len}
> > fields, yet I've keep them covering the xenpci device BAR, hence the
> > adjustment to hvmloader_acpi_build_tables() part of this patch.
> 
> acpi_build_tables() copies the field, and the comment ahead of struct
> acpi_info clarifies where the uses are: It's the PLEN field, which does
> have a use in dsdt.asl. Aiui the change you make is therefore a necessary
> one.

Right, thanks.  I was grepping for 'Field("BIOS"', instead of 'Field(BIOS'.

Roger.

Re: [PATCH] x86/hvmloader: don't set xenpci MMIO BAR as UC in MTRR
Posted by Jan Beulich 5 months ago
On 02.06.2025 16:27, Roger Pau Monné wrote:
> On Mon, Jun 02, 2025 at 11:46:52AM +0200, Jan Beulich wrote:
>> On 30.05.2025 11:23, Roger Pau Monne wrote:
>>> The Xen PCI device (vendor ID 0x5853) exposed to x86 HVM guests doesn't
>>> have the functionality of a traditional PCI device.  The exposed MIO BAR is
>>> used by some guests (including Linux) as a safe place to map foreign
>>> memory, including the grant table itself.
>>>
>>> Traditionally BARs from devices have the uncacheable (UC) cache attribute
>>> from the MTRR, to ensure correct functionality of such devices.  hvmloader
>>> mimics this behaviour and sets the MTRR attributes of both the low and high
>>> PCI MMIO windows (where BARs of PCI devices reside) as UC in MTRR.
>>>
>>> This however causes performance issues for the users of the Xen PCI device
>>> BAR, as for the purposes of mapping remote memory there's no need to use
>>> the UC attribute.  On Intel systems this is worked around by using iPAT,
>>> that allows the hypervisor to force the effective cache attribute of a p2m
>>> entry regardless of the guest PAT value.  AMD however doesn't have an
>>> equivalent of iPAT, and guest PAT values are always considered.
>>>
>>> Linux commit:
>>>
>>> 41925b105e34 xen: replace xen_remap() with memremap()
>>>
>>> Attempted to mitigate this by forcing mappings of the grant-table to use
>>> the write-back (WB) cache attribute.  However Linux memremap() takes MTRRs
>>> into account to calculate which PAT type to use, and seeing the MTRR cache
>>> attribute for the region being UC the PAT also ends up as UC, regardless of
>>> the caller having requested WB.
>>>
>>> As a workaround to allow current Linux to map the grant-table as WB using
>>> memremap() special case the Xen PCI device BAR in hvmloader and don't set
>>> its cache attribute as UC.
>>
>> Can we (fully compatibly) make such a change? IOW do we know all possible
>> guests would be at least unaffected (ideally affected positively)? Imo ...
> 
> Is there any other possible usage for the xenpci MMIO BAR?

How do you / we know?

>  My
> understanding is it was introduced for this specific purpose; to
> signal a safe place to map the grant-table or foreign mappings, which
> in both cases want to have an effective WB cache attribute.

It's a hack. It's relatively easy to imagine that someone might have built
a 2nd hack on top of this 1st one.

>>>  Such workaround in hvmloader should also be
>>> paired with a fix for Linux so it attempts to change the MTRR of the Xen
>>> PCI device BAR to WB by itself.
>>>
>>> Overall, the long term solution would be to provide the guest with a safe
>>> range in the guest physical address space where mappings to foreign pages
>>> can be created.
>>
>> ... this is the only viable path.
> 
> I agree, however this will take more time to materialize IMO.  Needs a
> patch to Linux, plus possible backports, and then distros picking the
> updates.
> 
> While I agree this needs fixing in Linux, I don't see any downsides of
> doing the workaround in hvmloader also, as a faster way to get it
> deployed with just a Xen update.

The (maybe merely theoretical) downside is that we may regress something
somewhere. Hence why I think that for starters this new behavior should
be optional, default off. Later, once proven to work in practice for a
fair while, we could then consider changing the default. (Then again I
certainly realize that adding yet another control is quite a bit of extra
effort, too.)

Jan

Re: [PATCH] x86/hvmloader: don't set xenpci MMIO BAR as UC in MTRR
Posted by Anthoine Bourgeois 5 months ago
On Fri, May 30, 2025 at 11:23:14AM +0200, Roger Pau Monne wrote:
>The Xen PCI device (vendor ID 0x5853) exposed to x86 HVM guests doesn't
>have the functionality of a traditional PCI device.  The exposed MIO BAR is
>used by some guests (including Linux) as a safe place to map foreign
>memory, including the grant table itself.
>
>Traditionally BARs from devices have the uncacheable (UC) cache attribute
>from the MTRR, to ensure correct functionality of such devices.  hvmloader
>mimics this behaviour and sets the MTRR attributes of both the low and high
>PCI MMIO windows (where BARs of PCI devices reside) as UC in MTRR.
>
>This however causes performance issues for the users of the Xen PCI device
>BAR, as for the purposes of mapping remote memory there's no need to use
>the UC attribute.  On Intel systems this is worked around by using iPAT,
>that allows the hypervisor to force the effective cache attribute of a p2m
>entry regardless of the guest PAT value.  AMD however doesn't have an
>equivalent of iPAT, and guest PAT values are always considered.
>
>Linux commit:
>
>41925b105e34 xen: replace xen_remap() with memremap()
>
>Attempted to mitigate this by forcing mappings of the grant-table to use
>the write-back (WB) cache attribute.  However Linux memremap() takes MTRRs
>into account to calculate which PAT type to use, and seeing the MTRR cache
>attribute for the region being UC the PAT also ends up as UC, regardless of
>the caller having requested WB.
>
>As a workaround to allow current Linux to map the grant-table as WB using
>memremap() special case the Xen PCI device BAR in hvmloader and don't set
>its cache attribute as UC.  Such workaround in hvmloader should also be
>paired with a fix for Linux so it attempts to change the MTRR of the Xen
>PCI device BAR to WB by itself.
>
>Overall, the long term solution would be to provide the guest with a safe
>range in the guest physical address space where mappings to foreign pages
>can be created.
>
>Some vif throughput performance figures provided by Anthoine from a 8
>vCPUs, 4GB of RAM HVM guest(s) running on AMD hardware:
>
>Without this patch:
>vm -> dom0: 1.1Gb/s
>vm -> vm:   5.0Gb/s
>
>With the patch:
>vm -> dom0: 4.5Gb/s
>vm -> vm:   7.0Gb/s
>
>Reported-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech>
>Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Also
Tested-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech>