drivers/pci/mmap.c | 6 ++++++ drivers/pci/quirks.c | 8 ++++++++ 2 files changed, 14 insertions(+)
QEMU Inter-VM Shared Memory (ivshmem) is designed to share a memory
region between guest and host. The host creates a file, passes it to QEMU
which it presents to the guest via PCI BAR#2. The guest userspace
can map /sys/bus/pci/devices/0000:01:02.3/resource2(_wc) to use the region
without having the guest driver for the device at all.
The problem with this, since it is a PCI resource, the PCI sysfs
reasonably enforces:
- no caching when mapped via "resourceN" (PTE::PCD on x86) or
- write-through when mapped via "resourceN_wc" (PTE::PWT on x86).
As the result, the host writes are seen by the guest immediately
(as the region is just a mapped file) but it takes quite some time for
the host to see non-cached guest writes.
Add a quirk to always map ivshmem's BAR2 as cacheable (==write-back) as
ivshmem is backed by RAM anyway.
(Re)use already defined but not used IORESOURCE_CACHEABLE flag.
This does not affect other ways of mapping a PCI BAR, a driver can use
memremap() for this functionality.
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
---
What is this IORESOURCE_CACHEABLE for actually?
Anyway, the alternatives are:
1. add a new node in sysfs - "resourceN_wb" - for mapping as writeback
but this requires changing existing (and likely old) userspace tools;
2. fix the kernel to strictly follow /proc/mtrr (now it is rather
a recommendation) but Documentation/arch/x86/mtrr.rst says it is replaced
with PAT which does not seem to allow overriding caching for specific
devices (==MMIO ranges).
---
drivers/pci/mmap.c | 6 ++++++
drivers/pci/quirks.c | 8 ++++++++
2 files changed, 14 insertions(+)
diff --git a/drivers/pci/mmap.c b/drivers/pci/mmap.c
index 8da3347a95c4..8495bee08fae 100644
--- a/drivers/pci/mmap.c
+++ b/drivers/pci/mmap.c
@@ -35,6 +35,7 @@ int pci_mmap_resource_range(struct pci_dev *pdev, int bar,
if (write_combine)
vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
else
+ else if (!(pci_resource_flags(pdev, bar) & IORESOURCE_CACHEABLE))
vma->vm_page_prot = pgprot_device(vma->vm_page_prot);
if (mmap_state == pci_mmap_io) {
@@ -46,6 +47,11 @@ int pci_mmap_resource_range(struct pci_dev *pdev, int bar,
vma->vm_ops = &pci_phys_vm_ops;
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_CACHEABLE)
+ return remap_pfn_range_notrack(vma, vma->vm_start, vma->vm_pgoff,
+ vma->vm_end - vma->vm_start,
+ vma->vm_page_prot);
+
return io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot);
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index d7f4ee634263..858869ec6612 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -6335,3 +6335,11 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev)
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout);
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout);
#endif
+
+static void pci_ivshmem_writeback(struct pci_dev *dev)
+{
+ struct resource *r = &dev->resource[2];
+
+ r->flags |= IORESOURCE_CACHEABLE;
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1110, pci_ivshmem_writeback);
--
2.49.0
On Thu, Jun 12, 2025 at 06:22:33PM +1000, Alexey Kardashevskiy wrote: > QEMU Inter-VM Shared Memory (ivshmem) is designed to share a memory > region between guest and host. The host creates a file, passes it to QEMU > which it presents to the guest via PCI BAR#2. The guest userspace > can map /sys/bus/pci/devices/0000:01:02.3/resource2(_wc) to use the region > without having the guest driver for the device at all. > > The problem with this, since it is a PCI resource, the PCI sysfs > reasonably enforces: Ok, so I read it up until now and can't continue because all I hear is a big honking HACK alarm here! Shared memory which is presented to a guest via PCI BAR?!? Can it get any more ugly than this? I hope I'm missing an important aspect here... > diff --git a/drivers/pci/mmap.c b/drivers/pci/mmap.c > index 8da3347a95c4..8495bee08fae 100644 > --- a/drivers/pci/mmap.c > +++ b/drivers/pci/mmap.c > @@ -35,6 +35,7 @@ int pci_mmap_resource_range(struct pci_dev *pdev, int bar, > if (write_combine) > vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot); > else > + else if (!(pci_resource_flags(pdev, bar) & IORESOURCE_CACHEABLE)) ^^^^^^ This can't build. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
On 13/9/25 02:49, Borislav Petkov wrote: > On Thu, Jun 12, 2025 at 06:22:33PM +1000, Alexey Kardashevskiy wrote: >> QEMU Inter-VM Shared Memory (ivshmem) is designed to share a memory >> region between guest and host. The host creates a file, passes it to QEMU >> which it presents to the guest via PCI BAR#2. The guest userspace >> can map /sys/bus/pci/devices/0000:01:02.3/resource2(_wc) to use the region >> without having the guest driver for the device at all. >> >> The problem with this, since it is a PCI resource, the PCI sysfs >> reasonably enforces: > > Ok, so I read it up until now and can't continue because all I hear is a big > honking HACK alarm here! It is :) > Shared memory which is presented to a guest via PCI BAR?!? > > Can it get any more ugly than this? > > I hope I'm missing an important aspect here... yeah, sadly, there is one - people are actually using it, for, like, a decade, and not exactly keen on changing those user space tools :) Hence "RFC". >> diff --git a/drivers/pci/mmap.c b/drivers/pci/mmap.c >> index 8da3347a95c4..8495bee08fae 100644 >> --- a/drivers/pci/mmap.c >> +++ b/drivers/pci/mmap.c >> @@ -35,6 +35,7 @@ int pci_mmap_resource_range(struct pci_dev *pdev, int bar, >> if (write_combine) >> vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot); >> else >> + else if (!(pci_resource_flags(pdev, bar) & IORESOURCE_CACHEABLE)) > ^^^^^^ > > This can't build. Why? Compiles and works just fine. -- Alexey
On Mon, Sep 15, 2025 at 05:43:33PM +1000, Alexey Kardashevskiy wrote: > yeah, sadly, there is one - people are actually using it, for, like, > a decade, and not exactly keen on changing those user space tools :) Hence > "RFC". Then this commit message needs a lot more explanation as it is something the kernel should support apparently when running as a guest... But then, if it has been used for a decade already, why do you need that quirk now? No one has noticed in 10 years time...? > > > else > > > + else if (!(pci_resource_flags(pdev, bar) & IORESOURCE_CACHEABLE)) > > ^^^^^^ > > > > This can't build. > > Why? Compiles and works just fine. Here's a more detailed explanation: https://lore.kernel.org/r/202506131608.QlkxUPnI-lkp@intel.com You haven't replaced the "else" with "else if" but left it there. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Wrong email for Nikunj :) And I missed the KVM ml. Sorry for the noise. On 12/6/25 18:22, Alexey Kardashevskiy wrote: > QEMU Inter-VM Shared Memory (ivshmem) is designed to share a memory > region between guest and host. The host creates a file, passes it to QEMU > which it presents to the guest via PCI BAR#2. The guest userspace > can map /sys/bus/pci/devices/0000:01:02.3/resource2(_wc) to use the region > without having the guest driver for the device at all. > > The problem with this, since it is a PCI resource, the PCI sysfs > reasonably enforces: > - no caching when mapped via "resourceN" (PTE::PCD on x86) or > - write-through when mapped via "resourceN_wc" (PTE::PWT on x86). > > As the result, the host writes are seen by the guest immediately > (as the region is just a mapped file) but it takes quite some time for > the host to see non-cached guest writes. > > Add a quirk to always map ivshmem's BAR2 as cacheable (==write-back) as > ivshmem is backed by RAM anyway. > (Re)use already defined but not used IORESOURCE_CACHEABLE flag. > > This does not affect other ways of mapping a PCI BAR, a driver can use > memremap() for this functionality. > > Signed-off-by: Alexey Kardashevskiy <aik@amd.com> > --- > > What is this IORESOURCE_CACHEABLE for actually? > > Anyway, the alternatives are: > > 1. add a new node in sysfs - "resourceN_wb" - for mapping as writeback > but this requires changing existing (and likely old) userspace tools; > > 2. fix the kernel to strictly follow /proc/mtrr (now it is rather > a recommendation) but Documentation/arch/x86/mtrr.rst says it is replaced > with PAT which does not seem to allow overriding caching for specific > devices (==MMIO ranges). > > --- > drivers/pci/mmap.c | 6 ++++++ > drivers/pci/quirks.c | 8 ++++++++ > 2 files changed, 14 insertions(+) > > diff --git a/drivers/pci/mmap.c b/drivers/pci/mmap.c > index 8da3347a95c4..8495bee08fae 100644 > --- a/drivers/pci/mmap.c > +++ b/drivers/pci/mmap.c > @@ -35,6 +35,7 @@ int pci_mmap_resource_range(struct pci_dev *pdev, int bar, > if (write_combine) > vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot); > else > + else if (!(pci_resource_flags(pdev, bar) & IORESOURCE_CACHEABLE)) > vma->vm_page_prot = pgprot_device(vma->vm_page_prot); > > if (mmap_state == pci_mmap_io) { > @@ -46,6 +47,11 @@ int pci_mmap_resource_range(struct pci_dev *pdev, int bar, > > vma->vm_ops = &pci_phys_vm_ops; > > + if (pci_resource_flags(pdev, bar) & IORESOURCE_CACHEABLE) > + return remap_pfn_range_notrack(vma, vma->vm_start, vma->vm_pgoff, > + vma->vm_end - vma->vm_start, > + vma->vm_page_prot); > + > return io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, > vma->vm_end - vma->vm_start, > vma->vm_page_prot); > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index d7f4ee634263..858869ec6612 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -6335,3 +6335,11 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev) > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout); > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout); > #endif > + > +static void pci_ivshmem_writeback(struct pci_dev *dev) > +{ > + struct resource *r = &dev->resource[2]; > + > + r->flags |= IORESOURCE_CACHEABLE; > +} > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1110, pci_ivshmem_writeback); -- Alexey
Ping? Thanks, On 12/6/25 18:27, Alexey Kardashevskiy wrote: > Wrong email for Nikunj :) And I missed the KVM ml. Sorry for the noise. > > > On 12/6/25 18:22, Alexey Kardashevskiy wrote: >> QEMU Inter-VM Shared Memory (ivshmem) is designed to share a memory >> region between guest and host. The host creates a file, passes it to QEMU >> which it presents to the guest via PCI BAR#2. The guest userspace >> can map /sys/bus/pci/devices/0000:01:02.3/resource2(_wc) to use the region >> without having the guest driver for the device at all. >> >> The problem with this, since it is a PCI resource, the PCI sysfs >> reasonably enforces: >> - no caching when mapped via "resourceN" (PTE::PCD on x86) or >> - write-through when mapped via "resourceN_wc" (PTE::PWT on x86). >> >> As the result, the host writes are seen by the guest immediately >> (as the region is just a mapped file) but it takes quite some time for >> the host to see non-cached guest writes. >> >> Add a quirk to always map ivshmem's BAR2 as cacheable (==write-back) as >> ivshmem is backed by RAM anyway. >> (Re)use already defined but not used IORESOURCE_CACHEABLE flag. >> >> This does not affect other ways of mapping a PCI BAR, a driver can use >> memremap() for this functionality. >> >> Signed-off-by: Alexey Kardashevskiy <aik@amd.com> >> --- >> >> What is this IORESOURCE_CACHEABLE for actually? >> >> Anyway, the alternatives are: >> >> 1. add a new node in sysfs - "resourceN_wb" - for mapping as writeback >> but this requires changing existing (and likely old) userspace tools; >> >> 2. fix the kernel to strictly follow /proc/mtrr (now it is rather >> a recommendation) but Documentation/arch/x86/mtrr.rst says it is replaced >> with PAT which does not seem to allow overriding caching for specific >> devices (==MMIO ranges). >> >> --- >> drivers/pci/mmap.c | 6 ++++++ >> drivers/pci/quirks.c | 8 ++++++++ >> 2 files changed, 14 insertions(+) >> >> diff --git a/drivers/pci/mmap.c b/drivers/pci/mmap.c >> index 8da3347a95c4..8495bee08fae 100644 >> --- a/drivers/pci/mmap.c >> +++ b/drivers/pci/mmap.c >> @@ -35,6 +35,7 @@ int pci_mmap_resource_range(struct pci_dev *pdev, int bar, >> if (write_combine) >> vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot); >> else >> + else if (!(pci_resource_flags(pdev, bar) & IORESOURCE_CACHEABLE)) >> vma->vm_page_prot = pgprot_device(vma->vm_page_prot); >> if (mmap_state == pci_mmap_io) { >> @@ -46,6 +47,11 @@ int pci_mmap_resource_range(struct pci_dev *pdev, int bar, >> vma->vm_ops = &pci_phys_vm_ops; >> + if (pci_resource_flags(pdev, bar) & IORESOURCE_CACHEABLE) >> + return remap_pfn_range_notrack(vma, vma->vm_start, vma->vm_pgoff, >> + vma->vm_end - vma->vm_start, >> + vma->vm_page_prot); >> + >> return io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, >> vma->vm_end - vma->vm_start, >> vma->vm_page_prot); >> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c >> index d7f4ee634263..858869ec6612 100644 >> --- a/drivers/pci/quirks.c >> +++ b/drivers/pci/quirks.c >> @@ -6335,3 +6335,11 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev) >> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout); >> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout); >> #endif >> + >> +static void pci_ivshmem_writeback(struct pci_dev *dev) >> +{ >> + struct resource *r = &dev->resource[2]; >> + >> + r->flags |= IORESOURCE_CACHEABLE; >> +} >> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1110, pci_ivshmem_writeback); > -- Alexey
On Tue, Jun 24, 2025 at 11:42:47AM +1000, Alexey Kardashevskiy wrote: > Ping? Thanks, > > > On 12/6/25 18:27, Alexey Kardashevskiy wrote: > > Wrong email for Nikunj :) And I missed the KVM ml. Sorry for the noise. > > > > > > On 12/6/25 18:22, Alexey Kardashevskiy wrote: > > > QEMU Inter-VM Shared Memory (ivshmem) is designed to share a memory > > > region between guest and host. The host creates a file, passes it to QEMU > > > which it presents to the guest via PCI BAR#2. The guest userspace > > > can map /sys/bus/pci/devices/0000:01:02.3/resource2(_wc) to use the region > > > without having the guest driver for the device at all. > > > > > > The problem with this, since it is a PCI resource, the PCI sysfs > > > reasonably enforces: > > > - no caching when mapped via "resourceN" (PTE::PCD on x86) or > > > - write-through when mapped via "resourceN_wc" (PTE::PWT on x86). > > > > > > As the result, the host writes are seen by the guest immediately > > > (as the region is just a mapped file) but it takes quite some time for > > > the host to see non-cached guest writes. > > > > > > Add a quirk to always map ivshmem's BAR2 as cacheable (==write-back) as > > > ivshmem is backed by RAM anyway. > > > (Re)use already defined but not used IORESOURCE_CACHEABLE flag. > > > It just makes me nervous to change the sematics of the sysfs attribute, even if the user knows what it is expecting. Now the "resourceN_wc" essentially becomes "resourceN_wb", which goes against the rule of sysfs I'm afraid. > > > This does not affect other ways of mapping a PCI BAR, a driver can use > > > memremap() for this functionality. > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@amd.com> > > > --- > > > > > > What is this IORESOURCE_CACHEABLE for actually? > > > > > > Anyway, the alternatives are: > > > > > > 1. add a new node in sysfs - "resourceN_wb" - for mapping as writeback > > > but this requires changing existing (and likely old) userspace tools; > > > I guess this would the cleanest approach. The old tools can continue to suffer from the performance issue and the new tools can work more faster. - Mani -- மணிவண்ணன் சதாசிவம்
On 26/6/25 02:28, Manivannan Sadhasivam wrote: > On Tue, Jun 24, 2025 at 11:42:47AM +1000, Alexey Kardashevskiy wrote: >> Ping? Thanks, >> >> >> On 12/6/25 18:27, Alexey Kardashevskiy wrote: >>> Wrong email for Nikunj :) And I missed the KVM ml. Sorry for the noise. >>> >>> >>> On 12/6/25 18:22, Alexey Kardashevskiy wrote: >>>> QEMU Inter-VM Shared Memory (ivshmem) is designed to share a memory >>>> region between guest and host. The host creates a file, passes it to QEMU >>>> which it presents to the guest via PCI BAR#2. The guest userspace >>>> can map /sys/bus/pci/devices/0000:01:02.3/resource2(_wc) to use the region >>>> without having the guest driver for the device at all. >>>> >>>> The problem with this, since it is a PCI resource, the PCI sysfs >>>> reasonably enforces: >>>> - no caching when mapped via "resourceN" (PTE::PCD on x86) or >>>> - write-through when mapped via "resourceN_wc" (PTE::PWT on x86). >>>> >>>> As the result, the host writes are seen by the guest immediately >>>> (as the region is just a mapped file) but it takes quite some time for >>>> the host to see non-cached guest writes. >>>> >>>> Add a quirk to always map ivshmem's BAR2 as cacheable (==write-back) as >>>> ivshmem is backed by RAM anyway. >>>> (Re)use already defined but not used IORESOURCE_CACHEABLE flag. >>>> > > It just makes me nervous to change the sematics of the sysfs attribute, even if > the user knows what it is expecting. On 1) Intel 2) without VFIO, the user already gets this semantic. Which seems... alright? > Now the "resourceN_wc" essentially becomes > "resourceN_wb", which goes against the rule of sysfs I'm afraid. What is this rule? > >>>> This does not affect other ways of mapping a PCI BAR, a driver can use >>>> memremap() for this functionality. >>>> >>>> Signed-off-by: Alexey Kardashevskiy <aik@amd.com> >>>> --- >>>> >>>> What is this IORESOURCE_CACHEABLE for actually? >>>> >>>> Anyway, the alternatives are: >>>> >>>> 1. add a new node in sysfs - "resourceN_wb" - for mapping as writeback >>>> but this requires changing existing (and likely old) userspace tools; >>>> > > I guess this would the cleanest approach. The old tools can continue to suffer > from the performance issue and the new tools can work more faster. Well yes but the only possible user of this is going to be ivshmem as every other cache coherent thing has a driver which can pick any sort of caching policy, and nobody will ever want a slow ivshmem because there will be no added benefit. I can send a patch if we get consensus on this though. Thanks, -- Alexey
© 2016 - 2025 Red Hat, Inc.