NTB transport backed by remote DW eDMA

[RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Koichiro Den 2 months, 1 week ago

dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
for the MSI target address on every interrupt and tears it down again
via dw_pcie_ep_unmap_addr().

On systems that heavily use the AXI bridge interface (for example when
the integrated eDMA engine is active), this means the outbound iATU
registers are updated while traffic is in flight. The DesignWare
endpoint spec warns that updating iATU registers in this situation is
not supported, and the behavior is undefined.

Under high MSI and eDMA load this pattern results in occasional bogus
outbound transactions and IOMMU faults such as:

  ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000

followed by the system becoming unresponsive. This is the actual output
observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.

There is no need to reprogram the iATU region used for MSI on every
interrupt. The host-provided MSI address is stable while MSI is enabled,
and the endpoint driver already dedicates a scratch buffer for MSI
generation.

Cache the aligned MSI address and map size, program the outbound iATU
once, and keep the window enabled. Subsequent interrupts only perform a
write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
the hot path and fixing the lockups seen under load.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
 drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
 2 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
index 3780a9bd6f79..ef8ded34d9ab 100644
--- a/drivers/pci/controller/dwc/pcie-designware-ep.c
+++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
@@ -778,6 +778,16 @@ static void dw_pcie_ep_stop(struct pci_epc *epc)
 	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
 	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
 
+	/*
+	 * Tear down the dedicated outbound window used for MSI
+	 * generation. This avoids leaking an iATU window across
+	 * endpoint stop/start cycles.
+	 */
+	if (ep->msi_iatu_mapped) {
+		dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
+		ep->msi_iatu_mapped = false;
+	}
+
 	dw_pcie_stop_link(pci);
 }
 
@@ -881,14 +891,37 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
 	msg_addr = ((u64)msg_addr_upper) << 32 | msg_addr_lower;
 
 	msg_addr = dw_pcie_ep_align_addr(epc, msg_addr, &map_size, &offset);
-	ret = dw_pcie_ep_map_addr(epc, func_no, 0, ep->msi_mem_phys, msg_addr,
-				  map_size);
-	if (ret)
-		return ret;
 
-	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
+	/*
+	 * Program the outbound iATU once and keep it enabled.
+	 *
+	 * The spec warns that updating iATU registers while there are
+	 * operations in flight on the AXI bridge interface is not
+	 * supported, so we avoid reprogramming the region on every MSI,
+	 * specifically unmapping immediately after writel().
+	 */
+	if (!ep->msi_iatu_mapped) {
+		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
+					  ep->msi_mem_phys, msg_addr,
+					  map_size);
+		if (ret)
+			return ret;
 
-	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
+		ep->msi_iatu_mapped = true;
+		ep->msi_msg_addr = msg_addr;
+		ep->msi_map_size = map_size;
+	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
+				ep->msi_map_size != map_size)) {
+		/*
+		 * The host changed the MSI target address or the required
+		 * mapping size. Reprogramming the iATU at runtime is unsafe
+		 * on this controller, so bail out instead of trying to update
+		 * the existing region.
+		 */
+		return -EINVAL;
+	}
+
+	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
 
 	return 0;
 }
@@ -1268,6 +1301,9 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
 	INIT_LIST_HEAD(&ep->func_list);
 	INIT_LIST_HEAD(&ep->ib_map_list);
 	spin_lock_init(&ep->ib_map_lock);
+	ep->msi_iatu_mapped = false;
+	ep->msi_msg_addr = 0;
+	ep->msi_map_size = 0;
 
 	epc = devm_pci_epc_create(dev, &epc_ops);
 	if (IS_ERR(epc)) {
diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
index 269a9fe0501f..1770a2318557 100644
--- a/drivers/pci/controller/dwc/pcie-designware.h
+++ b/drivers/pci/controller/dwc/pcie-designware.h
@@ -481,6 +481,11 @@ struct dw_pcie_ep {
 	void __iomem		*msi_mem;
 	phys_addr_t		msi_mem_phys;
 	struct pci_epf_bar	*epf_bar[PCI_STD_NUM_BARS];
+
+	/* MSI outbound iATU state */
+	bool			msi_iatu_mapped;
+	u64			msi_msg_addr;
+	size_t			msi_map_size;
 };
 
 struct dw_pcie_ops {
-- 
2.48.1

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Manivannan Sadhasivam 1 month, 4 weeks ago

On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> for the MSI target address on every interrupt and tears it down again
> via dw_pcie_ep_unmap_addr().
> 
> On systems that heavily use the AXI bridge interface (for example when
> the integrated eDMA engine is active), this means the outbound iATU
> registers are updated while traffic is in flight. The DesignWare
> endpoint spec warns that updating iATU registers in this situation is
> not supported, and the behavior is undefined.
> 

When claiming spec violation, you should quote the spec reference such as the
spec version, section, and actual wording snippet.

> Under high MSI and eDMA load this pattern results in occasional bogus
> outbound transactions and IOMMU faults such as:
> 
>   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> 
> followed by the system becoming unresponsive. This is the actual output
> observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.
> 
> There is no need to reprogram the iATU region used for MSI on every
> interrupt. The host-provided MSI address is stable while MSI is enabled,
> and the endpoint driver already dedicates a scratch buffer for MSI
> generation.
> 
> Cache the aligned MSI address and map size, program the outbound iATU
> once, and keep the window enabled. Subsequent interrupts only perform a
> write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
> the hot path and fixing the lockups seen under load.
> 

iATU windows are very limited (just 8 in some cases), so I don't like allocating
fixed windows for MSIs.

> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
>  drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
>  2 files changed, 47 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index 3780a9bd6f79..ef8ded34d9ab 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> @@ -778,6 +778,16 @@ static void dw_pcie_ep_stop(struct pci_epc *epc)
>  	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  
> +	/*
> +	 * Tear down the dedicated outbound window used for MSI
> +	 * generation. This avoids leaking an iATU window across
> +	 * endpoint stop/start cycles.
> +	 */
> +	if (ep->msi_iatu_mapped) {
> +		dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
> +		ep->msi_iatu_mapped = false;
> +	}
> +
>  	dw_pcie_stop_link(pci);
>  }
>  
> @@ -881,14 +891,37 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
>  	msg_addr = ((u64)msg_addr_upper) << 32 | msg_addr_lower;
>  
>  	msg_addr = dw_pcie_ep_align_addr(epc, msg_addr, &map_size, &offset);
> -	ret = dw_pcie_ep_map_addr(epc, func_no, 0, ep->msi_mem_phys, msg_addr,
> -				  map_size);
> -	if (ret)
> -		return ret;
>  
> -	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> +	/*
> +	 * Program the outbound iATU once and keep it enabled.
> +	 *
> +	 * The spec warns that updating iATU registers while there are
> +	 * operations in flight on the AXI bridge interface is not
> +	 * supported, so we avoid reprogramming the region on every MSI,
> +	 * specifically unmapping immediately after writel().
> +	 */
> +	if (!ep->msi_iatu_mapped) {
> +		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> +					  ep->msi_mem_phys, msg_addr,
> +					  map_size);
> +		if (ret)
> +			return ret;
>  
> -	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
> +		ep->msi_iatu_mapped = true;
> +		ep->msi_msg_addr = msg_addr;
> +		ep->msi_map_size = map_size;
> +	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
> +				ep->msi_map_size != map_size)) {
> +		/*
> +		 * The host changed the MSI target address or the required
> +		 * mapping size. Reprogramming the iATU at runtime is unsafe
> +		 * on this controller, so bail out instead of trying to update
> +		 * the existing region.
> +		 */

I'd perfer having some sort of locking to program the iATU registers during
runtime instead of bailing out.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Koichiro Den 1 month, 3 weeks ago

On Fri, Dec 12, 2025 at 12:38:02PM +0900, Manivannan Sadhasivam wrote:
> On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > for the MSI target address on every interrupt and tears it down again
> > via dw_pcie_ep_unmap_addr().
> > 
> > On systems that heavily use the AXI bridge interface (for example when
> > the integrated eDMA engine is active), this means the outbound iATU
> > registers are updated while traffic is in flight. The DesignWare
> > endpoint spec warns that updating iATU registers in this situation is
> > not supported, and the behavior is undefined.
> > 
> 
> When claiming spec violation, you should quote the spec reference such as the
> spec version, section, and actual wording snippet.

Thank you for the review and sorry about my late response.

The relevant wording is from the DW EPC databook 5.40a - 3.10.6.1 iATU
Outbound Programming Overview:
"Dynamic iATU Programming with AXI Bridge Module - You must not update the
iATU registers while operations are in progress on the AXI bridge slave
interface."

Niklas had pointed this out earlier and posted a stand-alone patch that
includes the reference and quote:
https://lore.kernel.org/all/20251210071358.2267494-2-cassel@kernel.org/

> 
> > Under high MSI and eDMA load this pattern results in occasional bogus
> > outbound transactions and IOMMU faults such as:
> > 
> >   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> > 
> > followed by the system becoming unresponsive. This is the actual output
> > observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.
> > 
> > There is no need to reprogram the iATU region used for MSI on every
> > interrupt. The host-provided MSI address is stable while MSI is enabled,
> > and the endpoint driver already dedicates a scratch buffer for MSI
> > generation.
> > 
> > Cache the aligned MSI address and map size, program the outbound iATU
> > once, and keep the window enabled. Subsequent interrupts only perform a
> > write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
> > the hot path and fixing the lockups seen under load.
> > 
> 
> iATU windows are very limited (just 8 in some cases), so I don't like allocating
> fixed windows for MSIs.

Do you think there is a generic way to resolve the issue without OB iATU
window? If so, I'd be happy to pursue it. I wonder whether MSI sideband
interface is something that can be used generically from software.

> 
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
> >  drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
> >  2 files changed, 47 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > index 3780a9bd6f79..ef8ded34d9ab 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > @@ -778,6 +778,16 @@ static void dw_pcie_ep_stop(struct pci_epc *epc)
> >  	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  
> > +	/*
> > +	 * Tear down the dedicated outbound window used for MSI
> > +	 * generation. This avoids leaking an iATU window across
> > +	 * endpoint stop/start cycles.
> > +	 */
> > +	if (ep->msi_iatu_mapped) {
> > +		dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
> > +		ep->msi_iatu_mapped = false;
> > +	}
> > +
> >  	dw_pcie_stop_link(pci);
> >  }
> >  
> > @@ -881,14 +891,37 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
> >  	msg_addr = ((u64)msg_addr_upper) << 32 | msg_addr_lower;
> >  
> >  	msg_addr = dw_pcie_ep_align_addr(epc, msg_addr, &map_size, &offset);
> > -	ret = dw_pcie_ep_map_addr(epc, func_no, 0, ep->msi_mem_phys, msg_addr,
> > -				  map_size);
> > -	if (ret)
> > -		return ret;
> >  
> > -	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> > +	/*
> > +	 * Program the outbound iATU once and keep it enabled.
> > +	 *
> > +	 * The spec warns that updating iATU registers while there are
> > +	 * operations in flight on the AXI bridge interface is not
> > +	 * supported, so we avoid reprogramming the region on every MSI,
> > +	 * specifically unmapping immediately after writel().
> > +	 */
> > +	if (!ep->msi_iatu_mapped) {
> > +		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> > +					  ep->msi_mem_phys, msg_addr,
> > +					  map_size);
> > +		if (ret)
> > +			return ret;
> >  
> > -	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
> > +		ep->msi_iatu_mapped = true;
> > +		ep->msi_msg_addr = msg_addr;
> > +		ep->msi_map_size = map_size;
> > +	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
> > +				ep->msi_map_size != map_size)) {
> > +		/*
> > +		 * The host changed the MSI target address or the required
> > +		 * mapping size. Reprogramming the iATU at runtime is unsafe
> > +		 * on this controller, so bail out instead of trying to update
> > +		 * the existing region.
> > +		 */
> 
> I'd perfer having some sort of locking to program the iATU registers during
> runtime instead of bailing out.

Here, does the "locking" mean any mechanism to ensure the quiescence that
allows safe reprogramming?

Thank you,
Koichiro

> 
> - Mani
> 
> -- 
> மணிவண்ணன் சதாசிவம்

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Niklas Cassel 2 months ago

On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> for the MSI target address on every interrupt and tears it down again
> via dw_pcie_ep_unmap_addr().
> 
> On systems that heavily use the AXI bridge interface (for example when
> the integrated eDMA engine is active), this means the outbound iATU
> registers are updated while traffic is in flight. The DesignWare
> endpoint spec warns that updating iATU registers in this situation is
> not supported, and the behavior is undefined.
> 
> Under high MSI and eDMA load this pattern results in occasional bogus
> outbound transactions and IOMMU faults such as:
> 
>   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> 
> followed by the system becoming unresponsive. This is the actual output
> observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.
> 
> There is no need to reprogram the iATU region used for MSI on every
> interrupt. The host-provided MSI address is stable while MSI is enabled,
> and the endpoint driver already dedicates a scratch buffer for MSI
> generation.
> 
> Cache the aligned MSI address and map size, program the outbound iATU
> once, and keep the window enabled. Subsequent interrupts only perform a
> write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
> the hot path and fixing the lockups seen under load.
> 
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
>  drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
>  2 files changed, 47 insertions(+), 6 deletions(-)
> 

I don't like that this patch modifies dw_pcie_ep_raise_msi_irq() but does
not modify dw_pcie_ep_raise_msix_irq()

both functions call dw_pcie_ep_map_addr() before doing the writel(),
so I think they should be treated the same.

I do however understand that it is a bit wasteful to dedicate one
outbound iATU for MSI and one outbound iATU for MSI-X, as the PCI
spec does not allow both of them to be enabled at the same PCI,
see:

6.1.4 MSI and MSI-X Operation § in PCIe 6.0 spec:
"A Function is permitted to implement both MSI and MSI-X,
but system software is prohibited from enabling both at the
same time. If system software enables both at the same time,
the behavior is undefined."

I guess the problem is that some EPF drivers, even if only
one capability can be enabled (MSI/MSI-X), call both
pci_epc_set_msi() and pci_epc_set_msix(), e.g.:
https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-test.c#L969-L987

To fill in the number of MSI/MSI-X irqs.

While other EPF drivers only call either pci_epc_set_msi() or
pci_epc_set_msix(), depending on the IRQ type that will actually
be used:
https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L2247-L2262

I think both versions is okay, just because the number of IRQs
is filled in for both MSI/MSI-X, AFAICT, only one of them will
get enabled.

I guess it might be hard for an EPC driver to know which capability
that is currently enabled, as to enable a capability is only a config
space write by the host side.

I guess in most real hardware, e.g. a NIC device, you do an
"enable engine"/"stop enginge" type of write to a BAR.

Perhaps we should have similar callbacks in struct pci_epc_ops ?

My thinking is that after "start engine", an EPC driver could read
the MSI and MSI-X capabilities, to see which is enabled.
As it should not be allowed to change between MSI and MSI-X without
doing a "stop engine" first.

Kind regards,
Niklas

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Krishna Chaitanya Chundru 1 month, 2 weeks ago


On 12/8/2025 1:27 PM, Niklas Cassel wrote:
> On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
>> dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
>> for the MSI target address on every interrupt and tears it down again
>> via dw_pcie_ep_unmap_addr().
>>
>> On systems that heavily use the AXI bridge interface (for example when
>> the integrated eDMA engine is active), this means the outbound iATU
>> registers are updated while traffic is in flight. The DesignWare
>> endpoint spec warns that updating iATU registers in this situation is
>> not supported, and the behavior is undefined.
>>
>> Under high MSI and eDMA load this pattern results in occasional bogus
>> outbound transactions and IOMMU faults such as:
>>
>>    ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
>>
>> followed by the system becoming unresponsive. This is the actual output
>> observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.
>>
>> There is no need to reprogram the iATU region used for MSI on every
>> interrupt. The host-provided MSI address is stable while MSI is enabled,
>> and the endpoint driver already dedicates a scratch buffer for MSI
>> generation.
>>
>> Cache the aligned MSI address and map size, program the outbound iATU
>> once, and keep the window enabled. Subsequent interrupts only perform a
>> write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
>> the hot path and fixing the lockups seen under load.
>>
>> Signed-off-by: Koichiro Den <den@valinux.co.jp>
>> ---
>>   .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
>>   drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
>>   2 files changed, 47 insertions(+), 6 deletions(-)
>>
> I don't like that this patch modifies dw_pcie_ep_raise_msi_irq() but does
> not modify dw_pcie_ep_raise_msix_irq()
>
> both functions call dw_pcie_ep_map_addr() before doing the writel(),
> so I think they should be treated the same.
>
>
> I do however understand that it is a bit wasteful to dedicate one
> outbound iATU for MSI and one outbound iATU for MSI-X, as the PCI
> spec does not allow both of them to be enabled at the same PCI,
> see:
>
> 6.1.4 MSI and MSI-X Operation § in PCIe 6.0 spec:
> "A Function is permitted to implement both MSI and MSI-X,
> but system software is prohibited from enabling both at the
> same time. If system software enables both at the same time,
> the behavior is undefined."
>
>
> I guess the problem is that some EPF drivers, even if only
> one capability can be enabled (MSI/MSI-X), call both
> pci_epc_set_msi() and pci_epc_set_msix(), e.g.:
> https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-test.c#L969-L987
>
> To fill in the number of MSI/MSI-X irqs.
>
> While other EPF drivers only call either pci_epc_set_msi() or
> pci_epc_set_msix(), depending on the IRQ type that will actually
> be used:
> https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L2247-L2262
>
> I think both versions is okay, just because the number of IRQs
> is filled in for both MSI/MSI-X, AFAICT, only one of them will
> get enabled.
>
>
> I guess it might be hard for an EPC driver to know which capability
> that is currently enabled, as to enable a capability is only a config
> space write by the host side.
As the host is the one which enables MSI/MSIX, it should be better the 
controller
driver takes this decision and the EPF driver just sends only raise_irq.
Because technically, host can disable MSI and enable MSIX at runtime also.

In the controller driver,  it can check which is enabled and chose b/w 
MSIX/MSI/Legacy.

- Krishna Chaitanya.
> I guess in most real hardware, e.g. a NIC device, you do an
> "enable engine"/"stop enginge" type of write to a BAR.
>
> Perhaps we should have similar callbacks in struct pci_epc_ops ?
>
> My thinking is that after "start engine", an EPC driver could read
> the MSI and MSI-X capabilities, to see which is enabled.
> As it should not be allowed to change between MSI and MSI-X without
> doing a "stop engine" first.
>
>
> Kind regards,
> Niklas
>

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Niklas Cassel 1 month, 2 weeks ago

On Mon, Dec 22, 2025 at 10:40:12AM +0530, Krishna Chaitanya Chundru wrote:
> On 12/8/2025 1:27 PM, Niklas Cassel wrote:
> > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > 
> > I guess the problem is that some EPF drivers, even if only
> > one capability can be enabled (MSI/MSI-X), call both
> > pci_epc_set_msi() and pci_epc_set_msix(), e.g.:
> > https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-test.c#L969-L987
> > 
> > To fill in the number of MSI/MSI-X irqs.
> > 
> > While other EPF drivers only call either pci_epc_set_msi() or
> > pci_epc_set_msix(), depending on the IRQ type that will actually
> > be used:
> > https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L2247-L2262
> > 
> > I think both versions is okay, just because the number of IRQs
> > is filled in for both MSI/MSI-X, AFAICT, only one of them will
> > get enabled.
> > 
> > 
> > I guess it might be hard for an EPC driver to know which capability
> > that is currently enabled, as to enable a capability is only a config
> > space write by the host side.
> As the host is the one which enables MSI/MSIX, it should be better the
> controller
> driver takes this decision and the EPF driver just sends only raise_irq.
> Because technically, host can disable MSI and enable MSIX at runtime also.
> 
> In the controller driver,  it can check which is enabled and chose b/w
> MSIX/MSI/Legacy.

I'm not sure if I'm following, but if by "the controller driver", you
mean the EPC driver, and not the host side driver, how can the EPC
driver know how many interrupts a specific EPF driver wants to use?

From the kdoc to pci_epc_set_msi(), the nr_irqs parameter is defined as:
@nr_irqs: number of MSI interrupts required by the EPF
https://github.com/torvalds/linux/blob/v6.19-rc2/drivers/pci/endpoint/pci-epc-core.c#L305

Anyway, I posted Koichiro's patch here:
https://lore.kernel.org/linux-pci/20251210071358.2267494-2-cassel@kernel.org/

See my comment:
  pci-epf-test does change between MSI and MSI-X without calling
  dw_pcie_ep_stop(), however, the msg_addr address written by the host
  will be the same address, at least when using a Linux host using a DWC
  based controller. If another host ends up using different msg_addr for
  MSI and MSI-X, then I think that we will need to modify pci-epf-test to
  call a function when changing IRQ type, such that pcie-designware-ep.c
  can tear down the MSI/MSI-X mapping.

So if we want to improve things, I think we need to modify the EPF drivers
to call a function when changing the IRQ type. The EPF driver should know
which IRQ type that is currently in use (see e.g. nvme_epf->irq_type in
drivers/nvme/target/pci-epf.c).

Additionally, I don't think that the host side should be allowed to change
the IRQ type (using e.g. setpci) when the EPF driver is in a "running state".

I think things will break badly if you e.g. try to do this on an PCIe
connected network card while the network card is in use.

Kind regards,
Niklas

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Krishna Chaitanya Chundru 1 month, 2 weeks ago


On 12/22/2025 1:20 PM, Niklas Cassel wrote:
> On Mon, Dec 22, 2025 at 10:40:12AM +0530, Krishna Chaitanya Chundru wrote:
>> On 12/8/2025 1:27 PM, Niklas Cassel wrote:
>>> On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
>>>
>>> I guess the problem is that some EPF drivers, even if only
>>> one capability can be enabled (MSI/MSI-X), call both
>>> pci_epc_set_msi() and pci_epc_set_msix(), e.g.:
>>> https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-test.c#L969-L987
>>>
>>> To fill in the number of MSI/MSI-X irqs.
>>>
>>> While other EPF drivers only call either pci_epc_set_msi() or
>>> pci_epc_set_msix(), depending on the IRQ type that will actually
>>> be used:
>>> https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L2247-L2262
>>>
>>> I think both versions is okay, just because the number of IRQs
>>> is filled in for both MSI/MSI-X, AFAICT, only one of them will
>>> get enabled.
>>>
>>>
>>> I guess it might be hard for an EPC driver to know which capability
>>> that is currently enabled, as to enable a capability is only a config
>>> space write by the host side.
>> As the host is the one which enables MSI/MSIX, it should be better the
>> controller
>> driver takes this decision and the EPF driver just sends only raise_irq.
>> Because technically, host can disable MSI and enable MSIX at runtime also.
>>
>> In the controller driver,  it can check which is enabled and chose b/w
>> MSIX/MSI/Legacy.
> I'm not sure if I'm following, but if by "the controller driver", you
> mean the EPC driver, and not the host side driver, how can the EPC
> driver know how many interrupts a specific EPF driver wants to use?
I meant the dwc drivers here.
Set msi & set msix still need to called from the EPF driver only to tell 
how many
interrupts they want to configure etc.
>
>  From the kdoc to pci_epc_set_msi(), the nr_irqs parameter is defined as:
> @nr_irqs: number of MSI interrupts required by the EPF
> https://github.com/torvalds/linux/blob/v6.19-rc2/drivers/pci/endpoint/pci-epc-core.c#L305
>
>
> Anyway, I posted Koichiro's patch here:
> https://lore.kernel.org/linux-pci/20251210071358.2267494-2-cassel@kernel.org/
I will comment on that patch.
>
> See my comment:
>    pci-epf-test does change between MSI and MSI-X without calling
>    dw_pcie_ep_stop(), however, the msg_addr address written by the host
>    will be the same address, at least when using a Linux host using a DWC
>    based controller. If another host ends up using different msg_addr for
>    MSI and MSI-X, then I think that we will need to modify pci-epf-test to
>    call a function when changing IRQ type, such that pcie-designware-ep.c
>    can tear down the MSI/MSI-X mapping.
Maybe for arm based systems we are getting same address but for x86 
based systems
it is not guarantee that you will get same address.
> So if we want to improve things, I think we need to modify the EPF drivers
> to call a function when changing the IRQ type. The EPF driver should know
> which IRQ type that is currently in use (see e.g. nvme_epf->irq_type in
> drivers/nvme/target/pci-epf.c).
My suggestion is let EPF driver call raise_irq with the vector number 
then the dwc driver
can raise IRQ based on which IRQ host enables it.
> Additionally, I don't think that the host side should be allowed to change
> the IRQ type (using e.g. setpci) when the EPF driver is in a "running state".
In the host driver itelf they can choose to change it by using 
pci_alloc_irq_vectors 
<https://elixir.bootlin.com/linux/v6.18.2/C/ident/pci_alloc_irq_vectors>, 
Currently it is not present but in future someone can change it, as spec 
didn't say you
can't update it.
> I think things will break badly if you e.g. try to do this on an PCIe
> connected network card while the network card is in use.
I agree on this.

I just want to highlight there is possibility of this in future, if 
someone comes up with a
clean logic.

- Krishna Chaitanya.
>
>
> Kind regards,
> Niklas

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Manivannan Sadhasivam 1 month, 2 weeks ago

On Mon, Dec 22, 2025 at 01:44:02PM +0530, Krishna Chaitanya Chundru wrote:
> 
> 
> On 12/22/2025 1:20 PM, Niklas Cassel wrote:
> > On Mon, Dec 22, 2025 at 10:40:12AM +0530, Krishna Chaitanya Chundru wrote:
> > > On 12/8/2025 1:27 PM, Niklas Cassel wrote:
> > > > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > > > 
> > > > I guess the problem is that some EPF drivers, even if only
> > > > one capability can be enabled (MSI/MSI-X), call both
> > > > pci_epc_set_msi() and pci_epc_set_msix(), e.g.:
> > > > https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-test.c#L969-L987
> > > > 
> > > > To fill in the number of MSI/MSI-X irqs.
> > > > 
> > > > While other EPF drivers only call either pci_epc_set_msi() or
> > > > pci_epc_set_msix(), depending on the IRQ type that will actually
> > > > be used:
> > > > https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L2247-L2262
> > > > 
> > > > I think both versions is okay, just because the number of IRQs
> > > > is filled in for both MSI/MSI-X, AFAICT, only one of them will
> > > > get enabled.
> > > > 
> > > > 
> > > > I guess it might be hard for an EPC driver to know which capability
> > > > that is currently enabled, as to enable a capability is only a config
> > > > space write by the host side.
> > > As the host is the one which enables MSI/MSIX, it should be better the
> > > controller
> > > driver takes this decision and the EPF driver just sends only raise_irq.
> > > Because technically, host can disable MSI and enable MSIX at runtime also.
> > > 
> > > In the controller driver,  it can check which is enabled and chose b/w
> > > MSIX/MSI/Legacy.
> > I'm not sure if I'm following, but if by "the controller driver", you
> > mean the EPC driver, and not the host side driver, how can the EPC
> > driver know how many interrupts a specific EPF driver wants to use?
> I meant the dwc drivers here.
> Set msi & set msix still need to called from the EPF driver only to tell how
> many
> interrupts they want to configure etc.

Please leave a newline before and after your reply to make it readable in text
based clients, which some of the poor folks like me still use.

> > 
> >  From the kdoc to pci_epc_set_msi(), the nr_irqs parameter is defined as:
> > @nr_irqs: number of MSI interrupts required by the EPF
> > https://github.com/torvalds/linux/blob/v6.19-rc2/drivers/pci/endpoint/pci-epc-core.c#L305
> > 
> > 
> > Anyway, I posted Koichiro's patch here:
> > https://lore.kernel.org/linux-pci/20251210071358.2267494-2-cassel@kernel.org/
> I will comment on that patch.
> > 
> > See my comment:
> >    pci-epf-test does change between MSI and MSI-X without calling
> >    dw_pcie_ep_stop(), however, the msg_addr address written by the host
> >    will be the same address, at least when using a Linux host using a DWC
> >    based controller. If another host ends up using different msg_addr for
> >    MSI and MSI-X, then I think that we will need to modify pci-epf-test to
> >    call a function when changing IRQ type, such that pcie-designware-ep.c
> >    can tear down the MSI/MSI-X mapping.
> Maybe for arm based systems we are getting same address but for x86 based
> systems
> it is not guarantee that you will get same address.
> > So if we want to improve things, I think we need to modify the EPF drivers
> > to call a function when changing the IRQ type. The EPF driver should know
> > which IRQ type that is currently in use (see e.g. nvme_epf->irq_type in
> > drivers/nvme/target/pci-epf.c).
> My suggestion is let EPF driver call raise_irq with the vector number then
> the dwc driver
> can raise IRQ based on which IRQ host enables it.
> > Additionally, I don't think that the host side should be allowed to change
> > the IRQ type (using e.g. setpci) when the EPF driver is in a "running state".
> In the host driver itelf they can choose to change it by using
> pci_alloc_irq_vectors
> <https://elixir.bootlin.com/linux/v6.18.2/C/ident/pci_alloc_irq_vectors>,
> Currently it is not present but in future someone can change it, as spec
> didn't say you
> can't update it.

The spec has some wording on it (though not very clear) in r6.0, sec 6.1.4

"System software initializes the message address and message data (from here on
referred to as the “vector”) during device configuration, allocating one or more
vectors to each MSI-capable Function."

This *sounds* like the MSI/MSI-X initialization should happen during device
configuration.

> > I think things will break badly if you e.g. try to do this on an PCIe
> > connected network card while the network card is in use.
> I agree on this.
> 
> I just want to highlight there is possibility of this in future, if someone
> comes up with a
> clean logic.
> 

I don't know if this is even possible. For example, I don't think a host is
allowed to reattach a device which was using MSI to a VM which only uses MSI-X
during live device migration in virtualization world. I bet the device might not
perform as expected if that happens.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Niklas Cassel 2 months ago

On Mon, Dec 08, 2025 at 08:57:19AM +0100, Niklas Cassel wrote:
> On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
>
> I don't like that this patch modifies dw_pcie_ep_raise_msi_irq() but does
> not modify dw_pcie_ep_raise_msix_irq()
>
> both functions call dw_pcie_ep_map_addr() before doing the writel(),
> so I think they should be treated the same.

Btw, when using nvmet-pci-epf:
drivers/nvme/target/pci-epf.c

With queue depth == 32, I get:
[  106.585811] arm-smmu-v3 fc900000.iommu: event 0x10 received:
[  106.586349] arm-smmu-v3 fc900000.iommu:      0x0000010000000010
[  106.586846] arm-smmu-v3 fc900000.iommu:      0x0000020000000000
[  106.587341] arm-smmu-v3 fc900000.iommu:      0x000000090000f040
[  106.587841] arm-smmu-v3 fc900000.iommu:      0x0000000000000000
[  106.588335] arm-smmu-v3 fc900000.iommu: event: F_TRANSLATION client: 0000:01:00.0 sid: 0x100 ssid: 0x0 iova: 0x90000f040 ipa: 0x0
[  106.589383] arm-smmu-v3 fc900000.iommu: unpriv data write s1 "Input address caused fault" stag: 0x0

(If I only run with queue depth == 1, I cannot trigger this IOMMU error.)


So, I really think that this patch is important, as it solves a real
problem also for the nvmet-pci-epf driver.


With this patch applied, I cannot trigger the IOMMU error,
so I really think that you should send this a a stand alone patch.


I still think that we need to modify dw_pcie_ep_raise_msix_irq() in a similar
way.


Perhaps instead of:


	if (!ep->msi_iatu_mapped) {
		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
					  ep->msi_mem_phys, msg_addr,
					  map_size);
		if (ret)
			return ret;

		ep->msi_iatu_mapped = true;
		ep->msi_msg_addr = msg_addr;
		ep->msi_map_size = map_size;
	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
				ep->msi_map_size != map_size)) {
		/*
		 * The host changed the MSI target address or the required
		 * mapping size. Reprogramming the iATU at runtime is unsafe
		 * on this controller, so bail out instead of trying to update
		 * the existing region.
		 */
		return -EINVAL;
	}

	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);



We could modify both dw_pcie_ep_raise_msix_irq and dw_pcie_ep_raise_msi_irq
to do something like:



	if (!ep->msi_iatu_mapped) {
		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
					  ep->msi_mem_phys, msg_addr,
					  map_size);
		if (ret)
			return ret;

		ep->msi_iatu_mapped = true;
		ep->msi_msg_addr = msg_addr;
		ep->msi_map_size = map_size;
	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
				ep->msi_map_size != map_size)) {
		/*
		 * Host driver might have changed from MSI to MSI-X
		 * or the other way around.
		 */
		 dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
		 ret = dw_pcie_ep_map_addr(epc, func_no, 0,
		 			   ep->msi_mem_phys, msg_addr,
					   map_size);
		if (ret)
			return ret;
	}

	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);



I think that should work without needing to introuce any
.start_engine() / .stop_engine() APIs.



Kind regards,
Niklas

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Koichiro Den 1 month, 4 weeks ago

On Tue, Dec 09, 2025 at 09:15:57AM +0100, Niklas Cassel wrote:
> On Mon, Dec 08, 2025 at 08:57:19AM +0100, Niklas Cassel wrote:
> > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> >
> > I don't like that this patch modifies dw_pcie_ep_raise_msi_irq() but does
> > not modify dw_pcie_ep_raise_msix_irq()
> >
> > both functions call dw_pcie_ep_map_addr() before doing the writel(),
> > so I think they should be treated the same.
> 
> Btw, when using nvmet-pci-epf:
> drivers/nvme/target/pci-epf.c
> 
> With queue depth == 32, I get:
> [  106.585811] arm-smmu-v3 fc900000.iommu: event 0x10 received:
> [  106.586349] arm-smmu-v3 fc900000.iommu:      0x0000010000000010
> [  106.586846] arm-smmu-v3 fc900000.iommu:      0x0000020000000000
> [  106.587341] arm-smmu-v3 fc900000.iommu:      0x000000090000f040
> [  106.587841] arm-smmu-v3 fc900000.iommu:      0x0000000000000000
> [  106.588335] arm-smmu-v3 fc900000.iommu: event: F_TRANSLATION client: 0000:01:00.0 sid: 0x100 ssid: 0x0 iova: 0x90000f040 ipa: 0x0
> [  106.589383] arm-smmu-v3 fc900000.iommu: unpriv data write s1 "Input address caused fault" stag: 0x0
> 
> (If I only run with queue depth == 1, I cannot trigger this IOMMU error.)
> 
> 
> So, I really think that this patch is important, as it solves a real
> problem also for the nvmet-pci-epf driver.
> 
> 
> With this patch applied, I cannot trigger the IOMMU error,
> so I really think that you should send this a a stand alone patch.
> 
> 
> I still think that we need to modify dw_pcie_ep_raise_msix_irq() in a similar
> way.

Sorry about my late response, and thank you for handling this:
https://lore.kernel.org/all/20251210071358.2267494-2-cassel@kernel.org/

Koichiro

> 
> 
> Perhaps instead of:
> 
> 
> 	if (!ep->msi_iatu_mapped) {
> 		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> 					  ep->msi_mem_phys, msg_addr,
> 					  map_size);
> 		if (ret)
> 			return ret;
> 
> 		ep->msi_iatu_mapped = true;
> 		ep->msi_msg_addr = msg_addr;
> 		ep->msi_map_size = map_size;
> 	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
> 				ep->msi_map_size != map_size)) {
> 		/*
> 		 * The host changed the MSI target address or the required
> 		 * mapping size. Reprogramming the iATU at runtime is unsafe
> 		 * on this controller, so bail out instead of trying to update
> 		 * the existing region.
> 		 */
> 		return -EINVAL;
> 	}
> 
> 	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> 
> 
> 
> We could modify both dw_pcie_ep_raise_msix_irq and dw_pcie_ep_raise_msi_irq
> to do something like:
> 
> 
> 
> 	if (!ep->msi_iatu_mapped) {
> 		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> 					  ep->msi_mem_phys, msg_addr,
> 					  map_size);
> 		if (ret)
> 			return ret;
> 
> 		ep->msi_iatu_mapped = true;
> 		ep->msi_msg_addr = msg_addr;
> 		ep->msi_map_size = map_size;
> 	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
> 				ep->msi_map_size != map_size)) {
> 		/*
> 		 * Host driver might have changed from MSI to MSI-X
> 		 * or the other way around.
> 		 */
> 		 dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
> 		 ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> 		 			   ep->msi_mem_phys, msg_addr,
> 					   map_size);
> 		if (ret)
> 			return ret;
> 	}
> 
> 	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> 
> 
> 
> I think that should work without needing to introuce any
> .start_engine() / .stop_engine() APIs.
> 
> 
> 
> Kind regards,
> Niklas

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Niklas Cassel 2 months, 1 week ago

On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> for the MSI target address on every interrupt and tears it down again
> via dw_pcie_ep_unmap_addr().
> 
> On systems that heavily use the AXI bridge interface (for example when
> the integrated eDMA engine is active), this means the outbound iATU
> registers are updated while traffic is in flight. The DesignWare
> endpoint spec warns that updating iATU registers in this situation is
> not supported, and the behavior is undefined.

Please reference a specific section in the EP databook, and the specific
EP databook version that you are using.

This patch appears to address quite a serious issue, so I think that you
should submit it as a standalone patch, and not as part of a series.

(Especially not as part of an RFC which can take quite long before it is
even submitted as a normal (non-RFC) series.)


Kind regards,
Niklas

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Koichiro Den 2 months, 1 week ago

On Tue, Dec 02, 2025 at 07:32:31AM +0100, Niklas Cassel wrote:
> On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > for the MSI target address on every interrupt and tears it down again
> > via dw_pcie_ep_unmap_addr().
> > 
> > On systems that heavily use the AXI bridge interface (for example when
> > the integrated eDMA engine is active), this means the outbound iATU
> > registers are updated while traffic is in flight. The DesignWare
> > endpoint spec warns that updating iATU registers in this situation is
> > not supported, and the behavior is undefined.
> 
> Please reference a specific section in the EP databook, and the specific
> EP databook version that you are using.

Ok, the section I was referring to in the commit message is:

DW EPC databook 5.40a - 3.10.6.1 iATU Outbound Programming Overview
"Caution: Dynamic iATU Programming with AXI Bridge Module You must not update
the iATU registers while operations are in progress on the AXI bridge slave
interface."

> 
> This patch appears to address quite a serious issue, so I think that you
> should submit it as a standalone patch, and not as part of a series.
> 
> (Especially not as part of an RFC which can take quite long before it is
> even submitted as a normal (non-RFC) series.)

Makes sense, thank you for the guidance.

Koichiro

> 
> 
> Kind regards,
> Niklas

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Niklas Cassel 2 months, 1 week ago

On Wed, Dec 03, 2025 at 05:30:52PM +0900, Koichiro Den wrote:
> On Tue, Dec 02, 2025 at 07:32:31AM +0100, Niklas Cassel wrote:
> > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > > for the MSI target address on every interrupt and tears it down again
> > > via dw_pcie_ep_unmap_addr().
> > > 
> > > On systems that heavily use the AXI bridge interface (for example when
> > > the integrated eDMA engine is active), this means the outbound iATU
> > > registers are updated while traffic is in flight. The DesignWare
> > > endpoint spec warns that updating iATU registers in this situation is
> > > not supported, and the behavior is undefined.
> > 
> > Please reference a specific section in the EP databook, and the specific
> > EP databook version that you are using.
> 
> Ok, the section I was referring to in the commit message is:
> 
> DW EPC databook 5.40a - 3.10.6.1 iATU Outbound Programming Overview
> "Caution: Dynamic iATU Programming with AXI Bridge Module You must not update
> the iATU registers while operations are in progress on the AXI bridge slave
> interface."

Please add this text to the commit message when sending a proper patch.

Nit: I think it is "DW EP databook" and not "DW EPC databook".

However, if what you are suggesting is true, that would have an implication
for all PCI EPF drivers.

E.g. the MHI EPF driver:
https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-mhi.c#L394-L395
https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-mhi.c#L323-L324

uses either the eDMA (without calling pci_epc_map_addr()) or MMIO
(which does call pci_epc_map_addr(), which will update the iATU registers),
depending on the I/O size.

And I assume that the MHI bus can have multiple outgoing reads/writes
at the same time.

If what you are suggesting is true, AFAICT, any EPF driver that could have
multiple outgoing transactions occuring at the same time, can not be allowed
to have calls to pci_epc_map_addr().

Which would mean that, even if we change dw_pcie_ep_raise_msix_irq() and
dw_pcie_ep_raise_msi_irq() to not call map_addr() after
dw_pcie_ep_init_registers(), we could never have an EPF driver that mixes
MMIO and DMA. (Or even has multiple outgoing MMIO transactions, as that
requires updating iATU registers.)

Kind regards,
Niklas

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Koichiro Den 2 months, 1 week ago

On Wed, Dec 03, 2025 at 11:19:13AM +0100, Niklas Cassel wrote:
> On Wed, Dec 03, 2025 at 05:30:52PM +0900, Koichiro Den wrote:
> > On Tue, Dec 02, 2025 at 07:32:31AM +0100, Niklas Cassel wrote:
> > > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > > > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > > > for the MSI target address on every interrupt and tears it down again
> > > > via dw_pcie_ep_unmap_addr().
> > > > 
> > > > On systems that heavily use the AXI bridge interface (for example when
> > > > the integrated eDMA engine is active), this means the outbound iATU
> > > > registers are updated while traffic is in flight. The DesignWare
> > > > endpoint spec warns that updating iATU registers in this situation is
> > > > not supported, and the behavior is undefined.
> > > 
> > > Please reference a specific section in the EP databook, and the specific
> > > EP databook version that you are using.
> > 
> > Ok, the section I was referring to in the commit message is:
> > 
> > DW EPC databook 5.40a - 3.10.6.1 iATU Outbound Programming Overview
> > "Caution: Dynamic iATU Programming with AXI Bridge Module You must not update
> > the iATU registers while operations are in progress on the AXI bridge slave
> > interface."
> 
> Please add this text to the commit message when sending a proper patch.
> 
> Nit: I think it is "DW EP databook" and not "DW EPC databook".

Thanks for pointing it out. Noted.

> 
> 
> However, if what you are suggesting is true, that would have an implication
> for all PCI EPF drivers.
> 
> E.g. the MHI EPF driver:
> https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-mhi.c#L394-L395
> https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-mhi.c#L323-L324
> 
> uses either the eDMA (without calling pci_epc_map_addr()) or MMIO
> (which does call pci_epc_map_addr(), which will update the iATU registers),
> depending on the I/O size.
> 
> And I assume that the MHI bus can have multiple outgoing reads/writes
> at the same time.
> 
> If what you are suggesting is true, AFAICT, any EPF driver that could have
> multiple outgoing transactions occuring at the same time, can not be allowed
> to have calls to pci_epc_map_addr().
> 
> Which would mean that, even if we change dw_pcie_ep_raise_msix_irq() and
> dw_pcie_ep_raise_msi_irq() to not call map_addr() after
> dw_pcie_ep_init_registers(), we could never have an EPF driver that mixes
> MMIO and DMA. (Or even has multiple outgoing MMIO transactions, as that
> requires updating iATU registers.)

I understand. That's a very good point. I'm not really sure whether the
issue this patch attempts to address is SoC-specific (ie. observable only
on R-Car S4), but I still think it's a good idea to conform to the
databook and avoid the Caution. It is also still somewhat speculative on my
side, as I have not been able to verify what is happening at the hardware
level.

Koichiro

> 
> 
> Kind regards,
> Niklas

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Frank Li 2 months, 1 week ago

On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> for the MSI target address on every interrupt and tears it down again
> via dw_pcie_ep_unmap_addr().
>
> On systems that heavily use the AXI bridge interface (for example when
> the integrated eDMA engine is active), this means the outbound iATU
> registers are updated while traffic is in flight. The DesignWare
> endpoint spec warns that updating iATU registers in this situation is
> not supported, and the behavior is undefined.
>
> Under high MSI and eDMA load this pattern results in occasional bogus
> outbound transactions and IOMMU faults such as:
>
>   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
>

I agree needn't map/unmap MSI every time. But I think there should be
logic problem behind this. IOMMU report error means page table already
removed, but you still try to access it after that. You'd better find where
access MSI memory after dw_pcie_ep_unmap_addr().

dw_pcie_ep_unmap_addr() use writel(), which use dma_dmb() before change
register, previous write should be completed before write ATU register.

Frank

> followed by the system becoming unresponsive. This is the actual output
> observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.
>
> There is no need to reprogram the iATU region used for MSI on every
> interrupt. The host-provided MSI address is stable while MSI is enabled,
> and the endpoint driver already dedicates a scratch buffer for MSI
> generation.
>
> Cache the aligned MSI address and map size, program the outbound iATU
> once, and keep the window enabled. Subsequent interrupts only perform a
> write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
> the hot path and fixing the lockups seen under load.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
>  drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
>  2 files changed, 47 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index 3780a9bd6f79..ef8ded34d9ab 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> @@ -778,6 +778,16 @@ static void dw_pcie_ep_stop(struct pci_epc *epc)
>  	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>
> +	/*
> +	 * Tear down the dedicated outbound window used for MSI
> +	 * generation. This avoids leaking an iATU window across
> +	 * endpoint stop/start cycles.
> +	 */
> +	if (ep->msi_iatu_mapped) {
> +		dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
> +		ep->msi_iatu_mapped = false;
> +	}
> +
>  	dw_pcie_stop_link(pci);
>  }
>
> @@ -881,14 +891,37 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
>  	msg_addr = ((u64)msg_addr_upper) << 32 | msg_addr_lower;
>
>  	msg_addr = dw_pcie_ep_align_addr(epc, msg_addr, &map_size, &offset);
> -	ret = dw_pcie_ep_map_addr(epc, func_no, 0, ep->msi_mem_phys, msg_addr,
> -				  map_size);
> -	if (ret)
> -		return ret;
>
> -	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> +	/*
> +	 * Program the outbound iATU once and keep it enabled.
> +	 *
> +	 * The spec warns that updating iATU registers while there are
> +	 * operations in flight on the AXI bridge interface is not
> +	 * supported, so we avoid reprogramming the region on every MSI,
> +	 * specifically unmapping immediately after writel().
> +	 */
> +	if (!ep->msi_iatu_mapped) {
> +		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> +					  ep->msi_mem_phys, msg_addr,
> +					  map_size);
> +		if (ret)
> +			return ret;
>
> -	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
> +		ep->msi_iatu_mapped = true;
> +		ep->msi_msg_addr = msg_addr;
> +		ep->msi_map_size = map_size;
> +	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
> +				ep->msi_map_size != map_size)) {
> +		/*
> +		 * The host changed the MSI target address or the required
> +		 * mapping size. Reprogramming the iATU at runtime is unsafe
> +		 * on this controller, so bail out instead of trying to update
> +		 * the existing region.
> +		 */
> +		return -EINVAL;
> +	}
> +
> +	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
>
>  	return 0;
>  }
> @@ -1268,6 +1301,9 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
>  	INIT_LIST_HEAD(&ep->func_list);
>  	INIT_LIST_HEAD(&ep->ib_map_list);
>  	spin_lock_init(&ep->ib_map_lock);
> +	ep->msi_iatu_mapped = false;
> +	ep->msi_msg_addr = 0;
> +	ep->msi_map_size = 0;
>
>  	epc = devm_pci_epc_create(dev, &epc_ops);
>  	if (IS_ERR(epc)) {
> diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
> index 269a9fe0501f..1770a2318557 100644
> --- a/drivers/pci/controller/dwc/pcie-designware.h
> +++ b/drivers/pci/controller/dwc/pcie-designware.h
> @@ -481,6 +481,11 @@ struct dw_pcie_ep {
>  	void __iomem		*msi_mem;
>  	phys_addr_t		msi_mem_phys;
>  	struct pci_epf_bar	*epf_bar[PCI_STD_NUM_BARS];
> +
> +	/* MSI outbound iATU state */
> +	bool			msi_iatu_mapped;
> +	u64			msi_msg_addr;
> +	size_t			msi_map_size;
>  };
>
>  struct dw_pcie_ops {
> --
> 2.48.1
>

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Koichiro Den 2 months, 1 week ago

On Mon, Dec 01, 2025 at 03:41:38PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > for the MSI target address on every interrupt and tears it down again
> > via dw_pcie_ep_unmap_addr().
> >
> > On systems that heavily use the AXI bridge interface (for example when
> > the integrated eDMA engine is active), this means the outbound iATU
> > registers are updated while traffic is in flight. The DesignWare
> > endpoint spec warns that updating iATU registers in this situation is
> > not supported, and the behavior is undefined.
> >
> > Under high MSI and eDMA load this pattern results in occasional bogus
> > outbound transactions and IOMMU faults such as:
> >
> >   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> >
> 
> I agree needn't map/unmap MSI every time. But I think there should be
> logic problem behind this. IOMMU report error means page table already
> removed, but you still try to access it after that. You'd better find where
> access MSI memory after dw_pcie_ep_unmap_addr().

I don't see any other callers that access the MSI region after
dw_pcie_ep_unmap_addr(), but I might be missing something. Also, even if I
serialize dw_pcie_ep_raise_msi_irq() invocations, the problem still
appears.

A couple of details I forgot to describe in the commit message:
(1). The IOMMU error is only reported on the RC side.
(2). Sometimes there is no IOMMU error printed and the board just freezes (becomes unresponsive).

The faulting iova is 0xfe000000. The iova 0xfe000000 is the base of
"addr_space" for R-Car S4 in EP mode:
https://github.com/jonmason/ntb/blob/68113d260674/arch/arm64/boot/dts/renesas/r8a779f0.dtsi#L847

So it looks like the EP sometimes issue MWr at "addr_space" base (offset 0),
the RC forwards it to its IOMMU (IPMMUHC) and that faults. My working theory
is that when the iATU registers are updated under heavy DMA load, the DAR of
some in-flight transfer can get corrupted to 0xfe000000. That would match one
possible symptom of the undefined behaviour that the DW EPC spec warns about
when changing iATU configuration under load.

-Koichiro

> 
> dw_pcie_ep_unmap_addr() use writel(), which use dma_dmb() before change
> register, previous write should be completed before write ATU register.
> 
> Frank
> 
> > followed by the system becoming unresponsive. This is the actual output
> > observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.
> >
> > There is no need to reprogram the iATU region used for MSI on every
> > interrupt. The host-provided MSI address is stable while MSI is enabled,
> > and the endpoint driver already dedicates a scratch buffer for MSI
> > generation.
> >
> > Cache the aligned MSI address and map size, program the outbound iATU
> > once, and keep the window enabled. Subsequent interrupts only perform a
> > write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
> > the hot path and fixing the lockups seen under load.
> >
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
> >  drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
> >  2 files changed, 47 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > index 3780a9bd6f79..ef8ded34d9ab 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > @@ -778,6 +778,16 @@ static void dw_pcie_ep_stop(struct pci_epc *epc)
> >  	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >
> > +	/*
> > +	 * Tear down the dedicated outbound window used for MSI
> > +	 * generation. This avoids leaking an iATU window across
> > +	 * endpoint stop/start cycles.
> > +	 */
> > +	if (ep->msi_iatu_mapped) {
> > +		dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
> > +		ep->msi_iatu_mapped = false;
> > +	}
> > +
> >  	dw_pcie_stop_link(pci);
> >  }
> >
> > @@ -881,14 +891,37 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
> >  	msg_addr = ((u64)msg_addr_upper) << 32 | msg_addr_lower;
> >
> >  	msg_addr = dw_pcie_ep_align_addr(epc, msg_addr, &map_size, &offset);
> > -	ret = dw_pcie_ep_map_addr(epc, func_no, 0, ep->msi_mem_phys, msg_addr,
> > -				  map_size);
> > -	if (ret)
> > -		return ret;
> >
> > -	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> > +	/*
> > +	 * Program the outbound iATU once and keep it enabled.
> > +	 *
> > +	 * The spec warns that updating iATU registers while there are
> > +	 * operations in flight on the AXI bridge interface is not
> > +	 * supported, so we avoid reprogramming the region on every MSI,
> > +	 * specifically unmapping immediately after writel().
> > +	 */
> > +	if (!ep->msi_iatu_mapped) {
> > +		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> > +					  ep->msi_mem_phys, msg_addr,
> > +					  map_size);
> > +		if (ret)
> > +			return ret;
> >
> > -	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
> > +		ep->msi_iatu_mapped = true;
> > +		ep->msi_msg_addr = msg_addr;
> > +		ep->msi_map_size = map_size;
> > +	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
> > +				ep->msi_map_size != map_size)) {
> > +		/*
> > +		 * The host changed the MSI target address or the required
> > +		 * mapping size. Reprogramming the iATU at runtime is unsafe
> > +		 * on this controller, so bail out instead of trying to update
> > +		 * the existing region.
> > +		 */
> > +		return -EINVAL;
> > +	}
> > +
> > +	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> >
> >  	return 0;
> >  }
> > @@ -1268,6 +1301,9 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
> >  	INIT_LIST_HEAD(&ep->func_list);
> >  	INIT_LIST_HEAD(&ep->ib_map_list);
> >  	spin_lock_init(&ep->ib_map_lock);
> > +	ep->msi_iatu_mapped = false;
> > +	ep->msi_msg_addr = 0;
> > +	ep->msi_map_size = 0;
> >
> >  	epc = devm_pci_epc_create(dev, &epc_ops);
> >  	if (IS_ERR(epc)) {
> > diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
> > index 269a9fe0501f..1770a2318557 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware.h
> > +++ b/drivers/pci/controller/dwc/pcie-designware.h
> > @@ -481,6 +481,11 @@ struct dw_pcie_ep {
> >  	void __iomem		*msi_mem;
> >  	phys_addr_t		msi_mem_phys;
> >  	struct pci_epf_bar	*epf_bar[PCI_STD_NUM_BARS];
> > +
> > +	/* MSI outbound iATU state */
> > +	bool			msi_iatu_mapped;
> > +	u64			msi_msg_addr;
> > +	size_t			msi_map_size;
> >  };
> >
> >  struct dw_pcie_ops {
> > --
> > 2.48.1
> >

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Niklas Cassel 2 months, 1 week ago

Hello Koichiro,

On Tue, Dec 02, 2025 at 03:35:36PM +0900, Koichiro Den wrote:
> On Mon, Dec 01, 2025 at 03:41:38PM -0500, Frank Li wrote:
> > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > > for the MSI target address on every interrupt and tears it down again
> > > via dw_pcie_ep_unmap_addr().
> > >
> > > On systems that heavily use the AXI bridge interface (for example when
> > > the integrated eDMA engine is active), this means the outbound iATU
> > > registers are updated while traffic is in flight. The DesignWare
> > > endpoint spec warns that updating iATU registers in this situation is
> > > not supported, and the behavior is undefined.
> > >
> > > Under high MSI and eDMA load this pattern results in occasional bogus
> > > outbound transactions and IOMMU faults such as:
> > >
> > >   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> > >
> > 
> > I agree needn't map/unmap MSI every time. But I think there should be
> > logic problem behind this. IOMMU report error means page table already
> > removed, but you still try to access it after that. You'd better find where
> > access MSI memory after dw_pcie_ep_unmap_addr().
> 
> I don't see any other callers that access the MSI region after
> dw_pcie_ep_unmap_addr(), but I might be missing something. Also, even if I
> serialize dw_pcie_ep_raise_msi_irq() invocations, the problem still
> appears.
> 
> A couple of details I forgot to describe in the commit message:
> (1). The IOMMU error is only reported on the RC side.
> (2). Sometimes there is no IOMMU error printed and the board just freezes (becomes unresponsive).
> 
> The faulting iova is 0xfe000000. The iova 0xfe000000 is the base of
> "addr_space" for R-Car S4 in EP mode:
> https://github.com/jonmason/ntb/blob/68113d260674/arch/arm64/boot/dts/renesas/r8a779f0.dtsi#L847
> 
> So it looks like the EP sometimes issue MWr at "addr_space" base (offset 0),
> the RC forwards it to its IOMMU (IPMMUHC) and that faults. My working theory
> is that when the iATU registers are updated under heavy DMA load, the DAR of
> some in-flight transfer can get corrupted to 0xfe000000. That would match one
> possible symptom of the undefined behaviour that the DW EPC spec warns about
> when changing iATU configuration under load.

For your information, in the NVMe PCI EPF driver:
https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L389-L429

We take a mutex around the dmaengine_slave_config() and dma_sync_wait() calls,
because without a mutex, we noticed that having multiple outstanding transfers,
since the dmaengine_slave_config() specifies the src/dst address, the function
call would affect other concurrent DMA transfers, leading to corruption because
of invalid src/dst addresses.

Having a mutex so that we can only have one outstanding transfer solves these
issues, but is obviously very bad for performance.

I did try to add DMA_MEMCPY support to the dw-edma driver:
https://lore.kernel.org/linux-pci/20241217160448.199310-4-cassel@kernel.org/

Since that would allow us to specify both the src and dst address in a single
dmaengine function call (so that we would no longer need a mutex).
However, because the eDMA hardware (at least for EDMA_LEGACY_UNROLL) does not
support transfers between PCI to PCI, only PCI to local DDR or local DDR to
PCI, using prep_memcpy() is wrong, as it does not take a direction:
https://lore.kernel.org/linux-pci/Z4jf2s5SaUu3wdJi@ryzen/

If we want to improve the dw-edma driver, so that an EPF driver can have
multiple outstanding transfers, I think the best way forward would be to create
a new _prep_slave_memcpy() or similar, that does take a direction, and thus
does not require dmaengine_slave_config() to be called before every
_prep_slave_memcpy() call.

Kind regards,
Niklas

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Koichiro Den 2 months, 1 week ago

On Tue, Dec 02, 2025 at 10:32:19AM +0100, Niklas Cassel wrote:
> Hello Koichiro,
> 
> On Tue, Dec 02, 2025 at 03:35:36PM +0900, Koichiro Den wrote:
> > On Mon, Dec 01, 2025 at 03:41:38PM -0500, Frank Li wrote:
> > > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > > > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > > > for the MSI target address on every interrupt and tears it down again
> > > > via dw_pcie_ep_unmap_addr().
> > > >
> > > > On systems that heavily use the AXI bridge interface (for example when
> > > > the integrated eDMA engine is active), this means the outbound iATU
> > > > registers are updated while traffic is in flight. The DesignWare
> > > > endpoint spec warns that updating iATU registers in this situation is
> > > > not supported, and the behavior is undefined.
> > > >
> > > > Under high MSI and eDMA load this pattern results in occasional bogus
> > > > outbound transactions and IOMMU faults such as:
> > > >
> > > >   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> > > >
> > > 
> > > I agree needn't map/unmap MSI every time. But I think there should be
> > > logic problem behind this. IOMMU report error means page table already
> > > removed, but you still try to access it after that. You'd better find where
> > > access MSI memory after dw_pcie_ep_unmap_addr().
> > 
> > I don't see any other callers that access the MSI region after
> > dw_pcie_ep_unmap_addr(), but I might be missing something. Also, even if I
> > serialize dw_pcie_ep_raise_msi_irq() invocations, the problem still
> > appears.
> > 
> > A couple of details I forgot to describe in the commit message:
> > (1). The IOMMU error is only reported on the RC side.
> > (2). Sometimes there is no IOMMU error printed and the board just freezes (becomes unresponsive).
> > 
> > The faulting iova is 0xfe000000. The iova 0xfe000000 is the base of
> > "addr_space" for R-Car S4 in EP mode:
> > https://github.com/jonmason/ntb/blob/68113d260674/arch/arm64/boot/dts/renesas/r8a779f0.dtsi#L847
> > 
> > So it looks like the EP sometimes issue MWr at "addr_space" base (offset 0),
> > the RC forwards it to its IOMMU (IPMMUHC) and that faults. My working theory
> > is that when the iATU registers are updated under heavy DMA load, the DAR of
> > some in-flight transfer can get corrupted to 0xfe000000. That would match one
> > possible symptom of the undefined behaviour that the DW EPC spec warns about
> > when changing iATU configuration under load.
> 
> For your information, in the NVMe PCI EPF driver:
> https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L389-L429
> 
> We take a mutex around the dmaengine_slave_config() and dma_sync_wait() calls,
> because without a mutex, we noticed that having multiple outstanding transfers,
> since the dmaengine_slave_config() specifies the src/dst address, the function
> call would affect other concurrent DMA transfers, leading to corruption because
> of invalid src/dst addresses.
> 
> Having a mutex so that we can only have one outstanding transfer solves these
> issues, but is obviously very bad for performance.

Thank you for the info, it helps a lot.
As for DW eDMA, it seems that at least dmaengine_slave_config() and
dmaengine_prep_slave_sg() needs to be executed under an exclusive
per-dma_chan lock. In hindsight, [RFC PATCH v2 20/27] should've done so,
although I still think it's unrelated to the particular issue addressed by
this commit.

For testing, I tried adding dma_sync_wait() and taking a per-dma_chan mutex
around dmaengine_slave_config() ~ dma_sync_wait() sequence, but the problem
(that often occurs with the IOMMU fault at 0xfe000000) remains under heavy
load if I revert this commit. The diff for the experiment is a bit large
(+117, -89), so I'm not posting it here, but I can do so if that would be
useful.

> 
> 
> I did try to add DMA_MEMCPY support to the dw-edma driver:
> https://lore.kernel.org/linux-pci/20241217160448.199310-4-cassel@kernel.org/
> 
> Since that would allow us to specify both the src and dst address in a single
> dmaengine function call (so that we would no longer need a mutex).
> However, because the eDMA hardware (at least for EDMA_LEGACY_UNROLL) does not
> support transfers between PCI to PCI, only PCI to local DDR or local DDR to
> PCI, using prep_memcpy() is wrong, as it does not take a direction:
> https://lore.kernel.org/linux-pci/Z4jf2s5SaUu3wdJi@ryzen/

Interesting, I didn't know that.

> 
> If we want to improve the dw-edma driver, so that an EPF driver can have
> multiple outstanding transfers, I think the best way forward would be to create
> a new _prep_slave_memcpy() or similar, that does take a direction, and thus
> does not require dmaengine_slave_config() to be called before every
> _prep_slave_memcpy() call.

Would dmaengine_prep_slave_single_config(), which Frank tolds us in this
thread, be sufficient?


Thanks for reviewing,
Koichiro

> 
> 
> Kind regards,
> Niklas

Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping

Posted by Frank Li 2 months, 1 week ago

On Tue, Dec 02, 2025 at 10:32:19AM +0100, Niklas Cassel wrote:
> Hello Koichiro,
>
> On Tue, Dec 02, 2025 at 03:35:36PM +0900, Koichiro Den wrote:
> > On Mon, Dec 01, 2025 at 03:41:38PM -0500, Frank Li wrote:
> > > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > > > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > > > for the MSI target address on every interrupt and tears it down again
> > > > via dw_pcie_ep_unmap_addr().
> > > >
> > > > On systems that heavily use the AXI bridge interface (for example when
> > > > the integrated eDMA engine is active), this means the outbound iATU
> > > > registers are updated while traffic is in flight. The DesignWare
> > > > endpoint spec warns that updating iATU registers in this situation is
> > > > not supported, and the behavior is undefined.
> > > >
> > > > Under high MSI and eDMA load this pattern results in occasional bogus
> > > > outbound transactions and IOMMU faults such as:
> > > >
> > > >   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> > > >
> > >
> > > I agree needn't map/unmap MSI every time. But I think there should be
> > > logic problem behind this. IOMMU report error means page table already
> > > removed, but you still try to access it after that. You'd better find where
> > > access MSI memory after dw_pcie_ep_unmap_addr().
> >
> > I don't see any other callers that access the MSI region after
> > dw_pcie_ep_unmap_addr(), but I might be missing something. Also, even if I
> > serialize dw_pcie_ep_raise_msi_irq() invocations, the problem still
> > appears.
> >
> > A couple of details I forgot to describe in the commit message:
> > (1). The IOMMU error is only reported on the RC side.
> > (2). Sometimes there is no IOMMU error printed and the board just freezes (becomes unresponsive).
> >
> > The faulting iova is 0xfe000000. The iova 0xfe000000 is the base of
> > "addr_space" for R-Car S4 in EP mode:
> > https://github.com/jonmason/ntb/blob/68113d260674/arch/arm64/boot/dts/renesas/r8a779f0.dtsi#L847
> >
> > So it looks like the EP sometimes issue MWr at "addr_space" base (offset 0),
> > the RC forwards it to its IOMMU (IPMMUHC) and that faults. My working theory
> > is that when the iATU registers are updated under heavy DMA load, the DAR of
> > some in-flight transfer can get corrupted to 0xfe000000. That would match one
> > possible symptom of the undefined behaviour that the DW EPC spec warns about
> > when changing iATU configuration under load.
>
> For your information, in the NVMe PCI EPF driver:
> https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L389-L429
>
> We take a mutex around the dmaengine_slave_config() and dma_sync_wait() calls,
> because without a mutex, we noticed that having multiple outstanding transfers,
> since the dmaengine_slave_config() specifies the src/dst address, the function
> call would affect other concurrent DMA transfers, leading to corruption because
> of invalid src/dst addresses.
>
> Having a mutex so that we can only have one outstanding transfer solves these
> issues, but is obviously very bad for performance.
>
>
> I did try to add DMA_MEMCPY support to the dw-edma driver:
> https://lore.kernel.org/linux-pci/20241217160448.199310-4-cassel@kernel.org/
>
> Since that would allow us to specify both the src and dst address in a single
> dmaengine function call (so that we would no longer need a mutex).
> However, because the eDMA hardware (at least for EDMA_LEGACY_UNROLL) does not
> support transfers between PCI to PCI, only PCI to local DDR or local DDR to
> PCI, using prep_memcpy() is wrong, as it does not take a direction:
> https://lore.kernel.org/linux-pci/Z4jf2s5SaUu3wdJi@ryzen/
>
> If we want to improve the dw-edma driver, so that an EPF driver can have
> multiple outstanding transfers, I think the best way forward would be to create
> a new _prep_slave_memcpy() or similar, that does take a direction, and thus
> does not require dmaengine_slave_config() to be called before every
> _prep_slave_memcpy() call.

Thanks, we also consider put slave config informaiton into prep_sg(single).
dmaengine_prep_slave_sg(single)_config(..., struct dma_slave_config *config)
to simple error handle at dma transfer functions.

dmaengine_prep_slave_single_config(
	struct dma_chan *chan, size_t size,
	unsigned long flags, dma_slave_config *config)

config already include src and dsc address.

Frank

>
>
> Kind regards,
> Niklas