drivers/misc/lan966x_pci.dtso | 5 +- drivers/net/ethernet/microchip/fdma/Makefile | 4 + drivers/net/ethernet/microchip/fdma/fdma_api.c | 33 ++ drivers/net/ethernet/microchip/fdma/fdma_api.h | 25 +- drivers/net/ethernet/microchip/fdma/fdma_pci.c | 177 +++++++ drivers/net/ethernet/microchip/fdma/fdma_pci.h | 41 ++ drivers/net/ethernet/microchip/lan966x/Makefile | 4 + .../net/ethernet/microchip/lan966x/lan966x_fdma.c | 51 +- .../ethernet/microchip/lan966x/lan966x_fdma_pci.c | 551 +++++++++++++++++++++ .../net/ethernet/microchip/lan966x/lan966x_main.c | 47 +- .../net/ethernet/microchip/lan966x/lan966x_main.h | 46 ++ .../net/ethernet/microchip/lan966x/lan966x_regs.h | 1 + .../net/ethernet/microchip/lan966x/lan966x_xdp.c | 6 + 13 files changed, 949 insertions(+), 42 deletions(-)
When lan966x operates as a PCIe endpoint, the driver currently uses
register-based I/O for frame injection and extraction. This approach is
functional but slow, topping out at around 33 Mbps on an Intel x86 host
with a lan966x PCIe card.
This series adds FDMA (Frame DMA) support for the PCIe path. When
operating as a PCIe endpoint, the internal FDMA engine on lan966x cannot
directly access host memory, so DMA buffers are allocated as contiguous
coherent memory and mapped through the PCIe Address Translation Unit
(ATU). The ATU provides outbound windows that translate internal FDMA
addresses to PCIe bus addresses, allowing the FDMA engine to read and
write host memory. Because the ATU requires contiguous address regions,
page_pool and normal per-page DMA mappings cannot be used. Instead,
frames are transferred using memcpy between the ATU-mapped buffers and
the network stack. With this, throughput increases from ~33 Mbps to ~620
Mbps for default MTU.
Patches 1-2 prepare the shared FDMA library: patch 1 renames the
contiguous dataptr helpers for clarity, and patch 2 adds PCIe ATU region
management and coherent DMA allocation with ATU mapping.
Patches 3-5 refactor the lan966x FDMA code to support both platform and
PCIe paths: extracting the LLP register write into a helper, exporting
shared functions, and introducing an ops dispatch table selected at
probe time.
Patch 6 adds the core PCIe FDMA implementation with RX/TX using
contiguous ATU-mapped buffers. Patches 7 and 8 extend it with MTU
change and XDP support respectively.
Patches 9-10 update the lan966x PCI device tree overlay to extend the
cpu register mapping to cover the ATU register space and add the FDMA
interrupt.
To: Andrew Lunn <andrew+netdev@lunn.ch>
To: David S. Miller <davem@davemloft.net>
To: Eric Dumazet <edumazet@google.com>
To: Jakub Kicinski <kuba@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>
To: Horatiu Vultur <horatiu.vultur@microchip.com>
To: Steen Hegelund <steen.hegelund@microchip.com>
To: UNGLinuxDriver@microchip.com
To: Alexei Starovoitov <ast@kernel.org>
To: Daniel Borkmann <daniel@iogearbox.net>
To: Jesper Dangaard Brouer <hawk@kernel.org>
To: John Fastabend <john.fastabend@gmail.com>
To: Stanislav Fomichev <sdf@fomichev.me>
To: Herve Codina <herve.codina@bootlin.com>
To: Arnd Bergmann <arnd@arndb.de>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: bpf@vger.kernel.org
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
---
Daniel Machon (10):
net: microchip: fdma: rename contiguous dataptr helpers
net: microchip: fdma: add PCIe ATU support
net: lan966x: add FDMA LLP register write helper
net: lan966x: export FDMA helpers for reuse
net: lan966x: add FDMA ops dispatch for PCIe support
net: lan966x: add PCIe FDMA support
net: lan966x: add PCIe FDMA MTU change support
net: lan966x: add PCIe FDMA XDP support
misc: lan966x-pci: dts: extend cpu reg to cover PCIE DBI space
misc: lan966x-pci: dts: add fdma interrupt to overlay
drivers/misc/lan966x_pci.dtso | 5 +-
drivers/net/ethernet/microchip/fdma/Makefile | 4 +
drivers/net/ethernet/microchip/fdma/fdma_api.c | 33 ++
drivers/net/ethernet/microchip/fdma/fdma_api.h | 25 +-
drivers/net/ethernet/microchip/fdma/fdma_pci.c | 177 +++++++
drivers/net/ethernet/microchip/fdma/fdma_pci.h | 41 ++
drivers/net/ethernet/microchip/lan966x/Makefile | 4 +
.../net/ethernet/microchip/lan966x/lan966x_fdma.c | 51 +-
.../ethernet/microchip/lan966x/lan966x_fdma_pci.c | 551 +++++++++++++++++++++
.../net/ethernet/microchip/lan966x/lan966x_main.c | 47 +-
.../net/ethernet/microchip/lan966x/lan966x_main.h | 46 ++
.../net/ethernet/microchip/lan966x/lan966x_regs.h | 1 +
.../net/ethernet/microchip/lan966x/lan966x_xdp.c | 6 +
13 files changed, 949 insertions(+), 42 deletions(-)
---
base-commit: 9ac76f3d0bb2940db3a9684d596b9c8f301ef315
change-id: 20260313-lan966x-pci-fdma-94ed485d23fa
Best regards,
--
Daniel Machon <daniel.machon@microchip.com>
Hi Daniel, On Fri, 20 Mar 2026 16:00:56 +0100 Daniel Machon <daniel.machon@microchip.com> wrote: > When lan966x operates as a PCIe endpoint, the driver currently uses > register-based I/O for frame injection and extraction. This approach is > functional but slow, topping out at around 33 Mbps on an Intel x86 host > with a lan966x PCIe card. > > This series adds FDMA (Frame DMA) support for the PCIe path. When > operating as a PCIe endpoint, the internal FDMA engine on lan966x cannot > directly access host memory, so DMA buffers are allocated as contiguous > coherent memory and mapped through the PCIe Address Translation Unit > (ATU). The ATU provides outbound windows that translate internal FDMA > addresses to PCIe bus addresses, allowing the FDMA engine to read and > write host memory. Because the ATU requires contiguous address regions, > page_pool and normal per-page DMA mappings cannot be used. Instead, > frames are transferred using memcpy between the ATU-mapped buffers and > the network stack. With this, throughput increases from ~33 Mbps to ~620 > Mbps for default MTU. > > Patches 1-2 prepare the shared FDMA library: patch 1 renames the > contiguous dataptr helpers for clarity, and patch 2 adds PCIe ATU region > management and coherent DMA allocation with ATU mapping. > > Patches 3-5 refactor the lan966x FDMA code to support both platform and > PCIe paths: extracting the LLP register write into a helper, exporting > shared functions, and introducing an ops dispatch table selected at > probe time. > > Patch 6 adds the core PCIe FDMA implementation with RX/TX using > contiguous ATU-mapped buffers. Patches 7 and 8 extend it with MTU > change and XDP support respectively. > > Patches 9-10 update the lan966x PCI device tree overlay to extend the > cpu register mapping to cover the ATU register space and add the FDMA > interrupt. > Thanks a lot for the series taking care of DMA and ATU in PCIe variants. I have tested the whole series on both my ARM and x86 systems. Doing a simple wget on my x86 system, I moved from 3.8MB/s to 11.2MB/s and so the improvement is obvious. Tested-by: Herve Codina <herve.codina@bootlin.com> Best regards, Hervé
Hi Daniel, On Mon, 23 Mar 2026 15:52:04 +0100 Herve Codina <herve.codina@bootlin.com> wrote: > Hi Daniel, > > On Fri, 20 Mar 2026 16:00:56 +0100 > Daniel Machon <daniel.machon@microchip.com> wrote: > > > When lan966x operates as a PCIe endpoint, the driver currently uses > > register-based I/O for frame injection and extraction. This approach is > > functional but slow, topping out at around 33 Mbps on an Intel x86 host > > with a lan966x PCIe card. > > > > This series adds FDMA (Frame DMA) support for the PCIe path. When > > operating as a PCIe endpoint, the internal FDMA engine on lan966x cannot > > directly access host memory, so DMA buffers are allocated as contiguous > > coherent memory and mapped through the PCIe Address Translation Unit > > (ATU). The ATU provides outbound windows that translate internal FDMA > > addresses to PCIe bus addresses, allowing the FDMA engine to read and > > write host memory. Because the ATU requires contiguous address regions, > > page_pool and normal per-page DMA mappings cannot be used. Instead, > > frames are transferred using memcpy between the ATU-mapped buffers and > > the network stack. With this, throughput increases from ~33 Mbps to ~620 > > Mbps for default MTU. > > > > Patches 1-2 prepare the shared FDMA library: patch 1 renames the > > contiguous dataptr helpers for clarity, and patch 2 adds PCIe ATU region > > management and coherent DMA allocation with ATU mapping. > > > > Patches 3-5 refactor the lan966x FDMA code to support both platform and > > PCIe paths: extracting the LLP register write into a helper, exporting > > shared functions, and introducing an ops dispatch table selected at > > probe time. > > > > Patch 6 adds the core PCIe FDMA implementation with RX/TX using > > contiguous ATU-mapped buffers. Patches 7 and 8 extend it with MTU > > change and XDP support respectively. > > > > Patches 9-10 update the lan966x PCI device tree overlay to extend the > > cpu register mapping to cover the ATU register space and add the FDMA > > interrupt. > > > > Thanks a lot for the series taking care of DMA and ATU in PCIe variants. > > I have tested the whole series on both my ARM and x86 systems. > > Doing a simple wget on my x86 system, I moved from 3.8MB/s to 11.2MB/s and > so the improvement is obvious. > > Tested-by: Herve Codina <herve.codina@bootlin.com> > Hum, I think I found an issue. If I remove the lan966x_pci module (modprobe -r lan966x_pci), and reload it (modprobe lan966x_pci), the board is not working. The system performs DHCP requests. Those requests are served by my PC (observed with Wireshark) but the system doesn't see those answers. Indeed, he continues to perform DHCP requests. Looks like the lan966x_pci module removal leaves the board in a bad state. Without the series applied, DHCP request answers from my PC are seen by the system after any module unloading / reloading. Do you have any ideas of what could be wrong? Best regards, Hervé
Hi Hervé, > Hi Daniel, > > On Mon, 23 Mar 2026 15:52:04 +0100 > Herve Codina <herve.codina@bootlin.com> wrote: > > > Hi Daniel, > > > > On Fri, 20 Mar 2026 16:00:56 +0100 > > Daniel Machon <daniel.machon@microchip.com> wrote: > > > > > When lan966x operates as a PCIe endpoint, the driver currently uses > > > register-based I/O for frame injection and extraction. This approach is > > > functional but slow, topping out at around 33 Mbps on an Intel x86 host > > > with a lan966x PCIe card. > > > > > > This series adds FDMA (Frame DMA) support for the PCIe path. When > > > operating as a PCIe endpoint, the internal FDMA engine on lan966x cannot > > > directly access host memory, so DMA buffers are allocated as contiguous > > > coherent memory and mapped through the PCIe Address Translation Unit > > > (ATU). The ATU provides outbound windows that translate internal FDMA > > > addresses to PCIe bus addresses, allowing the FDMA engine to read and > > > write host memory. Because the ATU requires contiguous address regions, > > > page_pool and normal per-page DMA mappings cannot be used. Instead, > > > frames are transferred using memcpy between the ATU-mapped buffers and > > > the network stack. With this, throughput increases from ~33 Mbps to ~620 > > > Mbps for default MTU. > > > > > > Patches 1-2 prepare the shared FDMA library: patch 1 renames the > > > contiguous dataptr helpers for clarity, and patch 2 adds PCIe ATU region > > > management and coherent DMA allocation with ATU mapping. > > > > > > Patches 3-5 refactor the lan966x FDMA code to support both platform and > > > PCIe paths: extracting the LLP register write into a helper, exporting > > > shared functions, and introducing an ops dispatch table selected at > > > probe time. > > > > > > Patch 6 adds the core PCIe FDMA implementation with RX/TX using > > > contiguous ATU-mapped buffers. Patches 7 and 8 extend it with MTU > > > change and XDP support respectively. > > > > > > Patches 9-10 update the lan966x PCI device tree overlay to extend the > > > cpu register mapping to cover the ATU register space and add the FDMA > > > interrupt. > > > > > > > Thanks a lot for the series taking care of DMA and ATU in PCIe variants. > > > > I have tested the whole series on both my ARM and x86 systems. > > > > Doing a simple wget on my x86 system, I moved from 3.8MB/s to 11.2MB/s and > > so the improvement is obvious. > > > > Tested-by: Herve Codina <herve.codina@bootlin.com> > > > > Hum, I think I found an issue. > > If I remove the lan966x_pci module (modprobe -r lan966x_pci), and reload > it (modprobe lan966x_pci), the board is not working. > > The system performs DHCP requests. Those requests are served by my PC (observed > with Wireshark) but the system doesn't see those answers. Indeed, he continues > to perform DHCP requests. > > Looks like the lan966x_pci module removal leaves the board in a bad state. > > Without the series applied, DHCP request answers from my PC are seen by the > system after any module unloading / reloading. > > Do you have any ideas of what could be wrong? > > Best regards, > Hervé Thanks for testing this! As part of my testing I did unload/load the lan966x_switch module to ensure the ATU was properly reset and reconfigured, and that seemed to work fine. I must admit, I did not try with the lan966x_pci module. From what I hear, when you are in the bad state, TX is still working, so it's an RX issue. Could be the interrupt is not firing, so the napi poll is not scheduled. Anyway, I will have a look at it during the week. Will let you know. /Daniel
Hi Daniel,
On Mon, 23 Mar 2026 20:40:59 +0100
Daniel Machon <daniel.machon@microchip.com> wrote:
> Hi Hervé,
>
> > Hi Daniel,
> >
> > On Mon, 23 Mar 2026 15:52:04 +0100
> > Herve Codina <herve.codina@bootlin.com> wrote:
> >
> > > Hi Daniel,
> > >
> > > On Fri, 20 Mar 2026 16:00:56 +0100
> > > Daniel Machon <daniel.machon@microchip.com> wrote:
> > >
> > > > When lan966x operates as a PCIe endpoint, the driver currently uses
> > > > register-based I/O for frame injection and extraction. This approach is
> > > > functional but slow, topping out at around 33 Mbps on an Intel x86 host
> > > > with a lan966x PCIe card.
> > > >
> > > > This series adds FDMA (Frame DMA) support for the PCIe path. When
> > > > operating as a PCIe endpoint, the internal FDMA engine on lan966x cannot
> > > > directly access host memory, so DMA buffers are allocated as contiguous
> > > > coherent memory and mapped through the PCIe Address Translation Unit
> > > > (ATU). The ATU provides outbound windows that translate internal FDMA
> > > > addresses to PCIe bus addresses, allowing the FDMA engine to read and
> > > > write host memory. Because the ATU requires contiguous address regions,
> > > > page_pool and normal per-page DMA mappings cannot be used. Instead,
> > > > frames are transferred using memcpy between the ATU-mapped buffers and
> > > > the network stack. With this, throughput increases from ~33 Mbps to ~620
> > > > Mbps for default MTU.
> > > >
> > > > Patches 1-2 prepare the shared FDMA library: patch 1 renames the
> > > > contiguous dataptr helpers for clarity, and patch 2 adds PCIe ATU region
> > > > management and coherent DMA allocation with ATU mapping.
> > > >
> > > > Patches 3-5 refactor the lan966x FDMA code to support both platform and
> > > > PCIe paths: extracting the LLP register write into a helper, exporting
> > > > shared functions, and introducing an ops dispatch table selected at
> > > > probe time.
> > > >
> > > > Patch 6 adds the core PCIe FDMA implementation with RX/TX using
> > > > contiguous ATU-mapped buffers. Patches 7 and 8 extend it with MTU
> > > > change and XDP support respectively.
> > > >
> > > > Patches 9-10 update the lan966x PCI device tree overlay to extend the
> > > > cpu register mapping to cover the ATU register space and add the FDMA
> > > > interrupt.
> > > >
> > >
> > > Thanks a lot for the series taking care of DMA and ATU in PCIe variants.
> > >
> > > I have tested the whole series on both my ARM and x86 systems.
> > >
> > > Doing a simple wget on my x86 system, I moved from 3.8MB/s to 11.2MB/s and
> > > so the improvement is obvious.
> > >
> > > Tested-by: Herve Codina <herve.codina@bootlin.com>
> > >
> >
> > Hum, I think I found an issue.
> >
> > If I remove the lan966x_pci module (modprobe -r lan966x_pci), and reload
> > it (modprobe lan966x_pci), the board is not working.
> >
> > The system performs DHCP requests. Those requests are served by my PC (observed
> > with Wireshark) but the system doesn't see those answers. Indeed, he continues
> > to perform DHCP requests.
> >
> > Looks like the lan966x_pci module removal leaves the board in a bad state.
> >
> > Without the series applied, DHCP request answers from my PC are seen by the
> > system after any module unloading / reloading.
> >
> > Do you have any ideas of what could be wrong?
> >
> > Best regards,
> > Hervé
>
> Thanks for testing this!
>
> As part of my testing I did unload/load the lan966x_switch module to ensure the
> ATU was properly reset and reconfigured, and that seemed to work fine. I must
> admit, I did not try with the lan966x_pci module.
>
> From what I hear, when you are in the bad state, TX is still working, so it's an
> RX issue. Could be the interrupt is not firing, so the napi poll is not
> scheduled.
Yes, confirmed that the issue is on Rx path. Tx data were received by my PC.
>
> Anyway, I will have a look at it during the week. Will let you know.
>
Some more interesting information.
I tested lan966x_pci module unloading and re-loading on my ARM system.
On this system the following traces are present when I reload the lan966x_pci
module. Those traces were not present on my x86 system.
[ 104.715031] ------------[ cut here ]------------
[ 104.719746] Unexpected error: 64, error_type: 1073741824
[ 104.725217] WARNING: drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c:558 at lan966x_fdma_irq_handler+0xe0/0x12c [lan966x_switch], CPU#0: swapper/0/0
[ 104.739119] Modules linked in: lan966x_pci irq_lan966x_oic reset_microchip_sparx5 pinctrl_ocelot lan966x_serdes mdio_mscc_miim lan966x_switch rtc_ds1307 marvell [last unloaded: lan966x_pci]
[ 104.756250] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 7.0.0-rc1-00010-gfa357f2a6a00 #565 PREEMPT
[ 104.765743] Hardware name: Marvell Armada 3720 Development Board DB-88F3720-DDR3 (DT)
[ 104.773579] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 104.780551] pc : lan966x_fdma_irq_handler+0xe0/0x12c [lan966x_switch]
[ 104.787046] lr : lan966x_fdma_irq_handler+0xe0/0x12c [lan966x_switch]
[ 104.793538] sp : ffff800082a7bc00
[ 104.796860] x29: ffff800082a7bc00 x28: ffff8000819abd00 x27: ffff800081a57757
[ 104.804033] x26: ffff8000819a5d88 x25: 0000000000012400 x24: ffff800081a57758
[ 104.811205] x23: 0000000000000038 x22: ffff00000f70ec00 x21: 0000000040000000
[ 104.818376] x20: ffff00000f7a8080 x19: 0000000000000040 x18: 00000000ffffffff
[ 104.825548] x17: ffff7fff9e7d6000 x16: ffff800082a78000 x15: ffff800102a7b837
[ 104.832719] x14: 0000000000000000 x13: 0000000000000000 x12: 3031203a65707974
[ 104.839891] x11: ffff8000819ca758 x10: 0000000000000018 x9 : ffff8000819ca758
[ 104.847062] x8 : 00000000ffffefff x7 : ffff800081a22758 x6 : 00000000fffff000
[ 104.854233] x5 : ffff00001fea1588 x4 : 0000000000000000 x3 : 0000000000000027
[ 104.861404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000819abd00
[ 104.868576] Call trace:
[ 104.871034] lan966x_fdma_irq_handler+0xe0/0x12c [lan966x_switch] (P)
[ 104.877531] __handle_irq_event_percpu+0xa0/0x4c4
[ 104.882261] handle_irq_event+0x4c/0xf8
[ 104.886117] handle_level_irq+0xec/0x17c
[ 104.890064] handle_irq_desc+0x40/0x58
[ 104.893831] generic_handle_domain_irq+0x18/0x24
[ 104.898467] lan966x_oic_irq_handler_domain+0x64/0xb0 [irq_lan966x_oic]
[ 104.905105] lan966x_oic_irq_handler+0x34/0xb4 [irq_lan966x_oic]
[ 104.911131] handle_irq_desc+0x40/0x58
[ 104.914900] generic_handle_domain_irq+0x18/0x24
[ 104.919536] pci_dev_irq_handler+0x1c/0x30 [lan966x_pci]
[ 104.924871] __handle_irq_event_percpu+0xa0/0x4c4
[ 104.929594] handle_irq_event+0x4c/0xf8
[ 104.933449] handle_level_irq+0xec/0x17c
[ 104.937394] handle_irq_desc+0x40/0x58
[ 104.941162] generic_handle_domain_irq+0x18/0x24
[ 104.945797] advk_pcie_irq_handler+0x160/0x390
[ 104.950260] __handle_irq_event_percpu+0xa0/0x4c4
[ 104.954983] handle_irq_event+0x4c/0xf8
[ 104.958839] handle_fasteoi_irq+0x108/0x20c
[ 104.963043] handle_irq_desc+0x40/0x58
[ 104.966811] generic_handle_domain_irq+0x18/0x24
[ 104.971447] gic_handle_irq+0x4c/0x110
[ 104.975214] call_on_irq_stack+0x30/0x48
[ 104.979156] do_interrupt_handler+0x80/0x84
[ 104.983360] el1_interrupt+0x3c/0x60
[ 104.986959] el1h_64_irq_handler+0x18/0x24
[ 104.991076] el1h_64_irq+0x6c/0x70
[ 104.994495] default_idle_call+0x80/0x138 (P)
[ 104.998870] do_idle+0x220/0x290
[ 105.002121] cpu_startup_entry+0x34/0x3c
[ 105.006064] rest_init+0xf8/0x188
[ 105.009398] start_kernel+0x818/0x8ec
[ 105.013085] __primary_switched+0x88/0x90
[ 105.017119] irq event stamp: 60740
[ 105.020529] hardirqs last enabled at (60739): [<ffff800080fb37bc>] default_idle_call+0x7c/0x138
[ 105.029328] hardirqs last disabled at (60740): [<ffff800080fab5c4>] enter_from_kernel_mode+0x10/0x3c
[ 105.038477] softirqs last enabled at (60728): [<ffff8000800cb774>] handle_softirqs+0x624/0x63c
[ 105.047193] softirqs last disabled at (60711): [<ffff8000800102d8>] __do_softirq+0x14/0x20
[ 105.055471] ---[ end trace 0000000000000000 ]---
[ 105.060274] ------------[ cut here ]------------
[ 105.064963] Unexpected error: 64, error_type: 1073741824
[ 105.070440] WARNING: drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c:558 at lan966x_fdma_irq_handler+0xe0/0x12c [lan966x_switch], CPU#0: swapper/0/0
...
[ 105.443891] ------------[ cut here ]------------
[ 105.448536] Unexpected error: 64, error_type: 0
[ 105.453235] WARNING: drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c:558 at lan966x_fdma_irq_handler+0xe0/0x12c [lan966x_switch], CPU#0: swapper/0/0
...
And after them my ARM system don't see the answer of a ping command
replied by my PC (I don't use DHCP with my ARM system).
I don't know why those traces are not present on my x86 (config, race
condition, other reason) but anyway, they can help to find some clues
about what's going on.
Of course, feel free to ask me some more tests or anything else I can do
to help on this topic.
Best regards,
Hervé
Hi Hervé,
> Hi Daniel,
>
> On Mon, 23 Mar 2026 20:40:59 +0100
> Daniel Machon <daniel.machon@microchip.com> wrote:
>
> > Hi Hervé,
> >
> > > Hi Daniel,
> > >
> > > On Mon, 23 Mar 2026 15:52:04 +0100
> > > Herve Codina <herve.codina@bootlin.com> wrote:
> > >
> > > > Hi Daniel,
> > > >
> > > > On Fri, 20 Mar 2026 16:00:56 +0100
> > > > Daniel Machon <daniel.machon@microchip.com> wrote:
> > > >
> > > > > When lan966x operates as a PCIe endpoint, the driver currently uses
> > > > > register-based I/O for frame injection and extraction. This approach is
> > > > > functional but slow, topping out at around 33 Mbps on an Intel x86 host
> > > > > with a lan966x PCIe card.
> > > > >
> > > > > This series adds FDMA (Frame DMA) support for the PCIe path. When
> > > > > operating as a PCIe endpoint, the internal FDMA engine on lan966x cannot
> > > > > directly access host memory, so DMA buffers are allocated as contiguous
> > > > > coherent memory and mapped through the PCIe Address Translation Unit
> > > > > (ATU). The ATU provides outbound windows that translate internal FDMA
> > > > > addresses to PCIe bus addresses, allowing the FDMA engine to read and
> > > > > write host memory. Because the ATU requires contiguous address regions,
> > > > > page_pool and normal per-page DMA mappings cannot be used. Instead,
> > > > > frames are transferred using memcpy between the ATU-mapped buffers and
> > > > > the network stack. With this, throughput increases from ~33 Mbps to ~620
> > > > > Mbps for default MTU.
> > > > >
> > > > > Patches 1-2 prepare the shared FDMA library: patch 1 renames the
> > > > > contiguous dataptr helpers for clarity, and patch 2 adds PCIe ATU region
> > > > > management and coherent DMA allocation with ATU mapping.
> > > > >
> > > > > Patches 3-5 refactor the lan966x FDMA code to support both platform and
> > > > > PCIe paths: extracting the LLP register write into a helper, exporting
> > > > > shared functions, and introducing an ops dispatch table selected at
> > > > > probe time.
> > > > >
> > > > > Patch 6 adds the core PCIe FDMA implementation with RX/TX using
> > > > > contiguous ATU-mapped buffers. Patches 7 and 8 extend it with MTU
> > > > > change and XDP support respectively.
> > > > >
> > > > > Patches 9-10 update the lan966x PCI device tree overlay to extend the
> > > > > cpu register mapping to cover the ATU register space and add the FDMA
> > > > > interrupt.
> > > > >
> > > >
> > > > Thanks a lot for the series taking care of DMA and ATU in PCIe variants.
> > > >
> > > > I have tested the whole series on both my ARM and x86 systems.
> > > >
> > > > Doing a simple wget on my x86 system, I moved from 3.8MB/s to 11.2MB/s and
> > > > so the improvement is obvious.
> > > >
> > > > Tested-by: Herve Codina <herve.codina@bootlin.com>
> > > >
> > >
> > > Hum, I think I found an issue.
> > >
> > > If I remove the lan966x_pci module (modprobe -r lan966x_pci), and reload
> > > it (modprobe lan966x_pci), the board is not working.
> > >
> > > The system performs DHCP requests. Those requests are served by my PC (observed
> > > with Wireshark) but the system doesn't see those answers. Indeed, he continues
> > > to perform DHCP requests.
> > >
> > > Looks like the lan966x_pci module removal leaves the board in a bad state.
> > >
> > > Without the series applied, DHCP request answers from my PC are seen by the
> > > system after any module unloading / reloading.
> > >
> > > Do you have any ideas of what could be wrong?
> > >
> > > Best regards,
> > > Hervé
> >
> > Thanks for testing this!
> >
> > As part of my testing I did unload/load the lan966x_switch module to ensure the
> > ATU was properly reset and reconfigured, and that seemed to work fine. I must
> > admit, I did not try with the lan966x_pci module.
> >
> > From what I hear, when you are in the bad state, TX is still working, so it's an
> > RX issue. Could be the interrupt is not firing, so the napi poll is not
> > scheduled.
>
> Yes, confirmed that the issue is on Rx path. Tx data were received by my PC.
>
> >
> > Anyway, I will have a look at it during the week. Will let you know.
> >
>
> Some more interesting information.
>
> I tested lan966x_pci module unloading and re-loading on my ARM system.
>
> On this system the following traces are present when I reload the lan966x_pci
> module. Those traces were not present on my x86 system.
>
> [ 104.715031] ------------[ cut here ]------------
> [ 104.719746] Unexpected error: 64, error_type: 1073741824
> [ 104.725217] WARNING: drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c:558 at lan966x_fdma_irq_handler+0xe0/0x12c [lan966x_switch], CPU#0: swapper/0/0
> [ 104.739119] Modules linked in: lan966x_pci irq_lan966x_oic reset_microchip_sparx5 pinctrl_ocelot lan966x_serdes mdio_mscc_miim lan966x_switch rtc_ds1307 marvell [last unloaded: lan966x_pci]
> [ 104.756250] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 7.0.0-rc1-00010-gfa357f2a6a00 #565 PREEMPT
> [ 104.765743] Hardware name: Marvell Armada 3720 Development Board DB-88F3720-DDR3 (DT)
> [ 104.773579] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 104.780551] pc : lan966x_fdma_irq_handler+0xe0/0x12c [lan966x_switch]
> [ 104.787046] lr : lan966x_fdma_irq_handler+0xe0/0x12c [lan966x_switch]
> [ 104.793538] sp : ffff800082a7bc00
> [ 104.796860] x29: ffff800082a7bc00 x28: ffff8000819abd00 x27: ffff800081a57757
> [ 104.804033] x26: ffff8000819a5d88 x25: 0000000000012400 x24: ffff800081a57758
> [ 104.811205] x23: 0000000000000038 x22: ffff00000f70ec00 x21: 0000000040000000
> [ 104.818376] x20: ffff00000f7a8080 x19: 0000000000000040 x18: 00000000ffffffff
> [ 104.825548] x17: ffff7fff9e7d6000 x16: ffff800082a78000 x15: ffff800102a7b837
> [ 104.832719] x14: 0000000000000000 x13: 0000000000000000 x12: 3031203a65707974
> [ 104.839891] x11: ffff8000819ca758 x10: 0000000000000018 x9 : ffff8000819ca758
> [ 104.847062] x8 : 00000000ffffefff x7 : ffff800081a22758 x6 : 00000000fffff000
> [ 104.854233] x5 : ffff00001fea1588 x4 : 0000000000000000 x3 : 0000000000000027
> [ 104.861404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000819abd00
> [ 104.868576] Call trace:
> [ 104.871034] lan966x_fdma_irq_handler+0xe0/0x12c [lan966x_switch] (P)
> [ 104.877531] __handle_irq_event_percpu+0xa0/0x4c4
> [ 104.882261] handle_irq_event+0x4c/0xf8
> [ 104.886117] handle_level_irq+0xec/0x17c
> [ 104.890064] handle_irq_desc+0x40/0x58
> [ 104.893831] generic_handle_domain_irq+0x18/0x24
> [ 104.898467] lan966x_oic_irq_handler_domain+0x64/0xb0 [irq_lan966x_oic]
> [ 104.905105] lan966x_oic_irq_handler+0x34/0xb4 [irq_lan966x_oic]
> [ 104.911131] handle_irq_desc+0x40/0x58
> [ 104.914900] generic_handle_domain_irq+0x18/0x24
> [ 104.919536] pci_dev_irq_handler+0x1c/0x30 [lan966x_pci]
> [ 104.924871] __handle_irq_event_percpu+0xa0/0x4c4
> [ 104.929594] handle_irq_event+0x4c/0xf8
> [ 104.933449] handle_level_irq+0xec/0x17c
> [ 104.937394] handle_irq_desc+0x40/0x58
> [ 104.941162] generic_handle_domain_irq+0x18/0x24
> [ 104.945797] advk_pcie_irq_handler+0x160/0x390
> [ 104.950260] __handle_irq_event_percpu+0xa0/0x4c4
> [ 104.954983] handle_irq_event+0x4c/0xf8
> [ 104.958839] handle_fasteoi_irq+0x108/0x20c
> [ 104.963043] handle_irq_desc+0x40/0x58
> [ 104.966811] generic_handle_domain_irq+0x18/0x24
> [ 104.971447] gic_handle_irq+0x4c/0x110
> [ 104.975214] call_on_irq_stack+0x30/0x48
> [ 104.979156] do_interrupt_handler+0x80/0x84
> [ 104.983360] el1_interrupt+0x3c/0x60
> [ 104.986959] el1h_64_irq_handler+0x18/0x24
> [ 104.991076] el1h_64_irq+0x6c/0x70
> [ 104.994495] default_idle_call+0x80/0x138 (P)
> [ 104.998870] do_idle+0x220/0x290
> [ 105.002121] cpu_startup_entry+0x34/0x3c
> [ 105.006064] rest_init+0xf8/0x188
> [ 105.009398] start_kernel+0x818/0x8ec
> [ 105.013085] __primary_switched+0x88/0x90
> [ 105.017119] irq event stamp: 60740
> [ 105.020529] hardirqs last enabled at (60739): [<ffff800080fb37bc>] default_idle_call+0x7c/0x138
> [ 105.029328] hardirqs last disabled at (60740): [<ffff800080fab5c4>] enter_from_kernel_mode+0x10/0x3c
> [ 105.038477] softirqs last enabled at (60728): [<ffff8000800cb774>] handle_softirqs+0x624/0x63c
> [ 105.047193] softirqs last disabled at (60711): [<ffff8000800102d8>] __do_softirq+0x14/0x20
> [ 105.055471] ---[ end trace 0000000000000000 ]---
> [ 105.060274] ------------[ cut here ]------------
> [ 105.064963] Unexpected error: 64, error_type: 1073741824
> [ 105.070440] WARNING: drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c:558 at lan966x_fdma_irq_handler+0xe0/0x12c [lan966x_switch], CPU#0: swapper/0/0
> ...
> [ 105.443891] ------------[ cut here ]------------
> [ 105.448536] Unexpected error: 64, error_type: 0
> [ 105.453235] WARNING: drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c:558 at lan966x_fdma_irq_handler+0xe0/0x12c [lan966x_switch], CPU#0: swapper/0/0
> ...
>
> And after them my ARM system don't see the answer of a ping command
> replied by my PC (I don't use DHCP with my ARM system).
>
> I don't know why those traces are not present on my x86 (config, race
> condition, other reason) but anyway, they can help to find some clues
> about what's going on.
>
> Of course, feel free to ask me some more tests or anything else I can do
> to help on this topic.
>
> Best regards,
> Hervé
As I remembered, doing rmmod on the lan966x_switch followed by modprobe
lan966x_switch works fine. This is because neither the switch core, nor the FDMA
engine is reset, so they remain in sync.
When the lan966x_pci module is removed and reloaded (what you did), the DT
overlay is re-applied, which causes the reset controller
(reset-microchip-sparx5) to re-probe. During probe, it performs a GCB soft reset
that resets the switch core, but protects the CPU domain from the reset. The
FDMA engine is part of the CPU domain, so it is not reset.
This leaves the switch core in a reset state while the FDMA
retains state from the previous driver instance. When the switch driver
subsequently probes and activates the FDMA channels, the two are out of
sync, and the FDMA immediately reports extraction errors.
Theres actually an FDMA register called NRESET that resets the FDMA controller
state. Calling this in the FDMA init path causes traffic to work correctly on
lan966x_pci reload, but it does not get rid of the FDMA splats you posted above.
They get queued up between the switch core reset, in the reset controller, and
the FDMA enabling. I tried different approaches to drain or flush queues, but
they wont go away entirely.
The only thing that seems to work consistently is to *not* do the soft reset in
the reset controller for the PCI path. The soft reset is actually the problem:
it only resets the switch core while protecting the CPU domain (including FDMA),
causing a desync.
A simple fix could be (in reset-microchip-sparx5.c):
+static bool mchp_reset_is_pci(struct device *dev)
+{
+ for (dev = dev->parent; dev; dev = dev->parent) {
+ if (dev_is_pci(dev))
+ return true;
+ }
+ return false;
+}
- /* Issue the reset very early, our actual reset callback is a noop. */
- err = sparx5_switch_reset(ctx);
- if (err)
- return err;
+ /* Issue the reset very early, our actual reset callback is a noop.
+ *
+ * On the PCI path, skip the reset. The endpoint is already in
+ * power-on reset state on the first probe. On subsequent probes
+ * (after driver reload), resetting the switch core while the FDMA
+ * retains state (CPU domain is protected from the soft reset)
+ * causes the two to go out of sync, leading to FDMA extraction
+ * errors.
+ */
+ if (!mchp_reset_is_pci(&pdev->dev)) {
+ err = sparx5_switch_reset(ctx);
+ if (err)
+ return err;
+ }
Could you test it and see if it helps the problem on your side.
/Daniel
Hi Daniel,
On Thu, 26 Mar 2026 16:48:33 +0100
Daniel Machon <daniel.machon@microchip.com> wrote:
...
>
> As I remembered, doing rmmod on the lan966x_switch followed by modprobe
> lan966x_switch works fine. This is because neither the switch core, nor the FDMA
> engine is reset, so they remain in sync.
>
> When the lan966x_pci module is removed and reloaded (what you did), the DT
> overlay is re-applied, which causes the reset controller
> (reset-microchip-sparx5) to re-probe. During probe, it performs a GCB soft reset
> that resets the switch core, but protects the CPU domain from the reset. The
> FDMA engine is part of the CPU domain, so it is not reset.
>
> This leaves the switch core in a reset state while the FDMA
> retains state from the previous driver instance. When the switch driver
> subsequently probes and activates the FDMA channels, the two are out of
> sync, and the FDMA immediately reports extraction errors.
>
> Theres actually an FDMA register called NRESET that resets the FDMA controller
> state. Calling this in the FDMA init path causes traffic to work correctly on
> lan966x_pci reload, but it does not get rid of the FDMA splats you posted above.
> They get queued up between the switch core reset, in the reset controller, and
> the FDMA enabling. I tried different approaches to drain or flush queues, but
> they wont go away entirely.
>
> The only thing that seems to work consistently is to *not* do the soft reset in
> the reset controller for the PCI path. The soft reset is actually the problem:
> it only resets the switch core while protecting the CPU domain (including FDMA),
> causing a desync.
>
> A simple fix could be (in reset-microchip-sparx5.c):
>
> +static bool mchp_reset_is_pci(struct device *dev)
> +{
> + for (dev = dev->parent; dev; dev = dev->parent) {
> + if (dev_is_pci(dev))
> + return true;
> + }
> + return false;
> +}
>
> - /* Issue the reset very early, our actual reset callback is a noop. */
> - err = sparx5_switch_reset(ctx);
> - if (err)
> - return err;
> + /* Issue the reset very early, our actual reset callback is a noop.
> + *
> + * On the PCI path, skip the reset. The endpoint is already in
> + * power-on reset state on the first probe. On subsequent probes
> + * (after driver reload), resetting the switch core while the FDMA
> + * retains state (CPU domain is protected from the soft reset)
> + * causes the two to go out of sync, leading to FDMA extraction
> + * errors.
> + */
> + if (!mchp_reset_is_pci(&pdev->dev)) {
> + err = sparx5_switch_reset(ctx);
> + if (err)
> + return err;
> + }
>
> Could you test it and see if it helps the problem on your side.
>
I have tested it on my ARM and x86 system. It fixes the lan966x_pci module
unloading / reloading issue.
However an other regression is present. After a reboot, without power
off/on, the board is not working (tested on both my ARM and x86 systems).
According to your explanation, this makes sense.
IMHO, the problem is that we cannot make the assumption that "The endpoint
is already in power-on reset state on the first probe". That's not true
when you just call the reboot command.
Best regards,
Hervé
Hi Hervé,
> Hi Daniel,
>
> On Thu, 26 Mar 2026 16:48:33 +0100
> Daniel Machon <daniel.machon@microchip.com> wrote:
>
> ...
>
> >
> > As I remembered, doing rmmod on the lan966x_switch followed by modprobe
> > lan966x_switch works fine. This is because neither the switch core, nor the FDMA
> > engine is reset, so they remain in sync.
> >
> > When the lan966x_pci module is removed and reloaded (what you did), the DT
> > overlay is re-applied, which causes the reset controller
> > (reset-microchip-sparx5) to re-probe. During probe, it performs a GCB soft reset
> > that resets the switch core, but protects the CPU domain from the reset. The
> > FDMA engine is part of the CPU domain, so it is not reset.
> >
> > This leaves the switch core in a reset state while the FDMA
> > retains state from the previous driver instance. When the switch driver
> > subsequently probes and activates the FDMA channels, the two are out of
> > sync, and the FDMA immediately reports extraction errors.
> >
> > Theres actually an FDMA register called NRESET that resets the FDMA controller
> > state. Calling this in the FDMA init path causes traffic to work correctly on
> > lan966x_pci reload, but it does not get rid of the FDMA splats you posted above.
> > They get queued up between the switch core reset, in the reset controller, and
> > the FDMA enabling. I tried different approaches to drain or flush queues, but
> > they wont go away entirely.
> >
> > The only thing that seems to work consistently is to *not* do the soft reset in
> > the reset controller for the PCI path. The soft reset is actually the problem:
> > it only resets the switch core while protecting the CPU domain (including FDMA),
> > causing a desync.
> >
> > A simple fix could be (in reset-microchip-sparx5.c):
> >
> > +static bool mchp_reset_is_pci(struct device *dev)
> > +{
> > + for (dev = dev->parent; dev; dev = dev->parent) {
> > + if (dev_is_pci(dev))
> > + return true;
> > + }
> > + return false;
> > +}
> >
> > - /* Issue the reset very early, our actual reset callback is a noop. */
> > - err = sparx5_switch_reset(ctx);
> > - if (err)
> > - return err;
> > + /* Issue the reset very early, our actual reset callback is a noop.
> > + *
> > + * On the PCI path, skip the reset. The endpoint is already in
> > + * power-on reset state on the first probe. On subsequent probes
> > + * (after driver reload), resetting the switch core while the FDMA
> > + * retains state (CPU domain is protected from the soft reset)
> > + * causes the two to go out of sync, leading to FDMA extraction
> > + * errors.
> > + */
> > + if (!mchp_reset_is_pci(&pdev->dev)) {
> > + err = sparx5_switch_reset(ctx);
> > + if (err)
> > + return err;
> > + }
> >
> > Could you test it and see if it helps the problem on your side.
> >
>
> I have tested it on my ARM and x86 system. It fixes the lan966x_pci module
> unloading / reloading issue.
>
> However an other regression is present. After a reboot, without power
> off/on, the board is not working (tested on both my ARM and x86 systems).
>
> According to your explanation, this makes sense.
>
> IMHO, the problem is that we cannot make the assumption that "The endpoint
> is already in power-on reset state on the first probe". That's not true
> when you just call the reboot command.
>
> Best regards,
> Hervé
Again, thanks for testing.
Agreed, that makes sense.
I will continue experimenting with the FDMA reset and see if I can do an FDMA
reset on switch driver probe, while not getting any intermediate FDMA errors.
After spring break, that is :)
/Daniel
© 2016 - 2026 Red Hat, Inc.