drivers/spi/spi-fsl-dspi.c | 233 ++++++++++++++++++++++++++++++++------------- 1 file changed, 166 insertions(+), 67 deletions(-)
Improve usability of target mode by reporting FIFO errors and increasing the buffer size when DMA is used. While we're touching DMA stuff also switch to non-coherent memory, although this is unrelated to target mode. The first commit is marked as a fix because it can fix intermittent issues with existing transfers, rather than the later fixes which improve larger than FIFO target mode transfers which would have never worked. With the combination of the commit to increase the DMA buffer size and the commit to use non-coherent memory, the host mode performance figures are as follows on S32G3: # spidev_test --device /dev/spidev1.0 --bpw 8 --size <test_size> --cpha --iter 10000000 --speed 10000000 Coherent (4096 byte transfers): 6534 kbps Non-coherent: 7347 kbps Coherent (16 byte transfers): 447 kbps Non-coherent: 448 kbps Just for comparison running the same test in XSPI mode: 4096 byte transfers: 2143 kbps 16 byte transfers: 637 kbps These tests required hacking S32G3 to use DMA in host mode, although the figures should be representative of target mode too where DMA is used. And the other devices that use DMA in host mode should see similar improvements. Signed-off-by: James Clark <james.clark@linaro.org> --- Changes in v4: - Fix !CONFIG_DMA_ENGINE build (and actually test it this time) - Reword completion counter comment - Reword some commit messages - Reset tries in dspi_poll() for each transfer - Check for fifo errors in dspi_poll() before checking for completion - Link to v3: https://lore.kernel.org/r/20250624-james-nxp-spi-dma-v3-0-e7d574f5f62c@linaro.org Changes in v3: - Stub out DMA functions in the driver so no-DMA builds work - Link to v2: https://lore.kernel.org/r/20250613-james-nxp-spi-dma-v2-0-017eecf24aab@linaro.org Changes in v2: - Store status in cur_msg->status rather than adding xfer_status - Show exact underflow/overflow flags in error message - Rate limit error messages - Add a comment about resetting the completion counter prior to transfer - Rename dspi_is_fifo_overflow() -> dspi_fifo_error() - Add performance figures to cover letter - Rebase onto https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git/for-next to avoid some conflicts - Link to v1: https://lore.kernel.org/r/20250609-james-nxp-spi-dma-v1-0-2b831e714be2@linaro.org --- James Clark (5): spi: spi-fsl-dspi: Clear completion counter before initiating transfer spi: spi-fsl-dspi: Store status directly in cur_msg->status spi: spi-fsl-dspi: Stub out DMA functions spi: spi-fsl-dspi: Use non-coherent memory for DMA spi: spi-fsl-dspi: Report FIFO overflows as errors Larisa Grigore (1): spi: spi-fsl-dspi: Increase DMA buffer size drivers/spi/spi-fsl-dspi.c | 233 ++++++++++++++++++++++++++++++++------------- 1 file changed, 166 insertions(+), 67 deletions(-) --- base-commit: 4f326fa6236787ca516ea6eab8e5e9dc5c236f03 change-id: 20250522-james-nxp-spi-dma-a997ebebfb6b Best regards, -- James Clark <james.clark@linaro.org>
On Fri, Jun 27, 2025 at 11:21:36AM +0100, James Clark wrote: > Improve usability of target mode by reporting FIFO errors and increasing > the buffer size when DMA is used. While we're touching DMA stuff also > switch to non-coherent memory, although this is unrelated to target > mode. > > The first commit is marked as a fix because it can fix intermittent > issues with existing transfers, rather than the later fixes which > improve larger than FIFO target mode transfers which would have never > worked. > > With the combination of the commit to increase the DMA buffer size and > the commit to use non-coherent memory, the host mode performance figures > are as follows on S32G3: > > # spidev_test --device /dev/spidev1.0 --bpw 8 --size <test_size> --cpha --iter 10000000 --speed 10000000 > > Coherent (4096 byte transfers): 6534 kbps > Non-coherent: 7347 kbps > > Coherent (16 byte transfers): 447 kbps > Non-coherent: 448 kbps > > Just for comparison running the same test in XSPI mode: > > 4096 byte transfers: 2143 kbps > 16 byte transfers: 637 kbps > > These tests required hacking S32G3 to use DMA in host mode, although > the figures should be representative of target mode too where DMA is > used. And the other devices that use DMA in host mode should see similar > improvements. > > Signed-off-by: James Clark <james.clark@linaro.org> > --- My test numbers on LS1028A: Baseline XSPI (unmodified driver): $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000 rate: tx 2710.6kbps, rx 2710.6kbps $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000 rate: tx 3217.5kbps, rx 3217.5kbps $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000 rate: tx 5118.4kbps, rx 5118.4kbps Baseline DMA (modified just DSPI_XSPI_MODE -> DSPI_DMA_MODE): $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000 rate: tx 1359.5kbps, rx 1359.5kbps $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000 rate: tx 1461.1kbps, rx 1461.1kbps $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000 rate: tx 1664.6kbps, rx 1664.6kbps Intermediary LS1028A DMA mode (using non-coherent buffers but still small DMA buffers, i.e. holding just 1 FIFO size worth of data): $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000 rate: tx 1345.1kbps, rx 1345.1kbps $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000 rate: tx 1522.5kbps, rx 1522.5kbps $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000 rate: tx 1690.8kbps, rx 1690.8kbps Final LS1028A DMA mode (with the patch to send large messages as a single DMA buffer applied): $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000 rate: tx 2247.0kbps, rx 2247.0kbps $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000 rate: tx 3477.4kbps, rx 3477.4kbps $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000 rate: tx 8978.4kbps, rx 8978.4kbps So after your patch set, DMA mode on LS1028A becomes more performant and should replace XSPI. This is an outstanding result. That can be done as follow-up work.
On 30/06/2025 4:26 pm, Vladimir Oltean wrote: > On Fri, Jun 27, 2025 at 11:21:36AM +0100, James Clark wrote: >> Improve usability of target mode by reporting FIFO errors and increasing >> the buffer size when DMA is used. While we're touching DMA stuff also >> switch to non-coherent memory, although this is unrelated to target >> mode. >> >> The first commit is marked as a fix because it can fix intermittent >> issues with existing transfers, rather than the later fixes which >> improve larger than FIFO target mode transfers which would have never >> worked. >> >> With the combination of the commit to increase the DMA buffer size and >> the commit to use non-coherent memory, the host mode performance figures >> are as follows on S32G3: >> >> # spidev_test --device /dev/spidev1.0 --bpw 8 --size <test_size> --cpha --iter 10000000 --speed 10000000 >> >> Coherent (4096 byte transfers): 6534 kbps >> Non-coherent: 7347 kbps >> >> Coherent (16 byte transfers): 447 kbps >> Non-coherent: 448 kbps >> >> Just for comparison running the same test in XSPI mode: >> >> 4096 byte transfers: 2143 kbps >> 16 byte transfers: 637 kbps >> >> These tests required hacking S32G3 to use DMA in host mode, although >> the figures should be representative of target mode too where DMA is >> used. And the other devices that use DMA in host mode should see similar >> improvements. >> >> Signed-off-by: James Clark <james.clark@linaro.org> >> --- > > My test numbers on LS1028A: > > Baseline XSPI (unmodified driver): > $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000 > rate: tx 2710.6kbps, rx 2710.6kbps > $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000 > rate: tx 3217.5kbps, rx 3217.5kbps > $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000 > rate: tx 5118.4kbps, rx 5118.4kbps > > Baseline DMA (modified just DSPI_XSPI_MODE -> DSPI_DMA_MODE): > $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000 > rate: tx 1359.5kbps, rx 1359.5kbps > $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000 > rate: tx 1461.1kbps, rx 1461.1kbps > $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000 > rate: tx 1664.6kbps, rx 1664.6kbps > > Intermediary LS1028A DMA mode (using non-coherent buffers but still > small DMA buffers, i.e. holding just 1 FIFO size worth of data): > $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000 > rate: tx 1345.1kbps, rx 1345.1kbps > $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000 > rate: tx 1522.5kbps, rx 1522.5kbps > $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000 > rate: tx 1690.8kbps, rx 1690.8kbps > > Final LS1028A DMA mode (with the patch to send large messages as a > single DMA buffer applied): > $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000 > rate: tx 2247.0kbps, rx 2247.0kbps > $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000 > rate: tx 3477.4kbps, rx 3477.4kbps > $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000 > rate: tx 8978.4kbps, rx 8978.4kbps > > So after your patch set, DMA mode on LS1028A becomes more performant and > should replace XSPI. This is an outstanding result. That can be done as > follow-up work. I wonder if latency could be higher despite increased throughput? It probably wouldn't be a big enough increase that anyone would care. And based on the structure of the driver if throughput is higher the latency might even be lower.
On Tue, Jul 01, 2025 at 01:42:46PM +0100, James Clark wrote: > I wonder if latency could be higher despite increased throughput? It > probably wouldn't be a big enough increase that anyone would care. And based > on the structure of the driver if throughput is higher the latency might > even be lower. Actually, I do have a metric for that, sort of. I have a SPI-controlled Ethernet switch with support for IEEE 1588, and synchronizing its hardware clock over SPI benefits greatly from having a high precision software timestamping point for the SPI transfers themselves. Essentially, with XSPI FIFO mode we are able to provide a timestamping granularity of $(FIFO size) words, see the spi_take_timestamp_pre() and spi_take_timestamp_post() calls. Whereas with DMA, we let the core take a message-level software timestamp which is much coarser, because at driver level we can't guarantee a much more precise transmission time interval for a particular requested byte. See __spi_pump_transfer_message(). If you're not familiar with phc2sys, an interpretation of the logs below is as follows. phc2sys synchronizes the sw2p0 (target) clock to CLOCK_REALTIME (the source clock). "delay" is the time it took for the kernel to read the target clock once, and the system clock twice (before and after). When software timestamps the SPI transfer that reads the hardware time, this is called a "cross timestamp". The smaller and less jittery this delay, the more stable the cross-timestamp and the better will software be able to discipline the target clock (aka the smaller the offsets will be). Before: $ phc2sys -s CLOCK_REALTIME -c sw2p0 -O 0 -m phc2sys[38.432]: sw2p0 sys offset -1741272972548124929 s0 freq +0 delay 6720 phc2sys[39.434]: sw2p0 sys offset -1741272972548179141 s1 freq -54094 delay 5960 phc2sys[40.436]: sw2p0 sys offset 190 s2 freq -53904 delay 6001 phc2sys[41.437]: sw2p0 sys offset 306 s2 freq -53731 delay 6520 phc2sys[42.438]: sw2p0 sys offset 275 s2 freq -53670 delay 6401 phc2sys[43.441]: sw2p0 sys offset 350 s2 freq -53513 delay 6881 phc2sys[44.442]: sw2p0 sys offset -302 s2 freq -54060 delay 6321 phc2sys[45.444]: sw2p0 sys offset 35 s2 freq -53814 delay 6761 phc2sys[46.446]: sw2p0 sys offset -103 s2 freq -53941 delay 6481 phc2sys[47.447]: sw2p0 sys offset -43 s2 freq -53912 delay 6361 phc2sys[48.450]: sw2p0 sys offset 314 s2 freq -53568 delay 6960 phc2sys[49.451]: sw2p0 sys offset -310 s2 freq -54098 delay 6441 phc2sys[50.453]: sw2p0 sys offset -86 s2 freq -53967 delay 6321 phc2sys[51.455]: sw2p0 sys offset -5 s2 freq -53911 delay 6401 phc2sys[52.457]: sw2p0 sys offset -2 s2 freq -53910 delay 6320 phc2sys[53.458]: sw2p0 sys offset 77 s2 freq -53832 delay 6400 phc2sys[54.459]: sw2p0 sys offset -112 s2 freq -53997 delay 6240 phc2sys[55.461]: sw2p0 sys offset 66 s2 freq -53853 delay 6480 phc2sys[56.463]: sw2p0 sys offset -33 s2 freq -53932 delay 6441 phc2sys[57.465]: sw2p0 sys offset -33 s2 freq -53942 delay 6441 phc2sys[58.467]: sw2p0 sys offset 17 s2 freq -53902 delay 6440 phc2sys[59.468]: sw2p0 sys offset -14 s2 freq -53928 delay 6520 phc2sys[60.470]: sw2p0 sys offset -133 s2 freq -54051 delay 6281 phc2sys[61.472]: sw2p0 sys offset 8 s2 freq -53950 delay 6400 phc2sys[62.473]: sw2p0 sys offset 25 s2 freq -53931 delay 6400 phc2sys[63.474]: sw2p0 sys offset -113 s2 freq -54061 delay 6040 phc2sys[64.476]: sw2p0 sys offset 44 s2 freq -53938 delay 6281 phc2sys[65.477]: sw2p0 sys offset -17 s2 freq -53986 delay 6320 phc2sys[66.479]: sw2p0 sys offset -86 s2 freq -54060 delay 5841 phc2sys[67.480]: sw2p0 sys offset 141 s2 freq -53859 delay 6361 phc2sys[68.481]: sw2p0 sys offset -11 s2 freq -53968 delay 6320 phc2sys[69.483]: sw2p0 sys offset -15 s2 freq -53976 delay 6321 phc2sys[70.484]: sw2p0 sys offset -109 s2 freq -54074 delay 5960 phc2sys[71.486]: sw2p0 sys offset 115 s2 freq -53883 delay 6520 phc2sys[72.488]: sw2p0 sys offset -86 s2 freq -54049 delay 6280 phc2sys[73.489]: sw2p0 sys offset 234 s2 freq -53755 delay 6801 phc2sys[74.491]: sw2p0 sys offset -219 s2 freq -54138 delay 6361 ^Cphc2sys[74.923]: sw2p0 sys offset -174 s2 freq -54159 delay 6360 After: $ phc2sys -s CLOCK_REALTIME -c sw2p0 -O 0 -m phc2sys[753.479]: sw2p0 sys offset 1882248595 s0 freq +32000000 delay 150440 phc2sys[754.482]: sw2p0 sys offset 1850232103 s1 freq +46787 delay 141960 phc2sys[755.483]: sw2p0 sys offset -33278 s2 freq +13509 delay 143160 phc2sys[756.485]: sw2p0 sys offset -5074 s2 freq +31729 delay 150040 phc2sys[757.486]: sw2p0 sys offset 11060 s2 freq +46341 delay 140240 phc2sys[758.488]: sw2p0 sys offset 4804 s2 freq +43403 delay 151320 phc2sys[759.489]: sw2p0 sys offset 10358 s2 freq +50398 delay 141879 phc2sys[760.491]: sw2p0 sys offset 409 s2 freq +43557 delay 148840 phc2sys[761.493]: sw2p0 sys offset 3863 s2 freq +47133 delay 143360 phc2sys[762.494]: sw2p0 sys offset 259 s2 freq +44688 delay 145840 phc2sys[763.496]: sw2p0 sys offset 1849 s2 freq +46356 delay 141000 phc2sys[764.497]: sw2p0 sys offset -1800 s2 freq +43262 delay 144160 phc2sys[765.499]: sw2p0 sys offset -184 s2 freq +44338 delay 139880 phc2sys[766.501]: sw2p0 sys offset -1677 s2 freq +42790 delay 146120 phc2sys[767.502]: sw2p0 sys offset 2529 s2 freq +46492 delay 141040 phc2sys[768.504]: sw2p0 sys offset -4368 s2 freq +40354 delay 151240 phc2sys[769.505]: sw2p0 sys offset 1112 s2 freq +44524 delay 147680 phc2sys[770.507]: sw2p0 sys offset 3002 s2 freq +46747 delay 142960 phc2sys[771.509]: sw2p0 sys offset -899 s2 freq +43747 delay 145440 phc2sys[772.510]: sw2p0 sys offset -2003 s2 freq +42373 delay 148360 phc2sys[773.512]: sw2p0 sys offset 3675 s2 freq +47450 delay 141440 phc2sys[774.514]: sw2p0 sys offset -1417 s2 freq +43461 delay 144960 phc2sys[775.515]: sw2p0 sys offset 802 s2 freq +45255 delay 142559 phc2sys[776.517]: sw2p0 sys offset 1368 s2 freq +46061 delay 140040 phc2sys[777.518]: sw2p0 sys offset -1897 s2 freq +43207 delay 141840 phc2sys[778.520]: sw2p0 sys offset -774 s2 freq +43761 delay 141680 phc2sys[779.522]: sw2p0 sys offset -1715 s2 freq +42587 delay 145199 phc2sys[780.523]: sw2p0 sys offset 4045 s2 freq +47833 delay 134839 phc2sys[781.525]: sw2p0 sys offset -4809 s2 freq +40192 delay 146840 phc2sys[782.526]: sw2p0 sys offset 363 s2 freq +43922 delay 144759 phc2sys[783.528]: sw2p0 sys offset 3328 s2 freq +46996 delay 140240 phc2sys[784.530]: sw2p0 sys offset -293 s2 freq +44373 delay 142480 phc2sys[785.531]: sw2p0 sys offset 46 s2 freq +44624 delay 142000 phc2sys[786.533]: sw2p0 sys offset -3422 s2 freq +41170 delay 148080 phc2sys[787.534]: sw2p0 sys offset 2932 s2 freq +46497 delay 140720 phc2sys[788.536]: sw2p0 sys offset -1961 s2 freq +42484 delay 147040 phc2sys[789.537]: sw2p0 sys offset -945 s2 freq +42912 delay 149160 phc2sys[790.539]: sw2p0 sys offset 3221 s2 freq +46794 delay 143040 phc2sys[791.541]: sw2p0 sys offset 41 s2 freq +44580 delay 144160 phc2sys[792.542]: sw2p0 sys offset -748 s2 freq +43804 delay 145120 Here, the synchronization offsets in DMA mode are an order of magnitude worse, so yeah, initial enthusiasm definitely curbed now. For me, what matters is not the latency itself, but the ability to cross-timestamp one byte within the SPI transfer with high granularity, and for the uncertainty of that timestamp to be as small and constant as possible. For that reason, I can post a third output log, taken in XSPI FIFO mode but with "ctlr->ptp_sts_supported = true" removed. That causes the core to take message-level software timestamps, which are a better indicator of latency. You can see that in FIFO mode, the minimum is much smaller (108 us) but the spread is larger (the maximum is 209 us). In DMA mode, the latencies are much more stable. But despite this, XSPI is still better for the ability to zoom in on the particular byte of interest. $ phc2sys -s CLOCK_REALTIME -c sw2p0 -O 0 -m phc2sys[246.568]: sw2p0 sys offset 2872475 s0 freq -88840 delay 131332 phc2sys[247.571]: sw2p0 sys offset 2874267 s1 freq -87052 delay 194739 phc2sys[248.572]: sw2p0 sys offset 71966 s2 freq -15086 delay 114971 phc2sys[249.573]: sw2p0 sys offset 34792 s2 freq -30670 delay 108331 phc2sys[250.575]: sw2p0 sys offset -39553 s2 freq -94578 delay 208580 phc2sys[251.577]: sw2p0 sys offset 50369 s2 freq -16521 delay 107410 phc2sys[252.578]: sw2p0 sys offset 1597 s2 freq -50183 delay 128292 phc2sys[253.579]: sw2p0 sys offset 6685 s2 freq -44616 delay 107810 phc2sys[254.581]: sw2p0 sys offset -4102 s2 freq -53397 delay 108530 phc2sys[255.582]: sw2p0 sys offset -7256 s2 freq -57782 delay 112051 phc2sys[256.584]: sw2p0 sys offset -2910 s2 freq -55613 delay 108610 phc2sys[257.586]: sw2p0 sys offset -52981 s2 freq -106557 delay 209460 phc2sys[258.587]: sw2p0 sys offset 49914 s2 freq -19556 delay 107130 phc2sys[259.589]: sw2p0 sys offset -29913 s2 freq -84409 delay 195699 phc2sys[260.591]: sw2p0 sys offset 42439 s2 freq -21031 delay 110411 phc2sys[261.592]: sw2p0 sys offset 3048 s2 freq -47690 delay 120571 phc2sys[262.594]: sw2p0 sys offset -853 s2 freq -50676 delay 113291 phc2sys[263.596]: sw2p0 sys offset -35260 s2 freq -85339 delay 173937 phc2sys[264.597]: sw2p0 sys offset 26479 s2 freq -34178 delay 110570 phc2sys[265.599]: sw2p0 sys offset -36802 s2 freq -89516 delay 195699 phc2sys[266.601]: sw2p0 sys offset 39945 s2 freq -23809 delay 110571 phc2sys[267.603]: sw2p0 sys offset -32036 s2 freq -83807 delay 194858 phc2sys[268.604]: sw2p0 sys offset 37721 s2 freq -23661 delay 110570 phc2sys[269.606]: sw2p0 sys offset 5110 s2 freq -44955 delay 112571 phc2sys[270.607]: sw2p0 sys offset -3526 s2 freq -52058 delay 109570 phc2sys[271.608]: sw2p0 sys offset -7856 s2 freq -57446 delay 112491 phc2sys[272.610]: sw2p0 sys offset -5259 s2 freq -57206 delay 112051 phc2sys[273.612]: sw2p0 sys offset -43272 s2 freq -96797 delay 194178 phc2sys[274.613]: sw2p0 sys offset 40708 s2 freq -25798 delay 108291 phc2sys[275.615]: sw2p0 sys offset -38753 s2 freq -93047 delay 208900 phc2sys[276.616]: sw2p0 sys offset 47948 s2 freq -17972 delay 111050 phc2sys[277.618]: sw2p0 sys offset 10692 s2 freq -40843 delay 111131 phc2sys[278.619]: sw2p0 sys offset -2179 s2 freq -50507 delay 108530 phc2sys[279.620]: sw2p0 sys offset -8143 s2 freq -57124 delay 111571 phc2sys[280.623]: sw2p0 sys offset -49486 s2 freq -100910 delay 199179 phc2sys[281.625]: sw2p0 sys offset -3684 s2 freq -69954 delay 199419 phc2sys[282.626]: sw2p0 sys offset 54475 s2 freq -12900 delay 111651 phc2sys[283.628]: sw2p0 sys offset -36562 s2 freq -87595 delay 209420 ^Cphc2sys[284.181]: sw2p0 sys offset -11239 s2 freq -73240 delay 194499
On Tue, Jul 01, 2025 at 04:57:47PM +0300, Vladimir Oltean wrote: > Here, the synchronization offsets in DMA mode are an order of magnitude > worse, so yeah, initial enthusiasm definitely curbed now. > For me, what matters is not the latency itself, but the ability to > cross-timestamp one byte within the SPI transfer with high granularity, > and for the uncertainty of that timestamp to be as small and constant as > possible. This is sounding like a copybreak type situation with the mode selected depending on how big the transfer is, that's a very common pattern. I've not looked how easy it is to flip this hardware between modes though.
On Tue, Jul 01, 2025 at 03:36:21PM +0100, Mark Brown wrote: > On Tue, Jul 01, 2025 at 04:57:47PM +0300, Vladimir Oltean wrote: > > > Here, the synchronization offsets in DMA mode are an order of magnitude > > worse, so yeah, initial enthusiasm definitely curbed now. > > > For me, what matters is not the latency itself, but the ability to > > cross-timestamp one byte within the SPI transfer with high granularity, > > and for the uncertainty of that timestamp to be as small and constant as > > possible. > > This is sounding like a copybreak type situation with the mode selected > depending on how big the transfer is, that's a very common pattern. > I've not looked how easy it is to flip this hardware between modes > though. I suppose one could try using FIFO mode for transfers which request timestamping and DMA for transfers which don't. I don't have an insight into what impact that will have on the driver, but I suspect at the very least one will have to transform "DSPI_DMA_MODE" into "dspi->dma_available" and "dspi->dma_in_use", and reconfigure the SPI_RSER register (interrupt routing, to DMA engine or to CPUs) at every transfer rather than at dspi_init(). The question is whether you would be willing to see and maintain such complexity increase, when AFAIK, the LS1028A FIFO mode passes its requirements.
On Tue, Jul 01, 2025 at 05:53:12PM +0300, Vladimir Oltean wrote: > I suppose one could try using FIFO mode for transfers which request > timestamping and DMA for transfers which don't. I don't have an insight > into what impact that will have on the driver, but I suspect at the very > least one will have to transform "DSPI_DMA_MODE" into "dspi->dma_available" > and "dspi->dma_in_use", and reconfigure the SPI_RSER register (interrupt > routing, to DMA engine or to CPUs) at every transfer rather than at dspi_init(). > The question is whether you would be willing to see and maintain such > complexity increase, when AFAIK, the LS1028A FIFO mode passes its > requirements. Switching between modes is incredibly common, usually between PIO (for very short transfers) and DMA, that's no problem. Factoring in timestamping seems like a reasonable signal I guess, might trip someone who was trying to benchmark things up but probably not normal users.
On Tue, Jul 01, 2025 at 04:16:50PM +0100, Mark Brown wrote: > On Tue, Jul 01, 2025 at 05:53:12PM +0300, Vladimir Oltean wrote: > > > I suppose one could try using FIFO mode for transfers which request > > timestamping and DMA for transfers which don't. I don't have an insight > > into what impact that will have on the driver, but I suspect at the very > > least one will have to transform "DSPI_DMA_MODE" into "dspi->dma_available" > > and "dspi->dma_in_use", and reconfigure the SPI_RSER register (interrupt > > routing, to DMA engine or to CPUs) at every transfer rather than at dspi_init(). > > > The question is whether you would be willing to see and maintain such > > complexity increase, when AFAIK, the LS1028A FIFO mode passes its > > requirements. > > Switching between modes is incredibly common, usually between PIO (for > very short transfers) and DMA, that's no problem. Factoring in > timestamping seems like a reasonable signal I guess, might trip someone > who was trying to benchmark things up but probably not normal users. Ah, ok, I vaguely remember something being discussed about can_dma() on previous iterations of this patch, but in a different context. Then that's an avenue to explore, I guess. Looking at that method's prototype, I suppose dspi could simply return can_dma = false "if (xfer->ptp_sts)" (timestamp requested), i.e. no core involvement in the decision process at all? Sounds interesting, but I can't promise I'll follow up with patches.
On Tue, Jul 01, 2025 at 06:24:33PM +0300, Vladimir Oltean wrote: > On Tue, Jul 01, 2025 at 04:16:50PM +0100, Mark Brown wrote: > > Switching between modes is incredibly common, usually between PIO (for > > very short transfers) and DMA, that's no problem. Factoring in > > timestamping seems like a reasonable signal I guess, might trip someone > > who was trying to benchmark things up but probably not normal users. > Ah, ok, I vaguely remember something being discussed about can_dma() > on previous iterations of this patch, but in a different context. > Then that's an avenue to explore, I guess. Looking at that method's > prototype, I suppose dspi could simply return can_dma = false "if (xfer->ptp_sts)" > (timestamp requested), i.e. no core involvement in the decision process at all? Yes, exactly. It can base the decision on whatever amuses it about the transfer.
On Tue, Jul 01, 2025 at 01:42:46PM +0100, James Clark wrote: > On 30/06/2025 4:26 pm, Vladimir Oltean wrote: > > On Fri, Jun 27, 2025 at 11:21:36AM +0100, James Clark wrote: > > So after your patch set, DMA mode on LS1028A becomes more performant and > > should replace XSPI. This is an outstanding result. That can be done as > > follow-up work. > I wonder if latency could be higher despite increased throughput? It > probably wouldn't be a big enough increase that anyone would care. And based > on the structure of the driver if throughput is higher the latency might > even be lower. Some CAN users are likely to care, some of them care a lot about throughput so will be happy but AIUI some are very focused on round trip times. That's a fairly specialist use case though.
On Fri, 27 Jun 2025 11:21:36 +0100, James Clark wrote: > Improve usability of target mode by reporting FIFO errors and increasing > the buffer size when DMA is used. While we're touching DMA stuff also > switch to non-coherent memory, although this is unrelated to target > mode. > > The first commit is marked as a fix because it can fix intermittent > issues with existing transfers, rather than the later fixes which > improve larger than FIFO target mode transfers which would have never > worked. > > [...] Applied to https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next Thanks! [1/6] spi: spi-fsl-dspi: Clear completion counter before initiating transfer commit: fa60c094c19b97e103d653f528f8d9c178b6a5f5 All being well this means that it will be integrated into the linux-next tree (usually sometime in the next 24 hours) and sent to Linus during the next merge window (or sooner if it is a bug fix), however if problems are discovered then the patch may be dropped or reverted. You may get further e-mails resulting from automated or manual testing and review of the tree, please engage with people reporting problems and send followup patches addressing any issues that are reported if needed. If any updates are required or you are submitting further changes they should be sent as incremental updates against current git, existing patches will not be replaced. Please add any relevant lists and maintainers to the CCs when replying to this mail. Thanks, Mark
© 2016 - 2025 Red Hat, Inc.