[v4] spi: spi-fsl-dspi: Target mode improvements

[PATCH v4 0/6] spi: spi-fsl-dspi: Target mode improvements

Posted by James Clark 3 months, 1 week ago

Improve usability of target mode by reporting FIFO errors and increasing
the buffer size when DMA is used. While we're touching DMA stuff also
switch to non-coherent memory, although this is unrelated to target
mode.

The first commit is marked as a fix because it can fix intermittent
issues with existing transfers, rather than the later fixes which
improve larger than FIFO target mode transfers which would have never
worked.

With the combination of the commit to increase the DMA buffer size and
the commit to use non-coherent memory, the host mode performance figures
are as follows on S32G3:

  # spidev_test --device /dev/spidev1.0 --bpw 8 --size <test_size> --cpha --iter 10000000 --speed 10000000

  Coherent (4096 byte transfers): 6534 kbps
  Non-coherent:                   7347 kbps

  Coherent (16 byte transfers):    447 kbps
  Non-coherent:                    448 kbps

Just for comparison running the same test in XSPI mode:

  4096 byte transfers:            2143 kbps
  16 byte transfers:               637 kbps

These tests required hacking S32G3 to use DMA in host mode, although
the figures should be representative of target mode too where DMA is
used. And the other devices that use DMA in host mode should see similar
improvements.

Signed-off-by: James Clark <james.clark@linaro.org>
---
Changes in v4:
- Fix !CONFIG_DMA_ENGINE build (and actually test it this time)
- Reword completion counter comment
- Reword some commit messages
- Reset tries in dspi_poll() for each transfer
- Check for fifo errors in dspi_poll() before checking for completion
- Link to v3: https://lore.kernel.org/r/20250624-james-nxp-spi-dma-v3-0-e7d574f5f62c@linaro.org

Changes in v3:
- Stub out DMA functions in the driver so no-DMA builds work
- Link to v2: https://lore.kernel.org/r/20250613-james-nxp-spi-dma-v2-0-017eecf24aab@linaro.org

Changes in v2:
- Store status in cur_msg->status rather than adding xfer_status
- Show exact underflow/overflow flags in error message
- Rate limit error messages
- Add a comment about resetting the completion counter prior to transfer
- Rename dspi_is_fifo_overflow() -> dspi_fifo_error()
- Add performance figures to cover letter
- Rebase onto https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git/for-next
  to avoid some conflicts
- Link to v1: https://lore.kernel.org/r/20250609-james-nxp-spi-dma-v1-0-2b831e714be2@linaro.org

---
James Clark (5):
      spi: spi-fsl-dspi: Clear completion counter before initiating transfer
      spi: spi-fsl-dspi: Store status directly in cur_msg->status
      spi: spi-fsl-dspi: Stub out DMA functions
      spi: spi-fsl-dspi: Use non-coherent memory for DMA
      spi: spi-fsl-dspi: Report FIFO overflows as errors

Larisa Grigore (1):
      spi: spi-fsl-dspi: Increase DMA buffer size

 drivers/spi/spi-fsl-dspi.c | 233 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 166 insertions(+), 67 deletions(-)
---
base-commit: 4f326fa6236787ca516ea6eab8e5e9dc5c236f03
change-id: 20250522-james-nxp-spi-dma-a997ebebfb6b

Best regards,
-- 
James Clark <james.clark@linaro.org>

Re: [PATCH v4 0/6] spi: spi-fsl-dspi: Target mode improvements

Posted by Vladimir Oltean 3 months, 1 week ago

On Fri, Jun 27, 2025 at 11:21:36AM +0100, James Clark wrote:
> Improve usability of target mode by reporting FIFO errors and increasing
> the buffer size when DMA is used. While we're touching DMA stuff also
> switch to non-coherent memory, although this is unrelated to target
> mode.
> 
> The first commit is marked as a fix because it can fix intermittent
> issues with existing transfers, rather than the later fixes which
> improve larger than FIFO target mode transfers which would have never
> worked.
> 
> With the combination of the commit to increase the DMA buffer size and
> the commit to use non-coherent memory, the host mode performance figures
> are as follows on S32G3:
> 
>   # spidev_test --device /dev/spidev1.0 --bpw 8 --size <test_size> --cpha --iter 10000000 --speed 10000000
> 
>   Coherent (4096 byte transfers): 6534 kbps
>   Non-coherent:                   7347 kbps
> 
>   Coherent (16 byte transfers):    447 kbps
>   Non-coherent:                    448 kbps
> 
> Just for comparison running the same test in XSPI mode:
> 
>   4096 byte transfers:            2143 kbps
>   16 byte transfers:               637 kbps
> 
> These tests required hacking S32G3 to use DMA in host mode, although
> the figures should be representative of target mode too where DMA is
> used. And the other devices that use DMA in host mode should see similar
> improvements.
> 
> Signed-off-by: James Clark <james.clark@linaro.org>
> ---

My test numbers on LS1028A:

Baseline XSPI (unmodified driver):
$ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000
rate: tx 2710.6kbps, rx 2710.6kbps
$ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000
rate: tx 3217.5kbps, rx 3217.5kbps
$ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000
rate: tx 5118.4kbps, rx 5118.4kbps

Baseline DMA (modified just DSPI_XSPI_MODE -> DSPI_DMA_MODE):
$ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000
rate: tx 1359.5kbps, rx 1359.5kbps
$ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000
rate: tx 1461.1kbps, rx 1461.1kbps
$ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000
rate: tx 1664.6kbps, rx 1664.6kbps

Intermediary LS1028A DMA mode (using non-coherent buffers but still
small DMA buffers, i.e. holding just 1 FIFO size worth of data):
$ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000
rate: tx 1345.1kbps, rx 1345.1kbps
$ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000
rate: tx 1522.5kbps, rx 1522.5kbps
$ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000
rate: tx 1690.8kbps, rx 1690.8kbps

Final LS1028A DMA mode (with the patch to send large messages as a
single DMA buffer applied):
$ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000
rate: tx 2247.0kbps, rx 2247.0kbps
$ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000
rate: tx 3477.4kbps, rx 3477.4kbps
$ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000
rate: tx 8978.4kbps, rx 8978.4kbps

So after your patch set, DMA mode on LS1028A becomes more performant and
should replace XSPI. This is an outstanding result. That can be done as
follow-up work.

Re: [PATCH v4 0/6] spi: spi-fsl-dspi: Target mode improvements

Posted by James Clark 3 months, 1 week ago


On 30/06/2025 4:26 pm, Vladimir Oltean wrote:
> On Fri, Jun 27, 2025 at 11:21:36AM +0100, James Clark wrote:
>> Improve usability of target mode by reporting FIFO errors and increasing
>> the buffer size when DMA is used. While we're touching DMA stuff also
>> switch to non-coherent memory, although this is unrelated to target
>> mode.
>>
>> The first commit is marked as a fix because it can fix intermittent
>> issues with existing transfers, rather than the later fixes which
>> improve larger than FIFO target mode transfers which would have never
>> worked.
>>
>> With the combination of the commit to increase the DMA buffer size and
>> the commit to use non-coherent memory, the host mode performance figures
>> are as follows on S32G3:
>>
>>    # spidev_test --device /dev/spidev1.0 --bpw 8 --size <test_size> --cpha --iter 10000000 --speed 10000000
>>
>>    Coherent (4096 byte transfers): 6534 kbps
>>    Non-coherent:                   7347 kbps
>>
>>    Coherent (16 byte transfers):    447 kbps
>>    Non-coherent:                    448 kbps
>>
>> Just for comparison running the same test in XSPI mode:
>>
>>    4096 byte transfers:            2143 kbps
>>    16 byte transfers:               637 kbps
>>
>> These tests required hacking S32G3 to use DMA in host mode, although
>> the figures should be representative of target mode too where DMA is
>> used. And the other devices that use DMA in host mode should see similar
>> improvements.
>>
>> Signed-off-by: James Clark <james.clark@linaro.org>
>> ---
> 
> My test numbers on LS1028A:
> 
> Baseline XSPI (unmodified driver):
> $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000
> rate: tx 2710.6kbps, rx 2710.6kbps
> $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000
> rate: tx 3217.5kbps, rx 3217.5kbps
> $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000
> rate: tx 5118.4kbps, rx 5118.4kbps
> 
> Baseline DMA (modified just DSPI_XSPI_MODE -> DSPI_DMA_MODE):
> $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000
> rate: tx 1359.5kbps, rx 1359.5kbps
> $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000
> rate: tx 1461.1kbps, rx 1461.1kbps
> $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000
> rate: tx 1664.6kbps, rx 1664.6kbps
> 
> Intermediary LS1028A DMA mode (using non-coherent buffers but still
> small DMA buffers, i.e. holding just 1 FIFO size worth of data):
> $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000
> rate: tx 1345.1kbps, rx 1345.1kbps
> $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000
> rate: tx 1522.5kbps, rx 1522.5kbps
> $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000
> rate: tx 1690.8kbps, rx 1690.8kbps
> 
> Final LS1028A DMA mode (with the patch to send large messages as a
> single DMA buffer applied):
> $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 8 --cpha --iter 10000000 --speed 10000000
> rate: tx 2247.0kbps, rx 2247.0kbps
> $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 16 --cpha --iter 10000000 --speed 10000000
> rate: tx 3477.4kbps, rx 3477.4kbps
> $ ./spidev_test --device /dev/spidev2.1 --bpw 8 --size 4096 --cpha --iter 10000000 --speed 10000000
> rate: tx 8978.4kbps, rx 8978.4kbps
> 
> So after your patch set, DMA mode on LS1028A becomes more performant and
> should replace XSPI. This is an outstanding result. That can be done as
> follow-up work.

I wonder if latency could be higher despite increased throughput? It 
probably wouldn't be a big enough increase that anyone would care. And 
based on the structure of the driver if throughput is higher the latency 
might even be lower.

Re: [PATCH v4 0/6] spi: spi-fsl-dspi: Target mode improvements

Posted by Vladimir Oltean 3 months, 1 week ago

On Tue, Jul 01, 2025 at 01:42:46PM +0100, James Clark wrote:
> I wonder if latency could be higher despite increased throughput? It
> probably wouldn't be a big enough increase that anyone would care. And based
> on the structure of the driver if throughput is higher the latency might
> even be lower.

Actually, I do have a metric for that, sort of. I have a SPI-controlled
Ethernet switch with support for IEEE 1588, and synchronizing its
hardware clock over SPI benefits greatly from having a high precision
software timestamping point for the SPI transfers themselves.

Essentially, with XSPI FIFO mode we are able to provide a timestamping
granularity of $(FIFO size) words, see the spi_take_timestamp_pre() and
spi_take_timestamp_post() calls. Whereas with DMA, we let the core take
a message-level software timestamp which is much coarser, because at
driver level we can't guarantee a much more precise transmission time
interval for a particular requested byte. See __spi_pump_transfer_message().

If you're not familiar with phc2sys, an interpretation of the logs below
is as follows.

phc2sys synchronizes the sw2p0 (target) clock to CLOCK_REALTIME (the
source clock). "delay" is the time it took for the kernel to read the
target clock once, and the system clock twice (before and after).
When software timestamps the SPI transfer that reads the hardware time,
this is called a "cross timestamp". The smaller and less jittery this
delay, the more stable the cross-timestamp and the better will software
be able to discipline the target clock (aka the smaller the offsets will
be).

Before:

$ phc2sys -s CLOCK_REALTIME -c sw2p0 -O 0 -m
phc2sys[38.432]: sw2p0 sys offset -1741272972548124929 s0 freq      +0 delay   6720
phc2sys[39.434]: sw2p0 sys offset -1741272972548179141 s1 freq  -54094 delay   5960
phc2sys[40.436]: sw2p0 sys offset       190 s2 freq  -53904 delay   6001
phc2sys[41.437]: sw2p0 sys offset       306 s2 freq  -53731 delay   6520
phc2sys[42.438]: sw2p0 sys offset       275 s2 freq  -53670 delay   6401
phc2sys[43.441]: sw2p0 sys offset       350 s2 freq  -53513 delay   6881
phc2sys[44.442]: sw2p0 sys offset      -302 s2 freq  -54060 delay   6321
phc2sys[45.444]: sw2p0 sys offset        35 s2 freq  -53814 delay   6761
phc2sys[46.446]: sw2p0 sys offset      -103 s2 freq  -53941 delay   6481
phc2sys[47.447]: sw2p0 sys offset       -43 s2 freq  -53912 delay   6361
phc2sys[48.450]: sw2p0 sys offset       314 s2 freq  -53568 delay   6960
phc2sys[49.451]: sw2p0 sys offset      -310 s2 freq  -54098 delay   6441
phc2sys[50.453]: sw2p0 sys offset       -86 s2 freq  -53967 delay   6321
phc2sys[51.455]: sw2p0 sys offset        -5 s2 freq  -53911 delay   6401
phc2sys[52.457]: sw2p0 sys offset        -2 s2 freq  -53910 delay   6320
phc2sys[53.458]: sw2p0 sys offset        77 s2 freq  -53832 delay   6400
phc2sys[54.459]: sw2p0 sys offset      -112 s2 freq  -53997 delay   6240
phc2sys[55.461]: sw2p0 sys offset        66 s2 freq  -53853 delay   6480
phc2sys[56.463]: sw2p0 sys offset       -33 s2 freq  -53932 delay   6441
phc2sys[57.465]: sw2p0 sys offset       -33 s2 freq  -53942 delay   6441
phc2sys[58.467]: sw2p0 sys offset        17 s2 freq  -53902 delay   6440
phc2sys[59.468]: sw2p0 sys offset       -14 s2 freq  -53928 delay   6520
phc2sys[60.470]: sw2p0 sys offset      -133 s2 freq  -54051 delay   6281
phc2sys[61.472]: sw2p0 sys offset         8 s2 freq  -53950 delay   6400
phc2sys[62.473]: sw2p0 sys offset        25 s2 freq  -53931 delay   6400
phc2sys[63.474]: sw2p0 sys offset      -113 s2 freq  -54061 delay   6040
phc2sys[64.476]: sw2p0 sys offset        44 s2 freq  -53938 delay   6281
phc2sys[65.477]: sw2p0 sys offset       -17 s2 freq  -53986 delay   6320
phc2sys[66.479]: sw2p0 sys offset       -86 s2 freq  -54060 delay   5841
phc2sys[67.480]: sw2p0 sys offset       141 s2 freq  -53859 delay   6361
phc2sys[68.481]: sw2p0 sys offset       -11 s2 freq  -53968 delay   6320
phc2sys[69.483]: sw2p0 sys offset       -15 s2 freq  -53976 delay   6321
phc2sys[70.484]: sw2p0 sys offset      -109 s2 freq  -54074 delay   5960
phc2sys[71.486]: sw2p0 sys offset       115 s2 freq  -53883 delay   6520
phc2sys[72.488]: sw2p0 sys offset       -86 s2 freq  -54049 delay   6280
phc2sys[73.489]: sw2p0 sys offset       234 s2 freq  -53755 delay   6801
phc2sys[74.491]: sw2p0 sys offset      -219 s2 freq  -54138 delay   6361
^Cphc2sys[74.923]: sw2p0 sys offset      -174 s2 freq  -54159 delay   6360

After:

$ phc2sys -s CLOCK_REALTIME -c sw2p0 -O 0 -m
phc2sys[753.479]: sw2p0 sys offset 1882248595 s0 freq +32000000 delay 150440
phc2sys[754.482]: sw2p0 sys offset 1850232103 s1 freq  +46787 delay 141960
phc2sys[755.483]: sw2p0 sys offset    -33278 s2 freq  +13509 delay 143160
phc2sys[756.485]: sw2p0 sys offset     -5074 s2 freq  +31729 delay 150040
phc2sys[757.486]: sw2p0 sys offset     11060 s2 freq  +46341 delay 140240
phc2sys[758.488]: sw2p0 sys offset      4804 s2 freq  +43403 delay 151320
phc2sys[759.489]: sw2p0 sys offset     10358 s2 freq  +50398 delay 141879
phc2sys[760.491]: sw2p0 sys offset       409 s2 freq  +43557 delay 148840
phc2sys[761.493]: sw2p0 sys offset      3863 s2 freq  +47133 delay 143360
phc2sys[762.494]: sw2p0 sys offset       259 s2 freq  +44688 delay 145840
phc2sys[763.496]: sw2p0 sys offset      1849 s2 freq  +46356 delay 141000
phc2sys[764.497]: sw2p0 sys offset     -1800 s2 freq  +43262 delay 144160
phc2sys[765.499]: sw2p0 sys offset      -184 s2 freq  +44338 delay 139880
phc2sys[766.501]: sw2p0 sys offset     -1677 s2 freq  +42790 delay 146120
phc2sys[767.502]: sw2p0 sys offset      2529 s2 freq  +46492 delay 141040
phc2sys[768.504]: sw2p0 sys offset     -4368 s2 freq  +40354 delay 151240
phc2sys[769.505]: sw2p0 sys offset      1112 s2 freq  +44524 delay 147680
phc2sys[770.507]: sw2p0 sys offset      3002 s2 freq  +46747 delay 142960
phc2sys[771.509]: sw2p0 sys offset      -899 s2 freq  +43747 delay 145440
phc2sys[772.510]: sw2p0 sys offset     -2003 s2 freq  +42373 delay 148360
phc2sys[773.512]: sw2p0 sys offset      3675 s2 freq  +47450 delay 141440
phc2sys[774.514]: sw2p0 sys offset     -1417 s2 freq  +43461 delay 144960
phc2sys[775.515]: sw2p0 sys offset       802 s2 freq  +45255 delay 142559
phc2sys[776.517]: sw2p0 sys offset      1368 s2 freq  +46061 delay 140040
phc2sys[777.518]: sw2p0 sys offset     -1897 s2 freq  +43207 delay 141840
phc2sys[778.520]: sw2p0 sys offset      -774 s2 freq  +43761 delay 141680
phc2sys[779.522]: sw2p0 sys offset     -1715 s2 freq  +42587 delay 145199
phc2sys[780.523]: sw2p0 sys offset      4045 s2 freq  +47833 delay 134839
phc2sys[781.525]: sw2p0 sys offset     -4809 s2 freq  +40192 delay 146840
phc2sys[782.526]: sw2p0 sys offset       363 s2 freq  +43922 delay 144759
phc2sys[783.528]: sw2p0 sys offset      3328 s2 freq  +46996 delay 140240
phc2sys[784.530]: sw2p0 sys offset      -293 s2 freq  +44373 delay 142480
phc2sys[785.531]: sw2p0 sys offset        46 s2 freq  +44624 delay 142000
phc2sys[786.533]: sw2p0 sys offset     -3422 s2 freq  +41170 delay 148080
phc2sys[787.534]: sw2p0 sys offset      2932 s2 freq  +46497 delay 140720
phc2sys[788.536]: sw2p0 sys offset     -1961 s2 freq  +42484 delay 147040
phc2sys[789.537]: sw2p0 sys offset      -945 s2 freq  +42912 delay 149160
phc2sys[790.539]: sw2p0 sys offset      3221 s2 freq  +46794 delay 143040
phc2sys[791.541]: sw2p0 sys offset        41 s2 freq  +44580 delay 144160
phc2sys[792.542]: sw2p0 sys offset      -748 s2 freq  +43804 delay 145120

Here, the synchronization offsets in DMA mode are an order of magnitude
worse, so yeah, initial enthusiasm definitely curbed now.

For me, what matters is not the latency itself, but the ability to
cross-timestamp one byte within the SPI transfer with high granularity,
and for the uncertainty of that timestamp to be as small and constant as
possible.

For that reason, I can post a third output log, taken in XSPI FIFO mode
but with "ctlr->ptp_sts_supported = true" removed. That causes the core
to take message-level software timestamps, which are a better indicator
of latency.

You can see that in FIFO mode, the minimum is much smaller (108 us) but
the spread is larger (the maximum is 209 us). In DMA mode, the latencies
are much more stable. But despite this, XSPI is still better for the
ability to zoom in on the particular byte of interest.

$ phc2sys -s CLOCK_REALTIME -c sw2p0 -O 0 -m
phc2sys[246.568]: sw2p0 sys offset   2872475 s0 freq  -88840 delay 131332
phc2sys[247.571]: sw2p0 sys offset   2874267 s1 freq  -87052 delay 194739
phc2sys[248.572]: sw2p0 sys offset     71966 s2 freq  -15086 delay 114971
phc2sys[249.573]: sw2p0 sys offset     34792 s2 freq  -30670 delay 108331
phc2sys[250.575]: sw2p0 sys offset    -39553 s2 freq  -94578 delay 208580
phc2sys[251.577]: sw2p0 sys offset     50369 s2 freq  -16521 delay 107410
phc2sys[252.578]: sw2p0 sys offset      1597 s2 freq  -50183 delay 128292
phc2sys[253.579]: sw2p0 sys offset      6685 s2 freq  -44616 delay 107810
phc2sys[254.581]: sw2p0 sys offset     -4102 s2 freq  -53397 delay 108530
phc2sys[255.582]: sw2p0 sys offset     -7256 s2 freq  -57782 delay 112051
phc2sys[256.584]: sw2p0 sys offset     -2910 s2 freq  -55613 delay 108610
phc2sys[257.586]: sw2p0 sys offset    -52981 s2 freq -106557 delay 209460
phc2sys[258.587]: sw2p0 sys offset     49914 s2 freq  -19556 delay 107130
phc2sys[259.589]: sw2p0 sys offset    -29913 s2 freq  -84409 delay 195699
phc2sys[260.591]: sw2p0 sys offset     42439 s2 freq  -21031 delay 110411
phc2sys[261.592]: sw2p0 sys offset      3048 s2 freq  -47690 delay 120571
phc2sys[262.594]: sw2p0 sys offset      -853 s2 freq  -50676 delay 113291
phc2sys[263.596]: sw2p0 sys offset    -35260 s2 freq  -85339 delay 173937
phc2sys[264.597]: sw2p0 sys offset     26479 s2 freq  -34178 delay 110570
phc2sys[265.599]: sw2p0 sys offset    -36802 s2 freq  -89516 delay 195699
phc2sys[266.601]: sw2p0 sys offset     39945 s2 freq  -23809 delay 110571
phc2sys[267.603]: sw2p0 sys offset    -32036 s2 freq  -83807 delay 194858
phc2sys[268.604]: sw2p0 sys offset     37721 s2 freq  -23661 delay 110570
phc2sys[269.606]: sw2p0 sys offset      5110 s2 freq  -44955 delay 112571
phc2sys[270.607]: sw2p0 sys offset     -3526 s2 freq  -52058 delay 109570
phc2sys[271.608]: sw2p0 sys offset     -7856 s2 freq  -57446 delay 112491
phc2sys[272.610]: sw2p0 sys offset     -5259 s2 freq  -57206 delay 112051
phc2sys[273.612]: sw2p0 sys offset    -43272 s2 freq  -96797 delay 194178
phc2sys[274.613]: sw2p0 sys offset     40708 s2 freq  -25798 delay 108291
phc2sys[275.615]: sw2p0 sys offset    -38753 s2 freq  -93047 delay 208900
phc2sys[276.616]: sw2p0 sys offset     47948 s2 freq  -17972 delay 111050
phc2sys[277.618]: sw2p0 sys offset     10692 s2 freq  -40843 delay 111131
phc2sys[278.619]: sw2p0 sys offset     -2179 s2 freq  -50507 delay 108530
phc2sys[279.620]: sw2p0 sys offset     -8143 s2 freq  -57124 delay 111571
phc2sys[280.623]: sw2p0 sys offset    -49486 s2 freq -100910 delay 199179
phc2sys[281.625]: sw2p0 sys offset     -3684 s2 freq  -69954 delay 199419
phc2sys[282.626]: sw2p0 sys offset     54475 s2 freq  -12900 delay 111651
phc2sys[283.628]: sw2p0 sys offset    -36562 s2 freq  -87595 delay 209420
^Cphc2sys[284.181]: sw2p0 sys offset    -11239 s2 freq  -73240 delay 194499

Re: [PATCH v4 0/6] spi: spi-fsl-dspi: Target mode improvements

Posted by Mark Brown 3 months, 1 week ago

On Tue, Jul 01, 2025 at 04:57:47PM +0300, Vladimir Oltean wrote:

> Here, the synchronization offsets in DMA mode are an order of magnitude
> worse, so yeah, initial enthusiasm definitely curbed now.

> For me, what matters is not the latency itself, but the ability to
> cross-timestamp one byte within the SPI transfer with high granularity,
> and for the uncertainty of that timestamp to be as small and constant as
> possible.

This is sounding like a copybreak type situation with the mode selected
depending on how big the transfer is, that's a very common pattern.
I've not looked how easy it is to flip this hardware between modes
though.

Re: [PATCH v4 0/6] spi: spi-fsl-dspi: Target mode improvements

Posted by Vladimir Oltean 3 months, 1 week ago

On Tue, Jul 01, 2025 at 03:36:21PM +0100, Mark Brown wrote:
> On Tue, Jul 01, 2025 at 04:57:47PM +0300, Vladimir Oltean wrote:
> 
> > Here, the synchronization offsets in DMA mode are an order of magnitude
> > worse, so yeah, initial enthusiasm definitely curbed now.
> 
> > For me, what matters is not the latency itself, but the ability to
> > cross-timestamp one byte within the SPI transfer with high granularity,
> > and for the uncertainty of that timestamp to be as small and constant as
> > possible.
> 
> This is sounding like a copybreak type situation with the mode selected
> depending on how big the transfer is, that's a very common pattern.
> I've not looked how easy it is to flip this hardware between modes
> though.

I suppose one could try using FIFO mode for transfers which request
timestamping and DMA for transfers which don't. I don't have an insight
into what impact that will have on the driver, but I suspect at the very
least one will have to transform "DSPI_DMA_MODE" into "dspi->dma_available"
and "dspi->dma_in_use", and reconfigure the SPI_RSER register (interrupt
routing, to DMA engine or to CPUs) at every transfer rather than at dspi_init().

The question is whether you would be willing to see and maintain such
complexity increase, when AFAIK, the LS1028A FIFO mode passes its
requirements.

Re: [PATCH v4 0/6] spi: spi-fsl-dspi: Target mode improvements

Posted by Mark Brown 3 months, 1 week ago

On Tue, Jul 01, 2025 at 05:53:12PM +0300, Vladimir Oltean wrote:

> I suppose one could try using FIFO mode for transfers which request
> timestamping and DMA for transfers which don't. I don't have an insight
> into what impact that will have on the driver, but I suspect at the very
> least one will have to transform "DSPI_DMA_MODE" into "dspi->dma_available"
> and "dspi->dma_in_use", and reconfigure the SPI_RSER register (interrupt
> routing, to DMA engine or to CPUs) at every transfer rather than at dspi_init().

> The question is whether you would be willing to see and maintain such
> complexity increase, when AFAIK, the LS1028A FIFO mode passes its
> requirements.

Switching between modes is incredibly common, usually between PIO (for
very short transfers) and DMA, that's no problem.  Factoring in
timestamping seems like a reasonable signal I guess, might trip someone
who was trying to benchmark things up but probably not normal users.

Re: [PATCH v4 0/6] spi: spi-fsl-dspi: Target mode improvements

Posted by Vladimir Oltean 3 months, 1 week ago

On Tue, Jul 01, 2025 at 04:16:50PM +0100, Mark Brown wrote:
> On Tue, Jul 01, 2025 at 05:53:12PM +0300, Vladimir Oltean wrote:
> 
> > I suppose one could try using FIFO mode for transfers which request
> > timestamping and DMA for transfers which don't. I don't have an insight
> > into what impact that will have on the driver, but I suspect at the very
> > least one will have to transform "DSPI_DMA_MODE" into "dspi->dma_available"
> > and "dspi->dma_in_use", and reconfigure the SPI_RSER register (interrupt
> > routing, to DMA engine or to CPUs) at every transfer rather than at dspi_init().
> 
> > The question is whether you would be willing to see and maintain such
> > complexity increase, when AFAIK, the LS1028A FIFO mode passes its
> > requirements.
> 
> Switching between modes is incredibly common, usually between PIO (for
> very short transfers) and DMA, that's no problem.  Factoring in
> timestamping seems like a reasonable signal I guess, might trip someone
> who was trying to benchmark things up but probably not normal users.

Ah, ok, I vaguely remember something being discussed about can_dma()
on previous iterations of this patch, but in a different context.
Then that's an avenue to explore, I guess. Looking at that method's
prototype, I suppose dspi could simply return can_dma = false "if (xfer->ptp_sts)"
(timestamp requested), i.e. no core involvement in the decision process at all?
Sounds interesting, but I can't promise I'll follow up with patches.

Re: [PATCH v4 0/6] spi: spi-fsl-dspi: Target mode improvements

Posted by Mark Brown 3 months, 1 week ago

On Tue, Jul 01, 2025 at 06:24:33PM +0300, Vladimir Oltean wrote:
> On Tue, Jul 01, 2025 at 04:16:50PM +0100, Mark Brown wrote:

> > Switching between modes is incredibly common, usually between PIO (for
> > very short transfers) and DMA, that's no problem.  Factoring in
> > timestamping seems like a reasonable signal I guess, might trip someone
> > who was trying to benchmark things up but probably not normal users.

> Ah, ok, I vaguely remember something being discussed about can_dma()
> on previous iterations of this patch, but in a different context.
> Then that's an avenue to explore, I guess. Looking at that method's
> prototype, I suppose dspi could simply return can_dma = false "if (xfer->ptp_sts)"
> (timestamp requested), i.e. no core involvement in the decision process at all?

Yes, exactly.  It can base the decision on whatever amuses it about the
transfer.

Re: [PATCH v4 0/6] spi: spi-fsl-dspi: Target mode improvements

Posted by Mark Brown 3 months, 1 week ago

On Tue, Jul 01, 2025 at 01:42:46PM +0100, James Clark wrote:
> On 30/06/2025 4:26 pm, Vladimir Oltean wrote:
> > On Fri, Jun 27, 2025 at 11:21:36AM +0100, James Clark wrote:

> > So after your patch set, DMA mode on LS1028A becomes more performant and
> > should replace XSPI. This is an outstanding result. That can be done as
> > follow-up work.

> I wonder if latency could be higher despite increased throughput? It
> probably wouldn't be a big enough increase that anyone would care. And based
> on the structure of the driver if throughput is higher the latency might
> even be lower.

Some CAN users are likely to care, some of them care a lot about
throughput so will be happy but AIUI some are very focused on round trip
times.  That's a fairly specialist use case though.

Re: (subset) [PATCH v4 0/6] spi: spi-fsl-dspi: Target mode improvements

Posted by Mark Brown 3 months, 1 week ago

On Fri, 27 Jun 2025 11:21:36 +0100, James Clark wrote:
> Improve usability of target mode by reporting FIFO errors and increasing
> the buffer size when DMA is used. While we're touching DMA stuff also
> switch to non-coherent memory, although this is unrelated to target
> mode.
> 
> The first commit is marked as a fix because it can fix intermittent
> issues with existing transfers, rather than the later fixes which
> improve larger than FIFO target mode transfers which would have never
> worked.
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next

Thanks!

[1/6] spi: spi-fsl-dspi: Clear completion counter before initiating transfer
      commit: fa60c094c19b97e103d653f528f8d9c178b6a5f5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark