[v1] net: ntb_netdev: Add Multi-queue support

[PATCH 0/3] net: ntb_netdev: Add Multi-queue support

Posted by Koichiro Den 1 month, 3 weeks ago

Hi,

ntb_netdev currently hard-codes a single NTB transport queue pair, which
means the datapath effectively runs as a single-queue netdev regardless
of available CPUs / parallel flows.

The longer-term motivation here is throughput scale-out: allow
ntb_netdev to grow beyond the single-QP bottleneck and make it possible
to spread TX/RX work across multiple queue pairs as link speeds and core
counts keep increasing.

Multi-queue also unlocks the standard networking knobs on top of it. In
particular, once the device exposes multiple TX queues, qdisc/tc can
steer flows/traffic classes into different queues (via
skb->queue_mapping), enabling per-flow/per-class scheduling and QoS in a
familiar way.

This series is a small plumbing step towards that direction:

  1) Introduce a per-queue context object (struct ntb_netdev_queue) and
     move queue-pair state out of struct ntb_netdev. Probe creates queue
     pairs in a loop and configures the netdev queue counts to match the
     number that was successfully created.

  2) Expose ntb_num_queues as a module parameter to request multiple
     queue pairs at probe time. The value is clamped to 1..64 and kept
     read-only for now (no runtime reconfiguration).

  3) Report the active queue-pair count via ethtool -l (get_channels),
     so users can confirm the device configuration from user space.

Compatibility:
  - Default remains ntb_num_queues=1, so behaviour is unchanged unless
    the user explicitly requests more queues.

Kernel base:
  - ntb-next latest:
    commit 7b3302c687ca ("ntb_hw_amd: Fix incorrect debug message in link
                          disable path")

Usage (example):
  - modprobe ntb_netdev ntb_num_queues=<N> # Patch 2 takes care of it
  - ethtool -l <ifname> # Patch 3 takes care of it

Patch summary:
  1/3 net: ntb_netdev: Introduce per-queue context
  2/3 net: ntb_netdev: Make queue pair count configurable
  3/3 net: ntb_netdev: Expose queue pair count via ethtool -l

Testing / results:
  Environment / command line:
    - 2x R-Car S4 Spider boards
      "Kernel base" (see above) + this series
    - For TCP load:
      [RC] $ sudo iperf3 -s
      [EP] $ sudo iperf3 -Z -c ${SERVER_IP} -l 65480 -w 512M -P 4
    - For UDP load:
      [RC] $ sudo iperf3 -s
      [EP] $ sudo iperf3 -ub0 -c ${SERVER_IP} -l 65480 -w 512M -P 4

  Before (without this series):
      TCP / UDP : 602 Mbps / 598 Mbps

  Before (ntb_num_queues=1):
      TCP / UDP : 588 Mbps / 605 Mbps

  After (ntb_num_queues=2):
      TCP / UDP : 602 Mbps / 598 Mbps

  Notes:
    In my current test environment, enabling multiple queue pairs does
    not improve throughput. The receive-side memcpy in ntb_transport is
    the dominant cost and limits scaling at present.

    Still, this series lays the groundwork for future scaling, for
    example once a transport backend is introduced that avoids memcpy
    to/from PCI memory space on both ends (see the superseded RFC
    series:
    https://lore.kernel.org/all/20251217151609.3162665-1-den@valinux.co.jp/).


Best regards,
Koichiro

Koichiro Den (3):
  net: ntb_netdev: Introduce per-queue context
  net: ntb_netdev: Make queue pair count configurable
  net: ntb_netdev: Expose queue pair count via ethtool -l

 drivers/net/ntb_netdev.c | 326 +++++++++++++++++++++++++++------------
 1 file changed, 228 insertions(+), 98 deletions(-)

-- 
2.51.0

Re: [PATCH 0/3] net: ntb_netdev: Add Multi-queue support

Posted by Jakub Kicinski 1 month, 2 weeks ago

On Wed, 25 Feb 2026 00:28:06 +0900 Koichiro Den wrote:
> Usage (example):
>   - modprobe ntb_netdev ntb_num_queues=<N> # Patch 2 takes care of it
>   - ethtool -l <ifname> # Patch 3 takes care of it

Module parameters are not a very user friendly choice for uAPI.
You use ethtool -l for GET, what's the challenge with implementing SET
via ethtool -L?

Re: [PATCH 0/3] net: ntb_netdev: Add Multi-queue support

Posted by Koichiro Den 1 month, 2 weeks ago

On Wed, Feb 25, 2026 at 07:50:04PM -0800, Jakub Kicinski wrote:
> On Wed, 25 Feb 2026 00:28:06 +0900 Koichiro Den wrote:
> > Usage (example):
> >   - modprobe ntb_netdev ntb_num_queues=<N> # Patch 2 takes care of it
> >   - ethtool -l <ifname> # Patch 3 takes care of it
> 
> Module parameters are not a very user friendly choice for uAPI.
> You use ethtool -l for GET, what's the challenge with implementing SET
> via ethtool -L?

Thanks for the comment, Jakub.

There is no technical limitation. I didn't include SET support simply to keep
the initial series minimal as possible.

However, you're right, adding such a module parameter would also make it part of
the uAPI and unnecesarily hard to remove later. It's better to implement SET
from the beggining.

Dave, let me respin the series with SET support and drop the module parameter.
Please let me know if you have any objections.

Best regards,
Koichiro

Re: [PATCH 0/3] net: ntb_netdev: Add Multi-queue support

Posted by Dave Jiang 1 month, 3 weeks ago


On 2/24/26 8:28 AM, Koichiro Den wrote:
> Hi,
> 
> ntb_netdev currently hard-codes a single NTB transport queue pair, which
> means the datapath effectively runs as a single-queue netdev regardless
> of available CPUs / parallel flows.
> 
> The longer-term motivation here is throughput scale-out: allow
> ntb_netdev to grow beyond the single-QP bottleneck and make it possible
> to spread TX/RX work across multiple queue pairs as link speeds and core
> counts keep increasing.
> 
> Multi-queue also unlocks the standard networking knobs on top of it. In
> particular, once the device exposes multiple TX queues, qdisc/tc can
> steer flows/traffic classes into different queues (via
> skb->queue_mapping), enabling per-flow/per-class scheduling and QoS in a
> familiar way.
> 
> This series is a small plumbing step towards that direction:
> 
>   1) Introduce a per-queue context object (struct ntb_netdev_queue) and
>      move queue-pair state out of struct ntb_netdev. Probe creates queue
>      pairs in a loop and configures the netdev queue counts to match the
>      number that was successfully created.
> 
>   2) Expose ntb_num_queues as a module parameter to request multiple
>      queue pairs at probe time. The value is clamped to 1..64 and kept
>      read-only for now (no runtime reconfiguration).
> 
>   3) Report the active queue-pair count via ethtool -l (get_channels),
>      so users can confirm the device configuration from user space.
> 
> Compatibility:
>   - Default remains ntb_num_queues=1, so behaviour is unchanged unless
>     the user explicitly requests more queues.
> 
> Kernel base:
>   - ntb-next latest:
>     commit 7b3302c687ca ("ntb_hw_amd: Fix incorrect debug message in link
>                           disable path")
> 
> Usage (example):
>   - modprobe ntb_netdev ntb_num_queues=<N> # Patch 2 takes care of it
>   - ethtool -l <ifname> # Patch 3 takes care of it
> 
> Patch summary:
>   1/3 net: ntb_netdev: Introduce per-queue context
>   2/3 net: ntb_netdev: Make queue pair count configurable
>   3/3 net: ntb_netdev: Expose queue pair count via ethtool -l
> 
> Testing / results:
>   Environment / command line:
>     - 2x R-Car S4 Spider boards
>       "Kernel base" (see above) + this series
>     - For TCP load:
>       [RC] $ sudo iperf3 -s
>       [EP] $ sudo iperf3 -Z -c ${SERVER_IP} -l 65480 -w 512M -P 4
>     - For UDP load:
>       [RC] $ sudo iperf3 -s
>       [EP] $ sudo iperf3 -ub0 -c ${SERVER_IP} -l 65480 -w 512M -P 4
> 
>   Before (without this series):
>       TCP / UDP : 602 Mbps / 598 Mbps
> 
>   Before (ntb_num_queues=1):
>       TCP / UDP : 588 Mbps / 605 Mbps

What accounts for the dip in TCP performance?

> 
>   After (ntb_num_queues=2):
>       TCP / UDP : 602 Mbps / 598 Mbps
> 
>   Notes:
>     In my current test environment, enabling multiple queue pairs does
>     not improve throughput. The receive-side memcpy in ntb_transport is
>     the dominant cost and limits scaling at present.
> 
>     Still, this series lays the groundwork for future scaling, for
>     example once a transport backend is introduced that avoids memcpy
>     to/from PCI memory space on both ends (see the superseded RFC
>     series:
>     https://lore.kernel.org/all/20251217151609.3162665-1-den@valinux.co.jp/).
> 
> 
> Best regards,
> Koichiro
> 
> Koichiro Den (3):
>   net: ntb_netdev: Introduce per-queue context
>   net: ntb_netdev: Make queue pair count configurable
>   net: ntb_netdev: Expose queue pair count via ethtool -l
> 
>  drivers/net/ntb_netdev.c | 326 +++++++++++++++++++++++++++------------
>  1 file changed, 228 insertions(+), 98 deletions(-)
> 

for the series
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Re: [PATCH 0/3] net: ntb_netdev: Add Multi-queue support

Posted by Koichiro Den 1 month, 2 weeks ago

On Tue, Feb 24, 2026 at 09:20:35AM -0700, Dave Jiang wrote:
> 
> 
> On 2/24/26 8:28 AM, Koichiro Den wrote:
> > Hi,
> > 
> > ntb_netdev currently hard-codes a single NTB transport queue pair, which
> > means the datapath effectively runs as a single-queue netdev regardless
> > of available CPUs / parallel flows.
> > 
> > The longer-term motivation here is throughput scale-out: allow
> > ntb_netdev to grow beyond the single-QP bottleneck and make it possible
> > to spread TX/RX work across multiple queue pairs as link speeds and core
> > counts keep increasing.
> > 
> > Multi-queue also unlocks the standard networking knobs on top of it. In
> > particular, once the device exposes multiple TX queues, qdisc/tc can
> > steer flows/traffic classes into different queues (via
> > skb->queue_mapping), enabling per-flow/per-class scheduling and QoS in a
> > familiar way.
> > 
> > This series is a small plumbing step towards that direction:
> > 
> >   1) Introduce a per-queue context object (struct ntb_netdev_queue) and
> >      move queue-pair state out of struct ntb_netdev. Probe creates queue
> >      pairs in a loop and configures the netdev queue counts to match the
> >      number that was successfully created.
> > 
> >   2) Expose ntb_num_queues as a module parameter to request multiple
> >      queue pairs at probe time. The value is clamped to 1..64 and kept
> >      read-only for now (no runtime reconfiguration).
> > 
> >   3) Report the active queue-pair count via ethtool -l (get_channels),
> >      so users can confirm the device configuration from user space.
> > 
> > Compatibility:
> >   - Default remains ntb_num_queues=1, so behaviour is unchanged unless
> >     the user explicitly requests more queues.
> > 
> > Kernel base:
> >   - ntb-next latest:
> >     commit 7b3302c687ca ("ntb_hw_amd: Fix incorrect debug message in link
> >                           disable path")
> > 
> > Usage (example):
> >   - modprobe ntb_netdev ntb_num_queues=<N> # Patch 2 takes care of it
> >   - ethtool -l <ifname> # Patch 3 takes care of it
> > 
> > Patch summary:
> >   1/3 net: ntb_netdev: Introduce per-queue context
> >   2/3 net: ntb_netdev: Make queue pair count configurable
> >   3/3 net: ntb_netdev: Expose queue pair count via ethtool -l
> > 
> > Testing / results:
> >   Environment / command line:
> >     - 2x R-Car S4 Spider boards
> >       "Kernel base" (see above) + this series
> >     - For TCP load:
> >       [RC] $ sudo iperf3 -s
> >       [EP] $ sudo iperf3 -Z -c ${SERVER_IP} -l 65480 -w 512M -P 4
> >     - For UDP load:
> >       [RC] $ sudo iperf3 -s
> >       [EP] $ sudo iperf3 -ub0 -c ${SERVER_IP} -l 65480 -w 512M -P 4
> > 
> >   Before (without this series):
> >       TCP / UDP : 602 Mbps / 598 Mbps
> > 
> >   Before (ntb_num_queues=1):
> >       TCP / UDP : 588 Mbps / 605 Mbps
> 
> What accounts for the dip in TCP performance?

I believe this is within normal run-to-run variance. To be sure, I repeated the
TCP tests multiple times. The aggregated results are:

  +------+----------+------------------+------------------+
  |      | Baseline | ntb_num_queues=1 | ntb_num_queues=2 |
  +------+----------+------------------+------------------+
  | Mean | 599.5    | 595.2 (-0.7%)    | 600.4 (+0.2%)    |
  | Min  | 590      | 590   (+0.0%)    | 593   (+0.5%)    |
  | Max  | 605      | 604   (-0.2%)    | 605   (+0.0%)    |
  | Med  | 602      | 593              | 601.5            |
  | SD   | 5.84     | 6.01             | 4.12             |
  +------+----------+------------------+------------------+

On my setup (2x R-Car S4 Spider), I do not observe any statistically meaningful
improvement or degradation. For completeness, here is the raw data:

              .----------------------------- Baseline (without this series)
              :           .----------------- ntb_num_queues=1
              :           :            .---- ntb_num_queues=2
              :           :            :
    #1     601 Mbps    604 Mbps     601 Mbps
    #2     604 Mbps    604 Mbps     603 Mbps
    #3     592 Mbps    590 Mbps     600 Mbps
    #4     593 Mbps    593 Mbps     603 Mbps
    #5     605 Mbps    591 Mbps     605 Mbps
    #6     590 Mbps    603 Mbps     602 Mbps
    #7     605 Mbps    590 Mbps     596 Mbps
    #8     598 Mbps    594 Mbps     593 Mbps
    #9     603 Mbps    590 Mbps     605 Mbps
    #10    604 Mbps    593 Mbps     596 Mbps

To see a tangible performance gain, another patch series I submitted yesterday
is also relevant:

  [PATCH 00/10] NTB: epf: Enable per-doorbell bit handling while keeping legacy offset
  https://lore.kernel.org/all/20260224133459.1741537-1-den@valinux.co.jp/

With that series applied as well, and with irq smp_affinity properly adjusted,
the results become:

  After (ntb_num_queues=2 + the other series also applied):
      TCP / UDP : 1.15 Gbps / 1.18 Gbps

In that sense, that series is also important groundwork from a performance
perspective. Since that work touches NTB-tree code, I'd appreciate it if you
could also have a look at that series.

Side note: R-Car S4 Spider has limited BAR resources. Although BAR2 is
resizable, ~2 MiB appears to be the practical ceiling for arbitrary mappings in
this setup, so I haven't tested larger ntb_num_queues=<N> values. On platforms
with more BAR space, sufficient CPUs for memcpy, or sufficent DMA channels for
DMA memcpy available to ntb_transport, further scaling with larger <N> values
should be possible.

Thanks,
Koichiro

> 
> > 
> >   After (ntb_num_queues=2):
> >       TCP / UDP : 602 Mbps / 598 Mbps
> > 
> >   Notes:
> >     In my current test environment, enabling multiple queue pairs does
> >     not improve throughput. The receive-side memcpy in ntb_transport is
> >     the dominant cost and limits scaling at present.
> > 
> >     Still, this series lays the groundwork for future scaling, for
> >     example once a transport backend is introduced that avoids memcpy
> >     to/from PCI memory space on both ends (see the superseded RFC
> >     series:
> >     https://lore.kernel.org/all/20251217151609.3162665-1-den@valinux.co.jp/).
> > 
> > 
> > Best regards,
> > Koichiro
> > 
> > Koichiro Den (3):
> >   net: ntb_netdev: Introduce per-queue context
> >   net: ntb_netdev: Make queue pair count configurable
> >   net: ntb_netdev: Expose queue pair count via ethtool -l
> > 
> >  drivers/net/ntb_netdev.c | 326 +++++++++++++++++++++++++++------------
> >  1 file changed, 228 insertions(+), 98 deletions(-)
> > 
> 
> for the series
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>

Re: [PATCH 0/3] net: ntb_netdev: Add Multi-queue support

Posted by Dave Jiang 1 month, 2 weeks ago


On 2/24/26 8:36 PM, Koichiro Den wrote:
> On Tue, Feb 24, 2026 at 09:20:35AM -0700, Dave Jiang wrote:
>>
>>
>> On 2/24/26 8:28 AM, Koichiro Den wrote:
>>> Hi,
>>>
>>> ntb_netdev currently hard-codes a single NTB transport queue pair, which
>>> means the datapath effectively runs as a single-queue netdev regardless
>>> of available CPUs / parallel flows.
>>>
>>> The longer-term motivation here is throughput scale-out: allow
>>> ntb_netdev to grow beyond the single-QP bottleneck and make it possible
>>> to spread TX/RX work across multiple queue pairs as link speeds and core
>>> counts keep increasing.
>>>
>>> Multi-queue also unlocks the standard networking knobs on top of it. In
>>> particular, once the device exposes multiple TX queues, qdisc/tc can
>>> steer flows/traffic classes into different queues (via
>>> skb->queue_mapping), enabling per-flow/per-class scheduling and QoS in a
>>> familiar way.
>>>
>>> This series is a small plumbing step towards that direction:
>>>
>>>   1) Introduce a per-queue context object (struct ntb_netdev_queue) and
>>>      move queue-pair state out of struct ntb_netdev. Probe creates queue
>>>      pairs in a loop and configures the netdev queue counts to match the
>>>      number that was successfully created.
>>>
>>>   2) Expose ntb_num_queues as a module parameter to request multiple
>>>      queue pairs at probe time. The value is clamped to 1..64 and kept
>>>      read-only for now (no runtime reconfiguration).
>>>
>>>   3) Report the active queue-pair count via ethtool -l (get_channels),
>>>      so users can confirm the device configuration from user space.
>>>
>>> Compatibility:
>>>   - Default remains ntb_num_queues=1, so behaviour is unchanged unless
>>>     the user explicitly requests more queues.
>>>
>>> Kernel base:
>>>   - ntb-next latest:
>>>     commit 7b3302c687ca ("ntb_hw_amd: Fix incorrect debug message in link
>>>                           disable path")
>>>
>>> Usage (example):
>>>   - modprobe ntb_netdev ntb_num_queues=<N> # Patch 2 takes care of it
>>>   - ethtool -l <ifname> # Patch 3 takes care of it
>>>
>>> Patch summary:
>>>   1/3 net: ntb_netdev: Introduce per-queue context
>>>   2/3 net: ntb_netdev: Make queue pair count configurable
>>>   3/3 net: ntb_netdev: Expose queue pair count via ethtool -l
>>>
>>> Testing / results:
>>>   Environment / command line:
>>>     - 2x R-Car S4 Spider boards
>>>       "Kernel base" (see above) + this series
>>>     - For TCP load:
>>>       [RC] $ sudo iperf3 -s
>>>       [EP] $ sudo iperf3 -Z -c ${SERVER_IP} -l 65480 -w 512M -P 4
>>>     - For UDP load:
>>>       [RC] $ sudo iperf3 -s
>>>       [EP] $ sudo iperf3 -ub0 -c ${SERVER_IP} -l 65480 -w 512M -P 4
>>>
>>>   Before (without this series):
>>>       TCP / UDP : 602 Mbps / 598 Mbps
>>>
>>>   Before (ntb_num_queues=1):
>>>       TCP / UDP : 588 Mbps / 605 Mbps
>>
>> What accounts for the dip in TCP performance?
> 
> I believe this is within normal run-to-run variance. To be sure, I repeated the
> TCP tests multiple times. The aggregated results are:
> 
>   +------+----------+------------------+------------------+
>   |      | Baseline | ntb_num_queues=1 | ntb_num_queues=2 |
>   +------+----------+------------------+------------------+
>   | Mean | 599.5    | 595.2 (-0.7%)    | 600.4 (+0.2%)    |
>   | Min  | 590      | 590   (+0.0%)    | 593   (+0.5%)    |
>   | Max  | 605      | 604   (-0.2%)    | 605   (+0.0%)    |
>   | Med  | 602      | 593              | 601.5            |
>   | SD   | 5.84     | 6.01             | 4.12             |
>   +------+----------+------------------+------------------+
> 
> On my setup (2x R-Car S4 Spider), I do not observe any statistically meaningful
> improvement or degradation. For completeness, here is the raw data:
> 
>               .----------------------------- Baseline (without this series)
>               :           .----------------- ntb_num_queues=1
>               :           :            .---- ntb_num_queues=2
>               :           :            :
>     #1     601 Mbps    604 Mbps     601 Mbps
>     #2     604 Mbps    604 Mbps     603 Mbps
>     #3     592 Mbps    590 Mbps     600 Mbps
>     #4     593 Mbps    593 Mbps     603 Mbps
>     #5     605 Mbps    591 Mbps     605 Mbps
>     #6     590 Mbps    603 Mbps     602 Mbps
>     #7     605 Mbps    590 Mbps     596 Mbps
>     #8     598 Mbps    594 Mbps     593 Mbps
>     #9     603 Mbps    590 Mbps     605 Mbps
>     #10    604 Mbps    593 Mbps     596 Mbps
> 
> To see a tangible performance gain, another patch series I submitted yesterday
> is also relevant:
> 
>   [PATCH 00/10] NTB: epf: Enable per-doorbell bit handling while keeping legacy offset
>   https://lore.kernel.org/all/20260224133459.1741537-1-den@valinux.co.jp/
> 
> With that series applied as well, and with irq smp_affinity properly adjusted,
> the results become:
> 
>   After (ntb_num_queues=2 + the other series also applied):
>       TCP / UDP : 1.15 Gbps / 1.18 Gbps
> 
> In that sense, that series is also important groundwork from a performance
> perspective. Since that work touches NTB-tree code, I'd appreciate it if you
> could also have a look at that series.
> 
> Side note: R-Car S4 Spider has limited BAR resources. Although BAR2 is
> resizable, ~2 MiB appears to be the practical ceiling for arbitrary mappings in
> this setup, so I haven't tested larger ntb_num_queues=<N> values. On platforms
> with more BAR space, sufficient CPUs for memcpy, or sufficent DMA channels for
> DMA memcpy available to ntb_transport, further scaling with larger <N> values
> should be possible.

Thanks for the data. I'll take a look at the other series.

> 
> Thanks,
> Koichiro
> 
>>
>>>
>>>   After (ntb_num_queues=2):
>>>       TCP / UDP : 602 Mbps / 598 Mbps
>>>
>>>   Notes:
>>>     In my current test environment, enabling multiple queue pairs does
>>>     not improve throughput. The receive-side memcpy in ntb_transport is
>>>     the dominant cost and limits scaling at present.
>>>
>>>     Still, this series lays the groundwork for future scaling, for
>>>     example once a transport backend is introduced that avoids memcpy
>>>     to/from PCI memory space on both ends (see the superseded RFC
>>>     series:
>>>     https://lore.kernel.org/all/20251217151609.3162665-1-den@valinux.co.jp/).
>>>
>>>
>>> Best regards,
>>> Koichiro
>>>
>>> Koichiro Den (3):
>>>   net: ntb_netdev: Introduce per-queue context
>>>   net: ntb_netdev: Make queue pair count configurable
>>>   net: ntb_netdev: Expose queue pair count via ethtool -l
>>>
>>>  drivers/net/ntb_netdev.c | 326 +++++++++++++++++++++++++++------------
>>>  1 file changed, 228 insertions(+), 98 deletions(-)
>>>
>>
>> for the series
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>>