drivers/net/ntb_netdev.c | 326 +++++++++++++++++++++++++++------------ 1 file changed, 228 insertions(+), 98 deletions(-)
Hi,
ntb_netdev currently hard-codes a single NTB transport queue pair, which
means the datapath effectively runs as a single-queue netdev regardless
of available CPUs / parallel flows.
The longer-term motivation here is throughput scale-out: allow
ntb_netdev to grow beyond the single-QP bottleneck and make it possible
to spread TX/RX work across multiple queue pairs as link speeds and core
counts keep increasing.
Multi-queue also unlocks the standard networking knobs on top of it. In
particular, once the device exposes multiple TX queues, qdisc/tc can
steer flows/traffic classes into different queues (via
skb->queue_mapping), enabling per-flow/per-class scheduling and QoS in a
familiar way.
This series is a small plumbing step towards that direction:
1) Introduce a per-queue context object (struct ntb_netdev_queue) and
move queue-pair state out of struct ntb_netdev. Probe creates queue
pairs in a loop and configures the netdev queue counts to match the
number that was successfully created.
2) Expose ntb_num_queues as a module parameter to request multiple
queue pairs at probe time. The value is clamped to 1..64 and kept
read-only for now (no runtime reconfiguration).
3) Report the active queue-pair count via ethtool -l (get_channels),
so users can confirm the device configuration from user space.
Compatibility:
- Default remains ntb_num_queues=1, so behaviour is unchanged unless
the user explicitly requests more queues.
Kernel base:
- ntb-next latest:
commit 7b3302c687ca ("ntb_hw_amd: Fix incorrect debug message in link
disable path")
Usage (example):
- modprobe ntb_netdev ntb_num_queues=<N> # Patch 2 takes care of it
- ethtool -l <ifname> # Patch 3 takes care of it
Patch summary:
1/3 net: ntb_netdev: Introduce per-queue context
2/3 net: ntb_netdev: Make queue pair count configurable
3/3 net: ntb_netdev: Expose queue pair count via ethtool -l
Testing / results:
Environment / command line:
- 2x R-Car S4 Spider boards
"Kernel base" (see above) + this series
- For TCP load:
[RC] $ sudo iperf3 -s
[EP] $ sudo iperf3 -Z -c ${SERVER_IP} -l 65480 -w 512M -P 4
- For UDP load:
[RC] $ sudo iperf3 -s
[EP] $ sudo iperf3 -ub0 -c ${SERVER_IP} -l 65480 -w 512M -P 4
Before (without this series):
TCP / UDP : 602 Mbps / 598 Mbps
Before (ntb_num_queues=1):
TCP / UDP : 588 Mbps / 605 Mbps
After (ntb_num_queues=2):
TCP / UDP : 602 Mbps / 598 Mbps
Notes:
In my current test environment, enabling multiple queue pairs does
not improve throughput. The receive-side memcpy in ntb_transport is
the dominant cost and limits scaling at present.
Still, this series lays the groundwork for future scaling, for
example once a transport backend is introduced that avoids memcpy
to/from PCI memory space on both ends (see the superseded RFC
series:
https://lore.kernel.org/all/20251217151609.3162665-1-den@valinux.co.jp/).
Best regards,
Koichiro
Koichiro Den (3):
net: ntb_netdev: Introduce per-queue context
net: ntb_netdev: Make queue pair count configurable
net: ntb_netdev: Expose queue pair count via ethtool -l
drivers/net/ntb_netdev.c | 326 +++++++++++++++++++++++++++------------
1 file changed, 228 insertions(+), 98 deletions(-)
--
2.51.0
On Wed, 25 Feb 2026 00:28:06 +0900 Koichiro Den wrote: > Usage (example): > - modprobe ntb_netdev ntb_num_queues=<N> # Patch 2 takes care of it > - ethtool -l <ifname> # Patch 3 takes care of it Module parameters are not a very user friendly choice for uAPI. You use ethtool -l for GET, what's the challenge with implementing SET via ethtool -L?
On Wed, Feb 25, 2026 at 07:50:04PM -0800, Jakub Kicinski wrote: > On Wed, 25 Feb 2026 00:28:06 +0900 Koichiro Den wrote: > > Usage (example): > > - modprobe ntb_netdev ntb_num_queues=<N> # Patch 2 takes care of it > > - ethtool -l <ifname> # Patch 3 takes care of it > > Module parameters are not a very user friendly choice for uAPI. > You use ethtool -l for GET, what's the challenge with implementing SET > via ethtool -L? Thanks for the comment, Jakub. There is no technical limitation. I didn't include SET support simply to keep the initial series minimal as possible. However, you're right, adding such a module parameter would also make it part of the uAPI and unnecesarily hard to remove later. It's better to implement SET from the beggining. Dave, let me respin the series with SET support and drop the module parameter. Please let me know if you have any objections. Best regards, Koichiro
On 2/24/26 8:28 AM, Koichiro Den wrote:
> Hi,
>
> ntb_netdev currently hard-codes a single NTB transport queue pair, which
> means the datapath effectively runs as a single-queue netdev regardless
> of available CPUs / parallel flows.
>
> The longer-term motivation here is throughput scale-out: allow
> ntb_netdev to grow beyond the single-QP bottleneck and make it possible
> to spread TX/RX work across multiple queue pairs as link speeds and core
> counts keep increasing.
>
> Multi-queue also unlocks the standard networking knobs on top of it. In
> particular, once the device exposes multiple TX queues, qdisc/tc can
> steer flows/traffic classes into different queues (via
> skb->queue_mapping), enabling per-flow/per-class scheduling and QoS in a
> familiar way.
>
> This series is a small plumbing step towards that direction:
>
> 1) Introduce a per-queue context object (struct ntb_netdev_queue) and
> move queue-pair state out of struct ntb_netdev. Probe creates queue
> pairs in a loop and configures the netdev queue counts to match the
> number that was successfully created.
>
> 2) Expose ntb_num_queues as a module parameter to request multiple
> queue pairs at probe time. The value is clamped to 1..64 and kept
> read-only for now (no runtime reconfiguration).
>
> 3) Report the active queue-pair count via ethtool -l (get_channels),
> so users can confirm the device configuration from user space.
>
> Compatibility:
> - Default remains ntb_num_queues=1, so behaviour is unchanged unless
> the user explicitly requests more queues.
>
> Kernel base:
> - ntb-next latest:
> commit 7b3302c687ca ("ntb_hw_amd: Fix incorrect debug message in link
> disable path")
>
> Usage (example):
> - modprobe ntb_netdev ntb_num_queues=<N> # Patch 2 takes care of it
> - ethtool -l <ifname> # Patch 3 takes care of it
>
> Patch summary:
> 1/3 net: ntb_netdev: Introduce per-queue context
> 2/3 net: ntb_netdev: Make queue pair count configurable
> 3/3 net: ntb_netdev: Expose queue pair count via ethtool -l
>
> Testing / results:
> Environment / command line:
> - 2x R-Car S4 Spider boards
> "Kernel base" (see above) + this series
> - For TCP load:
> [RC] $ sudo iperf3 -s
> [EP] $ sudo iperf3 -Z -c ${SERVER_IP} -l 65480 -w 512M -P 4
> - For UDP load:
> [RC] $ sudo iperf3 -s
> [EP] $ sudo iperf3 -ub0 -c ${SERVER_IP} -l 65480 -w 512M -P 4
>
> Before (without this series):
> TCP / UDP : 602 Mbps / 598 Mbps
>
> Before (ntb_num_queues=1):
> TCP / UDP : 588 Mbps / 605 Mbps
What accounts for the dip in TCP performance?
>
> After (ntb_num_queues=2):
> TCP / UDP : 602 Mbps / 598 Mbps
>
> Notes:
> In my current test environment, enabling multiple queue pairs does
> not improve throughput. The receive-side memcpy in ntb_transport is
> the dominant cost and limits scaling at present.
>
> Still, this series lays the groundwork for future scaling, for
> example once a transport backend is introduced that avoids memcpy
> to/from PCI memory space on both ends (see the superseded RFC
> series:
> https://lore.kernel.org/all/20251217151609.3162665-1-den@valinux.co.jp/).
>
>
> Best regards,
> Koichiro
>
> Koichiro Den (3):
> net: ntb_netdev: Introduce per-queue context
> net: ntb_netdev: Make queue pair count configurable
> net: ntb_netdev: Expose queue pair count via ethtool -l
>
> drivers/net/ntb_netdev.c | 326 +++++++++++++++++++++++++++------------
> 1 file changed, 228 insertions(+), 98 deletions(-)
>
for the series
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
On Tue, Feb 24, 2026 at 09:20:35AM -0700, Dave Jiang wrote:
>
>
> On 2/24/26 8:28 AM, Koichiro Den wrote:
> > Hi,
> >
> > ntb_netdev currently hard-codes a single NTB transport queue pair, which
> > means the datapath effectively runs as a single-queue netdev regardless
> > of available CPUs / parallel flows.
> >
> > The longer-term motivation here is throughput scale-out: allow
> > ntb_netdev to grow beyond the single-QP bottleneck and make it possible
> > to spread TX/RX work across multiple queue pairs as link speeds and core
> > counts keep increasing.
> >
> > Multi-queue also unlocks the standard networking knobs on top of it. In
> > particular, once the device exposes multiple TX queues, qdisc/tc can
> > steer flows/traffic classes into different queues (via
> > skb->queue_mapping), enabling per-flow/per-class scheduling and QoS in a
> > familiar way.
> >
> > This series is a small plumbing step towards that direction:
> >
> > 1) Introduce a per-queue context object (struct ntb_netdev_queue) and
> > move queue-pair state out of struct ntb_netdev. Probe creates queue
> > pairs in a loop and configures the netdev queue counts to match the
> > number that was successfully created.
> >
> > 2) Expose ntb_num_queues as a module parameter to request multiple
> > queue pairs at probe time. The value is clamped to 1..64 and kept
> > read-only for now (no runtime reconfiguration).
> >
> > 3) Report the active queue-pair count via ethtool -l (get_channels),
> > so users can confirm the device configuration from user space.
> >
> > Compatibility:
> > - Default remains ntb_num_queues=1, so behaviour is unchanged unless
> > the user explicitly requests more queues.
> >
> > Kernel base:
> > - ntb-next latest:
> > commit 7b3302c687ca ("ntb_hw_amd: Fix incorrect debug message in link
> > disable path")
> >
> > Usage (example):
> > - modprobe ntb_netdev ntb_num_queues=<N> # Patch 2 takes care of it
> > - ethtool -l <ifname> # Patch 3 takes care of it
> >
> > Patch summary:
> > 1/3 net: ntb_netdev: Introduce per-queue context
> > 2/3 net: ntb_netdev: Make queue pair count configurable
> > 3/3 net: ntb_netdev: Expose queue pair count via ethtool -l
> >
> > Testing / results:
> > Environment / command line:
> > - 2x R-Car S4 Spider boards
> > "Kernel base" (see above) + this series
> > - For TCP load:
> > [RC] $ sudo iperf3 -s
> > [EP] $ sudo iperf3 -Z -c ${SERVER_IP} -l 65480 -w 512M -P 4
> > - For UDP load:
> > [RC] $ sudo iperf3 -s
> > [EP] $ sudo iperf3 -ub0 -c ${SERVER_IP} -l 65480 -w 512M -P 4
> >
> > Before (without this series):
> > TCP / UDP : 602 Mbps / 598 Mbps
> >
> > Before (ntb_num_queues=1):
> > TCP / UDP : 588 Mbps / 605 Mbps
>
> What accounts for the dip in TCP performance?
I believe this is within normal run-to-run variance. To be sure, I repeated the
TCP tests multiple times. The aggregated results are:
+------+----------+------------------+------------------+
| | Baseline | ntb_num_queues=1 | ntb_num_queues=2 |
+------+----------+------------------+------------------+
| Mean | 599.5 | 595.2 (-0.7%) | 600.4 (+0.2%) |
| Min | 590 | 590 (+0.0%) | 593 (+0.5%) |
| Max | 605 | 604 (-0.2%) | 605 (+0.0%) |
| Med | 602 | 593 | 601.5 |
| SD | 5.84 | 6.01 | 4.12 |
+------+----------+------------------+------------------+
On my setup (2x R-Car S4 Spider), I do not observe any statistically meaningful
improvement or degradation. For completeness, here is the raw data:
.----------------------------- Baseline (without this series)
: .----------------- ntb_num_queues=1
: : .---- ntb_num_queues=2
: : :
#1 601 Mbps 604 Mbps 601 Mbps
#2 604 Mbps 604 Mbps 603 Mbps
#3 592 Mbps 590 Mbps 600 Mbps
#4 593 Mbps 593 Mbps 603 Mbps
#5 605 Mbps 591 Mbps 605 Mbps
#6 590 Mbps 603 Mbps 602 Mbps
#7 605 Mbps 590 Mbps 596 Mbps
#8 598 Mbps 594 Mbps 593 Mbps
#9 603 Mbps 590 Mbps 605 Mbps
#10 604 Mbps 593 Mbps 596 Mbps
To see a tangible performance gain, another patch series I submitted yesterday
is also relevant:
[PATCH 00/10] NTB: epf: Enable per-doorbell bit handling while keeping legacy offset
https://lore.kernel.org/all/20260224133459.1741537-1-den@valinux.co.jp/
With that series applied as well, and with irq smp_affinity properly adjusted,
the results become:
After (ntb_num_queues=2 + the other series also applied):
TCP / UDP : 1.15 Gbps / 1.18 Gbps
In that sense, that series is also important groundwork from a performance
perspective. Since that work touches NTB-tree code, I'd appreciate it if you
could also have a look at that series.
Side note: R-Car S4 Spider has limited BAR resources. Although BAR2 is
resizable, ~2 MiB appears to be the practical ceiling for arbitrary mappings in
this setup, so I haven't tested larger ntb_num_queues=<N> values. On platforms
with more BAR space, sufficient CPUs for memcpy, or sufficent DMA channels for
DMA memcpy available to ntb_transport, further scaling with larger <N> values
should be possible.
Thanks,
Koichiro
>
> >
> > After (ntb_num_queues=2):
> > TCP / UDP : 602 Mbps / 598 Mbps
> >
> > Notes:
> > In my current test environment, enabling multiple queue pairs does
> > not improve throughput. The receive-side memcpy in ntb_transport is
> > the dominant cost and limits scaling at present.
> >
> > Still, this series lays the groundwork for future scaling, for
> > example once a transport backend is introduced that avoids memcpy
> > to/from PCI memory space on both ends (see the superseded RFC
> > series:
> > https://lore.kernel.org/all/20251217151609.3162665-1-den@valinux.co.jp/).
> >
> >
> > Best regards,
> > Koichiro
> >
> > Koichiro Den (3):
> > net: ntb_netdev: Introduce per-queue context
> > net: ntb_netdev: Make queue pair count configurable
> > net: ntb_netdev: Expose queue pair count via ethtool -l
> >
> > drivers/net/ntb_netdev.c | 326 +++++++++++++++++++++++++++------------
> > 1 file changed, 228 insertions(+), 98 deletions(-)
> >
>
> for the series
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
On 2/24/26 8:36 PM, Koichiro Den wrote:
> On Tue, Feb 24, 2026 at 09:20:35AM -0700, Dave Jiang wrote:
>>
>>
>> On 2/24/26 8:28 AM, Koichiro Den wrote:
>>> Hi,
>>>
>>> ntb_netdev currently hard-codes a single NTB transport queue pair, which
>>> means the datapath effectively runs as a single-queue netdev regardless
>>> of available CPUs / parallel flows.
>>>
>>> The longer-term motivation here is throughput scale-out: allow
>>> ntb_netdev to grow beyond the single-QP bottleneck and make it possible
>>> to spread TX/RX work across multiple queue pairs as link speeds and core
>>> counts keep increasing.
>>>
>>> Multi-queue also unlocks the standard networking knobs on top of it. In
>>> particular, once the device exposes multiple TX queues, qdisc/tc can
>>> steer flows/traffic classes into different queues (via
>>> skb->queue_mapping), enabling per-flow/per-class scheduling and QoS in a
>>> familiar way.
>>>
>>> This series is a small plumbing step towards that direction:
>>>
>>> 1) Introduce a per-queue context object (struct ntb_netdev_queue) and
>>> move queue-pair state out of struct ntb_netdev. Probe creates queue
>>> pairs in a loop and configures the netdev queue counts to match the
>>> number that was successfully created.
>>>
>>> 2) Expose ntb_num_queues as a module parameter to request multiple
>>> queue pairs at probe time. The value is clamped to 1..64 and kept
>>> read-only for now (no runtime reconfiguration).
>>>
>>> 3) Report the active queue-pair count via ethtool -l (get_channels),
>>> so users can confirm the device configuration from user space.
>>>
>>> Compatibility:
>>> - Default remains ntb_num_queues=1, so behaviour is unchanged unless
>>> the user explicitly requests more queues.
>>>
>>> Kernel base:
>>> - ntb-next latest:
>>> commit 7b3302c687ca ("ntb_hw_amd: Fix incorrect debug message in link
>>> disable path")
>>>
>>> Usage (example):
>>> - modprobe ntb_netdev ntb_num_queues=<N> # Patch 2 takes care of it
>>> - ethtool -l <ifname> # Patch 3 takes care of it
>>>
>>> Patch summary:
>>> 1/3 net: ntb_netdev: Introduce per-queue context
>>> 2/3 net: ntb_netdev: Make queue pair count configurable
>>> 3/3 net: ntb_netdev: Expose queue pair count via ethtool -l
>>>
>>> Testing / results:
>>> Environment / command line:
>>> - 2x R-Car S4 Spider boards
>>> "Kernel base" (see above) + this series
>>> - For TCP load:
>>> [RC] $ sudo iperf3 -s
>>> [EP] $ sudo iperf3 -Z -c ${SERVER_IP} -l 65480 -w 512M -P 4
>>> - For UDP load:
>>> [RC] $ sudo iperf3 -s
>>> [EP] $ sudo iperf3 -ub0 -c ${SERVER_IP} -l 65480 -w 512M -P 4
>>>
>>> Before (without this series):
>>> TCP / UDP : 602 Mbps / 598 Mbps
>>>
>>> Before (ntb_num_queues=1):
>>> TCP / UDP : 588 Mbps / 605 Mbps
>>
>> What accounts for the dip in TCP performance?
>
> I believe this is within normal run-to-run variance. To be sure, I repeated the
> TCP tests multiple times. The aggregated results are:
>
> +------+----------+------------------+------------------+
> | | Baseline | ntb_num_queues=1 | ntb_num_queues=2 |
> +------+----------+------------------+------------------+
> | Mean | 599.5 | 595.2 (-0.7%) | 600.4 (+0.2%) |
> | Min | 590 | 590 (+0.0%) | 593 (+0.5%) |
> | Max | 605 | 604 (-0.2%) | 605 (+0.0%) |
> | Med | 602 | 593 | 601.5 |
> | SD | 5.84 | 6.01 | 4.12 |
> +------+----------+------------------+------------------+
>
> On my setup (2x R-Car S4 Spider), I do not observe any statistically meaningful
> improvement or degradation. For completeness, here is the raw data:
>
> .----------------------------- Baseline (without this series)
> : .----------------- ntb_num_queues=1
> : : .---- ntb_num_queues=2
> : : :
> #1 601 Mbps 604 Mbps 601 Mbps
> #2 604 Mbps 604 Mbps 603 Mbps
> #3 592 Mbps 590 Mbps 600 Mbps
> #4 593 Mbps 593 Mbps 603 Mbps
> #5 605 Mbps 591 Mbps 605 Mbps
> #6 590 Mbps 603 Mbps 602 Mbps
> #7 605 Mbps 590 Mbps 596 Mbps
> #8 598 Mbps 594 Mbps 593 Mbps
> #9 603 Mbps 590 Mbps 605 Mbps
> #10 604 Mbps 593 Mbps 596 Mbps
>
> To see a tangible performance gain, another patch series I submitted yesterday
> is also relevant:
>
> [PATCH 00/10] NTB: epf: Enable per-doorbell bit handling while keeping legacy offset
> https://lore.kernel.org/all/20260224133459.1741537-1-den@valinux.co.jp/
>
> With that series applied as well, and with irq smp_affinity properly adjusted,
> the results become:
>
> After (ntb_num_queues=2 + the other series also applied):
> TCP / UDP : 1.15 Gbps / 1.18 Gbps
>
> In that sense, that series is also important groundwork from a performance
> perspective. Since that work touches NTB-tree code, I'd appreciate it if you
> could also have a look at that series.
>
> Side note: R-Car S4 Spider has limited BAR resources. Although BAR2 is
> resizable, ~2 MiB appears to be the practical ceiling for arbitrary mappings in
> this setup, so I haven't tested larger ntb_num_queues=<N> values. On platforms
> with more BAR space, sufficient CPUs for memcpy, or sufficent DMA channels for
> DMA memcpy available to ntb_transport, further scaling with larger <N> values
> should be possible.
Thanks for the data. I'll take a look at the other series.
>
> Thanks,
> Koichiro
>
>>
>>>
>>> After (ntb_num_queues=2):
>>> TCP / UDP : 602 Mbps / 598 Mbps
>>>
>>> Notes:
>>> In my current test environment, enabling multiple queue pairs does
>>> not improve throughput. The receive-side memcpy in ntb_transport is
>>> the dominant cost and limits scaling at present.
>>>
>>> Still, this series lays the groundwork for future scaling, for
>>> example once a transport backend is introduced that avoids memcpy
>>> to/from PCI memory space on both ends (see the superseded RFC
>>> series:
>>> https://lore.kernel.org/all/20251217151609.3162665-1-den@valinux.co.jp/).
>>>
>>>
>>> Best regards,
>>> Koichiro
>>>
>>> Koichiro Den (3):
>>> net: ntb_netdev: Introduce per-queue context
>>> net: ntb_netdev: Make queue pair count configurable
>>> net: ntb_netdev: Expose queue pair count via ethtool -l
>>>
>>> drivers/net/ntb_netdev.c | 326 +++++++++++++++++++++++++++------------
>>> 1 file changed, 228 insertions(+), 98 deletions(-)
>>>
>>
>> for the series
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>>
© 2016 - 2026 Red Hat, Inc.