net/core/dev.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
If RPS is enabled, all packets with a CPU flow hint are enqueued to the
target CPU's input_pkt_queue and process_backlog() is scheduled on that
CPU to dequeue and process the packets. If ARFS has already steered the
packets to the correct CPU, this additional queuing is unnecessary and
the spinlocks involved incur significant CPU overhead.
In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
so, bypass input_pkt_queue and immediately process the packet(s) on the
current CPU.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
---
net/core/dev.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index c682173a7642..714a47897c75 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5855,11 +5855,11 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
#ifdef CONFIG_RPS
if (static_branch_unlikely(&rps_needed)) {
struct rps_dev_flow voidflow, *rflow = &voidflow;
int cpu = get_rps_cpu(skb->dev, skb, &rflow);
- if (cpu >= 0) {
+ if (cpu >= 0 && cpu != smp_processor_id()) {
ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
rcu_read_unlock();
return ret;
}
}
@@ -5884,15 +5884,17 @@ void netif_receive_skb_list_internal(struct list_head *head)
list_splice_init(&sublist, head);
rcu_read_lock();
#ifdef CONFIG_RPS
if (static_branch_unlikely(&rps_needed)) {
+ int curr_cpu = smp_processor_id();
+
list_for_each_entry_safe(skb, next, head, list) {
struct rps_dev_flow voidflow, *rflow = &voidflow;
int cpu = get_rps_cpu(skb->dev, skb, &rflow);
- if (cpu >= 0) {
+ if (cpu >= 0 && cpu != curr_cpu) {
/* Will be handled, remove from list */
skb_list_del_init(skb);
enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
}
}
--
2.45.2
On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos <csander@purestorage.com> wrote: > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the > target CPU's input_pkt_queue and process_backlog() is scheduled on that > CPU to dequeue and process the packets. If ARFS has already steered the > packets to the correct CPU, this additional queuing is unnecessary and > the spinlocks involved incur significant CPU overhead. > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(), > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If > so, bypass input_pkt_queue and immediately process the packet(s) on the > current CPU. > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Current implementation was a conscious choice. This has been discussed several times. By processing packets inline, you are actually increasing latencies of packets queued to other cpus.
On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos > <csander@purestorage.com> wrote: > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the > > target CPU's input_pkt_queue and process_backlog() is scheduled on that > > CPU to dequeue and process the packets. If ARFS has already steered the > > packets to the correct CPU, this additional queuing is unnecessary and > > the spinlocks involved incur significant CPU overhead. > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(), > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If > > so, bypass input_pkt_queue and immediately process the packet(s) on the > > current CPU. > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> > > Current implementation was a conscious choice. This has been discussed > several times. > > By processing packets inline, you are actually increasing latencies of > packets queued to other cpus. Sorry, I wasn't aware of these prior discussions. I take it you are referring to threads like https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/ ? I see what you mean about the latency penalty for packets that do require cross-CPU steering. Do you have an alternate suggestion for how to avoid the overhead of acquiring a spinlock for every packet? The atomic instruction in rps_lock_irq_disable() called from process_backlog() is consuming 5% of our CPU time. For our use case, we don't really want software RPS; we are expecting ARFS to steer all high-bandwidth traffic to the desired CPUs. We would happily turn off software RPS entirely if we could, which seems like it would avoid the concerns about higher latency for packets that need to be steering to a different CPU. But my understanding is that using ARFS requires RPS to be enabled (rps_sock_flow_entries set globally and rps_flow_cnt set on each queue), which enables these rps_needed static branches. Is that correct? If so, would you be open to adding a sysctl that disables software RPS and relies upon ARFS to do the packet steering? Thanks, Caleb
On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote: > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos > > <csander@purestorage.com> wrote: > > > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that > > > CPU to dequeue and process the packets. If ARFS has already steered the > > > packets to the correct CPU, this additional queuing is unnecessary and > > > the spinlocks involved incur significant CPU overhead. > > > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(), > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If > > > so, bypass input_pkt_queue and immediately process the packet(s) on the > > > current CPU. > > > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> > > > > Current implementation was a conscious choice. This has been discussed > > several times. > > > > By processing packets inline, you are actually increasing latencies of > > packets queued to other cpus. > > Sorry, I wasn't aware of these prior discussions. I take it you are > referring to threads like > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/ > ? I see what you mean about the latency penalty for packets that do > require cross-CPU steering. > > Do you have an alternate suggestion for how to avoid the overhead of > acquiring a spinlock for every packet? The atomic instruction in > rps_lock_irq_disable() called from process_backlog() is consuming 5% > of our CPU time. For our use case, we don't really want software RPS; > we are expecting ARFS to steer all high-bandwidth traffic to the > desired CPUs. We would happily turn off software RPS entirely if we > could, which seems like it would avoid the concerns about higher > latency for packets that need to be steering to a different CPU. But > my understanding is that using ARFS requires RPS to be enabled > (rps_sock_flow_entries set globally and rps_flow_cnt set on each > queue), which enables these rps_needed static branches. Is that > correct? If so, would you be open to adding a sysctl that disables > software RPS and relies upon ARFS to do the packet steering? A sysctl will not avoid the fundamental issue. Why not instead address the past feedback ? Can you test the following ? diff --git a/net/core/dev.c b/net/core/dev.c index c682173a76424d7dadcc8374aa5b11dff44a4b46..7a5a7f1a4b7c3cbd105ecfc076377f25929729eb 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5842,6 +5842,21 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp) return ret; } +#ifdef CONFIG_RPS +static bool net_must_use_backlog(int tcpu) +{ + if (tcpu < 0) + return false; + if (tcpu != smp_processor_id()) + return true; + /* target cpu is ourself. We must use our backlog + * if we have deferred IPI or packets. + */ + return this_cpu_read(softnet_data.rps_ipi_list) != NULL || + this_cpu_read(softnet_data.input_pkt_queue.qlen) != 0; +} +#endif + static int netif_receive_skb_internal(struct sk_buff *skb) { int ret; @@ -5857,7 +5872,7 @@ static int netif_receive_skb_internal(struct sk_buff *skb) struct rps_dev_flow voidflow, *rflow = &voidflow; int cpu = get_rps_cpu(skb->dev, skb, &rflow); - if (cpu >= 0) { + if (net_must_use_backlog(cpu)) { ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail); rcu_read_unlock(); return ret; @@ -5890,7 +5905,7 @@ void netif_receive_skb_list_internal(struct list_head *head) struct rps_dev_flow voidflow, *rflow = &voidflow; int cpu = get_rps_cpu(skb->dev, skb, &rflow); - if (cpu >= 0) { + if (net_must_use_backlog(cpu)) { /* Will be handled, remove from list */ skb_list_del_init(skb); enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote: > > > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos > > > <csander@purestorage.com> wrote: > > > > > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that > > > > CPU to dequeue and process the packets. If ARFS has already steered the > > > > packets to the correct CPU, this additional queuing is unnecessary and > > > > the spinlocks involved incur significant CPU overhead. > > > > > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(), > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the > > > > current CPU. > > > > > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> > > > > > > Current implementation was a conscious choice. This has been discussed > > > several times. > > > > > > By processing packets inline, you are actually increasing latencies of > > > packets queued to other cpus. > > > > Sorry, I wasn't aware of these prior discussions. I take it you are > > referring to threads like > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/ > > ? I see what you mean about the latency penalty for packets that do > > require cross-CPU steering. > > > > Do you have an alternate suggestion for how to avoid the overhead of > > acquiring a spinlock for every packet? The atomic instruction in > > rps_lock_irq_disable() called from process_backlog() is consuming 5% > > of our CPU time. For our use case, we don't really want software RPS; > > we are expecting ARFS to steer all high-bandwidth traffic to the > > desired CPUs. We would happily turn off software RPS entirely if we > > could, which seems like it would avoid the concerns about higher > > latency for packets that need to be steering to a different CPU. But > > my understanding is that using ARFS requires RPS to be enabled > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each > > queue), which enables these rps_needed static branches. Is that > > correct? If so, would you be open to adding a sysctl that disables > > software RPS and relies upon ARFS to do the packet steering? > > A sysctl will not avoid the fundamental issue. > Why not instead address the past feedback ? > Can you test the following ? > > diff --git a/net/core/dev.c b/net/core/dev.c > index c682173a76424d7dadcc8374aa5b11dff44a4b46..7a5a7f1a4b7c3cbd105ecfc076377f25929729eb > 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -5842,6 +5842,21 @@ static int generic_xdp_install(struct > net_device *dev, struct netdev_bpf *xdp) > return ret; > } > > +#ifdef CONFIG_RPS > +static bool net_must_use_backlog(int tcpu) > +{ > + if (tcpu < 0) > + return false; > + if (tcpu != smp_processor_id()) > + return true; > + /* target cpu is ourself. We must use our backlog > + * if we have deferred IPI or packets. > + */ > + return this_cpu_read(softnet_data.rps_ipi_list) != NULL || > + this_cpu_read(softnet_data.input_pkt_queue.qlen) != 0; > +} > +#endif Hi Eric, Look at this patch again, I am wondering why the tcpu < 0 case is treated differently from the tcpu == smp_processor_id() case. If I understand correctly, packets without a CPU flow hint are allowed to be processed immediately on the current CPU. They will leapfrog packets that other CPUs have queued onto this CPU via RPS and any RPS IPIs waiting to be issued to other CPUs. I see this is the behavior in the current code too, but why don't the same concerns about higher latency for packets steered cross-CPU apply? Thanks, Caleb
On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote: > > > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos > > > <csander@purestorage.com> wrote: > > > > > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that > > > > CPU to dequeue and process the packets. If ARFS has already steered the > > > > packets to the correct CPU, this additional queuing is unnecessary and > > > > the spinlocks involved incur significant CPU overhead. > > > > > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(), > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the > > > > current CPU. > > > > > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> > > > > > > Current implementation was a conscious choice. This has been discussed > > > several times. > > > > > > By processing packets inline, you are actually increasing latencies of > > > packets queued to other cpus. > > > > Sorry, I wasn't aware of these prior discussions. I take it you are > > referring to threads like > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/ > > ? I see what you mean about the latency penalty for packets that do > > require cross-CPU steering. > > > > Do you have an alternate suggestion for how to avoid the overhead of > > acquiring a spinlock for every packet? The atomic instruction in > > rps_lock_irq_disable() called from process_backlog() is consuming 5% > > of our CPU time. For our use case, we don't really want software RPS; > > we are expecting ARFS to steer all high-bandwidth traffic to the > > desired CPUs. We would happily turn off software RPS entirely if we > > could, which seems like it would avoid the concerns about higher > > latency for packets that need to be steering to a different CPU. But > > my understanding is that using ARFS requires RPS to be enabled > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each > > queue), which enables these rps_needed static branches. Is that > > correct? If so, would you be open to adding a sysctl that disables > > software RPS and relies upon ARFS to do the packet steering? > > A sysctl will not avoid the fundamental issue. Sorry if my suggestion was unclear. I mean that we would ideally like to use only hardware ARFS for packet steering, and disable software RPS. In our testing, ARFS reliably steers packets to the desired CPUs. (Our application has long-lived TCP sockets, each processed on a single thread affinitized to one of the interrupt CPUs for the Ethernet device.) In the off chance that ARFS doesn't steer the packet to the correct CPU, we would rather just process it on the CPU that receives it instead of going through the RPS queues. If software RPS is never used, then there wouldn't be any concerns about higher latency for RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu() computation is also not cheap, so it would be nice to skip it too. Basically, we want to program ARFS but skip these static_branch_unlikely(&rps_needed) branches. But I'm not aware of a way to do that currently. (Please let me know if it's already possible to do that.) > Why not instead address the past feedback ? > Can you test the following ? Sure, I will test the performance on our setup with this patch. Still, we would prefer to skip the get_rps_cpu() computation and these extra checks, and just process the packets on the CPUs they arrive on. Thanks for your help, Caleb
On Wed, Oct 30, 2024 at 1:26 PM Caleb Sander <csander@purestorage.com> wrote: > > On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote: > > > > > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos > > > > <csander@purestorage.com> wrote: > > > > > > > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the > > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that > > > > > CPU to dequeue and process the packets. If ARFS has already steered the > > > > > packets to the correct CPU, this additional queuing is unnecessary and > > > > > the spinlocks involved incur significant CPU overhead. > > > > > > > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(), > > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If > > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the > > > > > current CPU. > > > > > > > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> > > > > > > > > Current implementation was a conscious choice. This has been discussed > > > > several times. > > > > > > > > By processing packets inline, you are actually increasing latencies of > > > > packets queued to other cpus. > > > > > > Sorry, I wasn't aware of these prior discussions. I take it you are > > > referring to threads like > > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/ > > > ? I see what you mean about the latency penalty for packets that do > > > require cross-CPU steering. > > > > > > Do you have an alternate suggestion for how to avoid the overhead of > > > acquiring a spinlock for every packet? The atomic instruction in > > > rps_lock_irq_disable() called from process_backlog() is consuming 5% > > > of our CPU time. For our use case, we don't really want software RPS; > > > we are expecting ARFS to steer all high-bandwidth traffic to the > > > desired CPUs. We would happily turn off software RPS entirely if we > > > could, which seems like it would avoid the concerns about higher > > > latency for packets that need to be steering to a different CPU. But > > > my understanding is that using ARFS requires RPS to be enabled > > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each > > > queue), which enables these rps_needed static branches. Is that > > > correct? If so, would you be open to adding a sysctl that disables > > > software RPS and relies upon ARFS to do the packet steering? > > > > A sysctl will not avoid the fundamental issue. > > Sorry if my suggestion was unclear. I mean that we would ideally like > to use only hardware ARFS for packet steering, and disable software > RPS. > In our testing, ARFS reliably steers packets to the desired CPUs. (Our > application has long-lived TCP sockets, each processed on a single > thread affinitized to one of the interrupt CPUs for the Ethernet > device.) In the off chance that ARFS doesn't steer the packet to the > correct CPU, we would rather just process it on the CPU that receives > it instead of going through the RPS queues. If software RPS is never > used, then there wouldn't be any concerns about higher latency for > RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu() > computation is also not cheap, so it would be nice to skip it too. > Basically, we want to program ARFS but skip these > static_branch_unlikely(&rps_needed) branches. But I'm not aware of a > way to do that currently. (Please let me know if it's already possible > to do that.) I see now that get_rps_cpu() still needs to be called even if software RPS isn't going to be used, because it's also responsible for calling set_rps_cpu() to program ARFS. So looks like avoiding the calls to get_rps_cpu() entirely is not possible. > > Why not instead address the past feedback ? > > Can you test the following ? > > Sure, I will test the performance on our setup with this patch. Still, > we would prefer to skip the get_rps_cpu() computation and these extra > checks, and just process the packets on the CPUs they arrive on. Hi Eric, I tried out your patch and it seems to work equally well at eliminating the process_backlog() -> _raw_spin_lock_irq() hotspot. The added checks contribute a bit of overhead in netif_receive_skb_list_internal(): it has increased from 0.15% to 0.28% of CPU time. But that's definitely worth it to remove the 5% hotspot acquiring the RPS spinlock. Feel free to add: Tested-by: Caleb Sander Mateos <csander@purestorage.com> Some minor comments on the patch: - Probably should be using either skb_queue_len_lockless() or skb_queue_empty_lockless() to check input_pkt_queue.qlen != 0 because the RPS lock is not held at this point - Could consolidate the 2 this_cpu_read(softnet_data...) lookups into a single this_cpu_ptr(&softnet_data) call Thanks, Caleb
On Mon, Nov 4, 2024 at 7:42 PM Caleb Sander <csander@purestorage.com> wrote: > > On Wed, Oct 30, 2024 at 1:26 PM Caleb Sander <csander@purestorage.com> wrote: > > > > On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote: > > > > > > > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos > > > > > <csander@purestorage.com> wrote: > > > > > > > > > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the > > > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that > > > > > > CPU to dequeue and process the packets. If ARFS has already steered the > > > > > > packets to the correct CPU, this additional queuing is unnecessary and > > > > > > the spinlocks involved incur significant CPU overhead. > > > > > > > > > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(), > > > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If > > > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the > > > > > > current CPU. > > > > > > > > > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> > > > > > > > > > > Current implementation was a conscious choice. This has been discussed > > > > > several times. > > > > > > > > > > By processing packets inline, you are actually increasing latencies of > > > > > packets queued to other cpus. > > > > > > > > Sorry, I wasn't aware of these prior discussions. I take it you are > > > > referring to threads like > > > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/ > > > > ? I see what you mean about the latency penalty for packets that do > > > > require cross-CPU steering. > > > > > > > > Do you have an alternate suggestion for how to avoid the overhead of > > > > acquiring a spinlock for every packet? The atomic instruction in > > > > rps_lock_irq_disable() called from process_backlog() is consuming 5% > > > > of our CPU time. For our use case, we don't really want software RPS; > > > > we are expecting ARFS to steer all high-bandwidth traffic to the > > > > desired CPUs. We would happily turn off software RPS entirely if we > > > > could, which seems like it would avoid the concerns about higher > > > > latency for packets that need to be steering to a different CPU. But > > > > my understanding is that using ARFS requires RPS to be enabled > > > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each > > > > queue), which enables these rps_needed static branches. Is that > > > > correct? If so, would you be open to adding a sysctl that disables > > > > software RPS and relies upon ARFS to do the packet steering? > > > > > > A sysctl will not avoid the fundamental issue. > > > > Sorry if my suggestion was unclear. I mean that we would ideally like > > to use only hardware ARFS for packet steering, and disable software > > RPS. > > In our testing, ARFS reliably steers packets to the desired CPUs. (Our > > application has long-lived TCP sockets, each processed on a single > > thread affinitized to one of the interrupt CPUs for the Ethernet > > device.) In the off chance that ARFS doesn't steer the packet to the > > correct CPU, we would rather just process it on the CPU that receives > > it instead of going through the RPS queues. If software RPS is never > > used, then there wouldn't be any concerns about higher latency for > > RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu() > > computation is also not cheap, so it would be nice to skip it too. > > Basically, we want to program ARFS but skip these > > static_branch_unlikely(&rps_needed) branches. But I'm not aware of a > > way to do that currently. (Please let me know if it's already possible > > to do that.) > > I see now that get_rps_cpu() still needs to be called even if software > RPS isn't going to be used, because it's also responsible for calling > set_rps_cpu() to program ARFS. So looks like avoiding the calls to > get_rps_cpu() entirely is not possible. Yes, this is the reason. Things would be different if ARFS was done at dev_queue_xmit() time. > > > > Why not instead address the past feedback ? > > > Can you test the following ? > > > > Sure, I will test the performance on our setup with this patch. Still, > > we would prefer to skip the get_rps_cpu() computation and these extra > > checks, and just process the packets on the CPUs they arrive on. > > Hi Eric, > I tried out your patch and it seems to work equally well at > eliminating the process_backlog() -> _raw_spin_lock_irq() hotspot. The > added checks contribute a bit of overhead in > netif_receive_skb_list_internal(): it has increased from 0.15% to > 0.28% of CPU time. But that's definitely worth it to remove the 5% > hotspot acquiring the RPS spinlock. > > Feel free to add: > Tested-by: Caleb Sander Mateos <csander@purestorage.com> > > Some minor comments on the patch: > - Probably should be using either skb_queue_len_lockless() or > skb_queue_empty_lockless() to check input_pkt_queue.qlen != 0 because > the RPS lock is not held at this point this_cpu_read() has implicit READ_ONCE() semantic. > - Could consolidate the 2 this_cpu_read(softnet_data...) lookups into > a single this_cpu_ptr(&softnet_data) call Double check the generated assembly first :) this_cpu_read() is faster than going through this_cpu_ptr(&softnet_data), at least on x86, thanks to %gs: prefix. No temporary register is needed to compute and hold this_cpu_ptr(&softnet_data).
On Mon, Nov 4, 2024 at 10:58 AM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Nov 4, 2024 at 7:42 PM Caleb Sander <csander@purestorage.com> wrote: > > > > On Wed, Oct 30, 2024 at 1:26 PM Caleb Sander <csander@purestorage.com> wrote: > > > > > > On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote: > > > > > > > > > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos > > > > > > <csander@purestorage.com> wrote: > > > > > > > > > > > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the > > > > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that > > > > > > > CPU to dequeue and process the packets. If ARFS has already steered the > > > > > > > packets to the correct CPU, this additional queuing is unnecessary and > > > > > > > the spinlocks involved incur significant CPU overhead. > > > > > > > > > > > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(), > > > > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If > > > > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the > > > > > > > current CPU. > > > > > > > > > > > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> > > > > > > > > > > > > Current implementation was a conscious choice. This has been discussed > > > > > > several times. > > > > > > > > > > > > By processing packets inline, you are actually increasing latencies of > > > > > > packets queued to other cpus. > > > > > > > > > > Sorry, I wasn't aware of these prior discussions. I take it you are > > > > > referring to threads like > > > > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/ > > > > > ? I see what you mean about the latency penalty for packets that do > > > > > require cross-CPU steering. > > > > > > > > > > Do you have an alternate suggestion for how to avoid the overhead of > > > > > acquiring a spinlock for every packet? The atomic instruction in > > > > > rps_lock_irq_disable() called from process_backlog() is consuming 5% > > > > > of our CPU time. For our use case, we don't really want software RPS; > > > > > we are expecting ARFS to steer all high-bandwidth traffic to the > > > > > desired CPUs. We would happily turn off software RPS entirely if we > > > > > could, which seems like it would avoid the concerns about higher > > > > > latency for packets that need to be steering to a different CPU. But > > > > > my understanding is that using ARFS requires RPS to be enabled > > > > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each > > > > > queue), which enables these rps_needed static branches. Is that > > > > > correct? If so, would you be open to adding a sysctl that disables > > > > > software RPS and relies upon ARFS to do the packet steering? > > > > > > > > A sysctl will not avoid the fundamental issue. > > > > > > Sorry if my suggestion was unclear. I mean that we would ideally like > > > to use only hardware ARFS for packet steering, and disable software > > > RPS. > > > In our testing, ARFS reliably steers packets to the desired CPUs. (Our > > > application has long-lived TCP sockets, each processed on a single > > > thread affinitized to one of the interrupt CPUs for the Ethernet > > > device.) In the off chance that ARFS doesn't steer the packet to the > > > correct CPU, we would rather just process it on the CPU that receives > > > it instead of going through the RPS queues. If software RPS is never > > > used, then there wouldn't be any concerns about higher latency for > > > RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu() > > > computation is also not cheap, so it would be nice to skip it too. > > > Basically, we want to program ARFS but skip these > > > static_branch_unlikely(&rps_needed) branches. But I'm not aware of a > > > way to do that currently. (Please let me know if it's already possible > > > to do that.) > > > > I see now that get_rps_cpu() still needs to be called even if software > > RPS isn't going to be used, because it's also responsible for calling > > set_rps_cpu() to program ARFS. So looks like avoiding the calls to > > get_rps_cpu() entirely is not possible. > > Yes, this is the reason. Things would be different if ARFS was done at > dev_queue_xmit() time. > > > > > > > Why not instead address the past feedback ? > > > > Can you test the following ? > > > > > > Sure, I will test the performance on our setup with this patch. Still, > > > we would prefer to skip the get_rps_cpu() computation and these extra > > > checks, and just process the packets on the CPUs they arrive on. > > > > Hi Eric, > > I tried out your patch and it seems to work equally well at > > eliminating the process_backlog() -> _raw_spin_lock_irq() hotspot. The > > added checks contribute a bit of overhead in > > netif_receive_skb_list_internal(): it has increased from 0.15% to > > 0.28% of CPU time. But that's definitely worth it to remove the 5% > > hotspot acquiring the RPS spinlock. > > > > Feel free to add: > > Tested-by: Caleb Sander Mateos <csander@purestorage.com> > > > > Some minor comments on the patch: > > - Probably should be using either skb_queue_len_lockless() or > > skb_queue_empty_lockless() to check input_pkt_queue.qlen != 0 because > > the RPS lock is not held at this point > > this_cpu_read() has implicit READ_ONCE() semantic. Okay. Seems like it would be nicer not to reach into the sk_buff_head internals, but up to you. > > > - Could consolidate the 2 this_cpu_read(softnet_data...) lookups into > > a single this_cpu_ptr(&softnet_data) call > > Double check the generated assembly first :) > > this_cpu_read() is faster than going through > this_cpu_ptr(&softnet_data), at least on x86, > thanks to %gs: prefix. > > No temporary register is needed to compute and hold this_cpu_ptr(&softnet_data). Got it, thanks.
© 2016 - 2024 Red Hat, Inc.