[PATCH net-next v2 0/4] net: route: improve route hinting

Leone Fernando posted 4 patches 1 week, 5 days ago
drivers/net/Kconfig        |   1 -
include/net/dst_cache.h    |  68 +++++++++++++++++++
include/net/dst_metadata.h |   2 -
include/net/ip_tunnels.h   |   2 -
include/net/route.h        |   6 +-
net/Kconfig                |   4 --
net/core/Makefile          |   3 +-
net/core/dst.c             |   4 --
net/core/dst_cache.c       | 132 +++++++++++++++++++++++++++++++++++++
net/ipv4/Kconfig           |   1 -
net/ipv4/ip_input.c        |  58 ++++++++--------
net/ipv4/ip_tunnel_core.c  |   4 --
net/ipv4/route.c           |  75 +++++++++++++++------
net/ipv4/udp_tunnel_core.c |   4 --
net/ipv6/Kconfig           |   4 --
net/ipv6/ip6_udp_tunnel.c  |   4 --
net/netfilter/nft_tunnel.c |   2 -
net/openvswitch/Kconfig    |   1 -
net/sched/act_tunnel_key.c |   2 -
19 files changed, 291 insertions(+), 86 deletions(-)
[PATCH net-next v2 0/4] net: route: improve route hinting
Posted by Leone Fernando 1 week, 5 days ago
In 2017, Paolo Abeni introduced the hinting mechanism [1] to the routing
sub-system. The hinting optimization improves performance by reusing
previously found dsts instead of looking them up for each skb.

This patch series introduces a generalized version of the hinting mechanism that
can "remember" a larger number of dsts. This reduces the number of dst
lookups for frequently encountered daddrs.

Before diving into the code and the benchmarking results, it's important
to address the deletion of the old route cache [2] and why
this solution is different. The original cache was complicated,
vulnerable to DOS attacks and had unstable performance.

The new input dst_cache is much simpler thanks to its lazy approach,
improving performance without the overhead of the removed cache
implementation. Instead of using timers and GC, the deletion of invalid
entries is performed lazily during their lookups.
The dsts are stored in a simple, lightweight, static hash table. This
keeps the lookup times fast yet stable, preventing DOS upon cache misses.
The new input dst_cache implementation is built over the existing
dst_cache code which supplies a fast lockless percpu behavior.

The measurement setup is comprised of 2 machines with mlx5 100Gbit NIC.
I sent small UDP packets with 5000 daddrs (10x of cache size) from one
machine to the other while also varying the saddr and the tos. I set
an iptables rule to drop the packets after routing. the receiving
machine's CPU (i9) was saturated. 

Thanks a lot to David Ahern for all the help and guidance!

I measured the rx PPS using ifpps and the per-queue PPS using ethtool -S.
These are the results:

Total PPS:
mainline              patched                   delta
  Kpps                  Kpps                      %
  6903                  8105                    17.41

Per-Queue PPS:
Queue          mainline         patched
  0             345775          411780
  1             345252          414387
  2             347724          407501
  3             346232          413456
  4             347271          412088
  5             346808          400910
  6             346243          406699
  7             346484          409104
  8             342731          404612
  9             344068          407558
  10            345832          409558
  11            346296          409935
  12            346900          399084
  13            345980          404513
  14            347244          405136
  15            346801          408752
  16            345984          410865
  17            346632          405752
  18            346064          407539
  19            344861          408364
 total          6921182         8157593

I also verified that the number of packets caught by the iptables rule
matches the measured PPS.

TCP throughput was not affected by the patch, below is iperf3 output:
       mainline                                     patched 
15.4 GBytes 13.2 Gbits/sec                  15.5 GBytes 13.2 Gbits/sec

[1] https://lore.kernel.org/netdev/cover.1574252982.git.pabeni@redhat.com/
[2] https://lore.kernel.org/netdev/20120720.142502.1144557295933737451.davem@davemloft.net/

v1->v2:
- fix bitwise cast warning
- improved measurements setup

v1:
- fix typo while allocating per-cpu cache
- while using dst from the dst_cache set IPSKB_DOREDIRECT correctly
- always compile dst_cache

RFC-v2:
- remove unnecessary macro
- move inline to .h file

RFC-v1: https://lore.kernel.org/netdev/d951b371-4138-4bda-a1c5-7606a28c81f0@gmail.com/
RFC-v2: https://lore.kernel.org/netdev/3a17c86d-08a5-46d2-8622-abc13d4a411e@gmail.com/

Leone Fernando (4):
  net: route: expire rt if the dst it holds is expired
  net: dst_cache: add input_dst_cache API
  net: route: always compile dst_cache
  net: route: replace route hints with input_dst_cache

 drivers/net/Kconfig        |   1 -
 include/net/dst_cache.h    |  68 +++++++++++++++++++
 include/net/dst_metadata.h |   2 -
 include/net/ip_tunnels.h   |   2 -
 include/net/route.h        |   6 +-
 net/Kconfig                |   4 --
 net/core/Makefile          |   3 +-
 net/core/dst.c             |   4 --
 net/core/dst_cache.c       | 132 +++++++++++++++++++++++++++++++++++++
 net/ipv4/Kconfig           |   1 -
 net/ipv4/ip_input.c        |  58 ++++++++--------
 net/ipv4/ip_tunnel_core.c  |   4 --
 net/ipv4/route.c           |  75 +++++++++++++++------
 net/ipv4/udp_tunnel_core.c |   4 --
 net/ipv6/Kconfig           |   4 --
 net/ipv6/ip6_udp_tunnel.c  |   4 --
 net/netfilter/nft_tunnel.c |   2 -
 net/openvswitch/Kconfig    |   1 -
 net/sched/act_tunnel_key.c |   2 -
 19 files changed, 291 insertions(+), 86 deletions(-)

-- 
2.34.1
Re: [PATCH net-next v2 0/4] net: route: improve route hinting
Posted by Eric Dumazet 1 week, 5 days ago
On Tue, May 7, 2024 at 2:43 PM Leone Fernando <leone4fernando@gmail.com> wrote:
>
> In 2017, Paolo Abeni introduced the hinting mechanism [1] to the routing
> sub-system. The hinting optimization improves performance by reusing
> previously found dsts instead of looking them up for each skb.
>
> This patch series introduces a generalized version of the hinting mechanism that
> can "remember" a larger number of dsts. This reduces the number of dst
> lookups for frequently encountered daddrs.
>
> Before diving into the code and the benchmarking results, it's important
> to address the deletion of the old route cache [2] and why
> this solution is different. The original cache was complicated,
> vulnerable to DOS attacks and had unstable performance.
>
> The new input dst_cache is much simpler thanks to its lazy approach,
> improving performance without the overhead of the removed cache
> implementation. Instead of using timers and GC, the deletion of invalid
> entries is performed lazily during their lookups.
> The dsts are stored in a simple, lightweight, static hash table. This
> keeps the lookup times fast yet stable, preventing DOS upon cache misses.
> The new input dst_cache implementation is built over the existing
> dst_cache code which supplies a fast lockless percpu behavior.
>
> The measurement setup is comprised of 2 machines with mlx5 100Gbit NIC.
> I sent small UDP packets with 5000 daddrs (10x of cache size) from one
> machine to the other while also varying the saddr and the tos. I set
> an iptables rule to drop the packets after routing. the receiving
> machine's CPU (i9) was saturated.
>
> Thanks a lot to David Ahern for all the help and guidance!
>
> I measured the rx PPS using ifpps and the per-queue PPS using ethtool -S.
> These are the results:

How device dismantles are taken into account ?

I am currently tracking a bug in dst_cache, triggering sometimes when
running pmtu.sh selftest.

Apparently, dst_cache_per_cpu_dst_set() can cache dst that have no
dst->rt_uncached
linkage.

There is no cleanup (at least in vxlan) to make sure cached dst are
either freed or
their dst->dev changed.


TEST: ipv6: cleanup of cached exceptions - nexthop objects          [ OK ]
[ 1001.344490] vxlan: __vxlan_fdb_free calling
dst_cache_destroy(ffff8f12422cbb90)
[ 1001.345253] dst_cache_destroy dst_cache=ffff8f12422cbb90
->cache=0000417580008d30
[ 1001.378615] vxlan: __vxlan_fdb_free calling
dst_cache_destroy(ffff8f12471e31d0)
[ 1001.379260] dst_cache_destroy dst_cache=ffff8f12471e31d0
->cache=0000417580008608
[ 1011.349730] unregister_netdevice: waiting for veth_A-R1 to become
free. Usage count = 7
[ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 1/6 users at
[ 1011.350562]      dst_alloc+0x76/0x160
[ 1011.350562]      ip6_dst_alloc+0x25/0x80
[ 1011.350562]      ip6_pol_route+0x2a8/0x450
[ 1011.350562]      ip6_pol_route_output+0x1f/0x30
[ 1011.350562]      fib6_rule_lookup+0x163/0x270
[ 1011.350562]      ip6_route_output_flags+0xda/0x190
[ 1011.350562]      ip6_dst_lookup_tail.constprop.0+0x1d0/0x260
[ 1011.350562]      ip6_dst_lookup_flow+0x47/0xa0
[ 1011.350562]      udp_tunnel6_dst_lookup+0x158/0x210
[ 1011.350562]      vxlan_xmit_one+0x4c6/0x1550 [vxlan]
[ 1011.350562]      vxlan_xmit+0x535/0x1500 [vxlan]
[ 1011.350562]      dev_hard_start_xmit+0x7b/0x1e0
[ 1011.350562]      __dev_queue_xmit+0x20c/0xe40
[ 1011.350562]      arp_xmit+0x1d/0x50
[ 1011.350562]      arp_send_dst+0x7f/0xa0
[ 1011.350562]      arp_solicit+0xf6/0x2f0
[ 1011.350562]
[ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 3/6 users at
[ 1011.350562]      dst_alloc+0x76/0x160
[ 1011.350562]      ip6_dst_alloc+0x25/0x80
[ 1011.350562]      ip6_pol_route+0x2a8/0x450
[ 1011.350562]      ip6_pol_route_output+0x1f/0x30
[ 1011.350562]      fib6_rule_lookup+0x163/0x270
[ 1011.350562]      ip6_route_output_flags+0xda/0x190
[ 1011.350562]      ip6_dst_lookup_tail.constprop.0+0x1d0/0x260
[ 1011.350562]      ip6_dst_lookup_flow+0x47/0xa0
[ 1011.350562]      udp_tunnel6_dst_lookup+0x158/0x210
[ 1011.350562]      vxlan_xmit_one+0x4c6/0x1550 [vxlan]
[ 1011.350562]      vxlan_xmit+0x535/0x1500 [vxlan]
[ 1011.350562]      dev_hard_start_xmit+0x7b/0x1e0
[ 1011.350562]      __dev_queue_xmit+0x20c/0xe40
[ 1011.350562]      ip6_finish_output2+0x2ea/0x6e0
[ 1011.350562]      ip6_finish_output+0x143/0x320
[ 1011.350562]      ip6_output+0x74/0x140
[ 1011.350562]
[ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 1/6 users at
[ 1011.350562]      netdev_get_by_index+0xc0/0xe0
[ 1011.350562]      fib6_nh_init+0x1a9/0xa90
[ 1011.350562]      rtm_new_nexthop+0x6fa/0x1580
[ 1011.350562]      rtnetlink_rcv_msg+0x155/0x3e0
[ 1011.350562]      netlink_rcv_skb+0x61/0x110
[ 1011.350562]      rtnetlink_rcv+0x19/0x20
[ 1011.350562]      netlink_unicast+0x23f/0x380
[ 1011.350562]      netlink_sendmsg+0x1fc/0x430
[ 1011.350562]      ____sys_sendmsg+0x2ef/0x320
[ 1011.350562]      ___sys_sendmsg+0x86/0xd0
[ 1011.350562]      __sys_sendmsg+0x67/0xc0
[ 1011.350562]      __x64_sys_sendmsg+0x21/0x30
[ 1011.350562]      x64_sys_call+0x252/0x2030
[ 1011.350562]      do_syscall_64+0x6c/0x190
[ 1011.350562]      entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1011.350562]
[ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 1/6 users at
[ 1011.350562]      ipv6_add_dev+0x136/0x530
[ 1011.350562]      addrconf_notify+0x19d/0x770
[ 1011.350562]      notifier_call_chain+0x65/0xd0
[ 1011.350562]      raw_notifier_call_chain+0x1a/0x20
[ 1011.350562]      call_netdevice_notifiers_info+0x54/0x90
[ 1011.350562]      register_netdevice+0x61e/0x790
[ 1011.350562]      veth_newlink+0x230/0x440
[ 1011.350562]      __rtnl_newlink+0x7d2/0xaa0
[ 1011.350562]      rtnl_newlink+0x4c/0x70
[ 1011.350562]      rtnetlink_rcv_msg+0x155/0x3e0
[ 1011.350562]      netlink_rcv_skb+0x61/0x110
[ 1011.350562]      rtnetlink_rcv+0x19/0x20
[ 1011.350562]      netlink_unicast+0x23f/0x380
[ 1011.350562]      netlink_sendmsg+0x1fc/0x430
[ 1011.350562]      ____sys_sendmsg+0x2ef/0x320
[ 1011.350562]      ___sys_sendmsg+0x86/0xd0
[ 1011.350562]
Re: [PATCH net-next v2 0/4] net: route: improve route hinting
Posted by Leone Fernando 4 days, 2 hours ago
> On Tue, May 7, 2024 at 2:43 PM Leone Fernando <leone4fernando@gmail.com> wrote:
>>
>> In 2017, Paolo Abeni introduced the hinting mechanism [1] to the routing
>> sub-system. The hinting optimization improves performance by reusing
>> previously found dsts instead of looking them up for each skb.
>>
>> This patch series introduces a generalized version of the hinting mechanism that
>> can "remember" a larger number of dsts. This reduces the number of dst
>> lookups for frequently encountered daddrs.
>>
>> Before diving into the code and the benchmarking results, it's important
>> to address the deletion of the old route cache [2] and why
>> this solution is different. The original cache was complicated,
>> vulnerable to DOS attacks and had unstable performance.
>>
>> The new input dst_cache is much simpler thanks to its lazy approach,
>> improving performance without the overhead of the removed cache
>> implementation. Instead of using timers and GC, the deletion of invalid
>> entries is performed lazily during their lookups.
>> The dsts are stored in a simple, lightweight, static hash table. This
>> keeps the lookup times fast yet stable, preventing DOS upon cache misses.
>> The new input dst_cache implementation is built over the existing
>> dst_cache code which supplies a fast lockless percpu behavior.
>>
>> The measurement setup is comprised of 2 machines with mlx5 100Gbit NIC.
>> I sent small UDP packets with 5000 daddrs (10x of cache size) from one
>> machine to the other while also varying the saddr and the tos. I set
>> an iptables rule to drop the packets after routing. the receiving
>> machine's CPU (i9) was saturated.
>>
>> Thanks a lot to David Ahern for all the help and guidance!
>>
>> I measured the rx PPS using ifpps and the per-queue PPS using ethtool -S.
>> These are the results:
> 
> How device dismantles are taken into account ?
> 
> I am currently tracking a bug in dst_cache, triggering sometimes when
> running pmtu.sh selftest.
> 
> Apparently, dst_cache_per_cpu_dst_set() can cache dst that have no
> dst->rt_uncached
> linkage.

The dst_cache_input that was introduced in this series caches input
routes that are owned by the fib tree.
These routes have a rt_uncached linkage. So I think this bug will not
replicate to dst_cache_input.

> There is no cleanup (at least in vxlan) to make sure cached dst are
> either freed or
> their dst->dev changed.
> 
> 
> TEST: ipv6: cleanup of cached exceptions - nexthop objects          [ OK ]
> [ 1001.344490] vxlan: __vxlan_fdb_free calling
> dst_cache_destroy(ffff8f12422cbb90)
> [ 1001.345253] dst_cache_destroy dst_cache=ffff8f12422cbb90
> ->cache=0000417580008d30
> [ 1001.378615] vxlan: __vxlan_fdb_free calling
> dst_cache_destroy(ffff8f12471e31d0)
> [ 1001.379260] dst_cache_destroy dst_cache=ffff8f12471e31d0
> ->cache=0000417580008608
> [ 1011.349730] unregister_netdevice: waiting for veth_A-R1 to become
> free. Usage count = 7
> [ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 1/6 users at
> [ 1011.350562]      dst_alloc+0x76/0x160
> [ 1011.350562]      ip6_dst_alloc+0x25/0x80
> [ 1011.350562]      ip6_pol_route+0x2a8/0x450
> [ 1011.350562]      ip6_pol_route_output+0x1f/0x30
> [ 1011.350562]      fib6_rule_lookup+0x163/0x270
> [ 1011.350562]      ip6_route_output_flags+0xda/0x190
> [ 1011.350562]      ip6_dst_lookup_tail.constprop.0+0x1d0/0x260
> [ 1011.350562]      ip6_dst_lookup_flow+0x47/0xa0
> [ 1011.350562]      udp_tunnel6_dst_lookup+0x158/0x210
> [ 1011.350562]      vxlan_xmit_one+0x4c6/0x1550 [vxlan]
> [ 1011.350562]      vxlan_xmit+0x535/0x1500 [vxlan]
> [ 1011.350562]      dev_hard_start_xmit+0x7b/0x1e0
> [ 1011.350562]      __dev_queue_xmit+0x20c/0xe40
> [ 1011.350562]      arp_xmit+0x1d/0x50
> [ 1011.350562]      arp_send_dst+0x7f/0xa0
> [ 1011.350562]      arp_solicit+0xf6/0x2f0
> [ 1011.350562]
> [ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 3/6 users at
> [ 1011.350562]      dst_alloc+0x76/0x160
> [ 1011.350562]      ip6_dst_alloc+0x25/0x80
> [ 1011.350562]      ip6_pol_route+0x2a8/0x450
> [ 1011.350562]      ip6_pol_route_output+0x1f/0x30
> [ 1011.350562]      fib6_rule_lookup+0x163/0x270
> [ 1011.350562]      ip6_route_output_flags+0xda/0x190
> [ 1011.350562]      ip6_dst_lookup_tail.constprop.0+0x1d0/0x260
> [ 1011.350562]      ip6_dst_lookup_flow+0x47/0xa0
> [ 1011.350562]      udp_tunnel6_dst_lookup+0x158/0x210
> [ 1011.350562]      vxlan_xmit_one+0x4c6/0x1550 [vxlan]
> [ 1011.350562]      vxlan_xmit+0x535/0x1500 [vxlan]
> [ 1011.350562]      dev_hard_start_xmit+0x7b/0x1e0
> [ 1011.350562]      __dev_queue_xmit+0x20c/0xe40
> [ 1011.350562]      ip6_finish_output2+0x2ea/0x6e0
> [ 1011.350562]      ip6_finish_output+0x143/0x320
> [ 1011.350562]      ip6_output+0x74/0x140
> [ 1011.350562]
> [ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 1/6 users at
> [ 1011.350562]      netdev_get_by_index+0xc0/0xe0
> [ 1011.350562]      fib6_nh_init+0x1a9/0xa90
> [ 1011.350562]      rtm_new_nexthop+0x6fa/0x1580
> [ 1011.350562]      rtnetlink_rcv_msg+0x155/0x3e0
> [ 1011.350562]      netlink_rcv_skb+0x61/0x110
> [ 1011.350562]      rtnetlink_rcv+0x19/0x20
> [ 1011.350562]      netlink_unicast+0x23f/0x380
> [ 1011.350562]      netlink_sendmsg+0x1fc/0x430
> [ 1011.350562]      ____sys_sendmsg+0x2ef/0x320
> [ 1011.350562]      ___sys_sendmsg+0x86/0xd0
> [ 1011.350562]      __sys_sendmsg+0x67/0xc0
> [ 1011.350562]      __x64_sys_sendmsg+0x21/0x30
> [ 1011.350562]      x64_sys_call+0x252/0x2030
> [ 1011.350562]      do_syscall_64+0x6c/0x190
> [ 1011.350562]      entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1011.350562]
> [ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 1/6 users at
> [ 1011.350562]      ipv6_add_dev+0x136/0x530
> [ 1011.350562]      addrconf_notify+0x19d/0x770
> [ 1011.350562]      notifier_call_chain+0x65/0xd0
> [ 1011.350562]      raw_notifier_call_chain+0x1a/0x20
> [ 1011.350562]      call_netdevice_notifiers_info+0x54/0x90
> [ 1011.350562]      register_netdevice+0x61e/0x790
> [ 1011.350562]      veth_newlink+0x230/0x440
> [ 1011.350562]      __rtnl_newlink+0x7d2/0xaa0
> [ 1011.350562]      rtnl_newlink+0x4c/0x70
> [ 1011.350562]      rtnetlink_rcv_msg+0x155/0x3e0
> [ 1011.350562]      netlink_rcv_skb+0x61/0x110
> [ 1011.350562]      rtnetlink_rcv+0x19/0x20
> [ 1011.350562]      netlink_unicast+0x23f/0x380
> [ 1011.350562]      netlink_sendmsg+0x1fc/0x430
> [ 1011.350562]      ____sys_sendmsg+0x2ef/0x320
> [ 1011.350562]      ___sys_sendmsg+0x86/0xd0
> [ 1011.350562]