:p
atchew
Login
The initial intension was to increase the limits, but some fixes were needed, then when looking at the code around, other fixes had to be added too. So now I ended up with 8 fixes (including one early exit while at it), followed by 3 patches increasing limits, 3 to validate the modifications, one small improvement in the selftests, and some renaming. The patches with a Fixes tag are for -net, the rest for net-next. The last patch contains multiple renaming: please tell me what you think before I start splitting in smaller commits. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- Matthieu Baerts (NGI0) (16): mptcp: pm: resched blocked ADD_ADDR quicker mptcp: pm: allow retransmitting ADD_ADDR with ID 0 mptcp: pm: retrans ADD_ADDR: free sk if last mptcp: pm: retrans ADD_ADDR: always decrease sk refcount mptcp: pm: retrans ADD_ADDR: skip inactive subflows mptcp: pm: retrans ADD_ADDR: return early if no retrans mptcp: pm: prio: skip closed subflows selftests: mptcp: pm: restrict 'unknown' check to pm_nl_ctl mptcp: pm: in-kernel: explicitly limit batches to array size mptcp: pm: in-kernel: increase all limits to 64 mptcp: pm: in-kernel: increase endpoints limit selftests: mptcp: join: allow changing ifaces nr per test selftests: mptcp: join: validate 8x8 subflows selftests: mptcp: pm: validate new limits selftests: mptcp: pm: use simpler send/recv forms mptcp: pm: clearer ADD_ADDR related helpers names net/mptcp/options.c | 2 +- net/mptcp/pm.c | 165 +++++++++++++----------- net/mptcp/pm_kernel.c | 55 ++++---- net/mptcp/pm_userspace.c | 4 +- net/mptcp/protocol.h | 17 +-- net/mptcp/subflow.c | 4 +- tools/testing/selftests/net/mptcp/mptcp_join.sh | 33 ++++- tools/testing/selftests/net/mptcp/pm_netlink.sh | 62 +++++---- tools/testing/selftests/net/mptcp/pm_nl_ctl.c | 7 +- 9 files changed, 205 insertions(+), 144 deletions(-) --- base-commit: 7fe1c6148c0a11ed9a1fcfaf70cf380cf890c1bf change-id: 20260403-mptcp-inc-limits-ce9811024066 Best regards, -- Matthieu Baerts (NGI0) <matttbe@kernel.org>
When an ADD_ADDR needs to be retransmitted and another one has already been prepared -- e.g. multiple ADD_ADDRs have been sent in a row and need to be retransmitted later -- this additional retransmission will need to wait. In this case, the timer was reset to TCP_RTO_MAX / 8, which is ~15 seconds. This delay is unnecessary long: it should just be rescheduled at the next opportunity, e.g. after the retransmission timeout. Without this modification, some issues can be seen from time to time in the selftests when multiple ADD_ADDRs are sent, and the host takes time to process them, e.g. the "signal addresses, ADD_ADDR timeout" MPTCP Join selftest, especially with a debug kernel config. Note that on older kernels, 'timeout' is not available. It should be enough to replace it by one second (HZ). Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) if (!entry->addr.id) return; - if (mptcp_pm_should_add_signal_addr(msk)) { - sk_reset_timer(sk, timer, jiffies + TCP_RTO_MAX / 8); - goto out; - } - timeout = mptcp_adjust_add_addr_timeout(msk); if (!timeout) goto out; + if (mptcp_pm_should_add_signal_addr(msk)) { + sk_reset_timer(sk, timer, jiffies + timeout); + goto out; + } + spin_lock_bh(&msk->pm.lock); if (!mptcp_pm_should_add_signal_addr(msk)) { -- 2.53.0
ADD_ADDR can be sent for the ID 0, which corresponds to the local address and port linked to the initial subflow. Indeed, this address could be removed, and re-added later on, e.g. what is done in the "delete re-add signal" MPTCP Join selftests. So no reason to ignore it. Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) if (inet_sk_state_load(sk) == TCP_CLOSE) return; - if (!entry->addr.id) - return; - timeout = mptcp_adjust_add_addr_timeout(msk); if (!timeout) goto out; -- 2.53.0
When an ADD_ADDR is retransmitted, the sk is held in sk_reset_timer(), and released at the end. If at that moment, it was the last reference being held, the sk would not be freed. sock_put() should then be called instead of __sock_put(). Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) mptcp_pm_subflow_established(msk); out: - __sock_put(sk); + sock_put(sk); } struct mptcp_pm_add_entry * -- 2.53.0
When an ADD_ADDR is retransmitted, the sk is held in sk_reset_timer(). It should then be released in all cases at the end. Some (unlikely) checks were returning directly instead of calling sock_put() to decrease the refcount. Jump to the 'out' label to fix this potential leak. While at it, regroup the conditions, and explicitly mark them as "unlikely". Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) pr_debug("msk=%p\n", msk); - if (!msk) - return; - - if (inet_sk_state_load(sk) == TCP_CLOSE) - return; + if (unlikely(!msk || inet_sk_state_load(sk) == TCP_CLOSE)) + goto out; timeout = mptcp_adjust_add_addr_timeout(msk); if (!timeout) -- 2.53.0
When looking at the maximum RTO amongst the subflows, inactive subflows were taken into account: that includes stale ones, and the initial one if it has been already been closed. Unusable subflows are now simply skipped. Stale ones are used as an alternative: if there are only stale ones, to take their maximum RTO and avoid to eventually fallback to net.mptcp.add_addr_timeout, which is set to 2 minutes by default. Fixes: 30549eebc4d8 ("mptcp: make ADD_ADDR retransmission timeout adaptive") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static unsigned int mptcp_adjust_add_addr_timeout(struct mptcp_sock *msk) const struct net *net = sock_net((struct sock *)msk); unsigned int rto = mptcp_get_add_addr_timeout(net); struct mptcp_subflow_context *subflow; - unsigned int max = 0; + unsigned int max = 0, max_stale = 0; mptcp_for_each_subflow(msk, subflow) { struct sock *ssk = mptcp_subflow_tcp_sock(subflow); struct inet_connection_sock *icsk = inet_csk(ssk); - if (icsk->icsk_rto > max) + if (!__mptcp_subflow_active(subflow)) + continue; + + if (unlikely(subflow->stale)) { + if (icsk->icsk_rto > max_stale) + max_stale = icsk->icsk_rto; + } else { max = icsk->icsk_rto; + } } - if (max && max < rto) - rto = max; + if (max) + return max < rto ? max : rto; - return rto; + return max_stale && max_stale < rto ? max_stale : rto; } static void mptcp_pm_add_timer(struct timer_list *timer) -- 2.53.0
No need to iterate over all subflows if there is no retransmission needed. Exit early in this case then. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static unsigned int mptcp_adjust_add_addr_timeout(struct mptcp_sock *msk) struct mptcp_subflow_context *subflow; unsigned int max = 0, max_stale = 0; + if (!rto) + return 0; + mptcp_for_each_subflow(msk, subflow) { struct sock *ssk = mptcp_subflow_tcp_sock(subflow); struct inet_connection_sock *icsk = inet_csk(ssk); -- 2.53.0
When sending an MP_PRIO, closed subflows needs to be skipped. This fixes the case where the initial subflow got closed, re-opened later, then an MP_PRIO is needed for the same local address. Fixes: 067065422fcd ("mptcp: add the outgoing MP_PRIO support") Fixes: b29fcfb54cd7 ("mptcp: full disconnect implementation") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ int mptcp_pm_mp_prio_send_ack(struct mptcp_sock *msk, struct sock *ssk = mptcp_subflow_tcp_sock(subflow); struct mptcp_addr_info local, remote; + if (!__mptcp_subflow_active(subflow)) + continue; + mptcp_local_address((struct sock_common *)ssk, &local); if (!mptcp_addresses_equal(&local, addr, addr->port)) continue; -- 2.53.0
When pm_netlink.sh is executed with '-i', 'ip mptcp' is used instead of 'pm_nl_ctl'. IPRoute2 doesn't support the 'unknown' flag, which has only been added to 'pm_nl_ctl' for this specific check: to ensure that the kernel ignores such unsupported flag. No reason to add this flag to 'ip mptcp'. Then, this check should be skipped when 'ip mptcp' is used. Fixes: 29f4801e9c8d ("selftests: mptcp: pm: ensure unknown flags are ignored") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- tools/testing/selftests/net/mptcp/pm_netlink.sh | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/net/mptcp/pm_netlink.sh b/tools/testing/selftests/net/mptcp/pm_netlink.sh index XXXXXXX..XXXXXXX 100755 --- a/tools/testing/selftests/net/mptcp/pm_netlink.sh +++ b/tools/testing/selftests/net/mptcp/pm_netlink.sh @@ -XXX,XX +XXX,XX @@ check "show_endpoints" \ flush_endpoint check "show_endpoints" "" "flush addrs" -add_endpoint 10.0.1.1 flags unknown -check "show_endpoints" "$(format_endpoints "1,10.0.1.1")" "ignore unknown flags" -flush_endpoint +# "unknown" flag is only supported by pm_nl_ctl +if ! mptcp_lib_is_ip_mptcp; then + add_endpoint 10.0.1.1 flags unknown + check "show_endpoints" "$(format_endpoints "1,10.0.1.1")" \ + "ignore unknown flags" + flush_endpoint +fi set_limits 9 1 2>/dev/null check "get_limits" "${default_limits}" "rcv addrs above hard limit" -- 2.53.0
The in-kernel PM can create subflows in reply to ADD_ADDR by batch of maximum 8 subflows for the moment. Same when adding new "subflow" endpoints with the fullmesh flag. This limit is linked to the arrays used during these steps. There was no explicit limit to the arrays size (8), because the limit of extra subflows is the same (8). It seems safer to use an explicit limit, but also these two sizes are going to be different in the next commit. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm_kernel.c | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/net/mptcp/pm_kernel.c b/net/mptcp/pm_kernel.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm_kernel.c +++ b/net/mptcp/pm_kernel.c @@ -XXX,XX +XXX,XX @@ fill_remote_addr(struct mptcp_sock *msk, struct mptcp_addr_info *local, static unsigned int fill_remote_addresses_fullmesh(struct mptcp_sock *msk, struct mptcp_addr_info *local, - struct mptcp_addr_info *addrs) + struct mptcp_addr_info *addrs, + int addrs_size) { u8 limit_extra_subflows = mptcp_pm_get_limit_extra_subflows(msk); bool deny_id0 = READ_ONCE(msk->pm.remote_deny_join_id0); @@ -XXX,XX +XXX,XX @@ fill_remote_addresses_fullmesh(struct mptcp_sock *msk, msk->pm.extra_subflows++; i++; - if (msk->pm.extra_subflows >= limit_extra_subflows) + if (msk->pm.extra_subflows >= limit_extra_subflows || + i == addrs_size) break; } @@ -XXX,XX +XXX,XX @@ fill_remote_addresses_fullmesh(struct mptcp_sock *msk, */ static unsigned int fill_remote_addresses_vec(struct mptcp_sock *msk, struct mptcp_addr_info *local, - bool fullmesh, struct mptcp_addr_info *addrs) + bool fullmesh, struct mptcp_addr_info *addrs, + int addrs_size) { /* Non-fullmesh: fill in the single entry corresponding to the primary * MPC subflow remote address, and return 1, corresponding to 1 entry. @@ -XXX,XX +XXX,XX @@ fill_remote_addresses_vec(struct mptcp_sock *msk, struct mptcp_addr_info *local, return fill_remote_addr(msk, local, addrs); /* Fullmesh endpoint: fill all possible remote addresses */ - return fill_remote_addresses_fullmesh(msk, local, addrs); + return fill_remote_addresses_fullmesh(msk, local, addrs, addrs_size); } static struct mptcp_pm_addr_entry * @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_create_subflow_or_signal_addr(struct mptcp_sock *msk) else /* local_addr_used is not decr for ID 0 */ msk->pm.local_addr_used++; - nr = fill_remote_addresses_vec(msk, &local.addr, fullmesh, addrs); + nr = fill_remote_addresses_vec(msk, &local.addr, fullmesh, + addrs, ARRAY_SIZE(addrs)); if (nr == 0) continue; @@ -XXX,XX +XXX,XX @@ static unsigned int fill_local_addresses_vec_fullmesh(struct mptcp_sock *msk, struct mptcp_addr_info *remote, struct mptcp_pm_local *locals, + int locals_size, bool c_flag_case) { u8 limit_extra_subflows = mptcp_pm_get_limit_extra_subflows(msk); @@ -XXX,XX +XXX,XX @@ fill_local_addresses_vec_fullmesh(struct mptcp_sock *msk, msk->pm.extra_subflows++; i++; - if (msk->pm.extra_subflows >= limit_extra_subflows) + if (msk->pm.extra_subflows >= limit_extra_subflows || + i == locals_size) break; } rcu_read_unlock(); @@ -XXX,XX +XXX,XX @@ fill_local_laminar_endp(struct mptcp_sock *msk, struct mptcp_addr_info *remote, static unsigned int fill_local_addresses_vec_c_flag(struct mptcp_sock *msk, struct mptcp_addr_info *remote, - struct mptcp_pm_local *locals) + struct mptcp_pm_local *locals, + int locals_size) { u8 limit_extra_subflows = mptcp_pm_get_limit_extra_subflows(msk); struct pm_nl_pernet *pernet = pm_nl_get_pernet_from_msk(msk); @@ -XXX,XX +XXX,XX @@ fill_local_addresses_vec_c_flag(struct mptcp_sock *msk, msk->pm.extra_subflows++; i++; - if (msk->pm.extra_subflows >= limit_extra_subflows) + if (msk->pm.extra_subflows >= limit_extra_subflows || + i == locals_size) break; } @@ -XXX,XX +XXX,XX @@ fill_local_address_any(struct mptcp_sock *msk, struct mptcp_addr_info *remote, */ static unsigned int fill_local_addresses_vec(struct mptcp_sock *msk, struct mptcp_addr_info *remote, - struct mptcp_pm_local *locals) + struct mptcp_pm_local *locals, int locals_size) { bool c_flag_case = remote->id && mptcp_pm_add_addr_c_flag_case(msk); /* If there is at least one MPTCP endpoint with a fullmesh flag */ if (mptcp_pm_get_endp_fullmesh_max(msk)) return fill_local_addresses_vec_fullmesh(msk, remote, locals, + locals_size, c_flag_case); /* If there is at least one MPTCP endpoint with a laminar flag */ @@ -XXX,XX +XXX,XX @@ fill_local_addresses_vec(struct mptcp_sock *msk, struct mptcp_addr_info *remote, * limits are used -- accepting no ADD_ADDR -- and use subflow endpoints */ if (c_flag_case) - return fill_local_addresses_vec_c_flag(msk, remote, locals); + return fill_local_addresses_vec_c_flag(msk, remote, locals, + locals_size); /* No special case: fill in the single 'IPADDRANY' local address */ return fill_local_address_any(msk, remote, &locals[0]); @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_nl_add_addr_received(struct mptcp_sock *msk) /* connect to the specified remote address, using whatever * local address the routing configuration will pick. */ - nr = fill_local_addresses_vec(msk, &remote, locals); + nr = fill_local_addresses_vec(msk, &remote, locals, ARRAY_SIZE(locals)); if (nr == 0) return; -- 2.53.0
This means switching the maximum from 8 to 64 for the number of subflows and accepted ADD_ADDR. The previous limit of 8 subflows makes sense in most cases. Using more subflows will very likely *not* improve the situation, and could even decrease the performances. But there are no technical limitations nor performance impact to raise this limit, so let's do it: this will allow people with very specific use-cases, and researchers to easily create more subflows, and measure the performance impact by themselves. The theoretical limit is 255 -- the ID is written in a u8 on the wire -- but 64 is more than enough. With so many subflows, it will be costly to iterate over all of them when operations are done in bottom half. Note that the in-kernel PM will continue to create subflows in reply to ADD_ADDR by batch of maximum 8 subflows. Same when adding new "subflow" endpoints with the fullmesh flag. Increasing those batch limits would have a memory impact. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/434 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm_kernel.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/mptcp/pm_kernel.c b/net/mptcp/pm_kernel.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm_kernel.c +++ b/net/mptcp/pm_kernel.c @@ -XXX,XX +XXX,XX @@ struct pm_nl_pernet { }; #define MPTCP_PM_ADDR_MAX 8 +#define MPTCP_PM_SUBFLOWS_MAX 64 static struct pm_nl_pernet *pm_nl_get_pernet(const struct net *net) { @@ -XXX,XX +XXX,XX @@ static int parse_limit(struct genl_info *info, int id, unsigned int *limit) return 0; *limit = nla_get_u32(attr); - if (*limit > MPTCP_PM_ADDR_MAX) { + if (*limit > MPTCP_PM_SUBFLOWS_MAX) { NL_SET_ERR_MSG_ATTR_FMT(info->extack, attr, "limit greater than maximum (%u)", - MPTCP_PM_ADDR_MAX); + MPTCP_PM_SUBFLOWS_MAX); return -EINVAL; } return 0; -- 2.53.0
The endpoints are managed in a list which was limited to 8 entries. This limit can be too small in some cases: by having the same limit as the number of subflows, it might not allow creating all expected subflows when having a mix of v4 and v6 addresses that can all use MPTCP on v4/v6 only networks. While increasing the limit above the new subflows one, why not using the technical limit: 254. Indeed, the endpoint will each have an ID that will be used on the wire, limited to u8, and the ID 0 is reserved to the initial subflow. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm_kernel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/mptcp/pm_kernel.c b/net/mptcp/pm_kernel.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm_kernel.c +++ b/net/mptcp/pm_kernel.c @@ -XXX,XX +XXX,XX @@ static int mptcp_pm_nl_append_new_local_addr(struct pm_nl_pernet *pernet, */ if (pernet->next_id == MPTCP_PM_MAX_ADDR_ID) pernet->next_id = 1; - if (pernet->endpoints >= MPTCP_PM_ADDR_MAX) { + if (pernet->endpoints == MPTCP_PM_MAX_ADDR_ID) { ret = -ERANGE; goto out; } -- 2.53.0
By default, 4 network interfaces are created per subtest in a dedicated net namespace. Each netns has a dedicated pair of v4 and v6 addresses. Future tests will need more. Simply always creating more network interfaces per test will increase the execution time for all other tests, for no other benefits. So now it is possible to change this number only when needed, by setting ifaces_nr when calling 'reset', e.g. ifaces_nr=8 reset "Subtest title" Note that it might also be interesting to decrease the default value to 2 to reduce the setup time, especially when a debug kernel config is being used. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- tools/testing/selftests/net/mptcp/mptcp_join.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh index XXXXXXX..XXXXXXX 100755 --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh @@ -XXX,XX +XXX,XX @@ unset fastclose unset fullmesh unset speed unset bind_addr +unset ifaces_nr unset join_syn_rej unset join_csum_ns1 unset join_csum_ns2 @@ -XXX,XX +XXX,XX @@ init_partial() # ns1eth4 ns2eth4 local i - for i in $(seq 1 4); do + for i in $(seq 1 "${ifaces_nr:-4}"); do ip link add ns1eth$i netns "$ns1" type veth peer name ns2eth$i netns "$ns2" ip -net "$ns1" addr add 10.0.$i.1/24 dev ns1eth$i ip -net "$ns1" addr add dead:beef:$i::1/64 dev ns1eth$i nodad @@ -XXX,XX +XXX,XX @@ init_partial() init_shapers() { local i - for i in $(seq 1 4); do + for i in $(seq 1 "${ifaces_nr:-4}"); do tc -n $ns1 qdisc add dev ns1eth$i root netem rate 20mbit delay 1ms tc -n $ns2 qdisc add dev ns2eth$i root netem rate 20mbit delay 1ms done -- 2.53.0
The limits have been recently increased, it is required to validate that having 64 subflows is allowed. Here, both the client and the server have 8 network interfaces. The server has 8 endpoints marked as 'signal' to announce all its v4 addresses. The client also has 8 endpoints, but marked as 'subflow' and 'fullmesh' in order to create 8 subflows to each address announced by the server. This means 63 additional subflows will be created after the initial one. If it is not possible to increase the limits to 64, it means an older kernel version is being used, and the test is skipped. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- tools/testing/selftests/net/mptcp/mptcp_join.sh | 28 +++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh index XXXXXXX..XXXXXXX 100755 --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh @@ -XXX,XX +XXX,XX @@ reset_with_tcp_filter() fi } +# For kernel supporting limits above 8 +# $1: title ; $2,4: addrs limit ns1,2 ; $3,5: subflows limit ns1,2 +reset_with_high_limits() +{ + reset "${1}" || return 1 + + if ! pm_nl_set_limits "${ns1}" "${2}" "${3}" 2>/dev/null || + ! pm_nl_set_limits "${ns2}" "${4}" "${5}" 2>/dev/null; then + mark_as_skipped "unable to set the limits to ${*:2}" + return 1 + fi +} + # $1: err msg fail_test() { @@ -XXX,XX +XXX,XX @@ fullmesh_tests() chk_prio_nr 0 1 1 0 chk_rm_nr 0 1 fi + + # fullmesh in 8x8 to create 63 additional subflows + if ifaces_nr=8 reset_with_high_limits "fullmesh 8x8" 64 64 64 64; then + # higher chance to loose ADD_ADDR: allow retransmissions + ip netns exec $ns1 sysctl -q net.mptcp.add_addr_timeout=1 + local i + for i in $(seq 1 8); do + pm_nl_add_endpoint $ns2 10.0.$i.2 flags subflow,fullmesh + pm_nl_add_endpoint $ns1 10.0.$i.1 flags signal + done + speed=slow \ + run_tests $ns1 $ns2 10.0.1.1 + chk_join_nr 63 63 63 + fi + } fastclose_tests() -- 2.53.0
These limits have been recently updated, from 8 to: - 64 for the subflows and accepted add_addr - 255 for the MPTCP endpoints These modifications validate the new limits, but are also compatible with the previous ones, to be able to continue to validate stable kernel using the last version of the selftests. That's why new variables are now used instead of hard-coded values. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- tools/testing/selftests/net/mptcp/pm_netlink.sh | 52 +++++++++++++++---------- 1 file changed, 31 insertions(+), 21 deletions(-) diff --git a/tools/testing/selftests/net/mptcp/pm_netlink.sh b/tools/testing/selftests/net/mptcp/pm_netlink.sh index XXXXXXX..XXXXXXX 100755 --- a/tools/testing/selftests/net/mptcp/pm_netlink.sh +++ b/tools/testing/selftests/net/mptcp/pm_netlink.sh @@ -XXX,XX +XXX,XX @@ get_limits() { fi } +get_limits_nb() { + if mptcp_lib_is_ip_mptcp; then + ip -n "${ns1}" mptcp limits | awk '{ print $2" "$4 }' + else + ip netns exec "${ns1}" ./pm_nl_ctl limits | \ + awk '{ printf "%s ", $2 }' + fi +} + format_endpoints() { mptcp_lib_pm_nl_format_endpoints "${@}" } @@ -XXX,XX +XXX,XX @@ check "get_endpoint 2" "" "simple del addr" check "show_endpoints" \ "$(format_endpoints "1,10.0.1.1" \ "3,10.0.1.3,signal backup")" "dump addrs after del" +add_endpoint 10.0.1.2 id 2 add_endpoint 10.0.1.3 2>/dev/null check "get_endpoint 4" "" "duplicate addr" @@ -XXX,XX +XXX,XX @@ check "get_endpoint 4" "" "duplicate addr" add_endpoint 10.0.1.4 flags signal check "get_endpoint 4" "$(format_endpoints "4,10.0.1.4,signal")" "id addr increment" -for i in $(seq 5 9); do - add_endpoint "10.0.1.${i}" flags signal >/dev/null 2>&1 -done -check "get_endpoint 9" "$(format_endpoints "9,10.0.1.9,signal")" "hard addr limit" -check "get_endpoint 10" "" "above hard addr limit" +read -r -a default_limits_nb <<< "$(get_limits_nb)" +# limits have been increased: from 8 to 64 for subflows/add_addr & 255 for endp +if mptcp_lib_expect_all_features || set_limits 9 9 2>/dev/null; then + max_endp=255 + max_limits=64 +else + max_endp=8 + max_limits=8 +fi +set_limits "${default_limits_nb[@]}" -del_endpoint 9 -for i in $(seq 10 255); do - add_endpoint 10.0.0.9 id "${i}" - del_endpoint "${i}" +for i in $(seq 5 ${max_endp}); do + add_endpoint "10.0.0.${i}" id "${i}" done -check "show_endpoints" \ - "$(format_endpoints "1,10.0.1.1" \ - "3,10.0.1.3,signal backup" \ - "4,10.0.1.4,signal" \ - "5,10.0.1.5,signal" \ - "6,10.0.1.6,signal" \ - "7,10.0.1.7,signal" \ - "8,10.0.1.8,signal")" "id limit" +check "get_endpoint ${max_endp}" \ + "$(format_endpoints "${max_endp},10.0.0.${max_endp}")" "id limit" + +check "add_endpoint '10.0.1.1' &>/dev/null && echo 'no error'" "" \ + "above hard addr limit" flush_endpoint check "show_endpoints" "" "flush addrs" @@ -XXX,XX +XXX,XX @@ if ! mptcp_lib_is_ip_mptcp; then flush_endpoint fi -set_limits 9 1 2>/dev/null +set_limits $((max_limits + 1)) 1 2>/dev/null check "get_limits" "${default_limits}" "rcv addrs above hard limit" -set_limits 1 9 2>/dev/null +set_limits 1 $((max_limits + 1)) 2>/dev/null check "get_limits" "${default_limits}" "subflows above hard limit" -set_limits 8 8 +set_limits ${max_limits} ${max_limits} flush_endpoint ## to make sure it doesn't affect the limits -check "get_limits" "$(format_limits 8 8)" "set limits" +check "get_limits" "$(format_limits ${max_limits} ${max_limits})" "set limits" flush_endpoint add_endpoint 10.0.1.1 -- 2.53.0
Instead of sendto() and recvfrom() which the NL address that was already provided before. Just simpler and easier to read. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- tools/testing/selftests/net/mptcp/pm_nl_ctl.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/net/mptcp/pm_nl_ctl.c b/tools/testing/selftests/net/mptcp/pm_nl_ctl.c index XXXXXXX..XXXXXXX 100644 --- a/tools/testing/selftests/net/mptcp/pm_nl_ctl.c +++ b/tools/testing/selftests/net/mptcp/pm_nl_ctl.c @@ -XXX,XX +XXX,XX @@ static int capture_events(int fd, int event_group) /* do a netlink command and, if max > 0, fetch the reply ; nh's size >1024B */ static int do_nl_req(int fd, struct nlmsghdr *nh, int len, int max) { - struct sockaddr_nl nladdr = { .nl_family = AF_NETLINK }; - socklen_t addr_len; void *data = nh; int rem, ret; int err = 0; @@ -XXX,XX +XXX,XX @@ static int do_nl_req(int fd, struct nlmsghdr *nh, int len, int max) } nh->nlmsg_len = len; - ret = sendto(fd, data, len, 0, (void *)&nladdr, sizeof(nladdr)); + ret = send(fd, data, len, 0); if (ret != len) error(1, errno, "send netlink: %uB != %uB\n", ret, len); - addr_len = sizeof(nladdr); - rem = ret = recvfrom(fd, data, max, 0, (void *)&nladdr, &addr_len); + rem = ret = recv(fd, data, max, 0); if (ret < 0) error(1, errno, "recv netlink: %uB\n", ret); -- 2.53.0
Here is a suggestion, and if it is OK, I will split this in multiple commits: it is not the first time the 'add' and 'anno' names to describe ADD_ADDR related functions are confusing. Eric already pointed that in [1]. I started by renaming only the internal helper names, then while at it, I tried to uniform that. WDYT? Link: https://lore.kernel.org/20251117100745.1913963-1-edumazet@google.com [1] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/options.c | 2 +- net/mptcp/pm.c | 122 ++++++++++++++++++++++++----------------------- net/mptcp/pm_kernel.c | 16 +++---- net/mptcp/pm_userspace.c | 4 +- net/mptcp/protocol.h | 17 +++---- net/mptcp/subflow.c | 4 +- 6 files changed, 84 insertions(+), 81 deletions(-) diff --git a/net/mptcp/options.c b/net/mptcp/options.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -XXX,XX +XXX,XX @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb) MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_ADDADDR); } else { mptcp_pm_add_addr_echoed(msk, &mp_opt.addr); - mptcp_pm_del_add_timer(msk, &mp_opt.addr, true); + mptcp_pm_del_add_addr_timer(msk, &mp_opt.addr, true); MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_ECHOADD); } diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ #define ADD_ADDR_RETRANS_MAX 3 -struct mptcp_pm_add_entry { +struct mptcp_pm_add_addr { struct list_head list; struct mptcp_addr_info addr; u8 retrans_times; - struct timer_list add_timer; + struct timer_list timer; struct mptcp_sock *sock; struct rcu_head rcu; }; @@ -XXX,XX +XXX,XX @@ bool mptcp_lookup_subflow_by_saddr(const struct list_head *list, return false; } -static struct mptcp_pm_add_entry * -mptcp_lookup_anno_list_by_saddr(const struct mptcp_sock *msk, - const struct mptcp_addr_info *addr) +static struct mptcp_pm_add_addr * +mptcp_lookup_add_addr_by_saddr(const struct mptcp_sock *msk, + const struct mptcp_addr_info *addr) { - struct mptcp_pm_add_entry *entry; + struct mptcp_pm_add_addr *add_addr; lockdep_assert_held(&msk->pm.lock); - list_for_each_entry(entry, &msk->pm.anno_list, list) { - if (mptcp_addresses_equal(&entry->addr, addr, true)) - return entry; + list_for_each_entry(add_addr, &msk->pm.anno_list, list) { + if (mptcp_addresses_equal(&add_addr->addr, addr, true)) + return add_addr; } return NULL; } -bool mptcp_remove_anno_list_by_saddr(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr) +bool mptcp_remove_add_addr_by_saddr(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr) { - struct mptcp_pm_add_entry *entry; + struct mptcp_pm_add_addr *add_addr; bool ret; - entry = mptcp_pm_del_add_timer(msk, addr, false); - ret = entry; - kfree_rcu(entry, rcu); + add_addr = mptcp_pm_del_add_addr_timer(msk, addr, false); + ret = add_addr; + kfree_rcu(add_addr, rcu); return ret; } -bool mptcp_pm_sport_in_anno_list(struct mptcp_sock *msk, const struct sock *sk) +bool mptcp_pm_sport_in_add_addr_list(struct mptcp_sock *msk, + const struct sock *sk) { - struct mptcp_pm_add_entry *entry; + struct mptcp_pm_add_addr *add_addr; struct mptcp_addr_info saddr; bool ret = false; mptcp_local_address((struct sock_common *)sk, &saddr); spin_lock_bh(&msk->pm.lock); - list_for_each_entry(entry, &msk->pm.anno_list, list) { - if (mptcp_addresses_equal(&entry->addr, &saddr, true)) { + list_for_each_entry(add_addr, &msk->pm.anno_list, list) { + if (mptcp_addresses_equal(&add_addr->addr, &saddr, true)) { ret = true; goto out; } @@ -XXX,XX +XXX,XX @@ static unsigned int mptcp_adjust_add_addr_timeout(struct mptcp_sock *msk) return max_stale && max_stale < rto ? max_stale : rto; } -static void mptcp_pm_add_timer(struct timer_list *timer) +static void mptcp_pm_add_addr_timer(struct timer_list *timer) { - struct mptcp_pm_add_entry *entry = timer_container_of(entry, timer, - add_timer); - struct mptcp_sock *msk = entry->sock; + struct mptcp_pm_add_addr *add_addr = timer_container_of(add_addr, timer, + timer); + struct mptcp_sock *msk = add_addr->sock; struct sock *sk = (struct sock *)msk; unsigned int timeout; @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) spin_lock_bh(&msk->pm.lock); if (!mptcp_pm_should_add_signal_addr(msk)) { - pr_debug("retransmit ADD_ADDR id=%d\n", entry->addr.id); - mptcp_pm_announce_addr(msk, &entry->addr, false); + pr_debug("retransmit ADD_ADDR id=%d\n", add_addr->addr.id); + mptcp_pm_announce_addr(msk, &add_addr->addr, false); mptcp_pm_add_addr_send_ack(msk); - entry->retrans_times++; + add_addr->retrans_times++; } - if (entry->retrans_times < ADD_ADDR_RETRANS_MAX) + if (add_addr->retrans_times < ADD_ADDR_RETRANS_MAX) sk_reset_timer(sk, timer, - jiffies + (timeout << entry->retrans_times)); + jiffies + (timeout << add_addr->retrans_times)); spin_unlock_bh(&msk->pm.lock); - if (entry->retrans_times == ADD_ADDR_RETRANS_MAX) + if (add_addr->retrans_times == ADD_ADDR_RETRANS_MAX) mptcp_pm_subflow_established(msk); out: sock_put(sk); } -struct mptcp_pm_add_entry * -mptcp_pm_del_add_timer(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr, bool check_id) +struct mptcp_pm_add_addr * +mptcp_pm_del_add_addr_timer(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr, bool check_id) { - struct mptcp_pm_add_entry *entry; + struct mptcp_pm_add_addr *add_addr; struct sock *sk = (struct sock *)msk; bool stop_timer = false; rcu_read_lock(); spin_lock_bh(&msk->pm.lock); - entry = mptcp_lookup_anno_list_by_saddr(msk, addr); - if (entry && (!check_id || entry->addr.id == addr->id)) { - entry->retrans_times = ADD_ADDR_RETRANS_MAX; + add_addr = mptcp_lookup_add_addr_by_saddr(msk, addr); + if (add_addr && (!check_id || add_addr->addr.id == addr->id)) { + add_addr->retrans_times = ADD_ADDR_RETRANS_MAX; stop_timer = true; } - if (!check_id && entry) - list_del(&entry->list); + if (!check_id && add_addr) + list_del(&add_addr->list); spin_unlock_bh(&msk->pm.lock); /* Note: entry might have been removed by another thread. * We hold rcu_read_lock() to ensure it is not freed under us. */ if (stop_timer) - sk_stop_timer_sync(sk, &entry->add_timer); + sk_stop_timer_sync(sk, &add_addr->timer); rcu_read_unlock(); - return entry; + return add_addr; } -bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr) +bool mptcp_pm_alloc_add_addr_list(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr) { - struct mptcp_pm_add_entry *add_entry = NULL; + struct mptcp_pm_add_addr *add_addr = NULL; struct sock *sk = (struct sock *)msk; unsigned int timeout; lockdep_assert_held(&msk->pm.lock); - add_entry = mptcp_lookup_anno_list_by_saddr(msk, addr); + add_addr = mptcp_lookup_add_addr_by_saddr(msk, addr); - if (add_entry) { + if (add_addr) { if (WARN_ON_ONCE(mptcp_pm_is_kernel(msk))) return false; goto reset_timer; } - add_entry = kmalloc_obj(*add_entry, GFP_ATOMIC); - if (!add_entry) + add_addr = kmalloc_obj(*add_addr, GFP_ATOMIC); + if (!add_addr) return false; - list_add(&add_entry->list, &msk->pm.anno_list); + list_add(&add_addr->list, &msk->pm.anno_list); - add_entry->addr = *addr; - add_entry->sock = msk; - add_entry->retrans_times = 0; + add_addr->addr = *addr; + add_addr->sock = msk; + add_addr->retrans_times = 0; - timer_setup(&add_entry->add_timer, mptcp_pm_add_timer, 0); + timer_setup(&add_addr->timer, mptcp_pm_add_addr_timer, 0); reset_timer: timeout = mptcp_adjust_add_addr_timeout(msk); if (timeout) - sk_reset_timer(sk, &add_entry->add_timer, jiffies + timeout); + sk_reset_timer(sk, &add_addr->timer, + jiffies + timeout); return true; } -static void mptcp_pm_free_anno_list(struct mptcp_sock *msk) +static void mptcp_pm_free_add_addr_list(struct mptcp_sock *msk) { - struct mptcp_pm_add_entry *entry, *tmp; + struct mptcp_pm_add_addr *add_addr, *tmp; struct sock *sk = (struct sock *)msk; LIST_HEAD(free_list); @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_free_anno_list(struct mptcp_sock *msk) list_splice_init(&msk->pm.anno_list, &free_list); spin_unlock_bh(&msk->pm.lock); - list_for_each_entry_safe(entry, tmp, &free_list, list) { - sk_stop_timer_sync(sk, &entry->add_timer); - kfree_rcu(entry, rcu); + list_for_each_entry_safe(add_addr, tmp, &free_list, list) { + sk_stop_timer_sync(sk, &add_addr->timer); + kfree_rcu(add_addr, rcu); } } @@ -XXX,XX +XXX,XX @@ void mptcp_pm_add_addr_echoed(struct mptcp_sock *msk, spin_lock_bh(&pm->lock); - if (mptcp_lookup_anno_list_by_saddr(msk, addr) && READ_ONCE(pm->work_pending)) + if (mptcp_lookup_add_addr_by_saddr(msk, addr) && READ_ONCE(pm->work_pending)) mptcp_pm_schedule_work(msk, MPTCP_PM_SUBFLOW_ESTABLISHED); spin_unlock_bh(&pm->lock); @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_ops_release(struct mptcp_sock *msk) void mptcp_pm_destroy(struct mptcp_sock *msk) { - mptcp_pm_free_anno_list(msk); + mptcp_pm_free_add_addr_list(msk); mptcp_pm_ops_release(msk); } diff --git a/net/mptcp/pm_kernel.c b/net/mptcp/pm_kernel.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm_kernel.c +++ b/net/mptcp/pm_kernel.c @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_create_subflow_or_signal_addr(struct mptcp_sock *msk) /* If the alloc fails, we are on memory pressure, not worth * continuing, and trying to create subflows. */ - if (!mptcp_pm_alloc_anno_list(msk, &local.addr)) + if (!mptcp_pm_alloc_add_addr_list(msk, &local.addr)) return; __clear_bit(local.addr.id, msk->pm.id_avail_bitmap); @@ -XXX,XX +XXX,XX @@ int mptcp_pm_nl_add_addr_doit(struct sk_buff *skb, struct genl_info *info) return ret; } -static void mptcp_pm_remove_anno_addr(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr, - bool force) +static void mptcp_pm_remove_add_addr(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr, + bool force) { struct mptcp_rm_list list = { .nr = 0 }; bool announced; list.ids[list.nr++] = mptcp_endp_get_local_id(msk, addr); - announced = mptcp_remove_anno_list_by_saddr(msk, addr); + announced = mptcp_remove_add_addr_by_saddr(msk, addr); if (announced || force) { spin_lock_bh(&msk->pm.lock); if (announced) @@ -XXX,XX +XXX,XX @@ static int mptcp_nl_remove_subflow_and_signal_addr(struct net *net, lock_sock(sk); remove_subflow = mptcp_lookup_subflow_by_saddr(&msk->conn_list, addr); - mptcp_pm_remove_anno_addr(msk, addr, remove_subflow && - !(entry->flags & MPTCP_PM_ADDR_FLAG_IMPLICIT)); + mptcp_pm_remove_add_addr(msk, addr, remove_subflow && + !(entry->flags & MPTCP_PM_ADDR_FLAG_IMPLICIT)); list.ids[0] = mptcp_endp_get_local_id(msk, addr); @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_flush_addrs_and_subflows(struct mptcp_sock *msk, slist.ids[slist.nr++] = mptcp_endp_get_local_id(msk, &entry->addr); if (alist.nr < MPTCP_RM_IDS_MAX && - mptcp_remove_anno_list_by_saddr(msk, &entry->addr)) + mptcp_remove_add_addr_by_saddr(msk, &entry->addr)) alist.ids[alist.nr++] = mptcp_endp_get_local_id(msk, &entry->addr); } diff --git a/net/mptcp/pm_userspace.c b/net/mptcp/pm_userspace.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm_userspace.c +++ b/net/mptcp/pm_userspace.c @@ -XXX,XX +XXX,XX @@ int mptcp_pm_nl_announce_doit(struct sk_buff *skb, struct genl_info *info) lock_sock(sk); spin_lock_bh(&msk->pm.lock); - if (mptcp_pm_alloc_anno_list(msk, &addr_val.addr)) { + if (mptcp_pm_alloc_add_addr_list(msk, &addr_val.addr)) { msk->pm.add_addr_signaled++; mptcp_pm_announce_addr(msk, &addr_val.addr, false); mptcp_pm_addr_send_ack(msk); @@ -XXX,XX +XXX,XX @@ void mptcp_pm_remove_addr_entry(struct mptcp_sock *msk, int anno_nr = 0; /* only delete if either announced or matching a subflow */ - if (mptcp_remove_anno_list_by_saddr(msk, &entry->addr)) + if (mptcp_remove_add_addr_by_saddr(msk, &entry->addr)) anno_nr++; else if (!mptcp_lookup_subflow_by_saddr(&msk->conn_list, &entry->addr)) return; diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -XXX,XX +XXX,XX @@ int mptcp_pm_mp_prio_send_ack(struct mptcp_sock *msk, struct mptcp_addr_info *addr, struct mptcp_addr_info *rem, u8 bkup); -bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr); -bool mptcp_pm_sport_in_anno_list(struct mptcp_sock *msk, const struct sock *sk); -struct mptcp_pm_add_entry * -mptcp_pm_del_add_timer(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr, bool check_id); +bool mptcp_pm_alloc_add_addr_list(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr); +bool mptcp_pm_sport_in_add_addr_list(struct mptcp_sock *msk, + const struct sock *sk); +struct mptcp_pm_add_addr * +mptcp_pm_del_add_addr_timer(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr, bool check_id); bool mptcp_lookup_subflow_by_saddr(const struct list_head *list, const struct mptcp_addr_info *saddr); -bool mptcp_remove_anno_list_by_saddr(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr); +bool mptcp_remove_add_addr_by_saddr(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr); int mptcp_pm_nl_set_flags(struct mptcp_pm_addr_entry *local, struct genl_info *info); int mptcp_userspace_pm_set_flags(struct mptcp_pm_addr_entry *local, diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -XXX,XX +XXX,XX @@ static int subflow_check_req(struct request_sock *req, pr_debug("syn inet_sport=%d %d\n", ntohs(inet_sk(sk_listener)->inet_sport), ntohs(inet_sk((struct sock *)subflow_req->msk)->inet_sport)); - if (!mptcp_pm_sport_in_anno_list(subflow_req->msk, sk_listener)) { + if (!mptcp_pm_sport_in_add_addr_list(subflow_req->msk, sk_listener)) { SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_MISMATCHPORTSYNRX); subflow_add_reset_reason(skb, MPTCP_RST_EPROHIBIT); return -EPERM; @@ -XXX,XX +XXX,XX @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk, pr_debug("ack inet_sport=%d %d\n", ntohs(inet_sk(sk)->inet_sport), ntohs(inet_sk((struct sock *)owner)->inet_sport)); - if (!mptcp_pm_sport_in_anno_list(owner, sk)) { + if (!mptcp_pm_sport_in_add_addr_list(owner, sk)) { SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_MISMATCHPORTACKRX); subflow_add_reset_reason(skb, MPTCP_RST_EPROHIBIT); goto dispose_child; -- 2.53.0
The initial intension was to increase the limits, but some fixes were needed, then when looking at the code around, other fixes had to be added too. So now I ended up with 11 fixes (including one early exit while at it), followed by 4 patches increasing limits, 3 to validate the modifications, one small improvement in the selftests, and some renaming. The patches with a Fixes tag are for -net, the rest for net-next. The last patch contains multiple renaming: please tell me what you think before I start splitting in smaller commits. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- Changes in v4: - patch 5: reset timer_done when the timer is reset. - patch 10: new: selftests: correctly catch error in 'check output'. - patch 16: add example with init_shapers to silence AI reviews - Patch 18: clearer way to check errors (without buggy '&&') - Link to v3: https://patch.msgid.link/20260413-mptcp-inc-limits-v3-0-dd36c9360432@kernel.org Changes in v3: - patch 1: new: retransmit ADD_ADDR for ID 0 with right ID. - patch 5: moved below + support calling sk_free(). - patch 13: new: allow flushing more than 8 endpoints. - patch 12: clarify single batch of 8 subflows as known limit. - patch 16: fix typo in comment. - Link to v2: https://patch.msgid.link/20260410-mptcp-inc-limits-v2-0-5402209f05d3@kernel.org Changes in v2: - patch 2: new: fix potential data-race. - re-order patches 1 to 5 and use shorter prefix (ADD_ADDR rtx). - patch 6: restore accidentally deleted icsk->icsk_rto > max check. - patch 16: fix already present checkpatch warning. - Link to v1: https://patch.msgid.link/20260409-mptcp-inc-limits-v1-0-0e45fa30d914@kernel.org --- Matthieu Baerts (NGI0) (20): mptcp: pm: kernel: correctly retransmit ADD_ADDR ID 0 mptcp: pm: ADD_ADDR rtx: fix potential data-race mptcp: pm: ADD_ADDR rtx: allow ID 0 mptcp: pm: ADD_ADDR rtx: always decrease sk refcount mptcp: pm: ADD_ADDR rtx: free sk if last mptcp: pm: ADD_ADDR rtx: resched blocked ADD_ADDR quicker mptcp: pm: ADD_ADDR rtx: skip inactive subflows mptcp: pm: retrans ADD_ADDR: return early if no retrans mptcp: pm: prio: skip closed subflows selftests: mptcp: check output: catch cmd errors selftests: mptcp: pm: restrict 'unknown' check to pm_nl_ctl mptcp: pm: in-kernel: explicitly limit batches to array size mptcp: pm: in-kernel: increase all limits to 64 mptcp: pm: kernel: allow flushing more than 8 endpoints mptcp: pm: in-kernel: increase endpoints limit selftests: mptcp: join: allow changing ifaces nr per test selftests: mptcp: join: validate 8x8 subflows selftests: mptcp: pm: validate new limits selftests: mptcp: pm: use simpler send/recv forms mptcp: pm: clearer ADD_ADDR related helpers names net/mptcp/options.c | 2 +- net/mptcp/pm.c | 184 +++++++++++++----------- net/mptcp/pm_kernel.c | 106 +++++++++----- net/mptcp/pm_userspace.c | 6 +- net/mptcp/protocol.h | 19 +-- net/mptcp/subflow.c | 4 +- tools/testing/selftests/net/mptcp/mptcp_join.sh | 33 ++++- tools/testing/selftests/net/mptcp/mptcp_lib.sh | 8 +- tools/testing/selftests/net/mptcp/pm_netlink.sh | 74 ++++++---- tools/testing/selftests/net/mptcp/pm_nl_ctl.c | 8 +- 10 files changed, 272 insertions(+), 172 deletions(-) --- base-commit: b5a5b4d1643659280a5566b761679182ef7a8b25 change-id: 20260403-mptcp-inc-limits-ce9811024066 Best regards, -- Matthieu Baerts (NGI0) <matttbe@kernel.org>
When adding the ADD_ADDR to the list, the address including the IP, port and ID are copied. On the other hand, when the endpoint corresponds to the one from the initial subflow, the ID is set to 0, as specified by the MPTCP protocol. The issue is that the ID was reset after having copied the ID in the ADD_ADDR entry. So the retransmission was done, but using a different ID than the initial one. Fixes: 8b8ed1b429f8 ("mptcp: pm: reuse ID 0 after delete and re-add") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm_kernel.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/net/mptcp/pm_kernel.c b/net/mptcp/pm_kernel.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm_kernel.c +++ b/net/mptcp/pm_kernel.c @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_create_subflow_or_signal_addr(struct mptcp_sock *msk) /* check first for announce */ if (msk->pm.add_addr_signaled < endp_signal_max) { + u8 endp_id; + /* due to racing events on both ends we can reach here while * previous add address is still running: if we invoke now * mptcp_pm_announce_addr(), that will fail and the @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_create_subflow_or_signal_addr(struct mptcp_sock *msk) if (!select_signal_address(pernet, msk, &local)) goto subflow; + /* Special case for ID0: set the correct ID */ + endp_id = local.addr.id; + if (endp_id == msk->mpc_endpoint_id) + local.addr.id = 0; + /* If the alloc fails, we are on memory pressure, not worth * continuing, and trying to create subflows. */ if (!mptcp_pm_alloc_anno_list(msk, &local.addr)) return; - __clear_bit(local.addr.id, msk->pm.id_avail_bitmap); + __clear_bit(endp_id, msk->pm.id_avail_bitmap); msk->pm.add_addr_signaled++; - /* Special case for ID0: set the correct ID */ - if (local.addr.id == msk->mpc_endpoint_id) - local.addr.id = 0; - mptcp_pm_announce_addr(msk, &local.addr, false); mptcp_pm_addr_send_ack(msk); -- 2.53.0
This mptcp_pm_add_timer() helper is executed as a timer callback in softirq context. To avoid any data races, the socket lock needs to be held with bh_lock_sock(). If the socket is in use, retry again soon after, similar to what is done with the keepalive timer. Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) if (!entry->addr.id) return; + bh_lock_sock(sk); + if (sock_owned_by_user(sk)) { + /* Try again later. */ + sk_reset_timer(sk, timer, jiffies + HZ / 20); + goto out; + } + if (mptcp_pm_should_add_signal_addr(msk)) { sk_reset_timer(sk, timer, jiffies + TCP_RTO_MAX / 8); goto out; @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) mptcp_pm_subflow_established(msk); out: + bh_unlock_sock(sk); __sock_put(sk); } -- 2.53.0
ADD_ADDR can be sent for the ID 0, which corresponds to the local address and port linked to the initial subflow. Indeed, this address could be removed, and re-added later on, e.g. what is done in the "delete re-add signal" MPTCP Join selftests. So no reason to ignore it. Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) if (inet_sk_state_load(sk) == TCP_CLOSE) return; - if (!entry->addr.id) - return; - bh_lock_sock(sk); if (sock_owned_by_user(sk)) { /* Try again later. */ -- 2.53.0
When an ADD_ADDR is retransmitted, the sk is held in sk_reset_timer(). It should then be released in all cases at the end. Some (unlikely) checks were returning directly instead of calling sock_put() to decrease the refcount. Jump to the 'out' label to fix this potential leak. While at it, drop the '!msk' check which cannot happen because it is never reset, and explicitly mark the remaining one as "unlikely". Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- v3: remove '!msk' check: cannot be true. --- net/mptcp/pm.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) pr_debug("msk=%p\n", msk); - if (!msk) - return; - - if (inet_sk_state_load(sk) == TCP_CLOSE) - return; + if (unlikely(inet_sk_state_load(sk) == TCP_CLOSE)) + goto exit; bh_lock_sock(sk); if (sock_owned_by_user(sk)) { @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) out: bh_unlock_sock(sk); +exit: __sock_put(sk); } -- 2.53.0
When an ADD_ADDR is retransmitted, the sk is held in sk_reset_timer(), and released at the end. If at that moment, it was the last reference being held, the sk would not be freed. sock_put() should then be called instead of __sock_put(). But that's not enough: if it is the last reference, sock_put() will call sk_free(), which will end up calling sk_stop_timer_sync() on the same timer, and waiting indefinitely to finish. So it is needed to mark that the timer is done, and no need to call sk_stop_timer_sync(). Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- v3: support calling sk_free() from the timer handler. Note: I'm not very happy with this patch, it looks too big. Did I miss a simpler way? v4: init timer_done after 'reset_timer' label to handle cases where the sysctl is changed in between. --- net/mptcp/pm.c | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ struct mptcp_pm_add_entry { struct list_head list; struct mptcp_addr_info addr; u8 retrans_times; + bool timer_done; struct timer_list add_timer; struct mptcp_sock *sock; struct rcu_head rcu; @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) add_timer); struct mptcp_sock *msk = entry->sock; struct sock *sk = (struct sock *)msk; - unsigned int timeout; + unsigned int timeout = 0; pr_debug("msk=%p\n", msk); - if (unlikely(inet_sk_state_load(sk) == TCP_CLOSE)) - goto exit; - bh_lock_sock(sk); + if (unlikely(inet_sk_state_load(sk) == TCP_CLOSE)) + goto out; + if (sock_owned_by_user(sk)) { /* Try again later. */ - sk_reset_timer(sk, timer, jiffies + HZ / 20); + timeout = HZ / 20; goto out; } if (mptcp_pm_should_add_signal_addr(msk)) { - sk_reset_timer(sk, timer, jiffies + TCP_RTO_MAX / 8); + timeout = TCP_RTO_MAX / 8; goto out; } @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) } if (entry->retrans_times < ADD_ADDR_RETRANS_MAX) - sk_reset_timer(sk, timer, - jiffies + (timeout << entry->retrans_times)); + timeout <<= entry->retrans_times; + else + timeout = 0; spin_unlock_bh(&msk->pm.lock); @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) mptcp_pm_subflow_established(msk); out: + if (timeout) + sk_reset_timer(sk, timer, jiffies + timeout); + else + /* if sock_put calls sk_free: avoid waiting for this timer */ + entry->timer_done = true; bh_unlock_sock(sk); -exit: - __sock_put(sk); + sock_put(sk); } struct mptcp_pm_add_entry * @@ -XXX,XX +XXX,XX @@ bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk, timer_setup(&add_entry->add_timer, mptcp_pm_add_timer, 0); reset_timer: + add_entry->timer_done = false; timeout = mptcp_adjust_add_addr_timeout(msk); if (timeout) sk_reset_timer(sk, &add_entry->add_timer, jiffies + timeout); + else + add_entry->timer_done = true; return true; } @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_free_anno_list(struct mptcp_sock *msk) spin_unlock_bh(&msk->pm.lock); list_for_each_entry_safe(entry, tmp, &free_list, list) { - sk_stop_timer_sync(sk, &entry->add_timer); + if (!entry->timer_done) + sk_stop_timer_sync(sk, &entry->add_timer); kfree_rcu(entry, rcu); } } -- 2.53.0
When an ADD_ADDR needs to be retransmitted and another one has already been prepared -- e.g. multiple ADD_ADDRs have been sent in a row and need to be retransmitted later -- this additional retransmission will need to wait. In this case, the timer was reset to TCP_RTO_MAX / 8, which is ~15 seconds. This delay is unnecessary long: it should just be rescheduled at the next opportunity, e.g. after the retransmission timeout. Without this modification, some issues can be seen from time to time in the selftests when multiple ADD_ADDRs are sent, and the host takes time to process them, e.g. the "signal addresses, ADD_ADDR timeout" MPTCP Join selftest, especially with a debug kernel config. Note that on older kernels, 'timeout' is not available. It should be enough to replace it by one second (HZ). Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) goto out; } - if (mptcp_pm_should_add_signal_addr(msk)) { - timeout = TCP_RTO_MAX / 8; - goto out; - } - timeout = mptcp_adjust_add_addr_timeout(msk); - if (!timeout) + if (!timeout || mptcp_pm_should_add_signal_addr(msk)) goto out; spin_lock_bh(&msk->pm.lock); -- 2.53.0
When looking at the maximum RTO amongst the subflows, inactive subflows were taken into account: that includes stale ones, and the initial one if it has been already been closed. Unusable subflows are now simply skipped. Stale ones are used as an alternative: if there are only stale ones, to take their maximum RTO and avoid to eventually fallback to net.mptcp.add_addr_timeout, which is set to 2 minutes by default. Fixes: 30549eebc4d8 ("mptcp: make ADD_ADDR retransmission timeout adaptive") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- v2: restore accidentally deleted icsk->icsk_rto > max check --- net/mptcp/pm.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static unsigned int mptcp_adjust_add_addr_timeout(struct mptcp_sock *msk) const struct net *net = sock_net((struct sock *)msk); unsigned int rto = mptcp_get_add_addr_timeout(net); struct mptcp_subflow_context *subflow; - unsigned int max = 0; + unsigned int max = 0, max_stale = 0; mptcp_for_each_subflow(msk, subflow) { struct sock *ssk = mptcp_subflow_tcp_sock(subflow); struct inet_connection_sock *icsk = inet_csk(ssk); - if (icsk->icsk_rto > max) + if (!__mptcp_subflow_active(subflow)) + continue; + + if (unlikely(subflow->stale)) { + if (icsk->icsk_rto > max_stale) + max_stale = icsk->icsk_rto; + } else if (icsk->icsk_rto > max) { max = icsk->icsk_rto; + } } - if (max && max < rto) - rto = max; + if (max) + return max < rto ? max : rto; - return rto; + return max_stale && max_stale < rto ? max_stale : rto; } static void mptcp_pm_add_timer(struct timer_list *timer) -- 2.53.0
No need to iterate over all subflows if there is no retransmission needed. Exit early in this case then. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static unsigned int mptcp_adjust_add_addr_timeout(struct mptcp_sock *msk) struct mptcp_subflow_context *subflow; unsigned int max = 0, max_stale = 0; + if (!rto) + return 0; + mptcp_for_each_subflow(msk, subflow) { struct sock *ssk = mptcp_subflow_tcp_sock(subflow); struct inet_connection_sock *icsk = inet_csk(ssk); -- 2.53.0
When sending an MP_PRIO, closed subflows needs to be skipped. This fixes the case where the initial subflow got closed, re-opened later, then an MP_PRIO is needed for the same local address. Note that explicit MP_PRIO cannot be sent during the 3WHS, so it is fine to use __mptcp_subflow_active(). Fixes: 067065422fcd ("mptcp: add the outgoing MP_PRIO support") Fixes: b29fcfb54cd7 ("mptcp: full disconnect implementation") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ int mptcp_pm_mp_prio_send_ack(struct mptcp_sock *msk, struct sock *ssk = mptcp_subflow_tcp_sock(subflow); struct mptcp_addr_info local, remote; + if (!__mptcp_subflow_active(subflow)) + continue; + mptcp_local_address((struct sock_common *)ssk, &local); if (!mptcp_addresses_equal(&local, addr, addr->port)) continue; -- 2.53.0
Using '${?}' inside the if-statement to check the returned value from the command that was evaluated as part of the if-statement is not correct: here, '${?}' will be linked to the previous instruction, not the one that is expected here (${cmd}). Instead, simply mark the error, except if an error is expected. If that's the case, 1 can be passed as the 4th argument of this helper. Two checks from pm_netlink.sh expect an error. Note that we could expect a specific returned value, but the checks currently expecting an error can be used with 'ip mptcp' or 'pm_nl_ctl', and these two tools don't return the same error code. Fixes: 2d0c1d27ea4e ("selftests: mptcp: add mptcp_lib_check_output helper") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- tools/testing/selftests/net/mptcp/mptcp_lib.sh | 8 ++++---- tools/testing/selftests/net/mptcp/pm_netlink.sh | 8 +++++--- 2 files changed, 9 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/net/mptcp/mptcp_lib.sh b/tools/testing/selftests/net/mptcp/mptcp_lib.sh index XXXXXXX..XXXXXXX 100644 --- a/tools/testing/selftests/net/mptcp/mptcp_lib.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_lib.sh @@ -XXX,XX +XXX,XX @@ mptcp_lib_wait_local_port_listen() { wait_local_port_listen "${@}" "tcp" } +# $1: error file, $2: cmd, $3: expected msg, [$4: expected error] mptcp_lib_check_output() { local err="${1}" local cmd="${2}" local expected="${3}" + local exp_error="${4:-0}" local cmd_ret=0 local out - if ! out=$(${cmd} 2>"${err}"); then - cmd_ret=${?} - fi + out=$(${cmd} 2>"${err}") || cmd_ret=1 - if [ ${cmd_ret} -ne 0 ]; then + if [ "${cmd_ret}" != "${exp_error}" ]; then mptcp_lib_pr_fail "command execution '${cmd}' stderr" cat "${err}" return 2 diff --git a/tools/testing/selftests/net/mptcp/pm_netlink.sh b/tools/testing/selftests/net/mptcp/pm_netlink.sh index XXXXXXX..XXXXXXX 100755 --- a/tools/testing/selftests/net/mptcp/pm_netlink.sh +++ b/tools/testing/selftests/net/mptcp/pm_netlink.sh @@ -XXX,XX +XXX,XX @@ check() local cmd="$1" local expected="$2" local msg="$3" + local no_err="$4" local rc=0 mptcp_lib_print_title "$msg" - mptcp_lib_check_output "${err}" "${cmd}" "${expected}" || rc=${?} + mptcp_lib_check_output "${err}" "${cmd}" "${expected}" "${no_err}" || + rc=${?} if [ ${rc} -eq 2 ]; then mptcp_lib_result_fail "${msg} # error ${rc}" ret=${KSFT_FAIL} @@ -XXX,XX +XXX,XX @@ check "show_endpoints" \ "3,10.0.1.3,signal backup")" "dump addrs" del_endpoint 2 -check "get_endpoint 2" "" "simple del addr" +check "get_endpoint 2" "" "simple del addr" 1 check "show_endpoints" \ "$(format_endpoints "1,10.0.1.1" \ "3,10.0.1.3,signal backup")" "dump addrs after del" add_endpoint 10.0.1.3 2>/dev/null -check "get_endpoint 4" "" "duplicate addr" +check "get_endpoint 4" "" "duplicate addr" 1 add_endpoint 10.0.1.4 flags signal check "get_endpoint 4" "$(format_endpoints "4,10.0.1.4,signal")" "id addr increment" -- 2.53.0
When pm_netlink.sh is executed with '-i', 'ip mptcp' is used instead of 'pm_nl_ctl'. IPRoute2 doesn't support the 'unknown' flag, which has only been added to 'pm_nl_ctl' for this specific check: to ensure that the kernel ignores such unsupported flag. No reason to add this flag to 'ip mptcp'. Then, this check should be skipped when 'ip mptcp' is used. Fixes: 29f4801e9c8d ("selftests: mptcp: pm: ensure unknown flags are ignored") Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- tools/testing/selftests/net/mptcp/pm_netlink.sh | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/net/mptcp/pm_netlink.sh b/tools/testing/selftests/net/mptcp/pm_netlink.sh index XXXXXXX..XXXXXXX 100755 --- a/tools/testing/selftests/net/mptcp/pm_netlink.sh +++ b/tools/testing/selftests/net/mptcp/pm_netlink.sh @@ -XXX,XX +XXX,XX @@ check "show_endpoints" \ flush_endpoint check "show_endpoints" "" "flush addrs" -add_endpoint 10.0.1.1 flags unknown -check "show_endpoints" "$(format_endpoints "1,10.0.1.1")" "ignore unknown flags" -flush_endpoint +# "unknown" flag is only supported by pm_nl_ctl +if ! mptcp_lib_is_ip_mptcp; then + add_endpoint 10.0.1.1 flags unknown + check "show_endpoints" "$(format_endpoints "1,10.0.1.1")" \ + "ignore unknown flags" + flush_endpoint +fi set_limits 9 1 2>/dev/null check "get_limits" "${default_limits}" "rcv addrs above hard limit" -- 2.53.0
The in-kernel PM can create subflows in reply to ADD_ADDR by batch of maximum 8 subflows for the moment. Same when adding new "subflow" endpoints with the fullmesh flag. This limit is linked to the arrays used during these steps. There was no explicit limit to the arrays size (8), because the limit of extra subflows is the same (8). It seems safer to use an explicit limit, but also these two sizes are going to be different in the next commit. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm_kernel.c | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/net/mptcp/pm_kernel.c b/net/mptcp/pm_kernel.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm_kernel.c +++ b/net/mptcp/pm_kernel.c @@ -XXX,XX +XXX,XX @@ fill_remote_addr(struct mptcp_sock *msk, struct mptcp_addr_info *local, static unsigned int fill_remote_addresses_fullmesh(struct mptcp_sock *msk, struct mptcp_addr_info *local, - struct mptcp_addr_info *addrs) + struct mptcp_addr_info *addrs, + int addrs_size) { u8 limit_extra_subflows = mptcp_pm_get_limit_extra_subflows(msk); bool deny_id0 = READ_ONCE(msk->pm.remote_deny_join_id0); @@ -XXX,XX +XXX,XX @@ fill_remote_addresses_fullmesh(struct mptcp_sock *msk, msk->pm.extra_subflows++; i++; - if (msk->pm.extra_subflows >= limit_extra_subflows) + if (msk->pm.extra_subflows >= limit_extra_subflows || + i == addrs_size) break; } @@ -XXX,XX +XXX,XX @@ fill_remote_addresses_fullmesh(struct mptcp_sock *msk, */ static unsigned int fill_remote_addresses_vec(struct mptcp_sock *msk, struct mptcp_addr_info *local, - bool fullmesh, struct mptcp_addr_info *addrs) + bool fullmesh, struct mptcp_addr_info *addrs, + int addrs_size) { /* Non-fullmesh: fill in the single entry corresponding to the primary * MPC subflow remote address, and return 1, corresponding to 1 entry. @@ -XXX,XX +XXX,XX @@ fill_remote_addresses_vec(struct mptcp_sock *msk, struct mptcp_addr_info *local, return fill_remote_addr(msk, local, addrs); /* Fullmesh endpoint: fill all possible remote addresses */ - return fill_remote_addresses_fullmesh(msk, local, addrs); + return fill_remote_addresses_fullmesh(msk, local, addrs, addrs_size); } static struct mptcp_pm_addr_entry * @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_create_subflow_or_signal_addr(struct mptcp_sock *msk) else /* local_addr_used is not decr for ID 0 */ msk->pm.local_addr_used++; - nr = fill_remote_addresses_vec(msk, &local.addr, fullmesh, addrs); + nr = fill_remote_addresses_vec(msk, &local.addr, fullmesh, + addrs, ARRAY_SIZE(addrs)); if (nr == 0) continue; @@ -XXX,XX +XXX,XX @@ static unsigned int fill_local_addresses_vec_fullmesh(struct mptcp_sock *msk, struct mptcp_addr_info *remote, struct mptcp_pm_local *locals, + int locals_size, bool c_flag_case) { u8 limit_extra_subflows = mptcp_pm_get_limit_extra_subflows(msk); @@ -XXX,XX +XXX,XX @@ fill_local_addresses_vec_fullmesh(struct mptcp_sock *msk, msk->pm.extra_subflows++; i++; - if (msk->pm.extra_subflows >= limit_extra_subflows) + if (msk->pm.extra_subflows >= limit_extra_subflows || + i == locals_size) break; } rcu_read_unlock(); @@ -XXX,XX +XXX,XX @@ fill_local_laminar_endp(struct mptcp_sock *msk, struct mptcp_addr_info *remote, static unsigned int fill_local_addresses_vec_c_flag(struct mptcp_sock *msk, struct mptcp_addr_info *remote, - struct mptcp_pm_local *locals) + struct mptcp_pm_local *locals, + int locals_size) { u8 limit_extra_subflows = mptcp_pm_get_limit_extra_subflows(msk); struct pm_nl_pernet *pernet = pm_nl_get_pernet_from_msk(msk); @@ -XXX,XX +XXX,XX @@ fill_local_addresses_vec_c_flag(struct mptcp_sock *msk, msk->pm.extra_subflows++; i++; - if (msk->pm.extra_subflows >= limit_extra_subflows) + if (msk->pm.extra_subflows >= limit_extra_subflows || + i == locals_size) break; } @@ -XXX,XX +XXX,XX @@ fill_local_address_any(struct mptcp_sock *msk, struct mptcp_addr_info *remote, */ static unsigned int fill_local_addresses_vec(struct mptcp_sock *msk, struct mptcp_addr_info *remote, - struct mptcp_pm_local *locals) + struct mptcp_pm_local *locals, int locals_size) { bool c_flag_case = remote->id && mptcp_pm_add_addr_c_flag_case(msk); /* If there is at least one MPTCP endpoint with a fullmesh flag */ if (mptcp_pm_get_endp_fullmesh_max(msk)) return fill_local_addresses_vec_fullmesh(msk, remote, locals, + locals_size, c_flag_case); /* If there is at least one MPTCP endpoint with a laminar flag */ @@ -XXX,XX +XXX,XX @@ fill_local_addresses_vec(struct mptcp_sock *msk, struct mptcp_addr_info *remote, * limits are used -- accepting no ADD_ADDR -- and use subflow endpoints */ if (c_flag_case) - return fill_local_addresses_vec_c_flag(msk, remote, locals); + return fill_local_addresses_vec_c_flag(msk, remote, locals, + locals_size); /* No special case: fill in the single 'IPADDRANY' local address */ return fill_local_address_any(msk, remote, &locals[0]); @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_nl_add_addr_received(struct mptcp_sock *msk) /* connect to the specified remote address, using whatever * local address the routing configuration will pick. */ - nr = fill_local_addresses_vec(msk, &remote, locals); + nr = fill_local_addresses_vec(msk, &remote, locals, ARRAY_SIZE(locals)); if (nr == 0) return; -- 2.53.0
This means switching the maximum from 8 to 64 for the number of subflows and accepted ADD_ADDR. The previous limit of 8 subflows makes sense in most cases. Using more subflows will very likely *not* improve the situation, and could even decrease the performances. But there are no technical limitations nor performance impact to raise this limit, so let's do it: this will allow people with very specific use-cases, and researchers to easily create more subflows, and measure the performance impact by themselves. The theoretical limit is 255 -- the ID is written in a u8 on the wire -- but 64 is more than enough. With so many subflows, it will be costly to iterate over all of them when operations are done in bottom half. Note that the in-kernel PM will continue to create subflows in reply to ADD_ADDR with a single batch of maximum 8 subflows. Same when adding new "subflow" endpoints with the fullmesh flag. Increasing those batch limits would have a memory impact, and it looks fine not to cover these cases with larger batches for the moment. If more is needed later, the position of the last subflow from the list could be remembered, and the list iteration could continue later. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/434 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm_kernel.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/mptcp/pm_kernel.c b/net/mptcp/pm_kernel.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm_kernel.c +++ b/net/mptcp/pm_kernel.c @@ -XXX,XX +XXX,XX @@ struct pm_nl_pernet { }; #define MPTCP_PM_ADDR_MAX 8 +#define MPTCP_PM_SUBFLOWS_MAX 64 static struct pm_nl_pernet *pm_nl_get_pernet(const struct net *net) { @@ -XXX,XX +XXX,XX @@ static int parse_limit(struct genl_info *info, int id, unsigned int *limit) return 0; *limit = nla_get_u32(attr); - if (*limit > MPTCP_PM_ADDR_MAX) { + if (*limit > MPTCP_PM_SUBFLOWS_MAX) { NL_SET_ERR_MSG_ATTR_FMT(info->extack, attr, "limit greater than maximum (%u)", - MPTCP_PM_ADDR_MAX); + MPTCP_PM_SUBFLOWS_MAX); return -EINVAL; } return 0; -- 2.53.0
The mptcp_rm_list structure contains an array of IDs of 8 entries: to be able to send a RM_ADDR with 8 IDs. This limitation was OK so far because there could maximum 8 endpoints. But this is going to change in the next commit. To cope with that, if one of the arrays is full, the iteration stops, the lists are processed, then the iteration continues where it previously stopped. Note that if there are many endpoints to remove, and multiple RM_ADDR to send, it might be more likely that some of these RM_ADDRs are dropped or lost. This is a known limitation: RM_ADDR are not retransmitted in MPTCPv1. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm_kernel.c | 38 +++++++++++++++++++++++++++----------- 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/net/mptcp/pm_kernel.c b/net/mptcp/pm_kernel.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm_kernel.c +++ b/net/mptcp/pm_kernel.c @@ -XXX,XX +XXX,XX @@ int mptcp_pm_nl_del_addr_doit(struct sk_buff *skb, struct genl_info *info) } static void mptcp_pm_flush_addrs_and_subflows(struct mptcp_sock *msk, - struct list_head *rm_list) + struct list_head *rm_list, + struct mptcp_pm_addr_entry *entry) { - struct mptcp_rm_list alist = { .nr = 0 }, slist = { .nr = 0 }; - struct mptcp_pm_addr_entry *entry; + struct mptcp_rm_list alist, slist; + bool more; - list_for_each_entry(entry, rm_list, list) { - if (slist.nr < MPTCP_RM_IDS_MAX && - mptcp_lookup_subflow_by_saddr(&msk->conn_list, &entry->addr)) +again: + alist.nr = 0; + slist.nr = 0; + more = false; + + entry = list_prepare_entry(entry, rm_list, list); + list_for_each_entry_continue(entry, rm_list, list) { + if (mptcp_lookup_subflow_by_saddr(&msk->conn_list, &entry->addr)) slist.ids[slist.nr++] = mptcp_endp_get_local_id(msk, &entry->addr); - if (alist.nr < MPTCP_RM_IDS_MAX && - mptcp_remove_anno_list_by_saddr(msk, &entry->addr)) + if (mptcp_remove_anno_list_by_saddr(msk, &entry->addr)) alist.ids[alist.nr++] = mptcp_endp_get_local_id(msk, &entry->addr); + + if (slist.nr == MPTCP_RM_IDS_MAX || + alist.nr == MPTCP_RM_IDS_MAX) { + more = !list_is_last(&entry->list, rm_list); + break; + } } spin_lock_bh(&msk->pm.lock); @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_flush_addrs_and_subflows(struct mptcp_sock *msk, if (slist.nr) mptcp_pm_rm_subflow(msk, &slist); /* Reset counters: maybe some subflows have been removed before */ - bitmap_fill(msk->pm.id_avail_bitmap, MPTCP_PM_MAX_ADDR_ID + 1); - msk->pm.local_addr_used = 0; + if (!more) { + bitmap_fill(msk->pm.id_avail_bitmap, MPTCP_PM_MAX_ADDR_ID + 1); + msk->pm.local_addr_used = 0; + } spin_unlock_bh(&msk->pm.lock); + + if (more) + goto again; } static void mptcp_nl_flush_addrs_list(struct net *net, @@ -XXX,XX +XXX,XX @@ static void mptcp_nl_flush_addrs_list(struct net *net, if (!mptcp_pm_is_userspace(msk)) { lock_sock(sk); - mptcp_pm_flush_addrs_and_subflows(msk, rm_list); + mptcp_pm_flush_addrs_and_subflows(msk, rm_list, NULL); release_sock(sk); } -- 2.53.0
The endpoints are managed in a list which was limited to 8 entries. This limit can be too small in some cases: by having the same limit as the number of subflows, it might not allow creating all expected subflows when having a mix of v4 and v6 addresses that can all use MPTCP on v4/v6 only networks. While increasing the limit above the new subflows one, why not using the technical limit: 254. Indeed, the endpoint will each have an ID that will be used on the wire, limited to u8, and the ID 0 is reserved to the initial subflow. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/pm_kernel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/mptcp/pm_kernel.c b/net/mptcp/pm_kernel.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm_kernel.c +++ b/net/mptcp/pm_kernel.c @@ -XXX,XX +XXX,XX @@ static int mptcp_pm_nl_append_new_local_addr(struct pm_nl_pernet *pernet, */ if (pernet->next_id == MPTCP_PM_MAX_ADDR_ID) pernet->next_id = 1; - if (pernet->endpoints >= MPTCP_PM_ADDR_MAX) { + if (pernet->endpoints == MPTCP_PM_MAX_ADDR_ID) { ret = -ERANGE; goto out; } -- 2.53.0
By default, 4 network interfaces are created per subtest in a dedicated net namespace. Each netns has a dedicated pair of v4 and v6 addresses. Future tests will need more. Simply always creating more network interfaces per test will increase the execution time for all other tests, for no other benefits. So now it is possible to change this number only when needed, by setting ifaces_nr when calling 'reset' and 'init_shapers', e.g. ifaces_nr=8 reset "Subtest title" ifaces_nr=8 init_shapers Note that it might also be interesting to decrease the default value to 2 to reduce the setup time, especially when a debug kernel config is being used. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- tools/testing/selftests/net/mptcp/mptcp_join.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh index XXXXXXX..XXXXXXX 100755 --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh @@ -XXX,XX +XXX,XX @@ unset fastclose unset fullmesh unset speed unset bind_addr +unset ifaces_nr unset join_syn_rej unset join_csum_ns1 unset join_csum_ns2 @@ -XXX,XX +XXX,XX @@ init_partial() # ns1eth4 ns2eth4 local i - for i in $(seq 1 4); do + for i in $(seq 1 "${ifaces_nr:-4}"); do ip link add ns1eth$i netns "$ns1" type veth peer name ns2eth$i netns "$ns2" ip -net "$ns1" addr add 10.0.$i.1/24 dev ns1eth$i ip -net "$ns1" addr add dead:beef:$i::1/64 dev ns1eth$i nodad @@ -XXX,XX +XXX,XX @@ init_partial() init_shapers() { local i - for i in $(seq 1 4); do + for i in $(seq 1 "${ifaces_nr:-4}"); do tc -n $ns1 qdisc add dev ns1eth$i root netem rate 20mbit delay 1ms tc -n $ns2 qdisc add dev ns2eth$i root netem rate 20mbit delay 1ms done -- 2.53.0
The limits have been recently increased, it is required to validate that having 64 subflows is allowed. Here, both the client and the server have 8 network interfaces. The server has 8 endpoints marked as 'signal' to announce all its v4 addresses. The client also has 8 endpoints, but marked as 'subflow' and 'fullmesh' in order to create 8 subflows to each address announced by the server. This means 63 additional subflows will be created after the initial one. If it is not possible to increase the limits to 64, it means an older kernel version is being used, and the test is skipped. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- tools/testing/selftests/net/mptcp/mptcp_join.sh | 28 +++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh index XXXXXXX..XXXXXXX 100755 --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh @@ -XXX,XX +XXX,XX @@ reset_with_tcp_filter() fi } +# For kernel supporting limits above 8 +# $1: title ; $2,4: addrs limit ns1,2 ; $3,5: subflows limit ns1,2 +reset_with_high_limits() +{ + reset "${1}" || return 1 + + if ! pm_nl_set_limits "${ns1}" "${2}" "${3}" 2>/dev/null || + ! pm_nl_set_limits "${ns2}" "${4}" "${5}" 2>/dev/null; then + mark_as_skipped "unable to set the limits to ${*:2}" + return 1 + fi +} + # $1: err msg fail_test() { @@ -XXX,XX +XXX,XX @@ fullmesh_tests() chk_prio_nr 0 1 1 0 chk_rm_nr 0 1 fi + + # fullmesh in 8x8 to create 63 additional subflows + if ifaces_nr=8 reset_with_high_limits "fullmesh 8x8" 64 64 64 64; then + # higher chance to lose ADD_ADDR: allow retransmissions + ip netns exec $ns1 sysctl -q net.mptcp.add_addr_timeout=1 + local i + for i in $(seq 1 8); do + pm_nl_add_endpoint $ns2 10.0.$i.2 flags subflow,fullmesh + pm_nl_add_endpoint $ns1 10.0.$i.1 flags signal + done + speed=slow \ + run_tests $ns1 $ns2 10.0.1.1 + chk_join_nr 63 63 63 + fi + } fastclose_tests() -- 2.53.0
These limits have been recently updated, from 8 to: - 64 for the subflows and accepted add_addr - 255 for the MPTCP endpoints These modifications validate the new limits, but are also compatible with the previous ones, to be able to continue to validate stable kernel using the last version of the selftests. That's why new variables are now used instead of hard-coded values. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- v4: avoid using "&&" with check(), use a clearer way instead. --- tools/testing/selftests/net/mptcp/pm_netlink.sh | 56 +++++++++++++++---------- 1 file changed, 35 insertions(+), 21 deletions(-) diff --git a/tools/testing/selftests/net/mptcp/pm_netlink.sh b/tools/testing/selftests/net/mptcp/pm_netlink.sh index XXXXXXX..XXXXXXX 100755 --- a/tools/testing/selftests/net/mptcp/pm_netlink.sh +++ b/tools/testing/selftests/net/mptcp/pm_netlink.sh @@ -XXX,XX +XXX,XX @@ get_limits() { fi } +get_limits_nb() { + if mptcp_lib_is_ip_mptcp; then + ip -n "${ns1}" mptcp limits | awk '{ print $2" "$4 }' + else + ip netns exec "${ns1}" ./pm_nl_ctl limits | \ + awk '{ printf "%s ", $2 }' + fi +} + format_endpoints() { mptcp_lib_pm_nl_format_endpoints "${@}" } @@ -XXX,XX +XXX,XX @@ check "get_endpoint 2" "" "simple del addr" 1 check "show_endpoints" \ "$(format_endpoints "1,10.0.1.1" \ "3,10.0.1.3,signal backup")" "dump addrs after del" +add_endpoint 10.0.1.2 id 2 add_endpoint 10.0.1.3 2>/dev/null check "get_endpoint 4" "" "duplicate addr" 1 @@ -XXX,XX +XXX,XX @@ check "get_endpoint 4" "" "duplicate addr" 1 add_endpoint 10.0.1.4 flags signal check "get_endpoint 4" "$(format_endpoints "4,10.0.1.4,signal")" "id addr increment" -for i in $(seq 5 9); do - add_endpoint "10.0.1.${i}" flags signal >/dev/null 2>&1 -done -check "get_endpoint 9" "$(format_endpoints "9,10.0.1.9,signal")" "hard addr limit" -check "get_endpoint 10" "" "above hard addr limit" +read -r -a default_limits_nb <<< "$(get_limits_nb)" +# limits have been increased: from 8 to 64 for subflows/add_addr & 255 for endp +if mptcp_lib_expect_all_features || set_limits 9 9 2>/dev/null; then + max_endp=255 + max_limits=64 +else + max_endp=8 + max_limits=8 +fi +set_limits "${default_limits_nb[@]}" -del_endpoint 9 -for i in $(seq 10 255); do - add_endpoint 10.0.0.9 id "${i}" - del_endpoint "${i}" +for i in $(seq 5 ${max_endp}); do + add_endpoint "10.0.0.${i}" id "${i}" done -check "show_endpoints" \ - "$(format_endpoints "1,10.0.1.1" \ - "3,10.0.1.3,signal backup" \ - "4,10.0.1.4,signal" \ - "5,10.0.1.5,signal" \ - "6,10.0.1.6,signal" \ - "7,10.0.1.7,signal" \ - "8,10.0.1.8,signal")" "id limit" +check "get_endpoint ${max_endp}" \ + "$(format_endpoints "${max_endp},10.0.0.${max_endp}")" "id limit" + +if add_endpoint '10.0.0.1' &>/dev/null; then + hardlimit="no error" +else + hardlimit="error" +fi +check "echo ${hardlimit}" "error" "above hard addr limit" flush_endpoint check "show_endpoints" "" "flush addrs" @@ -XXX,XX +XXX,XX @@ if ! mptcp_lib_is_ip_mptcp; then flush_endpoint fi -set_limits 9 1 2>/dev/null +set_limits $((max_limits + 1)) 1 2>/dev/null check "get_limits" "${default_limits}" "rcv addrs above hard limit" -set_limits 1 9 2>/dev/null +set_limits 1 $((max_limits + 1)) 2>/dev/null check "get_limits" "${default_limits}" "subflows above hard limit" -set_limits 8 8 +set_limits ${max_limits} ${max_limits} flush_endpoint ## to make sure it doesn't affect the limits -check "get_limits" "$(format_limits 8 8)" "set limits" +check "get_limits" "$(format_limits ${max_limits} ${max_limits})" "set limits" flush_endpoint add_endpoint 10.0.1.1 -- 2.53.0
Instead of sendto() and recvfrom() which the NL address that was already provided before. Just simpler and easier to read without the to/from variants. While at it, fix a checkpatch warning by removing multiple assignments. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- v2: fix already present checkpatch warning. --- tools/testing/selftests/net/mptcp/pm_nl_ctl.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/net/mptcp/pm_nl_ctl.c b/tools/testing/selftests/net/mptcp/pm_nl_ctl.c index XXXXXXX..XXXXXXX 100644 --- a/tools/testing/selftests/net/mptcp/pm_nl_ctl.c +++ b/tools/testing/selftests/net/mptcp/pm_nl_ctl.c @@ -XXX,XX +XXX,XX @@ static int capture_events(int fd, int event_group) /* do a netlink command and, if max > 0, fetch the reply ; nh's size >1024B */ static int do_nl_req(int fd, struct nlmsghdr *nh, int len, int max) { - struct sockaddr_nl nladdr = { .nl_family = AF_NETLINK }; - socklen_t addr_len; void *data = nh; int rem, ret; int err = 0; @@ -XXX,XX +XXX,XX @@ static int do_nl_req(int fd, struct nlmsghdr *nh, int len, int max) } nh->nlmsg_len = len; - ret = sendto(fd, data, len, 0, (void *)&nladdr, sizeof(nladdr)); + ret = send(fd, data, len, 0); if (ret != len) error(1, errno, "send netlink: %uB != %uB\n", ret, len); - addr_len = sizeof(nladdr); - rem = ret = recvfrom(fd, data, max, 0, (void *)&nladdr, &addr_len); + ret = recv(fd, data, max, 0); if (ret < 0) error(1, errno, "recv netlink: %uB\n", ret); + rem = ret; /* Beware: the NLMSG_NEXT macro updates the 'rem' argument */ for (; NLMSG_OK(nh, rem); nh = NLMSG_NEXT(nh, rem)) { if (nh->nlmsg_type == NLMSG_DONE) -- 2.53.0
Here is a suggestion, and if it is OK, I will split this in multiple commits: it is not the first time the 'add' and 'anno' names to describe ADD_ADDR related functions are confusing. Eric already pointed that in [1]. I started by renaming only the internal helper names, then while at it, I tried to uniform everything linked to ADD_ADDR. WDYT? Link: https://lore.kernel.org/20251117100745.1913963-1-edumazet@google.com [1] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- net/mptcp/options.c | 2 +- net/mptcp/pm.c | 133 ++++++++++++++++++++++++----------------------- net/mptcp/pm_kernel.c | 20 +++---- net/mptcp/pm_userspace.c | 6 +-- net/mptcp/protocol.h | 19 +++---- net/mptcp/subflow.c | 4 +- 6 files changed, 93 insertions(+), 91 deletions(-) diff --git a/net/mptcp/options.c b/net/mptcp/options.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -XXX,XX +XXX,XX @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb) MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_ADDADDR); } else { mptcp_pm_add_addr_echoed(msk, &mp_opt.addr); - mptcp_pm_del_add_timer(msk, &mp_opt.addr, true); + mptcp_pm_add_addr_del_timer(msk, &mp_opt.addr, true); MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_ECHOADD); } diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ #define ADD_ADDR_RETRANS_MAX 3 -struct mptcp_pm_add_entry { +struct mptcp_pm_add_addr { struct list_head list; struct mptcp_addr_info addr; u8 retrans_times; bool timer_done; - struct timer_list add_timer; + struct timer_list timer; struct mptcp_sock *sock; struct rcu_head rcu; }; @@ -XXX,XX +XXX,XX @@ static bool mptcp_pm_is_init_remote_addr(struct mptcp_sock *msk, return mptcp_addresses_equal(&mpc_remote, remote, remote->port); } -bool mptcp_lookup_subflow_by_saddr(const struct list_head *list, - const struct mptcp_addr_info *saddr) +bool mptcp_pm_subflow_lookup_by_saddr(struct mptcp_sock *msk, + const struct mptcp_addr_info *saddr) { struct mptcp_subflow_context *subflow; struct mptcp_addr_info cur; struct sock_common *skc; - list_for_each_entry(subflow, list, node) { + mptcp_for_each_subflow(msk, subflow) { skc = (struct sock_common *)mptcp_subflow_tcp_sock(subflow); mptcp_local_address(skc, &cur); @@ -XXX,XX +XXX,XX @@ bool mptcp_lookup_subflow_by_saddr(const struct list_head *list, return false; } -static struct mptcp_pm_add_entry * -mptcp_lookup_anno_list_by_saddr(const struct mptcp_sock *msk, - const struct mptcp_addr_info *addr) +static struct mptcp_pm_add_addr * +mptcp_pm_add_addr_lookup_by_addr(const struct mptcp_sock *msk, + const struct mptcp_addr_info *addr) { - struct mptcp_pm_add_entry *entry; + struct mptcp_pm_add_addr *add_addr; lockdep_assert_held(&msk->pm.lock); - list_for_each_entry(entry, &msk->pm.anno_list, list) { - if (mptcp_addresses_equal(&entry->addr, addr, true)) - return entry; + list_for_each_entry(add_addr, &msk->pm.anno_list, list) { + if (mptcp_addresses_equal(&add_addr->addr, addr, true)) + return add_addr; } return NULL; } -bool mptcp_remove_anno_list_by_saddr(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr) +bool mptcp_pm_add_addr_remove(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr) { - struct mptcp_pm_add_entry *entry; + struct mptcp_pm_add_addr *add_addr; bool ret; - entry = mptcp_pm_del_add_timer(msk, addr, false); - ret = entry; - kfree_rcu(entry, rcu); + add_addr = mptcp_pm_add_addr_del_timer(msk, addr, false); + ret = add_addr; + kfree_rcu(add_addr, rcu); return ret; } -bool mptcp_pm_sport_in_anno_list(struct mptcp_sock *msk, const struct sock *sk) +bool mptcp_pm_add_addr_lookup_by_sk(struct mptcp_sock *msk, + const struct sock *sk) { - struct mptcp_pm_add_entry *entry; + struct mptcp_pm_add_addr *entry; struct mptcp_addr_info saddr; bool ret = false; @@ -XXX,XX +XXX,XX @@ static unsigned int mptcp_adjust_add_addr_timeout(struct mptcp_sock *msk) return max_stale && max_stale < rto ? max_stale : rto; } -static void mptcp_pm_add_timer(struct timer_list *timer) +static void mptcp_pm_add_addr_timer(struct timer_list *timer) { - struct mptcp_pm_add_entry *entry = timer_container_of(entry, timer, - add_timer); - struct mptcp_sock *msk = entry->sock; + struct mptcp_pm_add_addr *add_addr = timer_container_of(add_addr, timer, + timer); + struct mptcp_sock *msk = add_addr->sock; struct sock *sk = (struct sock *)msk; unsigned int timeout = 0; @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) spin_lock_bh(&msk->pm.lock); if (!mptcp_pm_should_add_signal_addr(msk)) { - pr_debug("retransmit ADD_ADDR id=%d\n", entry->addr.id); - mptcp_pm_announce_addr(msk, &entry->addr, false); + pr_debug("retransmit ADD_ADDR id=%d\n", add_addr->addr.id); + mptcp_pm_announce_addr(msk, &add_addr->addr, false); mptcp_pm_add_addr_send_ack(msk); - entry->retrans_times++; + add_addr->retrans_times++; } - if (entry->retrans_times < ADD_ADDR_RETRANS_MAX) - timeout <<= entry->retrans_times; + if (add_addr->retrans_times < ADD_ADDR_RETRANS_MAX) + timeout <<= add_addr->retrans_times; else timeout = 0; spin_unlock_bh(&msk->pm.lock); - if (entry->retrans_times == ADD_ADDR_RETRANS_MAX) + if (add_addr->retrans_times == ADD_ADDR_RETRANS_MAX) mptcp_pm_subflow_established(msk); out: @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) sk_reset_timer(sk, timer, jiffies + timeout); else /* if sock_put calls sk_free: avoid waiting for this timer */ - entry->timer_done = true; + add_addr->timer_done = true; bh_unlock_sock(sk); sock_put(sk); } -struct mptcp_pm_add_entry * -mptcp_pm_del_add_timer(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr, bool check_id) +struct mptcp_pm_add_addr * +mptcp_pm_add_addr_del_timer(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr, bool check_id) { - struct mptcp_pm_add_entry *entry; + struct mptcp_pm_add_addr *add_addr; struct sock *sk = (struct sock *)msk; bool stop_timer = false; rcu_read_lock(); spin_lock_bh(&msk->pm.lock); - entry = mptcp_lookup_anno_list_by_saddr(msk, addr); - if (entry && (!check_id || entry->addr.id == addr->id)) { - entry->retrans_times = ADD_ADDR_RETRANS_MAX; + add_addr = mptcp_pm_add_addr_lookup_by_addr(msk, addr); + if (add_addr && (!check_id || add_addr->addr.id == addr->id)) { + add_addr->retrans_times = ADD_ADDR_RETRANS_MAX; stop_timer = true; } - if (!check_id && entry) - list_del(&entry->list); + if (!check_id && add_addr) + list_del(&add_addr->list); spin_unlock_bh(&msk->pm.lock); /* Note: entry might have been removed by another thread. * We hold rcu_read_lock() to ensure it is not freed under us. */ if (stop_timer) - sk_stop_timer_sync(sk, &entry->add_timer); + sk_stop_timer_sync(sk, &add_addr->timer); rcu_read_unlock(); - return entry; + return add_addr; } -bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr) +bool mptcp_pm_add_addr_alloc(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr) { - struct mptcp_pm_add_entry *add_entry = NULL; + struct mptcp_pm_add_addr *add_addr = NULL; struct sock *sk = (struct sock *)msk; unsigned int timeout; lockdep_assert_held(&msk->pm.lock); - add_entry = mptcp_lookup_anno_list_by_saddr(msk, addr); - - if (add_entry) { + add_addr = mptcp_pm_add_addr_lookup_by_addr(msk, addr); + if (add_addr) { if (WARN_ON_ONCE(mptcp_pm_is_kernel(msk))) return false; goto reset_timer; } - add_entry = kmalloc_obj(*add_entry, GFP_ATOMIC); - if (!add_entry) + add_addr = kmalloc_obj(*add_addr, GFP_ATOMIC); + if (!add_addr) return false; - list_add(&add_entry->list, &msk->pm.anno_list); + list_add(&add_addr->list, &msk->pm.anno_list); - add_entry->addr = *addr; - add_entry->sock = msk; - add_entry->retrans_times = 0; + add_addr->addr = *addr; + add_addr->sock = msk; + add_addr->retrans_times = 0; - timer_setup(&add_entry->add_timer, mptcp_pm_add_timer, 0); + timer_setup(&add_addr->timer, mptcp_pm_add_addr_timer, 0); reset_timer: - add_entry->timer_done = false; + add_addr->timer_done = false; timeout = mptcp_adjust_add_addr_timeout(msk); if (timeout) - sk_reset_timer(sk, &add_entry->add_timer, jiffies + timeout); + sk_reset_timer(sk, &add_addr->timer, jiffies + timeout); else - add_entry->timer_done = true; + add_addr->timer_done = true; return true; } -static void mptcp_pm_free_anno_list(struct mptcp_sock *msk) +static void mptcp_pm_free_add_addr_list(struct mptcp_sock *msk) { - struct mptcp_pm_add_entry *entry, *tmp; + struct mptcp_pm_add_addr *add_addr, *tmp; struct sock *sk = (struct sock *)msk; LIST_HEAD(free_list); @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_free_anno_list(struct mptcp_sock *msk) list_splice_init(&msk->pm.anno_list, &free_list); spin_unlock_bh(&msk->pm.lock); - list_for_each_entry_safe(entry, tmp, &free_list, list) { - if (!entry->timer_done) - sk_stop_timer_sync(sk, &entry->add_timer); - kfree_rcu(entry, rcu); + list_for_each_entry_safe(add_addr, tmp, &free_list, list) { + if (!add_addr->timer_done) + sk_stop_timer_sync(sk, &add_addr->timer); + kfree_rcu(add_addr, rcu); } } @@ -XXX,XX +XXX,XX @@ void mptcp_pm_add_addr_echoed(struct mptcp_sock *msk, spin_lock_bh(&pm->lock); - if (mptcp_lookup_anno_list_by_saddr(msk, addr) && READ_ONCE(pm->work_pending)) + if (mptcp_pm_add_addr_lookup_by_addr(msk, addr) && + READ_ONCE(pm->work_pending)) mptcp_pm_schedule_work(msk, MPTCP_PM_SUBFLOW_ESTABLISHED); spin_unlock_bh(&pm->lock); @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_ops_release(struct mptcp_sock *msk) void mptcp_pm_destroy(struct mptcp_sock *msk) { - mptcp_pm_free_anno_list(msk); + mptcp_pm_free_add_addr_list(msk); mptcp_pm_ops_release(msk); } diff --git a/net/mptcp/pm_kernel.c b/net/mptcp/pm_kernel.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm_kernel.c +++ b/net/mptcp/pm_kernel.c @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_create_subflow_or_signal_addr(struct mptcp_sock *msk) /* If the alloc fails, we are on memory pressure, not worth * continuing, and trying to create subflows. */ - if (!mptcp_pm_alloc_anno_list(msk, &local.addr)) + if (!mptcp_pm_add_addr_alloc(msk, &local.addr)) return; __clear_bit(endp_id, msk->pm.id_avail_bitmap); @@ -XXX,XX +XXX,XX @@ int mptcp_pm_nl_add_addr_doit(struct sk_buff *skb, struct genl_info *info) return ret; } -static void mptcp_pm_remove_anno_addr(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr, - bool force) +static void mptcp_pm_remove_add_addr(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr, + bool force) { struct mptcp_rm_list list = { .nr = 0 }; bool announced; list.ids[list.nr++] = mptcp_endp_get_local_id(msk, addr); - announced = mptcp_remove_anno_list_by_saddr(msk, addr); + announced = mptcp_pm_add_addr_remove(msk, addr); if (announced || force) { spin_lock_bh(&msk->pm.lock); if (announced) @@ -XXX,XX +XXX,XX @@ static int mptcp_nl_remove_subflow_and_signal_addr(struct net *net, goto next; lock_sock(sk); - remove_subflow = mptcp_lookup_subflow_by_saddr(&msk->conn_list, addr); - mptcp_pm_remove_anno_addr(msk, addr, remove_subflow && - !(entry->flags & MPTCP_PM_ADDR_FLAG_IMPLICIT)); + remove_subflow = mptcp_pm_subflow_lookup_by_saddr(msk, addr); + mptcp_pm_remove_add_addr(msk, addr, remove_subflow && + !(entry->flags & MPTCP_PM_ADDR_FLAG_IMPLICIT)); list.ids[0] = mptcp_endp_get_local_id(msk, addr); @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_flush_addrs_and_subflows(struct mptcp_sock *msk, entry = list_prepare_entry(entry, rm_list, list); list_for_each_entry_continue(entry, rm_list, list) { - if (mptcp_lookup_subflow_by_saddr(&msk->conn_list, &entry->addr)) + if (mptcp_pm_subflow_lookup_by_saddr(msk, &entry->addr)) slist.ids[slist.nr++] = mptcp_endp_get_local_id(msk, &entry->addr); - if (mptcp_remove_anno_list_by_saddr(msk, &entry->addr)) + if (mptcp_pm_add_addr_remove(msk, &entry->addr)) alist.ids[alist.nr++] = mptcp_endp_get_local_id(msk, &entry->addr); if (slist.nr == MPTCP_RM_IDS_MAX || diff --git a/net/mptcp/pm_userspace.c b/net/mptcp/pm_userspace.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm_userspace.c +++ b/net/mptcp/pm_userspace.c @@ -XXX,XX +XXX,XX @@ int mptcp_pm_nl_announce_doit(struct sk_buff *skb, struct genl_info *info) lock_sock(sk); spin_lock_bh(&msk->pm.lock); - if (mptcp_pm_alloc_anno_list(msk, &addr_val.addr)) { + if (mptcp_pm_add_addr_alloc(msk, &addr_val.addr)) { msk->pm.add_addr_signaled++; mptcp_pm_announce_addr(msk, &addr_val.addr, false); mptcp_pm_addr_send_ack(msk); @@ -XXX,XX +XXX,XX @@ void mptcp_pm_remove_addr_entry(struct mptcp_sock *msk, int anno_nr = 0; /* only delete if either announced or matching a subflow */ - if (mptcp_remove_anno_list_by_saddr(msk, &entry->addr)) + if (mptcp_pm_add_addr_remove(msk, &entry->addr)) anno_nr++; - else if (!mptcp_lookup_subflow_by_saddr(&msk->conn_list, &entry->addr)) + else if (!mptcp_pm_subflow_lookup_by_saddr(msk, &entry->addr)) return; alist.ids[alist.nr++] = entry->addr.id; diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -XXX,XX +XXX,XX @@ int mptcp_pm_mp_prio_send_ack(struct mptcp_sock *msk, struct mptcp_addr_info *addr, struct mptcp_addr_info *rem, u8 bkup); -bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk, +bool mptcp_pm_add_addr_alloc(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr); +struct mptcp_pm_add_addr * +mptcp_pm_add_addr_del_timer(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr, bool check_id); +bool mptcp_pm_add_addr_remove(struct mptcp_sock *msk, const struct mptcp_addr_info *addr); -bool mptcp_pm_sport_in_anno_list(struct mptcp_sock *msk, const struct sock *sk); -struct mptcp_pm_add_entry * -mptcp_pm_del_add_timer(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr, bool check_id); -bool mptcp_lookup_subflow_by_saddr(const struct list_head *list, - const struct mptcp_addr_info *saddr); -bool mptcp_remove_anno_list_by_saddr(struct mptcp_sock *msk, - const struct mptcp_addr_info *addr); +bool mptcp_pm_add_addr_lookup_by_sk(struct mptcp_sock *msk, + const struct sock *sk); +bool mptcp_pm_subflow_lookup_by_saddr(struct mptcp_sock *msk, + const struct mptcp_addr_info *saddr); int mptcp_pm_nl_set_flags(struct mptcp_pm_addr_entry *local, struct genl_info *info); int mptcp_userspace_pm_set_flags(struct mptcp_pm_addr_entry *local, diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -XXX,XX +XXX,XX @@ static int subflow_check_req(struct request_sock *req, pr_debug("syn inet_sport=%d %d\n", ntohs(inet_sk(sk_listener)->inet_sport), ntohs(inet_sk((struct sock *)subflow_req->msk)->inet_sport)); - if (!mptcp_pm_sport_in_anno_list(subflow_req->msk, sk_listener)) { + if (!mptcp_pm_add_addr_lookup_by_sk(subflow_req->msk, sk_listener)) { SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_MISMATCHPORTSYNRX); subflow_add_reset_reason(skb, MPTCP_RST_EPROHIBIT); return -EPERM; @@ -XXX,XX +XXX,XX @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk, pr_debug("ack inet_sport=%d %d\n", ntohs(inet_sk(sk)->inet_sport), ntohs(inet_sk((struct sock *)owner)->inet_sport)); - if (!mptcp_pm_sport_in_anno_list(owner, sk)) { + if (!mptcp_pm_add_addr_lookup_by_sk(owner, sk)) { SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_MISMATCHPORTACKRX); subflow_add_reset_reason(skb, MPTCP_RST_EPROHIBIT); goto dispose_child; -- 2.53.0