:p
atchew
Login
This patch fixes a bug where the ADD_ADDR timer keeps rescheduling indefinitely when TCP option space is insufficient, blocking subsequent addresses in the endpoint list from being announced. The issue occurs when sending ADD_ADDR with an IPv6 address and port while tcp_timestamps is enabled. The first patch clears the signal in this path, skips the matching ADD_ADDR retransmission entry, and preserves PM state-machine progression semantics. The second patch adds a selftest to cover this behavior and prevent regressions. Patch summary: - patch 1: mptcp: pm: fix ADD_ADDR timer infinite retry on option space insufficient - patch 2: selftests: mptcp: join: cover ADD_ADDR tx drop and list progress Changes since v1: - add explicit MIB accounting for this drop path: - MPTCP_MIB_ADDADDRTXDROP for non-echo ADD_ADDR - MPTCP_MIB_ECHOADDTXDROP for echo ADD_ADDR - add explicit handling to stop the matching ADD_ADDR retransmission timer when skipping non-echo ADD_ADDR on pure-ACK option-space exhaustion - keep PM forward progress by advancing PM state after a successful skip of the matching ADD_ADDR entry - Added selftest to verify IPv6 ADD_ADDR with port tx-drop when tcp_timestamps is enabled - Link to v1: https://lore.kernel.org/netdev/20260418100018.2219500-1-lixiasong1@huawei.com/ Thanks to Matthieu Baerts for the detailed review and suggestions on v1, especially around PM progression and retry/timer handling. Li Xiasong (2): mptcp: pm: fix ADD_ADDR timer infinite retry on option space insufficient selftests: mptcp: join: cover ADD_ADDR tx drop and list progress net/mptcp/pm.c | 45 +++++++++++++++--- .../testing/selftests/net/mptcp/mptcp_join.sh | 47 +++++++++++++++++++ 2 files changed, 85 insertions(+), 7 deletions(-) -- 2.34.1
When TCP option space is insufficient (e.g., when sending ADD_ADDR with an IPv6 address and port while tcp_timestamps is enabled), the original code jumped to out_unlock without clearing the addr_signal flag. This caused mptcp_pm_add_timer to keep rescheduling indefinitely, not sending ADD_ADDR, preventing subsequent addresses in the endpoint list from being announced. Handle this case by clearing the ADD_ADDR signal and skipping the matching ADD_ADDR retransmission entry. The skip path cancels the matching timer (with id check) and advances PM state progression, preserving forward progress to subsequent PM work. This avoids endless retries for an ADD_ADDR that cannot be emitted in the current pure-ACK option-space constrained path. Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Signed-off-by: Li Xiasong <lixiasong1@huawei.com> --- net/mptcp/pm.c | 45 ++++++++++++++++++++++++++++++++++++++------- 1 file changed, 38 insertions(+), 7 deletions(-) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ mptcp_pm_del_add_timer(struct mptcp_sock *msk, return entry; } +static void mptcp_pm_add_addr_skip(struct mptcp_sock *msk, + const struct mptcp_addr_info *addr) +{ + mptcp_pm_del_add_timer(msk, addr, true); + mptcp_pm_subflow_established(msk); +} + bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk, const struct mptcp_addr_info *addr) { @@ -XXX,XX +XXX,XX @@ bool mptcp_pm_add_addr_signal(struct mptcp_sock *msk, const struct sk_buff *skb, struct mptcp_addr_info *addr, bool *echo, bool *drop_other_suboptions) { + bool skip_add_addr = false; int ret = false; u8 add_addr; u8 family; @@ -XXX,XX +XXX,XX @@ bool mptcp_pm_add_addr_signal(struct mptcp_sock *msk, const struct sk_buff *skb, } *echo = mptcp_pm_should_add_signal_echo(msk); - port = !!(*echo ? msk->pm.remote.port : msk->pm.local.port); - - family = *echo ? msk->pm.remote.family : msk->pm.local.family; - if (remaining < mptcp_add_addr_len(family, *echo, port)) - goto out_unlock; - if (*echo) { *addr = msk->pm.remote; add_addr = msk->pm.addr_signal & ~BIT(MPTCP_ADD_ADDR_ECHO); + port = !!msk->pm.remote.port; + family = msk->pm.remote.family; } else { *addr = msk->pm.local; add_addr = msk->pm.addr_signal & ~BIT(MPTCP_ADD_ADDR_SIGNAL); + port = !!msk->pm.local.port; + family = msk->pm.local.family; } - WRITE_ONCE(msk->pm.addr_signal, add_addr); + + if (remaining < mptcp_add_addr_len(family, *echo, port)) { + struct net *net = sock_net((struct sock *)msk); + + if (!*drop_other_suboptions) + goto out_unlock; + + if (*echo) { + MPTCP_INC_STATS(net, MPTCP_MIB_ECHOADDTXDROP); + } else { + skip_add_addr = true; + MPTCP_INC_STATS(net, MPTCP_MIB_ADDADDRTXDROP); + } + goto drop_signal_mark; + } + ret = true; +drop_signal_mark: + WRITE_ONCE(msk->pm.addr_signal, add_addr); + out_unlock: spin_unlock_bh(&msk->pm.lock); + + /* On pure-ACK option-space exhaustion, stop retrying this ADD_ADDR: + * clear the signal bit, cancel the matching retransmission timer, and + * let the PM state machine progress. + */ + if (skip_add_addr) + mptcp_pm_add_addr_skip(msk, addr); return ret; } -- 2.34.1
Extend add_addr_ports_tests with IPv6 signaling cases that exercise ADD_ADDR tx-space shortage when tcp_timestamps are enabled. Add one case to verify ADD_ADDR tx drop accounting via MPTcpExtAddAddrTxDrop, and another case to verify PM still progresses to later signal endpoints after the first one is dropped. This covers both failure accounting and the non-blocking behavior of the announce list after a tx-space drop on pure ACK. Signed-off-by: Li Xiasong <lixiasong1@huawei.com> --- .../testing/selftests/net/mptcp/mptcp_join.sh | 47 +++++++++++++++++++ 1 file changed, 47 insertions(+) diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh index XXXXXXX..XXXXXXX 100755 --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh @@ -XXX,XX +XXX,XX @@ chk_add_tx_nr() fi } +chk_add_drop_tx_nr() +{ + local add_drop_tx_nr=$1 + local count + + print_check "add addr tx drop" + count=$(mptcp_lib_get_counter ${ns1} "MPTcpExtAddAddrTxDrop") + if [ -z "$count" ]; then + print_skip + elif [ "$count" != "$add_drop_tx_nr" ]; then + fail_test "got $count ADD_ADDR drop[s] TX, expected $add_drop_tx_nr" + else + print_ok + fi +} + chk_rm_nr() { local rm_addr_nr=$1 @@ -XXX,XX +XXX,XX @@ add_addr_ports_tests() chk_mpc_endp_attempt ${retl} 1 fi + + # signal address with port IPv6: tx fails with tcp timestamp + if reset "signal address with port IPv6 tx drop"; then + pm_nl_set_limits $ns1 0 1 + pm_nl_set_limits $ns2 1 0 + ip netns exec $ns1 sysctl -q net.ipv4.tcp_timestamps=1 + ip netns exec $ns2 sysctl -q net.ipv4.tcp_timestamps=1 + pm_nl_add_endpoint $ns1 dead:beef:2::1 flags signal port 10100 + speed=slow \ + run_tests $ns1 $ns2 dead:beef:1::1 + chk_add_nr 0 0 0 + chk_add_drop_tx_nr 1 + fi + + # first signal address drops, second one still progresses + if reset "signal addr list progresses after tx drop"; then + pm_nl_set_limits $ns1 0 2 + pm_nl_set_limits $ns2 1 0 + ip netns exec $ns1 sysctl -q net.ipv4.tcp_timestamps=1 + ip netns exec $ns2 sysctl -q net.ipv4.tcp_timestamps=1 + + pm_nl_add_endpoint $ns1 dead:beef:2::1 flags signal port 10100 + pm_nl_add_endpoint $ns1 dead:beef:3::1 flags signal + + speed=slow \ + run_tests $ns1 $ns2 dead:beef:1::1 + + chk_add_drop_tx_nr 1 + chk_add_tx_nr 1 1 + chk_add_nr 1 1 0 + fi } bind_tests() -- 2.34.1
This series fixes a bug where the ADD_ADDR timer can keep rescheduling indefinitely when TCP option space is insufficient, blocking subsequent addresses in the endpoint list from being announced. The issue is reproducible when sending an IPv6 ADD_ADDR with port while tcp_timestamps is enabled. In that case, option space may be exhausted on pure ACKs, and ADD_ADDR retransmission can stall PM forward progress. Patch 1 clears the signal in this path, skips the matching ADD_ADDR retransmission entry, and preserves PM state-machine progression. Note on concurrency: timer cancellation in this path is best-effort. A concurrent add_timer callback that acquires pm.lock first may still perform one final ADD_ADDR send attempt before cancel state is applied. After cancel sets entry->retrans_times to ADD_ADDR_RETRANS_MAX, the callback-side check prevents further ADD_ADDR retransmissions. Patch 2 adds a selftest to cover this behavior and prevent regressions. Patch summary: - patch 1: mptcp: pm: fix ADD_ADDR timer infinite retry on option space insufficient - patch 2: selftests: mptcp: join: cover ADD_ADDR tx drop and list progress Changes since v2: - document the possible timer-cancel race window in the commit message and code comments - keep best-effort cancellation semantics (possible one extra send attempt under lock-order race), while preventing further retries via retrans_times gating - selftests: remove the redundant ADD_ADDR tx-drop subtest that duplicated existing coverage for list progression behavior - link to v2: https://lore.kernel.org/mptcp/20260430112026.343691-1-lixiasong1@huawei.com/ Changes since v1: - add explicit MIB accounting for this drop path: - MPTCP_MIB_ADDADDRTXDROP for non-echo ADD_ADDR - MPTCP_MIB_ECHOADDTXDROP for echo ADD_ADDR - add explicit handling to stop the matching ADD_ADDR retransmission timer when skipping non-echo ADD_ADDR on pure-ACK option-space exhaustion - keep PM forward progress by advancing PM state after a successful skip of the matching ADD_ADDR entry - add selftest to verify IPv6 ADD_ADDR with port tx-drop when tcp_timestamps is enabled - link to v1: https://lore.kernel.org/netdev/20260418100018.2219500-1-lixiasong1@huawei.com/ Thanks to Matthieu Baerts for the detailed review and suggestions. Li Xiasong (2): mptcp: pm: fix ADD_ADDR timer infinite retry on option space insufficient selftests: mptcp: join: cover ADD_ADDR tx drop and list progress net/mptcp/pm.c | 56 +++++++++++++++---- .../testing/selftests/net/mptcp/mptcp_join.sh | 31 ++++++++++ 2 files changed, 77 insertions(+), 10 deletions(-) -- 2.34.1
When TCP option space is insufficient (e.g., when sending ADD_ADDR with an IPv6 address and port while tcp_timestamps is enabled), the original code jumped to out_unlock without clearing the addr_signal flag. This caused mptcp_pm_add_timer to keep rescheduling indefinitely, not sending ADD_ADDR, preventing subsequent addresses in the endpoint list from being announced. Handle this case by clearing the ADD_ADDR signal and skipping the matching ADD_ADDR retransmission entry. The skip path cancels the matching timer (with id check) and advances PM state progression, preserving forward progress to subsequent PM work. This cancellation is inherently best-effort. A concurrent add_timer callback may already be running and may acquire pm.lock before the cancel path updates entry state. In that case, one final ADD_ADDR transmit attempt can still be executed. Once the cancel path sets entry->retrans_times to ADD_ADDR_RETRANS_MAX, the callback-side retrans_times check suppresses further ADD_ADDR retransmissions. Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Signed-off-by: Li Xiasong <lixiasong1@huawei.com> --- net/mptcp/pm.c | 56 +++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 46 insertions(+), 10 deletions(-) diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -XXX,XX +XXX,XX @@ static void mptcp_pm_add_timer(struct timer_list *timer) spin_lock_bh(&msk->pm.lock); - if (!mptcp_pm_should_add_signal_addr(msk)) { + /* The cancel path (mptcp_pm_del_add_timer()) can race with this + * callback. Once cancel updates retrans_times to MAX, suppress further + * retransmissions here. If this callback acquires pm.lock first, one + * final transmit attempt is still possible. + */ + if (entry->retrans_times < ADD_ADDR_RETRANS_MAX && + !mptcp_pm_should_add_signal_addr(msk)) { pr_debug("retransmit ADD_ADDR id=%d\n", entry->addr.id); mptcp_pm_announce_addr(msk, &entry->addr, false); mptcp_pm_add_addr_send_ack(msk); @@ -XXX,XX +XXX,XX @@ mptcp_pm_del_add_timer(struct mptcp_sock *msk, /* Note: entry might have been removed by another thread. * We hold rcu_read_lock() to ensure it is not freed under us. */ - if (stop_timer) - sk_stop_timer_sync(sk, &entry->add_timer); + if (stop_timer) { + if (check_id) + sk_stop_timer(sk, &entry->add_timer); + else + sk_stop_timer_sync(sk, &entry->add_timer); + } rcu_read_unlock(); return entry; @@ -XXX,XX +XXX,XX @@ bool mptcp_pm_add_addr_signal(struct mptcp_sock *msk, const struct sk_buff *skb, struct mptcp_addr_info *addr, bool *echo, bool *drop_other_suboptions) { + bool skip_add_addr = false; int ret = false; u8 add_addr; u8 family; @@ -XXX,XX +XXX,XX @@ bool mptcp_pm_add_addr_signal(struct mptcp_sock *msk, const struct sk_buff *skb, } *echo = mptcp_pm_should_add_signal_echo(msk); - port = !!(*echo ? msk->pm.remote.port : msk->pm.local.port); - - family = *echo ? msk->pm.remote.family : msk->pm.local.family; - if (remaining < mptcp_add_addr_len(family, *echo, port)) - goto out_unlock; - if (*echo) { *addr = msk->pm.remote; add_addr = msk->pm.addr_signal & ~BIT(MPTCP_ADD_ADDR_ECHO); + port = !!msk->pm.remote.port; + family = msk->pm.remote.family; } else { *addr = msk->pm.local; add_addr = msk->pm.addr_signal & ~BIT(MPTCP_ADD_ADDR_SIGNAL); + port = !!msk->pm.local.port; + family = msk->pm.local.family; } - WRITE_ONCE(msk->pm.addr_signal, add_addr); + + if (remaining < mptcp_add_addr_len(family, *echo, port)) { + struct net *net = sock_net((struct sock *)msk); + + if (!*drop_other_suboptions) + goto out_unlock; + + if (*echo) { + MPTCP_INC_STATS(net, MPTCP_MIB_ECHOADDTXDROP); + } else { + skip_add_addr = true; + MPTCP_INC_STATS(net, MPTCP_MIB_ADDADDRTXDROP); + } + goto drop_signal_mark; + } + ret = true; +drop_signal_mark: + WRITE_ONCE(msk->pm.addr_signal, add_addr); + out_unlock: spin_unlock_bh(&msk->pm.lock); + + /* On pure-ACK option-space exhaustion, stop retrying this ADD_ADDR: + * clear the signal bit, cancel the matching retransmission timer, and + * let the PM state machine progress. + */ + if (skip_add_addr) { + mptcp_pm_del_add_timer(msk, addr, true); + mptcp_pm_subflow_established(msk); + } return ret; } -- 2.34.1
Extend add_addr_ports_tests with IPv6 signaling cases that exercise ADD_ADDR tx-space shortage when tcp_timestamps are enabled. Add one case to verify PM still progresses to later signal endpoints after the first one is dropped. This covers both failure accounting and the non-blocking behavior of the announce list after a tx-space drop on pure ACK. Signed-off-by: Li Xiasong <lixiasong1@huawei.com> --- .../testing/selftests/net/mptcp/mptcp_join.sh | 31 +++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh index XXXXXXX..XXXXXXX 100755 --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh @@ -XXX,XX +XXX,XX @@ chk_add_tx_nr() fi } +chk_add_drop_tx_nr() +{ + local drop_tx_nr=$1 + local count + + print_check "add addr tx drop" + count=$(mptcp_lib_get_counter ${ns1} "MPTcpExtAddAddrTxDrop") + if [ -z "$count" ]; then + print_skip + elif [ "$count" != "$drop_tx_nr" ]; then + fail_test "got $count ADD_ADDR drop[s] TX, expected $drop_tx_nr" + else + print_ok + fi +} + chk_rm_nr() { local rm_addr_nr=$1 @@ -XXX,XX +XXX,XX @@ add_addr_ports_tests() chk_mpc_endp_attempt ${retl} 1 fi + + # first signal address drops, second one still progresses + if reset "signal addr list progresses after tx drop"; then + pm_nl_set_limits $ns1 0 2 + pm_nl_set_limits $ns2 1 0 + ip netns exec $ns1 sysctl -q net.ipv4.tcp_timestamps=1 + ip netns exec $ns2 sysctl -q net.ipv4.tcp_timestamps=1 + + pm_nl_add_endpoint $ns1 dead:beef:2::1 flags signal port 10100 + pm_nl_add_endpoint $ns1 dead:beef:3::1 flags signal + run_tests $ns1 $ns2 dead:beef:1::1 + chk_add_drop_tx_nr 1 + chk_add_tx_nr 1 1 + chk_add_nr 1 1 0 + fi } bind_tests() -- 2.34.1