:p
atchew
Login
From: Gang Yan <yangang@kylinos.cn> After an MPTCP connection is established, the sk_sndbuf of client's msk can be updated through 'subflow_finish_connect'. However, the newly accepted msk on the server side has a small sk_sndbuf than msk->first->sk_sndbuf: ''' MPTCP: msk:00000000e55b09db, msk->sndbuf:20480, msk->first->sndbuf:2626560 ''' This means that when the server immediately sends MSG_DONTWAIT data to the client after the connection is established, it is more likely to encounter EAGAIN. This patch synchronizes the sk_sndbuf by triggering its update during accept. Fixes: 8005184fd1ca ("mptcp: refactor sndbuf auto-tuning") Link: https://github.com/multipath-tcp/mptcp_net-next/issues/602 Signed-off-by: Gang Yan <yangang@kylinos.cn> --- Notes: Hi Paolo, Matt, Sorry for the late response for this patch. I've been analyzing this issue recently, and the basic picture is as follows: The root cause is a timing gap between msk creation and TCP sndbuf auto-tuning on the server side: 1. When the server receives the SYN, mptcp_sk_clone_init() creates the msk and calls __mptcp_propagate_sndbuf(). At this point, the TCP subflow is still in SYN_RCVD state, so its sk_sndbuf has only the initial value (tcp_wmem[1], typically ~16KB). 2. When the 3-way handshake completes (ACK received), the TCP stack calls tcp_init_buffer_space() -> tcp_sndbuf_expand(), which grows the subflow's sk_sndbuf based on MSS, congestion window, etc. (potentially up to tcp_wmem[2], ~4MB). 3. However, this auto-tuning happens deep in the TCP stack without any callback to MPTCP, so msk->sk_sndbuf is never updated to reflect the new subflow sndbuf value. 4. When accept() returns, msk->sk_sndbuf still holds the small initial value, while msk->first->sk_sndbuf has been auto-tuned to a much larger value. In contrast, the active (client) side doesn't have this issue because subflow_finish_connect() calls mptcp_propagate_state() after the TCP sndbuf auto-tuning has already occurred, ensuring proper synchronization. Thanks Gang net/mptcp/protocol.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -XXX,XX +XXX,XX @@ static int mptcp_stream_accept(struct socket *sock, struct socket *newsock, mptcp_graft_subflows(newsk); mptcp_rps_record_subflows(msk); + __mptcp_propagate_sndbuf(newsk, mptcp_subflow_tcp_sock(subflow)); /* Do late cleanup for the first subflow as necessary. Also * deal with bad peers not doing a complete shutdown. -- 2.43.0
From: Gang Yan <yangang@kylinos.cn> Changelog: v4: - Update the coomit log to show the root cause. - Remove the '__mptcp_propagate_sndbuf' in 'mptcp_sk_clone_init' according to Paolo's advice. - Replace 'grep' with 'sed' according to AI's suggestions. - Crete a packetdrill PR for test: https://github.com/multipath-tcp/packetdrill/pull/193 v3: - Add a check in diag.sh to check the sndbuf of the server side. Gang Yan (2): mptcp: sync the msk->sndbuf at accept() time selftests: mptcp: add a check for sndbuf of S/C net/mptcp/protocol.c | 2 +- tools/testing/selftests/net/mptcp/diag.sh | 28 +++++++++++++++++++++++ 2 files changed, 29 insertions(+), 1 deletion(-) -- 2.43.0
From: Gang Yan <yangang@kylinos.cn> On passive MPTCP connections, the msk sndbuf is not updated correctly. The root cause is a timing issue in the accept path: - tcp_check_req() -> subflow_syn_recv_sock() -> mptcp_sk_clone_init() calls __mptcp_propagate_sndbuf() to copy the ssk sndbuf into msk - Later, tcp_child_process() -> tcp_init_transfer() -> tcp_sndbuf_expand() grows the ssk sndbuf. So __mptcp_propagate_sndbuf() runs before the ssk sndbuf has been expanded and the msk ends up with a much smaller sndbuf than the subflow: MPTCP: msk->sndbuf:20480, msk->first->sndbuf:2626560 Fix this by removing the __mptcp_propagate_sndbuf() call in mptcp_sk_clone_init(), as the ssk sndbuf is not yet finalized there. Instead, call __mptcp_propagate_sndbuf() at accept() time, when the ssk sndbuf has been fully expanded by tcp_sndbuf_expand(). Fixes: 8005184fd1ca ("mptcp: refactor sndbuf auto-tuning") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/602 Signed-off-by: Gang Yan <yangang@kylinos.cn> Acked-by: Paolo Abeni <pabeni@redhat.com> --- net/mptcp/protocol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -XXX,XX +XXX,XX @@ struct sock *mptcp_sk_clone_init(const struct sock *sk, * uses the correct data */ mptcp_copy_inaddrs(nsk, ssk); - __mptcp_propagate_sndbuf(nsk, ssk); mptcp_rcv_space_init(msk, ssk); msk->rcvq_space.time = mptcp_stamp(); @@ -XXX,XX +XXX,XX @@ static int mptcp_stream_accept(struct socket *sock, struct socket *newsock, mptcp_graft_subflows(newsk); mptcp_rps_record_subflows(msk); + __mptcp_propagate_sndbuf(newsk, mptcp_subflow_tcp_sock(subflow)); /* Do late cleanup for the first subflow as necessary. Also * deal with bad peers not doing a complete shutdown. -- 2.43.0
From: Gang Yan <yangang@kylinos.cn> Add a new chk_sndbuf() helper to diag.sh that extracts the sndbuf (the 'tb' field from 'ss -m' skmem output) for both server and client MPTCP sockets, and verifies they are equal. Without the previous patch, it will fail: ''' 07 ....chk sndbuf server/client [FAIL] sndbuf S=20480 != C=2630656 ''' Signed-off-by: Gang Yan <yangang@kylinos.cn> --- tools/testing/selftests/net/mptcp/diag.sh | 28 +++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/tools/testing/selftests/net/mptcp/diag.sh b/tools/testing/selftests/net/mptcp/diag.sh index XXXXXXX..XXXXXXX 100755 --- a/tools/testing/selftests/net/mptcp/diag.sh +++ b/tools/testing/selftests/net/mptcp/diag.sh @@ -XXX,XX +XXX,XX @@ wait_connected() done } +chk_sndbuf() +{ + local server_sndbuf client_sndbuf msg + local port=${1} + + msg="....chk sndbuf server/client" + server_sndbuf=$(ss -N "${ns}" -inmHM "sport" "${port}" | \ + sed -n 's/.*tb\([0-9]\+\).*/\1/p') + client_sndbuf=$(ss -N "${ns}" -inmHM "dport" "${port}" | \ + sed -n 's/.*tb\([0-9]\+\).*/\1/p') + + mptcp_lib_print_title "${msg}" + if [ -z "${server_sndbuf}" ] || [ -z "${client_sndbuf}" ]; then + mptcp_lib_pr_fail "sndbuf S=${server_sndbuf} C=${client_sndbuf}" + mptcp_lib_result_fail "${msg}" + ret=${KSFT_FAIL} + elif [ "${server_sndbuf}" != "${client_sndbuf}" ]; then + mptcp_lib_pr_fail "sndbuf S=${server_sndbuf} != C=${client_sndbuf}" + mptcp_lib_result_fail "${msg}" + ret=${KSFT_FAIL} + else + mptcp_lib_pr_ok + mptcp_lib_result_pass "${msg}" + fi +} + + trap cleanup EXIT mptcp_lib_ns_init ns @@ -XXX,XX +XXX,XX @@ echo "b" | \ 127.0.0.1 >/dev/null & wait_connected $ns 10000 chk_msk_nr 2 "after MPC handshake" +chk_sndbuf 10000 chk_last_time_info 10000 chk_msk_remote_key_nr 2 "....chk remote_key" chk_msk_fallback_nr 0 "....chk no fallback" -- 2.43.0