From nobody Sun Mar 22 08:21:24 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7BF243DA7C6; Mon, 9 Mar 2026 15:57:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773071822; cv=none; b=LiA2UdvaLIr82uDQ/t2fGQXRr0fqnleblZR7y732wMCk4AMt7AxLZG6ni9zoHCACEXq3Q1tbLupvE/Iz0/u8RlzvAqCuk/SB0J5/d+OTZtmZ/g5m5OK22K97KXHato+5bNfkciW9A9q83nqlTS1ApEKCoFVgbOBzezl9WEa1K4Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773071822; c=relaxed/simple; bh=fUxq2JBMOXdn7J81G0Oub9GV5O7iGFWrvS7w+kFAiNk=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=X5u9vNTxo5rpR0dTu6A5MiDpy+8mBOjYYNPoVjc2VONos9sc2fV0ksfo3sqZb+I1FokZB/ozYBQHbEFi9iDjOVVz0nUbBVVeQo42Q0pf89Aej9/PRmpsCJF/vJtlteBKnGq5lP2W7HZdHrXRoPeJZLJzQjeM1AmKwFYP6WhVNIE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=qsuDH5ey; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="qsuDH5ey" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 97D43C2BC86; Mon, 9 Mar 2026 15:56:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773071822; bh=fUxq2JBMOXdn7J81G0Oub9GV5O7iGFWrvS7w+kFAiNk=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=qsuDH5eyjjNYETxqlXbZclHxJJPaauBYp7bTq21e22jfCRnR66na3xkBf1o10OMe4 qyjYorcgJF4m0c2swBxjFaEBJzCOVhd96x+ehD5GzNNd5yK0zpwfIvegDjUYvWwE3D yjCkjyLorjmNTvLBLtlNU9t4ys6yfZGu0vxO7mzLOuP//1z1aSxKTDpkQse4EsKsrs MPGLS9g74cOehCCRStPINKzApU4oNckWgu69gX7kkluxTU0St0UKaMWjqHB5bZ/Ylf JFOGwlbt6CaSAcTJSqHNCDMtowiJNG1LAbdGaxaLzPmoROK/RoZo7Y+Va3ubzBMnvo sYNbUcPNBJl2Q== From: "Matthieu Baerts (NGI0)" Date: Mon, 09 Mar 2026 16:56:45 +0100 Subject: [PATCH net-next 1/2] mptcp: better mptcp-level RTT estimator Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260309-net-next-mptcp-reduce-rbuf-v1-1-8f471206f9c5@kernel.org> References: <20260309-net-next-mptcp-reduce-rbuf-v1-0-8f471206f9c5@kernel.org> In-Reply-To: <20260309-net-next-mptcp-reduce-rbuf-v1-0-8f471206f9c5@kernel.org> To: Mat Martineau , Geliang Tang , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Florian Westphal Cc: netdev@vger.kernel.org, mptcp@lists.linux.dev, linux-kernel@vger.kernel.org, "Matthieu Baerts (NGI0)" , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=openpgp-sha256; l=9262; i=matttbe@kernel.org; h=from:subject:message-id; bh=WVBrrMCUm1p1MA2wZ2EEMG0KtGe2wPXSEGZOXLmCROM=; b=owGbwMvMwCVWo/Th0Gd3rumMp9WSGDLXvT3uaBqgelV7yYTTSuevPlq+i2+i7fLD2/aukXrTe 7bJzf+paEcpC4MYF4OsmCKLdFtk/sznVbwlXn4WMHNYmUCGMHBxCsBE/h1gZPi9/3nG1dj7lhzH 3ljLTPj4zu2PFcOOnawF/U5KnoIly8MZGbYsWhTL2Kl5c5Kkf+5DreSawunfDj/JiIrhPfRviWd +GxMA X-Developer-Key: i=matttbe@kernel.org; a=openpgp; fpr=E8CB85F76877057A6E27F77AF6B7824F4269A073 From: Paolo Abeni The current MPTCP-level RTT estimator has several issues. On high speed links, the MPTCP-level receive buffer auto-tuning happens with a frequency well above the TCP-level's one. That in turn can cause excessive/unneeded receive buffer increase. On such links, the initial rtt_us value is considerably higher than the actual delay, and the current mptcp_rcv_space_adjust() updates msk->rcvq_space.rtt_us with a period equal to the such field previous value. If the initial rtt_us is 40ms, its first update will happen after 40ms, even if the subflows see actual RTT orders of magnitude lower. Additionally: - setting the msk RTT to the maximum among all the subflows RTTs makes DRS constantly overshooting the rcvbuf size when a subflow has considerable higher latency than the other(s). - during unidirectional bulk transfers with multiple active subflows, the TCP-level RTT estimator occasionally sees considerably higher value than the real link delay, i.e. when the packet scheduler reacts to an incoming ACK on given subflow pushing data on a different subflow. - currently inactive but still open subflows (i.e. switched to backup mode) are always considered when computing the msk-level RTT. Address the all the issues above with a more accurate RTT estimation strategy: the MPTCP-level RTT is set to the minimum of all the subflows actually feeding data into the MPTCP receive buffer, using a small sliding window. While at it, also use EWMA to compute the msk-level scaling_ratio, to that MPTCP can avoid traversing the subflow list is mptcp_rcv_space_adjust(). Use some care to avoid updating msk and ssk level fields too often. Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning") Signed-off-by: Paolo Abeni Reviewed-by: Mat Martineau Signed-off-by: Matthieu Baerts (NGI0) --- Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: linux-trace-kernel@vger.kernel.org --- include/trace/events/mptcp.h | 2 +- net/mptcp/protocol.c | 63 ++++++++++++++++++++++++----------------= ---- net/mptcp/protocol.h | 38 +++++++++++++++++++++++++- 3 files changed, 73 insertions(+), 30 deletions(-) diff --git a/include/trace/events/mptcp.h b/include/trace/events/mptcp.h index 269d949b2025..04521acba483 100644 --- a/include/trace/events/mptcp.h +++ b/include/trace/events/mptcp.h @@ -219,7 +219,7 @@ TRACE_EVENT(mptcp_rcvbuf_grow, __be32 *p32; =20 __entry->time =3D time; - __entry->rtt_us =3D msk->rcvq_space.rtt_us >> 3; + __entry->rtt_us =3D mptcp_rtt_us_est(msk) >> 3; __entry->copied =3D msk->rcvq_space.copied; __entry->inq =3D mptcp_inq_hint(sk); __entry->space =3D msk->rcvq_space.space; diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 3da3da2c81b1..1ce6b9f51ea4 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -879,6 +879,32 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, s= truct sock *ssk) return moved; } =20 +static void mptcp_rcv_rtt_update(struct mptcp_sock *msk, + struct mptcp_subflow_context *subflow) +{ + const struct tcp_sock *tp =3D tcp_sk(subflow->tcp_sock); + u32 rtt_us =3D tp->rcv_rtt_est.rtt_us; + int id; + + /* Update once per subflow per rcvwnd to avoid touching the msk + * too often. + */ + if (!rtt_us || tp->rcv_rtt_est.seq =3D=3D subflow->prev_rtt_seq) + return; + + subflow->prev_rtt_seq =3D tp->rcv_rtt_est.seq; + + /* Pairs with READ_ONCE() in mptcp_rtt_us_est(). */ + id =3D msk->rcv_rtt_est.next_sample; + WRITE_ONCE(msk->rcv_rtt_est.samples[id], rtt_us); + if (++msk->rcv_rtt_est.next_sample =3D=3D MPTCP_RTT_SAMPLES) + msk->rcv_rtt_est.next_sample =3D 0; + + /* EWMA among the incoming subflows */ + msk->scaling_ratio =3D ((msk->scaling_ratio << 3) - msk->scaling_ratio + + tp->scaling_ratio) >> 3; +} + void mptcp_data_ready(struct sock *sk, struct sock *ssk) { struct mptcp_subflow_context *subflow =3D mptcp_subflow_ctx(ssk); @@ -892,6 +918,7 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk) return; =20 mptcp_data_lock(sk); + mptcp_rcv_rtt_update(msk, subflow); if (!sock_owned_by_user(sk)) { /* Wake-up the reader only for in-sequence data */ if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk)) @@ -2074,7 +2101,6 @@ static void mptcp_rcv_space_init(struct mptcp_sock *m= sk, const struct sock *ssk) =20 msk->rcvspace_init =3D 1; msk->rcvq_space.copied =3D 0; - msk->rcvq_space.rtt_us =3D 0; =20 /* initial rcv_space offering made to peer */ msk->rcvq_space.space =3D min_t(u32, tp->rcv_wnd, @@ -2085,15 +2111,15 @@ static void mptcp_rcv_space_init(struct mptcp_sock = *msk, const struct sock *ssk) =20 /* receive buffer autotuning. See tcp_rcv_space_adjust for more informati= on. * - * Only difference: Use highest rtt estimate of the subflows in use. + * Only difference: Use lowest rtt estimate of the subflows in use, see + * mptcp_rcv_rtt_update() and mptcp_rtt_us_est(). */ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied) { struct mptcp_subflow_context *subflow; struct sock *sk =3D (struct sock *)msk; - u8 scaling_ratio =3D U8_MAX; - u32 time, advmss =3D 1; - u64 rtt_us, mstamp; + u32 time, rtt_us; + u64 mstamp; =20 msk_owned_by_me(msk); =20 @@ -2108,29 +2134,8 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock= *msk, int copied) mstamp =3D mptcp_stamp(); time =3D tcp_stamp_us_delta(mstamp, READ_ONCE(msk->rcvq_space.time)); =20 - rtt_us =3D msk->rcvq_space.rtt_us; - if (rtt_us && time < (rtt_us >> 3)) - return; - - rtt_us =3D 0; - mptcp_for_each_subflow(msk, subflow) { - const struct tcp_sock *tp; - u64 sf_rtt_us; - u32 sf_advmss; - - tp =3D tcp_sk(mptcp_subflow_tcp_sock(subflow)); - - sf_rtt_us =3D READ_ONCE(tp->rcv_rtt_est.rtt_us); - sf_advmss =3D READ_ONCE(tp->advmss); - - rtt_us =3D max(sf_rtt_us, rtt_us); - advmss =3D max(sf_advmss, advmss); - scaling_ratio =3D min(tp->scaling_ratio, scaling_ratio); - } - - msk->rcvq_space.rtt_us =3D rtt_us; - msk->scaling_ratio =3D scaling_ratio; - if (time < (rtt_us >> 3) || rtt_us =3D=3D 0) + rtt_us =3D mptcp_rtt_us_est(msk); + if (rtt_us =3D=3D U32_MAX || time < (rtt_us >> 3)) return; =20 if (msk->rcvq_space.copied <=3D msk->rcvq_space.space) @@ -2995,6 +3000,7 @@ static void __mptcp_init_sock(struct sock *sk) msk->timer_ival =3D TCP_RTO_MIN; msk->scaling_ratio =3D TCP_DEFAULT_SCALING_RATIO; msk->backlog_len =3D 0; + mptcp_init_rtt_est(msk); =20 WRITE_ONCE(msk->first, NULL); inet_csk(sk)->icsk_sync_mss =3D mptcp_sync_mss; @@ -3440,6 +3446,7 @@ static int mptcp_disconnect(struct sock *sk, int flag= s) msk->bytes_retrans =3D 0; msk->rcvspace_init =3D 0; msk->fastclosing =3D 0; + mptcp_init_rtt_est(msk); =20 /* for fallback's sake */ WRITE_ONCE(msk->ack_seq, 0); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 0bd1ee860316..6ec65c0ae655 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -268,6 +268,13 @@ struct mptcp_data_frag { struct page *page; }; =20 +/* Arbitrary compromise between as low as possible to react timely to subf= low + * close event and as big as possible to avoid being fouled by biased large + * samples due to peer sending data on a different subflow WRT to the inco= ming + * ack. + */ +#define MPTCP_RTT_SAMPLES 5 + /* MPTCP connection sock */ struct mptcp_sock { /* inet_connection_sock must be the first member */ @@ -340,11 +347,17 @@ struct mptcp_sock { */ struct mptcp_pm_data pm; struct mptcp_sched_ops *sched; + + /* Most recent rtt_us observed by in use incoming subflows. */ + struct { + u32 samples[MPTCP_RTT_SAMPLES]; + u32 next_sample; + } rcv_rtt_est; + struct { int space; /* bytes copied in last measurement window */ int copied; /* bytes copied in this measurement window */ u64 time; /* start time of measurement window */ - u64 rtt_us; /* last maximum rtt of subflows */ } rcvq_space; u8 scaling_ratio; bool allow_subflows; @@ -422,6 +435,27 @@ static inline struct mptcp_data_frag *mptcp_send_head(= const struct sock *sk) return msk->first_pending; } =20 +static inline void mptcp_init_rtt_est(struct mptcp_sock *msk) +{ + int i; + + for (i =3D 0; i < MPTCP_RTT_SAMPLES; ++i) + msk->rcv_rtt_est.samples[i] =3D U32_MAX; + msk->rcv_rtt_est.next_sample =3D 0; + msk->scaling_ratio =3D TCP_DEFAULT_SCALING_RATIO; +} + +static inline u32 mptcp_rtt_us_est(const struct mptcp_sock *msk) +{ + u32 rtt_us =3D msk->rcv_rtt_est.samples[0]; + int i; + + /* Lockless access of collected samples. */ + for (i =3D 1; i < MPTCP_RTT_SAMPLES; ++i) + rtt_us =3D min(rtt_us, READ_ONCE(msk->rcv_rtt_est.samples[i])); + return rtt_us; +} + static inline struct mptcp_data_frag *mptcp_send_next(struct sock *sk) { struct mptcp_sock *msk =3D mptcp_sk(sk); @@ -523,6 +557,8 @@ struct mptcp_subflow_context { u32 map_data_len; __wsum map_data_csum; u32 map_csum_len; + u32 prev_rtt_us; + u32 prev_rtt_seq; u32 request_mptcp : 1, /* send MP_CAPABLE */ request_join : 1, /* send MP_JOIN */ request_bkup : 1, --=20 2.53.0 From nobody Sun Mar 22 08:21:24 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D724031B83B; Mon, 9 Mar 2026 15:57:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773071824; cv=none; b=clu7H1MX/xtltq6z/xXftyNb0F275Uy8zkQpzo3Q0CULOmiO7jc2F9yNVV34+9zIceHaxAC2tfJKOgHMYB/cA2QjYGnGTnCchpTzqKQ+lbcdi7XUx1naKKV1yrnStsoHf4h4JqLCe/7HqFyYG2RmLc19++Hi1QFUtJbaoEAUgXw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773071824; c=relaxed/simple; bh=QcizqaPDX56iZ2udPtRGKDZ8CnFeyrIcP65rubKpgg8=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=V6fGDBJqvZLNwNuDT4bz1j0ZDUTgXchvpeeHp0HCe+3okWo83jalalZtF3yK2Uy1hRq5J1PphbcGVh+AljYHhKCRjYkU4nFC3Kh12YK3t0kFzepB9D0/O4TQ1XWltvUL5SjcYtuajPtXsU+8yQaM2zC7FsBWqdMBaisP2X/9/bQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=aODDw/ei; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="aODDw/ei" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8964CC2BC9E; Mon, 9 Mar 2026 15:57:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773071824; bh=QcizqaPDX56iZ2udPtRGKDZ8CnFeyrIcP65rubKpgg8=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=aODDw/eiWkBT5PHQsVikxun/8LXefBYuw1UwWEthro6a+nu6r5S9Y0cleNLomAbq/ xc9sbNp9p/F6yGJ6IqBsdWOKpAF+usXSpkQXhB77DOV0X52Zq1JgaF3TNjX/z6OYae o+lEZbvrPP+ygYImtHH4glJJwmGooFs/H5q5L5X1w/2kyubHmG8hK0rhquTtYnKBYj eutClLEc5PqT56JyT6wPcH73cGNV0yLPumBm+OErgEKODxxGrrQceSL8mx1G0Vx57T x/YVjqWKbM4H2gnI/s8L8ivWC8+CvY76CDfnCn3Bb+U2hUvqcvwgCu4xyAHwh2z0MI Ea4fUAlCVRxLA== From: "Matthieu Baerts (NGI0)" Date: Mon, 09 Mar 2026 16:56:46 +0100 Subject: [PATCH net-next 2/2] mptcp: add receive queue awareness in tcp_rcv_space_adjust() Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260309-net-next-mptcp-reduce-rbuf-v1-2-8f471206f9c5@kernel.org> References: <20260309-net-next-mptcp-reduce-rbuf-v1-0-8f471206f9c5@kernel.org> In-Reply-To: <20260309-net-next-mptcp-reduce-rbuf-v1-0-8f471206f9c5@kernel.org> To: Mat Martineau , Geliang Tang , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Florian Westphal Cc: netdev@vger.kernel.org, mptcp@lists.linux.dev, linux-kernel@vger.kernel.org, "Matthieu Baerts (NGI0)" X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=openpgp-sha256; l=1895; i=matttbe@kernel.org; h=from:subject:message-id; bh=JVrga8j7mqpvJMieK/tkZeFj1MEfF7/MpEtmOeP552A=; b=owGbwMvMwCVWo/Th0Gd3rumMp9WSGDLXvT1h6vEnp1uDQXt6j2J46o87J48uNJD20FR6ntS0R TroXCN/RykLgxgXg6yYIot0W2T+zOdVvCVefhYwc1iZQIYwcHEKwERWf2VkOJfx/YS5nKjmXsmV U9KrT6Y2l8TG/n7C+zo9VOThsUYjNYZ/pnrfXVRUD6fH8AWK725036Zw+e1vkZXu5z9+Vr4nkez JBQA= X-Developer-Key: i=matttbe@kernel.org; a=openpgp; fpr=E8CB85F76877057A6E27F77AF6B7824F4269A073 From: Paolo Abeni This is the MPTCP counter-part of commit ea33537d8292 ("tcp: add receive queue awareness in tcp_rcv_space_adjust()"). Prior to this commit: ESTAB 33165568 0 192.168.255.2:5201 192.168.255.1:53380 \ skmem:(r33076416,rb33554432,t0,tb91136,f448,w0,o0,bl0,d0) After: ESTAB 3279168 0 192.168.255.2:5201 192.168.255.1]:53042 \ skmem:(r3190912,rb3719956,t0,tb91136,f1536,w0,o0,bl0,d0) Same throughput. Reviewed-by: Mat Martineau Signed-off-by: Paolo Abeni Signed-off-by: Matthieu Baerts (NGI0) --- net/mptcp/protocol.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 1ce6b9f51ea4..5f39eafc6e87 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -2138,11 +2138,13 @@ static void mptcp_rcv_space_adjust(struct mptcp_soc= k *msk, int copied) if (rtt_us =3D=3D U32_MAX || time < (rtt_us >> 3)) return; =20 - if (msk->rcvq_space.copied <=3D msk->rcvq_space.space) + copied =3D msk->rcvq_space.copied; + copied -=3D mptcp_inq_hint(sk); + if (copied <=3D msk->rcvq_space.space) goto new_measure; =20 trace_mptcp_rcvbuf_grow(sk, time); - if (mptcp_rcvbuf_grow(sk, msk->rcvq_space.copied)) { + if (mptcp_rcvbuf_grow(sk, copied)) { /* Make subflows follow along. If we do not do this, we * get drops at subflow level if skbs can't be moved to * the mptcp rx queue fast enough (announced rcv_win can @@ -2156,7 +2158,7 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock = *msk, int copied) slow =3D lock_sock_fast(ssk); /* subflows can be added before tcp_init_transfer() */ if (tcp_sk(ssk)->rcvq_space.space) - tcp_rcvbuf_grow(ssk, msk->rcvq_space.copied); + tcp_rcvbuf_grow(ssk, copied); unlock_sock_fast(ssk, slow); } } --=20 2.53.0