From nobody Thu Nov 27 14:02:39 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D49242FF179 for ; Wed, 19 Nov 2025 21:54:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763589282; cv=none; b=Rw/zHcfDzOyskT0VmkOUQ2u1pr43SqVUdqgMLMWU5HhsX1pb36O2ouHKdgJKwALfmkAhi3Kher80IhQlgamo2+yftFinv3ZenV1aS53zasRDaMm9DHwtl9rmPk0UOLTsdbVbc4bmOKz47lrjJmcdDGdrcQlN7QTy1ihLthma0/8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763589282; c=relaxed/simple; bh=EvV84U33+hWZH7K6gHZDndtZwyiFHTEQS6RrbKTkYcM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=fpCMOzUtxM4dlaq99IO/JtUzy42RGWfMoSKUImpvTiRaJU4noP0+/xc+t9p5jmESphZPJoo8oaNwTs2LhoN3tUcaOKma2SAYE5NWH+cUzBhMsF7yiz6tb+fQQc3EUMEfAAfGXVEWz0nDMOXFukg2pbSebJDImDR0ftBKoA10roQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Bdbh+lMO; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Bdbh+lMO" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1763589279; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TxDtR0wS1g57/yHEWQLDAaDKpVUd3KJJkHlZ02rEsak=; b=Bdbh+lMOhxCljmlUgcB0+FUKfDFmgaR22tSkAE6Ck2mIqAFoZ5B49s+XwLbk6xnEEarsBq +xBi5QN1IQFhu8kkywD8E+FzsE4fKok2HpDaG5KgJWmfhYOUk1Q9vTYzqnBXUjj8gMMTTf j2/is/CQueaQg5qO3GL4rxI0BSqT+Wo= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-677-5g1S4m7tMwyZiNZKTZsz9A-1; Wed, 19 Nov 2025 16:54:36 -0500 X-MC-Unique: 5g1S4m7tMwyZiNZKTZsz9A-1 X-Mimecast-MFC-AGG-ID: 5g1S4m7tMwyZiNZKTZsz9A_1763589275 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 46089180048E; Wed, 19 Nov 2025 21:54:35 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.33.89]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 2CA6C180049F; Wed, 19 Nov 2025 21:54:33 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: martineau@kernel.org Subject: [PATCH mptcp-next v6 3/4] mptcp: better mptcp-level RTT estimator Date: Wed, 19 Nov 2025 22:54:22 +0100 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: CNYOd0zqE3w-jsIinE3DKwmmEryVZW3fx05Ghd6VGlw_1763589275 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" The current MPTCP-level RTT estimator has several issues. On high speed links, the MPTCP-level receive buffer auto-tuning happens with a frequency well above the TCP-level's one. That in turn can cause excessive/unneeded receive buffer increase. On such links, the initial rtt_us value is considerably higher than the actual delay, and the current mptcp_rcv_space_adjust() updates msk->rcvq_space.rtt_us with a period equal to the such field previous value. If the initial rtt_us is 40ms, its first update will happen after 40ms, even if the subflows see actual RTT orders of magnitude lower. Additionally: - setting the msk rtt to the maximum among all the subflows RTTs makes DRS constantly overshooting the rcvbuf size when a subflow has considerable higher latency than the other(s). - during unidirectional bulk transfers with multiple active subflows, the TCP-level RTT estimator occasionally sees considerably higher value than the real link delay, i.e. when the packet scheduler reacts to an incoming ack on given subflow pushing data on a different subflow. - currently inactive but still open subflows (i.e. switched to backup mode) are always considered when computing the msk-level rtt. Address the all the issues above with a more accurate RTT estimation strategy: the MPTCP-level RTT is set to the minimum of all the subflows actually feeding data into the MPTCP receive buffer, using a small sliding window. While at it, also use EWMA to compute the msk-level scaling_ratio, to that mptcp can avoid traversing the subflow list is mptcp_rcv_space_adjust(). Use some care to avoid updating msk and ssk level fields too often. Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning") Signed-off-by: Paolo Abeni --- v4 -> v5: - avoid filtering out too high value, use sliding window instead v3 -> v4: - really refresh msk rtt after a full win per subflow (off-by-one in prev revision) - sync mptcp_rcv_space_adjust() comment with the new code v1 -> v2: - do not use explicit reset flags - do rcv win based decision instead - discard 0 rtt_us samples from subflows - discard samples on non empty rx queue - discard "too high" samples, see the code comments WRT the whys --- include/trace/events/mptcp.h | 2 +- net/mptcp/protocol.c | 63 ++++++++++++++++++++---------------- net/mptcp/protocol.h | 38 +++++++++++++++++++++- 3 files changed, 73 insertions(+), 30 deletions(-) diff --git a/include/trace/events/mptcp.h b/include/trace/events/mptcp.h index 269d949b2025..04521acba483 100644 --- a/include/trace/events/mptcp.h +++ b/include/trace/events/mptcp.h @@ -219,7 +219,7 @@ TRACE_EVENT(mptcp_rcvbuf_grow, __be32 *p32; =20 __entry->time =3D time; - __entry->rtt_us =3D msk->rcvq_space.rtt_us >> 3; + __entry->rtt_us =3D mptcp_rtt_us_est(msk) >> 3; __entry->copied =3D msk->rcvq_space.copied; __entry->inq =3D mptcp_inq_hint(sk); __entry->space =3D msk->rcvq_space.space; diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index d6d2d9d1849a..7af4c13b8f60 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -886,6 +886,32 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, s= truct sock *ssk) return moved; } =20 +static void mptcp_rcv_rtt_update(struct mptcp_sock *msk, + struct mptcp_subflow_context *subflow) +{ + const struct tcp_sock *tp =3D tcp_sk(subflow->tcp_sock); + u32 rtt_us =3D tp->rcv_rtt_est.rtt_us; + int id; + + /* Update once per subflow per rcvwnd to avoid touching the msk + * too often. + */ + if (!rtt_us || tp->rcv_rtt_est.seq =3D=3D subflow->prev_rtt_seq) + return; + + subflow->prev_rtt_seq =3D tp->rcv_rtt_est.seq; + + /* Pairs with READ_ONCE() in mptcp_rtt_us_est(). */ + id =3D msk->rcv_rtt_est.next_sample; + WRITE_ONCE(msk->rcv_rtt_est.samples[id], rtt_us); + if (++msk->rcv_rtt_est.next_sample =3D=3D MPTCP_RTT_SAMPLES) + msk->rcv_rtt_est.next_sample =3D 0; + + /* EWMA among the incoming subflows */ + msk->scaling_ratio =3D ((msk->scaling_ratio << 3) - msk->scaling_ratio + + tp->scaling_ratio) >> 3; +} + void mptcp_data_ready(struct sock *sk, struct sock *ssk) { struct mptcp_subflow_context *subflow =3D mptcp_subflow_ctx(ssk); @@ -899,6 +925,7 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk) return; =20 mptcp_data_lock(sk); + mptcp_rcv_rtt_update(msk, subflow); if (!sock_owned_by_user(sk)) { /* Wake-up the reader only for in-sequence data */ if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk)) @@ -2069,15 +2096,15 @@ static int __mptcp_recvmsg_mskq(struct sock *sk, st= ruct msghdr *msg, =20 /* receive buffer autotuning. See tcp_rcv_space_adjust for more informati= on. * - * Only difference: Use highest rtt estimate of the subflows in use. + * Only difference: Use lowest rtt estimate of the subflows in use, see + * mptcp_rcv_rtt_update() and mptcp_rtt_us_est(). */ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied) { struct mptcp_subflow_context *subflow; struct sock *sk =3D (struct sock *)msk; - u8 scaling_ratio =3D U8_MAX; - u32 time, advmss =3D 1; - u64 rtt_us, mstamp; + u32 time, rtt_us; + u64 mstamp; =20 msk_owned_by_me(msk); =20 @@ -2092,29 +2119,8 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock= *msk, int copied) mstamp =3D mptcp_stamp(); time =3D tcp_stamp_us_delta(mstamp, READ_ONCE(msk->rcvq_space.time)); =20 - rtt_us =3D msk->rcvq_space.rtt_us; - if (rtt_us && time < (rtt_us >> 3)) - return; - - rtt_us =3D 0; - mptcp_for_each_subflow(msk, subflow) { - const struct tcp_sock *tp; - u64 sf_rtt_us; - u32 sf_advmss; - - tp =3D tcp_sk(mptcp_subflow_tcp_sock(subflow)); - - sf_rtt_us =3D READ_ONCE(tp->rcv_rtt_est.rtt_us); - sf_advmss =3D READ_ONCE(tp->advmss); - - rtt_us =3D max(sf_rtt_us, rtt_us); - advmss =3D max(sf_advmss, advmss); - scaling_ratio =3D min(tp->scaling_ratio, scaling_ratio); - } - - msk->rcvq_space.rtt_us =3D rtt_us; - msk->scaling_ratio =3D scaling_ratio; - if (time < (rtt_us >> 3) || rtt_us =3D=3D 0) + rtt_us =3D mptcp_rtt_us_est(msk); + if (rtt_us =3D=3D U32_MAX || time < (rtt_us >> 3)) return; =20 if (msk->rcvq_space.copied <=3D msk->rcvq_space.space) @@ -2960,6 +2966,7 @@ static void __mptcp_init_sock(struct sock *sk) msk->timer_ival =3D TCP_RTO_MIN; msk->scaling_ratio =3D TCP_DEFAULT_SCALING_RATIO; msk->backlog_len =3D 0; + mptcp_init_rtt_est(msk); =20 WRITE_ONCE(msk->first, NULL); inet_csk(sk)->icsk_sync_mss =3D mptcp_sync_mss; @@ -3403,6 +3410,7 @@ static int mptcp_disconnect(struct sock *sk, int flag= s) msk->bytes_sent =3D 0; msk->bytes_retrans =3D 0; msk->rcvspace_init =3D 0; + mptcp_init_rtt_est(msk); =20 /* for fallback's sake */ WRITE_ONCE(msk->ack_seq, 0); @@ -3546,7 +3554,6 @@ void mptcp_rcv_space_init(struct mptcp_sock *msk, con= st struct sock *ssk) =20 msk->rcvspace_init =3D 1; msk->rcvq_space.copied =3D 0; - msk->rcvq_space.rtt_us =3D 0; =20 /* initial rcv_space offering made to peer */ msk->rcvq_space.space =3D min_t(u32, tp->rcv_wnd, diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 95c62f2ac705..df33b30043b6 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -269,6 +269,13 @@ struct mptcp_data_frag { struct page *page; }; =20 +/* Arbitrary compromise between as low as possible to react timely to subf= low + * close event and as big as possible to avoid being fouled by biased large + * samples due to peer sending data on a different subflow WRT to the inco= ming + * ack. + */ +#define MPTCP_RTT_SAMPLES 5 + /* MPTCP connection sock */ struct mptcp_sock { /* inet_connection_sock must be the first member */ @@ -340,11 +347,17 @@ struct mptcp_sock { */ struct mptcp_pm_data pm; struct mptcp_sched_ops *sched; + + /* Most recent rtt_us observed by in use incoming subflows. */ + struct { + u32 samples[MPTCP_RTT_SAMPLES]; + u32 next_sample; + } rcv_rtt_est; + struct { int space; /* bytes copied in last measurement window */ int copied; /* bytes copied in this measurement window */ u64 time; /* start time of measurement window */ - u64 rtt_us; /* last maximum rtt of subflows */ } rcvq_space; u8 scaling_ratio; bool allow_subflows; @@ -422,6 +435,27 @@ static inline struct mptcp_data_frag *mptcp_send_head(= const struct sock *sk) return msk->first_pending; } =20 +static inline void mptcp_init_rtt_est(struct mptcp_sock *msk) +{ + int i; + + for (i =3D 0; i < MPTCP_RTT_SAMPLES; ++i) + msk->rcv_rtt_est.samples[i] =3D U32_MAX; + msk->rcv_rtt_est.next_sample =3D 0; + msk->scaling_ratio =3D TCP_DEFAULT_SCALING_RATIO; +} + +static inline u32 mptcp_rtt_us_est(const struct mptcp_sock *msk) +{ + u32 rtt_us =3D msk->rcv_rtt_est.samples[0]; + int i; + + /* Lockless access of collected samples. */ + for (i =3D 1; i < MPTCP_RTT_SAMPLES; ++i) + rtt_us =3D min(rtt_us, READ_ONCE(msk->rcv_rtt_est.samples[i])); + return rtt_us; +} + static inline struct mptcp_data_frag *mptcp_send_next(struct sock *sk) { struct mptcp_sock *msk =3D mptcp_sk(sk); @@ -523,6 +557,8 @@ struct mptcp_subflow_context { u32 map_data_len; __wsum map_data_csum; u32 map_csum_len; + u32 prev_rtt_us; + u32 prev_rtt_seq; u32 request_mptcp : 1, /* send MP_CAPABLE */ request_join : 1, /* send MP_JOIN */ request_bkup : 1, --=20 2.51.1