From nobody Thu Nov 27 14:01:12 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DCFF1346776 for ; Tue, 4 Nov 2025 21:52:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762293132; cv=none; b=pN7BdojRlrndibCAcsJvK+aVF3vop+0ZOe4Eqe5WV69ONOMlgMBC7AgnyVDChl1AhYcsi/qB8/dkMbEf32wlst1/2PSKbboPX8j98p6u1FLyOHtWDFYV1U4tNpS/U/28hGJq4TcahQpTSRuhG/ZRdZnlsQDh0FTUOPRhEAmpn+M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762293132; c=relaxed/simple; bh=2mtXytkvgckfgdHJsby+YlnG/UyjFCKuZpllE4EOTR4=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=jPFaYhsgm2673C/eMvtgw7dtBuXxUikg0+xo1BXA/1W3Nk97lGRHMzK97tYcSuXrB5D8YRsu1yyQ1YgaL+/6eovCxE8aDks7OryZALJEZzDOrywXm5p1jCugRyGELIqnCsq27Xp4PVJmT0QwnYeyaxUGd+HgoIOJc/jHqWvM47A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=cYWD400f; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="cYWD400f" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1762293130; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tDXgwkaD72st4VwQNZpBl12EN3BOQGL0r6lD2/jXtAE=; b=cYWD400f64+QmTqyqv/NtdfE8lI+rFgqRjprSGAH2ECH/WCdqTQIVU6xZxgnONe42bKajY PTrXij5Ur8B2F3QdsNX/9HzUrlSZY9TdxrEAHAc7FpuqB/jVXqCpOroV23tgv0RY2Q+jZD 6hRpc4F0ht+nxXs2I8kocxItjptIKVs= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-691-qqmGx3GsMJiwA6ZEmSA1PQ-1; Tue, 04 Nov 2025 16:52:08 -0500 X-MC-Unique: qqmGx3GsMJiwA6ZEmSA1PQ-1 X-Mimecast-MFC-AGG-ID: qqmGx3GsMJiwA6ZEmSA1PQ_1762293128 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id F2B46195608A for ; Tue, 4 Nov 2025 21:52:07 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.45.224.32]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id C392D19560A2 for ; Tue, 4 Nov 2025 21:52:06 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH v2 mptcp-next 6/7] mptcp: better mptcp-level rtt estimator Date: Tue, 4 Nov 2025 22:51:40 +0100 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: lPCphFaG4rbInf_eRdwx2EjkMBGlSflAXba47QeSlyQ_1762293128 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" On high speed links, the MPTCP-level receive buffer auto-tuning happens with a frequency well above the TCP-level's one. That in turn can cause excessive/unneeded receive buffer increase. On such links, the initial rtt_us value is considerably higher than the actual delay, but the current mptcp_rcv_space_adjust() logic prevents msk->rcvq_space.rtt_us from decreasing. Address the issue with a more accurate RTT estimation strategy: the MPTCP-level RTT is set to the minimum of all the subflows, in a rcv-win based interval, feeding data into the MPTCP-receive buffer. Use some care to avoid updating msk and ssk level fields too often. Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning") Signed-off-by: Paolo Abeni --- v1 -> v2: - do not use explicit reset flags - do rcv win based decision instead - discard 0 rtt_us samples from subflows - discard samples on non empty rx queue - discard "too high" samples, see the code comments WRT the whys --- include/trace/events/mptcp.h | 2 +- net/mptcp/protocol.c | 74 ++++++++++++++++++++++-------------- net/mptcp/protocol.h | 7 +++- 3 files changed, 53 insertions(+), 30 deletions(-) diff --git a/include/trace/events/mptcp.h b/include/trace/events/mptcp.h index 71fd6d33f48b..999133000cb8 100644 --- a/include/trace/events/mptcp.h +++ b/include/trace/events/mptcp.h @@ -212,7 +212,7 @@ TRACE_EVENT(mptcp_rcvbuf_grow, __be32 *p32; =20 __entry->time =3D time; - __entry->rtt_us =3D msk->rcvq_space.rtt_us >> 3; + __entry->rtt_us =3D msk->rcv_rtt_est.rtt_us >> 3; __entry->copied =3D msk->rcvq_space.copied; __entry->inq =3D mptcp_inq_hint(sk); __entry->space =3D msk->rcvq_space.space; diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index c85ad7ef29b0..414cca078541 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -868,6 +868,42 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, s= truct sock *ssk) return moved; } =20 +static void mptcp_rcv_rtt_update(struct mptcp_sock *msk, + struct mptcp_subflow_context *subflow) +{ + const struct tcp_sock *tp =3D tcp_sk(subflow->tcp_sock); + u32 rtt_us =3D tp->rcv_rtt_est.rtt_us; + u8 sr =3D tp->scaling_ratio; + + /* MPTCP can react to incoming acks pushing data on different subflows, + * causing apparent high RTT: ignore large samples; also do the update + * only on RTT changes + */ + if (tp->rcv_rtt_est.seq =3D=3D subflow->prev_rtt_seq || + (subflow->prev_rtt_us && (rtt_us >> 1) > subflow->prev_rtt_us)) + return; + + /* Similar to plain TCP, only consider samples with empty RX queue. */ + if (!rtt_us || mptcp_data_avail(msk)) + return; + + /* Refresh the RTT after a full win per subflow */ + subflow->prev_rtt_us =3D rtt_us; + subflow->prev_rtt_seq =3D tp->rcv_rtt_est.seq; + if (after(subflow->map_seq, msk->rcv_rtt_est.seq)) { + msk->rcv_rtt_est.seq =3D subflow->map_seq + + tp->rcv_wnd * msk->pm.extra_subflows; + msk->rcv_rtt_est.rtt_us =3D rtt_us; + msk->scaling_ratio =3D sr; + return; + } + + if (rtt_us < msk->rcv_rtt_est.rtt_us) + msk->rcv_rtt_est.rtt_us =3D rtt_us; + if (sr < msk->scaling_ratio) + msk->scaling_ratio =3D sr; +} + void mptcp_data_ready(struct sock *sk, struct sock *ssk) { struct mptcp_subflow_context *subflow =3D mptcp_subflow_ctx(ssk); @@ -881,6 +917,7 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk) return; =20 mptcp_data_lock(sk); + mptcp_rcv_rtt_update(msk, subflow); if (!sock_owned_by_user(sk)) { /* Wake-up the reader only for in-sequence data */ if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk)) @@ -2058,6 +2095,7 @@ static void mptcp_rcv_space_init(struct mptcp_sock *m= sk) msk->rcvspace_init =3D 1; =20 mptcp_data_lock(sk); + msk->rcv_rtt_est.seq =3D atomic64_read(&msk->rcv_wnd_sent); __mptcp_sync_rcvspace(sk); =20 /* Paranoid check: at least one subflow pushed data to the msk. */ @@ -2076,9 +2114,8 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock = *msk, int copied) { struct mptcp_subflow_context *subflow; struct sock *sk =3D (struct sock *)msk; - u8 scaling_ratio =3D U8_MAX; - u32 time, advmss =3D 1; - u64 rtt_us, mstamp; + u32 rtt_us, time; + u64 mstamp; =20 msk_owned_by_me(msk); =20 @@ -2093,29 +2130,8 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock= *msk, int copied) mstamp =3D mptcp_stamp(); time =3D tcp_stamp_us_delta(mstamp, READ_ONCE(msk->rcvq_space.time)); =20 - rtt_us =3D msk->rcvq_space.rtt_us; - if (rtt_us && time < (rtt_us >> 3)) - return; - - rtt_us =3D 0; - mptcp_for_each_subflow(msk, subflow) { - const struct tcp_sock *tp; - u64 sf_rtt_us; - u32 sf_advmss; - - tp =3D tcp_sk(mptcp_subflow_tcp_sock(subflow)); - - sf_rtt_us =3D READ_ONCE(tp->rcv_rtt_est.rtt_us); - sf_advmss =3D READ_ONCE(tp->advmss); - - rtt_us =3D max(sf_rtt_us, rtt_us); - advmss =3D max(sf_advmss, advmss); - scaling_ratio =3D min(tp->scaling_ratio, scaling_ratio); - } - - msk->rcvq_space.rtt_us =3D rtt_us; - msk->scaling_ratio =3D scaling_ratio; - if (time < (rtt_us >> 3) || rtt_us =3D=3D 0) + rtt_us =3D READ_ONCE(msk->rcv_rtt_est.rtt_us); + if (rtt_us =3D=3D U32_MAX || time < (rtt_us >> 3)) return; =20 if (msk->rcvq_space.copied <=3D msk->rcvq_space.space) @@ -2941,7 +2957,8 @@ static void __mptcp_init_sock(struct sock *sk) msk->scaling_ratio =3D TCP_DEFAULT_SCALING_RATIO; msk->backlog_len =3D 0; msk->rcvq_space.copied =3D 0; - msk->rcvq_space.rtt_us =3D 0; + msk->rcv_rtt_est.rtt_us =3D U32_MAX; + msk->scaling_ratio =3D U8_MAX; =20 WRITE_ONCE(msk->first, NULL); inet_csk(sk)->icsk_sync_mss =3D mptcp_sync_mss; @@ -3385,7 +3402,8 @@ static int mptcp_disconnect(struct sock *sk, int flag= s) msk->bytes_retrans =3D 0; msk->rcvspace_init =3D 0; msk->rcvq_space.copied =3D 0; - msk->rcvq_space.rtt_us =3D 0; + msk->scaling_ratio =3D U8_MAX; + msk->rcv_rtt_est.rtt_us =3D U32_MAX; =20 /* for fallback's sake */ WRITE_ONCE(msk->ack_seq, 0); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index adc0851bad69..051f21b06d33 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -340,11 +340,14 @@ struct mptcp_sock { */ struct mptcp_pm_data pm; struct mptcp_sched_ops *sched; + struct { + u32 rtt_us; /* Minimum RTT of subflows */ + u64 seq; /* Refresh RTT after this seq */ + } rcv_rtt_est; struct { int space; /* bytes copied in last measurement window */ int copied; /* bytes copied in this measurement window */ u64 time; /* start time of measurement window */ - u64 rtt_us; /* last maximum rtt of subflows */ } rcvq_space; u8 scaling_ratio; bool allow_subflows; @@ -523,6 +526,8 @@ struct mptcp_subflow_context { u32 map_data_len; __wsum map_data_csum; u32 map_csum_len; + u32 prev_rtt_us; + u32 prev_rtt_seq; u32 request_mptcp : 1, /* send MP_CAPABLE */ request_join : 1, /* send MP_JOIN */ request_bkup : 1, --=20 2.51.0