From nobody Thu Nov 27 13:59:34 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 96434322DD0 for ; Fri, 31 Oct 2025 17:29:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761931769; cv=none; b=cayZNxE7GMoIaeXdiGBEG7Zkmam+kH1e3YtJ36L0Mx0RP0z+SNNyG01l61HY+hQ5oCNJvoApWUn5rwSz0TrCZ7EfKsCi/ceRpoKWmiBSxM4aTFKiVzwqnA9Lw3GAt8XeQqJhpI2ZgifQ/nJBKv/ZAEcMAbKppwEqqzsXppcOHDQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761931769; c=relaxed/simple; bh=3yXEYSkhLY5rpPSf2KfU9KSSKM+rMTatF32nBHe92ow=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=LIdzMFD+Lzr4cGhODgzvoWhDpEmMr/pcNfCfgK4fPJDZJgXqKx/YwPmi+9vkx1VCFXwdhukth20O4kdQ1a661PKNkuRIlZfqt8dRSI1rdfHxYfAiYRpKejLE52B05VEhhXJkz4npoMKlTPwiA6SX9y+mwI+RC3OtXHHuwGOyWXM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=DavlgxcL; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="DavlgxcL" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1761931763; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vxQOMC0cJyurko6tJkCec1+k4Htgnt0+TiLKI0oABQw=; b=DavlgxcLH4WB9srGWQHTMXCrLsT6AlTENd866flR1qxd+drzA/TYnXIXzWD/04OZr6/ro5 zX4IOdr0P7Xc3V2WNVh4kEJ46jhsMvAM9JoyVl7ymCpT80w95U9h6TD6QyR6XOBeyOA2Ew QnoXYxWUoUd0IgJ1ALcm4T/Cyi1Xi9I= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-338-x2tYPDHaMl-HD43D0kD3HA-1; Fri, 31 Oct 2025 13:29:22 -0400 X-MC-Unique: x2tYPDHaMl-HD43D0kD3HA-1 X-Mimecast-MFC-AGG-ID: x2tYPDHaMl-HD43D0kD3HA_1761931761 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 3B60B195608D for ; Fri, 31 Oct 2025 17:29:21 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.45.224.247]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 59B6E30001A1 for ; Fri, 31 Oct 2025 17:29:20 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next 1/4] mptcp: avoid unneeded subflow-level drops. Date: Fri, 31 Oct 2025 18:29:07 +0100 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: fNB_-nYUxmj70kk-fJklYGi8Obe1wQpD9Edz91AU4Pk_1761931761 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" The rcv window is shared among all the subflows. Currently, MPTCP sync the TCP-level rcv window with the MPTCP one at tcp_transmit_skb() time. The above means that incoming data may sporadically observe outdated TCP-level rcv window and being wrongly dropped by TCP. Address the issue checking for the edge condition before queuing the data at TCP level, and eventually syncing the rcv window as needed. Note that the issue is actually present from the very first MPTCP implementation, but backports older than the blamed commit below will range from impossible to useless. Before: nstat >/dev/null ;sleep 1; nstat -z TcpExtBeyondWindow TcpExtBeyondWindow 14 0.0 After: nstat >/dev/null ;sleep 1; nstat -z TcpExtBeyondWindow TcpExtBeyondWindow 0 0.0 Fixes: fa3fe2b15031 ("mptcp: track window announced to peer") Signed-off-by: Paolo Abeni --- net/mptcp/options.c | 31 +++++++++++++++++++++++++++++++ net/mptcp/protocol.h | 1 + 2 files changed, 32 insertions(+) diff --git a/net/mptcp/options.c b/net/mptcp/options.c index cf531f2d815c..9e2516193e21 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -1042,6 +1042,31 @@ static void __mptcp_snd_una_update(struct mptcp_sock= *msk, u64 new_snd_una) WRITE_ONCE(msk->snd_una, new_snd_una); } =20 +static void rwin_update(struct mptcp_sock *msk, struct sock *ssk, + struct sk_buff *skb) +{ + struct mptcp_subflow_context *subflow =3D mptcp_subflow_ctx(ssk); + struct tcp_sock *tp =3D tcp_sk(ssk); + u64 mptcp_rcv_wnd; + + /* Avoid touching extra cachelines if TCP is going to accept this + * skb without filling the TCP-level window even with a possibly + * outdated mptcp-level rwin. + */ + if (!skb->len || skb->len < tcp_receive_window(tp)) + return; + + mptcp_rcv_wnd =3D atomic64_read(&msk->rcv_wnd_sent); + if (!after64(mptcp_rcv_wnd, subflow->rcv_wnd_sent)) + return; + + /* Some other subflow grew the mptcp-level rwin since rcv_wup, + * resync. + */ + tp->rcv_wnd +=3D mptcp_rcv_wnd - subflow->rcv_wnd_sent; + subflow->rcv_wnd_sent =3D mptcp_rcv_wnd; +} + static void ack_update_msk(struct mptcp_sock *msk, struct sock *ssk, struct mptcp_options_received *mp_opt) @@ -1209,6 +1234,7 @@ bool mptcp_incoming_options(struct sock *sk, struct s= k_buff *skb) */ if (mp_opt.use_ack) ack_update_msk(msk, sk, &mp_opt); + rwin_update(msk, sk, skb); =20 /* Zero-data-length packets are dropped by the caller and not * propagated to the MPTCP layer, so the skb extension does not @@ -1295,6 +1321,10 @@ static void mptcp_set_rwin(struct tcp_sock *tp, stru= ct tcphdr *th) =20 if (rcv_wnd_new !=3D rcv_wnd_old) { raise_win: + /* the msk-level rcv wnd is after the tcp level one, + * sync the latter + */ + rcv_wnd_new =3D rcv_wnd_old; win =3D rcv_wnd_old - ack_seq; tp->rcv_wnd =3D min_t(u64, win, U32_MAX); new_win =3D tp->rcv_wnd; @@ -1318,6 +1348,7 @@ static void mptcp_set_rwin(struct tcp_sock *tp, struc= t tcphdr *th) =20 update_wspace: WRITE_ONCE(msk->old_wspace, tp->rcv_wnd); + subflow->rcv_wnd_sent =3D rcv_wnd_new; } =20 __sum16 __mptcp_make_csum(u64 data_seq, u32 subflow_seq, u16 data_len, __w= sum sum) diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 8e0f780e9210..84f2c51d776c 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -513,6 +513,7 @@ struct mptcp_subflow_context { u64 remote_key; u64 idsn; u64 map_seq; + u64 rcv_wnd_sent; u32 snd_isn; u32 token; u32 rel_write_seq; --=20 2.51.0 From nobody Thu Nov 27 13:59:34 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A0DBA331A5D for ; Fri, 31 Oct 2025 17:29:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761931770; cv=none; b=iGk1tasvjDEs2mq/anXOwyr4Gng+EY/DuSCSgsDYNMRZCqw2flqNNmpWFsrZZLOG/zbCSCdo1nPnGZtt1u7yWBxrnDz30UfyEC8/qoitcEXK+2qoZTtoWq+eevvNXO7kHmdOtMtgyzKZjMULsGRzU0gp1+mdg+HP3fRtUSM3tyI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761931770; c=relaxed/simple; bh=925QRmMF6Zjz3ppGM+y/mr/CYMs8bBov2XZYNg/czbE=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=JfkGtAt4mdL2sOEAlmg34JEaXdE8LvF12t90oSxr5CZtTptlOBlU2iJCw2HI3LOStKc8OGVH88IoBvn0eH4I5d35f5zSVFRii0q0vtEMMQbuUIieLSaRaxwb2qTFG8p6kGLvpdFog4AoHUF7GBayemoJj/uznXwLPkHEvgYe/pU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ZlpDqO84; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZlpDqO84" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1761931767; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NwQsmS8ps/tg21yZh4cDIXl+G364XMwnREcj2nybW0Q=; b=ZlpDqO84BM1SUUcpodMFMaJJ63MSxwUEeflwma42hBmAp66HD94xpWwTflc9f85ctvsmJO IhZwxgm6fXMdXGb1XmUp3pd9FKxwmIUowLl2lIOxY1GyqONa4YO06++Nw26wNYchS7/OKC ZzawaWNTs3oJGX+nI9cL5CkWMP6P/Iw= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-539-qggaT7cRM423RddxRJ-3Vw-1; Fri, 31 Oct 2025 13:29:23 -0400 X-MC-Unique: qggaT7cRM423RddxRJ-3Vw-1 X-Mimecast-MFC-AGG-ID: qggaT7cRM423RddxRJ-3Vw_1761931762 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B44841800345 for ; Fri, 31 Oct 2025 17:29:22 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.45.224.247]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id B937330001A6 for ; Fri, 31 Oct 2025 17:29:21 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next 2/4] mptcp: fix receive space time initialization. Date: Fri, 31 Oct 2025 18:29:08 +0100 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 2otR9WievFTUZ1aaBHYS8ONrid5yzZUmzGkaIcFHVVM_1761931762 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" MPTCP initialize the receive buffer stamp in mptcp_rcv_space_init(), using the provided subflow stamp. Such helper is invoked in several places, with a catch-up call in mptcp_rcv_space_adjust(). For passive sockets, MPTCP ends-up accesses the subflow stamp before its initialization, leading to quite randomic timing for the first receive buffer auto-tune event. Fix the issue using a fresh stamp. Drop the all the mptcp_rcv_space_init() invocations except the catch-up one: they add unneeded complexity for no good reason. As a side effect, this avoid using a zero value for imsk->rcvq_space.time, that made the first receive buffer auto-tune even quite randomic. This will also make the next patch cleaner. Fixes: 013e3179dbd2 ("mptcp: fix rcv space initialization") Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 40 ++++++++++++++++++++-------------------- net/mptcp/protocol.h | 6 +++++- net/mptcp/subflow.c | 2 -- 3 files changed, 25 insertions(+), 23 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index d16ad1a85411..e17abab7bab6 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -2047,6 +2047,21 @@ static int __mptcp_recvmsg_mskq(struct sock *sk, str= uct msghdr *msg, return copied; } =20 +static void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock= *ssk) +{ + const struct tcp_sock *tp =3D tcp_sk(ssk); + + msk->rcvspace_init =3D 1; + msk->rcvq_space.copied =3D 0; + msk->rcvq_space.rtt_us =3D 0; + + /* initial rcv_space offering made to peer */ + msk->rcvq_space.space =3D min_t(u32, tp->rcv_wnd, + TCP_INIT_CWND * tp->advmss); + if (msk->rcvq_space.space =3D=3D 0) + msk->rcvq_space.space =3D TCP_INIT_CWND * TCP_MSS_DEFAULT; +} + /* receive buffer autotuning. See tcp_rcv_space_adjust for more informati= on. * * Only difference: Use highest rtt estimate of the subflows in use. @@ -2069,8 +2084,8 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock = *msk, int copied) =20 msk->rcvq_space.copied +=3D copied; =20 - mstamp =3D div_u64(tcp_clock_ns(), NSEC_PER_USEC); - time =3D tcp_stamp_us_delta(mstamp, msk->rcvq_space.time); + mstamp =3D mptcp_stamp(); + time =3D tcp_stamp_us_delta(mstamp, READ_ONCE(msk->rcvq_space.time)); =20 rtt_us =3D msk->rcvq_space.rtt_us; if (rtt_us && time < (rtt_us >> 3)) @@ -3487,7 +3502,7 @@ struct sock *mptcp_sk_clone_init(const struct sock *s= k, mptcp_copy_inaddrs(nsk, ssk); __mptcp_propagate_sndbuf(nsk, ssk); =20 - mptcp_rcv_space_init(msk, ssk); + msk->rcvq_space.time =3D mptcp_stamp(); =20 if (mp_opt->suboptions & OPTION_MPTCP_MPC_ACK) __mptcp_subflow_fully_established(msk, subflow, mp_opt); @@ -3497,23 +3512,6 @@ struct sock *mptcp_sk_clone_init(const struct sock *= sk, return nsk; } =20 -void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk) -{ - const struct tcp_sock *tp =3D tcp_sk(ssk); - - msk->rcvspace_init =3D 1; - msk->rcvq_space.copied =3D 0; - msk->rcvq_space.rtt_us =3D 0; - - msk->rcvq_space.time =3D tp->tcp_mstamp; - - /* initial rcv_space offering made to peer */ - msk->rcvq_space.space =3D min_t(u32, tp->rcv_wnd, - TCP_INIT_CWND * tp->advmss); - if (msk->rcvq_space.space =3D=3D 0) - msk->rcvq_space.space =3D TCP_INIT_CWND * TCP_MSS_DEFAULT; -} - static void mptcp_destroy(struct sock *sk) { struct mptcp_sock *msk =3D mptcp_sk(sk); @@ -3703,6 +3701,8 @@ void mptcp_finish_connect(struct sock *ssk) */ WRITE_ONCE(msk->local_key, subflow->local_key); =20 + WRITE_ONCE(msk->rcvq_space.time, mptcp_stamp()); + mptcp_pm_new_connection(msk, ssk, 0); } =20 diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 84f2c51d776c..1f67d8468dfb 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -908,7 +908,11 @@ static inline bool mptcp_is_fully_established(struct s= ock *sk) READ_ONCE(mptcp_sk(sk)->fully_established); } =20 -void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk); +static inline u64 mptcp_stamp(void) +{ + return div_u64(tcp_clock_ns(), NSEC_PER_USEC); +} + void mptcp_data_ready(struct sock *sk, struct sock *ssk); bool mptcp_finish_join(struct sock *sk); bool mptcp_schedule_work(struct sock *sk); diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index ac8616e7521e..b64ab7649908 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -462,8 +462,6 @@ void __mptcp_sync_state(struct sock *sk, int state) =20 subflow =3D mptcp_subflow_ctx(ssk); __mptcp_propagate_sndbuf(sk, ssk); - if (!msk->rcvspace_init) - mptcp_rcv_space_init(msk, ssk); =20 if (sk->sk_state =3D=3D TCP_SYN_SENT) { /* subflow->idsn is always available is TCP_SYN_SENT state, --=20 2.51.0 From nobody Thu Nov 27 13:59:34 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8B90B334697 for ; Fri, 31 Oct 2025 17:29:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761931769; cv=none; b=fpeDDuRFs1sILOSLbFsR1qo10RV2yZmrxGbwpKbB3aTaU7FYkHMG36yuhyUoQq0a4F/fO7kOo2FywIbe20NOaPHD6S2UqKM3XFZuFbwc8sQ7nNF6E2/pA/VdXtf7JeflN2NKz4UcURwOJYRNd/htczRyE3/iElQqK3XZRMYne8c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761931769; c=relaxed/simple; bh=8PkyaWkKkwp1oyiF3ZSj54+JMOqqM3MsuLnGsHf7Bds=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=gDMBCv6WCNA0U7j+6zwOkJ6Un2J8kEXNm0lOot9j+h/9llpq8YqXDR+SbcZOo3X7il9Q82q+vPu3IJIa5+Fy8y2lE5fIKpZbFMqjlADtmbbGFtVDfE+PGes9qs77D61EBZ3DcoFfv1bzSS0n1Z8LJYJoXL3ACdRISteeGydbMk8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=PIHiJ2EC; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="PIHiJ2EC" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1761931766; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zKDfIZptn2R0xHEKJhCM4jrdvU3lAz4m9FC7OrPVAl4=; b=PIHiJ2ECQI+zfkYlZN/zsUWQPsSr/TJFKCnucO61BmNwdh7HcYUBvdpI2sUhmH2i3Or3Rc OaqfT/lqoquIXnzkE83Ync4My879z3upcwtBSTPlvhkA6UmqqLPxcISLXtXQQlL7YMbTL4 3TaNa62pkPKzfsmHQbn8lxDjwm2eGkY= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-207-ciOxZZg7MgKk4K-DgwR48w-1; Fri, 31 Oct 2025 13:29:24 -0400 X-MC-Unique: ciOxZZg7MgKk4K-DgwR48w-1 X-Mimecast-MFC-AGG-ID: ciOxZZg7MgKk4K-DgwR48w_1761931764 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 221FA195606E for ; Fri, 31 Oct 2025 17:29:24 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.45.224.247]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 24DCA30001A1 for ; Fri, 31 Oct 2025 17:29:22 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next 3/4] mptcp: better mptcp-level rtt estimator Date: Fri, 31 Oct 2025 18:29:09 +0100 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: BobTCUSTRZDXSFZbOnJx0bdryMAMoQz0rPLSP3Ip65k_1761931764 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" On high speed links, the MPTCP-level receive buffer auto-tuning happens with a frequency well above the TCP-level's one. That in turn can cause excessive/unneeded receive buffer increase. On such links, the initial rtt_us value is considerably higher than the actual delay, but the current mptcp_rcv_space_adjust() logic prevents msk->rcvq_space.rtt_us from decreasing. Address the issue with a more accurate RTT estimation strategy: the MPTCP-level RTT is set to the minimum of all the subflow feeding data into the MPTCP-receive buffer. Some complexity is due to try to avoid frequent updates of MPTCP-level fields and to allow subflow feeding data via the backlog to still perform the update under the msk socket lock. Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning") Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 89 ++++++++++++++++++++++++++++++-------------- net/mptcp/protocol.h | 8 +++- 2 files changed, 68 insertions(+), 29 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index e17abab7bab6..4fc1519baab6 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -865,10 +865,52 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, = struct sock *ssk) return moved; } =20 +static void mptcp_rcv_rtt_update(struct mptcp_sock *msk, u32 rtt_us, u8 sr) +{ + /* Similar to plain TCP, only consider samples with empty RX queue */ + if (mptcp_data_avail(msk)) + return; + + if (msk->rcv_rtt_est.reset) { + msk->rcv_rtt_est.rtt_us =3D rtt_us; + msk->rcv_rtt_est.reset =3D false; + msk->scaling_ratio =3D sr; + return; + } + + if (rtt_us < msk->rcv_rtt_est.rtt_us) + msk->rcv_rtt_est.rtt_us =3D rtt_us; + if (sr < msk->scaling_ratio) + msk->scaling_ratio =3D sr; +} + +static void mptcp_rcv_rtt_update_from_backlog(struct mptcp_sock *msk) +{ + mptcp_rcv_rtt_update(msk, msk->rcv_rtt_est.bl_rtt_us, + msk->rcv_rtt_est.bl_scaling_ratio); + + if (READ_ONCE(msk->rcv_rtt_est.reset_bl)) { + msk->rcv_rtt_est.bl_rtt_us =3D U32_MAX; + msk->rcv_rtt_est.bl_scaling_ratio =3D U8_MAX; + msk->rcv_rtt_est.reset_bl =3D false; + } +} + +static void mptcp_backlog_rcv_rtt_update(struct mptcp_sock *msk, u32 rtt_u= s, + u8 sr) +{ + if (rtt_us < msk->rcv_rtt_est.bl_rtt_us) + msk->rcv_rtt_est.bl_rtt_us =3D rtt_us; + if (sr < msk->rcv_rtt_est.bl_scaling_ratio) + msk->rcv_rtt_est.bl_scaling_ratio =3D sr; +} + void mptcp_data_ready(struct sock *sk, struct sock *ssk) { struct mptcp_subflow_context *subflow =3D mptcp_subflow_ctx(ssk); + u32 rtt_us =3D tcp_sk(ssk)->rcv_rtt_est.rtt_us; struct mptcp_sock *msk =3D mptcp_sk(sk); + u8 sr =3D tcp_sk(ssk)->scaling_ratio; =20 /* The peer can send data while we are shutting down this * subflow at subflow destruction time, but we must avoid enqueuing @@ -879,10 +921,12 @@ void mptcp_data_ready(struct sock *sk, struct sock *s= sk) =20 mptcp_data_lock(sk); if (!sock_owned_by_user(sk)) { + mptcp_rcv_rtt_update(msk, rtt_us, sr); /* Wake-up the reader only for in-sequence data */ if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk)) sk->sk_data_ready(sk); } else { + mptcp_backlog_rcv_rtt_update(msk, rtt_us, sr); __mptcp_move_skbs_from_subflow(msk, ssk, false); } mptcp_data_unlock(sk); @@ -2053,7 +2097,6 @@ static void mptcp_rcv_space_init(struct mptcp_sock *m= sk, const struct sock *ssk) =20 msk->rcvspace_init =3D 1; msk->rcvq_space.copied =3D 0; - msk->rcvq_space.rtt_us =3D 0; =20 /* initial rcv_space offering made to peer */ msk->rcvq_space.space =3D min_t(u32, tp->rcv_wnd, @@ -2070,16 +2113,15 @@ static void mptcp_rcv_space_adjust(struct mptcp_soc= k *msk, int copied) { struct mptcp_subflow_context *subflow; struct sock *sk =3D (struct sock *)msk; - u8 scaling_ratio =3D U8_MAX; - u32 time, advmss =3D 1; - u64 rtt_us, mstamp; + u32 rtt_us, time; + u64 mstamp; =20 msk_owned_by_me(msk); =20 if (copied <=3D 0) return; =20 - if (!msk->rcvspace_init) + if (unlikely(!msk->rcvspace_init)) mptcp_rcv_space_init(msk, msk->first); =20 msk->rcvq_space.copied +=3D copied; @@ -2087,29 +2129,8 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock= *msk, int copied) mstamp =3D mptcp_stamp(); time =3D tcp_stamp_us_delta(mstamp, READ_ONCE(msk->rcvq_space.time)); =20 - rtt_us =3D msk->rcvq_space.rtt_us; - if (rtt_us && time < (rtt_us >> 3)) - return; - - rtt_us =3D 0; - mptcp_for_each_subflow(msk, subflow) { - const struct tcp_sock *tp; - u64 sf_rtt_us; - u32 sf_advmss; - - tp =3D tcp_sk(mptcp_subflow_tcp_sock(subflow)); - - sf_rtt_us =3D READ_ONCE(tp->rcv_rtt_est.rtt_us); - sf_advmss =3D READ_ONCE(tp->advmss); - - rtt_us =3D max(sf_rtt_us, rtt_us); - advmss =3D max(sf_advmss, advmss); - scaling_ratio =3D min(tp->scaling_ratio, scaling_ratio); - } - - msk->rcvq_space.rtt_us =3D rtt_us; - msk->scaling_ratio =3D scaling_ratio; - if (time < (rtt_us >> 3) || rtt_us =3D=3D 0) + rtt_us =3D msk->rcv_rtt_est.rtt_us; + if (rtt_us =3D=3D U32_MAX || time < (rtt_us >> 3)) return; =20 if (msk->rcvq_space.copied <=3D msk->rcvq_space.space) @@ -2137,6 +2158,8 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock = *msk, int copied) new_measure: msk->rcvq_space.copied =3D 0; msk->rcvq_space.time =3D mstamp; + msk->rcv_rtt_est.reset =3D true; + WRITE_ONCE(msk->rcv_rtt_est.reset_bl, true); } =20 static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32= *delta) @@ -2198,6 +2221,7 @@ static bool mptcp_move_skbs(struct sock *sk) u32 moved; =20 mptcp_data_lock(sk); + mptcp_rcv_rtt_update_from_backlog(mptcp_sk(sk)); while (mptcp_can_spool_backlog(sk, &skbs)) { mptcp_data_unlock(sk); enqueued |=3D __mptcp_move_skbs(sk, &skbs, &moved); @@ -2933,6 +2957,10 @@ static void __mptcp_init_sock(struct sock *sk) msk->timer_ival =3D TCP_RTO_MIN; msk->scaling_ratio =3D TCP_DEFAULT_SCALING_RATIO; msk->backlog_len =3D 0; + msk->rcv_rtt_est.bl_rtt_us =3D U32_MAX; + msk->rcv_rtt_est.rtt_us =3D U32_MAX; + msk->rcv_rtt_est.bl_scaling_ratio =3D U8_MAX; + msk->scaling_ratio =3D U8_MAX; =20 WRITE_ONCE(msk->first, NULL); inet_csk(sk)->icsk_sync_mss =3D mptcp_sync_mss; @@ -3375,6 +3403,10 @@ static int mptcp_disconnect(struct sock *sk, int fla= gs) msk->bytes_sent =3D 0; msk->bytes_retrans =3D 0; msk->rcvspace_init =3D 0; + msk->scaling_ratio =3D U8_MAX; + msk->rcv_rtt_est.rtt_us =3D U32_MAX; + msk->rcv_rtt_est.bl_rtt_us =3D U32_MAX; + msk->rcv_rtt_est.bl_scaling_ratio =3D U8_MAX; =20 /* for fallback's sake */ WRITE_ONCE(msk->ack_seq, 0); @@ -3560,6 +3592,7 @@ static void mptcp_release_cb(struct sock *sk) =20 INIT_LIST_HEAD(&join_list); list_splice_init(&msk->join_list, &join_list); + mptcp_rcv_rtt_update_from_backlog(msk); =20 /* the following actions acquire the subflow socket lock * diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 1f67d8468dfb..d38a455f8f5b 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -340,11 +340,17 @@ struct mptcp_sock { */ struct mptcp_pm_data pm; struct mptcp_sched_ops *sched; + struct { + u32 rtt_us; /* Minimum rtt of subflows */ + u32 bl_rtt_us; /* Min rtt if subflows using the bl */ + u8 bl_scaling_ratio; + bool reset; + bool reset_bl; /* Protected by data lock */ + } rcv_rtt_est; struct { int space; /* bytes copied in last measurement window */ int copied; /* bytes copied in this measurement window */ u64 time; /* start time of measurement window */ - u64 rtt_us; /* last maximum rtt of subflows */ } rcvq_space; u8 scaling_ratio; bool allow_subflows; --=20 2.51.0 From nobody Thu Nov 27 13:59:34 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E549F334C2E for ; Fri, 31 Oct 2025 17:29:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761931770; cv=none; b=f08Eq9iN0HcvKVFCIhqR2+SxBT6Ca8+UJzkX5a8Xr99xmczwKZNfnxFuJfJLjuJYp5xLeojVEUQSzjYb3dyYxXrGwtzKrYXyAueKNK3S+bWkL/3V/KHhVh7DPZCkeCV6U617b+76Eg+NNvqlQiwVktgVj0eNOX9g1TVZ1T/UmLM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761931770; c=relaxed/simple; bh=uBldUIAo2WfmDfGgh1UzLBQjnA7eCdUsIqYqdoXJXns=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=n+j2iRPx9Xd5DNVvnHWwXBaoRYathx8IQWzC56UinD7/DsB2YZR//MemAtozl4sg0Utf9NUgJo1wAZW/KEgIllvnWrl2QzT+YXrFFHW7iC6c1FcZTZuzlAlhYOZ4dsdqbSHy2PHU9M3b5+skhvr+nFNta6NGWnIv/ZdimFLf4t0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=a07H4Yby; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="a07H4Yby" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1761931767; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BnTiPN/8IDyIhBWnwmAJI9qOOaI+mIkNv8aACKmyldM=; b=a07H4Ybyjh5YDrToexeDUjRQj0cnaUX8myx8ZSDM2aqNr7inJSTa+jiWJPVzybEi/9QsMP HJrmzT4Vl59LkEQMgv9eprAydAtmjtgEKtstl5pHlKrGT4LqRcnGiaobwKSzZdOHdvvOAv Zxc18826wd2LMynkGWHIPTKk8KOiOoc= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-544-MioydnfuM6qffGCeaygL9Q-1; Fri, 31 Oct 2025 13:29:26 -0400 X-MC-Unique: MioydnfuM6qffGCeaygL9Q-1 X-Mimecast-MFC-AGG-ID: MioydnfuM6qffGCeaygL9Q_1761931765 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 6A715195606F for ; Fri, 31 Oct 2025 17:29:25 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.45.224.247]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 84BC530001A1 for ; Fri, 31 Oct 2025 17:29:24 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next 4/4] mptcp: add receive queue awareness in tcp_rcv_space_adjust() Date: Fri, 31 Oct 2025 18:29:10 +0100 Message-ID: <4c75f202a33758222108c3766c1d80a87072628d.1761931390.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: EkPEkGAFEJyGfzGWEJ9ZOayPEMklohxMaX7D_sxDsWs_1761931765 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" This is the mptcp counter-part of commit ea33537d8292 ("tcp: add receive queue awareness in tcp_rcv_space_adjust()"). Prior to this commit: ESTAB 33165568 0 192.168.255.2:5201 192.168.255.1:53380 \ skmem:(r33076416,rb33554432,t0,tb91136,f448,w0,o0,bl0,d0) After: ESTAB 3279168 0 192.168.255.2:5201 192.168.255.1]:53042 \ skmem:(r3190912,rb3719956,t0,tb91136,f1536,w0,o0,bl0,d0) (same tput) Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 50 +++++++++++++++++++++++--------------------- 1 file changed, 26 insertions(+), 24 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 4fc1519baab6..37a90d644e7b 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -2105,6 +2105,27 @@ static void mptcp_rcv_space_init(struct mptcp_sock *= msk, const struct sock *ssk) msk->rcvq_space.space =3D TCP_INIT_CWND * TCP_MSS_DEFAULT; } =20 +static unsigned int mptcp_inq_hint(const struct sock *sk) +{ + const struct mptcp_sock *msk =3D mptcp_sk(sk); + const struct sk_buff *skb; + + skb =3D skb_peek(&sk->sk_receive_queue); + if (skb) { + u64 hint_val =3D READ_ONCE(msk->ack_seq) - MPTCP_SKB_CB(skb)->map_seq; + + if (hint_val >=3D INT_MAX) + return INT_MAX; + + return (unsigned int)hint_val; + } + + if (sk->sk_state =3D=3D TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN)) + return 1; + + return 0; +} + /* receive buffer autotuning. See tcp_rcv_space_adjust for more informati= on. * * Only difference: Use highest rtt estimate of the subflows in use. @@ -2133,10 +2154,12 @@ static void mptcp_rcv_space_adjust(struct mptcp_soc= k *msk, int copied) if (rtt_us =3D=3D U32_MAX || time < (rtt_us >> 3)) return; =20 - if (msk->rcvq_space.copied <=3D msk->rcvq_space.space) + copied =3D msk->rcvq_space.copied; + copied -=3D mptcp_inq_hint(sk); + if (copied <=3D msk->rcvq_space.space) goto new_measure; =20 - if (mptcp_rcvbuf_grow(sk, msk->rcvq_space.copied)) { + if (mptcp_rcvbuf_grow(sk, copied)) { =20 /* Make subflows follow along. If we do not do this, we * get drops at subflow level if skbs can't be moved to @@ -2150,7 +2173,7 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock = *msk, int copied) ssk =3D mptcp_subflow_tcp_sock(subflow); slow =3D lock_sock_fast(ssk); if (tcp_sk(ssk)->rcvq_space.space) - tcp_rcvbuf_grow(ssk, msk->rcvq_space.copied); + tcp_rcvbuf_grow(ssk, copied); unlock_sock_fast(ssk, slow); } } @@ -2233,27 +2256,6 @@ static bool mptcp_move_skbs(struct sock *sk) return enqueued; } =20 -static unsigned int mptcp_inq_hint(const struct sock *sk) -{ - const struct mptcp_sock *msk =3D mptcp_sk(sk); - const struct sk_buff *skb; - - skb =3D skb_peek(&sk->sk_receive_queue); - if (skb) { - u64 hint_val =3D READ_ONCE(msk->ack_seq) - MPTCP_SKB_CB(skb)->map_seq; - - if (hint_val >=3D INT_MAX) - return INT_MAX; - - return (unsigned int)hint_val; - } - - if (sk->sk_state =3D=3D TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN)) - return 1; - - return 0; -} - static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags, int *addr_len) { --=20 2.51.0