From nobody Sat Oct 11 09:45:50 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 33805258EE9 for ; Thu, 18 Sep 2025 17:15:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758215715; cv=none; b=iJKhJ0MQ30CgNQw/HSEn/kf1OdMF6mbvi9H6QSYb/1mwG1+QKpH78k+Ry63Ud/rbS4sCo5Pg4frgfQWEu6Mgjqit1SjTXxQCfkkCUgPthc3aQSDWrvKXC688OdCfgDjUVptryzBTVxxv9fM3pvUnwoFGCf8agPwI4XylChIViIo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758215715; c=relaxed/simple; bh=NzGf6W/q5ObL5GmP1XzGdlzIClQcXkms2rxHsAIx2QY=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=gII0EJzR6YtCl0QboYk9DH3PFslldKIrYH1kZh9eGaFTL3etkiMyDfSg+8tviA5tGdiy2Y5orjgXPiWZwI2R47NyxHgf5TV0GwebtqnuXu4jcBw2i2OF0kL+uvnZV+4jF+gixKOt1aWcFQt87m9ioHwlw7VMvAJ2nKbLuBbhBy8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=KoCxrRF3; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="KoCxrRF3" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1758215712; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5gjeFONxUKWySVppkSVXjFo5IEQbM26fyHBdWqLbC+c=; b=KoCxrRF34sAboeneelzvJj8HYEC4o1jxbKSSws89RxbciP+z/smkPLaoebyKMBt6z3drWh JnGOHzybpnljBIghG1D7aTWRv0nIjPu+T3QkRP9mvd1qT+vIWvi72tKCTaMxzcsbqeVLSY jWohQ3IYKjrCMGY3oQ12Y0G3zr+wya4= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-646-thp3VCssN-OT3LSVfqe_8w-1; Thu, 18 Sep 2025 13:15:10 -0400 X-MC-Unique: thp3VCssN-OT3LSVfqe_8w-1 X-Mimecast-MFC-AGG-ID: thp3VCssN-OT3LSVfqe_8w_1758215708 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id CEE52180057C for ; Thu, 18 Sep 2025 17:15:08 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.32.207]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 0B85530002C5 for ; Thu, 18 Sep 2025 17:15:07 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next v2 03/12] mptcp: rcvbuf auto-tuning improvement Date: Thu, 18 Sep 2025 19:14:45 +0200 Message-ID: <41db4ac9e54972274efd501dc110c5820def3412.1758214563.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: bk5BgiFz6C_uvec6Vx0o1We_36dhfGiE0WcMdYeY95Y_1758215708 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Apply to the MPTCP auto-tuning the same improvements introduced for the TCP protocol by the merge commit 2da35e4b4df9 ("Merge branch 'tcp-receive-side-improvements'"). The main difference is that TCP subflow and the main MPTCP socket need to account separately for OoO: MPTCP does not care for TCP-level OoO and vice versa, as a consequence do not reflect MPTCP-level rcvbuf increase due to OoO packets at the subflow level. This refeactor additionally allow dropping the msk receive buffer update at receive time, as the latter only intended to cope with subflow receive buffer increase due to OoO packets. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/487 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/559 Signed-off-by: Paolo Abeni Reviewed-by: Geliang Tang Reviewed-by: Matthieu Baerts (NGI0) Tested-by: Geliang Tang --- v1 -> v2: - fix unused variable - reword the commit message --- net/mptcp/protocol.c | 92 ++++++++++++++++++++------------------------ net/mptcp/protocol.h | 4 +- 2 files changed, 44 insertions(+), 52 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 9d95d24781509..162abafe3f320 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -179,6 +179,30 @@ static bool mptcp_ooo_try_coalesce(struct mptcp_sock *= msk, struct sk_buff *to, return mptcp_try_coalesce((struct sock *)msk, to, from); } =20 +static bool mptcp_rcvbuf_grow(struct sock *sk) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + int rcvwin, rcvbuf; + + if (!READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf) || + (sk->sk_userlocks & SOCK_RCVBUF_LOCK)) + return false; + + rcvwin =3D ((u64)msk->rcvq_space.space << 1); + + if (!RB_EMPTY_ROOT(&msk->out_of_order_queue)) + rcvwin +=3D MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq - msk->ack_seq; + + rcvbuf =3D min_t(u64, mptcp_space_from_win(sk, rcvwin), + READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2])); + + if (rcvbuf > sk->sk_rcvbuf) { + WRITE_ONCE(sk->sk_rcvbuf, rcvbuf); + return true; + } + return false; +} + /* "inspired" by tcp_data_queue_ofo(), main differences: * - use mptcp seqs * - don't cope with sacks @@ -292,6 +316,9 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk= , struct sk_buff *skb) end: skb_condense(skb); skb_set_owner_r(skb, sk); + /* do not grow rcvbuf for not-yet-accepted or orphaned sockets. */ + if (sk->sk_socket) + mptcp_rcvbuf_grow(sk); } =20 static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, @@ -784,18 +811,10 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, = struct sock *ssk) return moved; } =20 -static void __mptcp_rcvbuf_update(struct sock *sk, struct sock *ssk) -{ - if (unlikely(ssk->sk_rcvbuf > sk->sk_rcvbuf)) - WRITE_ONCE(sk->sk_rcvbuf, ssk->sk_rcvbuf); -} - static void __mptcp_data_ready(struct sock *sk, struct sock *ssk) { struct mptcp_sock *msk =3D mptcp_sk(sk); =20 - __mptcp_rcvbuf_update(sk, ssk); - /* Wake-up the reader only for in-sequence data */ if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk)) sk->sk_data_ready(sk); @@ -2014,48 +2033,26 @@ static void mptcp_rcv_space_adjust(struct mptcp_soc= k *msk, int copied) if (msk->rcvq_space.copied <=3D msk->rcvq_space.space) goto new_measure; =20 - if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf) && - !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { - u64 rcvwin, grow; - int rcvbuf; - - rcvwin =3D ((u64)msk->rcvq_space.copied << 1) + 16 * advmss; - - grow =3D rcvwin * (msk->rcvq_space.copied - msk->rcvq_space.space); - - do_div(grow, msk->rcvq_space.space); - rcvwin +=3D (grow << 1); - - rcvbuf =3D min_t(u64, mptcp_space_from_win(sk, rcvwin), - READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2])); - - if (rcvbuf > sk->sk_rcvbuf) { - u32 window_clamp; - - window_clamp =3D mptcp_win_from_space(sk, rcvbuf); - WRITE_ONCE(sk->sk_rcvbuf, rcvbuf); + msk->rcvq_space.space =3D msk->rcvq_space.copied; + if (mptcp_rcvbuf_grow(sk)) { =20 - /* Make subflows follow along. If we do not do this, we - * get drops at subflow level if skbs can't be moved to - * the mptcp rx queue fast enough (announced rcv_win can - * exceed ssk->sk_rcvbuf). - */ - mptcp_for_each_subflow(msk, subflow) { - struct sock *ssk; - bool slow; + /* Make subflows follow along. If we do not do this, we + * get drops at subflow level if skbs can't be moved to + * the mptcp rx queue fast enough (announced rcv_win can + * exceed ssk->sk_rcvbuf). + */ + mptcp_for_each_subflow(msk, subflow) { + struct sock *ssk; + bool slow; =20 - ssk =3D mptcp_subflow_tcp_sock(subflow); - slow =3D lock_sock_fast(ssk); - WRITE_ONCE(ssk->sk_rcvbuf, rcvbuf); - WRITE_ONCE(tcp_sk(ssk)->window_clamp, window_clamp); - if (tcp_can_send_ack(ssk)) - tcp_cleanup_rbuf(ssk, 1); - unlock_sock_fast(ssk, slow); - } + ssk =3D mptcp_subflow_tcp_sock(subflow); + slow =3D lock_sock_fast(ssk); + tcp_sk(ssk)->rcvq_space.space =3D msk->rcvq_space.copied; + tcp_rcvbuf_grow(ssk); + unlock_sock_fast(ssk, slow); } } =20 - msk->rcvq_space.space =3D msk->rcvq_space.copied; new_measure: msk->rcvq_space.copied =3D 0; msk->rcvq_space.time =3D mstamp; @@ -2084,11 +2081,6 @@ static bool __mptcp_move_skbs(struct sock *sk) if (list_empty(&msk->conn_list)) return false; =20 - /* verify we can move any data from the subflow, eventually updating */ - if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) - mptcp_for_each_subflow(msk, subflow) - __mptcp_rcvbuf_update(sk, subflow->tcp_sock); - subflow =3D list_first_entry(&msk->conn_list, struct mptcp_subflow_context, node); for (;;) { diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 9b5a248bad404..6ac58e92a1aa3 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -342,8 +342,8 @@ struct mptcp_sock { struct mptcp_pm_data pm; struct mptcp_sched_ops *sched; struct { - u32 space; /* bytes copied in last measurement window */ - u32 copied; /* bytes copied in this measurement window */ + int space; /* bytes copied in last measurement window */ + int copied; /* bytes copied in this measurement window */ u64 time; /* start time of measurement window */ u64 rtt_us; /* last maximum rtt of subflows */ } rcvq_space; --=20 2.51.0