From nobody Mon Jun 8 18:55:16 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1BECA3F1665 for ; Wed, 27 May 2026 10:46:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779878786; cv=none; b=cnVDDaA2e3KkmQi7HC/AJTrKuJJc73qNYRxXLRVxtjBfPpJBxkn10nSH4crK44e3R1rULfDRgYPqXCpua/J4h3DLrNPvuDO4Q2CeNtKOLtaTfTVhugYp/XYp7syM56w5V4KFml+4d1Dk1cKxU7cQ6kpQqv9teiW71VLNm7+KpoM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779878786; c=relaxed/simple; bh=q2y6wYZ1kFZxhyFm2Sr8uf28Z+yUkWk2d7xtyKkHR30=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=g7nuUFt3Oi2K/LIitQ6fwq0bu2tnNf44VZJ303ymnV7k63kC6l9qPrtDMYBCliuLCBzrZ05u0gGyyjRtw2Duyu+sALnaCtNk8qfr8pij/z7Ns+WzqKZRkRrgz58ZA4XoXwdN9+jfJO40ND/PcDrkinDcv93KL6TxUgK66V+aZLE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Kbbvq3WS; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Kbbvq3WS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1779878784; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Wt9SaCLUquw51iOI5MDSGLtvCvn1tQArR3VoX8zNqoQ=; b=Kbbvq3WSX9e3dEurQytzS/+Uc31ieETnYguTrxCniF1qJ/teXyOVjxyWPf0c+ZT/VX2Tdp wqvjn2cINgy3sB0UhJzbUtUtqQw4QzaCuZlxu6HYGj7TZkqp7FzNNaJySaFTzT993Ie0UO k4qo7PEcHxbKk1PpQruMgep4V7f4ag4= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-361-7_4iAceIO92Qo0I-kfqy7g-1; Wed, 27 May 2026 06:46:22 -0400 X-MC-Unique: 7_4iAceIO92Qo0I-kfqy7g-1 X-Mimecast-MFC-AGG-ID: 7_4iAceIO92Qo0I-kfqy7g_1779878782 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id E64A9195608F for ; Wed, 27 May 2026 10:46:21 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.48.107]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 284D719560AB for ; Wed, 27 May 2026 10:46:20 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH v9 mptcp-next 1/6] mptcp: allow subflow rcv wnd to shrink Date: Wed, 27 May 2026 12:45:31 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: WkPDQ8qnqTvZfystsF8k0B_nSBHfS2S921iFcW0uawk_1779878782 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" In MPTCP connection, the `window` field in the TCP header refers to the MPTCP-level rcv_nxt and it's right edge should not move backward. Such constraint is enforced at DSS option generation time. At the same time, the TCP stack ensures independently that the TCP-level rcv wnd right's edge does not move backward. That in turn causes artificial inflating of the MPTCP rcv window when the incoming data is acked at the TCP level and is OoO in the MPTCP sequence space (or lands in the backlog). As a consequence, the incoming traffic can exceed the receiver rcvbuf size even when the sender is not misbehaving. Prevent such scenario forcibly allowing the TCP subflow to shrink the TCP-level rcv wnd regardless of the current netns setting. Fixes: f3589be0c420 ("mptcp: never shrink offered window") Signed-off-by: Paolo Abeni --- net/mptcp/options.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/net/mptcp/options.c b/net/mptcp/options.c index 4d72f286a485..97ea4aa37b33 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -566,6 +566,7 @@ static bool mptcp_established_options_dss(struct sock *= sk, struct sk_buff *skb, { struct mptcp_subflow_context *subflow =3D mptcp_subflow_ctx(sk); struct mptcp_sock *msk =3D mptcp_sk(subflow->conn); + struct tcp_sock *tp =3D tcp_sk(sk); unsigned int dss_size =3D 0; struct mptcp_ext *mpext; unsigned int ack_size; @@ -614,6 +615,12 @@ static bool mptcp_established_options_dss(struct sock = *sk, struct sk_buff *skb, if (dss_size =3D=3D 0) ack_size +=3D TCPOLEN_MPTCP_DSS_BASE; =20 + /* The caller is __tcp_transmit_skb(), and will compute the new rcv + * wnd soon: ensure that the window can shrink. + */ + if (skb) + tp->rcv_wnd =3D tp->rcv_nxt - tp->rcv_wup; + dss_size +=3D ack_size; =20 *size =3D ALIGN(dss_size, 4); --=20 2.54.0 From nobody Mon Jun 8 18:55:16 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7B1B03932E4 for ; Wed, 27 May 2026 10:46:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779878788; cv=none; b=T1SAw3/Q6NmuDLhXpOYc83ak3AcF5W+8LbsIkR4nK2h8Dtgz9Xbqa9rLAioFjaZeKuVDmfDriYJenij0EuOTHMIdU6Zk0UfTIAjWViAGD7CSEs9RvofE79da+DLAMlHrVR3bosP5U9nYyJCifeKfxdkN/gJRqPEmCOsXrUleH68= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779878788; c=relaxed/simple; bh=FgTE3J2X4d387bpT6F8YPHcxFBojQzyESaePYTxpg6w=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=mnzEy8yjDSD6Mn5knkIkwo4uBIgLFW0dvWGAxAYweuYlHsJL+FZRGOKmMFLtPVYitFLQcmMA7plqXEv08HDQLYyxCeIeK7Gh48tgoWyJ33zoX26SJqZkrBonCR9dP/BGf1r3JVCg2lgCwo8H+68dxaEimyXF7Hf1QMMMi6gZe7M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Q31WkCoK; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Q31WkCoK" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1779878785; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xBm9l6M69EJA67QePIl2VtLjduLzmrJ5lQ4wIU+dT7U=; b=Q31WkCoKajtq0w7CE66O0DO/xYj13sVJoFGMvj5IoPj46e3x10cjNAWVfs74FNnVxN/pj/ 7JDSBg+Fdbxu7iyv9ONGiQPCt+jK3vRJ8P311YFI83O2ri8zrCViaX2ptz2HUC655RXkDu 7Fi7Yhvpr3JpY8qqHK25ywgIntIGZjg= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-190-PGxc7aUANzWa6sVTM-lulA-1; Wed, 27 May 2026 06:46:24 -0400 X-MC-Unique: PGxc7aUANzWa6sVTM-lulA-1 X-Mimecast-MFC-AGG-ID: PGxc7aUANzWa6sVTM-lulA_1779878783 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 327F218005B2 for ; Wed, 27 May 2026 10:46:23 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.48.107]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 63B4119560AB for ; Wed, 27 May 2026 10:46:22 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH v9 mptcp-next 2/6] mptcp: explicitly drop over memory limits Date: Wed, 27 May 2026 12:45:32 +0200 Message-ID: <6ccf1a553f1e5a3ae63b5a5608f774623d71c9da.1779876523.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: dVzwire8LCNfFOLwC0gzV6-amdeP0gzG1fZo9GvlT8c_1779878783 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Currently the enforcement of the rcvbuf constraint is implemented when moving the skbs into the msk receive or OoO queue, keeping the incoming skbs in the subflow queue when over limit. Under significant memory pressure the above can cause permanent data transfer stalls, as the skb needed to make forward progress can be stuck in a subflow queue. Over memory limits, drop the incoming skb, relaying on MPTCP-level retransmissions. Note that fallback socket must perform the limit before the skb reaches the subflow-level queue, as dropping an in-sequence already acked skb would break the stream. This is not a complete fix for the stall issue, as the drop strategy needs refinements that will come in the next patches. Signed-off-by: Paolo Abeni --- v7 -> v8: - removed non fallback check in mptcp_incoming_option(): that is an tput optimization (avoid rejections) for a slowpath case (sender is misbehaving) and needs too much additional complexity. - move here from later patches mibs definition v6 -> v7: - fix sign extension issues v4 -> v5: - fix possible u32 overflow in mptcp_over_limit v3 -> v4: - schedule TCP ack on drop - enforce limits in __mptcp_move_skb() and __mptcp_add_backlog(), too but only if not fallback. v1 -> v2: - deal correctly with tcp fin and zero win probe RFC -> v1: - limit vs actual buffer size - use CB info instead of skb->len Note that: - this needs the follow-up patches to really fix the stall - sashiko can assume ZWP carries unacked data and may be silently dropped. AFAIK that is false. - sashiko apparently can't graps mptcp subflow never hit the tcp rx fastpath, and the mptcp_incoming_options in tcp_rcv_state_process is hit, the peer can't transmit any more data. - the memory comparison is intentionally very rough, as the msk socket lock is not currently held where the condition is now enforced. This should require some refinement, shared as-is to avoid more latency on my side --- net/mptcp/mib.c | 2 ++ net/mptcp/mib.h | 2 ++ net/mptcp/options.c | 28 +++++++++++++++++++++++++--- net/mptcp/protocol.c | 31 +++++++++++++++++++++++-------- 4 files changed, 52 insertions(+), 11 deletions(-) diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c index f23fda0c55a7..ef65e2df709f 100644 --- a/net/mptcp/mib.c +++ b/net/mptcp/mib.c @@ -85,6 +85,8 @@ static const struct snmp_mib mptcp_snmp_list[] =3D { SNMP_MIB_ITEM("SimultConnectFallback", MPTCP_MIB_SIMULTCONNFALLBACK), SNMP_MIB_ITEM("FallbackFailed", MPTCP_MIB_FALLBACKFAILED), SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE), + SNMP_MIB_ITEM("BacklogDrop", MPTCP_MIB_BACKLOGDROP), + SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED), }; =20 /* mptcp_mib_alloc - allocate percpu mib counters diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h index 812218b5ed2b..c84eb853d499 100644 --- a/net/mptcp/mib.h +++ b/net/mptcp/mib.h @@ -88,6 +88,8 @@ enum linux_mptcp_mib_field { MPTCP_MIB_SIMULTCONNFALLBACK, /* Simultaneous connect */ MPTCP_MIB_FALLBACKFAILED, /* Can't fallback due to msk status */ MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */ + MPTCP_MIB_BACKLOGDROP, /* Backlog over memory limit */ + MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */ __MPTCP_MIB_MAX }; =20 diff --git a/net/mptcp/options.c b/net/mptcp/options.c index 97ea4aa37b33..2b35bdc113a5 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -1161,8 +1161,30 @@ static bool add_addr_hmac_valid(struct mptcp_sock *m= sk, return hmac =3D=3D mp_opt->ahmac; } =20 -/* Return false in case of error (or subflow has been reset), - * else return true. +static bool mptcp_over_limit(struct sock *sk, struct sock *ssk, + const struct sk_buff *skb) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + u64 mem =3D sk_rmem_alloc_get(sk); + + mem +=3D READ_ONCE(msk->backlog_len); + if (likely(mem <=3D READ_ONCE(sk->sk_rcvbuf))) + return false; + + /* Avoid silently dropping pure acks, fin or zero win probes. */ + if (TCP_SKB_CB(skb)->seq =3D=3D TCP_SKB_CB(skb)->end_seq || + TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN || + !after(TCP_SKB_CB(skb)->end_seq, tcp_sk(ssk)->rcv_nxt)) + return false; + + /* Dropped due to memory constraints, schedule an ack. */ + inet_csk(ssk)->icsk_ack.pending |=3D ICSK_ACK_NOMEM | ICSK_ACK_NOW; + inet_csk_schedule_ack(ssk); + return true; +} + +/* Return false when the caller must drop the packet, i.e. in case of erro= r, + * subflow has been reset, or over memory limits. */ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb) { @@ -1188,7 +1210,7 @@ bool mptcp_incoming_options(struct sock *sk, struct s= k_buff *skb) =20 __mptcp_data_acked(subflow->conn); mptcp_data_unlock(subflow->conn); - return true; + return !mptcp_over_limit(subflow->conn, sk, skb); } =20 mptcp_get_options(skb, &mp_opt); diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 1d67728d4233..26a70b3b9566 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -381,6 +381,16 @@ static bool __mptcp_move_skb(struct sock *sk, struct s= k_buff *skb) =20 mptcp_borrow_fwdmem(sk, skb); =20 + /* Can't drop packets for fallback socket this late, or the stream + * will break. + */ + if (unlikely(sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) && + !__mptcp_check_fallback(msk)) { + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED); + mptcp_drop(sk, skb); + return false; + } + if (MPTCP_SKB_CB(skb)->map_seq =3D=3D msk->ack_seq) { /* in sequence */ msk->bytes_received +=3D copy_len; @@ -675,6 +685,7 @@ static void __mptcp_add_backlog(struct sock *sk, struct sk_buff *tail =3D NULL; struct sock *ssk =3D skb->sk; bool fragstolen; + u64 limit; int delta; =20 if (unlikely(sk->sk_state =3D=3D TCP_CLOSE)) { @@ -682,6 +693,16 @@ static void __mptcp_add_backlog(struct sock *sk, return; } =20 + /* Similar additional allowance as plain TCP. */ + limit =3D READ_ONCE(sk->sk_rcvbuf); + limit +=3D (limit >> 1) + 64 * 1024; + limit =3D min_t(u64, limit, UINT_MAX); + if (msk->backlog_len > limit && !__mptcp_check_fallback(msk)) { + __MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_BACKLOGDROP); + kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_BACKLOG); + return; + } + /* Try to coalesce with the last skb in our backlog */ if (!list_empty(&msk->backlog_list)) tail =3D list_last_entry(&msk->backlog_list, struct sk_buff, list); @@ -753,7 +774,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp= _sock *msk, =20 mptcp_init_skb(ssk, skb, offset, len); =20 - if (own_msk && sk_rmem_alloc_get(sk) < sk->sk_rcvbuf) { + if (own_msk) { mptcp_subflow_lend_fwdmem(subflow, skb); ret |=3D __mptcp_move_skb(sk, skb); } else { @@ -2211,10 +2232,6 @@ static bool __mptcp_move_skbs(struct sock *sk, struc= t list_head *skbs, u32 *delt =20 *delta =3D 0; while (1) { - /* If the msk recvbuf is full stop, don't drop */ - if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf) - break; - prefetch(skb->next); list_del(&skb->list); *delta +=3D skb->truesize; @@ -2242,9 +2259,7 @@ static bool mptcp_can_spool_backlog(struct sock *sk, = struct list_head *skbs) DEBUG_NET_WARN_ON_ONCE(msk->backlog_unaccounted && sk->sk_socket && mem_cgroup_from_sk(sk)); =20 - /* Don't spool the backlog if the rcvbuf is full. */ - if (list_empty(&msk->backlog_list) || - sk_rmem_alloc_get(sk) > sk->sk_rcvbuf) + if (list_empty(&msk->backlog_list)) return false; =20 INIT_LIST_HEAD(skbs); --=20 2.54.0 From nobody Mon Jun 8 18:55:16 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BB8EB3DEAE3 for ; Wed, 27 May 2026 10:46:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779878789; cv=none; b=SpyXUQUwVkzM4Rpus4KdgDwkriLDza+bFdnUYYrvmmGsKNkD1lg4wUEqfQ7wl0gsJGHv2EfrtOvKUtR4A7/ug9o03Cz8pkw5Y4ftOwbtDR0nw8nk7F+aGQpCXO+G8B+oFLGn2LEbL26q7WozR7w9bUnVPS4et+A7tl0NVUIAIOw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779878789; c=relaxed/simple; bh=nd5fVne+R2tzq7Qg8D4rwQSwOeAwV9pEN7L7g3aGwLI=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=N6ITwSQtr/tcanXoWoh1EzQ28fvLIo8CNyCx+/F45cLvTLALloMvEXUkR5D6VXHFhrGAb/twJh/iWY1qNvhR2ykV7rX3zhXWyUebcwRZVfSVXXCxopDRECaXZXCzDLb0RyBWkM+8xkksykZEYKfcD7X1kpN1bAyPC0BmiIynAck= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Sc1gRZ6s; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Sc1gRZ6s" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1779878786; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Z05qiNjmOUnUCM0HtIlLbx9ynCtYwR2e2gZHtouxjhk=; b=Sc1gRZ6sbbmrYK6+vKFTYFNotrCGw4+aq06yYZNQwzpoEzYbcVEWVnyzDCpAhL3yfjgcIm Z3JTIi9kvjm4e/E8wHweOwQtmwyjTNLDhSw4rSUvH9Zxa7FeTwe6sBdxO4W4nWsJMv9iQC im/COq5B97/1+TF9ZYGI4bWGpmL336k= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-644-4JJHXVneMr-pC3r-_GyLSA-1; Wed, 27 May 2026 06:46:25 -0400 X-MC-Unique: 4JJHXVneMr-pC3r-_GyLSA-1 X-Mimecast-MFC-AGG-ID: 4JJHXVneMr-pC3r-_GyLSA_1779878784 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id A37941956053 for ; Wed, 27 May 2026 10:46:24 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.48.107]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id BC75319560AB for ; Wed, 27 May 2026 10:46:23 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH v9 mptcp-next 3/6] mptcp: enforce hard limit on backlog flushing Date: Wed, 27 May 2026 12:45:33 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: SNSTX37ng_s3t1qb9HrBBoIbnX2U3A9xQt1CMuu8I70_1779878784 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Currently a wild producer could keep the backlog flushing operation spinning for an unbound time. Since the previous patch the amount of data present in the backlog is hard-limited. Move the backlog len update at the end of the flush loop to prevent it spinning forever. Also, no need to splice back the remaining skbs list into the backlog, as such list is always empty after each backlog processing loop. Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 21 ++++++--------------- 1 file changed, 6 insertions(+), 15 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 26a70b3b9566..b0a0c51e0a13 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -2230,7 +2230,6 @@ static bool __mptcp_move_skbs(struct sock *sk, struct= list_head *skbs, u32 *delt struct mptcp_sock *msk =3D mptcp_sk(sk); bool moved =3D false; =20 - *delta =3D 0; while (1) { prefetch(skb->next); list_del(&skb->list); @@ -2267,20 +2266,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk= , struct list_head *skbs) return true; } =20 -static void mptcp_backlog_spooled(struct sock *sk, u32 moved, - struct list_head *skbs) -{ - struct mptcp_sock *msk =3D mptcp_sk(sk); - - WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved); - list_splice(skbs, &msk->backlog_list); -} - static bool mptcp_move_skbs(struct sock *sk) { + struct mptcp_sock *msk =3D mptcp_sk(sk); struct list_head skbs; bool enqueued =3D false; - u32 moved; + u32 moved =3D 0; =20 mptcp_data_lock(sk); while (mptcp_can_spool_backlog(sk, &skbs)) { @@ -2288,8 +2279,8 @@ static bool mptcp_move_skbs(struct sock *sk) enqueued |=3D __mptcp_move_skbs(sk, &skbs, &moved); =20 mptcp_data_lock(sk); - mptcp_backlog_spooled(sk, moved, &skbs); } + WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved); mptcp_data_unlock(sk); =20 if (enqueued && mptcp_epollin_ready(sk)) @@ -3680,12 +3671,12 @@ static void mptcp_release_cb(struct sock *sk) __must_hold(&sk->sk_lock.slock) { struct mptcp_sock *msk =3D mptcp_sk(sk); + u32 moved =3D 0; =20 for (;;) { unsigned long flags =3D (msk->cb_flags & MPTCP_FLAGS_PROCESS_CTX_NEED); struct list_head join_list, skbs; bool spool_bl; - u32 moved; =20 spool_bl =3D mptcp_can_spool_backlog(sk, &skbs); if (!flags && !spool_bl) @@ -3718,9 +3709,9 @@ static void mptcp_release_cb(struct sock *sk) =20 cond_resched(); spin_lock_bh(&sk->sk_lock.slock); - if (spool_bl) - mptcp_backlog_spooled(sk, moved, &skbs); } + if (moved) + WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved); =20 if (__test_and_clear_bit(MPTCP_CLEAN_UNA, &msk->cb_flags)) __mptcp_clean_una_wakeup(sk); --=20 2.54.0 From nobody Mon Jun 8 18:55:16 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A5BA58F7D for ; Wed, 27 May 2026 10:46:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779878791; cv=none; b=T8epxY+9j0H5599KapE8pjfXQgiQspkxrdNBeZJKLzjAUdio9WNHeFXlNtozML/pflmXhlbtm67gt0Uz4TvVMp/G16uPUiR7JWw0EDARFyxRIU9vBLj5ys/vvVs5ockhWc1IkzfXxaqjMju/dCyEduov++UcQtEsK+0VXGiS/nk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779878791; c=relaxed/simple; bh=oziDYzaaZklVxmIasPfI4bZW+xuTrYq48ui7/d4w9i4=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=InIXo6V32CbFprrkR+Zx4/XGGqeqc2K6bk89nLKf5W+LkPXRanO275503WPtnOxjhNi3eGvZNQRforacuJe8lADZ2hvw3YnserIpW2VmX0lFWXYhymFiEhS1F/uQDXC9dCnQwnzNkUNiq9aefLeOzWUSJ3cM7UxGj6HNDyjZLN0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=WVpFRzce; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WVpFRzce" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1779878788; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YXC9p6mpSxj8jwy4nA/L9GVclegUpq5nSYCF0Xf6Slo=; b=WVpFRzce1cT1bDIzBBuCNs5JdWUuy9DOR5RfA/GaSFqD0LFYP52hjuELrDlB8moJ39XV/F jJbAmFccr5xk2xgp/a2/zxaBJrSIgT6YjgFqYhoWWGGNIJtMTdkpQ0mCkDtGwcVbO6nxmz d2uaNq6EsjqhV5x4NedW12EoXXizf28= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-505-ojRIDOnvMQWBxpu458QXIg-1; Wed, 27 May 2026 06:46:26 -0400 X-MC-Unique: ojRIDOnvMQWBxpu458QXIg-1 X-Mimecast-MFC-AGG-ID: ojRIDOnvMQWBxpu458QXIg_1779878785 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id C1DCB180034C for ; Wed, 27 May 2026 10:46:25 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.48.107]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 025A119560AB for ; Wed, 27 May 2026 10:46:24 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH v9 mptcp-next 4/6] mptcp: implemented OoO queue pruning Date: Wed, 27 May 2026 12:45:34 +0200 Message-ID: <94af7b55233828311a361cdd70b837805448218a.1779876523.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: zqeS4Mh4WdbvO2CWJDwcPRV8aty4R6yOnpqyHaJvWNU_1779878785 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" When moving incoming skbs in the msk receive queue and the latter is above limits, prune it as needed quite alike what TCP is doing at the subflow level. The main difference relies in the stop condition: since MPTCP does not perform collapsing, it's better off dropping the bare minimum to fit the (newer) incoming packet. Signed-off-by: Paolo Abeni Reviewed-by: Matthieu Baerts (NGI0) Tested-by: Gang Yan --- v8 -> v9: - reworded the (obsoleted) commit message v6 -> v7: - fix u64 -> u32 truncation v2 -> v3: - deal with unsynced TFO skb at prune time - only possible when pruning in mptcp_over_limit() v1 -> v2: - collapse rcv queue, too - deal with MPC map, too - drop left-over sentence in the commit message RFC -> v1: - use data_seq only when available - avoid ack_seq lockless access - drop limit on fallback - collapse rcvqueue, too - drop only when pruning is not possible and over rcvbuf * 2 Note: - sashiko can be confused about fwd memory lifecycle (I can understand that :). Any exceeding amount of fwd allocated memory is always released by the next sk_mem_uncharge() - i.e. fwd memory is not tied to the current skb. - AFAICS KASAN handles bitmap variables in a sane way, and sashiko doesn't know about that --- net/mptcp/mib.c | 1 + net/mptcp/mib.h | 1 + net/mptcp/protocol.c | 48 +++++++++++++++++++++++++++++++++++++++++--- 3 files changed, 47 insertions(+), 3 deletions(-) diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c index ef65e2df709f..d9bd4f4afcc0 100644 --- a/net/mptcp/mib.c +++ b/net/mptcp/mib.c @@ -87,6 +87,7 @@ static const struct snmp_mib mptcp_snmp_list[] =3D { SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE), SNMP_MIB_ITEM("BacklogDrop", MPTCP_MIB_BACKLOGDROP), SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED), + SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED), }; =20 /* mptcp_mib_alloc - allocate percpu mib counters diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h index c84eb853d499..18f35f7e0a2d 100644 --- a/net/mptcp/mib.h +++ b/net/mptcp/mib.h @@ -90,6 +90,7 @@ enum linux_mptcp_mib_field { MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */ MPTCP_MIB_BACKLOGDROP, /* Backlog over memory limit */ MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */ + MPTCP_MIB_OFO_PRUNED, /* MPTCP-level OoO queue pruned */ __MPTCP_MIB_MAX }; =20 diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index b0a0c51e0a13..29cb10c02ed8 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -373,6 +373,45 @@ static void mptcp_init_skb(struct sock *ssk, struct sk= _buff *skb, int offset, skb_dst_drop(skb); } =20 +/* "Inspired" from the TCP version; main difference: stop as soon as the M= PTCP + * socket is under memory limit. + */ +static void mptcp_prune_ofo_queue(struct sock *sk, u64 seq) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct rb_node *node, *prev; + bool pruned =3D false; + u64 mem; + + if (RB_EMPTY_ROOT(&msk->out_of_order_queue)) + return; + + node =3D &msk->ooo_last_skb->rbnode; + + do { + struct sk_buff *skb =3D rb_to_skb(node); + + /* Stop pruning if the incoming skb would land in OoO tail. */ + if (after64(seq, MPTCP_SKB_CB(skb)->map_seq)) + break; + + pruned =3D true; + prev =3D rb_prev(node); + rb_erase(node, &msk->out_of_order_queue); + mptcp_drop(sk, skb); + msk->ooo_last_skb =3D rb_to_skb(prev); + + mem =3D (unsigned int)atomic_read(&sk->sk_rmem_alloc); + if (mem < sk->sk_rcvbuf) + break; + + node =3D prev; + } while (node); + + if (pruned) + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFO_PRUNED); +} + static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb) { u64 copy_len =3D MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq; @@ -386,9 +425,12 @@ static bool __mptcp_move_skb(struct sock *sk, struct s= k_buff *skb) */ if (unlikely(sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) && !__mptcp_check_fallback(msk)) { - MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED); - mptcp_drop(sk, skb); - return false; + mptcp_prune_ofo_queue(sk, MPTCP_SKB_CB(skb)->map_seq); + if (sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) { + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED); + mptcp_drop(sk, skb); + return false; + } } =20 if (MPTCP_SKB_CB(skb)->map_seq =3D=3D msk->ack_seq) { --=20 2.54.0 From nobody Mon Jun 8 18:55:16 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69D4F3F167C for ; Wed, 27 May 2026 10:46:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779878794; cv=none; b=C4hPr8hmlRrES3Ng7H5v9fwxHQCt7Be62uFjhEolq4gs939LEeNJEmoI6cVzj+VGaZIwA4sKweelT29MDXF4OAupP9xvRsawB1VTMF+Afw4kj2u6UOd+P0iqNKEPTMJo6mDGReCFgkA980m1FXbTryZbOPPM8nNqdA/3VK16VpU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779878794; c=relaxed/simple; bh=+2eHdt5xo3PxyAt1lRSd8jHkovb1kBrGbnzBirL4hIw=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=OvUfiREnfAEiE/oZwU1Bca30oMSiP7RQ8egpZySWIgYEwGJ0N6H4DbKG1ADrmdZuxxbgTYQDVf1CggXXkK2z7m9HmLrze7ECsZCNDj7CMyI+g43+OIXkPbL8xbK+wKZQF+HucgH7y+1X9WfxsgnVHoRlz6YAeEHkc09HE62gQGo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=DQ9PQ5Gn; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="DQ9PQ5Gn" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1779878789; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DxUzqrkXgztsAtDvzCQ0yinO3IxUwiR5GcLa9GW9GhU=; b=DQ9PQ5GnwkU/iGuyuADPBS2bfwpMMlQL/1iHfgPkX/0knrskkJK9yIPRyodvuH6qjg6gGQ eAATJC2Y+c9LC9NMkGkniMpLJ4Z9b+A9dyfCwA70xgFEGuumY3Oi8QrXmhm5R2yh05jAEH x3qvJkXfGPNR8kO39M0kM5DLocSHZRs= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-307-_wISqaL4PwWWO_TZ0t1NXg-1; Wed, 27 May 2026 06:46:28 -0400 X-MC-Unique: _wISqaL4PwWWO_TZ0t1NXg-1 X-Mimecast-MFC-AGG-ID: _wISqaL4PwWWO_TZ0t1NXg_1779878787 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 33803180047F for ; Wed, 27 May 2026 10:46:27 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.48.107]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 567A019560AB for ; Wed, 27 May 2026 10:46:26 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH v9 mptcp-next 5/6] mptcp: move the retrans loop to a separate helper Date: Wed, 27 May 2026 12:45:35 +0200 Message-ID: <863c4d94afef82be90c91a9fe17a402aa76f2d06.1779876523.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: VuLu6Xr-HupUiswsiY9hPjusLDDIOIbLYODjfwxzoJU_1779878787 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" This is a cleanup in order to make the next patch simpler. No functional change intended. Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 74 +++++++++++++++++++++++++------------------- 1 file changed, 43 insertions(+), 31 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 29cb10c02ed8..e21ace787a32 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -2828,41 +2828,14 @@ static void mptcp_check_fastclose(struct mptcp_sock= *msk) sk_error_report(sk); } =20 -static void __mptcp_retrans(struct sock *sk) +/* Retransmit the specified data fragment on all the selected subflows. */ +static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *d= frag) { struct mptcp_sendmsg_info info =3D { .data_lock_held =3D true, }; struct mptcp_sock *msk =3D mptcp_sk(sk); struct mptcp_subflow_context *subflow; - struct mptcp_data_frag *dfrag; struct sock *ssk; - int ret, err; - u16 len =3D 0; - - mptcp_clean_una_wakeup(sk); - - /* first check ssk: need to kick "stale" logic */ - err =3D mptcp_sched_get_retrans(msk); - dfrag =3D mptcp_rtx_head(sk); - if (!dfrag) { - if (mptcp_data_fin_enabled(msk)) { - struct inet_connection_sock *icsk =3D inet_csk(sk); - - WRITE_ONCE(icsk->icsk_retransmits, - icsk->icsk_retransmits + 1); - mptcp_set_datafin_timeout(sk); - mptcp_send_ack(msk); - - goto reset_timer; - } - - if (!mptcp_send_head(sk)) - goto clear_scheduled; - - goto reset_timer; - } - - if (err) - goto reset_timer; + int ret, len =3D 0; =20 mptcp_for_each_subflow(msk, subflow) { if (READ_ONCE(subflow->scheduled)) { @@ -2890,7 +2863,7 @@ static void __mptcp_retrans(struct sock *sk) !msk->allow_subflows) { spin_unlock_bh(&msk->fallback_lock); release_sock(ssk); - goto clear_scheduled; + return -1; } =20 while (info.sent < info.limit) { @@ -2913,6 +2886,45 @@ static void __mptcp_retrans(struct sock *sk) release_sock(ssk); } } + return len; +} + +static void __mptcp_retrans(struct sock *sk) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct mptcp_subflow_context *subflow; + struct mptcp_data_frag *dfrag; + int err, len; + + mptcp_clean_una_wakeup(sk); + + /* first check ssk: need to kick "stale" logic */ + err =3D mptcp_sched_get_retrans(msk); + dfrag =3D mptcp_rtx_head(sk); + if (!dfrag) { + if (mptcp_data_fin_enabled(msk)) { + struct inet_connection_sock *icsk =3D inet_csk(sk); + + WRITE_ONCE(icsk->icsk_retransmits, + icsk->icsk_retransmits + 1); + mptcp_set_datafin_timeout(sk); + mptcp_send_ack(msk); + + goto reset_timer; + } + + if (!mptcp_send_head(sk)) + goto clear_scheduled; + + goto reset_timer; + } + + if (err) + goto reset_timer; + + len =3D __mptcp_push_retrans(sk, dfrag); + if (len < 0) + goto clear_scheduled; =20 msk->bytes_retrans +=3D len; dfrag->already_sent =3D max(dfrag->already_sent, len); --=20 2.54.0 From nobody Mon Jun 8 18:55:16 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 971943672A5 for ; Wed, 27 May 2026 10:46:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779878793; cv=none; b=j1SnMwTtznJrC7FF+yowrzRi8qnUs8RLtWj3yB6xXb9B/9U2Gv05pm9+SkhW9tI+/DRbcabRt3y5F8YFoLBSxknx60R7ls7HwiaRoTmYTidVEfUq8FMziAOspjEF0DLO9ZyS58HPXioVT5swH4P8hasIiEjqLe3SS0N+4vPv8yw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779878793; c=relaxed/simple; bh=Lpu15/VfjcIKu1wH4b6hqHLc/k7f44wSUyzhCndF5H0=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=B03YGHkf+M84SXklHPIL6WJJHTMaZyAvAKPYFE4aAUHA4dO4Q7vveTzhsvvZxewpqCc/BOwgKb6C5v/duORZBsVmIRJFY70MFKwOX5SpxMIs7ljJIQC0tR+dc017I146pm48uRegfR3Znbh9c6da56a72SPXmlM4bTphg57NbHg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=BrTjYEF0; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="BrTjYEF0" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1779878790; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=N49mGoK39nxDoRa859djFTKOB1o/NBpw/CBBDpyDpJE=; b=BrTjYEF0Dc3jIZoxeE3D4rGiuiWB7Pn1u0WYyTyYhebSbhGt+EwVmx8Z3NZY75P9EKFMV5 ws8X74l/hPsMPJ8Ioo/7n9e1bl5ofbWcTpuOpqjvJLvSUbwfPnstLJdoRSe6/pQPn1I+5n 7uF4VaNhJlfYErgkCf1m/jV4a/zclNc= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-543-4Nku0ghgNFKX3BbTLwyqBg-1; Wed, 27 May 2026 06:46:29 -0400 X-MC-Unique: 4Nku0ghgNFKX3BbTLwyqBg-1 X-Mimecast-MFC-AGG-ID: 4Nku0ghgNFKX3BbTLwyqBg_1779878788 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 7269C18005B5 for ; Wed, 27 May 2026 10:46:28 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.48.107]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id A88CE19560AB for ; Wed, 27 May 2026 10:46:27 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH v9 mptcp-next 6/6] mptcp: let the retrans scheduler do its job. Date: Wed, 27 May 2026 12:45:36 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: nedoVJu_9HGMFSOhvCX05Ee_ABdUulruBwqgPcGaj3k_1779878788 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Currently the MPTCP core enforces that when MPTCP-level retrans timer fires, at most a single dfrag is retransmitted. If some corner-cases it may be necessary retransmit multiple dfrags, and the MPTCP socket will need to wait multiple retrans timeout to accomplish that. Remove the mentioned constraint, allowing to transmit multiple dfrags per retrans period, as long as the scheduler keeps selecting subflows for retransmissions and pending data is available in the rtx queue. The default scheduler will transmit a dfrag per available subflow. Signed-off-by: Paolo Abeni --- v7 -> v8 - fix corner-case retrans_seq update v4 -> v5: - fixed already_sent update v3 -> v4: - avoid quadratic behavior, fix retrans_seq update - fix rtx timer re-schedule miss v2 -> v3: - fix infinite loop issue (should address tls tests failures) v1 -> v2: - fix retrans sequence update (sashiko) Note: - sashiko see issues when dfrag =3D mptcp_rtx_head(sk) !=3D NULL and dfrag->already_sent =3D=3D 0. That condition should not possible: if mptcp_rtx_head() is not NULL there should be some data already sent. --- net/mptcp/protocol.c | 117 +++++++++++++++++++++++++++++++------------ 1 file changed, 85 insertions(+), 32 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index e21ace787a32..51509c062768 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -1199,13 +1199,6 @@ static void __mptcp_clean_una_wakeup(struct sock *sk) mptcp_write_space(sk); } =20 -static void mptcp_clean_una_wakeup(struct sock *sk) -{ - mptcp_data_lock(sk); - __mptcp_clean_una_wakeup(sk); - mptcp_data_unlock(sk); -} - static void mptcp_enter_memory_pressure(struct sock *sk) { struct mptcp_subflow_context *subflow; @@ -2828,8 +2821,12 @@ static void mptcp_check_fastclose(struct mptcp_sock = *msk) sk_error_report(sk); } =20 -/* Retransmit the specified data fragment on all the selected subflows. */ -static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *d= frag) +/* + * Retransmit the specified data fragment on all the selected subflows, + * starting from the specified sequence + */ +static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *d= frag, + u64 sent_seq) { struct mptcp_sendmsg_info info =3D { .data_lock_held =3D true, }; struct mptcp_sock *msk =3D mptcp_sk(sk); @@ -2839,6 +2836,7 @@ static int __mptcp_push_retrans(struct sock *sk, stru= ct mptcp_data_frag *dfrag) =20 mptcp_for_each_subflow(msk, subflow) { if (READ_ONCE(subflow->scheduled)) { + u16 offset =3D sent_seq - dfrag->data_seq; u16 copied =3D 0; =20 mptcp_subflow_set_scheduled(subflow, false); @@ -2848,9 +2846,12 @@ static int __mptcp_push_retrans(struct sock *sk, str= uct mptcp_data_frag *dfrag) lock_sock(ssk); =20 /* limit retransmission to the bytes already sent on some subflows */ - info.sent =3D 0; + info.sent =3D offset; info.limit =3D READ_ONCE(msk->csum_enabled) ? dfrag->data_len : dfrag->already_sent; + DEBUG_NET_WARN_ON_ONCE(!before64(sent_seq, + dfrag->data_seq + + info.limit)); =20 /* * make the whole retrans decision, xmit, disallow @@ -2894,45 +2895,97 @@ static void __mptcp_retrans(struct sock *sk) struct mptcp_sock *msk =3D mptcp_sk(sk); struct mptcp_subflow_context *subflow; struct mptcp_data_frag *dfrag; + bool retransmitted =3D false; + u64 retrans_seq; int err, len; =20 - mptcp_clean_una_wakeup(sk); - - /* first check ssk: need to kick "stale" logic */ - err =3D mptcp_sched_get_retrans(msk); + mptcp_data_lock(sk); + __mptcp_clean_una_wakeup(sk); + retrans_seq =3D msk->snd_una; dfrag =3D mptcp_rtx_head(sk); + mptcp_data_unlock(sk); + if (!dfrag) + goto check_data_fin; + + for (;;) { + bool already_retrans; + u64 sent_seq; + + /* The scheduler may clean the RTX queue. */ + get_page(dfrag->page); + + /* The default scheduler will kick "stale" logic. */ + err =3D mptcp_sched_get_retrans(msk); + if (err) { + put_page(dfrag->page); + break; + } + + /* Incoming acks can have moved retrans sequence after + * the current dfrag, if so try to start again from RTX head. + */ + mptcp_data_lock(sk); + already_retrans =3D !dfrag->already_sent || + !before64(msk->snd_una, dfrag->data_seq + + dfrag->already_sent); + put_page(dfrag->page); + if (already_retrans) { + __mptcp_clean_una_wakeup(sk); + retrans_seq =3D msk->snd_una; + dfrag =3D mptcp_rtx_head(sk); + } else if (after64(msk->snd_una, retrans_seq)) { + retrans_seq =3D msk->snd_una; + } + mptcp_data_unlock(sk); + if (!dfrag) + break; + + len =3D __mptcp_push_retrans(sk, dfrag, retrans_seq); + if (len < 0) + goto clear_scheduled; + + retransmitted =3D true; + retrans_seq +=3D len; + msk->bytes_retrans +=3D len; + dfrag->already_sent =3D max_t(u16, dfrag->already_sent, + retrans_seq - dfrag->data_seq); + + /* With csum enabled retransmission can send new data. */ + sent_seq =3D dfrag->already_sent + dfrag->data_seq; + if (after64(sent_seq, msk->snd_nxt)) + WRITE_ONCE(msk->snd_nxt, sent_seq); + + /* Attempt the next fragment only if the current one is + * completely retransmitted. + */ + if (before64(retrans_seq, dfrag->data_seq + dfrag->data_len)) + break; + + dfrag =3D list_is_last(&dfrag->list, &msk->rtx_queue) ? + NULL : list_next_entry(dfrag, list); + if (!dfrag || !dfrag->already_sent) + break; + } + + /* Data fin retransmission needed only if no data retransmission took + * place, and RTX queue is empty. + */ +check_data_fin: if (!dfrag) { - if (mptcp_data_fin_enabled(msk)) { + if (!retransmitted && mptcp_data_fin_enabled(msk)) { struct inet_connection_sock *icsk =3D inet_csk(sk); =20 WRITE_ONCE(icsk->icsk_retransmits, icsk->icsk_retransmits + 1); mptcp_set_datafin_timeout(sk); mptcp_send_ack(msk); - goto reset_timer; } =20 if (!mptcp_send_head(sk)) goto clear_scheduled; - - goto reset_timer; } =20 - if (err) - goto reset_timer; - - len =3D __mptcp_push_retrans(sk, dfrag); - if (len < 0) - goto clear_scheduled; - - msk->bytes_retrans +=3D len; - dfrag->already_sent =3D max(dfrag->already_sent, len); - - /* With csum enabled retransmission can send new data. */ - if (after64(dfrag->already_sent + dfrag->data_seq, msk->snd_nxt)) - WRITE_ONCE(msk->snd_nxt, dfrag->already_sent + dfrag->data_seq); - reset_timer: mptcp_check_and_set_pending(sk); =20 --=20 2.54.0