From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 807543921C9 for ; Mon, 20 Apr 2026 10:30:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776681005; cv=none; b=STmT39O4jLCRbTMnlz1uzzpjInpBazVF2DOfNMiHP5xjMqW0ruCVOPsu+3hxDrCPZrWb4Lq4tt6SI4lCHkhMSnwEKn6rU8ditKbw/145yDpW9fDJ2ff3+WkGtJx40ghVif7fn1OSqim77u3dJdfXdnhJ5SS6lbktAVal53jMy6w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776681005; c=relaxed/simple; bh=36HkXCFvRLarh8qECxqLCdMgeYL57YifkTfP9yWtN6Q=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=bddbkgikjAUxP5pY2RpGTzLhtYHbVU1l4heAui8y9C8DDEnJ8UXiSp1PVDAsYJjzXrlzegyu9uyvGHUDa/86mA/Ts8vcPoZm3aZJNxxFW4Q8db5GBT1tbKB01tYLRV6HjpiyEV/Nd+L6vaXsOPx/21039IFrSo91oifbqDgCWno= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Ls/fuPmC; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Ls/fuPmC" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776681003; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xLUooHHRrG/C8VgJhCVNfnE0H3o3WsT1QHphbEO0VdY=; b=Ls/fuPmCgTE8OlBXiGSWrmBnTNucnnJu9evW+L8UARkHhbOTt+vjErIycCM1OEn1/GIugX bJA8DSdi/Ii3k8Cmjna4z349KboUNC0zwhTODZSl4f3NINWumEVDr0MREMf3nSmDbsjcrn 6OwA7PWP4kr2LQTaMgIw1zGnpxtIuHI= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-251-NS7cHpsqP_2QZ4lbNlSHIQ-1; Mon, 20 Apr 2026 06:30:02 -0400 X-MC-Unique: NS7cHpsqP_2QZ4lbNlSHIQ-1 X-Mimecast-MFC-AGG-ID: NS7cHpsqP_2QZ4lbNlSHIQ_1776681001 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 8F323195608F; Mon, 20 Apr 2026 10:30:00 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.33.233]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 0DBAB195608E; Mon, 20 Apr 2026 10:29:58 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: yangang@kylinos.cn, geliang@kernel.org, matttbe@kernel.org Subject: [RFC PATCH 1/6] mptcp: move checks vs rcvbuf size earlier in the RX path Date: Mon, 20 Apr 2026 12:29:25 +0200 Message-ID: <7bd58b4078ae99c6514d356792baa75e0c4f1b9a.1776680489.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 0ZDlm7WIfAa-9hFdBcQ47lA2p2lAQBt0seVmER_NerQ_1776681001 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Currently the enforcement of the rcvbuf constraint is implemented when moving the skbs into the msk receive or OoO queue. Under significant memory pressure the above can cause permanent data transfer stalls. Move the checks early on, before landing even in the subflow queues. Signed-off-by: Paolo Abeni --- Note that: - this needs the follow-up patches to really fix the stall - the memory comparison is intentionally very rough, as the msk socket lock is not currently held where the condition is now enforced. This should require some refinement, shared as-is to avoid more latency on my side --- net/mptcp/options.c | 21 +++++++++++++++++++-- net/mptcp/protocol.c | 9 ++------- 2 files changed, 21 insertions(+), 9 deletions(-) diff --git a/net/mptcp/options.c b/net/mptcp/options.c index 4cc583fdc7a9..a6d290427611 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -1158,8 +1158,19 @@ static bool add_addr_hmac_valid(struct mptcp_sock *m= sk, return hmac =3D=3D mp_opt->ahmac; } =20 -/* Return false in case of error (or subflow has been reset), - * else return true. +static bool mptcp_over_limit(const struct sock *sk, struct sk_buff *skb) +{ + int limit; + + if (!skb->len) + return false; + + limit =3D READ_ONCE(sk->sk_rcvbuf) << 1; + return sk_rmem_alloc_get(sk) > limit; +} + +/* Return false when the caller must to drop the packet, i.e. in case of e= rror, + * subflow has been reset, or over memory limits. */ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb) { @@ -1185,6 +1196,9 @@ bool mptcp_incoming_options(struct sock *sk, struct s= k_buff *skb) =20 __mptcp_data_acked(subflow->conn); mptcp_data_unlock(subflow->conn); + + if (mptcp_over_limit(subflow->conn, skb)) + return false; return true; } =20 @@ -1263,6 +1277,9 @@ bool mptcp_incoming_options(struct sock *sk, struct s= k_buff *skb) return true; } =20 + if (mptcp_over_limit(subflow->conn, skb)) + return false; + mpext =3D skb_ext_add(skb, SKB_EXT_MPTCP); if (!mpext) return false; diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 17b9a8c13ebf..2d143b929bbf 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -739,7 +739,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp= _sock *msk, =20 mptcp_init_skb(ssk, skb, offset, len); =20 - if (own_msk && sk_rmem_alloc_get(sk) < sk->sk_rcvbuf) { + if (own_msk) { mptcp_subflow_lend_fwdmem(subflow, skb); ret |=3D __mptcp_move_skb(sk, skb); } else { @@ -2197,10 +2197,6 @@ static bool __mptcp_move_skbs(struct sock *sk, struc= t list_head *skbs, u32 *delt =20 *delta =3D 0; while (1) { - /* If the msk recvbuf is full stop, don't drop */ - if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf) - break; - prefetch(skb->next); list_del(&skb->list); *delta +=3D skb->truesize; @@ -2229,8 +2225,7 @@ static bool mptcp_can_spool_backlog(struct sock *sk, = struct list_head *skbs) mem_cgroup_from_sk(sk)); =20 /* Don't spool the backlog if the rcvbuf is full. */ - if (list_empty(&msk->backlog_list) || - sk_rmem_alloc_get(sk) > sk->sk_rcvbuf) + if (list_empty(&msk->backlog_list)) return false; =20 INIT_LIST_HEAD(skbs); --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7B5B6391829 for ; Mon, 20 Apr 2026 10:30:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776681018; cv=none; b=eV6qlnYU6dh0hvW0zhpZcaLQfhJS923S+xFCHcgLWAiQW4xM9ca0CjEKoENenTcLUCIu0p/bFM2+Kpfcy5EwuS/rNCT2QIXEWYko5KiJHdR9Iw3Sazn4uhx73JEMT9YxdlVrtJFidzoqyfCX0PZoaDd1pJxgOtPkOMrHKC2Sb4Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776681018; c=relaxed/simple; bh=a6RCc2G7zE84WKXNKjAk3EGDSnUveZnZ9GtQUok90VE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=JD2pZJpyjvxhUrd0jw0FSALY2ftNLOzP04lYhF0AcPoDpEOe2yMXaCXsphcgxwyYS39ROGu3+jY5/1+a79wCvAR2yI0HfZidGUznRENS9INUrPfka/hyzeHlKwZzLQmWaeq1U0vLt11DprwgK/n8N9FHaeKQ1DKgCN0uYlYcSe4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=TVjMyajw; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="TVjMyajw" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776681013; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MwpM0wUufr71r4zvoe7mJOaYelmfbA4VEXiIAbo6fmQ=; b=TVjMyajwKXshZBf0sIsRE2kDNzIYGrmrMuq1eaUcibiBz7QVPUNdmVx5w8Ku9MrbH7TNM5 19GZdhupKZqKBPogEXP392/irS+HrWrQA0raJ30e1sCch0DOqclqlCJuedYZvpI55ygQPC qGxhBGKgjsxT6HU4RUeNE5nvmxYmXA8= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-568-7Rxyc8BWNhKYVQK-q8ON9g-1; Mon, 20 Apr 2026 06:30:07 -0400 X-MC-Unique: 7Rxyc8BWNhKYVQK-q8ON9g-1 X-Mimecast-MFC-AGG-ID: 7Rxyc8BWNhKYVQK-q8ON9g_1776681003 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id E3A1D195608F; Mon, 20 Apr 2026 10:30:02 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.33.233]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 2B09A195608E; Mon, 20 Apr 2026 10:30:00 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: yangang@kylinos.cn, geliang@kernel.org, matttbe@kernel.org Subject: [RFC PATCH 2/6] mptcp: sync mptcp skb cb layout with tcp one Date: Mon, 20 Apr 2026 12:29:26 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: vL8Gbp3jJCOlsR5sDhA2Brm7MYYdNyBHbRzD-B3ZKWo_1776681003 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" The MPTCP protocol uses a significantly different CB layout WRT TCP, as it includes different information and use 64 bits for the sequence numbers. As the msk-level rcvbuf buffer size is limited by the core socket code the INT_MAX, we can safely use 32 bits for MPTCP-level sequence number. This allow updating the MPTCP CB layout so that fields with a corresponding TCP-= level data use the same area inside the CB itself. Add build time check the unsure the latter invariant. Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 81 +++++++++++++++++++++++++------------------- net/mptcp/protocol.h | 5 +-- 2 files changed, 50 insertions(+), 36 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 2d143b929bbf..800aa7d9408e 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -28,7 +28,7 @@ #include "protocol.h" #include "mib.h" =20 -static unsigned int mptcp_inq_hint(const struct sock *sk); +static int mptcp_inq_hint(const struct sock *sk); =20 #define CREATE_TRACE_POINTS #include @@ -165,7 +165,7 @@ static bool __mptcp_try_coalesce(struct sock *sk, struc= t sk_buff *to, !skb_try_coalesce(to, from, fragstolen, delta)) return false; =20 - pr_debug("colesced seq %llx into %llx new len %d new end seq %llx\n", + pr_debug("colesced seq %x into %x new len %d new end seq %x\n", MPTCP_SKB_CB(from)->map_seq, MPTCP_SKB_CB(to)->map_seq, to->len, MPTCP_SKB_CB(from)->end_seq); MPTCP_SKB_CB(to)->end_seq =3D MPTCP_SKB_CB(from)->end_seq; @@ -244,20 +244,20 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *m= sk, struct sk_buff *skb) { struct sock *sk =3D (struct sock *)msk; struct rb_node **p, *parent; - u64 seq, end_seq, max_seq; + u32 seq, end_seq, max_seq; struct sk_buff *skb1; =20 seq =3D MPTCP_SKB_CB(skb)->map_seq; end_seq =3D MPTCP_SKB_CB(skb)->end_seq; max_seq =3D atomic64_read(&msk->rcv_wnd_sent); =20 - pr_debug("msk=3D%p seq=3D%llx limit=3D%llx empty=3D%d\n", msk, seq, max_s= eq, + pr_debug("msk=3D%p seq=3D%x limit=3D%x empty=3D%d\n", msk, seq, max_seq, RB_EMPTY_ROOT(&msk->out_of_order_queue)); - if (after64(end_seq, max_seq)) { + if (after(end_seq, max_seq)) { /* out of window */ mptcp_drop(sk, skb); - pr_debug("oow by %lld, rcv_wnd_sent %llu\n", - (unsigned long long)end_seq - (unsigned long)max_seq, + pr_debug("oow by %d, rcv_wnd_sent %llu\n", + end_seq - max_seq, (unsigned long long)atomic64_read(&msk->rcv_wnd_sent)); MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_NODSSWINDOW); return; @@ -282,7 +282,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk= , struct sk_buff *skb) } =20 /* Can avoid an rbtree lookup if we are adding skb after ooo_last_skb */ - if (!before64(seq, MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq)) { + if (!before(seq, MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq)) { MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOQUEUETAIL); parent =3D &msk->ooo_last_skb->rbnode; p =3D &parent->rb_right; @@ -294,18 +294,18 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *m= sk, struct sk_buff *skb) while (*p) { parent =3D *p; skb1 =3D rb_to_skb(parent); - if (before64(seq, MPTCP_SKB_CB(skb1)->map_seq)) { + if (before(seq, MPTCP_SKB_CB(skb1)->map_seq)) { p =3D &parent->rb_left; continue; } - if (before64(seq, MPTCP_SKB_CB(skb1)->end_seq)) { - if (!after64(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) { + if (before(seq, MPTCP_SKB_CB(skb1)->end_seq)) { + if (!after(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) { /* All the bits are present. Drop. */ mptcp_drop(sk, skb); MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA); return; } - if (after64(seq, MPTCP_SKB_CB(skb1)->map_seq)) { + if (after(seq, MPTCP_SKB_CB(skb1)->map_seq)) { /* partial overlap: * | skb | * | skb1 | @@ -336,7 +336,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk= , struct sk_buff *skb) merge_right: /* Remove other segments covered by skb. */ while ((skb1 =3D skb_rb_next(skb)) !=3D NULL) { - if (before64(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) + if (before(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) break; rb_erase(&skb1->rbnode, &msk->out_of_order_queue); mptcp_drop(sk, skb1); @@ -359,11 +359,12 @@ static void mptcp_init_skb(struct sock *ssk, struct s= k_buff *skb, int offset, =20 /* the skb map_seq accounts for the skb offset: * mptcp_subflow_get_mapped_dsn() is based on the current tp->copied_seq - * value + * value; note that seq numbers are truncated to 32bits */ MPTCP_SKB_CB(skb)->map_seq =3D mptcp_subflow_get_mapped_dsn(subflow); MPTCP_SKB_CB(skb)->end_seq =3D MPTCP_SKB_CB(skb)->map_seq + copy_len; MPTCP_SKB_CB(skb)->offset =3D offset; + MPTCP_SKB_CB(skb)->flags =3D 0; MPTCP_SKB_CB(skb)->has_rxtstamp =3D has_rxtstamp; MPTCP_SKB_CB(skb)->cant_coalesce =3D 0; =20 @@ -375,13 +376,14 @@ static void mptcp_init_skb(struct sock *ssk, struct s= k_buff *skb, int offset, =20 static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb) { - u64 copy_len =3D MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq; + u32 copy_len =3D MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq; struct mptcp_sock *msk =3D mptcp_sk(sk); + u32 ack_seq =3D msk->ack_seq; struct sk_buff *tail; =20 mptcp_borrow_fwdmem(sk, skb); =20 - if (MPTCP_SKB_CB(skb)->map_seq =3D=3D msk->ack_seq) { + if (MPTCP_SKB_CB(skb)->map_seq =3D=3D ack_seq) { /* in sequence */ msk->bytes_received +=3D copy_len; WRITE_ONCE(msk->ack_seq, msk->ack_seq + copy_len); @@ -392,7 +394,7 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk= _buff *skb) skb_set_owner_r(skb, sk); __skb_queue_tail(&sk->sk_receive_queue, skb); return true; - } else if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq)) { + } else if (after(MPTCP_SKB_CB(skb)->map_seq, ack_seq)) { mptcp_data_queue_ofo(msk, skb); return false; } @@ -772,44 +774,42 @@ static bool __mptcp_move_skbs_from_subflow(struct mpt= cp_sock *msk, =20 static bool __mptcp_ofo_queue(struct mptcp_sock *msk) { + u32 seq_delta, ack_seq =3D msk->ack_seq; struct sock *sk =3D (struct sock *)msk; struct sk_buff *skb, *tail; bool moved =3D false; struct rb_node *p; - u64 end_seq; =20 p =3D rb_first(&msk->out_of_order_queue); pr_debug("msk=3D%p empty=3D%d\n", msk, RB_EMPTY_ROOT(&msk->out_of_order_q= ueue)); while (p) { skb =3D rb_to_skb(p); - if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq)) + if (after(MPTCP_SKB_CB(skb)->map_seq, ack_seq)) break; =20 p =3D rb_next(p); rb_erase(&skb->rbnode, &msk->out_of_order_queue); =20 - if (unlikely(!after64(MPTCP_SKB_CB(skb)->end_seq, - msk->ack_seq))) { + if (unlikely(!after(MPTCP_SKB_CB(skb)->end_seq, ack_seq))) { mptcp_drop(sk, skb); MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA); continue; } =20 - end_seq =3D MPTCP_SKB_CB(skb)->end_seq; + seq_delta =3D MPTCP_SKB_CB(skb)->end_seq - ack_seq; tail =3D skb_peek_tail(&sk->sk_receive_queue); if (!tail || !mptcp_ooo_try_coalesce(msk, tail, skb)) { - int delta =3D msk->ack_seq - MPTCP_SKB_CB(skb)->map_seq; + int delta =3D ack_seq - MPTCP_SKB_CB(skb)->map_seq; =20 /* skip overlapping data, if any */ - pr_debug("uncoalesced seq=3D%llx ack seq=3D%llx delta=3D%d\n", - MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq, - delta); + pr_debug("uncoalesced seq=3D%x ack seq=3D%x delta=3D%d\n", + MPTCP_SKB_CB(skb)->map_seq, ack_seq, delta); MPTCP_SKB_CB(skb)->offset +=3D delta; MPTCP_SKB_CB(skb)->map_seq +=3D delta; __skb_queue_tail(&sk->sk_receive_queue, skb); } - msk->bytes_received +=3D end_seq - msk->ack_seq; - WRITE_ONCE(msk->ack_seq, end_seq); + msk->bytes_received +=3D seq_delta; + WRITE_ONCE(msk->ack_seq, msk->ack_seq + seq_delta); moved =3D true; } return moved; @@ -2260,19 +2260,20 @@ static bool mptcp_move_skbs(struct sock *sk) return enqueued; } =20 -static unsigned int mptcp_inq_hint(const struct sock *sk) +static int mptcp_inq_hint(const struct sock *sk) { const struct mptcp_sock *msk =3D mptcp_sk(sk); const struct sk_buff *skb; =20 skb =3D skb_peek(&sk->sk_receive_queue); if (skb) { - u64 hint_val =3D READ_ONCE(msk->ack_seq) - MPTCP_SKB_CB(skb)->map_seq; + int hint_val =3D (u32)READ_ONCE(msk->ack_seq) - + MPTCP_SKB_CB(skb)->map_seq; =20 - if (hint_val >=3D INT_MAX) - return INT_MAX; + if (hint_val < 0) + return -hint_val; =20 - return (unsigned int)hint_val; + return hint_val; } =20 if (sk->sk_state =3D=3D TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN)) @@ -2380,7 +2381,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msgh= dr *msg, size_t len, tcp_recv_timestamp(msg, sk, &tss); =20 if (cmsg_flags & MPTCP_CMSG_INQ) { - unsigned int inq =3D mptcp_inq_hint(sk); + int inq =3D mptcp_inq_hint(sk); =20 put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq); } @@ -4601,11 +4602,23 @@ static int mptcp_napi_poll(struct napi_struct *napi= , int budget) return work_done; } =20 +#define CHK_CB_FIELD(mptcp_field, tcp_field) \ + ({ \ + BUILD_BUG_ON(offsetof(struct mptcp_skb_cb, mptcp_field) !=3D \ + offsetof(struct tcp_skb_cb, tcp_field)); \ + BUILD_BUG_ON(offsetofend(struct mptcp_skb_cb, mptcp_field) !=3D \ + offsetofend(struct tcp_skb_cb, tcp_field)); \ + }) + void __init mptcp_proto_init(void) { struct mptcp_delegated_action *delegated; int cpu; =20 + CHK_CB_FIELD(map_seq, seq); + CHK_CB_FIELD(end_seq, end_seq); + CHK_CB_FIELD(flags, tcp_flags); + mptcp_prot.h.hashinfo =3D tcp_prot.h.hashinfo; =20 if (percpu_counter_init(&mptcp_sockets_allocated, 0, GFP_KERNEL)) diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 661600f8b573..ad906737ee9f 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -126,9 +126,10 @@ #define MPTCP_SYNC_SNDBUF 7 =20 struct mptcp_skb_cb { - u64 map_seq; - u64 end_seq; + u32 map_seq; + u32 end_seq; u32 offset; + u16 flags; u8 has_rxtstamp; u8 cant_coalesce; }; --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6DA1E39A7EA for ; Mon, 20 Apr 2026 10:30:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776681013; cv=none; b=WT0CZGYJTTY+u2yrcpxSDoRtqvcbuOkYgnKaW51n7wV4WZX6nB21Yo6T7VrLezHldjyZV+HNjFh+qsfO3lb26d8QQjmRkvHJvPtgv6HoraR7GeZsttfRwUTcRC9ghFToF/WGkwHRLGkD2QwdiV6Zw3UXzKbjV+FrmcSQn3I+p/4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776681013; c=relaxed/simple; bh=xQMFl0N+uhDJ7tRcoG8G560q9ROLsXuIhvFXKoIl9w0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=uIovrrFjqM6VNv3MtMOAxd+Ai0kYPrXaOdzEbNLWShHG1Fa4bz7cYdovWje6ejcEKW1XW8rvjgiBftf5E8ryakShOI19E/Wn6wZR2CrXTKcqICnSZf6zTieM4s2UApP5VU3G6MDD+w3u5CXO4geaEzOJrDjqeXRCoU7Np6/23rA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=biFYswy2; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="biFYswy2" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776681009; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JP23uyFs9dMsIG80BWrPMY1NSTjFHNxGc90QtSBm+RU=; b=biFYswy29MTJKQpHEdzoNLL6iojfXR+gHN6nC3anu0ofNKSKw/h7Ql+21eP8fzq98rKgFO jK589SQ94mRzjUpz0c51tE6BS1RRNDZAa16HToD1ac2NqFFZpM4ynh6S2b+L0NezxeNf2X q8gQ9DzOUR6TSeSamXI+sFMtjJR+yRQ= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-195-69s4525uPwur--QmkwDdwQ-1; Mon, 20 Apr 2026 06:30:06 -0400 X-MC-Unique: 69s4525uPwur--QmkwDdwQ-1 X-Mimecast-MFC-AGG-ID: 69s4525uPwur--QmkwDdwQ_1776681005 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 2E75A1800473; Mon, 20 Apr 2026 10:30:05 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.33.233]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 79B1E195608E; Mon, 20 Apr 2026 10:30:03 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: yangang@kylinos.cn, geliang@kernel.org, matttbe@kernel.org Subject: [RFC PATCH 3/6] tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too Date: Mon, 20 Apr 2026 12:29:27 +0200 Message-ID: <17ca96bd263c4f7189dadfd782440463338bcfdb.1776680489.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: CGBpM3PE6xDZ0N_GBuTJqpyRO3qA1Z4OGz-rPQrtM94_1776681005 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" The end goal is to avoid duplicating the quite untrivial strategy at MPTCP level. After the previous patch, the mentioned helpers could process skbs standing in MPTCP-level queues without any CB-related adaptation. The only additional adjustment needed is explicitly providing the OoO queue reference, to cope with different sk layout. Additionally rename the helper to clearly document its hybrid nature and let it return the number of collapsed skbs, to allow proper accounting from the future MPTCP caller. Signed-off-by: Paolo Abeni --- Note: - this will need a significant amount of testing at the TCP level and explicit approval from Eric, which I can't guess if we can hope. --- include/net/tcp.h | 4 ++++ net/ipv4/tcp_input.c | 55 ++++++++++++++++++++++++++++---------------- 2 files changed, 39 insertions(+), 20 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 6156d1d068e1..4d23e75fc5cb 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1828,6 +1828,10 @@ extern void tcp_openreq_init_rwin(struct request_soc= k *req, =20 void tcp_enter_memory_pressure(struct sock *sk); void tcp_leave_memory_pressure(struct sock *sk); +unsigned int xtcp_collapse_ofo_queue(struct sock *sk, + struct rb_root *out_of_order_queue, + struct sk_buff **ooo_last_skb, + u8 scaling_radio); =20 static inline int keepalive_intvl_when(const struct tcp_sock *tp) { diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 7171442c3ed7..4daccc9c4795 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5725,16 +5725,22 @@ static struct sk_buff *tcp_collapse_one(struct sock= *sk, struct sk_buff *skb, /* Collapse contiguous sequence of skbs head..tail with * sequence numbers start..end. * + * sk can be either a TCP or an MPTCP socket. + * * If tail is NULL, this means until the end of the queue. * * Segments with FIN/SYN are not collapsed (only because this * simplifies code) + * + * Returns the number of collapsed skbs. */ -static void -tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *r= oot, - struct sk_buff *head, struct sk_buff *tail, u32 start, u32 end) +static unsigned int +xtcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *= root, + struct sk_buff *head, struct sk_buff *tail, u32 start, u32 end, + u8 scaling_ratio) { struct sk_buff *skb =3D head, *n; + unsigned int collapsed =3D 0; struct sk_buff_head tmp; bool end_of_skbs; =20 @@ -5750,6 +5756,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *li= st, struct rb_root *root, =20 /* No new bits? It is possible on ofo queue. */ if (!before(start, TCP_SKB_CB(skb)->end_seq)) { + collapsed++; skb =3D tcp_collapse_one(sk, skb, list, root); if (!skb) break; @@ -5762,7 +5769,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *li= st, struct rb_root *root, * overlaps to the next one and mptcp allow collapsing. */ if (!(TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) && - (tcp_win_from_space(sk, skb->truesize) > skb->len || + (__tcp_win_from_space(scaling_ratio, skb->truesize) > skb->len || before(TCP_SKB_CB(skb)->seq, start))) { end_of_skbs =3D false; break; @@ -5782,7 +5789,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *li= st, struct rb_root *root, if (end_of_skbs || (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || !skb_frags_readable(skb)) - return; + return collapsed; =20 __skb_queue_head_init(&tmp); =20 @@ -5819,6 +5826,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *li= st, struct rb_root *root, start +=3D size; } if (!before(start, TCP_SKB_CB(skb)->end_seq)) { + collapsed++; skb =3D tcp_collapse_one(sk, skb, list, root); if (!skb || skb =3D=3D tail || @@ -5832,23 +5840,26 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *= list, struct rb_root *root, end: skb_queue_walk_safe(&tmp, skb, n) tcp_rbtree_insert(root, skb); + return collapsed; } =20 /* Collapse ofo queue. Algorithm: select contiguous sequence of skbs - * and tcp_collapse() them until all the queue is collapsed. + * and xtcp_collapse() them until all the queue is collapsed. */ -static void tcp_collapse_ofo_queue(struct sock *sk) +unsigned int xtcp_collapse_ofo_queue(struct sock *sk, + struct rb_root *ooo_queue, + struct sk_buff **ooo_last_skb, + u8 scaling_ratio) { - struct tcp_sock *tp =3D tcp_sk(sk); - u32 range_truesize, sum_tiny =3D 0; + u32 range_truesize, sum_tiny =3D 0, collapsed =3D 0; struct sk_buff *skb, *head; u32 start, end; =20 - skb =3D skb_rb_first(&tp->out_of_order_queue); + skb =3D skb_rb_first(ooo_queue); new_range: if (!skb) { - tp->ooo_last_skb =3D skb_rb_last(&tp->out_of_order_queue); - return; + *ooo_last_skb =3D skb_rb_last(ooo_queue); + return collapsed; } start =3D TCP_SKB_CB(skb)->seq; end =3D TCP_SKB_CB(skb)->end_seq; @@ -5866,12 +5877,13 @@ static void tcp_collapse_ofo_queue(struct sock *sk) /* Do not attempt collapsing tiny skbs */ if (range_truesize !=3D head->truesize || end - start >=3D SKB_WITH_OVERHEAD(PAGE_SIZE)) { - tcp_collapse(sk, NULL, &tp->out_of_order_queue, - head, skb, start, end); + collapsed +=3D xtcp_collapse(sk, NULL, ooo_queue, + head, skb, start, end, + scaling_ratio); } else { sum_tiny +=3D range_truesize; if (sum_tiny > sk->sk_rcvbuf >> 3) - return; + return collapsed; } goto new_range; } @@ -5882,6 +5894,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk) if (after(TCP_SKB_CB(skb)->end_seq, end)) end =3D TCP_SKB_CB(skb)->end_seq; } + return collapsed; } =20 /* @@ -5969,12 +5982,14 @@ static int tcp_prune_queue(struct sock *sk, const s= truct sk_buff *in_skb) if (tcp_can_ingest(sk, in_skb)) return 0; =20 - tcp_collapse_ofo_queue(sk); + xtcp_collapse_ofo_queue(sk, &tp->out_of_order_queue, + &tp->ooo_last_skb, tp->scaling_ratio); if (!skb_queue_empty(&sk->sk_receive_queue)) - tcp_collapse(sk, &sk->sk_receive_queue, NULL, - skb_peek(&sk->sk_receive_queue), - NULL, - tp->copied_seq, tp->rcv_nxt); + xtcp_collapse(sk, &sk->sk_receive_queue, NULL, + skb_peek(&sk->sk_receive_queue), + NULL, + tp->copied_seq, tp->rcv_nxt, + tp->scaling_ratio); =20 if (tcp_can_ingest(sk, in_skb)) return 0; --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8B9FD39B493 for ; Mon, 20 Apr 2026 10:30:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776681014; cv=none; b=AsYRbG4SSI99faERIpfJUqyLcSc374DMOfPhUvFUuUDARzfD/gQPFAAgR41RS/nPupDcmnlYHXLzr5If3nKaIlHFPLo7eRLoUPNruRLRrhLuyKKV1IBb1K6BfaV1BuXIxuAHm5M7MlpmtbikBJoJ/guV3EGkYzTJ2QHYoNj6shg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776681014; c=relaxed/simple; bh=UgHYfQOGttMcgjTDamWiz8RtB3Jd25f931B8VbOqt6U=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=RSsyhMuJ6YTmQe4ReqmLNr7zyajOjkyuMVpRBnLmjrPowtFv9GAUr0FZV5hxpA1EbARQ5X3swm4R83vRxDr3KiSG/0WwSINJslbWxIh+lBtnzMKHbjiujtTKsfmazsDmfNea/OPdDaXjYMNUpycxGfdmBuRl7MA0J9QxORnWmao= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=UIXffH1/; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="UIXffH1/" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776681010; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=THOubhxGvzPFCHxid3vyMZ0OgwhsBYaBwSDaMbXvTa8=; b=UIXffH1/7K2KQ3PWYRpsvwkHeNfXRYxPfdZjWn10yskWDdA28n1EEj3KQ05Tf81Y1DFVQ1 LSVXkVoZJ0Yv4c2SkjW/Q4IKqAPhtH6EM21lmS+lIVAzJPV/1ZpC7EUkQxkHoFGGaKEcD8 fRWze2znyUO3oT+VoD+5usTfq/D88oE= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-690-ctzxc9FnOEmjpzYRvymA_w-1; Mon, 20 Apr 2026 06:30:08 -0400 X-MC-Unique: ctzxc9FnOEmjpzYRvymA_w-1 X-Mimecast-MFC-AGG-ID: ctzxc9FnOEmjpzYRvymA_w_1776681007 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 73B9D195608E; Mon, 20 Apr 2026 10:30:07 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.33.233]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id BA57119560AB; Mon, 20 Apr 2026 10:30:05 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: yangang@kylinos.cn, geliang@kernel.org, matttbe@kernel.org Subject: [RFC PATCH 4/6] mptcp: implemented OoO queue pruning Date: Mon, 20 Apr 2026 12:29:28 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: hAU75YKEZvH7n_EEQb6r4wC3XJ7ENwjBb8Te5enLBPw_1776681007 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Leverage the hybrid helper to implement the OoO queue prune at ingress time. If the msk is owned by the user-space at incoming skb time, perform the pruning in the release_cb. The prune check is additionally performed when the skb reaches the msk-level queues. Signed-off-by: Paolo Abeni --- Notes: - Similarly to path 'mptcp: move checks vs rcvbuf size earlier in the RX path', some cleanup/tuning in mptcp_over_limit() will be needed - Pruning in the release_cb() is likely not needed, should probably be removed (after more testing). --- net/mptcp/mib.c | 3 +++ net/mptcp/mib.h | 3 +++ net/mptcp/options.c | 22 +++++++++++++--- net/mptcp/protocol.c | 61 ++++++++++++++++++++++++++++++++++++++++++++ net/mptcp/protocol.h | 2 ++ 5 files changed, 87 insertions(+), 4 deletions(-) diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c index f23fda0c55a7..5128feec942c 100644 --- a/net/mptcp/mib.c +++ b/net/mptcp/mib.c @@ -85,6 +85,9 @@ static const struct snmp_mib mptcp_snmp_list[] =3D { SNMP_MIB_ITEM("SimultConnectFallback", MPTCP_MIB_SIMULTCONNFALLBACK), SNMP_MIB_ITEM("FallbackFailed", MPTCP_MIB_FALLBACKFAILED), SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE), + SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED), + SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED), + SNMP_MIB_ITEM("RcvCollapsed", MPTCP_MIB_RCVCOLLAPSED), }; =20 /* mptcp_mib_alloc - allocate percpu mib counters diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h index 812218b5ed2b..2f8f68e33ac5 100644 --- a/net/mptcp/mib.h +++ b/net/mptcp/mib.h @@ -88,6 +88,9 @@ enum linux_mptcp_mib_field { MPTCP_MIB_SIMULTCONNFALLBACK, /* Simultaneous connect */ MPTCP_MIB_FALLBACKFAILED, /* Can't fallback due to msk status */ MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */ + MPTCP_MIB_OFO_PRUNED, /* MPTCP-level OoO queue pruned */ + MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */ + MPTCP_MIB_RCVCOLLAPSED, /* Collapsed due to memory pressure */ __MPTCP_MIB_MAX }; =20 diff --git a/net/mptcp/options.c b/net/mptcp/options.c index a6d290427611..a6a6da262413 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -1158,15 +1158,29 @@ static bool add_addr_hmac_valid(struct mptcp_sock *= msk, return hmac =3D=3D mp_opt->ahmac; } =20 -static bool mptcp_over_limit(const struct sock *sk, struct sk_buff *skb) +static bool mptcp_over_limit(struct sock *sk, struct sk_buff *skb, u32 seq) { + struct mptcp_sock *msk =3D mptcp_sk(sk); + bool ret =3D true; int limit; =20 if (!skb->len) return false; =20 + /* Allow some slack for backlog processing */ limit =3D READ_ONCE(sk->sk_rcvbuf) << 1; - return sk_rmem_alloc_get(sk) > limit; + if (sk_rmem_alloc_get(sk) < limit) + return false; + + mptcp_data_lock(sk); + if (!sock_owned_by_user(sk)) { + __mptcp_check_prune(sk, seq); + ret =3D sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf); + } else { + __set_bit(MPTCP_PRUNE, &msk->cb_flags); + } + mptcp_data_unlock(sk); + return ret; } =20 /* Return false when the caller must to drop the packet, i.e. in case of e= rror, @@ -1197,7 +1211,7 @@ bool mptcp_incoming_options(struct sock *sk, struct s= k_buff *skb) __mptcp_data_acked(subflow->conn); mptcp_data_unlock(subflow->conn); =20 - if (mptcp_over_limit(subflow->conn, skb)) + if (mptcp_over_limit(subflow->conn, skb, msk->ack_seq)) return false; return true; } @@ -1277,7 +1291,7 @@ bool mptcp_incoming_options(struct sock *sk, struct s= k_buff *skb) return true; } =20 - if (mptcp_over_limit(subflow->conn, skb)) + if (mptcp_over_limit(subflow->conn, skb, mp_opt.use_map ? mp_opt.data_seq= : msk->ack_seq)) return false; =20 mpext =3D skb_ext_add(skb, SKB_EXT_MPTCP); diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 800aa7d9408e..9cf135e04d69 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -374,6 +374,59 @@ static void mptcp_init_skb(struct sock *ssk, struct sk= _buff *skb, int offset, skb_dst_drop(skb); } =20 +/* "Inspiered" from the TCP version */ +static void mptcp_prune_ofo_queue(struct sock *sk, u32 seq) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct rb_node *node, *prev; + bool pruned =3D false; + + if (RB_EMPTY_ROOT(&msk->out_of_order_queue)) + return; + + node =3D &msk->ooo_last_skb->rbnode; + + do { + struct sk_buff *skb =3D rb_to_skb(node); + + /* If incoming skb would land last in ofo queue, stop pruning. */ + if (after(seq, MPTCP_SKB_CB(skb)->map_seq)) + break; + + pruned =3D true; + prev =3D rb_prev(node); + rb_erase(node, &msk->out_of_order_queue); + mptcp_drop(sk, skb); + msk->ooo_last_skb =3D rb_to_skb(prev); + if (atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf) + break; + + node =3D prev; + } while (node); + + if (pruned) + NET_INC_STATS(sock_net(sk), LINUX_MIB_OFOPRUNED); +} + +bool __mptcp_check_prune(struct sock *sk, u32 seq) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + unsigned int dropped; + + if (likely(atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf)) + return false; + + dropped =3D xtcp_collapse_ofo_queue(sk, &msk->out_of_order_queue, + &msk->ooo_last_skb, msk->scaling_ratio); + if (dropped) + MPTCP_ADD_STATS(sock_net(sk), MPTCP_MIB_RCVCOLLAPSED, dropped); + if (likely(atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf)) + return false; + + mptcp_prune_ofo_queue(sk, seq); + return atomic_read(&sk->sk_rmem_alloc) >=3D sk->sk_rcvbuf; +} + static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb) { u32 copy_len =3D MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq; @@ -383,6 +436,12 @@ static bool __mptcp_move_skb(struct sock *sk, struct s= k_buff *skb) =20 mptcp_borrow_fwdmem(sk, skb); =20 + if (__mptcp_check_prune(sk, MPTCP_SKB_CB(skb)->map_seq)) { + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED); + mptcp_drop(sk, skb); + return false; + } + if (MPTCP_SKB_CB(skb)->map_seq =3D=3D ack_seq) { /* in sequence */ msk->bytes_received +=3D copy_len; @@ -3693,6 +3752,8 @@ static void mptcp_release_cb(struct sock *sk) __mptcp_error_report(sk); if (__test_and_clear_bit(MPTCP_SYNC_SNDBUF, &msk->cb_flags)) __mptcp_sync_sndbuf(sk); + if (__test_and_clear_bit(MPTCP_PRUNE, &msk->cb_flags)) + __mptcp_check_prune(sk, msk->ack_seq - 1); } } =20 diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index ad906737ee9f..e4bc77de725e 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -124,6 +124,7 @@ #define MPTCP_FLUSH_JOIN_LIST 5 #define MPTCP_SYNC_STATE 6 #define MPTCP_SYNC_SNDBUF 7 +#define MPTCP_PRUNE 8 =20 struct mptcp_skb_cb { u32 map_seq; @@ -828,6 +829,7 @@ bool __mptcp_close(struct sock *sk, long timeout); void mptcp_cancel_work(struct sock *sk); void __mptcp_unaccepted_force_close(struct sock *sk); void mptcp_set_state(struct sock *sk, int state); +bool __mptcp_check_prune(struct sock *sk, u32 seq); =20 bool mptcp_addresses_equal(const struct mptcp_addr_info *a, const struct mptcp_addr_info *b, bool use_port); --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AB57937F8A3 for ; Mon, 20 Apr 2026 10:30:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776681017; cv=none; b=IRDegQaAoFGSJZIkVFJVooVz4pRjyi8Qd+/GpFNe9pbYjnRPaoPu8O3ZBC3LZk/maFoQSRE4eoojYWhY0AkTdnHjLRAUfLcKgGhtU8zWmZ7w8o98OAHG5wvljg9p3iR2oUHh011RdrJc96cTeK+j+HdVSseYBuztDw6MQgHetFQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776681017; c=relaxed/simple; bh=0TZXzYI7oj9pepUXXTooRXQumm8DaFReB6bUhund9CY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=azSuQgyrsd3bB854TrsrEYUzPdLsHisWu/OKjGlnu0IXiK/LgjbjTRfWVU6iffNQTX5leO0KRZOsjWFEOfx4s1tVcRTNVGbZppzt1XiiiGNTHIDGm2hp/uu8cBOIQ81YIGZfXIimT/M/NSZlAPzyv1elonB8Ky6rSAUqCAJs7YI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ZyXwhGHF; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZyXwhGHF" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776681014; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=csMFC4kJ4S6NfPJR3o6ZrAzNqVBFjBIiAaxzcrbR1Ns=; b=ZyXwhGHFbqrCd6SZNlf2ud6r/OmZMznexTUfIpnYfiRcokvLAzxuGty9uxdhflFkN7KP0t ZVz++HFLwsmy2s5FrlS2ZitFj+i08ks+YM7aiXBIWlyVY5YwgiIleybiOcgn5Vyh77+Wkp A+cRjtx1y/zaJQwxLahXpEG3YaQEK4g= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-351-tWaam23XPHevXApFhCNxrA-1; Mon, 20 Apr 2026 06:30:11 -0400 X-MC-Unique: tWaam23XPHevXApFhCNxrA-1 X-Mimecast-MFC-AGG-ID: tWaam23XPHevXApFhCNxrA_1776681010 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id A9362195609F; Mon, 20 Apr 2026 10:30:09 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.33.233]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id E54C6195608E; Mon, 20 Apr 2026 10:30:07 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: yangang@kylinos.cn, geliang@kernel.org, matttbe@kernel.org Subject: [RFC PATCH 5/6] mptcp: refine coalescing conditions Date: Mon, 20 Apr 2026 12:29:29 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: fNaOZm5lRkCR5XdhhvA_n4DmBrMwijosZ_O2RXiExb8_1776681010 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" The current conditions prevent any coalescing when the receive buffer is small. Ensure that MPTCP can always aggregate at least at max GSO size. Signed-off-by: Paolo Abeni --- Note: - or we can drop entirely the rcvbuf-related check, to be verified vs simult_flows tests --- net/mptcp/protocol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 9cf135e04d69..8ddd4bb5172e 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -161,7 +161,7 @@ static bool __mptcp_try_coalesce(struct sock *sk, struc= t sk_buff *to, =20 if (unlikely(MPTCP_SKB_CB(to)->cant_coalesce) || MPTCP_SKB_CB(from)->offset || - ((to->len + from->len) > (limit >> 3)) || + ((to->len + from->len) > max(U16_MAX, (limit >> 3))) || !skb_try_coalesce(to, from, fragstolen, delta)) return false; =20 --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 34C66399377 for ; Mon, 20 Apr 2026 10:30:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776681018; cv=none; b=qMGtpBIPgJTyIhNpASZd32nperYnTmmZe+rSc9vsVxDOrrp0FVTr5WHPcI8Cdj/J9V5t2wTJmFqe3u9WKGgwB560s6AU26ZGEXAzJEP/Tk1Z3HpPwl8Y1JS2EmNn7VVoAIPwcpMEEMCM9RbAnXffskP2PULav1jlIKyHjoYOQCA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776681018; c=relaxed/simple; bh=i0RrDDrd+hFVsmXhX6F7M++BkBVutE2T0LexjliuH+M=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=V7r+g0sRCNpSxxZlWVM7T0ud7Ek1BntIRLyXd9N4JG4tdFfShJUZCJQFEWoaM046/DsxnWhII9uiMCG5S8r0z7hrK1m6pVdNPYBxv9IjQWx79OjACORbAOgOvGoVMkyd+nJf0Wce4oXRIAA65JgIFC7OMY6l6tHBUeWr251lS+w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=WZN9UeA6; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WZN9UeA6" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776681016; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5yxSuS76vMLmzgVW7whv3CCFI8Q3HZjI3RSdYor1u5Q=; b=WZN9UeA6Cc4aQRlNGh9AiPfsOIiWlWuQSxb0T0FmoCzqJYK1kLj7UOJpfh4mxPwpXls8gj zqPfET33ONxAP+jeydh/eXtOHXsM1MDkmAnuU8URbqG3GgU4Hwdpbv7qB9wd/zqn/DPihb LrGW73+vD4L1uaHN2V5Vub5LdQOLaak= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-473-ZxqIpUddP4O6fJQmaKtNOw-1; Mon, 20 Apr 2026 06:30:13 -0400 X-MC-Unique: ZxqIpUddP4O6fJQmaKtNOw-1 X-Mimecast-MFC-AGG-ID: ZxqIpUddP4O6fJQmaKtNOw_1776681012 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 040B9180036E; Mon, 20 Apr 2026 10:30:12 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.33.233]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 5DBFB195608E; Mon, 20 Apr 2026 10:30:09 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: yangang@kylinos.cn, geliang@kernel.org, matttbe@kernel.org Subject: [RFC PATCH 6/6] mptcp: unclone skbs before coalescing them, when needed Date: Mon, 20 Apr 2026 12:29:30 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: WP-GE_HBnWEgsZgRzJI0yEVPi-dgB5bEpry1FWTfBlg_1776681012 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" The self-test can trigger skb coalescing on clones skb, as the forward path uses only veth devices. That in turn prevents coalescing making memory pressure scenario more extreme. Signed-off-by: Paolo Abeni --- Possibly we could obtain the same effect with some netem magic, would be better. --- net/mptcp/protocol.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 8ddd4bb5172e..42af9f9e935d 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -162,6 +162,7 @@ static bool __mptcp_try_coalesce(struct sock *sk, struc= t sk_buff *to, if (unlikely(MPTCP_SKB_CB(to)->cant_coalesce) || MPTCP_SKB_CB(from)->offset || ((to->len + from->len) > max(U16_MAX, (limit >> 3))) || + (skb_cloned(to) && skb_unclone(to, GFP_ATOMIC)) || !skb_try_coalesce(to, from, fragstolen, delta)) return false; =20 --=20 2.53.0