From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 531193ECBC3 for ; Mon, 27 Apr 2026 19:52:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319540; cv=none; b=CvECowAxcYq8WjuA1/S3GI94SjqSgFDDcSA/fS/Ok5/LZ7HaF3rGDdr7rzQYF9CqG3Z09sGtNPAe/2FYKLLzdHdD/7KkLFibUkW/HbLtQHw6jGXShqGDdzqhg3Q2zffkEQJaYveiWjs8gJ4QC1YdcitT+3FbGc0Ui21c2SGKUCM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319540; c=relaxed/simple; bh=r2B2Znwq+STF7OG/akDr8kWwjPMsKQXPADrKxvHe34Q=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=m6275MqKa9UxCXRwbJfAi1KoyakTUXw67zqkLaxtuR4e5if5m78E/Z6VK8VfnC+0wyvQsdYAgZ7VcUrvqErPqpx8J8lVoYCUGVRFsRaKg3ZmYNj+q4CqqRRt89bhFkngDPY4Ub+4/p5xfTcOHyfTXkMZzatfFxvbVSE/WjcHI5c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Tk3cW/gx; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Tk3cW/gx" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777319538; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mpXzrkXaoWC2HqPMmw4WYl/kp+u1QDiWtdMkKwHddgI=; b=Tk3cW/gxM/rEAVOcew5+2RMIfCbJOKS6GVKnk5vfGfGj3XWULtzvMQB+YXJQCCo3XeKN3m NscMYdjWuNqS+RrXPb6nxQSg/qIvSb6ruWdxGUR9ps6+0+inB4VOwtGkmkjEymqi3y/Tyq cVJWhMOU1jFt3wW4JPPax2J26fF4ElM= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-85-N82GldSwM8OlfocloNAVGw-1; Mon, 27 Apr 2026 15:52:16 -0400 X-MC-Unique: N82GldSwM8OlfocloNAVGw-1 X-Mimecast-MFC-AGG-ID: N82GldSwM8OlfocloNAVGw_1777319535 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 49C861956088 for ; Mon, 27 Apr 2026 19:52:15 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.49.253]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 80943300757C for ; Mon, 27 Apr 2026 19:52:14 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next v2 01/10] mptcp: move checks vs rcvbuf size earlier in the RX path Date: Mon, 27 Apr 2026 21:51:59 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: BrdbCmaegLylyMcv_IbXmOqwVu0Ep6R847xozuCXQ34_1777319535 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Currently the enforcement of the rcvbuf constraint is implemented when moving the skbs into the msk receive or OoO queue. Under significant memory pressure the above can cause permanent data transfer stalls. Move the checks early on, before landing even in the subflow queues. Signed-off-by: Paolo Abeni --- v1 -> v2: - deal correctly with tcp fin and zero win probe RFC -> v1: - limit vs actual buffer size - use CB info instead of skb->len Note that: - this needs the follow-up patches to really fix the stall - the memory comparison is intentionally very rough, as the msk socket lock is not currently held where the condition is now enforced. This should require some refinement, shared as-is to avoid more latency on my side --- net/mptcp/options.c | 25 +++++++++++++++++++++++-- net/mptcp/protocol.c | 10 ++-------- 2 files changed, 25 insertions(+), 10 deletions(-) diff --git a/net/mptcp/options.c b/net/mptcp/options.c index 4cc583fdc7a9..ad4bb6fd86e1 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -1158,8 +1158,23 @@ static bool add_addr_hmac_valid(struct mptcp_sock *m= sk, return hmac =3D=3D mp_opt->ahmac; } =20 -/* Return false in case of error (or subflow has been reset), - * else return true. +static bool mptcp_over_limit(struct sock *sk, const struct sock *ssk, + const struct sk_buff *skb) +{ + if (likely(sk_rmem_alloc_get(sk) <=3D READ_ONCE(sk->sk_rcvbuf))) + return false; + + /* Avoid silently dropping pure acks, fin or zero win probes. */ + if (TCP_SKB_CB(skb)->seq =3D=3D TCP_SKB_CB(skb)->end_seq || + TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN || + !after(TCP_SKB_CB(skb)->end_seq, tcp_sk(ssk)->rcv_nxt)) + return false; + + return true; +} + +/* Return false when the caller must drop the packet, i.e. in case of erro= r, + * subflow has been reset, or over memory limits. */ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb) { @@ -1185,6 +1200,9 @@ bool mptcp_incoming_options(struct sock *sk, struct s= k_buff *skb) =20 __mptcp_data_acked(subflow->conn); mptcp_data_unlock(subflow->conn); + + if (mptcp_over_limit(subflow->conn, sk, skb)) + return false; return true; } =20 @@ -1263,6 +1281,9 @@ bool mptcp_incoming_options(struct sock *sk, struct s= k_buff *skb) return true; } =20 + if (mptcp_over_limit(subflow->conn, sk, skb)) + return false; + mpext =3D skb_ext_add(skb, SKB_EXT_MPTCP); if (!mpext) return false; diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 17b9a8c13ebf..81a9b8077d6b 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -739,7 +739,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp= _sock *msk, =20 mptcp_init_skb(ssk, skb, offset, len); =20 - if (own_msk && sk_rmem_alloc_get(sk) < sk->sk_rcvbuf) { + if (own_msk) { mptcp_subflow_lend_fwdmem(subflow, skb); ret |=3D __mptcp_move_skb(sk, skb); } else { @@ -2197,10 +2197,6 @@ static bool __mptcp_move_skbs(struct sock *sk, struc= t list_head *skbs, u32 *delt =20 *delta =3D 0; while (1) { - /* If the msk recvbuf is full stop, don't drop */ - if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf) - break; - prefetch(skb->next); list_del(&skb->list); *delta +=3D skb->truesize; @@ -2228,9 +2224,7 @@ static bool mptcp_can_spool_backlog(struct sock *sk, = struct list_head *skbs) DEBUG_NET_WARN_ON_ONCE(msk->backlog_unaccounted && sk->sk_socket && mem_cgroup_from_sk(sk)); =20 - /* Don't spool the backlog if the rcvbuf is full. */ - if (list_empty(&msk->backlog_list) || - sk_rmem_alloc_get(sk) > sk->sk_rcvbuf) + if (list_empty(&msk->backlog_list)) return false; =20 INIT_LIST_HEAD(skbs); --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5146E421886 for ; Mon, 27 Apr 2026 19:52:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319541; cv=none; b=jIWMthrOZA9/9aCbg2QCoclzIgDsb8G6JVAVHyQq63yrf0xXiUD6PjyDthEqkG474nx5tJL8w3uQqTSkD9cpcf2sQ+poPY+s3/z3kjClk6rUWCjG4t9PdlhJggI84SbK/UGBR7Au8dumdulVpiLlQI4JrjFABYs8qYn5+ft4s28= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319541; c=relaxed/simple; bh=+bg3kCbffeQ5PtOoholStJ8nA0IB5USDNsz3uv5ucTg=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=kb7OWTzks8Rw/5KeCpi97ifUdv+8CrDApj+Vjb9lYSS+GQ0lIfBWY3hyRNAI0MsaUUP3EllNG74+9F/2GUWVFjq5FDJCdDaWH1HkZAtptiMVfH6ePhCjFn8v2MphPYv1C56DI51zFlQ8FTifm0iVr43RCxHKqJCa7hFAu+CfwwQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ZJacYgUp; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZJacYgUp" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777319539; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5g/yTm3LII6nhGJojhttzWhUFMSw+1mflut0KHF1M8s=; b=ZJacYgUpjogYAcu1NuvFgIdHM3xW3AYa8lOhQCiqJ7TWVFJ5rp8UG2qAYgZ0hELiwK2DFj t5OUsUEj/+zNkdUj1UWz6tiau/DZzJvcfyel14R9n5Tbne/hfhwayGf5cLnyMP2X0BDJT+ EDfInDf4BZOD+gxldZIPEfKHjGiad9s= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-662-ceflMRNuMRKF9JqLIsFYGg-1; Mon, 27 Apr 2026 15:52:17 -0400 X-MC-Unique: ceflMRNuMRKF9JqLIsFYGg-1 X-Mimecast-MFC-AGG-ID: ceflMRNuMRKF9JqLIsFYGg_1777319536 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 7E9891800451 for ; Mon, 27 Apr 2026 19:52:16 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.49.253]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id B9DB53000C22 for ; Mon, 27 Apr 2026 19:52:15 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next v2 02/10] mptcp: drop the mptcp_ooo_try_coalesce() helper Date: Mon, 27 Apr 2026 21:52:00 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 6oLnqjEIf3yPbMdveNzCWhuCXZxgmadyI1tewa6gQQk_1777319536 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" It's used to save an additional comparison for in-order skbs, but is also a barrier to remove CB offset. Remove the helper, let __mptcp_try_coalesce() always perform the sequence check and remove duplicate checks from the callers. Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 21 ++++++--------------- 1 file changed, 6 insertions(+), 15 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 81a9b8077d6b..ad0a289b544b 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -159,7 +159,8 @@ static bool __mptcp_try_coalesce(struct sock *sk, struc= t sk_buff *to, { int limit =3D READ_ONCE(sk->sk_rcvbuf); =20 - if (unlikely(MPTCP_SKB_CB(to)->cant_coalesce) || + if (MPTCP_SKB_CB(from)->map_seq !=3D MPTCP_SKB_CB(to)->end_seq || + unlikely(MPTCP_SKB_CB(to)->cant_coalesce) || MPTCP_SKB_CB(from)->offset || ((to->len + from->len) > (limit >> 3)) || !skb_try_coalesce(to, from, fragstolen, delta)) @@ -192,15 +193,6 @@ static bool mptcp_try_coalesce(struct sock *sk, struct= sk_buff *to, return true; } =20 -static bool mptcp_ooo_try_coalesce(struct mptcp_sock *msk, struct sk_buff = *to, - struct sk_buff *from) -{ - if (MPTCP_SKB_CB(from)->map_seq !=3D MPTCP_SKB_CB(to)->end_seq) - return false; - - return mptcp_try_coalesce((struct sock *)msk, to, from); -} - /* "inspired" by tcp_rcvbuf_grow(), main difference: * - mptcp does not maintain a msk-level window clamp * - returns true when the receive buffer is actually updated @@ -275,7 +267,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk= , struct sk_buff *skb) /* with 2 subflows, adding at end of ooo queue is quite likely * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup. */ - if (mptcp_ooo_try_coalesce(msk, msk->ooo_last_skb, skb)) { + if (mptcp_try_coalesce(sk, msk->ooo_last_skb, skb)) { MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOMERGE); MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOQUEUETAIL); return; @@ -321,7 +313,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk= , struct sk_buff *skb) MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA); goto merge_right; } - } else if (mptcp_ooo_try_coalesce(msk, skb1, skb)) { + } else if (mptcp_try_coalesce(sk, skb1, skb)) { MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOMERGE); return; } @@ -672,8 +664,7 @@ static void __mptcp_add_backlog(struct sock *sk, if (!list_empty(&msk->backlog_list)) tail =3D list_last_entry(&msk->backlog_list, struct sk_buff, list); =20 - if (tail && MPTCP_SKB_CB(skb)->map_seq =3D=3D MPTCP_SKB_CB(tail)->end_seq= && - ssk =3D=3D tail->sk && + if (tail && ssk =3D=3D tail->sk && __mptcp_try_coalesce(sk, tail, skb, &fragstolen, &delta)) { skb->truesize -=3D delta; kfree_skb_partial(skb, fragstolen); @@ -797,7 +788,7 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk) =20 end_seq =3D MPTCP_SKB_CB(skb)->end_seq; tail =3D skb_peek_tail(&sk->sk_receive_queue); - if (!tail || !mptcp_ooo_try_coalesce(msk, tail, skb)) { + if (!tail || !mptcp_try_coalesce(sk, tail, skb)) { int delta =3D msk->ack_seq - MPTCP_SKB_CB(skb)->map_seq; =20 /* skip overlapping data, if any */ --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CE94A42188D for ; Mon, 27 Apr 2026 19:52:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319542; cv=none; b=iqZRVLnJtM1q0N9XuEStdXi71pNGMsGaEIfPcSgxdQL43p2Py/8n/QG8NxxCyTp9tTrWqbQo0L0upJ5F4uCWzl9Z0ncqxI+zMYSlss07ERnS5lbNLXeTS1oJpWU5135rZ1xDZln9nLYpswPM0UvHgPcujionNeWLFktE5PJ3N6U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319542; c=relaxed/simple; bh=w134bd/7QWax2vB75js04WO3EJh0WCw84fPWDhTC5ZI=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=dpGlwSY4b/+0riGIT/eV52CxLC3Oq2RHGCHnfuLZU1XBlJd73XfgwNCcPpkMNf804zMmctDM+DzGFtCF0rQwAmMb3d8gTKLcwFELGvBUH4qI4gDLeWsPbJ76eXa58tML8srXr7czssEIJYnev6DII08W1wAV6+8PGuQEldhNkhQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=XwoF3rAd; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="XwoF3rAd" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777319540; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=M4M8LkqGqrpFnXTcpLRTfY5qTePjlisVyws8BVhENvw=; b=XwoF3rAdR2POtSR3lk1g/VRjV2jzluSjx7Aycw9R2WsbKKacldkJWMOsBxNvi2TPsdyCIM IBUv5ulCHKek6JuNF090lJpdBaTqn363Qzj+eyA/58yNlHHucT5FS6+ZTFoBBCnEBOBTxG V37P6iR1ru+umrKMuigLn/yf7lIf8dc= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-301-ap1Tw1bZMIW1bSz1ezzlLQ-1; Mon, 27 Apr 2026 15:52:18 -0400 X-MC-Unique: ap1Tw1bZMIW1bSz1ezzlLQ-1 X-Mimecast-MFC-AGG-ID: ap1Tw1bZMIW1bSz1ezzlLQ_1777319537 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id BB7281956065 for ; Mon, 27 Apr 2026 19:52:17 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.49.253]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id F1D55300070A for ; Mon, 27 Apr 2026 19:52:16 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next v2 03/10] mptcp: drop the cant_coalesce CB field Date: Mon, 27 Apr 2026 21:52:01 +0200 Message-ID: <8f36529271f360ea255b382a168c76445b47f8b1.1777318959.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 8lmKme7sGltKlWkC4d2uXIZVLsWxUdhjoTB1rS46e5A_1777319537 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Such field is used to ensure in-sequence processing in case of fastopen. Instead let's perform synchronization of the fastopen skb sequence when the IASN becomes available with the 3rd ack. When the `cant_coalesce` field has been introduced, commit f03afb3aeb9d ("mptcp: drop __mptcp_fastopen_gen_msk_ackseq()") noted that updating the already queued skb for passive fastopen socket at 3rd ack time would be difficult and race prone. The main point is that such update don't need to be synchronously performed at 3rd ack time, but is sufficient to perform it before the next segment is introduced into the msk. To such extent, add an explicit test in __mptcp_move_skb(). Performance wise this trades a conditional in the fast path - in __mptcp_try_coalesce() - with a similar one in __mptcp_move_skb() and a couple more in slow paths. After this change the user-space will always observe consistent sequence numbers in the receive queue, even in the TFO dummy mapping case. Signed-off-by: Paolo Abeni --- net/mptcp/fastopen.c | 2 +- net/mptcp/protocol.c | 28 ++++++++++++++++++++++++++-- net/mptcp/protocol.h | 4 +++- net/mptcp/subflow.c | 7 +++++++ 4 files changed, 37 insertions(+), 4 deletions(-) diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c index 82ec15bcfd7f..40168adfed22 100644 --- a/net/mptcp/fastopen.c +++ b/net/mptcp/fastopen.c @@ -46,11 +46,11 @@ void mptcp_fastopen_subflow_synack_set_params(struct mp= tcp_subflow_context *subf MPTCP_SKB_CB(skb)->end_seq =3D 0; MPTCP_SKB_CB(skb)->offset =3D 0; MPTCP_SKB_CB(skb)->has_rxtstamp =3D TCP_SKB_CB(skb)->has_rxtstamp; - MPTCP_SKB_CB(skb)->cant_coalesce =3D 1; =20 mptcp_data_lock(sk); DEBUG_NET_WARN_ON_ONCE(sock_owned_by_user_nocheck(sk)); =20 + mptcp_sk(sk)->rcvd_dummy_seq =3D true; mptcp_borrow_fwdmem(sk, skb); skb_set_owner_r(skb, sk); __skb_queue_tail(&sk->sk_receive_queue, skb); diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index ad0a289b544b..fd88a81f1821 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -160,7 +160,6 @@ static bool __mptcp_try_coalesce(struct sock *sk, struc= t sk_buff *to, int limit =3D READ_ONCE(sk->sk_rcvbuf); =20 if (MPTCP_SKB_CB(from)->map_seq !=3D MPTCP_SKB_CB(to)->end_seq || - unlikely(MPTCP_SKB_CB(to)->cant_coalesce) || MPTCP_SKB_CB(from)->offset || ((to->len + from->len) > (limit >> 3)) || !skb_try_coalesce(to, from, fragstolen, delta)) @@ -357,7 +356,6 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_= buff *skb, int offset, MPTCP_SKB_CB(skb)->end_seq =3D MPTCP_SKB_CB(skb)->map_seq + copy_len; MPTCP_SKB_CB(skb)->offset =3D offset; MPTCP_SKB_CB(skb)->has_rxtstamp =3D has_rxtstamp; - MPTCP_SKB_CB(skb)->cant_coalesce =3D 0; =20 __skb_unlink(skb, &ssk->sk_receive_queue); =20 @@ -365,6 +363,24 @@ static void mptcp_init_skb(struct sock *ssk, struct sk= _buff *skb, int offset, skb_dst_drop(skb); } =20 +void __mptcp_sync_rcv_sequence(struct sock *sk) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct sk_buff *skb; + + if (likely(!msk->rcvd_dummy_seq)) + return; + + /* User space can have already received the TFO skb. */ + msk->rcvd_dummy_seq =3D false; + skb =3D skb_peek_tail(&sk->sk_receive_queue); + if (!skb) + return; + + MPTCP_SKB_CB(skb)->map_seq =3D msk->ack_seq - skb->len; + MPTCP_SKB_CB(skb)->end_seq =3D msk->ack_seq; +} + static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb) { u64 copy_len =3D MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq; @@ -373,6 +389,12 @@ static bool __mptcp_move_skb(struct sock *sk, struct s= k_buff *skb) =20 mptcp_borrow_fwdmem(sk, skb); =20 + /* Be sure to sync the eventual fastopen dummy mapping before any other + * skb lands into the msk. + */ + if (unlikely(msk->rcvd_dummy_seq)) + __mptcp_sync_rcv_sequence(sk); + if (MPTCP_SKB_CB(skb)->map_seq =3D=3D msk->ack_seq) { /* in sequence */ msk->bytes_received +=3D copy_len; @@ -3682,6 +3704,8 @@ static void mptcp_release_cb(struct sock *sk) __mptcp_error_report(sk); if (__test_and_clear_bit(MPTCP_SYNC_SNDBUF, &msk->cb_flags)) __mptcp_sync_sndbuf(sk); + if (__test_and_clear_bit(MPTCP_SYNC_SEQ, &msk->cb_flags)) + __mptcp_sync_rcv_sequence(sk); } } =20 diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 661600f8b573..16a1f4531dad 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -124,13 +124,13 @@ #define MPTCP_FLUSH_JOIN_LIST 5 #define MPTCP_SYNC_STATE 6 #define MPTCP_SYNC_SNDBUF 7 +#define MPTCP_SYNC_SEQ 8 =20 struct mptcp_skb_cb { u64 map_seq; u64 end_seq; u32 offset; u8 has_rxtstamp; - u8 cant_coalesce; }; =20 #define MPTCP_SKB_CB(__skb) ((struct mptcp_skb_cb *)&((__skb)->cb[0])) @@ -310,6 +310,7 @@ struct mptcp_sock { u32 token; unsigned long flags; unsigned long cb_flags; + bool rcvd_dummy_seq; bool recovery; /* closing subflow write queue reinjected */ bool can_ack; bool fully_established; @@ -1172,6 +1173,7 @@ void mptcp_event_pm_listener(const struct sock *ssk, enum mptcp_event_type event); bool mptcp_userspace_pm_active(const struct mptcp_sock *msk); =20 +void __mptcp_sync_rcv_sequence(struct sock *sk); void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context= *subflow, struct request_sock *req); int mptcp_pm_genl_fill_addr(struct sk_buff *msg, diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index c57ed27a5fb0..b226c7cd1b79 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -478,6 +478,8 @@ static void subflow_set_remote_key(struct mptcp_sock *m= sk, struct mptcp_subflow_context *subflow, const struct mptcp_options_received *mp_opt) { + struct sock *sk =3D (struct sock *)msk; + /* active MPC subflow will reach here multiple times: * at subflow_finish_connect() time and at 4th ack time */ @@ -496,6 +498,11 @@ static void subflow_set_remote_key(struct mptcp_sock *= msk, WRITE_ONCE(msk->ack_seq, subflow->iasn); WRITE_ONCE(msk->can_ack, true); atomic64_set(&msk->rcv_wnd_sent, subflow->iasn); + + if (!sock_owned_by_user(sk)) + __mptcp_sync_rcv_sequence(sk); + else + __set_bit(MPTCP_SYNC_SEQ, &msk->cb_flags); } =20 static void mptcp_propagate_state(struct sock *sk, struct sock *ssk, --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 527D841C313 for ; Mon, 27 Apr 2026 19:52:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319544; cv=none; b=jPOAiDp3pul+hvtvjZSb6Ro95ymsIjobTf70ziY2st0n2EPdnd/y9Ie+f896ZvcrUFGXh2kL1Bu5Ac96p1e6GVCz2I4L7Py+CgCWh6InnLDtIBQdLs1422Z0ue5gncEzTgsSRl1ElZkfCrTAUApA1V3pjWT7q/xupn1pR1U9Eeo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319544; c=relaxed/simple; bh=p/6fh99FQOThF5s1ivhu61rVIvGSYO5ZJV05AxJFPh0=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=hXinCHhSANPP3b5GZBZ4J/BRcsCUybcTbi0JwGb21BMPnyJ8WOEbN7muT/EEmV03CHZkZ1CfKFuXdfLvym1d04KtFYUYV6v1ENCsb4SCamLajbowl6tf29IZkCfWLeIuj8HHQsLv/kanlSN/MpxIodN+utDPN6WeQBjXNXKzCGw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=a+ZyVe9I; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="a+ZyVe9I" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777319541; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=whan17p7FWhmYeydr0M+IsGGtd7ymiGHbE1oJjACUyM=; b=a+ZyVe9I/lABYFlqaQrWoKeKI67mcrFwCHQLD98cPg9X0P1atTk6jzzJ3vdfWY+JEOTM7H tCVkNxllQ+TH6yQ7iaADeJZyLKH4fLN2flps7KGaesZl8mK0Wkrz7vA1wwpQeLWRVRVElH ERTmx8T5ShnI3ZGhq8kxLLpFM56JyQ0= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-76-4KsbAjKlM_213Q_W_5HWsw-1; Mon, 27 Apr 2026 15:52:19 -0400 X-MC-Unique: 4KsbAjKlM_213Q_W_5HWsw-1 X-Mimecast-MFC-AGG-ID: 4KsbAjKlM_213Q_W_5HWsw_1777319539 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id F2CB019560AF for ; Mon, 27 Apr 2026 19:52:18 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.49.253]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 393E2300070A for ; Mon, 27 Apr 2026 19:52:17 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next v2 04/10] mptcp: remove CB offset field Date: Mon, 27 Apr 2026 21:52:02 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: mXDaF3GfKO_ZnbPLKYCYBpghPryAZ0YtjDXNwcCacD4_1777319539 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Instead, use a new msk-level field to track the bytes already consumed inside each skb, carring the amount of bytes already copied to user-space, alike what TCP is already doing. `copied_seq` is always accessed under the msk socket lock, delegating the initialization to the msk release cb, when the socket is owned by the user-space at remote key reception time. This simplify a bit the __mptcp_recvmsg_mskq() and mptcp_inq_hint() code and will also make possible the next patch. Signed-off-by: Paolo Abeni --- v1 -> v2: - deal correctly with peek, as usally "inspired" from the correspondent tcp code - update mptcp_inq_hint(), too Note: this has the potential to break almost everything. On the flip side the CB->offset vs copied_seq difference from TCP is quite confusing and removing it will be for the good. Also this explicitly relays on "mptcp: do not drop partial packets" to avoid dropping partially consumed packets --- net/mptcp/fastopen.c | 5 +-- net/mptcp/protocol.c | 89 +++++++++++++++++++------------------------- net/mptcp/protocol.h | 2 +- net/mptcp/subflow.c | 6 ++- 4 files changed, 46 insertions(+), 56 deletions(-) diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c index 40168adfed22..cbe2a6192002 100644 --- a/net/mptcp/fastopen.c +++ b/net/mptcp/fastopen.c @@ -42,9 +42,8 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptc= p_subflow_context *subf subflow->ssn_offset +=3D skb->len; =20 /* Only the sequence delta is relevant */ - MPTCP_SKB_CB(skb)->map_seq =3D -skb->len; - MPTCP_SKB_CB(skb)->end_seq =3D 0; - MPTCP_SKB_CB(skb)->offset =3D 0; + MPTCP_SKB_CB(skb)->map_seq =3D 0; + MPTCP_SKB_CB(skb)->end_seq =3D skb->len; MPTCP_SKB_CB(skb)->has_rxtstamp =3D TCP_SKB_CB(skb)->has_rxtstamp; =20 mptcp_data_lock(sk); diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index fd88a81f1821..b24228f87216 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -160,7 +160,6 @@ static bool __mptcp_try_coalesce(struct sock *sk, struc= t sk_buff *to, int limit =3D READ_ONCE(sk->sk_rcvbuf); =20 if (MPTCP_SKB_CB(from)->map_seq !=3D MPTCP_SKB_CB(to)->end_seq || - MPTCP_SKB_CB(from)->offset || ((to->len + from->len) > (limit >> 3)) || !skb_try_coalesce(to, from, fragstolen, delta)) return false; @@ -342,8 +341,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk= , struct sk_buff *skb) skb_set_owner_r(skb, sk); } =20 -static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offs= et, - int copy_len) +static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offs= et) { struct mptcp_subflow_context *subflow =3D mptcp_subflow_ctx(ssk); bool has_rxtstamp =3D TCP_SKB_CB(skb)->has_rxtstamp; @@ -352,9 +350,9 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_= buff *skb, int offset, * mptcp_subflow_get_mapped_dsn() is based on the current tp->copied_seq * value */ - MPTCP_SKB_CB(skb)->map_seq =3D mptcp_subflow_get_mapped_dsn(subflow); - MPTCP_SKB_CB(skb)->end_seq =3D MPTCP_SKB_CB(skb)->map_seq + copy_len; - MPTCP_SKB_CB(skb)->offset =3D offset; + MPTCP_SKB_CB(skb)->map_seq =3D mptcp_subflow_get_mapped_dsn(subflow) - + offset; + MPTCP_SKB_CB(skb)->end_seq =3D MPTCP_SKB_CB(skb)->map_seq + skb->len; MPTCP_SKB_CB(skb)->has_rxtstamp =3D has_rxtstamp; =20 __skb_unlink(skb, &ssk->sk_receive_queue); @@ -377,6 +375,8 @@ void __mptcp_sync_rcv_sequence(struct sock *sk) if (!skb) return; =20 + /* The TFO segment data sits before IDSN */ + msk->copied_seq -=3D skb->len; MPTCP_SKB_CB(skb)->map_seq =3D msk->ack_seq - skb->len; MPTCP_SKB_CB(skb)->end_seq =3D msk->ack_seq; } @@ -750,7 +750,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp= _sock *msk, if (offset < skb->len) { size_t len =3D skb->len - offset; =20 - mptcp_init_skb(ssk, skb, offset, len); + mptcp_init_skb(ssk, skb, offset); =20 if (own_msk) { mptcp_subflow_lend_fwdmem(subflow, skb); @@ -817,8 +817,6 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk) pr_debug("uncoalesced seq=3D%llx ack seq=3D%llx delta=3D%d\n", MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq, delta); - MPTCP_SKB_CB(skb)->offset +=3D delta; - MPTCP_SKB_CB(skb)->map_seq +=3D delta; __skb_queue_tail(&sk->sk_receive_queue, skb); } msk->bytes_received +=3D end_seq - msk->ack_seq; @@ -2062,34 +2060,22 @@ static void mptcp_eat_recv_skb(struct sock *sk, str= uct sk_buff *skb) } =20 static int __mptcp_recvmsg_mskq(struct sock *sk, struct msghdr *msg, - size_t len, int flags, int copied_total, + size_t len, int flags, u64 *seq, struct scm_timestamping_internal *tss, int *cmsg_flags, struct sk_buff **last) { struct mptcp_sock *msk =3D mptcp_sk(sk); struct sk_buff *skb, *tmp; - int total_data_len =3D 0; int copied =3D 0; =20 skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) { - u32 delta, offset =3D MPTCP_SKB_CB(skb)->offset; - u32 data_len =3D skb->len - offset; - u32 count; + u64 offset =3D *seq - MPTCP_SKB_CB(skb)->map_seq; + u32 count, data_len =3D skb->len - offset; int err; =20 - if (flags & MSG_PEEK) { - /* skip already peeked skbs */ - if (total_data_len + data_len <=3D copied_total) { - total_data_len +=3D data_len; - *last =3D skb; - continue; - } - - /* skip the already peeked data in the current skb */ - delta =3D copied_total - total_data_len; - offset +=3D delta; - data_len -=3D delta; - } + /* Skip the already peeked data. */ + if (offset >=3D skb->len) + continue; =20 count =3D min_t(size_t, len - copied, data_len); if (!(flags & MSG_TRUNC)) { @@ -2107,14 +2093,12 @@ static int __mptcp_recvmsg_mskq(struct sock *sk, st= ruct msghdr *msg, } =20 copied +=3D count; + *seq +=3D count; =20 if (!(flags & MSG_PEEK)) { msk->bytes_consumed +=3D count; - if (count < data_len) { - MPTCP_SKB_CB(skb)->offset +=3D count; - MPTCP_SKB_CB(skb)->map_seq +=3D count; + if (count < data_len) break; - } =20 mptcp_eat_recv_skb(sk, skb); } else { @@ -2275,22 +2259,17 @@ static bool mptcp_move_skbs(struct sock *sk) static unsigned int mptcp_inq_hint(const struct sock *sk) { const struct mptcp_sock *msk =3D mptcp_sk(sk); - const struct sk_buff *skb; - - skb =3D skb_peek(&sk->sk_receive_queue); - if (skb) { - u64 hint_val =3D READ_ONCE(msk->ack_seq) - MPTCP_SKB_CB(skb)->map_seq; + u64 hint_val; =20 - if (hint_val >=3D INT_MAX) - return INT_MAX; + hint_val =3D READ_ONCE(msk->ack_seq) - msk->copied_seq; + if (hint_val >=3D INT_MAX) + return INT_MAX; =20 - return (unsigned int)hint_val; - } - - if (sk->sk_state =3D=3D TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN)) + if (!hint_val && + (sk->sk_state =3D=3D TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN))) return 1; =20 - return 0; + return (unsigned int)hint_val; } =20 static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, @@ -2299,6 +2278,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msgh= dr *msg, size_t len, struct mptcp_sock *msk =3D mptcp_sk(sk); struct scm_timestamping_internal tss; int copied =3D 0, cmsg_flags =3D 0; + u64 peek_seq, *seq; int target; long timeo; =20 @@ -2318,6 +2298,11 @@ static int mptcp_recvmsg(struct sock *sk, struct msg= hdr *msg, size_t len, =20 len =3D min_t(size_t, len, INT_MAX); target =3D sock_rcvlowat(sk, flags & MSG_WAITALL, len); + seq =3D &msk->copied_seq; + if (flags & MSG_PEEK) { + peek_seq =3D msk->copied_seq; + seq =3D &peek_seq; + } =20 if (unlikely(msk->recvmsg_inq)) cmsg_flags =3D MPTCP_CMSG_INQ; @@ -2327,7 +2312,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msgh= dr *msg, size_t len, int err, bytes_read; =20 bytes_read =3D __mptcp_recvmsg_mskq(sk, msg, len - copied, flags, - copied, &tss, &cmsg_flags, + seq, &tss, &cmsg_flags, &last); if (unlikely(bytes_read < 0)) { if (!copied) @@ -3479,6 +3464,7 @@ static int mptcp_disconnect(struct sock *sk, int flag= s) =20 /* for fallback's sake */ WRITE_ONCE(msk->ack_seq, 0); + msk->copied_seq =3D 0; =20 WRITE_ONCE(sk->sk_shutdown, 0); sk_error_report(sk); @@ -3704,8 +3690,13 @@ static void mptcp_release_cb(struct sock *sk) __mptcp_error_report(sk); if (__test_and_clear_bit(MPTCP_SYNC_SNDBUF, &msk->cb_flags)) __mptcp_sync_sndbuf(sk); - if (__test_and_clear_bit(MPTCP_SYNC_SEQ, &msk->cb_flags)) + if (__test_and_clear_bit(MPTCP_SYNC_SEQ, &msk->cb_flags)) { + struct mptcp_subflow_context *subflow; + + subflow =3D mptcp_subflow_ctx(msk->first); + msk->copied_seq =3D subflow->iasn; __mptcp_sync_rcv_sequence(sk); + } } } =20 @@ -4364,7 +4355,7 @@ static struct sk_buff *mptcp_recv_skb(struct sock *sk= , u32 *off) mptcp_move_skbs(sk); =20 while ((skb =3D skb_peek(&sk->sk_receive_queue)) !=3D NULL) { - offset =3D MPTCP_SKB_CB(skb)->offset; + offset =3D msk->copied_seq - MPTCP_SKB_CB(skb)->map_seq; if (offset < skb->len) { *off =3D offset; return skb; @@ -4406,11 +4397,9 @@ static int __mptcp_read_sock(struct sock *sk, read_d= escriptor_t *desc, copied +=3D count; =20 msk->bytes_consumed +=3D count; - if (count < data_len) { - MPTCP_SKB_CB(skb)->offset +=3D count; - MPTCP_SKB_CB(skb)->map_seq +=3D count; + msk->copied_seq +=3D count; + if (count < data_len) break; - } =20 mptcp_eat_recv_skb(sk, skb); } diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 16a1f4531dad..68bedb60871f 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -129,7 +129,6 @@ struct mptcp_skb_cb { u64 map_seq; u64 end_seq; - u32 offset; u8 has_rxtstamp; }; =20 @@ -289,6 +288,7 @@ struct mptcp_sock { u64 bytes_sent; u64 snd_nxt; u64 bytes_received; + u64 copied_seq; u64 ack_seq; atomic64_t rcv_wnd_sent; u64 rcv_data_fin_seq; diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index b226c7cd1b79..6d3d0106749f 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -499,10 +499,12 @@ static void subflow_set_remote_key(struct mptcp_sock = *msk, WRITE_ONCE(msk->can_ack, true); atomic64_set(&msk->rcv_wnd_sent, subflow->iasn); =20 - if (!sock_owned_by_user(sk)) + if (!sock_owned_by_user(sk)) { + msk->copied_seq =3D subflow->iasn; __mptcp_sync_rcv_sequence(sk); - else + } else { __set_bit(MPTCP_SYNC_SEQ, &msk->cb_flags); + } } =20 static void mptcp_propagate_state(struct sock *sk, struct sock *ssk, --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ECC5E40149B for ; Mon, 27 Apr 2026 19:52:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319544; cv=none; b=s37ddLsQJjL7q8dlXD/NyJRWRNerVgjHd4DPDmQj6/DOYcOYtiJaCfBL/MHGPat2JrWcpXESS7NL62hwzH+XKsovdvyFeBcNB6hKuIZwPRCEmM6AWNGuPyiAF0rpozZmCyNwjVcR+osZWdd11CtK7njnjg9amgFKPX3qmsa95eM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319544; c=relaxed/simple; bh=EDio4JrWScmgYY3Nl+YWXK5xBU+S8Z/s8WDOqLo9TKE=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=N8eg8zBsXvaJ4RBaC8Zv65bWdJoGHyioXpwJrNobbgGo663QXgRP1gMp7Lt7spkg6wTZLufvhPqa48X6Zaf6soC8+r2hlTqfEpSej0htegqcCBomEIEwLngSw1xoSnFxXx97DgMt28uUj3/QkIhT/3NwG/tnHBd5x1xmY05qiLk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Few447CT; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Few447CT" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777319542; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cBJdekeHUt1pf66wTxpeWcmQWYKB1d0KVBC4Z7uuntg=; b=Few447CTCC/YFJGFHZtExFsJd+EjAH6G+xLu87ktBRgE0Z1TJ0x3o1D7TAukuY0PLVmCAO cevErk0Eg2L5paO1rQ7Wi+Q0xXPkAPwZ4S0RWlYHgxA0Xn0sEDXhTS5EW+n8cvnNBTOIsq TXURCUUgLIpiH4jCfXJfksZT9F/PJpM= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-518-lVDnO_i_OEaT8StOEOOfwg-1; Mon, 27 Apr 2026 15:52:21 -0400 X-MC-Unique: lVDnO_i_OEaT8StOEOOfwg-1 X-Mimecast-MFC-AGG-ID: lVDnO_i_OEaT8StOEOOfwg_1777319540 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 3781419560B0 for ; Mon, 27 Apr 2026 19:52:20 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.49.253]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 6EF2F300070A for ; Mon, 27 Apr 2026 19:52:19 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next v2 05/10] mptcp: sync mptcp skb cb layout with tcp one Date: Mon, 27 Apr 2026 21:52:03 +0200 Message-ID: <957dfd22d5d9b88c3273f81859f128fc10be6946.1777318959.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: lYJg7DDq1LArk1NLxQ4YcwXxi4KFmb23B5svZkkyI3E_1777319540 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" The MPTCP protocol uses a significantly different CB layout WRT TCP, as it includes different information and use 64 bits for the sequence numbers. As the msk-level rcvbuf buffer size is limited by the core socket code the INT_MAX; after validating the incoming skb vs the current receive window, we can safely use 32 bits for MPTCP-level sequence number. This allow updating the MPTCP CB layout so that fields with a corresponding TCP-level data use the same area inside the CB itself. Add build time check to ensure the latter invariant. Signed-off-by: Paolo Abeni --- v1 -> v2: - use u64 for admission checks rfc -> v1: - keep `ack_seq` up2date --- net/mptcp/fastopen.c | 2 ++ net/mptcp/protocol.c | 78 ++++++++++++++++++++++++++++---------------- net/mptcp/protocol.h | 7 ++-- 3 files changed, 56 insertions(+), 31 deletions(-) diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c index cbe2a6192002..f65312b41b95 100644 --- a/net/mptcp/fastopen.c +++ b/net/mptcp/fastopen.c @@ -42,7 +42,9 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptc= p_subflow_context *subf subflow->ssn_offset +=3D skb->len; =20 /* Only the sequence delta is relevant */ + MPTCP_SKB_CB(skb)->map_seq64 =3D 0; MPTCP_SKB_CB(skb)->map_seq =3D 0; + MPTCP_SKB_CB(skb)->flags =3D 0; MPTCP_SKB_CB(skb)->end_seq =3D skb->len; MPTCP_SKB_CB(skb)->has_rxtstamp =3D TCP_SKB_CB(skb)->has_rxtstamp; =20 diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index b24228f87216..683eaa11634a 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -164,7 +164,7 @@ static bool __mptcp_try_coalesce(struct sock *sk, struc= t sk_buff *to, !skb_try_coalesce(to, from, fragstolen, delta)) return false; =20 - pr_debug("colesced seq %llx into %llx new len %d new end seq %llx\n", + pr_debug("colesced seq %x into %x new len %d new end seq %x\n", MPTCP_SKB_CB(from)->map_seq, MPTCP_SKB_CB(to)->map_seq, to->len, MPTCP_SKB_CB(from)->end_seq); MPTCP_SKB_CB(to)->end_seq =3D MPTCP_SKB_CB(from)->end_seq; @@ -234,14 +234,18 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *m= sk, struct sk_buff *skb) { struct sock *sk =3D (struct sock *)msk; struct rb_node **p, *parent; - u64 seq, end_seq, max_seq; + u64 end_seq, max_seq; struct sk_buff *skb1; + u32 seq; =20 seq =3D MPTCP_SKB_CB(skb)->map_seq; - end_seq =3D MPTCP_SKB_CB(skb)->end_seq; + end_seq =3D MPTCP_SKB_CB(skb)->map_seq64 + skb->len; max_seq =3D atomic64_read(&msk->rcv_wnd_sent); =20 - pr_debug("msk=3D%p seq=3D%llx limit=3D%llx empty=3D%d\n", msk, seq, max_s= eq, + /* Use the full sequence space to perform the admission checks, to + * protect vs possible wrap-arounds. + */ + pr_debug("msk=3D%p seq=3D%x limit=3D%llx empty=3D%d\n", msk, seq, max_seq, RB_EMPTY_ROOT(&msk->out_of_order_queue)); if (after64(end_seq, max_seq)) { /* out of window */ @@ -272,7 +276,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk= , struct sk_buff *skb) } =20 /* Can avoid an rbtree lookup if we are adding skb after ooo_last_skb */ - if (!before64(seq, MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq)) { + if (!before(seq, MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq)) { MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOQUEUETAIL); parent =3D &msk->ooo_last_skb->rbnode; p =3D &parent->rb_right; @@ -284,18 +288,18 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *m= sk, struct sk_buff *skb) while (*p) { parent =3D *p; skb1 =3D rb_to_skb(parent); - if (before64(seq, MPTCP_SKB_CB(skb1)->map_seq)) { + if (before(seq, MPTCP_SKB_CB(skb1)->map_seq)) { p =3D &parent->rb_left; continue; } - if (before64(seq, MPTCP_SKB_CB(skb1)->end_seq)) { - if (!after64(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) { + if (before(seq, MPTCP_SKB_CB(skb1)->end_seq)) { + if (!after(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) { /* All the bits are present. Drop. */ mptcp_drop(sk, skb); MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA); return; } - if (after64(seq, MPTCP_SKB_CB(skb1)->map_seq)) { + if (after(seq, MPTCP_SKB_CB(skb1)->map_seq)) { /* partial overlap: * | skb | * | skb1 | @@ -326,7 +330,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk= , struct sk_buff *skb) merge_right: /* Remove other segments covered by skb. */ while ((skb1 =3D skb_rb_next(skb)) !=3D NULL) { - if (before64(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) + if (before((u32)end_seq, MPTCP_SKB_CB(skb1)->end_seq)) break; rb_erase(&skb1->rbnode, &msk->out_of_order_queue); mptcp_drop(sk, skb1); @@ -346,13 +350,15 @@ static void mptcp_init_skb(struct sock *ssk, struct s= k_buff *skb, int offset) struct mptcp_subflow_context *subflow =3D mptcp_subflow_ctx(ssk); bool has_rxtstamp =3D TCP_SKB_CB(skb)->has_rxtstamp; =20 - /* the skb map_seq accounts for the skb offset: + /* The skb map_seq accounts for the skb offset: * mptcp_subflow_get_mapped_dsn() is based on the current tp->copied_seq - * value + * value; note that end seq number is only available in 32bits format. */ - MPTCP_SKB_CB(skb)->map_seq =3D mptcp_subflow_get_mapped_dsn(subflow) - - offset; + MPTCP_SKB_CB(skb)->map_seq64 =3D mptcp_subflow_get_mapped_dsn(subflow) - + offset; + MPTCP_SKB_CB(skb)->map_seq =3D (u32)MPTCP_SKB_CB(skb)->map_seq64; MPTCP_SKB_CB(skb)->end_seq =3D MPTCP_SKB_CB(skb)->map_seq + skb->len; + MPTCP_SKB_CB(skb)->flags =3D 0; MPTCP_SKB_CB(skb)->has_rxtstamp =3D has_rxtstamp; =20 __skb_unlink(skb, &ssk->sk_receive_queue); @@ -377,13 +383,14 @@ void __mptcp_sync_rcv_sequence(struct sock *sk) =20 /* The TFO segment data sits before IDSN */ msk->copied_seq -=3D skb->len; - MPTCP_SKB_CB(skb)->map_seq =3D msk->ack_seq - skb->len; + MPTCP_SKB_CB(skb)->map_seq64 =3D msk->ack_seq - skb->len; + MPTCP_SKB_CB(skb)->map_seq =3D (u32)MPTCP_SKB_CB(skb)->map_seq64; MPTCP_SKB_CB(skb)->end_seq =3D msk->ack_seq; } =20 static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb) { - u64 copy_len =3D MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq; + u32 copy_len =3D MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq; struct mptcp_sock *msk =3D mptcp_sk(sk); struct sk_buff *tail; =20 @@ -395,7 +402,7 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk= _buff *skb) if (unlikely(msk->rcvd_dummy_seq)) __mptcp_sync_rcv_sequence(sk); =20 - if (MPTCP_SKB_CB(skb)->map_seq =3D=3D msk->ack_seq) { + if (MPTCP_SKB_CB(skb)->map_seq64 =3D=3D msk->ack_seq) { /* in sequence */ msk->bytes_received +=3D copy_len; WRITE_ONCE(msk->ack_seq, msk->ack_seq + copy_len); @@ -406,7 +413,8 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk= _buff *skb) skb_set_owner_r(skb, sk); __skb_queue_tail(&sk->sk_receive_queue, skb); return true; - } else if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq)) { + } else if (after64(MPTCP_SKB_CB(skb)->map_seq64 + skb->len, + msk->ack_seq)) { mptcp_data_queue_ofo(msk, skb); return false; } @@ -787,40 +795,40 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk) { struct sock *sk =3D (struct sock *)msk; struct sk_buff *skb, *tail; + u32 seq_delta, ack_seq; bool moved =3D false; struct rb_node *p; - u64 end_seq; =20 p =3D rb_first(&msk->out_of_order_queue); pr_debug("msk=3D%p empty=3D%d\n", msk, RB_EMPTY_ROOT(&msk->out_of_order_q= ueue)); while (p) { + ack_seq =3D msk->ack_seq; skb =3D rb_to_skb(p); - if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq)) + if (after(MPTCP_SKB_CB(skb)->map_seq, ack_seq)) break; =20 p =3D rb_next(p); rb_erase(&skb->rbnode, &msk->out_of_order_queue); =20 - if (unlikely(!after64(MPTCP_SKB_CB(skb)->end_seq, - msk->ack_seq))) { + if (unlikely(!after(MPTCP_SKB_CB(skb)->end_seq, ack_seq))) { mptcp_drop(sk, skb); MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA); continue; } =20 - end_seq =3D MPTCP_SKB_CB(skb)->end_seq; + seq_delta =3D MPTCP_SKB_CB(skb)->end_seq - ack_seq; tail =3D skb_peek_tail(&sk->sk_receive_queue); if (!tail || !mptcp_try_coalesce(sk, tail, skb)) { - int delta =3D msk->ack_seq - MPTCP_SKB_CB(skb)->map_seq; + int delta =3D ack_seq - MPTCP_SKB_CB(skb)->map_seq; =20 /* skip overlapping data, if any */ - pr_debug("uncoalesced seq=3D%llx ack seq=3D%llx delta=3D%d\n", - MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq, + pr_debug("uncoalesced seq=3D%x ack seq=3D%x delta=3D%d\n", + MPTCP_SKB_CB(skb)->map_seq, ack_seq, delta); __skb_queue_tail(&sk->sk_receive_queue, skb); } - msk->bytes_received +=3D end_seq - msk->ack_seq; - WRITE_ONCE(msk->ack_seq, end_seq); + msk->bytes_received +=3D seq_delta; + WRITE_ONCE(msk->ack_seq, msk->ack_seq + seq_delta); moved =3D true; } return moved; @@ -2069,7 +2077,7 @@ static int __mptcp_recvmsg_mskq(struct sock *sk, stru= ct msghdr *msg, int copied =3D 0; =20 skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) { - u64 offset =3D *seq - MPTCP_SKB_CB(skb)->map_seq; + u32 offset =3D (u32)(*seq) - MPTCP_SKB_CB(skb)->map_seq; u32 count, data_len =3D skb->len - offset; int err; =20 @@ -4604,11 +4612,23 @@ static int mptcp_napi_poll(struct napi_struct *napi= , int budget) return work_done; } =20 +#define CHK_CB_FIELD(mptcp_field, tcp_field) \ + ({ \ + BUILD_BUG_ON(offsetof(struct mptcp_skb_cb, mptcp_field) !=3D \ + offsetof(struct tcp_skb_cb, tcp_field)); \ + BUILD_BUG_ON(offsetofend(struct mptcp_skb_cb, mptcp_field) !=3D \ + offsetofend(struct tcp_skb_cb, tcp_field)); \ + }) + void __init mptcp_proto_init(void) { struct mptcp_delegated_action *delegated; int cpu; =20 + CHK_CB_FIELD(map_seq, seq); + CHK_CB_FIELD(end_seq, end_seq); + CHK_CB_FIELD(flags, tcp_flags); + mptcp_prot.h.hashinfo =3D tcp_prot.h.hashinfo; =20 if (percpu_counter_init(&mptcp_sockets_allocated, 0, GFP_KERNEL)) diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 68bedb60871f..e4569b3af744 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -127,9 +127,12 @@ #define MPTCP_SYNC_SEQ 8 =20 struct mptcp_skb_cb { - u64 map_seq; - u64 end_seq; + u32 map_seq; + u32 end_seq; + u32 unused; + u16 flags; u8 has_rxtstamp; + u64 map_seq64; }; =20 #define MPTCP_SKB_CB(__skb) ((struct mptcp_skb_cb *)&((__skb)->cb[0])) --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B7EA21D0DEE for ; Mon, 27 Apr 2026 19:52:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319547; cv=none; b=K+gQ1nr588Cl09HAtnYLALXi0Q4FPPtb+XIATcUzZUWPstHHmNJ8E1cClsoWnnm5RfV10tzVmXralCdK2m4zPsZ1+ZWENiASQVF+VYIR331chwaC2e2abN/4sH34HeM1eNlgHDSCwN+vk8XkMKiCZU87wQ9XOTt/zvgn02Qgb9I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319547; c=relaxed/simple; bh=QAgeFVxrmR5mrmnN3gbfnYhnzJXIlgmvLuGBxnRNgpA=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=Z22ZoeLXqWMFR4gknhUGf4v9HoOx/X1wBggrZ3JpKr38oksMnZrpJ2gtyeloyxFwk4quLaVvAlN39c8Kl3ZQ0OsxLNYJfUnP9N3eCKjC39ZutL/zZJCIT1SQwSPhdRRXCFYVwjYS/rR2alr/5YLXkLtxH5L9G9gmpfQIeDqKZKQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=C870/KaD; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="C870/KaD" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777319545; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FUS3mIYnu0NHSGln8VsDNep7QRgEPvYfq6q3pmfspJA=; b=C870/KaDK5Ru5gRhPgGM1Fqb+vvI81WZIU9/1wTSWCkAWEXeEUGuwuXlv9/UoNd7R7TWY3 pzNqSPre2eM+JeB3mbcya3tARMK/jwC9goo7cs+OavppWyukMMcQzsfn3+0Qd5+LOlBkzk gLSck064Fx3OlKTuT6c8HYOhyuJHGCY= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-524-ZTCH8r2eOV-6LC1EnZ8EBw-1; Mon, 27 Apr 2026 15:52:22 -0400 X-MC-Unique: ZTCH8r2eOV-6LC1EnZ8EBw-1 X-Mimecast-MFC-AGG-ID: ZTCH8r2eOV-6LC1EnZ8EBw_1777319541 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 6C06018002C0 for ; Mon, 27 Apr 2026 19:52:21 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.49.253]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id A7A80300070A for ; Mon, 27 Apr 2026 19:52:20 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next v2 06/10] tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too Date: Mon, 27 Apr 2026 21:52:04 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: nz4P3szx7ThcBl6fLQCGI-29zf8qmYF9IdHdBZJ8Fco_1777319541 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" The end goal is to avoid duplicating the quite untrivial strategy at MPTCP level. After the previous patch, the mentioned helpers could process skbs standing in MPTCP-level queues without any CB-related adaptation. The only additional adjustment needed is explicitly providing the OoO queue reference, to cope with different sk layout. Additionally rename the helper to clearly document its hybrid nature and let it return the number of collapsed skbs, to allow proper accounting from the future MPTCP caller. Signed-off-by: Paolo Abeni --- rfc -> v1: - fix arg typo Note: - this will need a significant amount of testing at the TCP level and explicit approval from Eric, which I can't guess if we can hope. --- include/net/tcp.h | 8 +++++++ net/ipv4/tcp_input.c | 55 ++++++++++++++++++++++++++++---------------- 2 files changed, 43 insertions(+), 20 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 6156d1d068e1..34a96f0bcf0a 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1828,6 +1828,14 @@ extern void tcp_openreq_init_rwin(struct request_soc= k *req, =20 void tcp_enter_memory_pressure(struct sock *sk); void tcp_leave_memory_pressure(struct sock *sk); +unsigned int xtcp_collapse(struct sock *sk, struct sk_buff_head *list, + struct rb_root *root, struct sk_buff *head, + struct sk_buff *tail, u32 start, u32 end, + u8 scaling_ratio); +unsigned int xtcp_collapse_ofo_queue(struct sock *sk, + struct rb_root *out_of_order_queue, + struct sk_buff **ooo_last_skb, + u8 scaling_ratio); =20 static inline int keepalive_intvl_when(const struct tcp_sock *tp) { diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 7171442c3ed7..8417785fa48f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5725,16 +5725,22 @@ static struct sk_buff *tcp_collapse_one(struct sock= *sk, struct sk_buff *skb, /* Collapse contiguous sequence of skbs head..tail with * sequence numbers start..end. * + * sk can be either a TCP or an MPTCP socket. + * * If tail is NULL, this means until the end of the queue. * * Segments with FIN/SYN are not collapsed (only because this * simplifies code) + * + * Returns the number of collapsed skbs. */ -static void -tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *r= oot, - struct sk_buff *head, struct sk_buff *tail, u32 start, u32 end) +unsigned int +xtcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *= root, + struct sk_buff *head, struct sk_buff *tail, u32 start, u32 end, + u8 scaling_ratio) { struct sk_buff *skb =3D head, *n; + unsigned int collapsed =3D 0; struct sk_buff_head tmp; bool end_of_skbs; =20 @@ -5750,6 +5756,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *li= st, struct rb_root *root, =20 /* No new bits? It is possible on ofo queue. */ if (!before(start, TCP_SKB_CB(skb)->end_seq)) { + collapsed++; skb =3D tcp_collapse_one(sk, skb, list, root); if (!skb) break; @@ -5762,7 +5769,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *li= st, struct rb_root *root, * overlaps to the next one and mptcp allow collapsing. */ if (!(TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) && - (tcp_win_from_space(sk, skb->truesize) > skb->len || + (__tcp_win_from_space(scaling_ratio, skb->truesize) > skb->len || before(TCP_SKB_CB(skb)->seq, start))) { end_of_skbs =3D false; break; @@ -5782,7 +5789,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *li= st, struct rb_root *root, if (end_of_skbs || (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || !skb_frags_readable(skb)) - return; + return collapsed; =20 __skb_queue_head_init(&tmp); =20 @@ -5819,6 +5826,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *li= st, struct rb_root *root, start +=3D size; } if (!before(start, TCP_SKB_CB(skb)->end_seq)) { + collapsed++; skb =3D tcp_collapse_one(sk, skb, list, root); if (!skb || skb =3D=3D tail || @@ -5832,23 +5840,26 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *= list, struct rb_root *root, end: skb_queue_walk_safe(&tmp, skb, n) tcp_rbtree_insert(root, skb); + return collapsed; } =20 /* Collapse ofo queue. Algorithm: select contiguous sequence of skbs - * and tcp_collapse() them until all the queue is collapsed. + * and xtcp_collapse() them until all the queue is collapsed. */ -static void tcp_collapse_ofo_queue(struct sock *sk) +unsigned int xtcp_collapse_ofo_queue(struct sock *sk, + struct rb_root *ooo_queue, + struct sk_buff **ooo_last_skb, + u8 scaling_ratio) { - struct tcp_sock *tp =3D tcp_sk(sk); - u32 range_truesize, sum_tiny =3D 0; + u32 range_truesize, sum_tiny =3D 0, collapsed =3D 0; struct sk_buff *skb, *head; u32 start, end; =20 - skb =3D skb_rb_first(&tp->out_of_order_queue); + skb =3D skb_rb_first(ooo_queue); new_range: if (!skb) { - tp->ooo_last_skb =3D skb_rb_last(&tp->out_of_order_queue); - return; + *ooo_last_skb =3D skb_rb_last(ooo_queue); + return collapsed; } start =3D TCP_SKB_CB(skb)->seq; end =3D TCP_SKB_CB(skb)->end_seq; @@ -5866,12 +5877,13 @@ static void tcp_collapse_ofo_queue(struct sock *sk) /* Do not attempt collapsing tiny skbs */ if (range_truesize !=3D head->truesize || end - start >=3D SKB_WITH_OVERHEAD(PAGE_SIZE)) { - tcp_collapse(sk, NULL, &tp->out_of_order_queue, - head, skb, start, end); + collapsed +=3D xtcp_collapse(sk, NULL, ooo_queue, + head, skb, start, end, + scaling_ratio); } else { sum_tiny +=3D range_truesize; if (sum_tiny > sk->sk_rcvbuf >> 3) - return; + return collapsed; } goto new_range; } @@ -5882,6 +5894,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk) if (after(TCP_SKB_CB(skb)->end_seq, end)) end =3D TCP_SKB_CB(skb)->end_seq; } + return collapsed; } =20 /* @@ -5969,12 +5982,14 @@ static int tcp_prune_queue(struct sock *sk, const s= truct sk_buff *in_skb) if (tcp_can_ingest(sk, in_skb)) return 0; =20 - tcp_collapse_ofo_queue(sk); + xtcp_collapse_ofo_queue(sk, &tp->out_of_order_queue, + &tp->ooo_last_skb, tp->scaling_ratio); if (!skb_queue_empty(&sk->sk_receive_queue)) - tcp_collapse(sk, &sk->sk_receive_queue, NULL, - skb_peek(&sk->sk_receive_queue), - NULL, - tp->copied_seq, tp->rcv_nxt); + xtcp_collapse(sk, &sk->sk_receive_queue, NULL, + skb_peek(&sk->sk_receive_queue), + NULL, + tp->copied_seq, tp->rcv_nxt, + tp->scaling_ratio); =20 if (tcp_can_ingest(sk, in_skb)) return 0; --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C841F3B47C5 for ; Mon, 27 Apr 2026 19:52:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319547; cv=none; b=Cl66LzlXPAPZmkml0N4zn6iSRTgnRuD3RL6obS1iN2gue5ZSZ6i/Jp+DUYk7NNSgFKl5PrQQEBh4QhaRqvuT6P1/W0oiGICp2KUNGyumDLdD+m7wG7x0oDYe7EEjW3cmdSUnY31FaN3Y5y5IpPFxeLfnfscsemgpubVcmGGN1fo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319547; c=relaxed/simple; bh=mJr+LyHiZ7wBfKmnFNwIUgj4HOElo3M42uVO+jHQbGM=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=P0lul4hDBwW1XS0vAcGlzGfMuOQvErsIZB01JFSbrn6Q91lUFc5xm0BdnW9AnhRshzFrLSnzX/DcoMd6mBwgfjrJeZIOBCDbFcXqpppR3TR4eF9DScOGEwHXbCH5RW+7ViOFN1z09YpB0/Q+m5vCivswLNkA56Mu1Hdij5IV/nw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=d0cYMH9Z; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="d0cYMH9Z" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777319544; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=f4EgBq//cIlXgrjSvSNQ/Nj391IUgZFt6FXVel9hSdE=; b=d0cYMH9Z7Z9TmCNVaGQ44AxEAXqiUzFK9FTIKcoMgO7tkAVfssTvwe1aplS306/Ghg9la3 7fxH9yIzf4OFsyYMnVippsSXuwroxCUclioXZ3iirPDjB2rrETnAIVEX7t1b+CnlM9Me8Y GX2HwlXYR+ax6Xk0uQG20g6/5d+6B7Q= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-331--Gj7a1n9NoOu9xaubWlrUg-1; Mon, 27 Apr 2026 15:52:23 -0400 X-MC-Unique: -Gj7a1n9NoOu9xaubWlrUg-1 X-Mimecast-MFC-AGG-ID: -Gj7a1n9NoOu9xaubWlrUg_1777319542 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id A352518004A9 for ; Mon, 27 Apr 2026 19:52:22 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.49.253]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id DC5B1300070A for ; Mon, 27 Apr 2026 19:52:21 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next v2 07/10] mptcp: implemented OoO queue pruning Date: Mon, 27 Apr 2026 21:52:05 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: mmjUvJDasE_RoDpOjGrgo9V8jX5cDhzVAIS_-R7lDjg_1777319542 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Leverage the hybrid helpers to implement the receive queue and OoO queue collapsing at ingress time when reaching memory bounds. If the msk is owned by the user-space at incoming skb time, perform the pruning in the release_cb. The prune check is additionally performed when the skb reaches the msk-level queues. Signed-off-by: Paolo Abeni --- v1 -> v2: - collapse rcv queue, too - deal with MPC map, too - drop left-over sentence in the commit message RFC -> v1: - use data_seq only when available - avoid ack_seq lockless access - drop limit on fallback - collapse rcvqueue, too - drop only when pruning is not possible and over rcvbuf * 2 Notes: - Similarly to path 'mptcp: move checks vs rcvbuf size earlier in the RX path', some cleanup/tuning in mptcp_over_limit() will be needed - Pruning in the release_cb() is likely not needed, should probably be removed (after more testing). --- net/mptcp/mib.c | 3 ++ net/mptcp/mib.h | 3 ++ net/mptcp/options.c | 36 ++++++++++++++++++++--- net/mptcp/protocol.c | 69 ++++++++++++++++++++++++++++++++++++++++++++ net/mptcp/protocol.h | 2 ++ 5 files changed, 109 insertions(+), 4 deletions(-) diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c index f23fda0c55a7..5128feec942c 100644 --- a/net/mptcp/mib.c +++ b/net/mptcp/mib.c @@ -85,6 +85,9 @@ static const struct snmp_mib mptcp_snmp_list[] =3D { SNMP_MIB_ITEM("SimultConnectFallback", MPTCP_MIB_SIMULTCONNFALLBACK), SNMP_MIB_ITEM("FallbackFailed", MPTCP_MIB_FALLBACKFAILED), SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE), + SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED), + SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED), + SNMP_MIB_ITEM("RcvCollapsed", MPTCP_MIB_RCVCOLLAPSED), }; =20 /* mptcp_mib_alloc - allocate percpu mib counters diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h index 812218b5ed2b..2f8f68e33ac5 100644 --- a/net/mptcp/mib.h +++ b/net/mptcp/mib.h @@ -88,6 +88,9 @@ enum linux_mptcp_mib_field { MPTCP_MIB_SIMULTCONNFALLBACK, /* Simultaneous connect */ MPTCP_MIB_FALLBACKFAILED, /* Can't fallback due to msk status */ MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */ + MPTCP_MIB_OFO_PRUNED, /* MPTCP-level OoO queue pruned */ + MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */ + MPTCP_MIB_RCVCOLLAPSED, /* Collapsed due to memory pressure */ __MPTCP_MIB_MAX }; =20 diff --git a/net/mptcp/options.c b/net/mptcp/options.c index ad4bb6fd86e1..0c1d1d4da88a 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -1159,8 +1159,12 @@ static bool add_addr_hmac_valid(struct mptcp_sock *m= sk, } =20 static bool mptcp_over_limit(struct sock *sk, const struct sock *ssk, - const struct sk_buff *skb) + const struct sk_buff *skb, + const struct mptcp_options_received *mp_opt) { + struct mptcp_sock *msk =3D mptcp_sk(sk); + bool ret; + if (likely(sk_rmem_alloc_get(sk) <=3D READ_ONCE(sk->sk_rcvbuf))) return false; =20 @@ -1170,7 +1174,27 @@ static bool mptcp_over_limit(struct sock *sk, const = struct sock *ssk, !after(TCP_SKB_CB(skb)->end_seq, tcp_sk(ssk)->rcv_nxt)) return false; =20 - return true; + mptcp_data_lock(sk); + if (!sock_owned_by_user(sk)) { + /* When the data sequence is not (yet) available for the + * incoming skb, allow pruning the whole OoO queue. + */ + u32 seq =3D !mp_opt->use_map || mp_opt->mpc_map ? msk->ack_seq : + mp_opt->data_seq; + + __mptcp_check_prune(sk, seq); + ret =3D sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf); + } else { + u64 limit =3D ((u64)READ_ONCE(sk->sk_rcvbuf)) << 1; + + /* Pruning will take place later in the RX path, allow + * some extra slack. + */ + ret =3D sk_rmem_alloc_get(sk) > limit; + __set_bit(MPTCP_PRUNE, &msk->cb_flags); + } + mptcp_data_unlock(sk); + return ret; } =20 /* Return false when the caller must drop the packet, i.e. in case of erro= r, @@ -1201,7 +1225,11 @@ bool mptcp_incoming_options(struct sock *sk, struct = sk_buff *skb) __mptcp_data_acked(subflow->conn); mptcp_data_unlock(subflow->conn); =20 - if (mptcp_over_limit(subflow->conn, sk, skb)) + /* Will use ack_seq as limit for OoO pruning; any value would do + * as OoO queue must be empty. + */ + mp_opt.use_map =3D 0; + if (mptcp_over_limit(subflow->conn, sk, skb, &mp_opt)) return false; return true; } @@ -1281,7 +1309,7 @@ bool mptcp_incoming_options(struct sock *sk, struct s= k_buff *skb) return true; } =20 - if (mptcp_over_limit(subflow->conn, sk, skb)) + if (mptcp_over_limit(subflow->conn, sk, skb, &mp_opt)) return false; =20 mpext =3D skb_ext_add(skb, SKB_EXT_MPTCP); diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 683eaa11634a..4137d587d3c5 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -388,6 +388,67 @@ void __mptcp_sync_rcv_sequence(struct sock *sk) MPTCP_SKB_CB(skb)->end_seq =3D msk->ack_seq; } =20 +/* "Inspired" from the TCP version */ +static void mptcp_prune_ofo_queue(struct sock *sk, u32 seq) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct rb_node *node, *prev; + bool pruned =3D false; + + if (RB_EMPTY_ROOT(&msk->out_of_order_queue)) + return; + + node =3D &msk->ooo_last_skb->rbnode; + + do { + struct sk_buff *skb =3D rb_to_skb(node); + + /* Stop pruning if the incoming skb would land in OoO tail. */ + if (after(seq, MPTCP_SKB_CB(skb)->map_seq)) + break; + + pruned =3D true; + prev =3D rb_prev(node); + rb_erase(node, &msk->out_of_order_queue); + mptcp_drop(sk, skb); + msk->ooo_last_skb =3D rb_to_skb(prev); + if (atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf) + break; + + node =3D prev; + } while (node); + + if (pruned) + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFO_PRUNED); +} + +bool __mptcp_check_prune(struct sock *sk, u32 seq) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + unsigned int dropped; + + if (likely(atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf)) + return false; + + dropped =3D xtcp_collapse_ofo_queue(sk, &msk->out_of_order_queue, + &msk->ooo_last_skb, + msk->scaling_ratio); + if (!skb_queue_empty(&sk->sk_receive_queue)) + dropped +=3D xtcp_collapse(sk, &sk->sk_receive_queue, NULL, + skb_peek(&sk->sk_receive_queue), + NULL, + msk->copied_seq, msk->ack_seq, + msk->scaling_ratio); + + if (dropped) + MPTCP_ADD_STATS(sock_net(sk), MPTCP_MIB_RCVCOLLAPSED, dropped); + if (likely(atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf)) + return false; + + mptcp_prune_ofo_queue(sk, seq); + return atomic_read(&sk->sk_rmem_alloc) >=3D sk->sk_rcvbuf; +} + static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb) { u32 copy_len =3D MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq; @@ -402,6 +463,12 @@ static bool __mptcp_move_skb(struct sock *sk, struct s= k_buff *skb) if (unlikely(msk->rcvd_dummy_seq)) __mptcp_sync_rcv_sequence(sk); =20 + if (__mptcp_check_prune(sk, MPTCP_SKB_CB(skb)->map_seq)) { + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED); + mptcp_drop(sk, skb); + return false; + } + if (MPTCP_SKB_CB(skb)->map_seq64 =3D=3D msk->ack_seq) { /* in sequence */ msk->bytes_received +=3D copy_len; @@ -3705,6 +3772,8 @@ static void mptcp_release_cb(struct sock *sk) msk->copied_seq =3D subflow->iasn; __mptcp_sync_rcv_sequence(sk); } + if (__test_and_clear_bit(MPTCP_PRUNE, &msk->cb_flags)) + __mptcp_check_prune(sk, msk->ack_seq - 1); } } =20 diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index e4569b3af744..1116a402771d 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -125,6 +125,7 @@ #define MPTCP_SYNC_STATE 6 #define MPTCP_SYNC_SNDBUF 7 #define MPTCP_SYNC_SEQ 8 +#define MPTCP_PRUNE 9 =20 struct mptcp_skb_cb { u32 map_seq; @@ -831,6 +832,7 @@ bool __mptcp_close(struct sock *sk, long timeout); void mptcp_cancel_work(struct sock *sk); void __mptcp_unaccepted_force_close(struct sock *sk); void mptcp_set_state(struct sock *sk, int state); +bool __mptcp_check_prune(struct sock *sk, u32 seq); =20 bool mptcp_addresses_equal(const struct mptcp_addr_info *a, const struct mptcp_addr_info *b, bool use_port); --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DC15140149B for ; Mon, 27 Apr 2026 19:52:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319548; cv=none; b=IJ0jcv4fQNhE2GImCS7blSTzQmk1iRRpbvSydDWFPK7Cz8CWwnZkwiC/aAzll18IoncT61TUAdGSzPxNxesDiqpa9bopwYhUQlCoJfc4Dn+UBJ4SDFIFgK7TBBC8V+yhhRH2zJiCEcMp5VtmsLxV5mh5h6Yrl3fDF/lMWOe64Uw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319548; c=relaxed/simple; bh=PzjbCP5LmS2p4tsX/Lb2omigYVVuGlhElnWXQyWd7Iw=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=bj65y0F7hFonNm3YbVDFq+VPdcr14WvvZWyNrd4fpFf8lFpkUnX5cwxaV8lYP2jdRUedECDELzghgRN6NC7A4tXms3FJ+CJoZTOICfQqCLGVTNaGBAe3cGp3Pbi1wY7qb8wqsDxHYokigHVo1YyXRZ4lRCDtxQLMGA4ZavG9urU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=b2U9l01m; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="b2U9l01m" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777319546; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=RxI8NWf0ra8f8CoiZ4DpiTnLLBZWv72m1iMoSFdW8og=; b=b2U9l01mA4DZ4uouBnB8+DL5BoZuyYL2PLMla/5lw88d07C8/HSqlhTJvY15wYEtww4WyN mDhKE77GPCUayXaK1JTf+Teee8TVQQjQnUUEDmhZMlj9NtrijQNf6e6RmZ68CtUbkuT5XO UQQBiLlEZUgAm0SyEtjTysGjCAW9rEc= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-651-wgPlT1I9OZ6zP82HP6hJJQ-1; Mon, 27 Apr 2026 15:52:24 -0400 X-MC-Unique: wgPlT1I9OZ6zP82HP6hJJQ-1 X-Mimecast-MFC-AGG-ID: wgPlT1I9OZ6zP82HP6hJJQ_1777319544 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id DF5F919560BE for ; Mon, 27 Apr 2026 19:52:23 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.49.253]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 22C27300070A for ; Mon, 27 Apr 2026 19:52:22 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next v2 08/10] mptcp: track prune recovery status Date: Mon, 27 Apr 2026 21:52:06 +0200 Message-ID: <4790eaa2f5ccbac325da354d49116171b15d8d69.1777318959.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: jeDm6frA4A-wzsct5AHtqSyPwGlHT_CmxFUIOKP1ydw_1777319544 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" After dropping any data already acked at the TCP level, the MPTCP must avoid inducing TCP-level retransmission until the pruned data has been successfully acked at MPTCP level. Otherwise the subflows could keep retransmitting skbs carring OoO MPTCP data, preventing reinjections and stalling completely the data transfer. Explicitly keep track of the highest pruned MPTCP-level seq number and stop dropping at TCP level until such sequence has been acked. Signed-off-by: Paolo Abeni --- net/mptcp/options.c | 7 ++++++- net/mptcp/protocol.c | 14 +++++++++++++- net/mptcp/protocol.h | 3 +++ net/mptcp/subflow.c | 1 + 4 files changed, 23 insertions(+), 2 deletions(-) diff --git a/net/mptcp/options.c b/net/mptcp/options.c index 0c1d1d4da88a..2d050acad63b 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -1194,7 +1194,12 @@ static bool mptcp_over_limit(struct sock *sk, const = struct sock *ssk, __set_bit(MPTCP_PRUNE, &msk->cb_flags); } mptcp_data_unlock(sk); - return ret; + + /* After pruning any packets ensure that MPTCP-driven drops do not + * cause TCP-level retransmission + */ + return ret && + !before(READ_ONCE(msk->ack_seq), READ_ONCE(msk->pruned_seq)); } =20 /* Return false when the caller must drop the packet, i.e. in case of erro= r, diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 4137d587d3c5..3a6b0506d3a7 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -394,12 +394,14 @@ static void mptcp_prune_ofo_queue(struct sock *sk, u3= 2 seq) struct mptcp_sock *msk =3D mptcp_sk(sk); struct rb_node *node, *prev; bool pruned =3D false; + u32 pruned_seq; =20 if (RB_EMPTY_ROOT(&msk->out_of_order_queue)) return; =20 node =3D &msk->ooo_last_skb->rbnode; =20 + pruned_seq =3D msk->pruned_seq; do { struct sk_buff *skb =3D rb_to_skb(node); =20 @@ -410,16 +412,21 @@ static void mptcp_prune_ofo_queue(struct sock *sk, u3= 2 seq) pruned =3D true; prev =3D rb_prev(node); rb_erase(node, &msk->out_of_order_queue); + if (after(MPTCP_SKB_CB(skb)->end_seq, pruned_seq)) + pruned_seq =3D MPTCP_SKB_CB(skb)->end_seq; mptcp_drop(sk, skb); msk->ooo_last_skb =3D rb_to_skb(prev); + if (atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf) break; =20 node =3D prev; } while (node); =20 - if (pruned) + if (pruned) { + WRITE_ONCE(msk->pruned_seq, pruned_seq); MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFO_PRUNED); + } } =20 bool __mptcp_check_prune(struct sock *sk, u32 seq) @@ -464,6 +471,8 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk= _buff *skb) __mptcp_sync_rcv_sequence(sk); =20 if (__mptcp_check_prune(sk, MPTCP_SKB_CB(skb)->map_seq)) { + if (after(MPTCP_SKB_CB(skb)->end_seq, msk->pruned_seq)) + WRITE_ONCE(msk->pruned_seq, MPTCP_SKB_CB(skb)->end_seq); MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED); mptcp_drop(sk, skb); return false; @@ -898,6 +907,8 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk) WRITE_ONCE(msk->ack_seq, msk->ack_seq + seq_delta); moved =3D true; } + if (after(msk->ack_seq, msk->pruned_seq)) + WRITE_ONCE(msk->pruned_seq, (u32)msk->ack_seq); return moved; } =20 @@ -3540,6 +3551,7 @@ static int mptcp_disconnect(struct sock *sk, int flag= s) /* for fallback's sake */ WRITE_ONCE(msk->ack_seq, 0); msk->copied_seq =3D 0; + WRITE_ONCE(msk->pruned_seq, 0); =20 WRITE_ONCE(sk->sk_shutdown, 0); sk_error_report(sk); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 1116a402771d..c369e0efe260 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -307,6 +307,9 @@ struct mptcp_sock { u64 bytes_acked; u64 snd_una; u64 wnd_end; + u32 pruned_seq; /* If above ack_seq, highest + * seq pruned. + */ u32 last_data_sent; u32 last_data_recv; u32 last_ack_recv; diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index 6d3d0106749f..2a719146ebce 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -496,6 +496,7 @@ static void subflow_set_remote_key(struct mptcp_sock *m= sk, =20 WRITE_ONCE(msk->remote_key, subflow->remote_key); WRITE_ONCE(msk->ack_seq, subflow->iasn); + WRITE_ONCE(msk->pruned_seq, subflow->iasn); WRITE_ONCE(msk->can_ack, true); atomic64_set(&msk->rcv_wnd_sent, subflow->iasn); =20 --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EC2E83ECBC3 for ; Mon, 27 Apr 2026 19:52:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319549; cv=none; b=D50WpR5Q43TLKRTq3WvYDi7c3z9+Uhb4epPvT9KmYQ4CNapGuvSw/pGISWgkitbTIohVnsP2yKlpT9b07wujkXpe4KUnWoBANVE2yDN4BzCc0B3MxUQVn8a/TdtLgj0dsyMmRvhU1B/h2kH7NIe+JutA0ne0u8Em9/GYePr6zLE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319549; c=relaxed/simple; bh=NTp0KZiH/Qc/UTE97ufXdN7na+g8yi7MaqtkSjnkjec=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=gwNloJMNt8D0irrF6rVe+5vxUlvlSC3U7O3fRe8UaycAi2guiFLJk399xOIhzJRzDKp3CMqSvLRbGVLUuaMLo/hVY6haGRH4lKXAGM9+AB5pAM7E91Qw7HqLl9e7GtnSx+vwyhzqQiYqtzPkTX8UECRZL6hgcykbW18np8lEC78= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=N6m+D+yh; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="N6m+D+yh" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777319547; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Qn4NWeizyeJ6Uw0ONC3/u/t4VyKnqY93xFhHjoOqzXc=; b=N6m+D+yhBGZN6HIIj2eJNbw5FZ/ZSX1n7QXYyybPe81JoXFfhzdhW8C6bXedHK62M7FxY6 iLwOy7Kc0yirgbmwBkzt/wPCbHPk2fxwZ2SjOrOdsXDYKK3O/4JcZWEPkq8c4LSWTWBI5Z NKDnhjZsGaPtbwDNdA7DE4rQVgsFhqM= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-638-0XHyYASeOIGNx1JhCyHpmA-1; Mon, 27 Apr 2026 15:52:25 -0400 X-MC-Unique: 0XHyYASeOIGNx1JhCyHpmA-1 X-Mimecast-MFC-AGG-ID: 0XHyYASeOIGNx1JhCyHpmA_1777319545 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 23F2A19560BB for ; Mon, 27 Apr 2026 19:52:25 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.49.253]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 5B4E5300757C for ; Mon, 27 Apr 2026 19:52:24 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next v2 09/10] mptcp: move the retrans loop to a separate helper Date: Mon, 27 Apr 2026 21:52:07 +0200 Message-ID: <912b04db09f9031b46c85cb83e2169d363aac8af.1777318959.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 6h9Fn3hBz3_FNnYHbDQp42SZMw0r2nsSm4ToITo0iF8_1777319545 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" This is a cleanup in order to make the next patch simpler. No functional change intended. Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 74 +++++++++++++++++++++++++------------------- 1 file changed, 43 insertions(+), 31 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 3a6b0506d3a7..64991a5ee206 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -2840,41 +2840,14 @@ static void mptcp_check_fastclose(struct mptcp_sock= *msk) sk_error_report(sk); } =20 -static void __mptcp_retrans(struct sock *sk) +/* Retransmit the specified data fragment on all the selected subflows. */ +static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *d= frag) { struct mptcp_sendmsg_info info =3D { .data_lock_held =3D true, }; struct mptcp_sock *msk =3D mptcp_sk(sk); struct mptcp_subflow_context *subflow; - struct mptcp_data_frag *dfrag; struct sock *ssk; - int ret, err; - u16 len =3D 0; - - mptcp_clean_una_wakeup(sk); - - /* first check ssk: need to kick "stale" logic */ - err =3D mptcp_sched_get_retrans(msk); - dfrag =3D mptcp_rtx_head(sk); - if (!dfrag) { - if (mptcp_data_fin_enabled(msk)) { - struct inet_connection_sock *icsk =3D inet_csk(sk); - - WRITE_ONCE(icsk->icsk_retransmits, - icsk->icsk_retransmits + 1); - mptcp_set_datafin_timeout(sk); - mptcp_send_ack(msk); - - goto reset_timer; - } - - if (!mptcp_send_head(sk)) - goto clear_scheduled; - - goto reset_timer; - } - - if (err) - goto reset_timer; + int ret, len =3D 0; =20 mptcp_for_each_subflow(msk, subflow) { if (READ_ONCE(subflow->scheduled)) { @@ -2902,7 +2875,7 @@ static void __mptcp_retrans(struct sock *sk) !msk->allow_subflows) { spin_unlock_bh(&msk->fallback_lock); release_sock(ssk); - goto clear_scheduled; + return -1; } =20 while (info.sent < info.limit) { @@ -2925,6 +2898,45 @@ static void __mptcp_retrans(struct sock *sk) release_sock(ssk); } } + return len; +} + +static void __mptcp_retrans(struct sock *sk) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct mptcp_subflow_context *subflow; + struct mptcp_data_frag *dfrag; + int err, len; + + mptcp_clean_una_wakeup(sk); + + /* first check ssk: need to kick "stale" logic */ + err =3D mptcp_sched_get_retrans(msk); + dfrag =3D mptcp_rtx_head(sk); + if (!dfrag) { + if (mptcp_data_fin_enabled(msk)) { + struct inet_connection_sock *icsk =3D inet_csk(sk); + + WRITE_ONCE(icsk->icsk_retransmits, + icsk->icsk_retransmits + 1); + mptcp_set_datafin_timeout(sk); + mptcp_send_ack(msk); + + goto reset_timer; + } + + if (!mptcp_send_head(sk)) + goto clear_scheduled; + + goto reset_timer; + } + + if (err) + goto reset_timer; + + len =3D __mptcp_push_retrans(sk, dfrag); + if (len < 0) + goto clear_scheduled; =20 msk->bytes_retrans +=3D len; dfrag->already_sent =3D max(dfrag->already_sent, len); --=20 2.53.0 From nobody Tue May 5 12:24:00 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 83B313ECBC3 for ; Mon, 27 Apr 2026 19:52:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319550; cv=none; b=VkIXe47c90DAptKokZ/u+/oJEpBu5MBqoAijFQxC8hTf7uNRFdoSSoEZ8vvCaK2GZRNR6GfXMJDaH6zJhrjr2TQKuGjB0fwgiWYMaox5j7CD2K8P5XWKsATRR8jKJbNIAOtwtuZk6dg+2D/ZUfvjjRZNdVfkJPcyHW9xEgy54es= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777319550; c=relaxed/simple; bh=Vd9qBXPRaiKNerJzIHd6b7mFYw5CP7GqmnqKFmhFcmY=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=aq1i5s545QSO8gVZnqrIcfT8529vT83Fcmwmkm+/P4mXQCRGquCxReQdjfUqnYCKo0UDQZrXrkPGbwkQ2JJTHg6ICGJwYx7QiVTyNzAJnaK2UVo066jW+rVCb8K2Sc3n2lFob+GoOrBR0tAf02zFPkkwK0ymyv5/5tOnyfLMn/w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=gRrEmq8R; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="gRrEmq8R" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777319548; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Xl4II/IQ4CCFA8AYAQmdS5PV4JEqtOSTwLOeEwEcidY=; b=gRrEmq8RZ9xPaR0Y79/vD5WqPIWxNLquss3Vp5QnXnVHTDh35L0H9RBRiX0R+bMwz/UGTs qA8RKt/gJIHocg+Wtzb4XBhhhKHrlj+DRaAKYhUbAK8P/fNesZ8x7XrPxKUV1hoCOe+vHx 54IRi1Q5OkSwIZ5LeyMwZ/9roKWyafU= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-379-0TmzMMBdNUKHQWBUIExzlQ-1; Mon, 27 Apr 2026 15:52:27 -0400 X-MC-Unique: 0TmzMMBdNUKHQWBUIExzlQ-1 X-Mimecast-MFC-AGG-ID: 0TmzMMBdNUKHQWBUIExzlQ_1777319546 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 5A4C6180049F for ; Mon, 27 Apr 2026 19:52:26 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.49.253]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 94FDC300070A for ; Mon, 27 Apr 2026 19:52:25 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next v2 10/10] mptcp: let the retrans scheduler do its job. Date: Mon, 27 Apr 2026 21:52:08 +0200 Message-ID: <988764c7973b89d5105876aebb04098f855b7535.1777318959.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: RSiy5y_9f9FxOSH3jd4LZCvCHNX9fZOXkfkZJ_UZ_FY_1777319546 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Currently the MPTCP core enforces that when MPTCP-level retrans timer fires, at most a single dfrag is retransmitted. If some corner-cases it may be necessary retransmit multiple dfrags, and the MPTCP socket will need to wait multiple retrans timeout to accomplish that. Remove the mentioned constraint, allowing to transmit multiple dfrags per retrans period, as long as the scheduler keeps selecting subflows for retransmissions and pending data is available in the rtx queue. The default scheduler will transmit a dfrag per available subflow. Signed-off-by: Paolo Abeni --- v1 -> v2: - fix retrans sequence update (sashiko) --- net/mptcp/protocol.c | 82 ++++++++++++++++++++++++++------------------ 1 file changed, 48 insertions(+), 34 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 64991a5ee206..a4b2d664af80 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -1219,13 +1219,6 @@ static void __mptcp_clean_una_wakeup(struct sock *sk) mptcp_write_space(sk); } =20 -static void mptcp_clean_una_wakeup(struct sock *sk) -{ - mptcp_data_lock(sk); - __mptcp_clean_una_wakeup(sk); - mptcp_data_unlock(sk); -} - static void mptcp_enter_memory_pressure(struct sock *sk) { struct mptcp_subflow_context *subflow; @@ -2840,7 +2833,10 @@ static void mptcp_check_fastclose(struct mptcp_sock = *msk) sk_error_report(sk); } =20 -/* Retransmit the specified data fragment on all the selected subflows. */ +/* + * Retransmit the specified data fragment on all the selected subflows, + * starting from the specified sequence + */ static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *d= frag) { struct mptcp_sendmsg_info info =3D { .data_lock_held =3D true, }; @@ -2906,42 +2902,60 @@ static void __mptcp_retrans(struct sock *sk) struct mptcp_sock *msk =3D mptcp_sk(sk); struct mptcp_subflow_context *subflow; struct mptcp_data_frag *dfrag; + u64 retrans_seq; int err, len; =20 - mptcp_clean_una_wakeup(sk); - - /* first check ssk: need to kick "stale" logic */ - err =3D mptcp_sched_get_retrans(msk); - dfrag =3D mptcp_rtx_head(sk); - if (!dfrag) { - if (mptcp_data_fin_enabled(msk)) { - struct inet_connection_sock *icsk =3D inet_csk(sk); + mptcp_data_lock(sk); + __mptcp_clean_una_wakeup(sk); + retrans_seq =3D msk->snd_una; + mptcp_data_unlock(sk); =20 - WRITE_ONCE(icsk->icsk_retransmits, - icsk->icsk_retransmits + 1); - mptcp_set_datafin_timeout(sk); - mptcp_send_ack(msk); + for (;;) { + /* first check ssk: need to kick "stale" logic */ + err =3D mptcp_sched_get_retrans(msk); + dfrag =3D mptcp_rtx_head(sk); + if (!dfrag) { + if (mptcp_data_fin_enabled(msk)) { + struct inet_connection_sock *icsk; + + icsk =3D inet_csk(sk); + WRITE_ONCE(icsk->icsk_retransmits, + icsk->icsk_retransmits + 1); + mptcp_set_datafin_timeout(sk); + mptcp_send_ack(msk); + break; + } =20 - goto reset_timer; + if (!mptcp_send_head(sk)) + goto clear_scheduled; + break; } =20 - if (!mptcp_send_head(sk)) - goto clear_scheduled; - - goto reset_timer; - } + if (err) + break; =20 - if (err) - goto reset_timer; + /* Skip the data already retransmitted in this run */ + while (dfrag && !before64(retrans_seq, dfrag->data_seq + + dfrag->data_len)) + dfrag =3D list_is_last(&dfrag->list, &msk->rtx_queue) ? + NULL : list_next_entry(dfrag, list); + if (!dfrag || !dfrag->already_sent) + break; =20 - len =3D __mptcp_push_retrans(sk, dfrag); - if (len < 0) - goto clear_scheduled; + len =3D __mptcp_push_retrans(sk, dfrag); + if (len < 0) + goto clear_scheduled; =20 - msk->bytes_retrans +=3D len; - dfrag->already_sent =3D max(dfrag->already_sent, len); + retrans_seq +=3D len; + msk->bytes_retrans +=3D len; + dfrag->already_sent =3D max(dfrag->already_sent, len); =20 -reset_timer: + /* Attempt the next fragment only if the current one is + * completely retransmitted + */ + if (dfrag->already_sent < dfrag->data_len) + break; + } mptcp_check_and_set_pending(sk); =20 if (!mptcp_rtx_timer_pending(sk)) --=20 2.53.0