From nobody Mon May 25 18:04:43 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F67042883B for ; Fri, 15 May 2026 09:07:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836068; cv=none; b=LSldvSIxEMO3SKtaxFI3gZsdTEXlmDmHNubYadQ3lXJQvB2IhletJoXtBdXVk8C03b7yhX87ebFj8Jm7u1mCF4YAagZH9ah6HcrjN2ApWGe4TdFcD8D0BKSbfLPy7mdNJotMt9heK9dJD5GHTs5vAnvkLhA8NRQ/9CsdG77xrEw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836068; c=relaxed/simple; bh=UAjIVcjQcmmVfgXiCQeb37TkKmLwuggZwYX2PwkycJE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=MxsmS5kot+gJYdCol/tcwAtWipertV8Uo2qdkPnOXA2YuhUSwIZhNuKLS33vJ3Dan1QT/UwI7ohn5G+rm+7VJZDFu0gZoZAcoecSKj+h0ysMP/C/aWe11lJdDZGz6Plg0KW6lJ7vnn4i9xEaU1hY2B3dm+pvY/S/zsy2HCk9ciQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=SvZ4m0YB; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="SvZ4m0YB" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778836066; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=q9WHwxYbSBH3YYWlDfF5BQY1R6Rmm2J7nIqcTY6yTgM=; b=SvZ4m0YBx3yGtgyFQbizgSXaYckROSkh6mMdr9IOlD8q1RlEwTRjjHAOAT52nDP656Xhow hLwNBWU8j8ybYojmfAS4SqFzNDIiPBnqPwdoCMbOJ18bCMegpv29SGy+XPKyH6uqHdBe9X frhE+OoctNELf466UfUmF7740z2wlmo= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-576-cFHm2Qq6PL6EuahIzISF-g-1; Fri, 15 May 2026 05:07:42 -0400 X-MC-Unique: cFHm2Qq6PL6EuahIzISF-g-1 X-Mimecast-MFC-AGG-ID: cFHm2Qq6PL6EuahIzISF-g_1778836061 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 816031956053; Fri, 15 May 2026 09:07:41 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.32.244]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 3A7161800347; Fri, 15 May 2026 09:07:39 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: Geliang Tang , gang.yan@linux.dev Subject: [PATCH v6 mptcp-next 1/7] mptcp: fix missing wakeups in edge scenarios Date: Fri, 15 May 2026 11:07:22 +0200 Message-ID: <98f41ca828a7938005b796eccbff417b8bc32f2d.1778835009.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: HItXR_w6Z9n6adNxKwBIxd6mcLbGY7NejRBfckBz4DU_1778836061 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" The mptcp_recvmsg() can fill MPTCP socket receive queue via mptcp_move_skbs(), but currently does not try to wakeup any listener, because the same process is going to check the receive queue soon. When multiple threads are reading from the same fd, the above can cause stall. Add the missing wakeup. Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue") Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index ce8372fb3c6a..b3ac1cb370f5 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -2276,6 +2276,8 @@ static bool mptcp_move_skbs(struct sock *sk) mptcp_backlog_spooled(sk, moved, &skbs); } mptcp_data_unlock(sk); + if (enqueued) + sk->sk_data_ready(sk); return enqueued; } =20 --=20 2.54.0 From nobody Mon May 25 18:04:43 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E86642883C for ; Fri, 15 May 2026 09:07:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836068; cv=none; b=uFK6dlnlpVoSgV0wcaVuJS0/UPh248l1mX5f7qqf6G5O+Kru3u4Iedfn+1ipwjlZgOUmt9kZ1gboCZxR2WaHM0WMdvad7zPzNyYiC8sa1IpGNY2ZcgqxlMIMoG0cTCbjtgApqAHf5W0FD+SobDv6I+ammLOP+p5kF2yyr5AeFFY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836068; c=relaxed/simple; bh=i4fQvldCpsecdhzviG6pzEL68CXDgs2YGmnOTq1MAVs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=T4nfrH9inb9NMJ1kD9e0XuovUuh1tCoKRERD/w43pZgIdeAoGH+jlIjlNJ3WPagCvhpWc0buckLTC9pgp5XHsRcyyPEByXsmCyAoGg2hS/QjBsugAxWfq9zp3P9MUbthgWYq6DApmZKG4UGp86o2gQmJDWzBeP5db/V0jOchSZ8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Ur+Vsm2T; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Ur+Vsm2T" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778836066; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hggXIRLlT8vxVkUt2X1ih39LMb/FtD31U9qr4w40pGQ=; b=Ur+Vsm2T3t8m4eTdaTMkMgd1f+8R1i5rjwpAvrufLSqhKgIRhZZH3uPuY/BVroj4NRt2df pKDXBg+j6q0lBFBbbkRl5dJT2s5yAfgsh3W5PrkII04afqfbIdYbBhyAV4WheBTZd6UyeN HwRBVa0eyDOu3TeAcDog5tjH7fbnV78= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-573-L5uehSB9M42QN8urSieE2g-1; Fri, 15 May 2026 05:07:44 -0400 X-MC-Unique: L5uehSB9M42QN8urSieE2g-1 X-Mimecast-MFC-AGG-ID: L5uehSB9M42QN8urSieE2g_1778836063 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 72B4D180034E; Fri, 15 May 2026 09:07:43 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.32.244]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 16BC71800347; Fri, 15 May 2026 09:07:41 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: Geliang Tang , gang.yan@linux.dev Subject: [PATCH v6 mptcp-next 2/7] mptcp: explicitly drop over memory limits Date: Fri, 15 May 2026 11:07:23 +0200 Message-ID: <690600bcb4797076f531a262f352e9bd3e2cc85b.1778835009.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: Dz7CJ-LLdmKlQN8gwP40Na0m_GfLjQSz3JBxvBbVhJg_1778836063 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Currently the enforcement of the rcvbuf constraint is implemented when moving the skbs into the msk receive or OoO queue, keeping the incoming skbs in the subflow queue when over limit. Under significant memory pressure the above can cause permanent data transfer stalls. Hard enforce the memory limits as early as possible, before landing even in the subflow queues, and refine the check when owning the msk socket lock. Note that fallback socket must not drop on the later checks, as the incoming skb is already acked, and such drop would break the stream. Signed-off-by: Paolo Abeni --- v4 -> v5: - fix possible u32 overflow in mptcp_over_limit v3 -> v4: - schedule TCP ack on drop - enforce limits in __mptcp_move_skb() and __mptcp_add_backlog(), too but only if not fallback. v1 -> v2: - deal correctly with tcp fin and zero win probe RFC -> v1: - limit vs actual buffer size - use CB info instead of skb->len Note that: - this needs the follow-up patches to really fix the stall - sashiko can assume ZWP carries unacked data and may be silently dropped. AFAIK that is false. - sashiko apparently can't graps mptcp subflow never hit the tcp rx fastpath, and the mptcp_incoming_options in tcp_rcv_state_process is hit, the peer can't transmit any more data. - the memory comparison is intentionally very rough, as the msk socket lock is not currently held where the condition is now enforced. This should require some refinement, shared as-is to avoid more latency on my side --- net/mptcp/options.c | 32 ++++++++++++++++++++++++++++++-- net/mptcp/protocol.c | 29 +++++++++++++++++++++-------- 2 files changed, 51 insertions(+), 10 deletions(-) diff --git a/net/mptcp/options.c b/net/mptcp/options.c index 4cc583fdc7a9..36f12e5dfa92 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -1158,8 +1158,30 @@ static bool add_addr_hmac_valid(struct mptcp_sock *m= sk, return hmac =3D=3D mp_opt->ahmac; } =20 -/* Return false in case of error (or subflow has been reset), - * else return true. +static bool mptcp_over_limit(struct sock *sk, struct sock *ssk, + const struct sk_buff *skb) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + u64 mem =3D sk_rmem_alloc_get(sk); + + mem +=3D READ_ONCE(msk->backlog_len); + if (likely(mem <=3D READ_ONCE(sk->sk_rcvbuf))) + return false; + + /* Avoid silently dropping pure acks, fin or zero win probes. */ + if (TCP_SKB_CB(skb)->seq =3D=3D TCP_SKB_CB(skb)->end_seq || + TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN || + !after(TCP_SKB_CB(skb)->end_seq, tcp_sk(ssk)->rcv_nxt)) + return false; + + /* Dropped due to memory constraints, schedule an ack. */ + inet_csk(ssk)->icsk_ack.pending |=3D ICSK_ACK_NOMEM | ICSK_ACK_NOW; + inet_csk_schedule_ack(ssk); + return true; +} + +/* Return false when the caller must drop the packet, i.e. in case of erro= r, + * subflow has been reset, or over memory limits. */ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb) { @@ -1185,6 +1207,9 @@ bool mptcp_incoming_options(struct sock *sk, struct s= k_buff *skb) =20 __mptcp_data_acked(subflow->conn); mptcp_data_unlock(subflow->conn); + + if (mptcp_over_limit(subflow->conn, sk, skb)) + return false; return true; } =20 @@ -1263,6 +1288,9 @@ bool mptcp_incoming_options(struct sock *sk, struct s= k_buff *skb) return true; } =20 + if (mptcp_over_limit(subflow->conn, sk, skb)) + return false; + mpext =3D skb_ext_add(skb, SKB_EXT_MPTCP); if (!mpext) return false; diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index b3ac1cb370f5..9d2ed9503d08 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -381,6 +381,15 @@ static bool __mptcp_move_skb(struct sock *sk, struct s= k_buff *skb) =20 mptcp_borrow_fwdmem(sk, skb); =20 + /* Can't drop packets for fallback socket this late, or the stream + * will break. + */ + if (unlikely(sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) && + !__mptcp_check_fallback(msk)) { + mptcp_drop(sk, skb); + return false; + } + if (MPTCP_SKB_CB(skb)->map_seq =3D=3D msk->ack_seq) { /* in sequence */ msk->bytes_received +=3D copy_len; @@ -675,6 +684,7 @@ static void __mptcp_add_backlog(struct sock *sk, struct sk_buff *tail =3D NULL; struct sock *ssk =3D skb->sk; bool fragstolen; + u64 limit; int delta; =20 if (unlikely(sk->sk_state =3D=3D TCP_CLOSE)) { @@ -682,6 +692,15 @@ static void __mptcp_add_backlog(struct sock *sk, return; } =20 + /* Similar additional allowance as plain TCP. */ + limit =3D READ_ONCE(sk->sk_rcvbuf); + limit +=3D (limit >> 1) + 64 * 1024; + limit =3D min_t(u64, limit, UINT_MAX); + if (msk->backlog_len > limit && !__mptcp_check_fallback(msk)) { + kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_RCVBUFF); + return; + } + /* Try to coalesce with the last skb in our backlog */ if (!list_empty(&msk->backlog_list)) tail =3D list_last_entry(&msk->backlog_list, struct sk_buff, list); @@ -753,7 +772,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp= _sock *msk, =20 mptcp_init_skb(ssk, skb, offset, len); =20 - if (own_msk && sk_rmem_alloc_get(sk) < sk->sk_rcvbuf) { + if (own_msk) { mptcp_subflow_lend_fwdmem(subflow, skb); ret |=3D __mptcp_move_skb(sk, skb); } else { @@ -2211,10 +2230,6 @@ static bool __mptcp_move_skbs(struct sock *sk, struc= t list_head *skbs, u32 *delt =20 *delta =3D 0; while (1) { - /* If the msk recvbuf is full stop, don't drop */ - if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf) - break; - prefetch(skb->next); list_del(&skb->list); *delta +=3D skb->truesize; @@ -2242,9 +2257,7 @@ static bool mptcp_can_spool_backlog(struct sock *sk, = struct list_head *skbs) DEBUG_NET_WARN_ON_ONCE(msk->backlog_unaccounted && sk->sk_socket && mem_cgroup_from_sk(sk)); =20 - /* Don't spool the backlog if the rcvbuf is full. */ - if (list_empty(&msk->backlog_list) || - sk_rmem_alloc_get(sk) > sk->sk_rcvbuf) + if (list_empty(&msk->backlog_list)) return false; =20 INIT_LIST_HEAD(skbs); --=20 2.54.0 From nobody Mon May 25 18:04:43 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2A2EC428494 for ; Fri, 15 May 2026 09:07:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836070; cv=none; b=tQ0Dn45QRlUe+SNciS0iOLkIzPSSZ4DbUaeoELCQCx1N3oOHBh6rweKnja81uG8E8TT79Cthgzm2vsYSUXf2akPZnKu5Se6FXgxjUblyo9+MuKGvR6oKz8nigafrAGhz71Ymy2L8HR9XaXhkWDr0sZ4JAQ4orKKpWi+n4uyTYdg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836070; c=relaxed/simple; bh=bR1NyBHP4lfp8Fn7BRWj56Z3wLWxEL3B6TNgGtCSfAc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=Fv4U7zzqxGuFrw55NlT00lU7QsHpVJrkItOyI44OOIt8PrSOfd2RFUJJMq0BO4wi6zikRWdVrAIpxirrO0DLwA++4y1EHFSHDossHWj7wO594JEafvOZhCMENLUlV3rJApUFZxmeizDjKC0t3bERsLMl1jCpiP/oukHlC5s7/IY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Drwm9SP1; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Drwm9SP1" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778836068; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JnXocTbdh+DSJJCK0nDDz+oQoi+isF9a358SoRpFxFQ=; b=Drwm9SP1zKhYxJA/oDjKUmCTB/SCTWAS4wkIHSUIZ2gFgsGBx6tSkiuMzUEjG0fTKynKpo BifqXp+JbKY0FTq3B7E87SG6jF9VIDSYF5pYnFi+JGm6j4s+TClxx6e6CZv2D48UysE7Xv plXaMEWRj84BzalIcmFRu4r7sJkCDfk= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-655-KROyvPoFOsSeLhBucIdKsg-1; Fri, 15 May 2026 05:07:46 -0400 X-MC-Unique: KROyvPoFOsSeLhBucIdKsg-1 X-Mimecast-MFC-AGG-ID: KROyvPoFOsSeLhBucIdKsg_1778836065 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 6905E19560B4; Fri, 15 May 2026 09:07:45 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.32.244]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 248DF1800347; Fri, 15 May 2026 09:07:43 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: Geliang Tang , gang.yan@linux.dev Subject: [PATCH v6 mptcp-next 3/7] mptcp: enforce hard limit on backlog flushing Date: Fri, 15 May 2026 11:07:24 +0200 Message-ID: <48f51185f95a05491c3dc0d2f73177d82e010bb3.1778835009.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: WZYy4kcZLWW2iALGKxzLWQH1r_UeooKGSYznH-emnzY_1778836065 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Currently a wild producer could keep the backlog flushing operation spinning for an unbound time. Since the previous patch the amount of data present in the backlog is hard-limited. Move the backlog len update at the end of the flush loop to prevent it spinning forever. Also, no need to splice back the remaining skbs list into the backlog, as such list is always empty after each backlog processing loop. Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 21 ++++++--------------- 1 file changed, 6 insertions(+), 15 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 9d2ed9503d08..78b8bcac7d91 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -2228,7 +2228,6 @@ static bool __mptcp_move_skbs(struct sock *sk, struct= list_head *skbs, u32 *delt struct mptcp_sock *msk =3D mptcp_sk(sk); bool moved =3D false; =20 - *delta =3D 0; while (1) { prefetch(skb->next); list_del(&skb->list); @@ -2265,20 +2264,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk= , struct list_head *skbs) return true; } =20 -static void mptcp_backlog_spooled(struct sock *sk, u32 moved, - struct list_head *skbs) -{ - struct mptcp_sock *msk =3D mptcp_sk(sk); - - WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved); - list_splice(skbs, &msk->backlog_list); -} - static bool mptcp_move_skbs(struct sock *sk) { + struct mptcp_sock *msk =3D mptcp_sk(sk); struct list_head skbs; bool enqueued =3D false; - u32 moved; + u32 moved =3D 0; =20 mptcp_data_lock(sk); while (mptcp_can_spool_backlog(sk, &skbs)) { @@ -2286,8 +2277,8 @@ static bool mptcp_move_skbs(struct sock *sk) enqueued |=3D __mptcp_move_skbs(sk, &skbs, &moved); =20 mptcp_data_lock(sk); - mptcp_backlog_spooled(sk, moved, &skbs); } + WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved); mptcp_data_unlock(sk); if (enqueued) sk->sk_data_ready(sk); @@ -3672,12 +3663,12 @@ static void mptcp_release_cb(struct sock *sk) __must_hold(&sk->sk_lock.slock) { struct mptcp_sock *msk =3D mptcp_sk(sk); + u32 moved =3D 0; =20 for (;;) { unsigned long flags =3D (msk->cb_flags & MPTCP_FLAGS_PROCESS_CTX_NEED); struct list_head join_list, skbs; bool spool_bl; - u32 moved; =20 spool_bl =3D mptcp_can_spool_backlog(sk, &skbs); if (!flags && !spool_bl) @@ -3710,9 +3701,9 @@ static void mptcp_release_cb(struct sock *sk) =20 cond_resched(); spin_lock_bh(&sk->sk_lock.slock); - if (spool_bl) - mptcp_backlog_spooled(sk, moved, &skbs); } + if (moved) + WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved); =20 if (__test_and_clear_bit(MPTCP_CLEAN_UNA, &msk->cb_flags)) __mptcp_clean_una_wakeup(sk); --=20 2.54.0 From nobody Mon May 25 18:04:43 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C171C428494 for ; Fri, 15 May 2026 09:07:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836072; cv=none; b=u3TS8cLPjx6KwOuAlzOGnR5iAuNzOA/u3kzYt8fQU4qewta0MMYowKV+Bvgf0kr3XuWUh0hMSmkulucMcXCGrRdfqquMrWfG+u5hYfr5KFaLReiITXmS4NgDBVlSoCFTfsw4GyaSw5H4QPajX1op+tvOOHJfxU43IzwwjO5crqE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836072; c=relaxed/simple; bh=TF8oCtvLfG4NNTUv5N1rpZKtVxwsm05TnexBWy6Rz/o=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=tAA8lMX1TjfWmROpuTzOcRNLqO1v4hCPjiAeL5MCuZja+E7WkzX+HCrQ6Z+FGuPb8sC+QnSuiUWaPAZzddyfHe+A6R31jOOM4nwc0QX/p08VYAqZbzCQ95T5Q4600iBpF0wt930RAOwEbMxgTKYAA026gKite2yOvw42jXcbxNo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Oqgto5LX; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Oqgto5LX" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778836069; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LzEOe6S2q3TV+VVnQV5YJrigM80s2NzuN9n98r4GebY=; b=Oqgto5LXWOZxnyfKjbE17SawzhhRN3iY5ct21CmHdpf1jOVLyGOjjVWibjG9zDBg0vH8sD trc/ZO9uDroDeOTjvEjmcGBS3MK3jsbCHTUGWlUK57CWk8Qhjfs0bb3vhxRS0IoJk9hmSF otqCErEbThNLpKhazcDk9RxHrB0m9QM= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-530-7MzPqh02OWGvbfDh6BmmnA-1; Fri, 15 May 2026 05:07:48 -0400 X-MC-Unique: 7MzPqh02OWGvbfDh6BmmnA-1 X-Mimecast-MFC-AGG-ID: 7MzPqh02OWGvbfDh6BmmnA_1778836067 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 22A341956080; Fri, 15 May 2026 09:07:47 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.32.244]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id DAA7F1800347; Fri, 15 May 2026 09:07:45 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: Geliang Tang , gang.yan@linux.dev Subject: [PATCH v6 mptcp-next 4/7] mptcp: implemented OoO queue pruning Date: Fri, 15 May 2026 11:07:25 +0200 Message-ID: <5b21b1148406d07e32e89a5c611f47c749b5ae41.1778835009.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: sbcCV56eClOZFuG2LVZG-A0RvYPz_lDLxSXDZrONRII_1778836067 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Leverage the hybrid helpers to implement the receive queue and OoO queue collapsing at ingress time when reaching memory bounds. If the msk is owned by the user-space at incoming skb time, perform the pruning in the release_cb. The prune check is additionally performed when the skb reaches the msk-level queues. Signed-off-by: Paolo Abeni --- v2 -> v3: - deal with unsynced TFO skb at prune time - only possible when pruning in mptcp_over_limit() v1 -> v2: - collapse rcv queue, too - deal with MPC map, too - drop left-over sentence in the commit message RFC -> v1: - use data_seq only when available - avoid ack_seq lockless access - drop limit on fallback - collapse rcvqueue, too - drop only when pruning is not possible and over rcvbuf * 2 Note: - sashiko can be confused about fwd memory lifecycle (I can understand that :). Any exceeding amount of fwd allocated memory is always released by the next sk_mem_uncharge() - i.e. fwd memory is not tied to the current skb. - sashiko is also fooled by the main xtcp_collapse_ofo_queue() loop: ooo_last_skb is always kept up2date with the current tree status - AFAICS KASAN handles bitmap variables in a sane way, and sashiko doesn't know about that --- net/mptcp/mib.c | 2 ++ net/mptcp/mib.h | 2 ++ net/mptcp/options.c | 42 +++++++++++++++++++++++++++++++++++------- net/mptcp/protocol.c | 42 ++++++++++++++++++++++++++++++++++++++++++ net/mptcp/protocol.h | 1 + 5 files changed, 82 insertions(+), 7 deletions(-) diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c index f23fda0c55a7..bdc863c3a952 100644 --- a/net/mptcp/mib.c +++ b/net/mptcp/mib.c @@ -85,6 +85,8 @@ static const struct snmp_mib mptcp_snmp_list[] =3D { SNMP_MIB_ITEM("SimultConnectFallback", MPTCP_MIB_SIMULTCONNFALLBACK), SNMP_MIB_ITEM("FallbackFailed", MPTCP_MIB_FALLBACKFAILED), SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE), + SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED), + SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED), }; =20 /* mptcp_mib_alloc - allocate percpu mib counters diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h index 812218b5ed2b..8ec314847fc3 100644 --- a/net/mptcp/mib.h +++ b/net/mptcp/mib.h @@ -88,6 +88,8 @@ enum linux_mptcp_mib_field { MPTCP_MIB_SIMULTCONNFALLBACK, /* Simultaneous connect */ MPTCP_MIB_FALLBACKFAILED, /* Can't fallback due to msk status */ MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */ + MPTCP_MIB_OFO_PRUNED, /* MPTCP-level OoO queue pruned */ + MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */ __MPTCP_MIB_MAX }; =20 diff --git a/net/mptcp/options.c b/net/mptcp/options.c index 36f12e5dfa92..ec64e1a127d7 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -1159,10 +1159,13 @@ static bool add_addr_hmac_valid(struct mptcp_sock *= msk, } =20 static bool mptcp_over_limit(struct sock *sk, struct sock *ssk, - const struct sk_buff *skb) + const struct sk_buff *skb, + const struct mptcp_options_received *mp_opt) { struct mptcp_sock *msk =3D mptcp_sk(sk); u64 mem =3D sk_rmem_alloc_get(sk); + u64 limit; + bool ret; =20 mem +=3D READ_ONCE(msk->backlog_len); if (likely(mem <=3D READ_ONCE(sk->sk_rcvbuf))) @@ -1174,10 +1177,31 @@ static bool mptcp_over_limit(struct sock *sk, struc= t sock *ssk, !after(TCP_SKB_CB(skb)->end_seq, tcp_sk(ssk)->rcv_nxt)) return false; =20 - /* Dropped due to memory constraints, schedule an ack. */ - inet_csk(ssk)->icsk_ack.pending |=3D ICSK_ACK_NOMEM | ICSK_ACK_NOW; - inet_csk_schedule_ack(ssk); - return true; + mptcp_data_lock(sk); + if (!sock_owned_by_user(sk)) { + /* When the data sequence is not (yet) available for the + * incoming skb, allow pruning the whole OoO queue. + */ + u64 seq =3D (!mp_opt->use_map || mp_opt->mpc_map) ? + msk->ack_seq : mp_opt->data_seq; + + limit =3D sk->sk_rcvbuf; + __mptcp_check_prune(sk, seq); + } else { + /* Pruning will take place later in the RX path, allow + * some extra slack. + */ + limit =3D ((u64)READ_ONCE(sk->sk_rcvbuf)) << 1; + } + ret =3D sk_rmem_alloc_get(sk) + msk->backlog_len > limit; + mptcp_data_unlock(sk); + + if (ret) { + /* Dropped due to memory constraints, schedule an ack. */ + inet_csk(ssk)->icsk_ack.pending |=3D ICSK_ACK_NOMEM | ICSK_ACK_NOW; + inet_csk_schedule_ack(ssk); + } + return ret; } =20 /* Return false when the caller must drop the packet, i.e. in case of erro= r, @@ -1208,7 +1232,11 @@ bool mptcp_incoming_options(struct sock *sk, struct = sk_buff *skb) __mptcp_data_acked(subflow->conn); mptcp_data_unlock(subflow->conn); =20 - if (mptcp_over_limit(subflow->conn, sk, skb)) + /* Will use ack_seq as limit for OoO pruning; any value would do + * as OoO queue must be empty. + */ + mp_opt.use_map =3D 0; + if (mptcp_over_limit(subflow->conn, sk, skb, &mp_opt)) return false; return true; } @@ -1288,7 +1316,7 @@ bool mptcp_incoming_options(struct sock *sk, struct s= k_buff *skb) return true; } =20 - if (mptcp_over_limit(subflow->conn, sk, skb)) + if (mptcp_over_limit(subflow->conn, sk, skb, &mp_opt)) return false; =20 mpext =3D skb_ext_add(skb, SKB_EXT_MPTCP); diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 78b8bcac7d91..b79dd4c4fe31 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -373,6 +373,46 @@ static void mptcp_init_skb(struct sock *ssk, struct sk= _buff *skb, int offset, skb_dst_drop(skb); } =20 +/* "Inspired" from the TCP version */ +static void mptcp_prune_ofo_queue(struct sock *sk, u64 seq) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct rb_node *node, *prev; + bool pruned =3D false; + + if (RB_EMPTY_ROOT(&msk->out_of_order_queue)) + return; + + node =3D &msk->ooo_last_skb->rbnode; + + do { + struct sk_buff *skb =3D rb_to_skb(node); + + /* Stop pruning if the incoming skb would land in OoO tail. */ + if (after(seq, MPTCP_SKB_CB(skb)->map_seq)) + break; + + pruned =3D true; + prev =3D rb_prev(node); + rb_erase(node, &msk->out_of_order_queue); + mptcp_drop(sk, skb); + msk->ooo_last_skb =3D rb_to_skb(prev); + if (atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf) + break; + + node =3D prev; + } while (node); + + if (pruned) + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFO_PRUNED); +} + +bool __mptcp_check_prune(struct sock *sk, u64 seq) +{ + mptcp_prune_ofo_queue(sk, seq); + return atomic_read(&sk->sk_rmem_alloc) >=3D sk->sk_rcvbuf; +} + static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb) { u64 copy_len =3D MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq; @@ -385,7 +425,9 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk= _buff *skb) * will break. */ if (unlikely(sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) && + __mptcp_check_prune(sk, MPTCP_SKB_CB(skb)->map_seq) && !__mptcp_check_fallback(msk)) { + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED); mptcp_drop(sk, skb); return false; } diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 661600f8b573..95774a4e7231 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -827,6 +827,7 @@ bool __mptcp_close(struct sock *sk, long timeout); void mptcp_cancel_work(struct sock *sk); void __mptcp_unaccepted_force_close(struct sock *sk); void mptcp_set_state(struct sock *sk, int state); +bool __mptcp_check_prune(struct sock *sk, u64 seq); =20 bool mptcp_addresses_equal(const struct mptcp_addr_info *a, const struct mptcp_addr_info *b, bool use_port); --=20 2.54.0 From nobody Mon May 25 18:04:43 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A56EB428494 for ; Fri, 15 May 2026 09:07:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836074; cv=none; b=K/GMGrMYdSWK069Etc7jF9gIpZXNp0TNKWSUyKGVQtkqlvrEW6t+PN82NVIsUYyWI8MXrjRYXCWk+dsc1IgsnK0uyvBFa88cC0d1QaVmj+L9qk79VurAOTseRA7+99fcIzJmZNkoNA5i8mQQv4hhMBi5qwSdxHJtE4Mu+weEn9Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836074; c=relaxed/simple; bh=jVw7u+HAA35GmvnzLoI7OqoVqqPf8VQo3KaogWUGKBs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=lt2mf9xvi4/ZxLlTRnPJG80delc2wFezm1WD7obvLEozMZh5YsIXmvJ3HRGC/S2PuOHuUcdTKnieOgR4yRol9TQHLhnfC79eoO/iA4990agSIO5T9l3SqTk57hHvX02+AT6790zwDzSwXSrr6ZQJvs2krszdqJFp54pKwD3o4nA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=FVkHuPoI; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="FVkHuPoI" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778836071; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AU5BERyTkhJRjFIQJOtMMg7fgqdJQ1k9Q476Qxh89HY=; b=FVkHuPoImal1+6rAl+iuua6X7VlGcOSpNFrUgnva+BEE7upcy5SUT77APUTJf5O+QQ4+gF AvkhbIMe6STuvAA4hwNjodcbNf5wi8OphZgYrzEM0FGk95E/8bFFw1/k7+1JIj2bHWqZRJ UQhZ9IoWShwli4M7CgIpbg9R8v1h4q8= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-649-XSlzz-JjP8KKleUM4wBzkQ-1; Fri, 15 May 2026 05:07:50 -0400 X-MC-Unique: XSlzz-JjP8KKleUM4wBzkQ-1 X-Mimecast-MFC-AGG-ID: XSlzz-JjP8KKleUM4wBzkQ_1778836069 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 0756019560A2; Fri, 15 May 2026 09:07:49 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.32.244]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 9EEA6180075C; Fri, 15 May 2026 09:07:47 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: Geliang Tang , gang.yan@linux.dev Subject: [PATCH v6 mptcp-next 5/7] mptcp: track prune recovery status Date: Fri, 15 May 2026 11:07:26 +0200 Message-ID: <2acec9f2de6c41b18a0ef92903ea1cc11759c644.1778835009.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: rKIEzPubZqUESbCetuN1UrjgKjTAxDvE6v_pqC49oZ8_1778836069 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" After dropping any data already acked at the TCP level, the MPTCP must avoid inducing TCP-level retransmission until the pruned data has been successfully acked at MPTCP level. Otherwise the subflows could keep retransmitting skbs carring OoO MPTCP data, preventing reinjections and stalling completely the data transfer. Explicitly keep track of the highest pruned MPTCP-level seq number and stop dropping at TCP level until such sequence has been acked. Signed-off-by: Paolo Abeni --- Notes: - sashiko may miss that msk->ack_seq access in mptcp_over_limit() happens under the msk data lock and this is raceless. --- net/mptcp/options.c | 7 +++++++ net/mptcp/protocol.c | 14 +++++++++++++- net/mptcp/protocol.h | 3 +++ net/mptcp/subflow.c | 1 + 4 files changed, 24 insertions(+), 1 deletion(-) diff --git a/net/mptcp/options.c b/net/mptcp/options.c index ec64e1a127d7..c96b8166224b 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -1194,6 +1194,13 @@ static bool mptcp_over_limit(struct sock *sk, struct= sock *ssk, limit =3D ((u64)READ_ONCE(sk->sk_rcvbuf)) << 1; } ret =3D sk_rmem_alloc_get(sk) + msk->backlog_len > limit; + + /* After pruning any packets ensure that MPTCP-driven drops do not + * cause TCP-level retransmission. + */ + if (before64(msk->ack_seq, READ_ONCE(msk->pruned_seq))) + ret =3D false; + mptcp_data_unlock(sk); =20 if (ret) { diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index b79dd4c4fe31..640632c283e1 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -379,12 +379,14 @@ static void mptcp_prune_ofo_queue(struct sock *sk, u6= 4 seq) struct mptcp_sock *msk =3D mptcp_sk(sk); struct rb_node *node, *prev; bool pruned =3D false; + u64 pruned_seq; =20 if (RB_EMPTY_ROOT(&msk->out_of_order_queue)) return; =20 node =3D &msk->ooo_last_skb->rbnode; =20 + pruned_seq =3D msk->pruned_seq; do { struct sk_buff *skb =3D rb_to_skb(node); =20 @@ -395,16 +397,21 @@ static void mptcp_prune_ofo_queue(struct sock *sk, u6= 4 seq) pruned =3D true; prev =3D rb_prev(node); rb_erase(node, &msk->out_of_order_queue); + if (after(MPTCP_SKB_CB(skb)->end_seq, pruned_seq)) + pruned_seq =3D MPTCP_SKB_CB(skb)->end_seq; mptcp_drop(sk, skb); msk->ooo_last_skb =3D rb_to_skb(prev); + if (atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf) break; =20 node =3D prev; } while (node); =20 - if (pruned) + if (pruned) { + WRITE_ONCE(msk->pruned_seq, pruned_seq); MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFO_PRUNED); + } } =20 bool __mptcp_check_prune(struct sock *sk, u64 seq) @@ -427,6 +434,8 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk= _buff *skb) if (unlikely(sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) && __mptcp_check_prune(sk, MPTCP_SKB_CB(skb)->map_seq) && !__mptcp_check_fallback(msk)) { + if (after(MPTCP_SKB_CB(skb)->end_seq, msk->pruned_seq)) + WRITE_ONCE(msk->pruned_seq, MPTCP_SKB_CB(skb)->end_seq); MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED); mptcp_drop(sk, skb); return false; @@ -887,6 +896,8 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk) WRITE_ONCE(msk->ack_seq, end_seq); moved =3D true; } + if (after64(msk->ack_seq, msk->pruned_seq)) + WRITE_ONCE(msk->pruned_seq, msk->ack_seq); return moved; } =20 @@ -3536,6 +3547,7 @@ static int mptcp_disconnect(struct sock *sk, int flag= s) /* for fallback's sake */ WRITE_ONCE(msk->ack_seq, 0); atomic64_set(&msk->rcv_wnd_sent, 0); + WRITE_ONCE(msk->pruned_seq, 0); =20 WRITE_ONCE(sk->sk_shutdown, 0); sk_error_report(sk); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 95774a4e7231..32daf51e48ef 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -303,6 +303,9 @@ struct mptcp_sock { u64 bytes_acked; u64 snd_una; u64 wnd_end; + u64 pruned_seq; /* If strictly above ack_seq, + * the highest seq pruned. + */ u32 last_data_sent; u32 last_data_recv; u32 last_ack_recv; diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index d562e149606f..cc75d914c1b5 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -494,6 +494,7 @@ static void subflow_set_remote_key(struct mptcp_sock *m= sk, =20 WRITE_ONCE(msk->remote_key, subflow->remote_key); WRITE_ONCE(msk->ack_seq, subflow->iasn); + WRITE_ONCE(msk->pruned_seq, subflow->iasn); WRITE_ONCE(msk->can_ack, true); atomic64_set(&msk->rcv_wnd_sent, subflow->iasn); } --=20 2.54.0 From nobody Mon May 25 18:04:43 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F2435372698 for ; Fri, 15 May 2026 09:07:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836080; cv=none; b=rNdZdeFTm6cU7OdsVyTrKMPETrilpIBlL4PA8ZcY5G1rzUucGHOFCdFUll5K5BpC2M5OKVgmdMoxfwa9DOYpFE3Nx7XpRcxS0zAx0dH8ulHLOmxN+YJJsS3hH7KauEpi4hvMnjwk1MuLSA+jXJFQkpG8w9M0nd3EQvkd/uPQlD0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836080; c=relaxed/simple; bh=5u5Wzrp+SW4TT55X15ObBWVVxuyrcufmOV6YrBdjdxM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=jabd0DdLGRIXQkNOb381mEbCoASgp7L8tF19AAza2nlcI6MizSIT1UiiJUxsBDeJ1ni7S67mtmzZ6HgaCbHUWkK8iOZ2dNZVaOR/8vXOR08fjz2/1rFFax4qws6EWQLZZLxxZSZJO+2yVXODcNXUpApTpj4r8mUc3CmYecR5N0E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ZfVafNMG; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZfVafNMG" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778836078; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lEg8FL2MSLHI809MFFeMfP6hp75tculaUYdya5OR+3c=; b=ZfVafNMGsYLyPuue5gJEJfHA5h78D31c35Oh0rt54m7gSBiB7DppMsLw1ofj854hR4b5m7 6Ua+sfyX4By/uPX9vVgDMnh2r9QV0CxkBDb6G/emjOz7y67baErEhUA+BV2VKirK9Vo2bU uSEKTJ3oXJot3hwNU2cUFoHJsP0eWVE= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-516-Ny8b_rTMOLCcvFNvGwdhEA-1; Fri, 15 May 2026 05:07:52 -0400 X-MC-Unique: Ny8b_rTMOLCcvFNvGwdhEA-1 X-Mimecast-MFC-AGG-ID: Ny8b_rTMOLCcvFNvGwdhEA_1778836071 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 1D127195608E; Fri, 15 May 2026 09:07:51 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.32.244]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id AA56518001E8; Fri, 15 May 2026 09:07:49 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: Geliang Tang , gang.yan@linux.dev Subject: [PATCH v6 mptcp-next 6/7] mptcp: move the retrans loop to a separate helper Date: Fri, 15 May 2026 11:07:27 +0200 Message-ID: In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 9YxydPx1w3K5oxdaE99NT4vtu2gDym7t-7KkVBhBaoY_1778836071 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" This is a cleanup in order to make the next patch simpler. No functional change intended. Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 74 +++++++++++++++++++++++++------------------- 1 file changed, 43 insertions(+), 31 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 640632c283e1..de142be05934 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -2835,41 +2835,14 @@ static void mptcp_check_fastclose(struct mptcp_sock= *msk) sk_error_report(sk); } =20 -static void __mptcp_retrans(struct sock *sk) +/* Retransmit the specified data fragment on all the selected subflows. */ +static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *d= frag) { struct mptcp_sendmsg_info info =3D { .data_lock_held =3D true, }; struct mptcp_sock *msk =3D mptcp_sk(sk); struct mptcp_subflow_context *subflow; - struct mptcp_data_frag *dfrag; struct sock *ssk; - int ret, err; - u16 len =3D 0; - - mptcp_clean_una_wakeup(sk); - - /* first check ssk: need to kick "stale" logic */ - err =3D mptcp_sched_get_retrans(msk); - dfrag =3D mptcp_rtx_head(sk); - if (!dfrag) { - if (mptcp_data_fin_enabled(msk)) { - struct inet_connection_sock *icsk =3D inet_csk(sk); - - WRITE_ONCE(icsk->icsk_retransmits, - icsk->icsk_retransmits + 1); - mptcp_set_datafin_timeout(sk); - mptcp_send_ack(msk); - - goto reset_timer; - } - - if (!mptcp_send_head(sk)) - goto clear_scheduled; - - goto reset_timer; - } - - if (err) - goto reset_timer; + int ret, len =3D 0; =20 mptcp_for_each_subflow(msk, subflow) { if (READ_ONCE(subflow->scheduled)) { @@ -2897,7 +2870,7 @@ static void __mptcp_retrans(struct sock *sk) !msk->allow_subflows) { spin_unlock_bh(&msk->fallback_lock); release_sock(ssk); - goto clear_scheduled; + return -1; } =20 while (info.sent < info.limit) { @@ -2920,6 +2893,45 @@ static void __mptcp_retrans(struct sock *sk) release_sock(ssk); } } + return len; +} + +static void __mptcp_retrans(struct sock *sk) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct mptcp_subflow_context *subflow; + struct mptcp_data_frag *dfrag; + int err, len; + + mptcp_clean_una_wakeup(sk); + + /* first check ssk: need to kick "stale" logic */ + err =3D mptcp_sched_get_retrans(msk); + dfrag =3D mptcp_rtx_head(sk); + if (!dfrag) { + if (mptcp_data_fin_enabled(msk)) { + struct inet_connection_sock *icsk =3D inet_csk(sk); + + WRITE_ONCE(icsk->icsk_retransmits, + icsk->icsk_retransmits + 1); + mptcp_set_datafin_timeout(sk); + mptcp_send_ack(msk); + + goto reset_timer; + } + + if (!mptcp_send_head(sk)) + goto clear_scheduled; + + goto reset_timer; + } + + if (err) + goto reset_timer; + + len =3D __mptcp_push_retrans(sk, dfrag); + if (len < 0) + goto clear_scheduled; =20 msk->bytes_retrans +=3D len; dfrag->already_sent =3D max(dfrag->already_sent, len); --=20 2.54.0 From nobody Mon May 25 18:04:43 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CCAC7410D0C for ; Fri, 15 May 2026 09:07:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836078; cv=none; b=dxtstUdC09PR3jbrDltXV5JiEcIrSYM0jp5arWqUCFlgv02VX05thdrkiCmoz/QeLS6S2abob9haOivoQ+YhYjE71s/kloH8HjSFma1BpY0rQnzF2Uu/Abqx6vp12CTL5fn1fOMEu9bo2wSo9fufljU4YfPuAVQFQDuPB27hPQ0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778836078; c=relaxed/simple; bh=wcIfINHHczpBgg567UOp9Hs6hXrPD5RkkcT7KytG/Ms=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:content-type; b=R3s/jaZ+PcmzXbK+1VtPxXvn5JDAl0GIvJHxWTvGCopGPueWI8/zZr1WYDIpduru23zUNJDzllE27KRJMLWgxn8exuiE+D0wnY3bBNbCmNqYu8jDzeacpJDPT+R1gXfkTM4IzqswvRbaG2/yFL+RL7RVMEj0DUUoHRvWlU+wPXo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=QlzkS322; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="QlzkS322" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778836075; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aRa3SFFkIwhcB2VOwvbFuh8eWfgj33wy1ZIb8G3102A=; b=QlzkS322EXFxTYvsldEYwyjs0Wea1oBUw09x1IcZDIm56vCLwYaze4Zt9L4smt1uvXH8qz U7cinQR6ynhRCBv9lPRaKFZ0dW25mlLDDz81g5htC6b1OyoDIlk8QB4q4V0syr/XAHX4KC LJMdPGfH2M2r0DR/8rdYccMWjQpiwbQ= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-513-E2W85rsEN9GBjgGsOHiJGA-1; Fri, 15 May 2026 05:07:54 -0400 X-MC-Unique: E2W85rsEN9GBjgGsOHiJGA-1 X-Mimecast-MFC-AGG-ID: E2W85rsEN9GBjgGsOHiJGA_1778836073 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id D09DF18005BB; Fri, 15 May 2026 09:07:52 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.44.32.244]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 7DBB918001E8; Fri, 15 May 2026 09:07:51 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: Geliang Tang , gang.yan@linux.dev Subject: [PATCH v6 mptcp-next 7/7] mptcp: let the retrans scheduler do its job. Date: Fri, 15 May 2026 11:07:28 +0200 Message-ID: <08a58d3e917735d08d0ca50621494ea6f2fe41ba.1778835009.git.pabeni@redhat.com> In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: E_JiXSH6DDddNz9dVWTj4meo0j2gZ1HGARuon8hMyuU_1778836073 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; x-default="true" Currently the MPTCP core enforces that when MPTCP-level retrans timer fires, at most a single dfrag is retransmitted. If some corner-cases it may be necessary retransmit multiple dfrags, and the MPTCP socket will need to wait multiple retrans timeout to accomplish that. Remove the mentioned constraint, allowing to transmit multiple dfrags per retrans period, as long as the scheduler keeps selecting subflows for retransmissions and pending data is available in the rtx queue. The default scheduler will transmit a dfrag per available subflow. Signed-off-by: Paolo Abeni --- --- v4 -> v5: - fixed already_sent update v3 -> v4: - avoid quadratic behavior, fix retrans_seq update - fix rtx timer re-schedule miss v2 -> v3: - fix infinite loop issue (should address tls tests failures) v1 -> v2: - fix retrans sequence update (sashiko) --- net/mptcp/protocol.c | 105 +++++++++++++++++++++++++++++++------------ 1 file changed, 77 insertions(+), 28 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index de142be05934..b0877908883a 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -1208,13 +1208,6 @@ static void __mptcp_clean_una_wakeup(struct sock *sk) mptcp_write_space(sk); } =20 -static void mptcp_clean_una_wakeup(struct sock *sk) -{ - mptcp_data_lock(sk); - __mptcp_clean_una_wakeup(sk); - mptcp_data_unlock(sk); -} - static void mptcp_enter_memory_pressure(struct sock *sk) { struct mptcp_subflow_context *subflow; @@ -2835,8 +2828,12 @@ static void mptcp_check_fastclose(struct mptcp_sock = *msk) sk_error_report(sk); } =20 -/* Retransmit the specified data fragment on all the selected subflows. */ -static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *d= frag) +/* + * Retransmit the specified data fragment on all the selected subflows, + * starting from the specified sequence + */ +static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *d= frag, + u64 sent_seq) { struct mptcp_sendmsg_info info =3D { .data_lock_held =3D true, }; struct mptcp_sock *msk =3D mptcp_sk(sk); @@ -2846,6 +2843,7 @@ static int __mptcp_push_retrans(struct sock *sk, stru= ct mptcp_data_frag *dfrag) =20 mptcp_for_each_subflow(msk, subflow) { if (READ_ONCE(subflow->scheduled)) { + u16 offset =3D sent_seq - dfrag->data_seq; u16 copied =3D 0; =20 mptcp_subflow_set_scheduled(subflow, false); @@ -2855,9 +2853,12 @@ static int __mptcp_push_retrans(struct sock *sk, str= uct mptcp_data_frag *dfrag) lock_sock(ssk); =20 /* limit retransmission to the bytes already sent on some subflows */ - info.sent =3D 0; + info.sent =3D offset; info.limit =3D READ_ONCE(msk->csum_enabled) ? dfrag->data_len : dfrag->already_sent; + DEBUG_NET_WARN_ON_ONCE(!before64(sent_seq, + dfrag->data_seq + + info.limit)); =20 /* * make the whole retrans decision, xmit, disallow @@ -2901,41 +2902,89 @@ static void __mptcp_retrans(struct sock *sk) struct mptcp_sock *msk =3D mptcp_sk(sk); struct mptcp_subflow_context *subflow; struct mptcp_data_frag *dfrag; + bool retransmitted =3D false; + u64 retrans_seq; int err, len; =20 - mptcp_clean_una_wakeup(sk); - - /* first check ssk: need to kick "stale" logic */ - err =3D mptcp_sched_get_retrans(msk); + mptcp_data_lock(sk); + __mptcp_clean_una_wakeup(sk); + retrans_seq =3D msk->snd_una; dfrag =3D mptcp_rtx_head(sk); + mptcp_data_unlock(sk); + if (!dfrag) + goto check_data_fin; + + for (;;) { + bool already_retrans; + + /* The scheduler may clean the RTX queue. */ + get_page(dfrag->page); + + /* The default scheduler will kick "stale" logic. */ + err =3D mptcp_sched_get_retrans(msk); + if (err) { + put_page(dfrag->page); + break; + } + + /* Incoming acks can have moved retrans sequence after + * the current dfrag, if so try to start again from RTX head. + */ + mptcp_data_lock(sk); + already_retrans =3D !dfrag->already_sent || + !before64(msk->snd_una, dfrag->data_seq + + dfrag->already_sent); + put_page(dfrag->page); + if (already_retrans) { + __mptcp_clean_una_wakeup(sk); + retrans_seq =3D msk->snd_una; + dfrag =3D mptcp_rtx_head(sk); + } + mptcp_data_unlock(sk); + if (!dfrag) + break; + + len =3D __mptcp_push_retrans(sk, dfrag, retrans_seq); + if (len < 0) + goto clear_scheduled; + + retransmitted =3D true; + retrans_seq +=3D len; + msk->bytes_retrans +=3D len; + dfrag->already_sent =3D max_t(u16, dfrag->already_sent, + retrans_seq - dfrag->data_seq); + + /* Attempt the next fragment only if the current one is + * completely retransmitted. + */ + if (before64(retrans_seq, dfrag->data_seq + dfrag->data_len)) + break; + + dfrag =3D list_is_last(&dfrag->list, &msk->rtx_queue) ? + NULL : list_next_entry(dfrag, list); + if (!dfrag || !dfrag->already_sent) + break; + } + + /* Data fin retransmission needed only if no data retransmission took + * place, and RTX queue is empty. + */ +check_data_fin: if (!dfrag) { - if (mptcp_data_fin_enabled(msk)) { + if (!retransmitted && mptcp_data_fin_enabled(msk)) { struct inet_connection_sock *icsk =3D inet_csk(sk); =20 WRITE_ONCE(icsk->icsk_retransmits, icsk->icsk_retransmits + 1); mptcp_set_datafin_timeout(sk); mptcp_send_ack(msk); - goto reset_timer; } =20 if (!mptcp_send_head(sk)) goto clear_scheduled; - - goto reset_timer; } =20 - if (err) - goto reset_timer; - - len =3D __mptcp_push_retrans(sk, dfrag); - if (len < 0) - goto clear_scheduled; - - msk->bytes_retrans +=3D len; - dfrag->already_sent =3D max(dfrag->already_sent, len); - reset_timer: mptcp_check_and_set_pending(sk); =20 --=20 2.54.0