[v2] mptcp: address stall under memory pressure

[PATCH mptcp-next v2 00/10] mptcp: address stall under memory pressure

Posted by Paolo Abeni 1 week ago

This an attempt to fix the data transfer stall reported by Geliang and
Gang more carefully enforcing memory constraints at the MPTCP level.

Patch 1/10 moves the bound check before entering the TCP socket.
Patch 2, 3, 4 and 5 are cleanups/refactors finalized to safely re-using
TCP helpers on MPTCP skbs.
Patch 6 makes TCP pruning related helpers available to MPTCP and patch 7
makes use of them. Patch 8 addresses an edge scenario that could still
lead to transfer stall under memory pressure.
Finally patch 9 and 10 improve the MPTCP-level retransmission schema to
make recovery from memory pressure significanly faster.

Note that the diffstat is biases by the quite large patch 4/9, which
contains mechanical transformation of existing code; "real" changes are
noticiable smaller.

Tested successfully vs the test cases proposed by Geliang and Gang and
vs the selftests.
---
Notes:
- Sashiko apparently can't understand that mptcp_data_lock() &&
sk not owned by the user-space is equivalent to the sk socket lock. As a
consequences marks incorrectly as racy a few accesses under the data
lock to the sk_rcvbuf.
- mptcp_join with "4 subflows with under pressure" may require longer
than default timeout, as the transfer in such condition can be really
slow.

Paolo Abeni (10):
  mptcp: move checks vs rcvbuf size earlier in the RX path
  mptcp: drop the mptcp_ooo_try_coalesce() helper
  mptcp: drop the cant_coalesce CB field
  mptcp: remove CB offset field
  mptcp: sync mptcp skb cb layout with tcp one
  tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too
  mptcp: implemented OoO queue pruning
  mptcp: track prune recovery status
  mptcp: move the retrans loop to a separate helper
  mptcp: let the retrans scheduler do its job.

 include/net/tcp.h    |   8 +
 net/ipv4/tcp_input.c |  55 +++---
 net/mptcp/fastopen.c |   9 +-
 net/mptcp/mib.c      |   3 +
 net/mptcp/mib.h      |   3 +
 net/mptcp/options.c  |  58 ++++++-
 net/mptcp/protocol.c | 403 ++++++++++++++++++++++++++++---------------
 net/mptcp/protocol.h |  18 +-
 net/mptcp/subflow.c  |  10 ++
 9 files changed, 398 insertions(+), 169 deletions(-)

-- 
2.53.0

Re: [PATCH mptcp-next v2 00/10] mptcp: address stall under memory pressure

Posted by Paolo Abeni 6 days, 17 hours ago

On 4/27/26 9:51 PM, Paolo Abeni wrote:
> This an attempt to fix the data transfer stall reported by Geliang and
> Gang more carefully enforcing memory constraints at the MPTCP level.
> 
> Patch 1/10 moves the bound check before entering the TCP socket.
> Patch 2, 3, 4 and 5 are cleanups/refactors finalized to safely re-using
> TCP helpers on MPTCP skbs.
> Patch 6 makes TCP pruning related helpers available to MPTCP and patch 7
> makes use of them. Patch 8 addresses an edge scenario that could still
> lead to transfer stall under memory pressure.
> Finally patch 9 and 10 improve the MPTCP-level retransmission schema to
> make recovery from memory pressure significanly faster.
> 
> Note that the diffstat is biases by the quite large patch 4/9, which
> contains mechanical transformation of existing code; "real" changes are
> noticiable smaller.
> 
> Tested successfully vs the test cases proposed by Geliang and Gang and
> vs the selftests.
> ---
> Notes:
> - Sashiko apparently can't understand that mptcp_data_lock() &&
> sk not owned by the user-space is equivalent to the sk socket lock. As a
> consequences marks incorrectly as racy a few accesses under the data
> lock to the sk_rcvbuf.
> - mptcp_join with "4 subflows with under pressure" may require longer
> than default timeout, as the transfer in such condition can be really
> slow.

Sashiko says:
---
> +static bool mptcp_over_limit(struct sock *sk, const struct sock *ssk,
> +			     const struct sk_buff *skb)
> +{
> +	if (likely(sk_rmem_alloc_get(sk) <= READ_ONCE(sk->sk_rcvbuf)))
> +		return false;
> +
> +	/* Avoid silently dropping pure acks, fin or zero win probes. */
> +	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq ||
> +	    TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN ||
> +	    !after(TCP_SKB_CB(skb)->end_seq, tcp_sk(ssk)->rcv_nxt))
> +		return false;
Will this silently drop 1-byte Zero Window Probes (ZWPs)? For a standard
1-byte ZWP where seq = rcv_nxt and end_seq = rcv_nxt + 1, it seems it
would fail these conditions, returning true and dropping the packet.
Also, if mptcp_incoming_options() returns false due to being over limit,
does this cause tcp_data_queue() to execute __kfree_skb(skb) and return
without sending or scheduling a TCP ACK? If so, could this bypass standard
drop handlers (like goto out_of_window) and cause connection timeouts by
not informing the sender of the zero window condition?
---

this is wrong: zero window probes use already acked sequence numbers.

---
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 718e910ff23f..77d7c7fbe8ef 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -739,7 +739,7 @@ static bool __mptcp_move_skbs_from_subflow(struct
mptcp_sock *msk,
>
>  			mptcp_init_skb(ssk, skb, offset, len);
>
> -			if (own_msk && sk_rmem_alloc_get(sk) < sk->sk_rcvbuf) {
> +			if (own_msk) {
By removing the sk_rmem_alloc_get() limit check here and moving the limit
enforcement exclusively to mptcp_incoming_options() in the TCP slow path,
can this bypass the MSK receive buffer limit for fallback connections
taking the TCP fast path?
If an MPTCP connection falls back to plain TCP, it appears packets take
the TCP fast path in tcp_rcv_established(), bypassing tcp_data_queue()
and mptcp_incoming_options(). These packets would go directly through
tcp_data_ready() to __mptcp_move_skbs_from_subflow(), which now
unconditionally moves the packets to the MSK.
Since ownership is transferred to the MSK, the subflow's sk_rmem_alloc
would remain near zero, preventing it from falling back to the slow path.
Could this allow unbounded memory growth in the MSK queue if a sender
transmits an infinite amount of data?
---

This is wrong, mptcp_incoming_options() is always invoked even for
fallback msk, as msk explictly disables the header prediction.

---
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 28a7482b3887..26b9a23c63a9 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -352,9 +350,9 @@ static void mptcp_init_skb(struct sock *ssk,
struct sk_buff *skb, int offset,
>  	 * mptcp_subflow_get_mapped_dsn() is based on the current tp->copied_seq
>  	 * value
>  	 */
> -	MPTCP_SKB_CB(skb)->map_seq = mptcp_subflow_get_mapped_dsn(subflow);
> -	MPTCP_SKB_CB(skb)->end_seq = MPTCP_SKB_CB(skb)->map_seq + copy_len;
> -	MPTCP_SKB_CB(skb)->offset = offset;
> +	MPTCP_SKB_CB(skb)->map_seq = mptcp_subflow_get_mapped_dsn(subflow) -
> +				     offset;
Could this cause packets with an offset greater than 0 to be incorrectly
dropped?
If map_seq represents the sequence of the entire SKB rather than just the
unread data, and __mptcp_move_skb() checks if
MPTCP_SKB_CB(skb)->map_seq == msk->ack_seq, valid packets with an offset
(due
to TCP retransmissions or overlapping segments) might fail the in-order
check and be permanently dropped as old data.
---
this is wrong, the main point is that this series has a functional
dependency on "mptcp: do not drop partial packets".

---
> @@ -4364,7 +4355,7 @@ static struct sk_buff *mptcp_recv_skb(struct
sock *sk, u32 *off)
>  		mptcp_move_skbs(sk);
>
>  	while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
> -		offset = MPTCP_SKB_CB(skb)->offset;
> +		offset = msk->copied_seq - MPTCP_SKB_CB(skb)->map_seq;
>  		if (offset < skb->len) {
>  			*off = offset;
>  			return skb;
Similar to the issue in mptcp_inq_hint(), if copied_seq evaluates to 0
due to
deferred initialization, could this offset calculation wrap around?
Since map_seq is the crypto iasn, 0 - map_seq would wrap around to a large
u32 value. This would cause offset < skb->len to evaluate to false and
could lead to unconditionally dropping valid, unread SKBs via
mptcp_eat_recv_skb() when called from __mptcp_read_sock().
---
This is wrong; due to __mptcp_sync_rcv_sequence() `copied_seq` can be
zero only if map_seq is 0, too.

---
Also, what happens if __mptcp_push_retrans() successfully queues some data
but returns a len less than dfrag->data_len?
The loop will continue and pick the same dfrag again since retrans_seq
advanced but still falls within the fragment. However, looking at
__mptcp_push_retrans(), it appears to always start from the beginning:
static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag
*dfrag)
{
	...
	info.sent = 0;
	info.limit = READ_ONCE(msk->csum_enabled) ? dfrag->data_len :
						    dfrag->already_sent;
	while (info.sent < info.limit) {
		ret = mptcp_sendmsg_frag(sk, ssk, dfrag, &info);
	...
}
Since it doesn't take retrans_seq as an offset, will this incorrectly
retransmit the first len bytes of the fragment multiple times, artificially
incrementing retrans_seq until it exceeds the fragment bounds, and
ultimately skipping the tail of the fragment entirely?
---
AFAICS here the LLM is allucinating a lot, as the main termination
condition for that loopis `dfrag->retrans_seq < than dfrag->data_len`

The other comments looks valid and I'll try to address them in the next
revision. I'm not sure how to guide sashiko on the above, I'll try
adding more notes in the cover letter.

/P

Re: [PATCH mptcp-next v2 00/10] mptcp: address stall under memory pressure

Posted by MPTCP CI 1 week ago

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal (except selftest_mptcp_join): Success! ✅
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Unstable: 2 failed test(s): packetdrill_dss selftest_mptcp_connect_checksum ⚠️ 
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/25017102382

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/423dbedaff6a
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1086404


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)