mptcp: support net.ipv4.tcp_rcvbuf_low_rtt

[PATCH RFC mptcp-next] mptcp: support net.ipv4.tcp_rcvbuf_low_rtt

Posted by Matthieu Baerts (NGI0) 2 months, 1 week ago

This is a follow up of commit ecfea98b7d0d ("tcp: add
net.ipv4.tcp_rcvbuf_low_rtt"), but adapted to MPTCP.

MPTCP has mptcp_rcvbuf_grow(), which is similar to tcp_rcvbuf_grow, but
adapted for the MPTCP-level socket.

The idea here is similar to what has been done on TCP side: not let
mptcp_rcvbuf_grow() grow sk->sk_rcvbuf too fast for small RTT flows.
Quoting Eric: If sk->sk_rcvbuf is too big, this can force NIC driver to
not recycle pages from their page pool, and also can cause cache
evictions for DDIO enabled cpus/NIC, as receivers are usually slower
than senders.

If RTT if smaller than the new net.ipv4.tcp_rcvbuf_low_rtt sysctl value,
use the RTT / tcp_rcvbuf_low_rtt ratio to control sk_rcvbuf inflation.

Tested: NO :)

This is why it is still a RFC. My perf test env is currently broken. I'm
sharing this patch just in case it is easy for someone to validate this
patch. Ideally such tests should be done on top of "trace: mptcp: add
mptcp_rcvbuf_grow tracepoint" patch from Paolo (and probably on top of
the related series), following similar tests to the ones done by Eric,
making sure the receiver is slower than the sender. Feel free to take
the patch, and send new versions changing the author, etc. if needed.

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
 net/mptcp/protocol.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index e484c6391b48..715a9a072c6a 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -208,6 +208,7 @@ static bool mptcp_rcvbuf_grow(struct sock *sk, u32 newval)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	const struct net *net = sock_net(sk);
 	u32 rcvwin, rcvbuf, cap, oldval;
+	u32 rtt_threshold, rtt_us;
 	u64 grow;
 
 	oldval = msk->rcvq_space.space;
@@ -219,10 +220,19 @@ static bool mptcp_rcvbuf_grow(struct sock *sk, u32 newval)
 	/* DRS is always one RTT late. */
 	rcvwin = newval << 1;
 
-	/* slow start: allow the sender to double its rate. */
-	grow = (u64)rcvwin * (newval - oldval);
-	do_div(grow, oldval);
-	rcvwin += grow << 1;
+	rtt_us = msk->rcvq_space.rtt_us >> 3;
+	rtt_threshold = READ_ONCE(net->ipv4.sysctl_tcp_rcvbuf_low_rtt);
+	if (rtt_us < rtt_threshold) {
+		/* For small RTT, we set @grow to rcvwin * rtt_us/rtt_threshold.
+		 * It might take few additional ms to reach 'line rate',
+		 * but will avoid sk_rcvbuf inflation and poor cache use.
+		 */
+		grow = div_u64((u64)rcvwin * rtt_us, rtt_threshold);
+	} else {
+		/* slow start: allow the sender to double its rate. */
+		grow = div_u64(((u64)rcvwin << 1) * (newval - oldval), oldval);
+	}
+	rcvwin += grow;
 
 	if (!RB_EMPTY_ROOT(&msk->out_of_order_queue))
 		rcvwin += MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq - msk->ack_seq;

---
base-commit: 1fea9a6bd10f5c5494b7973141083ec56ecffd74
change-id: 20251127-mptcp-tcp_rcvbuf_low_rtt-fc64120b153a

Best regards,
-- 
Matthieu Baerts (NGI0) <matttbe@kernel.org>

Re: [PATCH RFC mptcp-next] mptcp: support net.ipv4.tcp_rcvbuf_low_rtt

Posted by MPTCP CI 2 months, 1 week ago

Hi Matthieu,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal (except selftest_mptcp_join): Success! ✅
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Unstable: 1 failed test(s): packetdrill_add_addr 🔴
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/19742476256

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/3d5676c09a8f
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1028361


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)

Re: [PATCH RFC mptcp-next] mptcp: support net.ipv4.tcp_rcvbuf_low_rtt

Posted by Paolo Abeni 2 months, 1 week ago

On 11/27/25 4:58 PM, Matthieu Baerts (NGI0) wrote:
> This is a follow up of commit ecfea98b7d0d ("tcp: add
> net.ipv4.tcp_rcvbuf_low_rtt"), but adapted to MPTCP.
> 
> MPTCP has mptcp_rcvbuf_grow(), which is similar to tcp_rcvbuf_grow, but
> adapted for the MPTCP-level socket.
> 
> The idea here is similar to what has been done on TCP side: not let
> mptcp_rcvbuf_grow() grow sk->sk_rcvbuf too fast for small RTT flows.
> Quoting Eric: If sk->sk_rcvbuf is too big, this can force NIC driver to
> not recycle pages from their page pool, and also can cause cache
> evictions for DDIO enabled cpus/NIC, as receivers are usually slower
> than senders.
> 
> If RTT if smaller than the new net.ipv4.tcp_rcvbuf_low_rtt sysctl value,
> use the RTT / tcp_rcvbuf_low_rtt ratio to control sk_rcvbuf inflation.

Instead of duplicating the TCP math, I suggest factoring it out in an
helper and use it in both the TCP and MPTCP code.

/P

Re: [PATCH RFC mptcp-next] mptcp: support net.ipv4.tcp_rcvbuf_low_rtt

Posted by Matthieu Baerts 2 months, 1 week ago

Hi Paolo,

On 27/11/2025 18:20, Paolo Abeni wrote:
> On 11/27/25 4:58 PM, Matthieu Baerts (NGI0) wrote:
>> This is a follow up of commit ecfea98b7d0d ("tcp: add
>> net.ipv4.tcp_rcvbuf_low_rtt"), but adapted to MPTCP.
>>
>> MPTCP has mptcp_rcvbuf_grow(), which is similar to tcp_rcvbuf_grow, but
>> adapted for the MPTCP-level socket.
>>
>> The idea here is similar to what has been done on TCP side: not let
>> mptcp_rcvbuf_grow() grow sk->sk_rcvbuf too fast for small RTT flows.
>> Quoting Eric: If sk->sk_rcvbuf is too big, this can force NIC driver to
>> not recycle pages from their page pool, and also can cause cache
>> evictions for DDIO enabled cpus/NIC, as receivers are usually slower
>> than senders.
>>
>> If RTT if smaller than the new net.ipv4.tcp_rcvbuf_low_rtt sysctl value,
>> use the RTT / tcp_rcvbuf_low_rtt ratio to control sk_rcvbuf inflation.
> 
> Instead of duplicating the TCP math, I suggest factoring it out in an
> helper and use it in both the TCP and MPTCP code.

Thank you for the review! Good idea!

I guess this patch can wait the next cycle, right? Or should I rush to
get this soon to stay in sync with TCP?

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.