The rcv window is shared among all the subflows. Currently, MPTCP sync
the TCP-level rcv window with the MPTCP one at tcp_transmit_skb() time.
The above means that incoming data may sporadically observe outdated
TCP-level rcv window and being wrongly dropped by TCP.
Address the issue checking for the edge condition before queuing the data
at TCP level, and eventually syncing the rcv window as needed.
Note that the issue is actually present from the very first MPTCP
implementation, but backports older than the blamed commit below will range
from impossible to useless.
Before:
nstat >/dev/null ;sleep 1; nstat -z TcpExtBeyondWindow
TcpExtBeyondWindow 14 0.0
After:
nstat >/dev/null ;sleep 1; nstat -z TcpExtBeyondWindow
TcpExtBeyondWindow 0 0.0
Fixes: fa3fe2b15031 ("mptcp: track window announced to peer")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/mptcp/options.c | 31 +++++++++++++++++++++++++++++++
net/mptcp/protocol.h | 1 +
2 files changed, 32 insertions(+)
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index cf531f2d815c..ad51dcf18984 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1042,6 +1042,31 @@ static void __mptcp_snd_una_update(struct mptcp_sock *msk, u64 new_snd_una)
WRITE_ONCE(msk->snd_una, new_snd_una);
}
+static void rwin_update(struct mptcp_sock *msk, struct sock *ssk,
+ struct sk_buff *skb)
+{
+ struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
+ struct tcp_sock *tp = tcp_sk(ssk);
+ u64 mptcp_rcv_wnd;
+
+ /* Avoid touching extra cachelines if TCP is going to accept this
+ * skb without filling the TCP-level window even with a possibly
+ * outdated mptcp-level rwin.
+ */
+ if (!skb->len || skb->len < tcp_receive_window(tp))
+ return;
+
+ mptcp_rcv_wnd = atomic64_read(&msk->rcv_wnd_sent);
+ if (!after64(mptcp_rcv_wnd, subflow->rcv_wnd_sent))
+ return;
+
+ /* Some other subflow grew the mptcp-level rwin since rcv_wup,
+ * resync.
+ */
+ tp->rcv_wnd += mptcp_rcv_wnd - subflow->rcv_wnd_sent;
+ subflow->rcv_wnd_sent = mptcp_rcv_wnd;
+}
+
static void ack_update_msk(struct mptcp_sock *msk,
struct sock *ssk,
struct mptcp_options_received *mp_opt)
@@ -1209,6 +1234,7 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
*/
if (mp_opt.use_ack)
ack_update_msk(msk, sk, &mp_opt);
+ rwin_update(msk, sk, skb);
/* Zero-data-length packets are dropped by the caller and not
* propagated to the MPTCP layer, so the skb extension does not
@@ -1295,6 +1321,10 @@ static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
if (rcv_wnd_new != rcv_wnd_old) {
raise_win:
+ /* The msk-level rcv wnd is after the tcp level one,
+ * sync the latter.
+ */
+ rcv_wnd_new = rcv_wnd_old;
win = rcv_wnd_old - ack_seq;
tp->rcv_wnd = min_t(u64, win, U32_MAX);
new_win = tp->rcv_wnd;
@@ -1318,6 +1348,7 @@ static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
update_wspace:
WRITE_ONCE(msk->old_wspace, tp->rcv_wnd);
+ subflow->rcv_wnd_sent = rcv_wnd_new;
}
__sum16 __mptcp_make_csum(u64 data_seq, u32 subflow_seq, u16 data_len, __wsum sum)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index d0881db16b12..f14eeb4fd884 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -513,6 +513,7 @@ struct mptcp_subflow_context {
u64 remote_key;
u64 idsn;
u64 map_seq;
+ u64 rcv_wnd_sent;
u32 snd_isn;
u32 token;
u32 rel_write_seq;
--
2.51.0
Hi Paolo,
On 07/11/2025 09:32, Paolo Abeni wrote:
> The rcv window is shared among all the subflows. Currently, MPTCP sync
> the TCP-level rcv window with the MPTCP one at tcp_transmit_skb() time.
>
> The above means that incoming data may sporadically observe outdated
> TCP-level rcv window and being wrongly dropped by TCP.
>
> Address the issue checking for the edge condition before queuing the data
> at TCP level, and eventually syncing the rcv window as needed.
>
> Note that the issue is actually present from the very first MPTCP
> implementation, but backports older than the blamed commit below will range
> from impossible to useless.
Thank you for this patch!
It looks good to me:
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> Before:
> nstat >/dev/null ;sleep 1; nstat -z TcpExtBeyondWindow
> TcpExtBeyondWindow 14 0.0
>
> After:
> nstat >/dev/null ;sleep 1; nstat -z TcpExtBeyondWindow
> TcpExtBeyondWindow 0 0.0
Should we eventually track this MIB counter in mptcp_connect.sh?
> Fixes: fa3fe2b15031 ("mptcp: track window announced to peer")
While there are still discussions on patch 6/7, to ease things, I
suggest already applying only this patch in our tree (fixes for -net),
the only material for 'net' from what I understood.
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
On 11/11/2025 19:43, Matthieu Baerts wrote:
> Hi Paolo,
>
> On 07/11/2025 09:32, Paolo Abeni wrote:
>> The rcv window is shared among all the subflows. Currently, MPTCP sync
>> the TCP-level rcv window with the MPTCP one at tcp_transmit_skb() time.
>>
>> The above means that incoming data may sporadically observe outdated
>> TCP-level rcv window and being wrongly dropped by TCP.
>>
>> Address the issue checking for the edge condition before queuing the data
>> at TCP level, and eventually syncing the rcv window as needed.
>>
>> Note that the issue is actually present from the very first MPTCP
>> implementation, but backports older than the blamed commit below will range
>> from impossible to useless.
>
> Thank you for this patch!
>
> It looks good to me:
>
> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>
>> Before:
>> nstat >/dev/null ;sleep 1; nstat -z TcpExtBeyondWindow
>> TcpExtBeyondWindow 14 0.0
>>
>> After:
>> nstat >/dev/null ;sleep 1; nstat -z TcpExtBeyondWindow
>> TcpExtBeyondWindow 0 0.0
>
> Should we eventually track this MIB counter in mptcp_connect.sh?
(This can be done in another patch later, we can also create a task to
do it later if that's easier of course.)
>
>> Fixes: fa3fe2b15031 ("mptcp: track window announced to peer")
>
> While there are still discussions on patch 6/7, to ease things, I
> suggest already applying only this patch in our tree (fixes for -net),
> the only material for 'net' from what I understood.
Done:
New patches for t/upstream-net and t/upstream:
- 9250eb1bd82b: mptcp: avoid unneeded subflow-level drops
- Results: 7d372ea35141..25f5033b9887 (export-net)
- Results: 3b1f54eb2ace..3dad323fd8af (export)
Tests are now in progress:
- export-net:
https://github.com/multipath-tcp/mptcp_net-next/commit/ef271ef4961b23166a56d80c5136bb85b29826d8/checks
- export:
https://github.com/multipath-tcp/mptcp_net-next/commit/473f3b6a6838866eed83d85449df20f654ac4c1a/checks
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
© 2016 - 2025 Red Hat, Inc.