[PATCH mptcp-net v3 1/5] mptcp: propagate shutdown to subflows when possible

Matthieu Baerts (NGI0) posted 5 patches 1 week ago
[PATCH mptcp-net v3 1/5] mptcp: propagate shutdown to subflows when possible
Posted by Matthieu Baerts (NGI0) 1 week ago
When the MPTCP DATA FIN have been ACKed, there is no more MPTCP related
metadata to exchange, and all subflows can be safely shutdown.

Before this patch, the subflows were actually terminated at 'close()'
time. That's certainly fine most of the time, but not when the userspace
'shutdown()' a connection, without close()ing it. When doing so, the
subflows were staying in LAST_ACK state on one side -- and consequently
in FIN_WAIT2 on the other side -- until the 'close()' of the MPTCP
socket.

Now, when the DATA FIN have been ACKed, all subflows are shutdown. A
consequence of this is that the TCP 'FIN' flag can be set earlier now,
but the end result is the same. This affects the packetdrill tests
looking at the end of the MPTCP connections, but for a good reason.

Fixes: 3721b9b64676 ("mptcp: Track received DATA_FIN sequence number and add related helpers")
Fixes: 16a9a9da1723 ("mptcp: Add helper to process acks of DATA_FIN")
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
 net/mptcp/protocol.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 1c26acf6c4145896ecbb5c7f7004121c66a20649..9feb9f437c2bdfff392249e7ad00edd09ba67ac3 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -372,6 +372,20 @@ static void mptcp_close_wake_up(struct sock *sk)
 		sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
 }
 
+static void mptcp_shutdown_subflows(struct mptcp_sock *msk)
+{
+	struct mptcp_subflow_context *subflow;
+
+	mptcp_for_each_subflow(msk, subflow) {
+		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+		bool slow;
+
+		slow = lock_sock_fast(ssk);
+		tcp_shutdown(ssk, SEND_SHUTDOWN);
+		unlock_sock_fast(ssk, slow);
+	}
+}
+
 /* called under the msk socket lock */
 static bool mptcp_pending_data_fin_ack(struct sock *sk)
 {
@@ -396,6 +410,7 @@ static void mptcp_check_data_fin_ack(struct sock *sk)
 			break;
 		case TCP_CLOSING:
 		case TCP_LAST_ACK:
+			mptcp_shutdown_subflows(msk);
 			mptcp_set_state(sk, TCP_CLOSE);
 			break;
 		}
@@ -564,6 +579,7 @@ static bool mptcp_check_data_fin(struct sock *sk)
 			mptcp_set_state(sk, TCP_CLOSING);
 			break;
 		case TCP_FIN_WAIT2:
+			mptcp_shutdown_subflows(msk);
 			mptcp_set_state(sk, TCP_CLOSE);
 			break;
 		default:

-- 
2.51.0
Re: [PATCH mptcp-net v3 1/5] mptcp: propagate shutdown to subflows when possible
Posted by Matthieu Baerts 5 days, 23 hours ago
Hello,

On 10/09/2025 10:55, Matthieu Baerts (NGI0) wrote:
> When the MPTCP DATA FIN have been ACKed, there is no more MPTCP related
> metadata to exchange, and all subflows can be safely shutdown.
> 
> Before this patch, the subflows were actually terminated at 'close()'
> time. That's certainly fine most of the time, but not when the userspace
> 'shutdown()' a connection, without close()ing it. When doing so, the
> subflows were staying in LAST_ACK state on one side -- and consequently
> in FIN_WAIT2 on the other side -- until the 'close()' of the MPTCP
> socket.
> 
> Now, when the DATA FIN have been ACKed, all subflows are shutdown. A
> consequence of this is that the TCP 'FIN' flag can be set earlier now,
> but the end result is the same. This affects the packetdrill tests
> looking at the end of the MPTCP connections, but for a good reason.

FYI, I don't know if it is linked to this patch, but the CI recently
complained about the "dss_fin_retrans_*" tests, in debug mode only:

> # dss_fin_retrans_close_wait.pkt:27: error handling packet: timing error: expected outbound packet in relative time range +0.200000~+0.300000 sec but happened at -0.017265 sec
> # script packet:  0.791882 . 1:1(0) ack 2 <dss dack4 3007449510 dsn8 8067261375896427910 ssn 0 dll 1 no_checksum flags: MmAF,nop,nop>
> # actual packet:  0.574617 . 1:1(0) ack 2 win 1050 <dss dack4 3007449510 dsn8 8067261375896427910 ssn 0 dll 1 no_checksum flags: MmAF,nop,nop>

> # dss_fin_retrans_established.pkt:21: error handling packet: timing error: expected outbound packet in relative time range +0.200000~+0.300000 sec but happened at -0.021432 sec
> # script packet:  0.791877 . 1:1(0) ack 1 <dss dack4 3007449509 dsn8 8629039693965256448 ssn 0 dll 1 no_checksum flags: MmAF,nop,nop>
> # actual packet:  0.570445 . 1:1(0) ack 1 win 1050 <dss dack4 3007449509 dsn8 8629039693965256448 ssn 0 dll 1 no_checksum flags: MmAF,nop,nop>

> # dss_fin_retrans_close_wait.pkt:27: error handling packet: timing error: expected outbound packet in relative time range +0.200000~+0.300000 sec but happened at -0.095031 sec
> # script packet:  0.745018 . 1:1(0) ack 2 <dss dack4 3007449510 dsn8 18052974533214962003 ssn 0 dll 1 no_checksum flags: MmAF,nop,nop>
> # actual packet:  0.449987 . 1:1(0) ack 2 win 1050 <dss dack4 3007449510 dsn8 18052974533214962003 ssn 0 dll 1 no_checksum flags: MmAF,nop,nop>

I'm not able to reproduce this issue on my side to get more details
about the previous delays. Please also note that a few weeks ago, I also
reduced the tolerance by a factor of 4.

Because it is happening only in "debug" mode, I just re-increased a bit
the tolerance on the CI side. But if you have an idea of what could
cause the kernel to emit the retransmission faster than expected, feel
free to share :)

https://github.com/multipath-tcp/mptcp_net-next/actions/runs/17638263507
https://github.com/multipath-tcp/mptcp_net-next/actions/runs/17638077933

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.
Re: [PATCH mptcp-net v3 1/5] mptcp: propagate shutdown to subflows when possible
Posted by Christoph Paasch 5 days, 21 hours ago
On Thu, Sep 11, 2025 at 9:32 AM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> Hello,
>
> On 10/09/2025 10:55, Matthieu Baerts (NGI0) wrote:
> > When the MPTCP DATA FIN have been ACKed, there is no more MPTCP related
> > metadata to exchange, and all subflows can be safely shutdown.
> >
> > Before this patch, the subflows were actually terminated at 'close()'
> > time. That's certainly fine most of the time, but not when the userspace
> > 'shutdown()' a connection, without close()ing it. When doing so, the
> > subflows were staying in LAST_ACK state on one side -- and consequently
> > in FIN_WAIT2 on the other side -- until the 'close()' of the MPTCP
> > socket.
> >
> > Now, when the DATA FIN have been ACKed, all subflows are shutdown. A
> > consequence of this is that the TCP 'FIN' flag can be set earlier now,
> > but the end result is the same. This affects the packetdrill tests
> > looking at the end of the MPTCP connections, but for a good reason.
>
> FYI, I don't know if it is linked to this patch, but the CI recently
> complained about the "dss_fin_retrans_*" tests, in debug mode only:
>
> > # dss_fin_retrans_close_wait.pkt:27: error handling packet: timing error: expected outbound packet in relative time range +0.200000~+0.300000 sec but happened at -0.017265 sec

What is more suspicious is that the time is negative. Do you capture
pcaps when running packetdrill ?
That would allow to see the "real" timestamps independent from
packetdrill's scheduling issues.


Christoph

> > # script packet:  0.791882 . 1:1(0) ack 2 <dss dack4 3007449510 dsn8 8067261375896427910 ssn 0 dll 1 no_checksum flags: MmAF,nop,nop>
> > # actual packet:  0.574617 . 1:1(0) ack 2 win 1050 <dss dack4 3007449510 dsn8 8067261375896427910 ssn 0 dll 1 no_checksum flags: MmAF,nop,nop>
>
> > # dss_fin_retrans_established.pkt:21: error handling packet: timing error: expected outbound packet in relative time range +0.200000~+0.300000 sec but happened at -0.021432 sec
> > # script packet:  0.791877 . 1:1(0) ack 1 <dss dack4 3007449509 dsn8 8629039693965256448 ssn 0 dll 1 no_checksum flags: MmAF,nop,nop>
> > # actual packet:  0.570445 . 1:1(0) ack 1 win 1050 <dss dack4 3007449509 dsn8 8629039693965256448 ssn 0 dll 1 no_checksum flags: MmAF,nop,nop>
>
> > # dss_fin_retrans_close_wait.pkt:27: error handling packet: timing error: expected outbound packet in relative time range +0.200000~+0.300000 sec but happened at -0.095031 sec
> > # script packet:  0.745018 . 1:1(0) ack 2 <dss dack4 3007449510 dsn8 18052974533214962003 ssn 0 dll 1 no_checksum flags: MmAF,nop,nop>
> > # actual packet:  0.449987 . 1:1(0) ack 2 win 1050 <dss dack4 3007449510 dsn8 18052974533214962003 ssn 0 dll 1 no_checksum flags: MmAF,nop,nop>
>
> I'm not able to reproduce this issue on my side to get more details
> about the previous delays. Please also note that a few weeks ago, I also
> reduced the tolerance by a factor of 4.
>
> Because it is happening only in "debug" mode, I just re-increased a bit
> the tolerance on the CI side. But if you have an idea of what could
> cause the kernel to emit the retransmission faster than expected, feel
> free to share :)
>
> https://github.com/multipath-tcp/mptcp_net-next/actions/runs/17638263507
> https://github.com/multipath-tcp/mptcp_net-next/actions/runs/17638077933
>
> Cheers,
> Matt
> --
> Sponsored by the NGI0 Core fund.
>
>
Re: [PATCH mptcp-net v3 1/5] mptcp: propagate shutdown to subflows when possible
Posted by Matthieu Baerts 5 days, 8 hours ago
Hi Christoph,

On 11/09/2025 21:08, Christoph Paasch wrote:
> On Thu, Sep 11, 2025 at 9:32 AM Matthieu Baerts <matttbe@kernel.org> wrote:
>>
>> Hello,
>>
>> On 10/09/2025 10:55, Matthieu Baerts (NGI0) wrote:
>>> When the MPTCP DATA FIN have been ACKed, there is no more MPTCP related
>>> metadata to exchange, and all subflows can be safely shutdown.
>>>
>>> Before this patch, the subflows were actually terminated at 'close()'
>>> time. That's certainly fine most of the time, but not when the userspace
>>> 'shutdown()' a connection, without close()ing it. When doing so, the
>>> subflows were staying in LAST_ACK state on one side -- and consequently
>>> in FIN_WAIT2 on the other side -- until the 'close()' of the MPTCP
>>> socket.
>>>
>>> Now, when the DATA FIN have been ACKed, all subflows are shutdown. A
>>> consequence of this is that the TCP 'FIN' flag can be set earlier now,
>>> but the end result is the same. This affects the packetdrill tests
>>> looking at the end of the MPTCP connections, but for a good reason.
>>
>> FYI, I don't know if it is linked to this patch, but the CI recently
>> complained about the "dss_fin_retrans_*" tests, in debug mode only:
>>
>>> # dss_fin_retrans_close_wait.pkt:27: error handling packet: timing error: expected outbound packet in relative time range +0.200000~+0.300000 sec but happened at -0.017265 sec
> 
> What is more suspicious is that the time is negative. Do you capture
> pcaps when running packetdrill ?
> That would allow to see the "real" timestamps independent from
> packetdrill's scheduling issues.

No we don't. I also don't know the effect it would have on the public CI
in debug mode.

Locally, I tried to reproduce the issue with '-vvv', but I didn't manage
to hit the bug after ~4h, even with stress-ng in parallel. I just tried
without packetdrill's debug output, but with the packet capture, and no
luck there too.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.