disconnect() can still race with subflow fallback leaving the msk in
inconsistent status. To avoid such race, disconnect() must be blocking,
error out if the argument prevent such wait: the socket will linger
in SS_DISCONNECT state forever.
Fixes: b29fcfb54cd7 ("mptcp: full disconnect implementation")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/mptcp/protocol.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 71b258aebcd9..0dc55a17f274 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3119,8 +3119,19 @@ static void mptcp_copy_inaddrs(struct sock *msk, const struct sock *ssk)
inet_sk(msk)->inet_rcv_saddr = inet_sk(ssk)->inet_rcv_saddr;
}
+/* can safely clear the fallback status only if no subflow is going to
+ * push or receive any data on the wire
+ */
+static bool mptcp_can_reset_fb_status(const struct mptcp_sock *msk)
+{
+ return list_empty(&msk->conn_list) ||
+ inet_sk_state_load(msk->first) == TCP_CLOSE;
+}
+
static int mptcp_disconnect(struct sock *sk, int flags)
{
+ long timeout = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
+ DEFINE_WAIT_FUNC(wait, woken_wake_function);
struct mptcp_sock *msk = mptcp_sk(sk);
/* We are on the fastopen error path. We can't call straight into the
@@ -3142,7 +3153,34 @@ static int mptcp_disconnect(struct sock *sk, int flags)
* subflow
*/
mptcp_destroy_common(msk, MPTCP_CF_FASTCLOSE);
+
+ if (!mptcp_can_reset_fb_status(msk)) {
+ if (!timeout)
+ return -EAGAIN;
+
+ add_wait_queue(sk_sleep(sk), &wait);
+
+ /* __mptcp_close_ssk() will wake us */
+ do {
+ if (sk_wait_event(sk, &timeout,
+ mptcp_can_reset_fb_status(msk),
+ &wait))
+ break;
+ } while (!signal_pending(current) && timeout);
+
+ remove_wait_queue(sk_sleep(sk), &wait);
+
+ /* timeout/signal can prevent reaching the expected status */
+ if (!mptcp_can_reset_fb_status(msk))
+ return -EBUSY;
+ }
+
+ spin_lock_bh(&msk->fallback_lock);
+ msk->allow_subflows = true;
+ msk->allow_infinite_fallback = true;
WRITE_ONCE(msk->flags, 0);
+ spin_unlock_bh(&msk->fallback_lock);
+
msk->cb_flags = 0;
msk->recovery = false;
WRITE_ONCE(msk->can_ack, false);
--
2.49.0
Hi Paolo, On 05/07/2025 09:24, Paolo Abeni wrote: > disconnect() can still race with subflow fallback leaving the msk in > inconsistent status. To avoid such race, disconnect() must be blocking, > error out if the argument prevent such wait: the socket will linger > in SS_DISCONNECT state forever. Good idea! Just to be sure, this is not really a behaviour change compared to TCP, right? No risks of breaking some cases? (io_uring, etc.?) In which cases is this MSG_DONTWAIT flag set with disconnect()? Cheers, Matt -- Sponsored by the NGI0 Core fund.
On 7/8/25 5:43 PM, Matthieu Baerts wrote: > Hi Paolo, > > On 05/07/2025 09:24, Paolo Abeni wrote: >> disconnect() can still race with subflow fallback leaving the msk in >> inconsistent status. To avoid such race, disconnect() must be blocking, >> error out if the argument prevent such wait: the socket will linger >> in SS_DISCONNECT state forever. > > Good idea! > > Just to be sure, this is not really a behaviour change compared to TCP, > right? Actually it is, as the TCP disconnect never blocks. On the flip side the race actually exists and I haven't find another way to plug it. An alternative I thought about was to create a new first subflow on disconnect and let the old one 'detach' from the mptcp socket, but that was problematic as the 'detached' ssk would have a NULL subflow->conn and currently we assume that to be ~always != NULL > No risks of breaking some cases? (io_uring, etc.?) Do io_uring work with mptcp? The disconnect failure when blocking is not possible should really never be a problem from a functional PoV: disconnect() can fail per documentation and the application should handle that case. Blocking could cause some additional latency or slowdown in that code path. I think it should be not problematic, but I have no real idea and tend to be too optimistic WRT this kind of issues. I suggest not sending this to stable (with a note in the commit message). > In which cases is this MSG_DONTWAIT flag set with disconnect()? Typo in the patch, should be O_NONBLOCK. The socket layer, the vfs layer and iouring could set that. /P
Hi Paolo, Thank you for your reply! On 09/07/2025 11:27, Paolo Abeni wrote: > > > On 7/8/25 5:43 PM, Matthieu Baerts wrote: >> Hi Paolo, >> >> On 05/07/2025 09:24, Paolo Abeni wrote: >>> disconnect() can still race with subflow fallback leaving the msk in >>> inconsistent status. To avoid such race, disconnect() must be blocking, >>> error out if the argument prevent such wait: the socket will linger >>> in SS_DISCONNECT state forever. >> >> Good idea! >> >> Just to be sure, this is not really a behaviour change compared to TCP, >> right? > > Actually it is, as the TCP disconnect never blocks. On the flip side the > race actually exists and I haven't find another way to plug it. > > An alternative I thought about was to create a new first subflow on > disconnect and let the old one 'detach' from the mptcp socket, but that > was problematic as the 'detached' ssk would have a NULL subflow->conn > and currently we assume that to be ~always != NULL Indeed, it doesn't sound safer :) >> No risks of breaking some cases? (io_uring, etc.?) > > Do io_uring work with mptcp? I don't know, I never tried. But I don't see why it wouldn't. (Of course, not when used with zero copy, etc.) > The disconnect failure when blocking is not > possible should really never be a problem from a functional PoV: > disconnect() can fail per documentation and the application should > handle that case. But it could set "O_NONBLOCK" and do the close later, no? Or is this correctly handled? > Blocking could cause some additional latency or slowdown in that code > path. I think it should be not problematic, but I have no real idea and > tend to be too optimistic WRT this kind of issues. > > I suggest not sending this to stable (with a note in the commit message). Maybe safer indeed. >> In which cases is this MSG_DONTWAIT flag set with disconnect()? > > Typo in the patch, should be O_NONBLOCK. The socket layer, the vfs layer > and iouring could set that. So typically what we do in the packetdrill tests. But maybe we don't check the returned value? Cheers, Matt -- Sponsored by the NGI0 Core fund.
On 7/9/25 12:39 PM, Matthieu Baerts wrote: > On 09/07/2025 11:27, Paolo Abeni wrote: >> The disconnect failure when blocking is not >> possible should really never be a problem from a functional PoV: >> disconnect() can fail per documentation and the application should >> handle that case. > > But it could set "O_NONBLOCK" and do the close later, no? Or is this > correctly handled? Yes and no. Meaning the kernel does not offer any mechanism to complete gracefully a blocking disconnect returning EAGAIN. In such a case the socket will stay in SS_DISCONNECTING (socket) state until close time. >>> In which cases is this MSG_DONTWAIT flag set with disconnect()? >> >> Typo in the patch, should be O_NONBLOCK. The socket layer, the vfs layer >> and iouring could set that. > > So typically what we do in the packetdrill tests. But maybe we don't > check the returned value? The only relevant pktdrill test I could find is `disconnect_after_accept.pkt` and it checks the return value. Let me check how it will run after the MSG_DONTWAIT -> O_NONBLOCK fix /P
On 09/07/2025 12:49, Paolo Abeni wrote: > On 7/9/25 12:39 PM, Matthieu Baerts wrote: >> On 09/07/2025 11:27, Paolo Abeni wrote: >>> The disconnect failure when blocking is not >>> possible should really never be a problem from a functional PoV: >>> disconnect() can fail per documentation and the application should >>> handle that case. >> >> But it could set "O_NONBLOCK" and do the close later, no? Or is this >> correctly handled? > > Yes and no. Meaning the kernel does not offer any mechanism to complete > gracefully a blocking disconnect returning EAGAIN. In such a case the > socket will stay in SS_DISCONNECTING (socket) state until close time. > >>>> In which cases is this MSG_DONTWAIT flag set with disconnect()? >>> >>> Typo in the patch, should be O_NONBLOCK. The socket layer, the vfs layer >>> and iouring could set that. >> >> So typically what we do in the packetdrill tests. But maybe we don't >> check the returned value? > > The only relevant pktdrill test I could find is > `disconnect_after_accept.pkt` and it checks the return value. > > Let me check how it will run after the MSG_DONTWAIT -> O_NONBLOCK fix Thank you! Cheers, Matt -- Sponsored by the NGI0 Core fund.
On 7/9/25 12:55 PM, Matthieu Baerts wrote: > On 09/07/2025 12:49, Paolo Abeni wrote: >> On 7/9/25 12:39 PM, Matthieu Baerts wrote: >>> On 09/07/2025 11:27, Paolo Abeni wrote: >>>> The disconnect failure when blocking is not >>>> possible should really never be a problem from a functional PoV: >>>> disconnect() can fail per documentation and the application should >>>> handle that case. >>> >>> But it could set "O_NONBLOCK" and do the close later, no? Or is this >>> correctly handled? >> >> Yes and no. Meaning the kernel does not offer any mechanism to complete >> gracefully a blocking disconnect returning EAGAIN. In such a case the >> socket will stay in SS_DISCONNECTING (socket) state until close time. >> >>>>> In which cases is this MSG_DONTWAIT flag set with disconnect()? >>>> >>>> Typo in the patch, should be O_NONBLOCK. The socket layer, the vfs layer >>>> and iouring could set that. >>> >>> So typically what we do in the packetdrill tests. But maybe we don't >>> check the returned value? >> >> The only relevant pktdrill test I could find is >> `disconnect_after_accept.pkt` and it checks the return value. >> >> Let me check how it will run after the MSG_DONTWAIT -> O_NONBLOCK fix > > Thank you! The test is passing here. I think that is due to mptcp_disconnect() invoking mptcp_close_ssk() with the MPTCP_CF_FASTCLOSE flag, which in turn forces tcp_disconnect() invocation on msk->first, that forciby closes the first subflow (sk_state == TCP_CLOSE after that). So we can trim that blocking wait out of patch 3. /P
© 2016 - 2025 Red Hat, Inc.