mptcp: do not queue data on closed subflows

[PATCH mptcp-net] mptcp: do not queue data on closed subflows

Posted by Paolo Abeni 1 year, 9 months ago

Dipanjan reported a syzbot splat at close time:

WARNING: CPU: 1 PID: 10818 at net/ipv4/af_inet.c:153
inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153
Modules linked in: uio_ivshmem(OE) uio(E)
CPU: 1 PID: 10818 Comm: kworker/1:16 Tainted: G           OE
5.19.0-rc6-g2eae0556bb9d #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Workqueue: events mptcp_worker
RIP: 0010:inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153
Code: 21 02 00 00 41 8b 9c 24 28 02 00 00 e9 07 ff ff ff e8 34 4d 91
f9 89 ee 4c 89 e7 e8 4a 47 60 ff e9 a6 fc ff ff e8 20 4d 91 f9 <0f> 0b
e9 84 fe ff ff e8 14 4d 91 f9 0f 0b e9 d4 fd ff ff e8 08 4d
RSP: 0018:ffffc9001b35fa78 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000002879d0 RCX: ffff8881326f3b00
RDX: 0000000000000000 RSI: ffff8881326f3b00 RDI: 0000000000000002
RBP: ffff888179662674 R08: ffffffff87e983a0 R09: 0000000000000000
R10: 0000000000000005 R11: 00000000000004ea R12: ffff888179662400
R13: ffff888179662428 R14: 0000000000000001 R15: ffff88817e38e258
FS:  0000000000000000(0000) GS:ffff8881f5f00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020007bc0 CR3: 0000000179592000 CR4: 0000000000150ee0
Call Trace:
 <TASK>
 __sk_destruct+0x4f/0x8e0 net/core/sock.c:2067
 sk_destruct+0xbd/0xe0 net/core/sock.c:2112
 __sk_free+0xef/0x3d0 net/core/sock.c:2123
 sk_free+0x78/0xa0 net/core/sock.c:2134
 sock_put include/net/sock.h:1927 [inline]
 __mptcp_close_ssk+0x50f/0x780 net/mptcp/protocol.c:2351
 __mptcp_destroy_sock+0x332/0x760 net/mptcp/protocol.c:2828
 mptcp_worker+0x5d2/0xc90 net/mptcp/protocol.c:2586
 process_one_work+0x9cc/0x1650 kernel/workqueue.c:2289
 worker_thread+0x623/0x1070 kernel/workqueue.c:2436
 kthread+0x2e9/0x3a0 kernel/kthread.c:376
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
 </TASK>

The root cause of the problem is that an mptcp-level (re)transmit can
race with mptcp_close() and the packet scheduler checks the subflow
state before acquiring the socket lock: we can try to (re)transmit on
an already closed ssk.

Fix the issue checking again the subflow socket status under the
subflow socket lock protection. Additionally add the missing check
for the fallback-to-tcp case.

Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
Fixes: d5f49190def6 ("mptcp: allow picking different xmit subflows")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c |  8 +++++++-
 net/mptcp/protocol.h | 11 +++++++----
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index cc21fafd9726..7c7670942e76 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1275,6 +1275,9 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 			 info->limit > dfrag->data_len))
 		return 0;
 
+	if (unlikely(!__tcp_can_send(ssk)))
+		return -EAGAIN;
+
 	/* compute send limit */
 	info->mss_now = tcp_send_mss(ssk, &info->size_goal, info->flags);
 	copy = info->size_goal;
@@ -1448,7 +1451,8 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 	if (__mptcp_check_fallback(msk)) {
 		if (!msk->first)
 			return NULL;
-		return sk_stream_memory_free(msk->first) ? msk->first : NULL;
+		return __tcp_can_send(msk->first) &&
+		       sk_stream_memory_free(msk->first) ? msk->first : NULL;
 	}
 
 	/* re-use last subflow, if the burst allow that */
@@ -1599,6 +1603,8 @@ void __mptcp_push_pending(struct sock *sk, unsigned int flags)
 
 			ret = mptcp_sendmsg_frag(sk, ssk, dfrag, &info);
 			if (ret <= 0) {
+				if (ret == -EAGAIN)
+					continue;
 				mptcp_push_release(ssk, &info);
 				goto out;
 			}
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 480c5320b86e..93d783039fce 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -625,16 +625,19 @@ void mptcp_info2sockaddr(const struct mptcp_addr_info *info,
 			 struct sockaddr_storage *addr,
 			 unsigned short family);
 
-static inline bool __mptcp_subflow_active(struct mptcp_subflow_context *subflow)
+static inline bool __tcp_can_send(const struct sock *ssk)
 {
-	struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+	/* only send if our side has not closed yet */
+	return ((1 << inet_sk_state_load(ssk)) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT));
+}
 
+static inline bool __mptcp_subflow_active(struct mptcp_subflow_context *subflow)
+{
 	/* can't send if JOIN hasn't completed yet (i.e. is usable for mptcp) */
 	if (subflow->request_join && !subflow->fully_established)
 		return false;
 
-	/* only send if our side has not closed yet */
-	return ((1 << ssk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT));
+	return __tcp_can_send(mptcp_subflow_tcp_sock(subflow));
 }
 
 void mptcp_subflow_set_active(struct mptcp_subflow_context *subflow);
-- 
2.35.3

Re: [PATCH mptcp-net] mptcp: do not queue data on closed subflows

Posted by Matthieu Baerts 1 year, 9 months ago

Hi Paolo, Mat,

On 25/07/2022 19:10, Paolo Abeni wrote:
> Dipanjan reported a syzbot splat at close time:

(...)

> The root cause of the problem is that an mptcp-level (re)transmit can
> race with mptcp_close() and the packet scheduler checks the subflow
> state before acquiring the socket lock: we can try to (re)transmit on
> an already closed ssk.
> 
> Fix the issue checking again the subflow socket status under the
> subflow socket lock protection. Additionally add the missing check
> for the fallback-to-tcp case.
> 
> Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
> Fixes: d5f49190def6 ("mptcp: allow picking different xmit subflows")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Thank you for the fix and the review!

Now in our tree (fixes for -net):

New patches for t/upstream-net:
- 6f471fc45a7c: mptcp: do not queue data on closed subflows
- Results: 76dabdd1e02b..42b7bf36daba (export-net)

New patches for t/upstream:
- 6f471fc45a7c: mptcp: do not queue data on closed subflows
- f79659168250: conflict in t/mptcp-add-get_subflow-wrappers
- Results: f6822255c960..b6a8d1e56468 (export)


Tests are now in progress:

https://cirrus-ci.com/github/multipath-tcp/mptcp_net-next/export/20220726T170140
https://cirrus-ci.com/github/multipath-tcp/mptcp_net-next/export-net/20220726T170140

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

Re: [PATCH mptcp-net] mptcp: do not queue data on closed subflows

Posted by Mat Martineau 1 year, 9 months ago

On Mon, 25 Jul 2022, Paolo Abeni wrote:

> Dipanjan reported a syzbot splat at close time:
>
> WARNING: CPU: 1 PID: 10818 at net/ipv4/af_inet.c:153
> inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153
> Modules linked in: uio_ivshmem(OE) uio(E)
> CPU: 1 PID: 10818 Comm: kworker/1:16 Tainted: G           OE
> 5.19.0-rc6-g2eae0556bb9d #2
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.13.0-1ubuntu1.1 04/01/2014
> Workqueue: events mptcp_worker
> RIP: 0010:inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153
> Code: 21 02 00 00 41 8b 9c 24 28 02 00 00 e9 07 ff ff ff e8 34 4d 91
> f9 89 ee 4c 89 e7 e8 4a 47 60 ff e9 a6 fc ff ff e8 20 4d 91 f9 <0f> 0b
> e9 84 fe ff ff e8 14 4d 91 f9 0f 0b e9 d4 fd ff ff e8 08 4d
> RSP: 0018:ffffc9001b35fa78 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 00000000002879d0 RCX: ffff8881326f3b00
> RDX: 0000000000000000 RSI: ffff8881326f3b00 RDI: 0000000000000002
> RBP: ffff888179662674 R08: ffffffff87e983a0 R09: 0000000000000000
> R10: 0000000000000005 R11: 00000000000004ea R12: ffff888179662400
> R13: ffff888179662428 R14: 0000000000000001 R15: ffff88817e38e258
> FS:  0000000000000000(0000) GS:ffff8881f5f00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020007bc0 CR3: 0000000179592000 CR4: 0000000000150ee0
> Call Trace:
> <TASK>
> __sk_destruct+0x4f/0x8e0 net/core/sock.c:2067
> sk_destruct+0xbd/0xe0 net/core/sock.c:2112
> __sk_free+0xef/0x3d0 net/core/sock.c:2123
> sk_free+0x78/0xa0 net/core/sock.c:2134
> sock_put include/net/sock.h:1927 [inline]
> __mptcp_close_ssk+0x50f/0x780 net/mptcp/protocol.c:2351
> __mptcp_destroy_sock+0x332/0x760 net/mptcp/protocol.c:2828
> mptcp_worker+0x5d2/0xc90 net/mptcp/protocol.c:2586
> process_one_work+0x9cc/0x1650 kernel/workqueue.c:2289
> worker_thread+0x623/0x1070 kernel/workqueue.c:2436
> kthread+0x2e9/0x3a0 kernel/kthread.c:376
> ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
> </TASK>
>
> The root cause of the problem is that an mptcp-level (re)transmit can
> race with mptcp_close() and the packet scheduler checks the subflow
> state before acquiring the socket lock: we can try to (re)transmit on
> an already closed ssk.
>
> Fix the issue checking again the subflow socket status under the
> subflow socket lock protection. Additionally add the missing check
> for the fallback-to-tcp case.
>
> Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
> Fixes: d5f49190def6 ("mptcp: allow picking different xmit subflows")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Thanks Paolo.

Patchew didn't tag this for some reason, so I manually pushed a tag to 
trigger the CI builds:

https://github.com/multipath-tcp/mptcp_net-next/actions/runs/2736071165
https://cirrus-ci.com/build/5132666281394176


The tests I ran worked ok, but the c reproducer didn't trigger the crash 
in my syzkaller vm with v5.19-rc6.

One question below.

> ---
> net/mptcp/protocol.c |  8 +++++++-
> net/mptcp/protocol.h | 11 +++++++----
> 2 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index cc21fafd9726..7c7670942e76 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1275,6 +1275,9 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
> 			 info->limit > dfrag->data_len))
> 		return 0;
>
> +	if (unlikely(!__tcp_can_send(ssk)))
> +		return -EAGAIN;
> +
> 	/* compute send limit */
> 	info->mss_now = tcp_send_mss(ssk, &info->size_goal, info->flags);
> 	copy = info->size_goal;
> @@ -1448,7 +1451,8 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> 	if (__mptcp_check_fallback(msk)) {
> 		if (!msk->first)
> 			return NULL;
> -		return sk_stream_memory_free(msk->first) ? msk->first : NULL;
> +		return __tcp_can_send(msk->first) &&
> +		       sk_stream_memory_free(msk->first) ? msk->first : NULL;
> 	}
>
> 	/* re-use last subflow, if the burst allow that */
> @@ -1599,6 +1603,8 @@ void __mptcp_push_pending(struct sock *sk, unsigned int flags)
>
> 			ret = mptcp_sendmsg_frag(sk, ssk, dfrag, &info);
> 			if (ret <= 0) {
> +				if (ret == -EAGAIN)
> +					continue;

There are two other mptcp_sendmsg_frag() callers: 
__mptcp_subflow_push_pending() and __mptcp_retrans(). The code for those 
has not changed to check for -EAGAIN like the line of code above, and it 
looks like such changes are not needed. Does my reasoning for these two 
other cases match yours:

__mptcp_subflow_push_pending() has the subflow lock held when it calls 
mptcp_subflow_get_send(), so it shouldn't run in to the problem you 
described in the commit message.

__mptcp_retrans() acquires the subflow lock after the call to 
mptcp_subflow_get_retrans(). But there's no check for -EAGAIN like the two 
lines added above so __mptcp_retrans() will give up. This looks valid to 
me since the retransmit is either not needed (if the connection is 
closing) or will get retried again when triggered by the retransmit timer.


The CI builds haven't finished yet, but so far it looks like the tests are 
passing.

Thanks,

- Mat

> 				mptcp_push_release(ssk, &info);
> 				goto out;
> 			}
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index 480c5320b86e..93d783039fce 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -625,16 +625,19 @@ void mptcp_info2sockaddr(const struct mptcp_addr_info *info,
> 			 struct sockaddr_storage *addr,
> 			 unsigned short family);
>
> -static inline bool __mptcp_subflow_active(struct mptcp_subflow_context *subflow)
> +static inline bool __tcp_can_send(const struct sock *ssk)
> {
> -	struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
> +	/* only send if our side has not closed yet */
> +	return ((1 << inet_sk_state_load(ssk)) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT));
> +}
>
> +static inline bool __mptcp_subflow_active(struct mptcp_subflow_context *subflow)
> +{
> 	/* can't send if JOIN hasn't completed yet (i.e. is usable for mptcp) */
> 	if (subflow->request_join && !subflow->fully_established)
> 		return false;
>
> -	/* only send if our side has not closed yet */
> -	return ((1 << ssk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT));
> +	return __tcp_can_send(mptcp_subflow_tcp_sock(subflow));
> }
>
> void mptcp_subflow_set_active(struct mptcp_subflow_context *subflow);
> -- 
> 2.35.3
>
>
>

--
Mat Martineau
Intel

Re: [PATCH mptcp-net] mptcp: do not queue data on closed subflows

Posted by Paolo Abeni 1 year, 9 months ago

On Mon, 2022-07-25 at 18:14 -0700, Mat Martineau wrote:
> On Mon, 25 Jul 2022, Paolo Abeni wrote:
> 
> > Dipanjan reported a syzbot splat at close time:
> > 
> > WARNING: CPU: 1 PID: 10818 at net/ipv4/af_inet.c:153
> > inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153
> > Modules linked in: uio_ivshmem(OE) uio(E)
> > CPU: 1 PID: 10818 Comm: kworker/1:16 Tainted: G           OE
> > 5.19.0-rc6-g2eae0556bb9d #2
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > 1.13.0-1ubuntu1.1 04/01/2014
> > Workqueue: events mptcp_worker
> > RIP: 0010:inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153
> > Code: 21 02 00 00 41 8b 9c 24 28 02 00 00 e9 07 ff ff ff e8 34 4d 91
> > f9 89 ee 4c 89 e7 e8 4a 47 60 ff e9 a6 fc ff ff e8 20 4d 91 f9 <0f> 0b
> > e9 84 fe ff ff e8 14 4d 91 f9 0f 0b e9 d4 fd ff ff e8 08 4d
> > RSP: 0018:ffffc9001b35fa78 EFLAGS: 00010246
> > RAX: 0000000000000000 RBX: 00000000002879d0 RCX: ffff8881326f3b00
> > RDX: 0000000000000000 RSI: ffff8881326f3b00 RDI: 0000000000000002
> > RBP: ffff888179662674 R08: ffffffff87e983a0 R09: 0000000000000000
> > R10: 0000000000000005 R11: 00000000000004ea R12: ffff888179662400
> > R13: ffff888179662428 R14: 0000000000000001 R15: ffff88817e38e258
> > FS:  0000000000000000(0000) GS:ffff8881f5f00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000020007bc0 CR3: 0000000179592000 CR4: 0000000000150ee0
> > Call Trace:
> > <TASK>
> > __sk_destruct+0x4f/0x8e0 net/core/sock.c:2067
> > sk_destruct+0xbd/0xe0 net/core/sock.c:2112
> > __sk_free+0xef/0x3d0 net/core/sock.c:2123
> > sk_free+0x78/0xa0 net/core/sock.c:2134
> > sock_put include/net/sock.h:1927 [inline]
> > __mptcp_close_ssk+0x50f/0x780 net/mptcp/protocol.c:2351
> > __mptcp_destroy_sock+0x332/0x760 net/mptcp/protocol.c:2828
> > mptcp_worker+0x5d2/0xc90 net/mptcp/protocol.c:2586
> > process_one_work+0x9cc/0x1650 kernel/workqueue.c:2289
> > worker_thread+0x623/0x1070 kernel/workqueue.c:2436
> > kthread+0x2e9/0x3a0 kernel/kthread.c:376
> > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
> > </TASK>
> > 
> > The root cause of the problem is that an mptcp-level (re)transmit can
> > race with mptcp_close() and the packet scheduler checks the subflow
> > state before acquiring the socket lock: we can try to (re)transmit on
> > an already closed ssk.
> > 
> > Fix the issue checking again the subflow socket status under the
> > subflow socket lock protection. Additionally add the missing check
> > for the fallback-to-tcp case.
> > 
> > Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
> > Fixes: d5f49190def6 ("mptcp: allow picking different xmit subflows")
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> 
> Thanks Paolo.
> 
> Patchew didn't tag this for some reason, so I manually pushed a tag to 
> trigger the CI builds:
> 
> https://github.com/multipath-tcp/mptcp_net-next/actions/runs/2736071165
> https://cirrus-ci.com/build/5132666281394176
> 
> 
> The tests I ran worked ok, but the c reproducer didn't trigger the crash 
> in my syzkaller vm with v5.19-rc6.

I could reproduce the issue using the config attached to the report
(CONFIG_PREEMPT=y). I think the issue is possible even without that,
but with a much smaller window of opportunity.


> One question below.
> 
> > ---
> > net/mptcp/protocol.c |  8 +++++++-
> > net/mptcp/protocol.h | 11 +++++++----
> > 2 files changed, 14 insertions(+), 5 deletions(-)
> > 
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > index cc21fafd9726..7c7670942e76 100644
> > --- a/net/mptcp/protocol.c
> > +++ b/net/mptcp/protocol.c
> > @@ -1275,6 +1275,9 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
> > 			 info->limit > dfrag->data_len))
> > 		return 0;
> > 
> > +	if (unlikely(!__tcp_can_send(ssk)))
> > +		return -EAGAIN;
> > +
> > 	/* compute send limit */
> > 	info->mss_now = tcp_send_mss(ssk, &info->size_goal, info->flags);
> > 	copy = info->size_goal;
> > @@ -1448,7 +1451,8 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> > 	if (__mptcp_check_fallback(msk)) {
> > 		if (!msk->first)
> > 			return NULL;
> > -		return sk_stream_memory_free(msk->first) ? msk->first : NULL;
> > +		return __tcp_can_send(msk->first) &&
> > +		       sk_stream_memory_free(msk->first) ? msk->first : NULL;
> > 	}
> > 
> > 	/* re-use last subflow, if the burst allow that */
> > @@ -1599,6 +1603,8 @@ void __mptcp_push_pending(struct sock *sk, unsigned int flags)
> > 
> > 			ret = mptcp_sendmsg_frag(sk, ssk, dfrag, &info);
> > 			if (ret <= 0) {
> > +				if (ret == -EAGAIN)
> > +					continue;
> 
> There are two other mptcp_sendmsg_frag() callers: 
> __mptcp_subflow_push_pending() and __mptcp_retrans(). The code for those 
> has not changed to check for -EAGAIN like the line of code above, and it 
> looks like such changes are not needed. Does my reasoning for these two 
> other cases match yours:
> 
> __mptcp_subflow_push_pending() has the subflow lock held when it calls 
> mptcp_subflow_get_send(), so it shouldn't run in to the problem you 
> described in the commit message.


when mptcp_subflow_get_send is called, the lock held, if any refers to
the "previous" ssk. AFAICS the race could happen even there. The
"lifesaver" check is inside mptcp_sendmsg_frag(), so it protects both
caller. 

The main point is that when __mptcp_subflow_push_pending() we pick a
racing to close subflow, we still need to try to select another
subflow, or we could end-up not pushing any data even the mptcp-level
window allow for that, and the actual data transfer could be delayed (a
lot of time) for no reasons.

__mptcp_retrans() is less critical, because it's a slower path, we are
not 100% we have really to send any data (could be just high jitter),
and the retrans function will be called later no matter what.

Please let me know if the above answer your question,
thanks!

Paolo

Re: [PATCH mptcp-net] mptcp: do not queue data on closed subflows

Posted by Mat Martineau 1 year, 9 months ago

On Tue, 26 Jul 2022, Paolo Abeni wrote:

> On Mon, 2022-07-25 at 18:14 -0700, Mat Martineau wrote:
>> On Mon, 25 Jul 2022, Paolo Abeni wrote:
>>
>>> Dipanjan reported a syzbot splat at close time:
>>>
>>> WARNING: CPU: 1 PID: 10818 at net/ipv4/af_inet.c:153
>>> inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153
>>> Modules linked in: uio_ivshmem(OE) uio(E)
>>> CPU: 1 PID: 10818 Comm: kworker/1:16 Tainted: G           OE
>>> 5.19.0-rc6-g2eae0556bb9d #2
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>> 1.13.0-1ubuntu1.1 04/01/2014
>>> Workqueue: events mptcp_worker
>>> RIP: 0010:inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153
>>> Code: 21 02 00 00 41 8b 9c 24 28 02 00 00 e9 07 ff ff ff e8 34 4d 91
>>> f9 89 ee 4c 89 e7 e8 4a 47 60 ff e9 a6 fc ff ff e8 20 4d 91 f9 <0f> 0b
>>> e9 84 fe ff ff e8 14 4d 91 f9 0f 0b e9 d4 fd ff ff e8 08 4d
>>> RSP: 0018:ffffc9001b35fa78 EFLAGS: 00010246
>>> RAX: 0000000000000000 RBX: 00000000002879d0 RCX: ffff8881326f3b00
>>> RDX: 0000000000000000 RSI: ffff8881326f3b00 RDI: 0000000000000002
>>> RBP: ffff888179662674 R08: ffffffff87e983a0 R09: 0000000000000000
>>> R10: 0000000000000005 R11: 00000000000004ea R12: ffff888179662400
>>> R13: ffff888179662428 R14: 0000000000000001 R15: ffff88817e38e258
>>> FS:  0000000000000000(0000) GS:ffff8881f5f00000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000020007bc0 CR3: 0000000179592000 CR4: 0000000000150ee0
>>> Call Trace:
>>> <TASK>
>>> __sk_destruct+0x4f/0x8e0 net/core/sock.c:2067
>>> sk_destruct+0xbd/0xe0 net/core/sock.c:2112
>>> __sk_free+0xef/0x3d0 net/core/sock.c:2123
>>> sk_free+0x78/0xa0 net/core/sock.c:2134
>>> sock_put include/net/sock.h:1927 [inline]
>>> __mptcp_close_ssk+0x50f/0x780 net/mptcp/protocol.c:2351
>>> __mptcp_destroy_sock+0x332/0x760 net/mptcp/protocol.c:2828
>>> mptcp_worker+0x5d2/0xc90 net/mptcp/protocol.c:2586
>>> process_one_work+0x9cc/0x1650 kernel/workqueue.c:2289
>>> worker_thread+0x623/0x1070 kernel/workqueue.c:2436
>>> kthread+0x2e9/0x3a0 kernel/kthread.c:376
>>> ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
>>> </TASK>
>>>
>>> The root cause of the problem is that an mptcp-level (re)transmit can
>>> race with mptcp_close() and the packet scheduler checks the subflow
>>> state before acquiring the socket lock: we can try to (re)transmit on
>>> an already closed ssk.
>>>
>>> Fix the issue checking again the subflow socket status under the
>>> subflow socket lock protection. Additionally add the missing check
>>> for the fallback-to-tcp case.
>>>
>>> Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
>>> Fixes: d5f49190def6 ("mptcp: allow picking different xmit subflows")
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>
>> Thanks Paolo.
>>
>> Patchew didn't tag this for some reason, so I manually pushed a tag to
>> trigger the CI builds:
>>
>> https://github.com/multipath-tcp/mptcp_net-next/actions/runs/2736071165
>> https://cirrus-ci.com/build/5132666281394176
>>
>>
>> The tests I ran worked ok, but the c reproducer didn't trigger the crash
>> in my syzkaller vm with v5.19-rc6.
>
> I could reproduce the issue using the config attached to the report
> (CONFIG_PREEMPT=y). I think the issue is possible even without that,
> but with a much smaller window of opportunity.
>
>
>> One question below.
>>
>>> ---
>>> net/mptcp/protocol.c |  8 +++++++-
>>> net/mptcp/protocol.h | 11 +++++++----
>>> 2 files changed, 14 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>> index cc21fafd9726..7c7670942e76 100644
>>> --- a/net/mptcp/protocol.c
>>> +++ b/net/mptcp/protocol.c
>>> @@ -1275,6 +1275,9 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
>>> 			 info->limit > dfrag->data_len))
>>> 		return 0;
>>>
>>> +	if (unlikely(!__tcp_can_send(ssk)))
>>> +		return -EAGAIN;
>>> +
>>> 	/* compute send limit */
>>> 	info->mss_now = tcp_send_mss(ssk, &info->size_goal, info->flags);
>>> 	copy = info->size_goal;
>>> @@ -1448,7 +1451,8 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
>>> 	if (__mptcp_check_fallback(msk)) {
>>> 		if (!msk->first)
>>> 			return NULL;
>>> -		return sk_stream_memory_free(msk->first) ? msk->first : NULL;
>>> +		return __tcp_can_send(msk->first) &&
>>> +		       sk_stream_memory_free(msk->first) ? msk->first : NULL;
>>> 	}
>>>
>>> 	/* re-use last subflow, if the burst allow that */
>>> @@ -1599,6 +1603,8 @@ void __mptcp_push_pending(struct sock *sk, unsigned int flags)
>>>
>>> 			ret = mptcp_sendmsg_frag(sk, ssk, dfrag, &info);
>>> 			if (ret <= 0) {
>>> +				if (ret == -EAGAIN)
>>> +					continue;
>>
>> There are two other mptcp_sendmsg_frag() callers:
>> __mptcp_subflow_push_pending() and __mptcp_retrans(). The code for those
>> has not changed to check for -EAGAIN like the line of code above, and it
>> looks like such changes are not needed. Does my reasoning for these two
>> other cases match yours:
>>
>> __mptcp_subflow_push_pending() has the subflow lock held when it calls
>> mptcp_subflow_get_send(), so it shouldn't run in to the problem you
>> described in the commit message.
>
>
> when mptcp_subflow_get_send is called, the lock held, if any refers to
> the "previous" ssk. AFAICS the race could happen even there. The
> "lifesaver" check is inside mptcp_sendmsg_frag(), so it protects both
> caller.
>
> The main point is that when __mptcp_subflow_push_pending() we pick a
> racing to close subflow, we still need to try to select another
> subflow, or we could end-up not pushing any data even the mptcp-level
> window allow for that, and the actual data transfer could be delayed (a
> lot of time) for no reasons.
>
> __mptcp_retrans() is less critical, because it's a slower path, we are
> not 100% we have really to send any data (could be just high jitter),
> and the retrans function will be called later no matter what.
>
> Please let me know if the above answer your question,
> thanks!
>

Yes, thanks for the explanation. Patch looks good to me:

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>

--
Mat Martineau
Intel