[PATCH net v3] mptcp: fix soft lockup in mptcp_recvmsg()

Li Xiasong posted 1 patch 6 days, 7 hours ago
There is a newer version of this series
net/mptcp/protocol.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
[PATCH net v3] mptcp: fix soft lockup in mptcp_recvmsg()
Posted by Li Xiasong 6 days, 7 hours ago
syzbot reported a soft lockup in mptcp_recvmsg() [0].

When receiving data with MSG_PEEK | MSG_WAITALL flags, the skb is not
removed from the sk_receive_queue. This causes sk_wait_data() to always
find available data and never perform actual waiting, leading to a soft
lockup.

Fix this by adding a 'last' parameter to track the last peeked skb.
This allows sk_wait_data() to make informed waiting decisions and prevent
infinite loops when MSG_PEEK is used.

[0]:
watchdog: BUG: soft lockup - CPU#2 stuck for 156s! [server:1963]
Modules linked in:
CPU: 2 UID: 0 PID: 1963 Comm: server Not tainted 6.19.0-rc8 #61 PREEMPT(none)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:sk_wait_data+0x15/0x190
Code: 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 41 56 41 55 41 54 49 89 f4 55 48 89 d5 53 48 89 fb <48> 83 ec 30 65 48 8b 05 17 a4 6b 01 48 89 44 24 28 31 c0 65 48 8b
RSP: 0018:ffffc90000603ca0 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff888102bf0800 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffc90000603d18 RDI: ffff888102bf0800
RBP: 0000000000000000 R08: 0000000000000002 R09: 0000000000000101
R10: 0000000000000000 R11: 0000000000000075 R12: ffffc90000603d18
R13: ffff888102bf0800 R14: ffff888102bf0800 R15: 0000000000000000
FS:  00007f6e38b8c4c0(0000) GS:ffff8881b877e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055aa7bff1680 CR3: 0000000105cbe000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 mptcp_recvmsg+0x547/0x8c0 net/mptcp/protocol.c:2329
 inet_recvmsg+0x11f/0x130 net/ipv4/af_inet.c:891
 sock_recvmsg+0x94/0xc0 net/socket.c:1100
 __sys_recvfrom+0xb2/0x130 net/socket.c:2256
 __x64_sys_recvfrom+0x1f/0x30 net/socket.c:2267
 do_syscall_64+0x59/0x2d0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x76/0x7e arch/x86/entry/entry_64.S:131
RIP: 0033:0x7f6e386a4a1d
Code: 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8d 05 f1 de 2c 00 41 89 ca 8b 00 85 c0 75 20 45 31 c9 45 31 c0 b8 2d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b f3 c3 66 0f 1f 84 00 00 00 00 00 41 56 41
RSP: 002b:00007ffc3c4bb078 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
RAX: ffffffffffffffda RBX: 000000000000861e RCX: 00007f6e386a4a1d
RDX: 00000000000003ff RSI: 00007ffc3c4bb150 RDI: 0000000000000004
RBP: 00007ffc3c4bb570 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000103 R11: 0000000000000246 R12: 00005605dbc00be0
R13: 00007ffc3c4bb650 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

Fixes: 8e04ce45a8db ("mptcp: fix MSG_PEEK stream corruption")
Signed-off-by: Li Xiasong <lixiasong1@huawei.com>
---

v2->v3:
Thanks to Matthieu Baerts for the review.
- Initialize last to NULL.
- Only update 'last' when MSG_PEEK is set, avoiding dangling pointer.

v1->v2:
Thanks to Matthieu Baerts for the suggestion and review.
- Move 'last' variable definition into the while loop in mptcp_recvmsg() to
  narrow its scope.
- In __mptcp_recvmsg_mskq(), assign 'last' right after 'copied' is updated,
  regardless of MSG_PEEK flag.
https://lore.kernel.org/netdev/20260324085131.4187473-1-lixiasong1@huawei.com/

v1:
https://lore.kernel.org/netdev/20260302052651.1466983-1-lixiasong1@huawei.com/

---
 net/mptcp/protocol.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index cf1852b99963..60f6e6a189b7 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2006,7 +2006,7 @@ static void mptcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
 static int __mptcp_recvmsg_mskq(struct sock *sk, struct msghdr *msg,
 				size_t len, int flags, int copied_total,
 				struct scm_timestamping_internal *tss,
-				int *cmsg_flags)
+				int *cmsg_flags, struct sk_buff **last)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct sk_buff *skb, *tmp;
@@ -2058,6 +2058,8 @@ static int __mptcp_recvmsg_mskq(struct sock *sk, struct msghdr *msg,
 			}
 
 			mptcp_eat_recv_skb(sk, skb);
+		} else {
+			*last = skb;
 		}
 
 		if (copied >= len)
@@ -2288,10 +2290,12 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		cmsg_flags = MPTCP_CMSG_INQ;
 
 	while (copied < len) {
+		struct sk_buff *last = NULL;
 		int err, bytes_read;
 
 		bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags,
-						  copied, &tss, &cmsg_flags);
+						  copied, &tss, &cmsg_flags,
+						  &last);
 		if (unlikely(bytes_read < 0)) {
 			if (!copied)
 				copied = bytes_read;
@@ -2343,7 +2347,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 
 		pr_debug("block timeout %ld\n", timeo);
 		mptcp_cleanup_rbuf(msk, copied);
-		err = sk_wait_data(sk, &timeo, NULL);
+		err = sk_wait_data(sk, &timeo, last);
 		if (err < 0) {
 			err = copied ? : err;
 			goto out_err;
-- 
2.34.1
Re: [PATCH net v3] mptcp: fix soft lockup in mptcp_recvmsg()
Posted by Matthieu Baerts 6 days, 3 hours ago
Hi Li,

On 27/03/2026 08:55, Li Xiasong wrote:
> syzbot reported a soft lockup in mptcp_recvmsg() [0].
> 
> When receiving data with MSG_PEEK | MSG_WAITALL flags, the skb is not
> removed from the sk_receive_queue. This causes sk_wait_data() to always
> find available data and never perform actual waiting, leading to a soft
> lockup.
> 
> Fix this by adding a 'last' parameter to track the last peeked skb.
> This allows sk_wait_data() to make informed waiting decisions and prevent
> infinite loops when MSG_PEEK is used.

Thank you for the new version!

(...)

> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index cf1852b99963..60f6e6a189b7 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c

(...)

> @@ -2343,7 +2347,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  
>  		pr_debug("block timeout %ld\n", timeo);
>  		mptcp_cleanup_rbuf(msk, copied);
> -		err = sk_wait_data(sk, &timeo, NULL);
> +		err = sk_wait_data(sk, &timeo, last);

sashiko is saying [1] this:

> Will this cause a soft lockup if all socket buffers in the receive queue
> have already been peeked?
>
> If the queue only contains already-peeked buffers, the loop in
> __mptcp_recvmsg_mskq() skips them using a continue statement:
>
>     if (flags & MSG_PEEK) {
>         /* skip already peeked skbs */
>         if (total_data_len + data_len <= copied_total) {
>             total_data_len += data_len;
>             continue;
>         }
>
> This means last is never assigned and remains NULL.
>
> When last is NULL, sk_wait_data() checks if the receive queue tail is
> not equal to NULL. Since the queue still contains the unconsumed buffers,
> this evaluates to true and sk_wait_data() returns immediately without
> sleeping.
>
> Does this result in an infinite loop here when MSG_PEEK and MSG_WAITALL
> are used together?

I *think* that's a false positive. When MSG_PEEK and MSG_WAITALL are
used together, sk_wait_data() will be call with "last" != NULL and will
be unblocked when a new data packet (skb->len > 0) is added to the
queue. In other words, when walking the queue in __mptcp_recvmsg_mskq,
it should never be full of already-peeked skb. Or did I miss something?

If yes, "*last = skb" could be added before the "continue".

If no, this patch can be applied in 'net' directly:

Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>


[1]
https://sashiko.dev/#/patchset/20260327075544.3622851-1-lixiasong1%40huawei.com

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.
Re: [PATCH net v3] mptcp: fix soft lockup in mptcp_recvmsg()
Posted by Li Xiasong 3 days, 4 hours ago
Hi Matt,

On 3/27/2026 8:13 PM, Matthieu Baerts wrote:
> Hi Li,
> 
> On 27/03/2026 08:55, Li Xiasong wrote:
>> syzbot reported a soft lockup in mptcp_recvmsg() [0].
>>
>> When receiving data with MSG_PEEK | MSG_WAITALL flags, the skb is not
>> removed from the sk_receive_queue. This causes sk_wait_data() to always
>> find available data and never perform actual waiting, leading to a soft
>> lockup.
>>
>> Fix this by adding a 'last' parameter to track the last peeked skb.
>> This allows sk_wait_data() to make informed waiting decisions and prevent
>> infinite loops when MSG_PEEK is used.
> 
> Thank you for the new version!
> 
> (...)
> 
>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>> index cf1852b99963..60f6e6a189b7 100644
>> --- a/net/mptcp/protocol.c
>> +++ b/net/mptcp/protocol.c
> 
> (...)
> 
>> @@ -2343,7 +2347,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>>  
>>  		pr_debug("block timeout %ld\n", timeo);
>>  		mptcp_cleanup_rbuf(msk, copied);
>> -		err = sk_wait_data(sk, &timeo, NULL);
>> +		err = sk_wait_data(sk, &timeo, last);
> 
> sashiko is saying [1] this:
> 
>> Will this cause a soft lockup if all socket buffers in the receive queue
>> have already been peeked?
>>
>> If the queue only contains already-peeked buffers, the loop in
>> __mptcp_recvmsg_mskq() skips them using a continue statement:
>>
>>     if (flags & MSG_PEEK) {
>>         /* skip already peeked skbs */
>>         if (total_data_len + data_len <= copied_total) {
>>             total_data_len += data_len;
>>             continue;
>>         }
>>
>> This means last is never assigned and remains NULL.
>>
>> When last is NULL, sk_wait_data() checks if the receive queue tail is
>> not equal to NULL. Since the queue still contains the unconsumed buffers,
>> this evaluates to true and sk_wait_data() returns immediately without
>> sleeping.
>>
>> Does this result in an infinite loop here when MSG_PEEK and MSG_WAITALL
>> are used together?
> 
> I *think* that's a false positive. When MSG_PEEK and MSG_WAITALL are
> used together, sk_wait_data() will be call with "last" != NULL and will
> be unblocked when a new data packet (skb->len > 0) is added to the
> queue. In other words, when walking the queue in __mptcp_recvmsg_mskq,
> it should never be full of already-peeked skb. Or did I miss something?
> 
> If yes, "*last = skb" could be added before the "continue".
> 
> If no, this patch can be applied in 'net' directly:
> 
> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> 

Through code analysis and testing, there is indeed the special scenario
sashiko described:

when receiving data with MSG_PEEK | MSG_WAITALL while waiting for data
and calling shutdown(sock_fd, SHUT_WR) in another thread simultaneously,
mptcp_close_wake_up will wake up sk_wait_data, but sk->sk_state remains
FIN_WAIT2, leading to a busy loop with CPU at 100%, which can further
lead to soft lockup.

Adding '*last = skb' before 'continue' can properly solve it. I'll send
v4 with this fix later.

Thanks,
Li Xiasong