Jakub reported an MPTCP deadlock at fallback time:
WARNING: possible recursive locking detected
6.18.0-rc7-virtme #1 Not tainted
--------------------------------------------
mptcp_connect/20858 is trying to acquire lock:
ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_try_fallback+0xd8/0x280
but task is already holding lock:
ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&msk->fallback_lock);
lock(&msk->fallback_lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by mptcp_connect/20858:
#0: ff1100001da18290 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x114/0x1bc0
#1: ff1100001db40fd0 (k-sk_lock-AF_INET#2){+.+.}-{0:0}, at: __mptcp_retrans+0x2cb/0xaa0
#2: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0
stack backtrace:
CPU: 0 UID: 0 PID: 20858 Comm: mptcp_connect Not tainted 6.18.0-rc7-virtme #1 PREEMPT(full)
Hardware name: Bochs, BIOS Bochs 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xa0
print_deadlock_bug.cold+0xc0/0xcd
validate_chain+0x2ff/0x5f0
__lock_acquire+0x34c/0x740
lock_acquire.part.0+0xbc/0x260
_raw_spin_lock_bh+0x38/0x50
__mptcp_try_fallback+0xd8/0x280
mptcp_sendmsg_frag+0x16c2/0x3050
__mptcp_retrans+0x421/0xaa0
mptcp_release_cb+0x5aa/0xa70
release_sock+0xab/0x1d0
mptcp_sendmsg+0xd5b/0x1bc0
sock_write_iter+0x281/0x4d0
new_sync_write+0x3c5/0x6f0
vfs_write+0x65e/0xbb0
ksys_write+0x17e/0x200
do_syscall_64+0xbb/0xfd0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fa5627cbc5e
Code: 4d 89 d8 e8 14 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
RSP: 002b:00007fff1fe14700 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fa5627cbc5e
RDX: 0000000000001f9c RSI: 00007fff1fe16984 RDI: 0000000000000005
RBP: 00007fff1fe14710 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff1fe16920
R13: 0000000000002000 R14: 0000000000001f9c R15: 0000000000001f9c
The packet scheduler could attempt a reinjection after receiving an
MP_FAIL and before the infinite map has been transmitted, causing a
deadlock since MPTCP needs to do the reinjection atomically from WRT
fallback.
Address the issue explicitly avoiding the reinjection in the critical
scenario. Note that this is the only fallback critical section that
could potentially send packets and hit the double-lock.
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://netdev-ctrl.bots.linux.dev/logs/vmksft/mptcp-dbg/results/412720/1-mptcp-join-sh/stderr
Fixes: f8a1d9b18c5e ("mptcp: make fallback action and fallback decision atomic")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/mptcp/protocol.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index cd5a19ab3ba1..2df36a125816 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2769,10 +2769,13 @@ static void __mptcp_retrans(struct sock *sk)
/*
* make the whole retrans decision, xmit, disallow
- * fallback atomic
+ * fallback atomic, note that we can't retrans even
+ * when an infinite fallback is in progress, i.e. new
+ * subflows are disallowed.
*/
spin_lock_bh(&msk->fallback_lock);
- if (__mptcp_check_fallback(msk)) {
+ if (__mptcp_check_fallback(msk) ||
+ !msk->allow_subflows) {
spin_unlock_bh(&msk->fallback_lock);
release_sock(ssk);
goto clear_scheduled;
--
2.52.0
Hi Paolo,
On 03/12/2025 19:55, Paolo Abeni wrote:
> Jakub reported an MPTCP deadlock at fallback time:
>
> WARNING: possible recursive locking detected
> 6.18.0-rc7-virtme #1 Not tainted
> --------------------------------------------
> mptcp_connect/20858 is trying to acquire lock:
> ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_try_fallback+0xd8/0x280
>
> but task is already holding lock:
> ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&msk->fallback_lock);
> lock(&msk->fallback_lock);
>
> *** DEADLOCK ***
>
> May be due to missing lock nesting notation
>
> 3 locks held by mptcp_connect/20858:
> #0: ff1100001da18290 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x114/0x1bc0
> #1: ff1100001db40fd0 (k-sk_lock-AF_INET#2){+.+.}-{0:0}, at: __mptcp_retrans+0x2cb/0xaa0
> #2: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0
>
> stack backtrace:
> CPU: 0 UID: 0 PID: 20858 Comm: mptcp_connect Not tainted 6.18.0-rc7-virtme #1 PREEMPT(full)
> Hardware name: Bochs, BIOS Bochs 01/01/2011
> Call Trace:
> <TASK>
> dump_stack_lvl+0x6f/0xa0
> print_deadlock_bug.cold+0xc0/0xcd
> validate_chain+0x2ff/0x5f0
> __lock_acquire+0x34c/0x740
> lock_acquire.part.0+0xbc/0x260
> _raw_spin_lock_bh+0x38/0x50
> __mptcp_try_fallback+0xd8/0x280
> mptcp_sendmsg_frag+0x16c2/0x3050
> __mptcp_retrans+0x421/0xaa0
> mptcp_release_cb+0x5aa/0xa70
> release_sock+0xab/0x1d0
> mptcp_sendmsg+0xd5b/0x1bc0
> sock_write_iter+0x281/0x4d0
> new_sync_write+0x3c5/0x6f0
> vfs_write+0x65e/0xbb0
> ksys_write+0x17e/0x200
> do_syscall_64+0xbb/0xfd0
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> RIP: 0033:0x7fa5627cbc5e
> Code: 4d 89 d8 e8 14 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
> RSP: 002b:00007fff1fe14700 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fa5627cbc5e
> RDX: 0000000000001f9c RSI: 00007fff1fe16984 RDI: 0000000000000005
> RBP: 00007fff1fe14710 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff1fe16920
> R13: 0000000000002000 R14: 0000000000001f9c R15: 0000000000001f9c
>
> The packet scheduler could attempt a reinjection after receiving an
> MP_FAIL and before the infinite map has been transmitted, causing a
> deadlock since MPTCP needs to do the reinjection atomically from WRT
> fallback.
>
> Address the issue explicitly avoiding the reinjection in the critical
> scenario. Note that this is the only fallback critical section that
> could potentially send packets and hit the double-lock.
Thank you for the fix!
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Out-of-curiosity: any idea why we only see it now while the fix tag is
from July? :)
Do you want to send it to netdev ASAP, or do you prefer if I do it?
Should I do it now, or can I send it with other fixes tomorrow?
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
On 12/4/25 6:44 PM, Matthieu Baerts wrote:
> On 03/12/2025 19:55, Paolo Abeni wrote:
>> Jakub reported an MPTCP deadlock at fallback time:
>>
>> WARNING: possible recursive locking detected
>> 6.18.0-rc7-virtme #1 Not tainted
>> --------------------------------------------
>> mptcp_connect/20858 is trying to acquire lock:
>> ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_try_fallback+0xd8/0x280
>>
>> but task is already holding lock:
>> ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0
>>
>> other info that might help us debug this:
>> Possible unsafe locking scenario:
>>
>> CPU0
>> ----
>> lock(&msk->fallback_lock);
>> lock(&msk->fallback_lock);
>>
>> *** DEADLOCK ***
>>
>> May be due to missing lock nesting notation
>>
>> 3 locks held by mptcp_connect/20858:
>> #0: ff1100001da18290 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x114/0x1bc0
>> #1: ff1100001db40fd0 (k-sk_lock-AF_INET#2){+.+.}-{0:0}, at: __mptcp_retrans+0x2cb/0xaa0
>> #2: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0
>>
>> stack backtrace:
>> CPU: 0 UID: 0 PID: 20858 Comm: mptcp_connect Not tainted 6.18.0-rc7-virtme #1 PREEMPT(full)
>> Hardware name: Bochs, BIOS Bochs 01/01/2011
>> Call Trace:
>> <TASK>
>> dump_stack_lvl+0x6f/0xa0
>> print_deadlock_bug.cold+0xc0/0xcd
>> validate_chain+0x2ff/0x5f0
>> __lock_acquire+0x34c/0x740
>> lock_acquire.part.0+0xbc/0x260
>> _raw_spin_lock_bh+0x38/0x50
>> __mptcp_try_fallback+0xd8/0x280
>> mptcp_sendmsg_frag+0x16c2/0x3050
>> __mptcp_retrans+0x421/0xaa0
>> mptcp_release_cb+0x5aa/0xa70
>> release_sock+0xab/0x1d0
>> mptcp_sendmsg+0xd5b/0x1bc0
>> sock_write_iter+0x281/0x4d0
>> new_sync_write+0x3c5/0x6f0
>> vfs_write+0x65e/0xbb0
>> ksys_write+0x17e/0x200
>> do_syscall_64+0xbb/0xfd0
>> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>> RIP: 0033:0x7fa5627cbc5e
>> Code: 4d 89 d8 e8 14 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
>> RSP: 002b:00007fff1fe14700 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
>> RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fa5627cbc5e
>> RDX: 0000000000001f9c RSI: 00007fff1fe16984 RDI: 0000000000000005
>> RBP: 00007fff1fe14710 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff1fe16920
>> R13: 0000000000002000 R14: 0000000000001f9c R15: 0000000000001f9c
>>
>> The packet scheduler could attempt a reinjection after receiving an
>> MP_FAIL and before the infinite map has been transmitted, causing a
>> deadlock since MPTCP needs to do the reinjection atomically from WRT
>> fallback.
>>
>> Address the issue explicitly avoiding the reinjection in the critical
>> scenario. Note that this is the only fallback critical section that
>> could potentially send packets and hit the double-lock.
>
> Thank you for the fix!
>
> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>
> Out-of-curiosity: any idea why we only see it now while the fix tag is
> from July? :)
The deadlock is deterministic, when the relevant pre-conditions are
reached; but such pre-req are quite/very unlikely:
- the peer send an MP_FAIL [1]
- the ssk/pm/msk tries to send an ack reply with infinite mapping
- allocation of such skb fails [2]
- the scheduler kick a mptcp-level retransmission before any other
later transmit [3]
Eech of [1], [2] and [3] is quite/very unlikely and we need all of them
with suitable/strict time scheduling.
(mostly wild guesses on my side)
/P
Hi Paolo,
On 05/12/2025 09:06, Paolo Abeni wrote:
> On 12/4/25 6:44 PM, Matthieu Baerts wrote:
>> On 03/12/2025 19:55, Paolo Abeni wrote:
>>> Jakub reported an MPTCP deadlock at fallback time:
>>>
>>> WARNING: possible recursive locking detected
>>> 6.18.0-rc7-virtme #1 Not tainted
>>> --------------------------------------------
>>> mptcp_connect/20858 is trying to acquire lock:
>>> ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_try_fallback+0xd8/0x280
>>>
>>> but task is already holding lock:
>>> ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0
>>>
>>> other info that might help us debug this:
>>> Possible unsafe locking scenario:
>>>
>>> CPU0
>>> ----
>>> lock(&msk->fallback_lock);
>>> lock(&msk->fallback_lock);
>>>
>>> *** DEADLOCK ***
>>>
>>> May be due to missing lock nesting notation
>>>
>>> 3 locks held by mptcp_connect/20858:
>>> #0: ff1100001da18290 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x114/0x1bc0
>>> #1: ff1100001db40fd0 (k-sk_lock-AF_INET#2){+.+.}-{0:0}, at: __mptcp_retrans+0x2cb/0xaa0
>>> #2: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0
>>>
>>> stack backtrace:
>>> CPU: 0 UID: 0 PID: 20858 Comm: mptcp_connect Not tainted 6.18.0-rc7-virtme #1 PREEMPT(full)
>>> Hardware name: Bochs, BIOS Bochs 01/01/2011
>>> Call Trace:
>>> <TASK>
>>> dump_stack_lvl+0x6f/0xa0
>>> print_deadlock_bug.cold+0xc0/0xcd
>>> validate_chain+0x2ff/0x5f0
>>> __lock_acquire+0x34c/0x740
>>> lock_acquire.part.0+0xbc/0x260
>>> _raw_spin_lock_bh+0x38/0x50
>>> __mptcp_try_fallback+0xd8/0x280
>>> mptcp_sendmsg_frag+0x16c2/0x3050
>>> __mptcp_retrans+0x421/0xaa0
>>> mptcp_release_cb+0x5aa/0xa70
>>> release_sock+0xab/0x1d0
>>> mptcp_sendmsg+0xd5b/0x1bc0
>>> sock_write_iter+0x281/0x4d0
>>> new_sync_write+0x3c5/0x6f0
>>> vfs_write+0x65e/0xbb0
>>> ksys_write+0x17e/0x200
>>> do_syscall_64+0xbb/0xfd0
>>> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>>> RIP: 0033:0x7fa5627cbc5e
>>> Code: 4d 89 d8 e8 14 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
>>> RSP: 002b:00007fff1fe14700 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
>>> RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fa5627cbc5e
>>> RDX: 0000000000001f9c RSI: 00007fff1fe16984 RDI: 0000000000000005
>>> RBP: 00007fff1fe14710 R08: 0000000000000000 R09: 0000000000000000
>>> R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff1fe16920
>>> R13: 0000000000002000 R14: 0000000000001f9c R15: 0000000000001f9c
>>>
>>> The packet scheduler could attempt a reinjection after receiving an
>>> MP_FAIL and before the infinite map has been transmitted, causing a
>>> deadlock since MPTCP needs to do the reinjection atomically from WRT
>>> fallback.
>>>
>>> Address the issue explicitly avoiding the reinjection in the critical
>>> scenario. Note that this is the only fallback critical section that
>>> could potentially send packets and hit the double-lock.
>>
>> Thank you for the fix!
>>
>> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>>
>> Out-of-curiosity: any idea why we only see it now while the fix tag is
>> from July? :)
>
> The deadlock is deterministic, when the relevant pre-conditions are
> reached; but such pre-req are quite/very unlikely:
>
> - the peer send an MP_FAIL [1]
> - the ssk/pm/msk tries to send an ack reply with infinite mapping
> - allocation of such skb fails [2]
> - the scheduler kick a mptcp-level retransmission before any other
> later transmit [3]
>
> Eech of [1], [2] and [3] is quite/very unlikely and we need all of them
> with suitable/strict time scheduling.
Thank you for your reply!
[1] is expected in this selftest, [3] I can understand, but not [2]. Or
an issue with the memory size allocated per VM in the new NIPA LF machines?
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
On 12/5/25 2:47 PM, Matthieu Baerts wrote:
> On 05/12/2025 09:06, Paolo Abeni wrote:
>> On 12/4/25 6:44 PM, Matthieu Baerts wrote:
>>> On 03/12/2025 19:55, Paolo Abeni wrote:
>>>> Jakub reported an MPTCP deadlock at fallback time:
>>>>
>>>> WARNING: possible recursive locking detected
>>>> 6.18.0-rc7-virtme #1 Not tainted
>>>> --------------------------------------------
>>>> mptcp_connect/20858 is trying to acquire lock:
>>>> ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_try_fallback+0xd8/0x280
>>>>
>>>> but task is already holding lock:
>>>> ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0
>>>>
>>>> other info that might help us debug this:
>>>> Possible unsafe locking scenario:
>>>>
>>>> CPU0
>>>> ----
>>>> lock(&msk->fallback_lock);
>>>> lock(&msk->fallback_lock);
>>>>
>>>> *** DEADLOCK ***
>>>>
>>>> May be due to missing lock nesting notation
>>>>
>>>> 3 locks held by mptcp_connect/20858:
>>>> #0: ff1100001da18290 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x114/0x1bc0
>>>> #1: ff1100001db40fd0 (k-sk_lock-AF_INET#2){+.+.}-{0:0}, at: __mptcp_retrans+0x2cb/0xaa0
>>>> #2: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0
>>>>
>>>> stack backtrace:
>>>> CPU: 0 UID: 0 PID: 20858 Comm: mptcp_connect Not tainted 6.18.0-rc7-virtme #1 PREEMPT(full)
>>>> Hardware name: Bochs, BIOS Bochs 01/01/2011
>>>> Call Trace:
>>>> <TASK>
>>>> dump_stack_lvl+0x6f/0xa0
>>>> print_deadlock_bug.cold+0xc0/0xcd
>>>> validate_chain+0x2ff/0x5f0
>>>> __lock_acquire+0x34c/0x740
>>>> lock_acquire.part.0+0xbc/0x260
>>>> _raw_spin_lock_bh+0x38/0x50
>>>> __mptcp_try_fallback+0xd8/0x280
>>>> mptcp_sendmsg_frag+0x16c2/0x3050
>>>> __mptcp_retrans+0x421/0xaa0
>>>> mptcp_release_cb+0x5aa/0xa70
>>>> release_sock+0xab/0x1d0
>>>> mptcp_sendmsg+0xd5b/0x1bc0
>>>> sock_write_iter+0x281/0x4d0
>>>> new_sync_write+0x3c5/0x6f0
>>>> vfs_write+0x65e/0xbb0
>>>> ksys_write+0x17e/0x200
>>>> do_syscall_64+0xbb/0xfd0
>>>> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>>>> RIP: 0033:0x7fa5627cbc5e
>>>> Code: 4d 89 d8 e8 14 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
>>>> RSP: 002b:00007fff1fe14700 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
>>>> RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fa5627cbc5e
>>>> RDX: 0000000000001f9c RSI: 00007fff1fe16984 RDI: 0000000000000005
>>>> RBP: 00007fff1fe14710 R08: 0000000000000000 R09: 0000000000000000
>>>> R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff1fe16920
>>>> R13: 0000000000002000 R14: 0000000000001f9c R15: 0000000000001f9c
>>>>
>>>> The packet scheduler could attempt a reinjection after receiving an
>>>> MP_FAIL and before the infinite map has been transmitted, causing a
>>>> deadlock since MPTCP needs to do the reinjection atomically from WRT
>>>> fallback.
>>>>
>>>> Address the issue explicitly avoiding the reinjection in the critical
>>>> scenario. Note that this is the only fallback critical section that
>>>> could potentially send packets and hit the double-lock.
>>>
>>> Thank you for the fix!
>>>
>>> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>>>
>>> Out-of-curiosity: any idea why we only see it now while the fix tag is
>>> from July? :)
>>
>> The deadlock is deterministic, when the relevant pre-conditions are
>> reached; but such pre-req are quite/very unlikely:
>>
>> - the peer send an MP_FAIL [1]
>> - the ssk/pm/msk tries to send an ack reply with infinite mapping
>> - allocation of such skb fails [2]
>> - the scheduler kick a mptcp-level retransmission before any other
>> later transmit [3]
>>
>> Eech of [1], [2] and [3] is quite/very unlikely and we need all of them
>> with suitable/strict time scheduling.
>
> Thank you for your reply!
>
> [1] is expected in this selftest, [3] I can understand, but not [2]. Or
> an issue with the memory size allocated per VM in the new NIPA LF machines?
Allocations can always fail, and this one is GFP_ATOMIC, so even more
likely. (note that the allocation includes __GFP_NOWARN).
Possibly nipa VMs are provisioned with a limited amount of memory (IDK)
/P
Hi Paolo,
On 04/12/2025 18:44, Matthieu Baerts wrote:
> Hi Paolo,
>
> On 03/12/2025 19:55, Paolo Abeni wrote:
>> Jakub reported an MPTCP deadlock at fallback time:
>>
>> WARNING: possible recursive locking detected
>> 6.18.0-rc7-virtme #1 Not tainted
>> --------------------------------------------
>> mptcp_connect/20858 is trying to acquire lock:
>> ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_try_fallback+0xd8/0x280
>>
>> but task is already holding lock:
>> ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0
>>
>> other info that might help us debug this:
>> Possible unsafe locking scenario:
>>
>> CPU0
>> ----
>> lock(&msk->fallback_lock);
>> lock(&msk->fallback_lock);
>>
>> *** DEADLOCK ***
>>
>> May be due to missing lock nesting notation
>>
>> 3 locks held by mptcp_connect/20858:
>> #0: ff1100001da18290 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x114/0x1bc0
>> #1: ff1100001db40fd0 (k-sk_lock-AF_INET#2){+.+.}-{0:0}, at: __mptcp_retrans+0x2cb/0xaa0
>> #2: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0
>>
>> stack backtrace:
>> CPU: 0 UID: 0 PID: 20858 Comm: mptcp_connect Not tainted 6.18.0-rc7-virtme #1 PREEMPT(full)
>> Hardware name: Bochs, BIOS Bochs 01/01/2011
>> Call Trace:
>> <TASK>
>> dump_stack_lvl+0x6f/0xa0
>> print_deadlock_bug.cold+0xc0/0xcd
>> validate_chain+0x2ff/0x5f0
>> __lock_acquire+0x34c/0x740
>> lock_acquire.part.0+0xbc/0x260
>> _raw_spin_lock_bh+0x38/0x50
>> __mptcp_try_fallback+0xd8/0x280
>> mptcp_sendmsg_frag+0x16c2/0x3050
>> __mptcp_retrans+0x421/0xaa0
>> mptcp_release_cb+0x5aa/0xa70
>> release_sock+0xab/0x1d0
>> mptcp_sendmsg+0xd5b/0x1bc0
>> sock_write_iter+0x281/0x4d0
>> new_sync_write+0x3c5/0x6f0
>> vfs_write+0x65e/0xbb0
>> ksys_write+0x17e/0x200
>> do_syscall_64+0xbb/0xfd0
>> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>> RIP: 0033:0x7fa5627cbc5e
>> Code: 4d 89 d8 e8 14 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
>> RSP: 002b:00007fff1fe14700 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
>> RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fa5627cbc5e
>> RDX: 0000000000001f9c RSI: 00007fff1fe16984 RDI: 0000000000000005
>> RBP: 00007fff1fe14710 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff1fe16920
>> R13: 0000000000002000 R14: 0000000000001f9c R15: 0000000000001f9c
>>
>> The packet scheduler could attempt a reinjection after receiving an
>> MP_FAIL and before the infinite map has been transmitted, causing a
>> deadlock since MPTCP needs to do the reinjection atomically from WRT
>> fallback.
>>
>> Address the issue explicitly avoiding the reinjection in the critical
>> scenario. Note that this is the only fallback critical section that
>> could potentially send packets and hit the double-lock.
>
> Thank you for the fix!
>
> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>
> Out-of-curiosity: any idea why we only see it now while the fix tag is
> from July? :)
>
> Do you want to send it to netdev ASAP, or do you prefer if I do it?
> Should I do it now, or can I send it with other fixes tomorrow?
We still need these patches in our tree anyway, so here they are:
New patches for t/upstream-net and t/upstream:
- 372aa8a6919d: mptcp: avoid deadlock on fallback while reinjecting
- Results: 00ccc7a6f72d..d4a49e165085 (export-net)
- Results: f83b9a74e823..e6e1c3ebe99c (export)
Tests are now in progress:
- export-net:
https://github.com/multipath-tcp/mptcp_net-next/commit/5dd0db8d894823fbbaac17b23cc174be4dddd58b/checks
- export:
https://github.com/multipath-tcp/mptcp_net-next/commit/db27293f26f389e0aa4807c47fc09a0cc1366702/checks
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
Hi Paolo,
Thank you for your modifications, that's great!
Our CI did some validations and here is its report:
- KVM Validation: normal (except selftest_mptcp_join): Unstable: 1 failed test(s): selftest_simult_flows 🔴
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Unstable: 1 failed test(s): packetdrill_dss 🔴
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/19923942884
Initiator: Matthieu Baerts (NGI0)
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/aedb85d7baaf
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1030212
If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:
$ cd [kernel source code]
$ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
--pull always mptcp/mptcp-upstream-virtme-docker:latest \
auto-normal
For more details:
https://github.com/multipath-tcp/mptcp-upstream-virtme-docker
Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)
Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)
© 2016 - 2025 Red Hat, Inc.