When TCP option space is insufficient (e.g., when sending ADD_ADDR with an
IPv6 address and port while tcp_timestamps is enabled), the original code
jumped to out_unlock without clearing the addr_signal flag. This caused
mptcp_pm_add_timer to keep rescheduling indefinitely, not sending ADD_ADDR,
preventing subsequent addresses in the endpoint list from being announced.
Handle this case by clearing the ADD_ADDR signal and skipping the matching
ADD_ADDR retransmission entry. The skip path cancels the matching timer
(with id check) and advances PM state progression, preserving forward
progress to subsequent PM work.
This avoids endless retries for an ADD_ADDR that cannot be emitted in the
current pure-ACK option-space constrained path.
Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout")
Signed-off-by: Li Xiasong <lixiasong1@huawei.com>
---
net/mptcp/pm.c | 45 ++++++++++++++++++++++++++++++++++++++-------
1 file changed, 38 insertions(+), 7 deletions(-)
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 57a456690406..a94177d9f265 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -401,6 +401,13 @@ mptcp_pm_del_add_timer(struct mptcp_sock *msk,
return entry;
}
+static void mptcp_pm_add_addr_skip(struct mptcp_sock *msk,
+ const struct mptcp_addr_info *addr)
+{
+ mptcp_pm_del_add_timer(msk, addr, true);
+ mptcp_pm_subflow_established(msk);
+}
+
bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk,
const struct mptcp_addr_info *addr)
{
@@ -860,6 +867,7 @@ bool mptcp_pm_add_addr_signal(struct mptcp_sock *msk, const struct sk_buff *skb,
struct mptcp_addr_info *addr, bool *echo,
bool *drop_other_suboptions)
{
+ bool skip_add_addr = false;
int ret = false;
u8 add_addr;
u8 family;
@@ -881,24 +889,47 @@ bool mptcp_pm_add_addr_signal(struct mptcp_sock *msk, const struct sk_buff *skb,
}
*echo = mptcp_pm_should_add_signal_echo(msk);
- port = !!(*echo ? msk->pm.remote.port : msk->pm.local.port);
-
- family = *echo ? msk->pm.remote.family : msk->pm.local.family;
- if (remaining < mptcp_add_addr_len(family, *echo, port))
- goto out_unlock;
-
if (*echo) {
*addr = msk->pm.remote;
add_addr = msk->pm.addr_signal & ~BIT(MPTCP_ADD_ADDR_ECHO);
+ port = !!msk->pm.remote.port;
+ family = msk->pm.remote.family;
} else {
*addr = msk->pm.local;
add_addr = msk->pm.addr_signal & ~BIT(MPTCP_ADD_ADDR_SIGNAL);
+ port = !!msk->pm.local.port;
+ family = msk->pm.local.family;
}
- WRITE_ONCE(msk->pm.addr_signal, add_addr);
+
+ if (remaining < mptcp_add_addr_len(family, *echo, port)) {
+ struct net *net = sock_net((struct sock *)msk);
+
+ if (!*drop_other_suboptions)
+ goto out_unlock;
+
+ if (*echo) {
+ MPTCP_INC_STATS(net, MPTCP_MIB_ECHOADDTXDROP);
+ } else {
+ skip_add_addr = true;
+ MPTCP_INC_STATS(net, MPTCP_MIB_ADDADDRTXDROP);
+ }
+ goto drop_signal_mark;
+ }
+
ret = true;
+drop_signal_mark:
+ WRITE_ONCE(msk->pm.addr_signal, add_addr);
+
out_unlock:
spin_unlock_bh(&msk->pm.lock);
+
+ /* On pure-ACK option-space exhaustion, stop retrying this ADD_ADDR:
+ * clear the signal bit, cancel the matching retransmission timer, and
+ * let the PM state machine progress.
+ */
+ if (skip_add_addr)
+ mptcp_pm_add_addr_skip(msk, addr);
return ret;
}
--
2.34.1
On Thu, 30 Apr 2026 19:20:25 +0800, Li Xiasong <lixiasong1@huawei.com> wrote:
> When TCP option space is insufficient (e.g., when sending ADD_ADDR with an
> IPv6 address and port while tcp_timestamps is enabled), the original code
> jumped to out_unlock without clearing the addr_signal flag. This caused
> mptcp_pm_add_timer to keep rescheduling indefinitely, not sending ADD_ADDR,
> preventing subsequent addresses in the endpoint list from being announced.
>
> Handle this case by clearing the ADD_ADDR signal and skipping the matching
> ADD_ADDR retransmission entry. The skip path cancels the matching timer
> (with id check) and advances PM state progression, preserving forward
> progress to subsequent PM work.
>
> This avoids endless retries for an ADD_ADDR that cannot be emitted in the
> current pure-ACK option-space constrained path.
Thank you for this new version!
>
>
> diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
> index 57a456690406..a94177d9f265 100644
> --- a/net/mptcp/pm.c
> +++ b/net/mptcp/pm.c
> @@ -401,6 +401,13 @@ mptcp_pm_del_add_timer(struct mptcp_sock *msk,
> return entry;
> }
>
> +static void mptcp_pm_add_addr_skip(struct mptcp_sock *msk,
> + const struct mptcp_addr_info *addr)
> +{
> + mptcp_pm_del_add_timer(msk, addr, true);
I *think* sashiko is right here, and I guess sk_stop_timer() should be
called instead of the _sync() version. Thinking about that, I wonder if
sk_stop_timer() shouldn't always be called when check_id == true, no? Do
you mind checking this please?
https://sashiko.dev/#/patchset/15237?part=1
> @@ -881,24 +889,47 @@ bool mptcp_pm_add_addr_signal(struct mptcp_sock *msk, const struct sk_buff *skb,
> [ ... skip 26 lines ... ]
> + goto out_unlock;
> +
> + if (*echo) {
> + MPTCP_INC_STATS(net, MPTCP_MIB_ECHOADDTXDROP);
> + } else {
> + skip_add_addr = true;
Even if I don't think we would have space issues with the echo, should
this variable also be set also in case of ADD_ADDR echo?
(In this case, the variable could also simply be called 'skip', or
'drop', similar to the MIB counters here.)
> [ ... skip 13 lines ... ]
> + /* On pure-ACK option-space exhaustion, stop retrying this ADD_ADDR:
> + * clear the signal bit, cancel the matching retransmission timer, and
> + * let the PM state machine progress.
> + */
> + if (skip_add_addr)
> + mptcp_pm_add_addr_skip(msk, addr);
If you have to modify something else, maybe the new helper is not
needed, and the code can be added here directly? I know I suggested the
introduction of this helper, but now that I see the code, I think it
would be easier to understand the new comment above without it, and to
backport this patch.
--
Matthieu Baerts (NGI0) <matttbe@kernel.org>
Hi, Matt
On 4/30/26 9:20 PM, Matthieu Baerts wrote:
> On Thu, 30 Apr 2026 19:20:25 +0800, Li Xiasong <lixiasong1@huawei.com> wrote:
>> When TCP option space is insufficient (e.g., when sending ADD_ADDR with an
>> IPv6 address and port while tcp_timestamps is enabled), the original code
>> jumped to out_unlock without clearing the addr_signal flag. This caused
>> mptcp_pm_add_timer to keep rescheduling indefinitely, not sending ADD_ADDR,
>> preventing subsequent addresses in the endpoint list from being announced.
>>
>> Handle this case by clearing the ADD_ADDR signal and skipping the matching
>> ADD_ADDR retransmission entry. The skip path cancels the matching timer
>> (with id check) and advances PM state progression, preserving forward
>> progress to subsequent PM work.
>>
>> This avoids endless retries for an ADD_ADDR that cannot be emitted in the
>> current pure-ACK option-space constrained path.
>
> Thank you for this new version!
>
>>
>>
>> diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
>> index 57a456690406..a94177d9f265 100644
>> --- a/net/mptcp/pm.c
>> +++ b/net/mptcp/pm.c
>> @@ -401,6 +401,13 @@ mptcp_pm_del_add_timer(struct mptcp_sock *msk,
>> return entry;
>> }
>>
>> +static void mptcp_pm_add_addr_skip(struct mptcp_sock *msk,
>> + const struct mptcp_addr_info *addr)
>> +{
>> + mptcp_pm_del_add_timer(msk, addr, true);
>
> I *think* sashiko is right here, and I guess sk_stop_timer() should be
> called instead of the _sync() version. Thinking about that, I wonder if
> sk_stop_timer() shouldn't always be called when check_id == true, no? Do
> you mind checking this please?
>
> https://sashiko.dev/#/patchset/15237?part=1
>
Sure, I'll re-check that path and confirm whether `sk_stop_timer()` is
enough when `check_id == true`.
>> @@ -881,24 +889,47 @@ bool mptcp_pm_add_addr_signal(struct mptcp_sock *msk, const struct sk_buff *skb,
>> [ ... skip 26 lines ... ]
>> + goto out_unlock;
>> +
>> + if (*echo) {
>> + MPTCP_INC_STATS(net, MPTCP_MIB_ECHOADDTXDROP);
>> + } else {
>> + skip_add_addr = true;
>
> Even if I don't think we would have space issues with the echo, should
> this variable also be set also in case of ADD_ADDR echo?
>
> (In this case, the variable could also simply be called 'skip', or
> 'drop', similar to the MIB counters here.)
>
My current understanding is that the skip marker is only required for
the non-echo ADD_ADDR path, because that is the one tied to the local
ADD_ADDR retransmission timer flow.
For the echo case here, it looks more like a reply path to a received
peer ADD_ADDR, so on option-space shortage, clearing the signal bit and
accounting the drop might be sufficient.
But I may be missing something in this area — if you think echo should
also go through the same skip handling, I can align with that in v3.
>> [ ... skip 13 lines ... ]
>> + /* On pure-ACK option-space exhaustion, stop retrying this ADD_ADDR:
>> + * clear the signal bit, cancel the matching retransmission timer, and
>> + * let the PM state machine progress.
>> + */
>> + if (skip_add_addr)
>> + mptcp_pm_add_addr_skip(msk, addr);
>
> If you have to modify something else, maybe the new helper is not
> needed, and the code can be added here directly? I know I suggested the
> introduction of this helper, but now that I see the code, I think it
> would be easier to understand the new comment above without it, and to
> backport this patch.
>
Agreed. I'll likely inline that logic to keep the change simpler and
easier to backport.
FYI: I’m replying from my personal address during the holiday period, as
I currently don’t have access to my corporate mailbox. I’ll keep
following up on this patchset and send v3 shortly.
Thanks,
Li Xiasong
Hi Li,
On 01/05/2026 17:03, Li Xiasong wrote:
> On 4/30/26 9:20 PM, Matthieu Baerts wrote:
>> On Thu, 30 Apr 2026 19:20:25 +0800, Li Xiasong <lixiasong1@huawei.com> wrote:
(...)
>>> @@ -881,24 +889,47 @@ bool mptcp_pm_add_addr_signal(struct mptcp_sock *msk, const struct sk_buff *skb,
>>> [ ... skip 26 lines ... ]
>>> + goto out_unlock;
>>> +
>>> + if (*echo) {
>>> + MPTCP_INC_STATS(net, MPTCP_MIB_ECHOADDTXDROP);
>>> + } else {
>>> + skip_add_addr = true;
>>
>> Even if I don't think we would have space issues with the echo, should
>> this variable also be set also in case of ADD_ADDR echo?
>>
>> (In this case, the variable could also simply be called 'skip', or
>> 'drop', similar to the MIB counters here.)
>>
>
> My current understanding is that the skip marker is only required for
> the non-echo ADD_ADDR path, because that is the one tied to the local
> ADD_ADDR retransmission timer flow.
>
> For the echo case here, it looks more like a reply path to a received
> peer ADD_ADDR, so on option-space shortage, clearing the signal bit and
> accounting the drop might be sufficient.
>
> But I may be missing something in this area — if you think echo should
> also go through the same skip handling, I can align with that in v3.
Indeed, you are right, I mixed thing up: when an ADD_ADDR echo is
received, the timer for the linked ADD_ADDR is stopped, but no timer
needs to be stopped when sending the ADD_ADDR echo. So no need to change
anything here.
>>> [ ... skip 13 lines ... ]
>>> + /* On pure-ACK option-space exhaustion, stop retrying this ADD_ADDR:
>>> + * clear the signal bit, cancel the matching retransmission timer, and
>>> + * let the PM state machine progress.
>>> + */
>>> + if (skip_add_addr)
>>> + mptcp_pm_add_addr_skip(msk, addr);
>>
>> If you have to modify something else, maybe the new helper is not
>> needed, and the code can be added here directly? I know I suggested the
>> introduction of this helper, but now that I see the code, I think it
>> would be easier to understand the new comment above without it, and to
>> backport this patch.
>>
>
> Agreed. I'll likely inline that logic to keep the change simpler and
> easier to backport.
>
> FYI: I’m replying from my personal address during the holiday period, as
> I currently don’t have access to my corporate mailbox. I’ll keep
> following up on this patchset and send v3 shortly.
All good, no hurry here.
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
Hi Matt,
I’m currently preparing the v3 update for this fix.
In mptcp_pm_add_addr_signal(), when TCP option space is insufficient, we
cancel the ADD_ADDR retrans timer via mptcp_pm_del_add_timer(). During
analysis, I found a race window that can lead to one extra ADD_ADDR send
attempt, so I’d like to confirm the intended semantics before posting v3.
I see two race cases:
Case 1: cancel state applied first (handled by existing counter check)
CPU0 (cancel path) CPU1 (add_timer callback)
-------------------------------------- -----------------------------------
option space insufficient callback already running
call mptcp_pm_del_add_timer() ...
acquire pm.lock ...
set retrans_times = ADD_ADDR_RETRANS_MAX ...
release pm.lock ...
stop timer ...
acquire pm.lock
check retrans_times <
ADD_ADDR_RETRANS_MAX ?
false -> skip announce/send
release pm.lock
For this case, gating send by
`retrans_times < ADD_ADDR_RETRANS_MAX` prevents stale send.
Case 2: callback takes lock first and sends before cancel applies
CPU0 (cancel path) CPU1 (add_timer callback)
-------------------------------------- -----------------------------------
option space insufficient callback already running
call mptcp_pm_del_add_timer() acquire pm.lock
try acquire pm.lock -> blocked check send conditions
announce/send ADD_ADDR <- send
retrans_times++
release pm.lock
acquire pm.lock ...
set retrans_times = ADD_ADDR_RETRANS_MAX ...
release pm.lock ...
stop timer ...
I’d like to discuss which semantic target is better for v3:
A) best-effort cancel:
race may still allow at-most-one extra send attempt and `ADDADDRTXDROP`
send-failure statistics may be incremented one extra time.
B) strict cancel semantics:
once cancel is decided, no further ADD_ADDR send action
My current understanding:
- Case 1 can be covered by the existing retrans_times gate.
- Case 2 is the key trade-off between complexity and strictness.
Happy to hear your thoughts on which direction is preferable.
I can prepare v3 accordingly.
Thanks for the guidance.
Best regards,
Li Xiasong
On 5/1/2026 11:13 PM, Matthieu Baerts wrote:
> Hi Li,
>
> On 01/05/2026 17:03, Li Xiasong wrote:
>> On 4/30/26 9:20 PM, Matthieu Baerts wrote:
>>> On Thu, 30 Apr 2026 19:20:25 +0800, Li Xiasong <lixiasong1@huawei.com> wrote:
>
> (...)
>
>>>> @@ -881,24 +889,47 @@ bool mptcp_pm_add_addr_signal(struct mptcp_sock *msk, const struct sk_buff *skb,
>>>> [ ... skip 26 lines ... ]
>>>> + goto out_unlock;
>>>> +
>>>> + if (*echo) {
>>>> + MPTCP_INC_STATS(net, MPTCP_MIB_ECHOADDTXDROP);
>>>> + } else {
>>>> + skip_add_addr = true;
>>>
>>> Even if I don't think we would have space issues with the echo, should
>>> this variable also be set also in case of ADD_ADDR echo?
>>>
>>> (In this case, the variable could also simply be called 'skip', or
>>> 'drop', similar to the MIB counters here.)
>>>
>>
>> My current understanding is that the skip marker is only required for
>> the non-echo ADD_ADDR path, because that is the one tied to the local
>> ADD_ADDR retransmission timer flow.
>>
>> For the echo case here, it looks more like a reply path to a received
>> peer ADD_ADDR, so on option-space shortage, clearing the signal bit and
>> accounting the drop might be sufficient.
>>
>> But I may be missing something in this area — if you think echo should
>> also go through the same skip handling, I can align with that in v3.
>
> Indeed, you are right, I mixed thing up: when an ADD_ADDR echo is
> received, the timer for the linked ADD_ADDR is stopped, but no timer
> needs to be stopped when sending the ADD_ADDR echo. So no need to change
> anything here.
>
>>>> [ ... skip 13 lines ... ]
>>>> + /* On pure-ACK option-space exhaustion, stop retrying this ADD_ADDR:
>>>> + * clear the signal bit, cancel the matching retransmission timer, and
>>>> + * let the PM state machine progress.
>>>> + */
>>>> + if (skip_add_addr)
>>>> + mptcp_pm_add_addr_skip(msk, addr);
>>>
>>> If you have to modify something else, maybe the new helper is not
>>> needed, and the code can be added here directly? I know I suggested the
>>> introduction of this helper, but now that I see the code, I think it
>>> would be easier to understand the new comment above without it, and to
>>> backport this patch.
>>>
>>
>> Agreed. I'll likely inline that logic to keep the change simpler and
>> easier to backport.
>>
>> FYI: I’m replying from my personal address during the holiday period, as
>> I currently don’t have access to my corporate mailbox. I’ll keep
>> following up on this patchset and send v3 shortly.
>
> All good, no hurry here.
>
> Cheers,
> Matt
Hi Li, On 06/05/2026 15:27, Li Xiasong wrote: > Hi Matt, > > I’m currently preparing the v3 update for this fix. Thanks! > In mptcp_pm_add_addr_signal(), when TCP option space is insufficient, we > cancel the ADD_ADDR retrans timer via mptcp_pm_del_add_timer(). During > analysis, I found a race window that can lead to one extra ADD_ADDR send > attempt, so I’d like to confirm the intended semantics before posting v3. > > I see two race cases: > > Case 1: cancel state applied first (handled by existing counter check) > > CPU0 (cancel path) CPU1 (add_timer callback) > -------------------------------------- ----------------------------------- > option space insufficient callback already running > call mptcp_pm_del_add_timer() ... > acquire pm.lock ... > set retrans_times = ADD_ADDR_RETRANS_MAX ... > release pm.lock ... > stop timer ... > acquire pm.lock > check retrans_times < > ADD_ADDR_RETRANS_MAX ? > false -> skip announce/send > release pm.lock > > For this case, gating send by > `retrans_times < ADD_ADDR_RETRANS_MAX` prevents stale send. > > > Case 2: callback takes lock first and sends before cancel applies > > CPU0 (cancel path) CPU1 (add_timer callback) > -------------------------------------- ----------------------------------- > option space insufficient callback already running > call mptcp_pm_del_add_timer() acquire pm.lock > try acquire pm.lock -> blocked check send conditions > announce/send ADD_ADDR <- send > retrans_times++ > release pm.lock > acquire pm.lock ... > set retrans_times = ADD_ADDR_RETRANS_MAX ... > release pm.lock ... > stop timer ... > > > I’d like to discuss which semantic target is better for v3: > > A) best-effort cancel: > race may still allow at-most-one extra send attempt and `ADDADDRTXDROP` > send-failure statistics may be incremented one extra time. I think that's OK to send an extra attempt, especially here where it is quite unlikely to get this race. > B) strict cancel semantics: > once cancel is decided, no further ADD_ADDR send action > > My current understanding: > - Case 1 can be covered by the existing retrans_times gate. > - Case 2 is the key trade-off between complexity and strictness. Case 1 looks fine to me, but do you mind documenting the possible race in the code and/or commit message depending on the patch, please? Cheers, Matt -- Sponsored by the NGI0 Core fund.
On 5/6/2026 9:34 PM, Matthieu Baerts wrote: > Hi Li, > > On 06/05/2026 15:27, Li Xiasong wrote: >> Hi Matt, >> >> I’m currently preparing the v3 update for this fix. > > Thanks! > >> In mptcp_pm_add_addr_signal(), when TCP option space is insufficient, we >> cancel the ADD_ADDR retrans timer via mptcp_pm_del_add_timer(). During >> analysis, I found a race window that can lead to one extra ADD_ADDR send >> attempt, so I’d like to confirm the intended semantics before posting v3. >> >> I see two race cases: >> >> Case 1: cancel state applied first (handled by existing counter check) >> >> CPU0 (cancel path) CPU1 (add_timer callback) >> -------------------------------------- ----------------------------------- >> option space insufficient callback already running >> call mptcp_pm_del_add_timer() ... >> acquire pm.lock ... >> set retrans_times = ADD_ADDR_RETRANS_MAX ... >> release pm.lock ... >> stop timer ... >> acquire pm.lock >> check retrans_times < >> ADD_ADDR_RETRANS_MAX ? >> false -> skip announce/send >> release pm.lock >> >> For this case, gating send by >> `retrans_times < ADD_ADDR_RETRANS_MAX` prevents stale send. >> >> >> Case 2: callback takes lock first and sends before cancel applies >> >> CPU0 (cancel path) CPU1 (add_timer callback) >> -------------------------------------- ----------------------------------- >> option space insufficient callback already running >> call mptcp_pm_del_add_timer() acquire pm.lock >> try acquire pm.lock -> blocked check send conditions >> announce/send ADD_ADDR <- send >> retrans_times++ >> release pm.lock >> acquire pm.lock ... >> set retrans_times = ADD_ADDR_RETRANS_MAX ... >> release pm.lock ... >> stop timer ... >> >> >> I’d like to discuss which semantic target is better for v3: >> >> A) best-effort cancel: >> race may still allow at-most-one extra send attempt and `ADDADDRTXDROP` >> send-failure statistics may be incremented one extra time. > > I think that's OK to send an extra attempt, especially here where it is > quite unlikely to get this race. > Thanks, got it. >> B) strict cancel semantics: >> once cancel is decided, no further ADD_ADDR send action >> >> My current understanding: >> - Case 1 can be covered by the existing retrans_times gate. >> - Case 2 is the key trade-off between complexity and strictness. > > Case 1 looks fine to me, but do you mind documenting the possible race > in the code and/or commit message depending on the patch, please? > I’ll document the race consideration in v3 (in the commit message, and add an in-code note as needed for clarity), then send the updated patch in the next revision. Best regards, Li Xiasong > Cheers, > Matt
© 2016 - 2026 Red Hat, Inc.