[PATCH mptcp] mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr

Krister Johansen posted 1 patch 1 month, 3 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/multipath-tcp/mptcp_net-next tags/patchew/20250221222146.GA1896@templeofstupid.com
There is a newer version of this series
net/mptcp/pm_netlink.c | 19 ++++++++++++-------
net/mptcp/protocol.h   |  1 +
2 files changed, 13 insertions(+), 7 deletions(-)
[PATCH mptcp] mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
Posted by Krister Johansen 1 month, 3 weeks ago
If multiple connection requests attempt to create an implicit mptcp
endpoint in parallel, more than one caller may end up in
mptcp_pm_nl_append_new_local_addr because none found the address in
local_addr_list during their call to mptcp_pm_nl_get_local_id.  In this
case, the concurrent new_local_addr calls may delete the address entry
created by the previous caller.  These deletes use synchronize_rcu, but
this is not permitted in some of the contexts where this function may be
called.  During packet recv, the caller may be in a rcu read critical
section and have preemption disabled.

An example stack:

   BUG: scheduling while atomic: swapper/2/0/0x00000302

   Call Trace:
   <IRQ>
   dump_stack_lvl+0x76/0xa0
   dump_stack+0x10/0x20
   __schedule_bug+0x64/0x80
   schedule_debug.constprop.0+0xdb/0x130
   __schedule+0x69/0x6a0
   schedule+0x33/0x110
   schedule_timeout+0x157/0x170
   wait_for_completion+0x88/0x150
   __wait_rcu_gp+0x150/0x160
   synchronize_rcu+0x12d/0x140
   mptcp_pm_nl_append_new_local_addr+0x1bd/0x280
   mptcp_pm_nl_get_local_id+0x121/0x160
   mptcp_pm_get_local_id+0x9d/0xe0
   subflow_check_req+0x1a8/0x460
   subflow_v4_route_req+0xb5/0x110
   tcp_conn_request+0x3a4/0xd00
   subflow_v4_conn_request+0x42/0xa0
   tcp_rcv_state_process+0x1e3/0x7e0
   tcp_v4_do_rcv+0xd3/0x2a0
   tcp_v4_rcv+0xbb8/0xbf0
   ip_protocol_deliver_rcu+0x3c/0x210
   ip_local_deliver_finish+0x77/0xa0
   ip_local_deliver+0x6e/0x120
   ip_sublist_rcv_finish+0x6f/0x80
   ip_sublist_rcv+0x178/0x230
   ip_list_rcv+0x102/0x140
   __netif_receive_skb_list_core+0x22d/0x250
   netif_receive_skb_list_internal+0x1a3/0x2d0
   napi_complete_done+0x74/0x1c0
   igb_poll+0x6c/0xe0 [igb]
   __napi_poll+0x30/0x200
   net_rx_action+0x181/0x2e0
   handle_softirqs+0xd8/0x340
   __irq_exit_rcu+0xd9/0x100
   irq_exit_rcu+0xe/0x20
   common_interrupt+0xa4/0xb0
   </IRQ>

This problem seems particularly prevalent if the user advertises an
endpoint that has a different external vs internal address.  In the case
where the external address is advertised and multiple connections
already exist, multiple subflow SYNs arrive in parallel which tends to
trigger the race during creation of the first local_addr_list entries
which have the internal address instead.

Fix this problem by switching mptcp_pm_nl_append_new_local_addr to use
call_rcu .  As part of plumbing this up, make
__mptcp_pm_release_addr_entry take a rcu_head which is used by all
callers regardless of cleanup method.

Cc: stable@vger.kernel.org
Fixes: d045b9eb95a9 ("mptcp: introduce implicit endpoints")
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
 net/mptcp/pm_netlink.c | 19 ++++++++++++-------
 net/mptcp/protocol.h   |  1 +
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index c0e47f4f7b1a..4115b83cc2c3 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -967,9 +967,15 @@ static bool address_use_port(struct mptcp_pm_addr_entry *entry)
 		MPTCP_PM_ADDR_FLAG_SIGNAL;
 }
 
-/* caller must ensure the RCU grace period is already elapsed */
-static void __mptcp_pm_release_addr_entry(struct mptcp_pm_addr_entry *entry)
+/*
+ * Caller must ensure the RCU grace period is already elapsed or call this
+ * via a RCU callback.
+ */
+static void __mptcp_pm_release_addr_entry(struct rcu_head *head)
 {
+	struct mptcp_pm_addr_entry *entry;
+
+	entry = container_of(head, struct mptcp_pm_addr_entry, rcu_head);
 	if (entry->lsk)
 		sock_release(entry->lsk);
 	kfree(entry);
@@ -1064,8 +1070,7 @@ static int mptcp_pm_nl_append_new_local_addr(struct pm_nl_pernet *pernet,
 
 	/* just replaced an existing entry, free it */
 	if (del_entry) {
-		synchronize_rcu();
-		__mptcp_pm_release_addr_entry(del_entry);
+		call_rcu(&del_entry->rcu_head, __mptcp_pm_release_addr_entry);
 	}
 	return ret;
 }
@@ -1443,7 +1448,7 @@ int mptcp_pm_nl_add_addr_doit(struct sk_buff *skb, struct genl_info *info)
 	return 0;
 
 out_free:
-	__mptcp_pm_release_addr_entry(entry);
+	__mptcp_pm_release_addr_entry(&entry->rcu_head);
 	return ret;
 }
 
@@ -1623,7 +1628,7 @@ int mptcp_pm_nl_del_addr_doit(struct sk_buff *skb, struct genl_info *info)
 
 	mptcp_nl_remove_subflow_and_signal_addr(sock_net(skb->sk), entry);
 	synchronize_rcu();
-	__mptcp_pm_release_addr_entry(entry);
+	__mptcp_pm_release_addr_entry(&entry->rcu_head);
 
 	return ret;
 }
@@ -1689,7 +1694,7 @@ static void __flush_addrs(struct list_head *list)
 		cur = list_entry(list->next,
 				 struct mptcp_pm_addr_entry, list);
 		list_del_rcu(&cur->list);
-		__mptcp_pm_release_addr_entry(cur);
+		__mptcp_pm_release_addr_entry(&cur->rcu_head);
 	}
 }
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index ad21925af061..29c4ee64cd0b 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -250,6 +250,7 @@ struct mptcp_pm_addr_entry {
 	u8			flags;
 	int			ifindex;
 	struct socket		*lsk;
+	struct rcu_head		rcu_head;
 };
 
 struct mptcp_data_frag {
-- 
2.25.1
Re: [PATCH mptcp] mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
Posted by Paolo Abeni 1 month, 3 weeks ago
Hi,

On 2/21/25 11:21 PM, Krister Johansen wrote:
> If multiple connection requests attempt to create an implicit mptcp
> endpoint in parallel, more than one caller may end up in
> mptcp_pm_nl_append_new_local_addr because none found the address in
> local_addr_list during their call to mptcp_pm_nl_get_local_id.  In this
> case, the concurrent new_local_addr calls may delete the address entry
> created by the previous caller.  These deletes use synchronize_rcu, but
> this is not permitted in some of the contexts where this function may be
> called.  During packet recv, the caller may be in a rcu read critical
> section and have preemption disabled.
> 
> An example stack:
> 
>    BUG: scheduling while atomic: swapper/2/0/0x00000302
> 
>    Call Trace:
>    <IRQ>
>    dump_stack_lvl+0x76/0xa0
>    dump_stack+0x10/0x20
>    __schedule_bug+0x64/0x80
>    schedule_debug.constprop.0+0xdb/0x130
>    __schedule+0x69/0x6a0
>    schedule+0x33/0x110
>    schedule_timeout+0x157/0x170
>    wait_for_completion+0x88/0x150
>    __wait_rcu_gp+0x150/0x160
>    synchronize_rcu+0x12d/0x140
>    mptcp_pm_nl_append_new_local_addr+0x1bd/0x280
>    mptcp_pm_nl_get_local_id+0x121/0x160
>    mptcp_pm_get_local_id+0x9d/0xe0
>    subflow_check_req+0x1a8/0x460
>    subflow_v4_route_req+0xb5/0x110
>    tcp_conn_request+0x3a4/0xd00
>    subflow_v4_conn_request+0x42/0xa0
>    tcp_rcv_state_process+0x1e3/0x7e0
>    tcp_v4_do_rcv+0xd3/0x2a0
>    tcp_v4_rcv+0xbb8/0xbf0
>    ip_protocol_deliver_rcu+0x3c/0x210
>    ip_local_deliver_finish+0x77/0xa0
>    ip_local_deliver+0x6e/0x120
>    ip_sublist_rcv_finish+0x6f/0x80
>    ip_sublist_rcv+0x178/0x230
>    ip_list_rcv+0x102/0x140
>    __netif_receive_skb_list_core+0x22d/0x250
>    netif_receive_skb_list_internal+0x1a3/0x2d0
>    napi_complete_done+0x74/0x1c0
>    igb_poll+0x6c/0xe0 [igb]
>    __napi_poll+0x30/0x200
>    net_rx_action+0x181/0x2e0
>    handle_softirqs+0xd8/0x340
>    __irq_exit_rcu+0xd9/0x100
>    irq_exit_rcu+0xe/0x20
>    common_interrupt+0xa4/0xb0
>    </IRQ>
> 
> This problem seems particularly prevalent if the user advertises an
> endpoint that has a different external vs internal address.  In the case
> where the external address is advertised and multiple connections
> already exist, multiple subflow SYNs arrive in parallel which tends to
> trigger the race during creation of the first local_addr_list entries
> which have the internal address instead.
> 
> Fix this problem by switching mptcp_pm_nl_append_new_local_addr to use
> call_rcu .  As part of plumbing this up, make
> __mptcp_pm_release_addr_entry take a rcu_head which is used by all
> callers regardless of cleanup method.
> 
> Cc: stable@vger.kernel.org
> Fixes: d045b9eb95a9 ("mptcp: introduce implicit endpoints")
> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>

The proposed patch looks functionally correct to me, but I think it
would be better to avoid adding new fields to mptcp_pm_addr_entry, if
not strictly needed.

What about the following? (completely untested!). When inplicit
endpoints creations race one with each other, we don't need to replace
the existing one, we could simply use it.

That would additionally prevent an implicit endpoint created from a
subflow from overriding the flags set by a racing user-space endpoint add.

If that works/fits you feel free to take/use it.
---
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 572d160edca3..dcb27b479824 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -977,7 +977,7 @@ static void __mptcp_pm_release_addr_entry(struct
mptcp_pm_addr_entry *entry)

 static int mptcp_pm_nl_append_new_local_addr(struct pm_nl_pernet *pernet,
 					     struct mptcp_pm_addr_entry *entry,
-					     bool needs_id)
+					     bool needs_id, bool replace)
 {
 	struct mptcp_pm_addr_entry *cur, *del_entry = NULL;
 	unsigned int addr_max;
@@ -1017,6 +1017,12 @@ static int
mptcp_pm_nl_append_new_local_addr(struct pm_nl_pernet *pernet,
 			if (entry->addr.id)
 				goto out;

+			if (!replace) {
+				kfree(entry);
+				ret = cur->addr.id;
+				goto out;
+			}
+
 			pernet->addrs--;
 			entry->addr.id = cur->addr.id;
 			list_del_rcu(&cur->list);
@@ -1165,7 +1171,7 @@ int mptcp_pm_nl_get_local_id(struct mptcp_sock
*msk, struct mptcp_addr_info *skc
 	entry->ifindex = 0;
 	entry->flags = MPTCP_PM_ADDR_FLAG_IMPLICIT;
 	entry->lsk = NULL;
-	ret = mptcp_pm_nl_append_new_local_addr(pernet, entry, true);
+	ret = mptcp_pm_nl_append_new_local_addr(pernet, entry, true, false);
 	if (ret < 0)
 		kfree(entry);

@@ -1433,7 +1439,8 @@ int mptcp_pm_nl_add_addr_doit(struct sk_buff *skb,
struct genl_info *info)
 		}
 	}
 	ret = mptcp_pm_nl_append_new_local_addr(pernet, entry,
-						!mptcp_pm_has_addr_attr_id(attr, info));
+						!mptcp_pm_has_addr_attr_id(attr, info),
+						true);
 	if (ret < 0) {
 		GENL_SET_ERR_MSG_FMT(info, "too many addresses or duplicate one: %d",
ret);
 		goto out_free;
Re: [PATCH mptcp] mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
Posted by Krister Johansen 1 month, 3 weeks ago
Hi Paolo,

Thanks for the feedback.

On Mon, Feb 24, 2025 at 11:09:17AM +0100, Paolo Abeni wrote:
> On 2/21/25 11:21 PM, Krister Johansen wrote:
> > If multiple connection requests attempt to create an implicit mptcp
> > endpoint in parallel, more than one caller may end up in
> > mptcp_pm_nl_append_new_local_addr because none found the address in
> > local_addr_list during their call to mptcp_pm_nl_get_local_id.  In this
> > case, the concurrent new_local_addr calls may delete the address entry
> > created by the previous caller.  These deletes use synchronize_rcu, but
> > this is not permitted in some of the contexts where this function may be
> > called.  During packet recv, the caller may be in a rcu read critical
> > section and have preemption disabled.
> > 
> > An example stack:
> > 
> >    BUG: scheduling while atomic: swapper/2/0/0x00000302
> > 
> >    Call Trace:
> >    <IRQ>
> >    dump_stack_lvl+0x76/0xa0
> >    dump_stack+0x10/0x20
> >    __schedule_bug+0x64/0x80
> >    schedule_debug.constprop.0+0xdb/0x130
> >    __schedule+0x69/0x6a0
> >    schedule+0x33/0x110
> >    schedule_timeout+0x157/0x170
> >    wait_for_completion+0x88/0x150
> >    __wait_rcu_gp+0x150/0x160
> >    synchronize_rcu+0x12d/0x140
> >    mptcp_pm_nl_append_new_local_addr+0x1bd/0x280
> >    mptcp_pm_nl_get_local_id+0x121/0x160
> >    mptcp_pm_get_local_id+0x9d/0xe0
> >    subflow_check_req+0x1a8/0x460
> >    subflow_v4_route_req+0xb5/0x110
> >    tcp_conn_request+0x3a4/0xd00
> >    subflow_v4_conn_request+0x42/0xa0
> >    tcp_rcv_state_process+0x1e3/0x7e0
> >    tcp_v4_do_rcv+0xd3/0x2a0
> >    tcp_v4_rcv+0xbb8/0xbf0
> >    ip_protocol_deliver_rcu+0x3c/0x210
> >    ip_local_deliver_finish+0x77/0xa0
> >    ip_local_deliver+0x6e/0x120
> >    ip_sublist_rcv_finish+0x6f/0x80
> >    ip_sublist_rcv+0x178/0x230
> >    ip_list_rcv+0x102/0x140
> >    __netif_receive_skb_list_core+0x22d/0x250
> >    netif_receive_skb_list_internal+0x1a3/0x2d0
> >    napi_complete_done+0x74/0x1c0
> >    igb_poll+0x6c/0xe0 [igb]
> >    __napi_poll+0x30/0x200
> >    net_rx_action+0x181/0x2e0
> >    handle_softirqs+0xd8/0x340
> >    __irq_exit_rcu+0xd9/0x100
> >    irq_exit_rcu+0xe/0x20
> >    common_interrupt+0xa4/0xb0
> >    </IRQ>
> > 
> > This problem seems particularly prevalent if the user advertises an
> > endpoint that has a different external vs internal address.  In the case
> > where the external address is advertised and multiple connections
> > already exist, multiple subflow SYNs arrive in parallel which tends to
> > trigger the race during creation of the first local_addr_list entries
> > which have the internal address instead.
> > 
> > Fix this problem by switching mptcp_pm_nl_append_new_local_addr to use
> > call_rcu .  As part of plumbing this up, make
> > __mptcp_pm_release_addr_entry take a rcu_head which is used by all
> > callers regardless of cleanup method.
> > 
> > Cc: stable@vger.kernel.org
> > Fixes: d045b9eb95a9 ("mptcp: introduce implicit endpoints")
> > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> 
> The proposed patch looks functionally correct to me, but I think it
> would be better to avoid adding new fields to mptcp_pm_addr_entry, if
> not strictly needed.
> 
> What about the following? (completely untested!). When inplicit
> endpoints creations race one with each other, we don't need to replace
> the existing one, we could simply use it.
> 
> That would additionally prevent an implicit endpoint created from a
> subflow from overriding the flags set by a racing user-space endpoint add.
> 
> If that works/fits you feel free to take/use it.

I like this suggestion.  In addition to the benefits you outlined, it
also prevents a series of back-to-back replacements from getting turned
into a chunk of call_rcu() calls.  Leaving this as a synchronize_rcu is
probably better too, if we can.  I was unsure whether it was acceptable
to skip the replacement in this case.  Thanks for clearing that up.

I'll test this and follow up with a v2.

> ---
> diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
> index 572d160edca3..dcb27b479824 100644
> --- a/net/mptcp/pm_netlink.c
> +++ b/net/mptcp/pm_netlink.c
> @@ -977,7 +977,7 @@ static void __mptcp_pm_release_addr_entry(struct
> mptcp_pm_addr_entry *entry)
> 
>  static int mptcp_pm_nl_append_new_local_addr(struct pm_nl_pernet *pernet,
>  					     struct mptcp_pm_addr_entry *entry,
> -					     bool needs_id)
> +					     bool needs_id, bool replace)
>  {
>  	struct mptcp_pm_addr_entry *cur, *del_entry = NULL;
>  	unsigned int addr_max;
> @@ -1017,6 +1017,12 @@ static int
> mptcp_pm_nl_append_new_local_addr(struct pm_nl_pernet *pernet,
>  			if (entry->addr.id)
>  				goto out;
> 
> +			if (!replace) {
> +				kfree(entry);
> +				ret = cur->addr.id;
> +				goto out;
> +			}
> +
>  			pernet->addrs--;
>  			entry->addr.id = cur->addr.id;
>  			list_del_rcu(&cur->list);
> @@ -1165,7 +1171,7 @@ int mptcp_pm_nl_get_local_id(struct mptcp_sock
> *msk, struct mptcp_addr_info *skc
>  	entry->ifindex = 0;
>  	entry->flags = MPTCP_PM_ADDR_FLAG_IMPLICIT;
>  	entry->lsk = NULL;
> -	ret = mptcp_pm_nl_append_new_local_addr(pernet, entry, true);
> +	ret = mptcp_pm_nl_append_new_local_addr(pernet, entry, true, false);
>  	if (ret < 0)
>  		kfree(entry);
> 
> @@ -1433,7 +1439,8 @@ int mptcp_pm_nl_add_addr_doit(struct sk_buff *skb,
> struct genl_info *info)
>  		}
>  	}
>  	ret = mptcp_pm_nl_append_new_local_addr(pernet, entry,
> -						!mptcp_pm_has_addr_attr_id(attr, info));
> +						!mptcp_pm_has_addr_attr_id(attr, info),
> +						true);
>  	if (ret < 0) {
>  		GENL_SET_ERR_MSG_FMT(info, "too many addresses or duplicate one: %d",
> ret);
>  		goto out_free;

Thanks,

-K
[PATCH v2 mptcp] mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
Posted by Krister Johansen 1 month, 3 weeks ago
If multiple connection requests attempt to create an implicit mptcp
endpoint in parallel, more than one caller may end up in
mptcp_pm_nl_append_new_local_addr because none found the address in
local_addr_list during their call to mptcp_pm_nl_get_local_id.  In this
case, the concurrent new_local_addr calls may delete the address entry
created by the previous caller.  These deletes use synchronize_rcu, but
this is not permitted in some of the contexts where this function may be
called.  During packet recv, the caller may be in a rcu read critical
section and have preemption disabled.

An example stack:

   BUG: scheduling while atomic: swapper/2/0/0x00000302

   Call Trace:
   <IRQ>
   dump_stack_lvl+0x76/0xa0
   dump_stack+0x10/0x20
   __schedule_bug+0x64/0x80
   schedule_debug.constprop.0+0xdb/0x130
   __schedule+0x69/0x6a0
   schedule+0x33/0x110
   schedule_timeout+0x157/0x170
   wait_for_completion+0x88/0x150
   __wait_rcu_gp+0x150/0x160
   synchronize_rcu+0x12d/0x140
   mptcp_pm_nl_append_new_local_addr+0x1bd/0x280
   mptcp_pm_nl_get_local_id+0x121/0x160
   mptcp_pm_get_local_id+0x9d/0xe0
   subflow_check_req+0x1a8/0x460
   subflow_v4_route_req+0xb5/0x110
   tcp_conn_request+0x3a4/0xd00
   subflow_v4_conn_request+0x42/0xa0
   tcp_rcv_state_process+0x1e3/0x7e0
   tcp_v4_do_rcv+0xd3/0x2a0
   tcp_v4_rcv+0xbb8/0xbf0
   ip_protocol_deliver_rcu+0x3c/0x210
   ip_local_deliver_finish+0x77/0xa0
   ip_local_deliver+0x6e/0x120
   ip_sublist_rcv_finish+0x6f/0x80
   ip_sublist_rcv+0x178/0x230
   ip_list_rcv+0x102/0x140
   __netif_receive_skb_list_core+0x22d/0x250
   netif_receive_skb_list_internal+0x1a3/0x2d0
   napi_complete_done+0x74/0x1c0
   igb_poll+0x6c/0xe0 [igb]
   __napi_poll+0x30/0x200
   net_rx_action+0x181/0x2e0
   handle_softirqs+0xd8/0x340
   __irq_exit_rcu+0xd9/0x100
   irq_exit_rcu+0xe/0x20
   common_interrupt+0xa4/0xb0
   </IRQ>

This problem seems particularly prevalent if the user advertises an
endpoint that has a different external vs internal address.  In the case
where the external address is advertised and multiple connections
already exist, multiple subflow SYNs arrive in parallel which tends to
trigger the race during creation of the first local_addr_list entries
which have the internal address instead.

Fix by skipping the replacement of an existing implicit local address if
called via mptcp_pm_nl_get_local_id.

Cc: stable@vger.kernel.org
Fixes: d045b9eb95a9 ("mptcp: introduce implicit endpoints")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
v2:
  - Switch from call_rcu to skipping replacement if invoked via
    mptcp_pm_nl_get_local_id. (Feedback from Paolo Abeni)
---
 net/mptcp/pm_netlink.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index c0e47f4f7b1a..7868207c4e9d 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -977,7 +977,7 @@ static void __mptcp_pm_release_addr_entry(struct mptcp_pm_addr_entry *entry)
 
 static int mptcp_pm_nl_append_new_local_addr(struct pm_nl_pernet *pernet,
 					     struct mptcp_pm_addr_entry *entry,
-					     bool needs_id)
+					     bool needs_id, bool replace)
 {
 	struct mptcp_pm_addr_entry *cur, *del_entry = NULL;
 	unsigned int addr_max;
@@ -1017,6 +1017,17 @@ static int mptcp_pm_nl_append_new_local_addr(struct pm_nl_pernet *pernet,
 			if (entry->addr.id)
 				goto out;
 
+			/* allow callers that only need to look up the local
+			 * addr's id to skip replacement. This allows them to
+			 * avoid calling synchronize_rcu in the packet recv
+			 * path.
+			 */
+			if (!replace) {
+				kfree(entry);
+				ret = cur->addr.id;
+				goto out;
+			}
+
 			pernet->addrs--;
 			entry->addr.id = cur->addr.id;
 			list_del_rcu(&cur->list);
@@ -1165,7 +1176,7 @@ int mptcp_pm_nl_get_local_id(struct mptcp_sock *msk, struct mptcp_addr_info *skc
 	entry->ifindex = 0;
 	entry->flags = MPTCP_PM_ADDR_FLAG_IMPLICIT;
 	entry->lsk = NULL;
-	ret = mptcp_pm_nl_append_new_local_addr(pernet, entry, true);
+	ret = mptcp_pm_nl_append_new_local_addr(pernet, entry, true, false);
 	if (ret < 0)
 		kfree(entry);
 
@@ -1433,7 +1444,8 @@ int mptcp_pm_nl_add_addr_doit(struct sk_buff *skb, struct genl_info *info)
 		}
 	}
 	ret = mptcp_pm_nl_append_new_local_addr(pernet, entry,
-						!mptcp_pm_has_addr_attr_id(attr, info));
+						!mptcp_pm_has_addr_attr_id(attr, info),
+						true);
 	if (ret < 0) {
 		GENL_SET_ERR_MSG_FMT(info, "too many addresses or duplicate one: %d", ret);
 		goto out_free;

base-commit: 384fa1d90d092d36bfe13c0473194120ce28a50e
-- 
2.25.1
Re: [PATCH v2 mptcp] mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
Posted by Matthieu Baerts 1 month, 3 weeks ago
Hi Krister,

On 25/02/2025 00:20, Krister Johansen wrote:
> If multiple connection requests attempt to create an implicit mptcp
> endpoint in parallel, more than one caller may end up in
> mptcp_pm_nl_append_new_local_addr because none found the address in
> local_addr_list during their call to mptcp_pm_nl_get_local_id.  In this
> case, the concurrent new_local_addr calls may delete the address entry
> created by the previous caller.  These deletes use synchronize_rcu, but
> this is not permitted in some of the contexts where this function may be
> called.  During packet recv, the caller may be in a rcu read critical
> section and have preemption disabled.

Thank you for this patch, and for having taken the time to analyse the
issue!

> An example stack:
> 
>    BUG: scheduling while atomic: swapper/2/0/0x00000302
> 
>    Call Trace:
>    <IRQ>
>    dump_stack_lvl+0x76/0xa0
>    dump_stack+0x10/0x20
>    __schedule_bug+0x64/0x80
>    schedule_debug.constprop.0+0xdb/0x130
>    __schedule+0x69/0x6a0
>    schedule+0x33/0x110
>    schedule_timeout+0x157/0x170
>    wait_for_completion+0x88/0x150
>    __wait_rcu_gp+0x150/0x160
>    synchronize_rcu+0x12d/0x140
>    mptcp_pm_nl_append_new_local_addr+0x1bd/0x280
>    mptcp_pm_nl_get_local_id+0x121/0x160
>    mptcp_pm_get_local_id+0x9d/0xe0
>    subflow_check_req+0x1a8/0x460
>    subflow_v4_route_req+0xb5/0x110
>    tcp_conn_request+0x3a4/0xd00
>    subflow_v4_conn_request+0x42/0xa0
>    tcp_rcv_state_process+0x1e3/0x7e0
>    tcp_v4_do_rcv+0xd3/0x2a0
>    tcp_v4_rcv+0xbb8/0xbf0
>    ip_protocol_deliver_rcu+0x3c/0x210
>    ip_local_deliver_finish+0x77/0xa0
>    ip_local_deliver+0x6e/0x120
>    ip_sublist_rcv_finish+0x6f/0x80
>    ip_sublist_rcv+0x178/0x230
>    ip_list_rcv+0x102/0x140
>    __netif_receive_skb_list_core+0x22d/0x250
>    netif_receive_skb_list_internal+0x1a3/0x2d0
>    napi_complete_done+0x74/0x1c0
>    igb_poll+0x6c/0xe0 [igb]
>    __napi_poll+0x30/0x200
>    net_rx_action+0x181/0x2e0
>    handle_softirqs+0xd8/0x340
>    __irq_exit_rcu+0xd9/0x100
>    irq_exit_rcu+0xe/0x20
>    common_interrupt+0xa4/0xb0
>    </IRQ>
Detail: if possible, next time, do not hesitate to resolve the
addresses, e.g. using: ./scripts/decode_stacktrace.sh

> This problem seems particularly prevalent if the user advertises an
> endpoint that has a different external vs internal address.  In the case
> where the external address is advertised and multiple connections
> already exist, multiple subflow SYNs arrive in parallel which tends to
> trigger the race during creation of the first local_addr_list entries
> which have the internal address instead.
> 
> Fix by skipping the replacement of an existing implicit local address if
> called via mptcp_pm_nl_get_local_id.
The v2 looks good to me:

Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

I'm going to apply it in our MPTCP tree, but this patch can also be
directly applied in the net tree directly, not to delay it by one week
if preferred. If not, I can re-send it later on.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.
Re: [PATCH v2 mptcp] mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
Posted by Krister Johansen 1 month, 3 weeks ago
Hi Matt,
Thanks for the review!

On Tue, Feb 25, 2025 at 06:52:45PM +0100, Matthieu Baerts wrote:
> On 25/02/2025 00:20, Krister Johansen wrote:
> > If multiple connection requests attempt to create an implicit mptcp
> > endpoint in parallel, more than one caller may end up in
> > mptcp_pm_nl_append_new_local_addr because none found the address in
> > local_addr_list during their call to mptcp_pm_nl_get_local_id.  In this
> > case, the concurrent new_local_addr calls may delete the address entry
> > created by the previous caller.  These deletes use synchronize_rcu, but
> > this is not permitted in some of the contexts where this function may be
> > called.  During packet recv, the caller may be in a rcu read critical
> > section and have preemption disabled.
> 
> Thank you for this patch, and for having taken the time to analyse the
> issue!
> 
> > An example stack:
> > 
> >    BUG: scheduling while atomic: swapper/2/0/0x00000302
> > 
> >    Call Trace:
> >    <IRQ>
> >    dump_stack_lvl+0x76/0xa0
> >    dump_stack+0x10/0x20
> >    __schedule_bug+0x64/0x80
> >    schedule_debug.constprop.0+0xdb/0x130
> >    __schedule+0x69/0x6a0
> >    schedule+0x33/0x110
> >    schedule_timeout+0x157/0x170
> >    wait_for_completion+0x88/0x150
> >    __wait_rcu_gp+0x150/0x160
> >    synchronize_rcu+0x12d/0x140
> >    mptcp_pm_nl_append_new_local_addr+0x1bd/0x280
> >    mptcp_pm_nl_get_local_id+0x121/0x160
> >    mptcp_pm_get_local_id+0x9d/0xe0
> >    subflow_check_req+0x1a8/0x460
> >    subflow_v4_route_req+0xb5/0x110
> >    tcp_conn_request+0x3a4/0xd00
> >    subflow_v4_conn_request+0x42/0xa0
> >    tcp_rcv_state_process+0x1e3/0x7e0
> >    tcp_v4_do_rcv+0xd3/0x2a0
> >    tcp_v4_rcv+0xbb8/0xbf0
> >    ip_protocol_deliver_rcu+0x3c/0x210
> >    ip_local_deliver_finish+0x77/0xa0
> >    ip_local_deliver+0x6e/0x120
> >    ip_sublist_rcv_finish+0x6f/0x80
> >    ip_sublist_rcv+0x178/0x230
> >    ip_list_rcv+0x102/0x140
> >    __netif_receive_skb_list_core+0x22d/0x250
> >    netif_receive_skb_list_internal+0x1a3/0x2d0
> >    napi_complete_done+0x74/0x1c0
> >    igb_poll+0x6c/0xe0 [igb]
> >    __napi_poll+0x30/0x200
> >    net_rx_action+0x181/0x2e0
> >    handle_softirqs+0xd8/0x340
> >    __irq_exit_rcu+0xd9/0x100
> >    irq_exit_rcu+0xe/0x20
> >    common_interrupt+0xa4/0xb0
> >    </IRQ>
> Detail: if possible, next time, do not hesitate to resolve the
> addresses, e.g. using: ./scripts/decode_stacktrace.sh

My apologies for the oversight here.  This is the decoded version of the
stack:

   Call Trace:
   <IRQ>
   dump_stack_lvl (lib/dump_stack.c:117 (discriminator 1))
   dump_stack (lib/dump_stack.c:124)
   __schedule_bug (kernel/sched/core.c:5943)
   schedule_debug.constprop.0 (arch/x86/include/asm/preempt.h:33 kernel/sched/core.c:5970)
   __schedule (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 kernel/sched/features.h:29 kernel/sched/core.c:6621)
   schedule (arch/x86/include/asm/preempt.h:84 kernel/sched/core.c:6804 kernel/sched/core.c:6818)
   schedule_timeout (kernel/time/timer.c:2160)
   wait_for_completion (kernel/sched/completion.c:96 kernel/sched/completion.c:116 kernel/sched/completion.c:127 kernel/sched/completion.c:148)
   __wait_rcu_gp (include/linux/rcupdate.h:311 kernel/rcu/update.c:444)
   synchronize_rcu (kernel/rcu/tree.c:3609)
   mptcp_pm_nl_append_new_local_addr (net/mptcp/pm_netlink.c:966 net/mptcp/pm_netlink.c:1061)
   mptcp_pm_nl_get_local_id (net/mptcp/pm_netlink.c:1164)
   mptcp_pm_get_local_id (net/mptcp/pm.c:420)
   subflow_check_req (net/mptcp/subflow.c:98 net/mptcp/subflow.c:213)
   subflow_v4_route_req (net/mptcp/subflow.c:305)
   tcp_conn_request (net/ipv4/tcp_input.c:7216)
   subflow_v4_conn_request (net/mptcp/subflow.c:651)
   tcp_rcv_state_process (net/ipv4/tcp_input.c:6709)
   tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1934)
   tcp_v4_rcv (net/ipv4/tcp_ipv4.c:2334)
   ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1))
   ip_local_deliver_finish (include/linux/rcupdate.h:813 net/ipv4/ip_input.c:234)
   ip_local_deliver (include/linux/netfilter.h:314 include/linux/netfilter.h:308 net/ipv4/ip_input.c:254)
   ip_sublist_rcv_finish (include/net/dst.h:461 net/ipv4/ip_input.c:580)
   ip_sublist_rcv (net/ipv4/ip_input.c:640)
   ip_list_rcv (net/ipv4/ip_input.c:675)
   __netif_receive_skb_list_core (net/core/dev.c:5583 net/core/dev.c:5631)
   netif_receive_skb_list_internal (net/core/dev.c:5685 net/core/dev.c:5774)
   napi_complete_done (include/linux/list.h:37 include/net/gro.h:449 include/net/gro.h:444 net/core/dev.c:6114)
   igb_poll (drivers/net/ethernet/intel/igb/igb_main.c:8244) igb
   __napi_poll (net/core/dev.c:6582)
   net_rx_action (net/core/dev.c:6653 net/core/dev.c:6787)
   handle_softirqs (kernel/softirq.c:553)
   __irq_exit_rcu (kernel/softirq.c:588 kernel/softirq.c:427 kernel/softirq.c:636)
   irq_exit_rcu (kernel/softirq.c:651)
   common_interrupt (arch/x86/kernel/irq.c:247 (discriminator 14))
   </IRQ>

> > This problem seems particularly prevalent if the user advertises an
> > endpoint that has a different external vs internal address.  In the case
> > where the external address is advertised and multiple connections
> > already exist, multiple subflow SYNs arrive in parallel which tends to
> > trigger the race during creation of the first local_addr_list entries
> > which have the internal address instead.
> > 
> > Fix by skipping the replacement of an existing implicit local address if
> > called via mptcp_pm_nl_get_local_id.
> The v2 looks good to me:
> 
> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> 
> I'm going to apply it in our MPTCP tree, but this patch can also be
> directly applied in the net tree directly, not to delay it by one week
> if preferred. If not, I can re-send it later on.

Thanks, I'd be happy to send it to net directly now that it has your
blessing.  Would you like me to modify the call trace in the commit
message to match the decoded one that I included above before I send it
to net?

Thanks,

-K
Re: [PATCH v2 mptcp] mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
Posted by Matthieu Baerts 1 month, 3 weeks ago
Hi Krister,

On 25/02/2025 20:29, Krister Johansen wrote:
> Hi Matt,
> Thanks for the review!
> 
> On Tue, Feb 25, 2025 at 06:52:45PM +0100, Matthieu Baerts wrote:
>> On 25/02/2025 00:20, Krister Johansen wrote:
>>> If multiple connection requests attempt to create an implicit mptcp
>>> endpoint in parallel, more than one caller may end up in
>>> mptcp_pm_nl_append_new_local_addr because none found the address in
>>> local_addr_list during their call to mptcp_pm_nl_get_local_id.  In this
>>> case, the concurrent new_local_addr calls may delete the address entry
>>> created by the previous caller.  These deletes use synchronize_rcu, but
>>> this is not permitted in some of the contexts where this function may be
>>> called.  During packet recv, the caller may be in a rcu read critical
>>> section and have preemption disabled.
>>
>> Thank you for this patch, and for having taken the time to analyse the
>> issue!
>>
>>> An example stack:

(...)

>> Detail: if possible, next time, do not hesitate to resolve the
>> addresses, e.g. using: ./scripts/decode_stacktrace.sh
> 
> My apologies for the oversight here.  This is the decoded version of the
> stack:

No problem, thanks for the decoded version!

(...)

>> I'm going to apply it in our MPTCP tree, but this patch can also be
>> directly applied in the net tree directly, not to delay it by one week
>> if preferred. If not, I can re-send it later on.
> 
> Thanks, I'd be happy to send it to net directly now that it has your
> blessing.  Would you like me to modify the call trace in the commit
> message to match the decoded one that I included above before I send it
> to net?

Sorry, I forgot to mention that this bit was for the net maintainers.
Typically, trivial patches and small fixes related to MPTCP can go
directly to net.

No need for you to re-send it. If the net maintainers prefer me to send
it later with other patches (if any), I will update the call trace, no
problem!

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.
Re: [PATCH v2 mptcp] mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
Posted by Krister Johansen 1 month, 3 weeks ago
Hi Matt,

On Tue, Feb 25, 2025 at 10:41:55PM +0100, Matthieu Baerts wrote:
> On 25/02/2025 20:29, Krister Johansen wrote:
> > On Tue, Feb 25, 2025 at 06:52:45PM +0100, Matthieu Baerts wrote:
> >> I'm going to apply it in our MPTCP tree, but this patch can also be
> >> directly applied in the net tree directly, not to delay it by one week
> >> if preferred. If not, I can re-send it later on.
> > 
> > Thanks, I'd be happy to send it to net directly now that it has your
> > blessing.  Would you like me to modify the call trace in the commit
> > message to match the decoded one that I included above before I send it
> > to net?
> 
> Sorry, I forgot to mention that this bit was for the net maintainers.
> Typically, trivial patches and small fixes related to MPTCP can go
> directly to net.
> 
> No need for you to re-send it. If the net maintainers prefer me to send
> it later with other patches (if any), I will update the call trace, no
> problem!

Thanks for clarifying.  I'll hold off on sending anything further and
will either let the net maintainers pick this up, or have you send it
with your next batch of patches.

Thanks again!

-K
Re: [PATCH v2 mptcp] mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
Posted by MPTCP CI 1 month, 3 weeks ago
Hi Krister,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal: Success! ✅
- KVM Validation: debug: Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/13510193894

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/b89b385a81ca
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=937276


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)
Re: [PATCH mptcp] mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
Posted by MPTCP CI 1 month, 3 weeks ago
Hi Krister,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal: Success! ✅
- KVM Validation: debug: Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/13466476124

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/ba4aa4f792b5
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=936626


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)