net/mptcp/protocol.c | 46 ++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 44 insertions(+), 2 deletions(-)
If a subflow receives data before gaining the memcg while the msk
socket lock is held at accept time, or the PM locks the msk socket
while still unaccepted and subflows push data to it at the same time,
the mptcp_graph_subflows() can complete with a non empty backlog.
The msk will try to borrow such memory, but (some) of the skbs there
where not memcg charged. When the msk finally will return such accounted
memory, we should hit the same splat of #597.
[even if so far I was unable to replicate this scenario]
This patch tries to address such potential issue by:
- preventing the subflow from queuing data into the backlog after
gaining the memcg. This ensure that at the end of the look all the
skbs in the backlog (if any) are _not_ memory accounted.
- mem charge the backlog to msk
- 'restart' the subflow and spool any data waiting there.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/mptcp/protocol.c | 46 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 44 insertions(+), 2 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 5e9325c7ea9c..d6b08e1de358 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -4082,10 +4082,12 @@ static void mptcp_graph_subflows(struct sock *sk)
{
struct mptcp_subflow_context *subflow;
struct mptcp_sock *msk = mptcp_sk(sk);
+ struct sock *ssk;
+ int old_amt, amt;
+ bool slow;
mptcp_for_each_subflow(msk, subflow) {
- struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
- bool slow;
+ ssk = mptcp_subflow_tcp_sock(subflow);
slow = lock_sock_fast(ssk);
@@ -4095,8 +4097,48 @@ static void mptcp_graph_subflows(struct sock *sk)
if (!ssk->sk_socket)
mptcp_sock_graft(ssk, sk->sk_socket);
+ if (!mem_cgroup_from_sk(sk))
+ goto unlock;
+
__mptcp_inherit_cgrp_data(sk, ssk);
__mptcp_inherit_memcg(sk, ssk, GFP_KERNEL);
+
+ /* Prevent subflows from queueing data into the backlog
+ * as soon as cg is set; note that we can't race
+ * with __mptcp_close_ssk setting this bit for a really
+ * closing socket, because we hold the msk socket lock here.
+ */
+ subflow->closing = 1;
+
+unlock:
+ unlock_sock_fast(ssk, slow);
+ }
+
+ if (!mem_cgroup_from_sk(sk))
+ return;
+
+ /* Charge the bl memory, note that __sk_charge accounted for
+ * fwd memory and rmem only
+ */
+ mptcp_data_lock(sk);
+ old_amt = sk_mem_pages(sk->sk_forward_alloc +
+ atomic_read(&sk->sk_rmem_alloc));
+ amt = sk_mem_pages(msk->backlog_len + sk->sk_forward_alloc +
+ atomic_read(&sk->sk_rmem_alloc));
+ amt -= old_amt;
+ if (amt)
+ mem_cgroup_sk_charge(sk, amt, GFP_ATOMIC | __GFP_NOFAIL);
+ mptcp_data_unlock(sk);
+
+ /* Finally let the subflow restart queuing data. */
+ mptcp_for_each_subflow(msk, subflow) {
+ ssk = mptcp_subflow_tcp_sock(subflow);
+
+ slow = lock_sock_fast(ssk);
+ subflow->closing = 0;
+
+ if (mptcp_subflow_data_available(ssk))
+ mptcp_data_ready(sk, ssk);
unlock_sock_fast(ssk, slow);
}
}
--
2.51.0
Hi Paolo,
On 09/11/2025 14:53, Paolo Abeni wrote:
> If a subflow receives data before gaining the memcg while the msk
> socket lock is held at accept time, or the PM locks the msk socket
> while still unaccepted and subflows push data to it at the same time,
> the mptcp_graph_subflows() can complete with a non empty backlog.
>
> The msk will try to borrow such memory, but (some) of the skbs there
> where not memcg charged. When the msk finally will return such accounted
> memory, we should hit the same splat of #597.
> [even if so far I was unable to replicate this scenario]
>
> This patch tries to address such potential issue by:
> - preventing the subflow from queuing data into the backlog after
> gaining the memcg. This ensure that at the end of the look all the
> skbs in the backlog (if any) are _not_ memory accounted.
> - mem charge the backlog to msk
> - 'restart' the subflow and spool any data waiting there.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/mptcp/protocol.c | 46 ++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 44 insertions(+), 2 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 5e9325c7ea9c..d6b08e1de358 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -4082,10 +4082,12 @@ static void mptcp_graph_subflows(struct sock *sk)
> {
> struct mptcp_subflow_context *subflow;
> struct mptcp_sock *msk = mptcp_sk(sk);
> + struct sock *ssk;
> + int old_amt, amt;
> + bool slow;
>
> mptcp_for_each_subflow(msk, subflow) {
> - struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
> - bool slow;
> + ssk = mptcp_subflow_tcp_sock(subflow);
>
> slow = lock_sock_fast(ssk);
>
> @@ -4095,8 +4097,48 @@ static void mptcp_graph_subflows(struct sock *sk)
> if (!ssk->sk_socket)
> mptcp_sock_graft(ssk, sk->sk_socket);
>
> + if (!mem_cgroup_from_sk(sk))
Should we not call mem_cgroup_sk_enabled() instead? It does this:
return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
That's what is done in net/core/sock.c and net/ipv4/tcp_output.c. Not in
__inet_accept(), because mem_cgroup_sockets_enabled() is checked before.
Maybe we should do the same here?
(Note that it is not clear to me if mem_cgroup can be enabled later on,
and if yes, what should be done with existing connections.)
> + goto unlock;
> +
> __mptcp_inherit_cgrp_data(sk, ssk);
> __mptcp_inherit_memcg(sk, ssk, GFP_KERNEL);
> +
> + /* Prevent subflows from queueing data into the backlog
> + * as soon as cg is set; note that we can't race
> + * with __mptcp_close_ssk setting this bit for a really
> + * closing socket, because we hold the msk socket lock here.
> + */
> + subflow->closing = 1;
> +
> +unlock:
> + unlock_sock_fast(ssk, slow);
> + }
> +
> + if (!mem_cgroup_from_sk(sk))
Same here?
> + return;
> +
> + /* Charge the bl memory, note that __sk_charge accounted for
> + * fwd memory and rmem only
> + */
> + mptcp_data_lock(sk);
> + old_amt = sk_mem_pages(sk->sk_forward_alloc +
> + atomic_read(&sk->sk_rmem_alloc));
> + amt = sk_mem_pages(msk->backlog_len + sk->sk_forward_alloc +
> + atomic_read(&sk->sk_rmem_alloc));
(Same as Geliang for the alignment here, and eventually calling
kmem_cache_charge() like in __inet_accept())
> + amt -= old_amt;
> + if (amt)
> + mem_cgroup_sk_charge(sk, amt, GFP_ATOMIC | __GFP_NOFAIL);
Just to be sure: no need to check if there was an error? It is not done
in __inet_accept() either, so I guess no?
> + mptcp_data_unlock(sk);
> +
> + /* Finally let the subflow restart queuing data. */
> + mptcp_for_each_subflow(msk, subflow) {
> + ssk = mptcp_subflow_tcp_sock(subflow);
> +
> + slow = lock_sock_fast(ssk);
> + subflow->closing = 0;
> +
> + if (mptcp_subflow_data_available(ssk))
> + mptcp_data_ready(sk, ssk);
> unlock_sock_fast(ssk, slow);
> }
> }
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
On 11/11/25 5:21 PM, Matthieu Baerts wrote:
> On 09/11/2025 14:53, Paolo Abeni wrote:
>> If a subflow receives data before gaining the memcg while the msk
>> socket lock is held at accept time, or the PM locks the msk socket
>> while still unaccepted and subflows push data to it at the same time,
>> the mptcp_graph_subflows() can complete with a non empty backlog.
>>
>> The msk will try to borrow such memory, but (some) of the skbs there
>> where not memcg charged. When the msk finally will return such accounted
>> memory, we should hit the same splat of #597.
>> [even if so far I was unable to replicate this scenario]
>>
>> This patch tries to address such potential issue by:
>> - preventing the subflow from queuing data into the backlog after
>> gaining the memcg. This ensure that at the end of the look all the
>> skbs in the backlog (if any) are _not_ memory accounted.
>> - mem charge the backlog to msk
>> - 'restart' the subflow and spool any data waiting there.
>>
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> net/mptcp/protocol.c | 46 ++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 44 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>> index 5e9325c7ea9c..d6b08e1de358 100644
>> --- a/net/mptcp/protocol.c
>> +++ b/net/mptcp/protocol.c
>> @@ -4082,10 +4082,12 @@ static void mptcp_graph_subflows(struct sock *sk)
>> {
>> struct mptcp_subflow_context *subflow;
>> struct mptcp_sock *msk = mptcp_sk(sk);
>> + struct sock *ssk;
>> + int old_amt, amt;
>> + bool slow;
>>
>> mptcp_for_each_subflow(msk, subflow) {
>> - struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
>> - bool slow;
>> + ssk = mptcp_subflow_tcp_sock(subflow);
>>
>> slow = lock_sock_fast(ssk);
>>
>> @@ -4095,8 +4097,48 @@ static void mptcp_graph_subflows(struct sock *sk)
>> if (!ssk->sk_socket)
>> mptcp_sock_graft(ssk, sk->sk_socket);
>>
>> + if (!mem_cgroup_from_sk(sk))
>
> Should we not call mem_cgroup_sk_enabled() instead? It does this:
>
> return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
>
> That's what is done in net/core/sock.c and net/ipv4/tcp_output.c. Not in
> __inet_accept(), because mem_cgroup_sockets_enabled() is checked before.
> Maybe we should do the same here?
>
> (Note that it is not clear to me if mem_cgroup can be enabled later on,
> and if yes, what should be done with existing connections.)
It's just an additional optimization, to leverage static branch, but
it's not strictly needed. Can be added, thus.
>> + goto unlock;
>> +
>> __mptcp_inherit_cgrp_data(sk, ssk);
>> __mptcp_inherit_memcg(sk, ssk, GFP_KERNEL);
>> +
>> + /* Prevent subflows from queueing data into the backlog
>> + * as soon as cg is set; note that we can't race
>> + * with __mptcp_close_ssk setting this bit for a really
>> + * closing socket, because we hold the msk socket lock here.
>> + */
>> + subflow->closing = 1;
>> +
>> +unlock:
>> + unlock_sock_fast(ssk, slow);
>> + }
>> +
>> + if (!mem_cgroup_from_sk(sk))
>
> Same here?
>
>> + return;
>> +
>> + /* Charge the bl memory, note that __sk_charge accounted for
>> + * fwd memory and rmem only
>> + */
>> + mptcp_data_lock(sk);
>> + old_amt = sk_mem_pages(sk->sk_forward_alloc +
>> + atomic_read(&sk->sk_rmem_alloc));
>> + amt = sk_mem_pages(msk->backlog_len + sk->sk_forward_alloc +
>> + atomic_read(&sk->sk_rmem_alloc));
>
> (Same as Geliang for the alignment here, and eventually calling
> kmem_cache_charge() like in __inet_accept())
This and the next are the more obscure point. I chose to not call
kmem_cache_charge() because I'm (was) a bit doubtful about such call
being legit in __inet_accept(): active (plain TCP) sockets are not
accounted, just passive ones. Re-thinking about it I guess it's better
to be consistent with TCP than trying to be smarted (history has proved
it does not work so well :-P)
TL;DR: I'll add the missing kmem_cache_charge();
>> + amt -= old_amt;
>> + if (amt)
>> + mem_cgroup_sk_charge(sk, amt, GFP_ATOMIC | __GFP_NOFAIL);
>
> Just to be sure: no need to check if there was an error? It is not done
> in __inet_accept() either, so I guess no?
The __GFP_NOFAIL flag ensures that the call can not fail. Adding it with
GFP_ATOMIC is at least "original" (this is the only call site with this
flags combo). In the next version I'll move the call outside the
spinlock (we are still under the msk socket lock) to replace GFP_ATOMIC
with GFP_KERNEL.
Many thanks for all the review effort!
/P
Hi Paolo,
Thank you for your reply!
On 12/11/2025 10:24, Paolo Abeni wrote:
> On 11/11/25 5:21 PM, Matthieu Baerts wrote:
>> On 09/11/2025 14:53, Paolo Abeni wrote:
>>> If a subflow receives data before gaining the memcg while the msk
>>> socket lock is held at accept time, or the PM locks the msk socket
>>> while still unaccepted and subflows push data to it at the same time,
>>> the mptcp_graph_subflows() can complete with a non empty backlog.
>>>
>>> The msk will try to borrow such memory, but (some) of the skbs there
>>> where not memcg charged. When the msk finally will return such accounted
>>> memory, we should hit the same splat of #597.
>>> [even if so far I was unable to replicate this scenario]
>>>
>>> This patch tries to address such potential issue by:
>>> - preventing the subflow from queuing data into the backlog after
>>> gaining the memcg. This ensure that at the end of the look all the
>>> skbs in the backlog (if any) are _not_ memory accounted.
>>> - mem charge the backlog to msk
>>> - 'restart' the subflow and spool any data waiting there.
>>>
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>> ---
>>> net/mptcp/protocol.c | 46 ++++++++++++++++++++++++++++++++++++++++++--
>>> 1 file changed, 44 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>> index 5e9325c7ea9c..d6b08e1de358 100644
>>> --- a/net/mptcp/protocol.c
>>> +++ b/net/mptcp/protocol.c
>>> @@ -4082,10 +4082,12 @@ static void mptcp_graph_subflows(struct sock *sk)
>>> {
>>> struct mptcp_subflow_context *subflow;
>>> struct mptcp_sock *msk = mptcp_sk(sk);
>>> + struct sock *ssk;
>>> + int old_amt, amt;
>>> + bool slow;
>>>
>>> mptcp_for_each_subflow(msk, subflow) {
>>> - struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
>>> - bool slow;
>>> + ssk = mptcp_subflow_tcp_sock(subflow);
>>>
>>> slow = lock_sock_fast(ssk);
>>>
>>> @@ -4095,8 +4097,48 @@ static void mptcp_graph_subflows(struct sock *sk)
>>> if (!ssk->sk_socket)
>>> mptcp_sock_graft(ssk, sk->sk_socket);
>>>
>>> + if (!mem_cgroup_from_sk(sk))
>>
>> Should we not call mem_cgroup_sk_enabled() instead? It does this:
>>
>> return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
>>
>> That's what is done in net/core/sock.c and net/ipv4/tcp_output.c. Not in
>> __inet_accept(), because mem_cgroup_sockets_enabled() is checked before.
>> Maybe we should do the same here?
>>
>> (Note that it is not clear to me if mem_cgroup can be enabled later on,
>> and if yes, what should be done with existing connections.)
>
> It's just an additional optimization, to leverage static branch, but
> it's not strictly needed. Can be added, thus.
I see, thank you!
While at it, should we call mem_cgroup_*() only once by using local
variables?
>>> + goto unlock;
>>> +
>>> __mptcp_inherit_cgrp_data(sk, ssk);
>>> __mptcp_inherit_memcg(sk, ssk, GFP_KERNEL);
>>> +
>>> + /* Prevent subflows from queueing data into the backlog
>>> + * as soon as cg is set; note that we can't race
>>> + * with __mptcp_close_ssk setting this bit for a really
>>> + * closing socket, because we hold the msk socket lock here.
>>> + */
>>> + subflow->closing = 1;
>>> +
>>> +unlock:
>>> + unlock_sock_fast(ssk, slow);
>>> + }
>>> +
>>> + if (!mem_cgroup_from_sk(sk))
>>
>> Same here?
>>
>>> + return;
>>> +
>>> + /* Charge the bl memory, note that __sk_charge accounted for
>>> + * fwd memory and rmem only
>>> + */
>>> + mptcp_data_lock(sk);
>>> + old_amt = sk_mem_pages(sk->sk_forward_alloc +
>>> + atomic_read(&sk->sk_rmem_alloc));
>>> + amt = sk_mem_pages(msk->backlog_len + sk->sk_forward_alloc +
>>> + atomic_read(&sk->sk_rmem_alloc));
>>
>> (Same as Geliang for the alignment here, and eventually calling
>> kmem_cache_charge() like in __inet_accept())
>
> This and the next are the more obscure point. I chose to not call
> kmem_cache_charge() because I'm (was) a bit doubtful about such call
> being legit in __inet_accept(): active (plain TCP) sockets are not
> accounted, just passive ones. Re-thinking about it I guess it's better
> to be consistent with TCP than trying to be smarted (history has proved
> it does not work so well :-P)
>
> TL;DR: I'll add the missing kmem_cache_charge();
:)
Indeed, safer. If it is changed in TCP, hopefully the same will be done
in MPTCP side.
>>> + amt -= old_amt;
>>> + if (amt)
>>> + mem_cgroup_sk_charge(sk, amt, GFP_ATOMIC | __GFP_NOFAIL);
>>
>> Just to be sure: no need to check if there was an error? It is not done
>> in __inet_accept() either, so I guess no?
>
> The __GFP_NOFAIL flag ensures that the call can not fail.
Of course, I missed that!
> Adding it with
> GFP_ATOMIC is at least "original" (this is the only call site with this
> flags combo). In the next version I'll move the call outside the
> spinlock (we are still under the msk socket lock) to replace GFP_ATOMIC
> with GFP_KERNEL.
Good idea! Also better to align with what is done with TCP.
> Many thanks for all the review effort!
That's the least I can do with all the new fixes and optimisations :)
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
Hi Paolo,
On 11/11/2025 17:21, Matthieu Baerts wrote:
> Hi Paolo,
>
> On 09/11/2025 14:53, Paolo Abeni wrote:
>> If a subflow receives data before gaining the memcg while the msk
>> socket lock is held at accept time, or the PM locks the msk socket
>> while still unaccepted and subflows push data to it at the same time,
>> the mptcp_graph_subflows() can complete with a non empty backlog.
>>
>> The msk will try to borrow such memory, but (some) of the skbs there
>> where not memcg charged. When the msk finally will return such accounted
>> memory, we should hit the same splat of #597.
>> [even if so far I was unable to replicate this scenario]
>>
>> This patch tries to address such potential issue by:
>> - preventing the subflow from queuing data into the backlog after
>> gaining the memcg. This ensure that at the end of the look all the
>> skbs in the backlog (if any) are _not_ memory accounted.
>> - mem charge the backlog to msk
>> - 'restart' the subflow and spool any data waiting there.
>>
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> net/mptcp/protocol.c | 46 ++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 44 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>> index 5e9325c7ea9c..d6b08e1de358 100644
>> --- a/net/mptcp/protocol.c
>> +++ b/net/mptcp/protocol.c
>> @@ -4082,10 +4082,12 @@ static void mptcp_graph_subflows(struct sock *sk)
>> {
>> struct mptcp_subflow_context *subflow;
>> struct mptcp_sock *msk = mptcp_sk(sk);
>> + struct sock *ssk;
>> + int old_amt, amt;
>> + bool slow;
>>
>> mptcp_for_each_subflow(msk, subflow) {
>> - struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
>> - bool slow;
>> + ssk = mptcp_subflow_tcp_sock(subflow);
>>
>> slow = lock_sock_fast(ssk);
>>
>> @@ -4095,8 +4097,48 @@ static void mptcp_graph_subflows(struct sock *sk)
>> if (!ssk->sk_socket)
>> mptcp_sock_graft(ssk, sk->sk_socket);
>>
>> + if (!mem_cgroup_from_sk(sk))
>
> Should we not call mem_cgroup_sk_enabled() instead? It does this:
>
> return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
>
> That's what is done in net/core/sock.c and net/ipv4/tcp_output.c. Not in
> __inet_accept(), because mem_cgroup_sockets_enabled() is checked before.
> Maybe we should do the same here?
>
> (Note that it is not clear to me if mem_cgroup can be enabled later on,
> and if yes, what should be done with existing connections.)
Also, do you not still need to call __mptcp_inherit_cgrp_data() even if
!mem_cgroup_sockets_enabled() (or !mem_cgroup_from_sk())?
I guess the two are often linked, but can they not be used independently?
Cheers,
Matt
>
>> + goto unlock;
>> +
>> __mptcp_inherit_cgrp_data(sk, ssk);
>> __mptcp_inherit_memcg(sk, ssk, GFP_KERNEL);
>> +
>> + /* Prevent subflows from queueing data into the backlog
>> + * as soon as cg is set; note that we can't race
>> + * with __mptcp_close_ssk setting this bit for a really
>> + * closing socket, because we hold the msk socket lock here.
>> + */
>> + subflow->closing = 1;
>> +
>> +unlock:
>> + unlock_sock_fast(ssk, slow);
>> + }
>> +
>> + if (!mem_cgroup_from_sk(sk))
>
> Same here?
>
>> + return;
>> +
>> + /* Charge the bl memory, note that __sk_charge accounted for
>> + * fwd memory and rmem only
>> + */
>> + mptcp_data_lock(sk);
>> + old_amt = sk_mem_pages(sk->sk_forward_alloc +
>> + atomic_read(&sk->sk_rmem_alloc));
>> + amt = sk_mem_pages(msk->backlog_len + sk->sk_forward_alloc +
>> + atomic_read(&sk->sk_rmem_alloc));
>
> (Same as Geliang for the alignment here, and eventually calling
> kmem_cache_charge() like in __inet_accept())
>
>> + amt -= old_amt;
>> + if (amt)
>> + mem_cgroup_sk_charge(sk, amt, GFP_ATOMIC | __GFP_NOFAIL);
>
> Just to be sure: no need to check if there was an error? It is not done
> in __inet_accept() either, so I guess no?
>
>> + mptcp_data_unlock(sk);
>> +
>> + /* Finally let the subflow restart queuing data. */
>> + mptcp_for_each_subflow(msk, subflow) {
>> + ssk = mptcp_subflow_tcp_sock(subflow);
>> +
>> + slow = lock_sock_fast(ssk);
>> + subflow->closing = 0;
>> +
>> + if (mptcp_subflow_data_available(ssk))
>> + mptcp_data_ready(sk, ssk);
>> unlock_sock_fast(ssk, slow);
>> }
>> }
--
Sponsored by the NGI0 Core fund.
Hi Paolo,
Thanks for this fix.
On Sun, 2025-11-09 at 14:53 +0100, Paolo Abeni wrote:
> If a subflow receives data before gaining the memcg while the msk
> socket lock is held at accept time, or the PM locks the msk socket
> while still unaccepted and subflows push data to it at the same time,
> the mptcp_graph_subflows() can complete with a non empty backlog.
>
> The msk will try to borrow such memory, but (some) of the skbs there
> where not memcg charged. When the msk finally will return such
> accounted
> memory, we should hit the same splat of #597.
> [even if so far I was unable to replicate this scenario]
>
> This patch tries to address such potential issue by:
> - preventing the subflow from queuing data into the backlog after
> gaining the memcg. This ensure that at the end of the look all the
> skbs in the backlog (if any) are _not_ memory accounted.
> - mem charge the backlog to msk
> - 'restart' the subflow and spool any data waiting there.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/mptcp/protocol.c | 46
> ++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 44 insertions(+), 2 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 5e9325c7ea9c..d6b08e1de358 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -4082,10 +4082,12 @@ static void mptcp_graph_subflows(struct sock
> *sk)
> {
> struct mptcp_subflow_context *subflow;
> struct mptcp_sock *msk = mptcp_sk(sk);
> + struct sock *ssk;
> + int old_amt, amt;
> + bool slow;
>
> mptcp_for_each_subflow(msk, subflow) {
> - struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
> - bool slow;
> + ssk = mptcp_subflow_tcp_sock(subflow);
>
> slow = lock_sock_fast(ssk);
>
> @@ -4095,8 +4097,48 @@ static void mptcp_graph_subflows(struct sock
> *sk)
> if (!ssk->sk_socket)
> mptcp_sock_graft(ssk, sk->sk_socket);
>
> + if (!mem_cgroup_from_sk(sk))
> + goto unlock;
I think it's better to use "continue" here, just like in v1, so that
other subflows also have a chance to call mptcp_sock_graft(), but we
need to call unlock_sock_fast() before "continue".
Besides, wouldn't it be more appropriate to squash these lines into
"mptcp: fix memcg accounting for passive sockets"?
> +
> __mptcp_inherit_cgrp_data(sk, ssk);
> __mptcp_inherit_memcg(sk, ssk, GFP_KERNEL);
> +
> + /* Prevent subflows from queueing data into the
> backlog
> + * as soon as cg is set; note that we can't race
> + * with __mptcp_close_ssk setting this bit for a
> really
> + * closing socket, because we hold the msk socket
> lock here.
> + */
> + subflow->closing = 1;
> +
> +unlock:
> + unlock_sock_fast(ssk, slow);
> + }
> +
> + if (!mem_cgroup_from_sk(sk))
> + return;
> +
> + /* Charge the bl memory, note that __sk_charge accounted for
> + * fwd memory and rmem only
> + */
> + mptcp_data_lock(sk);
> + old_amt = sk_mem_pages(sk->sk_forward_alloc +
> + atomic_read(&sk->sk_rmem_alloc));
> + amt = sk_mem_pages(msk->backlog_len + sk->sk_forward_alloc +
> + atomic_read(&sk->sk_rmem_alloc));
The code here is not aligned properly.
> + amt -= old_amt;
> + if (amt)
> + mem_cgroup_sk_charge(sk, amt, GFP_ATOMIC |
> __GFP_NOFAIL);
I'm not sure if we need to call kmem_cache_charge() here, just like in
__sk_charge().
WDYT?
Thanks,
-Geliang
> + mptcp_data_unlock(sk);
> +
> + /* Finally let the subflow restart queuing data. */
> + mptcp_for_each_subflow(msk, subflow) {
> + ssk = mptcp_subflow_tcp_sock(subflow);
> +
> + slow = lock_sock_fast(ssk);
> + subflow->closing = 0;
> +
> + if (mptcp_subflow_data_available(ssk))
> + mptcp_data_ready(sk, ssk);
> unlock_sock_fast(ssk, slow);
> }
> }
Hi Geliang,
On 11/11/2025 08:21, Geliang Tang wrote:
> Hi Paolo,
>
> Thanks for this fix.
>
> On Sun, 2025-11-09 at 14:53 +0100, Paolo Abeni wrote:
>> If a subflow receives data before gaining the memcg while the msk
>> socket lock is held at accept time, or the PM locks the msk socket
>> while still unaccepted and subflows push data to it at the same time,
>> the mptcp_graph_subflows() can complete with a non empty backlog.
>>
>> The msk will try to borrow such memory, but (some) of the skbs there
>> where not memcg charged. When the msk finally will return such
>> accounted
>> memory, we should hit the same splat of #597.
>> [even if so far I was unable to replicate this scenario]
>>
>> This patch tries to address such potential issue by:
>> - preventing the subflow from queuing data into the backlog after
>> gaining the memcg. This ensure that at the end of the look all the
>> skbs in the backlog (if any) are _not_ memory accounted.
>> - mem charge the backlog to msk
>> - 'restart' the subflow and spool any data waiting there.
>>
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> net/mptcp/protocol.c | 46
>> ++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 44 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>> index 5e9325c7ea9c..d6b08e1de358 100644
>> --- a/net/mptcp/protocol.c
>> +++ b/net/mptcp/protocol.c
>> @@ -4082,10 +4082,12 @@ static void mptcp_graph_subflows(struct sock
>> *sk)
>> {
>> struct mptcp_subflow_context *subflow;
>> struct mptcp_sock *msk = mptcp_sk(sk);
>> + struct sock *ssk;
>> + int old_amt, amt;
>> + bool slow;
>>
>> mptcp_for_each_subflow(msk, subflow) {
>> - struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
>> - bool slow;
>> + ssk = mptcp_subflow_tcp_sock(subflow);
>>
>> slow = lock_sock_fast(ssk);
>>
>> @@ -4095,8 +4097,48 @@ static void mptcp_graph_subflows(struct sock
>> *sk)
>> if (!ssk->sk_socket)
>> mptcp_sock_graft(ssk, sk->sk_socket);
>>
>> + if (!mem_cgroup_from_sk(sk))
>> + goto unlock;
>
> I think it's better to use "continue" here, just like in v1, so that
> other subflows also have a chance to call mptcp_sock_graft(), but we
> need to call unlock_sock_fast() before "continue".
Mmh, that's what this code is doing: unlock and continue, no?
> Besides, wouldn't it be more appropriate to squash these lines into
> "mptcp: fix memcg accounting for passive sockets"?
From what I understood, that's not strickly needed: such check are done
in the helpers below: here it is just to avoid doing the same check yet
another time for 'subflow->closing = 1'.
>
>> +
>> __mptcp_inherit_cgrp_data(sk, ssk);
>> __mptcp_inherit_memcg(sk, ssk, GFP_KERNEL);
>> +
>> + /* Prevent subflows from queueing data into the
>> backlog
>> + * as soon as cg is set; note that we can't race
>> + * with __mptcp_close_ssk setting this bit for a
>> really
>> + * closing socket, because we hold the msk socket
>> lock here.
>> + */
>> + subflow->closing = 1;
>> +
>> +unlock:
>> + unlock_sock_fast(ssk, slow);
>> + }
>> +
>> + if (!mem_cgroup_from_sk(sk))
>> + return;
>> +
>> + /* Charge the bl memory, note that __sk_charge accounted for
>> + * fwd memory and rmem only
>> + */
>> + mptcp_data_lock(sk);
>> + old_amt = sk_mem_pages(sk->sk_forward_alloc +
>> + atomic_read(&sk->sk_rmem_alloc));
>> + amt = sk_mem_pages(msk->backlog_len + sk->sk_forward_alloc +
>> + atomic_read(&sk->sk_rmem_alloc));
>
> The code here is not aligned properly.
>
>> + amt -= old_amt;
>> + if (amt)
>> + mem_cgroup_sk_charge(sk, amt, GFP_ATOMIC |
>> __GFP_NOFAIL);
>
> I'm not sure if we need to call kmem_cache_charge() here, just like in
> __sk_charge().
>
> WDYT?
>
> Thanks,
> -Geliang
>
>> + mptcp_data_unlock(sk);
>> +
>> + /* Finally let the subflow restart queuing data. */
>> + mptcp_for_each_subflow(msk, subflow) {
>> + ssk = mptcp_subflow_tcp_sock(subflow);
>> +
>> + slow = lock_sock_fast(ssk);
>> + subflow->closing = 0;
>> +
>> + if (mptcp_subflow_data_available(ssk))
>> + mptcp_data_ready(sk, ssk);
>> unlock_sock_fast(ssk, slow);
>> }
>> }
>
>
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
Hi Paolo,
Thank you for your modifications, that's great!
Our CI did some validations and here is its report:
- KVM Validation: normal (except selftest_mptcp_join): Success! ✅
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Success! ✅
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/19214327116
Initiator: Matthieu Baerts (NGI0)
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/d100023cce38
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1021300
If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:
$ cd [kernel source code]
$ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
--pull always mptcp/mptcp-upstream-virtme-docker:latest \
auto-normal
For more details:
https://github.com/multipath-tcp/mptcp-upstream-virtme-docker
Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)
Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)
© 2016 - 2025 Red Hat, Inc.