Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.
When running under a non-root cgroup, this memory is also charged to the
memcg as "sock" in memory.stat.
Even when a memcg controls memory usage, sockets of such protocols are
still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
This makes it difficult to accurately estimate and configure appropriate
global limits, especially in multi-tenant environments.
If all workloads were guaranteed to be controlled under memcg, the issue
could be worked around by setting tcp_mem[0~2] to UINT_MAX.
In reality, this assumption does not always hold, and processes that
belong to the root cgroup or opt out of memcg can consume memory up to
the global limit, becoming a noisy neighbour.
Let's decouple memcg from the global per-protocol memory accounting if
it has a finite memory.max (!= "max").
We still keep charging memory to memcg and protocol duplicately if
memcg has "max" in memory.max because TCP allows only 10% of physical
memory by default.
This simplifies memcg configuration while keeping the global limits
within a reasonable range.
If mem_cgroup_sk_isolated(sk) returns true, the per-protocol memory
accounting is skipped.
In inet_csk_accept(), we need to reclaim counts that are already charged
for child sockets because we do not allocate sk->sk_memcg until accept().
Note that trace_sock_exceed_buf_limit() will always show 0 as accounted
for the isolated sockets, but this can be obtained via memory.stat.
Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.
Setup:
# mkdir /sys/fs/cgroup/test
# echo $$ >> /sys/fs/cgroup/test/cgroup.procs
# sysctl -q net.ipv4.tcp_mem="1000 1000 1000"
Without setting memory.max:
# prlimit -n=524288:524288 bash -c "python3 pressure.py" &
# cat /sys/fs/cgroup/test/memory.stat | grep sock
sock 22642688
# ss -tn | head -n 5
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53188
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:49972
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53868
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53554
# nstat | grep Pressure || echo no pressure
TcpExtTCPMemoryPressures 1 0.0
With memory.max:
# echo $((64 * 1024 ** 3)) > /sys/fs/cgroup/test/memory.max
# prlimit -n=524288:524288 bash -c "python3 pressure.py" &
# cat /sys/fs/cgroup/test/memory.stat | grep sock
sock 2757468160
# ss -tn | head -n 5
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:49026
ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:45630
ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:44870
ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:45274
# nstat | grep Pressure || echo no pressure
no pressure
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v3:
* Fix build failure for kTLS
v2:
* Add sk_should_enter_memory_pressure() for
tcp_enter_memory_pressure() calls not in core
* Update example in changelog
---
include/net/proto_memory.h | 15 ++++++--
include/net/tcp.h | 10 ++++--
net/core/sock.c | 64 ++++++++++++++++++++++-----------
net/ipv4/inet_connection_sock.c | 18 ++++++++--
net/ipv4/tcp.c | 3 +-
net/ipv4/tcp_output.c | 10 ++++--
net/mptcp/protocol.c | 4 ++-
net/tls/tls_device.c | 4 ++-
8 files changed, 94 insertions(+), 34 deletions(-)
diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index 8e91a8fa31b5..8e8432b13515 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -31,13 +31,22 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
if (!sk->sk_prot->memory_pressure)
return false;
- if (mem_cgroup_sk_enabled(sk) &&
- mem_cgroup_sk_under_memory_pressure(sk))
- return true;
+ if (mem_cgroup_sk_enabled(sk)) {
+ if (mem_cgroup_sk_under_memory_pressure(sk))
+ return true;
+
+ if (mem_cgroup_sk_isolated(sk))
+ return false;
+ }
return !!READ_ONCE(*sk->sk_prot->memory_pressure);
}
+static inline bool sk_should_enter_memory_pressure(struct sock *sk)
+{
+ return !mem_cgroup_sk_enabled(sk) || !mem_cgroup_sk_isolated(sk);
+}
+
static inline long
proto_memory_allocated(const struct proto *prot)
{
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2936b8175950..0191a4585bba 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -275,9 +275,13 @@ extern unsigned long tcp_memory_pressure;
/* optimized version of sk_under_memory_pressure() for TCP sockets */
static inline bool tcp_under_memory_pressure(const struct sock *sk)
{
- if (mem_cgroup_sk_enabled(sk) &&
- mem_cgroup_sk_under_memory_pressure(sk))
- return true;
+ if (mem_cgroup_sk_enabled(sk)) {
+ if (mem_cgroup_sk_under_memory_pressure(sk))
+ return true;
+
+ if (mem_cgroup_sk_isolated(sk))
+ return false;
+ }
return READ_ONCE(tcp_memory_pressure);
}
diff --git a/net/core/sock.c b/net/core/sock.c
index ab6953d295df..755540215570 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1046,17 +1046,21 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
if (!charged)
return -ENOMEM;
- /* pre-charge to forward_alloc */
- sk_memory_allocated_add(sk, pages);
- allocated = sk_memory_allocated(sk);
- /* If the system goes into memory pressure with this
- * precharge, give up and return error.
- */
- if (allocated > sk_prot_mem_limits(sk, 1)) {
- sk_memory_allocated_sub(sk, pages);
- mem_cgroup_sk_uncharge(sk, pages);
- return -ENOMEM;
+ if (!mem_cgroup_sk_isolated(sk)) {
+ /* pre-charge to forward_alloc */
+ sk_memory_allocated_add(sk, pages);
+ allocated = sk_memory_allocated(sk);
+
+ /* If the system goes into memory pressure with this
+ * precharge, give up and return error.
+ */
+ if (allocated > sk_prot_mem_limits(sk, 1)) {
+ sk_memory_allocated_sub(sk, pages);
+ mem_cgroup_sk_uncharge(sk, pages);
+ return -ENOMEM;
+ }
}
+
sk_forward_alloc_add(sk, pages << PAGE_SHIFT);
WRITE_ONCE(sk->sk_reserved_mem,
@@ -3153,8 +3157,11 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
return true;
- sk_enter_memory_pressure(sk);
+ if (sk_should_enter_memory_pressure(sk))
+ sk_enter_memory_pressure(sk);
+
sk_stream_moderate_sndbuf(sk);
+
return false;
}
EXPORT_SYMBOL(sk_page_frag_refill);
@@ -3267,18 +3274,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
{
bool memcg_enabled = false, charged = false;
struct proto *prot = sk->sk_prot;
- long allocated;
-
- sk_memory_allocated_add(sk, amt);
- allocated = sk_memory_allocated(sk);
+ long allocated = 0;
if (mem_cgroup_sk_enabled(sk)) {
+ bool isolated = mem_cgroup_sk_isolated(sk);
+
memcg_enabled = true;
charged = mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge());
- if (!charged)
+
+ if (isolated && charged)
+ return 1;
+
+ if (!charged) {
+ if (!isolated) {
+ sk_memory_allocated_add(sk, amt);
+ allocated = sk_memory_allocated(sk);
+ }
+
goto suppress_allocation;
+ }
}
+ sk_memory_allocated_add(sk, amt);
+ allocated = sk_memory_allocated(sk);
+
/* Under limit. */
if (allocated <= sk_prot_mem_limits(sk, 0)) {
sk_leave_memory_pressure(sk);
@@ -3357,7 +3376,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
- sk_memory_allocated_sub(sk, amt);
+ if (allocated)
+ sk_memory_allocated_sub(sk, amt);
if (charged)
mem_cgroup_sk_uncharge(sk, amt);
@@ -3396,11 +3416,15 @@ EXPORT_SYMBOL(__sk_mem_schedule);
*/
void __sk_mem_reduce_allocated(struct sock *sk, int amount)
{
- sk_memory_allocated_sub(sk, amount);
-
- if (mem_cgroup_sk_enabled(sk))
+ if (mem_cgroup_sk_enabled(sk)) {
mem_cgroup_sk_uncharge(sk, amount);
+ if (mem_cgroup_sk_isolated(sk))
+ return;
+ }
+
+ sk_memory_allocated_sub(sk, amount);
+
if (sk_under_global_memory_pressure(sk) &&
(sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 0ef1eacd539d..9d56085f7f54 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -22,6 +22,7 @@
#include <net/tcp.h>
#include <net/sock_reuseport.h>
#include <net/addrconf.h>
+#include <net/proto_memory.h>
#if IS_ENABLED(CONFIG_IPV6)
/* match_sk*_wildcard == true: IPV6_ADDR_ANY equals to any IPv6 addresses
@@ -710,7 +711,6 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
if (mem_cgroup_sockets_enabled) {
gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
- int amt = 0;
/* atomically get the memory usage, set and charge the
* newsk->sk_memcg.
@@ -719,15 +719,27 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
mem_cgroup_sk_alloc(newsk);
if (mem_cgroup_from_sk(newsk)) {
+ int amt;
+
/* The socket has not been accepted yet, no need
* to look at newsk->sk_wmem_queued.
*/
amt = sk_mem_pages(newsk->sk_forward_alloc +
atomic_read(&newsk->sk_rmem_alloc));
+ if (amt) {
+ /* This amt is already charged globally to
+ * sk_prot->memory_allocated due to lack of
+ * sk_memcg until accept(), thus we need to
+ * reclaim it here if newsk is isolated.
+ */
+ if (mem_cgroup_sk_isolated(newsk))
+ sk_memory_allocated_sub(newsk, amt);
+
+ mem_cgroup_sk_charge(newsk, amt, gfp);
+ }
+
}
- if (amt)
- mem_cgroup_sk_charge(newsk, amt, gfp);
kmem_cache_charge(newsk, gfp);
release_sock(newsk);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 71a956fbfc55..dcbd49e2f8af 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -908,7 +908,8 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
}
__kfree_skb(skb);
} else {
- sk->sk_prot->enter_memory_pressure(sk);
+ if (sk_should_enter_memory_pressure(sk))
+ tcp_enter_memory_pressure(sk);
sk_stream_moderate_sndbuf(sk);
}
return NULL;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dfbac0876d96..f7aa86661219 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3574,12 +3574,18 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
delta = size - sk->sk_forward_alloc;
if (delta <= 0)
return;
+
amt = sk_mem_pages(delta);
sk_forward_alloc_add(sk, amt << PAGE_SHIFT);
- sk_memory_allocated_add(sk, amt);
- if (mem_cgroup_sk_enabled(sk))
+ if (mem_cgroup_sk_enabled(sk)) {
mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge() | __GFP_NOFAIL);
+
+ if (mem_cgroup_sk_isolated(sk))
+ return;
+ }
+
+ sk_memory_allocated_add(sk, amt);
}
/* Send a FIN. The caller locks the socket for us.
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 9a287b75c1b3..1a4089b05a16 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -16,6 +16,7 @@
#include <net/inet_common.h>
#include <net/inet_hashtables.h>
#include <net/protocol.h>
+#include <net/proto_memory.h>
#include <net/tcp_states.h>
#if IS_ENABLED(CONFIG_MPTCP_IPV6)
#include <net/transp_v6.h>
@@ -1016,8 +1017,9 @@ static void mptcp_enter_memory_pressure(struct sock *sk)
mptcp_for_each_subflow(msk, subflow) {
struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
- if (first)
+ if (first && sk_should_enter_memory_pressure(sk))
tcp_enter_memory_pressure(ssk);
+
sk_stream_moderate_sndbuf(ssk);
first = false;
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index f672a62a9a52..6696ef837116 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -35,6 +35,7 @@
#include <linux/netdevice.h>
#include <net/dst.h>
#include <net/inet_connection_sock.h>
+#include <net/proto_memory.h>
#include <net/tcp.h>
#include <net/tls.h>
#include <linux/skbuff_ref.h>
@@ -371,7 +372,8 @@ static int tls_do_allocation(struct sock *sk,
if (!offload_ctx->open_record) {
if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
sk->sk_allocation))) {
- READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
+ if (sk_should_enter_memory_pressure(sk))
+ READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
sk_stream_moderate_sndbuf(sk);
return -ENOMEM;
}
--
2.51.0.rc0.205.g4a044479a3-goog
On Tue, Aug 12, 2025 at 05:58:30PM +0000, Kuniyuki Iwashima wrote: > If all workloads were guaranteed to be controlled under memcg, the issue > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > In reality, this assumption does not always hold, and processes that > belong to the root cgroup or opt out of memcg can consume memory up to > the global limit, becoming a noisy neighbour. As per the last thread, this is not a supported usecase. Opting out of memcg coverage for individual cgroups is a self-inflicted problem and misconfiguration. There is *no* memory isolation *at all* on such containers. Maybe their socket buffers is the only thing that happens to matter to *you*, but this is in no way a generic, universal, upstreamable solution. Knob or auto-detection is not the issue. Nacked-by: Johannes Weiner <hannes@cmpxchg.org>
On Wed, Aug 13, 2025 at 6:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Aug 12, 2025 at 05:58:30PM +0000, Kuniyuki Iwashima wrote: > > If all workloads were guaranteed to be controlled under memcg, the issue > > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > > > In reality, this assumption does not always hold, and processes that > > belong to the root cgroup or opt out of memcg can consume memory up to > > the global limit, becoming a noisy neighbour. > > As per the last thread, this is not a supported usecase. Opting out of > memcg coverage for individual cgroups is a self-inflicted problem and > misconfiguration. There is *no* memory isolation *at all* on such > containers. I think the commit message needs to be improved, but could you read throughout the patch again ? I think you have the same misunderstanding that Shakeel had and corrected here. https://lore.kernel.org/netdev/jmbszz4m7xkw7fzolpusjesbreaczmr4i64kynbs3zcoehrkpj@lwso5soc4dh3/ ---8<--- Initially, I thought the series introduced multiple modes, including an option to exclude network memory from memcg accounting. However, if I understand correctly, that is not the case—the opt-out applies only to the global TCP/UDP accounting. That’s a relief, and I apologize for the misunderstanding. ---8<--- This patch does NOT change how memcg is applied to sockets but changes how _another_ memory accounting in the networking layer is applied to sockets. Currently, memcg AND the other mem accounting are applied to socket buffers. With/without this patch, memcg is _always_ applied to socket buffers. Also, there is _no_ behavioural change for _uncontrolled containers_ that have been subject to the two memory accounting. This behaviour hasn't been changed since you added memcg support for the networking stack in e805605c72102, and we want to _preserve_ this behaviour. This change stop double-charging by opting out of _the networking layer one_ because it interferes with memcg and complicates configuration of memory.max and the global networking limit. > Maybe their socket buffers is the only thing that happens > to matter to *you*, but this is in no way a generic, universal, > upstreamable solution. Knob or auto-detection is not the issue. > > Nacked-by: Johannes Weiner <hannes@cmpxchg.org> Please let me know if this nack still applies with the explanation above.
On Wed, Aug 13, 2025 at 11:43:15AM -0700, Kuniyuki Iwashima wrote: > On Wed, Aug 13, 2025 at 6:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > This change stop double-charging by opting out of _the > networking layer one_ because it interferes with memcg > and complicates configuration of memory.max and the > global networking limit. No, we do want the global limits as a backstop - even if every single cgroup in the system has its own memory limit. Sure, from a fairness POV, we want socket buffers accounted towards the containers' memory footprint and subject to their limits. But that doesn't imply that we can let the cgroup limit be the only thing curbing an explosion in socket buffers. This isn't about fairness, but about host stability. The MM can easily get rid of file cache and heap pages, but it has limited to no control over the socket buffer lifetime. If you split a 1TB host into 8 containers limited to ~128G, that doesn't mean you want to allow up to 1TB of memory in socket buffers. That could make low memory situations unrecoverable. > > Maybe their socket buffers is the only thing that happens > > to matter to *you*, but this is in no way a generic, universal, > > upstreamable solution. Knob or auto-detection is not the issue. > > > > Nacked-by: Johannes Weiner <hannes@cmpxchg.org> > > Please let me know if this nack still applies with the > explanation above. Yes, for one I think it's an unacceptable behavioral change of the sysctl semantics. But my wider point is that I think you're trying to fix something that is a direct result of a flawed approach to containerization, and it would make much more sense to address that instead.
On Tue, Aug 12, 2025 at 05:58:30PM +0000, Kuniyuki Iwashima wrote: > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > buffers and charge memory to per-protocol global counters pointed to by > sk->sk_proto->memory_allocated. > > When running under a non-root cgroup, this memory is also charged to the > memcg as "sock" in memory.stat. > > Even when a memcg controls memory usage, sockets of such protocols are > still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > This makes it difficult to accurately estimate and configure appropriate > global limits, especially in multi-tenant environments. > > If all workloads were guaranteed to be controlled under memcg, the issue > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > In reality, this assumption does not always hold, and processes that > belong to the root cgroup or opt out of memcg can consume memory up to > the global limit, becoming a noisy neighbour. Processes running in root memcg (I am not sure what does 'opt out of memcg means') means admin has intentionally allowed scenarios where noisy neighbour situation can happen, so I am not really following your argument here. > > Let's decouple memcg from the global per-protocol memory accounting if > it has a finite memory.max (!= "max"). Why decouple only for some? (Also if you really want to check memcg limits, you need to check limits for all ancestors and not just the given memcg). Why not start with just two global options (maybe start with boot parameter)? Option 1: Existing behavior where memcg and global TCP accounting are coupled. Option 2: Completely decouple memcg and global TCP accounting i.e. use mem_cgroup_sockets_enabled to either do global TCP accounting or memcg accounting. Keep the option 1 default. I assume you want third option where a mix of these options can happen i.e. some sockets are only accounted to a memcg and some are accounted to both memcg and global TCP. I would recommend to make that a followup patch series. Keep this series simple and non-controversial.
On Wed, Aug 13, 2025 at 12:11 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Tue, Aug 12, 2025 at 05:58:30PM +0000, Kuniyuki Iwashima wrote: > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > > buffers and charge memory to per-protocol global counters pointed to by > > sk->sk_proto->memory_allocated. > > > > When running under a non-root cgroup, this memory is also charged to the > > memcg as "sock" in memory.stat. > > > > Even when a memcg controls memory usage, sockets of such protocols are > > still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > > > This makes it difficult to accurately estimate and configure appropriate > > global limits, especially in multi-tenant environments. > > > > If all workloads were guaranteed to be controlled under memcg, the issue > > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > > > In reality, this assumption does not always hold, and processes that > > belong to the root cgroup or opt out of memcg can consume memory up to > > the global limit, becoming a noisy neighbour. > > Processes running in root memcg (I am not sure what does 'opt out of > memcg means') Sorry, I should've clarified memory.max==max (and same up to all ancestors as you pointed out below) as opt-out, where memcg works but has no effect. > means admin has intentionally allowed scenarios where Not really intentionally, but rather reluctantly because the admin cannot guarantee memory.max solely without tcp_mem=UINT_MAX. We should not disregard the cause that the two mem accounting are coupled now. > noisy neighbour situation can happen, so I am not really following your > argument here. So basically here I meant with tcp_mem=UINT_MAX any process can be noisy neighbour unnecessarily. > > > > > Let's decouple memcg from the global per-protocol memory accounting if > > it has a finite memory.max (!= "max"). > > Why decouple only for some? (Also if you really want to check memcg > limits, you need to check limits for all ancestors and not just the > given memcg). Oh, I assumed memory.max will be inherited to descendants. > > Why not start with just two global options (maybe start with boot > parameter)? > > Option 1: Existing behavior where memcg and global TCP accounting are > coupled. > > Option 2: Completely decouple memcg and global TCP accounting i.e. use > mem_cgroup_sockets_enabled to either do global TCP accounting or memcg > accounting. > > Keep the option 1 default. > > I assume you want third option where a mix of these options can happen > i.e. some sockets are only accounted to a memcg and some are accounted > to both memcg and global TCP. Yes because usually not all memcg have memory.max configured and we do not want to allow unlimited TCP memory for them. Option 2 works for processes in the root cgroup but doesn't for processes in non-root cgroup with memory.max == max. A good example is system processes managed by systemd where we do not want to specify memory.max but want a global seatbelt. Note this is how it works _now_, and we want to _preserve_ the case. Does this make sense ? > why decouple only for some > I would recommend to make that a followup > patch series. Keep this series simple and non-controversial. I can separate the series, but I'd like to make sure the Option 2 is a must for you or Meta configured memory.max for all cgroups ? I didn't think it's likely but if there's a real use case, I'm happy to add a boot param. The only diff would be boot param addition and the condition change in patch 11 so simplicity won't change.
On Wed, Aug 13, 2025 at 11:19:31AM -0700, Kuniyuki Iwashima wrote: > On Wed, Aug 13, 2025 at 12:11 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Tue, Aug 12, 2025 at 05:58:30PM +0000, Kuniyuki Iwashima wrote: > > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > > > buffers and charge memory to per-protocol global counters pointed to by > > > sk->sk_proto->memory_allocated. > > > > > > When running under a non-root cgroup, this memory is also charged to the > > > memcg as "sock" in memory.stat. > > > > > > Even when a memcg controls memory usage, sockets of such protocols are > > > still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > > > > > This makes it difficult to accurately estimate and configure appropriate > > > global limits, especially in multi-tenant environments. > > > > > > If all workloads were guaranteed to be controlled under memcg, the issue > > > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > > > > > In reality, this assumption does not always hold, and processes that > > > belong to the root cgroup or opt out of memcg can consume memory up to > > > the global limit, becoming a noisy neighbour. > > > > Processes running in root memcg (I am not sure what does 'opt out of > > memcg means') > > Sorry, I should've clarified memory.max==max (and same > up to all ancestors as you pointed out below) as opt-out, > where memcg works but has no effect. > > > > means admin has intentionally allowed scenarios where > > Not really intentionally, but rather reluctantly because the admin > cannot guarantee memory.max solely without tcp_mem=UINT_MAX. > We should not disregard the cause that the two mem accounting are > coupled now. > > > > noisy neighbour situation can happen, so I am not really following your > > argument here. > > So basically here I meant with tcp_mem=UINT_MAX any process > can be noisy neighbour unnecessarily. Only if there are processes in cgroups with unlimited memory limits. I think you are still missing the point. So, let me be very clear: Please stop using the "processes in cgroup with memory.max==max can be source of isolation issues" argument. Having unlimited limit means you don't want isolation. More importantly you don't really need this argument for your work. It is clear (to me at least) that we want global TCP memory accounting to be decoupled from memcg accounting. Using the flawed argument is just making your series controversial. [...] > > Why not start with just two global options (maybe start with boot > > parameter)? > > > > Option 1: Existing behavior where memcg and global TCP accounting are > > coupled. > > > > Option 2: Completely decouple memcg and global TCP accounting i.e. use > > mem_cgroup_sockets_enabled to either do global TCP accounting or memcg > > accounting. > > > > Keep the option 1 default. > > > > I assume you want third option where a mix of these options can happen > > i.e. some sockets are only accounted to a memcg and some are accounted > > to both memcg and global TCP. > > Yes because usually not all memcg have memory.max configured > and we do not want to allow unlimited TCP memory for them. > > Option 2 works for processes in the root cgroup but doesn't for > processes in non-root cgroup with memory.max == max. > > A good example is system processes managed by systemd where > we do not want to specify memory.max but want a global seatbelt. > > Note this is how it works _now_, and we want to _preserve_ the case. > Does this make sense ? > why decouple only for some > I hope I am very clear to stop using the memory.max == max argument. > > > I would recommend to make that a followup > > patch series. Keep this series simple and non-controversial. > > I can separate the series, but I'd like to make sure the > Option 2 is a must for you or Meta configured memory.max > for all cgroups ? I didn't think it's likely but if there's a real > use case, I'm happy to add a boot param. > > The only diff would be boot param addition and the condition > change in patch 11 so simplicity won't change. I am not sure if option 2 will be used by Meta or someone else, so no objection from me to not pursue it. However I don't want some possibly userspace policy to opt-in in one or the other accounting mechanism in the kernel. What I think is the right approach is to have BPF struct ops based approach with possible callback 'is this socket under pressure' or maybe 'is this socket isolated' and then you can do whatever you want in those callbacks. In this way your can follow the same approach of caching the result in kernel (lower bits of sk->sk_memcg). I am CCing bpf list to get some suggestions or concerns on this approach.
On 8/13/25 1:53 PM, Shakeel Butt wrote: > What I think is the right approach is to have BPF struct ops based > approach with possible callback 'is this socket under pressure' or maybe > 'is this socket isolated' and then you can do whatever you want in those > callbacks. In this way your can follow the same approach of caching the > result in kernel (lower bits of sk->sk_memcg). > > I am CCing bpf list to get some suggestions or concerns on this > approach. I have quickly looked at the set. In patch 11, it sets a bit in sk->sk_memcg. On the bpf side, there are already cgroup bpf progs that can do bpf_setsockopt on a sk, so the same can be done here. The bpf_setsockopt does not have to set option/knob that is only available in the uapi in case we don't want to expose this to the user space. The cgroup bpf prog (BPF_CGROUP_INET_SOCK_CREATE) can already be run when a "inet" sock is created. This hook (i.e. attach_type) does not have access to bpf_setsockopt but should be easy to add. For more comprehensive mem charge policy that needs new bpf hook, that probably will need struct_ops instead of another cgroup attach_type but that will be implementation details.
On Wed, Aug 13, 2025 at 5:55 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 8/13/25 1:53 PM, Shakeel Butt wrote: > > What I think is the right approach is to have BPF struct ops based > > approach with possible callback 'is this socket under pressure' or maybe > > 'is this socket isolated' and then you can do whatever you want in those > > callbacks. In this way your can follow the same approach of caching the > > result in kernel (lower bits of sk->sk_memcg). > > > > I am CCing bpf list to get some suggestions or concerns on this > > approach. > > I have quickly looked at the set. In patch 11, it sets a bit in sk->sk_memcg. > > On the bpf side, there are already cgroup bpf progs that can do bpf_setsockopt > on a sk, so the same can be done here. The bpf_setsockopt does not have to set > option/knob that is only available in the uapi in case we don't want to expose > this to the user space. > > The cgroup bpf prog (BPF_CGROUP_INET_SOCK_CREATE) can already be run when a > "inet" sock is created. This hook (i.e. attach_type) does not have access to > bpf_setsockopt but should be easy to add. Okay, I will try the bpf_setsockopt() approach. Should I post patch 1-10 to net-next separately ? They are pure net material to gather memcg code under CONFIG_MEMCG. > > For more comprehensive mem charge policy that needs new bpf hook, that probably > will need struct_ops instead of another cgroup attach_type but that will be > implementation details.
On Wed, Aug 13, 2025 at 09:34:01PM -0700, Kuniyuki Iwashima wrote: > On Wed, Aug 13, 2025 at 5:55 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > > > On 8/13/25 1:53 PM, Shakeel Butt wrote: > > > What I think is the right approach is to have BPF struct ops based > > > approach with possible callback 'is this socket under pressure' or maybe > > > 'is this socket isolated' and then you can do whatever you want in those > > > callbacks. In this way your can follow the same approach of caching the > > > result in kernel (lower bits of sk->sk_memcg). > > > > > > I am CCing bpf list to get some suggestions or concerns on this > > > approach. > > > > I have quickly looked at the set. In patch 11, it sets a bit in sk->sk_memcg. > > > > On the bpf side, there are already cgroup bpf progs that can do bpf_setsockopt > > on a sk, so the same can be done here. The bpf_setsockopt does not have to set > > option/knob that is only available in the uapi in case we don't want to expose > > this to the user space. > > > > The cgroup bpf prog (BPF_CGROUP_INET_SOCK_CREATE) can already be run when a > > "inet" sock is created. This hook (i.e. attach_type) does not have access to > > bpf_setsockopt but should be easy to add. > > Okay, I will try the bpf_setsockopt() approach. > Should I post patch 1-10 to net-next separately ? > They are pure net material to gather memcg code under CONFIG_MEMCG. Yes please.
Kuniyuki Iwashima <kuniyu@google.com> writes: > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > buffers and charge memory to per-protocol global counters pointed to by > sk->sk_proto->memory_allocated. > > When running under a non-root cgroup, this memory is also charged to the > memcg as "sock" in memory.stat. > > Even when a memcg controls memory usage, sockets of such protocols are > still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > This makes it difficult to accurately estimate and configure appropriate > global limits, especially in multi-tenant environments. > > If all workloads were guaranteed to be controlled under memcg, the issue > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > In reality, this assumption does not always hold, and processes that > belong to the root cgroup or opt out of memcg can consume memory up to > the global limit, becoming a noisy neighbour. > > Let's decouple memcg from the global per-protocol memory accounting if > it has a finite memory.max (!= "max"). I think you can't make the new behavior as the new default, simple because it might break existing setups. Basically anyone who is using memory.max will suddenly have their processes being opted out of the net memory accounting. Idk how many users are actually relying on the network memory accounting, but I believe way more than 0. So I guess a net sysctl/some other knob is warranted here, with the old behavior being the default. Thanks
On Tue, Aug 12, 2025 at 6:58 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > Kuniyuki Iwashima <kuniyu@google.com> writes: > > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > > buffers and charge memory to per-protocol global counters pointed to by > > sk->sk_proto->memory_allocated. > > > > When running under a non-root cgroup, this memory is also charged to the > > memcg as "sock" in memory.stat. > > > > Even when a memcg controls memory usage, sockets of such protocols are > > still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > > > This makes it difficult to accurately estimate and configure appropriate > > global limits, especially in multi-tenant environments. > > > > If all workloads were guaranteed to be controlled under memcg, the issue > > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > > > In reality, this assumption does not always hold, and processes that > > belong to the root cgroup or opt out of memcg can consume memory up to > > the global limit, becoming a noisy neighbour. > > > > Let's decouple memcg from the global per-protocol memory accounting if > > it has a finite memory.max (!= "max"). > > I think you can't make the new behavior as the new default, simple because > it might break existing setups. Basically anyone who is using memory.max > will suddenly have their processes being opted out of the net memory > accounting. Idk how many users are actually relying on the network > memory accounting, but I believe way more than 0. > > So I guess a net sysctl/some other knob is warranted here, with the old > behavior being the default. I think we don't need a knob to change the behaviour because the affected case must have a broken assumption. There are 3 possible cases below. 1) memory.max == "max" 2) memory.max != "max" and tcp_mem does not suppress memory allocation 3) memory.max != "max" and tcp_mem suppresses memory allocation 1) is not affected, and 2) is not affected too because decoupling does not change the situation. Then, for 3), this change will allow more memory than ever, but it's still limited by memory.max, which is configured by the sys admin. If this could be a problem, then the total amount of all memcg's memory.max should exceed the amount of system memory, which is unlikely if configured properly. Also, in the 3) case, TCP has quite bad performance and the sys admin should have raised the tcp_mem limit and moved to 2) like our unlimited tcp_mem setting. What do you think ?
© 2016 - 2025 Red Hat, Inc.