net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated.

[PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Kuniyuki Iwashima 1 month, 3 weeks ago

Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.

When running under a non-root cgroup, this memory is also charged to the
memcg as sock in memory.stat.

Even when memory usage is controlled by memcg, sockets using such protocols
are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).

This makes it difficult to accurately estimate and configure appropriate
global limits, especially in multi-tenant environments.

If all workloads were guaranteed to be controlled under memcg, the issue
could be worked around by setting tcp_mem[0~2] to UINT_MAX.

In reality, this assumption does not always hold, and a single workload
that opts out of memcg can consume memory up to the global limit,
becoming a noisy neighbour.

Let's decouple memcg from the global per-protocol memory accounting.

This simplifies memcg configuration while keeping the global limits
within a reasonable range.

If mem_cgroup_sk_isolated(sk) returns true, the per-protocol memory
accounting is skipped.

In inet_csk_accept(), we need to reclaim counts that are already charged
for child sockets because we do not allocate sk->sk_memcg until accept().

Note that trace_sock_exceed_buf_limit() will always show 0 as accounted
for the isolated sockets, but this can be obtained via memory.stat.

Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.

Setup:

  # mkdir /sys/fs/cgroup/test
  # echo $$ >> /sys/fs/cgroup/test/cgroup.procs
  # sysctl -q net.ipv4.tcp_mem="1000 1000 1000"

Without memory.socket_isolated:

  # echo 0 > /sys/fs/cgroup/test/memory.socket_isolated
  # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 24682496
  #  ss -tn | head -n 5
  State Recv-Q Send-Q Local Address:Port  Peer Address:Port
  ESTAB 2000   0          127.0.0.1:54997    127.0.0.1:37738
  ESTAB 2000   0          127.0.0.1:54997    127.0.0.1:60122
  ESTAB 2000   0          127.0.0.1:54997    127.0.0.1:33622
  ESTAB 2000   0          127.0.0.1:54997    127.0.0.1:35042
  # nstat | grep Pressure || echo no pressure
  TcpExtTCPMemoryPressures        1                  0.0

With memory.socket_isolated:

  # echo 1 > /sys/fs/cgroup/test/memory.socket_isolated
  # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 2766671872
  # ss -tn | head -n 5
  State Recv-Q Send-Q  Local Address:Port  Peer Address:Port
  ESTAB 112000 0           127.0.0.1:41729    127.0.0.1:35062
  ESTAB 110000 0           127.0.0.1:41729    127.0.0.1:36288
  ESTAB 112000 0           127.0.0.1:41729    127.0.0.1:37560
  ESTAB 112000 0           127.0.0.1:41729    127.0.0.1:37096
  # nstat | grep Pressure || echo no pressure
  no pressure

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 include/net/proto_memory.h      | 10 +++--
 include/net/tcp.h               | 10 +++--
 net/core/sock.c                 | 65 +++++++++++++++++++++++----------
 net/ipv4/inet_connection_sock.c | 18 +++++++--
 net/ipv4/tcp_output.c           | 10 ++++-
 5 files changed, 82 insertions(+), 31 deletions(-)

diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index 8e91a8fa31b52..3c2e92f5a6866 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -31,9 +31,13 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
-	if (mem_cgroup_sk_enabled(sk) &&
-	    mem_cgroup_sk_under_memory_pressure(sk))
-		return true;
+	if (mem_cgroup_sk_enabled(sk)) {
+		if (mem_cgroup_sk_under_memory_pressure(sk))
+			return true;
+
+		if (mem_cgroup_sk_isolated(sk))
+			return false;
+	}
 
 	return !!READ_ONCE(*sk->sk_prot->memory_pressure);
 }
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9ffe971a1856b..a5ff82a59867b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -275,9 +275,13 @@ extern unsigned long tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
-	if (mem_cgroup_sk_enabled(sk) &&
-	    mem_cgroup_sk_under_memory_pressure(sk))
-		return true;
+	if (mem_cgroup_sk_enabled(sk)) {
+		if (mem_cgroup_sk_under_memory_pressure(sk))
+			return true;
+
+		if (mem_cgroup_sk_isolated(sk))
+			return false;
+	}
 
 	return READ_ONCE(tcp_memory_pressure);
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index ab6953d295dfa..e1ae6d03b8227 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1046,17 +1046,21 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 	if (!charged)
 		return -ENOMEM;
 
-	/* pre-charge to forward_alloc */
-	sk_memory_allocated_add(sk, pages);
-	allocated = sk_memory_allocated(sk);
-	/* If the system goes into memory pressure with this
-	 * precharge, give up and return error.
-	 */
-	if (allocated > sk_prot_mem_limits(sk, 1)) {
-		sk_memory_allocated_sub(sk, pages);
-		mem_cgroup_sk_uncharge(sk, pages);
-		return -ENOMEM;
+	if (!mem_cgroup_sk_isolated(sk)) {
+		/* pre-charge to forward_alloc */
+		sk_memory_allocated_add(sk, pages);
+		allocated = sk_memory_allocated(sk);
+
+		/* If the system goes into memory pressure with this
+		 * precharge, give up and return error.
+		 */
+		if (allocated > sk_prot_mem_limits(sk, 1)) {
+			sk_memory_allocated_sub(sk, pages);
+			mem_cgroup_sk_uncharge(sk, pages);
+			return -ENOMEM;
+		}
 	}
+
 	sk_forward_alloc_add(sk, pages << PAGE_SHIFT);
 
 	WRITE_ONCE(sk->sk_reserved_mem,
@@ -3153,8 +3157,12 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 	if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
 		return true;
 
-	sk_enter_memory_pressure(sk);
 	sk_stream_moderate_sndbuf(sk);
+
+	if (mem_cgroup_sk_enabled(sk) && mem_cgroup_sk_isolated(sk))
+		return false;
+
+	sk_enter_memory_pressure(sk);
 	return false;
 }
 EXPORT_SYMBOL(sk_page_frag_refill);
@@ -3267,18 +3275,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
 	bool memcg_enabled = false, charged = false;
 	struct proto *prot = sk->sk_prot;
-	long allocated;
-
-	sk_memory_allocated_add(sk, amt);
-	allocated = sk_memory_allocated(sk);
+	long allocated = 0;
 
 	if (mem_cgroup_sk_enabled(sk)) {
+		bool isolated = mem_cgroup_sk_isolated(sk);
+
 		memcg_enabled = true;
 		charged = mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge());
-		if (!charged)
+
+		if (isolated && charged)
+			return 1;
+
+		if (!charged) {
+			if (!isolated) {
+				sk_memory_allocated_add(sk, amt);
+				allocated = sk_memory_allocated(sk);
+			}
+
 			goto suppress_allocation;
+		}
 	}
 
+	sk_memory_allocated_add(sk, amt);
+	allocated = sk_memory_allocated(sk);
+
 	/* Under limit. */
 	if (allocated <= sk_prot_mem_limits(sk, 0)) {
 		sk_leave_memory_pressure(sk);
@@ -3357,7 +3377,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 
 	trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
 
-	sk_memory_allocated_sub(sk, amt);
+	if (allocated)
+		sk_memory_allocated_sub(sk, amt);
 
 	if (charged)
 		mem_cgroup_sk_uncharge(sk, amt);
@@ -3396,11 +3417,15 @@ EXPORT_SYMBOL(__sk_mem_schedule);
  */
 void __sk_mem_reduce_allocated(struct sock *sk, int amount)
 {
-	sk_memory_allocated_sub(sk, amount);
-
-	if (mem_cgroup_sk_enabled(sk))
+	if (mem_cgroup_sk_enabled(sk)) {
 		mem_cgroup_sk_uncharge(sk, amount);
 
+		if (mem_cgroup_sk_isolated(sk))
+			return;
+	}
+
+	sk_memory_allocated_sub(sk, amount);
+
 	if (sk_under_global_memory_pressure(sk) &&
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
 		sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 0ef1eacd539d1..9d56085f7f54b 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -22,6 +22,7 @@
 #include <net/tcp.h>
 #include <net/sock_reuseport.h>
 #include <net/addrconf.h>
+#include <net/proto_memory.h>
 
 #if IS_ENABLED(CONFIG_IPV6)
 /* match_sk*_wildcard == true:  IPV6_ADDR_ANY equals to any IPv6 addresses
@@ -710,7 +711,6 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 
 	if (mem_cgroup_sockets_enabled) {
 		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
-		int amt = 0;
 
 		/* atomically get the memory usage, set and charge the
 		 * newsk->sk_memcg.
@@ -719,15 +719,27 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 
 		mem_cgroup_sk_alloc(newsk);
 		if (mem_cgroup_from_sk(newsk)) {
+			int amt;
+
 			/* The socket has not been accepted yet, no need
 			 * to look at newsk->sk_wmem_queued.
 			 */
 			amt = sk_mem_pages(newsk->sk_forward_alloc +
 					   atomic_read(&newsk->sk_rmem_alloc));
+			if (amt) {
+				/* This amt is already charged globally to
+				 * sk_prot->memory_allocated due to lack of
+				 * sk_memcg until accept(), thus we need to
+				 * reclaim it here if newsk is isolated.
+				 */
+				if (mem_cgroup_sk_isolated(newsk))
+					sk_memory_allocated_sub(newsk, amt);
+
+				mem_cgroup_sk_charge(newsk, amt, gfp);
+			}
+
 		}
 
-		if (amt)
-			mem_cgroup_sk_charge(newsk, amt, gfp);
 		kmem_cache_charge(newsk, gfp);
 
 		release_sock(newsk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 09f0802f36afa..79e705fca8b67 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3562,12 +3562,18 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
 	delta = size - sk->sk_forward_alloc;
 	if (delta <= 0)
 		return;
+
 	amt = sk_mem_pages(delta);
 	sk_forward_alloc_add(sk, amt << PAGE_SHIFT);
-	sk_memory_allocated_add(sk, amt);
 
-	if (mem_cgroup_sk_enabled(sk))
+	if (mem_cgroup_sk_enabled(sk)) {
 		mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge() | __GFP_NOFAIL);
+
+		if (mem_cgroup_sk_isolated(sk))
+			return;
+	}
+
+	sk_memory_allocated_add(sk, amt);
 }
 
 /* Send a FIN. The caller locks the socket for us.
-- 
2.50.0.727.gbf7dc18ff4-goog

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Michal Koutný 1 month, 2 weeks ago

On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima <kuniyu@google.com> wrote:
> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> buffers and charge memory to per-protocol global counters pointed to by
> sk->sk_proto->memory_allocated.
> 
> When running under a non-root cgroup, this memory is also charged to the
> memcg as sock in memory.stat.
> 
> Even when memory usage is controlled by memcg, sockets using such protocols
> are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).

IIUC the envisioned use case is that some cgroups feed from global
resource and some from their own limit.
It means the admin knows both:
  a) how to configure individual cgroup,
  b) how to configure global limit (for the rest).
So why cannot they stick to a single model only?

> This makes it difficult to accurately estimate and configure appropriate
> global limits, especially in multi-tenant environments.
> 
> If all workloads were guaranteed to be controlled under memcg, the issue
> could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> 
> In reality, this assumption does not always hold, and a single workload
> that opts out of memcg can consume memory up to the global limit,
> becoming a noisy neighbour.

That doesn't like a good idea to remove limits from possibly noisy
units.

> Let's decouple memcg from the global per-protocol memory accounting.
> 
> This simplifies memcg configuration while keeping the global limits
> within a reasonable range.

I think this is a configuration issue only, i.e. instead of preserving
the global limit because of _some_ memcgs, the configuration management
could have a default memcg limit that is substituted to those memcgs so
that there's no risk of runaways even in absence of global limit.

Regards,
Michal

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Kuniyuki Iwashima 1 month, 2 weeks ago

On Thu, Jul 31, 2025 at 6:39 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima <kuniyu@google.com> wrote:
> > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > buffers and charge memory to per-protocol global counters pointed to by
> > sk->sk_proto->memory_allocated.
> >
> > When running under a non-root cgroup, this memory is also charged to the
> > memcg as sock in memory.stat.
> >
> > Even when memory usage is controlled by memcg, sockets using such protocols
> > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
>
> IIUC the envisioned use case is that some cgroups feed from global
> resource and some from their own limit.
> It means the admin knows both:
>   a) how to configure individual cgroup,
>   b) how to configure global limit (for the rest).
> So why cannot they stick to a single model only?
>
> > This makes it difficult to accurately estimate and configure appropriate
> > global limits, especially in multi-tenant environments.
> >
> > If all workloads were guaranteed to be controlled under memcg, the issue
> > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> >
> > In reality, this assumption does not always hold, and a single workload
> > that opts out of memcg can consume memory up to the global limit,
> > becoming a noisy neighbour.
>
> That doesn't like a good idea to remove limits from possibly noisy
> units.
>
> > Let's decouple memcg from the global per-protocol memory accounting.
> >
> > This simplifies memcg configuration while keeping the global limits
> > within a reasonable range.
>
> I think this is a configuration issue only, i.e. instead of preserving
> the global limit because of _some_ memcgs, the configuration management
> could have a default memcg limit that is substituted to those memcgs so
> that there's no risk of runaways even in absence of global limit.

Doesn't that end up implementing another tcp_mem[] which now
enforce limits on uncontrolled cgroups (memory.max == max) ?
Or it will simply end up with the system-wide OOM killer ?

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Roman Gushchin 1 month, 2 weeks ago

Kuniyuki Iwashima <kuniyu@google.com> writes:

> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> buffers and charge memory to per-protocol global counters pointed to by
> sk->sk_proto->memory_allocated.
>
> When running under a non-root cgroup, this memory is also charged to the
> memcg as sock in memory.stat.
>
> Even when memory usage is controlled by memcg, sockets using such protocols
> are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
>
> This makes it difficult to accurately estimate and configure appropriate
> global limits, especially in multi-tenant environments.
>
> If all workloads were guaranteed to be controlled under memcg, the issue
> could be worked around by setting tcp_mem[0~2] to UINT_MAX.
>
> In reality, this assumption does not always hold, and a single workload
> that opts out of memcg can consume memory up to the global limit,
> becoming a noisy neighbour.
>
> Let's decouple memcg from the global per-protocol memory accounting.
>
> This simplifies memcg configuration while keeping the global limits
> within a reasonable range.

I don't think it should be a memcg feature. In fact, it doesn't have
much to do with cgroups at all (it's not hierarchical, it doesn't
control the resource allocation, and in the end it controls an
alternative to memory cgroups memory accounting system).

Instead, it can be a per-process prctl option.

(Assuming the feature is really needed - I'm also curious why some
processes have to be excluded from the memcg accounting - it sounds like
generally a bad idea).

Thanks

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Johannes Weiner 1 month, 3 weeks ago

On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> buffers and charge memory to per-protocol global counters pointed to by
> sk->sk_proto->memory_allocated.
> 
> When running under a non-root cgroup, this memory is also charged to the
> memcg as sock in memory.stat.
> 
> Even when memory usage is controlled by memcg, sockets using such protocols
> are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> 
> This makes it difficult to accurately estimate and configure appropriate
> global limits, especially in multi-tenant environments.
> 
> If all workloads were guaranteed to be controlled under memcg, the issue
> could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> 
> In reality, this assumption does not always hold, and a single workload
> that opts out of memcg can consume memory up to the global limit,
> becoming a noisy neighbour.

Yes, an uncontrolled cgroup can consume all of a shared resource and
thereby become a noisy neighbor. Why is network memory special?

I assume you have some other mechanisms for curbing things like
filesystem caches, anon memory, swap etc. of such otherwise
uncontrolled groups, and this just happens to be your missing piece.

But at this point, you're operating so far out of the cgroup resource
management model that I don't think it can be reasonably supported.

I hate to say this, but can't you carry this out of tree until the
transition is complete?

I just don't think it makes any sense to have this as a permanent
fixture in a general-purpose container management interface.

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Kuniyuki Iwashima 1 month, 2 weeks ago

On Mon, Jul 28, 2025 at 9:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > buffers and charge memory to per-protocol global counters pointed to by
> > sk->sk_proto->memory_allocated.
> >
> > When running under a non-root cgroup, this memory is also charged to the
> > memcg as sock in memory.stat.
> >
> > Even when memory usage is controlled by memcg, sockets using such protocols
> > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> >
> > This makes it difficult to accurately estimate and configure appropriate
> > global limits, especially in multi-tenant environments.
> >
> > If all workloads were guaranteed to be controlled under memcg, the issue
> > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> >
> > In reality, this assumption does not always hold, and a single workload
> > that opts out of memcg can consume memory up to the global limit,
> > becoming a noisy neighbour.
>
> Yes, an uncontrolled cgroup can consume all of a shared resource and
> thereby become a noisy neighbor. Why is network memory special?
>
> I assume you have some other mechanisms for curbing things like
> filesystem caches, anon memory, swap etc. of such otherwise
> uncontrolled groups, and this just happens to be your missing piece.

I think that's the tcp_mem[] knob, limiting tcp mem globally for
the "uncontrolled" cgroup.  But we can't use it because the
"controlled" cgroup is also limited by this knob.

If we want to properly control the "controlled" cgroup by its feature
only, we must disable the global limit completely on the host,
meaning we lose the "missing piece".

Currently, there are only two poor choices

1) Use tcp_mem[] but memory allocation could fail even if the
   cgroup has available memory

2) Disable tcp_mem[] but uncontrolled cgroup lose seatbelt and
   can consume memory up to system limit

but what we really need is

3) Uncontrolled cgroup is limited by tcp_mem[],
   AND
   for controlled cgroup, memory allocation won't fail if
   it has available memory regardless of tcp_mem[]

>
> But at this point, you're operating so far out of the cgroup resource
> management model that I don't think it can be reasonably supported.

I think it's rather operated under the normal cgroup management
model, relying on the configured memory limit for each cgroup.

What's wrong here is we had to set tcp_mem[] to UINT_MAX and
get rid of the seatbelt for uncontrolled cgroup for the management
model.

But this is just because cgroup mem is also charged globally
to TCP, which should not be.

>
> I hate to say this, but can't you carry this out of tree until the
> transition is complete?
>
> I just don't think it makes any sense to have this as a permanent
> fixture in a general-purpose container management interface.

I understand that, and we should eventually fix "1) or 2)" to
just 3), but introducing this change without a knob will break
assumptions in userspace and trigger regression.

cgroup v2 is now widely enabled by major distro, and systemd
creates many processes under non-root cgroups but without
memory limits.

If we had no knob, such processes would suddenly lose the
tcp_mem[] seatbelt and could consume memory up to system
limit.

How about adding the knob's deprecation plan by pr_warn_once()
or something and letting users configure the max properly by
that ?

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Johannes Weiner 1 month, 2 weeks ago

On Mon, Jul 28, 2025 at 02:41:38PM -0700, Kuniyuki Iwashima wrote:
> On Mon, Jul 28, 2025 at 9:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > > buffers and charge memory to per-protocol global counters pointed to by
> > > sk->sk_proto->memory_allocated.
> > >
> > > When running under a non-root cgroup, this memory is also charged to the
> > > memcg as sock in memory.stat.
> > >
> > > Even when memory usage is controlled by memcg, sockets using such protocols
> > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> > >
> > > This makes it difficult to accurately estimate and configure appropriate
> > > global limits, especially in multi-tenant environments.
> > >
> > > If all workloads were guaranteed to be controlled under memcg, the issue
> > > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> > >
> > > In reality, this assumption does not always hold, and a single workload
> > > that opts out of memcg can consume memory up to the global limit,
> > > becoming a noisy neighbour.
> >
> > Yes, an uncontrolled cgroup can consume all of a shared resource and
> > thereby become a noisy neighbor. Why is network memory special?
> >
> > I assume you have some other mechanisms for curbing things like
> > filesystem caches, anon memory, swap etc. of such otherwise
> > uncontrolled groups, and this just happens to be your missing piece.
> 
> I think that's the tcp_mem[] knob, limiting tcp mem globally for
> the "uncontrolled" cgroup.  But we can't use it because the
> "controlled" cgroup is also limited by this knob.

No, I was really asking what you do about other types of memory
consumed by such uncontrolled cgroups.

You can't have uncontrolled groups and complain about their resource
consumption.

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Kuniyuki Iwashima 1 month, 2 weeks ago

On Tue, Jul 29, 2025 at 7:22 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Jul 28, 2025 at 02:41:38PM -0700, Kuniyuki Iwashima wrote:
> > On Mon, Jul 28, 2025 at 9:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > > > buffers and charge memory to per-protocol global counters pointed to by
> > > > sk->sk_proto->memory_allocated.
> > > >
> > > > When running under a non-root cgroup, this memory is also charged to the
> > > > memcg as sock in memory.stat.
> > > >
> > > > Even when memory usage is controlled by memcg, sockets using such protocols
> > > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> > > >
> > > > This makes it difficult to accurately estimate and configure appropriate
> > > > global limits, especially in multi-tenant environments.
> > > >
> > > > If all workloads were guaranteed to be controlled under memcg, the issue
> > > > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> > > >
> > > > In reality, this assumption does not always hold, and a single workload
> > > > that opts out of memcg can consume memory up to the global limit,
> > > > becoming a noisy neighbour.
> > >
> > > Yes, an uncontrolled cgroup can consume all of a shared resource and
> > > thereby become a noisy neighbor. Why is network memory special?
> > >
> > > I assume you have some other mechanisms for curbing things like
> > > filesystem caches, anon memory, swap etc. of such otherwise
> > > uncontrolled groups, and this just happens to be your missing piece.
> >
> > I think that's the tcp_mem[] knob, limiting tcp mem globally for
> > the "uncontrolled" cgroup.  But we can't use it because the
> > "controlled" cgroup is also limited by this knob.
>
> No, I was really asking what you do about other types of memory
> consumed by such uncontrolled cgroups.
>
> You can't have uncontrolled groups and complain about their resource
> consumption.

Only 10% of physical memory is allowed to be used globally for TCP.
How is it supposed to work if we don't enforce limits on uncontrolled
cgroups ?

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Shakeel Butt 1 month, 3 weeks ago

On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> buffers and charge memory to per-protocol global counters pointed to by
> sk->sk_proto->memory_allocated.
> 
> When running under a non-root cgroup, this memory is also charged to the
> memcg as sock in memory.stat.
> 
> Even when memory usage is controlled by memcg, sockets using such protocols
> are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> 
> This makes it difficult to accurately estimate and configure appropriate
> global limits, especially in multi-tenant environments.
> 
> If all workloads were guaranteed to be controlled under memcg, the issue
> could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> 
> In reality, this assumption does not always hold, and a single workload
> that opts out of memcg can consume memory up to the global limit,
> becoming a noisy neighbour.
> 

Sorry but the above is not reasonable. On a multi-tenant system no
workload should be able to opt out of memcg accounting if isolation is
needed. If a workload can opt out then there is no guarantee.

In addition please avoid adding a per-memcg knob. Why not have system
level setting for the decoupling. I would say start with a build time
config setting or boot parameter then if really needed we can discuss if
system level setting is needed which can be toggled at runtime though
there might be challenges there.

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Eric Dumazet 1 month, 3 weeks ago

On Tue, Jul 22, 2025 at 8:14 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > buffers and charge memory to per-protocol global counters pointed to by
> > sk->sk_proto->memory_allocated.
> >
> > When running under a non-root cgroup, this memory is also charged to the
> > memcg as sock in memory.stat.
> >
> > Even when memory usage is controlled by memcg, sockets using such protocols
> > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> >
> > This makes it difficult to accurately estimate and configure appropriate
> > global limits, especially in multi-tenant environments.
> >
> > If all workloads were guaranteed to be controlled under memcg, the issue
> > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> >
> > In reality, this assumption does not always hold, and a single workload
> > that opts out of memcg can consume memory up to the global limit,
> > becoming a noisy neighbour.
> >
>
> Sorry but the above is not reasonable. On a multi-tenant system no
> workload should be able to opt out of memcg accounting if isolation is
> needed. If a workload can opt out then there is no guarantee.

Deployment issue ?

In a multi-tenant system you can not suddenly force all workloads to
be TCP memcg charged. This has caused many OMG.

Also, the current situation of maintaining two limits (memcg one, plus
global tcp_memory_allocated) is very inefficient.

If we trust memcg, then why have an expensive safety belt ?

With this series, we can finally use one or the other limit. This
should have been done from day-0 really.

>
> In addition please avoid adding a per-memcg knob. Why not have system
> level setting for the decoupling. I would say start with a build time
> config setting or boot parameter then if really needed we can discuss if
> system level setting is needed which can be toggled at runtime though
> there might be challenges there.

Built time or boot parameter ? I fail to see how it can be more convenient.

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Shakeel Butt 1 month, 3 weeks ago

On Tue, Jul 22, 2025 at 08:24:23AM -0700, Eric Dumazet wrote:
> On Tue, Jul 22, 2025 at 8:14 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > > buffers and charge memory to per-protocol global counters pointed to by
> > > sk->sk_proto->memory_allocated.
> > >
> > > When running under a non-root cgroup, this memory is also charged to the
> > > memcg as sock in memory.stat.
> > >
> > > Even when memory usage is controlled by memcg, sockets using such protocols
> > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> > >
> > > This makes it difficult to accurately estimate and configure appropriate
> > > global limits, especially in multi-tenant environments.
> > >
> > > If all workloads were guaranteed to be controlled under memcg, the issue
> > > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> > >
> > > In reality, this assumption does not always hold, and a single workload
> > > that opts out of memcg can consume memory up to the global limit,
> > > becoming a noisy neighbour.
> > >
> >
> > Sorry but the above is not reasonable. On a multi-tenant system no
> > workload should be able to opt out of memcg accounting if isolation is
> > needed. If a workload can opt out then there is no guarantee.
> 
> Deployment issue ?
> 
> In a multi-tenant system you can not suddenly force all workloads to
> be TCP memcg charged. This has caused many OMG.

Let's discuss the above at the end.

> 
> Also, the current situation of maintaining two limits (memcg one, plus
> global tcp_memory_allocated) is very inefficient.

Agree.

> 
> If we trust memcg, then why have an expensive safety belt ?
> 
> With this series, we can finally use one or the other limit. This
> should have been done from day-0 really.

Same, I agree.

> 
> >
> > In addition please avoid adding a per-memcg knob. Why not have system
> > level setting for the decoupling. I would say start with a build time
> > config setting or boot parameter then if really needed we can discuss if
> > system level setting is needed which can be toggled at runtime though
> > there might be challenges there.
> 
> Built time or boot parameter ? I fail to see how it can be more convenient.

I think we agree on decoupling the global and memcg accounting of
network memory. I am still not clear on the need of per-memcg knob. From
the earlier comment, it seems like you want mix of jobs with memcg
limited network memory accounting and with global network accounting
running concurrently on a system. Is that correct?

I expect this state of jobs with different network accounting config
running concurrently is temporary while the migrationg from one to other
is happening. Please correct me if I am wrong.

My main concern with the memcg knob is that it is permanent and it
requires a hierarchical semantics. No need to add a permanent interface
for a temporary need and I don't see a clear hierarchical semantic for
this interface.

I am wondering if alternative approches for per-workload settings are
explore starting with BPF.

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Kuniyuki Iwashima 1 month, 3 weeks ago

On Tue, Jul 22, 2025 at 8:52 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jul 22, 2025 at 08:24:23AM -0700, Eric Dumazet wrote:
> > On Tue, Jul 22, 2025 at 8:14 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > > > buffers and charge memory to per-protocol global counters pointed to by
> > > > sk->sk_proto->memory_allocated.
> > > >
> > > > When running under a non-root cgroup, this memory is also charged to the
> > > > memcg as sock in memory.stat.
> > > >
> > > > Even when memory usage is controlled by memcg, sockets using such protocols
> > > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> > > >
> > > > This makes it difficult to accurately estimate and configure appropriate
> > > > global limits, especially in multi-tenant environments.
> > > >
> > > > If all workloads were guaranteed to be controlled under memcg, the issue
> > > > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> > > >
> > > > In reality, this assumption does not always hold, and a single workload
> > > > that opts out of memcg can consume memory up to the global limit,
> > > > becoming a noisy neighbour.
> > > >
> > >
> > > Sorry but the above is not reasonable. On a multi-tenant system no
> > > workload should be able to opt out of memcg accounting if isolation is
> > > needed. If a workload can opt out then there is no guarantee.
> >
> > Deployment issue ?
> >
> > In a multi-tenant system you can not suddenly force all workloads to
> > be TCP memcg charged. This has caused many OMG.
>
> Let's discuss the above at the end.
>
> >
> > Also, the current situation of maintaining two limits (memcg one, plus
> > global tcp_memory_allocated) is very inefficient.
>
> Agree.
>
> >
> > If we trust memcg, then why have an expensive safety belt ?
> >
> > With this series, we can finally use one or the other limit. This
> > should have been done from day-0 really.
>
> Same, I agree.
>
> >
> > >
> > > In addition please avoid adding a per-memcg knob. Why not have system
> > > level setting for the decoupling. I would say start with a build time
> > > config setting or boot parameter then if really needed we can discuss if
> > > system level setting is needed which can be toggled at runtime though
> > > there might be challenges there.
> >
> > Built time or boot parameter ? I fail to see how it can be more convenient.
>
> I think we agree on decoupling the global and memcg accounting of
> network memory. I am still not clear on the need of per-memcg knob. From
> the earlier comment, it seems like you want mix of jobs with memcg
> limited network memory accounting and with global network accounting
> running concurrently on a system. Is that correct?

Correct.


>
> I expect this state of jobs with different network accounting config
> running concurrently is temporary while the migrationg from one to other
> is happening. Please correct me if I am wrong.

We need to migrate workload gradually and the system-wide config
does not work at all.  AFAIU, there are already years of effort spent
on the migration but it's not yet completed at Google.  So, I don't think
the need is temporary.

>
> My main concern with the memcg knob is that it is permanent and it
> requires a hierarchical semantics. No need to add a permanent interface
> for a temporary need and I don't see a clear hierarchical semantic for
> this interface.

I don't see merits of having hierarchical semantics for this knob.
Regardless of this knob, hierarchical semantics is guaranteed
by other knobs.  I think such semantics for this knob just complicates
the code with no gain.


>
> I am wondering if alternative approches for per-workload settings are
> explore starting with BPF.
>
>
>

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Shakeel Butt 1 month, 3 weeks ago

On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> >
> > I expect this state of jobs with different network accounting config
> > running concurrently is temporary while the migrationg from one to other
> > is happening. Please correct me if I am wrong.
> 
> We need to migrate workload gradually and the system-wide config
> does not work at all.  AFAIU, there are already years of effort spent
> on the migration but it's not yet completed at Google.  So, I don't think
> the need is temporary.
> 

From what I remembered shared borg had completely moved to memcg
accounting of network memory (with sys container as an exception) years
ago. Did something change there?

> >
> > My main concern with the memcg knob is that it is permanent and it
> > requires a hierarchical semantics. No need to add a permanent interface
> > for a temporary need and I don't see a clear hierarchical semantic for
> > this interface.
> 
> I don't see merits of having hierarchical semantics for this knob.
> Regardless of this knob, hierarchical semantics is guaranteed
> by other knobs.  I think such semantics for this knob just complicates
> the code with no gain.
> 

Cgroup interfaces are hierarchical and we want to keep it that way.
Putting non-hierarchical interfaces just makes configuration and setup
hard to reason about.

> 
> >
> > I am wondering if alternative approches for per-workload settings are
> > explore starting with BPF.
> >

Any response on the above? Any alternative approaches explored?

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Kuniyuki Iwashima 1 month, 3 weeks ago

On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> > >
> > > I expect this state of jobs with different network accounting config
> > > running concurrently is temporary while the migrationg from one to other
> > > is happening. Please correct me if I am wrong.
> >
> > We need to migrate workload gradually and the system-wide config
> > does not work at all.  AFAIU, there are already years of effort spent
> > on the migration but it's not yet completed at Google.  So, I don't think
> > the need is temporary.
> >
>
> From what I remembered shared borg had completely moved to memcg
> accounting of network memory (with sys container as an exception) years
> ago. Did something change there?

AFAICS, there are some workloads that opted out from memcg and
consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering
OOM and disrupting other workloads.

>
> > >
> > > My main concern with the memcg knob is that it is permanent and it
> > > requires a hierarchical semantics. No need to add a permanent interface
> > > for a temporary need and I don't see a clear hierarchical semantic for
> > > this interface.
> >
> > I don't see merits of having hierarchical semantics for this knob.
> > Regardless of this knob, hierarchical semantics is guaranteed
> > by other knobs.  I think such semantics for this knob just complicates
> > the code with no gain.
> >
>
> Cgroup interfaces are hierarchical and we want to keep it that way.
> Putting non-hierarchical interfaces just makes configuration and setup
> hard to reason about.

Actually, I tried that way in the initial draft version, but even if the
parent's knob is 1 and child one is 0, a harmful scenario didn't come
to my mind.


>
> >
> > >
> > > I am wondering if alternative approches for per-workload settings are
> > > explore starting with BPF.
> > >
>
> Any response on the above? Any alternative approaches explored?

Do you mean flagging each socket by BPF at cgroup hook ?

I think it's overkill and we don't need such finer granularity.

Also it sounds way too hacky to use BPF to correct the weird
behaviour from day0.  We should have more generic way to
control that.  I know this functionality is helpful for some workloads
at Amazon as well.

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Shakeel Butt 1 month, 3 weeks ago

On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote:
> On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> > > >
> > > > I expect this state of jobs with different network accounting config
> > > > running concurrently is temporary while the migrationg from one to other
> > > > is happening. Please correct me if I am wrong.
> > >
> > > We need to migrate workload gradually and the system-wide config
> > > does not work at all.  AFAIU, there are already years of effort spent
> > > on the migration but it's not yet completed at Google.  So, I don't think
> > > the need is temporary.
> > >
> >
> > From what I remembered shared borg had completely moved to memcg
> > accounting of network memory (with sys container as an exception) years
> > ago. Did something change there?
> 
> AFAICS, there are some workloads that opted out from memcg and
> consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering
> OOM and disrupting other workloads.
> 

What were the reasons behind opting out? We should fix those
instead of a permanent opt-out option.

> >
> > > >
> > > > My main concern with the memcg knob is that it is permanent and it
> > > > requires a hierarchical semantics. No need to add a permanent interface
> > > > for a temporary need and I don't see a clear hierarchical semantic for
> > > > this interface.
> > >
> > > I don't see merits of having hierarchical semantics for this knob.
> > > Regardless of this knob, hierarchical semantics is guaranteed
> > > by other knobs.  I think such semantics for this knob just complicates
> > > the code with no gain.
> > >
> >
> > Cgroup interfaces are hierarchical and we want to keep it that way.
> > Putting non-hierarchical interfaces just makes configuration and setup
> > hard to reason about.
> 
> Actually, I tried that way in the initial draft version, but even if the
> parent's knob is 1 and child one is 0, a harmful scenario didn't come
> to my mind.
> 

It is not just about harmful scenario but more about clear semantics.
Check memory.zswap.writeback semantics.

> 
> >
> > >
> > > >
> > > > I am wondering if alternative approches for per-workload settings are
> > > > explore starting with BPF.
> > > >
> >
> > Any response on the above? Any alternative approaches explored?
> 
> Do you mean flagging each socket by BPF at cgroup hook ?

Not sure. Will it not be very similar to your current approach? Each
socket is associated with a memcg and the at the place where you need to
check which accounting method to use, just check that memcg setting in
bpf and you can cache the result in socket as well.

> 
> I think it's overkill and we don't need such finer granularity.
> 
> Also it sounds way too hacky to use BPF to correct the weird
> behaviour from day0.

What weird behavior? Two accounting mechanisms. Yes I agree but memcgs
with different accounting mechanisms concurrently is also weird.

> We should have more generic way to
> control that.  I know this functionality is helpful for some workloads
> at Amazon as well.

The reason I am against this permanent opt-out interface is if we add
this interface then we will never fix the underlying issues blocking the
full conversion to memcg accounting of network memory. I am ok with some
temporary measures to allow opt-out impacted workload until the
underlying issue is fixed.

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Kuniyuki Iwashima 1 month, 3 weeks ago

On Tue, Jul 22, 2025 at 12:56 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote:
> > On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> > > > >
> > > > > I expect this state of jobs with different network accounting config
> > > > > running concurrently is temporary while the migrationg from one to other
> > > > > is happening. Please correct me if I am wrong.
> > > >
> > > > We need to migrate workload gradually and the system-wide config
> > > > does not work at all.  AFAIU, there are already years of effort spent
> > > > on the migration but it's not yet completed at Google.  So, I don't think
> > > > the need is temporary.
> > > >
> > >
> > > From what I remembered shared borg had completely moved to memcg
> > > accounting of network memory (with sys container as an exception) years
> > > ago. Did something change there?
> >
> > AFAICS, there are some workloads that opted out from memcg and
> > consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering
> > OOM and disrupting other workloads.
> >
>
> What were the reasons behind opting out? We should fix those
> instead of a permanent opt-out option.
>
> > >
> > > > >
> > > > > My main concern with the memcg knob is that it is permanent and it
> > > > > requires a hierarchical semantics. No need to add a permanent interface
> > > > > for a temporary need and I don't see a clear hierarchical semantic for
> > > > > this interface.
> > > >
> > > > I don't see merits of having hierarchical semantics for this knob.
> > > > Regardless of this knob, hierarchical semantics is guaranteed
> > > > by other knobs.  I think such semantics for this knob just complicates
> > > > the code with no gain.
> > > >
> > >
> > > Cgroup interfaces are hierarchical and we want to keep it that way.
> > > Putting non-hierarchical interfaces just makes configuration and setup
> > > hard to reason about.
> >
> > Actually, I tried that way in the initial draft version, but even if the
> > parent's knob is 1 and child one is 0, a harmful scenario didn't come
> > to my mind.
> >
>
> It is not just about harmful scenario but more about clear semantics.
> Check memory.zswap.writeback semantics.

zswap checks all parent cgroups when evaluating the knob, but
this is not an option for the networking fast path as we cannot
check them for every skb, which will degrade the performance.

Also, we don't track which sockets were created with the knob
enabled and how many such sockets are still left under the cgroup,
there is no way to keep options consistent throughout the hierarchy
and no need to try hard to make the option pretend to be consistent
if there's no real issue.


>
> >
> > >
> > > >
> > > > >
> > > > > I am wondering if alternative approches for per-workload settings are
> > > > > explore starting with BPF.
> > > > >
> > >
> > > Any response on the above? Any alternative approaches explored?
> >
> > Do you mean flagging each socket by BPF at cgroup hook ?
>
> Not sure. Will it not be very similar to your current approach? Each
> socket is associated with a memcg and the at the place where you need to
> check which accounting method to use, just check that memcg setting in
> bpf and you can cache the result in socket as well.

The socket pointer is not writable by default, thus we need to add
a bpf helper or kfunc just for flipping a single bit.  As said, this is
overkill, and per-memcg knob is much simpler.


>
> >
> > I think it's overkill and we don't need such finer granularity.
> >
> > Also it sounds way too hacky to use BPF to correct the weird
> > behaviour from day0.
>
> What weird behavior? Two accounting mechanisms. Yes I agree but memcgs
> with different accounting mechanisms concurrently is also weird.

Not that weird given the root cgroup does not allocate sk->sk_memcg
and are subject to the global tcp memory accounting.  We already have
a mixed set of memcgs.

Also, not every cgroup sets memory limits.  systemd puts some
processes to a non-root cgroup by default without setting memory.max.
In such a case we definitely want the global memory accounting to take
place.

Having to set memory.max to every non-root cgroup is less flexible
and too restricted.

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Shakeel Butt 1 month, 3 weeks ago

On Tue, Jul 22, 2025 at 02:59:33PM -0700, Kuniyuki Iwashima wrote:
> On Tue, Jul 22, 2025 at 12:56 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote:
> > > On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> > > > > >
> > > > > > I expect this state of jobs with different network accounting config
> > > > > > running concurrently is temporary while the migrationg from one to other
> > > > > > is happening. Please correct me if I am wrong.
> > > > >
> > > > > We need to migrate workload gradually and the system-wide config
> > > > > does not work at all.  AFAIU, there are already years of effort spent
> > > > > on the migration but it's not yet completed at Google.  So, I don't think
> > > > > the need is temporary.
> > > > >
> > > >
> > > > From what I remembered shared borg had completely moved to memcg
> > > > accounting of network memory (with sys container as an exception) years
> > > > ago. Did something change there?
> > >
> > > AFAICS, there are some workloads that opted out from memcg and
> > > consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering
> > > OOM and disrupting other workloads.
> > >
> >
> > What were the reasons behind opting out? We should fix those
> > instead of a permanent opt-out option.
> >

Any response to the above?

> > > >
> > > > > >
> > > > > > My main concern with the memcg knob is that it is permanent and it
> > > > > > requires a hierarchical semantics. No need to add a permanent interface
> > > > > > for a temporary need and I don't see a clear hierarchical semantic for
> > > > > > this interface.
> > > > >
> > > > > I don't see merits of having hierarchical semantics for this knob.
> > > > > Regardless of this knob, hierarchical semantics is guaranteed
> > > > > by other knobs.  I think such semantics for this knob just complicates
> > > > > the code with no gain.
> > > > >
> > > >
> > > > Cgroup interfaces are hierarchical and we want to keep it that way.
> > > > Putting non-hierarchical interfaces just makes configuration and setup
> > > > hard to reason about.
> > >
> > > Actually, I tried that way in the initial draft version, but even if the
> > > parent's knob is 1 and child one is 0, a harmful scenario didn't come
> > > to my mind.
> > >
> >
> > It is not just about harmful scenario but more about clear semantics.
> > Check memory.zswap.writeback semantics.
> 
> zswap checks all parent cgroups when evaluating the knob, but
> this is not an option for the networking fast path as we cannot
> check them for every skb, which will degrade the performance.

That's an implementation detail and you can definitely optimize it. One
possible way might be caching the state in socket at creation time which
puts some restrictions like to change the config, workload needs to be
restarted.

> 
> Also, we don't track which sockets were created with the knob
> enabled and how many such sockets are still left under the cgroup,
> there is no way to keep options consistent throughout the hierarchy
> and no need to try hard to make the option pretend to be consistent
> if there's no real issue.
> 
> 
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > I am wondering if alternative approches for per-workload settings are
> > > > > > explore starting with BPF.
> > > > > >
> > > >
> > > > Any response on the above? Any alternative approaches explored?
> > >
> > > Do you mean flagging each socket by BPF at cgroup hook ?
> >
> > Not sure. Will it not be very similar to your current approach? Each
> > socket is associated with a memcg and the at the place where you need to
> > check which accounting method to use, just check that memcg setting in
> > bpf and you can cache the result in socket as well.
> 
> The socket pointer is not writable by default, thus we need to add
> a bpf helper or kfunc just for flipping a single bit.  As said, this is
> overkill, and per-memcg knob is much simpler.
> 

Your simple solution is exposing a stable permanent user facing API
which I suspect is temporary situation. Let's discuss it at the end.

> 
> >
> > >
> > > I think it's overkill and we don't need such finer granularity.
> > >
> > > Also it sounds way too hacky to use BPF to correct the weird
> > > behaviour from day0.
> >
> > What weird behavior? Two accounting mechanisms. Yes I agree but memcgs
> > with different accounting mechanisms concurrently is also weird.
> 
> Not that weird given the root cgroup does not allocate sk->sk_memcg
> and are subject to the global tcp memory accounting.  We already have
> a mixed set of memcgs.

Running workloads in root cgroup is not normal and comes with a warning
of no isolation provided.

I looked at the patch again to understand the modes you are introducing.
Initially, I thought the series introduced multiple modes, including an
option to exclude network memory from memcg accounting. However, if I
understand correctly, that is not the case—the opt-out applies only to
the global TCP/UDP accounting. That’s a relief, and I apologize for the
misunderstanding.

If I’m correct, you need a way to exclude a workload from the global
TCP/UDP accounting, and currently, memcg serves as a convenient
abstraction for the workload. Please let me know if I misunderstood.

Now memcg is one way to represent the workload. Another more natural, at
least to me, is the core cgroup. Basically cgroup.something interface.
BPF is yet another option.

To me cgroup seems preferrable but let's see what other memcg & cgroup
folks think. Also note that for cgroup and memcg the interface will need
to be hierarchical.

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Kuniyuki Iwashima 1 month, 3 weeks ago

On Tue, Jul 22, 2025 at 5:29 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jul 22, 2025 at 02:59:33PM -0700, Kuniyuki Iwashima wrote:
> > On Tue, Jul 22, 2025 at 12:56 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote:
> > > > On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > > >
> > > > > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> > > > > > >
> > > > > > > I expect this state of jobs with different network accounting config
> > > > > > > running concurrently is temporary while the migrationg from one to other
> > > > > > > is happening. Please correct me if I am wrong.
> > > > > >
> > > > > > We need to migrate workload gradually and the system-wide config
> > > > > > does not work at all.  AFAIU, there are already years of effort spent
> > > > > > on the migration but it's not yet completed at Google.  So, I don't think
> > > > > > the need is temporary.
> > > > > >
> > > > >
> > > > > From what I remembered shared borg had completely moved to memcg
> > > > > accounting of network memory (with sys container as an exception) years
> > > > > ago. Did something change there?
> > > >
> > > > AFAICS, there are some workloads that opted out from memcg and
> > > > consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering
> > > > OOM and disrupting other workloads.
> > > >
> > >
> > > What were the reasons behind opting out? We should fix those
> > > instead of a permanent opt-out option.
> > >
>
> Any response to the above?

I'm just checking with internal folks, not sure if I will follow up on
this though, see below.

>
> > > > >
> > > > > > >
> > > > > > > My main concern with the memcg knob is that it is permanent and it
> > > > > > > requires a hierarchical semantics. No need to add a permanent interface
> > > > > > > for a temporary need and I don't see a clear hierarchical semantic for
> > > > > > > this interface.
> > > > > >
> > > > > > I don't see merits of having hierarchical semantics for this knob.
> > > > > > Regardless of this knob, hierarchical semantics is guaranteed
> > > > > > by other knobs.  I think such semantics for this knob just complicates
> > > > > > the code with no gain.
> > > > > >
> > > > >
> > > > > Cgroup interfaces are hierarchical and we want to keep it that way.
> > > > > Putting non-hierarchical interfaces just makes configuration and setup
> > > > > hard to reason about.
> > > >
> > > > Actually, I tried that way in the initial draft version, but even if the
> > > > parent's knob is 1 and child one is 0, a harmful scenario didn't come
> > > > to my mind.
> > > >
> > >
> > > It is not just about harmful scenario but more about clear semantics.
> > > Check memory.zswap.writeback semantics.
> >
> > zswap checks all parent cgroups when evaluating the knob, but
> > this is not an option for the networking fast path as we cannot
> > check them for every skb, which will degrade the performance.
>
> That's an implementation detail and you can definitely optimize it. One
> possible way might be caching the state in socket at creation time which
> puts some restrictions like to change the config, workload needs to be
> restarted.
>
> >
> > Also, we don't track which sockets were created with the knob
> > enabled and how many such sockets are still left under the cgroup,
> > there is no way to keep options consistent throughout the hierarchy
> > and no need to try hard to make the option pretend to be consistent
> > if there's no real issue.
> >
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > I am wondering if alternative approches for per-workload settings are
> > > > > > > explore starting with BPF.
> > > > > > >
> > > > >
> > > > > Any response on the above? Any alternative approaches explored?
> > > >
> > > > Do you mean flagging each socket by BPF at cgroup hook ?
> > >
> > > Not sure. Will it not be very similar to your current approach? Each
> > > socket is associated with a memcg and the at the place where you need to
> > > check which accounting method to use, just check that memcg setting in
> > > bpf and you can cache the result in socket as well.
> >
> > The socket pointer is not writable by default, thus we need to add
> > a bpf helper or kfunc just for flipping a single bit.  As said, this is
> > overkill, and per-memcg knob is much simpler.
> >
>
> Your simple solution is exposing a stable permanent user facing API
> which I suspect is temporary situation. Let's discuss it at the end.
>
> >
> > >
> > > >
> > > > I think it's overkill and we don't need such finer granularity.
> > > >
> > > > Also it sounds way too hacky to use BPF to correct the weird
> > > > behaviour from day0.
> > >
> > > What weird behavior? Two accounting mechanisms. Yes I agree but memcgs
> > > with different accounting mechanisms concurrently is also weird.
> >
> > Not that weird given the root cgroup does not allocate sk->sk_memcg
> > and are subject to the global tcp memory accounting.  We already have
> > a mixed set of memcgs.
>
> Running workloads in root cgroup is not normal and comes with a warning
> of no isolation provided.
>
> I looked at the patch again to understand the modes you are introducing.
> Initially, I thought the series introduced multiple modes, including an
> option to exclude network memory from memcg accounting. However, if I
> understand correctly, that is not the case—the opt-out applies only to
> the global TCP/UDP accounting. That’s a relief, and I apologize for the
> misunderstanding.
>
> If I’m correct, you need a way to exclude a workload from the global
> TCP/UDP accounting, and currently, memcg serves as a convenient
> abstraction for the workload. Please let me know if I misunderstood.

Correct.

Currently, memcg by itself cannot guarantee that memory allocation for
socket buffer does not fail even when memory.current < memory.max
due to the global protocol limits.

It means we need to increase the global limits to

(bytes of TCP socket buffer in each cgroup) * (number of cgroup)

, which is hard to predict, and I guess that's the reason why you
or Wei set tcp_mem[] to UINT_MAX so that we can ignore the global
limit.

But we should keep tcp_mem[] within a sane range in the first place.

This series allows us to configure memcg limits only and let memcg
guarantee no failure until it fully consumes memory.max.

The point is that memcg should not be affected by the global limits,
and this is orthogonal with the assumption that every workload should
be running under memcg.


>
> Now memcg is one way to represent the workload. Another more natural, at
> least to me, is the core cgroup. Basically cgroup.something interface.
> BPF is yet another option.
>
> To me cgroup seems preferrable but let's see what other memcg & cgroup
> folks think. Also note that for cgroup and memcg the interface will need
> to be hierarchical.

As the root cgroup doesn't have the knob, these combinations are
considered hierarchical:

(parent, child) = (0, 0), (0, 1), (1, 1)

and only the pattern below is not considered hierarchical

(parent, child) = (1, 0)

Let's say we lock the knob at the first socket creation like your
idea above.

If a parent and its child' knobs are (0, 0) and the child creates a
socket, the child memcg is locked as 0.  When the parent enables
the knob, we must check all child cgroups as well.  Or, we lock
the all parents' knobs when a socket is created in a child cgroup
with knob=0 ?  In any cases we need a global lock.

Well, I understand that the hierarchical semantics is preferable
for cgroup but I think it does not resolve any real issue and rather
churns the code unnecessarily.

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Shakeel Butt 1 month, 3 weeks ago

Cc Tejun & Michal to get their opinion on memcg vs cgroup vs BPF
options.

On Tue, Jul 22, 2025 at 07:35:52PM -0700, Kuniyuki Iwashima wrote:
[...]
> >
> > Running workloads in root cgroup is not normal and comes with a warning
> > of no isolation provided.
> >
> > I looked at the patch again to understand the modes you are introducing.
> > Initially, I thought the series introduced multiple modes, including an
> > option to exclude network memory from memcg accounting. However, if I
> > understand correctly, that is not the case—the opt-out applies only to
> > the global TCP/UDP accounting. That’s a relief, and I apologize for the
> > misunderstanding.
> >
> > If I’m correct, you need a way to exclude a workload from the global
> > TCP/UDP accounting, and currently, memcg serves as a convenient
> > abstraction for the workload. Please let me know if I misunderstood.
> 
> Correct.
> 
> Currently, memcg by itself cannot guarantee that memory allocation for
> socket buffer does not fail even when memory.current < memory.max
> due to the global protocol limits.
> 
> It means we need to increase the global limits to
> 
> (bytes of TCP socket buffer in each cgroup) * (number of cgroup)
> 
> , which is hard to predict, and I guess that's the reason why you
> or Wei set tcp_mem[] to UINT_MAX so that we can ignore the global
> limit.

No that was not the reason. The main reason behind max tcp_mem global
limit was it was not needed as memcg should account and limit the
network memory. I think the reason you don't want tcp_mem global limit
unlimited now is you have internal feature to let workloads opt out of
the memcg accounting of network memory which is causing isolation
issues.

> 
> But we should keep tcp_mem[] within a sane range in the first place.
> 
> This series allows us to configure memcg limits only and let memcg
> guarantee no failure until it fully consumes memory.max.
> 
> The point is that memcg should not be affected by the global limits,
> and this is orthogonal with the assumption that every workload should
> be running under memcg.
> 
> 
> >
> > Now memcg is one way to represent the workload. Another more natural, at
> > least to me, is the core cgroup. Basically cgroup.something interface.
> > BPF is yet another option.
> >
> > To me cgroup seems preferrable but let's see what other memcg & cgroup
> > folks think. Also note that for cgroup and memcg the interface will need
> > to be hierarchical.
> 
> As the root cgroup doesn't have the knob, these combinations are
> considered hierarchical:
> 
> (parent, child) = (0, 0), (0, 1), (1, 1)
> 
> and only the pattern below is not considered hierarchical
> 
> (parent, child) = (1, 0)
> 
> Let's say we lock the knob at the first socket creation like your
> idea above.
> 
> If a parent and its child' knobs are (0, 0) and the child creates a
> socket, the child memcg is locked as 0.  When the parent enables
> the knob, we must check all child cgroups as well.  Or, we lock
> the all parents' knobs when a socket is created in a child cgroup
> with knob=0 ?  In any cases we need a global lock.
> 
> Well, I understand that the hierarchical semantics is preferable
> for cgroup but I think it does not resolve any real issue and rather
> churns the code unnecessarily.

All this is implementation detail and I am asking about semantics. More
specifically:

1. Will the root be non-isolated always?
2. If a cgroup is isolated, does it mean all its desendants are
   isolated?
3. Will there ever be a reasonable use-case where there is non-isolated
   sub-tree under an isolated ancestor?

Please give some thought to the above (and related) questions.

I am still not convinced that memcg is the right home for this opt-out
feature. I have CCed cgroup folks to get their opinion as well.

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Kuniyuki Iwashima 1 month, 3 weeks ago

On Wed, Jul 23, 2025 at 10:28 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Cc Tejun & Michal to get their opinion on memcg vs cgroup vs BPF
> options.
>
> On Tue, Jul 22, 2025 at 07:35:52PM -0700, Kuniyuki Iwashima wrote:
> [...]
> > >
> > > Running workloads in root cgroup is not normal and comes with a warning
> > > of no isolation provided.
> > >
> > > I looked at the patch again to understand the modes you are introducing.
> > > Initially, I thought the series introduced multiple modes, including an
> > > option to exclude network memory from memcg accounting. However, if I
> > > understand correctly, that is not the case—the opt-out applies only to
> > > the global TCP/UDP accounting. That’s a relief, and I apologize for the
> > > misunderstanding.
> > >
> > > If I’m correct, you need a way to exclude a workload from the global
> > > TCP/UDP accounting, and currently, memcg serves as a convenient
> > > abstraction for the workload. Please let me know if I misunderstood.
> >
> > Correct.
> >
> > Currently, memcg by itself cannot guarantee that memory allocation for
> > socket buffer does not fail even when memory.current < memory.max
> > due to the global protocol limits.
> >
> > It means we need to increase the global limits to
> >
> > (bytes of TCP socket buffer in each cgroup) * (number of cgroup)
> >
> > , which is hard to predict, and I guess that's the reason why you
> > or Wei set tcp_mem[] to UINT_MAX so that we can ignore the global
> > limit.
>
> No that was not the reason. The main reason behind max tcp_mem global
> limit was it was not needed

but the global limit did take place thus you had to set tcp_mem
to unlimited.

> as memcg should account and limit the
> network memory.
> I think the reason you don't want tcp_mem global limit
> unlimited now is

memcg has been subject to the global limit from day 0.

And note that not every process is under memcg with memory.max
configured.


> you have internal feature to let workloads opt out of
> the memcg accounting of network memory which is causing isolation
> issues.
>
> >
> > But we should keep tcp_mem[] within a sane range in the first place.
> >
> > This series allows us to configure memcg limits only and let memcg
> > guarantee no failure until it fully consumes memory.max.
> >
> > The point is that memcg should not be affected by the global limits,
> > and this is orthogonal with the assumption that every workload should
> > be running under memcg.
> >
> >
> > >
> > > Now memcg is one way to represent the workload. Another more natural, at
> > > least to me, is the core cgroup. Basically cgroup.something interface.
> > > BPF is yet another option.
> > >
> > > To me cgroup seems preferrable but let's see what other memcg & cgroup
> > > folks think. Also note that for cgroup and memcg the interface will need
> > > to be hierarchical.
> >
> > As the root cgroup doesn't have the knob, these combinations are
> > considered hierarchical:
> >
> > (parent, child) = (0, 0), (0, 1), (1, 1)
> >
> > and only the pattern below is not considered hierarchical
> >
> > (parent, child) = (1, 0)
> >
> > Let's say we lock the knob at the first socket creation like your
> > idea above.
> >
> > If a parent and its child' knobs are (0, 0) and the child creates a
> > socket, the child memcg is locked as 0.  When the parent enables
> > the knob, we must check all child cgroups as well.  Or, we lock
> > the all parents' knobs when a socket is created in a child cgroup
> > with knob=0 ?  In any cases we need a global lock.
> >
> > Well, I understand that the hierarchical semantics is preferable
> > for cgroup but I think it does not resolve any real issue and rather
> > churns the code unnecessarily.
>
> All this is implementation detail and I am asking about semantics. More
> specifically:
>
> 1. Will the root be non-isolated always?

Yes, because the root cgroup doesn't have memcg.
Also, the knob has CFTYPE_NOT_ON_ROOT.


> 2. If a cgroup is isolated, does it mean all its desendants are
>    isolated?

No, but this is because we MUST think about how we handle
the scenario above that (parent, child) = (0,0) becomes (1, 0).

We cannot think about the semantics without implementation
detail.  And if we allow such scenario, the hierarchical semantics
is fake and has no meaning.


> 3. Will there ever be a reasonable use-case where there is non-isolated
>    sub-tree under an isolated ancestor?

I think no, but again, we need to think about the scenario above,
otherwise, your ideal semantics is just broken.

Also, "no reasonable scenario" does not always mean "we must
prevent the scenario".

If there's nothing harmful, we can just let it be, especially if such
restriction gives nothing andrather hurts performance with no
good reason.


>
> Please give some thought to the above (and related) questions.

Please think about the implementation detail and if its trade-off
(just keeping semantics vs code churn & perf regression) makes
really sense.


>
> I am still not convinced that memcg is the right home for this opt-out
> feature. I have CCed cgroup folks to get their opinion as well.

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Jakub Kicinski 1 month, 3 weeks ago

On Wed, 23 Jul 2025 11:06:14 -0700 Kuniyuki Iwashima wrote:
> > 3. Will there ever be a reasonable use-case where there is non-isolated
> >    sub-tree under an isolated ancestor?  
> 
> I think no, but again, we need to think about the scenario above,
> otherwise, your ideal semantics is just broken.
> 
> Also, "no reasonable scenario" does not always mean "we must
> prevent the scenario".
> 
> If there's nothing harmful, we can just let it be, especially if such
> restriction gives nothing andrather hurts performance with no
> good reason.

Stating the obvious perhaps but it's probably too late in the release
cycle to get enough agreement here to merge the series. So I'll mark
it as Deferred.

While I'm typing, TBH I'm not sure I'm following the arguments about
making the property hierarchical. Since the memory limit gets inherited
I don't understand why the property of being isolated would not.
Either I don't understand the memcg enough, or I don't understand your
intended semantics. Anyway..

Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

Posted by Kuniyuki Iwashima 1 month, 3 weeks ago

On Thu, Jul 24, 2025 at 6:49 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 23 Jul 2025 11:06:14 -0700 Kuniyuki Iwashima wrote:
> > > 3. Will there ever be a reasonable use-case where there is non-isolated
> > >    sub-tree under an isolated ancestor?
> >
> > I think no, but again, we need to think about the scenario above,
> > otherwise, your ideal semantics is just broken.
> >
> > Also, "no reasonable scenario" does not always mean "we must
> > prevent the scenario".
> >
> > If there's nothing harmful, we can just let it be, especially if such
> > restriction gives nothing andrather hurts performance with no
> > good reason.
>
> Stating the obvious perhaps but it's probably too late in the release
> cycle to get enough agreement here to merge the series. So I'll mark
> it as Deferred.

Fair enough.

>
> While I'm typing, TBH I'm not sure I'm following the arguments about
> making the property hierarchical. Since the memory limit gets inherited
> I don't understand why the property of being isolated would not.
> Either I don't understand the memcg enough, or I don't understand your
> intended semantics. Anyway..

Inheriting a config is easy, but keeping the hierarchy complete isn't,
or maybe I'm thinking too hard :S

[root@fedora ~]# mkdir /sys/fs/cgroup/test1
[root@fedora ~]# mkdir /sys/fs/cgroup/test1/test2
[root@fedora ~]# echo +memory > /sys/fs/cgroup/test1/cgroup.subtree_control
[root@fedora ~]# echo 10000 > /sys/fs/cgroup/test1/test2/memory.max
[root@fedora ~]# echo 1000 > /sys/fs/cgroup/test1/memory.max
[  108.130895] bash invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL),
order=0, oom_score_adj=0
...
[  108.260164] Out of memory and no killable processes...
[root@fedora ~]# cat /sys/fs/cgroup/test1/test2/memory.max
8192
[root@fedora ~]# cat /sys/fs/cgroup/test1/memory.max
0