Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.
When running under a non-root cgroup, this memory is also charged to the
memcg as sock in memory.stat.
Even when memory usage is controlled by memcg, sockets using such protocols
are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
This makes it difficult to accurately estimate and configure appropriate
global limits, especially in multi-tenant environments.
If all workloads were guaranteed to be controlled under memcg, the issue
could be worked around by setting tcp_mem[0~2] to UINT_MAX.
In reality, this assumption does not always hold, and a single workload
that opts out of memcg can consume memory up to the global limit,
becoming a noisy neighbour.
Let's decouple memcg from the global per-protocol memory accounting.
This simplifies memcg configuration while keeping the global limits
within a reasonable range.
If mem_cgroup_sk_isolated(sk) returns true, the per-protocol memory
accounting is skipped.
In inet_csk_accept(), we need to reclaim counts that are already charged
for child sockets because we do not allocate sk->sk_memcg until accept().
Note that trace_sock_exceed_buf_limit() will always show 0 as accounted
for the isolated sockets, but this can be obtained via memory.stat.
Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.
Setup:
# mkdir /sys/fs/cgroup/test
# echo $$ >> /sys/fs/cgroup/test/cgroup.procs
# sysctl -q net.ipv4.tcp_mem="1000 1000 1000"
Without memory.socket_isolated:
# echo 0 > /sys/fs/cgroup/test/memory.socket_isolated
# prlimit -n=524288:524288 bash -c "python3 pressure.py" &
# cat /sys/fs/cgroup/test/memory.stat | grep sock
sock 24682496
# ss -tn | head -n 5
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 2000 0 127.0.0.1:54997 127.0.0.1:37738
ESTAB 2000 0 127.0.0.1:54997 127.0.0.1:60122
ESTAB 2000 0 127.0.0.1:54997 127.0.0.1:33622
ESTAB 2000 0 127.0.0.1:54997 127.0.0.1:35042
# nstat | grep Pressure || echo no pressure
TcpExtTCPMemoryPressures 1 0.0
With memory.socket_isolated:
# echo 1 > /sys/fs/cgroup/test/memory.socket_isolated
# prlimit -n=524288:524288 bash -c "python3 pressure.py" &
# cat /sys/fs/cgroup/test/memory.stat | grep sock
sock 2766671872
# ss -tn | head -n 5
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 112000 0 127.0.0.1:41729 127.0.0.1:35062
ESTAB 110000 0 127.0.0.1:41729 127.0.0.1:36288
ESTAB 112000 0 127.0.0.1:41729 127.0.0.1:37560
ESTAB 112000 0 127.0.0.1:41729 127.0.0.1:37096
# nstat | grep Pressure || echo no pressure
no pressure
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/net/proto_memory.h | 10 +++--
include/net/tcp.h | 10 +++--
net/core/sock.c | 65 +++++++++++++++++++++++----------
net/ipv4/inet_connection_sock.c | 18 +++++++--
net/ipv4/tcp_output.c | 10 ++++-
5 files changed, 82 insertions(+), 31 deletions(-)
diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index 8e91a8fa31b52..3c2e92f5a6866 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -31,9 +31,13 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
if (!sk->sk_prot->memory_pressure)
return false;
- if (mem_cgroup_sk_enabled(sk) &&
- mem_cgroup_sk_under_memory_pressure(sk))
- return true;
+ if (mem_cgroup_sk_enabled(sk)) {
+ if (mem_cgroup_sk_under_memory_pressure(sk))
+ return true;
+
+ if (mem_cgroup_sk_isolated(sk))
+ return false;
+ }
return !!READ_ONCE(*sk->sk_prot->memory_pressure);
}
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9ffe971a1856b..a5ff82a59867b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -275,9 +275,13 @@ extern unsigned long tcp_memory_pressure;
/* optimized version of sk_under_memory_pressure() for TCP sockets */
static inline bool tcp_under_memory_pressure(const struct sock *sk)
{
- if (mem_cgroup_sk_enabled(sk) &&
- mem_cgroup_sk_under_memory_pressure(sk))
- return true;
+ if (mem_cgroup_sk_enabled(sk)) {
+ if (mem_cgroup_sk_under_memory_pressure(sk))
+ return true;
+
+ if (mem_cgroup_sk_isolated(sk))
+ return false;
+ }
return READ_ONCE(tcp_memory_pressure);
}
diff --git a/net/core/sock.c b/net/core/sock.c
index ab6953d295dfa..e1ae6d03b8227 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1046,17 +1046,21 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
if (!charged)
return -ENOMEM;
- /* pre-charge to forward_alloc */
- sk_memory_allocated_add(sk, pages);
- allocated = sk_memory_allocated(sk);
- /* If the system goes into memory pressure with this
- * precharge, give up and return error.
- */
- if (allocated > sk_prot_mem_limits(sk, 1)) {
- sk_memory_allocated_sub(sk, pages);
- mem_cgroup_sk_uncharge(sk, pages);
- return -ENOMEM;
+ if (!mem_cgroup_sk_isolated(sk)) {
+ /* pre-charge to forward_alloc */
+ sk_memory_allocated_add(sk, pages);
+ allocated = sk_memory_allocated(sk);
+
+ /* If the system goes into memory pressure with this
+ * precharge, give up and return error.
+ */
+ if (allocated > sk_prot_mem_limits(sk, 1)) {
+ sk_memory_allocated_sub(sk, pages);
+ mem_cgroup_sk_uncharge(sk, pages);
+ return -ENOMEM;
+ }
}
+
sk_forward_alloc_add(sk, pages << PAGE_SHIFT);
WRITE_ONCE(sk->sk_reserved_mem,
@@ -3153,8 +3157,12 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
return true;
- sk_enter_memory_pressure(sk);
sk_stream_moderate_sndbuf(sk);
+
+ if (mem_cgroup_sk_enabled(sk) && mem_cgroup_sk_isolated(sk))
+ return false;
+
+ sk_enter_memory_pressure(sk);
return false;
}
EXPORT_SYMBOL(sk_page_frag_refill);
@@ -3267,18 +3275,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
{
bool memcg_enabled = false, charged = false;
struct proto *prot = sk->sk_prot;
- long allocated;
-
- sk_memory_allocated_add(sk, amt);
- allocated = sk_memory_allocated(sk);
+ long allocated = 0;
if (mem_cgroup_sk_enabled(sk)) {
+ bool isolated = mem_cgroup_sk_isolated(sk);
+
memcg_enabled = true;
charged = mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge());
- if (!charged)
+
+ if (isolated && charged)
+ return 1;
+
+ if (!charged) {
+ if (!isolated) {
+ sk_memory_allocated_add(sk, amt);
+ allocated = sk_memory_allocated(sk);
+ }
+
goto suppress_allocation;
+ }
}
+ sk_memory_allocated_add(sk, amt);
+ allocated = sk_memory_allocated(sk);
+
/* Under limit. */
if (allocated <= sk_prot_mem_limits(sk, 0)) {
sk_leave_memory_pressure(sk);
@@ -3357,7 +3377,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
- sk_memory_allocated_sub(sk, amt);
+ if (allocated)
+ sk_memory_allocated_sub(sk, amt);
if (charged)
mem_cgroup_sk_uncharge(sk, amt);
@@ -3396,11 +3417,15 @@ EXPORT_SYMBOL(__sk_mem_schedule);
*/
void __sk_mem_reduce_allocated(struct sock *sk, int amount)
{
- sk_memory_allocated_sub(sk, amount);
-
- if (mem_cgroup_sk_enabled(sk))
+ if (mem_cgroup_sk_enabled(sk)) {
mem_cgroup_sk_uncharge(sk, amount);
+ if (mem_cgroup_sk_isolated(sk))
+ return;
+ }
+
+ sk_memory_allocated_sub(sk, amount);
+
if (sk_under_global_memory_pressure(sk) &&
(sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 0ef1eacd539d1..9d56085f7f54b 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -22,6 +22,7 @@
#include <net/tcp.h>
#include <net/sock_reuseport.h>
#include <net/addrconf.h>
+#include <net/proto_memory.h>
#if IS_ENABLED(CONFIG_IPV6)
/* match_sk*_wildcard == true: IPV6_ADDR_ANY equals to any IPv6 addresses
@@ -710,7 +711,6 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
if (mem_cgroup_sockets_enabled) {
gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
- int amt = 0;
/* atomically get the memory usage, set and charge the
* newsk->sk_memcg.
@@ -719,15 +719,27 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
mem_cgroup_sk_alloc(newsk);
if (mem_cgroup_from_sk(newsk)) {
+ int amt;
+
/* The socket has not been accepted yet, no need
* to look at newsk->sk_wmem_queued.
*/
amt = sk_mem_pages(newsk->sk_forward_alloc +
atomic_read(&newsk->sk_rmem_alloc));
+ if (amt) {
+ /* This amt is already charged globally to
+ * sk_prot->memory_allocated due to lack of
+ * sk_memcg until accept(), thus we need to
+ * reclaim it here if newsk is isolated.
+ */
+ if (mem_cgroup_sk_isolated(newsk))
+ sk_memory_allocated_sub(newsk, amt);
+
+ mem_cgroup_sk_charge(newsk, amt, gfp);
+ }
+
}
- if (amt)
- mem_cgroup_sk_charge(newsk, amt, gfp);
kmem_cache_charge(newsk, gfp);
release_sock(newsk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 09f0802f36afa..79e705fca8b67 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3562,12 +3562,18 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
delta = size - sk->sk_forward_alloc;
if (delta <= 0)
return;
+
amt = sk_mem_pages(delta);
sk_forward_alloc_add(sk, amt << PAGE_SHIFT);
- sk_memory_allocated_add(sk, amt);
- if (mem_cgroup_sk_enabled(sk))
+ if (mem_cgroup_sk_enabled(sk)) {
mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge() | __GFP_NOFAIL);
+
+ if (mem_cgroup_sk_isolated(sk))
+ return;
+ }
+
+ sk_memory_allocated_add(sk, amt);
}
/* Send a FIN. The caller locks the socket for us.
--
2.50.0.727.gbf7dc18ff4-goog
On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima <kuniyu@google.com> wrote: > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > buffers and charge memory to per-protocol global counters pointed to by > sk->sk_proto->memory_allocated. > > When running under a non-root cgroup, this memory is also charged to the > memcg as sock in memory.stat. > > Even when memory usage is controlled by memcg, sockets using such protocols > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). IIUC the envisioned use case is that some cgroups feed from global resource and some from their own limit. It means the admin knows both: a) how to configure individual cgroup, b) how to configure global limit (for the rest). So why cannot they stick to a single model only? > This makes it difficult to accurately estimate and configure appropriate > global limits, especially in multi-tenant environments. > > If all workloads were guaranteed to be controlled under memcg, the issue > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > In reality, this assumption does not always hold, and a single workload > that opts out of memcg can consume memory up to the global limit, > becoming a noisy neighbour. That doesn't like a good idea to remove limits from possibly noisy units. > Let's decouple memcg from the global per-protocol memory accounting. > > This simplifies memcg configuration while keeping the global limits > within a reasonable range. I think this is a configuration issue only, i.e. instead of preserving the global limit because of _some_ memcgs, the configuration management could have a default memcg limit that is substituted to those memcgs so that there's no risk of runaways even in absence of global limit. Regards, Michal
On Thu, Jul 31, 2025 at 6:39 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima <kuniyu@google.com> wrote: > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > > buffers and charge memory to per-protocol global counters pointed to by > > sk->sk_proto->memory_allocated. > > > > When running under a non-root cgroup, this memory is also charged to the > > memcg as sock in memory.stat. > > > > Even when memory usage is controlled by memcg, sockets using such protocols > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > IIUC the envisioned use case is that some cgroups feed from global > resource and some from their own limit. > It means the admin knows both: > a) how to configure individual cgroup, > b) how to configure global limit (for the rest). > So why cannot they stick to a single model only? > > > This makes it difficult to accurately estimate and configure appropriate > > global limits, especially in multi-tenant environments. > > > > If all workloads were guaranteed to be controlled under memcg, the issue > > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > > > In reality, this assumption does not always hold, and a single workload > > that opts out of memcg can consume memory up to the global limit, > > becoming a noisy neighbour. > > That doesn't like a good idea to remove limits from possibly noisy > units. > > > Let's decouple memcg from the global per-protocol memory accounting. > > > > This simplifies memcg configuration while keeping the global limits > > within a reasonable range. > > I think this is a configuration issue only, i.e. instead of preserving > the global limit because of _some_ memcgs, the configuration management > could have a default memcg limit that is substituted to those memcgs so > that there's no risk of runaways even in absence of global limit. Doesn't that end up implementing another tcp_mem[] which now enforce limits on uncontrolled cgroups (memory.max == max) ? Or it will simply end up with the system-wide OOM killer ?
Kuniyuki Iwashima <kuniyu@google.com> writes: > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > buffers and charge memory to per-protocol global counters pointed to by > sk->sk_proto->memory_allocated. > > When running under a non-root cgroup, this memory is also charged to the > memcg as sock in memory.stat. > > Even when memory usage is controlled by memcg, sockets using such protocols > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > This makes it difficult to accurately estimate and configure appropriate > global limits, especially in multi-tenant environments. > > If all workloads were guaranteed to be controlled under memcg, the issue > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > In reality, this assumption does not always hold, and a single workload > that opts out of memcg can consume memory up to the global limit, > becoming a noisy neighbour. > > Let's decouple memcg from the global per-protocol memory accounting. > > This simplifies memcg configuration while keeping the global limits > within a reasonable range. I don't think it should be a memcg feature. In fact, it doesn't have much to do with cgroups at all (it's not hierarchical, it doesn't control the resource allocation, and in the end it controls an alternative to memory cgroups memory accounting system). Instead, it can be a per-process prctl option. (Assuming the feature is really needed - I'm also curious why some processes have to be excluded from the memcg accounting - it sounds like generally a bad idea). Thanks
On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote: > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > buffers and charge memory to per-protocol global counters pointed to by > sk->sk_proto->memory_allocated. > > When running under a non-root cgroup, this memory is also charged to the > memcg as sock in memory.stat. > > Even when memory usage is controlled by memcg, sockets using such protocols > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > This makes it difficult to accurately estimate and configure appropriate > global limits, especially in multi-tenant environments. > > If all workloads were guaranteed to be controlled under memcg, the issue > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > In reality, this assumption does not always hold, and a single workload > that opts out of memcg can consume memory up to the global limit, > becoming a noisy neighbour. Yes, an uncontrolled cgroup can consume all of a shared resource and thereby become a noisy neighbor. Why is network memory special? I assume you have some other mechanisms for curbing things like filesystem caches, anon memory, swap etc. of such otherwise uncontrolled groups, and this just happens to be your missing piece. But at this point, you're operating so far out of the cgroup resource management model that I don't think it can be reasonably supported. I hate to say this, but can't you carry this out of tree until the transition is complete? I just don't think it makes any sense to have this as a permanent fixture in a general-purpose container management interface.
On Mon, Jul 28, 2025 at 9:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote: > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > > buffers and charge memory to per-protocol global counters pointed to by > > sk->sk_proto->memory_allocated. > > > > When running under a non-root cgroup, this memory is also charged to the > > memcg as sock in memory.stat. > > > > Even when memory usage is controlled by memcg, sockets using such protocols > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > > > This makes it difficult to accurately estimate and configure appropriate > > global limits, especially in multi-tenant environments. > > > > If all workloads were guaranteed to be controlled under memcg, the issue > > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > > > In reality, this assumption does not always hold, and a single workload > > that opts out of memcg can consume memory up to the global limit, > > becoming a noisy neighbour. > > Yes, an uncontrolled cgroup can consume all of a shared resource and > thereby become a noisy neighbor. Why is network memory special? > > I assume you have some other mechanisms for curbing things like > filesystem caches, anon memory, swap etc. of such otherwise > uncontrolled groups, and this just happens to be your missing piece. I think that's the tcp_mem[] knob, limiting tcp mem globally for the "uncontrolled" cgroup. But we can't use it because the "controlled" cgroup is also limited by this knob. If we want to properly control the "controlled" cgroup by its feature only, we must disable the global limit completely on the host, meaning we lose the "missing piece". Currently, there are only two poor choices 1) Use tcp_mem[] but memory allocation could fail even if the cgroup has available memory 2) Disable tcp_mem[] but uncontrolled cgroup lose seatbelt and can consume memory up to system limit but what we really need is 3) Uncontrolled cgroup is limited by tcp_mem[], AND for controlled cgroup, memory allocation won't fail if it has available memory regardless of tcp_mem[] > > But at this point, you're operating so far out of the cgroup resource > management model that I don't think it can be reasonably supported. I think it's rather operated under the normal cgroup management model, relying on the configured memory limit for each cgroup. What's wrong here is we had to set tcp_mem[] to UINT_MAX and get rid of the seatbelt for uncontrolled cgroup for the management model. But this is just because cgroup mem is also charged globally to TCP, which should not be. > > I hate to say this, but can't you carry this out of tree until the > transition is complete? > > I just don't think it makes any sense to have this as a permanent > fixture in a general-purpose container management interface. I understand that, and we should eventually fix "1) or 2)" to just 3), but introducing this change without a knob will break assumptions in userspace and trigger regression. cgroup v2 is now widely enabled by major distro, and systemd creates many processes under non-root cgroups but without memory limits. If we had no knob, such processes would suddenly lose the tcp_mem[] seatbelt and could consume memory up to system limit. How about adding the knob's deprecation plan by pr_warn_once() or something and letting users configure the max properly by that ?
On Mon, Jul 28, 2025 at 02:41:38PM -0700, Kuniyuki Iwashima wrote: > On Mon, Jul 28, 2025 at 9:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote: > > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > > > buffers and charge memory to per-protocol global counters pointed to by > > > sk->sk_proto->memory_allocated. > > > > > > When running under a non-root cgroup, this memory is also charged to the > > > memcg as sock in memory.stat. > > > > > > Even when memory usage is controlled by memcg, sockets using such protocols > > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > > > > > This makes it difficult to accurately estimate and configure appropriate > > > global limits, especially in multi-tenant environments. > > > > > > If all workloads were guaranteed to be controlled under memcg, the issue > > > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > > > > > In reality, this assumption does not always hold, and a single workload > > > that opts out of memcg can consume memory up to the global limit, > > > becoming a noisy neighbour. > > > > Yes, an uncontrolled cgroup can consume all of a shared resource and > > thereby become a noisy neighbor. Why is network memory special? > > > > I assume you have some other mechanisms for curbing things like > > filesystem caches, anon memory, swap etc. of such otherwise > > uncontrolled groups, and this just happens to be your missing piece. > > I think that's the tcp_mem[] knob, limiting tcp mem globally for > the "uncontrolled" cgroup. But we can't use it because the > "controlled" cgroup is also limited by this knob. No, I was really asking what you do about other types of memory consumed by such uncontrolled cgroups. You can't have uncontrolled groups and complain about their resource consumption.
On Tue, Jul 29, 2025 at 7:22 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Jul 28, 2025 at 02:41:38PM -0700, Kuniyuki Iwashima wrote: > > On Mon, Jul 28, 2025 at 9:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote: > > > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > > > > buffers and charge memory to per-protocol global counters pointed to by > > > > sk->sk_proto->memory_allocated. > > > > > > > > When running under a non-root cgroup, this memory is also charged to the > > > > memcg as sock in memory.stat. > > > > > > > > Even when memory usage is controlled by memcg, sockets using such protocols > > > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > > > > > > > This makes it difficult to accurately estimate and configure appropriate > > > > global limits, especially in multi-tenant environments. > > > > > > > > If all workloads were guaranteed to be controlled under memcg, the issue > > > > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > > > > > > > In reality, this assumption does not always hold, and a single workload > > > > that opts out of memcg can consume memory up to the global limit, > > > > becoming a noisy neighbour. > > > > > > Yes, an uncontrolled cgroup can consume all of a shared resource and > > > thereby become a noisy neighbor. Why is network memory special? > > > > > > I assume you have some other mechanisms for curbing things like > > > filesystem caches, anon memory, swap etc. of such otherwise > > > uncontrolled groups, and this just happens to be your missing piece. > > > > I think that's the tcp_mem[] knob, limiting tcp mem globally for > > the "uncontrolled" cgroup. But we can't use it because the > > "controlled" cgroup is also limited by this knob. > > No, I was really asking what you do about other types of memory > consumed by such uncontrolled cgroups. > > You can't have uncontrolled groups and complain about their resource > consumption. Only 10% of physical memory is allowed to be used globally for TCP. How is it supposed to work if we don't enforce limits on uncontrolled cgroups ?
On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote: > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > buffers and charge memory to per-protocol global counters pointed to by > sk->sk_proto->memory_allocated. > > When running under a non-root cgroup, this memory is also charged to the > memcg as sock in memory.stat. > > Even when memory usage is controlled by memcg, sockets using such protocols > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > This makes it difficult to accurately estimate and configure appropriate > global limits, especially in multi-tenant environments. > > If all workloads were guaranteed to be controlled under memcg, the issue > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > In reality, this assumption does not always hold, and a single workload > that opts out of memcg can consume memory up to the global limit, > becoming a noisy neighbour. > Sorry but the above is not reasonable. On a multi-tenant system no workload should be able to opt out of memcg accounting if isolation is needed. If a workload can opt out then there is no guarantee. In addition please avoid adding a per-memcg knob. Why not have system level setting for the decoupling. I would say start with a build time config setting or boot parameter then if really needed we can discuss if system level setting is needed which can be toggled at runtime though there might be challenges there.
On Tue, Jul 22, 2025 at 8:14 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote: > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > > buffers and charge memory to per-protocol global counters pointed to by > > sk->sk_proto->memory_allocated. > > > > When running under a non-root cgroup, this memory is also charged to the > > memcg as sock in memory.stat. > > > > Even when memory usage is controlled by memcg, sockets using such protocols > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > > > This makes it difficult to accurately estimate and configure appropriate > > global limits, especially in multi-tenant environments. > > > > If all workloads were guaranteed to be controlled under memcg, the issue > > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > > > In reality, this assumption does not always hold, and a single workload > > that opts out of memcg can consume memory up to the global limit, > > becoming a noisy neighbour. > > > > Sorry but the above is not reasonable. On a multi-tenant system no > workload should be able to opt out of memcg accounting if isolation is > needed. If a workload can opt out then there is no guarantee. Deployment issue ? In a multi-tenant system you can not suddenly force all workloads to be TCP memcg charged. This has caused many OMG. Also, the current situation of maintaining two limits (memcg one, plus global tcp_memory_allocated) is very inefficient. If we trust memcg, then why have an expensive safety belt ? With this series, we can finally use one or the other limit. This should have been done from day-0 really. > > In addition please avoid adding a per-memcg knob. Why not have system > level setting for the decoupling. I would say start with a build time > config setting or boot parameter then if really needed we can discuss if > system level setting is needed which can be toggled at runtime though > there might be challenges there. Built time or boot parameter ? I fail to see how it can be more convenient.
On Tue, Jul 22, 2025 at 08:24:23AM -0700, Eric Dumazet wrote: > On Tue, Jul 22, 2025 at 8:14 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote: > > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > > > buffers and charge memory to per-protocol global counters pointed to by > > > sk->sk_proto->memory_allocated. > > > > > > When running under a non-root cgroup, this memory is also charged to the > > > memcg as sock in memory.stat. > > > > > > Even when memory usage is controlled by memcg, sockets using such protocols > > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > > > > > This makes it difficult to accurately estimate and configure appropriate > > > global limits, especially in multi-tenant environments. > > > > > > If all workloads were guaranteed to be controlled under memcg, the issue > > > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > > > > > In reality, this assumption does not always hold, and a single workload > > > that opts out of memcg can consume memory up to the global limit, > > > becoming a noisy neighbour. > > > > > > > Sorry but the above is not reasonable. On a multi-tenant system no > > workload should be able to opt out of memcg accounting if isolation is > > needed. If a workload can opt out then there is no guarantee. > > Deployment issue ? > > In a multi-tenant system you can not suddenly force all workloads to > be TCP memcg charged. This has caused many OMG. Let's discuss the above at the end. > > Also, the current situation of maintaining two limits (memcg one, plus > global tcp_memory_allocated) is very inefficient. Agree. > > If we trust memcg, then why have an expensive safety belt ? > > With this series, we can finally use one or the other limit. This > should have been done from day-0 really. Same, I agree. > > > > > In addition please avoid adding a per-memcg knob. Why not have system > > level setting for the decoupling. I would say start with a build time > > config setting or boot parameter then if really needed we can discuss if > > system level setting is needed which can be toggled at runtime though > > there might be challenges there. > > Built time or boot parameter ? I fail to see how it can be more convenient. I think we agree on decoupling the global and memcg accounting of network memory. I am still not clear on the need of per-memcg knob. From the earlier comment, it seems like you want mix of jobs with memcg limited network memory accounting and with global network accounting running concurrently on a system. Is that correct? I expect this state of jobs with different network accounting config running concurrently is temporary while the migrationg from one to other is happening. Please correct me if I am wrong. My main concern with the memcg knob is that it is permanent and it requires a hierarchical semantics. No need to add a permanent interface for a temporary need and I don't see a clear hierarchical semantic for this interface. I am wondering if alternative approches for per-workload settings are explore starting with BPF.
On Tue, Jul 22, 2025 at 8:52 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Tue, Jul 22, 2025 at 08:24:23AM -0700, Eric Dumazet wrote: > > On Tue, Jul 22, 2025 at 8:14 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote: > > > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket > > > > buffers and charge memory to per-protocol global counters pointed to by > > > > sk->sk_proto->memory_allocated. > > > > > > > > When running under a non-root cgroup, this memory is also charged to the > > > > memcg as sock in memory.stat. > > > > > > > > Even when memory usage is controlled by memcg, sockets using such protocols > > > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). > > > > > > > > This makes it difficult to accurately estimate and configure appropriate > > > > global limits, especially in multi-tenant environments. > > > > > > > > If all workloads were guaranteed to be controlled under memcg, the issue > > > > could be worked around by setting tcp_mem[0~2] to UINT_MAX. > > > > > > > > In reality, this assumption does not always hold, and a single workload > > > > that opts out of memcg can consume memory up to the global limit, > > > > becoming a noisy neighbour. > > > > > > > > > > Sorry but the above is not reasonable. On a multi-tenant system no > > > workload should be able to opt out of memcg accounting if isolation is > > > needed. If a workload can opt out then there is no guarantee. > > > > Deployment issue ? > > > > In a multi-tenant system you can not suddenly force all workloads to > > be TCP memcg charged. This has caused many OMG. > > Let's discuss the above at the end. > > > > > Also, the current situation of maintaining two limits (memcg one, plus > > global tcp_memory_allocated) is very inefficient. > > Agree. > > > > > If we trust memcg, then why have an expensive safety belt ? > > > > With this series, we can finally use one or the other limit. This > > should have been done from day-0 really. > > Same, I agree. > > > > > > > > > In addition please avoid adding a per-memcg knob. Why not have system > > > level setting for the decoupling. I would say start with a build time > > > config setting or boot parameter then if really needed we can discuss if > > > system level setting is needed which can be toggled at runtime though > > > there might be challenges there. > > > > Built time or boot parameter ? I fail to see how it can be more convenient. > > I think we agree on decoupling the global and memcg accounting of > network memory. I am still not clear on the need of per-memcg knob. From > the earlier comment, it seems like you want mix of jobs with memcg > limited network memory accounting and with global network accounting > running concurrently on a system. Is that correct? Correct. > > I expect this state of jobs with different network accounting config > running concurrently is temporary while the migrationg from one to other > is happening. Please correct me if I am wrong. We need to migrate workload gradually and the system-wide config does not work at all. AFAIU, there are already years of effort spent on the migration but it's not yet completed at Google. So, I don't think the need is temporary. > > My main concern with the memcg knob is that it is permanent and it > requires a hierarchical semantics. No need to add a permanent interface > for a temporary need and I don't see a clear hierarchical semantic for > this interface. I don't see merits of having hierarchical semantics for this knob. Regardless of this knob, hierarchical semantics is guaranteed by other knobs. I think such semantics for this knob just complicates the code with no gain. > > I am wondering if alternative approches for per-workload settings are > explore starting with BPF. > > >
On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote: > > > > I expect this state of jobs with different network accounting config > > running concurrently is temporary while the migrationg from one to other > > is happening. Please correct me if I am wrong. > > We need to migrate workload gradually and the system-wide config > does not work at all. AFAIU, there are already years of effort spent > on the migration but it's not yet completed at Google. So, I don't think > the need is temporary. > From what I remembered shared borg had completely moved to memcg accounting of network memory (with sys container as an exception) years ago. Did something change there? > > > > My main concern with the memcg knob is that it is permanent and it > > requires a hierarchical semantics. No need to add a permanent interface > > for a temporary need and I don't see a clear hierarchical semantic for > > this interface. > > I don't see merits of having hierarchical semantics for this knob. > Regardless of this knob, hierarchical semantics is guaranteed > by other knobs. I think such semantics for this knob just complicates > the code with no gain. > Cgroup interfaces are hierarchical and we want to keep it that way. Putting non-hierarchical interfaces just makes configuration and setup hard to reason about. > > > > > I am wondering if alternative approches for per-workload settings are > > explore starting with BPF. > > Any response on the above? Any alternative approaches explored?
On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote: > > > > > > I expect this state of jobs with different network accounting config > > > running concurrently is temporary while the migrationg from one to other > > > is happening. Please correct me if I am wrong. > > > > We need to migrate workload gradually and the system-wide config > > does not work at all. AFAIU, there are already years of effort spent > > on the migration but it's not yet completed at Google. So, I don't think > > the need is temporary. > > > > From what I remembered shared borg had completely moved to memcg > accounting of network memory (with sys container as an exception) years > ago. Did something change there? AFAICS, there are some workloads that opted out from memcg and consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering OOM and disrupting other workloads. > > > > > > > My main concern with the memcg knob is that it is permanent and it > > > requires a hierarchical semantics. No need to add a permanent interface > > > for a temporary need and I don't see a clear hierarchical semantic for > > > this interface. > > > > I don't see merits of having hierarchical semantics for this knob. > > Regardless of this knob, hierarchical semantics is guaranteed > > by other knobs. I think such semantics for this knob just complicates > > the code with no gain. > > > > Cgroup interfaces are hierarchical and we want to keep it that way. > Putting non-hierarchical interfaces just makes configuration and setup > hard to reason about. Actually, I tried that way in the initial draft version, but even if the parent's knob is 1 and child one is 0, a harmful scenario didn't come to my mind. > > > > > > > > > I am wondering if alternative approches for per-workload settings are > > > explore starting with BPF. > > > > > Any response on the above? Any alternative approaches explored? Do you mean flagging each socket by BPF at cgroup hook ? I think it's overkill and we don't need such finer granularity. Also it sounds way too hacky to use BPF to correct the weird behaviour from day0. We should have more generic way to control that. I know this functionality is helpful for some workloads at Amazon as well.
On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote: > On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote: > > > > > > > > I expect this state of jobs with different network accounting config > > > > running concurrently is temporary while the migrationg from one to other > > > > is happening. Please correct me if I am wrong. > > > > > > We need to migrate workload gradually and the system-wide config > > > does not work at all. AFAIU, there are already years of effort spent > > > on the migration but it's not yet completed at Google. So, I don't think > > > the need is temporary. > > > > > > > From what I remembered shared borg had completely moved to memcg > > accounting of network memory (with sys container as an exception) years > > ago. Did something change there? > > AFAICS, there are some workloads that opted out from memcg and > consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering > OOM and disrupting other workloads. > What were the reasons behind opting out? We should fix those instead of a permanent opt-out option. > > > > > > > > > > My main concern with the memcg knob is that it is permanent and it > > > > requires a hierarchical semantics. No need to add a permanent interface > > > > for a temporary need and I don't see a clear hierarchical semantic for > > > > this interface. > > > > > > I don't see merits of having hierarchical semantics for this knob. > > > Regardless of this knob, hierarchical semantics is guaranteed > > > by other knobs. I think such semantics for this knob just complicates > > > the code with no gain. > > > > > > > Cgroup interfaces are hierarchical and we want to keep it that way. > > Putting non-hierarchical interfaces just makes configuration and setup > > hard to reason about. > > Actually, I tried that way in the initial draft version, but even if the > parent's knob is 1 and child one is 0, a harmful scenario didn't come > to my mind. > It is not just about harmful scenario but more about clear semantics. Check memory.zswap.writeback semantics. > > > > > > > > > > > > > > I am wondering if alternative approches for per-workload settings are > > > > explore starting with BPF. > > > > > > > > Any response on the above? Any alternative approaches explored? > > Do you mean flagging each socket by BPF at cgroup hook ? Not sure. Will it not be very similar to your current approach? Each socket is associated with a memcg and the at the place where you need to check which accounting method to use, just check that memcg setting in bpf and you can cache the result in socket as well. > > I think it's overkill and we don't need such finer granularity. > > Also it sounds way too hacky to use BPF to correct the weird > behaviour from day0. What weird behavior? Two accounting mechanisms. Yes I agree but memcgs with different accounting mechanisms concurrently is also weird. > We should have more generic way to > control that. I know this functionality is helpful for some workloads > at Amazon as well. The reason I am against this permanent opt-out interface is if we add this interface then we will never fix the underlying issues blocking the full conversion to memcg accounting of network memory. I am ok with some temporary measures to allow opt-out impacted workload until the underlying issue is fixed.
On Tue, Jul 22, 2025 at 12:56 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote: > > On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote: > > > > > > > > > > I expect this state of jobs with different network accounting config > > > > > running concurrently is temporary while the migrationg from one to other > > > > > is happening. Please correct me if I am wrong. > > > > > > > > We need to migrate workload gradually and the system-wide config > > > > does not work at all. AFAIU, there are already years of effort spent > > > > on the migration but it's not yet completed at Google. So, I don't think > > > > the need is temporary. > > > > > > > > > > From what I remembered shared borg had completely moved to memcg > > > accounting of network memory (with sys container as an exception) years > > > ago. Did something change there? > > > > AFAICS, there are some workloads that opted out from memcg and > > consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering > > OOM and disrupting other workloads. > > > > What were the reasons behind opting out? We should fix those > instead of a permanent opt-out option. > > > > > > > > > > > > > > My main concern with the memcg knob is that it is permanent and it > > > > > requires a hierarchical semantics. No need to add a permanent interface > > > > > for a temporary need and I don't see a clear hierarchical semantic for > > > > > this interface. > > > > > > > > I don't see merits of having hierarchical semantics for this knob. > > > > Regardless of this knob, hierarchical semantics is guaranteed > > > > by other knobs. I think such semantics for this knob just complicates > > > > the code with no gain. > > > > > > > > > > Cgroup interfaces are hierarchical and we want to keep it that way. > > > Putting non-hierarchical interfaces just makes configuration and setup > > > hard to reason about. > > > > Actually, I tried that way in the initial draft version, but even if the > > parent's knob is 1 and child one is 0, a harmful scenario didn't come > > to my mind. > > > > It is not just about harmful scenario but more about clear semantics. > Check memory.zswap.writeback semantics. zswap checks all parent cgroups when evaluating the knob, but this is not an option for the networking fast path as we cannot check them for every skb, which will degrade the performance. Also, we don't track which sockets were created with the knob enabled and how many such sockets are still left under the cgroup, there is no way to keep options consistent throughout the hierarchy and no need to try hard to make the option pretend to be consistent if there's no real issue. > > > > > > > > > > > > > > > > > > > > I am wondering if alternative approches for per-workload settings are > > > > > explore starting with BPF. > > > > > > > > > > > Any response on the above? Any alternative approaches explored? > > > > Do you mean flagging each socket by BPF at cgroup hook ? > > Not sure. Will it not be very similar to your current approach? Each > socket is associated with a memcg and the at the place where you need to > check which accounting method to use, just check that memcg setting in > bpf and you can cache the result in socket as well. The socket pointer is not writable by default, thus we need to add a bpf helper or kfunc just for flipping a single bit. As said, this is overkill, and per-memcg knob is much simpler. > > > > > I think it's overkill and we don't need such finer granularity. > > > > Also it sounds way too hacky to use BPF to correct the weird > > behaviour from day0. > > What weird behavior? Two accounting mechanisms. Yes I agree but memcgs > with different accounting mechanisms concurrently is also weird. Not that weird given the root cgroup does not allocate sk->sk_memcg and are subject to the global tcp memory accounting. We already have a mixed set of memcgs. Also, not every cgroup sets memory limits. systemd puts some processes to a non-root cgroup by default without setting memory.max. In such a case we definitely want the global memory accounting to take place. Having to set memory.max to every non-root cgroup is less flexible and too restricted.
On Tue, Jul 22, 2025 at 02:59:33PM -0700, Kuniyuki Iwashima wrote: > On Tue, Jul 22, 2025 at 12:56 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote: > > > On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote: > > > > > > > > > > > > I expect this state of jobs with different network accounting config > > > > > > running concurrently is temporary while the migrationg from one to other > > > > > > is happening. Please correct me if I am wrong. > > > > > > > > > > We need to migrate workload gradually and the system-wide config > > > > > does not work at all. AFAIU, there are already years of effort spent > > > > > on the migration but it's not yet completed at Google. So, I don't think > > > > > the need is temporary. > > > > > > > > > > > > > From what I remembered shared borg had completely moved to memcg > > > > accounting of network memory (with sys container as an exception) years > > > > ago. Did something change there? > > > > > > AFAICS, there are some workloads that opted out from memcg and > > > consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering > > > OOM and disrupting other workloads. > > > > > > > What were the reasons behind opting out? We should fix those > > instead of a permanent opt-out option. > > Any response to the above? > > > > > > > > > > > > > > > > My main concern with the memcg knob is that it is permanent and it > > > > > > requires a hierarchical semantics. No need to add a permanent interface > > > > > > for a temporary need and I don't see a clear hierarchical semantic for > > > > > > this interface. > > > > > > > > > > I don't see merits of having hierarchical semantics for this knob. > > > > > Regardless of this knob, hierarchical semantics is guaranteed > > > > > by other knobs. I think such semantics for this knob just complicates > > > > > the code with no gain. > > > > > > > > > > > > > Cgroup interfaces are hierarchical and we want to keep it that way. > > > > Putting non-hierarchical interfaces just makes configuration and setup > > > > hard to reason about. > > > > > > Actually, I tried that way in the initial draft version, but even if the > > > parent's knob is 1 and child one is 0, a harmful scenario didn't come > > > to my mind. > > > > > > > It is not just about harmful scenario but more about clear semantics. > > Check memory.zswap.writeback semantics. > > zswap checks all parent cgroups when evaluating the knob, but > this is not an option for the networking fast path as we cannot > check them for every skb, which will degrade the performance. That's an implementation detail and you can definitely optimize it. One possible way might be caching the state in socket at creation time which puts some restrictions like to change the config, workload needs to be restarted. > > Also, we don't track which sockets were created with the knob > enabled and how many such sockets are still left under the cgroup, > there is no way to keep options consistent throughout the hierarchy > and no need to try hard to make the option pretend to be consistent > if there's no real issue. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I am wondering if alternative approches for per-workload settings are > > > > > > explore starting with BPF. > > > > > > > > > > > > > > Any response on the above? Any alternative approaches explored? > > > > > > Do you mean flagging each socket by BPF at cgroup hook ? > > > > Not sure. Will it not be very similar to your current approach? Each > > socket is associated with a memcg and the at the place where you need to > > check which accounting method to use, just check that memcg setting in > > bpf and you can cache the result in socket as well. > > The socket pointer is not writable by default, thus we need to add > a bpf helper or kfunc just for flipping a single bit. As said, this is > overkill, and per-memcg knob is much simpler. > Your simple solution is exposing a stable permanent user facing API which I suspect is temporary situation. Let's discuss it at the end. > > > > > > > > > I think it's overkill and we don't need such finer granularity. > > > > > > Also it sounds way too hacky to use BPF to correct the weird > > > behaviour from day0. > > > > What weird behavior? Two accounting mechanisms. Yes I agree but memcgs > > with different accounting mechanisms concurrently is also weird. > > Not that weird given the root cgroup does not allocate sk->sk_memcg > and are subject to the global tcp memory accounting. We already have > a mixed set of memcgs. Running workloads in root cgroup is not normal and comes with a warning of no isolation provided. I looked at the patch again to understand the modes you are introducing. Initially, I thought the series introduced multiple modes, including an option to exclude network memory from memcg accounting. However, if I understand correctly, that is not the case—the opt-out applies only to the global TCP/UDP accounting. That’s a relief, and I apologize for the misunderstanding. If I’m correct, you need a way to exclude a workload from the global TCP/UDP accounting, and currently, memcg serves as a convenient abstraction for the workload. Please let me know if I misunderstood. Now memcg is one way to represent the workload. Another more natural, at least to me, is the core cgroup. Basically cgroup.something interface. BPF is yet another option. To me cgroup seems preferrable but let's see what other memcg & cgroup folks think. Also note that for cgroup and memcg the interface will need to be hierarchical.
On Tue, Jul 22, 2025 at 5:29 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Tue, Jul 22, 2025 at 02:59:33PM -0700, Kuniyuki Iwashima wrote: > > On Tue, Jul 22, 2025 at 12:56 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote: > > > > On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote: > > > > > > > > > > > > > > I expect this state of jobs with different network accounting config > > > > > > > running concurrently is temporary while the migrationg from one to other > > > > > > > is happening. Please correct me if I am wrong. > > > > > > > > > > > > We need to migrate workload gradually and the system-wide config > > > > > > does not work at all. AFAIU, there are already years of effort spent > > > > > > on the migration but it's not yet completed at Google. So, I don't think > > > > > > the need is temporary. > > > > > > > > > > > > > > > > From what I remembered shared borg had completely moved to memcg > > > > > accounting of network memory (with sys container as an exception) years > > > > > ago. Did something change there? > > > > > > > > AFAICS, there are some workloads that opted out from memcg and > > > > consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering > > > > OOM and disrupting other workloads. > > > > > > > > > > What were the reasons behind opting out? We should fix those > > > instead of a permanent opt-out option. > > > > > Any response to the above? I'm just checking with internal folks, not sure if I will follow up on this though, see below. > > > > > > > > > > > > > > > > > > > > My main concern with the memcg knob is that it is permanent and it > > > > > > > requires a hierarchical semantics. No need to add a permanent interface > > > > > > > for a temporary need and I don't see a clear hierarchical semantic for > > > > > > > this interface. > > > > > > > > > > > > I don't see merits of having hierarchical semantics for this knob. > > > > > > Regardless of this knob, hierarchical semantics is guaranteed > > > > > > by other knobs. I think such semantics for this knob just complicates > > > > > > the code with no gain. > > > > > > > > > > > > > > > > Cgroup interfaces are hierarchical and we want to keep it that way. > > > > > Putting non-hierarchical interfaces just makes configuration and setup > > > > > hard to reason about. > > > > > > > > Actually, I tried that way in the initial draft version, but even if the > > > > parent's knob is 1 and child one is 0, a harmful scenario didn't come > > > > to my mind. > > > > > > > > > > It is not just about harmful scenario but more about clear semantics. > > > Check memory.zswap.writeback semantics. > > > > zswap checks all parent cgroups when evaluating the knob, but > > this is not an option for the networking fast path as we cannot > > check them for every skb, which will degrade the performance. > > That's an implementation detail and you can definitely optimize it. One > possible way might be caching the state in socket at creation time which > puts some restrictions like to change the config, workload needs to be > restarted. > > > > > Also, we don't track which sockets were created with the knob > > enabled and how many such sockets are still left under the cgroup, > > there is no way to keep options consistent throughout the hierarchy > > and no need to try hard to make the option pretend to be consistent > > if there's no real issue. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I am wondering if alternative approches for per-workload settings are > > > > > > > explore starting with BPF. > > > > > > > > > > > > > > > > > Any response on the above? Any alternative approaches explored? > > > > > > > > Do you mean flagging each socket by BPF at cgroup hook ? > > > > > > Not sure. Will it not be very similar to your current approach? Each > > > socket is associated with a memcg and the at the place where you need to > > > check which accounting method to use, just check that memcg setting in > > > bpf and you can cache the result in socket as well. > > > > The socket pointer is not writable by default, thus we need to add > > a bpf helper or kfunc just for flipping a single bit. As said, this is > > overkill, and per-memcg knob is much simpler. > > > > Your simple solution is exposing a stable permanent user facing API > which I suspect is temporary situation. Let's discuss it at the end. > > > > > > > > > > > > > > I think it's overkill and we don't need such finer granularity. > > > > > > > > Also it sounds way too hacky to use BPF to correct the weird > > > > behaviour from day0. > > > > > > What weird behavior? Two accounting mechanisms. Yes I agree but memcgs > > > with different accounting mechanisms concurrently is also weird. > > > > Not that weird given the root cgroup does not allocate sk->sk_memcg > > and are subject to the global tcp memory accounting. We already have > > a mixed set of memcgs. > > Running workloads in root cgroup is not normal and comes with a warning > of no isolation provided. > > I looked at the patch again to understand the modes you are introducing. > Initially, I thought the series introduced multiple modes, including an > option to exclude network memory from memcg accounting. However, if I > understand correctly, that is not the case—the opt-out applies only to > the global TCP/UDP accounting. That’s a relief, and I apologize for the > misunderstanding. > > If I’m correct, you need a way to exclude a workload from the global > TCP/UDP accounting, and currently, memcg serves as a convenient > abstraction for the workload. Please let me know if I misunderstood. Correct. Currently, memcg by itself cannot guarantee that memory allocation for socket buffer does not fail even when memory.current < memory.max due to the global protocol limits. It means we need to increase the global limits to (bytes of TCP socket buffer in each cgroup) * (number of cgroup) , which is hard to predict, and I guess that's the reason why you or Wei set tcp_mem[] to UINT_MAX so that we can ignore the global limit. But we should keep tcp_mem[] within a sane range in the first place. This series allows us to configure memcg limits only and let memcg guarantee no failure until it fully consumes memory.max. The point is that memcg should not be affected by the global limits, and this is orthogonal with the assumption that every workload should be running under memcg. > > Now memcg is one way to represent the workload. Another more natural, at > least to me, is the core cgroup. Basically cgroup.something interface. > BPF is yet another option. > > To me cgroup seems preferrable but let's see what other memcg & cgroup > folks think. Also note that for cgroup and memcg the interface will need > to be hierarchical. As the root cgroup doesn't have the knob, these combinations are considered hierarchical: (parent, child) = (0, 0), (0, 1), (1, 1) and only the pattern below is not considered hierarchical (parent, child) = (1, 0) Let's say we lock the knob at the first socket creation like your idea above. If a parent and its child' knobs are (0, 0) and the child creates a socket, the child memcg is locked as 0. When the parent enables the knob, we must check all child cgroups as well. Or, we lock the all parents' knobs when a socket is created in a child cgroup with knob=0 ? In any cases we need a global lock. Well, I understand that the hierarchical semantics is preferable for cgroup but I think it does not resolve any real issue and rather churns the code unnecessarily.
Cc Tejun & Michal to get their opinion on memcg vs cgroup vs BPF options. On Tue, Jul 22, 2025 at 07:35:52PM -0700, Kuniyuki Iwashima wrote: [...] > > > > Running workloads in root cgroup is not normal and comes with a warning > > of no isolation provided. > > > > I looked at the patch again to understand the modes you are introducing. > > Initially, I thought the series introduced multiple modes, including an > > option to exclude network memory from memcg accounting. However, if I > > understand correctly, that is not the case—the opt-out applies only to > > the global TCP/UDP accounting. That’s a relief, and I apologize for the > > misunderstanding. > > > > If I’m correct, you need a way to exclude a workload from the global > > TCP/UDP accounting, and currently, memcg serves as a convenient > > abstraction for the workload. Please let me know if I misunderstood. > > Correct. > > Currently, memcg by itself cannot guarantee that memory allocation for > socket buffer does not fail even when memory.current < memory.max > due to the global protocol limits. > > It means we need to increase the global limits to > > (bytes of TCP socket buffer in each cgroup) * (number of cgroup) > > , which is hard to predict, and I guess that's the reason why you > or Wei set tcp_mem[] to UINT_MAX so that we can ignore the global > limit. No that was not the reason. The main reason behind max tcp_mem global limit was it was not needed as memcg should account and limit the network memory. I think the reason you don't want tcp_mem global limit unlimited now is you have internal feature to let workloads opt out of the memcg accounting of network memory which is causing isolation issues. > > But we should keep tcp_mem[] within a sane range in the first place. > > This series allows us to configure memcg limits only and let memcg > guarantee no failure until it fully consumes memory.max. > > The point is that memcg should not be affected by the global limits, > and this is orthogonal with the assumption that every workload should > be running under memcg. > > > > > > Now memcg is one way to represent the workload. Another more natural, at > > least to me, is the core cgroup. Basically cgroup.something interface. > > BPF is yet another option. > > > > To me cgroup seems preferrable but let's see what other memcg & cgroup > > folks think. Also note that for cgroup and memcg the interface will need > > to be hierarchical. > > As the root cgroup doesn't have the knob, these combinations are > considered hierarchical: > > (parent, child) = (0, 0), (0, 1), (1, 1) > > and only the pattern below is not considered hierarchical > > (parent, child) = (1, 0) > > Let's say we lock the knob at the first socket creation like your > idea above. > > If a parent and its child' knobs are (0, 0) and the child creates a > socket, the child memcg is locked as 0. When the parent enables > the knob, we must check all child cgroups as well. Or, we lock > the all parents' knobs when a socket is created in a child cgroup > with knob=0 ? In any cases we need a global lock. > > Well, I understand that the hierarchical semantics is preferable > for cgroup but I think it does not resolve any real issue and rather > churns the code unnecessarily. All this is implementation detail and I am asking about semantics. More specifically: 1. Will the root be non-isolated always? 2. If a cgroup is isolated, does it mean all its desendants are isolated? 3. Will there ever be a reasonable use-case where there is non-isolated sub-tree under an isolated ancestor? Please give some thought to the above (and related) questions. I am still not convinced that memcg is the right home for this opt-out feature. I have CCed cgroup folks to get their opinion as well.
On Wed, Jul 23, 2025 at 10:28 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > Cc Tejun & Michal to get their opinion on memcg vs cgroup vs BPF > options. > > On Tue, Jul 22, 2025 at 07:35:52PM -0700, Kuniyuki Iwashima wrote: > [...] > > > > > > Running workloads in root cgroup is not normal and comes with a warning > > > of no isolation provided. > > > > > > I looked at the patch again to understand the modes you are introducing. > > > Initially, I thought the series introduced multiple modes, including an > > > option to exclude network memory from memcg accounting. However, if I > > > understand correctly, that is not the case—the opt-out applies only to > > > the global TCP/UDP accounting. That’s a relief, and I apologize for the > > > misunderstanding. > > > > > > If I’m correct, you need a way to exclude a workload from the global > > > TCP/UDP accounting, and currently, memcg serves as a convenient > > > abstraction for the workload. Please let me know if I misunderstood. > > > > Correct. > > > > Currently, memcg by itself cannot guarantee that memory allocation for > > socket buffer does not fail even when memory.current < memory.max > > due to the global protocol limits. > > > > It means we need to increase the global limits to > > > > (bytes of TCP socket buffer in each cgroup) * (number of cgroup) > > > > , which is hard to predict, and I guess that's the reason why you > > or Wei set tcp_mem[] to UINT_MAX so that we can ignore the global > > limit. > > No that was not the reason. The main reason behind max tcp_mem global > limit was it was not needed but the global limit did take place thus you had to set tcp_mem to unlimited. > as memcg should account and limit the > network memory. > I think the reason you don't want tcp_mem global limit > unlimited now is memcg has been subject to the global limit from day 0. And note that not every process is under memcg with memory.max configured. > you have internal feature to let workloads opt out of > the memcg accounting of network memory which is causing isolation > issues. > > > > > But we should keep tcp_mem[] within a sane range in the first place. > > > > This series allows us to configure memcg limits only and let memcg > > guarantee no failure until it fully consumes memory.max. > > > > The point is that memcg should not be affected by the global limits, > > and this is orthogonal with the assumption that every workload should > > be running under memcg. > > > > > > > > > > Now memcg is one way to represent the workload. Another more natural, at > > > least to me, is the core cgroup. Basically cgroup.something interface. > > > BPF is yet another option. > > > > > > To me cgroup seems preferrable but let's see what other memcg & cgroup > > > folks think. Also note that for cgroup and memcg the interface will need > > > to be hierarchical. > > > > As the root cgroup doesn't have the knob, these combinations are > > considered hierarchical: > > > > (parent, child) = (0, 0), (0, 1), (1, 1) > > > > and only the pattern below is not considered hierarchical > > > > (parent, child) = (1, 0) > > > > Let's say we lock the knob at the first socket creation like your > > idea above. > > > > If a parent and its child' knobs are (0, 0) and the child creates a > > socket, the child memcg is locked as 0. When the parent enables > > the knob, we must check all child cgroups as well. Or, we lock > > the all parents' knobs when a socket is created in a child cgroup > > with knob=0 ? In any cases we need a global lock. > > > > Well, I understand that the hierarchical semantics is preferable > > for cgroup but I think it does not resolve any real issue and rather > > churns the code unnecessarily. > > All this is implementation detail and I am asking about semantics. More > specifically: > > 1. Will the root be non-isolated always? Yes, because the root cgroup doesn't have memcg. Also, the knob has CFTYPE_NOT_ON_ROOT. > 2. If a cgroup is isolated, does it mean all its desendants are > isolated? No, but this is because we MUST think about how we handle the scenario above that (parent, child) = (0,0) becomes (1, 0). We cannot think about the semantics without implementation detail. And if we allow such scenario, the hierarchical semantics is fake and has no meaning. > 3. Will there ever be a reasonable use-case where there is non-isolated > sub-tree under an isolated ancestor? I think no, but again, we need to think about the scenario above, otherwise, your ideal semantics is just broken. Also, "no reasonable scenario" does not always mean "we must prevent the scenario". If there's nothing harmful, we can just let it be, especially if such restriction gives nothing andrather hurts performance with no good reason. > > Please give some thought to the above (and related) questions. Please think about the implementation detail and if its trade-off (just keeping semantics vs code churn & perf regression) makes really sense. > > I am still not convinced that memcg is the right home for this opt-out > feature. I have CCed cgroup folks to get their opinion as well.
On Wed, 23 Jul 2025 11:06:14 -0700 Kuniyuki Iwashima wrote: > > 3. Will there ever be a reasonable use-case where there is non-isolated > > sub-tree under an isolated ancestor? > > I think no, but again, we need to think about the scenario above, > otherwise, your ideal semantics is just broken. > > Also, "no reasonable scenario" does not always mean "we must > prevent the scenario". > > If there's nothing harmful, we can just let it be, especially if such > restriction gives nothing andrather hurts performance with no > good reason. Stating the obvious perhaps but it's probably too late in the release cycle to get enough agreement here to merge the series. So I'll mark it as Deferred. While I'm typing, TBH I'm not sure I'm following the arguments about making the property hierarchical. Since the memory limit gets inherited I don't understand why the property of being isolated would not. Either I don't understand the memcg enough, or I don't understand your intended semantics. Anyway..
On Thu, Jul 24, 2025 at 6:49 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 23 Jul 2025 11:06:14 -0700 Kuniyuki Iwashima wrote: > > > 3. Will there ever be a reasonable use-case where there is non-isolated > > > sub-tree under an isolated ancestor? > > > > I think no, but again, we need to think about the scenario above, > > otherwise, your ideal semantics is just broken. > > > > Also, "no reasonable scenario" does not always mean "we must > > prevent the scenario". > > > > If there's nothing harmful, we can just let it be, especially if such > > restriction gives nothing andrather hurts performance with no > > good reason. > > Stating the obvious perhaps but it's probably too late in the release > cycle to get enough agreement here to merge the series. So I'll mark > it as Deferred. Fair enough. > > While I'm typing, TBH I'm not sure I'm following the arguments about > making the property hierarchical. Since the memory limit gets inherited > I don't understand why the property of being isolated would not. > Either I don't understand the memcg enough, or I don't understand your > intended semantics. Anyway.. Inheriting a config is easy, but keeping the hierarchy complete isn't, or maybe I'm thinking too hard :S [root@fedora ~]# mkdir /sys/fs/cgroup/test1 [root@fedora ~]# mkdir /sys/fs/cgroup/test1/test2 [root@fedora ~]# echo +memory > /sys/fs/cgroup/test1/cgroup.subtree_control [root@fedora ~]# echo 10000 > /sys/fs/cgroup/test1/test2/memory.max [root@fedora ~]# echo 1000 > /sys/fs/cgroup/test1/memory.max [ 108.130895] bash invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 ... [ 108.260164] Out of memory and no killable processes... [root@fedora ~]# cat /sys/fs/cgroup/test1/test2/memory.max 8192 [root@fedora ~]# cat /sys/fs/cgroup/test1/memory.max 0
© 2016 - 2025 Red Hat, Inc.