[PATCH v3] cgroup/dmem: allow max to be set below current usage

Thadeu Lima de Souza Cascardo posted 1 patch 1 week ago
kernel/cgroup/dmem.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
[PATCH v3] cgroup/dmem: allow max to be set below current usage
Posted by Thadeu Lima de Souza Cascardo 1 week ago
page_counter_set_max may return -EBUSY in case the current usage is above
the new max. When writing to dmem.max, this error is ignored and the new
max is not set.

Instead of using page_counter_set_max when writing to dmem.max, atomically
update its value irrespective of the current usage.

Since there is no current mechanism to evict a given dmemcg pool, this will
at least prevent the current usage from growing any further.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
---
When writing to dmem.max, it was noticed that some writes did not take
effect, even though the write was successful.

It turns out that page_counter_set_max checks if the new max value is above
the current usage and returns -EBUSY in case it is, which was being ignored
by dmemcg_limit_write.

It was also noticed that when setting limits for multiple regions in a
single write, while setting one region's limit may fail, others might have
succeeded before. Tejun Heo brought up that this breaks atomicity.

Maarten Lankhorst and Michal Koutný have brought up that instead of
failing, setting max below the current usage should behave like memcg and
start eviction until usage goes below the new max.

However, since there is no current mechanism to evict a given region, they
suggest that setting the new max will at least prevent further allocations.

v1 kept the multiple region support when writing to limit files, while
returning -EBUSY as soon as setting a region's max was above the usage.

v2 only allows setting a single region limit per write, while allowing any
new max to be set.

This version (v3) still allows multiple regions to be set, and explains why
page_counter_set_max is not used anymore.

I am sending this version dropping the multiple region restriction for now,
as we continue to discuss whether it should be supported or not.
---
Changes in v3:
- Dropped first patch as it was already applied.
- Added comment explaining why page_counter_set_max is not used.
- Dropped patch restricting multiple regions to be set for now.
- Link to v2: https://lore.kernel.org/r/20260319-dmem_max_ebusy-v2-0-b5ce97205269@igalia.com

Changes in v2:
- Remove support for setting multiple regions' limits.
- Allow any new max limit to be set.
- Link to v1: https://lore.kernel.org/r/20260318-dmem_max_ebusy-v1-1-b7e461157b29@igalia.com
---
 kernel/cgroup/dmem.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 1ab1fb47f2711ecc60dd13e611a8a4920b48f3e9..c00aa06759967a8f8977aaf036dd7966ddb55718 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -159,7 +159,15 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
 static void
 set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
 {
-	page_counter_set_max(&pool->cnt, val);
+	/*
+	 * page_counter_set_max will return -EBUSY in case the current
+	 * usage is above the new max.
+	 *
+	 * Since, there is no current eviction mechanism yet, setting max
+	 * irrespective of the current usage will prevent further
+	 * allocations.
+	 */
+	xchg(&pool->cnt.max, val);
 }
 
 static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)

---
base-commit: 6ffdb01db6d5d785e7278c6d98fd4ef8598b0fc5
change-id: 20260318-dmem_max_ebusy-bfd33564f2c3

Best regards,
-- 
Thadeu Lima de Souza Cascardo <cascardo@igalia.com>