From nobody Mon Jun 8 09:48:24 2026 Received: from mail-dy1-f175.google.com (mail-dy1-f175.google.com [74.125.82.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 38BF726ACC for ; Sat, 30 May 2026 07:35:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.175 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780126536; cv=none; b=MjMrfr5Bv1xn/hePzTX5t36y73Kb/+NVTNmO6oFQxl8MyTPoW9sY585IIDO6cruMArRNKbraGInMGT/PTqthkYipUIPoaZQ6KbjbWvYzFUYtj67TwvYR29emCxchVsebKkCZSr7ZGlKhaAXM+yA2+xbkmXlsesB9aGl6r0/v1dc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780126536; c=relaxed/simple; bh=xzfOsvJdAGauaR1anYLcq6Y+r6SvqEt8PxfqQr9NWvI=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:Cc; b=rESbPcsNziEZJBfyqcBFZosnEZAlknNEK6xoM+SAq6DiJeH/63Ki4roUYVp5DhFAKU6Py83jEePD4udZDoKF1pZYPFDAnySGygOPfN+6VpvnvMpjjoALx5pyQniHg/5dYmKa/LyST90Ij3f5YJz61JnrvFrQarEE8VWR7masD4I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=WqMUEsMo; arc=none smtp.client-ip=74.125.82.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="WqMUEsMo" Received: by mail-dy1-f175.google.com with SMTP id 5a478bee46e88-304e58292d3so3665745eec.0 for ; Sat, 30 May 2026 00:35:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1780126534; x=1780731334; darn=vger.kernel.org; h=cc:to:message-id:content-transfer-encoding:mime-version:subject :date:from:from:to:cc:subject:date:message-id:reply-to; bh=39Cnxhi4FnCrp7MTKYla7YbmboLtT4AoKytr/Zbd3Mg=; b=WqMUEsMoP6260H/SvuMUEsz+kl9U28LpHImN9x2Jd6yhT3pvAaP7DCkUy+qhl6Wamb 9tXX0gztBI5u5pXJIciZuVRSWxiO2nm4AgqEB//6ZwN0It86xejdNHi2r4M8WTiNmhgV yTgnMEHYKrL/j+hQAo4S5gSLI/JJaiv1fTyzzRdxNW84h4G+O4sIeXmtx/CmhgXTdzfR /n+2vy2WvtsEBLHqXeXHRmPfbFNO47voNaNlvczfK4EwOaMkTh60ntyi6ZbMaCBljG+N dJ1aJG8QFRTH6VvTazbs4dRzoge6lQ9jADaITfhrgtylORmrkU5/jdcWoh6ElYdHEaa5 bDzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780126534; x=1780731334; h=cc:to:message-id:content-transfer-encoding:mime-version:subject :date:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=39Cnxhi4FnCrp7MTKYla7YbmboLtT4AoKytr/Zbd3Mg=; b=G7cRzCIWOZ3XPbDAO7l/y/sZu9m5tWJrwYCj7OQWXY4+W+yOYXSyjModmGgq5kQzwa oErmAEzPy3OV/cj3bkqhJhANjMzd3ukkAMf1hvlgA162oaxslEXJM0oQnvJQtB9zt6lF dFibcVSkyCMaGYvuWQg49/bDbJPvztylvYYR8DS2uoBXCdV6Dzf4P3ZMSt9xHV/DVDb1 4TWJfxtftjjax+XHVs1EZ24iP9QqRG+ToAf2ltvrSCiRhilDMWuX4gXdd9PE4jdGesil Wvy1rYUwbiZmxHkKsip3eGcVwkic9FLN2HTYc/EjJjUvbSJfIgckPOocRl7D/gaJ8Y4u 428w== X-Forwarded-Encrypted: i=1; AFNElJ/xj6SBrxpSq/PztvWuOjGfQSc0ptQ5vqg/WvBNHFqBChIQ5LZOrXHtU49HU5nqmi5k7EGq8qT2XGJ0+Sc=@vger.kernel.org X-Gm-Message-State: AOJu0YxNay2AaTY7bXsUShxDCykUJ5By7OaUc6CURKNX9M/75dmJT2Dh auCbQRBT6D0ju4WLw2yFbByOZsaE+rvwcewyh/1Ag+Y+f4jvs+3QC0cY X-Gm-Gg: Acq92OEAxM/ay9SJhNwlC4tE86behY7TYcf7gISaTuCua89WgZCzOv8iLBPuCLYqBrn xUTlZ+skWNAGSR5bLUHHFbGDJ5P62WlRKK+7y3DsXqcPsgGJxoUlKbBvNHu3VVXW1qdv/Oh5JLP SvcikiFG74shHT00pL1Djbh9knO3X1A3mxKZx6L+ezAYgcpjh5TkBVvPigGCdftxGk1oseBdJ8I IGAjKvv02MExbNUc2ULjWb9jB/Ytis0yz1TOe8sIP4ljhV6/U/F4447ZFJ4g6UFOFFysVAyTWmC mTzXMZ21IFv1FbeqVBGrAK26+egP2nV9CKu4b+2mFbnYvPYVpYc28QdDJt7DMARNawIn1+RhdoK qWGINpqqKNzqPVTG5mzOD4SV/0Tp+muQIrP9qiatIU/MIKuX3Nzpa+ZjXkk6as7Zq5X9ZIdcz8K d2UDrO0QpKKFBolJ7v81O5dUkFdiqL3q/KReT4rTQg X-Received: by 2002:a05:7300:c87:b0:304:cefc:5fde with SMTP id 5a478bee46e88-304fa6b5144mr1459260eec.26.1780126534160; Sat, 30 May 2026 00:35:34 -0700 (PDT) Received: from wujing.localdomain ([23.254.208.9]) by smtp.gmail.com with ESMTPSA id 5a478bee46e88-304ed5b9be7sm3298939eec.27.2026.05.30.00.35.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 30 May 2026 00:35:33 -0700 (PDT) From: Qiliang Yuan Date: Sat, 30 May 2026 15:35:27 +0800 Subject: [PATCH v4] cgroup/dmem: implement dmem.high soft limit via prioritized eviction Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260530-feature-dmem-high-v4-1-ee7c6ec1c8da@gmail.com> X-B4-Tracking: v=1; b=H4sIAAAAAAAC/x3MTQqAIBBA4avErBvIfiy7SrSInHQWWmhFIN09a fkt3ksQKTBFGIsEgW6OvPuMtixgtYs3hKyzoa5qWXVC4UbLeQVC7cihZWNRSKV60Q56bQbI3RF o4+d/TvP7fhsW5YRjAAAA X-Change-ID: 20260519-feature-dmem-high-16997148dc38 To: Christian Koenig , Huang Rui , Matthew Auld , Matthew Brost , Maarten Lankhorst , Maxime Ripard , Thomas Zimmermann , David Airlie , Simona Vetter , Tejun Heo , Johannes Weiner , =?utf-8?q?Michal_Koutn=C3=BD?= , Natalie Vock Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Qiliang Yuan X-Mailer: b4 0.14.3 The dmem cgroup v2 controller currently only provides a hard "max" limit, which causes immediate allocation failures when a cgroup's device memory usage reaches its quota. GPU-bound AI workloads need smoother over-subscription support: a soft limit that temporarily allows excess usage while applying backpressure through reclaim rather than outright failure. Add dmem.high, a soft limit that penalizes over-limit cgroups by evicting their buffer objects first when eviction is triggered (e.g. due to a "max" limit hit). Unlike the rejected v1 approach which used sleep-on-allocation throttling, this version provides a meaningful recovery action through prioritized reclaim. Expose "high" as a new cgroupfs control file per region via set_resource_high() and get_resource_high(), and initialize it to PAGE_COUNTER_MAX in reset_all_resource_limits(). Like get_resource_max(), get_resource_high() returns PAGE_COUNTER_MAX when the pool is NULL. Extend dmem_cgroup_state_evict_valuable() with a "try_high" parameter. When set, the function evaluates the try_high condition first (before the limit_pool =3D=3D test_pool shortcut) so that even the limit-hitting cgroup's own BOs are filtered by the high threshold. It then walks the page_counter parent chain to check whether any ancestor exceeds its high limit, and verifies that the pool is above its effective minimum to respect dmem.min protection. Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy. Pass 1 uses a blocking lock and targets only BOs whose cgroup exceeds dmem.high, ensuring over-limit cgroups are penalized even when their BOs are actively in use. Pass 2 falls back to the standard above-elow trylock eviction. Pass 3+ uses proper locking and repeats while making progress with the existing low-watermark fallback. Signed-off-by: Qiliang Yuan --- Introduce a "high" soft limit for the dmem cgroup v2 controller. When a "max" limit is hit and eviction is triggered, buffer objects belonging to cgroups that exceed their dmem.high limit are targeted first, providing a meaningful recovery action through reclaim. The dmem cgroup currently only supports hard "max" limits, which cause immediate allocation failures for GPU-bound workloads. A soft limit enables smoother over-subscription by penalizing over-limit cgroups via prioritized eviction rather than outright rejection. The implementation adds a "high" cgroupfs control file per region, a try_high parameter to dmem_cgroup_state_evict_valuable() for tier-1 eviction, and a 3-pass strategy in ttm_bo_evict_alloc(). --- V3 -> V4: - Use a blocking lock in Pass 1 instead of trylock to ensure over-limit cgroups are penalized even when their BOs are actively in use, as requested by Maarten Lankhorst. - Evaluate the try_high condition before the limit_pool =3D=3D test_pool early-return so that the limit-hitting cgroup's own BOs are also filtered by dmem.high. - Remove the high-priority compensation retry at the start of Pass 3, which is no longer needed now that Pass 1 uses a blocking lock. V2 -> V3: - Walk the page_counter parent chain in the try_high pass to prevent child cgroups from evading the penalty when a parent cgroup exceeds its dmem.high limit. - Check dmem.min protection in the try_high pass to avoid evicting BOs below the effective minimum. - Add a properly-locked high-priority retry at the beginning of Pass 3 so that actively-used over-limit BOs (which failed trylock in Pass 1) are not skipped while innocent cgroups are evicted. - Fix get_resource_high(NULL) returning 0 instead of PAGE_COUNTER_MAX to match the behavior of get_resource_max(). V1 -> V2: - Replace sleep-on-allocation throttling with prioritized eviction. When a "max" limit is hit, BOs from cgroups exceeding dmem.high are evicted first in a dedicated pass. No throttling or sleeping is performed in the charge path. - Remove task throttling (schedule_timeout_killable, TIF_NOTIFY_RESUME, resume_user_mode_work() integration) entirely. - Add dmem.high cgroupfs control file per region. - Extend dmem_cgroup_state_evict_valuable() with try_high parameter to target over-limit cgroups as tier-1 eviction. - Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy: (1) trylock: evict only BOs exceeding dmem.high (2) trylock: above-elow (3) proper-lock: repeat with low fallback. - Initialize high to PAGE_COUNTER_MAX in reset_all_resource_limits(). v3: https://lore.kernel.org/r/20260528-feature-dmem-high-v3-1-c642b34bcb2f@= gmail.com v2: https://lore.kernel.org/r/20260522-feature-dmem-high-v2-1-1d7d4a0fa5da@= gmail.com v1: https://lore.kernel.org/all/20260520-feature-dmem-high-v1-1-97ca0cb7f95= a@gmail.com --- drivers/gpu/drm/ttm/ttm_bo.c | 32 +++++++++++++---- include/linux/cgroup_dmem.h | 4 +-- kernel/cgroup/dmem.c | 81 ++++++++++++++++++++++++++++++++++++++++= +--- 3 files changed, 104 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index bcd76f6bb7f02..bf06e9e4b18a3 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -505,6 +505,8 @@ struct ttm_bo_evict_walk { =20 /** @limit_pool: Which pool limit we should test against */ struct dmem_cgroup_pool_state *limit_pool; + /** @try_high: Whether to only evict BO's above the high watermark (first= pass) */ + bool try_high; /** @try_low: Whether we should attempt to evict BO's with low watermark = threshold */ bool try_low; /** @hit_low: If we cannot evict a bo when @try_low is false (first pass)= */ @@ -518,7 +520,8 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, s= truct ttm_buffer_object * s64 lret; =20 if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resourc= e->css, - evict_walk->try_low, &evict_walk->hit_low)) + evict_walk->try_high, evict_walk->try_low, + &evict_walk->hit_low)) return 0; =20 if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->= place)) @@ -577,31 +580,46 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev, }; s64 lret; =20 - evict_walk.walk.arg.trylock_only =3D true; + /* + * Pass 1 (blocking, high-priority): Evict only BOs whose cgroup + * exceeds its dmem.high soft limit. A blocking lock is used to + * ensure over-limit cgroups are penalized even when their BOs are + * actively in use. + */ + evict_walk.walk.arg.trylock_only =3D false; + evict_walk.try_high =3D true; lret =3D ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1); + evict_walk.try_high =3D false; + if (lret) + goto out; =20 - /* One more attempt if we hit low limit? */ + /* + * Pass 2 (trylock): Evict BOs above the effective low watermark. + * Falls back to low-priority eviction if needed. + */ + lret =3D ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1); if (!lret && evict_walk.hit_low) { evict_walk.try_low =3D true; lret =3D ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1); } + if (lret || !ticket) goto out; =20 - /* Reset low limit */ + /* + * Pass 3+ (properly locked): Evict while making progress. + * Reset flags and retry with try_low if we hit the low watermark. + */ evict_walk.try_low =3D evict_walk.hit_low =3D false; - /* If ticket-locking, repeat while making progress. */ evict_walk.walk.arg.trylock_only =3D false; =20 retry: do { - /* The walk may clear the evict_walk.walk.ticket field */ evict_walk.walk.arg.ticket =3D ticket; evict_walk.evicted =3D 0; lret =3D ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1); } while (!lret && evict_walk.evicted); =20 - /* We hit the low limit? Try once more */ if (!lret && evict_walk.hit_low && !evict_walk.try_low) { evict_walk.try_low =3D true; goto retry; diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h index dd4869f1d736e..06115d35509b1 100644 --- a/include/linux/cgroup_dmem.h +++ b/include/linux/cgroup_dmem.h @@ -23,7 +23,7 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *reg= ion, u64 size, void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size); bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit= _pool, struct dmem_cgroup_pool_state *test_pool, - bool ignore_low, bool *ret_hit_low); + bool try_high, bool ignore_low, bool *ret_hit_low); =20 void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool); #else @@ -54,7 +54,7 @@ static inline void dmem_cgroup_uncharge(struct dmem_cgrou= p_pool_state *pool, u64 static inline bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit= _pool, struct dmem_cgroup_pool_state *test_pool, - bool ignore_low, bool *ret_hit_low) + bool try_high, bool ignore_low, bool *ret_hit_low) { return true; } diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c index 4753a67d0f0f2..f81fbb538cf2f 100644 --- a/kernel/cgroup/dmem.c +++ b/kernel/cgroup/dmem.c @@ -156,6 +156,12 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, = u64 val) page_counter_set_low(&pool->cnt, val); } =20 +static void +set_resource_high(struct dmem_cgroup_pool_state *pool, u64 val) +{ + page_counter_set_high(&pool->cnt, val); +} + static void set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val) { @@ -167,6 +173,11 @@ static u64 get_resource_low(struct dmem_cgroup_pool_st= ate *pool) return pool ? READ_ONCE(pool->cnt.low) : 0; } =20 +static u64 get_resource_high(struct dmem_cgroup_pool_state *pool) +{ + return pool ? READ_ONCE(pool->cnt.high) : PAGE_COUNTER_MAX; +} + static u64 get_resource_min(struct dmem_cgroup_pool_state *pool) { return pool ? READ_ONCE(pool->cnt.min) : 0; @@ -186,6 +197,7 @@ static void reset_all_resource_limits(struct dmem_cgrou= p_pool_state *rpool) { set_resource_min(rpool, 0); set_resource_low(rpool, 0); + set_resource_high(rpool, PAGE_COUNTER_MAX); set_resource_max(rpool, PAGE_COUNTER_MAX); } =20 @@ -289,10 +301,13 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_p= ool_state *limit_pool, * dmem_cgroup_state_evict_valuable() - Check if we should evict from test= _pool * @limit_pool: The pool for which we hit limits * @test_pool: The pool for which to test + * @try_high: Only evict BOs whose usage exceeds the high limit (first pas= s) * @ignore_low: Whether we have to respect low watermarks. * @ret_hit_low: Pointer to whether it makes sense to consider low waterma= rk. * * This function returns true if we can evict from @test_pool, false if no= t. + * When @try_high is set, only pools with usage above their high limit are + * evictable, enabling prioritized eviction of over-limit cgroups. * When returning false and @ignore_low is false, @ret_hit_low may * be set to true to indicate this function can be retried with @ignore_low * set to true. @@ -301,12 +316,56 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_p= ool_state *limit_pool, */ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit= _pool, struct dmem_cgroup_pool_state *test_pool, - bool ignore_low, bool *ret_hit_low) + bool try_high, bool ignore_low, bool *ret_hit_low) { struct dmem_cgroup_pool_state *pool =3D test_pool; struct page_counter *ctest; u64 used, min, low; =20 + ctest =3D &test_pool->cnt; + used =3D page_counter_read(ctest); + + if (try_high) { + /* + * When the limit-hitting cgroup's own BOs are being + * considered, only evict them if their pool exceeds its + * own dmem.high limit. No ancestry check is needed + * because the limit was triggered by this pool itself. + */ + if (limit_pool =3D=3D test_pool) + return used > READ_ONCE(ctest->high); + + { + struct page_counter *c; + + /* + * Walk the page_counter parent chain to check + * whether any ancestor cgroup exceeds its + * dmem.high limit. This prevents child cgroups + * from evading the penalty when a parent cgroup + * is over its high limit. + */ + if (used <=3D READ_ONCE(ctest->high)) { + for (c =3D ctest->parent; c; c =3D c->parent) { + if (page_counter_read(c) > + READ_ONCE(c->high)) + break; + } + if (!c) + return false; + } + } + + /* + * Respect dmem.min protection: do not evict BOs below the + * effective minimum even during the high-priority pass. + */ + dmem_cgroup_calculate_protection(limit_pool, test_pool); + min =3D READ_ONCE(ctest->emin); + + return used > min; + } + /* Can always evict from current pool, despite limits */ if (limit_pool =3D=3D test_pool) return true; @@ -329,11 +388,8 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgro= up_pool_state *limit_pool, {} } =20 - ctest =3D &test_pool->cnt; - dmem_cgroup_calculate_protection(limit_pool, test_pool); =20 - used =3D page_counter_read(ctest); min =3D READ_ONCE(ctest->emin); =20 if (used <=3D min) @@ -835,6 +891,17 @@ static ssize_t dmem_cgroup_region_low_write(struct ker= nfs_open_file *of, return dmemcg_limit_write(of, buf, nbytes, off, set_resource_low); } =20 +static int dmem_cgroup_region_high_show(struct seq_file *sf, void *v) +{ + return dmemcg_limit_show(sf, v, get_resource_high); +} + +static ssize_t dmem_cgroup_region_high_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + return dmemcg_limit_write(of, buf, nbytes, off, set_resource_high); +} + static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v) { return dmemcg_limit_show(sf, v, get_resource_max); @@ -868,6 +935,12 @@ static struct cftype files[] =3D { .seq_show =3D dmem_cgroup_region_low_show, .flags =3D CFTYPE_NOT_ON_ROOT, }, + { + .name =3D "high", + .write =3D dmem_cgroup_region_high_write, + .seq_show =3D dmem_cgroup_region_high_show, + .flags =3D CFTYPE_NOT_ON_ROOT, + }, { .name =3D "max", .write =3D dmem_cgroup_region_max_write, --- base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf change-id: 20260519-feature-dmem-high-16997148dc38 Best regards, --=20 Qiliang Yuan