From nobody Mon Jun 8 15:33:18 2026 Received: from mail-qt1-f178.google.com (mail-qt1-f178.google.com [209.85.160.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D5FF93E7BB2 for ; Thu, 28 May 2026 12:03:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779969819; cv=none; b=ZcDYa5T7SXqY2BbgP4lMOz8AuBBm9cvDF6Mg/7tRWYxxlxWZ2ckLtc/n903ylcTY5ZSXKOlkdzhgkxXPjzi1koAmf8YyGxg3oQwdGSBFQcJdHHBoUfc8deWqkDiJTbYRFAHc7i4QFL8VVXvXc4UFfQgpxDv6VaSDgiv5GVLGIn0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779969819; c=relaxed/simple; bh=OnLfoBEGl/L+fi+ptHThj7ThHl8vQruu5WsetJMGQtE=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:Cc; b=UfkTNaTvHRVvoMJeHJdxKS5Gt634o461u6peJTwexxmoA/JGBCIMq/4eJSRu0/3hM0c9zMcfdZVqAUvVH7S9qiinHogBNLXOPneDtexX5lPAcOZMkiPXKUKS/bHsdy8vXTw/DV0ktpljVgRoNlrAKkMJUsEUrJ77Mn9DgpG7VWs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=fIqwpZZ/; arc=none smtp.client-ip=209.85.160.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fIqwpZZ/" Received: by mail-qt1-f178.google.com with SMTP id d75a77b69052e-50e614fdb42so99079041cf.3 for ; Thu, 28 May 2026 05:03:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779969817; x=1780574617; darn=vger.kernel.org; h=cc:to:message-id:content-transfer-encoding:mime-version:subject :date:from:from:to:cc:subject:date:message-id:reply-to; bh=cM2o5grV8Dn9u4uPmeFDpnxwRxbJqTaVT40/+VIccNE=; b=fIqwpZZ/SzeE+m1ac4udme8nxQzzl+QWBiBZRuHitS08gMkcGjWWaD1jCmUxkdP8gg 5bovaibKSJurMB+n+bGj/kFWYhiQMe0QQHezIk31RCO+IEBSVgyvdfQ7FgvBw3b0VfX7 gYJAvW/UpDfLCE2n57PbToSvFY9fVBIObPhgjR7OvPc66kNKj7HFjULNVon8i/VoKiIT Q5rYq6EXNMGujZfUaqRh/JxRNbxwFIvX7P6529s27XgfEIzBjHzaO50hI/7eFMLbnFy3 0gGaYbJgq2P1yQ19kaO9yCSv5dQPsMeHiJg05Q5rf8O6eva8aIv1Gi6/kLzkYWQY3WeL GxAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779969817; x=1780574617; h=cc:to:message-id:content-transfer-encoding:mime-version:subject :date:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=cM2o5grV8Dn9u4uPmeFDpnxwRxbJqTaVT40/+VIccNE=; b=Ai9oetsd82RseYqjW3LeTgPCwq/TVbxOKeyJbvMeF0hmvQsEPXPqkFSWY1oYJoH+/A GdGnl2KCgbEtg+sSCYKnYV/T85WojMRli2t5PMED4NP2VoglZDkT3V6z2qHaJuT50mIx sfejD9nb9m/KgWRK8xzj2TYigD4EiWLlxjTL0wz0jEyxUl0GDmAS10q4AacSnA+/LC9x mmDtNIHlKgUxI6m3adZ3LGDX2nzv0c89RJCn1vWerY4MYGYBnEV00GeceEuJy4WCI2kJ MPDtaHMEY40oyRz1RkfyhFLAc6XwEvNzPkh3OXXzP/fSyhJCQ25UVr9QysCt8WRToEkw ZTkw== X-Forwarded-Encrypted: i=1; AFNElJ8RzpSoWBH39dADrmEj7bjx6zvSf5ISlYseJA7A/1o85Utjelhye826vydN1rq96r53ZWzTXJGEEWFcAb0=@vger.kernel.org X-Gm-Message-State: AOJu0YxVkw4C+BvgKBNhfDV+1j2MFzV5GsF3jGxGJuM9/8SlaSmXq5Lq hxRNf7FOWMc2DiotkJtBA7hto9hI+wi/tsmvqEDNcjRkS44iRjEUa20bXm58JdkqUpU= X-Gm-Gg: Acq92OHG5SULthPtAOFIOvPKDo5hoTy4Pdt9GKXnUECc2m5KIiwMsl2+MGX5U+UBwIo Q3JFt+pLynbTpSQPetFcSAIEl9MWSk/l1+rsD8MsRdCUcR94TeZlOXXS6niTUxehTCDImGGK/3M xjlz5alM535NhNhoA9Y7CHiXMlI/48vK6/fp6R6ncRGAlOiymBztQ+CZsCTfncWJ9L9LbiRV8tn qyF9L/QXeFU4ieqprcHoj2Sk/RgKu/ZBcN9GuVNQLFgxwAC9RI5zQnDra3MSrsf6DcO1IRP/DYt cVxDzD45i8+XH3ieMDpHSv0HD8Hwvnx8jD028cIJirOoBvyXFFsRZNQIDJiiOTkS3nyWXUSPo2L XguqKxEQUXbg8u8cXQYhhwOKPsUQuSQqnrw+BcltcDfkVelHN03LZbxtp5L0ctoi9s91crBFGeb jw5V66tpCNKIWEF8ckjr97EmH/N6DdxPlENgF+MpYi X-Received: by 2002:a05:622a:144d:b0:516:e517:6dfa with SMTP id d75a77b69052e-516e5177074mr282647901cf.18.1779969816687; Thu, 28 May 2026 05:03:36 -0700 (PDT) Received: from wujing.localdomain ([23.254.208.9]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-51706b0bad4sm72208001cf.31.2026.05.28.05.03.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 May 2026 05:03:36 -0700 (PDT) From: Qiliang Yuan Date: Thu, 28 May 2026 20:03:28 +0800 Subject: [PATCH v3] cgroup/dmem: implement dmem.high soft limit via prioritized eviction Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260528-feature-dmem-high-v3-1-c642b34bcb2f@gmail.com> X-B4-Tracking: v=1; b=H4sIAAAAAAAC/x3MTQqAIBBA4avErBvIfiy7SrSInHQWWmhFIN09a fkt3ksQKTBFGIsEgW6OvPuMpixgtYs3hKyzoa5qWXVC4UbLeQVC7cihZWNRSKV60Q56bQbI3RF o4+d/TvP7fgtK7hxjAAAA X-Change-ID: 20260519-feature-dmem-high-16997148dc38 To: Christian Koenig , Huang Rui , Matthew Auld , Matthew Brost , Maarten Lankhorst , Maxime Ripard , Thomas Zimmermann , David Airlie , Simona Vetter , Tejun Heo , Johannes Weiner , =?utf-8?q?Michal_Koutn=C3=BD?= , Natalie Vock Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Qiliang Yuan X-Mailer: b4 0.14.3 The dmem cgroup v2 controller currently only provides a hard "max" limit, which causes immediate allocation failures when a cgroup's device memory usage reaches its quota. GPU-bound AI workloads need smoother over-subscription support: a soft limit that temporarily allows excess usage while applying backpressure through reclaim rather than outright failure. Add dmem.high, a soft limit that penalizes over-limit cgroups by evicting their buffer objects first when eviction is triggered (e.g. due to a "max" limit hit). Unlike the rejected v1 approach which used sleep-on-allocation throttling, this version provides a meaningful recovery action through prioritized reclaim. Expose "high" as a new cgroupfs control file per region via set_resource_high() and get_resource_high(), and initialize it to PAGE_COUNTER_MAX in reset_all_resource_limits(). Like get_resource_max(), get_resource_high() returns PAGE_COUNTER_MAX when the pool is NULL. Extend dmem_cgroup_state_evict_valuable() with a "try_high" parameter. When set, the function walks the page_counter parent chain to check whether any ancestor exceeds its high limit, then verifies that the pool is above its effective minimum to respect dmem.min protection. Only pools meeting both criteria are evicted. Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy. Pass 1 uses trylock and targets only BOs whose cgroup exceeds dmem.high. Pass 2 falls back to the standard above-elow eviction. Pass 3 begins with a properly-locked high-priority pass in case Pass 1 failed due to trylock contention, then proceeds with the standard repeat-while-making-progress loop with low-watermark fallback. Signed-off-by: Qiliang Yuan --- Introduce a "high" soft limit for the dmem cgroup v2 controller. When a "max" limit is hit and eviction is triggered, buffer objects belonging to cgroups that exceed their dmem.high limit are targeted first, providing a meaningful recovery action through reclaim. The dmem cgroup currently only supports hard "max" limits, which cause immediate allocation failures for GPU-bound workloads. A soft limit enables smoother over-subscription by penalizing over-limit cgroups via prioritized eviction rather than outright rejection. The implementation adds a "high" cgroupfs control file per region, a try_high parameter to dmem_cgroup_state_evict_valuable() for tier-1 eviction, and a 3-pass strategy in ttm_bo_evict_alloc(). --- V2 -> V3: - Walk the page_counter parent chain in the try_high pass to prevent child cgroups from evading the penalty when a parent cgroup exceeds its dmem.high limit. - Check dmem.min protection in the try_high pass to avoid evicting BOs below the effective minimum. - Add a properly-locked high-priority retry at the beginning of Pass 3 so that actively-used over-limit BOs (which failed trylock in Pass 1) are not skipped while innocent cgroups are evicted. - Fix get_resource_high(NULL) returning 0 instead of PAGE_COUNTER_MAX to match the behavior of get_resource_max(). V1 -> V2: - Replace sleep-on-allocation throttling with prioritized eviction. When a "max" limit is hit, BOs from cgroups exceeding dmem.high are evicted first in a dedicated pass. No throttling or sleeping is performed in the charge path. - Remove task throttling (schedule_timeout_killable, TIF_NOTIFY_RESUME, resume_user_mode_work() integration) entirely. - Add dmem.high cgroupfs control file per region. - Extend dmem_cgroup_state_evict_valuable() with try_high parameter to target over-limit cgroups as tier-1 eviction. - Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy: (1) trylock: evict only BOs exceeding dmem.high (2) trylock: above-elow (3) proper-lock: repeat with low fallback. - Initialize high to PAGE_COUNTER_MAX in reset_all_resource_limits(). v1: https://lore.kernel.org/all/20260520-feature-dmem-high-v1-1-97ca0cb7f95= a@gmail.com v2: https://lore.kernel.org/all/20260522-feature-dmem-high-v2-1-d805deddecb= b@gmail.com --- drivers/gpu/drm/ttm/ttm_bo.c | 35 ++++++++++++++++++++---- include/linux/cgroup_dmem.h | 4 +-- kernel/cgroup/dmem.c | 65 ++++++++++++++++++++++++++++++++++++++++= ++-- 3 files changed, 94 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index bcd76f6bb7f02..2f2b428f1d30a 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -505,6 +505,8 @@ struct ttm_bo_evict_walk { =20 /** @limit_pool: Which pool limit we should test against */ struct dmem_cgroup_pool_state *limit_pool; + /** @try_high: Whether to only evict BO's above the high watermark (first= pass) */ + bool try_high; /** @try_low: Whether we should attempt to evict BO's with low watermark = threshold */ bool try_low; /** @hit_low: If we cannot evict a bo when @try_low is false (first pass)= */ @@ -518,7 +520,8 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, s= truct ttm_buffer_object * s64 lret; =20 if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resourc= e->css, - evict_walk->try_low, &evict_walk->hit_low)) + evict_walk->try_high, evict_walk->try_low, + &evict_walk->hit_low)) return 0; =20 if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->= place)) @@ -577,31 +580,51 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev, }; s64 lret; =20 + /* + * Pass 1 (trylock): Only evict BOs whose cgroup is above its + * dmem.high soft limit. This penalizes over-limit cgroups first. + */ evict_walk.walk.arg.trylock_only =3D true; + evict_walk.try_high =3D true; lret =3D ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1); + evict_walk.try_high =3D false; + if (lret) + goto out; =20 - /* One more attempt if we hit low limit? */ + /* + * Pass 2 (trylock): Evict BOs above the effective low watermark. + * Falls back to low-priority eviction if needed. + */ + lret =3D ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1); if (!lret && evict_walk.hit_low) { evict_walk.try_low =3D true; lret =3D ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1); } + if (lret || !ticket) goto out; =20 - /* Reset low limit */ + /* + * Pass 3+ (properly locked): Evict while making progress. + * First retry the high-priority pass with proper locking in case + * Pass 1 failed due to trylock contention on over-limit BOs. + * If that still fails, fall back to the standard low-priority eviction. + */ evict_walk.try_low =3D evict_walk.hit_low =3D false; - /* If ticket-locking, repeat while making progress. */ evict_walk.walk.arg.trylock_only =3D false; + evict_walk.try_high =3D true; + lret =3D ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1); + evict_walk.try_high =3D false; + if (lret) + goto out; =20 retry: do { - /* The walk may clear the evict_walk.walk.ticket field */ evict_walk.walk.arg.ticket =3D ticket; evict_walk.evicted =3D 0; lret =3D ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1); } while (!lret && evict_walk.evicted); =20 - /* We hit the low limit? Try once more */ if (!lret && evict_walk.hit_low && !evict_walk.try_low) { evict_walk.try_low =3D true; goto retry; diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h index dd4869f1d736e..06115d35509b1 100644 --- a/include/linux/cgroup_dmem.h +++ b/include/linux/cgroup_dmem.h @@ -23,7 +23,7 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *reg= ion, u64 size, void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size); bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit= _pool, struct dmem_cgroup_pool_state *test_pool, - bool ignore_low, bool *ret_hit_low); + bool try_high, bool ignore_low, bool *ret_hit_low); =20 void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool); #else @@ -54,7 +54,7 @@ static inline void dmem_cgroup_uncharge(struct dmem_cgrou= p_pool_state *pool, u64 static inline bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit= _pool, struct dmem_cgroup_pool_state *test_pool, - bool ignore_low, bool *ret_hit_low) + bool try_high, bool ignore_low, bool *ret_hit_low) { return true; } diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c index 4753a67d0f0f2..c80444c0da177 100644 --- a/kernel/cgroup/dmem.c +++ b/kernel/cgroup/dmem.c @@ -156,6 +156,12 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, = u64 val) page_counter_set_low(&pool->cnt, val); } =20 +static void +set_resource_high(struct dmem_cgroup_pool_state *pool, u64 val) +{ + page_counter_set_high(&pool->cnt, val); +} + static void set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val) { @@ -167,6 +173,11 @@ static u64 get_resource_low(struct dmem_cgroup_pool_st= ate *pool) return pool ? READ_ONCE(pool->cnt.low) : 0; } =20 +static u64 get_resource_high(struct dmem_cgroup_pool_state *pool) +{ + return pool ? READ_ONCE(pool->cnt.high) : PAGE_COUNTER_MAX; +} + static u64 get_resource_min(struct dmem_cgroup_pool_state *pool) { return pool ? READ_ONCE(pool->cnt.min) : 0; @@ -186,6 +197,7 @@ static void reset_all_resource_limits(struct dmem_cgrou= p_pool_state *rpool) { set_resource_min(rpool, 0); set_resource_low(rpool, 0); + set_resource_high(rpool, PAGE_COUNTER_MAX); set_resource_max(rpool, PAGE_COUNTER_MAX); } =20 @@ -289,10 +301,13 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_p= ool_state *limit_pool, * dmem_cgroup_state_evict_valuable() - Check if we should evict from test= _pool * @limit_pool: The pool for which we hit limits * @test_pool: The pool for which to test + * @try_high: Only evict BOs whose usage exceeds the high limit (first pas= s) * @ignore_low: Whether we have to respect low watermarks. * @ret_hit_low: Pointer to whether it makes sense to consider low waterma= rk. * * This function returns true if we can evict from @test_pool, false if no= t. + * When @try_high is set, only pools with usage above their high limit are + * evictable, enabling prioritized eviction of over-limit cgroups. * When returning false and @ignore_low is false, @ret_hit_low may * be set to true to indicate this function can be retried with @ignore_low * set to true. @@ -301,7 +316,7 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_poo= l_state *limit_pool, */ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit= _pool, struct dmem_cgroup_pool_state *test_pool, - bool ignore_low, bool *ret_hit_low) + bool try_high, bool ignore_low, bool *ret_hit_low) { struct dmem_cgroup_pool_state *pool =3D test_pool; struct page_counter *ctest; @@ -331,9 +346,38 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgro= up_pool_state *limit_pool, =20 ctest =3D &test_pool->cnt; =20 + used =3D page_counter_read(ctest); + + if (try_high) { + struct page_counter *c; + + /* + * Walk the page_counter parent chain to check whether any + * ancestor cgroup exceeds its dmem.high limit. This prevents + * child cgroups from evading the penalty when a parent cgroup + * is over its high limit. + */ + if (used <=3D READ_ONCE(ctest->high)) { + for (c =3D ctest->parent; c; c =3D c->parent) { + if (page_counter_read(c) > READ_ONCE(c->high)) + break; + } + if (!c) + return false; + } + + /* + * Respect dmem.min protection: do not evict BOs below the + * effective minimum even during the high-priority pass. + */ + dmem_cgroup_calculate_protection(limit_pool, test_pool); + min =3D READ_ONCE(ctest->emin); + + return used > min; + } + dmem_cgroup_calculate_protection(limit_pool, test_pool); =20 - used =3D page_counter_read(ctest); min =3D READ_ONCE(ctest->emin); =20 if (used <=3D min) @@ -835,6 +879,17 @@ static ssize_t dmem_cgroup_region_low_write(struct ker= nfs_open_file *of, return dmemcg_limit_write(of, buf, nbytes, off, set_resource_low); } =20 +static int dmem_cgroup_region_high_show(struct seq_file *sf, void *v) +{ + return dmemcg_limit_show(sf, v, get_resource_high); +} + +static ssize_t dmem_cgroup_region_high_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + return dmemcg_limit_write(of, buf, nbytes, off, set_resource_high); +} + static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v) { return dmemcg_limit_show(sf, v, get_resource_max); @@ -868,6 +923,12 @@ static struct cftype files[] =3D { .seq_show =3D dmem_cgroup_region_low_show, .flags =3D CFTYPE_NOT_ON_ROOT, }, + { + .name =3D "high", + .write =3D dmem_cgroup_region_high_write, + .seq_show =3D dmem_cgroup_region_high_show, + .flags =3D CFTYPE_NOT_ON_ROOT, + }, { .name =3D "max", .write =3D dmem_cgroup_region_max_write, --- base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf change-id: 20260519-feature-dmem-high-16997148dc38 Best regards, --=20 Qiliang Yuan