From nobody Mon Jun 8 23:56:08 2026 Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4C9C23603D2 for ; Mon, 25 May 2026 12:23:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779711785; cv=none; b=k0TWe4OrAkVuQE9yMhK2QImGOpkUBoXC6aY/nXTTf76/nFwiLNF4m6OtxUhnjIeDoOgp9WCmIUHwft3h+5c9fRfIebl7atcjKopuxeEQZPeJ7YGaky8amn37Sk03QXt68ByCt0IebGgmpBPqBTb7gcbPtRkHqo26mim5lMVlODw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779711785; c=relaxed/simple; bh=UBB195Ozt5QE0T4ary3zwUYu4vqp88ETqIcNr3qHkjE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=fFFr/Y8fH+SEghfY0G/QKL214PtbzJAfbVg89cqBQuOKKcPUAo9V3qMTMe18J8XfUYIC74PbeRHsnNeP4J+2XnyCzayyt6Tc4qCA7zr4hKlZqRMKxI8AeYoJLeops1mZOQegdQa14BLlO6dctSpliwFcL4aOlgSetJc6c54vNhw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=rLYw8XLH; arc=none smtp.client-ip=209.85.216.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="rLYw8XLH" Received: by mail-pj1-f47.google.com with SMTP id 98e67ed59e1d1-3680540a6efso5337517a91.2 for ; Mon, 25 May 2026 05:23:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779711784; x=1780316584; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=QtMgcDN043IrMH4z5Dydpux33DpRRQAjJ0E9E7yR9Fo=; b=rLYw8XLHNNa5aV53Jeh3nr3yqZW91pi6RrJ9EoHsvjiP0EMaN+tS/rnZtZ2pdiHWN+ +DHjmK3ib04Bko4VaGOoCR4PA2XSdQFS/8KSwPSNi7t/8xXSQnyvt5n0QKZ1pqP2za+3 5ad8hDUUeuO3DH6AkC49ZywIK3RxducnDCPAcN+feJaHfq0FpzuF2hRxk/DPH+s4tav9 r3/ftj+SGUSqBlusNkQwDSoiZXXt/IP8RYa8XMN6AAVkTHjuISVeZ0g1lzHgeIJ6m/kC Pv+wOLmcUTsrkeeD5sv3NWWXc5Pyth1BDheuWtLjsax3H5XoTfLj6kochEaoXqM4Ilph vikA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779711784; x=1780316584; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=QtMgcDN043IrMH4z5Dydpux33DpRRQAjJ0E9E7yR9Fo=; b=m26YL/P44u30MvHm7zIaVbi9ha42d4Tp418XgQmpedUoJ5lbo7UpZvfJkL7FPQQ4VQ ye6NyIV75T/3Vk17KR3yq8HZqTPGojlZAhTum5eN0+lldcFnBloLxwN3DC/wWMajVqgW lJyqeMMfMjXTQ5agVu1GmmKpTWxvq28ReNp+ReQ5s/Q/BdnXDtGOnqvbS6/gYOY8hw2N jUezww1SzfcLzaud368AfTam2JDH5zDa+1xisbmcr6EhRlnkaju2SMLtoG2YG4TbhJxn WUb4Z60nwd16Z1IKjLeLfqEwpur2PUQm+q+rzTFWIhcv4ceKq/Srk61/SP16j3q34FzM zKBw== X-Forwarded-Encrypted: i=1; AFNElJ+yxJvWODseXyVVrzIaC90Mct8JiYxg40cKY39rMQC160w8uQQBYBHAm5SelXAOfREJLdRye4rxgzmIQNI=@vger.kernel.org X-Gm-Message-State: AOJu0Ywjt8oe7D5xPH3Y4Ky/kHw+Qasnf5fLx487psG/zOnzgKhV2P8Q ymspFMeQI8w0umOAMHSGbYfSH+L54B7vbfbE6f8JUalhu2hkCDIQjgY+ X-Gm-Gg: Acq92OGbpLqcyZ9Slp9VQNnXSeQK/RRHxaXtnGpkB0E1k7rfjIp1OilLwXATm+WV/qM d+GxHTkR2gioIk90HcxzWjoZeSxJAqVLCf6tucxHVI34/dJPgLQ8fEwB+tWyB6brsNtt5OO3ooK StQEYgE3HCrvoO/KBL4CU57gEPSxMkeFDrFocTMk2qBgqqSZsLAHvXyK8I/1lJZVG28Rrxfph1f k3E4CifwQ1jFkmnMyfwXiiD4k5qQjqJ9LKLqR5PYPxF0uAvnOPLK3sXWWziHjQVMQblwCHd6tTD sbaYBz1C6GEojUtfVGas69Nfb784Gsdwxrai5JsjfhXU6omMsogE7QqmQG70JLlmFvfcG98elQB qkWSbqH0959mBSt8NE/Jb/Jh+tDRJDZGkI5gifC2AswLLi1LIEXQJdI3lSdEWXz3Fds6CWIe7A3 /Dupay5I/WG+NZt7xXRWaTIYHlydWHz++AC5js6V22UN2KNLt/asmq6I+IL2jlOg== X-Received: by 2002:a17:90a:d44f:b0:368:763a:17b8 with SMTP id 98e67ed59e1d1-36a6741dd0bmr9210428a91.2.1779711783550; Mon, 25 May 2026 05:23:03 -0700 (PDT) Received: from localhost.localdomain ([210.184.73.204]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-36a72c913a1sm8999131a91.15.2026.05.25.05.22.57 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 25 May 2026 05:23:03 -0700 (PDT) From: Hao Jia To: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, mhocko@kernel.org, yosry@kernel.org, mkoutny@suse.com, nphamcs@gmail.com, chengming.zhou@linux.dev, muchun.song@linux.dev, roman.gushchin@linux.dev Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Hao Jia Subject: [PATCH v2 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg Date: Mon, 25 May 2026 20:22:39 +0800 Message-Id: <20260525122242.36127-2-jiahao.kernel@gmail.com> X-Mailer: git-send-email 2.39.2 (Apple Git-143) In-Reply-To: <20260525122242.36127-1-jiahao.kernel@gmail.com> References: <20260525122242.36127-1-jiahao.kernel@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Hao Jia The zswap background writeback worker shrink_worker() uses a global cursor zswap_next_shrink, protected by zswap_shrink_lock, to round-robin across the online memcgs under root_mem_cgroup. Proactive writeback also wants a similar per-memcg cursor that is scoped to the specified memcg, so that repeated invocations against the same memcg make forward progress across its descendant memcgs instead of restarting from the first child memcg each time. Naturally, group the cursor and its protecting spinlock into a zswap_wb_iter struct, and make it a member of struct mem_cgroup to realize per-memcg cursor management. Accordingly, shrink_worker() now uses the lock and cursor in root_mem_cgroup->zswap_wb_iter. Because the cursor is now per-memcg, the offline cleanup must visit every ancestor that could be holding a reference to the dying memcg. Factor out __zswap_memcg_offline_cleanup() and walk from dead_memcg up to the root. No functional change intended for shrink_worker(). Signed-off-by: Hao Jia --- include/linux/memcontrol.h | 3 + include/linux/zswap.h | 9 +++ mm/memcontrol.c | 3 + mm/zswap.c | 119 ++++++++++++++++++++++++++----------- 4 files changed, 98 insertions(+), 36 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index dc3fa687759b..b8323c8d6565 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -228,6 +228,9 @@ struct mem_cgroup { * swap, and from being swapped out on zswap store failures. */ bool zswap_writeback; + + /* Per-memcg writeback cursor */ + struct zswap_wb_iter zswap_wb_iter; #endif =20 /* vmpressure notifications */ diff --git a/include/linux/zswap.h b/include/linux/zswap.h index 30c193a1207e..efa6b551217e 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -11,6 +11,15 @@ extern atomic_long_t zswap_stored_pages; =20 #ifdef CONFIG_ZSWAP =20 +/* Iteration cursor for zswap writeback over a memcg's subtree. */ +struct zswap_wb_iter { + /* protects @pos against concurrent advances */ + spinlock_t lock; + struct mem_cgroup *pos; +}; + +void zswap_wb_iter_init(struct zswap_wb_iter *iter); + struct zswap_lruvec_state { /* * Number of swapped in pages from disk, i.e not found in the zswap pool. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c03d4787d466..409c41359dc8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4022,6 +4022,9 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem= _cgroup *parent) INIT_LIST_HEAD(&memcg->memory_peaks); INIT_LIST_HEAD(&memcg->swap_peaks); spin_lock_init(&memcg->peaks_lock); +#ifdef CONFIG_ZSWAP + zswap_wb_iter_init(&memcg->zswap_wb_iter); +#endif memcg->socket_pressure =3D get_jiffies_64(); #if BITS_PER_LONG < 64 seqlock_init(&memcg->socket_pressure_seqlock); diff --git a/mm/zswap.c b/mm/zswap.c index 4b5149173b0e..6519f646b496 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -163,9 +163,6 @@ struct zswap_pool { /* Global LRU lists shared by all zswap pools. */ static struct list_lru zswap_list_lru; =20 -/* The lock protects zswap_next_shrink updates. */ -static DEFINE_SPINLOCK(zswap_shrink_lock); -static struct mem_cgroup *zswap_next_shrink; static struct work_struct zswap_shrink_work; static struct shrinker *zswap_shrinker; =20 @@ -717,28 +714,88 @@ void zswap_folio_swapin(struct folio *folio) } } =20 -/* - * This function should be called when a memcg is being offlined. +void zswap_wb_iter_init(struct zswap_wb_iter *iter) +{ + spin_lock_init(&iter->lock); +} + +#ifdef CONFIG_MEMCG +/** + * zswap_mem_cgroup_iter - advance the writeback cursor + * @root: subtree root whose cursor to advance + * + * Advance @root->zswap_wb_iter.pos to @root itself or the next online + * descendant. Passing root_mem_cgroup yields a global walk. * - * Since the global shrinker shrink_worker() may hold a reference - * of the memcg, we must check and release the reference in - * zswap_next_shrink. + * The cursor is retained across invocations, so successive calls walk + * @root's subtree cyclically in pre-order and, after %NULL, restart + * from the beginning. * - * shrink_worker() must handle the case where this function releases - * the reference of memcg being shrunk. + * The returned memcg carries an extra reference; release it with + * mem_cgroup_put(). + * + * Return: the next online memcg in @root's subtree, or @root itself, + * with an extra reference, or %NULL after a full round-trip. */ -void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) +static struct mem_cgroup *zswap_mem_cgroup_iter(struct mem_cgroup *root) { - /* lock out zswap shrinker walking memcg tree */ - spin_lock(&zswap_shrink_lock); - if (zswap_next_shrink =3D=3D memcg) { + struct mem_cgroup *memcg; + + if (mem_cgroup_disabled()) + return NULL; + + spin_lock(&root->zswap_wb_iter.lock); + do { + memcg =3D mem_cgroup_iter(root, root->zswap_wb_iter.pos, NULL); + root->zswap_wb_iter.pos =3D memcg; + } while (memcg && !mem_cgroup_tryget_online(memcg)); + spin_unlock(&root->zswap_wb_iter.lock); + + return memcg; +} + +/* + * If @root's cursor currently points at @dead_memcg, advance it to the + * next online descendant so @dead_memcg can be freed. + */ +static void __zswap_memcg_offline_cleanup(struct mem_cgroup *root, + struct mem_cgroup *dead_memcg) +{ + spin_lock(&root->zswap_wb_iter.lock); + if (root->zswap_wb_iter.pos =3D=3D dead_memcg) { do { - zswap_next_shrink =3D mem_cgroup_iter(NULL, zswap_next_shrink, NULL); - } while (zswap_next_shrink && !mem_cgroup_online(zswap_next_shrink)); + root->zswap_wb_iter.pos =3D + mem_cgroup_iter(root, + root->zswap_wb_iter.pos, NULL); + } while (root->zswap_wb_iter.pos && + !mem_cgroup_online(root->zswap_wb_iter.pos)); } - spin_unlock(&zswap_shrink_lock); + spin_unlock(&root->zswap_wb_iter.lock); +} + +/* + * Called when a memcg is being offlined. If @memcg or any of its + * ancestors has a cursor pointing at @memcg, it must be advanced + * past @memcg before @memcg can be freed. Walk the chain and + * release such references. + */ +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent =3D memcg; + + do { + __zswap_memcg_offline_cleanup(parent, memcg); + } while ((parent =3D parent_mem_cgroup(parent))); +} +#else /* !CONFIG_MEMCG */ +static struct mem_cgroup *zswap_mem_cgroup_iter(struct mem_cgroup *root) +{ + return NULL; } =20 +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) { } +#endif /* CONFIG_MEMCG */ + /********************************* * zswap entry functions **********************************/ @@ -1328,38 +1385,28 @@ static void shrink_worker(struct work_struct *w) * - No writeback-candidate memcgs found in a memcg tree walk. * - Shrinking a writeback-candidate memcg failed. * - * We save iteration cursor memcg into zswap_next_shrink, + * We save the iteration cursor in root_mem_cgroup->zswap_wb_iter.pos, * which can be modified by the offline memcg cleaner * zswap_memcg_offline_cleanup(). * * Since the offline cleaner is called only once, we cannot leave an - * offline memcg reference in zswap_next_shrink. + * offline memcg reference in root_mem_cgroup->zswap_wb_iter.pos. * We can rely on the cleaner only if we get online memcg under lock. * * If we get an offline memcg, we cannot determine if the cleaner has * already been called or will be called later. We must put back the * reference before returning from this function. Otherwise, the - * offline memcg left in zswap_next_shrink will hold the reference - * until the next run of shrink_worker(). + * offline memcg left in root_mem_cgroup->zswap_wb_iter.pos will hold + * the reference until the next run of shrink_worker(). */ do { /* - * Start shrinking from the next memcg after zswap_next_shrink. - * When the offline cleaner has already advanced the cursor, - * advancing the cursor here overlooks one memcg, but this - * should be negligibly rare. - * - * If we get an online memcg, keep the extra reference in case - * the original one obtained by mem_cgroup_iter() is dropped by - * zswap_memcg_offline_cleanup() while we are shrinking the - * memcg. + * Start shrinking from the next memcg after + * root_mem_cgroup->zswap_wb_iter.pos. When the offline cleaner + * has already advanced the cursor, advancing the cursor here + * overlooks one memcg, but this should be negligibly rare. */ - spin_lock(&zswap_shrink_lock); - do { - memcg =3D mem_cgroup_iter(NULL, zswap_next_shrink, NULL); - zswap_next_shrink =3D memcg; - } while (memcg && !mem_cgroup_tryget_online(memcg)); - spin_unlock(&zswap_shrink_lock); + memcg =3D zswap_mem_cgroup_iter(root_mem_cgroup); =20 if (!memcg) { /* --=20 2.34.1 From nobody Mon Jun 8 23:56:08 2026 Received: from mail-pj1-f53.google.com (mail-pj1-f53.google.com [209.85.216.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BD5273806B6 for ; Mon, 25 May 2026 12:23:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.53 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779711793; cv=none; b=WrgVYLOBs5vVaFOGBsVc7hOlfiLEghjHVmFMfD2inBtYuMUl1Lv/E6Z2Si+Ki9po9pGs6zgsnoyOdPDlBEmXn9WZDRxoLkc3hUbPr0WZtErmybRxyp07uiSLl2oPrUB6fCwH+7YWLwC9kDUb3jdOWwV5xaITzBkNVse7/SJ7W3w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779711793; c=relaxed/simple; bh=MRGIb03JKDvk01qkq1kgHQhGq61+R02jo4VRs6BdDJ0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Mq4rSIL0+kQv8OMzuGZUGD+G/V+UL7+hXdbJQTPO5jo50BrQbLFyYdMqKcx8pszPrA2uh8tzk1FKpE+gNVH1iBjHv9deJvDGGCIaCaDPd380mSCD9QBMpI8XJayw8ES+UVQx+p9uEniIJfIWRRb+hsCVuDvJ9tPJVXeTVi7dKLc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=kg9XnPLb; arc=none smtp.client-ip=209.85.216.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="kg9XnPLb" Received: by mail-pj1-f53.google.com with SMTP id 98e67ed59e1d1-36af4b7840aso285008a91.3 for ; Mon, 25 May 2026 05:23:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779711791; x=1780316591; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=7wjzw8VopwanxPrNAwZOSKY/fFmzX13SGqgor333ApU=; b=kg9XnPLbP0GuVmW4aIM8dOAiuliUXscK5azulmyBbIbdgnpoFiSvbGqcBPLOgDlpJ9 g0vaqfVitf1lo7HSnBff4uebL06sw8xTka+Op889adaeeNARYn62FUPSBA03m+tX1LK4 9cdDo8+KnHrKt4FN3CWGFNq58P3MDONPffmSqrBAwmsUxrvA5bh+h1Kfzuls4NyJ9UFP YM0YNGGzWBH5PUw2OlHJ9bfHPWRlPPc44m7eVFQSGvmFZpOVFUL5GxHvDcbSolSlRuHs CBuAqcyRay1Ux1DsU0pVDQ/bB/ALVQQ2iOe0Av13oWab+/BNxUCnSAb3xW353O68JrBU bE4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779711791; x=1780316591; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=7wjzw8VopwanxPrNAwZOSKY/fFmzX13SGqgor333ApU=; b=bvi4cmgbvIniM3+RAzS8D5K7UKmpf+1HVXSVrmi3WzxQAL1lHUjnR/p1BvjFEF0VrW JUOvW1wrk70NQx1fJYsaEkq4CqYpjWV7Ax8Y7u6djJgoxVrJhSKAmVOyMXdO0FbTYSXQ iVtG6eemG4dXyMwSNDZd5FYkgLY58N41LyLXnlaopHSK+pNvz0wrS8gUB7D4LF/agwHF Rh8ghCz+DfaepyhgzU3zD0rHbjFjyH4fRXQLE3AFpt1/WTGjwW7nJAxXiSY1gRrK8HQQ m7nvuJTRp5BpWY/reS9LEOerxSgC3zQ7vxbjDBcOiJaT5+JdA0Cx2+24t1rZJNTkwg5C Tk+g== X-Forwarded-Encrypted: i=1; AFNElJ+IagLK5/c888Fn5TYVYAF9BpuVKN/+kgNB0qAOmmSxy0XP6mfTy9ZZCvLcPENTKNLCpucaxhMOSmHZQpA=@vger.kernel.org X-Gm-Message-State: AOJu0YxXXBCp6SMytP6sfFc+6jjb+XciKclVR8gqP52GmDI+VNdR1X6E 1Fj8xzsNCH1iXA40pcy3hLXK+1L6cHvv8vejD+obOiY2YARPMkiPcAU1 X-Gm-Gg: Acq92OGGJ2Jscf42RE1wL0qxs8FsnFZ2FhsI0n+POCUnUh7R0mIdm1GZm8tape/joYl r/5CU6npjBJXuPlcg1mUeNxCBQh1dnHm9qZANMhDTUPZC6oo7mpaGVlAAmW/0TJJTtMSDAuKQpB kCge84e0Ft7ciyxsxNEgsKbj5wYHJBMhFeyBzsMz/aFqcOACpOuNtIIk35UOLIuvlTfgdg/lAD+ I1RlIhNhjvwldmCtTIRUC7DAfrFefJ8WmDw+uuIiA3kd24i5FUmW1YA3PQSkkX37hB4VZ5dQACX 402MtoWpYUtOSUyLcChZbEZt0hY8rQHdoHtkzrf8eR9F/nZpzdsgtDg5JmqcRZ4TBZJj1HbS0/r 6BeL+WGNn+FWm3+bRc6CO+ofnl/ws+Xr6KhiYhQ9P4yMJmQnwMVyfxSqL09dlNR8HQ76UhrGMSz jHkEkuYN0y9DkXgCnP5t7KwhWPqFADWISgzP+tti8wYg5FmSRqZ/laBZ42r6IHmw== X-Received: by 2002:a17:90a:dfce:b0:369:932a:2b6d with SMTP id 98e67ed59e1d1-36a67402bccmr12184750a91.6.1779711790920; Mon, 25 May 2026 05:23:10 -0700 (PDT) Received: from localhost.localdomain ([210.184.73.204]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-36a72c913a1sm8999131a91.15.2026.05.25.05.23.05 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 25 May 2026 05:23:10 -0700 (PDT) From: Hao Jia To: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, mhocko@kernel.org, yosry@kernel.org, mkoutny@suse.com, nphamcs@gmail.com, chengming.zhou@linux.dev, muchun.song@linux.dev, roman.gushchin@linux.dev Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Hao Jia Subject: [PATCH v2 2/4] mm/zswap: Implement proactive writeback Date: Mon, 25 May 2026 20:22:40 +0800 Message-Id: <20260525122242.36127-3-jiahao.kernel@gmail.com> X-Mailer: git-send-email 2.39.2 (Apple Git-143) In-Reply-To: <20260525122242.36127-1-jiahao.kernel@gmail.com> References: <20260525122242.36127-1-jiahao.kernel@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Hao Jia Zswap currently writes back pages to backing swap reactively, triggered either by the shrinker or when the pool reaches its size limit. There is no mechanism to control the amount of writeback for a specific memory cgroup. However, users may want to proactively write back zswap pages, e.g., to free up memory for other applications or to prepare for memory-intensive workloads. Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup interface. When specified, this key bypasses standard memory reclaim and exclusively performs proactive zswap writeback up to the requested budget. If omitted, the default reclaim behavior remains unchanged. Example usage: # Write back 100MB of pages from zswap to the backing swap echo "100M zswap_writeback_only" > memory.reclaim Note that the actual amount written back may be less than requested due to the zswap second-chance algorithm: referenced entries are rotated on the LRU on the first encounter and only written back on a second pass. The interface returns -EAGAIN if no pages were successfully written back. Internally, extend user_proactive_reclaim() to parse the new "zswap_writeback_only" token and invoke the dedicated handler. Add zswap_proactive_writeback() to walk the target memcg subtree via the per-memcg writeback cursor, draining per-node zswap LRUs through list_lru_walk_one() with the shrink_memcg_cb() callback. Suggested-by: Yosry Ahmed Suggested-by: Nhat Pham Signed-off-by: Hao Jia --- Documentation/admin-guide/cgroup-v2.rst | 18 +++- Documentation/admin-guide/mm/zswap.rst | 11 +- include/linux/zswap.h | 7 ++ mm/vmscan.c | 14 +++ mm/zswap.c | 138 ++++++++++++++++++++++++ 5 files changed, 185 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst index 6efd0095ed99..6564abf0dec5 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1425,9 +1425,10 @@ PAGE_SIZE multiple when read back. =20 The following nested keys are defined. =20 - =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D swappiness Swappiness value to reclaim with - =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + zswap_writeback_only Only perform proactive zswap writeback + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 Specifying a swappiness value instructs the kernel to perform the reclaim with that swappiness value. Note that this has the @@ -1437,6 +1438,19 @@ The following nested keys are defined. The valid range for swappiness is [0-200, max], setting swappiness=3Dmax exclusively reclaims anonymous memory. =20 + The zswap_writeback_only key skips ordinary memory reclaim and + writes back pages from zswap to the backing swap device until + the requested amount has been written or no further candidates + are found. This is useful to proactively offload cold pages from + the zswap pool to the swap device. It is only available if + zswap writeback is enabled. zswap_writeback_only cannot be combined + with swappiness; specifying both returns -EINVAL. + + Example:: + + # Write back up to 100MB of pages from zswap to the backing swap + echo "100M zswap_writeback_only" > memory.reclaim + memory.peak A read-write single value file which exists on non-root cgroups. =20 diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-g= uide/mm/zswap.rst index 2464425c783d..1c0598e77958 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -131,7 +131,16 @@ User can enable it as follows:: echo Y > /sys/module/zswap/parameters/shrinker_enabled =20 This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON= `` is -selected. +selected. Once enabled, the shrinker automatically writes back zswap pages= to +backing swap during memory reclaim. + +If users want to explicitly trigger proactive zswap writeback for a specif= ic +memory cgroup without invoking standard page reclaim, it can be done as fo= llows:: + + echo "100M zswap_writeback_only" > /sys/fs/cgroup//memory.re= claim + +Both of the methods mentioned above are subject to the ``memory.zswap.writ= eback`` +control. This means that ``memory.zswap.writeback`` can reject all zswap w= riteback. =20 A debugfs interface is provided for various statistic about pool size, num= ber of pages stored, same-value filled pages and various counters for the reas= ons diff --git a/include/linux/zswap.h b/include/linux/zswap.h index efa6b551217e..98434d39339a 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -44,6 +44,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec); void zswap_folio_swapin(struct folio *folio); bool zswap_is_enabled(void); bool zswap_never_enabled(void); +int zswap_proactive_writeback(struct mem_cgroup *memcg, unsigned long nr_t= o_writeback); #else =20 struct zswap_lruvec_state {}; @@ -78,6 +79,12 @@ static inline bool zswap_never_enabled(void) return true; } =20 +static inline int zswap_proactive_writeback(struct mem_cgroup *memcg, + unsigned long nr_to_writeback) +{ + return -EOPNOTSUPP; +} + #endif =20 #endif /* _LINUX_ZSWAP_H */ diff --git a/mm/vmscan.c b/mm/vmscan.c index bd1b1aa12581..6249176b9886 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -64,6 +64,7 @@ =20 #include #include +#include =20 #include "internal.h" #include "swap.h" @@ -7894,11 +7895,13 @@ static unsigned long __node_reclaim(struct pglist_d= ata *pgdat, gfp_t gfp_mask, enum { MEMORY_RECLAIM_SWAPPINESS =3D 0, MEMORY_RECLAIM_SWAPPINESS_MAX, + MEMORY_RECLAIM_ZSWAP_WRITEBACK_ONLY, MEMORY_RECLAIM_NULL, }; static const match_table_t tokens =3D { { MEMORY_RECLAIM_SWAPPINESS, "swappiness=3D%d"}, { MEMORY_RECLAIM_SWAPPINESS_MAX, "swappiness=3Dmax"}, + { MEMORY_RECLAIM_ZSWAP_WRITEBACK_ONLY, "zswap_writeback_only"}, { MEMORY_RECLAIM_NULL, NULL }, }; =20 @@ -7908,6 +7911,7 @@ int user_proactive_reclaim(char *buf, unsigned int nr_retries =3D MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed =3D 0; int swappiness =3D -1; + bool zswap_writeback_only =3D false; char *old_buf, *start; substring_t args[MAX_OPT_ARGS]; gfp_t gfp_mask =3D GFP_KERNEL; @@ -7938,11 +7942,21 @@ int user_proactive_reclaim(char *buf, case MEMORY_RECLAIM_SWAPPINESS_MAX: swappiness =3D SWAPPINESS_ANON_ONLY; break; + case MEMORY_RECLAIM_ZSWAP_WRITEBACK_ONLY: + zswap_writeback_only =3D true; + break; default: return -EINVAL; } } =20 + if (zswap_writeback_only) { + /* zswap_writeback_only and swappiness are mutually exclusive. */ + if (swappiness !=3D -1) + return -EINVAL; + return zswap_proactive_writeback(memcg, nr_to_reclaim); + } + while (nr_reclaimed < nr_to_reclaim) { /* Will converge on zero, but reclaim enforces a minimum */ unsigned long batch_size =3D (nr_to_reclaim - nr_reclaimed) / 4; diff --git a/mm/zswap.c b/mm/zswap.c index 6519f646b496..947507b9a185 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1684,6 +1684,144 @@ int zswap_load(struct folio *folio) return 0; } =20 +/* + * Maximum LRU scan limit: + * number of entries to scan per page of remaining budget. + */ +#define ZSWAP_PROACTIVE_WB_SCAN_RATIO 16UL +/* + * Batch size for proactive writeback: + * - As the per-memcg writeback target in the outer memcg loop. + * - As the per-walk budget passed to list_lru_walk_one(). + */ +#define ZSWAP_PROACTIVE_WB_BATCH 128UL + +/* + * Walk the per-node LRUs of @memcg to write back up to @nr_to_write pages. + * Returns the number of pages written back, or -ENOENT if @memcg is a + * zombie or has writeback disabled. + */ +static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg, + unsigned long nr_to_write) +{ + unsigned long nr_written =3D 0; + int nid; + + if (!mem_cgroup_zswap_writeback_enabled(memcg)) + return -ENOENT; + + if (!mem_cgroup_online(memcg)) + return -ENOENT; + + for_each_node_state(nid, N_NORMAL_MEMORY) { + bool encountered_page_in_swapcache =3D false; + unsigned long nr_to_scan, nr_scanned =3D 0; + + /* + * Cap by LRU length: bounds rewalks when referenced + * entries keep rotating to the tail. + */ + nr_to_scan =3D list_lru_count_one(&zswap_list_lru, nid, memcg); + if (!nr_to_scan) + continue; + + /* + * Cap by SCAN_RATIO * remaining budget: bounds scan cost + * to the remaining writeback budget. + */ + nr_to_scan =3D min(nr_to_scan, + (nr_to_write - nr_written) * ZSWAP_PROACTIVE_WB_SCAN_RATIO); + + while (nr_scanned < nr_to_scan) { + unsigned long nr_to_walk =3D min(ZSWAP_PROACTIVE_WB_BATCH, + nr_to_scan - nr_scanned); + + if (signal_pending(current)) + return nr_written; + + /* + * Account for the committed budget rather than the walker's + * actual delta. If the list is emptied concurrently, the + * walker visits nothing and nr_scanned would never advance. + */ + nr_scanned +=3D nr_to_walk; + + nr_written +=3D list_lru_walk_one(&zswap_list_lru, nid, memcg, + &shrink_memcg_cb, + &encountered_page_in_swapcache, + &nr_to_walk); + + if (nr_written >=3D nr_to_write) + return nr_written; + if (encountered_page_in_swapcache) + break; + + cond_resched(); + } + } + + return nr_written; +} + +int zswap_proactive_writeback(struct mem_cgroup *memcg, + unsigned long nr_to_writeback) +{ + struct mem_cgroup *iter_memcg; + unsigned long nr_written =3D 0; + int failures =3D 0, attempts =3D 0; + + if (!memcg) + return -EINVAL; + if (!nr_to_writeback) + return 0; + + /* + * Writeback will be aborted with -EAGAIN if @nr_written is still + * zero and we encounter the following MAX_RECLAIM_RETRIES times: + * - No writeback-candidate memcgs found in a subtree walk. + * - A writeback-candidate memcg wrote back zero pages. + */ + while (nr_written < nr_to_writeback) { + unsigned long batch_size; + long shrunk; + + if (signal_pending(current)) + return -EINTR; + + iter_memcg =3D zswap_mem_cgroup_iter(memcg); + + if (!iter_memcg) { + /* + * Continue without incrementing failures if we found + * candidate memcgs in the last subtree walk. + */ + if (!attempts && ++failures =3D=3D MAX_RECLAIM_RETRIES) + goto out; + attempts =3D 0; + continue; + } + + batch_size =3D min(nr_to_writeback - nr_written, + ZSWAP_PROACTIVE_WB_BATCH); + shrunk =3D zswap_proactive_shrink_memcg(iter_memcg, batch_size); + mem_cgroup_put(iter_memcg); + + /* Writeback-disabled or offline: skip without counting. */ + if (shrunk =3D=3D -ENOENT) + continue; + + ++attempts; + if (shrunk > 0) + nr_written +=3D shrunk; + else if (++failures =3D=3D MAX_RECLAIM_RETRIES) + goto out; + + cond_resched(); + } +out: + return nr_written ? 0 : -EAGAIN; +} + void zswap_invalidate(swp_entry_t swp) { pgoff_t offset =3D swp_offset(swp); --=20 2.34.1 From nobody Mon Jun 8 23:56:08 2026 Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D27AD175A7E for ; Mon, 25 May 2026 12:23:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.45 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779711800; cv=none; b=pjhRMgvyXJ+kfU8fGYN4vLvr3Jib9vkuvt8EmbZziPDms+EJ5OCClgNboarvGYWSIUoRrmVDXrQLuFixx9afVx9oaGwsdGQNoYbkY7BADukjbu3zHnU1SL8BS5hnHp8bBP0abVsZGzy0LP630Lk4RC3vXCrlf6TtwD1ocVo7fOo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779711800; c=relaxed/simple; bh=1nz785EW5WPZmz23ZJhsomLMKvkUt7QhwQgs4eSChpo=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=dc1JNa2QNuWABNeLs/Iv9zKwYqCC6rIh3VQs7iayuKwHbOMnSkHBghNE8IMNIUMy9M4H/y+iJcmHLHD+hvruXaZAHARueKM+PJD/+wlhHC+gyI2zJWc+oSLrSwP+3LTeNIyhdZVHuRWt3Un1ENLs1u5CI3AbFn+5MWIC0hg4xt4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=DWGYa4Qq; arc=none smtp.client-ip=209.85.216.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="DWGYa4Qq" Received: by mail-pj1-f45.google.com with SMTP id 98e67ed59e1d1-367c26471f5so6129927a91.1 for ; Mon, 25 May 2026 05:23:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779711798; x=1780316598; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=dCgp1DVd9xo2joxajci3qyEuXkTq0+UzE9J3t6+D+hI=; b=DWGYa4QqPMKTHKM7jC6LsmpdmZ0cR3G3tkI8BExYpr0L0bOgjirrXZFmuunnSeABzg pzp9FVNt4ng5wmTsevx4ZtfbOAdMt3WW3nnQ4CyrfGRUw2+0Yto/Uq7fscVT38eIh5jV mr5e4TMLovFBhzYmdBVn7rkDO5EGzLxhaTlj9TFRMtsXmG/Kc1fSV9K170oxX2ts14sh ZvBAb+M2nbiugHgYM1SHPeaF4db/OBX/lwAR0+6vEy595aT+q5aSgdlqY1TJokhD2RYK vMW4QkOrktpfGmljomo1n+c19229mS2l22qeG3/l3mboDGP3YH6Oms81ezTf2P+VR9lg BZ/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779711798; x=1780316598; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=dCgp1DVd9xo2joxajci3qyEuXkTq0+UzE9J3t6+D+hI=; b=Vg9HbnUMAYOhzaS4EwOxolXKDwXtIeaxk42MX5L1RGuDLm5NCOa72yuBkQIPv47tZz jdeBou6ZhL3GF97dwo0EwKCTTUFbYkpDvxEsqJ9Q5kMBid8BhGNV6BFGpr2uC8YnhHDN 9IzJGH3eK4/ek0PgzIC/LW6iucgAM8GMGYcYNwawNZAm359SrCyYSx09Be9Fha4SXiuj kKqGChttYhapLZmz4D0AWF64IdY4aVCM7LNuNz7wCsamy8W9/CoKEgk6RkXNBPEf/mNs cZkXUlWoYcl5iZk0kSbR3zy2PLQNvwNV3m1c4m4hMw7mM6e3omToi6hiySXMdQeBQjoc 6lZw== X-Forwarded-Encrypted: i=1; AFNElJ9i8/5MIKShLRlNqVvcoonE/k/KeB5srtyti3u7k8O6CZlRGOPpdQpGSe4z2pWYrxlEFF9gaA0VG+6wgLM=@vger.kernel.org X-Gm-Message-State: AOJu0Yye56aBGDqjrUTC/5VNVw2jQFeAGLMtWgRgPQDdk0hdLV+c42te 8gQ8nEp9LGzKBHf4LDxdJRLrU4YzyYNsjwWspxY7fQTpdpcYNRMI8j+7 X-Gm-Gg: Acq92OGh21ZsowQ1524g6x6w4SKjRTSyuDkAo8VrBEdEmOPvqrZghktcwaGancrSZ25 /5T+bXsQ0wQTWSYXzvj4m2fWi5hkwMs4L8TqdP1pXul8laiWhn3lnsvPCL7j6g5fi+I3xOEs1m2 XB7fkdSydUej5dHWKod4fwkIPH3fa7ZDvhQE5jUq0PSG8mNRMrfiCjaxt0JOH6wrkFGvKceBSYN Apg4xespekLe9qC97uoZSteaGR0XRkSLgzAlaUGv9SFDa2AxL5vd02g/cYNI2x01doVKGDgjgz1 1kkyC9MEaAoJrATFvWhcV+MFs73ZaarvPmXca+nnrv4NAS/GeaSatYL6Vh1Vb9wZN1tjwyRyM7o tpggWQv7vl9wca16GoNfLOYfVpz2NY63HyMCsr50tBMPb127VYAoUn9e7d3j1rDyyCeKVK3lNIV JChGMuQX+6ScdPa8zyjUqYGTzboDrZZ7j8Qgr/a0D9B867Wt0kEPA= X-Received: by 2002:a17:90a:d885:b0:369:e4d4:79c6 with SMTP id 98e67ed59e1d1-36a6782a785mr13015019a91.20.1779711798253; Mon, 25 May 2026 05:23:18 -0700 (PDT) Received: from localhost.localdomain ([210.184.73.204]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-36a72c913a1sm8999131a91.15.2026.05.25.05.23.12 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 25 May 2026 05:23:17 -0700 (PDT) From: Hao Jia To: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, mhocko@kernel.org, yosry@kernel.org, mkoutny@suse.com, nphamcs@gmail.com, chengming.zhou@linux.dev, muchun.song@linux.dev, roman.gushchin@linux.dev Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Hao Jia Subject: [PATCH v2 3/4] mm/zswap: Add per-memcg stat for proactive writeback Date: Mon, 25 May 2026 20:22:41 +0800 Message-Id: <20260525122242.36127-4-jiahao.kernel@gmail.com> X-Mailer: git-send-email 2.39.2 (Apple Git-143) In-Reply-To: <20260525122242.36127-1-jiahao.kernel@gmail.com> References: <20260525122242.36127-1-jiahao.kernel@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Hao Jia Currently, zswap writeback can be triggered by either the pool limit being hit or by the proactive writeback mechanism. However, the existing 'zswpwb' metric in memory.stat and /proc/vmstat counts all written back pages, making it difficult to distinguish between pages written back due to the pool limit and those written back proactively. Add a new statistic 'zswpwb_proactive' to memory.stat and /proc/vmstat. This counter tracks the number of pages written back due to proactive writeback. This allows users to better monitor and tune the proactive writeback mechanism. Signed-off-by: Hao Jia --- Documentation/admin-guide/cgroup-v2.rst | 4 +++ include/linux/vm_event_item.h | 1 + mm/memcontrol.c | 1 + mm/vmstat.c | 1 + mm/zswap.c | 41 ++++++++++++++++++------- 5 files changed, 37 insertions(+), 11 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst index 6564abf0dec5..7d65aef83f7b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1748,6 +1748,10 @@ The following nested keys are defined. zswpwb Number of pages written from zswap to swap. =20 + zswpwb_proactive + Number of pages written from zswap to swap by proactive + writeback. This is a subset of zswpwb. + zswap_incomp Number of incompressible pages currently stored in zswap without compression. These pages could not be compressed to diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 03fe95f5a020..7a5bee0a20b6 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -138,6 +138,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, ZSWPIN, ZSWPOUT, ZSWPWB, + ZSWPWB_PROACTIVE, #endif #ifdef CONFIG_X86 DIRECT_MAP_LEVEL2_SPLIT, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 409c41359dc8..67de71b2a659 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -571,6 +571,7 @@ static const unsigned int memcg_vm_event_stat[] =3D { ZSWPIN, ZSWPOUT, ZSWPWB, + ZSWPWB_PROACTIVE, #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE THP_FAULT_ALLOC, diff --git a/mm/vmstat.c b/mm/vmstat.c index f534972f517d..66fd06d1bb01 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1452,6 +1452,7 @@ const char * const vmstat_text[] =3D { [I(ZSWPIN)] =3D "zswpin", [I(ZSWPOUT)] =3D "zswpout", [I(ZSWPWB)] =3D "zswpwb", + [I(ZSWPWB_PROACTIVE)] =3D "zswpwb_proactive", #endif #ifdef CONFIG_X86 [I(DIRECT_MAP_LEVEL2_SPLIT)] =3D "direct_map_level2_splits", diff --git a/mm/zswap.c b/mm/zswap.c index 947507b9a185..78190631e2c4 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -160,6 +160,11 @@ struct zswap_pool { char tfm_name[CRYPTO_MAX_ALG_NAME]; }; =20 +struct zswap_shrink_walk_arg { + bool proactive; + bool encountered_page_in_swapcache; +}; + /* Global LRU lists shared by all zswap pools. */ static struct list_lru zswap_list_lru; =20 @@ -1042,7 +1047,8 @@ static bool zswap_decompress(struct zswap_entry *entr= y, struct folio *folio) * freed. */ static int zswap_writeback_entry(struct zswap_entry *entry, - swp_entry_t swpentry) + swp_entry_t swpentry, + bool proactive) { struct xarray *tree; pgoff_t offset =3D swp_offset(swpentry); @@ -1102,6 +1108,12 @@ static int zswap_writeback_entry(struct zswap_entry = *entry, if (entry->objcg) count_objcg_events(entry->objcg, ZSWPWB, 1); =20 + if (proactive) { + count_vm_event(ZSWPWB_PROACTIVE); + if (entry->objcg) + count_objcg_events(entry->objcg, ZSWPWB_PROACTIVE, 1); + } + zswap_entry_free(entry); =20 /* folio is up to date */ @@ -1151,7 +1163,8 @@ static enum lru_status shrink_memcg_cb(struct list_he= ad *item, struct list_lru_o void *arg) { struct zswap_entry *entry =3D container_of(item, struct zswap_entry, lru); - bool *encountered_page_in_swapcache =3D (bool *)arg; + struct zswap_shrink_walk_arg *walk_arg =3D arg; + bool proactive_wb =3D walk_arg && walk_arg->proactive; swp_entry_t swpentry; enum lru_status ret =3D LRU_REMOVED_RETRY; int writeback_result; @@ -1206,7 +1219,7 @@ static enum lru_status shrink_memcg_cb(struct list_he= ad *item, struct list_lru_o */ spin_unlock(&l->lock); =20 - writeback_result =3D zswap_writeback_entry(entry, swpentry); + writeback_result =3D zswap_writeback_entry(entry, swpentry, proactive_wb); =20 if (writeback_result) { zswap_reject_reclaim_fail++; @@ -1217,9 +1230,9 @@ static enum lru_status shrink_memcg_cb(struct list_he= ad *item, struct list_lru_o * into the warmer region. We should terminate shrinking (if we're in th= e dynamic * shrinker context). */ - if (writeback_result =3D=3D -EEXIST && encountered_page_in_swapcache) { + if (writeback_result =3D=3D -EEXIST && walk_arg) { ret =3D LRU_STOP; - *encountered_page_in_swapcache =3D true; + walk_arg->encountered_page_in_swapcache =3D true; } } else { zswap_written_back_pages++; @@ -1231,8 +1244,11 @@ static enum lru_status shrink_memcg_cb(struct list_h= ead *item, struct list_lru_o static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc) { + struct zswap_shrink_walk_arg walk_arg =3D { + .proactive =3D false, + .encountered_page_in_swapcache =3D false, + }; unsigned long shrink_ret; - bool encountered_page_in_swapcache =3D false; =20 if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(sc->memcg)) { @@ -1241,9 +1257,9 @@ static unsigned long zswap_shrinker_scan(struct shrin= ker *shrinker, } =20 shrink_ret =3D list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb, - &encountered_page_in_swapcache); + &walk_arg); =20 - if (encountered_page_in_swapcache) + if (walk_arg.encountered_page_in_swapcache) return SHRINK_STOP; =20 return shrink_ret ? shrink_ret : SHRINK_STOP; @@ -1714,7 +1730,10 @@ static long zswap_proactive_shrink_memcg(struct mem_= cgroup *memcg, return -ENOENT; =20 for_each_node_state(nid, N_NORMAL_MEMORY) { - bool encountered_page_in_swapcache =3D false; + struct zswap_shrink_walk_arg walk_arg =3D { + .proactive =3D true, + .encountered_page_in_swapcache =3D false, + }; unsigned long nr_to_scan, nr_scanned =3D 0; =20 /* @@ -1748,12 +1767,12 @@ static long zswap_proactive_shrink_memcg(struct mem= _cgroup *memcg, =20 nr_written +=3D list_lru_walk_one(&zswap_list_lru, nid, memcg, &shrink_memcg_cb, - &encountered_page_in_swapcache, + &walk_arg, &nr_to_walk); =20 if (nr_written >=3D nr_to_write) return nr_written; - if (encountered_page_in_swapcache) + if (walk_arg.encountered_page_in_swapcache) break; =20 cond_resched(); --=20 2.34.1 From nobody Mon Jun 8 23:56:08 2026 Received: from mail-pj1-f46.google.com (mail-pj1-f46.google.com [209.85.216.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8462D175A7E for ; Mon, 25 May 2026 12:23:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.46 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779711808; cv=none; b=Q+Ege6oMvHRevlMmPoLlB7F/hcJGdImTSEa9YGbvY0UAyhCK4KIbw7WfGlkNtOipWgNjSCQWw9AQnI25SXNuyY7SH/SHNO4hU9F0eFnl2kw1enKZI1hMsGBvYpwAB3kiSzh6ZFZU1HI2J1KGY/r65lqXvigP5ja0Yd2cMG8QxWM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779711808; c=relaxed/simple; bh=oVfz0NI3eMBWkf2nwjQsUdGr3PbUSqFjsVO3o+rWEtE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=j1WFbHsy2sLAJgoe3pFL8f9jx3qTT03f5LyqFTT0XhUeSlYpgVGXLCeEXYhFeyiR7idx8oqhT4HhPK6oYg4CrQOjo3kunHAj8MhwG2H6XrKLHcA7MYCdmoLHCpg8UiEXRTQwuZf6oviwk3APhbi0DMExQsJAYNKEC84ecU11Oxw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=MjqsmwDD; arc=none smtp.client-ip=209.85.216.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="MjqsmwDD" Received: by mail-pj1-f46.google.com with SMTP id 98e67ed59e1d1-367cbac9c37so5516929a91.2 for ; Mon, 25 May 2026 05:23:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779711806; x=1780316606; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tDMvc5EPEd1kHb+nfQyiXVYziOBu/xsFB5tWsJV33wo=; b=MjqsmwDDWChYrYwj+5SFuKVwHk5Yc31AC7D0YvyIWzFc3XJosKB6BOR1YdlJ0bxZ/y GtCs5RcdU2lXWcaUbZJka8+YrgmL0a6QqKoSP0bHE5L/u0scKnJKgEf6egKm756LX3Xv bwSiUI6mX7ubfRYmJWJVnH5yMYsHSo5aDmICQz3b4e6aalfc+9eq3CLgyQbTSx+yyXq7 XBXmC/ydNZRP4Z2ptBMJcZLeJprHc8T83PkLuvwoY3C+rSfT/0cz3cY9wV84jCjq1jfB P29cHcHIEhWt+yAyQVJdOO5nUuQOq9PaeFUosCw6ow8CErfuxIDHYx2I6S+lZ3YbrP7w YrZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779711806; x=1780316606; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=tDMvc5EPEd1kHb+nfQyiXVYziOBu/xsFB5tWsJV33wo=; b=iFuN9M/SBi/Jskp8dD2M0kEj2F9UoJxUq2lUOD6nzM/rICaAFW1YOw3xrDd+1dEy7S RlwiVNfqSAfWzx4AhgA7FLrs2Z1cdUfmiDXjCCWy8asBxtirjniNuDLPoalgrdz39t1h kChbey32/vCY61o5DwLEjmVrhEdWMLy8vjWc2SaEoWlPEr8LVbYQ+rfsxEtWS98epmNl iAZnmmfjEhl8ujFv6rlPZYsZnjsUT1HrJQfvVl/8dnktCrpmP6NopORDE7AQuH6HDkJG B0CqI/kciQJ9aSqceP5Q8OieAjhc1Cj/yGMmo1NMSi/LZ/TEW9a+s59mPoWSe/9/tOqL mdOA== X-Forwarded-Encrypted: i=1; AFNElJ8lSHwxpaiw7ODZThGaIzbGCrN4GoQLO9hSeZ8a1Bq26ETc+5d0rl6wzpX+Q+8AWXxHzRobkx9L0P6pj3c=@vger.kernel.org X-Gm-Message-State: AOJu0Yy3WxBQXwRpyL3UbKAMruhg2E4wavhImvaiYUBCEVxfy0dthgzD NvxDr24BIEngmS9B23F1AFQab7n92lqO7ltzjFpYUd/VX7R2ZvFKXOEQ X-Gm-Gg: Acq92OHAp0/CECjT1YheFRDEajp5N5uOhGGuD0agg0q8XlwjMIQc/mC6POuRqJNVpXb XPrGnvSQqjMFcn4aw58/rC55vbyZVlExpi6oNam8PhQI0oIX3rFQuJs8Zy5K7xNm0L0IjCYIauR Qw5umkkkze10fb8huYzHs6b5+HlzsxmVLmr40REUGdgqADsvLfqwIKPi0+M7TNw6pppvmWIGOrw 3vj+ws/DfcFVfwYwc1pI9HPZa01nDRZxuiHtD/xITWSX5uWb45Gx0vy4lZkScV2EUIqpNWSEMSB GpGX3DJHALs3l558bJ8YP3Rs6pVWOwbvwsg+1OSGxqOaDDhzIPlYB9dCLez/tR/0U6cZN0anfTG MOQLfr6B1xs/D5nEqoQAln2zZTkim8fn27/ULQWnhFXhNpZnulT2ZWfESUXetrAdF+ibcsS80BA AmAyJHRKHZrZvYghYlOoNwkg3ubUiyI5y5w5wrAKVosHn8GGtTfGQ= X-Received: by 2002:a17:90b:2cc4:b0:36a:35df:769c with SMTP id 98e67ed59e1d1-36a67505a3fmr14458568a91.15.1779711805868; Mon, 25 May 2026 05:23:25 -0700 (PDT) Received: from localhost.localdomain ([210.184.73.204]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-36a72c913a1sm8999131a91.15.2026.05.25.05.23.19 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 25 May 2026 05:23:25 -0700 (PDT) From: Hao Jia To: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, mhocko@kernel.org, yosry@kernel.org, mkoutny@suse.com, nphamcs@gmail.com, chengming.zhou@linux.dev, muchun.song@linux.dev, roman.gushchin@linux.dev Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Hao Jia Subject: [PATCH v2 4/4] selftests/cgroup: Add tests for zswap proactive writeback Date: Mon, 25 May 2026 20:22:42 +0800 Message-Id: <20260525122242.36127-5-jiahao.kernel@gmail.com> X-Mailer: git-send-email 2.39.2 (Apple Git-143) In-Reply-To: <20260525122242.36127-1-jiahao.kernel@gmail.com> References: <20260525122242.36127-1-jiahao.kernel@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Hao Jia Add test_zswap_proactive_writeback() to cover the new memory.reclaim "zswap_writeback_only" key. The test populates a memory cgroup zswap pool, triggers proactive writeback, and verifies the behavior by observing the change in zswpwb_proactive. Invalid input combinations are also covered. Extend test_zswap_writeback_one() to assert that the existing non-proactive writeback path leaves zswpwb_proactive at zero. Signed-off-by: Hao Jia --- tools/testing/selftests/cgroup/test_zswap.c | 161 +++++++++++++++++++- 1 file changed, 153 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/se= lftests/cgroup/test_zswap.c index a7bdcdd09d62..b80ed13bc5e2 100644 --- a/tools/testing/selftests/cgroup/test_zswap.c +++ b/tools/testing/selftests/cgroup/test_zswap.c @@ -57,6 +57,11 @@ static long get_cg_wb_count(const char *cg) return cg_read_key_long(cg, "memory.stat", "zswpwb"); } =20 +static long get_cg_pwb_count(const char *cg) +{ + return cg_read_key_long(cg, "memory.stat", "zswpwb_proactive"); +} + static long get_zswpout(const char *cgroup) { return cg_read_key_long(cgroup, "memory.stat", "zswpout "); @@ -323,11 +328,17 @@ static int attempt_writeback(const char *cgroup, void= *arg) =20 static int test_zswap_writeback_one(const char *cgroup, bool wb) { - long zswpwb_before, zswpwb_after; + long wb_cnt, pwb_cnt; + + wb_cnt =3D get_cg_wb_count(cgroup); + if (wb_cnt !=3D 0) { + ksft_print_msg("zswpwb_before =3D %ld instead of 0\n", wb_cnt); + return -1; + } =20 - zswpwb_before =3D get_cg_wb_count(cgroup); - if (zswpwb_before !=3D 0) { - ksft_print_msg("zswpwb_before =3D %ld instead of 0\n", zswpwb_before); + pwb_cnt =3D get_cg_pwb_count(cgroup); + if (pwb_cnt !=3D 0) { + ksft_print_msg("zswpwb_proactive_before =3D %ld instead of 0\n", pwb_cnt= ); return -1; } =20 @@ -335,13 +346,24 @@ static int test_zswap_writeback_one(const char *cgrou= p, bool wb) return -1; =20 /* Verify that zswap writeback occurred only if writeback was enabled */ - zswpwb_after =3D get_cg_wb_count(cgroup); - if (zswpwb_after < 0) + wb_cnt =3D get_cg_wb_count(cgroup); + if (wb_cnt < 0) return -1; =20 - if (wb !=3D !!zswpwb_after) { + if (wb !=3D !!wb_cnt) { ksft_print_msg("zswpwb_after is %ld while wb is %s\n", - zswpwb_after, wb ? "enabled" : "disabled"); + wb_cnt, wb ? "enabled" : "disabled"); + return -1; + } + + /* + * attempt_writeback() does not use the proactive writeback path, so + * zswpwb_proactive must stay at zero regardless of whether writeback + * was enabled. + */ + pwb_cnt =3D get_cg_pwb_count(cgroup); + if (pwb_cnt !=3D 0) { + ksft_print_msg("zswpwb_proactive_after is %ld, expected 0\n", pwb_cnt); return -1; } =20 @@ -709,6 +731,128 @@ static int test_zswap_incompressible(const char *root) return ret; } =20 +/* + * Trigger proactive zswap writeback with the following steps: + * 1. Allocate memory. + * 2. Push allocated memory into zswap. + * 3. Proactively write back zswap pages to swap + * using "zswap_writeback_only". + */ +static int proactive_writeback_workload(const char *cgroup, void *arg) +{ + long pagesize =3D sysconf(_SC_PAGESIZE); + size_t memsize =3D MB(4); + char reclaim_cmd[64]; + char buf[pagesize]; + int ret =3D -1; + char *mem; + + mem =3D (char *)malloc(memsize); + if (!mem) + return ret; + + for (int i =3D 0; i < pagesize; i++) + buf[i] =3D i < pagesize / 2 ? (char)i : 0; + for (int i =3D 0; i < memsize; i +=3D pagesize) + memcpy(&mem[i], buf, pagesize); + + /* Evict allocated memory into zswap. */ + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { + ksft_print_msg("Failed to push pages into zswap\n"); + goto out; + } + + /* Trigger proactive zswap writeback for the same amount. */ + snprintf(reclaim_cmd, sizeof(reclaim_cmd), "%zu zswap_writeback_only", me= msize); + if (cg_write(cgroup, "memory.reclaim", reclaim_cmd)) { + ksft_print_msg("memory.reclaim zswap_writeback_only failed\n"); + goto out; + } + + ret =3D 0; +out: + free(mem); + return ret; +} + +static int check_writeback_invalid_inputs(const char *cgroup) +{ + static char * const bad_inputs[] =3D { + "zswap_writeback_only", + "1M zswap_writeback_only swappiness=3D60", + "1M swappiness=3D60 zswap_writeback_only", + "1M zswap_writeback_only swappiness=3Dmax", + "1M swappiness=3Dmax zswap_writeback_only", + }; + int i, rc; + + for (i =3D 0; i < ARRAY_SIZE(bad_inputs); i++) { + rc =3D cg_write(cgroup, "memory.reclaim", bad_inputs[i]); + if (rc !=3D -EINVAL) { + ksft_print_msg("memory.reclaim '%s': returned %d, expected %d\n", + bad_inputs[i], rc, -EINVAL); + return -1; + } + } + return 0; +} + +static int test_zswap_proactive_writeback(const char *root) +{ + long pwb_before, wb_before, pwb_after, wb_after; + long pwb_delta, wb_delta; + int ret =3D KSFT_FAIL; + char *test_group; + + if (cg_read_strcmp(root, "memory.zswap.writeback", "1")) + return KSFT_SKIP; + + test_group =3D cg_name(root, "zswap_proactive_test"); + if (!test_group) + return KSFT_FAIL; + if (cg_create(test_group)) + goto out; + if (check_writeback_invalid_inputs(test_group)) + goto out; + + pwb_before =3D get_cg_pwb_count(test_group); + wb_before =3D get_cg_wb_count(test_group); + if (pwb_before < 0 || wb_before < 0) + goto out; + + if (cg_run(test_group, proactive_writeback_workload, NULL)) + goto out; + + pwb_after =3D get_cg_pwb_count(test_group); + wb_after =3D get_cg_wb_count(test_group); + if (pwb_after < 0 || wb_after < 0) + goto out; + + pwb_delta =3D pwb_after - pwb_before; + wb_delta =3D wb_after - wb_before; + + if (pwb_delta <=3D 0) { + ksft_print_msg("zswpwb_proactive did not increase: delta=3D%ld\n", + pwb_delta); + goto out; + } + if (wb_delta <=3D 0) { + ksft_print_msg("zswpwb did not increase: delta=3D%ld\n", wb_delta); + goto out; + } + if (pwb_delta > wb_delta) { + ksft_print_msg("zswpwb_proactive delta (%ld) > zswpwb delta (%ld)\n", + pwb_delta, wb_delta); + goto out; + } + + ret =3D KSFT_PASS; +out: + cg_destroy(test_group); + free(test_group); + return ret; +} + #define T(x) { x, #x } struct zswap_test { int (*fn)(const char *root); @@ -722,6 +866,7 @@ struct zswap_test { T(test_no_kmem_bypass), T(test_no_invasive_cgroup_shrink), T(test_zswap_incompressible), + T(test_zswap_proactive_writeback), }; #undef T =20 --=20 2.34.1