From nobody Mon Jun 8 21:51:20 2026 Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BCEE93EAC8B for ; Tue, 26 May 2026 11:46:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779795998; cv=none; b=KIpFDVgojfVigxKSLkeCXwd/ea2iDfZJSRKko04b8p1uHtu4SDty3USkfSl8ceB3R8mXNPaAkyYfe443nztQ+0lgFJlij8PL3NBty7tKHQ/55pJ6l9glZuWQUHnSzD03KAjMoPA/2g3Fd9mGpQ9ZBk76klMrKgoIHsjozWo6i3Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779795998; c=relaxed/simple; bh=l9XMVF4QcsPXwJE+iu5AEQrcjTk+ktDmpogspnHnoI0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=gO4Cq162zeQNDgvOjSDAJwW2p/lQV2joBZz0SMbIU2XI08ZS5mDc0kKHonScFNR7XcFLLcx5Nv3GTNMA6UW9i9H8UB/sv1FX4gnBhbeOTpMbK0Cx9reeYGRn/UNKZg7Utc94lWyMWO8BpXdS6WcKk45uE89BRtBFtcEVr5iTrHw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=dT9IbUsv; arc=none smtp.client-ip=209.85.215.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="dT9IbUsv" Received: by mail-pg1-f182.google.com with SMTP id 41be03b00d2f7-c8532ba6c95so748504a12.0 for ; Tue, 26 May 2026 04:46:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779795989; x=1780400789; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=bmT0l+ZeGCAjbHFitrEBWwyjPsHUbJhDumkOpCJMGgM=; b=dT9IbUsvEK197VBWW0MbdZAuXdou0jxopoHFbL4VoW/yBhMe1LomZHbS30TL8r9XXv zlay0tGU8sYZwJKTUBMdgt8RqTMsuo6rw/31dzH6IRtof3t1l9twONf5KROaWCedH/+H O+0cu+SZuXimrsrq3bQsN4N+jAsGN7/525AY9Hw/CupYVL7Xa8RhaawOf1nYvYvHsdBs RY247O01enD5YGQKQSL31PEhSdlkTKTcmvdMphLpFEOY67QbTkHvGSyR28D1vmr5XjTe 144hqRxFCt76rh3Zg/w87rUANmH2ceylSJFLRdgkLDd6gHsl0lRrt3zZyNQL4CtgzZFE YOMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779795989; x=1780400789; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=bmT0l+ZeGCAjbHFitrEBWwyjPsHUbJhDumkOpCJMGgM=; b=Wmc45j6gIdXBEBcmkuMjja8Q5vvP0a83VK6D/iKPHdhMyP1cyCcHoWvJ0IUOAoVK3x avpjF3MH49yPbQk4wPXBjyyAeEE7uKqPTVUaJH3anzd+dffL7X/+iI7vrBEtzlvqxNuu CcZNV/qSnEJ7oYA4NG5ZRf3QnShMNVIb1ehRsXFrTTU9k4FbMm3Kh+v6XLbHcy6dIRTx weOlYzseOWQ0IMrTLHgHlFL/UkUI37mOrq4jcgxnvrS1vA8YEAv5go42hHEWCiltwequ t1zWAer3cQWUOK1LrZeNu5zjHFR8ZSDyKpfHrj9swJLDvHVCvsK43za8SOeE6J7qm80p GFQA== X-Forwarded-Encrypted: i=1; AFNElJ+ZQLkxDprcxVxePTnC+HyA+19eC9s0tAoIreFiOwLRpvmqZQ+ZQlVVSMTYaVt1DfOuFEYEdBQGrSNdYig=@vger.kernel.org X-Gm-Message-State: AOJu0Yy3jkPEF5VcBK08a41r+1iYTljzbywNk3/rGxh5FCtxYsq/g4Jf GrXN3rTavTUvNmiceE418DSpZt3pa2afe7Iv4R7UYAbDIFX5bYbxv3Bh X-Gm-Gg: Acq92OH8U2HjSiGTAcCqLrvpPNxdFel30d5UVjgeeKTajNJH9T6uWcxF2DmK8Bj77mo QTped149Yre3ytJi2izt7WsAnyCTHNgGZ+A9EumXlhj04jr5IwpXPeHikSgDUKuURLH8cTc1Apy PCEnb/yJdeFXNaYPmD5oD/q/hvJJb2ooyRigrF3iGYiTCiQCtNJy3Ns2eubKpGhj3jNf+mveCjT RCfhhQJLnJ0ymvtWXO8Skpk+HUGQjFwtk+3LdnNOWA7WJw9EJ3s6d8FjqcM5OsClQsrtl8mLCli v7dC2TSX7SZCfs8Re8yRXtj8atmwadD6z+2jUjQVdp9/k0v9y213rqLkkgtFgSB5zS8W6ARdPvi UjOYR+600/5t24W2ydRLMYgf5EHjlU1Xzae8MKBcP0V0oh8zl5F2nZgMTIN4rDWAZmRCxExtSKy EwHqxJ2JRoJMPxO848RW8/K+frrklY3zUTz5u8EQfm7d1zPPPgBI7mtFZRqkKyZA== X-Received: by 2002:a05:6a21:6113:b0:3a8:2339:24c4 with SMTP id adf61e73a8af0-3b328ed61d9mr17349804637.26.1779795989211; Tue, 26 May 2026 04:46:29 -0700 (PDT) Received: from localhost.localdomain ([210.184.73.204]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-c852028fe99sm10304341a12.4.2026.05.26.04.46.23 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 26 May 2026 04:46:28 -0700 (PDT) From: Hao Jia To: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, mhocko@kernel.org, yosry@kernel.org, mkoutny@suse.com, nphamcs@gmail.com, chengming.zhou@linux.dev, muchun.song@linux.dev, roman.gushchin@linux.dev Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Hao Jia Subject: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg Date: Tue, 26 May 2026 19:45:58 +0800 Message-Id: <20260526114601.67041-2-jiahao.kernel@gmail.com> X-Mailer: git-send-email 2.39.2 (Apple Git-143) In-Reply-To: <20260526114601.67041-1-jiahao.kernel@gmail.com> References: <20260526114601.67041-1-jiahao.kernel@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Hao Jia The zswap background writeback worker shrink_worker() uses a global cursor zswap_next_shrink, protected by zswap_shrink_lock, to round-robin across the online memcgs under root_mem_cgroup. Proactive writeback also wants a similar per-memcg cursor that is scoped to the specified memcg, so that repeated invocations against the same memcg make forward progress across its descendant memcgs instead of restarting from the first child memcg each time. Naturally, group the cursor and its protecting spinlock into a zswap_wb_iter struct, and make it a member of struct mem_cgroup to realize per-memcg cursor management. Accordingly, shrink_worker() now uses the lock and cursor in root_mem_cgroup->zswap_wb_iter. Because the cursor is now per-memcg, the offline cleanup must visit every ancestor that could be holding a reference to the dying memcg. Factor out __zswap_memcg_offline_cleanup() and walk from dead_memcg up to the root. No functional change intended for shrink_worker(). Signed-off-by: Hao Jia Reviewed-by: Nhat Pham --- include/linux/memcontrol.h | 3 + include/linux/zswap.h | 9 +++ mm/memcontrol.c | 3 + mm/zswap.c | 119 ++++++++++++++++++++++++++----------- 4 files changed, 98 insertions(+), 36 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index bf1a6e131eca..5e29c2b7e376 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -229,6 +229,9 @@ struct mem_cgroup { * swap, and from being swapped out on zswap store failures. */ bool zswap_writeback; + + /* Per-memcg writeback cursor */ + struct zswap_wb_iter zswap_wb_iter; #endif =20 /* vmpressure notifications */ diff --git a/include/linux/zswap.h b/include/linux/zswap.h index 30c193a1207e..efa6b551217e 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -11,6 +11,15 @@ extern atomic_long_t zswap_stored_pages; =20 #ifdef CONFIG_ZSWAP =20 +/* Iteration cursor for zswap writeback over a memcg's subtree. */ +struct zswap_wb_iter { + /* protects @pos against concurrent advances */ + spinlock_t lock; + struct mem_cgroup *pos; +}; + +void zswap_wb_iter_init(struct zswap_wb_iter *iter); + struct zswap_lruvec_state { /* * Number of swapped in pages from disk, i.e not found in the zswap pool. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 13f5d4b2a78e..e205e5de193d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4024,6 +4024,9 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem= _cgroup *parent) INIT_LIST_HEAD(&memcg->memory_peaks); INIT_LIST_HEAD(&memcg->swap_peaks); spin_lock_init(&memcg->peaks_lock); +#ifdef CONFIG_ZSWAP + zswap_wb_iter_init(&memcg->zswap_wb_iter); +#endif memcg->socket_pressure =3D get_jiffies_64(); #if BITS_PER_LONG < 64 seqlock_init(&memcg->socket_pressure_seqlock); diff --git a/mm/zswap.c b/mm/zswap.c index 761cd699e0a3..73e64a635690 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -163,9 +163,6 @@ struct zswap_pool { /* Global LRU lists shared by all zswap pools. */ static struct list_lru zswap_list_lru; =20 -/* The lock protects zswap_next_shrink updates. */ -static DEFINE_SPINLOCK(zswap_shrink_lock); -static struct mem_cgroup *zswap_next_shrink; static struct work_struct zswap_shrink_work; static struct shrinker *zswap_shrinker; =20 @@ -717,28 +714,88 @@ void zswap_folio_swapin(struct folio *folio) } } =20 -/* - * This function should be called when a memcg is being offlined. +void zswap_wb_iter_init(struct zswap_wb_iter *iter) +{ + spin_lock_init(&iter->lock); +} + +#ifdef CONFIG_MEMCG +/** + * zswap_mem_cgroup_iter - advance the writeback cursor + * @root: subtree root whose cursor to advance + * + * Advance @root->zswap_wb_iter.pos to @root itself or the next online + * descendant. Passing root_mem_cgroup yields a global walk. * - * Since the global shrinker shrink_worker() may hold a reference - * of the memcg, we must check and release the reference in - * zswap_next_shrink. + * The cursor is retained across invocations, so successive calls walk + * @root's subtree cyclically in pre-order and, after %NULL, restart + * from the beginning. * - * shrink_worker() must handle the case where this function releases - * the reference of memcg being shrunk. + * The returned memcg carries an extra reference; release it with + * mem_cgroup_put(). + * + * Return: the next online memcg in @root's subtree, or @root itself, + * with an extra reference, or %NULL after a full round-trip. */ -void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) +static struct mem_cgroup *zswap_mem_cgroup_iter(struct mem_cgroup *root) { - /* lock out zswap shrinker walking memcg tree */ - spin_lock(&zswap_shrink_lock); - if (zswap_next_shrink =3D=3D memcg) { + struct mem_cgroup *memcg; + + if (mem_cgroup_disabled()) + return NULL; + + spin_lock(&root->zswap_wb_iter.lock); + do { + memcg =3D mem_cgroup_iter(root, root->zswap_wb_iter.pos, NULL); + root->zswap_wb_iter.pos =3D memcg; + } while (memcg && !mem_cgroup_tryget_online(memcg)); + spin_unlock(&root->zswap_wb_iter.lock); + + return memcg; +} + +/* + * If @root's cursor currently points at @dead_memcg, advance it to the + * next online descendant so @dead_memcg can be freed. + */ +static void __zswap_memcg_offline_cleanup(struct mem_cgroup *root, + struct mem_cgroup *dead_memcg) +{ + spin_lock(&root->zswap_wb_iter.lock); + if (root->zswap_wb_iter.pos =3D=3D dead_memcg) { do { - zswap_next_shrink =3D mem_cgroup_iter(NULL, zswap_next_shrink, NULL); - } while (zswap_next_shrink && !mem_cgroup_online(zswap_next_shrink)); + root->zswap_wb_iter.pos =3D + mem_cgroup_iter(root, + root->zswap_wb_iter.pos, NULL); + } while (root->zswap_wb_iter.pos && + !mem_cgroup_online(root->zswap_wb_iter.pos)); } - spin_unlock(&zswap_shrink_lock); + spin_unlock(&root->zswap_wb_iter.lock); +} + +/* + * Called when a memcg is being offlined. If @memcg or any of its + * ancestors has a cursor pointing at @memcg, it must be advanced + * past @memcg before @memcg can be freed. Walk the chain and + * release such references. + */ +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent =3D memcg; + + do { + __zswap_memcg_offline_cleanup(parent, memcg); + } while ((parent =3D parent_mem_cgroup(parent))); +} +#else /* !CONFIG_MEMCG */ +static struct mem_cgroup *zswap_mem_cgroup_iter(struct mem_cgroup *root) +{ + return NULL; } =20 +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) { } +#endif /* CONFIG_MEMCG */ + /********************************* * zswap entry functions **********************************/ @@ -1323,38 +1380,28 @@ static void shrink_worker(struct work_struct *w) * - No writeback-candidate memcgs found in a memcg tree walk. * - Shrinking a writeback-candidate memcg failed. * - * We save iteration cursor memcg into zswap_next_shrink, + * We save the iteration cursor in root_mem_cgroup->zswap_wb_iter.pos, * which can be modified by the offline memcg cleaner * zswap_memcg_offline_cleanup(). * * Since the offline cleaner is called only once, we cannot leave an - * offline memcg reference in zswap_next_shrink. + * offline memcg reference in root_mem_cgroup->zswap_wb_iter.pos. * We can rely on the cleaner only if we get online memcg under lock. * * If we get an offline memcg, we cannot determine if the cleaner has * already been called or will be called later. We must put back the * reference before returning from this function. Otherwise, the - * offline memcg left in zswap_next_shrink will hold the reference - * until the next run of shrink_worker(). + * offline memcg left in root_mem_cgroup->zswap_wb_iter.pos will hold + * the reference until the next run of shrink_worker(). */ do { /* - * Start shrinking from the next memcg after zswap_next_shrink. - * When the offline cleaner has already advanced the cursor, - * advancing the cursor here overlooks one memcg, but this - * should be negligibly rare. - * - * If we get an online memcg, keep the extra reference in case - * the original one obtained by mem_cgroup_iter() is dropped by - * zswap_memcg_offline_cleanup() while we are shrinking the - * memcg. + * Start shrinking from the next memcg after + * root_mem_cgroup->zswap_wb_iter.pos. When the offline cleaner + * has already advanced the cursor, advancing the cursor here + * overlooks one memcg, but this should be negligibly rare. */ - spin_lock(&zswap_shrink_lock); - do { - memcg =3D mem_cgroup_iter(NULL, zswap_next_shrink, NULL); - zswap_next_shrink =3D memcg; - } while (memcg && !mem_cgroup_tryget_online(memcg)); - spin_unlock(&zswap_shrink_lock); + memcg =3D zswap_mem_cgroup_iter(root_mem_cgroup); =20 if (!memcg) { /* --=20 2.34.1 From nobody Mon Jun 8 21:51:20 2026 Received: from mail-pg1-f173.google.com (mail-pg1-f173.google.com [209.85.215.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CCB6F3ECBC2 for ; Tue, 26 May 2026 11:46:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779796005; cv=none; b=DzZgp+farevGEiDNY+vW2Of8LhSDGRp9B5ZnF1Zp951fQA3gQiIL+tt8f85B9hOyl0GB16ymVTYTSW/M5d5y4+19ovJHCAI+jLT0JHVd8eSyaWuJEQitH6QfDGSJiAOvRCZFam9xjnsTVpv4q1JwH42NMQQdxMOeQZXArtQhkEc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779796005; c=relaxed/simple; bh=xGgNsH72C/nIq69G4H4KYtqOBv468aDHxjm6I3eoGJE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=QMdVAuXJYj+KwOmOW7RUJB7uUkP+Cb2c5rLp9V/eUvNz1jNPQ91cX9UTTm3wmKz60rA5yYFBdh86mOFCR9Cz8zEkneSHHxn8cCCjbSSYlAUNFbhY8lkz86IQp9h+kB1Cv8cBEnw/Lfw+gKgM6EQHwZihyr1ti4Cp8jttbg5mqJw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=CEBbuo8O; arc=none smtp.client-ip=209.85.215.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="CEBbuo8O" Received: by mail-pg1-f173.google.com with SMTP id 41be03b00d2f7-c6dd5b01e14so4746251a12.0 for ; Tue, 26 May 2026 04:46:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779795996; x=1780400796; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=wUCd8YO/TGnxTv33ZbkeaZ1htMubaCTkJySuysD11rI=; b=CEBbuo8Oy7QaYhbn6c3QqXY+/bXSlvXmw/jTQ3TMLuXOdYOVfwBOHSMTE2+nELLiBa obdLOs7Co5TmmPjEqk4SleFPvpXbN7J8uN6GOjx2jg8yL9JNDJnBhrqwzfaXwfzuYrTc 3kfHbuOipuiJoWb3lLF2fN5Zqv6pS3ACR0XcHtgOjEJXft1vhnoYXGCdb2x2sJIj9FOC RaJPowBQHPQHziHf448w66JukGLOvqwe6XSlX/jKQnKhJrfkbOQ8EXIiH75bNs6D4Be2 HTwimPAmW3RBu70xDYU7NuO+3fBWoEBkcMKnCc/vlyuQuM7GNABHI83A4bvmsi4GFEz5 R36A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779795996; x=1780400796; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=wUCd8YO/TGnxTv33ZbkeaZ1htMubaCTkJySuysD11rI=; b=S0IVjUcM4B4gotZVB6kW4vy0pDyRC/Ln96NchNnEoiH8ew87OQLoJa9REs7buSoG7Z P+kNaux04JWOv1VdwOWY1f4VDLF5HYFY07UBpxhvp/Sp9AUf9i0clBj/Ap3kSPRXZzCQ TsOdl3rasjKR/Q5HWf3iuz5owX+/9GJszR+zl4pq3c4GLk0QvPMEcKB6M8wO2crG58yK vEV+xGtyemQEtWkmZMraIlTA2/q7y2xep1giUF7ae7h2s7OL5QcGiyhdaMq/yx3aOG6z HI8Kj0cR9OpnyVYU1LSAmTVHRjGpfDDu/QKRj4Y/mBzgj5CoU1A7fMJys93quezQFD2H rXqw== X-Forwarded-Encrypted: i=1; AFNElJ9lv0e0P1nZf+06CYCyriqXLazWYG7KXtvIwqCq4cHEYYCjZSvh/1vB7FvD9nX+GkeJrCBRT66MDk2k4rM=@vger.kernel.org X-Gm-Message-State: AOJu0YzFoVZiN2im5oAT3+FRsS3H+pFirfKECaQzrc67Fqefk7wRcW07 kDCDYstLhMfqpB/61THfP0CK5Q92mlrUBKkK6jUw49DvBtFG8ewL52gOdgqsTg== X-Gm-Gg: Acq92OG6HEZF5+oVgjl4Z4GW8+IcoB9Qihav0NaLLAi434RNLRmiu5Y5yrzFKqI5crQ NmJjGItbnsJo8jkjpkUzv0kt+6GAjkeGXDtJQ/385CdrUPDOgQdzla6qtEWUwHw0zxwwxw/F1mn 3Ko9rjLP+yys1+S0mmozWBz6thU26GMkqltKoMAAdZFuqbTNrunMPm4W94b5ttE/XPqzkvdWPbG pCOcUWrj3CWwZTVEiOgRJySxKjZayuQbwEdh4N9xS/yKkgEFNNbYMXHeh0MgZ0Y7aTe8DvWVqut bFbA7pSmTG0ay9AKtj+mXRjAysDp9X9ZVQm0D5CVAoqfjT+9eIUrp5TLJkgdrazjgRbvu+jEvkH 6z46Mgyv1ZAl63DaYeuG4aXPmpIsUlgQhasyDCzs0/H5Im38OXmCqbYYp6nv8NjlgZDa5FaNscL sufzox3LCO/4KukjU7Z6OOk6uUCxmpc/IWDkEIRAagmcXmti5rWSY= X-Received: by 2002:a05:6300:6d8d:20b0:3b3:9199:b1d4 with SMTP id adf61e73a8af0-3b39199c626mr3929480637.48.1779795996335; Tue, 26 May 2026 04:46:36 -0700 (PDT) Received: from localhost.localdomain ([210.184.73.204]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-c852028fe99sm10304341a12.4.2026.05.26.04.46.30 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 26 May 2026 04:46:35 -0700 (PDT) From: Hao Jia To: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, mhocko@kernel.org, yosry@kernel.org, mkoutny@suse.com, nphamcs@gmail.com, chengming.zhou@linux.dev, muchun.song@linux.dev, roman.gushchin@linux.dev Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Hao Jia Subject: [PATCH v3 2/4] mm/zswap: Implement proactive writeback Date: Tue, 26 May 2026 19:45:59 +0800 Message-Id: <20260526114601.67041-3-jiahao.kernel@gmail.com> X-Mailer: git-send-email 2.39.2 (Apple Git-143) In-Reply-To: <20260526114601.67041-1-jiahao.kernel@gmail.com> References: <20260526114601.67041-1-jiahao.kernel@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Hao Jia Zswap currently writes back pages to backing swap reactively, triggered either by the shrinker or when the pool reaches its size limit. There is no mechanism to control the amount of writeback for a specific memory cgroup. However, users may want to proactively write back zswap pages, e.g., to free up memory for other applications or to prepare for memory-intensive workloads. Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup interface. When specified, this key bypasses standard memory reclaim and exclusively performs proactive zswap writeback up to the requested budget. If omitted, the default reclaim behavior remains unchanged. Example usage: # Write back 100MB of pages from zswap to the backing swap echo "100M zswap_writeback_only" > memory.reclaim Note that the actual amount written back may be less than requested due to the zswap second-chance algorithm: referenced entries are rotated on the LRU on the first encounter and only written back on a second pass. If fewer bytes are written back than requested, -EAGAIN is returned, matching the existing memory.reclaim semantics. Internally, extend user_proactive_reclaim() to parse the new "zswap_writeback_only" token and invoke the dedicated handler. Add zswap_proactive_writeback() to walk the target memcg subtree via the per-memcg writeback cursor, draining per-node zswap LRUs through list_lru_walk_one() with the shrink_memcg_cb() callback. Suggested-by: Yosry Ahmed Suggested-by: Nhat Pham Signed-off-by: Hao Jia --- Documentation/admin-guide/cgroup-v2.rst | 18 +++- Documentation/admin-guide/mm/zswap.rst | 11 +- include/linux/zswap.h | 7 ++ mm/vmscan.c | 14 +++ mm/zswap.c | 138 ++++++++++++++++++++++++ 5 files changed, 185 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst index 6efd0095ed99..6564abf0dec5 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1425,9 +1425,10 @@ PAGE_SIZE multiple when read back. =20 The following nested keys are defined. =20 - =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D swappiness Swappiness value to reclaim with - =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + zswap_writeback_only Only perform proactive zswap writeback + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 Specifying a swappiness value instructs the kernel to perform the reclaim with that swappiness value. Note that this has the @@ -1437,6 +1438,19 @@ The following nested keys are defined. The valid range for swappiness is [0-200, max], setting swappiness=3Dmax exclusively reclaims anonymous memory. =20 + The zswap_writeback_only key skips ordinary memory reclaim and + writes back pages from zswap to the backing swap device until + the requested amount has been written or no further candidates + are found. This is useful to proactively offload cold pages from + the zswap pool to the swap device. It is only available if + zswap writeback is enabled. zswap_writeback_only cannot be combined + with swappiness; specifying both returns -EINVAL. + + Example:: + + # Write back up to 100MB of pages from zswap to the backing swap + echo "100M zswap_writeback_only" > memory.reclaim + memory.peak A read-write single value file which exists on non-root cgroups. =20 diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-g= uide/mm/zswap.rst index 2464425c783d..1c0598e77958 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -131,7 +131,16 @@ User can enable it as follows:: echo Y > /sys/module/zswap/parameters/shrinker_enabled =20 This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON= `` is -selected. +selected. Once enabled, the shrinker automatically writes back zswap pages= to +backing swap during memory reclaim. + +If users want to explicitly trigger proactive zswap writeback for a specif= ic +memory cgroup without invoking standard page reclaim, it can be done as fo= llows:: + + echo "100M zswap_writeback_only" > /sys/fs/cgroup//memory.re= claim + +Both of the methods mentioned above are subject to the ``memory.zswap.writ= eback`` +control. This means that ``memory.zswap.writeback`` can reject all zswap w= riteback. =20 A debugfs interface is provided for various statistic about pool size, num= ber of pages stored, same-value filled pages and various counters for the reas= ons diff --git a/include/linux/zswap.h b/include/linux/zswap.h index efa6b551217e..98434d39339a 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -44,6 +44,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec); void zswap_folio_swapin(struct folio *folio); bool zswap_is_enabled(void); bool zswap_never_enabled(void); +int zswap_proactive_writeback(struct mem_cgroup *memcg, unsigned long nr_t= o_writeback); #else =20 struct zswap_lruvec_state {}; @@ -78,6 +79,12 @@ static inline bool zswap_never_enabled(void) return true; } =20 +static inline int zswap_proactive_writeback(struct mem_cgroup *memcg, + unsigned long nr_to_writeback) +{ + return -EOPNOTSUPP; +} + #endif =20 #endif /* _LINUX_ZSWAP_H */ diff --git a/mm/vmscan.c b/mm/vmscan.c index ca4533eba701..63fa4341b823 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -64,6 +64,7 @@ =20 #include #include +#include =20 #include "internal.h" #include "swap.h" @@ -7856,11 +7857,13 @@ static unsigned long __node_reclaim(struct pglist_d= ata *pgdat, gfp_t gfp_mask, enum { MEMORY_RECLAIM_SWAPPINESS =3D 0, MEMORY_RECLAIM_SWAPPINESS_MAX, + MEMORY_RECLAIM_ZSWAP_WRITEBACK_ONLY, MEMORY_RECLAIM_NULL, }; static const match_table_t tokens =3D { { MEMORY_RECLAIM_SWAPPINESS, "swappiness=3D%d"}, { MEMORY_RECLAIM_SWAPPINESS_MAX, "swappiness=3Dmax"}, + { MEMORY_RECLAIM_ZSWAP_WRITEBACK_ONLY, "zswap_writeback_only"}, { MEMORY_RECLAIM_NULL, NULL }, }; =20 @@ -7870,6 +7873,7 @@ int user_proactive_reclaim(char *buf, unsigned int nr_retries =3D MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed =3D 0; int swappiness =3D -1; + bool zswap_writeback_only =3D false; char *old_buf, *start; substring_t args[MAX_OPT_ARGS]; gfp_t gfp_mask =3D GFP_KERNEL; @@ -7900,11 +7904,21 @@ int user_proactive_reclaim(char *buf, case MEMORY_RECLAIM_SWAPPINESS_MAX: swappiness =3D SWAPPINESS_ANON_ONLY; break; + case MEMORY_RECLAIM_ZSWAP_WRITEBACK_ONLY: + zswap_writeback_only =3D true; + break; default: return -EINVAL; } } =20 + if (zswap_writeback_only) { + /* zswap_writeback_only and swappiness are mutually exclusive. */ + if (swappiness !=3D -1) + return -EINVAL; + return zswap_proactive_writeback(memcg, nr_to_reclaim); + } + while (nr_reclaimed < nr_to_reclaim) { /* Will converge on zero, but reclaim enforces a minimum */ unsigned long batch_size =3D (nr_to_reclaim - nr_reclaimed) / 4; diff --git a/mm/zswap.c b/mm/zswap.c index 73e64a635690..7bcbf788f634 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1679,6 +1679,144 @@ int zswap_load(struct folio *folio) return 0; } =20 +/* + * Maximum LRU scan limit: + * number of entries to scan per page of remaining budget. + */ +#define ZSWAP_PROACTIVE_WB_SCAN_RATIO 16UL +/* + * Batch size for proactive writeback: + * - As the per-memcg writeback target in the outer memcg loop. + * - As the per-walk budget passed to list_lru_walk_one(). + */ +#define ZSWAP_PROACTIVE_WB_BATCH 128UL + +/* + * Walk the per-node LRUs of @memcg to write back up to @nr_to_write pages. + * Returns the number of pages written back, or -ENOENT if @memcg is a + * zombie or has writeback disabled. + */ +static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg, + unsigned long nr_to_write) +{ + unsigned long nr_written =3D 0; + int nid; + + if (!mem_cgroup_zswap_writeback_enabled(memcg)) + return -ENOENT; + + if (!mem_cgroup_online(memcg)) + return -ENOENT; + + for_each_node_state(nid, N_NORMAL_MEMORY) { + bool encountered_page_in_swapcache =3D false; + unsigned long nr_to_scan, nr_scanned =3D 0; + + /* + * Cap by LRU length: bounds rewalks when referenced + * entries keep rotating to the tail. + */ + nr_to_scan =3D list_lru_count_one(&zswap_list_lru, nid, memcg); + if (!nr_to_scan) + continue; + + /* + * Cap by SCAN_RATIO * remaining budget: bounds scan cost + * to the remaining writeback budget. + */ + nr_to_scan =3D min(nr_to_scan, + (nr_to_write - nr_written) * ZSWAP_PROACTIVE_WB_SCAN_RATIO); + + while (nr_scanned < nr_to_scan) { + unsigned long nr_to_walk =3D min(ZSWAP_PROACTIVE_WB_BATCH, + nr_to_scan - nr_scanned); + + if (signal_pending(current)) + return nr_written; + + /* + * Account for the committed budget rather than the walker's + * actual delta. If the list is emptied concurrently, the + * walker visits nothing and nr_scanned would never advance. + */ + nr_scanned +=3D nr_to_walk; + + nr_written +=3D list_lru_walk_one(&zswap_list_lru, nid, memcg, + &shrink_memcg_cb, + &encountered_page_in_swapcache, + &nr_to_walk); + + if (nr_written >=3D nr_to_write) + return nr_written; + if (encountered_page_in_swapcache) + break; + + cond_resched(); + } + } + + return nr_written; +} + +int zswap_proactive_writeback(struct mem_cgroup *memcg, + unsigned long nr_to_writeback) +{ + struct mem_cgroup *iter_memcg; + unsigned long nr_written =3D 0; + int failures =3D 0, attempts =3D 0; + + if (!memcg) + return -EINVAL; + if (!nr_to_writeback) + return 0; + + /* + * Writeback will be aborted with -EAGAIN if we encounter + * the following MAX_RECLAIM_RETRIES times: + * - No writeback-candidate memcgs found in a subtree walk. + * - A writeback-candidate memcg wrote back zero pages. + */ + while (nr_written < nr_to_writeback) { + unsigned long batch_size; + long shrunk; + + if (signal_pending(current)) + return -EINTR; + + iter_memcg =3D zswap_mem_cgroup_iter(memcg); + + if (!iter_memcg) { + /* + * Continue without incrementing failures if we found + * candidate memcgs in the last subtree walk. + */ + if (!attempts && ++failures =3D=3D MAX_RECLAIM_RETRIES) + return -EAGAIN; + attempts =3D 0; + continue; + } + + batch_size =3D min(nr_to_writeback - nr_written, + ZSWAP_PROACTIVE_WB_BATCH); + shrunk =3D zswap_proactive_shrink_memcg(iter_memcg, batch_size); + mem_cgroup_put(iter_memcg); + + /* Writeback-disabled or offline: skip without counting. */ + if (shrunk =3D=3D -ENOENT) + continue; + + ++attempts; + if (shrunk > 0) + nr_written +=3D shrunk; + else if (++failures =3D=3D MAX_RECLAIM_RETRIES) + return -EAGAIN; + + cond_resched(); + } + + return 0; +} + void zswap_invalidate(swp_entry_t swp) { pgoff_t offset =3D swp_offset(swp); --=20 2.34.1 From nobody Mon Jun 8 21:51:20 2026 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A44833EDE5B for ; Tue, 26 May 2026 11:46:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779796012; cv=none; b=nlpjl5Nc1dOjdiq+F1TG7JM+wngGQtHksGWs0TI8BEDalS1HD2xlMyQBSZNYigP+VodE1S2p31l6wgBcYww91l4XS4qhONjH9WdjxR2VjncIT2HfXYzgUVp6n91QmKVK9pcHxOpautVEqP+L70Dif+UrDzpTpfTOWYsD3f123SE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779796012; c=relaxed/simple; bh=FWMfB8uhsPIyNwMFM3AddVnmy9NCj+ewojVF2bEmBiM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=gZMAOwGjmu2Z8ft2CzVrys80JtUGFizCqCQVTl4ANLxmcEl+fKgj8oJfLVrC0h5WYjEx6Xv/rjedl/TiNO7ogbwhcYRN9zNrqVS5fOpi3nDHZyZgTwqAlOWPF5ZUO3NkiBsmbbLb/r50tSSHCJHn2sR9HInkT8qYWMs1MvaXHmI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=FXoFEHp4; arc=none smtp.client-ip=209.85.216.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="FXoFEHp4" Received: by mail-pj1-f50.google.com with SMTP id 98e67ed59e1d1-3697c35eab7so6309801a91.0 for ; Tue, 26 May 2026 04:46:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779796004; x=1780400804; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=LyxgSUePtHx6I/zD41D6CDDvd5fSjKcBNMIc54wEtIs=; b=FXoFEHp4fPTD7wpEIocU8DRTuPQeKr8IQtP6X0ks7XU+78ZzY1lFwvSxK+3TxryEf9 p5c5pS3CklPXzbBvlV4Hf/BUOETUr0vtoh2sPvsbrOzImY+MnTmImArY90ko6wD7GHdd NjnRTW5iAFaw8JS+Zy7Aj0PzOhgG7HYUIwz5LZJ8BgYqljlr3tpGN3SwHzPKRxMu6+i4 3DHwVPV5LH5wR02GeOKBLBwfWx5QNjPQ8w/Cu6D61KmOK/z4KieJ454Am+1gJL2a9KS9 cVSKkEk5mnEvem0wBCO9NJn2o3ObZkWTG+jw90pMP5aQrhcKIVfCAa5QxTtxLca6plTd FlhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779796004; x=1780400804; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=LyxgSUePtHx6I/zD41D6CDDvd5fSjKcBNMIc54wEtIs=; b=OnvLE8FOIccwhYa2xgYrRilM7AR1ve4LEGWIA/F0/xgygp2NugOTi3H8yi3OnGOGmq RIaUYynilDBmZ0BtqJ+m/O6nmiB+XL7cn67Y3MlGeflqV2CCK8J9zAo8OmZx+BiOhT+h FJhe0ZUWzofuPuSxWXJTpbJWRpkQerVW5IJT5De2Uu6FysWwpGsUsZO6yVaO4tUsT7ew DIjILJT6M0kDJ+EgB2dtYnS2y9c3WyKBDV0Ft2NRFBDUVeLJg4cCPS72iUaa3DcwrZxT 1VOKt3G6jYS+1JNQUbIyOkQnlzpXLg/4f5ObamGmDw9CS/5iyIpM0XOZ5cOetQECYixx 9G8Q== X-Forwarded-Encrypted: i=1; AFNElJ/gMTpb71jb1pNmgeb/hJL/MqsyYfynBTOrcaZoEChVdcb/G394Kx00vZCfi9Kmfl6nUbEWL9VnkdLucxA=@vger.kernel.org X-Gm-Message-State: AOJu0Yz0fmWe1WTD7w8Z0KDAl7cT3fNfUw2QcMf9ho90BGpHwHJHeng2 79m9MYoQlNdd5wIEod8darwWFv6P9wEbPaVnoldbJV9bVhZ4H85nForv X-Gm-Gg: Acq92OGV0qMfN54hUizAxmfuTuvqf3qUvTQM0m324ITjzrinT80sPZB8RUhFzO4JgeA ILUod2COQ06RSnsX7HrF+5TfJ+sVAKPVsA/+THoIScbhprFIjyrUp3lp69zpvaIhcjI1jDrAZJI BEBUYXPNvSmanp1YmtJSiqK1B2gub6pAksG1K50+yaRFQqUaWB8L8sME8MgB/e0NqbVtlvv4rFE UkMaf4yRmgV89J6Vkk/PQbkE8Lbnj7xKPrzdqalOsdP4U69ce+T4syDPpGCE28mgFx2p5hj6RLA MAvMFXOAhM8zf9RRd/zrJdR5yPYs+Eh5UZlvwko2TXGKfxK5+Zwx5i2iNJuxMYAfnolp1R8K6hh viC9A7Axeen+1dQhsNssma5P/7Qndti3s4k3E96hIUDoDzRC63Bl0nHkyaC9bW0f2rCJ+L9KKtK 9MENQtZJuDspyl0ssHbyIX8pCbGVsFMFYXzlG6pECJpsRMawhV2i0= X-Received: by 2002:a17:90b:350a:b0:36a:fcf5:64d2 with SMTP id 98e67ed59e1d1-36afcf566b1mr3550820a91.16.1779796003831; Tue, 26 May 2026 04:46:43 -0700 (PDT) Received: from localhost.localdomain ([210.184.73.204]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-c852028fe99sm10304341a12.4.2026.05.26.04.46.37 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 26 May 2026 04:46:43 -0700 (PDT) From: Hao Jia To: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, mhocko@kernel.org, yosry@kernel.org, mkoutny@suse.com, nphamcs@gmail.com, chengming.zhou@linux.dev, muchun.song@linux.dev, roman.gushchin@linux.dev Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Hao Jia Subject: [PATCH v3 3/4] mm/zswap: Add per-memcg stat for proactive writeback Date: Tue, 26 May 2026 19:46:00 +0800 Message-Id: <20260526114601.67041-4-jiahao.kernel@gmail.com> X-Mailer: git-send-email 2.39.2 (Apple Git-143) In-Reply-To: <20260526114601.67041-1-jiahao.kernel@gmail.com> References: <20260526114601.67041-1-jiahao.kernel@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Hao Jia Currently, zswap writeback can be triggered by either the pool limit being hit or by the proactive writeback mechanism. However, the existing 'zswpwb' metric in memory.stat and /proc/vmstat counts all written back pages, making it difficult to distinguish between pages written back due to the pool limit and those written back proactively. Add a new statistic 'zswpwb_proactive' to memory.stat and /proc/vmstat. This counter tracks the number of pages written back due to proactive writeback. This allows users to better monitor and tune the proactive writeback mechanism. Signed-off-by: Hao Jia Reviewed-by: Nhat Pham --- Documentation/admin-guide/cgroup-v2.rst | 4 +++ include/linux/vm_event_item.h | 1 + mm/memcontrol.c | 1 + mm/vmstat.c | 1 + mm/zswap.c | 41 ++++++++++++++++++------- 5 files changed, 37 insertions(+), 11 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst index 6564abf0dec5..7d65aef83f7b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1748,6 +1748,10 @@ The following nested keys are defined. zswpwb Number of pages written from zswap to swap. =20 + zswpwb_proactive + Number of pages written from zswap to swap by proactive + writeback. This is a subset of zswpwb. + zswap_incomp Number of incompressible pages currently stored in zswap without compression. These pages could not be compressed to diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 03fe95f5a020..7a5bee0a20b6 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -138,6 +138,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, ZSWPIN, ZSWPOUT, ZSWPWB, + ZSWPWB_PROACTIVE, #endif #ifdef CONFIG_X86 DIRECT_MAP_LEVEL2_SPLIT, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e205e5de193d..7648b3fd940e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -571,6 +571,7 @@ static const unsigned int memcg_vm_event_stat[] =3D { ZSWPIN, ZSWPOUT, ZSWPWB, + ZSWPWB_PROACTIVE, #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE THP_FAULT_ALLOC, diff --git a/mm/vmstat.c b/mm/vmstat.c index f534972f517d..66fd06d1bb01 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1452,6 +1452,7 @@ const char * const vmstat_text[] =3D { [I(ZSWPIN)] =3D "zswpin", [I(ZSWPOUT)] =3D "zswpout", [I(ZSWPWB)] =3D "zswpwb", + [I(ZSWPWB_PROACTIVE)] =3D "zswpwb_proactive", #endif #ifdef CONFIG_X86 [I(DIRECT_MAP_LEVEL2_SPLIT)] =3D "direct_map_level2_splits", diff --git a/mm/zswap.c b/mm/zswap.c index 7bcbf788f634..b45d094f532a 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -160,6 +160,11 @@ struct zswap_pool { char tfm_name[CRYPTO_MAX_ALG_NAME]; }; =20 +struct zswap_shrink_walk_arg { + bool proactive; + bool encountered_page_in_swapcache; +}; + /* Global LRU lists shared by all zswap pools. */ static struct list_lru zswap_list_lru; =20 @@ -1042,7 +1047,8 @@ static bool zswap_decompress(struct zswap_entry *entr= y, struct folio *folio) * freed. */ static int zswap_writeback_entry(struct zswap_entry *entry, - swp_entry_t swpentry) + swp_entry_t swpentry, + bool proactive) { struct xarray *tree; pgoff_t offset =3D swp_offset(swpentry); @@ -1097,6 +1103,12 @@ static int zswap_writeback_entry(struct zswap_entry = *entry, if (entry->objcg) count_objcg_events(entry->objcg, ZSWPWB, 1); =20 + if (proactive) { + count_vm_event(ZSWPWB_PROACTIVE); + if (entry->objcg) + count_objcg_events(entry->objcg, ZSWPWB_PROACTIVE, 1); + } + zswap_entry_free(entry); =20 /* folio is up to date */ @@ -1146,7 +1158,8 @@ static enum lru_status shrink_memcg_cb(struct list_he= ad *item, struct list_lru_o void *arg) { struct zswap_entry *entry =3D container_of(item, struct zswap_entry, lru); - bool *encountered_page_in_swapcache =3D (bool *)arg; + struct zswap_shrink_walk_arg *walk_arg =3D arg; + bool proactive_wb =3D walk_arg && walk_arg->proactive; swp_entry_t swpentry; enum lru_status ret =3D LRU_REMOVED_RETRY; int writeback_result; @@ -1201,7 +1214,7 @@ static enum lru_status shrink_memcg_cb(struct list_he= ad *item, struct list_lru_o */ spin_unlock(&l->lock); =20 - writeback_result =3D zswap_writeback_entry(entry, swpentry); + writeback_result =3D zswap_writeback_entry(entry, swpentry, proactive_wb); =20 if (writeback_result) { zswap_reject_reclaim_fail++; @@ -1212,9 +1225,9 @@ static enum lru_status shrink_memcg_cb(struct list_he= ad *item, struct list_lru_o * into the warmer region. We should terminate shrinking (if we're in th= e dynamic * shrinker context). */ - if (writeback_result =3D=3D -EEXIST && encountered_page_in_swapcache) { + if (writeback_result =3D=3D -EEXIST && walk_arg) { ret =3D LRU_STOP; - *encountered_page_in_swapcache =3D true; + walk_arg->encountered_page_in_swapcache =3D true; } } else { zswap_written_back_pages++; @@ -1226,8 +1239,11 @@ static enum lru_status shrink_memcg_cb(struct list_h= ead *item, struct list_lru_o static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc) { + struct zswap_shrink_walk_arg walk_arg =3D { + .proactive =3D false, + .encountered_page_in_swapcache =3D false, + }; unsigned long shrink_ret; - bool encountered_page_in_swapcache =3D false; =20 if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(sc->memcg)) { @@ -1236,9 +1252,9 @@ static unsigned long zswap_shrinker_scan(struct shrin= ker *shrinker, } =20 shrink_ret =3D list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb, - &encountered_page_in_swapcache); + &walk_arg); =20 - if (encountered_page_in_swapcache) + if (walk_arg.encountered_page_in_swapcache) return SHRINK_STOP; =20 return shrink_ret ? shrink_ret : SHRINK_STOP; @@ -1709,7 +1725,10 @@ static long zswap_proactive_shrink_memcg(struct mem_= cgroup *memcg, return -ENOENT; =20 for_each_node_state(nid, N_NORMAL_MEMORY) { - bool encountered_page_in_swapcache =3D false; + struct zswap_shrink_walk_arg walk_arg =3D { + .proactive =3D true, + .encountered_page_in_swapcache =3D false, + }; unsigned long nr_to_scan, nr_scanned =3D 0; =20 /* @@ -1743,12 +1762,12 @@ static long zswap_proactive_shrink_memcg(struct mem= _cgroup *memcg, =20 nr_written +=3D list_lru_walk_one(&zswap_list_lru, nid, memcg, &shrink_memcg_cb, - &encountered_page_in_swapcache, + &walk_arg, &nr_to_walk); =20 if (nr_written >=3D nr_to_write) return nr_written; - if (encountered_page_in_swapcache) + if (walk_arg.encountered_page_in_swapcache) break; =20 cond_resched(); --=20 2.34.1 From nobody Mon Jun 8 21:51:20 2026 Received: from mail-pg1-f172.google.com (mail-pg1-f172.google.com [209.85.215.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 117EE3EDE50 for ; Tue, 26 May 2026 11:46:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779796018; cv=none; b=PMNtoZWPZwrTVuvLY/BMcHLsvCh7ZJtTP3udp0N0Ga2C6jTIMHh+hMZpf2EX6k0ji5rjVdYFIkAYlWqjTr1QHRngSg2rtE5dVCQjT3I+W0LS0LRFZMGR1uaO1dNfgb+sulR6p+UqgN2D2iNzRGDMdHbP8doXhRc5HBGBzj/fHQ8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779796018; c=relaxed/simple; bh=KAua5ug0ikJ+vEnPoRL10cFv6cThk1xddjc36M+Xnfg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=DjSEwpNJCr4QlBQWImXsd5nD/LAexPo8YTUKmZU4fsN5/48rpXXiBfYrSg+xhzGPdpu+H1vRvhD9R6LZjHX7xK7Xd6mZE9r2eOiEFso9jwSfbT68Yqe4jaPlmvMs0gVQGNnQyvJIX3QIeRjx0FY+tx0Gv/uIBS0b8LPjFA5Rx+g= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=s/JxFRE2; arc=none smtp.client-ip=209.85.215.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="s/JxFRE2" Received: by mail-pg1-f172.google.com with SMTP id 41be03b00d2f7-c80227b1f6cso3891140a12.1 for ; Tue, 26 May 2026 04:46:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779796011; x=1780400811; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=lgSgho/uSma27Dv6R9PjYX30l7Y2Ab+qd5YVtfXCHFQ=; b=s/JxFRE2VGoOLDvsRjaRx0888PrcfgdyfstLjkMCZTcxbMy/7ur2WiK8kCPx0iaM+3 jaYNI5hPMYLq0qBSHr1gAN3PqhoWANjd2slvCbv9QyJCswSogAXbBfiPu24IeeLhcdxI AGE3EPE7p6QMo9xw1DB0KnRl0HKy0TO9yDPhTDw1wH2pqE4U/L01xGSMlY50LXg6PoCy wREuEF8iG95b3v4lXXlEssD8sts/862IYDVKP9GQw7vBwfY+nVUKkK4IFBwZXIw1zjE8 knO+wF4X8fL7lxbigdbrXD+CFH8qsUn/8qr8UyKeVfbTeQXwUrBye9ik4bUph920kGpd 4vFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779796011; x=1780400811; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=lgSgho/uSma27Dv6R9PjYX30l7Y2Ab+qd5YVtfXCHFQ=; b=Py3REjA/pIrAJSqzd1AogtUFIAXIgLTK6APs/u9esQdQndOOG9bacoHedkCG2K+daO xB8VfQMxzOyiJZ0WGrwnk0S7pLMp1+YTqWwOge1s741qlw6L2pjxcMRXJUtlRpdXR137 WnRWyjNyW3200xvtYr2XKdpm58nR5BncC7tzNfxqPvy9A3jdezBmO7KOvWa4Zi+Ok6zr hBvZletAQZ9sKOZ+/HG7C3yHHPUqdOyrBu4pXSUmIrLgijBUmXxdVIgl+Wjn9kNmb6sz UcQOeAVubAFdl/leelzUwRgUMBgf7tZF1XSz3HRfgOnMvi1Ta+10irsgbR231eELgSL1 LQ7w== X-Forwarded-Encrypted: i=1; AFNElJ/FxGt0FwgYeh1xgAt1mAbV1FVzA0ML9MnCZ0naZHiO6cBkSLlbWeJMD68gx/nu6JArBQalhcMFj2WaJmA=@vger.kernel.org X-Gm-Message-State: AOJu0YxPMDjWO5qAyNloVizV2UPHB2rdxv3sz+vRbBvhU5oquHXk7jMA q/uVJi3qBmUjN9JcryLUAv52bUlqzgHOJBV969tLvbRCr5WEXKpzLHZ6 X-Gm-Gg: Acq92OHRGnSYOUdhU6vDzLU7koEwdA6clL0sHGWFwRWMQbhilIO3TrMGBF9AoFPLaJK YflsgevGv4ODRX9TUUiCsCmNBZ9I5KzIeCw06ryPJ8sUr04dFEiDsyfqVRIOepdEsnvB4VN2c9Q GUNdnciEQRX5LH/cKPSPV16sKXJ7q0+3F8HQRrH/6slMi+RqWtrw3e9JgTz/RTbA/sNgyLdQXph p1/QErbfUI6AAsnaN7pams7OO79e2pitiiTm8k25BS2fEtZsXMQgpiPJhrMezp4brPiwQe0s96p E/SobApQefpGPPvHraPWC7WT75D5vUuOAdLuAid/NESJWWSNtHupq8sPWqg037RU1cBvTJIVSuE z3F181zFNYWuSeqvFgcji4tAmxsLWgzY5f21vdcabe/41+/F7HzdpXE16a+SW8nLlNN1Au0fGuR NjfwMKK9NRJZrS2xLFQA7R9446GLtbjspUY/zla/HnPu5CV+ISZ1Q= X-Received: by 2002:a05:6a20:7491:b0:39b:8dcb:f36d with SMTP id adf61e73a8af0-3b328f65039mr18535592637.35.1779796010995; Tue, 26 May 2026 04:46:50 -0700 (PDT) Received: from localhost.localdomain ([210.184.73.204]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-c852028fe99sm10304341a12.4.2026.05.26.04.46.45 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 26 May 2026 04:46:50 -0700 (PDT) From: Hao Jia To: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, mhocko@kernel.org, yosry@kernel.org, mkoutny@suse.com, nphamcs@gmail.com, chengming.zhou@linux.dev, muchun.song@linux.dev, roman.gushchin@linux.dev Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Hao Jia Subject: [PATCH v3 4/4] selftests/cgroup: Add tests for zswap proactive writeback Date: Tue, 26 May 2026 19:46:01 +0800 Message-Id: <20260526114601.67041-5-jiahao.kernel@gmail.com> X-Mailer: git-send-email 2.39.2 (Apple Git-143) In-Reply-To: <20260526114601.67041-1-jiahao.kernel@gmail.com> References: <20260526114601.67041-1-jiahao.kernel@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Hao Jia Add test_zswap_proactive_writeback() to cover the new memory.reclaim "zswap_writeback_only" key. The test populates a memory cgroup zswap pool, triggers proactive writeback, and verifies the behavior by observing the change in zswpwb_proactive. Invalid input combinations are also covered. Extend test_zswap_writeback_one() to assert that the existing non-proactive writeback path leaves zswpwb_proactive at zero. Signed-off-by: Hao Jia Reviewed-by: Nhat Pham --- tools/testing/selftests/cgroup/test_zswap.c | 155 +++++++++++++++++++- 1 file changed, 154 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/se= lftests/cgroup/test_zswap.c index 49b36ee79160..6ab9394a37cc 100644 --- a/tools/testing/selftests/cgroup/test_zswap.c +++ b/tools/testing/selftests/cgroup/test_zswap.c @@ -60,7 +60,12 @@ static int get_zswap_stored_pages(size_t *value) =20 static long get_cg_wb_count(const char *cg) { - return cg_read_key_long(cg, "memory.stat", "zswpwb"); + return cg_read_key_long(cg, "memory.stat", "zswpwb "); +} + +static long get_cg_pwb_count(const char *cg) +{ + return cg_read_key_long(cg, "memory.stat", "zswpwb_proactive "); } =20 static long get_zswpout(const char *cgroup) @@ -355,6 +360,7 @@ static int attempt_writeback(const char *cgroup, void *= arg) static int test_zswap_writeback_one(const char *cgroup, bool wb) { long zswpwb_before, zswpwb_after; + long pwb_cnt; =20 zswpwb_before =3D get_cg_wb_count(cgroup); if (zswpwb_before !=3D 0) { @@ -362,6 +368,12 @@ static int test_zswap_writeback_one(const char *cgroup= , bool wb) return -1; } =20 + pwb_cnt =3D get_cg_pwb_count(cgroup); + if (pwb_cnt !=3D 0) { + ksft_print_msg("zswpwb_proactive_before =3D %ld instead of 0\n", pwb_cnt= ); + return -1; + } + if (cg_run(cgroup, attempt_writeback, (void *) &wb)) return -1; =20 @@ -379,6 +391,17 @@ static int test_zswap_writeback_one(const char *cgroup= , bool wb) return -1; } =20 + /* + * attempt_writeback() does not use the proactive writeback path, so + * zswpwb_proactive must stay at zero regardless of whether writeback + * was enabled. + */ + pwb_cnt =3D get_cg_pwb_count(cgroup); + if (pwb_cnt !=3D 0) { + ksft_print_msg("zswpwb_proactive_after is %ld, expected 0\n", pwb_cnt); + return -1; + } + return 0; } =20 @@ -770,6 +793,135 @@ static int test_zswap_incompressible(const char *root) return ret; } =20 +/* + * Trigger proactive zswap writeback with the following steps: + * 1. Allocate memory. + * 2. Push allocated memory into zswap. + * 3. Proactively write back zswap pages to swap + * using "zswap_writeback_only". + */ +static int proactive_writeback_workload(const char *cgroup, void *arg) +{ + size_t memsize =3D page_size * 1024; + char reclaim_cmd[64]; + char buf[page_size]; + long zswap_usage; + int ret =3D -1; + char *mem; + + mem =3D (char *)malloc(memsize); + if (!mem) + return ret; + + for (int i =3D 0; i < page_size; i++) + buf[i] =3D i < page_size / 2 ? (char)i : 0; + for (int i =3D 0; i < memsize; i +=3D page_size) + memcpy(&mem[i], buf, page_size); + + /* Evict allocated memory into zswap. */ + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { + ksft_print_msg("Failed to push pages into zswap\n"); + goto out; + } + + zswap_usage =3D cg_read_long(cgroup, "memory.zswap.current"); + if (zswap_usage <=3D 0) { + ksft_print_msg("no zswap pool to write back\n"); + goto out; + } + + /* Trigger proactive zswap writeback. */ + snprintf(reclaim_cmd, sizeof(reclaim_cmd), "%zu zswap_writeback_only", me= msize); + int rc =3D cg_write(cgroup, "memory.reclaim", reclaim_cmd); + if (rc && rc !=3D -EAGAIN) { + ksft_print_msg("proactive zswap writeback failed: %d\n", rc); + goto out; + } + + ret =3D 0; +out: + free(mem); + return ret; +} + +static int check_writeback_invalid_inputs(const char *cgroup) +{ + static char * const bad_inputs[] =3D { + "zswap_writeback_only", + "1M zswap_writeback_only swappiness=3D60", + "1M swappiness=3D60 zswap_writeback_only", + "1M zswap_writeback_only swappiness=3Dmax", + "1M swappiness=3Dmax zswap_writeback_only", + }; + int i, rc; + + for (i =3D 0; i < ARRAY_SIZE(bad_inputs); i++) { + rc =3D cg_write(cgroup, "memory.reclaim", bad_inputs[i]); + if (rc !=3D -EINVAL) { + ksft_print_msg("memory.reclaim '%s': returned %d, expected %d\n", + bad_inputs[i], rc, -EINVAL); + return -1; + } + } + return 0; +} + +static int test_zswap_proactive_writeback(const char *root) +{ + long pwb_before, wb_before, pwb_after, wb_after; + long pwb_delta, wb_delta; + int ret =3D KSFT_FAIL; + char *test_group; + + if (cg_read_strcmp(root, "memory.zswap.writeback", "1")) + return KSFT_SKIP; + + test_group =3D cg_name(root, "zswap_proactive_test"); + if (!test_group) + return KSFT_FAIL; + if (cg_create(test_group)) + goto out; + if (check_writeback_invalid_inputs(test_group)) + goto out; + + pwb_before =3D get_cg_pwb_count(test_group); + wb_before =3D get_cg_wb_count(test_group); + if (pwb_before < 0 || wb_before < 0) + goto out; + + if (cg_run(test_group, proactive_writeback_workload, NULL)) + goto out; + + pwb_after =3D get_cg_pwb_count(test_group); + wb_after =3D get_cg_wb_count(test_group); + if (pwb_after < 0 || wb_after < 0) + goto out; + + pwb_delta =3D pwb_after - pwb_before; + wb_delta =3D wb_after - wb_before; + + if (pwb_delta <=3D 0) { + ksft_print_msg("zswpwb_proactive did not increase: delta=3D%ld\n", + pwb_delta); + goto out; + } + if (wb_delta <=3D 0) { + ksft_print_msg("zswpwb did not increase: delta=3D%ld\n", wb_delta); + goto out; + } + if (pwb_delta > wb_delta) { + ksft_print_msg("zswpwb_proactive delta (%ld) > zswpwb delta (%ld)\n", + pwb_delta, wb_delta); + goto out; + } + + ret =3D KSFT_PASS; +out: + cg_destroy(test_group); + free(test_group); + return ret; +} + #define T(x) { x, #x } struct zswap_test { int (*fn)(const char *root); @@ -783,6 +935,7 @@ struct zswap_test { T(test_no_kmem_bypass), T(test_no_invasive_cgroup_shrink), T(test_zswap_incompressible), + T(test_zswap_proactive_writeback), }; #undef T =20 --=20 2.34.1