From nobody Tue Dec 23 14:22:25 2025 Received: from out-187.mta1.migadu.com (out-187.mta1.migadu.com [95.215.58.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E2417465 for ; Sun, 4 Feb 2024 03:06:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.187 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707015981; cv=none; b=d0FQlRcCrVS7TGnmoJIMm/5DLxSlQwTwj7k3RayhiESiZDx6zyrUuAR1uTzv4NW7AK8Xi1e0H8iOVU7jZCKE/uk/vBFY295sk7HsOHVZvKyic44UoIFyWbOKZny2LYyXYH5fv1xavCoGKKuqjwsctgpY+lYwGJvoDqzEroR6vHA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707015981; c=relaxed/simple; bh=5FWNYWGWxUZhXLkqwTnflRB4+JFyB54M5WYwUuioX7o=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=Qjjsz2DbyRyLLrzUNC1W7+iHKzCO79Qn+z280PHQAQyPNiGbHuY9EtY2jQMgnKOGgVd/sBaONeUw5XIm34LoiZoYSlWNY2NZsE9ivS3D3lIdaX2/KzLGOqEKJ1mfNgRg1vbWoedGBKNbpVVv57KTNQ6jV0boViSO0XblOol6+UI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=linux.dev; arc=none smtp.client-ip=95.215.58.187 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Chengming Zhou Date: Sun, 04 Feb 2024 03:05:59 +0000 Subject: [PATCH v2 1/6] mm/zswap: add more comments in shrink_memcg_cb() Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20240201-b4-zswap-invalidate-entry-v2-1-99d4084260a0@bytedance.com> References: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com> In-Reply-To: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com> To: Nhat Pham , Yosry Ahmed , Andrew Morton , Johannes Weiner Cc: linux-mm@kvack.org, Nhat Pham , Chengming Zhou , linux-kernel@vger.kernel.org, Yosry Ahmed , Johannes Weiner X-Migadu-Flow: FLOW_OUT Add more comments in shrink_memcg_cb() to describe the deref dance which is implemented to fix race problem between lru writeback and swapoff, and the reason why we rotate the entry at the beginning. Also fix the stale comments in zswap_writeback_entry(), and add more comments to state that we only deref the tree after we get the swapcache reference. Suggested-by: Yosry Ahmed Suggested-by: Johannes Weiner Acked-by: Yosry Ahmed Reviewed-by: Nhat Pham Signed-off-by: Johannes Weiner Signed-off-by: Chengming Zhou --- mm/zswap.c | 43 ++++++++++++++++++++++++++----------------- 1 file changed, 26 insertions(+), 17 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index 4aea03285532..735f1a6ef336 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1207,10 +1207,12 @@ static int zswap_writeback_entry(struct zswap_entry= *entry, =20 /* * folio is locked, and the swapcache is now secured against - * concurrent swapping to and from the slot. Verify that the - * swap entry hasn't been invalidated and recycled behind our - * backs (our zswap_entry reference doesn't prevent that), to - * avoid overwriting a new swap folio with old compressed data. + * concurrent swapping to and from the slot, and concurrent + * swapoff so we can safely dereference the zswap tree here. + * Verify that the swap entry hasn't been invalidated and recycled + * behind our backs, to avoid overwriting a new swap folio with + * old compressed data. Only when this is successful can the entry + * be dereferenced. */ tree =3D swap_zswap_tree(swpentry); spin_lock(&tree->lock); @@ -1263,22 +1265,29 @@ static enum lru_status shrink_memcg_cb(struct list_= head *item, struct list_lru_o int writeback_result; =20 /* - * Rotate the entry to the tail before unlocking the LRU, - * so that in case of an invalidation race concurrent - * reclaimers don't waste their time on it. + * As soon as we drop the LRU lock, the entry can be freed by + * a concurrent invalidation. This means the following: * - * If writeback succeeds, or failure is due to the entry - * being invalidated by the swap subsystem, the invalidation - * will unlink and free it. + * 1. We extract the swp_entry_t to the stack, allowing + * zswap_writeback_entry() to pin the swap entry and + * then validate the zwap entry against that swap entry's + * tree using pointer value comparison. Only when that + * is successful can the entry be dereferenced. * - * Temporary failures, where the same entry should be tried - * again immediately, almost never happen for this shrinker. - * We don't do any trylocking; -ENOMEM comes closest, - * but that's extremely rare and doesn't happen spuriously - * either. Don't bother distinguishing this case. + * 2. Usually, objects are taken off the LRU for reclaim. In + * this case this isn't possible, because if reclaim fails + * for whatever reason, we have no means of knowing if the + * entry is alive to put it back on the LRU. * - * But since they do exist in theory, the entry cannot just - * be unlinked, or we could leak it. Hence, rotate. + * So rotate it before dropping the lock. If the entry is + * written back or invalidated, the free path will unlink + * it. For failures, rotation is the right thing as well. + * + * Temporary failures, where the same entry should be tried + * again immediately, almost never happen for this shrinker. + * We don't do any trylocking; -ENOMEM comes closest, + * but that's extremely rare and doesn't happen spuriously + * either. Don't bother distinguishing this case. */ list_move_tail(item, &l->list); =20 --=20 b4 0.10.1 From nobody Tue Dec 23 14:22:25 2025 Received: from out-185.mta1.migadu.com (out-185.mta1.migadu.com [95.215.58.185]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8B29A8C02 for ; Sun, 4 Feb 2024 03:06:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.185 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707015983; cv=none; b=PX9ZTspDX3ie+Q+lktGugVMSnj+nMgoVfeC23dV97ok1W83vEBfD/khyDJzw7P+tiSTHisZILtLwZOnwLr5xFFOiX3IuoyW+1+nFsRTim4NYHS9/gYYby2mDDhCX2ORVpqEnBXyrKnADsvGyjVpFU2h8ilf4pwTmmDhj2QoeX4w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707015983; c=relaxed/simple; bh=CXSxFvmtK0JnSP8CkE+Bptb/eB5m8MJkHEZmK5VbrjI=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=eK3OXXQEFRI2n7BIN+A/S5+bNb+wpGMX2eqhczZOUx+ZUb+RNU+tNTyDfrBSL9pp/zLxCmZ9Rm29Uv2R0yvFurPYP3feZauT/fuapHbddgQW/G1q8oSShdzfICU/KtGMz44qlhYE4vw3eCO9R5EBj8h0svCOtLwYbXoTQPgxxLU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=linux.dev; arc=none smtp.client-ip=95.215.58.185 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Chengming Zhou Date: Sun, 04 Feb 2024 03:06:00 +0000 Subject: [PATCH v2 2/6] mm/zswap: invalidate zswap entry when swap entry free Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20240201-b4-zswap-invalidate-entry-v2-2-99d4084260a0@bytedance.com> References: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com> In-Reply-To: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com> To: Nhat Pham , Yosry Ahmed , Andrew Morton , Johannes Weiner Cc: linux-mm@kvack.org, Nhat Pham , Chengming Zhou , linux-kernel@vger.kernel.org, Yosry Ahmed , Johannes Weiner X-Migadu-Flow: FLOW_OUT During testing I found there are some times the zswap_writeback_entry() return -ENOMEM, which is not we expected: bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=3Dcount()}' @[-12]: 1563 @[0]: 277221 The reason is that __read_swap_cache_async() return NULL because swapcache_prepare() failed. The reason is that we won't invalidate zswap entry when swap entry freed to the per-cpu pool, these zswap entries are still on the zswap tree and lru list. This patch moves the invalidation ahead to when swap entry freed to the per-cpu pool, since there is no any benefit to leave trashy zswap entry on the tree and lru list. With this patch: bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=3Dcount()}' @[0]: 259744 Note: large folio can't have zswap entry for now, so don't bother to add zswap entry invalidation in the large folio swap free path. Reviewed-by: Nhat Pham Acked-by: Johannes Weiner Signed-off-by: Chengming Zhou Acked-by: Yosry Ahmed --- include/linux/zswap.h | 4 ++-- mm/swap_slots.c | 3 +++ mm/swapfile.c | 1 - mm/zswap.c | 5 +++-- 4 files changed, 8 insertions(+), 5 deletions(-) diff --git a/include/linux/zswap.h b/include/linux/zswap.h index 91895ce1fdbc..341aea490070 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -29,7 +29,7 @@ struct zswap_lruvec_state { =20 bool zswap_store(struct folio *folio); bool zswap_load(struct folio *folio); -void zswap_invalidate(int type, pgoff_t offset); +void zswap_invalidate(swp_entry_t swp); int zswap_swapon(int type, unsigned long nr_pages); void zswap_swapoff(int type); void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); @@ -50,7 +50,7 @@ static inline bool zswap_load(struct folio *folio) return false; } =20 -static inline void zswap_invalidate(int type, pgoff_t offset) {} +static inline void zswap_invalidate(swp_entry_t swp) {} static inline int zswap_swapon(int type, unsigned long nr_pages) { return 0; diff --git a/mm/swap_slots.c b/mm/swap_slots.c index 0bec1f705f8e..90973ce7881d 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -273,6 +273,9 @@ void free_swap_slot(swp_entry_t entry) { struct swap_slots_cache *cache; =20 + /* Large folio swap slot is not covered. */ + zswap_invalidate(entry); + cache =3D raw_cpu_ptr(&swp_slots); if (likely(use_swap_slot_cache && cache->slots_ret)) { spin_lock_irq(&cache->free_lock); diff --git a/mm/swapfile.c b/mm/swapfile.c index 0580bb3e34d7..65b49db89b36 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -744,7 +744,6 @@ static void swap_range_free(struct swap_info_struct *si= , unsigned long offset, swap_slot_free_notify =3D NULL; while (offset <=3D end) { arch_swap_invalidate_page(si->type, offset); - zswap_invalidate(si->type, offset); if (swap_slot_free_notify) swap_slot_free_notify(si->bdev, offset); offset++; diff --git a/mm/zswap.c b/mm/zswap.c index 735f1a6ef336..d8bb0e06e2b0 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1738,9 +1738,10 @@ bool zswap_load(struct folio *folio) return true; } =20 -void zswap_invalidate(int type, pgoff_t offset) +void zswap_invalidate(swp_entry_t swp) { - struct zswap_tree *tree =3D swap_zswap_tree(swp_entry(type, offset)); + pgoff_t offset =3D swp_offset(swp); + struct zswap_tree *tree =3D swap_zswap_tree(swp); struct zswap_entry *entry; =20 spin_lock(&tree->lock); --=20 b4 0.10.1 From nobody Tue Dec 23 14:22:25 2025 Received: from out-180.mta1.migadu.com (out-180.mta1.migadu.com [95.215.58.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5EDD5AD21 for ; Sun, 4 Feb 2024 03:06:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707015985; cv=none; b=TBuaheC3eqWyOEFokOiX6H1utgabTweofEsHtXw5/RkaLAB2o7UKC56rSURm3BGa75v6QpiwH4ebnjnYxUVuAO3zlwVR0qzWxImDkIS6FpkBvPESatdmiKGFGAz2KQhlYQlNf+vkh9S/0zljfIJeCzfO0JfXZxJFZvNFBn8Z3n0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707015985; c=relaxed/simple; bh=0bt3TQ9vcugz+A0eDPVN5E2V6cGVJQsAL8vDUMgG9ik=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=uvFqnFwiNfNkwaT6/LP92qCQ4CIOz88ZokAwJ2erDyVJrpdnMdyPpEGWlMIsKzWs8+Cr79Zfsbn5bhDzsK67g9486sN2oF+Uq6+d00GlHpeizD2JAkAVZC8MzdXrLCqlfnFtPLn7jH9hFF2gLfpxr/uBGTCmdhD3O8eIC2eVf9w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=linux.dev; arc=none smtp.client-ip=95.215.58.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Chengming Zhou Date: Sun, 04 Feb 2024 03:06:01 +0000 Subject: [PATCH v2 3/6] mm/zswap: stop lru list shrinking when encounter warm region Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20240201-b4-zswap-invalidate-entry-v2-3-99d4084260a0@bytedance.com> References: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com> In-Reply-To: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com> To: Nhat Pham , Yosry Ahmed , Andrew Morton , Johannes Weiner Cc: linux-mm@kvack.org, Nhat Pham , Chengming Zhou , linux-kernel@vger.kernel.org, Yosry Ahmed , Johannes Weiner X-Migadu-Flow: FLOW_OUT When the shrinker encounter an existing folio in swap cache, it means we are shrinking into the warmer region. We should terminate shrinking if we're in the dynamic shrinker context. This patch add LRU_STOP to support this, to avoid overshrinking. Acked-by: Johannes Weiner Acked-by: Nhat Pham Reviewed-by: Yosry Ahmed Signed-off-by: Chengming Zhou --- include/linux/list_lru.h | 2 ++ mm/list_lru.c | 3 +++ mm/zswap.c | 4 +++- 3 files changed, 8 insertions(+), 1 deletion(-) diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index f2882a820690..792b67ceb631 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -24,6 +24,8 @@ enum lru_status { LRU_SKIP, /* item cannot be locked, skip */ LRU_RETRY, /* item not freeable. May drop the lock internally, but has to return locked. */ + LRU_STOP, /* stop lru list walking. May drop the lock + internally, but has to return locked. */ }; =20 struct list_lru_one { diff --git a/mm/list_lru.c b/mm/list_lru.c index 61f3b6b1134f..3fd64736bc45 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -243,6 +243,9 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int = memcg_idx, */ assert_spin_locked(&nlru->lock); goto restart; + case LRU_STOP: + assert_spin_locked(&nlru->lock); + goto out; default: BUG(); } diff --git a/mm/zswap.c b/mm/zswap.c index d8bb0e06e2b0..4381b7a2d4d6 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1315,8 +1315,10 @@ static enum lru_status shrink_memcg_cb(struct list_h= ead *item, struct list_lru_o * into the warmer region. We should terminate shrinking (if we're in th= e dynamic * shrinker context). */ - if (writeback_result =3D=3D -EEXIST && encountered_page_in_swapcache) + if (writeback_result =3D=3D -EEXIST && encountered_page_in_swapcache) { + ret =3D LRU_STOP; *encountered_page_in_swapcache =3D true; + } } else { zswap_written_back_pages++; } --=20 b4 0.10.1 From nobody Tue Dec 23 14:22:25 2025 Received: from out-189.mta1.migadu.com (out-189.mta1.migadu.com [95.215.58.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6237EBE4C for ; Sun, 4 Feb 2024 03:06:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.189 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707015987; cv=none; b=LLqb2iaf5y6he+5wiskzCi/Ta/noJgITzG1ySh8UUwpT5sApVrQzSvn3gTHFZassSHnKMkNvYgs0rlTOcCcDwOIlIshLNWTY74AmveMpm9+hN1I+W3i0M6X/WmK3qK893bkv9yioA+gcP5WVZMZdsYYEYmGzEf83DiQMZ6BuRlM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707015987; c=relaxed/simple; bh=Iwy2TxI1TgOBTx+y9m1xIIKgohEFXyY3VOBZsXNpH5M=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=rgao2Xj9UKdrUYCY384h6qcL8eihBMeIEBwPQX00vPk7y5wk06/ELQTd92Eo6s/0KZb90hhp/nLNaiUHUUkxTHlaCRRiF5DtQoQ8Q3cVLF6Kypm2uNH0L/depgWBgefYLjgZtYIgdUvUxEUerTrcUcXPMGv1Xe3n3VId0U1HJU4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=linux.dev; arc=none smtp.client-ip=95.215.58.189 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Chengming Zhou Date: Sun, 04 Feb 2024 03:06:02 +0000 Subject: [PATCH v2 4/6] mm/zswap: remove duplicate_entry debug value Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20240201-b4-zswap-invalidate-entry-v2-4-99d4084260a0@bytedance.com> References: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com> In-Reply-To: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com> To: Nhat Pham , Yosry Ahmed , Andrew Morton , Johannes Weiner Cc: linux-mm@kvack.org, Nhat Pham , Chengming Zhou , linux-kernel@vger.kernel.org, Yosry Ahmed , Johannes Weiner X-Migadu-Flow: FLOW_OUT cat /sys/kernel/debug/zswap/duplicate_entry 2086447 When testing, the duplicate_entry value is very high, but no warning message in the kernel log. From the comment of duplicate_entry "Duplicate store was encountered (rare)", it seems something goes wrong. Actually it's incremented in the beginning of zswap_store(), which found its zswap entry has already on the tree. And this is a normal case, since the folio could leave zswap entry on the tree after swapin, later it's dirtied and swapout/zswap_store again, found its original zswap entry. So duplicate_entry should be only incremented in the real bug case, which already have "WARN_ON(1)", it looks redundant to count bug case, so this patch just remove it. Acked-by: Johannes Weiner Reviewed-by: Nhat Pham Acked-by: Yosry Ahmed Signed-off-by: Chengming Zhou --- mm/zswap.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index 4381b7a2d4d6..3fbb7e2c8b8d 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -71,8 +71,6 @@ static u64 zswap_reject_compress_poor; static u64 zswap_reject_alloc_fail; /* Store failed because the entry metadata could not be allocated (rare) */ static u64 zswap_reject_kmemcache_fail; -/* Duplicate store was encountered (rare) */ -static u64 zswap_duplicate_entry; =20 /* Shrinker work queue */ static struct workqueue_struct *shrink_wq; @@ -1571,10 +1569,8 @@ bool zswap_store(struct folio *folio) */ spin_lock(&tree->lock); entry =3D zswap_rb_search(&tree->rbroot, offset); - if (entry) { + if (entry) zswap_invalidate_entry(tree, entry); - zswap_duplicate_entry++; - } spin_unlock(&tree->lock); objcg =3D get_obj_cgroup_from_folio(folio); if (objcg && !obj_cgroup_may_zswap(objcg)) { @@ -1661,7 +1657,6 @@ bool zswap_store(struct folio *folio) */ while (zswap_rb_insert(&tree->rbroot, entry, &dupentry) =3D=3D -EEXIST) { WARN_ON(1); - zswap_duplicate_entry++; zswap_invalidate_entry(tree, dupentry); } if (entry->length) { @@ -1822,8 +1817,6 @@ static int zswap_debugfs_init(void) zswap_debugfs_root, &zswap_reject_compress_poor); debugfs_create_u64("written_back_pages", 0444, zswap_debugfs_root, &zswap_written_back_pages); - debugfs_create_u64("duplicate_entry", 0444, - zswap_debugfs_root, &zswap_duplicate_entry); debugfs_create_u64("pool_total_size", 0444, zswap_debugfs_root, &zswap_pool_total_size); debugfs_create_atomic_t("stored_pages", 0444, --=20 b4 0.10.1 From nobody Tue Dec 23 14:22:25 2025 Received: from out-185.mta1.migadu.com (out-185.mta1.migadu.com [95.215.58.185]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0EEADD26B for ; Sun, 4 Feb 2024 03:06:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.185 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707015989; cv=none; b=aj/qcWhs1YmsLgaa5w0b39ubmMs3uuVURNkSYE5THgSRrOcLKUpIFO0KaPQPJTGEMqdnGCFhXkuhLt3mabMfhyJM1aS3VrHSalPRZtV9o4+m5iuFr/aQVsU9s5bFf7JC3mwa8ASnBUoOil1ZnpMQuDMS+l8JS4x4BTbOhnW24LE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707015989; c=relaxed/simple; bh=5FnRXUvDpUMWv/ll26B1goohLOKe8SR6AvNT36pdX1U=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=kYxtKU+j3hgQo7zXrHfNDI10Gf3shiaJ+7CWqD7uTZ1cp+NNAlNQVbsEWimhWZZo8Cp0hZUxJ69yS3cMBz5JKdrZknoxXp8B8ZpeQCul0HB5qHV5/vwXWt/Kv2G5s09ni9a9566ZQvQdtWegq5j3bXOkNjL3nr7WFVCvpnCRgtI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=linux.dev; arc=none smtp.client-ip=95.215.58.185 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Chengming Zhou Date: Sun, 04 Feb 2024 03:06:03 +0000 Subject: [PATCH v2 5/6] mm/zswap: only support zswap_exclusive_loads_enabled Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20240201-b4-zswap-invalidate-entry-v2-5-99d4084260a0@bytedance.com> References: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com> In-Reply-To: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com> To: Nhat Pham , Yosry Ahmed , Andrew Morton , Johannes Weiner Cc: linux-mm@kvack.org, Nhat Pham , Chengming Zhou , linux-kernel@vger.kernel.org, Yosry Ahmed , Johannes Weiner X-Migadu-Flow: FLOW_OUT The !zswap_exclusive_loads_enabled mode will leave compressed copy in the zswap tree and lru list after the folio swapin. There are some disadvantages in this mode: 1. It's a waste of memory since there are two copies of data, one is folio, the other one is compressed data in zswap. And it's unlikely the compressed data is useful in the near future. 2. If that folio is dirtied, the compressed data must be not useful, but we don't know and don't invalidate the trashy memory in zswap. 3. It's not reclaimable from zswap shrinker since zswap_writeback_entry() will always return -EEXIST and terminate the shrinking process. On the other hand, the only downside of zswap_exclusive_loads_enabled is a little more cpu usage/latency when compression, and the same if the folio is removed from swapcache or dirtied. More explanation by Johannes on why we should consider exclusive load as the default for zswap: Caching "swapout work" is helpful when the system is thrashing. Then recently swapped in pages might get swapped out again very soon. It certainly makes sense with conventional swap, because keeping a clean copy on the disk saves IO work and doesn't cost any additional memory. But with zswap, it's different. It saves some compression work on a thrashing page. But the act of keeping compressed memory contributes to a higher rate of thrashing. And that can cause IO in other places like zswap writeback and file memory. And the A/B test results of the kernel build in tmpfs with limited memory can support this theory: !exclusive exclusive real 63.80 63.01 user 1063.83 1061.32 sys 290.31 266.15 workingset_refault_anon 2383084.40 1976397.40 workingset_refault_file 44134.00 45689.40 workingset_activate_anon 837878.00 728441.20 workingset_activate_file 4710.00 4085.20 workingset_restore_anon 732622.60 639428.40 workingset_restore_file 1007.00 926.80 workingset_nodereclaim 0.00 0.00 pgscan 14343003.40 12409570.20 pgscan_kswapd 0.00 0.00 pgscan_direct 14343003.40 12409570.20 pgscan_khugepaged 0.00 0.00 Acked-by: Yosry Ahmed Acked-by: Johannes Weiner Reviewed-by: Nhat Pham Signed-off-by: Chengming Zhou --- mm/Kconfig | 16 ---------------- mm/zswap.c | 14 +++----------- 2 files changed, 3 insertions(+), 27 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index ffc3a2ba3a8c..673b35629074 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -45,22 +45,6 @@ config ZSWAP_DEFAULT_ON The selection made here can be overridden by using the kernel command line 'zswap.enabled=3D' option. =20 -config ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON - bool "Invalidate zswap entries when pages are loaded" - depends on ZSWAP - help - If selected, exclusive loads for zswap will be enabled at boot, - otherwise it will be disabled. - - If exclusive loads are enabled, when a page is loaded from zswap, - the zswap entry is invalidated at once, as opposed to leaving it - in zswap until the swap entry is freed. - - This avoids having two copies of the same page in memory - (compressed and uncompressed) after faulting in a page from zswap. - The cost is that if the page was never dirtied and needs to be - swapped out again, it will be re-compressed. - config ZSWAP_SHRINKER_DEFAULT_ON bool "Shrink the zswap pool on memory pressure" depends on ZSWAP diff --git a/mm/zswap.c b/mm/zswap.c index 3fbb7e2c8b8d..cbf379abb6c7 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -139,10 +139,6 @@ static bool zswap_non_same_filled_pages_enabled =3D tr= ue; module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pa= ges_enabled, bool, 0644); =20 -static bool zswap_exclusive_loads_enabled =3D IS_ENABLED( - CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON); -module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0= 644); - /* Number of zpools in zswap_pool (empirically determined for scalability)= */ #define ZSWAP_NR_ZPOOLS 32 =20 @@ -1722,16 +1718,12 @@ bool zswap_load(struct folio *folio) count_objcg_event(entry->objcg, ZSWPIN); =20 spin_lock(&tree->lock); - if (zswap_exclusive_loads_enabled) { - zswap_invalidate_entry(tree, entry); - folio_mark_dirty(folio); - } else if (entry->length) { - zswap_lru_del(&entry->pool->list_lru, entry); - zswap_lru_add(&entry->pool->list_lru, entry); - } + zswap_invalidate_entry(tree, entry); zswap_entry_put(entry); spin_unlock(&tree->lock); =20 + folio_mark_dirty(folio); + return true; } =20 --=20 b4 0.10.1 From nobody Tue Dec 23 14:22:25 2025 Received: from out-185.mta1.migadu.com (out-185.mta1.migadu.com [95.215.58.185]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ED4D510A09 for ; Sun, 4 Feb 2024 03:06:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.185 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707015992; cv=none; b=MbX7ZI9SxxbTk8qUs7FTr61JVl2aVmu/eDz57xCCToe06ukoDgqkuU/ggqQ2Y6R9ANIvUKF88VPu++37ovTyobfbDteWCDDgJwcqHrjA92Mmqx20a0soKLmG2auXwtp22izMopF3hubqE4wInrepDh+vcYTMRyq5wHeDczu5zG8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707015992; c=relaxed/simple; bh=d2Flh0a8GkGFKTUwmpYtoIW+gaautVkXIhUvQo5GYDI=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=danZTSVlZpwm+wG9Sd+aycHdq5W5H/FyaIANIuH/9HYxbnBW5u0M3mciH0P6mAPWJ1M1PUtjT6XxBXBVePuHxORZYPAN3X96CQxjAk6nzzjkxR8zypLI4hh491v2vpNC9A6cvDZwvw7fOZoj+U3IKi87mnt2yuDgYV0hJ0X6vt8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=linux.dev; arc=none smtp.client-ip=95.215.58.185 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Chengming Zhou Date: Sun, 04 Feb 2024 03:06:04 +0000 Subject: [PATCH v2 6/6] mm/zswap: zswap entry doesn't need refcount anymore Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20240201-b4-zswap-invalidate-entry-v2-6-99d4084260a0@bytedance.com> References: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com> In-Reply-To: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com> To: Nhat Pham , Yosry Ahmed , Andrew Morton , Johannes Weiner Cc: linux-mm@kvack.org, Nhat Pham , Chengming Zhou , linux-kernel@vger.kernel.org, Yosry Ahmed , Johannes Weiner X-Migadu-Flow: FLOW_OUT Since we don't need to leave zswap entry on the zswap tree anymore, we should remove it from tree once we find it from the tree. Then after using it, we can directly free it, no concurrent path can find it from tree. Only the shrinker can see it from lru list, which will also double check under tree lock, so no race problem. So we don't need refcount in zswap entry anymore and don't need to take the spinlock for the second time to invalidate it. The side effect is that zswap_entry_free() maybe not happen in tree spinlock, but it's ok since nothing need to be protected by the lock. Reviewed-by: Nhat Pham Acked-by: Johannes Weiner Signed-off-by: Chengming Zhou --- mm/zswap.c | 63 +++++++++++-----------------------------------------------= ---- 1 file changed, 11 insertions(+), 52 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index cbf379abb6c7..cd67f7f6b302 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -193,12 +193,6 @@ struct zswap_pool { * * rbnode - links the entry into red-black tree for the appropriate swap t= ype * swpentry - associated swap entry, the offset indexes into the red-black= tree - * refcount - the number of outstanding reference to the entry. This is ne= eded - * to protect against premature freeing of the entry by code - * concurrent calls to load, invalidate, and writeback. The lo= ck - * for the zswap_tree structure that contains the entry must - * be held while changing the refcount. Since the lock must - * be held, there is no reason to also make refcount atomic. * length - the length in bytes of the compressed page data. Needed during * decompression. For a same value filled page length is 0, and b= oth * pool and lru are invalid and must be ignored. @@ -211,7 +205,6 @@ struct zswap_pool { struct zswap_entry { struct rb_node rbnode; swp_entry_t swpentry; - int refcount; unsigned int length; struct zswap_pool *pool; union { @@ -222,11 +215,6 @@ struct zswap_entry { struct list_head lru; }; =20 -/* - * The tree lock in the zswap_tree struct protects a few things: - * - the rbtree - * - the refcount field of each entry in the tree - */ struct zswap_tree { struct rb_root rbroot; spinlock_t lock; @@ -890,14 +878,10 @@ static int zswap_rb_insert(struct rb_root *root, stru= ct zswap_entry *entry, return 0; } =20 -static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) +static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) { - if (!RB_EMPTY_NODE(&entry->rbnode)) { - rb_erase(&entry->rbnode, root); - RB_CLEAR_NODE(&entry->rbnode); - return true; - } - return false; + rb_erase(&entry->rbnode, root); + RB_CLEAR_NODE(&entry->rbnode); } =20 /********************************* @@ -911,7 +895,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_= t gfp, int nid) entry =3D kmem_cache_alloc_node(zswap_entry_cache, gfp, nid); if (!entry) return NULL; - entry->refcount =3D 1; RB_CLEAR_NODE(&entry->rbnode); return entry; } @@ -954,33 +937,15 @@ static void zswap_entry_free(struct zswap_entry *entr= y) zswap_update_total_size(); } =20 -/* caller must hold the tree lock */ -static void zswap_entry_get(struct zswap_entry *entry) -{ - WARN_ON_ONCE(!entry->refcount); - entry->refcount++; -} - -/* caller must hold the tree lock */ -static void zswap_entry_put(struct zswap_entry *entry) -{ - WARN_ON_ONCE(!entry->refcount); - if (--entry->refcount =3D=3D 0) { - WARN_ON_ONCE(!RB_EMPTY_NODE(&entry->rbnode)); - zswap_entry_free(entry); - } -} - /* - * If the entry is still valid in the tree, drop the initial ref and remov= e it - * from the tree. This function must be called with an additional ref held, - * otherwise it may race with another invalidation freeing the entry. + * The caller hold the tree lock and search the entry from the tree, + * so it must be on the tree, remove it from the tree and free it. */ static void zswap_invalidate_entry(struct zswap_tree *tree, struct zswap_entry *entry) { - if (zswap_rb_erase(&tree->rbroot, entry)) - zswap_entry_put(entry); + zswap_rb_erase(&tree->rbroot, entry); + zswap_entry_free(entry); } =20 /********************************* @@ -1219,7 +1184,7 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, } =20 /* Safe to deref entry after the entry is verified above. */ - zswap_entry_get(entry); + zswap_rb_erase(&tree->rbroot, entry); spin_unlock(&tree->lock); =20 zswap_decompress(entry, &folio->page); @@ -1228,10 +1193,7 @@ static int zswap_writeback_entry(struct zswap_entry = *entry, if (entry->objcg) count_objcg_event(entry->objcg, ZSWPWB); =20 - spin_lock(&tree->lock); - zswap_invalidate_entry(tree, entry); - zswap_entry_put(entry); - spin_unlock(&tree->lock); + zswap_entry_free(entry); =20 /* folio is up to date */ folio_mark_uptodate(folio); @@ -1702,7 +1664,7 @@ bool zswap_load(struct folio *folio) spin_unlock(&tree->lock); return false; } - zswap_entry_get(entry); + zswap_rb_erase(&tree->rbroot, entry); spin_unlock(&tree->lock); =20 if (entry->length) @@ -1717,10 +1679,7 @@ bool zswap_load(struct folio *folio) if (entry->objcg) count_objcg_event(entry->objcg, ZSWPIN); =20 - spin_lock(&tree->lock); - zswap_invalidate_entry(tree, entry); - zswap_entry_put(entry); - spin_unlock(&tree->lock); + zswap_entry_free(entry); =20 folio_mark_dirty(folio); =20 --=20 b4 0.10.1